Previous   Next

Text-String–Index Algorithms

In , we considered the process of building a string index, and we used binary search in a table of indexes into a text string to provide the capability to determine whether or not a given key string appears in the text. In this section, we look at more sophisticated implementations of this ADT using multiway tries.

As in , we consider each position in the text to be the beginning of a string key that runs all the way to the end of the text and build a symbol table with these keys, using indexes into the text. The keys are all different (for example, they are of different lengths), and most of them are extremely long. The purpose of a search is to determine whether or not a given search key is a prefix of one of the keys in the index, which is equivalent to discovering whether the search key appears somewhere in the text string.

A search tree that is built from keys defined by indexes into a text string is called a suffix tree. We could use any algorithm that can admit variable-length keys. Trie-based methods are particularly suitable, because (except for the trie methods that do one-way branching at the tails of keys) their running time does not depend on the key length but rather depends only on the number of digits required to distinguish among the keys. This characteristic lies in direct contrast to, for example, hashing algorithms, which do not apply immediately to this problem because their running time is proportional to the key length.

Screenshot gives examples of string indexes built with BSTs, patricia, and TSTs (with leaves). These indexes use just the keys starting at word boundaries; an index starting at character boundaries would provide a more complete index, but would use significantly more space.

Screenshot Text-string index examples

These diagrams show text-string indexes built from the text call me ishmael some years ago never mind how long precisely ... using a BST (top), a patricia trie (center), and a TST (bottom). Nodes with string pointers are depicted with the first four characters at the point referenced by the pointer.

Java graphics 15fig20.gif


Strictly speaking, even a random string text does not give rise to a random set of keys in the corresponding index (because the keys are not independent). However, we rarely work with random texts in practical indexing apps, and this analytic discrepancy will not stop us from taking advantage of the fast indexing implementations that are possible with radix methods. We refrain from discussing the detailed performance characteristics when we use each of the algorithms to build a string index, because many of the same tradeoffs that we have discussed for general symbol tables with string keys also hold for the string-index problem.

For a typical text, standard BSTs would be the first implementation that we might choose, because they are simple to implement (see Exercise 12.82). For typical apps, this solution is likely to provide good performance. One byproduct of the interdependence of the keys—particularly when we are building a string index for each character position—is that the worst case for BSTs is not a particular concern for huge texts, since unbalanced BSTs occur with only bizarre constructions.

Patricia was originally designed for the string-index app. To use Programs 15.7 and 15.6, we need only provide an implementation of bit that, given a string pointer and an integer i, returns the ith bit of the string (see Exercise 15.89). In practice, the height of a patricia trie that implements a text string index will be logarithmic. Moreover, a patricia trie will provide fast search implementations for misses because we do not need to examine all the bytes of the key.

TSTs afford several of the performance advantages of patricia, are simple to implement, and take advantage of built-in byte-access operations that are typically found on modern machines. They also are amenable to simple implementations, such as Program 15.3, that can solve search problems more complicated than fully matching a search key. To use TSTs to build a string index, we need to remove the code that handles ends of keys in the data structure, since we are guaranteed that no string is a prefix of another, and thus we never will be comparing strings to their ends. This modification includes changing the definition of equals to regard two strings as equal if one is a prefix of the other, as we did in , since we will be comparing a (short) search key against a (long) text string, starting at some position in the text string. A third change that is convenient is to keep a string index rather than a character in each node so that every node in the tree refers to a position in the text string (the position in the text string following the first occurrence of the character string defined by the characters on equal branches from the root to that node). Implementing these changes is an interesting and informative exercise that leads to a flexible and efficient text-string–index implementation (see Exercise 15.88).

Despite all the advantages that we have been discussing, it is important to remember that the text itself is usually fixed for typical text indexing apps, so we do not need to support the dynamic insert operations that we have become accustomed to supporting. That is, we typically build the index once, then use it for a huge number of searches, without ever changing it. Therefore, we may not need dynamic data structures like BSTs, patricia tries or TSTs at all: the basic binary search algorithm in is sufficient. The primary advantage of using binary search over a dynamic data structure is the space savings. To index a text string at N positions using binary search, we need just N string indexes; in contrast, to index a string at N positions using a tree-based method, we need at least 2N references (for at least two links per node) in addition to the N indexes. Text indexes are typically huge, so binary search might be preferred because it provides guaranteed logarithmic search time but uses less than one-third the amount of memory used by tree-based methods. If sufficient memory space is available, however, TSTs or tries will lead to a faster search for many apps because they move through the key without retracing its steps, and binary search does not do so (though it is possible to improve binary search to examine fewer characters in string keys, as we will see in Part 6).

If we have a huge text but plan to perform only a small number of searches, then building a full index is not likely to be justified. The string-search problem is to determine quickly whether a given text contains a given search key (without preprocessing the text). There are numerous string-processing problems that fall between these two extremes of needing no preprocessing and requiring a full index. Part 6 is devoted to such problems.

Exercises

Java graphics icon01.gif 15.84 Draw the 26-way DST that results when you build a text-string index from the words now is the time for all good people to come the aid of their party.

Java graphics icon01.gif 15.85 Draw the 26-way trie that results when you build a text-string index from the words now is the time for all good people to come the aid of their party.

Java graphics icon01.gif 15.86 Draw the TST that results when you build a text-string index from the words now is the time for all good people to come the aid of their party, in the style of Screenshot.

Java graphics icon01.gif 15.87 Draw the TST that results when you build a text-string index from the words now is the time for all good people to come the aid of their party, using the implementation described in the text, where the TST contains string pointers at every node.

ScreenshotModify the TST search and insertion implementations in Programs 15.15 and 15.16 to provide a TST-based string index.

ScreenshotImplement an interface that allows patricia to process String keys as though they were bitstrings.

ScreenshotDraw the patricia trie that results when you build a text string index from the words now is the time for all good people to come the aid of their party, using a 5-bit binary coding with the ith letter in the alphabet represented by the binary representation of i.

Find a large (at least 106 bytes) text file on your system and compare the height and internal path length of a standard BST, patricia trie, and TST, when you use these methods to build an index from that file.

Run empirical studies to compare the height and internal path length of a standard BST, patricia trie, and TST, when you use these methods to build an index from a text string consisting of N random characters from a 32-character alphabet, for N = 103, 104, 105, and 106.

ScreenshotWrite an efficient program to determine the longest repeated sequence in a huge text string.

ScreenshotWrite an efficient program to determine the 10-character sequence that occurs most frequently in a huge text string.

Java graphics roundbullet.gif 15.95 Build a string index that supports an operation that returns the number of occurrences of its parameter in the indexed text and supports a search operation that calls a method in a client-supplied object for all the text positions that match the search key.

ScreenshotDescribe a text string of N characters for which a TST-based string index will perform particularly badly. Estimate the cost of building an index for the same string with a BST.

Suppose that we want to build an index for a random N-bit string, for bit positions that are a multiple of 16. Run empirical studies to determine which of the bytesizes 1, 2, 4, 8, or 16 leads to the lowest running times to construct a TST-based index, for N = 103, 104, 105, and 106.


Previous   Next
Comments