Multiway Tries and TSTs

For radix sorting, we found that we could get a significant improvement in speed by considering more than 1 bit at a time. The same is true for radix search: By examining r bits at a time, we can speed up the search by a factor of r. However, there is a catch that makes it necessary for us to be more careful in applying this idea than we had to be for radix sorting. The problem is that considering r bits at a time corresponds to using tree nodes with R = 2r links, and that can lead to a considerable amount of wasted space for unused links.

Radix key type example

This code is an example of extending a key class such as Program 12.2, which defines integer-valued keys, in order to provide radix methods with access to the key digits. It provides a digit method that returns the indicated digit from the decimal representation of the key (an integer that is 0 through 9) and the constants R and END. Numbers of type int have only ten decimal digits; our convention is to return END if a client asks for a digit beyond the end of the number.

class radixKey extends myKey { public final static int R = 10; public final static int END = -1; private int[] p = {1000000000, 100000000, 10000000, 1000000, 100000, 10000, 1000, 100, 10, 1 }; public int digit(int B) { int v = val; if (B > 9) return END; return (v/p[B]) % 10; } }


Program 15.9 is an example of a key type implementation that provides access to key digits. As with bit, if we were to use this type of key, we would add to each class the code

private final static int R = radixKey.R; private int digit(KEY v, int i) { return ((radixKey) v).digit(i); }


to give us the flexibility to substitute a direct implementation for keys that are primitive types (see Exercise 15.53).

In the (binary) tries of , the nodes corresponding to key bits have two links: one for the case when the key bit is 0, and the other for the case when the key bit is 1. The appropriate generalization is to R-ary tries, where we have nodes with R links corresponding to key digits, one for each possible digit value. Keys are stored in leaves (nodes with all links null). To search in an R-way trie, we start at the root and at the leftmost key digit and use the key digits to guide us down the tree. We go down the ith link (and move to the next digit) if the digit value is i. If we reach a leaf, it contains the only key in the trie with leading digits corresponding to the path that we have traversed, so we can compare that key with the search key to determine whether we have a search hit or a search miss. If we reach a null link, we know that we have a search miss, because that link corresponds to a leading-digit pattern not found in any keys in the trie. Screenshot shows a 10-way trie that represents a sample set of decimal numbers. As we discussed in , numbers typically seen in practice are distinguished with relatively few trie nodes. This same effect for more general types of keys is the basis for a number of efficient search algorithms.

Screenshot R-way trie for base-10 numbers

This figure depicts the trie that distinguishes the set of numbers


.396465048 
.353336658 
.318693642 
.015583409 
.159369371 
.691004885 
.899854354 
.159072306 
.604144269 
.269971047 
.538069659 

(see Screenshot). Each node has 10 links (one for each possible digit). At the root, link 0 points to the trie for keys with first digit 0 (there is only one); link 1 points to the trie for keys with first digit 1 (there are two), and so forth. None of these numbers has first digit 4, 7, 8, or 9, so those links are null. There is only one number for each of the first digits 0, 2, and 5, so there is a leaf containing the appropriate number for each of those digits. The rest of the structure is built recursively, moving one digit to the right.

Java graphics 15fig14.gif


Before doing a full symbol-table implementation with multiple node types and so forth, we begin our study of multiway tries by concentrating on the existence-table problem, where we have only keys (no records or associated information) and want to develop algorithms to insert a key into a data structure and to search the data structure to tell us whether or not a given key has been inserted. Program 15.10 defines an ADT for existence tables. The existence-table implementation that we consider next clearly exposes the structure of multiway tries, is useful in its own right, and paves the way for using tries in a standard symbol-table ADT implementation.

Definition 15.2 The existence trie corresponding to a set of keys is defined recursively as follows: The trie for an empty set of keys is a null link; and the trie for a nonempty set of keys is an internal node with links referring to the trie for each possible key digit, with the leading digit considered to be removed for the purpose of constructing the subtrees.

For simplicity, we assume in this definition that no key is the prefix of another. Typically, we enforce this restriction by ensuring that the keys are distinct and either are of fixed length or have a termination digit with value END, a sentinel that is used for no other purpose. The point of this definition is that we can use existence tries to implement existence tables, without storing any information within the trie; the information is all implicitly defined within the trie structure. Each node has R + 1 links (one for each possible character value plus one for END) and no other information. To search, we use the digits in the key to guide us down the trie. If we reach the link to END at the same time that we run out of key digits, we have a search hit; otherwise we have a search miss. To insert a new key, we search until we reach a null link, then add nodes for each of the remaining characters in the key. Screenshot is an example of a 27-way trie; Program 15.11 is an implementation of the basic (multiway) existence-trie search and insert procedures.

Screenshot R-way existence trie search and insertion

The 26-way trie for the words now, is, and the (top) has nine nodes: the root plus one for each letter. The nodes are labeled in these diagrams, but we do not use explicit node labels in the data structure, because each node label can be inferred from the position of the link to it in its parents' link array. To insert the key time, we branch off the existing node for t and add new nodes for i, m, and e (center); to insert the key for, we branch off the root and add new nodes for f, o, and r.

Java graphics 15fig15.gif


Existence-table ADT

This interface defines the simplest kind of symbol table, where we have keys but no associated information. Clients can insert a key and search to determine whether or not a given key has been inserted.

class ET // ADT interface { // implementations and private members hidden ET() boolean search(KEY) void insert(KEY) }


If the keys are of fixed length and are distinct, we can dispense with the link to the terminal character and can terminate searches when we reach the key length (see Exercise 15.60). We have already seen an example of this type of trie when we used tries to describe MSD sorting for fixed-length keys (Screenshot).

In one sense, this pure abstract representation of the trie structure is optimal, because it can support the search operation in time proportional to the length of a key and in space proportional to the total number of characters in the key in the worst case. But the total amount of space used could be as high as nearly R links for each character, so we seek improved implementations. As we saw with binary tries, it is worthwhile to consider the pure trie structure as a particular representation of an underlying abstract structure that is a well-defined representation of our set of keys and then to consider other representations of the same abstract structure that might lead to better performance.

Existence-trie search and insertion

This implementation of the search and insert existence-table ADT operations for multiway tries stores the keys implicitly within the structure of the trie. Each node contains R linkss to the next level down the trie. We follow the ith link at level t when the tth digit of the key is i.

private boolean searchR(Node h, KEY v, int d) { int i = digit(v, d); if (h == null) return false; if (i < 0) return true; return searchR(h.next[i], v, d+1); } boolean search(KEY key) { return searchR(head, key, 0); } private Node insertR(Node h, KEY v, int d) { int i = digit(v, d); if (h == null) h = new Node(); if (i < 0) return h; h.next[i] = insertR(h.next[i], v, d+1); return h; } void insert(KEY v) { head = insertR(head, v, 0); }


Definition 15.3 A multiway trie is a multiway tree that has keys associated with each of its leaves, defined recursively as follows: The trie for an empty set of keys is a null link; the trie for a single key is a leaf containing that key; and the trie for a set of keys of cardinality greater than one is an internal node with links referring to tries for keys with each possible digit value, with the leading digit considered to be removed for the purpose of constructing the subtrees.

We assume that keys in the data structure are distinct and that no key is the prefix of another. To search in a standard multiway trie, we use the digits of the key to guide the search down the trie, with three possible outcomes. If we reach a null link, we have a search miss; if we reach a leaf containing the search key, we have a search hit; and if we reach a leaf containing a different key, we have a search miss. All leaves have R null links, and when we are implementing a symbol table we can put the items in the leaves (as we did for binary tries) so that different representations for leaf nodes and non-leaf nodes are appropriate, as in . We consider such an implementation in , and we shall consider another approach to an implementation in this chapter. In either case, the analytic results from generalize to tell us about the performance characteristics of standard multiway tries.

Property 15.6

Search or insertion in a standard R-ary trie requires about logR N byte comparisons on the average in a tree built from N random bytestrings. The number of links in an R-ary trie built from N random keys is about RN/ln R. The number of byte comparisons for search or insertion is no more than the number of bytes in the search key.

These results generalize those in Properties 15.3 and 15.4. We can establish them by substituting R for 2 in the proofs of those properties. As we mentioned, however, extremely sophisticated mathematics is involved in the precise analysis of these quantities. Screenshot


The performance characteristics listed in Property 15.6 represent an extreme example of a time–space tradeoff. On the one hand, there are a large number of unused null links—only a few nodes near the top use more than a few of their links. On the other hand, the height of a tree is small. For example, suppose that we take the typical value R = 256 and that we have N random 64-bit keys. Property 15.6 tells us that a search will take (lg N)/8 character comparisons (8 at most) and that we will use fewer than 47N links. If plenty of space is available, this method provides an extremely efficient alternative. We could cut the search cost to 4 character comparisons for this example by taking R = 65536, but that would require over 5900N links.

In practical apps, the space cost is likely to be even higher, because real sets of keys tend to have long stretches where parts of subsets of keys are equal. The trie for such a set of keys will have many nodes with R - 1 null links.

In particular, this analysis shows that it is unwise to use tries for standard Java Unicode strings, because the amount of space they consume when R = 65536 is excessive. One way to ameliorate this difficulty is to implement digit so as to break characters into multiple pieces, perhaps restricting attention to ASCII so that we only need half of each character (see Exercise 15.63). Next, we consider an algorithmic approach that is even more effective.

In the remainder of this section, we shall consider an alternative representation of multiway tries: the ternary search trie (TST), which help us avoid the excessive space cost normally associated with multiway tries. In a TST, each node has a character and three links, corresponding to keys whose current digits are less than, equal to, or greater than the node's character. Using this arrangement is equivalent to implementing trie nodes as binary search trees that use as keys the characters corresponding to non-null links. In the standard existence tries of Program 15.11, trie nodes are represented by R + 1 links, and we infer the character represented by each non-null link by its index. In the corresponding existence TST, all the characters corresponding to non-null links appear explicitly in nodes—we find characters corresponding to keys only when we are traversing the middle links. A sample TST is illustrated in Screenshot.

Screenshot Existence-trie structures

These figures show three different representations of the existence trie for the 16 words call me ishmael some years ago never mind how long precisely having little or no money: The 26-way existence trie (top); the abstract trie with null links removed (center); and the TST representation (bottom). The 26-way trie has too many links, but the TST is an efficient representation of the abstract trie. The top two tries assume that no key is the prefix of another. For example, adding the key not would result in the key no being lost. We can add a null character to the end of each key to correct this problem, as illustrated in the TST at the bottom.

Java graphics 15fig16.gif


The search algorithm for implementing the existence table ADT with TSTs is so straightforward as nearly to write itself; the insertion algorithm is slightly more complicated, but mirrors directly insertion in existence tries. To search, we compare the first character in the key with the character at the root. If it is less, we take the left link; if it is greater, we take the right link; and if it is equal, we take the middle link and move to the next key character. In each case, we apply the algorithm recursively. We terminate with a search miss if we encounter a null link or if we encounter the end of the search key before encountering END in the tree, and we terminate with a search hit if we traverse the middle link in a node whose character is END. To insert a new key, we search, then add new nodes for the characters in the tail of the key, just as we did for tries.

Program 15.12 gives the details of the implementation of these algorithms, and Screenshot has TSTs that correspond to the tries in Screenshot. Since one of the most important and natural apps of multiway tries is to process strings, Program 15.12 is coded as an implementation of the existence table ADT interface of Program 15.10 for String keys; it is a straightforward matter to change it to use digit and therefore implement the interface for the general KEY type that we have been using (see Exercise 15.62).

Screenshot Existence TSTs

An existence TST has one node for each letter, but only 3 children per node, rather than 26. The top three trees in this figure are the RSTs corresponding to the insertion example in Screenshot, with the additional change that an end-of-key character is appended to each key. We can then remove the restriction that no key may be a prefix of another, so, for example, we can insert the key theory (bottom).

Java graphics 15fig17.gif


Continuing the correspondence that we have been following between search trees and sorting algorithms, we see that TSTs correspond to three-way radix sorting in the same way that BSTs correspond to quicksort, tries correspond to binary quicksort, and M-way tries correspond to M-way radix sorting. Screenshot, which describes the recursive call structure for three-way radix sort, is a TST for that set of keys. The null-links problem for tries corresponds to the empty-bins problem for radix sorting; three-way branching provides an effective solution to both problems.

We can make TSTs more efficient in their use of space by putting keys in leaves at the point where they are distinguished and by eliminating one-way branching between internal nodes as we did for patricia. At the end of this section, we examine an implementation based on the former change.

Property 15.7

A search or insertion in a full TST requires time proportional to the key length. The number of links in a TST is at most three times the number of characters in all the keys.

In the worst case, each key character corresponds to a full R-ary node that is unbalanced, stretched out like a singly linked list. This worst case is extremely unlikely to occur in a random tree. More typically, we might expect to do ln R or fewer byte comparisons at the first level (since the root node behaves like a BST on the R different byte values) and perhaps at a few other levels (if there are keys with a common prefix and up to R different byte values on the character following the prefix), and to do only a few byte comparisons for most characters (since most trie nodes are sparsely populated with non-null links). Search misses are likely to involve only a few byte comparisons, ending at a null link high in the trie, and search hits involve only about one byte comparison per search key character, since most of them are in nodes with one-way branching at the bottom of the trie.

Actual space usage is generally less than the upper bound of three links per character, because keys share nodes at high levels in the tree. We refrain from a precise average-case analysis because TSTs are most useful in practical situations where keys neither are random nor are derived from bizarre worst-case constructions. Screenshot


Existence-TST search and insertion

This code implements the existence table ADT for string keys. Each node contains just one digit and three links: one each for keys whose next digit is less than, equal to, or greater than the corresponding digit in the search key, respectively.

The character value END is used as an end-of-string marker in the TST (this code uses 0, as in C-style strings) but strings need not end with END.

class StringET { private final static int END = 0; private class Node { char c; Node l, m, r; } private Node head; StringET() { head = null; } private Node insertR(Node h, char[] s, int i) { char ch = (i < s.length) ? s[i] : END; if (h == null) { h = new Node(); h.c = ch; } if (ch == END && h.c == END) return h; if (s[i] < h.c) h.l = insertR(h.l, s, i); if (s[i] == h.c) h.m = insertR(h.m, s, i+1); if (s[i] > h.c) h.r = insertR(h.r, s, i); return h; } void insert(String s) { head = insertR(head, s.toCharArray(), 0); } private boolean searchR(Node h, char[] s, int i) { if (h == null) return false; if (i == s.length) return h.c == END; if (s[i] < h.c) return searchR(h.l, s, i); if (s[i] > h.c) return searchR(h.r, s, i); return searchR(h.m, s, i+1); // s[i] == h.c } boolean search(String s) { return searchR(head, s.toCharArray(), 0); } }


The prime virtue of using TSTs is that they adapt gracefully to irregularities in search keys that are likely to appear in practical apps. There are two main effects. First, keys in practical apps come from large character sets, and usage of particular characters in the character sets is far from uniform—for example, a particular set of strings is likely to use only a small fraction of the possible characters. With TSTs, we can use a 256-character ASCII encoding or a 65536-character Unicode encoding without having to worry about the excessive costs of nodes with 256- or 65536-way branching, and without having to determine which sets of characters are relevant. Unicode strings in non-Roman alphabets can have thousands of characters—TSTs are especially appropriate for standard Java String keys that consist of such characters. Second, keys in practical apps often have a structured format, differing from app to app, perhaps using only letters in one part of the key, only digits in another part of the key, and special characters as delimiters (see Exercise 15.79). For example, Screenshot gives a list of library call numbers from an online library database. For such keys, some of the trie nodes might be represented as unary nodes in the TST (for places where all keys have delimiters); some might be represented as 10-node BSTs (for places where all keys have digits); and still others might be represented as 26-node BSTs (for places where all keys have letters). This structure develops automatically, without any need for special analysis of the keys.

Screenshot Sample string keys (library call numbers)

These keys from an online library database illustrate the variability of the structure found in string keys in apps. Some of the characters may appropriately be modeled as random letters, some may be modeled as random digits, and still others have fixed value or structure.

Java graphics 15fig18.gif


A second practical advantage of TST-based search over many other algorithms is that search misses are likely to be extremely efficient, even when the keys are long. Often, the algorithm uses just a few byte comparisons (and chases a few references) to complete a search miss. As we discussed in , a search miss in a hash table with N keys requires time proportional to the key length (to compute the hash function), and at least lg N key comparisons in a search tree. Even patricia requires lg N bit comparisons for a random search miss.

Table 15.2 gives empirical data in support of the observations in the previous two paragraphs.

A third reason that TSTs are attractive is that they support operations more general than the symbol-table operations that we have been considering. For example, Program 15.13 gives a program that allows particular characters in the search key to be unspecified and prints all keys in the data structure that match the specified digits of the search key. An example is depicted in Screenshot. Obviously, with a slight modification, we can adapt this program to visit all the matching keys in the way that we do for sort, rather than just to print them (see Exercise 15.65).

Screenshot TST-based partial-match search

To find all keys in a TST matching the pattern i* (top), we search for i in the BST for the first character. In this example, we find is (the only word that matches the pattern) after two one-way branches. For a less restrictive pattern such as *o* (bottom), we visit all nodes in the BST for the first character, but only those corresponding to o for the second character, eventually finding for and now.

Java graphics 15fig19.gif


Table 15.2. Empirical study of search with string keys

These relative timings for construction and search in symbol tables with string keys such as the library call numbers in Screenshot confirm that TSTs, although more expensive to construct, provide the fastest search for search misses with string keys, primarily because the search does not require examination of all the characters in the key.

 

construction

search misses

N

B

H

T

T*

B

H

T

T*

1250

6

5

15

7

6

5

4

4

2500

11

9

23

12

13

11

8

7

5000

29

18

48

39

31

22

17

13

12500

78

64

264

145

99

66

51

45

25000

223

93

509

251

253

151

119

101

50000

558

189

1234

495

608

322

275

249

100000

1328

516

2917

1008

1507

690

640

563

Key:

B Standard BST (Program 12.15)

H Hashing with separate chaining (M = N/5) (Program 14.3)

T TST (Program 15.12)

T* TST with R2-way branch at root (Programs 15.15 and 15.16)

Several other similar tasks are easy to handle with TSTs. For example, we can visit all keys in the data structure that differ from the search key in at most one digit position (see Exercise 15.66). Operations of this type are expensive or impossible with other symbol-table implementations. We shall consider in Part 6 these and many other algorithms for finding approximate matches in a string search.

Partial-match searching in TSTs

With judicious use of multiple recursive calls, we can find close matches in the TST structure of Program 15.12, as shown in this program for printing all strings in the data structure that match a search string with some characters unspecified (indicated by asterisks).

private char[] w; private void matchR(Node h, char[] s, int i) { if (h == null) return; if (i == s.length && h.c == END) Out.println(w + ""); if (i == s.length) return; if ((s[i] == '*') || (s[i] == h.c)) { w[i] = h.c; matchR(h.m, s, i+1); } if ((s[i] == '*') || (s[i] < h.c)) matchR(h.l, s, i); if ((s[i] == '*') || (s[i] > h.c)) matchR(h.r, s, i); } void match(String s) { w = new char[s.length()]; matchR(head, s.toCharArray(), 0); }


Patricia offers several of the same advantages; the main practical advantage of TSTs over patricia tries is that the former access key bytes or characters rather than key bits. One reason that this difference represents an advantage is that machine operations for this purpose are found in many machines, and Java provides direct access to bytes and characters through arrays or charAt in strings. Another reason is that, in some apps, working with bytes or characters in the data structure naturally reflects the orientation of the data itself—for example, in the partial-match search problem discussed in the previous paragraph (although, as we shall see in Part 6, we can also speed up partial-match search with judicious use of bit access).

To eliminate one-way branching in TSTs, we note that most of the one-way branching occurs at the tail ends of keys, which is not a factor if we evolve to a symbol-table implementation where we keep records in leaves that are placed in the highest level of the tree that distinguishes the keys. We also could maintain a byte index in the same manner as in patricia tries (see Exercise 15.72) but will omit this change, for simplicity. The combination of multiway branching and the TST representation by themselves is quite effective in many apps, but patricia-style collapse of one-way branching will further enhance performance when the keys are such that they are likely to match for long stretches (see Exercise 15.79).

Hybrid TST symbol-table implementation

This TST-based implementation of our standard symbol-table ADT uses R-way branching at the root node: the root is an array heads of R links, indexed by the first key digit. Each link points to a TST built from all the keys that begin with the corresponding digit. This hybrid combines the benefits of tries (fast search through indexing, at the root) and TSTs (efficient use of space with one node per character, except at the root).

class ST { private class Node { int d; ITEM item; Node l, m, r; Node(ITEM x) { item = x; d = END; } Node(int k){d=k;} Node(Node h, int k){d=k;m=h;} boolean internal() { return l!=null || m!=null || r!=null; } } private Node[] heads; ST(int maxN) { heads = new Node[R]; } void insert(ITEM x) // See Program 15.15 ITEM search(KEY v) // See Program 15.16


Another easy improvement to TST-based search is to use a large explicit multiway node at the root. The simplest way to proceed is to keep a table of R TSTs: one for each possible value of the first letter in the keys. If R is not large, we might use the first two letters of the keys (and a table of size R2). For this method to be effective, the leading digits of the keys must be well distributed. The resulting hybrid search algorithm corresponds to the way that a human might search for names in a telephone tutorial. The first step is a multiway decision ("Let's see, it starts with 'A' "), followed perhaps by some two-way decisions ("It's before 'Andrews,' but after 'Aitken' "), followed by sequential character matching (" 'Algonquin,' ...No, 'Algorithms' isn't listed, because nothing starts with 'Algor'!").

Hybrid TST insertion for symbol-table ADT

This implementation of insert for Program 15.15 uses a TST with items in leaves ( generalizing Program 15.4) for all words beginning with each character. If the search ends at a null link, we create a leaf to hold the item. If the search ends at a leaf, we create the internal nodes needed to distinguish the key found from the search key.

private Node split(Node p, Node q, int d) { int pd = digit(p.item.key(), d), qd = digit(q.item.key(), d); Node t = new Node(qd); if (pd < qd) { t.m = q; t.l = new Node(p, pd); } if (pd == qd) { t.m = split(p, q, d+1); } if (pd > qd) { t.m = q; t.r = new Node(p, pd); } return t; } private Node insertR(Node h, ITEM x, int d) { int i = digit(x.key(), d); if (h == null) return new Node(new Node(x), i); if ((h.d == END) && (i == END)) return h; if (!h.internal()) return split(new Node(x), h, d); if (i < h.d) h.l = insertR(h.l, x, d); if (i == h.d) h.m = insertR(h.m, x, d+1); if (i > h.d) h.r = insertR(h.r, x, d); return h; } void insert(ITEM x) { int i = digit(x.key(), 0); heads[i] = insertR(heads[i], x, 1); }


Hybrid TST search for symbol-table ADT

This search implementation for TSTs (built with Program 15.15) is like multiway-trie search, but we use only three, rather than R, links per node (except at the root). We use the digits of the key to travel down the tree, ending either at a null link (search miss) or at a leaf that has a key that either is (search hit) or is not (search miss) equal to the search key.

private ITEM searchR(Node h, KEY v, int d) { if (h == null) return null; if (h.internal()) { int i = digit(v, d); if (i < h.d) return searchR(h.l, v, d); if (i == h.d) return searchR(h.m, v, d+1); if (i > h.d) return searchR(h.r, v, d); } if (equals(v, h.item.key())) return h.item; return null; } ITEM search(KEY v) { return searchR(heads[digit(v, 0)], v, 1); }


Programs 15.14 through 15.16 comprise a TST-based implementation of the symbol-table search and insert operations which uses R-way branching at the root and keeps items in leaves (so there is no one-way branching once the keys are distinguished). These programs are likely to be among the fastest available for searching with string or long radix keys. The underlying TST structure can also support a host of other operations.

In a symbol table that grows to be huge, we may want to adapt the branching factor to the table size. In , we shall see a systematic way to grow a multiway trie so that we can take advantage of multiway radix search for arbitrary file sizes.

Property 15.8

A search or insertion in a TST with items in leaves (no one-way branching at the bottom) and Rt-way branching at the root requires roughly ln N - t ln R byte accesses for N keys that are random bytestrings. The number of links required is Rt (for the root node) plus a small constant times N.

These rough estimates follow immediately from Property 15.6. For the time cost, we assume that all but a constant number of the nodes on the search path (a few at the top) act as random BSTs on R character values, so we simply multiply the time cost by ln R. For the space cost, we assume that the nodes on the first few levels are filled with R character values and that the nodes on the bottom levels have only a constant number of character values. Screenshot


For example, if we have 1 billion random bytestring keys with R = 256, and we use a table of size R2 = 65536 at the top, then a typical search will require about ln 109 - 2 ln 256 Screenshot 20.7 - 11.1 = 9.6 byte comparisons. Using the table at the top cuts the search cost by a factor of 2. If we have truly random keys, we can achieve this performance with more direct algorithms that use the leading bytes in the key and an existence table, in the manner discussed in . With TSTs, we can get the same kind of performance when keys have a less random structure.

It is instructive to compare TSTs without multiway branching at the root with standard BSTs, for random keys. Property 15.8 says that TST search will require about ln N byte comparisons, whereas standard BSTs require about ln N key comparisons. At the top of the BST, the key comparisons can be accomplished with just one byte comparison, but at the bottom of the tree multiple byte comparisons may be needed to accomplish a key comparison. This performance difference is not dramatic. The reasons that TSTs are preferable to standard BSTs for string keys are that they provide a fast search miss; they adapt directly to multiway branching at the root; and (most important) they adapt well to bytestring keys that are not random, so no search takes longer than the length of a key in a TST.

Some apps may not benefit from the R-way branching at the root—for example, the keys in the library-call-number example of Screenshot all begin with either L or W. Other apps may call for a higher branching factor at the root—for example, as just noted, if the keys were random integers, we would use as large a table as we could afford. We can use app-specific dependencies of this sort to tune the algorithm to peak performance, but we should not lose sight of the fact that one of the most attractive features of TSTs is that they free us from having to worry about such app-specific dependencies, providing good performance without any tuning.

Perhaps the most important property of tries or TSTs with records in leaves is that their performance characteristics are independent of the key length. Thus, we can use them for arbitrarily long keys. In , we examine a particularly effective app of this kind.

Exercises

Java graphics icon01.gif 15.53 Write a digit method that corresponds to Program 15.9 for keys that are of type int.

Java graphics icon01.gif 15.54 Draw the existence trie that results when you insert the words now is the time for all good people to come the aid of their party into an initially empty trie. Use 27-way branching.

Java graphics icon01.gif 15.55 Draw the existence TST that results when you insert the words now is the time for all good people to come the aid of their party into an initially empty TST.

Java graphics icon01.gif 15.56 Draw the 4-way trie that results when you insert items with the keys 01001010 into an initially empty trie, using 2-bit bytes.

Java graphics icon01.gif 15.57 Draw the TST that results when you insert items with the keys 01001010 into an initially empty TST, using 2-bit bytes.

Java graphics icon01.gif 15.58 Draw the TST that results when you insert items with the keys 01001010 into an initially empty TST, using 4-bit bytes.

ScreenshotDraw the TST that results when you insert items with the library-call-number keys in Screenshot into an initially empty TST.

ScreenshotModify our multiway-trie search and insertion implementation (Program 15.11) to work under the assumption that keys are (fixed-length) w-byte words (so no end-of-key indication is necessary).

ScreenshotModify our TST search and insertion implementation (Program 15.12) to work under the assumption that keys are (fixed-length) w-byte words (so no end-of-key indication is necessary).

Develop a TST-based implementation of Program 15.10, by modifying Program 15.12 to use parameters of type KEY for search and insert and using digit instead of array indexing to access characters.

Modify Program 15.11 to implement an existence table for String keys (like Program 15.12) by using parameters of type String for search and insert and using array indexing instead of digit to access characters. Assume that the strings are ASCII so that you can use byte arrays instead of character arrays.

Run empirical studies to compare the time and space requirements of an 8-way trie built with random integers using 3-bit bytes, a 4-way trie built with random integers using 2-bit bytes, and a binary trie built from the same keys, for N = 103, 104, 105, and 106 (see Exercise 15.14).

Modify Program 15.13 such that it invokes a method in an object passed as a parameter with each matching string as a parameter (instead of just printing it).

ScreenshotWrite a method that prints all the keys in a TST that differ from the search key in at most k positions, for a given integer k.

Java graphics roundbullet.gif 15.67 Give a full characterization of the worst-case internal path length of an R-way trie with N distinct w-bit keys.

Java graphics roundbullet.gif 15.68 Develop a symbol-table implementation using multiway tries that includes a clone implementation and supports the construct, count, search, insert, remove, and join operations for a symbol-table ADT, with support for client handles (see Exercises 12.6 and 12.7).

Java graphics roundbullet.gif 15.69 Develop a symbol-table implementation using TSTs that includes a clone implementation and supports the construct, count, search, insert, remove, and join operations for a symbol-table ADT, with support for client handles (see Exercises 12.6 and 12.7).

Java graphics icon01.gif 15.70 Write a program that prints out all keys in an R-way trie that have the same initial t bytes as a given search key.

Java graphics roundbullet.gif 15.71 Modify our multiway-trie search and insertion implementation (Program 15.11) to eliminate one-way branching in the way that we did for patricia tries.

Java graphics roundbullet.gif 15.72 Modify our TST search and insertion implementation (Program 15.12) to eliminate one-way branching in the way that we did for patricia tries.

Write a program to balance the BSTs that represent the internal nodes of a TST (rearrange them such that all their external nodes are on one of two levels).

Write a version of insert for TSTs that maintains a balanced-tree representation of all the internal nodes (see Exercise 15.73).

Java graphics roundbullet.gif 15.75 Give a full characterization of the worst-case internal path length of a TST with N distinct w-bit keys.

Write an implementation of radixKey for 80-byte ASCII string keys (see Exercise 10.23). Then write a client that uses Program 15.11 to build a 256-way trie with N random keys, for N = 103, 104, 105, and 106, using search, then insert on search miss. Instrument your code to print out the total number of nodes in each trie, the total amount of space used by each trie, and the total amount of time taken to build each trie. Compare these statistics with the corresponding statistics for a client that uses String keys (see Exercise 15.63).

Answer Exercise 15.76 for TSTs. Compare your performance results with those for tries (see Program 15.12 and Exercise 15.62).

Write an implementation of radixKey that generates random keys by shuffling a random 80-byte sequence of ASCII characters (see Exercise 10.25). Use this key generator to build a 256-way trie with N random keys, for N = 103, 104, 105, and 106, using search, then insert on search miss. Compare your performance results with those for the random case (see Exercise 15.76).

ScreenshotWrite an implementation of radixKey that generates 30-byte random strings made up of three fields: a 4-byte field with one of a set of 10 given ASCII strings; a 10-byte field with one of a set of 50 given ASCII strings; a 1-byte field with one of two given values; and a 15-byte field with random left-justified ASCII strings of letters equally likely to be 4 through 15 characters long (see Exercise 10.27). Use this key generator to build a 256-way trie with N random keys, for N = 103, 104, 105, and 106, using search, then insert on search miss. Instrument your program to print out the total number of nodes in each trie and the total amount of time taken to build each trie. Compare your performance results with those for the random case (see Exercise 15.76).

Answer Exercise 15.79 for TSTs. Compare your performance results with those for tries.

Develop an implementation of search and insert for bytestring keys using multiway digital search trees.

Java graphics icon01.gif 15.82 Draw the 27-way DST (see Exercise 15.81) that results when you insert items with the keys now is the time for all good people to come the aid of their party into an initially empty DST.

Java graphics roundbullet.gif 15.83 Develop an implementation of multiway-trie search and insertion using linked lists to represent the trie nodes (as opposed to the BST representation that we use for TSTs). Run empirical studies to determine whether it is more efficient to use ordered or unordered lists, and to compare your implementation with a TST-based implementation.


Previous   Next
Comments