Previous   Next

### Perspective

The choice of the hashing method that is best suited for a particular app depends on many different factors, as we have discussed when examining the methods. All the methods can reduce the symbol-table search and insert operations to constant-time operations, and all are useful for a broad variety of apps. Roughly, we can characterize the three major methods (linear probing, double hashing, and separate chaining) as follows: Linear probing is the fastest of the three (if sufficient memory is available to ensure that the table is sparse), double hashing makes the most efficient use of memory (but requires extra time, to compute the second hash function), and separate chaining is the easiest to implement and deploy (provided that a good storage allocator is available). Table 14.1 gives empirical data and commentary on the performance of the algorithms.

The choice between linear probing and double hashing depends primarily on the cost of computing the hash function and on the load factor of the table. For sparse tables (small a), both methods use only a few probes, but double hashing could take more time because it has to compute two hash functions for long keys. As a approaches 1, double hashing far outperforms linear probing, as we saw in Screenshot.

Comparing linear probing and double hashing against separate chaining is more complicated, because we have to account precisely for memory usage. Separate chaining uses extra memory for links; the open-addressing methods use extra memory implicitly within the table to terminate probe sequences. The following concrete example illustrates the situation: Suppose that we have a table of M lists built with separate chaining, that the average length of the lists is 4, and that items and links each occupy a single machine word. The assumption that items and links take the same amount of space is justified in many situations because we would replace huge items with links to the items. With these assumptions, the table uses 9M words of memory (4M for items and 5M for links), and delivers an average search time of 2 probes. But linear probing for 4M items in a table of size 9M requires just (1 + 1/(1 - 4/9))/2 = 1.4 probes for a search hit, a value that is 30 percent faster than separate chaining for the same amount of space; and linear probing for 4M items in a table of size 6M requires 2 probes for a search hit (on the average), and thus uses 33 percent less space than separate chaining for the same amount of time. Furthermore, we can use a dynamic method such as Program 14.7 to ensure that the table can grow while staying sparsely populated.

The argument in the previous paragraph indicates that it is not normally justifiable to choose separate chaining over open addressing on the basis of performance. However, separate chaining with a fixed M is often chosen in practice for a host of other reasons: it is easy to implement (particularly remove); it requires little extra memory for items that have preallocated link fields for use by symbol-table and other ADTs that may need them; and, although its performance degrades as the number of items in the table grows, the degradation is graceful and takes place in a manner that is unlikely to harm the app because it still is a factor of M faster than sequential search.

Many other hashing methods have been developed that have app in special situations. Although we cannot go into details, we consider three examples briefly to illustrate the nature of specially adapted hashing methods.

One class of methods moves items around during insertion in double hashing to make successful search more efficient. In fact, Brent developed a method for which the average time for a successful search can be bounded by a constant, even in a full table (see reference section). Such a method might be useful in apps where search hits are the predominant operation.

Another method, called ordered hashing, exploits ordering to reduce the cost for unsuccessful search in linear probing to be close to the cost for successful search. In standard linear probing, we stop the search when we find an empty table position or an item with a key equal to the search key; in ordered hashing, we stop the search when we find an item with a key greater than or equal to the search key (the table must be constructed cleverly if this procedure is to work) (see reference section). This improvement by introducing ordering in the table is on the same order as what we achieved by ordering the lists in separate chaining. This method is designed for apps where search misses predominate.

A symbol table that has a fast search miss and somewhat slower search hit can be used to implement an exception dictionary. For example, a text-processing system might have an algorithm for hyphenating words that works well for most words, but does not work for bizarre cases (such as "bizarre"). Only a few words in a huge document are likely to be in the exception dictionary, so nearly all the searches are likely to be misses.

These examples are only a few of a large number of algorithmic improvements that have been suggested for hashing. Many of these improvements are interesting and have important apps. Our usual cautions must be raised against premature use of advanced methods except when the requirements are serious and the performance/complexity tradeoffs are carefully considered, because separate chaining, linear probing and double hashing are simple, efficient, and acceptable for most apps.

##### Table 14.1. Empirical study of hash-table implementations
 These relative timings for building and searching symbol tables from random sequences of 32-bit integers confirm that hashing is significantly faster than tree search for keys that are easily hashed. Among the hashing methods, double hashing is slower than separate chaining and linear probing for sparse tables (because of the cost of computing the second hash function) but is much faster than linear probing as the table fills, and is the only one of the methods that can provide fast search using only a small amount of extra memory. Dynamic hash tables built with linear probing and expansion by doubling are more costly to construct than are other hash tables because of memory allocation and rehashing, but they certainly lead to the fastest search times and represent the method of choice when search predominates and when the number of keys cannot be predicted accurately in advance. construction search misses N R H P D P* R H P D P* 1250 7 13 19 11 2 3 2 1 1 1 2500 13 16 18 16 3 8 2 1 2 1 5000 22 16 25 10 5 19 7 3 3 3 12500 100 29 35 16 14 58 10 8 8 7 25000 201 41 31 29 45 145 25 18 19 17 50000 836 82 69 53 90 365 81 39 42 41 100000 1137 183 64 76 195 811 261 100 107 91 150000 303 110 166 385 591 248 216 135 160000 316 123 180 393 651 320 252 145 170000 325 143 190 386 773 455 298 157 180000 329 210 164 403 857 652 375 171 190000 429 259 187 424 948 1337 492 183 200000 2614 457 442 2058 997 196 Key: R Red–black BST (Programs 12.15 and 13.6) H Separate chaining (Program 14.3 with table size 20,000) P Linear probing (Program 14.4 with table size 200,000) D Double hashing (Program 14.6 with table size 200,000) P* Linear probing with expansion by doubling (Program 14.7)

The problem of implementing an exception dictionary is an example of an app where we can recast our algorithm slightly to optimize performance for the most frequently performed operation—in this case, search miss. For example, suppose that we have a 1000-item exception dictionary, have 1 million items to look up in the dictionary, and expect virtually all the searches to end as misses. This situation might arise if the items were bizarre English-language words or random 32-bit integers. One way to proceed is to hash all the words to, say, 15-bit hash values (table size about 216). The 1000 exceptions occupy 1/64 of the table, and most of the 1 million searches end immediately with search misses, finding the empty table position on the first probe. But if the table contains 32-bit words, we can do much better by converting it into a bit-exception table and using 20-bit hash values. If we have a search miss (as we do most of the time), we finish the search with one bit test; a search hit requires a secondary test in a smaller table. The exceptions occupy 1/1000 of the table; search misses are by far the most likely operation; and we accomplish the task with 1 million directly indexed bit tests. This solution exploits the basic idea that a hash function produces a short certificate that represents a key—an essential concept that is useful in apps other than symbol-table implementations.

Hashing is preferred to the binary-tree structures of Chapters 12 and 13 as the symbol-table implementation for many apps, because it is somewhat simpler and can provide optimal (constant) search times, if the keys are of a standard type or are sufficiently simple that we can be confident of developing a good hash function for them. The advantages of binary-tree structures over hashing are that the trees are based on a simpler abstract interface (no hash function need be designed); the trees are dynamic (no advance information on the number of insertions is needed); the trees can provide guaranteed worst-case performance (everything could hash to the same place even in the best hashing method); and the trees support a wider range of operations (most important, sort and select). When these factors are not important, hashing is certainly the search method of choice, with one more important proviso: When keys are long strings, we can build them into data structures that can provide for search methods that are even faster than hashing. Such structures are the subject of .

#### Exercises

14.48 For 1 million integer keys, compute the hash-table size that makes each of the three hashing methods (separate chaining, linear probing, and double hashing) use the same number of key comparisons as BSTs for a search miss, on the average, counting the hash-function computation as a comparison.

14.49 For 1 million integer keys, compute the number of comparisons for each of the three hashing methods (separate chaining, linear probing, and double hashing) for a search miss, on the average, when they can use a total of 3 million words of memory (as BSTs would).

Implement a symbol-table ADT with fast search miss as described in the text, using separate chaining for secondary testing.

Do an empirical study to produce a table like Table 14.1 that compares linear probing with expansion by doubling with the Java Hashtable class.

 Previous   Next