Previous   Next

Perspective

The choice of the hashing method that is best suited for a particular app depends on many different factors, as we have discussed when examining the methods. All the methods can reduce the symbol-table search and insert operations to constant-time operations, and all are useful for a broad variety of apps. Roughly, we can characterize the three major methods (linear probing, double hashing, and separate chaining) as follows: Linear probing is the fastest of the three (if sufficient memory is available to ensure that the table is sparse), double hashing makes the most efficient use of memory (but requires extra time, to compute the second hash function), and separate chaining is the easiest to implement and deploy (provided that a good storage allocator is available). Table 14.1 gives empirical data and commentary on the performance of the algorithms.

The choice between linear probing and double hashing depends primarily on the cost of computing the hash function and on the load factor of the table. For sparse tables (small a), both methods use only a few probes, but double hashing could take more time because it has to compute two hash functions for long keys. As a approaches 1, double hashing far outperforms linear probing, as we saw in Screenshot.

Comparing linear probing and double hashing against separate chaining is more complicated, because we have to account precisely for memory usage. Separate chaining uses extra memory for links; the open-addressing methods use extra memory implicitly within the table to terminate probe sequences. The following concrete example illustrates the situation: Suppose that we have a table of M lists built with separate chaining, that the average length of the lists is 4, and that items and links each occupy a single machine word. The assumption that items and links take the same amount of space is justified in many situations because we would replace huge items with links to the items. With these assumptions, the table uses 9M words of memory (4M for items and 5M for links), and delivers an average search time of 2 probes. But linear probing for 4M items in a table of size 9M requires just (1 + 1/(1 - 4/9))/2 = 1.4 probes for a search hit, a value that is 30 percent faster than separate chaining for the same amount of space; and linear probing for 4M items in a table of size 6M requires 2 probes for a search hit (on the average), and thus uses 33 percent less space than separate chaining for the same amount of time. Furthermore, we can use a dynamic method such as Program 14.7 to ensure that the table can grow while staying sparsely populated.

The argument in the previous paragraph indicates that it is not normally justifiable to choose separate chaining over open addressing on the basis of performance. However, separate chaining with a fixed M is often chosen in practice for a host of other reasons: it is easy to implement (particularly remove); it requires little extra memory for items that have preallocated link fields for use by symbol-table and other ADTs that may need them; and, although its performance degrades as the number of items in the table grows, the degradation is graceful and takes place in a manner that is unlikely to harm the app because it still is a factor of M faster than sequential search.

Many other hashing methods have been developed that have app in special situations. Although we cannot go into details, we consider three examples briefly to illustrate the nature of specially adapted hashing methods.

One class of methods moves items around during insertion in double hashing to make successful search more efficient. In fact, Brent developed a method for which the average time for a successful search can be bounded by a constant, even in a full table (see reference section). Such a method might be useful in apps where search hits are the predominant operation.

Another method, called ordered hashing, exploits ordering to reduce the cost for unsuccessful search in linear probing to be close to the cost for successful search. In standard linear probing, we stop the search when we find an empty table position or an item with a key equal to the search key; in ordered hashing, we stop the search when we find an item with a key greater than or equal to the search key (the table must be constructed cleverly if this procedure is to work) (see reference section). This improvement by introducing ordering in the table is on the same order as what we achieved by ordering the lists in separate chaining. This method is designed for apps where search misses predominate.

A symbol table that has a fast search miss and somewhat slower search hit can be used to implement an exception dictionary. For example, a text-processing system might have an algorithm for hyphenating words that works well for most words, but does not work for bizarre cases (such as "bizarre"). Only a few words in a huge document are likely to be in the exception dictionary, so nearly all the searches are likely to be misses.

These examples are only a few of a large number of algorithmic improvements that have been suggested for hashing. Many of these improvements are interesting and have important apps. Our usual cautions must be raised against premature use of advanced methods except when the requirements are serious and the performance/complexity tradeoffs are carefully considered, because separate chaining, linear probing and double hashing are simple, efficient, and acceptable for most apps.

Table 14.1. Empirical study of hash-table implementations

These relative timings for building and searching symbol tables from random sequences of 32-bit integers confirm that hashing is significantly faster than tree search for keys that are easily hashed. Among the hashing methods, double hashing is slower than separate chaining and linear probing for sparse tables (because of the cost of computing the second hash function) but is much faster than linear probing as the table fills, and is the only one of the methods that can provide fast search using only a small amount of extra memory. Dynamic hash tables built with linear probing and expansion by doubling are more costly to construct than are other hash tables because of memory allocation and rehashing, but they certainly lead to the fastest search times and represent the method of choice when search predominates and when the number of keys cannot be predicted accurately in advance.

 

construction

search misses

N

R

H

P

D

P*

R

H

P

D

P*

1250

7

13

19

11

2

3

2

1

1

1

2500

13

16

18

16

3

8

2

1

2

1

5000

22

16

25

10

5

19

7

3

3

3

12500

100

29

35

16

14

58

10

8

8

7

25000

201

41

31

29

45

145

25

18

19

17

50000

836

82

69

53

90

365

81

39

42

41

100000

1137

183

64

76

195

811

261

100

107

91

150000

 

303

110

166

385

 

591

248

216

135

160000

 

316

123

180

393

 

651

320

252

145

170000

 

325

143

190

386

 

773

455

298

157

180000

 

329

210

164

403

 

857

652

375

171

190000

 

429

259

187

424

 

948

1337

492

183

200000

2614

457

   

442

2058

997

   

196

Key:

R Red–black BST (Programs 12.15 and 13.6)

H Separate chaining (Program 14.3 with table size 20,000)

P Linear probing (Program 14.4 with table size 200,000)

D Double hashing (Program 14.6 with table size 200,000)

P* Linear probing with expansion by doubling (Program 14.7)

The problem of implementing an exception dictionary is an example of an app where we can recast our algorithm slightly to optimize performance for the most frequently performed operation—in this case, search miss. For example, suppose that we have a 1000-item exception dictionary, have 1 million items to look up in the dictionary, and expect virtually all the searches to end as misses. This situation might arise if the items were bizarre English-language words or random 32-bit integers. One way to proceed is to hash all the words to, say, 15-bit hash values (table size about 216). The 1000 exceptions occupy 1/64 of the table, and most of the 1 million searches end immediately with search misses, finding the empty table position on the first probe. But if the table contains 32-bit words, we can do much better by converting it into a bit-exception table and using 20-bit hash values. If we have a search miss (as we do most of the time), we finish the search with one bit test; a search hit requires a secondary test in a smaller table. The exceptions occupy 1/1000 of the table; search misses are by far the most likely operation; and we accomplish the task with 1 million directly indexed bit tests. This solution exploits the basic idea that a hash function produces a short certificate that represents a key—an essential concept that is useful in apps other than symbol-table implementations.

Hashing is preferred to the binary-tree structures of Chapters 12 and 13 as the symbol-table implementation for many apps, because it is somewhat simpler and can provide optimal (constant) search times, if the keys are of a standard type or are sufficiently simple that we can be confident of developing a good hash function for them. The advantages of binary-tree structures over hashing are that the trees are based on a simpler abstract interface (no hash function need be designed); the trees are dynamic (no advance information on the number of insertions is needed); the trees can provide guaranteed worst-case performance (everything could hash to the same place even in the best hashing method); and the trees support a wider range of operations (most important, sort and select). When these factors are not important, hashing is certainly the search method of choice, with one more important proviso: When keys are long strings, we can build them into data structures that can provide for search methods that are even faster than hashing. Such structures are the subject of .

Exercises

Java graphics icon01.gif 14.48 For 1 million integer keys, compute the hash-table size that makes each of the three hashing methods (separate chaining, linear probing, and double hashing) use the same number of key comparisons as BSTs for a search miss, on the average, counting the hash-function computation as a comparison.

Java graphics icon01.gif 14.49 For 1 million integer keys, compute the number of comparisons for each of the three hashing methods (separate chaining, linear probing, and double hashing) for a search miss, on the average, when they can use a total of 3 million words of memory (as BSTs would).

Implement a symbol-table ADT with fast search miss as described in the text, using separate chaining for secondary testing.

Do an empirical study to produce a table like Table 14.1 that compares linear probing with expansion by doubling with the Java Hashtable class.


Previous   Next
Comments