The most important app of the methods discussed in this chapter is to construct indexes for huge databases that are maintained on external memory—for example, in disk files. Although the underlying algorithms that we have discussed are powerful, developing a file-system implementation based on B trees or on extendible hashing is a complex task. First, we cannot use the Java programs in this section directly—they have to be modified to read and refer to disk files. Second, we have to be sure that the algorithm parameters (page and directory size, for example) are tuned properly to the characteristics of the particular hardware that we are using. Third, we have to pay attention to reliability, error detection, and error correction. For example, we need to be able to check that the data structure is in a consistent state and to consider how we might proceed to correct any of the scores of errors that might crop up. Systems considerations of this kind are critical—and are beyond the scope of this tutorial.
On the other hand, if we have a coding system that supports virtual memory, we can put to direct use the Java implementations that we have considered here in a situation where we have a huge number of symbol-table operations to perform on a huge table. Roughly, each time that we access a page, such a system will put that page in a cache, where references to data on that page are handled efficiently. If we refer to a page that is not in the cache, the system has to read the page from external memory, so we can think of cache misses as roughly equivalent to the probe cost measure that we have been using.
For B trees, every search or insertion references the root, so the root will always be in the cache. Otherwise, for sufficiently large M, typical searches and insertions involve at most two cache misses. For a large cache, there is a good chance that the first page (the child of the root) that is accessed on a search is already in the cache, so the average cost per search is likely to be significantly less than two probes.
For extendible hashing, it is unlikely that the whole directory will be in the cache, so we expect that both the directory access and the page access might involve a cache miss (this case is the worst case). That is, two probes are required for a search in a huge table—one to access the appropriate part of the directory and one to access the appropriate page.
These algorithms form an appropriate subject on which to close our discussion of searching, because, to use them effectively, we need to understand basic properties of binary search, BSTs, balanced trees, hashing, and tries—the basic searching algorithms that we have studied in Chapters 12 through 15. As a group, these algorithms provide us with solutions to the symbol-table implementation problem in a broad variety of situations: they constitute an outstanding example of the power of algorithmic technology.
16.46 Develop a symbol-table implementation using B trees that includes a clone implementation and supports the construct, count, search, insert, re-move, and join operations for a symbol-table ADT, with support for client handles (see Exercises 12.6 and 12.7).
16.47 Develop a symbol-table implementation using extendible hashing that includes a clone implementation and supports the construct, count, search, insert, remove, and join operations for a symbol-table ADT, with support for client handles (see Exercises 12.6 and 12.7).
Modify the B-tree implementation in (Programs 16.1 through 16.3) to use an ADT for page references.
Modify the extendible-hashing implementation in (Programs 16.5 through 16.8) to use an ADT for page references.
Estimate the average number of probes per search in a B tree for S random searches, in a typical cache system, where the T most-recently-accessed pages are kept in memory (and therefore add 0 to the probe count). Assume that S is much larger than T.
Estimate the average number of probes per search in an extendible hash table, for the cache model described in Exercise 16.50.
If your system supports virtual memory, design and conduct experiments to compare the performance of B trees with that of binary search, for random searches in a huge symbol table.
Implement a priority-queue ADT that supports construct for a huge number of items, followed by a huge number of insert and remove the maximum operations (see ).
Develop an external symbol-table ADT based on a skip-list representation of B trees (see Exercise 13.80).
16.55 If your system supports virtual memory, run experiments to determine the value of M that leads to the fastest search times for a B-tree implementation supporting random search operations in a huge symbol table. (It may be worthwhile for you to learn basic properties of your system before conducting such experiments, which can be costly.)
16.56 Modify the B-tree implementation in (Programs 16.1 through 16.03) to operate in an environment where the table resides on external storage. If your system allows nonsequential file access, put the whole table on a single (huge) file and use offsets within the file in place of references in the data structure. If your system allows you to access pages on external devices directly, use page addresses in place of references in the data structure. If your system allows both, choose the approach that you determine to be most reasonable for implementing a huge symbol table.
16.57 Modify the extendible-hashing implementation in (Programs 16.5 through 16.8) to operate in an environment where the table resides on external storage. Explain the reasons for the approach that you choose for allocating the directory and the pages to files (see Exercise 16.56).