Previous   Next

Strings and Vectors

As discussed in , we can write an ITEM interface implementation to use the quicksort implementations in this chapter to sort records with string keys. This approach provides a correct and efficient implementation (faster than any other method we have seen so far, for large files), but there is a hidden cost that is interesting to consider.

The problem lies in the cost of comparing strings. The standard way to compares two strings is by proceeding from left to right, comparing the strings character by character, taking time proportional to the number of leading characters that match in the two strings. For the later partitioning stages of quicksort, when keys are close together, this match might be relatively long. As usual, because of the recursive nature of quicksort, nearly all the cost of the algorithm is incurred in the later stages, so examining improvements there is worthwhile.

For example, consider a subfile of size 5 containing the keys discreet, discredit, discrete, discrepancy,and discretion. All the comparisons used for sorting these keys examine at least seven characters, when it would suffice to start at the seventh character, if the extra information that the first six characters are equal were available.

The three-way partitioning procedure that we considered in provides an elegant way to take advantage of this observation. At each partitioning stage, we examine just one character (say the one at position d), assuming that the keys to be sorted are equal in positions 0 through d-1. We do a three-way partition with keys whose dth character is smaller than the dth character of the partitioning element on the left, those whose dth character is equal to the dth character of the partitioning element in the middle, and those whose dth character is larger than the dth character of the partitioning element on the right. Then, we proceed as usual, except that we sort the middle sub-file, starting at character d+1. It is not difficult to see that this method leads to a proper sort on strings, which turns out to be very efficient (see Table 7.2). We have here a convincing example of the power of thinking (and programming) recursively.

Table 7.2. Empirical study of quicksort variants

This table gives relative costs for several different versions of quicksort on the task of sorting the first N words of Moby Dick. Using insertion sort directly for small subfiles, or ignoring them and insertion sorting the same file afterward, are equally effective strategies, but the cost savings is slightly more than for integer keys (see Table 7.1) because comparisons are more expensive for strings. If we do not stop on duplicate keys when partitioning, then the time to sort a file with all keys equal is quadratic; the effect of this inefficiency is noticeable on this example, because there are numerous words that appear with high frequency in the data. For the same reason, three-way partitioning is effective; it is 30 to 35 percent faster than Program 7.1.

N

V

I

M

X

T

12500

84

59

59

55

60

25000

146

117

130

136

107

50000

354

273

311

341

264

100000

916

756

772

975

626

Key:

V Quicksort (Program 7.1)

I Insertion sort for small subfiles

M Ignore small subfiles, insertion sort afterward

X Scan over duplicate keys (goes quadratic when keys all equal)

T Three-way partitioning (Program 7.5)

To implement the sort, we need a more general abstract type that allows access to characters of keys. The way in which strings are handled in Java makes the implementation of this method particularly straightforward. However, we defer considering the implementation in detail until , where we consider a variety of techniques for sorting that take advantage of the fact that sort keys can often be easily decomposed into smaller pieces.

This approach generalizes to handle multidimensional sorts, where the sort keys are vectors and the records are to be rearranged such that the first components of the keys are in order, then those with first component equal are in order by second component, and so forth. If the components do not have duplicate keys, the problem reduces to sorting on the first component; in a typical app, however, each of the components may have only a few distinct values, and three-way partitioning (moving to the next component for the middle partition) is appropriate. This case was discussed by Hoare in his original paper and is an important app.

Exercises

Discuss the possibility of improving selection, insertion, bubble, and shell sorts for strings.

ScreenshotHow many characters are examined by the standard quicksort algorithm (Program 7.1, using a StringITEM class like Program 6.4, but with a String field for the key) when sorting a file consisting of N strings of length t, all of which are equal? Answer the same question for the modification proposed in the text.


Previous   Next
Comments