How do we get several independent processors to work together on the same sorting problem? Whether the processors control external memory devices or are complete computer systems, this question is at the heart of algorithm design for high-performance computing systems. The subject of parallel computing has been studied widely in recent years. Many different types of parallel computers have been devised, and many different models for parallel computation have been proposed. The sorting problem is a test case for the effectiveness of both.
We have already discussed low-level parallelism, in our discussion of sorting networks in , where we considered doing a number of compare–exchange operations at the same time. Now, we discuss a high-level parallel model, where we have a large number of independent general-purpose processors (rather than just comparators) that have access to the same data. Again, we ignore many practical issues, but can examine algorithmic questions in this context.
The abstract model that we use for parallel processing involves a basic assumption that the file to be sorted is distributed among P independent processors. We assume that we have
We assign the processors the labels 0, 1, ..., P - 1, and assume that the file to be input is in the local memories of the processors (that is, each processor has N/P of the records). The goal of the sort is to rearrange the records to put the smallest N/P records in processor 0's memory, the next smallest N/P records in processor 1's memory, and so forth, in sorted order. As we shall see, there is a tradeoff between P and the total running time—we are interested in quantifying that tradeoff so that we can compare competing strategies.
This model is one of many possible ones for parallelism, and it has many of the same liabilities with respect to practical applicability as did our model for external sorting (). Indeed, it does not address one of the most important issues to be faced in parallel computing: constraints on communication between the processors.
We shall assume that such communication is far more costly than references to local memory and that it is most efficiently done sequentially, in large blocks. In a sense, processors treat other processors' memory as external storage devices. Again, this high-level abstract model can be regarded as unsatisfactory from a practical standpoint, because it is an oversimplification; it can be regarded as unsatisfactory from a theoretical standpoint, because it is not fully specified. Still, it provides a framework within which we can develop useful algorithms.
Indeed, this problem (with these assumptions) provides a convincing example of the power of abstraction, because we can use the same sorting networks that we discussed in , by modifying the compare–exchange abstraction to operate on large blocks of data.
Definition 11.2 A merging comparator takes as input two sorted files of size M and produces as output two sorted files: one containing the M smallest of the 2M inputs, and the other containing the M largest of the 2M inputs.
Such an operation is easy to implement: Merge the two input files, and output the first half and the second half of the merged result.
We can sort a file of size N by dividing it into N/M blocks of size M, sorting each file, then using a sorting network built with merging comparators.
Establishing this fact from the 0–1 principle is tricky (see Exercise 11.71), but tracing through an example, such as the one in Screenshot, is a persuasive exercise.
This figure shows how we can use the network in Screenshot to sort blocks of data. The comparators put the small half of the elements in the two input lines out onto the top line and the large half out onto the bottom line. Three parallel steps suffice.
We refer to the method described in Property 11.7 as block sorting. We have a number of design parameters to consider before we use the method on a particular parallel machine. Our interest in the method concerns the following performance characteristic:
Block sorting on P processors, using Batcher's sort with merging comparators, can sort N records in about (lg P)2/2 parallel steps.
By parallel step in this context, we mean a set of disjoint merging comparators. Property 11.8 is a direct consequence of Properties 11.3 and 11.7.
To implement a merging comparator on two processors, we can have them exchange copies of their blocks of data, have both do the merge (in parallel), and have one keep the small half of the keys and the other keep the large half of the keys. If block transfer is slow compared to the individual processor speeds, then we can estimate the total time required for the sort by multiplying the cost of one block transfer by (lg P)2/2. This estimate embodies a large number of assumptions; for example, it assumes that multiple block transfers can be done in parallel without penalty, a rarely achieved goal in real parallel computers. Still, it provides a starting point for understanding what we can expect in a practical implementation.
If the block-transfer cost is comparable to individual processor speeds (another ideal goal that is only approached in real machines), then we have to account for the time to do the initial sorts. The processors each do about (N/P)lg(N/P) comparisons (in parallel) to sort the N/P blocks initially, and about P2(lg P)/2 stages with (N/P)-by-(N/P) merges. If the cost of a comparison is a and the cost per record for a merge is b, then the total running time is about
For huge N and small P, this performance is the best that we can hope for in any comparison-based parallel sorting method, because the cost in that case is about a(N lg N)/P, which is optimal: Any sort requires N lg N comparisons, and the best that we could do is to do P of them at once. For large P, the second term dominates, and the cost is about bN(P lg P)/2, which is suboptimal but still perhaps is competitive. For example, the second term contributes about 256bN/P to the cost of sorting 1 billion elements on 64 processors, as compared to the contribution of 32aN/P from the first term.
When P is large, the communication among all the processors might create a bottleneck on some machines. If so, using a perfect shuffle as in Screenshot might provide a way to control such costs. For precisely this reason, some parallel machines have built-in low-level interconnections that allow us to implement shuffles efficiently.
This example shows that we can get a large number of processors to work efficiently on a huge sort problem, under certain circumstances. To find the best way to do so, we certainly would need to consider many other algorithms for this kind of parallel machine, to learn many other characteristics of a real parallel machine, and to consider many variations on the machine model that we are using. Moreover, we might need to take a completely different approach to parallelism. Still, the idea that increasing the number of processors increases the costs of communicating among them is fundamental to parallel computing, and Batcher's networks provide an effective way of controlling these costs, as we have seen at a low level in and at a high level in this section.
The sorting methods described in this section and elsewhere in this chapter have a flavor different from those of the methods that we have discussed in Chapters 6 through 10, because they involve coping with constraints that we do not consider in ordinary programming. In Chapters 6 through 10, simple assumptions about the nature of our data were sufficient to allow us to compare a large number of different methods for the same basic problem. By contrast, in this chapter we have focused on articulating a variety of problems and have been able to discuss just a few solutions for each. These examples illustrate that changes in real-world constraints can provide new opportunities for algorithmic solutions; a critical part of the process is to develop useful abstract formulations of problems.
Sorting is essential in many practical apps, and the design of an efficient sort is often one of the first problems to be addressed on new computer architectures and in new coding environments. To the extent that new developments build on past experience, the array of techniques that we have discussed here and in Chapters 6 through 10 is important to know; to the extent that radical new departures are invented, the kind of abstract thinking discussed here will be necessary if we are to develop fast sorting procedures on new machines.
Use the 0–1 principle (Property 11.1) to prove Property 11.7.
11.72 Implement a sequential version of block sorting with Batcher's odd– even merge: (i) use standard mergesort (Programs 8.3 and 8.2) to sort the blocks, (ii) use the standard abstract in-place merge (Program 8.2) to implement the merging comparators, and (iii) use bottom-up Batcher's odd–even merge (Program 11.3) to implement the block sort.
Estimate the running time of the program described in Exercise 11.72, as a function of N and M, for large N.
11.74 Do Exercises 11.72 and 11.73, but substitute bottom-up Batcher's odd–even merge (Program 11.3) for Program 8.2 in both instances.
Give the values of P for which (N/P) lg N = NP lg P, for N = 103, 106, 109, and 1012.
Give approximate expressions of the form c1N lg N + c2N for the number of comparisons between data items used by a parallel Batcher's block sort, for P = 1, 4, 16, 64, and 256.
How many parallel steps would be required to sort 1015 records that are distributed on 1000 disks, using 100 processors?