Two-Way Merging - Algorithms - Java Programming Language

Given two ordered input files, we can combine them into one ordered output file simply by keeping track of the smallest element in each file and entering a loop where the smaller of the two elements that are smallest in their files is moved to the output, continuing until both input files are exhausted. We shall look at several implementations of this basic abstract operation in this and the next section. The running time is linear in the number of elements in the output, as long as we can perform the operation of finding the next smallest element in a file in constant time, which is certainly the case for files that are in sorted order and represented with a data structure that supports constant-time sequential access, such as an array or a linked list. This procedure is two-way merging; in , we shall look in detail at multiway merging, when more than two files are involved. The most important app of multiway merging is external sorting, which is discussed in detail in that chapter.

Merging

To combine two ordered arrays a and b into an ordered array c, we use a for loop that puts an element into c at each iteration. If a is exhausted, the element comes from b; if b is exhausted, the element comes from a; and if items remain in both, the smallest of the remaining elements in a and b goes to c. Beyond the implicit assumptions that the arrays are ordered and sufficiently large to be indexed as indicated, this implementation assumes that the array c is disjoint from (that is, does not overlap or share storage with) a and b.

static void mergeAB(ITEM[] c, int cl, ITEM[] a, int al, int ar, ITEM[] b, int bl, int br ) { int i = al, j = bl; for (int k = cl; k < cl+ar-al+br-bl+1; k++) { if (i > ar) { c[k] = b[j++]; continue; } if (j > br) { c[k] = a[i++]; continue; } c[k] = less(a[i], b[j]) ? a[i++] : b[j++]; } }

To begin, let us suppose that we have two disjoint ordered arrays a[0], ..., a[N-1] and b[0], ..., b[M-1] of integers that we wish to merge into a third array c[0], ... , c[N+M-1]. The obvious strategy, which is easily implemented, is to choose successively for c the smallest remaining element from a and b, as shown in Program 8.1. This implementation is simple, but it has two important characteristics that we shall now examine.

First, the implementation assumes that the arrays are disjoint. In particular, if a and b are huge arrays, then a third (also huge) array c is needed to hold the output. Instead of using extra space proportional to the size of the merged file, it would be desirable to have an inplace method so that, for example, we could combine the ordered files a[l], ..., a[m] and a[m+1], ..., a[r] into a single ordered file by moving the elements around within a[l], ..., a[r], without using a significant amount of other extra space. It is a worthwhile exercise to pause momentarily to consider how we might do that. This problem seems to be one that must be simple to solve; actually, however, the solutions that are known are complicated, especially by comparison to Program 8.1. Indeed, it is not easy to develop an algorithm for in-place merging that can outperform the alternative of using an in-place sort. We shall return to this issue in .

Merging has specific apps in its own right. For example, in a typical data-processing environment, we might need to maintain a large (ordered) data file, to which we will need to regularly add new entries. One approach is to batch each group of new entries-append them to the (much larger) main file, then resort the whole file. This situation is tailor-made for merging: A much more efficient strategy is to sort the (small) batch of new entries, then merge the resulting small file with the large main file. Merging has many other similar apps that make its study worthwhile. Our prime interest in this chapter will be the sorting methods that are based on merging.

Exercises

Suppose that an ordered file of size N is to be combined with an unordered file of size M, with M much smaller than N. Assume that you have a sorting program that takes about c₁N lg N seconds to sort a file of size N and a merging program that takes about c₂(N + M) seconds to merge a file of size N with one of size M, with c₁ c₂. How many times faster than resorting is the suggested merge-based method, as a function of M, for N = 10³, 10⁶, and 10⁹?
How does the strategy of using insertion sort for the whole file compare with the two methods postulated in Exercise 8.1? Assume that the small file is random, so each insertion goes about halfway into the large file, and the running time is about c₃MN/2, with c₃ approximately the same as the other constants.
Describe what happens if you try to use Program 8.1 for an in-place merge, by using the call merge(a, 0, a, 0, N/2, a, (N/2)+1, N) for the keys A E Q S U Y E I N O S T.
Does Program 8.1, called as described in Exercise 8.3, produce proper output if and only if the two input subarrays are in sorted order? Prove your answer, or provide a counterexample.