Previous   Next

Union–Find Algorithms

The first step in the process of developing an efficient algorithm to solve a given problem is to implement a simple algorithm that solves the problem. If we need to solve a few particular problem instances that turn out to be easy, then the simple implementation may finish the job for us. If a more sophisticated algorithm is called for, then the simple implementation provides us with a correctness check for small cases and a baseline for evaluating performance characteristics. We always care about efficiency, but our primary concern in developing the first program that we write to solve a problem is to make sure that the program is a correct solution to the problem.

The first idea that might come to mind is somehow to save all the input pairs, then to write a function to pass through them to try to discover whether the next pair of objects is connected. We shall use a different approach. First, the number of pairs might be sufficiently large to preclude our saving them all in memory in practical apps. Second, and more to the point, no simple method immediately suggests itself for determining whether two objects are connected from the set of all the connections, even if we could save them all! We consider a basic method that takes this approach in , but the methods that we shall consider in this chapter are simpler, because they solve a less difficult problem, and more efficient, because they do not require saving all the pairs. They all use an array of integers—one corresponding to each object—to hold the requisite information to be able to implement union and find. Arrays are elementary data structures that we discuss in detail in . Here, we use them in their simplest form: we create an array that can hold N integers by writing int id[] = new int[N]; then we refer to the ith integer in the array by writing id[i], for 0 i < 1000.

Quick-find solution to connectivity problem

This program takes an integer N from the command line, reads a sequence of pairs of integers, interprets the pair p q to mean "connect object p to object q," and prints the pairs that represent objects that are not yet connected. The program maintains the array id such that id[p] and id[q] are equal if and only if p and q are connected.

The In and Out methods that we use for input and output are described in the Appendix, and the standard Java mechanism for taking parameter values from the command line is described in .

```public class QuickF { public static void main(String[] args) { int N = Integer.parseInt(args[0]); int id[] = new int[N]; for (int i = 0; i < N ; i++) id[i] = i; for( In.init(); !In.empty(); ) { int p = In.getInt(), q = In.getInt(); int t = id[p]; if (t == id[q]) continue; for (int i = 0;i<N;i++) if (id[i] == t) id[i] = id[q]; Out.println(" " +p+""+q); } } }
```

Program 1.1 is an implementation of a simple algorithm called the quick-find algorithm that solves the connectivity problem (see and Program 3.1 for basic information on Java programs). The basis of this algorithm is an array of integers with the property that p and q are connected if and only if the pth and qth array entries are equal. We initialize the ith array entry to i for 0 i < N. To implement the union operation for p and q, we go through the array, changing all the entries with the same name as p to have the same name as q. This choice is arbitrary—we could have decided to change all the entries with the same name as q to have the same name as p.

Screenshot shows the changes to the array for the union operations in the example in Screenshot. To implement find, we just test the indicated array entries for equality—hence the name quick find. The union operation, on the other hand, involves scanning through the whole array for each input pair.

Screenshot Example of quick find (slow union)

This sequence depicts the contents of the id array after each of the pairs at left is processed by the quick-find algorithm (Program 1.1). Shaded entries are those that change for the union operation. When we process the pair pq, we change all entries with the value id[p] to have the value id[q].

Property 1.1

The quick-find algorithm executes at least MN instructions to solve a connectivity problem with N objects that involves M union operations.

For each of the M union operations, we iterate the for loop N times. Each iteration requires at least one instruction (if only to check whether the loop is finished).

We can execute tens or hundreds of millions of instructions per second on modern computers, so this cost is not noticeable if M and N are small, but we also might find ourselves with billions of objects and millions of input pairs to process in a modern app. The inescapable conclusion is that we cannot feasibly solve such a problem using the quick-find algorithm (see Exercise 1.10). We consider the process of precisely quantifying such a conclusion precisely in .

Screenshot shows a graphical representation of Screenshot. We may think of some of the objects as representing the set to which they belong, and all of the other objects as having a link to the representative in their set. The reason for moving to this graphical representation of the array will become clear soon. Observe that the connections between objects (links) in this representation are not necessarily the same as the connections in the input pairs—they are the information that the algorithm chooses to remember to be able to know whether future pairs are connected.

Screenshot Tree representation of quick find

This figure depicts graphical representations for the example in Screenshot. The connections in these figures do not necessarily represent the connections in the input. For example, the structure at the bottom has the connection 1-7, which is not in the input, but which is made because of the string of connections 7-3-4-9-5-6-1.

The next algorithm that we consider is a complementary method called the quick-union algorithm. It is based on the same data structure—an array indexed by object names—but it uses a different interpretation of the values that leads to more complex abstract structures. Each object has a link to another object in the same set, in a structure with no cycles. To determine whether two objects are in the same set, we follow links for each until we reach an object that has a link to itself. The objects are in the same set if and only if this process leads them to the same object. If they are not in the same set, we wind up at different objects (which have links to themselves). To form the union, then, we just link one to the other to perform the union operation; hence the name quick union.

Screenshot shows the graphical representation that corresponds to Screenshot for the operation of the quick-union algorithm on the example of Screenshot, and Screenshot shows the corresponding changes to the id array. The graphical representation of the data structure makes it relatively easy to understand the operation of the algorithm—input pairs that are known to be connected in the data are also connected to one another in the data structure. As mentioned previously, it is important to note at the outset that the connections in the data structure are not necessarily the same as the connections in the app implied by the input pairs; rather, they are constructed by the algorithm to facilitate efficient implementation of union and find.

Screenshot Tree representation of quick union

This figure is a graphical representation of the example in Screenshot. We draw a line from object i to object id[i].

Screenshot Example of quick union (not-too-quick find)

This sequence depicts the contents of the id array after each of the pairs at left are processed by the quick-union algorithm (Program 1.2). Shaded entries are those that change for the union operation (just one per operation). When we process the pair p q, we follow links from p to get an entry i with id[i] == i; then, we follow links from q to get an entry j with id[j] == j; then, if i and j differ, we set id[i] = id[j]. For the find operation for the pair 5-8 (final line), i takes on the values 5 6901, and j takes on the values 801.

The connected components depicted in Screenshot are called trees; they are fundamental combinatorial structures that we shall encounter on numerous occasions throughout the tutorial. We shall consider the properties of trees in detail in . For the union and find operations, the trees in Screenshot are useful because they are quick to build and have the property that two objects are connected in the tree if and only if the objects are connected in the input. By moving up the tree, we can easily find the root of the tree containing each object, so we have a way to find whether or not they are connected. Each tree has precisely one object that has a link to itself, which is called the root of the tree. The self-link is not shown in the diagrams. When we start at any object in the tree, move to the object to which its link refers, then move to the object to which that object's link refers, and so forth, we always eventually end up at the root. We can prove this property to be true by induction: It is true after the array is initialized to have every object link to itself, and if it is true before a given union operation, it is certainly true afterward.

The diagrams in Screenshot for the quick-find algorithm have the same properties as those described in the previous paragraph. The difference between the two is that we reach the root from all the nodes in the quick-find trees after following just one link, whereas we might need to follow several links to get to the root in a quick-union tree.

Quick-union solution to connectivity problem

If we replace the body of the for loop in Program 1.1 by this code, we have a program that meets the same specifications as Program 1.1, but does less computation for the union operation at the expense of more computation for the find operation. The for loops and subsequent if statement in this code specify the necessary and sufficient conditions on the id array for p and q to be connected. The assignment statement id[i] = j implements the union operation.

```int i, j, p = In.getInt(), q = In.getInt(); for (i = p; i != id[i]; i = id[i]); for (j = q; j != id[j]; j = id[j]); if (i == j) continue; id[i] = j; Out.println(" " + p + " " + q);
```

Program 1.2 is an implementation of the union and find operations that comprise the quick-union algorithm to solve the connectivity problem. The quick-union algorithm would seem to be faster than the quick-find algorithm, because it does not have to go through the entire array for each input pair; but how much faster is it? This question is more difficult to answer here than it was for quick find, because the running time is much more dependent on the nature of the input. By running empirical studies or doing mathematical analysis (see ), we can convince ourselves that Program 1.2 is far more efficient than Program 1.1, and that it is feasible to consider using Program 1.2 for huge practical problems. We shall discuss one such empirical study at the end of this section. For the moment, we can regard quick union as an improvement because it removes quick find's main liability (that the program requires at least NM instructions to process M union operations among N objects).

This difference between quick union and quick find certainly represents an improvement, but quick union still has the liability that we cannot guarantee it to be substantially faster than quick find in every case, because the input data could conspire to make the find operation slow.

Property 1.2

For M > N, the quick-union algorithm could take more than MN/2 instructions to solve a connectivity problem with M pairs of N objects.

Suppose that the input pairs come in the order 1-2, then 2-3, then 3-4, and so forth. After N - 1 such pairs, we have N objects all in the same set, and the tree that is formed by the quick-union algorithm is a straight line, with N linking to N - 1, which links to N - 2, which links to N - 3, and so forth. To execute the find operation for object N, the program has to follow N - 1 links. Thus, the average number of links followed for the first N pairs is

Now suppose that the remainder of the pairs all connect N to some other object. The find operation for each of these pairs involves at least (N - 1) links. The grand total for the M find operations for this sequence of input pairs is certainly greater than MN/2.

Fortunately, there is an easy modification to the algorithm that allows us to guarantee that bad cases such as this one do not occur. Rather than arbitrarily connecting the second tree to the first for union, we keep track of the number of nodes in each tree and always connect the smaller tree to the larger. This change requires slightly more code and another array to hold the node counts, as shown in Program 1.3, but it leads to substantial improvements in efficiency. We refer to this algorithm as the weighted quick-union algorithm.

Screenshot shows the forest of trees constructed by the weighted union–find algorithm for the example input in Screenshot. Even for this small example, the paths in the trees are substantially shorter than for the unweighted version in Screenshot. Screenshot illustrates what happens in the worst case, when the sizes of the sets to be merged in the union operation are always equal (and a power of 2). These tree structures look complex, but they have the simple property that the maximum number of links that we need to follow to get to the root in a tree of 2n nodes is n. Furthermore, when we merge two trees of 2n nodes, we get a tree of 2n+1 nodes, and we increase the maximum distance to the root to n + 1. This observation generalizes to provide a proof that the weighted algorithm is substantially more efficient than the unweighted algorithm.

Screenshot Tree representation of weighted quick union

This sequence depicts the result of changing the quick-union algorithm to link the root of the smaller of the two trees to the root of the larger of the two trees. The distance from each node to the root of its tree is small, so the find operation is efficient.

Screenshot Weighted quick union (worst case)

The worst scenario for the weighted quick-union algorithm is that each union operation links trees of equal size. If the number of objects is less than 2n, the distance from any node to the root of its tree is less than n.

Weighted version of quick union

This program is a modification to the quick-union algorithm (see Program 1.2) that keeps an additional array sz for the purpose of maintaining, for each object with id[i] == i, the number of nodes in the associated tree so that the union operation can link the smaller of the two specified trees to the larger, thus preventing the growth of long paths in the trees.

```public class QuickUW { public static void main(String[] args) { int N = Integer.parseInt(args[0]); int id[] = new int[N], sz[] = new int[N]; for (int i = 0;i<N;i++) { id[i] = i; sz[i] = 1; } for(In.init(); !In.empty(); ) { int i, j, p = In.getInt(), q = In.getInt(); for (i = p; i != id[i]; i = id[i]); for (j = q; j != id[j]; j = id[j]); if (i == j) continue; if (sz[i] < sz[j]) { id[i] = j; sz[j] += sz[i]; } else { id[j] = i; sz[i] += sz[j]; } Out.println(" " + p +""+q); } } }
```

Property 1.3

The weighted quick-union algorithm follows at most 2 lg N links to determine whether two of N objects are connected.

We can prove that the union operation preserves the property that the number of links followed from any node to the root in a set of k objects is no greater than lg k (we do not count the self-link at the root). When we combine a set of i nodes with a set of j nodes with i j, we increase the number of links that must be followed in the smaller set by 1, but they are now in a set of size i + j, so the property is preserved because 1 + lg i =lg(i + i) lg(i + j).

The practical implication of Property 1.3 is that the weighted quick-union algorithm uses at most a constant times M lg N instruc-tions to process M edges on N objects (see Exercise 1.9). This result is in stark contrast to our finding that quick find always (and quick union sometimes) uses at least MN/2 instructions. The conclusion is that, with weighted quick union, we can guarantee that we can solve huge practical problems in a reasonable amount of time (see Exercise 1.11). For the price of a few extra lines of code, we get a program that is literally millions of times faster than the simpler algorithms for the huge problems that we might encounter in practical apps.

It is evident from the diagrams that relatively few nodes are far from the root; indeed, empirical studies on huge problems tell us that the weighted quick-union algorithm of Program 1.3 typically can solve practical problems in linear time. That is, the cost of running the algorithm is within a constant factor of the cost of reading the input. We could hardly expect to find a more efficient algorithm.

We immediately come to the question of whether or not we can find an algorithm that has guaranteed linear performance. This question is an extremely difficult one that plagued researchers for many years (see ). There are a number of easy ways to improve the weighted quick-union algorithm further. Ideally, we would like every node to link directly to the root of its tree, but we do not want to pay the price of changing a large number of links, as we did in the quick-union algorithm. We can approach the ideal simply by making all the nodes that we do examine link to the root. This step seems drastic at first blush, but it is easy to implement, and there is nothing sacrosanct about the structure of these trees: If we can modify them to make the algorithm more efficient, we should do so. We can easily implement this method, called path compression, by adding another pass through each path during the union operation, setting the id entry corresponding to each vertex encountered along the way to link to the root. The net result is to flatten the trees almost completely, approximating the ideal achieved by the quick-find algorithm, as illustrated in Screenshot. The analysis that establishes this fact is extremely complex, but the method is simple and effective. Screenshot shows the result of path compression for a large example.

Screenshot Path compression

We can make paths in the trees even shorter by simply making all the objects that we touch point to the root of the new tree for the union operation, as shown in these two examples. The example at the top shows the result corresponding to Screenshot. For short paths, path compression has no effect, but when we process the pair 1 6, we make 1, 5, and 6 all point to 3 and get a tree flatter than the one in Screenshot. The example at the bottom shows the result corresponding to Screenshot. Paths that are longer than one or two links can develop in the trees, but whenever we traverse them, we flatten them. Here, when we process the pair 6 8, we flatten the tree by making 4, 6, and 8 all point to 0.

Screenshot A large example of the effect of path compression

This sequence depicts the result of processing random pairs from 100 objects with the weighted quick-union algorithm with path compression. All but two of the nodes in the tree are one or two steps from the root.

There are many other ways to implement path compression. For example, Program 1.4 is an implementation that compresses the paths by making each link skip to the next node in the path on the way up the tree, as depicted in Screenshot. This method is slightly easier to implement than full path compression (see Exercise 1.16), and achieves the same net result. We refer to this variant as weighted quick-union with path compression by halving. Which of these methods is the more effective? Is the savings achieved worth the extra time required to implement path compression? Is there some other technique that we should consider? To answer these questions, we need to look more carefully at the algorithms and implementations. We shall return to this topic in , in the context of our discussion of basic approaches to the analysis of algorithms.

Screenshot Path compression by halving

We can nearly halve the length of paths on the way up the tree by taking two links at a time and setting the bottom one to point to the same node as the top one, as shown in this example. The net result of performing this operation on every path that we traverse is asymptotically the same as full path compression.

Path compression by halving

If we replace the for loops in Program 1.3 by this code, we halve the length of any path that we traverse. The net result of this change is that the trees become almost completely flat after a long sequence of operations.

```for (i = p; i != id[i]; i = id[i]) id[i] = id[id[i]]; for (j = q; j != id[j]; j = id[j]) id[j] = id[id[j]];
```

The end result of the succession of algorithms that we have considered to solve the connectivity problem is about the best that we could hope for in any practical sense. We have algorithms that are easy to implement whose running time is guaranteed to be within a constant factor of the cost of gathering the data. Moreover, the algorithms are online algorithms that consider each edge once, using space proportional to the number of objects, so there is no limitation on the number of edges that they can handle. The empirical studies in Table 1.1 validate our conclusion that Program 1.3 and its path-compression variations are useful even for huge practical apps. Choosing which is the best among these algorithms requires careful and sophisticated analysis (see ).

Exercises

1.4 Show the contents of the id array after each union operation when you use the quick-find algorithm (Program 1.1) to solve the connectivity problem for the sequence 0-2, 1-4, 2-5, 3-6, 0-4, 6-0, and 1-3. Also give the number of times the program accesses the id array for each input pair.

1.5 Do Exercise 1.4, but use the quick-union algorithm (Program 1.2).

Table 1.1. Empirical study of union-find algorithms
 These relative timings for solving random connectivity problems using various union–find algorithms demonstrate the effectiveness of the weighted version of the quick-union algorithm. The added incremental benefit due to path compression is less important. In these experiments, M is the number of random connections generated until all N objects are connected. This process involves substantially more find operations than union operations, so quick union is substantially slower than quick find. Neither quick find nor quick union is feasible for huge N. The running time for the weighted methods is evidently roughly proportional to M. N M F U W P H 1000 3819 63 53 17 18 15 2500 12263 185 159 22 19 24 5000 21591 698 697 34 33 35 10000 41140 2891 3987 85 101 74 25000 162748 237 267 267 50000 279279 447 533 473 100000 676113 1382 1238 1174 Key: F quick find (Program 1.1) U quick union (Program 1.2) W weighted quick union (Program 1.3) P weighted quick union with path compression (Exercise 1.16) H weighted quick union with halving (Program 1.4)

1.6 Give the contents of the id array after each union operation for the weighted quick-union algorithm running on the examples corresponding to Screenshot and Screenshot.

1.7 Do Exercise 1.4, but use the weighted quick-union algorithm (Program 1.3).

1.8 Do Exercise 1.4, but use the weighted quick-union algorithm with path compression by halving (Program 1.4).

Prove an upper bound on the number of machine instructions required to process M connections on N objects using Program 1.3. You may assume, for example, that any Java assignment statement always requires less than c instructions, for some fixed constant c.

Estimate the minimum amount of time (in days) that would be required for quick find (Program 1.1) to solve a problem with 109 objects and 106 input pairs, on a computer capable of executing 109 instructions per second. Assume that each iteration of the inner for loop requires at least 10 instructions.

Estimate the maximum amount of time (in seconds) that would be required for weighted quick union (Program 1.3) to solve a problem with 109 objects and 106 input pairs, on a computer capable of executing 109 instructions per second. Assume that each iteration of the outer for loop requires at most 100 instructions.

Compute the average distance from a node to the root in a worst-case tree of 2n nodes built by the weighted quick-union algorithm.

1.13 Draw a diagram like Screenshot, starting with eight nodes instead of nine.

Give a sequence of input pairs that causes the weighted quick-union algorithm (Program 1.3) to produce a path of length 4.

1.15 Give a sequence of input pairs that causes the weighted quick-union algorithm with path compression by halving (Program 1.4) to produce a path of length 4.

Show how to modify Program 1.3 to implement full path compression, where we complete each union operation by making every node that we touch link to the root of the new tree.

1.17 Answer Exercise 1.4, but use the weighted quick-union algorithm with full path compression (Exercise 1.16).

1.18 Give a sequence of input pairs that causes the weighted quick-union algorithm with full path compression (Exercise 1.16) to produce a path of length 4.

Give an example showing that modifying quick union (Program 1.2) to implement full path compression (see Exercise 1.16) is not sufficient to ensure that the trees have no long paths.

1.20 Modify Program 1.3 to use the height of the trees (longest path from any node to the root), instead of the weight, to decide whether to set id[i] = j or id[j] = i. Run empirical studies to compare this variant with Program 1.3.

1.21 Show that Property 1.3 holds for the algorithm described in Exercise 1.20.

1.22 Modify Program 1.4 to generate random pairs of integers between 0 and N - 1 instead of reading them from standard input, and to loop until N - 1 union operations have been performed. Run your program for N = 103, 104, 105, and 106, and print out the total number of edges generated for each value of N.

1.23 Modify your program from Exercise 1.22 to plot the number of edges needed to connect N items, for 100 N 1000.

1.24 Give an approximate formula for the number of random edges that are required to connect N objects, as a function of N.

 Previous   Next
Comments