Implementations

Now that you have the various types of collections in mind, it is time to turn your attention to the implementation of these collections. A List, for example, can be implemented a number of ways, each with advantages and disadvantages. However, before you can understand the various implementations of Lists and other collections, you need to understand how collections and maps determine equality and order.

Determining Equality and Order

Until now, we have been discussing equality and order quite a bit, but without explaining how these conditions are actually discovered by the collections. For example, how does a Map decide if two keys are the same, and how does a SortedSet determine the order used to sort the objects? To answer these questions, you have to tackle the basic principles of object equality: identity and comparability.

Equality versus identity

For many collections to do their job, they need to know which objects are equal. For example, a Set can't exclude duplicate entries if it doesn't know how to check to see whether supplied objects are equal. These collections take advantage of the fact that all objects in Java descend from the common class java.lang.Object. Since all objects descend from Object, the Set implementations can call the method equals( ) on all objects. This allows the set to compare objects. However, there is one catch: for this to be effective, the object must define (or inherit) a valid implementation of equals( ). The default implementation of equals( ) compares two objects to see whether they are the same object in the virtual machine by identity, not by equality. Suppose you create two Address objects that define the addresses of employees. Both address objects can contain the same street name, number, country, postal code, and other data. However, these objects are two different instances (and therefore reside at two different memory addresses). In this case, the Address objects are equivalent, but they are not the same object—i.e., they are not identical. If you use the default implementation of the equals( ) method, you would always get a result of false despite the fact that the objects are equal. To solve this problem, you need to override the definition of equals( ) to compare the actual data in the class. While creating your new equals method, you need to keep the following requirements in mind:

The method must consistently return the same value for any single comparison of objects x and y no matter how many times the comparison is run, assuming that no data changes in x and y.
The method should be symmetric so that x.equals(y) returns the same result as y.equals(x).
The method should be transitive so that if x.equals(y) and y.equals(z) returns true, then z.equals(x) also returns true.
Whenever the comparison is against a null value, the method should always return false—i.e., (x.equals(null) == false) should be true for all types of x.
The method should be reflexive so that x.equals(x) is always true.

Don't forget to override the implementation of hashCode( ) when you override equals( ). There is a contract between these two methods that says that any objects evaluating as equivalent must also return the same hash code. For more information on this contract, see the Javadoc for java.lang.Object.equals( ).

The overridden equals( ) method ensures that collections that are not supposed to contain duplicates do indeed prevent them. It also illustrates an important concept in programming. Two objects are said to be equal in identity if they are the same object instance, whereas two objects are said to have equality if they contain the same data.

Comparing objects

The problem of sorting in collections is solved by defining a way to compare two objects. This can be accomplished using two different techniques. The first technique is to have the objects implement java.lang.Comparable. The Comparable interface contains a single method defined by the following signature:

public interface Comparable {
 public int compareTo(Object o);
}

The compareTo( ) method returns an integer indicating the comparison of the this object to the target object. If compareTo( ) returns a negative integer, then the this object is said to be less than the object given in the parameter o. If the returned integer is positive, then the this object is greater than the given object. If the integer is 0, the object is comparatively the same as the this object. Generally, implementations will ignore anything other than whether the return value is positive, negative, or 0. Therefore, it is useless to worry about defining how much less than the given object the return value is. For example, a returned value of -8 would be treated the same as a returned value of -1. One important factor to remember about this comparison is that it does not compare equality. Two objects that are not equivalent may result in a compareTo( ) result of 0. This occurs often when objects try to sort on only one or two fields in the class. For example, if you consider a Person object, you would want to sort on first name and last name but not necessarily on the other fields, such as age and tax identification number. In this case, compareTo( ) would return a value of 0 for two people named Paul Henry despite the fact that one is a senior executive and the other is an infant. The compareTo( ) method has a number of requirements similar to the equals( ) method:

Like the equals( ) method, the compareTo( ) method must return the same value for any single comparison of objects x and y no matter how many times the comparison is run, assuming that no data changes in x and y.
The method should be transitive—i.e., if object x is comparatively greater than object y, and object y is comparatively greater than object z, object x should be greater than object z. This is generally written using the following mathematical notation:
```
If x > y and y > z then x > z
```
The method must ensure that the comparison is consistent. For example, if x.compareTo(y) yields -1, then y.compareTo(x) should yield 1.
The method should be reflexive so that x.compareTo(x) always returns 0.
It is recommended that objects that are comparatively 0, but are not necessarily equals( ), indicate this in the documentation to the class.

It is always a good idea to implement the Comparable interface for objects that may be stored inside ordered collections. This implementation defines what is known as the natural ordering of instances. However, there are some situations in which a developer may want to sort the objects in a collection based on a value other than the natural ordering. The java.util.Comparator interface is used for this purpose. Suppose you have a set of data objects that contain a variety of information, and you want to be able to sort that information to achieve something other than the natural ordering. For example, if you have a set of Person objects, you may want to sort them by age or occupation instead of name. To accomplish this, create a new implementation of the Comparator interface that will sort the objects. Example 4-1 shows a Comparator that sorts Person objects based on their tax identification number.

Example 4-1. An age Comparator

package oracle.hcj.collections;
import java.util.Comparator;
import oracle.hcj.bankdata.Person;
public class TaxIdComparator implements Comparator {
 public int compare(final Object x, final Object y) {
 return ((Person)x).getTaxID( )
 .compareTo(((Person)y).getTaxID( ));
 }
}

In this example, you take advantage of the compareTo( ) method built into java.lang.String to compare the two people based on the values of their taxID property. Now that you have a new Comparator implementation, you can sort the people by passing this comparator to sorting collections:

public class SomeClass {
 protected void demoComparator( ) {
 SortedSet people = new TreeSet(new TaxIdComparator( ));
 // . . . add people to the set.
 }
}

Once you have created the sorted set and assigned the comparator, you can be sure that the people inserted into the set will always be sorted in order of taxID. You can also implement more complicated comparators that are configurable.

Big O Notation

Implementing collections in coding languages has been a topic of computer science since the first days of structured languages. This topic was born out of the desire to efficiently store and analyze information in computer systems. It is a true science with theories, mathematics, and concepts. One of the most important pieces of mathematics necessary to understanding collections is the concept of Big O Notation. Big O Notation is a mathematical term used to express the efficiency of an algorithm depending on the number of elements the algorithm must work on. The notation is called Big O Notation because of the conspicuous usage of the letter O in the formulas. The O stands for orders of magnitude. The general idea is that the lower the orders of magnitude, the faster the algorithm is. Consider a list of people stored in a random order. Big O Notation will tell you how long, on average, it will take you to find any one person in the list by name. If you think about the worst-case scenario, you will realize that you may look through all of the elements in a list, only to find that your target person is the last element on the list. As the list grows in size, your worst-case scenario gets . . . worse. If you have 100 people in your list, your worst-case scenario would involve performing 99 checks to find a person. If your list is 100,000 people, then you would have to do 99,999 checks before you find the person. This is described in Big O Notation with the formula O(n). This shorthand is read as "the orders of magnitude grows in proportion to n." By contrast, if you store people that are sorted by name in an ordered list, you can use a binary search algorithm to find the target person, called x, by name. This algorithm compares the person at the halfway point in the list, y, to the target person. If the person x evaluates as equal to person y, then the person is found, and the algorithm stops. However, if the person x evaluates as greater than person y, then the algorithm uses the upper half of the list for the search. If person x evaluates as less than person y, the algorithm uses the lower half of the list. The algorithm then splits up the targeted half and repeats the splitting process until it finds the person or narrows the list to one person that isn't the target person. The binary search algorithm has an order of magnitude of O(log2(n)). This means that the worst-case scenario to find a person in a list of size n is the result of log₂(n). In the case of your 100,000-person list, the result of this equation is roughly 16.609 operations. In other words, the worst-case scenario is that you would need 17 compares to find the target person.

Most calculators have only a log₁₀(n) function. To find log₂(n) on these calculators, you can employ the simple formula log₂(n) - log₁₀(n)/log₁₀(2).

The most common orders of magnitude that you will encounter are the following. They are listed in order of most efficient to least efficient.

O(1): Constant time. No matter how big the list, the time to find the target is always the same.
O(log2(n)): This notation is often abbreviated O(log(n)). For 100,000 people, the value would be 17 compares.
O(n): For 100,000 people, you would need 99,999 compares.
O(n x log(n)): For 100,000 people, you would need 1,700,000 compares!
O(n²): My computer couldn't compute this result for 100,000 entries; however, for 100 entries, it is 10,000. This number increases very rapidly.
O(n!): The factorial is the series of integers from 1 to n multiplied together. For example, if n = 10, the factorial is 10 x 9 x 8 x 7 x 6 x 5 x 4 x 3 x 2 x 1 = 3628800. If you write an algorithm with this order of magnitude, you will definitely have problems.

Big O Notation is not limited to calculating search time in a collection. In fact, it can be applied to any computer algorithm as a measure of efficiency. Generally, algorithms higher than O(n) are not feasible as solutions to practical problems.

Lists

Now that you have an understanding of the basic concepts, you are ready to study various types of collections. Once you have decided that you need a List, the next step is to decide exactly which List you need to use. To make a good decision, you need to have a good understanding of the three standard implementations of List.

Although anyone can implement the collection interfaces in their code, we will stick to the classes defined in the java.util package. Personally, I like to augment this with the collections classes from Jakarta Commons, available at http://jakarta.apache.org/.

java.util.Vector

The Vector class has been with the JDK for quite a long time. If you look at the @since tag in the Javadoc for this class, you will see that it was introduced in JDK 1.0. A Vector is essentially a dynamic resizing array. You create a vector by specifying its initial size and the size of the chunk you want to use to resize the Vector should it exceed its storage capacity. The following code creates a Vector that starts with 100 elements and grows by 100 elements (when needed):

Vector v = new Vector(100, 100);

There are other constructors to vectors that use default values for sizes and increments. Additionally, there is an extra constructor for Vector that copies a collection into a Vector and uses the default value for incrementation.

Since Vector is an implementation of the List interface, instances of Vector also allow the user to access each element in a similar manner to that of an array. Therefore, you can think of Vectors as something of a resizable array. In fact, inside the Vector class, the data is stored as an array of objects. When you initialize the Vector, this array is allocated according to the initial size that you specify in the constructor. As the capacity of the Vector is exceeded, a new array is allocated that is equal to the size of the old array plus the capacityIncrement you specified in the constructor. If you didn't specify either of these values, the class uses its default sizes. When the Vector class has to resize, it copies the old array into the new array using the System.arraycopy( ) method. Although this method uses fast native code to copy the array, copying is still quite slow and will get slower as the size of the Vector grows. This is why it is important to assign values to capacity and capacityIncrement that minimize the number of resize operations without wasting too much memory. Using an array to store the data inside a Vector has one other major disadvantage. When inserting an element into a Vector at a specific position, the Vector must use System.arraycopy( ) to move the data to accommodate the insertion, as Screenshot-1 demonstrates.

Screenshot-1. Inserting an element into a Vector at an index

Screenshot-1 shows the result of inserting a String element at index 2 into the Vector. Since there was already an element at index 2, the Vector had to use System.arraycopy( ) to push the elements further down into the array. If you remove elements at an index, the same process has to be performed to compress the array. What this all adds up to is that Vector is not very good at inserts and removals of items. However, the array storage paradigm makes Vector very good at accessing elements quickly. Therefore, Vector is a good choice for a list whose prime function is to access the data in the list rapidly, and a poor choice if the data in the list will be modified frequently. However, the Vector class has one major problem: since Vector was designed to be thread-safe, it is replete with synchronized blocks in the code. This allows multiple threads to use a Vector safely, but significantly slows down its performance. The semantics of synchronized code are outside the scope of this tutorial; however, I can tell you that the process of synchronization is a costly one.

If you are interested in learning more about synchronization and thread programming, check out Java Threads by Scott Oaks and Henry Wong (Oracle).

In situations in which data structures will not be accessed by multiple threads or in which the synchronization can be provided outside of the list, using Vector is a sad waste of performance. However, if you will be tossing your list around various threads, you should stick with Vector.

java.util.ArrayList

The ArrayList class is JDK 1.2's answer to the synchronization problem in the Vector class. This class does essentially the same job as the Vector class but without synchronization. The lack of thread safety within the ArrayList class makes it much faster than Vector. However, it still suffers from the same performance problems as Vector when resizing, adding, or removing elements. If you need a class that has all of the features of Vector but not the internal synchronization, you should choose a list class.

Marker Interfaces

If you look at the documentation for the ArrayList class, you will notice that in addition to implementing the List interface, it implements java.io.Serializable and java.util.RandomAccess. However, these two interfaces don't declare any methods or attributes. These two interfaces are examples of marker interfaces. Marker interfaces are used to mark classes conceptually. For example, the RandomAccess interface marks List implementations that can be accessed in a random, rapid, and predictable order. Although this marker interface is useful in some situations, I haven't seen it used professionally. On the other hand, Serializable is a marker interface that you see all over the place. This interface marks a class as having the ability to be flattened and serialized to a stream and then restored from that stream. This allows Java to store objects in files and send these objects across a network connection. In this case, the marker interface lets the virtual machine know that this class can be introspected and then written out. Classes that are not serializable do not have this ability. By using marker interfaces, a developer can tag certain code without complicating its inheritance structure.

java.util.LinkedList

The LinkedList class also implements the List interface and allows you to store objects in a defined order. However, it implements this ability in a very different way than Vector or ArrayList. Whereas Vector and ArrayList hold their data in an array of objects, LinkedList uses a special node structure to connect the objects. Each object is stored in a node that contains three attributes: one to contain the data in the node, one to be used as a reference to the next node, and one to be used as a reference to the previous node. The start of the chain of references is usually referred to as the head of the list, and is the only node actually stored in the class variable. The structure of a LinkedList is shown in Screenshot-2.

Screenshot-2. Structure of a LinkedList

This node-based structure makes it easy to insert or remove elements in the LinkedList. All that the list needs to do is create a node and rewire the references to the previous and next objects. There is (thankfully) no need to copy references or perform any other costly actions. On the other hand, the structure of a LinkedList makes finding any particular element in the list a costly exercise. Since the LinkedList class has to step through the nodes one by one from the head of the list until the element is found, the search will be slow.

Some of the data structure purists out there may have noticed that LinkedList is, in fact, not a normal linked list at all. In fact, a normal linked list has only a reference to the next object and not a reference to the previous object. The LinkedList class in the java.util package is actually a doubly linked list.

For these reasons, LinkedList is a great class for manipulating data and shuffling it around. It is ideal for tasks such as sorting and performing frequent insertions and removals. However, it is really bad at accessing data quickly. You should use LinkedList if the primary function of your list is to manipulate data in the list.

Maps and SortedMaps

Maps are useful structures for storing things such as key/value pairs. The Map interface and its descendant SortedMap have many implementations. Once you have decided that you need to use a map as your collection, you will then need to decide exactly which map to use.

java.util.HashTable

The Hashtable class is one of the two grandfathers of Java collections, the other being Vector. The Hashtable class stores a series of keyed values in a structure that uses the hash code of the object to store things quickly. To understand how a Hashtable works, it is important that you understand the hash system. When a Hashtable is constructed, it creates a table to hold the data of the collection. This table is an array of arrays. Each of the subarrays represents a bucket in which the class can store a set of values. All of the buckets are the same size. As you insert new items into the table, the hash code of the object is used to determine which bucket the object should go into. Once the correct bucket is found, the object is inserted into that bucket. The structure of the class is shown in Screenshot-3.

Screenshot-3. Hash-based storage

Each bucket holds data within a particular range of hash codes. As you add items to the table, the hash code of the key is computed and the correct bucket is chosen. One of the advantages of this structure is that it can hold very large data sets without impeding performance. As the map grows, the number of buckets grows, and the range of hash codes allowed in each bucket gets smaller. The result is that the map can find any object, taking, on average, the same amount of time as it does to find any other object. In Big O Notation, this is referred to as O(1). This is the best-case scenario for algorithm efficiency. As the map grows, it fills up until its load reaches a certain percentage. This loadFactor can be given to the class at construction time, or the user can choose to use the default value of 0.75f. When using the default, as soon as the map becomes 75% full, the class will increase the size of the map, usually by creating more buckets and narrowing the range. Once the new map is ready, the class will rehash the map by computing the hash codes of each object in the map and filing them in their proper buckets. This is a fairly slow process, so you don't want to do it often. On the other hand, since hash-based data structures take up a lot of memory, you wouldn't want to allocate maps with huge initial capacities either. Finding the optimal performance for a hash-based data structure is often dependent on how the class will be used in the program. If the data size in the map is fairly constant, you can use higher load factors. However, if the data is constantly changing size and memory is tight, you may want to use lower load factors. The default size of 0.75 provides this good balance in most situations. The Hashtable class is indeed useful. However, it is internally synchronized, which makes it thread-safe but much slower than other alternatives.

java.util.HashMap

Essentially, a HashMap is an implementation of the Hashtable class without internal synchronization. It sacrifices thread safety for higher performance. Other than this difference, they are conceptually the same. If you don't need to pass your map around in a threaded environment, or if thread safety can be provided externally to the map, then you should use a HashMap.

java.util.LinkedHashMap

A LinkedHashMap is an implementation of the SortedMap interface that allows you to guarantee that the iteration of the keys is in a predictable order. It uses the same technique as the HashMap and HashTable classes to store the data. However, LinkedHashMap also maintains a running LinkedList that contains all of the keys in the map in a sorted order. Although keeping the keys in a sorted order is advantageous when it comes to iterating through the map, maintaining the extra linked list causes LinkedHashMap to be slower than HashMap. Use it with caution.

java.util.IdentityHashMap

An IdentityHashMap is the same as a HashMap, except that objects are compared based on identity, not equality. In this map, key objects evaluate as equal only if they are the same instance in the virtual machine. This differs from other collections that evaluate for equality based on the result of calling the equals( ) method. Since this map doesn't compare based on equality, you can insert multiple keys into the map even when they evaluate as equals according to the equals( ) method; they only need to be different instances. For example, the following two instances could be placed in the same IdentityhashMap:

String x = new String("tom"); String y = new String("tom"); // <== Different instance!
IdentityHashMap ihm = new IdentityHashMap( ); ihm.put(x, new Integer(15));
ihm.put(y, new Integer(8));

If you try this with a normal HashMap, the value for the key tom would be replaced on the second line. With an IdentityHashMap, both values are inserted. However, just because you can do this doesn't mean you should; I don't recommend inserting duplicate keys into an IdentityHashMap because it can result in some very confusing code. All other operations on an IdentityHashMap work the same way as they do on a HashMap. The following code takes up where the last snippet left off:

String z = new String("tom");
Integer obj = ihm.get(z);

In this code, the variable element will contain the value null. This is because z is not the same instance as x or y even though they contain the same value. When working with IdentityHashMaps, you should keep these gotchas in mind.

java.util.WeakHashMap

A WeakHashMap is just like a HashMap, except that the keys used in the map are stored in weak references. We will cover weak references in .

java.util.TreeMap

A TreeMap is the other implementation of the SortedSet interface. However, unlike LinkedHashMap, the TreeMap class doesn't store its elements in a hash-based data structure. Instead, it stores its elements, unsurprisingly, in a tree. Specifically, the TreeMap class uses a structure called a binary tree. Screenshot-4 shows how this structure looks in memory.

Screenshot-4. Binary tree storage of a TreeMap

Inside TreeMap is a variable called root, which stores the root of the tree. Each node in the tree stores a key and a value, and each element stores a left and a right element. When a new element is inserted, the key is evaluated against the root. If the key is the same, then the root value in the root entry will be replaced with the new value. However, if the key evaluates as less than the root (compareTo( ) returns -1), the TreeMap will use the left member of the root to navigate to the next element. If the key evaluates as greater than the root (compareTo( ) returns 1), the TreeMap will use the right member to navigate to the next element. This process of evaluation continues until the key is found or the TreeMap hits a leaf node, which is a node without a left or right member. If the TreeMap hits a leaf node, it will add a new entry containing the key and value and tack it on to the leaf node. Keep in mind that determining whether a node is a leaf is dependent on which side, left or right, to which the TreeMap wants to surf. A particular node could be a leaf node with respect to left but not to right, as right may have a reference to another entry. Every now and then, the tree will become lopsided—that is, it has more elements on one side of root than on the other. When this occurs, the tree is rebalanced by using the middle key in the sorted keys as the new root entry. The other entries are then copied into the new tree using the algorithm for adding new entries. This rebalancing process is slow. The process of adding or removing a key from a TreeMap has a O(log2(n)) efficiency. It isn't as good as a LinkedHashMap with its O(1) efficiency, but a LinkedHashMap has to store two data structures for the elements in the map—one to accomplish the actual storage and another to keep a list of the keys in a sorted order. For this reason, TreeMaps are better at storing large sets of data. Since log2(n) grows very slowly as n increases, this size trade-off is often worth the cost. If you are morbidly curious, where n is 100,000,000, log2(n) is only 26.5754247590989. This means that if you have a map that contains a hundred million entries, a TreeMap will be able to find any arbitrary key using a maximum of 27 compares. That is usually good enough in most situations. On the other hand, storing references to all 100,000,000 entries twice would take up too much memory.

Sets and SortedSets

If you have opted to use a set as your collection, you will first need to decide whether the data needs to maintain a sorting order. This question is more complex than it appears. For example, a set of contacts in an address tutorial doesn't necessarily need to maintain sorting order even though the items may be displayed in a sorted order in the interface. In fact, the sorting order may be invalid if you built the list to sort based on last name, and the user wants it sorted based on email address. However, a set that holds the instructions to make a certain part in a factory must be kept in a specific sorted order to ensure that they are followed correctly. The question you should ask yourself is, "Do my objects need to be in a sorted order for presentation only, or is there some other underlying logical reason for the sorting?" If they only need to be presented in sorted order, then you can easily copy them to a list or SortedSet in the GUI and sort them at presentation time. However, if there is another underlying logical reason to sort them in one predefined order, then you should opt for a SortedSet. Try to avoid sorting your sets. Sorting sets introduces significant overhead in the insertion and deletion of items in a set and can impede performance.

java.util.HashSet

A HashSet is an implementation of the Set interface that stores its contents using a hash system. This system works the same way as a HashMap. In fact, one interesting piece of Java trivia is that the HashSet class is implemented using only the key set of the HashMap class. The HashSet class merely inserts a dummy value for each key in the map and hides the fact that it is using a map from the user.

java.util.LinkedSet

A LinkedSet is an implementation of the Set interface that guarantees a predictable iteration order. It does this by maintaining a LinkedList of the objects in the set, which can significantly decrease the amount of memory. Note that just because the order is predictable does not mean that the order is sorted. In fact, the only real benefit of a LinkedHashSet over a regular HashSet is that the iteration of the set remains constant over time and isn't as random as HashSet. However, I have never seen a programming situation in which this is enough of a concern to balance the significantly larger memory consumption.

java.util.TreeSet

A TreeSet is implemented the same way HashSet is. However, instead of using a HashMap as a backing store, it uses TreeMap. Therefore, the TreeSet class inherits the same benefits and limitations as the TreeMap class.

Previous Next

Comments