Advanced Sorting

Earlier, you learned that you could take a list and sort it in ascending ASCII order (like you do strings) using the built-in sort function. What if you don't want an ascending ASCII sort, but something else instead, like a numeric sort? Well, Perl gives you the tools you need to do the job. In fact, you'll see that the Perl sort is completely general and able to perform any well-defined sort order.

To define a sort of a different color, you need to define a comparison routine that describes how two elements compare. Why is this necessary? Well, if you think about it, sorting is putting a bunch of things in order by comparing them all. Since you can't compare them all at once, you need to compare two at a time, eventually using what you find out about each pair's order to put the whole kit'n'caboodle in line.

The comparison routine is defined as an ordinary subroutine. This routine will be called repeatedly, each time passing two elements of the list to be sorted. The routine must determine whether the first value is less-than, equal-to, or greater-than the second value, and return a coded value (described in a moment). This process is repeated until the list is sorted.

To save a little execution speed, the two values are not passed in an array, but rather are handed to the subroutine as the values of the global variables $a and $b. (Don't worry: the original values of $a and $b are safely protected.) The routine should return any negative number if $a is less than $b, zero if $a is equal to $b, and any positive number if $a is greater than $b. Now remember, the less than is according to your meaning of less than; it could be a numeric comparison, according to the third character of the string, or even according to the values of a hash using the passed-in values as keys. It's really pretty flexible.

Here's an example of a comparison routine that sorts in numeric order:

sub by_number {
 if ($a < $b) {
 return -1;
}
elsif ($a == $b) {
 return 0;
}
elsif ($a > $b) {
 return 1;
}
}

Notice the name by_number. There's nothing special about the name of this subroutine, but you'll see why we like names that start with by_ in a minute.

Let's look through this routine. If the value of $a is less than (numerically in this case) the value of $b, we return a -1 value. If the values are numerically equal, we get back a zero, and otherwise a . So, according to our specification for a sort comparison routine, this should work.

How do we use it? Let's try sorting the following list:

@somelist = (1,2,4,8,16,32,64,128,256);

If we use the ordinary sort without any adornment on the list, we get the numbers sorted as if they were strings, and in their ASCII order, like so:

@wronglist = sort @somelist; # @wronglist is now (1,128,16,2,256,32,4,64,8)

Certainly not very numeric. Well, let's give sort our newly defined comparison routine. The name of the comparison routine goes immediately following the sort keyword, like so:

@rightlist = sort by_number @wronglist; # @rightlist is now (1,2,4,8,16,32,64,128,256)

This does the trick. Note that you can read the sort with its companion sort routine in a human-like fashion: "sort by number." That's why I named the subroutine with a by_ prefix.

This kind of three-way value of -1, , and +1 on the basis of a numeric comparison occurs often enough in sort routines that Perl has a special operator to do this in one fell swoop. It's often called the spaceship operator, and looks like <=>. Using the spaceship operator, the preceding sort subroutine can be replaced with this:

sub by_number {
 $a <=> $b;
}

Note the spaceship between the two variables. Yes, it is indeed a three-character-long operator. The spaceship returns the same values as the if/elsif chain from the previous definition of this routine. Now this is pretty short, but you can shortcut the sort invocation even further, by replacing the name of the sort routine with the entire sort routine in line, like so:

@rightlist = sort {
 $a <=> $b
}
@wronglist;

There are some who argue that this decreases readability. They are wrong. Others argue that it removes the need to go somewhere else for the definition. Perl doesn't care. Our personal rule is that if it doesn't fit on one line or we have to use it more than once, it goes into a subroutine.

The spaceship operator for numeric comparison has a comparable string operator called cmp. The cmp operator returns one of three values, depending on the relative string comparisons of the two arguments. So, here's another way to write the default sort order:[]

@result = sort {
 $a cmp $b
}
@somelist;

[2] Not exactly. The built-in sort discards undef elements, but this one doesn't.

You probably won't ever write this exact subroutine (mimicking the built-in default sort), unless you're writing a tutorial about Perl. However, the cmp operator does have its uses in the construction of cascaded ordering schemes. For example, you might need to put the elements in numeric order unless they're numerically equal, in which case they should go in ASCII string order. (By default, the by_number routine above just sticks nonnumeric strings in some random order because there's no numeric ordering when comparing two values of zero.) Here's a way to say "numeric, unless they're numerically equal, then string":

sub by_mostly_numeric {
 ($a <=> $b) || ($a cmp $b);
}

How does this work? Well, if the result of the spaceship is -1 or 1, the rest of the expression is skipped, and the -1 or 1 is returned. If the spaceship evaluates to zero, however, the cmp operator gets its turn at bat, returning an appropriate ordering value considering the values as strings.

The values being compared are not necessarily the values being passed in. For example, say you have a hash where the keys are the login names and the values are the real names of each user. Suppose you want to print a chart where the login names and real names are sorted in the order of the real names. How would you do that?

Actually, it's fairly easy. Let's assume the values are in the array %names. The login names are thus the list of keys(%names). What we want to end up with is a list of the login names sorted by the corresponding value, so for any particular key $a, we need to look at $names{$a} and sort based on that. If you think of it that way, it almost writes itself, as in:

@sortedkeys = sort by_names keys(%names); sub by_names {
 return $names{$a} cmp $names{$b};
}
foreach (@sortedkeys) {
 print "$_ has a real name of $names{$_}\n";
}

To this we should also add a fallback comparison. Suppose the real names of two users are identical. Because of the whimsical nature of the sort routine, we might get one value ahead of another the first time through and the values in the reversed order the next time. This is bad if the report might be fed into a comparison program for reporting, so try very hard to avoid such things. With the cmp operator, it's easy:

sub by_names {
 ($names{$a} cmp $names{$b}) || ($a cmp $b);
}

Here, if the real names are the same, we sort based on the login name instead. Since the login name is guaranteed to be unique (after all, they are the keys of this hash, and no two keys are the same), then we can ensure a unique and repeatable order. Good defensive developing during the day is better than a late-night call from a system administrator wondering why the security alarms are going off.