I just spent some two hours tracking down a bug in a piece of code I’m writing, and I’m quite amazed at what I found. As most of you out there know, the current winner for the fastest sorting algorithm is QuickSort. (Before someone complains, yes, there are variants of QuickSort that are more efficient for special circumstances, bear with me for a moment.) For those who don’t know, or (like me until a few years ago) don’t understand it, QuickSort works by taking a set of data (call it an array, for simplicity’s sake), and selecting a Pivot value (one data element from the set). Next, set two pointers to either end of the set, and walk the pointers towards each other until each points to an element of the set that belongs on the “other side” of the Pivot value (e.g., if a value on the low side of the array is higher than the Pivot.) When you’ve got two values that are on the wrong side of the Pivot, swap them. Continue this until your pointers meet; this is the Partition Point. Now, recursively QuickSort the set from the beginning until the Partition, and from the Partition until the end. If properly coded, when you come out of the recursive QuickSort routine your set is all sorted. Simple, elegant, and oh so fast!
Those who study QuickSort will tell you that picking the Pivot is a matter of great discussion. Textbooks typically point out that the Pivot can easily be the first element of the set; however, they also point out that if the set is already sorted, QuickSort devolves from very efficient ( O( n*log n ) ) to inefficient ( O( n2 ) ). A couple of years ago one of my students and I figured out that if the middle element of the set is chosen as the Pivot, this breakdown of QuickSorting sorted data doesn’t occur; QuickSort remains efficient over a wider range of scenarios. Anyway, I’ve used the QuickSort algorithm in a number of programs over the years, and have been quite pleased with the results. Until the other day …
I’m writing a program to facilitate the course scheduling process at my school (many more blog entries to follow on this, I’m quite sure!), and I need to sort a big array containing pairs of StudentIDs and CourseNums. Each student can select up to about a dozen classes, so the array will have about a dozen CourseNum entries for each StudentID. Well, guess what, my tested and true QuickSort module broke, consistently causing a Stack Overflow error.
Now, as anyone who’s ever programmed a recursive method knows, Stack Overflow means that the recursion isn’t terminating, but going on and on and on. So, this should be easy to find. And after a couple hours of head scratching mixed with hard core debugging, it was easy to find: My “more efficient” Pivot selection was the culprit! Apparently, if you’ve got lots of identical elements in your set, picking the middle point in a ever-shrinking (i.e., recursive) algorithm will cause a failure of the QuickSort. When the recursion gets down to a set with two elements (as it eventually will), the Pivot will be set to the larger one, and the simple and elegant QuickSort will infinitely try to sort the same two points forever.
And now the greater insult in all this: When I tried changing the Pivot to the first element of the set, it worked fine. Thanks, Murphy!
COMMENTS FROM theSpoke:
You may want to try the Sort method in the Array class that is part of the .NET Framework sometime. I find that it is less work (I like less work) than coding my own sort and it it very faast. In fact in the tests I ran back a while a go it seemed to be faster than any sort I could code by hand.
The problem here, Alfred, is that I need to sort first by StudentID, and then later by CourseNum. I’m not sure how to get the Sort method of the Array class to differentiate.
Have you data class impliment IComparable and write a CompareTo method that commpares things the way you want them compared. Array.Sort will use that CompaterTo method to do the sort.