Revisiting a Classic Book

I’m re-reading Programming Pearls by Jon Bentley. I read this book within a year or two of its release, early in my career. That was 16 or 17 years ago. I remember enjoying it greatly. It opened my eyes to the critical importance of selecting optimal data structures and algorithms when designing solutions to programming problems. The author advocates for deliberation prior to putting hands on the keyboard.

Re-reading the book now as a more experienced (and I would hope wiser) programmer, I’m reminded of a quote from Linus Torvalds. What Jon Bentley states in elegant prose throughout two hundred pages of his classic book, Linus Torvalds summarizes pithily:

“Bad programmers worry about the code. Good programmers worry about data structures and their relationships.” -Linus Torvalds

My twenty year programming career and my chess engine hobby project have convinced me Jon Bentley and Linus Torvalds are correct.

Problem

To get more out of the book on my second reading, I’ve decided to write C# code to solve some of the problems and exercises posed in the text. Not all of the problems- only those that interest me. It didn’t take long to find one. In section 1.2 Bentley poses this problem:

To the programmer these requirements added up to, “How do I sort a disk file?” Before we attack the problem, let’s arrange what we know in a less biased and more useful form.

  • Input: A file containing at most n positive integers, each less than n, where n = 10 pow 7 = 10,000,000. It is a fatal error if any integer occurs twice in the input. No other data is associated with the integer.
  • Output: A sorted list in increasing order of the input integers.
  • Constraints: At most (roughly) a megabyte of storage is available in main memory; ample disk storage is available. The run time can be at most several minutes; a run time of ten seconds need not be decreased.

These integers actually are 1-800 toll-free phone numbers. (Jon Bentley worked for Bell Labs.) When considering the hardware and performance constraints, keep in mind this problem was posed in the late 1980s.

Naive Solution

This problem cries out for a bitwise solution. I see it because writing a chess engine has made me aware of the performance benefits of bit manipulation. However, let’s first write what I’ll call a naive solution. A naive solution will give us a baseline against which we’ll measure the performance of a bitwise solution.

My naive solution leverages the generic SortedSet<T> collection in the .NET Base Class Library. SortedSet is implemented as a Red-Black Binary Search Tree.

Run the program, specifying one million phone numbers and the naive sort algorithm.

PS C:\Users\Erik\...\ProgrammingPearls\Phone Number Sort> dotnet run -c release -- 1000000 naive
000.002  Thread01  Creating input file of phone numbers.
001.066  Thread09  Done.
001.068  Thread09  Creating output file of phone numbers.
006.256  Thread10  Done.
Thread10  Sort took 5.187 seconds.

The program creates an input file of one million random phone numbers with no duplicates. It then reads the input file one line at a time, adding each phone number to a SortedSet. The SortedSet, being a Red-Black Binary Search Tree, inserts the phone number into its internal tree such that the order of all phone numbers is maintained. Insertion performs in O(log n) time. In fact, all operations (insert, delete, search) in Red-Black Binary Search Trees perform in O(log n) time, worst case. The program then iterates over the SortedSet (which already is in order), writing phone numbers to the output file. It completes in 5.187 seconds.

Can we do better? Yes!

Optimized Solution

Let’s use an array of integers with each bit indicating whether the phone number (represented as an integer) is included in the input file. An integer contains 32 bits, so the 0th bit indicates if the nonsensical 000-0000 phone number is included, the 15th bit indicates if the 000-0015 phone number is included, etc. The 13th bit of the integer in the 271,103rd index of the array (bit and integer indices are zero-based) indicates if the 867-5309 phone number is included.

Run the program, specifying one million phone numbers and the bitwise sort algorithm.

PS C:\Users\Erik\...\ProgrammingPearls\Phone Number Sort> dotnet run -c release -- 1000000 bitwise
000.002  Thread01  Creating input file of phone numbers.
001.081  Thread09  Done.
001.082  Thread09  Creating output file of phone numbers.
001.728  Thread05  Done.
Thread05  Sort took 0.644 seconds.

The program creates an input file of one million random phone numbers with no duplicates. It then reads the input file one line at a time, converting the phone number to an integer, calculating an array index and mask, then applying the mask to the integer at the index location. The program then iterates over each bit in each integer in the array, writing phone numbers to the output file.

Visual inspection of the input and output files confirms the program functions correctly using either algorithm. However, the bitwise technique outperforms the SortedSet technique, completing in 0.644 seconds versus 5.187 seconds for the SortedSet.

That’s an 8x speedup!

You may review the full source code in the Phone Number Sort folder of my ProgrammingPearls project in GitHub.

Leave a Reply

Your email address will not be published.