This repository currently only features a few select sorting algorithms. They are:
- Insertion Sort
- Selection Sort
- Merge Sort
- Heap Sort
- Smooth Sort
- Tim Sort
In hindsight, smooth sort was the most difficult to implement, but it also brought me the most insights and enjoyment. Because the implementations of the other algorithms are already widely-known, I put emphasis on explaining the mechanism of smooth sort below and leave out the others (except for Heap Sort... It'll become clear why I decided to explain this too).
Additionally, I'll outline the methods I used to benchmark the sorting algorithms. After much research, I decided to generate my own data sets (still using the struct Record
provided by the starter code) to test the different implementations. I landed on a few ways to measure "entropy" (how shuffled an array is) to help me understand how the sorting algorithms would behave under different circumstances.
Likewise, to verify the amount by which the different algorithms would scale (in terms of execution time), the sizes of the custom data sets were varied too.
Compiling the program is relatively straightforward. Just use gcc
and type the command gcc main.c -o main
. The resulting built executable main
can then be run. For linux users, I do leave another small but important note as the math library is not linked by default: if you are using a Linux device, run gcc main.c -o main -lm
instead.
Depending on what exactly you want to run, the program accepts a number of different command-line arguments. They are listed below.
-
algos
, the algorithms to run, default isinsertion,selection,heap,merge,smooth,tim
. -
N
, the number of records to sort, default is100
. -
P
, the shuffling amount (explained later), default is1.0
. -
cycles
, the number of cycles per run (explained later), default is1
. -
runs
, the number of runs per$(N, P)$ (also explained later in this document), default is1
. -
out
, the name of the file to write the results to, default isoutput.csv
; of course, you may specify directories within the path.
So for example you could run the executable with the command main algos=smooth,heap,merge N=500,5000,50000 P=0.005,0.05,0.5,1.0 cycles=5 runs=2
. This performs two runs for each pair of
However, if you want to run the algorithms with the default datasets provided by the starter code, you can also specify the debug
argument; instead of calling ./main
, you run ./main debug
. The resulting process will execute each of the algorithms ten different times for all of the starter datasets. The debug process also checks the sorted arrays and verifies that they are
Because of the way the program is structured, it is possible to import sortingalgorithms.c
into your own main function and test each of the following functions with your own data:
Record_insertionSort(Record *records, int n)
Record_selectionSort(Record *records, int n)
Record_heapSort(Record *records, int n)
Record_mergeSort(Record *records, int n)
Record_smoothSort(Record *records, int n)
Record_timSort(Record *records, int n)
Note how all the functions preserve the same signature to facilitate testing. Functions that use recursion (such as merge sort) have been abstracted into their own files to avoid succumbing to spaghetti code.
IMPORTANT REMINDERS
Please keep the following in mind to avoid any problems running your custom tests:
- When importing
sortingalgorithms.c
and using any of theRecord_...Sort()
functions, DO NOT FORGET TO RUNRecord_initSorters()
at the start of the main function (with no parameters). This function has to be run once before any of the sorting algorithms can work.- DO NOT MOVE any of the files around to prevent breaking any of the imports.
Some might notice that the project contains a few Python files. These can be ignored and were simply used to automate the running of the C program. It allowed the possibility to perform batch tests without having to type the commands one after the other. Additionally, another Python script was also utilized to generate some of the visuals of this report (particularly those that graph data; the other illustrations were created in Figma).
Heap sort has been described as "selection sort using the right data structure". While that was not something that made sense to me a year ago, it was something that clicked during this project.
Heap sort treats the array differently: instead of viewing it as a sequential list of elements, heap sort visualizes it in the form of a tree, with the associations between parent nodes and child nodes outlined in a simple manner. Every
But what benefit does a tree have over a sequential view of an array of elements? Because of the structure of a the tree (which is actually a max-heap), we are guaranteed to know that every element is greater than all its descendants. This invariant allows us to shift an element to its right place within the structure without having to compare it to every single element in the array; the motion associated with performing this action is vertical along the tree, and thus only depends on the height of the tree. This is wonderful because the height of a binary tree is always around
As a brief outline of the actual heap sort algorithm (which I believe we must recall for the sake of understanding smooth sort), I present the following pseudocode:
# The first step involves converting the array into a valid max-heap
# This operation takes nlogn steps.
for i = (array.length - 1) to i = 0:
# Consider the subtree rooted at i, and sift the root down
# Because we're starting with the leaves of the tree, this guarantees a valid max-heap at the end
siftDown(array[i])
# The second step is the actual sorting
# We consider slices of the original array in decreasing length
for i = (array.length - 1) to i = 0:
# Swap the largest element and bring it to the end of the current subarray
swap(0, i);
# Sift the new root of the max-heap
# Since the max-heap was originally valid, we only need to sift the root to fix it
# Because the root was the only thing swapped
siftDown(array[0]);
It is also possible to optimize heap sort by starting the heapifying process at i = pow(2, (int) log2(array.length)) - 1
(this is just the smallest power of 2 less than the array length); this is possible because the leaf nodes do not have children and it would be pointless to call the function siftDown()
on them. Nevertheless, even though the current implementation does without this optimization, heap sort is generally the fastest among the six sorting algorithms that were chosen (at least according to the implementations within this project). For more information about the analyses and results, refer to the section on Comparisons and Individual Analyses.
One important thing to note was that a different optimization was used to speed up heap sort. Originally, I managed to implement the algorithm exclusively using swaps. When sifting elements down the tree, a swap would occur at every level when it was possible. This meant copying two different records unto two different locations. However, upon saving the root in a temp variable and only "shifting" child nodes up instead of performing swaps (much like insertion sort shifts adjacent elements rather than swapping to the end), heap sort ran about twice as fast as it did in the initial runs (although sadly I do not have data for the initial implementation of heap sort I did). This makes sense given the fact that the act of copying data was cut in half.
Anyway, on to the fun part...
The reason I decided to explain heap sort prior to smooth sort is because the two algorithms rely on the same fundamental idea: visualizing an array in a manner that differs from its linear structure. However, smooth sort attempts to remedy a certain problem with heap sort: the largest element of the array is always at the root (the beginning) of the array, when ultimately it must end up at the opposite end. This means that regardless of the initial state of our array,
Smooth sort, on the other hand, takes an unorthodox approach. For one, it doesn't create a single tree but rather a forest of max-heaps. For another, it builds these trees such that their root nodes are on the right. This entails way less computations for lists that are close to being sorted. It also means that smooth sort, in the best case, is
But how does it work, exactly?
Before we discuss the underlying structure of the max-heaps created by smooth sort, I wish to bring forth the idea of the Leonardo numbers. The Leonardo numbers are just an integer sequence defined by the recurrence:
To the acute reader, this might seem rather similar to the Fibonacci sequence, aside from the
Now consider for a moment trees with
Smooth sort uses Leonardo trees to visualize the array.
Generating the forest of Leonardo trees for smooth sort is more straightforward than it sounds. The illustration below will likely do better to explain the process, but I will try to outline it regardless. The process relies on the fact that any number can be expressed as the sum of a subset of the Leonardo numbers. This fact can be proven through mathematical induction, though again, we refuse to wander beyond the scope of this report.
We begin by starting at the left of the array and proceeding rightwards until the last element is reached. During each iteration, the current element is added as the root of the rightmost tree in the forest, and the corresponding tree is heapified to ensure it is a valid max-heap. We can see the progression of the process below, where the newly added node is highlighted for each iteration. Whenever this newly added node has child nodes, we call a siftDown()
function to maintain the max-heap property of the tree (note that like heap sort, the siftDown()
we use for smooth sort doesn't rely on swaps but shifts).
The exact rules for deciding whether or not to append the next element as a root node or as a singleton are simple: if the two rightmost Leonardo trees in the forest have adjacent orders, then we merge those two (alongside the new root) to create a Leonardo tree of the next order. Otherwise, we add a singleton node. The actual code uses the bits of an unsigned int
to keep track of the orders of the Leonardo trees currently in the forest: a unsigned long
).
Now I won't bother going in-depth with tim sort; it's not really the main algorithm I chose anyway. Nevertheless, I feel like it deserves a special mention. The original publication outlining tim sort actually takes inspiration from another academic paper which led me down a rabbit hole of information theory. This eventually helped me realize my ideas on how to benchmark the sorting algorithms.
Some caveats with the implementation of tim sort used by this project: it's not adaptive, and it's not completely faithful to the original. For one, the run size doesn't change based on the number of records, although in practice it should adapt to try and minimize the number of resulting merges utilized by the algorithm. Aside from this, the algorithm also performs merges only after defining the runs, as opposed to performing them when merge criteria are satisfied. Nevertheless, contrary to these oversimplifications, I still refer to this version of the algorithm as tim sort, since a lot of the implementations found online follow this pattern (despite their apparent irreverence to the original).
There is a well-known shuffling algorithm that generates a permutation of the elements of a sequence in a reliably-random way. By this, we mean to say that every possible permutation is equally likely, and the shuffling process does not favor any single one. This is called the Fisher-Yates algorithm.
It's actually simpler than it sounds: you traverse an array of elements starting from the last element. At every element
for k = (array.length - 1) to k = 0:
k' = random number from 0 to k
swap array[k] with array[k']
Despite it's simplicity, it reliably selects a permutation of the original array in a uniform manner (assuming your random function for choosing a number between each
To perform shuffling for this project, a slight variation of the original algorithm is used. Whereas the original Fisher-Yates always executes a shuffle for each element, the variation used here only performs swaps probabilitically. A swap happens
for k = (array.length - 1) to k = 0:
k' = random number from 0 to k
t = random float from 0 to 1
if t <= P and t > 0:
swap array[k] with array[k']
In the implementation, things are notated a bit differently and we have
When performing a shuffle on data, it's helpful to know just how much of a shuffle we were able to do. To help us measure this idea of "shuffling", we come up with two metrics, the first of which is entropy.
Entropy is often associated with the idea of disorder. Fundamentally, the concept of "disorder" is not too far from the concept of "messing up" a deck of cards, although context necessitates that we refer to the second case as information entropy. Interestingly enough, both information theory and statistical mechanics have characterized their notions of entropy in much the same way. We focus on the formulation Claude Shannon provided for information theory (who impressively came up with the form independent of any knowledge of statistical mechanics):
gives the entropy of a discrete random variable
When
So how can we use this to analyze the "shuffledness" of our arrays? Entropy manifests in our sequences of records when it becomes hard to expect which record comes after a given record. In a sorted list (or a list with some structure to it), this is easy because the records form a sort of pattern. Thus, when traversing the array, at any given point, it is reasonable to have an expectation for what the next record might be (it won't be anything surprising). However, when an array is shuffled really well, it becomes really hard to tell what record might come next, and so things are "uncertain" and "surprising".
To approach this more rigorously, we define a random variable
where
Do note that in order for this approach to work, we adjust negative differences to fall within the range
Another useful idea to help us assess the "shuffledness" of an array is correlation. Using the same function
However, when measuring shuffledness alone, the measure of correlation tends to be a bit superfluous in that we don't need to know whether or not the array was in reverse or the right order. Either way, the array wouldn't be considered "well-shuffled" (it might not be sorted, but it isn't that jumbled up). To give us a better sense of shuffledness alone, we can square the value of the correlation coefficient. This particular number is used a lot in inferential statistics and has a special name: the coefficient of determination. Although this number has a few different interpretations, we ignore most of these as they do not apply to our framework, though we use it for its relevance in determining the disorder within our datasets.
Nevertheless, both the coefficient of correlation and determination will be used to assess the performances of the different algorithms. The latter will supplement our measurements of entropy, while the former will give us insights into how the algorithms react to data that has a bias for being sorted in the correct order or the reverse order.
This section discusses the two different methods used to compare and analyze the different algorithms. The first uses the provided datasets for the project; there are seven of these, and all algorithms we're allowed to run on them. The second involves a testing framework specifically coded for this project. Do note, however, that flexibility was considered in designing this system, and the framework may be used to benchmark other sorting algorithms (even ones that don't use the Record
data structure) so long as they follow the interfaces required by the framework.
This test was relatively straightforward. To ensure the reliability of the measured durations, each algorithm was run ten times on all of the provided datasets. The results were then recorded unto a text file (encoded by the executable by piping its text output) and are summarized by the table below. Note that all the values depicted here represent the average duration taken by the algorithm across the ten different attempts of sorting each dataset (and thus contain an extra significant figure).
Dataset / Algorithm | Insertion Sort | Selection Sort | Heap Sort | Merge Sort | Smooth Sort | Tim Sort |
---|---|---|---|---|---|---|
random100.txt |
|
|
|
|
|
|
random25000.txt |
|
|
|
|
|
|
random50000.txt |
|
|
|
|
|
|
random75000.txt |
|
|
|
|
|
|
random100000.txt |
|
|
|
|
|
|
almostsorted.txt |
|
|
|
|
|
|
totallyreversed.txt |
|
|
|
|
|
|
As expected, both totallyreversed.txt
is likely due to the increased number of swaps performed by selection sort.
For the
Initially the testing framework only checked "sortedness" by checking the order of the elements in the array. If the array had its elements in a nondecreasing (or nonincreasing) arrangement, then the function would say the array was sorted. However, after debugging the more complicated algorithms, I encountered problems that made me realize this check was insufficient.
As mentioned a number of times above, a testing framework was also constructed to aid in the comparison and analyses of the different algorithms. The framework allows us to execute a number of different runs, each of which perform a set of specific cycles. In this case, a run refers to different shufflings of records for a given
# This performs a test for a given N and a given P
set value of N
set value of P
# Testing algorithm for a given N and P
for i in number of runs:
shuffle = new shuffle of records according to N and P
times = 2d array to store execution times, init to all 0s
# Perform the required number of cycles
for j in number of cycles:
# Do the sort for each algorithm
for algo in algorithms:
tosort = copy order of shuffled records
start = get start time
algo.sort(tosort)
end = get end time
times[algorithm][run] = end - start
# Get the average of all cycles for this run
times[algorithm][run] /= number of cycles
Note that when we "save data somewhere else", we're saving it alongside the values of cycles=5
, while runs have runs=5
.
Here are a few reflections I've had the privilege of noting down following the completion of the project.
This is probably one of the funniest bugs I've ever encountered... it's even better because it didn't come from a problem with my code.
When I was implementing smooth sort, I found myself needing a way to compute the logarithm of a number with respect to base 2. This was necessary in order to read off the bits contained in a standard unsigned int
(or rather, their indices) without the need for a loop. Initially, the solution I came up with used the log()
function of the C library, which took logarithms in base log2()
existed, I computed my logarithms by doing the usual base switch:
This was working all fine and well (which I guess one should expect since the formula is mathematically correct) until at some point I decided to run my program on Windows. Up until then I'd been working within a Linux environment, and I wasn't expecting to have to deal with platform differences because I wasn't touching anything that low-level. But for some reason, despite the same code running on both my devices, smooth sort failed to sort my arrays on Windows when it successfully did on Linux. Although I now know that the reason for this was the logarithm formula I used above, it took me an hour or two to debug the whole mess and isolate the problem. It's honestly a miracle I found the problem that fast, because my smooth sort implementation had so many components I could've just as well spent an entire day figuring that one out.
It turns out that doing (int) log(8) / log(2)
on Windows gave... log(x) / log(2)
would work fine for just about any value of log2()
. If only I'd known about the function at the beginning, the problem would've never occurred... but is it my fault? Perhaps, perhaps not. It's either bad documentation on the part of the C Library or bad documentation comprehension on my part.
Initially, heap sort actually wasn't the fastest algorithm; there was no clear winner, but tim sort and merge sort seemed to dominate for the most part. That was until I realized all the algorithms that used swaps could be optimized. If I copied insertion sort and stored the current element in a temp variable, I could theoreitcally cut the execution times of the algorithms in half (as they would have to copy half the amount of data per shift vs. per swap). This actually ended up being true and sped up heap sort so much that it overtook every other algorithm after that. Do note that I did the same optimization for smooth sort, but it was already kind of slow to begin with, so even though it ran twice as fast, it was still slower than its
Towards the latter half of the testing phase, I realized how much more valuable it would be to measure the coefficient of correlation and not the coefficient of determination. While the initial idea was to use the latter metric because I thought measuring 'shuffledness' was sufficient, I later realized that knowing the 'bias' of a dataset (whether it tends to be in the right order or in reverse) would just be as insightful, since some of the algorithms obviously perform differently based on this. In this case, correlation would definitely give us more insights to work with.
Due to the delay of this realization, I had to restart all the tests I had done at that point. All in all, counting mishaps and dry-runs, I probably left my laptop on for a total of at least 48 hours running tests on the algorithms. In case you're curious, my humble potato has an Intel i3-6006u
processor, so it was definitely up for the task.
I'd spent about ten days on the entire project at that point when suddenly, the day before the submission, I reread the specifications and realized I needed to do the frequency counts for each algorithm (I had no idea that was required). I was about to lose it when I realized I'd created structs for each of the sorting algorithms, so that storing state for each of them (such as frequency count) would be no biggie. The function signature would remain the same, and only an extra struct member would need to be created. While it did take about half an hour to set up (and I had to redo all the tests again), the fact that it took only half an hour to set up was a relief. I felt kinda happy realizing the way I code these days gives my programs a leeway for flexibility.
|\ _,,,---,,_
ZZZzz /,`.-'`' -. ;-;;,_
|,4- ) )-,_. ,\ ( `'-'
'---''(_/--' `-'\_)