Finding the kth smallest element in a list efficiently
Let’s talk about finding the kth smallest element in a list. We’ll first look at some trivial solutions and then check if there might be any better. The absolute smallest element in a list is easy (and we will never do better than $ \Theta(n) $ time for that). Let’s quickly see if we can design a naive **recursive** algorithm for that. This function will accept a list of (in this case) _integers_ and will return the smallest element. The base case is simply if the list has zero or one element. So we get something like
function smallestElem(A[0..n-1]): if length(a) is zero: return [] if length(a) is one: return A[0] end
Now look at the recursive case. If the last element, $ A[n-1] $ is smaller than smallestElem(A[0..n-2]) then we just return that, or we'll return the smallest element of the subarray $ A[0..n-2] $. We get
function smallestElem(A[0..n-1]): if length(a) is zero: return A if length(a) is one: return A[0] // Smallest element of subarray A[0..n-2] smallest = smallestElem(A[0..n-2]) if A[n-1] ≤ smallest: return A[n-1] else: return smallest
In Python, we can easily translate this into
def smallestElem(A[0..n-1]): if len(a) == 0: return [] if len(a) == 1: return a[0] // Smallest element of subarray A[0..n-2] smallest = smallestElem(a[:-1]) if A[-1] <= smallest: return a[-1] else: return smallest
Finding the kth smallest element
So what about the $ k$th smallest element? A little trickier. This means that, if $ k = 2 $ then we want the second smallest element in list. This is a very simple solution (and quite readable) but not particularly efficient:
def kSmallestElem(array, k): stack = [] while k > 0: currSmallest = float("inf") for j in range(0, len(array)): if array[j] < currSmallest and array[j] not in stack: currSmallest = array[j] k -= 1 return stack.pop()
Obviously not very efficient. However, we can also sort the array before (transform-and-conquer) which lets us do this in $ \Theta(n \, log \,n) $ time on average:
def kSmallestElem(array, k): array.sort() return array[k-1]
However, this doesn't take into account if there are duplicates (but it can easily be worked around). But can we do even better? Yes, absolutely, at least we can improve the computation time by choosing a different strategy. If you have been introduced to the sorting algorithm Quicksort before, then this will feel familiar.
Like Quicksort, Quickselect can be understood in two separate parts. For quicksort, the divide-and-conquer strategy itself and the partition process (which there are several algorithms for), see my post about subarray partition if you need to refresh. In Quickselct, it's variable-decrease (decrease-and-conquer).
In pseudocode, the algorithm looks like this
function quickselect(A[l..r], k): s = lomutoPartition(A, l, r) if s = k - 1: return A[s] else if s > l + k - 1: quickselect(A[l..s-1], k) else: quickselect(A[s+1..r], k)
$ s $ is the split position returned by lomutoPartition, $ l $ is the start of the subarray and $ r $ is the end.
The idea is to use subarray partition (here we use Lomuto's partition); we basically pick an element as a pivot and then we partition the subarray such that all elements less than the pivot are to the left of it, and the greater ones are to the right. The pivot element will then be put at its right place, and we just return the index of the pivot's new position. But why do we do all this? Well, if our split position $ s $ is greater than $ l + k - 1 $, meaning the start index plus the $ k $ value, then we know that the $ k $th smallest element must be in the left subarray (obviously). Note, we have to do $ k - 1 $ because arrays begin at index zero.
The same goes with if $ s $ would be smaller, then the $ k $th smallest element must seemingly be in the right subarray.
So we first check the most simplest case (a.k.a. base case), which is if we get an array and the split position is the $ k $th smallest already, then we just return it. Or we return quickselect recursively depending on the logic whether the $ k $th smallest element must be in the left or right subarray (of the split position $ s $).
Implementation in Python. Neat, don't you think? ;)
## test print quickselect([7, 3, 2, 9, 12, 5], 6) ## wrapper def quickselect(a, k): return quickselect2(a, 0, len(a)-1, k) def quickselect2(a, left, right, k): splitPos = lomuto_partition(a, left, right) if splitPos == k - 1: return a[splitPos] if splitPos > left + k - 1: return quickselect2(a, left, splitPos -1, k) else: return quickselect2(a, splitPos + 1, right, k) def lomuto_partition(a, l, r): pivot = a[r] i = l - 1 for j in range(l, r): if a[j] <= pivot: i += 1 swap(a, i, j) swap(a, i + 1, r) return i + 1
Quickselect - Time complexity
So how well does Quickselect perform?
Remember that the time complexity for partition will always be $ \Theta(n) $, more specifically $ n - 1 $ key comparisons as we have to scan the whole array.
The base case for quickselect is easy. If $ s = k - 1 $ immediately after the first partition, then $ C_{best}(n) = n - 1 \in \Theta(n) $, i.e. linear time.
However, the worst case scenario is interesting. Since in Lomuto's algorithm we're (in this implementation) choosing the last element as pivot, things might not be as good as expected. Assume that we get the following input array
$$ [1, 2, 3, 4, 5, 6, 7] $$
and we have $ k = 1 $ (for purely demonstration).
Then Lomoto will go through the array and say, hey the split position $ s = 6 $. We will then say ok, let's recurse on the left subarray $ A[0..5] $. This time, Lomuto will yield $ s = 5 $, and next time $ s = 4 $ and so forth. This leads to the computation sum
$$ c_{worst}(n) = (n - 1) + (n - 2) + (n - 3) + ... + 1 $$
This is a known sum and occur in so many different areas. We know from math that
$$ c_{worst}(n) = (n - 1) + (n - 2) + (n - 3) + ... + 1 = \frac{n(n-1)}{2} $$
Thus, $ c_{worst}(n) \in \Theta(n^2) $.
Not good at all, pre-sorting the array and choosing the $ k $th largest element is more efficient in worst case. However, it's been shown that quickselect runs in $ \Theta(n) $ in time on average, and can be improved by using a better strategy for choosing the pivot (instead of first or last element). So quckselect is, on average, faster than pre-sorting.