Let’s talk about binary trees and search
Binary search is a pretty straight forward search technique - at least if we’re looking for some good performance. It’s pretty easy to explain binary search because the algorithm itself is pretty intuitive. Let’s dive right into it. Let’s take the classic and intuitive example; the phonebook. Assume that we don’t have the Internet and we’re actually still using one of these. ;) So if we wanted to find **Smith** how would we go on doing that? Well, of course.. you could flip one page at a time until you get there, although this method would probably take some time considering how large phonebooks can actually be. Or, we could simply use the fact that the phonebook is sorted alphabetically - so we flip up to the middle page instead and check from there. If the page we turned on list names with **M** then we know that **Smith** must be listed in the other right part, so we flip the middle page in the right part, and continue doing this until we find Smith. This is a lot lot more efficient than checking every page. Sorry I had to tell you that, I know you're not stupid. :) Binary search is built actually on this logic. Let's say, for simplicity, that we have a sorted list of integers $$ A[1, 14, 18, 27, 33, 35, 37, 44, 58, 81, 103] $$ and we want to see if $ K = 14 $ is in that list. Then we start by selecting the middle element, in this case index $ i = 5 $, which has the value $ 35 $. We compare $ 14 $ with $ 35 $ and see that $ 14 < 35 $, thus if $ 14 $ is in the array then it must be in the left part. We repeat this process by again selecting the middle element and so on, until our comparison $ K = A[i] $ is true, or if there are simply no elements left to search. A **bold** element indicates the element we're comparing with. Let's have two variables $ l $ and $ r $ to keep track of the start and end index of respective subarray, then to figure out the middle index of the subarray we simply just compute the average of $ l $ and $ r $, i.e. $ m = \frac{l + r }{2} $, where $ m $ is this midpoiint. Remember we wanted to search for $ 14 $ in the list. In first iteration, we have $ l = 0 $ and $ r = n - 1 = 10 $, where $ n $ is the length of input array. Then $ m = \frac{0 + 10}{2} = 5 $. A[1, 14, 18, 27, 33, **35**, 37, 44, 58, 81, 103] $ l = 0 $ and $ r = m - 1 = 4 $. Then $ m = \frac{0 + 4}{2} = 2 $. Go into the left subarray $ A[0..4] $. A[1, 14, **18**, 27, 33, 35, 37, 44, 58, 81, 103] $ l = 0 $ and $ r = m - 1 = 1 $. Then $ m = \frac{0 + 1}{2} = 0 $ (integer division). Go into the left subarray $ A[0..1] $ A[**1**, 14, 18, 27, 33, 35, 37, 44, 58, 81, 103] In this case, $ 1 $ is smaller than $ 14 $, so we'll have the following values: $ l = m + 1 = 1 $, $ r = 1 $ and $ m = \frac{1 + 1}{2} = 1 $. A[1, **14**, 18, 27, 33, 35, 37, 44, 58, 81, 103] And we're done! ## Implementation in Python ## This can easily be implemented using recursion like this
def binarysearch(arr, left, right, k): mid = (left + right) / 2 ## simplest case, we found the element immediately if arr[mid] == k: return True ## if this is true, we're done searching. Not found if l == r: return False if arr[mid] > k: return binarysearch(arr, left, mid - 1, k) else: return binarysearch(a, mid + 1, right, k)
The iterative version is pretty similar
def binarysearch(arr, left, right, k): while left <= right: mid = (left + right) / 2 if arr[mid] == k: return True if arr[mid] > k: right = mid - 1 else: left = mid + 1 // Not found return False
But how well does binary search perform? BS is a decrease-and-conquer algorithm, since during each iteration it decreases its problem instance by a constant factor, namely $ 2 $.
So in worst-case, the total amount of times we'll actually have to compare is how many times we can divide the input size $ n $ in halves + 1, i.e. $ \lfloor log_2\, n + 1 \rfloor $. We add one because we're doing a comparison initially before we divide. The $ \lfloor \, \rfloor $ notation is the floor function (it rounds down its argument). Why is $ log_2 \, n $ how many times we can divide $ n $ by $ 2 $? Read my post about it (it's pretty short).
$$ C_{worst}(n) = \lfloor log_2\, n + 1 \rfloor $$
meaning $ C_{worst}(n) \in \Theta(log \, n) $.
On average, BS will run near its worst case, so $ C_{average}(n) \in \Theta(log \, n) $.
However, even though binary search is pretty fast, we can do even better. For instance, we'll look at hash tables and maps in an upcoming post. But all have its pros and cons.
Let's look at binary search tree while we're at binary search. This is an interesting data structure that's used for searching - where we apply this search technique.
So what's the difference between a binary tree and a binary search tree? The simple answer is that, a binary tree just consists of two child nodes. So every tree where each node has (at most) two child nodes is called a binary tree. A binary search tree (BST), however, you could say is a special binary tree used for searching. The BST has some special properties; the left child node should be smaller than its parent node and its right child node should be greater.
For instance, this is a binary tree:
this is a binary search tree
See that the left child node is smaller, and the right is greater. This property must at all times be satisfied in order to be a BST.
Binary and trees in general
So a tree consists of nodes. These are basically just "places" where we put our data. For instance, in the trees above: 1, 2, 3 are all nodes.
In computer science we write our trees upside down (so relate to trees in nature, but think of a "upside-down" tree). This means that the root of the tree, is the start node. In the BST example, 2 is the root node.
Each node can have children, e.g. child nodes, but like I said - binary trees can only have up to two children, but a node isn't required to have children (in programming we usually refer to an empty child node as null).
Since we're working with an "upside-down tree" and if the root is at the top, then the leaves must be at the bottom. So in the BST example above, 1 and 3 are also considered leaf nodes since they are the bottom.
The height of a binary tree is simply how many levels there are in the tree, in both examples above there's one level since we never include the root node while counting the levels, so for instance a tree with only a root node has a height of $ 0 $.
Intuitively, the levels can be seen as how many times we can divide something in halves, however, there's a problem with binary trees (which we will look at soon) and that is, they can be very unbalanced. This means that the height of a binary tree can actually be proportional to the input size $ n - 1 $ (-1 since we don't count the root node). However, the minimum number of levels a BST can have is $ log_2 \, n $ if it's full. This means that, the height $ h $ of a given BST with size $ n $ satisfies the following inequality
$$ log_2 \, n \leq h \leq n - 1 $$
So I said binary search trees can end up really bad. Imagine the scenario if we keep inserting elements in a BST in increasing order, i.e. every element is greater than its precedence, the height of the tree will just increase for every insertion. For an example, we want to insert $ [3, 5, 7, 9, 15] $, starting from left.
So we start with $ 3 $. This will be the root node
Next to insert is $ 5 $. $ 5 $ is greater than $ 3 $, so it should be inserted at the right side.
Now we want to insert $ 7 $. Again, $ 7 > 5 $ so we do the same
Next one is $ 9 $ and again, $ 9 > 7 $ so
Last one is $ 15 $ and, $ 15 > 9 $ so
Conclusions? A pretty skewed tree we ended up with. If we count the levels (not including the root) then we see it's $ 4 $. That's why the height of a BST can at most be $ n - 1 $. This is an issue that also affects the time complexity. Both searching, inserting and deleting is at worst $ \Theta(n) $ if the tree is skewed, for an example, like this one.
It's pretty easy to come up with an algorithm for insertion in a BST, the deletion is a bit harder. A BST is usually constructed using a node with pointers to its child nodes. For searching, the recursive algorithm is simple (as you've seen before):
Let $ N $ be a node in a BST with pointers to its left child node $ N.left $ and right $ N.right $. $ N.value $ is the "compare" element in that node.
if $ N.value = v $ or $ N.value = null $, return $ N $.
If $ v > N.value $, search the right subtree recursively
else, search the left subtree recursively
where $ v $ is the value you want to search for in tree.
With insertion, again using the same variables but as an extra parameter, we want to accept some value $ v $ to be inserted as well.
if $ N = null $ then N = new Node(v) $
else if $ v > N.value $, insert at right subtree recursively
otherwise, insert at the right subtree recursively
Not harder than that! I will leave deletion as an exercise. The scenarios that you need to consider are
If deleting a node with no children, remove the node
If deleting a node with one child, remove the node and replace with its child
The little "tricky" part is what happens when you have two child nodes. That's the problem to solve!