Type "the" into Google. Google searches 9 billion pages in .29 seconds. How does it do that? Try "chocolate". 197 million pages in .48 seconds. (No wonder we hate when the microwave takes 2 mins to boil water!)
When we search Google, we are searching an index of the data, not the data itself. We need a data structure that can store billions of keys and return a value in response to a query almost instantaneously. The answer: B-Trees.
B-Trees are a self-balancing tree where each node can have many children. If a node stores up to M keys, then it has M + 1 children. Suppose a tree stores 60 billion keys with nodes that store up to 1000 keys. To find a given key, the number of nodes searched is <= 4. WAT?
The reason is that the tree is very flat. The number of keys in a node ranges from M/2 to M (except for the root which has 2 keys). In 1979 (hey, B-Trees are not new), a theorist named Yao proved that, for random keys, the amount of empty space is about 44%. In the worse case, the height of the tree is (log base M/2) N. So, for M = 1000 and N = 62 billion, the height is 4.
But how can I search a B-tree with 62 billion keys? I can't fit it in memory. All the programs I write use some data structure that fits into memory and I don't even think about it.
One answer is virtual memory. If your machine has a MMU (Memory Management Unit) then you as a programmer can ignore details about what is in memory and what is on external storage. The MMU maps physical hardware addresses, wherever they are, to addresses that appear to be contiguous, in-memory addresses.
A more basic concept is that the program only needs to bring into memory the chunk of data that it needs to access. It only needs to access the nodes that it hits along the search path of the B-tree. So if the search, starting at the root, takes the left link, then the middle link, then the right link, before it hits a leaf node, then the program only needs to read in the those 3 nodes plus the leaf node. The program then searches for the value associated with the query key in the leaf node.
To explore B-trees and build one of my own, I decided to index the web pages that I can reach from www.princeton.edu - and to explicitly read and write each of the nodes to/from disk. Each node stores 1000 keys. A key is a word on a web page. The value associated with the key is the set of urls on which the word is found. I limit the number of nodes in the tree to 200. (My goal is to get up to 1000 but first I have to deal with broken links and connection errors.)
As I type this, I can see the program in action. The tree starts as 1 node: the root. When it fills up to 1000, it splits into 2 nodes of 500 keys each (left and right). A new root is created that stores 2 keys: 1 key links to the left node and 1 to the right node. Every time a new root is created the tree height grows by 1.
After a node splits, I write each node to its own file on disk. After a search or insert completes, the only node in memory is the root. All other nodes in the search path are written to disk. When a new search or insert begins, only the files associated with the nodes in the search path are opened. By looking at the directory with the files, I can see them being updated (keys added to it) and new files created (nodes splitting). Watching the tree grow in real time is cool.
In the process of building an index of web pages and then searching for search terms in the index (271 urls from princeton homepage, 31 tree nodes, 20180 unique words, <2 avg node hits for each search - the tree has height = 1!), I learned more python along the way:
used python 3, requests, and BeautifulSoup to scrape the web pages for urls and words. Detours: unicode, Ned Batchelder's pycon video on unicode , json and pickle (and their differences), file modes (text and byte) and what these mean in python 3.
And yeah... using a python IDE is a new world. The ability to stop a program and navigate down a tree by expanding its nodes almost makes debugging fun.
P.S. What does the B mean in B-tree? Nothing. It could stand for 'Bayer' (the inventor of B-trees) or B for 'balanced' multiway trees. But it doesn't.