Using Redis to Manage Surrogate Keys
In the ad-tech industry, we get a a lot of traffic. One company I work with receives upwards of 3 billion events per month. The big guys do around 20 billion per day. To manage this amount traffic there are a lot of different data warehouse techniques that people employ to get every last bit of speed out of their apps, while preserving as much precious disk space as possible. While natural keys employ an automatic sense referential integrity, storing them is not very efficient. Surrogate keys, on the other hand, can be expensive to create and manage, plus a lot of datawarehouse solutions dont even offer the ability to auto-increment or even provide constraints. Such is the case of InfoBright.
The problem(s):
Creating a surrogate key requires doing a search on the columns values in a particular dimension table to find if a row exists, then creating a new one that is one more than the last one created.
While doing a lookup for the keys, you will most likely hinder performance and/or lock tables.
Every RDBMS datawarehouse solution is much faster at bulk loading than individual inserts.
My solution:
The idea is to use Redis in your ETL process to take care of the heavy lifting. Imagine you have one fact table call event_facts and one dimension table called users_dim.
Here is what one row in out psv file may look like:
purchase|john|doe|male|37.50|socks
So, from this we know that John Doe made a purchase for socks at the price of 37.50. To split up the data into our table schema it may look like this:
purchase, socks, 37.50, {user_id}
{user_id}, John, Doe, male
Que Redis, here I have made a simple little Ruby script that will get us our surrogate key:
So in this example, we are accomplishing the following process:
Creating a hash of the data ("john,doe,male")
Checking redis to see if we have the key "users_dim/#{hash}" and return it if we do
If do not have the key, increment the key "users_dim/key" which stores our latest value
Set the key "users_dim/#{hash}" to the incremented value, so next time we will have the key
On a 32gb machine, we had well over 40m keys, with room to spare. Because redis is an in memory store, it is insanely fast at these lookups, much faster than the db itself. After you get all of your surrogate keys, just output the file in a format that can be loaded into the db and go.