HBase or Cassandra?
On joining Reduce data, I was curious and always had questions regarding the two NoSQL technologies that we use Cassandra and HBase for our UI and reporting respectively.
While repeated readings and comparisons revealed Cassandra is ideally suited for more writes then reads and HBase for more reads then writes, but at Reduce Data, our UI extensively reads from Cassandra and our Ad server asynchronously to HBase. Some more searching revealed HBase is ideally suited for analytics but still no clear explanations on why and how. I started to use these technologies and I gradually discovered the answer to that question and why Cassandra is a bad bet for analytics.
The main distinguishing feature that puts HBase a step ahead with Cassandra for analytics is because of its support for partial key scanning/getting with composite keys. There are other important features (which I will note later in the blog but I am going to emphasize on the partial key scanning first). This might sound inadequate without depth of support so let me explain it with a case study.
Lets say we want to measure the performance of individual campaigns for the last one week, then we'll define a composite row key in HBase consisting of
But the same table in Cassandra would look like
(Note: We can define composite keys in Cassandra but it does not offer support for partial key scanning like the case with HBase )
If our requirement is to analyse the the performance of campaign_A for the last one week then the operation involved goes like below:
In HBase : We'll do a partial range scanning with startrow(campaign_A,today) and endrow as (campaign_A,today+7) taking advantage of the fact that HBase sorts its rows by default.
In Cassandra: We have the CampaignName but don't know the ordering of the Dayinversetimestamp, hence it would lead us to do a O( n) row scan for Campaign_A and then filter the results for the last 7 days. O(n) is a costly operation and again filtering obviously consumes computational time.
But what if we used the same composite row key as we did with HBase?
Looks okay doesn't it? But Cassandra offers poor support for range based scans. When the size of the data grows, this can overwhelm the system.
Also, If data is stored in columns in Cassandra to support sorting that helps to perform range scans, the practical limitation of a row size in Cassandra is 10's of Megabytes. Rows larger than that causes problems with compaction overhead and time.
Let us now add a new requirement where we want to list only those campaigns that are active or inactive irrespective of the time.
A HBase composite key looks like the one below:
HBase Operations involved:
For getting active ones
Doing a partial scan with startrow(1,CampaignName) endrow(1,CampaignName+1)
For getting inactive ones
Doing a partial scan with- startrow(0,CampaignName) endrow(0,CampaignName+1)
Since it doesn't have any ordering by default and offers limited support for a range scan, this operation cost again would be O(n).
Solutions to avoid the cost O(n) ?
Solution 1: Having a counter table for the isactive parameter. But managing this would be an overhead and a problem if not thought through earlier.
Solution 2: Using two column families in Cassandra for both active and inactive campaigns. But again that is not an elegant solution.
There are approaches in solving the problem but it probably adds a layer of complexity.
Another big difference is that HBase counters are atomic and consistent when compared to Cassandra. This is a deal breaker for certain kinds of systems which clearly helps build the case for HBase.
This and other factors eventually let us chose HBase counters over Cassandra for certain data processing on our ad serving system.
Useful links: http://research.yahoo.com/files/ycsb-v4.pdf
http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-vs-cassandra
http://bigdatanoob.blogspot.in/2012/11/hbase-vs-cassandra.html
http://bigdatanerd.wordpress.com/2011/12/08/why-nosql-part-1-cap-theorem/
Author
Bala Kumar S









