Engineering Blog @rdeng - Tumblr Blog

Distributed Quota Management in An Ad Serving Environment

Distributed Quota Management in an Ad Ad Serving Environment OR How we manage ad spends across data centres without over spending from reducedata

#Reduce Data #programmatic advertising #advertising #distributed quota management #ad budget management.

HBase or Cassandra?

On joining Reduce data, I was curious and always had questions regarding the two NoSQL technologies that we use Cassandra and HBase for our UI and reporting respectively.

While repeated readings and comparisons revealed Cassandra is ideally suited for more writes then reads and HBase for more reads then writes, but at Reduce Data, our UI extensively reads from Cassandra and our Ad server asynchronously to HBase. Some more searching revealed HBase is ideally suited for analytics but still no clear explanations on why and how. I started to use these technologies and I gradually discovered the answer to that question and why Cassandra is a bad bet for analytics.

The main distinguishing feature that puts HBase a step ahead with Cassandra for analytics is because of its support for partial key scanning/getting with composite keys. There are other important features (which I will note later in the blog but I am going to emphasize on the partial key scanning first). This might sound inadequate without depth of support so let me explain it with a case study.

Lets say we want to measure the performance of individual campaigns for the last one week, then we'll define a composite row key in HBase consisting of

But the same table in Cassandra would look like

(Note: We can define composite keys in Cassandra but it does not offer support for partial key scanning like the case with HBase )

If our requirement is to analyse the the performance of campaign_A for the last one week then the operation involved goes like below:

In HBase : We'll do a partial range scanning with startrow(campaign_A,today) and endrow as (campaign_A,today+7) taking advantage of the fact that HBase sorts its rows by default.

In Cassandra: We have the CampaignName but don't know the ordering of the Dayinversetimestamp, hence it would lead us to do a O( n) row scan for Campaign_A and then filter the results for the last 7 days. O(n) is a costly operation and again filtering obviously consumes computational time.

But what if we used the same composite row key as we did with HBase?

Looks okay doesn't it? But Cassandra offers poor support for range based scans. When the size of the data grows, this can overwhelm the system.

Also, If data is stored in columns in Cassandra to support sorting that helps to perform range scans, the practical limitation of a row size in Cassandra is 10's of Megabytes. Rows larger than that causes problems with compaction overhead and time.

Let us now add a new requirement where we want to list only those campaigns that are active or inactive irrespective of the time.

A HBase composite key looks like the one below:

HBase Operations involved:

For getting active ones

Doing a partial scan with startrow(1,CampaignName) endrow(1,CampaignName+1)

For getting inactive ones

Doing a partial scan with- startrow(0,CampaignName) endrow(0,CampaignName+1)

Since it doesn't have any ordering by default and offers limited support for a range scan, this operation cost again would be O(n).

Solutions to avoid the cost O(n) ?

Solution 1: Having a counter table for the isactive parameter. But managing this would be an overhead and a problem if not thought through earlier.

Solution 2: Using two column families in Cassandra for both active and inactive campaigns. But again that is not an elegant solution.

There are approaches in solving the problem but it probably adds a layer of complexity.

Another big difference is that HBase counters are atomic and consistent when compared to Cassandra. This is a deal breaker for certain kinds of systems which clearly helps build the case for HBase.

This and other factors eventually let us chose HBase counters over Cassandra for certain data processing on our ad serving system.

Useful links: http://research.yahoo.com/files/ycsb-v4.pdf

http://stackoverflow.com/questions/7237271/large-scale-data-processing-hbase-vs-cassandra

http://bigdatanoob.blogspot.in/2012/11/hbase-vs-cassandra.html

http://bigdatanerd.wordpress.com/2011/12/08/why-nosql-part-1-cap-theorem/

Author

Bala Kumar S

[email protected]

#hbase vs cassandra #range scans #partial scans #big data systems

We're Hiring for Several Key Positions in Engineering

Data Mining Scientist – SF Bay Area

We are looking for someone who wants to join a high energy team to use machine learning and statistical techniques to create state-of-the-art solutions using machine learning in advertising. We are using early techniques in machine learning and would like to improve it drastically to improve advertisers ROI at real-time. This means analyzing billions of records at real-time and working closely with members of the team to deliver agile solutions. If you love love to work with data, are deeply technical, highly innovative and entrepreneurial, long for the opportunity to be a part of a team that delivers a global impact, we want to talk to you! Responsibilities • Use statistical and machine learning techniques to create scalable solutions for advertising spend optimization and other core business problems. • Analyze and extract relevant information from large amounts of historical business data to help automate and optimize key processes • Design, development and evaluation of highly innovative models for predictive learning • Work closely with engineering teams to drive model implementations.

Basic Qualifications • An BS / MS in CS machine learning, Operational research or Statistics • 2+ years of hands-on experience in predictive modeling and analysis • Strong Problem solving ability • Good skills with Java or C++, Perl (or similar scripting language)

DevOps Engineer – Part time / Consulting, Chennai, India / SF Bay Area

We are looking for someone with deep Devops experience and someone to take our mission-critical infrastructure and built tools to completely automate it. Though our scale is small right now, we anticipate to be handling billions of events each month.

Responsibilities • Take personal responsibility for the availability and reliability of our service. • Save the company, money on infrastructure costs • Author and use tools that manage infrastructure. We are looking for someone to write clean, re-usable code. Elegant OO code that’s simple. This is not a sysadmin job. • Write maintainable code with extensive test coverage, working in a professional software engineering environment (with source control, dev/stage/prod release cycle, continuous deployment) • You will need to get the job done in Java preferably.

Qualifications / Requirements • A distributed systems foundation with 5 years development experience. • 3+ years of development experience, handled 100s of servers and automated every possible. • You’ve made a substantial contribution to a widely used open source project.

UI Architect / Lead Developer – Chennai, India

We are looking for a good UI architect / lead developer in the SF Bay Area. You are someone who is extremely passionate and deep understanding of User Interface Design and Development. If you love to work with UI, Visualization, highly innovative and long for the opportunity to be a part of a team that delivers a global impact, we want to talk to you! Given a UI design, you will be expected to create templates, standardize CSS and Javascripts and build repeatable UIs whose total work quantity reduces over time.

Responsibilities

• Design and build highly fluid user interfaces that visualize data • Design Mockups and HTML prototypes during the design process • Test UI using frameworks / tools • AB Test Screens through rollout

Basic Qualifications

• An BS / MS • 2+ years of hands-on UI development • Strong Problem solving ability

Skills

• Expert knowledge of Java (required) • Knowledge of Play (required) • Cassandra or MySQL(required) • Hbase • JQuery (required) • YUI (required) • UI Design Patterns (required) • Various UI Frameworks (required) • DOM / Javascript / CSS (required) • Usability Basics / User Experience (required) • Interest in Data Visualization

Reduce Data offers exciting and challenging careers. We are solving some difficult problems and are looking for

Please apply by sending in your resume to [email protected]

We've open sourced a simple module to read the Twitter data stream

At Reduce Data, we recently evaluated building social data streams into our product and it is a part of our road-map. One of the first things we wanted to connect to is the Twitter data stream.

Gnip is the third party data provider for twitter. You can connect to Gnip and read any tweets related to any subject / hastag or user. The Gnip documentation provides a wealth of information relating to the meta data to the twitter stream. Read about it here.

The Gnip data stream does not have limits unlike the traditional twitter stream access simply because it is a paid data stream.

We built a small hack around it including a UI to render the charts. We decided to open source it for companies who want to connect and consume the Twitter data stream. Please note that you need to talk to Gnip first to get access to the API

Access the repo at:

https://github.com/ReduceData-Opensource/Twitter-Gnip-Stream

.Screenshot:

Trending Blogs

Recently Viewed Blogs

Engineering Blog