Big Data Cookbook @bigdata-cookbook-blog - Tumblr Blog

Posts

An extensive tutorial and training deck on Apache Kafka 0.8

#kafka #big data #sreaming #queue #training #tutorial

An extensive tutorial and training deck on Apache Storm 0.9

#storm #realtime #bigdata #streaming

streamparse lets you run Python code against real-time streams of data. Integrates with Apache Storm.

code on GH: https://github.com/Parsely/streamparse

Presentation: http://pydata.org/sv2014/abstracts/#201

#storm #realtime #python #streaming #big data

Puppet module to deploy Storm 0.9+ clusters by miguno

You can use this Puppet module to deploy Storm to physical and virtual machines, for instance via your existing internal or cloud-based Puppet infrastructure and via a tool such as Vagrant for local and remote deployments.

#storm #puppet #realtime #bigdata

Awesome post on Error Handling in Storm Trident topologies by @svend_x4f

#storm #realtime #bigdata #errors #robust #streaming

CUBE and ROLLUP: Two Pig Functions That Every Data Scientist Should Know

This is a great writeup about Pig's new operators CUBE and ROLLUP by @joshualande.

#bigdata #hadoop #pig #cube #rollup

Bucketing, multiplexing and combining in Hadoop

Great two part blog post about Bucketing, multiplexing and combining data in Hadoop by Alex Holmes (@grep_alex). He goes into great detail of how to use MultipleOutputFormat and MultipleOutputs. This post has excellent code examples and diagrams. This was incredibly useful for a project I worked on this week.

Alex Holmes is the author of Hadoop in Practice.

Part 1

Part 2

--Jason

#hadoop #bigdata #MultipleOutputs #mapreduce #MultipleOutputFormat

#mapreduce #python #mrjob #hadoop #recommender #bigdata

Pig vs. MapReduce: When, Why, and How

Donald Miner (@donaldpminer), author of MapReduce Design Patterns giving a talk on Pig vs. MapReduce: When, Why, and How.

Video and Slides

--Jason

#hadoop #pig #mapreduce

Monoidify! Monoids as a Design Principle for Efﬁcient MapReduce Algorithms by Jimmy Lin

It is well known that since the sort/shuffle stage in MapReduce is costly, local aggregation is one important principle to designing efficient algorithms. This short paper represents an attempt to more clearly articulate this design principle in terms of monoids, which generalizes the use of combiners and the in-mapper combining pattern.

#hadoop #mapreduce #bigdata #monoids

Scalable Real-time State Updates with Storm groupBy / persistentAggregate / IBackingMap

Blog post on Scalable Real-time State Updates with Storm groupBy / persistentAggregate / IBackingMap by @svend_x4f. Code on GH.

--Jason

#storm #realtime #analytics #cassandra

Twitter Storm: How to develop a pluggable scheduler

This is a great blog post on How to Develop a Pluggable Scheduler for Twitter Storm from one of the Storm committers, James Xu (@xumingming). And there is example code on GH.

Storm's scheduler allows you to control which servers a particular bolt or spout run on. Some examples from this post:

make sure a particular task run on a particular machine

Topology priority: when the resource is limited, always make sure TopologyA is scheduled first.

There are two CPU intensive bolts, we dont want them to be assigned to the same machine

--Jason

#storm #realtime #bigdata #scheduler

Agile Analytics Application by Russell Jurney

Russell Jurney, author of Agile Data Science, presenting "Agile Analytics Applications"

http://datasyndrome.com/post/73247620998/me-presenting-on-agile-data-science-to-the-sf-data

--Jason

#agile #analytics #hadoop #data science

A Practical Storm Trident Tutorial

A practical Storm Trident tutorial from Strataconf 2013.

Slides http://htmlpreview.github.io/?https://rawgithub.com/mischat/trident-tutorial/blob/master/slides/index.html

Code on GH https://github.com/eshioji/trident-tutorial

--Jason

#storm #trident #realtime #bigdata

#lucene #search #analytics #solr

Announcing storm-metrics-statsd: Reporting Storm Metrics to statsd

We want to announce the release of storm-metrics-statsd. storm-metrics-statsd is a module for Storm that enables metrics collection and reporting to statsd. It uses the storm-metrics framework. For more info on how to use the storm-metrics framework, check out the Storm Metrics Howto.

Building/Installation

Usage

This module can be used in two ways:

Configure it for each topology by calling Conf.registerMetricsConsumer() prior to launching the topology.

Deploy and configure system wide so usage of this is transparent across all topologies.

Configuring each topology separately

Add this as a dependency to your pom.xml

Configure the StatsdMetricConsumer when building your topology. The example below is based on the storm-starter ExclamationTopology.

System Wide Deployment

System wide deployment requires three steps:

1. Add this section to your $STORM_HOME/conf/storm.yaml.

2. Install the storm-metrics-statsd and java-statsd-client JARs into $STORM_HOME/lib/ ON EACH STORM NODE.

3. Restart storm and you will likely need to restart any topologies running prior to changing your $STORM_HOME/conf/storm.yaml.

Notes

You can override the topology name used when reporting to statsd by calling:

This will be useful if you use versioned topology names (.e.g. appending a timestamp or a version string), but only care to track them as one in statsd.

License

storm-metrics-statsd is licensed under the Apache 2.0 license.

--@jason_trost

#storm #metrics #realtime #bigdata #statsd

Storm Metrics How-to

If you have been following Storm's updates over the past year, you may have noticed the metrics framework feature, added in version 0.9.0 ([New Storm metrics system PR](https://github.com/nathanmarz/storm/issues/305)). This provides nicer primitives built into Storm for collecting application specific metrics and reporting those metrics to external systems in a manageable and scalable way. This blog post is a brief howto on using this system since the only examples of this system I've seen used are in the core storm code.

Concepts

Storm's metrics framework mainly consists of two API additions: 1) Metrics, 2) Metrics Consumers.

Metric

An object initialized in a Storm bolt or spout (or Trident Function) that is used for instrumenting the respective bolt/spout for metric collection. This object must also be registered with Storm using the TopologyContext.registerMetric(...) function. Metric's must implement backtype.storm.metric.api.IMetric.

Several useful IMetric implementations exist. (Excerpt from the Storm Metrics wiki page with some extra notes added).

AssignableMetric -- set the metric to the explicit value you supply. Useful if it's an external value or in the case that you are already calculating the summary statistic yourself. Note: Useful for statsd Gauges.

CombinedMetric -- generic interface for metrics that can be updated associatively.

CountMetric -- a running total of the supplied values. Call incr() to increment by one, incrBy(n) to add/subtract the given number. Note: Useful for statsd counters.

MultiCountMetric -- a hashmap of count metrics. Note: Useful for many Counters where you may not know the name of the metric a priori or where creating many Counter's manually is burdensome.

MeanReducer -- an implementation of ReducedMetric that tracks a running average of values given to its reduce() method. (It accepts Double, Integer or Long values, and maintains the internal average as a Double.) Despite his reputation, the MeanReducer is actually a pretty nice guy in person.

Metrics Consumer

A object meant to process/report/log/etc output from Metric objects (represented as DataPoint objects) for all the various places these Metric objects were registered, also providing useful metadata about where the metric was collected such as worker host, worker port, componentID (bolt/spout name), taskID, timestamp, and updateInterval (all represented as TaskInfo objects). MetricConsumers are registered in the storm topology configuration (using backtype.storm.Config.registerMetricsConsumer(...)) or in Storm's system config (Under the config name topology.metrics.consumer.register). Metrics Consumers must implement backtype.storm.metric.api.IMetricsConsumer.

Example Usage

To demonstrate how to use the new metrics framework, I will walk through some changes I made to the ExclamationTopology included in storm-starter. These changes will allow us to collect some metrics including:

A simple count of how many times the execute() method was called per time period (5 sec in this example).

A count of how many times an individual word was encountered per time period (1 minute in this example).

The mean length of all words encountered per time period (1 minute in this example).

Adding Metrics to the ExclamationBolt

Add three new member variables to ExclamationBolt. Notice there are all decalred as transient. This is needed because none of these Metrics are Serializable and all non transient variables in Storm bolts and spouts must be Serializable.

Initialize and register these Metrics in the Bolt's prepare method. Metrics can only be registered in the prepare method of bolts or the open method of spouts. Otherwise an exception is thrown. The registerMetric takes three arguments: 1) metric name, 2) metric object, and 3) time bucket size in seconds. The "time bucket size in seconds" controls how often the metrics are sent to the Metrics Consumer.

Actually increment/update the metrics in the bolt's execute method. In this example we are just:

incrementing a counter every time we handle a word.

incrementing a counter for each specific word encountered.

updating the mean length of word we encountered.

Collecting/Reporting Metrics

Lastly, we need to enable a Metric Consumer in order to collect and process these metrics. The Metric Consumer is meant to be the interface between the Storm metrics framework and some external system (such as Statsd, Riemann, etc). In this example, we are just going to log the metrics using Storm's builtin LoggingMetricsConsumer. This is accomplished by registering the Metrics Consumer when defining the Storm topology. In this example, we are registering the metrics consumer with a parallelism hint of 2.

Here is the line we need to add when defining the topology.

Here is the full code for defining the toplogy:

After running this topology, you should see log entries in $STORM_HOME/logs/metrics.log that look like this.

You should also see the LoggingMetricsConsumer show up as a Bolt in the Storm web UI, like this (After clicking the "Show System Stats" button at the bottom of the page):

Summary

We instrumented the ExclamationBolt to collect some simple metrics. We accomplished this by initializing and registering the metrics in the Bolt's prepare method and then by incrementing/updating the metrics in the bolt's execute method.

We had the metrics framework simply log all the metrics that were gathered using the builtin LoggingMetricsConsumer.

The full code is here as well as posted below. A diff between the original ExclamationTopology and mine is here.

In a future post I hope to present a Statsd Metrics Consumer that I am working on to allow for easy collection of metrics in statsd and then visualization in graphite, like this.

--@jason_trost

#storm #bigdata #metrics #realtime #streaming

Trending Blogs

Recently Viewed Blogs

Big Data Cookbook