SPARK streaming and other real time stream processing framework
Streams are everywhere; twitter streams, tcp streams, clickstreams, log streams, event streams. Processing and analyzing them in real time is no less than a daunting task when you have to process hundreds of Megabytes of information per second. There is no point in detecting a potential buyer after the user leaves your eCommerce site, or, detecting fraud after the burglar has absconded. Stream processing is useful in ILP (information leak prevention), SPAM detection, traffic estimation etc.
There are many Real time streaming computation framework available today, notably, Storm from Twitter (BackType), S4 from Yahoo, HStreaming, flume .. To get some idea about their differences (not all though), you can refer to http://www.quora.com/What-would-you-choose-between-Flume-Yahoo-S4-and-Backtype-Twitter-Storm-and-why
In none of the above you can combine real time computation with Batch job. Nature of streaming system is event driven and it is different from the APIs of batch system. Well, you can combine both in SPARK streaming. It provides one API for entire data analysis.
You can also easily combine streaming data with historical data, e.g., join a stream of events against historical data to make a decision. This is achieved through various stateful "window" operation in SPARK streaming. Count frequency of words received in last minute is as simple as,
ones = words.map(w => (w, 1))
freqs = ones.reduceByKey(_ + _)
freqs_60s = freqs.window(Seconds(60), Second(1))
In existing systems, either you need more hardware to achieve fault tolerance or recovery time from fault/stragglers is higher in case of Up-stream backup (Storm, S4 employ up-stream backup). SPARK streaming provides automatic (self healing) recovery from fault/stragglers faster than the rest. This is achieved through parallel recovery across nodes.
Checkpoint state datasets periodically
If a node fails/straggles, build its data in parallel on other nodes using dependency graph
So it saves cost, provides better manageability, performance and consistency across system. It's ability to recover faster from fault in a self healing mode is extremely crucial.
Quick view of SPARK streaming...
The key idea behind SPARK streaming is to treat streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state. We store these results in resilient distributed datasets (RDDs), an efficient storage abstraction that avoids replication by using lineage for fault recovery.
A discretized stream or D-Stream groups together a series of RDDs and lets the user manipulate them to through various operators. D-Streams provide both stateless operators, such as map, which act independently on each time interval, and stateful operators, such as aggregation over a sliding window, which operate on multiple intervals and may produce intermediate RDDs as state.
This is not yet released officially, but available through github https://github.com/mesos/spark/tree/streaming ) 0.7 version containing the alpha spark streaming is slated to be released soon
Maximum nodes it has been tried out as found through some docs is around 200. Need to see how it scales with a bigger cluster.
Ideally sliding windows can be kept at 100ms duration, but practically this need to be evaluated and it appears that keeping it to 2-5sec makes more practical sense. Again this need to be validated. Latency of the overall computation depend on the length of the sliding window.
At this point, we need to find out who are the major players that are using SPARK streaming. Conviva could be one. who else ?
Refer to https://github.com/mesos/spark/blob/streaming/docs/streaming-programming-guide.md for documentation.
NB: Diagrams are copied from Spark presentations.