Arindam's Tech Blog @arindampaul - Tumblr Blog

Juggling with Data

Created an IPython Notebook to capture the essence of INTEROPERABILITY with different types of workloads in Big Data using map-reduce, SQL, Dataframe and visualization (matplotlib) - all. under. one. umbrella.

The link of the notebook is provided below. You can download and play with it. Please let me know if any clarification is needed.

In my next Notebook, I’ll cover streaming and machine learning in the SPARK world.

Please let me know if you want to cover any specific aspect in the notebook.

http://nbviewer.ipython.org/github/parindam/GIDS/blob/master/wordcount-n-more.ipynb

Thanks and Hope you find it useful !!

Arindam.

#hadoop sql dataframe apache spark map-reduce visualization

God particle of Big Data universe discovered

Great article http://www.ctocio.com/big-data-2/15730.html by Jean-Paul Smets and my main takeaways:

1. Completely agree on "before going into expensive investments: first purchase a small GNU/Linux server with at least 32 GB memory, a large SSD disk (ex. 1 TB) and study Scikit-Learn". 32GB laptop can many a times reduce a seemingly out-of-core problem into a single machine problem and that's a key for an enthusiast data scientist. For "smaller level" scaling strategies, here is a cute nice article, http://scikit-learn.org/dev/modules/scaling_strategies.html

2. did not know that "Scikit-Learn is by the way the toolkit used by many engineers at Google to prototype solutions for their “Big Data” problems.". Google is always the trendsetter in technology and If it is true, it's a great thing for scikit, or, rather, for pythonification of scientific computing ? Well, I am sure many Kagglers will agree.

3. Dont know if it is my ignorance, but, it's been a Jaw-dropping moment to learn, that "The Large Hadron Collider (LHC) of CERN has been designed to process 1 petabyte of data per second." and this is equivalent to processing 3.3 years of HD video per second. Almost towards having an infrastructure to manage the surveillance of the entire world !!!!

#machine learning #big data #python #scikit-learn

How to decide whether to apply Supervised Learning or Anomaly Detection

Let’s quickly find out when to use Gaussian Distribution and when to use Supervised Learning –

· When there are very small number of outliers, or, positive examples and a large number of negative examples, this makes it a skewed data-set. In such scenario, it’s ideal to use Gaussian Distribution. When we have large number of both positive and negative examples, we can use Supervised Learning.

· When we have many different “types” of anomalies and it’s hard for any algorithm to learn from positive examples about what the future anomalies will look like, it’s better to use Anomaly Detection. Future anomalies may look nothing like any of the anomalous examples seen before. On the contrary, when we have enough positive examples and future positive examples are likely to be similar to the ones in the training set, use Logistic Regression.

· Fraud Detection, Error in manufacturing, Monitoring machine in Datacenter – are examples of Anomaly Detection

· Email Spam, Weather prediction, cancer classification – are examples of supervised learning.

· Remember, if there are too many fraudulent behavior, Fraud detection might as well move towards Supervised Learning.

Let’s say you have identified to apply Anomaly Detection. How do you decide whether to use normal Gaussian Distribution or Multivariate Gaussian Distribution ?

· If you want to manually create features to capture anomalies, use Gaussian distribution. Use Multivariate Gaussian distribution if you want to automatically capture correlation between features.

o For example, for monitoring machines in data center, let’s say you have x1=CPU load and x2=network traffic.

o Come up with extra features, like, x3=CPU Load/Network Traffic=x1/x2 and x4=x1^2/x2

· Gaussian distribution is computationally cheap, whereas Multivariate Gaussian distribution is computationally more expensive, because it has to compute the inverse of a matrix.

· Gaussian distribution works well even if training set size is less than number of features. Whereas, for Multivariate Gaussian distribution number of training set must be approx. 10 times more than the number of features.

· In case of Multivariate Gaussian distribution, make sure that there are no redundant features as well. A matrix becomes non-invertible (or, Singular) if it has redundant features.

#Multivariate Gaussian Distribution #Anomaly Detection #Gaussian Distribution #Supervised Learning vs Anomaly Detection

Schooling, Machine Learning and Bias and Variance

Remember our Good old School days !! What if we send Machines to school ? Will it make them better ? Interesting ? read on … It's just an analogy...

Remember how we studied the whole year... to learn many subjects... class tests and then finally year-end-exam. In the test, the less error we made, more marks we got and more percentage thereof. Whole point was to prove that we learnt properly in the whole of the year and we did that by answering the questions appropriately. The job of the teacher was to evaluate the answers and provide some amount of corrective measures so that the student can minimize error in the final exam.. Those corrective measures were mainly applied during the mid-term test.

Machine Learning is also quite similar. Machine learns from the huge dataset (called training set) and forms a hypothesis (denoted as h(theta)). It then does various model selection as a corrective measure through Cross-Validation dataset (This is much like our mid-term exam) and then test on the Test Set (X(test)), more like our final exam and we evaluate this to be a good hypothesis based on the lesser errors it make.

Rule of thumb for splitting the dataset is to first randomly shuffle your dataset and then take

60% for Training Set - used for learning.

20% for Cross Validation - Used for model selection, like, degree of polynomial of features, regularization parameter (have not discussed this in detail; this would be a separate discussion).

20% for Test Set - Final Exam.

Now, forming a hypothesis boils down to identifying the parameter vector, theta, such that difference between predicted behavior and actual behavior is minimized. A good way to identify if your hypothesis is learning is by plotting the cost function J(theta)/error vs the training set size. If you start with few training data and then move on to large training set, error will increase as you move from low to high data set. Now, do the same thing with your Cross Validation set (cv)... With CV, your error would decrease over more data.

It may so happen that with more data-set,

(Case 1) The error difference between training data and cross validation data is reduced.

(Case 2) There is significant gap in the error difference between training data and cross validation data.

Case 1 signifies that our hypothesis is under-fitting the data and this is known as high Bias. This means that getting more data is NOT going to help improve the hypothesis. Instead,

Get additional features, or more polynomial features. Or/and

Decrease regularization parameter (lambda)

Case 2 signifies that our hypothesis is over-fitting the data and this is known as high Variance. This means that getting more data is likely to help improve the hypothesis. So to fix high variance

Get more training examples. Or/and

Try smaller set of features, or reduce some features. Or/and

Increase regularization parameter (lambda)

Using Principal Component Analysis, PCA to reduce the dimensionality of the given data and use it to improve variance is an extremely BAD IDEA. This is a very common misuse of PCA. Increase regularization parameter instead.

Recommended approach in Machine Learning:

Start with simple, quick and dirty algorithm. Do not spend too much time on algorithm.

Plot learning curves and see if more data, more features etc are likely to help.

Manually examine the examples with cross validation set about where the algorithm made error.

#Machine Learning #Bias #Variance #Regularization parameter #overfitting #underfitting #lambda #pca

Even with 100% precision, many diagnosis fail to predict diseases correctly, Why ?

Before I explain, there is a quiz for you.

A car travels at 60kmph for 10km and returns back at 40kmph (same distance). What is the average speed for the entire travel?

Answer is

I am sure you can find out how.

Anyway, the formula is

2 * v1 * v2 / (v1 + v2). Let us call this F1 Score.

Now let us consider the following table.

Please note that when v1 and v2 are same (case 2), F1 Score is also same and when one is much lower (case 3), the F1 Score is penalized heavily.

This is a background that will help understand why precision with 100% fails to diagnose properly.

Have a look at the Actual diagnosis vs Predicted diagnosis (kind of a truth table).

Precision (P) is defined as,

True positive / Total number of predicted positive

= True positive / (True positive + False positive)

So P==1 or 100% when there are no False Positive.

Recall (R) is defined as,

True positive / Total number of Actual TRUE

= True positive / (True positive + False negative)

So R==1 or 100% when there are no False Negative.

A better diagnostic system, will always have a higher Precision, as well as higher Recall.

Often, there is a trade off between Precision and Recall and to achieve higher precision, they penalize Recall.

So if precision is 100%, (that is 1) and Recall is 10% (0.1), F1 Score (2*P*R/(P+R)) is 0.18 (look at case 3 in above table). So even with 100% precision the diagnostic system is only 18% accurate.

It turns out to be a good idea to get diagnosis done by multiple system to come to a conclusion, at least for rare diseases. I know people who were not diagnosed to be a cancer patient (false negative), but at a much later stage, it was diagnosed to be a cancer.

Precision and Recall are used in Machine Learning as evaluation metric for skewed classes in Logistic Regression (Sigmoid function) and in Anomaly Detection.

Understanding Locality Sensitive hashing (LSH)

Once I played a roulette kind of a game at a casino in Atlantic City. I remember betting on a particular number and kept on losing. It was probably the 10th game and the last one for that night. I already lost 9 games prior to that and I kept on betting for the same number. throughout the 10 games, I consistently bet for the same number. So on my last game I won. i got back all the money I lost. For a moment you might think i was romanticizing with the number.. I was actually amplifying the probability for the number to be a winner. Well, if you are prepared to lose, you can use my trick. I cautioned you, don't blame me later :-)

If you have understood how i won, you probably understood what Locality Sensitive Hashing is.

Now consider another roulette machine. So there are 2 roulette machines now and let's imagine that both the machines are very similar. it means that, for same initial conditions of the roulette machines, if you apply similar forces, it stops at almost closeby numbers. It means that, for force F1, if roulette-1 stops at 7, roulette-2 will also stop at nearby numbers, may be at 6 or 7 or 8 (with a high probability at 7). For different forces F2, F3 etc both the machines will stop at some other closeby numbers. So we can conclude that both the roulette machines are similar. If the machines are not similar, they'll stop at some different numbers for same initial condition and same force.

(From Rajaraman Ullman's book the following is taken)

One general approach to LSH is to “hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are. We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair. We shall normally assume that two vectors hash to the same bucket if and only if they are identical.

-------------------------------------------------------------------------------------------

Mathematics behind LSH is very well explained by Prof Gautam Shroff at coursera. The following 2 slides are taken from his lectures.

-------------------------------------------------------------------------------------------

While implementing symbol table the emphasis is on avoiding collision, so that we can get an uniform hashing for best performance (for search, insert and delete). However, in LSH, you do the reverse, that is, use collision to move similar items to the same bucket.

Here is a slide explaining concept of "Bins and Balls" which gives some idea about how multiple hashing might move things to the same bucket.

Hello World in SPARK

When it comes to Distributed Computing, Word Count can be considered as the "Hello World" !!

And what better ways to start if there is a REPL (read-eval-print loop, CLI).

So build SPARK from source and start the spark-shell.cmd (in Windows). If you have not yet built it, here is a guide about how to do it in Windows.

Here is our first Woed Count in SPARK. I am assuming, you know scala.

When you open the REPL, Spark context is available there as sc

scala> val file = sc.textFile("C:\somefile.txt")

This will create the Text File RDD from the local file. You can also create the RDD from HDFS or other Hadoop-supported filesystem, or HTTP, HTTPS, FTP hdfs://, s3://, kfs://,file://, etc URI

scala> val words = file.flatMap(_.split(" "))

This is going to flatten the lines and split it into List of words.

scala> words.count()

This will count the number of words.

scala> words.distinct().count()

Count of unique words.

scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)

map will convert the list of words to (word, 1) sequence. Now the shuffle stage transformation reduceByKey will reduce it to a dataset of (word, total count of this word) form. All Transformation are lazy operation. So we need to perform an Action to execute and return the output to the driver program.

scala> wordCounts.saveAsTextFile("sparkHelloWord")

We want to modify this further and want to sort it starting from maximum times a word appears to the least (decreasing order). Unfortunately, we do not have sortByValue, but we have sortByKey. So we have to reverse the order of key and value and then sortByKey and the reverse it again.

scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)}

The following Action will print the top 5 words in the console

scala> wordCounts.take(5)

In our tutorial above we are doing multiple operations on words and it make sense that we cache it in memory after we compute words, as,

scala> words.cache()

#Hello World in SPARK #Word Count in SPARK #Top K words #sortByKey

SPARK streaming and other real time stream processing framework

Streams are everywhere; twitter streams, tcp streams, clickstreams, log streams, event streams. Processing and analyzing them in real time is no less than a daunting task when you have to process hundreds of Megabytes of information per second. There is no point in detecting a potential buyer after the user leaves your eCommerce site, or, detecting fraud after the burglar has absconded. Stream processing is useful in ILP (information leak prevention), SPAM detection, traffic estimation etc.

There are many Real time streaming computation framework available today, notably, Storm from Twitter (BackType), S4 from Yahoo, HStreaming, flume .. To get some idea about their differences (not all though), you can refer to http://www.quora.com/What-would-you-choose-between-Flume-Yahoo-S4-and-Backtype-Twitter-Storm-and-why

In none of the above you can combine real time computation with Batch job. Nature of streaming system is event driven and it is different from the APIs of batch system. Well, you can combine both in SPARK streaming. It provides one API for entire data analysis.

You can also easily combine streaming data with historical data, e.g., join a stream of events against historical data to make a decision. This is achieved through various stateful "window" operation in SPARK streaming. Count frequency of words received in last minute is as simple as,

ones = words.map(w => (w, 1))

freqs = ones.reduceByKey(_ + _)

freqs_60s = freqs.window(Seconds(60), Second(1))

.reduceByKey(_ + _)

In existing systems, either you need more hardware to achieve fault tolerance or recovery time from fault/stragglers is higher in case of Up-stream backup (Storm, S4 employ up-stream backup). SPARK streaming provides automatic (self healing) recovery from fault/stragglers faster than the rest. This is achieved through parallel recovery across nodes.

Checkpoint state datasets periodically

If a node fails/straggles, build its data in parallel on other nodes using dependency graph

So it saves cost, provides better manageability, performance and consistency across system. It's ability to recover faster from fault in a self healing mode is extremely crucial.

Quick view of SPARK streaming...

The key idea behind SPARK streaming is to treat streaming computations as a series of deterministic batch computations on small time intervals. The input data received during each interval is stored reliably across the cluster to form an input dataset for that interval. Once the time interval completes, this dataset is processed via deterministic parallel operations, such as map, reduce and groupBy, to produce new datasets representing program outputs or intermediate state. We store these results in resilient distributed datasets (RDDs), an efficient storage abstraction that avoids replication by using lineage for fault recovery.

A discretized stream or D-Stream groups together a series of RDDs and lets the user manipulate them to through various operators. D-Streams provide both stateless operators, such as map, which act independently on each time interval, and stateful operators, such as aggregation over a sliding window, which operate on multiple intervals and may produce intermediate RDDs as state.

D-Streams

Points to be noted,

This is not yet released officially, but available through github https://github.com/mesos/spark/tree/streaming ) 0.7 version containing the alpha spark streaming is slated to be released soon

Maximum nodes it has been tried out as found through some docs is around 200. Need to see how it scales with a bigger cluster.

Ideally sliding windows can be kept at 100ms duration, but practically this need to be evaluated and it appears that keeping it to 2-5sec makes more practical sense. Again this need to be validated. Latency of the overall computation depend on the length of the sliding window.

At this point, we need to find out who are the major players that are using SPARK streaming. Conviva could be one. who else ?

Refer to https://github.com/mesos/spark/blob/streaming/docs/streaming-programming-guide.md for documentation.

NB: Diagrams are copied from Spark presentations.

#stream processing #dstreams #spark streaming #big data streaming #real time streaming #storm #flume #s4

Many Functional programming give you parallelism for FREE, How ?

Let’s try and understand why we should care more about this in the first place. This requires a little bit of historical explanation.

We all know Moore’s Law. It says the number of transistors on integrated circuits doubles approximately every two years. The prediction has been working pretty well since late ‘70. Couple of things started changing from the middle of last decade (around 2005).

Moore’s law started failing, since hardware makers found a limitation in hardware capability for increasing the clock speed, and,

Secondly there has been an exponential growth of data from around that time.

First part will help us understand our current topic of discussion and the second point will lead to another interesting discussion, about the rising trajectory of NoSQL landscape and what was the problem with traditional RDBMS ? Will discuss the second part later.

…Unable to increase the clock speed, companies like Intel started adding multi core processors in the same machine. And we got Dual Core, Quad Code … machines. Now the current existing languages, like C, C++, java are designed to use threads to handle concurrency. Now parallelism is different from concurrency. Simply put, concurrency is how we handle multiple request-response and Parallelism is sharing a large CPU intensive work with multiple processors. That’s a different problem, that, even with the threaded model, it’s difficult to write thread safe code that works over time. Livelocks, deadlocks become part of daily affair in maintaining a large application written with threaded code. This is one of the reasons I like Node.js so much; Concurrency is handled by event loop and you don’t have to worry about Locking and synchronization.

Mutability becomes nightmare when you have to share your mutable code. So,if something do not change and you share it, you do not have to protect it, which means you don’t have to worry about safety, synchronization if you share immutable code. This is one of the great aspects of functional programming. Immutability. This is what make your code run in multiple processors. It’s not free, but it’s trivial to make your code run on multiple core. Generally it’s achieved with immutable collection of Objects in Scala.

Look at this code in Scala

val list = (1 to 100000).toList

list.map(_ + 42)

To make the operation run in parallel, one must simply invoke the par method on the sequential collection, list. awesome !!

list.par.map(_ + 42)

Another important aspect of FP is functions are first class; they are not second class citizens like in C++ or java. You can treat them as any another variable. Functions are pure, they exhibit idempotent behavior, side-effect free and functions are of higher-order. You can pass a function to a function and you can return a function. Closures are very much derived from this. You take an object and transform it to something else, you don’t change it. Monads !!

Scala harnesses all the power of functional programming and combines it with Object Oriented Programming. It’s a JVM language and fully interoperable with Java libraries.

SPARK is written with Scala and what scala does to your code in multi-core machine, SPARK does the same thing across machines in a cluster.Parallelism !!

#Why Functional Programming #why parallelism is important #scala

BDAS stack has been very well summarized in the above link. Data is as useful as the decision it enables.

All components of BDAS frameworks share two goals -- increased parallelism and low latency. However it missed to mention MLBase which attempts to greatly simplify Machine learning (ML) and statistical techniques that are key to transforming big data into actionable knowledge. This is still under development and is slated to be released soon.

#machine learning #MLBase #BIG DATA #low latency #summary of BDAS

References for better understanding of SPARK

Making Big Data Analytics Interactive and Real-Time from Seven Nguyen

Transforming Big Data with Spark and Shark

http://www.quora.com/Apache-Hadoop/How-does-Impala-compare-to-Shark

http://spark-project.org/docs/latest/ - Source for all references.

Advanced Spark features

Spark streaming (D-Streams)

https://c59951.ssl.cf2.rackcdn.com/hotcloud12/zaharia.mp4

MESOS

http://www.quora.com/How-does-YARN-compare-to-Mesos

There are not many resources/blogs available available, so if you find a good one, please let me know and I'll include it here

#Spark resources #references for BDAS #video links

How to build SPARK on Windows

It's great to play around with SPARK in local mode. Here is how you can build it for Windows.

get the latest source from http://spark-project.org/downloads/ .. I built Spark 0.6.2 on my Windows 7 laptop.

I had latest scala version 2.10.0 installed in my machine and Spark is built with Scala version 2.9.2. Ideally this should not create any problem. But it did not work for me. So I had to uninstall Scala and install Scala version 2.9.2 from scala-lang.org

While installing scala, remember to install it at some path that does not contain any space, eg, C:\Program Files\scala will NOT work as it contains a space. So install it at some other path like C:\software\scala

Set this PATH as windows env variable.

It also requires sbt (simple build tool) and it comes bundled with Spark Code. But when I tried to build, I got the following error,

###########

Error occurred during initialization of VM

Could not reserve enough space for object heap

Could not create the Java virtual machine.

###########

Instead of correcting the error, I downloaded sbt msi from sbt site

and installed it and set the Windows PATH to include c:\Program Files\sbt\bin

I ran "sbt package" from the spark-0.6.2 directory and the build compiled successfully.

Before building: if you are working with hadoop, please identify which version of hadoop you are working with and specify that in SparkBuild.scala file under the project folder.

--------------

// Hadoop version to build against. For example, "0.20.2", "0.20.205.0", or

// "1.0.3" for Apache releases, or "0.20.2-cdh3u5" for Cloudera Hadoop.

val HADOOP_VERSION = "0.20.205.0"

val HADOOP_MAJOR_VERSION = "1"

// For Hadoop 2 versions such as "2.0.0-mr1-cdh4.1.1", set the HADOOP_MAJOR_VERSION to "2"

//val HADOOP_VERSION = "2.0.0-mr1-cdh4.1.1"

//val HADOOP_MAJOR_VERSION = "2"

---------------------------------

After the build is successful, In the conf directory, create spark-env.cmd and set the following environment variable

-------------------

set SCALA_HOME=<SCALA PATH> (Example: C:\software\scala\scala\bin)

set SPARK_CLASSPATH=C:\......\SPARK\source\spark-0.6.2-sources\spark-0.6.2\core\target\scala-2.9.2\spark-core_2.9.2-0.6.2.jar;C:\......\scala\scala\lib\scala-library.jar;C:\......\scala\scala\lib\scala-compiler.jar;

--------------------

Now you should be able to run the REPL, using the spark-shell.cmd

If you want to run the master for the standalone mode, you can do so by running run spark.deploy.master.Master

By default you can access the web UI for the master at port 8080. You can refer to the guide for more information.

#BDAS #SPARK #AMPLAB #FAST Processing of Big Data #REPL in SPARK #SCALA REPL in SPARK

Composition is the only way to handle complexity in software.

Brian Beckman

#monads #functional programming #closure #manoids #lambda function

Why do we need another Big data processing engine, like SPARK ?

Current ubiquitous standard for storing and processing very large data is Hadoop. It's an open source Apache project with storage provided by HDFS (Hadoop Distributed File System) and processing by Map-Reduce computing paradigm. Fault tolerant capability in Hadoop is achieved by replication. So in case of eventual failure of any node in the cluster, the system still functions without losing any data.

Computation of the large data across cluster of commodity machines, essentially involves a good amount of network I/O and disk I/O for each of the Map and Reduce stages and 90% of the time is being spent on I/O, rather than actual computation, thereby making it a high latency system. Hadoop is a batch processing system of very large data, often of the order of several terabytes to petabytes of data. Number of hard disks required to store 40PB data, if stacked one after another, the height will be double the height of empire state building !!

Although map-reduce is a great computing paradigm for distributed programming, it's not easy to write program in map-reduce. So some higher level abstraction was required and that gave birth to Hive, which is a declarative language and Pig (Pig latin) which is a scripting framework, both work on top of map-reduce. Any job written on Hive or Pig essentially gets converted to a map-reduce job. That does not solve high latency problem though. Imagine cases where you need to answer within a bounded period in time, otherwise the purpose of analysis is lost, such as fraud detection, spam analysis, face detection etc. Many algorithms to do this kind of work or machine learning or computing page rank etc are inherently multipass algorithm or iterative in nature. This is again another difficulty with hadoop to perform iterative operations.

So we needed a low latency distributed system where iterative algorithms can be run with ease. Welcome to the UC Berkeley research project called SPARK that essentially solves these problems. SPARK is part of Berkeley Distributed Analytics Stack (BDAS) developed at AmpLab. Other parts of the stack are Mesos and Shark (Others are BlinkDB and MLbase - not yet released). Mesos is a cluster Manager and Shark is Hive on Spark.

SPARK can run in three modes, local, standalone and with Mesos. It can also run with YARN.

SPARK is built with Scala and the code size is 22KLOC which is one tenth of Hadoop.

SPARK has been built by keeping generality in mind. It supports diverse workload; Not just batch jobs that are run with map-reduce, but also iterative algorithm ilke machine learning, graph algorithms -- from jobs that run in sub sec to hours. Combine wide array of operators and group them together and everything in a fault tolerant way. It's a self heal kind of a system, so the user need not worry about fault tolerance.

Some of the beautiful things in SPARK, are,

Speed. Shared in-memory immutable dataset, greatly reduces network and disk I/O.

Consistency is free, because of using immutable dataset.

local mode: where things can be tried out in a single box (even a Windows machine). Once you are comfortable, you can try this in a cluster setup. This speeds up the learning curve.

REPL: Ability to try things from command line in a interactive way. This is a great way to start without writing a single line of code.

Primary language is Scala, although it supports Java and python. Scala is a great language that combines functional programming and OOP. It's a JVM language and can utilize existing Java libraries.

Fault tolerant without replication (through RDDs) and fast recovery time. Self healing - works under the hood in case of a failure.

Interoperability: It's ability to talk to HDFS, S3, EC2, MPI, even local filesystem.

Not only map-reduce. Spark's programming model includes mutable accumulators and broadcast variables and immutable RDDs, along with 2 types of operations, lazy Transformations and Actions. Transformations, which create a new dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset.

Persisting or caching a dataset in memory across operations. These significantly increases the speed of subsequent operations on the dataset.

Easy deployability in the Cloud like amazon web services.

This kind of projects are a great power to the masses. Storing, processing, analyzing - entire Big Data stack is available for free as open source projects.

Will discuss things in more detail in my next writeup. Will talk about why scala is great. Will also discuss about SPARK streaming that is slated to be released as alpha in the upcoming 0.7 version.

#BDAS #SPARK #BIG DATA #HADOOP #MESOS #RDD

Trending Blogs

Recently Viewed Blogs

Arindam's Tech Blog