Hello World in SPARK
When it comes to Distributed Computing, Word Count can be considered as the "Hello World" !!
And what better ways to start if there is a REPL (read-eval-print loop, CLI).
So build SPARK from source and start the spark-shell.cmd (in Windows). If you have not yet built it, here is a guide about how to do it in Windows.
Here is our first Woed Count in SPARK. I am assuming, you know scala.
When you open the REPL, Spark context is available there as sc
scala> val file = sc.textFile("C:\somefile.txt")
This will create the Text File RDD from the local file. You can also create the RDD from HDFS or other Hadoop-supported filesystem, or HTTP, HTTPS, FTP hdfs://, s3://, kfs://,file://, etc URI
scala> val words = file.flatMap(_.split(" "))
This is going to flatten the lines and split it into List of words.
scala> words.count()
This will count the number of words.
scala> words.distinct().count()
Count of unique words.
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
map will convert the list of words to (word, 1) sequence. Now the shuffle stage transformation reduceByKey will reduce it to a dataset of (word, total count of this word) form. All Transformation are lazy operation. So we need to perform an Action to execute and return the output to the driver program.
scala> wordCounts.saveAsTextFile("sparkHelloWord")
We want to modify this further and want to sort it starting from maximum times a word appears to the least (decreasing order). Unfortunately, we do not have sortByValue, but we have sortByKey. So we have to reverse the order of key and value and then sortByKey and the reverse it again.
scala> val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _).map{case(x,y) => (y,x)}.sortByKey(false).map{case(i,j) => (j, i)}
The following Action will print the top 5 words in the console
scala> wordCounts.take(5)
In our tutorial above we are doing multiple operations on words and it make sense that we cache it in memory after we compute words, as,
scala> words.cache()












