deliciousdata. @blairhudson - Tumblr Blog

Getting started with Spark, using Docker and pySpark (Part 1)

You've probably been hearing a lot about Hadoop and the in-memory computing framework Spark. Getting started can be hard, but with a little help from Docker, your first foray into the wild new world of big data analytics should be just that little bit easier.

Getting Started

Before we begin, a few notes:

I'm running Docker directly on Ubuntu 15.04 but you can pretty well install Docker on anything these days. The instructions and commands listed herein should work identically no matter how you are running Docker.

The Docker image we'll be using will take some time to download, so you might want to go grab a coffee (or read ahead) while it's pulling down from the hub.

You'll often see Spark run with Scala. We'll be using Python to keep things nice and simple.

We'll be using smungee's pySpark image for our Spark instance. The image conveniently comes with numpy, scipy and scikit-learn pre-installed. Neat.

Step 1: Prepare Docker

Before we can run the image, we need to pull it down. Run the following command and find something else to do for a little while. (I, for example, chose to start writing this article.)

docker pull smungee/pyspark-docker:latest

If you're new to Docker, the pull command simply caches a copy of the specified image (smungee/pyspark-docker) from the Docker Hub.

About the Image

If you check out the Dockerfile, you'll notice that it is built on top of SequenceIQ's Spark image (sequenceiq/spark). SequenceIQ wasacquired by Hortonworks in April 2015, and together form one of the largest Hadoop supporters.

Step 2: Start the Docker container

Welcome back. Now we're ready to go. Run this command:

docker run -i -t -h sandbox --name pyspark-sandbox smungee/pyspark-docker:latest /etc/bootstrap.sh -bash

Helpful Tips

To exit bash and detach the container at any time, simply hold CTRL and press P then Q (then let go of CTRL).

If you want to reattach, run:

docker attach pyspark-sandbox

To stop the container once detached, run:

docker stop pyspark-sandbox

If you're having trouble stopping the container gracefully, try:

docker kill pyspark-sandbox

To restart the container once stopped, run:

docker restart pyspark-sandbox

And finally, to remove the container once stopped, run (this will not remove the cached image):

docker rm pyspark-sandbox

If you're new to Docker, here's breakdown of what's going on:

-i -t: in combination these allow us to connect to the Docker container

-h sandbox: sets the container computer's hostname

--name pyspark-sandbox: set's the container's name (to make it easier for us to reference)

smungee/pyspark-docker:latest: the image to run (we're smart and already cached a copy in Step 1!)

/etc/bootstrap.sh -bash: anything commands after the image name are passed through to the container, telling the container to execute the specified script (starting the required nodes and leaves us with a bash shell)

Step 3: Start pySpark inside Docker container

Now that we have a bash shell available from inside our container, we can get the pySpark engines rolling with the following command (make sure to run it inside the Docker container), giving us a fancy ASCII Spark logo and a useful Python shell:

/usr/local/spark/bin/pyspark

To exit the Python shell at any time, simply press CTRL + Z.

Step 4: Does it work?

Alright, from within Python, within pySpark, within our Docker container, within your Terminal prompt (potentially within a Virtual Machine?), we can now execute Python commands.

For those new to Docker, this means you can execute any Python commands, including the simple:

x = 1 print x

Which simply outputs '1'.

Smungee suggests the following simple Spark program to verify the installation:

data = [1, 2, 3, 4, 5] sc.parallelize(data).count()

Which should of course ultimately output '5'.

And to verify scikit-learn, try:

from sklearn import svm, datasets clf = svm.SVC(gamma=0.001, C=100.) digits = datasets.load_digits() clf.fit(digits.data[:-1], digits.target[:-1])

Where you can expect an output similar to:

SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)

Step 5: Taking it further

To see all of the functions available in pySpark, from within Python run:

help(pyspark)

(To exit help simply press Q.)

There are many further examples provided in the official Apache Spark git repo. Part 2 of this post will show you exactly how to get some of these examples running with Docker.

#Spark #Apache #Docker #Python #Hadoop #pySpark

Performing sentiment analysis with Twitter Streaming API, Python and Elasticsearch

Twitter users publish around 500 million tweets per day. The tweets, which could be about anything -- including a product, service, company or some other point of interest -- collectively become a massive and everflowing stream of semi-structured data.

This data, once captured and analysed, can provide an ongoing view of public sentiment towards a product, company or topic, measure reactions to a press release, track impacts to perception from a marketing campaign or create alerts for customer service teams to respond to feedback and complaints.

This recipe is going to explore using the Twitter Streaming API, Docker and Kitematic -- a fancy new Docker UI for OS X, Elasticsearch, Python and Kibana. The Python portions of this guide are based on a great post from Real Python published last November. This should take about an hour to cook, maybe less if you're more familiar with Docker. Let's get started...

Set up Elasticsearch with (or without) Kitematic

You may opt to set up your own Docker environment, covered previously, and skip to the next section. Otherwise, download and install Kitematic. Kitematic UI replacement for Boot2Docker, which allows OS X users to install and run Docker containers through a lightweight Linux virtual machine. Kitematic simplifies the installation process, configuration and management of containers during development.

Create an Elasticsearch container

In Kitematic, search for elasticsearch and create a container based on the official library Dockerfile. Alternatively for those not using Kitematic, run docker run -p 9200:9200 -p 9300:9300 elasticsearch. This will expose Elasticsearch on a port determined by Kitematic (e.g. http://192.168.99.100:49156), or on http://localhost:9200.

Obtain a Twitter API key and setup config file

You'll need a Twitter account to create an app on Twitter's developer portal. Once done, the 'Keys and Access Tokens' will provide various keys. Save these in a file called config.py in the following format. Important -- do not share these keys or store them in version control. In this file also include the Elasticsearch host and port, and the keyword you would like to capture. For this demo, I've captured tweets containing the keyword "Easter".

## Twitter API consumer_key = "your_consumer_key" consumer_secret = "your_consumer_secret_key" access_token = "your_access_token" access_token_secret = "your_access_token_secret" ## Elasticsearch elasticsearch_uri = "http://192.168.99.100:49156" ## Stream keyword = "easter"

Install Python package dependencies

OS X users will already have Python 2.7 installed by default. You'll need to install the following packages:

tweepy: to connect to the Twitter Streaming API

textblob: to analyse tweet sentiment

elasticsearch: to connect to our Elasticsearch instance

config: to load the config.py

datetime: to convert Twitter's bad dates into ISO 8601 format

To do this easily in OS X, open Terminal.app and run:

wget https://bootstrap.pypa.io/get-pip.py sudo python get-pip.py sudo pip install tweepy textblob elasticsearch config datetime

This will install PIP, a Python package manager, as well as the above packages.

Connect it all together

Save the following code into a file called sentiment.py in the same directory as config.py from above:

import json from tweepy.streaming import StreamListener from tweepy import OAuthHandler from tweepy import Stream from textblob import TextBlob from elasticsearch import Elasticsearch from datetime import datetime # import twitter keys and tokens from config import * # create instance of elasticsearch es = Elasticsearch([elasticsearch_uri]) class TweetStreamListener(StreamListener): # on success def on_data(self, data): try: # decode json dict_data = json.loads(data) # pass tweet into TextBlob tweet = TextBlob(dict_data["text"]) # determine if sentiment is positive, negative, or neutral if tweet.sentiment.polarity < 0: sentiment = "negative" elif tweet.sentiment.polarity == 0: sentiment = "neutral" else: sentiment = "positive" # fix the timestamp format # https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior timestamp = datetime.strptime(dict_data["created_at"].replace("+0000 ",""), "%a %b %d %H:%M:%S %Y").isoformat() # output sentiment print timestamp + ":" + sentiment # add text and sentiment info to elasticsearch es.index(index="sentiment-demo", doc_type="tweet", body={"source": dict_data["source"], "author": dict_data["user"]["screen_name"], "location": dict_data["user"]["location"], "followers": dict_data["user"]["followers_count"], "timestamp": timestamp, "message": dict_data["text"], "polarity": tweet.sentiment.polarity, "subjectivity": tweet.sentiment.subjectivity, "sentiment": sentiment}) except: # tweet skipped due to processing error print "processing exception" print data return True # on failure def on_error(self, status): print status if __name__ == '__main__': # create instance of the tweepy tweet stream listener listener = TweetStreamListener() # set twitter keys/tokens auth = OAuthHandler(consumer_key, consumer_secret) auth.set_access_token(access_token, access_token_secret) # create instance of the tweepy stream stream = Stream(auth, listener) # search twitter for the keyword stream.filter(track=[keyword])

Once saved, in Terminal.app run python sentiment.py. If the Elasticsearch container is running, config.py exists and details are accurate, the requisite Python packages are installed and an Internet connection is available, the command line should begin outputting a date and sentiment rating (positive, negative, neutral) for each tweet captured using your keyword as it is published. Leave this process running and the tweets will be captured indefinitely.

Visualise with Kibana

To descriptively show the sentiment analysis results, install Kibana. In Kitematic, search for 'Kibana4' and create a container using the Dockerfile maintained by user 'marcbachmann', or non-Kitematic users, run (replacing the Elasticsearch variable with your own address):

docker run -e ELASTICSEARCH=http://192.168.99.100:49156 -P marcbachmann/kibana4

If you are using Kitematic, once setup is completed, go to the Settings pane and add an Environment Variable called 'ELASTICSEARCH' and set it to your Elasticsearch URI (e.g. http://192.168.99.100:49156).

Go ahead and open your browser to Kibana's local address (either http://localhost:5601 or set by Kitematic).

Configure an index pattern using sentiment-demo and the field name timestamp. Now you can start to explore the data in the Discover tab. The following details are captured for each tweet (and can be modified by changing sentiment.py:

Author (Twitter account)

Timestamp of publishing

Tweet message

Count of followers at time of tweet

Tweet source (e.g. Twitter for iPhone)

Author location on profile at time of tweet

Message polarity and sentiment

Message subjectivity

To build a dashboard in Kibana, first create a visualisation on the Visualisation tab. Here we will create a breakdown of "Easter" sentiment over time:

Select 'Vertical bar chart'

For the y-axis metric, select 'Count' aggregation

For the x-axis buckets, select 'Date Histogram' aggregation on field 'timestamp' with interval 'Minute'

Add sub-aggregation, select 'Split Bars'

Sub-aggregation as 'Terms', on field 'sentiment'

In View Options, make sure Bar Mode is 'stacked'. You can also scale the y-axis.

Click 'Apply' and the save disk icon, saving as 'Sentiment over time'.

Now in the Dashboard tab click the plus button to add your visualisation and re-size to fit. Save the dashboard as 'Easter Sentiment'. You can then add other visuals and metrics to the dashboard. In the top right menu bar, set refresh interval to '5 seconds' and you can watch the sentinment dashboard update as new data streams in with a view similar to this:

In the time that I spent putting this tutorial together, over 70000 "Easter" tweets were captured, mainly postive and neutral, with a few negative tweets highlighting international conflicts, holiday illness and reluctance to return to work following the long weekend.

That's all for now. Stay tuned to see a tutorial covering creating a Dockerfile to encapsulate this demo, and doing further analysis of the tweets and sentiment measures with R.

#docker #twitter #streamingapi #sentiment #analysis #kitematic #elasticsearch #python #kibana

Bernard Marr's analogy likening Data Scienctists to pastry chefs fits in well with the deliciousdata theme. Marr raises six key points, which in summary are:

Follow a strategy/recipe for creating your analyses/pastries.

Use the highest quality data/ingredients.

Follow the strategy/recipe through.

Verify/taste-test the result.

Present the result beautifully, ready for action/eating.

Follow up and confirm the usefulness/deliciousness.

Sounds tasty!

#deliciousdata #datachef

Cooking with Docker: useful tips for beginners

Last week I wrote a quick introduction to getting started with Docker -- a modern replacement for traditional application virtualisation. I covered creating a Docker testing environment and using it to quickly start an instance of R's web application framework called Shiny.

I also attended the Sydney Docker March meetup at Atlassian, hearing about the experiences and challenges with using Docker in production.

Since then, I've compiled this list of the various helpful bits and pieces I've picked up from working with Docker:

Skip 'sudo' for commands

You may have noticed in my last post that the 'sudo' authorisation was prepended to each Docker command. To skip this, add your account to the 'docker' user group. Create the group if it does not exist, add your account (replace 'deliciousdata' with your own) and restart Docker:

sudo groupadd docker sudo gpasswd -a deliciousdata docker sudo service docker.io restart

You will need to log out for the changes to take effect.

See running containers

As you try out various different packages on the Docker registry, you'll be starting many containers. Track what's running easily with ps:

docker ps

The -a flag shows all containers, not just those that are running:

docker ps -a

Customise container names

You can reference your containers in Docker commands by specifying a custom name. This way, instead of refering to an ID such as '191baf3d0d19' or trying to remmeber an auto-generated name such as 'cranky_hawking' or 'trusting_brown') when starting, stopping or attaching a container, you can use 'shiny', for example.

docker run --name shiny rocker/shiny

Detach a container without it stopping

When attached to a container, you can easily detach it with CTRL-C, however this will also stop the container. Running the container with the -t flag will allow CTRL-C to detatch it without closing:

docker run --name shiny -t rocker/shiny

Run a container unattached

You can run a Docker container without attaching it's stdout using the -d flag:

docker run --name shiny -t -d rocker/shiny

Link two containers

To allow a Docker container to see another, the --link command can be used (this command runs a Shiny container linked to an existing Redis container -- Redis is a NoSQL key-value server. You can use Redis with Shiny through the rredis package):

docker run --name shiny --link redis:redis rocker/shiny

Map a container directory to the host filesystem

The -v command allows you to specify a directory in the container to map to the host filesystem as /host/os/path/:/path/in/container/. This was particularly useful for exposing the Shiny server directory so to modify the served application.

docker run --name shiny -p 3838:3838 -v /srv/shiny-server/:/srv/shiny-server/ rocker/shiny

That's all for now.

This week is Docker will be two years old, and Optiver is hosting a birthday party "open-source-a-thon" to help non-programmers, beginners and Docker-aficionados contribute to the project. Happy Birthday Docker!

#docker #dockerparty #tips

Deploying prototype analytics applications with Docker

I recently caught wind of Docker -- a modern virtualisation technology -- and after a little research and investigation, I've set out to discover what all the buzz is about. Docker, which describes itself as "a platform for developers and sysadmins to develop, ship and run applications" combines a container-based virtualisation engine and a community platform for application stack sharing.

One of the many promises of Docker is the lightweight nature of applications deployed with the engine. Compare to a more traditional hypervisor-based approach where one might spin up a virtual appliance for their database server, and a second for their webserver, Docker claims to remove the need for a Guest OS for each application, and instead serve applications in a manner not too far removed from Apple's approach to sandboxing apps in iOS (and more recently, OS X).

Sandboxed, or "containerised", applications share a single Host OS, but otherwise run standalone with their own dependencies, which means that application stacks can be built, tested and maintained more easily as individual components. Managing dependencies at the application level, like traditional virtualisation, also means avoiding lib version conflicts -- great!

The one major caveat is that Host OS's are limited to Linux environments for the time being -- though apparently the upcoming Windows Server 2015 will support Docker for Windows-based application environments.

With all that in mind, onto our first recipe...

A simple Docker salad with Shiny dressing

Preparation time 1 hour, serves 1 Shiny web-app

Ingredients

VirtualBox 4.3.24 (Windows, OS X, Ubuntu 64-bit)

Ubuntu 14.04 AMD64 Desktop (ISO) (We should ideally be using a server edition of Linux here, but I only had the desktop image handy at the time of writing.)

An Internet connection

Steps

Download the ingredients listed above and install VirtualBox.

Using the Ubuntu ISO, create an Ubuntu VM to host our Docker install. I’d recommend allocating at least 2GB of RAM. Follow through all the prompts, for this tutorial I’ve created a machine called docker with admin username deliciousdata. Click through the installation prompts, grab a coffee and come back in about 15 minutes.

Open a Terminal session and install Docker:

sudo apt-get update sudo apt-get install docker.io

You can test your install by running:

sudo docker run -i -t ubuntu /bin/bash

This will download the Ubuntu base image and open a bash shell within the container. Go ahead and run the following command and marvel at your creation:

echo 'Hello, world!' exit

A quick search on the Docker Registry reveals a Shiny Server dockerfile. Shiny is an R package released by the creators of RStudio for web application development in R. It makes building pretty great data visualisations a much simpler process.

To run a Shiny container, run the following command (this may take a while depending on your Internet connection, time for another coffee?):

sudo docker run --rm -p 3838:3838 rocker/shiny

Docker will run the rocker/shiny Dockerfile, install all dependencies, and fire up Shiny Server on port 3838. Once completed (Docker will stop displaying status updates), you should be able to open FireFox within your Ubuntu VM and see Shiny running at http://localhost:3838/.

To stop the Docker container, run:

sudo docker ps

and locate the Container ID (in my case, 191baf3d0d19), and then run:

sudo docker stop 191baf3d0d19

(http://localhost:3838/ should no longer respond.)

The next step is to customise the Shiny web app with one of the examples from RStudio's GitHub.

sudo docker run --rm -p 3838:3838 -v /srv/shiny-server/:/srv/shiny-server/ rocker/shiny

This will create the /srv/shiny-server/ directory on the Host OS (if it does not already exist), and map it to the folder within the Shiny Server container. In a new Terminal, run the following:

cd /srv/shiny-server/ sudo wget https://raw.githubusercontent.com/rstudio/shiny-examples/master/008-html/server.R sudo mkdir www cd www sudo wget https://raw.githubusercontent.com/rstudio/shiny-examples/master/008-html/www/index.html

Go ahead and visit http://localhost:3838/ once more and see the Shiny app you just concocted with Docker in all of its glory.

That completes this tutorial. Look out for further posts about my journey with Docker, and an upcoming post about customising HTML webapps with RShiny.

Tomorrow I'll be attending the March Docker Meetup at Atlassian in Sydney, where Tim Robinson (founder of Volt Grid) will be presenting "Intro to Docker in Production (from an Ops perspective)" and Ruben Rubio Rey (CTO of manageacloud.com) will present his talk on "How to use configuration management systems to manage Docker containers".

#docker #virtualisation #shiny-server #tutorial #dockermeetup

Trending Blogs

Recently Viewed Blogs

deliciousdata.