Getting started with Spark, using Docker and pySpark (Part 1)
You've probably been hearing a lot about Hadoop and the in-memory computing framework Spark. Getting started can be hard, but with a little help from Docker, your first foray into the wild new world of big data analytics should be just that little bit easier.
Getting Started
Before we begin, a few notes:
I'm running Docker directly on Ubuntu 15.04 but you can pretty well install Docker on anything these days. The instructions and commands listed herein should work identically no matter how you are running Docker.
The Docker image we'll be using will take some time to download, so you might want to go grab a coffee (or read ahead) while it's pulling down from the hub.
You'll often see Spark run with Scala. We'll be using Python to keep things nice and simple.
We'll be using smungee's pySpark image for our Spark instance. The image conveniently comes with numpy, scipy and scikit-learn pre-installed. Neat.
Step 1: Prepare Docker
Before we can run the image, we need to pull it down. Run the following command and find something else to do for a little while. (I, for example, chose to start writing this article.)
docker pull smungee/pyspark-docker:latest
If you're new to Docker, the pull command simply caches a copy of the specified image (smungee/pyspark-docker) from the Docker Hub.
About the Image
If you check out the Dockerfile, you'll notice that it is built on top of SequenceIQ's Spark image (sequenceiq/spark). SequenceIQ wasacquired by Hortonworks in April 2015, and together form one of the largest Hadoop supporters.
Step 2: Start the Docker container
Welcome back. Now we're ready to go. Run this command:
docker run -i -t -h sandbox --name pyspark-sandbox smungee/pyspark-docker:latest /etc/bootstrap.sh -bash
Helpful Tips
To exit bash and detach the container at any time, simply hold CTRL and press P then Q (then let go of CTRL).
If you want to reattach, run:
docker attach pyspark-sandbox
To stop the container once detached, run:
docker stop pyspark-sandbox
If you're having trouble stopping the container gracefully, try:
docker kill pyspark-sandbox
To restart the container once stopped, run:
docker restart pyspark-sandbox
And finally, to remove the container once stopped, run (this will not remove the cached image):
docker rm pyspark-sandbox
If you're new to Docker, here's breakdown of what's going on:
-i -t: in combination these allow us to connect to the Docker container
-h sandbox: sets the container computer's hostname
--name pyspark-sandbox: set's the container's name (to make it easier for us to reference)
smungee/pyspark-docker:latest: the image to run (we're smart and already cached a copy in Step 1!)
/etc/bootstrap.sh -bash: anything commands after the image name are passed through to the container, telling the container to execute the specified script (starting the required nodes and leaves us with a bash shell)
Step 3: Start pySpark inside Docker container
Now that we have a bash shell available from inside our container, we can get the pySpark engines rolling with the following command (make sure to run it inside the Docker container), giving us a fancy ASCII Spark logo and a useful Python shell:
/usr/local/spark/bin/pyspark
To exit the Python shell at any time, simply press CTRL + Z.
Step 4: Does it work?
Alright, from within Python, within pySpark, within our Docker container, within your Terminal prompt (potentially within a Virtual Machine?), we can now execute Python commands.
For those new to Docker, this means you can execute any Python commands, including the simple:
x = 1 print x
Which simply outputs '1'.
Smungee suggests the following simple Spark program to verify the installation:
data = [1, 2, 3, 4, 5] sc.parallelize(data).count()
Which should of course ultimately output '5'.
And to verify scikit-learn, try:
from sklearn import svm, datasets clf = svm.SVC(gamma=0.001, C=100.) digits = datasets.load_digits() clf.fit(digits.data[:-1], digits.target[:-1])
Where you can expect an output similar to:
SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0, degree=3, gamma=0.001, kernel='rbf', max_iter=-1, probability=False, random_state=None, shrinking=True, tol=0.001, verbose=False)
Step 5: Taking it further
To see all of the functions available in pySpark, from within Python run:
help(pyspark)
(To exit help simply press Q.)
There are many further examples provided in the official Apache Spark git repo. Part 2 of this post will show you exactly how to get some of these examples running with Docker.









