Jupyter Notebooks with PySpark on AWS EMR

One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.

Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.


  1. Create an EMR cluster with Spark 2.0 or later with this file as a bootstrap action: Link.
  2. Add this as a step: Link.
  3. SSH in to the head/master node and run pyspark with whatever options you need.
  4. Open up port 8888 (make sure it’s allowed in the security group) of your head/master node in a web browser and you’re in Jupyter!

Step by Step Screenshots

Find them here: https://github.com/mikestaszel/spark-emr-jupyter/tree/master/screenshots

Creating an EMR Cluster

When creating your EMR cluster, all you need to do is add a bootstrap action file that will install Anaconda and Jupyter Spark extensions to make job progress visible directly in the notebook. Add this as a bootstrap action: https://github.com/mikestaszel/spark-emr-jupyter/blob/master/emr_bootstrap.sh

Running Jupyter through PySpark

When your cluster is ready, you need to run a step that will tell PySpark to launch Jupyter when you run it. You’ll need to copy this file into an S3 bucket and reference it in the step: https://github.com/mikestaszel/spark-emr-jupyter/blob/master/jupyter_step.sh

To do this, add a step to the cluster with the following parameters:

JAR location: s3://[region].elasticmapreduce/libs/script-runner/script-runner.jar

Arguments: s3://[your-bucket]/jupyter_step.sh

Run PySpark

You’re ready to run PySpark! You can go ahead and run something like “pyspark –master yarn” with any options you need (for example in a tmux session on your master node). You should see the Jupyter notebook server start and print out an address and authentication token.

In your browser, open up port 8888 of the head node and paste in the authentication key and you’re all set! You can create a notebook or upload one. You don’t need to initialize a SparkSession – one is automatically created for you, named “spark”. Make sure your security group firewall rules allow access to port 8888!

One last thing to keep in mind is that your notebooks will be deleted when you terminate the cluster, so make sure to download anything you need! There are some Jupyter plugins you can try if you want to store notebooks in S3, but that’s another blog post.


Vowpal Wabbit – Ramdisk vs. EBS-Optimized SSD

Recently I started playing around with Vowpal Wabbit and various data sets. Vowpal Wabbit promises to be really fast, so much so that disk IO is one of the most common bottlenecks according to the author. I did a quick test to see if using a RAM disk would make Vowpal Wabbit’s training faster. However, a RAM disk is not a silver bullet that will make Vowpal Wabbit faster, at least in my quick testing.


I used an AWS m4.2xlarge machine with 32GB RAM and created a 20GB RAM disk. I trained a logistic regression model on data from the Criteo Display Advertising Challenge.

I expected the RAM disk to be faster because reading data from RAM is 10-20 times faster than an SSD according to StackOverflow.

Vowpal Wabbit?

Vowpal Wabbit is a machine learning library, comparable to Spark’s MLib or scikit-learn.

Creating a RAM Disk in Ubuntu

This was easier than expected, there’s just one command to run:

sudo mount -t tmpfs -o size=20G tmpfs ramdisk/

This will create a 20GB RAM disk and mount it to a “ramdisk” directory. I copied my training dataset into this folder. I also had a copy of my training data saved on an EBS-optimized SSD attached to my instance.


Long story short, the disk wasn’t the bottleneck with training a logistic regression model on this dataset. Training on the file with 40 million+ rows with ~30 features per row took 3 minutes whether I trained on the dataset from the SSD or from the RAM disk. The exact command (very basic) was:

vw train.vw -f model.vw --loss_function logistic

One really obvious bottleneck here is the CPU – the command above will only use 1 core of the CPU! But that’s another blog post for a future date.


Amazon Dash Button Hackery with Python

I bought a few Amazon Dash Buttons as part of Prime Day. These are the cheaper $4.99 buttons, not the more expensive $19.99 AWS IoT Buttons. In this blog post I’ll walk through how to make these cheaper buttons do what the more expensive button does.

What You Need

  • One or more Amazon Dash Buttons. You’ll need to add the button to your Amazon account, but do not pick a product to buy. Just exit the set-up process without picking a product and you’ll be all set.
  • Computer with root (sudo) access (or RaspberryPi or other device capable of running Python).
  • The code from GitHub.


How exactly does the dash button work? In a nutshell, every time you press it, the button connects to the Wi-Fi network, pings Amazon, and then shuts back down for power savings. We’ll exploit the first step in that process – connecting to the Wi-Fi network. Using Python, we can listen for special “ARP probe” packets the Dash Button sends when it attempts to connect to Wi-Fi. All you need to know is the MAC address of the Dash Button and then listen for these ARP packets. When an ARP packet with your Dash Button’s MAC address is detected, you know the button was pressed, and you can call whatever Python methods you want.

Finding the MAC Address

The first step once you’ve set up your Dash Button (but have no picked a product to actually buy!) is to find the button’s MAC address. Grab your computer and connect it to the same Wi-Fi network as your button, then run the pydashbutton.py script as root and watch for any MAC addresses that are printed when you press the button. One important thing to note is that there seems to be throttling when pressing the button. Pressing the button multiple times per minute might not work.

Running Methods on Button Press

Now that you have the MAC address to listen for, all you need to do is throw some if/else logic into that same listener script to run code when a MAC address is detected. Check out the script and make any modifications you need. I included a simple example for logging button presses to a Google Sheet when pressing a button.

The Code

Check out the code on GitHub. Have fun!


Jupyter Spark Integration


Jupyter Notebook extension for Apache Spark integration.

Includes a progress indicator for the current Notebook cell if it invokes a Spark job. Queries the Spark UI service on the backend to get the required Spark job information.

This is really neat. No more checking another tab for job progress when running cells in a notebook!


Data Science from Scratch – Microreview

I recently finished reading Data Science from Scratch by Joel Grus. This book is a great introduction to data science concepts. It uses real code to demonstrate complex Python, data analytics, data science, and machine learning concepts.

I’m really glad I picked up this book as the first book I’ve read about machine learning. There was a great combination of mathematics, statistics, and real applications of machine learning algorithms.

The book starts out with a quick introduction to Python, followed by an in-depth review of all the math you need for the code to make sense.

If you’re looking for a book that’ll show you how to use Tensorflow or scikit-learn, this book is not for you. I’d recommend reading this book before diving into those. You’ll learn about the math behind popular machine learning libraries and implement basic versions of some of the most popular algorithms from scratch.

I think the next book I’ll pick up after this one is Python Data Science Handbook which will go into more detail on using a bunch of Python libraries to do some of this machine learning for me.


Jekyll in Docker

Recently I’ve been playing around with Jekyll to create some simple websites. I’ve used Jekyll in the past and I remember that the set-up was a multi-step process.

Jekyll is a Ruby application that uses several Gems and Bundler. That means installing several dependencies. In my case I don’t have a Ruby development environment already set up, so I would have to install all these packages just to use a static site generator.

Then I found the official Jekyll Docker image.

I already have Docker installed to play around with other containers, so downloading a Jekyll container and using it was as easy as:

docker run --rm --label=jekyll --volume=$(pwd):/srv/jekyll \
  -it -p jekyll/jekyll jekyll serve

That’s all there is to it. This command will download the latest Jekyll image and start serving your site. No need to install Ruby, Gem, Bundler, or a bunch of other dependencies.


Fluent Python – Microreview

If you really want to get into the details of Python and learn about how the language was built and how some of its internals are implemented, Fluent Python is the book for you.

It’s a great book to refresh your knowledge of coroutines, asyncio, and other Python goodies.