Jupyter Notebooks with PySpark on AWS EMR

One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.

Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.

Summary

  1. Create an EMR cluster with Spark 2.0 or later with this file as a bootstrap action: Link.
  2. Add this as a step: Link.
  3. SSH in to the head/master node and run pyspark with whatever options you need.
  4. Open up port 8888 (make sure it’s allowed in the security group) of your head/master node in a web browser and you’re in Jupyter!

Step by Step Screenshots

Find them here: https://github.com/mikestaszel/spark-emr-jupyter/tree/master/screenshots

Creating an EMR Cluster

When creating your EMR cluster, all you need to do is add a bootstrap action file that will install Anaconda and Jupyter Spark extensions to make job progress visible directly in the notebook. Add this as a bootstrap action: https://github.com/mikestaszel/spark-emr-jupyter/blob/master/emr_bootstrap.sh

Running Jupyter through PySpark

When your cluster is ready, you need to run a step that will tell PySpark to launch Jupyter when you run it. You’ll need to copy this file into an S3 bucket and reference it in the step: https://github.com/mikestaszel/spark-emr-jupyter/blob/master/jupyter_step.sh

To do this, add a step to the cluster with the following parameters:

JAR location: s3://[region].elasticmapreduce/libs/script-runner/script-runner.jar

Arguments: s3://[your-bucket]/jupyter_step.sh

Run PySpark

You’re ready to run PySpark! You can go ahead and run something like “pyspark –master yarn” with any options you need (for example in a tmux session on your master node). You should see the Jupyter notebook server start and print out an address and authentication token.

In your browser, open up port 8888 of the head node and paste in the authentication key and you’re all set! You can create a notebook or upload one. You don’t need to initialize a SparkSession – one is automatically created for you, named “spark”. Make sure your security group firewall rules allow access to port 8888!

One last thing to keep in mind is that your notebooks will be deleted when you terminate the cluster, so make sure to download anything you need! There are some Jupyter plugins you can try if you want to store notebooks in S3, but that’s another blog post.

 

EC2 + Route53 for Dynamic DNS

Recently I ran into a problem while working with Amazon EC2 servers. Servers without dedicated elastic IP addresses would get a different IP address every time they were started up! This proved to be a challenge when trying to SSH in to the servers.

How can I have a dynamic domain name that always points to my EC2 server?

Amazon’s Route53 came to mind. Route53, however, does not have a simple way to point a subdomain directly to an EC2 instance. You can set up load balancers between Route53 and your instance, but that’s a hassle. You can also set up an elaborate private network with port forwarding – yuck.

I wanted a simple way to set a Route53 subdomain’s A record to point to an EC2 instance’s public IP address, on startup.

Enter go-route53-dyn-dns. This is a simple Go project that solves this problem. It is a small binary that reads a JSON configuration file and updates Route53 with an EC2 instance’s public IP address.

Included in the GitHub README.md file is how to set everything up.

The project is here: go-route53-dyn-dns.