Apache Spark on Google Colaboratory

Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive – free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.

Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark. It turned out to be much easier than I expected. Download the notebook¬†and import it into Colaboratory or read on…

… 

 

Jupyter Notebooks with PySpark on AWS EMR

One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.

Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.

…