Installing Spark on Ubuntu in 3 Minutes

One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables.

So here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less.

All you need is a machine (or instance, server, VPS, etc.) that you can install packages on (e.g. “sudo apt” works). If you need one of those, check out DigitalOcean. It’s much simpler than AWS for small projects.

First, log in to the machine via SSH.

Now, install OpenJDK 8 (Java):

sudo apt update && sudo apt install -y openjdk-8-jdk-headless python

Next, download and extract Apache Spark:

wget http://www-us.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz && tar xf spark-2.3.1-bin-hadoop2.7.tgz

Set up environment variables to configure Spark:

echo 'SPARK_HOME=$HOME/spark-2.3.1-bin-hadoop2.7' >> ~/.bashrc
echo 'PATH=$PATH:$SPARK_HOME/bin' >> ~/.bashrc
echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
source ~/.bashrc

That’s it – you’re all set! You’ve installed Spark and it’s ready to go. Try out “pyspark”, “spark-submit” or “spark-shell”.

Try running this inside “pyspark” to validate that it worked:

spark.createDataFrame([{"hello": x} for x in range(1000)]).count() # hopefully this equals 1000
 

AWS Certified Solutions Architect

Quick post – I’ve been busy studying for the AWS Certified Solutions Architect – Associate exam for the past few weeks – good news, I passed it a few days ago! Shoot me a note if you ever need some solutions architected.

I primarily did this because I’ve been using AWS for years now – but so has everyone else – this would be a differentiator. There was also a lot missing in-between the cracks (I learned how to give instances in a private subnet Internet access to install/update software without giving them public IP addresses and without spending hours reading Stack Overflow posts).

… 

 

azssh: Easily manage EC2 instances

azssh is a small commandline utility I wrote a few months ago to help with managing EC2 instances.

My workflow on EC2 consists of starting and stopping instances and sometimes SSHing in to run some commands. That’s what this utility does – starts and stops EC2 instances, tells you the public DNS address, and runs an SSH command.

Check out the source code and releases at the GitHub page at: https://github.com/mikestaszel/azssh

 

 

Vowpal Wabbit Docker Image

Vowpal Wabbit is a really fast machine learning system.

A few months ago I put together a Docker image of Vowpal Wabbit, making it easy to run on any platform. It’s been sitting up on Github and the Docker Hub, but I forgot to write a blog post! So here it is:

https://github.com/mikestaszel/vowpal_wabbit_docker

You can download and run Vowpal Wabbit with one command – here is an example:

docker run --rm --volume=$(pwd):/data -t crimsonredmk/vw /data/click.train.vw -f /data/click.model.vw --loss_function logistic --link logistic --passes 1 --cache_file /data/click.cache.vw

Enjoy!

 

Spark + Scala Boilerplate Project

After setting up a few Spark + Scala projects I decided to open-source a boilerplate sample project that you can import right into IntelliJ and build with one command.

Usually I write Apache Spark code in Python, but there are a few times I prefer to use Scala:

  • When functionality isn’t in PySpark yet.
  • It’s easier to include dependencies in the JAR file instead of installing on cluster nodes.
  • Need that extra bit of performance.
  • Even more reasons here on StackOverflow.

One of the downsides to using Scala over Python is setting up the initial project structure. With PySpark, a single “.py” file does the trick. Using this boilerplate project will make using Spark + Scala just as easy. Grab the code and run “sbt assembly” and you’ll have a JAR file ready to use with “spark-submit”.

Check it out here: https://github.com/mikestaszel/spark-scala-boilerplate

 

Fixing WordPress Jetpack Connection Errors

I recently migrated my WordPress installation from an old Debian 8 Google Cloud instance to Debian 9. I decided to do the installation myself this time instead of using a Bitnami image for greater control. I couldn’t get certbot (a Let’s Encrypt client for free SSL certificates) to work on the Bitnami image so I figured I’d set everything up myself.

I ran into one problem that took me a while to debug and figure out. I relinked my site to Jetpack to get basic analytics and automatic sharing to LinkedIn, but Jetpack couldn’t communicate with my site.

… 

 

Apache Spark on Google Colaboratory

Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive – free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.

Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark. It turned out to be much easier than I expected. Download the notebook and import it into Colaboratory or read on…

… 

 

Jupyter Notebooks with PySpark on AWS EMR

One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.

Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.

…