After setting up a few Spark + Scala projects I decided to open-source a boilerplate sample project that you can import right into IntelliJ and build with one command.
Usually I write Apache Spark code in Python, but there are a few times I prefer to use Scala:
- When functionality isn’t in PySpark yet.
- It’s easier to include dependencies in the JAR file instead of installing on cluster nodes.
- Need that extra bit of performance.
- Even more reasons here on StackOverflow.
One of the downsides to using Scala over Python is setting up the initial project structure. With PySpark, a single “.py” file does the trick. Using this boilerplate project will make using Spark + Scala just as easy. Grab the code and run “sbt assembly” and you’ll have a JAR file ready to use with “spark-submit”.
Check it out here: https://github.com/mikestaszel/spark-scala-boilerplate
Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive – free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.
Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark. It turned out to be much easier than I expected. Download the notebook and import it into Colaboratory or read on…
I recently ran into a use case that the usual Spark CSV writer didn’t handle very well – the data I was writing had an unusual encoding, odd characters, and was really large.
I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.
One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.
Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.
Jupyter Notebook extension for Apache Spark integration.
Includes a progress indicator for the current Notebook cell if it invokes a Spark job. Queries the Spark UI service on the backend to get the required Spark job information.
This is really neat. No more checking another tab for job progress when running cells in a notebook!
You’ll definitely want to read this if you’re using AWS Kinesis with Apache Spark to stream data, it’s been extremely valuable:
New tiny GitHub project: https://github.com/mikestaszel/spark_cluster_vagrant
Over the past few weeks I’ve been working on benchmarking Spark as well as learning more about setting up clusters of Spark machines both locally and on cloud providers.
I decided to work on a simple
Vagrantfile that spins up a Spark cluster with a head node and however many worker nodes desired. I’ve seen a few of these but they either used some 3rd party box, had an older version of Spark, or only spun up one node.
By running only one command I could have a fully-configured Spark cluster ready to use and test. Vagrant also easily extends beyond simple Virtualbox machines to many providers, including AWS EC2 and DigitalOcean and this
Vagrantfile can be extended to provision clusters on those providers.
Check it out here: https://github.com/mikestaszel/spark_cluster_vagrant