Apache Spark on Google Colaboratory

Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive – free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.

Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark. It turned out to be much easier than I expected. Download the notebook and import it into Colaboratory or read on…

Continue reading “Apache Spark on Google Colaboratory”

Writing Huge CSVs Easily and Efficiently with PySpark

I recently ran into a use case that the usual Spark CSV writer didn’t handle very well – the data I was writing had an unusual encoding, odd characters, and was really large.

I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.

Continue reading “Writing Huge CSVs Easily and Efficiently with PySpark”

Vowpal Wabbit – Ramdisk vs. EBS-Optimized SSD

Recently I started playing around with Vowpal Wabbit and various data sets. Vowpal Wabbit promises to be really fast, so much so that disk IO is one of the most common bottlenecks according to the author. I did a quick test to see if using a RAM disk would make Vowpal Wabbit’s training faster. However, a RAM disk is not a silver bullet that will make Vowpal Wabbit faster, at least in my quick testing.

Continue reading “Vowpal Wabbit – Ramdisk vs. EBS-Optimized SSD”