After setting up a few Spark + Scala projects I decided to open-source a boilerplate sample project that you can import right into IntelliJ and build with one command.
Usually I write Apache Spark code in Python, but there are a few times I prefer to use Scala:
- When functionality isn’t in PySpark yet.
- It’s easier to include dependencies in the JAR file instead of installing on cluster nodes.
- Need that extra bit of performance.
- Even more reasons here on StackOverflow.
One of the downsides to using Scala over Python is setting up the initial project structure. With PySpark, a single “.py” file does the trick. Using this boilerplate project will make using Spark + Scala just as easy. Grab the code and run “sbt assembly” and you’ll have a JAR file ready to use with “spark-submit”.
Check it out here: https://github.com/mikestaszel/spark-scala-boilerplate
I recently migrated my WordPress installation from an old Debian 8 Google Cloud instance to Debian 9. I decided to do the installation myself this time instead of using a Bitnami image for greater control. I couldn’t get certbot (a Let’s Encrypt client for free SSL certificates) to work on the Bitnami image so I figured I’d set everything up myself.
I ran into one problem that took me a while to debug and figure out. I relinked my site to Jetpack to get basic analytics and automatic sharing to LinkedIn, but Jetpack couldn’t communicate with my site.
Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive – free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.
Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark. It turned out to be much easier than I expected. Download the notebook and import it into Colaboratory or read on…
I had a hard time figuring out how to make a Go program execute a command and make that program take over the console. I wanted my program to launch an SSH session.
I recently ran into a use case that the usual Spark CSV writer didn’t handle very well – the data I was writing had an unusual encoding, odd characters, and was really large.
I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.
One of the biggest, most time-consuming parts of data science is analysis and experimentation. One of the most popular tools to do so in a graphical, interactive environment is Jupyter.
Combining Jupyter with Apache Spark (through PySpark) merges two extremely powerful tools. AWS EMR lets you set up all of these tools with just a few clicks. In this tutorial I’ll walk through creating a cluster of machines running Spark with a Jupyter notebook sitting on top of it all.
Recently I started playing around with Vowpal Wabbit and various data sets. Vowpal Wabbit promises to be really fast, so much so that disk IO is one of the most common bottlenecks according to the author. I did a quick test to see if using a RAM disk would make Vowpal Wabbit’s training faster. However, a RAM disk is not a silver bullet that will make Vowpal Wabbit faster, at least in my quick testing.
I bought a few Amazon Dash Buttons as part of Prime Day. These are the cheaper $4.99 buttons, not the more expensive $19.99 AWS IoT Buttons. In this blog post I’ll walk through how to make these cheaper buttons do what the more expensive button does.
Jupyter Notebook extension for Apache Spark integration.
Includes a progress indicator for the current Notebook cell if it invokes a Spark job. Queries the Spark UI service on the backend to get the required Spark job information.
This is really neat. No more checking another tab for job progress when running cells in a notebook!
I recently finished reading Data Science from Scratch by Joel Grus. This book is a great introduction to data science concepts. It uses real code to demonstrate complex Python, data analytics, data science, and machine learning concepts.
I’m really glad I picked up this book as the first book I’ve read about machine learning. There was a great combination of mathematics, statistics, and real applications of machine learning algorithms.
The book starts out with a quick introduction to Python, followed by an in-depth review of all the math you need for the code to make sense.
If you’re looking for a book that’ll show you how to use Tensorflow or scikit-learn, this book is not for you. I’d recommend reading this book before diving into those. You’ll learn about the math behind popular machine learning libraries and implement basic versions of some of the most popular algorithms from scratch.
I think the next book I’ll pick up after this one is Python Data Science Handbook which will go into more detail on using a bunch of Python libraries to do some of this machine learning for me.