Installing Spark on Ubuntu in 3 Minutes

One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables. Here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less....

September 19, 2018 · Mike Staszel

Apache Spark on Google Colaboratory

Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive — free! It’s similar to Databricks — give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more. Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark....

March 7, 2018 · Mike Staszel

Execute Interactive Programs From Go

I had a hard time figuring out how to make a Go program execute a command and make that program take over the console. I wanted my program to launch an SSH session. I recently started working on a tool to help me SSH into EC2 instances (more details coming in a future blog post). The goal was to automatically open up an SSH session into an EC2 instance. It’s easy to execute a program like ssh but the input and output of that program is lost....

February 21, 2018 · Mike Staszel

Writing Huge CSVs Easily and Efficiently With PySpark

I recently ran into a use case that the usual Spark CSV writer didn’t handle very well — the data I was writing had an unusual encoding, odd characters, and was really large. I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file. I don’t know how I missed this RDD method before, but toLocalIterator was the cleanest, most straight-forward way I got this to work....

February 5, 2018 · Mike Staszel

Data Science from Scratch - Microreview

I recently finished reading Data Science from Scratch by Joel Grus. This book is a great introduction to data science concepts. It uses real code to demonstrate complex Python, data analytics, data science, and machine learning concepts. I’m really glad I picked up this book as the first book I’ve read about machine learning. There was a great combination of mathematics, statistics, and real applications of machine learning algorithms. The book starts out with a quick introduction to Python, followed by an in-depth review of all the math you need for the code to make sense....

July 10, 2017 · Mike Staszel

Fluent Python - Microreview

If you really want to get into the details of Python and learn about how the language was built and how some of its internals are implemented, Fluent Python is the book for you. It’s a great book to refresh your knowledge of coroutines, asyncio, and other Python goodies.

April 6, 2017 · Mike Staszel

Hello Startup - Microreview

I just finished reading “Hello, Startup” by Yevgeniy Brikman, a book written for programmers about starting a startup. All the basics are covered, including hiring, teamwork, startup culture, and development methodology while scaling a startup. It’s a nice quick read (I skimmed through the chapters about development, programming, databases, and other technical chapters, but I found the other content to be a great place to start learning about what it takes to build a startup....

December 28, 2016 · Mike Staszel

DiskDict — Python dictionaries stored on disk

This weekend while running a rather large Python job, I ran into a memory error. It turned out that a dictionary I was populating could potentially become too big to fit into RAM. This is where DiskDict saved me some time. It’s definitely not the best way to solve an issue, but in this case I was working with a limited system where rewriting the surrounding code would have been intrusive....

December 5, 2016 · Mike Staszel

EC2 + Route53 for Dynamic DNS

Recently I ran into a problem while working with Amazon EC2 servers. Servers without dedicated elastic IP addresses would get a different IP address every time they were started up! This proved to be a challenge when trying to SSH in to the servers. How can I have a dynamic domain name that always points to my EC2 server? Amazon’s Route53 came to mind. Route53, however, does not have a simple way to point a subdomain directly to an EC2 instance....

March 12, 2016 · Mike Staszel