Apache Spark Cluster with Vagrant

New tiny GitHub project: https://github.com/mikestaszel/spark_cluster_vagrant

Over the past few weeks I’ve been working on benchmarking Spark as well as learning more about setting up clusters of Spark machines both locally and on cloud providers.

I decided to work on a simple Vagrantfile that spins up a Spark cluster with a head node and however many worker nodes desired. I’ve seen a few of these but they either used some 3rd party box, had an older version of Spark, or only spun up one node.

By running only one command I could have a fully-configured Spark cluster ready to use and test. Vagrant also easily extends beyond simple Virtualbox machines to many providers, including AWS EC2 and DigitalOcean and this Vagrantfile can be extended to provision clusters on those providers.

Check it out here: https://github.com/mikestaszel/spark_cluster_vagrant


Hello, Startup – Microreview

I just finished reading “Hello, Startup” by Yevgeniy Brikman, a book written for programmers about starting a startup. All the basics are covered, including hiring, teamwork, startup culture, and development methodology while scaling a startup. It’s a nice quick read (I skimmed through the chapters about development, programming, databases, and other technical chapters, but I found the other content to be a great place to start learning about what it takes to build a startup.

Check it out here (also available on Safari Books): http://www.hello-startup.net


Flask Quick Startup Project

I like to start my projects using Flask and Python because it’s fast and quick for most things, yet lightweight.

By default, Flask doesn’t give you much in terms of test frameworks, application settings, deployment, or running the application in production. I always end up making a skeleton that does some of these things, so I decided to put together a GitHub repository with a skeleton Flask project that does it for me.

Have a look here: https://github.com/mikestaszel/flask_startup


DiskDict – Python dictionaries stored on disk

This weekend while running a rather large Python job, I ran into a memory error. It turned out that a dictionary I was populating could potentially become too big to fit into RAM. This is where DiskDict saved me some time.


It’s definitely not the best way to solve an issue, but in this case I was working with a limited system where rewriting the surrounding code would have been intrusive. Plus, the job didn’t have time constraints, so DiskDict was a decent workaround.

Wanted to share because it proved useful to me!


EC2 + Route53 for Dynamic DNS

Recently I ran into a problem while working with Amazon EC2 servers. Servers without dedicated elastic IP addresses would get a different IP address every time they were started up! This proved to be a challenge when trying to SSH in to the servers.

How can I have a dynamic domain name that always points to my EC2 server?

Amazon’s Route53 came to mind. Route53, however, does not have a simple way to point a subdomain directly to an EC2 instance. You can set up load balancers between Route53 and your instance, but that’s a hassle. You can also set up an elaborate private network with port forwarding – yuck.

I wanted a simple way to set a Route53 subdomain’s A record to point to an EC2 instance’s public IP address, on startup.

Enter go-route53-dyn-dns. This is a simple Go project that solves this problem. It is a small binary that reads a JSON configuration file and updates Route53 with an EC2 instance’s public IP address.

Included in the GitHub README.md file is how to set everything up.

The project is here: go-route53-dyn-dns.