AWS CloudFormation

If you’re doing any production-level work in AWS, you should be using AWS CloudFormation. It’s really easy to get started. Let’s walk through the basics.

Why use CloudFormation?

Here’s a common scenario: creating an EC2 instance and assigning an Elastic IP address. Let’s say it’s for a web server. Great! That’s easy. Just spin up an EC2 instance. Choose the correct image, size, security groups, VPC, subnet, keypair, and so on. Then create and assign it an Elastic IP address. No problem!

Now deploy it in QA. Then in production. But production has different security groups. You should probably set up CloudWatch alerts in production too. All of this is getting expensive, so maybe we should turn off the development stack overnight. But at this point we don’t just have one EC2 instance – we also have RDS, some S3 buckets, DynamoDB, and so on. We’ll need all of that configured in each environment. It’s 6 months later now and we need to recreate everything in a different region – did you document how to set everything up?

CloudFormation takes care of all of that for you.

CloudFormation can provision, update, delete, and monitor changes in virtually any AWS service. You can make S3 buckets with specific policies, make IAM roles allowed to access those buckets, spin up a Redshift cluster with that role attached, and so on. You can even create EC2 instances with Elastic IP addresses attached to them (and the VPC, security groups, and subnet associated with that instance).

Here’s an example:

AWSTemplateFormatVersion: 2010-09-09
Description: Create an EC2 instance.
Parameters: 
  InstanceNameParameter: 
    Type: String
    Description: Name of the instance.
Resources:
  EC2Instance:
    Type: AWS::EC2::Instance
    Properties:
      ImageId: ami-0123456789abcdef0
      KeyName: mykeypair
      InstanceType: t3.nano
      SecurityGroupIds:
        - sg-0123456789abcdef0
      SubnetId: subnet-01234567
      BlockDeviceMappings:
        -
          DeviceName: /dev/xvda
          Ebs:
            VolumeSize: 20
      Tags:
        - {Key: "Name", Value: !Ref InstanceNameParameter}
  ElasticIP:
    Type: AWS::EC2::EIP
    Properties:
      Domain: vpc
      InstanceId: !Ref EC2Instance
Outputs:
  ElasticIPAddr:
    Value: !Ref ElasticIP

That looks like a lot and the formatting takes some getting used to. But at this point, you can go right into the AWS Console and upload that CloudFormation template and have an EC2 instance and IP address created and set up in a few seconds. JSON is also a supported template format.

Drift Detection

This is a really cool feature. Let’s say your stack has been created and now it’s a few months later and someone changed some settings. AWS CloudFormation can detect when changes are made outside of CloudFormation and alert you.

Updates and Deleting a Stack

You guessed it – if you update your CloudFormation template, AWS will intelligently figure out what it needs to do to update your stack.

Here’s an example – let’s say we need to increase the size of the disk on that EC2 instance. We would simply change the value in the template and use CloudFormation to update the stack. AWS would create a new instance with a larger disk and attach the Elastic IP address to the new instance automatically. The old EC2 instance would then be terminated.

CloudFormation

The best part is that templates are easy to reuse and work with most AWS services, not just EC2. There’s a slight learning curve, but the benefits are worth it.

 

Installing Spark on Ubuntu in 3 Minutes

One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables.

So here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less.

All you need is a machine (or instance, server, VPS, etc.) that you can install packages on (e.g. “sudo apt” works). If you need one of those, check out DigitalOcean. It’s much simpler than AWS for small projects.

First, log in to the machine via SSH.

Now, install OpenJDK 8 (Java):

sudo apt update && sudo apt install -y openjdk-8-jdk-headless python

Next, download and extract Apache Spark:

wget http://www-us.apache.org/dist/spark/spark-2.3.1/spark-2.3.1-bin-hadoop2.7.tgz && tar xf spark-2.3.1-bin-hadoop2.7.tgz

Set up environment variables to configure Spark:

echo 'SPARK_HOME=$HOME/spark-2.3.1-bin-hadoop2.7' >> ~/.bashrc
echo 'PATH=$PATH:$SPARK_HOME/bin' >> ~/.bashrc
echo 'export PYSPARK_PYTHON=python3' >> ~/.bashrc
source ~/.bashrc

That’s it – you’re all set! You’ve installed Spark and it’s ready to go. Try out “pyspark”, “spark-submit” or “spark-shell”.

Try running this inside “pyspark” to validate that it worked:

spark.createDataFrame([{"hello": x} for x in range(1000)]).count() # hopefully this equals 1000
 

AWS Certified Solutions Architect

Quick post – I’ve been busy studying for the AWS Certified Solutions Architect – Associate exam for the past few weeks – good news, I passed it a few days ago! Shoot me a note if you ever need some solutions architected.

I primarily did this because I’ve been using AWS for years now – but so has everyone else – this would be a differentiator. There was also a lot missing in-between the cracks (I learned how to give instances in a private subnet Internet access to install/update software without giving them public IP addresses and without spending hours reading Stack Overflow posts).

… 

 

azssh: Easily manage EC2 instances

azssh is a small commandline utility I wrote a few months ago to help with managing EC2 instances.

My workflow on EC2 consists of starting and stopping instances and sometimes SSHing in to run some commands. That’s what this utility does – starts and stops EC2 instances, tells you the public DNS address, and runs an SSH command.

Check out the source code and releases at the GitHub page at: https://github.com/mikestaszel/azssh

 

 

Vowpal Wabbit Docker Image

Vowpal Wabbit is a really fast machine learning system.

A few months ago I put together a Docker image of Vowpal Wabbit, making it easy to run on any platform. It’s been sitting up on Github and the Docker Hub, but I forgot to write a blog post! So here it is:

https://github.com/mikestaszel/vowpal_wabbit_docker

You can download and run Vowpal Wabbit with one command – here is an example:

docker run --rm --volume=$(pwd):/data -t crimsonredmk/vw /data/click.train.vw -f /data/click.model.vw --loss_function logistic --link logistic --passes 1 --cache_file /data/click.cache.vw

Enjoy!

 

Spark + Scala Boilerplate Project

After setting up a few Spark + Scala projects I decided to open-source a boilerplate sample project that you can import right into IntelliJ and build with one command.

Usually I write Apache Spark code in Python, but there are a few times I prefer to use Scala:

  • When functionality isn’t in PySpark yet.
  • It’s easier to include dependencies in the JAR file instead of installing on cluster nodes.
  • Need that extra bit of performance.
  • Even more reasons here on StackOverflow.

One of the downsides to using Scala over Python is setting up the initial project structure. With PySpark, a single “.py” file does the trick. Using this boilerplate project will make using Spark + Scala just as easy. Grab the code and run “sbt assembly” and you’ll have a JAR file ready to use with “spark-submit”.

Check it out here: https://github.com/mikestaszel/spark-scala-boilerplate

 

Fixing WordPress Jetpack Connection Errors

I recently migrated my WordPress installation from an old Debian 8 Google Cloud instance to Debian 9. I decided to do the installation myself this time instead of using a Bitnami image for greater control. I couldn’t get certbot (a Let’s Encrypt client for free SSL certificates) to work on the Bitnami image so I figured I’d set everything up myself.

I ran into one problem that took me a while to debug and figure out. I relinked my site to Jetpack to get basic analytics and automatic sharing to LinkedIn, but Jetpack couldn’t communicate with my site.

… 

 

Apache Spark on Google Colaboratory

Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive – free! It’s similar to Databricks – give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.

Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark. It turned out to be much easier than I expected. Download the notebook and import it into Colaboratory or read on…

…