A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS.
When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.
Hi there, I’m Mike.
🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else:
Updating my post from almost 3 years ago! The world has moved on to Spark 3.3, and so have the necessary JARs you will need to access S3 from Spark.
Run these commands to download JARs for Spark 3.3.2:
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.426/aws-java-sdk-bundle-1.12.426.jar -P $SPARK_HOME/jars/ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar -P $SPARK_HOME/jars/ That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark 3.3.2.
I began properly versioning the software I write recently. I’m working on a Python package that I hope others will use. My goal is to iterate and release new features and fixes over time, but I need a way to signal to the world that a new version is available.
Why You Should Version Your Code It’s a best practice. It makes you think about supporting users of your code. It only takes a few extra minutes during the software development lifecycle.
I finished a thing in my free time! Chipee!
Goals I’ve always wanted to write an emulator and this is the first time I actually got around to finishing one!
My goals were to learn about how to write an emulator and to re-learn C. It’s been years since I wrote any halfway decent C.
I’ve also never done anything using SDL, sound, or even a window with graphics.
One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables.
Here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less.
I had a hard time figuring out how to make a Go program execute a command and make that program take over the console. I wanted my program to launch an SSH session.
I recently started working on a tool to help me SSH into EC2 instances (more details coming in a future blog post). The goal was to automatically open up an SSH session into an EC2 instance.
It’s easy to execute a program like ssh but the input and output of that program is lost.
This weekend while running a rather large Python job, I ran into a memory error. It turned out that a dictionary I was populating could potentially become too big to fit into RAM. This is where DiskDict saved me some time.
It’s definitely not the best way to solve an issue, but in this case I was working with a limited system where rewriting the surrounding code would have been intrusive.
Recently I ran into a problem while working with Amazon EC2 servers. Servers without dedicated elastic IP addresses would get a different IP address every time they were started up! This proved to be a challenge when trying to SSH in to the servers.
How can I have a dynamic domain name that always points to my EC2 server?
Amazon’s Route53 came to mind. Route53, however, does not have a simple way to point a subdomain directly to an EC2 instance.