A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS.
When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.
Hi there, I’m Mike.
🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else:
Updating my post from almost 3 years ago! The world has moved on to Spark 3.3, and so have the necessary JARs you will need to access S3 from Spark.
Run these commands to download JARs for Spark 3.3.2:
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.426/aws-java-sdk-bundle-1.12.426.jar -P $SPARK_HOME/jars/ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar -P $SPARK_HOME/jars/ That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark 3.3.2.
This is the last post in my series of how to connect Spark to various data sources.
Here is how to connect to Google Cloud Storage using Spark 3.x.
First, create a service account in GCP, then download the JSON key file. Save it somewhere secure (e.g. import it into Vault or Secrets Manager).
Grant that service account permission to read and write to the bucket.
Grab the Hadoop 3.x JAR (gcs-connector-hadoop3-latest.
This is another quick post for how to connect Spark to various platforms.
I used Azure Data Lake Storage on a project in the past and had a tough time figuring out what to do (there are huge differences between Azure Blob Storage, Azure Data Lake Gen1, and Azure Data Lake Gen2).
This guide assumes that you have a client_id, tenant_id, and client_secret from Azure.
Code Example # Acquire these JARs from Maven: # azure-data-lake-store-sdk-2.
I recently graduated from Georgia Tech’s Online Master’s in Computer Science program! I’m now taking some time to reflect on my experience.
What was the program like? Georgia Tech’s program is a 10-course (30 credit hour) Master’s degree in Computer Science. It’s fully online and follows a traditional academic structure with lectures, office hours, homework, exams, and grades. It’s the real deal, not watered down. You take the same courses the on-campus students take.
This post originally started as a post about Terraform, but I decided to break that out into a separate post.
It turned out that I had a wishlist for improvements I’d like to see in CloudFormation.
I’ve been using CloudFormation for years and have been pushing teams I work with to do the same - with mixed results.
Problems with CloudFormation Writing Templates CloudFormation templates are tedious to write.
CloudFormation consists of JSON or YAML files that define “stacks”.
I love seeing when something I wrote helps someone.
Someone sent me an email years ago thanking me for writing about some obscure bug or problem I solved and blogged about, and I remember it to this day!
Medium is a popular blogging platform that a lot of software engineers use, and I’ve started cross-posting there.
Why Cross-Post? A picture is worth a thousand words:
That’s right - 60 people found my post and found it to be useful and clapped!
M1 Mac + Logitech Mouse + Logi Options This will be a quick post because I’m sure Logitech will fix this eventually (or maybe I’m the only person with this problem).
I have a Logitech M705 mouse I use extensively with my M1 Mac Mini. I configured one of the side buttons to launch Mission Control (to see all my windows instantly) — but this seems to crash some bit of Logitech code every once in a while (maybe once an hour this stops working for 1 minute, then starts working again).
I began properly versioning the software I write recently. I’m working on a Python package that I hope others will use. My goal is to iterate and release new features and fixes over time, but I need a way to signal to the world that a new version is available.
Why You Should Version Your Code It’s a best practice. It makes you think about supporting users of your code. It only takes a few extra minutes during the software development lifecycle.