Avatar
🚀

Follow me on:

S3A on Spark 3.x in 2023
Published by Mike Staszel on March 14, 2023

Updating my post from almost 3 years ago! The world has moved on to Spark 3.3, and so have the necessary JARs you will need to access S3 from Spark.

Run these commands to download JARs for Spark 3.3.2:

wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.426/aws-java-sdk-bundle-1.12.426.jar -P $SPARK_HOME/jars/
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar -P $SPARK_HOME/jars/

That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark 3.3.2.

Featured Posts

  1. A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS. When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.

    aws development kubernetes

  2. Hi there, I’m Mike. 🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else:

    development