Follow me on:

S3A on Spark 2.4
Published by Mike Staszel on November 29, 2020

Quick post mostly for my own reference since I always need to re-learn how to do this.

This used to be more difficult in older versions of Spark, but when using Spark 2.4 or later, all you have to do is:

wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar -P $SPARK_HOME/jars/
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar -P $SPARK_HOME/jars/

That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark.

Featured Posts

  1. A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS. When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.

    aws development kubernetes

  2. Hi there, I’m Mike. 🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else: