Avatar
🚀

Follow me on:

Spark to Azure Data Lake Storage Gen1
Published by Mike Staszel on January 29, 2023

This is another quick post for how to connect Spark to various platforms.

I used Azure Data Lake Storage on a project in the past and had a tough time figuring out what to do (there are huge differences between Azure Blob Storage, Azure Data Lake Gen1, and Azure Data Lake Gen2).

This guide assumes that you have a client_id, tenant_id, and client_secret from Azure.

Code Example

# Acquire these JARs from Maven:
# azure-data-lake-store-sdk-2.3.10.jar
‡ hadoop-azure-datalake-3.2.3.jar
# wildfly-openssl-1.0.7.Final.jar
# place them in $SPARK HOME/jars/

spark = SparkSession.builder.getOrCreate()
tenant_id = "some-identifier-here"
client_id = "some-identifier-here"
client_secret = "super-top-secret-here"

spark.conf.set("fs.adl.account.auth.type", "OAuth") spark.conf.set("fs.adl.oauth2.refresh.url", f"https://login.microsoftonline.com/{tenant_id}/oauth2/token")
spark.conf.set("fs.adl.oauth2.client.id", client_id) spark.conf.set("fs.adl.oauth2.credential", client_secret)

# That's all there is to it:
df = spark.read.parquet("adl://something.azuredatalakestore.net/folder/")

Finding the correct JARs and Spark configurations was more than half the battle. Hopefully this post helps someone out in the future!

Featured Posts

  1. A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS. When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.

    aws development kubernetes

  2. Hi there, I’m Mike. 🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else:

    development