Follow me on:

Spark to Google Cloud Storage
Published by Mike Staszel on February 2, 2023

This is the last post in my series of how to connect Spark to various data sources.

Here is how to connect to Google Cloud Storage using Spark 3.x.

First, create a service account in GCP, then download the JSON key file. Save it somewhere secure (e.g. import it into Vault or Secrets Manager).

Grant that service account permission to read and write to the bucket.

Grab the Hadoop 3.x JAR (gcs-connector-hadoop3-latest.jar) from Google.

Place that JAR into Spark’s jars directory.

Finally, fire up a Spark shell (or PySpark as in this example), and run:

from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()

spark.conf.set("google.cloud.auth.service.account.enable", "true")
spark.conf.set("google.cloud.auth.service.account.json.keyfile", "PATH-TO-JSON-KEYFILE.json")

df = spark.createDataFrame([{"num": x} for x in range (10000)])

That’s all there is to it.

This was a quick post mainly for myself if I ever need to come back to it.

Featured Posts

  1. A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS. When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.

    aws development kubernetes

  2. Hi there, I’m Mike. 🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else: