Updating my post from almost 3 years ago! The world has moved on to Spark 3.3, and so have the necessary JARs you will need to access S3 from Spark.
Run these commands to download JARs for Spark 3.3.2:
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.426/aws-java-sdk-bundle-1.12.426.jar -P $SPARK_HOME/jars/ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar -P $SPARK_HOME/jars/ That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark 3.3.2.
This is the last post in my series of how to connect Spark to various data sources.
Here is how to connect to Google Cloud Storage using Spark 3.x.
First, create a service account in GCP, then download the JSON key file. Save it somewhere secure (e.g. import it into Vault or Secrets Manager).
Grant that service account permission to read and write to the bucket.
Grab the Hadoop 3.x JAR (gcs-connector-hadoop3-latest.
This is another quick post for how to connect Spark to various platforms.
I used Azure Data Lake Storage on a project in the past and had a tough time figuring out what to do (there are huge differences between Azure Blob Storage, Azure Data Lake Gen1, and Azure Data Lake Gen2).
This guide assumes that you have a client_id, tenant_id, and client_secret from Azure.
Code Example # Acquire these JARs from Maven: # azure-data-lake-store-sdk-2.
Quick post mostly for my own reference since I always need to re-learn how to do this.
This used to be more difficult in older versions of Spark, but when using Spark 2.4 or later, all you have to do is:
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar -P $SPARK_HOME/jars/ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar -P $SPARK_HOME/jars/ That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark.
One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables.
Here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less.
Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive — free! It’s similar to Databricks — give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more.
Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark.
I recently ran into a use case that the usual Spark CSV writer didn’t handle very well — the data I was writing had an unusual encoding, odd characters, and was really large.
I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.
I don’t know how I missed this RDD method before, but toLocalIterator was the cleanest, most straight-forward way I got this to work.