Mike Staszel

Building software engineering teams.

Follow me on:

7 results for spark

Clear filter

S3A on Spark 3.x in 2023

Updating my post from almost 3 years ago! The world has moved on to Spark 3.3, and so have the necessary JARs you will need to access S3 from Spark. Run these commands to download JARs for Spark 3.3.2: wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.426/aws-java-sdk-bundle-1.12.426.jar -P $SPARK_HOME/jars/ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.2/hadoop-aws-3.3.2.jar -P $SPARK_HOME/jars/ That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark 3.3.2.

development spark Published March 14, 2023
Spark to Google Cloud Storage

This is the last post in my series of how to connect Spark to various data sources. Here is how to connect to Google Cloud Storage using Spark 3.x. First, create a service account in GCP, then download the JSON key file. Save it somewhere secure (e.g. import it into Vault or Secrets Manager). Grant that service account permission to read and write to the bucket. Grab the Hadoop 3.x JAR (gcs-connector-hadoop3-latest.

spark Published February 2, 2023
Spark to Azure Data Lake Storage Gen1

This is another quick post for how to connect Spark to various platforms. I used Azure Data Lake Storage on a project in the past and had a tough time figuring out what to do (there are huge differences between Azure Blob Storage, Azure Data Lake Gen1, and Azure Data Lake Gen2). This guide assumes that you have a client_id, tenant_id, and client_secret from Azure. Code Example # Acquire these JARs from Maven: # azure-data-lake-store-sdk-2.

spark Published January 29, 2023
S3A on Spark 2.4

Quick post mostly for my own reference since I always need to re-learn how to do this. This used to be more difficult in older versions of Spark, but when using Spark 2.4 or later, all you have to do is: wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar -P $SPARK_HOME/jars/ wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.3/hadoop-aws-2.7.3.jar -P $SPARK_HOME/jars/ That’s all there is to it. The s3a:// prefix should work now for reading and writing data using Spark.

spark Published November 29, 2020
Installing Spark on Ubuntu in 3 Minutes

One thing I hear often from people starting out with Spark is that it’s too difficult to install. Some guides are for Spark 1.x and others are for 2.x. Some guides get really detailed with Hadoop versions, JAR files, and environment variables. Here’s yet another guide on how to install Apache Spark, condensed and simplified to get you up and running with Apache Spark 2.3.1 in 3 minutes or less.

development spark Published September 19, 2018
Apache Spark on Google Colaboratory

Google recently launched a preview of Colaboratory, a new service that lets you edit and run IPython notebooks right from Google Drive — free! It’s similar to Databricks — give that a try if you’re looking for a better-supported way to run Spark in the cloud, launch clusters, and much more. Google has published some tutorials showing how to use Tensorflow and various other Google APIs and tools on Colaboratory, but I wanted to try installing Apache Spark.

spark python Published March 7, 2018
Writing Huge CSVs Easily and Efficiently With PySpark

I recently ran into a use case that the usual Spark CSV writer didn’t handle very well — the data I was writing had an unusual encoding, odd characters, and was really large. I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file. I don’t know how I missed this RDD method before, but toLocalIterator was the cleanest, most straight-forward way I got this to work.

spark Published February 5, 2018

Mike Staszel

Follow me on:

S3A on Spark 3.x in 2023

Spark to Google Cloud Storage

Spark to Azure Data Lake Storage Gen1

S3A on Spark 2.4

Installing Spark on Ubuntu in 3 Minutes

Apache Spark on Google Colaboratory

Writing Huge CSVs Easily and Efficiently With PySpark