Avatar
🚀

Follow me on:

Writing Huge CSVs Easily and Efficiently With PySpark
Published by Mike Staszel on February 5, 2018

I recently ran into a use case that the usual Spark CSV writer didn’t handle very well — the data I was writing had an unusual encoding, odd characters, and was really large.

I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.

I don’t know how I missed this RDD method before, but toLocalIterator was the cleanest, most straight-forward way I got this to work. Here’s the code (with a bunch of data cleanup and wrangling omitted):

with open("output.csv", "w+") as f:
    w = unicodecsv.DictWriter(f, fieldnames=["num", "data"])
    for rdd_row in df.rdd.toLocalIterator():
        w.writerow(rdd_row.asDict())

That’s all there is to it! toLocalIterator returns a Python iterator which yields RDD rows. It’s essentially doing a collect to the driver, but only one partition is being processed at a time, saving memory.

Featured Posts

  1. A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS. When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements.

    aws development kubernetes

  2. Hi there, I’m Mike. 🔭 I’m currently working on big data engineering with Spark on k8s on AWS at iSpot.tv. 🌱 I’m focusing on mentoring and coaching my team to improve their skills and release awesome products. 🌎 I occasionally write blog posts about software engineering and other topics. Management and Software Engineering I consider myself to be a software engineer at heart. Nowadays I’m trying to do less code-writing and more of everything else:

    development