Writing Huge CSVs Easily and Efficiently with PySpark

I recently ran into a use case that the usual Spark CSV writer didn’t handle very well – the data I was writing had an unusual encoding, odd characters, and was really large.

I needed a way to use the Python unicodecsv library with a Spark dataframe to write to a huge output CSV file.

I don’t know how I missed this RDD method before, but toLocalIterator was the cleanest, most straight-forward way I got this to work. Here’s the code (with a bunch of data cleanup and wrangling omitted):

with open("output.csv", "w+") as f:
    w = unicodecsv.DictWriter(f, fieldnames=["num", "data"])

    for rdd_row in df.rdd.toLocalIterator():

That’s all there is to it! toLocalIterator returns a Python iterator which yields RDD rows. It’s essentially doing a collect to the driver, but only one partition is being processed at a time, saving memory.


Mike Staszel