I recently ran into a use case that the usual Spark CSV writer didn’t handle very well — the data I was writing had an unusual encoding, odd characters, and was really large.
I needed a way to use the Python
unicodecsv library with a Spark dataframe to write to a huge output CSV file.
I don’t know how I missed this RDD method before, but
toLocalIterator was the cleanest, most straight-forward way I got this to work. Here’s the code (with a bunch of data cleanup and wrangling omitted):
with open("output.csv", "w+") as f: w = unicodecsv.DictWriter(f, fieldnames=["num", "data"]) for rdd_row in df.rdd.toLocalIterator(): w.writerow(rdd_row.asDict())
That’s all there is to it!
toLocalIterator returns a Python iterator which yields RDD rows. It’s essentially doing a collect to the driver, but only one partition is being processed at a time, saving memory.