After setting up a few Spark + Scala projects I decided to open-source a boilerplate sample project that you can import right into IntelliJ and build with one command.
Usually I write Apache Spark code in Python, but there are a few times I prefer to use Scala:
- When functionality isn’t in PySpark yet.
- It’s easier to include dependencies in the JAR file instead of installing on cluster nodes.
- Need that extra bit of performance.
- Even more reasons here on StackOverflow.
One of the downsides to using Scala over Python is setting up the initial project structure. With PySpark, a single “.py” file does the trick. Using this boilerplate project will make using Spark + Scala just as easy. Grab the code and run “sbt assembly” and you’ll have a JAR file ready to use with “spark-submit”.
Check it out here: https://github.com/mikestaszel/spark-scala-boilerplate