A typical modern Spark stack nowadays most likely runs Spark jobs on a Kubernetes cluster, especially for heavy usage. Workloads are moving away from EMR on EC2 to either EMR on EKS or open-source Spark on EKS.
When you’re running Spark on EKS, you probably want to scale your Kubernetes nodes up and down as you need them. You might only need to run a few jobs per day, or you might need to run hundreds of jobs, each with different resource requirements. Either way — you need to scale your EKS cluster up and down. You’re probably using the default Kubernetes cluster-autoscaler.
It works for any Kubernetes workload, including both the open-source Spark or EMR on EKS.
Karpenter vs. cluster-autoscaler
Karpenter is a relatively new project from AWS that completely replaces the default cluster-autoscaler.
To understand Karpenter, let’s briefly talk about how the default cluster-autoscaler works. Typically, you’d create a few nodegroups on your Kubernetes cluster with different instance types and sizes. You could have a “small” nodegroup with cheap instances and a “large” nodegroup with more expensive instances for workloads that need more executor memory.
Someone has to decide what nodegroups to create, what instance types should go in each nodegroup, and keep them up-to-date as AWS adds more instance types. You probably also want different nodegroups for on-demand and spot instances. Some of your Spark jobs might benefit from a GPU (RAPIDS) — you should make a nodegroup for that too. Soon, you’re overwhelmed with nodegroups.
At the end of the day, you just want to specify driver and executor memory and cores. Who cares what types of instances they run on?
Karpenter abstracts all of that away from you. No more nodegroups.
Karpenter is Fast
When you submit a Spark job and your cluster needs to scale up, it probably takes cluster-autoscaler 2 or 3 minutes to spin up a new EC2 node for the driver. Then it might take another minute to register the node with Kubernetes. Then your driver pod starts running.
Then, the driver pod requests pods for your executors, and you repeat the entire process over. More waiting for cluster-autoscaler to provision new EC2 nodes and register them with Kubernetes.
Karpenter makes all of this a lot faster.
Karpenter requests new EC2 nodes seemingly immediately (within < 5 seconds of a pod being scheduled).
While the EC2 instances are being spun up, Karpenter itself schedules the pod to start running on the new EC2 node before it’s even done registering in Kubernetes.
In practice, this cuts down the time to start a Spark job on brand new EC2 nodes from ~3 minutes down to ~1 minute. If you’re running hundreds of Spark jobs per day, this adds up quickly.
Karpenter + Spot Instances
This is where you really save money.
The default cluster-autoscaler has a list of a few instance types and it does a good job of choosing the cheapest spot instance type from that list.
Karpenter doesn’t have nodegroups. It only cares about how much CPU and memory your pod needs. It can run those pods on any instance type that has enough CPU and memory.
Let’s say I have a Spark job that needs 10 executors with 4 cores and 20G of executor memory. This translates into 10 Kubernetes pods with CPU and memory requests. Karpenter takes those 10 pod requests and chooses the cheapest possible spot instances that can run those pods.
In practice, I’ve seen this turn into 3 spot r4.8xlarge instances for $0.004/hour each.
The best part is that I have no idea how much memory or CPU an r4.8xlarge instance has, nor should I care. I’d probably never even put an r4.8xlarge into my nodegroup definitions because they cost over $2/hour at on-demand pricing.
And when I inevitably lose my spot instances, Karpenter immediately requests replacements during my two-minute spot instance termination window.
The easiest way to try this all out is to spin up a new EKS cluster using eksctl. Recent versions have Karpenter support built-in. That way your subnets, security groups, and the service will all be set up on a fresh EKS cluster.
Spark Configs for Karpenter
Here are a few useful Spark (3.3 or later) configs to get you off the ground. Tune these as needed.
"spark.kubernetes.driver.annotation.karpenter.sh/do-not-evict": "true", "spark.kubernetes.driver.node.selector.karpenter.sh/capacity-type": "on-demand", "spark.kubernetes.executor.node.selector.karpenter.sh/capacity-type": "spot", "spark.kubernetes.node.selector.topology.kubernetes.io/zone": "us-east-1a"
These configs tell Karpenter to:
- never evict (move) driver pods (Karpenter can also consolidate pods for maximum resource usage).
- run the driver on an on-demand instance
- run the executors on spot instances
- run all pods in the same AWS availability zone (you don’t want to pay cross-AZ transfer fees!)
That’s really all there is to it. You don’t need to modify anything else in your setup — continue using Airflow, k9s, Kubecost, and whatever else you use. Karpenter is an almost-drop-in replacement for cluster-autoscaler.
If you’re using the default cluster-autoscaler, try Karpenter.
You’ll be amazed at how much simpler it is to run Spark jobs without having to manage nodegroups. You’ll love its speed and ability to place pods on extremely cheap spot instances types you’ve probably overlooked.