Chapter 15Advanced

Production Spark Best Practices

Resource Sizing, Cluster Tuning, Reliability, and Cost

Everything from Chapters 1 through 14, lazy evaluation, shuffles, partitioning, caching, AQE, streaming, comes together when a job has to run reliably and cheaply in production, night after night.

There's no single "correct" cluster size. The right configuration depends on whether a job is bottlenecked on shuffle, memory, or I/O, and tuning the wrong knob can make things worse.

Production also means planning for failure: tasks die, nodes get reclaimed, and a job that can't retry or checkpoint cleanly will fail an entire run over a single bad task.

Try It

Key Concepts

Sizing Executors

More, smaller executors improve parallelism and fault isolation; fewer, larger executors reduce shuffle overhead and JVM duplication. Most teams land on 4-5 cores per executor.

Reliability: Retries and Speculative Execution

spark.task.maxFailures retries a failed task before failing the stage. spark.speculation re-runs tasks that are running unusually slowly, in case the original is stuck on a bad node.

Cost Optimization

Dynamic allocation scales executors up and down with workload. Spot/preemptible instances cut cost for fault-tolerant batch jobs, since Spark can recompute lost partitions from lineage.

Production Spark Configs

Core Concept

Five things to check before a job goes to production: (1) shuffle partition count matches data size, (2) AQE is enabled, (3) input files are columnar and partitioned, (4) checkpointing is configured for any stateful or streaming step, and (5) task retries and speculation are set for the cluster's reliability profile.