Production Spark Best Practices
Resource Sizing, Cluster Tuning, Reliability, and Cost
Everything from Chapters 1 through 14, lazy evaluation, shuffles, partitioning, caching, AQE, streaming, comes together when a job has to run reliably and cheaply in production, night after night.
There's no single "correct" cluster size. The right configuration depends on whether a job is bottlenecked on shuffle, memory, or I/O, and tuning the wrong knob can make things worse.
Production also means planning for failure: tasks die, nodes get reclaimed, and a job that can't retry or checkpoint cleanly will fail an entire run over a single bad task.
Five things to check before a job goes to production: (1) shuffle partition count matches data size, (2) AQE is enabled, (3) input files are columnar and partitioned, (4) checkpointing is configured for any stateful or streaming step, and (5) task retries and speculation are set for the cluster's reliability profile.