Chapter 12Advanced

Performance Tuning and Optimization

AQE, Data Skew, Predicate Pushdown, and Shuffle Optimization

Data is rarely evenly distributed. If you groupBy or join on a key like user_id and one user accounts for 40% of all rows, the shuffle sends 40% of the data to a single partition. That one task runs far longer than the others, and the whole stage waits for it. This is Data Skew.

Adaptive Query Execution (AQE), on by default since Spark 3.0, watches actual shuffle statistics at runtime and re-optimizes the plan. It can coalesce many small shuffle partitions into fewer larger ones, switch a sort-merge join to a broadcast join once it sees a table is actually small, and, with spark.sql.adaptive.skewJoin.enabled, split an oversized partition into several smaller tasks that run in parallel.

When AQE alone isn't enough, salting helps: add a random suffix to the skewed key so its rows spread across many partitions for the first aggregation, then run a second aggregation that strips the suffix and combines the partial results.

Try It

Key Concepts

Data Skew

An uneven distribution of values for a join or groupBy key, so one shuffle partition ends up far larger than the rest and becomes a straggler task.

Adaptive Query Execution (AQE)

Spark re-optimizes the physical plan at runtime using real shuffle statistics: coalescing small partitions, switching join strategies, and splitting skewed partitions automatically.

Salting

A manual technique for skew AQE can't fully fix: append a random suffix to the hot key to spread its rows across partitions, aggregate, then combine the partial results.

AQE and Salting

Core Concept

Three quick wins for a slow job: turn on AQE (spark.sql.adaptive.enabled, on by default since 3.0), check the Spark UI for one task taking far longer than the rest, that's skew, and run explain() to confirm filters are pushed down before reaching for salting or manual repartitioning.