Performance Tuning and Optimization
AQE, Data Skew, Predicate Pushdown, and Shuffle Optimization
Data is rarely evenly distributed. If you groupBy or join on a key like user_id and one user accounts for 40% of all rows, the shuffle sends 40% of the data to a single partition. That one task runs far longer than the others, and the whole stage waits for it. This is Data Skew.
Adaptive Query Execution (AQE), on by default since Spark 3.0, watches actual shuffle statistics at runtime and re-optimizes the plan. It can coalesce many small shuffle partitions into fewer larger ones, switch a sort-merge join to a broadcast join once it sees a table is actually small, and, with spark.sql.adaptive.skewJoin.enabled, split an oversized partition into several smaller tasks that run in parallel.
When AQE alone isn't enough, salting helps: add a random suffix to the skewed key so its rows spread across many partitions for the first aggregation, then run a second aggregation that strips the suffix and combines the partial results.
Three quick wins for a slow job: turn on AQE (spark.sql.adaptive.enabled, on by default since 3.0), check the Spark UI for one task taking far longer than the rest, that's skew, and run explain() to confirm filters are pushed down before reaching for salting or manual repartitioning.