Chapter 07Intermediate

Query Plan Optimization

Catalyst Optimizer, Logical Plans, Physical Plans, and Cost-Based Optimization

In Chapter 6 you saw that DataFrame code and SQL compile to the same logical plan. But Catalyst doesn't stop there, it rewrites that plan several times before anything runs.

Catalyst moves through four stages: a Parsed Logical Plan straight from your syntax, an Analyzed Logical Plan with column and table names resolved against the catalog, an Optimized Logical Plan with rule-based rewrites like predicate pushdown and projection pruning, and finally one or more Physical Plans.

When multiple physical plans are possible, for example different join strategies, the Cost-Based Optimizer picks the cheapest one using table statistics. Run df.explain(True) to see all four stages for any query.

From your query to a running plan
Key Concepts
01
Predicate Pushdown
Catalyst pushes filter conditions down to the data source, so files and row groups that can't match are skipped entirely.
02
Projection Pruning
Columns that are never referenced downstream are removed from the plan, so Spark reads less data off disk.
03
Cost-Based Optimization
Using table size statistics, Catalyst chooses the cheapest physical plan, for example a broadcast join instead of a sort-merge join.
df.explain(True)
Core Concept

Catalyst doesn't just translate your query, it rewrites it. The plan you write and the plan that runs are rarely the same.