Query Plan Optimization
Catalyst Optimizer, Logical Plans, Physical Plans, and Cost-Based Optimization
In Chapter 6 you saw that DataFrame code and SQL compile to the same logical plan. But Catalyst doesn't stop there, it rewrites that plan several times before anything runs.
Catalyst moves through four stages: a Parsed Logical Plan straight from your syntax, an Analyzed Logical Plan with column and table names resolved against the catalog, an Optimized Logical Plan with rule-based rewrites like predicate pushdown and projection pruning, and finally one or more Physical Plans.
When multiple physical plans are possible, for example different join strategies, the Cost-Based Optimizer picks the cheapest one using table statistics. Run df.explain(True) to see all four stages for any query.
Catalyst doesn't just translate your query, it rewrites it. The plan you write and the plan that runs are rarely the same.