Joins and Aggregations
Broadcast Join, Sort-Merge Join, Shuffle Hash Join, and Aggregation Strategies
Joining two DataFrames means matching rows by key. If the matching rows for a key live on different executors, Spark must move data across the network before it can join them, exactly the shuffle you saw in Chapter 5.
If one side of the join is small enough (under spark.sql.autoBroadcastJoinThreshold, 10MB by default), Spark skips the shuffle entirely. It sends a full copy of the small table to every executor, a Broadcast Join, so the large table never has to move.
When both sides are large, Spark shuffles both DataFrames by the join key so matching rows land on the same partition. From there it either sorts both sides and merges them (Sort-Merge Join, the default) or builds a hash table from one side and probes it with the other (Shuffle Hash Join).
Small table? Broadcast it with broadcast(df). Large table joining a large table? Let Spark use its default Sort-Merge Join.