Chapter 09Intermediate

Joins and Aggregations

Broadcast Join, Sort-Merge Join, Shuffle Hash Join, and Aggregation Strategies

Joining two DataFrames means matching rows by key. If the matching rows for a key live on different executors, Spark must move data across the network before it can join them, exactly the shuffle you saw in Chapter 5.

If one side of the join is small enough (under spark.sql.autoBroadcastJoinThreshold, 10MB by default), Spark skips the shuffle entirely. It sends a full copy of the small table to every executor, a Broadcast Join, so the large table never has to move.

When both sides are large, Spark shuffles both DataFrames by the join key so matching rows land on the same partition. From there it either sorts both sides and merges them (Sort-Merge Join, the default) or builds a hash table from one side and probes it with the other (Shuffle Hash Join).

Try It

Key Concepts

Broadcast Join

When one DataFrame is small, Spark copies it whole to every executor with a BroadcastExchange, avoiding a shuffle of the large table entirely.

Sort-Merge Join (Default)

Both sides are shuffled by the join key, sorted, then merged. Spark's default strategy for joining two large DataFrames.

Aggregation: Partial Then Final

groupBy() aggregates locally on each partition first (a partial aggregate), shuffles the smaller, pre-aggregated results, then combines them into a final aggregate.

broadcast() and groupBy()

Core Concept

Small table? Broadcast it with broadcast(df). Large table joining a large table? Let Spark use its default Sort-Merge Join.