Chapter 05Intermediate

Data Redistribution Across the Network

Wide Dependencies, Stage Boundaries, and Shuffle Mechanism

filter() and select() were narrow: cheap, parallel, no network. But what about groupBy() or join()? Rows with the same key might live on completely different partitions.

To group rows by key, Spark must redistribute every row so all rows with the same key land on the same partition. This redistribution is called a shuffle: data is written to disk, sent across the network, and read back by other Executors.

Because each output partition can depend on every input partition, groupBy() and join() are wide dependencies. Spark draws a hard line here: a new Stage. The next stage cannot start until the shuffle finishes.

Try It

Key Concepts

Wide Dependency

An output partition can depend on multiple, often all, input partitions. groupBy(), join(), distinct(), and repartition() are wide.

Shuffle

Spark writes data to disk, sends it across the network, and reads it back so rows with the same key end up on the same partition.

Stage Boundary

Spark splits the DAG into Stages at every wide dependency. The next Stage cannot start until the shuffle from the previous one completes.

groupBy() and join() Mean Exchange

Core Concept

Shuffles are the most expensive operation in Spark: disk I/O, network transfer, and serialization. Minimize them whenever you can.