Data Redistribution Across the Network
Wide Dependencies, Stage Boundaries, and Shuffle Mechanism
filter() and select() were narrow: cheap, parallel, no network. But what about groupBy() or join()? Rows with the same key might live on completely different partitions.
To group rows by key, Spark must redistribute every row so all rows with the same key land on the same partition. This redistribution is called a shuffle: data is written to disk, sent across the network, and read back by other Executors.
Because each output partition can depend on every input partition, groupBy() and join() are wide dependencies. Spark draws a hard line here: a new Stage. The next stage cannot start until the shuffle finishes.
Shuffles are the most expensive operation in Spark: disk I/O, network transfer, and serialization. Minimize them whenever you can.