Chapter 04Beginner

Processing Data in Parallel

Narrow Dependencies, Pipelining, and Parallel Execution

In Chapter 3 you saw filter() and groupBy() build a plan. But how does Spark actually run that plan across a cluster of machines?

When each output partition depends on exactly one input partition, Spark calls it a narrow dependency. filter(), select(), and map() are all narrow: every Executor can transform its own slice of data without talking to any other Executor.

Spark chains narrow transformations into a single pipeline per partition. With 3 partitions and 3 cores, all 3 pipelines run at the same time, fully in parallel, with zero network traffic.

Try It
Key Concepts
01
Narrow Dependency
Each output partition depends on exactly one input partition. filter(), select(), and map() are all narrow.
02
Pipelining
Spark fuses chained narrow transformations into one stage. Data flows from filter() to select() without ever touching disk in between.
03
Parallel Execution
Each partition's pipeline becomes one task. Give Spark as many cores as partitions, and every task runs at the same time.
Narrow Transformations, One Pipeline
Core Concept

Narrow = each partition processed independently. No data crosses the network.