Chapter 04Beginner

Processing Data in Parallel

Narrow Dependencies, Pipelining, and Parallel Execution

In Chapter 3 you saw filter() and groupBy() build a plan. But how does Spark actually run that plan across a cluster of machines?

When each output partition depends on exactly one input partition, Spark calls it a narrow dependency. filter(), select(), and map() are all narrow: every Executor can transform its own slice of data without talking to any other Executor.

Spark chains narrow transformations into a single pipeline per partition. With 3 partitions and 3 cores, all 3 pipelines run at the same time, fully in parallel, with zero network traffic.

Try It

Key Concepts

Narrow Dependency

Each output partition depends on exactly one input partition. filter(), select(), and map() are all narrow.

Pipelining

Spark fuses chained narrow transformations into one stage. Data flows from filter() to select() without ever touching disk in between.

Parallel Execution

Each partition's pipeline becomes one task. Give Spark as many cores as partitions, and every task runs at the same time.

Narrow Transformations, One Pipeline

Core Concept

Narrow = each partition processed independently. No data crosses the network.