Chapter 14Advanced

Structured Streaming Fundamentals

Micro-Batch Processing, Checkpointing, and Watermarks

Structured Streaming treats a live data stream as an unbounded table that keeps growing. Instead of processing one event at a time, Spark repeats the same batch DataFrame plan over and over on small chunks of new data, called micro-batches.

Each micro-batch reads whatever new data has arrived, updates a running state (like a count per word or per user), writes the result to a sink, and records a checkpoint. If the job crashes, it resumes from that checkpoint instead of starting over.

Late-arriving events are common in real systems. A watermark tells Spark how long to wait for late data before finalizing a window's result and dropping anything older as too late to matter.

Try It

Key Concepts

Micro-Batch Processing

Structured Streaming runs your DataFrame query repeatedly on small batches of new data at a fixed trigger interval, the same Catalyst plan as a batch job, just on a slice of data each time.

Stateful Aggregations and Checkpointing

Operations like groupBy().count() keep running state between micro-batches. Checkpointing saves that state and progress to durable storage so a restarted job picks up exactly where it left off.

Watermarks and Late Data

withWatermark() tells Spark the maximum delay to tolerate for late events. Once the watermark passes a window's end, that window's result is finalized and later events for it are dropped.

readStream, writeStream, and Watermark

Core Concept

Output mode matters: "complete" rewrites the whole result table every batch, "update" emits only changed rows, and "append" emits rows only once they're final, the only mode usable with most file sinks.