Structured Streaming Fundamentals
Micro-Batch Processing, Checkpointing, and Watermarks
Structured Streaming treats a live data stream as an unbounded table that keeps growing. Instead of processing one event at a time, Spark repeats the same batch DataFrame plan over and over on small chunks of new data, called micro-batches.
Each micro-batch reads whatever new data has arrived, updates a running state (like a count per word or per user), writes the result to a sink, and records a checkpoint. If the job crashes, it resumes from that checkpoint instead of starting over.
Late-arriving events are common in real systems. A watermark tells Spark how long to wait for late data before finalizing a window's result and dropping anything older as too late to matter.
Output mode matters: "complete" rewrites the whole result table every batch, "update" emits only changed rows, and "append" emits rows only once they're final, the only mode usable with most file sinks.