Chapter 13Advanced

Monitoring and Debugging Spark Jobs

Spark UI, DAG Visualization, Event Logs, and Troubleshooting

Every Action you call, count(), collect(), write(), creates one Job in the Spark UI. Each Job splits into Stages at shuffle boundaries, exactly the wide dependency walls from Chapter 5. Each Stage splits into Tasks, one per partition.

The Jobs tab lists every job with its duration and status. Click a job to see its Stages tab: shuffle read and write sizes, and how many tasks ran. Click a stage to see its Tasks tab, the fastest way to spot a single straggler task, the skew from Chapter 12.

The DAG Visualization tab draws the same stage and shuffle structure you've been seeing in these diagrams, generated directly from your job. If spark.eventLog.enabled is set, the History Server can replay this entire UI later, even after the application has finished.

Try It

Key Concepts

Jobs, Stages, and Tasks

An Action creates a Job. A Job splits into Stages at shuffle boundaries. A Stage splits into Tasks, one per partition, run on executors.

Reading the Stages Tab

Shuffle read and write sizes show how much data crossed the network. A stage with much higher shuffle write than the next stage's read suggests wasted work upstream.

Event Logs and the History Server

With spark.eventLog.enabled set, Spark writes everything the UI shows to a log file. The History Server reads these logs to reopen the UI for finished applications.

Event Log and History Server

Core Concept

When a job is slow, start at the Stages tab and sort by duration. Open the slowest stage's Tasks tab and sort by duration too, one task far above the rest means skew, not just "more data overall".