Memory Management and Caching
Cache, Persist, Storage Levels, and Memory Utilization
Suppose you build a filtered DataFrame, then call two different Actions on it: result.count() and result.agg(sum("amount")). Because of lazy evaluation, what happens when both run?
By default, Spark has no memory of the first Action. Every Action re-walks the full lineage, so the read and filter steps run again from scratch for the second Action too.
Calling result.cache() (or .persist()) tells Spark to keep the computed result in memory after the first Action produces it. The second Action reuses that cached result instead of recomputing read and filter.
Cache a DataFrame when multiple Actions reuse it. If you only call one Action on it, skip cache(), it just adds overhead and can push other data out of memory.