Chapter 08Intermediate

Memory Management and Caching

Cache, Persist, Storage Levels, and Memory Utilization

Suppose you build a filtered DataFrame, then call two different Actions on it: result.count() and result.agg(sum("amount")). Because of lazy evaluation, what happens when both run?

By default, Spark has no memory of the first Action. Every Action re-walks the full lineage, so the read and filter steps run again from scratch for the second Action too.

Calling result.cache() (or .persist()) tells Spark to keep the computed result in memory after the first Action produces it. The second Action reuses that cached result instead of recomputing read and filter.

Try It
Key Concepts
01
cache() and persist()
df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK). It marks a DataFrame so its result is kept after the first Action computes it.
02
Storage Levels
MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and their _SER and _2 variants control where cached data lives, whether it is serialized, and whether it is replicated.
03
When Caching Hurts
Caching a DataFrame that is only used once adds overhead with no payoff, and can evict other useful cached data from limited executor memory.
cache() and persist()
Core Concept

Cache a DataFrame when multiple Actions reuse it. If you only call one Action on it, skip cache(), it just adds overhead and can push other data out of memory.