Chapter 08Intermediate

Memory Management and Caching

Cache, Persist, Storage Levels, and Memory Utilization

Suppose you build a filtered DataFrame, then call two different Actions on it: result.count() and result.agg(sum("amount")). Because of lazy evaluation, what happens when both run?

By default, Spark has no memory of the first Action. Every Action re-walks the full lineage, so the read and filter steps run again from scratch for the second Action too.

Calling result.cache() (or .persist()) tells Spark to keep the computed result in memory after the first Action produces it. The second Action reuses that cached result instead of recomputing read and filter.

Try It

Key Concepts

cache() and persist()

df.cache() is shorthand for df.persist(StorageLevel.MEMORY_AND_DISK). It marks a DataFrame so its result is kept after the first Action computes it.

Storage Levels

MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, and their _SER and _2 variants control where cached data lives, whether it is serialized, and whether it is replicated.

When Caching Hurts

Caching a DataFrame that is only used once adds overhead with no payoff, and can evict other useful cached data from limited executor memory.

cache() and persist()

Core Concept

Cache a DataFrame when multiple Actions reuse it. If you only call one Action on it, skip cache(), it just adds overhead and can push other data out of memory.