Chapter 06Intermediate

Spark SQL and DataFrames

Schema, DataFrame API, SQL Engine, and Query Abstraction

Every DataFrame has a schema: a list of column names, types, and whether each column can be null. Spark checks your code against this schema before anything runs.

You can work with that DataFrame two ways: chain methods like .select() and .filter(), known as the DataFrame API, or write SQL directly with spark.sql(). Both are just two surfaces over the same engine.

Whichever syntax you choose, Spark's Catalyst optimizer parses it into the same internal logical plan and optimizes it the same way. The DataFrame API and Spark SQL are not two different engines, they are one engine wearing two different hats.

Schema of the events DataFrame
column
type
nullable
event_id
string
false
user_id
bigint
false
amount
double
true
region
string
true
created_at
timestamp
false
Two syntaxes, one plan
Key Concepts
01
Schema
A DataFrame's schema lists every column's name, type, and nullability. df.printSchema() shows it as a tree.
02
DataFrame API
A typed, chainable interface: .select(), .filter(), .groupBy(). Each call returns a new, lazy DataFrame.
03
Query Abstraction
The DataFrame API and Spark SQL both compile down to the same logical plan. Pick whichever reads better for the task.
DataFrame API vs. Spark SQL
Core Concept

Both produce the same plan. The optimizer doesn't care which syntax you chose.