Chapter 06Intermediate

Spark SQL and DataFrames

Schema, DataFrame API, SQL Engine, and Query Abstraction

Every DataFrame has a schema: a list of column names, types, and whether each column can be null. Spark checks your code against this schema before anything runs.

You can work with that DataFrame two ways: chain methods like .select() and .filter(), known as the DataFrame API, or write SQL directly with spark.sql(). Both are just two surfaces over the same engine.

Whichever syntax you choose, Spark's Catalyst optimizer parses it into the same internal logical plan and optimizes it the same way. The DataFrame API and Spark SQL are not two different engines, they are one engine wearing two different hats.

Schema of the events DataFrame

column

type

nullable

event_id

string

false

user_id

bigint

false

amount

double

true

region

string

true

created_at

timestamp

false

Two syntaxes, one plan

Key Concepts

Schema

A DataFrame's schema lists every column's name, type, and nullability. df.printSchema() shows it as a tree.

DataFrame API

A typed, chainable interface: .select(), .filter(), .groupBy(). Each call returns a new, lazy DataFrame.

Query Abstraction

The DataFrame API and Spark SQL both compile down to the same logical plan. Pick whichever reads better for the task.

DataFrame API vs. Spark SQL

Core Concept

Both produce the same plan. The optimizer doesn't care which syntax you chose.