Cheatsheet - Databricks

Core Concepts

Term	Description
Workspace	The web-based UI where you manage notebooks, clusters, jobs, and data.
Cluster	The set of virtual machines (compute resources) that run your commands and jobs.
Notebook	A web-based interface for writing and executing code in cells (Python, Scala, SQL, R).
DBFS	Databricks File System. A distributed file system layer over cloud storage (S3/ADLS) accessible in the workspace.
Delta Lake	The default, open-source storage format that brings ACID transactions, time travel, and performance optimizations to data lakes.
Widget	A UI element in a notebook that allows you to provide parameters to your code (e.g., text boxes, dropdowns).

Notebook Magic Commands

These commands are used at the top of a notebook cell to change the language or execute a shell command.

Command	Description
`%python`	Executes the cell as Python code (often the default).
`%sql`	Executes the cell as SQL against a table or view in the metastore.
`%scala`	Executes the cell as Scala code.
`%r`	Executes the cell as R code.
`%fs`	A shortcut for `dbutils.fs` commands. Example: `%fs ls /databricks-datasets/`
`%sh`	Executes the cell as a shell script on the driver node.
`%md`	Formats the cell content as Markdown.

Databricks Utilities (`dbutils`)

A set of powerful utility functions exclusive to Databricks notebooks.

File System (dbutils.fs)

# List files in a directory
dbutils.fs.ls("/mnt/my_storage/")

# Create a directory
dbutils.fs.mkdirs("/new_directory/")

# Copy a file
dbutils.fs.cp("file:/tmp/my_file.txt", "dbfs:/new_directory/")

# Remove a file or directory
dbutils.fs.rm("/new_directory/my_file.txt", recurse=True)

Widgets (dbutils.widgets)

# Create a text input widget
dbutils.widgets.text("my_parameter", "default_value", "My Parameter Label")

# Get the value from a widget
param_value = dbutils.widgets.get("my_parameter")

# Remove a widget
dbutils.widgets.remove("my_parameter")

Notebook Workflow (dbutils.notebook)

# Run another notebook and get its return value
result = dbutils.notebook.run("./other_notebook", timeout_seconds=3600, arguments={"input": "value"})

# Exit a notebook with a value
dbutils.notebook.exit("This is a return value")

PySpark DataFrame Essentials

These are the most common operations for data manipulation using PySpark DataFrames.

Prerequisite Import:

from pyspark.sql.functions import col, sum, avg, count, when, lit
import pyspark.sql.functions as F

Reading & Writing Data

# Read from Delta Lake (default format)
df = spark.read.table("my_catalog.my_schema.my_table")
df = spark.read.format("delta").load("/path/to/delta_table")

# Read from CSV
df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/path/to/data.csv")

# Write to Delta Lake
df.write.format("delta").mode("overwrite").save("/path/to/delta_table")
df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.new_table")

Viewing & Inspecting Data

# Show top 20 rows
df.show()

# Display in a rich, scrollable table (Databricks specific)
display(df)

# Print the schema (column names and types)
df.printSchema()

# Get the number of rows
df.count()

Selecting & Renaming Columns

# Select specific columns
df.select("col1", "col2").show()

# Rename a column
df.withColumnRenamed("old_name", "new_name").show()

# Select and rename (alias) at the same time
df.select(col("col1").alias("new_name1"), col("col2")).show()

Filtering Rows

# Using column expressions (recommended)
df.filter(col("country") == "USA").show()
df.filter((col("age") > 30) & (col("status") != "inactive")).show()

# Using SQL syntax
df.filter("country = 'USA' AND age > 30").show()

Adding & Modifying Columns

# Add a new column with a literal value
df.withColumn("new_col", lit("some_value")).show()

# Derive a new column from existing ones
df.withColumn("revenue", col("quantity") * col("price")).show()

# Add a column conditionally using when/otherwise
df.withColumn("age_group", when(col("age") < 18, "minor").otherwise("adult")).show()

Aggregations (Group By)

# Group by a column and calculate aggregates
df.groupBy("category") \
  .agg(
      sum("sales").alias("total_sales"),
      avg("price").alias("average_price")
  ).show()

Joining DataFrames

# Inner join on a common key
df1.join(df2, df1.user_id == df2.id, "inner").show()

Join Types: inner, left, right, full_outer, left_semi

Working with SQL

Register a DataFrame as a Temporary View:

# Create a view that is only available in the current Spark session
df.createOrReplaceTempView("my_temp_view")

Run SQL in a %sql Cell:

%sql
SELECT
  category,
  SUM(sales) as total_sales
FROM my_temp_view
WHERE region = 'North America'
GROUP BY category
ORDER BY total_sales DESC
LIMIT 10;

Delta Lake Essentials

These commands leverage the power of the Delta Lake format.

Time Travel (Querying Historical Data):

-- Query a table by version number
SELECT * FROM my_delta_table VERSION AS OF 5;

-- Query a table by timestamp
SELECT * FROM my_delta_table TIMESTAMP AS OF '2023-10-27T10:00:00.000Z';

Merge (Upsert Operation):

%sql
MERGE INTO target_table t
USING source_updates s
ON t.id = s.id
WHEN MATCHED THEN
  UPDATE SET t.value = s.value, t.last_updated = s.timestamp
WHEN NOT MATCHED THEN
  INSERT (id, value, last_updated) VALUES (s.id, s.value, s.timestamp)

History & Maintenance:

-- View the transaction history of a Delta table
DESCRIBE HISTORY my_delta_table;

-- Optimize the layout of small files into larger ones
OPTIMIZE my_delta_table ZORDER BY (indexed_col);

-- Clean up old data files that are no longer referenced (default retention is 7 days)
VACUUM my_delta_table;

Databricks CLI

For automating tasks from your local terminal.

Command	Description
`databricks configure --token`	Set up authentication.
`databricks workspace ls /path/`	List workspace objects.
`databricks workspace import ...`	Import a notebook file to the workspace.
`databricks clusters list`	List available clusters.
`databricks jobs create --json-file ...`	Create a new job from a JSON definition.
`databricks jobs run-now --job-id ...`	Trigger a job run.