Cheatsheet - Databricks
Core Concepts
Term | Description |
---|---|
Workspace | The web-based UI where you manage notebooks, clusters, jobs, and data. |
Cluster | The set of virtual machines (compute resources) that run your commands and jobs. |
Notebook | A web-based interface for writing and executing code in cells (Python, Scala, SQL, R). |
DBFS | Databricks File System. A distributed file system layer over cloud storage (S3/ADLS) accessible in the workspace. |
Delta Lake | The default, open-source storage format that brings ACID transactions, time travel, and performance optimizations to data lakes. |
Widget | A UI element in a notebook that allows you to provide parameters to your code (e.g., text boxes, dropdowns). |
Notebook Magic Commands
These commands are used at the top of a notebook cell to change the language or execute a shell command.
Command | Description |
---|---|
%python |
Executes the cell as Python code (often the default). |
%sql |
Executes the cell as SQL against a table or view in the metastore. |
%scala |
Executes the cell as Scala code. |
%r |
Executes the cell as R code. |
%fs |
A shortcut for dbutils.fs commands. Example: %fs ls /databricks-datasets/ |
%sh |
Executes the cell as a shell script on the driver node. |
%md |
Formats the cell content as Markdown. |
Databricks Utilities (dbutils
)
A set of powerful utility functions exclusive to Databricks notebooks.
-
File System (
dbutils.fs
)# List files in a directory dbutils.fs.ls("/mnt/my_storage/") # Create a directory dbutils.fs.mkdirs("/new_directory/") # Copy a file dbutils.fs.cp("file:/tmp/my_file.txt", "dbfs:/new_directory/") # Remove a file or directory dbutils.fs.rm("/new_directory/my_file.txt", recurse=True)
-
Widgets (
dbutils.widgets
)# Create a text input widget dbutils.widgets.text("my_parameter", "default_value", "My Parameter Label") # Get the value from a widget param_value = dbutils.widgets.get("my_parameter") # Remove a widget dbutils.widgets.remove("my_parameter")
-
Notebook Workflow (
dbutils.notebook
)# Run another notebook and get its return value result = dbutils.notebook.run("./other_notebook", timeout_seconds=3600, arguments={"input": "value"}) # Exit a notebook with a value dbutils.notebook.exit("This is a return value")
PySpark DataFrame Essentials
These are the most common operations for data manipulation using PySpark DataFrames.
-
Prerequisite Import:
from pyspark.sql.functions import col, sum, avg, count, when, lit import pyspark.sql.functions as F
-
Reading & Writing Data
# Read from Delta Lake (default format) df = spark.read.table("my_catalog.my_schema.my_table") df = spark.read.format("delta").load("/path/to/delta_table") # Read from CSV df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/path/to/data.csv") # Write to Delta Lake df.write.format("delta").mode("overwrite").save("/path/to/delta_table") df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.new_table")
-
Viewing & Inspecting Data
# Show top 20 rows df.show() # Display in a rich, scrollable table (Databricks specific) display(df) # Print the schema (column names and types) df.printSchema() # Get the number of rows df.count()
-
Selecting & Renaming Columns
# Select specific columns df.select("col1", "col2").show() # Rename a column df.withColumnRenamed("old_name", "new_name").show() # Select and rename (alias) at the same time df.select(col("col1").alias("new_name1"), col("col2")).show()
-
Filtering Rows
# Using column expressions (recommended) df.filter(col("country") == "USA").show() df.filter((col("age") > 30) & (col("status") != "inactive")).show() # Using SQL syntax df.filter("country = 'USA' AND age > 30").show()
-
Adding & Modifying Columns
# Add a new column with a literal value df.withColumn("new_col", lit("some_value")).show() # Derive a new column from existing ones df.withColumn("revenue", col("quantity") * col("price")).show() # Add a column conditionally using when/otherwise df.withColumn("age_group", when(col("age") < 18, "minor").otherwise("adult")).show()
-
Aggregations (Group By)
# Group by a column and calculate aggregates df.groupBy("category") \ .agg( sum("sales").alias("total_sales"), avg("price").alias("average_price") ).show()
-
Joining DataFrames
# Inner join on a common key df1.join(df2, df1.user_id == df2.id, "inner").show()
- Join Types:
inner
,left
,right
,full_outer
,left_semi
- Join Types:
Working with SQL
- Register a DataFrame as a Temporary View:
# Create a view that is only available in the current Spark session df.createOrReplaceTempView("my_temp_view")
- Run SQL in a
%sql
Cell:%sql SELECT category, SUM(sales) as total_sales FROM my_temp_view WHERE region = 'North America' GROUP BY category ORDER BY total_sales DESC LIMIT 10;
Delta Lake Essentials
These commands leverage the power of the Delta Lake format.
-
Time Travel (Querying Historical Data):
-- Query a table by version number SELECT * FROM my_delta_table VERSION AS OF 5; -- Query a table by timestamp SELECT * FROM my_delta_table TIMESTAMP AS OF '2023-10-27T10:00:00.000Z';
-
Merge (Upsert Operation):
%sql MERGE INTO target_table t USING source_updates s ON t.id = s.id WHEN MATCHED THEN UPDATE SET t.value = s.value, t.last_updated = s.timestamp WHEN NOT MATCHED THEN INSERT (id, value, last_updated) VALUES (s.id, s.value, s.timestamp)
-
History & Maintenance:
-- View the transaction history of a Delta table DESCRIBE HISTORY my_delta_table; -- Optimize the layout of small files into larger ones OPTIMIZE my_delta_table ZORDER BY (indexed_col); -- Clean up old data files that are no longer referenced (default retention is 7 days) VACUUM my_delta_table;
Databricks CLI
For automating tasks from your local terminal.
Command | Description |
---|---|
databricks configure --token |
Set up authentication. |
databricks workspace ls /path/ |
List workspace objects. |
databricks workspace import ... |
Import a notebook file to the workspace. |
databricks clusters list |
List available clusters. |
databricks jobs create --json-file ... |
Create a new job from a JSON definition. |
databricks jobs run-now --job-id ... |
Trigger a job run. |