logo

Cheatsheet - Databricks

Core Concepts

Term Description
Workspace The web-based UI where you manage notebooks, clusters, jobs, and data.
Cluster The set of virtual machines (compute resources) that run your commands and jobs.
Notebook A web-based interface for writing and executing code in cells (Python, Scala, SQL, R).
DBFS Databricks File System. A distributed file system layer over cloud storage (S3/ADLS) accessible in the workspace.
Delta Lake The default, open-source storage format that brings ACID transactions, time travel, and performance optimizations to data lakes.
Widget A UI element in a notebook that allows you to provide parameters to your code (e.g., text boxes, dropdowns).

Notebook Magic Commands

These commands are used at the top of a notebook cell to change the language or execute a shell command.

Command Description
%python Executes the cell as Python code (often the default).
%sql Executes the cell as SQL against a table or view in the metastore.
%scala Executes the cell as Scala code.
%r Executes the cell as R code.
%fs A shortcut for dbutils.fs commands. Example: %fs ls /databricks-datasets/
%sh Executes the cell as a shell script on the driver node.
%md Formats the cell content as Markdown.

Databricks Utilities (dbutils)

A set of powerful utility functions exclusive to Databricks notebooks.

  • File System (dbutils.fs)

    # List files in a directory
    dbutils.fs.ls("/mnt/my_storage/")
    
    # Create a directory
    dbutils.fs.mkdirs("/new_directory/")
    
    # Copy a file
    dbutils.fs.cp("file:/tmp/my_file.txt", "dbfs:/new_directory/")
    
    # Remove a file or directory
    dbutils.fs.rm("/new_directory/my_file.txt", recurse=True)
    
  • Widgets (dbutils.widgets)

    # Create a text input widget
    dbutils.widgets.text("my_parameter", "default_value", "My Parameter Label")
    
    # Get the value from a widget
    param_value = dbutils.widgets.get("my_parameter")
    
    # Remove a widget
    dbutils.widgets.remove("my_parameter")
    
  • Notebook Workflow (dbutils.notebook)

    # Run another notebook and get its return value
    result = dbutils.notebook.run("./other_notebook", timeout_seconds=3600, arguments={"input": "value"})
    
    # Exit a notebook with a value
    dbutils.notebook.exit("This is a return value")
    

PySpark DataFrame Essentials

These are the most common operations for data manipulation using PySpark DataFrames.

  • Prerequisite Import:

    from pyspark.sql.functions import col, sum, avg, count, when, lit
    import pyspark.sql.functions as F
    
  • Reading & Writing Data

    # Read from Delta Lake (default format)
    df = spark.read.table("my_catalog.my_schema.my_table")
    df = spark.read.format("delta").load("/path/to/delta_table")
    
    # Read from CSV
    df = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("/path/to/data.csv")
    
    # Write to Delta Lake
    df.write.format("delta").mode("overwrite").save("/path/to/delta_table")
    df.write.mode("overwrite").saveAsTable("my_catalog.my_schema.new_table")
    
  • Viewing & Inspecting Data

    # Show top 20 rows
    df.show()
    
    # Display in a rich, scrollable table (Databricks specific)
    display(df)
    
    # Print the schema (column names and types)
    df.printSchema()
    
    # Get the number of rows
    df.count()
    
  • Selecting & Renaming Columns

    # Select specific columns
    df.select("col1", "col2").show()
    
    # Rename a column
    df.withColumnRenamed("old_name", "new_name").show()
    
    # Select and rename (alias) at the same time
    df.select(col("col1").alias("new_name1"), col("col2")).show()
    
  • Filtering Rows

    # Using column expressions (recommended)
    df.filter(col("country") == "USA").show()
    df.filter((col("age") > 30) & (col("status") != "inactive")).show()
    
    # Using SQL syntax
    df.filter("country = 'USA' AND age > 30").show()
    
  • Adding & Modifying Columns

    # Add a new column with a literal value
    df.withColumn("new_col", lit("some_value")).show()
    
    # Derive a new column from existing ones
    df.withColumn("revenue", col("quantity") * col("price")).show()
    
    # Add a column conditionally using when/otherwise
    df.withColumn("age_group", when(col("age") < 18, "minor").otherwise("adult")).show()
    
  • Aggregations (Group By)

    # Group by a column and calculate aggregates
    df.groupBy("category") \
      .agg(
          sum("sales").alias("total_sales"),
          avg("price").alias("average_price")
      ).show()
    
  • Joining DataFrames

    # Inner join on a common key
    df1.join(df2, df1.user_id == df2.id, "inner").show()
    
    • Join Types: inner, left, right, full_outer, left_semi

Working with SQL

  • Register a DataFrame as a Temporary View:
    # Create a view that is only available in the current Spark session
    df.createOrReplaceTempView("my_temp_view")
    
  • Run SQL in a %sql Cell:
    %sql
    SELECT
      category,
      SUM(sales) as total_sales
    FROM my_temp_view
    WHERE region = 'North America'
    GROUP BY category
    ORDER BY total_sales DESC
    LIMIT 10;
    

Delta Lake Essentials

These commands leverage the power of the Delta Lake format.

  • Time Travel (Querying Historical Data):

    -- Query a table by version number
    SELECT * FROM my_delta_table VERSION AS OF 5;
    
    -- Query a table by timestamp
    SELECT * FROM my_delta_table TIMESTAMP AS OF '2023-10-27T10:00:00.000Z';
    
  • Merge (Upsert Operation):

    %sql
    MERGE INTO target_table t
    USING source_updates s
    ON t.id = s.id
    WHEN MATCHED THEN
      UPDATE SET t.value = s.value, t.last_updated = s.timestamp
    WHEN NOT MATCHED THEN
      INSERT (id, value, last_updated) VALUES (s.id, s.value, s.timestamp)
    
  • History & Maintenance:

    -- View the transaction history of a Delta table
    DESCRIBE HISTORY my_delta_table;
    
    -- Optimize the layout of small files into larger ones
    OPTIMIZE my_delta_table ZORDER BY (indexed_col);
    
    -- Clean up old data files that are no longer referenced (default retention is 7 days)
    VACUUM my_delta_table;
    

Databricks CLI

For automating tasks from your local terminal.

Command Description
databricks configure --token Set up authentication.
databricks workspace ls /path/ List workspace objects.
databricks workspace import ... Import a notebook file to the workspace.
databricks clusters list List available clusters.
databricks jobs create --json-file ... Create a new job from a JSON definition.
databricks jobs run-now --job-id ... Trigger a job run.