SQL - Overview

Updated: 2020-12-31

DML vs DDL

  • Data Manipulation Language (DML)

    • DML is about content, data, "rows"
    • e.g. INSERT, UPDATE or DELETE.
  • Data definition or data description language (DDL)

    • DDL is about table, index, schema, metadata, "columns"
    • e.g. CREATE, ALTER, TRUNCATE or DROP.

Standards and Dialects

The latest SQL standard is SQL:2016.

Systems that support standard SQL:

  • Presto: used by Facebook, open source, written in Java, successor of Hive (Presto is based on ANSI SQL but HiveQL is not)
  • BigQuery / Spanner, used by Google, both use Google's Standard SQL, based on ANSI SQL:2011

BigQuery and Presto are optimized for batch/read queries.

Dialects / Extensions:

  • PL/SQL: Oracle's procedural extension for SQL
  • Transact-SQL (T-SQL): Microsoft's extension to SQL

Create Table

Presto

CREATE TABLE IF NOT EXISTS
table_name (
    xxx_names ARRAY<VARCHAR>,
    ds VARCHAR
)

SELECT
    id
    ARRAY_AGG(DISTINCT value) AS values
FROM
    ...
GROUP BY id

Case When

Create is_not_null column

CASE WHEN <column> IS NOT NULL THEN 1 ELSE 0 END as is_not_null

MAXBY / MINBY

Presto provides MAX_BY and MIN_BY, but BigQuery does not, here's an alternative:

ARRAY_AGG(field1 ORDER BY field2)[OFFSET(0)]

Join ARRAY to STRING

SELECT ARRAY_TO_STRING(arr, ",")
FROM ...

Rollup

ROLLUP(YEAR, MONTH, DAY)

With a ROLLUP, it will have the following outputs:

YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()

Cube

With CUBE, it will have the following:

YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()

CUBE essentially contains every possible rollup scenario for each node whereas ROLLUP will keep the hierarchy in tact (so it won't skip MONTH and show YEAR/DAY, whereas CUBE will)

Read More

Best Practice

  • Avoid using SELECT *. Applying a LIMIT clause to a SELECT * query does not affect the amount of data read
  • Keep data denormalized. "The storage savings from normalized data are less of a concern in modern systems"
  • "Trim the data as early in the query as possible, before the query performs a JOIN"
  • JOIN order: place the largest table first, followed by the smallest, and then by decreasing size.