SQL - Overview
- Data Manipulation Language (DML)
- DML is about content, data, "rows"
- Data definition or data description language (DDL)
- DDL is about table, index, schema, metadata, "columns"
The latest SQL standard is SQL:2016.
Systems that support standard SQL:
- Presto: used by Facebook, open source, written in Java, successor of Hive (Presto is based on ANSI SQL but HiveQL is not)
- BigQuery / Spanner, used by Google, both use Google's Standard SQL, based on ANSI SQL:2011
BigQuery and Presto are optimized for batch/read queries.
Dialects / Extensions:
- PL/SQL: Oracle's procedural extension for SQL
- Transact-SQL (T-SQL): Microsoft's extension to SQL
CREATE TABLE IF NOT EXISTS table_name ( xxx_names ARRAY<VARCHAR>, ds VARCHAR ) SELECT id ARRAY_AGG(DISTINCT value) AS values FROM ... GROUP BY id
CASE WHEN <column> IS NOT NULL THEN 1 ELSE 0 END as is_not_null
MIN_BY, but BigQuery does not, here's an alternative:
ARRAY_AGG(field1 ORDER BY field2)[OFFSET(0)]
SELECT ARRAY_TO_STRING(arr, ",") FROM ...
ROLLUP(YEAR, MONTH, DAY)
ROLLUP, it will have the following outputs:
YEAR, MONTH, DAY YEAR, MONTH YEAR ()
CUBE, it will have the following:
YEAR, MONTH, DAY YEAR, MONTH YEAR, DAY YEAR MONTH, DAY MONTH DAY ()
CUBE essentially contains every possible rollup scenario for each node whereas
ROLLUP will keep the hierarchy in tact (so it won't skip MONTH and show YEAR/DAY, whereas CUBE will)
- SQL Anti-patterns: https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
- Avoid using
SELECT *. Applying a
LIMITclause to a
SELECT *query does not affect the amount of data read
- Keep data denormalized. "The storage savings from normalized data are less of a concern in modern systems"
- "Trim the data as early in the query as possible, before the query performs a JOIN"
JOINorder: place the largest table first, followed by the smallest, and then by decreasing size.