SQL - Overview
Updated: 2020-12-31
DML vs DDL
-
Data Manipulation Language (DML)
- DML is about content, data, "rows"
- e.g.
INSERT
,UPDATE
orDELETE
.
-
Data definition or data description language (DDL)
- DDL is about table, index, schema, metadata, "columns"
- e.g.
CREATE
,ALTER
,TRUNCATE
orDROP
.
Standards and Dialects
The latest SQL standard is SQL:2016.
Systems that support standard SQL:
- Presto: used by Facebook, open source, written in Java, successor of Hive (Presto is based on ANSI SQL but HiveQL is not)
- BigQuery / Spanner, used by Google, both use Google's Standard SQL, based on ANSI SQL:2011
BigQuery and Presto are optimized for batch/read queries.
Dialects / Extensions:
- PL/SQL: Oracle's procedural extension for SQL
- Transact-SQL (T-SQL): Microsoft's extension to SQL
Create Table
Presto
CREATE TABLE IF NOT EXISTS
table_name (
xxx_names ARRAY<VARCHAR>,
ds VARCHAR
)
SELECT
id
ARRAY_AGG(DISTINCT value) AS values
FROM
...
GROUP BY id
Case When
Create is_not_null
column
CASE WHEN <column> IS NOT NULL THEN 1 ELSE 0 END as is_not_null
MAXBY / MINBY
Presto provides MAX_BY
and MIN_BY
, but BigQuery does not, here's an alternative:
ARRAY_AGG(field1 ORDER BY field2)[OFFSET(0)]
Join ARRAY to STRING
SELECT ARRAY_TO_STRING(arr, ",")
FROM ...
Rollup
ROLLUP(YEAR, MONTH, DAY)
With a ROLLUP
, it will have the following outputs:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR
()
Cube
With CUBE
, it will have the following:
YEAR, MONTH, DAY
YEAR, MONTH
YEAR, DAY
YEAR
MONTH, DAY
MONTH
DAY
()
CUBE
essentially contains every possible rollup scenario for each node whereas ROLLUP
will keep the hierarchy in tact (so it won't skip MONTH and show YEAR/DAY, whereas CUBE will)
Read More
- SQL Anti-patterns: https://cloud.google.com/bigquery/docs/best-practices-performance-patterns
- http://allyouneedisbackend.com/blog/2017/09/24/the-sql-i-love-part-1-scanning-large-table/
- https://blog.jooq.org/2017/09/28/10-cool-sql-optimisations-that-do-not-depend-on-the-cost-model/
Best Practice
- Avoid using
SELECT *
. Applying aLIMIT
clause to aSELECT *
query does not affect the amount of data read - Keep data denormalized. "The storage savings from normalized data are less of a concern in modern systems"
- "Trim the data as early in the query as possible, before the query performs a JOIN"
JOIN
order: place the largest table first, followed by the smallest, and then by decreasing size.