Cloud / Distributed Systems - Data Warehouse

Updated: 2018-12-12

Cloud Services

  • Amazon Redshift: based on ParAccel(initially based on PostgreSQL, columnar)
  • Google BigQuery: based on Google's Dremel, battle tested inside Google for years.
  • Microsoft Azure SQL Data Warehouse

Hadoop Ecosystem

Hive can be used as Data Warehouse, to store huge amount of data. However other compute engines like Presto are often used to accelerate queries.

Traditional

  • Teradata

Data Sources

  1. logging and messaging system(kafka)
  2. scraping(dumping) data from DB(e.g. MySQL) to DW(e.g. Hive); dump to a staging table, then copy to the target table

Dimension Table vs Fact Table

The fact table contains business facts (or measures), and foreign keys which refer to candidate keys (normally primary keys) in the dimension tables.

Contrary to fact tables, dimension tables contain descriptive attributes (or fields) that are typically textual fields (or discrete numbers that behave like text). These attributes are designed to serve two critical purposes: query constraining and/or filtering, and query result set labeling.

Slowly Changing Dimensions(SCD)

https://en.wikipedia.org/wiki/Slowlychangingdimension