Cloud / Distributed Systems - Data Warehouse

Updated: 2019-01-27

Cloud Services

  • Amazon Redshift: based on ParAccel(initially based on PostgreSQL, columnar)
  • Google BigQuery: based on Google's Dremel, battle tested inside Google for years.
  • Microsoft Azure SQL Data Warehouse

Hadoop Ecosystem

Hive can be used as Data Warehouse, to store huge amount of data. However other compute engines like Presto are often used to accelerate queries.

Traditional

  • Teradata

Data Sources

  1. logging and messaging system(kafka)
  2. scraping(dumping) data from DB(e.g. MySQL) to DW(e.g. Hive); dump to a staging table, then copy to the target table

Dimension Table vs Fact Table

  • fact tables: business facts, or measures, like business transactions, page visits; they have foreign keys which refer to primary keys in the dimension tables.
  • dimension tables: descriptive attributes, like name, age, location, etc. Used for (1) query constraining and/or filtering, and (2) query result set labeling.

Slowly Changing Dimensions(SCD)

https://en.wikipedia.org/wiki/Slowlychangingdimension

  • relatively static data like geo location, customer, or product
  • change slowly but unpredictably, no regular schedule