logo

Data Warehouses

Last Updated: 2022-02-06

Popular Choices

Cloud Services

  • Google BigQuery: based on Google's Dremel, battle tested inside Google for years.
  • Amazon Redshift: based on ParAccel(initially based on PostgreSQL, columnar)
  • Microsoft Azure SQL Data Warehouse
  • Snowflake

Hadoop Ecosystem

Hive can be used as Data Warehouse, to store huge amount of data. However other compute engines like Presto are often used to accelerate queries.

Traditional

  • Teradata

Data Sources

  • logging and messaging system (kafka)
  • scraping (dumping) data from DB (e.g. MySQL) to DW (e.g. Hive); dump to a staging table, then copy to the target table

Dimension Table vs Fact Table

  • fact tables: business facts, or measures, like business transactions, page visits; they have foreign keys which refer to primary keys in the dimension tables.
  • dimension tables: descriptive attributes, like name, age, location, etc. Used for (1) query constraining and/or filtering, and (2) query result set labeling.

Slowly Changing Dimensions (SCD)

https://en.wikipedia.org/wiki/Slowly_changing_dimension

  • relatively static data like geo location, customer, or product
  • change slowly but unpredictably, no regular schedule