Last Updated: 2022-02-06
- Google BigQuery: based on Google's Dremel, battle tested inside Google for years.
- Amazon Redshift: based on ParAccel(initially based on PostgreSQL, columnar)
- Microsoft Azure SQL Data Warehouse
Hive can be used as Data Warehouse, to store huge amount of data. However other compute engines like Presto are often used to accelerate queries.
- logging and messaging system (kafka)
- scraping (dumping) data from DB (e.g. MySQL) to DW (e.g. Hive); dump to a staging table, then copy to the target table
Dimension Table vs Fact Table
- fact tables: business facts, or measures, like business transactions, page visits; they have foreign keys which refer to primary keys in the dimension tables.
- dimension tables: descriptive attributes, like name, age, location, etc. Used for (1) query constraining and/or filtering, and (2) query result set labeling.
Slowly Changing Dimensions (SCD)
- relatively static data like geo location, customer, or product
- change slowly but unpredictably, no regular schedule