Cloud / Distributed Systems - Data Warehouse
- Amazon Redshift: based on ParAccel(initially based on PostgreSQL, columnar)
- Google BigQuery: based on Google's Dremel, battle tested inside Google for years.
- Microsoft Azure SQL Data Warehouse
Hive can be used as Data Warehouse, to store huge amount of data. However other compute engines like Presto are often used to accelerate queries.
- logging and messaging system(kafka)
- scraping(dumping) data from DB(e.g. MySQL) to DW(e.g. Hive); dump to a staging table, then copy to the target table
- fact tables: business facts, or measures, like business transactions, page visits; they have foreign keys which refer to primary keys in the dimension tables.
- dimension tables: descriptive attributes, like name, age, location, etc. Used for (1) query constraining and/or filtering, and (2) query result set labeling.
- relatively static data like geo location, customer, or product
- change slowly but unpredictably, no regular schedule