Over the years, different kinds of databases were developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.
The traditional type.
- relations: between rows and columns(a relation is a table, not between tables)
- a unique key identifying each row.
AWS, Azure and GCP have their home-grown relational database, and also provide hosted services for popular databases.
- Amazon Aurora: Amazon's own relational database
- Amazon RDS: Amazon managed MySQL / PostgreSQL / MariaDB
- Google Cloud SQL: Google managed MySQL / PostgreSQL
- Azure Database for MySQL / PostgreSQL / MariaDB: Microsoft managed db.
NewSQL = Online scalability of NoSQL databases + relational data model / ACID consistency + SQL interface
- Google Cloud Spanner: Google's own relational database
Versus Traditional RDBMS:
In a traditional vertically scaled RDBMS, there is a single node that manages all of the tables. (Depending on the installation, there may be replicas that can be used for reading or for failover). This single node therefore has full control over the table row locks, and the issuing of unique keys from a numeric sequence.
new sql: with many nodes reading and writing to the database at any one time.
Traditional vertical RDBMS have integrated sequence generators, which create new integer keys from a sequence during a transaction. Cloud Spanner cannot have this due to its distributed architecture, as there would either be race conditions between the split leader nodes when inserting new keys, or the table would have to be globally locked when generating a new key, both of which would reduce performance.
Not for online serving, but for offline analytics.
Read more: Data Warehouses.
Time Series Database
Especially useful for:
- DevOps Monitoring
- IoT Applications
- Real-time analytics
- Amazon Timestream
- InfluxDB: https://www.influxdata.com/time-series-platform/influxdb/
Several different types:
- document: stores data as JSON documents.
- wide-column: stores data as tables but the columns are flexible, not all the rows have all the columns. (Data is actually stored in tuples like
(row, column, timestamp).)
- key-value: often used for caching.
- graph: stores data in nodes and relationships in edges.
MongoDB is probably the most popular open source NoSQL database. DynamoDB is the one shining in the cloud.
- MongoDB Atlas: MongoDB's own cloud offering https://www.mongodb.com/cloud/atlas/
- Amazon DynamoDB
Based on the design of Google's Bigtable:
- Google Cloud Bigtable
Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.
Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.
Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike.
Key-value stores, by design, have trouble linking values together (in other words, they have no foreign keys).
Read more: caching.
Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.
E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.
Read more: Graph
Solr is losing popularity to Elasticsearch.
- Amazon CloudSearch
- Microsoft Azure Search
- Google Search Appliance
Notes of Popular databases
- Wide Column Stores: including Bigtable, Cassandra, HBase, Accumulo
- Spanner: including Spanner, CockroachDB, YugaByteDB
Often used in web browsers and mobile apps.
- the whole database is in one single file on disk, can be embedded inside the application, very portable.
- zero-config, easy to setup (e.g. no
GRANT, use file permissions)
- not a client–server database, no network capabilities, cannot be used over a network.
- not for write-intensive deployments, not for high concurrency use case, since it relies on file-system locks, versus server-based databases handle locks in daemons
- no type checking, the type of a value is dynamic and not strictly constrained by the schema
- no user management; no way to tune for higher performance
- not that reliable, comparing to other RDBMS like MySQL
Uses vector lock to resolve write conflicts.
LevelDB and RocksDB
- LevelDB (created by Google): inspired by Bigtable, also use SSTable.
- RocksDB (created by Facebook): a fork of LevelDB
- better compression
- less write-amplification
How it works:
- SSTables are arranged in several levels.
- SSTables are non-overlapping within a level, e.g. 2 levels:
- level 1: 2 SSTables, one with key space
[a, b), and another
- level 2: single SSTable with key space
- the query with a string starting with
awill look at
[a,c)on level 2 and
[a, b)on level 1. Since the strings are sorted, it takes
- level 1: 2 SSTables, one with key space
- namespaces(databases): contain sets (tables) of records
- sets: table
- key: identify records
- metadata: generation tracks record modification, time-to-live (TTL) specifies record expiration
- bin: name value pairs.
- Data is sharded and balanced between servers using a Paxos-based membership algorithm.
- Aerospike makes a dangerous assumption for a distributed datastore: it assumes the network is reliable
Comparing to in memory cache:
- optimized for flash storage
- cheaper than the in memory cache
- no need to re-load data into memory after outrage
- append only: data is virtually incorruptible and easy to replicate, back up, and restore.
- not for ad hoc queries
- multiple masters
Data is continuously backed up to S3 in real time, with no performance impact.
Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account
InnoDB vs RocksDB
- InnoDB: B-tree
- RocksDB: LSM(log-structured merge tree)
- We have found RocksDB, when compared to InnoDB, uses less storage space to store an UDB instance and generates less write traffic overall
- A B-Tree wastes space when pages fragment. An LSM doesn't fragment.
Timescale vs InfluxDB
as cardinality moderately increases, InfluxDB performance drops dramatically due to its reliance on time-structured merge trees (which, similar to the log-structured merge trees it is modeled after, suffers with higher-cardinality datasets). This of course should be no surprise, as high cardinality is a well known Achilles heel for InfluxDB
Pipeline that responds to database writes
- Extraction: Stream data to key-value maps or other simplified forms, so that serving systems or others can consume them.
- Transformation: Do additional computation and write the augmented data back to the same database.