Tech Stacks
    Overview
    System Design Patterns
    CAP Theorem
    C10K and C10M
    Network Programming Models
    Infrastructure as Code
    Examples

Databases

Updated: 2022-03-26

Over the years, different kinds of databases were developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.

Relational Databases

The traditional type.

  • relations: between rows and columns(a relation is a table, not between tables)
  • a unique key identifying each row.

It is a safe bet to start with PostgreSQL or MySQL. Many mobile apps rely on SQLite.

Cloud services:

AWS, Azure and GCP have their home-grown relational database, and also provide hosted services for popular databases.

  • Amazon Aurora: Amazon's own relational database
  • Amazon RDS: Amazon managed MySQL / PostgreSQL / MariaDB
  • Google Cloud SQL: Google managed MySQL / PostgreSQL
  • Azure Database for MySQL / PostgreSQL / MariaDB: Microsoft managed db.

NewSQL

NewSQL = Online scalability of NoSQL databases + relational data model / ACID consistency + SQL interface

  • Google Cloud Spanner: Google's own relational database

Versus Traditional RDBMS:

In a traditional vertically scaled RDBMS, there is a single node that manages all of the tables. (Depending on the installation, there may be replicas that can be used for reading or for failover). This single node therefore has full control over the table row locks, and the issuing of unique keys from a numeric sequence.

new sql: with many nodes reading and writing to the database at any one time.

Traditional vertical RDBMS have integrated sequence generators, which create new integer keys from a sequence during a transaction. Cloud Spanner cannot have this due to its distributed architecture, as there would either be race conditions between the split leader nodes when inserting new keys, or the table would have to be globally locked when generating a new key, both of which would reduce performance.

Data Warehouse

Not for online serving, but for offline analytics.

Read more: Data Warehouses.

Time Series Database

Especially useful for:

  • DevOps Monitoring
  • IoT Applications
  • Real-time analytics

Examples:

NoSQL (non-relational)

Several different types:

  • document: stores data as JSON documents.
  • wide-column: stores data as tables but the columns are flexible, not all the rows have all the columns. (Data is actually stored in tuples like (row, column, timestamp).)
  • key-value: often used for caching.
  • graph: stores data in nodes and relationships in edges.

Document

MongoDB is probably the most popular open source NoSQL database. DynamoDB is the one shining in the cloud.

Cloud services:

Wide-column

Based on the design of Google's Bigtable:

Cloud services:

  • Google Cloud Bigtable

Caching (key-value)

Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.

Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.

Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike.

Key-value stores, by design, have trouble linking values together (in other words, they have no foreign keys).

Read more: caching.

Graph

Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.

E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.

Read more: Graph

Solr is losing popularity to Elasticsearch.

Cloud Services:

  • Amazon CloudSearch
  • Microsoft Azure Search
  • Google Search Appliance

SQLite

Often used in web browsers and mobile apps.

Pros

  • the whole database is in one single file on disk, can be embedded inside the application, very portable.
  • zero-config, easy to setup (e.g. no GRANT, use file permissions)

Cons

  • not a client–server database, no network capabilities, cannot be used over a network.
  • not for write-intensive deployments, not for high concurrency use case, since it relies on file-system locks, versus server-based databases handle locks in daemons
  • no type checking, the type of a value is dynamic and not strictly constrained by the schema
  • no user management; no way to tune for higher performance
  • not that reliable, comparing to other RDBMS like MySQL

Riak

Uses vector lock to resolve write conflicts.

LevelDB and RocksDB

How it works:

  • SSTables are arranged in several levels.
  • SSTables are non-overlapping within a level, e.g. 2 levels:
    • level 1: 2 SSTables, one with key space [a, b), and another [b,c)
    • level 2: single SSTable with key space [a, c)
    • the query with a string starting with a will look at [a,c) on level 2 and [a, b) on level 1. Since the strings are sorted, it takes log(n)

Aerospike

Data model:

  • namespaces(databases): contain sets (tables) of records
  • sets: table
  • records:
    • key: identify records
    • metadata: generation tracks record modification, time-to-live (TTL) specifies record expiration
    • bin: name value pairs.

Traits:

  • Data is sharded and balanced between servers using a Paxos-based membership algorithm.
  • Aerospike makes a dangerous assumption for a distributed datastore: it assumes the network is reliable

Comparing to in memory cache:

  • optimized for flash storage
  • cheaper than the in memory cache
  • no need to re-load data into memory after outrage

CouchDB

  • append only: data is virtually incorruptible and easy to replicate, back up, and restore.
  • not for ad hoc queries
  • multiple masters

Amazon RDS

Data is continuously backed up to S3 in real time, with no performance impact.

CosmosDB

Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account

Comparisons

InnoDB vs RocksDB

  • InnoDB: B-tree
  • RocksDB: LSM(log-structured merge tree)

LSM vs B-tree: http://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html

  • We have found RocksDB, when compared to InnoDB, uses less storage space to store an UDB instance and generates less write traffic overall
  • A B-Tree wastes space when pages fragment. An LSM doesn't fragment.

Timescale vs InfluxDB

https://blog.timescale.com/timescaledb-vs-influxdb-for-time-series-data-timescale-influx-sql-nosql-36489299877

as cardinality moderately increases, InfluxDB performance drops dramatically due to its reliance on time-structured merge trees (which, similar to the log-structured merge trees it is modeled after, suffers with higher-cardinality datasets). This of course should be no surprise, as high cardinality is a well known Achilles heel for InfluxDB

Pipeline that responds to database writes

Use cases:

  • Extraction: Stream data to key-value maps or other simplified forms, so that serving systems or others can consume them.
  • Transformation: Do additional computation and write the augmented data back to the same database.