Cloud / Distributed Systems - Database
Over the years, different kinds of databases were developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.
- relations: between rows and columns(a relation is a table, not between tables)
- a unique key identifying each row.
It is a safe bet to start With MySQL or PostgreSQL.
MySQL is the top one open-source db on DB-Engines, also proved its scalability at Facebook, and Uber migrated from Postres to MySQL
MariaDB is a fork of MySQL and it is gaining popularity quickly.
Google and Amazon have their home-grown relational database. All 3 biggest clouds managed popular databases.
- Amazon Aurora: Amazon's own relational database
- Amazon RDS: Amazon managed MySQL/PostgreSQL/MariaDB
- Google Cloud Spanner: Google's own relational database
- Google Cloud SQL: Google managed MySQL/PostgreSQL
- Azure Database for MySQL/PostgreSQL/MariaDB: Microsoft managed db.
MongoDB is probably the most popular open source NoSQL database. DynamoDB is the one shining in the cloud.
There are many open source db under Apache, some of them are implementations of the famous papers published by companies like Google
- Amazon DynamoDB
- Google Cloud Bigtable
- Azure CosmosDB
- MongoDB Atlas: MongoDB's own cloud offering https://www.mongodb.com/cloud/atlas/
Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.
E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.
Graph databases you can use if you choose not to build it in-house like Facebook:
- Neo4j: a Java graph db.
- JanusGraph: started as a ford of TitanDB(now TitanDB is discontinued). Supported by Google.
- graph model: Property Graph and W3C's RDF,
- graph query: Apache TinkerPop Gremlin and SPARQL
- Giraph: based on Google's Pregel, however Pregel is deprecated.
Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.
Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.
Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike
key-value stores, by design, have trouble linking values together (in other words, they have no foreign keys).
- Amazon ElastiCache
- Google Cloud Memorystore
- Azure Cache for Redis
Read more in the cache page.
Especially useful for:
- DevOps Monitoring
- IoT Applications
- Real-time analytics
- Amazon Timestream
- InfluxDB: https://www.influxdata.com/time-series-platform/influxdb/
- Facebook Beringei: https://github.com/facebookincubator/beringei
Solr is losing popularity to Elasticsearch.
- Amazon CloudSearch
- Microsoft Azure Search
- Google Search Appliance
Read more from the Data Warehouse page.
- buffered writes: support failing MySQL masters
- read-your-write semantics.
- master-slave replication (read from slaves, write to master)
- B-tree index:
O(log(N)), good for
LIKE 'string%'(a constant string that does not start with a wildcard character)
- hash index:
O(1), used only for equality comparisons that use the
<=>operators (but are very fast). Good for key-value stores.
E.g. in MySQL:
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name USING [BTREE | HASH | RTREE] ON table_name (column_name [(length)] [ASC | DESC],...)
FULLTEXT: optimized for regular plain text; it split the text into words and make an index over the words instead of the whole text. (
LIKE '%foo') does not use any index thus may incur full table scan.)
SPATIAL: for indexing geo location
Storage engine is usually single-node, as just one layer of distributed databases.
Storage Engine's responsibility:
- Atomicity and Durability of ACID.
- performance(e.g. write-ahead log)
Higher layers of databases can focus on Isolation(distributed coordination) and Consistency(data integrity).
- MySQL's default storage engine is InnoDB, however there are many other options like RocksDB.
- CockroachDB is based on RocksDB.
- Postgres is a monolithic system, does not have a plugable "storage engine"
- normalized data: join when used
- denormalized data: data would be stored and retrieved as a whole(blobs) No joins are needed and it can be written with one disk write
- log-structured storage engines (like LSMT, "log-structured merge trees"), Leveldb, Rocksdb, and Cassandra, high write throughput, low read throughput
- page-oriented storage engines (like b-trees) e.g. PostgreSQL, MySQL, sqlite, high read throughput, low write throughput
log-structured file systems
- hypothesis: memory size increasing, reads would be from memory cache, I/O becomes write heavy
- treats its storage as a circular log and writes sequentially to the head of the log.
read the value, grab the write lock, look at the index, check if the address for the data we stays the same: if yes, no writes occurred, proceed with writing the new value. otherwise a conflicting transaction occurred, roll back and start again with the read phase.
There’s a common 4 step dual writing pattern that people often use to do large online migrations like this. Here’s how it works:
- Dual writing to the existing and new tables to keep them in sync.(backfill old data to the new table)
- Changing all read paths in our codebase to read from the new table.
- Changing all write paths in our codebase to only write to the new table.
- Removing old data that relies on the outdated data model.
- OLTP: On-line Transaction Processing
- OLAP: On-line Analytical Processing
These are two relatively old terms, when you hear people talking about them, think about operational databases(e.g. MySQL, Spanner) vs data warehouse(for analytics, e.g. AWS Redshift, GCP BigQuery).
- mongodb, mysql: single master + sharding, strongly consistent.
- cassandra, dynamo: eventually consistent.
if R+W > N: strong consistency(R=during read how many nodes agree, W=during writing, how many nodes confirmed, N=nodes)
- Save index in memory to speed up search
- The default engine is InnoDB (MyISAM before 5.5.5). MyISAM no longer under active development. Another option is MyRocks, which is based on Facebook's RocksDB. To list all the engines:
mysql> SHOW ENGINES;
- In most cases, only one storage engine will be needed. Sometimes you may need to have different storage engines for different tables in the same database.
primary: unique, one per table.
unique: unique, no two rows have the same combination of the values.
index: may not be unique, but improves lookup efficiency.
fulltext: for full text search, by creating an index for each word in that column.
FULLTEXTare stored in B+trees.
- spatial data types use R-trees;
MEMORYtables also support hash indexes; choose B-tree for
MEMORYtables if some queries use range operators.
- InnoDB uses inverted lists for
- MySQL is moving to x509 for authentication, away from username + pw
Understanding caching in Postgres - An in-depth guide: Postgres is a process based system, i.e each connection has a native OS process...Postgres as a cross platform database, relies heavily on the operating system for its caching...WAL is a redo log that basically keeps track of whatever that is happening to the system...It is always better to monitor something directly from postgres,rather than going through the OS route.
- row key: UUID, equivalent to “primary key"
- column key: string
- ref key: version number(the highest is the latest)
Uses vector lock to resolve write conflicts
Spanner is a globally-scalable database system used internally by Google; it is the successor of the BigTable database. (Link to the paper)
Cloud Spanner is the managed database on Google Cloud Platform.
Spanner does not have auto-increment key; do not use numbers in incremental order as keys, including timestamps, because Spanner is distributed and sharded by key, such keys will result in hotspots and hurt performance.
An open source version of Google Spanner, bigtable=>spanner(google)=>CockroachDB(out of google)
CockroachDB uses RocksDB, an embedded key-value store, internally. Though RocksDB is from Facebook, but it is based on LevelDB, which was also from Google.
Based on based on the Bigtable technology from Google
- modeled after google’s bigtable
- HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Bigtable paper.
- In the parlance of Eric Brewer’s CAP Theorem, HBase is a CP type system.
- LevelDB(created by Google): inspired by Bigtable, also use SSTable.
- RocksDB(created by Facebook): a fork of LevelDB
- better compression
- less write-amplification
How it works:
- SSTables are arranged in several levels.
SSTables are non-overlapping within a level, e.g. 2 levels:
- level 1: 2 SSTables, one with key space
[a, b), and another
- level 2: single SSTable with key space
- the query with a string starting with
awill look at
[a,c)on level 2 and
[a, b)on level 1. Since the strings are sorted, it takes
- level 1: 2 SSTables, one with key space
Unlike either monolithic or master-slave designs, Cassandra makes use of an entirely peer-to-peer architecture. All nodes in a Cassandra cluster can accept reads and writes, no matter where the data being written or requested actually belongs in the cluster.
- Shard data automatically
- Handle partial outages without data loss or downtime
- Scales close to linearly
Five-dimensional data model(similar to Bigtable or Cassandra):
- namespaces(databases): contain sets (tables) of records
- sets: table
- key: identify records
- metadata: generation tracks record modification, time-to-live (TTL) specifies record expiration
- bin: name value pairs.
- Data is sharded and balanced between servers using a Paxos-based membership algorithm.
- Aerospike makes a dangerous assumption for a distributed datastore: it assumes the network is reliable
Comparing to in memory cache:
- optimized for flash storage
- cheaper than the in memory cache
- no need to re-load data into memory after outrage
- append only: data is virtually incorruptible and easy to replicate, back up, and restore.
- not for ad hoc queries
- multiple masters
- single master
- index: b-tree
- data is heterogeneous: the items in the collection don’t all have the same structure for some reason
Config server vs mongos: The config server (itself replicated) manages the sharded information for other sharded servers, while mongos will likely live on your local application server where clients can easily connect (without needing to manage which shards to connect to).
MongoDB vs SQL Databases:
- SQL db: join on db side
- one-to-many: self contained;
- many-to-one, many-to-many: join on application/clinet side
- the whole database is in one single file on disk, can be embedded inside the application, very portable.
- zero-config, easy to setup
- not a client–server database, no network capabilities, cannot be used over a network.
- not for write-intensive deployments, not for high concurrency use case, since it relies on file-system locks, versus server-based databases handle locks in daemons
- no type checking, the type of a value is dynamic and not strictly constrained by the schema
- no user management; no way to tune for higher performance
- not that reliable, comparing to other RDBMS like MySQL
- TimescaleDB is packaged as a PostgreSQL extension
- TimescaleDB is more than 20x faster than vanilla PostgreSQL when inserting data at scale
- TimescaleDB enables you to scale to 1 billion rows with no real impact to insert performance
Data is continuously backed up to S3 in real time, with no performance impact.
Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account
- InnoDB: B-tree
- RocksDB: LSM(log-structured merge tree)
- We have found RocksDB, when compared to InnoDB, uses less storage space to store an UDB instance and generates less write traffic overall
- A B-Tree wastes space when pages fragment. An LSM doesn't fragment.
- Inefficient architecture for writes;
- Inefficient data replication;
- Issues with table corruption;
- Poor replica MVCC support;
- Difficulty upgrading to newer releases.
- The most important architectural difference is that while Postgres directly maps index records to on-disk locations, InnoDB maintains a secondary structure. Instead of holding a pointer to the on-disk row location (like the ctid does in Postgres);
- MySQL supports multiple different replication modes; InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool; MySQL implements concurrent connections by spawning a thread-per-connection. This is relatively low overhead.
as cardinality moderately increases, InfluxDB performance drops dramatically due to its reliance on time-structured merge trees (which, similar to the log-structured merge trees it is modeled after, suffers with higher-cardinality datasets). This of course should be no surprise, as high cardinality is a well known Achilles heel for InfluxDB