Cloud / Distributed Systems - Database
Over the years, different kinds of databases were developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.
- relations: between rows and columns(a relation is a table, not between tables)
- a unique key identifying each row.
B-Tree vs Hash Index:
- B-tree index:
O(log(N)), good for
LIKE 'string%'(a constant string that does not start with a wildcard character)
- hash index:
O(1), used only for equality comparisons that use the
<=>operators (but are very fast). Good for key-value stores.
It is a safe bet to start With MySQL or PostgreSQL.
MySQL is the top one open-source db on DB-Engines, also proved its scalability at Facebook, and Uber migrated from Postres to MySQL
MariaDB is a fork of MySQL and it is gaining popularity quickly.
Google and Amazon have their home-grown relational database. All 3 biggest clouds managed popular databases.
- Amazon Aurora: Amazon's own relational database
- Amazon RDS: Amazon managed MySQL/PostgreSQL/MariaDB
- Google Cloud Spanner: Google's own relational database
- Google Cloud SQL: Google managed MySQL/PostgreSQL
- Azure Database for MySQL/PostgreSQL/MariaDB: Microsoft managed db.
MongoDB is probably the most popular open source NoSQL database. DynamoDB is the one shining in the cloud.
There are many open source db under Apache, some of them are implementations of the famous papers published by companies like Google
- Amazon DynamoDB
- Google Cloud Bigtable
- Azure CosmosDB
- MongoDB Atlas: MongoDB's own cloud offering https://www.mongodb.com/cloud/atlas/
Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.
E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.
Graph databases you can use if you choose not to build it in-house like Facebook:
- Neo4j: a Java graph db.
- JanusGraph: started as a ford of TitanDB(now TitanDB is discontinued). Supported by Google.
- graph model: Property Graph and W3C's RDF,
- graph query: Apache TinkerPop Gremlin and SPARQL
- Giraph: based on Google's Pregel, however Pregel is deprecated.
Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.
Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.
Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike
key-value stores, by design, have trouble linking values together (in other words, they have no foreign keys).
- Amazon ElastiCache
- Google Cloud Memorystore
- Azure Cache for Redis
Read more in the cache page.
Especially useful for:
- DevOps Monitoring
- IoT Applications
- Real-time analytics
- Amazon Timestream
- InfluxDB: https://www.influxdata.com/time-series-platform/influxdb/
- Facebook Beringei: https://github.com/facebookincubator/beringei
Solr is losing popularity to Elasticsearch.
- Amazon CloudSearch
- Microsoft Azure Search
- Google Search Appliance
Read more from the Data Warehouse page.
- buffered writes: support failing MySQL masters
- read-your-write semantics.
- master-slave replication (read from slaves, write to master)
E.g. in MySQL:
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name USING [BTREE | HASH | RTREE] ON table_name (column_name [(length)] [ASC | DESC],...)
FULLTEXT: optimized for regular plain text; it split the text into words and make an index over the words instead of the whole text. (
LIKE '%foo') does not use any index thus may incur full table scan.)
SPATIAL: for indexing geo location
- normalized data: join when used
- denormalized data: data would be stored and retrieved as a whole(blobs) No joins are needed and it can be written with one disk write
- log-structured storage engines (like LSMT, "log-structured merge trees"), Leveldb, Rocksdb, and Cassandra, high write throughput, low read throughput
- page-oriented storage engines (like b-trees) e.g. PostgreSQL, MySQL, sqlite, high read throughput, low write throughput
log-structured file systems
- hypothesis: memory size increasing, reads would be from memory cache, I/O becomes write heavy
- treats its storage as a circular log and writes sequentially to the head of the log.
read the value, grab the write lock, look at the index, check if the address for the data we stays the same: if yes, no writes occurred, proceed with writing the new value. otherwise a conflicting transaction occurred, roll back and start again with the read phase.
There’s a common 4 step dual writing pattern that people often use to do large online migrations like this. Here’s how it works:
- Dual writing to the existing and new tables to keep them in sync.(backfill old data to the new table)
- Changing all read paths in our codebase to read from the new table.
- Changing all write paths in our codebase to only write to the new table.
- Removing old data that relies on the outdated data model.
OLTP (On-line Transaction Processing) is involved in the operation of a particular system. OLTP is characterized by a large number of short on-line transactions (INSERT, UPDATE, DELETE). The main emphasis for OLTP systems is put on very fast query processing, maintaining data integrity in multi-access environments and an effectiveness measured by number of transactions per second. In OLTP database there is detailed and current data, and schema used to store transactional databases is the entity model (usually 3NF). It involves Queries accessing individual record like Update your Email in Company database.
OLAP (On-line Analytical Processing) deals with Historical Data or Archival Data. OLAP is characterized by relatively low volume of transactions. Queries are often very complex and involve aggregations. For OLAP systems a response time is an effectiveness measure. OLAP applications are widely used by Data Mining techniques. In OLAP database there is aggregated, historical data, stored in multi-dimensional schemas (usually star schema). Sometime query need to access large amount of data in Management records like what was the profit of your company in last year.
- mongodb, mysql: single master + sharding
- cassandra, dynamo: eventual consistent.
if R+W > N: strong consistency(R=during read how many nodes agree, W=during writing, how many nodes confirmed, N=nodes)
- Save index in memory to speed up search
- The default engine is InnoDB (MyISAM before 5.5.5). MyISAM no longer under active development. Another option is MyRocks, which is based on Facebook's RocksDB. To list all the engines:
mysql> SHOW ENGINES;
- In most cases, only one storage engine will be needed. Sometimes you may need to have different storage engines for different tables in the same database.
primary: unique, one per table.
unique: unique, no two rows have the same combination of the values.
index: may not be unique, but improves lookup efficiency.
fulltext: for full text search, by creating an index for each word in that column.
FULLTEXTare stored in B+trees.
- spatial data types use R-trees;
MEMORYtables also support hash indexes; choose B-tree for
MEMORYtables if some queries use range operators.
- InnoDB uses inverted lists for
- MySQL is moving to x509 for authentication, away from username + pw
Understanding caching in Postgres - An in-depth guide: Postgres is a process based system, i.e each connection has a native OS process...Postgres as a cross platform database, relies heavily on the operating system for its caching...WAL is a redo log that basically keeps track of whatever that is happening to the system...It is always better to monitor something directly from postgres,rather than going through the OS route.
- row key: UUID, equivalent to “primary key"
- column key: string
- ref key: version number(the highest is the latest)
Uses vector lock to resolve write conflicts
Spanner is a globally-scalable database system used internally by Google; it is the successor of the BigTable database. (Link to the paper)
Cloud Spanner is the managed database on Google Cloud Platform.
Spanner does not have auto-increment key; do not use numbers in incremental order as keys, including timestamps, because Spanner is distributed and sharded by key, such keys will result in hotspots and hurt performance.
An open source version of Google Spanner, bigtable=>spanner(google)=>CockroachDB(out of google) https://www.cockroachlabs.com/
Good to see CockroachDB 1.0 is Production-Ready. OpenCredo published CockroachDB: First Impressions. The good: our impressions with CockroachDB are very good overall. It’s very easy to get started with and you get a fully distributed ANSI SQL database. Even better, it was clearly designed to work well in a container and scheduler setup. The bad: Some attention needed for efficient queries, especially when doing joins; relatively immature tooling; if you have a need for more complex, ad-hoc queries, you should evaluate CockroachDB’s performance specifically for your use case. Also, Local and distributed query processing in CockroachDB.
CockroachDB uses RocksDB, an embedded key-value store, internally. Though RocksDB is from Facebook, but it is based on LevelDB, which was also from Google.
In memory database
Based on based on the Bigtable technology from Google
- modeled after google’s bigtable
- HBase features compression, in-memory operation, and Bloom filters on a per-column basis as outlined in the original Bigtable paper.
- In the parlance of Eric Brewer’s CAP Theorem, HBase is a CP type system.
Inspired by bigtable
- rocks db on leveldb(google)
- better compression
- less write-amplification
Netezza’s proprietary AMPP (Asymmetric Massively Parallel Processing) architecture is a two-tiered system designed to quickly handle very large queries from multiple users. https://en.wikipedia.org/wiki/Netezza
Unlike either monolithic or master-slave designs, Cassandra makes use of an entirely peer-to-peer architecture. All nodes in a Cassandra cluster can accept reads and writes, no matter where the data being written or requested actually belongs in the cluster.
- Shard data automatically
- Handle partial outages without data loss or downtime
- Scales close to linearly
Five-dimensional data model(similar to Bigtable or Cassandra):
- namespaces(databases): contain sets (tables) of records
- sets: table
- key: identify records
- metadata: generation tracks record modification, time-to-live (TTL) specifies record expiration
- bin: name value pairs.
- Data is sharded and balanced between servers using a Paxos-based membership algorithm.
- Aerospike makes a dangerous assumption for a distributed datastore: it assumes the network is reliable
Comparing to in memory cache:
- optimized for flash storage
- cheaper than the in memory cache
- no need to re-load data into memory after outrage
- append only: data is virtually incorruptible and easy to replicate, back up, and restore.
- not for ad hoc queries
- multiple masters
- single master
- index: b-tree
- data is heterogeneous: the items in the collection don’t all have the same structure for some reason
Config server vs mongos: The config server (itself replicated) manages the sharded information for other sharded servers, while mongos will likely live on your local application server where clients can easily connect (without needing to manage which shards to connect to).
MongoDB vs SQL Databases:
- SQL db: join on db side
- one-to-many: self contained;
- many-to-one, many-to-many: join on application/clinet side
- the whole database is in one single file on disk, can be embedded inside the application, very portable.
- zero-config, easy to setup
- not a client–server database, no network capabilities, cannot be used over a network.
- not for write-intensive deployments, not for high concurrency use case, since it relies on file-system locks, versus server-based databases handle locks in daemons
- no type checking, the type of a value is dynamic and not strictly constrained by the schema
- no user management; no way to tune for higher performance
- not that reliable, comparing to other RDBMS like MySQL
- TimescaleDB is packaged as a PostgreSQL extension
- TimescaleDB is more than 20x faster than vanilla PostgreSQL when inserting data at scale
- TimescaleDB enables you to scale to 1 billion rows with no real impact to insert performance
Data is continuously backed up to S3 in real time, with no performance impact.
Instead of creating an index tree for each column, Cosmos DB employs one index for the whole database account
- InnoDB: B-tree
- RocksDB: LSM(log-structured merge tree)
- We have found RocksDB, when compared to InnoDB, uses less storage space to store an UDB instance and generates less write traffic overall
- A B-Tree wastes space when pages fragment. An LSM doesn't fragment.
- Inefficient architecture for writes;
- Inefficient data replication;
- Issues with table corruption;
- Poor replica MVCC support;
- Difficulty upgrading to newer releases.
- The most important architectural difference is that while Postgres directly maps index records to on-disk locations, InnoDB maintains a secondary structure. Instead of holding a pointer to the on-disk row location (like the ctid does in Postgres);
- MySQL supports multiple different replication modes; InnoDB storage engine implements its own LRU in something it calls the InnoDB buffer pool; MySQL implements concurrent connections by spawning a thread-per-connection. This is relatively low overhead.
- cassandra: eventual consistent
- mysql: strong consistency
as cardinality moderately increases, InfluxDB performance drops dramatically due to its reliance on time-structured merge trees (which, similar to the log-structured merge trees it is modeled after, suffers with higher-cardinality datasets). This of course should be no surprise, as high cardinality is a well known Achilles heel for InfluxDB
facebook database automation: https://code.facebook.com/posts/180455938822278/under-the-hood-mysql-pool-scanner-mps-/