Databases - Concepts
- buffered writes: support failing MySQL masters
- read-your-write semantics.
- master-slave replication (read from slaves, write to master)
B-Tree vs Hash Index:
- B-tree index:
O(log(N)), good for
LIKE 'string%'(a constant string that does not start with a wildcard character)
- hash index:
O(1), used only for equality comparisons that use the
<=>operators (but are very fast). Good for key-value stores.
E.g. in MySQL:
CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name USING [BTREE | HASH | RTREE] ON table_name (column_name [(length)] [ASC | DESC],...)
FULLTEXT: optimized for regular plain text; it split the text into words and make an index over the words instead of the whole text. (
LIKE '%foo') does not use any index thus may incur full table scan.)
SPATIAL: for indexing geo location
Storage engine is usually single-node, as just one layer of distributed databases.
Storage Engine's responsibility:
- Atomicity and Durability of ACID.
- performance(e.g. write-ahead log)
Higher layers of databases can focus on Isolation(distributed coordination) and Consistency(data integrity).
- MySQL's default storage engine is InnoDB, however there are many other options like RocksDB.
- CockroachDB is based on RocksDB. And a new option: Pebble.
- Postgres is a monolithic system, does not have a plugable "storage engine", but has an internal B-tree, hash and heap storage systems.
- Cassandra: comes with an LSM tree implementation.
Normalized vs Denormalized
- normalized data: join when used
- denormalized data: data would be stored and retrieved as a whole(blobs) No joins are needed and it can be written with one disk write
Storage Engines: Log-structured vs Page-oriented
- log-structured storage engines (like LSMT, "log-structured merge trees"), Leveldb, Rocksdb, and Cassandra, high write throughput, low read throughput
- page-oriented storage engines (like b-trees) e.g. PostgreSQL, MySQL, sqlite, high read throughput, low write throughput
log-structured file systems
- hypothesis: memory size increasing, reads would be from memory cache, I/O becomes write heavy
- treats its storage as a circular log and writes sequentially to the head of the log.
read the value, grab the write lock, look at the index, check if the address for the data we stays the same: if yes, no writes occurred, proceed with writing the new value. otherwise a conflicting transaction occurred, roll back and start again with the read phase.
Data model change: A pattern for online migrations
There’s a common 4 step dual writing pattern that people often use to do large online migrations like this. Here’s how it works:
- Dual writing to the existing and new tables to keep them in sync.(backfill old data to the new table)
- Changing all read paths in our codebase to read from the new table.
- Changing all write paths in our codebase to only write to the new table.
- Removing old data that relies on the outdated data model.
OLTP vs OLAP
- OLTP: On-line Transaction Processing, data is ususally row-oriented
- OLAP: On-line Analytical Processing, data is usually column-oriented
These are two relatively old terms, when you hear people talking about them, think about operational databases(e.g. MySQL, Spanner) vs data warehouse(for analytics, e.g. AWS Redshift, GCP BigQuery).
- mongodb, mysql: single master + sharding, strongly consistent.
- cassandra, dynamo: eventually consistent.
if R+W > N: strong consistency(R=during read how many nodes agree, W=during writing, how many nodes confirmed, N=nodes)
Read More: Strong Consistency vs Eventual Consistency
Log-Structured Merge-Tree (LSM Tree)
"LSM" architecture provides a number of interesting behaviors: writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek.
A good read: http://www.benstopford.com/2015/02/14/log-structured-merge-trees/
"Log" means append-only.
A table with an empty primary key is called a singleton table and equates to a table that must have exactly one row.
MVCC: multi-version concurrency control
To read without blocking writes, keep multiple immutable versions of data. A write creates a new immutable version whose timestamp is that of the write's transaction. A "snapshot read" at a timestamp returns the value of the most recent version prior to that timestamp, and does not need to block writes.
This is trivial for a single-machine database, but achieving it in a widely distributed system, in which servers all over the world need to assign timestamps, is much more difficult to do efficiently.
E.g. Spanner depends on TrueTime to generate monotonically increasing timestamps.
Transaction Processing System
- Strong consistency: guarantees that they observe the effects of all transactions that committed before the start of the operation, independent of which replica receives the read.
- Serializability: executes transactions in a manner that is indistinguishable from a system in which the transactions are executed serially
- Linearizability: a property of concurrent objects that support atomic read and write operations. In a database, an "object" would typically be a single row or even a single cell.
- External consistency: a property of transaction-processing systems, where clients dynamically synthesize transactions that contain multiple read and write operations on arbitrary objects. Linearizability can be viewed as a special case of external consistency, where a transaction can only contain a single read or write operation on a single object. External consistency is stricter than the above.
Avoid hot spot
- change orders of the primary keys: instead of
(Timestamp, UserId), use
- hash the unique key: spread the write across logical shards:
(Hash, Timestamp, UserId), where
Hash = hash(Timestamp and UserId), or use an
int ShardId = hash(Timestamp and UserId) % N
- use UUID
If the id has to be a numeric sequence, add the hash to the front
CREATE TABLE Table1 ( Hashed_Id INT64 NOT NULL, ID INT64 NOT NULL, -- other columns with data values follow.... ) PRIMARY KEY (Hashed_Id, Id)
When reading or joining the table, both the ID and Hashed_Id must be specified to prevent a table scan
There are several ways to store the UUID v4 as the primary key:
- In a
- In a pair of
- In a
Disadvantages to using a UUID:
- They are slightly large, using 16 bytes or more. Other options for primary keys don't use this much storage.
- They carry no information about the record. For example, a primary key of SingerId and AlbumId has an inherent meaning, while a UUID does not.
- You lose locality between records that are related, which is why using a UUID eliminates hotspots.