Databases - Concepts

Common Techniques

append-only
buffered writes: support failing MySQL masters
read-your-write semantics.
master-slave replication (read from slaves, write to master)

Index

B-Tree vs Hash Index:

B-tree index: O(log(N)), good for =, >, >=, <, <=, or BETWEEN operators and LIKE 'string' or LIKE 'string%'(a constant string that does not start with a wildcard character)
hash index: O(1), used only for equality comparisons that use the = or <=> operators (but are very fast). Good for key-value stores.

https://dev.mysql.com/doc/refman/8.0/en/index-btree-hash.html

E.g. in MySQL:

CREATE [UNIQUE|FULLTEXT|SPATIAL] INDEX index_name
USING [BTREE | HASH | RTREE]
ON table_name (column_name [(length)] [ASC | DESC],...)

FULLTEXT: optimized for regular plain text; it split the text into words and make an index over the words instead of the whole text. (Like with wildcard(LIKE '%foo') does not use any index thus may incur full table scan.)
SPATIAL: for indexing geo location

Storage Engine

Storage engine is usually single-node, as just one layer of distributed databases.

Storage Engine's responsibility:

Atomicity and Durability of ACID.
performance(e.g. write-ahead log)

Higher layers of databases can focus on Isolation(distributed coordination) and Consistency(data integrity).

Examples:

MySQL's default storage engine is InnoDB, however there are many other options like RocksDB.
CockroachDB is based on RocksDB. And a new option: Pebble.
Postgres is a monolithic system, does not have a plugable "storage engine", but has an internal B-tree, hash and heap storage systems.
Cassandra: comes with an LSM tree implementation.

Normalized vs Denormalized

normalized data: join when used
denormalized data: data would be stored and retrieved as a whole(blobs) No joins are needed and it can be written with one disk write

Storage Engines: Log-structured vs Page-oriented

log-structured storage engines (like LSMT, "log-structured merge trees"), Leveldb, Rocksdb, and Cassandra, high write throughput, low read throughput
page-oriented storage engines (like b-trees) e.g. PostgreSQL, MySQL, sqlite, high read throughput, low write throughput

log-structured file systems

hypothesis: memory size increasing, reads would be from memory cache, I/O becomes write heavy
treats its storage as a circular log and writes sequentially to the head of the log.

read-modify-write cycle

read the value, grab the write lock, look at the index, check if the address for the data we stays the same: if yes, no writes occurred, proceed with writing the new value. otherwise a conflicting transaction occurred, roll back and start again with the read phase.

Data model change: A pattern for online migrations

https://stripe.com/blog/online-migrations

There’s a common 4 step dual writing pattern that people often use to do large online migrations like this. Here’s how it works:

Dual writing to the existing and new tables to keep them in sync.(backfill old data to the new table)
Changing all read paths in our codebase to read from the new table.
Changing all write paths in our codebase to only write to the new table.
Removing old data that relies on the outdated data model.

OLTP vs OLAP

OLTP: On-line Transaction Processing, data is ususally row-oriented
OLAP: On-line Analytical Processing, data is usually column-oriented

These are two relatively old terms, when you hear people talking about them, think about operational databases(e.g. MySQL, Spanner) vs data warehouse(for analytics, e.g. AWS Redshift, GCP BigQuery).

Consistency

mongodb, mysql: single master + sharding, strongly consistent.
cassandra, dynamo: eventually consistent.

if R+W > N: strong consistency(R=during read how many nodes agree, W=during writing, how many nodes confirmed, N=nodes)

Log-Structured Merge-Tree (LSM Tree)

"LSM" architecture provides a number of interesting behaviors: writes are always fast regardless of the size of dataset (append-only), and random reads are either served from memory or require a quick disk seek.

A good read: http://www.benstopford.com/2015/02/14/log-structured-merge-trees/

"Log" means append-only.

Singleton Table

A table with an empty primary key is called a singleton table and equates to a table that must have exactly one row.

MVCC: multi-version concurrency control

To read without blocking writes, keep multiple immutable versions of data. A write creates a new immutable version whose timestamp is that of the write's transaction. A "snapshot read" at a timestamp returns the value of the most recent version prior to that timestamp, and does not need to block writes.

This is trivial for a single-machine database, but achieving it in a widely distributed system, in which servers all over the world need to assign timestamps, is much more difficult to do efficiently.

E.g. Spanner depends on TrueTime to generate monotonically increasing timestamps.

When to use Snapshot Transaction for Reads?

SnapshotTransactions: use it

when you need a read timestamp for a single read.
when you want to include multiple reads in a single transaction.
when you need this data fairly quickly and can trade off a maximum of X seconds of data staleness for a potentially lower latency read from a nearby replica instead of waiting for the leader.

Transaction Processing System

Strong consistency: guarantees that they observe the effects of all transactions that committed before the start of the operation, independent of which replica receives the read.
Serializability: executes transactions in a manner that is indistinguishable from a system in which the transactions are executed serially
Linearizability: a property of concurrent objects that support atomic read and write operations. In a database, an "object" would typically be a single row or even a single cell.
External consistency: a property of transaction-processing systems, where clients dynamically synthesize transactions that contain multiple read and write operations on arbitrary objects. Linearizability can be viewed as a special case of external consistency, where a transaction can only contain a single read or write operation on a single object. External consistency is stricter than the above.

Avoid hot spot

change orders of the primary keys: instead of (Timestamp, UserId), use (UserId, Timestamp)
hash the unique key: spread the write across logical shards: (Hash, Timestamp, UserId), where Hash = hash(Timestamp and UserId), or use an int ShardId = hash(Timestamp and UserId) % N
use UUID

If the id has to be a numeric sequence, add the hash to the front

CREATE TABLE Table1 (
     Hashed_Id INT64 NOT NULL,
     ID INT64 NOT NULL,
     -- other columns with data values follow....
) PRIMARY KEY (Hashed_Id, Id)

When reading or joining the table, both the ID and Hashed_Id must be specified to prevent a table scan

Using UUID

There are several ways to store the UUID v4 as the primary key:

In a STRING(36) column.
In a pair of INT64 columns.
In a BYTES(16) column.

Disadvantages to using a UUID:

They are slightly large, using 16 bytes or more. Other options for primary keys don't use this much storage.
They carry no information about the record. For example, a primary key of SingerId and AlbumId has an inherent meaning, while a UUID does not.
You lose locality between records that are related, which is why using a UUID eliminates hotspots.

What is the two-phase commit protocol?

Two-phase commit protocol (2PC):

Phase 1: the voting phase. The coordinator process asks all participating processes to vote either "commit" or "abort".
Phase 2: the commit phase. The coordinator decides whether to commit or abort based on the votes, and sends out the decision to all participating processees.

The coordinator decides the commit timestamp in phase 2.

Best Practices when using commit timestamp

Avoid commit timestamp-based insertions as much as possible.
Put commit timestamp in the back of the primary key.
Avoid creating global indexes on commit timestamp column.

What is Cursor?

A database cursor is a control structure that allows for the traversal of records in a database. They can be viewed as a pointer to one row in a set of rows.

What is stored procedure?

A stored procedure is a pre-compiled group of SQL statements stored in a database. It allows you to execute a block of code as a single unit, improving performance and code reusability. Stored procedures can also accept input parameters and return output values.

What is an index?

What is an index? an index is a data structure that provides a quick lookup of data in columns of a table.

What is the difference between Clustered and Non-clustered index?

The main difference between clustered and non-clustered indexes lies in how they impact the physical storage and ordering of data rows in a database table. Clustered indexes define the physical storage order of the table, while non-clustered indexes create separate structures with pointers to the actual data.