Distributed Systems

What are the key topics of Distributed Systems?

Distributed Systems is a massive field that covers how multiple independent computers (nodes) communicate and coordinate to appear as a single, coherent system to the end user.

1. The Foundational Theory

These are the "laws of physics" for distributed computing.

The CAP Theorem: The trade-off between Consistency, Availability, and Partition Tolerance.
PACELC Theorem: An extension of CAP that describes what happens during "normal" operation (Latency vs. Consistency).
Fallacies of Distributed Computing: Eight false assumptions developers make (e.g., "The network is reliable," "Latency is zero").
Time & Ordering: Since there is no single "clock," how do we know which event happened first?
- Logical Clocks: Lamport Clocks, Vector Clocks.
- Physical Clocks: NTP, Google's TrueTime (used in Spanner).

2. Communication Patterns

How nodes actually move data between each other.

Remote Procedure Calls (RPC): Making a function call on a different machine (e.g., gRPC, Apache Thrift).
Message Queues & Pub/Sub: Asynchronous communication where a "Broker" sits in the middle (e.g., Kafka, RabbitMQ, GCP Pub/Sub).
Serialization: How to turn an object into bits (e.g., Protobuf, Avro, JSON).

3. Consensus & Coordination

How independent nodes "agree" on a value or state (the hardest part of the field).

Consensus Protocols: Paxos and Raft. These are used to ensure that if one machine crashes, the others still agree on what the truth is.
Distributed Locking: Ensuring only one node performs a task (e.g., Etcd, ZooKeeper).
Leader Election: Determining which node is the "Master" and what to do if it dies.

4. Distributed Data & Storage

How to store data across 1,000 machines instead of one.

Replication: Keeping copies of the same data on multiple nodes (Leader-based vs. Multi-leader vs. Leaderless).
Partitioning/Sharding: Splitting a massive database into smaller chunks based on a "Shard Key."
Consistency Models:
- Strong Consistency: Everyone sees the same data at the same time.
- Eventual Consistency: Everyone will eventually see the same data (common in NoSQL).
Distributed Hash Tables (DHT): How systems like BitTorrent or IPFS find data without a central server.

5. Fault Tolerance & Reliability

How the system stays alive when hardware inevitably fails.

Health Checks & Heartbeats: How nodes monitor each other's status.
Quorums: Requiring a majority (e.g., 3 out of 5 nodes) to agree before a write is finalized.
The Byzantine Generals Problem: How to reach agreement when some nodes are intentionally lying or malicious (crucial for Blockchain).
Circuit Breakers: Automatically stopping requests to a failing service to prevent a "cascading failure."

6. Scalability & Performance

How to handle more users by adding more machines.

Load Balancing: Distributing incoming traffic (L4 vs. L7 load balancing).
Caching: Using Redis or Memcached to store frequent results in memory.
Content Delivery Networks (CDNs): Distributing static assets (images/video) to the "edge" (closer to the user).

7. Observability & Management

How to debug a system when you can't log into 5,000 different servers.

Distributed Tracing: Following a single user request as it travels through 20 different microservices (e.g., Jaeger, OpenTelemetry, Zipkin).
Log Aggregation: Pulling logs from every node into one searchable place (e.g., ELK Stack, Cloud Logging).
Service Mesh: Managing the "networking" between microservices (e.g., Istio, Linkerd).