Distributed Systems
What are the key topics of Distributed Systems?
Distributed Systems is a massive field that covers how multiple independent computers (nodes) communicate and coordinate to appear as a single, coherent system to the end user.
1. The Foundational Theory
These are the "laws of physics" for distributed computing.
- The CAP Theorem: The trade-off between Consistency, Availability, and Partition Tolerance.
- PACELC Theorem: An extension of CAP that describes what happens during "normal" operation (Latency vs. Consistency).
- Fallacies of Distributed Computing: Eight false assumptions developers make (e.g., "The network is reliable," "Latency is zero").
- Time & Ordering: Since there is no single "clock," how do we know which event happened first?
- Logical Clocks: Lamport Clocks, Vector Clocks.
- Physical Clocks: NTP, Google's TrueTime (used in Spanner).
2. Communication Patterns
How nodes actually move data between each other.
- Remote Procedure Calls (RPC): Making a function call on a different machine (e.g., gRPC, Apache Thrift).
- Message Queues & Pub/Sub: Asynchronous communication where a "Broker" sits in the middle (e.g., Kafka, RabbitMQ, GCP Pub/Sub).
- Serialization: How to turn an object into bits (e.g., Protobuf, Avro, JSON).
3. Consensus & Coordination
How independent nodes "agree" on a value or state (the hardest part of the field).
- Consensus Protocols: Paxos and Raft. These are used to ensure that if one machine crashes, the others still agree on what the truth is.
- Distributed Locking: Ensuring only one node performs a task (e.g., Etcd, ZooKeeper).
- Leader Election: Determining which node is the "Master" and what to do if it dies.
4. Distributed Data & Storage
How to store data across 1,000 machines instead of one.
- Replication: Keeping copies of the same data on multiple nodes (Leader-based vs. Multi-leader vs. Leaderless).
- Partitioning/Sharding: Splitting a massive database into smaller chunks based on a "Shard Key."
- Consistency Models:
- Strong Consistency: Everyone sees the same data at the same time.
- Eventual Consistency: Everyone will eventually see the same data (common in NoSQL).
- Distributed Hash Tables (DHT): How systems like BitTorrent or IPFS find data without a central server.
5. Fault Tolerance & Reliability
How the system stays alive when hardware inevitably fails.
- Health Checks & Heartbeats: How nodes monitor each other's status.
- Quorums: Requiring a majority (e.g., 3 out of 5 nodes) to agree before a write is finalized.
- The Byzantine Generals Problem: How to reach agreement when some nodes are intentionally lying or malicious (crucial for Blockchain).
- Circuit Breakers: Automatically stopping requests to a failing service to prevent a "cascading failure."
6. Scalability & Performance
How to handle more users by adding more machines.
- Load Balancing: Distributing incoming traffic (L4 vs. L7 load balancing).
- Caching: Using Redis or Memcached to store frequent results in memory.
- Content Delivery Networks (CDNs): Distributing static assets (images/video) to the "edge" (closer to the user).
7. Observability & Management
How to debug a system when you can't log into 5,000 different servers.
- Distributed Tracing: Following a single user request as it travels through 20 different microservices (e.g., Jaeger, OpenTelemetry, Zipkin).
- Log Aggregation: Pulling logs from every node into one searchable place (e.g., ELK Stack, Cloud Logging).
- Service Mesh: Managing the "networking" between microservices (e.g., Istio, Linkerd).