Distributed Systems - System Design Patterns
Particularly useful with Kubernetes: K8s has pods, each pod has one or more application containers. A sidecar is a utility container in the pod, to support the main container by enhancing the functionalities without strong coupling.
Ingress Proxy sits between internet(external) and intranet(internal), for reverse proxy and TLS(SSL) termination; mapping URI to Load Balancer target. Though Sidecar Proxy to check abuse/billing/API activation/quota to decide if the request is allowed.
Load Balancer as Service Discovery Service(SDS), mapping LB target to set of hosts that serve that target(i.e. Service Discovery)
A service mesh is a dedicated infrastructure layer for handling service-to-service communication and global cross-cutting of concerns to make the communications more reliable, secure, observable and manageable.
Most notable Service Mesh Framework: Istio.
Istio uses Lyft's Envoy as an intelligent proxy deployed as a sidecar.
Online Transaction Processing Databases (OLTP)
- Facebook Graph, mission critical, strong consistency, core services
Semi-online Light Transaction Processing Databases (SLTP)
- Facebook Messages and Facebook Time Series
- Photos, videos, etc
- Data Warehouse, Logs storage
Facebook example. This is adapted from this slide
|Facebook Graph||MySQL/TAO||Random read IOPS||few ms||quickly consistent across data centers||no data loss|
|Messages and Time Series||HBase and HDFS||Write IOPS/storage capacity||< 200 ms||consistent within a data center||no data loss|
|Photos/Videos||Haystack||storage capacity||< 250 ms||immutable||no data loss|
|Data Warehouse||Hive/Presto/HDFS||storage capacity||< 1min||not consistent across data centers||no silent data loss|
- High write throughput
- Horizontal scalability
- Automatic Failover
- Strong consistency within a data center
- CAP Theorem
- Vertical scaling
- Horizontal scaling
- Consistent Hashing
- Load balancing
- Redundancy and Replication
- Data Partitioning
- NoSQL vs SQL
- Async / event loop
- resolve conflict
- fault tolerant
- data retention: how long to store
- data persistent format
throughput = latency + concurrency
Latency refers to the time it takes to start processing a task. Concurrency means the number of independent tasks that can be running at any one time.
performance optimization of individual tasks(task processing time) is a separate subject and engineering concern.
- Guaranteed order of messages
- Exactly-once delivery
- volume: hadoop, mapreduce
- velocity: streaming frameworks
- variety: schema on read, specialized tools(KV pair)
Big data problems tend to fall into three categories; namely, managing ever-increasing volumes of data, managing an increasing velocity of data, and dealing with a greater variety of data structure.
PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.You force a schema on write and if that is violated, you throw an error. Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.
- System scalability: Supporting an online service with hundreds of millions of registered users, handling millions of requests per second.
- Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system without excessive coordination overhead.
streaming: a type of data processing engine that is designed with infinite data sets in mind
- streaming data sets: unbounded data
- batch data sets: finite and bounded data.
- Use lock in Multithreading
- use single thread, e.g. Node.js and Chrome(V8)
- async: (cheating) do the minimal amount of work on the backend and tell the user you are done. Put it in a queue
From reddit: The key to speed is to precompute everything and cache It. For example, they precompute all 15 different sort orders (hot, new, top, old, this week. etc) for listings when someone submits a link.Every listing has 15 different sort orders (hot, new, top, old, this week). When someone submits a link they recalculate all the possible listing that link could effect. http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
- Load Balancer - a dispatcher determines which worker instance will handle a request based on different policies.
- Scatter and Gather - a dispatcher multicasts requests to all workers in a pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.distributed-system
- Result Cache - a dispatcher will first lookup if the request has been made before and try to find the previous result to return, in order to save the actual execution.
- Shared Space - all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.
- Pipe and Filter - all workers connected by pipes across which data flows.
- MapReduce - targets batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.
- Bulk Synchronous Parallel - a lock-step execution across all workers, coordinated by a master.
- Execution Orchestrator - an intelligent scheduler / orchestrator schedules ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.
- COW: Copy-on-Write
- Clean Replication
- Idempotent: making it possible to fail and restart.
- Immutability: the backbone of big data.
- Fault Tolerance.
- LSF (log-structured file system)
- LSM (log-structured merge-tree)
- consistent hashing
- vector clocks
- anti-entropy based recovery
- probabilistic data structure
- 2 phase commit
- write ahead log:https://en.wikipedia.org/wiki/Write-ahead_logging writes are serialized to disk before they are applied to the in-memory database.
- exactly once delivery
- Log-structured storage
Designs are driving toward immutability, which is needed to coordinate at ever increasing distances. Given space to store data for a long time, immutability is affordable. Versioning provides a changing view, while the underlying data is expressed with new contents bound to a unique identifier.
Copy-on-Write. Many emerging systems leverage COW semantics to provide a façade of change while writing immutable files to an underlying store. In turn, the underlying store offers robustness and scalability because it is storing immutable files. For example, many key-value systems are implemented with LSM trees (e.g., HBase, BigTable, and LevelDB).
Clean Replication. When data is immutable and has a unique identifier, many challenges with replication are eased. There's never a worry about finding a stale version of the data because no stale versions exist. Consequently, the replication system may be more fluid and less picky about where it allows a replica to land. There are also fewer replication bugs.
Immutable Data Sets. Immutable data sets can be combined by reference with transactional database data and offer clean semantics when the data sets project relational schema and tables. Looking at the semantics projected by an immutable data set, you can create a new version of it optimized for a different usage pattern but still projecting the same semantics. Projections, redundant copies, denormalization, indexing, and column stores are all examples of optimizing immutable data while preserving its semantics.
Parallelism and Fault Tolerance. Immutability and functional computation are keys to implementing big data.
- sticky sessions: load balancer insert some random number to cookie so subsequent visits will be directed to the same web server
- SSL termination in load balancer, everything inside is unencrypted(use port 80), which lessens the CPU load on nginx.
- Layer 7 load balancing (HTTP, HTTPS, WS);
- Layer 4 load balancing (TCP, UDP);
- Layer 3 load balancing;
- DNS load balancing; (e.g. round robin)
- Manually load balancing with multiple subdomains;
To forward HTTP requests to upstream application components, the frontend needs to provide termination of HTTP, HTTPS, and HTTP/2 connections, along with “shock-absorber” protection and routing. It also needs to offer basic logging and first-line access control, implement global security rules, and offload HTTP heavy lifting (caching, compression) to optimize the performance of the upstream application.
- a simple buffer mechanism with Redis: if the write to Postgres failed we could retry later since the trip had been stored in Redis in the interim. But while in Redis, the trip could not be read from Postgres.
HDFS is read only: can’t update records in place as you would with a relational database
Design Principle: Favor composition over inheritance.
- schema on read: hive, pig, hadoop
- schema on write: predefined schema, e.g. RDBMS
Using any categorical field for sharding may result in skewed partitions.
Instead, use some random generated or unique numbers, e.g. UUID
IoT Analytics: distributed model scoring + centralised model building
Reactive Platform: VStack is a reactive platform in the sense that it uses an asynchronous message oriented architecture (which is the definition of "reactive").
Node.js: pure async, event-driven, non-blocking, based on event loop, single thread
nonblocking RPC server based on Netty
There's a difference between (A) locking (waiting, really) on access to a critical section (where you spinlock, yield your thread, etc.) and (B) locking the processor to safely execute a synchronization primitive (mutexes/semaphores).
Periodic updates + TTLs.
- require work linear to the number of nodes and place the demand on a fixed number of servers.
- the failure detection window is at least as long as the TTL.
E.g. built in ZooKeeper. K/V entries that are removed when a client disconnects.
- more sophisticated than a heartbeat system
- have scalability issues and add client-side complexity.
- All clients must maintain active connections to the ZooKeeper servers and perform keep-alives.
- requires "thick clients" which are difficult to write and often result in debugging challenges.
Consul uses a very different architecture for health checking. Instead of only having server nodes, Consul clients run on every node in the cluster. These clients are part of a gossip pool which serves several functions, including distributed health checking. The gossip protocol implements an efficient failure detector that can scale to clusters of any size without concentrating the work on any select group of servers. The clients also enable a much richer set of health checks to be run locally, whereas ZooKeeper ephemeral nodes are a very primitive check of liveness. With Consul, clients can check that a web server is returning 200 status codes, that memory utilization is not critical, that there is sufficient disk space, etc. The Consul clients expose a simple HTTP interface and avoid exposing the complexity of the system to clients in the same way as ZooKeeper.