System Design Patterns
Key factors to consider when designing distributed systems:
- CAP Theorem: pick 2 out of these 3
- Partition tolerant
- latency: the time it takes to start processing a task
- concurrency: the number of independent tasks that can be running at any one time.
- throughput: throughput = latency + concurrency. Performance optimization of individual tasks (task processing time) is a separate subject and engineering concern.
- how to resolve conflicts
- fault tolerant
- data retention: how long to store
- data persistent format
Hard problems in distributed systems
- Guaranteed order of messages
- Exactly-once delivery
- network failures, bandwidth limitations, variable latency connections, security concerns, and anything else that can go wrong in a networked environment, even across multiple data centers.
They added a sidecar as a local proxy for a service. A sidecar is essentially just another containerized application that is run alongside the main application on the EC2 node. The benefit of using sidecars (as opposed to libraries) is that it’s technology agnostic.
Particularly useful with Kubernetes: K8s has pods, each pod has one or more application containers. A sidecar is a utility container in the pod, to support the main container by enhancing the functionalities without strong coupling.
Ingress Proxy sits between internet (external) and intranet (internal), for reverse proxy and TLS(SSL) termination; mapping URI to Load Balancer target. Sidecar Proxy to check abuse / billing / API activation / quota to decide if the request is allowed.
Read More about Proxy
Load Balancer: a dispatcher determines which worker instance will handle a request based on different policies.
Load Balancer as Service Discovery Service (SDS), mapping LB target to set of hosts that serve that target(i.e. Service Discovery)
- sticky sessions: load balancer insert some random number to cookie so subsequent visits will be directed to the same web server
- SSL termination in load balancer, everything inside is unencrypted(use port 80), which lessens the CPU load on nginx.
- L7: Load Balancing at the HTTP level
- L3/L4: Load Balancing at the network (TCP) level
- DNS load balancing; (e.g. round robin)
- Manually load balancing with multiple subdomains;
Fail Open vs Fail Close
If service X depends on service Y for certain policy decisions, and service Y is unavailable ...
- Fail Open: X continues to process requests assuming Y's decision is to proceed. e.g. quota
- Fail Close: X rejects requests assuming Y's decision is not to proceed. e.g. security
Big data problems: 3V
- volume: hadoop, mapreduce
- velocity: streaming frameworks
- variety: schema on read, specialized tools(KV pair)
Big data problems tend to fall into three categories; namely, managing ever-increasing volumes of data, managing an increasing velocity of data, and dealing with a greater variety of data structure.
PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.You force a schema on write and if that is violated, you throw an error. Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.
The key to speed is to precompute everything and cache it.
For example, Reddit precomputes all 15 different sort orders (hot, new, top, old, this week. etc) for listings when someone submits a link. Every listing has 15 different sort orders (hot, new, top, old, this week). When someone submits a link they recalculate all the possible listing that link could effect. http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
- use Cache to store the most frequently used data
- Pre-generate static HTML files
- Minify and compress
- use CDN to serve pre-computed pages and results
System scalability vs Organizational scalability
- System scalability: Supporting an online service with hundreds of millions of registered users, handling millions of requests per second.
- Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system without excessive coordination overhead.
Streaming vs Batching
Streaming: a type of data processing engine that is designed with infinite data sets in mind.
- streaming data sets: unbounded data
- batch data sets: finite and bounded data.
Use a streaming solution like Spark Streaming: store counts in an RDD that can be incremented in a reduce process.
a distribution scheme that does not depend directly on the number of servers
Immutability is the backbone of big data.
Designs are driving toward immutability, which is needed to coordinate at ever increasing distances. Given space to store data for a long time, immutability is affordable. Versioning provides a changing view, while the underlying data is expressed with new contents bound to a unique identifier.
- Copy-on-Write. Many emerging systems leverage COW semantics to provide a façade of change while writing immutable files to an underlying store. In turn, the underlying store offers robustness and scalability because it is storing immutable files. For example, many key-value systems are implemented with LSM trees (e.g., HBase, BigTable, and LevelDB).
- Clean Replication. When data is immutable and has a unique identifier, many challenges with replication are eased. There's never a worry about finding a stale version of the data because no stale versions exist. Consequently, the replication system may be more fluid and less picky about where it allows a replica to land. There are also fewer replication bugs.
- Immutable Data Sets. Immutable data sets can be combined by reference with transactional database data and offer clean semantics when the data sets project relational schema and tables. Looking at the semantics projected by an immutable data set, you can create a new version of it optimized for a different usage pattern but still projecting the same semantics. Projections, redundant copies, denormalization, indexing, and column stores are all examples of optimizing immutable data while preserving its semantics.
- Parallelism and Fault Tolerance. Immutability and functional computation are keys to implementing big data.
HDFS is read only: can’t update records in place as you would with a relational database
To forward HTTP requests to upstream application components, the frontend needs to provide termination of HTTP, HTTPS, and HTTP/2 connections, along with “shock-absorber” protection and routing. It also needs to offer basic logging and first-line access control, implement global security rules, and offload HTTP heavy lifting (caching, compression) to optimize the performance of the upstream application.
- a simple buffer mechanism with Redis: if the write to Postgres failed we could retry later since the trip had been stored in Redis in the interim. But while in Redis, the trip could not be read from Postgres.
Problems with EventBus:
- All subscribers need to be instantiated at app start, significantly impacting app start latency.
- hard to test, events coming in unexpected orders
- good to write loosely coupled features but it makes the dependencies less clear and harder to untangle.
Lambda Architecture and Kappa Architecture
- Lambda Architecture: run a streaming system alongside a batch system, both performing essentially the same calculation. The streaming system gives you low-latency, inaccurate results (either because of the use of an approximation algorithm, or because the streaming system itself does not provide correctness), and some time later a batch system rolls along and provides you with correct output.
- Kappa Architecture: run a single pipeline using a well-designed system that’s appropriately built for the job at hand.
Schema-On-Read vs Schema-On-Write
- schema on read: hive, pig, hadoop
- schema on write: predefined schema, e.g. RDBMS
Using any categorical field for sharding may result in skewed partitions.
Instead, use some random generated or unique numbers, e.g. UUID
One common problem with all the distributed systems is how to determine which servers are alive and operating at any given point of time.
Periodic updates + TTLs.
- require work linear to the number of nodes and place the demand on a fixed number of servers.
- the failure detection window is at least as long as the TTL.
E.g. built in ZooKeeper. K/V entries that are removed when a client disconnects.
- more sophisticated than a heartbeat system
- have scalability issues and add client-side complexity.
- All clients must maintain active connections to the ZooKeeper servers and perform keep-alives.
- requires "thick clients" which are difficult to write and often result in debugging challenges.
Agents on Clients and Servers
Consul uses a very different architecture for health checking. Instead of only having server nodes, Consul clients run on every node in the cluster. These clients are part of a gossip pool which serves several functions, including distributed health checking. The gossip protocol implements an efficient failure detector that can scale to clusters of any size without concentrating the work on any select group of servers. The clients also enable a much richer set of health checks to be run locally, whereas ZooKeeper ephemeral nodes are a very primitive check of liveness. With Consul, clients can check that a web server is returning 200 status codes, that memory utilization is not critical, that there is sufficient disk space, etc. The Consul clients expose a simple HTTP interface and avoid exposing the complexity of the system to clients in the same way as ZooKeeper.
PRPL is a web site architecture developed by Google for building websites and apps that work exceptionally well on smartphones and other devices with unreliable network connections.
PRPL stands for:
- Push critical resources for the initial URL route using
- Render initial route.
- Pre-cache remaining routes.
- Lazy-load and create remaining routes on demand.
- Scatter and Gather - a dispatcher multicasts requests to all workers in a pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.
- Shared Space - all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.
- Pipe and Filter - all workers connected by pipes across which data flows.
- MapReduce - targets batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.
- Bulk Synchronous Parallel - a lock-step execution across all workers, coordinated by a master.
- Execution Orchestrator - an intelligent scheduler / orchestrator schedules ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.
- Idempotent: making it possible to fail and restart.
- LSF (log-structured file system)
- LSM (log-structured merge-tree)
- consistent hashing
- vector clocks
- anti-entropy based recovery
- probabilistic data structure
- 2 phase commit
- write ahead log:https://en.wikipedia.org/wiki/Write-ahead_logging writes are serialized to disk before they are applied to the in-memory database.
- exactly once delivery
- Log-structured storage
- Async: (cheating) do the minimal amount of work on the backend and tell the user you are done. Put it in a queue
- Vertical scaling
- Horizontal scaling
- Consistent Hashing
- Load balancing
- Redundancy and Replication
- Data Partitioning
- NoSQL vs SQL
- Async / event loop