Distributed System Design

Key factors to consider when designing distributed systems:

CAP Theorem: pick 2 out of these 3
- Consistent
- Available
- Partition tolerant
latency: the time it takes to start processing a task
concurrency: the number of independent tasks that can be running at any one time.
throughput: throughput = latency + concurrency. Performance optimization of individual tasks (task processing time) is a separate subject and engineering concern.
how to resolve conflicts
fault tolerant
data retention: how long to store
data persistent format

Hard problems in distributed systems

Guaranteed order of messages
Exactly-once delivery
network failures, bandwidth limitations, variable latency connections, security concerns, and anything else that can go wrong in a networked environment, even across multiple data centers.

Sidecar Pattern

They added a sidecar as a local proxy for a service. A sidecar is essentially just another containerized application that is run alongside the main application on the EC2 node. The benefit of using sidecars (as opposed to libraries) is that it’s technology agnostic.

Particularly useful with Kubernetes: K8s has pods, each pod has one or more application containers. A sidecar is a utility container in the pod, to support the main container by enhancing the functionalities without strong coupling.

Proxy

Ingress Proxy sits between internet (external) and intranet (internal), for reverse proxy and TLS(SSL) termination; mapping URI to Load Balancer target. Sidecar Proxy to check abuse / billing / API activation / quota to decide if the request is allowed.

Load Balancing

Load Balancer: a dispatcher determines which worker instance will handle a request based on different policies.

Load Balancer as Service Discovery Service (SDS), mapping LB target to set of hosts that serve that target(i.e. Service Discovery)

sticky sessions: load balancer insert some random number to cookie so subsequent visits will be directed to the same web server
SSL termination in load balancer, everything inside is unencrypted(use port 80), which lessens the CPU load on nginx.

Different layers

L7: Load Balancing at the HTTP level
L3/L4: Load Balancing at the network (TCP) level
DNS load balancing; (e.g. round robin)
Manually load balancing with multiple subdomains;
Anycast.

Fail Open vs Fail Close

If service X depends on service Y for certain policy decisions, and service Y is unavailable ...

Fail Open: X continues to process requests assuming Y's decision is to proceed. e.g. quota
Fail Close: X rejects requests assuming Y's decision is not to proceed. e.g. security

Big data problems: 3V

volume: hadoop, mapreduce
velocity: streaming frameworks
variety: schema on read, specialized tools(KV pair)

Big data problems tend to fall into three categories; namely, managing ever-increasing volumes of data, managing an increasing velocity of data, and dealing with a greater variety of data structure.

PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.You force a schema on write and if that is violated, you throw an error. Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.

Pre-compute

The key to speed is to precompute everything and cache it.

For example, Reddit precomputes all 15 different sort orders (hot, new, top, old, this week. etc) for listings when someone submits a link. Every listing has 15 different sort orders (hot, new, top, old, this week). When someone submits a link they recalculate all the possible listing that link could effect. http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html

use Cache to store the most frequently used data
Pre-generate static HTML files
Minify and compress
use CDN to serve pre-computed pages and results

System scalability vs Organizational scalability

System scalability: Supporting an online service with hundreds of millions of registered users, handling millions of requests per second.
Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system without excessive coordination overhead.

Streaming vs Batching

Streaming: a type of data processing engine that is designed with infinite data sets in mind.

streaming data sets: unbounded data
batch data sets: finite and bounded data.

Use a streaming solution like Spark Streaming: store counts in an RDD that can be incremented in a reduce process.

Consistent Hashing

a distribution scheme that does not depend directly on the number of servers

Immutability

Immutability is the backbone of big data.

Designs are driving toward immutability, which is needed to coordinate at ever increasing distances. Given space to store data for a long time, immutability is affordable. Versioning provides a changing view, while the underlying data is expressed with new contents bound to a unique identifier.

Copy-on-Write. Many emerging systems leverage COW semantics to provide a façade of change while writing immutable files to an underlying store. In turn, the underlying store offers robustness and scalability because it is storing immutable files. For example, many key-value systems are implemented with LSM trees (e.g., HBase, BigTable, and LevelDB).
Clean Replication. When data is immutable and has a unique identifier, many challenges with replication are eased. There's never a worry about finding a stale version of the data because no stale versions exist. Consequently, the replication system may be more fluid and less picky about where it allows a replica to land. There are also fewer replication bugs.
Immutable Data Sets. Immutable data sets can be combined by reference with transactional database data and offer clean semantics when the data sets project relational schema and tables. Looking at the semantics projected by an immutable data set, you can create a new version of it optimized for a different usage pattern but still projecting the same semantics. Projections, redundant copies, denormalization, indexing, and column stores are all examples of optimizing immutable data while preserving its semantics.
Parallelism and Fault Tolerance. Immutability and functional computation are keys to implementing big data.

HDFS is read only: can’t update records in place as you would with a relational database

HTTPS Termination

To forward HTTP requests to upstream application components, the frontend needs to provide termination of HTTP, HTTPS, and HTTP/2 connections, along with “shock-absorber” protection and routing. It also needs to offer basic logging and first-line access control, implement global security rules, and offload HTTP heavy lifting (caching, compression) to optimize the performance of the upstream application.

Write Availability

a simple buffer mechanism with Redis: if the write to Postgres failed we could retry later since the trip had been stored in Redis in the interim. But while in Redis, the trip could not be read from Postgres.

EventBus

https://github.com/google/guava/wiki/EventBusExplained

Problems with EventBus:

All subscribers need to be instantiated at app start, significantly impacting app start latency.
hard to test, events coming in unexpected orders
good to write loosely coupled features but it makes the dependencies less clear and harder to untangle.

Lambda Architecture and Kappa Architecture

Lambda Architecture: run a streaming system alongside a batch system, both performing essentially the same calculation. The streaming system gives you low-latency, inaccurate results (either because of the use of an approximation algorithm, or because the streaming system itself does not provide correctness), and some time later a batch system rolls along and provides you with correct output.
Kappa Architecture: run a single pipeline using a well-designed system that’s appropriately built for the job at hand.

Schema-On-Read vs Schema-On-Write

schema on read: hive, pig, hadoop
schema on write: predefined schema, e.g. RDBMS

Sharding

Using any categorical field for sharding may result in skewed partitions.

Instead, use some random generated or unique numbers, e.g. UUID

Health Check

One common problem with all the distributed systems is how to determine which servers are alive and operating at any given point of time.

Heartbeating

Periodic updates + TTLs.

require work linear to the number of nodes and place the demand on a fixed number of servers.
the failure detection window is at least as long as the TTL.

Ephemeral Nodes

E.g. built in ZooKeeper. K/V entries that are removed when a client disconnects.

more sophisticated than a heartbeat system
have scalability issues and add client-side complexity.
All clients must maintain active connections to the ZooKeeper servers and perform keep-alives.
requires "thick clients" which are difficult to write and often result in debugging challenges.

Agents on Clients and Servers

Consul uses a very different architecture for health checking. Instead of only having server nodes, Consul clients run on every node in the cluster. These clients are part of a gossip pool which serves several functions, including distributed health checking. The gossip protocol implements an efficient failure detector that can scale to clusters of any size without concentrating the work on any select group of servers. The clients also enable a much richer set of health checks to be run locally, whereas ZooKeeper ephemeral nodes are a very primitive check of liveness. With Consul, clients can check that a web server is returning 200 status codes, that memory utilization is not critical, that there is sufficient disk space, etc. The Consul clients expose a simple HTTP interface and avoid exposing the complexity of the system to clients in the same way as ZooKeeper.

JAMStack

https://jamstack.org/

PRPL Pattern

PRPL is a web site architecture developed by Google for building websites and apps that work exceptionally well on smartphones and other devices with unreliable network connections.

https://developers.google.com/web/fundamentals/performance/prpl-pattern/

PRPL stands for:

Push critical resources for the initial URL route using <link preload> and http/2.
Render initial route.
Pre-cache remaining routes.
Lazy-load and create remaining routes on demand.

Modular Programming

https://en.wikipedia.org/wiki/Modular_programming

Data in or Data out

SQL->NoSQL, Data Warehouse->Data Lake: think less about how to put data in, but more when pulling data out.

States

Session state: across running things. Stateful sessions remembers stuff; Stateless does not remember on the session
Durable state: across failure, stuff is remembered when you come back later.

http://horicky.blogspot.com/2010/10/scalable-system-design-patterns.html

Scatter and Gather - a dispatcher multicasts requests to all workers in a pool. Each worker will compute a local result and send it back to the dispatcher, who will consolidate them into a single response and then send back to the client.
Shared Space - all workers monitors information from the shared space and contributes partial knowledge back to the blackboard. The information is continuously enriched until a solution is reached.
Pipe and Filter - all workers connected by pipes across which data flows.
MapReduce - targets batch jobs where disk I/O is the major bottleneck. It use a distributed file system so that disk I/O can be done in parallel.
Bulk Synchronous Parallel - a lock-step execution across all workers, coordinated by a master.
Execution Orchestrator - an intelligent scheduler / orchestrator schedules ready-to-run tasks (based on a dependency graph) across a clusters of dumb workers.
Idempotent: making it possible to fail and restart.
LSF (log-structured file system)
LSM (log-structured merge-tree)
DHT
consistent hashing
versioning
vector clocks
quorum
anti-entropy based recovery
probabilistic data structure
2 phase commit
write ahead log:https://en.wikipedia.org/wiki/Write-ahead_logging writes are serialized to disk before they are applied to the in-memory database.
exactly once delivery
Log-structured storage
Async: (cheating) do the minimal amount of work on the backend and tell the user you are done. Put it in a queue
Vertical scaling
Horizontal scaling
Consistent Hashing
Load balancing
Redundancy and Replication
Data Partitioning
NoSQL vs SQL
Async / event loop
Consensus
Indexes
Proxies