Distributed Systems - Overview
Things You Need to Know
- CAP Theorem
- Vertical scaling
- Horizontal scaling
- Consistent Hashing
- Load balancing
- Redundancy and Replication
- Data Partitioning
- NoSQL vs SQL
- Async / event loop
- resolve conflict
- fault tolerant
- data retention: how long to store
- data persistent format
throughput = latency + concurrency
Latency refers to the time it takes to start processing a task. Concurrency means the number of independent tasks that can be running at any one time.
performance optimization of individual tasks(task processing time) is a separate subject and engineering concern.
Two hard problems in distributed systems
- Guaranteed order of messages
- Exactly-once delivery
Big data problems: 3V
- volume: hadoop, mapreduce
- velocity: streaming frameworks
- variety: schema on read, specialized tools(KV pair)
Big data problems tend to fall into three categories; namely, managing ever-increasing volumes of data, managing an increasing velocity of data, and dealing with a greater variety of data structure.
PostgreSQL and other relational data solutions provide very good guarantees on the data because they enforce a lack of variety.You force a schema on write and if that is violated, you throw an error. Hadoop enforces a schema on read, and so you can store data and then try to read it, and get a lot of null answers back because the data didn't fit your expectations.
System scalability vs Organizational scalability
- System scalability: Supporting an online service with hundreds of millions of registered users, handling millions of requests per second.
- Organizational scalability: Allowing hundreds or even thousands of software engineers to work on the system without excessive coordination overhead.
Streaming vs Batch
streaming: a type of data processing engine that is designed with infinite data sets in mind
- streaming data sets: unbounded data
- batch data sets: finite and bounded data.
- Use lock in Multithreading
- use single thread, e.g. Node.js and Chrome(V8)
How to response faster
- async: (cheating) do the minimal amount of work on the backend and tell the user you are done. Put it in a queue
From reddit: The key to speed is to precompute everything and cache It. For example, they precompute all 15 different sort orders (hot, new, top, old, this week. etc) for listings when someone submits a link.Every listing has 15 different sort orders (hot, new, top, old, this week). When someone submits a link they recalculate all the possible listing that link could effect. http://highscalability.com/blog/2010/5/17/7-lessons-learned-while-building-reddit-to-270-million-page.html
Internet Scale Services Checklist: https://gist.github.com/padajo/c4b5fde1b682aca43dc2a56366653b19
Share Nothing Architecture: https://en.wikipedia.org/wiki/Sharednothingarchitecture
Distributed Lock Manager: https://en.wikipedia.org/wiki/Distributedlockmanager
Shadow Paging https://en.wikipedia.org/wiki/Shadow_paging
Hive(Datawarehouse) has two major sources of data: messaging system(kafka) and scraping(dumping from db like MySQL).
IoT Analytics: distributed model scoring + centralised model building
Reactive Platform: VStack is a reactive platform in the sense that it uses an asynchronous message oriented architecture (which is the definition of "reactive").
Node.js: pure async, event-driven, non-blocking, based on event loop, single thread
nonblocking RPC server based on Netty
There's a difference between (A) locking (waiting, really) on access to a critical section (where you spinlock, yield your thread, etc.) and (B) locking the processor to safely execute a synchronization primitive (mutexes/semaphores).
http://blog.gainlo.co http://horicky.blogspot.com https://www.hiredintech.com/classrooms/system-design/lesson/52 http://www.lecloud.net/tagged/scalability http://tutorials.jenkov.com/software-architecture/index.html http://highscalability.com/