Distributed Systems - Notes
- SQL->NoSQL, Data Warehouse->Data Lake: think less about how to put data in, but more when pulling data out.
- Do you want it right? read your writes. Do you want it right now? bounded by fast SLA
- devops replaces sysadmin
Servers
Servers are long-lived pieces of software that provide services. A server is a collection of running services. The most common kinds of services: HTTP and RPC request-handling services.
Data Formats / Serialization
- GRPC
- Protobuf: created and used by Google
- Thrift: created and used by Facebook
- Avro
- RCFile(Record Columnar File): Facebook
- Optimized Row Columnar (ORC) Hortonworks
- Parquet: Cloudera and Twitter
parquet vs arrow:
- parquet: on disk
- arrow: in memory
Data Processing
Stream Processing vs Batch Processing
- Observe what is happening, and act on events as they occur (stream processing)
- Periodically crunch a large amount of accumulated data (batch processing)
Batch Data Processing
Integration Patterns
- API: contract driven
- Event Driven
- Data Stream Driven
State
- Session state: across running things. Stateful sessions remembers stuff; Stateless does not remember on the session
- Durable state: across failure, stuff is remembered when you come back later.
Others
Design Principle: Favor composition over inheritance.
IoT Analytics: distributed model scoring + centralised model building
Reactive Platform: VStack is a reactive platform in the sense that it uses an asynchronous message oriented architecture (which is the definition of "reactive").
Node.js: pure async, event-driven, non-blocking, based on event loop, single thread
nonblocking RPC server based on Netty
There's a difference between (A) locking (waiting, really) on access to a critical section (where you spinlock, yield your thread, etc.) and (B) locking the processor to safely execute a synchronization primitive (mutexes/semaphores).