Cloud / Distributed Systems - Overview
Working on any non-trivial projects in any non-ancient software companies would require some knowledge about distributed systems. Seriously, with data at today's scale, everything is distributed. This is an attempt to create a mind map to help you navigate.
It's all about data:
- Store data so that they, or another application, can find it again later (databases)
- Remember the result of an expensive operation, to speed up reads (caches)
- Allow users to search data by keyword or filter it in various ways (search indexes)
- Send a message to another process, to be handled asynchronously (message queues)
- Observe what is happening, and act on events as they occur (stream processing)
- Periodically crunch a large amount of accumulated data (batch processing)
"Cloud is the new OS"
- Amazon AWS
- Microsoft Azure
- Google Cloud Platform(GCP)
- GCP vs AWS: https://cloud.google.com/free/docs/map-aws-google-cloud-platform
- GCP vs Azure: https://cloud.google.com/free/docs/map-azure-google-cloud-platform
Or generate static sites and deploy on services like S3
- Protobuf: created and used by Google
- Thrift: created and used by Facebook
- RCFile(Record Columnar File): Facebook
- Optimized Row Columnar (ORC) Hortonworks
- Parquet: Cloudera and Twitter
parquet vs arrow:
- parquet: on disk
- arrow: in memory
Git or hg(better support for larger repos)
- Facebook: Read More
- Google: Storing all 2b code in a single repo. Read More
- Microsoft: git Read More
Virtual machine, container, serverless
The related job titles may be: DevOps/Production Engineer/Site Reliability Engineer
Offload network, storage and management to dedicated hardware, so CPU can be used for more important computing jobs. Thanks to the ASIC(Application-specific integrated circuit) from Annapurna Labs, a company that Amazon acquired.
Nitro Hypervisor: built on KVM, but does not include general purpose operating system components.
- Kubernetes, the open-source version of Google's internal Borg
- API: contract driven
- Event Driven
- Data Stream Driven
Latency must be paired with a p number(p50, p95, p99, etc)
- SQL->NoSQL, Data Warehouse->Data Lake: think less about how to put data in, but more when pulling data out.
- Do you want it right? read your writes. Do you want it right now? bounded by fast SLA
- devops replaces sysadmin
- Session state: across running things. Stateful sessions remembers stuff; Stateless does not remember on the session
- Durable state: across failure, stuff is remembered when you come back later.
ALTS, Application Layer Transport Security(LOAS2); a replacement for SSL/TLS
- TLS: from external to Google
- ALTS: for service-to-service communications within Goolge's infrastructure
- gRPC: a replacement for Google's internal only Stubby, on top of LTS or ALTS
a distribution scheme that does not depend directly on the number of servers
SSTable: Sorted String Table. Used by BigTable, Cassandra, Spanner.
- binary release needs to go through compilation and tests, which may take a few hours in a CI/CD system
- data push is relatively small, mostly configurations, should be rolled out quickly(in minutes instead of hours). This is useful for controlled feature roll out(by feature flag, and sampling percentage) or operational changes(like whitelist/blacklist). Data push should be a separate system that can quickly changes things in prod without changing code and binary.