Cloud / Distributed Systems - Overview

Updated: 2019-06-15

Working on any non-trivial projects in any non-ancient software companies would require some knowledge about distributed systems. Seriously, with data at today's scale, everything is distributed. This is an attempt to create a mind map to help you navigate.

Distributed (Data) System

It's all about data:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, to speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways (search indexes)
  • Send a message to another process, to be handled asynchronously (message queues)
  • Observe what is happening, and act on events as they occur (stream processing)
  • Periodically crunch a large amount of accumulated data (batch processing)

Top 3 Clouds

"Cloud is the new OS"

  • Amazon AWS
  • Microsoft Azure
  • Google Cloud Platform(GCP)

Comparison Chart:

Web Tier

Web servers:

  • Apache
  • Nginx

Or generate static sites and deploy on services like S3

Data Format/Serialization

  • Protobuf: created and used by Google
  • Thrift: created and used by Facebook
  • Avro
  • RCFile(Record Columnar File): Facebook
  • Optimized Row Columnar (ORC) Hortonworks
  • Parquet: Cloudera and Twitter
  • GRPC

parquet vs arrow:

  • parquet: on disk
  • arrow: in memory

Code management

Git or hg(better support for larger repos)

CI/CD

Facebook: Landcastle https://gregoryszorc.com/blog/2015/03/28/notes-from-facebook%27s-developer-infrastructure-at-scale-f8-talk/

Compute/Deployment

Virtual machine, container, serverless

The related job titles may be: DevOps/Production Engineer/Site Reliability Engineer

  • Kubernetes/Docker
  • Chef/Puppet/Ansible
  • Serverless

https://www.simform.com/serverless-vs-containers/

Batch Data Processing

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

More Stacks

AWS Nitro System

Offload network, storage and management to dedicated hardware, so CPU can be used for more important computing jobs. Thanks to the ASIC(Application-specific integrated circuit) from Annapurna Labs, a company that Amazon acquired.

Nitro Hypervisor: built on KVM, but does not include general purpose operating system components.

Distributed Cluster Operating Systems

  • Kubernetes, the open-source version of Google's internal Borg
  • Mesos

Integration Patterns

  • API: contract driven
  • Event Driven
  • Data Stream Driven

Latency

Latency must be paired with a p number(p50, p95, p99, etc)

Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).

Notes

  • SQL->NoSQL, Data Warehouse->Data Lake: think less about how to put data in, but more when pulling data out.
  • Do you want it right? read your writes. Do you want it right now? bounded by fast SLA
  • devops replaces sysadmin

State

  • Session state: across running things. Stateful sessions remembers stuff; Stateless does not remember on the session
  • Durable state: across failure, stuff is remembered when you come back later.

gRPC and LOAS

https://security.googleblog.com/2017/12/securing-communications-between-google.html

  • ALTS, Application Layer Transport Security(LOAS2); a replacement for SSL/TLS

    • TLS: from external to Google
    • ALTS: for service-to-service communications within Goolge's infrastructure
  • gRPC: a replacement for Google's internal only Stubby, on top of LTS or ALTS

consistent hashing

a distribution scheme that does not depend directly on the number of servers

SSTable

SSTable: Sorted String Table, sorted strings in files. Used by BigTable, Cassandra, Spanner. Also LevelDB and RocksDB.

Binary Release vs Data Push

  • binary release needs to go through compilation and tests, which may take a few hours in a CI/CD system
  • data push is relatively small, mostly configurations, should be rolled out quickly(in minutes instead of hours). This is useful for controlled feature roll out(by feature flag, and sampling percentage) or operational changes(like whitelist/blacklist). Data push should be a separate system that can quickly changes things in prod without changing code and binary.

Data Plane vs Control Plane vs Management Plane

The 3 Planes in distributed systems/clouds:

  • Management Plane: WRITE configs, either through code or (cloud) console
  • Control Plane: distribute, sync and READ configs in real time from Data Plane
  • Data Plane: the actual services, databases etc.

Software Defined Networking(SDN)

OpenFlow

  • a communications protocol, allows a server to tell network switches where to send packets.
  • used between the switch and controller on a secure channel
  • program data plane, to allow control plane to scale separately from data plane
  • an enabler of SDN