Cloud / Distributed Systems - Overview

Updated: 2019-01-13

Working on any non-trivial projects in any non-ancient software companies would require some knowledge about distributed systems. Seriously, with data at today's scale, everything is distributed. This is an attempt to create a mind map to help you navigate.

Distributed (Data) System

It's all about data:

  • Store data so that they, or another application, can find it again later (databases)
  • Remember the result of an expensive operation, to speed up reads (caches)
  • Allow users to search data by keyword or filter it in various ways (search indexes)
  • Send a message to another process, to be handled asynchronously (message queues)
  • Observe what is happening, and act on events as they occur (stream processing)
  • Periodically crunch a large amount of accumulated data (batch processing)

Top 3 Clouds

"Cloud is the new OS"

  • Amazon AWS
  • Microsoft Azure
  • Google Cloud Platform(GCP)

Comparison Chart:

Web Tier

Web servers:

  • Apache
  • Nginx

Or generate static sites and deploy on services like S3

Data Format/Serialization

  • Protobuf: created and used by Google
  • Thrift: created and used by Facebook
  • Avro
  • RCFile(Record Columnar File): Facebook
  • Optimized Row Columnar (ORC) Hortonworks
  • Parquet: Cloudera and Twitter
  • GRPC

parquet vs arrow:

  • parquet: on disk
  • arrow: in memory

Code management

Git or hg(better support for larger repos)

CI/CD

Facebook: Landcastle https://gregoryszorc.com/blog/2015/03/28/notes-from-facebook%27s-developer-infrastructure-at-scale-f8-talk/

Compute/Deployment

Virtual machine, container, serverless

The related job titles may be: DevOps/Production Engineer/Site Reliability Engineer

  • Kubernetes/Docker
  • Chef/Puppet/Ansible
  • Serverless

https://www.simform.com/serverless-vs-containers/

Batch Data Processing

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

More Stacks

AWS Nitro System

Offload network, storage and management to dedicated hardware, so CPU can be used for more important computing jobs. Thanks to the ASIC(Application-specific integrated circuit) from Annapurna Labs, a company that Amazon acquired.

Nitro Hypervisor: built on KVM, but does not include general purpose operating system components.

Distributed Cluster Operating Systems

  • Kubernetes, the open-source version of Google's internal Borg
  • Mesos

Integration Patterns

  • API: contract driven
  • Event Driven
  • Data Stream Driven

Latency

Latency must be paired with a p number(p50, p95, p99, etc)

Notes

  • SQL->NoSQL, Data Warehouse->Data Lake: think less about how to put data in, but more when pulling data out.
  • Do you want it right? read your writes. Do you want it right now? bounded by fast SLA
  • devops replaces sysadmin

State

  • Session state: across running things. Stateful sessions remembers stuff; Stateless does not remember on the session
  • Durable state: across failure, stuff is remembered when you come back later.

gRPC and LOAS

https://security.googleblog.com/2017/12/securing-communications-between-google.html

  • ALTS, Application Layer Transport Security(LOAS2); a replacement for SSL/TLS

    • TLS: from external to Google
    • ALTS: for service-to-service communications within Goolge's infrastructure
  • gRPC: a replacement for Google's internal only Stubby, on top of LTS or ALTS

SSTable

SSTable: Sorted String Table. Used by BigTable, Cassandra, Spanner.