Distributed Systems - Overview

Working on any non-trivial projects in any non-ancient software companies would require some knowledge about distributed systems. Seriously, with data at today's scale, everything is distributed. This is an attempt to create a mind map to help you navigate.

Databases

Probably the first category you could think of when talking about distributed systems: databases.

Over the years, different kinds of databases are developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.

Relational Databases

It is a safe bet to start With MySQL or PostgreSQL.

MySQL is the top one open-source db on DB-Engines, also proved its scalability at Facebook, and Uber migrated from Postres to MySQL

MariaDB is a fork of MySQL and it is gaining popularity quickly.

Cloud services:

  • Microsoft Azure SQL

NoSQL

MongoDB is the most popular NoSQL database.

There are many open source db under Apache, some of them are implementations of the famous papers published by companies like Google

Cloud services:

  • Google's BigTable and is powered by Apache Hadoop,Apache Zookeeper, and Apache Thrift.
  • Amazon DynamoDB
  • MongoDB Atlas: MongoDB's own cloud offering https://www.mongodb.com/cloud/atlas/

Graph Databases

Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.

E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.

Graph databases you can use if you choose not to build it in-house like Facebook:

  • Neo4j: a Java graph db.
  • Amazon Neptune

    • graph model: Property Graph and W3C's RDF,
    • graph query: Apache TinkerPop Gremlin and SPARQL
  • Microsoft Azure Cosmos

Cache

Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.

Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.

Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike

Time Series Database

Search

Solr is losing popularity to Elasticsearch.

Cloud Services:

  • Amazon CloudSearch
  • Microsoft Azure Search
  • Google Search Appliance

Data Warehouse/Analytics

For offline analytics.

Hadoop Ecosystem

Loading data from datawarehouse and other storage systems. Crunch the numbers, get insights.

  • MapReduce: the old Hadoop way
  • SQL-like interface:

  • Spark: the new, faster way; also support SQL
  • Storm
  • Flint

Traditional

  • Teradata

Cloud Services

  • Amazon Redshift
  • Microsoft Azure SQL Data Warehouse
  • Google BigQuery

Storage

Different types of storage: blob storage, file storage

  • HDFS
  • S3

Messaging

  • Kafka

Web Tier

Web servers:

  • Apache
  • Nginx

Or generate static sites and deploy on services like S3

CDN

  • Large companies use Akamai
  • Cheap alternative: CloudFlare
  • Amazon CloudFront if you use AWS

Data Format/Serialization

  • Protobuf: created and used by Google
  • Thrift: created and used by Facebook
  • Avro
  • RCFile(Record Columnar File): Facebook
  • Optimized Row Columnar (ORC) Hortonworks
  • Parquet: Cloudera and Twitter
  • GRPC

parquet vs arrow:

  • parquet: on disk
  • arrow: in memory

Code management

Git or hg(better support for larger repos)

CI/CD

Facebook: Landcastle https://gregoryszorc.com/blog/2015/03/28/notes-from-facebook%27s-developer-infrastructure-at-scale-f8-talk/

Compute/Deployment

Virtual machine, container, serverless

The related job titles may be: DevOps/Production Engineer/Site Reliability Engineer

  • Kubernetes/Docker
  • Chef/Puppet/Ansible
  • Serverless

Container

  • lxc(linuxcontainer)
  • docker coreos
  • containerd: Container Runtime
  • rkt: Container Runtime

Orchestration:

https://www.slideshare.net/Docker/aravindnarayanan-facebook140613153626phpapp02-37588997

https://www.simform.com/serverless-vs-containers/

Service Discovery And Configuration

  • Build your own: etcd, consul, zookeeper
  • Built-in: Kubernetes, Marathon, AWS

Learn more...

API

Batch Data Processing

https://medium.com/@maximebeauchemin/functional-data-engineering-a-modern-paradigm-for-batch-data-processing-2327ec32c42a

More Stacks