Distributed Systems - Overview
Working on any non-trivial projects in any non-ancient software companies would require some knowledge about distributed systems. Seriously, with data at today's scale, everything is distributed. This is an attempt to create a mind map to help you navigate.
Probably the first category you could think of when talking about distributed systems: databases.
Over the years, different kinds of databases are developed for different use cases. DB-Engines is tracking more than 300 databases. Choose wisely.
It is a safe bet to start With MySQL or PostgreSQL.
MySQL is the top one open-source db on DB-Engines, also proved its scalability at Facebook, and Uber migrated from Postres to MySQL
MariaDB is a fork of MySQL and it is gaining popularity quickly.
- Microsoft Azure SQL
MongoDB is the most popular NoSQL database.
There are many open source db under Apache, some of them are implementations of the famous papers published by companies like Google
- Google's BigTable and is powered by Apache Hadoop,Apache Zookeeper, and Apache Thrift.
- Amazon DynamoDB
- MongoDB Atlas: MongoDB's own cloud offering https://www.mongodb.com/cloud/atlas/
Relational databases can function as general purpose graph databases, but they are not very efficient to traverse the graphs. Multiple queries and joins are often needed.
E.g. Facebook created its own huge social graph, every entity is a node(like a person, a business, a page, a group), and the different types of relationships are the edges. It is backed by TAO, which is actually a caching layer over MySQL.
Graph databases you can use if you choose not to build it in-house like Facebook:
- Neo4j: a Java graph db.
- graph model: Property Graph and W3C's RDF,
- graph query: Apache TinkerPop Gremlin and SPARQL
- Microsoft Azure Cosmos
Choose Redis for cache for all new projects. Though Memcache is used extensively at Facebook.
Redis: not often used as a primary data store, but for storing and accessing ephemeral data(loss can be tolerated) – metrics, session state, caching.
Some databases are optimized for flash, making them cheaper alternatives to caches. E.g. Aerospike
Time Series Database
Facebook Beringei: https://github.com/facebookincubator/beringei
Solr is losing popularity to Elasticsearch.
- Amazon CloudSearch
- Microsoft Azure Search
- Google Search Appliance
For offline analytics.
Loading data from datawarehouse and other storage systems. Crunch the numbers, get insights.
- MapReduce: the old Hadoop way
- Spark: the new, faster way; also support SQL
- Amazon Redshift
- Microsoft Azure SQL Data Warehouse
- Google BigQuery
Different types of storage: blob storage, file storage
Or generate static sites and deploy on services like S3
- Large companies use Akamai
- Cheap alternative: CloudFlare
- Amazon CloudFront if you use AWS
- Protobuf: created and used by Google
- Thrift: created and used by Facebook
- RCFile(Record Columnar File): Facebook
- Optimized Row Columnar (ORC) Hortonworks
- Parquet: Cloudera and Twitter
parquet vs arrow:
- parquet: on disk
- arrow: in memory
Git or hg(better support for larger repos)
- Facebook: Read More
- Google: Storing all 2b code in a single repo. Read More
- Microsoft: git Read More
Virtual machine, container, serverless
The related job titles may be: DevOps/Production Engineer/Site Reliability Engineer
- docker coreos
- containerd: Container Runtime
- rkt: Container Runtime
- Facebook's Tupperware：https://www.youtube.com/watch?v=C_WuUgTqgOc (why not docker or coreos? they didn't exist then)
Service Discovery And Configuration
- Build your own: etcd, consul, zookeeper
- Built-in: Kubernetes, Marathon, AWS