Distributed Systems - Monitoring

Updated: 2020-06-29

3 Kinds

  • Metrics: shows what is wrong without explaining why
  • Tracing: finds which part of the distributed system to blame
  • Logs: extra info for debugging

Ecosystem

Prometheus

Jaeger

  • Distributed tracing

OpenTelemetry

  • Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs
  • Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
  • OpenCensus and OpenTracing have merged to form OpenTelemetry.
  • OpenCensus originated from Google.
  • OpenTelemetry is part of CNCF
  • Official website: https://opentelemetry.io/
  • GitHub: https://github.com/open-telemetry/

Solutions

Google Cloud Monitoring, ELK, Splunk, Sumo Logic

Logging

Read more

Metrics

SLI vs SLO vs SLA

  • Service Level Indicator (SLI): what to measure, e.g. latency, availability, data quality.
  • Service Level Objective (SLO): desired target for a SLI, e.g. 100ms for p99 latency
  • Service Level Agreement (SLA): external visible contract about a SLO, e.g. if p99 latency exceeds 100ms, refund or pay $ penalty.

Latency

Latency must be paired with a p number(p50, p95, p99, etc)

Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).