Distributed Systems - Monitoring
- Metrics: shows what is wrong without explaining why
- Tracing: finds which part of the distributed system to blame
- Logs: extra info for debugging
- Metrics-based Monitoring
- Monitoring system, timeseries database (TSDB)
- Originally built at SoundCloud
- Part of CNCF, second hosted project, after Kubernetes.
- Official website: https://prometheus.io/
- GitHub: https://github.com/prometheus
- Distributed tracing
- Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs
- Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
- OpenCensus and OpenTracing have merged to form OpenTelemetry.
- OpenCensus originated from Google.
- OpenTelemetry is part of CNCF
- Official website: https://opentelemetry.io/
- GitHub: https://github.com/open-telemetry/
Google Cloud Monitoring, ELK, Splunk, Sumo Logic
- Service Level Indicator (SLI): what to measure, e.g. latency, availability, data quality.
- Service Level Objective (SLO): desired target for a SLI, e.g. 100ms for p99 latency
- Service Level Agreement (SLA): external visible contract about a SLO, e.g. if p99 latency exceeds 100ms, refund or pay $ penalty.
Latency must be paired with a p number(p50, p95, p99, etc)
Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).