Tech Stacks - Observability
3 Kinds
- Metrics: shows what is wrong without explaining why
- Tracing: finds which part of the distributed system to blame
- Logs: extra info for debugging
Ecosystem
Prometheus
- Metrics-based Monitoring
- Monitoring system, timeseries database (TSDB) (tracked in db-engines: https://db-engines.com/en/system/Prometheus).
- Originally built at SoundCloud.
- Part of CNCF, second hosted project, after Kubernetes.
- Prometheus 3.0 brings forth native support for OpenTelemetry.
- Official website: https://prometheus.io/
- GitHub: https://github.com/prometheus
Jaeger
- Distributed tracing
Cortex
cortex/alertmanager: the internal developer portal, metrics storage, time series db.
Cortex provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus.
OpenTelemetry
- Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs.
- Standardized format: OpenTelemetry Protocol (OTLP)
- Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
- Prometheus: OpenTelemetry metrics can be sent on
/otlp/v1/metrics
endpoint and ingested natively.
- Prometheus: OpenTelemetry metrics can be sent on
- OpenCensus and OpenTracing have merged to form OpenTelemetry.
- OpenCensus originated from Google.
- OpenTelemetry is part of CNCF.
- Official website: https://opentelemetry.io/
- GitHub: https://github.com/open-telemetry/
- logs ingestion
- OTEL: short for OpenTelemetry. E.g. otel-collector runs in sidecars to collect telemetry then sends to storage backend.
Grafana
For visualization.
Grafana Loki
For log aggregation.
Solutions
Google Cloud Monitoring, ELK, Splunk, Sumo Logic, Grafana, Prometheus, InfluxDB.
Logging
Data
Introspective and synthetic time series data for visualizations and drilldowns
- Short-term (In-memory) or long-term (on disk)
- High-cardinality (raw data) or summarized
Alerts
Treat time-series data as a data source for generating alerts
Metrics
Latency
Latency must be paired with a p number(p50, p95, p99, etc)
Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).
Proactive Monitoring vs Reactive Monitoring
- proactive monitoring (exploration): be able to query and analyze data; to improve performance, reliability or availability
- reactive monitoring (incident response):
- Detect
- Triage
- Investigate
- Mitigate
Query
- PromQL: Prometheus Query Language.
- InfluxQL.
- Graphite.
- MQL for Google Cloud Monitoring: https://cloud.google.com/monitoring/mql