Tech Stacks - Monitoring
Last Updated: 2023-02-14
3 Kinds
- Metrics: shows what is wrong without explaining why
- Tracing: finds which part of the distributed system to blame
- Logs: extra info for debugging
Ecosystem
Prometheus
- Metrics-based Monitoring
- Monitoring system, timeseries database (TSDB)
- Originally built at SoundCloud
- Part of CNCF, second hosted project, after Kubernetes.
- Official website: https://prometheus.io/
- GitHub: https://github.com/prometheus
Jaeger
- Distributed tracing
Cortex
cortex/alertmanager: metrics storage, time series db.
OpenTelemetry
- Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs
- Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
- OpenCensus and OpenTracing have merged to form OpenTelemetry.
- OpenCensus originated from Google.
- OpenTelemetry is part of CNCF
- Official website: https://opentelemetry.io/
- GitHub: https://github.com/open-telemetry/
- logs ingestion
Solutions
Google Cloud Monitoring, ELK, Splunk, Sumo Logic, Grafana, Prometheus, InfluxDB
Logging
Data
Introspective and synthetic time series data for visualizations and drilldowns
- Short-term (In-memory) or long-term (on disk)
- High-cardinality (raw data) or summarized
Alerts
Treat time-series data as a data source for generating alerts
Metrics
Latency
Latency must be paired with a p number(p50, p95, p99, etc)
Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).
Proactive Monitoring vs Reactive Monitoring
- proactive monitoring (exploration): be able to query and analyze data; to improve performance, reliability or availability
- reactive monitoring (incident response):
- Detect
- Triage
- Investigate
- Mitigate
Query
PromQL, InfluxQL or Graphite. MQL for Google Cloud Monitoring.