logo

Tech Stacks - Monitoring

Last Updated: 2023-02-14

3 Kinds

  • Metrics: shows what is wrong without explaining why
  • Tracing: finds which part of the distributed system to blame
  • Logs: extra info for debugging

Ecosystem

Prometheus

Jaeger

  • Distributed tracing

Cortex

https://www.cortex.io/

cortex/alertmanager: metrics storage, time series db.

OpenTelemetry

  • Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs
  • Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
  • OpenCensus and OpenTracing have merged to form OpenTelemetry.
  • OpenCensus originated from Google.
  • OpenTelemetry is part of CNCF
  • Official website: https://opentelemetry.io/
  • GitHub: https://github.com/open-telemetry/
  • logs ingestion

Solutions

Google Cloud Monitoring, ELK, Splunk, Sumo Logic, Grafana, Prometheus, InfluxDB

Logging

Read more

Data

Introspective and synthetic time series data for visualizations and drilldowns

  • Short-term (In-memory) or long-term (on disk)
  • High-cardinality (raw data) or summarized

Alerts

Treat time-series data as a data source for generating alerts

Metrics

SLI vs SLO vs SLA

Latency

Latency must be paired with a p number(p50, p95, p99, etc)

Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).

Proactive Monitoring vs Reactive Monitoring

  • proactive monitoring (exploration): be able to query and analyze data; to improve performance, reliability or availability
  • reactive monitoring (incident response):
    • Detect
    • Triage
    • Investigate
    • Mitigate

Query

PromQL, InfluxQL or Graphite. MQL for Google Cloud Monitoring.