logo

Tech Stacks - Observability

Last Updated: 2024-01-14

3 Kinds

  • Metrics: shows what is wrong without explaining why
  • Tracing: finds which part of the distributed system to blame
  • Logs: extra info for debugging

Ecosystem

Prometheus

Jaeger

  • Distributed tracing

Cortex

https://www.cortex.io/

cortex/alertmanager: the internal developer portal, metrics storage, time series db.

Cortex provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus.

OpenTelemetry

  • Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs.
  • Standardized format: OpenTelemetry Protocol (OTLP)
  • Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
    • Prometheus: OpenTelemetry metrics can be sent on /otlp/v1/metrics endpoint and ingested natively.
  • OpenCensus and OpenTracing have merged to form OpenTelemetry.
  • OpenCensus originated from Google.
  • OpenTelemetry is part of CNCF.
  • Official website: https://opentelemetry.io/
  • GitHub: https://github.com/open-telemetry/
  • logs ingestion
  • OTEL: short for OpenTelemetry. E.g. otel-collector runs in sidecars to collect telemetry then sends to storage backend.

Grafana

For visualization.

Grafana Loki

For log aggregation.

Solutions

Google Cloud Monitoring, ELK, Splunk, Sumo Logic, Grafana, Prometheus, InfluxDB.

Logging

Read more

Data

Introspective and synthetic time series data for visualizations and drilldowns

  • Short-term (In-memory) or long-term (on disk)
  • High-cardinality (raw data) or summarized

Alerts

Treat time-series data as a data source for generating alerts

Metrics

SLI vs SLO vs SLA

Latency

Latency must be paired with a p number(p50, p95, p99, etc)

Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).

Proactive Monitoring vs Reactive Monitoring

  • proactive monitoring (exploration): be able to query and analyze data; to improve performance, reliability or availability
  • reactive monitoring (incident response):
    • Detect
    • Triage
    • Investigate
    • Mitigate

Query