Tech Stacks - Observability

3 Kinds

Metrics-based Monitoring
Monitoring system, timeseries database (TSDB) (tracked in db-engines: https://db-engines.com/en/system/Prometheus).
Originally built at SoundCloud.
Part of CNCF, second hosted project, after Kubernetes.
Prometheus 3.0 brings forth native support for OpenTelemetry.
Official website: https://prometheus.io/
GitHub: https://github.com/prometheus

cortex/alertmanager: the internal developer portal, metrics storage, time series db.

Cortex provides horizontally scalable, highly available, multi-tenant, long term storage for Prometheus.

Specs, APIs, SDKs, libraries for collecting metrics, traces, and logs.
Standardized format: OpenTelemetry Protocol (OTLP)
Does NOT include backends; telemetries can be sent to backends like Prometheus (for metrics), Jaeger (for tracing)
- Prometheus: OpenTelemetry metrics can be sent on /otlp/v1/metrics endpoint and ingested natively.
OpenCensus and OpenTracing have merged to form OpenTelemetry.
OpenCensus originated from Google.
OpenTelemetry is part of CNCF.
Official website: https://opentelemetry.io/
GitHub: https://github.com/open-telemetry/
logs ingestion
OTEL: short for OpenTelemetry. E.g. otel-collector runs in sidecars to collect telemetry then sends to storage backend.

For visualization.

For log aggregation.

Google Cloud Monitoring, ELK, Splunk, Sumo Logic, Grafana, Prometheus, InfluxDB.

Introspective and synthetic time series data for visualizations and drilldowns

Treat time-series data as a data source for generating alerts

Latency must be paired with a p number(p50, p95, p99, etc)

Light speed limit: 93 miles distance adds 1 ms RTT(round-trip time).

proactive monitoring (exploration): be able to query and analyze data; to improve performance, reliability or availability
reactive monitoring (incident response):
- Detect
- Triage
- Investigate
- Mitigate