Tech Stacks - Logging
- Debug logging: instead of logging
errorlocally, send those logs to a centralized place so the stacktrace can be easily viewed in a dedicated web page, regardless where the code is being executed, whether in your dev server or some random node in staging/prod clusters.
- System metrics: things like QPS, latency, availability, request counts, etc. These can tell you the system(cluster) health, and not quite related to business logics.
- Business metrics: especially for billing purposes.
Distributed / Centralized Logging
- high write availability, and durable record storage
- repeatable total order on those records.
- append-only, cannot modify existing records.
- Record-oriented: data is written into the log in indivisible records, rather than individual bytes.
Products and Solutions
- Commercial Solutions: Splunk, Sumo Logic
- Open Source Solutions: Kibana
- Facebook LogDevice
- Google Cloud Logging
- AWS Central Logging
- machine logs vs user logs
- real-time vs historical
- collected logs vs processed logs(sessionization, normalization, anonymization)
- debug logs (INFO/WARNING/ERROR) vs event logs(revenue logs, click logs)
- logs: retention, wipeout, takeout
Versioned Process Logs
Versioned directories (e.g.
/prefix/YYYY/MM/DD/<version>/) should be used.
For Log Consumer
With non-versioned directories: an analysis job is reading log files, the data processing pipeline just finishes and updates the old data in place, the analysis job ends up reading half old files and half new files, with duplicated data or missing data.
With versioned directories: the analysis job can keep reading the old files until it finishes to get a consistent result, if it runs again it can pickup the new data.
For Log Producer
With non-versioned directories: the new logs need to be generated in place, or in another directory and be copied over.
With versioned directories: the new logs are generated in the new version directory, and the directory can be marked as ready or live and made visible to consumers. And it is easier to roll back.
- remote debug logs (server logs), info, fatal, error
- query log
- production change logs: rollouts, commandline param change
- binary crash logging