Observability

Categories: Systems
Sources: Site Reliability Engineering (Google)

Observability is the degree to which a system's internal health can be inferred from what it emits: metrics, logs, and traces. You cannot operate, debug, or improve what you cannot see. The practical core is to instrument for a small set of user-centric signals (latency, traffic, errors, and saturation are a common starting set) and to alert on symptoms the user would feel rather than on internal causes that may or may not matter.

Why it Matters

Operations, incident response, and reliability targets all depend on measurement: an SLO is meaningless without an indicator behind it, and a failure cannot be understood without signal into what the system was doing. Observability is the sense organ that closes the loop between a system's behavior and the people responsible for it.

Signals

The questions you need during an incident can be answered from existing telemetry, not added after the fact.
Alerts fire on user-visible symptoms (errors, latency) rather than on every internal metric, keeping noise low.
Health is judged from the user's edge, not only from internal components that can look healthy while users suffer.

Benefits

Faster, better-grounded diagnosis; alerting that tracks real impact instead of generating fatigue; and the measurement substrate that SLOs, error budgets, and postmortems all require.

Risks

Measuring everything and alerting on all of it produces noise and fatigue, which hides real problems. Cause-based alerts proliferate and mislead; vanity metrics that look healthy while users suffer give false comfort. Instrumentation is itself work that can become its own source of toil.

Tensions

More signal aids debugging but raises cost, noise, and the risk of alert fatigue; less signal is cheaper and calmer but can leave you blind. The judgment is which few signals reveal user-facing health, not how much data can be collected.

Examples

Paging on a breached latency or error-rate SLO (a symptom) while leaving CPU saturation as a dashboard signal to consult during diagnosis, rather than paging on every resource threshold.