Site Reliability Engineering (Google)
Main Argument
Reliability is the most important feature of any system, but more reliability is not always better: there is an optimal point past which chasing nines costs more than it returns. Treat operations as an engineering and software problem. Choose an explicit reliability target, measure against it, spend the remaining unreliability deliberately on change and velocity, automate away repetitive operational work, and learn from failure without blame. The durable content is the reasoning about targets, tradeoffs, and operational discipline, not Google's specific tools.
Key Takeaways
- 100% is the wrong reliability target. Pick a Service Level Objective from indicators that track user-visible health, and manage to it rather than to "as reliable as possible."
- The allowable unreliability (1 − SLO) is an error budget: a quantified amount of risk to spend on releases and change. It turns the developer-versus-operations tension into one shared, measurable tradeoff.
- Toil — manual, repetitive, automatable work that scales with the service and creates no lasting value — must be capped and engineered away, so effort compounds instead of growing with load.
- You cannot operate what you cannot see. Instrument services for a small set of user-centric signals (latency, traffic, errors, saturation) and alert on symptoms, not causes.
- When systems fail, write blameless postmortems: assume good intent, look for systemic and latent causes rather than a culprit, and feed fixes back into the system. Failure is a property of the system, not the operator.
- Reliability depends on disciplined change: most incidents follow a change, so release gradually (canarying) to limit blast radius, and back failure-prone change with automated testing.