Error Budgets

Categories
Systems
Sources
Site Reliability Engineering (Google)

The error budget is the amount of unreliability a service is allowed to spend over a period: one minus the Service Level Objective. If the objective is 99.9%, the budget is the remaining 0.1% of failed or slow requests. Because some failure is permitted by design, the budget becomes a resource that can be spent on releasing changes, running experiments, and taking risks, until it runs out.

Why it Matters

It dissolves the structural conflict between development, which wants to ship, and operations, which wants stability, by making both answer to one number. While budget remains, the team can move fast; when it is exhausted, the policy shifts to protecting reliability until the budget recovers. The argument stops being about temperament and becomes about a shared, measured quantity.

Signals

  • Release decisions reference how much budget is left, not who is more cautious.
  • Spending the budget is treated as normal and useful, not as failure to be driven to zero.
  • A written policy says what happens when the budget is exhausted (slow or freeze risky change).

Benefits

An objective, self-correcting control on the pace of change: velocity and reliability are balanced automatically by the budget rather than negotiated by authority, and the incentive to hide or fear failure drops because a budget exists to be used.

Risks

A budget with no policy behind it changes nothing. Gaming the indicators, or resetting the budget after every breach, removes the feedback. Treating any spend as a problem recreates the "100% or bust" mindset the budget was meant to replace.

Tensions

Spending the budget buys speed now at the cost of reliability headroom; saving it buys stability at the cost of slower delivery. The budget makes the tradeoff explicit but does not remove it: someone still chooses how aggressively to spend.

Examples

A team that has burned its monthly budget halts feature rollouts and shifts to reliability work until the next window; a team with budget to spare runs a risky migration behind a canary, accepting that it may consume part of the budget.