Change Introduces New Failure Modes

Categories
Systems
Sources
How Complex Systems Fail, Site Reliability Engineering (Google)

Every change to a complex system, including changes that fix problems or add safety, creates new and often unforeseen paths to failure. Improvement and new risk arrive together.

Why it Matters

Changes alter the web of interactions and consume the margin that absorbed past variation, so the system's failure landscape shifts with each one. Low-frequency, high-consequence failures are especially likely to be introduced by well-intentioned changes whose downside is not yet visible.

Signals

  • A new failure mode appearing shortly after an upgrade, optimization, or added safeguard.
  • "That used to be impossible before we changed X."
  • A risk reappearing in a new form after a fix.

Benefits

Pacing and reviewing change with its new risks in view; treating each change as a hypothesis to monitor rather than a settled improvement.

Risks

Assuming a change is purely positive; changing faster than the new failure modes can be learned; safety improvements that quietly enable new catastrophes.

Tensions

Systems must change to improve and adapt, yet every change reintroduces uncertainty and risk; progress competes with stability.

Examples

Adding a cache that speeds the system but creates a new class of stale-data incidents; a new automated safeguard that operators come to over-trust.