Partial Failure

Categories: Systems
Sources: Designing Data-Intensive Applications

In a distributed system some parts can be broken while others keep working, and a node often cannot tell whether a remote node has failed, is merely slow, or whether the network dropped the message. Unlike a single machine that either works or crashes, distributed systems fail partially and nondeterministically.

Why it Matters

The defining difficulty of distributed systems is not that things fail but that you cannot reliably detect what failed. Every remote interaction must assume messages may be lost, delayed, duplicated, or reordered, and a timeout is a guess, not a fact.

Signals

Code that assumes a remote call either clearly succeeds or clearly fails.
Treating a timeout as proof that a node is dead.
"It worked in testing" on a single machine, then breaking in a cluster.

Benefits

Designing for partial failure yields systems that degrade gracefully instead of corrupting data or hanging.

Risks

Building on the assumption of a reliable network; acting on a false failure detection, for example two nodes each concluding the other is dead and both acting as leader.

Tensions

Longer timeouts reduce false failure detection but slow recovery; shorter timeouts recover fast but misjudge slow nodes as dead. No perfect failure detector exists over an asynchronous network.

Examples

A client unsure whether a request succeeded after a timeout, so a naive retry applies it twice; a network partition leaving two halves each believing the other has gone down.