Derived Data

Categories
Architecture
Sources
Designing Data-Intensive Applications, Out of the Tar Pit

Data that can be recomputed from another source: caches, search indexes, materialized views, aggregations. It is distinguished from the system of record, the authoritative source of truth. If derived data is lost, it can be rebuilt from the source of record.

Why it Matters

Separating the source of truth from derived views clarifies which data must be protected and which is disposable, and it frees you to maintain many specialized views, an index here, a cache there, of the same underlying data without creating competing sources of truth. Out of the Tar Pit reaches the same conclusion from a complexity angle: derived data should be defined as a function of the essential state rather than stored and mutated separately, because every separately stored copy is accidental state that can drift.

Signals

  • Confusion over which store is authoritative.
  • A cache or index treated as if it were the source of truth.
  • Multiple independently writable copies of the same data that can disagree.

Benefits

Each view optimized for its own access pattern; derived stores can be rebuilt, re-indexed, or replaced without risking the source of truth; clearer ownership of correctness.

Risks

Writing to a derived store as if it were authoritative; derived data drifting out of sync with the source; mistaking an irreplaceable store for derived and losing it.

Tensions

More derived views speed reads but add the cost of keeping them consistent with the source; denormalization improves performance while duplicating data that must then be kept in step.

Examples

A search index rebuilt from the primary database; a cache that can be flushed and repopulated; a materialized rollup recomputed from raw events.