Metastable Failures and Interactions Between Systems
12 hours ago
- #system-failures
- #retry-storms
- #feedback-loops
- Metastable failures are self-sustaining performance failures caused by positive feedback loops.
- A classic example is a retry storm where overloaded systems lead to more retries, worsening the problem.
- Systems interact via signals and actions, but signals can be ambiguous, leading to incorrect responses.
- Avoiding metastable failures involves minimizing unnecessary interactions, avoiding positive feedback actions, and reducing signal ambiguity.
- Some actions are unavoidable due to system requirements or algorithm designs, making complete avoidance difficult.
- Mitigation strategies include minimizing feedback loops and using multiple signals to disambiguate states.