What Now? Handling Errors in Large Systems
13 hours ago
- #error-handling
- #Rust
- #system-architecture
- Cloudflare's outage postmortem highlighted the debate around error handling, specifically the use of `unwrap` in Rust, which crashes the program on errors.
- Error handling decisions should consider whether failures are correlated, if they can be handled at a higher layer, and if meaningful continuation is possible.
- System architecture plays a crucial role in determining the appropriateness of crashing versus continuing, with fine-grained architectures better suited to handle higher error rates.
- The complexity of error handling is a global property of the system, requiring built-in strategies from the beginning, with techniques like blast radius reduction to mitigate risks.
- Rust's `unwrap` could be more explicitly named (e.g., `or_panic`) or require justification to improve clarity and safety in error handling.