Passwords and Power Drills
3 days ago
- #security
- #reliability
- #incident-response
- A Google-wide announcement about a WiFi password change caused a cascading failure in an internal password manager due to unexpected traffic spikes.
- The recovery process was complicated by security measures, including the need for a hardware security module (HSM) smart card stored in a safe, which was initially inaccessible.
- Engineers eventually recovered the system by brute-forcing a safe with a power drill and realizing the smart card was inserted incorrectly.
- The incident highlighted the interplay between reliability (load balancing) and security (HSM requirements), showing how design considerations for one can impact the other.
- Reliability and security are both crucial but require different design approaches: reliability assumes non-malicious failures, while security assumes active adversaries.
- Examples from aviation and data storage illustrate how reliability issues (e.g., hardware flaws) can lead to security problems (e.g., confidentiality breaches).
- Denial-of-service (DoS) attacks blur the line between reliability and security, as they can stem from malicious intent or legitimate traffic spikes.
- System complexity and small changes can lead to major failures, as seen in the Debian OpenSSL bug and a YouTube outage caused by a logging library update.
- Defense in depth, least privilege, and multi-party authorization are key strategies to mitigate both reliability and security risks.
- Incident response plans, like Google's IMAG program, are critical for managing crises, with regular testing (e.g., DiRT) to prepare for emergencies.
- Recovering from security failures often involves trade-offs between speed and reliability when patching vulnerabilities.
- The book emphasizes the importance of integrating security and reliability considerations early in system design to avoid costly fixes later.