Passwords and Power Drills

6 months ago

#security
#reliability
#incident-response

A Google-wide announcement about a WiFi password change caused a cascading failure in an internal password manager due to unexpected traffic spikes.
The recovery process was complicated by security measures, including the need for a hardware security module (HSM) smart card stored in a safe, which was initially inaccessible.
Engineers eventually recovered the system by brute-forcing a safe with a power drill and realizing the smart card was inserted incorrectly.
The incident highlighted the interplay between reliability (load balancing) and security (HSM requirements), showing how design considerations for one can impact the other.
Reliability and security are both crucial but require different design approaches: reliability assumes non-malicious failures, while security assumes active adversaries.
Examples from aviation and data storage illustrate how reliability issues (e.g., hardware flaws) can lead to security problems (e.g., confidentiality breaches).
Denial-of-service (DoS) attacks blur the line between reliability and security, as they can stem from malicious intent or legitimate traffic spikes.
System complexity and small changes can lead to major failures, as seen in the Debian OpenSSL bug and a YouTube outage caused by a logging library update.
Defense in depth, least privilege, and multi-party authorization are key strategies to mitigate both reliability and security risks.
Incident response plans, like Google's IMAG program, are critical for managing crises, with regular testing (e.g., DiRT) to prepare for emergencies.
Recovering from security failures often involves trade-offs between speed and reliability when patching vulnerabilities.
The book emphasizes the importance of integrating security and reliability considerations early in system design to avoid costly fixes later.

Hasty Briefsbeta

Passwords and Power Drills