I Dropped Our Production Database and Now Pay 10% More for AWS
10 hours ago
- #Terraform
- #Disaster Recovery
- #AWS
- A Terraform command executed by an AI agent accidentally wiped out the production infrastructure for DataTalks.Club, including a database with 2.5 years of course submissions.
- The incident occurred due to reliance on an AI agent (Claude Code) to manage Terraform commands without proper oversight, leading to the deletion of both the database and automated snapshots.
- AWS Business Support was engaged for faster assistance, costing an additional 10% in cloud expenses, and successfully restored the database after 24 hours.
- Key missteps included reusing an existing Terraform setup for a new project, not migrating Terraform state to the new computer, and allowing the AI agent to execute destructive commands without manual review.
- Post-incident, several safeguards were implemented: backups outside Terraform state, daily restore tests with Lambda, deletion protection in Terraform and AWS, S3 backup protection, and moving Terraform state to S3.
- Lessons learned include the dangers of over-reliance on AI agents for critical operations, the importance of manual reviews for destructive commands, and the necessity of independent backup systems.
- The author also shared updates on recent projects, including the AI Engineering Buildcamp, an AI Engineering Newsletter series, live research sessions, and upcoming workshops on data engineering and Apache Flink.