Production tests: a guidebook for better systems and more sleep
a year ago
- #Production Testing
- #DevOps
- #Software Reliability
- Customers expect full site functionality at all times, necessitating near-perfect uptime.
- Production tests (synthetics) offer immediate failure notifications in production environments.
- Setting up production tests is quick (within one sprint) and offers high ROI.
- Atlassian's use of 'pollinators' showcases production tests' value in early problem detection.
- Production tests are automated, frequent (e.g., every minute), and can emulate user actions via headless browsers or API calls.
- Tests should be simple, fast (≤30 seconds), and integrate with alerting systems like Slack or paging.
- They enhance reliability by providing immediate regression warnings, acting as canaries pre-deployment.
- Design considerations include keeping tests basic to avoid false alarms and ensuring they don't overly impact system resources.
- Good test examples include login verification and simple CRUD operations; bad examples are overly complex or timing-sensitive checks.
- Production tests differ from health checks but may overlap; they should not cause false alarms or be overly simplistic.
- Tests improve observability, especially in low-traffic regions, but may add noise or costs.
- Fake data and test accounts require careful management to avoid expiration or storage issues.
- Implementing a 'three strikes' rule for alerts reduces false alarms while maintaining oversight.
- Pros include real-world testing, quality control, troubleshooting aid, and safer deployments.
- Cons involve setup challenges, potential flakiness, resource costs, and maintenance efforts.
- Observability tools complement production tests by monitoring real traffic for issues like latency or failures.
- Both production tests and observability are recommended for comprehensive monitoring.
- Regular review and adjustment of production tests ensure continued value as systems evolve.