Production tests: a guidebook for better systems and more sleep

a year ago

#Production Testing
#DevOps
#Software Reliability

Customers expect full site functionality at all times, necessitating near-perfect uptime.
Production tests (synthetics) offer immediate failure notifications in production environments.
Setting up production tests is quick (within one sprint) and offers high ROI.
Atlassian's use of 'pollinators' showcases production tests' value in early problem detection.
Production tests are automated, frequent (e.g., every minute), and can emulate user actions via headless browsers or API calls.
Tests should be simple, fast (≤30 seconds), and integrate with alerting systems like Slack or paging.
They enhance reliability by providing immediate regression warnings, acting as canaries pre-deployment.
Design considerations include keeping tests basic to avoid false alarms and ensuring they don't overly impact system resources.
Good test examples include login verification and simple CRUD operations; bad examples are overly complex or timing-sensitive checks.
Production tests differ from health checks but may overlap; they should not cause false alarms or be overly simplistic.
Tests improve observability, especially in low-traffic regions, but may add noise or costs.
Fake data and test accounts require careful management to avoid expiration or storage issues.
Implementing a 'three strikes' rule for alerts reduces false alarms while maintaining oversight.
Pros include real-world testing, quality control, troubleshooting aid, and safer deployments.
Cons involve setup challenges, potential flakiness, resource costs, and maintenance efforts.
Observability tools complement production tests by monitoring real traffic for issues like latency or failures.
Both production tests and observability are recommended for comprehensive monitoring.
Regular review and adjustment of production tests ensure continued value as systems evolve.

Hasty Briefsbeta

Production tests: a guidebook for better systems and more sleep