Taming P99s in OpenFGA: How we built a self-tuning strategy planner
17 days ago
- #latency-optimization
- #authorization-systems
- #reinforcement-learning
- Tail latency reduction is crucial for latency-critical systems like OpenFGA, an open-source authorization system.
- OpenFGA's Check API performance depends on efficient graph traversal strategies for authorization decisions.
- Initial static strategy selection lacked adaptability, leading to the development of a dynamic, self-tuning planner.
- The planner uses Thompson Sampling, a reinforcement learning approach, to balance exploitation and exploration of traversal strategies.
- Thompson Sampling maintains probability distributions for each strategy's performance, allowing adaptive decision-making.
- The system uses Normal-Gamma distributions to model latency and variance, updating priors in real-time for optimal performance.
- Production results showed a 98% reduction in P99 latency for complex models, with the planner identifying optimal strategies dynamically.
- The approach emphasizes robustness over hand-tuned heuristics, ensuring performance adapts to evolving data distributions.