What we learned from a 22-Day storage bug (and how we fixed it)
4 days ago
- #video-streaming
- #incident-report
- #distributed-systems
- Mux Video's Just-In-Time transcoding feature allows quick playback of uploaded content.
- A recent incident affected 0.33% of audio and video segments, causing playback issues like audio dropouts and visual stuttering.
- The incident was caused by a combination of factors including remote read errors, deletion race conditions, and scaling issues.
- Mux's storage system involves multiple components like storage-worker, storage-api, and object storage for durable persistence.
- Context cancellation in Go during remote reads caused adjacent transcode tasks to fail.
- A race condition between file deletion and replication led to corrupted segments being served.
- Scaling changes introduced bottlenecks, slowing down operations and increasing errors.
- Corrupted segments were fixed by addressing purge paths, context cancellation, and scaling adjustments.
- Mux is improving observability, support escalation processes, and system behavior under load to prevent future incidents.