Hasty Briefsbeta

Bilingual

What we learned from a 22-Day storage bug (and how we fixed it)

4 days ago
  • #video-streaming
  • #incident-report
  • #distributed-systems
  • Mux Video's Just-In-Time transcoding feature allows quick playback of uploaded content.
  • A recent incident affected 0.33% of audio and video segments, causing playback issues like audio dropouts and visual stuttering.
  • The incident was caused by a combination of factors including remote read errors, deletion race conditions, and scaling issues.
  • Mux's storage system involves multiple components like storage-worker, storage-api, and object storage for durable persistence.
  • Context cancellation in Go during remote reads caused adjacent transcode tasks to fail.
  • A race condition between file deletion and replication led to corrupted segments being served.
  • Scaling changes introduced bottlenecks, slowing down operations and increasing errors.
  • Corrupted segments were fixed by addressing purge paths, context cancellation, and scaling adjustments.
  • Mux is improving observability, support escalation processes, and system behavior under load to prevent future incidents.