When JIT Meets Storage: 22 Days of Silent Segment Corruption
Mux describes a 22-day incident (Jan 8–Feb 4) in which about 0.33% of served VOD audio/video segments were corrupted, leading to playback issues such as brief audio dropouts and visual stuttering, without loss of source video data. The postmortem attributes the issue to interactions between JIT transcoding, a new storage architecture, cloud object storage read/replication slowdowns after a scaling change, and a cleanup path that allowed partially written segments to be treated as valid and replicated. Mux reports deploying fixes on Jan 29 and completing remediation by regenerating affected segments and purging CDN caches by Feb 3, alongside planned improvements to observability and support escalation.
Key Takeaways
- Impact was small in percentage (~0.33%) but broad in surface area: corrupted segments across VOD playback, varying by player behavior.
- A scaling move to fewer, larger storage-worker nodes introduced bottlenecks that increased object storage timeouts and widened replication/delete race windows.
- Critical failure mode: transcoding cleanup closed a partial write cleanly, causing storage to treat incomplete segments as valid and replicate them.
- Mux stopped new corruption on Jan 29 (delete/purge fixes, background-context remote reads, more nodes) and remediated by regenerating affected segments plus CDN purges by Feb 3.
- Observability and operations gaps were material: dropped logs and slow pattern recognition from customer reports prolonged time-to-diagnosis.
Why It Matters
This is the modern streaming infra meme: “partial writes are valid until proven otherwise.” JIT workflows and stream-while-writing storage architectures optimize latency, but they also turn subtle API contracts (what counts as a ‘successful’ close) into customer-visible playback defects—especially when a scaling tweak changes timing across caches, object storage, and replication. For platform buyers, the lesson is to interrogate vendors on end-to-end integrity checks, cache purge playbooks, and incident detection (not just uptime). For builders, treat object stores, retries, and cancellation semantics as product requirements, not implementation details.
Read full article at mux.com