Sub-second voice loops hinge on LLM and TTS latency
An analysis of over 30 stack benchmarks details the challenges and progress in achieving sub-second voice loops for AI agents, identifying that LLM Time to First Token (TTFT) and TTS Time to First Byte (TTFB) are the primary latency contributors. The research highlights various STT, LLM, and TTS combinations, with GPT-4 Nano + Cartesia Sonic-Turbo achieving the lowest first-byte latency of 0.73-0.75 seconds, noting that English still outperforms Spanish in speed and advocating for modular pipeline designs. The article predicts continued model shrinkage and emergence of joint LLM-TTS training for further latency reduction.
Key Takeaways
- LLM TTFT plus TTS TTFB accounted for more than 90% of total loop time in every stack measured.
- GPT-4.1-nano + Cartesia Sonic-Turbo delivered the lowest first-byte latency: 0.73-0.75 seconds in English and Spanish.
- Spanish added roughly 300-500 ms of TTFT versus English across the tested stacks.
- Deepgram streaming STT logged under 5 ms, making transcription latency effectively zero in the benchmark.
- LiveKit was used to measure STT duration, TTFT, and TTFB across dozens of STT, LLM, and TTS combinations.
Why It Matters
The immediate takeaway is that voice-agent latency is now mostly a model-selection problem, not an STT problem: streamed transcription is effectively negligible, while LLM TTFT and TTS TTFB dominate the loop. That makes modular pipelines attractive, because STT, LLM, and TTS can be swapped independently as faster checkpoints ship. The article also shows the tradeoff is not just speed versus quality; English still outpaces Spanish, and the fastest stacks lean on smaller models. Watch the first-byte latency of new nano and flash releases, plus any benchmark that gets below 0.5 seconds.
Read full article at dev.to
