Modal shows one-second voice AI with open models
Modal, in collaboration with Pipecat, detailed how they achieved near one-second voice-to-voice latency for AI chatbots using open-weight models (Parakeet-tdt-v3 for STT, Qwen3-4B-Instruct-2507 + vLLM for LLM, KokoroTTS for TTS) hosted on Modal's platform. The article explains the architectural approach, including the use of WebRTC, Modal Tunnels for low-latency communication, and RAG with ChromaDB for knowledge retrieval, demonstrating optimized performance through geographic co-location of services.
Key Takeaways
- The demo chains three inference steps: STT, LLM, and TTS, coordinated by the open-source Pipecat framework.
- Modal used Parakeet-tdt-v3 for transcription after finding it faster than the open-weight streaming implementations it tried.
- The LLM stack pairs Qwen3-4B-Instruct-2507 with vLLM, and Modal says it tuned CUDA graph compilation to reduce time-to-first-token.
- For TTS, the demo uses Kokoro, an 82M-parameter model that supports streaming output and phonetic input for words like “Modal.”
- Modal reports a median voice-to-voice latency of one second when the client and Modal containers are near each other; it tested Bay Area, Virginia, and no-preference regions.
Why It Matters
This is a concrete latency benchmark for an open-model voice stack, not just a demo architecture. Modal’s result depends on three things working together: a CPU-based Pipecat bot, GPU-backed STT/LLM/TTS services, and lower network overhead from WebRTC plus Modal Tunnels. The broader point for streaming and media teams is that real-time voice apps now hinge as much on transport and region placement as on model choice. What to watch next is whether Modal’s approach keeps the median near one second when deployments move outside the Bay Area/Virginia test regions.
Read full article at modal.com
