AI & VideoTechnical Development

Modal shows one-second voice AI with open models

Modal, in collaboration with Pipecat, detailed how they achieved near one-second voice-to-voice latency for AI chatbots using open-weight models (Parakeet-tdt-v3 for STT, Qwen3-4B-Instruct-2507 + vLLM for LLM, KokoroTTS for TTS) hosted on Modal's platform. The article explains the architectural approach, including the use of WebRTC, Modal Tunnels for low-latency communication, and RAG with ChromaDB for knowledge retrieval, demonstrating optimized performance through geographic co-location of services.

Key Takeaways

The demo chains three inference steps: STT, LLM, and TTS, coordinated by the open-source Pipecat framework.
Modal used Parakeet-tdt-v3 for transcription after finding it faster than the open-weight streaming implementations it tried.
The LLM stack pairs Qwen3-4B-Instruct-2507 with vLLM, and Modal says it tuned CUDA graph compilation to reduce time-to-first-token.
For TTS, the demo uses Kokoro, an 82M-parameter model that supports streaming output and phonetic input for words like “Modal.”
Modal reports a median voice-to-voice latency of one second when the client and Modal containers are near each other; it tested Bay Area, Virginia, and no-preference regions.

Why It Matters

This is a concrete latency benchmark for an open-model voice stack, not just a demo architecture. Modal’s result depends on three things working together: a CPU-based Pipecat bot, GPU-backed STT/LLM/TTS services, and lower network overhead from WebRTC plus Modal Tunnels. The broader point for streaming and media teams is that real-time voice apps now hinge as much on transport and region placement as on model choice. What to watch next is whether Modal’s approach keeps the median near one second when deployments move outside the Bay Area/Virginia test regions.

Read full article at modal.com

Agora: Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

Amazon Web Services, Inc.: AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

wTVision: wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh

Modal shows one-second voice AI with open models

Key Takeaways

The demo chains three inference steps: STT, LLM, and TTS, coordinated by the open-source Pipecat framework.
Modal used Parakeet-tdt-v3 for transcription after finding it faster than the open-weight streaming implementations it tried.
The LLM stack pairs Qwen3-4B-Instruct-2507 with vLLM, and Modal says it tuned CUDA graph compilation to reduce time-to-first-token.
For TTS, the demo uses Kokoro, an 82M-parameter model that supports streaming output and phonetic input for words like “Modal.”
Modal reports a median voice-to-voice latency of one second when the client and Modal containers are near each other; it tested Bay Area, Virginia, and no-preference regions.

Why It Matters

Read full article at modal.com

Modal shows one-second voice AI with open models

Key Takeaways

Why It Matters

Related Articles

Modal shows one-second voice AI with open models

Key Takeaways

Why It Matters

Related Articles

Newest

Upcoming Events

Top Sources

Newest

Upcoming Events

Top Sources

Related Articles

Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh