AI & VideoProduct Launch

OpenMOSS launches 30-second MOSS-SoundEffect v2.0 on Hugging Face

OpenMOSS-Team has released MOSS-SoundEffect v2.0, a new text-to-audio model designed to generate high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts. This version, which uses a Diffusion Transformer backbone, can generate audio up to 30 seconds at 48 kHz and supports both English and Chinese captions.

Key Takeaways

MOSS-SoundEffect v2.0 uses a Diffusion Transformer backbone with a Flow Matching objective, plus a DAC VAE and Qwen3 text encoder.
The model generates sound effects up to 30 seconds long at 48 kHz, with the duration tag prepended during training.
Supported prompt languages are English and Chinese, expanding beyond single-language captioning.
OpenMOSS lists the model at 1.3B parameters and says it supersedes the v1 discrete-token autoregressive backbone, MossTTSDelay.

Why It Matters

MOSS-SoundEffect v2.0 gives audio generation pipelines a longer-form, higher-fidelity option for non-speech sounds, with 30-second clips at 48 kHz and bilingual prompting built in. That matters for video workflows that need ambience, urban scenes, creatures, or human-action effects from text alone. The model also marks a technical shift inside the MOSS-TTS family from a discrete-token autoregressive backbone to a DiT plus Flow Matching design. The next signal to watch is whether the model gets broader deployment support beyond Hugging Face, since it is not currently deployed by any Inference Provider.

Read full article at huggingface.co

Agora: Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

Amazon Web Services, Inc.: AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

wTVision: wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh

← AI for Video

AI & VideoProduct Launch

OpenMOSS launches 30-second MOSS-SoundEffect v2.0 on Hugging Face

huggingface

Key Takeaways

MOSS-SoundEffect v2.0 uses a Diffusion Transformer backbone with a Flow Matching objective, plus a DAC VAE and Qwen3 text encoder.
The model generates sound effects up to 30 seconds long at 48 kHz, with the duration tag prepended during training.
Supported prompt languages are English and Chinese, expanding beyond single-language captioning.
OpenMOSS lists the model at 1.3B parameters and says it supersedes the v1 discrete-token autoregressive backbone, MossTTSDelay.