OpenMOSS launches 30-second MOSS-SoundEffect v2.0 on Hugging Face
OpenMOSS-Team has released MOSS-SoundEffect v2.0, a new text-to-audio model designed to generate high-fidelity environmental, urban, creature, and human-action sound effects from natural-language prompts. This version, which uses a Diffusion Transformer backbone, can generate audio up to 30 seconds at 48 kHz and supports both English and Chinese captions.
Key Takeaways
- MOSS-SoundEffect v2.0 uses a Diffusion Transformer backbone with a Flow Matching objective, plus a DAC VAE and Qwen3 text encoder.
- The model generates sound effects up to 30 seconds long at 48 kHz, with the duration tag prepended during training.
- Supported prompt languages are English and Chinese, expanding beyond single-language captioning.
- OpenMOSS lists the model at 1.3B parameters and says it supersedes the v1 discrete-token autoregressive backbone, MossTTSDelay.
Why It Matters
MOSS-SoundEffect v2.0 gives audio generation pipelines a longer-form, higher-fidelity option for non-speech sounds, with 30-second clips at 48 kHz and bilingual prompting built in. That matters for video workflows that need ambience, urban scenes, creatures, or human-action effects from text alone. The model also marks a technical shift inside the MOSS-TTS family from a discrete-token autoregressive backbone to a DiT plus Flow Matching design. The next signal to watch is whether the model gets broader deployment support beyond Hugging Face, since it is not currently deployed by any Inference Provider.
Read full article at huggingface.co