DeltaToken cuts video tokens from 180K to under 1,000
Qiang Zhang announced 'DeltaToken', a new video tokenizer designed to reduce the number of VAE tokens for video models by up to 192x while maintaining the same number of channels. This advancement is stated to lower training costs, increase inference savings for real-time video generation, and extend video context length from seconds to minutes for AI models.
Key Takeaways
- DeltaToken is a new video tokenizer for world models and video models that uses the same number of channels while cutting VAE tokens by up to 192x.
- One example in the post shows token count falling from 180K to under 1,000.
- The project claims 10–100x lower training cost, with a video foundation model trained from scratch for under $4,000 in compute.
- The post says the compression could extend context length from 10–15 seconds to 5–10 minutes for native cross-shot consistency.
- Qiang Zhang says the encoder focuses on what changes in video, which he says improves physical grounding for embodied world models.
Why It Matters
If the claims hold up, DeltaToken reduces the token burden that sits between raw video and model training, inference, and longer-context generation. That matters most for systems trying to run video generation in LLMs, VLMs, and VLAs, since the post argues the compression makes native integration possible without architectural compromise. The immediate technical signal is cost: sub-$4,000 scratch training and real-time on-device generation are both called out. Watch for the released demo details and whether the 180K-to-under-1,000 token reduction holds across different video workloads.
Read full article at linkedin.com