OpenBMB shrinks multimodal video understanding to phone-sized deployment
OpenBMB has introduced MiniCPM-V 4.6, a multi-modal large language model designed for efficient image and video understanding on mobile devices. The model, built on SigLIP2-400M and Qwen3.5-0.8B LLM, offers strong multimodal capabilities and significant computation efficiency improvements, including support for mixed 4x/16x visual token compression and deployment across iOS, Android, and HarmonyOS platforms.
Key Takeaways
- MiniCPM-V 4.6 scores 13 on the Artificial Analysis Intelligence Index, above Qwen3.5-0.8B’s 10 and Qwen3.5-0.8B-Thinking’s 11.
- The model uses mixed 4x/16x visual token compression and reduces visual encoding FLOPs by more than 50%.
- OpenBMB says MiniCPM-V 4.6 reaches Qwen3.5 2B-level capability on benchmarks including OpenCompass, RefCOCO, HallusionBench, MUIRBench, and OCRBench.
- The model can be deployed on iOS, Android, and HarmonyOS, with edge adaptation code open-sourced.
- It is adapted to vLLM, SGLang, llama.cpp, and Ollama, and supports SWIFT and LLaMA-Factory for fine-tuning.
Why It Matters
MiniCPM-V 4.6 pushes image and video understanding closer to on-device deployment by combining a 1B-parameter footprint with lower visual compute and support for three mobile platforms. That matters for product teams building phone-based video or multimodal features, because the model is explicitly packaged for edge use rather than only server inference. The broader ecosystem angle is compatibility: OpenBMB lists vLLM, SGLang, llama.cpp, Ollama, SWIFT, and LLaMA-Factory support, which lowers integration friction across serving and tuning stacks. Watch for how the open-sourced edge builds and quantized variants are adopted in actual mobile deployments.
Read full article at huggingface.co