Jina extends text embeddings to image, audio, and video
Jina by Elastic researchers introduced jina-embeddings-v5-omni, a suite of multimodal embedding models that extends existing text embedding models to support image, audio, and video inputs. The models utilize a "frozen-encoder model composition" approach, connecting pre-trained modality-specific encoders to a frozen text embedding model via compact projectors, resulting in competitive performance with less training. The jina-embeddings-v5-omni-small model achieves strong text-only performance and competitive scores on image and audio tasks compared to other open-weight multimodal embedding models.
Key Takeaways
- jina-embeddings-v5-omni comes in two base models: nano at 0.95B parameters and small at 1.57B parameters.
- The training recipe freezes the text backbone, vision encoder, and audio encoder, and updates only fc_vision_2, fc_audio, and modality delimiter embeddings.
- The paper says the trainable components are 0.35% of the joint model’s total weights.
- On the open-weight benchmark table, jina-embeddings-v5-omni-small scores 67.00 on text, 56.05 on image, 41.20 on video, and 51.46 on audio, for a 53.93 average.
- For visual document retrieval on ViDoRe-in-MIEB, jina-embeddings-v5-omni-small scores 79.08, while jina-embeddings-v5-omni-nano scores 70.05.
Why It Matters
This is a practical way to extend a text embedding stack into multimodal retrieval without retraining the full model. The paper keeps text embeddings identical to Jina Embeddings v5 Text, which matters for existing retrieval and RAG pipelines that depend on stable vector geometry. It also shows the strongest results on visual document retrieval, while video remains the weak spot in the benchmark tables. Watch the release’s task-specific variants and the gap between image/audio scores and MMEB-Video performance, since those are the clearest signs of where the recipe holds up and where it doesn’t.
Read full article at arxiv.org
