AI & VideoTechnical Development

Jina extends text embeddings to image, audio, and video

Jina by Elastic researchers introduced jina-embeddings-v5-omni, a suite of multimodal embedding models that extends existing text embedding models to support image, audio, and video inputs. The models utilize a "frozen-encoder model composition" approach, connecting pre-trained modality-specific encoders to a frozen text embedding model via compact projectors, resulting in competitive performance with less training. The jina-embeddings-v5-omni-small model achieves strong text-only performance and competitive scores on image and audio tasks compared to other open-weight multimodal embedding models.

Key Takeaways

jina-embeddings-v5-omni comes in two base models: nano at 0.95B parameters and small at 1.57B parameters.
The training recipe freezes the text backbone, vision encoder, and audio encoder, and updates only fc_vision_2, fc_audio, and modality delimiter embeddings.
The paper says the trainable components are 0.35% of the joint model’s total weights.
On the open-weight benchmark table, jina-embeddings-v5-omni-small scores 67.00 on text, 56.05 on image, 41.20 on video, and 51.46 on audio, for a 53.93 average.
For visual document retrieval on ViDoRe-in-MIEB, jina-embeddings-v5-omni-small scores 79.08, while jina-embeddings-v5-omni-nano scores 70.05.

Why It Matters

This is a practical way to extend a text embedding stack into multimodal retrieval without retraining the full model. The paper keeps text embeddings identical to Jina Embeddings v5 Text, which matters for existing retrieval and RAG pipelines that depend on stable vector geometry. It also shows the strongest results on visual document retrieval, while video remains the weak spot in the benchmark tables. Watch the release’s task-specific variants and the gap between image/audio scores and MMEB-Video performance, since those are the clearest signs of where the recipe holds up and where it doesn’t.

Read full article at arxiv.org

Amazon Web Services, Inc.: AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

Agora: Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

wTVision: wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh

Jina extends text embeddings to image, audio, and video

Key Takeaways

jina-embeddings-v5-omni comes in two base models: nano at 0.95B parameters and small at 1.57B parameters.
The training recipe freezes the text backbone, vision encoder, and audio encoder, and updates only fc_vision_2, fc_audio, and modality delimiter embeddings.
The paper says the trainable components are 0.35% of the joint model’s total weights.
On the open-weight benchmark table, jina-embeddings-v5-omni-small scores 67.00 on text, 56.05 on image, 41.20 on video, and 51.46 on audio, for a 53.93 average.
For visual document retrieval on ViDoRe-in-MIEB, jina-embeddings-v5-omni-small scores 79.08, while jina-embeddings-v5-omni-nano scores 70.05.

Why It Matters

Read full article at arxiv.org

Jina extends text embeddings to image, audio, and video

Key Takeaways

Why It Matters

Related Articles

Jina extends text embeddings to image, audio, and video

Key Takeaways

Why It Matters

Related Articles

Newest

Upcoming Events

Top Sources

Newest

Upcoming Events

Top Sources

Related Articles

AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh