AI & VideoProduct LaunchMarch 10, 2026

Google adds multimodal embeddings across text, video, audio

Google has launched Gemini Embedding 2, its first natively multimodal embedding model, now available in Public Preview via the Gemini API and Vertex AI. This model can map text, images, video, audio, and documents (up to 6 pages) into a single embedding space, facilitating multimodal retrieval and classification and enhancing AI applications like Retrieval-Augmented Generation (RAG) and semantic search.

Key Takeaways

Gemini Embedding 2 is Google’s first natively multimodal embedding model, available now in Public Preview.
The model accepts text, up to 6 images, up to 120 seconds of video, audio, and PDFs up to 6 pages.
Google says the model supports semantic intent across more than 100 languages and interleaved inputs like image + text in one request.
The embedding output uses Matryoshka Representation Learning, with dimensions that can scale down from 3072 to 1536 or 768.
Early partners cited results such as Paramount Skydance’s 85.3% text-to-video Recall@1 and Mindlid’s 20% lift in top-1 recall.

Why It Matters

Gemini Embedding 2 gives streaming and media teams a single embedding layer for text, images, video, audio, and short documents instead of separate pipelines for each format. Google positions it for multimodal retrieval, RAG, semantic search, sentiment analysis, and clustering, and says it already supports tools like LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB, and Vertex AI Vector Search. The clearest near-term signal to watch is whether developers adopt the preview through Gemini API and Vertex AI, and whether the benchmark claims translate into production search and retrieval gains like Paramount Skydance’s 85.3% Recall@1.

Read full article at blog.google

Agora: Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

Amazon Web Services, Inc.: AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

wTVision: wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh

Google adds multimodal embeddings across text, video, audio

Key Takeaways

Gemini Embedding 2 is Google’s first natively multimodal embedding model, available now in Public Preview.
The model accepts text, up to 6 images, up to 120 seconds of video, audio, and PDFs up to 6 pages.
Google says the model supports semantic intent across more than 100 languages and interleaved inputs like image + text in one request.
The embedding output uses Matryoshka Representation Learning, with dimensions that can scale down from 3072 to 1536 or 768.
Early partners cited results such as Paramount Skydance’s 85.3% text-to-video Recall@1 and Mindlid’s 20% lift in top-1 recall.

Why It Matters

Read full article at blog.google

Google adds multimodal embeddings across text, video, audio

Key Takeaways

Why It Matters

Related Articles

Google adds multimodal embeddings across text, video, audio

Key Takeaways

Why It Matters

Related Articles

Newest

Upcoming Events

Top Sources

Newest

Upcoming Events

Top Sources

Related Articles

Agora Integrates OpenAI Real-Time API for Low-Latency Conversational AI

AWS SageMaker Adds Multi-Turn RL for Specialized AI Model Training

wTVision Debuts CricketStats CG, Enters Cricket Graphics Market in Bangladesh