Google adds multimodal embeddings across text, video, audio
Google has launched Gemini Embedding 2, its first natively multimodal embedding model, now available in Public Preview via the Gemini API and Vertex AI. This model can map text, images, video, audio, and documents (up to 6 pages) into a single embedding space, facilitating multimodal retrieval and classification and enhancing AI applications like Retrieval-Augmented Generation (RAG) and semantic search.
Key Takeaways
- Gemini Embedding 2 is Google’s first natively multimodal embedding model, available now in Public Preview.
- The model accepts text, up to 6 images, up to 120 seconds of video, audio, and PDFs up to 6 pages.
- Google says the model supports semantic intent across more than 100 languages and interleaved inputs like image + text in one request.
- The embedding output uses Matryoshka Representation Learning, with dimensions that can scale down from 3072 to 1536 or 768.
- Early partners cited results such as Paramount Skydance’s 85.3% text-to-video Recall@1 and Mindlid’s 20% lift in top-1 recall.
Why It Matters
Gemini Embedding 2 gives streaming and media teams a single embedding layer for text, images, video, audio, and short documents instead of separate pipelines for each format. Google positions it for multimodal retrieval, RAG, semantic search, sentiment analysis, and clustering, and says it already supports tools like LangChain, LlamaIndex, Haystack, Weaviate, QDrant, ChromaDB, and Vertex AI Vector Search. The clearest near-term signal to watch is whether developers adopt the preview through Gemini API and Vertex AI, and whether the benchmark claims translate into production search and retrieval gains like Paramount Skydance’s 85.3% Recall@1.
Read full article at blog.google