Gemini Embedding 2: natively multimodal embedding model

2 months ago

Gemini Embedding 2 is Google's first natively multimodal embedding model.
It supports text, images, videos, audio, and documents in a unified embedding space.
The model can process interleaved inputs (e.g., image + text) in a single request.
It supports up to 8192 input tokens for text, 6 images, 120 seconds of video, and 6-page PDFs.
Gemini Embedding 2 uses Matryoshka Representation Learning (MRL) for flexible output dimensions.
It outperforms leading models in text, image, and video tasks.
Early access partners are using it for high-value multimodal applications like RAG and semantic search.
Available via Gemini API and Vertex AI, with support for LangChain, LlamaIndex, and other tools.

Hasty Briefsbeta