Gemini Embedding 2: A Native Multimodal Embedding Model from Gemini
Gemini Embedding 2:来自Gemini的原生多模态嵌入模型
Madhuri Shanbhogue, Zhe Li, Shanfeng Zhang, Gustavo Hernández Ábrego, Shih-Cheng Huang, Aashi Jain, Daniel Salz, Sonam Goenka, Chaitra Hegde, Ji Ma, Feiyang Chen, Jiaxing Wu, Tanmaya Dabral, Babak Samari, Kevin Poulet, Daniel Cer, Kaifeng Chen, Paul Suganathan, Hui Hui, Jovan Andonov, Philippe Schlattner, Jay Han, Iftekhar Naim, Wing Lowe, Vladimir Pchelin, Albert Yang, Yi-Ting Chen, Zhongli Ding, Grace Zhang, Georg Heigold, Yichang Chen, Antoine Reveillon, Brendan Mccloskey, Wenlei Zhou, Dahun Kim, Rui Meng, Emma Wang, Jack Zheng, Halley Fede, Zhen Yang, Keegan Mosley, Brian Potetz, Sahil Dua, Henrique Schechter Vera, Shen Gao, Hesen Zhang, Andreas Hess, Hengxuan Ying, Alberto Montes, Karan Gill, Min Choi, Sebastian Russo, Anja Hauth, Jinhyuk Lee, Michael Boratko, Megan Barnes, Vikram Rao, Claudiu Musat, Cyril Allauzen, Ehsan Variani, Shankar Kumar, Tom Bagby, Junyi Jiao, Yang Gu, Tengxin Li, Ayush Agrawal, Roberto Santana, Dev Nath, Stephen Karukas, Shuoxuan Han, Lucia Loher, Alice Twu, Nidhi Vyas, Siddharth Bhai, Frank Palma Gomez, Wangyuan Zhang, Chaoren Liu, Jizheng Yang, Steve Qiu, Shijie Zhang, Sujay Kulkarni, Sascha Rothe, Sean Nakamoto, Raphael Hoffmann, Zach Gleicher, Yunhsuan Sung, Qin Yin, Tom Duerig, Mojtaba Seyedhosseini
AI总结 提出原生多模态嵌入模型Gemini Embedding 2,通过多任务多阶段对比学习统一视频、音频、图像和文本的表示空间,在单模态、跨模态和多模态检索任务上达到最先进性能。
详情
我们介绍了Gemini Embedding 2,一种原生多模态嵌入模型,允许在统一表示空间中对视频、音频、图像和文本模态进行嵌入。我们利用Gemini的多模态能力,为所有这些模态的交错输入任意组合生成嵌入,这些嵌入在广泛的任务中具有良好的泛化能力。在多任务多阶段训练设置中应用大规模对比学习,我们在关键嵌入基准测试中取得了最先进的性能,包括涵盖多种任务的单模态、跨模态和多模态检索。我们展示了我们的嵌入模型在多种任务上表现出强大的性能(在MSCOCO上得分为62.9 R@1,在Vatex上为68.8 NDCG@10,在MTEB多语言上为69.9,在MTEB代码上为84.0),超越了专门模型的性能。这些统一的能力使Gemini Embedding 2成为下游用例(如RAG、推荐和搜索)的有前途的候选者。此外,它在不同领域(从天文学和生物科学到美术和烹饪艺术)的强大零样本性能,使其成为即使对于专业领域也非常可靠的即用型表示。
We introduce Gemini Embedding 2, a native multimodal embedding model that allows embedding video, audio, image, and text modalities in a unified representation space. We leverage the multimodal capabilities of Gemini to produce embeddings for arbitrary combinations of interleaved inputs across all these modalities that generalize well across a wide variety of tasks. Applying large-scale contrastive learning in a multi-task multi-stage training setup, we achieve state-of-the-art performance on key embedding benchmarks including unimodal, cross-modal, and multimodal retrieval spanning a diverse set of tasks. We show that our embedding model demonstrates strong performance (with a score of 62.9 R@1 on MSCOCO, 68.8 NDCG@10 on Vatex, 69.9 on MTEB multilingual and 84.0 on MTEB Code) across a variety of tasks surpassing the performance of specialized models. These unified capabilities make Gemini Embedding 2 a promising candidate for downstream use cases such as RAG, recommendation and search. Furthermore, its robust zero-shot performance across distinct fields - from astronomy and bioscience to fine arts and the culinary arts - establishes it as a highly reliable, out-of-the-box representation even for specialized domains.