arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

多模态大模型

跨文本、图像、视频、音频等模态的大模型与学习方法。

今日/当前日期收录 2 信号源:cs.CV, cs.CL, cs.AI, cs.MM, eess.AS
2606.20280 2026-06-19 cs.IR cs.AI 新提交 85%

ELVA: Exploring Ranking-Driven Universal Multimodal Retrieval

ELVA:探索排序驱动的通用多模态检索

Yuhan Liu, Pei Fu, Hang Li, Yukun Qi, Chao Jiang, Jingwen Fu, Zhen Liu, Bin Qin, Zhenbo Luo, Jian Luan, Jingmin Xin

发表机构 * National Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家级重点实验室) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院) MiLM Plus Xiaomi Inc(小米公司) Zhongguancun Academy(中关村学院) Beijing, China(北京市)

专题命中 跨模态检索 :提出ELVA框架用于通用多模态检索

AI总结 提出ELVA框架,通过基于规则的强化学习缓解对比学习中的粒度盲视问题,在通用多模态检索中实现排序优化,并在新基准MRBench上提升13.1%。

Comments Accepted by ECCV 2026

详情
AI中文摘要

利用多模态大语言模型(MLLMs)进行对比学习已成为提升通用多模态检索(UMR)性能的主流范式。然而,先前的工作在将对比范式适应到检索任务时忽略了粒度盲视问题。粒度盲视是指模型倾向于忽略查询中包含的粒度级信息,而这些信息对于有效处理复杂查询至关重要。这源于对比学习将样本视为二元分类(正/负),而忽略了每个负样本携带的不同信息。为了解决这个问题,我们认为应该根据负样本与正样本的相似度区别对待它们,使模型能够从每个负样本中学习不同的粒度信息。在本文中,我们引入了一个简单但有效的框架,称为ELVA,一种新颖的基于规则的强化学习框架,通过排序驱动的MLLMs缓解粒度盲视。1)不依赖奖励模型,我们将可验证奖励的强化学习(RLVR)扩展到检索任务,使模型能够探索新的排序行为而无需显式的排序标签。2)通过利用基于规则的奖励,我们的方法联合优化负样本的排序,同时扩大正负样本之间的相似度差距。为了更精确地衡量粒度盲视,我们进一步引入了MRBench,一个专门为多粒度查询场景设计的新基准。ELVA在标准检索基准上取得了最先进的结果,在MRBench上显著提升13.1%,进一步证明了其在缓解粒度盲视方面的有效性。

英文摘要

Leveraging Multimodal Large Language Models (MLLMs) via contrastive learning has become a mainstream paradigm for improving the performance of Universal Multimodal Retrieval (UMR). However, previous works have ignored the grain blindness when adapting the contrastive paradigm into retrieval tasks. Grain blindness refers to the tendency of the model to overlook grain-level information contained in the query, which is crucial for effectively handling complex queries. This stems from contrastive learning treating samples as a binary classification (positive/negative), while ignoring the different information carried by each negative sample. To address this, we argue that negatives should be treated differently according to their similarity to the positive sample, enabling the model to learn distinct grain information from each negative. In this paper, we introduce a simple but effective framework, called ELVA, a novel rule-based RL framework that mitigates grain blindness through ranking-driven MLLMs. 1) Instead of relying on reward models, we extend Reinforcement Learning with Verifiable Rewards (RLVR) to retrieval tasks, allowing the model to explore new ranking behaviors without explicit ranking labels. 2) By utilizing rule-based rewards, our approach jointly optimizes the ranking of negative samples while enlarging the similarity gap between positive and negative. To more precisely measure grain blindness, we further introduce MRBench, a new benchmark specifically designed for multi-grain query scenarios. ELVA achieves state-of-the-art results across standard retrieval benchmarks, and its notable 13.1% improvement on MRBench further demonstrates its effectiveness in alleviating grain blindness.

2606.20523 2026-06-19 cs.CV cs.AI cs.DB 新提交 70%

SARLO-80: Worldwide Slant SAR Language Optic Dataset 80cm

SARLO-80:全球斜距SAR语言光学数据集80cm

Solène Debuysère, Nicolas Trouvé, Nathan Letheule, Elise Colin, Georgia Channing

发表机构 * DEMR-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DEMR-ONERA,巴黎-萨克雷大学) DTIS-ONERA – The French Aerospace Lab, Université Paris-Saclay(法国航空航天实验室DTIS-ONERA,巴黎-萨克雷大学) Hugging Face

专题命中 跨模态检索 :支持跨模态检索与生成的多模态数据集

AI总结 为解决高分辨率SAR与光学图像及文本对齐的数据稀缺问题,基于Umbra SLC数据构建了80cm斜距网格的SAR-光学-文本三元组数据集,支持跨模态检索与生成任务。

详情
AI中文摘要

多模态基础模型因大规模光学基准而快速发展,但合成孔径雷达(SAR)的类似资源仍然有限。现有的SAR-光学数据集主要依赖低分辨率、仅强度的地面距离检测(GRD)产品,未保留复值SAR测量或原生采集几何,限制了基于物理的多模态学习。特别是,结合甚高分辨率(VHR)SAR SLC、对齐光学图像和自然语言描述的大规模公开数据集仍然缺乏。我们提出了一个基于开源Umbra聚束模式采集的传感器独立复数据(SICD)构建的VHR SAR-光学-文本数据集。从约2500个全球场景(VV/HH,20cm–2m原生分辨率)出发,通过带限FFT重采样将所有SAR数据标准化到80cm斜距网格,并将图像分割为1024×1024的图块。对于每个SAR图块,我们检索高分辨率光学图块,并利用局部坐标对应关系将其扭曲到SAR网格以实现局部像素级对齐。我们进一步为每个样本生成三种描述变体(短/中/长),以支持视觉-语言训练和评估。我们的数据集包含119,566个三元组(复数和幅度斜距SAR图块、对齐光学图块、自然语言描述),覆盖72个国家的257个地点以及广泛的地物类型和基础设施。我们发布固定的训练/验证/测试划分以及完整的预处理和基线代码,以支持在原生SAR几何中进行跨模态检索和条件生成的多模态对齐的可重复基准测试。该数据集在Hugging Face Hub上公开可用,网址为https://this URL。

英文摘要

Multimodal foundation models have advanced rapidly thanks to large optical benchmarks, but comparable resources for synthetic aperture radar (SAR) remain limited. Existing SAR--optical datasets largely rely on low-resolution, intensity-only Ground Range Detected~(GRD) products and do not preserve complex-valued SAR measurements or native acquisition geometry, which restricts physically grounded multimodal learning. In particular, large-scale public datasets combining very-high-resolution (VHR) SAR SLC, aligned optical imagery, and natural-language descriptions are still lacking. We present a VHR SAR--optical--text dataset built from open-access Umbra spotlight acquisitions distributed as Sensor Independent Complex Data (SICD). From around 2,500 worldwide scenes (VV/HH, 20cm--2m native resolution), we standardize all SAR data to an 80cm slant-range grid via band-limited FFT resampling and tile the imagery into 1024 by 1024 patches. For each SAR patch, we retrieve a high-resolution optical tile and warp it into the SAR grid using local coordinate correspondences for local pixel-level alignment. We further generate three caption variants (SHORT/MID/LONG) per sample to support vision--language training and evaluation. Our dataset contains 119,566 triplets (complex and amplitude slant-range SAR patch, aligned optical patch, natural-language description) covering 257 locations across 72 countries and a broad range of land types and infrastructures. We release fixed train/validation/test splits and the full preprocessing and baseline code to enable reproducible benchmarks for multimodal alignment on cross-modal retrieval and conditional generation in native SAR geometry. The dataset is publicly available on the Hugging Face Hub at https://huggingface.co/datasets/ONERA/SARLO-80.