多模态大模型 - arXivDaily 专题

2606.19062 2026-06-18 cs.CV 新提交 85%

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University（世宗大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Ulsan National Institute of Science and Technology（乌山国立科学研究院）

专题命中跨模态检索：跨模态检索，双目标编码。

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

详情

AI中文摘要

在当今媒体驱动的世界中，视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射，限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐，但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM：双路径表示增强与对齐模型，一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略，结合掩码和排列语言建模目标，以捕捉局部和全局语言语义。在视觉方面，我们设计了一个具有级联组注意力的层级视觉编码器，通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM，分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性，并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.18885 2026-06-18 cs.CV cs.IR 新提交 75%

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)（沙特数据与人工智能局）

专题命中跨模态检索：文本-图像跨模态检索

AI总结提出LARE框架，通过并行编码低注意力区域和完整图像，解决拥挤场景下视觉编码器忽视关键细节的问题，在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情

AI中文摘要

拥挤场景中的图像检索尤其具有挑战性，因为传统视觉编码器存在显著性偏差，倾向于关注主要对象而忽略低注意力区域，而这些区域通常对细粒度检索至关重要。我们提出了LARE（低注意力区域编码），一个显式建模这些被忽略区域的框架。LARE采用双编码策略，并行编码图像的低注意力区域和完整图像，从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能，我们引入了Dense-Set，一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中，图像被重新标注，以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性，并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明，所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

URL PDF HTML ☆

赞 0 踩 0