多模态信息融合 - arXivDaily 专题

2606.19325 2026-06-18 cs.SD cs.AI cs.CV 新提交 90%

Reference-Driven Multi-Speaker Audio Scene Generation from In-the-Wild Priors

参考驱动的野外先验多说话人音频场景生成

Michael Finkelson, Daniel Segal, Eitan Richardson, Shahar Armon, Nani Goldring, Poriya Panet, Nir Zabari, Benjamin Brazowski, Or Patashnik, Yoav HaCohen

发表机构 * Lightricks ； Tel Aviv University（特拉维夫大学）

专题命中音视频/视觉语言融合：多参考声音和文本提示生成多说话人音频场景

AI总结提出ScenA方法，利用预训练的文本到音频流匹配基础模型，通过多参考声音和自然语言提示生成多说话人音频场景，并采用高噪声偏置时间步分布解决参考捷径问题，在CoVoMix2-Dialogue基准上优于现有系统。

Comments Project page at https://finmickey.github.io/scena/

详情

AI中文摘要

现有的多说话人对话系统通过结构化监督（如每轮标签、多流转录或可学习说话人嵌入）将说话人与话语绑定。这些系统在仅语音的流水线中运行，生成干净的语音序列，缺乏真实对话的环境纹理。我们采取不同的方法。我们的方法ScenA将文本到音频流匹配基础模型（在大规模野外数据上预训练）直接以多个参考声音和描述整个多说话人音频场景的自由形式自然语言提示为条件。利用这样的基础模型使我们能够继承其生成自然、非录音室音频的能力：背景噪声、房间声学、重叠对话和自发的副语言事件，同时添加多说话人控制而无需任何每轮结构。具体地，参考潜在向量被连接到模型的令牌序列中，并通过轻量级的身份感知位置编码进行区分。然而，我们识别出这种方法的一个关键障碍：参考捷径。在标准噪声调度下的训练过程中，模型可以通过声学相似性识别匹配的参考与噪声目标，从而完全绕过文本提示。我们通过高噪声偏置的时间步分布来解决这个问题，迫使模型依赖文本提示进行说话人分配。我们在CoVoMix2-Dialogue基准上评估ScenA，结果表明它在说话人绑定指标上优于现有的多说话人系统，同时生成具有重叠语音、情感发声和环境声音的丰富对话音频。我们的结果证明了使用以自由形式场景描述为条件的通用音频模型，而不是通过仅语音流水线传递结构化对话脚本的优势。

英文摘要

Existing multi-speaker dialogue systems bind speakers to utterances through structured supervision: per-turn tags, multi-stream transcriptions, or learnable speaker embeddings. These systems operate within speech-only pipelines that produce clean vocal sequences without the ambient texture of real conversations. We take a different approach. Our method, ScenA, conditions a text-to-audio flow-matching foundation model, pretrained on large-scale in-the-wild data, directly on multiple reference voices and a free-form natural language prompt that describes an entire multi-speaker audio scene. Leveraging such a foundational model allows us to inherit its capacity for natural, non-studio audio: background noise, room acoustics, overlapping dialogue, and spontaneous paralinguistic events, while adding multi-speaker control without any per-turn structure. Concretely, reference latents are concatenated into the model's token sequence and distinguished by lightweight identity-aware positional encodings. However, we identify a critical obstacle to this approach: the \textit{Reference Shortcut}. During training under standard noise schedules, the model can identify the matching reference by acoustic similarity to the noisy target, bypassing the text prompt entirely. We address this with a high-noise-biased timestep distribution that forces the model to rely on the text prompt for speaker assignment. We evaluate ScenA on the CoVoMix2-Dialogue benchmark, showing that it outperforms existing multi-speaker systems on speaker-binding metrics while generating rich conversational audio with overlapping speech, emotional vocalizations, and ambient sound. Our results demonstrate the advantage of using a general-purpose audio model conditioned on a free-form scene description, rather than passing structured dialog scripts through a speech-only pipeline.

URL PDF HTML ☆

赞 0 踩 0

2606.19062 2026-06-18 cs.CV 新提交 90%

DREAM: Extending Vision-Language Models with Dual-Objective Encoding for Cross-Modal Retrieval

DREAM: 通过双目标编码扩展视觉-语言模型用于跨模态检索

Kaleem Ullah, Altaf Hussain, Muhammad Munsif, Sung Wook Baik

发表机构 * Sejong University（世宗大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； Ulsan National Institute of Science and Technology（乌山国立科学研究院）

专题命中音视频/视觉语言融合：提出双路径视觉语言模型用于跨模态视频检索。

AI总结提出DREAM模型，通过双路径表示增强与对齐，结合层级视觉编码器和混合语言建模，在视频检索任务中实现新SOTA。

详情

AI中文摘要

在当今媒体驱动的世界中，视频内容在监控、教育和娱乐等领域的指数级增长使得通过自然语言查询检索语义相关视频变得日益关键。早期的视频检索系统依赖于手工特征或浅层跨模态映射，限制了其捕捉复杂语义和时间动态的能力。虽然大规模视觉-语言模型改进了跨模态对齐，但在建模细粒度时间依赖和微妙语言结构方面仍存在挑战。本文介绍DREAM：双路径表示增强与对齐模型，一种通过增强视觉和文本编码来解决这些局限性的新型多模态框架。DREAM采用混合语言建模策略，结合掩码和排列语言建模目标，以捕捉局部和全局语言语义。在视觉方面，我们设计了一个具有级联组注意力的层级视觉编码器，通过多阶段令牌交互和从粗到细的注意力细化来整合空间和时间信息。我们通过在广泛使用的MSRVTT、MSVD和LSMDC基准数据集上进行全面评估来验证DREAM，分别取得了49.4%、49.7%和27.3%的新SOTA R1分数。定性分析进一步展示了模型在帧间保持连贯注意力以及将复杂查询与动态视频内容对齐的能力。这些发现强调了层级注意力和双目标文本建模在实现鲁棒、上下文感知视频检索中的有效性，并为推进跨模态表示学习的未来研究铺平了道路。

英文摘要

In today's media-driven world, the exponential growth of video content across domains such as surveillance, education, and entertainment has made retrieving semantically relevant videos via natural language queries increasingly critical. Early video retrieval systems relied on handcrafted features or shallow cross-modal mappings, limiting their ability to capture complex semantics and temporal dynamics. While large-scale vision-language models have improved cross-modal alignment, challenges remain in modeling fine-grained temporal dependencies and nuanced linguistic structures. In this paper, we introduce DREAM: Dual-path Representation Enhancement and Alignment Model, a novel multimodal framework that addresses these limitations through enhanced visual and textual encoding. DREAM incorporates a hybrid language modeling strategy that combines masked and permuted language modeling objectives to capture both local and global linguistic semantics. On the visual side, we design a hierarchical vision encoder with cascaded group attention, which integrates spatial and temporal information through multi-stage token interaction and coarse-to-fine attention refinement. We validate DREAM through comprehensive evaluations on the widely-used MSRVTT, MSVD and LSMDC benchmark datasets, where it achieves new state-of-the-art R1 scores of 49.4%, 49.7% and 27.3%, respectively. Qualitative analyses further show the model's ability to maintain coherent attention across frames and align complex queries with dynamic video content. These findings underscore the effectiveness of hierarchical attention and dual-objective textual modeling in enabling robust, context-aware video retrieval, and pave the way for future research in advancing cross-modal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14702 2026-06-18 cs.CV 新提交 90%

OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

OmniVideo-100K：通过结构化脚本和证据链进行音视频推理的数据集

Xinyue Cai, Chaoyou Fu, Yi-Fan Zhang, Ran He, Caifeng Shan

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）

专题命中音视频/视觉语言融合：音视频问答数据集，涉及音频与视觉模态融合推理

AI总结提出OmniVideo-100K数据集，通过实体锚定视频脚本和线索引导的QA生成机制，解决音视频问答中跨段实体不一致和长时推理不足的问题，微调模型在多个基准上取得显著提升。

Comments Project page: https://github.com/MiG-NJU/OmniVideo-100K

详情

AI中文摘要

当前的音视频问答（QA）自动化流水线通常采用“视频-字幕-QA”范式。然而，这些方法通常将视频分割成短片段，并为音频和视觉模态生成独立的描述。这种解耦处理切断了声音与其视觉来源之间的固有关联，而独立的片段处理常常导致同一实体在不同片段中的描述不一致。此外，将长文本理解和QA合成耦合到单一步骤中，往往将模型限制在局部事件上，生成的问答缺乏长期时间连接和深度跨模态推理。为了解决这些问题，我们提出了一种自动化数据引擎，包含两种机制：（1）**实体锚定视频脚本**将视频转换为结构化脚本，包括摘要、主要实体列表和逐片段的音视频描述。实体列表作为全局先验，确保跨片段引用一致性并重建音视频关联。（2）**线索引导的QA生成**提示模型首先从脚本中挖掘跨片段、多模态线索，然后基于这些高价值线索生成QA对。利用这一流水线，我们构建了指令微调数据集**OmniVideo-100K**和人工验证的测试集**OmniVideo-Test**。在OmniVideo-100K上微调VITA-1.5、Qwen2.5-Omni-7B和Qwen3-Omni-30B，在OmniVideo-Test上获得了高达20.59%的性能提升，并在Daily-Omni和JointAVBench等现有基准上表现出强大的泛化能力（提升高达12.64%）。

英文摘要

Current automated pipelines for audio-visual Question Answering (QA) generally adopt a ``video-caption-QA'' paradigm. However, these methods typically segment videos into short clips and generate separate descriptions for audio and visual modalities. This decoupled processing severs inherent associations between sounds and their visual sources, while independent clip processing often causes inconsistent descriptions of the same entity across segments. Furthermore, coupling long-text comprehension and QA synthesis into a single step often restricts models to localized events, yielding questions lacking long-term temporal connections and deep cross-modal reasoning. To address these issues, we propose an automated data engine featuring two mechanisms: (1) \textbf{Entity-Anchored Video Scripting} transforms videos into structured scripts, comprising summaries, main entity lists, and segment-wise audio-visual descriptions. The entity list serves as a global prior to ensure cross-segment referential consistency and reconstruct audio-visual associations. (2) \textbf{Clue-Guided QA Generation} prompts models to first mine cross-segment, multimodal clues from the script, and subsequently generate QA pairs based on these high-value clues. Leveraging this pipeline, we construct the instruction-tuning dataset \textbf{OmniVideo-100K} and a human-verified test set, \textbf{OmniVideo-Test}. Fine-tuning VITA-1.5, Qwen2.5-Omni-7B and Qwen3-Omni-30B on OmniVideo-100K yields performance gains of up to 20.59% on OmniVideo-Test, demonstrating strong generalization (up to 12.64% improvements) across established benchmarks like Daily-Omni and JointAVBench.

URL PDF HTML ☆

赞 0 踩 0

2606.19341 2026-06-18 cs.CV cs.CL cs.SD 新提交 85%

Native Active Perception as Reasoning for Omni-Modal Understanding

原生主动感知作为全模态理解的推理

Zhenghao Xing, Ruiyang Xu, Yuxuan Wang, Jinzheng He, Ziyang Ma, Qize Yang, Yunfei Chu, Jin Xu, Junyang Lin, Chi-Wing Fu, Pheng-Ann Heng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； Qwen Team, Alibaba Group（阿里巴巴集团Qwen团队）

专题命中音视频/视觉语言融合：全模态智能体融合音视频线索进行视频理解

AI总结提出OmniAgent，一种基于POMDP迭代观察-思考-行动循环的原生全模态智能体，通过主动感知将推理复杂度与视频时长解耦，在多个基准上达到开源模型最优性能。

Comments Accepted at ICML 2026. Code and models: https://github.com/harryhsing/omniagent

详情

AI中文摘要

用于长视频理解的被动模型通常依赖于“全看一遍”范式，无论查询难度如何都统一处理帧，导致计算成本随视频时长增长。尽管出现了交互式框架，但它们通常依赖于全局预扫描，其上下文成本仍随视频长度扩展。我们提出OmniAgent，第一个原生全模态智能体，将视频理解建模为基于POMDP的迭代观察-思考-行动循环。OmniAgent执行按需动作，选择性地将视听线索提炼到持久文本记忆中，有效将推理复杂度与原始视频时长解耦。为实现这一点，我们引入了(1)智能体监督微调，通过最佳N轨迹合成和双阶段质量控制在启动原生主动感知；(2)带TAURA（轮次感知自适应不确定性重缩放优势）的智能体强化学习，利用轮次级熵将信用分配引导至关键发现轮次。关键的是，OmniAgent表现出正向测试时缩放，性能随推理轮次增加而提升，验证了主动感知的有效性。在十个基准（如VideoMME、LVBench）上的实验结果表明，OmniAgent在开源模型中达到了最先进性能。值得注意的是，在LVBench上，我们的7B智能体优于10倍大的Qwen2.5-VL-72B（50.5% vs. 47.3%）。

英文摘要

Passive models for long video understanding typically rely on a "watch-it-all" paradigm, processing frames uniformly regardless of query difficulty, causing computational cost to grow with video duration. Although interactive frameworks have emerged, they often rely on global pre-scanning, and their context cost still scales with video length. We propose OmniAgent, the first native omni-modal agent that formulates video understanding as a POMDP-based iterative Observation-Thought-Action cycle. OmniAgent executes on-demand actions to selectively distill audio-visual cues into a persistent textual memory, effectively decoupling reasoning complexity from raw video duration. To operationalize this, we introduce (1) Agentic Supervised Fine-Tuning to bootstrap native active perception via best-of-N trajectory synthesis with dual-stage quality control, and (2) Agentic Reinforcement Learning with TAURA (Turn-aware Adaptive Uncertainty Rescaled Advantage), which leverages turn-level entropy to steer credit assignment toward pivotal discovery turns. Crucially, OmniAgent exhibits positive test-time scaling, where performance improves as the number of reasoning turns increases, validating the efficacy of active perception. Empirical results across ten benchmarks (e.g., VideoMME, LVBench) demonstrate that OmniAgent achieves state-of-the-art performance among open-source models. Notably, on LVBench, our 7B agent outperforms the 10$\times$ larger Qwen2.5-VL-72B (50.5% vs. 47.3%).

URL PDF HTML ☆

赞 0 踩 0

2606.18974 2026-06-18 cs.CV 新提交 85%

Visual-OPSD: Cross-Modal On-Policy Self-Distillation for Efficient Unified Multimodal Reasoning

Visual-OPSD：用于高效统一多模态推理的跨模态在策略自蒸馏

Pengyu Li, Zhitao Gao, Lingling Zhang, Muye Huang, Yuanming Li, Fangzhi Xu, Jun Liu

发表机构 * Xi’an Jiaotong University（西安交通大学）； MOE KLINNS Lab（MOE KLINNS实验室）； Shaanxi Province Key Laboratory of Big Data Knowledge Engineering（陕西省大数据知识工程重点实验室）； Sun Yat-sen University（中山大学）

专题命中音视频/视觉语言融合：跨模态自蒸馏将视觉推理转移到文本模型。

AI总结提出Visual-OPSD方法，通过跨模态在策略自蒸馏，将多步扩散生成的可视化思维推理能力转移到纯文本学生模型，实现14.3倍加速且性能提升3.40个百分点。

详情

AI中文摘要

统一多模态模型（UMMs）将生成的“可视化思维”（VTs）与文本推理交错以改进空间任务。这导致多步扩散带来大约一个数量级的推理成本。我们发现这种成本带来的直接收益有限。在ThinkMorph上，移除或噪声化VTs在九个基准上几乎不改变准确率。一旦渲染，注意力集中在VT上，无论其内容如何。然而，KL诊断表明，以特权VT轨迹为条件会改变模型的完成分布。这表明生成路径编码了超出渲染像素的有用推理。受此差距启发，我们提出了Visual On-Policy Self-Distillation（Visual-OPSD）。教师和学生共享相同权重，但上下文不同：教师看到特权VTs，而学生只看到问题。在策略学生轨迹上的token级JSD蒸馏将教师的推理转移到纯文本学生。在九个基准上，Visual-OPSD相比其生成教师提高了$+3.40$个百分点，加速$14.3\times$（每个样本10.0秒 vs. 142.8秒），并在VSP上比同规模VLM提高了$+63.83$个百分点。高斯噪声控制（真实VT为$+0.40$pp vs. $+10.28$pp）和$58.4\%$的KL差距闭合证实，收益来自生成路径的语义内容。

英文摘要

Unified multimodal models (UMMs) interleave generated ''visual thoughts'' (VTs) with text reasoning to improve spatial tasks. This incurs roughly an order-of-magnitude inference cost from multi-step diffusion. We find this cost yields limited direct benefit. On ThinkMorph, removing or noising VTs barely changes accuracy across nine benchmarks. Once rendered, attention concentrates on the VT regardless of content. Yet a KL diagnostic shows that conditioning on a privileged VT trace shifts the model's completion distribution. This suggests the generation pathway encodes useful reasoning beyond the rendered pixels. Motivated by this gap, we propose Visual On-Policy Self-Distillation(Visual-OPSD). Teacher and student share identical weights but differ in context: the teacher sees privileged VTs while the student sees only the question. Token-level JSD distillation on on-policy student trajectories transfers the teacher's reasoning to a text-only student. Across nine benchmarks, Visual-OPSD improves over its generative teacher by $+3.40$pp with $14.3\times$ speedup (10.0s vs. 142.8s per sample) and outperforms same-scale VLMs by $+63.83$pp on VSP. A Gaussian-noise control ($+0.40$pp vs. $+10.28$pp for real VTs) and $58.4\%$ closure of the KL gap confirm that gains come from the semantic content of the generation pathway.

URL PDF HTML ☆

赞 0 踩 0

2606.18780 2026-06-18 cs.CV cs.CL cs.MM 新提交 85%

SAMA: Semantic Anchor-aligned Augmentation for Unified Low-Resource Multimodal Information Extraction

SAMA：面向统一低资源多模态信息抽取的语义锚定对齐增强

Quanjiang Guo, Chong Mu, Jiazhou Pan, Ming Jia, Ling Tian, Hui Gao, Zhao Kang

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）

专题命中音视频/视觉语言融合：多模态信息抽取增强，融合视觉与语言模态。

AI总结提出语义锚定对齐增强框架SAMA，通过构建结构化语义锚引导多专家多模态大模型生成高保真文本，并利用锚保留扩散机制合成图像，结合双约束过滤模块，在低资源多模态信息抽取任务中显著提升性能。

Comments Accepted by IEEE Transactions on Multimedia

详情

AI中文摘要

多模态信息抽取（MIE）——涵盖多模态命名实体识别（MNER）、关系抽取（MRE）和事件抽取（MEE）等任务——对于理解多媒体内容至关重要，但受到严重数据稀缺的限制。尽管数据增强是一种有前景的补救措施，但现有方法受到粗粒度跨模态对齐和碎片化、任务特定设计的阻碍，未能利用共享语义知识。为克服这些限制，我们引入了语义锚定对齐多模态增强（SAMA），一个用于生成高保真、任务感知合成数据的统一框架。SAMA从真实标签构建结构化语义锚，以指导协作多专家多模态大语言模型（CME-MLLM），该模型集成了用于共享语义的通用适配器和任务特定适配器，以生成多样且符合约束的文本样本。对于图像合成，SAMA采用锚保留扩散机制，使用锚加权提示和潜在条件来维持关键语义锚，同时多样化视觉上下文。为消除人工验证需求，SAMA进一步引入双约束过滤模块，基于跨模态一致性和锚保真度选择合成样本。在MNER、MRE和MEE基准数据集上的大量实验表明，SAMA在全监督和低资源设置下均一致优于最先进的增强基线，突显了其通用性、鲁棒性和有效性。

英文摘要

Multimodal Information Extraction (MIE)-covering tasks such as Multimodal Named Entity Recognition (MNER), Relation Extraction (MRE), and Event Extraction (MEE)-is essential for understanding multimedia content but remains constrained by severe data scarcity. Although data augmentation is a promising remedy, existing approaches are impeded by coarse cross-modal alignment and fragmented, task-specific designs that fail to exploit shared semantic knowledge. To overcome these limitations, we introduce Semantic Anchor-aligned Multimodal Augmentation (SAMA), a unified framework for generating high-fidelity, task-aware synthetic data. SAMA constructs structured semantic anchors from ground-truth labels to guide a Collaborative Multi-Experts Multimodal Large Language Model (CME-MLLM), which integrates a Universal Adapter for shared semantics with Task-Specific Adapters to produce diverse yet constraint-compliant textual samples. For image synthesis, SAMA employs an Anchor-Preserving Diffusion mechanism that uses anchor-weighted prompts and latent conditioning to maintain critical semantic anchors while diversifying visual contexts. To eliminate the need for manual verification, SAMA further introduces a Dual-Constraint Filtering module that selects synthetic samples based on both cross-modal consistency and anchor fidelity. Extensive experiments across benchmark datasets for MNER, MRE, and MEE demonstrate that SAMA consistently outperforms state-of-the-art augmentation baselines under both fully supervised and low-resource settings, underscoring its versatility, robustness, and effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.18586 2026-06-18 cs.CV cs.AI 新提交 85%

APT: Atomic Physical Transitions for Causal Video-Language Understanding

APT: 用于因果视频语言理解的原子物理转变

Shang Wu, Haoran Lu, Songling Liu, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Dolby Laboratories（杜比实验室）

专题命中音视频/视觉语言融合：提出APT表示视频因果状态变化，用于视频语言理解，属于视觉语言融合。

AI总结提出原子物理转变（APT）作为视频中因果状态变化的显式表示，并构建混合来源数据集，通过APT-Tune微调方法使VLM学习物理转变而不遗忘事件级知识。

详情

AI中文摘要

物理事件不仅通过其名称来理解，还通过组成它们的因果状态变化来理解。诸如“弹跳”之类的片段级标签可能是正确的，但同时隐藏了使事件在物理上有效的过程，从支撑丧失和接触开始到反弹和稳定。为了使这一隐藏过程显式化，我们引入了原子物理转变（APT）：最小的、时间局部化的状态变化，将可见线索与活跃的物理机制以及前后动力学状态联系起来。APT链将视频表示为有序的因果转变序列，而不是单个聚合事件标签：事件标签说明发生了什么；APT链解释为什么会发生。为了使VLM能够学习APT，我们从人工标注和模拟器真实数据构建了混合来源的APT数据，涵盖接触、重力、摩擦和旋转/稳定性中的14种转变类型，包含1,246个试验中的27,303个计时实例。利用这些数据，我们发现当前的VLM在转变级物理理解上存在不足，零样本召回率最多为14%，错误主要由遗漏的转变主导。直接在APT链上进行微调可以改善转变检测，但会导致事件级遗忘，表明模型学习的是专门的答案格式，而不是可复用的物理表示。因此，我们提出了APT-Tune，一种参数高效的方案，教会VLM使用因果转变而不遗忘如何回答视频问题。它结合了图像填充感知监督、格式条件协同训练和机制条件域到类型解码，使APT学习具有格式鲁棒性和物理基础。在Qwen3-VL-2B上仅使用11M LoRA参数，APT-Tune显著提高了APT召回率，同时改善了事件级视频迁移。这些结果表明，APT不是一种新的答案格式，而是一种用于物理视频理解的人类对齐的因果监督信号。

英文摘要

Physical events are not understood by their names alone, but by the causal state changes that compose them. A clip-level label such as "bounce" can be correct while hiding the process that makes the event physically valid, from support loss and contact onset to rebound and settling. To make this hidden process explicit, we introduce Atomic Physical Transitions (APTs): minimal, temporally localized state changes that bind a visible cue to an active physical mechanism and before/after dynamical regimes. An APT chain represents a video as an ordered causal transition sequence rather than a single aggregate event label: event labels tell what happened; APT chains explain why it happened. To make APTs learnable by VLMs, we construct mixed-source APT data from human annotations and simulator ground truth, covering 14 transition types across contact, gravity, friction, and rotation/stability, with 27,303 timed instances over 1,246 trials. Using this data, we find that current VLMs miss transition-level physics, with zero-shot recall at most 14% and errors dominated by missed transitions. Direct fine-tuning on APT chains improves transition detection but causes event-level forgetting, indicating that the model learns a specialized answer format rather than a reusable physical representation. We therefore propose APT-Tune, a parameter-efficient recipe that teaches VLMs to use causal transitions without forgetting how to answer video questions. It combines image-pad-aware supervision, format-conditional co-training, and mechanism-conditioned domain-to-type decoding to make APT learning format-robust and physically grounded. With only 11 M LoRA parameters on Qwen3-VL-2B, APT-Tune substantially improves APT recall while also improving event-level video transfer. These results show that APTs are not a new answer format, but a human-aligned causal supervision signal for physical video understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.18553 2026-06-18 cs.CV 新提交 85%

Hierarchical Multi-Modal Retrieval for Knowledge-Grounded News Image Captioning

基于知识的分层多模态检索用于新闻图像描述生成

Minh-Loi Nguyen, Xuan-Vu Le, Long-Bao Nguyen, Hoang-Bach Ngo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（越南国立大学胡志明市分校理学院）； Vietnam National University, Ho Chi Minh City（越南国立大学胡志明市分校）

专题命中音视频/视觉语言融合：分层多模态检索增强新闻图像描述，融合视觉与文本。

AI总结提出分层多模态文章检索增强的图像描述框架，通过结构感知检索和上下文精炼，结合VLM和LLM生成富含上下文细节的描述，在EVENTA 2025挑战赛中获得第5名。

Comments SOICT 2025

详情

AI中文摘要

传统的图像描述方法通常难以生成全面、上下文丰富的描述，尤其是对于无法直接从视觉线索中观察到的细节。为了克服这一问题，我们提出了一种新颖的检索增强图像描述框架，通过利用外部知识生成具有更深层次洞察的描述，如对象属性、事件背景和潜在意义。我们的方法采用分层多模态文章检索机制，超越了单一的文本实体。该检索考虑了文章结构感知特征，包括加权文本组件（例如，标题、正文部分）和视觉布局模式，以及多方面的相似性计算（内容-视觉、视觉-视觉和话语定位）。后续的上下文相关性精炼阶段进一步增强了检索到的信息。检索到的文章随后作为描述生成的知识库：首先，VLM生成简洁的图像描述；其次，我们基于该描述从检索到的文章中分割出相关信息；最后，LLM利用描述和提取的知识生成全面、上下文详细的描述。我们参加了ACM Multimedia EVENTA 2025挑战赛，并在OpenEvent-V1数据集的私有测试集上以0.2824的总分获得第5名。源代码已在此https URL公开发布。

英文摘要

Traditional image captioning methods often struggle to generate comprehensive, context-rich descriptions, especially for details not directly observable from visual cues. To overcome this, we propose a novel retrieval-augmented image captioning framework that generates captions with deeper insights, such as object attributes, event context, and underlying significance, by leveraging external knowledge. Our approach features a hierarchical multi-modal article retrieval mechanism that moves beyond monolithic text entities. This retrieval considers article structure-aware features, including weighted textual components (e.g., headlines, body sections) and visual placement patterns, alongside multi-faceted similarity computations (content--visual, visual--visual, and discourse positioning). A subsequent contextual relevance refinement stage further enhances the retrieved information. The retrieved articles then serve as the knowledge base for caption generation: first, a VLM generates a concise image description; second, we segment relevant information from the retrieved articles based on this description; and finally, an LLM utilizes both the description and extracted knowledge to generate a comprehensive, contextually detailed caption. We participated in the ACM Multimedia EVENTA 2025 Challenge and achieved 5th place with an overall score of 0.2824 on the private test set of the OpenEvent-V1 dataset. Source code is publicly released at https://github.com/mf0212/EVENTA-Challange.

URL PDF HTML ☆

赞 0 踩 0

2606.18472 2026-06-18 cs.CV 新提交 85%

Domain Generalizable Adaptation of 3D Vision-Language Models via Regularized Fine-Tuning

通过正则化微调实现可域泛化的3D视觉-语言模型适应

Sneha Paul, Zachary Patterson, Nizar Bouguila

发表机构 * Concordia University（康考迪亚大学）

专题命中音视频/视觉语言融合：3D视觉语言模型域泛化，融合点云、视觉和文本模态。

AI总结提出ReFine3D框架，通过选择性层调优、多视图一致性、同义词提示及点渲染视觉监督等正则化策略，提升3D大语言模型在域泛化中的性能。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

域适应仍然是3D视觉中的一个核心挑战，特别是对于将3D点云与视觉和文本数据对齐的多模态基础模型。尽管这些模型表现出强大的通用能力，但将其适应到数据有限的下游领域往往会导致过拟合和灾难性遗忘。为了解决这个问题，我们引入了ReFine3D，一个正则化的微调框架，专为3D大语言模型（LMMs）的可域泛化调优而设计。ReFine3D将选择性层调优与两种针对性的正则化策略相结合：跨增强点云的多视图一致性，以及通过大语言模型生成的基于同义词的提示实现的文本多样性。此外，我们加入了点渲染的视觉监督和一种基于置信度聚合的测试时增强机制，以进一步增强鲁棒性。在不同3D域泛化基准上的大量实验表明，ReFine3D将基类到新类泛化提高了1.36%，跨数据集迁移提高了2.43%，对损坏的鲁棒性提高了1.80%，少样本准确率提高了最多3.11%，以最小的额外计算开销超越了先前的最先进方法。

英文摘要

Domain adaptation remains a central challenge in 3D vision, especially for multimodal foundation models that align 3D point clouds with visual and textual data. While these models demonstrate strong general capabilities, adapting them to downstream domains with limited data often leads to overfitting and catastrophic forgetting. To address this, we introduce ReFine3D, a regularized fine-tuning framework designed for domain-generalizable tuning of 3D large multimodal models (LMMs). ReFine3D combines selective layer tuning with two targeted regularization strategies: multi-view consistency across augmented point clouds and text diversity through synonym-based prompts generated by large language models. Additionally, we incorporate point-rendered vision supervision and a test-time augmentation mechanism with confidence-based aggregation to further enhance robustness. Extensive experiments across different 3D domain generalization benchmarks show that ReFine3D improves base-to-novel class generalization by 1.36%, cross-dataset transfer by 2.43%, robustness to corruption by 1.80%, and few-shot accuracy by up to 3.11%, outperforming prior state-of-the-art methods with minimal added computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2606.19100 2026-06-18 cs.CV 新提交 80%

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology（NOVA科学与技术学校）； NOVA LINCS

专题命中音视频/视觉语言融合：构建欧洲葡萄牙语视觉语言模型

AI总结针对欧洲葡萄牙语缺乏开源多模态模型的问题，提出AMALIA-VL，通过三阶段训练和葡萄牙语中心数据混合，建立强基线并开源所有资源。

详情

AI中文摘要

大型视觉与语言模型（LVLMs）发展迅速，但欧洲葡萄牙语（pt-PT）在现有的开源多模态模型中仍系统性地未被充分服务，这些模型要么将其与巴西葡萄牙语混为一谈，要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL，这是第一个原生为pt-PT构建的开源指令微调LVLM，通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合，该混合结合了策划和翻译的公共数据集与新颖的数据集，以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明，AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程，以及机器翻译的pt-PT评估基准，以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

URL PDF HTML ☆

赞 0 踩 0

2606.18992 2026-06-18 cs.CV 新提交 80%

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示，而非询问：基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran ； Baogh Le ； Tuan Kiet Pham ； Sui Yang Guang

专题命中音视频/视觉语言融合：组合图像检索涉及视觉与文本跨模态融合。

AI总结提出CLARA框架，通过展示视觉备选面板让用户选择，结合似然比重校准实现多轮覆盖保证，在组合图像检索中有效消歧，优于文本提问基线。

详情

AI中文摘要

组合图像检索（CIR）使用参考图像和文本修改来搜索目标图像。然而，此类查询通常描述多个可能的图像而非一个确切目标，使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限：其覆盖保证仅在第一轮交互中成立，且文本问题通常不足以解决细粒度视觉差异，如外观、属性或视角。我们提出CLARA，一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题，只需选择最接近预期目标的原型图像。这提供了直接的视觉信号，并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证，CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集，并映射到真实语料库图像，确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明，CLARA匹配单轮最先进的检索性能，在多轮交互中维持名义覆盖，并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显，此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

URL PDF HTML ☆

赞 0 踩 0

2606.18885 2026-06-18 cs.CV cs.IR 新提交 80%

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)（沙特数据与人工智能局）

专题命中音视频/视觉语言融合：文本-图像检索，低注意力区域编码增强跨模态检索。

AI总结提出LARE框架，通过并行编码低注意力区域和完整图像，解决拥挤场景下视觉编码器忽视关键细节的问题，在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情

AI中文摘要

拥挤场景中的图像检索尤其具有挑战性，因为传统视觉编码器存在显著性偏差，倾向于关注主要对象而忽略低注意力区域，而这些区域通常对细粒度检索至关重要。我们提出了LARE（低注意力区域编码），一个显式建模这些被忽略区域的框架。LARE采用双编码策略，并行编码图像的低注意力区域和完整图像，从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能，我们引入了Dense-Set，一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中，图像被重新标注，以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性，并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明，所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

URL PDF HTML ☆

赞 0 踩 0

2606.18558 2026-06-18 cs.CV 新提交 80%

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI（艾伦人工智能研究所）； University of Washington（华盛顿大学）； UNC-Chapel Hill（北卡罗来纳大学教堂山分校）

专题命中音视频/视觉语言融合：基于语言指令预测3D点轨迹，涉及视觉与语言融合。

AI总结提出一种基于语言指令的3D点运动预测方法，通过构建大规模数据集和基准，实现类无关、视角稳定的运动轨迹预测，并在机器人操作和视频生成中验证其有效性。

详情

AI中文摘要

运动预测是视觉智能的核心：智能体必须预测物体如何运动，以规划行动、推理物理交互并合成逼真的未来场景。我们认为，世界坐标系中的3D点提供了一种通用表示，具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务：给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述，模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务：(1) MolmoMotion-1M是一个大型语料库，包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹；(2) PointMotionBench是一个人工验证的基准，涵盖111个物体类别和61种运动类型；(3) MolmoMotion是一个通用运动预测模型，支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式，并在PointMotionBench上显著优于现有运动预测基线。最后，我们展示了学习到的3D运动先验能很好地迁移到下游应用：它提高了机器人操作的训练效率和泛化能力，其预测轨迹为生成模型提供了有效的运动指导，以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

URL PDF HTML ☆

赞 0 踩 0

2606.18441 2026-06-18 cs.CV 新提交 80%

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集：视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University（兰州大学信息科学与工程学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； Cloud and AI BU, Huawei（华为云与AI业务部）； School of Computing, National University of Singapore（新加坡国立大学计算机学院）

专题命中音视频/视觉语言融合：视频多模态大语言模型推理，融合视频帧与语言。

AI总结提出无时间标注的过程级奖励框架CF-GRPO，通过视频内在线索构建一致性帧先验，并利用一致性帧奖励优化模型帧使用与先验的对齐，提升视频推理性能。

详情

AI中文摘要

强化学习提升了大型语言模型的推理能力，但将仅结果奖励应用于视频多模态大语言模型（Video-MLLMs）时，对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发（其中一致的线索可以增强感知估计的显著性和可靠性），我们引入了一致性帧GRPO（CF-GRPO），一种无需时间标注的过程级奖励框架，用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验，包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后，它从视觉和响应表示中计算模型侧的帧使用分数，并通过一致性帧奖励（CFR）优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化，CFR提供了高对比度的奖励信号，无需人工时间标注。实验表明，VideoCFR在复杂视频推理基准上取得了有竞争力的性能，并在多个指标上优于代表性的Video-MLLM和RL基线，同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见：https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

URL PDF HTML ☆

赞 0 踩 0

2606.19338 2026-06-18 cs.CV 新提交 75%

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测：评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

专题命中音视频/视觉语言融合：评估多模态大模型在非马尔可夫博弈中的表现

AI总结提出RNG-Bench基准套件，通过配对记忆和3D迷宫两个博弈，评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力，发现主要错误源于遗忘而非决策，微调可提升性能。

详情

AI中文摘要

将多模态基础模型部署为闭环策略时，越来越需要基于不再可见的观测来调节动作。然而，现有基准要么暴露完整状态，将隐藏状态重建与其他智能体技能混为一谈，要么仅在回合结束后测试记忆。我们引入了RNG-Bench（重建性非马尔可夫博弈），这是一个基准套件，旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈：配对记忆，其中卡片身份在特定位置短暂显示后需被回忆；以及3D迷宫，其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估，具有三个可控难度轴：网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差，以及记忆差距指标，将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入，前沿MLLMs远未饱和。记忆差距分析表明，大多数残余错误源于遗忘较早的观测，而非次优决策。最后，在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B，提高了RNG-Bench的性能，并迁移到现有基准，而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

URL PDF HTML ☆

赞 0 踩 0

2606.19297 2026-06-18 cs.LG cs.RO 新提交 75%

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗？衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab（CogAI实验室）； FusionBrain Lab（FusionBrain实验室）； IAI MSU（MSU人工智能研究所）； Lomonosov MSU（Lomonosov莫斯科大学）； NUST MISIS ； Applied AI Institute（应用人工智能研究所）； HSE University（俄罗斯高等经济大学）； Generalizable AI Systems（可泛化人工智能系统）； ISP RAS（俄罗斯科学院信息与自动化过程研究所）； MIRAI ； Domain-specific NLP Group（领域特定自然语言处理小组）

专题命中音视频/视觉语言融合：评估视觉-语言-动作模型的知识保留

AI总结提出 Act2Answer 协议，通过动作回答评估 VLA 模型的知识保留，发现模型在简单概念上表现良好，但在丰富语义类别上存在差距，且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情

AI中文摘要

具身视觉-语言-动作（VLA）模型通常通过在机器人数据上微调强大的预训练 VLM 获得，但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的，混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer，一种轻量级协议，通过要求智能体通过动作来回答，将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景，其中智能体执行单个物体放置动作以选择候选答案，从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件，并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中，我们系统地跨类别对模型进行排名，发现 VLA 在简单概念上表现稳健，但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距，VQA 联合训练与更好的知识保留相关，并且答案相关信号在 VLA 中间层达到峰值，但在上层减弱。Act2Answer 可在以下网址获取：此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

URL PDF HTML ☆

赞 0 踩 0

2606.19161 2026-06-18 cs.RO 新提交 75%

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench：基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University（北航）； Rimbot ； BUPT（北邮）； ShanghaiTech University（上海科技大学）； Tsinghua University（清华大学）； CAS（中国科学院）

专题命中音视频/视觉语言融合：对齐触觉与视觉信息，多模态表示学习

AI总结提出HT-Bench多任务基准和HandTouch编码器，通过大规模自我中心视觉与全手触觉数据，在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情

AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性，为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准，而是探索了一个可扩展且有前景的未来发展方向：将自我中心视觉与全手触觉数据配对。为此，我们引入了\ extbf{HT-Bench}，一个用于灵巧全手触觉感知的大规模多任务基准，包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示：它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力，HT-Bench包含四个任务：细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch}，一个矢量量化视觉-触觉编码器，通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上，HandTouch始终优于代表性的触觉编码器基线，将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%，将掩码触觉修复的RMSE从0.022降低到0.010，并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性，并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.19088 2026-06-18 cs.RO 新提交 75%

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg：面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence（格拉茨技术大学，软件工程与人工智能研究所）； University of Applied Sciences Technikum Wien, Department of Industrial Engineering（维也纳应用科技大学，工业工程系）； University of Alicante, Department of Computer Technology（阿利坎特大学，计算机技术系）； University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research（自然资源与生命科学大学，整合自然保护研究 institute）

专题命中音视频/视觉语言融合：改进VLM特征空间一致性用于语言接地

AI总结提出ReSiReg方法，通过重构空间一致的VLM中间特征，改善密集语言接地检索，在OVSS和3D映射中提升空间一致性，并发布紧凑的25M参数VLM模型。

详情

AI中文摘要

视觉-语言模型（VLM）使机器人能够遵循开放语言指令。然而，密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构，并提出了ReSiReg，一种特征重构方法，利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型，推导其语言描述符，并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估，并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善；操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM，远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.18955 2026-06-18 cs.CV cs.RO 新提交 75%

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Tianfu Jiangxi Laboratory（天府江西实验室）

专题命中音视频/视觉语言融合：从人类视频提取动作先验，涉及视觉语言融合。

AI总结提出基于潜在动作的框架，利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验，通过意图-感知解耦策略减少动作幻觉，仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情

AI中文摘要

训练通用视觉-语言-动作（VLA）模型通常需要大量、多样化的机器人数据集，并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性，但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题，我们提出了一种基于潜在动作的框架，旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE，通过物理掩码将运动动态与环境背景解耦，从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练，VLM骨干网络学习到动作意图的深层表示。为了适应特定实体，我们引入了一种意图-感知解耦策略，其中VLM预测动作意图，而一个独立的冻结视觉编码器为动作专家提供状态特定特征，从而减少动作幻觉。在仿真和真实环境中的结果表明，我们的方法仅在无标签人类视频上预训练，与在大量标注数据集上训练的最先进VLA模型相比具有竞争力，且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.18846 2026-06-18 cs.CV 新提交 75%

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

从边界框到视觉推理：一种用于视觉语言模型的在线策略数据标注工具

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing, Pan Wang, Qingzu He, Qi Wang

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； College of Computer Science, Jilin University（吉林大学计算机科学与技术学院）； OPPO

专题命中音视频/视觉语言融合：提出视觉语言模型标注工具，涉及视觉与语言模态融合。

AI总结提出ScreenAnnotator，通过统一标注原子模式、在线策略循环与贝叶斯验证器，解决现有工具表达力不足、标注-训练脱节和数据复用性差的问题，实现高效多任务数据生成。

Comments 14 pages, 7 figures

详情

AI中文摘要

视觉语言模型（VLM）正快速向复杂的基于基础的结构化视觉推理发展。训练具备此类高级能力的模型需要一种新型数据，该数据能将空间坐标、开放词汇描述、结构化属性和拓扑关系无缝统一为单一表示。然而，现有数据标注工具从根本上无法满足这些复杂需求，存在三个系统性瓶颈：表达力有限、严重的标注-训练解耦以及数据复用性差。为弥补这一基础设施差距，我们引入了一个开源标注工具ScreenAnnotator。首先，我们定义了一个统一的标注原子模式，将空间、语义和结构基元绑定为单个单元。其次，我们实现了一个嵌入贝叶斯标注验证器（BAV）的在线策略标注循环。最后，我们设计了一个模板驱动的多任务数据合成过程，动态地将静态原子转化为多样化的多维推理任务，消除了冗余的重新标注。在线策略循环将流程图上的标注接受率提升至近100%，GUI截图上的接受率达到77%，同时随着标注数据的积累，每张图像的标注时间稳步减少。在流程图场景中，微调VLM的平均准确率达到76.1%，绝对提升了35.1个百分点。我们的代码可在以下网址获取：this https URL。

英文摘要

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

URL PDF HTML ☆

赞 0 踩 0

2606.06926 2026-06-18 cs.CV cs.MM 新提交 75%

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology（釜山国立科学研究院）

专题命中音视频/视觉语言融合：利用大语言模型融合多模态信息检测体育视频精彩片段

AI总结针对现有方法无法处理超长视频精彩片段检测的问题，提出首个基准SVHighlights（包含320个平均时长2小时的体育视频）以及无训练的分段方法TF-SELECTOR，通过大语言模型融合多模态信息预测片段级显著性分数，在多个指标上超越现有基线。

Comments Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/

详情

DOI: 10.1145/3770855.3817564

AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义，但现有方法大多局限于短视频内容，这主要是由于缺乏合适的基准。为了填补这一空白，我们引入了SVHighlights，据我们所知，这是首个针对极长体育视频（每段时长超过一小时，涵盖多种体育类别）精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线，从完整体育视频及其对应的官方精彩片段视频对构建而成，无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频，平均时长2.00小时，总时长640.18小时，显著超过以往的数据集。现有方法在长视频上也面临根本性挑战：在短视频片段上训练的模型无法泛化到小时级内容，并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线，我们提出了TF-SELECTOR，一种无需训练的基于分段的方法，该方法通过合并相邻的具有相同语义内容的镜头，将每个视频划分为上下文感知的分段，并使用多模态输入（包括视觉描述、转录文本和音频音量）的大语言模型预测分段级显著性分数。实验表明，与视频时间定位（VTG）微调的基线相比，TF-SELECTOR在大多数指标上取得了更优的性能，在HIT@1上提升+3.12，在HIT@K上提升+4.06，在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台，并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +2.50 in HIT@1, +4.04 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

URL PDF HTML ☆

赞 0 踩 0

2606.19120 2026-06-18 cs.LG cs.CV 新提交 70%

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思：解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences（机器人与智能系统国家重点实验室，沈阳自动化研究所，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）

专题命中音视频/视觉语言融合：多模态大语言模型后训练，融合视觉与语言

AI总结提出ViGOS框架，通过解耦感知和推理，在MLLM后训练中避免文本捷径，提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情

AI中文摘要

在策略自蒸馏（OPSD）训练模型在其自身rollouts上，并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好，但直接扩展到多模态大语言模型（MLLMs）可能产生捷径：特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS，一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述，然后推理出最终答案。对于有效rollouts，仅图像的感知教师监督描述，而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中，ViGOS保持了OPSD的主要优势，并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

URL PDF HTML ☆

赞 0 踩 0

2606.18839 2026-06-18 cs.LG cs.CV 新提交 70%

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

专题命中音视频/视觉语言融合：认证视觉语言模型鲁棒性，涉及视觉与文本语义融合。

AI总结提出首个无需额外数据即可认证视觉语言模型在语义层面（如形状、大小、风格）鲁棒性的框架，通过文本提示作为语义代理并量化决策边界，确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情

AI中文摘要

视觉语言模型（VLM）现在被广泛用于下游任务。然而，现实世界的应用常常使VLM面临由语义变化（例如形状、大小和风格）引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换，但本文提出了一种新颖的框架，能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力，我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界，我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架，使其易于应用。在合成数据和真实数据上的实验表明，我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.19194 2026-06-18 cs.RO 新提交 60%

Invertible Neural Network Adapter for One-Step Flow Matching in Robot Manipulation

用于机器人操作中一步流匹配的可逆神经网络适配器

Yu Zhang, Kangyi Ji, Yongxiang Zou, Rongtao Xu, Feng Zheng, Long Cheng

专题命中音视频/视觉语言融合：条件于多模态观测生成动作，但非典型融合

AI总结提出可逆神经网络适配器，通过一步去噪过程生成高维动作，降低推理复杂度并保持精度，在仿真和真实实验中提升效率。

详情

AI中文摘要

本文提出了一种用于通用机器人操作的可逆神经网络适配器，旨在通过一步去噪过程，基于多模态观测（包括视觉、语言和本体感受输入）生成精确的高维动作。基于流匹配公式，所提出的适配器有效地将动作生成轨迹约束在可逆潜空间内，从而仅需单次推理步骤即可实现高效、高质量的灵巧动作合成。与传统的迭代流匹配策略相比，所提出的框架显著降低了推理复杂度，同时保持了强大的动作预测精度和稳定性。在多种仿真基准和真实机器人平台上进行了大量实验，以评估所提出方法的有效性。在仿真基准测试中，所提出的适配器在广泛的操作任务上持续表现出优于或接近最先进的性能。此外，真实世界实验显示，视觉-语言-动作（VLA）模型的推理效率显著提升，平均推理延迟从110毫秒降低到61毫秒，同时保持了强大的任务性能。

英文摘要

This paper presents an invertible neural network adapter for general robotic manipulation, designed to generate precise high-dimensional actions conditioned on multimodal observations, including visual, linguistic, and proprioceptive inputs, through a one-step denoising process. Built upon a flow-matching formulation, the proposed adapter effectively constrains the action generation trajectory within an invertible latent space, thereby enabling efficient and high-quality dexterous action synthesis with only a single inference step. Compared with conventional iterative flow-matching policies, the proposed framework substantially reduces inference complexity while maintaining strong action prediction accuracy and stability. Extensive experiments are conducted across a diverse set of simulation benchmarks and real-world robotic platforms to evaluate the effectiveness of the proposed method. Across simulation benchmarks, the proposed adapter consistently demonstrates superior or near state-of-the-art performance on a wide range of manipulation tasks. Furthermore, real-world experiments reveal a significant improvement in inference efficiency for vision-language-action (VLA) models, reducing the average inference latency from 110 ms to 61 ms while maintaining strong task performance.

URL PDF HTML ☆

赞 0 踩 0