arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

视觉大模型 / VLM

视觉语言模型、视觉推理、视觉问答、图文理解和视觉 grounding。

今日/当前日期收录 37 信号源:cs.CV, cs.AI, cs.LG

1. 视觉推理 25 篇

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新 95%

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

专题命中 视觉推理 :提出Vero系列VLM,在视觉推理基准上显著提升

AI总结 提出Vero系列开放视觉语言模型,通过构建600K样本数据集Vero-600K和任务路由奖励,在30个基准测试中平均提升2.9-5.4点,Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情
AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么?最强的视觉语言模型(VLM)表明广泛的视觉推理是可以实现的,但其封闭的数据和强化学习(RL)流程使得其成果难以研究、复现或扩展。我们引入了Vero,一个完全开放的VLM系列,在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励,构建了Vero-600K,一个来自59个数据集的600K样本数据集,并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中,Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型,Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是,基于Instruct模型训练的Vero-Qwen3I-8B,在没有额外蒸馏的情况下,平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示,不同的任务类别引发不同的推理模式,而广泛的收益依赖于联合学习它们,而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

2606.20515 2026-06-19 cs.CV 新提交 90%

S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

S-Agent:空间工具使用激发空间智能推理

Yalun Dai, Hao Li, Shulin Tian, Runmao Yao, Yuhao Dong, Fangzhou Hong, Zhaoxi Chen, Fangfu Liu, Baoliang Tian, Dingwen Zhang, Tao Wang, Kim-Hui Yap, Ziwei Liu

发表机构 * NTU(南洋理工大学) THU(清华大学) ByteDance(字节跳动) NWPU(西北工业大学)

专题命中 视觉推理 :利用VLM作为语义规划器进行空间推理

AI总结 提出S-Agent空间工具使用智能体范式,通过时空证据积累和层次化工具集,将VLM作为语义规划器,实现连续多视图图像和视频的空间推理,在无训练下提升开源和闭源VLM性能,并基于S-300K轨迹微调得到紧凑空间智能体S-Agent-8B。

Comments Project Page : https://Ropedia.github.io/S-Agent

详情
AI中文摘要

现实世界的空间智能需要对连续且不断变化的三维世界进行推理,然而现有的VLM和工具增强智能体大多仍局限于从孤立的视觉观察中进行静态、无状态的推理。我们引入了\textbf{\textsc{S-Agent}},一种用于理解和推理连续多视图图像和视频的空间工具使用智能体范式。通过将空间推理表述为时空证据积累而非孤立的帧级预测,\textsc{S-Agent}将空间感知重塑为以场景为中心的理解,超越以帧为中心的识别。具体而言,\textsc{S-Agent}将VLM作为语义规划器,决定需要哪些证据,而层次化的空间工具和专家将物体锚定在2D中,将其提升为3D几何证据,并将这些证据聚合为高级空间知识(例如,计数、测量、方向和相对位置)。此外,时间记忆机制,包括用于维护不断演变的场景状态的场景记忆和用于积累推理上下文的智能体记忆,实现了跨帧和推理步骤的证据整合。在多视图和视频空间推理基准上的全面实验表明,\textsc{S-Agent}以无需训练的方式持续提升开源和闭源VLM的性能。除了推理时增强,在\textsc{S-Agent}生成的空间轨迹\textsc{S-300K}上进行监督微调(SFT)得到了\textsc{S-Agent-8B},一个紧凑的空间智能体,显著超越了类似规模的基线(例如,Qwen3-VL-8B),并与先进的闭源模型(例如,GPT-5.4和Gemini 3)性能相当。

英文摘要

Real-world spatial intelligence requires reasoning over a continuous and evolving 3D world, yet existing VLMs and tool-augmented agents largely remain tied to static, stateless inference from isolated visual observations. We introduce \textbf{\textsc{S-Agent}}, a spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos. By formulating spatial reasoning as spatio-temporal evidence accumulation rather than isolated frame-level prediction, \textsc{S-Agent} reshapes spatial perception into scene-centric understanding beyond frame-centric recognition. Specifically, \textsc{S-Agent} casts the VLM as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge (\textit{e.g.}, counting, measurement, orientation, and relative position). Additionally, a temporal memory mechanism, including Scene Memory for maintaining the evolving scene state and Agent Memory for accumulating reasoning context, enables evidence integration across frames and reasoning steps. Comprehensive experiments on multi-view and video spatial reasoning benchmarks show that \textsc{S-Agent} consistently improves both open-source and closed-source VLMs in a training-free manner. Beyond inference-time augmentation, supervised fine-tuning (SFT) on \textsc{S-Agent}-generated spatial trajectories \textsc{S-300K} yields \textsc{S-Agent-8B}, a compact spatial agent that significantly surpasses similar-scale baselines (e.g., Qwen3-VL-8B) and performs comparably to advanced closed-source models (e.g., GPT-5.4 and Gemini 3).

2606.19776 2026-06-19 cs.CV 新提交 90%

Occ-VLM: Occupancy Grounded Vision Language Model for Indoor Scene Understanding

Occ-VLM: 面向室内场景理解的占用接地视觉语言模型

Jianing Li, Zhou Fang, Yijiang Liu, Li Du

发表机构 * School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

专题命中 视觉推理 :占用接地VLM用于室内3D场景理解。

AI总结 提出Occ-VLM,仅用姿态RGB图像和单一2D视觉编码器,通过重建3D占用作为几何先验,实现统一的3D场景理解,在占用预测、3D VQA和密集描述任务上达到领先水平。

详情
AI中文摘要

近期,视觉语言模型(VLM)在3D场景理解方面取得了显著进展,推动了具身智能和机器人视觉等应用的发展。然而,现有方法通常要么直接依赖显式的3D输入(如点云或RGB-D序列),要么引入额外的3D几何编码器从2D图像中推导出3D感知的视觉标记。这种设计在结构上将3D几何感知与通过视觉语言预训练学到的丰富2D语义解耦,阻碍了统一3D视觉语言表示的发展。在这项工作中,我们提出了Occ-VLM,一个仅基于姿态RGB图像并采用单一2D视觉编码器的3D场景理解新框架。具体而言,Occ-VLM重建3D场景占用作为辅助几何先验,用于将前景2D标记与3D空间进行空间关联。然后,这些标记由大型语言模型(LLM)解码,实现统一的场景理解。大量实验表明,Occ-VLM实现了准确的几何感知和稳健的视觉语言推理:在多视角占用预测上达到最先进性能,同时在3D视觉问答(VQA)和3D密集描述基准上与使用3D输入的VLM表现相当。

英文摘要

Recently, vision-language models (VLMs) have made significant progress in 3D scene understanding, driving advances in applications such as embodied intelligence and robotic vision. However, existing approaches typically either rely directly on explicit 3D inputs (e.g., point clouds or RGB-D sequences), or introduce an additional 3D geometry encoder to derive 3D-aware visual tokens from 2D images. Such designs structurally decouple 3D geometric perception from the rich 2D semantics learned via vision-language pre-training, hindering the development of a unified 3D vision-language representation. In this work, we propose Occ-VLM, a novel framework for 3D scene understanding that operates purely on posed RGB images and employs a single 2D vision encoder. Specifically, Occ-VLM reconstructs 3D scene occupancy as an auxiliary geometric prior, which is utilized to spatially associate foreground 2D tokens with 3D space. These tokens are then decoded by a Large Language Model (LLM) for unified scene understanding. Extensive experiments demonstrate that Occ-VLM achieves both accurate geometric perception and robust vision-language reasoning: it attains state-of-the-art performance on multi-view occupancy prediction, while performing on par with 3D-input VLMs on 3D Visual Question Answering (VQA) and 3D dense captioning benchmarks.

2606.19552 2026-06-19 cs.CL 新提交 90%

LaViSA: A Language and Vision Structural Ambiguity Benchmark

LaViSA:语言与视觉结构歧义基准

Lee Sangmyeong, Shun Inadumi, Koichiro Yoshino

发表机构 * Nara Institute of Science and Technology(奈良先端科学技术大学院大学) Guardian Robot Project RIKEN(RIKEN守护机器人项目) The University of Osaka(大阪大学)

专题命中 视觉推理 :评估VLM利用视觉场景解决结构歧义的能力

AI总结 提出LaViSA基准,通过七类歧义句及对应图像评估视觉语言模型利用视觉场景解决结构歧义的能力,实验显示现有模型虽能部分利用视觉信息,但在特定歧义类型和细微语义区分上仍有局限。

详情
AI中文摘要

结构歧义是指单个句子由于其句法结构而产生多种有效解释,这给语言理解带来了基本挑战。视觉场景可作为解决此类歧义的有用线索,视觉语言模型(VLM)需要能够从视觉场景中推导出可能的语义解释。我们引入了语言与视觉结构歧义(LaViSA)基准,旨在评估VLM利用视觉场景解决结构歧义的能力。LaViSA包含歧义句子、其消歧句子以及这些消歧句子对应的图像,涵盖七类歧义。利用LaViSA,我们对多种VLM进行了全面评估,包括专有模型和开源模型,参数规模和推理能力各异。实验结果表明,尽管最近的VLM能在一定程度上利用视觉场景解决结构歧义,但它们仍然在特定歧义类型和视觉上微妙的语义区分上存在困难,表明在利用视觉场景解决结构歧义方面仍存在局限性。

英文摘要

Structural ambiguity arises when a single sentence admits multiple valid interpretations due to its syntactic structure, posing a fundamental challenge for language understanding. Visual scenes serve as useful cues for resolving such ambiguity, and Vision and Language Models (VLMs) need to be capable of deriving possible semantic interpretations from visual scenes. We introduce Language and Vision Structural Ambiguity (LaViSA), a benchmark designed to evaluate the ability of VLMs to resolve structural ambiguity leveraging visual scenes. LaViSA consists of ambiguous sentences, their disambiguated sentences, and corresponding images of these disambiguated sentences across seven ambiguity categories. Using LaViSA, we conduct a comprehensive evaluation of diverse VLMs, including both proprietary and open-source models with varying parameter scales and reasoning capabilities. Experimental results show that although recent VLMs can leverage visual scenes to resolve structural ambiguity to a some extent, they still struggle with certain ambiguity types and visually subtle semantic distinctions, indicating remaining limitations in resolving structural ambiguity using visual scenes.

2606.05833 2026-06-19 cs.CV cs.AI 版本更新 90%

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

专题命中 视觉推理 :从视频学习几何表示提升MLLM空间智能。

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.20527 2026-06-19 cs.CL cs.CV 新提交 85%

StylisticBias: A Few Human Visual Cues Drive Most Social Biases in MLLMs

StylisticBias: 少数人类视觉线索驱动多模态大语言模型中的大部分社会偏见

Shaghayegh Kolli, Timo Cavelius, Nafiseh Nikeghbal, Samantha Dalal, Jana Diesner

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Princeton Center for Information and Technology Policy(普林斯顿信息与技术政策中心)

专题命中 视觉推理 :评估MLLM中视觉线索导致的社会偏见

AI总结 提出StylisticBias基准,通过控制单一视觉属性变化,发现年龄和体型主导身份层面偏见,而时尚风格等约15个属性解释近80%的偏见变化,偏见集中于少数视觉线索。

Comments Accepted to the non-archival workshops AI4Good and Culture x AI at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地部署在个人和社会影响重大的场景中,但影响这些模型判断人物的视觉线索仍知之甚少。先前的工作通常比较不同的(群体)个体,难以将外貌效应与身份差异分离。我们引入StylisticBias,一个用于评估MLLMs中属性级社会偏见的受控基准。我们生成500张逼真的基础人脸,每张脸创建约50个单一属性变体,产生约25K张图像。这种设计保持身份不变,每次改变一个视觉属性,使我们能够测量特定线索如何改变模型判断。我们在25个二元社会判断场景中评估了六个MLLMs。我们发现年龄和体型主导身份层面的效应,而时尚风格和其他视觉线索驱动最大的属性级变化。我们进一步发现,约15个属性解释了近80%的总变异,表明偏见集中在少数视觉线索上。在与外貌语义对齐的判断中,尤其是社会经济和风格相关判断,敏感性最强。我们发布StylisticBias作为多模态模型细粒度偏见评估的基准。代码和数据集:此https URL和此https URL。

英文摘要

Multimodal large language models (MLLMs) are increasingly deployed in personally and societally consequential settings, yet the visual cues that shape how these models judge people remain poorly understood. Prior work often compares different (groups of) individuals, making it difficult to separate appearance effects from identity differences. We introduce StylisticBias, a controlled benchmark for evaluating attribute-level social bias in MLLMs. We generate 500 photorealistic base faces and create about 50 single-attribute variations per face, producing about 25K images. This design keeps identity fixed and changes one visual attribute at a time. It lets us measure how specific cues shift model judgments. We evaluate six MLLMs across 25 binary social judgment scenarios. We find that age and body type dominate identity-level effects, while fashion style and other visual cues drive the largest attribute-level shifts. We further find that about 15 attributes account for nearly 80\% of the total variation, showing that bias is concentrated in a small set of visual cues. Sensitivity is strongest in judgments that are semantically aligned with appearance, especially socioeconomic and style-related judgments. We release StylisticBias as a benchmark for fine-grained bias evaluation in multimodal models. Code and dataset: https://github.com/timo-cavelius/StylisticBias and https://hf.co/datasets/shaghayegh/stylistic-bias-dataset.

2606.20419 2026-06-19 cs.CV 新提交 85%

Spectral Query-Key Product Weight Steering for Training-Free VLM Hallucination Mitigation

谱查询-键乘积权重引导用于免训练VLM幻觉缓解

Karn Tiwari, Varnith Chordia, Prathosh A P

发表机构 * Indian Institute of Science, Bengaluru(印度科学理工学院,班加罗尔) Snap Research(Snap 研究院)

专题命中 视觉推理 :免训练方法减少VLM对象幻觉,提升视觉推理

AI总结 提出QK乘积引导,一种无数据、免训练、零推理成本的权重编辑方法,通过抑制中间层主导奇异模式减少对象幻觉,在三个GQA基VLM上平均降低CHAIR$_s$ 4.0%。

Comments Under Review

详情
AI中文摘要

视觉语言模型(VLM)通常生成流畅但视觉上无依据的描述,尤其是提及图像中不存在的对象。我们提出QK乘积引导,一种无数据、免训练、零推理成本的权重编辑方法,用于减少对象幻觉。该方法通过抑制选定中间层中少量主导奇异模式,直接编辑每头的查询-键乘积(即产生softmax前注意力logits的算子)。然后,通过封闭形式的仅查询更新将编辑后的乘积映射回查询权重,同时保持共享的键权重固定,使编辑兼容分组查询注意力。我们进一步将QK乘积分解为对称和反对称分量,以区分相互内容相似性模式与方向性注意力模式。在三个基于GQA的VLM上,QK乘积引导实现了平均相对CHAIR$_s$降低4.0%,而匹配的随机模式控制显示可忽略的变化。可解释性消融表明,幻觉信号特定于主导QK模式,并主要定位于对称相互注意力通道。总体而言,QK乘积引导提供了一种解码时缓解的简单替代方案,无需额外数据、微调或推理时开销,同时基本保持多模态能力。

英文摘要

Vision-language models (VLMs) often generate fluent but visually unsupported descriptions, especially by mentioning objects absent from the image. We propose QK Product Steering, a data-free, training-free, and zero-inference-cost weight edit for reducing object hallucination. The method directly edits the per-head query-key product, the operator that produces pre-softmax attention logits, by suppressing a small number of dominant singular modes in selected middle layers. The edited product is then mapped back to the query weights through a closed-form query-only update while keeping shared key weights fixed, making the edit compatible with grouped-query attention. We further decompose the QK product into symmetric and antisymmetric components to distinguish mutual content-similarity patterns from directional attention patterns. Across three GQA-based VLMs, QK Product Steering achieves an average relative CHAIR$_s$ reduction of $4.0\%$, while matched random-mode controls show negligible change. Interpretability ablations show that the hallucination signal is specific to dominant QK modes and is primarily localized to the symmetric mutual-attention channel. Overall, QK Product Steering offers a simple alternative to decoding-time mitigation, requiring no additional data, fine-tuning, or inference-time overhead while largely preserving general multimodal capability.

2606.20364 2026-06-19 cs.LG 新提交 85%

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

评判以改进:一种去偏的 VLM-as-3D-Judge 协议用于单图像 3D 生成

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

专题命中 视觉推理 :VLM作为评判者优化3D生成

AI总结 本文提出一种去偏的跨模型 VLM-as-3D-Judge 协议,将评判者从排序扩展到优化,通过训练与评估评判者分离、位置偏差校正及修复三种失效模式,实现轻量级适应下与强基线的匹配。

详情
AI中文摘要

一项伴随研究建立了一个去偏的、跨模型的 VLM-as-3D-Judge,能够可靠地对单图像到 3D 网格质量进行排序,而廉价的几何和 CLIP 代理在此方面表现不足。本文提出:该评判者的偏好能否专门化一个强大的开放生成器 TRELLIS,针对单一资产类别(家具),且无需人工标注?将评判者从排序扩展到优化是本文的工作所在。将 VLM 评判者推入训练和评估循环会暴露排序从未触发的失效模式,因此我们的贡献是对评判者进行优化级别的强化:一个训练评判者(Qwen2.5-VL-7B)与一个评估评判者(InternVL3-8B)保持分离以打破循环性;位置偏差校正;以及针对三种失效模式(图像过载、隐藏几何的溅射渲染、以及奖励干净但错误输出的无参考评判)的修复,并附有校准证据(清晰差距胜率 0.83-1.0;基线间约 0.5)。使用此协议作为独立评估者,仅从公开模型和数据出发,采用轻量级参数高效适应,我们发现我们的方法匹配了强基线而非超越它。独立基线样本几乎不携带可学习的偏好(0.94 顺序翻转率),因此信号必须通过质量对比构造来设计。在六种适应方法、两种输入模式和严重程度扫描中,最具针对性的方法——严重退化下的条件器修复——达到了与基线持平(0.50),而没有方法达到 >=65% 的胜率目标。结果是机制性的:干净输入使评判者饱和,流式 DIT 微调通过采样器被冲刷,而条件器修复是改变几何的位点。胜率在 n=8 个对象时具有方向性。匹配一个强大的公开数据基线本身具有信息量:超越它需要比公开数据上的轻量级 PEFT 更多,而评判者协议是可复用的。

英文摘要

A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.

2606.20244 2026-06-19 cs.CV cs.AI 新提交 85%

SPOT-E: Test-Time Entropy Shaping with Visual Spotlights for Frozen VLMs

SPOT-E:基于视觉聚光灯的冻结VLM测试时熵整形

Bo Yin, Xiaobin Hu, Chengming Xu, Ruolin Shen, Mo Yang, Jiangning Zhang, Peng-Tao Jiang, Cheng Tan, Shuicheng YAN

发表机构 * National University of Singapore(新加坡国立大学) Fudan University(复旦大学) Technical University of Munich(慕尼黑工业大学) Sagenic Tech Zhejiang University(浙江大学) vivo Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

专题命中 视觉推理 :测试时熵整形提升VLM证据定位

AI总结 提出SPOT-E方法,通过测试时熵整形和视觉聚光灯,解决VLM在证据密集型任务中因忽视局部关键证据而表现不佳的问题,无需重新训练即可提升定位与鲁棒性。

详情
AI中文摘要

视觉语言模型(VLM)在证据密集型任务中通常表现不佳,因为决定性视觉证据往往微小、局部且容易被忽略,导致即使高层推理完好,证据读取也会失败。先前的推理时视觉干预可以在不重新训练的情况下改善定位,但大多是开环的,缺乏验证高亮证据是否实际使用的机制。我们研究答案跨度预测熵作为模型内部反馈信号,并表明朴素熵最小化具有歧义性,因为低熵可能源于证据支持的置信度或捷径坍塌。为解决这一歧义,我们引入低熵锚点和熵整形目标,在减少答案不确定性的同时保留基线高置信度标记。我们将这一原理实例化为SPOT-E,一种即插即用的测试时方法,生成问题条件聚光灯,并通过基于组相对策略优化(GRPO)的轻量级调优对每个实例进行优化。在所有基准测试和不同VLM家族中,SPOT-E在视觉损坏下均取得一致增益和改进的鲁棒性。代码公开于:\url{this https URL}

英文摘要

Vision-language models (VLMs) often underperform on evidence intensive tasks because decisive visual evidence are small, localized, and easy to overlook, leading to failures in evidence readout even when high-level reasoning is intact. Prior inference-time visual interventions can improve grounding without retraining, but they are largely open-loop and lack a mechanism to verify whether highlighted evidence is actually used. We study answer-span prediction entropy as a model-internal feedback signal and show that naive entropy minimization is ambiguous, since low entropy may arise from evidence-grounded confidence or shortcut collapse. To resolve this ambiguity, we introduce low-entropy anchors and an entropy-shaping objective that reduces answer uncertainty while preserving baseline high-confidence tokens. We instantiate this principle in SPOT-E, a plug-and-play test-time method that produces question-conditioned spotlights, optimized per instance via light-weight tuning based on Group Relative Policy Optimization (GRPO). Across all benchmarks and different VLM families, SPOT-E yields consistent gains and improved robustness under visual corruptions. Code is publicly available at: \url{https://github.com/YinBo0927/SPOT-E}

2606.20077 2026-06-19 cs.CV cs.AI 新提交 85%

The Hidden Evolution of Disguised Visual Context inside the VLM

VLM内部伪装视觉上下文的隐藏演化

Wish Suharitdamrong, Tony Alex, Muhammad Awais, Sara Atito

发表机构 * Surrey Institute for People-Centred AI, University of Surrey(萨里大学以人为本人工智能研究所) Centre for Vision, Speech and Signal Processing (CVSSP), University of Surrey(萨里大学视觉、语音与信号处理中心)

专题命中 视觉推理 :VLM中视觉令牌演化分析

AI总结 研究视觉语言模型中视觉令牌如何通过不同集成架构(上下文注入与逐层注入)转化为有意义表示,揭示其内部演化过程及对性能的影响。

详情
AI中文摘要

视觉令牌作为原始的外部信号进入大语言模型(LLM)。它们如何被转化为有意义的表示并与语言空间交互完全取决于集成架构——无论是将视觉令牌视为输入序列中的上下文提示,还是直接注入到LLM的中间层。对于这些架构选择如何影响视觉信息及其内部转换以与LLM集成,目前仍缺乏受控比较和理解。我们通过在相同训练条件下评估上下文注入和逐层注入的VLM集成范式,在单图像、多图像和视频基准上进行公平比较。在此过程中,我们揭示了一个隐藏的演化:视觉令牌作为伪装的视觉上下文(缺乏语言结构的原始表示)进入LLM,但根据集成范式逐渐被重塑,每种范式捕捉视觉信号的不同频率特征。我们表明,LLM内部的这种演化决定了VLM能够有效利用哪些视觉特征、视觉表示如何与语言空间对齐,以及最终每种范式在不同任务上的表现。我们进一步证明,仅关注注意力分配是不够的,性能由每一层视觉表示的质量驱动。

英文摘要

Visual tokens enter Large Language Models (LLMs) as raw, foreign signals. How they are transformed into meaningful representations and interact with the language space depends entirely on the integration architecture. Whether by treating visual tokens as in-context prompts within the input sequence or injecting them directly into the LLM's intermediate layers. A controlled comparison and understanding of how these architectural choices affect visual information and its internal transformation to integrate with the LLM remains underexplored. We provide a fair comparison by evaluating in-context and layer-wise injection VLM integration paradigms under identical training conditions across single image, multi-image, and video benchmarks. In doing so, we uncover a hidden evolution where visual tokens enter the LLM as disguised visual context, raw representations lacking linguistic structure, but are progressively reshaped depending on the integration paradigm, each capturing fundamentally different frequency characteristics of the visual signal. We show that this evolution inside the LLM determines what visual features the VLM can utilize effectively, how visual representations align with the language space, and ultimately how each paradigm performs across different tasks. We further demonstrate that attention allocation alone is insufficient, and that performance is driven by the quality of visual representations at each layer.

2606.20045 2026-06-19 cs.CV cs.AI 新提交 85%

See-and-Reach: Precise Vision-Language Navigation for UAVs within the Field of View

See-and-Reach: 视场内的精确视觉语言导航用于无人机

Fanfu Xue, En Yu, Yantian Shen, Zhikun Hu, Hongjun Wang, Yang Yang, Xindi Wang, Jiande Sun

发表机构 * School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Faculty of Engineering and Information Technology, University of Technology Sydney(悉尼科技大学工程与信息技术学院) School of Computer Science and Technology, Shandong University(山东大学计算机科学与技术学院) School of Artificial Intelligence, Shandong University(山东大学人工智能学院) School of Computer Science and Artificial Intelligence, Shandong Normal University(山东师范大学计算机科学与人工智能学院) Interdisciplinary Research Center of General Artificial Intelligence, Shandong Normal University(山东师范大学通用人工智能跨学科研究中心)

专题命中 视觉推理 :提出视觉语言导航框架用于无人机精确到达。

AI总结 针对无人机视觉语言导航中目标可见后精确到达能力评估不足的问题,提出UAV-VLN-FOV任务和3DG-VLN框架,通过动态3D方向线索增强细粒度视觉定位与空间对齐,在基准和真实实验中显著提升成功率。

Comments 12 pages, 7 figures

详情
AI中文摘要

无人机视觉语言导航(UAV-VLN)通常被形式化为一个整体的搜索与到达问题,其中远程目标发现和最终目标接近被联合优化和评估。这种表述使得评估空中具身代理的关键能力变得困难,即一旦目标进入其视场,无人机能否准确地将可见目标定位并将视觉语言证据转化为精确的3D运动。为了解决这一局限性,我们引入了UAV-VLN-FOV,一个目标可见的导航任务,它隔离了“看到并到达”阶段,并能够对终端到达能力进行更具诊断性的评估。我们进一步提出了3DG-VLN,一种由动态3D方向线索引导的视觉语言航点预测框架,以增强细粒度视觉定位和空间方向对齐,从而实现精确的目标到达。具体来说,3DG-VLN自适应地处理高分辨率的前视和下视观测,以保留用于目标定位的细粒度视觉和几何细节。它还在闭环导航过程中在线更新目标相对方向,使代理能够保持与目标的空间对齐并减少累积的方向漂移。为了支持该任务,我们构建了一个专用的高分辨率基准,包含2,717条轨迹,带有面向目标的高级指令、高分辨率的前视和下视自我中心观测以及连续的3D航点注释。实验表明,3DG-VLN优于具有竞争力的UAV-VLN基线,成功率提高了13.82%。真实世界试验进一步展示了3DG-VLN在实际“看到并到达”导航中的潜力。源代码和基准可在以下网址获取:此 https URL。

英文摘要

UAV Vision-Language Navigation (UAV-VLN) is typically formulated as a holistic search-and-reach problem, where long-range target discovery and final target approach are optimized and evaluated jointly. This formulation makes it difficult to assess a critical capability of aerial embodied agents, namely whether a UAV can accurately ground a visible target and translate vision-language evidence into precise 3D motion once the target enters its field of view. To address this limitation, we introduce UAV-VLN-FOV, a target-visible navigation task that isolates the see-and-reach stage and enables a more diagnostic evaluation of terminal reaching ability. We further propose 3DG-VLN, a vision-language waypoint prediction framework guided by dynamic 3D direction cues to enhance fine-grained visual grounding and spatial direction alignment for precise target reaching. Specifically, 3DG-VLN adaptively processes high-resolution front-view and downward-view observations to preserve fine-grained visual and geometric details for target grounding. It also updates the target-relative direction online during closed-loop navigation, allowing the agent to maintain spatial alignment with the target and reduce accumulated direction drift. To support this task, we construct a dedicated high-resolution benchmark which contains 2,717 trajectories with target-oriented high-level instructions, high-resolution front-view and downward-view egocentric observations, and continuous 3D waypoint annotations. Experiments show that 3DG-VLN outperforms competitive UAV-VLN baselines, achieving a 13.82\% improvement in success rate. Real-world trials further demonstrate the potential of 3DG-VLN for practical see-and-reach navigation. The source code and benchmark are available at https://github.com/xuefanfu/3DG-VLN.

2606.19965 2026-06-19 cs.CV cs.AI 新提交 85%

ROSE: Benchmarking the Perception-to-Action Gap in Multimodal Models

ROSE:多模态模型中感知到行动差距的基准测试

Yihao Wang, Zijian He, Jie Ren, Keze Wang

发表机构 * Sun Yat-sen University(中山大学) Shaanxi Normal University(陕西师范大学)

专题命中 视觉推理 :提出ROSE基准测试MLLM感知到行动差距

AI总结 提出ROSE基准,通过固定视觉场景并变化区域约束与符号输出,测试多模态大模型在不同上下文中将相同视觉证据转化为所需行动的能力,发现模型性能下降高达44.5个百分点,揭示感知到行动的瓶颈。

Comments 29 pages, 11 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越被期望基于视觉信息采取行动,然而同一场景在不同任务上下文中可能需要不同的行动。模型能否可靠地将相同的视觉证据转化为当前上下文所需的行动?为了回答这个问题,我们引入了\textsc{ROSE}(\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution),一个受控基准,它在保持视觉场景固定的同时变化区域约束和所需的符号输出。通过耦合的计数和坐标行动任务,\textsc{ROSE}测试模型是否能够推断出隐含的多数参考,并在变化的上下文中基于由此产生的细粒度视觉证据采取行动。在九个最近的MLLMs中,从计数导向任务到区域条件行动的性能下降高达44.5个百分点,而人类表现达到98.8%。这种差距在成对的场景和区域中持续存在,即使同一模型在这些场景和区域上返回正确的计数,而全局点击和匹配的局部控制表明坐标定位仅解释了部分损失,揭示了在将共享视觉证据转化为上下文特定行动时存在一个独特的、模型相关的瓶颈。

英文摘要

Multimodal large language models (MLLMs) are increasingly expected to act on visual information, yet the same scene may require different actions under different task contexts. How reliably can a model turn the same visual evidence into the action required by the current context? To answer this question, we introduce \textsc{ROSE} (\textbf{R}eference-conditioned \textbf{O}ddity and \textbf{S}ymbolic \textbf{E}xecution), a controlled benchmark that holds the visual scene fixed while varying region constraints and required symbolic outputs. Through coupled counting and coordinate-action tasks, \textsc{ROSE} tests whether models can infer an implicit majority reference and act on the resulting fine-grained visual evidence under changing contexts. Across nine recent MLLMs, performance drops by as much as 44.5 percentage points from counting-oriented tasks to region-conditioned action, despite 98.8\% human performance. The gap persists on paired scenes and regions for which the same model returns the correct count, while global-click and matched local controls show that coordinate grounding explains only part of the loss, revealing a distinct, model-dependent bottleneck in turning shared visual evidence into context-specific actions.

2606.19944 2026-06-19 cs.CV 新提交 85%

Timage: A Generative Text-in-Image Paradigm for Fine-Tuning Vision-Language Models

Timage: 一种用于微调视觉语言模型的文本嵌入图像生成范式

Yifeng Wu, Huimin Huang, Ruiluo Wu, Chunyi Lin, Guanhua Chen, Xian Wu, Wang Song, Ruize Han

发表机构 * Fudan University(复旦大学) Shenzhen University of Advanced Technology(深圳先进技术大学) Tencent Jarvis Lab(腾讯贾维斯实验室) Southern University of Science and Technology(南方科技大学)

专题命中 视觉推理 :提出文本嵌入图像范式提升VLM细粒度空间推理

AI总结 提出Timage范式,通过约束薛定谔桥将查询文本作为排版覆盖层嵌入图像,以显式空间锚点引导模型关注,在不侵蚀骨干能力前提下提升细粒度空间推理性能。

Comments ECCV

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度空间推理中常丢失正确图像区域,因为文本查询很少携带明确的几何锚点进入像素域。现有补救方法要么重新调整模型权重,要么用冗长指令填充提示,但都无法在不侵蚀骨干通用能力的情况下可靠地将语言定位到正确的视觉坐标。我们提出Timage,一种将多模态理解重新定义为输入层面对齐问题的范式:查询被绘制为排版覆盖层直接叠加在图像上。该覆盖层的放置和外观由约束薛定谔桥(cSB)生成,这是一种熵最优传输采样器,将布局合成分解为两个耦合的随机阶段。第一阶段——区域搜索,将噪声向查询对齐的图像区域传输,同时遵守硬遮挡屏障以保护显著前景内容;第二阶段——外观塑造,通过“墨水预算”正则化调整字形大小,使渲染文本保持可读和视觉平衡。生成的覆盖层作为显式注意力信标,引导模型沿空间语义聚焦。在VMCBench基准上,Timage搭配7B骨干模型明显超越更大的专有系统和参数调优基线。该研究将审慎的输入重构定位为一种强大的、架构中立的杠杆,以增强多模态推理。

英文摘要

Multimodal Large Language Models (MLLMs) often lose track of the right image regions during fine-grained spatial reasoning, because a textual query rarely carries any explicit geometric anchor into the pixel domain. Prevailing remedies either rewire the model's weights or pad the prompt with verbose instructions, yet neither reliably pins the language to the correct visual coordinates without eroding the backbone's general competence. We introduce Timage, a paradigm that recasts multimodal understanding as an alignment problem solved at the input: the query is drawn, as a typeset overlay, onto the image itself. The placement and appearance of this overlay are produced by a Constrained Schrödinger Bridge (cSB), an entropic optimal-transport sampler that factorizes layout synthesis into two coupled stochastic stages. The first stage, Region Search, transports noise toward query-aligned image zones while obeying a hard occlusion barrier that protects salient foreground content; the second stage, Appearance Shaping, sizes the glyphs through an ``ink-budget'' regularizer so that the rendered text stays legible and visually balanced. The resulting overlay behaves as an explicit attention beacon that channels the model's focus along spatial semantics. On the VMCBench suite, Timage paired with a modest 7B backbone clearly overtakes far larger proprietary systems as well as parameter-tuned baselines. The study positions deliberate input reconstruction as a powerful, architecture-neutral lever for strengthening multimodal reasoning.

2606.19828 2026-06-19 cs.CV 新提交 85%

3D-PLOT-LLM: Part-Level Object Tokens for 3D Large Language Models

3D-PLOT-LLM: 用于三维大语言模型的部件级对象标记

Jintang Xue, Xinyu Wang, Yixing Wu, Jingwen Chen, C. -C. Jay Kuo

发表机构 * University of Southern California(南加州大学) Ohio State University(俄亥俄州立大学)

专题命中 视觉推理 :3D多模态大模型,支持部件级对象标记和推理。

AI总结 提出3D-PLOT-LLM,通过重组输入标记流使部件可直接通过LLM词汇寻址,无需分割解码器或边界框,在部件级基准上超越现有方法。

详情
AI中文摘要

三维多模态大语言模型(3D MLLMs)将3D对象作为一个整体进行描述,但无法处理、命名或推理其部件。先前的部件感知尝试增加了分割解码器、更重的3D编码器或边界框语法,导致参数成本大幅增加。我们采取了一条根本不同的路径:重新组织输入标记流,使得部件通过LLM自身的词汇变得可直接寻址。我们的模型3D-PLOT-LLM将冻结的点编码器的块分割成K个局部一致的区域,并在每个区域的块标记之前插入一个可学习的每区域标记和一个保留词汇标记<part_k>;然后,一个标记空间精化(MSR)模块根据每个区域的空间统计信息和邻接邻居对该标记进行条件化。因此,模型在其输出中引用部件,并遵循通过标记引用部件的提示,这是先前对象级3D MLLMs所不具备的能力。为了探究这一接口,我们构建了PartVerse-QA,一个基于PartVerse网格注释改编的词汇级部件问答基准(77K训练对和588个保留查询,基于不相交的对象划分),在该基准上,3D-PLOT-LLM达到了描述到槽的Jaccard指数0.459和精确匹配率13.78%,槽到描述的GPT-4o评判得分为44.68。在3DCoMPaT-GrIn部件感知接地描述基准上,3D-PLOT-LLM在所有文本输出指标上优于PointLLM、Kestrel、PARIS3D和SegPoint,并在4项指标中的3项上优于ShapeLLM,相比PointLLM的GPT-4o评判得分最高提升+3.03。在Objaverse整体对象描述中,在第二阶段添加PartVerse-QA使得相比PointLLM的SBERT得分提升+0.65,GPT-4o得分提升+1.85,并且在5项传统指标中的4项(SBERT、SimCSE、BLEU-1、METEOR)上超过PointLLM-PiSA,尽管其目标是不同的(部件接地)目标。所有这些仅需在冻结的点编码器上增加不到100万个可训练参数,比先前的部件感知3D MLLMs低一个数量级,且无需分割解码器或边界框头。

英文摘要

3D multimodal large language models (3D MLLMs) describe a 3D object as a whole but cannot address, name, or reason about its parts. Prior part-aware attempts add segmentation decoders, heavier 3D encoders, or bounding-box grammars at substantial parameter cost. We take a fundamentally different path: we reorganize the input token stream so that parts become directly addressable through the LLM's own vocabulary. Our model, 3D-PLOT-LLM, partitions the frozen point encoder's patches into K locally coherent regions and inserts, before each region's patch tokens, a learnable per-region marker and a reserved vocabulary token <part_k>; a Marker-Space Refinement (MSR) module then conditions each marker on its region's spatial statistics and adjacency neighbors. The model thus cites parts in its output and follows prompts that refer to parts by token, a capability absent from prior object-level 3D MLLMs. To probe this interface, we construct PartVerse-QA, a vocabulary-level part-QA benchmark adapted from PartVerse mesh annotations (77K training pairs and 588 held-out queries on disjoint object splits), on which 3D-PLOT-LLM reaches caption-to-slots Jaccard 0.459 and Exact-match 13.78%, with a slot-to-caption GPT-4o judge of 44.68. On the 3DCoMPaT-GrIn part-aware grounded description benchmark, 3D-PLOT-LLM outperforms PointLLM, Kestrel, PARIS3D, and SegPoint on every text-output metric, and ShapeLLM on 3 of 4, with up to +3.03 GPT-4o judge over PointLLM. On Objaverse whole-object captioning, adding PartVerse-QA at Stage 2 yields +0.65 SBERT and +1.85 GPT-4o over PointLLM, and tops PointLLM-PiSA on 4 of 5 traditional metrics (SBERT, SimCSE, BLEU-1, METEOR) despite targeting a different (part-grounded) objective. All with under 1M new trainable parameters on a frozen point encoder, an order of magnitude below prior part-aware 3D MLLMs, and no segmentation decoder or bounding-box head.

2606.19584 2026-06-19 cs.CV 新提交 85%

Language-Instructed Vision Embeddings for Controllable and Generalizable Perception

语言引导的视觉嵌入用于可控且可泛化的感知

Chengzhi Mao, Xudong Lin, Wen-Sheng Chu

发表机构 * Google(谷歌)

专题命中 视觉推理 :语言引导视觉嵌入方法,提升视觉推理和泛化能力

AI总结 提出语言引导视觉嵌入(LIVE)方法,利用语言动态引导视觉编码器生成任务中心嵌入,无需任务特定重训练,减少视觉幻觉并提升泛化能力。

Journal ref Published as a conference paper at ICLR 2026

详情
AI中文摘要

视觉基础模型通常被训练为静态特征提取器,将任务适应的负担转移到大型下游模型上。我们提出另一种范式:不是仅将视觉特征输入语言模型,而是使用语言本身动态引导视觉编码器。我们的方法,语言引导视觉嵌入(LIVE),利用语言作为高层指导在推理时生成以任务为中心的嵌入,消除了任务特定重训练的需要。这使得编码器能够关注输入中上下文相关的方面,产生更可控和可泛化的表示。实验上,LIVE减少了视觉幻觉(在MMVP上提升34分),在视觉问答上超越了参数数量大几个数量级的视觉语言模型,并泛化到未见过的指令和任务——为自适应的、指令驱动的视觉智能提供了直接路径。

英文摘要

Vision foundation models are typically trained as static feature extractors, placing the burden of task adaptation onto large downstream models. We propose an alternative paradigm: instead of solely feeding visual features into language models, we use language itself to dynamically guide the vision encoder. Our method, Language-Instructed Vision Embeddings (LIVE), leverages language as high-level guidance to produce task-centric embeddings at inference time, removing the need for task-specific retraining. This enables the encoder to focus on contextually relevant aspects of the input, yielding more controllable and generalizable representations. Empirically, LIVE reduces visual hallucinations (+34 points on MMVP), surpasses vision-language models with orders of magnitude more parameters on visual question answering, and generalizes to unseen instructions and tasks -- offering a direct path toward adaptive, instruction-driven visual intelligence.

2606.18950 2026-06-19 cs.AI 新提交 85%

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

专题命中 视觉推理 :评估VLM在RTS游戏中的战略推理

AI总结 提出RTSGameBench,基于Beyond All Reason游戏,通过多样化对战、迷你游戏诊断和自进化生成框架,评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情
AI中文摘要

现代视觉语言模型(VLM)在竞争和合作环境中的不确定性下,往往难以进行战略推理,即预测和影响其他智能体的行为。实时策略(RTS)游戏可以作为诊断这一局限性的自然测试平台,因为它们要求与盟友协调、适应对手策略,并在部分可观测性下进行长期规划。然而,现有的RTS基准评估范围有限,缺乏系统的能力诊断,并且局限于预设计的场景覆盖。为了解决这些限制,我们提出了RTSGameBench,它建立在Beyond All Reason之上,这是一款大规模RTS游戏,其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估,通过迷你游戏进行诊断性评估,每个迷你游戏针对单个战略能力,并通过自进化生成框架实现可扩展的覆盖,该框架将自由形式的查询转化为新的迷你游戏,并在连续循环中改进。此外,为了让VLM在大规模RTS游戏中运行,我们提供了RTSGameAgent,它通过具有智能体记忆的有限状态机(FSM)管理单位。我们通过实验验证,多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

2605.20448 2026-06-19 cs.CV cs.LG 版本更新 85%

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体?

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI(德克南人工智能)

专题命中 视觉推理 :VLM 3D场景理解能力评估

AI总结 本文通过一个包含3034个样本的人工整理基准,探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力,发现模型在重新安排可见布局时表现优异,但在遮挡和反射推断上表现较差。

详情
AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体,但它们是否代表这些物体所处的3D布局?我们引入了一个包含3034个样本的人工整理基准,针对空间理解的三个组成部分:深度有序遮挡(通过三种独立的反事实操作化进行探测)、可见反射的光学几何推断,以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分,没有使用LLM作为判断标准,揭示了明显的分离:在53-97%的准确率下,能够对可见布局进行重新安排的模型,在遮挡任务中表现不佳,仅在6-45%之间,而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示,失败归因于视觉标记合并:在视觉编码器中可恢复的空间信息在标记压缩后变得不可用,只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

2603.12252 2026-06-19 cs.CV cs.CL 版本更新 85%

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

EndoCoT:扩散模型中的内生思维链推理扩展

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Shanghai Jiaotong University(上海交通大学) Fudan University(复旦大学) The Chinese University of Hong Kong(香港中文大学)

专题命中 视觉推理 :扩散模型中内生思维链,提升视觉推理

AI总结 提出EndoCoT框架,通过迭代思维引导模块激活MLLM的推理潜力,并利用终端思维接地模块确保推理轨迹与文本监督对齐,使DiT逐步执行复杂任务,在多个基准上平均准确率达92.1%。

Comments 23 pages, 18 figures, The code and dataset are publicly available at https://internlm.github.io/EndoCoT/

详情
AI中文摘要

最近,多模态大语言模型(MLLMs)被广泛集成到扩散框架中,主要作为文本编码器来处理空间推理等复杂任务。然而,这种范式存在两个关键限制:(i)MLLM文本编码器表现出不足的推理深度。单步编码无法激活思维链过程,而这对MLLM为复杂任务提供准确指导至关重要。(ii)在解码过程中,指导保持不变。即使有正确的MLLM编码,解码过程中的不变指导也阻止了DiT逐步将复杂指令分解为可执行的去噪步骤。为此,我们提出了内生思维链(EndoCoT),一种新颖的框架,首先通过迭代思维引导模块迭代细化潜在思维状态来激活MLLM的推理潜力,然后将这些状态桥接到DiT的去噪过程。其次,应用终端思维接地模块,通过将最终状态与真实答案对齐,确保推理轨迹保持与文本监督的接地。通过这两个组件,MLLM文本编码器提供精心推理的指导,使DiT能够逐步执行并最终以逐步方式解决复杂任务。在多个基准(如Maze、TSP、VSP和Sudoku)上的广泛评估实现了平均准确率92.1%,比最强基线高出8.3个百分点。代码和数据集在此https URL公开。

英文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

2606.20177 2026-06-19 cs.CV cs.AI 新提交 80%

Evaluating and Enhancing Negation Comprehension in Remote Sensing MLLMs

评估与增强遥感多模态大语言模型的否定理解能力

Haochen Han, Jue Wang, Alex Jinpeng Wang, Fangming Liu

发表机构 * Peng Cheng Laboratory(鹏城实验室) Tsinghua University(清华大学) Central South University(中南大学)

专题命中 视觉推理 :评估遥感MLLMs否定理解,属于视觉语言推理。

AI总结 提出RS-Neg基准评估遥感MLLMs的否定理解,并设计NeFo方法通过测试时学习利用约5%未标注样本显著提升模型性能。

Comments ECCV 2026 Accepted

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种遥感(RS)任务中取得了显著成功。然而,它们理解否定的能力仍未得到充分探索,限制了在现实应用中的部署,其中模型必须明确识别什么是错误的或不存在的,例如,应急响应人员需要定位非洪水路线进行疏散。为了全面研究这一局限性,我们引入了RS-Neg,这是第一个从区域级到场景级任务评估否定理解的基准。具体来说,我们为遥感图像设计了一个自动数据生成流程,使用LLMs合成多样化的否定查询,并引入了一个动态视觉焦点模块进行验证。我们的评估表明,先进的遥感MLLMs在否定理解上存在困难,表现出幻觉和显著的性能下降。为了弥补这一差距,我们提出了NeFo,一种新颖的测试时学习方法,将否定的逻辑角色明确纳入模型优化。值得注意的是,使用约5%的未标注测试样本,NeFo显著提升了模型的否定理解能力,并展现出对未见任务的强泛化能力。代码和数据将在接收后发布。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in various Remote Sensing (RS) tasks. However, their ability to comprehend negation remains underexplored, limiting deployment in real-world applications where models must explicitly identify what is false or absent, e.g., emergency responders need to locate non-flooded routes for evacuation. To comprehensively study this limitation, we introduce RS-Neg, the first benchmark to evaluate negation understanding across region-level to scene-level tasks. Specifically, we design an automated data generation pipeline for RS imagery, using LLMs to synthesize diverse negation queries, and introduce a dynamic visual focus module for verification. Our evaluation reveals that advanced RS MLLMs struggle with negation, exhibiting hallucinations and substantial performance degradation. To close this gap, we propose NeFo, a novel test-time learning method that explicitly incorporates the logical role of negation into the model optimization. Remarkably, using about 5\% unlabeled test samples, NeFo significantly improves the negation understanding of models and shows strong generalization to unseen tasks. Code and data will be released upon acceptance.

2606.19927 2026-06-19 cs.CV 新提交 80%

CARE: Competence-Aware Reward Shaping for Adaptive Reasoning Length in Video-MLLMs

CARE: 面向视频多模态大语言模型的自适应推理长度的能力感知奖励塑形

Chengwen Liu, Hao Peng, Jisheng Dang, Hong Peng, Bin Hu, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) School of Medical Technology, Beijing Institute of Technology(北京理工大学医学技术学院) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 视觉推理 :提出自适应推理长度优化框架用于视频MLLM

AI总结 提出CARE框架,通过能力感知奖励塑形自适应优化推理长度,利用指数移动平均估计能力并分阶段调整奖励偏好,结合批次归一化和后验放大器提升效率与准确性。

详情
AI中文摘要

在多模态视频推理中,基于强化学习的方法通常依赖简单且不灵活的推理长度控制策略,无法适应模型不断变化的能力。这种不匹配可能在早期阶段抑制必要的探索,而在模型变得更有能力后鼓励冗余推理和低效解码。本文提出CARE,一种用于多模态推理中自适应推理长度优化的能力感知奖励塑形框架。具体来说,CARE通过通过率的指数移动平均维护平滑的能力估计,并利用它将训练路由到渐进阶段,将奖励偏好从探索导向的长形式推理转向效率导向的简洁推理。为避免将冗长与内在任务复杂性混淆,CARE进一步使用批次级统计归一化推理努力,并引入后验放大器以增强对历史上困难样本上意外强性能的奖励信号。所提出的机制无缝集成到GRPO训练流程中,且不增加额外推理开销。在多个视频推理和通用视频理解基准上的大量实验表明,CARE持续提高推理准确性,稳定强化学习,并显著提升令牌效率。此外,CARE在训练过程中展现出推理长度的特征性倒U型轨迹,并在收敛时产生更短但信息更丰富的推理轨迹,表明推理预算的有效自适应分配。我们在以下网址提供CARE框架和实验的源代码:此https URL。

英文摘要

In multimodal video reasoning, reinforcement learning-based methods typically rely on simplistic and inflexible reasoning-length control strategies that fail to adapt to the model's evolving competence. This mismatch may suppress necessary exploration at early stages, while encouraging redundant reasoning and inefficient decoding once the model becomes more competent. In this paper, we propose CARE, a competence-aware reward shaping framework for adaptive reasoning length optimization in multimodal reasoning. Specifically, CARE maintains a smoothed competence estimate via an exponential moving average of pass rates, and uses it to route training into progressive stages that shift the reward preference from exploration-oriented long-form reasoning to efficiency-oriented concise reasoning. To avoid conflating verbosity with intrinsic task complexity, CARE further normalizes reasoning effort with batch-level statistics, and introduces a posterior amplifier to strengthen reward signals for unexpectedly strong performance on historically difficult samples. The proposed mechanism is seamlessly integrated into the GRPO training pipeline and incurs no additional inference-time overhead. Extensive experiments on multiple video reasoning and general video understanding benchmarks demonstrate that CARE consistently improves reasoning accuracy, stabilizes reinforcement learning, and significantly enhances token efficiency. Moreover, CARE exhibits a characteristic inverted-U trajectory of reasoning length during training, and yields shorter yet more informative reasoning traces at convergence, indicating effective adaptive allocation of reasoning budget. We provide the source code for our proposed CARE framework and experiments at https://github.com/1Pansy/Video-CARE.

2606.19882 2026-06-19 cs.CV cs.LG 新提交 80%

Multimodal Concept Bottleneck Models

多模态概念瓶颈模型

Tongqing Shi, Ge Yan, Tuomas Oikarinen, Tsui-Wei Weng

发表机构 * UC San Diego(加州大学圣地亚哥分校)

专题命中 视觉推理 :多模态概念瓶颈模型,可解释零样本分类。

AI总结 提出多模态概念瓶颈模型(MM-CBM),利用双概念瓶颈层对齐图像和文本嵌入,实现可解释的零样本分类和图像检索,在四个基准上平均准确率提升高达51.26%。

Comments Present at NeurIPS 2025 Mechanistic Interpretability Workshop

详情
AI中文摘要

概念瓶颈模型(CBM)通过将图像提取的特征与自然概念对齐,增强了深度学习网络的可解释性。然而,现有的CBM在泛化到固定预定义类别集之外的能力以及非概念信息泄露的风险方面受到限制,其中预期概念之外的预测信号被无意中利用。在本文中,我们提出了多模态概念瓶颈模型(MM-CBM)来解决这些问题,并将CBM扩展到CLIP。MM-CBM利用双概念瓶颈层(CBL)将图像和文本嵌入对齐为可解释的特征。这使我们能够以可解释的方式执行新的视觉任务,如零样本分类或图像检索。与现有方法相比,MM-CBM在四个标准基准上平均准确率提升高达51.26%。我们的方法保持高准确率,在黑盒性能的约5%以内,同时提供更高的可解释性。

英文摘要

Concept Bottleneck Models (CBMs) enhance the interpretability of deep learning networks by aligning the features extracted from images with natural concepts. However, existing CBMs are constrained in their ability to generalize beyond a fixed set of predefined classes and the risk of non-concept information leakage, where predictive signals outside the intended concepts are inadvertently exploited. In this paper, we propose Multimodal Concept Bottleneck Model (MM-CBM) to address these issues and extend CBMs into CLIP. MM-CBM utilizes dual Concept Bottleneck Layers (CBLs) to align both the image and text embeddings into interpretable features. This allows us to perform new vision tasks like zero-shot classification or image retrieval in an interpretable way. Compared to existing methods, MM-CBM achieves up to 51.26% accuracy improvement on average across four standard benchmarks. Our method maintains high accuracy, staying within ~5% of black-box performance while offering greater interpretability.

2605.10873 2026-06-19 cs.CV cs.AI 版本更新 80%

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench:一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

专题命中 视觉推理 :评估视觉语言模型在CAD程序生成中的表现

AI总结 本文提出CADBench,一个统一的多模态CAD程序生成基准,包含18000个样本和六类基准,评估11种视觉语言模型,揭示了CAD程序生成中的三种常见失败模式。

详情
AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心,但进展难以衡量,因为现有评估分散在数据集、模态和指标上。我们引入CADBench,一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本,涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族,五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染,以及六个指标,涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层,所有家族均进行多样性采样,以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统,生成超过140万个CAD程序。在理想输入下,专用的网格到CAD模型显著优于代码生成VLMs,后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式:几何复杂性增加时重建质量下降,CAD专用模型在模态转移下可能变得脆弱,且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

2508.04424 2026-06-19 cs.CV 版本更新 80%

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索:通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China(新一代人工智能技术及跨学科应用国家重点实验室,东南大学,教育部,江苏,中国) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(穆罕默德·本·扎耶德人工智能大学(MBZUAI),阿布扎赫德,阿联酋)

专题命中 视觉推理 :提出组合对象检索任务,需视觉-语言推理。

AI总结 提出组合对象检索(COR)任务,通过组合参考对象、掩码和检索文本进行对象级检索,并构建COR125K基准和CORE模型,显著优于现有方法。

详情
AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索(CIR)方法结合了参考图像和检索文本,但它们局限于图像级匹配,无法定位特定对象。为此,我们提出了组合对象检索(COR),一种新的对象级检索任务,从目标图像中的候选对象中检索目标对象,并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本,COR要求模型执行组合视觉-文本推理,而不是依赖显式的类别名称。这一设置带来了若干挑战,包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K,第一个大规模COR基准,包含408个类别的125,541个检索三元组,并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE,一个统一的端到端模型,集成了参考区域编码、自适应视觉-文本交互和区域级对比学习,以将组合表示与目标对象对齐,同时抑制背景和干扰物。大量实验表明,CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线,为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

2606.20458 2026-06-19 cs.RO 新提交 75%

Slow Brain, Fast Planner: Latency-Resilient VLM-Augmented Urban Navigation

慢速大脑,快速规划器:延迟鲁棒的VLM增强城市导航

Zhenghao "Mark'' Peng, Honglin He, Quanyi Li, Yukai Ma, Bolei Zhou

发表机构 * Amazon FAR(亚马逊 FAR) UCLA(加州大学洛杉矶分校) Independent(独立) Zhejiang University(浙江大学)

专题命中 视觉推理 :使用VLM增强城市导航中的轨迹选择。

AI总结 针对移动机器人在人行道导航中轨迹评分差距问题,提出一种无需训练的延迟鲁棒轨迹级融合层,利用VLM选择候选轨迹并与规划器输出融合,在挑战场景下降低ADE 30%。

详情
AI中文摘要

基于学习的 sidewalk 导航规划器可以实时生成多样化的候选轨迹,但其评分函数在挑战性场景中往往无法选择最佳轨迹,即使同一集合中存在更好的候选,也会输出使移动机器人驶入草地、朝向行人或错误方向的轨迹。我们称之为轨迹评分差距:在真实世界的人行道导航中,基于锚点的规划器的最佳选择与最佳候选之间的差距很大,这可能是由于规划器的高层场景理解能力有限。我们不是用端到端的视觉-语言-动作模型替换规划器,而是提出一种VLM-规划器接口,使用VLM从规划器的候选集合中选择一个候选索引,然后将其与规划器的初始输出融合。然而,VLM每次查询需要1-3秒,因此无法直接驱动5-20Hz的控制循环。我们贡献了一种无需训练、延迟鲁棒的轨迹级融合层,通过指数衰减的几何相似性将过时的VLM选择转化为实时规划器评分。在约2000个具有挑战性的真实世界场景(例如交叉口、行人相遇)中,VLM选择相比规划器的最佳选择实现了30%的ADE降低,而规划器在常规场景中仍保持竞争力。在仿真中,Score Fusion在高达5秒的延迟下仍保持>80%的成功率。我们在移动机器人上展示了完整系统,在具有不同网络延迟的具有挑战性的校园人行道上进行导航。

英文摘要

Learning-based planners for sidewalk navigation can generate diverse candidate trajectories in real time, yet their scoring functions often fail to select the best trajectory in challenging situations, outputting trajectories that make the mobile robot drive onto grass, toward pedestrians, or in the wrong direction, even when better candidates exist in the same set. We call this the trajectory scoring gap: in real-world sidewalk navigation, the gap between an anchor-based planner's top choice and the best possible candidate is substantial, likely due to limited high-level scene understanding capability of the planner. Rather than replacing the planner with an end-to-end Vision-Language-Action model, we propose a VLM-Planner interface that uses a VLM to select a candidate index from the planner's proposal set and then fuse it with the planner's initial output. However, VLMs take 1--3s per query and so cannot directly drive a 5--20Hz control loop. We contribute a training-free, latency-resilient trajectory-level fusion layer that turns a stale VLM selection into real-time planner scoring via geometric similarity with exponential decay. On $\sim$2,000 challenging real-world scenarios (e.g., junctions, pedestrian encounters), VLM selection achieves 30% ADE reduction versus the planner's best selection, while the planner remains competitive in routine situations. In simulation, Score Fusion maintains >80% success rate with delays up to 5s. We demonstrate the full system on a mobile robot navigating challenging campus sidewalks with varied network latency.

2606.19489 2026-06-19 cs.LG cs.AI 新提交 75%

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

概念流模型:通过层次瓶颈锚定基于概念的推理

Ya Wang, Adrian Paschke

发表机构 * Fraunhofer Institute for Open Communication Systems(弗劳恩霍夫开放通信系统研究所) Freie Universität Berlin(柏林自由大学)

专题命中 视觉推理 :利用视觉语言模型生成概念嵌入,提升可解释性。

AI总结 提出概念流模型(CFM),用层次化概念决策树替代扁平瓶颈,通过逐步缩小预测范围减少信息泄露,在保持预测性能的同时提升可解释性。

Journal ref Transaction on Machine Learning Research, 2/2026

详情
AI中文摘要

概念瓶颈模型(CBM)通过将学习到的特征投影到人类可理解的概念空间来增强可解释性。最近的方法利用视觉-语言模型生成概念嵌入,减少了对人工概念标注的需求。然而,这些模型存在一个关键限制:当概念数量接近嵌入维度时,信息泄露增加,使得模型能够利用虚假或语义上不相关的相关性,从而削弱可解释性。在这项工作中,我们提出了概念流模型(CFM),它将扁平瓶颈替换为层次化的、概念驱动的决策树。层次结构中的每个内部节点专注于局部判别性概念子集,逐步缩小预测范围。我们的框架从视觉嵌入构建决策层次,在每个层次级别分布语义概念,并通过概率树遍历训练可微的概念权重。在多个基准上的大量实验表明,CFM在预测性能上与扁平CBM相当,同时通过减少有效概念使用显著缓解了信息泄露。此外,CFM产生逐步决策流,使得具有层次类结构的透明且可审计的模型推理成为可能。

英文摘要

Concept Bottleneck Models (CBMs) enhance interpretability by projecting learned features into a human-understandable concept space. Recent approaches leverage vision-language models to generate concept embeddings, reducing the need for manual concept annotations. However, these models suffer from a critical limitation: as the number of concepts approaches the embedding dimension, information leakage increases, enabling the model to exploit spurious or semantically irrelevant correlations and undermining interpretability. In this work, we propose Concept Flow Models (CFMs), which replace the flat bottleneck with a hierarchical, concept-driven decision tree. Each internal node in the hierarchy focuses on a localized subset of discriminative concepts, progressively narrowing the prediction scope. Our framework constructs decision hierarchies from visual embeddings, distributes semantic concepts at each hierarchy level, and trains differentiable concept weights through probabilistic tree traversal. Extensive experiments on diverse benchmarks demonstrate that CFMs match the predictive performance of flat CBMs, while substantially mitigating information leakage by reducing effective concept usage. Furthermore, CFMs yield stepwise decision flows that enable transparent and auditable model reasoning with hierarchical class structures.

2. 视觉定位 1 篇

2606.20161 2026-06-19 cs.CV 新提交 85%

ARTEMIS: Agent-guided Reliability-aware Temporal Mask Evolution for Imperfectly Supervised Video Polyp Segmentation

ARTEMIS: 基于智能体引导的可靠性感知时间掩码演化用于不完美监督的视频息肉分割

Tong Wang, Siwen Wang, Yaolei Qi, Jinxing Zhou, Yuting He, Guanyu Yang, Yutong Xie

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education(东南大学教育部新一代人工智能技术及其跨学科应用重点实验室) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) School of Medicine, Case Western Reserve University(凯斯西储大学医学院)

专题命中 视觉定位 :利用视觉语言智能体选择可靠时间锚点,结合SAM2进行视频息肉分割。

AI总结 提出ARTEMIS框架,利用视觉语言智能体选择可靠时间锚点,结合SAM2传播和可靠性感知鲁棒学习,从不完美监督(点、涂鸦、少量密集标签)中学习高质量视频息肉分割掩码,在多个基准上达到最优性能。

详情
AI中文摘要

不完美监督的视频息肉分割(VPS)旨在从廉价监督中学习密集、时间一致的掩码,包括弱标注(点、涂鸦)和少量密集标注帧的半监督。该设置具有临床价值,但由于弱对比、模糊边界、运动模糊和镜面高光,加上稀疏的像素级指导,具有挑战性。虽然SAM2可以从稀疏输入生成密集掩码,但直接伪标签通常会产生几何退化的掩码,存在边界泄漏,未充分利用时间一致性,并忽略可靠性。为解决这些问题,我们提出ARTEMIS,一个由智能体引导的可靠性感知时间掩码演化驱动的统一框架,用于不完美监督的VPS。ARTEMIS从可用监督初始化粗掩码:SAM2转换点/涂鸦,而密集标签作为可靠锚点。一个辩论-判断视觉语言智能体在弱监督下选择可靠的时间锚点,这些锚点通过SAM2双向传播以细化不可靠或未标注的帧。最后,ARTEMIS使用时间可靠性感知鲁棒学习训练分割器,结合可靠性引导的参考选择、参考原型传输模块和可靠性感知鲁棒损失。这些组件评估掩码可靠性,随时间演化锚点,跨帧传输目标身份,并降低噪声监督的权重而非丢弃困难样本。在SUN-SEG和CVC-ClinicDB-612上的涂鸦、点和有限标签设置下的实验表明,ARTEMIS达到了最先进的性能。代码将在此https URL发布。

英文摘要

Imperfectly supervised video polyp segmentation (VPS) aims to learn dense, temporally consistent masks from inexpensive supervision, including weak annotations (points, scribbles) and semi-supervision with few densely labeled frames. This setting is clinically valuable but challenging due to weak contrast, ambiguous boundaries, motion blur, and specular highlights, compounded by sparse pixel-level guidance. While SAM2 can generate dense masks from sparse inputs, direct pseudo-labeling often yields geometry-degraded masks with boundary leakage, underutilizes temporal consistency, and ignores reliability. To address these issues, we propose ARTEMIS, a unified framework for imperfectly supervised VPS driven by agent-guided reliability-aware temporal mask evolution. ARTEMIS initializes coarse masks from available supervision: SAM2 converts points/scribbles, while dense labels serve as reliable anchors. A debate-and-judge vision-language agent selects reliable temporal anchors under weak supervision, which are propagated bidirectionally with SAM2 to refine unreliable or unlabeled frames. Finally, ARTEMIS trains the segmenter using temporal reliability-aware robust learning, incorporating reliability-guided reference selection, a Reference Prototype Transport Module, and reliability-aware robust loss. These components assess mask reliability, evolve anchors over time, transport target identity across frames, and down-weight noisy supervision instead of discarding difficult samples. Experiments on SUN-SEG and CVC-ClinicDB-612 under scribble, point, and limited-label settings demonstrate that ARTEMIS achieves state-of-the-art performance. Code will be released at https://github.com/wangtong627/ARTEMIS.

3. 视觉问答 4 篇

2606.19646 2026-06-19 cs.IR cs.CV 新提交 85%

SAFE-Cascade: Cost-Adaptive Vision-Language Routing for Chart Question Answering

SAFE-Cascade: 面向图表问答的成本自适应视觉语言路由

Ayush Dwivedi, Qixin Wang, Ashvi Soni, Ruoteng Wang, Han Li, Animesh Mahapatra, Neeraj Agrawal, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

专题命中 视觉问答 :提出成本自适应路由系统,用于图表问答,涉及VLM调用决策。

AI总结 提出SAFE-Cascade系统,通过OCR和轻量语言模型先给出答案,再由学习路由器决定是否调用VLM,在ChartQA上以73.1%的VLM调用率达到69.1%准确率,减少26.9%的VLM调用和9.3%的成本。

Comments Demo paper submitted at CIKM 2026. 4 pages, 2 figures

详情
AI中文摘要

视觉语言模型(VLM)在图表问答中表现出色,但若每个查询都调用VLM,当许多问题可通过OCR文本和轻量语言推理回答时,成本会不必要地高昂。我们展示了SAFE-Cascade,一个用于成本自适应图表问答的交互系统。给定图表图像和自然语言问题,SAFE-Cascade首先通过OCR提取图表文本,从纯文本语言模型获得临时答案,然后使用学习路由器决定接受文本答案还是升级到VLM。该演示向用户展示这一决策过程:OCR证据、纯文本答案、路由概率、升级决策、最终答案、估计成本和估计延迟并排显示。SAFE-Cascade被设计为一个透明界面,用于理解何时实际需要视觉基础。用户可以上传或选择图表、提问、检查每条路径使用的证据、比较纯文本和VLM答案,并调整升级阈值以探索准确率-成本边界。该系统使用Azure Document Intelligence进行OCR,gpt-5-mini作为纯文本模型,gemini-2.5-flash-image作为VLM,以及基于推理时特征训练的随机森林路由器。在从2500个样本实验中留出的375个ChartQA测试集上,SAFE-Cascade实现了69.1%的统一准确率和73.1%的VLM调用率,而全VLM基线为67.7%准确率和100% VLM调用率。观察到的+1.4个百分点差异在统计上不确定,因此我们将SAFE-Cascade解释为匹配全VLM性能,同时减少26.9%的VLM调用和9.3%的估计成本。该演示展示了选择性模态路由如何使多模态知识系统更加透明、可调优和成本感知。

英文摘要

Vision-language models (VLMs) are powerful for chart question answering, but invoking a VLM for every query can be unnecessarily expensive when many questions are answerable from OCR text and lightweight language reasoning. We demonstrate SAFE-Cascade, an interactive system for cost-adaptive chart question answering. Given a chart image and a natural-language question, SAFE-Cascade first extracts chart text with OCR, obtains a provisional answer from a text-only language model, and then uses a learned router to decide whether to accept the text answer or escalate to a VLM. The demo exposes this decision process to users: OCR evidence, text-only answer, routing probability, escalation decision, final answer, estimated cost, and estimated latency are shown side by side. SAFE-Cascade is designed as a transparent interface for understanding when visual grounding is actually needed. Users can upload or select charts, ask questions, inspect the evidence used by each pathway, compare text-only and VLM answers, and adjust the escalation threshold to explore the accuracy-cost frontier. The system is implemented with Azure Document Intelligence for OCR, gpt-5-mini as the text-only model, gemini-2.5-flash-image as the VLM, and a Random Forest router trained on inference-time features. On a held-out ChartQA test split of 375 examples from a 2,500-example experiment, SAFE-Cascade achieves 69.1% unified accuracy with 73.1% VLM invocation, compared with 67.7% accuracy and 100% VLM invocation for the full-VLM baseline. The observed +1.4 percentage-point difference is statistically uncertain, so we interpret SAFE-Cascade as matching full-VLM performance while reducing VLM calls by 26.9% and estimated cost by 9.3%. The demonstration shows how selective modality routing can make multimodal knowledge systems more transparent, tunable, and cost-aware.

2603.28387 2026-06-19 cs.AI cs.LG 版本更新 85%

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应:提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学)

专题命中 视觉问答 :揭示临床VLM评估中提示框架的脚手架效应

AI总结 研究发现,在临床VLM评估中,提示中提及MRI可用性即可解释70-80%的性能提升,与图像数据是否存在无关,这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情
AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}(情感障碍)和\textsc{OASIS-3}(认知衰退)上评估了12个开源视觉语言模型(VLM)的二分类性能。两个数据集都包含结构MRI数据,但这些数据不携带可靠的个体级诊断信号。在这些条件下,较小的VLM在引入神经影像上下文后F1分数提升高达58%,蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示,仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变,与影像数据是否存在无关,这是模态坍塌的一个领域特定实例,我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由,而偏好对齐虽然消除了引用MRI的行为,却使两种条件都退化为随机基线。我们的发现表明,表面评估不足以作为多模态推理的指标,这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

2606.20477 2026-06-19 cs.CV cs.CL cs.LG 新提交 80%

Scalable Training of Spatially Grounded 2D Vision-Language Models for Radiology

面向放射学的空间定位2D视觉-语言模型的可扩展训练

Yusuf Salcan, Simon Ging, Robin Schirrmeister, Philipp Arnold, Elmar Kotter, Behzad Bozorgtabar, Thomas Brox

发表机构 * Computer Vision Group, University of Freiburg, Germany(德国弗莱堡大学计算机视觉组) Department of Radiology, Medical Center -- University of Freiburg, Germany(德国弗莱堡大学医学中心放射科) CRIION-AI Lab, Freiburg, Germany(德国弗莱堡CRIION-AI实验室)

专题命中 视觉问答 :联合报告生成、VQA和空间定位

AI总结 提出RefRad2D大规模双语数据集,通过LLM和自动分割生成空间定位数据,训练RadGrounder模型联合完成报告生成、VQA和空间定位,在外部基准上取得竞争性结果。

Comments Accepted for MICCAI 2026. First two authors: equal contribution. Last two authors: equal supervision

详情
AI中文摘要

我们研究了如何在没有手动空间标注的情况下,为放射学训练具有视觉定位能力的视觉-语言模型(VLM)。我们引入了RefRad2D,这是一个大规模的双语(德语/英语)数据集,包含来自临床实践的120万对CT和MR图像-文本对,并通过基于LLM的筛选和自动分割自动生成任务特定的VQA和空间定位子集。在此数据上训练的模型RadGrounder联合执行报告生成、视觉问答以及通过边界框检测或分割进行的空间定位。在外部VQA基准(Slake,VQA-RAD)上,RadGrounder取得了与专用医学VLM竞争的结果。将我们的临床数据加入训练混合集,相比于仅在下游数据集上微调,提高了开放式VQA的性能,显示了数据集的迁移性。关键在于,添加定位监督不会降低语言质量,从而在不牺牲VQA性能的情况下实现空间可验证的输出。

英文摘要

We study how to train visually grounded vision-language models (VLMs) for radiology without manual spatial annotations. We introduce RefRad2D, a large-scale bilingual (German/English) dataset of 1.2M CT and MR image-text pairs derived from clinical practice, with task-specific VQA and spatial grounding subsets generated automatically via LLM-based curation and automated segmentation. Trained on this data, our model RadGrounder jointly performs report generation, visual question answering, and spatial grounding via bounding-box detection or segmentation. On external VQA benchmarks (Slake, VQA-RAD), RadGrounder achieves competitive results with specialized medical VLMs. Adding our clinical data to the training mixture improves open-ended VQA over fine-tuning on the downstream datasets alone, showing the transferability of our dataset. Crucially, adding grounding supervision does not degrade language quality, enabling spatially verifiable outputs at no cost to VQA performance.

2606.20561 2026-06-19 cs.CV 新提交 70%

TimeProVe: Propose, then Verify for Efficient Long Video Temporal Reasoning in Activities of Daily Living

TimeProVe: 先提出后验证,实现日常活动中的高效长视频时间推理

Arkaprava Sinha, Dominick Reilly, Siddharth Krishnan, Hieu Le, Srijan Das

发表机构 * University of North Carolina, Charlotte(北卡罗来纳大学夏洛特分校)

专题命中 视觉问答 :使用VLM进行长视频问答验证

AI总结 提出TimeProVe框架,先通过轻量模块生成基于动作的候选假设,再调用昂贵VLM验证,在长视频问答中降低75%VLM调用和93%推理成本,性能提升7.3%。

详情
AI中文摘要

长视频问答(LVQA)需要在数小时未修剪的视频中识别稀疏的、与查询相关的证据。现有方法要么使用大型视觉语言模型(VLM)密集处理视频,导致计算成本过高,要么依赖稀疏的基于字幕的推理,这往往会遗漏时间局部化和以运动为中心的证据。我们提出TimeProVe,一种用于长视频中时间基础推理的高效混合框架。TimeProVe首先使用轻量模块生成基于动作的答案-证据假设,随后仅调用昂贵的VLM进行针对性验证。我们框架的核心在于基于动作的候选证据(ACE)模块,该模块通过轻量级LLM推理将时间局部化的动作转换为查询条件化的候选答案和支持证据窗口。我们进一步引入OpenTSUBench(OTB),一个开放基准测试,旨在评估真实世界日常活动(ADL)场景中的时间基础推理。实验表明,TimeProVe在OTB上比最强基线高出7.3%,同时减少了75%的VLM调用和93%的推理成本。此外,在没有显式时间基础训练的情况下,TimeProVe在Charades-STA上取得了竞争性性能,并在结合基础VLM增强时达到了最先进的结果。

英文摘要

Long Video Question Answering (LVQA) requires identifying sparse, query-relevant evidence within hours-long untrimmed videos. Existing approaches either process videos densely with large vision-language models (VLMs), incurring prohibitive computational cost, or rely on sparse caption-based reasoning, which often misses temporally localized and motion-centric evidence. We introduce TimeProVe, a cost-efficient hybrid framework for temporally grounded reasoning in long videos. TimeProVe first employs lightweight modules to generate action-grounded answer--evidence hypotheses and subsequently invokes an expensive VLM only for targeted verification. The core of our framework lies in the Action-based Candidate Evidence (ACE) module, which converts temporally localized actions into query-conditioned candidate answers and supporting evidence windows through lightweight LLM reasoning. We further introduce OpenTSUBench (OTB), an open-ended benchmark designed to evaluate temporally grounded reasoning in real-world Activities of Daily Living (ADL) scenarios. Experiments show that TimeProVe outperforms the strongest baseline on OTB by 7.3%, while reducing VLM calls by 75% and inference cost by 93%. Furthermore, without explicit temporal grounding training, TimeProVe achieves competitive performance on Charades-STA, and reaches state-of-the-art results when enhanced with grounding VLMs.