视觉大模型 / VLM - arXivDaily 专题

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新 95%

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

专题命中视觉推理：提出Vero系列VLM，在视觉推理基准上显著提升

AI总结提出Vero系列开放视觉语言模型，通过构建600K样本数据集Vero-600K和任务路由奖励，在30个基准测试中平均提升2.9-5.4点，Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情

AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么？最强的视觉语言模型（VLM）表明广泛的视觉推理是可以实现的，但其封闭的数据和强化学习（RL）流程使得其成果难以研究、复现或扩展。我们引入了Vero，一个完全开放的VLM系列，在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励，构建了Vero-600K，一个来自59个数据集的600K样本数据集，并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中，Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型，Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是，基于Instruct模型训练的Vero-Qwen3I-8B，在没有额外蒸馏的情况下，平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示，不同的任务类别引发不同的推理模式，而广泛的收益依赖于联合学习它们，而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-19 cs.CV cs.AI 版本更新 90%

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

专题命中视觉推理：从视频学习几何表示提升MLLM空间智能。

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2605.20448 2026-06-19 cs.CV cs.LG 版本更新 85%

Do Vision-Language Models Understand 3D Scenes or Just Catalogue Objects?

视觉-语言模型是理解3D场景还是仅仅 catalogue 物体？

Animesh Maheshwari, Divyansh Sahu, Nishit Verma

发表机构 * Deccan AI（德克南人工智能）

专题命中视觉推理：VLM 3D场景理解能力评估

AI总结本文通过一个包含3034个样本的人工整理基准，探讨了视觉-语言模型对空间理解的深度有序遮挡、光学几何推断和体积重新安排规划能力，发现模型在重新安排可见布局时表现优异，但在遮挡和反射推断上表现较差。

详情

AI中文摘要

视觉-语言模型能够可靠地命名场景中的物体，但它们是否代表这些物体所处的3D布局？我们引入了一个包含3034个样本的人工整理基准，针对空间理解的三个组成部分：深度有序遮挡（通过三种独立的反事实操作化进行探测）、可见反射的光学几何推断，以及体积重新安排规划。六个前沿和开放权重的VLMs在18,204个响应上由训练注释者评分，没有使用LLM作为判断标准，揭示了明显的分离：在53-97%的准确率下，能够对可见布局进行重新安排的模型，在遮挡任务中表现不佳，仅在6-45%之间，而在反射任务中低于7%。一个具身推理模型重现了相同的模式。对Qwen3-VL-8B-Thinking的白盒分析显示，失败归因于视觉标记合并：在视觉编码器中可恢复的空间信息在标记压缩后变得不可用，只有在清洁的标记合并后激活被重新引入语言解码器后才恢复。

英文摘要

Vision-language models reliably name objects in a scene, but do they represent the 3D layout those objects inhabit? We introduce a 3,034-sample human-curated benchmark targeting three components of spatial understanding: depth-ordered occlusion (probed via three independent counterfactual operationalisations), optical-geometry inference over visible reflections, and volumetric rearrangement planning. Six frontier and open-weight VLMs, scored by trained annotators on 18,204 responses with no LLM-as-judge, reveal a sharp dissociation: models that plan rearrangements over visible layouts at 53--97% accuracy and rarely violate collision constraints fall to 6--45% on occlusion and below 7% on reflections. An embodied-reasoning model reproduces the same profile. White-box analysis on Qwen3-VL-8B-Thinking localises the failure to the visual-token merger: spatial information recoverable throughout the vision encoder becomes inaccessible after token compression and only stabilises again when clean post-merger activations are patched into the language decoder.

URL PDF HTML ☆

赞 0 踩 0

2603.12252 2026-06-19 cs.CV cs.CL 版本更新 85%

EndoCoT: Scaling Endogenous Chain-of-Thought Reasoning in Diffusion Models

EndoCoT：扩散模型中的内生思维链推理扩展

Xuanlang Dai, Yujie Zhou, Long Xing, Jiazi Bu, Xilin Wei, Yuhong Liu, Beichen Zhang, Kai Chen, Yuhang Zang

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Xi’an Jiaotong University（西安交通大学）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiaotong University（上海交通大学）； Fudan University（复旦大学）； The Chinese University of Hong Kong（香港中文大学）

专题命中视觉推理：扩散模型中内生思维链，提升视觉推理

AI总结提出EndoCoT框架，通过迭代思维引导模块激活MLLM的推理潜力，并利用终端思维接地模块确保推理轨迹与文本监督对齐，使DiT逐步执行复杂任务，在多个基准上平均准确率达92.1%。

Comments 23 pages, 18 figures, The code and dataset are publicly available at https://internlm.github.io/EndoCoT/

详情

AI中文摘要

最近，多模态大语言模型（MLLMs）被广泛集成到扩散框架中，主要作为文本编码器来处理空间推理等复杂任务。然而，这种范式存在两个关键限制：（i）MLLM文本编码器表现出不足的推理深度。单步编码无法激活思维链过程，而这对MLLM为复杂任务提供准确指导至关重要。（ii）在解码过程中，指导保持不变。即使有正确的MLLM编码，解码过程中的不变指导也阻止了DiT逐步将复杂指令分解为可执行的去噪步骤。为此，我们提出了内生思维链（EndoCoT），一种新颖的框架，首先通过迭代思维引导模块迭代细化潜在思维状态来激活MLLM的推理潜力，然后将这些状态桥接到DiT的去噪过程。其次，应用终端思维接地模块，通过将最终状态与真实答案对齐，确保推理轨迹保持与文本监督的接地。通过这两个组件，MLLM文本编码器提供精心推理的指导，使DiT能够逐步执行并最终以逐步方式解决复杂任务。在多个基准（如Maze、TSP、VSP和Sudoku）上的广泛评估实现了平均准确率92.1%，比最强基线高出8.3个百分点。代码和数据集在此https URL公开。

英文摘要

Recently, Multimodal Large Language Models (MLLMs) have been widely integrated into diffusion frameworks primarily as text encoders to tackle complex tasks such as spatial reasoning. However, this paradigm suffers from two critical limitations: (i) MLLMs text encoder exhibits insufficient reasoning depth. Single-step encoding fails to activate the Chain-of-Thought process, which is essential for MLLMs to provide accurate guidance for complex tasks. (ii) The guidance remains invariant during the decoding process. Invariant guidance during decoding prevents DiT from progressively decomposing complex instructions into actionable denoising steps, even with correct MLLM encodings. To this end, we propose Endogenous Chain-of-Thought (EndoCoT), a novel framework that first activates MLLMs' reasoning potential by iteratively refining latent thought states through an iterative thought guidance module, and then bridges these states to the DiT's denoising process. Second, a terminal thought grounding module is applied to ensure the reasoning trajectory remains grounded in textual supervision by aligning the final state with ground-truth answers. With these two components, the MLLM text encoder delivers meticulously reasoned guidance, enabling the DiT to execute it progressively and ultimately solve complex tasks in a step-by-step manner. Extensive evaluations across diverse benchmarks (e.g., Maze, TSP, VSP, and Sudoku) achieve an average accuracy of 92.1%, outperforming the strongest baseline by 8.3 percentage points. The code and dataset are publicly available at https://internlm.github.io/EndoCoT/.

URL PDF HTML ☆

赞 0 踩 0

2605.10873 2026-06-19 cs.CV cs.AI 版本更新 80%

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench：一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

专题命中视觉推理：评估视觉语言模型在CAD程序生成中的表现

AI总结本文提出CADBench，一个统一的多模态CAD程序生成基准，包含18000个样本和六类基准，评估11种视觉语言模型，揭示了CAD程序生成中的三种常见失败模式。

详情

AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心，但进展难以衡量，因为现有评估分散在数据集、模态和指标上。我们引入CADBench，一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本，涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族，五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染，以及六个指标，涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层，所有家族均进行多样性采样，以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统，生成超过140万个CAD程序。在理想输入下，专用的网格到CAD模型显著优于代码生成VLMs，后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式：几何复杂性增加时重建质量下降，CAD专用模型在模态转移下可能变得脆弱，且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

URL PDF HTML ☆

赞 0 踩 0

2508.04424 2026-06-19 cs.CV 版本更新 80%

Composed Object Retrieval: Object-level Retrieval via Composed Expressions

组合对象检索：通过组合表达式进行对象级检索

Tong Wang, Guanyu Yang, Nian Liu, Zongyan Han, Jinxing Zhou, Salman Khan, Fahad Shahbaz Khan

发表机构 * Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Southeast University, Ministry of Education, Jiangsu, China（新一代人工智能技术及跨学科应用国家重点实验室，东南大学，教育部，江苏，中国）； Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE（穆罕默德·本·扎耶德人工智能大学（MBZUAI），阿布扎赫德，阿联酋）

专题命中视觉推理：提出组合对象检索任务，需视觉-语言推理。

AI总结提出组合对象检索（COR）任务，通过组合参考对象、掩码和检索文本进行对象级检索，并构建COR125K基准和CORE模型，显著优于现有方法。

详情

AI中文摘要

基于用户意图检索细粒度视觉内容在多模态系统中仍然是一个挑战。尽管当前的组合图像检索（CIR）方法结合了参考图像和检索文本，但它们局限于图像级匹配，无法定位特定对象。为此，我们提出了组合对象检索（COR），一种新的对象级检索任务，从目标图像中的候选对象中检索目标对象，并用像素级掩码对检索结果进行定位。给定一个参考对象、其掩码、一个目标图像以及描述所需修改的检索文本，COR要求模型执行组合视觉-文本推理，而不是依赖显式的类别名称。这一设置带来了若干挑战，包括细粒度组合匹配、在视觉相似干扰物下的负对象过滤以及灵活的单对象或多对象检索。我们构建了COR125K，第一个大规模COR基准，包含408个类别的125,541个检索三元组，并划分基础/新类别以评估类别级泛化能力。我们还提出了CORE，一个统一的端到端模型，集成了参考区域编码、自适应视觉-文本交互和区域级对比学习，以将组合表示与目标对象对齐，同时抑制背景和干扰物。大量实验表明，CORE在基础和新类别上均显著优于现有的基于CIR的流程和强基线，为细粒度对象级多模态检索建立了一个简单而有效的基础。代码将在此https URL公开发布。

英文摘要

Retrieving fine-grained visual content based on user intent remains a challenge in multimodal systems. Although current Composed Image Retrieval (CIR) methods combine reference images with retrieval texts, they are constrained to image-level matching and cannot localize specific objects. To this end, we propose Composed Object Retrieval (COR), a new object-level retrieval task that retrieves target object(s) from candidate objects in a target image and grounds the retrieved result with pixel-level masks. Given a reference object, its mask, a target image, and a retrieval text describing the desired modification, COR requires models to perform composed visual-textual reasoning rather than relying on explicit category names. This setting introduces several challenges, including fine-grained compositional matching, negative-object filtering under visually similar distractors, and flexible single- or multi-object retrieval. We construct COR125K, the first large-scale COR benchmark, containing 125,541 retrieval triplets across 408 categories with base/novel splits for evaluating category-level generalization. We also present CORE, a unified end-to-end model that integrates reference region encoding, adaptive vision-text interaction, and region-level contrastive learning to align composed representations with target objects while suppressing background and distractors. Extensive experiments demonstrate that CORE significantly outperforms existing CIR-based pipelines and strong baselines in both base and novel categories, establishing a simple and effective foundation for fine-grained object-level multimodal retrieval. Code will be released publicly at https://github.com/wangtong627/COR.

URL PDF HTML ☆

赞 0 踩 0

2509.10416 2026-06-19 cs.RO 版本更新 70%

TASC: Task-Aware Shared Control for Relational Telemanipulation

TASC：面向关系遥操作的任务感知共享控制

Ze Fu, Pinhao Song, Yutong Hu, Renaud Detry

发表机构 * KU Leuven, Dept. Mechanical Engineering, Research unit Robotics, Automation and Mechatronics（KU莱顿机械工程系，机器人、自动化与机电一体化研究单位）； KU Leuven, Dept. Electrical Engineering, Research unit Processing Speech and Images（KU莱顿电气工程系，语音与图像处理研究单位）

专题命中视觉推理：利用视觉语言模型预测空间约束，辅助共享控制。

AI总结提出TASC框架，通过视觉构建开放词汇交互图推断任务级用户意图，并基于空间约束提供共享控制辅助，提升关系遥操作效率与泛化能力。

Comments Accepted to IROS 2026

详情

AI中文摘要

我们提出了TASC，一个面向关系遥操作的任务感知共享控制框架，该框架从仅运动输入中推断任务级用户意图并提供辅助。为了在没有预定义模板的情况下支持抓取关系任务，TASC从视觉输入构建一个开放词汇的交互图来表示功能性物体关系，并据此推断用户意图。然后，共享控制策略在抓取和物体交互过程中提供辅助，该辅助由视觉语言模型预测的空间约束引导。我们的方法解决了共享控制下关系遥操作的两个关键挑战：（1）从低级运动命令中推断任务级意图，以及（2）跨不同物体和任务的泛化辅助。在仿真和真实世界的实验表明，与先前方法相比，TASC提高了任务效率并减少了用户输入努力，同时实现了跨多种关系遥操作任务的零样本泛化。支持我们实验的代码在此https URL公开提供。

英文摘要

We present TASC, a Task-Aware Shared Control framework for relational telemanipulation that infers task-level user intent and provides assistance from motion-only input. To support prehensile relational tasks without predefined templates, TASC constructs an open-vocabulary interaction graph from visual input to represent functional object relationships, and infers user intent accordingly. A shared control policy then provides assistance during both grasping and object interaction, guided by spatial constraints predicted by a vision-language model. Our method addresses two key challenges in relational telemanipulation under shared control: (1) task-level intent inference from low-level motion commands, and (2) generalizable assistance across diverse objects and tasks. Experiments in both simulation and the real world demonstrate that TASC improves task efficiency and reduces user input effort compared to prior methods, while enabling zero-shot generalization across diverse relational telemanipulation tasks. The code that supports our experiments is publicly available at https://github.com/fitz0401/tasc.

URL PDF HTML ☆

赞 0 踩 0

2305.14985 2026-06-19 cs.CV cs.CL 版本更新 70%

IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

IdealGPT: 通过大型语言模型迭代分解视觉与语言推理

Haoxuan You, Rui Sun, Zhecan Wang, Long Chen, Gengyu Wang, Hammad A. Ayyubi, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University（哥伦比亚大学）； HKUST（香港科技大学）； University of California, Los Angeles（加州大学洛杉矶分校）

专题命中视觉推理：利用LLM迭代分解视觉语言推理任务。

AI总结提出IdealGPT框架，利用大型语言模型迭代分解视觉语言推理任务，通过子问题生成、子答案获取和最终答案推理的循环过程，在零样本设置下显著提升多步推理性能。

Comments 13 pages, 5 figures

详情

AI中文摘要

视觉与语言（VL）理解领域通过端到端的大型预训练VL模型（VLM）取得了前所未有的进展。然而，它们在需要多步推理的零样本推理任务中仍存在不足。为了实现这一目标，先前的工作采用了分而治之的流程。本文认为，先前的工作存在几个固有的缺点：1）它们依赖于特定领域的子问题分解模型。2）即使子问题或子答案提供的信息不足，它们也强制模型预测最终答案。我们通过IdealGPT框架解决了这些局限性，该框架利用大型语言模型（LLM）迭代分解VL推理。具体来说，IdealGPT使用一个LLM生成子问题，一个VLM提供相应的子答案，另一个LLM进行推理以得出最终答案。这三个模块迭代地执行分而治之的过程，直到模型对主问题的最终答案有信心。我们在零样本设置下对多个具有挑战性的VL推理任务评估了IdealGPT。特别是，我们的IdealGPT在VCR上比现有最好的GPT-4类模型绝对提高了10%，在SNLI-VE上提高了15%。代码可在以下网址获取：此 https URL

英文摘要

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

URL PDF HTML ☆

赞 0 踩 0