arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.15912 2026-06-16 cs.LG cs.AI 新提交

On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

基于课程回合级指导的在线策略蒸馏用于多轮智能体

Gengsheng Li, Mao Zheng, Mingyang Song, Ruiqi Liu, Tianyu Yang, Jie Sun, Qiyong Zhong, Haiyun Guo, Junfeng Fang, Dan Zhang, Jinqiao Wang

发表机构 * Foundation Model Research Center, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所基础模型研究中心）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Large Language Model Department, Tencent（腾讯大语言模型部）； University of Science and Technology of China（中国科学技术大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Wuhan AI Research（武汉人工智能研究院）

AI总结针对多轮智能体在线策略蒸馏中错误累积导致教师监督失效的问题，提出混合教师和学生生成回合的Guided-OPD算法，通过课程式衰减教师干预概率，在ALFWorld等任务上平均提升21.1%得分和25.5%成功率。

详情

AI中文摘要

能够规划、调用工具并与环境交互的多轮智能体为解决复杂任务提供了一种有前景的范式，但其能力通常依赖于非常大的模型，这些模型的推理成本在实践中令人望而却步。在线策略蒸馏（OPD）是将这种能力迁移到较小学生模型的一种自然方法，但我们发现它在这种设置下存在一种特征性失败模式：小的学生错误在回合间累积，将轨迹推离教师熟悉的状态分布，因此教师的监督在最需要的地方变得最不可靠。我们提出了引导式在线策略蒸馏（Guided-OPD），一种简单而有效的算法，它在每个轨迹中混合教师和学生生成的回合，并按照衰减到零的课程安排教师的干预概率。强引导使早期轨迹接近教师分布，然后逐渐撤除以恢复推理时使用的纯在线策略。在ALFWorld、ScienceWorld和WebShop上，从Qwen3-30B-A3B教师蒸馏Qwen3学生，Guided-OPD相比普通OPD平均提高21.1%得分和25.5%成功率，在较小的学生上收益更大。

英文摘要

Multi-turn agents that plan, invoke tools, and interact with environments offer a promising paradigm for solving complex tasks, yet their capabilities typically rely on very large models whose inference cost is prohibitive in practice.On-Policy Distillation (OPD) is a natural recipe for transferring such capabilities to smaller students, but we find that it suffers a characteristic failure mode in this setting: small student errors compound across turns and push the trajectory out of the teacher's familiar state distribution, so the teacher's supervision becomes least reliable precisely where the student needs it most.We propose Guided On-Policy Distillation (Guided-OPD), a simple yet effective algorithm that mixes teacher- and student-generated turns within each rollout and schedules the teacher's intervention probability along a curriculum that decays to zero.Strong guidance keeps early trajectories close to the teacher distribution and is then gradually withdrawn to recover the purely on-policy regime used at inference.On ALFWorld, ScienceWorld, and WebShop, distilling Qwen3 students from a Qwen3-30B-A3B teacher, Guided-OPD improves Score by 21.1\% and Success Rate by 25.5\% over vanilla OPD on average, with larger gains on smaller students.

URL PDF HTML ☆

赞 0 踩 0

2606.15911 2026-06-16 cs.CL cs.IR 新提交

Interactor: Agentic RL oriented Iterative Creation for Ad Description Generation in Sponsored Search

Interactor: 面向赞助搜索中广告描述生成的智能体强化学习迭代创建框架

Penghui Wei, Jiayu Wu, Chao Ye, Zhi Guo, Shuanglong Li, Lin Liu

发表机构 * Baidu Inc.（百度公司）

AI总结提出Interactor框架，利用智能体强化学习多轮迭代生成广告描述，通过多个生成奖励模型评估知识容量和落地页一致性，显著提升广告描述的知识丰富度和忠实度。

详情

AI中文摘要

本文聚焦于自动生成赞助搜索中信息丰富的广告描述。与通常优化以吸引用户点击反馈的广告标题不同，广告描述具有更长的文本跨度，并有可能融入世界知识来满足用户搜索意图，同时呈现广告的细粒度卖点。我们提出了Interactor，一个基于智能体强化学习优化的多轮迭代创建框架，用于广告描述生成。生成模型作为策略，与由多个生成奖励模型组成的定制环境交互。给定策略的初始生成结果，定制的GenRMs评估包括知识容量和落地页一致性在内的多维质量，提供二元信号和推理反馈。策略随后基于这些反馈迭代优化描述，确保持续改进。在工业数据集上的实验表明，Interactor框架在生成知识丰富且忠实的广告描述方面显著优于最先进的方法。自2026年5月起，它已在领先的搜索广告系统中在线部署，为广告收入和用户体验做出贡献。

英文摘要

This paper focuses on automatically generating informative ad descriptions in sponsored search. Unlike ad titles which are usually optimized to attract user click feedbacks, ad descriptions have a longer text span and possess the potential of incorporating world knowledge to address user search intents while presenting the fine-grained selling points of the ads. We propose Interactor, a multi-turn iterative creation framework optimized with agentic RL for ad description generation. The generation model acts as a policy that interacts with a customized environment consisting of multiple generative reward models. Given initial generations by the policy, the customized GenRMs evaluate multi-dimensional qualities including knowledge capacity and landing page consistency, providing both binary signals and reasoning feedbacks. The policy then iteratively refines the descriptions based on such feedbacks to ensure continuous improvement. Experiments on industrial datasets show that the Interactor framework significantly outperforms state-of-the-art approaches in generating knowledge-rich and faithful ad descriptions. Since May 2026, it has been deployed online in a leading search ads system, contributing to both ad revenue and user experience.

URL PDF HTML ☆

赞 0 踩 0

2606.15910 2026-06-16 cs.CL 新提交

Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

校准的分诊，而非自主：医学视觉-语言模型的置信度估计

Reza Khanmohammadi, Kundan Thind, Mohammad M. Ghassemi

发表机构 * Michigan State University（密歇根州立大学）

AI总结针对医学视觉-语言模型在回答时可能忽略图像而依赖语言先验的问题，提出使用置信度估计进行校准分诊，通过评估七种置信度估计器，发现高置信度区域是区分可用估计器的关键，最佳探针可将错误率从41-45%降至1-4%，但无估计器在所有领域和模型中一致最优。

详情

AI中文摘要

视觉-语言模型可以流畅且自信地回答关于医学图像的问题，但几乎不使用图像，而是依赖语言先验。在医学中，这是最严重的失败，因为答案看起来可信而实际不可信，唯一的保护是足够可靠的置信度分数，以告知系统何时应该弃权。我们提出一个部署问题而非准确性问题：模型可以安全地单独处理多少成像工作，以及哪种置信度信号使其成为可能。我们在五个开放权重的LVLM和三个涵盖广泛临床成像、放射学和病理学的医学视觉问答数据集上评估了七种置信度估计器，每个探针仅在自然图像上训练且未经适应应用。重新表述为有界选择性预测（仅在置信度超过阈值时自动化案例，其余推迟），比较结果是警示性的。标准指标是糟糕的指南：辨别力几乎无法区分方法，而廉价自我报告的弱校准可以通过域外温度缩放廉价地去除，而不改变可部署的产量。区分可用估计器的是临床医生所依赖的高置信度区域：最弱的基线在其错误的41%到45%上自信地错误，而最佳探针为1%到4%，并且没有估计器在领域或模型上可靠地最佳。安全交接在两个层面控制：基础模型能力设定上限，因此校准良好的分数在20%错误容忍度下可以恢复大约三分之一的放射学案例，但几乎无法恢复病理学案例；然后置信度层决定可以达到该上限的多少。今天可用的角色是校准的分诊，而非自主：自动化校准分数标记为安全的案例，其余路由给临床医生。我们发布所有输出、正确性判断和置信度分数，以及代码。

英文摘要

A vision-language model can answer a question about a medical image fluently and confidently while barely using the image, leaning instead on language priors. In medicine this is the failure that matters most, because the answer looks trustworthy and is not, and the only protection is a confidence score reliable enough to tell the system when to abstain. We ask a deployment question rather than an accuracy one: how much imaging work a model can safely handle alone, and which confidence signal makes that possible. We evaluate seven confidence estimators across five open-weight LVLMs and three medical visual-question-answering datasets spanning broad clinical imaging, radiology, and pathology, with every probe trained only on natural images and applied without adaptation. Recast as bounded selective prediction (automate a case only when confidence clears a threshold, defer the rest), the comparison is cautionary. The standard metrics are poor guides: discrimination barely separates the methods, and the weak calibration of a cheap self-report is cheaply removed by off-domain temperature scaling without changing deployable yield. What distinguishes a usable estimator is the high-confidence region a clinician acts on: the weakest baselines are confidently wrong on 41 to 45 percent of their errors against 1 to 4 percent for the best probe, and no estimator is reliably best across domains or models. Safe handoff is governed at two levels: base-model competence sets a ceiling, so a well-calibrated score recovers roughly a third of radiology cases at a 20 percent error tolerance but almost none of pathology; the confidence layer then decides how much of that ceiling is reachable. The usable role today is calibrated triage, not autonomy: automate the cases a calibrated score marks safe, route the rest to a clinician. We release all outputs, correctness judgments, and confidence scores, with code.

URL PDF HTML ☆

赞 0 踩 0

2606.15909 2026-06-16 cs.RO 新提交

GeoTLM: Geometry-aware Tactile-Language Models for Contact Motion Orientation Reasoning of Dynamic Objects

GeoTLM: 面向动态物体接触运动方向推理的几何感知触觉语言模型

Qiutian Li, Zinan Liu, Lin Wang

发表机构 * School of EEE, Nanyang Technological University (NTU)（南洋理工大学电气与电子工程学院）

AI总结提出GeoTLM，通过可微几何表示（DGR）提取触觉剪切场中的几何先验，提升动态物体旋转和滑动方向推理能力，在旋转和滑动任务上分别提升14.6%和16.2%的准确率。

Comments 7 pages, 3 figures, 4 tables

详情

AI中文摘要

现代触觉语言模型（TLMs）在机器人学习任务（如材料和纹理识别）中展现出潜力。然而，对于接触密集场景，这些TLMs难以理解动态物体的物理属性，如旋转和滑动方向。例如，我们的初步实验表明，流行的TLMs（如Sparsh和AnyTouch2）在基于GelSight Mini触觉数据的旋转方向推理上表现较弱。这一令人惊讶的差距启发我们探索一个新的研究问题：能否将物理基础的几何先验注入TLMs，以实现对动态物体属性的可靠接触方向推理？为此，我们提出GeoTLM，一种新颖的几何表示引导的TLM，用于感知动态接触事件。我们的关键思想是在语言级推理之前保留并结构化触觉剪切场几何，而不是将低分辨率触觉令牌强行塞入脆弱的封闭形式物理算子。为实现这一点，我们提出一种轻量级（仅14k参数）但新颖的可微几何表示（DGR）。具体地，DGR在剪切场中学习接触掩码引导的表示，并通过反对称七区域池化设计进行聚合，其动机是旋转接触产生反对称变形模式的物理直觉。我们在两个代表性任务上进行实验：旋转方向和滑动方向推理。大量实验表明，GeoTLM在相同骨干网络下，无几何编码器时，新物体旋转准确率提升14.6%，真实传感器滑动准确率提升16.2%。总体而言，我们的工作为物理基础的触觉语言推理开辟了新途径，在动态物体理解和接触密集的机器人操作方面具有巨大潜力。

英文摘要

Modern tactile-language models (TLMs) have shown potential for robot learning tasks, such as material and texture recognition. However, for contact-rich scenarios, these TLMs struggle to understand the physical properties of dynamic objects, such as rotation and sliding directions. For instance, our preliminary experiments reveal that popular TLMs, such as Sparsh and AnyTouch2, exhibit weak performance on basic rotation direction reasoning from GelSight Mini tactile data. This surprising gap inspires us to explore a novel research question: Can we inject physically grounded geometric priors into TLMs to enable reliable contact orientation reasoning of dynamic object properties? To this end, we propose GeoTLM, a novel geometric representation-guided TLM for the perception of dynamic contact events. Our key idea is to preserve and structure tactile shear-field geometry before language-level reasoning, rather than forcing low-resolution tactile tokens into fragile closed-form physics operators. To achieve this, we propose a lightweight (only 14k parameters) yet novel Differentiable Geometric Representation (DGR). Specifically, DGR learns a contact-mask-guided representation in the shear field and aggregates it through an antisymmetric seven-region pooling design, motivated by the physical intuition that rotational contact produces antisymmetric deformation patterns. We conduct experiments on two representative tasks: rotation direction and sliding direction reasoning. Extensive experiments show that GeoTLM improves novel-object rotation accuracy by +14.6% and real-sensor sliding accuracy by +16.2% over the same backbone without the geometric encoder. Overall, our work paves a new way for physically grounded tactile-language reasoning, with strong potential for dynamic object understanding and contact-rich robotic manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.15898 2026-06-16 cs.RO 新提交

VL2Spike: Spike-driven Distillation from VLMs for Low-Power Visual Perception in Embodied AI

VL2Spike：面向具身AI低功耗视觉感知的VLM脉冲驱动蒸馏

Zinan Liu, Eric Zheng, Soumyaratna Debnath, Hao Shi, Ling Xiao, Lin Wang

发表机构 * School of EEE, Nanyang Technological University (NTU)（南洋理工大学电气与电子工程学院）； Department of Computer Science, University of Toronto（多伦多大学计算机科学系）； Advanced Micro Devices, Inc.（超威半导体公司）； State Key Laboratory of Extreme Photonics and Instrumentation, Zhejiang University（浙江大学极端光子学与仪器国家重点实验室）； Faculty of Information Science and Technology, Hokkaido University（北海道大学信息科学与技术学院）

AI总结提出VL2Spike框架，通过时空视觉脉冲蒸馏和脉冲原型引导语言蒸馏，将VLM多模态知识迁移至Spikformer，在静态数据集上提升6.81%性能且能耗仅15.7%，并显著增强机器人视觉地点识别能力。

Comments 9 pages, 4 figures, 8 tables

详情

AI中文摘要

脉冲神经网络（SNN）是受大脑启发的、事件驱动的模型，通过稀疏脉冲进行计算，从而在资源受限的具身AI模型中实现高效的视觉感知。具有脉冲自注意力的Spiking-Transformer模型的出现显著提升了纯SNN的学习能力。尽管SNN具有能效优势，但其性能仍受限于基于脉冲的架构和优化挑战，因为标准梯度下降规则无法直接应用。最近，视觉语言模型（VLM）展示了丰富的多模态知识表示能力，可用于视觉感知。因此，利用VLM来更好地训练Spikformer是很有前景的。为此，我们提出了VL2Spike，一种新颖的基于脉冲的知识蒸馏（KD）框架，将VLM的多模态知识与紧凑的Spikformer模型桥接起来。该设计增强了Spikformer模型的学习能力，同时保留了其能效优势，从而为低功耗机器人感知提供了一条实用路径。我们的VL2Spike带来了两项关键技术贡献。为了与脉冲动态对齐，我们首先提出了时空视觉脉冲（SVS）蒸馏，实现了（1）VLM图像特征与脉冲令牌之间的共享流形对齐，以及（2）膜电位和脉冲率上的暖启动时间一致性。然后，我们设计了一种新颖的脉冲原型引导语言（SPL）蒸馏策略，将Spikformer的类别原型和logits与可提示的VLM文本嵌入对齐。大量实验表明，VL2Spike在三个静态数据集上仅消耗15.7%的能量就实现了6.81%的性能提升。它在机器人视觉地点识别（VPR）上也表现出强大的泛化能力，性能提升6.63%，突显了其在具身AI中低功耗感知的潜力。

英文摘要

Spiking neural networks (SNNs) are brain-inspired, event-driven models that compute with sparse spikes, which enables highly efficient visual perception in resource-constrained embodied AI models. The emergence of Spiking-Transformer models with spike self-attention has substantially improved the learning capacity of pure SNNs. Although SNNs are energy efficient, their performance is still limited by the spike-based architecture and optimization challenges, as standard gradient descent rules cannot be directly applied. Recently, vision-language models (VLMs) have shown rich multi-modal knowledge representation capabilities for visual perception. Thus, it is promising to leverage VLMs for better Spikformer training. To this end, we present VL2Spike, a novel spike-based knowledge distillation (KD) framework that bridges multi-modal knowledge from VLMs with compact Spikformer models. This design enhances the learning capacity of Spikformer models while preserving their energy-efficiency merits, thereby offering a practical pathway toward low-power robotic perception. Our VL2Spike brings two key technical contributions. To align with spiking dynamics, we first propose spatial-temporal visual spike (SVS) distillation, which achieves (1) shared manifold alignment between VLM image features and spike tokens, and (2) warm-started temporal consistency on membrane potentials and spike rates. We then design a novel spike prototype-guided linguistic (SPL) distillation strategy that aligns Spikformer's class prototypes and logits with promptable VLM text embeddings. Extensive experiments show that VL2Spike achieves 6.81% gain across three static datasets with only 15.7% energy consumption. It also exhibits strong generalization capacity on robotic visual place recognition (VPR) with a gain of 6.63%, highlighting its potential for low-power perception in embodied AI.

URL PDF HTML ☆

赞 0 踩 0

2606.15897 2026-06-16 cs.LG cs.AI stat.ML 新提交

Topological Flow Matching

拓扑流匹配

Kacper Wyrwal, İsmail İlkan Ceylan, Alexander Tong

发表机构 * University of Oxford（牛津大学）； TU Wien（维也纳技术大学）； AITHYRA

AI总结提出拓扑流匹配，通过拉普拉斯漂移增强参考过程，在保留流匹配稳定性和无模拟目标的同时，捕捉底层域拓扑结构，适用于脑fMRI、洋流等结构化数据。

Comments Accepted at ICLR 2026. 26 pages, 24 figures. Code: https://github.com/KacperWyrwal/topological-flow-matching

详情

AI中文摘要

流匹配是一个强大的生成建模框架，因其简单性和强大的经验性能而受到重视。然而，其标准公式将结构化空间上的信号（例如脑图上的fMRI数据）视为欧几里得空间中的点，忽略了其域的丰富拓扑特征。为了解决这个问题，我们引入了拓扑流匹配，这是流匹配的一种拓扑感知泛化。我们将流匹配解释为解决退化薛定谔桥问题的框架，并通过用拉普拉斯导出的漂移增强参考过程来注入拓扑信息。这种原则性修改捕获了底层域的结构，同时保留了流匹配的理想特性：稳定的、无模拟的目标和确定性样本路径。因此，我们的框架可以作为标准流匹配的直接替代品。我们在多样化的结构化数据集上展示了其有效性，包括脑fMRI、洋流、地震事件和交通流。

英文摘要

Flow matching is a powerful generative modeling framework, valued for its simplicity and strong empirical performance. However, its standard formulation treats signals on structured spaces, such as fMRI data on brain graphs, as points in Euclidean space, overlooking the rich topological features of their domains. To address this, we introduce topological flow matching, a topology-aware generalization of flow matching. We interpret flow matching as a framework for solving a degenerate Schrödinger bridge problem and inject topological information by augmenting the reference process with a Laplacian-derived drift. This principled modification captures the structure of the underlying domain while preserving the desirable properties of flow matching: a stable, simulation-free objective and deterministic sample paths. As a result, our framework serves as a drop-in replacement for standard flow matching. We demonstrate its effectiveness on diverse structured datasets, including brain fMRIs, ocean currents, seismic events, and traffic flows.

URL PDF HTML ☆

赞 0 踩 0

2606.15896 2026-06-16 cs.RO cs.LG 新提交

LoComposition: Terrain-Adaptive Energy-Efficient Quadruped Locomotion without Gait Priors

LoComposition：无需步态先验的地形自适应高效四足运动

Loukas Kordos, Leonard T. Franz, Simon Rappenecker, Oliver Hausdoerfer, Angela P. Schoellig, Pavel Kolev, Georg Martius

发表机构 * Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； University of Tübingen（图宾根大学）； Technical University of Munich（慕尼黑工业大学）； University of Stuttgart（斯图加特大学）

AI总结提出一种将任务奖励、操作约束、能量最小化和地形感知分离的框架，无需显式步态先验，在四足机器人上实现高效地形自适应运动，运输成本降低56%，违规减少96%。

Comments 17 pages, 5 figures, 10 tables

详情

AI中文摘要

基于学习的四足运动通常依赖于复杂的奖励函数，将任务规范、操作限制、步态偏好和地形适应纠缠在单个优化目标中。我们通过不同的机制处理这些功能：任务规范用奖励，操作限制用约束，步态偏好用能量最小化，以及用外部感知来根据地形难度调整能量使用。我们表明，这些组件共同实现了高效、地形自适应的运动，并且移除每个组件会暴露出不同的失败模式。我们的公式移除了显式的步态先验（包括腾空时间、接触次数和足部间隙目标），转而支持涌现行为。与传统的复杂奖励基线相比，我们的公式在实现相当的地形穿越的同时，将运输成本降低了56%，操作限制违规减少了96%。得到的策略零样本迁移到使用基于LiDAR高程地图的物理Unitree Go2上。项目网站含视频：https://tinyurl.com/locomposition。

英文摘要

Learning-based quadrupedal locomotion typically relies on complex reward formulations that entangle task specification, operational limits, gait preference, and terrain adaptation within a single optimization objective. We instead treat these functions through distinct mechanisms: rewards for task specification, constraints for operational limits, energy minimization for gait preference, and exteroceptive perception for adapting energy use to terrain difficulty. We show that these components jointly enable efficient, terrain-adaptive locomotion, and that removing each component exposes a distinct failure mode. Our formulation removes explicit gait priors (including air-time, contact-count, and foot-clearance targets) in favor of emergent behavior. Compared to a conventional complex-reward baseline, our formulation achieves comparable terrain traversal while reducing cost of transport by 56% and operational-limit violations by 96%. The resulting policies transfer zero-shot to a physical Unitree Go2 using LiDAR-based elevation mapping. Project website with videos: https://tinyurl.com/locomposition.

URL PDF HTML ☆

赞 0 踩 0

2606.15893 2026-06-16 cs.CL 新提交

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

BALTO: 用于幻觉缓解的平衡令牌级策略优化

Ning Li, Zixuan Guo, Yan Xu, Wenbo Fei, Yifan Niu, Chang Luo, Yasheng Wang, Weiwen Liu, Yong Yu, Weinan Zhang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Tencent（腾讯）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结针对大语言模型幻觉问题，提出BALTO框架，通过提取可验证事实声明并投影为令牌级标签，引入平衡信用分配机制，在六个模型-基准设置中实现最高忠实度，优于现有后训练基线。

详情

AI中文摘要

幻觉仍然是阻碍大语言模型在知识密集型环境中部署的主要障碍，在这些环境中，生成的响应必须忠实地基于所提供的证据。强化学习是缓解幻觉的一个有前景的方向，但响应级忠实度奖励存在粒度不匹配问题：局部幻觉可能导致受支持的内容受到虚假惩罚。尽管最近的工作引入了细粒度反馈，如声明级验证和令牌级奖励，但不平衡的信用分配仍可能引发长度、冗长或优化噪声偏差。我们提出了BALTO，一种用于幻觉缓解的平衡令牌级策略优化框架。BALTO提取可核查的事实声明，根据参考上下文对其进行验证，并将声明级判断投影到令牌级标签。该框架引入了一种平衡的令牌级信用分配机制。这种设计将概率质量从未受支持的内容重新分配到忠实内容，而不是抑制整个响应。我们从理论角度系统分析了响应级奖励的局限性，并证明了BALTO在幻觉缓解的训练稳定性和优化效率方面的优势。在ConFiQA、RAGTruth和FinLLM-Eval上的实验表明，BALTO在所有六个模型-基准设置中实现了最高的忠实度，并且在Q-Score上持续优于现有的后训练基线，展示了更强的忠实度-信息量权衡。

英文摘要

Hallucinations remain a major obstacle to deploying large language models (LLMs) in knowledge-intensive settings, where generated responses must be faithfully grounded in provided evidence. Reinforcement learning (RL) is a promising direction for hallucination mitigation, but response-level faithfulness rewards suffer from a granularity mismatch: localized hallucinations can cause supported content to receive spurious penalties. Although recent work introduces fine-grained feedback such as claim-level verification and token-level rewards, unbalanced credit assignment can still induce length, verbosity, or optimization-noise biases. We propose BALTO, a Balanced Token-level Policy Optimization framework for hallucination mitigation. BALTO extracts checkable factual claims, verifies them against the reference context, and projects claim-level judgments to token-level labels. A balanced token-level credit assignment mechanism is introduced into the framework. This design redistributes probability mass from unsupported content toward faithful content, rather than suppressing the entire response. We systematically analyze the limitations of response-level rewards from a theoretical standpoint, and prove BALTO's advantages in training stability and optimization efficiency for hallucination mitigation. Experiments on ConFiQA, RAGTruth, and FinLLM-Eval show that BALTO achieves the highest faithfulness across all six model--benchmark settings and consistently outperforms existing post-training baselines in Q-Score, demonstrating a stronger faithfulness--informativeness trade-off.

URL PDF HTML ☆

赞 0 踩 0

2606.15892 2026-06-16 cs.LG 新提交

Scalar-pathway fidelity improves physical accuracy in short-range equivariant interatomic potentials

标量路径保真度提高短程等变原子间势的物理准确性

Jia Bi, Alin Marin Elena, Samuel Pinilla

发表机构 * Science and Technology Facilities Council（科学技术设施委员会）； Diamond Light Source（钻石光源）

AI总结提出标量路径修正方法（PAN池化和PGS混合器），在保持等变骨架不变下优化标量通道，使MACE等势的力误差降低22-27%，能量误差降低19-22%，且计算开销仅增5%。

详情

AI中文摘要

精确的原子间势能实现超越密度泛函理论长度和时间尺度的材料、分子和界面的分子动力学。等变神经网络势能改进了局部几何的表示。然而，其可部署的能量表面最终通过不变的标量通道体现，这些通道的聚合和光谱分辨率相对未充分研究。这里我们使用物理感知邻域（PAN）池化和物理引导光谱（PGS）混合器作为受控的标量路径探针：轻量级、对称性保持的修改，仅作用于$\ell=0$通道，同时保持等变张量主干不变。使用MACE作为高体阶机制支架，PAN添加协调敏感幅度调制，而PGS用径向和锥形光谱基增强边和读出标量特征。在金属Ag、共价Si、短程离子LiF/Li--F子集和MD17/rMD17分子上，这种标量路径修正将MACE力误差降低22-27%，能量误差降低19-22%；在带有应力标签的系统上，应力误差降低27-28%，推理FLOPs成本增加约5%。在Allegro和NequIP中方向一致的增益进一步表明该修正可跨不同短程等变主干移植，尽管效果大小仍依赖于架构。这些结果将标量路径保真度确定为短程等变原子间势的一个实用设计维度。

英文摘要

Accurate interatomic potentials enable molecular dynamics of materials, molecules, and interfaces beyond density-functional-theory length and time scales. Equivariant neural network potentials have improved the representation of local geometry. However, their deployable energy surfaces ultimately manifest through invariant scalar channels, whose aggregation and spectral resolution remain comparatively underexamined. Here we use Physics-Aware Neighborhood (PAN) pooling and Physics-Guided Spectral (PGS) mixers as controlled scalar-pathway probes: lightweight, symmetry-preserving modifications that act only on $\ell=0$ channels while leaving the equivariant tensor backbone unchanged. Using MACE as a high-body-order mechanistic scaffold, PAN adds coordination-sensitive amplitude modulation, whereas PGS augments edge and readout scalar features with radial and tapered spectral bases. Across metallic Ag, covalent Si, a short-range ionic LiF/Li--F subset, and MD17/rMD17 molecules, this scalar-pathway correction reduces MACE force errors by 22--27\% and energy errors by 19--22\%; on systems with stress labels, stress errors decrease by 27--28\%, at approximately 5\% additional inference-FLOPs cost. Directionally consistent gains in Allegro and NequIP further indicate that the correction is portable across distinct short-range equivariant backbones, although effect sizes remain architecture-dependent. These results identify scalar-pathway fidelity as a practical design dimension for short-range equivariant interatomic potentials.

URL PDF HTML ☆

赞 0 踩 0

2606.15890 2026-06-16 cs.AI 新提交

UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics

UrbanWell: 面向时空城市福祉分析的多模态大语言模型基准测试

Yanxin Xi, Xiang Su, Jie Feng, Yu Liu, Sasu Tarkoma, Pan Hui

发表机构 * University of Helsinki（赫尔辛基大学）； Zhongguancun Academy（中关村学院）； University of Oxford（牛津大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出UrbanWell基准，通过卫星和街景图像联合建模，系统评估多模态大语言模型在环境、空间可达性、城市形态、活力和主观感知等5类城市福祉指标上的时空推理能力，并定义时序预测和趋势分类任务。

Comments accepted by KDD Datasets and Benchmarks Track 2026

详情

AI中文摘要

从多模态数据理解城市福祉需要整合异构的时空信号，这对当前的多模态大语言模型（MLLMs）构成了重大挑战。我们提出了UrbanWell，一个大规模基准测试，旨在通过卫星和街景图像的联合建模，系统评估MLLMs在时空推理方面的能力，用于城市福祉分析。UrbanWell覆盖多个年份的38个城市，包含多样化的指标，涵盖（1）环境条件（CO$_2$、NO$_2$、PM${2.5}$和归一化植被指数），（2）空间可达性（到超市和餐馆的最小距离），（3）城市形态（道路长度、道路密度和土地利用），（4）城市活力（人口、经济活动多样性和土地利用多样性），以及（5）主观感知属性（例如安全性、美观性、活力、财富和宁静度）。所有指标在网格级别对齐，以实现标准化评估。除了静态预测，UrbanWell还定义了时序推理任务，包括基于历史观测的未来值预测和时序趋势分类。我们在零样本设置下对15个有代表性的最先进MLLMs进行了基准测试，提供了跨空间和时间维度的全面比较评估。实验结果表明，尽管MLLMs能够捕捉显著的空间和感知线索，但其性能在涵盖环境和主观感知的异质城市指标上差异显著。UrbanWell作为评估城市福祉分析中多模态时空推理的统一基准，为系统评估和未来多模态城市智能研究提供了标准化测试平台。我们的代码和数据集可通过https://github.com/axin1301/UrbanWell-Benchmark获取。

英文摘要

Understanding urban wellbeing from multimodal data requires integrating heterogeneous spatial and temporal signals, posing significant challenges for current multimodal large language models (MLLMs). We introduce UrbanWell, a large-scale benchmark designed to systematically evaluate the spatio-temporal reasoning capabilities of MLLMs for urban wellbeing analytics through joint modeling of satellite and street view imagery. UrbanWell spans 38 cities across multiple years and includes diverse indicators covering (1) environmental conditions (CO$_2$, NO$_2$, PM${2.5}$, and Normalized Difference Vegetation Index), (2) spatial accessibility (minimum distance to supermarkets and restaurants), (3) urban form (road length, road density, and land use), (4) urban vitality (population, economic activity diversity, and land use diversity), and (5) subjective perception attributes (e.g., safety, beauty, liveliness, wealth, and quietness). All indicators are aligned at grid level to enable standardized evaluation. Beyond static prediction, UrbanWell defines temporal reasoning tasks, including future value forecasting from historical observations and temporal trend classification. We benchmark 15 state-of-the-art representative MLLMs in a zero-shot setting, providing a comprehensive comparative evaluation across spatial and temporal dimensions. Experimental results indicate that while MLLMs capture salient spatial and perceptual cues, their performance varies substantially across heterogeneous urban indicators spanning environment and subjective perception. UrbanWell serves as a unified benchmark for evaluating multimodal spatial and temporal reasoning in urban wellbeing analytics, offering a standardized testbed for systematic assessment and future research on multimodal urban intelligence. Our codes and datasets are accessible via https://github.com/axin1301/UrbanWell-Benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.15889 2026-06-16 cs.CV 新提交

SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

SiGnature: 显式运动扩散用于风格化语义手势

Adi Rosenthal, Tomer Koren, Nadav Shaked, Doron Friedman, Ariel Shamir

发表机构 * Reichman University（赖希曼大学）

AI总结提出SiGnature框架，通过显式关节旋转空间和免训练推理机制JMI，实现语义手势的精准控制与说话人风格的高保真保持，优于现有方法。

详情

AI中文摘要

虽然共语手势生成的最新进展已实现令人印象深刻的节奏同步，但生成既具有语义意义又忠实于说话人独特非语言风格的手势仍然是一个开放挑战。语义手势（如象形形状或指示性指向）在统计上稀疏，使其难以在标准生成模型中有效学习。我们提出SiGnature，一个用于风格化和语义手势生成的框架，它协调了精确的语义控制与高保真风格保持。与依赖纠缠潜在表示的流行方法不同，SiGnature在显式关节旋转空间中操作。这种设计实现了我们的核心贡献——联合运动集成（JMI），一种免训练推理机制，能够直接将任何外部运动序列（特别是野外语义手势）注入扩散过程。JMI自动识别传达语义动作的特定“活动关节”并将其注入生成，同时依赖扩散主干根据目标说话人预学习的风格合成剩余的身体动态（包括姿态和流畅度）。这使得无需重新训练或引入剪切粘贴方法典型的“弗兰肯斯坦”伪影，即可即插即用地集成任意运动（包括复杂语义手势）。大量实验和感知研究表明，SiGnature在保持流畅自然的共语手势生成和保留说话人独特特征的同时，提供了优越的语义运动控制，从而优于最先进的基线方法。

英文摘要

While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker's unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific ``active joints'' conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the ``Frankenstein'' artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.15888 2026-06-16 cs.SD cs.AI eess.AS 新提交

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

NVMOS：语音中非语言发声质量评估

Jialong Mai, Jinxin Ji, Xiaofen Xing, Wencui Liu, Xiangmin Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对非语言发声（如笑声、叹息）的感知质量评估空白，构建NV-MOS数据集，提出首个专用模型NVMOS，通过局部聚焦模块达到专家级评估一致性。

Comments 6 pages. Code and model: https://github.com/yongaifadian1/NVMOS

详情

AI中文摘要

非语言发声（NVs），如笑声、叹息和咳嗽，是情感和意图的重要声学线索。现有的语音质量评估方法通常关注整体自然度，而非语言TTS评估主要检查目标NV是否以正确的类型和位置出现。然而，NV事件本身的感知质量仍未被充分探索。为填补这一空白，我们构建了一个NV-MOS数据集，包含来自多个NV-TTS系统的输出和自然发生的NV样本，并由三位声学专家根据感知质量量表进行评分。我们进一步分析了支持音频的多模态大语言模型（如Gemini），发现其评分与专家评分之间存在明显不一致。这些结果表明，通用多模态模型无法可靠地替代人类进行NV质量评估。随后，我们提出了NVMOS，据我们所知，这是第一个能够可靠预测语音中NV事件感知质量的模型。实验结果表明，通过局部NV事件聚焦模块，NVMOS达到了与人类MOS评分专家级或更强的一致性。

英文摘要

Non-verbal vocalizations (NVs), such as laughter, sighs, and coughs, are important acoustic cues for emotion and intent. Existing speech quality assessment methods typically focus on overall naturalness, while non-verbal TTS evaluations mainly examine whether a target NV appears with the correct type and position. However, the perceptual quality of NV events themselves remains underexplored. To address this gap, we construct an NV-MOS dataset containing outputs from multiple NV-TTS systems and naturally occurring NV samples, with ratings collected from three acoustic experts on a perceptual quality scale. We further analyze audio-capable multimodal large language models such as Gemini and find clear inconsistencies between their scores and expert ratings. These results suggest that general-purpose multimodal models cannot reliably replace human judgments for NV quality assessment. We then propose NVMOS, to our knowledge the first model that can reliably predict the perceptual quality of NV events in speech. Experimental results show that, with a local NV-event focusing module, NVMOS reaches expert-level or stronger agreement with human MOS.

URL PDF HTML ☆

赞 0 踩 0

2606.15887 2026-06-16 cs.LG cs.AI 新提交

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

智能并非瓶颈：验证LLM初稿评分与同行评审结果的一致性

Costa Georgantas

发表机构 * aipr.pub（aipr实验室）

AI总结本研究验证了LLM系统AIPR通过提示对论文进行评分，无需微调，其整体评分能有效区分ICLR会议的接收与拒绝论文（AUROC 0.82），且评分稳定、可复现，为辅助同行评审提供了可靠依据。

Comments 34 pages, 14 figures

详情

AI中文摘要

大型语言模型（LLM）系统越来越多地被提议用于辅助同行评审，但大多数评估判断的是机器生成的评审文本的措辞，而非系统分配的数字分数的有效性。我们验证了AIPR，该系统读取提交的稿件并输出五个0-100的质量维度和一个加权总分，针对一个主要机器学习会议的公开决策结果进行验证。AIPR仅通过提示进行评分，没有对评审或决策进行微调。在300篇ICLR提交论文中，这些论文具有公开的决策层级和评审评分，在冻结的流水线下进行评分，且假设在评分与任何结果相遇之前预先注册，整体评分将拒绝论文与接收论文分开（AUROC 0.82，95% CI 0.78-0.87），在层级间单调上升，并跟踪平均评审评分。信号在我们声称的地方最强：得分最低的五分之一论文被拒绝的比例远高于基准率，且口头报告论文缺失。有效性主要来自模型：在同一模型上的一段提示几乎与完整流水线一样好地判别（小差距有利于流水线，但未达到预先声明的标准，p = 0.09）。工程增加的是可靠性和有依据的评审：AIPR的评分在重复运行中几乎不变（论文内标准差0.7 vs. 2.8分），而裸提示波动很大，并且同一轮返回的是基于评分标准的、有证据依据的评审，而非裸数字，由人类保留决策权。

英文摘要

Large language model (LLM) systems are increasingly proposed to assist peer review, yet most evaluations judge the prose of machine-generated review text, not the validity of the numeric score a system assigns. We validate AIPR, which reads a submitted manuscript and emits five 0-100 quality dimensions and a weighted overall score, against the public decision outcomes of a major machine learning venue. AIPR grades by prompting alone, with no fine-tuning on reviews or decisions. Across 300 ICLR submissions with public decision tiers and reviewer ratings, graded under a frozen pipeline with hypotheses pre-registered before any score met any outcome, the overall score separates rejected from accepted submissions (AUROC 0.82, 95% CI 0.78-0.87), rises monotonically across tiers, and tracks the mean reviewer rating. The signal is strongest where we claim it: the lowest-scoring fifth is rejected far above the base rate, with oral papers absent. The validity comes mostly from the model: a one-paragraph prompt on the same model discriminates almost as well as the full pipeline (the small gap favours the pipeline but does not meet the pre-declared criterion, p = 0.09). What the engineering adds is reliability and a grounded review: AIPR's score barely moves across repeated runs (0.7 vs. 2.8 points within-paper SD) where the bare prompt swings, and the same pass returns a rubric-structured, evidence-grounded review rather than a bare number, with the human keeping the decision.

URL PDF HTML ☆

赞 0 踩 0

2606.15886 2026-06-16 cs.CV 新提交

Text region detection in historical astronomical diagrams

历史天文图中的文本区域检测

Zeynep Sonat Baltacı, Raphaël Baena, Fei Meng, Somkéo Norindr, Florence Somer, Matthieu Husson, Mathieu Aubry

发表机构 * LIGM, ENPC, IP Paris, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France（LIGM, 国立桥路学校, 巴黎理工学院, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心, 马恩拉瓦莱, 法国）； LTE, CNRS, PSL-Observatoire de Paris, SU, EIDA Project（LTE, 法国国家科学研究中心, 巴黎文理研究大学-巴黎天文台, 索邦大学, EIDA项目）

AI总结提出包含948张历史天文图的大规模数据集，涵盖十世纪七种语言传统，并设计Poly-DETR模型实现文本区域检测。

详情

AI中文摘要

文本检测是历史文献分析中的关键任务。尽管手稿和地图的文本检测已有数据集和基准，但数学图表中的文本研究鲜受关注。为此，我们引入一个大规模、多样化、开放获取的数据集，包含948张历史天文图，共计10,940个定向多边形文本区域。数据集跨越十个世纪（8至18世纪）和七种主要语言传统：阿拉伯语和波斯语（115张）、中文（332张）、拜占庭语（233张）、拉丁语（185张）、希伯来语（48张）和梵语（35张）。它涵盖了从符号到多行段落的广泛图表风格和文本内容。每个文本实例都标注了有序多边形，精确描绘文本区域并编码阅读方向。此外，我们为拉丁图表中的2,293个区域标注了20个类别标签。我们在数据集上评估了多个强基线，包括TESTR、DeepSolo++以及Poly-DETR（我们设计的DINO-DETR的简单扩展，用于预测有序多边形顶点）。Poly-DETR在MTHv2和cBAD2019基准上达到最先进性能，并在我们的数据集上提供了坚实、简单的基线。代码和数据集在线提供。

英文摘要

Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.

URL PDF HTML ☆

赞 0 踩 0

2606.15884 2026-06-16 cs.CL 新提交

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

法律领域推理中大语言模型的神经元级分析

Eri Onami, Youmi Ma, Shuhei Kurita, Naoaki Okazaki

发表机构 * Institute of Science Tokyo（东京科学大学）； NII（国立信息学研究所）； AIST（产业技术综合研究所）

AI总结通过神经元归因分数识别并抑制关键神经元，发现存在任务特异性神经元和跨任务通用神经元，法律领域神经元重叠度高且分布受输入格式影响。

详情

AI中文摘要

我们对LLM在法律领域推理中的神经元级分析进行了研究，并将其与七个开放权重模型中的其他应用领域任务进行了比较。通过使用神经元归因分数对影响神经元进行排序和抑制，我们证实抑制识别出的神经元会显著降低目标任务的准确率，而抑制相同数量的随机神经元则不会。我们进一步发现了一小部分对所有七个任务都有影响的神经元；一旦这些神经元被移除，抑制剩余神经元只会降低它们被识别出的任务的表现，从而揭示了每个研究模型中真正任务特异性的神经元。在法律领域内，三个基准测试表现出相对较高的神经元重叠，并且往往共同受到影响，这表明存在跨司法管辖区的法律组件神经元。我们实验中识别出的神经元分布表明，关于影响神经元集中在中间MLP层的假设可能取决于输入格式和内容，而非普遍现象。

英文摘要

We presented a neuron-level analysis of legal-domain reasoning in LLMs, comparing it with other applied domain tasks across seven open-weight models. Using neuron attribution scores to rank and suppress influential neurons, we confirmed that suppressing the identified neurons collapses accuracy on the target task, whereas suppressing the same number of random neurons does not. We further found a small subset of neurons influential across all seven tasks; once these are removed, suppressing the remaining neurons degrades only the task they were identified from, revealing genuinely task-specific neurons in every model studied. Within the legal domain, the three benchmarks exhibit relatively high neuron overlap and tend to be affected jointly, suggesting of legal components neurons that span jurisdictions. The distribution of identified neurons in our experiments suggests that the hypothesis that influential neurons are concentrated in middle MLP layers may depend on the input format and content, rather than being a universal phenomenon.

URL PDF HTML ☆

赞 0 踩 0

2606.15880 2026-06-16 cs.CV cs.AI 新提交

Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models

深度残差注入：多模态大语言模型的全频谱取证信号感知

Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Ke-Yue Zhang, Yue Zhou, Caiyong Piao, Bin Li, Taiping Yao, Bo Wang, Youchang Xiao, Shouhong Ding

发表机构 * National University of Singapore（新加坡国立大学）； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）； University of Electronic Science and Technology of China（电子科技大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结针对多模态大语言模型在取证中难以同时保留语义知识和捕获低级生成器伪影的问题，提出Deep-VRM方法，通过将伪影特定视觉信号作为残差路径注入中间层，实现全频谱信号感知，达到鲁棒检测性能。

Comments Accepted at ICML 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）因其强大的语义理解能力，越来越多地被应用于取证领域。随着AI生成图像变得逼真，仅凭语义层面的不一致往往不足以进行可靠检测。这引发了一个关键问题：MLLMs能否实现全频谱取证信号感知，即在不牺牲预训练语义知识的情况下捕获低级生成器伪影。我们进一步对MLLMs中的取证信号感知进行了逐层分析，表明语义信息主要在早期到中间层形成，而直接微调学习伪影会破坏这些语义表示。基于这一发现，我们提出了深度视觉残差MLLM（Deep-VRM），以保留早期语义处理，同时将伪影特定的视觉信号作为残差路径注入中间层，在此与语义标记表示融合，并通过后续可训练层传播。这使得后续层能够联合建模语义推理和信号级取证线索，令人惊讶的是，模型学会了根据输入自适应地利用不同级别的取证信号，实现了鲁棒且可泛化的检测性能。大量实验表明，我们的方法在大多数基准测试中达到了最先进水平。代码和数据可在https://github.com/KQL11/Deep-VRM获取。

英文摘要

Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of forensic signal perception in MLLMs, showing that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across most benchmarks. The code and data are available at https://github.com/KQL11/Deep-VRM.

URL PDF HTML ☆

赞 0 踩 0

2606.15877 2026-06-16 cs.CL cs.AI 新提交

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

自由能启发式：作为不确定精度下主动推理的快速节俭认知

Alex Bogdan

发表机构 * Evolutionairy AI Toronto, Canada（进化人工智能（多伦多，加拿大））

AI总结本文提出元不确定性决定链式思维（CoT）的效果：当模型对自身证据的可靠性高度不确定时，更多推理会降低准确率。通过自由能最小化策略证明，在重尾精度先验下，有限数量的高有效性线索后停止整合，与“取最优”启发式等价。实验验证了高元不确定性下长CoT导致准确率下降17.3个百分点。

Comments 64 pages, 6 figures

详情

AI中文摘要

链式思维（CoT）提升了大型语言模型在数学和符号推理中的表现。但在规划、有争议的伦理问题以及模型无法自我检查的任务中，更多推理反而使情况更糟。这两种效应均有文献记载；但一直缺少一个原则性的解释来说明哪种属性决定了结果。我们认为这是元不确定性：模型对其自身证据可靠性的不确定程度。当这种不确定性很高时，额外的推理不再增加信号，而是开始制造虚假的置信度。我们证明，在不确定精度下最小化期望自由能的策略，在精度先验为重尾分布时（定理2.6.1），会在有限数量的高有效性线索后停止整合线索，并且在递减优势条件下，该策略在样本层面上与“取最优”策略相同（定理2.7.4）。因此，快速节俭启发式和主动推理是同一计算的两种描述。预测是，在高元不确定性项目上，更长的CoT会降低准确率。我们按项目对区间进行评分（模拟-恢复rho > 0.96），构建了FEH-79基准（包含匹配对照的奈特框架），并在七个模型（五个开放权重3B-32B，两个前沿模型）、五种CoT长度和7,875个响应上进行了预注册研究。门槛（在数据前固定）要求负交互的后验概率高于0.95，准确率下降超过6个百分点。结果成立。高区间下降为17.3个百分点（95% CI [7.7, 25.5]）；具有明确答案的匹配项目没有显示成本。该效应依赖于区间：在能力较强的中大型模型中显著，在两个前沿系统中具有方向性，在最弱的模型中缺失甚至反转。该框架回答了CoT何时有帮助，并统一了贝叶斯和快速节俭传统：少即是多的效应是关于元不确定性区间的证据，而非反对贝叶斯认知。

英文摘要

Chain-of-thought (CoT) improves large language models' performance in math and symbolic reasoning. But on planning, contested ethics, and tasks where the model cannot check itself, more reasoning makes things worse. Both effects are documented; what has been missing is a principled account of which property decides the outcome. We argue it is meta-uncertainty: how unsure the model is about the reliability of its own evidence. When that uncertainty is high, extra reasoning stops adding signal and starts manufacturing false confidence. We prove that the policy minimizing expected free energy under uncertain precision stops integrating cues after a finite number of high-validity ones when the precision prior is heavy-tailed (Theorem 2.6.1), and under a Descending Dominance condition, is sample-wise identical to take-the-best (Theorem 2.7.4). Fast-and-frugal heuristics and active inference are, then, two descriptions of the same computation. The prediction is that on high-meta-uncertainty items, longer CoT should degrade accuracy. We score the regime per item (simulate-and-recover rho > 0.96), build FEH-79, a benchmark of Knightian frames with matched controls, and run a pre-registered study across seven models (five open-weight 3B-32B, two frontier), five CoT lengths, and 7,875 responses. The gate, fixed before any data, required a negative interaction with posterior probability above 0.95 and an accuracy drop of more than 6 points. It held. The high-regime drop is 17.3 points (95% CI [7.7, 25.5]); matched items with definite answers show no cost. The effect is regime-dependent: decisive in capable mid-to-large models, directional in the two frontier systems, absent-to-reversed in the weakest. The framework answers when CoT helps and unifies the Bayesian and fast-and-frugal traditions: less-is-more effects are evidence about the meta-uncertainty regime, not against Bayesian cognition.

URL PDF HTML ☆

赞 0 踩 0

2606.15874 2026-06-16 cs.AI cs.SE 新提交

LLM-as-Code Agentic Programming for Agent Harness

LLM即代码：面向Agent框架的编程范式

Junjia Qi, Zichuan Fu, Jingtong Gao, Wenlin Zhang, Hanyu Yan, Xian Wu, Xiangyu Zhao

发表机构 * City University of Hong Kong（香港城市大学）； Tencent Jarvis Lab（腾讯贾维斯实验室）

AI总结针对LLM作为编排器导致控制流幻觉和不可靠执行的问题，提出Agentic Programming范式，由程序控制所有流程，LLM仅作为代码组件在需要推理或生成时被调用，显著提升长序列操作的稳定性。

Comments Accepted at the KDD 2026 Workshop on Agentic Software Engineering (AgenticSE)

详情

AI中文摘要

每个主要的LLM Agent框架都赋予LLM编排者的角色；模型决定下一步做什么、何时调用工具以及何时停止。我们认为，令牌爆炸、控制流幻觉和不可靠完成并非实现缺陷，而是将循环、分支和排序等确定性工作分配给概率系统的架构后果。更好的提示或更强的模型无法保证LLM Agent的可靠性。因此，我们提出Agentic Programming，其中程序控制所有流程，而LLM本身是其中的一部分，一个称为LLM-as-Code的自适应组件，仅在任务需要推理或生成时调用。在每个调用中，模型保持完全灵活性，但不能改变程序的执行路径。由于控制权在程序中，LLM的上下文由执行历史的调用树构建，形成有向无环图（DAG）。每个调用的上下文长度由其调用深度决定，而非随步骤累积。计算机使用Agent的案例研究表明，该设计不仅是理论立场，而且是实用的，显著提高了长视觉操作序列的稳定性。

英文摘要

Every major LLM agent framework gives the LLM the role of orchestrator; the model decides what to do next, when to call tools, and when to stop. We argue that token explosion, control-flow hallucination, and unreliable completion are not implementation bugs but architectural consequences of assigning the deterministic work of looping, branching, and sequencing to a probabilistic system. A better prompt or a stronger model cannot guarantee the reliability of the LLM agent. We therefore propose Agentic Programming, in which the program governs all control flow, and the LLM is itself part of it, an adaptive component we call LLM-as-Code and invoke only where a task calls for reasoning or generation. Within each call the model keeps full flexibility, but it cannot alter the program's execution path. With control in the program, the LLM's context is built from the execution history's call tree and forms a directed acyclic graph (DAG). Each call's context length is then determined by its call depth rather than by accumulation over steps. A case study of computer-use agents shows that the design is practical, not just a theoretical stance, substantially improving the stability of long visual operation sequences.

URL PDF HTML ☆

赞 0 踩 0

2606.15872 2026-06-16 cs.CL 新提交

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

SciOrch: 学习编排专家大语言模型以解决前沿多模态科学推理任务

Jingru Guo, Xiangyuan Xue, Lian Zhang, Wanghan Xu, Siki Chen, Philip Torr, Wanli Ouyang, Lei Bai, Zhenfei Yin

发表机构 * Imperial College London（伦敦帝国学院）； The Chinese University of Hong Kong（香港中文大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Oxford（牛津大学）； Shenzhen Loop Area Institute（深圳环湖研究所）

AI总结提出SciOrch框架，训练轻量级8B模型编排多个前沿大语言模型，通过MCTS和GRPO优化，在科学推理任务上超越最强单模型和多智能体基线。

详情

AI中文摘要

前沿科学推理仍然是大语言模型（LLMs）面临的主要挑战，即使是最强大的商业系统也达不到专家级性能。对模型行为的深入分析揭示了单模型评估所隐藏的显著互补性：不同的前沿模型在不同类型的问题上表现出色，没有一个模型能全面覆盖。我们提出了SciOrch，一个训练轻量级8B模型来编排前沿LLMs进行科学推理的框架。编排器分解每个问题，通过API调用将子问题委托给选定的商业模型，并综合最终答案。训练这样的编排器比传统的智能体强化学习更难：每个动作都会触发一次API调用，这在金钱成本和延迟上都代价高昂，使得标准的在线回滚不可行。我们通过基于MCTS的方法解决了这个问题，生成了多样化的编排轨迹，提取了每个节点的单轮样本，并使用GRPO风格的训练优化编排器。在包含SGI-Reasoning和Scientists' First Exam的240个问题测试集上，SciOrch达到了56.66%的平均准确率，比最强的单个商业模型高出3.74%，比最强的多智能体基线高出3.33%。它还在SGI和SFE上都取得了最佳准确率，而API成本不到典型多智能体方法的一半。

英文摘要

Frontier scientific reasoning remains a major challenge for large language models (LLMs), where even the strongest commercial systems fall short of expert-level performance. A closer look at model behavior reveals substantial complementarity that single-model evaluation hides: different frontier models excel on different question types, and no single model captures the full picture. We present SciOrch, a framework that trains a lightweight 8B model to orchestrate frontier LLMs for scientific reasoning. The orchestrator decomposes each question, delegates sub-problems to selected commercial models through API calls, and synthesizes a final answer. Training such an orchestrator is fundamentally harder than conventional agentic RL: each action triggers an API call that is expensive in both dollar cost and latency, making standard online rollouts infeasible. We address this with MCTS-based approach, producing diverse orchestration trajectories, extracting per-node single-turn samples, and optimizing the orchestrator with GRPO-style training. On a 240-question test set spanning SGI-Reasoning and Scientists' First Exam, SciOrch reaches 56.66% average accuracy, outperforming the strongest single commercial model by 3.74% and the strongest multi-agent baseline by 3.33%. It also attains the best accuracy on both SGI and SFE with less than half the API cost of typical multi-agent methods.

URL PDF HTML ☆

赞 0 踩 0

2606.15869 2026-06-16 cs.CV 新提交

Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

Metis: 一种用于自动驾驶和城市导航的通用高效世界-动作模型

Jingyu Li, Zhe Liu, Dongnan Hu, Junjie Wu, Zipei Ma, Wenxiao Wu, Chao Han, Zhihui Hao, Zhikang Liu, Kun Zhan, Jiankang Deng, Xiatian Zhu, Li Zhang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； The University of Hong Kong（香港大学）； Tongji University（同济大学）； Li Auto Inc.（理想汽车）； Huazhong University of Science and Technology（华中科技大学）； Imperial College London（伦敦帝国理工学院）； University of Surrey（萨里大学）

AI总结提出Metis框架，通过解耦视频生成与动作预测，采用混合专家架构和不对称注意力掩码，实现高效推理与泛化，在多个导航基准上取得最优性能。

详情

AI中文摘要

世界-动作模型（WAMs）在自动驾驶和城市导航中展现出巨大潜力。基于视觉-语言-动作模型或视频生成模型的现有方法存在关键限制：（1）测试时因预测未来观测而导致高推理延迟，（2）视频与动作建模紧密耦合导致表示不匹配和泛化能力下降。为解决这两个问题，我们提出Metis，一种端到端WAM框架，将视频生成与动作预测解耦。具体而言，Metis采用混合专家（Mixture-of-Transformers）架构，包含专门用于视频生成和动作预测的专家，保留了每个任务的内在分布特性。为提高效率，我们引入非对称注意力掩码，使得两个专家能够联合训练，同时允许动作模型在推理时绕过显式视频生成。这种设计确保了训练-推理一致性，并在不牺牲规划性能的情况下显著降低计算成本。大量实验表明，Metis在NAVSIM navhard和navtest基准以及CityWalker导航基准上取得了最先进的性能，验证了其在多样化任务中的泛化能力和效率。真实机器人部署进一步证实了我们方法的实际可行性。

英文摘要

World action models~(WAMs) have shown great promise for autonomous driving and urban navigation. Built upon Vision-Language-Action models or video generation models, existing approaches suffer key limitations: (1) High inference latency due to future observation prediction at test time, and (2) tightly coupled video and action modeling leading to representational mismatch and degraded generalization. To address both issues, we propose Metis, an end-to-end WAM framework that decouples video generation and action prediction. Specifically, Metis employs a Mixture-of-Transformers architecture with dedicated experts for video generation and action prediction, preserving the intrinsic distributional properties of each task. To enhance efficiency, we introduce an asymmetric attention mask that enables joint training of both experts while allowing the action model to bypass explicit video generation during inference. This design ensures training-inference consistency and significantly reduces computational costs without compromising planning performance. Extensive experiments demonstrate state-of-the-art performance on the NAVSIM navhard and navtest benchmarks and the CityWalker navigation benchmark, validating both the generalizability and efficiency across diverse tasks. Real-robot deployments further confirm the practical feasibility of our approach.

URL PDF HTML ☆

赞 0 踩 0

2606.15868 2026-06-16 cs.LG 新提交

David vs. Goliath in Next Activity Prediction: Argmax vs. LSTM, Transformer, and LLM

下一活动预测中的大卫与歌利亚：Argmax 与 LSTM、Transformer 和 LLM

Hans Weytjens, Ingo Weber

发表机构 * Technical University of Munich（慕尼黑工业大学）； Fraunhofer Gesellschaft（弗劳恩霍夫协会）

AI总结本文通过系统基准测试，比较了简单计数 argmax 基线、LSTM、Transformer 和 LLM 在下一活动预测中的性能，发现 argmax 基线在多数数据集上可媲美或接近十亿参数 LLM。

Comments Accepted for 24th International Conference on Business Process Management (2026) Forum

详情

AI中文摘要

下一活动预测（NAP）是预测性流程监控（PPM）的基石，使组织能够从回顾性分析转向主动流程引导。PPM 领域已从经典机器学习发展到深度学习架构（如 LSTM 和 Transformer），再到大型语言模型（LLM）。尽管模型复杂性不断增加，但目前尚无基准在 NAP 的直接序列建模设置中联合比较 LLM、Transformer、LSTM 和简单基线。在本文中，我们通过系统基准测试填补了这一空白。我们在七个真实事件日志上比较了词汇适应型 LLM、从头训练的 Transformer、LLM 蒸馏 Transformer 和 LSTM 与基于计数的简单 argmax 基线。我们的结果讲述了一个大卫与歌利亚的故事：预训练相比从头训练没有带来一致的改进，模型大小对性能影响很小，并且在大多数数据集上，argmax 基线匹配或接近十亿参数 LLM 的性能。

英文摘要

Next activity prediction (NAP) is a cornerstone of predictive process monitoring (PPM), enabling organizations to move from retrospective analysis to proactive process steering. The PPM field has progressed from classical machine learning through deep learning architectures such as LSTMs and Transformers to large language models (LLMs). Despite growing model complexity, no benchmark jointly compares LLMs, Transformers, LSTMs, and simple baselines in a direct sequence modeling setting for NAP. In this paper, we fill this gap with a systematic benchmark. We compare vocabulary-adapted LLMs, Transformers trained from scratch, LLM-distilled Transformers, and LSTMs against a simple counting-based argmax baseline across seven real-life event logs. Our results tell a David vs. Goliath story: pretraining confers no consistent improvement over training from scratch, model size shows little effect on performance, and on most datasets the argmax baseline matches or approaches the performance of billion-parameter LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.15867 2026-06-16 cs.CV 新提交

CogCanvas: A Benchmark for Evaluating Multi-Subject Reference-Based Image Generation

CogCanvas: 用于评估多主体参考图像生成的基准

Long-Bao Nguyen, Quang-Khai Tran, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam（胡志明市理科大学）； University of Dayton, Ohio, United States（代顿大学）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学胡志明市分校）

AI总结提出CogCanvas基准，包含1952张参考图像和1361个组合提示，评估多身份、对象绑定和背景场景的生成，引入BG-Sim和Attr-VQA指标，发现现有模型在超过3个主体时性能严重下降。

详情

AI中文摘要

多主体参考图像生成需要同时保留多个人的身份、绑定每个人的对象和时尚物品，并尊重指定的背景场景，当前扩散模型在此方面仍然脆弱。现有基准一次只评估一个方面，没有一个能联合捕捉多身份组合、人-物交互、背景基础和空间合理性。我们引入了CogCanvas，一个包含1952张精选参考图像的基准，涵盖100个名人身份、115个独特对象和时尚物品，以及29个真实世界背景场景（包括地标），从中我们构建了1361个组合提示，覆盖2-5人的群体规模。筛选流程结合了基于DINOv2的去重、两阶段美学过滤以及结构化交互和位置图的自动推导，作为真实监督。CogCanvas在统一的六轴评估协议下支持三个任务：基于参考的多人物-对象生成（主要）、文本到图像的组合生成和参考检索。我们引入了两个针对多参考设置量身定制的指标：BG-Sim，通过DINOv3特征相似性在SAM 3掩码区域上评分背景保真度；Attr-VQA，使用多模态大语言模型根据结构化图验证每个主体的属性绑定和人际交互。对五种最先进方法的基准测试表明，随着群体规模从2人增加到5人，每个模型都显著退化，在超过三个主体时对象/时尚物品绑定几乎完全失败。

英文摘要

Multi-subject reference-based image generation requires jointly preserving multiple human identities, binding per-person objects and fashion items, and respecting a specified background scene, a regime where current diffusion models remain brittle. Existing benchmarks evaluate only one axis at a time and none jointly captures multi-identity composition with human-object interaction, background grounding, and spatial plausibility. We introduce CogCanvas, a benchmark of 1,952 curated reference images spanning 100 celebrity identities, 115 distinctive objects and fashion items, and 29 real-world background scenes including landmarks, from which we construct 1,361 compositional prompts covering 2-5 person group sizes. The curation pipeline combines DINOv2-based deduplication, two-stage aesthetic filtering, and automated derivation of structured interaction and position graphs that serve as ground-truth supervision. CogCanvas supports three tasks, reference-based multi-human-object generation (primary), text-to-image compositional generation, and reference retrieval, under a unified six-axis evaluation protocol. We introduce two metrics tailored to the multi-reference setting: BG-Sim, which scores background fidelity on SAM 3-masked regions via DINOv3 feature similarity, and Attr-VQA, which uses a multimodal LLM to verify per-subject attribute binding and inter-person interactions against the structured graphs. Benchmarking five SOTA methods reveals that every model degrades substantially as group size grows from 2 to 5, with near-complete failure on object/fashion binding beyond three subjects.

URL PDF HTML ☆

赞 0 踩 0

2606.15866 2026-06-16 cs.AI cs.LG 新提交

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

STRIDE: 通过判别估计进行策略轨迹推理以实现可验证强化学习

Qinjian Zhao, Zhihao Dou, Dinggen Zhang, Xiangyu Li, Chaoda Song, Zhongwei Wan, Xinpeng Li, Yanyan Zhang, Kaijie Chen, Qingtao Pan, Chengcheng Feng, Zhiqiang Gao, Xiaoyu Xia

发表机构 * Kean University（基恩大学）； Case Western Reserve University（凯斯西储大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； The Ohio State University（俄亥俄州立大学）； Tongji University（同济大学）； Duke Kunshan University（昆山杜克大学）； Royal Melbourne Institute of Technology（皇家墨尔本理工大学）

AI总结提出STRIDE框架，通过对比成功与失败轨迹估计n-gram策略模式的判别偏好，结合推理显著性熵识别关键策略模式，实现细粒度信用分配，提升可验证强化学习的推理性能。

详情

AI中文摘要

可验证奖励强化学习（RLVR）已成为提升大语言模型推理能力的有效后训练范式。然而，现有RLVR方法通常依赖最终答案正确性分配轨迹级奖励，提供稀疏监督，并统一处理所有token，不考虑它们对推理的实际贡献。尽管最近的研究引入了中间信号，如过程奖励、高熵token和语义不确定性，但这些信号通常本身不可验证，且可能无法区分有益策略模式与有害模式。为解决这一局限，我们提出STRIDE（通过判别估计进行策略轨迹推理），一种从可验证结果中推导策略推理监督的细粒度RLVR框架。STRIDE对比每个响应组内的成功和失败轨迹，以估计每个n-gram策略模式的结果判别偏好，并进一步将该信号与推理显著性熵结合，识别决策相关的策略模式。在RL优化过程中，这些模式被分配差异化的优势值，从而在保持RLVR可验证性的同时实现更精确的信用分配。大量实验表明，STRIDE在多种模型、任务和扩展设置（包括VLM和基于智能体的系统）中一致提升了推理性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has become an effective post-training paradigm for improving the reasoning abilities of large language models. However, existing RLVR methods typically rely on final-answer correctness to assign trajectory-level rewards, providing sparse supervision and treating all tokens uniformly regardless of their actual contribution to reasoning. Although recent studies introduce intermediate signals such as process rewards, high-entropy tokens, and semantic uncertainty, these signals are often not inherently verifiable and may fail to distinguish beneficial strategic patterns from harmful ones. To address this limitation, we propose STRIDE (Strategic Trajectory Reasoning with Discriminative Estimation), a fine-grained RLVR framework that derives strategic reasoning supervision from verifiable outcomes. STRIDE contrasts successful and failed trajectories within each response group to estimate the outcome-discriminative preference of each $n$-gram strategic pattern, and further combines this signal with reasoning saliency entropy to identify decision-relevant strategic patterns. These patterns are assigned differentiated advantage values during RL optimization, enabling more precise credit assignment while preserving the verifiability of RLVR. Extensive experiments demonstrate that STRIDE consistently improves reasoning performance across diverse models, tasks, and extended settings, including VLMs and agent-based systems.

URL PDF HTML ☆

赞 0 踩 0

2606.15861 2026-06-16 cs.CV 新提交

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

对象标记作为机器人手术中分割与视觉问答的桥梁

Yiping Li, Ronald de Jong, Romy van Jaarsveld, Franco Badaloni, Gino Kuiper, Jelle Ruurda, Josien Pluim, Marcel Breeuwer

发表机构 * Department of Biomedical Engineering, Eindhoven University of Technology（埃因霍温理工大学生物医学工程系）； Department of Electrical Engineering, Eindhoven University of Technology（埃因霍温理工大学电气工程系）； Department of Surgery, University Medical Center Utrecht（乌得勒支大学医学中心外科）

AI总结提出统一框架，联合像素级分割与视觉问答，通过VLM生成对象标记引导答案预测和分割掩码，在RAMIE和EndoVis18数据集上优于基线方法。

详情

AI中文摘要

机器人手术中的视觉问答（VQA），称为手术VQA，需要对复杂手术场景进行高级理解，并将视觉感知与语言推理相结合，具有支持手术培训和术中决策的潜力。最近的视觉-语言模型（VLM）通过参数高效微调显示出有希望的性能；然而，大多数现有方法依赖于粗粒度的视觉定位，通常仅限于边界框，这未能捕捉手术对象的细粒度空间结构。在这项工作中，我们提出了一个统一框架，在单个框架内联合执行像素级分割和视觉问答。我们的方法将VLM与基于Segment Anything Model（SAM）的解码器集成，并将场景元素表示为VLM生成的对象标记。这些对象标记指导答案预测，并进一步投影到基于SAM的解码器以产生分割掩码。通过分割和问答目标优化对象标记嵌入，模型学习空间基础表示，增强视觉推理，同时提供显式的像素级基础。我们在私有RAMIE（机器人辅助微创食管切除术）数据集和公共EndoVis18数据集上评估了所提出的方法，在手术VQA中始终优于基线方法。这些结果表明，将上下文感知的对象标记纳入视觉-语言模型可改善细粒度手术场景理解。

英文摘要

Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.

URL PDF HTML ☆

赞 0 踩 0

2606.15857 2026-06-16 cs.CV 新提交

A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

用于水下图像增强与目标检测联合优化的双分支协作框架

Liyuan Cao, Zheng Liu, Guanghao Liao, Yonghui Yang, Qi Li

发表机构 * School of Electronic and Information Engineering, University of Science and Technology Liaoning（电子与信息工程学院，科学技术大学辽宁）

AI总结提出一种双分支水下图像增强框架，通过细节增强和颜色恢复分支分别提升纹理细节和校正色偏，在提升视觉质量的同时兼顾检测性能与效率，在URPC数据集上使YOLOv8的mAP50提升2.1%。

详情

AI中文摘要

由于波长依赖的光吸收和散射，水下图像通常存在颜色失真和细节模糊，这限制了水下目标检测的性能。现有的水下图像增强方法主要关注视觉质量提升，但仍难以平衡增强质量、处理效率和下游检测性能。因此，本文提出一种高效的双分支水下图像增强框架用于目标检测。细节增强分支通过提升亮度和局部对比度来恢复暗区域的纹理细节。颜色恢复分支使用自适应补偿来减少颜色失真并改善色彩层次。通过结合两个分支的互补输出，所提框架为目标检测提供更清晰、信息更丰富的图像。在UIEB和EUVP数据集上，所提方法分别达到2.249和2.576的UIQM分数。当应用于URPC数据集上的YOLOv8检测任务时，与基线相比，所提方法将mAP50提升了2.1%。大量实验表明，我们的方法在复杂水下场景中改善了目标检测，同时平衡了增强质量和处理效率。

英文摘要

Due to wavelength dependent light absorption and scattering, underwater images usually suffer from color distortion and blurred details, which limits underwater object detection performance. Existing underwater image enhancement methods mainly focus on visual quality improvement, while it is still difficult to balance enhancement quality, processing efficiency, and downstream detection performance. Therefore, this paper proposes an efficient dual-branch underwater image enhancement framework for object detection. The detail enhancement branch improves brightness and local contrast to recover texture details in dark regions. The color restoration branch uses adaptive compensation to reduce color distortion and improve color gradation. By combining the complementary outputs of the two branches, the proposed framework provides clearer and more informative images for object detection. On the UIEB and EUVP datasets, the proposed method achieves UIQM scores of 2.249 and 2.576. When applied to the YOLOv8 detection task on the URPC dataset, the proposed method improves mAP50 by 2.1\% compared with the baseline. Extensive experiments show that our method improves object detection in complex underwater scenes, while balancing enhancement quality and processing efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.15848 2026-06-16 cs.CV 新提交

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

EmoZone-Talker: 基于面部动作单元的音频驱动3DGS说话人头部的区域语义控制

Tingting Chen, Shaojun Wang, Huaye Zhang, Diqiong Jiang, Chenglizhao Chen

发表机构 * China University of Petroleum (East China)（中国石油大学（华东））

AI总结提出EmoZone-Talker框架，通过区域解耦和时序建模解决音频与表情信号的冲突，实现精细、可解释的面部表情控制。

详情

AI中文摘要

3D高斯泼溅（3DGS）在高保真说话头部合成方面显示出巨大潜力。然而，由于语音驱动的面部动态与显式表情信号之间的内在冲突，实现细粒度、可解释且可编辑的面部表情控制仍然具有根本性挑战。现有方法依赖隐式多模态融合，导致空间纠缠和时间不稳定性。我们提出EmoZone-Talker，一种新颖的框架，将音频驱动的面部动画重新表述为跨模态冲突下的结构化时空协调问题。我们的方法引入了面部运动的显式空间解缠和时序动态建模。具体来说，我们提出了具有优先注意力偏好的协同区域（SZ-PAB），通过解剖先验引导的区域约束显式解耦模态贡献，以及通道独立的时间AU编码器（CIT-AE）来建模时间连贯的AU动态。通过将这些表示集成到3D高斯变形中，EmoZone-Talker实现了对面部表情的精确和可解释控制。大量实验表明，我们的方法提高了表情可控性和真实感，在上脸准确性和时间连贯性方面取得了显著提升，同时保持了高渲染质量和准确的唇形同步。代码将公开发布以促进可重复性和进一步研究。

英文摘要

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

URL PDF HTML ☆

赞 0 踩 0

2606.15846 2026-06-16 cs.RO 新提交

FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds

FlashNav：20秒内实现超快速机器人导航策略训练

Shanze Wang, Yiwei Qian, Xinming Zhang, Jun Xue, Siwei Cheng, Xianghui Wang, Qingyuan Hu, Xiaoyu Shen, Wei Zhang

发表机构 * Eastern Institute of Technology, Ningbo（宁波东方理工大学）； The Hong Kong Polytechnic University（香港理工大学）； National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出FlashNav框架，通过GPU加速和MDP对齐实现20秒内训练可部署的导航策略，在TurtleBot2和Unitree Go2上验证成功。

Comments 15 pages, 4 figures

详情

AI中文摘要

深度强化学习在机器人导航中展现出强大潜力，但其实际部署仍受限于策略训练的长时钟成本。本文提出FlashNav，一个用于超快速基于距离的机器人导航训练的GPU优先框架。据我们所知，FlashNav是首个达到秒级策略训练的基于DRL的机器人导航框架，最快可部署策略在不到20秒内训练完成。关键思想是将仿真与导航MDP对齐：FlashNav保留了速度级导航的必要组件，包括占据几何、距离感知、目标条件控制、机器人运动动力学、碰撞处理、终止和重置，同时从训练循环中移除不必要的渲染和高保真物理细节。基于批量位图仿真器和我们的FastDSAC学习器构建的全GPU驻留训练流水线，FlashNav完全在GPU上生成大规模并行导航转移。在TurtleBot2和Unitree Go2上的实验表明，FlashNav在RTX 5090上20秒内达到100%成功率，并在桌面GPU上保持在几十秒内。学习到的策略进一步迁移到静态和动态室内场景中的物理轮式和腿式机器人，证明基于DRL的导航可以在秒级速度下训练，同时保持可部署的避障行为。

英文摘要

Deep reinforcement learning has shown strong potential for robot navigation, but its practical deployment is still limited by the long wall-clock cost of policy training. This paper presents FlashNav, a GPU-first framework for ultra-fast range-based robot navigation training. To the best of our knowledge, FlashNav is the first DRL-based robot navigation framework that reaches seconds-level policy training, with the fastest deployable policy trained in less than 20 seconds. The key idea is to align simulation with the navigation MDP: FlashNav preserves the essential components for velocity-level navigation, including occupancy geometry, range sensing, goal-conditioned control, robot motion dynamics, collision handling, termination, and reset, while removing unnecessary rendering and high-fidelity physical details from the training loop. Built on a batched bitmap simulator and a fully GPU-resident training pipeline with our FastDSAC learner, FlashNav generates massive parallel navigation transitions entirely on GPU. Experiments on TurtleBot2 and Unitree Go2 show that FlashNav achieves a 100\% success-rate below 20 seconds on an RTX 5090 and remains within tens of seconds across desktop GPUs. The learned policies further transfer to physical wheeled and legged robots in static and dynamic indoor scenes, demonstrating that DRL-based navigation can be trained at seconds-level speed while preserving deployable obstacle-avoidance behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.15841 2026-06-16 cs.AI 新提交

Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains

预算受限LLM验证中的异方差信号：结构异质性限制了优化收益

Jinlong Yang

发表机构 * Northwestern Polytechnical University（西北工业大学）

AI总结本文发现LLM不确定性信号在预算受限验证中存在异方差性，导致全局分配扭曲；通过分层阈值干预（CST）在强异质性设置下提升命中率达17个百分点，揭示结构异质性是主要瓶颈。

详情

AI中文摘要

大型语言模型（LLM）系统越来越多地使用不确定性信号来在验证、测试时扩展、工具执行和其他选择性计算决策中分配有限的计算资源。此类策略依赖于一个全局信号可比性假设：相等的分数应在不同输入中携带可比的决策价值。使用预算受限验证作为受控诊断设置，我们识别出该假设的一种失效模式：不确定性质量在成本分层上是异方差的，某些区域尽管集中了大量错误，却表现出近乎随机的可区分性。在一个显式的局部模型下，我们刻画了由此导致的全局分配扭曲，并表明其上界随跨层信号质量离散度而缩放。我们通过一个受控干预层级（阈值、MP-Adapt、MP-Strat以及一个故意简单的成本分层阈值干预CST）将弱信号、优化不稳定性和结构异质性分离开来。在MBPP和MATH上使用Qwen3-8B、LLaMA3-8B和GPT-4o-mini的实验表明，全局在线自适应相对于静态阈值化产生不一致的收益；MP-Strat部分恢复了性能，而CST在强异质性设置下无需梯度更新即可将命中率提升高达17个百分点。这些结果表明，在所观察的设置中，结构异质性（而非仅优化器弱点）是主要瓶颈。更广泛地说，错位的反馈结构并不总能通过更强的优化来修复。

英文摘要

Large language model (LLM) systems increasingly use uncertainty signals to allocate limited computation across verification, test-time scaling, tool execution, and other selective-compute decisions. Such policies rely on a \emph{global signal comparability assumption}: equal scores should carry comparable decision value across inputs. Using budgeted verification as a controlled diagnostic setting, we identify a failure mode of this assumption: uncertainty quality is heteroskedastic across cost strata, with some regions exhibiting near-random discriminability despite concentrating many errors. Under an explicit local model, we characterize the resulting distortion of global allocation and show that its upper bound scales with cross-stratum signal-quality dispersion. We separate weak signals, optimization instability, and structural heterogeneity through a controlled intervention hierarchy: Threshold, MP-Adapt, MP-Strat, and a deliberately simple cost-stratified thresholding intervention (CST). Across MBPP and MATH using Qwen3-8B, LLaMA3-8B, and GPT-4o-mini, global online adaptation yields inconsistent gains over static thresholding; MP-Strat partially recovers performance, while CST improves hit rate by up to 17 percentage points in strongly heterogeneous settings without gradient updates. These results identify structural heterogeneity, rather than optimizer weakness alone, as the primary bottleneck in the observed settings. More broadly, misaligned feedback structure cannot always be repaired by stronger optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.15837 2026-06-16 cs.CV cs.LG stat.ME stat.ML 新提交

Learning a Sampling-Free Variational DNN Plugin from Tiny Training Sets to Refine OOD Segmentation With Uncertainty Estimation

学习一种无采样的变分DNN插件，从微小训练集精炼OOD分割并估计不确定性

Jimut B. Pal, Suyash P. Awate

发表机构 * Centre for Machine Intelligence and Data Science (C-MInDS), Indian Institute of Technology (IIT) Bombay（印度理工学院孟买分校机器智能与数据科学中心）； Computer Science and Engineering (CSE) Department, Indian Institute of Technology (IIT) Bombay（印度理工学院孟买分校计算机科学与工程系）

AI总结提出VarDeepPCA，一种轻量级变分DNN框架，利用小分布内数据集学习有效解剖几何分布，无需目标域数据或预训练，通过重新解释softmax映射实现无采样推理，并提供不确定性估计，在4种临床应用中显著提升OOD分割的解剖合理性和准确性。

Comments Accepted at the Journal of Machine Learning for Biomedical Imaging

详情

AI中文摘要

深度神经网络（DNN）由于扫描仪和采集协议的变化，经常无法泛化到分布外（OOD）的医学图像。由于获取和标注新医学数据集的成本高昂，重新训练DNN模型以应对这些分布偏移通常不切实际。为了解决这个问题，我们引入了VarDeepPCA，一种新颖的轻量级变分DNN框架，旨在通过利用内在几何先验来恢复/精炼退化的分割图。与需要目标域数据或大量预训练的现有方法不同，我们的VarDeepPCA仅使用小的分布内（ID）数据集显式学习有效解剖几何的分布。理论上，我们的新颖变分学习框架利用对softmax映射的重新解释来隐式执行精确分布建模，从而实现计算高效、无采样的学习和推理。这也使VarDeepPCA能够为其恢复的分割图提供不确定性估计。我们在4种不同的临床应用上，使用14个公开可用的数据集，涉及心肌、神经视网膜边缘、前列腺和胎儿头部分割，对我们的框架进行了实证验证。与15种现有方法的比较表明，VarDeepPCA一致地恢复了现有方法在OOD数据上产生的分割图，以（i）显著提高几何的解剖合理性和分割的临床实用性，以及（ii）显著减少误差，而不需要比现有方法更多的训练数据。

英文摘要

Deep neural networks (DNNs) frequently fail to generalize to out-of-distribution (OOD) medical images because of variations in scanners and acquisition protocols. Retraining DNN models to address these distribution shifts is often impractical due to the high cost of acquiring and annotating new medical datasets. To address this, we introduce VarDeepPCA, a novel lightweight variational DNN framework designed to restore/refine degraded segmentation maps by leveraging intrinsic geometric priors. Unlike existing approaches that require target-domain data or extensive pre-training, our VarDeepPCA explicitly learns a distribution of valid anatomical geometries using only small in-distribution (ID) datasets. Theoretically, our novel variational learning framework leverages a reinterpretation of the softmax mapping to implicitly perform exact distribution modeling, thereby enabling computationally efficient, sampling-free learning and inference. This also enables VarDeepPCA to provide uncertainty estimates associated with its restored segmentation maps. We empirically validate our framework across 4 distinct clinical applications, using 14 publicly available datasets, involving segmentation of the myocardium, neuroretinal rim, prostate, and fetal head. Comparisons against 15 existing methods demonstrate that VarDeepPCA consistently restores segmentation maps produced by the existing methods on OOD data to (i) significantly improve anatomical plausibility of geometries and clinical utility of the segmentations, and (ii) significantly reduce errors, without needing any more training data than that used by existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.15835 2026-06-16 cs.LG cs.AI 新提交

Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition

基于速度场分解的去中心化扩散模型中ODE采样器的Wasserstein收敛性

Chencheng Tang, Xuanyu Xue, Fangyikang Wang, Chao Zhang, Hubery Yin

发表机构 * Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）； MBZUAI（穆罕默德·本·扎耶德人工智能大学）； Zhejiang University（浙江大学）； Tencent（腾讯）

AI总结针对去中心化扩散模型中随机专家切换的ODE采样，通过速度场分解建立Wasserstein-2距离下的收敛保证，证明N步离散化以O(N^{-1/2}+ε)速率收敛。

Comments 50 pages, 9 figures. Preprint under review

AI 大模型

视觉与机器人

科学与医疗

On-Policy Distillation with Curriculum Turn-level Guidance for Multi-turn Agents

Interactor: Agentic RL oriented Iterative Creation for Ad Description Generation in Sponsored Search

Calibrated Triage, Not Autonomy: Confidence Estimation for Medical Vision-Language Models

GeoTLM: Geometry-aware Tactile-Language Models for Contact Motion Orientation Reasoning of Dynamic Objects

VL2Spike: Spike-driven Distillation from VLMs for Low-Power Visual Perception in Embodied AI

Topological Flow Matching

LoComposition: Terrain-Adaptive Energy-Efficient Quadruped Locomotion without Gait Priors

BALTO: Balanced Token-Level Policy Optimization for Hallucination Mitigation

Scalar-pathway fidelity improves physical accuracy in short-range equivariant interatomic potentials

UrbanWell: Benchmarking Multimodal Large Language Models for Spatio-Temporal Urban Wellbeing Analytics

SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

NVMOS: Non-Verbal Vocalization Quality Assessment in Speech

Intelligence Is Not the Bottleneck: Validating an LLM First-Pass Manuscript Score Against Peer-Review Outcomes

Text region detection in historical astronomical diagrams

Neuron Level Analysis of Large Language Model in Legal Domain Reasoning

Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models

Free Energy Heuristics: Fast-And-Frugal Cognition as Active Inference Under Uncertain Precision

LLM-as-Code Agentic Programming for Agent Harness

SciOrch: Learning to Orchestrate Expert LLMs for Solving Frontier Multimodal Scientific Reasoning Tasks

Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

David vs. Goliath in Next Activity Prediction: Argmax vs. LSTM, Transformer, and LLM

CogCanvas: A Benchmark for Evaluating Multi-Subject Reference-Based Image Generation

STRIDE: Strategic Trajectory Reasoning via Discriminative Estimation for Verifiable Reinforcement Learning

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

FlashNav: Ultra-Fast Policy Training for Robot Navigation within 20 Seconds

Heteroskedastic Signals in Budgeted LLM Verification: Structural Heterogeneity Limits Optimization Gains

Learning a Sampling-Free Variational DNN Plugin from Tiny Training Sets to Refine OOD Segmentation With Uncertainty Estimation

Wasserstein Convergence of ODE-Based Samplers in Decentralized Diffusion Model via Velocity Field Decomposition