arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.10876 2026-06-10 cs.CV 新提交

Advancing Wood Identification in the Philippines: Utilizing the Xylorix Platform for Efficient AI Model Development and Deployment for Five Key Species

推进菲律宾木材识别:利用Xylorix平台高效开发和部署五种关键树种的AI模型

Rosalie C. Mendoza, Vivian C. Daracan, Arlene D. Romano, Ronniel D. Manalo, Xin Jie Tang, Yi Hong Wong, Yong Haur Tay

发表机构 * College of Forestry and Natural Resources, University of the Philippines Los Banos(菲律宾大学洛桑分校林业与自然资源学院) Agritix

AI总结 本研究利用Xylorix平台,让无编程经验的木材科学家为五种菲律宾硬木开发并部署宏观木材识别AI模型,AUC达0.969-1.000,四种达AA级,证明非程序员可构建适合现场部署的可靠模型。

详情
AI中文摘要

非法采伐和木材贸易在菲律宾持续构成重大挑战,准确的木材物种识别对执法至关重要,但受限于专业设备和专业知识。本研究旨在评估木材科学家能否在没有编程专业知识的情况下,利用Xylorix平台开发和部署宏观木材识别的AI模型,聚焦五种菲律宾硬木:Mangium (Acacia mangium Willd.)、Rain Tree [Samanea saman (Jacq.) Merr.]、Banuyo (Wallaceodendron celebicum Koord.)、Tindalo [Afzelia rhomboidea (Blanco) Vidal] 和 Ipil [Intsia bijuga (Colebr.) O. Kuntze]。二元分类器使用来自260个标本的10,663张经过验证的横截面图像进行训练,并通过标本级平均评分进行评估,以模拟操作现场条件。ROC曲线下面积(AUC)值范围为0.969(Ipil)到1.000(Mangium),平均精度(AP)值范围为0.589(Samanea)到1.000(Mangium)。五个物种中有四个达到AA级(AUC和AP均≥0.90);Rain Tree获得AE级(AUC≥0.90,AP<0.60),原因是其正测试集较小(3个标本)导致AP压缩。所有五个分类器以近乎完美的保真度将目标标本排在非目标标本之上。标本级错误分析显示,Ipil有9个假阴性,主要源于局部图像伪影;Rain Tree有3个假阳性,Tindalo有1个假阳性,由共享的族级解剖特征引起。这些发现表明,Xylorix非程序员可以利用Xylorix平台构建操作可靠的木材识别模型,适用于供应链检查点的现场部署。

英文摘要

Illegal logging and timber trade continue to pose significant challenges in the Philippines, where accurate wood species identification is essential for enforcement but limited by the need for specialised equipment and expertise. This study aims to evaluate whether AI models for macroscopic wood identification can be developed and deployed by wood scientists without programming expertise using the Xylorix platform, focusing on five Philippine hardwood species: Mangium (Acacia mangium Willd.), Rain Tree [Samanea saman (Jacq.) Merr.], Banuyo (Wallaceodendron celebicum Koord.), Tindalo [Afzelia rhomboidea (Blanco) Vidal], and Ipil [Intsia bijuga (Colebr.) O. Kuntze]. Binary classifiers were trained on 10,663 verified cross-section images from 260 specimens and evaluated using specimen-level mean scoring to mirror operational field conditions. Area Under the ROC Curve (AUC) values ranged from 0.969 (Ipil) to 1.000 (Mangium), and Average Precision (AP) values ranged from 0.589 (Samanea) to 1.000 (Mangium). Four of five species achieved AA grade (AUC and AP both \geq 0.90); Rain Tree received AE (AUC \geq 0.90, AP < 0.60) due to AP compression from its small positive test set (3 specimens). All five classifiers rank their target specimens above non-target specimens with near-perfect fidelity. Specimen-level error analysis revealed 9 false negatives from Ipil, primarily stemming from localized image artifacts and 3 false positives for Rain Tree and 1 false positive for Tindalo caused by shared tribal-level anatomical traits. These findings demonstrate that Xylorix non-programmers can leverage the Xylorix platform to construct operationally reliable wood identification models suitable for field deployment at supply chain checkpoints.

2606.10875 2026-06-10 cs.CL 新提交

Pushing the Limits of LLM Tool Calling via Experiential Knowledge Integration and Activation

通过经验知识集成与激活推动LLM工具调用极限

Yupu Hao, Zhuoran Jin, Huanxuan Liao, Kang Liu, Jun Zhao

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所复杂系统认知与决策智能重点实验室) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 研究如何通过经验知识获取、激活和内化提升LLM多步工具调用性能,提出知识增强工具执行框架KATE,结合宽度扩展推理与知识感知训练,在BFCL-V3和AppWorld上显著优于基线。

详情
AI中文摘要

大型语言模型(LLM)依赖工具使用来充当自主代理,但由于缺乏足够的工具相关知识和无效的知识激活,在多步执行中常常失败。因此,我们进行了一项系统性研究,探讨知识如何影响工具使用性能,涵盖知识获取、激活和内化阶段。在知识获取阶段,我们获取并评估了各种形式的经验知识,分析表明简单的实例级知识已经能够提供强大且可靠的增益,而抽象的意图级知识收益有限。在推理时,为了激活知识,我们发现提示LLM扩展推理深度会产生递减收益,而通过并行采样与聚合扩展推理宽度能更有效地激活潜在经验知识。在训练时,对于知识内化,使用知识增强数据进行后训练进一步提升了性能,其中强化学习优于监督微调。基于这些见解,我们提出了知识增强工具执行(KATE)框架,该框架将经验知识与宽度扩展推理及知识感知训练相结合。在BFCL-V3和AppWorld上的实验表明,该方法在不同模型规模上均比强基线有一致且显著的改进。我们的代码可在该https URL获取。

英文摘要

Large language models (LLMs) rely on tool use to act as autonomous agents, yet often fail in multi-step execution due to insufficient tool-related knowledge and ineffective knowledge activation. Therefore, we present a systematic study on how knowledge influences tool-use performance, covering the stages of knowledge acquisition, activation, and internalization. In the knowledge acquisition stage, we acquire and evaluate various forms of experiential knowledge, and our analysis shows that simple instance-level knowledge can already provide strong and reliable gains, while abstract intent-level knowledge offers limited benefits. At inference time, to activate knowledge, we find that prompting LLM to expand the depth of reasoning yields diminishing returns, whereas expanding the width of reasoning by parallel sampling with aggregation more effectively activates latent experiential knowledge. At training time, for knowledge internalization, post-training with knowledge-augmented data further improves performance, with reinforcement learning outperforming supervised fine-tuning. Based on these insights, we propose the Knowledge-Augmented Tool Execution (KATE), a knowledge-augmented tool execution framework that integrates experiential knowledge with reasoning-width-expanded inference and knowledge-aware training. Experiments on BFCL-V3 and AppWorld demonstrate consistent and substantial improvements over strong baselines across model scales. Our Code is available at https://github.com/hypasd-art/KATE.

2606.10857 2026-06-10 cs.RO cs.LG 新提交

Embodiment-conditioned Generalist Control for Multirotor Aerial Robots

基于具身条件的多旋翼空中机器人通用控制

Orestis Konstantaropoulos, Welf Rehberg, Mihir Kulkarni, Kostas Alexis

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU), Trondheim, Norway(挪威科技大学工程控制论系)

AI总结 提出一种通用位置控制策略,通过物理具身描述符(质量与惯性归一化控制分配矩阵)实现单一网络权重控制任意多旋翼构型,采用PPO训练,五分钟后零样本迁移至真实世界。

详情
AI中文摘要

我们提出了一种通用位置控制策略,能够使用单一网络权重控制具有特定旋翼数量(例如六旋翼或四旋翼)的任意多旋翼构型。该策略基于一个物理驱动的具身描述符:一个质量和惯性归一化的控制分配矩阵,该矩阵捕捉了质量归一化的电机推力如何在机体坐标系中产生线性和角加速度。为了训练该策略,我们从任意多旋翼构型的广泛分布中采样,包括非平面和非对称系统,并使用近端策略优化(PPO)优化单个紧凑网络。训练仅需在RTX 3090 GPU上使用基于NVIDIA Warp的自定义动力学模拟器进行五分钟。通过大量仿真实验,我们展示了具身条件化使得通用控制能够在任意形态下鲁棒工作。我们还在三种不同的六旋翼系统上展示了该通用策略的零样本真实世界迁移,包括一个平面机器人、一个部分对称的非平面系统,以及一个随机非对称非平面构型。

英文摘要

We present a generalist position control policy capable of controlling arbitrary multirotor configurations of a certain rotor count (e.g., hexarotors or quadrotors) with a single set of network weights. The policy is conditioned on a physics-grounded embodiment descriptor: a mass and inertia-normalized control allocation matrix that captures how mass-normalized motor thrusts generate linear and angular accelerations in the body-frame. To train the policy, we sample from a broad distribution of arbitrary multirotor configurations, including non-planar and asymmetric systems, and optimize a single, compact network using Proximal Policy Optimization. Training requires only five minutes on an RTX 3090 GPU using a custom NVIDIA Warp-based dynamics simulator. Through extensive simulation experiments, we show that embodiment conditioning enables robust generalist control across arbitrary morphologies. We demonstrate zero-shot real-world transfer of this generalist policy on three diverse hexarotor systems, including a planar robot, a partially symmetric non-planar system, and a random asymmetric, non-planar configuration.

2606.10856 2026-06-10 cs.RO 新提交

An Exposure-Time-Aligned Primary-Path Architecture for Autonomous-Driving ECUs

一种曝光时间对齐的主路径架构用于自动驾驶ECU

Toru Saito, Yuki Hagura, Tatsuya Konishi, Satoru Mizusawa, Takumi Yajima

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 针对生产车辆从模块化多NN流水线向端到端自动驾驶过渡的需求,提出主路径、曝光时间对齐和共路径共存三项设计原则,在双SoC平台上实现平均296ms的延迟。

详情
AI中文摘要

虽然端到端(E2E)自动驾驶已成为主导研究方向,但在一个非平凡的过渡期内,量产车辆仍然依赖模块化的多NN流水线。本文的主题是设计一种架构,在此阶段支持模块化流水线和E2E路径并行,并嵌入一条用于分阶段迁移的路径。移植到量产SoC上,平等主义的后期融合计算效率低下,且没有自然单元用于分阶段的E2E替代。作为替代方案,我们提出三项设计原则:(i)主路径,明确选择一条主要感知链,并优先将其封装在单个SoC对中,而非关键路径;(ii)曝光时间对齐,将主传感器的曝光时间τ_exp作为标签沿链传播,并在匹配的τ_exp上事件驱动融合节点,而非固定周期;(iii)共路径共存,基于(i)和(ii),让E2E输出路径与模块化流水线在同一τ_exp周期内并行运行。在双SoC量产AD-ECU上,实现从相机快门到规划器输出的平均延迟为296毫秒,在350毫秒的设计预算内。在(iii)下,模块化流水线在生产启动时为主路径,E2E路径作为影子在实车上运行,随着评估证据的积累,E2E范围逐步扩大。

英文摘要

While end-to-end (E2E) autonomous driving has become the dominant research direction, production vehicles continue to rely on modular multi-NN pipelines for a non-trivial transitional period. The subject of this paper is the design of an architecture that, during this phase, supports a modular pipeline and an E2E path side by side and embeds a path for staged migration. Transplanted to a production SoC, egalitarian late fusion is compute-inefficient and offers no natural unit for staged E2E substitution. As an alternative, we propose three design principles: (i) Primary-Path, which explicitly selects a primary perception chain and prioritizes its enclosure within a single SoC pair over the non-critical paths (ii) Exposure-Time-Aligned, which propagates the primary sensor's exposure time $τ_{\rm exp}$ as a tag along the chain and event-drives the fusion node on matched $τ_{\rm exp}$ rather than a fixed cycle and (iii) Co-Path Coexistence, which, building on (i) and (ii), lets an E2E output path co-run with the modular pipeline within the same $τ_{\rm exp}$ cycle. On a Dual-SoC production AD-ECU, the implementation closes camera-shutter to planner-output latency at a mean of 296 ms within the 350 ms design budget. Under (iii), the modular pipeline is primary at production launch and the E2E path runs as shadow on real vehicles, and the E2E scope is expanded as evaluation evidence accumulates.

2606.10852 2026-06-10 cs.CL cs.AI 新提交

Janus: A Benchmark for Goal-Conditioned Information Distortion in LLMs

Janus: 大语言模型中目标导向信息扭曲的基准测试

Polydoros Giannouris, Mohsinul Kabir, Sophia Ananiadou

发表机构 * The University of Manchester(曼彻斯特大学) Archimedes/Athena RC(阿基米德/雅典研究中心)

AI总结 提出JANUS基准,通过固定事实池对比中性/目标导向条件,测量LLM在事实输出中的选择性扭曲,揭示模型缺乏防误导通信的鲁棒性。

详情
AI中文摘要

LLM的欺骗通常通过直接标记如捏造声明、明确谎言或策略性隐瞒来评估。然而,许多现实中的误导性沟通并不依赖于虚假陈述,而是源于对真实事实的选择性处理:省略不利证据、软化不利细节、强调有利细节或用模糊语言替代精确限定。现有基准大多忽略了这种更微妙且可能更危险的失败模式。我们引入JANUS,一个用于测量基于事实的LLM输出中目标导向语用扭曲的基准。我们基准中的每个场景提供固定的一组有利和不利事实,并比较中性条件与目标导向条件(例如,尽管可能对直接受影响的个人或群体造成伤害,仍要增加采用率、注册率、批准率或支持率)。由于所有输出都被限制使用相同的事实池,JANUS将误导性总体印象与幻觉和捏造分离开来。JANUS包含跨8个领域的160个场景,每个场景配有中性和目标导向提示以及标注的事实材料。跨12个LLM的大量实验揭示了一致的目标导向扭曲,表明当前模型仍然对激励和框架目标敏感,并且缺乏针对选择性误导沟通的鲁棒防护。我们公开发布语料库和代码以供未来研究。

英文摘要

LLM deception is often evaluated through direct markers such as fabricated claims, explicit lies, or strategic concealment. However, many real-world misleading communications do not depend on false statements, rather, they arise from selective treatment of true material facts: omitting adverse evidence, softening unfavorable details, emphasizing favorable details, or replacing precise qualifications with vague language. Existing benchmarks largely miss this subtler and arguably more dangerous failure mode. We introduce JANUS, a benchmark for measuring goal-conditioned pragmatic distortion in fact-grounded LLM outputs. Each scenario in our benchmark provides a fixed pool of favorable and adverse facts and compares a neutral condition against a goal-directed condition, such as increasing adoption, enrollment, approval, or support, despite potential harm to directly affected individuals or groups. Because all outputs are constrained to use the same fact pool, JANUS isolates misleading net impressions from hallucination and fabrication. JANUS contains 160 scenarios across 8 domains, with each scenario paired with neutral and goal-conditioned prompts and annotated material facts. Extensive experiments across 12 LLMs reveal consistent goal-conditioned distortions, demonstrating that current models remain sensitive to incentive and framing objectives and lack robust safeguards against selectively misleading communication. We publicly release our corpus and code for future research.

2606.10842 2026-06-10 cs.CL cs.IR 新提交

ConvMemory v2: A Recall-Preserving Top-10 Evidence Reranker for Conversational Memory Retrieval

ConvMemory v2: 一种保留召回率的前10证据重排序器用于对话记忆检索

Taiheng Pan

发表机构 * School of Computing and Information Systems, University of Melbourne(墨尔本大学计算与信息系统学院)

AI总结 提出ConvMemory v2,一种轻量级重排序器,在保留v1的Recall@10前提下,通过微调交叉编码器提升MRR和H@1,并分析其机制。

Comments 19 pages, 3 figures. Single-author technical report. Extends arXiv:2605.28062 (ConvMemory v1). Code and checkpoint: github.com/pth2002/ConvMemory

详情
AI中文摘要

我们描述了ConvMemory v2,一种可选的token证据重排序器,位于轻量级ConvMemory v1重排序器之后,仅对v1保护的前10候选集进行重排序。v2是一个微调的ms-marco-MiniLM-L-6-v2交叉编码器(22,713,601个参数,从发布的检查点测量),应用于v1已经选择的十个(查询,记忆)对;它不改变返回的十个记忆,因此Recall@10和Hit@10与v1相同,这是构造决定的,而非统计巧合。在LoCoMo对话记忆基准测试(5个种子,n = 4955个测试行)上,v2将FULL MRR从v1的0.5824提升到0.6560(配对bootstrap +0.0734,95% CI [+0.0645, +0.0827]),H@1从0.4440提升到0.5474。v2缩小了与更昂贵的全池交叉编码器参考(mxbai-rerank-large-v1在前500个上,MRR 0.6688)的大部分差距但未完全消除:在FULL MRR上,v2比mxbai_top500低0.013,但在两个raw-dense-hard切片上(v1保护的前10个比mxbai自己的前10个具有更高的召回率),v2超过了mxbai_top500。一项四臂负载消融实验表明,候选特定的记忆文本是机制:移除、打乱或替换它会使MRR崩溃到低于原始稠密检索。v2最好被理解为一种标准的保留召回率的级联模式,具有LoCoMo特定的微调、显式的抗捷径推理契约和严谨的负载分析;其相对于mxbai的优势是切片特定的,而非一般的优势声明。本报告扩展了v1技术报告(arXiv:2605.28062)。

英文摘要

We describe ConvMemory v2, an opt-in token-evidence reranker that sits after the lightweight ConvMemory v1 reranker and reorders only v1's protected top-10 candidate set. v2 is a fine-tuned ms-marco-MiniLM-L-6-v2 cross-encoder (22,713,601 parameters, measured from the released checkpoint) applied to the ten (query, memory) pairs that v1 has already selected; it does not change which ten memories are returned, so Recall@10 and Hit@10 are identical to v1 by construction, not by statistical coincidence. On the LoCoMo conversational memory benchmark (5 seeds, n = 4955 test rows), v2 raises FULL MRR from v1's 0.5824 to 0.6560 (paired bootstrap +0.0734, 95% CI [+0.0645, +0.0827]) and H@1 from 0.4440 to 0.5474. v2 closes most but not all of the gap to a much more expensive full-pool cross-encoder reference (mxbai-rerank-large-v1 over the top-500, MRR 0.6688): on FULL MRR v2 sits 0.013 below mxbai_top500, but on two raw-dense-hard slices (where v1's protected top-10 has higher recall than mxbai's own top-10) v2 exceeds mxbai_top500. A four-arm load-bearing ablation shows candidate-specific memory text is the mechanism: removing, shuffling, or replacing it collapses MRR below raw dense retrieval. v2 is best understood as a standard recall-preserving cascade pattern with LoCoMo-specific fine-tuning, an explicit anti-shortcut inference contract, and disciplined load-bearing analysis; its advantage over mxbai is slice-specific rather than a general dominance claim. This report extends the v1 technical report (arXiv:2605.28062).

2606.10839 2026-06-10 cs.CV 新提交

HarmoView: Harmonizing Multi-View Constraints for Identity-Consistent Video Generation

HarmoView: 协调多视角约束以实现身份一致视频生成

Cong Wang, Zhentao Yu, Hongmei Wang, Weicong Liang, Zixiang Zhou, Zilin Yang, Jiarong Ou, Rui Chen, Yuan Zhou, Qinglin Lu

发表机构 * Tencent Hunyuan(腾讯混元)

AI总结 提出HarmoView框架,通过多级特征注入、可学习代理令牌和Jump-RoPE等架构改进,结合渐进式视角课程训练,解决大视角变化下身份一致视频生成的外观保真度问题,在多视角基准上达到最优性能。

Comments Project Page: https://conallwang.github.io/HarmoView_Pages

详情
AI中文摘要

当前的身份一致视频生成方法在大的视角变化下难以保持外观保真度。虽然引入多视角参考输入提供了自然解决方案,但由于缺乏有效的多视角输入框架以及多视角数据的稀缺性,进展仍然受限。我们通过提出HarmoView来应对这些挑战,这是一个用于身份一致视频生成的鲁棒框架,通过三种架构改进并辅以分阶段训练课程,有效整合多视角线索。具体来说,我们首先引入多级特征注入(MFI)来锚定身份保真度;通过交叉注意力将来自正面参考的原始ViT特征与文本令牌一起注入,MFI提供了持久的低级外观锚点,补充了DiT块内的高级身份特征,从而增强了身份保持。然后,我们采用可学习代理令牌来统一单/多视角设置下的异构参考布局,同时解决参考-视角不匹配问题。进一步开发了Jump-RoPE用于身份级特征隔离以减少身份串扰。为了在保留原始生成先验的同时激活这些结构能力,我们提出了渐进式视角课程。这种四阶段训练策略采用视角丢弃,以促进从原始T2V生成到高保真、身份持久的空间推理的稳定过渡。此外,我们构建了一个大规模多视角数据集以解决数据稀缺问题。在我们的多视角基准上的广泛评估(包含100个手动策划的案例,涵盖52个独特身份)表明,HarmoView显著优于开源基线,并匹配领先的闭源引擎,在身份一致视频生成中实现了最先进的性能。

英文摘要

Current identity-consistent video generation methods struggle to preserve appearance fidelity under large viewpoint changes. While introducing multi-view reference input offers a natural solution, progress remains constrained by the lack of effective frameworks for multi-view inputs and the scarcity of multi-view data. We address these challenges by proposing HarmoView, a robust framework for identity-consistent video generation that effectively integrates multi-view cues through three architectural refinements complemented by a staged training curriculum. Specifically, we first introduce Multi-level Feature Injection to anchor identity fidelity; by injecting raw ViT features from frontal references alongside text tokens via cross-attention, MFI provides persistent low-level appearance anchors that complement the high-level identity features within DiT blocks, leading to enhanced identity preservation. Then, we employ learnable proxy tokens to unify heterogeneous reference layouts across single-/multi-view settings while simultaneously resolving the reference-view mismatch problem. Jump-RoPE is further developed for identity-wise feature isolation to reduce identity crosstalk. To activate these structural capabilities while preserving the original generative priors, we propose the Progressive View Curriculum. This four-stage training strategy employs view dropout to facilitate a stable transition from vanilla T2V generation to high-fidelity, identity-persistent spatial reasoning. Furthermore, we construct a large-scale multi-view dataset to address the issue of data scarcity. Extensive evaluation on our multi-view benchmark, comprising 100 manually-curated cases spanning 52 unique identities, demonstrates that HarmoView significantly outperforms open-source baselines and matches leading closed-source engines, achieving state-of-the-art performance in identity-consistent video generation.

2606.10835 2026-06-10 cs.LG cs.AI 新提交

Geometrically Averaged Hard Target Updates for Linear Q-Learning

线性Q学习的几何平均硬目标更新

Donghwan Lee

发表机构 * School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院)

AI总结 提出λ-几何加权平均的周期目标更新方法,用于线性Q学习,通过切换系统模型分析其稳定性,连接了单周期更新和投影Q值迭代。

详情
AI中文摘要

周期性硬目标更新是现代深度Q学习中最常见的稳定化手段之一。最近的研究表明,目标更新可以提高使用函数逼近(包括线性函数逼近)的Q学习的稳定性。我们引入并分析了所谓的λ-目标更新,通过将m-周期目标更新映射与λ-几何权重$(1-\lambda)\lambda^{m-1}$($\lambda \in [0,1]$)平均得到。端点$\lambda=0$恢复单周期目标更新,而连续端点$\lambda\uparrow1$恢复投影Q值迭代。我们使用切换系统模型和相关工具,研究了这种机制在线性函数逼近的Q学习(即线性Q学习)中的应用。为清晰起见,本文处理确定性版本;该公式可扩展到随机强化学习设置。

英文摘要

Periodic hard target updates are among the most common stabilization devices in modern deep Q-learning. Recent studies suggest that target updates can improve stability in Q-learning with function approximation, including linear function approximation. We introduce and analyze the so-called $λ$-target update, obtained by averaging the $m$-periodic target update maps with $λ$-geometric weights $(1-λ)λ^{m-1}$, $λ\in [0,1]$. The endpoint $λ=0$ recovers the one-period target update, while the continuous endpoint $λ\uparrow1$ recovers projected Q-value iteration. We study this mechanism for Q-learning with linear function approximation, namely linear Q-learning, using a switching-system model and related tools. For clarity, the paper treats a deterministic version; the formulation extends to stochastic reinforcement-learning settings.

2606.10833 2026-06-10 cs.AI 新提交

Do VLMs Reason Like Engineers? A Benchmark and a Stage-wise Evaluation

视觉语言模型像工程师一样推理吗?一个基准测试与分阶段评估

Syed Wasiq, Syed Mohamad Tawseeq, Yashwant Pravinrao Bangde, Debaditya Roy

发表机构 * Indian Institute of Technology Kharagpur(印度理工学院卡哈拉格普尔分校)

AI总结 提出工程视觉问答基准EngVQA和8阶段自动评估框架,揭示当前视觉语言模型在工程推理中的显著局限,并验证了自动化评估与人工评分的高度一致性。

Comments 9 pages (main text), 4 figures, 2 tables; 50 pages total including appendix. The first two authors contributed equally

详情
AI中文摘要

视觉语言模型(VLM)在通用多模态推理基准上表现出色,但其进行工程推理的能力尚未得到充分探索。与一般视觉问答不同,工程问题解决需要解读技术图表、选择支配物理原理并保持物理一致的多步推理。这些能力对于用于工程教育、科学辅助和技术决策的AI系统日益重要,因为推理失败可能产生物理上无效但表面上合理的解决方案。现有基准主要评估最终答案,对中间推理过程的评估有限。我们引入了EngVQA,一个跨5个工程学科、包含696个问题的多模态基准,用于评估工程推理。我们提出了一个8阶段自动评估框架,用于评估VLM生成的解决方案。该框架独立评估解决方案的每个阶段,实现对推理失败的细粒度分析。我们在评估框架上对多个最先进的开源和闭源VLM进行了基准测试,并展示了当前工程推理能力的显著局限性。人工评估与我们的自动化框架高度一致,在10分制评分上实现了0.975的皮尔逊相关系数和0.67的平均绝对误差。我们的结果强调了面向过程的评估对于可靠评估多模态工程推理系统的重要性。

英文摘要

Vision-Language Models (VLMs) demonstrate strong performance on general multimodal reasoning benchmarks, yet their ability to perform engineering reasoning remains largely unexplored. Unlike general visual question answering, engineering problem solving requires interpreting technical diagrams, selecting governing physical principles, and maintaining physically consistent multi-step reasoning. These capabilities are increasingly important for AI systems used in engineering education, scientific assistance, and technical decision-making, where reasoning failures may produce physically invalid yet superficially plausible solutions. Existing benchmarks primarily evaluate final answers and provide limited assessment of intermediate reasoning processes. We introduce EngVQA, a multimodal benchmark for evaluating engineering reasoning across 5 engineering subjects containing 696 problems. We introduce an 8-stage automatic evaluation framework for assessing VLM-generated solutions. The framework independently evaluates each stage of the solution, enabling fine-grained analysis of reasoning failures. We benchmark multiple state-of-the-art open and closed source VLMs on our evaluation framework and demonstrate substantial limitations in current engineering reasoning capabilities. Human evaluation shows strong agreement with our automated framework, achieving a Pearson correlation of 0.975 and a mean absolute error of 0.67 on a 10-point grading scale. Our results highlight the importance of process-oriented evaluation for reliable assessment of multimodal engineering reasoning systems.

2606.10832 2026-06-10 cs.RO 新提交

GUIDE: Goal-Initialized Directional Understanding for End-to-End Visual Navigation

GUIDE: 目标初始化的定向理解用于端到端视觉导航

Liang Wang, Jin Jin, KanZhong Yao, YiBin Wu, Fangqiang Ding, Jin Wang, Jun Wu, Zhe Sun, Qiuguo Zhu

发表机构 * Institute of Cyber-Systems and Control, Zhejiang University(浙江大学控制科学与工程学院) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院(TeleAI)) Oxford Robotics Institute, University of Oxford(牛津大学牛津机器人研究所) Center for Robotics, University of Bonn(波恩大学机器人中心) Department of Mechanical Engineering, Massachusetts Institute of Technology(麻省理工学院机械工程系)

AI总结 提出GUIDE框架,通过空间锚点预测器利用多频率本体感受历史提取自运动表示,结合深度流感知局部几何,实现无需后续目标更新的端到端四足机器人导航。

Comments https://guide-navigation.github.io/

详情
AI中文摘要

基于学习的足式机器人视觉导航通常依赖于层次状态估计的连续目标更新,以提供持久的定向参考。这种依赖引入了额外的感知和计算开销,偏离了完全端到端的移动自主性。此外,在部分可观测性下,策略容易学习短视行为,容易陷入死角和复杂结构布局。为了解决这些限制,我们研究了一种目标初始化的导航设置,其中目标仅在情节开始时提供一次,要求机器人基于内在空间记忆运行,无需来自外部模块的后续目标更新。在这项工作中,我们提出了GUIDE,一个完全端到端的强化学习框架,旨在培养内部定向意识。具体来说,GUIDE包含一个空间锚点预测器,利用多频率本体感受历史来提取自运动表示,从而为导航维持持久的长期空间上下文。同时,它利用原始深度流感知局部环境几何。我们在仿真和真实场景中对四足机器人进行了评估。实验表明,GUIDE学习了可靠的自运动和定向意识,使得完全端到端部署的策略能够在没有后续目标引导或先验地图的情况下,安全地穿越密集杂乱和结构化迷宫。

英文摘要

Learning-based visual navigation for legged robots typically relies on continuous goal updates from hierarchical state estimation to provide a persistent directional reference. This reliance incurs additional sensory and computational overhead and deviates from fully end-to-end mobile autonomy. Furthermore, under partial observability, policies are prone to learn myopic behaviors, easily becoming trapped in dead ends and complex structural layouts. To address these limitations, we investigate a goal-initialized navigation setting, where the target is provided only once at the beginning of an episode, requiring the robot to operate based on intrinsic spatial memory without subsequent goal updates from external modules. In this work, we propose GUIDE, a fully end-to-end reinforcement learning framework designed to cultivate internal directional awareness. Specifically, GUIDE incorporates a spatial anchor predictor that leverages multi-frequency proprioceptive history to extract egomotion representations, thereby maintaining a persistent long-horizon spatial context for navigation. Concurrently, it utilizes raw depth streams to perceive local environmental geometry. We evaluate the proposed framework across both simulation and real-world scenarios on a quadruped robot. Experiments show that GUIDE learns reliable egomotion and directional awareness, enabling a fully end-to-end deployed policy to safely navigate through dense clutter and structured mazes without subsequent goal guidance or prior maps.

2606.10829 2026-06-10 cs.CL cs.AI 新提交

Attention-Discounted Adaptive Sampler for Masked Diffusion Language Models

注意力折扣自适应采样器用于掩码扩散语言模型

Yusuf Sahin, Ahmed Rockey Saikia, Volkan Cevher, Paolo Favaro

发表机构 * University of Bern(伯尔尼大学) EPFL(瑞士联邦理工学院洛桑分校)

AI总结 针对掩码扩散语言模型并行解码中候选词交互导致的不安全问题,提出训练无关的重排序规则ADAS,通过注意力折扣软惩罚改进子集构建,在多个基准上提升低NFE性能。

详情
AI中文摘要

掩码扩散语言模型可以通过每次去噪迭代揭示多个令牌来减少推理步骤,但这种并行性很脆弱:当预测相互耦合时,单独置信的位置同时提交可能不安全。现有的免训练采样器如Top-\(k\)、Fast-dLLM和EB-Sampler主要控制揭示多少令牌,而通常通过忽略选定集内交互的逐令牌分数对候选进行排序。我们提出ADAS,一种用于并行掩码扩散解码的免训练重排序规则。ADAS保持基础采样器的停止规则不变,仅修改子集构建:当候选者强烈关注预测仍不确定的已选位置时,它贪婪地折扣该候选者。与将注意力转化为硬兼容性约束的图约束方法不同,ADAS保持注意力连续并将其用作软边际惩罚。在GSM8K、MATH500、HumanEval和MBPP上,针对LLaDA-8B-Base和Dream-7B-Base,将ADAS插入Top-\(k\)、Fast-dLLM和EB-Sampler中,在匹配去噪器评估下,低NFE性能平均分别提高9.11和10.46个百分点,每次前向运行时开销为3.1%。这些结果表明,软注意力折扣重排序是一种简单且模块化的方法,可提高掩码扩散语言模型高度并行解码的质量。

英文摘要

Masked diffusion language models can reduce inference steps by revealing multiple tokens per denoising iteration, but this parallelism is fragile: positions that are individually confident may be unsafe to commit together when their predictions are coupled. Existing training-free samplers such as Top-\(k\), Fast-dLLM, and EB-Sampler mainly control how many tokens to reveal, while often ranking candidates by token-wise scores that ignore interactions within the selected set. We propose ADAS, a training-free reranking rule for parallel masked diffusion decoding. ADAS leaves the base sampler's stopping rule unchanged and modifies only subset construction: it greedily discounts a candidate when it attends strongly to already selected positions whose predictions remain uncertain. Unlike graph-constrained methods that turn attention into hard compatibility constraints, ADAS keeps attention continuous and uses it as a soft marginal penalty. Across LLaDA-8B-Base and Dream-7B-Base on GSM8K, MATH500, HumanEval, and MBPP, plugging ADAS into Top-\(k\), Fast-dLLM, and EB-Sampler improves low-NFE performance at matched denoiser evaluations by \(9.11\) and \(10.46\) percentage points on average, respectively, with \(3.1\%\) per-forward runtime overhead. These results show that soft attention-discounted reranking is a simple and modular way to improve quality in highly parallel decoding for masked diffusion language models.

2606.10825 2026-06-10 cs.LG 新提交

MODIP: Efficient Model-Based Optimization for Diffusion Policies

MODIP:扩散策略的高效基于模型的优化

Zakariae El Asri, Philippe Gratias-Quiquandon, Nicolas Thome, Olivier Sigaud

发表机构 * Sorbonne Université, CNRS, ISIR, F-75005 Paris, France(索邦大学,法国国家科学研究中心,智能系统与机器人研究所,法国巴黎) Institut Universitaire de France (IUF)(法国大学研究院)

AI总结 提出MODIP框架,利用世界模型和模型预测控制生成高质量轨迹,以监督方式微调扩散策略,实现离线到在线的强化学习微调,在D4RL和RoboMimic任务上超越行为克隆基线。

详情
AI中文摘要

扩散策略(DPs)已成为机器人学习中表达力强的策略表示,通常与行为克隆(BC)等模仿学习方法一起使用。然而,虽然它们的成功主要局限于BC,但直接进行强化学习(RL)微调仍然具有挑战性,因为动作是通过多步去噪过程生成的。在这项工作中,我们提出了MODIP,一个用于扩散策略离线到在线微调的框架。MODIP不是直接将RL应用于DPs,而是利用世界模型(WM)来指导策略适应,并保持BC的简单性和稳定性。我们利用模型预测控制(MPC)在WM内生成高质量轨迹,并将其作为监督目标来微调DP。为了使MPC规划高效,MODIP使用终端状态值而不是依赖于策略的状态-动作值,从而减少了推理时间。此外,MODIP使用与策略无关的TD目标训练评论家,减少了训练时间。在D4RL(MuJoCo、Kitchen)和RoboMimic任务上的实验表明,MODIP改进了超越BC的扩散策略,并且与扩散策略RL微调方法和强基于模型的基线(如TD-MPC2)相比具有竞争力或更优性能。

英文摘要

Diffusion policies (DPs) have emerged as expressive policy representations for robot learning, often used with imitation learning methods such as behavioral cloning (BC). However, while their success has largely been confined to BC, direct reinforcement learning (RL) fine-tuning remains challenging because actions are generated through a multi-step denoising process. In this work, we propose MODIP, a framework for the offline-to-online fine-tuning of DPs. Rather than directly applying RL to the DPs, MODIP leverages a world model (WM) to guide policy adaptation and keeps the simplicity and stability of BC. We utilize model predictive control (MPC) to generate high-quality trajectories within the WM, and use them as supervised targets for fine-tuning the DP. To make MPC planning efficient, MODIP uses a terminal state value instead of a policy-dependent state-action value, reducing inference time. Additionally, MODIP trains critics with policy-independent TD targets, reducing training time. Experiments on D4RL (MuJoCo, Kitchen) and RoboMimic tasks show that MODIP improves diffusion policies beyond BC, and is competitive with or outperforms diffusion policy RL fine-tuning methods and strong model-based baselines such as TD-MPC2.

2606.10819 2026-06-10 cs.CV cs.AI 新提交

Earth-OneVision: Extending Remote Sensing Multimodal Large Language Models to More Sensor Modalities and Tasks

Earth-OneVision:将遥感多模态大语言模型扩展到更多传感器模态和任务

Miaoxin Cai, Guanqun Wang, Wei Zhang, Guangyao Zhou, Yin Zhuang, Tong Zhang, Hao Wang, He Chen, Jun Li

发表机构 * National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing (SBIIP), Beijing Institute of Technology(北京理工大学空间智能信息处理国家重点实验室) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院空天信息创新研究院) Key Laboratory of Technology in Geo-Spatial Information Processing and Application System, Chinese Academy of Sciences(中国科学院地理空间信息处理与应用系统技术重点实验室) Advanced Research Institute of Multidisciplinary Sciences, Beijing Institute of Technology(北京理工大学前沿交叉科学研究院) School of Mechatronical Engineering, Beijing Institute of Technology(北京理工大学机电学院) School of Earth and Space Sciences, Peking University(北京大学地球与空间科学学院) School of Electronics, Peking University(北京大学电子学院) School of Computer Science and Hubei Key Laboratory of Intelligent Geo-Information Processing(华中科技大学计算机科学与技术学院&湖北省智能地理信息处理重点实验室)

AI总结 提出Earth-OneVision,一个2B参数的RS-MLLM,通过全粒度视觉语言对齐、空间语言同构序列化和渐进式跨模态适应机制,统一六种传感器模态和九类任务,在多个基准上达到或超越4B-72B模型。

详情
AI中文摘要

RS-MLLM能够对地球观测图像进行自然语言理解和空间推理。然而,现有模型仅支持狭窄的传感器类型和任务范围,导致对地球的碎片化视角,并使得跨模态地球科学知识在很大程度上未被利用。本文提出了Earth-OneVision,一个2B参数的RS-MLLM,它在单一自回归框架内统一了六种传感器模态(即光学、SAR、红外、多光谱、时序和视频)以及跨传感器融合,涵盖9个任务类别。三种专用机制解决了三个瓶颈。全粒度视觉语言对齐(FGVLA)将多级视觉特征与多维语言空间对齐。空间语言同构序列化(SLIS)将异构空间输出统一为自回归令牌。渐进式跨模态适应(PCMA)将复合领域差距分解为连续阶段,依次解决视角和成像物理差距。为了支持联合训练,构建了MMRS-OneVision,包含约340万QA对,涵盖所有六种传感器模态和9个任务类别的跨传感器融合,大大超过了现有的遥感多模态指令数据集。仅用2B参数,Earth-OneVision在广泛基准上取得了具有竞争力或最先进的结果,持续匹配或超越4B-72B的RS-MLLM。它在光学视觉定位的OPT-RSVG测试集上达到87.52%的P@0.5,在SAR VQA基准SARLANG-Bench上达到80.68%,超过7B模型7%以上。它还在多光谱分类的BigEarthNet-MS测试集上达到75.74%的召回率,在跨模态推理的EarthMind-Bench上达到81.94%的MCQ准确率。

英文摘要

RS-MLLMs enable natural-language understanding and spatial reasoning over earth observation imagery. However, existing models support only a narrow range of sensor types and tasks, yielding a fragmented view of the earth and leaving cross-modal geoscientific knowledge largely unexploited. This work presents Earth-OneVision, a 2B RS-MLLM that unifies six sensor modalities (i.e., optical, SAR, infrared, multispectral, temporal, and video) and cross-sensor fusion across 9 task categories within a single autoregressive framework. Three dedicated mechanisms address three bottlenecks. Full-Granularity Vision-Language Alignment (FGVLA) aligns multi-level visual features with the multi-dimensional language space. Spatial-Linguistic Isomorphic Serialization (SLIS) unifies heterogeneous spatial outputs as autoregressive tokens. Progressive Cross-Modality Adaptation (PCMA) decomposes the compound domain gap into sequential stages, tackling the viewpoint and imaging physics gaps in turn. To support joint training, MMRS-OneVision is constructed with ~34M QA pairs spanning all six sensor modalities and cross-sensor fusion across 9 task categories, substantially exceeding existing RS multimodal instruction datasets. With only 2B parameters, Earth-OneVision achieves competitive or state-of-the-art results across extensive benchmarks, consistently matching or outperforming 4B-72B RS-MLLMs. It achieves 87.52% P@0.5 on the OPT-RSVG testset for optical visual grounding and 80.68% on the SAR VQA benchmark SARLANG-Bench, exceeding 7B models by over 7%. It further achieves 75.74% recall on the BigEarthNet-MS testset for multispectral classification, and 81.94% MCQ accuracy on EarthMind-Bench for cross-modality reasoning.

2606.10818 2026-06-10 cs.RO cs.CV 新提交

IMPACT: Learning Internal-Model Predictive Control for Forceful Robotic Manipulation

IMPACT:面向强力机器人操控的内部模型预测控制学习

Jiawei Gao, Chaoqi Liu, Peilin Wu, Haonan Chen, Yilun Du

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 提出IMPACT框架,将强力操控任务解耦为任务规划和基于内部模型的预测控制,通过仿真和实验证明其在成功率、泛化性、安全性和能效上的优势。

Comments Project website: https://gao-jiawei.com/IMPACT/

详情
AI中文摘要

现实世界中的机器人操控任务通常涉及与环境的有力交互,例如使用不同重量的工具、运输不同质量的物体以及执行接触密集任务(如擦桌子)。先前的基于学习方法通常采用模仿学习策略,输出由低级阻抗控制器跟踪的目标末端执行器姿态。在这些系统中,有力交互要么通过稳态跟踪误差隐式实现,要么使用腕部力/扭矩或触觉传感器显式命令。然而,隐式方法在不同物体重量下泛化能力差,而显式方法需要专用硬件并增加系统复杂性。在这项工作中,我们提出了IMPACT,一个将这些有力任务解耦为任务规划和基于内部模型的预测控制的框架。广泛的仿真和真实世界实验表明,所提出的框架实现了更高的成功率、对未见物体重量的更好泛化性,以及更好的安全性和能效。

英文摘要

Real-world robotic manipulation tasks often involve forceful interactions with the environment, such as using tools of varying weights, transporting objects with different masses, and performing contact-rich tasks like table wiping. Previous learning-based approaches typically employ imitation learning policies that output target end-effector poses tracked by low-level impedance controllers. In these systems, forceful interactions are either implicitly realized through steady-state tracking errors or explicitly commanded using wrist force/torque or tactile sensors. However, implicit approaches generalize poorly across object weights, while explicit approaches require specialized hardware and increase system complexity. In this work, we propose IMPACT, a framework that decouples these forceful tasks into task-planning and internal-model-based predictive control. Extensive simulation and real-world experiments demonstrate that the proposed framework achieves higher success rates and improved generalization to unseen object weights, as well as better safety and energy efficiency.

2606.10811 2026-06-10 cs.CV 新提交

Deep learning for echo sounder data

深度学习用于回声测深仪数据

Ketil Malde

发表机构 * Ketil Malde

AI总结 本文探讨深度学习在声学数据(如回声图)中的应用,指出由于声学数据特性,需开发专用方法而非简单复用图像处理模型,并强调缺乏标准数据集和格式是主要障碍。

详情
AI中文摘要

毫无疑问,在过去十年中,机器学习领域的技术已经彻底改变了我们处理和解释数据的方式,尤其是图像和文本。对于水下观测,声学是主要的信息来源,自然地,深度学习方法已被应用于回声图和其他声学数据,但迄今为止成果相当有限。在此,我们认为,由于声学数据的固有特性,重大进展可能需要研究超越简单复用图像处理模型和技术的深度学习方法。目前,方法开发的突破潜力受到缺乏标准数据格式和组织方式的阻碍,更甚的是缺乏具有既定性能目标的现成高质量数据集。为了推动该领域的发展,这些不足应得到纠正。

英文摘要

There is no doubt that over the last decade, techniques from the field of machine learning have revolutionized how we process and interpret data, especially images and text. For underwater observations acoustics is a primary source of information, and naturally, deep learning methods have been applied to echograms and other acoustics data, but so far with rather modest results. Here, we argue that due to intrinsic properties of acoustic data, substantial advances will likely require research into deep learning methods beyond mere recycling of models and techniques from image processing. Currently, the potential for breakthroughs in method development is hindered by the lack of standard data formats and organization, and even more by the lack of readily available, high quality data sets with established performance goals. To advance the field, these shortcomings should be remedied

2606.10808 2026-06-10 cs.RO 新提交

Bridging Semantics and Physical Execution: A Neuro-Symbolic Framework for Multi-Pair Robotic Assembly

桥接语义与物理执行:面向多对机器人装配的神经符号框架

Xinyi Li, Aiguo Song, Linhu Wei, Huijun Li

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院)

AI总结 提出一种端到端神经符号框架,通过分层生成最优子图、解耦通用性与边缘情况、协调全局序列,解决非结构化环境中多对装配的空间干扰和接触不确定性,在100个真实场景中达到97%全局可执行性,UR3机械臂部署成功率90%。

Comments Corresponding author: Aiguo Song (a.g.song@seu.edu.cn)

详情
AI中文摘要

非结构化环境中的多对机器人装配面临空间干扰和接触不确定性。现有范式无法桥接认知决策与物理执行,要么遭遇状态空间爆炸和知识瓶颈,要么遭受逻辑幻觉和拓扑冲突。我们提出一种端到端神经符号框架,分层解决该挑战:为每对生成最优子图,将通用性与边缘情况解耦,然后解决跨对干扰。给定眼在手RGB-D装配场景,框架提取语义实例身份和状态,同时量化场景以计算散度。对于每对,通过LLM使用基本动作生成最优子图以减轻幻觉。边缘情况的支撑动作通过轻量级判别器推理并插入。由量化基线与当前场景之间的散度驱动,该框架易于以低成本扩展。增强的子图在拓扑上协调为全局序列,同时保持内部行为一致性。嵌入原子技能的动态行为树闭环力感知执行循环。在100个真实场景上的离线评估达到97.00%的全局可执行性,优于经典和最新规划器。在UR3机械臂上的真实机器人部署在强干扰下达到90%的成功率,公差0.5毫米,展示了复杂自主装配的统一且可验证解决方案。

英文摘要

Multi-pair robotic assembly in unstructured environments faces spatial interference and contact uncertainties. Existing paradigms fail to bridge cognitive decision-making and physical execution, as they either encounter state-space explosion and knowledge bottlenecks or suffer from logical hallucinations and topological conflicts. We propose an end-to-end neuro-symbolic framework that solves the challenge hierarchically: generating optimal subgraphs for each pair, decoupling generality from edge cases, and then resolving cross-pair interferences. Given an eye-on-hand RGB-D assembly scene, the framework extracts semantic instance identity and state while quantifying the scene for divergence calculation. For each pair, optimal subgraph is generated via LLM using barely basic actions to mitigate hallucinations. Supportive actions for edge cases are reasoned and inserted with a lightweight discriminator. Driven by the divergence between the quantified baseline and current scene, it is easily extensible at low cost. Augmented subgraphs are topologically coordinated into global sequences while preserving internal behavioral coherence. Dynamic behavior trees embedding atomic skills close the force-aware execution loop. Offline evaluation on 100 real-world scenes achieves 97.00% global executability, outperforming classical and state-of-the-art planners. Real-robot deployment on a UR3 arm attains 90% success rate with 0.5 mm tolerance under strong interference, demonstrating a unified and verifiable solution for complex autonomous assembly.

2606.10803 2026-06-10 cs.CL cs.AI cs.CV 新提交

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

超越API:探索多模态大语言模型在物理工具使用中的极限

Zhixin Ma, Yutong Zhou, Yongqi Li, Chong-Wah Ngo, Wenjie Li

发表机构 * Singapore Management University(新加坡管理大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出PhysTool-Bench基准,评估多模态大语言模型在真实场景中识别物理工具并规划使用的能力,发现最强模型仅完成21%任务,揭示感知与规划双重缺陷。

详情
AI中文摘要

多模态大语言模型(MLLMs)在利用数字API方面表现出色,并日益成为具身AI的“大脑”,指导机器人与物理世界交互。在这种具身环境中,核心能力之一是使用物理工具,这支撑着MLLMs在现实任务中协助人类的能力。尽管重要性显著,MLLMs在物理工具使用方面的熟练程度仍 largely unexplored。为填补这一空白,我们引入了PhysTool-Bench,这是首个评估MLLMs理解真实场景、识别物理工具并规划其使用能力的物理工具使用基准。PhysTool-Bench包含2,510个查询,覆盖2,678个真实世界物理工具,涉及制造、电气工程、农业和医疗等多个领域。具体而言,模型沿两个主要维度进行评估:1)识别场景中所有存在的物理工具,2)根据指令和视觉上下文规划工具选择和使用顺序。在13个领先的MLLMs中,即使最强的模型(Gemini-3.1-Pro)也只能识别场景中58.7%的工具,并仅完成21.0%的端到端查询。我们的分析揭示了两个层面的缺陷:MLLMs难以在真实场景中感知工具,而规划阶段更大的下降进一步表明缺乏将感知到的工具映射到任务语义的功能常识,这指出了发展实用具身AI的关键瓶颈。

英文摘要

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

2606.10802 2026-06-10 cs.LG cs.AI 新提交

Boosting ECG Classification Performance by Pre-training with Synthesized Data

通过合成数据预训练提升心电图分类性能

Naoki Nonaka, Jun Seita

发表机构 * Advanced Data Science Project, RIKEN Information R&D and Strategy Headquarters(理化学研究所信息研发与战略总部先进数据科学项目)

AI总结 提出基于医学知识的高斯组合合成算法生成单导联II心电图数据,用于预训练深度神经网络,在四种异常分类中平均提升最高33.2%,尤其在小数据集场景下效果显著。

详情
AI中文摘要

深度神经网络通常需要大量数据集才能有效训练。在医学领域,由于隐私问题和某些疾病的罕见性,获取大规模数据往往具有挑战性。为了解决数据稀缺问题,我们研究了使用基于领域医学知识生成的合成数据训练深度神经网络模型的有效性。具体来说,我们针对单导联II心电图开发了一种知识驱动的高斯组合合成算法,其中每个心跳由高斯形状的P、Q、R、S和T波分量表示。使用该模拟器,我们为四种异常心电图类别生成合成数据:心房颤动、心房扑动、室性早搏和沃尔夫-帕金森-怀特综合征。我们通过使用十种不同的深度神经网络架构进行异常心电图分类来评估该合成数据的效用。结果表明,合成到真实的训练提高了四种目标异常中三种的分类性能,其中心房扑动观察到的最大架构平均增益为33.2%。进一步分析表明,合成数据带来的性能提升在真实数据集较小时更为明显。这些发现表明,基于领域知识的合成心电图可以作为有用的预训练资源,特别是在真实数据有限或难以获取的场景中。

英文摘要

Deep Neural Networks (DNNs) typically require extensive datasets for effective training. In the medical domain, acquiring large-scale data is often challenging due to privacy concerns and the rarity of certain diseases. To address this data scarcity, we investigate the efficacy of training DNN models using synthetic data, generated based on domain-specific medical knowledge. Specifically, we develop a knowledge-driven Gaussian-composition synthesis algorithm for single-lead II ECGs, in which each heartbeat is represented by Gaussian-shaped P, Q, R, S, and T wave components. Using this simulator, we generate synthetic data for four abnormal electrocardiogram (ECG) classes: atrial fibrillation (AF), atrial flutter (AFLT), premature ventricular complex (PVC), and Wolff-Parkinson-White Syndrome (WPW). We evaluate the utility of this synthetic data by conducting abnormal ECG classification using ten different DNN architectures. Our results demonstrate that synthetic-to-real training improves classification performance for three of the four target abnormalities, with the largest architecture-averaged gain of $33.2\%$ observed for AFLT. Further analysis reveals that the performance enhancement from synthetic data is more pronounced with smaller real-world datasets. These findings suggest that domain-knowledge-based synthetic ECGs can serve as a useful pre-training resource, particularly in scenarios where real-world data are limited or difficult to obtain.

2606.10799 2026-06-10 cs.AI 新提交

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

通过严格的步骤级验证评估研究级数学证明

Yifeng Sun

发表机构 * Independent Researcher(独立研究者)

AI总结 提出严格步骤级验证框架,通过约束推理上下文和定理来源,解决大模型在复杂数学证明验证中的“上下文中毒”问题,在FirstProof挑战数据集上优于全局评估,并揭示基准中的隐含歧义。

详情
AI中文摘要

大型语言模型(LLM)难以严格验证复杂的数学证明。标准的全局评估方法遭受“上下文中毒”,即表面上合理的陈述掩盖了微妙的逻辑缺陷,导致幻觉或过度怀疑。为了解决这个问题,我们从全局评估转向严格的步骤级验证:我们的框架为每个推理步骤维护详细的上下文,并严格约束所应用定理的来源。我们在从FirstProof挑战中精心策划的对抗性诊断套件上评估研究级证明。系统的消融研究表明,这些演绎约束是不可或缺的,因为无约束的全局提示始终无法定位微妙的逻辑错误。除了优于全局评估,我们的方法从根本上改变了失败分类。错误分析显示,剩余的拒绝主要是“迂腐的过度严谨”实例,源于未说明的领域约定,而不是表现出严重的逻辑幻觉,这有效地暴露了专家基准本身中的隐含歧义。我们的发现表明,提示代理以谨慎的、类似人类数学家的方式组织其验证笔记,可以显著提高其区分严谨证明和有缺陷证明的能力,有可能加强基础模型尚不熟悉的前沿数学概念上的代理推理,并为未来的自动化证明审查系统奠定理论基础。代码和提示可在GitHub上获取。

英文摘要

Large Language Models (LLMs) struggle to rigorously verify complex mathematical proofs. Standard global evaluation approaches suffer from "context poisoning," in which superficially plausible statements mask subtle logical flaws, leading to hallucination or over-skepticism. To address this, we shift from global evaluation to strict step-level verification: our framework maintains detailed context for each deduction step and strictly constrains the sources of applied theorems. We evaluate on a carefully curated adversarial diagnostic suite of research-level proofs drawn from the FirstProof challenge. A systematic ablation study demonstrates that these deductive constraints are indispensable, as unconstrained global prompting consistently fails to localize subtle logical errors. Beyond outperforming global evaluation, our approach fundamentally alters the failure taxonomy. Error analysis reveals that, rather than exhibiting severe logical hallucinations, remaining rejections are primarily instances of "pedantic hyper-rigor" stemming from unstated domain conventions, effectively exposing implicit ambiguities within the expert benchmark itself. Our findings suggest that prompting agents to organize their verification notes in a cautious, human-mathematician-like manner can substantially improve their ability to distinguish rigorous proofs from flawed ones, with the potential to strengthen agentic reasoning on frontier mathematical concepts that the base model does not already know well, and to lay a theoretical foundation for future automated proof-review systems. Code and prompts are available at GitHub.

2606.10798 2026-06-10 cs.LG 新提交

CITRAS-FM: Tiny Time Series Foundation Model for Covariate-Informed Zero-Shot Forecasting

CITRAS-FM: 面向协变量信息零样本预测的微型时间序列基础模型

Yosuke Yamaguchi, Issei Suemitsu, Yuki Kajihara, Wenpeng Wei

发表机构 * University of Tokyo(东京大学) Keio University(庆应大学)

AI总结 提出CITRAS-FM,一个仅7M参数的时间序列基础模型,通过引入Shifted Attention和协变量合成方法CovSynth,实现高效零样本预测,在100个任务上达到子10M模型最优精度且CPU推理时间低于0.1秒。

Comments Accepted to EUSIPCO 2026

详情
AI中文摘要

预训练的时间序列基础模型(TSFMs)已实现对未见目标序列的零样本预测。然而,现有TSFMs通常计算成本高,对多样变量类型的支持有限,且往往未能考虑外生影响目标变异的协变量。为解决这些挑战,我们提出CITRAS-FM,一个仅7M参数的微型TSFM,支持单变量、多变量和协变量信息零样本预测,并实现实时CPU推理。基于补丁化的仅解码器Transformer,CITRAS-FM在跨变量模块中引入Shifted Attention,以有效利用在整个预测范围内可获取的已知协变量。此外,为了在协变量丰富语料稀缺的情况下实现协变量感知预训练,我们提出CovSynth,从目标序列的分解成分中合成逼真的协变量。在fev-bench上的实验(涵盖不同设置下的100个任务)表明,CITRAS-FM在子10M TSFMs中实现了最先进的零样本精度,同时提供低于0.1秒的CPU推理,在预测精度和实时部署能力之间取得了强平衡。

英文摘要

Pretrained time series foundation models (TSFMs) have enabled zero-shot forecasting on unseen target series. However, existing TSFMs often incur high computational cost and provide limited support for diverse variable types, often failing to account for covariates that exogenously influence target variability. To address these challenges, we propose CITRAS-FM, a tiny 7M-parameter TSFM that supports univariate, multivariate, and covariate-informed zero-shot forecasting with real-time CPU inference. Built on a patch-based, decoder-only Transformer, CITRAS-FM introduces Shifted Attention into the cross-variate module to effectively exploit known covariates accessible throughout the forecast horizon. Moreover, to enable covariate-aware pretraining despite the scarcity of covariate-rich corpora, we propose CovSynth, which synthesizes realistic covariates from decomposed components of target series. Experiments on fev-bench, spanning 100 tasks across various settings, demonstrate that CITRAS-FM achieves state-of-the-art zero-shot accuracy among sub-10M TSFMs while delivering sub-0.1-second CPU inference, offering a strong balance between forecasting accuracy and real-time deployability.

2606.10796 2026-06-10 cs.CL cs.AI 新提交

Dep-LLM: Training-Free Depression Diagnosis via Evidence-Guided Structured Multi-factor with Reliable LLM Reasoning

Dep-LLM:基于证据引导的结构化多因素与可靠LLM推理的无训练抑郁症诊断

Yiqing Lyu, Xianbing Zhao, Buzhou Tang, Ronghuan Jiang

发表机构 * School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong, China(哈尔滨工业大学(深圳)计算机科学与技术学院) School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, China(江南大学人工智能与计算机学院) Guangdong Provincial Key Laboratory of Intelligent Information Processing(广东省智能信息处理重点实验室) Pengcheng Laboratory(鹏城实验室) Chinese People’s Liberation Army General Hospital, Beijing, China(中国人民解放军总医院)

AI总结 提出无训练框架Dep-LLM,通过思维链多因素分析、置信度调制和协作预测,在冻结LLM上实现抑郁症诊断,超越零样本和微调方法。

详情
AI中文摘要

从临床访谈中进行自动抑郁症检测(ADD)是计算心理健康领域的关键任务,但由于两个关键障碍仍然具有挑战性:1)在冗长、多主题的临床访谈中建模复杂但稀疏分布的抑郁线索困难,导致推理肤浅且不可靠;2)由于临床隐私导致标记数据稀缺,加上训练和微调的高成本,限制了监督式ADD系统的部署。为了共同应对这些挑战,我们提出了Dep-LLM,一个无训练框架,它模仿临床精神科医生的逐步推理,并完全在冻结的现成基础LLM上运行。Dep-LLM包含三个阶段。首先,思维链(CoT)抑郁症多因素分析模块将长对话结构性地分解为五个临床对齐的主题,并产生基于证据的推理,有效处理长上下文依赖。其次,我们引入了置信度分析与调制模块,该模块从每个推理的token级熵中量化认知可靠性,并应用标签内和主题间调制,在不进行额外训练的情况下放大可信信号同时抑制不确定信号。第三,协作多因素预测模块动态整合由置信度加权的多因素信号,形成最终诊断。在DAIC-WOZ和E-DAIC数据集上的大量实验证明了Dep-LLM的有效性和泛化性:它在几乎所有21个基础LLM上,在准确率、宏F1和加权平均F1等9个指标上超越了零样本基线,并进一步优于最先进的监督式领域特定LLM以及最新的闭源商业LLM,同时无需额外训练。

英文摘要

Automatic Depression Detection (ADD) from clinical interviews is a pivotal task in computational mental health, yet it remains challenging due to two critical obstacles: 1) difficulty in modeling complex but sparsely distributed depression clues within lengthy, multi-topic clinical interviews, leading to superficial and unreliable reasoning; 2) scarcity of labeled data due to clinical privacy, together with high cost of training and fine-tuning, limiting the deployment of supervised ADD systems. To jointly address these challenges, we propose Dep-LLM, a training-free framework that mirrors the step-by-step reasoning of clinical psychiatrists and operates entirely on frozen off-the-shelf foundation LLMs. Dep-LLM comprises three stages. First, a Chain-of-Thought (CoT) Depression Multi-factor Analysis module structurally decomposes the long dialogue into five clinically aligned themes and produces evidence-grounded rationales, effectively handling long-context dependencies. Second, we introduce Confidence Analysis and Modulation module that quantifies the epistemic reliability from token-level entropy of each rationale and applies an intra-label and inter-theme modulation that amplifies trustworthy signals while suppressing uncertain ones without extra training. Third, a Collaborative Multi-factor Prediction module dynamically integrates multi-factor signals weighted by confidence into the final diagnosis. Extensive experiments on the DAIC-WOZ and E-DAIC datasets demonstrate the effectiveness and generalizability of Dep-LLM: it surpasses zero-shot baseline on nearly all 21 foundation LLMs across 9 metrics such as accuracy, macro F1 and weighted-average F1, and further outperforms state-of-the-art supervised domain-specific LLMs as well as the latest closed-source commercial LLMs, while requiring no extra training.

2606.10791 2026-06-10 cs.SD 新提交

Overview of ESDD2: Environment-Aware Speech and Sound Deepfake Detection Challenge

ESDD2概述:环境感知语音与声音深度伪造检测挑战赛

Xueping Zhang, Han Yin, Yang Xiao, Lin Zhang, Ting Dang, Rohan Kumar Das, Ming Li

发表机构 * Duke Kunshan University(昆山杜克大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) The University of Melbourne(墨尔本大学) Johns Hopkins University(约翰霍普金斯大学) Fortemedia Singapore(Fortemedia新加坡) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 介绍ESDD2挑战赛,评估语音和环境声音独立或联合操纵的检测系统,最佳系统Macro-F1达0.8775,模块化分解、跨域自监督编码器、数据增强和选择性集成是关键。

Comments Accepted to 2026 ICME workshop

详情
AI中文摘要

与ICME 2026联合举办的环境感知语音与声音深度伪造检测挑战赛(ESDD2)评估了五个组件级别的音频欺骗检测系统,其中语音和环境声音可能被独立或联合操纵。挑战结束后,我们分析了最终排行榜,并总结了来自顶级提交的有效设计选择。该挑战吸引了来自16个国家的94个注册;在验证提交要求和元数据后,保留了13个团队进行最终分析。在测试集上,最佳系统实现了0.8775的Macro-F1分数,显著优于分离增强的联合学习基线(0.6327)。顶级系统一致受益于模块化任务分解、跨域自监督编码器、针对性数据增强和选择性集成,而非简单的模型缩放。同时,辅助EER分析揭示了在检测伪造环境组件以及泛化到测试集中未见生成器方面的持续困难。本文报告了挑战结果,并为未来环境感知深度伪造检测研究提供了见解。CompSpoofV2数据集和基线代码仍公开可用,以促进可重复性。

英文摘要

The Environment-Aware Speech and Sound Deepfake Detection Challenge (ESDD2), held in conjunction with ICME 2026, evaluated systems for five component-level audio spoofing detection, where speech and environmental sounds may be manipulated independently or jointly. After the challenge concludes, we analyze the final leaderboard and summarize effective design choices from the top-performing submissions. The challenge attracted 94 registrations from 16 countries; after verification of submission requirements and metadata, 13 teams were retained for the final analysis. On the test set, the best system achieved a Macro-F1 score of 0.8775, substantially outperforming the separation-enhanced joint learning baseline (0.6327). Top systems consistently benefited from modular task decomposition, cross-domain self-supervised encoders, targeted data augmentation, and selective ensembling rather than simple model scaling. At the same time, auxiliary EER analyses reveal persistent difficulty in detecting the spoofed environmental component and in generalizing to unseen generators in the test set. This paper reports challenge results and provides insights for future environment-aware deepfake detection research. The CompSpoofV2 dataset and baseline code remain publicly available for reproducibility.

2606.10790 2026-06-10 cs.CV 新提交

A Multimodal RGB and Events Dataset for Hand Detection in First-Person View

第一人称视角下用于手部检测的多模态RGB和事件数据集

Bharghav Kota, Yulia Sandamirskaya

发表机构 * Zurich University of Applied Sciences(苏黎世应用科技大学)

AI总结 针对移动机器人系统中传统相机在暗光下运动模糊的问题,提出利用事件相机与RGB相机结合的多模态手部检测方法,并通过合成事件数据集实现与现有方法相当的性能。

详情
AI中文摘要

现有的手部检测算法基于图像工作,检测率受限于相机的帧率。在移动机器人系统的手部检测应用中,传统相机会导致运动模糊,尤其是在较暗的光照条件下。我们可以利用事件相机,它具有高动态范围、高时间分辨率和低功耗的特点。最近的研究表明,使用事件相机和帧相机的立体设置可以提高检测精度和带宽-延迟权衡。在目标检测和识别任务中使用事件相机的主要瓶颈是训练数据量相对较少。在这项工作中,我们提出了一种方法以及一个从自我中心、第一人称视角合成的示例性事件手部数据集。数据使用v2e工具箱从现有的RGB Egohands数据集合成。通过改变v2e工具箱的参数,提供不同光照条件和尺度的数据集版本。使用微调后的YOLOv8模型生成地面真值检测,该模型应用于Egohands数据集中的RGB图像,并在高时间分辨率事件上进行插值。我们使用多模态数据集,利用现有的使用事件和RGB相机多模态设置的目标检测算法进行手部检测,并展示了与最先进方法相当的性能。

英文摘要

Existing hand detection algorithms work on images and the detection rate is restricted by the frame rate of the camera. In hand detection applications for moving robotic systems, conventional cameras cause motion blur, especially in darker lighting conditions. We can leverage the use of event-based cameras which possess a high dynamic range, high temporal resolution, and low power consumption. Recent work has shown that using a stereo setup of an event-based and a frame-based camera improves detection accuracy and the bandwidth-latency tradeoff. The main bottleneck in using event-based cameras in object detection and recognition tasks is a relatively low amount of training data. In this work, we propose a methodology and an exemplary synthetic event-based hand dataset from an egocentric, first-person view perspective. The data is synthesized from the existing RGB Egohands dataset with the v2e toolbox. Parameters of the v2e toolbox are varied to provide versions of the dataset with different lighting conditions and scales. Ground truth detections are generated with a fine-tuned YOLOv8 model which is applied to the RGB images in the Egohands dataset and interpolated on the high-temporal resolution events. We use the multi-modal dataset to perform hand detection with existing object detection algorithms which use a multi-modal setup of event and RGB cameras and demonstrate performance comparable to the state-of-the-art.

2606.10789 2026-06-10 cs.LG 新提交

Closing the Modality Gap in Zero-Shot HAR: Contrastive Training and Separability-Optimized Prototypes on IMU Data

缩小零样本HAR中的模态差距:基于IMU数据的对比训练与可分性优化原型

Anik Ghosh

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对IMU基零样本人体活动识别中的模态差距问题,提出对比训练与描述性原型结合的方法,在PAMAP2数据集上实现73.2%准确率和0.583宏F1,并指出宏F1更适合作为评估指标。

Comments 17 pages, 7 figures

详情
AI中文摘要

基于惯性测量单元(IMU)的人体活动识别(HAR)中的零样本学习(ZSL)面临一个核心挑战:弥合传感器嵌入与语义类表示之间的差距。我们在PAMAP2数据集上系统评估了三种推理方法与两种训练流程组合的七种配置,使用14个已知和4个未知活动类别,并保留受试者108和109用于测试。我们发现模态差距是一个由编码器目标决定的训练时现象。使用标签名称的Sentence-BERT原型进行交叉熵训练的时间卷积网络(TCN)产生的传感器嵌入与对应文本原型的平均余弦相似度为0.30,而将标签名称原型目标替换为判别性活动描述后,该值提升至0.69。这种对齐改进在所有三种推理方法中一致迁移。最强的结果结合了对比训练与反向softmax校正,在未知类别上达到73.2%的准确率和0.583的宏F1,而标签名称基线仅为58.3%准确率和0.34宏F1。另一个发现是,更丰富的文本描述降低了Sentence-BERT空间中原型间的可分性,因为共享的生物力学词汇导致语言模型压缩了原型云。只要原型描述保留足够的判别性词汇,这种效应不会抵消对比对齐的好处。我们还证明,当测试集类别分布不平衡时,总体准确率是一个误导性的主要指标,并推荐宏平均F1作为ZSL-HAR基准的标准报告指标。

英文摘要

Zero-shot learning (ZSL) for inertial measurement unit (IMU)-based human activity recognition (HAR) faces a central challenge: bridging the gap between sensor embeddings and semantic class representations. We systematically evaluate seven configurations combining three inference methods with two training pipelines on the PAMAP2 dataset, using 14 seen and 4 unseen activity classes with subjects 108 and 109 held out for testing. We find that the modality gap is a training-time phenomenon governed by the encoder objective. A temporal convolutional network (TCN) trained with cross-entropy over label-name Sentence- BERT prototypes yields sensor embeddings with a mean cosine similarity of 0.30 to the corresponding text prototypes, while replacing the label-name prototype targets with discriminative activity descriptions raises this to 0.69. This alignment improvement transfers consistently across all three inference methods. The strongest result combines contrastive training with inverted softmax correction, achieving 73.2% accuracy and 0.583 macro F1 on unseen classes, compared to 58.3% accuracy and 0.34 macro F1 for the label-name baseline. A secondary finding is that richer text descriptions reduce inter-prototype separability in Sentence-BERT space, because shared biomechanical vocabulary causes the language model to compress the prototype cloud. This effect does not negate the benefits of contrastive alignment provided prototype descriptions retain sufficient discriminative vocabulary. We also demonstrate that overall accuracy is a misleading primary metric when test-set class distributions are imbalanced, and recommend macro-averaged F1 as the standard reporting metric for ZSL-HAR benchmarks.

2606.10787 2026-06-10 cs.AI cs.LO 新提交

Accelerating NeurASP with vectorization and caching

通过向量化和缓存加速NeurASP

Alexander Philipp Rader, Alessandra Russo

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本文通过向量化、批处理和缓存中间计算,显著加速了神经符号框架NeurASP的训练,在大型任务上实现了多个数量级的提速。

Comments 16 pages, 5 figures, to be published in the Theory and Practice of Logic Programming (TPLP) journal for the 42nd International Conference on Logic Programming (ICLP) issue

详情
AI中文摘要

神经符号AI将神经网络与符号程序相结合,以创建鲁棒且可解释的预测。其中一个框架是NeurASP,它训练神经网络来预测概念,并使用答案集编程(ASP)编写的规则对这些概念进行推理,以解决下游任务。关键的是,标签仅由符号规则产生的下游预测提供,而不是潜在概念。通过不可微的ASP组件进行反向传播需要昂贵的概率和梯度计算,这阻碍了其扩展到更复杂的任务。在本文中,我们通过向量化、批处理和训练期间中间计算的缓存来改善NeurASP的计算性能,从而解决其当前局限性。我们比较了原始NeurASP和新实现的计算速度,并报告了在较大任务上多个数量级的加速。为此,我们提出了一个涉及扑克牌的困难任务新数据集,用于测试NeurASP增强学习功能的能力。

英文摘要

Neurosymbolic AI combines neural networks with symbolic programs to create robust and explainable predictions. One such framework is NeurASP, which trains a neural network to predict concepts and reasons over them using rules written in answer set programming (ASP) to solve downstream tasks. Crucially, labels are only provided for the downstream prediction produced by the symbolic rules, not for the latent concepts themselves.Backpropagation through the non-differentiable ASP component requires expensive probability and gradient calculations, which has hindered scalability to more sophisticated tasks.In this paper, we address the current limitations of NeurASP by improving its computational performance through vectorization, batch processing and caching of intermediate computations during training. We compare computation speeds between the original and our new implementation of NeurASP and report speedups of multiple orders of magnitude for larger tasks. To this end, we propose a new dataset of difficult tasks involving playing cards, which we use to test the capabilities of NeurASP's enhanced learning function.

2606.10778 2026-06-10 cs.CV 新提交

From Patches to Patients: A study of the tile-to-slide performance transferability in Digital Pathology

从斑块到患者:数字病理学中斑块到全切片性能可迁移性的研究

Sofiène Boutaj, Leo Fillioux, Maria Vakalopoulou, Stergios Christodoulidis, Pierre Marza

发表机构 * Université Paris-Saclay, CentraleSupélec, Gustave Roussy, INSERM, IHU PRISM, Cancer Data Science Unit(巴黎-萨克雷大学、中央理工-高等电力学院、古斯塔夫·鲁西研究所、法国国家健康与医学研究院、IHU PRISM、癌症数据科学单元) Université Paris-Saclay, CentraleSupélec, MICS Laboratory(巴黎-萨克雷大学、中央理工-高等电力学院、MICS实验室)

AI总结 研究斑块级线性探测能否作为全切片级性能的可靠代理,通过19个基础模型在42个切片级和16个斑块级任务上的基准测试,发现斑块与切片性能高度相关,斑块级基准测试可有效筛选候选模型。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

基础模型最近通过为全切片图像分析提供稳健表示,重新定义了组织病理学中的最先进技术。然而,为特定临床队列选择最优基础模型目前需要多个预处理步骤,随后对每个模型进行计算昂贵的特征提取和训练多实例学习聚合器。在这项工作中,我们研究高效的斑块级线性探测能否作为切片级性能的可靠代理,从而减少对每个候选编码器运行完整切片级管道的需求。我们在42个切片级和16个斑块级任务上对19个最先进的基础模型进行基准测试,使用ABMIL和均值池化聚合器比较斑块探测指标与切片级结果。我们观察到在不同任务难度下,斑块与切片性能之间存在高度相关性,表明编码器表示质量是WSI成功的主要决定因素。敏感性分析显示,可迁移性在不同模型间稳定,且受队列规模和每张切片斑块数量的影响大于平均任务难度。我们还测量了斑块级和切片级任务中最佳表现模型的一致性,表明斑块基准测试可靠地筛选出强候选模型。总体而言,我们的研究表明,斑块级基准测试为缩小候选模型范围提供了高效且实用的第一步,而切片级评估对于临床任务的最终验证仍然必不可少。

英文摘要

Foundation Models (FMs) have recently redefined the state-of-the-art in histopathology by providing robust representations for whole-slide image (WSI) analysis. However, selecting the optimal foundation model (FM) for a specific clinical cohort currently requires multiple preprocessing steps, followed by computationally expensive feature extraction and the training of a Multiple Instance Learning (MIL) aggregator for every model. In this work, we investigate whether efficient tile-level linear probing can serve as a reliable proxy for slide-level performance, reducing the need to run full slide-level pipelines for every candidate encoder. We benchmark 19 state-of-the-art FMs on 42 slide-level and 16 tile-level tasks, comparing tile probing metrics against slide-level outcomes using ABMIL and Mean Pooling aggregations. We observe a high correlation between tile and slide performance across varying task difficulties, indicating that encoder representation quality is the primary determinant of WSI success. Sensitivity analyses show that transferability is stable across models and is more influenced by cohort sizes and numbers of tiles per slide than by average task difficulty. We also measure the agreement in best performing models between tile and slide-level tasks, showing tile benchmarks reliably shortlist strong candidates. Overall, our study indicates that tile-level benchmarking provides an efficient and practical first step for narrowing down candidate models, while slide-level evaluation remains essential for final validation on clinical tasks.

2606.10777 2026-06-10 cs.LG 新提交

Can we trust our models? Epistemic calibration in second-order classification

我们能信任我们的模型吗?二阶分类中的认知校准

Arthur Hoarau

发表机构 * Université de Lorraine, CentraleSupélec Loria, CNRS(洛林大学,中央理工-高等电力学院洛里亚实验室,法国国家科学研究中心)

AI总结 提出认知校准准则,衡量认知不确定性估计是否可靠,并证明其比经典校准更严格,通过EECE指标实验揭示不同不确定性量化方法的差异。

详情
AI中文摘要

不确定性估计对于在高风险场景中部署机器学习模型至关重要。然而,经典校准仅评估预测概率的可靠性,并不评估认知不确定性估计本身是否可信。这一局限性对于二阶分类模型尤为突出。我们引入认知校准,这是一个有原则的准则,用于衡量报告的认知不确定性是否忠实地反映了模型预测围绕真实值的分散程度。我们证明认知校准是比经典校准更严格的概念,并能捕捉标准指标无法发现的失败模式。通过一个在认知校准假设下成立的不可能性定理,我们将这项工作与现有文献联系起来。为了将这一概念付诸实践,我们提出了期望认知校准误差(EECE),并证明它是真实认知校准误差(TECE)的一致估计量。在广泛的不确定性量化方法上的实验表明,认知校准是一个连贯且有意义的准则,并揭示了不同方法之间的显著差异,尽管它们的预测性能相似。

英文摘要

Uncertainty estimation is critical for deploying machine learning models in high-stakes settings. However, classical calibration only assesses the reliability of predicted probabilities and does not evaluate whether epistemic uncertainty estimates are themselves trustworthy. This limitation is particularly relevant for second-order classification models. We introduce epistemic calibration, a principled criterion that measures whether reported epistemic uncertainty faithfully reflects the dispersion of model predictions around the ground truth. We show that epistemic calibration is a strictly stronger notion than classical calibration and captures failure modes invisible to standard metrics. We relate this work to the existing literature through an impossibility theorem that holds under the epistemic calibration hypothesis. To operationalize this concept, we propose the Expected Epistemic Calibration Error (EECE), which we prove to be a consistent estimator of a True Epistemic Calibration Error (TECE). Experiments across a broad range of uncertainty quantification methods show that epistemic calibration is a coherent and meaningful criterion and reveal substantial differences across methods, despite similar predictive performance.

2606.10769 2026-06-10 cs.CV 新提交

ZODS-RS -- Zero-training Oriented Detection & Segmentation for Remote Sensing

ZODS-RS -- 面向遥感的零训练目标检测与分割

Zuan Gu, Tianhan Gao, Langxu Zhao

发表机构 * Northeastern University, China(东北大学)

AI总结 提出一种无需训练的封闭式管道ZODS-RS,通过原型纯化、旋转尺度等变匹配和不确定性感知像素合并,统一了遥感图像的水平框检测与实例分割,在多个数据集上取得优异性能。

详情
AI中文摘要

遥感与无人机应用需要模型能够跨平台和视角泛化,而无需特定任务训练。然而,无训练管道在处理有向几何、尺度/旋转变化以及拥挤的港口或机场时常常失败,并且很少统一检测与分割。我们提出ZODS-RS,一种无训练、封闭式的管道,输出水平框(HBB)和实例掩码。基于DINOv3密集特征和SAM风格的提议,ZODS-RS链式包含:PP(通过Tyler协方差进行原型纯化)、R-SEM(使用可分离核和全局匈牙利分配的旋转尺度等变匹配)以及UAM(具有自适应先验和可选负原型的不确定性感知逐像素合并)。一个轻量级的CWLA融合多个DINOv3层。在FAIR1M(HBB)上,我们获得$\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$和$\mathrm{AP}_S=\mathbf{2.93}$(船舶/飞机类别平均);在xView(HBB)上,我们报告$\mathrm{mAP}=\mathbf{16.69}$。在我们的无人机数据集上,ZODS-RS实现了掩码$\mathrm{mIoU}=\mathbf{31.10}$,并在单张5090上将小目标AP相对于Grounded-SAM提升了$\mathbf{+30.70}$。这项工作为航空影像中的水平框检测加实例分割提供了统一的、无需训练的解决方案;提供了与DINOv3紧密耦合的PP/R-SEM/UAM的显式封闭形式公式;并在小目标和拥挤目标以及跨域迁移下展示了一致的增益,同时保持部署简单。

英文摘要

Remote-sensing and UAV applications need models that generalize across platforms and viewpoints without task-specific training. Yet training-free pipelines often falter on oriented geometry, scale/rotation variation, and crowded ports or airfields, and rarely unify detection and segmentation. We introduce ZODS-RS, a training-free, closed-form pipeline that outputs horizontal boxes (HBB) and instance masks. Built on DINOv3 dense features and SAM-style proposals, ZODS-RS chains: PP (prototype purification via Tyler covariance), R-SEM (rotation-scale equivariant matching with separable kernels and global Hungarian assignment), and UAM (uncertainty-aware pixelwise merging with adaptive priors and optional negative prototypes). A lightweight CWLA fuses multiple DINOv3 layers. On FAIR1M (HBB) we obtain $\mathrm{mAP}_{0.50:0.95}=\mathbf{13.06}$ and $\mathrm{AP}_S=\mathbf{2.93}$ \emph{(class-averaged over ship/airplane)}; on xView (HBB) we report $\mathrm{mAP}=\mathbf{16.69}$. On our UAV dataset, ZODS-RS achieves mask $\mathrm{mIoU}=\mathbf{31.10}$ and improves small-object AP by $\mathbf{+30.70}$ over Grounded-SAM on a single 5090. This work offers a unified, \emph{no-training} solution for horizontal-box detection plus instance segmentation in aerial imagery; provides explicit closed-form formulations for PP/R-SEM/UAM tightly coupled with DINOv3; and demonstrates \emph{consistent} gains on small and crowded targets and under cross-domain shifts while keeping deployment simple.

2606.10768 2026-06-10 cs.LG cs.CL 新提交

N-GRPO: Embedding-Level Neighbor Mixing for Enhanced Policy Optimization

N-GRPO:嵌入级邻居混合增强策略优化

Xukun Zhu, Hang Yu, Peng Di, Linchao Zhu

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 针对大语言模型数学推理中探索策略的折衷问题,提出N-GRPO方法,通过语义邻居混合机制在嵌入层注入多样性,在保持语义一致性的同时提升策略优化效果。

Comments ACL 2026 Findings. 16 pages, 3 figures. Code: https://github.com/ZJUSCL/N-GRPO

详情
AI中文摘要

大语言模型在数学推理中的成功很大程度上依赖于生成多样化且有效的解题路径。然而,当前的展开技术面临一个基本折衷:token级采样通常产生仅在措辞上不同的冗余轨迹,而利用随机噪声的嵌入级方法则经常破坏语义一致性。为解决此问题,我们引入N-GRPO,一种集成到组相对策略优化(GRPO)框架中的新型探索策略。我们的方法不依赖于token级采样或原生嵌入级噪声,而是利用语义邻居混合机制。该机制通过混合锚点token及其最近语义邻居的嵌入来动态构建输入表示,从而在严格遵循局部语义流形的同时注入多样性。在不同大小的DeepSeek-R1-Distill-Qwen模型上的实验评估表明,N-GRPO不仅在数学推理基准上相比强基线取得一致改进,而且在分布外任务上展现出鲁棒的泛化能力。

英文摘要

The success of Large Language Models in mathematical reasoning relies heavily on the generation of diverse and valid solution paths during the rollout phase. However, current rollout techniques face a fundamental trade-off: token-level sampling often yields redundant trajectories that differ only in rephrasing, while embedding-level methods utilizing random noise frequently disrupt semantic consistency. To resolve this, we introduce N-GRPO, a novel exploration strategy integrated into the Group Relative Policy Optimization (GRPO) framework. Rather than relying on token-level sampling or native embedding-level noise, our approach leverages Semantic Neighbor Mixing. This mechanism dynamically constructs input representations by mixing the embeddings of an anchor token and its nearest semantic neighbors, thereby injecting diversity while strictly adhering to the local semantic manifold. Experimental evaluations on the DeepSeek-R1-Distill-Qwen models across different sizes show that N-GRPO not only achieves consistent improvements over strong baselines on math reasoning benchmarks but also exhibits robust generalization capabilities on out-of-distribution tasks.

2606.10765 2026-06-10 cs.CL 新提交

ArabiGEE: A Hierarchical Taxonomy for Arabic Grammatical Error Explanation

ArabiGEE:阿拉伯语语法错误解释的层次分类体系

Khaled Elhady, Omar Kallas, Nizar Habash, Bashar Alhafni

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) New York University Abu Dhabi(纽约大学阿布扎克分校)

AI总结 提出首个基于显式错误类型的阿拉伯语语法错误解释层次分类体系,涵盖正字法、形态、句法和词汇四个维度,包含27种错误类型、140种修正类型和324种解释,并用于人工标注现有语料库以支持大语言模型的自动评估。

详情
AI中文摘要

我们介绍了ArabiGEE,这是首个基于显式错误类型的全面阿拉伯语语法错误解释(GEE)分类体系。与现有将解释生成视为自由形式文本的GEE方法不同,ArabiGEE通过涵盖正字法、形态、句法和词汇维度的层次结构组织语法解释。该分类体系包含27种错误类型、140种修正类型和324种相关解释。我们将ArabiGEE应用于人工标注现有阿拉伯语语法错误修正语料库的部分内容,并展示了结构化语法解释如何支持对大语言模型在阿拉伯语GEE上的自动评估。我们的代码和数据已公开。

英文摘要

We introduce ArabiGEE, the first comprehensive Arabic grammatical error explanation (GEE) taxonomy grounded in explicit error types. Unlike existing GEE approaches that treat explanation generation as free-form text, ArabiGEE organizes grammatical explanations through a hierarchical structure spanning orthographic, morphological, syntactic, and lexical dimensions. The taxonomy consists of 27 error types, 140 correction types, and 324 associated explanations. We apply ArabiGEE to manually annotate portions of existing Arabic grammatical error correction corpora and demonstrate how structured grammatical explanations can support automatic evaluation of LLMs on Arabic GEE. Our code and data are publicly available.