arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2056
专题追踪
2605.30512 2026-06-01 cs.AI cs.CV

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

PhyDrawGen: 基于自然语言的物理约束图表生成

Nafiul Haque, Syed Nazmus Sakib, Shifat E Arman

发表机构 * Department of Robotics and Mechatronics Engineering, University of Dhaka(机器人与机电工程系,达卡大学)

AI总结 提出PhyDrawGen神经符号管道,通过场景图提取、确定性求解器和视觉验证循环,从自然语言生成符合物理定律的图表,在力学、光学和电磁学基准上显著优于现有模型。

Comments 9 figures, 7 tables. Under review at EMNLP 2026

详情
AI中文摘要

从文本生成物理图表需要严格遵守物理定律。虽然当前生成模型能产生视觉上合理的输出,但它们会系统性地产生力向量幻觉、忽略守恒定律并违反几何约束。我们提出PhyDrawGen,一种神经符号管道,将语义场景理解与物理约束满足解耦。首先,大语言模型从问题文本中提取类型化场景图。然后,确定性求解器将该图转换为平面直线图(PSLG),将力平衡、光路和场拓扑编码为精确几何基元。最后,微调的Qwen-VL模型实现视觉基础的提议-验证循环,以迭代纠正任何约束违反。在涵盖力学、光学和电磁学的1,449个问题基准上评估,PhyDrawGen显著优于GPT-5-image、Gemini 2.5 Flash和Gemini 3 Pro,即使在非常见物体问题上也展现出鲁棒的物理准确性。

英文摘要

Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.

2605.30510 2026-06-01 cs.CV cs.AI

A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images

一种新颖的全局上下文感知深度神经网络用于基于磁共振图像的增强脑肿瘤分割

Sourjya Mukherjee, Ananya Bhattacharjee, R. Murugan

发表机构 * National Institute of Technology Silchar(全国理工学院锡拉char分校)

AI总结 提出全局上下文感知的挤压激励残差UNet(GCSER-UNet),融合空间和通道注意力,在TCGA LGG和BraTS 2020数据集上取得优于现有技术的Dice分数。

Comments 11 pages, 9 figures, 6 tables. Submitted to arXiv cs.CV

详情
AI中文摘要

脑癌的严重性需要精确的脑肿瘤分割,这对于有效的脑肿瘤诊断至关重要。手动识别成本高、劳动强度大且易出错,凸显了自动化方法的必要性。在本研究中,我们引入了全局上下文感知的挤压激励残差UNet(GCSER-UNet),它促进了空间和通道注意力的融合,从而增强了模型捕捉复杂空间依赖和上下文信息的能力。GCSER-UNet从多模态MRI切片中高效提取肿瘤区域,表现出卓越的性能。在基准数据库上的评估显示了其优越性,在TCGA LGG数据集上达到了94%的Dice分数,超过了当前最先进的91.8%。在BraTS 2020数据集上,所提出的GCSER-UNet集成方法在肿瘤区域——全肿瘤(W)、肿瘤核心(T)和增强肿瘤(E)上分别获得了95%、92%和90%的Dice分数,而当前最先进的Dice分数分别为94%、93%和88%。这些令人信服的结果突显了GCSER-UNet在精确脑肿瘤分割中的有效性,因此可以帮助神经科医生进行有效的脑癌管理和治疗规划。

英文摘要

Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identification, burdened by high costs, labor, and error risks, highlights the need for automated methods. In this study, we introduce the Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet), which facilitates a fusion of spatial and channel-wise attention and thus enhances the model's capacity to capture intricate spatial dependencies and contextual information. GCSER-UNet efficiently extracts tumor segments from multimodal MRI slices, delivering exceptional performance. Evaluations on benchmark databases exhibit its superiority, achieving a notable 94 percent dice score on the TCGA LGG dataset, surpassing the state-of-the-art dice score of 91.8 percent. In the BraTS 2020 dataset, the proposed GCSER-UNet ensemble approach yielded dice scores of 95 percent, 92 percent, and 90 percent for the tumor regions - Whole Tumor (W), Tumor Core (T), and Enhancing Tumor (E), respectively. The current state-of-the-art dice scores were 94 percent, 93 percent, and 88 percent. These compelling outcomes highlight the efficacy of GCSER-UNet in precise brain tumor segmentation and thus can aid neurologists in effective brain cancer management and treatment planning.

2605.30508 2026-06-01 cs.RO

ARISTO Hand: Sensing-Driven Distal Hyperextension for Fine-Grained Manipulation

ARISTO Hand:基于感知驱动的远端过伸实现精细操作

Aaron Kim, Dong Ho Kang, Mark Helwig, Mingyo Seo, Kazuto Yokoyama, Tetsuya Narita, Luis Sentis

发表机构 * Human Centered Robotics Lab at The University of Texas at Austin(德克萨斯大学奥斯汀分校人本机器人实验室) Sony Group Corporation(索尼集团)

AI总结 提出一种肌腱驱动机械手ARISTO Hand,通过主动远端过伸和混合指尖传感架构(刚性指甲安装力-扭矩传感器与软电容触觉阵列),增强对薄物体的操作能力,在1-20 mm厚度范围内将拔出力提升2.76倍,并实现SD卡插拔等精细任务。

详情
AI中文摘要

操作薄物体需要精确的接触几何和可靠的力感知,然而许多拟人化机械手缺乏此类交互所需的机械和传感能力。我们提出ARISTO Hand,一种肌腱驱动机器人手,它将主动远端过伸与混合指尖传感架构相结合,该架构结合了刚性指甲安装的力-扭矩传感器和软电容触觉阵列。主动过伸使得指尖能够在标准屈曲的运动学极限之外进行受控接合,对于1-20 mm的物体厚度,拔出力提高了2.76倍,同时保留了标称抓取能力。刚性指甲安装传感器在边缘接触期间提供可靠的力测量,此时本体感觉力估计的灵敏度随着接触几何接近运动学奇点而下降。我们通过定量力表征和多阶段SD卡提取与插入任务验证了所提出的架构。视频和补充材料可在 https://aristohand.github.io 获取。

英文摘要

Manipulating thin objects requires precise contact geometry and reliable force perception, yet many anthropomorphic robotic hands lack the mechanical and sensing capabilities needed for such interactions. We present the ARISTO Hand, a tendon-driven robotic hand that integrates active distal hyperextension with a hybrid fingertip-sensing architecture that combines a rigid, nail-mounted force-torque sensor and a soft capacitive tactile array. Active hyperextension enables controlled fingertip engagement beyond the kinematic limits of standard flexion, increasing pull-out force by 2.76x for object thicknesses of 1-20 mm while preserving the nominal grasp capability. The rigid nail-mounted sensor provides reliable force measurements during edge contacts, where the sensitivity of proprioceptive force estimation degrades as the contact geometry approaches kinematic singularities. We validate the proposed architecture through quantitative force characterization and a multi-stage SD card extraction and insertion task. Video and supplementary materials are available at: https://aristohand.github.io

2605.30506 2026-06-01 cs.RO cs.CV

VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments

VLM-GLoc:视觉语言模型增强的蒙特卡洛定位,用于杂乱准静态环境中的鲁棒语义全局定位

Shivendra Agrawal, Bradley Hayes

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出VLM-GLoc方法,利用开放词汇视觉语言模型作为统一语义观测前端,通过逆语义提议机制和文本到地图检索,在几何模糊和语义歧义的准静态环境中实现鲁棒全局定位。

详情
AI中文摘要

在几何模糊的准静态环境(如杂货店、办公室、学校和医院)中,全局定位对移动机器人构成重大挑战。具有平行过道和长尾产品分布的杂货店,以及具有重复家具(如椅子、桌子、显示器和门)的办公室和实验室,是常见的室内环境,存在几何甚至语义歧义。传统方法要么依赖独特的几何特征,要么依赖特定领域的视觉管道,这些方法难以处理长尾语义分布和瞬态视觉杂乱。我们提出VLM-GLoc,一种分层语义蒙特卡洛定位(MCL)方法,利用开放词汇视觉语言模型(VLM)作为统一语义观测前端。我们假设VLM具有三重优势:(1)提取高度判别性的丰富文本特征,(2)对模糊或动态对象进行隐式质量过滤,(3)针对数据增强的持久性推理。我们引入一种逆语义提议机制,通过文本到地图检索播种粒子。在两个具有不同特征的真实世界环境和两个不同平台上进行评估:一个3500平方英尺的杂货店(使用手机)和一个3700平方英尺的实验室空间(使用四足机器人),VLM-GLoc分别实现了70%和74%的全局定位成功率,显著优于传统的纯几何和特定领域基线方法。

英文摘要

Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.

2605.30504 2026-06-01 cs.CL

Auditing LLM Benchmarks with Item Response Theory

使用项目反应理论审计LLM基准测试

Sander Land, Daniel M. Bikel

发表机构 * Writer, Inc.(Writer公司)

AI总结 本文引入基于项目反应理论的指标,通过114个模型的响应在七个偏好和多选基准测试中识别出前200个示例中95%精度的可能错误标签,并追踪错误来源,同时揭示奖励模型在风格偏好而非事实知识上的专业化。

详情
AI中文摘要

LLM基准测试的标签在发布时被冻结,并无声地传播到下游基准测试中,包括所有错误。我们引入了一个基于项目反应理论的指标,该指标使用114个模型的响应,在七个偏好和多选基准测试的前200个示例中,以95%的精度识别出可能的错误标签,性能优于监督分类器。我们将这些错误追溯到机械标签启发式方法、从源数据集中继承的未更改的上游注释错误,以及没有合理单一标签的根本性模糊项目。相同的模型拟合显示,奖励模型专门研究风格偏好而非事实知识,并识别出一个前沿奖励模型,该模型以78%的准确率与检测到的错误标签一致,而同行仅为38%,这与基准测试污染或基准测试特定的过度优化一致。

英文摘要

LLM benchmark labels are frozen at release and silently propagated into downstream benchmarks, errors and all. We introduce an Item Response Theory-based indicator that surfaces likely mislabels at 95% precision in the top 200 examples across seven preference and multiple-choice benchmarks using responses from 114 models, outperforming a supervised classifier. We trace these errors to mechanical labeling heuristics, upstream annotation mistakes inherited unchanged from source datasets, and fundamentally ambiguous items without a defensible single label. The same model fit reveals that reward models specialize in stylistic preference rather than factual knowledge, and identifies one frontier reward model that agrees with detected mislabels at 78% accuracy versus 38% for its peers, consistent with benchmark contamination or benchmark-specific over-optimization.

2605.30503 2026-06-01 cs.RO cs.SY eess.SY stat.ML

Physics-informed Goal-Conditioned Reinforcement Learning under Hybrid Contact Dynamics

混合接触动力学下的物理信息目标条件强化学习

Vittorio Giammarino, Anastasios Manganaris, Ahmed H. Qureshi

发表机构 * Department of Computer Science(计算机科学系)

AI总结 针对接触丰富任务中混合动力学导致现有物理信息目标条件强化学习方法性能下降的问题,提出接触感知和分层公式,选择性应用物理信息归纳偏置,向接触丰富操作扩展。

详情
AI中文摘要

从稀疏反馈中学习达到任意目标需要智能体推断状态-目标对之间的丰富可达性概念。目标条件强化学习(GCRL)通过学习跨目标泛化的策略来应对这一挑战,但随着底层动力学变得高维、混合或接触依赖,这种泛化变得越来越困难。为了解决这个问题,物理信息GCRL(Pi-GCRL)将最优控制启发的归纳偏置引入目标条件价值学习。虽然Pi-GCRL方法在导航和无目标到达领域已被证明有效,但它们在接触丰富任务中的可靠性仍不清楚,其中接触交互导致混合动力学、模式依赖的可控性和非光滑价值景观。在这项工作中,我们表明这些结构特性可能导致现有Pi-GCRL方法在朴素应用于接触丰富操作时性能下降。受此分析启发,我们引入了接触感知和分层公式,选择性地将物理信息归纳偏置应用于操作问题。我们的结果为将Pi-GCRL扩展到接触丰富操作提供了原则性的一步。

英文摘要

Learning to reach arbitrary goals from sparse feedback requires agents to infer a rich notion of reachability across state--goal pairs. Goal-conditioned reinforcement learning (GCRL) tackles this challenge by learning policies that generalize across goals, but this generalization becomes increasingly difficult as the underlying dynamics become high-dimensional, hybrid, or contact-dependent. To address this issue, physics-informed GCRL (Pi-GCRL) introduces optimal-control-inspired inductive biases into goal-conditioned value learning. While Pi-GCRL methods have proven effective in navigation and object-free goal-reaching domains, their reliability in contact-rich tasks remains unclear, where contact interactions induce hybrid dynamics, mode-dependent controllability, and nonsmooth value landscapes. In this work, we show that these structural properties can cause existing Pi-GCRL methods to degrade when applied naively to contact-rich manipulation. Motivated by this analysis, we introduce contact-aware and hierarchical formulations that apply physics-informed inductive biases selectively across the manipulation problem. Our results provide a principled step toward extending Pi-GCRL to contact-rich manipulation.

2605.30501 2026-06-01 cs.CL

Linear Ensembles Wash Away Watermarks: On the Fragility of Distributional Perturbations in LLMs

线性集成洗去水印:论LLMs中分布扰动的脆弱性

Zhihao Wu, Gracia Gong, Qinglin Zhu, Yudong Chen, Runcong Zhao

发表机构 * Department of Informatics, King's College London, UK(伦敦国王学院信息学院) Department of Mathematics, Imperial College London, UK(伦敦帝国学院数学系) Department of Statistics, University of Warwick, UK(沃里克大学统计系)

AI总结 本文通过理论和实验证明,当用户访问多个模型时,对输出概率分布进行简单平均即可消除水印,并提出了WASH方法解决集成中的词汇对齐和分词差异问题。

详情
AI中文摘要

水印在AI生成的文本中嵌入统计签名,用于检测和归属。我们揭示了一个基本漏洞:当用户访问多个模型(当今的现实)时,水印轻易失效。水印将输出分布从原始分布扰动开,而在竞争市场中,这些扰动在不同提供商之间通常是独立的。我们从理论上证明,平均输出概率分布可以恢复未加水印的分布,误差项仅为二阶。实验上,仅平均3-5个模型即可抵消这些扰动。我们提出WASH(通过统计混合的水印衰减),解决了集成生成中的实际挑战:异构模型间的词汇不对齐和分词差异。在六种水印方案和三个LLM上的实验表明,平均3个模型可将检测z分数从5-300抑制到2以下(低于4的检测阈值),并将5% FPR下的TPR降至50%以下,同时质量提升27.5%,在长序列生成上运行速度比最佳基线快6倍。我们的结果表明,通过水印实现稳健的AI文本检测要么接受这一基本漏洞,要么需要模型提供商之间前所未有的协调。

英文摘要

Watermarking embeds statistical signatures in AI-generated text for detection and attribution. We reveal a fundamental vulnerability: when users access multiple models (today's reality), watermarks trivially fail. Watermarks perturb output distributions away from the original, and in competitive markets, these perturbations are typically independent across providers. We theoretically prove that averaging output probability distributions recovers the unwatermarked distribution with up to a second-order error term. Empirically, simply averaging 3-5 models cancels out these perturbations. We introduce WASH (Watermark Attenuation via Statistical Hybridisation), which solves practical challenges in ensemble generation: vocabulary misalignment and tokenisation differences across heterogeneous models. Experiments across six watermarking schemes and three LLMs show that averaging across 3 models suppresses detection z-scores from 5-300 to below 2 (below the detection threshold of 4) and reduces TPR at 5% FPR to below 50%, while improving quality by 27.5% and running 6 times faster than the best baseline on the long sequence generation. Our results suggest that robust AI-text detection via watermarking requires either accepting this fundamental vulnerability or unprecedented coordination among model providers.

2605.03337 2026-06-01 cs.CV cs.AI

FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles

FreeTimeGS++:动态高斯泼溅的秘密及其原理

Lucas Yunkyu Lee, Soonho Kim, Youngwook Kim, Sangmin Kim, Jaesik Park

发表机构 * Seoul National University(首尔国立大学) POSTECH

AI总结 本文通过建立控制基线FreeTimeGS_ours,系统分析4D高斯泼溅框架中的隐藏因素,揭示高斯持续时间驱动的时态分区和光度保真度与时空一致性之间的差异等关键秘密,并提出FreeTimeGS++方法,采用门控边缘化和神经速度场实现更稳定的动态表示。

Comments Project page: https://yklcs.com/ftgspp

详情
AI中文摘要

近期4D高斯泼溅(4DGS)的兴起在动态场景重建方面取得了令人瞩目的成果。尽管这些方法表现出卓越的性能,但其背后的具体驱动因素仍未被充分探索,使得对基本原理的系统理解具有挑战性。本文对这些隐藏因素进行了全面分析,以提供对4DGS框架更清晰的视角。我们首先通过形式化和复现最先进的FreeTimeGS的启发式方法,建立了一个受控基线FreeTimeGS_ours。利用该框架,我们沿着其基本轴剖析4DGS,并揭示了关键秘密,包括由高斯持续时间驱动的涌现时态分区以及光度保真度与时空一致性之间的差异。基于这些见解,我们提出了FreeTimeGS++,这是一种采用门控边缘化和神经速度场的原理性方法,以实现卓越的稳定性和鲁棒的动态表示。我们的方法产生了可重复的结果,并降低了运行间方差。我们将发布我们的实现,为未来的4DGS研究提供可靠的基础。

英文摘要

The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate remarkable performance, the specific drivers behind such gains remain less explored, making a systematic understanding of the underlying principles challenging. In this paper, we perform a comprehensive analysis of these hidden factors to provide a clearer perspective on the 4DGS framework. We first establish a controlled baseline, FreeTimeGS_ours, by formalizing and reproducing the heuristics of the state-of-the-art FreeTimeGS. Using this framework, we dissect 4DGS along its fundamental axes and uncover key secrets, including the emergent temporal partitioning driven by Gaussian durations and the discrepancy between photometric fidelity and spatiotemporal consistency. Based on these insights, we propose FreeTimeGS++, a principled method that employs gated marginalization and neural velocity fields to achieve superior stability and robust dynamic representations. Our approach yields reproducible results with reduced run-to-run variance. We will release our implementation to provide a reliable foundation for future 4DGS research.

2605.30497 2026-06-01 cs.CL

CanLegalRAGBench: Evaluating Retrieval-Augmented Generation on Canadian Case Law

CanLegalRAGBench:评估加拿大判例法上的检索增强生成

Ethan Zhao, Maksym Taranukhin, Wei Cui, Moira Aikenhead, Vered Shwartz

发表机构 * Department of Computer Science, University of British Columbia(不列颠哥伦比亚大学计算机科学系) Vector Institute(向量研究所) CIFAR AI Chair Peter A. Allard School of Law, University of British Columbia(不列颠哥伦比亚大学彼得·A·艾尔德法学院)

AI总结 针对法律RAG系统中幻觉问题及加拿大法律评估不足,提出基于真实查询和专家标注的加拿大法律QA基准CanLegalRAGBench,发现检索性能受设计选择影响、开源嵌入模型与闭源模型竞争力相当,但自动评估存在局限且生成答案常偏离黄金标准。

详情
AI中文摘要

基于RAG的法律助手越来越受欢迎,但LLM的幻觉仍然是一个关键问题,可能损害司法公正。虽然已经开发了基准来评估进展,但许多基准依赖于合成查询而非现实的法律场景。此外,加拿大法律在现有评估中代表性不足。为了解决这一差距,我们引入了CanLegalRAGBench,一个基于真实查询和专家标注的加拿大法律QA基准,其答案基于判例法。我们的评估表明,检索性能对设计选择敏感,开源嵌入模型与闭源模型具有竞争力。然而,它也揭示了自动评估的局限性,即对检索到替代相关文档的系统进行惩罚。我们还发现,生成的答案往往偏离黄金标准,要么出现幻觉,要么产生过于详细或不相关的内容,其中8-29%的主张不被检索到的文档支持。我们希望这个基准将有助于推动法律RAG系统局限性的持续改进。

英文摘要

RAG-based legal assistants have been growing in popularity, but LLM hallucinations remain a key issue and potentially undermines justice. While benchmarks have been developed to evaluate progress, many rely on synthetic queries rather than realistic legal scenarios. Moreover, Canadian law remains underrepresented in existing evaluations. To address this gap, we introduce CanLegalRAGBench, a Canadian legal QA benchmark based on realistic queries and expert-annotated answers grounded in case law. Our evaluation shows that retrieval performance is sensitive to design choices and that open-source embedding models are competitive with closed source models. However, it also reveals the limitation of automatic evaluations that penalize systems for retrieving alternative relevant documents. We also find that generated answers often diverge from gold responses, either with hallucinations or by producing overly detailed or irrelevant content, with 8-29% of claims not being supported by the retrieved documents. We hope this benchmark will help drive continued progress in addressing limitations of legal RAG systems.

2605.30488 2026-06-01 cs.RO

CoMo3R-SLAM: Collaborative Monocular Dense SLAM with Learned 3D Reconstruction Priors for Outdoor Multi-Agent Systems

CoMo3R-SLAM: 面向室外多智能体系统的协作式单目稠密SLAM与学习型3D重建先验

Zhihao Cao, Qi Shao, Shuhao Zhai, Feng Tian, Anh Nguyen, Hesheng Wang, Baoru Huang

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Liverpool(利物浦大学) Harbin Engineering University(哈尔滨工程大学) University of Ottawa(Ottawa大学) Shanghai Jiao Tong University(上海交通大学) Imperial College London(伦敦帝国理工学院)

AI总结 提出首个协作式单目稠密RGB SLAM系统CoMo3R-SLAM,利用学习的前馈3D重建先验实现室外多智能体地图构建,无需深度传感器即可生成全局一致的度量地图。

详情
AI中文摘要

协作式稠密SLAM对于多机器人团队在大规模室外环境中实现可扩展且一致的3D感知至关重要。现有系统通常依赖深度传感器,导致显著的载荷、功耗和标定成本。单目RGB相机是一种轻量级替代方案,但协作式单目稠密SLAM仍面临尺度模糊、智能体间数据关联不可靠等困难,尤其是在室外场景中,低重叠和重复结构使得传统特征匹配不可靠,从而需要鲁棒的几何信息。我们提出CoMo3R-SLAM,这是首个利用鲁棒的学习前馈3D重建先验进行室外多智能体地图构建的协作式单目稠密RGB SLAM系统。每个智能体运行一个先验引导的前端,用于实时跟踪和局部稠密融合,而协调器执行稠密点图匹配以进行跨智能体验证、闭式Sim(3)规范同步以及GPU加速的全局光束法平差与分段深度优化。我们的系统既不需要深度传感器也不需要参数化内参,仅凭单目RGB即可产生鲁棒的跨智能体约束和全局一致的度量地图。在Tanks and Temples和Waymo序列上,CoMo3R-SLAM在四个Tanks and Temples场景中的三个上实现了最佳ATE,并在Waymo上达到竞争性精度,匹配或超越最先进的RGB-D方法,同时以8 FPS在线运行。

英文摘要

Collaborative dense SLAM is essential for multi-robot teams to achieve scalable and consistent 3D perception across large-scale outdoor environments. Existing systems typically depend on depth sensors, incurring significant payload, power, and calibration costs. Monocular RGB cameras are a lightweight alternative, but collaborative monocular dense SLAM remains difficult due to scale ambiguity, unreliable inter-agent data association, especially in outdoor scenes where low overlap and repetitive structures make traditional feature matching unreliable, motivating robust geometric information. We propose CoMo3R-SLAM, the first collaborative monocular dense RGB SLAM system that leverages robust learned feed-forward 3D reconstruction priors for outdoor multi-agent mapping. Each agent runs a prior-guided front-end for real-time tracking and local dense fusion, while a coordinator performs dense pointmap matching for cross-agent verification, closed-form Sim(3) gauge synchronization, and GPU-accelerated global bundle adjustment with segment-level depth optimization. Requiring neither depth sensors nor parametric intrinsics, our system produces robust cross-agent constraints and globally consistent metric maps from monocular RGB alone. On Tanks and Temples and Waymo sequences, CoMo3R-SLAM achieves the best ATE on three of four Tanks and Temples scenes and competitive Waymo accuracy, matching or exceeding state-of-the-art RGB-D methods while running online at 8 FPS.

2605.30487 2026-06-01 cs.CL

Configurable Reward Model for Balanced Safety Alignment

可配置奖励模型用于平衡安全对齐

Zhengping Jiang, Mehran Khodabandeh, Akash Bharadwaj, Manik Bhandari, Mayur Srungarapu, Anqi Liu, Benjamin Van Durme, Li Chen

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出可配置安全奖励模型(CSRM),通过配置目标数据增强和联合优化,实现对细粒度安全配置和对话细微差别的敏感性,在可配置安全基准上达到最优性能,并改善有用性与安全性的权衡。

详情
AI中文摘要

将大型语言模型(LLM)对齐到异构且快速变化的安全要求仍然是一个关键挑战。现有的指令微调LLM和独立安全分类器通常无法泛化到新的安全配置,这促使需要明确可配置以适应不断变化规范的奖励模型(RM)。我们引入了可配置安全奖励模型(CSRM),它针对校准的安全合规性和奖励建模进行了联合优化。我们的方法由配置目标数据增强支持,该增强在保持相对严重性结构的同时强制指令遵循。由此产生的RM对细粒度安全配置和对话细微差别敏感,显著改进了对未见安全配置的泛化能力。CSRM在最近的可配置安全基准上达到了最先进性能,包括CoSApien(94.6% F1)和DynaBench(75.8% F1),无需额外的人工注释。当用于下游安全对齐时,与现有基线相比,CSRM产生的LLM具有显著改善的有用性-安全性权衡。

英文摘要

Aligning large language models (LLMs) to heterogeneous and rapidly evolving safety requirements remains a critical challenge. Existing instruction-tuned LLMs and standalone safety classifiers often fail to generalize to new safety configurations, motivating the need for Reward Models (RMs) that are explicitly configurable to changing specifications. We introduce the Configurable Safety Reward Model (CSRM), which is jointly optimized for calibrated safety compliance and reward modeling. Our approach is supported by configuration-targeted data augmentation that enforces instruction adherence while preserving relative severity structure. The resulting RM is sensitive to fine-grained safety configurations and conversational nuances, substantially improving generalization to previously unseen safety configurations. CSRM achieves state-of-the-art performance on recent configurable safety benchmarks, including CoSApien (94.6% F1) and DynaBench (75.8% F1), without requiring additional human annotation. When used for downstream safety alignment, CSRM yields LLMs with a significantly improved helpfulness-safety tradeoff compared to existing baselines.

2605.30486 2026-06-01 cs.LG cs.AI

Graph-Conditioned Mixture of Graph Neural Network Experts for Traffic Forecasting

图条件化的图神经网络专家混合模型用于交通预测

Amirhossein Ghaffari, Saeid Sheikhi, Ekaterina Gilman

发表机构 * Future Computing Group, University of Oulu(奥卢大学未来计算组)

AI总结 提出GC-MoE框架,通过图拓扑和近期交通输入为每个节点分配个性化专家组合,仅训练轻量路由模块,在四个基准上提升MAE。

Comments An accepted paper at the 27th IEEE International Conference on Mobile Data Management (MDM 2026)

详情
AI中文摘要

传感器图上的时空预测通常采用统一应用于所有节点的单一骨干架构,尽管图区域可能表现出不同的动态。道路段在功能类别、结构和交通行为上存在差异,表明节点级专家专业化可能是有用的。我们提出GC-MoE,一种图条件化的专家混合框架,基于图拓扑和近期交通输入窗口为每个节点分配个性化的冻结预测专家组合。GC-MoE将冻结的预训练时空GNN专家与输入感知、空间上下文化的路由器相结合,同时仅训练轻量级路由模块。我们还研究了一个有界图条件化输出精炼层作为可选扩展,并仅作为消融诊断包含节点自适应ST-LoRA适配器。在四个标准基准(PEMS04、PEMS07、METR-LA和PEMS-BAY)上,GC-MoE在零参数集成基线上改善了MAE,具有竞争力的RMSE和MAPE,同时在1.5M冻结专家权重之上仅训练约17K参数。实现代码见https://github.com/Ahghaffari/gc_moe。

英文摘要

Spatio-temporal forecasting on sensor graphs is commonly tackled with a single backbone architecture applied uniformly across all nodes, although graph regions can exhibit different dynamics. Road segments differ in functional class, structure, and traffic behavior, suggesting that node-wise expert specialization can be useful. We propose GC-MoE, a graph-conditioned mixture of experts framework that assigns each node a personalized combination of frozen forecasting experts based on graph topology and the recent traffic input window. GC-MoE combines frozen pretrained spatio-temporal GNN experts with an input-aware, spatially contextualized router while training only a lightweight routing module. We also study a bounded graph-conditioned output refinement layer as an optional extension and include node-adaptive ST-LoRA adapters only as an ablation diagnostic. Across four standard benchmarks (PEMS04, PEMS07, METR-LA, and PEMS-BAY), GC-MoE improves MAE over a zero-parameter ensemble baseline, with competitive RMSE and MAPE, while training only ~17K parameters on top of 1.5M frozen expert weights. The implementation is available at https://github.com/Ahghaffari/gc_moe.

2605.30484 2026-06-01 cs.RO

ELAN4D: Embodiment-Centric 4D Supervision for Vision-Language-Action Models via Plug-and-Play Adaptation

ELAN4D:以具身为中心的4D监督用于视觉-语言-动作模型的即插即用适配

Zeyuan He, Bowen Yang, Zhirui Fang, Keru Zhou, Lei Jiang, Jingjing Qian, Fan Mo, Junchi Yan, Philip Torr, Xiu Li, Li Jiang, Jialin Yu

发表机构 * Torr Vision Group, University of Oxford(托尔视觉组,牛津大学) The Chinese University of Hong Kong, Shenzhen(香港大学(深圳)) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) University College London(伦敦大学学院) University of Cambridge(剑桥大学)

AI总结 提出ELAN4D框架,通过未来机器人关键点轨迹作为预测性时空监督,以即插即用方式增强VLA策略的鲁棒性和泛化能力。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出潜力,但现有大多数策略通过直接从当前观测回归动作来反应式运行,没有显式建模未来动态。这限制了它们在分布外扰动下的泛化能力。为解决此问题,我们提出ELAN4D,一个以具身为中心的4D感知训练框架,通过未来机器人关键点轨迹作为预测性时空监督来增强VLA策略。仅利用本体感觉状态的前向运动学,我们推导出机器人关键点(如关节和末端执行器)的3D位移轨迹,预处理成本可忽略。这些轨迹提供度量且紧凑的监督,无需外部跟踪器或重建。一个即插即用的辅助分支,配备轻量级轨迹解码器,在通过梯度隔离保护预训练视觉-语言主干的同时,将4D信号注入动作专家。推理时丢弃轨迹解码器,保持基础策略接口不变。在LIBERO、LIBERO-Plus、RoboTwin2.0和真实世界操作任务上的大量实验表明,ELAN4D持续优于强VLA基线,在分布外扰动(包括相机、背景和布局变化)下取得最佳整体性能和显著提升。这些结果凸显了以具身为中心的4D监督对于构建更鲁棒和可泛化的操作策略的有效性。

英文摘要

Vision-Language-Action (VLA) models have shown promise for robotic manipulation, yet most existing policies operate reactively by directly regressing actions from current observations, without explicitly modeling future dynamics. This limits their ability to generalize under out-of-distribution perturbations. To address this issue, we propose ELAN4D, an embodiment-centric, 4D-aware training framework that enhances VLA policies with future robot keypoint tracks as predictive spatio-temporal supervision. Using only forward kinematics from proprioceptive states, we derive 3D displacement tracks of robot keypoints, such as joints and the end-effector, with negligible preprocess cost. These tracks provide metric and compact supervision without requiring external trackers or reconstruction. A plug-and-play auxiliary branch with a lightweight track decoder injects this 4D signal into the action expert while preserving the pretrained vision-language backbone through gradient isolation. The track decoder is discarded during inference, leaving the base policy interface unchanged. Extensive experiments on LIBERO, LIBERO-Plus, RoboTwin2.0 and real-world manipulation tasks demonstrate that ELAN4D consistently improves over strong VLA baselines, achieving the best overall performance and substantial gains under out-of-distribution perturbations, including camera, background, and layout shifts. These results highlight the effectiveness of embodiment-centric 4D supervision for building more robust and generalizable manipulation policies.

2605.30482 2026-06-01 cs.LG

Discovering a Zeta Map Algorithm on Dyck Paths via Mechanistic Interpretability

通过机制可解释性发现 Dyck 路径上的 Zeta 映射算法

Xiaoyu Huang, Blake Jackson, Kyu-Hwan Lee

发表机构 * Department of Mathematics, Temple University, Philadelphia, PA, USA(特拉华大学数学系) Institute for Computer-Aided Reasoning in Mathematics, Carnegie Mellon University, Pittsburgh, PA, USA(计算机辅助数学推理研究所,卡内基梅隆大学) Department of Mathematics, University of Connecticut, Storrs, CT, USA(康乃狄克大学数学系) Korea Institute for Advanced Study, Seoul 02455, Republic of Korea(韩国高等研究院)

AI总结 本文通过训练一个小型编码器-解码器 Transformer 模型来学习 Dyck 路径上的 zeta 映射,并利用机制可解释性工具分析其计算过程,从而发现并证明了一种新的显式组合算法——脚手架映射。

详情
AI中文摘要

机器学习越来越多地用于数学发现,但在数学中,期望的输出通常不是预测本身,而是一个可以独立验证的显式构造。我们通过 Dyck 路径上的 zeta 映射(q,t-卡特兰数组合学中的一个经典双射)来研究这一设定。我们在该映射上训练了一个特意设计的小型单层单头编码器-解码器 Transformer,并使用机制可解释性工具(包括解码器交叉注意力分析、线性探测和因果干预)分析其学习到的计算过程。分析揭示了一种基于层级的机制:编码器表示使路径层级线性可访问,而解码器以结构化方式选择和遍历输入位置。将这些信号转化为组合学,得到了脚手架映射,这是一种针对 Dyck 路径的显式以峰为中心的遍历算法。我们证明该算法与 zeta 映射一致,只是标签的逆转约定有所不同。这提供了一个受控的 AI 辅助数学发现示例,其中机制可解释性将模型行为转化为精确、人类可验证的组合算法。

英文摘要

Machine learning is increasingly used in mathematical discovery, but in mathematics the desired output is often not a prediction itself, but an explicit construction that can be checked independently. We study this setting through the zeta map on Dyck paths, a classical bijection in the combinatorics of the q,t-Catalan numbers. We train a deliberately small one-layer, one-head encoder-decoder transformer on this map and analyze its learned computation using mechanistic interpretability tools, including decoder cross-attention analysis, linear probing, and causal intervention. The analysis reveals a level-based mechanism: encoder representations make path levels linearly accessible, while the decoder selects and traverses input positions in a structured way. Translating these signals into combinatorics leads to the scaffolding map, an explicit peak-centered traversal algorithm for Dyck paths. We prove that this algorithm agrees with the zeta map, modulo a reversal convention in the labeling. This gives a controlled example of AI-assisted mathematical discovery in which mechanistic interpretability turns model behavior into a precise, human-verifiable combinatorial algorithm.

2605.30481 2026-06-01 cs.CL

When English Rewrites Local Knowledge: Global Narrative Dominance in Large Language Models

当英语重写地方知识:大语言模型中的全球叙事主导

Md Arid Hasan, Ruwad Naswan, Farhan Samir, Sharifa Sultana, Syed Ishtiaque Ahmed

发表机构 * University of Toronto(多伦多大学) BUET(巴特利特大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本研究通过构建孟加拉语文化数据集CulturalNB,评估大语言模型在低资源文化背景下的跨语言知识一致性,发现英语提问会系统性地增加全球替代和制度框架,减少地方视角覆盖,表明文化失败不仅是知识缺失,更是根基和叙事优先级问题。

Comments Submitted to ARR

详情
AI中文摘要

大语言模型(LLMs)被广泛用作跨语言知识接口。然而,植根于文化的问题往往反映全球主导叙事而非地方背景。我们将这种失败模式称为孟加拉语(一种低资源文化背景)中的 extit{全球叙事主导}。我们引入了 exttt{CulturalNB},一个包含717个手工策划的孟加拉语文化实例的数据集,配有平行的孟加拉语-英语问答对、支持证据、元数据和社会文化注释。通过仅问题和基于证据的提示,我们使用人类和两个独立的LLM裁判,在跨语言一致性、语言锚定、全球替代、制度偏见和认知视角覆盖等指标上评估了九个最先进的LLM。结果表明,用英语提问会系统性地增加全球替代和制度框架,同时减少地方视角覆盖。地方证据提高了事实一致性和视角覆盖,但并未消除语言引起的认知偏移。这些发现表明,LLM中的文化失败不仅是知识缺失错误,更是根基和叙事优先级失败。

英文摘要

Large language models (LLMs) are widely used as cross-lingual knowledge interfaces. However, culturally grounded questions often reflect globally dominant narratives rather than local contexts. We study this failure mode as \textit{global narrative dominance} in Bangla, a low-resource cultural context. We introduce \texttt{CulturalNB}, a dataset of 717 manually curated Bengali cultural instances with parallel Bangla--English question--answer pairs and supporting evidence, metadata, and sociocultural annotations. Using question-only and evidence-based prompting, we evaluate nine state-of-the-art LLMs with human and two independent LLM judges across metrics for cross-lingual consistency, language anchoring, global substitution, institutional bias, and epistemic perspective coverage. Results show that questions asked in English systematically increase global substitution and institutional framing while reducing local perspective coverage. Local evidence improves factual consistency and perspective coverage, but does not eliminate language-induced epistemic shifts. These findings suggest that cultural failures in LLMs are not only missing-knowledge errors but also failures of grounding and narrative prioritization.

2605.30479 2026-06-01 cs.LG

Universal Multiclass Transductive Online Learning

通用多类别转导在线学习

Steve Hanneke, Hongao Wang

发表机构 * Department of Computer Science, Purdue University, West Lafayette, IN 47907, USA.(计算机科学系,普渡大学,西拉法叶,印第安纳州,47907,美国)

AI总结 研究具有可能无界标签空间的通用转导在线分类问题,通过引入“Level-Constrained-Littlestone-Littlestone (LCLL)树”和冷漠性质来刻画可学习性,并证明可学习类的最优错误率要么有界要么对数增长。

详情
AI中文摘要

我们考虑具有可能无界标签空间的通用转导在线分类问题。该设置考虑在线学习,其中实例序列(无标签)预先已知给学习器。我们说一个概念类$\mathcal{H}$是可学习的,如果存在一个学习算法$\mathcal{A}$,使得对于每个可实现序列,$\mathcal{A}$犯的错误数量最多随预测次数次线性增长。我们刻画了该设置的可学习性,并表明对于可学习类,只有两种可能的最优速率:有界或对数增长。我们引入了一种新的组合结构,称为“Level-Constrained-Littlestone-Littlestone (LCLL)树”,它与冷漠性质一起刻画了可学习性。我们还将可学习性结果扩展到不可知情况以及仅已知生成实例序列的随机过程的情况。

英文摘要

We consider the problem of universal transductive online classification with a possibly unbounded label space. This setting considers online learning, with the sequence of instances (without labels) known to the learner in advance. We say a concept class $\mathcal{H}$ is learnable if there is a learning algorithm $\mathcal{A}$, such that for every realizable sequence, the number of mistakes made by $\mathcal{A}$ grows at most sublinearly with the number of predictions. We characterize the learnability of this setting and show that there are only two possible optimal rates for the learnable classes: either bounded or increasing logarithmically. We introduce a new combinatorial structure, called ``Level-Constrained-Littlestone-Littlestone (LCLL) tree'', which, along with the indifference property, characterizes the learnability. We also extend the learnability result to the agnostic case and the case where only the stochastic process that generates the instance sequence is known.

2605.30472 2026-06-01 cs.CL

Your Multimodal Speech Model Says I Have a Face for Radio

你的多模态语音模型说我有张适合电台的脸

Maya K. Nachesa, Vlad Niculae, Vagrant Gautam

发表机构 * Language Technology Lab University of Amsterdam(语言技术实验室 阿姆斯特丹大学) Heidelberg Institute for Theoretical Studies(海德堡理论研究所)

AI总结 通过配对不同人脸与相同音频的视频,评估多模态语音识别模型在性别、种族及其交叉属性上的偏差,发现质量差异显著,提示多模态并非必然更好。

详情
AI中文摘要

随着大型神经模型在语言任务上的表现越来越好,研究人员正在构建处理更多数据模态的多模态和全模态模型。其中一个例子是将语音识别模型扩展到音视频数据,用于噪声抑制和多模态字幕生成。虽然性能和偏差在单模态领域已得到广泛研究,但新模态如何影响这些方面尚不清楚,尽管它们在人类中会产生偏差。因此,我们提出了对多模态语音识别的首次偏差评估,我们创建了将不同人脸与相同音频配对的视频,并测量语音转录准确性的变化。我们发现,在mWhisper-Flamingo和Gemini模型中,自我声明的性别、种族及其交叉属性之间存在高达4.05个词错误率点的服务质量差异。我们的研究结果表明,开发者应优先评估、修复和传达此类限制,因为通过额外模态提供更多信号并不一定更好,甚至可能导致有偏见的结果。

英文摘要

As large neural models have become better at language tasks, researchers are increasingly building multi- and omnimodal models that handle more modalities of data. One example is the expansion of speech recognition models to audio-visual data for noise mitigation and multimodal subtitling. While performance and bias have been studied extensively in the single-modality regime, it is unknown how new modalities affect this, even though they produce biases in humans. We therefore propose the first bias evaluation of multimodal speech recognition, where we create videos pairing different faces with the same audio, and measure changes in speech transcription accuracy. We find large quality-of-service differences across mWhisper-Flamingo and Gemini models, with drops of up to 4.05 word error rate points, across self-declared gender, ethnicity, and their intersection. Our findings point to a priority for developers to evaluate, fix, and communicate such limitations, as providing more signals through additional modalities is not necessarily better, and may even lead to biased outcomes.

2605.30470 2026-06-01 cs.LG

Can Subgraph Explanations Be Weaponized to Steal Graph Neural Networks?

子图解释能否被武器化以窃取图神经网络?

Ojas Nimase, Jiate Li, Yue Zhao, Yushun Dong

发表机构 * University of Southern California(南加州大学) Florida State University(佛罗里达州立大学)

AI总结 本文提出首个针对图分类的黑盒模型提取攻击,利用模型解释输出引导蒙特卡洛边敏感性估计,并利用解释子图缩小边界搜索空间,实验表明该方法优于现有基线。

Comments 28 pages, 8 figures, 10 tables. Under review at NeurIPS 2026

详情
AI中文摘要

图机器学习即服务(GMLaaS)平台越来越多地实现可解释性接口以满足监管透明度要求。然而,这种透明度为模型提取攻击创造了可利用的漏洞。我们提出了首个针对图分类的模型提取攻击,该攻击在严格的黑盒约束下进行,攻击者仅观察到离散类标签和二进制解释掩码(无概率分数、梯度或置信度值)。我们的方法(1)利用模型解释输出引导蒙特卡洛边敏感性估计朝向决策边界,并具有Hoeffding集中保证估计精度;(2)利用解释子图有效缩小边界搜索空间。在多个领域的基准图数据集上的大量实验表明,我们的方法优于可比基线。这些发现表明,此类可解释性接口创造了可利用的攻击面,为可解释AI指令的防御机制和政策框架提供了信息。实现代码见https://github.com/LabRAI/XSTEAL/。

英文摘要

Graph Machine Learning as a Service (GMLaaS) platforms increasingly implement explainability interfaces to meet regulatory transparency requirements. However, this transparency creates exploitable vulnerabilities for model extraction attacks. We present the first model extraction attack specifically designed for graph classification under strict black-box constraints where the attacker observes only discrete class labels and binary explanation masks (no probability scores, gradients, or confidence values). Our method (1) uses model explanation outputs to guide Monte Carlo edge sensitivity estimation toward decision boundaries, with Hoeffding concentration guarantees on estimation accuracy and (2) exploits explanation subgraphs to efficiently narrow the boundary search space. Extensive experiments on benchmark graph datasets across multiple domains demonstrate our method's superiority over comparable baselines. These findings demonstrate that such explainability interfaces create exploitable attack surfaces, informing both defensive mechanisms and policy frameworks for explainable AI mandates. The implementation code is provided in https://github.com/LabRAI/XSTEAL/.

2605.30469 2026-06-01 cs.SD cs.CV

3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark

3DAE: 基于空间图谱和基准的音频新视角合成双耳质量评估

Jialu Xu, Yifan Zhou

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出一个全参考诊断框架3DAE Map,通过时频音频误差图(幅度、ILD、IPD、时间对齐、响度和高频故障)进行视觉检查,并构建模型无关基准3DAE Bench,用于评估音频新视角合成模型的双耳预测质量。

详情
AI中文摘要

3D音频和新视角声学合成模型通常使用全局指标进行评估。然而,全局指标往往隐藏了双耳预测失败的位置和原因。我们提出一个全参考诊断框架,该框架使用时频音频误差图,包括幅度、ILD、IPD、时间对齐、响度和高频故障,形成3D音频误差图(3DAE Map)用于视觉检查。我们将这些诊断方法整合到一个模型无关的基准——空间音频误差基准(3DAE Bench)中,该基准接受任意真实和预测的双耳对,并报告音频新视角合成模型的预测质量。在Replay-NVAS和SoundSpaces上对ViGAS输出的实验显示了不同的主要故障模式:Replay-NVAS上的时间错位和SoundSpaces上的ILD不匹配。总体而言,该框架为音频新视角合成模型开发优化提供了可解释的故障模式总结和直观的视觉图谱。

英文摘要

3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.

2605.30468 2026-06-01 cs.RO

Learning-Based Navigation for Indoor Mobile Robots

基于学习的室内移动机器人导航

Tri-Tin Nguyen, Tien-Dat Nguyen, Gia-Uy Le, Vinh Nguyen, Vinh-Hao Nguyen

发表机构 * Faculty of Electrical Electronic Engineering, Ho Chi Minh City University of Technology, VNU-HCM Ho Chi Minh City, Vietnam

AI总结 提出一种结合监督学习全局规划器与基于学习的DWA局部规划器的导航框架,通过行为克隆和PPO优化实现安全避障导航。

详情
AI中文摘要

本文提出了一种基于学习的室内移动机器人导航框架。该方法将基于代价感知A*专家轨迹训练的监督神经全局规划器与提出的基于学习的DWA局部规划器相结合,后者被表述为动态窗口法(DWA)动作格上的离散候选选择。对于局部规划,策略首先通过行为克隆进行训练,然后在可行性感知掩码下通过近端策略优化(PPO)进行精炼。该框架在模拟和真实室内环境中进行了实现和评估。实验结果表明,所提方法能够在存在障碍物的情况下生成可行的全局路径和可靠的局部运动指令,以实现安全的目标导向导航。这些结果证明了将基于学习的全局规划与强化学习精炼的局部控制相结合用于室内移动机器人导航的有效性。源代码将在 https://ntdathp.github.io/rl_robot_web/ 发布。

英文摘要

This paper presents a learning-based navigation framework for indoor mobile robots. The proposed method combines a supervised neural global planner, trained from cost-aware A* expert trajectories, with the proposed Learning-Based DWA local planner, which is formulated as discrete candidate selection over the Dynamic Window Approach (DWA) action lattice. For local planning, the policy is first trained by behavior cloning and then refined by Proximal Policy Optimization (PPO) under feasibility-aware masking. The framework is implemented and evaluated in both simulated and real-world indoor environments. Experimental results show that the proposed method generates feasible global routes and reliable local motion commands for safe goal-directed navigation in the presence of obstacles. These results demonstrate the effectiveness of integrating learning-based global planning with reinforcement-learning-refined local control for indoor mobile robot navigation. The source code will be released at https://ntdathp.github.io/rl_robot_web/.

2605.30465 2026-06-01 cs.CL

Knowledge Graph-Enhanced Zero-Shot Topic Classification: A Multi-Strategy Comparative Study

知识图谱增强的零样本主题分类:多策略比较研究

Shahana Akter, Yatharth Vohra, Ankita Shukla, Souvika Sarkar

发表机构 * A2I Lab, School of Computing, Wichita State University(A2I实验室,计算学院,威斯康星州立大学) EIS Lab, College of Engineering, University of Nevada, Reno(EIS实验室,工程学院,内华达大学,里诺)

AI总结 提出零样本多标签主题分类框架,通过知识图谱增强文档表示,实验表明对小型模型有正面影响,对大型模型有负面影响。

Comments 15 pages, 1 figure, ACL format. This paper proposes a KG-augmented zero-shot multi-label topic classification framework and evaluates multiple strategies

详情
AI中文摘要

在没有标注训练数据的情况下进行多标签主题分类是一项具有挑战性的任务,特别是当文档包含复杂的关系信息时。我们提出了一个零样本多标签主题分类框架,并系统地研究了每篇文章的知识图谱增强如何影响其性能。基础框架在没有标注训练数据的情况下对文档中的主题进行分类,有四种变体:仅文章分类、关键词增强分类以及两者的自一致性解码变体。然后,我们为每个基础变体增加每篇文章的知识图谱。该图谱通过类似于KGGen的流水线从输入文档中提取,基于主语-谓语-宾语三元组。我们在十五个大型语言模型和八个跨不同领域的多标签数据集上测试了所有八种方法(四种基础和四种图谱增强)。对于基础框架,关键词增强分类(AK)是表现最好的方法,十五个大型语言模型中有六个超过了句子编码器基线。图谱增强对小型模型有正面影响,对大型模型有负面影响。这表明大型模型已经从预训练中包含了足够的关系信息。此外,自一致性解码变体在任何实验中都没有显示出性能提升,同时计算成本增加了约五倍。

英文摘要

Multi-label topic classification without labeled training data is a challenging task, specially when documents contain complex relational information. We present a zero-shot multi-label topic classification framework and systematically investigate how per-article knowledge graph augmentation affects its performance. The base framework classifies topics in documents without labeled training data and has four variants: article-only classification, keyword-enhanced classification, and self-consistency decoding variants of both. Then, we augment each base variant with per article knowledge graph. This graph is extracted from the input document through a pipeline similar to KGGen based on subject-predicate-object triples. We test all eight methods, four base and four graph augmented on fifteen LLMs and eight multi-label datasets across different domains. For the base framework, keyword-enhanced classification (AK) is the best performing method, and six out of fifteen LLMs surpass the sentence-encoder baseline. Graph augmentation has positive and negative impacts on small and large models, respectively. This shows that larger models already contain enough relational information from pretraining. Furthermore, the self-consistency decoding variant does not show performance improvements in any experiment while increasing computation costs about fivefold.

2605.30462 2026-06-01 cs.LG cs.AI

idSCD: Identifying Training Datasets through Semantic Correlation Descriptors

idSCD: 通过语义相关描述符识别训练数据集

Andrada Gobeaja, Ionut Hodoroaga, Elena Burceanu, Marius Leordeanu

发表机构 * POLITEHNICA University of Bucharest(巴尔贝鲁斯理工大学) Bitdefender, Romania(罗马尼亚Bitdefender公司) Institute of Mathematics of the Romanian Academy(罗马尼亚科学院数学研究所)

AI总结 提出基于语义相关描述符(SCD)的白盒方法,通过模型学习到的语义相关结构识别训练数据集中的成员关系,在多个实验设置中优于现有基线方法。

Comments 16 pages, 3 figures

详情
AI中文摘要

一个数据集能否通过其在训练过程中引起的虚假相关性被识别?我们认为,数据集会在模型学习的语义相关结构中留下特定于数据集的痕迹:在数据集中具有预测性但对底层任务非因果的偶然规律性,可能在训练过程中被内化。我们利用这一洞察研究数据集级别的成员推断,超越了依赖置信度分数、损失、边际、生成样本或查询响应等行为或分布证据的现有方法。我们引入了一种基于语义相关描述符(SCD)的白盒语义指纹方法,该方法捕获模型学习的语义相关结构,并使其在不同数据集混合中具有可比性。在受控的留一数据集诊断中,SCD恢复了数据集特定的变化,并完美区分匹配与非匹配的数据集对。然后,我们提出了一种实用的基于SCD的成员分数,该分数仅使用模型的SCD和目标数据集的独立SCD来测试目标数据集是否是模型训练混合的一部分,无需留一数据集模型。在三个不同的实验设置中,包括自然语言推理、情感分类和医学文本分类的数据集组,我们测试了基于SCD的成员推断在不同程度的语义分离和数据集划分之间的关键词支持下的优势和局限性。平均而言,基于该分数的分类器实现了最高的性能和最低的标准差,优于黑盒基线RMIA、Attack-P和LiRA,以及白盒基线SIF。这些结果表明,数据集成员可以通过内部语义相关性进行追踪,当数据集组暴露不同的语义特性时,ROC-AUC的最大相对增益超过60%。

英文摘要

Can a dataset be recognized from the spurious correlations it induces during training? We argue that datasets leave dataset-specific traces in a model's learned semantic correlation structure: incidental regularities that are predictive within a dataset, but not causal for the underlying task, can be internalized during training. We use this insight to study dataset-level membership inference, moving beyond existing methods that rely on behavioral or distributional evidence such as confidence scores, losses, margins, generated samples, or query responses. We introduce a white-box semantic fingerprinting approach based on semantic correlation descriptors (SCDs), which capture the semantic correlation structure learned by a model and make it comparable across dataset mixtures. In a controlled leave-one-dataset-out diagnostic, SCDs recover dataset-specific changes and perfectly separate matching from non-matching dataset pairs. We then propose a practical SCD-based membership score that tests whether a target dataset is part of a model's training mixture using only the model's SCD and the target dataset's standalone SCD, without requiring leave-one-dataset-out models. Across three diverse experimental settings, with dataset groups for natural language inference, emotion classification, and medical text classification, we test both the advantages and limitations of SCD-based membership inference with different degrees of semantic separation and keyword support between dataset splits. On average, the classifier based on this score achieves the highest performance and the lowest std, outperforming black-box baselines RMIA, Attack-P, and LiRA, as well as the white-box SIF baseline. These results show that dataset membership can be traced through internal semantic correlations, with the largest relative gain exceeding 60% in ROC-AUC when dataset groups expose distinct semantic particularities.

2605.30461 2026-06-01 cs.LG cs.AI

Scalable Constrained Multi-Agent Reinforcement Learning via State Augmentation and Consensus for Separable Dynamics

通过状态增强和共识实现可分离动力学的可扩展约束多智能体强化学习

Santiago Amaya-Corredor, Miguel Calvo-Fullana, Anders Jonsson

发表机构 * Department of Engineering University Pompeu Fabra(工程系庞培法布拉大学)

AI总结 提出一种结合状态增强策略学习与对偶变量分布式共识的分布式约束多智能体强化学习方法,解决可分离动力学系统中全局资源约束的协调问题,实现线性可扩展性并保证约束满足。

Comments 17 pages, 8 figures, 3 tables. Plus appendix

详情
AI中文摘要

我们提出了一种用于约束多智能体强化学习(MARL)的分布式方法,该方法将状态增强策略学习与对偶变量的分布式共识相结合。我们的方法针对智能体具有可分离动力学但必须协调以满足全局资源约束的系统,正如我们通过实验证明的,在这种设置下,独立学习无法产生可行解,因为智能体无法确定各自对集体约束满足的适当贡献。关键技术贡献在于证明,对拉格朗日乘子进行轻量级邻居到邻居共识足以实现全局协调的约束执行,同时保持独立训练的可扩展性。每个智能体离线学习一个单一的增强策略,该策略以其局部状态和编码约束反馈的对偶变量为条件。在执行过程中,智能体仅通过局部通信就该对偶变量达成共识。我们证明,在温和的连通性假设下,智能体乘子之间的共识误差是有界的,并且表明这转化为有界的约束违反,该违反随图连通性和共识轮次增加而减小。与集中训练分散执行(CTDE)方法相比,后者的复杂度至少随智能体数量呈二次增长,而我们的方法在训练和执行中均呈线性扩展。在智能电网需求响应上的实验表明,共识协调对于可行性至关重要:没有共识,智能体只能通过无限期推迟需求来满足电网容量约束,这是一种退化的非解。有了共识,智能体收敛到共享的对偶变量,并同时满足电网约束和需求满足,可扩展到数千个智能体,而CTDE基线仅能处理数十个。

英文摘要

We present a distributed approach for constrained Multi-Agent Reinforcement Learning (MARL) that combines state-augmented policy learning with distributed consensus over dual variables. Our method targets systems where agents have separable dynamics but must coordinate to satisfy global resource constraints, a setting in which, as we demonstrate empirically, independent learning fails to produce feasible solutions because agents cannot determine appropriate individual contributions toward collective constraint satisfaction. The key technical contribution is showing that lightweight neighbor-to-neighbor consensus over Lagrange multipliers suffices for globally coordinated constraint enforcement while preserving the scalability of independent training. Each agent learns a single augmented policy offline, conditioned on both its local state and a dual variable encoding constraint feedback. During execution, agents reach agreement on this dual variable through local communication alone. We prove that under mild connectivity assumptions, the consensus error among agents' multipliers is bounded, and show that this translates to a bounded constraint violation that decreases with graph connectivity and the number of consensus rounds. Unlike centralized training with decentralized execution (CTDE) approaches, whose complexity grows at least quadratically with agent count, our method scales linearly in both training and execution. Experiments on smart grid demand response demonstrate that consensus coordination is \emph{essential for feasibility}: without it, agents satisfy grid capacity constraints only by indefinitely postponing demand, a degenerate non-solution. With consensus, agents converge to a shared dual variable and satisfy both grid constraints and demand fulfillment, scaling to thousands of agents while CTDE baselines are limited to dozens.

2605.30459 2026-06-01 cs.CL

Can LLM Teams Play What? Where? When?

LLM 团队能玩“什么?哪里?何时?”吗?

Anastasia Kotelnikova, Viktor Byzov, Maria Dolzhenkova, Evgeny Kotelnikov

发表机构 * Vyatka State University(维yatka国立大学) European University at St. Petersburg(圣彼得堡欧洲大学)

AI总结 本文研究基于团队的交互策略(投票、静默团队、健谈团队)能否提升大语言模型在“什么?哪里?何时?”问答游戏中的表现,实验表明团队策略优于单模型基线,准确率最高提升20个百分点,最佳团队达到44.23%,接近人类水平。

Comments Accepted for Dialogue-2026 conference

详情
AI中文摘要

大语言模型(LLM)在需要间接推理、文化知识和协调假设测试的任务上仍然受限。我们研究基于团队的交互是否能提升LLM在“什么?哪里?何时?”(ChGK)问答游戏中的表现,该游戏旨在奖励集体推理。我们引入了三种团队策略:投票、静默团队(队长观察最终答案)和健谈团队(队长观察答案和理由)。为最小化数据泄露,我们在2025年发布的572个ChGK问题数据集上评估这些策略。使用六个近期的大规模开源模型,我们表明基于团队的策略优于单模型基线,准确率提升高达20个百分点。最佳团队达到44.23%的准确率,并在有人类统计数据的题目上接近人类团队表现。模型间多样性分析显示,分歧强烈预测较低准确率,但解释性沟通显著缓解性能下降。我们进一步检查队长行为,未发现自我偏好偏差的证据;访问同伴理由提升了队长判断。总体而言,LLM团队主要作为答案选择和错误过滤机制,而非新解决方案的生成器。我们的发现强调了交互的重要性,并表明自适应策略是多智能体系统的一个有前景的方向。

英文摘要

Large language models (LLMs) remain limited on tasks requiring indirect reasoning, cultural knowledge, and coordinated hypothesis testing. We investigate whether team-based interaction improves LLM performance in What? Where? When? (ChGK), a quiz game designed to reward collective reasoning. We introduce three team strategies: Voting, Silent Team (the captain observes final answers), and Talkative Team (the captain observes both answers and rationales). To minimize data leakage, we evaluate these strategies on a dataset consisting of 572 ChGK questions released in 2025. Using six recent large-scale open models, we show that team-based strategies outperform single-model baselines, yielding gains of up to 20 percentage points in accuracy. The best team achieves 44.23% accuracy, and approaches human team performance on questions with available human statistics. Analysis of inter-model diversity reveals that disagreement strongly predicts lower accuracy, but explanatory communication substantially mitigates performance drops. We further examine captain behavior and find no evidence of self-preference bias; access to peer rationales improves captain judgments. Overall, LLM teams function primarily as answer selection and error-filtering mechanisms rather than generators of novel solutions. Our findings highlight the importance of interaction and suggest adaptive strategies as a promising direction for multi-agent systems.

2605.30452 2026-06-01 cs.LG cs.AI math.OC

A Unified Framework for Gradient Aggregation in Multi-Objective Optimization

多目标优化中梯度聚合的统一框架

Zeou Hu, Kelvin Ho, Yaoliang Yu

发表机构 * Cheriton School of Computer Science(切尔顿计算机科学学院) University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出一个统一框架,通过充分对齐条件建立梯度聚合方法的收敛率,并引入基于CVaR的capped MGDA算法,在对抗联邦学习中验证鲁棒性。

详情
AI中文摘要

许多机器学习问题涉及多个固有的权衡,最好通过基于梯度的多目标优化(MOO)算法来解决。现有方法通常基于不同的动机提出,逐个案例进行分析,并且在每一步中如何聚合分量梯度在算法上有所不同。在这项工作中,我们为MOO中的梯度聚合开发了一个统一框架,建立了收敛到帕累托平稳性(MOO的标准性能度量)的(最优)速率。我们分析的核心是一个充分对齐条件,由此我们推导出一个定理,表明当在梯度的凸包内选择时,非冲突方向构成了收敛的基本充分条件。我们进一步表明,通过对偶锥上的投影可以确保可行性,从而拓宽了具有收敛保证的方法的范围。同时,我们提出了梯度聚合的原始优化视角,该视角涵盖了已有算法,阐明了它们的理论关系,并能够设计新的变体。作为示例,我们引入了capped MGDA,它基于CVaR公式推导而来,并展示了其在对抗联邦学习中的鲁棒性。最后,我们通过在合成问题和实际基准上的实验验证了我们的理论。

英文摘要

Many machine learning problems involve multiple inherent trade-offs that are best addressed by gradient-based multi-objective optimization (MOO) algorithms. Existing methods are often proposed with various motivations, analyzed case by case, and differ algorithmically in how the component gradients are aggregated at each step. In this work, we develop a unifying framework for gradient aggregation in MOO, establishing (optimal) rates of convergence to Pareto stationarity, the standard measure of performance in MOO. Central to our analysis is a sufficient alignment condition, from which we derive a theorem showing that non-conflicting directions, when chosen within the convex hull of gradients, form a fundamental sufficient condition for convergence. We further show that feasibility can be ensured through projection onto the dual cone, broadening the scope of methods that admit convergence guarantees. In parallel, we present a primal optimization perspective of gradient aggregation that encompasses established algorithms, clarifies their theoretical relationships, and enables the design of new variants. As an illustration, we introduce capped MGDA, derived from a CVaR-based formulation, and demonstrate its robustness in adversarial federated learning. Finally, we validate our theory through experiments on synthetic problems and practical benchmarks.

2605.30451 2026-06-01 cs.LG

VeriGate: Verifier-Gated Step-Level Supervision for GRPO

VeriGate: 验证器门控的步骤级监督用于GRPO

Aakriti Agrawal, Minghui Liu, Furong Huang

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出VeriGate方法,通过验证器门控的步骤级监督扩展GRPO,解决稀疏奖励和信用分配问题,在多个推理基准上显著提升准确率。

详情
AI中文摘要

组相对策略优化(GRPO)是一种有效的训练推理模型的方法,使用基于验证器的结果奖励,但其监督是稀疏的:当针对某个提示的所有采样轨迹获得相同的验证器奖励时,组相对优势会坍缩为零,学习停滞。仅结果奖励也不提供步骤级信用分配,限制了探索,使得学习稳健推理更加困难。我们提出了VeriGate(验证器门控步骤级GRPO),这是GRPO的一种验证器门控扩展,通过三个设计选择解决了这些限制。首先,每当验证器奖励在采样轨迹之间诱导出有意义的偏好时,VeriGate让验证器负责,并且仅在验证器奖励退化时使用过程监督。其次,VeriGate不将过程奖励模型(PRM)的步骤分数坍缩为单个轨迹奖励,而是将其转换为未来累积奖励,以分配延续感知的信用。第三,VeriGate将这些奖励转换为组归一化的令牌级优势,恢复信息丰富的梯度和细粒度的信用分配,同时相比优化聚合PRM分数的方法,对奖励黑客攻击的敏感性更低。实验上,在MATH上使用1.5B和7B Qwen2.5-Instruct模型进行训练,并在六个推理基准上评估,VeriGate将1.5B和7B模型的平均准确率分别提高了约20%和12%,显著减少了零梯度失败,降低了奖励黑客行为,并相对于仅结果GRPO和PRM作为结果的基线提高了推理质量。

英文摘要

Group Relative Policy Optimization (GRPO) is an effective recipe for training reasoning models with verifier-based outcome rewards, but its supervision is sparse: when all sampled trajectories for a prompt receive the same verifier reward, the group-relative advantage collapses to zero and learning stalls. Outcome-only rewards also provide no step-level credit assignment, limiting exploration and making it harder to learn robust reasoning. We present VeriGate (Verifier-Gated Step-Level GRPO), a verifier-gated extension of GRPO that addresses these limitations with three design choices. First, VeriGate keeps the verifier in charge whenever verifier rewards induce a meaningful preference among sampled trajectories, and uses process supervision only when verifier rewards are degenerate. Second, instead of collapsing Process Reward Model (PRM) step scores into a single trajectory reward, VeriGate converts them into future-cumulated rewards to assign continuation-aware credit. Third, VeriGate transforms these rewards into group-normalized token-level advantages, restoring informative gradients and fine-grained credit assignment while remaining less susceptible to reward hacking than methods that optimize aggregated PRM scores. Empirically, training on MATH with 1.5B and 7B Qwen2.5-Instruct models and evaluating on six reasoning benchmarks, VeriGate improves average accuracy by about 20% and 12% for 1.5B and 7B models respectively, substantially reduces zero-gradient failures, decreases reward-hacking behavior, and improves reasoning quality relative to outcome-only GRPO and PRM-as-outcome baselines.

2605.30448 2026-06-01 cs.LG cs.CL

Bounded Behavioral Indistinguishability for Black-Box LLM Distillation

黑盒大语言模型蒸馏的有界行为不可区分性

Munawar Hasan

发表机构 * Michigan Technological University(密歇根技术大学)

AI总结 针对黑盒LLM蒸馏,提出有界行为不可区分性形式化定义,并通过对抗评估揭示语义相似性不足以保证行为不可区分性。

详情
AI中文摘要

黑盒大语言模型蒸馏通常被评估为输出匹配问题:当学生模型的响应与教师模型在语义上相似或任务一致时,即认为学生模型成功。然而,输出相似性并不意味着学生模型与其模仿的模型在行为上不可区分。我们引入了有界行为不可区分性,形式化为在显式提示分布上的$(ε,q,t,\mathbb{A})$-行为不可区分性,其中$ε$限制区分优势,$q$限制预言机查询次数,$t$限制计算量,$\mathbb{A}$表示对手类别。我们在Qwen和Llama教师-学生对上使用受控的$5,000$提示行为探测套件实例化该概念。对于每个系列,我们比较教师模型与基础学生模型以及LoRA蒸馏学生模型,衡量蒸馏是否降低了可区分性而不仅仅是提高了相似性。LoRA将Qwen的语义相似性从$0.788$提升至$0.862$,Llama从$0.814$提升至$0.874$。然而,对抗评估揭示了剩余的行为差异:学习到的判别器保持非零优势,成对类别分析显示伪影集中在风格/格式、鲁棒性和领域技术提示中。成对教师识别对手证实了这一趋势。使用不同系列的Llama评判器和A/B交换一致性过滤,Qwen的区分优势从基础学生模型的$0.158$下降到LoRA蒸馏后的$0.081$。查询预算实验表明,分歧引导的采集并不始终优于分层随机采样,表明覆盖率和多样性仍然是强基线。我们的结果表明,语义保真度有用但不足:黑盒大语言模型蒸馏需要有界、对抗性和类别感知的评估。

英文摘要

Black-box LLM distillation is usually evaluated as an output-matching problem: a student is considered successful when its responses are semantically similar to, or task-consistent with, those of a teacher. However, output similarity does not imply that the student is behaviorally indistinguishable from the model it imitates. We introduce bounded behavioral indistinguishability, formalized as $(ε,q,t,\mathbb{A})$-behavioral indistinguishability over an explicit prompt distribution, where $ε$ bounds distinguishing advantage, $q$ bounds oracle queries, $t$ bounds computation, and $\mathbb{A}$ denotes the adversary class. We instantiate this notion on Qwen and Llama teacher-student pairs using a controlled $5,000$-prompt behavioral probe suite. For each family, we compare the teacher with both the base student and the LoRA-distilled student, measuring whether distillation reduces distinguishability rather than merely improving similarity. LoRA raises semantic similarity from $0.788$ to $0.862$ for Qwen and from $0.814$ to $0.874$ for Llama. Yet adversarial evaluation reveals remaining behavioral differences: learned discriminators retain nonzero advantage, and pairwise category analysis shows artifacts concentrated in style/format, robustness, and domain-technical prompts. A pairwise teacher-identification adversary confirms this trend. With a different-family Llama judge and A/B-swap consistency filtering, Qwen distinguishing advantage drops from $0.158$ for the base student to $0.081$ after LoRA distillation. Query-budget experiments show that disagreement-guided acquisition does not consistently outperform stratified random sampling, indicating that coverage and diversity remain strong baselines. Our results show that semantic fidelity is useful but insufficient: black-box LLM distillation requires bounded, adversarial, and category-aware evaluation.

2605.30447 2026-06-01 cs.LG cs.AI stat.ML

Calibrated Preference Learning: The Case of Label Ranking

校准偏好学习:以标签排序为例

Santo M. A. R. Thies, Viktor Bengs, Timo Kaufmann, Sebastian J. Vollmer, Eyke Hüllermeier

发表机构 * Munich Center for Machine Learning, Munich (MCML), Germany(慕尼黑机器学习中心,慕尼黑(MCML),德国)

AI总结 针对概率标签排序问题,形式化定义了校准概念并建立层次体系,通过理论证明和实验验证了不同校准概念的关系及现有模型的校准缺陷。

详情
AI中文摘要

校准,即预测概率与真实结果频率的对齐,对于可靠决策至关重要。尽管在分类和回归中已有广泛研究,但校准尚未在概率标签排序中得到正式处理,其目标是预测标签集排序上的分布。将排序视为类别会忽略其结构,并无法捕捉成对和top-k预测等重要模态。我们形式化了标签排序的校准,并建立了一个涵盖完整排序、子排序和top-k排序的概念层次。我们证明完整排序校准蕴含其他校准,但反之不成立,且子排序和top-k校准不可比较。实验发现,流行的标签排序模型通常校准不良,子排序和top-k指标之间存在显著差异。将我们的框架应用于RLHF奖励模型,发现校准与基准准确性强相关但不完全一致,表明它捕捉了超越top-1准确性的有意义的质量维度。这些发现激励了未来关于理解误校准的下游影响以及开发纠正方法的工作。

英文摘要

Calibration, the alignment of predicted probabilities with true outcome frequencies, is essential for reliable decision-making. While extensively studied for classification and regression, calibration has not been formally addressed for probabilistic label ranking, where the goal is to predict a distribution over orderings of a label set. Naively treating rankings as classes ignores their structure and fails to capture important modalities such as pairwise and top-k predictions. We formalize calibration for label ranking and develop a hierarchy of notions covering full rankings, sub-rankings, and top-k rankings. We prove that full-rank calibration implies the others but not conversely, and sub-ranking and top-k calibration are incomparable. Empirically, we find popular label ranking models are often poorly calibrated, with substantial differences between sub-ranking and top-k metrics. Applying our framework to RLHF reward models, we find that calibration correlates strongly but not perfectly with benchmark accuracy, suggesting it captures a meaningful quality dimension beyond top-1 accuracy. These findings motivate future work on understanding the downstream effects of miscalibration and developing methods to correct it.

2605.30444 2026-06-01 cs.CV

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

Dex2HOI: 灵巧双手双物体交互生成

Chrysa Pratikaki, Pablo Ruiz-Ponce, Jiankang Deng, Stefanos Zafeiriou, Rolandos Alexandros Potamias

发表机构 * Imperial College London, UK(伦敦帝国学院) University of Alicante, Spain(阿利坎特大学)

AI总结 提出Dex2HOI统一扩散模型,通过双流扩散和运动融合网络,实现从文本生成单/双物体灵巧双手交互,速度提升达540倍。

详情
AI中文摘要

近期4D人-物体交互(HOI)生成的进展使得运动合成越来越逼真,特别是对于单物体操作。然而,当前研究忽视了人类行为的一个固有特性:人们自然地协调双手并同时操作多个物体。为填补这一空白,我们提出了Dex2HOI,一个用于从文本合成单物体和双物体HOI的统一扩散模型。其核心采用双流扩散方法,每个物体在专用交互流中处理,并通过双向交叉注意力进行协调。为了合成最终运动,我们引入了一个运动融合网络,该网络集成了新颖的相对于手的物体表示和应用于整个序列的接触感知条件。通过在带前缀条件的窗口上自回归采样扩散过程,Dex2HOI以实时速度生成任意长的序列,省略了冗余的测试时优化,相比先前最先进方法实现了高达540倍的推理加速。在单物体和双物体基准上的广泛评估展示了最先进的定量结果,标志着超越传统单物体HOI生成、向表达性多物体操作迈出的一步。代码和模型将在接收后发布。

英文摘要

Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text. At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence. By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation. Code and models will be released upon acceptance.

2605.30434 2026-06-01 cs.LG cs.AI cs.CL cs.MA

LongDS-Bench: On the Failure of Long-Horizon Agentic Data Analysis

LongDS-Bench:关于长周期智能数据分析的失败

Kewei Xu, Xiaoben Lu, Shuofei Qiao, Zihan Ding, Haoming Xu, Lei Liang, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团) Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(知识图谱联合实验室)

AI总结 提出LongDS基准,用于评估长周期多轮数据分析中智能体维护和更新分析状态的能力,发现最佳模型平均准确率仅48.45%,且长周期错误占失败原因的52%-69%。

Comments Ongoing work

详情
AI中文摘要

现实世界的数据分析本质上是迭代的,然而现有基准大多评估孤立或短期的交互任务,未能测试智能体在长周期内跟踪不断变化的分析上下文的能力。我们引入了LongDS,一个用于长周期、多轮数据分析的基准,其中智能体必须维护、更新、恢复和组合不断变化的分析状态。LongDS包含68个从真实世界Kaggle笔记本构建的任务,涵盖地球科学、商业和教育等六个领域的2,225轮交互。任务围绕状态演化模式(例如反事实扰动、回滚、多状态组合)设计,平均依赖跨度为11.3轮。评估五个最先进模型,我们发现最佳模型仅达到48.45%的平均准确率,性能从早期到后期轮次下降近47个百分点,长周期错误占失败原因的52%-69%。进一步分析表明,额外的智能体步骤并不一定能提高性能,这表明关键瓶颈在于维护正确的分析状态,而非增加交互预算。我们发布LongDS以支持可靠的长周期智能数据分析研究。代码和数据将在https://github.com/zjunlp/DataMind发布。

英文摘要

Real-world data analysis is inherently iterative, yet existing benchmarks mostly evaluate isolated or short interactive tasks, leaving agents' ability to track evolving analytical context over long horizons untested. We introduce LongDS, a benchmark for long-horizon, multi-turn data analysis where agents must maintain, update, restore, and compose evolving analytical states. LongDS comprises 68 tasks constructed from real-world Kaggle notebooks, spanning 2,225 turns across six domains including Geoscience, Business, and Education. Tasks are designed around state-evolution patterns (e.g., counterfactual perturbation, rollback, multi-state composition), with an average dependency span of 11.3 turns. Evaluating five state-of-the-art models, we find that the best model reaches only 48.45% average accuracy, performance drops nearly 47 points from early to late turns, and long-horizon errors account for 52%--69% of failures. Further analysis shows that additional agent steps do not necessarily improve performance, suggesting that the key bottleneck is maintaining a correct analytical state rather than increasing interaction budget. We release LongDS to support research on reliable long-horizon agentic data analysis. Code and data will be released at https://github.com/zjunlp/DataMind.