arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 21516
2606.04853 2026-06-04 cs.RO

Teaching Robots to Say 'I Don't Know' : SENTINEL for Uncertainty-Aware SLAM

教机器人说‘我不知道’:用于不确定性感知SLAM的SENTINEL

Abhishek S, Badrikanath Praharaj, Sreeram MV

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出SENTINEL框架,通过几何扫描统计和跨模态深度一致性为低成本2D LiDAR提供无训练、无标签的可靠性评分,拒绝损坏扫描并回退到轮式里程计,防止SLAM无声损坏。

Comments 6 pages, 10 figures, 3 tables, This paper was accepted at Uncertainty in Open-World Robotics Workshop in conjunction with Internation conference of robotics and automation (ICRA 2026)

详情
AI中文摘要

低成本2D LiDAR缺乏高端传感器用于诊断测量故障的强度通道,但它们广泛用于教育和预算机器人平台。我们提出SENTINEL,一个无需训练、无需标签的可靠性估计框架,为仅测距的LiDAR提供有效的诊断信号。SENTINEL结合基于几何的扫描统计与LiDAR和RGB-D相机之间的跨模态深度一致性,计算每个扫描的可靠性分数(0到1)。当分数低于阈值时,损坏的扫描被拒绝,机器人回退到校准的轮式里程计,防止无声的SLAM损坏。我们在配备RPLidar A2M12和Intel RealSense D435i的GEFIER R1四轮差速转向机器人上评估SENTINEL,在包含中央障碍物上受控透明和反射故障元素的185 cm × 245 cm场地中。跨五种表面条件(包括玻璃、镜子、光面纸以及混合镜子和光面纸条件)的空间可靠性图显示了干净情况和故障情况之间的清晰分离,允许受影响区域被识别为拒绝或噪声。由于这些故障模式在仿真中不存在,验证完全在真实硬件上进行。

英文摘要

Low-cost 2D LiDARs lack the intensity channel that higher-end sensors use to diagnose measurement failures, yet they are widely used on educational and budget robotics platforms. We present SENTINEL, a training - free, label - free reliability estimation framework that gives range - only LiDAR an effective diagnostic signal. SENTINEL combines geometry-based scan statistics with cross - modal depth consistency between LiDAR and an RGB - D camera to compute a per - scan reliability score between 0 and 1. When the score falls below a threshold, corrupted scans are rejected and the robot falls back to calibrated wheel odometry, preventing silent SLAM corruption. We evaluate SENTINEL on a GEFIER R1 four - wheel skid-steer robot equipped with an RPLidar A2M12 and an Intel RealSense D435i in a 185 cm by 245 cm arena containing controlled transparent and reflective failure elements on a central obstacle. Spatial reliability maps across five surface conditions, including glass, mirror, shiny paper, and a mixed mirror and shiny-paper condition, show clear separation between clean and failure cases, allowing affected regions to be identified as reject or noise. Because these failure modes are absent in simulation, validation is performed entirely on real hardware.

2606.04847 2026-06-04 cs.CV cs.CL cs.LG

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: 在摩尔线程GPU上通过全栈训练实现原生GPU内核生成

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

发表机构 * Moore Threads AI

AI总结 提出MusaCoder全栈训练框架,结合渐进式数据合成、多样性保持拒绝微调和基于执行反馈的强化学习,在CUDA和MUSA后端上生成高效原生GPU内核,9B模型匹配前沿闭源模型,27B模型达到新最优。

详情
AI中文摘要

原生GPU内核生成将高级张量程序转换为可执行、高效的低级代码。现有大型语言模型(LLMs)在此任务上表现不佳,而基于执行的强化学习面临稀疏奖励、奖励黑客和训练不稳定性问题。我们提出MusaCoder,一个用于在CUDA和MUSA后端上生成原生GPU内核的全栈训练框架。MusaCoder结合了渐进式内核导向数据合成、保持多样性的拒绝微调以及通过MooreEval(一个分布式验证器和奖励环境)进行的执行反馈强化学习(RL)。为了稳定RL,MusaCoder引入了PrimeEcho用于首轮锚定的多轮奖励、Buffered Dynamic Retry用于从全失败的困难样本中恢复信号,以及MirrorPop用于离策略序列过滤。在KernelBench和MUSA移植变体上的实验表明,MusaCoder在正确性和经验加速方面均优于强开源和专有基线,其中9B模型匹配或超越前沿闭源模型,27B模型建立了新的最优结果。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性,也展示了摩尔线程GPU支持完整LLM后训练栈的能力,为新兴加速器上的大模型训练和优化提供了实用基础。

英文摘要

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

2606.04846 2026-06-04 cs.CL

Large Language Models in K-12 Education: Alignment with State Curriculum Standards and Student Personas

K-12教育中的大语言模型:与州课程标准和学生角色的对齐

Lisa Korver, Tomo Lazovich, Sherief Reda

发表机构 * Department of Computer Engineering(计算机工程系) Brown University(布朗大学) Data Science Institute(数据科学研究院)

AI总结 本研究开发基于LLM的流程评估不同LLM与美国各州历史课程标准的对齐程度,并通过控制用户角色实验分析模型对地理、年级、性别和种族的敏感性,发现模型能调整历史主题呈现但可能源于州政治倾向,且对年级适应良好而对种族性别敏感性低,揭示了LLM与课程标准错位对学生学习的潜在风险。

详情
AI中文摘要

随着大语言模型(LLM)在教育环境中日益普及,它们引发了关于其使用伦理的重要问题。公开可用的在线聊天机器人能力和准确性迅速提升,导致更广泛的使用,包括寻求作业帮助的学生。这使得考虑这些模型是否与教育标准对齐变得至关重要。由于美国的课程标准由各州制定,它们在所需内容、重点和叙事焦点上存在显著差异。在这项工作中,我们开发了一个基于LLM的流程,以识别各州美国历史课程的变化,并评估不同LLM反映这些州特定课程差异的程度。此外,我们进行控制实验,通过陈述用户属性(如地理位置、年级、性别和种族)来改变用户角色,以评估LLM响应对用户特征的敏感性。我们发现,虽然模型能够调整历史主题的呈现,但这些转变可能源于各州的政治倾向,并不一定反映实际的课程内容。此外,模型成功适应学生的年级水平,而对种族或性别的敏感性最小,这表明它们能够以有限的人口统计偏差对用户角色进行有用的适应。总之,这些发现突显了开放访问LLM聊天机器人可能因与州课程标准错位而导致学生学习成果受损的潜在风险,并强调了需要更强大的对齐技术。

英文摘要

As Large Language Models (LLMs) become increasingly popular in educational settings, they raise important questions about the ethical implications of their use. Publicly available online chatbots are quickly improving in capability and accuracy leading to more widespread use, including among students looking for help with their homework. This makes it crucial to consider whether these models are aligned with educational standards. Because curriculum standards in the United States are set at the state level, they differ significantly in required content, emphasis, and narrative focus. In this work, we develop an LLM-based pipeline to identify variations in U.S. History curricula across states and evaluate the extent to which different LLMs reflect these state-specific curricular differences. In addition, we conduct controlled experiments that vary user personas by stating user attributes such as geographic location, grade level, gender and race to evaluate the sensitivity of LLM responses to user characteristics. We find that while models are able to adjust their presentation of historical topics, these shifts may come from the perceived political leanings of states and do not necessarily reflect actual curriculum content. Additionally, models successfully adapt to a student's grade level while showing minimal sensitivity to race or gender, suggesting they are capable of useful adaptation to student personas with limited demographic bias. Together, these findings highlight potential risks that open access to LLM chatbots may cause to student learning outcomes stemming from misalignment with state curriculum standards and highlight the need for more robust alignment techniques.

2606.04844 2026-06-04 cs.SD cs.CV

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

漂移增强评分:文本驱动的零样本音频-语言分类噪声鲁棒性

Tu Vo, Sheir Zaheer, Chan Y. Park

发表机构 * Anonymous Authors(匿名作者)

AI总结 提出漂移增强评分(DAS),通过文本生成的噪声条件提示预测音频嵌入漂移方向,为每个类别添加奖励分数,在不增加梯度或测试时批处理的情况下,显著提升零样本音频分类在噪声下的准确率和mAP。

详情
AI中文摘要

对比音频-语言模型(如CLAP)能够实现零样本音频分类:通过将音频嵌入与文本提示嵌入匹配来标记声音,无需标注音频。但在声学噪声下,这种匹配会失效,标准基准测试中,0 dB SNR时准确率和mAP下降12-30个百分点。我们提出漂移增强评分(DAS),这是一种添加到余弦评分中的每类小奖励。当噪声音频嵌入向该类噪声条件文本提示预测的方向漂移时,奖励该类。该奖励仅从文本推导,计算一次并缓存,推理时每类只需一个内积,无需梯度或测试时批处理。在LAION CLAP骨干网络上,我们将DAS与Acevedo等人同期方法的四种变体在UrbanSound8K和完整FSD50K评估集上进行比较,将每个片段与城市声学场景噪声混合,覆盖一系列SNR。DAS在所有测试条件下均提升了指标:UrbanSound8K上准确率提高+2.60至+5.75个百分点,FSD50K上mAP提高+1.50至+1.74个百分点。

英文摘要

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

2606.04836 2026-06-04 cs.CV

3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks

注意力任务期间自闭症谱系障碍筛查的3D时间分析

Inam Qadir, Elizabeth B Varghese, Dena Al-Thani, Marwa Qaraqe

发表机构 * College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar(科学与工程学院,哈马德·本·哈利法大学,卡塔尔基金会,多哈,卡塔尔)

AI总结 提出基于DECA的3D时间分析框架,提取头部姿态和面部表情特征,利用LSTM/GRU分类器在VR-CPT任务中实现ASD筛查,多模态融合达到84.6%准确率。

详情
AI中文摘要

对学龄儿童进行准确的自闭症谱系障碍(ASD)筛查对于识别早期可能遗漏的病例以及及时干预以支持社交、认知和学业发展至关重要。当前的ASD筛查依赖于主观评估和2D分析方法,无法捕捉ASD行为特征的空间位移模式。本研究提出了一种新颖的3D时间分析框架,该框架基于DECA(详细表情捕捉与动画)这一3D建模框架,用于提取全面的头部姿态参数(包括平移分量$T_x, T_y, T_z$)以及独立于姿态变化的面部表情。基于LSTM和GRU的时间分类器在从39名7-12岁参与者(19名ASD,20名TD)在虚拟现实-持续性能测试任务中收集的视频数据提取的3D特征上进行训练。GRU模型表现出优越性能,其中3D头部姿态特征达到83.9%的准确率,3D面部特征达到81.4%的准确率,分别比2D基线方法高出10.7%和7.5%。此外,通过PCA降维的3D头部姿态和面部特征的多模态融合达到了84.6%的最高准确率,优于单模态方法。这项工作为针对学龄人群ASD识别中当前诊断局限性的客观、自动化筛查工具奠定了基础。

英文摘要

Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components $T_x, T_y, T_z$) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9\% accuracy and 3D facial features reaching 81.4\% accuracy, outperforming 2D baseline approaches by 10.7\% and 7.5\%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6\%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.

2606.04834 2026-06-04 cs.LG

Prediction Under Imperfect Compression: A Theory of Approximate MDL

非完美压缩下的预测:近似最小描述长度理论

Qian Li, Xinyu Mao, Shang-Hua Teng, Guangxu Yang

发表机构 * Shenzhen Research Institute of Big Data(深圳大数据研究院) University of Southern California(南加州大学)

AI总结 本文研究了在近似优化下,最小描述长度(MDL)原则仍能保证可靠序列预测的条件,证明了加性松弛下的鲁棒性并刻画了正则化的必要性。

Comments 26 pages

详情
AI中文摘要

最小描述长度(MDL)通过优化总描述长度 $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$ 形式化了奥卡姆剃刀原则。对于序列预测,MDL 方法反复选择在观测前缀上具有最小目标得分的模型进行下一步预测。经典 MDL 预测理论表明,精确优化 MDL 目标确实提供了支持可靠预测的强压缩保证。然而,实际机器学习通常只能通过近似优化目标函数来找到模型。为弥合这一差距,本文解决了以下基本问题:在何种近似和正则化形式下,近似 MDL 仍能保证可靠的序列预测?本文提供了一个原则性的刻画。我们证明,对于平衡 MDL 目标更一般形式 $λ\cdot L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$ 的任意加性松弛 $C$,当 $λ\ge1$ 时,累积期望平方预测误差有限。$λ>1$ 的情况通过亲和-望远镜论证证明,而边界情况 $λ=1$ 通过基于精确静态 MDL 边界的似然比停止论证证明。我们的结果表明,经典 MDL 正则化对任意固定加性优化误差保持鲁棒。此外,我们建立了近似 MDL 框架刻画的尖锐性:当 $0<λ<1$ 时,在可估测度的通用类中,过拟合可能导致无限累积期望误差,因此需要强形式的模型复杂度正则化。另外,在乘性近似下,模型选择可能在每个正则化区域 $λ>0$ 中失败,因此加性近似既充分又必要。

英文摘要

Minimum Description Length (MDL) formalizes the principle of Occam's razor by optimizing the total description length: $L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$. For sequential prediction, the MDL method repeatedly selects a model with a minimum objective score of the observed prefix for the next step prediction. Classical MDL prediction theory shows that exact optimization of the MDL objective indeed provides a strong compression guarantee that supports reliable prediction. However, practical machine learning usually can only find models by approximately optimizing the objective function. To bridge this gap, this paper addresses the following fundamental question: Under what forms of approximation and regularization does approximate MDL still guarantee reliable sequential prediction? This work offers a principled characterization. We prove that for any approximation with additive slack $C$ of the more general form of the balanced MDL objective: $λ\cdot L(\mathrm{model})+L(\mathrm{data} \ | \ \mathrm{model})$, the cumulative expected squared prediction error is finite for all $λ\ge1$. The case $λ>1$ is proved by an affinity-telescoping argument, while the boundary case $λ=1$ is proved by a likelihood-ratio stopping argument based on exact static MDL bounds. Our results establish that classical MDL regularization remains robust to any fixed additive optimization error. Furthermore, we establish that our characterization of the approximate MDL framework is sharp: When $0<λ<1$, overfits can happen to incur infinite cumulative expected error in the universal class of estimable measures, and hence a strong form of model-complexity regularization is necessary. In addition, model selection may fail in every regularized regime $λ>0$, under multiplicative approximation, and thus, additive approximation is both sufficient and essential.

2606.04829 2026-06-04 cs.RO

M3imic: Learning a Versatile Whole-Body Controller for Multimodal Motion Mimicking

M3imic: 学习用于多模态运动模仿的通用全身控制器

Zuxing Lu, Ziang Zheng, Yao Lyu, Jingyu Liu, Feihong Zhang, Song Lu, Xin Yuan, Changyin Sun, Xingxing Zuo, Shengbo Eben Li

发表机构 * School of Automation, Southeast University(东南大学自动化学院) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动性学院) Department of Robotics, Mohamed Bin Zayed University of Artificial Intelligence(马尔代夫比兹亚德大学人工智能学院机器人系)

AI总结 提出M3imic框架,通过模态特定编码器将异构运动参考模态(机器人关节角度、人体姿态轨迹、末端执行器位姿)映射到共享潜在空间,并利用大规模强化学习训练单一策略,实现无需模态特定重训练的sim-to-real迁移。

详情
AI中文摘要

构建通用全身控制器对于使人形机器人在广泛的下游任务(包括 locomotion 和 loco-manipulation)中具备多样化的运动能力至关重要。不同任务依赖于不同的运动参考模态:locomotion 主要依赖于协调的机器人关节轨迹,而 manipulation 则需要精确的末端执行器轨迹跟踪。现有方法常常忽视密集的机器人关节角度与稀疏的末端执行器位姿之间的表示不匹配问题。为解决这一问题,我们提出了 Multi-Modal Mimic (M3imic),一个通用的多模态全身控制框架,它使用模态特定编码器将异构运动参考模态(包括机器人关节角度、人体姿态轨迹和末端执行器位姿)映射到共享潜在空间,从而统一这些模态。利用模拟器中的大规模强化学习,我们训练了一个单一策略,该策略能够在无需模态特定重训练的情况下实现跨多种运动参考模态的 sim-to-real 迁移。在 Unitree G1 机器人上进行了广泛的仿真和真实世界实验以评估所提出的框架。在仿真中,该策略在未见过的测试数据集上达到了 98.42% 的峰值成功率,展示了其卓越的泛化能力。代码可在 https://github.com/Renforce-Dynamics/MultiModalWBC 获取。

英文摘要

Building a general-purpose whole-body controller is essential for enabling diverse motion capabilities in humanoid robots across a wide range of downstream tasks, including locomotion and loco-manipulation. Different tasks rely on distinct motion reference modalities: locomotion primarily depends on coordinated robot joint trajectories, whereas manipulation requires precise end-effector trajectory tracking. Existing methods often overlook the representational mismatch between dense robot joint angles and sparse end-effector poses. To address this, we propose Multi-Modal Mimic (M3imic), a versatile multi-modal whole-body control framework that unifies heterogeneous motion reference modalities, including robot joint angles, human pose trajectories, and end-effector poses, using modality-specific encoders to map them into a shared latent space. Leveraging large-scale reinforcement learning in the simulator, we train a single policy that achieves sim-to-real transfer across multiple motion reference modalities without modality-specific retraining. Extensive simulation and real-world experiments on the Unitree G1 robot are conducted to evaluate the proposed framework. In simulation, the policy achieves a peak success rate of 98.42\% on an unseen test dataset, demonstrating its exceptional generalization capability. The code is available at https://github.com/Renforce-Dynamics/MultiModalWBC

2606.04828 2026-06-04 cs.CL

A French Corpus Annotated for Multiword Expressions with Adverbial Function

一个标注了副词性多词表达的法语语料库

Eric Laporte, Takuya Nakamura, Stavroula Voyatzi

发表机构 * Université Paris-Est(巴黎-est大学) Institut Gaspard-Monge - LabInfo(加斯帕尔-蒙日研究所 - LabInfo)

AI总结 本文介绍了一个标注了副词性多词表达的法语语料库,旨在支持信息检索、信息抽取以及深层和浅层句法分析的研究。

详情
Journal ref
Language Resources and Evaluation Conference (LREC), Linguistic Annotation Workshop, 2008, Marrakech, Morocco, pp.48-51
AI中文摘要

本文介绍了一个标注了副词性多词表达(MWEs)的法语语料库。该语料库旨在用于信息检索和信息抽取以及深层和浅层句法分析的研究。我们界定了所标注的多词表达的类型,描述了用于标注的资源和方法,并简要评论了结果。标注后的语料库可在 http://infolingu.univ-mlv.fr/ 根据 LGPLLR 许可证获取。

英文摘要

This paper presents a French corpus annotated for multiword expressions (MWEs) with adverbial function. This corpus is designed for investigation on information retrieval and extraction, as well as on deep and shallow syntactic parsing. We delimit which kind of MWEs we annotated, we describe the resources and methods we used for the annotation, and we briefly comment the results. The annotated corpus is available at http://infolingu.univ-mlv.fr/ under the LGPLLR license.

2606.04825 2026-06-04 cs.RO

HapTile: A Haptic-Informed Vision-Tactile-Language-Action Dataset for Contact-Rich Imitation Learning

HapTile: 用于接触丰富模仿学习的触觉感知视觉-触觉-语言-动作数据集

Amirhosein Alian, Yongqiang Zhao, Shiyi Gu, Xuyang Zhang, Zhuo Chen, Christopher E. Mower, Haitham Bou-Ammar, Shan Luo

发表机构 * King’s College London, UK(伦敦国王学院) Huawei, Noah’s Ark Lab, UK(华为、诺亚实验室) University College London, UK(伦敦大学学院)

AI总结 提出HapTile数据集,通过集成指尖触觉反馈和操作员触觉感知,为接触丰富的机器人操作任务提供视觉-触觉-语言-动作联合数据,并验证其在策略学习中的有效性。

详情
AI中文摘要

尽管触觉感知对于可靠操作至关重要,但大多数现有的视觉-语言-动作(VLA)数据集仍然仅基于视觉,而那些确实包含触觉信息的数据集通常缺乏任务多样性、语言条件和动作轨迹的联合组合。此外,现有的遥操作流程很少为操作员提供触觉反馈,尽管触觉反馈在演示质量和操作稳定性中具有公认的作用。在这项工作中,我们提出了HapTile,一个接触基础的视觉触觉操作数据集,它通过嵌入两个层次的物理交互感知超越了仅视觉轨迹数据集:机器人末端执行器上的指尖触觉反馈,以及遥操作侧的触觉感知演示。数据收集平台将触觉反馈直接集成到遥操作控制器中,使操作员能够实时感知接触交互。它基于一个标准且可复现的机器人系统构建,该系统配备了定制设计的指尖触觉传感器。该数据集涵盖了日常操作任务,包括拾取与放置、折叠、按压、堆叠以及其他常规活动,这些任务涉及广泛的接触丰富技能。每个任务都配有语言指令,用于根据操作目标对策略进行条件化,同时还有同步的视觉触觉观察和动作轨迹。此外,我们使用两个基线模型对接触丰富的策略学习进行了基准研究,以评估所提出的接触基础数据集的有效性。数据集和更多详细信息可在我们的网站上获取:haptile-dataset.github.io。

英文摘要

Despite the importance of tactile sensing for reliable manipulation, most existing Vision-Language-Action (VLA) datasets remain vision-only, and those that do incorporate tactile information typically lack the joint combination of task diversity, language conditioning, and action trajectories. Furthermore, existing teleoperation pipelines rarely provide haptic feedback to the operator, despite its established role in demonstration quality and manipulation stability. In this work, we present HapTile, a contact-grounded visuotactile manipulation dataset that advances beyond vision-only trajectory datasets by embedding physical interaction sensing at two levels: fingertip tactile feedback at the robot end-effector, and haptic-informed demonstrations at the teleoperator side. The data collection platform integrates haptic feedback directly into the teleoperation controller, enabling the operator to perceive contact interactions in real time. It is built around a standard and reproducible robotic system equipped with custom-designed fingertip tactile sensors. The dataset comprises everyday manipulation tasks spanning a broad range of contact-rich skills, including pick-and-place, folding, pressing, stacking, and other routine activities. Each task is paired with language instructions that condition the policy on the manipulation objective, together with synchronized visuotactile observations and action trajectories. In addition, we provide a benchmarking study on contact-rich policy learning using two baseline models to evaluate the effectiveness of the proposed contact-grounded dataset. The dataset and additional details are available on our website: haptile-dataset.github.io.

2606.04823 2026-06-04 cs.AI cs.CL cs.MA

R-APS: Compositional Reasoning and In-Context Meta-Learning for Constrained Design via Reflective Adversarial Pareto Search

R-APS:基于反思性对抗帕累托搜索的组合推理与上下文元学习用于约束设计

João Pedro Gandarela, Thiago Rios, Stefan Menzel, André Freitas

发表机构 * Idiap Research Institute(Idiap研究 institute) École Polytechnique Fédérale de Lausanne(瑞士联邦理工学院) Honda Research Institute Europe(本田欧洲研究机构) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) National Biomarker Centre, CRUK-MI, University of Manchester(曼彻斯特大学国家生物标志物中心)

AI总结 提出R-APS方法,通过推理模式分解、分阶段组合推理、敏感性引导对抗测试和元归纳规则提取,联合解决LLM在代理设置中的错误传播、最坏情况扰动和知识失效问题,在平面机构合成任务上实现更紧的鲁棒性证书和更快的迭代速度。

详情
AI中文摘要

大型语言模型(LLM)在开放式任务上表现流畅,但在需要规划、使用工具和长时间行动的代理设置中,流畅性并不能保证可靠交付。我们将这一差距归因于三个耦合的结构性失败:错误传播而不定位、最坏情况扰动未评估、积累的知识从未失效。我们认为这些失败有一个共同根源:溯因、反事实、元归纳、纠正和归纳推理将共享上下文拉向不相容的方向。我们提出反思性对抗帕累托搜索(R-APS),据我们所知,这是第一种通过推理模式分解联合解决所有三个失败的方法,为每种推理模式分配其自己的上下文,并在三个时间尺度上协调交互:带有类型化验证批评者的分阶段组合推理(失败定位)、作为第一类帕累托目标的敏感性引导反事实压力测试(鲁棒性)、以及带有显式失效的元归纳规则提取(持久记忆)。R-APS无需微调,仅通过结构化协议设计在冻结的LLM上运行。我们在平面机构综合(机器人、假肢、机械设计)上评估,每个候选解由运动学求解器检查。在32个目标轨迹上,R-APS提供的鲁棒性证书比均匀扰动基线紧3.5倍,首次接纳迭代速度提高46%,Chamfer距离比Enum+GA减少2.1倍,同时联合控制杆数和最坏情况鲁棒性。小型4B推理专用模型在协议内与通用70B骨干模型竞争,表明结构化协议可以部分抵消模型规模。

英文摘要

Large language models (LLMs) are fluent on open-ended tasks, yet in agentic settings, where a system must plan, use tools, and act over extended horizons, fluency does not ensure reliable delivery. We trace this gap to three coupled structural failures: errors propagate without localization, worst-case perturbations go unevaluated, and accumulated knowledge is never invalidated. We argue these share a root cause: abductive, counterfactual, meta-inductive, corrective, and inductive reasoning pull a shared context in incompatible directions. We introduce Reflective Adversarial Pareto Search (R-APS), to our knowledge the first method addressing all three failures jointly via reasoning-mode decomposition, allocating each reasoning mode its own context and orchestrating interaction across three timescales: staged compositional reasoning with a typed validation critic (failure localization), sensitivity-guided counterfactual stress-testing as a first-class Pareto objective (robustness), and meta-inductive rule extraction with explicit invalidation (persistent memory). R-APS requires no fine-tuning and operates on a frozen LLM purely via structured protocol design. We evaluate on planar mechanism synthesis (robotics, prosthetics, mechanical design), with every candidate checked by a kinematic solver. On 32 target trajectories, R-APS delivers robustness certificates 3.5x tighter than uniform-perturbation baselines, 46% faster iterations-to-first-admission, and 2.1x Chamfer-distance reduction over Enum+GA while jointly controlling bar-count and worst-case robustness. Small 4B reasoning-specialized models prove competitive with general-purpose 70B backbones inside the protocol, suggesting structured protocols can partially offset model scale.

2606.04822 2026-06-04 cs.LG

Reconciling Causality and Non-Equilibrium Thermodynamics with Hamiltonian Causal Models

用哈密顿因果模型调和因果关系与非平衡热力学

Dario Rancati, Max Welling, Francesco Locatello

发表机构 * Institute of Science and Technology Austria(奥地利科学与技术研究所) CuspAI University of Amsterdam(阿姆斯特丹大学CuspAI)

AI总结 提出哈密顿因果模型(HCMs),通过分离不可变运动方程与可干预机制,定义路径级因果效应,并与非平衡热力学自然接口,利用熵产生量化因果效应。

详情
AI中文摘要

物理时间现象的因果建模必须处理沿轨迹的干预、非平稳诱导律、路径依赖效应以及由动力学介导的反馈,这些在标准因果模型中都具有挑战性。我们引入了哈密顿因果模型(HCMs),这是一个轨迹级框架,其中观测变量与局部环境相互作用,干预作为哈密顿机制的控制。HCMs将不可变的运动方程与可干预机制分离,并将因果效应定义为干预路径律之间的差异。HCMs的一个关键动机是它们与非平衡热力学的自然接口。熵产生量化了过程的不可逆性,是一个核心因果可观测量:它可以从数据中估计,并见证系统演化过程中标准平均处理效应的端点和累积版本所不可见的因果效应。如同物理学中,原因和结果不是两个随机变量之间关系的原始概念,而是源于热力学箭头的不可逆性。因此,我们的论文调和了统计因果模型和非平稳热力学的语言,为描述广泛物理系统中的因果关系提供了新工具。

英文摘要

Causal modeling of physical temporal phenomena must handle interventions that act along trajectories, nonstationary induced laws, path-dependent effects, and feedback mediated by dynamics, all challenging in standard causal models. We introduce Hamiltonian Causal Models (HCMs), a trajectory-level framework in which observed variables interact with local environments and interventions act as controls of Hamiltonian mechanisms. HCMs separate immutable equations of motion from intervenable mechanisms and define causal effects as discrepancies between interventional path laws. A key motivation for HCMs is their natural interface with non-equilibrium thermodynamics. Entropy production quantifies the irreversibility of a process and is a central causal observable: it is estimable from data and witnesses causal effects along the system's evolution that are invisible to endpoint and cumulative versions of the standard average treatment effect. As in physics, cause and effect are not primitives of the relation between two random variables but arise from the non-invertibility of the thermodynamic arrow. With this, our paper reconciles the language of statistical causal models and non-stationary thermodynamics, offering new tools to describe causality in a wide range of physical systems.

2606.04820 2026-06-04 cs.CV cs.AI cs.LG

OA-CutMix: Correcting the Label Bias of CutMix

OA-CutMix:纠正CutMix的标签偏差

Tobias Christian Nauen, Stanislav Frolov, Federico Raue, Brian B. Moser, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau(凯撒斯劳滕-兰道大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心)

AI总结 针对CutMix中标签分配基于区域面积导致语义偏差的问题,提出OA-CutMix,利用分割掩码根据可见目标面积分配标签,在不改变图像混合过程的情况下提升分类准确率。

详情
AI中文摘要

CutMix已成为事实上的标准混合增强方法,但其标签分配基于一个有缺陷的假设:粘贴补丁的面积忠实地反映了其对混合图像的语义贡献。然而,在实践中,补丁经常落在背景区域,将标签信用分配给其目标不可见的类别。CutMix标签与语义目标面积的平均差异为21.5%。在17%的样本中,一张图像贡献了零个可见目标像素,却获得了非零的标签权重。我们提出目标感知CutMix(OA-CutMix),通过用从预计算分割掩码中导出的权重替换基于面积的CutMix权重来纠正这种偏差,根据每个图像贡献给混合图像的可见目标面积比例分配标签。图像混合过程完全保持不变。我们在4种架构和6个数据集上评估了OA-CutMix与10多种静态和动态混合方法的性能。OA-CutMix在所有任务中始终达到最高准确率,甚至优于动态混合方法,但训练时间成本仅为其一小部分。对于小目标,改进最大,因为CutMix的标签偏差最大。因此,纠正标签足以匹配或超过修改图像混合算法的方法的性能。

英文摘要

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

2606.04818 2026-06-04 cs.RO

Real-World Deployment of a 5G-Connected Edge-Controlled Aerial Robot in Industrial Subterranean Mines

工业地下矿井中5G连接边缘控制空中机器人的实际部署

Achilleas Santi Seisa, Emanuele Pagliari, Gerasimos Damigos, Elias Small, George Nikolakopoulos

发表机构 * Robotics and AI Group, Department of Computer, Electrical and Space Engineering, Luleå University of Technology(机器人与人工智能小组,计算机、电气与空间工程系,吕勒奥技术大学)

AI总结 本文首次在实际工业地下矿井中部署了由边缘卸载控制器控制的5G连接自主飞行空中机器人,采用模型预测控制器(MPC)生成平滑无碰撞路径,展示了边缘控制机器人系统在时间关键、安全高效未来部署中的潜力。

Comments 6 pages, 8 figures, MED 2026

详情
AI中文摘要

本文介绍了首次由边缘卸载控制器控制的5G连接空中机器人的实际自主飞行,旨在弥合受控设置与实际设置之间的差距。该机器人在一个活跃的工业地下矿井中运行,而高层控制器部署在附近的基于Kubernetes的边缘集群中。机器人与边缘之间的通信通过5G新无线电(NR)独立组网(SA)网络实现。所选的控制器是模型预测控制器(MPC),它生成控制动作,使机器人能够在采矿环境中无缝导航。人类操作员为空中机器人选择航点,MPC生成平滑、无碰撞的路径以自主执行。所提出的基于5G边缘的闭环系统在实际工业环境中进行了评估,展示了边缘控制机器人系统在时间关键、安全高效的未来部署中的潜力。

英文摘要

This article presents the first real-world autonomous flight of a 5G-connected aerial robot controlled by an edge-offloaded controller, and aims to bridge the gap between controlled and factual setups. The robot operates within an active industrial subterranean mine, while the high-level controller is deployed in a nearby Kubernetes-based edge cluster. Communication between the robot and the edge is enabled via a 5G New Radio (NR) Standalone (SA) network. The chosen controller is a Model Predictive Controller (MPC), which generates control actions to allow the robot to navigate seamlessly through the mining environment. A human operator selects waypoints for the aerial robot, and the MPC generates smooth, collision-free paths for autonomous executions. The proposed 5G edge-based closed-loop system is evaluated in a real industrial setting and demonstrates the potential of edge-controlled robotic systems toward time-critical, safe and efficient future deployments.

2606.04816 2026-06-04 cs.AI cs.LG

Beyond Objective Equivalence: Constraint Injection for LLM-Based Optimization Modeling on Vehicle Routing Problems

超越目标等价性:基于LLM的车辆路径问题优化建模的约束注入

Xizi Luo, Changhong He, Dongdong Geng, Chenggong Shi, Yu Mei

发表机构 * Beihang University(北京航空航天大学) Baidu Inc.(百度公司)

AI总结 针对LLM在约束密集的运筹问题中可能添加虚假约束或遗漏必要约束的问题,提出约束注入方法,结合差分测试形成双重验证器,并在车辆路径问题上验证其有效性。

Comments 28 pages

详情
AI中文摘要

大型语言模型(LLM)越来越多地将自然语言优化问题转化为可执行的求解器代码。然而,对于约束密集的运筹学(OR)问题,现有的数据过滤和训练流程主要依赖于目标等价性信号,如差分测试和答案一致性,这些信号允许程序在测试实例上添加虚假约束或静默省略必要约束,只要这些约束在测试实例上非绑定。我们提出约束注入,利用可行探针暴露虚假过度约束,利用单约束违反探针揭示静默约束遗漏。结合差分测试,它形成一个双重验证器。我们在车辆路径问题(VRPs)上实例化并评估该方法,VRPs是代表性的约束密集组合优化测试平台,具有耦合的操作约束。我们开发了VRPCoder,一个8B端到端模型,将自然语言VRP场景转化为Gurobi脚本,并附带一个专家验证的VRP基准套件,涵盖21种变体。该验证器在数据合成期间用作拒绝采样过滤器,在组相对策略优化(GRPO)中用作每次rollout的奖励。在四个VRP基准上,VRPCoder-GRPO达到93%的平均Pass@1,在三个基准上优于Gemini-3.1-Pro Preview,超过Claude-Sonnet-4.5平均28个百分点,并超过先前的OR-LLM平均78个百分点。

英文摘要

Large language models (LLMs) increasingly translate natural-language optimization problems into executable solver code. Yet for constraint-dense operations research (OR) problems, existing data-filtering and training pipelines largely rely on objective-equivalence signals such as differential testing and answer agreement, which a program can pass while adding spurious constraints or silently omitting required ones, whenever those constraints are non-binding on the tested instance. We propose constraint injection, which uses feasible probes to expose spurious over-constraint and one-constraint-violating probes to reveal silent constraint omission. Combined with differential testing, it forms a dual verifier. We instantiate and evaluate it on vehicle routing problems (VRPs), a representative constraint-dense combinatorial optimization testbed with coupled operational constraints. We develop VRPCoder, an 8B end-to-end model that translates natural-language VRP scenarios into Gurobi scripts, together with an expert-verified VRP benchmark suite covering 21 variants. The verifier is reused as a rejection-sampling filter during data synthesis and as a per-rollout reward in group relative policy optimization (GRPO). Across four VRP benchmarks, VRPCoder-GRPO reaches 93\% average Pass@1, outperforms Gemini-3.1-Pro Preview on three benchmarks, exceeds Claude-Sonnet-4.5 by 28 average points, and surpasses prior OR-LLMs by 78 average points.

2606.04815 2026-06-04 cs.LG cs.AI

Learning While Acting: A Skill-Enhanced Test-Time Co-Evolution Framework for Online Lifelong Learning Agents

边行动边学习:面向在线终身学习智能体的技能增强测试时协同进化框架

Bo Mao, Jie Zhou, Yutao Yang, Xin Li, Xian Wei, Qin Chen, Xingjiao Wu, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai AI Laboratory(上海人工智能实验室) Software Engineering Institute, East China Normal University(东华大学软件工程学院)

AI总结 提出LifeSkill框架,通过验证器引导的技能学习和在线技能内化,使LLM智能体在测试时持续内化反馈,提升终身学习性能。

详情
AI中文摘要

终身学习对于在动态、交互环境中运行的大型语言模型(LLM)智能体至关重要。然而,现有的用于长时任务的终身学习智能体通常依赖于离散技能或过去经验检索,并在推理期间使用静态参数,这阻止了它们像人类学习者一样持续内化测试时反馈。为弥补这一差距,我们提出了技能增强测试时协同进化(LifeSkill),一个用于在线终身学习智能体的两阶段强化学习框架。具体来说,我们设计了验证器引导的技能学习,通过根据多个技能条件策略滚动的平均验证器成功率奖励候选技能,解决了技能提取缺乏直接监督的问题,鼓励模型生成对解决任务有用的技能,而不仅仅是文本上合理的技能。此外,我们引入了在线技能内化,通过在测试时交互期间将技能条件轨迹转化为奖励信号,持续改进策略模型。这使得智能体能够将推理能力直接内化到其参数中,避免了经验检索的上下文膨胀。在LifelongAgentBench上的实验表明,与现有终身学习智能体基线相比,LifeSkill将平均性能提高了7个绝对百分点。

英文摘要

Lifelong learning is essential for Large Language Model (LLM) agents operating in dynamic, interactive environments. However, existing lifelong learning agents for long-horizon tasks typically depend on discrete skill or past experiences retrieval with static parameters during inference, which prevents them from continuously internalizing test-time feedback like human learners. To bridge this gap, we propose Skill-enhanced Test-Time Co-Evolution (\texttt{LifeSkill}), a two-stage reinforcement learning framework for Online Lifelong Learning Agents. Specifically, we design Verifier-Guided Skill Learning that addresses the lack of direct supervision for skill extraction by rewarding candidate skills according to the average verifier success of multiple skill-conditioned policy rollouts, encouraging the model to generate skills that are useful for solving tasks rather than merely plausible in text. Furthermore, we introduce Online Skill Internalization, which continuously improves the policy model during test-time interaction by transforming skill-conditioned trajectories into reward signals. This enables the agent to directly internalize reasoning capabilities into its parameters, avoiding the context bloat of experience retrieval. Experiments on LifelongAgentBench show that LifeSkill improves average performance by 7 absolute points by comparing with existing lifelong agent baselines.

2606.04807 2026-06-04 cs.AI cs.CL cs.CY cs.LG

BiasGRPO: Stabilizing Bias Mitigation in High-Variance Reward Landscapes via Group-Relative Policy Optimization

BiasGRPO:通过组相对策略优化在高方差奖励景观中稳定偏差缓解

Saket Reddy, Ke Yang, ChengXiang Zhai

发表机构 * University of Illinois - Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出BiasGRPO框架,利用组相对策略优化(GRPO)通过归一化组内奖励来稳定大语言模型的社会偏差缓解,优于DPO和PPO。

Comments Accepted to Findings of the ACL

详情
AI中文摘要

缓解大语言模型(LLMs)中的社会偏差提出了一个独特的对齐挑战:与可验证任务不同,偏差缺乏单一的真实标准,从而产生高方差、主观的奖励景观。先前的基于偏好的微调方法存在主要权衡:直接偏好优化(DPO)受限于离线训练中缺乏探索,而近端策略优化(PPO)由于潜在不可靠的评论家估计可能导致训练不稳定。在本文中,我们提出了BiasGRPO,一个使用组相对策略优化(GRPO)的框架,通过对一组采样完成进行奖励归一化来稳定对齐。通过用组相对基线替代价值函数,我们的方法在保持在线训练探索优势的同时减少了不稳定性。我们发现BiasGRPO在多个基准测试中优于DPO和PPO,表明其有效性。为了适应GRPO,我们综合扩展了一个涵盖多个领域和上下文的数据集。我们还创建并发布了一个定制的偏差奖励模型,该模型在有效指导生成的同时高度计算高效且避免知识退化,提供了一个可无缝集成到多目标RLHF流程中的宝贵资源。

英文摘要

Mitigating social bias in Large Language Models (LLMs) presents a distinct alignment challenge: unlike verifiable tasks, bias lacks a single ground truth, creating a high-variance, subjective reward landscape. Previous preference-based fine-tuning methods have major trade-offs: Direct Preference Optimization (DPO) is limited by the lack of exploration inherent in offline training, while Proximal Policy Optimization (PPO) can lead to training instability due to potentially unreliable critic estimates. In this paper, we propose BiasGRPO, a framework using Group Relative Policy Optimization (GRPO) to stabilize alignment by normalizing rewards across a group of sampled completions. By substituting the value function with a group-relative baseline, our approach reduces instability while maintaining the exploration benefits of online training. We find that BiasGRPO outperforms DPO and PPO across multiple benchmarks, indicating its effectiveness. To adapt GRPO, we synthetically extend a dataset spanning multiple domains and contexts. We also create and release a custom bias reward model that effectively guides generation while being highly compute-efficient and avoiding knowledge degradation, providing a valuable resource that can be seamlessly integrated into multi-objective RLHF pipelines.

2606.04806 2026-06-04 cs.CV cs.AI

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

NoRA: 评估视觉第一人称规范性动作推理中的基于事实的合理性

Sichao Li, Sai Ma, Daniel Kilov, Secil Yanik Guyot, Zhuang Li, Seth Lazar

发表机构 * The University of Sydney(悉尼大学) Australian National University(澳大利亚国立大学) RMIT University(皇家墨尔本理工大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出NoRA基准,通过事实-理由-动作支持图评估多模态模型生成合理动作并基于可见事实进行推理的能力,发现当前VLM在构建完整动作空间和绑定正确支持方面存在不足。

详情
AI中文摘要

LLM和智能系统越来越多地部署在社交环境中,使得规范能力对安全和适当行为至关重要。然而,现有方法要么仅在文本中评估规范性判断,要么将其简化为从固定候选动作集中选择。我们认为两者都不够。在实践中,智能体永远不会获得一个选项菜单;它们必须从头识别一个合理的动作,基于可见事实并由可检查的理由支持。我们引入了NoRA,一个视觉第一人称视频基准,要求模型生成候选的下一个动作,并通过显式的事实-理由-动作支持图来证明每个动作。该基准包含1,420个带注释的视频片段,包括HumanGold-190和LLMSilver-1230分割。每个实例通过动作对齐、事实基础和支持绑定进行评估,汇总为单一的基于事实的合理性分数。我们在直接、深思熟虑和结构化提示模式下对12个多模态系统进行了基准测试,发现当前的VLM经常能恢复合理的动作和相关的场景事实,但始终难以构建完整的合理动作空间并将所选动作绑定到正确的局部支持上。NoRA使这一差距可测量,将评估问题从模型是否能选择一个动作转变为是否能基于正确的可见理由证明一个适当的动作。

英文摘要

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

2606.04801 2026-06-04 cs.CV

Fast Cubical Persistent Homology on 2D and 3D Images via Union-Find, Pruning, and Lookup Tables

基于并查集、剪枝和查找表的2D和3D图像快速立方体持久同调

Titouan Le Breton, Karol Szustakowski, Marie Piraud

发表机构 * Helmholtz AI(海德堡人工智能研究所) Helmholtz Munich(海德堡慕尼黑研究所) École des Ponts ParisTech(巴黎科技大学) ENS Paris-Saclay(巴黎-萨克勒大学) Institute of AI for Health, Helmholtz Munich(健康人工智能研究所,海德堡慕尼黑研究所)

AI总结 提出Flash Cubical方法,通过并查集、边剪枝和查找表技术,高效计算2D和3D图像在V-过滤下的立方体持久性,在时间和内存上达到最优。

详情
AI中文摘要

我们提出Flash Cubical,一种在$\mathbb{F}_2$上对2D和3D图像的V-过滤进行立方体持久性高效计算的方法。该实现基于三个核心思想。首先,立方体复形满足某些性质,允许通过并查集和对偶性计算最高维度的持久性。其次,对某些边进行剪枝可以实现快速高效的并查集。第三,使用查找表,利用立方体复形的规律性预计算局部信息,避免运行时计算局部信息。据我们所知,这是在V-过滤下最有效的立方体持久性实现,无论在时间还是内存成本上。尽管本文关注V-过滤立方体复形的持久性,但基本思想自然推广到立方体复形的T-过滤,并为其他复形提供了有希望的方向。

英文摘要

We present Flash Cubical, a highly efficient computation of cubical persistence on a V-filtration for 2D and 3D images over $\mathbb{F}_2$. The implementation is built around three core ideas. First, cubical complexes satisfy properties that allow for the computation of persistence of the highest dimension via union-find and duality. Second, pruning of certain edges allows for a fast and efficient implementation of union-find. Third, the use of a lookup table, which exploits the regularity of cubical complexes to pre-compute local information. This avoids the need to compute local information at run time. To the best of our knowledge, this is the most efficient implementation of cubical persistence with a V-filtration, both in terms of time and memory costs. Although the paper focuses on persistence for V-filtration cubical complexes, the underlying ideas generalise naturally to T-filtrations on cubical complexes and suggest promising directions for other complexes.

2606.04798 2026-06-04 cs.LG

Uncertainty-Aware (Un)Supervised Few-Shot User Adaptation for On-Device Personalized Human Activity Recognition

不确定性感知的(无)监督少样本用户自适应用于设备上个性化人类活动识别

Maximilian Burzer, Till Riedel, Michael Beigl, Tobias Röddiger

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) IPAI Foundation gGmbH(IPAI基金会)

AI总结 提出一种无梯度框架,通过贝叶斯原型估计实现监督/无监督少样本用户自适应,仅需每活动3秒校准数据即可显著提升设备上HAR模型性能。

Comments 6 pages, 4 figures, 2 tables, 2 algorithms

详情
AI中文摘要

基于传感器的人类活动识别(HAR)模型通常因个体运动模式和传感器放置导致的域偏移而在未见用户上性能下降。因此,实用的可穿戴HAR系统需要轻量级的个性化方法,这些方法应适用于校准数据有标签、无标签或不可用的情况,并在有限校准下具有鲁棒性。我们提出一个无梯度框架,将预训练的HAR分类器重新用作原型网络,利用先验原型保持零样本性能并规范自适应。对于有标签校准,我们引入闭式贝叶斯原型估计,并将相同原理扩展到无标签校准。仅使用每活动3秒的校准数据(一次样本),监督自适应在四个数据集上将宏F1提高了+2.76至+33.44个百分点,而无监督自适应提高了+0.56至+32.13个百分点。由于自适应仅需要闭式原型更新,该框架能够实现现有HAR分类器的高效且鲁棒的设备上个性化。

英文摘要

Sensor-based Human Activity Recognition (HAR) models often degrade on unseen users due to domain shifts caused by individual movement patterns and sensor placement. Practical wearable HAR systems therefore require personalization methods that are lightweight, applicable whether calibration data is labeled, unlabeled, or unavailable, and robust under limited calibration. We present a gradient-free framework that repurposes pretrained HAR classifiers as Prototypical Networks using using prior prototypes, which preserve zero-shot performance and regularize adaptation. For labeled calibration, we introduce closed-form Bayesian prototype estimation and extend the same principle to unlabeled calibration. With only 3 seconds of calibration data per activity (one shot), supervised adaptation improves macro-F1 by +2.76 to +33.44 percentage points across four datasets, while unsupervised adaptation improves by +0.56 to +32.13 points. Since adaptation requires only closed-form prototype updates, the framework enables efficient and robust on-device personalization of preexisting HAR classifiers.

2606.04797 2026-06-04 cs.CV cs.LG

Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization

打造你不断演变的梦想:概念增量式多功能定制

Jiahua Dong, Wenqi Liang, Hongliu Li, Yang Cong, Duzhen Zhang, Hanbin Zhao, Henghui Ding, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed大学人工智能学院) University of Trento(特伦托大学) Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University(香港理工大学土木与环境工程系) South China University of Technology(华南理工大学) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) Institute of Big Data, Fudan University(复旦大学大数据研究院) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出持续可定制扩散模型(CCDM),通过属性解耦LoRA模块和相关性引导聚合策略解决灾难性遗忘,并结合可控区域上下文合成策略处理概念忽视,实现概念增量式多功能定制。

Comments Accepted to Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情
AI中文摘要

定制扩散模型(CDMs)因其生成个性化概念的卓越能力而引起了广泛关注。然而,大多数CDMs不切实际地假设用户的个性化概念集合是静态的,无法随时间增长。此外,在增量学习一系列新概念时,它们对先前学习的概念表现出显著的灾难性遗忘和概念忽视。为了解决上述挑战,我们开发了一种新颖的持续可定制扩散模型(CCDM),使用户能够进行概念增量式多功能定制。具体来说,我们设计了一个属性解耦LoRA(AD-LoRA)模块和一个相关性引导的AD-LoRA聚合策略,以缓解灾难性遗忘。它们可以保留每个任务的概念特定属性,并利用有益的任务间相关性来增强新定制任务的持续学习。此外,为了解决概念忽视的挑战,我们提出了一种可控区域上下文合成策略,该策略根据用户提供的条件进行多概念合成。该策略通过保证用户定义区域之间的语义独立性及其平滑边界过渡,增强了多概念合成的整体一致性。实验表明,我们的CCDM在基线方法上表现出显著改进。

英文摘要

Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.

2606.04792 2026-06-04 cs.CV

A Pathology Foundation Model for Gastric Cancer with Real-World Validation

用于胃癌的病理基础模型及真实世界验证

Ling Liang, Jiabo Ma, Zhengyu Zhang, Fengtao Zhou, Yingxue Xu, Yihui Wang, Cheng Jin, Zhengrui Guo, On Ki Tang, Zhijian Cen, Zhen Wang, Qi Xie, Chengyu Lu, Chenglong Zhao, Feifei Wang, Yu Cai, Hongyi Wang, Jing Zhang, Yaping Ye, Shijun Sun, Shenglei Li, Yu Wang, Zhenhui Li, Ronald Cheong Kin Chan, Xiuming Zhang, Zhe Wang, Hao Chen, Li Liang

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China(计算机科学与工程系,香港科技大学,香港特别行政区,中国) Department of Pathology, Nanfang Hospital, Southern Medical University, Guangzhou, China(病理学系,南方医科大学南芳医院,广州,中国) Department of Pathology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China(病理学系,南方医科大学基础医学学院,广州,中国) Guangdong Province Key Laboratory of Molecular Tumor Pathology, Guangzhou, China(广东省分子肿瘤病理学重点实验室,广州,中国) Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Hong Kong SAR, China(解剖学与细胞病理学系,香港中文大学,香港特别行政区,中国) Pathology Artificial Intelligence Development and Assessment Laboratory, State Key Laboratory of Translational Oncology, The Chinese University of Hong Kong, Hong Kong SAR, China(病理人工智能发展与评估实验室,转化肿瘤学国家重点实验室,香港中文大学,香港特别行政区,中国)

AI总结 提出胃癌专用基础模型GRACE,基于多中心HE染色全切片图像,在28项临床任务中优于通用PFM,并通过前瞻性验证和读者研究证实其辅助诊断效能。

详情
AI中文摘要

胃癌仍然是癌症死亡的主要原因,但其组织学和分子异质性使诊断和风险分层复杂化。通用病理基础模型(PFM)在胃癌诊疗的关键细粒度终点上往往表现停滞,且很少有模型经过严格的前瞻性验证或临床读者研究。我们提出了GRACE,一个用于真实世界评估和临床决策支持的胃癌专用基础模型。GRACE基于来自37,493名患者的多中心胃癌病理数据集(共48,364张HE染色全切片图像)开发。在28项临床相关任务评估中,GRACE持续优于代表性泛癌PFM,达到宏观AUC 0.9188,在癌前病变诊断(宏观AUC 0.9322)、肿瘤组织病理评估(宏观AUC 0.9119)、分子分型(宏观AUC 0.8682)和预后预测方面表现强劲。除基准测试外,GRACE的转化价值通过严格的证据链得到证实。在安全门控标准(排除要求100%阴性预测值,纳入要求100%阳性预测值)下,GRACE简化了高达69.6%的恶性诊断病例的审查,并分流了46.8%的MMR-IHC随访请求。这种转化可行性通过病理学家-AI协作的随机交叉读者研究得到进一步加强。在GRACE辅助下,诊断准确率从82.0%提高到89.9%,正确诊断的校正优势比提高近两倍(OR 1.987),同时敏感性和特异性也得到提升。AI辅助还使诊断时间减少14.9%,诊断信心提高9.0%,并显著改善评估者间一致性。当校准至不劣于高级病理医生时,AI辅助工作流可分流60.7%的萎缩病例和82.7%的肠化生病例。

英文摘要

Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and Clinical dEcision support. GRACE was developed from multicenter gastric pathology datasets totaling 48,364 primarily HE-stained whole-slide images from 37,493 patients. When evaluated on 28 clinically relevant tasks, GRACE consistently outperformed representative pancancer PFMs, achieving a macro-AUC of 0.9188, with strong performance for precancerous lesion diagnosis (macro-AUC 0.9322), tumor histopathological assessment (macro-AUC 0.9119), molecular profiling (macro-AUC 0.8682), and prognostic prediction. Beyond benchmarking, GRACE's translational value was substantiated through a rigorous evidence chain. Under safety-gated criteria requiring 100% NPV for rule-out and 100% PPV for rule-in, GRACE streamlined review for up to 69.6% of malignancy-diagnosis cases and triaged 46.8% of MMR-IHC follow-up requests. This translational feasibility was further strengthened by a randomized crossover reader study of pathologist-AI collaboration. With GRACE assistance, diagnostic accuracy improved from 82.0% to 89.9%, yielding nearly twofold higher adjusted odds of a correct diagnosis (OR 1.987) alongside concurrent gains in sensitivity and specificity. AI assistance also reduced diagnostic time by 14.9%, elevated diagnostic confidence by 9.0%, and markedly improved inter-rater agreement. When calibrated to maintain non-inferior performance to senior pathologists, the AI-assisted workflow could triage 60.7% of atrophy and 82.7% of intestinal metaplasia cases.

2606.04788 2026-06-04 cs.CV cs.RO

Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives

Z-FLoc: 基于几何基元的零样本楼层平面定位

Ayumi Umemura, Toshinori Kuwahara, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich(苏黎世联邦理工学院) Tohoku University(东北大学)

AI总结 提出一种零样本楼层平面定位方法,通过从单目3D重建的鸟瞰图中提取直线和圆等几何基元,并与楼层平面进行鲁棒匹配,无需重新训练即可泛化到新环境。

详情
AI中文摘要

视觉定位——在预先存在的地图中估计相机姿态——是计算机视觉中的一个基本问题。楼层平面是一种有吸引力的地图表示:它们对于大多数建筑来说易于获取、紧凑,并且固有地不受视觉外观变化的影响。然而,弥合相机观测与楼层平面几何之间的严重领域差距仍然具有挑战性。现有方法通过数据驱动学习来解决这一差距,但它们需要大规模训练数据和特定环境的重新训练,限制了实际部署。我们提出了一种零样本楼层平面定位方法,无需任何重新训练即可泛化到新环境。我们的关键见解是,主导几何基元——直线和圆——在人造环境中无处不在,并提供外观不变的结构约束。我们从单目3D重建的鸟瞰图投影中提取这些基元,并通过鲁棒估计框架内的专用最小求解器将它们与楼层平面进行匹配。在模拟和真实数据集上的实验表明,我们的方法在未见过的环境上优于最先进的基于学习的方法,同时在所有实验中使用单一固定的超参数集。源代码将公开提供。

英文摘要

Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives -- lines and circles -- are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird's-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available.

2606.04781 2026-06-04 cs.AI cs.LG

AIP: A Graph Representation for Learning and Governing Agent Skills

AIP: 一种用于学习和治理智能体技能的图表示

Zachary Blumenfeld, Jim Webber

发表机构 * Neo4j USA(Neo4j美国公司) Neo4j UK(Neo4j英国公司)

AI总结 提出Agent指令协议(AIP),将有向执行图作为技能表示,通过编译人类编写的技能提升任务表现,并支持技能的可诊断修复与治理。

详情
AI中文摘要

当前的智能体技能主要由自由形式的散文组成,要求智能体在每个会话中阅读、解释并重新推导如何行动。这带来了两个叠加的成本:在实现密集型任务上降低了可靠性,并且技能创建和改进困难,因为编辑散文是一个脆弱的过程,人类和智能体都难以处理,特别是对于模型训练中代表性不足的领域特定程序性知识。智能体指令协议(AIP)通过将技能建模为有向执行图来解决这两个问题:离散步骤作为节点,由确定性脚本或自然语言描述支持,通过显式类型的输入/输出边连接,并由模式验证的YAML规范管理。一个编译器元技能将现有的人类编写的技能转换为这种形式。好处是双重的。首先,将人类编写的技能编译为AIP后,Claude Sonnet在SkillsBench的27个真实智能体任务上的平均任务奖励从0.60提高到0.71,通过率从53%提高到67%——这是统计上显著的提升(Wilcoxon符号秩检验p=0.011),在12个任务中获胜,2个失败,13个平局——通常耗时更少。该图为智能体提供了经过验证、可运行的单元,而不是要求它从自然语言中重新推导代码、命令和工具调用。其次,在创建和改进方面,由于每个技能都经过模式验证、功能可测试且可逐节点寻址,因此可以精确诊断和修复故障。两个作者编写的技能故障被追溯到脚本级别。在调整AIP规范并重新编译后,两者均恢复且无回归(一个任务从0/5变为5/5),将技能改进转变为可测量的调优循环,而不是散文重写。相同的图结构支持语料库级别的治理和技能内省,并为基于技能的强化学习提供了自然的动作空间。

英文摘要

Agent Skills today consist largely of free-form prose requiring the agent to read, interpret, and re-derive how to act in every session. This imposes two compounding costs: reduced reliability on implementation-heavy tasks, and difficulty in skill creation and improvement, since editing prose is a fragile process that both humans and agents struggle with, particularly for domain-specific procedural knowledge underrepresented in model training. The Agent Instruction Protocol (AIP) addresses both by modeling a skill as a directed execution graph: discrete steps as nodes backed by deterministic scripts or natural-language descriptions, connected by explicit typed input/output edges, and governed by a schema-validated YAML specification. A compiler meta-skill translates existing human-written skills into this form. The benefits are twofold. First, compiling human-written skills to AIP raised Claude Sonnet's mean task reward from 0.60 to 0.71 and pass rate from 53% to 67% across 27 real agent tasks from SkillsBench - a statistically significant gain (Wilcoxon signed-rank p = 0.011), winning 12 tasks to 2 with 13 ties - often in less wall-clock time. The graph delivers vetted, runnable units to the agent rather than asking it to re-derive code, commands, and tool calls from natural language. Second, on creation and improvement, because each skill is schema-validated, functionally testable, and addressable node-by-node, failures can be diagnosed and repaired precisely. Two authored-skill failures were traced to the script level. After adjusting the AIP spec and recompiling, both recovered with zero regressions (one task going from 0/5 to 5/5), turning skill improvement into a measurable tuning loop rather than a prose rewrite. That same graph structure supports corpus-level governance and skill introspection, and provides a natural action space for reinforcement learning over skills.

2606.04780 2026-06-04 cs.CL

PersonaTree: Structured Lifecycle Memory for Person Understanding in LLM Agents

PersonaTree: 面向LLM智能体人物理解的结构化生命周期记忆

Yubo Hou, Jingwei Song, Hongbo Zhang, Zhisheng Chen, Bang Xiao, Tao Wan, Zengchang Qin

发表机构 * School of ASEE, Beihang University, Beijing, China(北京航空航天大学航空科学与工程学院) The University of Hong Kong, Hong Kong, China(香港大学) Peking University, Beijing, China(北京大学) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) School of BME, Beihang University, Beijing, China(北京航空航天大学生物医学工程学院) CAIR and CECS, VinUniversity, Hanoi, Vietnam(越南河内 Vin 大学 CAIR 和 CECS)

AI总结 提出PersonaTree,一种结构化生命周期记忆框架,通过三级人物树和显式支持路径,将交互证据抽象为人物理解,在多个基准上取得领先性能。

详情
AI中文摘要

持久化的LLM智能体需要记忆表示,使得在长期交互中人物理解的形成变得明确。现有的智能体记忆方法强调信息保留和检索,但对累积的交互证据如何被抽象为人物理解的解释有限。我们将这一过程视为图式形成,其中情境证据被抽象为可重用模式和稳定的人物层面断言。我们引入PersonaTree,一种结构化生命周期记忆框架,通过三级人物树实现这一观点,并具有从证据到断言的显式支持路径。PersonaTree通过保守写入、置信度引导的合并和查询条件路径检索来维护树,仅返回每个查询所需的证据深度。在六个涉及人物理解和持久记忆的基准测试中,使用三个回答骨干,PersonaTree在18个紧凑分数中排名第一,并在16个设置中进入前两名。消融实验表明,层次结构提高了KnowMe上的抽象人物理解,而在可比上下文预算下,支持路径检索提高了RealPref的对齐度。

英文摘要

Persistent LLM agents require memory representations that make the formation of person understanding explicit across long term interaction. Existing agent memory methods emphasize information retention and retrieval, yet give limited account of how accumulated interaction evidence is abstracted into person understanding. We view this process as schema formation, where situated evidence is abstracted into reusable patterns and stable person level claims. We introduce PersonaTree, a structured lifecycle memory framework that realizes this view as a three level persona tree with explicit support paths from evidence to claims. PersonaTree maintains the tree through conservative writing, confidence guided consolidation, and query conditioned path retrieval, returning only the evidence depth required by each query. Across six person understanding and persistent memory benchmarks with three answer backbones, PersonaTree ranks first in 12 of 18 compact scores and reaches the top two in 16 settings. Ablations show that hierarchy improves abstract person understanding on KnowMe, while support path retrieval improves RealPref alignment under a comparable context budget.

2606.04779 2026-06-04 cs.AI math.CO

Tree-Based Formalization of Multi-Agent Complementarity in Human-AI Interactions

基于树的人机交互中多智能体互补性形式化

Andrea Ferrario

发表机构 * Institute of Biomedical Ethics and History of Medicine, University of Zurich(伦理与医学史研究所,苏黎世大学) SUPSI, Dalle Molle Institute for Artificial Intelligence (IDSIA)(SUPSI,达勒莫利人工智能研究所) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出一种基于树的形式化框架,通过有序智能体角色配置和平面二叉树表示人机交互协议,证明互补性在回归中可实现,但在分类中受限于局部聚合和损失函数的自然条件。

Comments 29 pages, 9 figures

详情
AI中文摘要

互补性是指人机交互(HAI)的表现优于其成员中最佳预测基准的情况。尽管这一概念在HAI研究中至关重要,但关于互补性的形式化工作仍然有限。现有框架未能建模智能体的预测如何组合成对工作流敏感的多智能体协议。我们通过引入基于树的多智能体HAI互补性形式化来填补这一空白。一个HAI协议由一个有序的智能体角色配置以及一棵有根平面二叉树表示,树的叶子由预测向量装饰。沿树递归评估一个局部二元组合规则,产生相对于逐点最小预言基准的树相对互补性泛函。我们证明了四个结果。第一,基于选择器的HAI(包括自我或AI依赖)无法实现互补性,无论任务、损失或预测质量如何。第二,在平方损失下的回归中,互补性等价于与真实向量之间的欧几里得距离最小化;对于$N=2$,最优线性池化权重具有封闭形式并具有残差校正解释。第三,在线性局部组合下,每个协议树定义了叶子权重单纯形上的重心坐标图;协议树的Tamari覆盖重新参数化保持互补性,对于$N=4$,它们满足五边形恒等式。第四,在二元分类中,在端点单调损失(包括标准Bregman和许多有限伯努利$f$散度损失)下,没有内部局部组合能实现互补性;在交叉熵下的多类聚合中存在类似障碍。总之,我们的框架表明,互补性在多智能体回归中是可实现的,但在分类中,在局部聚合和损失函数的自然条件下受到阻碍。

英文摘要

Complementarity is the case in which a human--AI interaction (HAI) outperforms the best prediction benchmark available among its members. Although this idea is central in HAI research, formal work on complementarity remains limited. Existing frameworks do not model how agents' predictions compose into workflow-sensitive multi-agent protocols. We close this gap by introducing a tree-based formalization of complementarity in multi-agent HAI. An HAI protocol is represented by an ordered agent-role configuration together with a rooted planar binary tree whose leaves are decorated by prediction vectors. A local binary composition rule is evaluated recursively along the tree, yielding a tree-relative complementarity functional relative to a pointwise-min oracle benchmark. We prove four results. First, selector-based HAIs, including self- or AI-reliance, cannot achieve complementarity regardless of task, loss, or prediction quality. Second, in regression under squared loss, complementarity is equivalent to Euclidean distance minimization from the ground-truth vector; for $N=2$, the optimal linear-pooling weight has a closed form and a residual-correction interpretation. Third, under linear local composition, every protocol tree defines a barycentric coordinate chart on the simplex of leaf weights; Tamari-cover reparameterizations of protocol trees preserve complementarity, and for $N=4$, they satisfy the pentagon identity. Fourth, in binary classification, no internal local composition can achieve complementarity under endpoint-monotone losses, including standard Bregman and many finite Bernoulli $f$-divergence losses; an analogous obstruction holds for multiclass aggregation under cross-entropy. In summary, our framework shows that complementarity is attainable in multi-agent regression, but obstructed in classification under natural conditions on local aggregation and loss functions.

2606.04778 2026-06-04 cs.AI cs.CL cs.LG

Inference-Time Vulnerability Beyond Shallow Safety: Alignment Along Generation Trajectories

超越浅层安全的推理时脆弱性:沿生成轨迹的对齐

Kyungmin Park, Taesup Kim

发表机构 * Hankuk University of Foreign Studies(翰江大学外国语大学) Seoul National University(首尔国立大学)

AI总结 本文揭示安全对齐的大语言模型在推理时存在更广泛的脆弱性,即任意生成步骤的短标记注入都能显著改变后续安全行为,并提出通过直接在生成轨迹上对齐模型来提升鲁棒性。

详情
AI中文摘要

安全对齐的大语言模型(LLMs)在推理时仍然容易受到干预,这些干预会将生成导向有害输出。最近的研究将其归因于浅层安全,即对齐集中在最初的几个输出标记上。我们表明,浅层安全是更广泛的推理时脆弱性的一个特例,其中在任何生成步骤的短标记注入都能显著改变后续的安全行为。我们还发现,模型在其隐藏状态中与拒绝方向的对齐并不能预测其对这种注入的鲁棒性,这表明在扰动下,内部状态本身并不能决定生成行为。为了解决这个问题,我们通过模拟序列中段扰动构建的生成轨迹上直接对齐模型,并表明这提高了对中段注入的鲁棒性,并泛化到利用早期标记生成的攻击。我们的工作认为,鲁棒的安全对齐需要对生成过程本身进行训练,而不仅仅是其输出。

英文摘要

Safety-aligned Large Language Models (LLMs) remain vulnerable to interventions during inference that redirect generation toward harmful outputs. Recent work attributes this to shallow safety, where alignment concentrates in the first few output tokens. We show that shallow safety is a special case of a broader inference-time vulnerability, in which short token injections at any generation step can substantially alter subsequent safety behavior. We also find that a model's alignment with refusal directions in its hidden states does not predict its robustness to such injection, revealing that internal state alone does not determine generation behavior under perturbation. To address this, we align models directly on generation trajectories constructed by simulating mid-sequence perturbation, and show that this improves robustness to mid-sequence injection and generalizes to attacks that exploit early-token generation. Our work argues that robust safety alignment requires training on the generation process itself, not only its outputs.

2606.04776 2026-06-04 cs.RO

SoftPINCH: EMG-Driven Soft Exoskeleton Assistance for Finger Flexion and Grasping

SoftPINCH: 用于手指屈曲和抓取的EMG驱动软体外骨骼辅助

Nicklas Nikolaj Grønvall, Magnus Malthe Sigsgaard Nielsen, Xiaofeng Xiong, Saravana Prashanth Murali Babu

发表机构 * SDU Soft Robotics(SDU软机器人实验室) The Maersk Mc-Kinney Moller Institute(马士基麦金尼摩勒研究所) University of Southern Denmark(丹麦南部大学) Odense, Denmark(丹麦奥丁斯)

AI总结 提出一种结合肌腱驱动软体外骨骼、指尖磁接触传感和神经EMG解码的EMG驱动软体可穿戴外骨骼系统SoftPINCH,用于拇指-食指屈曲和捏取辅助,实验表明CNN+LSTM模型在解码中达到99.4%准确率,且主动辅助可显著降低肌肉用力。

Comments Submitted to 18th International Conference on the Simulation of Adaptive Behavior (SAB 2026)

详情
AI中文摘要

表面肌电图(sEMG)提供了一种非侵入式接口,用于检测手部运动意图并控制可穿戴辅助设备。然而,由于EMG信号受噪声、运动伪影、电极放置、肌肉疲劳和受试者间差异的影响,可靠的EMG驱动手部辅助仍然具有挑战性。同时,许多手部外骨骼在机械上仍具有限制性或笨重,限制了舒适性和自然手部运动。本工作提出了SoftPINCH,一种用于拇指-食指屈曲和捏取辅助的EMG驱动软体可穿戴外骨骼。该系统结合了肌腱驱动的软体外骨骼、指尖磁接触传感和用于基于意图辅助的神经EMG解码。在食指和拇指运动期间记录前臂肌肉的表面EMG,并评估了三种独立于受试者的解码架构:LSTM、CNN+LSTM和带注意力的CNN+LSTM。CNN+LSTM和CNN+LSTM-attention模型均达到99.4%的LOSO测试准确率,优于达到97.8%的独立LSTM。然而,注意力机制相比CNN+LSTM并未提供显著改进,表明基于CNN的特征提取足以实现鲁棒的EMG表示。因此,由于高准确率和较低的架构复杂度,选择了CNN+LSTM模型进行实时部署。功能评估表明,主动外骨骼辅助在孤立手指屈曲和物体抓取期间减少了肌肉用力。在负重抓取期间,辅助在所有测试负载下均减少了肌肉用力,在最高负载下减少了92.6%。这些结果证明了SoftPINCH通过实时EMG驱动软体机器人控制实现直观、低用力捏取辅助的潜力。

英文摘要

Surface electromyography (sEMG) provides a non-invasive interface for detecting hand-movement intention and controlling wearable assistive devices. However, reliable EMG-driven hand assistance remains challenging because EMG signals are affected by noise, motion artifacts, electrode placement, muscle fatigue, and inter-subject variability. At the same time, many hand exoskeletons remain mechanically restrictive or bulky, limiting comfort and natural hand motion. This work presents SoftPINCH, an EMG-driven soft wearable exoskeleton for thumb-index finger flexion and pinch grasp assistance. The system combines a tendon-driven soft exoskeleton, fingertip magnetic contact sensing, and neural EMG decoding for intention-based assistance. Surface EMG was recorded from forearm muscles during index and thumb movements, and three subject-independent decoding architectures were evaluated: LSTM, CNN+LSTM, and CNN+LSTM with attention. The CNN+LSTM and CNN+LSTM-attention models both achieved 99.4% LOSO test accuracy, outperforming the standalone LSTM, which reached 97.8%. However, the attention mechanism did not provide a significant improvement over CNN+LSTM, indicating that CNN-based feature extraction was sufficient for robust EMG representation. The CNN+LSTM model was therefore selected for real-time deployment due to its high accuracy and lower architectural complexity. Functional evaluation showed that active exoskeleton assistance reduced muscular effort during isolated finger flexion and object grasping. During weighted grasping, assistance reduced muscular effort across all tested loads, with a 92.6% reduction at the highest load. These results demonstrate the potential of SoftPINCH for intuitive, low-effort pinch assistance using real-time EMG-driven soft robotic control.

2606.04775 2026-06-04 cs.LG cs.AI cs.CV cs.SY eess.SY math.OC

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

通过降阶线性最优控制引导视频生成模型的激活

Jihoon Hong, Alice Chan, Qiyue Dai, Julian Skifstad, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出LA-LQR框架,将文本到视频推理建模为动态系统,通过降阶最优控制实现最小干预的激活引导,减少不安全内容生成同时保持视觉质量。

详情
AI中文摘要

在大规模网络数据上训练的文本到视频(T2V)模型可能生成不良内容,这促使我们进行干预以减少有害输出而不牺牲视觉质量。激活引导提供了一种有吸引力的机制替代微调和提示过滤,但现有的T2V引导方法仍然有限,通常采用粗糙的、非预测性的干预,可能导致过度引导和内容退化。为了弥补这一差距,我们提出了潜在激活线性二次型调节器(LA-LQR),一种用于最小侵入性T2V引导的降阶最优控制框架。LA-LQR将T2V推理表述为一个动态系统,并计算闭环反馈干预,将激活引导向期望的特征设定点,同时惩罚不必要的扰动。为了使最优控制对高维视频激活可行,我们将激活投影到由对比提示对导出的低维、任务相关子空间,估计该潜在空间中的局部线性动力学,并求解潜在LQR问题以获得时间步和层特定的引导信号。我们提供了将潜在设定点跟踪与原始激活空间特征控制联系起来的理论界限,并实证验证了降阶潜在动力学的保真度。在概念引导和视频安全基准测试中,LA-LQR相对于基线减少了不安全生成,同时保持了提示保真度和视觉质量。

英文摘要

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

2606.04773 2026-06-04 cs.CV cs.CL

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

NextMotionQA: 使用视觉语言模型基准测试和评判人体运动理解

Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) Tübingen AI Center(图宾根人工智能中心) Max Planck Institute for Informatics(马克斯·普朗克信息学院) Saarland Informatics Campus(萨尔兰州信息学院)

AI总结 提出NextMotionQA基准,通过三项互补任务和多粒度难度分层,系统评估视觉语言模型在人体运动理解中的能力,并揭示其在细粒度评判中的局限性。

Comments 23 pages, 8 figures, 9 tables

详情
AI中文摘要

人体运动理解的可靠评估对于推进具身人工智能、机器人和动画至关重要。然而,现有基准存在语义粒度粗糙、难度无区分、标注质量有限以及答案模糊等问题,无法诊断当前模型的失败之处。为弥补这一差距,我们引入NextMotionQA,这是一个全面的基准,利用视觉语言模型(VLM)进行半自动化、专家验证的数据集构建。NextMotionQA包含三项互补任务:多项选择题问答、视频字幕生成和细粒度错误纠正。每项任务沿三个核心语义轴系统组织,并分为三个任务复杂度级别。我们对十二个代表性VLM的广泛评估揭示了在传统单任务评估中不可见的关键能力差距和弱点。在互补方向上,近期工作开始使用VLM作为文本到运动评估的评判者;我们探究它们在更困难任务下是否表现出同样的退化。我们发现,VLM在粗粒度标准上与专家评分高度一致(Cohen's κ=0.70),但在细粒度、部件级评判上表现不佳(κ=0.10),验证了该范式在其强项领域的有效性,同时明确了其局限性。

英文摘要

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.

2606.04772 2026-06-04 cs.CV cs.AI

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

用于脑重建的基于顺序Mamba的粗到细层次架构

Hoang-Son Vo, Van-Hung Bui, Minh-Huy Mai-Duc, Tien-Dung Mai, Soo-Hyung Kim

发表机构 * Chonnam National University, Gwangju, Republic of Korea(全罗国立大学,韩国光州市) Vietnam National University - Ho Chi Minh City, University of Science, Vietnam(越南国家大学-胡志明市,越南科学大学) Institute for Cybersecurity and Digital Technologies, Russia(俄罗斯网络安全与数字技术研究所)

AI总结 提出CHASMBrain,一种基于双流Mamba和粗到细策略的两阶段图像到fMRI编码框架,在NSD数据集上优于基线,并揭示了视觉皮层的因果组织特性。

详情
AI中文摘要

理解深度视觉表征与人类视觉系统之间的关系是计算神经科学中的一个基本挑战。尽管现代视觉模型在图像识别中取得了强劲性能,但它们与人类视觉皮层层次组织的对应关系仍是一个开放问题。在本研究中,我们提出了CHASMBrain,一种新颖的分层两阶段图像到fMRI编码框架。我们的架构利用双流Mamba设计,明确分离并处理全局语义标记和局部空间补丁,这一设计受视觉皮层功能组织的启发。采用粗到细策略:第一阶段预测去噪的ROI级激活,第二阶段使用Mamba-VAE将这些粗响应细化为全体素级预测。在自然场景数据集(NSD)上的实验表明,我们的方法达到了0.429的皮尔逊相关系数和0.261的均方误差,优于所有评估的基线,包括岭回归和DINOv2线性探针。除了预测性能,因果分支消融实验揭示了一种非对称特化:补丁流特定锁定于早期视觉皮层(视网膜拓扑区域),而CLS流为高阶区域提供更广泛的语义上下文——这种对应关系是因果性的,而不仅仅是相关性的。跨被试迁移实验进一步表明,学习到的骨干网络在个体间泛化良好,只需极少的个体适应,表明模型捕捉到了共享的、与主体无关的视觉表征。

英文摘要

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.