arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2063
2510.04567 2026-06-11 cs.LG cs.AI 版本更新

GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning

GILT:一种无需LLM、无需微调的图基础模型用于上下文学习

Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Wangxuan Institute of Computer Technology, Peking University(北京大学王宣计算机技术研究所)

AI总结 提出GILT框架,通过基于令牌的上下文学习机制统一处理节点、边和图级别的分类任务,无需大语言模型或微调,实现高效泛化。

Comments Accepted as an oral presentation at the GFM @ ICML 2026 Workshop

详情
AI中文摘要

图神经网络(GNN)是处理关系数据的强大工具,但通常难以泛化到未见过的图,从而催生了图基础模型(GFM)的发展。然而,当前的GFM面临图数据极端异质性的挑战,每个图可能具有独特的特征空间、标签集和拓扑结构。为此,出现了两种主要范式:第一种利用大语言模型(LLM),但本质上依赖于文本,因此难以处理海量图中的数值特征;第二种预训练基于结构的模型,但适应新任务通常需要昂贵的每图微调阶段,造成关键效率瓶颈。在这项工作中,我们超越了这些限制,引入了图上下文学习Transformer(GILT),这是一个基于无需LLM且无需微调架构的框架。GILT引入了一种新颖的基于令牌的框架用于图上的上下文学习(ICL),在统一框架中重新定义了跨节点、边和图级别的分类任务。该机制是处理异质性的关键,因为它设计用于操作通用数值特征。此外,它从上下文中动态理解类别语义的能力实现了无需微调的适应。全面实验表明,与基于LLM或基于微调的基线相比,GILT以显著更少的时间实现了更强的少样本性能,验证了我们方法的有效性。我们的代码可在https://github.com/yiming421/inductnode/获取。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for processing relational data but often struggle to generalize to unseen graphs, giving rise to the development of Graph Foundational Models (GFMs). However, current GFMs are challenged by the extreme heterogeneity of graph data, where each graph can possess a unique feature space, label set, and topology. To address this, two main paradigms have emerged. The first leverages Large Language Models (LLMs), but is fundamentally text-dependent, thus struggles to handle the numerical features in vast graphs. The second pre-trains a structure-based model, but the adaptation to new tasks typically requires a costly, per-graph tuning stage, creating a critical efficiency bottleneck. In this work, we move beyond these limitations and introduce \textbf{G}raph \textbf{I}n-context \textbf{L}earning \textbf{T}ransformer (GILT), a framework built on an LLM-free and tuning-free architecture. GILT introduces a novel token-based framework for in-context learning (ICL) on graphs, reframing classification tasks spanning node, edge and graph levels in a unified framework. This mechanism is the key to handling heterogeneity, as it is designed to operate on generic numerical features. Further, its ability to understand class semantics dynamically from the context enables tuning-free adaptation. Comprehensive experiments show that GILT achieves stronger few-shot performance with significantly less time than LLM-based or tuning-based baselines, validating the effectiveness of our approach. Our code is available at: https://github.com/yiming421/inductnode/.

2604.07833 2026-06-11 cs.RO 版本更新

Harnessing Embodied Agents: Runtime Governance for Policy-Constrained Execution

利用具身体 agent:运行时治理以实现政策约束执行

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-沃德大学马来西亚校区数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院) Fraunhofer Institute for Applied Information Technology(弗劳恩霍夫应用信息技术研究所)

AI总结 本文提出了一种政策约束执行框架,通过将 agent 认知与执行监督分离,增强了具身体 agent 的运行时治理能力,通过 1000 次随机模拟验证,显著提高了对未经授权动作的拦截率和系统恢复成功率。

Comments 36 pages, 3 figures, 10 tables

详情
AI中文摘要

具身体 agent 正从被动推理系统发展为能够与工具、机器人和物理环境交互的主动执行者。一旦获得执行权限,核心挑战是如何在运行时保持行动的可控性。现有方法将安全性和恢复逻辑嵌入 agent 循环中,使执行控制难以标准化、审计和适应。本文认为,具身体智能不仅需要更强的 agent,还需要更强的运行时治理。我们提出了一种政策约束执行框架,将 agent 认知与执行监督分离。治理被外部化为一个专用的运行时层,负责政策检查、能力准入、执行监控、回滚处理和人工覆盖。我们正式界定了具身体 agent、具身体能力模块(ECMs)和运行时治理层之间的控制边界,并通过 1000 次随机模拟试验在三个治理维度上进行了验证。结果表明,96.2% 的未经授权动作被拦截,运行时漂移下不安全延续率从 100% 降至 22.2%,且在完全政策合规的情况下,91.4% 的恢复成功,显著优于所有基线(p<0.001)。通过将运行时治理重新定义为一个系统问题,本文将政策约束执行定位为具身体 agent 系统的关键设计原则。

英文摘要

Embodied Agents are evolving from passive reasoning systems into active executors that interact with tools, robots, and physical environments. Once an agent gains execution authority, the central challenge shifts from how to make it act to how to keep its actions governable at runtime. Existing approaches embed safety, recovery, and decision constraints inside the agent loop, making execution control difficult to standardize, audit, and adapt across environments. We propose a runtime governance framework for policy-constrained execution that separates agent cognition from execution oversight. Governance is externalized into a dedicated runtime layer performing policy checking, capability admission, execution monitoring, rollback, and human override. We formalize the control boundary among a persistent Embodied Agent, modular Capability Packages, and the governance layer, and define a policy-constrained execution pipeline evaluated under controlled simulation. Over 1000 randomized trials, the framework achieves 96.2%+/-2.7% interception of unauthorized actions, reduces unsafe continuation from 100% to 22.2%+/-3.1% under runtime drift, and attains 90.7%+/-3.0% recovery success with full policy compliance. Comparison with five baselines, including AutoRT-style constitution filtering and RoboGuard-style two-stage guardrails, shows that pre-execution filtering is equally effective across governance-aware methods, while only the proposed framework provides continuous runtime detection (RVDR = 61.3% vs. 0%) and structured recovery (all p<0.001). A sensitivity sweep across the full detection range confirms a genuine detection-continuation trade-off. This work argues future embodied systems should be designed for governable execution.

2605.14738 2026-06-11 cs.LG cs.AI 版本更新

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

TAPIOCA: 为什么任务感知剪枝能提升模型对分布外数据的能力

Krish Sharma, Omar Naim, Soumadeep Saha, Vinija Jain, Aman Chadha, Nicholas Asher

发表机构 * ANITI Meta Apple

AI总结 本文研究了任务感知剪枝在分布外数据上的改进机制,通过实验发现剪枝能提升OOD准确性,其核心贡献是通过几何解释说明任务感知剪枝如何调整模型表示以适应任务需求。

详情
AI中文摘要

近期的研究表明,任务感知层剪枝可以提高模型在特定任务上的性能,如TALE所示。本文探讨了这种改进何时发生以及为何会发生。我们首先证明,在受控的多项式回归任务和大型语言模型中,此类剪枝在分布内(ID)数据上没有好处,但能一致地提高分布外(OOD)准确性。我们进一步通过实验证明,OOD输入会诱导出层间范数和成对距离的分布,这些分布偏离ID分布的相应分布。这导致了任务感知剪枝的几何解释:每个任务诱导出一个任务适应的几何结构,通过ID输入上观察到的表示分布来经验性地表征。OOD输入可以引入任务适应几何的扭曲版本。任务感知剪枝识别出创建或放大这种扭曲的层;通过移除这些层,它将OOD表示的范数和成对距离转向在适应分布上观察到的值。这使OOD输入与模型的任务适应几何重新对齐,并提高性能。我们通过受控分布偏移和残差缩放干预提供了因果证据,并在不同模型规模上展示了一致的行为。

英文摘要

Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

2605.20795 2026-06-11 cs.CV 版本更新

What Semantics Survive the Connector? Diagnosing VLM-to-DiT Alignment in Video Editing

什么语义能经受住连接器的考验?视频编辑中VLM到DiT对齐的诊断

Hangyu Lin, Chao Wen, Chengming Xu, Jianxiong Gao, Jiangning Zhang, Xiaobin Hu, Yanwei Fu

发表机构 * HKUST(香港理工大学) FDU(福建大学) ZJU(浙江大学) NUS(新加坡国立大学)

AI总结 本研究探讨了视频生成模型中VLM与DiT对齐过程中的语义瓶颈问题,通过提出TRACE-Edit数据集和诊断协议,发现连接器模块会导致细粒度结构语义的严重退化,挑战了原有假设。

详情
AI中文摘要

基于流匹配的视频生成模型日益依赖前置的视觉-语言模型(VLMs)来处理复杂的、基于指令的视频编辑任务。该范式下普遍的假设是连接模块能够无缝地将VLM的丰富多模态推理与DiT的原始文本嵌入空间对齐。然而,我们假设这种对齐实际上是一个严重的语义瓶颈,会退化细粒度的结构变量。验证这一假设具有挑战性,因为端到端的评估将对齐失败与生成错误混为一谈,而自然数据集缺乏解耦的标注。为了严格研究这一问题,我们提出了一种基于视频组成的受控数据处理流程,生成TRACE-Edit数据集,该数据集专注于基于关系的编辑。利用此数据集,我们提出了一种全面的诊断协议,分析现有视频编辑模型中元查询和连接器两个重要设计。对四个代表性模型案例的系统评估表明,在对齐过程中细粒度结构语义会受到严重退化。我们的发现推翻了无损语义传输的假设,将VLM到DiT的对齐识别为一个主要瓶颈,并为未来的多模态对齐架构提供了新的诊断基础。

英文摘要

Flow matching based video generative models have been increasingly relying on prepended Vision-Language Models (VLMs) to handle complex, instruction-based video editing. The prevailing assumption underlying this paradigm is that a connector module can seamlessly align the VLM's rich multi-modal reasoning with the original text embedding space of DiTs. However, we hypothesize that this alignment acts as a severe semantic bottleneck, degrading fine-grained structural variables. Verifying this is challenging, as end-to-end evaluations conflate alignment failures with generation errors, and natural datasets lack disentangled annotations. To rigorously investigate this, we propose a controlled data processing pipeline based on video composition that results in TRACE-Edit, a diagnostic dataset focusing on relation-based editing. Leveraging this dataset, we propose a comprehensive diagnostic protocol to analyze two important designs of meta-query and connector in the existing video editing models. Systematic evaluation of four representative model cases reveals that fine-grained structural semantics can be severely degraded during alignment. Our findings overturn the assumption of lossless semantic transfer, identifying the VLM-to-DiT alignment as a major bottleneck and providing a new diagnostic foundation for future multi-modal alignment architectures.

2605.20436 2026-06-11 cs.CV 版本更新

Lighting-aware Unified Model for Instance Segmentation

考虑光照的实例分割统一模型

Qisai Liu, Alloy Das, Zhanhong Jiang, Joshua R. Waite, Aditya Balu, Adarsh Krishnamurthy, Soumik Sarkar

发表机构 * Iowa State University(爱荷华州立大学)

AI总结 本文提出了一种考虑光照的实例分割统一模型,通过开发Lighting Convolutional-Attention模块,在不微调重型主干网络的情况下提升分割鲁棒性,实验结果表明该方法能有效解决光照变化带来的领域差距问题。

详情
AI中文摘要

像Segment Anything Model(SAM)这样的基础模型展示了令人印象深刻的零样本泛化能力,但在多样化的现实世界光照下经常退化,特别是在实例分割中。在本工作中,我们通过开发Lighting Convolutional-Attention(\lca{}),一种适配模块,来解决这一限制。\lca{}采用双分支架构处理RGB特征和对比图,使模型对结构性变化敏感而非光照伪影。我们通过成对训练策略优化\lca{},引入一个针对损失项,明确惩罚干净图像与其对应光照变体之间的差异。为了评估和支持这一架构,我们跨多个现有基准进行了全面的经验研究,并提出了一个专门设计的Unity基合成数据集,以准确复制复杂的现实世界光照条件。广泛的实验结果表明,我们的方法成功地弥合了领域差距,实现了优越的光照鲁棒分割。

英文摘要

Foundation models like the Segment Anything Model (SAM) demonstrate impressive zero-shot generalization but frequently degrade under diverse real-world illumination, particularly for instance segmentation. In this work, we address this limitation by developing \textit{Lighting Convolutional-Attention (\lca{})}, an adapter module that enhances segmentation robustness without fine-tuning the heavy backbone. \lca{} employs a dual-branch architecture to process RGB features alongside contrast maps, enabling physically motivated sensitivity to structural changes rather than illumination artifacts. We optimize \lca{} through a pairwise training strategy, introducing a targeted loss term that explicitly penalizes discrepancies between clean images and their corresponding illumination variants. To evaluate and support this architecture, we conduct a comprehensive empirical study across multiple existing benchmarks and present a novel Unity-based synthetic dataset specifically designed to accurately replicate complex real-world lighting conditions. Extensive experimental results demonstrate that our approach successfully bridges the domain gap, delivering superior lighting-robust segmentation.

2605.19031 2026-06-11 cs.AI eess.SP 版本更新

KAN-MLP-Mixer: A comprehensive investigation of the usage of Kolmogorov-Arnold Networks (KANs) for improving IMU-based Human Activity Recognition

KAN-MLP-Mixer: 对Kolmogorov-Arnold网络(KANs)在改进基于惯性测量单元(IMU)的人体活动识别中的应用的全面研究

Mengxi Liu, Sizhen Bian, Vitor Fortes, Francisco Calatrava Nicolas, Daniel Geißler, Maximilian Kiefer-Emmanouilidis, Bo Zhou, Paul Lukowicz

发表机构 * DFKI Germany(德意志联邦共和国达姆施塔特研究所) Northwestern Polytechnical University China(中国西北工业大学) RPTU Germany(德国鲁尔大学) Örebro University Sweden(瑞典欧雷布罗大学)

AI总结 本文研究了KANs在改进IMU基人体活动识别(HAR)模型中的应用,提出了一种混合架构,结合KANs的精度与MLP的鲁棒性和效率,实验表明该混合模型在多个数据集上显著提升了性能。

Comments 23 pages, and 9 figures

详情
AI中文摘要

Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

英文摘要

Kolmogorov-Arnold Networks (KANs) have demonstrated an exceptional ability to learn complex functions on clean, low-dimensional data but struggle to maintain performance on noisy and imperfect real-world datasets. In contrast, conventional multi-layer perceptrons (MLPs) are far more tolerant to noise and computationally efficient. Replacing all MLP components with KANs in HAR models often degrades accuracy and computation efficiency, highlighting an open challenge: how to combine KANs' precision with MLPs' noise robustness and efficiency. To address this, we systematically explore various placements of KAN modules within deep HAR networks and propose a hybrid architecture that strategically synergizes the strengths of both paradigms, which uses a KAN-based input embedding layer, retains MLP layers for intermediate feature mixing, and introduces a specialized LarctanKAN module for final activity classification. Across eight public HAR datasets, the hybrid KAN-MLP model achieves an average macro F1 score relative improvement of 5.33\% compared pure-MLP model, significantly outperforming standalone KAN and MLP baselines. Furthermore, integrating this hybrid strategy into other state-of-the-art HAR architectures consistently boosts their performance. Our findings demonstrate that a carefully orchestrated combination of KAN, MLP, or other conventional neural components yields more robust and accurate HAR models for real-world wearable sensing environments.

2602.17001 2026-06-11 cs.AI cs.CL cs.DB 版本更新

Sonar-TS: Search-Then-Verify Natural Language Querying for Time Series Databases

Sonar-TS: 为时间序列数据库的自然语言查询设计的搜索-验证方法

Zhao Tan, Yiji Zhao, Shiyu Wang, Chang Xu, Yuxuan Liang, Xiping Liu, Shirui Pan, Ming Jin

发表机构 * Jiangxi University of Finance and Economics(江西财经大学) Griffith University(格里菲斯大学) Yunnan University(云南大学) Microsoft Research Asia(微软亚洲研究院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出Sonar-TS,一种神经符号框架,用于解决时间序列数据库的自然语言查询问题,通过搜索-验证流程处理连续形态意图和超长历史数据,引入NLQTSBench基准进行评估,展示了该方法在复杂时间查询中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

自然语言查询时间序列数据库(NLQ4TSDB)旨在帮助非专家用户从大量时间记录中检索有意义的事件、区间和摘要。然而,现有的文本到SQL方法未针对连续形态意图(如形状或异常)进行设计,而时间序列模型在处理超长历史时面临挑战。为解决这些问题,我们提出Sonar-TS,一种神经符号框架,通过搜索-验证流程处理NLQ4TSDB。类似于主动声纳,它利用特征索引通过SQL ping候选窗口,随后通过生成的Python程序锁定并验证候选者与原始信号。为了实现有效的评估,我们引入NLQTSBench,这是第一个大规模基准,专门针对NLQ在TSDB规模的历史数据。我们的实验突显了该领域独特的挑战,并展示了Sonar-TS在传统方法无法处理的复杂时间查询中的有效性。本文首次系统研究了NLQ4TSDB,提供了一个通用框架和评估标准,以促进未来研究。

英文摘要

Natural Language Querying for Time Series Databases (NLQ4TSDB) aims to assist non-expert users retrieve meaningful events, intervals, and summaries from massive temporal records. However, existing Text-to-SQL methods are not designed for continuous morphological intents such as shapes or anomalies, while time series models struggle to handle ultra-long histories. To address these challenges, we propose Sonar-TS, a neuro-symbolic framework that tackles NLQ4TSDB via a Search-Then-Verify pipeline. Analogous to active sonar, it utilizes a feature index to ping candidate windows via SQL, followed by generated Python programs to lock on and verify candidates against raw signals. To enable effective evaluation, we introduce NLQTSBench, the first large-scale benchmark designed for NLQ over TSDB-scale histories. Our experiments highlight the unique challenges within this domain and demonstrate that Sonar-TS effectively navigates complex temporal queries where traditional methods fail. This work presents the first systematic study of NLQ4TSDB, offering a general framework and evaluation standard to facilitate future research.

2510.13293 2026-06-11 cs.CL 版本更新

Cross-modal Consistency Guidance for Robust Emotion Control in Auto-Regressive TTS Models

跨模态一致性引导用于自动回归TTS模型中的鲁棒情绪控制

Yizhou Peng, Yukun Ma, Chong Zhang, Yi-Wen Chao, Chongjia Ni, Bin Ma, Eng Siong Chng

发表机构 * Alibaba-NTU Global e-Sustainability CorpLab(阿里-国立大学全球可持续发展公司实验室) Nanyang Technological University(南洋理工大学) College of Computing and Data Science(计算与数据科学学院) Alibaba(阿里) Alibaba Inc.(阿里公司)

AI总结 本文提出了一种基于文本情绪与显式语音情绪不一致程度的动态尺度的跨模态一致性引导分类器免费引导方法(CCG-CFG),通过使用文本情绪替代dropout条件,并采用硬样本挖掘策略蒸馏CCG-CFG引导信号,从而提升TTS模型的情绪对齐能力。在五个情感语料库和两个TTS基准测试中,该方法在CosyVoice2上实现了情绪识别准确率提升12%,主观评分提升10%,优于基线模型HierSpeech++、Qwen3-TTS和原始CosyVoice2,同时保持可懂性、自然度和高质量。

Comments Accepted to Interspeech 2026, short paper

详情
AI中文摘要

尽管文本到语音(TTS)系统通过自然语言指令实现情绪控制,但当目标情绪与文本语义冲突时,表达性、自然性和语音质量会下降。我们提出了一种基于文本情绪与显式语音情绪不一致程度的动态尺度的跨模态一致性引导分类器免费引导(CCG-CFG)方法,通过使用文本情绪替代dropout条件。我们还采用硬样本挖掘策略蒸馏CCG-CFG引导信号,以提高TTS模型的情绪对齐能力。在五个情感语料库和两个TTS基准测试中的评估显示,我们的方法应用于CosyVoice2时,情绪识别准确率提高了12%,主观评分提高了10%,优于基线模型,包括HierSpeech++、Qwen3-TTS和原始CosyVoice2,同时保持可懂性、自然性和高质量。

英文摘要

While Text-to-Speech (TTS) systems enable emotional control via natural-language instructions, expressiveness, naturalness, and speech quality degrade when the target emotion conflicts with the textual semantics. We propose a Cross-modal Consistency Guided Classifier-Free Guidance (CCG-CFG) method with dynamic scales based on the degree of inconsistency between the text emotion and the explicit speech emotion, replacing the dropout condition with the text emotion. We also distill the CCG-CFG guidance signal using a hard-sample mining strategy, improving the TTS model's emotional alignment capability. Evaluations on five emotional corpora and two TTS benchmarks show that our approaches applied to CosyVoice2 achieve up to a 12% absolute improvement in emotion-recognition accuracy and a 10% relative improvement in subjective scores, outperforming baselines including HierSpeech++, Qwen3-TTS, and original CosyVoice2, while preserving intelligibility, naturalness, and high speech quality.

2511.02414 2026-06-11 cs.AI 版本更新

A New Perspective on Precision and Recall for Generative Models

生成模型精度与召回的全新视角

Benjamin Sykes, Loïc Simon, Julien Rabin, Jalal Fadili

发表机构 * NORMANDIE UNIV, UNICAEN, ENSICAEN, CNRS, GREYC(诺曼底大学、UNICAEN、ENSICAEN、CNRS、GREYC)

AI总结 本文提出了一种基于二分类视角的新框架,用于估计生成模型的完整精度-召回曲线,并通过统计分析得出最小最大上界,同时展示了该框架可扩展至文献中的多个经典PR指标。

详情
AI中文摘要

随着生成模型在图像和文本领域取得近期成功,其评估问题近年来受到广泛关注。尽管大多数现有方法依赖于标量指标,但引入精度和召回(PR)作为生成模型的评估指标,开辟了新的研究方向。相关的PR曲线允许更丰富的分析,但其估计存在诸多挑战。在本文中,我们提出了一种基于二分类视角的新框架,用于估计完整的PR曲线。我们对所提出估计进行了详尽的统计分析。作为副产品,我们获得了PR估计风险的最小最大上界。此外,我们还展示了该框架可扩展至文献中的多个经典PR指标,这些指标设计上被限制在曲线的极值点。最后,我们研究了在不同设置下所获得的曲线的不同行为。

英文摘要

With the recent success of generative models in image and text, the question of their evaluation has recently gained a lot of attention. While most methods from the state of the art rely on scalar metrics, the introduction of Precision and Recall (PR) for generative model has opened up a new avenue of research. The associated PR curve allows for a richer analysis, but their estimation poses several challenges. In this paper, we present a new framework for estimating entire PR curves based on a binary classification standpoint. We conduct a thorough statistical analysis of the proposed estimates. As a byproduct, we obtain a minimax upper bound on the PR estimation risk. We also show that our framework extends several landmark PR metrics of the literature which by design are restrained to the extreme values of the curve. Finally, we study the different behaviors of the curves obtained experimentally in various settings.

2605.16651 2026-06-11 cs.CV cs.LG 版本更新

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

正确预测,误导性解释:关于视觉-语言模型解释的脆弱性

Narges Babadi, Hadis Karimipour

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究探讨了视觉-语言模型中解释热图在对抗条件下是否忠实反映推理过程,提出X-Shift攻击揭示解释与预测行为的脱节,验证了解释机制的脆弱性。

Comments Accepted at the ICML 2026 Workshop on Trustworthy AI for Good (AI4GOOD), Seoul, South Korea

详情
AI中文摘要

解释机制被广泛用于增强视觉-语言模型(VLMs)的透明性和信任度,特别是在需要人类监督的决策场景中。然而,这些解释的鲁棒性仍不明确。本文研究了VLMs(特别是基于CLIP的模型)中的解释热图在对抗条件下是否忠实反映模型推理。我们发现,解释图谱可以系统性地被操控,同时保持模型的原始预测,揭示了预测行为与解释忠实性之间的脱节。为研究这种脆弱性,我们引入了X-Shift,一种新的灰盒攻击,通过扰动图像级视觉表示,将解释热图引导至语义无关区域,而不会改变预测输出。与传统对抗攻击旨在诱导误分类不同,X-Shift专门针对解释过程的完整性。该攻击不修改模型参数,并在多种CLIP架构和解释方法上通用。我们在ImageNet-1k、MS-COCO和Flickr30K上评估了所提出的方法,证明在不可察觉的扰动下,解释对齐性持续下降,而预测保持稳定。此外,标准以预测为导向的对抗攻击即使在更大的扰动预算下也无法复制相同的解释偏移行为。我们的发现突显了当前VLMs解释机制的根本局限性,并对它们在高影响应用中作为可靠信任指标的使用提出了担忧。

英文摘要

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

2603.21396 2026-06-11 cs.LG 版本更新

Mechanisms of Introspective Awareness

内省意识的机制

Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey

发表机构 * Anthropic Fellows Program(Anthropic Fellow项目) MIT(麻省理工学院) Constellation Anthropic

AI总结 研究揭示了大语言模型在检测注入的转向向量时的内省意识机制,发现其行为稳健且源于训练后阶段,通过两阶段电路实现,且在不同层间机制存在差异。

详情
AI中文摘要

最近的研究表明,大语言模型有时能够检测到转向向量被注入到残差流中,并识别出注入的概念,这一现象被称为

英文摘要

Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which "evidence carrier" features in early post-injection layers detect perturbations monotonically along diverse directions, suppressing downstream "gate" features that implement a default negative response. This circuit is absent in base models and robust to refusal ablation. Identification of injected concepts relies on largely distinct later-layer mechanisms that only weakly overlap with those involved in detection. Finally, we show that introspective capability is substantially underelicited: ablating refusal directions improves detection by +53%, and a trained bias vector improves it by +75% on held-out concepts, both without meaningfully increasing false positives. Our results suggest that this introspective awareness of injected concepts is robust and mechanistically nontrivial, and could be substantially amplified in future models. Code: https://github.com/safety-research/introspection-mechanisms.

2605.15687 2026-06-11 cs.CL cs.AI 版本更新

ASRU: Activation Steering Meets Reinforcement Unlearning for Multimodal Large Language Models

ASRU:激活引导与强化遗忘融合用于多模态大语言模型

Jiahui Guang, Haiyan Wang, Yingjie Zhu, Cuiyun Gao, Jing Li, Di Shao, Zhaoquan Gu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 ASRU提出一种可控多模态遗忘框架,通过激活引导和强化学习提升多模态大语言模型的遗忘效果和生成质量,实验显示在Qwen3-VL上遗忘效果提升24.6%,生成质量提升5.8倍。

详情
AI中文摘要

多模态大语言模型(MLLMs)在预训练过程中可能记忆敏感的跨模态信息,使机器遗忘(MU)变得至关重要。现有方法通常基于输出偏差评估遗忘效果,而忽视遗忘后的生成质量。这可能导致幻觉或僵化响应,影响遗忘模型的可用性和安全性。为了解决这一问题,我们提出了ASRU,一种可控的多模态遗忘框架,将生成质量作为核心评估目标。ASRU首先通过激活引导诱导初始拒绝行为,然后使用定制奖励函数优化细粒度拒绝边界,从而在目标知识遗忘和模型实用性之间取得更好的平衡。实验表明,在Qwen3-VL上,ASRU在平均上显著提高了遗忘效果(+24.6%)和生成质量(5.8倍),同时有效保持了模型实用性,仅使用少量保留的监督数据。

英文摘要

Multimodal large language models (MLLMs) may memorize sensitive cross-modal information during pretraining, making machine unlearning (MU) crucial. Existing methods typically evaluate unlearning effectiveness based on output deviations, while overlooking the generation quality after unlearning. This can easily lead to hallucinated or rigid responses, thereby affecting the usability and safety of the unlearned model. To address this issue, we propose ASRU, a controllable multimodal unlearning framework that incorporates generation quality as a core evaluation objective. ASRU first induces initial refusal behavior through activation redirection, and then optimizes fine-grained refusal boundaries using a customized reward function, thereby achieving a better trade-off between target knowledge unlearning and model utility. Experiments on Qwen3-VL show that ASRU significantly improves unlearning effectiveness (+24.6%) on average and generation quality (5.8X) on average while effectively preserving model utility, using only a small amount of retained supervision data.

2605.15435 2026-06-11 cs.LG cs.NE 版本更新

On the Stability of Growth in Structural Plasticity

结构塑性中增长的稳定性

Lute Lillo, Nick Cheney

发表机构 * University of Vermont(佛蒙特大学)

AI总结 本文研究了结构塑性中增长与剪枝的稳定性差异,指出生长在优化轨迹中插入新单元体,而剪枝则在训练初期选择已有单元。生长在图像分类任务中表现更优,但需足够时间整合新单元以提高适应性。

详情
AI中文摘要

标准深度学习管道通常在训练前选择网络架构并保持不变。相比之下,模型可以在训练过程中通过剪枝现有隐藏单元或生长新单元来适应。尽管增长对自适应和持续学习系统有吸引力,但本文表明增长并非单纯是剪枝的逆过程。剪枝在训练初期选择参与训练的单元,而增长在已专业化的优化轨迹中插入新单元。新生单元通常在正向计算中活跃但反向信号较弱。在小型MLP基准中此劣势较小,但在更难的图像分类设置中变得明显。在这些设置中,Grow在结构编辑过程中能获得高最终精度,而Prune在训练轨迹平均性能或重新训练稀疏网络时表现更优。针对优化器状态、插入、选择和可训练性等干预表明,提高新生单元的整合能改善适应性表现,但不自动产生更好的最终子网络。在压力塑性损失的持续学习基准中,Grow在新单元有足够时间整合时表现竞争。这些结果表明,Grow不应仅作为架构搜索操作符,而应作为时间敏感的优化过程,其成功取决于插入稳定性。

英文摘要

Standard deep-learning pipelines usually choose the network architecture before training and keep it fixed throughout optimization. In contrast, a model can also be adapted by editing its structure during training, for example by pruning existing hidden-neuron units or growing new ones. Although growth is appealing for adaptive and continual systems, we show that it is not simply the inverse of pruning. Pruning selects among units that have participated in training from the start, whereas growth inserts new units into an already specialized optimization trajectory. We isolate this insertion problem and show that newborn units are often forward-active but backward-starved: they participate in the forward computation, yet receive much weaker gradient signal than incumbent units. This disadvantage is minor in small MLP benchmarks, but becomes clear in harder image-classification settings with a convolutional trunk. In these settings, \textsc{Grow} can achieve high final accuracy during the structural-editing procedure, while \textsc{Prune} is stronger when performance is averaged over the training trajectory or when the final sparse network is retrained from scratch. Interventions targeting optimizer state, insertion, selection, and trainability show that improving the integration of newborn units can improve adaptive performance, but does not automatically produce better final subnetworks. In continual-learning benchmarks stressing plasticity loss, \textsc{Grow} becomes competitive mainly when new units have enough time to integrate. Together, these results suggest that \textsc{Grow} should be evaluated not only as an architecture-search operator, but as a time-sensitive optimization process whose success depends on insertion stability.

2509.20241 2026-06-11 cs.LG cs.DC 版本更新

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

AI推断的能耗:效率路径与测试时计算

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres

发表机构 * Microsoft(微软)

AI总结 本文提出基于令牌吞吐量的底层方法,估算大规模大语言模型的每查询能耗,揭示测试时扩展场景下的能耗变化及效率提升潜力。

Comments A preprint version with DOI is available at Zenodo: https://doi.org/10.5281/zenodo.17188770

详情
Journal ref
Joule (2026) 102430
AI中文摘要

随着AI推断扩展到数十亿查询和新兴推理及代理工作流增加令牌需求,可靠估计每查询能耗对容量规划、排放核算和效率优先级至关重要。许多公开估计不一致且高估能耗,因为它们从有限基准外推且未能反映大规模下的效率提升。本文引入基于令牌吞吐量的底层方法,估算大规模LLM系统的每查询能耗。在H100节点下运行的模型,根据现实工作负载和GPU利用率及PUE约束,估算前沿规模模型(>2000亿参数)的每查询能耗中位数为0.34瓦(IQR: 0.18-0.67)。这些结果与生产规模配置测量一致,表明非生产估计可能高估能耗4-20倍。扩展到测试时扩展场景,每个典型查询的令牌数增加15倍,中位数能耗升至4.32瓦,表明在该范围内聚焦效率将带来最大的集群节能。我们量化了在模型、服务平台和硬件层面的可实现效率提升,发现单个模型的每查询能耗中位数减少1.5-3.5倍,而综合改进可能带来8-20倍的减少。为说明系统级影响,我们估算一个处理十亿查询的部署的基线日能耗为0.8 GWh/天。如果10%为长查询,需求可能增长到1.8 GWh/天。通过针对性的效率干预,它降至0.9 GWh/天,与该规模的网络搜索能耗相当。这呼应了数据中心历史上通过效率提升控制能耗增长的历史。

英文摘要

As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicating widely cited estimates are overstated by 4-20x. In test-time scaling scenarios 15x longer than typical queries, the median energy rises 13x to 3.91 Wh (IQR 2.15-7.05). Across models, serving systems, and hardware, we estimate 8-20x line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling.

2605.12288 2026-06-11 cs.CL cs.AI 版本更新

TokenRatio: Principled Token-Level Preference Optimization via Ratio Matching

TokenRatio: 通过比率匹配实现原理化的token级偏好优化

Truong Nguyen, Tien-Phat Nguyen, Linh Ngo Van, Duy Minh Ho Nguyen, Khoa Doan, Trung Le

发表机构 * National University of Singapore(新加坡国立大学) Institute of Cybernetics and Robotics, Czech Technical University in Prague(捷克布拉格技术大学控制论与机器人研究所)

AI总结 本文提出TBPO方法,通过比率匹配恢复token级偏好最优性,改进对齐质量和训练稳定性,并增加输出多样性。

详情
AI中文摘要

直接偏好优化(DPO)是一种广泛使用的无强化学习方法,用于对齐语言模型,但其在完整序列上建模偏好,尽管生成过程由逐token决策驱动。现有token级扩展通常将序列级Bradley-Terry目标分解到时间步,使前缀(状态级)最优性隐含。我们研究如何仅使用标准序列级成对比较恢复token级偏好最优性。我们引入token级Bregman偏好优化(TBPO),提出一个基于前缀的token级Bradley-Terry偏好模型,推导出Bregman散度密度比率匹配目标,该目标扩展了logistic/DPO损失,同时保持由token级模型诱导的最佳策略,并维持DPO-like的简洁性。我们引入两个实例:TBPO-Q,显式学习轻量级状态基线;TBPO-A,通过优势归一化移除基线。在指令跟随、有用性/无害性以及摘要基准上,TBPO相比强序列级和token级基线提高了对齐质量和训练稳定性,并增加了输出多样性。

英文摘要

Direct Preference Optimization (DPO) is a widely used RL-free method for aligning language models from pairwise preferences, but it models preferences over full sequences even though generation is driven by per-token decisions. Existing token-level extensions typically decompose a sequence-level Bradley-Terry objective across timesteps, leaving per-prefix (state-wise) optimality implicit. We study how to recover token-level preference optimality using only standard sequence-level pairwise comparisons. We introduce Token-level Bregman Preference Optimization (TBPO), which posits a token-level Bradley-Terry preference model over next-token actions conditioned on the prefix, and derive a Bregman-divergence density-ratio matching objective that generalizes the logistic/DPO loss while preserving the optimal policy induced by the token-level model and maintaining DPO-like simplicity. We introduce two instantiations: TBPO-Q, which explicitly learns a lightweight state baseline, and TBPO-A, which removes the baseline through advantage normalization. Across instruction following, helpfulness/harmlessness, and summarization benchmarks, TBPO improves alignment quality and training stability and increases output diversity relative to strong sequence-level and token-level baselines.

2605.13674 2026-06-11 cs.CV cs.AI 版本更新

Weakly Supervised Segmentation as Semantic-Based Regularization

弱监督分割作为语义基于的正则化

Stefano Colamonaco, Andrei-Bogdan Florea, Jaron Maene

发表机构 * KU Leuven(鲁文大学)

AI总结 本文提出通过神经符号方法整合模糊逻辑与深度分割模型,利用弱标注和领域先验知识提升伪标签质量,从而实现优于密集监督基线的分割精度。

详情
AI中文摘要

弱监督语义分割(WSSS)通过部分或粗略标注(如边界框、涂鸦或图像标签)训练密集像素级分割模型。尽管近期工作利用基础模型如Segment Anything Model(SAM)生成伪标签,但这些方法通常依赖启发式提示选择,难以整合先验知识或异质标签。本文通过神经符号视角:将可微模糊逻辑与深度分割模型结合。弱标注和领域特定先验被统一为连续逻辑约束,以微调SAM在弱监督下。优化后的基础模型随后生成改进的伪标签,从中训练一个无提示的第二阶段分割模型。在Pascal VOC 2012和REFUGE2视盘/杯分割数据集上的实验表明,逻辑引导的微调产生了更高质量的伪标签,导致分割精度超越密集监督基线。

英文摘要

Weakly supervised semantic segmentation (WSSS) trains dense pixel-level segmentation models from partial or coarse annotations such as bounding boxes, scribbles, or image-level tags. While recent work leverages foundation models such as the Segment Anything Model (SAM) to generate pseudo-labels, these approaches typically depend on heuristic prompt choices and offer limited ways to incorporate prior knowledge or heterogeneous labels. We address this gap by taking a neurosymbolic perspective: integrating differentiable fuzzy logic with deep segmentation models. Weak annotations and domain-specific priors are unified as continuous logical constraints that fine-tune SAM under weak supervision. The refined foundation model then produces improved pseudo-labels, from which we train a second-stage prompt-free segmentation model. Experiments on Pascal VOC 2012 and the REFUGE2 optic disc/cup segmentation dataset show that our logic-guided fine-tuning yields higher-quality pseudo-labels, leading to state-of-the-art segmentation accuracy that often exceeds densely supervised baselines.

2605.12655 2026-06-11 cs.AI cs.MA 版本更新

Robust Instruction Compliance in Cooperative Multi-Agent Reinforcement Learning

鲁棒的指令遵从:合作多智能体强化学习

Wo Wei Lin, Ethan Rathbun, Enrico Marchesini, Xiang Zhi Tan

发表机构 * Department of Computer Sciences, Northeastern University(东北大学计算机科学系) Department of Computer Sciences, Massachusetts Institute of Technology(麻省理工学院计算机科学系)

AI总结 针对外部指令中断行为并冲突长期目标的问题,提出宏动作值修正方法(MAVIC),通过修正指令边界的Bellman备份实现一致值估计,在复杂合作环境中保持高指令遵从和基础任务性能。

详情
AI中文摘要

现实场景中的多智能体强化学习(MARL)可能需要适应外部自然语言指令,这些指令会中断正在进行的行为并与长期目标冲突。然而,基于指令的条件奖励引入了一种基本失败模式,因为Bellman更新耦合了跨指令上下文的值估计,导致当指令中断宏动作时值不一致。我们提出了用于指令遵从的宏动作值修正(MAVIC),该方法通过修正传入指令目标并恢复当前目标下的延续值,来纠正指令边界处的Bellman备份。与奖励塑形不同,MAVIC修改了自举目标本身,从而在统一策略下实现随机指令切换时的一致值估计。我们提供了理论分析和演员-评论家实现,并表明MAVIC在日益复杂的合作多智能体环境中实现了高指令遵从,同时保持了基础任务性能。

英文摘要

Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.

2605.12386 2026-06-11 cs.RO 版本更新

SafeManip: A Property-Driven Benchmark for Temporal Safety Evaluation in Robotic Manipulation

SafeManip: 一种基于属性的基准,用于机器人操作中的时间安全评估

Chengyue Huang, Khang Vo Huynh, Sebastian Elbaum, Zsolt Kira, Lu Feng

发表机构 * Department of Machine Learning, Georgia Institute of Technology(佐治亚理工学院机器学习系) Department of Computer Science, University of Virginia(弗吉尼亚大学计算机科学系)

AI总结 SafeManip通过定义可重用的安全模板,评估机器人操作中的时间安全属性,涵盖碰撞安全、抓取稳定性等八类安全类别,验证了现有方法在安全评估上的不足。

详情
AI中文摘要

机器人操作通常通过任务成功率评估,但成功完成并不保证安全执行。许多安全故障是时间相关的:机器人可能在污染后接触清洁表面或在物体完全进入封闭空间前释放物体。我们介绍了SafeManip,一种基于属性的基准,用于显式评估机器人操作中的时间安全属性,超越了以往主要关注任务完成或每个状态约束违规的评估。SafeManip使用有限迹线上的线性时间逻辑(LTLf)定义可重用的安全模板。它将观察到的运行结果映射到符号谓词轨迹,并使用基于LTLf的监控器进行评估。其属性集涵盖八类操作安全类别:碰撞和接触安全、抓取稳定性、释放稳定性、交叉污染、动作开始、机制恢复、物体容纳和封闭空间访问。模板可以使用任务特定的对象、固定装置、区域或技能进行实例化,允许相同的安全规范在不同任务和环境中泛化。我们在六个视觉-语言-动作策略上评估SafeManip,包括π0、π0.5、GR00T及其训练变体,覆盖50个RoboCasa365家庭任务。结果表明,即使强大的模型也常常行为不安全。任务成功率的提升并不总是转化为更安全的执行:许多成功的运行仍然不安全,而更长的horizon或更复杂的任务暴露了更多的违规行为。SafeManip提供了一个可重用的评估层,用于诊断时间安全故障并测量安全成功,而不仅仅是任务完成。

英文摘要

Robotic manipulation is typically evaluated by task success, but successful completion does not guarantee safe execution. Many safety failures are temporal: a robot may touch a clean surface after contamination or release an object before it is fully inside an enclosure. We introduce SafeManip, a property-driven benchmark to explicitly evaluate temporal safety properties in robotic manipulation, moving beyond prior evaluations that largely focus on task completion or per-state constraint violations. SafeManip defines reusable safety templates over finite executions using Linear Temporal Logic over finite traces (LTLf). It maps observed rollouts to symbolic predicate traces and evaluates them with LTLf-based monitors. Its property suite covers eight manipulation safety categories: collision and contact safety, grasp stability, release stability, cross-contamination, action onset, mechanism recovery, object containment, and enclosure access. Templates can be instantiated with task-specific objects, fixtures, regions, or skills, allowing the same safety specifications to generalize across tasks and environments. We evaluate SafeManip on six vision-language-action policies, including $π_0$, $π_{0.5}$, GR00T, and their training variants, across 50 RoboCasa365 household tasks. Results show that even strong models often behave unsafely. Task-success gains do not reliably translate into safer execution: many successful rollouts remain unsafe, while longer-horizon or more complex tasks expose more violations. SafeManip provides a reusable evaluation layer for diagnosing temporal safety failures and measuring safe success beyond task completion.

2605.12053 2026-06-11 cs.RO 版本更新

Closing the Motion Execution Gap: From Semantic Motion Task Constraints to Kinematic Control

弥合运动执行差距:从语义运动任务约束到运动学控制

Simon Stelter, Vanessa Hassouna, Malte Huerkamp, Michael Beetz

发表机构 * University of Bremen(不莱梅大学)

AI总结 本文提出通过运动状态图实现语义约束与可执行机器人运动的连接,利用统一的可微运动学世界模型实现世界中心的运动规范与跨平台泛化,采用基于lMPC的任务函数方法确保任务切换的平滑过渡。

Comments 9 pages, 8 figures, to be published in IJCAI 2026

详情
AI中文摘要

本文针对运动执行差距问题,即高层符号任务描述与可执行机器人运动之间的脱节,提出运动状态图作为可执行的符号表示。该方法允许任意排列运动约束、监控器或嵌套状态图的并行与顺序组合。通过使用统一的可微运动学世界模型,实现了以世界为中心的运动规范和跨体素的泛化。运动执行通过基于lMPC的任务函数方法实现,利用 jerk 限制确保任务切换的平滑过渡。通过在八个机器人平台上部署该方法,展示了跨平台的可转移性。所提出的框架称为 Giskard,并且是开源的:https://github.com/cram2/cognitive_robot_abstract_machine.

英文摘要

This paper addresses the Motion Execution Gap, the disconnect between high-level symbolic task descriptions using semantic constraints and executable robot motions. Motion Statecharts are introduced as an executable symbolic representation for complex motions. They allow the arbitrary arrangement of motion constraints, monitors or nested statecharts in parallel and sequence. World-centric motion specification and generalization across embodiments are enabled through the use of a unified differentiable kinematic world model of both, robots and environments. Motion execution is realized through a lMPC-based implementation of the task-function approach, in which smooth transitions during task switches are ensured using jerk bounds. Cross-platform transferability was demonstrated by deploying the method on eight robot platforms, operating in diverse environments. The proposed framework is called Giskard and is available open source: https://github.com/cram2/cognitive_robot_abstract_machine.

2605.11911 2026-06-11 cs.LG 版本更新

Understanding Sample Efficiency in Predictive Coding

理解预测编码中的样本效率

Gaspard Oliviers, Elene Lominadze, Rafal Bogacz

发表机构 * Nuffield Department of Clinical Neurosciences, University of Oxford, United Kingdom(牛津大学神经科学学院Nuffield部门,英国) MRC Centre of Research Excellence in Restorative Neural Dynamics, United Kingdom(英国修复神经动力学研究卓越中心)

AI总结 本文研究预测编码在样本效率上的优势,通过目标对齐度量分析BP和PC的学习效率,发现PC在深度、狭窄和预训练网络中表现更优,提供机制理解以指导PC参数设计。

详情
AI中文摘要

预测编码(PC)是皮层学习的重要理论。近期研究多比较PC与反向传播(BP)以确定PC是否具有优势。小规模实验表明PC在许多上下文中能更高效地学习,但理论理解仍不明确。本文通过目标对齐度量量化BP和PC的学习效率,推导并验证深度线性网络中目标对齐的解析表达式。研究发现PC的学习效率高于BP,尤其在深度、狭窄和预训练网络中更为明显。还推导了保证PC目标对齐最优的精确条件,并通过实验验证。研究了线性和非线性模型的完整训练轨迹,发现即使部分假设不成立,PC的预测优势仍持续存在。本文提供了对PC比BP在先前工作中观察到更高学习效率的机制理解,并指导如何参数化PC以最有效地学习。

英文摘要

Predictive Coding (PC) is an influential account of cortical learning. Much of recent work has focused on comparing PC to Backpropagation (BP) to find whether PC offers any advantages. Small scale experiments show that PC enables learning that is more sample efficient and effective in many contexts, though a thorough theoretical understanding of the phenomena remains elusive. To address this, we quantify the efficiency of learning in BP and PC through a metric called ``target alignment'', which measures how closely the change in the output of the network is aligned to the output prediction error. We then derive and empirically validate analytical expressions for target alignment in Deep Linear Networks. We show that learning in PC is more efficient than BP, which is especially pronounced in deep, narrow and pre-trained networks. We also derive exact conditions for guaranteed optimal target alignment in PC and validate our findings through experiments. We study full training trajectories of linear and non-linear models, and find the predicted benefits of PC persist in practice even when some assumptions are violated. Overall, this work provides a mechanistic understanding of the higher learning efficiency observed for PC over BP in previous works, and can guide how PC should be parametrised to learn most effectively.

2505.13196 2026-06-11 cs.LG cs.AI quant-ph 版本更新

A Physics-Inspired Optimizer: Velocity Regularized Adam

一种受物理启发的优化器:速度正则化Adam

Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Maike Osborne

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出VRAdam优化器,通过引入速度正则化技术,结合Adam的参数缩放,提升训练稳定性与收敛速度,理论分析显示其在非凸目标下的收敛速率为O(√(lnN)/√N)。

Comments L. Schorling and P. Vaidhyanathan contributed equally to this work. 20 pages, 10 figures

详情
AI中文摘要

我们介绍了一种受物理启发的优化器——速度正则化Adam(VRAdam),用于训练深度神经网络。该优化器借鉴了四次项用于动能的思想,其在系统动力学中具有稳定作用。先前的算法,包括普遍使用的Adam,训练过程中处于所谓的稳定性边缘,导致快速振荡和损失收敛缓慢。然而,VRAdam基于速度在学习率上添加更高阶惩罚,使得算法在权重更新变得较大时自动减慢。实践中,我们观察到在高速度区域,有效动态学习率会缩小并抑制振荡。通过将这种基于速度的正则化用于全局阻尼,结合Adam的参数缩放,我们创建了一个强大的混合优化器。对于该优化器,我们从物理和控制的角度对动量在稳定性边缘的操作进行了严格的理论分析。此外,我们推导了在轻微假设下的非凸随机目标下的收敛界,收敛速率为O(ln(N)/√N)。我们证明VRAdam在标准优化器如AdamW上表现更优。我们通过多种任务如图像分类、语言建模和生成建模,使用不同架构和训练方法(包括卷积神经网络、Transformer和GFlowNets)进行基准测试。

英文摘要

We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

2605.10592 2026-06-11 cs.AI cs.HC cs.LG 版本更新

A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge

跨云和边缘的防洪溢流监控稳健解决方案

Vipin Singh, Tianheng Ling, Peter Ghaly, Felix Grimmeisen, Gregor Schiele, Felix Biessmann

发表机构 * Berlin University of Applied Sciences(柏林应用技术大学) University of Duisburg-Essen(杜伊斯堡-埃森大学) Okeanos Smart Data Solutions GmbH(Okeanos智能数据解决方案 GmbH) Einstein Center Digital Future(爱因斯坦数字未来研究中心)

AI总结 本文提出一个基于深度学习的云边协同监控平台,用于预测溢流池填充动态,以应对城市排水系统老化问题,提升防洪预警能力。

Comments 3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: https://riwwer.demo.calgo-lab.de

详情
AI中文摘要

许多历史城市的老化联合排水系统正因极端降雨事件而承受更大压力,可能引发联合排水溢流(CSO),对环境和公共健康造成严重影响。预测溢流池的填充动态对于预测容量超限并及时采取预防措施至关重要。我们提出一个基于网页的演示器(https://riwwer.demo.calgo-lab.de),将云和边缘环境中的深度学习预测方法整合到交互式监控仪表板中,以实现溢流监控的网络中断鲁棒性。一个视频演示可在在线(https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ)获取。

英文摘要

Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online (https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ).

2603.08558 2026-06-11 cs.LG stat.ML 版本更新

Impact of Connectivity on Laplacian Representations in Reinforcement Learning

连通性对强化学习中拉普拉斯表示的影响

Tommaso Giorgi, Pierriccardo Olivieri, Keyue Jiang, Laura Toni, Matteo Papini

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文研究了连通性对强化学习中拉普拉斯表示的误差影响,通过分析状态图的代数连通性,推导了线性价值函数近似误差的上界,并展示了表示学习管道中的端到端误差分解。

详情
AI中文摘要

在马尔可夫决策过程(MDPs)中学习紧凑的状态表示对于解决大规模强化学习(RL)问题中的维度灾难至关重要。现有方法通过构造状态表示为状态图拉普拉斯特征向量的线性组合,利用结构先验。当转移图未知或状态空间过大时,可通过样本轨迹直接估计图谱特征。本文证明了在学习的谱特征下线性价值函数近似误差的上界,并展示了该误差如何随状态图的代数连通性变化,从而将近似质量根植于MDP的拓扑结构中。进一步界定了由特征向量估计本身引入的误差,导致表示学习管道中的端到端误差分解。此外,尽管RL设置中的拉普拉斯算子表达式等价于现有方法,但其防止了一些常见的误解,并展示了文献中的示例。我们的结果适用于一般的(非均匀)策略,无需对诱导转移核的对称性做任何假设。我们通过在网格世界环境中进行数值模拟验证了理论发现。

英文摘要

Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.

2602.06827 2026-06-11 cs.RO 版本更新

DynaRetarget: Dynamically-Feasible Retargeting using Sampling-Based Trajectory Optimization

DynaRetarget: 基于采样的轨迹优化的动态可行重定向

Victor Dhedin, Ilyass Taouil, Shafeef Omar, Dian Yu, Kun Tao, Angela Dai, Majid Khadiv

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出DynaRetarget框架,通过采样轨迹优化将人体运动重定向为人形机器人控制策略,实现长时域动态可行运动,在数百个演示中取得更高成功率。

详情
AI中文摘要

在本文中,我们介绍了DynaRetarget,一个将人体运动重定向到人形机器人控制策略的完整流程。其核心组件是一种新颖的基于采样的轨迹优化(SBTO)框架,该框架将不完美的运动学轨迹优化为动态可行的运动。SBTO逐步推进优化范围,从而能够对长时域任务的整个轨迹进行优化。我们通过成功重定向数百个人形物体演示并实现比现有技术更高的成功率来验证DynaRetarget。该框架还使用相同的跟踪目标,在不同物体属性(如质量、尺寸和几何形状)下泛化。这种鲁棒地重定向多样化演示的能力为生成大规模人形机器人操作轨迹合成数据集打开了大门,解决了真实世界数据收集的主要瓶颈。

英文摘要

In this paper, we introduce DynaRetarget, a complete pipeline for retargeting human motions to humanoid control policies. The core component of DynaRetarget is a novel Sampling-Based Trajectory Optimization (SBTO) framework that refines imperfect kinematic trajectories into dynamically feasible motions. SBTO incrementally advances the optimization horizon, enabling optimization over the entire trajectory for long-horizon tasks. We validate DynaRetarget by successfully retargeting hundreds of humanoid-object demonstrations and achieving higher success rates than the state of the art. The framework also generalizes across varying object properties, such as mass, size, and geometry, using the same tracking objective. This ability to robustly retarget diverse demonstrations opens the door to generating large-scale synthetic datasets of humanoid loco-manipulation trajectories, addressing a major bottleneck in real-world data collection.

2605.06485 2026-06-11 cs.CL cs.AI 版本更新

Litespark Inference For CPUs: Ultra-Fast SIMD Framework for Ternary (1.58-bit) Language Models

Litespark Inference For CPUs: 三元(1.58位)语言模型的超快SIMD框架

Nii Osae Osae Dade, Tony Morri, Moinul Hossain Rahat, Sayandip Pal, Rickston Pinto

发表机构 * Mindbeam AI

AI总结 针对三元语言模型权重为{-1,0,1}的特点,提出自定义SIMD内核,用加减运算替代矩阵乘法,在CPU上实现18-96倍加速和6倍内存减少。

详情
AI中文摘要

大型语言模型(LLM)已经改变了人工智能,但其计算需求对大多数用户来说仍然过高。标准推理需要昂贵的数据中心GPU或云API访问,导致超过十亿台个人计算机在AI工作负载中未被充分利用。三元模型提供了一条前进的道路:它们的权重被限制在{-1, 0, +1},理论上消除了浮点乘法的需求。然而,现有框架未能利用这种结构,将三元模型视为密集浮点网络。我们通过自定义SIMD内核填补了这一空白,这些内核用简单的加法和减法运算取代矩阵乘法,针对现代CPU上可用的整数点积指令。我们的实现Litespark-Inference可通过pip安装,并直接与Hugging Face集成,在Apple Silicon上实现了比标准PyTorch推理高18.15倍的吞吐量、快7.15倍的首令牌时间和6.03倍的内存减少,在Intel和AMD处理器上实现了高达95.81倍的吞吐量加速。

英文摘要

Large language models (LLMs) have transformed artificial intelligence, but their computational requirements remain prohibitive for most users. Standard inference demands expensive datacenter GPUs or cloud API access, leaving over one billion personal computers underutilized for AI workloads. Ternary models offer a path forward: their weights are constrained to {-1, 0, +1}, theoretically eliminating the need for floating-point multiplication. However, existing frameworks fail to exploit this structure, treating ternary models as dense floating-point networks. We address this gap with custom SIMD kernels that replace matrix multiplication with simple addition and subtraction operations, targeting the integer dot product instructions available on modern CPUs. Our implementation, Litespark-Inference, is pip-installable and integrates directly with Hugging-Face, achieving 18.15x higher throughput, 7.15x faster time-to-first-token and 6.03x memory reduction compared to standard PyTorch inference on Apple Silicon, with comparable or higher throughput speedups up to 95.81x on Intel and AMD processors.

2602.03141 2026-06-11 cs.CL 版本更新

Short Chains, Deep Thoughts: Balancing Reasoning Efficiency and Intra-Segment Capability via Split-Merge Optimization

短链深思:通过拆分-合并优化平衡推理效率与段内能力

Runquan Gui, Jie Wang, Zhihai Wang, Chi Ma, Jianye Hao, Feng Wu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出CoSMo框架,通过拆分-合并算法动态优化推理链,结合段级预算的结构对齐强化学习,在保持准确率的同时显著减少冗余段,平均提升准确率3.3点并减少28.7%段使用。

Comments camera ready version upload

详情
AI中文摘要

尽管大型推理模型(LRMs)通过生成长推理链在解决复杂任务方面展示了令人印象深刻的能力,但这种依赖冗长生成的方式导致了显著的延迟和计算开销。为了解决这些挑战,我们提出了\textbf{CoSMo}(\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization),一个旨在消除结构冗余而非不加区分地限制令牌数量的框架。具体来说,CoSMo利用一种拆分-合并算法,通过合并冗余段和拆分逻辑缺口来动态优化推理链,以确保连贯性。然后,我们采用结构对齐的强化学习,配合一种新颖的段级预算,在整个训练过程中监督模型保持高效的推理结构。跨多个基准和骨干网络的广泛实验表明,CoSMo实现了优越的性能,与推理效率基线相比,平均准确率提高了\textbf{3.3}个百分点,同时段使用量减少了\textbf{28.7\%}。

英文摘要

While Large Reasoning Models (LRMs) have demonstrated impressive capabilities in solving complex tasks through the generation of long reasoning chains, this reliance on verbose generation results in significant latency and computational overhead. To address these challenges, we propose \textbf{CoSMo} (\textbf{Co}nsistency-Guided \textbf{S}plit-\textbf{M}erge \textbf{O}ptimization), a framework designed to eliminate structural redundancy rather than indiscriminately restricting token volume. Specifically, CoSMo utilizes a split-merge algorithm that dynamically refines reasoning chains by merging redundant segments and splitting logical gaps to ensure coherence. We then employ structure-aligned reinforcement learning with a novel segment-level budget to supervise the model in maintaining efficient reasoning structures throughout training. Extensive experiments across multiple benchmarks and backbones demonstrate that CoSMo achieves superior performance, improving accuracy by \textbf{3.3} points while reducing segment usage by \textbf{28.7\%} on average compared to reasoning efficiency baselines.

2605.00545 2026-06-11 cs.LG cs.AI math-ph math.MP q-bio.GN q-bio.QM 版本更新

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

超越连续性:从单细胞快照无模拟重建离散分支动力学

Junda Ying, Yuxuan Wang, Bowen Yang, Peijie Zhou, Lei Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对单细胞快照数据中随机性和非保守质量动态(如细胞增殖和凋亡)的挑战,提出无模拟框架Unbalanced Schrödinger Bridge (USB),通过离散分支薛定谔桥问题建模单细胞分辨率的跳跃式生灭动态,实现高效轨迹重建与离散模拟。

详情
AI中文摘要

从破坏性快照推断细胞轨迹因随机性和非保守质量动态(如细胞增殖和凋亡)的挑战而复杂化。现有的不平衡最优传输(OT)方法将质量视为连续流体,在群体水平进行推断。然而,这种宏观视角往往无法捕捉单细胞分辨率下生灭事件的离散跳跃性质,而这对于理解谱系分支和命运决定至关重要。我们提出无模拟框架Unbalanced Schrödinger Bridge (USB),用于学习底层动态,有效整合随机和非平衡效应,并在单细胞分辨率下建模离散、跳跃式的生灭动态。理论上,USB为分支薛定谔桥(BSB)问题提供了可处理的解,给出了严格的微观解释,其中单个细胞同时经历布朗运动和离散生灭跳跃。技术上,该方法通过引入无模拟训练目标实现高效求解器,有效扩展到高维组学数据。实验上,我们在模拟和真实数据集上证明,USB不仅达到优于或可比于确定性基线的轨迹重建性能,而且独特地实现了单细胞分辨率下生灭动态的真实离散模拟。

英文摘要

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

2605.00321 2026-06-11 cs.RO 版本更新

Embodied Interpretability: Linking Causal Understanding to Generalization in Vision-Language-Action Models

具身可解释性:将因果理解与视觉-语言-动作模型的泛化联系起来

Hanxin Zhang, Mingshuo Xu, Abdulqader Dhafer, Shigang Yue, Hongbiao Dong, Zhou Daniel Hao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出干预显著性评分(ISS)和干扰质量比(NMR),通过干预掩码估计视觉区域对动作预测的因果影响,并量化对任务无关特征的归因,实验表明NMR可预测泛化行为,ISS比现有方法提供更忠实的解释。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉-语言-动作(VLA)策略在分布偏移下经常失败,这表明决策可能依赖于虚假的视觉相关性而非任务相关原因。我们将视觉-动作归因形式化为一个干预估计问题。据此,我们引入了干预显著性评分(ISS),一种用于估计视觉区域对动作预测的因果影响的干预掩码程序,以及干扰质量比(NMR),一种对任务无关特征归因的标量度量。我们分析了ISS的统计性质,表明它可以实现无偏估计,并刻画了动作预测误差为因果影响提供有效代理的条件。跨多种操作任务的实验表明,NMR预测泛化行为,且ISS比现有可解释性方法产生更忠实的解释。这些结果表明,干预归因为识别具身策略中的因果错位提供了一种简单的诊断方法。

英文摘要

Vision-Language-Action (VLA) policies often fail under distribution shift, suggesting that decisions may depend on spurious visual correlations rather than task-relevant causes. We formulate visual-action attribution as an interventional estimation problem. Accordingly, we introduce the Interventional Significance Score (ISS), an interventional masking procedure for estimating the causal influence of visual regions on action predictions, and the Nuisance Mass Ratio (NMR), a scalar measure of attribution to task-irrelevant features. We analyze the statistical properties of ISS and show that it admits unbiased estimation, and we characterize conditions under which action prediction error provides a valid proxy for causal influence. Experiments across diverse manipulation tasks indicate that NMR predicts generalization behavior and that ISS yields more faithful explanations than existing interpretability methods. These results suggest that interventional attribution provides a simple diagnostic approach for identifying causal misalignment in embodied policies.

2604.18543 2026-06-11 cs.AI cs.CL 版本更新

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents

ClawEnvKit:爪型智能体的自动环境生成

Xirui Li, Ming Li, Ion Stoica, Cho-Jui Hsieh, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) Arena University of California, Berkley(伯克利大学) University of California, Los Angeles(洛杉矶大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出ClawEnvKit自动生成多样、可验证的爪型智能体训练与评估环境,构建含1040个环境的Auto-ClawEval基准,成本降低13800倍,性能提升达15.7个百分点。

详情
AI中文摘要

构建用于训练和评估爪型智能体的环境仍然是一个手动、人力密集且无法扩展的过程。我们认为,需要的不仅仅是一个数据集,而是一个能够按需生成多样化、可验证环境的自动化流水线。为此,我们引入了ClawEnvKit,一个自主生成流水线,它从自然语言描述中实例化这一形式化体系。该流水线包含三个模块:(1)解析器,从自然语言输入中提取结构化生成参数;(2)生成器,生成任务规范、工具接口和评分配置;(3)验证器,确保生成环境的可行性、多样性、结构有效性和内部一致性。使用ClawEnvKit,我们构建了Auto-ClawEval,这是首个用于爪型智能体的大规模基准,包含24个类别的1040个环境。实验表明,Auto-ClawEval在连贯性和清晰度上匹配或超过人工策划的环境,成本降低13800倍。在4个模型家族和8个智能体框架上评估,我们发现框架工程比裸ReAct基线性能提升高达15.7个百分点,完成度仍是主要变化轴,且没有模型饱和该基准,自动化生成使得评估规模达到前所未有的水平。除了静态基准测试,ClawEnvKit还支持实时评估:用户用自然语言描述所需能力,即可按需获得验证过的环境,将评估转变为持续的、用户驱动的过程。同样的机制也可作为按需训练环境生成器,产生适应智能体当前弱点的任务分布,而非受限于现有用户日志。

英文摘要

Constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale. We argue that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand. To this end, we introduce ClawEnvKit, an autonomous generation pipeline that instantiates this formalism from natural language descriptions. The pipeline comprises three modules: (1) a parser that extracts structured generation parameters from natural language input; (2) a generator that produces the task specification, tool interface, and scoring configuration; and (3) a validator that enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. Using ClawEnvKit, we construct Auto-ClawEval, the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories. Empirically, Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost. Evaluated across 4 model families and 8 agent harness frameworks, we find that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, completion remains the primary axis of variation with no model saturating the benchmark, and automated generation enables evaluation at a scale previously infeasible. Beyond static benchmarking, ClawEnvKit enables live evaluation: users describe a desired capability in natural language and obtain a verified environment on demand, turning evaluation into a continuous, user-driven process. The same mechanism serves as an on-demand training environment generator, producing task distributions that adapt to an agent's current weaknesses rather than being bounded by existing user logs.

2601.21293 2026-06-11 cs.LG cs.AI 版本更新

Reliability-Calibrated Edge-IoT Early Fault Warning for Rotating Machinery with a Physics-Guided Tiny-Mamba Transformer

面向旋转机械的可靠性校准边缘物联网早期故障预警:一种物理引导的Tiny-Mamba Transformer

Changyu Li, Huabei Nie, Xiaoya Ni, Lu Wang, Lijuan Shen, Kaishun Wu, Fei Luo

发表机构 * Great Bay University(大亚湾大学) Huizhou University(惠州大学) National University of Singapore(国立新加坡大学) Shenzhen University(深圳大学) James Cook University(詹姆斯库克大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种可靠性校准的边缘物联网早期故障预警框架,使用物理引导的Tiny-Mamba Transformer提取特征,结合极值理论校准误报率,在低计算资源下实现高精度、低延迟的旋转机械故障预警。

详情
AI中文摘要

工业物联网系统日益依赖分布式振动传感来支持旋转机械的预测性维护。然而,在实际部署中,原始信号上传成本高昂,且报警决策必须在有限计算资源、变化运行条件和严格误报预算下本地进行。本文提出一种可靠性校准的边缘物联网早期预警框架,其中紧凑的物理引导Tiny-Mamba Transformer作为表示模块,极值理论层将流式异常分数转换为事件级报警片段。PG-TMT结合深度可分离卷积主干、Tiny-Mamba状态空间分支和轻量级局部Transformer,在批量大小为1的推理下捕获瞬态、长周期和多通道退化线索。为提高可审计性,时间注意力被投影到频域并与分析轴承故障阶次带软对齐。极值理论校准、双阈值迟滞和修尾拟合即使在健康校准数据不完美的情况下也能提供可控的误报强度。在CWRU、Paderborn、XJTU-SY和工业试点上的实验表明,所提框架提高了PR-AUC,在可控误报预算下减少了检测延迟,并对结构化干扰、元数据不确定性、复合故障混合和域转移保持鲁棒。凭借小于1 MB的占用空间和低于7 ms的Jetson p99延迟,该框架支持工业物联网预测性维护的校准和可解释早期预警。

英文摘要

Industrial Internet of Things (IIoT) systems increasingly rely on distributed vibration sensing to support predictive maintenance of rotating machinery. In practical deployments, however, raw signal upload is costly and alarm decisions must be made locally under limited computation, changing operating conditions, and strict nuisance-alarm budgets. This paper presents a reliability-calibrated edge-IoT early-warning framework, in which a compact Physics-Guided Tiny-Mamba Transformer (PG-TMT) acts as the representation module and an extreme value theory (EVT) layer converts streaming anomaly scores into event-level alarm episodes. PG-TMT combines a depthwise-separable convolutional stem, a Tiny-Mamba state-space branch, and a lightweight local Transformer to capture transient, long-horizon, and multichannel degradation cues under batch-size-one inference. To improve auditability, temporal attention is projected to the frequency domain and softly aligned with analytical bearing fault-order bands. EVT calibration, dual-threshold hysteresis, and trimmed-tail fitting provide controllable false-alarm intensity even when healthy calibration data are imperfect. Experiments on CWRU, Paderborn, XJTU-SY, and an industrial pilot demonstrate that the proposed framework improves PR-AUC, reduces detection delay under a controlled nuisance-alarm budget, and remains robust to structured interference, metadata uncertainty, compound fault mixtures, and domain transfer. With a sub-1 MB footprint and Jetson p99 latency below 7 ms, the framework supports calibrated and interpretable early warnings for IIoT predictive maintenance.