arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.11661 2026-05-21 cs.LG cs.AI

Towards Autonomous Mechanistic Reasoning in Virtual Cells

向虚拟细胞中的自主机理推理迈进

Yunhui Jang, Lu Zhu, Jake Fawkes, Alisandra Kaye Denton, Dominique Beaini, Emmanuel Noutahi

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）； Valence Labs（Valence实验室）； University College London（伦敦大学学院）

AI总结本文提出了一种结构化解释形式化方法，用于虚拟细胞中的生物推理，通过机理动作图实现系统验证和反驳，并引入VCR-Agent多智能体框架，结合生物基础知识检索和基于验证器的过滤方法，生成并验证机理推理。

详情

AI中文摘要

大型语言模型（LLMs）最近因其在加速科学发现方面的潜力而受到广泛关注。然而，它们在如生物学等开放性科学领域中的应用仍然有限，主要是由于缺乏事实性支撑和可操作的解释。为此，我们引入了一种结构化解释形式化方法，用于虚拟细胞，将生物推理表示为机理动作图，从而实现系统验证和反驳。在此基础上，我们提出了VCR-Agent多智能体框架，该框架整合了生物基础知识检索与基于验证器的过滤方法，以自动生成并验证机理推理。使用该框架，我们发布了VC-TRACES数据集，该数据集由来自Tahoe-100M图谱的验证机理解释组成。实证研究表明，使用这些解释训练可以提高事实准确性，并为下游基因表达预测提供更有效的监督信号。这些结果强调了通过多智能体和严格验证的协同作用，可靠机理推理在虚拟细胞中的重要性。

英文摘要

Large language models (LLMs) have recently gained significant attention as a promising approach to accelerate scientific discovery. However, their application in open-ended scientific domains such as biology remains limited, primarily due to the lack of factually grounded and actionable explanations. To address this, we introduce a structured explanation formalism for virtual cells that represents biological reasoning as mechanistic action graphs, enabling systematic verification and falsification. Building upon this, we propose VCR-Agent, a multi-agent framework that integrates biologically grounded knowledge retrieval with a verifier-based filtering approach to generate and validate mechanistic reasoning autonomously. Using this framework, we release VC-TRACES dataset, which consists of verified mechanistic explanations derived from the Tahoe-100M atlas. Empirically, we demonstrate that training with these explanations improves factual precision and provides a more effective supervision signal for downstream gene expression prediction. These results underscore the importance of reliable mechanistic reasoning for virtual cells, achieved through the synergy of multi-agent and rigorous verification.

URL PDF HTML ☆

赞 0 踩 0

2604.11530 2026-05-21 cs.CV cs.AI

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

超越注意力分数：基于SVD的视觉令牌修剪用于高效视觉-语言模型

Yvon Apedo, Martyna Poreba, Michal Szczepanski, Samia Bouchafa

发表机构 * anoncvlab（匿名计算机视觉实验室）

AI总结本文提出SVD-Prune方法，通过SVD分解视觉令牌特征矩阵并利用统计杠杆分数选择顶级令牌，以在极端视觉令牌预算下保持高性能，优于现有修剪方法。

详情

AI中文摘要

视觉-语言模型（VLMs）通过联合处理视觉和文本信息革新了多模态学习。然而，由于处理长序列视觉令牌的高计算和内存需求，它们面临显著挑战。许多现有方法依赖于局部启发式方法，如注意力分数或令牌范数。然而，这些标准存在位置偏见和信息分散的问题，限制了它们在高修剪比率下保留本质内容的能力，导致在视觉细节丰富的图像上性能下降。为了解决这些问题，我们提出了SVD-Prune，一种训练免费、即插即用的令牌修剪方法，基于奇异值分解。它分解视觉令牌特征矩阵，并利用统计杠杆分数选择顶级令牌，确保仅保留对主导全局方差贡献最大的令牌。实验表明，SVD-Prune在极端视觉令牌预算下始终优于现有修剪方法，即使在32和16个视觉令牌的情况下也能保持强劲性能。

英文摘要

Vision-Language Models (VLMs) have revolutionized multi-modal learning by jointly processing visual and textual information. Yet, they face significant challenges due to the high computational and memory demands of processing long sequences of vision tokens. Many existing methods rely on local heuristics, such as attention scores or token norms. However, these criteria suffer from positional bias and information dispersion, limiting their ability to preserve essential content at high pruning ratios and leading to performance degradation on visually detailed images. To address these issues, we propose SVD-Prune, a training-free, plug-and-play token pruning method based on Singular Value Decomposition. It decomposes the vision token feature matrix and selects the top-k tokens using statistical leverage scores, ensuring only tokens contributing most to the dominant global variance are preserved. Experiments show that SVD-Prune consistently outperforms prior pruning methods under extreme vision token budgets, maintaining strong performance even with 32 and 16 vision tokens.

URL PDF HTML ☆

赞 0 踩 0

2604.11071 2026-05-21 cs.CV cs.AI cs.LG

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

轻量级低光照图像增强 via 分布归一化预处理和深度卷积U-Net

Shimon Murai, Teppei Kurita, Ryuta Satoh, Yusuke Moriuchi

发表机构 * Sony Semiconductor Solutions Corporation（索尼半导体解决方案公司）

AI总结本文提出了一种轻量级两阶段框架，通过分布归一化预处理和深度卷积U-Net实现低光照图像增强，相比现有方法参数更少且感知质量更优。

Comments Technical report for the NTIRE 2026 Efficient Low-Light Image Enhancement Challenge (CVPR 2026 Workshops), 3rd place solution

2604.10245 2026-05-21 cs.CV physics.med-ph

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

基于迭代3D/2D肝脏配准的预热启动强化学习

Hanyuan Zhang, Lucas He, Zijie Cheng, Abdolrahim Kadkhodamohammadi, Danail Stoyanov, Brian R. Davidson, Evangelos B. Mazomenos, Matthew. J Clarkson

发表机构 * UCL Hawkes Institute, University College London（伦敦大学学院UCL霍克斯研究所）； Medtronic plc.（美敦力公司）

AI总结本文提出了一种基于离散动作强化学习的框架，用于将术前CT与术中腹腔镜视频进行配准，通过预热启动的监督姿态估计网络提供稳定的几何特征和更快的收敛速度，从而实现更高效的配准。

Comments Laparoscopic Liver Surgery, Augmented Reality, Image Registration, Reinforcement Learning

详情

AI中文摘要

术前CT与术中腹腔镜视频之间的配准在增强现实（AR）导航中微创手术中起着关键作用。基于学习的方法最近在配准误差上实现了与基于优化的方法相当的性能，同时提供了更快的推理速度。然而，许多监督方法会产生粗略的对齐，需要额外的基于优化的细化，从而增加推理时间。我们提出了一种离散动作强化学习（RL）框架，将CT到视频的配准视为一个序列决策过程。一个共享的特征编码器从CT渲染和腹腔镜帧中提取表示，通过从监督姿态估计网络预热启动以提供稳定的几何特征和更快的收敛。一个RL策略头学习在六个自由度上选择刚体变换，并决定何时停止迭代。在公开的腹腔镜数据集上的实验表明，我们的方法实现了平均目标配准误差（TRE）为15.70毫米，与监督方法结合优化相当，同时实现了更快的收敛。所提出的基于RL的公式使自动、高效的迭代配准成为可能，而无需手动调整步长或停止标准。这种离散框架为未来连续动作和可变形配准模型在手术AR应用中的实际基础提供了支持。

英文摘要

Registration between preoperative CT and intraoperative laparoscopic video plays a crucial role in augmented reality (AR) guidance for minimally invasive surgery. Learning-based methods have recently achieved registration errors comparable to optimization-based approaches while offering faster inference. However, many supervised methods produce coarse alignments that rely on additional optimization-based refinement, thereby increasing inference time. We present a discrete-action reinforcement learning (RL) framework that formulates CT-to-video registration as a sequential decision-making process. A shared feature encoder, warm-started from a supervised pose estimation network to provide stable geometric features and faster convergence, extracts representations from CT renderings and laparoscopic frames, while an RL policy head learns to choose rigid transformations along six degrees of freedom and to decide when to stop the iteration. Experiments on a public laparoscopic dataset demonstrated that our method achieved an average target registration error (TRE) of 15.70 mm, comparable to supervised approaches with optimization, while achieving faster convergence. The proposed RL-based formulation enables automated, efficient iterative registration without manually tuned step sizes or stopping criteria. This discrete framework provides a practical foundation for future continuous-action and deformable registration models in surgical AR applications.

URL PDF HTML ☆

赞 0 踩 0

2604.07213 2026-05-21 cs.LG math.PR

Diffusion Processes on Implicit Manifolds

隐式流形上的扩散过程

Victor Kawasaki-Borruat, Clara Grotehans, Pierre Vandergheynst, Adam Gosztolai

发表机构 * Signal Processing Laboratory 2（信号处理实验室2）； Institute of Artificial Intelligence（人工智能研究所）； EPFL（瑞士联邦理工学院）； Medical University of Vienna（维也纳医学大学）

AI总结本文研究如何仅使用点云样本在数据流形上构建扩散过程，提出隐式流形估值扩散（IMDs）方法，通过近似扩散过程的无穷小生成元和carré-du-champ来定义高维空间中的随机微分方程，实现流形内在过程的外推，并通过实验验证其在数据流形上的约束性和探索性。

Comments Comments are more than welcome!

详情

AI中文摘要

高维数据通常被认为位于低维流形上。我们研究如何仅使用点云样本，在不访问图表、投影或其他几何原始元素的情况下，构建该数据流形上的扩散过程。本文引入隐式流形估值扩散（IMDs），一种数据驱动的数学形式化方法，用于在原始高维空间中定义描述内在流形上漂移布朗粒子的随机微分方程。我们的构造基于使用数据上的邻近图近似扩散过程的相应无穷小生成元，并利用生成元的carré-du-champ，该编码流形的局部切空间，并将内在过程提升到环境坐标中。我们证明随着样本数量的增长，我们的离散扩散过程在概率路径空间上收敛到其光滑流形对应物。我们进一步提出一个欧拉-马尔代夫方案用于IMDs的数值积分。我们通过在合成流形和MNIST数据流形上的数值实验验证了我们的框架，显示IMDs能够在流形上保持约束并实现其引导探索。我们的工作为数据流形上的扩散过程提供了数学基础和实际实现，开辟了流形感知采样、探索和生成建模的新途径。

英文摘要

High-dimensional data are often assumed to lie on lower-dimensional manifolds. We study how to construct diffusion processes on this data manifold using only point cloud samples and without access to charts, projections, or other geometric primitives. Here, we introduce Implicit Manifold-valued Diffusions (IMDs), a data-driven mathematical formalism for defining stochastic differential equations in the original high-dimensional space that describe drifting Brownian particles evolving intrinsically on the underlying manifold. Our construction hinges on approximating the corresponding infinitesimal generator of the diffusion process using a proximity graph over the data and using the carré-du-champ of the generator, which encodes the local tangent spaces of the manifold and lifts the intrinsic process into ambient coordinates. We show that as the number of samples grows, our discrete diffusion process converges in law on the space of probability paths to its smooth manifold counterpart. We further present an Euler-Maruyama scheme for the numerical integration of IMDs. We validate our framework using numerical experiments on synthetic manifolds and the MNIST data manifold, showing that IMDs remain confined over the manifold and enable its guided exploration. Our work provides the mathematical foundation and practical implementations of diffusion processes on data manifolds, opening new avenues for manifold-aware sampling, exploration, and generative modeling.

URL PDF HTML ☆

赞 0 踩 0

2604.06750 2026-05-21 cs.CV

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

视觉-语言模型对连续驾驶场景的理解有多好？一项敏感性研究

Roberto Brusnicki, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems（自动驾驶系统教授职位）； TUM School of Engineering and Design（技术大学慕尼黑工程与设计学院）

AI总结本文通过系统分析视觉-语言模型（VLMs）在连续驾驶场景中的表现，揭示了输入配置对模型性能的影响，发现即使顶级模型在连续驾驶场景中的准确率仅为57%，远低于人类在相似约束下的65%，并暴露了VLMs在理解车辆动态和时间关系上的显著差距。

Comments 8 pages, 5 figures

详情

AI中文摘要

视觉-语言模型（VLMs）越来越多地被提出用于自动驾驶任务，但它们在连续驾驶场景中的性能仍然缺乏充分的描述，尤其是在输入配置如何影响其能力方面。我们介绍了VENUSS（VLM评估在理解连续场景上），这是一个用于系统敏感性分析VLM在连续驾驶场景中性能的框架，为未来研究建立了基准。基于现有数据集，VENUSS从驾驶视频中提取时间序列，并在自定义类别中生成结构化评估。通过在2,600多个场景上比较25多个现有VLMs，我们揭示了即使顶级模型也只能达到57%的准确率，这并未达到在相似约束下人类的65%表现，并暴露了显著的能力差距。我们的分析表明，VLMs在静态物体检测方面表现优异，但在理解车辆动态和时间关系方面则存在困难。VENUSS提供了首次系统性的VLM敏感性分析，专注于输入图像配置（如分辨率、帧数、时间间隔、空间布局和展示模式）如何影响连续驾驶场景中的性能。补充材料可在https://TUM-AVS.github.io/VENUSS/上获得。

英文摘要

Vision-Language Models (VLMs) are increasingly proposed for autonomous driving tasks, yet their performance on sequential driving scenes remains poorly characterized, particularly regarding how input configurations affect their capabilities. We introduce VENUSS (VLM Evaluation oN Understanding Sequential Scenes), a framework for systematic sensitivity analysis of VLM performance on sequential driving scenes, establishing baselines for future research. Building upon existing datasets, VENUSS extracts temporal sequences from driving videos, and generates structured evaluations across custom categories. By comparing 25+ existing VLMs across 2,600+ scenarios, we reveal how even top models achieve only 57% accuracy, not matching human performance under similar constraints (65%) and exposing significant capability gaps. Our analysis shows that VLMs excel with static object detection but struggle with understanding vehicle dynamics and temporal relations. VENUSS offers the first systematic sensitivity analysis of VLMs focused on how input image configurations - resolution, frame count, temporal intervals, spatial layouts, and presentation modes - affect performance on sequential driving scenes. Supplementary material available at https://TUM-AVS.github.io/VENUSS/.

URL PDF HTML ☆

赞 0 踩 0

2604.01341 2026-05-21 cs.CV q-bio.NC

Perceptual misalignment of texture representations in convolutional neural networks

卷积神经网络中纹理表示的感知偏差

Ludovica de Paolis, Fabio Anselmi, Alessio Ansuini, Eugenio Piasini

发表机构 * Neuroscience area International School for Advanced Studies (SISSA) Trieste Italy（国际先进研究学院（SISSA）神经科学部，特里埃斯特，意大利）； Department of Mathematics Informatics and Geosciences Università degli Studi di Trieste Trieste Italy（特里埃斯特大学数学、信息学和地球科学系，特里埃斯特，意大利）； Department of Data Engineering Area Science Park Trieste Italy（数据工程系，Area Science Park，特里埃斯特，意大利）

AI总结本文研究了卷积神经网络中纹理表示与人类感知内容之间的对齐关系，发现传统CNN视觉模型质量评估与人类纹理感知对齐性无直接关联，表明纹理感知可能涉及不同于传统CNN对象识别模型的机制。

2603.29183 2026-05-21 cs.LG cs.AI

IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

IMPACT: 开集时间序列异常检测中的影响建模

Xiaohui Zhou, Yijie Wang, Hongzuo Xu, Weixuan Liang, Xiaoli Li, Guansong Pang

发表机构 * National Key Laboratory of Parallel and Distributed Computing（国家级并行与分布式计算实验室）； College of Computer Science and Technology, National University of Defense Technology（国防科技大学计算机科学与技术学院）； Intelligent Game and Decision Lab (IGDL)（智能游戏与决策实验室（IGDL））； Information Systems Technology and Design, Singapore University of Technology and Design（新加坡科技设计大学信息系统技术与设计系）； School of Computing and Information Systems, Singapore Management University（新加坡管理大学计算与信息系统学院）

AI总结本文提出IMPACT框架，通过影响建模方法解决开集时间序列异常检测中的挑战，通过学习影响函数生成真实异常模式并净化训练数据。

Comments Accepted by ICML 2026

详情

AI中文摘要

开集异常检测（OSAD）是一种新兴范式，旨在利用训练中观察到的异常类有限标记数据，在测试时识别已见和未见的异常。当前方法依赖简单的增强方法生成伪异常以复制未见异常。尽管在图像数据中表现良好，但这些方法在时间序列数据中效果不佳，因为未能保持其序列特性，导致异常模式变得琐碎或不真实。当训练数据被未标记异常污染时，问题进一步加剧。本文引入IMPACT，一种新的框架，利用影响建模方法解决这些挑战。关键见解是学习一个影响函数，以准确估计单个训练样本对建模的影响，然后利用这些影响分数生成语义上不同但真实的未见异常，同时将高影响样本重新利用为监督异常以净化数据。大量实验表明，IMPACT显著优于现有最先进方法，在各种OSAD设置和污染率下表现出更高的准确性。代码可在https://github.com/mala-lab/IMPACT获取。

英文摘要

Open-set anomaly detection (OSAD) is an emerging paradigm designed to utilize limited labeled data from anomaly classes seen in training to identify both seen and unseen anomalies during testing. Current approaches rely on simple augmentation methods to generate pseudo anomalies that replicate unseen anomalies. Despite being promising in image data, these methods are found to be ineffective in time series data due to the failure to preserve its sequential nature, resulting in trivial or unrealistic anomaly patterns. They are further plagued when the training data is contaminated with unlabeled anomalies. This work introduces $\textbf{IMPACT}$, a novel framework that leverages $\underline{\textbf{i}}$nfluence $\underline{\textbf{m}}$odeling for o$\underline{\textbf{p}}$en-set time series $\underline{\textbf{a}}$nomaly dete$\underline{\textbf{ct}}$ion, to tackle these challenges. The key insight is to $\textbf{i)}$ learn an influence function that can accurately estimate the impact of individual training samples on the modeling, and then $\textbf{ii)}$ leverage these influence scores to generate semantically divergent yet realistic unseen anomalies for time series while repurposing high-influential samples as supervised anomalies for anomaly decontamination. Extensive experiments show that IMPACT significantly outperforms existing state-of-the-art methods, showing superior accuracy under varying OSAD settings and contamination rates. Code is available at https://github.com/mala-lab/IMPACT.

URL PDF HTML ☆

赞 0 踩 0

2603.27747 2026-05-21 cs.CV cs.AI

AI-Powered Facial Mask Removal Is Not Suitable For Identification

基于AI的面部遮挡去除并不适合识别

Emily A Cooper, Hany Farid

发表机构 * Herbert Wertheim School of Optometry & Vision Science University of California, Berkeley（赫伯特·韦瑟姆视觉科学学院，加州大学伯克利分校）； School of Information University of California, Berkeley（信息学院，加州大学伯克利分校）

AI总结本文研究了基于AI的面部遮挡去除技术的有效性和风险，探讨其在真实身份匹配中的可靠性。

2603.26539 2026-05-21 cs.CL cs.AI

How Open Must Language Models be to Enable Reliable Scientific Inference?

语言模型必须多开放才能实现可靠的科学推断？

James A. Michaelov, Catherine Arnett, Tyler A. Chang, Pamela D. Rivière, Samuel M. Taylor, Cameron R. Jones, Sean Trott, Roger P. Levy, Benjamin K. Bergen, Micah Altman

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； EleutherAI ； University of California San Diego（加州大学圣地亚哥分校）； Rutgers University-Newark（新泽西州立大学罗威特分校）； Stony Brook University（史泰森布魯克大學）

AI总结本文探讨了语言模型的开放程度如何影响基于其研究的科学推断可靠性，指出封闭模型通常不适合科学用途，并提出系统识别和缓解推断威胁的方法。

2603.23531 2026-05-21 cs.CL

Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

大型语言模型通过目标-立场提取解构复杂的政治观点

Özgür Togay, Javier Garcia-Bernardo, Florian Kunneman, Anastasia Giachanou

发表机构 * Department of Methodology and Statistics, Utrecht University（方法论与统计学系，乌特雷赫特大学）； Department of Languages, Literature and Communication, Utrecht University（语言、文学与传播系，乌特雷赫特大学）

AI总结本文研究了大型语言模型是否能通过目标-立场提取任务解构复杂政治观点，通过构建包含138个不同政治目标的Reddit帖子数据集，评估了多种LLM在零样本、少样本和上下文增强提示策略下的表现，结果显示最佳模型表现接近高度训练的人类标注者，证明了LLM在最小监督下的复杂政治观点提取能力。

详情

AI中文摘要

政治极化源于对政策、人物和议题的复杂信念相互作用。然而，大多数计算分析将话语简化为粗粒度的党派标签，忽视了这些信念之间的互动。这在在线政治对话中尤为明显，这些对话通常具有细微差别且涵盖广泛主题，使自动识别讨论目标和对它们的观点变得困难。在本研究中，我们探讨了大型语言模型（LLMs）是否能通过目标-立场提取（TSE）任务解决这一挑战，TSE是一种结合目标识别和立场检测的自然语言处理任务，能够更细致地分析政治观点。为此，我们构建了一个包含1,084个Reddit帖子的数据集，来自r/NeutralPolitics，涵盖138个不同的政治目标，并使用零样本、少样本和上下文增强提示策略评估了一系列专有和开源的LLM。我们的结果表明，最佳模型在高度训练的人类标注者表现相当，并且在具有低标注者一致性挑战的帖子上仍保持稳健。这些发现表明，LLM能够以最小监督的方式提取复杂的政治观点，为计算社会科学和政治文本分析提供了可扩展的工具。

英文摘要

Political polarization emerges from a complex interplay of beliefs about policies, figures, and issues. However, most computational analyses reduce discourse to coarse partisan labels, overlooking how these beliefs interact. This is especially evident in online political conversations, which are often nuanced and cover a wide range of subjects, making it difficult to automatically identify the target of discussion and the opinion expressed toward them. In this study, we investigate whether Large Language Models (LLMs) can address this challenge through Target-Stance Extraction (TSE), a recent natural language processing task that combines target identification and stance detection, enabling more granular analysis of political opinions. For this, we construct a dataset of 1,084 Reddit posts from r/NeutralPolitics, covering 138 distinct political targets and evaluate a range of proprietary and open-source LLMs using zero-shot, few-shot, and context-augmented prompting strategies. Our results show that the best models perform comparably to highly trained human annotators and remain robust on challenging posts with low inter-annotator agreement. These findings demonstrate that LLMs can extract complex political opinions with minimal supervision, offering a scalable tool for computational social science and political text analysis.

URL PDF HTML ☆

赞 0 踩 0

2603.22727 2026-05-21 cs.LG eess.SP

Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication

基于脑机接口的沉浸式通信的脉冲个性化联邦学习

Chen Shang, Dinh Thai Hoang, Diep N. Nguyen, Jiadong Yu

发表机构 * School of Electrical and Data Engineering, University of Technology Sydney（悉尼技术大学电气与数据工程学院）； Thrust of Internet of Things, The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州）物联网研究所）

AI总结本文提出了一种利用脑机接口获取脑信号以推断用户中心状态（如意图和感知相关不适）的沉浸式通信框架，通过个性化联邦学习模型处理脑信号，以适应神经多样性数据并防止敏感脑信号信息泄露，同时通过嵌入脉冲神经网络降低能耗，实验表明在真实脑信号数据集上识别准确率最高且能耗降低6.46倍。

Comments 6 pages, 3 figures

Journal ref INFOCOM Workshop, 2026

详情

AI中文摘要

本文提出了一种新颖的沉浸式通信框架，利用脑机接口（BCI）获取脑信号以推断用户中心状态（例如意图和感知相关不适），从而在强个体差异下实现更个性化和稳健的沉浸式适应。具体而言，我们开发了一个个性化联邦学习（PFL）模型来分析和处理收集到的脑信号，该模型不仅能够适应神经多样性脑信号数据，还能防止敏感脑信号信息泄露。为了解决持续设备学习和推理在能量受限的沉浸终端（如头戴式显示器）中的能量瓶颈，我们进一步将脉冲神经网络（SNNs）嵌入到PFL中。通过利用稀疏、事件驱动的脉冲计算，SNN启用的PFL在保持竞争性个性化性能的同时，降低了训练和推理的计算和能耗。在真实脑信号数据集上的实验表明，我们的方法在整体识别准确率方面表现最佳，同时与传统人工神经网络基线相比，推理能耗降低了6.46倍。

英文摘要

This work proposes a novel immersive communication framework that leverages brain-computer interface (BCI) to acquire brain signals for inferring user-centric states (e.g., intention and perception-related discomfort), thereby enabling more personalized and robust immersive adaptation under strong individual variability. Specifically, we develop a personalized federated learning (PFL) model to analyze and process the collected brain signals, which not only accommodates neurodiverse brain-signal data but also prevents the leakage of sensitive brain-signal information. To address the energy bottleneck of continual on-device learning and inference on energy-limited immersive terminals (e.g., head-mounted display), we further embed spiking neural networks (SNNs) into the PFL. By exploiting sparse, event-driven spike computation, the SNN-enabled PFL reduces the computation and energy cost of training and inference while maintaining competitive personalization performance. Experiments on real brain-signal dataset demonstrate that our method achieves the best overall identification accuracy while reducing inference energy by 6.46$\times$ compared with conventional artificial neural network-based personalized baselines.

URL PDF HTML ☆

赞 0 踩 0

2603.22430 2026-05-21 cs.LG

Inference Time Policy Optimization for Offline RL with Differentiable World Models

基于可微世界模型的离线强化学习推理时间策略优化

Rohan Deb, Stephen J. Wright, Arindam Banerjee

发表机构 * Siebel School of Computing and Data Science（计算与数据科学学院）； Department of Computer Sciences（计算机科学系）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Wisconsin Madison（威斯康星大学麦迪逊分校）

AI总结本文提出了一种在推理时间利用可微世界模型优化策略参数的方法，通过端到端的梯度计算提升离线强化学习的性能，同时探讨了推理时间适应的计算开销与收益的权衡。

详情

AI中文摘要

Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to *optimize* the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines. Inference-time adaptation, however, is expensive: rollout generation and backpropagation dominate per-step compute. We study this tradeoff explicitly, showing that a suitable tilted version of one-step MeanFlow sampler recovers much of the gains at a fraction of the cost.

英文摘要

Offline Reinforcement Learning (RL) learns optimal policies from fixed datasets, training a policy once and deploying it at inference time without further refinement. Inspired by model predictive control (MPC), we introduce an inference time adaptation framework that utilizes a pretrained policy along with a learned world model. While existing world model and diffusion-planning methods use learned dynamics to generate imagined trajectories during training, or to sample candidate plans at inference time, they do not use inference-time information to *optimize* the policy parameters on the fly. In contrast, our design is a Differentiable World Model (DWM) pipeline that enables end-to-end gradient computation through imagined rollouts for inference time policy optimization (ITPO). We evaluate our algorithm on D4RL continuous-control benchmarks (MuJoCo locomotion tasks and AntMaze), and show that exploiting inference-time information to optimize the policy parameters yields consistent gains over strong offline RL baselines. Inference-time adaptation, however, is expensive: rollout generation and backpropagation dominate per-step compute. We study this tradeoff explicitly, showing that a suitable tilted version of one-step MeanFlow sampler recovers much of the gains at a fraction of the cost.

URL PDF HTML ☆

赞 0 踩 0

2603.16513 2026-05-21 cs.LG cs.AI

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

FEAT: 一个线性复杂度的超大规模结构化数据基础模型

Zhenghang Song, Tang Qian, Lu Chen, Yushuai Li, Zhengke Hu, Bingbing Fang, Yumeng Song, Junbo Zhao, Sheng Zhang, Tianyi Li

发表机构 * Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）； Aalborg University（奥尔堡大学）

AI总结本文提出FEAT，一种线性复杂度的基础模型，用于处理超大规模结构化数据，通过多层双轴编码架构和自适应融合双向状态空间模型，实现线性时间内的跨元组上下文化，同时支持排列不变的表示学习。

详情

AI中文摘要

结构化数据在医疗、金融和科学数据管理等领域被广泛应用。最近关于结构化数据基础模型（SFMs）的研究旨在支持在这些数据上的数据分析和挖掘任务，但将其应用于现实世界的企业数据库时仍面临可扩展性和泛化能力的挑战。首先，许多SFMs依赖于完全自注意力机制，这引入了O(N²)的计算瓶颈，并限制了可以同时处理的元组数量。其次，直接用线性复杂度序列模型替代注意力可能与结构化数据的排列不变性质相冲突，引入人为的顺序偏差并降低表示质量。此外，仅在合成数据上训练的模型可能难以泛化到现实世界数据库中常见的重尾和异质分布。为了解决这些挑战，我们提出了FEAT，一种用于超大规模结构化数据的线性复杂度基础模型。FEAT用多层双轴编码架构替代二次注意力。它集成了自适应融合双向状态空间模型（AFBM）与卷积门控线性注意力（Conv-GLA），在O(N)时间内实现跨元组上下文化，同时支持排列不变的表示学习。为了提高在现实数据偏斜下的鲁棒性，FEAT进一步采用混合结构因果预训练流水线，具有鲁棒的重建目标。在12个现实世界数据库基准测试中，FEAT在零样本任务上始终优于代表性的SFMs，并且与结构化数据样本长度线性扩展，达到高达50倍的推理延迟提升。

英文摘要

Structured data is widely used in domains such as healthcare, finance, and scientific data management. Recent studies on structured data foundation models (SFMs) aim to support data analysis and mining tasks over such data, but still face scalability and generalization challenges when applied to real-world enterprise databases. First, many SFMs rely on full self-attention, which introduces an O(N^2) computational bottleneck and limits the number of tuples that can be processed jointly. Second, directly replacing attention with linear-complexity sequence models may conflict with the permutation-invariant nature of structured data, introducing artificial order bias and degrading representation quality. Moreover, models trained only on synthetic data may struggle to generalize to the heavy-tailed and heterogeneous distributions commonly found in real-world databases. To address these challenges, we propose FEAT, a linear-complexity foundation model for extremely large structured data. FEAT replaces quadratic attention with a multi-layer dual-axis encoding architecture. It integrates an adaptive-fusion bidirectional state-space model (AFBM) with convolutional gated linear attention (Conv-GLA), enabling cross-tuple contextualization in O(N) time while supporting permutation-invariant representation learning. To improve robustness under real-world data skewness, FEAT further adopts a hybrid structural causal pre-training pipeline with a robust reconstruction objective. Experiments on 12 real-world database benchmarks show that FEAT consistently outperforms representative SFMs on zero-shot tasks and scales linearly with structured-data sample length, achieving up to 50x faster inference latency.

URL PDF HTML ☆

赞 0 踩 0

2603.14392 2026-05-21 cs.LG cs.RO

WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

WestWorld: 一种知识编码的可扩展轨迹世界模型用于多样化机器人系统

Yuchen Wang, Jiangtao Kong, Sizhe Wei, Xiaochang Li, Haohong Lin, Hongjue Zhao, Tianyi Zhou, Lu Gan, Huajie Shao

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Carnegie Mellon University（卡内基梅隆大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文提出WestWorld，一种知识编码的可扩展轨迹世界模型，用于多样化机器人系统，通过引入系统感知的混合专家（Sys-MoE）和结构嵌入来提升可扩展性和零样本泛化能力，实现了在多种机器人环境中的高效轨迹预测和控制。

Comments ICML 2026 spotlight

详情

AI中文摘要

轨迹世界模型在机器人动力学学习、规划和控制中起着关键作用。尽管最近的研究已经探索了适用于多样化机器人系统的轨迹世界模型，但它们难以扩展到大量不同的系统动态，并忽略了物理结构的领域知识。为了解决这些限制，我们引入了WestWorld，一种针对多样化机器人系统的知识编码可扩展轨迹世界模型。为了解决可扩展性挑战，我们提出了一种新颖的系统感知混合专家（Sys-MoE），通过可学习的系统嵌入动态结合和路由针对不同机器人系统的专用专家。为进一步增强零样本泛化能力，我们通过引入结构嵌入来整合机器人物理结构的领域知识，使轨迹表示与形态学信息对齐。在预训练于89个复杂环境（涵盖多样化形态的仿真和现实世界设置）后，WestWorld在零样本和少样本轨迹预测上显著优于竞争基线。此外，它在广泛范围的机器人环境中的可扩展性表现出色，并在不同机器人上的下游基于模型的控制中显著提高了性能。最后，我们在现实世界中的Unitree Go1上部署了该模型，展示了稳定的移动性能。代码可在https://github.com/511205787/WestWorld上获取。

英文摘要

Trajectory world models play a crucial role in robotic dynamics learning, planning, and control. While recent works have explored trajectory world models for diverse robotic systems, they struggle to scale to a large number of distinct system dynamics and overlook domain knowledge of physical structures. To address these limitations, we introduce WestWorld, a knoWledge-Encoded Scalable Trajectory World model for diverse robotic systems. To tackle the scalability challenge, we propose a novel system-aware Mixture-of-Experts (Sys-MoE) that dynamically combines and routes specialized experts for different robotic systems via a learnable system embedding. To further enhance zero-shot generalization, we incorporate domain knowledge of robot physical structures by introducing a structural embedding that aligns trajectory representations with morphological information. After pretraining on 89 complex environments spanning diverse morphologies across both simulation and real-world settings, WestWorld achieves significant improvements over competitive baselines in zero- and few-shot trajectory prediction. Additionally, it shows strong scalability across a wide range of robotic environments and significantly improves performance on downstream model-based control for different robots. Finally, we deploy our model on a real-world Unitree Go1, where it demonstrates stable locomotion performance. The code is available at https://github.com/511205787/WestWorld.

URL PDF HTML ☆

赞 0 踩 0

2603.14184 2026-05-21 cs.CV cs.AI

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

更深入的思考，更弱的目标：理解并缓解多模态大语言模型推理过程中感知障碍

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

发表机构 * Tsinghua Shenzhen International Graduate School（清华大学深圳国际研究生院）； Huawei Technologies（华为技术）

AI总结本文研究了多模态大语言模型在推理过程中出现的视觉感知障碍问题，提出了一种无需训练的视觉区域引导注意力框架，通过选择和重新加权视觉头部来引导模型关注与问题相关区域，从而提高视觉定位和推理准确性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在进行扩展推理模式时常常出现感知障碍，特别是在视觉问答（VQA）任务中。我们识别出注意力分散是根本原因：在多步推理过程中，模型的视觉注意力变得分散并远离与问题相关区域，实际上“失去焦点”于视觉输入。为了更好地理解这一现象，我们分析了MLLMs的注意力图，并观察到推理提示显著减少了回答问题关键区域的注意力。我们进一步发现模型对图像标记的总体注意力与图像内注意力的空间分散性之间存在强相关性。基于这一见解，我们提出了一个无需训练的视觉区域引导注意力（VRGA）框架，该框架根据熵-聚焦准则选择视觉头部并重新加权其注意力，从而有效引导模型在推理过程中关注与问题相关区域。在视觉-语言基准上的广泛实验表明，我们的方法有效缓解了感知退化，从而在视觉定位和推理准确性方面取得改进，同时提供了可解释的见解，说明MLLMs如何处理视觉信息。

英文摘要

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

URL PDF HTML ☆

赞 0 踩 0

2603.13419 2026-05-21 cs.LG

Diffusion Models Memorize in Training -- and Generalize in Inference

扩散模型在训练中记忆，而在推理中泛化

Tim Kaiser, Markus Kollmann

发表机构 * Heinrich-Heine-University Düsseldorf（杜伊斯堡-埃森大学）

AI总结本文研究了扩散模型在训练中过度拟合去噪目标，导致训练样本与验证样本性能差距，但通过模型误差使采样轨迹远离训练样本分布，从而在推理中实现泛化。

Comments 31 pages and 29 figures

详情

AI中文摘要

扩散模型在实践中泛化效果良好。然而，一个最优的扩散模型完全记忆训练数据，因此无法泛化，引发了一个问题：是什么因素使真实的扩散模型能够泛化？我们发现，尽管在样本层面泛化，扩散模型逐渐过度拟合去噪训练目标，从而在验证样本和训练样本的性能之间产生泛化差距。这种差距在中间噪声水平最明显。使用一个完全分析性的误差易犯玩具模型，我们追踪了影响泛化差距的因素。我们发现，最优去噪流场在训练点周围局部化，但模型误差抑制了对训练点的精确回忆，从而产生一个平滑、泛化的流场。最后，我们发现训练中观察到的泛化差距不会转移到推理中，这会导致生成样本与训练样本有很强的相似性。这是因为采样轨迹的中间状态足够远离模型训练所用的噪声训练样本分布。这些发现揭示了扩散模型泛化的全新图景：流场通过模型误差泛化，使采样轨迹远离噪声训练样本的领域，从而自然地防止过拟合。

英文摘要

Diffusion models generalize well in practice. However, an optimal diffusion model fully memorizes the training data and therefore fails to generalize, raising the question of what induces generalization in a real diffusion model. We show that, despite generalizing at the sample level, diffusion models progressively overfit the denoising training objective and thereby create a generalization gap between the performance on validation and training samples. This gap is most pronounced at intermediate noise levels. Using a fully analytic error-prone toy model, we trace the factors affecting the generalization gap. We find that the optimal denoising flow field localizes sharply around training points, but the model error suppresses the exact recall of training points, yielding a smooth, generalizing flow field. Finally, we find that the generalization gap observed in training does not translate to inference, which would result in a strong similarity between generated samples and training samples. This is because the intermediate states of sampling trajectories are sufficiently far from the distribution of noisy training samples the model is trained on. Together, these findings reveal a novel picture of how diffusion models generalize: the flow field generalizes through model error, which moves sampling trajectories outside the domain of noisy training samples and thereby naturally prevents overfitting.

URL PDF HTML ☆

赞 0 踩 0

2603.09024 2026-05-21 cs.LG

When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency

在漂移后何时重新训练：对后漂移数据大小充分性的数据-only测试

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

发表机构 * SANKEN, The University of Osaka, Japan（SANKEN大学大阪大学日本）

AI总结本文提出CALIPER方法，通过数据-only测试估计后漂移数据大小以确保稳定重新训练，该方法利用动态系统生成的数据状态依赖性，通过单次加权局部回归和局部性参数θ跟踪一歩代理误差，当有效样本量门控满足时，误差随局部性参数增加而单调非递增，表明数据足够用于重新训练。

Comments Accepted by ICLR 2026

详情

AI中文摘要

突然的概念漂移使之前训练的预测器变得不可靠，但决定何时重新训练和后漂移数据大小是否足够 rarely 被解决。我们提出CALIPER - 一个检测器和模型无关的数据-only测试，用于估计后漂移数据大小以实现稳定重新训练。CALIPER利用动态系统生成的数据状态依赖性：我们在后漂移窗口上运行一次加权局部回归，并跟踪一个一步代理误差作为局部性参数θ的函数。当有效样本量门控被满足时，该误差随局部性参数增加而单调非递增，表明数据大小足够用于重新训练。我们还提供了对方法的理论分析，并展示了该算法具有低的每更新时间和内存。在四个异质领域、三个学习者家族和两个检测器的数据集上，CALIPER一致匹配或超过最佳固定数据大小进行重新训练，同时产生可忽略的开销，经常优于增量更新。CALIPER缩小了漂移检测和数据充分适应在流学习中的差距。

英文摘要

Sudden concept drift makes previously trained predictors unreliable, yet deciding when to retrain and what post-drift data size is sufficient is rarely addressed. We propose CALIPER - a detector- and model-agnostic, data-only test that estimates the post-drift data size required for stable retraining. CALIPER exploits state dependence in streams generated by dynamical systems: we run a single-pass weighted local regression over the post-drift window and track a one-step proxy error as a function of a locality parameter $θ$. When an effective sample size gate is satisfied, a monotonically non-increasing trend in this error with increasing a locality parameter indicates that the data size is sufficiently informative for retraining. We also provide a theoretical analysis of our method, and we show that the algorithm has a low per-update time and memory. Across datasets from four heterogeneous domains, three learner families, and two detectors, CALIPER consistently matches or exceeds the best fixed data size for retraining while incurring negligible overhead and often outperforming incremental updates. CALIPER closes the gap between drift detection and data-sufficient adaptation in streaming learning.

URL PDF HTML ☆

赞 0 踩 0

2603.08235 2026-05-21 cs.CV cs.AI

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

探索深度学习与超宽场成像用于糖尿病视网膜病变和黄斑水肿

Pablo Jimenez-Lizcano, Sergio Romero-Tapiador, Ruben Tolosana, Aythami Morales, Guillermo González de Rivera, Ruben Vera-Rodriguez, Julian Fierrez

发表机构 * BiometricsAI, Universidad Autónoma de Madrid, Madrid, Spain（生物度量AI，马德里自治大学，马德里，西班牙）； Department of Mathematics, Universidad de Las Palmas de Gran Canaria, Spain（数学系，拉斯帕尔马斯大Canaria大学，西班牙）； HCTLab Research Group, Universidad Autónoma de Madrid, Madrid, Spain（HCTLab研究组，马德里自治大学，马德里，西班牙）

AI总结本文研究了深度学习和超宽场成像在糖尿病视网膜病变和黄斑水肿检测中的应用，通过公开数据集评估了多种深度学习模型，并探讨了特征融合和频域表示的潜力。

Comments 6 pages, 4 figures, 2 tables

详情

AI中文摘要

糖尿病视网膜病变（DR）和糖尿病黄斑水肿（DME）是导致成年劳动力失明的主要原因之一。传统方法主要依赖标准彩色视网膜摄影（CFP）进行检测。然而，最近的超宽场成像（UWF）相比CFP提供了更宽的视野。受此启发，本文探讨了最新深度学习（DL）方法和UWF成像在三个临床相关任务上的应用：i）UWF图像质量评估，ii）可参考糖尿病视网膜病变（RDR）的识别，iii）DME的识别。使用公开的UWF4DR挑战数据集（作为MICCAI 2024会议的一部分发布），我们评估了DL模型在空间（RGB）和频域中的表现，包括流行的卷积神经网络（CNNs）以及最近的视觉变换器（ViTs）和基础模型。此外，我们还探索了最终的特征级融合以提高鲁棒性。最后，我们还利用Grad-CAM分析DL模型的决策，提高可解释性。我们的方法在所有架构中均实现了稳定强劲的性能，凸显了新兴ViTs和基础模型的竞争力，以及特征级融合和频域表示在UWF分析中的潜力。

英文摘要

Diabetic retinopathy (DR) and diabetic macular edema (DME) are leading causes of preventable blindness among working-age adults. Traditional approaches in the literature focus on standard color fundus photography (CFP) for the detection of these conditions. Nevertheless, recent ultra-widefield imaging (UWF) offers a significantly wider field of view in comparison to CFP. Motivated by this, the present study explores state-of-the-art deep learning (DL) methods and UWF imaging on three clinically relevant tasks: i) image quality assessment for UWF, ii) identification of referable diabetic retinopathy (RDR), and iii) identification of DME. Using the publicly available UWF4DR Challenge dataset, released as part of the MICCAI 2024 conference, we benchmark DL models in the spatial (RGB) and frequency domains, including popular convolutional neural networks (CNNs) as well as recent vision transformers (ViTs) and foundation models. In addition, we explore a final feature-level fusion to increase robustness. Finally, we also analyze the decisions of the DL models using Grad-CAM, increasing the explainability. Our proposal achieves consistently strong performance across all architectures, underscoring the competitiveness of emerging ViTs and foundation models and the promise of feature-level fusion and frequency-domain representations for UWF analysis.

URL PDF HTML ☆

赞 0 踩 0

2603.08155 2026-05-21 cs.LG

C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

C$^2$FG: 通过分数差异分析实现控制分类器无关引导

Jiayang Gao, Tianyi Zheng, Jiayang Zou, Fengxiang Yang, Shice Liu, Luyao Fan, Zheyu Zhang, Hao Zhang, Jinwei Chen, Peng-Tao Jiang, Bo Li, Jia Wang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； vivo BlueImage Lab（vivo 蓝影实验室）； vivo Mobile Communication Co., Ltd.（vivo 通信有限公司）

AI总结本文提出C$^2$FG，一种基于分数差异分析的控制分类器无关引导方法，通过严格理论分析建立了条件分布与无条件分布在不同时间步的分数差异上界，从而为时间依赖引导提供了理论基础，并通过实验验证了其在多种生成任务中的有效性。

Comments Accepted to CVPR 2026 (Highlight)

详情

AI中文摘要

分类器无关引导（CFG）是现代条件扩散模型的核心，但其依赖于固定或启发式动态引导权重，主要基于经验，忽略了扩散过程的内在动态。本文对分类器无关引导进行了严格的理论分析。具体而言，我们基于扩散过程建立了条件分布与无条件分布在不同时间步的分数差异的严格上界。这一发现解释了固定权重策略的局限性，并为时间依赖引导建立了原理基础。受此启发，我们引入了控制分类器无关引导（C$^2$FG），一种新颖的、无需训练且可直接使用的插件方法，通过指数衰减控制函数将引导强度与扩散动态对齐。大量实验表明，C$^2$FG在多种生成任务中均有效且具有广泛的应用性，同时与现有策略具有正交性。

英文摘要

Classifier-Free Guidance (CFG) is a cornerstone of modern conditional diffusion models, yet its reliance on the fixed or heuristic dynamic guidance weight is predominantly empirical and overlooks the inherent dynamics of the diffusion process. In this paper, we provide a rigorous theoretical analysis of the Classifier-Free Guidance. Specifically, we establish strict upper bounds on the score discrepancy between conditional and unconditional distributions at different timesteps based on the diffusion process. This finding explains the limitations of fixed-weight strategies and establishes a principled foundation for time-dependent guidance. Motivated by this insight, we introduce \textbf{Control Classifier-Free Guidance (C$^2$FG)}, a novel, training-free, and plug-in method that aligns the guidance strength with the diffusion dynamics via an exponential decay control function. Extensive experiments demonstrate that C$^2$FG is effective and broadly applicable across diverse generative tasks, while also exhibiting orthogonality to existing strategies.

URL PDF HTML ☆

赞 0 踩 0

2603.06007 2026-05-21 cs.CL cs.AI cs.MA

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

MASFactory: 一种基于图的框架，用于通过Vibe图谱编排基于大语言模型的多智能体系统

Yang Liu, Jinxuan Cai, Yishen Li, Qi Meng, Zedi Liu, Xin Li, Chen Qian, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出MASFactory，一种基于图的框架，用于通过Vibe图谱编排基于大语言模型的多智能体系统，解决了现有框架在实现复杂图工作流时需要大量手动工作、重用性差和难以整合异构外部上下文源的问题。

Comments Accepted to the ACL 2026 Demo Track. Camera-ready version. 10 pages, 6 figures. Code and documentation are available at: https://github.com/BUPT-GAMMA/MASFactory

详情

AI中文摘要

基于大语言模型的（LLM-based）多智能体系统（MAS）越来越多地被用于通过角色专业化和协作扩展智能体问题解决。MAS工作流可以自然地建模为有向计算图，其中节点执行智能体或子工作流，边编码依赖性和消息传递。然而，目前框架在实现复杂图工作流时仍然需要大量的手动工作，提供有限的重用性，并使整合异构外部上下文源变得困难。为克服这些限制，我们提出了MASFactory，一种用于编排基于大语言模型的MAS的基于图的框架。它引入了Vibe图谱，一种人机交互的方法，将自然语言意图编译成可编辑的工作流规范，然后编译成可执行的图。此外，该框架提供了可重用的组件、技能支持、多模态消息处理和可插拔的上下文整合，以及用于拓扑预览、运行时跟踪和人机交互的可视化工具。我们在七个公开基准上评估了MASFactory，验证了代表性MAS方法的再生产一致性以及Vibe图谱的有效性。我们的代码（https://github.com/BUPT-GAMMA/MASFactory，Apache-2.0许可）和视频演示（https://youtu.be/ANynzVfY32k）均已公开。

英文摘要

Large language model-based (LLM-based) multi-agent systems (MAS) are increasingly used to extend agentic problem solving via role specialization and collaboration. MAS workflows can be naturally modeled as directed computation graphs, where nodes execute agents or sub-workflows and edges encode dependencies and message passing. However, implementing complex graph workflows in current frameworks still requires substantial manual effort, offers limited reuse, and makes it difficult to integrate heterogeneous external context sources. To overcome these limitations, we present MASFactory, a graph-centric framework for orchestrating LLM-based MAS. It introduces Vibe Graphing, a human-in-the-loop approach that compiles natural-language intent into an editable workflow specification and then into an executable graph. In addition, the framework provides reusable components, skill support, multimodal message handling, and pluggable context integration, as well as a visualizer for topology preview, runtime tracing, and human-in-the-loop interaction. We evaluate MASFactory on seven public benchmarks, validating both reproduction consistency for representative MAS methods and the effectiveness of Vibe Graphing. Our code (https://github.com/BUPT-GAMMA/MASFactory, licensed under Apache-2.0) and video demonstration (https://youtu.be/ANynzVfY32k) are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2603.01712 2026-05-21 cs.AI cs.LG

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

FT-Dojo: 向自主LLM微调迈进的语言代理

Qizheng Li, Yifei Zhang, Xiao Yang, Xu Yang, Zhuo Wang, Weiqing Liu, Jiang Bian

发表机构 * Peking University（北京大学）； Nanjing University（南京大学）； Microsoft Research Asia（微软亚洲研究院）； The University of Chicago（芝加哥大学）

AI总结本文提出FT-Dojo交互式基准环境，用于研究自主LLM微调，通过标准化任务接口、共享数据仓库、沙盒执行环境和反馈协议，开发了FT-Agent框架，实现了结构化迭代规划和多级反馈分析，实验显示FT-Agent在13个任务中表现优异，且展示了代理在故障恢复和长期规划中的能力。

Comments 26 pages, 6 figures, 11 tables

详情

AI中文摘要

针对垂直领域LLM微调仍需大量人力劳动的问题，本文引入FT-Dojo交互式基准环境，包含5个领域13个任务。FT-Dojo标准化了任务接口、共享数据仓库、沙盒执行环境、结构化反馈协议和评估流程。进一步开发了FT-Agent框架，通过结构化迭代规划、快速失败验证和多级反馈分析优化数据和训练策略。实验表明FT-Agent在13个任务中表现优异，且通过与前沿代理、开源规划框架和多轮统计对比验证了主要发现。案例研究表明代理可通过累积学习恢复故障，但仍存在因果诊断和长期规划的局限性。实现代码见https://github.com/microsoft/rd-agent。

英文摘要

Fine-tuning large language models for vertical domains remains labor-intensive, requiring practitioners to curate data, configure training, and iteratively diagnose model behavior. Despite growing interest in autonomous machine learning and language agents, end-to-end LLM fine-tuning has not been systematically studied as an interactive agent task. We introduce FT-Dojo, an interactive benchmark environment for autonomous LLM fine-tuning, comprising 13 tasks across 5 domains. Rather than a new collection of static datasets, FT-Dojo standardizes a task interface, shared raw-data repository, sandboxed execution environment, structured feedback protocol, and held-out evaluation procedure. We further develop FT-Agent, a fine-tuning-oriented autonomous framework that uses structured iteration planning, fail-fast validation, and multi-level feedback analysis to refine data and training strategies. Experiments show that FT-Agent provides a strong initial baseline, achieving the best performance on 10 out of 13 tasks, with additional controlled comparisons against frontier agents, open-source planning backbones, and multi-run statistics supporting the main findings. Case studies show that agents can recover from failures through cumulative learning, while still exposing limitations in causal diagnosis and long-horizon planning. The implementation is available at https://github.com/microsoft/rd-agent.

URL PDF HTML ☆

赞 0 踩 0

2603.01406 2026-05-21 cs.LG cs.AI cs.NA math.NA

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

一个运算符统治一切？关于神经PDE求解器中边界索引运算符家族的探讨

Lennon J. Shikhman

发表机构 * College of Computing, Georgia Institute of Technology（佐治亚理工学院计算机学院）； Department of Mathematics and Systems Engineering, Florida Institute of Technology（佛罗里达理工学院数学与系统工程系）

AI总结本文探讨了神经PDE求解器中边界索引运算符家族的核心问题，指出传统方法在边界条件变化时存在非识别性问题，并通过实验验证了在不同边界条件下求解器的局限性。

Comments Published in the ICLR 2026 Workshop on AI & PDEs. 10 pages, 5 figures

详情

AI中文摘要

神经PDE求解器通常被描述为学习映射问题数据到PDE解的运算符。本文作者认为，当边界条件变化时，这种解释通常是不正确的。我们展示了标准的神经运算符训练实际上隐式地学习了一个边界索引的运算符家族，而不是一个单一的、不考虑边界的运算符，其中学习的映射本质上依赖于训练过程中看到的边界条件分布。我们通过将运算符学习框架为边界条件上的条件风险最小化来正式化这一观点，这导致了在训练边界分布之外的非识别性结果。因此，forcing terms或resolution的泛化并不意味着在边界条件上的泛化。我们通过受控实验在泊松方程上支持我们的理论分析，展示了在边界条件转移时的明显退化，不同边界集合之间的跨分布失败，以及在去除边界信息时收敛到条件期望。我们的结果澄清了当前神经PDE求解器的核心限制，并突显了在追求PDE基础模型时需要显式边界意识建模的必要性。

英文摘要

Neural PDE solvers are often described as learning solution operators that map problem data to PDE solutions. In this work, we argue that this interpretation is generally incorrect when boundary conditions vary. We show that standard neural operator training implicitly learns a boundary-indexed family of operators, rather than a single boundary-agnostic operator, with the learned mapping fundamentally conditioned on the boundary-condition distribution seen during training. We formalize this perspective by framing operator learning as conditional risk minimization over boundary conditions, which leads to a non-identifiability result outside the support of the training boundary distribution. As a consequence, generalization in forcing terms or resolution does not imply generalization across boundary conditions. We support our theoretical analysis with controlled experiments on the Poisson equation, demonstrating sharp degradation under boundary-condition shifts, cross-distribution failures between distinct boundary ensembles, and convergence to conditional expectations when boundary information is removed. Our results clarify a core limitation of current neural PDE solvers and highlight the need for explicit boundary-aware modeling in the pursuit of foundation models for PDEs.

URL PDF HTML ☆

赞 0 踩 0

2602.24138 2026-05-21 cs.CV cs.AI

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

多模态最优传输用于手术机器人中的无训练时序分割

Omar Mohamed, Edoardo Fazzari, Ayah Al-Naji, Hamdan Alhadhrami, Khalfan Hableel, Saif Alkindi, Ivan Laptev, Cesare Stefanini

发表机构 * Dept. of Robotics, Mohamed bin Zayed University of AI（机器人系，Mohamed bin Zayed人工智能大学）

AI总结本文提出了一种无需标注的手术时序分割框架TASOT，通过结合时间对齐的文本描述和视觉信息，在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索，实现了在多个公开手术数据集上的显著提升。

详情

AI中文摘要

自动化识别手术阶段和步骤是机器人辅助手术中术中决策支持、工作流自动化和技能评估的基本能力。现有方法要么依赖大规模标注手术数据集，要么需要昂贵的领域特定预训练，这限制了它们在不同机器人平台和临床环境中的实际部署。在本文中，我们提出TASOT（文本增强的动作分割最优传输），一种无需任务特定标注或手术领域预训练的手术时序分割框架。TASOT扩展了动作分割最优传输（ASOT）公式，通过结合直接从输入视频生成的时间对齐文本描述，在统一的不平衡Gromov-Wasserstein最优传输目标下融合视觉和语义线索。视觉表示使用DINOv3提取，而由视觉-语言模型生成的时间描述通过CLIP编码并时间对齐到单个帧，为传输成本提供互补的语义结构。我们在三个公开手术数据集和四个基准设置上评估了TASOT，涵盖腹腔镜和机器人手术程序，显示出显著优于最强的零样本基线：在Cholec80上+18.9 F1，在AutoLaparo上+33.7，在StrasByPass70上+23.7，在BernByPass70上+4.5。这些结果表明，在机器人环境中可以实现细粒度的手术工作流理解，而无需手动训练标注或手术特定的预训练流程，为实际的机器人手术系统提供了一种有前景的替代方案。

英文摘要

Automated recognition of surgical phases and steps is a fundamental capability for intraoperative decision support, workflow automation, and skill assessment in robotic-assisted surgery. Existing approaches either depend on large-scale annotated surgical datasets or require expensive domain-specific pretraining on thousands of labeled videos, limiting their practical deployability across diverse robotic platforms and clinical environments. In this work, we propose TASOT (Text-Augmented Action Segmentation Optimal Transport), an annotation-free framework for surgical temporal segmentation that requires no task-specific annotations or surgical-domain pretraining. TASOT extends the Action Segmentation Optimal Transport (ASOT) formulation by incorporating temporally aligned textual descriptions generated directly from the input video, fusing visual and semantic cues within a unified unbalanced Gromov-Wasserstein optimal transport objective. Visual representations are extracted using DINOv3, while temporal captions produced by a vision-language model are encoded via CLIP and temporally aligned to individual frames, providing complementary semantic structure to the transport cost. We evaluate TASOT on three public surgical datasets and four benchmark settings spanning laparoscopic and robotic procedures, showing substantial improvements over the strongest zero-shot baselines: +18.9 F1 on Cholec80, +33.7 on AutoLaparo, +23.7 on StrasByPass70, and +4.5 on BernByPass70. These results suggest that fine-grained surgical workflow understanding in robotic settings can be achieved without manual training annotations or surgical-specific pretraining pipelines, offering a promising alternative for real-world robotic surgical systems.

URL PDF HTML ☆

赞 0 踩 0

2602.20399 2026-05-21 cs.LG

GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training

GeoPT：通过提升几何预训练实现物理模拟的扩展

Haixu Wu, Minghao Guo, Zongyi Li, Zhiyang Dou, Mingsheng Long, Kaiming He, Wojciech Matusik

发表机构 * MIT CSAIL ； Tsinghua University（清华大学）

AI总结本文提出GeoPT，一种基于提升几何预训练的通用物理模拟预训练模型，通过合成动态增强几何，实现动态感知的自监督学习，从而提升物理模拟的效率和效果。

Comments Project Page: https://physics-scaling.github.io/GeoPT/

详情

AI中文摘要

神经模拟器有望成为高效的物理模拟替代品，但其扩展受限于生成高保真训练数据的高昂成本。在大量现成几何上预训练提供了一种自然替代方案，但面临根本性缺口：仅对静态几何进行监督会忽略动态，并可能导致物理任务的负迁移。我们提出了GeoPT，一种基于提升几何预训练的通用物理模拟预训练模型。核心思想是通过合成动态增强几何，实现动态感知的自监督学习，无需物理标签。在超过一百万个样本上预训练后，GeoPT在流体力学、汽车、飞机和船舶以及固体力学中的碰撞模拟等工业保真度基准上持续改进，将标注数据需求减少20-60%，并加速收敛2倍。这些结果表明，通过合成动态提升可以弥合几何-物理的差距，解锁神经模拟的可扩展路径，可能进一步扩展到其他领域。代码可在https://github.com/Physics-Scaling/GeoPT上获得。

英文摘要

Neural simulators promise efficient surrogates for physics simulation, but scaling them is bottlenecked by the prohibitive cost of generating high-fidelity training data. Pre-training on abundant off-the-shelf geometries offers a natural alternative, yet faces a fundamental gap: supervision on static geometry alone ignores dynamics and can lead to negative transfer on physics tasks. We present GeoPT, a unified pre-trained model for general physics simulation based on lifted geometric pre-training. The core idea is to augment geometry with synthetic dynamics, enabling dynamics-aware self-supervision without physics labels. Pre-trained on over one million samples, GeoPT consistently improves industrial-fidelity benchmarks spanning fluid mechanics for cars, aircraft, and ships, and solid mechanics in crash simulation, reducing labeled data requirements by 20-60% and accelerating convergence by 2$\times$. These results show that lifting with synthetic dynamics bridges the geometry-physics gap, unlocking a scalable path for neural simulation and potentially beyond. Code is available at https://github.com/Physics-Scaling/GeoPT.

URL PDF HTML ☆

赞 0 踩 0

2602.19320 2026-05-21 cs.CL cs.AI

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

代理记忆的解剖结构：评估和系统限制的分类与实证分析

Dongming Jiang, Yi Li, Songtao Wei, Jinxin Yang, Ayushi Kishore, Alysa Zhao, Dingyi Kang, Xu Hu, Feng Chen, Qiannan Li, Bingzhe Li

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； University of California Davis（加州大学戴维斯分校）； Texas A&M University（德克萨斯农工大学）

AI总结本文通过分类和实证分析，探讨了代理记忆系统的架构和系统限制，揭示了现有系统在基准饱和效应、指标有效性、模型依赖性和内存维护开销等方面的问题，并提出了更可靠的评估方法和可扩展的系统设计方向。

详情

AI中文摘要

代理记忆系统使大型语言模型（LLM）代理能够在长时间交互中保持状态，支持超出固定上下文窗口的长周期推理和个性化。尽管架构发展迅速，这些系统的实证基础仍脆弱：现有基准通常规模不足，评估指标与语义效用不一致，性能在基础模型上变化显著，且系统层面的成本常被忽视。本文从架构和系统角度对代理记忆进行了结构化分析。我们首先基于四种记忆结构介绍了MAG系统简要分类。然后，我们分析了限制当前系统的关键痛点，包括基准饱和效应、指标有效性和判断敏感性、基础模型依赖的准确性，以及内存维护引入的延迟和吞吐量开销。通过将内存结构与实证限制联系起来，本文阐明了当前代理记忆系统为何经常无法达到其理论承诺，并概述了更可靠评估和可扩展系统设计的方向。

英文摘要

Agentic memory systems enable large language model (LLM) agents to maintain state across long interactions, supporting long-horizon reasoning and personalization beyond fixed context windows. Despite rapid architectural development, the empirical foundations of these systems remain fragile: existing benchmarks are often underscaled, evaluation metrics are misaligned with semantic utility, performance varies significantly across backbone models, and system-level costs are frequently overlooked. This survey presents a structured analysis of agentic memory from both architectural and system perspectives. We first introduce a concise taxonomy of MAG systems based on four memory structures. Then, we analyze key pain points limiting current systems, including benchmark saturation effects, metric validity and judge sensitivity, backbone-dependent accuracy, and the latency and throughput overhead introduced by memory maintenance. By connecting the memory structure to empirical limitations, this survey clarifies why current agentic memory systems often underperform their theoretical promise and outlines directions for more reliable evaluation and scalable system design.

URL PDF HTML ☆

赞 0 踩 0

2602.18532 2026-05-21 cs.CV cs.AI cs.RO

VLANeXt: Recipes for Building Strong VLA Models

VLANeXt: 构建强大VLA模型的配方

Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, Chen Change Loy

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S实验室）； SenseTime Research（商汤研究）； Sun Yat-sen University（中山大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文通过统一框架和评估设置重新审视VLA设计空间，系统分析了基础组件、感知要素和动作建模视角，总结出12项关键发现，提出了一种简单有效的VLA模型VLANeXt，并在LIBERO和LIBERO-plus基准测试中超越了现有方法，同时提供了易于使用的代码库。

Comments Accepted in ICML 2026, Project Page: https://dravenalg.github.io/VLANeXt/

详情

AI中文摘要

在大基础模型兴起之后，视觉-语言-动作模型（VLAs）应运而生，利用视觉语言模型的强大视觉和语言理解能力进行通用目的策略学习。然而，当前VLA领域仍处于碎片化和探索阶段。尽管许多团队提出了各自的VLA模型，但训练协议和评估设置的一致性不足，使得难以确定哪些设计选择真正重要。为了使这一发展领域更具结构化，我们重新审视VLA设计空间，基于类似RT-2的简单VLA基线，系统地分析了三个维度：基础组件、感知要素和动作建模视角。从这项研究中，我们提炼出12项关键发现，共同构成了构建强大VLA模型的实用配方。该探索的成果是一种简单而有效的模型VLANeXt，它在LIBERO和LIBERO-plus基准测试中优于现有方法，并在现实世界实验中表现出色。我们还发布了一个统一且易于使用的代码库，以重现我们的发现、探索设计空间并基于共享基础开发新的VLA变体。代码库可在https://github.com/DravenALG/VLANeXt上获得。

英文摘要

Following the rise of large foundation models, Vision-Language-Action models (VLAs) emerged, leveraging strong visual and language understanding from Vision-Language Models for general-purpose policy learning. Yet, the current VLA landscape remains fragmented and exploratory. Although many groups have proposed their own VLA models, inconsistencies in training protocols and evaluation settings make it difficult to identify which design choices truly matter. To bring structure to this evolving space, we reexamine the VLA design space under a unified framework and evaluation setup. Starting from a simple VLA baseline similar to RT-2, which is the origin of VLA, we systematically dissect design choices along three dimensions: foundational components, perception essentials, and action modelling perspectives. From this study, we distill 12 key findings that together form a practical recipe for building strong VLA models. The outcome of this exploration is a simple yet effective model, VLANeXt. It outperforms the state-of-the-art methods on the LIBERO and LIBERO-plus benchmarks and demonstrates strong performance in real-world experiments. We release a unified and easy-to-use codebase to reproduce our findings, explore the design space, and develop new VLA variants on top of a shared foundation. The codebase is available at https://github.com/DravenALG/VLANeXt.

URL PDF HTML ☆

赞 0 踩 0

2602.17062 2026-05-21 cs.AI

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

保留次优动作以跟随移动的最优解在多智能体强化学习中

Yonghyeon Jo, Sunwoo Lee, Seungyul Han

发表机构 * Graduate School of Artificial Intelligence（人工智能研究生院）； Ulsan National Institute of Science and Technology (UNIST)（乌山国立科学与技术研究院（UNIST））

AI总结本文提出S2Q算法，通过学习多个子价值函数来保留替代的高价值动作，以解决多智能体强化学习中适应值函数变化时的最优解移动问题，实验表明其在多智能体强化学习基准上表现优异。

Comments 10 technical page followed by references and appendix. Accepted to ICLR 2026

Journal ref International Conference on Learning Representations (ICLR), 2026

2602.16608 2026-05-21 cs.CL cs.AI cs.CV cs.LG

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

可解释的人工智能：面向Transformer模型的上下文感知分层集成梯度方法

Melkamu Abay Mersha, Jugal Kalita

发表机构 * College of Engineering and Applied Science, University of Colorado Colorado Springs（科罗拉多州立大学工程与应用科学学院）

AI总结本文提出了一种上下文感知分层集成梯度框架（CA-LIG），用于解释Transformer模型的决策过程，通过计算每个Transformer块内的分层集成梯度，并将这些token级属性与类特定的注意力梯度融合，从而生成具有符号和上下文敏感性的属性图，以捕捉支持和反对的证据，并追踪Transformer层中的相关性层次流动。

详情

DOI: 10.1016/j.neucom.2026.133050

AI中文摘要

Transformer模型在多个领域和任务中实现了最先进的性能，然而其深层表示使得预测难以解释。现有的可解释性方法依赖于最终层的属性，只能捕捉局部token级属性或全局注意力模式，缺乏对token间依赖关系和结构组件的上下文感知能力。它们还无法捕捉相关性如何在层之间演变以及结构组件如何影响决策。为了解决这些限制，我们提出了上下文感知分层集成梯度（CA-LIG）框架，一种统一的层次属性框架，该框架在每个Transformer块内计算分层集成梯度，并将这些token级属性与类特定的注意力梯度融合。这种整合产生了带有符号和上下文敏感性的属性图，能够捕捉支持和反对的证据，同时追踪Transformer层中的相关性层次流动。我们评估了CA-LIG框架在多样化的任务、领域和Transformer模型家族中的表现，包括使用BERT进行情感分析和长多类文档分类，使用XLM-R和AfroLM在低资源语言设置中进行仇恨言论检测，以及使用Masked Autoencoder Vision Transformer模型进行图像分类。在所有任务和架构中，CA-LIG提供了更忠实的属性，显示出对上下文依赖的更强敏感性，并产生了更清晰、更语义连贯的可视化结果，优于现有可解释性方法。这些结果表明，CA-LIG提供了更全面、上下文感知和可靠的Transformer决策解释，推动了深度神经网络的实用可解释性和概念理解。

英文摘要

Transformer models achieve state-of-the-art performance across domains and tasks, yet their deeply layered representations make their predictions difficult to interpret. Existing explainability methods rely on final-layer attributions, capture either local token-level attributions or global attention patterns without unification, and lack context-awareness of inter-token dependencies and structural components. They also fail to capture how relevance evolves across layers and how structural components shape decision-making. To address these limitations, we proposed the \textbf{Context-Aware Layer-wise Integrated Gradients (CA-LIG) Framework}, a unified hierarchical attribution framework that computes layer-wise Integrated Gradients within each Transformer block and fuses these token-level attributions with class-specific attention gradients. This integration yields signed, context-sensitive attribution maps that capture supportive and opposing evidence while tracing the hierarchical flow of relevance through the Transformer layers. We evaluate the CA-LIG Framework across diverse tasks, domains, and transformer model families, including sentiment analysis and long and multi-class document classification with BERT, hate speech detection in a low-resource language setting with XLM-R and AfroLM, and image classification with Masked Autoencoder vision Transformer model. Across all tasks and architectures, CA-LIG provides more faithful attributions, shows stronger sensitivity to contextual dependencies, and produces clearer, more semantically coherent visualizations than established explainability methods. These results indicate that CA-LIG provides a more comprehensive, context-aware, and reliable explanation of Transformer decision-making, advancing both the practical interpretability and conceptual understanding of deep neural models.

URL PDF HTML ☆

赞 0 踩 0

2602.11675 2026-05-21 cs.AI

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward

知识悔恨最小化：超越结果奖励的无标签因果批评

Edward Y. Chang, Longling Geng

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出了一种名为知识悔恨最小化（ERM）的框架，通过评估模型推理轨迹的因果结构而非答案本身，来改进因果批评，从而在没有正确答案的情况下进行无标签操作，并在多个前沿LLM上实验表明，ERM在因果批评任务中表现优于传统方法。

Comments 43 pages, 22 tables, 18 figures

详情

AI中文摘要

大型语言模型可以正确回答因果问题，但其正确性基于错误的原因。当前的强化学习方法奖励模型得出的结论，但忽略其原因，强化了相关性捷径——这是我们称之为“奖励固化”的失败。我们引入了知识悔恨最小化（ERM），一种框架，它批评模型推理轨迹的因果结构，而非其答案。应用已建立的因果原则，ERM标记未审查的混杂因素、相关性-干预混淆以及从暴露推理轨迹中未检查的后门路径。该框架允许无标签操作——无需真实的因果图或正确答案，并且我们在实验中分别区分了有利的基准衍生批评、错误方向提示以及完全无标签的判断生成批评。在单个回合内，ERM检测并修复因果推理错误；在多个回合中，它将干预证据积累到可用于无答案键的奖励信号中。在六个前沿LLM上的1360个场景实验中表明，推理密集型模型（GPT-4 Turbo，GPT-5.2）对结果仅修正（25-31%恢复）的抵抗，但对因果批评（78-91%）的响应，获得+53-59 pp。标准测试时间方法（自一致性，Best-of-N，Self-Refine）在因果任务中表现不佳，而ERM将残余Rung Collapse从55-70%降至4%。一个分离定理证明仅结果奖励无法缩小这一差距；受控模拟证实了知识反馈确实能缩小这一差距，其表现优于仅结果奖励基线38倍。

英文摘要

Large language models can answer causal questions correctly for the wrong reasons. Current RL methods reward \emph{what} a model concludes but ignore \emph{why}, reinforcing correlational shortcuts -- a failure we call \emph{Reward Entrenchment}. We introduce \emph{Epistemic Regret Minimization} (\erm), a framework that critiques the causal \emph{structure} of a model's reasoning trace rather than its answer. Applying established causal principles, \erm flags unexamined confounders, correlation--intervention conflation, and unchecked back-door paths from exposed reasoning traces. The framework admits \emph{label-free} operation -- without the true causal graph or correct answer -- and we separately distinguish favorable benchmark-derived critique, error-direction cues, and fully label-free judge-generated critique in the experiments. Within a single episode, \erm detects and repairs causal reasoning errors; across episodes, it accumulates interventional evidence into a reward signal applicable where no answer key exists. Experiments on 1,360 scenarios across six frontier LLMs show that reasoning-heavy models (GPT-4 Turbo, GPT-5.2) resist outcome-only correction (25--31\% recovery) yet respond to causal critique (78--91\%), gaining $+53$--$59$ pp. Standard test-time methods (self-consistency, Best-of-$N$, Self-Refine) \emph{underperform} outcome-only reprompting on causal tasks, while ERM reduces residual Rung Collapse from 55--70\% to 4\%. A separation theorem proves outcome-only reward cannot close this gap; a controlled simulation confirms epistemic feedback does, outperforming outcome-only baselines 38-fold.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Towards Autonomous Mechanistic Reasoning in Virtual Cells

Beyond Attention Scores: SVD-Based Vision Token Pruning for Efficient Vision-Language Models

Lightweight Low-Light Image Enhancement via Distribution-Normalizing Preprocessing and Depthwise U-Net

Warm-Started Reinforcement Learning for Iterative 3D/2D Liver Registration

Diffusion Processes on Implicit Manifolds

How Well Do Vision-Language Models Understand Sequential Driving Scenes? A Sensitivity Study

Perceptual misalignment of texture representations in convolutional neural networks

IMPACT: Influence Modeling for Open-Set Time Series Anomaly Detection

AI-Powered Facial Mask Removal Is Not Suitable For Identification

How Open Must Language Models be to Enable Reliable Scientific Inference?

Large Language Models Unpack Complex Political Opinions through Target-Stance Extraction

Spiking Personalized Federated Learning for Brain-Computer Interface-Enabled Immersive Communication

Inference Time Policy Optimization for Offline RL with Differentiable World Models

FEAT: A Linear-Complexity Foundation Model for Extremely Large Structured Data

WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Diffusion Models Memorize in Training -- and Generalize in Inference

When to Retrain after Drift: A Data-Only Test of Post-Drift Data Size Sufficiency

Exploring Deep Learning and Ultra-Widefield Imaging for Diabetic Retinopathy and Macular Edema

C$^2$FG: Control Classifier-Free Guidance via Score Discrepancy Analysis

MASFactory: A Graph-centric Framework for Orchestrating LLM-Based Multi-Agent Systems with Vibe Graphing

FT-Dojo: Towards Autonomous LLM Fine-Tuning with Language Agents

One Operator to Rule Them All? On Boundary-Indexed Operator Families in Neural PDE Solvers

Multimodal Optimal Transport for Training-free Temporal Segmentation in Surgical Robotics

GeoPT: Scaling Physics Simulation via Lifted Geometric Pre-Training

Anatomy of Agentic Memory: Taxonomy and Empirical Analysis of Evaluation and System Limitations

VLANeXt: Recipes for Building Strong VLA Models

Retaining Suboptimal Actions to Follow Shifting Optima in Multi-Agent Reinforcement Learning

Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models

Epistemic Regret Minimization: Label-Free Causal Critique Beyond Outcome Reward