arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2602.11499 2026-05-21 cs.CV

What if Agents Could Imagine? Reinforcing Open-Vocabulary HOI Comprehension through Generation

如果智能体能够想象?通过生成强化开放词汇人-物交互理解

Zhenlong Yuan, Yue Wang, Dapeng Zhang, Kejin Cui, Rui Chen, Jing Tang, Lei Sun, Hongwei Yu, Chengxuan Qian, Xiangxiang Chu, Shuo Li, Yuyin Zhou

发表机构 * Dream-X Team(Dream-X团队) Stanford University(斯坦福大学) National University of Singapore(新加坡国立大学) Independent Researcher(独立研究者) Case Western Reserve University(凯斯西储大学) UC Santa Cruz(加州大学圣克鲁兹分校)

AI总结 本文提出ImagineAgent框架,通过生成式世界建模和工具增强强化学习,解决开放词汇人-物交互理解中的跨模态幻觉和视角限制问题,实现了高效且鲁棒的推理。

详情
AI中文摘要

多模态大语言模型在连接视觉和文本推理方面展现出有前景的能力,但其在开放词汇人-物交互(OV-HOI)中的推理能力受到跨模态幻觉和图像视角有限的限制。为此,我们提出ImagineAgent,一种整合认知映射、工具增强强化学习(RL)和生成式世界建模的智能体框架,以实现稳健的OV-HOI理解。具体而言,我们首先提出一个创新的CoT数据集hicodet-6K用于监督微调(SFT),通过将感知实体结构化为交互对,有效弥合感知到认知的差距,实现全面预测。随后,我们开发了一个多模态工具库,集成了在线检索、图像裁剪和生成式建模,使智能体能够动态增强推理,利用领域特定工具解决推理中的视觉-语义模糊性和幻觉问题。此外,我们引入生成模型重建替代视角,使智能体能够在有限视角下进行‘想象’。最后,我们提出一个复合奖励机制,共同优化预测准确性和工具效率。在SWIG-HOI和HICO-DET数据集上的评估表明,我们的方法在仅需36.7%的训练数据相比现有方法的情况下实现了最先进的性能,验证了我们的鲁棒性、经验有效性和效率。

英文摘要

Multimodal Large Language Models have shown promising capabilities in bridging visual and textual reasoning, yet their reasoning capabilities in Open-Vocabulary Human-Object Interaction (OV-HOI) are limited by cross-modal hallucinations and limited viewpoints of images. To address this, we propose ImagineAgent, an agentic framework that integrates cognitive mapping, tool-augmented reinforcement learning (RL), and generative world modeling for robust OV-HOI understanding. Specifically, we first propose an innovative CoT dataset named hicodet-6K for supervised fine-tuning (SFT), which effectively bridges the perception-to-cognition gap by structuring perceived entities into interaction pairs for comprehensive predictions. Subsequently, we develop a multimodal tool library integrating online retrieval, image cropping, and generative modeling, enabling the agent to dynamically augment reasoning with domain-specific tools to resolve visual-semantic ambiguities and hallucinations during inference. Moreover, we incorporate a generative model to reconstruct alternative viewpoints, enabling the agent to 'imagine' under limited viewpoints. Finally, we propose a composite reward mechanism to jointly optimize prediction accuracy and tool efficiency. Evaluations on both SWIG-HOI and HICO-DET datasets demonstrate that our method achieves state-of-the-art performance while requiring merely 36.7% of the training data compared to existing methods, validating our robustness, empirical effectiveness and efficiency.

2602.08819 2026-05-21 cs.LG cs.CL

Bayesian Preference Learning for Test-Time Steerable Reward Models

基于测试时间可调节的贝叶斯偏好学习的奖励模型

Jiwoo Hong, Shao Tang, Zhipeng Wang

发表机构 * LinkedIn Corporation(LinkedIn公司) Nubank

AI总结 本文提出了一种新的贝叶斯奖励建模目标,即变分上下文奖励建模(ICRM),通过上下文偏好演示实现测试时间可调节性,从而适应未见过的偏好分布,提高了奖励模型的准确性和鲁棒性。

Comments Preprint

详情
AI中文摘要

奖励模型在通过强化学习(RL)对语言模型与人类偏好对齐中起核心作用。随着RL越来越多地应用于可验证奖励和多目标对齐等场景,RMs被期望编码更复杂和多维的偏好分布。然而,分类RMs一旦训练完成就保持静态,限制了测试时间的适应性。我们提出变分上下文奖励建模(ICRM),一种新颖的贝叶斯奖励建模目标,通过上下文偏好演示实现测试时间可调节性。ICRM将奖励建模视为在Bradley-Terry模型下对潜在偏好概率的变分推断,使用共轭Beta先验。我们证明ICRM能够适应单目标和多目标设置中的未见过的偏好分布。随着更多演示,ICRM在RM-Bench上的准确性从60.5提高到70.8,在道德困境偏好上比生成判断者具有更低的校准误差,并在冲突偏好下扩展了可达到的帕累托前沿。我们进一步研究了ICRM在RL训练中的实际适用性,证明其可以通过在数学推理中优于传统RM来有效编码可验证奖励。最后,我们提供了理论保证,变分目标在有限置信度下具有全局内部最优解,并分析了KL正则化如何缓解奖励过度优化。

英文摘要

Reward models are central to aligning language models with human preferences via reinforcement learning (RL). As RL is increasingly applied to settings such as verifiable rewards and multi-objective alignment, RMs are expected to encode more complex and multifaceted preference distributions. However, classifier RMs remain static once trained, limiting their adaptability at test time. We propose Variational In-Context Reward Modeling (ICRM), a novel Bayesian reward modeling objective that enables test-time steerability via in-context preference demonstrations. ICRM casts reward modeling as amortized variational inference over a latent preference probability under the Bradley-Terry model using a conjugate Beta prior. We show that ICRM adapts to unseen preference distributions at test time for both single and multi-objective settings. With more demonstrations, ICRM improves RM-Bench accuracy from 60.5 to 70.8, achieves lower calibration error than a generative judge on moral dilemma preferences, and expands the attainable Pareto frontier under conflicting preferences. We further study the practical applicability of ICRM for RL training, showing that it can effectively encode verifiable rewards by outperforming a conventional RM in math reasoning. Finally, we provide theoretical guarantees that the variational objective admits a global interior optimum with finite confidence, and we analyze how KL regularization mitigates reward over-optimization.

2602.08028 2026-05-21 cs.CL cs.AI

Diverge to Induce Prompting: Multi-Rationale Induction for Zero-Shot Reasoning

偏离以诱导提示:多理性诱导用于零样本推理

Po-Chun Chen, Hen-Hsen Huang, Hsin-Hsi Chen

发表机构 * Department of Computer Science and Information Engineering, National Taiwan University, Taiwan(国家台湾大学计算机科学与信息工程系) Institute of Information Science, Academia Sinica, Taiwan(学术院信息科学研究所) AI Research Center (AINTU), National Taiwan University, Taiwan(国家台湾大学人工智能研究中心(AINTU))

AI总结 本研究提出DIP框架,通过生成多个多样化的高层理由并诱导最终计划,以提升零样本推理的准确性,克服了传统链式思考提示中推理路径不稳定的问题。

Comments Accepted to Findings of IJCNLP-AACL 2025

详情
AI中文摘要

为了解决标准链式思考提示中无引导推理路径的不稳定性,最近的方法通过首先引导大型语言模型(LLMs)生成单一推理策略来指导模型。然而,仅依赖一个策略来回答每个问题仍然限制了在多样化任务中的性能。我们提出了偏离以诱导提示(DIP),一个框架,首先提示LLM为每个问题生成多个多样化的高层理由。每个理由随后被扩展成详细的、分步骤的草案计划。最后,这些草案计划被诱导成最终计划。DIP在不依赖资源密集型采样的情况下增强了零样本推理的准确性。实验表明,DIP优于单一策略提示,证明了多计划诱导对基于提示的推理的有效性。

英文摘要

To address the instability of unguided reasoning paths in standard Chain-of-Thought prompting, recent methods guide large language models (LLMs) by first eliciting a single reasoning strategy. However, relying on just one strategy for each question can still limit performance across diverse tasks. We propose Diverge-to-Induce Prompting (DIP), a framework that first prompts an LLM to generate multiple diverse high-level rationales for each question. Each rationale is then elaborated into a detailed, step-by-step draft plan. Finally, these draft plans are induced into a final plan. DIP enhances zero-shot reasoning accuracy without reliance on resource-intensive sampling. Experiments show that DIP outperforms single-strategy prompting, demonstrating the effectiveness of multi-plan induction for prompt-based reasoning.

2602.06862 2026-05-21 cs.CV

Parameters as Experts: Adapting Vision Models with Dynamic Parameter Routing

参数作为专家:通过动态参数路由适应视觉模型

Meng Lou, Stanley Yu, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出ParaX方法,通过动态参数路由机制实现视觉模型的高效微调,以生成更定制化和强大的特征表示,从而在多种视觉识别任务中取得优越性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

利用参数高效微调(PEFT)来适应预训练视觉模型仍然具有挑战性,因为其目标是在少量可训练参数的情况下实现与完整微调相当的性能。当应用于复杂的密集预测任务时,现有方法存在局限性,包括输入无关的建模和冗余的跨层表示。为此,我们提出了ParaX,一种新的适配器式方法,其特征是简单的混合专家(MoE)架构。具体而言,我们引入了共享专家中心,其中每个专家都是可训练的参数矩阵。在前向传递过程中,网络中的每个ParaX模块通过简单的动态参数路由机制动态生成针对当前模块的权重矩阵,该机制选择性地聚合相应专家中心的参数矩阵。ParaX模块中的动态权重矩阵通过输入依赖的方式实现低秩适应,从而生成更加定制化和强大的特征表示。此外,由于多个网络层的ParaX模块共享相同的专家中心,它们通过促进隐含的跨层特征交互来提高特征多样性。广泛的实验结果表明,ParaX在多种视觉识别任务中均表现出色。代码已公开发布:https://github.com/LMMMEng/ParaX。

英文摘要

Adapting pre-trained vision models using parameter-efficient fine-tuning (PEFT) remains challenging, as it aims to achieve performance comparable to full fine-tuning using a minimal number of trainable parameters. When applied to complex dense prediction tasks, existing methods exhibit limitations, including input-agnostic modeling and redundant cross-layer representations. To this end, we propose ParaX, a new adapter-style method featuring a simple mixture-of-experts (MoE) architecture. Specifically, we introduce shared expert centers, where each expert is a trainable parameter matrix. During a feedforward pass, each ParaX module in the network dynamically generates weight matrices tailored for the current module via a simple dynamic parameter routing mechanism, which selectively aggregates parameter matrices in the corresponding expert center. Dynamic weight matrices in ParaX modules facilitate low-rank adaptation in an input-dependent manner, thus generating more customized and powerful feature representations. Moreover, since ParaX modules across multiple network layers share the same expert center, they improve feature diversity by promoting implicit cross-layer feature interaction. Extensive experimental results demonstrate the superiority of ParaX across diverse visual recognition tasks. Code is publicly released at: https://github.com/LMMMEng/ParaX.

2602.06500 2026-05-21 cs.LG

Can Microcanonical Langevin Dynamics Leverage Mini-Batch Gradient Noise?

微 canonical 动力学能否利用小批量梯度噪声?

Emanuel Sommer, Kangning Diao, Jakob Robnik, Uros Seljak, David Rügamer

发表机构 * Department of Statistics, LMU Munich(统计系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Department of Physics, University of California, Berkeley(伯克利大学物理系) Department of Astronomy, Tsinghua University(清华大学天文系) Physics Division, Lawrence Berkeley National Lab(伯克利国家实验室物理部)

AI总结 本文研究了微 canonical 动力学能否有效利用小批量梯度噪声,提出了一种梯度噪声预条件化方案和能量方差基于的自适应调节器,从而开发出一种鲁棒且可扩展的微 canonical 采样器,实现了在高维推断任务中的最佳性能。

Comments In Proceedings of the 43rd International Conference on Machine Learning

详情
AI中文摘要

将推断方法如马尔可夫链蒙特卡罗扩展到高维模型仍然是贝叶斯深度学习中的核心挑战。一个有前景的最新提案,微 canonical 动力学蒙特卡罗,在广泛的问题上展示了最先进的性能。然而,其对完整数据集梯度的依赖使其在大规模问题中成本过高。本文解决了一个根本性问题:微 canonical 动力学能否有效利用小批量梯度噪声?我们提供了该问题的第一个系统研究,建立了随机梯度微 canonical 动力学的新型连续时间理论分析。我们揭示了两种关键的失败模式:由于各向异性梯度噪声导致的理论偏置和复杂高维后验中的数值不稳定性。为解决这些问题,我们提出了一种原理性的梯度噪声预条件化方案,已证明能显著减少这种偏置,并开发了一种新的基于能量方差的自适应调节器,自动化步长选择并动态告知数值保护措施。所得到的算法是一种鲁棒且可扩展的微 canonical 采样器,能够在具有挑战性的高维推断任务如贝叶斯神经网络中实现最先进的性能。结合最近的集合技术,我们的工作解锁了一种新的随机微 canonical 动力学集合(SMILE)采样器类,用于大规模贝叶斯推断。

英文摘要

Scaling inference methods such as Markov chain Monte Carlo to high-dimensional models remains a central challenge in Bayesian deep learning. A promising recent proposal, microcanonical Langevin Monte Carlo, has shown state-of-the-art performance across a wide range of problems. However, its reliance on full-dataset gradients makes it prohibitively expensive for large-scale problems. This paper addresses a fundamental question: Can microcanonical dynamics effectively leverage mini-batch gradient noise? We provide the first systematic study of this problem, establishing a novel continuous-time theoretical analysis of stochastic-gradient microcanonical dynamics. We reveal two critical failure modes: a theoretically derived bias due to anisotropic gradient noise and numerical instabilities in complex high-dimensional posteriors. To tackle these issues, we propose a principled gradient noise preconditioning scheme shown to significantly reduce this bias and develop a novel, energy-variance-based adaptive tuner that automates step size selection and dynamically informs numerical guardrails. The resulting algorithm is a robust and scalable microcanonical Monte Carlo sampler that achieves state-of-the-art performance on challenging high-dimensional inference tasks like Bayesian neural networks. Combined with recent ensemble techniques, our work unlocks a new class of stochastic microcanonical Langevin ensemble (SMILE) samplers for large-scale Bayesian inference.

2602.04907 2026-05-21 cs.LG cs.AI stat.ME

Causal Discovery from Heteroscedastic Stochastic Dynamical Systems under Imperfect Physical Models

从不完美物理模型下的异方差随机动力系统中进行因果发现

Jianhong Chen, Naichen Shi, Xubo Yue

发表机构 * Department of Mechanical & Industrial Engineering(机械与工业工程系) Northeastern University(东北大学) Department of Industrial Engineering and Management Sciences(工业工程与管理科学系) Department of Mechanical Engineering(机械工程系) Northwestern University(西北大学)

AI总结 本文提出了一种整合因果发现框架,利用随机微分方程中的部分物理知识来提高动态系统中因果图的恢复能力,同时分析了在不完美物理模型下的鲁棒性。

Comments 101 pages

详情
AI中文摘要

因果发现是一种数据驱动的复杂系统分析范式,而基于物理的模型,如常微分方程(ODEs),为现实世界的动力学过程提供了机理结构。整合这些范式可以提高可识别性、稳定性和鲁棒性。然而,真实动力系统往往表现出循环交互和非平稳性,而许多因果发现方法依赖于无循环、平稳或平衡假设。我们提出了一种整合因果发现框架,利用随机微分方程(SDEs)中的部分物理知识。漂移项编码已知的ODE动力学,而扩散项捕捉超出规定物理的未知因果耦合。我们开发了一种可扩展的稀疏诱导最大准似然估计器,并通过理论上合理的稳定技术来改善优化景观。在温和条件下,我们为稳定和不稳定SDEs建立了因果图恢复保证。我们还分析了我们的因果图估计在ODE不准确情况下的鲁棒性,并澄清了引入的稳定技术如何平衡数值稳定性和统计恢复能力。在线性SDEs和非线性基准测试,包括具有无循环和循环结构的Lotka-Volterra和Lorenz动力学上,实验显示了比数据驱动基线更好的图恢复和鲁棒性。我们还通过在我们的因果发现框架内重建随机SIR动力学来展示实际应用,以在现实世界流行病数据中进行因果图重建。

英文摘要

Causal discovery is a data-driven paradigm for analyzing complex systems, while physics-based models, such as ordinary differential equations (ODEs), provide mechanistic structure for real-world dynamical processes. Integrating these paradigms can improve identifiability, stability, and robustness. However, real dynamical systems often exhibit cyclic interactions and nonstationarity, whereas many causal discovery methods rely on acyclicity, stationarity, or equilibrium assumptions. We propose an integrative causal discovery framework for dynamical systems that leverages partial physical knowledge through stochastic differential equations (SDEs). The drift term encodes known ODE dynamics, while the diffusion term captures unknown causal couplings beyond the prescribed physics. We develop a scalable sparsity-inducing maximum quasi-likelihood estimator with a theoretically justified stabilization technique to improve the optimization landscape. Under mild conditions, we establish causal graph recovery guarantees for both stable and unstable SDEs. We also analyze robustness of our causal graph estimate to ODE misspecification and clarify how the introduced stabilization technique balances numerical stability and statistical recoverability. Experiments on linear SDEs and nonlinear benchmarks, including Lotka-Volterra and Lorenz dynamics with acyclic and cyclic structures, show improved graph recovery and robustness over data-driven baselines. We also demonstrate practical utility on real-world epidemic data by reconstructing stochastic SIR dynamics within our causal discovery framework.

2602.04876 2026-05-21 cs.CV

PerpetualWonder: Long-Horizon Action-Conditioned 4D Scene Generation

PerpetualWonder: 长时间地平线动作条件4D场景生成

Jiahao Zhan, Zizhang Li, Hong-Xing Yu, Jiajun Wu

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出PerpetualWonder,一种混合生成模拟器,能够从单张图像生成长时间地平线动作条件的4D场景。该方法通过引入真正的闭环系统,解决了现有方法因物理状态与视觉表示解耦导致的生成问题,实现了物理动态和外观的双向修正。

Comments Project website: https://johnzhan2023.github.io/PerpetualWonder/

详情
AI中文摘要

我们介绍了PerpetualWonder,一种混合生成模拟器,能够从单张图像生成长时间地平线动作条件的4D场景。当前工作无法完成此任务,因为其物理状态与其视觉表示相互分离,这阻止了生成性改进更新底层物理以供后续交互。PerpetualWonder通过引入首个真正的闭环系统来解决这一问题。它具有一个新颖的统一表示,创建了物理状态与视觉原语之间的双向链接,使生成性改进能够同时修正动态和外观。它还引入了一种稳健的更新机制,通过多个视角收集监督以解决优化模糊性。实验表明,从单张图像出发,PerpetualWonder能够成功模拟复杂、多步骤的长时间动作交互,保持物理合理性和视觉一致性。

英文摘要

We introduce PerpetualWonder, a hybrid generative simulator that enables long-horizon, action-conditioned 4D scene generation from a single image. Current works fail at this task because their physical state is decoupled from their visual representation, which prevents generative refinements to update the underlying physics for subsequent interactions. PerpetualWonder solves this by introducing the first true closed-loop system. It features a novel unified representation that creates a bidirectional link between the physical state and visual primitives, allowing generative refinements to correct both the dynamics and appearance. It also introduces a robust update mechanism that gathers supervision from multiple viewpoints to resolve optimization ambiguity. Experiments demonstrate that from a single image, PerpetualWonder can successfully simulate complex, multi-step interactions from long-horizon actions, maintaining physical plausibility and visual consistency.

2602.03209 2026-05-21 cs.RO

Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

在未见过的田间机器人环境中使用极稀疏深度测量进行深度补全

Marco Job, Thomas Stastny, Eleni Kelasidi, Roland Siegwart, Michael Pantic

发表机构 * Autonomous Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院自主系统实验室) Field Robotics Lab, NTNU(挪威特罗姆瑟大学场 robotics 实验室)

AI总结 本研究提出了一种深度补全模型,通过合成数据训练和极稀疏的深度传感器测量,在未见过的田间机器人环境中预测密集的度量深度,解决了低成本相机在田间机器人中应用受限的问题。

Comments Accepted to ICRA 2026

详情
AI中文摘要

在无结构环境中自主运行的田间机器人需要可靠的感知以确保安全和可靠的运行。最近的单目深度估计进展展示了低成本相机作为深度传感器的潜力;然而,由于缺乏可靠的尺度线索、模糊或低纹理条件以及大规模数据集的稀缺,其在田间机器人中的应用仍然有限。为了解决这些挑战,我们提出了一种深度补全模型,该模型在合成数据上训练,并利用深度传感器的极稀疏测量来预测未见过的田间机器人环境中的密集度量深度。一个针对田间机器人的合成数据集生成流程能够创建多个逼真的数据集用于训练。该数据集生成方法利用结构从运动的纹理3D网格和具有新视角合成的逼真渲染来模拟多样的田间机器人场景。我们的方法在Nvidia Jetson AGX Orin上实现了每帧53毫秒的端到端延迟,使嵌入式平台上的实时部署成为可能。广泛的评估表明,在多样化的现实世界田间机器人场景中具有竞争性的性能。

英文摘要

Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

2602.03004 2026-05-21 cs.LG cs.AI

Graph Autoencoder for Process Monitoring

用于过程监控的图自编码器

Xiangrui Zhang

发表机构 * School of Information and Control Engineering, China University of Mining and Technology(信息与控制工程学院,中国矿业大学)

AI总结 本文提出了一种因果图时空自编码器(CGSTAE),通过结合基于空间自注意力机制的空间相关图结构学习模块和利用图卷积长短期记忆(GCLSTM)的空间-时间编码器-解码器模块,以提高工业过程监控的可靠性和可解释性。

详情
AI中文摘要

为提高工业过程监控的可靠性和可解释性,本文提出了一种因果图时空自编码器(CGSTAE)。CGSTAE的网络架构结合了两个组件:基于空间自注意力机制的空间相关图结构学习模块(SSAM)和利用图卷积长短期记忆(GCLSTM)的空间-时间编码器-解码器模块。SSAM通过捕捉变量之间的动态关系来学习相关图,而一种新的三步因果图结构学习算法被引入,以从这些相关图中推导出因果图。该算法利用因果不变性原理的反向视角来揭示从变化相关性中得到的不变因果图。空间-时间编码器-解码器由GCLSTM单元构建,在序列到序列框架内重建时间序列过程数据。所提出的CGSTAE通过特征空间和残差空间中的两个统计量实现有效的过程监控和故障检测。最后,我们通过田纳西东部过程和一个现实世界的空气分离过程验证了CGSTAE在过程监控中的有效性。

英文摘要

To improve the reliability and interpretability of industrial process monitoring, this article proposes a Causal Graph Spatial-Temporal Autoencoder (CGSTAE). The network architecture of CGSTAE combines two components: a correlation graph structure learning module based on spatial self-attention mechanism (SSAM) and a spatial-temporal encoder-decoder module utilizing graph convolutional long-short term memory (GCLSTM). The SSAM learns correlation graphs by capturing dynamic relationships between variables, while a novel three-step causal graph structure learning algorithm is introduced to derive a causal graph from these correlation graphs. The algorithm leverages a reverse perspective of causal invariance principle to uncover the invariant causal graph from varying correlations. The spatial-temporal encoder-decoder, built with GCLSTM units, reconstructs time-series process data within a sequence-to-sequence framework. The proposed CGSTAE enables effective process monitoring and fault detection through two statistics in the feature space and residual space. Finally, we validate the effectiveness of CGSTAE in process monitoring through the Tennessee Eastman process and a real-world air separation process.

2602.02660 2026-05-21 cs.AI

MARS: Modular Agent with Reflective Search for Automated AI Research

MARS:模块化代理与反思搜索用于自动化AI研究

Jiefeng Chen, Bhavana Dalvi Mishra, Jaehyun Nam, Rui Meng, Tomas Pfister, Jinsung Yoon

发表机构 * Google Cloud AI Research(谷歌云人工智能研究)

AI总结 本文提出MARS框架,通过预算感知规划、模块化构建和比较反思记忆解决复杂机器学习工程任务中的执行成本与性能归因问题,实现开放源代码框架在MLE-Bench上的最佳性能。

Comments Paper published at International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

自动化AI研究的关键瓶颈在于执行复杂的机器学习工程(MLE)任务。MLE不同于一般软件工程,因其计算成本高昂(例如模型训练)和性能归因不透明。当前基于LLM的代理在此方面表现不佳,常生成忽视执行成本和因果因素的单体脚本。我们引入MARS(模块化代理与反思搜索),一种优化于自主AI研究的框架。MARS依赖三个支柱:(1)通过成本受限的蒙特卡洛树搜索(MCTS)进行预算感知规划,以显式平衡性能与执行成本;(2)模块化构建,采用“设计-分解-实现”流程来管理复杂的研究存储库;(3)比较反思记忆,通过分析解决方案差异来解决信用分配问题,从而提炼出高信号的洞察。MARS在可比条件下,在开放源代码框架中实现了MLE-Bench上的最佳性能,与全球排行榜上顶尖方法竞争性相当。此外,系统表现出定性“啊哈!”时刻,其中所有使用的63%的教训源自跨分支转移,表明代理能有效在搜索路径间泛化洞察。

英文摘要

A critical bottleneck in automating AI research is the execution of complex machine learning engineering (MLE) tasks. MLE differs from general software engineering due to computationally expensive evaluation (e.g., model training) and opaque performance attribution. Current LLM-based agents struggle here, often generating monolithic scripts that ignore execution costs and causal factors. We introduce MARS (Modular Agent with Reflective Search), a framework optimized for autonomous AI research. MARS relies on three pillars: (1) Budget-Aware Planning via cost-constrained Monte Carlo Tree Search (MCTS) to explicitly balance performance with execution expense; (2) Modular Construction, employing a "Design-Decompose-Implement" pipeline to manage complex research repositories; and (3) Comparative Reflective Memory, which addresses credit assignment by analyzing solution differences to distill high-signal insights. MARS achieves state-of-the-art performance among open-source frameworks on MLE-Bench under comparable settings, maintaining competitiveness with the global leaderboard's top methods. Furthermore, the system exhibits qualitative "Aha!" moments, where 63% of all utilized lessons originate from cross-branch transfer, demonstrating that the agent effectively generalizes insights across search paths.

2602.02304 2026-05-21 cs.AI cs.LG

Comparing Explanations is Not Enough, Explain the Change: New Standards are Needed to Explain Behavioral Shifts in Large Language Models

比较解释并不足够,解释变化:需要新的标准来解释大型语言模型中的行为转变

Martino Ciaperoni, Marzio Di Vece, Roberto Pellungrini, Luca Pappalardo, Fosca Giannotti, Francesco Giannini

发表机构 * Scuola Normale Superiore(诺莱学院) ISTI-CNR(意大利国家研究委员会ISTI研究所) University of Pisa(比萨大学)

AI总结 本文提出了一种新的XAI方法,旨在解释大型语言模型在干预后行为转变的原因和机制,以应对现有解释方法无法解释行为转变的问题。

详情
AI中文摘要

大规模基础模型在受到缩放、微调、人类反馈强化学习或上下文学习等干预时会表现出行为转变。当前的可解释性方法结构上不适用于解释这些转变,因为它们要么将模型视为静态对象,如传统可解释AI(XAI)方法所做的,要么仅仅比较不同模型检查点的独立解释。因此,这些方法无法解释两个模型实例之间的功能转变,其中某种行为在干预后发生了变化。这种差距在欧盟人工智能法案、美国州立法和中国人工智能法规等司法管辖区中带来了重大治理风险,这些法规要求记录重大系统修改的因果链。本文主张,解释大型语言模型的行为转变需要一种系统的方法,将转变本身作为解释的主要对象:即解释干预如何和为何将参考模型转变为具有不同行为的更新模型。为了支持这一主张,我们引入了称为比较XAI(XAI_Δ)的新XAI范式,旨在解释两个模型检查点之间的差异,其中行为发生了变化,以及一组规范,规定XAI_Δ解释器和解释必须满足的条件,包括可比性、有效性、可操作性和监控,目标是将模型审计 grounded 在明确、可测量的要求中。最后,我们通过示例实验提供初步证据,表明在实践中需要XAI_Δ,将结果汇总成一份转换报告,直接可用于治理和事件记录。

英文摘要

Large-scale foundation models exhibit \emph{behavioral shifts} when subjected to interventions such as scaling, fine-tuning, reinforcement learning with human feedback, or in-context learning. Current explainability methods are structurally ill-suited to explain these shifts, because they either treat models as static objects, as traditional eXplainable AI (XAI) approaches do, or merely compare independent explanations across different checkpoints of a model. As a result, these approaches fail to explain the functional transition between two model instances in which a certain behavior has shifted following an intervention. This gap creates significant governance risks across jurisdictions including the EU AI Act, US state legislation, and Chinese AI regulations, which require documenting causal chains for substantial system modifications. This position paper argues that explaining behavioral shifts in large language models requires a principled approach that treats the shift itself as the primary object of explanation: namely, one that explains how and why an intervention transforms a reference model into an updated model with different behavior. To support this claim, we introduce \textit{Comparative} XAI (XAI$_Δ$), a novel XAI paradigm aimed at explaining the difference between two model checkpoints where a behavior has shifted, together with a set of desiderata specifying what XAI$_Δ$ explainers and explanations must satisfy, including comparability, validity, actionability, and monitoring, with the goal of grounding model auditing in explicit, measurable requirements. Finally, we provide preliminary evidence suggesting the need for XAI$_Δ$ in practice through illustrative experiments, compiling the resulting findings into a transition report directly usable for governance and incident documentation.

2602.01273 2026-05-21 cs.CV

Q-DiT4SR: Exploration of Detail-Preserving Diffusion Transformer Quantization for Real-World Image Super-Resolution

Q-DiT4SR: 探索细节保留的扩散变换器量化以实现实景图像超分辨率

Xun Zhang, Kaicheng Yang, Hongliang Lu, Haotong Qin, Yong Guo, Yulun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Huawei(华为)

AI总结 本文提出Q-DiT4SR,一种专门针对基于扩散变换器的实现实景图像超分辨率的后训练量化框架,通过引入层次化SVD和变异性感知时空混合精度方法,在保持细节的同时实现高效的模型压缩和加速。

Comments Accepted to ICML 2026. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR

详情
AI中文摘要

近年来,扩散变换器(DiTs)在实现实景图像超分辨率(Real-ISR)中崭露头角,能够生成高质量的纹理,但其沉重的推理负担阻碍了实际应用。尽管后训练量化(PTQ)是加速的有希望的解决方案,但现有超分辨率方法大多集中在U-Net架构上,而通用的DiT量化通常针对文本到图像任务设计。直接将这些方法应用于基于DiT的超分辨率模型会导致局部纹理严重退化。因此,我们提出了Q-DiT4SR,这是首个专门针对基于DiT的Real-ISR的PTQ框架。我们提出了H-SVD,一种层次化SVD,它在匹配的参数预算下集成了一个全局低秩分支和一个局部块状秩-1分支。我们进一步提出了变异性感知时空混合精度:VaSMP在无数据的情况下基于率-失真理论分配跨层权重位宽,而VaTMP通过动态规划(DP)在最小校准下调度跨扩散时间步的层内激活精度。在多个实现实景数据集上的实验表明,我们的Q-DiT4SR在W4A6和W4A4设置下均实现了SOTA性能。值得注意的是,W4A4量化配置将模型大小减少了5.8倍,并将计算操作减少了6.14倍。我们的代码和模型将在https://github.com/xunzhang1128/Q-DiT4SR上提供。

英文摘要

Recently, Diffusion Transformers (DiTs) have emerged in Real-World Image Super-Resolution (Real-ISR) to generate high-quality textures, yet their heavy inference burden hinders real-world deployment. While Post-Training Quantization (PTQ) is a promising solution for acceleration, existing methods in super-resolution mostly focus on U-Net architectures, whereas generic DiT quantization is typically designed for text-to-image tasks. Directly applying these methods to DiT-based super-resolution models leads to severe degradation of local textures. Therefore, we propose Q-DiT4SR, the first PTQ framework specifically tailored for DiT-based Real-ISR. We propose H-SVD, a hierarchical SVD that integrates a global low-rank branch with a local block-wise rank-1 branch under a matched parameter budget. We further propose Variance-aware Spatio-Temporal Mixed Precision: VaSMP allocates cross-layer weight bit-widths in a data-free manner based on rate-distortion theory, while VaTMP schedules intra-layer activation precision across diffusion timesteps via dynamic programming (DP) with minimal calibration. Experiments on multiple real-world datasets demonstrate that our Q-DiT4SR achieves SOTA performance under both W4A6 and W4A4 settings. Notably, the W4A4 quantization configuration reduces model size by 5.8$\times$ and computational operations by 6.14$\times$. Our code and models will be available at https://github.com/xunzhang1128/Q-DiT4SR.

2601.23086 2026-05-21 cs.AI

Chain-of-thought obfuscation learned from output supervision can generalise to unseen tasks

从输出监督学习的链式思维混淆可以泛化到未见过的任务

Nathaniel Mitrani Hadida, Sassan Bhanji, Cameron Tice, Puria Radmard

发表机构 * University of Cambridge(剑桥大学) Geodesic Research

AI总结 本文研究了链式思维(CoT)推理中混淆现象的泛化能力,发现模型在学习混淆推理轨迹时,能够将这种混淆行为及其在未见过的任务中表现出来,从而影响模型的可监控性。

详情
AI中文摘要

链式思维(CoT)推理通过使大型语言模型(LLM)能够规划、探索和反思其行动,显著提升了性能。CoT也是监控这些代理行为的强大工具:当忠实时,它们提供模型决策过程的解释,并为危险行为发出早期警告。然而,优化压力可能会导致模型混淆推理轨迹,失去这一有益属性。我们证明混淆可以跨任务泛化;学习混淆涉及奖励黑客(例如访问和利用泄露信息)的推理的模型,不仅在未见过的奖励黑客设置中泛化了奖励黑客行为及其混淆。最令人担忧的是,我们显示当仅惩罚模型关闭CoT后的最终动作时,CoT推理的混淆及其跨任务泛化也随之发生。我们的发现表明,当前对有害生成的惩罚实践可能会无意中以不可预测的方式减少LLM的广泛可监控性。

英文摘要

Chain-of-thought (CoT) reasoning provides a significant performance uplift to LLMs by enabling planning, exploration, and deliberation of their actions. CoT is also a powerful tool for monitoring the behaviours of these agents: when faithful, they offer interpretations of the model's decision making process, and an early warning sign for dangerous behaviours. However, optimisation pressures placed on the CoT may cause the model to obfuscate reasoning traces, losing this beneficial property. We show that obfuscation can generalise across tasks; models that learn to obfuscate reasoning involving reward hacking (e.g. accessing and utilising leaked information) generalise both the reward hacking behaviour and its obfuscation in CoT to unseen reward hacking settings. Most worryingly, we show that obfuscation of CoT reasoning, and its generalisation across tasks, also follows when we penalise only the model's final actions after closing its CoT. Our findings suggest that current practices of penalising harmful generations may inadvertently lead to a reduction in the broader monitorability of LLMs in unpredictable ways.

2601.22932 2026-05-21 cs.LG

DC-LA: Difference-of-Convex Langevin Algorithm

DC-LA:差分凸拉格朗日算法

Hoang Phuc Hau Luu, Zhongjian Wang

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore(数学科学系,物理与数学科学学院,南洋理工大学,新加坡)

AI总结 本文研究了一个采样问题,其目标分布为π∝exp(-f-r),其中数据保真项f是Lipschitz光滑的,而正则化项r=r1-r2是一个非光滑的差分凸(DC)函数。通过利用r的DC结构,分别对r1和r2应用Moreau包络以平滑r。随后,将正则化部分的凹部分分配给数据保真项,并研究相应的近端拉格朗日算法(称为DC-LA)。在V远离耗散的假设下,建立了DC-LA在q-Wasserstein距离上收敛到目标分布π的结论,且在离散化和平滑误差范围内对所有q∈ℕ*成立。结果在非对数凹采样方面改进了之前的成果。

详情
AI中文摘要

我们研究了一个采样问题,其目标分布为π∝exp(-f-r),其中数据保真项f是Lipschitz光滑的,而正则化项r=r1-r2是一个非光滑的差分凸(DC)函数,即r1,r2是凸函数。通过利用r的DC结构,我们分别对r1和r2应用Moreau包络以平滑r。遵循DC编程,我们将正则化部分的凹部分分配给数据保真项,并研究其对应的近端拉格朗日算法(称为DC-LA)。我们在V远离耗散的假设下,建立了DC-LA在q-Wasserstein距离上收敛到目标分布π的结论,且在离散化和平滑误差范围内对所有q∈ℕ*成立。我们的结果在非对数凹采样方面改进了之前的成果。数值实验表明,DC-LA在合成设置中能够生成准确的分布,并在实际应用的计算机断层扫描中提供定性合理的不确定性量化。

英文摘要

We study a sampling problem whose target distribution is $π\propto \exp(-f-r)$ where the data fidelity term $f$ is Lipschitz smooth while the regularizer term $r=r_1-r_2$ is a non-smooth difference-of-convex (DC) function, i.e., $r_1,r_2$ are convex. By leveraging the DC structure of $r$, we can smooth out $r$ by applying Moreau envelopes to $r_1$ and $r_2$ separately. In line with DC programming, we then redistribute the concave part of the regularizer to the data fidelity and study its corresponding proximal Langevin algorithm (termed DC-LA). We establish convergence of DC-LA to the target distribution $π$, up to discretization and smoothing errors, in the $q$-Wasserstein distance for all $q \in \mathbb{N}^*$, under the assumption that $V$ is distant dissipative. Our results improve previous work on non-log-concave sampling in terms of a more general framework and assumptions. Numerical experiments show that DC-LA produces accurate distributions in synthetic settings and provides qualitatively reasonable uncertainty quantification in a real-world Computed Tomography application.

2601.21662 2026-05-21 cs.LG

Epistemic Uncertainty Quantification for Pre-trained VLMs via Riemannian Flow Matching

通过黎曼流匹配对预训练视觉语言模型进行知识不确定性量化

Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta Vats, Prashant Singh

发表机构 * Department of Information Technology, Uppsala University, Uppsala, Sweden(瑞典乌普萨拉大学信息科技系) Science for Life Laboratory, Uppsala University, Uppsala, Sweden(瑞典乌普萨拉大学生命科学实验室)

AI总结 本文提出REPVLM方法,通过黎曼流匹配在视觉语言模型嵌入的超球面流形上计算概率密度,以量化模型的知识不确定性,并在分类和异常检测中取得显著效果。

Journal ref Forty-Third International Conference on Machine Learning, 2026

详情
AI中文摘要

视觉语言模型(VLMs)通常具有确定性性质,并缺乏内在机制来量化知识不确定性,这反映了模型对知识的缺乏或对其自身表示的无知。我们理论上提出嵌入的负对数密度作为知识不确定性的代理,低密度区域表示模型的无知。所提出的方法REPVLM通过黎曼流匹配在VLM嵌入的超球面流形上计算概率密度。我们实证表明,REPVLM在不确定性与预测误差之间实现了接近完美的相关性,显著优于现有基线。除了分类之外,我们还证明该模型还提供了一种可扩展的度量标准,用于异常检测和自动化数据整理。

英文摘要

Vision-Language Models (VLMs) are typically deterministic in nature and lack intrinsic mechanisms to quantify epistemic uncertainty, which reflects the model's lack of knowledge or ignorance of its own representations. We theoretically motivate negative log-density of an embedding as a proxy for the epistemic uncertainty, where low-density regions signify model ignorance. The proposed method REPVLM computes the probability density on the hyperspherical manifold of the VLM embeddings using Riemannian Flow Matching. We empirically demonstrate that REPVLM achieves near-perfect correlation between uncertainty and prediction error, significantly outperforming existing baselines. Beyond classification, we also demonstrate that the model also provides a scalable metric for out-of-distribution detection and automated data curation.

2601.18696 2026-05-21 cs.LG

Explainability Methods for Hardware Trojan Detection: A Systematic Comparison

用于硬件木马检测的可解释性方法:系统性比较

Paul Whitten, Francis Wolff, Chris Papachristou

发表机构 * Electrical, Computer, and Systems Engineering(电子工程与系统工程)

AI总结 本文针对硬件木马检测中的可解释性方法进行系统性比较,探讨领域感知属性分析、基于案例的推理和特征归因技术在硬件安全应用中的性能差异。

详情
AI中文摘要

硬件木马是恶意电路,会破坏集成电路(IC)的功能和安全性。这些电路直接制造在硅片上,无法像软件一样通过安全补丁修复。解决方案需要通过更换IC进行昂贵的产品召回,因此在设计过程中早期检测至关重要。最佳的硬件检测仅能提供基于统计的解决方案,存在大量假阳性和假阴性。这些检测方法需要更深入的可解释性分析来过滤假指标。现有为通用领域(如图像分类)开发的可解释性方法可能无法提供硬件工程师所需的操作洞察。问题在于:领域感知属性分析、基于案例的推理和特征归因技术在硬件安全应用中如何比较?本文比较了三种可解释性方法用于门级硬件木马检测,在Trust-Hub基准数据集上:(1)基于31个电路特定特征的领域感知属性分析,这些特征来自门扇入模式、触发器距离和主输入/输出(I/O)连接;(2)使用k-最近邻进行基于案例的推理以获得基于先例的解释;(3)基于模型无关的特征归因方法(局部可解释模型无关解释(LIME)、SHapley Additive exPlanations(SHAP)、梯度)提供通用的重要性评分,而无需电路级上下文。

英文摘要

Hardware trojans are malicious circuits which compromise the functionality and security of an integrated circuit (IC). These circuits are manufactured directly into the silicon and cannot be fixed by security patches like software. The solution would require a costly product recall by replacing the IC and hence, early detection in the design process is essential. Hardware detection at best provides statistically based solutions with many false positives and false negatives. These detection methods require more thorough explainable analysis to filter out false indicators. Existing explainability methods developed for general domains like image classification may not provide the actionable insights that hardware engineers need. A question remains: How do domain-aware property analysis, model-agnostic case-based reasoning, and model-agnostic feature attribution techniques compare for hardware security applications? This work compares three categories of explainability for gate-level hardware trojan detection on the Trust-Hub benchmark dataset: (1) domain-aware property-based analysis of 31 circuit-specific features derived from gate fanin patterns, flip-flop distances, and primary Input/Output (I/O) connectivity; (2) model-agnostic case-based reasoning using k-nearest neighbors for precedent-based explanations; and (3) model-agnostic feature attribution methods (Local Interpretable Model-agnostic Explanations (LIME), SHapley Additive exPlanations (SHAP), gradient) that provide generic importance scores without circuit-level context.

2601.05877 2026-05-21 cs.CL

iReasoner: Trajectory-Aware Intrinsic Reasoning Supervision for Self-Evolving Large Multimodal Models

iReasoner: 一种面向轨迹的内在推理监督方法,用于自演化的大多模态模型

Meghana Sunil, Manikandarajan Venmathimaran, Muthu Subash Kavitha

发表机构 * Vellore Institute of Technology(韦洛雷理工学院) School of Information and Data Sciences(信息与数据科学学院) Nagasaki University(长崎大学) Loughborough University(洛桑大学)

AI总结 本文提出iReasoner,一种自演化框架,通过显式引导推理链和奖励内部一致性来提升大模型的隐式推理能力,在无监督设置下实现了多模态推理基准的性能提升。

Comments ACL 2026 (Findings)

详情
AI中文摘要

最近的研究表明,大多模态模型(LMMs)可以通过自博弈和内在反馈从无标签数据中自我改进。然而,现有的自演化框架主要奖励最终结果,尽管中间推理对于视觉基础决策至关重要,但其约束较弱。我们提出iReasoner,一种自演化框架,通过显式引导推理链(CoT)并奖励其内部一致性来提升LMM的隐式推理能力。在无标签图像上进行Proposer--Solver循环时,iReasoner通过在中间推理步骤上定义轨迹感知信号,增强结果层面的内在奖励,提供无需真实标签或外部评判的学习信号,以区分导向相同答案的不同推理路径。从Qwen2.5-VL-7B开始,iReasoner在完全无监督的后训练阶段,在多样化的多模态推理基准上实现了最高+2.1分的提升。我们希望这项工作能成为在纯无监督设置中实现推理感知自改进的LMMs的起点。我们的代码可在https://meghanaasunil.github.io/iReasoner上获取。

英文摘要

Recent work shows that large multimodal models (LMMs) can self-improve from unlabeled data via self-play and intrinsic feedback. Yet existing self-evolving frameworks mainly reward final outcomes, leaving intermediate reasoning weakly constrained despite its importance for visually grounded decision making. We propose iReasoner, a self-evolving framework that improves an LMM's implicit reasoning by explicitly eliciting chain-of-thought (CoT) and rewarding its internal agreement. In a Proposer--Solver loop over unlabeled images, iReasoner augments outcome-level intrinsic rewards with a trajectory-aware signal defined over intermediate reasoning steps, providing learning signals that distinguish reasoning paths leading to the same answer without ground-truth labels or external judges. Starting from Qwen2.5-VL-7B, iReasoner yields up to $+2.1$ points across diverse multimodal reasoning benchmarks under fully unsupervised post-training. We hope this work serves as a starting point for reasoning-aware self-improvement in LMMs in purely unsupervised settings. Our code is available at https://meghanaasunil.github.io/iReasoner.

2601.04068 2026-05-21 cs.CV cs.AI

Mind the Generative Details: Direct Localized Detail Preference Optimization for Video Diffusion Models

注意生成细节:面向视频扩散模型的直接局部化细节偏好优化

Zitong Huang, Kaidong Zhang, Yukang Ding, Chao Gao, Rui Ding, Ying Chen, Wangmeng Zuo

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Alibaba Group - Taobao & Tmall Group(阿里巴巴集团-淘宝 & 天猫集团)

AI总结 本文提出LocalDPO,一种新的后训练框架,通过从真实视频中构建局部偏好对,并在时空区域层面优化对齐,以提高视频生成的质量和人类偏好评分。

Comments Accepted by CVPR 2026

详情
AI中文摘要

将文本到视频的扩散模型与人类偏好对齐对于生成高质量视频至关重要。现有的直接偏好优化(DPO)方法依赖于多样本排序和任务特定的批评模型,这效率低下且常导致模糊的全局监督。为了解决这些限制,我们提出了LocalDPO,一种新的后训练框架,该框架从真实视频中构建局部偏好对,并在时空区域层面进行优化。我们设计了一个自动化流程,高效地收集偏好对数据,通过单次提示推理生成偏好对,消除了对外部批评模型或人工标注的需求。具体来说,我们将高质量的真实视频作为正样本,并通过局部随机时空掩码来生成对应的负样本,仅使用冻结的基模型恢复被掩码的区域。在训练过程中,我们引入了区域感知的DPO损失,将偏好学习限制在被损坏的区域以实现快速收敛。在Wan2.1和CogVideoX上的实验表明,LocalDPO在视频保真度、时间连贯性和人类偏好评分方面优于其他后训练方法,建立了更高效和精细的视频生成器对齐范式。代码可在https://github.com/1170300714/Local-DPO上获得。

英文摘要

Aligning text-to-video diffusion models with human preferences is crucial for generating high-quality videos. Existing Direct Preference Otimization (DPO) methods rely on multi-sample ranking and task-specific critic models, which is inefficient and often yields ambiguous global supervision. To address these limitations, we propose LocalDPO, a novel post-training framework that constructs localized preference pairs from real videos and optimizes alignment at the spatio-temporal region level. We design an automated pipeline to efficiently collect preference pair data that generates preference pairs with a single inference per prompt, eliminating the need for external critic models or manual annotation. Specifically, we treat high-quality real videos as positive samples and generate corresponding negatives by locally corrupting them with random spatio-temporal masks and restoring only the masked regions using the frozen base model. During training, we introduce a region-aware DPO loss that restricts preference learning to corrupted areas for rapid convergence. Experiments on Wan2.1 and CogVideoX demonstrate that LocalDPO consistently improves video fidelity, temporal coherence and human preference scores over other post-training approaches, establishing a more efficient and fine-grained paradigm for video generator alignment.The code is available at https://github.com/1170300714/Local-DPO.

2601.03135 2026-05-21 cs.CL

Improving Indigenous Language Machine Translation with Synthetic Data and Language-Specific Preprocessing

通过合成数据和语言特定预处理改进原住民语言机器翻译

Aashish Dhawan, Christopher Driggers-Ellis, Christan Grant, Daisy Zhe Wang

发表机构 * University of Florida(佛罗里达大学)

AI总结 本研究通过合成数据生成和语言特定预处理方法,改进低资源原住民语言的神经机器翻译效果,实验显示合成数据增强对翻译质量有积极影响,但通用预处理在高度屈折语言中存在局限。

详情
AI中文摘要

低资源原住民语言往往缺乏用于有效神经机器翻译(NMT)所需的平行语料库。合成数据生成为数据稀缺环境提供了一种实用策略。在本工作中,我们通过使用高容量多语言翻译模型生成合成句子对,扩充美洲原住民语言的精选平行语料库。我们对多语言mBART模型进行微调,使用curated-only和合成增强的数据,并通过chrF++评估翻译质量,该指标是最近美洲NLP共享任务中用于屈折语言的主要指标。我们进一步应用语言特定的预处理,包括正字法标准化和噪声感知过滤,以减少语料库中的伪影。在瓜拉尼-西班牙语和克丘亚-西班牙语翻译实验中,合成数据增强显示出一致的chrF++提升,而对艾马拉语的诊断实验则揭示了通用预处理在高度屈折语言中的局限性。

英文摘要

Low-resource indigenous languages often lack the parallel corpora required for effective neural machine translation (NMT). Synthetic data generation offers a practical strategy for mitigating this limitation in data-scarce settings. In this work, we augment curated parallel datasets for indigenous languages of the Americas with synthetic sentence pairs generated using a high-capacity multilingual translation model. We fine-tune a multilingual mBART model on curated-only and synthetically augmented data and evaluate translation quality using chrF++, the primary metric used in recent AmericasNLP shared tasks for agglutinative languages. We further apply language-specific preprocessing, including orthographic normalization and noise-aware filtering, to reduce corpus artifacts. Experiments on Guarani-Spanish and Quechua-Spanish translation show consistent chrF++ improvements from synthetic data augmentation, while diagnostic experiments on Aymara highlight the limitations of generic preprocessing for highly agglutinative languages.

2601.00473 2026-05-21 cs.LG cs.AI

Deep Neural Networks as Discrete Dynamical Systems: Implications for Physics-Informed Learning

深度神经网络作为离散动力系统:对物理信息学习的启示

Abhisek Ganguly, Santosh Ansumali, Sauro Succi

发表机构 * Engineering Mechanics Unit, Jawaharlal Nehru Centre for Advanced Scientific Research(纳拉扬·德赛高级科学研究中心工程力学单元) Italian Institute of Technology(意大利理工学院) University of Roma Tre(罗马三大学) Physics Department, Harvard University(哈佛大学物理系) Cornell University(康奈尔大学)

AI总结 本文探讨了深度神经网络与离散动力系统之间的类比,通过比较Burgers方程和Eikonal方程的数值/精确解与PINNs获得的解,展示了PINN学习在近似相同系统动力学时提供了一种不同的计算路径,同时指出PINNs的密集参数表示在高维情况下可能具有优势。

详情
AI中文摘要

我们重新审视了前馈深度神经网络(DNNs)与源自神经积分方程及其相应偏微分方程(PDE)形式的离散动力系统之间的类比。本文呈现了Burgers方程和Eikonal方程的数值/精确解与通过PINNs获得的解的比较分析。我们展示了PINN学习在近似本质上相同的系统动力学时提供了一种不同于标准数值离散化的计算路径。在此框架下,DNNs可以被解释为离散动力系统,其层间演进方法趋向于吸引子,多个参数配置可能产生可比的解,反映了逆映射的退化性。与有限差分(FD)过程相关的结构化算子不同,PINNs学习密集的参数表示,这些表示与经典离散化 stencil 无直接关联。这种分布式表示通常涉及更多的参数,导致可解释性降低和计算成本增加。然而,这种额外的灵活性可能在高维情况下提供优势,其中经典网格方法变得不切实际。

英文摘要

We revisit the analogy between feed-forward deep neural networks (DNNs) and discrete dynamical systems derived from neural integral equations and their corresponding partial differential equation (PDE) forms. A comparative analysis between the numerical/exact solutions of the Burgers' and Eikonal equations, and the same obtained via PINNs is presented. We show that PINN learning provides a different computational pathway compared to standard numerical discretization in approximating essentially the same underlying dynamics of the system. Within this framework, DNNs can be interpreted as discrete dynamical systems whose layer-wise evolution approaches attractors, and multiple parameter configurations may yield comparable solutions, reflecting the degeneracy of the inverse mapping. In contrast to the structured operators associated with finite-difference (FD) procedures, PINNs learn dense parameter representations that are not directly associated with classical discretization stencils. This distributed representation generally involves a larger number of parameters, leading to reduced interpretability and increased computational cost. However, the additional flexibility of such representations may offer advantages in high-dimensional settings where classical grid-based methods become impractical.

2512.14896 2026-05-21 cs.CL cs.AI

DrugRAG: Enhancing Pharmacy LLM Performance Through A Novel Retrieval-Augmented Generation Pipeline

DrugRAG: 通过一种新颖的检索增强生成流水线提升药学LLM性能

Houman Kazemzadeh, Kiarash Mokhtari Dizaji, Seyed Reza Tavakoli, Farbod Davoodi, MohammadReza KarimiNejad, Parham Abed Azad, Fatemeh Latifi, Ali Sabzi, Armin Khosravi, Siavash Ahmadi, Babak Khalaj, Mohammad Hossein Rohban, Glolamali Aminian, Zohreh Amoozgar, Tahereh Javaheri

发表机构 * Department of Medicinal Chemistry, Faculty of Pharmacy, Tehran University of Medical Sciences(药学系,泰赫兰医科大学) Department of Computer Sciences, Faculty of Mathematics and Computer Sciences, Amir Kabir University of Technology(计算机科学系,阿米尔·卡比尔技术大学) Department of Mathematical Sciences, Sharif University of Technology(数学科学系,沙菲克技术大学) Department of Computer Sciences, Missouri University of Science and Technology(计算机科学系,密苏里科学与技术大学) Department of Computer Engineering, Sharif University of Technology(计算机工程系,沙菲克技术大学) Department of Faculty of Interdisciplinary Science and Technology, Tarbiat Modares University(跨学科科学与技术学院,塔里亚特莫达res大学) Electronics Research Institute, Sharif University of Technology(电子研究所,沙菲克技术大学) Department of Electrical Engineering, Sharif University of Technology(电气工程系,沙菲克技术大学) The Alan Turing Institute, London, United Kingdom(艾伦·图灵研究所,伦敦,英国) Department of Radiation Oncology, Massachusetts General Hospital & Harvard Medical School(放射肿瘤科,麻省总医院及哈佛医学院) Health Informatics Lab, Metropolitan College, Boston University(健康信息学实验室,波士顿大学)

AI总结 本研究评估了大型语言模型在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性,通过DrugRAG流水线整合结构化药物知识,从而提升药学相关问答任务的LLM性能。

Comments 14 pages, 2 figures, 2 tables. The revised version includes McNemar's paired statistical analysis, Wilson confidence intervals, expanded methodological clarifications, a revised discussion of evidence retrieval, improved reproducibility details, and updated limitations

详情
AI中文摘要

在本研究中,我们评估了大型语言模型(LLM)在药学执业资格问答任务中的性能,并开发了一种外部知识整合方法以提高准确性。我们使用一个包含141个问题的药学数据集,对十个参数规模不同的LLM(8十亿到70十亿以上)进行了基准测试,测量了基线准确性。基线性能范围从46%到92%,其中GPT-5(92%)和o3(89%)取得了最高分数,而较小的开源模型表现显著较低。然后,我们开发了DrugRAG,一种三步检索增强生成(RAG)流水线,该流水线检索结构化、基于证据的药物信息,并将上下文药理学证据添加到模型提示中,该流水线在模型架构或参数无需更改的情况下外部运行。DrugRAG在所有五个评估模型上均提高了准确性,提升幅度范围从7到21个百分点(例如,Gemma 3 27B:61.0%到71%,Llama 3.1 8B:46%到67%)。McNemar分析显示,这些改进在较小和中等规模的开源模型中具有统计学显著性。这些发现表明,通过DrugRAG整合结构化外部药物知识可以提高LLM在药学相关问答任务中的性能,而无需修改底层模型,为提升基于证据的药学相关AI应用提供了实用的流水线。

英文摘要

In our study, we evaluated large language model (LLM) performance on pharmacy licensure-style question-answering tasks and developed an external knowledge integration method to improve accuracy. We benchmarked ten LLMs with varying parameter sizes (8 billion to 70+ billion) using a 141-question pharmacy dataset, measuring baseline accuracy without modification. Baseline performance ranged from 46% to 92%, with GPT-5 (92%) and o3 (89%) achieving the highest scores, while smaller open-source models showed substantially lower performance. We then developed DrugRAG, a three-step retrieval-augmented generation (RAG) pipeline that retrieves structured, evidence-based drug information and augments model prompts with contextual pharmacological evidence, operating externally and requiring no changes to model architecture or parameters. DrugRAG increased accuracy across all five evaluated models, with gains ranging from 7 to 21 percentage points (e.g., Gemma 3 27B: 61.0% to 71%, Llama 3.1 8B: 46% to 67%). McNemar analyses demonstrated statistically significant paired improvements primarily in smaller and mid-sized open-source models. These findings demonstrate that integrating structured external drug knowledge via DrugRAG can improve LLM performance on pharmacy-focused question-answering tasks without modifying the underlying models, providing a practical pipeline for enhancing evidence-based pharmacy-focused AI applications.

2512.13788 2026-05-21 cs.LG cs.RO

Constrained Policy Optimization via Sampling-Based Weight-Space Projection

通过基于采样的权重空间投影进行约束策略优化

Shengfan Cao, Francesco Borrelli, Eunhyek Joa

发表机构 * Department of Mechanical Engineering, Seoul National University, Seoul, Korea(首尔国立大学机械工程系)

AI总结 该研究提出了一种基于采样的权重空间投影方法SCPO,用于在不离开安全操作范围的情况下优化策略,通过在参数空间中直接强制安全约束,确保在训练过程中保持安全性和可行性,同时在约束控制任务中实现闭环稳定性。

Comments Accepted for publication at IFAC World Congress 2026; fixed minor notation inconsistencies

详情
AI中文摘要

安全关键学习需要在不离开安全操作范围的情况下提高性能的策略。我们研究了约束策略学习,其中模型参数必须满足基于滚动的安全部署约束,这些约束可以评估但不能解析地微分。我们提出了SCPO,一种基于采样的权重空间投影方法,该方法在不需梯度访问约束函数的情况下直接在参数空间中强制安全。SCPO通过结合基于滚动的安全评估和参数扰动与安全度量变化之间的平滑性界,构建局部安全区域,并通过凸QCQP将每个梯度更新投影。我们建立了安全-by-induction保证:从任何安全初始化开始,给定可行的投影,所有中间策略保持安全。在具有稳定备份策略的约束控制设置中,SCPO进一步确保闭环稳定性,同时在保守备份之外实现安全适应。在具有有害监督的约束回归和双积分模仿与恶意专家的实验中,SCPO拒绝了不安全的更新,保持了训练过程中的可行性,并实现了有意义的目标改进。

英文摘要

Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO further ensures closed-loop stability while enabling safe adaptation beyond the conservative backup. Experiments on constrained regression with harmful supervision and double-integrator imitation with a malicious expert show that SCPO rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful objective improvement.

2512.13402 2026-05-21 cs.CV cs.AI

End2Reg: Learning Task-Specific Segmentation for Markerless Registration in Spine Surgery

End2Reg: 为无标记定位学习任务特定分割在脊柱手术中

Lorenzo Pettinari, Sidaty El Hadramy, Michael Wehrli, Philippe C. Cattin, Daniel Studer, Carol C. Hasler, Maria Licci

发表机构 * Department of Biomedical Engineering, University of Basel, Allschwil, Switzerland(巴塞尔大学生物医学工程系,瑞士Allschwil) Department of Orthopedics, University Children’s Hospital, Basel, Switzerland(巴塞尔大学儿童医院骨科部,瑞士Basel)

AI总结 本文提出End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,无需分割标签和手动步骤,从而提高脊柱手术中无标记导航的精度。

Comments Early Accepted MICCAI 2026. Code and interactive visualizations: https://lorenzopettinari.github.io/end-2-reg/

详情
AI中文摘要

脊柱手术中的术中导航需要毫米级的精度。目前,这通过辐射强度大的术中成像和骨锚定标记实现,但这些标记侵入性且会干扰手术流程。无标记RGB-D定位方法提供了一种有前途的替代方案。然而,现有方法依赖于弱分割标签来隔离相关解剖结构,这可能导致在定位过程中传播误差。我们提出了End2Reg,一种端到端深度学习框架,通过联合优化分割和定位,消除了对分割标签和手动步骤的需要。网络学习任务特定的分割掩码,以适应定位,仅通过定位目标进行指导,而无需显式的分割监督。End2Reg在体外和体内基准测试中实现了最先进的性能,将中位目标定位误差减少了32%,均方根误差平均减少了61%,同时在部分遮挡下保持稳健性能。消融结果证实,端到端优化显著提高了定位精度。总体而言,End2Reg朝着完全自动化的无标记术中导航迈进。代码和交互式可视化可在:https://lorenzopettinari.github.io/end-2-reg/ 上找到。

英文摘要

Intraoperative navigation in spine surgery demands millimeter-level accuracy. Currently, this is achieved through radiation-intensive intraoperative imaging and bone-anchored markers that are invasive and disrupt surgical workflow. Markerless RGB-D registration methods offer a promising alternative. However, existing approaches rely on weak segmentation labels to isolate relevant anatomical structures, potentially propagating errors through the registration process. We present End2Reg, an end-to-end deep learning framework that jointly optimizes segmentation and registration, eliminating the need for segmentation labels and manual steps. The network learns task-specific segmentation masks optimized for registration, guided solely by the registration objective without explicit segmentation supervision. End2Reg achieves state-of-the-art performance on ex- and in-vivo benchmarks, reducing median Target Registration Error by 32% and mean Root Mean Square Error by 61%, while maintaining robust performance under partial occlusions. Ablation results confirm that end-to-end optimization significantly improves registration accuracy. Overall, End2Reg advances towards fully automatic, markerless intraoperative navigation. Code and interactive visualizations are available at: https://lorenzopettinari.github.io/end-2-reg/.

2512.09806 2026-05-21 cs.CV cs.AI

CHEM: Estimating and Understanding Hallucinations in Deep Learning for Image Processing

CHEM: 估计和理解深度学习在图像处理中的幻觉

Jianfei Li, Ines Rosellon-Inclan, Gitta Kutyniok, Jean-Luc Starck

发表机构 * Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Tromsø(特罗姆斯大学) German Aerospace Center (DLR)(德国航天中心) Foundation for Research and Technology Hellas (FORTH)(希腊研究与技术基金会)

AI总结 本文提出CHEM方法,用于量化和表征图像重建模型中的幻觉 artifacts,通过小波和shearlet表示定位幻觉区域,并利用 conformalized quantile regression 评估幻觉水平,同时分析U-shaped网络为何容易产生幻觉预测。

详情
AI中文摘要

基于深度学习的方法最近在图像重建问题中取得了显著成功。然而,挑战出现了,因为这些方法可能会生成不真实的 artifacts 或幻觉,这可能干扰安全关键场景中的分析。本文介绍了一个框架,用于量化和表征图像重建模型中的幻觉 artifacts。所提出的方法称为 Conformal Hallucination Estimation Metric (CHEM),能够识别模型预测中的幻觉易发区域。它利用小波和shearlet表示在图像特征层面定位这些区域,并使用 conformalized quantile regression 以分布无关的方式评估幻觉水平。提供了理论分析,表征了CHEM对幻觉 artifacts 的灵敏度及其与均方误差的关系。基于这些见解并采用基于逼近理论的观点,我们研究了为何U-shaped网络,广泛用于图像重建的架构,倾向于产生易受幻觉影响的预测。我们在天文图像去卷积中使用CANDELS数据集(如U-Net、SwinUNet和Learnlets)以及在自然图像超分辨率中使用DIV2K数据集(如DRUNet、Unfolded DRS、RAM和DPS)上评估了所提出方法的有效性。

英文摘要

Deep learning-based methods have recently achieved significant success in image reconstruction problems. However, challenges have emerged, as these methods may generate unrealistic artifacts or hallucinations, which can interfere with analysis in safety-critical scenarios. This paper introduces a framework for quantifying and characterizing hallucinated artifacts in image reconstruction models. The proposed method, termed the Conformal Hallucination Estimation Metric (CHEM), enables the identification of hallucination-prone regions in model predictions. It leverages wavelet and shearlet representations to localize such regions at the level of image features, and uses conformalized quantile regression to assess hallucination levels in a distribution-free manner. A theoretical analysis is provided, characterizing the sensitivity of CHEM to hallucinated artifacts and its relationship to the mean squared error. Building on these insights and adopting a viewpoint grounded in approximation theory, we investigate why U-shaped networks, widely used architectures for image reconstruction, tend to hallucination-prone predictions. We assess the effectiveness of the proposed approach on astronomical image deconvolution using the CANDELS dataset with architectures such as U-Net, SwinUNet, and Learnlets, and on natural image super-resolution using the DIV2K dataset with models such as DRUNet, Unfolded DRS, RAM, and DPS.

2512.09447 2026-05-21 cs.RO cs.CV

Query-Calibrated Segmental Admission for Descriptor-Agnostic LiDAR Loop Closure in Repetitive Environments

基于查询校准的分段准入用于无描述符的激光雷达回环闭合在重复环境中

Jaehyun Kim, Seungwon Choi, Wonseok Kang, Tae-Wan Kim

发表机构 * Department of Naval Architecture and Ocean Engineering(naval architecture and ocean engineering department)

AI总结 该研究提出了一种无描述符的稀疏回环准入策略,用于在重复环境中稳定图结构,通过校准查询级的分段假设并验证代表性配对来减少回环因素的误入,从而提高回环闭合的精度和稳定性。

Comments 8 pages, 3 figures

详情
AI中文摘要

结构重复的环境会产生视觉上合理但存在混叠的LiDAR回环候选者,当这些候选者被作为回环因子加入图中时,可能会破坏位姿图优化。我们提出了一种名为查询校准分段准入(QCSA)的策略,这是一种面向图稳定性的稀疏回环准入政策。该策略通过与硬负样本对比对短描述符分段进行评分,校准哪些查询级的分段假设能达到几何关系,并通过广义迭代最近点(G-ICP)验证代表性配对。我们在SNU图书馆数据集(SNULib)和HeLiPR重叠路线上评估了该方法。在SNULib上对七种LiDAR描述符家族进行汇总分析,QCSA将插入的回环因子减少了3.8倍,将因子精度从0.542提高到0.717,并显著降低了每组查询的误入率。在更稀疏的图中,它保持了可比的平均绝对轨迹误差(ATE)并大幅降低了最坏序列ATE与密集Top1+G-ICP相比,从1.064降至0.778米。这些结果支持了所提出的回环准入层在重混叠的同时定位与建图(SLAM)中的应用。我们的实现和数据集将在:https://github.com/wanderingcar/snu_library_dataset上发布。

英文摘要

Structurally repetitive environments produce visually plausible but aliased LiDAR loop candidates that can destabilize pose-graph optimization when admitted as loop factors. We propose Query-Calibrated Segmental Admission (QCSA), a descriptor-agnostic sparse loop-admission policy for graph-stability-oriented insertion. The policy scores short descriptor segments against hard negatives, calibrates which query-level segment hypotheses reach geometry, and inserts representative pairs validated by Generalized Iterative Closest Point (G-ICP). We evaluate it on the SNU Library Dataset (SNULib) and HeLiPR overlap routes. Aggregated over seven LiDAR descriptor families on SNULib, QCSA reduces inserted loop factors by 3.8 times, raises factor precision from 0.542 to 0.717, and sharply lowers false admissions per query group. With this sparser graph, it maintains comparable mean absolute trajectory error (ATE) and substantially reduces worst-sequence ATE versus dense Top1+G-ICP, from 1.064 to 0.778 m. The aggregate mean and worst-sequence ATE remain lower than the odometry-only reference. Under a matched factor budget, QCSA also attains lower trajectory error than SeqSLAM and sparse Top1+G-ICP selections. Fixed-transfer validation on HeLiPR, with no route-specific tuning, likewise suppresses hard-negative admissions. These results support the proposed admission layer for aliasing-heavy simultaneous localization and mapping (SLAM). Our implementation and dataset will be released at: https://github.com/wanderingcar/snu_library_dataset.

2511.23152 2026-05-21 cs.LG cond-mat.dis-nn math.OC math.RT stat.ML

A Differentiable Measure of Algebraic Complexity: Provably Exact Discovery of Group Structures

一种可微的代数复杂性度量:证明精确发现群结构

Dongsung Huh, Lior Horesh, Halyun Jeong

发表机构 * Independent Researcher(独立研究者) IBM Research(IBM研究院) University at Albany, SUNY(阿尔巴尼大学,SUNY)

AI总结 本文提出了一种可微的代数复杂性度量,通过Cayley表完成问题,证明了通过超立方体操作符张量分解可以精确发现群结构,解决了Huh(2025)的核心开放猜想。

Comments 29 pages, 3 figures. All theoretical conjectures are formally proven as theorems and verified in Lean 4. v4: Minor typographical corrections

详情
AI中文摘要

从数据中发现离散代数规则是机器学习中的基本挑战。我们通过Cayley表完成——经典矩阵完成的代数对应物——正式化了这个问题,其中关联性违反的程度取代线性秩作为复杂性的内在度量。我们对超立方体,一种操作值张量分解,在完全观察的目标表δ上进行了严格的景观分析,证明其全局下界H_inf(δ) := inf_{Θ∈F_δ} H(Θ)隐式定义了这种复杂性的精确可微度量。我们证明了超立方体的原目标函数H(Θ)分解为两个组成部分:几何对齐(共线性)和反ℓ_2惩罚。我们建立这些连续变分压力诱导了核心离散属性:共线性强制关联性(共线性-关联性等价),而反ℓ_2惩罚在共线性流形内减少为精确反秩惩罚,驱动参数向全秩单位性发展。因此,我们推导出一个绝对下界H(Θ) ≥ H_inf(δ) ≥ 3 |δ|,其中|δ|是目标表大小。我们证明这个绝对地板在且仅在目标是同源于群时被达到,并将全局最小值表征为底层群的正则表示(除单位性规范外),解决了Huh(2025)的核心开放猜想。本文为某些离散代数结构可以被可微度量精确表征提供了存在证明,使得基于梯度的发现无需组合搜索。所有理论结果均在Lean 4中机械验证并通过小规模实验确认。

英文摘要

Discovering discrete algebraic rules from data is a fundamental challenge in machine learning. We formalize this problem through Cayley-table completion -- an algebraic counterpart to classical matrix completion -- where the degree of associativity violation replaces linear rank as the intrinsic measure of complexity. We provide a rigorous landscape analysis of HyperCube, an operator-valued tensor factorization, on the fully observed target table $δ$, proving that its global infimum $H_{\inf}(δ) := \inf_{Θ\in F_δ} H(Θ)$ implicitly defines an exact differentiable measure for this complexity. We show that HyperCube's native objective $H(Θ)$ decomposes into two components: geometric alignment (collinearity) and an inverse $\ell_2$ penalty. We establish that these continuous variational pressures induce core discrete properties: collinearity enforces associativity (Collinearity--Associativity Equivalence), and the inverse $\ell_2$ penalty reduces to an exact inverse rank penalty within the collinear manifold, driving the parameters toward full-rank unitarity. Consequently, we derive an absolute lower bound $H(Θ) \ge H_{\inf}(δ) \ge 3 \, |δ|$, where $|δ|$ is the target table size. We prove this absolute floor is attained if and only if the target is isotopic to a group, and characterize the global minimizer as the regular representation of the underlying group (up to unitary gauge), resolving the central open conjecture of Huh (2025). This work serves as an existence proof that certain discrete algebraic structures can be exactly characterized by differentiable measures, enabling gradient-based discovery without the need for combinatorial search. All theoretical results are mechanically verified in Lean 4 and confirmed via small-scale experiments.

2511.01482 2026-05-21 cs.CL

Towards Consistent Detection of Cognitive Distortions: LLM-Based Annotation and Dataset-Agnostic Evaluation

迈向认知扭曲一致检测:基于大语言模型的标注与数据集无关评估

Neha Sharma, Navneet Agarwal, Kairit Sirts

发表机构 * LREC-2026(LREC-2026会议)

AI总结 本文探讨了利用大语言模型作为一致且可靠的标注器进行认知扭曲检测的方法,并提出了一种数据集无关的评估框架,以公平比较不同数据集训练的模型,结果显示GPT-4能产生一致的标注,提升了模型在主观NLP任务中的表现。

Journal ref https://lrec.elra.info/lrec2026-main-851

详情
AI中文摘要

基于文本的自动化认知扭曲检测是一项具有挑战性的任务,由于其主观性质,即使在专家人类标注者之间也观察到低一致性分数,导致不可靠的标注。我们探索了使用大型语言模型(LLMs)作为一致且可靠的标注器,并提出多个独立的LLM运行可以揭示稳定的标注模式,尽管任务本身具有内在的主观性。此外,为了公平比较训练于不同特征数据集上的模型,我们引入了一种使用Cohen's kappa作为效应大小度量的数据集无关评估框架。该方法允许在传统指标如F1分数不足的情况下进行公平的跨数据集和跨研究比较。我们的结果表明,GPT-4可以产生一致的标注(Fleiss's Kappa = 0.78),从而在使用这些标注训练的模型在测试集上的表现优于使用人类标注数据训练的模型。我们的发现表明,LLMs可以为生成支持强大下游性能的主观NLP任务的训练数据提供可扩展且内部一致的替代方案。

英文摘要

Text-based automated Cognitive Distortion detection is a challenging task due to its subjective nature, with low agreement scores observed even among expert human annotators, leading to unreliable annotations. We explore the use of Large Language Models (LLMs) as consistent and reliable annotators, and propose that multiple independent LLM runs can reveal stable labeling patterns despite the inherent subjectivity of the task. Furthermore, to fairly compare models trained on datasets with different characteristics, we introduce a dataset-agnostic evaluation framework using Cohen's kappa as an effect size measure. This methodology allows for fair cross-dataset and cross-study comparisons where traditional metrics like F1 score fall short. Our results show that GPT-4 can produce consistent annotations (Fleiss's Kappa = 0.78), resulting in improved test set performance for models trained on these annotations compared to those trained on human-labeled data. Our findings suggest that LLMs can offer a scalable and internally consistent alternative for generating training data that supports strong downstream performance in subjective NLP tasks.

2511.01219 2026-05-21 cs.RO

Tackling the Kidnapped Robot Problem via Sparse Feasible Hypothesis Sampling and Reliable Batched Multi-Stage Inference

通过稀疏可行假设采样和可靠的分批多阶段推理解决被绑架的机器人问题

Muhua Zhang, Lei Ma, Ying Wu, Kai Shen, Deqing Huang, Henry Leung

发表机构 * School of Electrical Engineering, Southwest Jiaotong University(西南交通大学电子工程学院)

AI总结 本文提出了一种被动的2D全局重定位框架,通过单个LiDAR扫描和占用网格地图在机器人静止时高效可靠地估计全局姿态,从而提高移动机器人的长期自主性。该框架将全局重定位问题转化为非凸问题,并通过多假设方案与分批多阶段推理和早期终止平衡完整性和效率。

Comments 14 pages, 8 figures. Accepted for publication in IEEE Transactions on Instrumentation and Measurement. DOI: 10.1109/TIM.2026.3694741

详情
AI中文摘要

本文针对被绑架的机器人问题(KRP),即在已知地图中重新定位机器人时,没有先验姿态估计或在SLAM初始化时的定位丢失问题。为此,提出了一种被动的2D全局重定位框架。该框架在机器人静止时,通过单个LiDAR扫描和占用网格地图高效可靠地估计全局姿态,从而提高移动机器人的长期自主性。所提出的框架将全局重定位问题转化为非凸问题,并通过多假设方案与分批多阶段推理和早期终止来解决,平衡完整性和效率。快速探索随机树(RRT)在可通行性约束下,渐近覆盖可达空间以生成稀疏、均匀分布的可行位置假设,从根本上减少采样空间。假设首先通过所提出的扫描均方差(SMAD)进行排序,这是一种粗略的光束误差水平度量,通过优先处理高可能性的候选者来实现早期终止。SMAD计算优化以适应有限的扫描测量。提出的翻译亲和度扫描到地图对齐度量(TAM)用于在假设位置可靠地选择方向,并准确评估最终的全局姿态,以减轻由于稀疏假设引起的翻译不确定性以及非全景LiDAR扫描和环境变化导致的传统似然场度量的退化。在资源受限的移动机器人上的真实世界实验表明,所提出的框架在成功率、在测量不确定性下的鲁棒性和计算效率方面均表现优异。

英文摘要

This paper addresses the Kidnapped Robot Problem (KRP), a core localization challenge of relocalizing a robot in a known map without prior pose estimate upon localization loss or at SLAM initialization. For this purpose, a passive 2-D global relocalization framework is proposed. It estimates the global pose efficiently and reliably from a single LiDAR scan and an occupancy grid map while the robot remains stationary, thereby enhancing the long-term autonomy of mobile robots. The proposed framework casts global relocalization as a non-convex problem and solves it via the multi-hypothesis scheme with batched multi-stage inference and early termination, balancing completeness and efficiency. The Rapidly-exploring Random Tree (RRT), under traversability constraints, asymptotically covers the reachable space to generate sparse, uniformly distributed feasible positional hypotheses, fundamentally reducing the sampling space. The hypotheses are preliminarily ordered by the proposed Scan Mean Absolute Difference (SMAD), a coarse beam-error level metric that facilitates the early termination by prioritizing high-likelihood candidates. The SMAD computation is optimized for limited scan measurements. The Translation-Affinity Scan-to-Map Alignment Metric (TAM) is proposed for reliable orientation selection at hypothesized positions and accurate final global pose evaluation to mitigate degradation in conventional likelihood-field metrics under translational uncertainty induced by sparse hypotheses, as well as non-panoramic LiDAR scan and environmental changes. Real-world experiments on a resource-constrained mobile robot with non-panoramic LiDAR scans show that the proposed framework achieves competitive performance in success rate, robustness under measurement uncertainty, and computational efficiency.

2510.23538 2026-05-21 cs.AI cs.CL cs.CV cs.SE

JanusCoder: Towards a Foundational Visual-Programmatic Interface for Code Intelligence

JanusCoder: 向代码智能的视觉-程序化界面迈进

Qiushi Sun, Jingyang Gong, Yang Liu, Qiaosheng Chen, Lei Li, Kai Chen, Qipeng Guo, Ben Kao, Fei Yuan

发表机构 * The University of Hong Kong(香港大学) Shanghai AI Laboratory(上海人工智能实验室) Nanjing University(南京大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出JanusCoder,一种面向代码智能的视觉-程序化界面,通过构建大规模多模态代码数据集和统一模型,实现从文本指令、视觉输入或两者结合生成代码,展示了其在文本和视觉编程任务中的优越性能。

Comments ICLR 2026 Camera Ready Version, with code and data available

详情
AI中文摘要

神经代码智能的范围正在迅速扩展,从基于文本的源代码扩展到程序生成的丰富视觉输出。这种视觉维度对于高级应用如灵活的内容生成和精确的可视化程序驱动编辑至关重要。然而,进展受到高质量多模态代码数据稀缺的阻碍,这源于合成和质量评估的挑战。为了解决这些挑战,我们从数据和建模的角度做出贡献。我们首先引入了一个完整的合成工具包,利用数据模态之间的相互协同效应,高效地生成涵盖标准图表到复杂交互式网页UI和代码驱动动画的大型高质量语料库。利用该工具包,我们构建了JanusCode-800K,目前最大的多模态代码语料库。这推动了我们模型JanusCoder和JanusCoderV的训练,建立了从文本指令、视觉输入或两者结合生成代码的视觉-程序化界面。我们的统一模型不同于现有方法,后者为孤立任务构建专门模型。在文本导向和视觉导向的编程任务上的大量实验表明,JanusCoder系列的性能优越,我们的7B到14B规模模型接近甚至超过了商业模型的性能。此外,广泛的分析提供了将程序逻辑与其视觉表达和谐统一的关键见解。我们的代码和检查点可在https://github.com/InternLM/JanusCoder上获得。

英文摘要

The scope of neural code intelligence is rapidly expanding beyond text-based source code to encompass the rich visual outputs that programs generate. This visual dimension is critical for advanced applications like flexible content generation and precise, program-driven editing of visualizations. However, progress has been impeded by the scarcity of high-quality multimodal code data, a bottleneck stemming from challenges in synthesis and quality assessment. To address these challenges, we make contributions from both a data and modeling perspective. We first introduce a complete synthesis toolkit that leverages reciprocal synergies between data modalities to efficiently produce a large-scale, high-quality corpus spanning from standard charts to complex interactive web UIs and code-driven animations. Leveraging this toolkit, we construct JanusCode-800K, the largest multimodal code corpus to date. This powers the training of our models, JanusCoder and JanusCoderV, which establish a visual-programmatic interface for generating code from textual instructions, visual inputs, or a combination of both. Our unified model is a departure from existing approaches that build specialized models for isolated tasks. Extensive experiments on both text-centric and vision-centric coding tasks demonstrate the superior performance of the JanusCoder series, with our 7B to 14B scale models approaching or even exceeding the performance of commercial models. Furthermore, extensive analysis provides key insights into harmonizing programmatic logic with its visual expression. Our code and checkpoints are available at https://github.com/InternLM/JanusCoder.

2510.18034 2026-05-21 cs.CV cs.AI cs.RO

Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

VLMs能否解锁语义异常检测?一个结构化推理的框架

Roberto Brusnicki, David Pop, Yuan Gao, Mattia Piccinini, Johannes Betz

发表机构 * Professorship of Autonomous Vehicle Systems TUM School of Engineering Design, Technical University of Munich Munich, Germany

AI总结 本文提出SAVANT框架,通过结构化推理方法提升VLM在语义异常检测中的性能,实现对自动驾驶场景中罕见异常情况的更准确识别。

Comments 8 pages, 5 figures

详情
AI中文摘要

自动驾驶系统仍然对长尾的稀有、分布外语义异常极度脆弱。尽管VLMs已显现为感知的有前途工具,但其在异常检测中的应用仍然主要局限于提示专有模型,限制了可靠性、可重复性和部署可行性。为解决这一差距,我们引入SAVANT(语义异常验证/分析工具包),一种新的模型无关推理框架,将异常检测重新表述为分层语义一致性验证。通过应用SAVANT的两阶段流程——结构化场景描述提取和多模态评估,现有VLMs在输入图像中检测异常驾驶场景的得分得到提升。我们的方法取代了随意提示,通过语义感知推理,将基于VLM的检测转化为四个语义领域之间的原则性分解。我们证明,在平衡的现实驾驶场景集上,应用SAVANT可将VLM的绝对召回率提高约18.5%,相比提示基线。此外,这一增益使大规模注释成为可能:利用我们框架内的最佳专有模型,我们自动标注了约10,000张现实世界图像,具有高置信度。我们使用由此产生的高质量数据集来微调一个7B开源模型(Qwen2.5-VL)以执行单次异常检测,达到90.8%的召回率和93.8%的准确率,超越所有评估模型,同时在接近零成本的情况下实现本地部署。通过将结构化语义推理与可扩展的数据整理相结合,我们为自动驾驶系统中的语义异常检测数据稀缺问题提供了实用的解决方案。补充材料:https://TUM-AVS.github.io/SAVANT/.

英文摘要

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models - limiting reliability, reproducibility, and deployment feasibility. To address this gap, we introduce SAVANT (Semantic Anomaly Verification/Analysis Toolkit), a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline - structured scene description extraction and multi-modal evaluation - existing VLMs improve their scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains. We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM's absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. We use the resulting high-quality dataset to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy - surpassing all models evaluated while enabling local deployment at near-zero cost. By coupling structured semantic reasoning with scalable data curation, we provide a practical solution to data scarcity in semantic anomaly detection for autonomous systems. Supplementary material: https://TUM-AVS.github.io/SAVANT/.