arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4033
专题追踪
2602.04189 2026-05-12 cs.LG stat.CO

Beyond Accuracy: Evaluating Posterior Fidelity of Diffusion Inverse Solvers

Xiaoyu Qiu, Taewon Yang, Zhanhao Liu, Guanyang Wang, Liyue Shen

发表机构 * Department of Statistics, University of Michigan(密歇根大学统计学系) Department of EECS, University of Michigan(密歇根大学电子工程与计算机科学系) Department of Statistics, Rutgers University(罗格斯大学统计学系)

AI总结 本文研究了扩散逆解器(DIS)在科学与工程反问题中的后验分布保真度问题,指出现有基准主要关注重建精度而忽视了不确定性量化。为此,作者提出了一种无需真实后验的评分核Stein分歧(score-KSD)指标,用于评估扩散采样器生成样本与目标后验分布的一致性。实验表明,该指标能有效揭示重建精度与后验一致性之间的差异,为更全面的模型评估提供了新方法。

详情
英文摘要

Uncertainty evaluation is critical in scientific and engineering inverse problems. However, existing benchmarks on Diffusion Inverse Solvers (DIS) primarily focus on reconstruction accuracy but overlook uncertainty and distributional behavior. Since stochastic inverse solvers represent uncertainty through diffusion-based posterior samples, evaluating how well their generated samples capture the target posterior distribution becomes an important aspect of uncertainty quantification. To address this limitation and better understand the distributional behavior of diffusion samplers, we conduct a systematic study to investigate the posterior fidelity of a broad range of existing DIS methods in controlled simulation settings with a known analytical true posterior. Furthermore, to enable posterior-aware evaluation on real-world inverse problems where ground-truth posterior is unavailable, we propose score-based Kernel Stein Discrepancy (score-KSD), a theoretically-grounded and ground-truth-free metric that measures the consistency of the distribution of generated samples from a DIS method with the target posterior score field, induced by the forward model and learned diffusion prior. Through both simulation experiments and real-world inverse problem solving, we validate the effectiveness of the proposed score-KSD and demonstrate that it provides meaningful posterior fidelity diagnostics beyond reconstruction accuracy, revealing that higher reconstruction accuracy does not necessarily imply better posterior consistency.

2602.04093 2026-05-12 cs.LG

Federated Concept-Based Models: Interpretable models with distributed supervision

Dario Fenoglio, Arianna Casanova, Francesco De Santis, Gabriele Dominici, Johannes Schneider, Pietro Barbiero, Giovanni De Felice, Marc Langheinrich, Martin Gjoreski

发表机构 * Università della Svizzera italiana(瑞士联邦理工学院) University of Liechtenstein(利亨斯坦大学) Politecnico di Torino(都灵理工大学) IBM Research Zurich(IBM 苏黎世研究实验室)

AI总结 该论文提出了一种名为“联邦概念模型”(F-CMs)的新方法,旨在将可解释的概念模型与联邦学习相结合,以解决在分布式数据源中概念标注稀缺的问题。该方法能够在不同机构间聚合概念信息,并在概念监督变化时高效适应模型架构,同时保障隐私。实验表明,F-CMs在保持预测准确性的同时,还能在机构无法获取某些概念的情况下实现可解释推理,具有显著的创新性。

详情
英文摘要

Concept-based Models (CMs) enhance interpretability in deep learning by grounding predictions in human-understandable concepts. However, concept annotations are costly and rarely available at scale within a single data source. Federated Learning (FL) could alleviate this limitation by enabling cross-institutional training over concept annotations distributed across multiple data owners. Yet, FL lacks interpretable modeling paradigms. Integrating CMs with FL is non-trivial: although FL supports heterogeneous and non-stationary client participation, it typically assumes a fixed shared architecture, whereas CMs may require architectural adaptation as the available concept set evolves. We propose Federated Concept-based Models (F-CMs), a new methodology for deploying CMs in evolving FL settings. F-CMs aggregate concept-level information across institutions and efficiently adapt the model architecture to changes in concept supervision while preserving privacy. Empirically, F-CMs maintain accuracy and intervention effectiveness comparable to training settings with full concept supervision, while outperforming on average non-adaptive federated baselines. Notably, F-CMs enable interpretable inference on concepts unavailable to a given institution, a key novelty over existing approaches.

2602.03688 2026-05-12 cs.AI

TodyComm: Task-Oriented Dynamic Communication for Multi-Round LLM-based Multi-Agent System

Wenzhe Fan, Tommaso Tognoli, Henry Peng Zou, Chunyu Miao, Yibo Wang, Xinhua Zhang

发表机构 * Department of Compute Science, University of Illinois at Chicago(伊利诺伊大学芝加哥分校计算机科学系)

AI总结 本文提出了一种名为TodyComm的任务导向动态通信算法,用于解决多轮基于大语言模型的多智能体系统中通信结构固定导致的协作效率问题。该方法通过策略梯度优化,在每轮交互中动态生成适应任务需求的协作拓扑,从而提升任务性能。实验表明,TodyComm在动态对抗环境和通信预算限制下表现出优越的性能,同时保持了高效性、可扩展性和良好的泛化能力。

详情
英文摘要

Multi-round LLM-based multi-agent systems rely on effective communication structures to support collaboration across rounds. However, most existing methods employ a fixed communication topology during inference, which falls short in many realistic applications where the agents' roles may change \textit{across rounds} due to dynamic adversary, task progression, or time-varying constraints such as communication bandwidth. In this paper, we propose addressing this issue through TodyComm, a \textbf{t}ask-\textbf{o}riented \textbf{dy}namic \textbf{comm}unication algorithm. It produces behavior-driven collaboration topologies that adapt to the dynamics at each round, optimizing the utility for the task through policy gradient. Experiments on five benchmarks demonstrate that, under both dynamic adversarial settings and communication budget constraints, TodyComm achieves superior task performance while maintaining token efficiency, scalability, and strong generalizability across varying adversarial conditions.

2602.02281 2026-05-12 cs.LG cs.AI cs.NE physics.class-ph physics.comp-ph

A Physical Theory of Backpropagation: Exact Gradients from the Least-Action Principle

Antonino Emanuele Scurria

发表机构 * Quantum Information Laboratory (LIQ)(量子信息实验室(LIQ)) Université libre de Bruxelles (ULB)(布鲁塞尔自由大学(ULB))

AI总结 本文从哈密顿最小作用量原理出发,推导出精确的反向传播算法,填补了物理原理与反向传播之间的重要理论空白。通过将前向传播过程转化为连续时间动力学,并引入适用于非保守系统的拉格朗日形式,作者在扩展的相空间中统一了推理与梯度计算,使激活值和敏感度共同编码于共轭场中。该方法无需独立的反向计算电路,实现了推理与梯度计算的同步进行,标准的反向传播可视为该连续流的离散时间投影,为经典力学工具在学习动力学分析中的应用提供了理论基础。

Comments 22 pages

详情
英文摘要

Backpropagation is typically presented as a symbolic procedure: a backward pass topologically distinct from inference, with non-local error signals and synchronous global clocking, features with no clear analog in physical reality. Existing physics-inspired alternatives recover gradients only approximately, in vanishing-perturbation limits, or under weight-symmetry constraints incompatible with feedforward architectures. In this paper, we address this gap by deriving exact backpropagation from Hamilton's least-action principle. By recasting the forward dynamics in continuous time and adapting a Lagrangian formalism for non-conservative systems to the resulting flow, we unify inference and gradient computation within a single variational framework on a doubled phase space, whose two conjugate fields jointly encode activations and sensitivities. A single global Lagrangian governs the dynamics: the task loss enters as a symmetry-breaking perturbation of the forward manifold, and credit assignment emerges as the tension that develops between the conjugate states. Inference and gradient computation thus unfold simultaneously through local interactions, requiring no separate backward circuit. Ultimately, standard backpropagation is recovered exactly as the discrete-time projection of this continuous flow. This perspective unifies the formalism of physics with backpropagation, opening a principled pathway for applying tools from classical mechanics - symplectic geometry, Noether's theorem, path-integral methods - to the analysis of learning dynamics. As a downstream consequence, it also points toward analog and neuromorphic substrates in which learning is embodied in the hardware itself.

2602.01698 2026-05-12 cs.CL cs.LG

Restoring Exploration after Post-Training: Latent Exploration Decoding for Large Reasoning Models

Wenhui Tan, Fiorenzo Parascandolo, Enver Sangineto, Jianzhong Ju, Zhenbo Luo, Qian Cao, Rita Cucchiara, Ruihua Song, Jian Luan

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) MiLM Plus, Xiaomi Inc., Beijing, China(小米公司北京MiLM Plus团队) University of Modena and Reggio Emilia(莫德纳和雷吉奥艾米莉亚大学) University of Pisa, Italy(比萨大学)

AI总结 大型推理模型(LRMs)通过强化学习后训练在数学和代码推理任务中取得了显著进展,但研究发现这种后训练会导致探索能力下降,即温度采样无法有效提升任务成功率。本文提出了一种名为“潜在探索解码”(LED)的方法,通过利用中间层的高熵特性,结合深度条件解码策略,有效恢复模型的探索能力。实验表明,LED在多个基准测试中显著提升了推理准确率,且无需额外训练或参数,同时与强化学习结合还能加速性能提升。

Comments Project Page: https://github.com/AlbertTan404/LED

详情
英文摘要

Large Reasoning Models (LRMs) have recently achieved strong mathematical and code reasoning performance through Reinforcement Learning (RL) post-training. However, we show that modern reasoning post-training induces an unintended exploration collapse: temperature-based sampling no longer increases pass@$n$ accuracy. Empirically, the final-layer posterior of post-trained LRMs exhibit sharply reduced entropy, while the entropy of intermediate layers remains relatively high. Motivated by this entropy asymmetry, we propose Latent Exploration Decoding (LED), a depth-conditioned decoding strategy. LED aggregates intermediate posteriors via cumulative sum and selects depth configurations with maximal entropy as exploration candidates. Without additional training or parameters, LED consistently improves pass@1 and pass@16 accuracy by 0.61 and 1.03 percentage points across multiple reasoning benchmarks and models. Furthermore, integrating LED into reinforcement learning, e.g., using GRPO as the rollout strategy, yields faster reward improvement and higher final performance, due to the efficient exploration capability of LED. Project page: https://github.com/AlbertTan404/LED.

2601.23026 2026-05-12 cs.LG

Root Cause Analysis of Measurement and Mechanistic Anomalies

Hendrik Suhr, David Kaltenpoth, Jilles Vreeken

发表机构 * CISPA Helmholtz Center for Information Security(CISPA赫尔姆霍兹信息安全中心)

AI总结 本文研究了异常的根本原因分析问题,旨在识别样本偏离正常过程的机制和原因。现有方法主要关注哪些特征导致异常,而忽略了异常可能源于测量错误或机制变化两种不同过程。作者提出了一种因果模型,明确区分这两种异常类型,并基于该模型开发了高效的推理方法,用于定位根本原因并分类异常类型。实验表明,该方法在合成和真实数据上均表现出优越的性能。

详情
英文摘要

Root cause analysis of anomalies aims to identify how and why a sample deviates from the normal process. Existing methods primarily focus on telling which features are responsible, ignoring that anomalies can arise through two fundamentally different processes: measurement errors, where the sample is generated normally but one or more values is recorded incorrectly, and mechanism shifts, where the causal process that generated the sample was changed. While measurement errors can often be safely corrected, mechanistic anomalies require careful consideration. In this paper, we formally define a causal model that explicitly captures both types by treating outliers as latent interventions on latent ("true") and observed ("measured") variables and show under which conditions the distinction is possible. Based on this model, we develop an efficient inference procedure for localizing root causes and distinguishing anomaly types. Experiments on synthetic and real-world data show that our method provides state-of-the-art and highly robust performance in both root cause localization and classification of anomaly types.

2601.22131 2026-05-12 cs.LG

SMOG: Scalable Meta-Learning for Multi-Objective Bayesian Optimization

Leonard Papenmeier, Petru Tighineanu

发表机构 * Department of Information Systems University of Münster(信息系统系穆斯特大学) Robert Bosch GmbH(博世集团)

AI总结 该论文提出了一种可扩展的元学习方法 SMOG,用于多目标贝叶斯优化。SMOG 基于多输出高斯过程,显式学习目标之间的相关性,并通过构建跨元任务和目标任务的结构化联合先验,实现对元数据不确定性的有效传播。该方法支持分层并行训练,具有良好的可扩展性,并能与标准多目标贝叶斯优化的获取函数无缝集成,显著提升了数据效率。

Comments 29 pages, 18 figures

详情
英文摘要

Multi-objective optimization aims to solve problems with competing objectives. Evaluating such problems is often slow or expensive, limiting the budget of evaluations. In many applications, historical data from related optimization tasks is available and can be leveraged via meta-learning to accelerate optimization. Bayesian optimization, as a promising technique for expensive black-box problems, has been extended independently to meta-learning and multi-objective optimization, but methods that simultaneously address both settings remain largely unexplored. We propose SMOG-a scalable and modular meta-learning model based on a multi-output Gaussian process-that explicitly learns correlations between objectives. SMOG builds a structured joint Gaussian process prior across meta- and target tasks and, after conditioning on metadata, yields a closed-form prior for the target task. This construction propagates metadata uncertainty into the target surrogate in a principled way. SMOG supports hierarchical, parallel training, achieving linear scaling with the number of meta-tasks. The resulting surrogate integrates seamlessly with standard multi-objective Bayesian optimization acquisition functions. We demonstrate that our method is consistently competitive, delivering strong data efficiency across representative benchmarks and applications.

2601.21926 2026-05-12 cs.RO

Information Filtering via Variational Regularization for Robot Manipulation

Jinhao Zhang, Wenlong Xia, Yaojia Wang, Zhexuan Zhou, Huizhe Li, Yichen Lai, Haoming Song, Youmin Gong, Jie Mei

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文研究了基于扩散模型的视觉运动策略在机器人操作中的信息过滤问题,指出现有方法中去噪解码器过于庞大,导致中间特征块存在冗余和噪声。为此,作者提出了一种可插拔的变分正则化模块,通过引入条件高斯分布和KL散度正则化,形成自适应信息瓶颈,有效提升了模型性能。实验表明,该方法在多个仿真和实际机器人任务中均取得了优于基线的成果,达到了新的状态-of-the-art水平。

详情
英文摘要

Diffusion-based visuomotor policies built on 3D visual representations have achieved strong performance in learning complex robotic skills. However, most existing methods employ an oversized denoising decoder. While increasing model capacity can improve denoising, empirical evidence suggests that it also introduces redundancy and noise in intermediate feature blocks. Crucially, we find that randomly masking backbone features in U-Net or skipping intermediate layers in DiT at inference time (without changing training) can improve performance, confirming the presence of task-irrelevant noise in intermediate features. To this end, we propose Variational Regularization (VR), a plug-and-play module that imposes a context-conditioned Gaussian over the noisy features and applies a KL-divergence regularizer, forming an adaptive information bottleneck. Extensive experiments on three simulation benchmarks, RoboTwin2.0, Adroit, and MetaWorld, show that our approach consistently improves task success rates over the baseline for both DP3-UNet and DP3-DiT, achieving new state-of-the-art results. Real-world experiments further demonstrate that our method performs well in practical deployments.

2601.21739 2026-05-12 cs.LG cs.AI stat.ML

Why Adam Works Better with $β_1 = β_2$: The Missing Gradient Scale Invariance Principle

Alberto Fernández-Hernández, Cristian Pérez-Corral, Jose I. Mestre, Manuel F. Dolz, Enrique S. Quintana-Ortí

发表机构 * Universitat Politècnica de València(巴塞罗那理工大学) Universitat Jaume I(Jaime I 大学)

AI总结 本文研究了Adam优化器中为何当动量参数满足 $β_1 = β_2$ 时表现更优这一长期未被解释的现象。作者提出并形式化了一个名为“梯度尺度不变性”的结构性质,证明当 $β_1 = β_2$ 时,Adam 优化器具有一阶梯度尺度不变性。该发现不仅解释了Adam在平衡参数设置下的优越性能,也为设计鲁棒性更强的优化算法提供了理论指导。

Comments 23 pages, 8 figures. Preprint

详情
英文摘要

Adam has been at the core of large-scale training for almost a decade, yet a simple empirical fact remains unaccounted for: both validation scores and the qualitative behaviour of the training runs improve when the momentum parameters satisfy $β_{1}=β_{2}$. Some recent studies have reported this pattern, but there is still no explanation for why this choice helps. We show that this choice is closely tied to a structural property that we refer to as \textit{gradient scale invariance}. We formalize this notion and prove that Adam becomes gradient scale invariant of first order if and only if $β_{1}=β_{2}$. This perspective places the balanced regime of Adam in direct alignment with the design principles underlying several recent optimizers that explicitly enforce scale-robust updates. The theory is supported by experiments across vision and language tasks, and across different architectural families, in which rescaling the gradient has a markedly smoother effect on the update when $β_{1}=β_{2}$. Overall, our results offer a coherent explanation for an open question in the behavior of Adam and provide a simple principle that helps guide the design of future optimizers.

2601.20756 2026-05-12 cs.LG stat.ML

Supervised Guidance Training for Infinite-Dimensional Diffusion Models

Elizabeth L. Baker, Alexander Denker, Jes Frellsen

发表机构 * Department of Applied Mathematics and Computer Science, Technical University of Denmark, Denmark(应用数学和计算机科学系,丹麦技术大学,丹麦)

AI总结 本文研究了如何在无限维函数空间中对扩散模型进行监督引导训练,以解决来自偏微分方程的贝叶斯反问题。作者提出了一种基于无限维Doob $h$-变换的条件化方法,并将条件分数分解为无条件分数和引导项,进而设计了一种无需模拟的分数匹配目标(称为监督引导训练),实现了高效稳定的后验采样。该方法为在函数空间中微调扩散模型以准确采样后验分布提供了首个系统性方案。

详情
英文摘要

Score-based diffusion models have recently been extended to infinite-dimensional function spaces, with uses such as inverse problems arising from partial differential equations. In the Bayesian formulation of inverse problems, the aim is to sample from a posterior distribution over functions obtained by conditioning a prior on noisy observations. While diffusion models provide expressive priors in function space, the theory of conditioning them to sample from the posterior remains open. We address this, assuming that either the prior lies in the Cameron-Martin space, or is absolutely continuous with respect to a Gaussian measure. We prove that the models can be conditioned using an infinite-dimensional extension of Doob's $h$-transform, and that the conditional score decomposes into an unconditional score and a guidance term. As the guidance term is intractable, we propose a simulation-free score matching objective (called Supervised Guidance Training) enabling efficient and stable posterior sampling. We illustrate the theory with numerical examples on Bayesian inverse problems in function spaces. In summary, our work offers the first function-space method for fine-tuning trained diffusion models to accurately sample from a posterior.

2601.20164 2026-05-12 cs.LG cs.AI cs.CL

What's the plan? Metrics for implicit planning in LLMs and their application to rhyme generation and question answering

Jim Maar, Denis Paperno, Callum Stuart McDougall, Neel Nanda

发表机构 * HPI / University of Potsdam(HPI/波茨坦大学) Utrecht University(乌特勒支大学) Google DeepMind(谷歌DeepMind)

AI总结 本文研究了大型语言模型(LLMs)中的隐式规划行为,即模型在生成文本时可能为未来可能出现的词语(如押韵词或问题答案)提前做出选择。作者提出了一种简单有效的方法来评估这种隐式规划能力,并通过押韵生成和问答任务的案例研究验证了该方法的广泛适用性。研究发现,即使在参数量较小(如10亿参数)的模型中也存在隐式规划机制,这一发现对理解语言模型的规划能力及其在AI安全与控制中的应用具有重要意义。

Comments 41 pages, 34 figures, Accepted at ICLR 2026, Code available at https://github.com/Jim-Maar/implicit-planning-in-llms

详情
英文摘要

Prior work suggests that language models, while trained on next token prediction, show implicit planning behavior: they may select the next token in preparation to a predicted future token, such as a likely rhyming word, as supported by a prior qualitative study of Claude 3.5 Haiku using a cross-layer transcoder. We propose much simpler techniques for assessing implicit planning in language models. With case studies on rhyme poetry generation and question answering, we demonstrate that our methodology easily scales to many models. Across models, we find that the generated rhyme (e.g. "-ight") or answer to a question ("whale") can be manipulated by steering at the end of the preceding line with a vector, affecting the generation of intermediate tokens leading up to the rhyme or answer word. We show that implicit planning is a universal mechanism, present in smaller models than previously thought, starting from 1B parameters. Our methodology offers a widely applicable direct way to study implicit planning abilities of LLMs. More broadly, understanding planning abilities of language models can inform decisions in AI safety and control.

2601.19914 2026-05-12 cs.CL cs.AI cs.SE

Simulating Complex Multi-Turn Tool Calling Interactions in Stateless Execution Environments

Maxwell Crouse, Ibrahim Abdelaziz, Kshitij Fadnis, Siva Sankalp Patel, Kinjal Basu, Chulaka Gunasekara, Sadhana Kumaravel, Asim Munawar, Pavan Kapanipathi

发表机构 * IBM Research AI(IBM人工智能研究院)

AI总结 该研究旨在解决在无状态执行环境中生成复杂多轮工具调用对话的问题。传统方法通常假设存在能够维护状态的执行环境,但实际场景中如企业安全或多方来源工具规格合成等情况下,这种假设并不成立。为此,研究提出了一种名为DiGiT-TC的数据生成方法,通过一种新颖的生成模式隐式地在用户请求中表示工具调用,从而在无状态环境下模拟出类似有状态环境生成的对话。实验表明,该方法在标准基准测试中表现出色,即使在有状态问题设置下也取得了显著的性能提升。

详情
英文摘要

Synthetic data has proven itself to be a valuable resource for tuning smaller, cost-effective language models to handle the complexities of multi-turn tool calling conversations. While many frameworks and systems for producing synthetic multi-turn tool calling data have been proposed, prior works have frequently assumed that any tool calling interactions will take place in an execution environment that maintains state. When such an environment is available, this is advantageous as it allows for the validity of an interaction to be determined by whether or not the state of the execution environment matches to some prespecified objective. Unfortunately, this does not hold in many real-world tool use settings, e.g., in enterprise settings where data security is of the utmost importance or in cases where tool specifications are synthesized from multiple sources. In this work, we address this gap by introducing a data generation method, DiGiT-TC, that is designed to produce tool calling conversations that have the characteristics of conversations generated through search in a stateful environment. The key to our technique lies in a novel generation pattern that allows our approach to implicitly represent certain tool calls in the user request. We validate our approach on standard tool calling benchmarks and demonstrate that, even in stateful problem settings, our approach results in strong performance gains.

2601.16097 2026-05-12 cs.CL

Incremental Multilingual Text2Cypher with Adapter Combination

Makbule Gulcin Ozsoy

发表机构 * Neo4j London UK(Neo4j伦敦英国)

AI总结 该研究旨在开发一种可扩展的多语言Text2Cypher系统,能够在不重新进行完整微调的情况下支持新语言,从而提升数据库的多语言访问能力。研究通过训练特定语言的LoRA适配器,并结合统一线性合并或动态门控的融合MLP,实现了高效的多语言模型适配。实验表明,该方法在使用更少数据的情况下,性能接近联合多语言微调,且支持语言的逐步扩展,为多语言Text2Cypher任务提供了性能与数据效率兼顾的实用解决方案。

详情
英文摘要

Large Language Models enable users to access database using natural language interfaces using tools like Text2SQL, Text2SPARQL, and Text2Cypher, which translate user questions into structured database queries. While these systems improve database accessibility, most research focuses on English with limited multilingual support. This work investigates a scalable multilingual Text2Cypher, aiming to support new languages without re-running full fine-tuning, avoiding manual hyper-parameter tuning, and maintaining performance close to joint multilingual fine-tuning. We train language-specific LoRA adapters for English, Spanish, and Turkish and combined them via uniform linear merging or learned fusion MLP with dynamic gating. Experimental results show that the fusion MLP recovers around 75\% of the accuracy gains from joint multilingual fine-tuning while requiring only a smaller subset of the data, outperforming linear merging across all three languages. This approach enables incremental language expansion to new languages by requiring only one LoRA adapter and a lightweight MLP retraining. Learned adapter fusion offers a practical alternative to expensive joint fine-tuning, balancing performance, data efficiency, and scalability for multilingual Text2Cypher task.

2601.15686 2026-05-12 cs.LG

Beyond Hard Writes and Rigid Preservation: Soft Recursive Least-Squares for Lifelong LLM Editing

Xinyu Wang, Sicheng Lyu, Yu Gu, Jerry Huang, Peng Lu, Yufei Cui, Xiao-Wen Chang

发表机构 * McGill University(麦吉尔大学) Mila–Quebec AI Institute(蒙特利尔AI研究院) Université de Montréal(蒙特利尔大学)

AI总结 该论文研究了如何在不重新训练的前提下,对预训练的大语言模型进行长期的、连续的事实或规则编辑,以解决编辑过程中出现的干扰累积与行为稳定性之间的矛盾。提出了一种基于递归最小二乘法的编辑方法RLSEdit,通过在线二次优化框架,结合软约束和正则化项,实现对模型权重和锚定映射的偏差控制,并支持高效的在线递归计算。实验表明,该方法在多个模型和数据集上能够稳定处理大量编辑任务,在编辑效果和整体稳定性方面优于现有方法,同时保持早期编辑效果和模型的通用能力。

详情
英文摘要

Model editing updates a pre-trained LLM with new facts or rules without retraining while preserving unrelated behavior. In real deployment, edits arrive as long streams, creating a plasticity-stability dilemma: repeated locate-then-edit "hard writes" can accumulate interference over time, while rigid preservation constraints may protect only explicitly constrained directions, allowing past edits or unconstrained behaviors to deviate. We propose RLSEdit, a recursive least-squares editor for long sequential editing. RLSEdit formulates editing as an online quadratic optimization with soft constraints, minimizing a cumulative key-value fitting objective together with two regularizers that control deviation from the pre-trained weights and from a designated anchor mapping. This objective admits an efficient Woodbury-based online recursion, with per-edit cost independent of history length and scaling only with the current edit size. We further provide deviation bounds and an asymptotic characterization of the adherence-preservation trade-off in the many-edits regime. Experiments on CounterFact and ZsRE across multiple model families show stable scaling to 10K edits, outperforming strong baselines in both edit success and holistic stability, while retaining early edits and preserving general capabilities on GLUE and held-out reasoning/code benchmarks.

2601.15599 2026-05-12 cs.AI

Autonomous Business System via Neuro-symbolic AI

Cecil Pang, Hiroki Sayama

发表机构 * School of Systems Science and Industrial Engineering, Binghamton University, State University of New York(宾夕法尼亚州立大学布林顿分校系统科学与工业工程学院) AI Engineering, USA TODAY Co., Inc.(USA TODAY公司人工智能工程部) Binghamton Center of Complex Systems, Binghamton University, State University of New York(宾夕法尼亚州立大学布林顿复杂系统中心) Waseda Innovation Lab, Waseda University(早稻田大学创新实验室)

AI总结 现代企业环境中,跨职能流程需要持续调整,但现有企业系统多为部门隔离、流程僵化和硬编码自动化。本文提出一种基于神经符号AI的自主业务系统(AUTOBUS),将大语言模型、谓词逻辑编程和业务语义数据整合为统一架构,实现端到端业务任务的自动化执行。该系统通过知识图谱组织企业数据,结合AI代理生成任务逻辑程序,并由逻辑引擎确保执行的确定性和语义一致性,从而提升业务流程的灵活性与可审计性。

Comments IEEE SysCon 2026

Journal ref 2026 IEEE International Systems Conference (SysCon), Halifax, NS, Canada, 2026, pp. 1-8

详情
英文摘要

Modern business environments demand continuous reconfiguration of cross-functional processes, yet most enterprise systems remain organized around siloed departments, rigid workflows, and hard-coded automation. Meanwhile, large language models (LLMs) demonstrate strong capabilities in interpreting natural language and synthesizing unstructured information, but they lack deterministic, auditable execution of complex business logic. We introduce Autonomous Business System (AUTOBUS), a system that integrates LLM-based AI agents, predicate-logic programming, and business-semantics-centric enterprise data into a unified neuro-symbolic architecture for executing end-to-end business initiatives. AUTOBUS models a business initiative as a network of interrelated tasks with explicit pre- and post-conditions, required data, evaluation rules, and API-level actions. Enterprise data is organized as a knowledge graph, whose entities, relationships, and constraints are translated into logic facts and foundational rules that ground reasoning and ensure semantic consistency. Core AI agents synthesize task instructions, enterprise semantics, and available tools into task-specific logic programs, which are executed by a logic engine that enforces constraints, coordinates auxiliary tools, and produces deterministic outcomes. Humans specify task instructions, define and maintain business semantics and policies, curate tools, and supervise high-impact or ambiguous decisions, ensuring accountability and adaptability. We detail the AUTOBUS architecture, the structure of AI-generated logic programs, and the human-AI collaboration model and present a case study that demonstrates accelerated time to market in a data-rich organization. A reference implementation of the case study is available at https://github.com/cecilpang/autobus-paper.

2601.12374 2026-05-12 cs.CL cs.AI

A Scalable Entity-Based Framework for Auditing Bias in LLMs

Akram Elbouanani, Aboubacar Tuo, Adrian Popescu

发表机构 * Université Paris-Saclay, CEA, List(巴黎-萨克雷大学,法国原子能委员会,List)

AI总结 本文提出了一种基于实体的可扩展框架,用于审计大型语言模型中的偏见。该框架利用命名实体作为可控探针,通过合成数据生成多样且可控的输入,从而系统性地评估模型在不同实体类型、任务、语言和提示策略下的行为差异。研究发现了模型在政治立场、国家偏好和行业倾向等方面的一致偏见模式,并指出模型规模的增加可能加剧偏见,而指令微调虽能缓解但无法完全消除。该框架为大规模偏见分析提供了有效工具,适用于多种应用场景,并已公开提供以支持后续研究。

详情
英文摘要

Existing approaches to bias evaluation in large language models (LLMs) trade ecological validity for statistical control, relying either on artificial prompts that poorly reflect real-world use or on naturalistic tasks that lack scale and rigor. We introduce a scalable bias-auditing framework that uses named entities as controlled probes to measure systematic disparities in model behavior. Synthetic data enables us to construct diverse, controlled inputs, and we show that it reliably reproduces bias patterns observed in natural text, supporting its use for large-scale analysis. Using this framework, we conduct the largest bias audit to date, comprising 1.9 billion data points across multiple entity types, tasks, languages, models, and prompting strategies. We find consistent patterns: models penalize right-wing politicians and favor left-wing politicians, prefer Western and wealthier countries over the Global South, favor Western companies, and penalize firms in the defense and pharmaceutical sectors. While instruction tuning reduces bias, increasing model scale amplifies it, and prompting in Chinese or Russian does not mitigate Western-aligned preferences. These findings highlight the need for systematic bias auditing before deploying LLMs in high-stakes applications. Our framework is extensible to other domains and tasks, and we make it publicly available to support future work.

2601.08321 2026-05-12 cs.CV

UM-Text: A Unified Multimodal Model for Image Understanding and Visual Text Editing

Lichen Ma, Xiaolong Fu, Gaojing Zhou, Zipeng Guo, Ting Zhu, Yichun Liu, Yu Shi, Jason Li, Junshi Huang

发表机构 * Sun Yat-sen University(中山大学)

AI总结 随着图像生成技术的快速发展,基于自然语言指令的视觉文本编辑任务日益受到关注。该任务的核心挑战在于如何准确理解指令和参考图像,并生成与图像风格一致的视觉文本。为此,本文提出 UM-Text,一个统一的多模态模型,通过引入视觉语言模型(VLM)和 UM-Encoder,实现了对文本内容与布局的精细设计,并通过区域一致性损失和三阶段训练策略提升了生成效果,同时贡献了一个大规模视觉文本图像数据集 UM-DATA-200K。

Comments Accepted by AAAI 2026

详情
英文摘要

With the rapid advancement of image generation, visual text editing using natural language instructions has received increasing attention. The main challenge of this task is to fully understand the instruction and reference image, and thus generate visual text that is style-consistent with the image. Previous methods often involve complex steps of specifying the text content and attributes, such as font size, color, and layout, without considering the stylistic consistency with the reference image. To address this, we propose UM-Text, a unified multimodal model for context understanding and visual text editing by natural language instructions. Specifically, we introduce a Visual Language Model (VLM) to process the instruction and reference image, so that the text content and layout can be elaborately designed according to the context information. To generate an accurate and harmonious visual text image, we further propose the UM-Encoder to combine the embeddings of various condition information, where the combination is automatically configured by VLM according to the input instruction. During training, we propose a regional consistency loss to offer more effective supervision for glyph generation on both latent and RGB space, and design a tailored three-stage training strategy to further enhance model performance. In addition, we contribute the UM-DATA-200K, a large-scale visual text image dataset on diverse scenes for model training. Extensive qualitative and quantitative results on multiple public benchmarks demonstrate that our method achieves state-of-the-art performance.

2601.03042 2026-05-12 cs.CL

BaseCal: Unsupervised Confidence Calibration via Base Model Signals

Hexiang Tan, Wanli Yang, Junwei Zhang, Xin Chen, Rui Tang, Du Su, Jingang Wang, Yuanzhuo Wang, Fei Sun, Xueqi Cheng

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院)

AI总结 该研究针对大语言模型(PoLLMs)在实际应用中常表现出的过度自信问题,提出了一种无需监督的置信度校准方法BaseCal。通过利用对应的基座模型(base LLM)作为参考,BaseCal 提出了两种方法:一种是通过基座模型重新评估PoLLM的输出置信度,另一种是训练一个轻量投影模块将PoLLM的隐藏状态映射到基座模型的状态,从而生成校准后的置信度。实验表明,BaseCal 能有效降低预期校准误差(ECE),在多个数据集和模型家族中表现优异。

Comments ACL 2026 Main

详情
英文摘要

Reliable confidence is essential for trusting the outputs of LLMs, yet widely deployed post-trained LLMs (PoLLMs) typically compromise this trust with severe overconfidence. In contrast, we observe that their corresponding base LLMs often remain well-calibrated. This naturally motivates us to calibrate PoLLM confidence using the base LLM as a reference. This work proposes two ways to achieve this. A straightforward solution, BaseCal-ReEval, evaluates PoLLM's responses by feeding them into the base LLM to get average probabilities as confidence. While effective, this approach introduces additional inference overhead. To address this, we propose BaseCal-Proj, which trains a lightweight projection to map the final-layer hidden states of PoLLMs back to those of their base LLMs. These projected states are then processed by the base LLM's output layer to derive base-calibrated confidence for PoLLM's responses. Notably, BaseCal is an unsupervised, plug-and-play solution that operates without human labels or LLM modifications. Experiments across five datasets and three LLM families demonstrate the effectiveness of BaseCal, reducing Expected Calibration Error (ECE) by an average of 42.90\% compared to the best unsupervised baselines.

2512.24601 2026-05-12 cs.AI cs.CL

Recursive Language Models

Alex L. Zhang, Tim Kraska, Omar Khattab

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文研究了如何通过推理时的扩展,使大语言模型(LLMs)能够处理任意长度的提示。为此,作者提出了递归语言模型(RLMs),该方法将长提示视为外部环境的一部分,允许模型对提示进行编程式的分析、分解和递归调用自身。实验表明,RLMs 能够处理超出模型上下文窗口两个数量级的输入,在多个长上下文任务中显著优于现有的前沿模型,且成本相当。此外,作者基于 RLM 微调了首个模型 RLM-Qwen3-8B,在多个长上下文任务中表现优于基础模型,并接近 GPT-5 的水平。

Comments 9 pages, 43 with Appendix

详情
英文摘要

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference paradigm that treats long prompts as part of an external environment and allows the LLM to programmatically examine, decompose, and recursively call itself over snippets of the prompt. We find that RLMs can successfully process inputs up to two orders of magnitude beyond model context windows and, even for shorter prompts, dramatically outperform the quality of vanilla frontier LLMs and common long-context and coding scaffolds (e.g., on GPT-5 by a median across the evaluated benchmarks of $26\%$ against compaction, $130\%$ against CodeAct with sub-calls, and $13\%$ against Claude Code) across four diverse long-context tasks while having comparable cost. At a small scale, we post-train the first model around the RLM. Our model, RLM-Qwen3-8B, outperforms the underlying Qwen3-8B model by $28.3\%$ on average and even approaches the quality of vanilla GPT-5 on three long-context tasks. Code is available at https://github.com/alexzhang13/rlm.

2512.23964 2026-05-12 cs.LG cs.AI

DUALFloodGNN: Physics-informed Graph Neural Network for Operational Flood Modeling

Carlo Malapad Acosta, Herath Mudiyanselage Viraj Vidura Herath, Jia Yu Lim, Abhishek Saha, Sanka Rasnayaka, Lucy Marshall

发表机构 * Department of Computer Science, School of Computing, National University of Singapore(新加坡国立大学计算机科学系) School of Civil Engineering, Faculty of Engineering, The University of Sydney(悉尼大学土木工程学院) Delft Institute of Applied Mathematics, Delft University of Technology(代尔夫特理工大学应用数学研究所)

AI总结 该论文提出了一种名为 DUALFloodGNN 的物理信息图神经网络模型,用于操作性洪水模拟。该模型通过在全局和局部尺度上嵌入物理约束,结合显式损失函数,实现了对节点水体积和边流量的联合预测。相比传统图神经网络和现有洪水模型,DUALFloodGNN 在预测水文变量(如水体积、流量和水深)方面表现出更高的准确性和计算效率,并且支持快速预测,适用于实际灾害管理场景。

Comments Accepted for publication at the IJCAI-ECAI 2026 AI4Tech track

详情
英文摘要

Flood models inform strategic disaster management by simulating the spatiotemporal hydrodynamics of flooding. While physics-based numerical flood models are accurate, their substantial computational cost limits their use in operational settings where rapid predictions are essential. Models designed with graph neural networks (GNNs) provide both speed and accuracy while having the ability to process unstructured spatial domains. Given its flexible input and architecture, GNNs can be leveraged alongside physics-informed techniques with ease, significantly improving interpretability and generalizability. We introduce a novel flood GNN architecture, DUALFloodGNN, which embeds physical constraints at both global and local scales through explicit loss terms. The model jointly predicts water volume at nodes and flow along edges through a shared message-passing framework. To improve performance for autoregressive inference, model training is conducted with a multi-step loss enhanced with dynamic curriculum learning. Compared with standard GNN architectures and state-of-the-art GNN flood models, DUALFloodGNN achieves substantial improvements in predicting multiple hydrologic variables (e.g., water volume, flow, and depth) while maintaining high computational efficiency. The model is open sourced at https://github.com/acostacos/dual_flood_gnn. The dataset is open sourced at https://hdl.handle.net/2123/35293 with the DOI 10.25910/9xav-0s86.

2512.19995 2026-05-12 cs.CL cs.AI cs.LG

Schoenfeld's Anatomy of Mathematical Reasoning by Language Models

Ming Li, Chenrui Fan, Yize Cheng, Soheil Feizi, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 该研究探讨了大型语言模型在数学推理过程中所展现的思维结构,采用Schoenfeld的“事件理论”作为分析框架,提出了一种名为ThinkARM的可扩展方法,将推理过程抽象为如分析、探索、验证等明确的推理步骤。通过该方法,研究揭示了不同模型在推理过程中的动态特征和结构差异,并通过案例分析表明,探索步骤对推理正确性具有关键影响,效率导向的方法可能抑制评估反馈步骤而非单纯缩短响应。这一工作为系统分析语言模型推理结构提供了新的视角。

Comments ACL2026, camera-ready

详情
英文摘要

Large language models increasingly expose reasoning traces, yet their underlying cognitive structure and steps remain difficult to identify and analyze beyond surface-level statistics. We adopt Schoenfeld's Episode Theory as an inductive, intermediate-scale lens and introduce ThinkARM (Anatomy of Reasoning in Models), a scalable framework that explicitly abstracts reasoning traces into functional reasoning steps such as Analysis, Explore, Implement, Verify, etc. When applied to mathematical problem solving by diverse models, this abstraction reveals reproducible thinking dynamics and structural differences between reasoning and non-reasoning models, which are not apparent from token-level views. We further present two diagnostic case studies showing that exploration functions as a critical branching step associated with correctness, and that efficiency-oriented methods selectively suppress evaluative feedback steps rather than uniformly shortening responses. Together, our results demonstrate that episode-level representations make reasoning steps explicit, enabling systematic analysis of how reasoning is structured, stabilized, and altered in modern language models.

2512.17593 2026-05-12 cs.LG math.OC

A Unified Representation of Neural Networks Architectures

Christophe Prieur, Mircea Lazar, Bogdan Robu

发表机构 * Univ. Grenoble Alpes, CNRS, Grenoble INP(格勒诺布尔阿尔卑斯大学、法国国家科学研究中心、格勒诺布尔INP) Eindhoven University of Technology Electrical Engineering, Control Systems(埃因霍温理工大学电子工程与控制系统)

AI总结 本文研究了神经网络架构在隐藏层神经元数量和隐藏层数目趋于无穷时的极限情况,将其形式化为连续体,并推导了相应的逼近误差。作者首先考虑单隐藏层神经网络,提出了一种广义的无限宽度积分神经网络表示,进而扩展到具有有限积分隐藏层和残差连接的深度残差CNN。通过结合神经ODE与深度残差网络的关系,作者提出了一个统一的分布参数神经网络(DiPaNet)表示,展示了大多数现有有限和无限维神经网络架构均可通过同质化或离散化方法与此表示相关联,为神经网络的理论分析提供了新的视角。

Comments Typographical corrections and additional clarifications, remarks; few new relevant references added and acknowledgements; main results unchanged

详情
英文摘要

In this paper we consider the limiting case of neural networks (NNs) architectures when the number of neurons in each hidden layer and the number of hidden layers tend to infinity thus forming a continuum, and we derive approximation errors as a function of the number of neurons and/or hidden layers. Firstly, we consider the case of neural networks with a single hidden layer and we derive an integral infinite width neural representation that generalizes existing continuous neural networks (CNNs) representations. Then we extend this to deep residual CNNs that have a finite number of integral hidden layers and residual connections. Secondly, we revisit the relation between neural ODEs and deep residual NNs and we formalize approximation errors via discretization techniques. Then, we merge these two approaches into a unified homogeneous representation of NNs as a Distributed Parameter neural Network (DiPaNet) and we show that most of the existing finite and infinite-dimensional NNs architectures are related via homogenization/discretization with the DiPaNet representation. Our approach is purely deterministic and applies to general, uniformly continuous matrix weight functions. Relations with neural fields and other neural integro-differential equations are discussed along with further possible generalizations and applications of the DiPaNet framework.

2512.15977 2026-05-12 cs.CV

Are vision-language models ready to zero-shot replace supervised classification models in agriculture?

Earl Ranario, Mason J. Earles

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 该研究评估了多种开源和闭源的视觉-语言模型(VLMs)在农业图像分类任务中的表现,涉及27个数据集、162个类别和248,000张图像。结果表明,零样本VLMs在多数任务中显著落后于监督学习的基准模型YOLO11,且在开放性提示下性能更低,需借助语义判断等方法提升效果。尽管部分开源模型如Qwen-VL-72B表现接近闭源模型,但整体来看,当前VLMs尚未具备作为独立农业诊断系统的能力,更适合在受限接口和领域知识支持下作为辅助工具使用。

详情
英文摘要

Vision-language models (VLMs) are increasingly proposed as general-purpose solutions for visual recognition tasks, yet their reliability for agricultural decision support remains poorly understood. We benchmark a diverse set of open-source and closed-source VLMs on 27 agricultural image classification datasets from the AgML collection (https://github.com/Project-AgML), spanning 162 classes and 248,000 images across plant disease, pest and damage, and plant and weed species identification. Across all tasks, zero-shot VLMs substantially underperform a supervised task-specific baseline (YOLO11), which consistently achieves markedly higher accuracy than any foundation model. Under multiple-choice prompting, the best-performing VLM (Gemini-3 Pro) reaches approximately 62% average accuracy, while open-ended prompting yields much lower performance, with raw accuracies typically below 25%. Applying LLM-based semantic judging increases open-ended accuracy (e.g., from ~21% to ~30% for top models) and alters model rankings, demonstrating that evaluation methodology meaningfully affects reported conclusions. Among open-source models, Qwen-VL-72B performs best, approaching closed-source performance under constrained prompting but still trailing top proprietary systems. Task-level analysis shows that plant and weed species classification is consistently easier than pest and damage identification, which remains the most challenging category across models. Overall, these results indicate that current off-the-shelf VLMs are not yet suitable as standalone agricultural diagnostic systems, but can function as assistive components when paired with constrained interfaces, explicit label ontologies, and domain-aware evaluation strategies.

2512.13919 2026-05-12 cs.LG cs.NA math.NA

Adaptive digital twins for predictive decision-making: Online Bayesian learning of transition dynamics

Eugenio Varetti, Matteo Torzoni, Marco Tezzele, Andrea Manzoni

发表机构 * MOX – Dipartimento di Matematica, Politecnico di Milano(MOX——数学系,米兰理工大学) Dipartimento di Ingegneria Civile e Ambientale, Politecnico di Milano(土木工程与环境工程系,米兰理工大学) Mathematics Department, Emory University(埃默里大学数学系)

AI总结 本文研究了如何通过自适应机制提升数字孪生在土木工程中的价值实现,重点在于利用概率图模型对数字孪生中的状态转移模型进行自适应。通过动态贝叶斯网络建模物理与虚拟域之间的双向交互,并将状态转移概率作为具有共轭先验的随机变量,实现了基于贝叶斯更新的分层在线学习。该方法扩展了现有数字孪生框架中对分布类型的适用范围,并结合强化学习求解参数化马尔可夫决策过程,提升了系统的个性化、鲁棒性和成本效益,实验案例验证了其在铁路桥梁结构健康监测与维护规划中的有效性。

详情
英文摘要

This work shows how adaptivity can enhance value realization of digital twins in civil engineering. We focus on adapting the state transition models within digital twins represented through probabilistic graphical models. The bi-directional interaction between the physical and virtual domains is modeled using dynamic Bayesian networks. By treating state transition probabilities as random variables endowed with conjugate priors, we enable hierarchical online learning of transition dynamics from a state to another through effortless Bayesian updates. We provide the mathematical framework to account for a larger class of distributions with respect to the current literature on digital twins. To compute dynamic policies with precision updates we solve parametric Markov decision processes through reinforcement learning. The proposed adaptive digital twin framework enjoys enhanced personalization, increased robustness, and improved cost-effectiveness. We assess our approach on a case study involving structural health monitoring and maintenance planning of a railway bridge.

2512.13618 2026-05-12 cs.CL cs.LG

Temporal Tokenization Strategies for Event Sequence Modeling with Large Language Models

Zefang Liu, Nam H. Nguyen, Yinzhu Quan, Shi-Xiong Zhang

发表机构 * Capital One Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了在使用大语言模型(LLM)对事件序列进行建模时,如何有效表示连续时间这一关键但尚未充分探索的问题。通过系统比较多种时间编码策略,如数值字符串、高精度字节表示、日历语义标记、均匀分箱和自适应残差量化等,发现不同方法在不同统计分布的数据上表现各异。研究强调,时间标记策略应与数据的统计特性相匹配,揭示了时间标记设计在基于LLM的事件建模中是一个关键但常被忽视的维度。

详情
英文摘要

Representing continuous time is a critical and under-explored challenge in modeling temporal event sequences with large language models (LLMs). Various strategies like byte-level representations or calendar tokens have been proposed. However, the optimal approach remains unclear, especially given the diverse statistical distributions of real-world event data, which range from smooth log-normal to discrete, spiky patterns. This paper presents a systematic empirical study of temporal tokenization for modeling event sequences with LLMs, comparing distinct encoding strategies: naive numeric strings, high-precision byte-level representations, human-semantic calendar tokens, classic uniform binning, and adaptive residual scalar quantization. We evaluate these strategies by fine-tuning LLMs on real-world datasets that exemplify these diverse distributions. Our analysis reveals that no single strategy is universally superior; instead, prediction performance depends heavily on aligning the tokenizer with the data's statistical properties, highlighting temporal tokenization as a critical yet often overlooked design dimension in LLM-based event modeling.

2512.06949 2026-05-12 cs.CV

Can We Go Beyond Visual Features? Neural Tissue Relation Modeling for Relational Graph Analysis in Non-Melanoma Skin Histology

Shravan Venkatraman, Muthu Subash Kavitha, Joe Dhanith P R, V Manikandarajan, Jia Wu

发表机构 * Mohamed bin Zayed University of AI(Mohamed bin Zayed人工智能大学) School of Information and Data Sciences(信息与数据科学学院) Vellore Institute of Technology(维洛雷理工学院) Loughborough University(洛桑大学) MD Anderson Cancer Center, The University of Texas(MD安德森癌症中心,德克萨斯大学)

AI总结 在皮肤癌诊断中,组织病理学图像分割对于识别组织结构至关重要,但建模空间上下文和组织间关系仍是一个挑战,尤其是在组织重叠或形态相似的区域。为此,本文提出了一种新的分割框架——神经组织关系建模(NTRM),通过在卷积神经网络中引入图神经网络,建模不同组织类型之间的空间和功能关系,从而提升分割的结构一致性。实验表明,NTRM在非黑色素瘤皮肤癌分割数据集上显著优于现有方法,Dice相似性系数提升了4.9%至31.25%,展示了关系建模在提升分割准确性和可解释性方面的潜力。

Comments CVPR 2026 Workshops

详情
英文摘要

Histopathology image segmentation is essential for delineating tissue structures in skin cancer diagnostics, but modeling spatial context and inter-tissue relationships remains a challenge, especially in regions with overlapping or morphologically similar tissues. Current convolutional neural network (CNN)-based approaches operate primarily on visual texture, often treating tissues as independent regions and failing to encode biological context. To this end, we introduce Neural Tissue Relation Modeling (NTRM), a novel segmentation framework that augments CNNs with a tissue-level graph neural network to model spatial and functional relationships across tissue types. NTRM constructs a graph over predicted regions, propagates contextual information via message passing, and refines segmentation through spatial projection. Unlike prior methods, NTRM explicitly encodes inter-tissue dependencies, enabling structurally coherent predictions in boundary-dense zones. On the benchmark Histopathology Non-Melanoma Skin Cancer Segmentation Dataset, NTRM outperforms state-of-the-art methods, achieving a robust Dice similarity coefficient that is 4.9\% to 31.25\% higher than the best-performing models among the evaluated approaches. Our experiments indicate that relational modeling offers a principled path toward more context-aware and interpretable histological segmentation, compared to local receptive-field architectures that lack tissue-level structural awareness. Our code is available at https://github.com/shravan-18/NTRM.

2512.06427 2026-05-12 cs.LG

A new initialisation to Control Gradients in Sinusoidal Neural network

Andrea Combette, Antoine Venaille, Nelly Pustelnik

发表机构 * ENSL, CNRS UMR 5672(ENSL,CNRS UMR 5672)

AI总结 本文提出了一种针对正弦激活函数神经网络(如SIREN)的新初始化方法,旨在更好地控制梯度、缓解梯度消失或爆炸问题,并提升模型的训练与泛化能力。该方法通过分析前激活分布和雅可比矩阵方差的收敛性,推导出一种闭式初始化表达式,与原始SIREN方案不同。实验表明,该初始化方法在函数拟合和图像重建任务中显著优于现有方法,尤其在物理信息神经网络任务中表现突出。

详情
英文摘要

Proper initialisation strategy is of primary importance to mitigate gradient explosion or vanishing when training neural networks. Yet, the impact of initialisation parameters still lacks a precise theoretical understanding for several well-established architectures. Here, we propose a new initialisation for networks with sinusoidal activation functions such as \texttt{SIREN}, focusing on gradients control, their scaling with network depth, their impact on training and on generalization. To achieve this, we identify a closed-form expression for the initialisation of the parameters, differing from the original \texttt{SIREN} scheme. This expression is derived from fixed points obtained through the convergence of pre-activation distribution and the variance of Jacobian sequences. Controlling both gradients and targeting vanishing pre-activation helps preventing the emergence of inappropriate frequencies during estimation, thereby improving generalization. We further show that this initialisation strongly influences training dynamics through the Neural Tangent Kernel framework (NTK). Finally, we benchmark \texttt{SIREN} with the proposed initialisation against the original scheme and other baselines on function fitting and image reconstruction. The new initialisation consistently outperforms state-of-the-art methods across a wide range of reconstruction tasks, including those involving physics-informed neural networks.

2512.04949 2026-05-12 cs.LG cs.AI cs.CL

CARL: Criticality-Aware Agentic Reinforcement Learning

Leyang Shen, Yang Zhang, Chun Kai Ling, Xiaoyan Zhao, Tat-Seng Chua

发表机构 * National University of Singapore, Singapore(新加坡国立大学)

AI总结 本文提出了一种名为CARL的强化学习算法,旨在解决多步任务中传统策略优化方法因假设每一步贡献相同而导致的性能不足问题。CARL通过引入熵作为状态重要性的代理指标,专注于对关键状态的动作进行奖励分配,从而提升训练效率和效果。实验表明,CARL在多种评估场景中均表现出更强的性能和更高的效率。

Comments 18 pages, 6 figures

详情
英文摘要

Agents capable of accomplishing complex tasks through multiple interactions with the environment have emerged as a popular research direction. However, in such multi-step settings, the conventional group-level policy optimization algorithm becomes suboptimal because of its underlying assumption that each step holds equal contribution, which deviates significantly from reality. Our analysis reveals that only the action choices on a small fraction of states are critical in determining the final outcome. Building on this insight, we propose CARL, a criticality-aware reinforcement learning algorithm tailored for long-horizon agentic reasoning. CARL leverages entropy as a heuristic proxy for state criticality and achieves focused training by assigning rewards to actions taken from high-criticality states while excluding actions taken from low-criticality states from model updates, avoiding noisy credit assignment and redundant computation. Extensive experiments demonstrate that CARL achieves both stronger performance and higher efficiency across diverse evaluation settings. The source code will be publicly available.

2511.23332 2026-05-12 cs.CV

UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, Jing Zhang

发表机构 * Beijing Institute of Technology(北京理工大学) Wuhan University(武汉大学) Zhongguancun Academy(中关村学院) Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出 UniGeoSeg,一种面向遥感地景的统一开放世界分割框架,旨在解决现有方法在任务定义分散和指令数据有限方面的不足。研究构建了 GeoSeg-1M 数据集,包含大量图像-掩码-指令三元组,并设计了 GeoSeg-Bench 用于评估模型在复杂地景场景中的理解与推理能力。UniGeoSeg 通过任务感知的文本增强、潜在知识记忆和渐进式训练策略,实现了多任务学习,在多个基准测试中表现出色,具有强大的零样本泛化能力。

Comments Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg ; Accepted by CVPR 2026

详情
英文摘要

Instruction-driven segmentation in remote sensing generates masks from guidance, offering great potential for accessible and generalizable applications. However, existing methods suffer from fragmented task formulations and limited instruction data, hindering effective understanding and generalization. To address these issues, we introduce GeoSeg-1M, the first million-scale dataset for remote sensing instruction-driven segmentation, constructed via an automatic mask filtering and instruction generation pipeline that synthesizes referring, interactive, and reasoning segmentation instructions from multiple public datasets. GeoSeg-1M contains 590K images, 117 categories, and 1.1M image-mask-instruction triplets. Building upon this foundation, we further curate GeoSeg-Bench, a challenging benchmark designed to evaluate contextual understanding and reasoning capabilities across diverse instruction-driven tasks and complex geospatial scenes. Furthermore, we present UniGeoSeg, a unified framework that serves as a strong baseline, incorporating task-aware text enhancement, latent knowledge memory, and a progressive training strategy to facilitate multi-task learning. Extensive experiments demonstrate the state-of-the-art performance of UniGeoSeg across GeoSeg-Bench and diverse public benchmarks, while exhibiting strong zero-shot generalization. Datasets and source code were released at https://github.com/MiliLab/UniGeoSeg.

2511.22963 2026-05-12 cs.RO cs.AI

Commanding Humanoid by Free-form Language: A Large Language Action Model with Unified Motion Vocabulary

Zhirui Liu, Kaiyang Ji, Ke Yang, Yahao Fan, Jingyi Yu, Ye Shi, Jingya Wang

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 本文研究了如何使人形机器人理解并执行自由形式的自然语言指令,提出了一个名为Humanoid-LLA的大语言动作模型,能够将自然语言直接转化为可执行的全身运动。该方法通过学习统一的人类-人形机器人运动词汇,解决了语言语义与物理控制之间的对齐问题,并采用两阶段微调框架,结合监督学习与强化学习,提升了运动的物理稳定性和鲁棒性。实验表明,该模型在模拟和真实环境中均能生成多样且物理合理的动作,具有良好的语言指令泛化能力。

Comments Project page: https://humanoidlla.github.io/

详情
英文摘要

Enabling humanoid robots to follow free-form natural language commands is a critical step toward seamless human-robot interaction and general-purpose embodied AI. However, existing methods remain limited, often constrained to simple instructions or forced to sacrifice motion diversity for physical plausibility. To address this gap, we present Humanoid-LLA, a Large Language Action model that translates unconstrained natural language directly into executable whole-body motions for humanoid robots. Our approach tackles two core challenges: paired language-humanoid motion data scarcity and physical instability. First, we bridge high-level language semantics with physically-grounded control by learning a unified human-humanoid motion vocabulary. Second, we introduce a novel two-stage fine-tuning framework that begins with supervised motion Chain-of-Thought learning, followed by reinforcement learning refined with physical feedback to ensure robustness and stability. Extensive evaluation in simulation and real-world cross-embodiment experiments demonstrates that Humanoid-LLA achieves superior generalization to novel language commands and diverse motion generation while maintaining high physical fidelity.