arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1670
专题追踪
2605.15284 2026-05-18 cs.LG

Tadpole: Autoencoders as Foundation Models for 3D PDEs with Online Learning

Tadpole:用于3D偏微分方程的自动编码器作为基础模型的在线学习

Qiang Liu, Felix Koehler, Benjamin Holzschuh, Nils Thuerey

发表机构 * TUM School of Computation, Information and Technology(慕尼黑技术大学计算、信息与技术学院) Technical University of Munich, Garching, Germany(慕尼黑技术大学,慕尼黑,德国) MCML, Munich Center for Machine Learning, Munich, Germany(慕尼黑机器学习中心,慕尼黑,德国)

AI总结 Tadpole通过在线数据生成框架预训练自动编码器,学习跨异构物理系统的丰富可迁移表示,支持高维扩展和多任务应用,包括动态学习和生成建模。

详情
AI中文摘要

我们介绍了Tadpole,一种新的三维偏微分方程(PDE)基础模型,解决了可迁移性、高维可扩展性和多功能性等关键挑战。Tadpole在由高效在线数据生成框架生成的合成3D PDE数据上预训练为自动编码器。这使得能够进行大规模、多样化的训练,无需存储或I/O开销,通过扩展到相当于数百TB的训练数据进行演示。通过自动编码单通道空间裁剪,Tadpole在具有不同状态变量数量和空间分辨率的异构物理系统中学习丰富的、可迁移的表示。尽管仅预训练为自动编码器,Tadpole可以高效地应用于多种下游任务,包括动态学习和生成建模。对于动态学习,我们提出了一种新颖的参数高效微调策略,结合低秩适应、潜在空间转换和重新引入的跳跃连接,以最小的可训练参数数量实现精确的时间建模。Tadpole在各种下游任务中展示了强大的微调性能,突显了其作为3D PDE学习基础模型的通用性和有效性。Tadpole的源代码和预训练权重可在https://github.com/tum-pbs/tadpole获取。

英文摘要

We introduce Tadpole, a novel foundation model for three-dimensional partial differential equations (PDEs) that addresses key challenges in transferability, scalability to high dimensionality, and multi-functionality. Tadpole is pre-trained as an autoencoder on synthetic 3D PDE data generated by an efficient online data-generation framework. This enables large-scale, diverse training without storage or I/O overhead, demonstrated by scaling to an equivalent of hundreds of terabytes of training data. By autoencoding single-channel spatial crops, Tadpole learns rich and transferable representations across heterogeneous physical systems with varying numbers of state variables and spatial resolutions. Although pre-trained solely as an autoencoder, Tadpole can be efficiently applied for multiple downstream tasks beyond reconstruction, including dynamics learning and generative modeling. For dynamics learning, we propose a novel parameter-efficient fine-tuning strategy that integrates low-rank adaptation, latent-space transformations, and reintroduced skip connections, achieving accurate temporal modeling with a minimal number of trainable parameters. Tadpole demonstrates strong fine-tuning performance across various downstream tasks, highlighting its versatility and effectiveness as a foundation model for 3D PDE learning. Source code and pre-trained weights of Tadpole are available at https://github.com/tum-pbs/tadpole

2605.15282 2026-05-18 cs.CL

Fluency and Faithfulness in Human and Machine Literary Translation

Sarah Griebel, Ted Underwood

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) School of Information Sciences(信息科学学院)

AI总结 本研究探讨了文学翻译中流畅性与忠实度之间的关系,分析了106部小说中13万余段人工及机器翻译文本。通过自动评估方法,发现流畅性与忠实度存在显著负相关,且该现象在人类翻译和谷歌翻译中尤为明显,而TranslateGemma则表现出较弱的相关性。研究结果表明,在文学翻译中,提升流畅性可能以牺牲忠实度为代价,且评估结果受文本长度影响。

Comments Accepted NLP4DH 2026

详情
英文摘要

Literary translation requires balancing target-language fluency with faithfulness to the source. Recent large language models (LLMs) often produce fluent translations, but it remains unclear whether fluency corresponds to semantic preservation in literary text. We examine this relationship using 130,486 translated paragraphs from 106 novels in 16 source languages, including human, Google Translate, and TranslateGemma translations. Fluency is measured as original-likeness with a translationese classifier trained on paragraph part-of-speech n-grams, and faithfulness with the automatic translation evaluation metric COMET-KIWI. We control for paragraph length and find a consistent negative correlation between fluency and faithfulness. The pattern appears for both human and Google Translate, but is weaker and often non-significant for TranslateGemma. These results show that segment length matters for automatic evaluation and suggest a tradeoff between fluency and faithfulness in literary translation.

2605.15257 2026-05-18 cs.LG

Training on Documents About Monitoring Leads to CoT Obfuscation

Reilly Haskins, Bilal Chughtai, Joshua Engels

发表机构 * University of Canterbury(坎特伯雷大学) Google DeepMind(谷歌深Mind) Pivotal Research(Pivotal研究)

AI总结 本文研究了模型在了解监控机制的情况下是否会通过隐藏其推理过程来逃避检测。研究者通过合成文档微调的方式,使八种模型接触描述思维链(CoT)监控的预训练风格文档,发现具备监控意识的模型在逃避检测方面的表现显著优于无意识的对照组。研究还表明,模型的思维链可控性与其成功隐藏推理的能力高度相关,并且具备监控意识的模型在同等强化学习压力下更快学会规避监控。这些结果表明,监控知识与高思维链可控性的结合可能对基于CoT的监控系统构成潜在风险。

详情
英文摘要

Chain-of-thought (CoT) monitoring is one of the most promising tools we have for detecting model misbehavior, but its effectiveness depends on models faithfully externalizing their reasoning. Motivated by this vulnerability, we study whether monitor-aware models are capable of obfuscating their reasoning to evade detection. We use synthetic document finetuning to expose eight models to realistic pre-training-style documents describing a CoT monitor and find that monitor-aware models consistently achieve higher rates of undetected misbehavior compared to unaware controls. This effect is weaker but still present on a harder agentic task. We also show that CoT controllability, a model's ability to reshape its own reasoning trace under an imposed constraint, is closely correlated with obfuscation success across the eight models studied ($r=0.800$, $p=0.017$). Monitor-aware models placed under equal reinforcement learning optimization pressure also learn to reward-hack without triggering a CoT monitor substantially faster than unaware controls. Together, these results suggest that knowledge of monitoring combined with high CoT controllability poses a risk to CoT-based monitoring.

2605.15256 2026-05-18 cs.CV

ReactiveGWM: Steering NPC in Reactive Game World Models

Zeqing Wang, Danze Chen, Zhaohu Xing, Zizhao Tong, Yinhan Zhang, Xingyi Yang, Yeying Jin

发表机构 * Tencent(腾讯) National University of Singapore(新加坡国立大学) The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 当前游戏世界模型多从玩家视角出发,将非玩家角色(NPC)仅视为背景像素,难以捕捉玩家与NPC之间的互动。为此,本文提出ReactiveGWM,一种能够模拟玩家与NPC动态交互的反应型游戏世界模型。该模型通过解耦玩家控制与NPC行为,并引入轻量级偏差注入和跨注意力模块,实现了对NPC高层策略(如进攻、防守)的灵活响应,且无需针对具体游戏进行再训练,具备跨游戏的零样本策略迁移能力。

Comments The code is available at https://inv-wzq.github.io/ReactiveGWM/

详情
英文摘要

Current game world models simulate environments from a subjective, player-centric perspective. However, by treating the Non-Player Character (NPC) merely as background pixels, these models cannot capture interactions between the player and NPC. In that sense, they act as passive video renderers rather than real simulation engines, lacking the physical understanding needed to model action-induced NPC reactivities. We introduce ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and NPC. Instead of entangling all interaction dynamics, ReactiveGWM explicitly decouples player controls from NPC behaviors. Player actions are injected into the diffusion backbone via a lightweight additive bias, while high-level NPC responses (e.g., Offense, Control, Defense) are grounded through cross-attention modules. Crucially, these modules learn a game-agnostic representation of interactive logic. This enables zero-shot strategy transfer: our learned modules can be plugged directly into off-the-shelf, unannotated world models of different games. This instantly unlocks steerable NPC interactions without any domain-specific retraining. Evaluated on two Street Fighter games, ReactiveGWM maintains fine-grain player controllability while achieving robust, prompt-aligned NPC strategy adherence, paving the way for scalable, strategy-rich interaction with the NPC.

2605.15254 2026-05-18 cs.LG

Curriculum Learning of Physics-Informed Neural Networks based on Spatial Correlation

Xujia Chen, Xinyue Hu, Letian Chen, Daming Shi, Wenhui Fan

发表机构 * Department of Automation, Tsinghua University(自动化系,清华大学)

AI总结 本文针对物理信息神经网络(PINNs)在求解偏微分方程时面临的训练不稳定、多目标约束不平衡及信息传播效率低等问题,提出了一种基于空间相关性的课程学习框架。该方法通过空间因果权重引导边界附近区域的信息向内传播,利用低频信息桥增强空间分离区域的一致性,并采用区域自适应重加权策略优化局部残差,从而有效提升训练稳定性和解的精度。实验表明,在相近计算成本下,该方法显著改善了PINNs的训练效果。

Comments 37 pages, 14 figures, 9 tables

详情
英文摘要

Physics-Informed Neural Networks (PINNs) combine deep learning with physical constraints for solving partial differential equations (PDEs), and are widely applied in fluid mechanics, heat transfer, and solid mechanics. However, PINN training still suffers from high-dimensional non-convex loss landscapes, imbalanced multiobjective constraints, and ineffective information propagation. Existing curriculum learning and causality-guided strategies improve training stability, but mainly focus on temporal or parametric progression, lacking explicit treatment of spatial information propagation and inter-region consistency. Moreover, they are not directly applicable to boundary value problems (BVPs) with strong spatial coupling. To address this issue, we propose a spatially correlated curriculum learning framework for PINNs. To the best of our knowledge, this is the first work to address PINN training difficulties from the perspective of spatial coupling among subregions. First, spatial causal weights guide information from near-boundary regions inward, reducing optimization failures and spurious convergence. Second, a low-frequency information bridge enforces pseudo-label-based consistency across spatially separated regions, suppressing global low-frequency drift. Third, a region-adaptive reweighting strategy adjusts subregion losses to reduce local residuals and recover high-frequency details. Experiments on PDE benchmarks show that, under comparable computational cost, the proposed method alleviates training failures and improves solution accuracy. The code is available at https://github.com/pigofmomo/CurriculumLearningPINN.

2605.15253 2026-05-18 cs.LG

Position: Ideas Should be the Center of Machine Learning Research

Jairo Diaz-Rodriguez

发表机构 * Department of Mathematics and Statistics(数学与统计学系) York University(约克大学) Toronto M3J 1P3, Canada(多伦多M3J 1P3, 加拿大)

AI总结 本文指出当前机器学习研究日益分化为追求指标优化的工程实践和脱离实际的理想化理论,忽视了研究的核心应是“想法”本身。作者提出“以想法为中心”的研究框架,强调通过设计针对性实验验证想法在现代模型中的行为特征,而非单纯追求榜单成绩。这一转变有助于弥合理论与实践之间的差距,同时促进研究公平性,使资源有限的研究者也能做出严谨的科学贡献。

Comments Accepted into ICML 2026 https://icml.cc/virtual/2026/poster/67144

详情
英文摘要

Machine learning research increasingly bifurcates into two disconnected modes: benchmark-driven engineering that prioritizes metrics over understanding, and idealized theory that often fails to transfer to modern systems. In this position paper, we argue that the field focuses too heavily on these endpoints, neglecting the central scientific object: the idea. We propose an Ideas First framework in which ideas are valued for the behavioral signatures they predict in modern models, and these signatures are tested through tailored experiments designed to detect the relevant patterns rather than to win leaderboards. This shift not only bridges the gap between theory and practice but also promotes equity by removing the "complexity premium," enabling rigorous scientific contributions from researchers with modest computational, financial, and human resources. Ultimately, we advocate for a research culture centered on ideas, treating benchmarks and theorems as instruments for testing mechanistic hypotheses rather than as ends in themselves.

2605.15252 2026-05-18 cs.LG cs.AI eess.SP

PDRNN: Modular Data-driven Pedestrian Dead Reckoning on Loosely Coupled Radio- and Inertial-Signalstreams

Peter Bauer, Andreas Porada, Felix Ott, Christopher Mutschler, Tobias Feigl

发表机构 * Fraunhofer Institute for Integrated Circuits IIS(弗劳恩霍夫集成电路研究所)

AI总结 本文提出了一种名为PDRNN的模块化数据驱动行人航位推算系统,用于处理松耦合的无线电与惯性传感器信号流。该方法基于简单循环神经网络架构,能够隐式预测不同估计方法下的异步传感器数据流,并通过独立的机器学习模型分别估计姿态、速度和位置等关键参数及其方差,最终融合模型结合这些输出以提升系统鲁棒性。实验表明,PDRNN在动态运动数据上的精度和稳定性优于传统方法和现有机器学习方法,同时具备更好的组件控制能力和预测能力。

Comments 12 pages

Journal ref IEEE/ION Position, Location and Navigation Symposium (PLANS), Salt Lake City, UT, May 2025

详情
英文摘要

Modern pedestrian dead reckoning (PDR) systems rely on fusing noisy and biased estimates of position, velocity, and calibrated orientation derived from loosely coupled sensors to determine the current pose of a localized object. However, discrepancies in the sampling rates of sensor-specific estimation methods and unreliable transmission pose significant challenges. And traditional methods often fail to effectively fuse multimodal sensor data during dynamic movements characterized by high accelerations, velocities, and rapidly varying orientations. To address these limitations, we propose a simple recurrent neural network (RNN) architecture capable of implicitly forecasting asynchronous sensor data streams from diverse estimation methods along reference trajectories. The proposed approach introduces PDRNN, a modular hybrid AI-assisted PDR system that handles each component as an independent ensemble of machine learning (ML) models to estimate both key parameter means and variances. Separate ML-based models are employed to estimate orientation, (un)directed velocity or distance from acceleration and gyroscope data, with optional absolute positioning from synchronized radio systems such as 5G for stabilization. A final fusion model combines these outputs, position, velocity, and orientation, while using uncertainty estimates to enhance system robustness. The modular design allows individual components to be updated, fine-tuned, or replaced without affecting the entire system. Experiments on dynamic sports movement data show that PDRNN achieves superior accuracy and precision compared to classic and ML-based methods, effectively avoiding error accumulation common in black-box approaches. And PDRNN offers forecast capabilities and better component control despite increased system complexity.

2605.15246 2026-05-18 cs.LG

Privacy Evaluation of Generative Models for Trajectory Generation

Stavros Bouras, Ioannis Kontopoulos, Chiara Pugliese, Francesco Lettich, Emanuele Carlini, Hanna Kavalionak, Chiara Renso, Konstantinos Tserpes

发表机构 * School of Electrical and Computer Engineering, National Technical University of Athens(电气与计算机工程学院,国家技术大学雅典) Institute of Information Science and Technologies, National Research Council (CNR)(信息科学与技术研究所,国家研究委员会(CNR)) Institute of Informatics and Teletematics, National Research Council (CNR)(信息学与电信研究所,国家研究委员会(CNR))

AI总结 轨迹数据在现代城市智能中具有重要作用,但其敏感性也带来了显著的隐私风险。本文研究了生成模型在轨迹生成任务中的隐私保护问题,指出现有生成模型虽然能够生成符合时空分布和移动模式的合成轨迹数据,但其生成特性并不意味着隐私得到保障。通过实施成员推理攻击,作者揭示了生成轨迹模型在隐私保护方面的评估缺口,并证明其仍存在潜在的隐私泄露风险。

Comments Accepted at the 1st Workshop on Multi-Sensor Trajectory Knowledge Discovery and Extraction (MuseKDE 2026), co-located with the 27th IEEE International Conference on Mobile Data Management (IEEE MDM 2026)

详情
英文摘要

Trajectory data is fundamental to modern urban intelligence, yet its sensitivity raises significant privacy concerns. Generative models such as Generative Adversarial Networks, Variational Autoencoders, and Diffusion Models have been developed to generate realistic synthetic trajectory data by capturing underlying spatiotemporal distributions and mobility patterns. Although these models are often assumed to preserve privacy due to their generative nature, this assumption does not necessarily hold. In this work, we investigate the intersection of generative trajectory modeling and privacy evaluation. By identifying applicable empirical methods for assessing privacy preservation in trajectory generation tasks, we demonstrate a significant gap in the evaluation of privacy for generative trajectory models. Motivated by this gap, we implement Membership Inference Attacks against representative models, demonstrating the feasibility of using such empirical privacy evaluation methods and showing that their generative nature does not eliminate privacy risks.

2605.15243 2026-05-18 cs.LG cs.AI q-bio.BM q-bio.MN q-bio.QM

Reading the Cell, Designing the Cure: Perturbation-Conditioned Molecular Diffusion for Function-Oriented Drug Design

Ziyu Xu, Zijian Zhang, Liang Wang, Zhiyuan Liu, Qiang Liu, Shu Wu, Liang Wang

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学先进交叉学科学院) NLPR, MAIS, Institute of Automation, Chinese Academy of Sciences, Beijing, China(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学人工智能学院) National University of Singapore, Singapore(新加坡国立大学)

AI总结 该研究提出了一种基于转录组的药物设计方法(TBDD),旨在根据期望的基因表达变化生成具有特定功能的分子。为了解决生物学与化学领域间的巨大差异以及转录组信号稀疏性带来的挑战,研究设计了多尺度的扩散生成模型CURE,其核心模块TFE能够提取功能导向的扰动特征,并跨模态对齐化学结构信息,从而生成结构合理且功能一致的候选药物分子。实验表明,该方法在多个基准测试中表现优异,并在零样本基因抑制剂设计任务中验证了其实际应用潜力。

详情
英文摘要

When reliable target structures are unavailable at scale or phenotypes arise from dysregulated pathways, transcriptomic perturbations provide a system-level functional readout for drug action. In this work, we formalize \emph{Transcriptome-based Drug Design (TBDD)} as a generative inverse problem: designing drug molecules conditioned on desired transcriptomic state transitions. We analyze the inherently ill-posed nature of this task, which is further complicated by the profound domain gap between biology and chemistry and by the sparsity of transcriptomic signals. To address these challenges, we propose \textbf{\themodel{}} (A \textbf{C}ell\textbf{U}lar \textbf{R}esponse \textbf{E}ngine), a multi-resolution transcriptome-guided diffusion framework. \themodel{} features a specialized \textbf{Transcriptome Perturbation Functional Feature Extractor (TFE)} that (1) distills function-oriented perturbation embeddings from pre/post states, (2) aligns these signatures to dual chemical views to bridge the cross-modal gap, and (3) performs heterogeneity-aware aggregation to extract robust state-specific signals from noisy transcriptomic data. Extensive evaluations on both standard benchmarks and rigorous out-of-distribution protocols demonstrate that \themodel{} consistently outperforms strong baselines in structural quality and functional consistency. Furthermore, we validate its practical utility via a zero-shot gene-inhibitor design task, highlighting the potential of phenotype-driven generative discovery.

2605.15242 2026-05-18 cs.LG

Logical Grammar Induction via Graph Kolmogorov Complexity: A Neuro-Symbolic Framework for Self-Healing Clinical Data Integrity

Abolfazl Zarghani, Amir Malekesfandiari

发表机构 * Department of Computer Engineering, Ferdowsi University of Mashhad(法尔德大学马什哈德分校计算机工程系)

AI总结 本文提出了一种名为Logic-GNN的神经符号框架,用于解决医疗信息系统中由人为错误引起的临床数据完整性问题。该方法将临床记录视为受潜在逻辑规则支配的结构化“私有语言”,结合时序图神经网络与图 Kolmogorov 复杂度,推导出描述医疗交互逻辑的符号语法规则,并将异常定义为违反该语法导致图描述长度显著增加的情况。实验表明,该方法在区分医疗异常与数据错误方面表现出色,F1 分数达到 0.94,优于现有方法,并具备实时自我修复功能以维护数据完整性。

详情
英文摘要

The reliability of Healthcare Information Systems (HIS) is frequently compromised by human-induced data entry errors, which existing statistical anomaly detection methods fail to distinguish from legitimate clinical extremes. This paper proposes Logic-GNN, a novel neuro-symbolic framework that treats clinical records as a structured ``private language'' governed by latent logical games. By integrating Temporal Graph Neural Networks (TGNN) with Graph Kolmogorov Complexity, we induce a symbolic grammar that represents the underlying logic of medical interactions. We define anomalies as ``grammatical violations'' that cause a significant expansion in the Minimum Description Length (MDL) of the clinical graph. Evaluated on the Sina System dataset (2M+ records), Logic-GNN achieves an F1-score of 0.94, outperforming state-of-the-art baselines by 12\% in distinguishing between life-threatening medical outliers and data corruption. Our approach introduces a self-healing mechanism that suggests logical corrections to maintain data integrity in real-time HIS environments.

2605.15235 2026-05-18 cs.LG

MuteBench: Modality Unavailability Tolerance Evaluation for Incomplete Multimodal Fusion

Wugeng Zheng, Ziwen Kan, Tianlong Chen, Chen Chen, Song Wang

发表机构 * University of Central Florida(中央佛罗里达大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 MuteBench 是一个用于评估不完整多模态融合系统在模态缺失情况下的鲁棒性的基准,涵盖了7个临床领域的9个数据集、6种融合架构和两种缺失数据模式。研究发现,架构类型是影响系统鲁棒性的最主要因素,而通道独立模型在处理模态缺失时表现较好,但在处理模态内缺失时可能存在问题。该基准为临床AI系统的设计与选择提供了重要参考。

详情
英文摘要

Multimodal physiological data powers clinical AI systems from intensive care units to wearable devices, but sensors routinely fail in practice. Two failure modes are common: modality missing, where an entire channel is absent, and within-modality missing, where a contiguous time segment is lost. No existing benchmark evaluates multiple fusion architectures under both failure modes at controlled severity levels across diverse clinical datasets. We present MuteBench, a benchmark covering 9 datasets from 7 clinical domains, 6 fusion architectures, and 2 missing-data modes over 125,000 samples. Through this benchmark, we find that architecture family is the strongest predictor of robustness, outweighing parameter count. Channel-independent models tolerate modality missing well but can be sensitive to within-modality missing, especially on short sequences. Curriculum modality dropout protects reliably only up to the maximum dropout rate used in training. We also find that channel count, sequence length, and modality alignment jointly determine which failure mode poses the greater threat. Finally, a PTB-XL case study suggests that diffusion-based imputation can improve downstream classification under within-modality missing, with the largest gains for models whose expert routing is most sensitive to corrupted inputs, though broader validation across datasets remains an open direction. MuteBench provides practitioners with concrete guidance for both selecting existing architectures and informing the design of future robust multimodal fusion methods.

2605.15227 2026-05-18 cs.AI cond-mat.mtrl-sci cs.RO

NIMO Controller: a self-driving laboratory orchestrator based on the Model Context Protocol

Naruki Yoshikawa, Ryo Tamura

发表机构 * National Institute for Materials Science(国家材料科学研究所) Graduate School of Frontier Sciences, The University of Tokyo(东京大学前沿科学研究生院)

AI总结 本文提出了一种基于模型上下文协议(MCP)的自主驾驶实验室(SDL)控制架构——NIMO Controller,旨在解决现有SDL软件框架缺乏标准化接口、难以支持AI代理的问题。该架构通过MCP服务器统一暴露所有SDL功能,并提供了基于MCP工具发现的可视化编程接口,使用户无需编写代码即可设计实验流程,同时支持AI代理通过同一后端进行交互。研究通过颜色匹配实验验证了该架构的可行性与实用性。

Comments 9 pages, 4 figures

详情
英文摘要

Self-driving laboratories (SDLs) have attracted increasing attention as a means of accelerating scientific discovery; however, developing SDL software remains technically demanding. To improve accessibility, orchestration software frameworks have been proposed to coordinate SDL components. Nevertheless, existing frameworks are primarily designed for human interaction and do not provide standardized interfaces suitable for AI agents. In this work, we propose an SDL software architecture based on the Model Context Protocol (MCP), in which all SDL functionalities are exposed through MCP servers. Following this design principle, we introduce an MCP-based SDL orchestrator, named NIMO Controller. It provides a visual programming interface automatically generated through MCP-based tool discovery, allowing human users to design experimental workflows without writing code. The same MCP backend can also be accessed by AI agents, providing a unified interface for both human users and AI agents. We demonstrate the proposed system through a case study on a color-matching SDL. The results validate the usability of the proposed MCP-based SDL architecture.

2605.15224 2026-05-18 cs.AI cs.MA

ICRL: Learning to Internalize Self-Critique with Reinforcement Learning

Jianbo Lin, Xiaomin Yu, Yi Xin, Yifu Guo, Zhuosong Jiang, Zhongqi Yue, Weishi Wang, Heqing Zou, Chengwei Qin, Hui Xiong

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Nanjing University(南京大学) Sun Yat-sen University(中山大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) SAP Microsoft Research(微软研究院)

AI总结 本文提出了一种基于强化学习的新型框架ICRL,旨在使大型语言模型在获得自我批评反馈后能够内化这些指导,从而在无外部批评的情况下仍能保持良好的表现。该框架通过联合训练求解器和批评者,利用批评反馈带来的性能提升作为奖励,促使批评者生成更有助于改进的反馈。为了解决批评条件行为与无批评行为之间的分布偏移问题,ICRL引入了分布校准的重加权策略,并通过角色分组优势估计稳定联合优化过程。实验表明,ICRL在多种任务中均取得了显著提升,且训练出的批评者在性能上可与更大规模的模型相媲美。

详情
英文摘要

Large language model-based agents make mistakes, yet critique can often guide the same model toward correct behavior. However, when critique is removed, the model may fail again on the same query, indicating that it has not internalized the critique's guidance into its underlying capability. Meanwhile, a frozen critic cannot improve its feedback quality over time, limiting the potential for iterative self-improvement. To address this, we propose learning to internalize self-critique with reinforcement learning(ICRL), a novel framework that jointly trains a solver and a critic from a shared backbone to convert critique-induced success into unassisted solver ability. The critic is rewarded based on the solver's subsequent performance gain, incentivizing actionable feedback. To address the distribution shift between critique-conditioned and critique-free behavior, ICRL introduces a distribution-calibration re-weighting ratio that selectively transfers critique-guided improvements compatible with the solver's own prompt distribution. Additionally, a role-wise group advantage estimation stabilizes joint optimization across the two roles. Together, these mechanisms ensure that the solver learns to improve itself without external critique, rather than becoming dependent on critique-conditioned behavior. We evaluate ICRL on diverse benchmarks spanning agentic and mathematical reasoning tasks, using Qwen3-4B and Qwen3-8B as backbones. Results show consistent improvements, with average gains of 6.4 points over GRPO on agentic tasks, and 7.0 points on mathematical reasoning. Notably, the learned 8B critic is comparable to 32B critics while using substantially fewer tokens. The code is available at https://github.com/brick-pid/ICRL.

2605.15220 2026-05-18 cs.CL cs.AI cs.LG

Always Learning, Always Mixing: Efficient and Simple Data Mixing All The Time

Michael Y. Hu, Apurva Gandhi, Kyunghyun Cho, Tal Linzen, Pratyusha Sharma

发表机构 * New York University(纽约大学) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软公司)

AI总结 数据混合在语言模型训练中起着关键作用,决定了如何组合不同来源或类型的训练数据。本文提出了一种名为OP-Mix的高效数据混合算法,能够在整个语言模型训练生命周期中持续运行,解决了现有方法仅适用于单一训练阶段的问题。该方法通过在当前模型上训练低秩适配器并进行插值,低成本地模拟候选数据混合方案,从而避免了对代理模型的依赖,并始终基于模型的实际学习动态进行搜索。实验表明,OP-Mix在预训练、持续微调等任务中均能以更低的计算成本达到接近最优的性能。

详情
英文摘要

Data mixing decides how to combine different sources or types of data and is a consequential problem throughout language model training. In pretraining, data composition is a key determinant of model quality; in continual learning and adaptation, it governs what is retained and acquired. Yet existing data mixing methods address only one phase of this lifecycle at a time: some require smaller proxy models tied to a single training phase, others assume a fixed domain set, and continual learning lacks principled guidance altogether. We argue that data mixing is fundamentally an online decision making problem -- one that recurs throughout training and demands a single, unified solution. We introduce OP-Mix (On-Policy Mix), a data mixing algorithm that operates across the entire language model training lifecycle. Our main insight is that candidate data mixtures can be cheaply simulated by interpolating between low-rank adapters trained directly on the current model, eliminating separate proxy models and ensuring the search is always grounded in the model's actual learning dynamics. Across pretraining, continual midtraining, and continual instruction tuning, OP-Mix consistently finds near-optimal mixtures while using a fraction of the compute of the baselines. In pretraining, OP-Mix improves upon training without mixing by 6.3% in average perplexity. For continual learning, OP-Mix matches the performance of both retraining and on-policy distillation while using 66% and 95% less overall compute, respectively. OP-Mix suggests a different view of language model training: not a sequence of distinct phases, but a single continuous process of learning from data.

2605.15218 2026-05-18 cs.AI cs.CE

CAX-Agent: A Lightweight Agent Harness for Reliable APDL Automation

Chenying Lin, Yichen Hai, Yi He, Ran Wang, Haiyan Qiang, Liang Yu

发表机构 * Shanghai Ultradimension Technology Co., Ltd.(上海超维科技有限公司) College of Logistics Engineering, Shanghai Maritime University(上海海洋大学物流学院) School of Civil Aviation, Northwestern Polytechnical University(西北工业大学航空学院) State Key Laboratory of Airliner Integration Technology(航空器集成技术国家重点实验室) National Key Laboratory of Strength(强度与结构完整性国家实验室) Wuhan University(武汉大学)

AI总结 本文提出了一种轻量级的代理框架CAX-Agent,旨在提升MAPDL有限元仿真中的自动化可靠性。该框架通过引入领域特定的中间件,实现工具生命周期管理、工作流状态控制和故障恢复,从而解决大语言模型在该任务中常见的输出不一致和任务失败问题。实验评估表明,CAX-Agent中基于模型驱动的恢复策略在多个结构基准测试中表现出色,显著优于仅依赖规则或无恢复策略的方法。

Comments 8 pages, 6 figures, IEEE conference format

详情
英文摘要

Large language models deployed for MAPDL finite-element simulation face practical reliability challenges: without structured execution control, tool encapsulation, and fault recovery, outputs may be inconsistent and task failures are common. The Agent Harness paradigm addresses this by inserting domain-specific orchestration middleware that manages tool lifecycles, workflow state, and recovery escalation. This paper presents the architecture of CAX-Agent, a lightweight agent harness purpose-built for MAPDL automation, and empirically evaluates one of its core components -- the recovery policy.CAX-Agent organizes execution into three layers -- LLM service, agent harness, and solver backend -- with a recovery ladder that escalates from deterministic rule patching through model-driven regeneration to context enrichment and human intervention. We evaluate three recovery strategies (no_recovery, rule_only, and model_only) on 50 standard structural benchmarks with three repeated runs per strategy (450 case-runs total). Two independent human raters score task completion under blind conditions; inter-rater agreement is strong (quadratic weighted Cohen's kappa = 0.84, 96 percent of score pairs within one point). Model_only achieves the best completion rate (0.9267), task score (3.59/4), total score (9.16/10), and zero-intervention rate (0.84), outperforming rule_only (0.7733, 3.17/4, 7.03/10, 0.00) and no_recovery (0.6933, 2.74/4, 5.60/10, 0.00) with large effect sizes (Cliff's delta = 0.81-0.87). The benchmark uses deliberately simple geometries to isolate recovery-policy effects; we discuss the scope of these findings and directions for broader validation.

2605.15217 2026-05-18 cs.AI cs.CY cs.LG econ.GN q-fin.EC

Fair outputs, Biased Internals: Causal Potency and Asymmetry of Latent Bias in LLMs for High-Stakes Decisions

Jagdish Tripathy, Marcus Buckmann

发表机构 * Bank of England(英格兰银行)

AI总结 本研究探讨了指令微调语言模型在高风险决策(如房贷审批)中表现出的行为公平性与其内部潜在偏见之间的不对称关系。研究发现,尽管模型在输出层面看似无偏,但其内部表示仍保留并放大了与种族相关的偏见,且这些隐藏的偏见具有因果影响力,能够通过特定干预引发决策反转。研究还揭示了这种偏见在不同群体间的不对称性,并指出仅关注输出的行为审计不足以识别和治理模型中的潜在偏差,需结合表示分析的双重评估框架。

Comments 39 pages, 16 figures, 2 tables

详情
英文摘要

Instruction-tuned language models exhibit behavioural fairness in high-stakes decisions while retaining biased associations in their internal representations. However, whether these suppressed representations can affect model outputs - and whether such causal potency is symmetric across demographic groups - remains unknown. We investigate the use of open-weight models for mortgage underwriting using matched applications that differ only in racially-associated names and reveal a critical disconnect: models show no output-level bias, yet retain and amplify demographic representations across model layers. Through activation steering and novel cross-layer interventions, we demonstrate that this suppressed information is decision-relevant: when reinjected at critical layers, it produces near-complete decision reversals. Critically, this latent bias is asymmetric - steering interventions affect decisions in one demographic direction, while producing minimal effects in reverse - and susceptible to adversarial prompt engineering and parameter-efficient fine-tuning. These findings demonstrate that behavioural audits focused on outputs are insufficient: fair outputs can mask exploitable internal biases. They also motivate dual-layer testing frameworks combining output evaluation with representational analysis for AI governance in high-stakes decisions.

2605.15215 2026-05-18 cs.AI cs.SE

SkillSmith: Compiling Agent Skills into Boundary-Guided Runtime Interfaces

Duling Xu, Zheng Chen, Zaifeng Pan, Jiawei Guan, Dong Dong, Jialin Li, Bangzheng Pu

发表机构 * AetherHeart Tech Co., Ltd.(AetherHeart科技有限公司) Renmin University of China(中国人民大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 SkillSmith 是一种边界引导的编译-运行时框架,旨在优化基于技能的智能体系统。该方法通过离线编译技能包为最小可执行接口,提取技能的细粒度操作边界,使智能体在运行时仅调用相关组件,从而减少冗余上下文注入和重复推理。实验表明,SkillSmith 显著降低了推理阶段的 token 使用量、思考迭代次数和执行时间,并提升了任务准确率,同时支持强模型生成的编译结果被轻量模型复用。

详情
英文摘要

Recently, skills have been widely adopted in large language model (LLM)-based agent systems across various domains. In existing frameworks, skills are typically injected into the agent reasoning loop as contextual guidance once matched to a runtime task, enabling specialized task-solving capabilities. We find that this execution paradigm introduces two major sources of redundancy: irrelevant context injection and repeated skill-specific reasoning and planning. To this end, we propose SkillSmith, a boundary-first compiler-runtime framework that compiles skill packages offline into minimal executable interfaces. By extracting fine-grained operational boundaries from skills, SkillSmith enables agents to dynamically access and execute only the relevant components at runtime, thereby minimizing unnecessary context injection and redundant reasoning overhead. In the evaluation on SkillsBench benchmark, SkillSmith reduces solve-stage token usage by 57.44%, thinking iterations by 42.99%, solve time by 50.57% (2.02x faster), and token-proportional monetary cost by 57.44% compared with using raw-skills. Moreover, compiled artifacts produced by a stronger model can be reused by a smaller or more efficient runtime model, improving task accuracy in cases where raw skill interpretation fails. The source code and data are available at https://github.com/AetherHeart-AI/Aeloon.

2605.15208 2026-05-18 cs.LG cs.AI

Quantization Undoes Alignment: Bias Emergence in Compressed LLMs Across Models and Precision Levels

Plawan Kumar Rath, Rahul Maliakkal

发表机构 * Meta

AI总结 该研究探讨了量化压缩对大型语言模型(LLMs)偏见表现的影响,发现低精度量化会导致模型在多个任务中产生新的刻板印象行为,且这种变化与精度水平呈剂量反应关系。通过在多个模型和精度级别上的大规模实验,研究揭示了传统质量评估指标无法检测到这种偏见的增加,强调了在模型压缩前进行公平性检测的重要性。

Comments 7 pages, 4 figures, 4 tables. Accepted at IEEE Cloud Summit 2026. This is the author's accepted version; the version of record will appear in IEEE Xplore

详情
英文摘要

Large Language Models are routinely compressed via post-training quantization to reduce inference costs and memory footprint for cloud and edge deployment, yet the impact of this compression on model quality remains poorly understood. Existing studies typically compare only two conditions (full-precision vs. a single quantized variant), rely on aggregate bias metrics, and evaluate a single model family, making it impossible to distinguish gradual degradation from threshold-dependent safety failures. We conduct a controlled empirical study of three instruction-tuned models (Qwen2.5-7B, Mistral-7B, Phi-3.5-mini) at five precision levels (BF16 through 3-bit) on 12,148 BBQ bias benchmark items across 5 random seeds, totaling 911,100 inference records. Our results reveal that 3-bit quantization causes 6-21% of previously unbiased items to develop new stereotypical behaviors, following a clear dose-response pattern confirmed via logistic regression, while models' willingness to select "unknown" answers declines by 17.4%. Crucially, these item-level changes are invisible to standard quality metrics: perplexity increases by less than 0.5% at 8-bit and under 3% at 4-bit across all three models, yet 2.5-5.6% of items already develop new biases at 4-bit. These findings demonstrate that aggregate evaluation metrics systematically miss fairness-critical degradation, underscoring the need for quality-aware compression protocols that explicitly test for bias emergence before deployment.

2605.15207 2026-05-18 cs.LG cs.MA

TeamTR: Trust-Region Fine-Tuning for Multi-Agent LLM Coordination

Yi Xie, Siao Liu, Falong Fan, Yuanqi Yao, Yue Zhao, Bo Liu

发表机构 * Department of Electrical \& Computer Engineering, University of Arizona Engineering College, Soochow University INSAIT, Sofia University "St. Kliment Ohridski" Department of Electrical Computer Engineering, Stony Brook University

AI总结 多智能体大语言模型系统在复杂推理任务中展现出潜力,但近期评估显示其性能常低于单一模型基线。本文识别出共享上下文团队在顺序微调中存在结构性失效模式,即更新一个智能体会导致团队上下文分布偏移,而后续使用缓存轨迹进行评估会加剧这种偏差。为此,作者提出TeamTR信任域框架,通过每次更新后重新采样轨迹并控制每个智能体的分布偏差,从而保证每次更新和每个阶段的改进下界。实验表明,TeamTR在平均性能上优于单智能体和顺序微调方法约7.1%,有效缓解了协调退化问题,并支持组件的即插即用替换。

Comments 9pages, Accepted at ICML2026

详情
英文摘要

Multi-agent LLM systems have shown promise for complex reasoning, yet recent evaluations reveal they often underperform single-model baselines. We identify a structural failure mode in sequential fine-tuning of shared-context teams: updating one agent shifts the team's context distribution, and when subsequent updates are evaluated on cached rollouts, this mismatch compounds. We formalize this as the compounding occupancy shift and prove that stale-occupancy evaluation incurs a penalty that scales quadratically with the number of agents. In contrast, intermediate-occupancy evaluation reduces this to linear scaling. We propose TeamTR, a trust-region framework that resamples trajectories after each component update and enforces per-agent divergence control, yielding rigorous per-update and per-stage improvement lower bounds. Experiments show that TeamTR outperforms single-agent and sequential baselines with 7.1% on average, mitigates coordination regressions, and supports plug-and-play component replacement. Code is available at https://github.com/Yydc/TeamTR.

2605.15206 2026-05-18 cs.LG cs.AI cs.DC

AgentStop: Terminating Local AI Agents Early to Save Energy in Consumer Devices

Dzung Pham, Kleomenis Katevas, Ali Shahin Shamsabadi, Hamed Haddadi

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Brave Software, Imperial College London(Brave软件公司,伦敦帝国学院)

AI总结 随着基于大语言模型的自主代理在复杂任务中应用增多,本地部署虽能提升隐私保护和降低成本,但其资源消耗远高于普通语言模型交互。本文研究了在消费级硬件上本地运行代理的能耗问题,提出了一种名为AgentStop的轻量级监督机制,通过预测任务失败的可能性提前终止无效流程,在减少15%-20%能耗的同时仅小幅影响任务性能,为可持续的本地智能代理系统提供了可行方案。

Comments ACM CAIS '26

详情
英文摘要

Autonomous agents powered by large language models (LLMs) are increasingly used to automate complex, multi-step tasks such as coding or web-based question answering. While remote, cloud-based agents offer scalability and ease of deployment, they raise privacy concerns, depend on network connectivity, and incur recurring API costs. Deploying agents locally on user devices mitigates these issues by preserving data privacy and eliminating usage-based fees. However, agentic workflows are far more resource-intensive than typical LLM interactions. Iterative reasoning, tool use, and failure retries substantially increase token consumption, often expending significant compute without successfully completing tasks. In this work, we investigate the time, token, and energy overhead of locally deployed LLM-based agents on consumer hardware. Our measurements show that agentic execution increases GPU power draw, temperature, and battery drain compared to single-inference workloads. To address this inefficiency, we introduce AgentStop, a lightweight efficiency supervisor that predicts and preemptively terminates trajectories unlikely to succeed. Leveraging low-cost execution signals, such as token-level log probabilities, AgentStop can reduce wasted energy by 15-20% with minimal impact on task performance (<5% utility drop) for challenging web-based question answering and coding benchmarks. These findings position predictive early termination as a practical mechanism for enabling sustainable, privacy-preserving LLM agents on user devices. Our project code and data are available at https://github.com/brave-experiments/AgentStop.

2605.15205 2026-05-18 cs.AI

Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations

Nanxu Gong, Zixin Chen, Haotian Li, Zishu Zhao, Jianxun Lian, Huamin Qu, Yanjie Fu, Xing Xie

发表机构 * Arizona State University(亚利桑那州立大学) Hong Kong University of Science and Technology(香港科学与技术大学) Microsoft Research Asia(微软亚洲研究院) Smith College(史密斯学院)

AI总结 本研究探讨了提升大型语言模型(LLM)心智理论(ToM)能力是否真正有助于改善人机交互。研究指出,现有基准多从第三人称视角通过阅读故事和选择题评估ToM能力,忽视了真实交互中的第一人称、动态和开放特性。为此,研究提出了一种新的交互式ToM评估范式,并通过真实数据集和用户实验系统评估了四种代表性ToM增强技术,发现静态基准上的提升并不一定带来动态人机交互中的性能改善,强调了基于交互的评估在开发下一代社会智能模型中的重要性。

详情
英文摘要

Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.

2605.15204 2026-05-18 cs.AI

SDOF: Taming the Alignment Tax in Multi-Agent Orchestration with State-Constrained Dispatch

Zhantao Wang

发表机构 * Digital China(数字中国)

AI总结 本文提出了一种名为SDOF的多智能体协调框架,旨在解决现有系统在任务调度中缺乏阶段约束的问题。该框架将多智能体执行视为受约束的状态机,并通过强化学习与有限状态自动机相结合的方法,实现对任务流程的精确控制与合规性验证。实验表明,SDOF在招聘系统等实际场景中表现出更高的任务完成率与执行安全性,显著优于现有模型。

Comments 12 pages, 4 figures, 14 tables

详情
英文摘要

Multi-agent orchestration frameworks such as LangChain, LangGraph, and CrewAI route tasks through graph-based pipelines but do not enforce the stage constraints that govern real business processes. We present SDOF, a framework that treats multi-agent execution as a constrained state machine. SDOF operates through two primary defensive layers, implemented by three components: (1) an Online-RLHF Specialized Intent Router trained via Generative Reward Modeling (GRPO) and (2) a StateAwareDispatcher with GoalStage finite-automaton checks and precondition/postcondition SkillRegistry validation for auditable execution control. On a recruitment system backed by the Beisen iTalent platform (6000+ enterprises), 185 expert-curated scenarios trigger 1671 live API calls. Our GSPO-aligned 7B Intent Router achieves higher joint accuracy than zero-shot GPT-4o on this FSM-constrained adversarial routing benchmark (80.9% versus 48.9%). In end-to-end execution, SDOF reaches 86.5% task completion (95% confidence interval 80.8 to 90.7) and blocks all 22 operations in the injection, illegal HR subset. Under a broader message-level blocking audit, SDOF attains precision 100% and recall 88%, expert agreement kappa=0.94. A separate evaluation on 960 SGD-derived dialogues spanning 8 service domains surfaces 201 stage-order conflicts under our FSM mapping, 41 of which arise in the normal split. This arXiv version reports the current validated scope; extended multi-seed training comparisons and deeper workflow evaluations will be released in a subsequent update.

2605.15202 2026-05-18 cs.AI cs.CL cs.IR

DeepSlide: From Artifacts to Presentation Delivery

Ming Yang, Zhiwei Zhang, Jiahang Li, Haoseng Liu, Yuzheng Cai, Weiguo Zheng

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院)

AI总结 DeepSlide 是一个支持全流程演示文稿准备的人机协作多智能体系统,旨在优化从内容规划到演讲表现的整个过程,而不仅仅是生成视觉上合理的幻灯片。该系统结合了可控逻辑链规划、内容树检索、风格继承的序列渲染以及可执行的排练支持,有效提升了演讲的叙事连贯性、节奏精确度和幻灯片与讲稿的协同性。研究还引入了一个双评分板基准,用于区分静态内容质量与动态演讲表现,实验表明 DeepSlide 在多个领域和受众场景下均优于现有方法。

Comments 37 pages,10 figures,9 tables

详情
英文摘要

Presentations are a primary medium for scholarly communication, yet most AI slide generators optimize the artifact (a visually plausible deck) while under-optimizing the delivery process (pacing, narrative, and presentation preparation). We present DeepSlide, a human-in-the-loop multi-agent system that supports preparing the full presentation process, from requirement elicitation and time-budgeted narrative planning, to evidence-grounded slide--script generation, attention augmentation, and rehearsal support. DeepSlide integrates (i) a controllable logical-chain planner with per-node time budgets, (ii) a lightweight content-tree retriever for grounding, (iii) Markov-style sequential rendering with style inheritance, and (iv) sandboxed execution with minimal repair to ensure renderability. We further introduce a dual-scoreboard benchmark that cleanly separates static artifact quality from dynamic delivery excellence. Across 20 domains and diverse audience profiles, DeepSlide matches strong baselines on artifact quality while consistently achieving larger gains on delivery metrics, improving narrative flow, pacing precision, and slide--script synergy with clearer attention guidance.

2605.15093 2026-05-18 cs.CV

CoralLite: μCT Reconstruction of Coral Colonies from Individual Corallites

Jess Jones, Leonardo Bertini, Kenneth Johnson, Erica Hendy, Tilo Burghardt

发表机构 * University of Bristol(布里斯托大学) University of Liverpool(利物浦大学) Natural History Museum(自然历史博物馆)

AI总结 该研究提出了一种名为CoralLite的方法,用于从珊瑚骨骼的微CT扫描数据中重建单个珊瑚虫的骨骼结构。研究通过结合弱标注数据预训练与全标注切片微调的混合V-Trans-UNet网络,实现了对整个珊瑚群体骨骼的高精度分割与三维建模。该方法在相同珊瑚群体和不同生物样本上均表现出良好的分割性能,为基于微CT的珊瑚个体骨骼建模提供了首个深度学习基准与完整数据集。

Comments 15 pages, 10 figures, 2 tables

详情
英文摘要

The life history of an individual coral is archived within the accreting skeleton of the colony. While reef-forming coral colonies (e.g. massive $\textit{Porites}$ sp.) may live for hundreds of years and deposit calcareous structures many metres in height and width, their living tissue is a thin outer surface layer comprised of asexually-dividing polyps that only survive a few years. To understand the rate and timing of polyp division and the consequences for colony skeletal growth, scientists need to track the skeletal corallite deposited around each polyp. Here we propose CoralLite, an annotated $μ$CT scan dataset of entire calcareous skeletons and an associated, first corallite deep learning reconstruction baseline. CoralLite combines fully quantified volumetric segmentations with cross-slice linking for visualisations of 3D models for each corallite up to colony scale. For segmentation, we propose and evaluate in detail a hybrid V-Trans-UNet architecture applicable to segmenting tiled $μ$CT virtual slabs of $\textit{Porites}$ sp. colonies. The model is pre-trained on weakly annotated data and topology-aware fine-tuned using fully annotated slice sections with 8k+ manual corallite region annotations. On unseen slices of the same colony, the resulting model reaches 0.94 topological accuracy at mean Dice scores of 0.77 on the same colony and projection axis, and 0.63 mean Dice scores on a different, biologically unrelated specimen. Whilst our experiments are limited in scale and context, our results show for the first time that visual machine learning can effectively support full 3D individual corallite modelling from $μ$CT scans of coral skeletons alone. For reproducibility and as a baseline for future research we publish our full dataset of 697 $μ$CT slices, 37 partial or full slice annotations, and all network weights and source code with this paper.

2605.15053 2026-05-18 cs.LG cs.AI

TFGN: Task-Free, Replay-Free Continual Pre-Training Without Catastrophic Forgetting at LLM Scale

Anurup Ganguli

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种名为TFGN的新型架构,能够在无需回放数据、无需任务标识的情况下,在大规模语言模型中实现无灾难性遗忘的持续预训练。该方法通过在Transformer模型上叠加一个参数高效的输入条件更新模块,实现了跨异构文本领域的正向和反向迁移,并在多个大规模模型和数据集上取得了显著效果。研究还进一步引入了闭环元控制器和操作级计划向量,提升了模型的自主学习能力和跨域适应性,为大规模语言模型的持续学习提供了新的架构解决方案。

Comments 65 pages, 10 figures, 40 tables

详情
英文摘要

Continually pre-training a large language model on heterogeneous text domains, without replay or task labels, has remained an unsolved architectural problem at LLM scale. Existing methods rely on replay buffers, task identifiers, regularization penalties that scale poorly, or sentence-classification-scale evaluation. We introduce TFGN, an architectural overlay for transformer language models that produces input-conditioned, parameter-efficient updates while leaving the rest of the transformer unchanged. On six heterogeneous text domains (Prose, Python, Math, Biomedical, Chinese, JavaScript) at 1B tokens per phase across three model scales (~398M, ~739M, ~9B) and two regimes (From-Scratch and Retrofit), TFGN achieves backward transfer of -0.007 at LLaMA 3.1 8B Retrofit, HellaSwag retention 0.506/0.504/0.510, and >=99.59% L2-orthogonal gradient separation between domain pairs - with no replay, no task IDs, no Fisher penalty. The same matrices show positive cross-domain forward transfer: held-out JavaScript PPL drops 26.8% at LLaMA-8B Retrofit and 62.0% at GPT-2 Medium From-Scratch purely from Python training. Two extensions on the same substrate close further open problems. A closed-loop meta-control layer (Extension A) reduces forgetting by an additional 81% at ~398M, mapping onto the System A and System M roles of Dupoux et al. (arXiv:2603.15381). An operator-level plan vector (Extension B) reshapes forward-pass behavior at 99.96% cosine fidelity over 30 source->target pairs. The architectural insight is a Read/Write decomposition: the forward pass is fully dense, while cross-domain parameter updates are structured so prior-domain subspaces are not written to. To our knowledge, TFGN is the first architecture that simultaneously closes catastrophic forgetting at LLM scale, realizes a closed-loop autonomous-learning meta-controller, and carries an operator-level latent planner.

2605.15010 2026-05-18 cs.CV

3D Skew-Normal Splatting

Xiangru Wu, Ke Fan, Yanwei Fu

发表机构 * Fudan University(复旦大学)

AI总结 本文提出了一种名为Skew-Normal Splatting(SNS)的新方法,用于改进3D高斯溅射(3DGS)在实时新视角合成中的表示能力。通过引入Azzalini偏正态分布作为基本单元,SNS能够灵活建模对称和非对称结构,尤其在处理物体边界和单侧表面时表现出更强的表示能力。此外,SNS保持了数学上的可解析性,并通过解耦参数化和分块优化策略提升了训练稳定性,实验表明其在多个基准测试中优于传统高斯及其他非高斯核方法。

详情
英文摘要

3D Gaussian Splatting (3DGS) has emerged as a leading representation for real-time novel view synthesis and has been widely adopted in various downstream applications. The core strength of 3DGS lies in its efficient kernel-based scene representation, where Gaussian primitives provide favorable mathematical and computational properties. However, under a finite primitive budget, the symmetric shape of each primitive directly affects representation compactness, especially near asymmetric structures such as object boundaries and one-sided surfaces. Recent works have explored more complex kernel distributions; however, they either remain within the elliptical family or rely on hard truncation, which limits continuous shape control and introduces distributional discontinuities. In this paper, we propose Skew-Normal Splatting (SNS), which adopts the Azzalini Skew-Normal distribution as the fundamental primitive. By introducing a learnable and bounded skewness parameter, SNS can continuously interpolate between symmetric Gaussians and Half-Gaussian-like shapes, enabling flexible modeling of both sharp boundaries and interior regions. Moreover, SNS preserves analytical tractability under affine transformations and marginalization. This property allows seamless integration into existing Gaussian Splatting rasterization pipelines. Furthermore, to address the strong coupling between scale, rotation, and skewness parameters, we introduce a decoupled parameterization and a block-wise optimization strategy to enhance training stability and accuracy. Extensive experiments on standard novel-view synthesis benchmarks show that SNS consistently improves reconstruction quality over Gaussian and recent non-Gaussian kernels, with clearer benefits on sharp boundaries and thin or one-sided structures.

2605.14892 2026-05-18 cs.AI

Beyond Individual Intelligence: Surveying Collaboration, Failure Attribution, and Self-Evolution in LLM-based Multi-Agent Systems

Shihao Qi, Jie Ma, Rui Xing, Wei Guo, Xiao Huang, Zhitao Gao, Jianhao Deng, Jun Liu, Lingling Zhang, Bifan Wei, Boqian Yang, Pinghui Wang, Jianwen Sun, Jing Tao, Yaqiang Wu, Hui Liu, Yu Yao, Tongliang Liu

发表机构 * MOE KLINNS Lab(MOE KLINNS实验室) School of Computer Science and Technology(计算机科学与技术学院) School of Cyber Science and Engineering(网络安全工程学院) School of Software Engineering(软件工程学院) School of Control Science and Engineering(控制科学与工程学院) Shaanxi Provincial Key Laboratory of Big Data Knowledge Engineering(陕西省大数据知识工程重点实验室) Laboratory for AI and New Forms of Education(人工智能与新型教育实验室) Lenovo AI Technology Center, CTOO, Lenovo(联想AI技术中心,联想CTOO) Sydney AI Centre, The University of Sydney(悉尼AI中心,悉尼大学)

AI总结 本文综述了基于大语言模型的多智能体系统在协作、错误归因与自主进化方面的研究进展,指出现有研究多分别关注单个智能体能力、协作机制或自我进化,而忽视了它们之间的因果关系。文章提出了一个统一的框架——LIFE 进程,涵盖能力基础构建、协作整合、错误归因与自主进化四个阶段,系统分析了各阶段之间的依赖关系,并提出了跨阶段的研究方向,旨在推动具备持续诊断、结构调整与行为优化能力的自组织多智能体系统发展。

详情
英文摘要

LLM-based autonomous agents have demonstrated strong capabilities in reasoning, planning, and tool use, yet remain limited when tasks require sustained coordination across roles, tools, and environments. Multi-agent systems address this through structured collaboration among specialized agents, but tighter coordination also amplifies a less explored risk: errors can propagate across agents and interaction rounds, producing failures that are difficult to diagnose and rarely translate into structural self-improvement. Existing surveys cover individual agent capabilities, multi-agent collaboration, or agent self-evolution separately, leaving the causal dependencies among them unexamined. This survey provides a unified review organized around four causally linked stages, which we term the LIFE progression: Lay the capability foundation, Integrate agents through collaboration, Find faults through attribution, and Evolve through autonomous self-improvement. For each stage, we provide systematic taxonomies and formally characterize the dependencies between adjacent stages, revealing how each stage both depends on and constrains the next. Beyond synthesizing existing work, we identify open challenges at stage boundaries and propose a cross-stage research agenda for closed-loop multi-agent systems capable of continuously diagnosing failures, reorganizing structures, and refining agent behaviors, extending current coordination frameworks toward more self-organizing forms of collective intelligence. By bridging these previously fragmented research threads, this survey aims to offer both a systematic reference and a conceptual roadmap toward autonomous, self-improving multi-agent intelligence.

2605.14884 2026-05-18 cs.LG

AIMing for Standardised Explainability Evaluation in GNNs: A Framework and Case Study on Graph Kernel Networks

Magdalena Proszewska, N. Siddharth

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 图神经网络(GNNs)在处理图结构数据方面取得了显著进展,但缺乏一个全面的可解释性评估框架。本文提出AIM框架,从准确性、实例级解释和模型级解释三个维度对可解释性进行系统评估,具有高度灵活性和广泛适用性。通过将AIM应用于图核网络(GKNs)等内在可解释的GNN模型,研究发现了其解释性局限并据此改进模型,提出了在保持高准确率的同时提升可解释性的xGKN,为图神经网络的可解释性研究提供了更实用和有效的解决方案。

Comments 19 pages, 4 figures, 8 tables

Journal ref Transactions on Machine Learning Research (TMLR). ISSN 2835-8856 (2026)

详情
英文摘要

Graph Neural Networks (GNNs) have advanced significantly in handling graph-structured data, but a comprehensive framework for evaluating explainability remains lacking. Existing evaluation frameworks primarily involve post-hoc explanations, and operate in the setting where multiple methods generate a suite of explanations for a single model. This makes comparison of explanations across models difficult. Evaluation of inherently interpretable models often targets a specific aspect of interpretability relevant to the model, but remains underdeveloped in terms of generating insight across a suite of measures. We introduce AIM, a comprehensive framework that addresses these limitations by measuring Accuracy, Instance-level explanations, and Model-level explanations. AIM is formulated with minimal constraints to enhance flexibility and facilitate broad applicability. Here, we use AIM in a pipeline, extracting explanations from inherently interpretable GNNs such as graph kernel networks (GKNs) and prototype networks (PNs), evaluating these explanations with AIM, identifying their limitations and obtaining insights to their characteristics. Taking GKNs as a case study, we show how the insights obtained from AIM can be used to develop an updated model, xGKN, that maintains high accuracy while demonstrating improved explainability. Our approach aims to advance the field of Explainable AI (XAI) for GNNs, providing more robust and practical solutions for understanding and improving complex models.

2605.14876 2026-05-18 cs.CV cs.AI

Unlocking Complex Visual Generation via Closed-Loop Verified Reasoning

Hanbo Cheng, Limin Lin, Ruo Zhang, Yicheng Pan, Jun Du

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 尽管当前文本到图像生成模型在技术上取得了快速进展,但它们大多依赖单步生成范式,难以处理复杂的语义内容,且参数扩展带来的性能提升有限。为了解决多步推理方法中存在的幻觉、优化不稳定和推理延迟等问题,本文提出了一种闭环视觉推理框架CLVR,该框架将视觉语言逻辑规划与像素级扩散生成深度融合,并引入了基于代理提示的强化学习和Δ-空间权重合并等方法,有效提升了生成质量与推理效率,实验表明其在多个基准测试中优于现有开源模型,接近商业模型的性能。

详情
英文摘要

Despite rapid advancements, current text-to-image (T2I) models predominantly rely on a single-step generation paradigm, which struggles with complex semantics and faces diminishing returns from parameter scaling. While recent multi-step reasoning approaches show promise, they are hindered by ungrounded planning hallucinations lacking verification, monolithic post-hoc reflection, long-context optimization instabilities, and prohibitive inference latency. To overcome these bottlenecks, we propose the Closed-Loop Visual Reasoning (CLVR) framework, a comprehensive system that deeply couples visual-language logical planning with pixel-level diffusion generation. CLVR introduces an automated data engine with step-level visual verification to synthesize reliable reasoning trajectories, and proposes Proxy Prompt Reinforcement Learning (PPRL) to resolve long-context optimization instabilities by distilling interleaved multimodal histories into explicit reward signals for accurate causal attribution. Furthermore, to mitigate the severe latency bottleneck caused by iterative denoising, we propose $Δ$-Space Weight Merge (DSWM), a theoretically grounded method that fuses alignment weights with off-the-shelf distillation priors, reducing the per-step inference cost to just 4 NFEs without requiring expensive re-distillation. Extensive experiments demonstrate that CLVR outperforms existing open-source baselines across multiple benchmarks and approaches the performance of proprietary commercial models, unlocking general test-time scaling capabilities for complex visual generation.

2605.14665 2026-05-18 cs.AI cs.CL cs.IR

Falkor-IRAC: Graph-Constrained Generation for Verified Legal Reasoning in Indian Judicial AI

Joy Bose

发表机构 * Independent Researcher(独立研究者)

AI总结 该论文提出了一种名为Falkor-IRAC的图约束生成框架,旨在提升印度司法AI系统中法律推理的准确性和可靠性。该方法基于IRAC(问题、规则、分析、结论)知识图谱,将印度最高法院和高等法院的判决结构化为图节点,并整合程序状态转换、先例关系和法律条文引用。在推理过程中,系统仅接受能通过图结构验证的生成结果,从而有效减少错误引用和推理链不完整的问题,并能主动检测法律原则间的冲突,为法律AI的可信推理提供了新思路。

Comments 20 pages, 8 figures, 4 tables

详情
英文摘要

Legal reasoning is not semantic similarity search. A court judgment encodes constrained symbolic reasoning: precedent propagation, procedural state transitions, and statute-bound inference. These are properties that vector-based retrieval-augmented generation (RAG) cannot faithfully represent. Hallucinated precedents, outdated statute citations, and unsupported reasoning chains remain persistent failure modes in LLM-based legal AI, with real consequences for access to justice in high-caseload jurisdictions such as India. This paper presents Falkor-IRAC, a graph-constrained generation framework for Indian legal AI that grounds generation in structured reasoning over an IRAC (Issue, Rule, Analysis, Conclusion) knowledge graph. Judgments from the Supreme Court and High Courts of India are ingested as IRAC node structures enriched with procedural state transitions, precedent relationships, and statutory references, stored in FalkorDB for low-latency agentic traversal. At inference time, LLM-generated answers are accepted only if a valid supporting path can be traced through the graph, a check performed by a falsifiability oracle called the Verifier Agent. The system also detects doctrinal conflicts as a first-class output rather than silently resolving them. Falkor-IRAC is evaluated using graph-native metrics: citation grounding accuracy, path validity rate, hallucinated precedent rate, and conflict detection rate. These metrics are argued to be more appropriate for legal reasoning evaluation than BLEU and ROUGE. On a proof-of-concept corpus of 51 Supreme Court judgments, the Verifier Agent correctly validated citations on completed queries and correctly rejected fabricated citations. Evaluation against vector-only RAG baselines is left for future work. The companion InIRAC dataset, 500+ structured Indian court judgments with IRAC annotations, is released alongside this paper.