arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2605.13333 2026-05-14 cs.CV cs.AI cs.GR cs.LG

Stylized Text-to-Motion Generation via Hypernetwork-Driven Low-Rank Adaptation

Junhyuk Jeon, Seokhyeon Hong, Junyong Noh

发表机构 * Visual Media Lab, KAIST(韩国庆熙大学视觉媒体实验室)

AI总结 该研究针对文本驱动的运动扩散模型在生成精细风格化动作时的不足,提出了一种轻量级的风格条件生成框架。通过超网络生成低秩适配参数,动态调节预训练扩散模型,从而在去噪过程中实现对风格的精细控制。该方法利用监督对比损失结构风格潜在空间,提升了对未见风格的泛化能力,并在多个数据集上取得了领先的风格化生成效果。

Comments Accepted to SIGGRAPH 2026. Project page: https://junhyukjeon.github.io/projects/style-salad/

详情
英文摘要

Text-driven motion diffusion models are capable of generating realistic human motions, but text alone often struggles to express fine-level nuances of motion, commonly referred to as style. Recent approaches have tackled this challenge by attaching a style injection mechanism to a pretrained text-driven diffusion model. Existing stylization methods, however, either require style-specific fine-tuning of existing models or rely on heavy ControlNet-based architectures, limiting efficiency and generalization to unseen styles. We propose a lightweight style conditioning framework that dynamically modulates a pretrained diffusion model through hypernetwork-generated LoRA parameters. A style reference motion is encoded into a global style embedding, which is mapped by a hypernetwork to low-rank updates applied at each denoising step of the diffusion model. By structuring the style latent space with a supervised contrastive loss, our framework reliably captures diverse stylistic attributes, improves generalization to unseen styles, and supports optimization-based guidance without requiring predefined style categories. Experiments on the HumanML3D and 100STYLE datasets show state-of-the-art stylization results, while achieving improved stylization for unseen styles.

2605.13332 2026-05-14 cs.AI cs.CC

Diversity of Extensions in Abstract Argumentation

Johannes K. Fichte, Markus Hecher, Yasir Mahmood, Zhengjun Wang

发表机构 * Department of Computer and Information Science (IDA), Linköping University, Sweden(链接öping大学计算机与信息科学系(IDA)) University of Potsdam, Germany & University of Artois, CNRS, UMR8188 (CRIL), France(波茨坦大学 & 阿尔托伊斯大学、法国CNRS UMR8188(CRIL)) Data Science Group, Heinz Nixdorf Institute, Paderborn University, Germany(帕德博恩大学数据科学小组、海因茨·尼克斯多夫研究所)

AI总结 本文研究抽象论证框架中扩展集的多样性问题,提出了一种基于对称差的定量多样性度量方法,用于衡量不同扩展集之间的差异程度。作者系统分析了相关推理问题的计算复杂性,并探讨了框架是否允许具有特定多样性的扩展集,以及如何计算最大可能的多样性值。研究还提供了计算多样性水平的原型系统和实验评估。

Comments Technical Report to the paper accepted at IJCAI 2026

详情
英文摘要

Argumentation is an important topic of AI for modelling and reasoning about arguments. In abstract argumentation, we consider directed graphs, so-called argumentation frameworks (AF), that express conflicts between arguments. The semantics is defined by the notion of extensions, which are sets of arguments that satisfy particular relationship conditions in the AF. Usually, standard reasoning in argumentation do not reveal how far apart extensions are. We introduce a quantitative notion of diversity of extensions based on the symmetric difference and provide a systematic complexity classification. Intuitively, diversity captures whether extensions of a framework (accepted viewpoints) differ only marginally or represent fundamentally incompatible sets of arguments. We study whether an AF admits k-diverse extensions, admits k-diverse extensions covering specific arguments, and to compute the largest k for which an AF admits k-diverse extensions. We outline a prototype and provide an evaluation for computing diversity levels.

2605.13330 2026-05-14 cs.CL

FIND: Toward Multimodal Financial Reasoning and Question Answering for Indic Languages

Sarmistha Das, Vaibhav Vishal, Syed Ibrahim Ahmad, Manish Gupta, Sriparna Saha

发表机构 * Indian Institute of Technology Patna(印度理工学院帕纳分校) Microsoft(微软)

AI总结 该研究针对多语言金融场景下的数值推理与问答任务,提出了一种面向印地语系语言的新型基准数据集FinVQA,涵盖英语、印地语、孟加拉语等六种语言,包含18,900个样本,覆盖14个金融领域。为应对多模态和多语言带来的挑战,研究还提出FIND框架,结合监督微调与约束感知解码,提升模型在数值推理、多模态理解和结构化决策方面的能力,为高风险多语言金融推理任务提供了评估与建模的新范式。

详情
英文摘要

Financial decision-making in multilingual settings demands accurate numerical reasoning grounded in diverse modalities, yet existing benchmarks largely overlook this high-stakes, real-world challenge, especially for Indic languages. We introduce FinVQA, a benchmark for evaluating financial numerical and multimodal reasoning in multilingual Indic contexts. FinVQA spans English, Hindi, Bengali, Marathi, Gujarati, and Tamil, and comprises 18,900 samples across 14 financial domains. The dataset captures diverse reasoning paradigms under realistic constraints, and is structured across three difficulty levels (easy, moderate, hard) and four question formats: multiple choice, fill-in-the-blank, table matching, and true/false. To address these challenges, we propose FIND, a framework that combines supervised fine-tuning with constraint-aware decoding to promote faithful numerical reasoning, robust multimodal grounding, and structured decision-making. Together, FinVQA and FIND establish a rigorous evaluation and modeling paradigm for high-stakes multilingual multimodal financial reasoning.

2605.13329 2026-05-14 cs.CL cs.AI

Tracing Persona Vectors Through LLM Pretraining

Viktor Moskvoretskii, Dominik Glandorf, Jorge Medina Moreira, Tanja Käser, Robert West

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 本文研究了大语言模型在预训练过程中如何形成用于表示高层行为的“人格向量”,并追踪了这些向量在OLMo-3-7B模型预训练阶段的演变过程。研究发现,人格向量在预训练初期就已形成,并在后续训练中持续优化。实验还表明,不同的人格提取方法能够揭示模型中不同方面的行为特征,且这一现象在其他模型如Apertus-8B中也得到验证,说明人格向量是预训练早期形成的稳定特征,为理解模型行为的可解释性提供了新方向。

Comments Preprint

详情
英文摘要

How large language models internally represent high-level behaviors is a core interpretability question with direct relevance to AI safety: it determines what we can detect, audit, or intervene on. Recent work has shown that traits such as evil or sycophancy correspond to linear directions in the internal activations, the so-called persona vectors. Although these vectors are now routinely utilized to inspect and steer model behavior in safety-relevant settings, how these representations are formed during training remains unknown. To address this gap, we trace persona vectors across the pretraining of OLMo-3-7B, finding that persona vectors form remarkably early -- within 0.22% of OLMo-3 pretraining -- and remain effective for steering the fully post-trained instruct models. Although core representations are formed early on, persona vectors continue to refine geometrically and semantically throughout pretraining. We further compare alternative elicitation strategies and find that all yield effective directions, with each strategy surfacing qualitatively distinct facets of the underlying persona. Replicating our analysis on Apertus-8B reveals that our findings transfer qualitatively beyond OLMo-3. Our results establish persona representations as stable features of early pretraining and open a path to studying how training forms, refines, and shapes them.

2605.13328 2026-05-14 cs.RO cs.AI cs.CL cs.CV

What Limits Vision-and-Language Navigation ?

Yunheng Wang, Yuetong Fang, Taowen Wang, Lusong Li, Kun Liu, Junzhe Xu, Zizhao Yuan, Yixiao Feng, Jiaxi Zhang, Wei Lu, Zecui Zeng, Renjing Xu

发表机构 * HKUST(GZ)(香港科技大学(广州)) JD Explore Academy(京东探索研究院)

AI总结 视觉与语言导航(VLN)是具身智能的重要研究方向,但在从仿真环境迁移到真实世界时,现有方法常因感知不稳定和指令模糊而表现下降。本文提出StereoNav,一种融合视觉、语言和动作的鲁棒框架,通过引入目标位置先验和双目视觉技术,增强跨域导航的稳定性与准确性。实验表明,StereoNav在多个基准测试中取得先进性能,并在真实机器人部署中显著提升了复杂环境下的导航可靠性。

详情
英文摘要

Vision-and-Language Navigation (VLN) is a cornerstone of embodied intelligence. However, current agents often suffer from significant performance degradation when transitioning from simulation to real-world deployment, primarily due to perceptual instability (e.g., lighting variations and motion blur) and under-specified instructions. While existing methods attempt to bridge this gap by scaling up model size and training data, we argue that the bottleneck lies in the lack of robust spatial grounding and cross-domain priors. In this paper, we propose StereoNav, a robust Vision-Language-Action framework designed to enhance real-world navigation consistency. To address the inherent gap between synthetic training and physical execution, we introduce Target-Location Priors as a persistent bridge. These priors provide stable visual guidance that remains invariant across domains, effectively grounding the agent even when instructions are vague. Furthermore, to mitigate visual disturbances like motion blur and illumination shifts, StereoNav leverages stereo vision to construct a unified representation of semantics and geometry, enabling precise action prediction through enhanced depth awareness. Extensive experiments on R2R-CE and RxR-CE demonstrate that StereoNav achieves state-of-the-art egocentric RGB performance, with SR and SPL scores of 81.1% and 68.3%, and 67.5% and 52.0%, respectively, while using significantly fewer parameters and less training data than prior scaling-based approaches. More importantly, real-world robotic deployments confirm that StereoNav substantially improves navigation reliability in complex, unstructured environments. Project page: https://yunheng-wang.github.io/stereonav-public.github.io.

2605.13321 2026-05-14 cs.RO

HCSG: Human-Centric Semantic-Geometric Reasoning for Vision-Language Navigation

Haoxuan Xu, Tianfu Li, Wenbo Chen, Yi Liu, Jin Wu, Huashuo Lei, Yunfan Lou, Lujia Wang, Hesheng Wang, Haoang Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学) University of Science and Technology Beijing(北京科技大学) National University of Singapore(新加坡国立大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 视觉语言导航(VLN)在数据和模型规模扩展的推动下取得了显著进展,但在真实室内场景中,机器人常需应对动态行人,现有方法多将行人视为被动障碍物,缺乏对人类意图和社交规范的主动理解。为此,本文提出HCSG,首个以人类为中心的视觉语言导航框架,通过融合几何预测与语义解释模块,实现对人类行为的主动理解与社交距离控制,显著提升了导航的安全性与社会适应性。实验表明,HCSG在HA-VLNCE基准测试中大幅优于现有方法,成功率提升14%,碰撞率降低34%。

详情
英文摘要

VLN has achieved remarkable progress by scaling data and model capacity. However, the assumption of a static environment breaks down in real-world indoor scenarios, where robots inevitably encounter dynamic pedestrians. Existing human-aware approaches typically treat humans merely as moving obstacles based on implicit visual cues, lacking the explicit reasoning required to interpret human intentions or maintain social norms. To address this, we propose HCSG, the first human-centric framework for VLN. This framework provides a robust foundation for safe, socially intelligent navigation in dynamic human-robot environments that shifts the paradigm from passive collision avoidance to active human behavior understanding. Specifically, HCSG introduces a unified Human Understanding Module that synergizes two key capabilities: (i) geometric forecasting, which predicts human pose and trajectory to anticipate future motion dynamics; and (ii) semantic interpretation, which leverages a Vision-Language Model (VLM) to generate natural language descriptions of human actions and intentions. These semantic-geometric representations are fused into the agent's topological map for instruction-conditioned planning. Furthermore, a social distance loss is introduced to enforce socially compliant interaction distances. Extensive experiments on the HA-VLNCE benchmark demonstrate that HCSG significantly outperforms state-of-the-art methods, achieving a 14% improvement in Success Rate and a 34% reduction in Collision Rate. Our project can be seen at https://haoxuanxu1024.github.io/HCSG/.

2605.13316 2026-05-14 cs.CV

Test-time Sparsity for Extreme Fast Action Diffusion

Kangye Ji, Yuan Meng, Jianbo Zhou, Ye Li, Chen Tang, Zhi Wang

发表机构 * Tsinghua University(清华大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 该研究针对动作扩散模型在生成高质量动作序列时计算成本高的问题,提出了一种测试时稀疏化方法,通过动态预测模型前向过程中的可剪枝残差计算,以加速动作生成。为解决重复编码和剪枝带来的效率瓶颈,设计了高度并行的推理流程,并引入多向复用策略,有效提升了剪枝稀疏度与生成效率。实验表明,该方法在保持性能不变的情况下,将计算量降低了92%,生成速度提升了5倍。

详情
英文摘要

Action diffusion excels at high-fidelity action generation but incurs heavy computational costs owing to its iterative denoising nature. Despite current technologies showing promise in accelerating diffusion transformers by reusing the cached features, they struggle to adapt to policy dynamics arising from diverse perceptions and multi-round rollout iterations in open environments. We propose test-time sparsity to tackle this challenge, which aims to accelerate action diffusion by dynamically predicting prunable residual computations for each model forward at test time. However, two bottlenecks remain in this paradigm: 1) repetitive conditional encoding and pruning offset most potential speed gains, and 2) the features cached from previous denoising timesteps cannot constrain large pruning errors under aggressive sparsity. To address the first bottleneck, we design a highly parallelized inference pipeline that minimizes the non-decoder delay to milliseconds. Specifically, we first design a lightweight pruner that shares the encoder with the diffusion transformer. Then, we decouple the encoding and pruning from the autoregressive denoising loop by processing all denoising timesteps in parallel, and overlap the pruner with the decoder forward inference through asynchronism. To overcome the second bottleneck, we introduce an omnidirectional reusing strategy, which achieves 95% sparsity by selectively reusing features cached from the current forward, previous denoising timesteps, and earlier rollout iterations. To learn the rollout-level reusing strategies, we sample a few action trajectories to supervise the sparsified diffusion step by step. Extensive experiments demonstrate that our method reduces FLOPs by 92% and accelerates action generation by 5x, achieving lossless performance with an inference frequency of 47.5 Hz. Our code is available at https://github.com/ky-ji/Test-time-Sparsity.

2605.13312 2026-05-14 cs.LG

Supervised Deep Multimodal Matrix Factorization for Interpretable Brain Network Analysis

Amjad Seyedi, Lifang He, Songlin Zhao, Akwum Onwunta, Nicolas Gillis

发表机构 * Dept. of Mathematics & Operational Research University of Mons(数学与运筹学系蒙斯大学) Dept. of Computer Science & Engineering Lehigh University(计算机科学与工程系莱斯大学) Dept. of Industrial & Systems Engineering Lehigh University(工业与系统工程系莱斯大学)

AI总结 本文提出了一种可解释的监督深度多模态矩阵分解框架SD3MF,用于整合多模态脑网络数据的分析。该方法将传统的无监督单图聚类扩展为多模态图的监督预测,通过深度分层分解学习各模态的特征,并构建共享的潜在表示以对齐不同视角的被试数据。实验表明,SD3MF在多模态脑连接数据集上优于CNN和GNN等深度学习方法,同时能够提供具有生物学意义的可解释特征。

详情
英文摘要

We present Supervised Deep Multimodal Matrix Factorization (SD3MF), an interpretable framework for integrative brain network analysis that generalizes Symmetric Nonnegative Matrix Tri-Factorization (SNMTF) from unsupervised single-graph clustering to supervised prediction over populations of multimodal graphs. SD3MF learns deep hierarchical factorizations for each modality together with a shared latent representation that aligns subjects across views. An encoder-decoder formulation jointly optimizes graph reconstruction and supervised prediction, while adaptive weights enable data-driven multimodal fusion. By representing each subject through community-level interaction matrices, the model yields interpretable and discriminative features. Experiments on multimodal connectome datasets show that SD3MF consistently outperforms strong deep learning baselines such as CNNs and GNNs, while enabling biologically interpretable insights. Code for reproducibility is available at: https://github.com/amjadseyedi/SD3MF.

2605.13311 2026-05-14 cs.AI cs.IR cs.MA

IdeaForge: A Knowledge Graph-Grounded Multi-Agent Framework for Cross-Methodology Innovation Analysis and Patent Claim Generation

Joy Bose

发表机构 * Independent Researcher(独立研究员)

AI总结 IdeaForge 是一个基于知识图谱的多智能体框架,旨在支持跨创新方法(如 TRIZ、设计思维等)的创新分析与专利权利要求生成。该框架通过多个专业智能体在持久化的知识图谱上协作,整合不同方法的结构化推理结果,并利用图结构实现跨方法的收敛关联,从而识别高可信度的创新方案。研究提出了一种基于图的收敛机制和专利生成流程,实验表明该方法在创新候选的多样性和可追溯性方面优于单一方法的基线模型。

Comments 14 pages, 3 figures, 6 tables

详情
英文摘要

Current AI-assisted innovation systems typically apply a single ideation methodology (such as TRIZ or Design Thinking) using sequential prompt-based workflows that do not preserve intermediate reasoning structure. As a result, insights generated across methodologies remain fragmented, limiting traceability, synthesis, and systematic evaluation of novelty. We present IdeaForge, a knowledge graph-grounded multi-agent framework for innovation analysis and patent claim generation. IdeaForge integrates multiple innovation methodologies (TRIZ, Design Thinking, and SCAMPER) through specialist agents operating over a persistent FalkorDB knowledge graph. Each agent contributes structured entities and relationships representing contradictions, inventive principles, user needs, transformations, analogies, and candidate claims. The central contribution of IdeaForge is a cross-methodology convergence mechanism implemented through graph-based claim linkage. Claims independently supported by multiple methodologies are connected using CONVERGENT relationships, enabling identification of high-confidence innovation candidates through graph traversal. A downstream patent drafting agent generates structured patent drafts grounded in convergent claim subgraphs, reducing reliance on unconstrained language model generation. An InnovationScore formula ranks claims by convergent support, methodology diversity, claim strength, and prior art challenge count. We describe the graph schema, agent architecture, convergence detection pipeline, and patent synthesis workflow. Experiments on a legal technology use case demonstrate that graph-grounded multi-methodology synthesis produces more diverse and traceable innovation candidates compared to single-methodology baselines. We discuss implications for computational creativity, explainable AI-assisted invention, and graph-native innovation systems.

2605.13307 2026-05-14 cs.CL cs.HC

PRISM-X: Experiments on Personalised Fine-Tuning with Human and Simulated Users

Hannah Rose Kirk, Liu Leqi, Fanzhi Zeng, Henry Davidson, Bertie Vidgen, Christopher Summerfield, Scott A. Hale

发表机构 * University of Oxford(牛津大学) UK AI Security Institute(英国人工智能安全研究所) University of Texas at Austin(德克萨斯大学奥斯汀分校) Mercor Meedan

AI总结 该研究探讨了个性化微调在对话系统中的有效性,通过大规模的被试内实验,比较了基于真实用户和模拟用户对个性化与非个性化语言模型的偏好。研究发现,基于用户偏好进行微调的方法在短期表现上优于通用模型和个性化提示,但在长期可能加剧模型的奉承和关系寻求行为。实验还表明,模拟用户在判断一致性、话题多样性和反馈动态等方面与真实用户存在显著差异,难以完全替代人类进行评估。

详情
英文摘要

Personalisation is a standard feature of conversational AI systems used by millions; yet, the efficacy of personalisation methods is often evaluated in academic research using simulated users rather than real people. This raises questions about how users and their simulated counterparts differ in interaction patterns and judgements, as well as whether personalisation is best achieved through context-based prompting or weight-based fine-tuning. Here, in a large-scale within-subject experiment, we re-recruit 530 participants from 52 countries two years after they gave their preferences in the PRISM dataset (Kirk et al., 2024) to evaluate personalised and non-personalised language models in blinded multi-turn conversations. We find preference fine-tuning (P-DPO, Li et al., 2024) significantly outperforms both a generic model and personalised prompting but adapting to individual preference data yields marginal gains over training on pooled preferences from a diverse population. Beyond length biases, fine-tuning amplifies sycophancy and relationship-seeking behaviours that people reward in short-term evaluations but which may introduce deleterious long-term consequences. Replicating this within-subject experiment with simulated users recovers aggregate model hierarchies but simulators perform far below human self-consistency baselines for individual judgements, discuss different topics, exhibit amplified position biases, and produce feedback dynamics that diverge from humans.

2605.13306 2026-05-14 cs.CV

Color Constancy in Hyperspectral Imaging via Reduced Spectral Spaces

G. Dofri Vidarsson, Liying Lu, Sabine Süsstrunk

发表机构 * \'Ecole Polytechnique F\'ed\'erale de Lausanne (EPFL), Lausanne, Switzerland

AI总结 本文研究了如何通过降低光谱维度来提升高光谱成像中的颜色恒定性估计性能。作者采用基于相关性的颜色估计(CbC)框架,分析了不同光谱降维策略对光照估计的影响,揭示了在何种条件下紧凑的光谱表示优于传统RGB方法。该研究为高效利用高光谱信息进行光照估计提供了实用指导。

详情
英文摘要

Illuminant estimation aims to infer scene illumination from image measurements despite intrinsic ambiguities between surface reflectance and lighting. Most existing methods operate on trichromatic RGB images and are therefore fundamentally limited by the restricted spectral information available. Hyperspectral imaging provides a much richer representation of scene radiance and has the potential to alleviate these ambiguities. However, its high dimensionality poses computational and statistical challenges. In this work, we systematically study the effect of spectral dimensionality and representation choice on illuminant estimation performance using hyperspectral data. We adopt the practical and effective Color-by-Correlation (CbC) framework as the estimation backbone and analyze its behavior under different spectral dimensionality reduction strategies. Our results offer practical insights into how hyperspectral information can be efficiently exploited for illuminant estimation and identify conditions under which compact spectral representations outperform conventional RGB-based approaches. The code is available at https://github.com/IVRL/Reduced-Spectral-Color-Constancy.

2605.13305 2026-05-14 cs.LG math.DS physics.chem-ph

MPINeuralODE: Multiple-Initial-Condition Physics-Informed Neural ODEs for Globally Consistent Dynamical System Learning

Lake Yang, Antonio Malpica-Morales, Frank Ioannis Papadakis Wood, Serafim Kalliadasis

发表机构 * Department of Chemical Engineering, Imperial College London(帝国理工学院伦敦校区化学工程系)

AI总结 本文提出了一种名为MPINeuralODE的新方法,用于解决神经常微分方程(Neural ODE)在面对未见过的初始条件和长期预测时泛化能力差的问题。该方法结合了软物理感知残差和多初始条件(MIC)多阶段训练策略,通过结构互补的方式提升了对动态系统矢量场的全局一致性学习能力。实验表明,MPINeuralODE在多个指标上优于现有方法,尤其在长期稳定性和哈密顿量漂移控制方面表现突出。

详情
英文摘要

Neural ordinary differential equations (Neural ODEs) often fit training trajectories while generalizing poorly to unseen initial conditions and long horizons. We propose MPINeuralODE, which combines a soft physics-informed residual with a Multiple-Initial-Condition (MIC) multiple-shooting curriculum whose ingredients are structurally complementary: the physics term anchors the vector-field magnitude on the support that MIC enlarges. We evaluate along three axes: out-of-sample error, long-horizon stability, and Hamiltonian drift, which together expose whether the learned dynamics recover the underlying vector field. On Lotka-Volterra, MPINeuralODE achieves the lowest out-of-sample and long-horizon MSE among data-driven methods, with a 26% reduction over the baseline Neural ODE, while essentially matching the PINN ablation on Hamiltonian drift.

2605.13301 2026-05-14 cs.AI cs.CL

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

Yafu Li, Runzhe Zhan, Haoran Zhang, Shunkai Zhang, Yizhuo Li, Zhilin Wang, Jiacheng Chen, Futing Wang, Xuyang Hu, Yuchen Fan, Bangjie Xu, Yucheng Su, Xinmiao Han, Chenxi Li, Haodi Lei, Yufeng Zhao, Zejin Lin, Qianjia Cheng, Tong Zhu, Xiaoye Qu, Ganqu Cui, Peng Ye, Yun Luo, Zhouchen Lin, Yu Qiao, Bowen Zhou, Ning Ding, Yu Cheng

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) The Chinese University of Hong Kong(香港中文大学) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学)

AI总结 本文提出了一种简单统一的方法,将预训练的推理模型转化为能够达到国际数学和物理奥林匹克竞赛金牌水平的解题系统。该方法通过逆困惑度课程进行监督微调,培养严格的证明搜索与自我检查能力,并通过两阶段强化学习流程逐步提升模型性能,最终通过测试时扩展进一步提高解题效果。实验表明,基于该方法训练的模型SU-01在数学与物理竞赛中表现出色,同时在科学推理的跨领域泛化能力方面也表现出色。

Comments Technical Report. 77 pages

详情
英文摘要

Recent progress in reasoning models has substantially advanced long-horizon mathematical and scientific problem solving, with several systems now reaching gold-medal-level performance on International Mathematical Olympiad (IMO) and International Physics Olympiad (IPhO) problems. In this paper, we introduce a simple and unified recipe for converting a post-trained reasoning backbone into a rigorous olympiad-level solver. The recipe first uses a reverse-perplexity curriculum for SFT to instill rigorous proof-search and self-checking behaviors, then scales these behaviors through a two-stage RL pipeline that progresses from RL with verifiable rewards to more delicate proof-level RL, and finally boosts solving performance with test-time scaling. Applying this recipe, we train a 30B-A3B backbone with SFT on around 340K sub-8K-token trajectories followed by 200 RL steps. The resulting model, SU-01, supports stable reasoning on difficult problems with trajectories exceeding 100K tokens, while achieving gold-medal-level performance on mathematical and physical olympiad competitions, including IMO 2025/USAMO 2026 and IPhO 2024/2025. It also demonstrates strong generalization of scientific reasoning to domains beyond mathematics and physics.

2605.13297 2026-05-14 cs.LG

PaMM: Periodic Motif Memory for Atomistic Models with an Explicit Local-Structure Interface

Ryan Dong

发表机构 * Independent Research(独立研究)

AI总结 本文提出了一种名为PaMM的周期性配位模式记忆模块,用于增强原子模型对晶体结构中重复局部配位模式的显式建模能力。PaMM通过引入基于原子类型和几何特征的成对和三元组模式查找表,显式地编码局部结构信息,并与原始边特征进行融合。实验表明,在固定训练预算下,PaMM能够有效提升模型在能量和力预测上的性能,且其优势来源于结构化的配对/三元组组织方式,而非简单的容量增加。

详情
英文摘要

Periodic crystals repeatedly instantiate similar local coordination motifs across translated cells and chemically related structures, but current equivariant atomistic models usually encode these patterns only implicitly in dense edge features. We introduce PaMM, a periodic motif memory that augments the UMA eSCN-MD edge encoder with explicit pair and triplet lookup features. Pair motifs are keyed by $(Z_j, Z_i, b_r)$ and triplet motifs by $(Z_j, Z_i, Z_k, b_θ)$, hashed into fixed-size tables and fused with the baseline edge representation through lightweight gate-only and affine-equipped variants. We evaluate PaMM in a matched UMA-S OMAT setting and focus on a narrow question: whether explicit motif memory helps at a fixed intermediate training budget. At the 10k-step checkpoint, both PaMM variants improve over the plain baseline; gate-only gives the best energy MAE, while the affine-equipped variant gives the best force MAE. A matched 20k follow-up keeps the same operating-point picture. Aligned controls show that the gain weakens for pair-only, triplet-only, random-bucket, and parameter-matched MLP alternatives, suggesting that the benefit is tied to structured pair/triplet organization rather than generic added capacity. A within-OMAT24 source-family evaluation also shows small but consistent gains across held-out generation families. We therefore make a focused claim: in the studied UMA-S + OMAT regime, explicit pair/ triplet motif memory is a useful inductive bias for periodic atomistic modeling. We do not claim broad cross-dataset transfer, a uniquely preferred fusion variant, or strong scientific interpretability beyond a more inspectable local-structure interface.

2605.13296 2026-05-14 cs.AI cs.LG cs.MA

Discrete Diffusion for Complex and Congested Multi-Agent Path Finding with Sparse Social Attention

Yuanzhe Wang, Tian Zhi, Zihang Wei, Hongguang Wang, Jiaming Guo, Yang Zhao, Zisheng Liu, Shiyu Quan, Xing Hu, Zidong Du, Yunji Chen

发表机构 * State Key Lab of Processors, Institute of Computing Technology, CAS(中国科学院计算技术研究所处理器重点实验室) School of Advanced Interdisciplinary Sciences, CAS(中国科学院高等交叉学科学院) University of Chinese Academy of Sciences(中国科学院大学) Institute of Microelectronics, CAS(中国科学院微电子研究所)

AI总结 本文研究了在复杂拥挤环境中多智能体路径规划(MAPF)的问题,提出了一种基于离散扩散模型的混合框架DiffLNS,用于生成高质量的初始路径草案以提升修复型求解器的性能。该方法结合了稀疏社交注意力机制的离散去噪扩散概率模型(D3PM)与LNS2算法,直接在离散动作空间中生成多样化的联合路径草案,有效提升了大规模MAPF问题的求解成功率和效率。实验表明,DiffLNS在多种复杂场景中表现优异,平均成功率达到95.8%,显著优于现有方法。

Comments 24 pages, 7 figures

详情
英文摘要

Multi-Agent Path Finding (MAPF) is a coordination problem that requires computing globally consistent, collision-free trajectories from individual start positions to assigned goal positions under combinatorial planning complexity. In dense environments, suboptimal initial plans induce compound conflicts that hinder feasible repair. For repair-based solvers like LNS2, initial plan quality critically affects downstream repair, yet this factor remains underexplored. We propose DiffLNS, a hybrid framework that integrates a discrete denoising diffusion probabilistic model (D3PM) with LNS2. The D3PM serves as an initializer with sparse social attention that learns a spatiotemporal prior over coordinated multi-agent action trajectories from expert demonstrations and samples multiple joint plans. Operating directly on the categorical action space, our discrete diffusion preserves the MAPF action structure and samples from a multimodal joint-plan distribution to produce diverse drafts well suited for neighborhood repair. These drafts act as warm starts for downstream repair, which completes unfinished trajectories and resolves remaining conflicts under hard MAPF constraints. Experimental results show that despite being trained only on instances with at most 96 agents, the initializer generalizes to scenarios with up to 312 agents at inference time. Across 20 complex and congested settings, DiffLNS achieves an average success rate of 95.8%, outperforming the strongest tested baseline by 9.6 percentage points and matching or exceeding all baselines in all 20 settings. To the best of our knowledge, this is the first work to leverage discrete diffusion for warm-starting an LNS-based MAPF solver.

2605.13295 2026-05-14 cs.CL cs.AI cs.MA

CANTANTE: Optimizing Agentic Systems via Contrastive Credit Attribution

Tom Zehle

发表机构 * University of Freiburg(弗赖堡大学) ELLIS Institute(埃里克·林斯研究所) Tübingen(图宾根)

AI总结 本文提出了一种名为 CANTANTE 的框架,用于优化基于大语言模型的多智能体系统。该方法通过对比不同联合配置在相同查询上的执行结果,将系统层面的奖励分解为每个智能体的更新信号,从而解决信用分配问题。实验表明,CANTANTE 在编程、数学推理和多跳问答等任务上均优于现有优化方法,且在保持较高性能的同时降低了推理成本。

详情
英文摘要

LLM-based multi-agent systems have demonstrated strong performance across complex real-world tasks, such as software engineering, predictive modeling, and retrieval-augmented generation. Yet automating their configuration remains a structural challenge, as scores are available only at the system level, whereas the parameters governing agent behavior are local. We argue that optimizing these systems is fundamentally a credit-assignment problem. We therefore introduce CANTANTE, a framework that decomposes system-level rewards into per-agent update signals by contrasting rollouts of multiple joint configurations on the same query. We instantiate it for prompt optimization, treating agent prompts as learnable system parameters. We evaluate CANTANTE against GEPA and MIPROv2 on programming (MBPP), mathematical reasoning (GSM8K), and multi-hop question answering (HotpotQA). Across these benchmarks, CANTANTE achieves the best average rank among all evaluated optimizers and consistently outperforms unoptimized prompts. It improves over the strongest baseline by +18.9 percentage points on MBPP and +12.5 percentage points on GSM8K, while incurring a lower inference cost. It remains within one standard deviation of the strongest baseline on HotpotQA. Crucially, our credit correlation analysis confirms that the attributer produces meaningful per-agent signals rather than echoing the global system score.

2605.13293 2026-05-14 cs.CV

Img2CADSeq: Image-to-CAD Generation via Sequence-Based Diffusion

Shiyu Tan, Zixuan Zhao, Hao Gao, Zhiheng Chen, Xiaolong Yin, Enya Shen

发表机构 * School of Software Tsinghua University China(软件学院清华大学中国) Tsinghua University(清华大学)

AI总结 该论文提出了一种名为Img2CADSeq的多阶段图像到CAD生成方法,旨在从单视角图像中生成高质量的边界表示(BRep)CAD模型。其核心方法是将CAD操作序列编码为三级层次化代码本,并通过重要性优先策略,优先保留轮廓信息以压缩长序列到稳定的离散潜在空间。为弥合图像与CAD之间的模态差异,研究引入了基于对比学习的点云中间表示,结合VQ-Diffusion模型进行条件生成,并在新构建的CAD-220K和PrintCAD数据集上验证了方法的有效性,显著优于现有方法,生成的STEP文件可直接用于商业CAD软件。

Comments Accepted by SIGGRAPH 2026 Conference

详情
英文摘要

Boundary Representation (BRep) is the standard format for Computer-Aided Design (CAD), yet reconstructing high-quality BReps from single-view images remains challenging due to the complexity of topological constraints and operation sequences. We present Img2CADSeq, a multi-stage pipeline that overcomes these limitations by encoding CAD sequences into a three-level hierarchical codebook. Guided by an importance prioritization, this strategy values profiles over details, compressing long sequences into a stable discrete latent space. To bridge the modality gap, we leverage a coarse-to-fine point cloud intermediate, aligning 2D visual features with 3D CAD sequences via contrastive learning to condition a VQ-Diffusion model. Supported by newly introduced CAD-220K and PrintCAD datasets, our approach ensures robust industrial domain adaptation. Extensive experiments demonstrate that Img2CADSeq significantly outperforms state-of-the-art methods, producing standard STEP files that can be directly used in commercial CAD software.

2605.13292 2026-05-14 cs.CL cs.AI cs.IR cs.LG

IndicMedDialog: A Parallel Multi-Turn Medical Dialogue Dataset for Accessible Healthcare in Indic Languages

Shubham Kumar Nigam, Suparnojit Sarkar, Piyush Patel

发表机构 * University of Birmingham(布里斯托尔大学) Heritage Institute of Technology(遗产理工学院) Madan Mohan Malaviya University of Technology(马丹·莫汉·马尔维亚理工学院)

AI总结 本文介绍了IndicMedDialog,一个包含英印九种语言的平行多轮医疗对话数据集,旨在提升医疗对话系统在印地语系语言中的适用性和对话真实性。该数据集通过大语言模型生成对话并经母语者验证和后处理优化,同时基于该数据集微调了参数高效的医疗语言模型IndicMedLM,以实现更个性化的症状收集。研究通过多语言基线对比和专家评估,验证了模型的临床合理性和有效性。

Comments Accepted in BioNLP @ ACL 2026 Conference

详情
英文摘要

Most existing medical dialogue systems operate in a single-turn question--answering paradigm or rely on template-based datasets, limiting conversational realism and multilingual applicability. We introduce IndicMedDialog, a parallel multi-turn medical dialogue dataset spanning English and nine Indic languages: Assamese, Bengali, Gujarati, Hindi, Marathi, Punjabi, Tamil, Telugu, and Urdu. The dataset extends MDDial with LLM-generated synthetic consultations, translated using TranslateGemma, verified by native speakers, and refined through a script-aware post-processing pipeline to correct phonetic, lexical, and character-spacing errors. Building on this dataset, we fine-tune IndicMedLM via parameter-efficient adaptation of a quantized small language model, incorporating optional patient pre-context to personalise multi-turn symptom elicitation. We evaluate against zero-shot multilingual baselines, conduct systematic error analysis across ten languages, and validate clinical plausibility through medical expert evaluation.

2605.13290 2026-05-14 cs.AI

What properties of reasoning supervision are associated with improved downstream model quality?

Mikołaj Langner, Dzmitry Pihulski, Jan Eliasz, Michał Rajkowski, Przemysław Kazienko, Maciej Piasecki, Jan Kocoń, Teddy Ferdinan

发表机构 * Wroclaw Tech(沃拉布技术学院)

AI总结 本文研究了如何在训练前通过内在数据指标可靠预测推理数据集的效用,以减少对昂贵试错调优的依赖。作者提出了一系列定量指标,并通过在语义不同的波兰推理数据集上微调8B和11B模型进行验证,发现这些指标与下游模型性能有显著相关性。研究还揭示了效用预测指标具有规模依赖性:小模型更依赖对齐性指标保证精度,而大模型则受益于高冗余度和详细推理过程以解决复杂任务。这一发现为推理数据验证提供了一个规模感知的框架,有助于更高效地选择训练数据集。

Comments To appear in the Proceedings of the International Conference on Computational Science (ICCS) 2026

详情
英文摘要

Validating training data for reasoning models typically requires expensive trial-and-error fine-tuning cycles. In this work, we investigate whether the utility of a reasoning dataset can be reliably predicted prior to training using intrinsic data metrics. We propose a suite of quantitative measures and evaluate their predictive power by fine-tuning 8B and 11B models on semantically distinct variants of a Polish reasoning dataset. Our analysis reveals that these intrinsic metrics demonstrate strong and significant correlations with downstream model performance. Crucially, we find that the predictors of utility are scale-dependent: smaller models rely on alignment-focused metrics to ensure precision, whereas larger models benefit from high redundancy, utilizing verbose traces to solve complex tasks. These findings establish a scale-aware framework for validating reasoning data, enabling practitioners to select effective training sets without the need for exhaustive empirical testing.

2605.13287 2026-05-14 cs.LG cs.AI math.OC stat.ML

Delightful Exploration

Ian Osband

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 本文提出了一种名为“Delight-gated exploration”(DE)的探索策略,用于解决大规模动作空间中探索预算有限的问题。该方法通过衡量潜在收益与惊喜值的乘积(即“delight”)来决定是否进行探索,从而更高效地利用有限的探索资源。DE 在多种任务中表现出比 Thompson Sampling 和 $\varepsilon$-greedy 更弱的遗憾增长,并且其超参数具有良好的跨任务迁移性,无需重新调整。

详情
英文摘要

Most exploration algorithms search broadly until uncertainty is resolved. When the action space is too large to resolve within budget, practitioners default to $\varepsilon$-greedy, which bounds disruption but spends its override blindly. We introduce \textit{Delight-gated exploration} (DE), a host--override rule that spends exploratory actions only when their prospective delight (expected improvement times surprisal) exceeds a gate price. This practical heuristic recovers a classical result: Pandora's reservation-value rule for costly search, with surprisal setting the effective inspection cost. Resolved arms exit the gate, fresh arms shut off above a prior-determined threshold, and selected linear-bandit overrides consume finite information budget. Across Bernoulli bandits, linear bandits, and tabular MDPs, the same hyperparameters transfer without retuning, and DE shows much weaker regret growth than Thompson Sampling and $\varepsilon$-greedy in the tested unresolved regimes. Delight improves acting for the same reason it improves learning: it prices scarce resources by the product of upside and surprisal.

2605.13283 2026-05-14 cs.LG math.ST stat.TH

Byzantine-Robust Distributed Sparse Learning Revisited

Yuxuan Wang, Lixin Zhang, Kangqiang Li

发表机构 * School of Mathematical Sciences(数学科学学院) School of Statistics and Mathematics(统计与数学学院) Information Center(信息中心)

AI总结 本文重新研究了高维稀疏线性模型下的拜占庭鲁棒分布式估计问题。作者提出了一种结合局部鲁棒$\ell_1$正则化估计与服务器端鲁棒聚合的框架,适用于伪Huber回归、分位数回归和稀疏支持向量机。该方法在较弱条件下提供了非渐近保证,达到了近似最优的统计收敛速率,同时保持了通信效率,仿真实验验证了其在多种拜占庭攻击下的估计鲁棒性、支持恢复和分类精度。

详情
英文摘要

We revisit Byzantine robust distributed estimation for high-dimensional sparse linear models. By combining local $\ell_1$-regularized robust estimation with robust aggregation at the server, the framework applies to pseudo-Huber regression, quantile regression, and sparse SVM. We show that the resulting estimators yield non-asymptotic guarantees and attain near-optimal statistical rates under mild conditions, while remaining communication-efficient. Simulations confirm strong robustness in estimation, support recovery and classification accuracy under various Byzantine attacks.

2605.13277 2026-05-14 cs.CL cs.AI cs.CV cs.IR cs.LG

Utility-Oriented Visual Evidence Selection for Multimodal Retrieval-Augmented Generation

Weiqing Luo, Zongye Hu, Xiao Wang, Zhiyuan Yu, Haofeng Zhang, Ziyi Huang

发表机构 * Arizona State University(亚利桑那州立大学) Texas A&M University(德克萨斯大学) Morgan Stanley(摩根大通)

AI总结 本文研究了多模态检索增强生成(RAG)中视觉证据的选择问题,指出现有方法通常基于语义相关性或表面相似性,难以准确反映证据对下游推理的实际效用。为此,作者从信息论角度重新定义了证据的效用,提出通过模型输出分布的信息增益来衡量证据价值,并设计了一种无需训练、基于轻量多模态模型的高效估计框架。实验表明,该方法在多个基准上优于现有RAG方法,同时显著降低了计算成本。

Comments Accepted to ACL 2026

详情
英文摘要

Visual evidence selection is a critical component of multimodal retrieval-augmented generation (RAG), yet existing methods typically rely on semantic relevance or surface-level similarity, which are often misaligned with the actual utility of visual evidence for downstream reasoning. We reformulate multimodal evidence selection from an information-theoretic perspective by defining evidence utility as the information gain induced on a model's output distribution. To overcome the intractability of answer-space optimization, we introduce a latent notion of evidence helpfulness and theoretically show that, under mild assumptions, ranking evidence by information gain on this latent variable is equivalent to answer-space utility. We further propose a training-free, surrogate-accelerated framework that efficiently estimates evidence utility using lightweight multimodal models. Experiments on MRAG-Bench and Visual-RAG across multiple model families demonstrate that our method consistently outperforms state-of-the-art RAG baselines while achieving substantial reductions in computational cost.

2605.13266 2026-05-14 cs.RO

Galilean State Estimation for Inertial Navigation Systems with Unknown Time Delay

Giulio Delama, Martin Scheiber, Yixiao Ge, Tarek Hamel, Stephan Weiss, Robert Mahony

发表机构 * Control of Networked Systems Group(网络化系统控制组) University of Klagenfurt(克莱根furt大学) Systems Theory and Robotics Group(系统理论与机器人组) Australian National University(澳大利亚国立大学) I3S, CNRS, Université Côte d’Azur and Institut Universitaire de France(I3S、CNRS、坎特伯雷大学及法国大学研究院)

AI总结 本文研究了在存在未知时间延迟的惯性导航系统(INS)中如何进行状态估计的问题。作者提出了一种基于伽利略对称性的几何框架,将时空统一建模,从而实现导航状态与时间延迟的联合估计,并推导出一种等变滤波器(EqF)用于在线估计。实验表明,该方法在保持估计精度的同时具有更好的一致性,优于现有的扩展卡尔曼滤波(EKF)方法,尤其在时间延迟较大时表现更优。

详情
英文摘要

Many Inertial Navigation Systems (INS) use Global Navigation Satellite System (GNSS) position as the primary measurement to drive filter performance and bound error growth. However, commercial-grade GNSS receivers introduce unknown measurement delays ranging from 50 ms to 300 ms depending on sensor quality and operating mode. Such time delays can significantly degrade INS performance unless they are explicitly compensated for. Existing algorithms commonly estimate this delay offline, run the filter concurrently with GNSS measurements using buffered Inertial Measurement Unit (IMU) data, and predict the current state by forward-integrating buffered inertial measurements via IMU preintegration. The state-of-the-art online method is an Extended Kalman Filter (EKF) that explicitly models the time delay as a state parameter, which defines the preintegration duration. This paper introduces a novel geometric framework for modeling time-delayed INS, in which Galilean symmetry is leveraged to provide a joint representation of space and time for consistent state estimation. An Equivariant Filter (EqF) is derived for the coupled estimation of navigation states and time delay. Validation is performed on two fixed-wing Uncrewed Aerial Vehicles (UAV) with GNSS time lags of 90 ms and 120 ms. The test flights last two to three minutes. Simulations further investigate delays up to 500 ms and provide a statistical comparison against the state-of-the-art EKF. Results show that the EqF preserves accuracy and consistency, while the EKF lacks consistency and its performance degrades significantly with increasing measurement delays.

2605.13265 2026-05-14 cs.LG

LightSplit: Practical Privacy-Preserving Split Learning via Orthogonal Projections

Mert Cihangiroglu, Alessandro Pegoraro, Phillip Rieger, Antonino Nocera, Ahmad-Reza Sadeghi

发表机构 * University of Pavia(帕维亚大学) Technical University of Darmstadt(达姆施塔特技术大学)

AI总结 Split Learning(SL)通过将神经网络分割在客户端和中央服务器之间实现协作训练,但切分层接口带来了高维激活值通信开销大和表示易受重构攻击的问题。本文提出LightSplit方法,在切分层应用轻量的固定正交随机投影,以降低信息暴露并减少通信开销。该方法基于信息论原理,通过投影限制样本特异性信息,抑制可被利用的样本信号,并在不改变原有架构的前提下实现高效训练,适用于边缘设备,同时保持端到端可微性。实验表明,LightSplit在大幅降低通信维度的情况下仍能保持超过95%的基线准确率。

详情
英文摘要

Split learning (SL) enables collaborative training by partitioning a neural network across clients and a central server, but the cut-layer interface introduces a key challenge: high-dimensional activations incur substantial communication overhead while exposing representations vulnerable to reconstruction attacks. Existing approaches typically address efficiency or privacy in isolation, relying on additional mechanisms such as sparsification, quantization, or noise injection. We propose LightSplit, which limits information exposure and reduces communication overhead by applying a lightweight fixed orthogonal random projection at the cut layer. Based on Shannon's information theory, this projection acts as an information bottleneck that restricts instance-specific information and suppresses exploitable per-sample signals. By transmitting low-dimensional projections instead of raw activations, the server operates on lifted representations without requiring architectural modifications, ensuring compatibility with existing SL architectures. By avoiding additional trainable components on the client, the method remains lightweight and suitable for edge devices while preserving end-to-end differentiability via exact gradient propagation. As the projection is non-invertible, part of the original representation is irreversibly discarded at the client, LightSplit reduces the information available for reconstruction and limits information exposure. We extensively evaluate LightSplit on state-of-the-art benchmarks in both IID and non-IID settings across varying projection dimensions and client scales. Our results show that the method retains more than 95% of the baseline accuracy at up to 32x reduction in transmitted dimensionality while maintaining stable training dynamics.

2605.13262 2026-05-14 cs.LG q-bio.QM

Chem-GMNet: A Sphere-Native Geometric Transformer for Molecular Property Prediction

Deepak Warrier, Raja Sekhar Pappala

发表机构 * MSTACK AI

AI总结 本文提出了一种名为Chem-GMNet的球面原生几何变换器,用于分子属性预测任务。该模型通过将传统变换器中的各个模块替换为基于球面几何的结构,充分利用了化学结构中的几何先验信息。实验表明,Chem-GMNet在参数更少的情况下取得了优于现有方法如ChemBERTa的性能,尤其在无需预训练的情况下也表现出色。

详情
英文摘要

Modern SMILES-based chemical language models obtain strong MoleculeNet performance by treating SMILES as generic text and compensating with multi-million-molecule self-supervised pretraining. We ask: when a domain carries structural priors as rich as chemistry's, does it warrant a domain-native transformer rather than a generic one rescued by scale? We answer affirmatively with \textbf{GM-Net} (Geometric Measure Network), a transformer family in which every module is replaced by a sphere-native counterpart, and instantiate it as \textbf{Chem-GMNet}. Three blocks follow: SH-Embedding (tokens as learnable directions on $S^{k-1}$ lifted through a Gegenbauer feature map); DualSKA (a per-head fusion of a linear-time gated Sphere-Flow recurrence whose persistent state we prove is the truncated multipole expansion of the input distribution, and a softmax Sphere-Kernel branch over the same Schoenberg-valid kernel); and SH-FFN (sphere projection $\to$ Gegenbauer lift $\to$ moment readout). On canonical DeepChem scaffold splits, against same-shape ChemBERTa-2 baselines under the chemberta3-faithful protocol: (i) random-initialised, Chem-GMNet wins on 7 of 10 MoleculeNet endpoints at $\sim\!35\%$ fewer parameters; (ii) pretrained on the same 10M-SMILES ZINC corpus as ChemBERTa-2 MLM-10M, it matches or beats the public release on 6 of 8 shared endpoints (5/7 excluding a known ClinTox release anomaly). A $(k,L)$ ablation shows that increasing the sphere dimension from $k\!=\!8$ to $k\!=\!10$ at fixed $L\!=\!3$ lowers ESOL RMSE to $0.938$ at scratch, beating pretrained ChemBERTa-2 MLM-10M on this endpoint without any pretraining at all.

2605.13260 2026-05-14 cs.LG math.AP math.FA stat.ML

Unified generalization analysis for physics informed neural networks

Yuka Hashimoto, Tomoharu Iwata

发表机构 * NTT, Inc.(NTT公司) RIKEN AIP(理化学研究所AIP)

AI总结 本文针对物理信息神经网络(PINNs)及其变体(VPINNs)的泛化能力进行了统一的理论分析。研究通过泰勒展开将非线性微分算子转化为高维空间中的线性算子,结合Koopman分析方法,建立了适用于包含微分操作的神经网络的泛化界。该方法突破了以往对稳定性条件或线性椭圆性的依赖,揭示了微分算子的非线性特性对泛化性能的显著影响,为理解物理信息神经网络的训练与推广提供了新的理论视角。

详情
英文摘要

Physics-Informed Neural Networks (PINNs) and their variational counterparts (VPINNs) are neural networks that incorporate physical laws, making them useful for scientific problems. Existing generalization analyses for PINNs and VPINNs remain limited, often requiring restrictive assumptions such as stability conditions or linear ellipticity. In this paper, we derive generalization bounds for neural networks that involve differentiation with respect to input variables, covering PINNs and VPINNs under a unified framework. We apply Taylor expansion to represent nonlinear differential operators as linear operators on a high-dimensional space, enabling the use of Koopman-based analysis and showing that high-rank networks can generalize well even in settings involving differential operators. We also show that the nonlinearity of the differential operator exponentially enlarges the bound, highlighting its significant impact on generalization.

2605.13255 2026-05-14 cs.AI

Respecting Self-Uncertainty in On-Policy Self-Distillation for Efficient LLM Reasoning

Junlong Ke, Zichen Wen, Weijia Li, Conghui He, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tsinghua University(清华大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文研究了如何在基于策略的自蒸馏中更有效地利用教师模型的不确定性信息,以提升大语言模型的推理效率。提出了一种基于熵引导的强化自蒸馏方法EGRSD,通过结合奖励引导方向、师生似然比幅度以及教师熵置信门机制,动态调整对不同位置token的监督权重,从而提升模型训练效果。进一步引入了因果前瞻变体CL-EGRSD,以区分持续高熵和短暂高熵区域,实验表明该方法在推理准确率与长度的权衡上优于现有可训练方法。

详情
英文摘要

On-policy self-distillation trains a reasoning model on its own rollouts while a teacher, often the same model conditioned on privileged context, provides dense token-level supervision. Existing objectives typically weight the teacher's token-level signal uniformly across a chain-of-thought sequence, despite substantial variation in the entropy of the teacher's predictive distribution. We propose EGRSD (Entropy-Guided Reinforced Self-Distillation), which unifies token-level updates through three signals: a reward-grounded direction, a teacher-student likelihood-ratio magnitude, and the proposed teacher-entropy confidence gate that down-weights high-entropy token positions while maintaining a nonzero lower bound on every token weight. We further introduce CL-EGRSD, a causal-lookahead variant that distinguishes sustained high-entropy spans from transient high-entropy positions whose following context rapidly becomes low entropy. Experiments with Qwen3-4B and Qwen3-8B in thinking mode show that EGRSD and CL-EGRSD advance the accuracy-length frontier among the compared trainable methods.

2605.13245 2026-05-14 cs.AI

It's not the Language Model, it's the Tool: Deterministic Mediation for Scientific Workflows

Marios Adamidis, Danae Katrisioti, Yannis Tzitzikas, Emmanuel Stratakis

发表机构 * Department of Materials Science and Technology, University of Crete(材料科学与技术系,克里特大学) Institute of Electronic Structure and Laser, FORTH(电子结构与激光研究所,FORTH) Computer Science Department, University of Crete(计算机科学系,克里特大学) Institute of Computer Science, FORTH(计算机科学研究所,FORTH) Department of Physics, University of Crete(物理系,克里特大学)

AI总结 该研究探讨了语言模型在科学工作流中生成分析结果的可重复性问题,指出当前模型在同一数据上多次生成时可能得到不同结果,缺乏可信度。为此,作者提出了一种“类型化中介”方法,通过模型调用确定性工具来执行分析,每个工具对应特定仪器的精确操作流程,确保结果的一致性。实验表明,该方法在多个平台上实现了相同分析任务的完全可复现结果,相较商业模型具有更高的稳定性和可靠性,为科学分析中的可重复性需求提供了实用解决方案。

Comments 18 pages, 4 figures, 2 appendices. Submitted to SETN 2026

详情
英文摘要

Language models can produce convincing scientific analyses, but repeated generations on the same data do not guarantee the same result. A researcher may regenerate an identical query and receive a different fit, a different peak position or a different analysis procedure, without an obvious way to decide which output to trust. We propose typed mediation, a pattern in which the model orchestrates deterministic tools rather than generating analytical code. Each tool encodes one researcher's exact procedure for one instrument, ported through structured interviews. The model selects which tool to call and with what parameters. The tool produces the result. Regeneration does not change it. We evaluate this claim by running the same photoluminescence analysis on four platforms, including three commercial foundation models, four times each with the same prompt. The typed tool produces identical results across all runs. The commercial platforms either vary in numerical output and analytical methodology across runs, or fail to produce valid results on the task. We deploy this pattern on two instruments serving users over approximately six months, with very positive user feedback. Both cases are very challenging: they involve proprietary binary formats and per-seat licensed software, which force the tool to remain on local infrastructure alongside the data and the instrument it operates. We argue that deployment topology is not just a preference, but a structural requirement of scientific tool mediation. The result is a practical pattern for deploying language models in scientific workflows where reproducibility is mandatory, reducing analysis time from weeks to minutes while guaranteeing identical outputs across runs.

2605.13236 2026-05-14 cs.CL

A Hybrid Framework for Natural Language Querying of IFC Models with Relational and Graph Representations

Rabindra Lamsal, Sisi Zlatanova, Haowen Xu, Yafei Sun, Johnson Xuesong Shen

发表机构 * GRID Lab, School of Built Environment, The University of New South Wales(建筑环境学院,新南威尔士大学GRID实验室) School of Civil and Environmental Engineering, The University of New South Wales(土木与环境工程学院,新南威尔士大学)

AI总结 本文提出了一种名为IfcLLM的混合框架,用于通过自然语言查询IFC格式的建筑信息模型(BIM)。该框架将IFC模型转化为互补的表示形式,包括用于结构化属性和几何信息的关系表示,以及用于拓扑关系的图表示,并通过迭代的重试与优化机制整合这两种表示进行大语言模型推理。实验表明,该方法在多个场景下的首次查询准确率高达93.3%至100%,能够有效提升非专家用户对BIM数据的访问与分析能力。

详情
英文摘要

Building Information Modeling (BIM) is widely used in the Architecture, Engineering, and Construction (AEC) industry, but the complexity of Industry Foundation Classes (IFC) limits accessibility for non-expert users. To address this, we introduce IfcLLM, a hybrid framework for natural language interaction with IFC-based BIM models. It transforms IFC models into complementary representations: a relational representation for structured element properties and geometry, and a graph representation for topological relationships. These representations are integrated through iterative retry-and-refine LLM reasoning. We implement the framework using an open-weight LLM (GPT OSS 120B), supporting reproducible and deployment-oriented workflows. Evaluation on three IFC models with queries derived from 30 scenarios shows first-attempt accuracy of 93.3%-100%, with all failures recovered using a fallback LLM. The results show that combining complementary representations with iterative reasoning enables more accessible natural language querying of IFC data while supporting routine BIM analysis tasks.

2605.13229 2026-05-14 cs.AI cs.SE

Improving Code Translation with Syntax-Guided and Semantic-aware Preference Optimization

Yuhan Wu, Huan Zhang, Wei Cheng, Chen Shen, Jingyue Yang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) National Institute of Healthcare Data Science, Nanjing University, China(南京大学健康数据科学国家研究院)

AI总结 本文研究如何提升代码翻译的准确性和语义一致性,提出了一种基于语法引导和语义感知的偏好优化方法CTO。该方法通过对比学习训练跨语言语义模型,直接评估源代码与翻译代码的功能等价性,并将语义信号与编译器反馈的语法信号统一到多目标优化框架中。实验表明,CTO在C++、Java和Python代码翻译任务中显著优于现有方法。

Comments Accepted in the 35th International Joint Conference on Artificial Intelligence (IJCAI 2016)

详情
英文摘要

LLMs have shown immense potential for code translation, yet they often struggle to ensure both syntactic correctness and semantic consistency. While preference-based learning offers a promising alignment strategy, it is hindered by unreliable semantic rewards derived from sparse test cases or restrictive reference translations. We argue that a robust semantic reward for code translation must be derived directly from the source code. In this paper, we propose CTO to improve code translation with syntax-guided and semantic-aware preference optimization. Through contrastive learning, we train a cross-lingual semantic model to directly assess functional equivalence between source and translated code. By formulating code translation as a multi-objective optimization problem, this robust semantic signal is seamlessly unified with compiler-based syntactic feedback within the direct preference optimization framework. Extensive experiments on C++, Java, and Python translations demonstrate that CTO significantly outperforms existing baselines and alternative preference optimization strategies.