arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2075
专题追踪
2601.19831 2026-05-11 cs.LG cs.CL

Neural Neural Scaling Laws

神经神经扩展定律

Michael Y. Hu, Jane Pan, Ayush Rajesh Jhaveri, Nicholas Lourie, Kyunghyun Cho

发表机构 * New York University(纽约大学)

AI总结 本文提出NeuNeu模型,通过时间序列外推预测语言模型在下游任务中的性能,相比传统参数化方法,其在66个下游任务上的均方误差降低了44%。

详情
AI中文摘要

神经神经扩展定律通过时间序列外推预测语言模型在下游任务中的性能,相比传统参数化方法,其在66个下游任务上的均方误差降低了44%。

英文摘要

Neural scaling laws predict how language model performance improves with increased training inputs. While aggregate metrics like validation loss can follow smooth power-law curves, individual downstream tasks exhibit diverse scaling behaviors: some improve monotonically, others plateau, and some even degrade with scale. We argue that predicting downstream performance from validation loss suffers from two limitations: averaging token-level losses obscures signal, and no simple parametric family can capture the full spectrum of scaling behaviors. To address this, we propose Neural Neural Scaling Laws (NeuNeu), a neural network that frames scaling law prediction as time-series extrapolation. NeuNeu combines temporal context from observed accuracy trajectories with token-level validation losses, learning to predict future performance without the limitations inherent in assuming a specific functional form. Trained entirely on open-source model checkpoints from HuggingFace, NeuNeu achieves 1.99% mean absolute error in predicting model accuracy on 66 downstream tasks -- a 44% reduction compared to logistic scaling laws (3.56% MAE). Furthermore, NeuNeu generalizes zero-shot to unseen model families, architectures, parameter counts, and downstream tasks. Our work suggests that predicting downstream scaling directly from data outperforms parametric alternatives.

2601.18744 2026-05-11 cs.AI cs.LG

TSRBench: A Comprehensive Multi-task Multi-modal Time Series Reasoning Benchmark for Generalist Models

TSRBench: 一个全面的多任务多模态时间序列推理基准,用于通用模型

Fangxu Yu, Xingang Guo, Lingzhi Yuan, Haoqiang Kang, Hongyu Zhao, Lianhui Qin, Furong Huang, Bin Hu, Tianyi Zhou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University of Illinois at Urbana–Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Maryland, College Park(马里兰大学学院公园分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 TSRBench通过4125个问题和15个任务评估通用模型的时间序列推理能力,揭示了感知和推理中的缩放定律失效以及多模态融合的不足。

Comments Accepted to ICML 2026

详情
AI中文摘要

时间序列在现实世界中无处不在,对能源管理到交通控制等应用至关重要。然而,现有通用模型基准大多忽视了这一维度。为弥补这一差距,我们引入了TSRBench,一个全面的多模态基准,用于压力测试时间序列推理能力的完整范围。TSRBench包含来自14个领域的4125个问题,分为感知、推理、预测和决策四个维度。15个任务评估了基本推理能力(如数值推理)。通过广泛的实验,我们评估了超过30种领先的专有和开源LLM、VLM和TSLLM。我们的发现表明:缩放定律在感知和推理中成立,但在预测中失效;强大的推理不保证准确的上下文感知预测,表明语义理解与数值预测之间存在脱节;尽管文本和视觉时间序列作为输入形式具有互补性,但当前多模态模型未能有效融合它们以实现相互促进的性能提升。TSRBench提供了一个标准化的评估平台,不仅突显了现有挑战,还为推进通用模型提供了有价值的见解。我们的代码和数据集可在https://tsrbench.github.io/上获得。

英文摘要

Time series are ubiquitous in real-world scenarios and crucial for applications ranging from energy management to traffic control. Consequently, the ability to reason over time series is a fundamental skill for generalist models to solve complex problems. However, current benchmarks for generalist models largely overlook this dimension. To bridge this gap, we introduce TSRBench, a comprehensive multi-modal benchmark designed to stress-test the full spectrum of time series reasoning capabilities. TSRBench features: i) a diverse set of 4125 problems from 14 domains, and is categorized into 4 major dimensions: Perception, Reasoning, Prediction, and Decision-Making. ii) 15 tasks from the 4 dimensions evaluating essential reasoning capabilities (e.g., numerical reasoning). Through extensive experiments, we evaluate over 30 leading proprietary and open-source LLMs, VLMs, and TSLLMs within TSRBench. Our findings reveal that: i) scaling laws hold for perception and reasoning but break down for prediction; ii) strong reasoning does not guarantee accurate context-aware forecasting, indicating a decoupling between semantic understanding and numerical prediction; and iii) despite the complementary nature of textual and visual forms of time series as inputs, current multimodal models fail to effectively fuse them for reciprocal performance gains. TSRBench provides a standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance generalist models. Our code and dataset are available at https://tsrbench.github.io/.

2601.18700 2026-05-11 cs.AI

TEA-Bench: A Systematic Benchmarking of Tool-enhanced Emotional Support Dialogue Agent

TEA-Bench: 一种工具增强情感支持对话代理的系统性基准测试

Xingyu Sui, Yanyan Zhao, Yulin Hu, Jiahe Guo, Weixiang Zhao, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出TEA-Bench,首个评估工具增强情感支持对话代理的交互式基准,通过真实情感场景和工具环境评估情感支持质量和事实性。实验表明工具增强提升情感支持质量并减少幻觉,但效果依赖模型能力。

详情
AI中文摘要

情感支持对话需要不仅有情感表达,还需要有事实性工具支持以提供可信指导。然而,现有ESC系统和基准大多集中在纯文本环境中的情感支持,忽视了外部工具如何在多轮情感支持中提供事实基础并减少幻觉。我们引入TEA-Bench,首个评估工具增强情感支持对话代理的交互式基准,包含真实情感场景、MCP式工具环境和过程级指标,共同评估情感支持的质量和事实性。实验表明,工具增强通常提升情感支持质量并减少幻觉,但收益强烈依赖模型能力:更强模型更有效地使用工具,而较弱模型收益有限。我们进一步发布TEA-Dialog数据集,发现监督微调在分布内支持表现良好但泛化差。我们的结果强调了工具使用在构建可靠情感支持代理中的重要性。代码和数据可在https://github.com/XingYuSSS/TEA-Bench找到。

英文摘要

Emotional Support Conversation requires not only affective expression but also grounded instrumental support to provide trustworthy guidance. However, existing ESC systems and benchmarks largely focus on affective support in text-only settings, overlooking how external tools can enable factual grounding and reduce hallucination in multi-turn emotional support. We introduce TEA-Bench, the first interactive benchmark for evaluating tool-augmented agents in ESC, featuring realistic emotional scenarios, an MCP-style tool environment, and process-level metrics that jointly assess the quality and factual grounding of emotional support. Experiments on nine LLMs show that tool augmentation generally improves emotional support quality and reduces hallucination, but the gains are strongly capacity-dependent: stronger models use tools more selectively and effectively, while weaker models benefit only marginally. We further release TEA-Dialog, a dataset of tool-enhanced ESC dialogues, and find that supervised fine-tuning improves in-distribution support but generalizes poorly. Our results underscore the importance of tool use in building reliable emotional support agents. Our code and data can be found in https://github.com/XingYuSSS/TEA-Bench.

2601.18681 2026-05-11 cs.LG cs.AI cs.SY eess.SY math.OC

ART for Diffusion Sampling: A Reinforcement Learning Approach to Timestep Schedule

ART用于扩散采样:一种强化学习方法用于时间步调度

Yilie Huang, Wenpin Tang, Xunyu Zhou

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University, New York, NY 10027, USA(工业工程与运筹学系,哥伦比亚大学,纽约,NY 10027,美国) Department of Industrial Engineering and Operations Research & Data Science Institute, Columbia University, New York, NY 10027, USA(工业工程与运筹学系及数据科学研究所,哥伦比亚大学,纽约,NY 10027,美国)

AI总结 本文提出ART方法,通过强化学习优化扩散模型的时间步调度,减少欧拉离散误差,提升生成质量。

Comments 25 pages, 8 figures, 5 tables

详情
AI中文摘要

我们考虑基于分数的扩散模型中时间离散化生成样本的问题,统一和手工设计的网格在时间步预算下可能不优。我们引入自适应重参数化时间(ART),通过控制重参数化时间变量的时钟速度,在保持终端时间的同时重新分布计算,以最小化累积的欧拉离散误差。我们推导出一个随机的ART-RL伙伴,将ART重新表述为连续时间强化学习问题,证明了两者之间的双向桥梁:确定性的ART最优提升到最优高斯策略,反之任何最优高斯策略必须通过其均值恢复ART控制。这一桥梁将连续时间演员-评论员学习转化为一种原则性的而非启发式的路线来获得确定性时间步最优解。在官方EDM流程中,ART-RL在CIFAR-10上广泛预算范围内提升了FID,经过一次离线训练后, distilled确定性调度在AFHQv2、FFHQ和ImageNet上无需重新训练即可在无额外推理成本下转移。

英文摘要

We consider time discretization for score-based diffusion models to generate samples from a learned reverse-time dynamic on a finite grid. Uniform and hand-crafted grids can be suboptimal given a budget on the number of time steps. We introduce Adaptive Reparameterized Time (ART), which controls the clock speed of a reparameterized time variable to redistribute computation along the sampling trajectory while preserving the terminal time, with the objective of minimizing the aggregate Euler discretization error. We derive a randomized companion ART-RL that recasts ART as a continuous-time reinforcement learning problem with Gaussian policies, and prove a two-directional bridge between the two: the deterministic ART optimum lifts to an optimal Gaussian policy, and conversely any optimal Gaussian policy must recover the ART control through its mean. This bridge turns continuous-time actor--critic learning into a principled, rather than heuristic, route to the deterministic timestep optimum. Within the official EDM pipeline, ART-RL improves FID on CIFAR--10 across a wide range of budgets; after one-time offline training, the distilled deterministic schedule transfers without retraining to AFHQv2, FFHQ, and ImageNet at no extra inference cost.

2601.17942 2026-05-11 cs.AI cs.DB

LLM-Based SQL Generation: Prompting, Self-Refinement, and Adaptive Weighted Majority Voting

基于大型语言模型的SQL生成:提示、自我完善与自适应加权多数投票

Yu-Jie Yang, Hung-Fu Chang, Po-An Chen

发表机构 * Institute of Information Management(信息管理研究所) National Yang Ming Chiao Tung University(阳明交通大学) R. B. Annis School of Engineering(R. B. Annis工程学院) University of Indianapolis(印第安纳大学)

AI总结 本文提出SSEV管道,通过自我完善与加权多数投票提升SQL生成准确性,在多个基准测试中取得良好成绩,并进一步提出ReCAPAgent-SQL框架以应对企业数据库复杂性,提升实际应用效果。

Comments 29 pages, 22 figures

Journal ref 2026 International Conference on Information Management

详情
AI中文摘要

文本到SQL已成为研究热点,尤其在大语言模型快速发展的背景下。通过自然语言而非SQL查询数据库,显著降低了数据分析门槛。然而,从自然语言生成准确SQL仍具挑战性,因用户查询模糊、模式链接复杂、SQL方言泛化有限及领域知识需求。本文提出基于PET-SQL的单代理自我完善与集成投票(SSEV)管道,无需真实数据,整合自我完善与加权多数投票(WMV)及其随机变体(RWMA)。实验结果表明,SSEV在多个基准测试中表现优异,达到85.5%的Spider 1.0-Dev执行准确率,86.4%的Spider 1.0-Test执行准确率,以及66.3%的BIRD-Dev执行准确率。基于SSEV的洞察,进一步提出ReCAPAgent-SQL(基于反思-批评-行动-计划的SQL框架)以应对企业数据库复杂性及实际文本到SQL任务。该框架整合多个专用代理进行规划、外部知识检索、批评、行动生成、自我完善、模式链接及结果验证,通过代理协作实现SQL预测的迭代完善。ReCAPAgent-SQL的WMA结果在Spider 2.0-Lite前100个查询上达到31%的执行准确率,显示出在处理实际企业场景中的显著改进。总体而言,本文工作促进了可扩展的文本到SQL系统在实际应用中的部署,支持更高效的数据驱动决策。

英文摘要

Text-to-SQL has emerged as a prominent research area, particularly with the rapid advancement of large language models (LLMs). By enabling users to query databases through natural language rather than SQL, this technology significantly lowers the barrier to data analysis. However, generating accurate SQL from natural language remains challenging due to ambiguity in user queries, the complexity of schema linking, limited generalization across SQL dialects, and the need for domain-specific understanding. In this study, we propose a Single-Agent Self-Refinement with Ensemble Voting (SSEV) pipeline built on PET-SQL that operates without ground-truth data, integrating self-refinement with Weighted Majority Voting (WMV) and its randomized variant (RWMA). Experimental results show that the SSEV achieves competitive performance across multiple benchmarks, attaining execution accuracies of 85.5% on Spider 1.0-Dev, 86.4% on Spider 1.0-Test, and 66.3% on BIRD-Dev. Building on insights from the SSEV pipeline, we further propose ReCAPAgent-SQL (Refinement-Critique-Act-Plan agent-based SQL framework) to address the growing complexity of enterprise databases and real-world Text-to-SQL tasks. The framework integrates multiple specialized agents for planning, external knowledge retrieval, critique, action generation, self-refinement, schema linking, and result validation, enabling iterative refinement of SQL predictions through agent collaboration. ReCAPAgent-SQL's WMA results achieve 31% execution accuracy on the first 100 queries of Spider 2.0-Lite, demonstrating significant improvements in handling real-world enterprise scenarios. Overall, our work facilitates the deployment of scalable Text-to-SQL systems in practical settings, supporting better data-driven decision-making at lower cost and with greater efficiency.

2601.16736 2026-05-11 cs.CV

A Step to Decouple Optimization in 3DGS

在3DGS中实现优化解耦的一步

Renjie Ding, Yaonan Wang, Min Liu, Jialin Zhu, Jiazheng Wang, Jiahao Zhao, Wenting Shen, Feixiang He, Xiang Chen

发表机构 * National Engineering Research Center of Robot Visual Perception and Control Technology, School of Artificial Intelligence and Robitics, Hunan University(机器人视觉感知与控制技术国家工程研究中心,人工智能与机器人学院,湖南大学) Baidu Inc.(百度公司) Central South University(中南大学)

AI总结 本文针对3DGS优化中的耦合问题,提出解耦方法,通过稀疏Adam、重状态正则化和解耦属性正则化提升优化效率和表示效果。

Comments Accepted by ICLR 2026 (fixed typo)

详情
AI中文摘要

3D高斯点划法(3DGS)已成为实时生成新视角的强大技术。作为一种通过梯度传播在原始体之间优化的显式表示,优化在深度神经网络(DNNs)中广泛接受的优化方法实际上被应用于3DGS,例如同步权重更新和Adam带自适应梯度。然而,考虑到3DGS的物理意义和特定设计,优化过程中有两个被忽视的细节:(i)更新步耦合,这导致了优化器状态重新缩放和在视点外昂贵的属性更新;(ii)动量中的梯度耦合,这可能导致正则化的不足或过度有效。然而,这种复杂的耦合尚未被充分探索。在重新审视3DGS的优化后,我们采取了一步将其解耦,并将过程重新组合为:稀疏Adam、重状态正则化和解耦属性正则化。在3DGS和3DGS-MCMC框架下进行大量实验,我们的工作提供了对这些组件的更深入理解。最后,基于经验分析,我们重新设计了优化并提出AdamW-GS,通过重新耦合有益的组件,在此之下实现了更好的优化效率和表示效果。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for real-time novel view synthesis. As an explicit representation optimized through gradient propagation among primitives, optimization widely accepted in deep neural networks (DNNs) is actually adopted in 3DGS, such as synchronous weight updating and Adam with the adaptive gradient. However, considering the physical significance and specific design in 3DGS, there are two overlooked details in the optimization of 3DGS: (i) update step coupling, which induces optimizer state rescaling and costly attribute updates outside the viewpoints, and (ii) gradient coupling in the moment, which may lead to under- or over-effective regularization. Nevertheless, such a complex coupling is under-explored. After revisiting the optimization of 3DGS, we take a step to decouple it and recompose the process into: Sparse Adam, Re-State Regularization and Decoupled Attribute Regularization. Taking a large number of experiments under the 3DGS and 3DGS-MCMC frameworks, our work provides a deeper understanding of these components. Finally, based on the empirical analysis, we re-design the optimization and propose AdamW-GS by re-coupling the beneficial components, under which better optimization efficiency and representation effectiveness are achieved simultaneously.

2601.15884 2026-05-11 cs.CV

Contrast-X: A Multi-Modal Contrast Image Synthesis Benchmark and Universal Modality Flow Matching

Contrast-X:一个多模态对比图像合成基准及通用模态流匹配

Yifan Chen, Fei Yin, Hao Chen, Jia Wu, Chao Li

发表机构 * University of Cambridge(剑桥大学) MD Anderson Cancer Center(MD安德森癌症中心) University of Dundee(邓迪大学)

AI总结 Contrast-X通过多器官CT和乳腺DCE-MRI数据,提出FlowMI模型解决多模态缺失问题,评估合成图像质量及跨器官泛化能力。

详情
AI中文摘要

增强对比成像在肿瘤诊断中至关重要,但对比剂可能对最需要的患者禁忌。从非对比输入合成对比扫描是自然响应。两大障碍:无基准提供配对对比数据及病变级评估,无单一模型处理任意缺失模式。我们引入Contrast-X,涵盖10个器官CT(1526例)和多期乳腺DCE-MRI(1116例)的配对对比增强和非对比影像数据。每例均带有放射科医生验证的相位标签和肿瘤掩膜。我们进一步提出FlowMI,通过统一多模态潜在空间和流匹配处理任意可用模态子集。我们评估多种缺失模态配置,报告标准图像质量指标、放射科读者研究及合成扫描的下游病变分析。我们进一步评估跨器官泛化以测试模型是否学习了可转移的对比增强操作。数据集、代码和排行榜将发布。我们的代码可在https://github.com/YifanChen02/Contrast-X获取。

英文摘要

Contrast-enhanced imaging is central to oncologic diagnosis, but contrast agents can be contraindicated for many of the patients who need them most. Synthesizing contrast scans from non-contrast inputs is the natural response. Two obstacles stand in the way: no benchmark provides paired contrast data with lesion-level evaluation, and no single model handles the arbitrary missing patterns seen in practice. We introduce Contrast-X, a benchmark of paired contrast-enhanced and non-contrast imaging spanning 10 organs in CT (1{,}526 patients) and multi-phase breast DCE-MRI (1116 patients). Every case carries radiologist-verified phase labels and tumor masks. We further propose FlowMI, a single model that handles arbitrary subsets of available modalities through a unified multi-modal latent space and flow matching. We benchmark a range of missing-modality configurations, reporting standard image-quality metrics, radiologist reader studies, and downstream lesion analysis on the synthesized scans. We further evaluate cross-organ generalization to test whether the model has learned a transferable contrast-enhancement operation. Dataset, code, and leaderboard will be released. Our code are available at https://github.com/YifanChen02/Contrast-X.

2601.15507 2026-05-11 cs.CV

A Unified and Controllable Framework for Layered Image Generation with Visual Effects

用于具有视觉效果的分层图像生成的统一且可控的框架

Jinrui Yang, Qing Liu, Yijun Li, Mengwei Ren, Letian Zhang, Zhe Lin, Cihang Xie, Yuyin Zhou

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) Adobe Research(Adobe研究)

AI总结 本文提出LASAGNA框架,通过单次前向传递生成逼真背景和具有视觉效果的RGBA前景,支持多种编辑操作,消除身份漂移,同时发布两个社区资源。

详情
AI中文摘要

近年来,图像生成模型生成的合成图像令人印象深刻,但在编辑特定元素时往往无法保留用户提供的内容身份:周围场景可能移动,甚至编辑对象的外观也可能漂移。分层表示提供了一种自然的解决方法——允许用户独立地操控各个元素——但现有分层方法通常生成透明前景,没有真实的视觉效果如阴影和反射,迫使每次编辑后都使用第二个和谐模型,这又引入了漂移。为了克服这些限制,我们提出了LASAGNA,它在单次前向传递中生成逼真的背景(BG)和具有引人注目的视觉效果的RGBA前景。通过将与对象相关的视觉效果视为前景(FG)层的一部分,LASAGNA通过仅使用alpha合成即可支持主流的消费者编辑(例如翻译、缩放、重新着色、复制),而无需调用任何模型后处理,从而消除由级联编辑管道引入的身份漂移。这种单次传递设计与以往分层方法不同,后者依赖于每个任务单独的专家模型。LASAGNA在统一的架构中处理多样化的条件输入——文本提示、FG、BG和位置掩码。我们进一步发布了两个社区资源:LASAGNA-48K,第一个公开的48K层图像三元组数据集,具有逼真的视觉效果;以及LASAGNA-BENCH,第一个标准化的基于层的生成和编辑基准,包含来自六个不同来源的242个专家标注样本。实验表明,LASAGNA在三种生成模式中均优于通用编辑器和先前的分层方法,并支持广泛的后编辑,而无需任何模型重新推理。

英文摘要

Recent image generation models produce impressive composites, but often fail to preserve the identity of user-provided content when editing specific elements: the surrounding scene may shift, and even the edited object's appearance can drift from the original. Layered representation offer a natural remedy--they allow users to independently manipulate individual elements--but existing layered methods typically produce transparent foregrounds without realistic visual effects such as shadows and reflections, forcing the use of a second harmonization model after every edit, which in turn introduces drift. To overcome these limitations, we present LASAGNA, which generates a photorealistic background (BG) and an RGBA foreground with compelling visual effects in a single forward pass. By treating object-associated visual effects as part of the foreground (FG) layer, LASAGNA supports the dominant class of consumer edits (e.g., translation, scaling, recoloring, duplication) via alpha compositing alone, without invoking any model post-edit, thereby eliminating identity drift introduced by cascade editing pipelines. This single-pass design contrasts with prior layered methods that rely on separate expert models for each task. LASAGNA handles diverse conditional inputs--text prompts, FG, BG, and location masks--within a unified architecture. We further release two community resources: LASAGNA-48K, the first public dataset of 48K layered image triplets with photorealistic visual effects, and LASAGNA-BENCH, the first standardized benchmark for layer-centric generation and editing, comprising 242 expert-annotated samples across six diverse sources. Experiments show that LASAGNA outperforms both general-purpose editors and prior layered methods across three generation modes, and supports a wide range of post-edits without any model re-inference.

2601.15050 2026-05-11 cs.CL

Beyond Factual Accuracy: Evaluating Global Reasoning Integrity in RAG Systems with LogicScore

超越事实准确性:使用逻辑得分评估RAG系统中的全球推理完整性

Zhichao Yan, Yunxiao Zhao, Jiapu Wang, Jiaoyan Chen, Xiaoli Li, Ru Li, Jeff Z. Pan

发表机构 * Shanxi University(山西大学) Nanjing University of Science and Technology(南京理工大学) University of Manchester(曼彻斯特大学) Singapore University of Technology and Design(新加坡科技设计大学) University of Edinburgh(爱丁堡大学)

AI总结 本文提出LogicScore,通过全局推理审查弥补RAG系统在长文本生成中逻辑完整性不足的问题,揭示模型在事实准确性高但全局推理质量差的缺陷。

详情
AI中文摘要

当前用于检索增强生成(RAG)的评估方法存在事实近视问题:它们过度强调事实准确性却忽视长文本生成中的全局逻辑完整性。这导致模型被迫建立不自然的联系,产生事实基础但逻辑不一致的响应,存在未解决的缺口、模糊链接或冗余前提。为缓解此问题,我们提出了LogicScore,从局部的事实对事实评估转向严格的全局推理审查。基于霍恩规则,我们的方法整合了反向验证机制,系统地评估三个关键推理维度:完整性(逻辑上合理的推导)、必要性(非冗余性)和确定性(一致的答案蕴含性)。在三个多跳问答数据集(HotpotQA、MusiQue和2WikiMultiHopQA)和超过20个LLM(包括GPT-5、Gemini-3-Pro、LLaMA3和任务特定调优模型)上的广泛实验揭示了关键能力差距:领先模型往往在事实准确性(例如Gemini-3 Pro的92.85%精度)方面表现优异,但在全局推理质量(例如Gemini-3 Pro的35.11%必要性)方面挣扎。我们的工作建立了一个稳健的逻辑评估标准,强调在LLM开发中应优先考虑推理一致性与事实基础。

英文摘要

Current evaluation methods for Retrieval Augmented Generation (RAG) suffer from \textit{factual myopia}: they relentlessly emphasize factual accuracy yet neglect global logical integrity in long-form answer generation. This drives models to force unnatural connections, producing factually grounded yet logically incoherent responses with unaddressed gaps, ambiguous links, or redundant premises. To mitigate this, we present \textsc{LogicScore}, shifting from local, fact-by-fact assessment to rigorous global reasoning scrutiny. Grounded in Horn Rules, our approach integrates a backward verification mechanism to systematically evaluate three key reasoning dimensions: \textit{Completeness} (logically sound deduction), \textit{Essentiality} (non-redundancy), and \textit{Determinateness} (consistent answer entailment). Extensive experiments across three multi-hop QA datasets (HotpotQA, MusiQue, and 2WikiMultiHopQA) and over 20 LLMs (including GPT-5, Gemini-3-Pro, LLaMA3, and task-specific tuned models) reveal a critical capability gap: leading models often achieve high factual accuracy (e.g., 92.85\% precision for Gemini-3 Pro) but struggle with global reasoning quality (e.g., 35.11\% Essentiality for Gemini-3 Pro). Our work establishes a robust standard for logical evaluation, highlighting the need to prioritize reasoning coherence alongside factual grounding in LLM development.

2601.14958 2026-05-11 cs.CL cs.AI

Script Sensitivity: Benchmarking Language Models on Unicode, Romanized and Mixed-Script Sinhala

脚本敏感性:在Unicode、罗马化和混合脚本僧加罗语上对语言模型的基准测试

Minuri Rajapakse, Ruvan Weerasinghe

发表机构 * School of Computing Informatics Institute of Technology(计算学院信息科技研究院)

AI总结 本研究评估了语言模型在僧加罗语等低资源、形态丰富的语言上的表现,发现脚本敏感性显著,模型大小与脚本处理能力无关,Unicode性能预测混合脚本鲁棒性但不预测罗马化能力。

Comments Published at SCSE 2026 (9th IEEE International Research Conference on Smart Computing and Systems Engineering). Best Paper Award - Text Analytics Track

Journal ref 2026 9th IEEE International Research Conference on Smart Computing and Systems Engineering (SCSE), vol. 9, pp. 1-6

详情
AI中文摘要

语言模型(LMs)在低资源、形态丰富的语言如僧加罗语上的性能仍鲜有研究,尤其在数字通信中的脚本变化方面。僧加罗语表现出脚本双元性,Unicode用于正式场合,罗马化文本主导社交媒体,而混合脚本在实践中常见。本文通过困惑度评估在Unicode、罗马化和混合脚本僧加罗语上对24个开源LM进行基准测试。结果揭示了显著的脚本敏感性,从Unicode到罗马化文本的中位性能降级超过300倍。关键发现是模型大小与脚本处理能力无相关性,较小模型往往优于28倍大的架构。Unicode性能强烈预测混合脚本鲁棒性但不预测罗马化能力,表明单一脚本评估大幅低估了实际部署挑战。这些发现建立了僧加罗语LM的基础能力,并为多脚本低资源环境中的模型选择提供了实用指导。

英文摘要

The performance of Language Models (LMs) on low-resource, morphologically rich languages like Sinhala remains largely unexplored, particularly regarding script variation in digital communication. Sinhala exhibits script duality, with Unicode used in formal contexts and Romanized text dominating social media, while mixed-script usage is common in practice. This paper benchmarks 24 open-source LMs on Unicode, Romanized and mixed-script Sinhala using perplexity evaluation across diverse text sources. Results reveal substantial script sensitivity, with median performance degradation exceeding 300 times from Unicode to Romanized text. Critically, model size shows no correlation with script-handling competence, as smaller models often outperform architectures 28 times larger. Unicode performance strongly predicts mixed-script robustness but not Romanized capability, demonstrating that single-script evaluation substantially underestimates real-world deployment challenges. These findings establish baseline LM capabilities for Sinhala and provide practical guidance for model selection in multi-script low-resource environments.

2601.04731 2026-05-11 cs.AI cs.CL

Miner:Mining Intrinsic Mastery for Data-Efficient RL in Large Reasoning Models

Miner:挖掘内在掌握以实现大推理模型的数据高效强化学习

Shuyang Jiang, Yuhao Wang, Ya Zhang, Yanfeng Wang, Yu Wang

发表机构 * Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 本文提出Miner方法,通过利用策略内在不确定性作为自监督奖励信号,提升大推理模型的数据高效强化学习性能,实现状态-of-the-art结果。

Comments 24 pages

详情
AI中文摘要

当前无批评者强化学习方法在训练正齐次提示(所有轨迹正确)时效率低下,导致零优势估计浪费轨迹。我们引入了一种简单而强大的解决方案来挖掘内在掌握(Miner),将策略的内在不确定性重新利用为自监督奖励信号,无需外部监督、辅助模型或额外推理成本。我们的方法开创了两个关键创新:(1)一种令牌级焦点信用分配机制,动态放大关键不确定令牌的梯度并抑制过于自信的令牌;(2)自适应优势校准,无缝整合内在和可验证奖励。在Qwen3-4B和Qwen3-8B基础模型上六个推理基准测试中,Miner在其他四种算法中取得最佳性能,相比GRPO在Pass@1上获得高达4.58的绝对增益,在Pass@K上获得6.66的增益。与其他针对探索增强的方法比较进一步揭示了这两个新提出的创新的优越性。这证明了潜在不确定性利用对于高效和可扩展推理模型强化学习是必要且充分的。代码可在https://github.com/pixas/Miner获取。

英文摘要

Current critic-free RL methods for large reasoning models suffer from severe inefficiency when training on positive homogeneous prompts (where all rollouts are correct), resulting in waste of rollouts due to zero advantage estimates. We introduce a radically simple yet powerful solution to \uline{M}ine \uline{in}trinsic mast\uline{er}y (Miner), that repurposes the policy's intrinsic uncertainty as a self-supervised reward signal, with no external supervision, auxiliary models, or additional inference cost. Our method pioneers two key innovations: (1) a token-level focal credit assignment mechanism that dynamically amplifies gradients on critical uncertain tokens while suppressing overconfident ones, and (2) adaptive advantage calibration to seamlessly integrate intrinsic and verifiable rewards. Evaluated across six reasoning benchmarks on Qwen3-4B and Qwen3-8B base models, Miner achieves state-of-the-art performance among the other four algorithms, yielding up to \textbf{4.58} absolute gains in Pass@1 and \textbf{6.66} gains in Pass@K compared to GRPO. Comparison with other methods targeted at exploration enhancement further discloses the superiority of the two newly proposed innovations. This demonstrates that latent uncertainty exploitation is both necessary and sufficient for efficient and scalable RL training of reasoning models. Code is available at https://github.com/pixas/Miner.

2601.03728 2026-05-11 cs.CV cs.AI

CSMCIR: CoT-Enhanced Symmetric Alignment with Memory Bank for Composed Image Retrieval

CSMCIR: 基于记忆库的增强对称对齐用于复合图像检索

Zhipeng Qian, Zihan Liang, Yufei Ma, Ben Chen, Huangyu Dai, Yiwei Ma, Jiayi Ji, Chenyi Lei, Han Li, Xiaoshuai Sun

发表机构 * Kuaishou Technology(快播科技) Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(教育部多媒体可信感知与高效计算重点实验室,厦门大学)

AI总结 本文提出CSMCIR,通过多级链式思维提示策略、对称双塔架构和熵基记忆库策略,实现高效查询-目标对齐,提升复合图像检索性能。

详情
AI中文摘要

复合图像检索(CIR)允许用户通过参考图像和操控文本搜索目标图像,相较于单一模态检索系统具有显著优势。然而,现有CIR方法存在表征空间碎片化问题:查询和目标包含异构模态并由不同编码器处理,迫使模型只能通过事后对齐来弥合不一致的表征空间,从根本上限制检索性能。这种架构不对称性在特征空间中表现为三个明显分离的聚类,直接展示出异构模态如何从初始化阶段就导致根本性的表征空间不一致。本文提出CSMCIR,一种统一的表征框架,通过三个协同组件实现高效的查询-目标对齐。首先,我们引入多级链式思维(MCoT)提示策略,引导多模态大语言模型生成具有判别性和语义兼容性的目标图像描述,建立模态对称性。在此基础上,我们设计了对称双塔架构,其中查询和目标侧均使用相同的共享参数Q-Former进行跨模态编码,确保一致的特征表示并进一步减少对齐差距。最后,这种架构对称性使熵基、时间动态的记忆库策略能够提供高质量的负样本,同时保持与模型状态演化的兼容性。在四个基准数据集上的广泛实验表明,我们的CSMCIR在状态-of-the-art性能和更高效的训练效率方面均取得优异成果。全面的消融研究进一步验证了每个提出组件的有效性。

英文摘要

Composed Image Retrieval (CIR) enables users to search for target images using both a reference image and manipulation text, offering substantial advantages over single-modality retrieval systems. However, existing CIR methods suffer from representation space fragmentation: queries and targets comprise heterogeneous modalities and are processed by distinct encoders, forcing models to bridge misaligned representation spaces only through post-hoc alignment, which fundamentally limits retrieval performance. This architectural asymmetry manifests as three distinct, well-separated clusters in the feature space, directly demonstrating how heterogeneous modalities create fundamentally misaligned representation spaces from initialization. In this work, we propose CSMCIR, a unified representation framework that achieves efficient query-target alignment through three synergistic components. First, we introduce a Multi-level Chain-of-Thought (MCoT) prompting strategy that guides Multimodal Large Language Models to generate discriminative, semantically compatible captions for target images, establishing modal symmetry. Building upon this, we design a symmetric dual-tower architecture where both query and target sides utilize the identical shared-parameter Q-Former for cross-modal encoding, ensuring consistent feature representations and further reducing the alignment gap. Finally, this architectural symmetry enables an entropy-based, temporally dynamic Memory Bank strategy that provides high-quality negative samples while maintaining consistency with the evolving model state. Extensive experiments on four benchmark datasets demonstrate that our CSMCIR achieves state-of-the-art performance with superior training efficiency. Comprehensive ablation studies further validate the effectiveness of each proposed component.

2601.01822 2026-05-11 cs.RO cs.CV

DisCo-FLoc: Semantic-Free Floorplan Localization via $SE(2)$-Aware Contrastive Disambiguation

DisCo-FLoc: 通过SE(2)感知对比度解歧实现无语义的地板计划定位

Ping Zhong, Shiyong Meng, Bolei Chen, Tao Zou, Chaoxu Mu, Jianxin Wang

发表机构 * School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院) School of Electrical and Information Engineering, Tianjin University(天津大学电气与信息工程学院)

AI总结 本文提出DisCo-FLoc,一种无语义的视觉-几何对比度解歧方法,通过深度感知射线回归预测器和空间扰动对比目标,提升地板计划定位的精度与鲁棒性。

Comments 9 pages, 3 figures

详情
AI中文摘要

视觉地板计划定位(FLoc)在面对由重复极简布局引起的严重结构混叠时面临挑战。这是因为物理上距离较远的姿势共享高度相似的视觉-几何特征,这会降低空间分离性和角判别性。尽管现有方法试图通过依赖昂贵的语义标注来缓解这些歧义,但由此获得的性能提升仍然受限。为了解决上述问题,我们提出了DisCo-FLoc,一种无语义的视觉-几何对比度解歧方法。首先,我们引入了一个深度感知的射线回归预测器(RRP),它作为密集到射线的几何投影器。通过显式地抑制垂直方向上的视觉杂乱,RRP将单目RGB图像投影到2D射线原始语,这些与地板计划匹配以产生几何感知的FLoc候选。其次,为了在这些候选之间解决剩余的歧义,我们提出了一种空间扰动的对比目标,以对齐RGB图像与局部地板计划结构,并制定一个视觉-几何兼容函数。特别是,我们通过SE(2)姿态扰动在位置和方向层面精心构建正负样本,以实现对比学习,从而有效实现姿态平滑、空间分离性和角判别性。兼容函数使DisCo-FLoc能够通过比纯几何布局更丰富的视觉上下文来解歧FLoc,而无需任何语义标注。在两个具有挑战性的视觉FLoc基准上的广泛实验表明,DisCo-FLoc显著优于最先进的语义方法,特别是在缩小位置和方向FLoc精度差距方面。

英文摘要

Visual Floorplan Localization (FLoc) struggles with severe structural aliasing caused by repetitive minimalist layouts. This occurs because physically distant poses share highly similar visual-geometric features, which degrades spatial separability and angular discriminability. While existing methods attempt to mitigate these ambiguities by relying on costly semantic annotations, the resulting performance gains remain inherently limited. To address the above issues, we propose DisCo-FLoc, a semantic-free method for visual-geometric Contrastive Disambiguation. First, we introduce a depth-aware Ray Regression Predictor (RRP) that serves as a dense-to-ray geometric projector. By explicitly suppressing visual clutter along the vertical dimension, RRP projects monocular RGB images into 2D ray primitives, which are matched with floorplans to produce geometry-aware FLoc candidates. Second, to resolve the remaining ambiguity among these candidates, we propose a spatially perturbed contrastive objective to align RGB images with local floorplan structures and formulate a visual-geometric compatibility function. In particular, we meticulously construct positive and negative samples at both positional and directional levels through $SE(2)$ pose perturbations for contrastive learning, effectively achieving pose smoothness, spatial separability, and angular discriminability. The compatibility function enables DisCo-FLoc to disambiguate FLoc by using richer visual context beyond pure geometric layouts, without requiring any semantic annotations. Extensive experiments on two challenging visual FLoc benchmarks demonstrate that DisCo-FLoc significantly outperforms state-of-the-art semantic-based methods, especially narrowing the performance gap between positional and directional FLoc accuracy.

2601.00889 2026-05-11 cs.LG

FANoS-v2: Feedback-Controlled Momentum with Thermostat Damping for Lightweight Neural Optimization

FANoS-v2:基于热力学阻尼的反馈控制动量用于轻量级神经优化

Nalin Dhiman

发表机构 * School of Computing and Electrical Engineering(计算与电子工程学院) Indian Institute of Technology Mandi(印度理工学院曼迪)

AI总结 FANoS-v2通过反馈控制动量和热力学阻尼优化神经网络,实现轻量级优化,在MNIST、Fashion-MNIST和CIFAR-10上优于AdamW,但计算时间更长。

Comments 17 pages, 3 figures, 5 tables

详情
AI中文摘要

FANOS是一种PyTorch优化器,通过在更新能量上添加标量反馈控制器来增强RMS预条件动量。该优化器将动量存储在参数更新单元中,应用非负热力学阻尼系数,支持对角线、因子化和原始梯度预条件,并暴露用于稳定性审计的诊断信息。本研究提供了发布优化器的完整数学规范,包括精确的参数单元更新、研究方程物理更新模式、有界对数比热力学控制、自适应预条件器软化、预热防护栏和实验Fast配置。我们报告了v0.2证据:五种子减少样本MNIST、Fashion-MNIST和CIFAR-10实验显示,Faster在AdamW上平均Top-1提升0.889、2.197和2.666个百分点,但计算时间分别高出49.8%、61.6%和56.8%。初步科学、PINN和EEG测试结果混合,仅作为假设生成。证据支持FANOS作为alpha阶段的研究优化器,具有可复现的轻量视觉信号和显式运行时间瓶颈。

英文摘要

\FANOS{} is a PyTorch optimizer that augments RMS-preconditioned momentum with a scalar feedback controller over update energy. The public reference implementation stores momentum in parameter-update units, applies a non-negative thermostat damping coefficient, supports diagonal, factored, and raw-gradient preconditioning, and exposes diagnostics intended for stability audits. This study gives a complete mathematical specification of the released optimizer, including the exact parameter-unit update, the study-equation physical update mode, bounded log-ratio thermostat control, adaptive preconditioner softening, warmup guardrails, and the experimental \Fast{} profile. We report the v0.2 evidence: five-seed reduced-sample MNIST, Fashion-MNIST, and CIFAR-10 experiments show mean top-1 gains of 0.889, 2.197, and 2.666 percentage points over AdamW for \Fast{}, but with 49.8\%, 61.6\%, and 56.8\% higher wall-clock time. Preliminary scientific, PINN, and EEG smoke tests are mixed and are treated as hypothesis-generating only. The evidence supports \FANOS{} as an alpha-stage research optimizer with a reproducible lightweight-vision signal and an explicit runtime bottleneck.

2512.23770 2026-05-11 cs.LG cs.AI

SB-TRPO: Towards Safe Reinforcement Learning with Hard Constraints

SB-TRPO:迈向带有硬约束的安全强化学习

Dominik Wagner, Ankit Kanwar, Luke Ong

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算学院和数据科学学院,新加坡) Sony Corporation(索尼公司)

AI总结 SB-TRPO通过动态平衡成本降低与奖励提升,提供硬约束强化学习的原理性算法,确保安全性和奖励的改进,实验显示其在安全约束下平衡性能最佳。

详情
AI中文摘要

在安全关键领域,强化学习(RL)代理必须在完成任务的同时满足严格且零成本的安全约束。现有无模型方法往往要么无法达到接近零的安全违规,要么过于保守。我们引入了安全偏置的信任区域策略优化(SB-TRPO),一种用于硬约束RL的原理性算法,通过动态凸组合奖励和成本自然策略梯度进行每一步更新,确保固定比例的最优成本降低,同时利用剩余更新容量提升奖励。我们的方法提供了安全性的局部进步保证,同时在梯度合适对齐时改进奖励。在标准和具有挑战性的Safety Gymnasium任务中的实验表明,SB-TRPO在硬约束条件下始终实现了安全性和任务性能的最佳平衡。

英文摘要

In safety-critical domains, reinforcement learning (RL) agents must often satisfy strict, zero-cost safety constraints while accomplishing tasks. Existing model-free methods frequently either fail to achieve near-zero safety violations or become overly conservative. We introduce Safety-Biased Trust Region Policy Optimisation (SB-TRPO), a principled algorithm for hard-constrained RL that dynamically balances cost reduction with reward improvement. At each step, SB-TRPO updates via a dynamic convex combination of the reward and cost natural policy gradients, ensuring a fixed fraction of optimal cost reduction while using remaining update capacity for reward improvement. Our method comes with formal guarantees of local progress on safety, while still improving reward whenever gradients are suitably aligned. Experiments on standard and challenging Safety Gymnasium tasks demonstrate that SB-TRPO consistently achieves the best balance of safety and task performance in the hard-constrained regime.

2512.20974 2026-05-11 cs.LG cs.AI cs.RO

Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions

在深度贝叶斯强化学习中引入可学习基函数的广义线性模型

Jingyang You, Hanna Kurniawati

发表机构 * School of Computing(计算学院)

AI总结 本文提出GLiBRL,通过可学习基函数实现深度贝叶斯强化学习中的 tractable 贝叶斯推断,改进任务表示与经验核对应关系,提升在MuJoCo和MetaWorld基准上的性能。

详情
AI中文摘要

贝叶斯强化学习(BRL)作为元强化学习(Meta-RL)的一个子类,通过显式整合贝叶斯任务参数到转移和奖励模型中,提供了一种原则性的泛化框架。然而,经典BRL方法假设转移和奖励模型的形式已知。尽管近期深度BRL方法通过模型学习来解决这一问题,但直接将神经网络应用于联合数据和任务参数需要变分推断,这通常导致任务表示模糊,影响最终的BRL策略。为克服这些限制,我们引入了在深度贝叶斯强化学习中引入可学习基函数的广义线性模型(GLiBRL)。我们的方法特征是任务参数和模型噪声的完全可推导的贝叶斯推断,以及学习转移和奖励模型的精确边缘似然评估。GLiBRL中精确贝叶斯推断的排列不变性质使得无缝集成到both on-policy和off-policy RL算法成为可能。我们进一步展示了GLiBRL中任务表示的$\mathcal{L}_2$距离与经验核基对应关系之间的闭式关系,这在我们的知识范围内是在线深度BRL的第一个结构性结果。GLiBRL与代表性的和最近的Meta-RL方法进行比较,并在MuJoCo和MetaWorld基准上分别提高了最高1.8倍的性能。

英文摘要

Bayesian Reinforcement Learning (BRL), a subclass of Meta-Reinforcement Learning (Meta-RL), provides a principled framework for generalisation by explicitly incorporating Bayesian task parameters into transition and reward models. However, classical BRL methods assume known forms of transition and reward models. While recent deep BRL methods incorporate model learning to address this, applying neural networks directly to joint data and task parameters necessitates variational inference. This often yields indistinct task representations, compromising the resulting BRL policies. To overcome these limitations, we introduce Generalised Linear Models in Deep Bayesian RL with Learnable Basis Functions (GLiBRL). Our approach features fully tractable Bayesian inference over task parameters and model noise, alongside exact marginal likelihood evaluation for learning transition and reward models. The permutation-invariant nature of exact Bayesian inference in GLiBRL enables seamless integration with both on-policy and off-policy RL algorithms. We further show that GLiBRL admits a closed-form relationship between the $\mathcal{L}_2$ distance of its task representations and empirical kernel-based correspondence between task samples, which is to our knowledge the first such structural result for online deep BRL. GLiBRL is compared against representative and recent Meta-RL methods, and improves state-of-the-art performance on both MuJoCo and MetaWorld benchmarks by up to 1.8$\times$.

2512.19991 2026-05-11 cs.LG

Bloom Filter Encoding for Machine Learning

用于机器学习的Bloom过滤器编码

John Cartmell, Mihaela Cardei, Ionut Cardei

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系) Florida Atlantic University(佛罗里达大学)

AI总结 本文提出一种利用Bloom过滤器变换预处理数据的方法,通过哈希编码将样本转换为紧凑的位数组,减少内存使用并模糊原始特征值。在六个数据集上评估,结果表明Bloom编码在多个数据集上与原始数据或标准降维技术性能相当,同时提供一致的内存节省。

Comments 14 pages, 7 figures

详情
AI中文摘要

我们提出了一种使用Bloom过滤器变换来预处理数据用于机器学习的方法。每个样本通过基于哈希的编码转换为紧凑的位数组表示,生成固定长度的特征空间,从而减少内存使用并模糊原始特征值。该编码不依赖于带键哈希;然而,键可以可选地用于控制映射,并且在再现表示时是必需的。我们在六个跨文本、时间序列、表格和图像领域的数据集上评估了该方法:SMS Spam Collection、ECG200、Adult 50K、CDC Diabetes、MNIST和Fashion MNIST。考虑了四种分类器:极值梯度提升、深度神经网络、卷积神经网络和逻辑回归。结果表明,训练在Bloom过滤器编码上的模型在多个数据集上与训练在原始数据或标准降维技术上的模型性能相当,同时提供一致的内存节省。这些发现表明,Bloom过滤器编码可以作为一种高效、通用的预处理表示,保留有用的相似性结构用于学习任务,同时提供一定程度的数据模糊。

英文摘要

We present a method that uses a Bloom filter transform to preprocess data for machine learning. Each sample is encoded into a compact bit-array representation using hash-based encoding, producing a fixed-length feature space that reduces memory usage and obfuscates original feature values. The encoding does not rely on keyed hashing; however, a key can optionally be used to control the mapping and would be required to reproduce the representation. We evaluate the approach on six datasets spanning text, time-series, tabular, and image domains: SMS Spam Collection, ECG200, Adult 50K, CDC Diabetes, MNIST, and Fashion MNIST. Four classifiers are considered: Extreme Gradient Boosting, Deep Neural Networks, Convolutional Neural Networks, and Logistic Regression. Results show that models trained on Bloom filter encodings achieve performance comparable to models trained on raw data or standard dimensionality reduction techniques across several datasets, while providing consistent memory savings. These findings suggest that Bloom filter encodings can serve as an efficient, general-purpose pre-processing representation that preserves useful similarity structure for learning tasks while providing a degree of data obfuscation.

2512.17129 2026-05-11 cs.LG cs.MA cs.RO q-bio.QM

DiffeoMorph: Learning to Morph 3D Shapes Using Differentiable Agent-Based Simulations

DiffeoMorph: 通过可微分基于代理的模拟学习3D形状变形

Seong Ho Pahng, Guoye Guan, Benjamin Fefferman, Sahand Hormoz

发表机构 * Department of Systems Biology, Harvard Medical School(哈佛医学院系统生物学系) Department of Data Science, Dana-Farber Cancer Institute(达纳-法伯癌症研究所数据科学部) Departments of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学部) Broad Institute of MIT and Harvard(MIT与哈佛大学Broad研究所)

AI总结 本文提出DiffeoMorph框架,通过可微分代理模拟学习形态发生协议,使代理群体变形为目标3D形状,采用SE(3)等价图神经网络和基于3DZernike多项式的形状匹配损失实现旋转不变性和反射敏感性。

详情
AI中文摘要

生物系统可通过共享相同更新规则的代理集体行为形成复杂三维结构。如何使分布式控制产生精确的全局模式是发育生物学、分布式机器人、可编程物质和多代理学习中的核心问题。本文介绍了DiffeoMorph,一种端到端可微分框架,用于学习指导代理群体变形为目标3D形状的形态发生协议。每个代理通过SE(3)等价图神经网络更新其位置和内部状态,基于其自身状态和接收到的其他代理信号。为了训练该系统,我们引入了一种基于3DZernike多项式的新型形状匹配损失,将预测和目标形状作为连续空间分布进行比较,而非离散点云,并且对代理顺序、数量和全局方向不变。为了在保持反射敏感性的同时实现旋转不变性,我们包含一个对齐步骤,将预测的Zernike频谱最优旋转以匹配目标,然后再计算损失。我们通过基准测试验证了我们的形状匹配损失在形状比较任务中优于其他标准距离度量的优势。然后证明DiffeoMorph可以从最小模式初始条件中形成各种复杂形状。DiffeoMorph为学习形态发生、群机器人和可编程自组装的分布式控制策略提供了一般框架。

英文摘要

Biological systems can form complex three-dimensional structures through the collective behavior of agents that share a common update rule and operate without central control. How such distributed control gives rise to precise global patterns remains a central question not only in developmental biology but also in distributed robotics, programmable matter, and multi-agent learning. Here, we introduce DiffeoMorph, an end-to-end differentiable framework for learning a morphogenesis protocol that guides a population of agents to morph into a target 3D shape. Each agent updates its position and internal state using an SE(3)-equivariant graph neural network, based on its own internal state and signals received from other agents. To train this system, we introduce a new shape-matching loss based on 3D Zernike polynomials, which compares the predicted and target shapes as continuous spatial distributions, not as discrete point clouds, and is invariant to agent ordering, number of agents, and global orientation. To achieve rotation invariance while preserving reflection sensitivity, we include an alignment step that optimally rotates the predicted Zernike spectrum to match the target before computing the loss. We perform benchmarking to establish the advantages of our shape-matching loss over other standard distance metrics for shape comparison tasks. We then demonstrate that DiffeoMorph can form a range of complex shapes from minimally patterned initial conditions. DiffeoMorph provides a general framework for learning distributed control strategies for morphogenesis, swarm robotics, and programmable self-assembly.

2512.15840 2026-05-11 cs.RO cs.CV

Large Video Planner Enables Generalizable Robot Control

大视频规划器实现通用机器人控制

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Caiyi Zhang, Peihao Li, Kiwhan Song, William T. Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, Vincent Sitzmann, Yilun Du

发表机构 * MIT(麻省理工学院) UC Berkeley(加州大学伯克利分校) Harvard(哈佛大学)

AI总结 本文提出利用大规模视频预训练构建通用机器人基础模型,通过生成视频计划实现跨任务和环境的泛化控制,并在真实机器人上验证了其可行性。

Comments 29 pages, 16 figures

详情
AI中文摘要

通用机器人需要能够跨多样任务和环境泛化的决策模型。最近的工作通过扩展多模态大语言模型(MLLMs)并加入动作输出,构建了视觉-语言-动作(VLA)系统。本文探索了一种替代范式,即使用大规模视频预训练作为构建机器人基础模型的主要模态。与静态图像和语言不同,视频捕捉了物理世界中状态和动作的时空序列,与机器人行为自然对齐。我们整理了一个互联网级的人类活动和任务演示视频数据集,并首次以基础模型规模训练了一个开源视频模型用于生成机器人规划。该模型能够为新场景和任务生成零样本视频计划,我们通过后处理提取可执行的机器人动作。通过第三方选定的野外和真实机器人实验评估任务级泛化能力,展示了成功的物理执行。这些结果表明了稳健的指令跟随、强大的泛化能力和现实可行性。我们发布了模型和数据集以支持开放、可重复的视频基于机器人学习。我们的网站可在https://www.boyuan.space/large-video-planner/访问。

英文摘要

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

2512.15567 2026-05-11 cs.AI cond-mat.mtrl-sci cs.LG physics.chem-ph

Evaluating Large Language Models in Scientific Discovery

评估大型语言模型在科学发现中的表现

Zhangde Song, Jieyu Lu, Yuanqi Du, Botao Yu, Thomas M. Pruyn, Yue Huang, Kehan Guo, Xiuzhe Luo, Yuanhao Qu, Yi Qu, Yinkai Wang, Haorui Wang, Jeff Guo, Jingru Gan, Parshin Shojaee, Di Luo, Andres M Bran, Gen Li, Qiyuan Zhao, Shao-Xiong Lennon Luo, Yuxuan Zhang, Xiang Zou, Wanru Zhao, Yifan F. Zhang, Wucheng Zhang, Shunan Zheng, Saiyang Zhang, Sartaaj Takrim Khan, Mahyar Rajabi-Kochi, Samantha Paradi-Maropakis, Tony Baltoiu, Fengyu Xie, Tianyang Chen, Kexin Huang, Weiliang Luo, Meijing Fang, Xin Yang, Lixue Cheng, Jiajun He, Soha Hassoun, Xiangliang Zhang, Wei Wang, Chandan K. Reddy, Chao Zhang, Zhiling Zheng, Mengdi Wang, Le Cong, Carla P. Gomes, Chang-Yu Hsieh, Aditya Nandy, Philippe Schwaller, Heather J. Kulik, Haojun Jia, Huan Sun, Seyed Mohamad Moosavi, Chenru Duan

发表机构 * Deep Principle(深原则) Department of Computer Science, Cornell University(计算机科学系,康奈尔大学) Department of Computer Science and Engineering, The Ohio State University(计算机科学与工程系,俄亥俄州立大学) Department of Chemical Engineering & Applied Chemistry, University of Toronto(化学工程与应用化学系,多伦多大学) Department of Computer Science and Engineering, University of Notre Dame(计算机科学与工程系,圣母大学) QuEra Computing Inc.(QuEra计算公司) Department of Pathology, Department of Genetics, Cancer Biology Program, Stanford University School of Medicine(病理学系、遗传学系、癌症生物学项目,斯坦福大学医学院) Harvard Law School(哈佛法学院) Department of Computer Science, Tufts University(计算机科学系,塔夫茨大学) School of Computational Science and Engineering, Georgia Institute of Technology(计算科学与工程学院,佐治亚理工学院) Department of Computer Science, University of California, Los Angeles(计算机科学系,加州大学洛杉矶分校) Department of Computer Science, Virginia Tech(计算机科学系,弗吉尼亚理工大学) Department of Physics, Tsinghua University(物理系,清华大学) Institute for Advanced Study, Tsinghua University(清华大学高级研究所) Laboratory of Artificial Chemical Intelligence, Ecole Polytechnique Federale de Lausanne(人工化学智能实验室,瑞士联邦理工学院)

AI总结 本文提出一个基于场景的基准测试,评估LLM在生物学、化学、材料科学和物理学中的科学发现能力,揭示了模型在科学发现任务中的性能差距和改进方向。

详情
AI中文摘要

大型语言模型(LLMs)正被越来越多地应用于科学研究,但现有的科学基准测试主要考察脱离上下文的知识,忽略了驱动科学发现的迭代推理、假设生成和观察解释过程。我们引入了一个场景导向的基准测试,评估LLMs在生物学、化学、材料科学和物理学中的表现,其中领域专家定义具有实际意义的研究项目,并将其分解为模块化研究场景,从中采样经过验证的问题。该框架在两个层面评估模型:(i)与场景相关的问题准确性,以及(ii)项目层面的表现,其中模型必须提出可测试的假设、设计模拟或实验并解释结果。将这种两阶段的科学发现评估(SDE)框架应用于最先进的LLMs,揭示了相对于通用科学基准测试的性能差距,模型规模和推理能力的边际收益递减,以及跨顶级模型提供商的系统性弱点。研究场景中的大型性能差异导致最佳表现模型的选择发生变化,表明当前所有LLMs都距离通用科学“超级智能”仍有差距。然而,LLMs在各种科学发现项目中已展现出巨大潜力,包括在组成部分场景评分较低的情况下,突显了引导探索和偶然发现在发现中的作用。这种SDE框架提供了一个可重复的基准测试,用于评估与发现相关的LLMs,并指明了推动其向科学发现发展的发展路径。

英文摘要

Large language models (LLMs) are increasingly applied to scientific research, yet prevailing science benchmarks probe decontextualized knowledge and overlook the iterative reasoning, hypothesis generation, and observation interpretation that drive scientific discovery. We introduce a scenario-grounded benchmark that evaluates LLMs across biology, chemistry, materials, and physics, where domain experts define research projects of genuine interest and decompose them into modular research scenarios from which vetted questions are sampled. The framework assesses models at two levels: (i) question-level accuracy on scenario-tied items and (ii) project-level performance, where models must propose testable hypotheses, design simulations or experiments, and interpret results. Applying this two-phase scientific discovery evaluation (SDE) framework to state-of-the-art LLMs reveals a consistent performance gap relative to general science benchmarks, diminishing return of scaling up model sizes and reasoning, and systematic weaknesses shared across top-tier models from different providers. Large performance variation in research scenarios leads to changing choices of the best performing model on scientific discovery projects evaluated, suggesting all current LLMs are distant to general scientific "superintelligence". Nevertheless, LLMs already demonstrate promise in a great variety of scientific discovery projects, including cases where constituent scenario scores are low, highlighting the role of guided exploration and serendipity in discovery. This SDE framework offers a reproducible benchmark for discovery-relevant evaluation of LLMs and charts practical paths to advance their development toward scientific discovery.

2512.12116 2026-05-11 cs.LG stat.ML

Neural CDEs as Correctors for Learned Time Series Models

神经CDEs作为已学习时间序列模型的校正器

Muhammad Bilal Shahid, Zhanhong Jiang, Prajwal Koirala, Soumik Sarkar, Cody Fleming

发表机构 * Department of Mechanical Engineering(机械工程系) Iowa State University(爱荷华州立大学) Sibley School of Mechanical and Aerospace Engineering(机械与航空航天工程学院) Cornell University(康奈尔大学)

AI总结 本文提出了一种预测-校正框架,利用神经控制微分方程校正时间序列预测误差,通过正则化策略提升性能并保证稳定性,实验显示其在多种预测器上均能提升预报效果。

详情
AI中文摘要

已学习的时间序列模型,无论是连续还是离散的,广泛用于动态系统状态预测,但多步预测中存在误差累积问题。为解决此问题,我们提出了一种预测-校正框架,其中预测器是生成多步预测的已学习时间序列模型,校正器是神经控制微分方程,用于校正预测误差。校正器能够处理不规则采样的时间序列,并与连续时间和离散时间预测器兼容。我们进一步引入了两种正则化策略,以提高校正器的外推性能并加速其训练。我们还提供了对所提框架稳定性和收敛性的理论保证。在合成、基于物理和真实世界数据集上的实验表明,所提框架在多种预测器上(包括神经普通微分方程、ContiFormer和DLinear)均能一致提升预报性能,展示了其预测器无关性。

英文摘要

Learned time-series models, whether continuous or discrete, are widely used for forecasting the states of dynamical systems but suffer from error accumulation in multi-step forecasts. To address this issue, we propose a Predictor-Corrector framework in which the Predictor is a learned time-series model that generates multi-step forecasts and the Corrector is a neural controlled differential equation that corrects the forecast errors. The Corrector works with irregularly sampled time series and is compatible with both continuous- and discrete-time Predictors. We further introduce two regularization strategies that improve the Corrector's extrapolation performance and accelerate its training. We also provide theoretical guarantees on the stability and convergence of the proposed framework. Experiments on synthetic, physics-based, and real-world datasets show that the proposed framework consistently improves forecasting performance across diverse Predictors, including neural ordinary differential equations, ContiFormer, and DLinear, demonstrating its predictor-agnostic nature.

2512.10371 2026-05-11 cs.AI

AgentProg: Empowering Long-Horizon GUI Agents with Program-Guided Context Management

AgentProg: 通过程序引导的上下文管理赋能长周期GUI代理

Shizuo Tian, Hao Wen, Yuxuan Chen, Jiacheng Liu, Shanhui Zhao, Guohong Liu, Ju Ren, Yunxin Liu, Yuanchun Li

发表机构 * Tsinghua University(清华大学) Peking University(北京大学)

AI总结 本文提出AgentProg,通过将交互历史重构为程序结构,实现长周期GUI任务的高效上下文管理,实验表明其在AndroidWorld和扩展任务集上达到最优性能,并在长周期任务中保持稳健表现。

Comments 16 pages, 8 figures

详情
AI中文摘要

移动GUI代理的快速发展推动了长周期任务自动化研究。然而,构建此类代理面临关键瓶颈:依赖不断扩展的交互历史导致显著的上下文开销。现有上下文管理和压缩技术难以保留关键语义信息,导致任务性能下降。我们提出AgentProg,一种程序引导的代理上下文管理方法,将交互历史重构为包含变量和控制流的程序。通过按程序结构组织信息,该结构提供了一种原则性机制来决定哪些信息应保留,哪些可丢弃。我们进一步整合受Belief MDP框架启发的全局信念状态机制,以处理部分可观测性并适应意外环境变化。在AndroidWorld和我们扩展的长周期任务集上的实验表明,AgentProg在这些基准上实现了最先进的成功率。更重要的是,它在长周期任务中保持稳健性能,而基线方法则经历灾难性退化。我们的系统在https://github.com/MobileLLM/AgentProg上开源。

英文摘要

The rapid development of mobile GUI agents has stimulated growing research interest in long-horizon task automation. However, building agents for these tasks faces a critical bottleneck: the reliance on ever-expanding interaction history incurs substantial context overhead. Existing context management and compression techniques often fail to preserve vital semantic information, leading to degraded task performance. We propose AgentProg, a program-guided approach for agent context management that reframes the interaction history as a program with variables and control flow. By organizing information according to the structure of program, this structure provides a principled mechanism to determine which information should be retained and which can be discarded. We further integrate a global belief state mechanism inspired by Belief MDP framework to handle partial observability and adapt to unexpected environmental changes. Experiments on AndroidWorld and our extended long-horizon task suite demonstrate that AgentProg has achieved the state-of-the-art success rates on these benchmarks. More importantly, it maintains robust performance on long-horizon tasks while baseline methods experience catastrophic degradation. Our system is open-sourced at https://github.com/MobileLLM/AgentProg.

2512.05439 2026-05-11 cs.AI cs.FL

BEAVER: An Efficient Deterministic LLM Verifier

BEAVER:一种高效的确定性大语言模型验证器

Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 BEAVER 提出了一种计算大语言模型安全属性确定性概率界限的框架,通过新颖的 Token trie 和 Frontier 数据结构系统探索模型输出空间,提供可靠保证并识别更多高风险实例。

详情
AI中文摘要

BEAVER 提出了一种计算大语言模型安全属性确定性概率界限的框架,通过新颖的 Token trie 和 Frontier 数据结构系统探索模型输出空间,提供可靠保证并识别更多高风险实例。

英文摘要

As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify model outputs and characterize tail risk for safe deployment. While sampling-based estimates provide an ad-hoc intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM satisfaction of safety properties. Given a prompt & any safety property, BEAVER systematically explores the model output space using novel Token trie and Frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on 4 safety properties across 12 open-weight LLMs. BEAVER identifies 2-3x more risky instances compared to baselines while taking 1/10 of the compute budget, surfacing tail risks that loose bounds and ad-hoc evaluation misses.

2512.03476 2026-05-11 cs.LG cs.AI cs.MA cs.NA math.NA physics.comp-ph

ATHENA: Agentic Team for Hierarchical Evolutionary Numerical Algorithms

ATHENA:分层进化数值算法的智能团队

Juan Diego Toscano, Daniel T. Chen, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系)

AI总结 ATHENA通过知识驱动的诊断流程,实现科学计算与科学机器学习的自动化,结合专家蓝图和组合空间,生成高精度数值解和稳定求解器,达到10^-14的验证误差。

详情
AI中文摘要

在科学计算和科学机器学习中,理论概念与计算实现之间的差距是主要瓶颈。我们介绍了ATHENA(分层进化数值算法的智能团队),一个作为自主实验室的智能框架,用于管理端到端的计算研究生命周期。其核心是HENA循环,一种基于上下文老虎机问题的知识驱动诊断过程。系统作为在线学习者,分析先前试验,从组合空间中选择结构‘动作’(A_n),并将其转换为可执行代码(S_n)以生成科学奖励(R_n)。ATHENA超越了标准自动化:在科学计算中,它能够自动识别数学对称性以获得精确解析解或推导稳定数值求解器,即使基础模型失效。在科学机器学习中,它执行深度诊断以处理不恰当的公式,并结合混合的符号-数值工作流程(例如将PINNs与FEM耦合)以解决多物理场问题。该框架实现了超人类性能,达到10^-14的验证误差。此外,协作的‘人机协同’干预使系统能够弥合稳定性缺口,将结果提高一个数量级。这一范式转变从实现机制转向方法论创新,加速了科学发现。

英文摘要

Bridging the gap between theoretical conceptualization and computational implementation is a major bottleneck in Scientific Computing (SciC) and Scientific Machine Learning (SciML). We introduce ATHENA (Agentic Team for Hierarchical Evolutionary Numerical Algorithms), an agentic framework designed as an Autonomous Lab to manage the end-to-end computational research lifecycle. Its core is the HENA loop, a knowledge-driven diagnostic process framed as a Contextual Bandit problem. Acting as an online learner, the system analyzes prior trials to select structural `actions' ($A_n$) from combinatorial spaces guided by expert blueprints (e.g., Universal Approximation, Physics-Informed constraints). These actions are translated into executable code ($S_n$) to generate scientific rewards ($R_n$). ATHENA transcends standard automation: in SciC, it autonomously identifies mathematical symmetries for exact analytical solutions or derives stable numerical solvers where foundation models fail. In SciML, it performs deep diagnosis to tackle ill-posed formulations and combines hybrid symbolic-numeric workflows (e.g., coupling PINNs with FEM) to resolve multiphysics problems. The framework achieves super-human performance, reaching validation errors of $10^{-14}$. Furthermore, collaborative ``human-in-the-loop" intervention allows the system to bridge stability gaps, improving results by an order of magnitude. This paradigm shift focuses from implementation mechanics to methodological innovation, accelerating scientific discovery.

2512.03454 2026-05-11 cs.CV cs.AI

Think Before You Drive: World Model-Inspired Multimodal Grounding for Autonomous Vehicles

在驾驶前思考:基于世界模型的多模态接地用于自动驾驶车辆

Haicheng Liao, Huanming Shen, Bonan Wang, Yongkang Li, Yihong Tang, Chengyue Wang, Dingyi Zhuang, Kehua Chen, Hai Yang, Chengzhong Xu, Zhenning Li

发表机构 * University of Macau(澳门大学) UESTC(电子科技大学) Purdue University(普渡大学) McGill University(麦吉尔大学) Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 本文提出ThinkDeeper框架,通过预测未来空间状态提升自动驾驶车辆的自然语言指令理解能力,结合超图引导解码器融合多模态输入,提出DrivePilot数据集并在多个基准测试中表现优异。

详情
AI中文摘要

解读自然语言指令以定位目标对象对自动驾驶(AD)至关重要。现有视觉接地(VG)方法在自动驾驶车辆(AVs)中通常难以处理模糊的、依赖上下文的指令,因为它们缺乏对3D空间关系和预期场景演变的推理能力。基于世界模型的原则,我们提出了ThinkDeeper框架,该框架在做出接地决策前会推理未来空间状态。其核心是一个空间感知的世界模型(SA-WM),通过将当前场景转化为命令感知的潜在状态,并滚动出一系列未来的潜在状态,提供前瞻性线索以消除歧义。此外,超图引导的解码器进一步将这些状态与多模态输入分层融合,捕捉更高阶的空间依赖关系以实现稳健的定位。此外,我们还提出了DrivePilot,一个自动驾驶中的多源视觉接地数据集,其语义标注由检索增强生成(RAG)和链式思维(CoT)提示的LLM流水线生成。在六个基准测试中的广泛评估表明,ThinkDeeper在Talk2Car排行榜上排名第一,并在DrivePilot、MoCAD和RefCOCO/+/g基准测试中超越了现有最先进基线。值得注意的是,它在具有挑战性的场景(长文本、多代理、歧义)中表现出强大的鲁棒性和效率,并且即使在仅训练50%数据的情况下仍能保持优越的性能。

英文摘要

Interpreting natural-language commands to localize target objects is critical for autonomous driving (AD). Existing visual grounding (VG) methods for autonomous vehicles (AVs) typically struggle with ambiguous, context-dependent instructions, as they lack reasoning over 3D spatial relations and anticipated scene evolution. Grounded in the principles of world models, we propose ThinkDeeper, a framework that reasons about future spatial states before making grounding decisions. At its core is a Spatial-Aware World Model (SA-WM) that learns to reason ahead by distilling the current scene into a command-aware latent state and rolling out a sequence of future latent states, providing forward-looking cues for disambiguation. Complementing this, a hypergraph-guided decoder then hierarchically fuses these states with the multimodal input, capturing higher-order spatial dependencies for robust localization. In addition, we present DrivePilot, a multi-source VG dataset in AD, featuring semantic annotations generated by a Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT)-prompted LLM pipeline. Extensive evaluations on six benchmarks, ThinkDeeper ranks #1 on the Talk2Car leaderboard and surpasses state-of-the-art baselines on DrivePilot, MoCAD, and RefCOCO/+/g benchmarks. Notably, it shows strong robustness and efficiency in challenging scenes (long-text, multi-agent, ambiguity) and retains superior performance even when trained on 50% of the data.

2511.22316 2026-05-11 cs.LG

Outlier Smoothing with Closed-Form Rotations for W4A4 Large Language Model Quantization

基于闭式旋转的W4A4大语言模型量化去噪

Jinying Xiao, Bin Ji, Shasha Li, Xiaodong Liu, Ma Jun, Chao Wang, Wei Li, Ye Zhong, Xuan Xie, Nyima Tashi, Jie Yu

发表机构 * National University of Defense Technology(国防科技大学) Qinghai Normal University(青海师范大学) Tibet University(西藏大学)

AI总结 本文提出SingleQuant框架,通过闭式最优旋转平滑激活异常值,提升大语言模型量化性能,实验证明其在多种任务中优于基线方法。

Comments 9 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)量化有助于在资源受限环境下部署LLMs,但现有结合不兼容梯度优化和量化截断的方法导致严重收敛问题。本文研究证实,Stiefel流形上的Straight-Through Estimator(STE)引入非光滑性和梯度噪声,阻碍优化收敛,阻碍高保真量化LLM发展。为解决上述限制,我们提出SingleQuant,一种单次量化框架,与量化截断解耦,从而消除上述非光滑性和梯度噪声因素。具体而言,SingleQuant构建了对齐旋转变换(ART)和均匀性旋转变换(URT),针对不同的激活异常值,其中ART通过闭式最优旋转平滑异常值,URT通过几何映射重塑分布。两者矩阵均由严格制定的Givens旋转组成,具有预定义的维度和旋转角度,使在短时间内实现有希望的LLM任务性能。实验结果表明,SingleQuant在7B-70B LLMs上的多种任务中优于所选基线。更精确地说,SingleQuant使量化LLMs在更短的时间内实现更高的任务性能。例如,当对LLaMA-2-13B进行量化时,SingleQuant实现了1,400×的量化加速,并相比所选最佳基线提高了+0.57%的平均任务性能。

英文摘要

Large Language Models (LLMs) quantization facilitates deploying LLMs in resource-limited settings, but existing methods that combine incompatible gradient optimization and quantization truncation lead to serious convergence pathology. This prolongs quantization time and degrades LLMs' task performance. Our studies confirm that Straight-Through Estimator (STE) on Stiefel manifolds introduce non-smoothness and gradient noise, obstructing optimization convergence and blocking high-fidelity quantized LLM development despite extensive training. To tackle the above limitations, we propose SingleQuant, a single-pass quantization framework that decouples from quantization truncation, thereby eliminating the above non-smoothness and gradient noise factors. Specifically, SingleQuant constructs Alignment Rotation Transformation (ART) and Uniformity Rotation Transformation (URT) targeting distinct activation outliers, where ART achieves smoothing of outlier values via closed-form optimal rotations, and URT reshapes distributions through geometric mapping. Both matrices comprise strictly formulated Givens rotations with predetermined dimensions and rotation angles, enabling promising LLMs task performance within a short time. Experimental results demonstrate SingleQuant's superiority over the selected baselines across diverse tasks on 7B-70B LLMs. To be more precise, SingleQuant enables quantized LLMs to achieve higher task performance while necessitating less time for quantization. For example, when quantizing LLaMA-2-13B, SingleQuant achieves 1,400$\times$ quantization speedup and increases +0.57\% average task performance compared to the selected best baseline.

2511.18085 2026-05-11 cs.RO cs.AI

Continually Evolving Skill Knowledge in Vision Language Action Model

视觉语言动作模型中的持续演进技能知识

Yuxuan Wu, Guangming Wang, Zhiheng Yang, Tianchen Deng, Maoqing Yao, Brian Sheil, Hesheng Wang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) University of Cambridge(剑桥大学) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) MIT SMART(麻省理工学院SMART实验室) AgiBot

AI总结 本文提出Stellar VLA框架,通过无需增加参数的持续模仿学习方法,实现视觉语言动作模型的持续知识积累,实验表明其在任务专精和知识转移方面表现优异。

详情
AI中文摘要

视觉语言动作(VLA)模型在预训练中展现出良好的知识积累能力,但持续学习仍具挑战性,尤其在高效适应方面。现有持续模仿学习(CIL)方法通常依赖额外参数或外部模块,限制了大规模VLA模型的扩展性。我们提出Stellar VLA,一种知识驱动的CIL框架,无需增加网络参数。设计了两个逐步扩展的变种:T-Stellar用于扁平任务导向建模,TS-Stellar用于分层任务-技能结构。Stellar VLA通过联合优化任务表示和学习的知识空间,实现自我演进的知识学习。我们提出一种基于知识关系和Top-K语义嵌入的专家路由机制,实现任务专精而无需增加模型大小。在LIBERO基准测试中,Stellar VLA在VLA和CIL基线中表现强劲,仅使用1%的数据回放。在双臂平台上的真实世界评估验证了有效知识转移。TS-Stellar在分层操作中表现突出,可视化揭示了稳健的知识保留和任务发现。项目网站:https://stellarvla.github.io/

英文摘要

Vision-language-action (VLA) models show promising knowledge accumulation ability from pretraining, yet continual learning in VLA remains challenging, especially for efficient adaptation. Existing continual imitation learning (CIL) methods often rely on additional parameters or external modules, limiting scalability for large VLA models. We propose Stellar VLA, a knowledge-driven CIL framework without increasing network parameters. Two progressively extended variants are designed: T-Stellar for flat task-centric modeling and TS-Stellar for hierarchical task-skill structure. Stellar VLA enables self-evolving knowledge learning by jointly optimizing task representations and a learned knowledge space. We propose a knowledge-guided expert routing mechanism conditioned on knowledge relation and Top-K semantic embeddings, enabling task specialization without increasing model size. Experiments on the LIBERO benchmark show that Stellar VLAs achieve strong performance among both VLA and CIL baselines, using only 1 % data replay. Real-world evaluation on a dual-arm platform with distinct embodiment and scene configurations validates effective knowledge transfer. TS-Stellar excels in hierarchical manipulation, and visualizations reveal robust knowledge retention and task discovery. Project Website: https://stellarvla.github.io/

2511.09907 2026-05-11 cs.AI cs.CV

Learning to Pose Problems: Reasoning-Driven and Solver-Adaptive Data Synthesis

学习提出问题:基于推理和求解器适应的数据合成

Yongxian Wei, Yilin Zhao, Zixuan Hu, Li Shen, Xinrui Chen, Runxi Cheng, Sinan Du, Hao Yu, Chun Yuan, Dian Li

发表机构 * Tsinghua University(清华大学) Tencent(腾讯) Nanyang Technological University(南洋理工大学) Sun Yat-sen University(中山大学)

AI总结 本文提出一种基于推理和求解器适应的数据合成方法,通过生成相关问题对并利用推理模型生成中间步骤,提升问题设计策略,实验表明在多个基准上实现了3.4%的平均提升。

详情
AI中文摘要

数据合成用于训练大型推理模型是一种可扩展的替代方案,能够生成高质量数据。然而,现有方法面临挑战:(i) 随机生成忽略求解器能力导致低价值问题或依赖复杂数据管道平衡问题难度;(ii) 问题生成缺乏推理,导致浅层问题变体。本文开发了一种问题生成器,在合成前明确推理以规划问题方向,并根据求解器能力调整难度。具体而言,我们构建了相关问题对并利用推理模型生成中间问题设计CoT进行增强。这些数据用于引导生成器的问题设计策略。然后,我们将求解器对合成问题的反馈作为奖励信号,使生成器能够校准难度并生成接近求解器能力边缘的互补问题。在10个数学和通用推理基准上的广泛实验表明,所提出框架在多个基准上实现了3.4%的累计平均提升,展示了在语言和视觉-语言模型上的稳健泛化能力。

英文摘要

Data synthesis for training large reasoning models offers a scalable alternative to limited, human-curated datasets, enabling the creation of high-quality data. However, existing approaches face several challenges: (i) indiscriminate generation that ignores the solver's ability and yields low-value problems, or reliance on complex data pipelines to balance problem difficulty; and (ii) a lack of reasoning in problem generation, leading to shallow problem variants. In this paper, we develop a problem generator that reasons explicitly to plan problem directions before synthesis and adapts difficulty to the solver's ability. Specifically, we construct related problem pairs and augment them with intermediate problem-design CoT produced by a reasoning model. These data are used to bootstrap problem-design strategies in the generator. Then, we treat the solver's feedback on synthetic problems as a reward signal, enabling the generator to calibrate difficulty and produce complementary problems near the edge of the solver's competence. Extensive experiments on 10 mathematical and general reasoning benchmarks show that our proposed framework achieves a cumulative average improvement of 3.4%, demonstrating robust generalization across both language and vision-language models.

2511.09598 2026-05-11 cs.LG

Amortized Multi-Objective Optimization Across Tasks with Generative Solution Modeling

跨任务的 amortized 多目标优化与生成解建模

Tingyang Wei, Jiao Liu, Abhishek Gupta, Chin Chun Ooi, Puay Siew Tan, Yew-Soon Ong

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算与数据科学学院,新加坡) School of Mechanical Sciences, Indian Institute of Technology, Goa, India(印度理工学院Goa分校机械科学学院,印度) Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(新加坡科技研究局前沿人工智能研究中心) Singapore Institute of Manufacturing Technology, Agency for Science, Technology and Research, Singapore(新加坡制造技术研究所,新加坡科技研究局)

AI总结 本文提出一种新型参数化多目标贝叶斯优化器,通过生成解采样和跨任务协同搜索,实现连续任务偏好空间中的解预测,避免重复昂贵评估。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

许多现实应用需要在变化的操作条件下解决一系列昂贵的多目标优化问题(EMOPs)。这可以被建模为参数化的昂贵多目标优化问题(P-EMOPs),其中每个任务参数定义一个独特的优化实例。当前的多目标贝叶斯优化方法广泛用于为每个任务找到有限的帕累托最优解集。然而,P-EMOPs面临根本性挑战:连续的任务参数空间可能包含无限不同的问题,每个都需要单独的昂贵评估。为了解决这个问题,我们提出学习一个逆向模型来在连续的任务偏好空间中摊销多目标优化成本,从而能够直接对任何查询进行解预测,而无需昂贵的重新评估。本文介绍了一种新颖的参数化多目标贝叶斯优化器,通过交替进行(1)生成解采样(通过条件生成模型)和(2)利用跨任务协同的获取驱动搜索来学习该逆向模型。这种方法实现了多个任务的有效优化,并最终实现了对未见过的参数化EMOPs的直接解预测,而无需重新评估。我们通过任务感知的高斯过程理论证明了通过跨任务协同实现更快收敛。基于此,合成和现实世界基准的实证研究进一步验证了所提出参数化优化器的有效性。

英文摘要

Many real-world applications require solving families of expensive multi-objective optimization problems~(EMOPs) under varying operational conditions. This can be formulated as parametric expensive multi-objective optimization problems (P-EMOPs) where each task parameter defines a distinct optimization instance. Current multi-objective Bayesian optimization methods have been widely used for finding finite sets of Pareto optimal solutions for each task. However, P-EMOPs present a fundamental challenge: the continuous task parameter space can contain infinite distinct problems, each requiring separate expensive evaluations. To address this, we propose learning an inverse model to amortize the multi-objective optimization cost across the continuous task-preference space, enabling direct solution prediction for any query without the need for expensive re-evaluation. This paper introduces a novel parametric multi-objective Bayesian optimizer that learns this inverse model by alternating between (1) generative solution sampling via conditional generative models and (2) acquisition-driven search leveraging inter-task synergies. This approach enables effective optimization across multiple tasks and finally achieves direct solution prediction for unseen parameterized EMOPs without re-evaluations. We theoretically justify the faster convergence by leveraging inter-task synergies through task-aware Gaussian processes. Based on that, empirical studies in synthetic and real-world benchmarks further verify the effectiveness of the proposed parametric optimizer.

2511.02805 2026-05-11 cs.CL cs.AI

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

MemSearcher:通过端到端强化学习训练LLM进行推理、搜索和管理内存

Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, Xianpei Han

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences, Beijing, China(中国科学院软件研究所信息处理实验室,北京,中国) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学,北京,中国) Xiaohongshu Inc(小红书公司)

AI总结 MemSearcher通过端到端强化学习训练LLM进行推理、搜索和内存管理,通过维护紧凑的记忆在多轮交互中保持上下文长度稳定,优于基于历史拼接的基线方法。

Comments Accepted to ACL 2026

详情
AI中文摘要

MemSearcher通过端到端强化学习训练LLM进行推理、搜索和内存管理,通过维护紧凑的记忆在多轮交互中保持上下文长度稳定,优于基于历史拼接的基线方法。

英文摘要

LLM-based search agents often concatenate the full interaction history into the context, producing long and noisy inputs, and increasing compute cost and GPU memory overhead. To address this issue, we propose MemSearcher, an agent framework that maintains a compact memory during multi-turn interactions, retaining only question-relevant information and thereby keeping the context length stable across turns. Training MemSearcher is challenging because each trajectory spans multiple turns under different LLM contexts, making each turn an independent optimization target in reinforcement learning. We introduce multi-context GRPO, which propagates trajectory-level advantages to all turns for end-to-end optimization. Experiments demonstrate that MemSearcher outperforms strong history-concatenation (ReAct-style) baselines on a range of public datasets while maintaining nearly constant token counts across multi-turn interactions. The code and models will be publicly available at https://github.com/icip-cas/MemSearcher