arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
专题追踪
2605.26862 2026-05-27 cs.CV

RoadGIE: Towards A Global-Scale Aerial Benchmark for Generalizable Interactive Road Extraction

RoadGIE:面向通用交互式道路提取的全球尺度航拍基准

Chenxu Peng, Chenxu Wang, Yimian Dai, Yongxiang Liu, Ming-Ming Cheng, Xiang Li

发表机构 * NKIARI, Shenzhen Futian(深圳福田NKIARI) VCIP, CS, Nankai University(南开大学VCIP研究所) AAIS, Nankai University(南开大学AAIS) College of Electronic Engineering, National University of Defense Technology, Changsha, China(国防科技大学电子工程学院,长沙,中国)

AI总结 提出最大、最多样的道路分割数据集WorldRoadSeg-360K,并设计支持连通性感知提示的交互式方法RoadGIE,在分割精度和拓扑一致性上达到最优。

详情
AI中文摘要

从航拍图像中准确分割道路是许多地理空间应用的基础。然而,现有数据集通常面临场景多样性有限、语义粒度低和结构连续性差的问题,限制了它们在不同环境中的泛化能力。为了解决这些挑战,我们引入了WorldRoadSeg-360K,这是迄今为止最大、最多样的道路分割数据集,包含从38个国家223个城市收集的366,947张高分辨率图像,覆盖不同地形和大陆。WorldRoadSeg-360K作为一个全面的基准,揭示了处理多样化和结构复杂场景的关键挑战。自动化方法通常难以保持道路连通性,而当前的交互式方法缺乏高效、拓扑敏感的工具用于实际道路编辑。为此,我们提出了RoadGIE,建立了一种新的遥感道路提取交互范式。与先前的点或框提示策略不同,RoadGIE支持连通性感知提示,包括点击和涂鸦,这些提示与道路网络的拓扑结构天然对齐。为了提高结构一致性并减轻迭代交互中的性能下降,RoadGIE集成了专家引导的提示策略,并针对交互场景调整了基于骨架的召回损失。RoadGIE在WorldRoadSeg-360K和其他基准上,在分割精度和拓扑一致性方面均达到了最先进的性能,同时仅需3.7M参数即可高效运行。代码公开于:https://github.com/chaineypung/RoadGIE

英文摘要

Accurate road segmentation from aerial imagery is fundamental to many geospatial applications. However, existing datasets often suffer from limited scene diversity, low semantic granularity, and poor structural continuity, restricting their generalization across environments. To address these challenges, we introduce WorldRoadSeg-360K, the largest and most diverse road segmentation dataset to date, comprising 366,947 high-resolution images collected from 38 countries and 223 cities across various terrains and continents. WorldRoadSeg-360K serves as a comprehensive benchmark and reveals key challenges in handling diverse and structurally complex scenes. Automated approaches often struggle to preserve road connectivity, while current interactive methods lack efficient, topology-sensitive tools for real-world road editing. To this end, we present RoadGIE, establishing a novel interactive paradigm for road extraction in remote sensing. Unlike prior point- or box-based prompting strategies, RoadGIE supports connectivity-aware prompts, including clicks and scribbles, which inherently align with the topology of road networks. To improve structural consistency and mitigate performance degradation during iterative interactions, RoadGIE integrates an expert-guided prompting strategy and adapts the skeleton-based recall loss for interactive scenarios. RoadGIE achieves state-of-the-art performance in both segmentation accuracy and topological consistency on WorldRoadSeg-360K and other benchmarks, while maintaining efficient operation with only 3.7M parameters. The code are publicly available at: https://github.com/chaineypung/RoadGIE

2605.26861 2026-05-27 cs.CV

REVERSE: Reinforcing Evidence Verification and Search for Agentic Image geo-localization

REVERSE: 强化证据验证与搜索的智能体图像地理定位

Yong Li, Furong Jia, Dacheng Yin, Kang Rong, Fengyun Rao, Jing Lyu, Fan Zhang

发表机构 * Peking University(北京大学) The Hong Kong University of Science and Technology(香港科技大学) WeChat Vision, Tencent Inc(腾讯公司)

AI总结 提出REVERSE框架,通过多轮智能体推理强化证据搜索与验证的交互,在图像地理定位任务中优于强检索增强基线,以4B模型媲美更大模型。

详情
AI中文摘要

图像地理定位旨在确定照片的拍摄地点,该任务通常需要识别可见地标之外的信息。人类专家通常通过迭代工作流程解决:检查信息区域,形成位置假设,寻求外部证据,并根据新线索修正判断。现有方法仅部分捕捉这一过程:直接预测方法完全绕过证据获取,而检索增强方法引入外部证据但通常对中间决策(搜索位置、查询方式、过滤噪声结果)提供有限监督。我们提出REVERSE,一个强化证据搜索与验证交互的框架,实现多轮智能体推理。REVERSE教授三个中间决策:看哪里、查什么、信任什么证据。为此,我们构建了带注释区域选择、搜索观察和地理信息证据标签的工具化轨迹,并引入视觉定位、查询效用和证据辨别的过程奖励。离线搜索缓存使检索观察在强化学习过程中稳定且可重用,实现对噪声搜索结果的密集监督。使用4B模型,REVERSE在Im2GPS3k和YFCC4k上优于强检索增强基线,并媲美显著更大的模型。代码见https://github.com/yonglleee/REVERSE。

英文摘要

Image geo-localization aims to determine where a photograph was taken, a task that often requires more than recognizing visible landmarks. Human experts typically solve it through an iterative workflow: they inspect informative regions, form location hypotheses, seek external evidence, and revise their judgments as new clues appear. Existing methods only partially capture this process: direct prediction methods bypass evidence acquisition altogether, while retrieval-augmented methods introduce external evidence but usually provide limited supervision on the intermediate decisions of where to search, how to query, and how to filter noisy results. We present REVERSE, a framework that reinforces the interplay between evidence search and verification to enable multi-turn agentic reasoning. REVERSE teaches three intermediate decisions: where to look, what to query, and what evidence to trust. To support this, we construct tool-grounded trajectories with annotated region selections, search observations, and geo-informative evidence labels, and introduce process rewards for visual grounding, query utility, and evidence discrimination. An offline search cache makes retrieval observations stable and reusable during reinforcement learning, enabling dense supervision over noisy search results. With a 4B model, REVERSE outperforms strong retrieval-augmented baselines and rivals substantially larger models on Im2GPS3k and YFCC4k. Code is available at https://github.com/yonglleee/REVERSE.

2605.26857 2026-05-27 cs.LG

Generalist Graph Anomaly Detection via Prototype-Based Distillation

基于原型蒸馏的通才图异常检测

Yiming Xu, Zihan Chen, Zhen Peng, Song Wang, Bin Shi, Bo Dong, Chao Shen

发表机构 * School of Computer Science and Technology, Xi'an Jiaotong University, Xi'an, China(西安交通大学计算机科学与技术学院) National Engineering Research Center for Visual Information and Applications, Xi'an, China(视觉信息与应用国家工程研究中心) University of Virginia, Charlottesville, USA(弗吉尼亚大学) University of Central Florida, Orlando, USA(佛罗里达大学) School of Distance Education, Xi’an Jiaotong University, Xi'an, China(西安交通大学继续教育学院) School of Cyber Science and Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学网络安全学院)

AI总结 提出首个无监督通才图异常检测框架ProMoS,通过知识蒸馏从冻结的自监督图神经网络教师模型中提取正常性先验,并利用原型引导的软标签蒸馏实现跨图零样本异常检测。

Comments Accepted by ICML 2026

详情
AI中文摘要

在高风险领域对图异常检测(GAD)的迫切需求驱动下,通才GAD范式(训练一个可迁移到新图的单一检测器)近年来日益受到关注。然而,现有方法通常依赖稀缺且昂贵的标注进行训练,有时甚至需要在推理时提供少量样本支持,这限制了其对多样且未见异常模式的鲁棒性。为解决这一局限,我们提出了ProMoS,首个无监督通才GAD框架,通过建模未标注数据中丰富的正常性来检测异常。ProMoS采用知识蒸馏范式,将正常性先验从冻结的自监督图神经网络(GNN)教师模型蒸馏到具有共享全局和轻量个性化分支的混合学生模型中,无需从头学习即可实现高效且富有表现力的正常性建模。我们进一步提出原型引导的软标签蒸馏,在共享原型空间中对齐教师和学生,增强跨图泛化能力。在推理时,ProMoS通过蒸馏偏差和原型几何偏差对未见图进行零样本异常检测。大量实验证明了ProMoS的有效性和高效性,为迈向无标签、零样本的通才GAD开辟了一条实用路径。

英文摘要

Driven by the pressing demand for graph anomaly detection (GAD) in high-stakes domains, the generalist GAD paradigm, which trains a single detector transferable across new graphs, has recently gained growing attention. However, existing methods often rely on scarce and costly annotations for training and sometimes even require few-shot support at inference, which limits their robustness to diverse and unseen anomaly patterns. To address this limitation, we introduce ProMoS, the first unsupervised generalist GAD framework, which detects anomalies by modeling the abundant normality in unlabeled data. ProMoS adopts a knowledge-distillation paradigm to distill normality priors from a frozen self-supervised graph neural network (GNN) teacher to a mixture-of-students model with shared global and lightweight personalized branches, enabling efficient and expressive normality modeling without learning from scratch. We further propose prototype-guided soft-label distillation to align teacher and student in a shared prototype space, enhancing cross-graph generalizability. During inference, ProMoS performs zero-shot anomaly detection on unseen graphs via distillation bias and prototype geometric deviation. Extensive experiments show the effectiveness and efficiency of ProMoS, charting a practical path toward label-free, zero-shot generalist GAD.

2605.26855 2026-05-27 cs.CV

Receipt Replay OOD: A Small Benchmark for Screen Replay Detection Under Domain Shift

Receipt Replay OOD: 一个用于域偏移下屏幕重放检测的小型基准

Alexander Vinogradov

发表机构 * IU International University of Applied Science(国际应用科学大学)

AI总结 针对屏幕重放攻击检测中的域偏移问题,提出基于收据的小型OOD基准,评估跨域泛化性能。

详情
AI中文摘要

公共数据集如 DLC-2021、SynID 和 KID34K 对身份文档的呈现攻击检测(包括屏幕重放攻击)研究做出了重要贡献。然而,对域外(OOD)鲁棒性的评估仍不充分,尤其是在现实域偏移下。在这项工作中,我们引入了 Receipt Replay OOD,一个用于屏幕重放检测的小型域外基准。收据与身份文档共享多个特征,包括平面几何、圆角、磨损伪影以及文本或标志图案,同时避免了身份文档常见的个人身份信息约束。我们在跨域条件下评估文档重放检测模型,并展示了域偏移对泛化性能的影响。该数据集已公开。

英文摘要

Public datasets such as DLC-2021, SynID, and KID34K have significantly contributed to research on presentation attack detection for identity documents, including screen replay attacks. However, evaluation of out-of-domain (OOD) robustness remains insufficiently explored, especially under realistic domain shifts. In this work, we introduce Receipt Replay OOD, a small out-of-domain benchmark for screen replay detection. Receipts share several characteristics with identity documents, including planar geometry, curved corners, wear-and-tear artifacts, and text or logo patterns, while avoiding personally identifiable information constraints commonly associated with identity documents. We evaluate document replay detection models under cross-domain conditions and demonstrate the impact of domain shift on generalization performance. The dataset is publicly available.

2605.26854 2026-05-27 cs.LG

RAPNet: Accelerating Algebraic Multigrid with Learned Sparse Corrections

RAPNet: 通过学习的稀疏校正加速代数多重网格

Yali Fink, Ido Ben-Yair, Lars Ruthotto, Eran Treister

发表机构 * Institute for Interdisciplinary Computational Sciences, Faculty of Computer and Information Science, Ben-Gurion University of the Negev, Be'er Sheva, Israel(交叉学科计算科学研究所,计算机与信息科学学院,内盖夫本·古里安大学,以色列贝尔谢瓦) Department of Mathematics and Computer Science, Emory University, Atlanta, GA, USA(数学与计算机科学系,埃默里大学,美国亚特兰大,GA)

AI总结 提出图神经网络框架RAPNet,通过从稀疏代数系统中学习生成稀疏且鲁棒的粗网格算子,解决了代数多重网格中稀疏性与收敛质量之间的权衡问题,并采用逐层训练策略实现大规模泛化。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea Code available at https://github.com/idoby/rapnet

详情
AI中文摘要

大规模稀疏线性系统的可扩展求解是科学计算和图分析中的瓶颈。虽然代数多重网格提供了最优的线性扩展,但其性能受到粗网格算子稀疏性与收敛质量之间权衡的严重限制。经典的代数多重网格启发式方法难以平衡这些目标,常常为了稀疏性而牺牲稳定性或性能。我们提出了RAPNet,一个图神经网络框架,通过学习直接从稀疏代数系统生成稀疏、鲁棒的粗算子来解决这一权衡。我们方法的关键是一种逐层训练策略,该策略能够从小型子图中学习并泛化到百万节点规模的域,绕过了先前神经代数多重网格尝试的瓶颈。RAPNet仅在求解器设置阶段执行,确保求解阶段保持其有利的计算特性。我们展示了我们的方法在多种PDE离散化和图拉普拉斯矩阵上优于经典的非Galerkin基线,使其特别适用于多查询任务,如特征值问题、时间依赖模拟以及逆问题或设计问题。

英文摘要

The scalable solution of large sparse linear systems is a bottleneck in scientific computing and graph analysis. While algebraic multigrid (AMG) offers optimal linear scaling, its performance is severely constrained by the trade-off between the sparsity and convergence quality of coarse-grid operators. Classical AMG heuristics struggle to balance these objectives, often sacrificing stability or performance for sparsity. We propose RAPNet, a graph neural network (GNN) framework that resolves this trade-off by learning to generate sparse, robust coarse operators directly from the sparse algebraic system. Key to our approach is a level-wise training strategy that enables learning from small subgraphs and generalization to million-node domains, bypassing the bottlenecks of prior neural AMG attempts. RAPNet executes exclusively during the solver setup phase, ensuring that the solve phase retains its favorable computational properties. We show that our method outperforms classical non-Galerkin baselines on diverse PDE discretizations and graph Laplacians, making it particularly effective for multi-query tasks such as eigenproblems, time-dependent simulations, and inverse or design problems.

2605.26850 2026-05-27 cs.LG

Learning Energy-Based Models from Stochastic Interpolants using Spatiotemporal Differences

从随机插值中学习基于能量的模型:利用时空差异

Hanlin Yu, RuiKang OuYang, Partha Kaushik, Arto Klami, Michael U. Gutmann, Omar Chehab

发表机构 * University of Helsinki(赫尔辛基大学) University of Cambridge(剑桥大学) Carnegie Mellon University(卡内基梅隆大学) University of Edinburgh(爱丁堡大学)

AI总结 提出时空噪声对比估计(stNCE)框架,通过联合时空差异从随机插值中学习能量函数,统一现有方法并实现与最先进密度估计方法竞争的性能。

详情
AI中文摘要

从数据样本中学习基于能量的模型是机器学习中的一个核心问题。许多近期流行的方法,如用于训练基于能量的扩散模型的去噪分数匹配,使用随机插值器通过时间变量索引的不同噪声水平来破坏数据样本。这定义了数据空间和时间上的联合密度,大多数方法通过空间或时间差异来学习其能量。我们识别了这两种方法各自的失败模式。为了解决这些问题,我们提出了时空噪声对比估计(stNCE),一个通过联合时空差异来学习能量的框架。stNCE统一了许多现有方法,并产生了新的训练目标。在图像和分子上的实验表明,其性能与最先进的密度估计方法相竞争。

英文摘要

Learning an energy-based model from data samples is a central problem in machine learning. Many recent and popular methods, such as denoising score matching for training energy-based diffusion models, use stochastic interpolants to corrupt data samples at different noise levels indexed by a time variable. This defines a joint density over both the data space and time, and most methods learn its energy through either spatial or temporal differences. We identify distinct failure modes for both of these approaches. To solve them, we propose Spatiotemporal Noise-Contrastive Estimation (stNCE), a framework for learning the energy through joint spatiotemporal differences. stNCE unifies many existing methods and leads to new training objectives. Experiments on images and molecules demonstrate performance competitive with state-of-the-art density estimation methods.

2605.26849 2026-05-27 cs.CL

Uncertainty-Aware Budget Allocation for Adaptive Test-Time Reasoning

不确定性感知的自适应测试时推理预算分配

Manh Nguyen, Sunil Gupta, Hung Le

发表机构 * Applied Artificial Intelligence Initiative(应用人工智能计划)

AI总结 提出不确定性感知预算分配(UAB)框架,通过基于每问题不确定性的凹整数优化重新分配固定采样预算,无需额外推理成本,在多个推理基准上提升准确率高达3-5%。

详情
AI中文摘要

采样多个响应可以改善语言模型的推理能力,但均匀的计算分配效率低下:简单问题被过度采样,而困难问题探索不足。我们提出不确定性感知预算分配(UAB),这是一个凹整数优化框架,基于每问题的不确定性重新分配固定采样预算,且无需额外推理成本。在第一阶段,每个问题生成一个响应;其平均负对数似然(ANLL)直接从输出对数概率中提取,作为难度信号,同时该生成贡献于最终投票。在第二阶段,剩余预算通过边际贪心算法分配,该算法精确求解凹覆盖最大化替代问题:不确定的问题获得更多采样预算,而确定的问题获得更少的额外样本。在六个开源和黑盒模型(参数规模从1.5B到27B)以及五个涵盖数学、逻辑和偏好任务的推理基准上评估,UAB在平均准确率上比基线高出最多3%,在单个基准上高出最多5%,在低资源设置下增益最大,且无需辅助模型或额外的LLM调用。代码公开于 https://github.com/manhitv/UAB。

英文摘要

Sampling multiple responses improves language model reasoning, but uniform compute allocation is inefficient: easy questions are over-sampled while hard questions remain under-explored. We propose Uncertainty-Aware Budget Allocation (UAB), a concave integer optimization framework that reallocates a fixed sampling budget based on per-question uncertainty estimated at no additional inference cost. In Phase 1, every question receives one generation; its average negative log-likelihood (ANLL), extracted directly from output log-probabilities, serves as a difficulty signal while the generation contributes to the final vote. In Phase 2, the remaining budget is allocated by a marginal-greedy algorithm that solves a concave coverage-maximization surrogate exactly: uncertain questions receive more sampling budget while confident questions receive fewer additional samples. Evaluated on six open-weight and black-box models spanning 1.5B to 27B parameters and five reasoning benchmarks covering math, logic, and preference tasks, UAB outperforms baselines by up to +3% in average accuracy and up to +5% on individual benchmarks, with the largest gains in low-resource settings, requiring no auxiliary model or additional LLM call. Code is publicly available at https://github.com/manhitv/UAB.

2605.26844 2026-05-27 cs.LG

Not All Disagreement Is Learnable: Token Teachability in On-Policy Distillation

并非所有分歧都是可学习的:在线策略蒸馏中的Token可教学性

Yuanyi Wang, Su Lu, Yanggan Gu, Pengkai Wang, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University, PolyU(香港理工大学) Hong Kong Polytechnic University(香港理工大学) Daya Bay Technology and Innovation Research Institute(大亚湾技术与创新研究院)

AI总结 本文提出可教学性感知的在线策略蒸馏(TA-OPD),通过识别并选择教师信号中可学习的token位置,仅用5%的token即可超越全token蒸馏效果。

详情
AI中文摘要

在线策略蒸馏(OPD)使用token级别的教师监督在学生的自身轨迹上训练学生。最近的OPD选择性方法通过优先考虑高熵或高分歧token来利用OPD信号的非均匀性。我们重新审视这一原则并问:哪些token级别的教师信号实际上是可学习的?使用固定上下文诊断(测量相同上下文下教师-学生KL散度减少),我们表明原始KL分歧是学习价值的粗略代理。它将可学习分歧(教师将纠正质量分配给学生的top-K候选)与不兼容分歧(教师将质量主要放在学生当前支持范围之外)混为一谈。我们将这种局部兼容性形式化为token可教学性,并表明它比单独的原始KL更好地预测固定上下文的改进。受此发现启发,我们提出可教学性感知的在线策略蒸馏(TA-OPD),一种轻量级的token位置选择方法,无需奖励模型或验证器即可将OPD损失应用于高可教学性位置。在Qwen2.5和Qwen 3教师-学生设置中,TA-OPD通常仅用5%的保留token就超越了全token OPD,并优于基于熵和散度的基线。我们的结果将选择性OPD重新定义为选择可学习的教师信号,而不仅仅是选择显著的token。

英文摘要

On-policy distillation (OPD) trains a student on its own rollouts with token-level teacher supervision. Recent selective OPD methods exploit the non-uniformity of OPD signals by prioritizing high-entropy or high-disagreement tokens. We revisit this principle and ask: which token-level teacher signals are actually learnable? Using a fixed-context diagnostic that measures same-context teacher-student KL reduction, we show that raw KL disagreement is a coarse proxy for learning value. It conflates learnable disagreement, where the teacher assigns corrective mass to the student's top-K candidates, with incompatible disagreement, where the teacher places mass mostly off the student's current support. We formalize this local compatibility as token teachability and show that it better predicts fixed-context improvement than raw KL alone. Motivated by this finding, we propose Teachability-Aware OPD (TA-OPD), a lightweight token-position selection method that applies OPD loss to high-teachability positions without reward models or verifiers. Across Qwen2.5 and Qwen 3 teacher-student settings, TA-OPD often surpasses full-token OPD with only 5% retained tokens and improves over entropy- and divergence-based baselines. Our results reframe selective OPD as selecting learnable teacher signals rather than merely salient tokens.

2605.26842 2026-05-27 cs.LG cs.CL

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

MONA: 基于Nesterov加速的Muon优化器用于可扩展语言模型训练

Jiacheng Li, Jianchao Tan, Hongtao Xu, Jiaqi Zhang, Yifan Lu, Yerui Sun, Yuchen Xie, Xunliang Cai

发表机构 * Meituan(美团)

AI总结 提出MONA优化器,通过将Nesterov加速项集成到Muon的梯度处理流程中,实现曲率感知加速,从而帮助逃离尖锐局部最小值,并在1B到68B参数的混合专家预训练中取得更优收敛和下游任务性能。

详情
AI中文摘要

Muon优化器最近为大型语言模型训练提供了一种有希望的AdamW替代方案,利用矩阵正交化产生几何感知更新。然而,与所有一阶方法一样,Muon可能会陷入尖锐的局部最小值。在这项工作中,我们提出了MONA,一种将Muon的正交化框架与曲率感知加速相结合的优化器。MONA直接将加速项添加到Muon的梯度处理流程中。该加速项根据梯度差异的指数移动平均计算得出。我们提供了MONA的详细收敛性分析,表明加速项能够在保持Muon谱范数正则化的同时逃离尖锐最小值。实验上,在从1B到68B参数的三个规模的混合专家预训练中(最大模型在1万亿tokens上训练),MONA在收敛性和下游任务性能上均优于Muon和AdamW。此外,我们在MOE-68B-A3B模型上进行了监督微调,并在通用能力、数学推理和代码生成基准上评估,MONA达到了最先进的性能。

英文摘要

The Muon optimizer has recently offered a promising alternative to AdamW for large language model training, leveraging matrix orthogonalization to produce geometry-aware updates. However, like all first-order methods, Muon can become trapped in sharp local minima. In this work, we present MONA, an optimizer that bridges Muon's orthogonalization framework with curvature-aware acceleration. MONA adds an acceleration term directly into Muon's gradient processing pipeline. This term is calculated from the exponential moving average of gradient differences. We provide a detailed convergence analysis for MONA, showing that the acceleration term enables escape from sharp minima while preserving Muon's spectral-norm regularization. Empirically, MONA achieves better convergence and downstream task performance compared to both Muon and AdamW across three scales of Mixture-of-Experts pretraining, spanning from 1B to 68B parameters, with the largest model trained on 1 trillion tokens. Furthermore, we conduct supervised fine-tuning on the MOE-68B-A3B model and evaluate it on general capability, mathematical reasoning, and code generation benchmarks, where MONA achieves SOTA performance.

2605.26840 2026-05-27 cs.CL

Optimising Factual Consistency in Summarisation via Preference Learning from Multiple Imperfect Metrics

通过来自多个不完美指标的偏好学习优化摘要的事实一致性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent System Laboratory(智能系统实验室)

AI总结 提出一种自动化训练流程,通过聚合多个弱事实性指标的分数并映射为偏好,过滤高分歧样本,利用词汇相似摘要对进行偏好学习,从而提升摘要的事实一致性。

Comments EMNLP 2025 Findings

详情
AI中文摘要

使用评估指标作为奖励的强化学习被广泛用于增强语言模型的特定能力。然而,对于事实一致性摘要等任务,现有指标仍不完善,限制了其作为塑造模型行为的信号的有效性。虽然单个事实性指标不可靠,但它们的组合可以更有效地捕捉多样的事实错误。我们利用这一见解,引入了一种自动化训练流程,通过聚合来自不同弱指标的分数来提高摘要的事实一致性。我们的方法通过将分数映射到偏好并过滤掉指标之间高度不一致的情况,避免了复杂的奖励塑造。对于每个源文档,我们通过改变解码策略生成词汇相似的摘要对,使模型能够从由细微词汇差异引起的事实差异中学习。这种方法仅使用源文档构建高质量偏好数据集。实验表明,从早期的编码器-解码器架构到现代大型语言模型,模型均获得一致的事实性提升,较小的模型能达到与较大模型相当的事实性。

英文摘要

Reinforcement learning with evaluation metrics as rewards is widely used to enhance specific capabilities of language models. However, for tasks such as factually consistent summarisation, existing metrics remain underdeveloped, limiting their effectiveness as signals for shaping model behaviour.While individual factuality metrics are unreliable, their combination can more effectively capture diverse factual errors. We leverage this insight to introduce an automated training pipeline that improves factual consistency in summaries by aggregating scores from different weak metrics. Our approach avoids the need for complex reward shaping by mapping scores to preferences and filtering out cases with high disagreement between metrics. For each source document, we generate lexically similar summary pairs by varying decoding strategies, enabling the model to learn from factual differences caused by subtle lexical differences. This approach constructs a high-quality preference dataset using only source documents.Experiments demonstrate consistent factuality gains across models, ranging from early encoder-decoder architectures to modern large language models, with smaller models reaching comparable factuality to larger ones.

2605.26835 2026-05-27 cs.AI

Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

Helicase: 不确定性引导的供应链知识图谱构建与自主多智能体大语言模型

Yunbo Long, Haolang Zhao, Ge Zheng, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出Helicase,一种基于多智能体大语言模型的自主系统,通过不确定性引导的迭代验证和知识图谱构建,解决供应链中需要多跳推理的结构化推断问题,并引入SCQA基准评估。

详情
AI中文摘要

基于大语言模型的多智能体系统已被广泛用于知识检索和报告生成,通过网页搜索和文本推理综合已知信息。然而,供应链中的许多关键信息任务并非简单的一次性查询:它们是结构化推断问题,需要在复杂、碎片化的网络资源中进行多跳推理。诸如“特斯拉哪些组件使用了来自澳大利亚矿山的锂?”之类的问题在任何单一文档中都没有答案;答案必须通过自主构建和分析从碎片化、异构来源中组装起来的动态知识图谱,以计算方式合成。此外,这种发现过程必须具有不确定性意识:决策不仅依赖于答案,还依赖于对其可靠性的校准置信度,该置信度可追溯到来源质量和推理一致性。为了解决这一能力差距,我们提出了Helicase,一种用于不确定性引导的供应链知识图谱构建的自主多智能体大语言模型系统。Helicase将高层供应链查询分解为可执行的调查计划,通过迭代验证循环协调专门的网页搜索、推理和编码智能体,并逐步构建带有每个事实不确定性注释的查询特定供应链知识图谱。其三层不确定性框架在行动、轨迹和记忆层跟踪不确定性,从而实现结构化推断和校准置信度评估。为了评估整个复杂性谱系中的自主推理,我们引入了SCQA(供应链查询评估),这是一个包含80个供应链查询的基准,这些查询组织成四个象限,涵盖单跳到多跳推理,在高低数据可见性下进行。

英文摘要

LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one-shot queries: they are structural inference problems requiring multi-hop reasoning across complex, fragmented web resources. Questions such as \textit{``Which Tesla components use lithium from Australian mines?''} have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty-aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose \textit{Helicase}, an autonomous multi-agent LLM system for uncertainty-guided supply chain knowledge graph construction. \textit{Helicase} decomposes high-level supply-chain queries into executable investigation plans, coordinates specialized web-search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query-specific supply chain knowledge graphs with per-fact uncertainty annotations. Its three-layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single-hop to multi-hop inference under both high and low data visibility.

2605.26833 2026-05-27 cs.LG cs.AI

Periodic Topological Deep Learning for Polymer Design and Discovery

周期性拓扑深度学习用于聚合物设计与发现

Yasharth Yadav, Tze Kwang Gerald Er, Atsushi Goto, Kelin Xia

发表机构 * School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371(新加坡南洋理工大学物理与数学科学学院) School of Chemistry, Chemical Engineering and Biotechnology (CCEB), Nanyang Technological University, Singapore 637371(新加坡南洋理工大学化学、化工与生物技术学院)

AI总结 提出基于周期性Vietoris-Rips复形和层次单纯形消息传递的深度学习框架Periodic-TDL,通过捕捉多体相互作用和长程信息,在聚合物性质预测任务上超越现有模型,并验证了酯到酰胺取代和α-甲基化对热稳定性的提升。

Comments 19 pages, 3 figures, 3 tables

详情
AI中文摘要

聚合物支撑着能源、医疗和材料科学领域的应用,但其广阔的化学空间使得系统性发现充满挑战。大多数机器学习方法将聚合物表示为单个重复单元的分子图,从而忽略了聚合物链的周期性和超越成对键的多体相互作用。我们提出了Periodic-TDL,一个基于周期性Vietoris-Rips复形的深度学习框架,该复形捕捉跨多个空间尺度的多体相互作用,随后通过层次单纯形消息传递(HSMP)编码器将信息从长程相互作用传播到共价键,产生由高阶拓扑特征增强的表征。Periodic-TDL在涵盖电子、光学、物理和热学目标的聚合物性质预测任务中优于所有最先进的模型。此外,我们定量验证了酯到酰胺取代和α-甲基化如何增强热稳定性。使用通过系统取代丙烯酸酯和丙烯酰胺聚合物生成的计算合成数据集(48,208个结构),我们观察到在匹配的聚合物对中,酯到酰胺取代的平均$T_g$增加约$55^\circ$C,主链α-甲基化的平均$T_g$增加约$14^\circ$C。为了验证这些预测趋势,我们使用Periodic-TDL模型分析了来自独立实验测量的六对新型聚合物,包括三篇文献中未报道的新合成聚合物。实验数据成功证实了模型的预测。最终,这些发现表明Periodic-TDL捕捉了特定官能团修饰的潜在物理效应,而不仅仅是优化基准数据集上的预测性能。

英文摘要

Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery challenging. Most machine learning approaches represent polymers as molecular graphs of a single repeating unit, thereby missing both the periodicity of polymer chains and many-body interactions beyond pairwise bonds. We introduce Periodic-TDL, a deep learning framework built on periodic Vietoris-Rips complexes that capture many-body interactions across multiple spatial scales, followed by a hierarchical simplicial message-passing (HSMP) encoder that propagates information from long-range interactions to covalent bonds, yielding representations enriched by higher-order topological features. Periodic-TDL outperforms all state-of-the-art models across polymer property prediction tasks spanning electronic, optical, physical, and thermal targets. Furthermore, we quantitatively validate how ester-to-amide substitution and $α$-methylation enhance thermal stability. Using a computationally synthesized dataset of 48,208 structures-generated via systematic substitution of acrylate and acrylamide polymers-we observed a mean $T_g$ increase of $\sim 55^\circ$C for ester-to-amide substitutions and $\sim 14^\circ$C for backbone $α$-methylation across matched polymer pairs. To verify these predicted trends, we use our Periodic-TDL model to analyze six novel polymer pairs from independent experimental measurements, including three newly synthesized polymers previously unreported in the literature. The experimental data successfully confirmed the model's predictions. Ultimately, these findings demonstrate that Periodic-TDL captures the underlying physical effects of specific functional group modifications, rather than merely optimizing predictive performance on benchmark datasets.

2605.26831 2026-05-27 cs.CV cs.RO

OSMa-Bench++: Toward Open-Ended Benchmarking of Semantic Mapping for Manipulation with Prompt-Generated Synthetic Scenes

OSMa-Bench++:面向操作任务的语义映射开放基准测试,使用提示生成的合成场景

Regina Kurkova, Maxim Popov, Sergey Kolyubin

发表机构 * Biomechatronics and Energy-Efficient Robotics (BE2R) Lab, ITMO University(生物机电学与节能机器人实验室,ITMO大学)

AI总结 本文扩展OSMa-Bench,通过提示生成合成室内场景实现可控基准测试,并提出一种基于提示的VQA类别,用于语义映射方法在杂乱、小物体、部分遮挡和光照变化等条件下的压力测试。

Comments Code: https://github.com/be2rlab/OSMa-Bench-v2

详情
AI中文摘要

语义映射方法越来越多地被用作下游机器人推理和操作的中间场景表示,但它们的评估仍然很大程度上依赖于固定的基准数据集,这些数据集对操作相关边缘情况的覆盖有限。在这项工作中,我们将OSMa-Bench扩展到使用提示生成的合成室内场景进行可控基准测试。我们的流程自动生成场景描述,使用SceneSmith合成相应环境,并将生成的资产适配为OSMa-Bench兼容的仿真格式。这种适配需要一个非平凡的中层,包括语义归一化、材质和纹理修复、着色器回退策略、地面处理、导航设置和受控光照配置。所提出设置的一个关键优势是原始场景生成提示是预先已知的,因此可以作为预期场景的辅助语义规范。我们利用这一特性,将OSMa-Bench的VQA组件扩展了一个基于提示的问题类别。由此产生的框架支持在杂乱、小物体、部分遮挡和光照变化等条件下对语义场景表示进行有针对性的压力测试,并使基准测试更具可扩展性,更好地与下游操作需求对齐。我们的代码可在https://github.com/be2rlab/OSMa-Bench-v2获取。

英文摘要

Semantic mapping methods are increasingly used as intermediate scene representations for downstream robotic reasoning and manipulation, yet their evaluation is still largely tied to fixed benchmark datasets with limited coverage of manipulation-relevant corner cases. In this work, we extend OSMa-Bench toward controllable benchmarking with prompt-generated synthetic indoor scenes. Our pipeline automatically generates scene descriptions, synthesizes corresponding environments with SceneSmith, and adapts the resulting assets into an OSMa-Bench-compatible simulation format. This adaptation requires a nontrivial intermediate layer, including semantic normalization, material and texture repair, shader fallback policies, floor handling, navigation setup, and controlled lighting configuration. A key advantage of the proposed setup is that the original scene-generation prompt is known in advance and can therefore serve as an auxiliary semantic specification of the intended scene. We use this property to extend the VQA component of OSMa-Bench with a prompt-grounded question category. The resulting framework supports targeted stress-testing of semantic scene representations under conditions such as clutter, small objects, partial occlusions, and lighting variation, and makes benchmarking more extensible and better aligned with downstream manipulation requirements. Our code is available at https://github.com/be2rlab/OSMa-Bench-v2.

2605.26830 2026-05-27 cs.LG cs.AI cs.CV

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

卡尔曼演化:通过可解释算法发现缩小卡尔曼滤波的差距

Vasileios Saketos, Ming Xiao

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 针对非线性传感场景下卡尔曼滤波性能下降的问题,提出Kalman Evolve框架,联合优化噪声参数与更新结构,利用大语言模型生成可解释的非仿射修改,在多个基准上实现高达12%的RMSE降低。

详情
AI中文摘要

状态估计是控制和信号处理中的一个基本问题,卡尔曼滤波器在线性动力学、高斯噪声和已知噪声协方差下提供最优解。然而,这些假设在多普勒雷达和LiDAR等实际传感场景中常常不成立。在这些情况下,最优估计器本质上是非线性的,导致系统性能下降。这产生了一个仅通过调整噪声协方差参数(即卡尔曼滤波器中的过程噪声和测量噪声)无法消除的性能差距。为了解决这一限制,我们提出了Kalman Evolve,一个通过联合优化噪声参数和更新结构来发现改进滤波算法的框架。我们的方法利用大语言模型作为程序空间上的结构化先验,能够生成对经典卡尔曼滤波器的可解释、非仿射修改,同时保留其递归形式。我们提供了分析结果,证明了在常见非线性传感模型下仿射估计器的次优性,从而激发了结构感知更新的必要性。在一系列合成和真实跟踪基准测试中,包括多普勒雷达、基于LiDAR的定位和行人跟踪,所发现的算法始终优于强基线(如优化卡尔曼滤波器),实现了高达12%的RMSE降低。这些结果表明,优化卡尔曼滤波器的结构而不仅仅是其参数,提供了一种实用且可解释的方式来改进状态估计。

英文摘要

State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12\% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.

2605.26828 2026-05-27 cs.RO

Learning Compositional Symbolic Task Rules from Demonstrations with Inductive Logic Programming

通过归纳逻辑编程从演示中学习组合符号任务规则

Oleh Borys, Karla Stepanova

发表机构 * Czech Institute of Informatics, Robotics and Cybernetics(捷克信息学、机器人学与自动控制研究所)

AI总结 提出一种基于归纳逻辑编程的分解学习方法,从演示中学习可解释、可重用且支持强泛化的符号任务规则。

Comments In: ICRA 2026 Workshop on Semantics for Reliable Robot Autonomy: From Environment Understanding and Reasoning to Safe Interaction, Vienna, 2026 In: ICRA 2026, International Joint Workshop on Ontologies, Semantic Maps and Autonomous Robotics Standardization (J-WOSMARS 2026), Vienna, 2026

详情
AI中文摘要

从演示中学习不仅应捕捉任务如何执行,还应解释演示行为的高层任务结构。随着机器人变得更加自主,这种任务表示必须可检查、可重用且人类可解释。为此,我们研究如何通过归纳逻辑编程(ILP)表示和学习机器人任务,将复杂任务分解为不同抽象(本体)层次上的一系列更简单的学习目标。该系统从演示和先验(领域)知识中推断符号规则,并在学习更高层任务结构时重用已学习的规则。我们在一个合成的积木组装场景中评估了该方法,结果表明学习到的抽象是可解释的,并支持对更难的、包含未见物体的保留任务进行强泛化。这些结果初步证明分解的ILP是实现任务级LfD的可行方法。

英文摘要

Learning from Demonstration~(LfD) should capture not only how a task is executed, but also its high-level task structure that explains the demonstrated behavior. As robots become more autonomous, such task representations must be inspectable, reusable, and human-interpretable. To address this, we study how to represent and learn robotic tasks with inductive logic programming~(ILP) by decomposing a complex task into a series of simpler learning objectives at different abstraction (ontological) levels. The system infers symbolic rules from demonstrations and prior (domain) knowledge, and reuses learned rules when learning higher-level task structure. We evaluate the approach in a synthetic block-assembly scenario and show that the learned abstractions are interpretable and support strong generalization to harder, held-out tasks with unseen objects. These results provide preliminary evidence that decomposed ILP is a feasible approach to task-level LfD.

2605.26827 2026-05-27 cs.CL cs.AI

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

ContextGuard: 语言模型中上下文学习的结构化自我审计

Hongbo Jin, Chi Wang, Haoran Tang, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding

发表机构 * Peking University(北京大学) SCUT(上海交通大学) Tsinghua University(清华大学)

AI总结 提出ContextGuard框架,通过结构化自我审计机制使大语言模型在复杂上下文任务中忠实遵循所有上下文约束,包括外围、持久和格式敏感要求。

详情
AI中文摘要

最近的基准测试揭示,尽管大语言模型(LLMs)具有强大的推理能力,但在忠实应用复杂上下文知识方面仍存在困难。这些失败通常不是整体推理崩溃:在上下文丰富的任务中,模型可能遵循中心推理路径,同时遗漏外围、持久或格式敏感的要求。

英文摘要

Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements.

2605.26823 2026-05-27 cs.CL

Generating Logically Consistent Synthetic Supply Chain Data with LLM-Driven Knowledge Graph Reasoning

基于LLM驱动知识图谱推理生成逻辑一致的合成供应链数据

Yunbo Long, Ge Zheng, Liming Xu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 针对合成供应链数据需保持操作逻辑一致性的问题,提出TabKG框架,通过构建列关系知识图谱并利用多LLM集成验证关系,结合潜在扩散模型生成逻辑一致的表格数据。

详情
AI中文摘要

合成数据为供应链分析中两个长期存在的障碍(数据稀缺和数据隐私)提供了一种有前景的解决方案。然而,要使合成数据支持运营模拟和决策,它必须不仅再现真实记录的统计分布,还要保留支配供应链流程的\emph{操作逻辑},包括时间顺序、数学依赖、层次分类和条件规则,这些使记录在操作上合理。我们将这种逻辑视为供应链数据的“物理”。现有的表格生成模型主要针对分布保真度和下游预测效用进行优化,因此通常生成统计上看似真实但违反基本操作约束的记录。本文介绍了 extbf{ extit{TabKG}},一个知识图谱引导的框架,用于生成逻辑一致的合成供应链表格数据。TabKG构建了一个 extbf{ extit{列关系知识图谱(CR-KG)}}来表示数据操作依赖。它使用多LLM集成和多数投票从列元数据中提出候选关系,通过真实数据验证这些关系以去除幻觉或未支持的边,然后使用验证后的CR-KG指导生成。具体而言,TabKG将原始表压缩为独立列,使用潜在扩散模型生成这些列,并根据验证后的关系确定性地重建依赖列,从而通过构造强制与发现的操作规则保持逻辑一致性。

英文摘要

Synthetic data offers a promising solution to two persistent barriers in supply chain analytics: data scarcity and data privacy. However, for synthetic data to support operational simulation and decision-making, it must do more than reproduce the statistical distributions of real records, and also preserve the \emph{operational logic} that governs supply chain processes, including the temporal orderings, mathematical dependencies, hierarchical taxonomies, and conditional rules that make a record operationally plausible. We consider this logic as the ``physics'' of supply chain data. Existing tabular generative models are primarily optimized for distributional fidelity and downstream predictive utility, and therefore often generate records that appear statistically realistic but violate fundamental operational constraints. This paper introduces \textbf{\textit{TabKG}}, a knowledge-graph-guided framework for logically consistent synthetic supply chain tabular data generation. TabKG constructs a \textbf{\textit{Column Relationship Knowledge Graph (CR-KG)}} to represent data operational dependencies. It uses a multi-LLM ensemble with majority voting to propose candidate relationships from column metadata, validates these relationships against real data to remove hallucinated or unsupported edges, and then uses the validated CR-KG to guide generation. Specifically, TabKG compresses the original table into independent columns, generates these columns using a latent diffusion model, and deterministically reconstructs dependent columns according to the validated relationships, enforcing logical consistency by construction with respect to the discovered operational rules.

2605.26820 2026-05-27 cs.RO

Can VLA Models Learn from Real-World Data Continually without Forgetting?

VLA 模型能否从现实世界数据中持续学习而不遗忘?

Jiarun Zhu, Yijun Hong, Xiaoquan Sun, Zetian Xu, Mingqi Yuan, Zhiyong Wang, Wenjun Zeng, Jiayu Chen

发表机构 * HKU(香港大学) INFIFORCE EIT, Ningbo(宁波工程学院) HUST(华中科技大学) SUSTech(南方科技大学) HITSZ(香港理工大学)

AI总结 本研究通过构建包含四个顺序操作任务的真实世界持续学习数据集,实证发现视觉-语言-动作(VLA)模型在持续学习异构真实世界演示时存在严重灾难性遗忘,并系统评估了经验回放方法的关键实施因素。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为通用机器人提供了有前景的基础。然而,它们在现实场景中的成功部署需要能够持续获取新技能,同时保留先前学习的行为。虽然开创性研究在狭窄的模拟环境中研究了VLA模型的持续学习,但在现实条件下这一挑战仍未得到充分探索。为解决这一局限,我们构建了一个真实世界的持续学习数据集,包含四个顺序操作任务,涵盖刚体抓取放置、接触式按压和可变形物体折叠。利用该数据集,我们进行了全面实验,发现VLA模型在持续学习异构真实世界演示时遭受显著的灾难性遗忘。然后,我们系统评估了经验回放,并揭示了决定其成功的关键实施因素。总之,这项工作提供了真实世界持续VLA学习的首次实证研究,并为部署长期运行的机器人策略提供了实用指导。

英文摘要

Vision-language-action (VLA) models provide a promising foundation for general-purpose robotics. However, their successful deployment in real-world scenarios requires the ability to continually acquire new skills while retaining previously learned behaviors. While pioneering research has studied the continual learning of VLA models in narrowly simulated environments, this challenge remains largely unexplored under realistic conditions. To address this limitation, we construct a real-world continual learning dataset comprising four sequential manipulation tasks, spanning rigid-object pick-and-place, contact-rich pressing, and deformable-object folding. Using this dataset, we conduct comprehensive experiments and find that VLA models suffer significant catastrophic forgetting when continually learning from heterogeneous real-world demonstrations. We then systematically evaluate experience replay and uncover key implementation factors that govern its success. In summary, this work provides the first empirical study of real-world continual VLA learning and offers practical guidance for deploying long-lived robot policies.

2605.26808 2026-05-27 cs.LG cs.AI cs.IT math.IT

Innovation: An Almost Characterization of Hallucination

创新:幻觉的几乎刻画

Nishant P. Das, Piyush Srivastava

发表机构 * School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, Maharashtra - 400 005, India(技术与计算机科学学院,塔塔基础研究机构,孟买,马哈拉施特拉邦 - 400 005, 印度)

AI总结 本文引入“创新”属性来刻画大语言模型幻觉的必然性,证明创新与幻觉几乎等价,并基于创新率给出新的幻觉率下界。

详情
AI中文摘要

幻觉是大语言模型(LLMs)的一个核心局限,大量工作致力于理解和缓解它。为此,Kalai 和 Vempala(STOC 2024)引入了一个概率框架来形式化校准和幻觉,并证明高概率下,校准的 LLM 大致以“缺失质量”(衡量训练数据相对于其来源的不完整程度)的速率产生幻觉。这引出了两个基本问题:(i) 校准的 LLM 的什么属性使得幻觉不可避免?(ii) 能否通过放弃校准来避免幻觉?我们通过引入一个更简单的属性——我们称之为“创新”——来回答这些问题,该属性衡量模型产生训练数据之外输出的倾向。我们证明,创新由 Kalai 和 Vempala 识别的幻觉条件蕴含,并且进一步,它是幻觉的几乎刻画:幻觉蕴含创新,反之,创新高概率地蕴含幻觉。我们还基于“创新率”给出了幻觉率的下界,并通过将创新率与缺失质量联系起来,获得了基于缺失质量的新的幻觉率下界,扩展了 Kalai 和 Vempala 的结果。

英文摘要

Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigating it. Towards this, Kalai and Vempala (STOC 2024) introduced a probabilistic framework formalizing calibration and hallucination, and showed that, with high probability, calibrated LLMs hallucinate roughly at the rate of the "missing mass", a measure of how incomplete the training data is relative to its source. This raises two fundamental questions: (i) what property of a calibrated LLM makes hallucinations unavoidable? and (ii) can hallucinations be avoided by giving up calibration? We answer these questions by introducing a simpler property we call innovation that measures the tendency of a model to produce outputs outside the training data. We show that innovation is implied by the condition for hallucination identified by Kalai and Vempala, and, further, that it is an almost characterization of hallucination: hallucination implies innovation, and conversely, innovation implies hallucination with high probability. We also provide lower bounds on the hallucination rate based on the "innovation rate", and by relating innovation rate back to missing mass, we obtain new hallucination rate lower bounds based on missing mass that extend the results of Kalai and Vempala.

2605.26802 2026-05-27 cs.LG

PATE-TabTransGAN: Differentially Private Synthetic Tabular Data Generation via Transformer-Based Student Discrimination

PATE-TabTransGAN:基于Transformer学生鉴别的差分隐私合成表格数据生成

M. Youssef, M. Woźniak

发表机构 * Wrocław University of Science and Technology(沃拉布大学科学与技术学院)

AI总结 提出PATE-TabTransGAN框架,结合教师集成私有聚合(PATE)机制与基于Transformer的学生鉴别器,在正式差分隐私保证下生成高质量合成表格数据,并在四个基准数据集上取得最优或并列最优的AUROC。

Comments 16 pages, 3 figures, 4 tables. Submitted for publication

详情
AI中文摘要

在正式差分隐私保证下生成高保真合成表格数据仍然是一个开放挑战。提供强理论保护的方法通常牺牲了真实合成所需的特征间依赖建模,而擅长捕获复杂列关系的架构仅提供经验隐私保证。我们提出PATE-TabTransGAN,一个生成框架,将教师集成私有聚合(PATE)机制与基于Transformer的学生鉴别器相结合,以共同满足这两个要求,并采用GNMax RDP会计进行数值稳定的隐私核算。在不相交分区上训练的Logistic回归教师集成通过噪声聚合标签监督学生,残差生成器针对这个差分隐私学生进行优化,通过后处理继承正式的(ε, δ)-DP保证。将PATE-TabTransGAN与PATE-GAN、DP-GAN和DP-CTGAN(被认为是差分隐私表格合成的最先进方法)进行比较。在四个表格基准(Adult、Breast、Cardio、Cervical)上进行的实验证实了所提方法的高质量:PATE-TabTransGAN在所有四个数据集上达到最佳或并列最佳的AUROC。在AUCPR上,它在Cardio上与最强基线持平,在Cervical上领先,在Breast上落后;在Adult上,我们证明AUCPR对正类惯例高度敏感,观察到的差距与评估流程之间的惯例差异一致,而非合成缺陷。

英文摘要

Generating high-fidelity synthetic tabular data under formal differential privacy guarantees remains an open challenge. Methods that provide strong theoretical protection typically sacrifice the modeling of inter-feature dependencies required for realistic synthesis, while architectures that excel at capturing complex column relationships offer only empirical privacy guarantees. We present PATE-TabTransGAN, a generative framework that integrates the Private Aggregation of Teacher Ensembles (PATE) mechanism with a Transformer-based student discriminator to jointly address both requirements, and employs a GNMax RDP accountant for numerically stable privacy accounting. An ensemble of Logistic Regression teachers trained on disjoint partitions supervise the student via noisy-aggregated labels, and a residual generator is optimized against this differentially private student, inheriting formal (ε, δ)-DP guarantees by post-processing. PATE-TabTransGAN was compared with PATE-GAN, DP-GAN, and DP-CTGAN, considered state-of-the-art in differentially private tabular synthesis. Experiments conducted on four tabular benchmarks (Adult, Breast, Cardio, Cervical) confirmed the high quality of the proposed method: PATE-TabTransGAN attains the best or tied-best AUROC on all four datasets. On AUCPR it matches the strongest baseline on Cardio, leads on Cervical, and trails on Breast; on Adult, we demonstrate that AUCPR is highly sensitive to positive-class convention, and that the observed gap is consistent with a convention difference between evaluation pipelines rather than a synthesis deficit.

2605.26801 2026-05-27 cs.CL

Psychological Constructs in Shared Semantic Space

共享语义空间中的心理构念

Hubert Plisiecki

发表机构 * IDEAS Research Institute(IDEAS研究院)

AI总结 本文提出一个框架,通过将心理构念表示为共享词嵌入空间中的方向,并使用监督语义微分从文本-结果关联中估计构念特定的语义梯度,从而实现跨不同测量工具和研究传统的心理构念的语义可比性。

详情
AI中文摘要

心理构念通常在不同的测量工具、数据集和研究传统中进行测量,这使得直接比较变得困难。本文提出了一个框架,通过将心理构念表示并比较为共享词嵌入空间中的方向,使这些构念在语义上具有可比性。使用监督语义微分,我们从文本-结果关联中估计构念特定的语义梯度,并将其投影到理论驱动的参考轴上。作为初始测试案例,我们使用效价、唤醒度和支配度(VAD)作为情感坐标系。首先,我们从英语词汇级情感规范中恢复可解释的VAD方向。其次,我们将27个GoEmotions类别的语义梯度投影到该空间中,并恢复预期的情感组织,特别是在效价和唤醒度维度上。第三,我们将相同程序应用于源自IPIP-NEO-300项目-因子关联的大五人格领域和子域。领域层面的定位大体一致,而子域层面的结果更具探索性,因为它们依赖于稀疏的问卷文本。结果表明,只要语义定位的稳定性和可解释性得到评估,嵌入空间可以支持在其他不可比较的心理测量之间进行构念层面的比较。

英文摘要

Psychological constructs are often measured in separate instruments, datasets, and research traditions, which makes direct comparison difficult. This paper proposes a framework for making such constructs semantically commensurate by representing and comparing them as directions in a shared word-embedding space. Using Supervised Semantic Differential, we estimate construct-specific semantic gradients from text-outcome associations and project them onto theoretically motivated reference axes. As an initial test case, we use Valence, Arousal, and Dominance (VAD) as an affective coordinate system. First, we recover interpretable VAD directions from English word-level affective norms. Second, we project semantic gradients for 27 GoEmotions categories into this space and recover the expected organization of emotions, especially along valence and arousal. Third, we apply the same procedure to Big Five personality domains and facets derived from IPIP-NEO-300 item-factor associations. Domain-level placements are broadly coherent, while facet-level results are more exploratory because they rely on sparse questionnaire text. The results suggest that embedding spaces can support construct-level comparison across otherwise incommensurable psychological measurements, provided that semantic placements are assessed for stability and interpretability.

2605.26797 2026-05-27 cs.LG cs.CL

Latent Recurrent Transformer: Architecture Exploration, Training Strategies, and Scaling Behavior

潜在循环Transformer:架构探索、训练策略与扩展行为

Zeyi Huang, Xuehai He, LiLiang Ren, Yiping Wang, Baolin Peng, Hao Cheng, Shuohang Wang, Pengcheng He, Jianfeng Gao, Yong Jae Lee, Yelong Shen

发表机构 * Microsoft(微软公司) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) University of Washington(华盛顿大学)

AI总结 提出潜在循环Transformer(LRT),通过跨层循环潜在路径重用前一token的高层隐藏状态作为记忆,在不增加暂停token或额外深度循环的情况下,以约2倍基线计算实现并行训练,在匹配有效计算下提升语言建模损失和上下文学习能力,仅增加0.3%参数。

详情
AI中文摘要

我们研究潜在循环Transformer(LRT),一种自回归Transformer的轻量级增强,它重用来自前一个token的高层源层隐藏状态作为下一个token的循环记忆。由于该源状态在普通解码过程中已经计算,LRT跨位置添加跨层循环潜在路径,无需插入暂停token或额外深度循环,并且保留了标准注意力机制和KV-cache接口。为了在不顺序展开Transformer的情况下大规模预训练这种循环,我们引入了交错并行训练:一次完整的全序列初始化前向传播构建共享缓冲区;然后不相交的位置子集并行细化并写回,使得所有token在约2倍基线计算下获得循环记忆感知的监督。在nanochat风格的主干网络和广泛的每参数token预算范围内,LRT在匹配有效计算下改进了语言建模损失和上下文学习,同时仅增加0.3%的参数。

英文摘要

We study Latent Recurrent Transformer (LRT), a lightweight augmentation of autoregressive transformers that reuses a high-level source-layer hidden state from the previous token as recurrent memory for the next token. Because this source state is already computed during ordinary decoding, LRT adds a cross-layer recurrent latent pathway across positions without inserting pause tokens or extra depth loops, and the standard attention mechanism and KV-cache interface are preserved. To pretrain this recurrence at scale without sequentially unrolling the transformer, we introduce interleaved parallel training: a single full-sequence initialization forward pass builds a shared buffer; then disjoint position subsets are refined in parallel and written back, so that all tokens receive recurrent-memory-aware supervision at roughly 2 times baseline compute. Across nanochat style backbones and a wide range of tokens-per-parameter budgets, LRT improves both language-modeling loss and in-context learning under matched effective compute while adding as little as 0.3% parameters.

2605.26795 2026-05-27 cs.AI

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

链式思维在探测时为何有效?局部共现而非全局推导

Xiang Wang, Wei Wei

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 研究链式思维提示在探测时提升语言模型准确率的原因,发现增益主要来自词汇激活和短距离标记共现,而非句子级逻辑推导。

详情
AI中文摘要

链式思维提示可靠地提高了语言模型的准确性,但推理文本的哪些属性驱动了这种改进尚不清楚。先前的工作主要研究生成本身的行为。我们转而提出一个探测时问题:给定上下文中的固定推理文本,该文本中的什么改变了答案?我们确定了增益的两个互补来源。首先,即使是全局词序打乱的推理文本也显著优于无推理基线,表明存在强烈的词汇激活效应。更重要的是,结构化文本带来的额外增益似乎较少来自句子级的逻辑排序,而更多来自短距离标记邻接。保留仅$n^\star{=}2$--$3$个标记的连续窗口即可恢复向完整链式思维性能的大部分剩余增益。支持性实验排除了显式答案声明或答案值的复制以及完整的语法实现作为主要驱动因素。进一步的泛化实验表明,这种定性模式在多个模型家族、参数规模和数据集上保持稳定。这些结果支持探测时链式思维的局部共现激活解释,其中观察到的增益主要来自词汇激活和短距离标记共现,而非句子级逻辑推导。

英文摘要

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just $n^\star{=}2$--$3$ tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

2605.26789 2026-05-27 cs.AI

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

组合崩溃:稳定的事实知识并不意味着组合推理

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Hong Kong Baptist University(香港 Baptist大学) Harbin Institute of Technology(哈尔滨工业大学) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 本文提出组合崩溃现象,即模型在稳定掌握原子事实的情况下仍无法将其组合成链式推理,并通过双门控协议分解后训练增益,揭示聚合指标掩盖的组合能力变化。

详情
AI中文摘要

后训练通常通过聚合基准分数来评估,这些分数将多跳推理视为单一能力——仿佛回答更多问题的模型必然更擅长组合事实。我们表明这种假设可能具有误导性:在统计上无法区分的原子知识配方下,组合行为差异超过40个百分点,我们将这种现象称为组合崩溃:即系统性地无法将稳定已知的事实组合成链,而这种失败对聚合指标不可见。我们引入双门控协议,将估计量从聚合组合性差距转变为基于稳定原子访问的残差组合失败,将后训练收益分解为三个独立通道:原子稳定性、残差组合和关键深度。在一个涵盖深度2-11的时序事实链基准上,对四种后训练配方进行分解,揭示了后训练目标以聚合指标掩盖的方式改变组合能力,并表明关于多跳推理改进的主张应伴随原子门控控制的组合指标。诊断探针进一步显示,测量到的组合失败中相当一部分反映了生成时的计算约束,而非永久性的组合能力缺失。

英文摘要

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

2605.26788 2026-05-27 cs.CL cs.AI

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT: 基于句子变换器的决策变换器条件化用于多轮对话可靠性

Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde, Amit Shukla

发表机构 * Independent Researcher(独立研究者) Drone Lab, IIT Mandi(IIT曼迪无人机实验室) UPES, Dehradun(德里敦UPES)

AI总结 针对大语言模型在多轮对话中性能下降的问题,提出一种无需训练和额外数据的推理方法SeDT,通过引入离线强化学习中的return-to-go条件化,利用语义、词汇和位置信号计算累积相关性得分并注释对话历史,显著提升模型性能并降低不可靠性。

详情
AI中文摘要

大语言模型(LLMs)在单轮任务完全指定时表现令人印象深刻,但当相同任务在多轮中逐步揭示时,同一模型性能下降高达39%,这一现象在规模上被记录为“迷失在对话中”。关键的是,这种崩溃几乎完全是可靠性失败;最佳情况下,能力仅下降16%,而不可靠性增加超过一倍(+112%)。我们认为根本原因是结构性的:扁平化的对话历史对每个先前轮次赋予相等隐式权重,使模型无法区分关键约束与无关对话。我们提出SeDT(句子变换器-决策变换器),一种无需训练的推理时方法,通过从离线强化学习中引入return-to-go条件化来解决此问题。SeDT使用来自三种互补信号(语义、词汇和位置)的累积相关性得分注释每个对话片段,并在最后一轮向模型呈现完整的注释历史,无需权重更改、无需训练数据、无需丢弃上下文。在三个LLM和三个生成任务的Lost-in-Conversation基准上评估,SeDT在所有九个模型-任务组合中均优于分片基线,平均性能P提升高达+37.7%,同时在九个组合中的七个中降低了不可靠性。简而言之,告诉模型哪些过去的轮次重要足以显著恢复对话中丢失的性能。

英文摘要

Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.

2605.26785 2026-05-27 cs.CL cs.AI

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

EmoDistill: 对抗性谈判中语言模型代理的离线情感技能蒸馏

Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu, Alexandra Brintrup

发表机构 * University of Cambridge(剑桥大学) Technical University of Munich(慕尼黑技术大学) Exiger LLC The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出EmoDistill离线框架,通过隐式Q学习选择情感和低秩适应策略表达情感,蒸馏情感谈判技能到语言模型代理,在四个高风险谈判领域取得最高效用。

详情
AI中文摘要

后训练的LLM通常被优化以对齐响应与人类偏好,使其安全、礼貌且适合对话。然而,在对抗性谈判中,这种对齐可能成为漏洞:情感框架语言可能引导代理朝向对手方利益。使用基于GoEmotions的情感提示,我们表明情感显著改变谈判结果,表明情感是战略行动渠道而非表面风格。因此,我们引入 extbf{EmoDistill},一个用于将情感谈判技能蒸馏到语言模型代理中的离线框架。EmoDistill将情感策略分解为情感选择和情感表达:隐式Q学习(IQL)选择器学习表达\emph{哪种}情感,而基于低秩适应(LoRA)的策略通过监督微调(SFT)和裁判策略优化(JPO)学习\emph{如何}表达它。在四个情感敏感、高风险的谈判领域,在EmoDistill框架下训练的SLM策略实现了最高效用,优于普通SLM/LLM基线和仅IQL情感选择。消融实验表明情感条件化是必要的,迁移研究展示了跨领域、未见对手和训练对训练锦标赛的泛化能力。总体而言,EmoDistill从离线代理间交互中学习技能,避免了训练期间昂贵的在线谈判。

英文摘要

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.

2605.26784 2026-05-27 cs.LG cs.AI

Ratio-Variance Regularized Policy Optimization

比率方差正则化策略优化

Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li

发表机构 * Department of Foundation Model, 2012 Labs, Huawei(华为基础模型部门,2012实验室) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(上海智能自主系统研究院,同济大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 提出R²VPO方法,通过约束策略比率方差作为信任区域的局部近似,替代启发式裁剪,在LLM和机器人控制任务中提升性能与样本效率。

详情
AI中文摘要

标准的同策略强化学习依赖启发式裁剪来强制信任区域,但这种机制通过不加区分地截断高回报但高散度的更新而施加了严重代价。我们证明,显式约束策略比率方差为信任区域约束提供了原则性的局部近似,消除了二元硬裁剪的需要。通过作为分布式的“软刹车”,这种方法保留了来自新颖发现的关键梯度信号,同时自然降低权重并允许重用陈旧的离策略数据。我们引入了${\bf R}^2{\bf VPO}$(比率方差正则化策略优化),它通过原始-对偶优化框架实现这一约束。在跨越快速和慢速推理范式的$7$个LLM规模以及$10$个机器人控制任务上的广泛评估证明了所提出方法的通用性。R$^2$VPO在数学推理基准上取得了显著的性能提升,特别是在较小模型上改进尤为明显,同时显著提高了样本效率。此外,它在连续控制领域(特别是稀疏奖励和动态环境)中始终优于PPO基线。这些发现共同确立了比率方差正则化作为稳定且数据高效策略优化的原则性基础。

英文摘要

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

2605.26782 2026-05-27 cs.RO cs.HC

Manipulating Tangible Virtual Object Dynamics to Promote Learning of Precision Force Generation

操控有形虚拟物体动力学以促进精确力生成的学习

Alberto Garzás-Villar, Alba Riera-Cardona, Alexis Derumigny, J. Micah Prendergast, Jane Murray Cramm, Laura Marchal-Crespo

发表机构 * Department of Cognitive Robotics, Delft University of Technology, Delft, 2628 CD, The Netherlands Department of Socio-medical Sciences, Erasmus School of Health Policy \& Management, Erasmus University Rotterdam, Rotterdam, 3062PA, The Netherlands. Department of Applied Mathematics, Delft University of Technology, Delft, 2628 CD, The Netherlands Department of Rehabilitation Medicine, Erasmus Medical Center, Rotterdam, 3015 GD, The Netherlands

AI总结 本研究提出通过操控有形虚拟物体的动力学(线性、高斯或反对称高斯弹簧模型)来训练精确力控制,实验表明反对称高斯组在训练中力精度最高,但长期保留无显著差异,且参与者主要依赖学习到的目标伸长而非目标力。

详情
AI中文摘要

机器人触觉设备结合虚拟现实为训练精细力生成提供了新机会,这是中风后康复中重要但常被忽视的部分。本研究提出,操控有形虚拟物体的渲染动力学可用于训练精确力控制,同时激活体感系统。我们进行了一项实验,50名健康参与者执行一项类似冰壶的任务,他们必须拉伸虚拟弹簧以产生目标释放力,将石头推至冰面上预定义位置。在训练中,弹簧的力-伸长关系被建模为线性或非线性函数,即高斯或反对称高斯函数,在释放目标力处导数为零。结果表明,反对称高斯组在训练中始终比线性组获得更高的力精度,而高斯组仅在训练后期优于线性组。人格特质分析显示,在高斯动力学下,更高的自由精神得分与较差的表现和减少的任务探索相关,而更高的挑战转化得分与增加的探索相关。尽管存在这些训练效应,但在不同弹簧类型或人格特质之间,长期保留没有显著差异。参与者主要依赖学习到的目标伸长而非目标力,这通过在不同刚度但相同目标力的转移任务中的表现得以证实。虽然这些方法对体感神经康复有前景,但在对神经疾病患者进行测试之前,需要改进以减少对本体感觉线索的依赖。

英文摘要

Robotic haptic devices combined with virtual reality offer novel opportunities to train fine force generation, an essential yet overlooked component of post-stroke rehabilitation. This study proposes that manipulating the rendered dynamics of tangible virtual objects can be leveraged to train precise force control while engaging the somatosensory system. We conducted an experiment with fifty healthy participants who performed a curling-inspired task in which they had to stretch a virtual spring to generate a target release force to propel the stone to a predefined location on the ice sheet. During training, the spring's force-elongation relationship was modeled as either a linear or non-linear function, i.e., a Gaussian or antisymmetric Gaussian (AS-Gaussian) function with zero derivative at the release target force. Results indicate that the AS-Gaussian group consistently achieved higher force accuracy during training than the linear group, while the Gaussian group only outperformed the linear group toward the end of training. Analysis of personality traits revealed that higher Free Spirit scores were associated with poorer performance and reduced task exploration under Gaussian dynamics, whereas higher Transform-of-Challenge scores correlated with increased exploration. Despite these training effects, no significant differences in long-term retention were found across spring types or personality traits. Participants primarily relied on learned target elongation rather than target force, as evidenced by performance in a transfer task with a different stiffness but the same target force. While promising for somatosensory neurorehabilitation, these methods require refinement to reduce reliance on proprioceptive cues before testing with neurological patients.

2605.26781 2026-05-27 cs.AI cs.MM

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench: 大型多模态模型真的征服了高中水平的考试吗?

Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li

发表机构 * Tencent PCG(腾讯PCG) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院)

AI总结 本文提出动态多学科基准LiveK12Bench,通过自动化流水线和新颖的模拟考试评估方案,揭示大型多模态模型在真实考试场景下性能显著下降,尤其对复杂视觉布局敏感。

详情
AI中文摘要

先进的大型多模态模型(LMMs)在K-12推理任务中展示了令人印象深刻的表现,展现出作为智能导师的巨大潜力。实现这一潜力需要模型有效应对真实世界的考试,但大多数现有基准未能捕捉真实考试环境的复杂性。具体来说,大多数数据集是静态的,容易受到数据污染,并且通常局限于受限的模态、学科和评估标准。为了解决这些问题,我们引入了LiveK12Bench,这是一个动态、全面、多学科的基准,旨在评估LMMs在真实考试场景中的推理能力。LiveK12Bench包含2000多道经过验证的题目,涵盖数学、物理、化学和生物,来源于最新的真实考试试卷,并设计为随时间增长。我们的框架具有几个核心创新:1)采用自动化流水线,持续摄取和解析最新考试试卷以减轻数据泄露;2)提出一种新颖的“模拟考试”评估方案,评估模型自主完成端到端考试并具有准确高效推理路径的能力。在12个LMMs上的大量实验表明,先进模型在考试真实约束下性能大幅下降:当过程严谨性和效率共同评估时,GPT-5的分数从79降至53(满分100)。我们的发现暴露了关键漏洞,例如对复杂视觉布局的敏感性,凸显了理想化推理能力与真正教育准备之间的差距。代码和数据集均已公开。

英文摘要

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

2605.26778 2026-05-27 cs.AI

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

归因盲点:检测语言模型何时依赖记忆而非检索到的上下文

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Bo Yang, Chen Ye, Gaolei Li, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) National Fintech Evaluation Center(国家金融科技评估中心) Hangzhou Dianzi University(杭州电子科技大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出计算现实监控(CRM)方法,通过比较有无上下文时的内部表征差异,检测语言模型是否依赖预训练记忆而非检索到的上下文进行生成,解决了输出级监控无法识别的归因盲点问题。

详情
AI中文摘要

检索增强生成承诺将语言模型输出锚定于外部证据,然而该领域缺乏可靠方法来验证检索到的上下文是否实际主导了生成——这是任何高风险部署的前提。标准假设(上下文一致的输出意味着上下文主导的输出)在检索到的文档与模型预训练数据重叠时失效:模型可以完全从参数化记忆中生成看似忠实的文本,且两种途径产生无法区分的输出。我们将此失败命名为归因盲点,并引入计算现实监控(CRM)来解决它。CRM 操作化了源自认知科学现实监控框架的一个原则:比较有上下文和无上下文时的内部表征,揭示了输出级监控系统系统性遗漏的基于成员条件的表征分歧。CRM 并不证明单个生成使用了哪个来源;它检测预训练暴露是否留下可测量的内部轨迹特征,从而为来源归因建立必要的基础。在跨越三个系列的九个模型变体中,这种分歧集中在架构特定的层模式中,得到块级噪声干预的汇聚支持,并在任务和数据集上泛化,而在领域混淆的基准上消失。归因盲点是可以测量且部分可解决的:内部表征携带输出级不可见的诊断信号,为系统建立基础,使其对证据来源的内部意识支配其外部行为。

英文摘要

Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.