arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2374
2605.30245 2026-05-29 cs.CL

Knowing What to Solve Before How: Preplan Empowered LLM Mathematical Reasoning

知道在如何解决之前该解决什么:预规划赋能的大语言模型数学推理

Shaojie Wang, Liang Zhang

AI总结 提出PPC框架,通过引入显式的问题理解阶段(预规划)来弥补现有规划推理方法中“如何解决”与“该解决什么”之间的范式差距,在多个数学推理基准上取得最佳结果。

详情
AI中文摘要

当前的基于规划的推理方法通过在执行前插入规划阶段来改进大语言模型(LLMs),形成了问题→规划→思维链的范式。虽然有效,但仔细审视发现存在固有的范式级差距:规划和执行阶段都决定了如何解决问题,而之前的问题——该解决什么,即识别问题类型、适用工具和可预见的陷阱——仍然完全隐含。为弥补这一差距,我们提出PPC(预规划-规划-思维链),一个引入显式问题理解阶段(预规划)的框架,产生了新的问题→预规划→规划→思维链范式。实现这一范式需要在两端维护预规划的概念完整性。具体地,我们设计了一个三阶段合成流程,配备一个剧透分数检测器来过滤泄漏和剧透故障,以构建干净的预规划监督,并且一个复合GRPO奖励强制生成的规划真正遵循预规划。在四个骨干模型和五个数学推理基准上的实验表明,PPC在40个指标中的39个上取得了最佳结果,在不引入额外推理令牌开销的情况下,将maj@16和pass@16分别比最强基线提高了+2.23和+3.06。

英文摘要

Current plan-based reasoning methods improve large language models (LLMs) by inserting a planning stage before execution, giving rise to the question $\rightarrow$ plan $\rightarrow$ cot paradigm. While effective, a closer examination reveals an inherent paradigm-level gap: both the planning and its execution stages decide how to solve a problem, while the prior question of what to solve; recognizing the problem type, the applicable tools, and the foreseeable pitfalls; remains entirely implicit. To bridge this gap, we propose PPC (Preplan-Plan-CoT), a framework that introduces an explicit problem-understanding stage, the preplan, yielding a new question $\rightarrow$ preplan $\rightarrow$ plan $\rightarrow$ cot paradigm. Realizing this paradigm requires safeguarding the conceptual integrity of preplan at both ends. Specifically, we design a three-stage synthesis pipeline with a spoiler-score detector that filters out leakage and spoiler failures to build clean preplan supervision, and a composite GRPO reward enforces that the generated plan genuinely follows from the preplan. Experiments across four backbones and five mathematical reasoning benchmarks show that PPC achieves the best results on 39 of 40 metrics, improving maj@16 and pass@16 by +2.23 and +3.06 over the strongest baseline without introducing additional inference token overhead.

2605.30244 2026-05-29 cs.CV cs.AI

Reinforcement Learning with Robust Rubric Rewards

基于稳健评分规则的强化学习

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

AI总结 针对部分可验证的视觉-语言任务,提出RLR^3方法,通过双路径执行评分规则、最小暴露策略和层次聚合,实现从任务级到准则级验证的扩展,在15个基准上平均提升4.7分。

详情
AI中文摘要

虽然基于可验证奖励的强化学习(RLVR)对于确定性可检查的任务有效,但许多视觉-语言任务部分可验证,需要多准则监督(例如,感知细节、推理步骤和约束)。评分规则为此细粒度监督提供了自然接口,但其有效性取决于在线RL期间的执行准确性。我们提出基于稳健评分规则的强化学习($\text{RLR}^3$),将RLVR从任务级验证扩展到准则级验证。$\text{RLR}^3$通过两条执行路径路由实例特定的评分规则:LLM作为提取器与确定性验证器配对,或LLM作为裁判用于不可验证的准则。为确保忠实评分,$\text{RLR}^3$引入最小暴露策略,从提取器中屏蔽真实标签,从裁判中屏蔽图像。此外,$\text{RLR}^3$采用层次聚合,优先考虑基本准则而非附加准则,并缓解rollout组内的分数饱和。在Qwen3-VL-30B-A3B上跨15个基准评估,$\text{RLR}^3$始终优于RLVR,比基础模型提升4.7分,并超过官方instruct-to-thinking模型差距。受控审计证实,我们的确定性验证和最小暴露显著减少了可利用的假阳性。

英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

2605.30241 2026-05-29 cs.CL cs.CY cs.SI

CommunityFact: A Dynamic, Multilingual, Multi-domain Benchmark for Misinformation Detection in the Wild

CommunityFact:一个面向野外错误信息检测的动态、多语言、多领域基准

Sahajpreet Singh, Insyirah Mujtahid, Min-Yen Kan, Kokil Jaidka

AI总结 提出CommunityFact基准,通过多语言多领域声明、LLM评估和社区笔记分析,揭示封闭输入验证的挑战、网络搜索的增益以及证据选择策略的偏差。

详情
AI中文摘要

错误信息验证越来越多地发生在公开、快速变化和多语言的在线环境中,静态基准无法全面衡量模型可靠性。我们引入了CommunityFact,一个可刷新的野外错误信息检测基准,具有三个主要目标:覆盖度、粒度和可再分发性。本版本包含5种语言和2个领域的15,992条独立声明。我们在不同的推理时能力(包括思考和网络搜索)下评估了十个LLM。我们的结果表明,封闭输入验证仍然具有挑战性,网络访问带来了最大的收益,并且启用网络的LLM的源选择策略与人类社区笔记评分者所达成的源系统性地不一致——这种差距可以通过特定模型的检索扩展或剪枝机制来缩小。我们进一步发现,跨语言-领域切片以及启用网络的系统所使用的证据生态系统存在显著差异。除了评估之外,CommunityFact将社区笔记定位为声明条件源建议器的训练信号,这可以改进对新声明的真实性验证。

英文摘要

Misinformation verification increasingly occurs in public, fast-moving, and multilingual online settings, where static benchmarks provide an incomplete measure of model reliability. We introduce CommunityFact, a refreshable benchmark for misinformation detection in the wild, with three major goals: coverage, granularity, and redistributability. This release contains 15,992 standalone claims across five languages and two domains. We evaluate ten LLMs under varying inference-time capabilities, including thinking and web-search. Our results show that closed-input verification remains challenging, web access yields the largest gains, and web-enabled LLMs' source-selection policies are systematically misaligned with the sources human Community Notes raters converge on -- a gap that closes through model-specific mechanisms of retrieval expansion or pruning. We further find substantial variation across language-domain slices and across the evidence ecosystems used by web-enabled systems. Beyond evaluation, CommunityFact positions Community Notes as a training signal for claim-conditioned source suggesters that could improve factual verification on novel claims.

2605.30239 2026-05-29 cs.CV

SAM3D-Phys: Towards Multi-Object Interactive Simulation in Real World

SAM3D-Phys:迈向真实世界中的多物体交互仿真

Xin Dong, Weijian Deng, Lihan Zhang, Tianru Dai, Wenfeng Deng, Yansong Tang

AI总结 提出SAM3D-Phys框架,结合场景重建与SAM3D生成式先验,从部分观测中恢复完整可仿真物体几何,并通过物理约束优化和掩码引导外观蒸馏实现场景一致性,支持多物体同时交互仿真。

Comments 23 pages, 11 figures

详情
AI中文摘要

这项工作解决了从重建的真实世界场景中恢复完整、可仿真的物体几何的问题,使得与场景中嵌入的物体进行基于物理的交互成为可能。虽然现代多视图重建方法可以产生视觉上准确的环境,但由于遮挡和有限的观测,物体往往不完整,因此不适合物理仿真。为了解决这一局限性,我们提出了SAM3D-Phys,一个将场景重建与SAM3D的生成式3D先验相结合以恢复可物理仿真的物体的框架。我们的方法首先从多视图图像重建场景,获得场景几何和物体的部分观测。然后,我们利用SAM3D从这些部分观测中推断出完整的物体几何。为了确保恢复的物体与重建场景保持一致,我们通过两种互补策略恢复场景一致的物体状态:一种物理约束的空间优化算法,迭代地将恢复的物体对齐到其原始位置;以及一种掩码引导的外观蒸馏模块,基于观测图像细化纹理保真度。通过恢复完整的物体几何并在场景中恢复其姿态和外观,SAM3D-Phys产生了适用于基于物理仿真的干净物体表示,使得在重建场景中能够对多个物体进行同时且物理一致的交互仿真。项目页面:https://chnxindong.github.io/sam3d-phys/

英文摘要

This work addresses the problem of recovering complete, simulatable object geometry from reconstructed real-world scenes, enabling physics-based interaction with objects embedded in the scene. While modern multi-view reconstruction methods can produce visually accurate environments, objects are often incomplete due to occlusions and limited observations, making them unsuitable for physics simulation. To address this limitation, we propose SAM3D-Phys, a framework that integrates scene reconstruction with generative 3D priors of SAM3D to recover physically simulatable objects. Our approach first reconstructs the scene from multi-view images to obtain scene geometry and partial observations of objects. We then leverage SAM3D to infer complete object geometry from these partial observations. To ensure that the recovered objects remain consistent with the reconstructed scene, we restore scene-consistent object states through two complementary strategies: a physics-constrained spatial optimization algorithm that iteratively aligns the recovered object to its original location, and a mask-guided appearance distillation module that refines texture fidelity based on the observed images. By recovering complete object geometry and restoring its pose and appearance within the scene, SAM3D-Phys produces clean object representations suitable for physics-based simulation, enabling simultaneous and physically consistent interactive simulation of multiple objects within a reconstructed scene. Project page: https://chnxindong.github.io/sam3d-phys/

2605.30235 2026-05-29 cs.CV

BullingerDB: A Dataset for Handwritten Text Recognition and Writer Retrieval

BullingerDB:用于手写文本识别和作者检索的数据集

Marco Peer, Anna-Scius Bertrand, Patricia Scheurer, Andreas Fischer

AI总结 提出一个基于Heinrich Bullinger书信的大规模历史文档数据集BullingerDB,用于手写文本识别和作者检索,并引入时间感知的nDCG指标评估检索性能。

Comments Accepted for presentation at ICDAR2026. Dataset available via zenodo

详情
AI中文摘要

我们提出了BullingerDB,这是一个基于Heinrich Bullinger(1504-1575)书信的大规模历史文档分析基准数据集。该语料库包含由796位作者在六十年间书写的20,898页和499,222行文本,具有风格变化、多语言内容(主要是拉丁语和早期新高地德语)以及作者身份和时间等元信息。我们在文本识别和作者检索上评估了BullingerDB。表现最佳的模型TrOCR实现了9.1%的字符错误率(CER)。对于作者检索,我们引入了一个时间感知的nDCG指标来评估时间感知检索。虽然可以实现时间连贯的检索,但mAP(78.3%)分数表明由于长期风格变化而存在挑战。通过BullingerDB,我们旨在为多语言历史文本识别和时间感知的作者分析建立一个新的基准。

英文摘要

We present BullingerDB, a large-scale benchmark dataset for historical document analysis based on the correspondence of Heinrich Bullinger (1504-1575). The corpus comprises 20,898 pages and 499,222 text lines written by 796 writers over six decades, featuring stylistic variation, multilingual content (mostly Latin and Early New High German) as well as meta-information such as writer identity and time. We evaluate BullingerDB on text recognition and writer retrieval. TrOCR, the best performing model, achieves a CER of 9.1%. For writer retrieval, we introduce a temporal nDCG metric to assess time-aware retrieval. While temporally coherent retrieval is achievable, mAP (78.3%) scores indicate challenges due to long-term stylistic variation. With BullingerDB, we aim to establish a new benchmark for multilingual historical text recognition and temporally-aware writer analysis.

2605.30233 2026-05-29 cs.CL cs.AI

Do Language Models Track Entities Across State Changes?

语言模型是否在状态变化中跟踪实体?

Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller, Sebastian Schuster, Najoung Kim

AI总结 研究语言模型在自然语言中处理多步状态变化操作时的实体跟踪机制,发现其采用非增量策略,在最后token并行聚合信息,并揭示了REMOVE操作的全局抑制标签及其导致的失败模式。

Comments ICML main conference 2026, 9 pages

详情
AI中文摘要

实体跟踪(ET),即跟踪状态的能力,是支撑复杂推理的基本技能。越来越多的研究探讨transformer语言模型(LMs)如何在没有状态变化的情况下解决实体绑定问题。然而,对于非玩具级LMs如何处理以自然语言表达的具有现实难度的ET问题,理解仍然有限。为此,我们研究了在具有多个状态变化操作的更复杂场景下ET背后的机制。我们发现,LMs不会跨token增量地跟踪世界状态,也不会跨层跟踪查询相关状态,而是在查询变得明显时,在最后一个token处并行地聚合相关信息。我们进一步研究了单个操作(PUT、REMOVE、MOVE)的机制,以表征这种非增量ET机制。令人惊讶的是,LMs使用一种脆弱的全局抑制标签来实现REMOVE操作;这种全局移除机制预测了我们通过行为实验确认的各种失败模式。我们提供了一种消除该标签的机械解决方案,以部分解决此问题。总体而言,我们的发现揭示了LMs使用非顺序策略来解决一个本质上是顺序的任务。更广泛地说,我们的工作展示了行为分析和机制分析如何有效地相互作用。行为结果为机制假设提供信息,而机制分析的见解通过预测现有评估中缺失的失败模式,有助于构建更强的行为评估。

英文摘要

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\texttt{PUT}$, $\texttt{REMOVE}$, $\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.

2605.30232 2026-05-29 cs.LG cs.CL

How's it going? Reinforcement learning in language models recruits a functional welfare axis

进展如何?语言模型中的强化学习招募了一个功能性福利轴

Andy Q Han, David J. Chalmers, Pavel Izmailov

AI总结 本文通过迷宫环境实验,发现强化学习会招募语言模型中预先存在的功能性福利表征(即对系统目标达成程度的估计),从而广泛影响模型行为,且该表征在训练前已存在。

Comments 81 pages, 43 figures, 32 tables

详情
AI中文摘要

强化学习如何塑造语言模型的内部表征?我们提出证据表明,RL招募了一个预先存在的功能性福利表征:即对系统相对于其目标表现好坏程度的估计。我们在一个新颖的、语义中性的迷宫环境中训练了几个语言模型。然后,我们提取奖励和惩罚轨迹的概念向量,并在与迷宫环境无关的设置中评估这些向量。惩罚向量表现为负面福利的表征:它促进失败和不可能性标记,与负面情绪概念对齐,负面追踪目标达成,并且通过它进行引导会引发负面自我报告、病理性回溯、拒绝和不确定性。正向奖励向量则表现为镜像,两者几乎反平行。这些效应在控制图块到奖励映射、规模、指令微调、RL训练算法、模型家族以及LoRA与全微调时都很稳健,并且当我们用监督微调替换RL时,这些效应在很大程度上仍然存在。重要的是,这些向量在模型经历迷宫训练之前就已经有效。结合这些效应也出现在仅预训练模型中的观察,我们因此认为,这个功能性福利轴在训练后已经存在:它是由训练后招募的,而不是创造的。虽然我们不声称任何关于福利体验的主张,但该轴提供了一个证明,即最小的奖励信号可以通过招募预先存在的类似福利的表征来广泛影响模型行为,这对可解释性、训练后动态和对齐具有启示意义。

英文摘要

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

2605.30231 2026-05-29 cs.CV cs.AI

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

超越3D VQA:将3D空间先验注入视觉-语言模型以增强几何推理

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

AI总结 提出GASP框架,通过将几何先验注入LLM的Transformer层,利用对比损失和深度一致性监督训练,显著提升VLM的3D空间推理能力,在多个基准上取得大幅提升。

Comments CVPR 2026. Project page: https://danielchyeh.github.io/GASP/

详情
AI中文摘要

视觉-语言模型(VLM)通常在鲁棒的3D空间推理方面存在困难。依赖于使用3D视觉问答(VQA)数据集进行微调的主流方法可能过度拟合数据集特定的偏差,而集成专门的3D视觉编码器往往不灵活且繁琐。在本文中,我们认为真正的空间理解应该源于学习基本的几何先验,而不仅仅是来自高级VQA监督。我们提出了GASP(几何感知空间先验),这是一个将这些先验直接注入LLM的Transformer层的框架。GASP采用一个小的对应头,作为跨所有层的深度监督信号,并使用一个双重目标进行训练,该目标利用大规模视频场景的真实几何:基于真实点对应的对比损失强制2D视图不变性,而深度一致性监督解决3D几何歧义。我们的分析首先提供了一个诊断,表明标准VLM的内部对应匹配精度非常低(通常低于5%)。然后我们证明,我们的训练显著改善了这种行为,将逐层峰值对应提升到70%以上,并保持超过85%的时间鲁棒性,而基线仍低于5%。这些内部改进转化为下游空间基准的显著提升,包括在All-Angles Bench上+18.2%,在VSI-Bench上+29.0%,所有这些都没有在任何3D VQA数据上进行训练。我们的发现表明,从基本几何先验中学习是实现具有更可靠3D空间推理的VLM的一条有前途且可推广的途径。

英文摘要

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

2605.30230 2026-05-29 cs.CV

IP-Adapter Is All You Need: Towards Fine-Tuning-Free Diffusion-Based Talking Face Generation

IP-Adapter 就够了:迈向免微调扩散模型的人脸说话视频生成

Hao Wu, Xiangyang Luo, Hao Wang, Jiawei Zhang, Yi Zhang, Jinwei Wang

AI总结 提出一种免微调范式,利用预训练的 Stable Diffusion 和 IP-Adapter,结合三个无参数组件解决身份漂移、同步误差和时间不稳定问题,在唇同步精度和视觉保真度上超越现有方法。

详情
AI中文摘要

随着扩散模型的快速发展,人脸说话视频生成取得了显著进展。然而,现有的基于扩散的方法仍然需要特定任务的微调和大规模音视频数据集,导致计算成本高昂,阻碍了扩散方法在学术界的可扩展性和可访问性。为了解决这个问题,我们提出了一种免微调范式,直接使用 Stable Diffusion 和 IP-Adapter 的预训练权重进行人脸说话视频生成。该骨干网络利用 IP-Adapter 的视觉嵌入能力,从预训练的 Stable Diffusion 中挖掘与嘴唇相关的语义。为了解决身份漂移、同步误差和时间不稳定的挑战,我们还设计了三个无训练参数组件:(1)结构器(Structurist),显式解耦并重新组合嘴唇和外观特征,以减轻身份漂移和外观失真;(2)结构控制器(Structure Controller),基于准单调运动趋势自适应细化嵌入,实现精确的唇同步;(3)噪声传感器(Noise Sensor),引入高斯先验来检测和抑制闪烁和抖动伪影,增强时间一致性。实验结果表明,我们的方法在唇同步精度(PCLD 至少提升 0.16)和视觉保真度(FID 至少提升 0.7)方面均优于现有最先进方法,建立了一种新颖的免微调扩散框架用于人脸说话视频生成。

英文摘要

With the rapid advancement of diffusion models, talking face generation has made remarkable progress. However, existing diffusion-based methods still require task-specific fine-tuning and large-scale audiovisual datasets, resulting in high computational costs that hinder scalability and accessibility of diffusion-based approaches across the research community. To address this, we propose a finetuning-free paradigm that directly performs talking face generation using the pretrained weights of Stable Diffusion and IP-Adapter. This backbone leverages the visual embedding capability of IP-Adapter to mine lip-related semantics from the pretrained Stable Diffusion. To address the challenges of identity drift, synchronization errors, and temporal instability, we also design three trainable-parameterfree components: (1) the Structurist, which explicitly disentangles and reassembles lip and appearance features to mitigate identity drift and appearance distortion; (2) the Structure Controller, which adaptively refines embeddings based on quasi-monotonic motion trends for precise lip synchronization; and (3) the Noise Sensor, which introduces Gaussian prior to detect and suppress flicker and jitter artifacts and enhance temporal consistency. Experimental results show that our method outperforms existing SOTA approaches in both lip-sync accuracy (at least 0.16 gain in PCLD) and visual fidelity (at least 0.7 improvement in FID), establishing a novel fine-tuning-free diffusion framework for talking face generation.

2605.30229 2026-05-29 cs.LG

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

基于辅助变量的平均场Transformer中的反模式坍缩

Masaaki Imaizumi, Masanori Koyama, Noboru Isobe, Kohei Hayashi

AI总结 本研究利用平均场Transformer模型从理论上证明位置编码等辅助变量能防止自注意力机制的模式坍缩,并揭示其表示普适性与亚稳态性质。

Comments 39 pages

详情
AI中文摘要

我们使用基于平均场的Transformer模型从理论上研究辅助变量(如位置编码)如何防止自注意力机制的模式坍缩。近年来,由于平均场Transformer能够全面分析token交互,利用其分析自注意力机制性质的方法引起了广泛关注。然而,对该简单模型的分析表明,在长推理(即多层)过程中会出现模式坍缩,即token分布退化为单点,这与实际情况不符。本研究考察了该平均场Transformer模型,并证明引入辅助变量(如位置编码)可作为对抗理论模式坍缩的反作用力。具体而言,我们表明在理论框架中,能量最大化分布不会退化为单点,而是由辅助变量分布的推前(pushforward)刻画,从而避免集中于Dirac测度。我们的主要例子是位置编码和固定提示插入,它们被视为并行辅助变量机制。此外,我们证明位置编码和提示插入在极限情况下具有表示普适性,即推理的极限分布可以精确表示一大类分布。我们还分析了位置编码和亚稳态的几个关键性质,并通过数学实验验证了我们的理论结果。

英文摘要

We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties of self-attention mechanisms has garnered significant attention in recent years due to their ability to comprehensively analyze token interactions. However, analysis of this simple model suggests that mode collapse, where token distributions degenerate to a single point, occurs during long inferences (i.e., many layers), indicating a discrepancy with reality. This study investigates this mean-field transformer model and demonstrates that the introduction of auxiliary variables, such as positional encoding, acts as a counterforce against theoretical mode collapse. Specifically, we show that in the theoretical scheme, the energy-maximizing distribution does not degenerate to a single point; instead, it is characterized by a pushforward of the auxiliary variable distribution, thereby avoiding concentration in the Dirac measure. Our main examples are the positional encoding and the fixed prompt insertion treated as a parallel auxiliary-variable mechanism. Furthermore, we demonstrate that positional encoding and prompt insertion possess universality of representation in the limit, meaning that the limit distribution of inference can exactly represent a wide class of distributions. We also analyze several key properties of positional encoding and metastability, and validate our theoretical results through mathematical experiments.

2605.30220 2026-05-29 cs.LG

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

TriSearch:通过双星翻转学习优化三角剖分

Yiran Wang, Guido Montúfar

AI总结 提出基于强化学习的框架TriSearch,利用电路支撑的子三角剖分动作表示,通过双星翻转优化多面体三角剖分目标,实现零样本泛化到更大实例。

详情
AI中文摘要

我们引入了TriSearch,这是一个强化学习框架,用于通过双星翻转优化多面体三角剖分上的目标。关键思想是一种电路支撑的子三角剖分动作表示:可行的翻转由其支撑电路和实现的局部子三角剖分编码,使得学习策略能够利用局部几何和组合特征对它们进行排序。这产生了一个维度无关的接口,并能够在不显式枚举整个三角剖分空间的情况下高效遍历翻转图。在3D和4D中实例化后,TriSearch从小的训练实例零样本泛化到具有指数级更大搜索空间的大型多面体。它在3D中的度量目标上达到了顶级性能,并且在4D中,在固定预算下,发现了比现有采样器更多的自反多面体的不同精细、正则、星形三角剖分,对应于Calabi-Yau三维流形。

英文摘要

We introduce TriSearch, a reinforcement learning framework for optimizing objectives over triangulations of a polytope via bistellar flips. The key idea is a circuit-supported subtriangulation action representation: feasible flips are encoded by their supporting circuit and realized local subtriangulation, enabling a learned policy to rank them using local geometric and combinatorial features. This yields a dimension-agnostic interface and enables efficient traversal of the flip graph without explicit enumeration of the full triangulation space. Instantiated in 3D and 4D, TriSearch generalizes zero-shot from small training instances to larger polytopes with exponentially larger search spaces. It achieves top performance on metric objectives in 3D and, in 4D, discovers more distinct Fine, Regular, Star triangulations of reflexive polytopes, corresponding to Calabi-Yau threefolds, than existing samplers under a fixed budget.

2605.30219 2026-05-29 cs.AI cs.CL cs.LG

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模型何时应改变想法?大语言模型中的上下文信念管理

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

AI总结 提出上下文信念管理(CBM)框架,通过引入BeliefTrack基准和信念状态奖励的强化学习,将大语言模型在长程交互中的信念更新失败率平均降低70.9%。

Comments Work in progress

详情
AI中文摘要

长程交互要求语言模型管理累积信息:何时更新状态、何时保持状态、以及忽略什么。我们将这一挑战研究为 extbf{上下文信念管理(CBM)}:在隔离任务无关噪声的同时,维护与形式证据对齐的预测信念状态。为了使CBM可测量,我们引入了BeliefTrack,一个涵盖规则发现和电路诊断的封闭世界基准,其中有限的信念空间和符号验证器支持精确的逐轮评估。BeliefTrack诊断三种失败:保持失败、更新失败和隔离失败。在多个大语言模型中,原始模型表现出严重的CBM失败,而显式的信念跟踪提示提供的改进有限。相比之下,使用信念状态奖励的强化学习平均将失败率降低了70.9%。进一步的探测揭示了这些失败背后的潜在信念状态动态,而表示级引导在两个任务上将失败率降低了46.1% ootnote{代码即将在https://github.com/zjunlp/CBM发布。}

英文摘要

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

2605.30218 2026-05-29 cs.LG cs.PF

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

MarginGate: 用于批量不变LLM推理的稀疏边际触发验证

Kexin Chu, Yang Zhou, Wei Zhang

AI总结 提出MarginGate方法,利用logit边际稀疏触发验证,仅对低边际步骤进行验证并修复,以低成本实现批量不变LLM推理的确定性解码。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

零温度BF16 LLM推理通常被认为是可重现的,但同一请求在单独解码或位于较大批次内时可能产生不同的token。现有修复方法使用批量不变算子或LLM-42的逐token验证,即使在大多数步骤稳定时也会产生成本。我们询问验证是否可以仅应用于翻转的token。在五个模型上,批次诱导的token翻转在翻转率基准上是稀疏的:在MATH500上,Llama-3.1-8B在$0.48\%$的同步解码步骤中翻转,所有测试模型在MATH500、GSM8K和HumanEval上的翻转率保持在0.3-1.3%范围内。翻转前K/V扰动保持平坦,而低top-1/top-2 logit边际暴露了大部分翻转风险。MarginGate将这些观察转化为验证器策略:它在高边际步骤上保持BF16解码,仅验证低边际步骤,并通过替换当前K/V列修复确认的不匹配。我们在四个数据集上评估,在MATH500上校准并迁移到GSM8K、SharedGPT和HumanEval。MarginGate在Llama-3.1-8B和Qwen2.5-14B上以18.56%/15.05%的验证器触发率恢复100%序列级确定性解码,相对于始终验证,将LLM-42的延迟增量降低2.23倍/1.99倍。在DSR1-Distill-Qwen-7B上,相同策略在更困难的条件下以49.50%的触发率达到确定性。

英文摘要

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.

2605.30214 2026-05-29 cs.CL

GRUFF: LLM Pronoun Fidelity, Reasoning, and Biases in German

GRUFF:德语中LLM的代词忠实度、推理与偏见

Fabian Mewes, Anne Lauscher, Vagrant Gautam

AI总结 通过构建大规模德语数据集GRUFF,研究大型语言模型在四种性别一致系统与四组代词上的代词忠实度,发现模型在无显式上下文时对阳性和阴性实体表现出强语法一致,但对新代词xier和en较弱,且职业刻板印象在不同语法格和模型间相关性低。

详情
AI中文摘要

第三人称单数代词长期以来被用于研究语言模型中的刻板偏见以及测试其推理指代的能力。最近,通过代词忠实度任务研究了推理与偏见之间的相互作用,该任务评估模型正确复用先前为某个话语实体指定的代词的能力,而不受中间提到的其他潜在干扰话语实体的影响。然而,此类研究主要关注英语,这是一种语法性别有限且几乎没有性别一致的语言。在本文中,我们贡献了一个新颖的大规模数据集GRUFF,用于测量德语中的代词忠实度,涵盖了名词中的四种不同性别一致系统以及四组代词。利用该数据集,我们展示了LLM在缺乏显式上下文时对阳性和阴性实体表现出强语法一致,但对新代词xier和en则不然。模型通常对干扰项不鲁棒,但仅编码器模型在德语中比在英语中更鲁棒,反映了语法性别的重要性。最后,我们表明,在此上下文中,职业刻板印象在不同语法格之间以及大多数模型之间相关性较低,除了具有紧密相关架构的模型。我们发布所有代码和数据,以鼓励在德语中进一步研究性别包容性语言和指代推理。

英文摘要

Third-person singular pronouns have long been used to study stereotypical biases in language models and to test their abilities to reason about reference. More recently, the interplay between reasoning and bias has been investigated with the task of pronoun fidelity, which assesses models' abilities to correctly reuse a previously-specified pronoun for a discourse entity, independent of other potentially distracting discourse entities mentioned in between. However, such research focuses on English, which is a language with limited grammatical gender and almost no gender agreement. In this paper we contribute a novel, large-scale dataset, GRUFF, to measure pronoun fidelity in German, covering four different gender agreement systems in nouns, and four sets of pronouns. With this dataset, we show that LLMs show strong grammatical agreement for masculine and feminine entities in the absence of explicit context, but not for neopronouns xier and en. Models are generally not robust to distractors, but encoder-only models are more robust in German than in English, reflecting the importance of grammatical gender. Finally, we show that occupational stereotypes in this context are poorly correlated across grammatical cases, and across most models, except ones with closely related architectures. We release all code and data to encourage further work on gender-inclusive language and referential reasoning in German.

2605.30213 2026-05-29 cs.LG

Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs

不规则和异步数据的忠实嵌入用于在线Log-NCDEs

Benjamin Walker, Alexandre Bloch, Lingyi Yang, Sam Morley, Terry Lyons

AI总结 针对不规则和异步数据,提出一种连续且单射的嵌入方法,基于Log-NCDEs实现无需插值的在线计算,并证明其通用性。

Comments 34 pages, 16 figures

详情
AI中文摘要

连续时间模型是不规则和异步数据的自然选择。一个核心设计选择是如何将离散观测嵌入到连续时间中。基于插值和插补的嵌入重构了连续的观测路径,使得模型对重构的选择敏感。我们表明这种重构步骤是不必要的;在温和条件下,只要从数据到输入的嵌入是连续且单射的,模型输入空间上的紧集通用性就会转移到数据空间。受此结果指导,并基于神经控制微分方程(NCDEs)的直线控制路径,我们为Log-NCDEs(一类通用的连续时间模型)引入了一种连续且单射的嵌入。我们的方法将观测记录为增量,并在任意查询区间上组合它们,直接形成对数签名。这提供了区间级别的摘要,而无需先对观测变量进行插值,同时支持在线计算。在合成控制动力学和真实世界时间序列数据集上的实验表明,该表示准确、高效,并且对不规则、异步和稀疏观测具有鲁棒性。

英文摘要

Continuous-time models are a natural choice for irregular and asynchronous data. A central design choice is how to embed discrete observations into continuous time. Interpolation- and imputation-based embeddings reconstruct a continuous observation path, making the model sensitive to the choice of reconstruction. We show that this reconstruction step is unnecessary; under mild conditions, compact-set universality on the model input space transfers to the data space whenever the embedding from data to input is continuous and injective. Guided by this result, and building on the rectilinear control path for Neural Controlled Differential Equations (NCDEs), we introduce a continuous and injective embedding for Log-NCDEs, a universal class of continuous-time models. Our approach records observations as increments and composes them over arbitrary query intervals to directly form log-signatures. This provides interval-level summaries without first interpolating the observed variables, while supporting online computation. Experiments on synthetic controlled dynamics and real-world time-series datasets show that the representation is accurate, efficient, and robust to irregular, asynchronous, and sparse observations.

2605.30211 2026-05-29 cs.CV

Cycle Consistency in Video Object-Centric Learning

视频目标中心学习中的循环一致性

Rongzhen Zhao, Zhiyuan Li, Ruonan Wei, Juho Kannala, Joni Pajarinen

AI总结 针对视频目标中心学习中潜在槽空间难以直接应用循环一致性的问题,提出隐式循环一致性(ICC),将约束从槽空间转移到连续重建流形,避免特征坍塌并提升性能。

Comments 14 pages

详情
AI中文摘要

自监督视频目标中心学习(OCL)旨在发现不同目标并跨时间关联它们,而自监督多目标跟踪(MOT)则侧重于关联预定义的目标检测或分割。尽管循环一致性(CC)在MOT中已成熟应用,但它不能简单或显式地应用于OCL的潜在槽空间。与MOT中确定性和理想的目标表示不同,OCL槽由于非唯一的场景分解而固有地具有随机性和模糊性。在槽上强制执行显式循环一致性(ECC)会导致刚性均值寻求,这严重惩罚了模型探索替代但同样有效的分解,从而驱动特征坍塌。为解决这一困境,我们提出隐式循环一致性(ICC),它将循环一致性约束从限制性的槽空间转移到连续的重建流形,鼓励槽在集体解释视觉场景上达成软共识,而不是强制刚性点对点特征对齐。在复杂视频OCL基准上的大量实验表明,ICC避免了特征坍塌,并优于ECC基线。我们的源代码、模型检查点和训练日志可在 https://github.com/Genera1Z/ICC 获取。

英文摘要

Self-supervised video Object-Centric Learning (OCL) aims to discover distinct objects and associate them across time, whereas self-supervised Multi-Object Tracking (MOT) focuses on associating pre-defined object detections or segmentations. Although well-established in MOT, Cycle Consistency (CC) cannot naively or explicitly apply to the latent slot space of OCL. Unlike the deterministic and ideal object representations in MOT, OCL slots are inherently stochastic and ambiguous due to non-unique scene decompositions. Enforcing explicit cycle consistency (ECC) on slots imposes rigid mean seeking. This severely penalizes the model for exploring alternative but equally valid decompositions, thereby driving towards feature collapse. To resolve this dilemma, we propose \textit{Implicit Cycle Consistency (ICC)}, which shifts the cycle-consistency constraint from the restrictive slot space to the continuous reconstruction manifold, encouraging slots to reach a soft consensus on collectively interpreting the visual scene rather than forcing rigid point-to-point feature alignment. Extensive experiments on complex video OCL benchmarks demonstrate that ICC avoids feature collapse and outperforms ECC baselines. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/ICC.

2605.30207 2026-05-29 cs.AI

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

检索增强商业对话中品牌推荐的人格条件化:一种突出性分层跨提供商审计

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结 本研究通过审计10种人格×8个提示×3种模型配置的2000次运行,发现用户人格显著改变AI推荐品牌集,且效果在中等市场品牌和依赖先验的生成路径中更为突出。

详情
AI中文摘要

相同的提示——“最佳CRM软件”——来自不同背景的买家(独立创始人、企业副总裁、英国中小企业主)会到达AI助手。我们审计了这种上下文变化如何强烈地重塑模型推荐的品牌。审计采样了2000次运行,覆盖10种人格×8个提示×3种模型配置×N=10次重复的设计空间,其中两个OpenAI单元覆盖全部8个提示,Anthropic sonnet-4.6/低单元覆盖4个提示。在用户消息前添加人格,相对于同人格基线,推荐集相似度(Jaccard)下降Delta = -0.12至-0.20(聚类95%置信区间在所有三个测量单元上排除零;sonnet单元的置信区间仅基于4个提示聚类,相应更宽)。该效应具有明显的突出性分层:品类领导者具有人格抗性(跨人格约80%相同品牌一致性),但中等市场品牌随人格变化最多更换75%的推荐集。Anthropic模型的点估计效应大于OpenAI配置,尽管聚类置信区间在更接近的对比(sonnet vs. OpenAI/高)中重叠;这种不对称性与Anthropic更多依赖检索未归因的生成路径一致(43-52%的推荐没有观察到检索层证据,而OpenAI为8-29%,记录在Jack 2026中)。任何AI品牌感知的测量都必须以提供查询的买家人格为条件:相同的提示根据模型认为谁在提问而产生实质上不同的推荐集,而跨人格聚合的测量协议系统性地掩盖了这种变化。该效应集中在中等市场,并且在我们审计中最依赖先验的生成路径上最大,这与人格响应性随着模型更依赖训练数据先验和更丰富的上下文集成而增强是一致的。

英文摘要

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

2605.30202 2026-05-29 cs.CL

A Dual-Path Architecture for Scaling Compute and Capacity in LLMs

一种用于扩展LLMs计算和容量的双路径架构

Markus Frey, Behzad Shomali, Joachim Koehler, Mehdi Ali

AI总结 提出一种双路径块架构,通过深度子层(参数共享重复K次)和宽度子层(单次大FFN)并行扩展计算和容量,在语言建模和下游任务上超越等FLOPs基线模型,且门控机制可解释地分配每令牌路径。

详情
AI中文摘要

循环变压器多次应用共享块,已成为语言模型中参数高效扩展计算的途径。然而,在固定FLOPs下,循环模型的容量严格低于基线变压器。我们提出一种新颖的双路径块,可以灵活扩展计算(应用于隐藏状态的顺序操作数量)和容量(单步可用参数)。为此,我们在单层内将两个轴暴露为并行通路:一个深度子层,使用共享参数重复应用K次;一个宽度子层,包含一次应用的大型前馈网络。独立的每令牌门控组合两个轴,并允许详细的每令牌路由分析。我们表明,在两个FLOP预算下,我们的双路径模型在语言建模和下游评估上超越了等FLOPs匹配模型,同时在匹配FLOPs下使用比基线更少的参数。学习到的门控直接可解释,并显示系统的每令牌分配:功能词和词汇内容倾向于宽度路径,而标点、符号和算术令牌倾向于深度路径。

英文摘要

Looped transformers apply a shared block multiple times and have emerged as a parameter-efficient route to scaling compute in language models. However, at fixed FLOPs a looped model has strictly less capacity than a baseline transformer. We propose a novel dual-path block that can flexibly scale compute, the number of sequential operations applied to a hidden state, and capacity, the parameters available at a single step. For this we expose both axes as parallel pathways within a single layer: a deep sublayer re-applied K times with shared parameters, and a wide sublayer with an enlarged feed-forward network applied once. Independent per-token gates combine both axes and allow detailed per-token routing analyses. We show that across two FLOP budgets, our dual-path model surpasses iso-FLOP matched models on language modeling and downstream evaluations, while using fewer parameters than the baseline at matched FLOPs. The learned gates are directly interpretable and show systematic per-token allocation with function words and lexical content trend wide, while punctuation, symbols, and arithmetic tokens trend deep.

2605.30201 2026-05-29 cs.LG cs.AI

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO: 稀疏奖励机制下稳定高效训练的滞后策略优化

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

AI总结 针对GRPO在稀疏验证奖励下的失败模式,提出HPO通过降低负优势更新权重和均值长度归一化改进训练,并引入自适应版本A-HPO,在TeleLogs和Countdown实验中显著提升奖励。

详情
AI中文摘要

我们研究了GRPO风格的强化学习在稀疏可验证奖励背景下的一种狭窄但常见的失败模式:早期更新中包含更多具有负优势的响应,而非正优势的响应,而响应级长度归一化将更新幅度与输出长度挂钩。我们提出滞后策略优化(HPO),这是对GRPO的最小修改,它降低了负优势更新的权重,并用均值长度归一化替代了每个响应的长度归一化。我们进一步引入自适应HPO(A-HPO),它基于批次级优势符号统计设置滞后权重,从而消除了调整固定滞后权重的需要。在我们的TeleLogs和Countdown实验中,与GRPO相比,A-HPO提高了每次更新的奖励,在早期稀疏奖励机制中增益最大。在TeleLogs上,A-HPO实现了0.84的最终奖励,比SAPO高5%,比GSPO高11%,比GRPO高15%,同时保持了可比较的响应长度。在Countdown上,A-HPO在1.5B-7B模型的初始和最困难配置中实现了最大增益。关于滞后权重的消融研究表明,A-HPO的增益来自于比仅正更新或完全对称更新更好地平衡正负优势的贡献。

英文摘要

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

2605.30200 2026-05-29 cs.AI

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

双刃剑还是利器?设计与评估面向K-12写作规模化的三元LLM-教师协作

Canran Wang, Yuwen Yang, Zhen Wang, Ming Ma, Ding Yu, Chentai Wang, Keman Huang, Xiaoyong Du

AI总结 本文通过开发一个三元协作系统,结合系统功能语言学与建议轨迹追踪管道,基于包含57,954篇作文的大规模实证数据,验证了LLM作为生成引擎、教师作为教学把关者的分工策略能有效提升写作质量,并发现语言扩展存在边际效用递减的天花板效应。

详情
AI中文摘要

集成大型语言模型(LLM)的双刃剑效应需要一个有效的LLM、教师和学生之间的三元协作机制,尤其是对于K-12教育。通过开发一个支持K-12写作学习的三元协作系统,一个基于系统功能语言学和建议轨迹追踪管道的多维评估框架,本文贡献了一个大规模实证数据集,包含来自120所学校10,195名学生在两年内提交的57,954篇作文。我们的发现证实了该系统通过战略分工提高写作质量的功效:LLM作为生成引擎以缓解教师倦怠,教师作为教学把关者和桥梁以保证反馈质量。虽然LLM和教师对技能提升都至关重要,但我们发现了一个天花板效应,即过度的语言扩展产生递减的边际效用。这些表明随着学生熟练度的提高,需要动态自适应的LLM-教师协作。

英文摘要

The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.

2605.30198 2026-05-29 cs.LG

Active Continual Learning with Metaplastic Binary Bayesian Neural Networks

具有可塑性二值贝叶斯神经网络的主动持续学习

Kellian Cottart, Théo Ballet, Djohan Bonnet, Damien Querlioz

AI总结 针对边缘系统持续学习中的后验饱和与可塑性冻结问题,提出基于有界记忆变分目标的BiMU方法,通过不确定性依赖步长和先验松弛维持非退化后验,实现无缓冲主动查询,在Permuted-MNIST和OpenLORIS-Object上显著减少标签与更新次数。

Comments Accepted at ICML 2026

详情
AI中文摘要

始终在线的边缘系统必须在严格的计算预算下随着条件变化持续学习,并检测不可靠的预测。贝叶斯二值神经网络在此场景中具有吸引力,但均值场伯努利后验可能在长非平稳流上饱和,消除认知不确定性并冻结可塑性。我们提出BiMU,它源于一个有界记忆变分目标,平衡了稳定性、可塑性和遗忘。BiMU结合了数据项与受控松弛向先验,以及不确定性依赖的步长,防止饱和并维持信息性不确定性。这种非退化后验通过蒙特卡洛分歧实现完全在线、无缓冲的主动查询,在类别不平衡下减少标签查询和反向传播更新。BiMU在1000任务Permuted-MNIST上维持学习和强OOD检测,在OpenLORIS-Object上在类别不平衡和特征压缩下,以匹配的准确率实现高达32倍的标签/更新节省。

英文摘要

Always-on edge systems must keep learning as conditions change under tight compute budgets and must detect unreliable predictions. Bayesian binary neural networks are attractive in this setting, but mean-field Bernoulli posteriors can saturate on long non-stationary streams, wiping out epistemic uncertainty and freezing plasticity. We propose BiMU, derived from a bounded-memory variational objective that balances stability, plasticity, and forgetting. BiMU combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size that prevents saturation and sustains informative uncertainty. This non-degenerate posterior enables fully online, buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. BiMU sustains learning and strong OOD detection on 1000-tasks Permuted-MNIST, and on OpenLORIS-Object achieves up to 32$\times$ label/update savings at matched accuracy under class imbalance and feature compression.

2605.30187 2026-05-29 cs.AI cs.CY

Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

模块化教育大语言模型代理以促进负责任的学习辅助

Julius Gabelmann, Felix Jahn, Kevin Baum, Sophie van Rossum, Emely Wuenscher, Timo P. Gros, Verena Wolf

AI总结 提出一种模块化代理架构的AI聊天机器人,通过分阶段指导练习解决,融入针对性教学建议,实现更可控、透明和可监督的学习过程,促进教育中负责任的AI使用。

Comments 12 pages, 2 figures (+ 2 in appendix), accepted at AISoLA 2025 (Track: Responsible and Trusted AI: An Interdisciplinary Perspective)

详情
AI中文摘要

AI聊天机器人在教育中的广泛采用将彻底改变学习方式,使负责任部署成为关键问题。虽然大型语言模型(LLM)可能能够访问讨论教育科学见解的来源,但它们并不特别倾向于遵循教学概念,可能对学习过程产生负面影响,如丧失迁移能力、批判性思维或创造力。在本文中,我们介绍了一种辅助学生解决练习的代理型AI聊天机器人架构,专门设计用于促进教育中更负责任的AI使用。我们的概念开发基于对负责任的基于LLM的教育系统若干期望的识别,论证了整体式开箱即用解决方案固有的结构缺陷,并建议模块化代理架构。我们提出了针对练习解决不同阶段的特定模块,使得能够融入有针对性的教学建议,以更可控、透明和可监督的方式引导学生完成学习过程。

英文摘要

The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. While large language models (LLMs) might have access to sources discussing insights from educational sciences, they are not particularly inclined to adhere to pedagogical concepts, risking negative effects on the learning process, such as a loss of transfer capabilities, critical thinking, or creativity. In this paper, we introduce an agentic AI chatbot architecture assisting students with exercise solving, specifically designed to contribute to more responsible AI use in education. We base our conceptual development on the identification of several desiderata for responsible LLM-based educational systems, argue for the structural shortcomings inherent in monolithic, out-of-the-box solutions, and instead suggest modularizing the agentic architecture. We propose specific modules for different stages of exercise solving, enabling incorporation of targeted pedagogical advice, guiding students through the learning process in a more controllable, transparent, and overseeable manner.

2605.30179 2026-05-29 cs.LG cs.AI

iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

iLoRA: 用于微生物组诊断的具有潜在交互图的贝叶斯低秩适应

Yang Song, Yixuan Zhang, Lingfa Meng, Tongyuan Hu, Haizhou Shi, Hao Wang, Samir Bhatt, Hengguan Huang

AI总结 提出iLoRA,一种贝叶斯图条件LoRA框架,通过推断输入中的潜在交互图生成输入条件LoRA更新,联合学习预测和潜在交互结构,在微生物组诊断中优于现有方法。

Comments Accepted at ICML 2026

详情
AI中文摘要

参数高效适应使得大型语言模型在领域预测中变得实用,但标准LoRA仍然依赖于静态低秩更新,并且没有揭示通常驱动科学标签的潜在交互。我们引入了iLoRA。据我们所知,这是第一个贝叶斯图条件LoRA框架。它从输入中推断潜在交互图,并使用它生成输入条件LoRA更新。因此,iLoRA联合学习预测和潜在交互结构,而不是训练预测器然后仅事后应用交互分析。我们将这一思想实例化用于微生物组诊断,其中疾病状态可能依赖于物种水平丰度和微生物-微生物串扰,并在两个互补设置中评估:与人工注释图进行交互式问答,测试潜在结构恢复;以及多队列IBD诊断,测试生物医学效用。在这两种设置中,iLoRA优于强LoRA和贝叶斯适应基线,恢复与人工注释和队列水平微生物组关联一致的图,并提供具有适度图分支开销的校准不确定性。

英文摘要

Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-conditioned LoRA updates. As a result, iLoRA learns prediction and latent interaction structure jointly, rather than training a predictor and applying interaction analysis only post hoc. We instantiate this idea for microbiome diagnosis, where disease state can depend on both species-level abundance and microbe-microbe cross-talk, and evaluate it in two complementary settings: interactive QA with human-annotated graphs, which tests latent structure recovery, and multi-cohort IBD diagnosis, which tests biomedical utility. Across both settings, iLoRA improves over strong LoRA and Bayesian adaptation baselines, recovers graphs aligned with human annotations and cohort-level microbiome associations, and provides calibrated uncertainty with moderate graph-branch overhead.

2605.30174 2026-05-29 cs.CV

LiveSVG: Zero-Shot SVG Animation via Video Generation

LiveSVG:通过视频生成的零样本SVG动画

Matan Levy, Ran Margolin, Bar Cavia, Dvir Samuel, Yael Pritch, Shmuel Peleg, Alex Rav Acha, Ariel Shamir, Dani Lischinski

AI总结 提出LiveSVG方法,利用视频扩散模型直接拟合目标视频实现零样本SVG动画,无需骨架或类别先验,通过双层运动表示和球体填充重着色策略解决复杂运动与颜色歧义问题。

Comments Project Page: https://levymsn.github.io/LiveSVG

详情
AI中文摘要

我们介绍了LiveSVG,一种利用视频扩散模型生成可缩放矢量图形(SVG)动画的零样本方法。当前的SVG动画方法在处理复杂运动时存在困难:基于LLM的代码合成难以表达精细的非刚性贝塞尔变形,而分数蒸馏采样(SDS)提供有噪声的梯度,并且通常需要类别特定的先验(如骨架)。相比之下,LiveSVG将矢量几何直接拟合到显式生成的目标视频上。给定输入SVG图像和运动提示,我们使用冻结的图像到视频模型生成可预览的目标视频,然后通过可微分渲染将原始SVG拟合到该视频。我们的拟合阶段无需骨架,利用双层运动表示:每个组的单应性矩阵用于粗略关节运动,每个路径的贝塞尔控制点偏移用于局部变形。为了解决逐像素拟合过程中颜色引起的对应歧义,我们引入了一种新颖的球体填充重着色策略。我们还提出了ChallengeSVG,一个包含复杂多对象场景的基准测试,揭示了先前工作的局限性。评估表明,LiveSVG在AniClipart和ChallengeSVG上均显著优于现有方法,确立了直接参考视频拟合作为实现提示对齐和完全可编辑矢量动画的实用、稳健途径。

英文摘要

We introduce LiveSVG, a zero-shot approach for generating Scalable Vector Graphics (SVG) animations using video diffusion models. Current SVG animation methods struggle with complex motions: LLM-based code synthesis fails to express fine, non-rigid Bézier deformations, while Score Distillation Sampling (SDS) provides noisy gradients and often requires category-specific priors like skeletons. In contrast, LiveSVG fits vector geometry directly to an explicitly generated target video. Given an input SVG image and a motion prompt, we generate a previewable target video using a frozen image-to-video model, then fit the original SVG to this video via differentiable rendering. Our fitting stage is skeleton-free, utilizing a dual-level motion representation that combines per-group homographies for coarse articulation with per-path Bézier control-point offsets for local deformations. To resolve color-induced correspondence ambiguities during pixel-wise fitting, we introduce a novel sphere-packing recolorization strategy. We also present ChallengeSVG, a benchmark of complex, multi-object scenes that exposes the limitations of prior work. Evaluations demonstrate that LiveSVG significantly outperforms existing methods on both AniClipart and ChallengeSVG, establishing direct reference-video fitting as a practical, robust route to prompt-aligned and fully editable vector animation.

2605.30168 2026-05-29 cs.CV

OmniCD: A Foundational Framework for Remote Sensing Image Change Detection Guided by Multimodal Semantics

OmniCD:多模态语义引导的遥感图像变化检测基础框架

Chenhao Sun

AI总结 提出OmniCD框架,通过多模态语义引导(图像和文本提示)统一遥感变化检测任务,结合层次化场景检索和风格解耦机制,并构建大规模数据集RSITCD,在多个基准上取得最优性能。

详情
AI中文摘要

遥感中的变化检测(CD)对于城市监测和灾害评估等应用至关重要,但传统方法难以在不同场景下泛化。我们提出OmniCD,一个通过多模态语义引导统一并增强遥感CD的基础框架。OmniCD将图像和文本提示(如文本描述、语义地图和地理空间元数据)整合到统一架构中,支持从二元CD到零样本语义变化理解的任务。该框架集成了层次化场景检索模块和变化检测模块,并通过风格解耦机制增强跨域鲁棒性。我们进一步引入RSITCD,一个包含30万+标注图像-文本对的大规模多模态数据集。大量实验表明,OmniCD在多个基准上达到最先进性能,展现出强大的适应性,为遥感中的通用CD系统奠定了坚实基础。

英文摘要

Change detection (CD) in remote sensing is vital for applications such as urban monitoring and disaster assessment, yet traditional methods struggle with generalization across diverse scenarios. We present OmniCD, a foundational framework that unifies and enhances remote sensing CD through multimodal semantic guidance. OmniCD incorporates image and text prompts -- such as textual descriptions, semantic maps, and geospatial metadata -- into a unified architecture, supporting tasks from binary CD to zero-shot semantic change understanding. The framework integrates a hierarchical scene retrieval module and a change detection module, reinforced by a style disentanglement mechanism for improved cross-domain robustness. We further introduce RSITCD, a large-scale multimodal dataset with 300K+ annotated image-text pairs. Extensive experiments show that OmniCD achieves state-of-the-art performance across benchmarks, demonstrating strong adaptability and setting a solid foundation for general-purpose CD systems in remote sensing.

2605.30162 2026-05-29 cs.AI cs.CR cs.LG

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

BioRefusalAudit: 使用通用和领域微调稀疏自编码器审计生物安全拒绝深度

Caleb DeLeeuw

AI总结 本文提出BioRefusalAudit方法,通过行为测试和内部稀疏自编码器特征分析,评估语言模型在生物安全场景下的拒绝一致性,发现模型存在结构脆弱性、过度拒绝和架构差异。

Comments 21 pages, 2 figures, 3 tables. Apart Research AIxBio Sprint hackathon paper, April 2026 (Track 3: AI Biosecurity Tools). Code, eval set, and SAEs: github.com/SolshineCode/Deleeuw-AI-x-Bio-hackathon. Reviewer feedback: apartresearch.com/project/biorefusalaudit-auditing-biosecurity-refusal-depth-using-general-and-domainfinetuned-sparse-autoencoders-1fyk

详情
AI中文摘要

语言模型的生物安全评估通常询问模型是否产生危险输出。本文提出一个补充性问题:当模型拒绝时,该拒绝在结构上是否稳健,还是在提示框架、格式或输出长度的适度变化下消失?在五种架构中,没有模型能清晰区分良性查询和危险查询。Gemma 2 2B-IT 在75个提示中从未真正拒绝,对每个接近危险的查询都含糊其辞。Gemma 4 E2B-IT 在聊天模板格式下拒绝了65/75个提示,无格式时拒绝了0/75。两个Gemma模型在80词限制下都降至0%拒绝率。Qwen 2.5 1.5B 和 Phi-3-mini 过度拒绝,将83-87%的良性生物学标记为危险。Llama 3.2 1B 显示出唯一有意义的层级梯度(61点跨度)。为了探究过度拒绝的驱动因素,我们测试了一组附表I但无生物毒性的化合物(特别是裸盖菇素栽培,具有FDA突破性疗法资格)。一些模型对这些化合物的拒绝率超过了真正危险的生物学,表明拒绝追踪法律和文化显著性而非CBRN危险。为了测量内部层面,我们引入了一个分歧分数D,比较模型的表面响应标签与其内部稀疏自编码器(SAE)特征激活。在Gemma 2 2B-IT(Gemma Scope 1)和Gemma 4 E2B-IT(作者训练的bio SAE)上计算了完整的D。发布了两个微调的Gemma 2领域SAE。在Gemma 4上,服从和拒绝响应之间差距为0.647点,零重叠(n=75),尽管这是初步的,目录狭窄,样本内校准,且仅覆盖Gemma家族的SAE。在一个黑客马拉松周末使用消费级硬件(GTX 1650 Ti Max-Q,以及用于SAE训练的Colab T4)构建,这一初步证据表明,激活级审计可能揭示行为评估无法发现的失败模式,且各架构间存在显著差异。

英文摘要

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

2605.30161 2026-05-29 cs.CV

Why Far Looks Up: Probing Spatial Representation in Vision-Language Models

为什么远处看起来在上方:探究视觉-语言模型中的空间表征

Cheolhong Min, Jaeyun Jung, Daeun Lee, Hyeonseong Jeon, Yu Su, Jonathan Tremblay, Chan Hee Song, Jaesik Park

AI总结 通过最小对比对分析,发现视觉-语言模型存在垂直-距离纠缠(将图像垂直位置与距离混淆),这种透视偏差导致性能差距,并随数据规模扩大而加剧,而具有良好分离空间轴的模型更鲁棒。

详情
AI中文摘要

视觉-语言模型(VLM)在空间推理基准上取得了强劲性能,但仍不清楚这是否反映了结构化的3D理解,还是依赖于自然图像中的统计捷径。我们引入了一个表征级分析框架,构建最小对比对来测量VLM嵌入中空间轴的组织和分离程度。跨多个模型族的分析揭示了一致的垂直-距离纠缠:模型将图像垂直位置与距离混淆,反映了自然照片的透视偏差。这种偏差导致透视一致与反启发式示例之间存在显著的准确率差距,并且随着数据规模的扩大而加剧,即使整体基准准确率有所提高。我们进一步表明,具有相似基准分数的模型可能表现出不同的内部表征,并且这些差异可预测跨不同空间推理基准的准确率和鲁棒性。为了将这种偏差与评估集偏斜隔离,我们引入了SpatialTunnel,这是一个合成基准,通过去除自然图像中常见的相关性来暴露空间捷径偏差。实验证实,纠缠是模型固有的,并且具有良好分离空间轴的模型表现出更强的鲁棒性,这表明结构良好的空间表征可在不同基准上带来更可靠的空间推理。代码和基准可在项目页面获取:https://cheolhong0916.github.io/whyfarlooksup.github.io/。

英文摘要

Vision-language models (VLMs) achieve strong performance on spatial reasoning benchmarks, yet it remains unclear whether this reflects structured 3D understanding or reliance on statistical shortcuts in natural images. We introduce a representation-level analysis framework that constructs minimal contrastive pairs to measure how spatial axes are organized and disentangled within VLM embeddings. Our analysis across multiple model families reveals a consistent vertical-distance entanglement: models conflate vertical image position with distance, mirroring the perspective bias of natural photographs. This bias produces a significant accuracy gap between perspective-consistent and counter-heuristic examples, and intensifies under data scaling even as overall benchmark accuracy improves. We further show that models with similar benchmark scores can exhibit different internal representations, and that these differences predict accuracy and robustness across diverse spatial reasoning benchmarks. To isolate this bias from evaluation-set skew, we introduce SpatialTunnel, a synthetic benchmark designed to expose spatial shortcut biases by removing common correlations present in natural images. Experiments confirm that the entanglement is model-intrinsic, and that models with well-separated spatial axes exhibit greater robustness, suggesting that well-structured spatial representations lead to more reliable spatial reasoning across diverse benchmarks. Code and benchmark are available on the project page: https://cheolhong0916.github.io/whyfarlooksup.github.io/.

2605.30160 2026-05-29 cs.LG cs.AI

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

混沌动力系统中的分布强化学习

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

AI总结 针对混沌动力系统中强化学习面临的高方差和梯度病态问题,提出分布强化学习通过1-Wasserstein度量下的分布贝尔曼目标实现更稳定的优化。

详情
AI中文摘要

混沌动力系统对强化学习(RL)提出了根本性挑战:对初始条件的指数敏感性导致高方差的引导目标和病态的梯度更新。混沌动力学出现在科学和工程领域的各个方面,从流体流动和气候系统到多智能体系统,在这些领域中,可靠的学习是非常可取的。标准RL方法通过标量值函数优化期望回报,隐式地对发散轨迹进行平均,并将轨迹层面的不稳定性与学习目标纠缠在一起。我们证明,在温和的统计稳定性假设下,当在$1$-Wasserstein度量下测量时,回报分布比单个轨迹更规则地演化,从而产生更平滑的分布贝尔曼目标。通过将优化与该度量层面结构对齐,分布RL提供了更好的条件学习。我们为混沌系统中分布方法的优势以及混沌下RL目标的几何结构提供了原则性的解释。

英文摘要

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

2605.30159 2026-05-29 cs.AI

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

元认知记忆策略优化用于长视野LLM智能体

Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

AI总结 针对长视野任务中记忆策略训练缺乏中间监督的问题,提出基于信念熵的元认知记忆策略优化(MMPO),通过自监督代理惩罚高认知不确定性摘要,提升长期推理性能。

详情
AI中文摘要

记忆增强的LLM智能体通过递归地将交互轨迹总结为紧凑记忆来处理复杂的长期任务。然而,现有方法通常使用基于结果的强化学习来训练这些记忆策略,未能定位中间记忆质量下降的位置。随着交互的展开,模糊的递归总结逐渐丢弃任务相关信息并引入语义噪声。这加剧了信念偏差,模糊了智能体对潜在任务状态的估计,最终导致长期推理偏离轨道。因此,我们认为记忆优化不仅应关注轨迹层面的成功,还应关注中间总结所诱导的信念清晰度。为此,我们引入了信念熵,这是一种自监督代理,用于探测模型在当前记忆下对潜在任务状态的不确定性程度。基于这一代理,我们提出了元认知记忆策略优化(MMPO)。MMPO不依赖稀疏的基于结果的信号,而是通过显式惩罚诱导高认知不确定性的总结,提供细粒度的、记忆特定的监督。实验表明,MMPO在各种长期任务上持续优于现有方法,即使在扩展到175万token的上下文时仍保持97.1%的性能。

英文摘要

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

2605.30154 2026-05-29 cs.LG

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

RL2ML: 从强化学习到最大似然的有限rollout替代目标

Yifu Zheng

AI总结 本文提出RL2ML系列有限rollout替代目标,具有闭式无偏梯度估计,连接标准强化学习、类最大似然训练及超越最大似然目标,并揭示群体级更新尺度相变,将剩余自由度转化为一维优化问题。

详情
AI中文摘要

基于正确性的可验证奖励强化学习(RLVR)通过采样输出的二元反馈训练语言模型,但期望优化的目标与有限rollout组引起的随机更新几何常被混淆。本文开发了RL2ML,一系列具有闭式、精确无偏梯度估计的有限rollout替代目标。该系列在固定rollout预算下连续连接标准强化学习、类最大似然训练及超越最大似然目标,同时保持估计器-目标对齐。我们引入群体级更新尺度来表征rollout组在观察到经验成功计数后如何重新加权,揭示了仅通过总体级目标符号隐藏的亚临界-超临界更新尺度相变。基于这一区分,校准的度量增益分析和精确方差分解表明,最佳替代目标的选择既不由接近最大似然决定,也不仅由总体级权重决定,而是取决于评估度量、局部敏感性和估计器方差。因此,替代目标系列中的剩余自由度可以表述为一维优化问题,而非视为无约束超参数。

英文摘要

Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper develops RL2ML, a family of finite-rollout surrogate objectives with a closed-form, exactly unbiased gradient estimator. The family continuously connects standard reinforcement learning, maximum-likelihood-like training, and beyond-maximum-likelihood objectives while preserving estimator-objective alignment under a fixed rollout budget. We introduce the group-level update scale to characterize how a rollout group is reweighted after its empirical success count is observed, revealing a subcritical-supercritical update-scale transition that is hidden by population-level objective notation alone. Building on this distinction, calibrated metric-gain analysis and exact variance decomposition show that the best choice of surrogate objective is determined neither by proximity to maximum likelihood nor by the population-level weight alone. Instead, it depends jointly on the evaluation metric, local sensitivity, and estimator variance. The remaining degree of freedom in the surrogate objective family can therefore be formulated as a one-dimensional optimization problem rather than treated as an unconstrained hyperparameter.