arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2042
2606.05885 2026-06-05 cs.LG

When Denser Credit Is Not Enough: Evidence-Calibrated Policy Optimization for Long-Horizon LLM Agent Training

当更密集的信用不足时:面向长周期LLM智能体训练的基于证据校准的策略优化

Yuanfan Li, Qi Zhou, Wenjing Duan, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(X-LANCE实验室,计算机科学学院,上海交通大学,上海,中国) Faculty of Electronic and Information Engineering, Xi’an Jiaotong University(电子与信息工程学院,西安交通大学)

AI总结 针对长周期LLM智能体在稀疏延迟奖励下的信用分配问题,提出一种无评论家的策略优化算法ECPO,通过证据校准的动作优势和方差门控信用加权来修正密集信用的统计不可靠性,在ALFWorld和WebShop上显著提升性能。

详情
AI中文摘要

长周期LLM智能体需要能够在稀疏和延迟奖励下为中间决策分配信用的强化学习方法。最近的基于分组的方法如GiGPO通过构建重复锚点状态下的步骤级优势来改进GRPO。然而,我们表明这种密集信用在统计上可能不可靠:在有限的轨迹采样下,罕见但幸运的动作可能获得过大的优势,产生发散锚点偏差和后期训练振荡。我们提出证据校准策略优化(ECPO),一种在策略更新前校准步骤级信用的无评论家策略优化算法。ECPO结合了证据校准动作优势(将轨迹按规范动作分组并收缩低计数估计)和方差门控信用加权(抑制由动作内噪声主导的锚点状态)。在ALFWorld和WebShop上使用Qwen2.5-1.5B/7B的实验表明,ECPO持续优于强基线,在Qwen2.5-1.5B上,ALFWorld/WebShop的成功点分别比GiGPO提高+5.2/+7.3,同时仅增加0.1%的额外优势计算开销。

英文摘要

Long-horizon LLM agents require reinforcement learning methods that can assign credit to intermediate decisions under sparse and delayed rewards. Recent group-based methods such as GiGPO improve over GRPO by constructing step-level advantages at repeated anchor states. However, we show that such dense credit can be statistically unreliable: under limited rollouts, rare but lucky actions may receive overly large advantages, producing divergent anchor bias and late-stage training oscillation. We propose Evidence-Calibrated Policy Optimization (ECPO), a critic-free policy optimization algorithm that calibrates step-level credit before policy updates. ECPO combines Evidence-Calibrated Action Advantage, which groups rollouts by canonical actions and shrinks low-count estimates, with Variance-Gated Credit Weighting, which suppresses anchor states dominated by within-action noise. Experiments on ALFWorld and WebShop with Qwen2.5-1.5B/7B show that ECPO consistently outperforms strong baselines, improving GiGPO by +5.2/+7.3 success points on ALFWorld/WebShop with Qwen2.5-1.5B while adding only 0.1% additional advantage-computation overhead.

2606.05883 2026-06-05 cs.CV

Geometry-Aware Dataset Condensation for Diffusion Model Training

面向扩散模型训练的几何感知数据集压缩

Xiao Cui, Yulei Qin, Mo Zhu, Wengang Zhou, Hongsheng Li, Houqiang Li

发表机构 * GitHub

AI总结 针对扩散模型训练,提出基于几何感知分布对齐的真实子集选择方法,利用单侧部分最优传输保持几何结构,并辅以轻量级特征统计与语义一致性正则化,通过两阶段离散优化实现高效压缩。

Comments ICML 2026

详情
AI中文摘要

数据集压缩旨在通过合成或选择从真实数据中构建紧凑数据集。然而,现有方法不适用于扩散模型训练:合成数据生成通常产生不适合真实建模的低保真样本,而真实子集选择通常无法保留扩散似然目标所需的分布几何结构。为解决此问题,我们提出将真实子集选择重新表述为几何感知分布对齐问题。通过引入单侧部分最优传输,我们的方法选择性地将紧凑子集与完整数据分布对齐,同时允许低密度区域中的未匹配质量,确保保留扩散模型训练所需的有效几何结构。为进一步保证分布保真度,我们用轻量级特征统计和语义一致性正则化补充几何对齐。提出了一种高效的两阶段离散优化策略来实现该对齐目标。在扩散变体、子集大小、图像分辨率和训练轮次上的大量实验表明,我们的方法在扩散模型训练中实现了优越的保真度和分布覆盖。代码可在 https://github.com/2018cx/GADC 获取。

英文摘要

Dataset condensation aims to construct compact datasets from real data via synthesis or selection. However, existing approaches are ill-suited for diffusion model training: synthetic data generation often yields low-fidelity samples unsuitable for authentic modeling, while real subset selection typically fails to preserve the distributional geometry required by diffusion likelihood objectives. To address this, we propose to reformulate real subset selection as a geometry-aware distribution alignment problem. By incorporating one-sided partial optimal transport, our method selectively aligns a compact subset with the full data distribution while allowing unmatched mass in low-density regions, ensuring the preserved geometric structure necessary for effective diffusion model training. To further ensure distributional fidelity, we complement geometric alignment with lightweight feature-statistics and semantic consistency regularization. An efficient two-stage discrete optimization strategy is proposed to achieve this alignment objective. Extensive experiments across diffusion variants, subset sizes, image resolutions, and training rounds show that our method achieves superior fidelity and distributional coverage in diffusion model training. Codes are available at https://github.com/2018cx/GADC.

2606.05880 2026-06-05 cs.RO

TAGA: Terrain-aware Active Gaze Learning for Generalizable Agile Humanoid Locomotion

TAGA:面向可泛化敏捷人形运动的地形感知主动注视学习

Peizhuo Li, Hongyi Li, Mingfeng Fan, Fangzhou Xu, Shuhao Liao, Yuxuan Ma, Zicheng Zeng, Ze Wang, Yongbin Jin, Yuhong Cao, Hongtao Wang, Guillaume Sartoretti

发表机构 * MarmotLab, National University of Singapore(马尔莫实验室,新加坡国立大学) Center of X-Mechanics, Zhejiang University(浙大X力学中心) South China University of Technology(华南理工大学)

AI总结 提出TAGA框架,通过融合视觉、本体感觉和运动命令,让模型学习主动注视地形关键区域,在有限计算资源下提高感知密度,实现鲁棒且可泛化的敏捷人形运动。

详情
AI中文摘要

在多样挑战性地形上的敏捷人形运动需要广泛的感知覆盖和精确的局部几何理解。受人类在运动中选择性注视相关地形的启发,我们提出了TAGA,一种用于基于注意力的人形控制的地形感知主动注视学习框架。通过融合视觉、本体感觉和运动命令,我们的框架引导模型学习预期线索并主动关注高度扫描的特定区域,选择性地将这些信息区域用于下游网络。这自适应地提高了在严格机载计算约束下观测的信息密度,从而在更大尺度地形上实现细粒度感知运动。我们发现,这种注视行为可以仅通过强化学习自然涌现,无需额外监督或显式指导,显著提高了训练效率。因此,训练后的策略在仿真和硬件上展示了鲁棒且可泛化的运动,包括可靠的地形感知落脚点选择、高台穿越、竞争性稀疏落脚点穿越,以及在感知人形运动系统中报告的最大实际间隙穿越距离1.2米,同时在严重感知干扰和环境干扰下保持稳定性。

英文摘要

Agile humanoid locomotion across diverse challenging terrain demands both wide perceptual coverage and precise local geometry understanding. Motivated by the way humans selectively look at relevant terrain during locomotion, we introduce TAGA, a Terrain-aware Active Gaze learning framework for Attention-based humanoid control. By fusing vision, proprioception, and motion commands, our framework guides the model to learn anticipatory cues and actively attend to specific areas of the height scan, selectively using these informative regions for the downstream network. This adaptively increases the information density of observations under tight onboard computational constraints, thus enabling fine-grained perceptive locomotion over larger-scale terrains. We find that such gaze behaviors can naturally emerge through reinforcement learning alone, without requiring additional supervision or explicit guidance, significantly improve training efficiency. As a result, the trained policy demonstrates robust and generalizable locomotion in simulation and on hardware, including reliable terrain-aware foothold selection, elevated-platform traversal, competitive sparse-foothold traversal, and the largest reported real-world gap traversal distance of 1.2m among perceptive humanoid locomotion systems, while maintaining stability under severe perceptual disturbances and environmental interference.

2606.05875 2026-06-05 cs.AI cs.DB

QCFuse: Query-Aware Cache Fusion via Compressed View for Efficient RAG Serving

QCFuse: 通过压缩视图的查询感知缓存融合实现高效RAG服务

Jianxin Yan, Wangze Ni, Zhenxin Li, Jiabao Jin, Zhitao Shen, Haoyang Li, Jia Zhu, Peng Cheng, Xuemin Lin, Lei Chen, Kui Ren

发表机构 * Zhejiang University(浙江大学) East China Normal University(华东师范大学) Ant Group(蚂蚁集团) The Hong Kong Polytechnic University(香港理工大学) Zhejiang Normal University(浙江师范大学) Tongji University(同济大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 提出QCFuse,一种基于压缩视图的查询感知选择器,通过块锚查询探测和关键层分析实现高效RAG缓存融合,在保持全预填充质量的同时平均加速1.7倍。

详情
AI中文摘要

检索增强生成(RAG)通过将生成过程基于外部证据来提高大语言模型(LLM)的答案质量,但处理检索到的上下文使得预填充阶段成为主要的服务成本。RAG缓存融合通过重用检索块的预计算键值(KV)缓存,并选择性地在当前提示下重新计算令牌来降低这一成本。然而,现有的选择器在质量和效率之间面临两难:快速的查询无关或最终层查询到上下文选择器可能遗漏与请求相关的证据,而全视图查询感知选择器在重新计算之前需要广泛的上下文和层可见性,因此会阻塞逐层缓存融合流水线。我们提出QCFuse,一种用于RAG缓存融合的压缩视图查询感知选择器。QCFuse使用块锚查询探测将用户查询状态条件化到紧凑的每块锚点上,并通过关键层分析识别重新计算令牌而无需检查所有层。我们在SGLang中实现QCFuse,并在六个数据集上对四个开放权重LLM进行评估。QCFuse达到了全预填充级别的质量。在匹配质量下,QCFuse相比全预填充实现了平均1.7倍的预填充加速,相比最强的保质量基线ProphetKV实现了1.5倍加速。

英文摘要

Retrieval-augmented generation (RAG) improves large language model (LLM) answer quality by grounding generation in external evidence, but processing retrieved contexts makes the prefill stage a dominant serving cost. RAG cache fusion reduces this cost by reusing precomputed key-value (KV) caches for retrieved chunks and selectively recomputing tokens under the current prompt. Existing selectors, however, face a dilemma between quality and efficiency: fast query-agnostic or final-layer query-to-context selectors can miss request-relevant evidence, whereas full-view query-aware selectors require broad context and layer visibility before recomputation and therefore stall the layer-wise cache-fusion pipeline. We present QCFuse, a compressed-view query-aware selector for RAG cache fusion. QCFuse uses chunk-anchor query probing to condition user-query states on compact per-chunk anchors and critical-layer profiling to identify recomputation tokens without all-layer inspection. We implement QCFuse in SGLang and evaluate it on four open-weight LLMs across six datasets. QCFuse reaches full-prefill-level quality. At matched quality, QCFuse achieves an average prefill-time speedup of 1.7x over full prefill and 1.5x over ProphetKV, the strongest quality-preserving baseline.

2606.05874 2026-06-05 cs.CL

Evaluating Stochastic Collapse and Implicit Bias in Multimodal Large Language Models

评估多模态大语言模型中的随机坍缩与隐式偏差

Huiyuan Zheng, Houtao Zhang, Boyang Wang, Qingyi Si, Hongcheng Guo

发表机构 * Fudan University(复旦大学) Beihang University(北航) JD.com(京东)

AI总结 提出RandomBench基准测试,通过熵和分布偏差指标揭示多模态大语言模型在逻辑中性场景下存在随机坍缩现象,即无法维持均匀随机性。

详情
AI中文摘要

当前对多模态大语言模型(MLLMs)的评估 overwhelmingly 关注效用驱动目标,导致模型在逻辑中性场景下的行为 largely 未被探索。在多个行动同样有效的情况下(如推荐旅行路线或日常安排,多个选项具有相似效用),随机性是必要的。在此类设置中,确定性策略可能导致重复行为和有效替代方案的覆盖减少。为弥补这一空白,我们提出RandomBench,一个旨在评估MLLMs在选择等价选项时是否能维持分布中性行为的基准测试。我们进一步引入三个指标,包括RI、BCI、BII,以量化熵和分布偏差。实验揭示了一种普遍现象,称为随机坍缩,即MLLMs在明确的随机指令下无法维持均匀随机性,Claude Sonnet 4.6中top-1概率达到97%(理想为四分之一),RI降至0.068。广泛的消融研究进一步表明,这些偏差在不同语言和表示格式中持续存在,突显了逻辑中性决策设置中分布坍缩的鲁棒性。

英文摘要

Current evaluations for Multimodal Large Language Models (MLLMs) overwhelmingly focus on utility-driven objectives, leaving model behavior under logic-neutral scenarios largely underexplored. Stochasticity is essential in scenarios where multiple actions are equally valid, such as recommending travel itineraries or daily schedules where multiple options have similar utility. In such settings, deterministic policies may lead to repetitive behaviors and reduced coverage of valid alternatives. To bridge this gap, we propose RandomBench, a benchmark designed to evaluate whether MLLMs can maintain distributionally neutral behavior when selecting among equivalent options. We further introduce three metrics, including RI, BCI, BII, to quantify entropy and distributional bias. Experiments reveal a pervasive phenomenon termed Stochastic Collapse, where MLLMs fail to maintain uniform randomness under explicit random instructions, with top-1 probabilities reaching 97% from the ideal one quarter baseline and RI dropping to 0.068 in Claude Sonnet 4.6. Extensive ablation studies further demonstrate that these deviations persist across languages and representation formats, highlighting the robustness of distributional collapse in logic-neutral decision settings.

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR(亚马逊FAR) USC(美国南加州大学) UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) CMU(卡内基梅隆大学)

AI总结 提出LadderMan系统,通过两阶段学习管道和视觉基础模型,使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情
AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力,但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性,爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan},一个统一的系统,使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道,其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家,并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署,我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略,我们进一步使用双智能体公式训练一个独立的操控策略,允许通过遥操作在梯子上进行稳定操控。实验表明,LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬,以零样本方式成功迁移到真实世界硬件,并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

2606.05868 2026-06-05 cs.CL

YouZhi: Towards High-Concurrency Financial LLMs via Adaptive GQA-to-MLA Transition

YouZhi:通过自适应GQA到MLA转换实现高并发金融大语言模型

PSBC LLM Team, Huawei LLM Team, Ruihan Long, Junjie Wu, Tianan Zhang, Duo Zhang, Yaozong Wu, Jinbin Fu, Chang Liu, Zhentao Tang, Wenshuang Yang, Xin Wang, Zhihao Song, Ning Huang, Wenjing Xu, Shuai Zong, Shupei Sun, Sen Wang, Jing Hu, Bin Wang, Xinyu Wang, Junkui Ju, Zequn Ding, Jie Ran, Man Luo, Shixiong Kai, Linkai Hou, Kaichao Liang, Hu Zhao, Yang Zhao, Shucheng Lin, Wei Yu, Chenghan Jiang, Jingjing Ding, Jiahui Zhang, Tian Jin, Yuhang Zhang, Dong Guo, Wei Sun, Jun Xie, Jianwei Li, Lei Cao, Pei Li, Jiabin Li, Jia Yuan, Rui Yuan, Jing Zhu, Mingxuan Yuan, Zhangcheng Lv, Xin Jiang, Xiuhong Fei, Xiaozhe Ren, Yulong Li, Zhipeng Zhang, Hang Wang, Zhaohui Xu, Rui Zhao, Yibo He, Xinzhuang Niu

发表机构 * Postal Savings Bank of China & Huawei LLM Team(中国邮政储蓄银行及华为LLM团队) Postal Savings Bank of China(中国邮政储蓄银行) Huawei Technologies(华为技术)

AI总结 提出YouZhi-LLM,通过层自适应GQA-to-MLA转换框架和基于昇腾的训练流水线,显著压缩KV缓存并提升金融领域高并发推理效率。

详情
AI中文摘要

大语言模型推动了重大金融创新,但其高并发部署受到KV缓存内存开销的严重瓶颈,这增加了基础设施成本并限制了可扩展性。为解决这一问题,我们提出YouZhi-LLM,一种高效金融大语言模型,通过基于华为昇腾生态系统的全面结构转换和训练流水线实现。在其算法核心,YouZhi-LLM采用层自适应GQA-to-MLA转换框架,动态分配每层的FreqFold大小,在最大化KV缓存压缩的同时最小化困惑度下降。为恢复表示能力并注入领域知识,基于昇腾的训练流水线无缝集成广义知识蒸馏与金融特定监督微调。评估表明该系统性方法的优越性,自适应转换相比均匀基线将困惑度下降减少高达35%。关键的是,在昇腾NPU上通过vLLM-Ascend评估时,大规模KV缓存减少直接转化为部署效率。与各自基础模型相比,YouZhi-7B在平均金融基准分数上提升12.3%,同时最大并发数提升2.69倍;类似地,YouZhi-14B实现7.0%的准确率提升和2.43倍的并发提升,为成本高效、高吞吐的金融推理建立了新范式。

英文摘要

Large language models (LLMs) drive significant financial innovations, yet their high-concurrency deployment is severely bottlenecked by KV cache memory overhead, which inflates infrastructure costs and throttles scalability. To address this, we propose YouZhi-LLM, a highly efficient financial LLM empowered by a comprehensive structural transition and training pipeline natively built on the Huawei Ascend ecosystem. At its algorithmic core, YouZhi-LLM features a layer-adaptive GQA-to-MLA transition framework that dynamically assigns per-layer FreqFold sizes, maximizing KV-cache compression while minimizing perplexity degradation. To recover representation capacity and inject domain expertise, the Ascend-based training pipeline seamlessly integrates generalized knowledge distillation with financial-specific supervised fine-tuning. Evaluations demonstrate the superiority of this systematic approach, with the adaptive transition reducing perplexity degradation by up to 35% over uniform baselines. Crucially, when evaluated on Ascend NPUs via vLLM-Ascend, the massive KV-cache reduction translates directly into deployment efficiency. Compared to their respective base models, YouZhi-7B yields a 12.3% improvement in average financial benchmark score alongside a 2.69$\times$ increase in maximum concurrency; similarly, YouZhi-14B achieves a 7.0% accuracy gain and a 2.43$\times$ concurrency boost, establishing a new paradigm for cost-effective, high-throughput financial inference.

2606.05864 2026-06-05 cs.CL

Analysis of the Neglect-Zero Effect in Large Language Models

大型语言模型中忽视零效应的分析

Jin Tanaka, Daiki Matsuoka, Ryoma Kumon, Hitomi Yanaka

发表机构 * The University of Tokyo(东京大学) RIKEN(理化学研究所) Tohoku University(东北大学)

AI总结 本研究通过结构启动范式,探究大型语言模型是否像人类一样存在忽视零效应,即忽略使命题因空集而空洞为真的零模型。

Comments 14 pages (10 pages main text), 8 figures. To appear in the Proceedings of the ACL2026 Student Research Workshop (SRW)

详情
AI中文摘要

我们研究了LLM的语言处理在多大程度上类似于人类的认知过程,重点关注一种称为$ extit{忽视零效应}$的人类认知偏差。这种效应指的是人类倾向于忽略$ extit{零模型}$,即那些因空集而使命题空洞为真的配置。我们关注由忽视零效应驱动的两种推理类型,并通过比较LLM在处理这些推理时的行为与不涉及忽视零效应的推理中的行为,来检验LLM如何处理这些推理。为此,我们采用基于$ extit{结构启动}$的范式,其中先前接触一个前导句子($ extit{启动句}$)会因结构相似性而促进后续句子($ extit{目标句}$)的处理。我们准备启动句以迫使LLM考虑零模型,并分析它们是否也在目标句中考虑零模型。结果表明,在本研究分析的LLM中可能未出现忽视零效应。我们的代码可在https://github.com/ynklab/neglect_zero获取。

英文摘要

We investigate the extent to which the language processing of LLMs resembles human cognitive processes, focusing on a human cognitive bias called the $\textit{neglect-zero effect}$. This effect refers to the human tendency to ignore $\textit{zero-models}$, which are configurations that render a proposition vacuously true by virtue of an empty set. We focus on two types of inferences driven by the neglect-zero effect, and examine how LLMs process these inferences by comparing their behavior with that in an inference that does not involve the neglect-zero effect. For this purpose, we employ a paradigm based on $\textit{structural priming}$, where recent exposure to a preceding sentence (the $\textit{prime}$) facilitates the processing of a subsequent sentence (the $\textit{target}$) due to their structural similarity. We prepare primes to force LLMs to consider the zero-model, and analyze whether they also consider it in the target. The results suggest that the neglect-zero effect may not occur in the LLMs analyzed in this study. Our code is available at https://github.com/ynklab/neglect_zero

2606.05863 2026-06-05 cs.LG cs.AI

Deciphering Two Training Clocks in Grokking via Deep Linear Network Theory with Conditional ReLU Reduction

通过深度线性网络理论与条件ReLU约简解读Grokking中的两个训练时钟

Hu Tan, Kuo Gai, Shihua Zhang

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China(数学科学国家重点实验室,数学与系统科学研究院,中国科学院,北京100190,中国) School of Mathematical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China(中国科学院大学数学科学学院,北京100049,中国) Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS), Shanghai, China(上海数学与交叉科学研究所(SIMIS),上海,中国) Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Hangzhou 310024, China(浙江省系统健康科学重点实验室,生命科学学院,杭州先进研究院,中国科学院大学,中国科学院,杭州310024,中国)

AI总结 本文通过分离分类损失的快速衰减与表示学习的缓慢简化,定义了“两个训练时钟”形式化Grokking现象,并利用深度线性网络理论和条件ReLU约简机制解释了这一两阶段过程。

详情
AI中文摘要

Grokking表明,拟合训练数据和学习简单底层规则可能发生在不同的时间尺度上。我们通过将分类损失的快速衰减与学习表示的较慢简化分离来形式化这一现象,并将由此产生的停止时间对称为两个训练时钟。对于深度线性网络,我们证明后边际间隙增长或一步尾部收缩条件在对数时间尺度上将交叉熵损失降低到ε水平。相反,当存在逐层权重衰减时,端到端映射上的诱导正则化可以表示为Schatten型惩罚;在尖锐的晚期Kurdyka-Lojasiewicz尾部下,这种结构能量在多项式时间尺度上闭合。因此,两个时钟将拟合与表示简化分开。然后我们解释相同机制如何在ReLU MLP中出现。在训练集上的激活模式保持固定的区域中,网络简化为活动坐标上的线性模型。在两层ReLU嵌入模型中,链式法则估计进一步表明,在受控的下游范数下,分类器头可以比嵌入块接收更大的有效梯度。这支持了一个两阶段机制:分类器先拟合,而表示随后继续简化。我们以模加法作为主要实验设置。深度线性理论提供了分析的核心严格基础。但ReLU结果被表述为条件约简,以解释经验行为,而不声称对非线性训练动态的全局证明。

英文摘要

Grokking suggests that fitting the training data and learning a simple underlying rule may occur on different time scales. We formalize this phenomenon by separating the fast decay of the classification loss from the slower simplification of the learned representation, and we call the resulting pair of stopping times two training clocks. For deep linear networks, we show that a post-margin gap-growth or one-step tail-contraction condition reduces the cross-entropy loss to level epsilon on a logarithmic time scale. In contrast, when layerwise weight decay is present, the induced regularization on the end-to-end map can be expressed as a Schatten-type penalty; under a sharp late-time Kurdyka-Lojasiewicz tail, this structural energy closes on a polynomial time scale. The two clocks, therefore, separate fitting from representation simplification. We then explain how the same mechanism can appear in ReLU MLPs. In regions where the activation patterns on the training set remain fixed, the network reduces to a linear model in the active coordinates. In a two-layer ReLU embedding model, chain-rule estimates further show that the classifier head can receive larger effective gradients than the embedding block under controlled downstream norms. This supports a two-stage mechanism in which the classifier fits first, while the representation continues to simplify later. We use modular addition as the main experimental setting. The deep linear theory provides the rigorous core of the analysis. But the ReLU results are formulated as conditional reductions that account for empirical behavior without claiming a global proof for nonlinear training dynamics.

2606.05859 2026-06-05 cs.CL

TARPO: Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization

TARPO:通过动作路由策略优化的逐令牌隐式-显式推理

Liting Zhang, Shiwan Zhao, Xuyang Zhao, Zichen Xu, Jianye Wang, Qicheng Li

发表机构 * TMCC, College of Computer Science, Nankai University, Tianjin, China(TMCC,计算机科学学院,南开大学,天津,中国)

AI总结 提出TARPO框架,通过动作路由策略优化在每一步自适应切换离散令牌生成和连续隐式推理,以解决隐式推理中连续表示限制策略探索的问题,实验表明其优于现有显式和隐式推理基线。

Comments 18 pages, 12 figures. Code available at https://github.com/NKU-LITI/TARPO-master

详情
AI中文摘要

隐式推理已成为大型语言模型(LLMs)中离散思维链(CoT)的一种有前景的替代方案,通过在连续表示上操作实现更具表达力的推理。然而,连续表示固有的确定性限制了强化学习(RL)中的策略探索。为解决这一问题,我们提出了TARPO(通过动作路由策略优化的逐令牌隐式-显式推理),一个纯RL框架,在每一步自适应地在离散令牌生成和连续隐式推理之间切换。TARPO引入了一个轻量级的动作头路由器,它观察当前隐藏状态并从二元模式选择空间中采样一个路由决策,保留了从词汇表中离散令牌采样的随机性。LLM主干和路由器通过共享的组相对优势信号进行端到端联合优化。在Qwen2.5(从1.5B到7B)和Llama-3.1-8B主干上的大量实验表明,TARPO在各种基准测试中始终优于现有的显式和隐式推理RL基线。进一步分析表明,TARPO学习了自适应的逐令牌切换行为,同时保持了稳定的训练动态。我们的代码可在https://github.com/NKU-LITI/TARPO-master获取。

英文摘要

Latent reasoning has emerged as a promising alternative to discrete Chain-of-Thought (CoT) in large language models (LLMs), enabling more expressive reasoning by operating over continuous representations. However, the inherently deterministic nature of continuous representations limits policy exploration in reinforcement learning (RL). To address this, we propose TARPO (Token-Wise Latent-Explicit Reasoning via Action-Routing Policy Optimization), a pure RL framework that adaptively switches between discrete token generation and continuous latent reasoning at each step. TARPO introduces a lightweight action head router that observes the current hidden state and samples a routing decision from a binary mode-selection space, preserving the stochasticity of discrete token sampling from the vocabulary. The LLM backbone and router are jointly optimized end-to-end with a shared group-relative advantage signal. Extensive experiments across Qwen2.5 (from 1.5B to 7B) and Llama-3.1-8B backbones demonstrate that TARPO consistently outperforms existing explicit and latent reasoning RL baselines across diverse benchmarks. Further analysis shows that TARPO learns adaptive token-wise switching behaviors while maintaining stable training dynamics. Our code is available at https://github.com/NKU-LITI/TARPO-master.

2606.05858 2026-06-05 cs.CL

ReverseEOL: Improving Training-free Text Embeddings via Text Reversal in Decoder-only LLMs

ReverseEOL: 通过解码器仅LLM中的文本反转改进无训练文本嵌入

Ailiang Lin, Zhuoyun Li, Yusong Wang, Keyu Mao, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学研究所) Tencent(腾讯)

AI总结 提出ReverseEOL方法,通过反转输入文本生成互补嵌入,结合前向嵌入提升冻结解码器仅LLM的文本表示能力,在STS和MTEB基准上显著提升无训练基线性能。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展为生成无训练文本嵌入开辟了新途径。然而,解码器仅LLM中的因果注意力机制阻止了早期标记关注未来上下文,导致上下文表示存在偏差。在这项工作中,我们提出了带有显式单词限制的反转提示(ReverseEOL),一种简单而有效的方法,用于增强冻结LLM的表示能力。ReverseEOL通过从反转输入文本中获得的额外反转嵌入来增强标准前向嵌入。由于反转输入使每个标记能够访问原始顺序中无法访问的上下文,所得的反转嵌入有效地为原始嵌入提供了互补信息。因此,结合前向和反转嵌入产生了更丰富的最终表示。在STS和MTEB基准上的全面实验表明,ReverseEOL显著提高了现有无训练基线在具有不同架构和规模的各种LLM上的性能。广泛的消融和分析进一步证实了我们反转机制的必要性。

英文摘要

Recent advances in Large Language Models (LLMs) have opened new avenues for generating training-free text embeddings. However, the causal attention in decoder-only LLMs prevents earlier tokens from attending to future context, leading to biased contextualized representations. In this work, we propose Reverse prompting with Explicit One-word Limitation (ReverseEOL), a simple yet effective method for enhancing the representational capability of frozen LLMs. ReverseEOL augments the standard forward embedding with an additional reversed embedding derived from the reversed input text. Since reversing the input exposes each token to context inaccessible in the original order, the resulting reversed embedding effectively provides complementary information to the original one. As a result, combining the forward and reversed embeddings yields a richer final representation. Comprehensive experiments on STS and MTEB benchmarks demonstrate that ReverseEOL significantly improves the performance of existing training-free baselines across a broad range of LLMs with diverse architectures and scales. Extensive ablations and analyses further confirm the necessity of our reversal mechanism.

2606.05857 2026-06-05 cs.CL

Forgive or forget: Understanding the context of hate in audio retrieval systems

原谅或忘记:理解音频检索系统中仇恨的上下文

Arghya Pal, Sailaja Rajanala, Raphael C. -W. Phan, Shekhar Nayak

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种后门因果去偏框架,通过情感控制中介在保持语义相关性的同时抑制有害语音,实验表明在最小化检索精度损失下持续降低毒性。

详情
AI中文摘要

处理文本到音频系统中的有毒检索因上下文依赖而具有挑战性。现有策略(如改写、摘要)存在改变意图或遗漏细节的风险。我们提出了一种后门因果去偏框架,带有情感控制中介,以在抑制有害语音的同时保持语义相关性。我们的方法是模型无关的,并能无缝集成到现有检索流程中。我们引入了两种变体:Forgive,通过logit调整对有毒音频进行重排序和过滤;Forget,生成反事实有毒提示以减轻有害检索。实验表明,在检索精度损失最小的情况下,毒性持续降低,提高了安全性和可靠性。

英文摘要

Handling toxic retrieval in text-to-audio systems is challenging due to contextual dependencies. Existing strategies (e.g., rephrasing, summarization) risk altering intent or omitting details. We propose a post hoc causal debiasing framework with a sentiment-controlled mediator to preserve semantic relevance while suppressing harmful speech. Our approach is model-agnostic and integrates seamlessly with existing retrieval pipelines. We introduce two variants: Forgive, which re-ranks and filters toxic audio via logit adjustment, and Forget, which generates counterfactual toxic prompts to mitigate harmful retrievals. Experiments show consistent toxicity reduction with minimal loss in retrieval accuracy, improving both safety and reliability.

2606.05852 2026-06-05 cs.SD cs.AI eess.AS

UniVoice: A Unified Model for Speech and Singing Voice Generation

UniVoice: 一种用于语音和歌声生成的统一模型

Junjie Zheng, Huixin Xue, Shihong Ren, Chaofan Ding, Hao Liu, Zihao Chen

发表机构 * Giant Network(巨量网络) Shanghai Conservatory of Music(上海音乐学院)

AI总结 提出UniVoice,一种基于条件流匹配的统一语音和歌声生成框架,通过将条件分解为内容、旋律和音色,并引入空旋律标记,实现单一模型同时生成自然语音和可控歌声。

Comments 9 pages, 2 figures

详情
AI中文摘要

文本到语音(TTS)和歌声合成(SVS)都旨在从符号输入生成人类声音音频,但它们对生成过程提出了不同的要求。语音生成依赖于灵活的、语言驱动的韵律,而歌声生成则需要明确的旋律控制和准确的节奏对齐。这种不匹配使得训练一个既能生成自然语音又能生成可控歌声的单一模型具有挑战性,因为与旋律相关的条件应该强烈约束歌声,但不应限制语音韵律。我们提出了UniVoice,一种基于条件流匹配的统一语音和歌声生成框架。UniVoice没有使用单一的未分化条件表示,而是将条件分解为内容、旋律和音色,这些由适合模态的编码器编码,并由共享的扩散变换器(DiT)主干网络使用。对于歌声,旋律条件由MIDI音符序列表示;对于语音,它被替换为学习的空旋律标记,使模型能够从语言和声学上下文中推断韵律。这种设计保留了歌声的显式旋律控制,同时避免了对语音施加旋律约束的需要。我们进一步将空旋律标记分析为条件流中旋律边缘化的近似。在3万小时语音和3.5万小时歌声数据上训练,UniVoice在语音上实现了5.26%的音素错误率(PER),与专用TTS系统如F5-TTS(5.21%)和CosyVoice3(5.30%)相当。在歌声生成上,UniVoice实现了16.22%的PER,优于统一基线Vevo1.5(24.72%)。

英文摘要

Text-to-speech (TTS) and singing voice synthesis (SVS) both aim to generate human vocal audio from symbolic inputs, but they impose different requirements on the generation process. Speech generation relies on flexible, language-driven prosody, whereas singing generation requires explicit melody control and accurate rhythmic alignment. This mismatch makes it challenging to train a single model that can generate both natural speech and controllable singing, since melody-related conditions should strongly constrain singing but should not restrict speech prosody. We present UniVoice, a unified speech and singing voice generation framework based on conditional flow matching. Instead of using a single undifferentiated conditioning representation, UniVoice factorizes the condition into content, melody, and timbre, which are encoded by modality-appropriate encoders and consumed by a shared Diffusion Transformer (DiT) backbone. For singing, the melody condition is represented by MIDI note sequences; for speech, it is replaced with a learned null melody token, allowing the model to infer prosody from linguistic and acoustic context. This design preserves explicit melody control for singing while avoiding the need to impose melody constraints on speech. We further analyze the null melody token as an approximation to melody marginalization in the conditional flow. Trained on 30k hours of speech and 35k hours of singing data, UniVoice achieves a speech PER of 5.26\%, comparable to dedicated TTS systems such as F5-TTS (5.21\%) and CosyVoice3 (5.30\%). On singing generation, UniVoice achieves a PER of 16.22\%, outperforming the unified baseline Vevo1.5 (24.72\%).

2606.05848 2026-06-05 cs.RO

Visuotactile and Explicitly Force-Controlled Robotic Ultrasound for Abdominal Volumetric Reconstruction

用于腹部体积重建的视觉触觉和显式力控制机器人超声

Adrian Piedra, R Brooke Jeffrey, Oussama Khatib

发表机构 * Stanford Robotics Laboratory, Computer Science Department, Stanford University(斯坦福机器人实验室、计算机科学系、斯坦福大学) Department of Radiology, School of Medicine, Stanford University(放射科、医学院、斯坦福大学)

AI总结 提出一种结合立体视觉、触觉反馈和专家策略的机器人超声采集系统,通过力控机械臂实现自适应腹部扫描,并实现三维体积重建以增强诊断能力。

详情
AI中文摘要

在本文中,我们提出了一种机器人超声采集系统,该系统集成立体视觉、基于触摸的反馈和专家策略,以执行自主和自适应的腹部扫描。系统记录来自放射科专家的徒手运动和力数据,创建一个框架来捕获探头运动、施加的力和解剖扫描策略。这些专家数据被重放以用机器人复制特征扫描,为进一步的自主能力奠定基础。利用立体视觉,系统生成患者腹部的三维地形图,并通过关键点的刚度测量来细化,以描绘肋骨边界。这些组合技术使机器人能够执行两种不同的扫描路径:肋骨下方向上倾斜的扫描以可视化上腹部附近的结构,以及穿过软组织区域的垂直扫描。一个柔顺的、扭矩控制的七自由度机器人操纵器通过闭环力控制来保持与不同解剖表面的一致探头接触。物理实验表明,该系统在动态适应患者特定地形的同时,实现了与专家扫描相当的高质量成像。此外,机器人系统通过实现三维体积采集超越了专家能力,这增强了诊断潜力并为高级分析提供了体积数据。这项工作突出了将专家知识集成到自主机器人系统中,并强调了将基于感知的自主性与物理推理相结合以增强诊断性能的潜力。

英文摘要

In this paper, we present a robotic ultrasound acquisition system that integrates stereo vision, touch-based feedback, and expert-informed strategies to perform autonomous and adaptive abdominal scans. The system records freehand motion and force data from expert radiologists, creating a framework to capture transducer motion, applied forces, and anatomical scanning strategies. This expert data is replayed to replicate characteristic scans with the robot, forming a foundation for further autonomous capabilities. Using stereo vision, the system generates three-dimensional topography maps of the patient's abdomen, which are refined through stiffness measurements at key points to delineate the rib cage boundary. These combined techniques enable the robot to execute two distinct scanning paths: an upward-angled sweep beneath the rib cage to visualize structures near the upper abdomen and a perpendicular sweep across soft tissue regions. A compliant, torque-controlled seven degree-of-freedom robotic manipulator is controlled to maintain consistent probe contact through closed-loop force control over the varied anatomical surfaces. Physical experiments demonstrate that the system achieves high-quality imaging comparable to expert scans while dynamically adapting to patient-specific topographies. Furthermore, the robotic system surpasses expert capabilities by enabling three-dimensional volume acquisition, which enhances diagnostic potential and provides volumetric data for advanced analyses. This work highlights the integration of expert knowledge into autonomous robotic systems and underscores the potential of combining perception-based autonomy with physical reasoning for enhanced diagnostic performance.

2606.05847 2026-06-05 cs.AI

Agentic Molecular Recovery via Molecule-Aware Exploration

通过分子感知探索实现智能体分子恢复

Suwan Yoon, Changhee Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 针对文本引导分子生成中无效SMILES问题,提出AMREC方法,通过分子感知失配追踪、扩展候选探索和轨迹级选择,在恢复化学有效性的同时保留目标相关结构线索和分子身份。

Comments Preprint

详情
AI中文摘要

使用LLM进行文本引导的分子生成常常产生无效的SMILES。我们认为,无效草稿应通过从面向有效性的修复转向保持身份的分子恢复来解决:目标不仅是恢复化学有效性,还要保留目标相关的结构线索并恢复描述所暗示的分子身份。这一视角揭示了现有修正策略的局限性。事后修复可以在恢复有效性的同时扭曲关键结构,仅LLM修正可能引入意外的全局漂移,而即使配备了可执行的RDKit编辑工具,通用智能体修正仍受限于贪婪的单候选轨迹。为了解决这些局限性,我们提出了AMREC,它将分子感知失配追踪与扩展候选探索和轨迹级选择相结合。在来自三个骨干模型的无效ChEBI-20草稿上,AMREC在结构、精确匹配和字符串级指标上实现了最强的整体恢复性能。

英文摘要

Text-guided molecular generation with LLMs often yields invalid SMILES. We argue that invalid drafts should be addressed through a shift from validity-oriented repair to identity-preserving molecular recovery: the objective is not only to restore chemical validity, but also to preserve target-relevant structural cues and recover the molecular identity implied by the description. This perspective reveals the limitations of existing correction strategies. Post-hoc repair can recover validity while distorting key structures, LLM-only correction can introduce unintended global drift, and generic agentic correction remains constrained by greedy single-candidate trajectories even when equipped with executable RDKit edit tools. To address these limitations, we propose AMREC, which couples molecule-aware mismatch tracking with expanded candidate exploration and trajectory-level selection. On invalid ChEBI-20 drafts from three backbone models, AMREC achieves the strongest overall recovery profile across structural, exact-match, and string-level metrics.

2606.05846 2026-06-05 cs.CL eess.AS

Towards Truly Multilingual ASR: Generalizing Code-Switching ASR to Unseen Language Pairs

迈向真正的多语言ASR:将代码切换ASR泛化到未见语言对

Gio Paik, Hyunseo Shin, Soungmin Lee

发表机构 * University of Tokyo(东京大学)

AI总结 通过模型合并和领域泛化方法,研究从有限语言对中学到的代码切换能力能否泛化到未见语言对,实验表明双语CS-ASR模型对未见语言对有一定泛化能力但有限。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

自动语音识别(ASR)已成为人机交互的关键技术。然而,由于跨多种语言对的代码切换(CS)语音资源严重稀缺,代码切换ASR(CS-ASR)仍然特别具有挑战性。现有方法主要通过合成CS语音生成或在有限双语数据集上进行特定语言对微调来提高CS-ASR性能。然而,这些方法面临固有的可扩展性限制,因为对CS的支持必须针对语言对单独开发,而语言对的数量随支持的语言数量呈组合增长。在这项工作中,我们研究通过模型合并和领域泛化方法,从一组有限的已见语言对中学到的CS能力是否可以泛化到未见语言对。我们的实验表明,合并的双语CS-ASR模型对未见语言对有一定程度的泛化,表明双语CS能力在语言对之间的迁移有限。

英文摘要

Automatic Speech Recognition (ASR) has become a key technology for human--AI interaction. However, code-switching ASR (CS-ASR) remains particularly challenging due to the severe scarcity of multilingual CS speech resources across diverse language pairs. Existing approaches primarily improve CS-ASR performance through synthetic CS speech generation or pair-specific fine-tuning on limited bilingual datasets. Nevertheless, these approaches face an inherent scalability limitation, as support for CS must be developed separately for language pairs whose number grows combinatorially with the number of supported languages. In this work, we investigate whether CS capabilities learned from a limited set of seen language pairs can generalize to unseen language pairs through model merging and domain generalization methods. Our experiments show that merged bilingual CS-ASR models modestly generalize to unseen language pairs, suggesting limited transfer of bilingual CS capabilities across language pairs.

2606.05843 2026-06-05 cs.CL cs.AI

Mechanistic Insights into Functional Sparsity in Multimodal LLMs via CoRe Heads

多模态大语言模型中通过CoRe头的功能稀疏性机制洞察

Ruoxi Sun, Quantong Qiu, Juntao Li, Zecheng Tang, Yihang Lou, Min Zhang

发表机构 * Soochow University(苏州大学) Peking University(北京大学)

AI总结 通过识别和分析CoRe头,揭示多模态大语言模型在跨模态检索中功能稀疏的结构特性,并验证其必要性及加速推理的潜力。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在复杂的视觉-语言任务上表现出卓越的能力,但它们从复杂、嘈杂的上下文中提取与查询相关的视觉特征的机制仍然不透明。在本文中,我们进行了一项深入的可解释性研究,揭示了MLLMs中一个深刻的结构属性:跨模态检索中的功能稀疏性。利用一种称为检索注意力质量(RAM)的令牌级指标,我们识别并描述了一组高度专业化的注意力头,称为上下文感知检索(CoRe)头。在不同的视觉领域和模型规模中,我们观察到明确的功能划分:CoRe头充当专用的信息提取器,而大多数其他头则将注意力分布在更广泛的上下文区域。因果干预进一步证明了这些专业化头的必要性。仅消融前5%的CoRe头就会导致多模态推理性能显著下降,而消融排名较低的头则影响甚微。此外,加速实验验证了CoRe头的实用性,表明利用这种局部稀疏性可以显著加速推理,同时保持稳健的任务性能。我们的发现揭示了MLLMs中功能稀疏性的结构原理,完善了当前对机制可解释性的理解,并为未来的架构设计和模型优化奠定了理论基础。

英文摘要

While Multimodal Large Language Models (MLLMs) demonstrate remarkable proficiency on complex vision-language tasks, the mechanisms by which they extract query-relevant visual features from complex, noisy contexts remain opaque. In this paper, we present an in-depth interpretability study that uncovers a profound structural property within MLLMs: functional sparsity in cross-modal retrieval. Leveraging a token-level metric termed Retrieval Attention Mass (RAM), we identify and characterize a highly specialized subset of attention heads, referred to as Context-aware Retrieval (CoRe) heads. Across diverse visual domains and model scales, we observe a clear functional division: CoRe heads act as dedicated information extractors, while most other heads distribute attention over broader contextual regions. Causal interventions further demonstrate the necessity of these specialized heads. Ablating only the top 5% of CoRe heads causes significant degradation in multimodal reasoning performance, whereas ablating lower-ranked heads has minimal effect. Moreover, acceleration experiments validate the utility of CoRe heads, showing that leveraging this localized sparsity significantly accelerates inference while maintaining robust task performance. Our findings reveal a structural principle of functional sparsity within MLLMs, refining the current understanding of mechanistic interpretability and laying a theoretical foundation that can inspire future architecture design and model optimization.

2606.05836 2026-06-05 cs.CL

ProSPy: A Profiling-Driven SQL-Python Agentic Framework for Enterprise Text-to-SQL

ProSPy: 面向企业级Text-to-SQL的剖析驱动的SQL-Python智能体框架

Zhaorui Yang, Huawei Zheng, Sen Yang, Yuhui Zhang, Haoxuan Li, Zhizhen Yu, Xuan Yi, Chen Hou, Defeng Xie, Chao Hu, Minfeng Zhu, Dazhen Deng, Haozhe Feng, Danqing Huang, Yingcai Wu, Peng Chen, Wei Chen

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室) School of Software Technology(软件技术学院) Tencent TEG(腾讯科技集团) School of Mathematical Sciences, Peking University(北京大学数学科学学院) Zhejiang University(浙江大学)

AI总结 提出ProSPy框架,通过自动剖析、模式剪枝、中间视图获取和Python分析四阶段,结合SQL高效性与Python灵活性,解决企业级数据库Text-to-SQL中的模式异构、元数据不完整和复杂分析问题。

Comments 24 pages, 12 figures

详情
AI中文摘要

大型语言模型显著推进了Text-to-SQL系统,但将其应用于企业级数据库仍具挑战。现实数据库通常包含大型异构模式、不完整元数据、方言特定SQL语法以及难以用单个SQL查询解决的复杂分析问题。为应对这些挑战,我们提出ProSPy,一个面向企业级Text-to-SQL的剖析驱动的SQL-Python智能体框架。ProSPy将推理过程分为四个阶段:首先通过自动剖析提取细粒度数据证据,逐步将大型模式剪枝为任务相关上下文,通过方言无关的SQL接口获取中间视图,最后使用Python进行灵活的下游分析。该设计结合了SQL在大型数据库上的高效性与基于Python的分析的灵活性,同时减少了对不可靠元数据的依赖,并提高了跨SQL方言的鲁棒性。在Spider 2.0-Lite和Spider 2.0-Snow上的实验表明,ProSPy在使用开源和专有模型时均持续优于强基线,使用Claude-4.5-Opus时无需多数投票即可达到60.15%和60.51%的执行准确率。进一步分析表明,ProSPy对SQL方言变化具有鲁棒性,并在模式召回率和精确率之间取得了有利的权衡。

英文摘要

Large language models have substantially advanced Text-to-SQL systems, yet applying them to enterprise-scale databases remains challenging. Real-world databases often contain large and heterogeneous schemas, incomplete metadata, dialect-specific SQL syntax, and complex analytical questions that are difficult to solve with a single SQL query. To address these challenges, we propose ProSPy, a Profiling-driven SQL--Python agentic framework for enterprise-scale Text-to-SQL. ProSPy structures the reasoning process into four stages: it first extracts fine-grained data evidence through automatic profiling, progressively prunes large schemas into task-relevant contexts, fetches intermediate views through a dialect-agnostic SQL interface, and finally performs flexible downstream analysis with Python. This design combines the efficiency of SQL over large databases with the flexibility of Python-based analysis, while reducing reliance on unreliable metadata and improving robustness across SQL dialects. Experiments on Spider 2.0-Lite and Spider 2.0-Snow show that ProSPy consistently outperforms strong baselines with both open-source and proprietary models, achieving execution accuracies of 60.15% and 60.51% with Claude-4.5-Opus, without majority voting. Further analysis shows that ProSPy is robust to SQL dialect variations and achieves a favorable trade-off between schema recall and precision.

2606.05833 2026-06-05 cs.CV cs.AI

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.05829 2026-06-05 cs.CV

Gender Artifacts from Art History to Text-to-Image Generation

从艺术史到文本到图像生成中的性别伪影

Piera Riccio, Miriam Doh, Benedikt Höltgen, Noa Garcia, Nanne van Noord

发表机构 * University of Amsterdam(阿姆斯特丹大学) Université Libre de Bruxelles(布鲁塞尔自由大学) Hasso Plattner Institut University of Potsdam(波茨坦大学霍索普纳研究所) The University of Osaka(大阪大学)

AI总结 通过提出性别伪影度量(PixelSGA和MaskSGA),研究了艺术风格中性别表征与视觉特征的关系,并发现文本到图像生成模型会放大历史来源中的性别伪影。

详情
AI中文摘要

艺术风格植根于特定的社会历史背景,这些背景编码了社会等级,包括不同的性别建构。然而,在人工智能研究中,风格长期以来被视为一种表面层次的视觉属性:一种应用于内容中性场景的颜色、笔触和纹理的滤镜。我们引入了第一个数据集来研究历史图像和生成图像中性别表征与风格之间的相互作用。StyleGender包含跨越19种艺术风格的74k张图像,包括带有风格和性别注释的艺术历史图像、在受控风格和性别提示下由T2I生成的图像,以及一个语义对齐集,使得可以直接比较艺术史与生成结果。通过提出两种集合性别伪影(SGA)度量(PixelSGA和MaskSGA),在像素级别和构图结构中捕捉性别信号,我们展示了:(1) 性别表征塑造了不同艺术风格的视觉特征,(2) 风格关键词将这些模式带入T2I生成中,(3) 生成模型倾向于放大历史来源中观察到的性别伪影。

英文摘要

Artistic styles are rooted in specific socio-historical contexts that encode social hierarchies, including distinct constructions of gender. Yet in AI research, style has long been treated as a surface-level visual property: a filter of color, brushstroke, and texture applied to otherwise content-neutral scenes. We introduce the first dataset to investigate the interplay between gender representation and style in both historical and generated images. StyleGender comprises 74k images spanning 19 artistic styles, comprising art historical images with style and gender annotations, T2I-generated images under controlled style and gender prompts, and a semantically aligned set enabling direct art history-to-generation comparison. By proposing two Set Gender Artifact (SGA) metrics (PixelSGA and MaskSGA), capturing gender signals at the pixel level and in compositional structure, we show that (1) gender representation shapes visual features across artistic styles, (2) style keywords carry these patterns into T2I generation, and (3) generative models tend to amplify gender artifacts beyond what is observed in historical sources.

2606.05828 2026-06-05 cs.AI cs.CL

Statistical Priors for Implicit Preferences: Decoupling Skill Selection as a Local Harness in Personal Agents

隐式偏好的统计先验:在个人代理中解耦技能选择作为局部调控机制

Zeyu Gan, Huayi Tang, Yong Liu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院)

AI总结 针对本地部署的个人代理中隐式用户偏好学习问题,提出一种解耦统计偏好学习与语义意图解析的轻量级架构,通过局部统计结果影响远程LLM的选择决策,显著降低累积遗憾并提高测试准确率。

详情
AI中文摘要

随着大型语言模型(LLM)能力的提升,依赖基于API的远程模型和外部技能的本地部署个人代理成为一种新范式。随着可用技能的快速扩展,使个人代理能够学习并适应隐式用户偏好成为关键挑战。然而,本地部署的限制排除了复杂的集中式选择算法,迫切需要一种轻量级的局部偏好调控机制。本文通过一种严格解耦统计偏好学习与语义意图解析的新颖架构,探索了这种调控机制的实现。具体而言,我们利用局部统计结果来影响和调节远程LLM的选择决策。大量评估表明,我们的解耦方法实现了最低的累积遗憾和最高的测试准确率,显著优于传统的记忆增强型代理。

英文摘要

As Large Language Model (LLM) capabilities advance, locally deployed personal agents relying on API-based remote models and external skills have emerged as a novel paradigm. With the rapid expansion of available skills, enabling personal agents to learn and adapt to implicit user preferences becomes a critical challenge. However, local deployment constraints preclude complex centralized selection algorithms, creating an urgent need for a lightweight local preference harness. This paper explores the implementation of such a harness through a novel architecture that strictly decouples statistical preference learning from semantic intent parsing. Specifically, we leverage localized statistical results to influence and modulate the selection decisions of the remote LLM. Extensive evaluations demonstrate that our decoupled approach achieves the lowest cumulative regret and highest test accuracy, significantly outperforming traditional memory-augmented agents.

2606.05817 2026-06-05 cs.LG cs.AI

Consistency Training Along the Transformer Stack

沿Transformer堆栈的一致性训练

Sukrati Gautam, Neil Shah, Arav Dhoot, Bryan Maruyama, Caroline Wei, Rohan Kapoor, Robert Sidey, Prakhar Gupta, Zi Cheng Huang, David Demitri Africa

发表机构 * Purdue University(普渡大学) Independent(独立) Columbia University(哥伦比亚大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校) Dartmouth College(达特茅斯学院) University of Michigan, Ann Arbor(密歇根大学安娜堡分校)

AI总结 本文通过引入MLP状态和注意力分布的一致性目标,将一致性训练扩展到多种安全威胁,并发现跨威胁泛化及共享机制,证明其作为灵活对齐框架的有效性。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

一致性训练鼓励模型在不同上下文中表现相似,并已显示出减少对齐问题的潜力。我们以两种方式扩展一致性训练的范围。首先,我们引入两个新的内部一致性目标:MLP一致性训练(MLPCT),匹配激活后的MLP状态;以及注意力一致性训练(AttCT),匹配每个头的注意力分布。其次,我们将一致性训练应用于四种额外的安全威胁:角色上下文学习攻击、对抗性挫败、预填充攻击和条件性对齐错误。在多个模型和威胁设置中,我们发现一致性训练在减少对齐问题方面远优于先前工作中研究的谄媚和越狱设置。我们还发现了跨威胁泛化的案例,即针对一种失败模式的训练提高了对另一种模式的鲁棒性,并识别了ACT、MLPCT和AttCT共享的残差流机制,同时将BCT区分为机制上不同的方法。我们的结果表明,一致性训练是一个灵活且可扩展的对齐框架,能够统一防御更广泛的模型病理类别。

英文摘要

Consistency training encourages models to behave similarly across different contexts, and has shown promise for reducing misalignment. We broaden the scope of consistency training in two ways. First, we introduce two new internal consistency targets: MLP Consistency Training (MLPCT), which matches post-activation MLP states, and Attention Consistency Training (AttCT), which matches per-head attention distributions. Second, we apply consistency training to four additional safety threats: persona in-context learning attacks, adversarial frustration, prefill attacks, and conditional misalignment. Across several models and threat settings, we find that consistency training reduces misalignment well beyond the sycophancy and jailbreak settings studied in prior work. We also find cases of cross-threat generalization, where training against one failure mode improves robustness to another, and identify a shared residual-stream mechanism underlying ACT, MLPCT, and AttCT, while distinguishing BCT as mechanistically distinct. Our results suggest that consistency training is a flexible and extensible framework for alignment, capable of unifying defenses against a broader class of model pathologies.

2606.05814 2026-06-05 cs.LG

Robust and sparse support vector machine via hybrid truncated loss for supervised classification

基于混合截断损失的鲁棒稀疏支持向量机用于监督分类

Yuliang Yang, Chen Chen, Yuxiang Liu, Huiru Wang

发表机构 * School of Science, Beijing Forestry University(北京林业大学理学院) Translational Cancer Research Center, Peking University First Hospital(北京大学第一医院转化肿瘤研究中心)

AI总结 提出一种稀疏且有界的混合截断损失函数L_ht,构建L_ht-SVM模型用于单视图分类,并扩展为多视图MvL_ht-SVM,通过P-平稳点和交替方向乘子法实现高效优化,实验表明在准确率、稀疏性和鲁棒性上优于对比方法。

详情
AI中文摘要

支持向量机(SVM)是一种广泛使用的分类器,但选择合适的损失函数仍然困难。凸损失如hinge损失和最小二乘损失对异常值敏感,而有界非凸损失通常导致高计算成本。为解决这一问题,我们提出一种混合截断损失函数($L_{\mathrm{ht}}$),该函数既稀疏又有界,并构建了用于单视图分类的$L_{\mathrm{ht}}$-SVM模型。我们引入P-平稳点,并利用它建立一阶必要和充分最优性条件。基于这些条件,我们设计了一种带有工作集策略的交替方向乘子法,降低了计算成本并实现了全局收敛。我们进一步通过添加结构信息和视图权重将$L_{\mathrm{ht}}$-SVM扩展到多视图学习,得到Mv$L_{\mathrm{ht}}$-SVM,该方法遵循共识和互补原则。在合成、真实世界和图像数据集上的实验表明,$L_{\mathrm{ht}}$-SVM在准确率更高、支持向量更少和噪声鲁棒性更好方面优于五种单视图方法,而Mv$L_{\mathrm{ht}}$-SVM在准确率、精确率、召回率和F1分数上优于六种多视图方法。

英文摘要

The support vector machine (SVM) is a widely used classifier, but choosing an appropriate loss function remains difficult. Convex losses such as the hinge loss and least-squares loss are sensitive to outliers, while bounded non-convex losses often lead to high computational cost. To address this, we propose a hybrid truncated loss function ($L_{\mathrm{ht}}$) that is both sparse and bounded, and build the $L_{\mathrm{ht}}$-SVM model for single-view classification. We introduce the P-stationary point and use it to establish the first-order necessary and sufficient optimality conditions. Based on these conditions, we design an alternating direction method of multipliers with a working-set strategy that reduces computational cost and achieves global convergence. We further extend $L_{\mathrm{ht}}$-SVM to multi-view learning by adding structural information and view weights, resulting in Mv$L_{\mathrm{ht}}$-SVM, which follows both the consensus and complementarity principles. Experiments on synthetic, real-world, and image datasets show that $L_{\mathrm{ht}}$-SVM achieves higher accuracy with fewer support vectors and better noise robustness than five single-view methods, while Mv$L_{\mathrm{ht}}$-SVM outperforms six multi-view methods in accuracy, precision, recall, and F1-score.

2606.05806 2026-06-05 cs.AI

When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents

当工具失效时:LLM智能体动态重规划与异常恢复的基准测试

Dongsheng Zhu, Xuchen Ma, Yucheng Shen, Xiang Li, Yukun Zhao, Shuaiqiang Wang, Lingyong Yan, Dawei Yin

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Sochow University(苏州大学) Shandong University(山东大学) Baidu Inc.(百度公司)

AI总结 本文提出ToolMaze基准,通过有向无环图拓扑复杂度和工具扰动分类法,评估LLM智能体在工具失效时的动态重规划与错误恢复能力,发现模型对隐式语义故障的恢复率下降约37%,且智能体容错性随模型规模增长的速度远慢于基本任务执行。

详情
AI中文摘要

现有基准在理想化的“快乐路径”上评估LLM中的工具集成推理(TIR),很大程度上忽视了现实中的工具故障。我们引入ToolMaze,一个用于TIR智能体动态路径发现和错误恢复的基准。为了将系统性重规划与盲目试错区分开来,ToolMaze采用二维设计:基于DAG的拓扑复杂度和一个$2 \times 2$的工具扰动分类法(显式/隐式,瞬态/永久)。评估表明,扰动几乎在所有模型上降低了性能,在隐式语义故障下下降最为剧烈。由于对受损输出的系统性过度信任,这些场景中的扰动恢复率(PRR)骤降约37%,而复杂拓扑将智能体困在徒劳的试错循环中。关键的是,智能体容错性随模型规模增长的速度比基本任务执行慢$3.66\times$,凸显了动态重规划作为一个独立瓶颈,无法通过模型缩放或提示工程解决。数据和代码见https://github.com/Zhudongsheng75/ToolMaze。

英文摘要

Existing benchmarks evaluate Tool-Integrated Reasoning (TIR) in LLMs on idealized ''happy paths'', largely overlooking real-world tool failures. We introduce ToolMaze, a benchmark for dynamic path discovery and error recovery in TIR agents. To separate systematic replanning from blind trial-and-error, ToolMaze adopts a two-dimensional design: DAG-based topological complexity and a $2 \times 2$ taxonomy of tool perturbations (explicit/implicit, transient/permanent). Evaluations show that perturbations degrade performance across nearly all models, with the sharpest drops under implicit semantic failures. Driven by systemic over-trust in corrupted outputs, Perturbation Recovery Rate (PRR) plummets by around 37\% in these scenarios, while complex topologies trap agents in futile trial-and-error loops. Crucially, agentic fault-tolerance improves with model scale $3.66\times$ slower than basic task execution, highlighting dynamic replanning as a distinct bottleneck unaddressed by model scaling or prompting. Data and code are available at https://github.com/Zhudongsheng75/ToolMaze.

2606.05805 2026-06-05 cs.AI

From Risk Classification to Action Plan Remediation: A Guardrail Feedback Driven Framework for LLM Agents

从风险分类到行动方案修复:一种基于护栏反馈驱动的LLM代理框架

Yuhao Sun, Jiacheng Zhang, Shaanan Cohney, Zhexin Zhang, Feng Liu, Xingliang Yuan

发表机构 * The University of Melbourne(墨尔本大学) Tsinghua University(清华大学)

AI总结 提出TRIAD框架,通过护栏生成的言语反馈引导代理在规划步骤中保持良性目标,实现安全与效用的最佳平衡。

Comments 32 pages

详情
AI中文摘要

基于LLM的护栏通常通过在执行前评估提议的行动或输入来保护代理,产生安全信号,如二元允许/拒绝决策、风险类别和/或关于潜在政策违规的解释性理由。然而,当原本良性的任务被不可信的外部内容、不安全的指令或风险工具使用污染时,代理风险常常出现。现有护栏通常将整个任务统一标记为不安全,从而阻止威胁但牺牲了良性部分。此外,现有工作大多孤立地评估护栏,不清楚其干预是否导致更安全的下游代理行为。为解决此问题,我们引入TRIAD(三方响应用于迭代代理护栏),一个护栏集成代理框架,利用护栏生成的言语反馈作为引导信号,使代理在每个规划步骤中保持与良性目标一致。我们在自策训练数据集上微调语言模型,输出三种决策之一:继续、拒绝或更新,并附带结构化的自然语言反馈。更新不仅允许或阻止执行,还指导代理修改其计划,避免有害组件,并尽可能保留良性任务。TRIAD将此反馈注入代理的上下文,实现后续计划修订,并在护栏反馈与代理规划之间形成闭环。在ASB和AgentHarm上的大量实验表明,TRIAD将平均攻击成功率降低至10.42%,同时在护栏集成基线中实现了最佳的安全-效用权衡。我们的代码可在https://github.com/YUHAOSUNABC/TRIAD获取。

英文摘要

LLM-based guardrails typically safeguard agents by evaluating proposed actions or inputs before execution, producing safety signals such as binary allow/deny decisions, risk categories, and/or explanatory rationales about potential policy violations. However, agent risks often arise when otherwise benign tasks are contaminated by untrusted external content, unsafe instructions, or risky tool use. Existing guardrails often flag the entire task uniformly as unsafe, thereby blocking the threat but sacrificing the benign part. Moreover, existing work largely evaluates guardrails in isolation, leaving unclear whether their interventions lead to safer downstream agent behavior. To address this, we introduce TRIAD (Tripartite Response for Iterative Agent Guardrailing), a guardrail-integrated agent framework that leverages guardrail-generated verbal feedback as a guiding signal to keep the agent aligned with benign objectives at each planning step. We finetune a language model on a self-curated training dataset to output one of three decisions: proceed, refuse, or update, together with structured natural-language feedback. Rather than merely allowing or blocking execution, update guides the agent to revise its plan, avoid harmful components, and preserve the benign task where possible. TRIAD injects this feedback into the agent's context, enabling subsequent plan revision and forming a closed loop between guardrail feedback and agent planning. Extensive experiments on ASB and AgentHarm show that TRIAD reduces the average attack success rate to 10.42%, while achieving the best safety-utility trade-off among guardrail-integrated baselines. Our code is available at: https://github.com/YUHAOSUNABC/TRIAD.

2606.05804 2026-06-05 cs.CL

Can LLMs Be Constrained to the Past? Improving Knowledge Cutoff through Recall-Based Prompting

LLMs 能否被约束到过去?通过基于回忆的提示改进知识截止

Michiro Asai, Ailiang Lin, Yu Kishimoto, Takao Obi, Satoshi Kosugi, Kotaro Funakoshi, Manabu Okumura

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出两种基于回忆的提示策略(Self-Recall 和 Question-Recall)来改进大语言模型在知识截止约束下的表现,在反事实问题上尤其有效,并构建了多截止历史事件基准(MHEB)进行鲁棒性评估。

详情
AI中文摘要

提示知识截止指令大语言模型(LLM)表现得好像指定截止日期之后的信息不可用。然而,先前的工作主要依赖于直接答案生成,当截止后的知识未被明确查询而仅与问题存在因果关系时,这种方法难以应对。为了解决这一限制,我们提出了两种基于回忆的提示策略:Self-Recall(SR),要求模型重述其截止约束;以及 Question-Recall(QR),要求模型回忆在截止日期下有效的问题相关信息。在三个现有基准上,我们的方法优于直接答案提示和传统的逐步推理基线,在反事实问题上尤其有显著改进。为了研究不同截止设置下的鲁棒性,我们进一步构建了多截止历史事件基准(MHEB),该基准在多个截止年份下评估同一问题。结果表明,知识截止性能随截止距离变化,而结合 SR 和 QR 始终能获得最佳性能。

英文摘要

Prompted knowledge cutoff instructs a large language model (LLM) to act as if information beyond a specified cutoff date were unavailable. However, prior work mainly relies on direct-answer generation, which struggles when post-cutoff knowledge is not explicitly queried but is only causally related to the question. To address this limitation, we propose two recall-based prompting strategies: Self-Recall (SR), which asks the model to restate its cutoff constraint, and Question-Recall (QR), which requires the model to recall question-relevant information valid under the cutoff. Across three existing benchmarks, our methods outperform both direct-answer prompting and conventional step-by-step reasoning baselines, with particularly strong improvements on counterfactual questions. To investigate robustness across different cutoff settings, we further construct the Multi-cutoff Historical Event Benchmark (MHEB), which evaluates the same question under multiple cutoff years. Results show that knowledge cutoff performance varies with cutoff distance, while combining SR and QR consistently yields the best performance.

2606.05800 2026-06-05 cs.LG

SALT: When More Rollouts Don't Help in Group-Based Policy Optimization and How to Make Them Matter

SALT: 当更多 rollout 在基于组的策略优化中无益时如何使其发挥作用

Powei Chang, Jinpeng Zhang, Chaoqun Sun, MiniWell Tsao, Lianrui Li, Jianxiang Xiang, Chenyu Wang, Yukang Gao, Dongying Kong

发表机构 * Bilibili Inc.(哔哩哔哩公司) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 针对 GRPO 风格组归一化中增加 rollout 数量导致梯度抵消的问题,提出 SALT 组件,通过子空间自适应重加权组相对更新系数,改善更新几何并提升性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通常采用 GRPO 风格的组相对更新,为每个提示采样多个 rollout 以构建归一化学习信号。然而,仅仅增加 rollout 数量并不能可靠地增强学习:在 GRPO 风格组归一化下,每个 rollout 的策略梯度特征可能集中到低秩、有符号的几何结构中,导致聚合时大量抵消,削弱有效更新。我们通过 SALT(子空间自适应几何插件组件)解决这种失效模式,该组件利用样本梯度几何对组相对更新的系数进行重新加权。SALT 从小批量 Gram 几何中估计主导共享子空间,将组相对系数分解为共享通道和残差通道,并在符号抵消严重时自适应放大残差通道。在多种推理导向的 RLVR 基准和模型规模上,SALT 在不修改奖励模型或 rollout 采样过程的情况下,改善了有效更新几何和性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) often adopts GRPO-style group-relative updates, sampling multiple rollouts per prompt to construct normalized learning signals. However, merely increasing the number of rollouts does not reliably strengthen learning: under GRPO-style group normalization, per-rollout policy-gradient features can concentrate into a low-rank, signed geometry, causing substantial cancellation during aggregation and weakening the effective update. We address this failure mode with SALT, a Subspace-Adaptive geometry pLug-in componenT that uses sample-wise gradient geometry to reweight the coefficients of group-relative updates. SALT estimates a dominant shared subspace from the mini-batch Gram geometry, decomposes group-relative coefficients into shared and residual channels, and adaptively amplifies the residual channel when signed cancellation is severe. Across diverse reasoning-oriented RLVR benchmarks and model scales, SALT improves effective update geometry and performance without modifying the reward model or the rollout sampling procedure

2606.05799 2026-06-05 cs.LG cs.CL

CaliDist: Calibrating Large Language Models via Behavioral Robustness to Distraction

CaliDist: 通过抗干扰行为鲁棒性校准大型语言模型

Mohammad Anas Jawad, Cornelia Caragea

发表机构 * Cornelia Caragea(卡伦·卡雷亚) Mohammad Anas Jawad(穆罕默德·安斯·贾瓦德)

AI总结 提出CaliDist方法,通过测量和惩罚模型对语义干扰的敏感性来校准LLM,在7个NLU基准上平均将ECE从23%降至7%。

详情
AI中文摘要

现有的大型语言模型(LLM)校准方法常常忽略可信度的一个关键维度:模型对无关或误导信息的{\em 行为鲁棒性}。在本文中,我们认为模型的真实置信度应反映其在认知压力下的稳定性。我们引入\textsc{CaliDist},一种新颖的事后校准方法,直接测量并惩罚模型对干扰的敏感性。\textsc{CaliDist}量化了当输入提示被语义\textit{干扰项}扰动时,LLM的预测和不确定性如何变化。然后利用这种稳定性(或不稳定性)信号来自适应地缩放模型的初始置信度分数。我们在六个不同LLM的七个自然语言理解分类基准上进行的广泛实验表明,与强基线相比,\textsc{CaliDist}一致地实现了更低的期望校准误差(ECE)和Brier分数。值得注意的是,我们的方法平均将ECE从23%降至7%——相对改进70%——表明行为稳定性是校准的有力信号。我们在github.com/m-anas-j/CaliDist提供代码和数据集。

英文摘要

Existing calibration methods for Large Language Models (LLMs) often overlook a critical dimension of trustworthiness: a model's {\em behavioral robustness} to irrelevant or misleading information. In this paper, we argue that a model's true confidence should reflect its stability under cognitive pressure. We introduce \textsc{CaliDist}, a novel post-hoc calibration approach that directly measures and penalizes a model's susceptibility to distraction. \textsc{CaliDist} quantifies how an LLM's predictions and uncertainty change when its input prompt is perturbed with semantic \textit{distractors}. This stability (or lack thereof) signal is then used to adaptively scale the model's initial confidence score. Our extensive experiments on seven Natural Language Understanding classification benchmarks using six distinct LLMs show that \textsc{CaliDist} consistently achieves lower Expected Calibration Error (ECE) and Brier Score compared with strong baselines. Remarkably, our method reduces the ECE from 23\% to 7\% on average--a relative improvement of 70\%--demonstrating that behavioral stability is a powerful signal for calibration. We make our code and datasets available at github.com/m-anas-j/CaliDist.

2606.05793 2026-06-05 cs.CL cs.AI cs.CY cs.LG

CollabBench: Benchmarking and Unleashing Collaborative Ability of LLMs with Diverse Players via Proactive Engagement

CollabBench: 通过主动参与与多样化玩家基准测试和释放LLMs的协作能力

Hong Qian, Yuanhao Liu, Zihan Zhou, Zongbao Zhang, Hanjie Ge, Haotian Shi, Liang Dou, Xiangfeng Wang, Jingwen Yang, Aimin Zhou

发表机构 * Shanghai Institute of AI for Education(上海人工智能教育研究院) School of Computer Science(计算机科学学院) East China Normal University(东华大学) Tencent Inc.(腾讯公司) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出CollabBench基准,通过多样化玩家模拟和协作智能体训练范式,提升LLM在合作游戏中的任务效率和情感适应能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于LLM的智能体在个体任务上表现出色,但与真实人类伙伴的有效协作仍然具有挑战性。现有的对话级协作研究大多缺乏基于交互和行为执行,这促使需要能够实现情境化和沉浸式协作的合作游戏环境。为此,本文提出了CollabBench,一个用于评估和训练合作游戏中协作智能体的基准。CollabBench具有多样化玩家档案模拟管道,用于建模不同的玩家行为,以及一种协作智能体训练范式,通过智能体展开统一推理、沟通和行动,并使用混合奖励优化任务效率和情感适应。我们进一步将经典环境扩展到CWAH-MultiPlayer和Cook-MultiPlayer,以在多样化个性下进行系统评估。使用效率和情感指标的实验表明,我们训练的模型优于基础模型,效率提高了19.5%,情感表现提高了24.4%。进一步分析揭示了现有模型的关键协作局限性,并为未来的协作训练提供了见解。

英文摘要

While LLM-based agents excel at individual tasks, effective collaboration with realistic human partners remains challenging. Most of the existing conversation-level collaborative studies lack grounded interaction and behavioral execution, motivating the need for cooperative game environments that enable contextualized and immersive collaboration. To this end, this paper proposes CollabBench, a benchmark for evaluating and training collaborative agents in cooperative games. CollabBench features a Diverse Player Profile Simulation pipeline to model varied players behaviors, and a Collaborative Agentic Training paradigm that unifies reasoning, communication, and action via agentic rollouts, optimized with a hybrid reward balancing task efficiency and affective adaptation. We further extend classic environments to CWAH-MultiPlayer and Cook-MultiPlayer for systematic evaluation under diverse personalities. Experiments with efficiency and affective metrics show that our trained models outperform base models, achieving 19.5% higher efficiency and 24.4% improved affective performance. Further analysis reveals key collaborative limitations of existing models and offers insights for future collaborative training.

2606.05792 2026-06-05 cs.AI cs.LG cs.LO cs.SE

Can LLMs Write Correct TLA+ Specifications? Evaluating Natural-Language-to-TLA+ Generation

LLM 能写出正确的 TLA+ 规范吗?自然语言到 TLA+ 生成的评估

Arslan Bisharat, Brian Ortiz, Eric Spencer, Khushboo Bhadauria, TaiNing Wang, George K. Thiruvathukal, Konstantin Laufer, Mohammed Abuhamad

发表机构 * Department of Computer Science, Loyola University Chicago(洛约奈大学芝加哥分校计算机科学系)

AI总结 本文首次系统评估基于 LLM 从自然语言合成 TLA+ 规范的能力,发现模型在语义正确性上仅达 8.6%,且成功依赖于渐进式提示,揭示了模型大小与质量无关、代码专用模型表现不佳等关键发现。

Comments 12 pages, 11 tables. Accepted at the 21st International Conference on Software Technologies (ICSOFT 2026); Recommended as Best Paper Award Candidate

详情
AI中文摘要

TLA+ 已支持亚马逊和微软等公司的工业验证,但从自然语言编写正确的 TLA+ 规范仍需时间和专业知识,这限制了其采用。LLM 显示出潜力,但尚无先前研究衡量它们是否能从自然语言生成语义正确的 TLA+ 规范。本文首次系统评估基于 LLM 的 TLA+ 规范合成。我们的研究在精心策划的 205 个 TLA+ 规范数据集上评估了来自八个系列的 30 个 LLM:四种提示策略下的 25 个开放权重模型(2600 次运行)和少样本提示下的 5 个专有模型(130 次运行),所有结果均由 SANY 解析器和 TLC 模型检查器验证。LLM 达到高达 26.6% 的语法正确性,但仅 8.6% 的语义正确性,成功仅出现在渐进式提示中。结果表明模型大小不能预测质量,例如 DeepSeek r1:8b 在所有策略上优于其 70B 变体,这表明推理对齐对形式语言的重要性。由于主流语言训练的负迁移,代码专用模型始终表现不佳。我们识别出五类重复出现的幻觉,所有幻觉均可追溯到特定的训练数据偏差。这些结果表明,当前 LLM 在没有专家监督的情况下无法生成可靠的 TLA+ 规范。我们发布了评估框架、代码和数据集,以支持可重复性和未来研究。

英文摘要

TLA+ has supported industrial verification at companies such as Amazon and Microsoft, yet writing correct TLA+ specifications from natural language still requires time and expertise, which limits adoption. LLMs show promise, but no prior study measures whether they produce semantically correct TLA+ specifications from natural language. This paper presents the first systematic evaluation of LLM-based TLA+ specification synthesis from natural language. Our study evaluates 30 LLMs across eight families on a curated dataset of 205 TLA+ specifications: 25 open-weight models across four prompting strategies (2,600 runs) and 5 proprietary models under few-shot prompting (130 runs), all validated by the SANY parser and TLC model checker. LLMs achieve up to 26.6% syntactic correctness but only 8.6% semantic correctness, with successes exclusive to progressive prompting. Results show that model size does not predict quality, e.g., DeepSeek r1:8b outperforms its 70B variant across all strategies, which suggests the importance of reasoning alignment for formal languages. Code-specialized models consistently underperform due to negative transfer from mainstream language training. We identify five recurring hallucination categories, all traceable to specific training data biases. These results suggest that current LLMs do not generate reliable TLA+ specifications without expert oversight. We release the evaluation framework, code, and dataset to support reproducibility and future research.