arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2251
2606.09079 2026-06-10 cs.LG cs.AI 版本更新

FlashMemory-DeepSeek-V4: Lightning Index Ultra-Long Context via Lookahead Sparse Attention

FlashMemory-DeepSeek-V4: 通过前瞻稀疏注意力实现闪电索引超长上下文

Yan Wang, Qifan Zhang, Jiachen Yu, Tian Liang, Dongyang Ma, Xiang Hu, Zibo Lin, Chunyang Li, Zhichao Wang, Miao Peng, Nuo Chen, Jia Li, Yujiu Yang, Haitao Mi, Dong Yu

发表机构 * Independent Researchers(独立研究者) Tencent(腾讯) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tsinghua University(清华大学)

AI总结 提出前瞻稀疏注意力(LSA),基于DeepSeek-V4架构的神经记忆索引器,通过预测未来上下文需求仅保留关键KV块,在超长上下文场景下将物理KV缓存压缩至全上下文的13.5%,同时保持或略微提升下游准确率。

Comments Technical report. 11 pages. Code and model available at https://github.com/libertywing/FlashMemory-Deepseek-V4 and https://huggingface.co/libertywing/FlashMemory-Deepseek-V4

详情
AI中文摘要

传统大语言模型在解码过程中保持完整的KV缓存,导致超长上下文服务出现严重的GPU内存瓶颈。在本报告中,我们提出前瞻稀疏注意力(LSA),一种基于DeepSeek-V4架构构建的神经记忆索引器驱动的新型推理范式。LSA并非被动地关注所有历史令牌,而是主动预测未来的上下文需求,并仅在GPU内存中保留查询关键的KV块。关键的是,我们通过无骨干的解耦训练策略实例化该架构。通过将索引器制定为标准双编码器架构,我们使用标准检索训练框架独立训练它,而无需将庞大的骨干模型加载到GPU内存中。我们证明这种“少即是多”的范式显著最大化服务效率,同时在依赖长期全局记忆的任务中充当有效的注意力去噪器。在主要的长上下文评估套件(例如LongBench-v2、LongMemEval和RULER)中,FM-DS-V4将平均物理KV缓存占用压缩至全上下文基线的仅13.5%,同时一致地保持或略微提升下游准确率(平均绝对边际+0.6%)。关键的是,在极端500K规模下,FlashMemory将物理KV缓存开销抑制超过90%,而不会破坏骨干的核心推理能力。

英文摘要

Conventional LLMs keep the full KV cache loaded during decoding, causing a severe GPU memory bottleneck for ultra-long context serving. In this report, we propose Lookahead Sparse Attention (LSA), a novel inference paradigm powered by a Neural Memory Indexer built upon the DeepSeek-V4 architecture. Rather than passively attending to all historical tokens, LSA proactively predicts future context demands and preserves only the query-critical KV chunks in the GPU memory. Crucially, we instantiate this architecture via a backbone-free decoupled training strategy. By formulating the indexer as a standard dual-encoder architecture, we train it independently using standard retrieval training frameworks without ever loading the massive backbone model into GPU memory. We demonstrate that this "less is more" paradigm significantly maximizes serving efficiency while acting as an effective attention denoiser in tasks that rely on long-term global memory. Across primary long-context evaluation suites (e.g., LongBench-v2, LongMemEval, and RULER), FM-DS-V4 compresses the average physical KV cache footprint down to merely 13.5% of the full-context baseline, while consistently preserving or slightly elevating downstream accuracy (+0.6% absolute margin on average). Crucially, at extreme 500K scales, FlashMemory suppresses the physical KV cache overhead by over 90% without destabilizing the backbone's core reasoning capacities.

2606.09026 2026-06-10 cs.LG 版本更新

Structural Grid Descriptors Predict Within-Task Solver Success on ARC-AGI

结构网格描述符预测ARC-AGI任务内求解器成功率

Ayan Pendharkar

发表机构 * ARC-AGI Report(ARC-AGI报告)

AI总结 通过条件互信息检验,发现中间网格状态的结构属性可预测符号ARC-AGI求解器在相同任务内的成败,主要预测信息沿单一网格复杂度轴分布,且跨求解器架构泛化。

详情
AI中文摘要

我们询问中间网格状态的结构属性是否能预测符号ARC-AGI求解器是否成功,将其作为条件互信息I(X;Y|task) > 0的检验。在跨越两个架构不同的求解器(束搜索和随机DFS)、400个ARC任务、每个求解器28种配置以及训练和评估分割的44,800次运行中,在50%轨迹完成度处测量的人工设计网格描述符区分了相同任务内的成功和失败运行(平均任务内最佳特征AUC = 0.885,在任务内标签置换下p < 0.001)。大部分预测内容沿单一网格复杂度轴分布。该结果跨求解器架构泛化:在一个求解器上选择的特征在四个转移方向上预测另一个求解器的成功率,AUC为0.747-0.762(p < 0.001,控制泄漏)。在预注册的41个可靠任务保留集上,冻结特征n_components_final达到AUC = 0.765(95% CI [0.717, 0.810],p < 0.001),在任务聚类自助重采样和跨求解器任务合并下稳健。该信号不能由求解器容量解释(配置残差化后束搜索和SDFS的AUC分别为0.927和0.896,p < 0.001),且与得分轨迹弱相关(R^2约0)。在50%完成度时提前停止将束搜索计算量减少33.6%,同时保留98.9%的解法;退化轨迹检测将SDFS计算量减少65.3%,且无解法损失。最后,在400个评估任务中的229个中,DSL基本库从输入网格无法产生有效转换。这种0步崩溃对搜索预算不变,且束搜索普遍失败,表明是DSL覆盖范围限制而非搜索预算效应。

英文摘要

We ask whether structural properties of intermediate grid states predict whether a symbolic ARC-AGI solver will succeed, framed as a test of conditional mutual information I(X;Y|task) > 0. Across 44,800 runs spanning two architecturally distinct solvers (beam search and Stochastic DFS), 400 ARC tasks, 28 configurations per solver, and both training and evaluation splits, hand-crafted grid descriptors measured at 50% trajectory completion discriminate successful from failed runs within the same task (mean within-task best-feature AUC = 0.885, p < 0.001 under within-task label permutation). Most predictive content lies along a single grid-complexity axis. The result generalizes across solver architectures: a feature selected on one solver predicts success on the other with AUC 0.747-0.762 in all four transfer directions (p < 0.001, leakage controlled). On a pre-registered held-out set of 41 reliable tasks, the frozen feature n_components_final achieves AUC = 0.765 (95% CI [0.717, 0.810], p < 0.001), robust under task-clustered bootstrap resampling and cross-solver task collapsing. The signal is not explained by solver capacity (configuration-residualized AUC = 0.927 and 0.896 for beam search and SDFS, p < 0.001) and is only weakly coupled to score trajectories (R^2 approximately 0). Early stopping at 50% completion reduces beam-search compute by 33.6% while retaining 98.9% of solves; degenerate-trajectory detection reduces SDFS compute by 65.3% with no solve loss. Finally, on 229 of 400 evaluation tasks the DSL primitive library produces no valid transition from the input grid. This 0-step collapse is invariant to search budget and universally failed by beam search, indicating a DSL coverage limitation rather than a search-budget effect.

2606.08982 2026-06-10 cs.AI 版本更新

Baichuan-M4: A Clinical-Grade Medical Agent System for Continuous Care

Baichuan-M4:面向持续照护的临床级医疗智能体系统

Aiyuan Yang, Canbin Piao, Chengfeng Dou, Da Pan, Dian Wang, Fan Yang, Fei Deng, Fei Li, Guangwei Ai, Hui Liu, Hongda Zhang, Jinyang Tai, Kai Lu, Lijun Liu, Linwei Chen, Linyu Li, Meiqing Guo, Peidong Guo, Qiang Ju, Rihui Xin, Shuai Wang, XinKai Ma, Xudong Chen, Yichuan Mo, Yijie Zhou, Leyi Pan, Yihe Luo, Zian Wang

发表机构 * Baichuan AI(百川智能) THUBPM Group, Tsinghua University(清华大学THUBPM课题组)

AI总结 提出Baichuan-M4临床级医疗大模型,通过统一运行时、持续照护强化学习框架和临床工具层三大支柱构建智能体系统,在多项医疗评估中取得领先结果,幻觉率降至3.3%。

详情
AI中文摘要

Baichuan-M4是百川智能开发的临床级医疗大模型,专为\emph{持续照护}而非单轮医疗问答设计。它围绕三大支柱构建为协调的医疗智能体系统:\textbf{Baichuan-Harness},一个统一运行时,保持强化学习训练与实际部署的一致性,同时强制执行动作约束、工具使用、长期患者记忆和多智能体协调;一个\textbf{核心推理模型},采用持续照护强化学习框架训练,该框架集成了跨度级奖励建模(SPAR++)、推理路径压缩、课程学习和稳定的策略优化;以及一个\textbf{临床工具层},用于患者记忆管理、权威循证检索以及跨文档、X光和皮肤科的多模态医学感知。在跨维度医学评估套件中,Baichuan-M4在静态医学知识与安全性、动态OSCE式咨询、长上下文临床记忆、循证检索、医学文档OCR和多模态图像理解方面取得领先结果,同时将幻觉率降至3.3%。

英文摘要

Baichuan-M4 is Baichuan Intelligence's clinical-grade medical large model, designed for continuous care rather than single-turn medical question answering. It is built as a coordinated medical agent system around three pillars: Baichuan-Harness, a unified runtime that keeps reinforcement-learning training and real-world deployment consistent while enforcing action constraints, tool use, long-term patient memory, and multi-agent coordination; a core reasoning model trained with a continuous-care reinforcement-learning framework that integrates span-level reward modeling (SPAR++), reasoning-path compression, curriculum learning, and stabilized policy optimization; and a clinical tool layer for patient-memory management, authoritative evidence-based retrieval, and multimodal medical perception across documents, X-rays, and dermatology. On a cross-dimensional medical evaluation suite, Baichuan-M4 attains leading results in static medical knowledge and safety, dynamic OSCE-style consultation, long-context clinical memory, evidence-based retrieval, medical document OCR, and multimodal image understanding, while lowering the hallucination rate to 3.3%.

2606.08779 2026-06-10 cs.LG 版本更新

Reformulate LLM Reinforcement Learning for Efficient Training under Black-box Discrepancy

重新制定LLM强化学习以在黑箱差异下高效训练

Jiashun Liu, Runze Liu, Xu Wan, Jing Liang, Hongyao Tang, Ling Pan

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Zhejiang University(浙江大学) Tianjin University(天津大学)

AI总结 针对强化学习中的训练-推理差异问题,提出差异约束马尔可夫决策过程(DCMDP),通过拉格朗日松弛自适应平衡性能提升与差异控制,实现稳定高效训练。

详情
AI中文摘要

强化学习(RL)已成为一种关键的后训练范式,但它经常遭受不可预测的次优性能甚至训练崩溃。最近的研究将这些失败归因于隐藏的训练-推理差异(或不匹配),源于底层引擎和架构的不同。我们发现,当提供适当的学习信号时,训练策略可以主动自我纠正这种差异。然后,我们进一步通过经验确定了一个差异容忍区域:在该区域内,激进地缩小差异会抑制策略探索并降低学习效率,而在该区域外,减少过度差异可提高优化一致性并提升可达到的局部性能上限。根据这些发现,我们将此问题表述为差异约束马尔可夫决策过程(DCMDP),其中奖励最大化与对齐训练-推理行为的约束相结合,实现稳定的双目标优化。为了自适应地平衡性能改进和差异控制,我们引入了一种拉格朗日松弛机制,根据当前差异违反程度动态调整两个目标的相对权重。这使得双目标优化稳定:策略可以在容忍区域内自由探索,而当差异超出安全边界时则被引导回来。经验上,DCMDP显著提升了8B密集模型(Qwen-3-8b)和30B混合专家模型(Qwen-3-30bA3b)的性能,并实现了一种异构训练范式,其中LLM可以在高保真训练设置下进行优化,同时明确对齐以用于低成本、资源受限的推理部署。

英文摘要

Reinforcement Learning (RL) has emerged as a pivotal post-training paradigm, yet it frequently suffers from unpredictable sub-optimum performance or even training collapses. Recent findings attribute these failures to a hidden train-inference discrepancy (or mismatch), stemming from the disparate underlying engines and architecture. We find that the training policy can actively self-correct such a discrepancy when provided with an appropriate learning signal. Then, we further empirically identify a discrepancy tolerance region: within this region, aggressively narrowing the discrepancy can suppress policy exploration and reduce learning efficiency, whereas outside this region, reducing excessive discrepancy improves optimization consistency and raises the achievable local performance ceiling. According to such findings, we formulate this problem as a Discrepancy-Constrained Markov Decision Process (DCMDP), where reward maximization is coupled with a constraint that aligns training-Inference behavior, achieving stable dual-objective optimization. To adaptively balance performance improvement and discrepancy control, we introduce a Lagrangian relaxation mechanism that dynamically adjusts the relative weight of the two objectives according to the current degree of discrepancy violation. This enables stable dual-objective optimization: the policy is allowed to explore freely within the tolerance region, while being guided back when the discrepancy exceeds the safe boundary. Empirically, DCMDP significantly improves the performance of 8B dense model (Qwen-3-8b) and 30B Mixture-of-Expert model (Qwen-3-30bA3b), and enables a heterogeneous training paradigm, where LLMs can be optimized in high-fidelity training setup while being explicitly aligned for low-cost, resource-constrained inference deployment.

2606.08674 2026-06-10 cs.CV cs.AI 版本更新

BioVid: Autoregressive Video Generation with Biological Behavior Semantic Comprehension

BioVid: 具有生物行为语义理解的自回归视频生成

Tsung-Wei Pan, Jung-Hua Wang

发表机构 * Department of Electrical Engineering, National Taiwan Ocean University(国立台湾海洋大学电子工程系) AI research center, National Taiwan Ocean University(国立台湾海洋大学人工智能研究中心)

AI总结 提出BioVid,一种数据驱动的自回归视频生成框架,通过FSQ-R3GAN分词器和因果Transformer学习生物行为的自然时长分布,无需预设长度约束。

详情
AI中文摘要

现有的视频生成框架将序列时长视为外部指定参数——固定的帧数或文本提示——生成的片段在时间边界上与真实行为数据的统计结构脱节。这一假设与生物行为根本不一致,因为动作时长在个体和实例之间自然变化,并编码在数据本身中。我们提出BioVid,一种数据驱动的自回归视频生成框架,直接从训练数据中学习生物行为的时序结构,包括其自然长度分布。在第一阶段,有限标量量化GAN(FSQ-R3GAN)分词器将每个视频帧编码为紧凑的离散表示,结合R3GAN的稳定相对训练目标和FSQ的保证码本利用率,实现高保真空间重建而无需码本崩溃。在第二阶段,因果Transformer自回归地对生成的令牌序列建模,并在行为事件达到语义闭合时学习发出序列结束(EOS)令牌,终止分布自然地从训练数据中涌现,而非任何人为指定的约束。在人类饮酒行为数据集(NTU RGB+D, A001, n=94)上的实验表明,BioVid生成的长度分布与保留测试数据的分布紧密匹配,与真实分布的Wasserstein-1距离为1.24——相比之下,固定长度基线为6.05,VideoGPT为15.48——同时保持有竞争力的空间保真度。

英文摘要

Existing video generation frameworks treat sequence duration as an externally prescribed parameter -- fixed frame counts or text prompts -- producing clips whose temporal boundaries are decoupled from the statistical structure of real behavioral data. This assumption is fundamentally misaligned with biological behavior, where action duration varies naturally across individuals and instances and is encoded in the data itself. We present BioVid, a data-driven autoregressive video generation framework that learns the temporal structure of biological behaviors directly from training data, including their natural length distributions. In the first stage, a Finite Scalar Quantization GAN (FSQ-R3GAN) tokenizer encodes each video frame into a compact discrete representation, combining the stabilized relativistic training objective of R3GAN with FSQ's guaranteed codebook utilization to achieve high-fidelity spatial reconstruction without codebook collapse. In the second stage, a causal Transformer models the resulting token sequences autoregressively and learns to emit an End-of-Sequence (EOS) token when the behavioral event reaches semantic closure, with the termination distribution emerging naturally from the training data rather than any human-specified constraint. Experiments on a human drinking behavior dataset (NTU RGB+D, A001, n=94) demonstrate that BioVid's generated length distribution closely matches that of held-out test data, achieving a Wasserstein-1 distance of 1.24 against the ground truth -- compared to 6.05 for a fixed-length baseline and 15.48 for VideoGPT -- while maintaining competitive spatial fidelity.

2606.07936 2026-06-10 cs.CL cs.AI 版本更新

Illusions of the Gold Standard: A Large-scale Analysis of Human Evaluation Protocols for Long-form Text Generation

黄金标准的幻觉:长文本生成中人类评估协议的大规模分析

Katelyn Xiaoying Mei, Yi-Li Hsu, Minjoon Choi, Zongwan Cao, Chenjun Xu, Bingbing Wen, Su Lin Blodgett, Lucy Lu Wang

发表机构 * University of Washington(华盛顿大学) National Tsing Hua University(国立清华大学) Seoul National University(首尔大学) Mila - Québec AI Institute(米拉-魁北克人工智能研究所) Allen Institute for AI(艾伦人工智能研究所)

AI总结 通过分析2023-2025年*CL会议论文中的人类评估协议,发现报告不透明和可重复性差的问题,并提出改进建议。

Comments Accepted to ACL 2026 Main

详情
AI中文摘要

人类评估在评估生成文本质量中起着关键作用。然而,这些评估的可靠性和可重复性取决于透明且记录良好的协议——这些细节在当前实践中经常缺失。在这项工作中,我们对*CL会议出版物(2023-2025年)中评估长文本生成任务的人类评估协议进行了大规模分析,包括对284篇论文的完整人工审查和另外1800多篇论文的LLM辅助分析。我们定义了与人类评估研究可重复性相关的20个可报告标准,并应用这些标准系统地检查了社区内的报告规范和实践。我们发现,人类评估研究设计的重要方面普遍报告不足,导致关于测量了什么、如何测量、谁提供了判断以及如何解释判断的模糊性。基于这些发现,我们概述了可操作的建议,以支持未来研究中更透明和可重复的报告。我们的分析代码和注释数据集可在以下网址找到:https://github.com/larchlab/Illusions-of-the-Gold-Standard

英文摘要

Human evaluation plays a critical role in assessing the quality of generated text. However, the reliability and reproducibility of these evaluations depend on transparent and well-documented protocols -- details that are frequently missing in current practice. In this work, we conduct a large-scale analysis of human evaluation protocols for evaluating long-form generation tasks in *CL conference publications from 2023--2025, including a full manual review of 284 papers and LLM-assisted analysis for another 1.8k+ papers. We define a set of 20 reportable criteria related to reproducibility of human evaluation studies, and apply these criteria to systematically examine reporting norms and practices within the community. We find widespread under-reporting of important aspects of human evaluation study design, leading to ambiguity about what was measured and how, who contributed judgments, and how judgments should be interpreted. Based on these findings, we outline actionable recommendations to support more transparent and reproducible reporting in future research. Our analysis code and annotated dataset can be found at: https://github.com/larchlab/Illusions-of-the-Gold-Standard

2606.07605 2026-06-10 cs.LG cs.AI 版本更新

SRT: Super-Resolution for Time Series via Disentangled Rectified Flow

SRT: 基于解缠校正流的时间序列超分辨率

Jufang Duan, Shenglong Xiao, Yuren Zhang

发表机构 * Bytedance(字节跳动)

AI总结 提出SRT框架,通过解缠校正流将低分辨率时间序列重建为高分辨率,分解趋势与季节成分,利用隐式神经表示对齐分辨率,并引入跨分辨率注意力机制生成细节。

Comments Accepted to the International Conference on Learning Representations (ICLR) 2026

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

具有高时间分辨率的细粒度时间序列数据对于广泛应用的精确分析至关重要。然而,获取此类数据通常受到成本和可行性的限制。可以通过基于特定先验从低分辨率输入重建高分辨率信号来解决此问题,这被称为超分辨率。虽然在计算机视觉中得到了广泛研究,但直接将图像超分辨率技术迁移到时间序列并非易事。为了从根本上解决这一挑战,我们提出了时间序列超分辨率(SRT),这是一种通过解缠校正流重建低分辨率输入中丢失的时间模式的新框架。SRT将输入分解为趋势和季节成分,使用隐式神经表示将它们对齐到目标分辨率,并利用一种新颖的跨分辨率注意力机制来指导高分辨率细节的生成。我们进一步引入了SRT-large,这是一个经过大规模预训练的扩展版本,具有强大的零样本超分辨率能力。在九个公共数据集上的大量实验表明,SRT和SRT-large在多个尺度因子下始终优于现有方法,展示了稳健的性能以及我们架构中每个组件的有效性。

英文摘要

Fine-grained time series data with high temporal resolution is critical for accurate analytics across a wide range of applications. However, the acquisition of such data is often limited by cost and feasibility. This problem can be tackled by reconstructing high-resolution signals from low-resolution inputs based on specific priors, known as super-resolution. While extensively studied in computer vision, directly transferring image super-resolution techniques to time series is not trivial. To address this challenge at a fundamental level, we propose Super-Resolution for Time series (SRT), a novel framework that reconstructs temporal patterns lost in low-resolution inputs via disentangled rectified flow. SRT decomposes the input into trend and seasonal components, aligns them to the target resolution using an implicit neural representation, and leverages a novel cross-resolution attention mechanism to guide the generation of high-resolution details. We further introduce SRT-large, a scaled-up version with extensive pre-training, which enables strong zero-shot super-resolution capability. Extensive experiments on nine public datasets demonstrate that SRT and SRT-large consistently outperform existing methods across multiple scale factors, showing both robust performance and the effectiveness of each component in our architecture.

2606.07586 2026-06-10 cs.LG cs.AI cs.AR cs.MA 版本更新

From Human Guidance to Autonomy: Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

从人类引导到自主:面向空间NPU上端到端LLM部署的智能体技能系统

Jiajie Li, Erwei Wang, Zhiru Zhang, Samuel Bayliss

发表机构 * AMD Research and Advanced Development(AMD研究与高级开发)

AI总结 提出两阶段方法,从人类引导的智能体辅助部署到自主技能系统,在AMD XDNA 2 NPU上实现8种LLM的端到端自动部署,性能超越或持平人工优化基线。

Comments Accepted to the Machine Learning for Architecture and Systems Workshop (MLArchSys), co-located with ISCA 2026

详情
AI中文摘要

空间神经处理单元(NPU)为边缘LLM推理提供了能效平台,但在此类硬件上高效端到端部署LLM仍然劳动密集。尽管AI编码智能体已开始降低这一成本,现有研究主要关注单核优化,而非在资源受限的空间NPU上进行端到端LLM部署。\n我们提出一种两阶段方法,在AMD XDNA 2 NPU上实例化,从人类引导开发进展到智能体自主。第一阶段,我们通过人类引导的智能体辅助开发Llama-3.2-1B的参考部署。与手工优化基线相比,该实现实现了2.2倍的预填充加速和4.0倍的解码加速,优化轨迹及其经验教训全程记录为结构化文档。第二阶段,我们将文档提炼为一个由八个阶段组成的智能体技能系统,编排优化和调试技能集,并在每个阶段严格执行数值正确性。\n利用我们的智能体技能系统,我们使用开源编译器栈在AMD XDNA 2 NPU上自主端到端部署了另外八个仅解码器LLM(Llama-3.2-3B、SmolLM2-1.7B、Qwen2.5-{0.5B, 1.5B, 3B}、Qwen3-{0.6B, 1.7B, 4B})。据我们所知,这些模型此前尚未通过任何开源软件栈部署在AMD NPU上。每次部署在0.5-4小时的智能体挂钟时间内完成,几乎无需人类引导,并通过数值正确性门控,展示了对先前未见LLM的功能泛化能力。其中八个中的三个达到或超过了我们Llama-3.2-1B参考部署的持续性能,表明所得实现无需额外模型特定人工工程即可具有竞争力。

英文摘要

Spatial neural processing units (NPUs) provide an energy-efficient platform for edge LLM inference, but efficiently deploying an LLM end-to-end on such hardware remains labor-intensive. Although AI coding agents have begun to lower this cost, existing studies have largely focused on single-kernel optimization rather than end-to-end LLM deployment on resource-constrained spatial NPUs. We present a two-stage methodology, instantiated on the AMD XDNA 2 NPU, that progresses from human-guided development to agent autonomy. In the first stage, we develop a reference deployment of Llama-3.2-1B through human-guided agent assistance. The resulting implementation achieves a speedup of 2.2x on prefill and 4.0x on decode over the hand-optimized baseline, with the optimization trajectory and its lessons recorded as structured documentation throughout. In the second stage, we distill the documentation into an agent skill system consisting of eight phases, orchestrating the optimization and debugging skill sets, with numerical correctness strictly enforced at each phase. Using our agent skill system, we autonomously deploy eight additional decoder-only LLMs (Llama-3.2-3B, SmolLM2-1.7B, Qwen2.5-{0.5B, 1.5B, 3B}, Qwen3-{0.6B, 1.7B, 4B}) end-to-end on the AMD XDNA 2 NPU using the open-source compiler stack. To our knowledge, these models have not previously been deployed on AMD NPUs via any open-source software stack. Each deployment completes in 0.5-4 hours of agent wall time with almost no human guidance, and passes the numerical-correctness gates, demonstrating functional generalization to previously unencountered LLMs. Three of the eight match or exceed the sustained performance of our Llama-3.2-1B reference deployment, suggesting that the resulting implementations can be competitive without additional model-specific human engineering.

2606.07532 2026-06-10 cs.CL cs.AI 版本更新

Durable Evaluation Framework: Adversarial Arbitration for Sycophancy Reduction in Large Language Models

原则性智能体辩论:针对大型语言模型谄媚减少的对抗性仲裁

Sam Ryan

发表机构 * Novel Systems Engineering LLC(新型系统工程有限公司)

AI总结 提出原则性智能体辩论(PAD)多智能体架构,通过仲裁两个对立倾向的模型并盲评其论点,在SycophancyEval上显著降低谄媚偏差,最佳变体准确率达48.5%。

Comments 25 pages, 3 figures. Code and data available at github.com/NovelSystems/CANDOR

详情
AI中文摘要

RLHF训练的模型系统性地偏向于一致性而非准确性,这是训练过程的结构性属性。我们提出原则性智能体辩论(PAD),一种多智能体架构,通过仲裁两个调整为对立哲学倾向的模型来减轻身份框架下的谄媚,其中实用主义合成器在不知来源的情况下评估两个论点。本文评估了基于提示的PAD实例化。关键机制包括静态倾向调整、合成前的身份剥离、单轮独立论证和盲仲裁。我们在SycophancyEval的200个分层问题上评估了五种实例化。所有PAD变体(AnCifer、DeWin、FeynStein、BurGal、Trident)均显著优于单模型基线(18.5%)和指示对立基线(29.0%),其中DeWin达到48.5%的准确率(与两者相比z=6.36,p<0.001)。在n=200时,各变体之间无显著差异。BurGal变体达到53.0%,但作为架构有效性检查;其共识/异端轴在每个基准问题上结构性偏向异端模型。预训练下限影响约40%的问题;微调倾向模型被确定为下一步。

英文摘要

RLHF-trained models are systematically biased toward agreement over accuracy, a structural property of the training process. We present Durable Evaluation Framework (DEF) Arbitration, a multi-agent architecture that mitigates identity-framed sycophancy by arbitrating between two models tuned to opposing DEFs, with a pragmatist synthesizer evaluating both arguments blind to their origins. This paper evaluates a prompt-based instantiation of DEF Arbitration. The key mechanisms are static DEF tuning, identity stripping before synthesis, single-round independent argumentation, and blind arbitration. We evaluate five instantiations on 200 stratified questions from SycophancyEval. All tested DEF variants (AnCifer, DeWin, FeynStein, BurGal, Trident) significantly outperform the single-model baseline (18.5%) and instructed-opposition baseline (29.0%), with DeWin achieving 48.5% accuracy (z=6.36, p<0.001 versus both). The variants are not significantly different from each other at n=200. The BurGal variant achieves 53.0% but functions as an architectural validity check; its consensus/heterodox axis structurally favors the heterodox model on every benchmark question. A pre-training floor affects an estimated 40% of questions; fine-tuned DEF models are the identified next step.

2606.07422 2026-06-10 cs.CL cs.AI 版本更新

The Masked Advantage: Uncovering Local-Language Access to Cultural Knowledge in LLMs

掩蔽优势:揭示LLMs中本地语言对文化知识的访问

Yang Zhang, Xiao Fei, Amr Mohamed, Sarah Almeida Carneiro, Mersin Konomi, Mingmeng Geng, Ahmed Asaad, Guokan Shang, Michalis Vazirgiannis

发表机构 * Ecole Polytechnique(巴黎高等理工学院) MBZUAI(穆罕默德·本·拉什德智能研究院) ENS-PSL(巴黎综合理工学院-巴黎科学实验室) Durham University(杜尔罕大学)

AI总结 通过控制实验和项目反应理论模型,分离语言能力与文化知识访问,发现本地语言在文化知识访问上具有优势,但常被语言能力不足掩盖。

详情
AI中文摘要

大型语言模型越来越多地被用于跨语言回答文化相关问题,但目前尚不清楚本地文化知识是通过英语还是本地语言更容易获取。现有评估面临两个关键限制:许多评估依赖于可能无法反映文化知识自然出现的平行模板问题,并且原始准确率混淆了通用语言能力与语言条件知识访问。我们通过一个基于从区域基准和本地来源收集的真实世界文化问题的受控框架来解决这些问题。通过交叉问题类型(文化无关 vs. 文化特定)与查询语言(英语 vs. 本地语言),并使用共享的1PL项目反应理论模型估计能力,我们将语言能力与本地化知识访问分离。在13个地区和大约80个模型上,我们发现文化无关问题上存在一致的英语优势,表明更强的英语能力。然而,在考虑了这种能力差距后,本地语言在几乎所有地区-模型设置中都显示出积极的知识访问优势。这种优势在原始准确率中通常被掩盖,但在前沿、区域对齐或语言适应模型中变得更加明显。我们的结果表明,较弱的本地语言表现并不一定意味着较弱的文化知识;相反,本地文化知识可能通过本地语言更容易访问,但被有限的语言能力所隐藏。

英文摘要

Large language models are increasingly used to answer culturally grounded questions across languages, yet it remains unclear whether local cultural knowledge is better accessed through English or the local language. Existing evaluations face two key limitations: many rely on parallel template-based questions that may not reflect how cultural knowledge naturally appears, and raw accuracy conflates general language proficiency with language-conditioned knowledge access. We address these issues with a controlled framework built on real-world cultural questions collected from regional benchmarks and local sources. By crossing question type (culture-agnostic vs. culture-specific) with query language (English vs. local language), and estimating ability with a shared 1PL item response theory model, we separate proficiency from localized knowledge access. Across 13 locales and roughly 80 models, we find a consistent English advantage on culture-agnostic questions, indicating stronger English proficiency. However, after accounting for this proficiency gap, local languages show a positive knowledge-access advantage in nearly all locale-model settings. This advantage is often masked in raw accuracy but becomes more visible for frontier, regionally aligned, or language-adapted models. Our results suggest that weaker local-language performance does not necessarily imply weaker cultural knowledge; rather, local cultural knowledge may be more accessible through the local language but hidden by limited language proficiency.

2606.07135 2026-06-10 cs.LG 版本更新

Explaining Unsupervised Disease Staging in Huntington's Disease: Insights into Model Representations and Clusters

解释亨廷顿病中的无监督疾病分期:模型表示与聚类洞察

Lubna Mahmoud Abu Zohair, Hind Zantout

发表机构 * Heriot-Watt University(赫瑞-沃德大学)

AI总结 本文通过可解释性分析扩展无监督疾病分期框架,在Enroll-HD数据集上揭示模型嵌入与临床进展一致,并利用SHAP量化特征重要性,识别出从早期认知运动障碍到严重功能依赖的疾病阶段。

Comments Accepted for oral presentation and as a full-length paper at the International Conference on AI in Healthcare 2026 (26-28 August 2026, Imperial College London) and will be published by Springer in the Lecture Notes in Computer Science (LNCS) series

详情
AI中文摘要

亨廷顿病(HD)是一种进行性神经退行性疾病,影响运动、认知和行为功能,准确描述疾病进展对于改善患者预后和生活质量至关重要。无监督机器学习(ML)方法已证明能够从纵向数据中发现疾病进展轨迹和有意义的潜在阶段;然而,其有限的可解释性限制了临床信任和转化。我们通过将可解释性分析应用于提取的特征表示和发现的疾病阶段,扩展了先前提出的基于ML的疾病分期框架。应用于Enroll-HD数据集,我们首先将学习到的表示投影到低维空间,以直观评估所得聚类是否与既定临床指标的进展一致。然后,我们使用显著性图识别随时间对学习嵌入贡献最大的临床特征。最后,我们训练一个替代分类器并应用SHAP来量化特征对聚类分配的重要性,并分析哪些临床变量驱动疾病阶段之间的转换。可解释性分析表明,学习到的嵌入捕捉了具有临床意义的疾病结构,与既定的运动和功能严重程度评分一致,并显示出跨聚类的进行性恶化。在此分析中,SHAP揭示了疾病阶段的分层,范围从早期认知运动障碍到严重功能依赖,与已知的临床进展模式一致,同时也突出了阶段内变异性。

英文摘要

Huntington's disease (HD) is a progressive neurodegenerative disorder that affects motor, cognitive, and behavioral functions, where accurate characterization of disease progression remains essential to improve patient outcome and quality of life. Unsupervised machine learning (ML) approaches have demonstrated the ability to uncover disease progression trajectories and meaningful latent stages from longitudinal data; however, their limited interpretability restricts clinical trust and translation. We extend a previously proposed ML-based disease staging framework by applying an explainability analysis to the extracted feature representations and discovered disease stages. Applied to the Enroll-HD dataset, we first project the learned representations into a lower-dimensional space to intuitively assess whether the resulting clusters align with the progression of established clinical measures. We then use saliency maps to identify the clinical features that most strongly contribute to the learned embeddings over time. Finally, we train a surrogate classifier and apply SHAP to quantify feature importance for cluster assignments and to analyze which clinical variables drive transitions between disease stages. The explainability analysis indicates that the learned embeddings capture clinically meaningful disease structure, aligning with established motor and functional severity scores and exhibiting progressive deterioration across clusters. Within this analysis, SHAP reveals a stratification of disease stages, ranging from early cognitive-motor impairment to severe functional dependency, consistent with known clinical progression patterns, while also highlighting intra-stage variability.

2606.07088 2026-06-10 cs.LG math.OC 版本更新

Residual-Controlled Multiplier Learning for Stochastic Constrained Decision-Making

残差控制乘子学习用于随机约束决策

Kang Liu, Jianchen Hu, Ziyu Qu, Edward Hengzhou Yan, Lun Yang, Meng Zhang

发表机构 * Xi’an Jiaotong University(西安交通大学) Tencent(腾讯) China University of Geosciences(中国地质大学)

AI总结 提出残差控制乘子学习(RCML),通过将乘子更新重构为投影压力反馈,并引入模块化随机稳定组件,解决随机约束决策中原始-对偶方法因小批量噪声导致乘子更新不稳定的问题,实现有限增益收敛和局部KKT残差解释。

详情
AI中文摘要

随机约束决策需要在强制执行统计要求(如安全性或公平性)的同时优化性能目标。然而,标准的原始-对偶方法在随机小批量反馈下难以稳健地更新乘子,因为小批量梯度和约束估计的噪声会直接累积到乘子记忆中。为了解决这个问题,我们提出了残差控制乘子学习(RCML),它将乘子更新重新表述为投影压力反馈。核心思想是将投影乘子分解为用于原始下降的有效压力信号和用于有限增益乘子跟踪的压力记忆残差。为了处理异质和有噪声的观测,我们进一步用模块化随机稳定组件增强这个残差-积分骨干。对于凸-仿射骨干,我们建立了有限增益收敛,推导了小批量反馈下的随机残差界,并表明在非凸问题的正则KKT点附近,残差反馈律具有局部KKT残差解释。在优化、分配和公平排序任务上的实验表明,RCML在保持竞争性目标性能的同时,改善了可行性控制和乘子稳定性。代码可在此处获取。

英文摘要

Stochastic constrained decision-making requires optimizing performance objectives while enforcing statistical requirements such as safety or fairness. However, standard primal--dual methods struggle to update multipliers robustly under stochastic mini-batch feedback, as the noise of mini-batch gradients and constraint estimates can be directly accumulated into the multiplier memory. To address this issue, we propose Residual-Controlled Multiplier Learning (RCML), which reformulates multiplier updating as projected-pressure feedback. The central idea is to decompose the projected multiplier into an effective pressure signal for primal descent and a pressure-memory residual for finite-gain multiplier tracking. To handle heterogeneous and noisy observations, we further augment this residual-integral backbone with modular stochastic stabilization components. For the convex-affine backbone, we establish finite-gain convergence, derive a stochastic residual bound under mini-batch feedback, and show that the residual feedback law admits a local KKT-residual interpretation near regular KKT points of nonconvex problems. Experiments across optimization, allocation, and fair-ranking tasks show that RCML improves feasibility control and multiplier stability while maintaining competitive objective performance. Code is released at https://anonymous.4open.science/r/RCML-3114/.

2606.06888 2026-06-10 cs.LG 版本更新

Data-Constrained Language Model Pretraining: Improved Regularization and Scaling Laws

数据受限的语言模型预训练:改进的正则化与缩放定律

Zhiwei Xu, Shihao Wu, Hanseul Cho, Wei Hu, Yixin Wang

发表机构 * University of Michigan(密歇根大学) KAIST AI(韩国科学技术院人工智能研究所)

AI总结 研究数据受限下语言模型预训练的正则化与缩放,提出掩码输入正则化(MIR)改善验证损失,并设计SoftQ缩放定律更准确拟合重复数据下的模型与数据规模交互。

详情
AI中文摘要

语言模型预训练的经典缩放定律在固定计算预算下平衡模型规模与训练数据集大小,假设数据充足且仅对语料库遍历一次。随着训练计算量增长快于自然语言数据的供应,预训练可能进入数据受限、计算丰富的阶段,模型在有限数据集上训练多个周期。我们沿正则化和缩放两个维度研究数据受限预训练。对于正则化,我们研究掩码输入正则化(MIR),一种对随机掩码输入进行辅助下一词预测损失的方法。MIR测试扩散语言模型中的随机掩码是否能在不改变架构或增加推理开销的情况下有益于自回归预训练。在72M到1.4B参数的模型中,我们发现MIR在强权重衰减基础上进一步改善了验证损失,优于仅使用强权重衰减的自回归模型,并在1.4B规模上带来下游性能提升。对于缩放,我们提出SoftQ,一种将模型规模和数据规模耦合以捕捉它们在重复数据下交互的缩放定律。经典替代方案如Chinchilla定律使用加性形式解耦这些项,导致在数据受限情况下设定错误。我们发现SoftQ比这些替代方案更好地拟合数据受限实验,并估计MIR带来的增益相当于约1.3倍的独特训练数据。我们在https://this URL 发布代码。

英文摘要

Classical scaling laws for language model pretraining balance model size against training dataset size under a fixed compute budget, assuming abundant data and a single pass over the corpus. As training compute grows faster than the supply of natural language data, pretraining is likely to enter a data-constrained, compute-rich regime where models train for multiple epochs over a finite dataset. We study data-constrained pretraining along two axes, regularization and scaling. For regularization, we study masked-input regularization (MIR), an auxiliary next-token prediction loss on randomly masked inputs. MIR tests whether the random masking central to diffusion language models can benefit autoregressive pretraining without architectural changes or inference overhead. Across 72M to 1.4B parameter models, we find that MIR added on top of strong weight decay improves validation loss over autoregressive strong-weight-decay-only models, with downstream gains at 1.4B. For scaling, we propose SoftQ, a scaling law that couples model size and data size to capture their interaction under repeated data. Classical alternatives such as the Chinchilla law use an additive form that decouples these terms, making them misspecified in the data-constrained regime. We find that SoftQ fits data-constrained experiments substantially better than these alternatives, and estimates MIR's gains as equivalent to roughly 1.3 times as much unique training data. We release our code at https://github.com/yixinw-lab/dc_pretrain.

2606.06758 2026-06-10 cs.CL 版本更新

Diagnosing Evidence Utilization in Long-Context and Retrieval-Augmented Language Models under Matched Evidence Conditions

长上下文与检索增强语言模型中证据利用的四条件诊断协议

Haizhou Xia

发表机构 * University of Western Ontario(西方大学)

AI总结 提出四条件证据可用性协议,通过ONCU估计器分离无证据、全上下文、检索证据和Oracle证据四种条件下的模型表现,诊断长上下文与检索增强语言模型的证据利用瓶颈。

Comments 46 pages, 37 tables, 1 figure

详情
AI中文摘要

最终答案准确性、检索召回率和引用重叠本身并不能确定长上下文或检索增强语言模型是否使用了所提供的证据。模型可能从参数记忆中进行回答,尽管接收到正确的段落却失败,或者引用证据但未将其转换为所请求的答案。本文提出了一种匹配的四条件证据可用性协议——无证据、全上下文、检索证据和Oracle证据参考——用于在固定示例、提示、评分字段、检索设置和有效性检查下诊断证据利用情况。ONCU被用作协议绑定的估计器,用于估计恢复的Oracle参考证据优势,并且仅针对分母有效的组进行计算;无分母的答案、证据、检索和失败审计指标分别报告。实证研究评估了来自Qwen、Gemma、Llama和Mistral家族的五个本地开源模型,在Controlled-ONCU-safe16K、HotpotQA-ONCU和2WikiMultiHopQA-ONCU上进行了评估,共产生18,000个ONCU兼容预测。主要发现是任务相关的瓶颈分裂:受控合成设置主要暴露全上下文利用失败,而测试的真实多跳设置主要暴露无分母答案和证据指标中的检索链覆盖失败,ONCU在Oracle改进组上支持相同方向。贡献在于提供了一个诊断协议,用于分离无证据可回答性、Oracle证据可恢复性、全上下文利用和检索条件利用,而不是为长上下文或检索增强系统提供单一分数排行榜。

英文摘要

Final-answer accuracy, retrieval recall, and citation overlap do not reveal how much answer advantage a long-context or retrieval-augmented language model actually recovers from supplied evidence. A model may answer from parametric priors, fail to use evidence that is present, or cite relevant text without converting it into the final answer. This paper introduces a four-condition diagnostic protocol for evidence-utilization evaluation under matched examples, models, prompts, and scoring rules. The protocol compares no-evidence, full-context, retrieved-evidence, and oracle-evidence reference conditions, and uses Oracle-Reference Normalized Context Utilization (ONCU) as a denominator-valid estimate of recovered oracle-reference evidence advantage. The empirical study evaluates five local open-weight models from the Qwen, Gemma, Llama, and Mistral families over Controlled-ONCU-safe16K, HotpotQA-ONCU, and 2WikiMultiHopQA-ONCU, comprising 18,000 ONCU-compatible predictions. Results show a task-dependent diagnostic pattern: controlled synthetic settings expose reduced recovery when the same evidence is embedded in long input rather than supplied compactly, while realistic multi-hop reconstructions show that full-context inputs outperform the tested retrieved inputs in denominator-free answer and evidence metrics, with ONCU supporting the same direction on oracle-improving groups. Sensitivity audits with stronger retrieval settings narrow some gaps but do not overturn the scoped interpretation. The main contribution is therefore not a single utilization ratio, but a matched diagnostic protocol that separates no-evidence answerability, oracle-evidence recoverability, full-context recovery, retrieval-conditioned recovery, denominator validity, and companion answer/evidence diagnostics.

2606.06744 2026-06-10 cs.LG cs.GT cs.MA econ.TH 版本更新

Learn to Match: Two-Sided Matching with Temporally Extended Feedback

学会匹配:具有时间扩展反馈的双边匹配

Haijing Zong, Yancheng Liang, Boyang Zhou, Natasha Jaques

发表机构 * Department of Economics, University of Washington(华盛顿大学经济系) Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院)

AI总结 提出一个具有时间扩展反馈的双边匹配框架,将其建模为部分可观测马尔可夫博弈,并基于多智能体强化学习构建Learn2Match基准,实验表明独立PPO优于bandit基线,但存在信息摩擦损失。

详情
AI中文摘要

双边匹配市场通常涉及随时间通过面试、重复互动、学习和分离而展开的信息。现有的匹配模型通常将此过程简化为关于固定偏好的即时亚高斯反馈,忽略了支付相关信息逐渐揭示并改变未来匹配决策的场景。我们引入了一个具有时间扩展反馈的框架,将双边匹配建模为一个部分可观测马尔可夫博弈,其中包含昂贵的匹配前筛选、有噪声的匹配后观测、演变的潜在特征以及内生的延续或解散。我们在Learn2Match中实例化该框架,这是一个用于动态匹配市场的多智能体强化学习基准。Learn2Match支持关于面试谁、与谁匹配以及何时解散匹配的分散决策,同时使用遗憾、社会福利和信息摩擦损失(衡量由潜在偏好不完全揭示引起的福利差距)来评估策略。我们发现,在时间扩展反馈下,独立PPO比bandit风格的CA-ETC基线实现了更高的累积社会福利和更低的累积遗憾,展示了MARL在动态匹配市场中的前景。然而,PPO仍然产生更高的信息摩擦损失,表明端到端MARL尚未提供匹配bandit方法的协调探索结构。这些结果将Learn2Match定位为开发下一代匹配市场算法的基准:像RL智能体一样自适应、像bandit算法一样统计严谨、像稳定匹配机制一样结构感知的方法。

英文摘要

Two-sided matching markets often involve information that unfolds over time through interviews, repeated interaction, learning, and separation. Existing matching models typically reduce this process to immediate sub-Gaussian feedback about fixed preferences, missing settings where payoff-relevant information is revealed gradually and changes future matching decisions. We introduce a framework with temporally extended feedback, that formulates two-sided matching as a partially observable Markov game with costly pre-match screening, noisy post-match observations, evolving latent profiles, and endogenous continuation or dissolution. We instantiate this framework in Learn2Match, a multi-agent reinforcement-learning benchmark for dynamic matching markets. Learn2Match supports decentralized decision making over whom to interview, whom to match with, and when to dissolve a match, while evaluating policies using regret, social welfare, and an information-friction loss that measures the welfare gap caused by incomplete revelation of latent preferences. We find that independent PPO achieves higher cumulative social welfare and lower cumulative regret than the bandit-style CA-ETC baseline under temporally extended feedback, demonstrating the promise of MARL for dynamic matching markets. However, PPO still incurs higher information-friction loss, revealing that end-to-end MARL does not yet provide the coordinated exploration structure of matching-bandit methods. These results position Learn2Match as a benchmark for developing the next generation of matching-market algorithms: methods that are adaptive like RL agents, statistically disciplined like bandit algorithms, and structurally aware like stable-matching mechanisms. Please refer to https://sites.google.com/view/learn-to-match/home for the official website and the code link.

2606.06742 2026-06-10 cs.LG stat.ML 版本更新

TorchKM: A GPU-Oriented Library for Kernel Learning and Model Selection

TorchKM:面向GPU的核学习与模型选择库

Yikai Zhang, Gaoxiang Jia, Jie Ding, Boxiang Wang

发表机构 * University of Iowa(爱荷华大学) University of Minnesota(明尼苏达大学) Individual Researcher(独立研究者) AIScientists, Inc. (MorphMind)(AIScientists公司(MorphMind)) Department of Statistics and Actuarial Science, University of Iowa(爱荷华大学统计与精算科学系)

AI总结 提出GPU加速的核学习库TorchKM,通过智能复用矩阵运算加速SVM、核逻辑回归等模型的训练与模型选择,性能优于标准基线。

Comments 14 pages, 2 figures

详情
AI中文摘要

TorchKM是一个用于核机器的开源库,包括支持向量机、核逻辑回归和核分位数回归,并具有GPU加速。该库采用scikit-learn风格的API,旨在利用GPU友好的线性代数,通过智能复用矩阵运算加速完整的训练和模型选择流程。基准测试显示,与标准基线相比,具有竞争力的预测性能以及显著的加速效果。代码和文档可在https://this URL获取,并且该包可以通过PyPI轻松安装。

英文摘要

TorchKM is an open-source library for kernel machines, including support vector machines, kernel logistic regression, and kernel quantile regression, with GPU acceleration. The library features a scikit-learn-style API and is designed to exploit GPU-friendly linear algebra, accelerating the full training and model-selection pipeline through intelligent reuse of matrix operations. Benchmarks show competitive predictive performance with substantial speedups over standard baselines. The efficiency and programmable design also make TorchKM a kernel-learning component for AI-driven workflows. Code and documentation are available at https://github.com/YikaiZhang95/torchkm, and the package can be easily installed via PyPI.

2606.06735 2026-06-10 cs.AI 版本更新

A Geometric Account of Activation Steering through Angle-Norm Decomposition

通过角度-范数分解的激活引导的几何解释

Georgii Aparin, Tatiana Gaintseva

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) Queen Mary University of London(女王玛丽大学)

AI总结 本文通过控制实验分离角度和径向分量,发现概念主要编码在角度结构中,但范数对引导的稳定性和下游效应至关重要,建议将激活引导参数化为可解释的角度和径向分量。

详情
AI中文摘要

线性激活引导作为一种简单且经验有效的控制语言模型行为的方法已受到广泛关注。最近,球形引导范式被提出来解决加性干预的局限性,其动机通常是假设隐藏状态范数不携带概念相关信息。在这项工作中,我们通过一项旨在分离角度和径向分量作用的受控实证研究重新审视了这一假设。我们表明,引导方法的主要区别在于它们如何耦合两种几何效应:改变令牌与概念方向的角度对齐以及改变其隐藏状态范数。在七个语言模型上,我们发现概念主要表示在角度结构中,这支持了球形方法的动机,但范数对于引导的稳定性和下游效应仍然重要。我们的结果解释了为什么具有相似概念级别效果的干预可能表现不同,并建议激活引导应由干预的可解释角度和径向分量参数化,而不是由纠缠这两种效应的单个加性系数参数化。

英文摘要

Linear activation steering has gained popularity as a simple and empirically effective way to control language model behavior. More recently, spherical steering paradigms have been proposed to address limitations of additive interventions, often motivated by the assumption that hidden-state norm does not carry concept-relevant information. In this work, we revisit this assumption through a controlled empirical study designed to disentangle the roles of angular and radial components. We show that steering methods differ mainly in how they couple two geometric effects: changing a token's angular alignment with a concept direction and changing its hidden-state norm. Across seven language models, we find that concepts are represented primarily in angular structure, supporting the motivation for spherical methods, but that norm remains important for the stability and downstream effects of steering. Our results explain why interventions with similar concept-level effects can behave differently, and suggest that activation steering should be parameterized by interpretable angular and radial components of the intervention, rather than by a single additive coefficient that entangles these two effects.

2606.06698 2026-06-10 cs.LG cs.CL 版本更新

RECAP: Regression Evaluation for Continual Adaptation of Prompts

RECAP: 提示持续适应的回归评估

Harsh Deshpande, Kushal Chawla, Sangwoo Cho, William Campbell, Sambit Sahu

发表机构 * Capital One

AI总结 提出RECAP基准,在严格主动适应-测试协议下评估提示优化方法对约束变化的持续学习能力,发现现有方法在主动场景下性能无显著提升,强调设计主动提示适应方法的必要性。

详情
AI中文摘要

生产中的代理系统经常面临不断变化的约束,并且必须从下一次交互开始就遵守。诸如工具调用通知更改合规阈值或策略更新添加披露要求等场景符合这一标准,在生产中几乎没有出错的空间。这种主动适应设置在部署中很常见,但在当前的基准测试中却不存在,这些基准测试假设要么是静态约束集,要么是带有评估反馈的反应式协议。我们引入了RECAP,这是一个基准测试,在严格主动适应-测试协议下,在约束级别测量持续学习现象(遗忘、回归、前向转移):提示优化方法仅接收约束规范,并且必须在看到任何测试数据之前进行泛化。我们在四个LLM和三个具有不断变化的约束的调度上评估了六种方法,发现这些方法在性能上没有显著改善,即使在产生更高延迟之后也是如此。这些为离线或反应式设置设计的方法不足以应对主动范式。我们的工作强调了设计主动提示适应方法的日益增长的需求,其中模型必须对部署中不断变化的需求保持鲁棒性。

英文摘要

Production agentic systems routinely face evolving constraints and must comply from the very next interaction. Scenarios like a tool-call notification changing a compliance threshold or a policy update adding disclosure requirements fit this criteria, having close to no room for errors in production. This proactive adaptation setting is common in deployment, but absent from current benchmarks, which assume either static constraint sets or reactive protocols with evaluation feedback. We introduce RECAP, a benchmark that measures continual-learning phenomena (forgetting, regression, forward transfer) at the constraint level under a strictly proactive adapt-then-test protocol: prompt optimization methods receive only the constraint specification and must generalize before seeing any test data. Evaluating six methods across four LLMs and three schedules with evolving constraints, we find that these methods show no significant improvement in performance, even after incurring a higher latency. These methods, designed for offline or reactive settings, are inadequate for the proactive paradigm. Our work emphasizes the growing need for designing proactive prompt adaptation methods, where the models must remain robust to evolving needs in deployment.

2606.06622 2026-06-10 cs.CL 版本更新

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench:评估大语言模型分布随机性的基准

Amirhossein Abaskohi, Amirhossein Dabiriaghdam, Liang Luo, Ellie Dingqiao Wen, Lele Wang, Giuseppe Carenini, Peter West

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Independent Researcher(独立研究者)

AI总结 提出UnpredictaBench基准,通过KS@N指标评估LLM从目标分布(统计分布、随机程序、自然语言场景)采样的能力,发现模型表现差异大且无模型超过40%准确率,表明分布采样能力仍有显著提升空间。

详情
AI中文摘要

我们引入了UnpredictaBench,这是一个评估大语言模型(LLM)捕捉真实潜在分布能力的测试。随着LLM越来越多地被用作其他实体的替代品(例如,在经济模拟中替代人类),许多模型倾向于坍缩到单一合理答案,这导致无法捕捉真实系统的不可预测性。最近关于提高输出多样性的工作对于这种设置是不够的:模拟需要从目标分布中校准的样本,而不仅仅是多样化的输出。UnpredictaBench提炼了该问题的一个简化但基础的版本:从单个目标分布中采样结果,包括经典统计分布、随机程序诱导的分布以及描述随机过程的自然语言场景。我们引入了448个这样的问题,以及KS@N,一个通用评估指标,通过Kolmogorov-Smirnov统计检验量化模型输出近似黑盒目标分布的程度。这是我们在样本量为N时未能拒绝模型样本与真实样本之间差异的比率,N越大表示难度越大。在开源和专有模型上的测试中,我们发现分布能力存在很大差异。例如,当模型生成样本量为100(KS@100,我们的标准指标)时,得分范围从接近0到超过20%。没有模型能在KS@100上达到40%以上,这表明分布采样作为一种能力仍有显著的提升空间。尽管增加推理可以在一定程度上提高得分,但我们发现这个问题没有立即可行的解决方案。UnpredictaBench表明,即使是简单的分布模拟仍然具有挑战性,这使得它成为使用LLM作为复杂系统替代品的必要第一步。

英文摘要

We introduce UnpredictaBench, an evaluation that tests the ability of large language models (LLMs) to capture true underlying distributions. As LLMs are increasingly used as substitutes for other entities (e.g., for humans in economic simulations), the tendency of many models to collapse towards a single plausible answer means a failure to capture the unpredictability of real systems. Recent work on improving output diversity is insufficient for this setting: simulation requires samples that are calibrated to a target distribution, not merely varied outputs. UnpredictaBench isolates a simplified but fundamental version of this problem: sampling outcomes from individual target distributions, including canonical statistical distributions, distributions induced by stochastic programs, and natural-language scenarios that describe random processes. We introduce 448 such problems together with KS@N, a general-purpose evaluation metric that quantifies how well a model outputs approximate black-box target distributions via the Kolmogorov-Smirnov statistical test. This is the rate at which we fail to reject model samples of size N against ground-truth samples, with larger N indicating greater difficulty. Tested across open and proprietary models, we find a large spread in distributional capabilities. For instance, when models generate samples of size 100 (KS@100, our standard metric), scores range from near 0 to over 20%. No model is able to achieve over 40% at KS@100, showing significant headroom in distributional sampling as a capability. Although adding reasoning can somewhat increase scores, we find no immediate solution for this issue. UnpredictaBench shows that even simple distributional simulation remains challenging, making it a necessary first step toward using LLMs as stand-ins for complex systems.

2606.06493 2026-06-10 cs.RO cs.AI cs.LG 版本更新

HANDOFF: Humanoid Agentic Task-Space Whole-Body Control via Distilled Complementary Teachers

HANDOFF: 通过蒸馏互补教师实现人形机器人任务空间全身控制

Lizhi Yang, Junheng Li, Nehar Poddar, Yiling Hou, Gio Huh, Robert Griffin, Georgia Gkioxari, Aaron Ames

发表机构 * California Institute of Technology(加州理工学院) The Institute for Human & Machine Cognition(人机认知研究院)

AI总结 提出HANDOFF框架,通过多教师KL蒸馏和上下文门控机制,将全身运动跟踪、行走和跌倒恢复三个专家策略融合为混合专家学生策略,实现基于紧凑显式接口的全身控制,在Unitree G1上达到先进的速度跟踪性能并扩展了操作工作空间。

Comments 22 pages, 9 figures, Project page: https://lzyang2000.github.io/HANDOFF/

详情
AI中文摘要

对于要在现实世界中部署的人形机器人,命令空间(即任务规划与全身控制之间的接口)的选择至关重要。现有的全身控制器通常需要密集的运动学或空间参考,而规划器难以从任务语义中合成这些参考。我们提出了一种紧凑、显式的接口,该接口直观、通用、模块化且具有足够的表达能力,适用于多种操作技能。为此,我们引入了HANDOFF,这是一个单一的人形全身控制器,遵循该接口,并通过多教师KL蒸馏,在上下文条件门控方案下,从三个互补专家(具有安全过滤数据的全身运动跟踪、行走和跌倒恢复)中蒸馏出混合专家学生。在Unitree G1上,HANDOFF达到了最先进的速度跟踪性能,并提供了最大的鲁棒操作工作空间之一。我们进一步通过多个自然语言驱动的任务执行演示了硬件可行性,这些任务由VLM驱动的智能体规划器提供支持,无需特定任务数据或控制器微调。

英文摘要

For a humanoid robot to be deployed in the real world, the choice of command space (i.e., the interface between task planning and whole-body control) is crucial. Existing whole-body controllers typically demand dense kinematic or spatial references that planners struggle to synthesize from task semantics. We instead propose a compact, explicit interface that is intuitive, general, modular, and expressive enough for diverse loco-manipulation skills. To this end, we introduce HANDOFF, a single humanoid whole-body controller that follows this interface and is distilled via multi-teacher KL distillation under a context-conditioned gating scheme into a mixture-of-experts student from three complementary specialists: whole-body motion tracking with safety-filtered data, locomotion, and fall-recovery. On the Unitree G1, HANDOFF matches state-of-the-art velocity tracking and offers one of the largest robust manipulation workspaces. We further demonstrate hardware feasibility through multiple natural-language-driven task roll-outs, powered by a VLM-driven agentic planner with no task-specific data or controller fine-tuning.

2606.06323 2026-06-10 cs.RO 版本更新

VOLT: Vision and Language Trajectory Segmentation for Faster-than-Demonstration Policies

VOLT: 面向超演示速度策略的视觉与语言轨迹分割

Robert Ramirez Sanchez, Daniel J. Evans, Dylan P. Losey, Siddarth Jain

发表机构 * Collab , Dept. of Mechanical Engineering, Virginia Tech, Blacksburg, VA 24061(机械工程系,弗吉尼亚理工学院,布莱克斯堡,VA 24061) Mitsubishi Electric Research Laboratories ( MERL ), Cambridge, MA 02139(三菱电机研究实验室(MERL),剑桥,MA 02139)

AI总结 提出VOLT方法,通过视觉与语言线索对演示轨迹进行分割,选择性下采样安全加速部分,保留需要精细操作的慢速段,从而训练出比演示更快的机器人策略。

详情
AI中文摘要

人类演示任务所需的时间通常比机器人执行任务的时间长。许多工业和实际应用要求机器人尽可能快地执行任务,而不是学习以相同速度复制演示。本文研究了实现超演示速度策略的几种假设。实验表明,最有效的策略是对记录的演示进行下采样,并在加速后的数据上训练机器人策略。然而,均匀下采样整个轨迹可能存在问题:任务的某些部分可以安全加速(例如无约束运动),而其他部分则需要更慢、更精确的运动(例如物体交互或精细操作)。为解决这一挑战,我们提出了VOLT,一种视觉与语言轨迹分割方法,它推理视频演示,并利用上下文线索确定何时加速合适以及何时需要小心精确。VOLT识别需要缓慢、谨慎运动的分段,然后选择性地对剩余分段进行下采样。得到的重新格式化轨迹可用于标准模仿学习方法,如扩散策略。我们的结果强调分割质量至关重要——基线方法常常错误判断何时可以加速,导致策略过于谨慎或不可靠。与最先进的替代方法相比,VOLT使机器人能够更快地执行任务,同时保持强劲性能。

英文摘要

Humans often take longer to demonstrate a task than a robot would need to execute it. Rather than learning to replicate the demonstration at the same pace, many industrial and practical applications require robots to perform tasks as quickly as possible. In this paper, we investigate several hypotheses for learning policies that operate faster-than-demonstrations. Our experiments show that the most effective strategy is to downsample recorded demonstrations and train the robot's policy on this accelerated data. However, uniformly downsampling an entire trajectory can be problematic. Some parts of a task can be safely sped up (e.g., unconstrained motion), while others demand slower, more precise motion (e.g., object interactions or fine manipulation). To address this challenge, we introduce VOLT, a vision-and-language trajectory segmentation method that reasons over video demonstrations, and leverages contextual cues to determine when acceleration is appropriate and when careful precision is required. VOLT identifies segments where slow, deliberate motion is necessary, then selectively downsamples the remaining segments. The resulting reformatted trajectories can be used with standard imitation learning approaches, such as diffusion policies. Our results highlight that segmentation quality is critical -- baseline methods often misidentify when acceleration is possible, leading to overly cautious or unreliable policies. Compared to state-of-the-art alternatives, VOLT allows robots to execute tasks faster while maintaining strong performance.

2606.06021 2026-06-10 cs.LG cs.AI 版本更新

OPRD: On-Policy Representation Distillation

OPRD: 在线策略表示蒸馏

Shenzhi Yang, Guangcheng Zhu, Bowen Song, Haobo Wang, Mingxuan Xia, Xing Zheng, Yingfan Ma, Zhongqi Chen, Weiqiang Wang, Gang Chen

发表机构 * Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 针对在线策略蒸馏中输出空间监督的采样方差和忽略中间隐藏状态的问题,提出OPRD方法,通过在隐藏状态空间对齐师生表示,消除采样方差、提供更丰富的逐层结构信息,并在AIME等基准上缩小师生差距,训练速度提升1.44倍,内存减少54%。

详情
AI中文摘要

在线策略蒸馏(OPD)仅通过匹配下一个词元的概率在输出空间监督学生。这种仅输出范式有两个限制:(1)在大词汇量(例如Qwen约15万个词元)上,蒙特卡洛KL估计的采样方差在整个训练过程中持续存在;(2)它将教师视为黑盒,丢弃了LM头之后的所有中间隐藏状态。我们提出在线策略表示蒸馏(OPRD),通过在相同轨迹上选择层对齐学生和教师的表示,将蒸馏提升到隐藏状态空间,完全绕过LM头。理论上,OPRD消除了采样方差,并提供了更丰富的逐层结构信息。实验上,OPRD在AIME 2024/2025和AIMO上缩小了学生与教师之间的差距,而输出空间OPD基线停滞在教师水平以下。OPRD的训练速度也比top-k OPD快1.44倍,内存使用减少54%。代码:https://github.com/ShenzhiYang2000/OPRD。

英文摘要

On-policy distillation (OPD) supervises the student only in output space by matching next-token probabilities. This output-only paradigm has two limits: (1) sampling variance from Monte Carlo KL estimates over large vocabularies (e.g., Qwen's ~150k tokens) persists throughout training, and (2) it treats the teacher as a black-box, discarding all intermediate hidden states after the LM head. We propose On-Policy Representation Distillation (OPRD), which lifts distillation into hidden-state space by aligning student and teacher representations across selected layers on the same rollouts, bypassing the LM head entirely. Theoretically, OPRD eliminates sampling variance and provides richer per-layer structural information. Empirically, OPRD closes the student-teacher gap on AIME 2024/2025 and AIMO, while output-space OPD baselines plateau below the teacher. OPRD also trains 1.44x faster and uses 54% less memory than top-k OPD. Code: https://github.com/ShenzhiYang2000/OPRD.

2606.05645 2026-06-10 cs.RO 版本更新

Discrete-WAM: Unified Discrete Vision-Action Token Editing for World-Policy Learning

Discrete-WAM:面向世界-策略学习的统一离散视觉-动作标记编辑

Ziyang Yao, Haochen Liu, Yuncheng Jiang, Zeyu Zhu, Zibin Guo, Jingru Wang, Tianle Liu, Jianwei Cui, Kuiyuan Yang, Hongwei Xie, Jingwei Zhao, Guang Chen, Hangjun Ye

发表机构 * Xiaomi EV(小米电动车)

AI总结 提出Discrete-WAM,通过将未来视觉状态和自车动作对齐为离散标记,构建统一离散扩散框架,实现世界建模、世界-动作策略和分层决策策略的联合学习,支持可控生成和反事实推理,提升自动驾驶决策可靠性。

详情
AI中文摘要

自动驾驶需要对自车动作如何影响周围世界的演变进行推理。然而,大多数端到端方法依赖于直接的状态到动作映射,捕捉相关性而没有显式建模动作条件动力学。相反,连续潜在世界模型通常缺乏用于跨反事实未来进行因果推理的组合结构。我们提出Discrete-WAM,一种统一的潜在视觉-动作世界策略,将未来视觉状态和自车动作表示为对齐的离散标记,实现跨替代未来的组合因果推理。基于这种统一的离散对齐,Discrete-WAM建立了一个具有统一生成任务的共享离散扩散框架,共同制定世界建模、世界-动作策略和分层决策使能策略,支持跨多种驾驶场景的组合泛化。在大型自动驾驶基准上的实验表明,Discrete-WAM在实现竞争性能的同时,支持可控生成和反事实推理,为更可靠的决策提供了一条原则性路径。

英文摘要

Autonomous driving requires reasoning about how ego actions shape future world evolution, rather than merely mapping observations to actions. However, most end-to-end methods rely on direct state-to-action imitation, while existing world models often remain weakly aligned with downstream policy generation. We introduce Discrete-WAM, a unified discrete vision-action world-policy framework that represents visual observations, future states, high-level decisions, and ego actions within a shared token space. Built on this discrete alignment, Discrete-WAM jointly trains world modeling, world-policy modeling, and policy modeling through multi-task and multi-stage pretraining, allowing action-conditioned future prediction to directly support policy generation. For downstream planning, Discrete-WAM further decomposes policy generation into hierarchical decision prediction and parallel action-token editing, where the decision token provides a high-level planning skeleton and confidence-based scheduling refines dense future actions efficiently. Experiments on large-scale autonomous-driving benchmarks show that Discrete-WAM achieves strong planning performance while supporting controllable future generation, counterfactual evaluation, surprise-based world-model analysis, and efficient parallel policy decoding. These results suggest that discrete representation alignment, unified world-policy training, and hierarchical token editing provide a promising design paradigm for physical AI.

2606.05597 2026-06-10 cs.LG 版本更新

AsyncWebRL: Efficient Multi-Step RL for Visual Web Agents

AsyncWebRL: 面向视觉网页智能体的高效多步强化学习

Hao Bai, Rui Yang, Chenlu Ye, Spencer Whitehead, Aviral Kumar, Tong Zhang

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软) CMU(卡内基梅隆大学)

AI总结 提出异步系统设计和算法改进,解决多步强化学习中GPU空闲和轨迹过长问题,实现训练吞吐量提升2.9倍,并在WebGym测试集上取得新最优结果。

Comments Updated logo and code link

详情
AI中文摘要

使用多步强化学习训练视觉语言网页智能体计算密集,存在两种主要低效形式:同步强化学习中的GPU空闲,以及使用比必要更多步数和令牌的轨迹。我们提出AsyncWebRL,同时解决这两个问题。在系统方面,异步设计在迭代间重叠展开、梯度更新和策略刷新,并配合两种针对网页智能体的特定适配,即永久展开池和轻量级截图处理,共同实现端到端训练吞吐量比先前最快的开源同步流水线(WebGym)提升高达2.9倍。在算法方面,我们识别出多步GRPO中的每轨迹归一化器$1/|τ_i|$是轨迹级和令牌级低效的根本原因:因为失败轨迹系统性地比成功轨迹长,它降低了失败令牌上负梯度的权重,导致策略持续生成冗长的记忆模式。将$1/|τ_i|$替换为常数$1/k$打破了这种耦合,在保持总体成功率的同时缩短了轨迹。这些贡献共同在WebGym分布外测试集上设立了新的开源最优水平(相对先前最佳42.9%提升5.8%),在更难子集上提升最大(中等难度相对提升42%,困难难度相对提升48%)。

英文摘要

Training vision-language web agents with multi-step RL is compute-intensive, with two dominant forms of inefficiency: idle GPUs in synchronous RL, and trajectories that use more steps and tokens than necessary. We present AsyncWebRL, which addresses both. On the system side, an asynchronous design overlaps rollout, gradient update, and policy refresh across iterations, paired with two web-agent-specific adaptations, namely an everlasting rollout pool and lightweight screenshot handling, that together deliver up to a $2.9\times$ end-to-end training-throughput speedup over the previously fastest open synchronous pipeline (WebGym). On the algorithmic side, we identify the per-trajectory normalizer $1/|τ_i|$ in multi-step GRPO as the root cause of trajectory-level and token-level inefficiency: because failures are systematically longer than successes, it down-weights the negative gradient on failed tokens, so the policy keeps producing verbose memory schemas. Replacing $1/|τ_i|$ with a constant $1/k$ breaks this coupling, contracting trajectories while preserving aggregate success. Together, these contributions set a new open-source state of the art on the WebGym out-of-distribution test split (+5.8% relative over the 42.9% prior best), with the largest gains on the harder slices (+42% relative on Medium, +48% relative on Hard).

2606.05463 2026-06-10 cs.AI 版本更新

PSEBench: A Controllable and Verifiable Benchmark for Evaluating LLMs in Patient Safety Event Triage

PSEBench: 一个用于评估大语言模型在患者安全事件分类中的可控且可验证的基准

Keqi Han, Ryan Young, Annabel Strauss, Lindsey Hughes, Katharine M. Nesbitt, Nicole Schueler, Che Ngufor, Carl Yang, Yuan Xue, Zhijun Yin

发表机构 * Emory University(埃默里大学) Scale AI Mayo Clinic(梅奥诊所) Vanderbilt University Medical Center(范德比大学医学中心)

AI总结 提出基于政策条款卡的结构化构建方法,通过锚点驱动实例化和闭环验证生成带真实标签的叙事,并创建包含5074个案例的基准PSEBench,评估15个代表性LLM在患者安全事件分类中的能力。

详情
AI中文摘要

患者安全事件分类,即根据特定管辖政策判断临床事件是否需要报告,是一项高风险任务,通常由患者安全专家手动完成。尽管大语言模型(LLM)可能支持这一工作流程,但由于缺乏能够捕捉基于证据的政策推理、针对不完整报告的主动信息寻求以及在不可简化模糊情况下原则性弃权的基准,可靠评估受到限制。我们通过一种基于政策的结构化构建方法来解决这一差距,该方法以条款卡(clause card)为核心,这是一种将监管文本分解为可审计决策规范的结构化表示。结合条款卡与锚点驱动实例化和闭环验证,我们的可扩展流水线生成具有构造性真实标签的叙事,并自然支持生成缺失信息和不确定变体。我们将该方法应用于明尼苏达州29项可报告不良健康事件,创建了PSEBench,一个包含5074个案例的基准,并配备代理评估环境。对15个代表性LLM的评估揭示了一致的能力趋势,展示了基准的实用性,并指出了实现基于LLM的可靠患者安全事件分类的可操作差距。

英文摘要

Patient safety event triage, determining whether a clinical event is reportable under jurisdiction-specific policy, is a high-stakes task typically performed manually by patient safety experts. Although LLMs may support this workflow, reliable evaluation is limited by the lack of benchmarks to capture evidence-grounded policy reasoning, proactive information seeking for incomplete reports, and principled abstention in irreducibly ambiguous cases. We address this gap with a policy-grounded construction methodology centered on the clause card, a structured representation that factorizes regulatory text into auditable decision specifications. Combining clause cards with anchor-driven instantiation and closed-loop verification, our scalable pipeline produces narratives with by-construction ground truth and naturally supports generating missing information and uncertain variants. We instantiate this method on Minnesota's 29 Reportable Adverse Health Events, producing PSEBench, a 5,074-case benchmark with an agentic evaluation environment. Evaluation on 15 representative LLMs reveals consistent capability trends, demonstrates the benchmark's utility, and identifies actionable gaps toward reliable LLM-based patient safety event triage.

2606.05399 2026-06-10 cs.CV 版本更新

UniPixie: Unified and Probabilistic 3D Physics Learning via Flow Matching

UniPixie: 基于流匹配的统一概率三维物理学习

Qilin Huang, Quynh Anh Huynh, Long Le, Chen Wang, Chuhao Chen, Ryan Lucas, Eric Eaton, Lingjie Liu

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Southern University of Science and Technology(南方科技大学) MIT(麻省理工学院)

AI总结 提出UniPixie框架,通过流匹配学习从单张视觉输入到连续可控材料属性分布的映射,实现多样物理场生成并降低杨氏模量预测误差超50%。

Comments Published at CVPR 2026 as a Highlight. Project page: https://unipixie.github.io/

详情
AI中文摘要

现有的前馈网络擅长从视觉外观预测单一物理属性集,但这种点估计范式从根本上无法捕捉现实世界固有的物理模糊性。我们通过将物理预测重构为学习可控、连续的材料属性分布任务来解决这一问题。我们引入UNIPIXIE框架,该框架训练用于从单张视觉输入预测一条连续且参数化的物理合理材料属性路径。通过在我们的PIXIEMULTIVERSE数据集上学习沿物体从最软到最硬谱的直接映射,UNIPIXIE允许通过单个直观参数可控地生成多样、物理有效的材料场。关键的是,UNIPIXIE引入了一种新颖的统一架构,为多种物理求解器生成可模拟的参数,包括基于连续介质的物质点法(MPM)、基于线性混合蒙皮(LBS)的降阶变形以及基于锚点的弹簧-质量系统,解决了先前工作中的关键可移植性问题。实验表明,我们的方法不仅生成丰富多样的合理动力学,而且相比最强的确定性基线,将杨氏模量预测误差降低了50%以上,弥合了静态点估计与物理现实连续性之间的差距。项目页面:https://unipixie.github.io/

英文摘要

Existing feed-forward networks excel at predicting a single set of physical properties from visual appearance, but this point-estimate paradigm fundamentally fails to capture the real world's inherent physical ambiguity. We address this by reframing physics prediction as a task of learning a controllable, continuous distribution of material properties. We introduce UNIPIXIE, a framework trained to predict a continuous and parameterized path of physically plausible material properties from a single visual input. By learning a direct mapping along an object's softest-to-stiffest spectrum on our PIXIEMULTIVERSE dataset, UNIPIXIE allows for controllable generation of diverse, physically valid material fields via a single intuitive parameter. Crucially, UNIPIXIE introduces a novel unified architecture to produce simulation-ready parameters for diverse physics solvers, including continuum-based Material Point Method (MPM), reduced-order deformation based on Linear Blend Skinning (LBS), and anchor-based Spring-Mass systems, addressing a key portability issue in prior work. Experiments show our approach not only generates a rich variety of plausible dynamics but also reduces Young's Modulus prediction error by over 50% against the strongest deterministic baseline, bridging the gap between static point estimates and the continuous nature of physical reality. Project page: https://unipixie.github.io/

2605.03217 2026-06-10 cs.LG cs.CY 版本更新

Moral Sensitivity in LLMs: A Tiered Evaluation of Contextual Bias via Behavioral Profiling and Mechanistic Interpretability

大语言模型中的道德敏感性:通过行为剖析和机制可解释性对上下文偏见进行分层评估

Yash Aggarwal, Atmika Gorti, Vinija Jain, Aman Chadha, Krishnaprasad Thirunarayan, Manas Gaur

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) Purdue University(普渡大学) Meta Apple(苹果公司) Wright State University(怀特州立大学) University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出道德敏感性指数(MSI)量化大语言模型在七级压力测试中的偏见概率,并通过行为剖析和机制验证揭示模型偏见随情境变化的U型曲线,发现推理蒸馏会重新激活浅层统计关联。

详情
AI中文摘要

大型语言模型(LLM)越来越多地部署在需要细致伦理推理的环境中,但现有的偏见评估将模型输出简单地视为“有偏见”或“无偏见”。这种二元框架忽略了偏见实际出现的渐进、情境敏感的方式。我们分两个阶段解决这一差距:行为剖析和机制验证。在行为阶段,我们引入了道德敏感性指数(MSI),该指标量化了在从抽象数值问题到基于历史和社会经济不公正场景的七级压力测试中产生偏见输出的概率。评估四个领先模型(Claude 3.5、Qwen 3.5、Llama 3和Gemini 1.5),我们识别出由对齐设计塑造的不同行为特征:例如,Gemini 1.5在社会经济框架下达到第5级时MSI为72.7%,而Claude表现出与基于身份的安全训练一致的强烈抑制。然后,我们在机制上验证这些行为模式。我们选择在所有模型中产生最高MSI分数的犯罪偏见场景作为探针,并将logit透镜、注意力分析、激活修补和语义探针应用于一组受控的六个模型,涵盖三个能力层级:小型语言模型(SLM)、指令微调基础模型和推理蒸馏变体。电路级分析揭示了偏见的U型曲线:SLM表现出强烈的犯罪偏见;扩展到指令微调模型消除了偏见;推理蒸馏将偏见重新引入到类似SLM的水平,尽管参数数量相同,这表明蒸馏以重新激活浅层统计关联的方式压缩了推理轨迹。关键的是,驱动高MSI分数的社会负载线索激活了与机制识别出的相同偏见驱动电路,提供了跨阶段验证。

英文摘要

Large language models (LLMs) are increasingly deployed in settings that require nuanced ethical reasoning, yet existing bias evaluations treat model outputs as simply "biased" or "unbiased." This binary framing misses the gradual, context-sensitive way bias actually emerges. We address this gap in two stages: behavioral profiling and mechanistic validation. In the behavioral stage, we introduce the Moral Sensitivity Index (MSI), a metric that quantifies the probability of biased output across a graduated, seven-tier stress test ranging from abstract numerical problems to scenarios rooted in historical and socioeconomic injustice. Evaluating four leading models (Claude 3.5, Qwen 3.5, Llama 3, and Gemini 1.5), we identify distinct behavioral signatures shaped by alignment design: for instance, Gemini 1.5 reaches 72.7% MSI by Tier 5 under socioeconomic framing, while Claude exhibits sharp suppression consistent with identity-based safety training. We then verify these behavioral patterns mechanistically. We select criminal-bias scenarios, which produced the highest MSI scores across models, as probes and apply logit lens, attention analysis, activation patching, and semantic probing to a controlled set of six models spanning three capability tiers: small language models (SLMs), instruction-tuned base models, and reasoning-distilled variants. Circuit-level analysis reveals a U-curve of bias: SLMs exhibit strong criminal bias; scaling to instruction-tuned models eliminates it; reasoning distillation reintroduces bias to SLM-like levels despite identical parameter counts, suggesting distillation compresses reasoning traces in ways that reactivate shallow statistical associations. Critically, the socially loaded cues that drive high MSI scores activate the same bias-driving circuits identified mechanistically, providing cross-stage validation.

2606.04746 2026-06-10 cs.RO 版本更新

CADENCE: Predicting Realized MAPF Execution Time Beyond Sum of Costs

CADENCE:预测实际MAPF执行时间超越成本总和

Abhishek S, Badrikanath Praharaj, Sreeram MV

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CADENCE框架,通过分析原始运动负担和交互感知协调特征,发现原始运动负担能显著提高多智能体路径规划执行时间的预测精度,超越传统成本总和指标。

Comments 7 pages, 4 figures, 3 tables and this paper was accepted at Multi-Agent Robotic Systems: Real-World Collaboration and Interaction a workshop at the international conference of robotics and automation (ICRA 2026)

详情
AI中文摘要

多智能体路径规划(MAPF)算法越来越多地用于规划工业仓库和机器人共享工作空间中机器人团队的运动,但标准MAPF算法评估指标(如成本总和(SoC)、完工时间和规划器运行时间)可能掩盖规划选择如何转化为实际执行性能。我们提出了CADENCE(面向网络化连续执行的协调与动作驱动估计),在一个固定的7×7工作单元上使用七台差分驱动机器人进行硬件研究,探究在执行前可用的哪些特征最能预测最终的挂钟完成时间。我们比较了SoC、总规划行程成本、原始运动负担(计划所需的基本运动量,如完工时间、转弯、连续移动和启停转换)以及交互感知协调结构(计划引起的机器人间协调量,如依赖链接、交互机器人对、依赖深度和拥挤暴露)。为了测试这一点,我们生成了跨越15个场景(5个空场景、5个中等随机场景和5个瓶颈场景)的120个计划,并每个计划执行四次,产生了480次试验的硬件语料库。使用场景留出岭回归模型和试验级混合效应模型,我们发现仅SoC提供信息但不完整,而原始运动负担提供了最强的改进,相对于仅SoC模型,将留出误差在MAE上降低了约48.6%-59.8%,在RMSE上降低了44.2%-61.4%。交互感知协调特征增加了较小且不太均匀的增益,在混合效应分析中最为明显。在两种模型和不确定性检查中,原始运动负担是除SoC之外最可靠的附加信号,表明大部分执行时间差距在机器人开始移动之前就已经在离线计划中可见。

英文摘要

Multi-Agent Path Finding (MAPF) algorithms are increasingly used to plan motion for robot teams in industrial warehouses and robotic shared workspaces, but standard MAPF algorithm evaluation metrics, such as Sum of Costs (SoC), makespan, and planner runtime, can obscure how planner choices translate into realistic execution performance. We present CADENCE (Coordination and Action-Driven Estimation for Networked Continuous Execution), a hardware study of this evaluation gap on a fixed 7 by 7 workcell with seven differential drive robots, asking which features available before execution can best predict final wall-clock completion time. We compare SoC, total planned travel cost, primitive motion burden (how much basic motion the plan requires, such as makespan, turns, consecutive moves, and start-stop transitions), and interaction aware coordination structure (how much inter-robot coordination the plan induces, such as dependency links, interacting robot pairs, dependency depth, and crowding exposure). To test this, we generate 120 plans across 15 scenarios -- 5 Empty, 5 Medium Random, and 5 Bottleneck and execute each plan four times, yielding a 480 trial hardware corpus. Using both a scenario-held -- out ridge model and a trial-level mixed-effects model, we find that SoC alone is informative but incomplete, while primitive motion burden gives the strongest improvement, reducing held out error by about 48.6%-59.8% in MAE and 44.2%-61.4% in RMSE relative to SoC-only models. Interaction-aware coordination features add smaller, less uniform gains, most clearly in the mixed-effects analysis. Across both models and uncertainty checks, primitive motion burden is the most reliable additional signal beyond SoC, suggesting that much of the execution time gap is already visible in the offline plan before any robot starts moving.

2606.04212 2026-06-10 cs.LG stat.ML 版本更新

Edge of Stability Selectively Shapes Learning Across the Data Distribution

稳定性边缘选择性地塑造数据分布上的学习

Shauna Kwag, Anakha Ganesh, Tomaso Poggio, Pierfrancesco Beneventano

发表机构 * MIT(麻省理工学院)

AI总结 本文发现优化中的稳定性边缘(EoS)具有选择性,通过分支干预因果证明了EoS在训练数据子集间重新分配学习,并识别了受益组需满足的两个条件:梯度与Hessian主特征向量对齐,以及梯度幅度持续非零。

Comments ICML HiLD 2026; 27 pages, 22 figures

详情
AI中文摘要

现有对稳定性边缘(EoS)的分析将其视为优化的全局属性。我们表明它也具有选择性:稳定性约束在训练分布的各个子集之间重新分配学习,放大某些组上的进展,同时抑制其他组上的进展。通过从相同训练状态进入或退出EoS regime的分支干预,我们因果地证明了这种权衡,并识别了组受益的两个必要条件。首先,其聚合梯度必须与顶部Hessian特征向量对齐。我们通过一个受控扰动隔离了这一机制,该扰动保持距离但随机化方向,破坏了对齐并消除了优势。其次,该组必须随时间保持非零梯度幅度。在交叉熵损失下,梯度饱和使置信度高的组解耦,将优势转移到输出异常值,后者的梯度持续存在。总之,这些结果表明EoS不仅作为稳定性边界,而且作为控制数据分布上学习分配的机制。

英文摘要

Existing analyses of the edge of stability (EoS) treat it as a global property of optimization. We show that it is also selective: the stability constraint redistributes learning across subsets of the training distribution, amplifying progress on some groups while suppressing progress on others. Using a branching intervention that enters or exits the EoS regime from the same training state, we causally demonstrate this trade-off and identify two necessary conditions for a group to benefit. First, its aggregate gradient must align with the top Hessian eigenvector. We isolate this mechanism with a controlled perturbation that preserves distance but randomizes direction, destroying alignment and eliminating the advantage. Second, the group must sustain non-vanishing gradient magnitude over time. Under cross-entropy loss, gradient saturation decouples confidently classified groups, shifting the advantage to output-outliers, whose gradients persist. Together, these results show that EoS functions not only as a stability boundary, but as a mechanism governing the allocation of learning across the data distribution.

2606.03963 2026-06-10 cs.RO cs.AI 版本更新

AgenticRL: Self-Refining Agentic Reinforcement Learning for Vision-Conditioned UAV Navigation

面向视觉条件的无人机导航的自优化智能体强化学习

Roohan Ahmed Khan, Yasheerah Yaqoot, Amir Atef Habel, Muhammad Ahsan Mustafa, Dzmitry Tsetserukou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AgenticRL框架,利用多模态GPT智能体自动设计奖励函数、通过闭环自改进优化策略,在多种无人机导航任务中提升性能并实现高成功率。

详情
AI中文摘要

深度强化学习在使自主机器人学习复杂导航任务方面显示出巨大潜力。然而,其实际应用仍然严重依赖于人工设计的奖励函数和重复的手动微调,这既耗时又无法保证在目标任务中取得高成功率。本文提出了AgenticRL,一种智能体引导的强化学习框架,用于提高无人机导航任务中奖励设计、策略优化和实际部署的自主性。AgenticRL使用多模态生成预训练变换器(GPT)智能体来解释任务信息和视觉场景观察,生成特定于任务的奖励函数,使用近端策略优化(PPO)算法训练策略,然后通过诊断包评估训练后的策略作为批评者,生成反馈。基于该反馈,智能体识别失败模式并在闭环自改进过程中优化奖励函数。为了在推理期间进一步利用多模态GPT智能体,AgenticRL使用真实世界图像和自然语言任务信息自动识别活动场景并选择适当的训练策略执行。该框架在多种导航任务上进行了评估,包括穿越门、避障、穿越墙障并着陆、轨迹跟踪和运动行为学习。实验结果表明,与初始奖励相比,闭环优化过程将策略行为提升了71%。我们还展示了所提出框架的仿真到现实迁移,实现了91%的真实世界成功率和94%的仿真到现实准确率。

英文摘要

Deep reinforcement learning has shown strong potential for enabling autonomous robots to learn complex navigational tasks. However, its practical use still depends heavily on human designed reward functions and repeated manual fine tuning, which is time consuming and does not guarantee high success in the desired task. This paper presents AgenticRL, agent guided reinforcement learning framework that increases autonomy in reward design, policy refinement, and real world deployment for unmanned aerial vehicles (UAV) navigation tasks. AgenticRL uses a multimodal generative pre-trained transformer (GPT) agent to interpret task information and visual scene observations, generate task specific reward functions, train policies using Proximal Policy Optimization (PPO) algorithm, and then act as a critic by evaluating the trained policy through diagnosis packets to generate feedback. Based on this feedback, the agent identifies failure modes and refines the reward function in a closed loop self improvement process. To further leverage the multimodal GPT agent during inference, AgenticRL uses real world images and natural language task information to automatically identify the active scenario and select the appropriate trained policy for execution. The framework is evaluated on multiple navigational tasks, including gate traversal, obstacle avoidance, wall barrier crossing with landing, trajectory following, and motion behavior learning. Experimental results show that the closed loop refinement process improves policy behavior compared with initial rewards by 71%. We also demonstrate sim-to-real transfer of the proposed framework, achieving a real world success rate of 91% and a sim-to-real accuracy of 94%.