arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1709
专题追踪
2606.12280 2026-06-15 cs.LG 新提交

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

在8位权重和激活下保持FP8质量上限:Ideogram 4.0的INT8和GGUF后训练量化用于消费级GPU

Deep Gandhi, Ali Asaria, Tony Salomone

发表机构 * Transformer Lab

AI总结 本文对Ideogram 4.0模型进行INT8 W8A8量化,在无FP8张量核心的Ampere GPU上达到FP8质量水平,并优于NF4,同时GGUF Q4_K在质量-内存前沿上成为帕累托最优。

详情
AI中文摘要

后训练量化使得大型文本到图像扩散变压器能够在消费级GPU上运行,然而硬件特定的权衡很少被直接测量。我们对Ideogram 4.0进行量化——这是一个9.3B的流匹配扩散变压器(DiT),作为两个独立权重副本的单流34层骨干网络,用于无分类器引导,并由Qwen3-VL-8B编码器调节——针对缺乏FP8张量核心的Ampere RTX 3090 GPU。我们的INT8 W8A8方案(每通道权重、每令牌动态激活、SmoothQuant,以及一小部分高脆弱性层的混合精度保护)保持了FP8质量上限:在200个提示的基准测试中,INT8与FP8的配对同种子自举置信区间在Pick和CLIP上均包含零,而INT8相比NF4提升了$+1.9$ CLIP(95%置信区间$[+1.21,+2.64]$,排除零)。据我们所知,针对此类模型的逐类别OCR分析尚未见报道,该分析确认了文本可读性得以保留,并且消融实验将FFN下投影的保护隔离为主要的质量杠杆。我们的GGUF Q4_K量化在相同磁盘大小下优于NF4,并在质量-内存前沿上成为帕累托最优,配对置信区间排除零(Q8_0质量中性)。最后,我们描述了8位量化在哪些情况下有帮助,在哪些情况下没有:INT8的权重与FP8的占用空间匹配而非缩小,因此在Ampere上获得速度提升需要融合INT8内核。

英文摘要

We study post-training quantization (PTQ) of Ideogram 4.0, a 9.3B flow-matching diffusion transformer (DiT) that realizes classifier-free guidance with two separate-weight copies of a single-stream backbone and is conditioned by a Qwen3-VL text encoder, targeting Ampere RTX~3090 GPUs, which lack FP8 tensor cores. Because Ideogram~4.0 is trained on structured JSON captions, we evaluate every variant under schema-valid JSON prompts produced by an LLM expander built to Ideogram's published caption specification, and score them with a battery spanning human-preference (HPSv2), CLIP, and PickScore for standalone quality; PP-OCR exact-match and edit distance for text; and PSNR/SSIM/LPIPS for fidelity to the FP8 reference (the highest-precision public checkpoint) output. On a 300-prompt benchmark with paired bootstrap confidence intervals, an INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and bf16 protection of a small high-fragility layer set) is statistically indistinguishable from FP8 on CLIP and PickScore (paired CIs include zero) and within ~0.004 HPSv2, and, at its 8-bit size, is the most faithful reproduction of the FP8 output (LPIPS 0.243 vs 0.277/0.306 for the half-size 4-bit baselines; the INT8-Q4_K gap excludes zero). A GGUF Q4_K quantization reaches the same standalone quality as the published NF4 baseline at the same on-disk size, making it the Pareto choice on the quality-memory frontier. We further show that under JSON prompts all four variants reach parity on standalone quality, the variants separate on fidelity and text rendering, not on aggregate image-quality scores, and that text legibility, near-zero when the model is prompted with raw strings, reaches 55% OCR exact-match under the JSON captions it expects. We release the INT8 W8A8 and GGUF Q4_K quantized weights on Hugging Face under a gated, non-commercial license.

2606.11898 2026-06-15 cs.CL cs.LG 新提交

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Yang Li, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at https://github.com/Heinz217/GraspLLM.

2606.11502 2026-06-15 cs.CL cs.AI 新提交

When Roleplaying, Do Models Believe What They Say?

角色扮演时,模型是否相信它们所说的话?

Benjamin Sturgeon, David Africa, Sid Black

发表机构 * MATS

AI总结 通过线性真实探针研究角色扮演对LLM内部表征的影响,发现角色扮演主要改变输出而非内部真实表征,而紧急错位则更显著地改变内部表征。

详情
AI中文摘要

语言模型可以陈述“地球绕太阳运行”,并在扮演亚里士多德时断言相反的说法。最近的研究认为,角色采用是语言模型运作的基础,模型会不断为给定上下文选择最合适的角色。这种角色扮演是否仅仅改变了模型的输出,还是也影响了模型内部表征为真实的内容?我们通过线性真实探针研究这个问题,将其应用于扮演历史人物(其可能的信念与现代共识不同)的LLM。对于每个角色,我们比较该角色可能赞同的虚假陈述(*时代相信*)与主题匹配但该角色不会赞同的虚假陈述(*时代虚假*)。通过提示、上下文学习和监督微调,角色诱导对时代相信陈述的抑制程度低于同等虚假的替代陈述,但它们总体上仍被分类为虚假。因此,角色扮演改变模型所说的内容多于其内部表征为真实的内容。我们将此与经过有害建议训练并表现出紧急错位(EM)的模型进行对比。在三个模型家族(Qwen 2.5 14B、Qwen 3 8B和Llama 3.3 70B)中,它们的虚假陈述显著向探针空间的真实区域移动,在挑战下大约一半时间被辩护(而角色扮演约为六分之一),并用于下游推理。因此,角色扮演和紧急错位是信念内化谱系上的点,其中角色扮演改变模型所说的内容而表征变化很小,而紧急错位则改变虚假陈述的内部表征,但并未完全将其标记为真实。

英文摘要

Language models can state that "the Earth orbits the Sun" and, when role-playing Aristotle, assert the opposite. Recent work argues that persona adoption is fundamental to how language models operate, with models constantly selecting the most appropriate persona for a given context. Does such role-playing merely change the model's outputs, or does it also affect what the model internally represents as truthful? We study this question with linear truth probes, applying them to LLMs role-playing historical personas whose likely beliefs differ from modern consensus. For each persona, we compare false claims the persona would likely have endorsed (*era-believed*) with topic-matched false claims they would not have endorsed (*era-false*). Across prompting, in-context learning, and supervised fine-tuning, persona induction suppresses era-believed statements less than equally false alternatives, yet they remain classified as false overall. Role-play therefore shifts what these models say more than what they internally represent as true. We contrast this with models trained on harmful advice that exhibit Emergent Misalignment (EM). Across three model families (Qwen 2.5 14B, Qwen 3 8B, and Llama 3.3 70B), their false claims move substantially toward the true region of probe space, are defended under challenge roughly half the time versus about a sixth for role-play, and are used in downstream reasoning. Role-play and Emergent Misalignment thus are points on a spectrum of belief internalization, where role-play changes what a model says with little representational change, while Emergent Misalignment shifts the internal representation of false claims without fully marking them as true.

2606.10881 2026-06-15 cs.AI 新提交

Large-scale semantic mapping of learner agency and autonomy reveals what measurement and generative AI research overlook

学习者能动性与自主性的大规模语义映射揭示测量与生成式AI研究的忽视

Fei Qin, Xiaobo Liu, Yaowen Zhang, Xuming Li, Fei Wang, Mutlu Cukurova, Jingjing Chen, Yu Zhang

发表机构 * School of Education, Tsinghua University(清华大学教育学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Institute of Education, University College London(伦敦大学教育学院)

AI总结 通过语义分析管道从超过14,000篇出版物中提取定义和量表项目,发现学习者能动性与自主性包含任务、个人和社会文化三个维度,现有量表忽视社会文化维度,且生成式AI研究过度聚焦学习调节与控制。

Comments 45 pages, 12 figures, 1 table, including appendices, added funding information

详情
AI中文摘要

学习者能动性和自主性是个人发展的基础,然而普遍存在的“叮当谬误”(即相同术语指代不同构念,不同术语指代相同构念)严重阻碍了知识的积累。将意义视为通过语言实践中的使用构成的现象,我们从超过14,000篇出版物中提取了8,954个定义和2,700个量表项目,通过语义分析管道研究研究人员实际如何使用学习者能动性和自主性。这两个构念的定义景观解析为三个维度:学习的调节与控制(任务)、内在动机与内部决策(个人)以及社会关系行动(社会文化),从而经验性地量化了叮当谬误。然而,现有量表系统性地低估了社会文化维度。关键的是,当前教育领域的生成式AI研究集中于学习调节与控制,缩小了AI中介学习环境旨在培养的行为库。除了概念澄清外,这项工作对支持多维学习者能动性和自主性的概念化、测量和实践具有直接意义。

英文摘要

Learner agency and autonomy are foundational to personal development, yet a pervasive "jingle-jangle" fallacy (i.e. identical terms denoting different constructs, distinct terms denoting identical ones) has substantially hindered cumulative knowledge. Treating meaning as a phenomenon constituted through use in linguistic practice, we extracted 8,954 definitions and 2,700 scale items from over 14,000 publications, to investigate how researchers actually used learner agency and autonomy with a semantic analysis pipeline. The definitional landscape of two constructs resolves into three dimensions: regulation and control of learning (task), intrinsic motivation and internal decision-making (person), and social-relational action (sociocultural), thereby empirically quantifying the jingle-jangle fallacy. Existing scales, however, systematically underrepresent the sociocultural dimension. Critically, current generative AI research in education concentrates on learning regulation and control, narrowing the behavioral repertoire that AI-mediated learning environments are designed to cultivate. Beyond conceptual clarification, this work carries direct implications for conceptualization, measurement, and practice towards supporting the multidimensional learner agency and autonomy.

2606.08881 2026-06-15 cs.RO cs.AI 新提交

Benchmarking Vision-Language-Action Models on SO-101: Failure and Recovery Analysis

在SO-101上对视觉-语言-动作模型进行基准测试:失败与恢复分析

Yi Yu, Xinchuan Qiu

发表机构 * Graduate School of Advanced Science and Engineering, Hiroshima University(广岛大学先进科学与工程研究生院)

AI总结 提出SO-101低成本机器人平台基准,通过失败分类和恢复评估指标,系统比较VLA和模仿学习策略,发现执行不稳定是主要失败源。

Comments 13 pages, 9 figures,

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出强大的泛化能力,但现有评估主要在仿真或昂贵机器人平台上进行,其在低成本真实机器人上的鲁棒性尚未充分探索。我们提出了一个标准化的真实世界基准,用于在低成本SO-101机器人平台上评估代表性VLA和模仿学习策略。该基准包含四个代表性操作任务和统一评估协议,能够在具身不确定性下进行系统比较。使用真实遥操作演示,我们直接在物理平台上微调和评估$π_{0.5}$、SmolVLA、Wall-X和ACT。除了传统的任务成功率,该基准还包含结构化的失败分类、语义级和执行级失败分解,以及恢复感知评估指标,以表征策略鲁棒性。实验结果表明,更强的预训练VLA策略通常优于模仿学习基线,尽管在低成本机器人部署条件下性能高度依赖于任务。执行不稳定是主要的失败源,而恢复能力在不同架构间差异显著。这些结果强调了超越二元任务成功进行失败和恢复分析的重要性,并将SO-101确立为在现实低成本机器人部署条件下评估具身AI系统的实用基准。

英文摘要

Vision-Language-Action (VLA) models have demonstrated strong generalization in robotic manipulation, yet existing evaluations are primarily conducted in simulation or on expensive robotic platforms, leaving their robustness on affordable real-world robots largely unexplored. We present a standardized real-world benchmark for evaluating representative VLA and imitation learning policies on the low-cost SO-101 robotic platform. The benchmark comprises four representative manipulation tasks together with unified evaluation protocols, enabling systematic comparison under embodiment uncertainty. Using real-world teleoperated demonstrations, we fine-tune and evaluate $π_{0.5}$, SmolVLA, Wall-X, and ACT directly on the physical platform. Beyond conventional task success rates, the benchmark incorporates a structured failure taxonomy, semantic- and execution-level failure decomposition, and recovery-aware evaluation metrics to characterize policy robustness. Experimental results show that stronger pretrained VLA policies generally outperform the imitation learning baseline, although performance remains highly task-dependent under low-cost robotic deployment conditions. Execution instability emerges as the dominant failure source, while recovery capability varies substantially across architectures. These results highlight the importance of failure and recovery analysis beyond binary task success and establish SO-101 as a practical benchmark for evaluating embodied AI systems under realistic low-cost robotic deployment conditions.

2606.08663 2026-06-15 cs.SD eess.AS 新提交

Probing Token Spaces under Generator Shift in AI-Generated Music Detection

在AI生成音乐检测中生成器偏移下的Token空间探测

Joonyong Park, Jungwoo Kim, Junyoung Koh, Yuki Saito

发表机构 * KAIST(韩国科学技术院)

AI总结 针对AI音乐检测器在生成器偏移下性能下降的问题,提出CoMoE紧凑分类器比较不同音频Token空间,发现编码器风格离散Token空间应作为主要实验变量。

Comments Accepted to ICML 2026 ML4Audio workshop

详情
AI中文摘要

AI生成的音乐检测器在标准基准分割上可能表现鲁棒,但其部署需要转移到训练期间不存在的生成器源。我们通过源受限评估在\ extsc{MoM-open}上研究此问题,这是MoM-CLAM的开放重建,用FMA和MTG-Jamendo替换了不可再分发的真实语料库,同时保留了假生成器协议。为了隔离表示的作用,我们引入了\ extsc{CoMoE},一个紧凑的固定分类器,用于比较异构音频Token空间,同时保持下游架构和训练方案不变。实验表明,标准和真实源受限分割几乎饱和,而假源受限暴露了Token空间之间的巨大差异:X-Codec Token在仅使用Udio训练时最强,而MERT派生的Token在仅使用Suno-v3.5训练时最强。这些结果表明,在AI生成音乐检测中,编码器风格的离散Token空间应被视为生成器偏移下的主要实验轴。我们的代码和数据可在https://github.com/MAAP-LAB/CoMoE获取。

英文摘要

AI-generated music detectors can appear robust on standard benchmark splits, yet their deployments require transfer to generator sources absent during training. We study this problem with source-restricted evaluation on \textsc{MoM-open}, an open reconstruction of MoM-CLAM that replaces the non-redistributable real corpus with FMA and MTG-Jamendo while preserving the fake-generator protocol. To isolate the role of representation, we introduce \textsc{CoMoE}, a compact fixed classifier for comparing heterogeneous audio token spaces while keeping the downstream architecture and training recipe unchanged. Experiments show that standard and real-source-restricted splits are nearly saturated, whereas fake-source restriction exposes large differences between token spaces: X-Codec tokens are strongest when training on Udio alone, while MERT-derived tokens are stronger when training on Suno-v3.5 alone. These results suggest that codec-style discrete token spaces should be treated as a primary experimental axis under generator shift in AI-generated music detection. Our code and data are available at https://github.com/MAAP-LAB/CoMoE.

2606.08555 2026-06-15 cs.RO 新提交

FAWAM: Force-Aware World Action Models for Closed-Loop Contact-Rich Manipulation

FAWAM: 面向闭环密集接触操作的力感知世界动作模型

Haotian He, Zeyu Yan, Qipeng Liu, Ning Guo, Wenzhao Lian

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院)

AI总结 提出FAWAM,在感知、预测和闭环执行三个层次融入力信息,通过联合预测动作与末端扳手及残差校正模块,提升密集接触操作的成功率。

详情
AI中文摘要

力信号为接触丰富的机器人操作提供了关键的交互线索。然而,现有方法大多将力作为额外的观测模态,未能充分利用其在建模未来交互动态或指导执行时反馈校正中的作用。本文提出FAWAM,一种力感知世界动作模型,在三个层次融入力信息:感知、预测和闭环执行。FAWAM首先编码历史六轴力/力矩信号以调节动作生成,然后联合预测未来动作和末端扳手以显式建模接触演化。它进一步引入残差校正模块,使用预测的扳手轨迹作为执行时参考,基于实时力反馈在线优化动作。跨多个接触丰富任务的实际实验表明,FAWAM相比纯视觉基线平均成功率提升36.25%,相比现有力感知基线提升21.25%,证明了我们的力感知框架在鲁棒密集接触操作中的有效性。

英文摘要

Force signals provide critical interaction cues for contact-rich robotic manipulation. However, existing methods mostly use force as an additional observation modality, without fully exploiting its role in modeling future interaction dynamics or guiding execution-time feedback correction. In this paper, we propose FAWAM, a force-aware world action model that incorporates force information at three levels: perception, prediction, and closed-loop execution. FAWAM first encodes historical 6-axis force/torque signals to modulate action generation, then jointly predicts future actions and end-effector wrenches to explicitly model contact evolution. It further introduces a residual correction module that uses the predicted wrench trajectory as an execution-time reference to refine actions online based on real-time force feedback. Real-world experiments across multiple contact-rich tasks show that FAWAM improves the average success rate by 36.25% over vision-only baselines and 21.25% over existing force-aware baselines, demonstrating the effectiveness of our force-aware framework for robust contact-rich manipulation.

2606.07157 2026-06-15 cs.AI 新提交

Think Fast: Estimating No-CoT Task-Completion Time Horizons of Frontier AI Models

快速思考:估计前沿AI模型的无思维链任务完成时间范围

Dewi Gould, Francis Rhys Ward, Anders Cairns Woodruff, Rauno Arike, Josh Hills, Alex Serrano, Ida Caspary, Jason Ross Brown, Jo J. Jiao, Patrick Leask, Twm Stone, Ram Potham, Ionut Gabriel Stan, Harry Mayne, Simeon Hellsten, Shubhorup Biswas, Ariana Azarbal, William L. Anderson, Elle Najt, Ryan Greenblatt, Julian Stastny

发表机构 * Redwood Research(红木研究) Astra Fellows Program(Astra 后援计划) Aether Research(Aether 研究) MATS Research(MATS 研究) Polytechnic University of Catalonia(加泰罗尼亚理工大学) Imperial College London(伦敦帝国理工学院) University of Cambridge(剑桥大学) University of Chicago(芝加哥大学) Durham University(杜伦大学) MIT(麻省理工学院) University of Oxford(牛津大学) University of Glasgow(格拉斯哥大学) Constellation(星座)

AI总结 本研究通过超过3万个问题测试前沿AI模型在无思维链推理下的表现,估计其50%任务完成时间范围,发现该时间每约两年翻一番,GPT-5.5已达3分钟以上。

详情
AI中文摘要

许多确保前沿AI模型安全的努力依赖于监控其思维链(CoT)推理。如果模型能够在没有显式思考令牌的情况下内部执行足够复杂的推理,这将破坏这种监督。我们测量了前沿模型在无CoT情况下的推理能力,涉及超过3万个问题,涵盖数学、编程、谜题、因果推理、心理理论和策略推理等领域的43个基准测试。为了将模型与人类进行比较,我们估计了50%任务完成时间范围(TH):模型以50%成功率完成的任务所需的人类时间。我们还补充了50%推理令牌范围:模型以50%成功率解决的任务所需的最小o3-mini推理令牌数。我们发现,过去六年中,前沿模型的无CoT 50% TH大约每两年翻一番,GPT-5.5的TH超过3分钟,推理令牌范围超过1500个令牌。我们的中位数估计预测,到2028年,前沿无CoT TH可能超过7分钟,到2030年超过25分钟,尽管这些预测存在很大的不确定性。我们建议前沿开发者明确跟踪这一指标。

英文摘要

Many efforts to ensure frontier AI models are safe rely on monitoring their chain-of-thought (CoT) reasoning. If models become able to perform sufficiently complex reasoning internally, without explicit thinking tokens, this would undermine such oversight. We measure how well frontier models reason without CoT across a suite of over 30,000 questions spanning 43 benchmarks in domains including math, coding, puzzles, causality, theory-of-mind, and strategic reasoning. To compare models against humans, we estimate the $50\%$-task-completion time horizon (TH): the human time required for tasks a model completes with $50\%$ success rate. We complement this with a $50\%$ reasoning token horizon: the minimum number of o3-mini reasoning tokens needed for tasks a model solves with $50\%$ success rate. We find that the no-CoT $50\%$ TH of frontier models has been doubling roughly every year over the past six years, with GPT-5.5's TH reaching over 3 minutes and reasoning token horizon exceeding 1,500 tokens. Our median estimates predict that frontier no-CoT THs could exceed 7 minutes by 2028, and 25 minutes by 2030, though these projections carry substantial uncertainty. We recommend frontier developers track this explicitly.

2606.07040 2026-06-15 cs.CL 新提交

Beyond Rubrics: Exploration-Guided Evaluation Skills for Reward Modeling

超越评分标准:面向奖励建模的探索引导评估技能

Xing Yue, Linjuan Wu, Daoxin Zhang, Yongliang Shen, Weiming Lu

发表机构 * Zhejiang University(浙江大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出Eval-Skill方法,通过探索引导合成可复用的领域级评估技能,以替代每查询生成评分标准,在RewardBench 2上使Qwen3-8B和DeepSeek-V4-Flash分别提升13.44%和18.51%。

详情
AI中文摘要

开放式奖励建模需要裁判在无法获得可验证答案时遵循微妙的、领域特定的偏好。现有的基于评分标准的方法通常通过为每个查询在线生成标准来解决这一问题,但额外的生成步骤会增加推理开销,并产生僵化或错位的指导。我们引入了Eval-Skill,一种探索引导的方法,为奖励建模合成可复用的评估技能,并将奖励指导重新定义为上下文演化,而非参数训练或每查询评分标准生成。仅使用每个领域100个案例进行技能演化,Eval-Skill通过两个渐进阶段(工作流生成和原则生成)合成可复用的领域级评估技能,并在两个阶段中交错进行探索和选择。一旦生成,技能直接注入裁判上下文。在多个奖励建模基准上,Eval-Skill持续改进不同的裁判骨干模型;在RewardBench 2上,与普通评判相比,每个主要骨干模型都取得了显著提升(Qwen3-8B提升13.44%,DeepSeek-V4-Flash提升18.51%)。对演化时间缩放、泛化性和可迁移性的进一步分析表明,紧凑的评估技能为基于LLM的评估提供了一种高效的新范式。代码可在https://this URL获取。

英文摘要

Open-ended reward modeling requires judges that can follow subtle, domain-specific preferences when verifiable answers are unavailable. Existing rubric-based methods often address this by generating criteria online for each query, but the extra generation step can add inference overhead and produce rigid or misaligned guidance. We introduce Eval-Skill, an exploration-guided method that synthesizes reusable evaluation skills for reward modeling and reframes reward guidance as context evolution rather than parameter training or per-query rubric generation. Using only 100 cases per domain for skill evolution, Eval-Skill synthesizes reusable domain-level evaluation skills through two progressive stages, workflow generation followed by principle generation, with exploration and selection interleaved across both stages. Once generated, a skill is directly injected into the judge context. Across multiple RM benchmarks, Eval-Skill consistently improves diverse judge backbones; on RewardBench 2, it yields significant gains over vanilla judging for each main backbone (+13.44% for Qwen3-8B, and 18.51% for DeepSeek-V4-Flash). Further analyses of evolution-time scaling, generalizability, and transferability show that compact evaluation skills offer an efficient new paradigm for LLM-based evaluation. Code is available at https://github.com/xing-stellus-yue/Eval-Skill.

2606.07027 2026-06-15 cs.AI 新提交

StainFlow: Entity-Stain Tracking and Evidence Linking for Process Rewards in GUI Agents

StainFlow: GUI代理中实体痕迹追踪与证据链接用于过程奖励

Haojie Hao, Longkun Hao, Yihang Lou, Yan Bai, Zhenyang Li, Zhichao Yang, Dongshuo Huang, Hongyu Lin, Lanqing Hong, Jiakai Wang, Xianglong Liu

发表机构 * Beihang University(北京航空航天大学) Peking University(北京大学) Renmin University of China(中国人民大学) Northwestern Polytechnical University(西北工业大学) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) National University of Singapore(新加坡国立大学) Zhongguancun Laboratory(中关村实验室)

AI总结 提出StainFlow模型,通过全局实体痕迹追踪和局部证据链接,解决GUI代理过程奖励中的里程碑分解主观性和局部窗口证据遗漏问题,提升在线强化学习成功率3.2%。

详情
AI中文摘要

强化学习已成为在长期、随机数字环境中改进GUI代理的有前景方法,但轨迹级成功反馈过于稀疏,无法为中间探索步骤提供可靠的信用分配。为缓解此问题,近期研究引入过程奖励模型,通过全局里程碑验证或局部步骤级评估提供更细粒度的训练反馈。然而,这些方法仍存在两个层级特定的局限性:全局里程碑分解主观且单一,难以适应真实GUI任务中的多条有效执行路径;而固定的局部判断窗口可能遗漏远程关键证据或用无关帧稀释决策信号。受网络流分析中痕迹追踪机制的启发,我们提出StainFlow,一种用于GUI代理的实体痕迹流过程奖励模型。为减少全局划分的主观性,我们引入全局实体痕迹追踪模块,提取视觉可验证的任务实体,并追踪其痕迹浓度和状态沿轨迹的演变,从而通过实体证据流的变化客观分离任务阶段。为提高局部验证的准确性,我们引入局部痕迹证据链接模块。以每个候选关键节点的触发实体为中心,该模块根据其痕迹浓度和状态变化检索相关步骤,并动态构建高密度证据窗口以验证真实关键节点。在AndroidWorld和OGRBench上的大量实验表明,StainFlow在线强化学习成功率相对提升3.2%,轨迹完成判断准确率提升1.8%。

英文摘要

Reinforcement Learning (RL) has become a promising approach for improving GUI Agents in long-horizon, stochastic digital environments, but trajectory-level success feedback is too sparse to provide reliable credit assignment for intermediate exploration steps. To mitigate this issue, recent studies introduce Process Reward Models (PRMs), which provide finer-grained training feedback through global milestone verification or local step-level evaluation. However, these methods still suffer from two level-specific limitations: global milestone decomposition is subjective and singular, making it difficult to accommodate the multiple valid execution paths in real GUI tasks, while fixed local judging windows may miss long-range key evidence or dilute the decision signal with irrelevant frames. Inspired by stain-tracing mechanisms in network flow analysis, we propose StainFlow, an entity-stain-flow process reward model for GUI Agents. To reduce the subjectivity of global partitioning, we introduce the Global Entity Stain Tracking module, which extracts visually verifiable task entities and tracks how their stain concentrations and states evolve along the trajectory, allowing task phases to be objectively separated by changes in the entity evidence flow. To improve the accuracy of local verification, we introduce the Local Stain Evidence Linking module. Centered on the triggering entities of each candidate key node, it retrieves relevant steps based on their stain concentrations and state changes, and dynamically constructs high-density evidence windows for verifying true key nodes. Extensive experiments on AndroidWorld and OGRBench show that StainFlow relatively improves online RL success by 3.2% and trajectory completion judgment accuracy by 1.8%.

2606.06794 2026-06-15 cs.CL cs.IR 新提交

TA-RAG: Tone-Aware Retrieval-Augmented Generation for Peer-Support Health Communication

TA-RAG: 面向同伴支持健康沟通的语气感知检索增强生成

Yong-Bin Kang, Anthony McCosker

发表机构 * Swinburne University of Technology(斯winburne大学)

AI总结 提出TA-RAG框架,通过轻量级提示在RAG管道中嵌入语气控制(无污名化、可读性调整、受众适应、同理心改写),无需微调模型,提升敏感健康沟通质量。

详情
AI中文摘要

检索增强生成(RAG)成功地将大型语言模型(LLM)的输出建立在可信文档上,但仅靠事实依据不足以支持敏感的同伴健康沟通。在HIV同伴支持等领域,回复还必须易于理解、无污名化、富有同理心并针对接收者定制。本文提出TA-RAG,一个轻量级的、基于提示的语气感知RAG框架,它将明确的语气控制嵌入到RAG管道中,无需模型微调。我们通过四个核心组件来操作化语气:无污名化改写、可读性调整、接收者适应和同理心重述。我们使用来自澳大利亚HIV在线学习(HOLA)、UNAIDS术语指南、可读性指标、澳大利亚HIV感染者协会(NAPWHA)的同伴支持标准以及公共同理心数据集的问题,通过组件级测试评估TA-RAG。结果表明,TA-RAG的组件在保留关键内容的同时,提高了其目标沟通质量。这些发现强调,基于提示的语气控制是使RAG输出适用于敏感同伴支持健康沟通的一个潜在方向。

英文摘要

Retrieval-augmented generation (RAG) successfully grounds large language model (LLM) outputs in trusted documents, but factual grounding alone is insufficient for sensitive peer-support health communication. In domains such as HIV peer support, responses must also be accessible, stigma-free, empathetic, and tailored to the recipient. This paper presents TA-RAG, a lightweight, prompt-based tone-aware RAG framework that embeds explicit tone control into a RAG pipeline without requiring model fine-tuning. We operationalise tone across four core components: stigma-free rewriting, readability adjustment, recipient adaptation, and empathy rephrasing. We evaluate TA-RAG through component-level tests using questions derived from HIV Online Learning Australia (HOLA), UNAIDS terminology guidance, readability metrics, peer-support standards from National Association of People with HIV Australia (NAPWHA), and a public empathy dataset. Results show that the TA-RAG's components improve their targeted communication quality while preserving key content. These findings emphasise that prompt-based tone control is a potential direction for making RAG outputs suitable for sensitive peer-support health communication.

2605.13217 2026-06-15 cs.CL cs.AI cs.LG 交叉投稿

GAGPO: Generalized Advantage Grouped Policy Optimization

GAGPO:通用优势分组策略优化

Siyuan Zhu, Chao Yu, Rongxin Yang, Zongkai Liu, Jinjun Hu, Qiwen Chen, Yibo Zhang

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) Meituan(美团)

AI总结 GAGPO提出一种无需价值模型的强化学习方法,通过分组价值代理和动作重要性比,实现多轮任务中精确的时间信用分配,实验表明其在ALFWorld和WebShop上优于现有基线。

详情
AI中文摘要

GAGPO提出了一种无需价值模型的强化学习方法,通过分组价值代理和动作重要性比,实现多轮任务中精确的时间信用分配,实验表明其在ALFWorld和WebShop上优于现有基线。

英文摘要

Reinforcement learning has become a powerful paradigm for post-training large language model agents, yet credit assignment in multi-turn environments remains a challenge. Agents often receive sparse, trajectory-level rewards only at the end of an episode, making it difficult to determine which intermediate actions contributed to success or failure. As a result, propagating delayed outcomes back to individual decision steps without relying on costly auxiliary value models remains an open problem. We propose Generalized Advantage Grouped Policy Optimization (GAGPO), a critic-free reinforcement learning method for precise, step-aligned temporal credit assignment. GAGPO constructs a non-parametric grouped value proxy from sampled rollouts and uses it to compute TD/GAE-style temporal advantages, recursively propagating outcome supervision backward through time. Combined with group-wise advantage normalization and an action-level importance ratio, GAGPO extracts stable, localized optimization signals directly from multi-turn trajectories. Experiments on ALFWorld and WebShop show that GAGPO outperforms strong reinforcement learning baselines. Further analyses demonstrate faster early-stage learning, improved interaction efficiency, and smoother optimization dynamics, suggesting that GAGPO offers a simple yet effective framework for multi-turn agentic reinforcement learning.

2605.04847 2026-06-15 cs.LG cs.AI 版本更新

Quantile-Free Uncertainty Quantification in Graph Neural Networks

图神经网络中的无分位数不确定性量化

Soyoung park, Hwanjun Song, Sungsu Lim

发表机构 * Soyoung Park Hwanjun Song Sungsu Lim

AI总结 提出QpiGNN框架,通过无分位数联合损失直接优化覆盖率和区间宽度,实现高效鲁棒的图神经网络不确定性量化,理论保证渐近覆盖和近最优宽度。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

不确定性量化(UQ)在图神经网络(GNN)中对于高风险领域至关重要,但仍是一个重大挑战。在图设置中,消息传递通常依赖于强假设(如可交换性),这些假设在实践中很少满足,并且实现可靠的UQ通常需要昂贵的重采样或事后校准。为了解决这些问题,我们引入了无分位数预测区间GNN(QpiGNN),这是一个基于分位数回归(QR)的框架,通过直接优化覆盖率和区间宽度来实现基于GNN的UQ,无需分位数输入或后处理。QpiGNN采用双头架构,将预测和不确定性解耦,并通过无分位数联合损失使用仅标签监督进行训练。这种设计允许高效训练,并产生鲁棒的预测区间,在温和假设下具有渐近覆盖率和近最优宽度的理论保证。在19个合成和真实世界基准上的实验表明,QpiGNN比基线平均覆盖率高22%,区间窄50%,同时确保了对噪声和结构变化的效率和鲁棒性。

英文摘要

Uncertainty quantification (UQ) in graph neural networks (GNNs) is crucial in high-stakes domains but remains a significant challenge. In graph settings, message passing often relies on strong assumptions such as exchangeability, which are rarely satisfied in practice, and achieving reliable UQ typically requires costly resampling or post-hoc calibration. To address these issues, we introduce Quantile-free Prediction Interval GNN (QpiGNN), a framework that builds on quantile regression (QR) to enable GNN-based UQ by directly optimizing coverage and interval width without requiring quantile inputs or post-processing. QpiGNN employs a dual-head architecture that decouples prediction and uncertainty, and is trained with label-only supervision through a quantile-free joint loss. This design allows efficient training and yields robust prediction intervals, with theoretical guarantees of asymptotic coverage and near-optimal width under mild assumptions. Experiments on 19 synthetic and real-world benchmarks show QpiGNN achieves average 22% higher coverage and 50% narrower intervals than baselines, while ensuring efficiency and robustness to noise and structural shifts.

2605.03065 2026-06-15 cs.LG cs.RO 版本更新

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

OGPO:生成控制策略的样本高效全微调

Sarvesh Patil, Mitsuhiko Nakamoto, Manan Agarwal, Shashwat Saxena, Jesse Zhang, Giri Anantharaman, Cleah Winston, Chaoyi Pan, Douglas Chen, Nai-Chieh Huang, Zeynep Temel, Oliver Kroemer, Sergey Levine, Abhishek Gupta, Hongkai Dai, Paarth Shah, Max Simchowitz

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出OGPO算法,通过离策略评论网络和修改的PPO目标,实现生成控制策略的样本高效微调,在多种操作任务上达到最优性能,并能在无专家数据下微调不良初始化的行为克隆策略。

详情
AI中文摘要

生成控制策略(GCPs),如基于扩散和基于流的控制策略,已成为机器人学习的有效参数化方法。本文介绍了离策略生成策略优化(OGPO),一种用于微调GCPs的样本高效算法,该算法维护离策略评论网络以最大化数据重用,并通过修改的PPO目标将策略梯度传播到策略的完整生成过程,使用评论网络作为终端奖励。OGPO在涵盖多任务设置、高精度插入和灵巧控制的操作任务上达到了最先进的性能。据我们所知,它也是唯一一种能够在在线回放缓冲区中无专家数据的情况下,将初始化不良的行为克隆策略微调到接近完全任务成功的方法,并且只需很少的任务特定超参数调整。通过广泛的实证研究,我们证明了OGPO在策略引导和残差学习方面显著优于替代方法,并确定了其性能背后的关键机制。我们进一步引入了实用的稳定技巧,包括成功缓冲区正则化、双边保守优势和Q方差减少,以减轻基于状态和基于像素的设置中的评论网络过度利用。除了提出OGPO,我们还对GCP微调进行了系统的实证研究,确定了控制成功离策略全策略改进的稳定机制和失败模式。

英文摘要

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

2606.06010 2026-06-15 cs.LG cs.DB 版本更新

Adaptive Oscillatory-State Alignment for Time Series Forecasting

自适应振荡状态对齐用于时间序列预测

Zhangyao Song, Chaofeng Qu, Chao Zha, Xiaoyu Zhao, Yinfei Xu, Tao Guo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AOSNET框架,通过希尔伯特变换将固定模板匹配改为自适应振荡状态对齐,以处理实际时间序列中的非平稳振荡行为,在多个基准上达到先进或竞争性精度。

详情
AI中文摘要

长期时间序列预测受益于揭示重复时间结构的归纳偏置。现有的周期性预测方法通常通过预定义周期、全局频谱分量或固定可学习模板来建模重复性。然而,现实世界的时间动态很少是严格周期性的:振荡行为通常通过幅度调制、相位漂移和局部频率变化而演变。在这些条件下,固定模板的周期性建模可能与底层时间状态根本性不匹配。我们提出了AOSNET,一个希尔伯特引导的预测框架,将周期性预测从固定模板匹配重新表述为自适应振荡状态对齐。AOSNET从观测序列和可学习的全局振荡先验中提取解析信号描述符,然后通过描述符条件门自适应地对齐局部状态,该门选择性地保留可靠观测,同时软性纠正不匹配区域。学习到的先验不是作为刚性的重复模板,而是作为通过局部状态动力学解释的灵活振荡参考。在八个基准上的实验表明,具有快速推理速度的最先进或高度竞争的准确性。控制合成研究分离幅度调制、相位漂移和局部频率变化,证实振荡状态对齐的优势随着非平稳性加剧而持续增加。

英文摘要

Long-term time series forecasting benefits from inductive biases that expose recurring temporal structure. Existing periodic forecasting methods typically model recurrence through predefined periods, global spectral components, or fixed learnable templates. However, real-world temporal dynamics are rarely rigidly periodic: around a nominal cycle, oscillatory behavior often exhibits \emph{non-rigid periodicity} (NRP), where cycle magnitude, cycle alignment, and local cycle duration vary over time. Under these conditions, fixed-template periodic modeling can become fundamentally mismatched to the underlying temporal states. We propose AOSNet, a Hilbert-guided forecasting framework that reformulates periodic forecasting from fixed template matching to adaptive oscillatory-state alignment. AOSNet extracts analytic-signal descriptors from both the observed sequence and a learnable global oscillatory prior, then adaptively aligns local states through a descriptor-conditioned gate that selectively preserves reliable observations while softly correcting mismatched regions. The learned prior serves not as a rigid repeated template but as a flexible oscillatory reference interpreted through local state dynamics. Experiments on eight public benchmarks and two cloud workload traces demonstrate leading or highly competitive accuracy with a compact model size and low inference latency, supporting repeated forecasting settings such as capacity planning and autoscaling. Controlled synthetic studies that isolate cycle-magnitude and cycle-alignment variation and combine them with cycle-duration changes show that the advantage of oscillatory-state alignment increases as NRP intensifies.

2606.05774 2026-06-15 cs.CV 版本更新

LiAuto-GeoX: Efficient Grounded Driving Transformer

LiAuto-GeoX: 高效接地驾驶Transformer

Jiawei Lian, Haoyi Sun, Yang Wu, Lifu Mu, Siyuan Wang, Le Hui, Ning Mao, Tao Wei, Pan Zhou, Kun Zhan, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) Li Auto Inc.(Li Auto公司) Northwestern Polytechnical University(西北工业大学) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学院)

AI总结 提出LiAuto-GeoX,通过稀疏激光雷达先验和几何保持蒸馏框架,实现高效、实时的自车中心密集3D重建,并显著提升下游自动驾驶任务性能。

详情
AI中文摘要

密集3D重建在空间理解方面展现出巨大潜力,但其作为自动驾驶实时车载表示的可行性仍是一个开放挑战。现有大规模视觉几何模型通常需要大量计算资源,且缺乏动态驾驶环境所需的远距离几何保真度、环视一致性和实时效率。为弥补这一差距,我们提出 extbf{LiAuto-GeoX},一种为可部署的自车中心3D场景理解设计的高效接地驾驶Transformer。我们的方法首先从大规模环视数据中学习高容量驾驶几何模型,利用稀疏激光雷达先验在远处、模糊或结构稀疏区域提供稳健的几何接地。然后,通过一种新颖的几何保持蒸馏框架,将这一能力实例化为高度紧凑的1.55亿参数车载模型。该框架采用掩码引导的深度感知蒸馏,通过强调几何信息丰富的区域来保留细粒度度量结构,以及相对姿态关系蒸馏,通过姿态诱导的几何关系强制跨视图空间一致性。大量评估表明, extbf{LiAuto-GeoX}在KITTI上以220 FPS运行,同时保持高保真密集重建,实现实时部署。学习到的几何结构无缝迁移到下游自主任务,在轨迹预测中达到90.6 PDMS,在占用预测中达到24.63 mIoU,在未来帧预测中达到47.67 IoU。这些结果表明,高效的密集3D重建可以超越其作为感知目标的传统角色,作为下一代自动驾驶的可扩展基础几何表示。

英文摘要

Dense 3D reconstruction has demonstrated immense potential for spatial understanding, yet its viability as a real-time, onboard representation for autonomous driving remains an open challenge. Existing large-scale visual geometry models typically require substantial computational resources and lack the long-range geometric fidelity, surround-view consistency, and real-time efficiency demanded by dynamic driving environments. To bridge this gap, we present \textbf{LiAuto-GeoX}, an efficient grounded driving transformer designed for deployable, ego-centric 3D scene understanding. Our approach begins by learning a high-capacity driving geometry model from large-scale surround-view data, utilizing sparse LiDAR priors to provide robust geometric grounding in distant, ambiguous, or structure-sparse regions. We then instantiate this capability into a highly compact 155M-parameter onboard model through a novel geometry-preserving distillation framework. This framework employs mask-guided depth-aware distillation to retain fine-grained metric structures by emphasizing geometrically informative regions, and relative-pose relational distillation to enforce cross-view spatial consistency through pose-induced geometric relations. Extensive evaluations reveal that \textbf{LiAuto-GeoX} runs at 220 FPS on KITTI while maintaining high-fidelity dense reconstruction, enabling real-time deployment. The learned geometry transfers seamlessly to downstream autonomy tasks, achieving 90.6 PDMS in trajectory prediction, 24.63 mIoU in occupancy prediction, and 47.67 IoU in future-frame prediction. These all demonstrate that efficient dense 3D reconstruction can transcend its traditional role as a perception target to serve as a scalable, foundational geometric representation for next-generation autonomous driving.

2606.05461 2026-06-15 cs.AI 版本更新

Output Type Before Quality: A Standards-Derived XAI Admissibility Rubric for Autonomous-Driving Safety

先输出类型,后质量:基于标准的自动驾驶安全XAI可接受性评估标准

Abhinaw Priyadershi, Mandar Pitale, Jelena Frtunikj, Maria Spence

发表机构 * NVIDIA Corporation(英伟达公司) NVIDIA GmbH(英伟达德国分公司)

AI总结 针对基于ML的自动驾驶安全标准与XAI方法输出类型不匹配的证据类型缺口,从多个安全标准推导出19项可测试证据标准,评估六类XAI方法,发现因果XAI在三个生命周期阶段结构上必需,并提出了结构可接受性概念。

Comments Accepted at SAFECOMP 2026 Workshops (SASSUR); to appear in Springer LNCS

详情
AI中文摘要

基于ML的自动驾驶安全标准规定了保证案例必须包含的证据类型(有向因果链、量化的干预效应、命名的根因变量),然而XAI文献是按输出类型和技术族(显著性图、特征归因、反事实、因果图、语言痕迹)组织的。最受推荐的ADS XAI方法SHAP返回一个排序的特征列表,任何实现努力都无法将其转换为有向链(图1)。我们将这种不匹配称为证据类型缺口。 从AMLAS、ISO 26262、ISO 21448、ISO/PAS 8800中,我们推导出19项可测试的证据标准,涵盖7个生命周期阶段,并附有代表性的条款引用推导,对六类XAI方法进行了结构性评分。 因果XAI在结构上被证明是满足推导标准的必要条件,涉及三个阶段:危害识别(+62%标准缺口)、事件调查(+50%)和数据管理(+50%);判定集在阈值T∈(0%, 50%]内稳定,并在最坏情况下的单单元翻转下存活至T=25%。在其余四个阶段,相关或基于语言的方法是可比较或足够的。该标准识别了结构可接受性(合规的必要但非充分条件):一个可接受方法的具体输出内容仍可能是错误的,验证其保真度(拟合SCM产生的边、痕迹命名的原因)是开放的保证挑战。基于1,996个真实驾驶片段(79,840行,十个分割)的单VLA概念验证与每种方法观察到的输出类型匹配其标准预测一致。ADS安全保证的XAI方法选择应由生命周期阶段的证据需求驱动,而非方法流行度。

英文摘要

Safety standards for ML-based autonomous driving specify the kind of evidence an assurance case must contain (directed cause-and-effect chains, quantified interventional effects, named root-cause variables), yet the XAI literature is organised by output type and technique family (saliency maps, feature attribution, counterfactuals, causal graphs, language traces). SHAP, the most-recommended ADS XAI method, returns a ranked feature list that no implementation effort can convert into a directed chain (Fig.1). We name this mismatch the evidence-type gap. From AMLAS, ISO 26262, ISO21448, ISO/PAS 8800 we derive 19 testable evidentiary criteria across 7 lifecycle stages with representative clause-cited derivations and score six XAI method classes structurally. Causal XAI emerges as structurally required to satisfy the derived criteria at three stages: hazard identification (+62% rubric gap), incident investigation (+50%), and data management (+50%); the verdict set is stable across thresholds T in (0%, 50%]$ and survives a worst-case single-cell flip down to T = 25%. At the remaining four stages, correlational or language-based methods are comparable or sufficient. The rubric identifies structural admissibility (necessary but not sufficient for compliance): an admissible method's specific output content may still be wrong, and validating that fidelity (the edges a fitted SCM produces, the cause a trace names) is the open assurance challenge. A single-VLA proof of concept on 1,996 real-world driving clips (79,840 rows, ten splits) is consistent with each method's observed output type matching its rubric prediction. XAI method selection for ADS safety assurance should be driven by lifecycle-stage evidence demand, not by method popularity.

2606.05102 2026-06-15 cs.CV 版本更新

ZipSplat: Fewer Gaussians, Better Splats

ZipSplat: 更少的高斯,更好的泼溅

Alexander Veicht, Sunghwan Hong, Dániel Baráth, Marc Pollefeys

发表机构 * ETH Zürich(苏黎世联邦理工学院) Microsoft(微软)

AI总结 提出 ZipSplat,一种基于令牌的前馈模型,通过聚类压缩视觉令牌并解码为高斯组,在无需重训练的情况下实现质量-效率权衡,以约6倍更少的高斯数在DL3DV和RealEstate10K上达到新最优。

详情
AI中文摘要

前馈式3D高斯泼溅方法能够在单次前向传递中从有姿态或无姿态图像重建场景,但当前方法为每个输入像素预测一个高斯,将表示预算与相机分辨率而非场景复杂度绑定。因此,一面平坦的墙壁和一块纹理丰富的物体会产生同样多的高斯,尽管几何需求截然不同。我们提出ZipSplat,一种基于令牌的前馈模型,将高斯放置与像素网格解耦。多视图骨干网络提取密集的视觉令牌,k-means聚类将其压缩为紧凑的场景令牌集。交叉注意力和自注意力精炼这些令牌,轻量级MLP将每个令牌解码为一组具有无约束3D位置的高斯。由于聚类在推理时应用,单个训练模型无需重训练即可覆盖质量-效率曲线。ZipSplat无需真实姿态或内参,但在DL3DV和RealEstate10K上以比像素对齐方法少约6倍的高斯数达到新最优,分别超过最佳无姿态基线2.1dB和1.2dB PSNR。它进一步零样本泛化到Mip-NeRF360和ScanNet++,优于所有可比基线。我们的项目页面位于https://veichta.com/zipsplat。

英文摘要

Feed-forward 3D Gaussian Splatting methods reconstruct a scene from posed or pose-free images in a single forward pass, yet current approaches predict one Gaussian per input pixel, tying the representation budget to camera resolution rather than scene complexity. A flat wall and a richly textured object thus produce equally many Gaussians despite very different geometric needs. We propose ZipSplat, a token-based feed-forward model that decouples Gaussian placement from the pixel grid. A multi-view backbone extracts dense visual tokens, and k-means clustering compresses them into a compact set of scene tokens. Cross- and self-attention refine these tokens, and a lightweight MLP decodes each into a group of Gaussians with unconstrained 3D positions. Because clustering is applied at inference, a single trained model spans the quality-efficiency curve without retraining. ZipSplat operates without ground-truth poses or intrinsics, yet sets a new state of the art on DL3DV and RealEstate10K with ${\sim}6{\times}$ fewer Gaussians than pixel-aligned methods, surpassing the best pose-free baseline by 2.1dB and 1.2dB PSNR, respectively. It further generalizes zero-shot to Mip-NeRF360 and ScanNet++, outperforming all comparable baselines. Our project page is at https://veichta.com/zipsplat.

2606.04883 2026-06-15 cs.CL cs.LO 版本更新

Optimizing the Cost-Quality Tradeoff of Agentic Theorem Provers in Lean

优化 Lean 中智能定理证明器的成本-质量权衡

Kári Rögnvaldsson, Chenhao Sun, Jasper Dekoninck, Martin Vechev

发表机构 * University of Washington(华盛顿大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种包含数据平面和控制平面的动作路由智能体,通过观察失败轨迹并估计成功概率与成本来动态决定继续证明或重新分解,在 PutnamBench 子集上平均降低 25.8% 成本且保持性能。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于在 Lean 中生成形式化证明的工作流程。这些工作流程通常将问题分解为更小的引理,采样许多证明尝试,并使用编译器反馈来指导搜索。然而,它们可能成本高昂,往往在最终失败的尝试上花费大量计算。在这项工作中,我们通过一个包含数据平面和控制平面的动作路由智能体来解决这个问题。数据平面生成自然语言的引理分解,在 Lean 中形式化它们,并为由此产生的定理和引理目标采样证明尝试。控制平面观察之前失败的 Lean 尝试,估计成功可能性和另一次尝试的成本,并决定是继续证明当前目标还是从新的分解重新开始。在 PutnamBench 的一个子集上,我们的智能体平均比固定步长基线降低 25.8% 的成本,在显著减少计算量的同时保持性能。这些结果表明,失败的 Lean 轨迹为智能定理证明中的成本感知资源分配提供了可操作的信号。

英文摘要

Large language models (LLMs) are increasingly used in workflows for generating formal proofs in Lean. These workflows often decompose problems into smaller lemmas, sample many proof attempts, and use compiler feedback to guide search. However, they can be prohibitively expensive, often spending substantial compute on attempts that ultimately fail. In this work, we address this problem with an action routing agent that consists of a data plane and a control plane. The data plane generates natural-language lemma decompositions, formalizes them in Lean, and samples proof attempts for the resulting theorem and lemma targets. The control plane observes previous failed Lean attempts, estimates both the likelihood of success and cost of another attempt, and decides whether to continue proving the current target or restart from a new breakdown. On a subset of PutnamBench, our agent decreases the cost by $28.9\%$ over a fixed-step baseline on average, preserving performance while using substantially less compute. These results suggest that failed Lean trajectories provide actionable signals for cost-aware resource allocation in agentic theorem proving.

2606.04718 2026-06-15 cs.RO cs.AI 版本更新

CoRe-MoE: Contrastive Reweighted Mixture of Experts for Multi-Terrain Humanoid Locomotion with Gait Adaptation

CoRe-MoE: 面向多地形人形机器人步态适应的对比重加权专家混合

Kailun Huang, Zikang Xie, Yanzhe Xie, Panpan Liao, Fanghai Zhang, Yanheng Mai, Wenhao Xu, Yunheng Wang, Renjing Xu, Haohui Huang, Chenguang Yang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) South China Agricultural University(华南农业大学) Guangdong University of Technology(广东工业大学)

AI总结 提出CoRe-MoE两阶段强化学习框架,通过解耦步态生成与地形适应,利用对比学习促进专家专业化,实现人形机器人在多地形下的稳定行走和跑步。

Comments Kailun Huang, Zikang Xie, Yanzhe Xie and Panpan Liao contributed equally to this work. Corresponding authors: Renjing Xu, Haohui Huang and Chenguang Yang

详情
AI中文摘要

人类主要依靠行走和跑步穿越复杂地形,而无需采用不必要复杂的运动模式。类似地,人形机器人应在行走和跑步之间实现平滑过渡,同时保持自然稳定的运动。然而,由于梯度干扰以及地形相关的视觉和动态变化引起的分布偏移,在单一策略中统一步态转换和多地形适应仍然具有挑战性。尽管专家混合(MoE)架构可以缓解多技能干扰,但简单的联合训练往往无法产生清晰的专家专业化,限制了其有效性。为解决这些问题,我们提出了CoRe-MoE,一个两阶段强化学习框架,将步态生成与地形适应解耦。在第一阶段,学习一个稳定的运动策略,以产生具有平滑过渡的自然行走和跑步行为。在第二阶段,引入一个地形感知的MoE分支,并通过对比目标进行训练以塑造门控网络,使其能够捕捉结构化地形表示并促进专家专业化。最终动作通过基础步态策略和地形感知分支的加权融合获得,使策略在适应复杂地形的同时保持稳定的运动模式。大量仿真结果表明,所提方法在成功率、运动稳定性和多地形适应性方面优于基线方法。此外,在Unitree G1人形机器人上的零样本部署验证了我们框架的有效性,实现了在楼梯、斜坡、台阶、障碍物和非结构化户外地形上的稳健行走和跑步,同时在外界干扰下保持精确的落脚点和动态稳定性。

英文摘要

Humans primarily rely on walking and running to traverse complex terrains. Similarly, humanoid robots should be able to smoothly transition between walking and running while maintaining natural and stable locomotion. However, unifying gait transition and multi-terrain adaptation within a single policy remains challenging due to gradient interference between tasks and the distribution shift caused by terrain variations. Although Mixture-of-Experts (MoE) architectures can mitigate multi-skill interference, direct joint training often fails to achieve clear expert specialization. To address these challenges, we propose CoRe-MoE, a two-stage reinforcement learning framework that decouples gait generation from terrain adaptation. In the first stage, a stable locomotion policy is learned to produce natural walking and running behaviors with smooth transitions. In the second stage, a terrain-aware MoE branch is introduced, and the gating network is trained with a contrastive objective to learn structured terrain representations and promote expert specialization. The final action is obtained through weighted fusion of the base gait policy and the terrain-aware branch, enabling the policy to preserve stable locomotion while adapting to complex terrains. Extensive simulation results demonstrate that the proposed method outperforms baseline approaches in terms of success rate, locomotion stability, and multi-terrain adaptability. Furthermore, zero-shot deployment on a Unitree G1 humanoid robot validates the effectiveness of our framework, achieving robust walking and running across stairs, slopes, steps, obstacles, and unstructured outdoor terrains while maintaining accurate foothold control and dynamic stability.

2606.03108 2026-06-15 cs.AI 版本更新

EvoTrainer: Co-Evolving LLM Policies and Training Harnesses for Autonomous Agentic Reinforcement Learning

EvoTrainer: 协同进化LLM策略与训练框架以实现自主智能体强化学习

Guhong Chen, Yingcheng Shi, Yongbin Li, Binhua Li, Xander Xu, Hu Wei, Shiwen Ni, Min Yang, Jieping Ye

发表机构 * Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(深圳先进技术研究院,中国科学院) Tongyi Lab , Alibaba Group(通义实验室,阿里巴巴集团) Alibaba Group(阿里巴巴集团) SUAT(深圳大学)

AI总结 提出EvoTrainer框架,通过协同进化LLM策略和训练端框架,基于经验反馈自动诊断、修正并积累可复用技能,在数学推理、编程竞赛和仓库级软件工程任务上匹配或超越人工设计的RL基线。

详情
AI中文摘要

自主LLM训练通常被表述为配方搜索,这使训练框架基本保持静态。这种局限性在智能体RL中尤为突出,其中不断变化的瓶颈和标量奖励掩盖了多种失败模式。我们引入了EvoTrainer,一个通过经验反馈协同进化LLM策略和训练端框架的自主训练框架:它诊断rollout级别的证据、修正诊断、回测干预并积累可复用技能。在数学推理、竞赛编程代码生成和仓库级软件工程上的评估表明,在相同数据、代码库和评估协议下,EvoTrainer匹配或超过了人工设计的RL参考,其中在长周期智能体SWE上增益最大。轨迹分析显示,保留的策略在不同领域分化,进化的诊断阻止了无效的高分分支被提升,而可复用技能塑造了后续搜索。自主LLM RL应超越配方搜索,转向策略和解释它们的训练框架的联合进化。

英文摘要

Autonomous LLM training is often framed as recipe search, which leaves the training harness largely static. This limitation sharpens in agentic RL, where shifting bottlenecks and scalar rewards mask diverse failure modes. We introduce EvoTrainer, an autonomous training framework that co-evolves LLM policies and training-side harnesses through empirical feedback: it diagnoses rollout-level evidence, revises diagnostics, backtests interventions, and accumulates reusable skills. Evaluated on mathematical reasoning, competitive-programming code generation, and repository-level software engineering, EvoTrainer matches or exceeds the human-engineered RL references under the same data, codebase, and evaluation protocol, with the largest gain on long-horizon agentic SWE. Trajectory analyses show that retained strategies diverge across domains, evolving diagnostics prevent invalid high-scoring branches from being promoted, and reusable skills shape later search. Autonomous LLM RL should move beyond recipe search toward joint evolution of policies and the training harnesses that interpret them.

2606.03085 2026-06-15 cs.LG cs.CL 版本更新

Multi-component Causal Tracing in Large Language Models

大型语言模型中的多组件因果追踪

Zirui Yan, Dennis Wei, Dmitriy A. Katz, Prasanna Sattigeri, Ali Tajer

发表机构 * Rensselaer Polytechnic Institute(拉特拉姆技术学院) IBM Research(IBM研究院)

AI总结 本文提出一个统一框架,通过软干预和度量转换高效识别对目标性能指标最关键的多组件子集,优于现有基线方法。

Comments Accepted to ACL 2026 main conference

详情
AI中文摘要

因果追踪通过系统地干预大型语言模型(LLM)的内部表示,揭示并量化将特定输入或计算与特定感兴趣指标联系起来的因果路径,从而量化LLM的行为。在先前单组件或单层研究的基础上,本文提出了一个同时因果追踪多个组件的统一框架。该框架系统地识别对期望目标性能指标(如准确性和公平性)最关键的组件子集(例如注意力头和多层感知器神经元)。这是通过将灵活的干预应用于广泛期望的指标来实现的。为了解决多组件问题的组合复杂性,设计了一种高效算法,该算法利用软干预和精心设计的度量转换,将组合搜索问题转化为一个连续问题,该问题可以在适当约束下高效求解,从而为选择组件生成适当的二元决策。实验结果表明,所提出的方法高效地识别出对目标指标具有高影响力的模型组件子集,优于现有基线方法。我们的代码可从此https URL获取。

英文摘要

Causal tracing systematically intervenes on a large language model's (LLM's) internal representations to uncover and quantify the causal pathways linking specific inputs or computations to specific metrics of interest, quantifying the LLM's behavior. Building on previous single-component or single-layer studies, this paper presents a unified framework for causally tracing multiple components simultaneously. This framework systematically identifies the subsets of components (e.g., attention heads and multi-layer perceptron neurons) most critical to a desired target performance metric (e.g., accuracy and fairness). This is achieved by incorporating flexible interventions applied to a wide range of desired metrics. To address the combinatorial complexity of the multi-component problem, an efficient algorithm is designed that leverages soft interventions and a carefully designed metric transformation, converting the combinatorial search problem into a continuous one that can be solved efficiently under proper constraints, thereby generating proper binary decisions for selecting components. Experimental results demonstrate that the proposed method efficiently identifies subsets of the model's components that have a high impact on the target metric, outperforming existing baseline approaches. Our code is available at https://github.com/ZiruiYan/multi-component-causal-tracing.

2606.02320 2026-06-15 cs.CL 版本更新

TVIR: Building Deep Research Agents Towards Text-Visual Interleaved Report Generation

TVIR:构建面向文本-视觉交错报告生成的深度研究智能体

Xinkai Ma, Zhiqi Bai, Dingling Zhang, Pei Liu, Yishuo Yuan, He Zhu, Jiakai Wang, Qianqian Xie, Yifan Zhao, Xinlong Yang, Hao Cong, Zhiheng Yao, Fengxia Xie, Zihao Xu, Haoran Xu, Zhaohui Wang, Minghao Liu, Shirong Lin, Yingshui Tan, Yuchi Xu, Wenbo Su, Zhaoxiang Zhang, Bo Zheng, Jiaheng Liu

发表机构 * Nanjing University Alibaba Group(南京大学阿里集团)

AI总结 提出TVIR基准和层次化多智能体框架,解决深度研究报告中视觉元素的事实可靠性与对齐问题。

详情
AI中文摘要

深度研究智能体在多步信息检索、推理和长文本报告生成方面表现出强大能力,但现有基准和系统仍以文本为中心,对视觉元素是否事实可靠且与周围分析良好对齐的评估有限。为填补这一空白,我们引入了TVIR(文本-视觉交错报告生成),包括TVIR-Bench(一个包含100个专家策划的多模态深度研究任务的基准,要求视觉元素服务于特定的分析子目标)和TVIR-Agent(一个层次化多智能体框架,作为构建大纲、检索图像、生成可溯源图表以及通过上下文感知的顺序写作撰写报告的强基线)。我们进一步开发了结合文本评估和视觉评估的双路径评估框架。在九个深度研究系统上的实验表明,TVIR-Agent实现了强大的整体性能,凸显了显式多模态设计和评估对于证据驱动报告生成的重要性。

英文摘要

Deep Research Agents have shown strong capability in multi-step information retrieval, reasoning, and long-form report generation, but existing benchmarks and systems remain predominantly text-centric, with limited evaluation of whether visual elements are factually reliable and well aligned with the surrounding analysis. To address this gap, we introduce TVIR (Text-Visual Interleaved Report Generation), which includes TVIR-Bench, a benchmark of 100 expert-curated multimodal deep research tasks that require visual elements to serve specific analytical sub-goals, and TVIR-Agent, a hierarchical multi-agent framework that serves as a strong baseline for constructing outlines, retrieving images, generating charts with traceable sources, and composing reports through context-aware sequential writing. We further develop a dual-path evaluation framework that combines Textual Assessment and Visual Assessment. Experiments across nine deep research systems show that TVIR-Agent achieves strong overall performance, underscoring the importance of explicit multimodal design and evaluation for evidence-driven report generation.

2606.01730 2026-06-15 cs.AI cs.LG 版本更新

Evidence-Gated LLM Priors for Multi-Objective Bayesian Optimization

证据门控的LLM先验用于多目标贝叶斯优化

Jiangyu Chen, Ban Yi

发表机构 * State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室)

AI总结 针对多目标贝叶斯优化中LLM先验可能误导的问题,提出一种目标级声誉市场机制,通过在线反馈动态校准专家权重,并引入解耦反事实门控,在合成测试和分子优化基准上验证了动态校准的鲁棒性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作黑箱优化的启发式顾问,但其建议和自我报告的置信度不一定与下游目标值校准。在多目标贝叶斯优化中,这一问题更加突出,因为不同目标可能需要不同的专家知识,而LLM专家可能对一个目标有用,但对另一个目标产生误导。 我们研究如何在离散多目标贝叶斯优化中使用LLM生成的专家先验,而不盲目信任它们。我们提出了一种目标级声誉市场机制,将每个专家-目标对视为可证伪的先验来源。专家权重根据观察到的目标反馈在线更新,随时间衰减,并由市场级信任门控。然后,我们引入一个解耦的反事实门控,可以在不使用置信度的情况下使用LLM先验,在置信度下使用,或完全放弃LLM先验。 在受控的合成压力测试和三个使用\qwenflash{}生成的专家先验的分子优化基准上,我们发现动态目标级校准比固定LLM先验提高了鲁棒性。然而,原始LLM置信度并不总是有益的:在ESOL上,置信度与预测误差正相关;在FreeSolv上,置信度可能有帮助;在Lipophilicity上,忽略置信度仍然最强。我们的固定三臂反事实门控在ESOL和FreeSolv上优于第一个反事实变体,而尝试的边际组合暴露了一个有用的负面结果:边际选择应基于采集感知,而不是仅基于一步先验误差。

英文摘要

Large language models (LLMs) are increasingly used as heuristic advisors for black-box optimization, yet their suggestions and self-reported confidence are not necessarily calibrated to downstream objective values. This issue becomes more pronounced in multi-objective Bayesian optimization, where different objectives may require different expert knowledge and where an LLM expert can be useful for one objective but misleading for another. We study how to use LLM-generated expert priors in discrete multi-objective Bayesian optimization without blindly trusting them. We propose an objective-wise reputation-market mechanism that treats each expert-objective pair as a falsifiable prior source. Expert weights are updated online from observed objective feedback, discounted over time, and gated by market-level trust. We then introduce a decoupled counterfactual gate that can use the LLM prior without confidence, use it with confidence, or abstain from the LLM prior entirely. Across controlled synthetic stress tests and three molecule optimization benchmarks with \qwenflash{}-generated expert priors, we find that dynamic objective-wise calibration improves robustness over fixed LLM priors. However, raw LLM confidence is not reliably beneficial: on ESOL, confidence is positively correlated with prediction error; on FreeSolv, confidence can help; and on Lipophilicity, ignoring confidence remains strongest. Our fixed three-arm counterfactual gate improves over the first counterfactual variant on ESOL and FreeSolv, while an attempted margin portfolio exposes a useful negative result: margin selection should be acquisition-aware rather than based only on one-step prior error.

2606.01476 2026-06-15 cs.LG cs.CL 版本更新

OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification

OmniOPD:通过推测性验证实现无Logit的在线策略蒸馏

Yuhang Zhou, Lizhu Zhang, Yifan Wu, Mingyi Wang, Bo Peng, Jiayi Liu, Xiangjun Fan, Zhuokai Zhao

发表机构 * Meta AI

AI总结 提出OmniOPD框架,通过基于蒙特卡洛展开的块级语义相似度替代token级logit匹配,结合峰值熵调度器和贝叶斯先验,解决在线策略蒸馏中logit不可获取和信号脆弱问题,在数学任务上超越标准OPD达28.64%。

Comments 26 pages, 3 figures

详情
AI中文摘要

在线策略蒸馏(OPD)在强教师模型的密集token级反馈下,基于学生模型自身的生成轨迹进行训练,缓解了监督微调(SFT)的离策略分布偏移和强化学习(RL)的稀疏信用分配问题。然而,标准OPD面临两个耦合的限制。首先,它需要直接访问教师模型的token级logit,将一大类有能力的专有模型排除在教师之外。其次,token级logit信号本身是脆弱的,依赖于教师和学生之间合理下一个token的狭窄重叠,并且容易放大重复循环等退化模式。在本文中,我们引入了OmniOPD,一种通过无logit的块级监督信号解决这两个限制的新框架。OmniOPD用蒙特卡洛展开替代确定性logit匹配,通过多token块上的连续语义相似性度量近似教师的局部偏好,并通过峰值熵调度器集中这种监督,仅在学生的高不确定性推理分叉处进行审计。Dirichlet-Multinomial贝叶斯先验和基础模型KL锚进一步限制了离散采样的方差,并防止了未审计token上的策略崩溃。在竞争性基准测试中,OmniOPD在数学任务上超越标准OPD方法高达28.64%,证实了块级语义验证提取了比token级logit匹配更可靠的学习信号,后者高信息密度被显著的噪声和脆弱性所抵消。此外,当与更强的黑盒教师(如Claude-4.5-Haiku和Gemini-2.5-Flash)配对时,OmniOPD在数学任务上相对于其开放权重教师对应物额外获得了9.54%的相对提升,使学生超越了自我探索RL的性能。

英文摘要

On-Policy Distillation (OPD) trains a student model on its own generative trajectories under dense token-level feedback from a stronger teacher, mitigating both the off-policy distribution shift of Supervised Fine-Tuning (SFT) and the sparse credit assignment of Reinforcement Learning (RL). However, standard OPD faces two coupled limitations. First, it requires direct access to the teacher's token-level logits, excluding a broad class of capable proprietary models from serving as teachers. Second, the token-level logit signal itself is brittle, depending on a narrow overlap of plausible next tokens between teacher and student, and prone to amplifying degenerate patterns such as repetition loops. In this paper, we introduce OmniOPD, a novel framework that addresses both limitations through a logit-free, chunk-level supervision signal. OmniOPD replaces deterministic logit matching with Monte Carlo rollouts that approximate the teacher's local preferences through a continuous semantic similarity metric over multi-token chunks, and concentrates this supervision via a peak-entropy scheduler that audits the student only at its high-uncertainty reasoning forks. A Dirichlet-Multinomial Bayesian prior and a base-model KL anchor further bound the variance of discrete sampling and prevent policy collapse across unaudited tokens. Across competitive benchmarks, OmniOPD surpasses the standard OPD approach by up to +28.64% on math, confirming that chunk-level semantic verification extracts a more reliable learning signal than token-level logit matching, whose high information density is offset by significant noise and brittleness. Furthermore, when paired with stronger black-box teachers such as Claude-4.5-Haiku and Gemini-2.5-Flash, OmniOPD achieves an additional +9.54% relative on math over its open-weight teacher counterpart, advancing the student past the performance of self-exploratory RL.

2606.00947 2026-06-15 cs.LG cs.AI 版本更新

Silent Failures in Federated Personalization of Foundation Models

联邦基础模型个性化中的静默失败

YongKyung Oh, Alex Bui

发表机构 * Medical & Imaging Informatics (MII) Group, University of California, Los Angeles (UCLA)(医学与影像信息学(MII)组,加州大学洛杉矶分校(UCLA))

AI总结 本文提出联邦基础模型个性化中因隐私约束导致的一类信任失败——静默失败,包括偏差放大、公平性崩溃和对齐侵蚀,并引入六种静默失败模式的分类法,强调隐私保护训练不足以保障可信部署。

详情
AI中文摘要

基础模型通过联邦学习在分散的私有数据上越来越个性化,并在日益增长的上市后监管要求下大规模部署。我们认为这种趋同产生了一类独特且未被充分认识的信任失败,我们称之为“静默失败”。这些包括偏差放大、公平性崩溃和对齐侵蚀,这些可能仍然难以检测,因为联邦学习的隐私约束限制了对模型行为的可见性。对现有基准的景观分析揭示了结构性鸿沟。联邦基准评估系统性能,但对模型行为的洞察有限,而集中式信任基准评估行为,但需要与联邦隐私不兼容的模型访问。我们引入了一个由基础模型个性化、数据集偏移和核心联邦约束相互作用产生的六种静默失败模式的分类法。我们的分析表明,仅靠隐私保护训练不足以实现可信部署。最后,我们提出了一个隐私保护行为评估的研究议程,并建议将静默失败作为可信联邦人工智能的标准诊断类别。

英文摘要

Foundation models are increasingly personalized on decentralized private data through federated learning and are now deployed at scale under growing regulatory requirements for post-market monitoring. We argue that this convergence creates a distinct and under-recognized class of trustworthiness failures, which we term "Silent Failures." These include amplified bias, fairness collapse, and alignment erosion that may remain difficult to detect because federated learning's privacy constraints limit visibility into model behavior. A landscape analysis of existing benchmarks reveals a structural divide. Federated benchmarks evaluate system performance but provide limited insight into model behavior, whereas centralized trustworthiness benchmarks assess behavior but require model access incompatible with federated privacy. We introduce a taxonomy of six silent failure modes arising from the interaction of foundation model personalization, dataset shift, and core federated constraints. Our analysis shows that privacy-preserving training alone is insufficient for trustworthy deployment. We conclude with a research agenda for privacy-preserving behavioral evaluation and propose that silent failures become a standard diagnostic category for trustworthy federated artificial intelligence.

2605.31604 2026-06-15 cs.CV 版本更新

Representation Forcing for Bottleneck-Free Unified Multimodal Models

表示强制:无瓶颈统一多模态模型

Yuqing Wang, Zhijie Lin, Ceyuan Yang, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Zihan Ding, Fuyun Wang, Shuai Wang, Youliang Zhang, Haoqi Fan, Xihui Liu

发表机构 * University of Hong Kong(香港大学) ByteDance Seed(字节跳动种子) The Chinese University of Hong Kong(香港中文大学) Nanjing University(南京大学) Tsinghua University(清华大学)

AI总结 提出表示强制(RF)技术,通过让解码器自回归预测视觉表示作为中间令牌,再在相同骨干网络中引导像素扩散,从而消除统一多模态模型对预训练VAE的依赖,实现无瓶颈的端到端模型。

Comments Project page: https://yuqingwang1029.github.io/RepresentationForcing

详情
AI中文摘要

统一多模态模型(UMMs)旨在单个模型中处理感知和生成。然而,现有的UMMs仍然依赖一个冻结的、单独预训练的VAE进行图像生成,造成了结构瓶颈。简单地移除它会导致质量差距,因为模型必须从原始像素中同时学习高级结构和低级细节。在本文中,我们提出了表示强制(RF),一种通过使表示预测成为模型原生能力来缩小这一差距的技术。具体来说,RF强制解码器在像素之前自回归地预测视觉表示作为中间令牌;这些令牌随后保留在上下文中,在相同骨干网络内引导像素扩散。通过将表示从感知输出转变为生成目标,RF消除了任何外部生成潜在空间的需求。我们发现RF对理解和生成都有益。在图像生成上,我们的像素空间模型与RF匹配了基于VAE的最先进统一模型。在图像理解上,像素空间RF通常优于其基于VAE的变体。这些结果共同为迈向端到端、无瓶颈的UMMs提供了有效的一步。

英文摘要

Unified multimodal models (UMMs) aim to handle perception and generation in a single model. Yet existing UMMs still rely on a frozen, separately pretrained VAE for image generation, imposing a structural bottleneck. Naively removing it introduces a quality gap, as the model must learn both high-level structure and low-level details from raw pixels. In this paper, we propose Representation Forcing (RF), a technique that closes this gap by making representation prediction a native capability of the model. Concretely, RF forces the decoder to autoregressively predict visual representations as intermediate tokens before pixels; these tokens then stay in context to guide pixel diffusion within the same backbone. By turning representations from perception outputs into generation targets, RF eliminates the need for any external generative latent space. We find that RF benefits both understanding and generation. On image generation, our pixel-space model with RF matches state-of-the-art VAE-based unified models. On image understanding, pixel-space RF generally outperforms its VAE-based variant. Together, these results offer an effective step toward end-to-end, bottleneck-free UMMs.

2605.30931 2026-06-15 cs.CL 版本更新

MineExplorer: Evaluating Open-World Exploration of MLLM Agents in Minecraft

MineExplorer: 评估MLLM智能体在Minecraft中的开放世界探索能力

Tianjie Ju, Yueqing Sun, Zheng Wu, Wei Zhang, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Gongshen Liu, Zhuosheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学学院) Meituan(美团)

AI总结 提出MineExplorer基准,通过多智能体合成工作流构建隐式多跳任务,评估多模态大语言模型在Minecraft中的开放世界探索能力,发现长轨迹协调仍是挑战。

Comments Working in progress

详情
AI中文摘要

多模态大语言模型(MLLM)在感知、推理和动作生成方面展现出强大能力。然而,它们在动态开放世界中持续探索的能力仍不明确。现有的具身和基于游戏的基准通常将交互压缩为短时任务,或将成功与特定领域的游戏机制纠缠在一起。在本文中,我们介绍了MineExplorer基准,用于评估MLLM智能体在Minecraft中的开放世界探索能力。我们首先筛选出解决方案高度依赖Minecraft特定知识的原子任务,以更好地反映通用开放世界推理。然后,我们围绕ReAct风格的能力公式组织基准,并将原子任务组合成隐式多跳任务。为了进一步构建可靠的实例,MineExplorer使用多智能体合成工作流,联合设计任务图、沙盒场景和基于规则的里程碑评估器。人工评估表明,多智能体合成工作流比单智能体基线产生显著更可靠的实例。与先进MLLM智能体的实验表明,开放世界探索仍然具有挑战性,因为强模型可以处理许多单跳任务,但在需要协调隐藏前提条件的长轨迹中性能急剧下降。进一步分析发现,任务难度与智能体完成度相关,而更大的模型或思考模式并不一致地转化为更好的性能。代码和数据集可在https://github.com/Jometeorie/MineExplorer获取。

英文摘要

Multimodal large language models (MLLMs) have shown strong capabilities in perception, reasoning, and action generation. However, their ability to sustain exploration in dynamic open worlds remains unclear. Existing embodied and game-based benchmarks often compress interaction into short-horizon tasks or entangle success with domain-specific game mechanics. In this paper, we introduce MineExplorer benchmark for evaluating open-world exploration capabilities of MLLM agents in Minecraft. We first filter atomic tasks whose solutions rely heavily on Minecraft-specific knowledge to better reflect general open-world reasoning. Then we organize the benchmark around a ReAct-style capability formulation and compose atomic tasks into implicit multi-hop tasks. To further construct reliable instances, MineExplorer uses a multi-agent synthesis workflow that jointly designs task graphs, sandbox scenes, and rule-based milestone evaluators. Human evaluation shows that the multi-agent synthesis workflow produces significantly more reliable instances than a single-agent baseline. Experiments with advanced MLLM agents show that open-world exploration remains challenging, as strong models can handle many single-hop tasks but degrade sharply when hidden prerequisites must be coordinated over longer trajectories. Further analysis finds that task difficulty tracks agent completion, and larger models or thinking modes do not consistently translate into better performance. Code and dataset are available at https://github.com/Jometeorie/MineExplorer.

2605.28591 2026-06-15 cs.CL cs.AI 版本更新

Models That Know How Evaluations Are Designed Score Safer

知道评估如何设计的模型更安全

Katharina Deckenbach, Haritz Puerto, Jonas Geiping, Sahar Abdelnabi

发表机构 * ELLIS Institute Tübingen, Max Planck Institute for Intelligent Systems, Tübingen AI Center(图宾根ELLIS研究所、图宾根马克斯·普朗克智能系统研究所、图宾根人工智能中心)

AI总结 本文通过微调模型使其掌握评估的元知识(如可验证结构或道德困境),发现这会导致模型在安全基准测试中表现更安全,从而引入了一种独立于显式记忆或评估意识的新混淆因素。

详情
AI中文摘要

AI安全评估的有效性取决于模型在受控环境和部署环境中行为的一致性。先前的研究已经发现测试时的上下文线索(例如假设场景)是口头评估意识和后续行为转变的来源。在本文中,我们研究了这一现象的一个潜在解释:评估元知识,定义为关于评估结构特征的参数化知识。类似于数据集污染(基准暴露通过记忆导致更高性能),我们假设在描述评估实践的文本上训练的模型可能隐式地学会识别和响应类似评估的上下文,例如通过接触关于AI基准测试的科学文章或社交媒体帖子。为了验证这一点,我们在描述评估特征(如可验证结构或道德困境)的合成文档上微调模型。在六个安全基准上评估这个微调模型,我们发现它比基础模型和控制模型显著更安全。即使将分析限制在缺乏明确评估意识口头表达的响应中,这种行为转变仍然存在。我们的结果表明,评估元知识可能夸大安全基准性能,引入了一种独立于显式记忆或口头评估意识的新混淆因素,因此难以检测。这些发现对AI安全评估的设计和解释具有重要意义。我们的代码和模型可在 https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge 获取。

英文摘要

The validity of AI safety evaluations depends on models behaving consistently across controlled and deployment settings. Prior work has identified test-time contextual cues, such as hypothetical scenarios, as a source of verbalized evaluation awareness and subsequent behavioral shift. In this paper, we investigate a potential explanation of this phenomenon: evaluation meta-knowledge, defined as parametric knowledge about the structural traits that characterize evaluations. Similar to dataset contamination, where benchmark exposure leads to higher performance through memorization, we hypothesize that models trained on texts describing evaluation practices may implicitly learn to recognize and respond to evaluation-like contexts, for instance, through exposure to scientific articles or social media posts about AI benchmarking. To test this, we fine-tune models on synthetic documents describing evaluation traits such as verifiable structures or moral dilemmas. Evaluating this fine-tuned model on five safety benchmarks, we find that it is significantly safer than the base model and control model. This behavioral shift persists even when restricting the analysis to responses lacking explicit verbalization of evaluation awareness. Our results demonstrate that evaluation meta-knowledge may inflate safety benchmark performance, introducing a novel confounder that is independent of explicit memorization or verbalized evaluation awareness, thus, challenging to detect. These findings have important implications for the design and interpretation of AI safety evaluations. Our code and models are available at https://github.com/compass-group-tue/arxiv2026_evaluation_meta_knowledge.

2605.28477 2026-06-15 cs.CV 版本更新

SA4Depth: Consistent Pose-Depth Scale Alignment for Self-Supervised Monocular Depth Estimation

SA4Depth: 自监督单目深度估计中一致的姿态-深度尺度对齐

Changxuan Li, Nadine Berner, Nassir Navab, Federico Tombari, Stefano Gasperini

发表机构 * Technical University of Munich(慕尼黑技术大学) BMW Group(宝马集团) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) Google(谷歌) VisualAIs Labs GmbH(VisualAIs实验室 GmbH)

AI总结 提出SA4Depth方法,通过可微的视觉特征重投影和姿态细化,对齐自监督深度估计中深度网络和姿态网络估计的场景尺度,提升深度预测精度且不增加推理时间。

Comments Accepted by IEEE RA-L 2026

详情
AI中文摘要

从单目序列进行自监督深度估计依赖于深度网络和姿态网络的联合学习。尽管已有大量研究致力于改进深度网络,但对姿态的努力仍然有限。在此背景下,即使深度估计达到尺度级别,我们强调了姿态网络和深度网络估计的场景尺度之间对齐的重要性。然后,我们引入了SA4Depth,一种改善这种对齐并提升深度预测的方法,同时保持推理时间不变。我们提出的方法利用训练期间估计的深度,跨连续帧重投影可学习的视觉特征,并通过减少特征对齐残差来细化姿态估计。通过我们的方法,由独立的深度网络和姿态网络估计的场景尺度得以对齐,并且不同序列之间的预测尺度一致性得到改善。我们的可微细化无缝集成到现有的自监督流程中,并显著改善了它们的深度估计。我们在KITTI、Cityscapes和NYUv2上进行了广泛的室外和室内实验,证明了这一点。此外,KITTI Odometry上的结果证实了我们姿态细化的有效性。我们的代码可在https://github.com/Runningchauncey/SA4Depth获取。

英文摘要

Self-supervised depth estimation from monocular sequences relies on the joint learning of a depth and a pose network. Despite abundant research done to improve the depth network, efforts on the pose remain limited. In this context, even when depth is estimated up to scale, we highlight the importance of the alignment between the scene scales estimated by the pose and depth nets. Then, we introduce SA4Depth, an approach to improve this alignment and boost the depth predictions while keeping the inference time unchanged. Our proposed method uses the depth estimated during training to reproject learnable visual features across consecutive frames and refine the pose estimates by reducing feature alignment residuals. With our method, the estimated scene scales by the separate depth and pose networks are aligned, and the prediction scale consistency is improved across different sequences. Our differentiable refinement integrates seamlessly into existing self-supervised pipelines and substantially improves their depth estimates. We demonstrate this with extensive experiments both outdoors and indoors on KITTI, Cityscapes, and NYUv2. Additionally, results on KITTI Odometry confirm the effectiveness of our pose refinement. Our code is available at https://github.com/Runningchauncey/SA4Depth .