arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 4089
专题追踪
2606.00571 2026-06-02 cs.LG cs.AI cs.CV

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对元学习训练数据选择(MTS)在实践中表现不佳的问题,本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍,并提出增大批大小和利用信息特征作为解决方案。

详情
AI中文摘要

合成数据越来越多地被用于训练神经网络,但若不加区分地使用,其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重,我们称之为元学习训练数据选择(MTS)。有趣的是,在实践中,MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍:梯度信噪比(GSNR)低导致优化困难,以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析,揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明,一个简单而有效的解决方案是增大批大小。此外,我们提出了一组信息特征,用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进,与无选择的训练相比平均提升 5.49%,与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

2606.00570 2026-06-02 cs.CL cs.AI

Revisiting Parameter-Based Knowledge Editing in Large Language Models: Theoretical Limits and Empirical Evidence

重新审视大型语言模型中基于参数的知识编辑:理论极限与实证证据

Wanying Ren, Xin Song, Futing Wang, Guoxiu He, Aixin Sun

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过理论分析和实证评估,揭示了基于参数的知识编辑方法会因维度坍缩假设导致全局干扰和推理崩溃,而简单的检索基线方法在所有条件下均表现更优。

Comments Accepted to ICML 2026. Equal contribution by the first two authors. 9 pages main paper, 10 figures, with appendix

详情
AI中文摘要

基于参数的知识编辑通过局部权重修改更新大型语言模型(LLMs)的内部知识,并引起了广泛关注。然而,大多数现有方法忽略了基本的理论限制,并且很少在现实的、面向实践的设置下进行评估。在本文中,我们首先基于维度坍缩假设提出理论分析,解释局部参数编辑如何沿着表示空间中的脆弱方向传播,引发全局干扰并最终导致推理崩溃。基于这一见解,我们通过系统变化知识复杂度、编辑次数、评估维度和基线方法进行了全面的实证评估。我们的结果表明,基于参数的编辑方法持续损害LLM的核心能力。相比之下,一个简单的基于检索的基线在所有评估条件下始终比所有参数编辑方法表现更强。这些发现强调,在知识编辑后保持LLM的基本能力应成为未来研究的核心关注点。

英文摘要

Parameter-based knowledge editing updates the internal knowledge of large language models (LLMs) via localized weight modifications and has attracted significant attention. However, most existing methods overlook fundamental theoretical limitations and are rarely evaluated under realistic, practice-oriented settings. In this paper, we first present a theoretical analysis based on the dimensional Collapse Hypothesis, explaining how localized parameter edits can propagate along fragile directions in the representation space, inducing global interference and ultimately causing reasoning collapse. Building on this insight, we conduct a comprehensive empirical evaluation by systematically varying knowledge complexity, number of edits, evaluation dimensions, and baseline methods. Our results show that parameter-based editing methods consistently damage core LLM capabilities. In contrast, a simple retrieval-based baseline achieves consistently stronger performance than all parameter-editing methods across all evaluated conditions. These findings highlight that preserving the fundamental capabilities of LLMs after knowledge editing should be a central concern for future research.

2606.00566 2026-06-02 cs.LG cs.CL cs.CR

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

相同载荷,不同通道:测量使用工具的語言模型中的信任不对称性

Mohammed Sameer Syed, Rozhin Yasaei

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本研究提出安全不对称分数(SAS),通过匹配恶意载荷仅改变传递上下文,系统测量了语言模型在不同通道(用户消息、工具元数据、工具输出)中对对抗性内容的脆弱性差异,发现代理原生模型在工具描述通道更脆弱,而通用模型相反,且机制研究表明安全相关表示在深层网络非线性编码。

Comments 13 pages, 1 figure. Submitted to EMNLP 2026

详情
AI中文摘要

随着语言模型承担代理角色,包括调用外部API、读取工具输出以及执行嵌入在第三方内容中的指令,其攻击面远超用户输入。模型是否以相同方式处理恶意指令(无论其来源)尚未被系统研究。我们引入了安全不对称分数(SAS),通过使用匹配的载荷对(保持恶意文本相同,仅改变传递上下文)来测量模型对对抗性内容的敏感性如何随内容出现在用户消息、工具元数据或工具输出中而变化。在6个生产级LLM和三种攻击家族上的评估发现了一致且信息丰富的不对称性:当对抗性内容通过工具描述而非用户消息传递时,代理原生模型显著更脆弱,而通用模型则相反。当相同内容通过工具输出而非描述传递时,这种不对称性进一步反转,表明模型隐含地将工具元数据视为可信指令,而将工具结果视为普通数据。对Llama 3.3 70B的机制研究表明,安全相关表示在网络的中间到深层因果存在但非线性编码,解释了线性探针为何无法检测到它。这些发现揭示了当前使用工具的模型在处理对抗性内容时存在的系统性、通道依赖的盲点。

英文摘要

As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model treats a malicious instruction the same way regardless of where it arrives has not been systematically studied. We introduce the Safety Asymmetry Score (SAS), which measures how much a model's susceptibility to adversarial content shifts depending on whether that content arrives in the user message, tool metadata, or tool output, using matched payload pairs that keep the malicious text identical and vary only the context of delivery. Evaluated across 6 production LLMs and three attack families, we find a consistent and informative asymmetry: agent-native models are substantially more vulnerable when adversarial content arrives via tool descriptions than via user messages, while general-purpose models show the reverse. This asymmetry further inverts when the same content is delivered through tool outputs rather than descriptions, suggesting models implicitly treat tool metadata as trusted instructions and tool results as ordinary data. A mechanistic study on Llama 3.3 70B reveals that the safety-relevant representation is causally present at mid-to-late network depths but non-linearly encoded, explaining why linear probes fail to detect it. These findings expose a systematic, channel-dependent blind spot in how current tool-using models handle adversarial content.

2606.00564 2026-06-02 cs.CV cs.CL

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

面向视觉-语言推理的分解式在策略蒸馏:引导梯度实现视觉定位

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom, Ji Woo Hong, Mark Hasegawa-Johnson, Qi Dai, Chong Luo, Chang D. Yoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过将视觉-语言模型蒸馏损失分解为语言先验和视觉定位两个正交分量,提出视觉梯度引导(VGS)方法动态调整更新方向以优先优化视觉子空间,从而提升小模型在复杂多模态任务中的定位能力。

Comments ICML 2026 Spotlight

详情
AI中文摘要

虽然在策略蒸馏为训练小型推理模型提供了密集监督,但其在多模态领域的优化动态仍未得到充分探索。在这项工作中,我们通过数学上将损失分解为两个不同的组成部分:语言先验和视觉定位,挑战了视觉-语言模型(VLM)蒸馏的标准整体观点。我们的分析揭示,这些分量的梯度向量几乎正交,表明与教师语言分布对齐的目标在几何上独立于匹配其视觉感知的目标。因此,标准优化被动地遵循一条次优的折衷轨迹,隐式地平衡这两个目标。假设视觉定位是视觉-语言推理的主要瓶颈,我们引入了视觉梯度引导(VGS),一种动态重新定向更新向量以优先考虑视觉子空间的方法。在多个蒸馏设置和复杂多模态基准上的实验结果表明,VGS显著优于标准的在策略蒸馏整体公式,以最小的训练开销实现了卓越的定位能力。

英文摘要

While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.

2606.00563 2026-06-02 cs.LG cs.AI stat.ML

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models

医学预测模型中选择偏差影响的一个实用上界

Kara Liu, Maggie Wang, Russ B. Altman

发表机构 * Stanford University(斯坦福大学)

AI总结 针对选择偏差导致模型泛化性差的问题,提出在仅部分观测选择机制和目标分布的现实条件下,对目标群体最差模型性能的一个新上界,并通过合成数据和真实数据验证其有效性和实用性。

Comments 32 pages, 27 figures, will be published at ACM SIGKDD '26

详情
AI中文摘要

选择偏差是真实世界数据中常见且往往不可避免的一个方面,它挑战了机器学习模型的泛化性。当在偏倚数据上训练的模型被部署到更广泛的目标群体时,模型泛化能力差可能导致实际危害,尤其是在医疗保健等高危环境中。这种风险凸显了从业者在部署前可靠评估模型泛化性的需求。然而,现有的预测模型性能的方法依赖于不切实际地访问目标分布或了解导致偏差的选择机制。为了解决这些局限性,我们提出了一个新颖的上界,用于在现实设置下目标群体上的最差模型性能,其中选择机制和目标群体数据仅被部分观测。我们通过在完全合成数据、源自All of Us研究计划的半合成数据以及MIMIC-IV中的真实世界选择偏差上的实验,证明了我们方法的有效性和实际效用。我们的工作提供了一个原则性和实用性的工具,用于估计在原本难以处理的情况下选择偏差的影响,从而使从业者能够在医疗保健及其他领域构建更安全、更具泛化性的模型。

英文摘要

Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.

2606.00562 2026-06-02 cs.CV cs.LG

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

DeepLatent: 通过并行潜在视觉推理用图像思考

Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao

发表机构 * Baidu Inc.(百度公司) Peking University(北京大学)

AI总结 提出DeepLatent框架,通过LatentFormer并行生成潜在视觉状态,并结合连续空间强化学习优化潜在表示,在多个基准上达到最先进性能。

详情
AI中文摘要

“用图像思考”的新兴范式将视觉状态嵌入中间推理步骤,定义了视觉语言模型的新前沿。现有方法沿两条路线分化。工具辅助方法应用显式视觉操作,但存在高延迟和操作类型受限的问题。潜在推理方法自回归生成隐式视觉状态,但性能不如工具辅助方法,且其潜在标记无法捕获有效的视觉信息。在这项工作中,我们提出DeepLatent,一个用于潜在视觉推理的并行框架。首先,我们引入LatentFormer。它使用可学习的2D标记并行生成上下文条件的潜在状态,将每次视觉更新直接锚定在原始图像特征中。其次,我们设计了一种连续空间强化学习算法。它直接在嵌入空间中优化潜在调制参数,显著提高潜在表示质量。该框架通过知识蒸馏和连续空间强化学习算法进行训练。此外,我们贡献了DeepLatent-180K,一个专为潜在视觉推理定制的大规模数据集。在多个基准上的广泛评估表明,DeepLatent达到了最先进的性能。

英文摘要

The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.

2606.00561 2026-06-02 cs.LG cs.AI

Interpretable Policy Distillation for Power Grid Topology Control

可解释的策略蒸馏用于电网拓扑控制

Aleksandra Dmitruka, Karlis Freivalds

发表机构 * University of Latvia, Faculty of Exact Sciences and Technology(拉脱维亚大学,精确科学与技术学院)

AI总结 提出一种将深度强化学习策略蒸馏为轻量级决策树/随机森林的方法,在保持性能的同时提升可解释性,并揭示表征偏移。

详情
AI中文摘要

深度强化学习为实时电网运行提供了有前景的途径,但大型神经策略评估成本高、难以在受限硬件上部署,且对操作员不透明。我们探究用于电网拓扑控制的近端策略优化(PPO)智能体能否压缩为紧凑的树基替代模型而不损失运行性能。在Grid2Op的标准14节点环境中,使用面向稳定性的奖励,通过压力聚焦的数据收集在关键高负荷状态下训练PPO教师。然后将策略蒸馏为决策树和随机森林。在保留的验证回合中,两个替代模型在平均奖励和生存时长上均超过教师,而推理成本仅为教师的一小部分。决策树与PPO argmax的动作完全一致率较高,且在其排名靠前的动作中几乎完全一致,同时保持足够小以便直接检查。特征重要性分析揭示了表征偏移:PPO策略主要依赖线路负载信号,而蒸馏树主要由母线拓扑变量驱动。这些结果表明,压力聚焦的蒸馏可以将黑箱神经控制器转换为轻量级、可审计的规则类替代模型,适用于实时部署,同时揭示与确定性动作和拓扑特定泛化相关的风险。

英文摘要

Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evaluate, hard to deploy on constrained hardware, and opaque to operators. We ask whether a Proximal Policy Optimization (PPO) agent for grid topology control can be compressed into compact tree-based surrogates without losing operational performance. A PPO teacher is trained on Grid2Op's standard 14-bus environment with a stability-oriented reward, using stress-focused data collection on critical, high-loading states. The policy is then distilled into a decision tree and a random forest. Across held-out validation episodes, both surrogates exceed the teacher in mean reward and survival length at a fraction of the inference cost. The decision tree shows high exact-action agreement with the PPO argmax and near-complete agreement within its top-ranked actions, while remaining small enough to be inspected directly. Feature-importance analysis reveals a representational shift: the PPO policy relies mainly on line-loading signals, while the distilled tree is driven primarily by bus-topology variables. These results suggest that stress-focused distillation can convert a black-box neural controller into a lightweight, auditable rule-like surrogate suited for real-time deployment, while also surfacing risks tied to deterministic actions and topology-specific generalization.

2606.00559 2026-06-02 cs.LG cs.AI

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

通过辅助重建实现神经算法推理的更丰富表示

Jiafu Huang, Chao Peng, Chenyang Xu, Zhengfeng Yang, Kecheng Cai, Chenhao Zhang, Yi Wang, Yiwei Gong, Wanqin Zhou, Irene Zheng

发表机构 * sei.ecnu.edu.cn(东华大学信息科学与工程学院)

AI总结 提出辅助重建模块和自监督学习变体,增强编码器对输入状态信息的保留和特征间依赖的捕捉,从而提升神经算法推理性能。

Comments Appeared at AAAI 2026

详情
AI中文摘要

神经算法推理已成为一个热门研究方向。它旨在训练神经网络模仿经典基于规则的算法的逐步行为。更具体地说,此类算法的执行可以抽象为一系列状态,其中每个状态代表执行步骤后的中间结果。训练目标是生成复制底层算法过程的状态序列。该任务的常见框架采用编码器-处理器-解码器架构,其中编码器学习状态的表示,处理器模拟算法步骤,解码器重建输出状态。虽然先前的工作侧重于改进处理器,但编码器在表示学习中的作用很少受到关注。大多数方法依赖简单的MLP编码器,这引发了一个问题:这些表示是否足够信息丰富以支持算法推理。本文研究如何改进神经算法推理的编码器表示。我们提出一个重建模块,旨在从其编码表示中恢复输入状态。这个辅助重建任务鼓励编码器保留关于输入的关键信息。我们证明,在训练过程中加入此任务可以提高现有神经架构在标准基准上的性能。此外,我们观察到当前编码器常常未充分利用状态内特征之间的相关性。为了解决这个问题,我们从自监督学习中汲取灵感,设计了一个增强的辅助任务变体,鼓励编码器捕捉状态内特征依赖。实验结果表明,我们的方法使编码器能够学习更丰富的表示,从而增强现有处理器在算法推理任务上的性能。

英文摘要

Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavior of classical rule-based algorithms. More specifically, the execution of such algorithms can be abstracted as a sequence of states, where each state represents the intermediate outcome after an execution step. The training objective is to generate state sequences that replicate the underlying algorithmic process. A common framework for this task adopts an encoder-processor-decoder architecture, where the encoder learns representations of states, the processor simulates algorithmic steps, and the decoder reconstructs output states. While prior work has focused on improving the processor, the role of the encoder in representation learning has received little attention. Most methods rely on simple MLP encoders, raising the question of whether such representations are sufficiently informative for supporting algorithmic reasoning. This paper investigates how to improve encoder representations for neural algorithmic reasoning. We propose a reconstruction module that aims to recover the input state from its encoded representation. This auxiliary reconstruction task encourages the encoder to retain critical information about the input. We demonstrate that incorporating this task during training improves the performance of existing neural architectures on standard benchmarks. Furthermore, we observe that current encoders often underutilize the correlations among features within a state. To address this, we draw inspiration from self-supervised learning and design an enhanced variant of the auxiliary task that encourages the encoder to capture intra-state feature dependencies. Experimental results show that our method enables the encoder to learn richer representations, thereby enhancing the performance of existing processors on algorithmic reasoning tasks.

2606.00557 2026-06-02 cs.LG

Normalized Relevance Measure as a Unifying Framework to Explain Neural Network Latent Structures

归一化相关度量作为解释神经网络潜在结构的统一框架

Ping Xiong, Thomas Schnake, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima

发表机构 * Berlin Institute for the Foundations of Learning(学习与数据基础研究院) Machine Learning Group, Technical University of Berlin(柏林技术大学机器学习组) Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Max Planck Institute for Informatics(信息研究所) Department of Chemistry, Chemical Physics Theory Group, University of Toronto(多伦多大学化学系、化学物理理论组) Vector Institute for Artificial Intelligence(人工智能矢量研究所) Acceleration Consortium, University of Toronto(多伦多大学加速联盟)

AI总结 提出归一化相关度量(NRM)框架,通过定义归一化符号度量来归因任意层神经元的相关性,统一了现有传播解释算法,并在VGG16中展示多层级联合分析揭示信息流。

详情
AI中文摘要

为了理解神经网络(NN)的功能和预测,越来越清楚的是,仅分析输入域是不够的——还必须检查其内部推理机制以获取完整图景。为了解释此类模型的内部推理机制,分析潜在表示对于给定任务的重要性至关重要。在本文中,我们提出了\emph{归一化相关度量}(NRM)框架——一种新颖的通用解释过程,将相关性归因于\emph{任意架构中跨层的任意神经元集合}。在NRM框架中,所选神经元的相关性被明确定义为归一化符号度量,使用简单操作——基于加法和乘法法则的边际化和条件化——类似于概率度量。归一化性质进一步保证了跨层的可比性。NRM框架通过明确识别正在计算的底层量,涵盖了现有的基于传播的解释算法。我们在计算机视觉应用中展示了该框架的实用性,其中跨多个层的联合相关性分析揭示了VGG16网络中的关键信息流。总体而言,NRM框架提供了一种通用的、数学上严谨的方法来理解现代NN如何传播信息,为可解释人工智能提供了多功能且广泛适用的基础。

英文摘要

To understand how a neural network (NN) functions and makes predictions, it has become increasingly clear that analyzing only the input domain is insufficient -- one must also examine its internal inference mechanisms to capture the complete picture. To explain the internal inference mechanisms of such models, it is essential to analyze the importance of latent representations for a given task. In this paper, we propose the \emph{normalized relevance measure} (NRM) framework -- a novel general explanation procedure that attributes relevance to \emph{arbitrary sets of neurons across layers of arbitrary architectures}. In the NRM framework, relevance of selected neurons is explicitly defined as a normalized signed measure, constructed using simple operations -- marginalization and conditioning based on additive and multiplicative laws -- in analogy to the probability measures. The normalization property further guarantees comparability across layers. The NRM framework subsumes existing propagation-based explanation algorithms by explicitly identifying the underlying quantity being computed. We demonstrate the utility of the framework in computer vision applications, where joint relevance analysis across multiple layers reveals key information flows in VGG16 networks. Overall, the NRM framework provides a general, mathematically grounded approach to understanding how modern NNs propagate information, offering a versatile and broadly applicable foundation for explainable artificial intelligence.

2606.00556 2026-06-02 cs.CV

Improving Visual Grounding in Remote Sensing via Cluster-Guided Refinement and Model Ensemble Voting

通过聚类引导精炼和模型集成投票改进遥感中的视觉定位

Panav Shah, Geet Sethi, Ashutosh Gandhe

发表机构 * Indian Institute of Technology Bombay(印度理工学院班加罗尔)

AI总结 提出两种视觉定位流程(SGR和CGR),结合遥感专用模型RemoteSAM和通用分割模型SAM3,并通过多模型集成投票提升定位精度。

Comments Accepted at CVPR 2026 Workshop MORSE

详情
AI中文摘要

视觉定位旨在定位与自然语言描述对应的图像区域,是可解释视觉系统的关键组成部分。在遥感图像中,由于场景复杂、目标小且尺度变化大,定位尤为困难。依赖单一模型往往不足以应对这些多样化的挑战。在这项工作中,我们提出了两种定位流程,即序列定位精炼(SGR)和聚类感知定位精炼(CGR),它们结合了专门用于遥感的视觉定位模型RemoteSAM和强大的通用分割模型SAM3的互补优势。我们的方法首先使用RemoteSAM获得目标位置的初始估计,然后使用SAM3进行精炼,以产生更准确且空间一致的分割。此外,我们探索了一种基于六个不同能力的定位流程的多数投票集成策略。这种多模型框架提高了鲁棒性,并显著提升了定位精度。实验结果表明,所提出的流程和集成方法优于单个模型,从而产生更可靠和精确的视觉定位预测。

英文摘要

Visual grounding aims to locate image regions that correspond to natural language descriptions and is a key component of interpretable vision systems. In remote sensing imagery, grounding is particularly challenging due to complex scenes, small objects, and large variations in scale. Relying on a single model is often insufficient to address these diverse challenges. In this work, we propose two grounding pipelines, Sequential Grounding Refinement (SGR) and Cluster-Aware Grounding Refinement (CGR), that combine the complementary strengths of RemoteSAM, a visual grounding model specialized for remote sensing, and SAM3, a powerful general-purpose segmentation model. Our approach first uses RemoteSAM to obtain an initial estimate of object location, which is then refined using SAM3 to produce more accurate and spatially consistent segmentations. Additionally, we explore an ensemble strategy based on majority voting across six diverse grounding pipelines, each with distinct capabilities. This multi-model framework improves robustness and significantly enhances localization accuracy. Experimental results demonstrate that the proposed pipelines and ensemble approach outperform individual models, leading to more reliable and precise visual grounding predictions.

2606.00548 2026-06-02 cs.CV cs.AI cs.LG

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

CAFOSat:用于基于高分辨率影像的基础设施感知型CAFO制图的高质量标注数据集

Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga

发表机构 * University of Virginia(弗吉尼亚大学) Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所)

AI总结 针对集中式动物饲养操作(CAFO)大规模制图困难,提出CAFOSat数据集,集成高分辨率NAIP影像与多源CAFO清单,通过人机协同标注、GradCAM定位和几何聚类优化弱定位记录,并引入合成增强管道,实现基础设施级标注和鲁棒分类。

Comments Accepted at CVPR Workshop-2026. First two authors has equal contribution

详情
AI中文摘要

集中式动物饲养操作(CAFO)在农业生产中发挥重要作用,但也与环境、公共卫生和疾病监测问题相关。由于基础设施布局异质、位置记录噪声大、标注不一致以及清单不完整,从遥感影像大规模制图CAFO仍具挑战。我们引入CAFOSat,一个用于美国全境CAFO制图的高质量标注、基础设施感知数据集。CAFOSat集成高分辨率国家农业影像计划(NAIP)影像与跨州收集的多源CAFO清单,并通过结合AI辅助标注、基于GradCAM的定位和几何聚类的人机协同管道,将弱地理定位记录转化为精细标注。为提高数据集质量,我们利用土地覆盖引导采样和空间排除约束筛选具有挑战性的负样本,并通过人工验证提供基础设施级标注,包括畜棚、粪池和放牧相关特征。最终数据集包含超过45,000个图像块,覆盖20个州和四大CAFO类别。我们对多种卷积、基于Transformer和视觉-语言模型进行基准测试,证明了精细标注和精心筛选的负样本在CAFO分类和泛化中的价值。此外,我们引入一个合成增强管道,生成基础设施感知的变体以增加训练多样性并提升分布偏移下的鲁棒性。CAFOSat为推进基础设施感知的农业监测和基于高分辨率遥感影像的CAFO制图提供了大规模基准。

英文摘要

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.

2605.01797 2026-06-02 cs.AI

Neural Decision-Propagation for Answer Set Programming

面向回答集编程的神经决策传播

Thomas Eiter, Katsumi Inoue, Sota Moriyama

发表机构 * Vienna University of Technology (TU Wien)(维也纳技术大学( TU Wien)) National Institute of Informatics(日本信息处理学会) The Graduate University for Advanced Studies, SOKENDAI(高级研究大学,SOKENDAI)

AI总结 提出决策传播(DProp)方法及其可微扩展神经决策传播(NDProp),通过交替假决策和真传播高效计算稳定模型,提升神经符号推理的可扩展性和准确性。

Comments This is the full version (with appendix) of a paper appearing at the 35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026)

详情
AI中文摘要

将回答集编程(ASP)与神经网络集成已成为神经符号AI中一种有前景的工具。虽然现有方法将ASP的能力扩展到现实世界领域,但其推理流程依赖于经典求解器,这成为可扩展性的瓶颈。为解决这一问题,我们提出了一种计算稳定模型的新方法,称为决策传播(DProp),它交替进行假决策和真传播。我们证明了成功的DProp计算能够捕捉稳定模型语义。随后,我们开发了神经决策传播(NDProp),它是DProp的可微扩展,使用神经计算进行决策,使用模糊评估进行传播。我们评估了NDProp在学习决策启发式以及神经符号集成方面的能力,并将其与现有的神经符号方法进行了比较。结果表明,NDProp能够学习高效计算稳定模型,并在神经符号基准测试中提高了准确性和可扩展性。

英文摘要

Integration of Answer Set Programming (ASP) with neural networks has emerged as a promising tool in Neuro-symbolic AI. While existing approaches extend the capabilities of ASP to real world domains, their reasoning pipelines depend on classical solvers, which is a bottleneck for scalability. To tackle this problem, we propose a new method to compute stable models, called decision-propagation (DProp), which alternates falsity decisions and truth propagations. Successful DProp computations are shown to capture the stable model semantics. We then develop Neural DProp (NDProp), a differentiable extension of DProp with neural computation for decisions and fuzzy evaluation for propagations. We evaluate the capabilities of NDProp for learning decision heuristics as well as neuro-symbolic integration, and compare it with existing neuro-symbolic approaches. The results show that NDProp can learn to efficiently compute stable models, and it improves accuracy and scalability on neuro-symbolic benchmarks.

2606.00547 2026-06-02 cs.CL

Learning to Retrieve: Dual-Level Long-Term Memory for Text-to-SQL Agents

学习检索:面向文本到SQL代理的双层长期记忆

Yibo Wang, Nikki Lijing Kuang, Philip S. Yu, Zhewei Yao, Yuxiong He

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Snowflake AI Research(Snowflake AI研究院)

AI总结 提出MERIT框架,通过强化学习优化双层记忆检索(全局策略与局部决策),提升交互式文本到SQL代理的成功率并减少交互轮次。

详情
AI中文摘要

交互式文本到SQL代理通过多轮交互解决数据库任务,涉及模式探索、查询执行、反馈解释和决策修订。长期记忆帮助代理重用过去经验,但现有检索方法仍有局限。静态方法依赖固定的相似性启发式,无法优化下游效用;动态方法通常从稀疏的最终结果中学习,并在单一决策水平上检索记忆。当记忆有用性随交互阶段变化时,这种方法是 insufficient 的,因为用于初始规划的记忆可能不同于局部、状态条件执行所需的记忆。我们提出MERIT,一种动态多水平记忆检索框架。MERIT维护用于全局策略指导的片段级记忆和用于局部决策支持的轮次级记忆。两个水平都使用通过强化学习优化的学习检索策略。为了在有限的中间监督下训练轮次级检索,MERIT使用轻量级过程奖励模型为局部记忆选择提供密集的代理奖励。在BIRD-Interact上的实验表明,MERIT在成功率上优于无记忆、静态检索和动态检索基线,同时减少了平均交互轮次。在Spider2-Snow上的迁移结果进一步显示了无需基准特定调优的跨基准正迁移。这些结果表明,多水平检索改善了交互式文本到SQL代理中的经验重用。

英文摘要

Interactive text-to-SQL agents solve database tasks through multi-turn interactions involving schema exploration, query execution, feedback interpretation, and decision revision. Long-term memory helps agents reuse past experiences, but existing retrieval methods remain limited. Static methods rely on fixed similarity heuristics that do not optimize downstream utility, while dynamic methods often learn from sparse final outcomes and retrieve memories at a single decision horizon. This is insufficient when memory usefulness changes across interaction stages, since memories useful for initial planning may differ from those needed for local, state-conditioned execution. We propose MERIT, a dynamic multi-horizon memory retrieval framework. MERIT maintains episode-level memory for global strategic guidance and turn-level memory for local decision support. Both levels use learned retrieval policies optimized with reinforcement learning. To train turn-level retrieval despite limited intermediate supervision, MERIT uses a lightweight Process Reward Model to provide dense proxy rewards for local memory selection. Experiments on BIRD-Interact show that MERIT outperforms no-memory, static-retrieval, and dynamic-retrieval baselines in success rate while reducing average interaction turns. Transfer results on Spider2-Snow further show positive cross-benchmark transfer without benchmark-specific tuning. These results suggest that multi-horizon retrieval improves experience reuse in interactive text-to-SQL agents.

2606.00545 2026-06-02 cs.LG

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

助手作为特权角色:跨角色自我识别中的规范参考

Asvin G

发表机构 * Institute for Advanced Study, Princeton(普林斯顿高级研究院)

AI总结 本文研究后训练语言模型在跨角色作者身份判断中的表现,发现助手角色作为规范参考,其熵信号和角色向量距离紧密耦合,且这种耦合仅对助手角色成立。

Comments Project out of Anthropic Fellows

详情
AI中文摘要

后训练语言模型能够从上下文中的一两句话识别自己的输出。在配套论文 \citep{jack2026twomodes} 中,我们展示了它们还能通过助手模式生成的尖锐熵降来识别当前是否在策略上行动。这两个信号都与后训练主要塑造的助手角色相关。 本文将框架扩展到 Llama-3.1-70B-Instruct 上的跨角色作者身份判断。我们测量了一个由评估者和生成者角色(从图书管理员到龙到莎士比亚)组成的面板上的作者身份声称率矩阵,并提出两个主张。 首先,在助手自己的矩阵行上,助手的声称率、激活空间中与助手的角色向量距离,以及助手对某个角色文本的惊讶与该角色对自己文本的惊讶之间的熵差,三者紧密耦合。这扩展了配套论文中“行动”的熵特征,使之成为“已行动”的回顾性特征。 其次,这种耦合在助手行之外失效:熵差的自然对称扩展不能预测独特评估者(海盗、龙、莎士比亚)的作者身份;起作用的是非对称的——评估者与助手对同一文本的惊讶比较,而非与生成者的比较。我们通过尝试许多候选替代角色排除了任何其他角色都能扮演这一参考角色的可能性。我们将这种非对称性解释为模型在执行隐式贝叶斯似然比检验,以助手作为规范备择假设,而 \citet{chen2025persona} 的角色向量几何(每个角色都是助手的一个增量)确保了助手是唯一普遍可被该检验访问的角色。

英文摘要

Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through the sharp entropy drop of assistant-mode generation. Both signals are tied to the Assistant persona that post-training mainly shapes. This paper widens the frame to cross-persona authorship judgement on Llama-3.1-70B-Instruct. We measure a matrix of authorship claim rates over a panel of evaluator and generator personas spanning librarian to dragon to Shakespeare, and make two claims. \emph{First}, on the Assistant's own row of the matrix, the Assistant's claim rate, the persona-vector distance from the Assistant in activation space, and the entropy gap between the Assistant's surprise on a persona's text and the persona's surprise on its own text are all tightly coupled. This extends the entropy signature of \emph{acting} from the companion paper to a retrospective signature of \emph{having acted}. \emph{Second}, this coupling fails off the Assistant's row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators (pirate, dragon, Shakespeare); what does is asymmetric -- the evaluator's surprise compared to the Assistant's surprise on the same text, not to the generator's. We rule out the alternative that any persona could play this reference role by trying many candidate substitutes; none does. We interpret the asymmetry as the model performing an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypothesis, with the persona-vector geometry of \citet{chen2025persona} (every persona a delta off the Assistant) ensuring that the Assistant is the only persona universally accessible to that test.

2606.00544 2026-06-02 cs.LG cs.CL

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

逃离模式抽彩:多响应训练提升语言模型泛化能力

Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna

发表机构 * Department of Computer Science, Purdue University(计算机科学系,普渡大学)

AI总结 本文提出多响应训练(MRT)方法,通过保留每个提示的多个有效响应来缓解传统单响应微调导致的“模式抽彩”问题,并从统计角度揭示了其提升分布泛化的原理和适用条件。

详情
AI中文摘要

现代语言模型微调通常为每个提示配对单个响应,尽管许多提示允许多个有效补全。这实际上将多模态条件分布简化为单样本视图,我们称之为“模式抽彩”现象,其中训练强调一部分合理模式而忽略其他模式。我们研究了多响应训练(MRT),该方法保留每个提示的多个响应,并建立了关于何时以及为何有帮助的原则性解释。我们的关键见解是,提示和响应是不同的统计资源:额外的提示减少输入分布的不确定性,而额外的响应减少条件输出分布的不确定性。这产生了方差-预算权衡,预测了何时保留多个响应是有价值的,显示了随着提示级不确定性占主导地位而收益递减,并解释了为什么大型冗余语料库可以表现出隐式的多响应效应。我们进一步分析了响应选择,并表明Random-K-of-N是分布微调的无偏默认选择,基于奖励的选择可能导致模式坍缩,而子模质量-多样性目标提供了一种具有理论保证的高效替代方案。受控模拟验证了预测的方差和选择效应,包括一个惊人的失败模式,其中仅奖励选择产生的梯度与真实目标不一致。在结构化和真实世界数据集上,包括一个新的多提示、多响应基准,MRT一致地改善了分布泛化,在响应多样性高、提示冗余性低的场景中收益最大。MRT将响应多重性重新定义为数据分配问题,并提供了明确的指导:当响应廉价且多样时,保留多个响应不是启发式方法,而是基于统计的选择。

英文摘要

Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the "mode lottery," where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.

2606.00543 2026-06-02 cs.CV

ETC: Extreme Token Compression via Task-aware Visual Information Distillation in VLMs

ETC: 通过任务感知的视觉信息蒸馏实现视觉语言模型中的极端令牌压缩

Yiling Gao, Hongchen Wei, Zhenzhong Chen

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院)

AI总结 提出ETC框架,基于变分信息蒸馏原理,在减少输入令牌数量时最小化任务损失,通过文本-图像交叉注意力加权视觉特征并引入变分信息蒸馏,实现单令牌压缩下仍保持强任务性能。

详情
AI中文摘要

在视觉语言模型(VLM)中,高分辨率图像会产生大量视觉令牌,导致推理时的高计算成本和KV缓存开销。为解决此问题,我们提出极端令牌压缩(ETC)框架,基于变分信息蒸馏原理,在减少输入令牌数量时最小化任务损失。具体而言,从信息论角度,我们表明最小化任务损失需要紧凑表示保留用于预测的指令感知充分统计量。在实践中,ETC利用文本-图像交叉注意力加权原始视觉特征以近似潜在的指令感知预测统计量。此外,ETC引入变分信息蒸馏,使紧凑表示保留必要信息以恢复该预测统计量。在LLaVA-1.5-7B和Qwen3-VL-2B上的实验表明,即使在单令牌压缩下,ETC仍保持有效性,大幅减少KV缓存开销同时保留强任务性能。

英文摘要

In Vision-Language Models (VLMs), high-resolution images produce a large number of visual tokens, resulting in high computational costs and KV-cache overhead during inference. To address this problem, we propose an Extreme Token Compression (ETC) framework that minimizes task loss when reducing the number of input tokens based on the principle of variational information distillation. Specifically, from an information-theoretic perspective, we show that minimizing task loss requires the compact representation to preserve the instruction-aware sufficient statistic of the task-relevant visual information for prediction. In practice, ETC leverages text-to-image cross-attention to weight the original visual features to approximate the latent instruction-aware predictive statistic. Moreover, ETC introduces a variational information distillation, enabling the compact representation to preserve the essential information to recover this predictive statistic. Experiments on LLaVA-1.5-7B and Qwen3-VL-2B show that ETC remains effective even under single-token compression, substantially reducing KV-cache overhead while retaining strong task performance.

2606.00539 2026-06-02 cs.LG math.OC stat.ML

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

GNMR: 低精度大语言模型训练的运行时稳定性控制

Boao Kong, Weichen Jia, Engao Zhang, Guohong Li, Yonghan Dong, Yao Wang, Yaoyuan Wang, Yunke Peng, Kun Yuan

发表机构 * Peking University(北京大学) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 针对低精度语言模型训练中的稳定性瓶颈,提出基于梯度范数与历史均值之比(GNMR)的轻量级运行时控制器,通过局部风险信号映射到有界恢复动作,在不改变数值格式或后端的情况下提升训练稳定性。

Comments 29 pages, 4 figures, 15 tables

详情
AI中文摘要

训练稳定性是低精度语言模型训练的关键瓶颈:高效的低成本路径仍可能在少量算子处产生短暂的数值风险。我们将此问题形式化为运行时稳定性控制,并提出梯度范数与历史均值之比(GNMR),一种轻量级控制器,将每个可恢复单元的当前梯度范数与其历史均值进行比较。结合用于检测短窗口内突增的$Δ$-GNMR,GNMR在硬$\mathrm{maxO}$预算和短锁定间隔下将局部风险信号映射到有界恢复动作,而不改变数值格式、内核或后端方案。在激活量化压力测试、DeepSeek风格的配方级训练以及LLaMA-2 13B微调中,GNMR以稀疏且预算受限的恢复保持了高保真质量。这些结果支持GNMR作为一种与后端无关的控制器,在保持低成本执行的同时提高低精度训练的稳定性。

英文摘要

Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $Δ$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.

2606.00537 2026-06-02 cs.RO

PACE: Phase-Aware Chunk Execution for Robot Policies with Action Chunking

PACE: 面向动作分块策略的相位感知分块执行方法

Junnan Nie, Jiayi Li, Jiachen Zhang, Junyi Lao, Chenghao Liu, Tianle Zhang, Songfang Huang

发表机构 * Peking University(北京大学) JD Explore Academy(京东探索研究院)

AI总结 提出PACE方法,通过在线预测动作分块中的低速过渡点作为重规划边界,自适应选择执行步长,无需重新训练即可提升机器人策略成功率。

Comments 21 pages, 7 figures, 6 tables. Preprint

详情
AI中文摘要

最近的视觉-语言-动作和基于扩散的机器人策略通常使用动作分块,其中每次策略查询预测一系列未来动作,机器人执行一个开环前缀后再重新查询。虽然这种接口改善了局部运动连续性,但部署时仍需选择执行步长:在获取新观测之前应执行每个预测分块的多少。然而,我们的实验表明,成功率强烈依赖于任务且相对于执行步长非单调,这使得单一恒定步长成为不可靠的部署规则。我们提出PACE(相位感知分块执行),一种无需训练的测试时执行方法,从预测分块本身在线选择执行步长。PACE通过识别预测速度剖面中的低速过渡点,利用操作轨迹的相位相关运动学结构,将其作为候选重规划边界。由于PACE仅使用预测的动作分块,因此即插即用,无需重新训练或访问策略内部。我们通过在仿真和真实机器人环境中的大规模评估验证了PACE。在50个RoboTwin2.0任务上,PACE将平均成功率从57.8%提升至64.2%。在双臂ALOHA和单臂Franka平台上的真实机器人实验中,PACE将平均任务得分从60.7提升至77.7,平均成功率从50.7%提升至70.4%。消融实验和轨迹级分析表明,PACE跨操作阶段自适应调整执行步长,在过渡附近缩短执行,同时在连贯运动中保持较长执行。

英文摘要

Recent vision-language-action and diffusion-based robot policies often use action chunking, where each policy query predicts a sequence of future actions and the robot executes an open-loop prefix before re-querying. While this interface improves local motion continuity, deployment still requires choosing the execution horizon: how much of each predicted chunk should be executed before acquiring a new observation. However, our experiments show that success is strongly task-dependent and non-monotonic with respect to the execution horizon, making a single constant horizon an unreliable deployment rule. We propose PACE (Phase-Aware Chunk Execution), a training-free test-time execution method that selects the execution horizon online from the predicted chunk itself. PACE exploits the phase-dependent kinematic structure of manipulation trajectories by identifying low-speed transition points in the predicted speed profile and using them as candidate replanning boundaries. Because PACE uses only the predicted action chunk, it is plug-and-play and requires no retraining or access to policy internals. We validate PACE through large-scale evaluations in both simulation and real-robot settings. On 50 RoboTwin2.0 tasks, PACE raises the average success rate from 57.8% to 64.2%. In real-robot experiments on bimanual ALOHA and single-arm Franka platforms, PACE improves the average task score from 60.7 to 77.7 and the average success rate from 50.7% to 70.4%. Ablations and rollout-level analyses show that PACE adapts execution horizons across manipulation phases, shortening near transitions while preserving longer execution during coherent motion.

2606.00535 2026-06-02 cs.LG

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

DREAM-S: 基于可搜索草稿与目标感知精炼的推测解码用于多模态生成

Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

发表机构 * New York University(纽约大学) Cerebras Systems Inc.(Cerebras Systems公司) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出DREAM-S框架,通过神经架构搜索和目标感知超网训练自动优化草稿模型架构与交互策略,结合注意力熵引导的自适应中间特征蒸馏,实现视觉语言模型的高效推测解码,加速比达3.85倍。

详情
AI中文摘要

推测解码(SD)已被证明是加速大型语言模型(LLM)自回归生成的有效技术,然而其在视觉语言模型(VLM)中的应用仍相对未被探索。我们提出 extit{DREAM-S},一个专门为VLM中快速高效解码设计的新型SD框架。DREAM-S利用神经架构搜索(NAS)框架与目标感知超网训练,自动识别草稿模型与目标模型之间的最优交互策略,以及最适合底层硬件实现平台的草稿模型架构。此外,DREAM-S还结合了由注意力熵引导的自适应中间特征蒸馏,以实现高效的草稿训练。在一系列成熟的VLM上的实验表明,与标准解码方法相比,DREAM-S实现了高达$3.85 imes$的加速,并显著优于现有的SD基线。代码已公开:https://github.com/SAI-Lab-NYU/DREAM-S。

英文摘要

Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We propose~\textit{DREAM-S}, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM-S leverages a neural architecture search (NAS) framework with target-aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM-S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well-established VLMs show that DREAM-S achieves up to a $3.85\times$ speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM-S .

2606.00532 2026-06-02 cs.AI

KACE: Knowledge-Adaptive Context Engineering for Mathematical Reasoning

KACE: 知识自适应上下文工程用于数学推理

Jayant Parashar, Suchendra M. Bhandarkar

发表机构 * School of Computing, University of Georgia(计算学院,佐治亚大学)

AI总结 提出KACE方法,通过难度和领域分层的知识库与分层自一致性,解决数学推理中上下文膨胀问题,在AIME 2025上达到62.2%准确率。

Comments 9 pages, 1 figure, 6 tables

详情
AI中文摘要

上下文工程可以在不更新权重的情况下改进大型语言模型,但数学推理暴露了一个关键限制:在一个不断增长的提示中累积的反馈会导致上下文膨胀,并限制了可使用的学习指导量。现有方法常常混淆存储(跨运行学习的内容)与使用(针对特定问题包含的内容),因此继承了这种提示大小上限。我们引入了知识自适应上下文工程(KACE),通过基于难度和领域的组织将存储与使用分离。离线时,一个自我反思的学习循环将训练轨迹提炼成认知树:一个按问题难度和认知领域分层的类型化卡片知识库。每张卡片被分配到与其起源失败对应的难度-领域节点。在评估时,具有每层一致性门控的分层自一致性将每个问题动态分类为简单、中等或困难。简单问题无需检索卡片即可退出,而较难的问题仅检索树的匹配分支。这种分层方案在计算量相当的情况下匹配或超过Best-of-N,并以78%的成对一致性对问题难度进行分类。主要的实证贡献是通过分层自一致性构建和使用了一个难度和领域分层的知识库。在AIME 2025上,KACE达到了62.2%的准确率,在可比的求解器调用预算下,比固定的Best-of-5自一致性绝对提高了10.4个百分点,比最强的学习上下文基线Tiered + GEPA提高了5.6个百分点。我们还在MATH-HARD和OlymMATH的可验证子集上观察到一致的提升。

英文摘要

Context engineering can improve large language models without updating their weights, but mathematical reasoning exposes a key limitation: feedback accumulated in one growing prompt causes context bloat and limits the amount of learned guidance that can be used. Existing methods often conflate storage, what is learned across runs, with usage, what is included for a particular problem, and therefore inherit this prompt-size ceiling. We introduce Knowledge-Adaptive Context Engineering (KACE), which separates storage from usage through difficulty- and domain-based organization. Offline, a self-reflective learning loop distills training traces into an epistemic tree: a knowledge base of typed cards stratified by problem difficulty and epistemic domain. Each card is assigned to the difficulty-domain node corresponding to the failure from which it originated. At evaluation time, tiered self-consistency with per-tier agreement gates dynamically classifies each problem as easy, medium, or hard. Easy problems exit without retrieved cards, while harder problems retrieve only the matching branch of the tree. This tiered scheme matches or exceeds Best-of-N while using comparable compute, and it classifies problem difficulty with 78 percent pairwise concordance. The main empirical contribution is the construction and use of a difficulty- and domain-stratified knowledge base enabled by tiered self-consistency. On AIME 2025, KACE achieves 62.2 percent accuracy, a 10.4-point absolute gain over fixed Best-of-5 self-consistency at a comparable solver-call budget and a 5.6-point gain over the strongest learned-context baseline, Tiered + GEPA. We also observe consistent gains on MATH-HARD and the verifiable subset of OlymMATH.

2606.00523 2026-06-02 cs.CL

ProactiveLLM: Learning Active Interaction for Streaming Large Language Models

ProactiveLLM: 学习流式大语言模型的主动交互

Junlong Tong, Yao Zhang, Anhao Zhao, Yingqi Fan, Yunpu Ma, Xiaoyu Shen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ProactiveLLM,利用模型内生状态指导交互决策,通过掩码流式建模和同步特权自蒸馏实现主动交互,减少延迟并保持质量。

Comments ICML 2026

详情
AI中文摘要

标准大语言模型(LLMs)遵循先读取后生成的范式,导致不必要的延迟和计算。流式LLMs通过在接收输入的同时生成来缓解这一问题,但仍难以决定何时与流交互。现有方法要么硬编码交互时机,要么依赖昂贵的外部对齐信号,如时间标签、推理轨迹或更强的教师模型。本文提出ProactiveLLM,通过利用模型内生状态来指导交互决策,实现主动交互。模型首先通过两种互补的训练机制学习从部分输入中感知语义充分性:基于掩码的流式建模和同步特权自蒸馏(SPSD)。前者在训练期间对输入应用单调随机掩码,模拟逐步揭示的流式输入,使模型能够从部分输入视角学习局部语义依赖。后者将部分上下文的学生视图与同一演化模型生成的全上下文教师视图对齐,允许特权全上下文证据指导学生在不完整观察下的理解。这些机制共同诱导内生充分性线索,无需外部教师或标注,为多种决策头的即插即用集成提供了通用基础。在文本和语音流式任务上的广泛评估证实,ProactiveLLM在保持质量的同时显著降低了交互延迟,验证了其动态主动交互的能力。代码公开于https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM。

英文摘要

Standard Large Language Models (LLMs) follow a read-then-generate paradigm, causing unnecessary latency and computation. Streaming LLMs alleviate this issue by generating while receiving inputs, but still struggle to decide when to interact with the stream. Existing methods either hard-code interaction timing or rely on costly external alignment signals, such as timing labels, reasoning trajectories, or stronger teachers. In this paper, we propose ProactiveLLM, which achieves active interaction by leveraging the model's endogenous states to guide interaction decisions. The model first learns to perceive semantic sufficiency from partial inputs through two complementary training mechanisms: mask-based streaming modeling and synchronized privileged self-distillation (SPSD). The former applies monotonic random masking to the input during training, simulating progressively revealed streaming inputs and enabling the model to learn local semantic dependencies from partial-input views. The latter aligns the partial-context student view with a full-context teacher view generated by the same evolving model, allowing privileged full-context evidence to guide the student's understanding under incomplete observations. Together, these mechanisms induce endogenous sufficiency cues without requiring external teachers or annotations, providing a versatile foundation for the plug-and-play integration of diverse decision heads. Extensive evaluation across text and speech streaming tasks confirms that ProactiveLLM significantly reduces interaction latency while maintaining quality, validating its capacity for dynamic and active interaction. Code is publicly available at https://github.com/EIT-NLP/StreamingLLM/tree/main/ProactiveLLM.

2606.00519 2026-06-02 cs.RO

DriveAnchor: Progressive Anchor-based Flow Learning for Autonomous Driving Planning

DriveAnchor: 用于自动驾驶规划的渐进式基于锚点的流学习

Limin Yan, Haoyun Tang, Yutao Qiu, Hongqing Liu, Haoyu Xu

发表机构 * Meituan Autonomous Driving(美团自动驾驶) Xi’an Jiaotong University(西安交通大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出三阶段框架DriveAnchor,通过示范流预训练、引导流后训练和奖励精炼流微调,实现行为多样性、可控性和安全性,在200万场景中近距碰撞率降低89%,平均奖励提升32%。

详情
AI中文摘要

我们提出DriveAnchor,一个用于自动驾驶规划的三阶段框架,在可组合流水线中实现行为多样性、可控性和安全性。示范流预训练通过最远点采样构建的2398个轨迹形状词汇表替代无结构高斯先验,在词汇覆盖中结构化地奠定行为多样性基础。引导流后训练联合后训练一个能量场模块与流匹配(FM),仅以静态道路几何为条件,在流生成前将锚点重新定位到用户指定的走廊多边形,无需可微引导即可增加可控性;在第二阶段后,新的走廊预设只需更新能量场,无需重新训练FM。奖励精炼流微调应用零阶强化学习,使每个锚点的输出与避碰目标对齐:由于流匹配模型在单步模式下是确定性前馈网络,每个锚点唯一确定输出轨迹,将奖励优化简化为锚点空间中的方向搜索,无需对数似然计算或ODE到SDE转换。在约200万个保留驾驶场景上的评估表明,DriveAnchor将近距碰撞率降低89%,平均奖励提升32%,且模仿精度不下降,在NVIDIA Drive Orin上推理时间为2.06毫秒。DriveAnchor已通过真实车辆测试验证,确认其适用于生产部署。

英文摘要

We present DriveAnchor, a three-stage framework for autonomous driving planning that achieves behavioral diversity, controllability, and safety in a composable pipeline. Demonstration Flow Pretraining replaces the unstructured Gaussian prior with a vocabulary of 2,398 trajectory shapes constructed by farthest-point sampling, structurally grounding behavioral diversity in vocabulary coverage. Guided Flow Post-training jointly post-trains an Energy Field module with flow matching (FM), conditioning the Energy Field on static road geometry alone, to relocate anchors toward user-specified corridor polygons before flow generation, adding controllability without differentiable guidance; after Stage 2, new corridor presets require only Energy Field updates, not FM retraining. Reward-Refined Flow Fine-tuning applies zeroth-order reinforcement learning to align each anchor's output with collision-avoidance objectives: because the flow-matching model is a deterministic feedforward network in single-step mode, each anchor uniquely determines the output trajectory, reducing reward optimization to a direction search in anchor space without log-likelihood computation or ODE-to-SDE conversion. Evaluated on approximately 2 million held-out driving scenarios, DriveAnchor reduces near-range collision rates by 89% and improves mean reward by 32% without degradation in imitation accuracy, with 2.06 ms inference on NVIDIA Drive Orin. DriveAnchor has been validated through real-world vehicle testing, confirming its practicality for production deployment.

2606.00518 2026-06-02 cs.AI

Acting with AI: An Interaction-Based Framework for Agentic Tort Liability

与AI行动:基于交互的代理侵权责任框架

Yiheng Yao

发表机构 * Yiheng Yao(姚艺恒)

AI总结 本文基于Bratman规划理论和普通法人类协同行动原则,提出一种交互分类框架(自主漂移、纯工具使用、协作规划)来分配AI代理系统的侵权责任,并引入“合理代理”标准。

详情
AI中文摘要

代理AI系统能够多步规划、使用工具并随时间执行任务。当此类系统造成损害时,侵权法难以分配责任,因为有害路径可能既非用户完全选择,也非开发者特别预见。本文借鉴Michael Bratman的规划理论和普通法对人类协同行动的处理,提出一个基于交互的代理侵权框架。我们区分三种交互类型:自主漂移、纯工具使用和协作规划。纯工具案例仍受普通产品缺陷和警告原则管辖;协作规划案例映射到独立承包商控制测试、专业过失和过失性虚假陈述;自主漂移则映射到雇主责任下的“擅自行动”和严格产品责任。该框架将有状态交互日志作为主要证据线索,使法院能够推断人-AI轨迹何时偏离授权行为以及责任应归于何处。我们解决了四个事件锚定案例,将该观点与严格责任和基于保险的提案并列,指出其与监管监督的关系,并提出了一个围绕约束验证、认知透明度、运行时基础和取证日志构建的“合理代理”标准。

英文摘要

Agentic AI systems can plan over multiple steps, use tools, and execute tasks over time. When such systems cause harm, tort law struggles to allocate responsibility because the harmful path may be neither fully chosen by the user nor specifically foreseen by the developer. This paper proposes an interaction-based framework for agentic torts, drawing on Michael Bratman's planning theory and on the common law's treatment of human-human concerted action. We distinguish three interaction types: autonomous drift, pure tool use, and collaborative planning. Pure tool cases remain governed by ordinary product-defect and warning doctrines; collaborative planning cases map onto the independent contractor control test, professional malpractice, and negligent misrepresentation; autonomous drift maps onto frolic and detour under respondeat superior and strict product liability. The framework treats the stateful interaction log as the primary evidentiary trace, allowing courts to infer where the human-AI trajectory departed from the authorized undertaking and where liability should attach. We resolve four incident-anchored cases, situate the account alongside strict-liability and insurance-based proposals, note its relationship to regulatory oversight, and propose a ``Reasonable Agent'' standard built around constraint verification, epistemic transparency, runtime grounding, and forensic logging.

2606.00516 2026-06-02 cs.AI

Threshold-Based Exclusive Batching for LLM Inference

基于阈值的独占批处理用于LLM推理

Weifang Zhang, Yuzhou Nie, Bowen Pang, Guangrui Ma, Shining Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间利用工程技术中心)

AI总结 针对混合批处理中预填充与解码干扰导致边际成本上升的问题,提出基于GPU内存带宽、模型大小和工作负载的EB-MB性能交叉条件及最优切换阈值,优化后的独占批处理在带宽受限GPU上吞吐量提升高达41.9%。

Comments 37 pages, 12 figures. Accepted at ICML 2026

详情
AI中文摘要

混合批处理(MB)——将预填充和解码交错在单个批次中——已成为大型语言模型(LLM)推理的标准调度策略,因其在最大化计算和内存利用率方面的效率。然而,通过受控实验,我们发现预填充-解码干扰使MB的每步边际成本高于纯解码。在高带宽H200(4.8 TB/s)上,这仅在解码token超过批次的80%时发生;然而,在带宽受限的RTX PRO 6000(1.792 TB/s)上,该阈值骤降至仅20%。因此,MB与独占批处理(EB)之间的最优选择根本上取决于GPU内存带宽、模型大小和工作负载组成。我们推导了该EB-MB性能交叉的闭式条件,以及渐近最优的相位切换阈值和EB的内存安全批次大小。优化的EB在带宽受限GPU上吞吐量提升高达41.9%,而MB在具有更大模型的高带宽硬件上保持优势。我们的混合调度器EB+在线应用该条件,在无需人工干预的情况下动态切换EB和MB。在分布或并发度变化的非平稳流量下,EB+在每个设置中达到最高或接近最高的吞吐量,比MB高出高达36.4%。

英文摘要

Mixed batching (MB)--interleaving prefill and decode in a single batch--has become the standard scheduling strategy for large language model (LLM) inference due to its efficiency in maximizing compute and memory utilization. However, through controlled experiments, we find that prefill-decode interference inflates MB's per-step marginal cost above that of pure decode. On the high-bandwidth H200 (4.8 TB/s), this occurs only when decode tokens exceed 80% of the batch; however, on the bandwidth-constrained RTX PRO 6000 (1.792 TB/s), this threshold plummets to just 20%. Consequently, the optimal choice between MB and exclusive batching (EB) fundamentally depends on GPU memory bandwidth, model size, and workload composition. We derive a closed-form condition for this EB-MB performance crossover, along with asymptotically optimal phase-switching thresholds and memory-safe batch sizing for EB. Optimized EB achieves up to 41.9% higher throughput on bandwidth-constrained GPUs, while MB retains its advantage on high-bandwidth hardware with larger models. Our hybrid scheduler EB+ applies this condition online to dynamically switch between EB and MB without manual intervention. Under non-stationary traffic with distribution or concurrency shifts, EB+ attains the highest or near-highest throughput in every setting, outperforming MB by up to 36.4%.

2606.00515 2026-06-02 cs.RO cs.AI cs.SY eess.SY

PaCo-VLA: Passivity-Shielded Compliance Prior for Contact-Rich Vision-Language-Action Manipulation

PaCo-VLA: 用于富接触视觉-语言-动作操控的被动屏蔽柔顺先验

Haofan Cao, Zhaoyang Li, Zhichao You, Liang Guo, Tianrui Li

发表机构 * Southwest Jiaotong University(西南交通大学) University of Leeds(莱斯特大学)

AI总结 提出PaCo-VLA框架,通过被动屏蔽将VLA模型输出转化为任务级柔顺建议,并利用能量罐和边界检查防止无效预测绕过底层接触物理,实现安全精确的富接触操控。

Comments Under review, code will be available soon

详情
AI中文摘要

富接触操控既需要高层语义推理,也需要对高频接触动态的安全调节。虽然视觉-语言-动作(VLA)模型提供了前所未有的语义泛化能力,但其低速率输出缺乏在力敏感任务中直接控制执行器所需的可靠性。为弥合这一语义到控制的鸿沟,我们引入PaCo-VLA,一种被动屏蔽的柔顺先验,重新定义了VLA接口。PaCo-VLA不将直接电机指令托付给VLA,而是将网络输出视为任务级柔顺建议:语义绑定、任务阶段和导纳调度。一个高频、建议无关的被动屏蔽通过能量罐核算和边界检查来管理这些建议,防止无效、过时或未经验证的模型预测绕过底层接触物理。这种解耦架构还支持因果评估,将语义贡献与几何捷径分离。大量仿真和真实世界的连接器插入实验表明,PaCo-VLA在无屏蔽VLA基线上实现了卓越的精度,即使在对抗性柔顺偏移下也能保持零被动违规。该框架在导纳端口建立了一个可证明的采样被动运行时契约,并为在富接触领域部署基础模型提供了运行时接口。

英文摘要

Contact-rich manipulation demands both high-level semantic reasoning and the safe regulation of high-frequency contact dynamics. While Vision-Language-Action (VLA) models provide unprecedented semantic generalization, their low-rate outputs lack the reliability required for direct plant authority in force-sensitive tasks. To bridge this semantic-to-control gap, we introduce PaCo-VLA, a passivity-shielded compliance prior that recasts the VLA interface. Rather than trusting VLAs with direct motor commands, PaCo-VLA treats network outputs as task-level compliance proposals: semantic bindings, task stages, and admittance schedules. A high-frequency, proposal-independent passivity shield governs these proposals through energy-tank accounting and boundary checks, preventing invalid, stale, or unverified model predictions from bypassing low-level contact physics. This decoupled architecture also enables causal evaluation, isolating semantic contributions from geometric shortcuts. Extensive simulated and real-world connector-insertion experiments demonstrate that PaCo-VLA achieves superior precision over unshielded VLA baselines, sustaining zero passivity violations even under adversarial compliance shifts. This framework establishes a provably sampled-passive runtime contract at the admittance port and provides a runtime interface for deploying foundation models in contact-rich domains.

2606.00514 2026-06-02 cs.LG cs.CV

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

在重建空间中生成,在语义空间中匹配:一步生成的传输几何

Hugues Van Assel, Edward De Brouwer, Saeed Saremi, Gabriele Scalia, Aviv Regev

发表机构 * Genentech(基因泰克)

AI总结 本文研究自监督表示学习(SSL)特征在一步生成模型中的作用,提出在语义特征空间中使用Sinkhorn散度进行分布匹配,显著降低ImageNet FID,并揭示了评估指标与训练特征之间的潜在冲突。

Comments 26 pages, 4 figures

详情
AI中文摘要

生成建模和自监督表示学习(SSL)优化结构不同的目标:生成训练奖励分布保真度,而SSL奖励语义一致性。然而,最近的研究反复发现SSL特征改善了生成训练,尽管这种协同作用的机制仍不清楚。在这里,我们在一步生成的框架下研究SSL在生成建模中的优势,其中表示的作用是明确的:冻结的SSL特征用于将生成的样本与真实数据匹配。我们在该特征空间中使用Sinkhorn散度,为Wasserstein距离提供了一个可处理的代理,这是由Fréchet风格评估指标(如FID)近似的总体差异。我们发现,当在语义结构化的SSL特征空间中计算时,这个目标变得非常有效(ImageNet FID降低39倍)。我们将这种行为主要归因于匹配估计:抑制无关重建细节的语义SSL特征诱导出更紧凑的几何结构,使分布匹配更易处理。因此,最佳的训练SSL特征不一定与评估指标使用的特征匹配。特别是,我们表明使用Inception作为特征提取器可以改善FID,同时降低匹配稳定性和样本质量,揭示了一种形式的指标黑客攻击。通过在ImageNet上的大量实验,我们确定了哪些SSL特征族能带来最佳的生成性能,并表明匹配稳定性是选择它们的定量标准。代码可在https://github.com/Genentech/semantic-transport-generation获取。

英文摘要

Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fréchet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39$\times$ reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic-transport-generation.

2606.00512 2026-06-02 cs.LG cs.IT math.IT stat.ML

Semi-Supervised Learning with Noisy Proxy Covariates: Generalization Bounds and Distribution Regression

带噪声代理协变量的半监督学习:泛化界与分布回归

Kwangho Kim, Jisu Kim

发表机构 * Department of Statistics, Korea University, Seoul, Korea(韩国大学统计系) Department of Statistics, Seoul National University, Seoul, Korea(首尔国立大学统计系)

AI总结 针对带噪声代理协变量的半监督回归问题,提出两阶段估计器,利用所有代理协变量学习核本征特征,并在标记数据上拟合岭回归,理论证明在代理扰动可控且未标记代理协变量充足时能恢复快速标记样本率,实验表明在低标记率下优于监督和半监督基线。

Journal ref Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

在许多现代机器学习流程中,丰富的预训练表示充当有噪声的代理协变量,而任务特定标签仍然稀缺。我们研究这种设置下的半监督回归,并提出一个简单的两阶段估计器,该估计器从所有代理协变量中学习核本征特征,并在标记数据上拟合岭预测器。我们推导出有限样本界,表明当代理扰动受控且未标记代理协变量足够丰富时,可以恢复快速的标记样本率。我们还表明,分布回归是一个直接的特例,当有限袋大小足够大时具有类似的保证。实验表明,在低标记率情况下,相比监督和半监督基线有持续改进。

英文摘要

In many modern machine learning pipelines, abundant pretrained representations serve as noisy proxy covariates, while task-specific labels remain scarce. We study semi-supervised regression in this setting, and propose a simple two stage estimator that learns kernel eigenfeatures from all proxy covariates and fits a ridge predictor on labeled data. We derive finite sample bounds showing that fast labeled sample rates are recovered when proxy perturbation is controlled and unlabeled proxy covariates are sufficiently abundant. We also show that distribution regression is a direct special case, with analogous guarantees when the finite bag size is large enough. Experiments show consistent gains over supervised and semi-supervised baselines, especially in low label regimes.

2606.00511 2026-06-02 cs.LG cs.CV

Saliency-Aware Model Merging

显著性感知模型合并

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea(首尔大学) Ewha Womans University, Seoul, South Korea(成均馆女子大学)

AI总结 提出SA-Merging方法,利用结构剪枝中的连通性显著性(如SynFlow)进行数据无关模型合并,通过任务向量显著性评分和合并感知调制减少任务干扰,并在视觉和语言任务上验证有效性。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

模型合并旨在将多个在不同数据集上微调的任务特定模型整合到一个统一架构中,以实现跨领域能力。当前的数据无关模型合并方法通常难以扩展,因为它们依赖于忽略层间依赖性和非均匀专业知识分布的简单参数级启发式方法。本文提出SA-Merging,它基于结构剪枝(如SynFlow)中的连通性显著性公式,并将其扩展到数据无关模型合并设置。我们相对于共享基础模型定义任务向量上的显著性分数,并进一步引入合并感知调制,该调制结合专家间的一致性以减轻任务干扰。基于此公式,迭代的显著性感知合并过程逐步移除非信息性更新,同时保留端到端连通性。此外,我们将SA-Merging扩展到为LoRA引入秩级显著性分解,而不损害其结构完整性。在视觉和语言任务上的大量实验证明了我们基于显著性方法的有效性,进一步缩小了数据无关方法和测试时自适应方法之间的差距。

英文摘要

Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.

2606.00510 2026-06-02 cs.CL cs.AI

Skill or Skip? Learning Selective Skill Invocation in Agentic Tasks via Dual-Granularity Preference Learning

技能还是跳过?通过双粒度偏好学习在智能体任务中学习选择性技能调用

Chishui Chen, Jiaye Lin, Te Sun, Junxi Wang, Yi Yang, Cong Qin, Yangen Hu, Lu Pan, Ke Zeng

发表机构 * Meituan(美团) Fudan University(复旦大学) Shanghai Jiao Tong University(上海交通大学) Nanjing University(南京大学) Peking University(北京大学)

AI总结 提出SelSkill框架,通过双粒度偏好学习实现选择性技能调用,在ALFWorld和BFCL上显著提升任务成功率和执行精度。

Comments 18 pages, 4 figures, 10 tables

详情
AI中文摘要

智能体技能是可调用的程序化模块,为复杂智能体任务提供可重用知识和执行策略。然而,现有方法主要关注选择相关技能或改进技能本身,而忽略了在当前决策点是否应该实际调用相关技能。无帮助的调用可能引入无关上下文并破坏原本正确的执行过程。为解决此问题,我们提出SelSkill,一个用于选择性技能调用的双粒度偏好学习框架。SelSkill将技能使用表述为技能或跳过决策,利用预测不确定性优先考虑候选决策点,并从共享轨迹前缀构建受控的调用-跳过偏好对。它进一步结合了回合级结果偏好与步骤级调用偏好,以捕捉整体轨迹质量和技能调用的局部有效性。在ALFWorld上使用Qwen3-8B,SelSkill将任务成功率提高了10.9个百分点,执行精度提高了29.1个百分点。在BFCL上,它将任务成功率提高了5.7个百分点,执行精度提高了29.5个百分点。在Tau-bench和PopQA上的零样本结果进一步表明,学习到的调用策略可迁移到具有未见技能的新领域。

英文摘要

Agent skills are callable procedural modules that provide reusable knowledge and execution policies for complex agentic tasks. However, existing methods mainly focus on selecting relevant skills or improving the skills themselves, while overlooking whether a relevant skill should actually be invoked at the current decision point. Unhelpful invocations may introduce irrelevant context and disrupt an otherwise correct execution process. To address this issue, we propose SelSkill, a dual-granularity preference-learning framework for selective skill invocation. SelSkill formulates skill use as a skill-or-skip decision, uses predictive uncertainty to prioritize candidate decision points, and constructs controlled invoke-skip preference pairs from shared trajectory prefixes. It further combines episode-level outcome preferences with step-level invocation preferences to capture both overall trajectory quality and the local effectiveness of skill invocation. On ALFWorld with Qwen3-8B, SelSkill improves task success by 10.9 percentage points and execution precision by 29.1 percentage points. On BFCL, it improves task success by 5.7 percentage points and execution precision by 29.5 percentage points. Zero-shot results on Tau-bench and PopQA further suggest that the learned invocation policy transfers to new domains with previously unseen skills.

2606.00509 2026-06-02 cs.CV

Structure-Aware Consistency Priors for Shape from Polarization in Complex Media

复杂介质中偏振形状恢复的结构感知一致性先验

Kaimin Yu, Puyun Wang, Huayang He, Xianyu Wu

发表机构 * The School of Mechanical Engineering and Automation, Fuzhou University, Fuzhou, China(福州大学机械工程与自动化学院) Research Institute of Highway, Ministry of Transport, Beijing, China(交通部公路科学研究院)

AI总结 针对复杂介质(以冰为例)中偏振观测与表面法线间的非线性映射问题,提出基于自相关函数的结构感知偏振先验,并设计双分支网络IceSfP通过跨模态注意力和多尺度特征融合实现精确法线估计,在首个真实冰SfP数据集上达到16.01°的平均角度误差。

Journal ref 2026ICML

详情
AI中文摘要

在复杂介质中从单视角偏振图像恢复表面法线仍然具有挑战性。本文以冰作为代表性复杂介质,其中复杂的光与物质相互作用导致偏振观测与表面法线之间存在非线性映射。为了解决这一问题,提出了一种基于自相关函数的结构感知偏振先验,以捕获AoLP的局部空间一致性。在此基础上,设计了一个双分支网络(IceSfP),通过跨模态注意力和多尺度特征融合将原始偏振特征与先验集成,从而在复杂介质条件下实现准确的表面法线估计。为了评估该方法,构建了首个真实世界的冰SfP数据集。实验结果表明,该方法在所有指标上均优于现有方法,平均绝对误差(MAE)为16.01°,比第二好的方法低2.74°。该框架为复杂介质中的高精度几何感知提供了一种可推广的解决方案。

英文摘要

Recovering surface normals from single view polarization images in complex media remains challenging. This paper focuses on ice as a representative complex medium, where intricate light matter interactions lead to a nonlinear mapping between polarization observations and surface normals. To address this, a structure-aware polarization prior based on autocorrelation functions is proposed to capture the local spatial consistency of AoLP. Building on this, a dual-branch network (IceSfP) is designed to integrate raw polarization features with priors via cross modal attention and multi-scale feature fusion, enabling accurate surface normal estimation under complex media conditions. To evaluate the method, the first real-world ice SfP dataset is constructed. Experimental results show that the method outperforms existing approaches across all metrics, achieving a MAE of 16.01 deg, which is 2.74 deg lower than the second-best method. The framework provides a generalizable solution for high-precision geometric perception in complex media.