arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.30925 2026-06-01 cs.CV cs.GR

MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

MultiAct: 通过定制注意力引导从复合文本生成动作

Nathan Sala, Ofir Abramovich, Ariel Shamir, Daniel Cohen-Or, Andreas Aristidou, Sigal Raab

AI总结提出MultiAct，一种无需重新训练或修改架构的推理时框架，通过自适应增强未充分表示提示组件的交叉注意力分数，解决复合文本到动作生成中语义覆盖不全的问题。

Comments Accepted to SIGGRAPH 2026 conference. Project page: https://natsala13.github.io/multiact.github.io

详情

AI中文摘要

近年来，文本到动作生成发展迅速，为动画和人机交互提供了富有表现力的界面。然而，当前模型在处理描述同时发生的多个动作的提示时仍然脆弱。模型常常优先考虑单个主导动作而忽略其余部分，导致动作不完整或模糊，而不是实现复合描述的所有组成部分。我们提出MultiAct，一种无需配对、推理时的组合文本到动作合成框架，可直接作用于预训练的动作生成器，无需重新训练或架构修改。我们的方法通过自适应增强与未充分表示提示组件相关的交叉注意力分数来对抗语义崩溃。我们注意到有效调制取决于提示特定的选择，例如要定位的令牌和层，并引入一个轻量级辅助决策方案，以确定最有效的注意力增强参数化。广泛的定量和定性评估表明，MultiAct在复合提示上持续优于现有基线，在保持动作真实感的同时实现了改进的语义覆盖。项目页面：https://natsala13.github.io/multiact.github.io。

英文摘要

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.30924 2026-06-01 cs.CL

EMBGuard: Constructing Hazard-Aware Guardrails for Safe Planning in Embodied Agents

EMBGuard：为具身智能体安全规划构建危险感知护栏

Dongwook Choi, Taeyoon Kwon, Bogyung Jeong, Minju Kim, Yeonjun Hwang, Hyojun Kim, Byungchul Kim, Young Kyun Jang, Jinyoung Yeo

AI总结提出首个基于MLLM的具身安全护栏EMBGuard，通过解耦物理风险推理与智能体策略，评估（视觉观察，动作）对来识别危险配置并提供自然语言解释，同时构建训练数据集EMBHazard和基准测试EMBGuardTest，在紧凑模型尺寸下达到与专有MLLM竞争的性能并降低误报率。

Comments Accepted at ICML 2026

详情

AI中文摘要

部署在真实环境中的MLLM驱动的具身智能体会遇到物理危险。然而，现有方法缺乏识别危险和推理动作条件风险的内在机制，导致智能体要么错过危险交互，要么过度识别风险。为解决此问题，我们提出EMBGuard，这是首个基于MLLM的具身智能体安全护栏，旨在将物理风险推理与智能体策略解耦。通过评估（视觉观察，动作）对，EMBGuard识别危险配置并提供潜在风险的自然语言解释。伴随EMBGuard，我们贡献了EMBHazard，一个包含15.1K个动作条件对的训练数据集，以及EMBGuardTest，一个包含329个手动策划的真实世界场景的基准测试，涵盖七种物理风险类别。通过危险和动作的组合变化，我们生成了智能体在规划过程中可能遇到的各种危险和良性场景。尽管模型尺寸紧凑（2B，4B），EMBGuard达到了与专有MLLM（例如GPT-5.1，Gemini-2.5-Pro）竞争的性能，同时显著降低了阻碍实时部署的误报率。我们在https://github.com/dongwxxkchoi/EMBGuard公开了代码、数据和模型。

英文摘要

MLLM-powered embodied agents deployed in real-world environments encounter physical hazards. However, existing approaches lack explicit mechanisms for identifying hazards and reasoning about action-conditioned risks, leading agents to either miss risky interactions or over-identify risks. To address this, we propose EMBGuard, the first MLLM-based safety guardrail for embodied agents designed to decouple physical risk reasoning from agent policy. By evaluating a (visual observation, action) pair, EMBGuard identifies hazardous configurations and provides natural language explanations of potential risks. Alongside EMBGuard, we contribute EMBHazard, a training dataset of 15.1K action-conditioned pairs, and EMBGuardTest, a benchmark of 329 manually curated real-world scenarios spanning seven physical risk categories. Through compositional variation of hazards and actions, we generate diverse risky and benign scenarios that agents may encounter during planning. Despite its compact size (2B, 4B), EMBGuard achieves performance competitive with proprietary MLLMs (e.g., GPT-5.1, Gemini-2.5-Pro) while significantly reducing the false-positive rates that hinder real-time deployment. We make the code, data, and models publicly available at https://github.com/dongwxxkchoi/EMBGuard

URL PDF HTML ☆

赞 0 踩 0

2605.30919 2026-06-01 cs.LG cs.AI

De-attribute to Forget for LLM Unlearning

Xinyang Lu, Jiabao Pan, Rachael Hwee Ling Sim, See-Kiong Ng, Anthony Kum Hoe Tung, Bryan Kian Hsiang Low

AI总结本文提出基于数据归因奖励的LLM遗忘框架DareU，通过强化学习降低生成响应与遗忘数据的归因分数，实现有效遗忘并平衡模型效用。

详情

AI中文摘要

大型语言模型（LLM）的快速发展引发了对使用不当数据进行训练的担忧，这导致了对LLM遗忘研究的兴趣日益增长。许多现有的LLM遗忘方法依赖于优化预测损失，例如最大化遗忘集上的损失，但常常面临过度遗忘和模型效用差等关键问题。为了解决这些问题，本文创新地将LLM遗忘的优化目标定义为归零数据归因。具体而言，我们提出了第一个基于数据归因奖励的LLM遗忘框架，称为DareU，该框架通过强化学习来更新LLM，通过降低其生成响应与遗忘数据所有者的归因分数（即去归因）来实现遗忘。使用LLM分类器作为归因的有效近似进行的实证评估表明，DareU在实现有效遗忘的同时，很好地平衡了遗忘质量和模型效用，优于现有基线。

英文摘要

The rapid development of large language models (LLMs) has raised concerns on the use of inappropriate data for training, which has led to a growing interest in LLM unlearning. Many existing LLM unlearning approaches rely on optimizing prediction loss(es), such as maximizing the loss on the forget set, but often face critical issues like over-forgetting and poor model utility. To address them, this paper novelly frames the optimization objective for LLM unlearning as one of zeroing out data attribution instead. In particular, we propose the first LLM unlearning framework based on data attribution rewards called DareU that performs reinforcement learning to update the LLM by reducing the attribution score of its generated responses (i.e., de-attributing) to the forget data owners. Empirical evaluation using an LLM classifier as an efficient approximation of attribution shows that DareU outperforms existing baselines by achieving effective unlearning while balancing forget quality and model utility well.

URL PDF HTML ☆

赞 0 踩 0

2605.30916 2026-06-01 cs.LG cs.GT econ.TH

Welfare, Improvability, and Variance: A Principal-Agent Approach to Optimal Benchmark Item Aggregation

福利、可改进性与方差：最优基准测试项聚合的主-代理方法

Andreas Haupt, Justin Hartenstein, Anka Reuel, Mykel Kochenderfer, Sanmi Koyejo

AI总结提出将基准测试建模为多任务主-代理博弈，通过福利、可改进性和方差三个维度评估项目，并应用于OLMES数据集识别帕累托劣势项目。

详情

AI中文摘要

AI基准测试存在记录完善的局限性，先前研究探讨了污染、饱和以及构造不明确等问题。聚合受到的关注要少得多：基准测试通常通过统一平均项目级分数来总结，隐含地将每个测试项目视为同等重要。我们将基准测试建模为多任务主-代理博弈，并表明基准测试的福利损失由三个项目级原始要素共同决定：与规范性福利优先级的一致性、边际可改进性和性能方差。我们将该理论转化为一个审计框架，沿这三个轴对项目进行排序，并使用WORKBank（福利）、EvoLM 4B套件（可改进性）和PolyPythias 410M面板（方差）将其应用于OLMES项目。该框架揭示了在OLMES中，在亲工人福利操作化下帕累托劣势的项目。所有代码可在 https://github.com/stair-lab/principal-agent-benchmarks 获取。

英文摘要

AI benchmarks have well-documented limitations, with prior work examining contamination, saturation, and construct underspecification. Aggregation has received far less attention: benchmarks are typically summarized by uniformly averaging item-level scores, implicitly treating every test item as equally valuable. We model benchmarking as a multitask principal-agent game and show that the welfare loss from a benchmark is determined jointly by three item-level primitives: alignment with normative welfare priorities, marginal improvability, and performance variance. We translate the theory into an audit framework that ranks items along each of these three axes, and apply it to OLMES items using WORKBank for welfare, the EvoLM 4B suite for improvability, and the PolyPythias 410M panel for variance. The framework surfaces items that are Pareto-inferior within OLMES subject to a pro-worker welfare operationalization. All code is available at https://github.com/stair-lab/principal-agent-benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.30914 2026-06-01 cs.LG cs.SE

Automating Formal Verification with Reinforcement Learning and Recursive Inference

用强化学习和递归推理自动化形式验证

Max Tan

AI总结研究通过可验证奖励的强化学习和验证器引导的推理搜索，提升大语言模型生成验证程序和证明的能力，在Dafny和Lean上取得显著进展。

Comments Master's thesis, 140 pages, 16 figures, 17 tables

详情

AI中文摘要

自动化形式验证对大语言模型仍然具有挑战性，因为证明助手和验证感知语言的数据稀缺，且正确性取决于满足精确的机器可检查规范，而非生成合理的代码。本文研究验证器环境如何通过可验证奖励的强化学习（RLVR）和验证器引导的推理时搜索，改进大语言模型生成验证程序和证明的能力。首先，我们使用组相对策略优化（GRPO）及相关变体，在Dafny中训练开源模型，将生成的候选程序组装成完整程序，并根据编译器和验证器的结果进行评分。在APPS衍生的Dafny数据集上的初步实验将验证奖励从2.2%提升至58.1%，但发现了规范破解问题，即模型利用弱形式规范而非实现预期解决方案。在过滤掉欠规范和易受攻击的任务后，多轮RLVR在改进的基准上将验证通过率从9.7%提升至31.1%。其次，我们在Lean中开发了一个验证器引导的推理框架，将证明生成视为对分解子目标、验证器反馈、诊断和修复的结构化搜索。使用固定的基础模型，包含证明修订器的完整框架在初始VeriCoding试点集上将通过率从直接修复的46.2%提升至69.2%。在更大的VERINA数据集上，整体任务分解加上证明修订器解决了42个先前未解决任务中的7个。我们还引入了Dalek-Bench，一个从Rust $ exttt{curve25519-dalek}$验证项目派生的仓库级Lean基准；初步结果仍然较弱，表明仍需更强的进度评估和特定任务的工具使用策略。

英文摘要

Automated formal verification remains challenging for large language models because data for proof assistants and verification-aware languages is scarce, and correctness depends on satisfying precise machine-checkable specifications rather than producing plausible code. This thesis studies how verifier environments can improve LLM generation of verified programs and proofs through reinforcement learning from verifiable rewards (RLVR) and verifier-guided inference-time search. First, we train open-source models in Dafny with RLVR using Group Relative Policy Optimization (GRPO) and related variants, assembling generated candidates into complete programs and scoring them with compiler and verifier outcomes. Initial experiments on an APPS-derived Dafny dataset increased verified reward from 2.2% to 58.1%, but revealed specification hacking, where models exploit weak formal specifications instead of implementing the intended solutions. After filtering underspecified and vulnerable tasks, multi-turn RLVR on the refined benchmark improves the verified pass rate from 9.7% to 31.1%. Second, we develop a verifier-guided inference scaffold in Lean that treats proof generation as structured search over decomposed subgoals, verifier feedback, diagnostics, and repair. With a fixed base model, the full scaffold with proof reviser improves pass rate on an initial VeriCoding pilot set from 46.2% under direct repair to 69.2%. On the larger VERINA dataset, whole-task decomposition plus proof reviser solves 7 of 42 previously unsolved tasks. We also introduce Dalek-Bench, a repository-scale Lean benchmark derived from the Rust $\texttt{curve25519-dalek}$ verification project; preliminary results remain weak, indicating that stronger progress evaluation and task-specific tool-use policies are still needed.

URL PDF HTML ☆

赞 0 踩 0

2605.30913 2026-06-01 cs.CL cs.AI cs.CY cs.HC

Toxic HallucinAItions: Perturbing Prompts and Tracing LLM Circuits

有毒幻觉：扰动提示与追踪LLM电路

Soorya Ram Shimgekar, Agam Goyal, Amruta Parulekar, Joshua Chen, Yian Wang, Navin Kumar, Hari Sundaram, Eshwar Chandrasekharan, Koustuv Saha

AI总结研究有毒语言扰动对LLM事实可靠性的影响，发现有毒词汇降低准确率并增加不确定性，通过归因图分析揭示内部机制。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地部署在对话环境中，用户语气从礼貌到对抗性或毒性不等，但尚不清楚在语义等效的提示中，有毒语言是否会降低事实可靠性。我们研究基于词汇和语气的提示扰动如何影响LLM的事实可靠性。通过礼貌、随机和三种毒性水平的受控提示变化，我们在ARC-Easy、GSM8K和MMLU上评估了五个LLM。我们发现有毒词汇扰动持续降低事实准确性并增加不确定性，而礼貌措辞产生有限且不一致的变化。为了检查这些答案不一致是否对应内部变化，我们进行了模型激活和影响的归因图分析。我们发现增加毒性选择性地放大对扰动敏感的变体节点，而相对稳定的核心推理节点保持更不变。这些发现将提示语气定位为LLM可靠性的关键维度，并提供了行为和机制证据，表明表面词汇变化可以改变事实输出和内部计算。

英文摘要

Large language models (LLMs) are increasingly deployed in conversational settings where user tone ranges from polite to adversarial or toxic, yet less is known about whether toxic language in otherwise semantically equivalent prompts can degrade factual reliability. We study how lexical and tone-based prompt perturbations affect the factual reliability of LLMs. Using controlled prompt variations across polite, random, and three toxicity levels, we evaluate five LLMs on ARC-Easy, GSM8K, and MMLU. We find that toxic lexical perturbations consistently reduce factual accuracy and increase uncertainty, while polite phrasing yields limited and inconsistent changes. To examine whether these answer inconsistencies correspond to internal changes, we conduct attribution-graph analyses of model activations and influences. We find that increasing toxicity selectively amplifies perturbation-sensitive variant nodes while relatively stable core reasoning nodes remain more invariant. These findings position prompt tone as a critical dimension of LLM reliability and provide behavioral and mechanistic evidence that surface-level lexical variation can alter factual outputs and internal computation.

URL PDF HTML ☆

赞 0 踩 0

2605.30912 2026-06-01 cs.CV cs.CL

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据：面向多模态RLVR的证据锚定空间注意力监督

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang

AI总结提出EASE方法，通过将标注证据区域转化为平滑视觉标记目标，在多模态强化学习训练中引导响应到图像的注意力，从而提升视觉语言模型在感知、幻觉、视觉数学和多模态推理基准上的性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）通过优化从最终答案中导出的结果奖励来改进视觉语言模型（VLM）。然而，这种仅基于结果的奖励并不能告诉模型哪些图像区域证明了答案的正确性。对于需要视觉定位的问题，这些奖励无法区分由相关视觉证据支持的响应与由语言先验捷径或幸运猜测产生的响应。我们引入了EASE（证据锚定空间注意力），它通过视觉证据过程监督增强了多模态RLVR。EASE将标注的证据区域转换为平滑的视觉标记目标，并在RL训练期间使用它来引导响应到图像的注意力，但仅限于高奖励轨迹。标注仅用作特权训练标签，而推理仅需要原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B上，EASE在感知、幻觉、视觉数学和多模态推理基准上的平均得分比DAPO高出2.5到3.1分。诊断和消融实验表明，EASE更好地将视觉注意力与标注的证据区域对齐。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

URL PDF HTML ☆

赞 0 踩 0

2605.30911 2026-06-01 cs.CV cs.AI

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

什么使LVLMs更少产生幻觉？揭示影响幻觉鲁棒性的架构因素

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin, Jun Luo, Jiancheng Lv

AI总结本文通过将架构设计分解为语言基础、视觉表示和语义对齐三个维度，并引入CoSimUE基准，系统探索了架构因素对LVLMs幻觉鲁棒性的影响，发现模型参数扩展效果有限，而增强视觉编码器、语言基础和语义对齐能分别减少不同类型的幻觉。

详情

AI中文摘要

幻觉仍然是削弱大型视觉-语言模型（LVLMs）可靠性的关键挑战之一。但什么使LVLM更少产生幻觉？许多现有工作专注于改进模型的内部组件。我们认为幻觉从根本上源于模型架构的设计方式。为了研究这一点，我们将架构设计分解为三个维度：语言基础（LF）、视觉表示（VR）和语义对齐（SA），并将幻觉分为共现型、相似型和先前被忽视的不确定型。基于这一框架，我们提出了CoSimUE基准，通过受控文本扰动和随机扰动创建细粒度的幻觉场景，从而建立设计选择与幻觉行为之间的映射。在7个设计方面的实验表明：1）广泛强调的参数规模扩展对减少所有三类幻觉的影响有限；2）更大且训练更好的语言基础可以减少共现型幻觉；3）更强的视觉编码器和更高的分辨率减轻相似型错误；4）有效的对齐策略缓解不确定型幻觉。5）此外，跨维度分析显示，联合增强视觉保真度和对齐质量能带来最全面的改进。本研究首次系统性地将架构级设计与幻觉鲁棒性联系起来，为开发可靠且高效的LVLMs提供了实用指导。

英文摘要

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.30910 2026-06-01 cs.LG

PINNs Failure Modes are Overfitting

PINNs 的失败模式是过拟合

Nigel T. Andersen, Takashi Matsubara

AI总结本文通过可视化残差证明物理信息神经网络的失败模式源于过拟合，并提出基于正则化和双反向传播的方法来消除失败模式，在标准方程上以更少的配置点实现最先进性能。

详情

AI中文摘要

物理信息神经网络（PINNs）是一类常见的基于机器学习的偏微分方程（PDE）求解器，它们通过最小化编码 PDE 的残差损失来训练网络以表示解。尽管取得了成功，但已知它们在某些简单方程上会失败，收敛到不正确的解，尽管损失很低。这些失败模式在过去几年中引起了文献中的广泛关注，激发了基于架构和优化的解决方案。通过直接可视化残差，我们表明失败模式是过拟合的结果：损失在配置点上被最小化，但在其他地方则不然。应用正则化会使失败模式消失。最后，我们将双反向传播扩展到整个残差集，并使用它在四个标准失败模式方程上实现了最先进的性能，配置点数量减少多达 $23\times$，且使用普通架构。

英文摘要

Physics-Informed Neural Networks (PINNs) are a common class of machine learning-based partial differential equation (PDE) solvers which train a network to represent a solution by minimizing a residual loss that encodes the PDE. Despite their successes, they are known to fail on certain simple equations, converging to an incorrect solution despite low loss. These failure modes have garnered significant attention in the literature over the past several years, motivating both architectural and optimization based solutions. By directly visualizing the residual, we show that failure modes are the result of overfitting: the loss is minimized on the collocation points, but not elsewhere. Applying regularization causes the failure modes to vanish. Finally, we extend double backpropagation over the full set of residuals, and use it to achieve state-of-the-art performance on four standard failure mode equations with up to $23\times$ fewer collocation points and a vanilla architecture.

URL PDF HTML ☆

赞 0 踩 0

2605.30906 2026-06-01 cs.RO cs.SY eess.SY

Trajectory Planning for Non-Communicating Mobile Robots using Inverse Optimal Control

非通信移动机器人的逆最优控制轨迹规划

Nina Majer, Yannick Epple, Xin Ye, Stefan Schwab, Sören Hohmann

AI总结针对非通信移动机器人在避碰场景中的高效交互，提出一种结合逆最优控制的轨迹规划与预测算法，通过估计未知目标状态并联合预测，实现更快的规划求解。

2605.30904 2026-06-01 cs.CV

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

MergeTok: 通过令牌合并实现统一连续和离散视觉令牌化

Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang

AI总结提出MergeTok统一令牌化器，通过令牌合并技术联合优化连续VAE和离散VQ令牌化器，实现高保真重建与语义可控离散表示的兼顾。

Comments 11 pages (main text), 7 figures. Preprint. Under review at NeurIPS 2026

详情

AI中文摘要

大多数用于图像生成的视觉令牌化器分为两类，各有互补的局限性：连续VAE提供高保真重建，但遭受密集、纠缠的潜在变量，不适合语义控制；而基于离散VQ的模型能够实现自回归生成，但面临梯度稀疏、训练不稳定和码本崩溃的问题。在这项工作中，我们引入了MergeTok，一个统一的令牌化器，在编码器-解码器架构中联合优化连续（VAE）和离散（VQ）令牌化器，利用令牌合并技术作为语义桥梁。通过在编码过程中聚类相似令牌，MergeTok建立了一个结构先验，提供双重监督信号：（i）在VAE分支中施加合并令牌的语义对齐，将其潜在空间正则化为解缠、语义感知的表示；（ii）推导出组级约束，促进组内多样性和组间排他性，从而稳定VQ训练。MergeTok在ImageNet-256上展示了具有竞争力的重建和生成性能，在匹配令牌预算下，其rFID远低于强VAE和VQ模型，同时产生语义组织的令牌表示，兼容自回归和扩散生成器。这表明单一架构可以赋予视觉令牌化器鲁棒的语义组织和生成器友好的离散性。

英文摘要

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

URL PDF HTML ☆

赞 0 踩 0

2605.30903 2026-06-01 cs.LG cs.AI

Inverse Reinforcement Learning without an Optimal Demonstrator: A Feasible Reward Set Approach

无最优演示者的逆强化学习：一种可行奖励集方法

Kihyun Kim, Shripad Deshmukh, Nikos Vlassis, Jiawei Zhang

AI总结针对多个非最优演示者数据，提出可行奖励集框架，通过线性约束联合可行集单调收缩，并给出恢复保证与高维环境离线算法。

详情

AI中文摘要

逆强化学习（IRL）通常假设来自单个最优演示者的演示，但在许多应用中，数据来自多个具有异质次优性水平的非完美演示者。我们通过可行奖励集框架研究这一设置下的奖励学习：对于每个演示者，我们将其声明的次优性水平编码为线性约束，并在演示者之间对所得可行集取交集。我们的理论分析表明，随着数据的增加，联合可行集单调收缩，并且我们精确刻画了新演示者何时严格收紧该集合。我们进一步为真实最优演示者的可行奖励集建立了两个恢复保证：一个界限依赖于与最优占用度的接近程度，而另一个仅需要足够的覆盖且没有接近最优的演示者。在实际方面，我们引入了解决所得奖励集中固有奖励模糊性的策略，并提供了适用于高维环境的函数逼近离线算法。在表格型网格世界和大语言模型（LLM）微调设置中的实验与理论预测一致，并证明了所提框架相对于基线的有效性。

英文摘要

Inverse reinforcement learning (IRL) typically assumes demonstrations from a single optimal demonstrator, but in many applications data come from multiple imperfect demonstrators with heterogeneous suboptimality levels. We study reward learning in this setting through a feasible-reward-set framework: for each demonstrator, we encode its declared suboptimality level as a linear constraint and intersect the resulting feasible sets across demonstrators. Our theoretical analysis shows that the joint feasible set shrinks monotonically as data are added, and we give an exact characterization of when a new demonstrator strictly tightens it. We further establish two recovery guarantees for the feasible reward set of the ground-truth optimal demonstrator: one bound depends on closeness to the optimal occupancy, while the other requires only sufficient coverage and no near-optimal demonstrator. On the practical side, we introduce strategies to address the inherent reward ambiguity in the obtained reward set and provide an offline algorithm with function approximation for high-dimensional environments. Experiments in tabular grid-world and large language model (LLM) fine-tuning settings are consistent with the theoretical predictions and demonstrate the effectiveness of the proposed framework over baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.30901 2026-06-01 cs.LG

Density-Guided Robust Counterfactual Explanations on Tabular Data under Model Multiplicity

模型多重性下表格数据的密度引导鲁棒反事实解释

Jun Tan, Qing Guo, Zicheng Xu, Jinglin Li, Qi Fang, Ning Gui

AI总结提出DensityFlow生成框架，利用神经ODE和密度评分构建鲁棒反事实解释，避免低密度区域，并在模型多重性下保持有效性。

Comments 26 pages, 11 figures, accepted by ICML 2026

详情

AI中文摘要

反事实解释（CEs）对于可操作的补救措施至关重要，但其可靠性在低密度区域常常受到损害，因为分类器在这些区域表现出高方差。与依赖昂贵的集成交集来定义稳定性的现有方法不同，我们提出了 extit{DensityFlow}，一种生成框架，通过遵循高置信度数据流形来构建鲁棒的反事实解释。具体来说，我们将反事实生成建模为由神经ODE参数化的连续时间动力学，并由可微密度评分引导，以主动避免不确定的低密度区域。该密度评分通过噪声对比估计学习，有效利用$(K{+}1)$路判别器来估计密度比。对于黑盒设置，我们引入了一种局部代理蒸馏机制，该机制在CE生成的轨迹内严格地将轻量级代理与目标模型对齐，从而实现高效的基于梯度的优化，且查询次数最少。实验表明，与基于集成的基线相比， extit{DensityFlow}在模型多重性下实现了优越的有效性，同时显著降低了查询成本。我们的实现可在https://github.com/G-AILab/DensityFlow获取。

英文摘要

Counterfactual explanations (CEs) are essential for actionable recourse, yet their reliability is often compromised in low-density regions, where classifiers exhibit high variance. Unlike existing methods that rely on expensive ensemble intersections to define stability, we propose \textit{DensityFlow}, a generative framework that constructs robust CEs by adhering to the high-confidence data manifold. Specifically, we model the counterfactual generation as continuous-time dynamics parameterized by Neural ODE, guided by a differentiable density score to actively avoid uncertain, low-density areas. This density score is learned via Noise Contrastive Estimation, effectively leveraging a $(K{+}1)$-way discriminator to estimate density ratios. For black-box settings, we introduce a local proxy distillation mechanism that aligns a lightweight surrogate with the target model strictly within the trajectory of CE generation, enabling efficient gradient-based optimization with minimal queries. Experiments demonstrate that \textit{DensityFlow} achieves superior validity under model multiplicity while significantly reducing query costs compared to ensemble-based baselines. Our implementation is available at https://github.com/G-AILab/DensityFlow.

URL PDF HTML ☆

赞 0 踩 0

2605.30900 2026-06-01 cs.AI physics.app-ph

BilliardPhys-Bench: Benchmarking Physical Reasoning and Visual Dynamics of Multimodal LLMs

BilliardPhys-Bench: 多模态大语言模型的物理推理与视觉动力学基准测试

Ben Wang, Xiaogang Li, Ruochen Gao, Peiyao Xiao, Chengliang Xu, Zeyu Wang, Zichao Chen, Bing Zhao, Hu Wei

AI总结提出BilliardPhys-Bench基准，通过合成台球环境评估多模态大语言模型在物理推理（碰撞、反弹、最终位置预测）上的能力，发现模型存在“静态偏差”且性能随模拟时间与场景复杂度下降。

详情

AI中文摘要

当前多模态模型在静态图像识别方面表现良好，但直观的物理推理仍是弱点。从单张图像预测物体如何运动及相互作用对这些系统而言仍然困难。我们提出了BilliardPhys-Bench，一个用于合成台球环境中物理推理的基准测试。其程序化引擎生成带有摩擦和弹性碰撞的随机场景。该基准测试三种能力：(1) 预测球与球之间的碰撞，(2) 推理墙壁反弹，(3) 估计运动停止后球的最终位置。我们评估了来自GPT、Claude、Gemini和Qwen系列的最新MLLMs。随着模拟时间增加和场景几何复杂度提高，性能下降。我们还观察到一个一致的失败模式，称为“静态偏差”：当正确的物理结果更难推断时，模型倾向于预测无交互。这些发现揭示了当前MLLMs在视觉动力学上的不足之处，并指出了在多模态架构中需要更好的物理归纳偏置。

英文摘要

Current multimodal models handle static image recognition well, but intuitive physical reasoning remains a weakness. Predicting how objects will move and interact from a single image is still difficult for these systems. We present BilliardPhys-Bench, a benchmark for physical reasoning in synthetic billiards environments. Its procedural engine generates randomized scenarios with friction and elastic collisions. The benchmark tests three abilities: (1) predicting ball-to-ball collisions, (2) reasoning about wall bounces, and (3) estimating final ball positions after motion stops. We evaluate recent MLLMs from the GPT, Claude, Gemini, and Qwen families. Performance drops as simulation time increases and scene geometry grows more complex. We also observe a consistent failure mode we call "stasis bias": when the correct physical outcome is harder to infer, models tend to predict no interaction. These findings show where current MLLMs break down on visual dynamics and point toward the need for better physical inductive biases in multimodal architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.30898 2026-06-01 cs.AI cs.CL

UniScale: Adaptive Unified Inference Scaling via Online Joint Optimization of Model Routing and Test-Time Scaling

UniScale: 通过模型路由和测试时扩展的在线联合优化实现自适应统一推理扩展

Kaiyu Huang, Xingyu Wang, Mingze Kong, Zhubo Shi, Yuqian Hou, Hong Xu, Zhongxiang Dai, Minchen Yu, Qingjiang Shi

AI总结提出UniScale框架，将模型路由和测试时扩展统一为上下文多臂老虎机问题，通过LinUCB在线学习推理策略，实现细粒度且更优的质量-成本权衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

在大语言模型（LLM）的实际部署中，平衡推理质量和计算成本已成为核心挑战。现有方法沿着两个大致独立的维度处理这一权衡：模型路由（在不同规模的模型之间切换以匹配请求复杂度）和测试时扩展（TTS，在固定模型内调整推理时计算以实现细粒度控制）。然而，这种解耦设计引入了固有限制。由于模型规模稀疏，模型路由产生粗粒度的离散性能变化，而单模型TTS通常遇到能力上限，并随着计算增加出现收益递减。此外，将两种机制分开处理限制了动态推理环境中的适应性。为克服这些限制，我们引入统一推理扩展（UIS），将模型路由和TTS统一到单个优化空间中。基于此公式，我们提出UniScale，一个在线框架，将自适应UIS建模为上下文多臂老虎机问题，并通过LinUCB学习推理策略。该框架包含效率感知学习和成本建模，以确保在高维动作空间上的稳定和可扩展优化。评估表明，UniScale有效利用UIS空间中的协同作用，在多样化的动态推理场景中提供细粒度且持续更优的质量-成本权衡。

英文摘要

In real-world deployments of large language models (LLMs), balancing inference quality and computational cost has become a central challenge. Existing approaches tackle this trade-off along two largely independent dimensions: model routing, which switches among models of different scales to match request complexity, and test-time scaling (TTS), which adjusts inference-time compute within a fixed model for fine-grained control. However, this decoupled design introduces inherent limitations. Model routing yields coarse-grained, discrete performance changes due to the sparse set of model scales, while single-model TTS often encounters capacity ceilings and exhibits diminishing returns as compute increases. Moreover, treating the two mechanisms separately restricts adaptability in dynamic inference environments. To overcome these limitations, we introduce Unified Inference Scaling (UIS), which unifies model routing and TTS in a single optimization space. Building on this formulation, we propose UniScale, an online framework that models adaptive UIS as a contextual multi-armed bandit problem and learns inference policies via LinUCB. The framework incorporates efficiency-aware learning and cost modeling to ensure stable and scalable optimization over high-dimensional action spaces. Evaluation shows that UniScale effectively exploits the synergy in the UIS space to deliver a fine-grained and consistently better quality-cost trade-off across diverse, dynamic inference scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.30896 2026-06-01 cs.LG

Zero Collapse: A Failure Mode of Policy Gradient Methods in Discontinuous Reward Environments

零坍塌：策略梯度方法在不连续奖励环境中的一种失败模式

Nishant Kumar, Enrique Areyan Viqueira, Amy Greenwald

AI总结本文发现策略梯度方法在拍卖等不连续奖励环境中会出现“零坍塌”失败模式，即策略因梯度信号消失而陷入零奖励区域，并提出了缓解策略。

Comments 20 pages, 7 figures; includes Appendix

详情

AI中文摘要

重复拍卖中的竞价是强化学习（RL）的一个核心挑战，它结合了连续控制与数字广告的策略复杂性。尽管策略梯度和基于值的方法似乎适合这些设置，但它们常常难以应对拍卖奖励景观的不连续、“悬崖状”特性。例如，在首价拍卖中，竞拍者在达到特定阈值之前获得零奖励，之后奖励随出价增加而减少。这形成了由尖锐边界分隔的平坦零奖励区域。我们识别出这种设置中一个基本的失败模式，称为“零坍塌”。我们表明，随机探索和基于梯度的更新可能导致策略越过最优高奖励区域，进入平坦的零奖励区域。一旦进入，由于缺乏信息性的梯度信号，恢复变得极其样本低效，有效地困住了智能体。我们发现演员-评论家方法特别容易受到影响，因为偏差的值估计会加速向不稳定区域的移动。我们的贡献包括：（1）对不连续奖励如何导致信号消失和零坍塌的机制解释；（2）对策略随机性和步长之间相互作用的分析；（3）在REINFORCE和演员-评论家变体上对该现象的经验演示。我们提出了涉及初始化和架构选择的实用缓解策略以提高稳定性。最后，我们引入了一个正式的拍卖环境RL框架，突出了其独特的结构特性。

英文摘要

Bidding in repeated auctions is a central challenge for reinforcement learning (RL), combining continuous control with the strategic complexities of digital advertising. While policy gradient and value-based methods seem well-suited for these settings, they often struggle with the discontinuous, "cliff-like" nature of auction reward landscapes. In a first-price auction, for example, a bidder receives zero reward until they cross a specific threshold, after which the reward decreases as the bid increases. This creates a landscape of flat, zero-reward regions separated by sharp boundaries. We identify a fundamental failure mode in this setting termed "zero collapse." We show that stochastic exploration and gradient-based updates can cause policies to overshoot optimal high-reward regions and enter flat, zero-reward regimes. Once there, the lack of an informative gradient signal makes recovery extremely sample-inefficient, effectively trapping the agent. We find that actor-critic methods are particularly susceptible, as biased value estimates can accelerate this movement toward unstable regions. Our contributions include: (1) a mechanistic explanation of how discontinuous rewards lead to vanishing signals and zero collapse; (2) an analysis of the interaction between policy stochasticity and step size; and (3) an empirical demonstration of this phenomenon across REINFORCE and actor-critic variants. We propose practical mitigation strategies involving initialization and architectural choices to improve stability. Finally, we introduce a formal RL framework for auction environments highlighting their unique structural properties.

URL PDF HTML ☆

赞 0 踩 0

2605.30894 2026-06-01 cs.CV

SteerFace: Debiasing Synthetic Face Generation via Adaptive Residue Perturbation

SteerFace: 通过自适应残差扰动消除合成人脸生成中的偏差

Yuxi Mi, Qiuyang Yuan, Jianqing Xu, Yichun Zhou, Xuan Zhao, Jun Wang, Rizen Guo, Shuigeng Zhou

AI总结针对合成人脸数据与真实数据分布存在视觉倾向差异的问题，提出SteerFace框架，通过将身份嵌入向随机正交方向扰动作为正则化项，抑制生成器对非身份视觉线索的依赖，从而缩小合成-真实差距。

详情

AI中文摘要

人脸识别训练中合法合规数据的短缺引发了人们对使用合成数据作为替代方案的日益关注。虽然最近的扩散方法能够生成具有强身份一致性和数据多样性的逼真人脸图像，但其下游识别性能仍然存在显著的合成-真实差距。本文识别出视觉倾向（visual tendency）作为一个此前未被充分探索的限制因素，即合成数据表现出不切实际的视觉属性普遍性，从而偏离真实数据分布。视觉倾向可归因于生成器对身份嵌入的条件化，通过这种条件化，共现的残留视觉线索被无意中吸收到学习到的身份语义中。为了阻止生成器利用此类视觉线索，本文提出SteerFace，一个简单高效的训练框架，通过将身份嵌入向嵌入超球面上的随机正交方向引导来扰动身份嵌入。该扰动作为一种身份保持正则化项，惩罚生成器对非身份成分的依赖，理论分析支持了这一点。本文进一步引入一种自适应策略，学习具有样本级偏好和有利总体统计的扰动强度。大量实验表明，SteerFace有效缓解了视觉倾向，在下游人脸识别中优于先前方法，并且在不同训练数据集和生成流程中具有良好的泛化能力。

英文摘要

The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.30893 2026-06-01 cs.CV

Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation

用于3D CT重建、增强和生成的基础VAE

Qi Chen, Shuhan Ding, Yu Gu, Nan Liu, Jiang Bian, Alan Yuille, Zongwei Zhou, Jingjing Fu

AI总结本文发现，在自然图像上预训练的基础VAE可直接用于CT重建、增强和生成，无需训练或微调，通过冻结编解码器实现解剖结构保留和噪声抑制，并在分割和生成任务上取得显著提升。

Comments ICML 2026 Accepted

详情

AI中文摘要

变分自编码器（VAE）将高分辨率CT体积压缩为紧凑的潜在表示，同时保留临床相关结构。然而，从头训练或大量微调CT专用VAE会带来巨大的计算和工程成本，并且在异构扫描仪、协议和疾病下性能常会下降。本文通过一个关键观察向免训练的医学VAE迈出了渐进的一步：一个在自然图像和视频上大规模预训练的基础VAE可以作为CT重建、增强和生成的统一接口。在编码器和解码器均冻结的情况下，基础VAE重建CT体积时保留了解剖结构，同时抑制了采集噪声；在这些重建上训练分割模型，对于胰腺肿瘤和肺肿瘤，表面准确度平均提高了3.9% NSD。在相同的基础VAE潜在空间中，条件潜在扩散模型实现了平均FVD降低3.9%，CT CLIP分数提高36.2%，并在18种疾病的多疾病生成忠实度上提高了2.76% AUC。这些结果表明基础VAE可作为可扩展的CT表示重用和忠实CT生成的实用接口。我们的代码和演示可在 https://github.com/qic999/Foundation-VAE 获取。

英文摘要

Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.

URL PDF HTML ☆

赞 0 踩 0

2605.30892 2026-06-01 cs.LG

Bandwidth Allocation with Device Partitioning for Federated Learning over Industrial IoT networks

面向工业物联网联邦学习的设备分区带宽分配

Kangmin Kim, Jaeyoung Song

AI总结针对联邦学习在工业物联网中的通信瓶颈，提出一种基于设备计算能力分区的带宽分配策略，通过顺序分配全带宽给子集来最小化训练时间，并理论证明其优于无分区方案，同时降低上行能耗。

详情

AI中文摘要

我们考虑一个联邦学习（FL）系统，其中工业物联网（IIoT）设备通过无线信道协作训练全局模型，而不共享本地数据。在此类系统中，通信时间是制约整体训练效率的主要瓶颈。与优先考虑个体服务质量需求的传统网络不同，FL系统旨在尽可能高效地收敛到最优全局模型，这需要一种根本不同的带宽分配方法。本文提出一种新颖的带宽分配策略，利用设备计算能力的异构性来最小化总训练时间。该策略并非同时将所有选定设备的带宽分配出去，而是将参与设备划分为有序子集，并依次授予每个子集全带宽的独占访问权。我们正式证明，无论底层调度算法如何，这种基于分区的策略都能实现比任何无分区带宽分配方案更低的训练时间。此外，通过减少每台设备的传输持续时间，该策略还最小化了上行能耗，这对电池受限的IIoT设备尤其有利。在真实数据集（包括工业表面缺陷基准GC10-Det和标准图像分类基准CIFAR-10）上的大量实验表明，与现有带宽分配方案相比，所提策略持续降低了训练时间和能耗，接近轮次时间的理论下界。

英文摘要

We consider a federated learning (FL) system in which Industrial Internet-of-Things (IIoT) devices collaboratively train a global model over wireless channels without sharing local data. In such systems, communication time is a primary bottleneck that constrains overall training efficiency. Unlike conventional networks that prioritize individual quality-of-service requirements, FL systems collectively aim to converge to an optimal global model as efficiently as possible, which calls for a fundamentally different approach to bandwidth allocation. In this paper, we propose a novel bandwidth allocation policy that exploits the heterogeneity of device computing capabilities to minimize total training time. Rather than distributing bandwidth among all selected devices simultaneously, the proposed policy partitions the participating devices into ordered subsets and sequentially grants each subset exclusive access to the full bandwidth. We formally prove that this partitioning-based policy achieves a strictly lower training time than any bandwidth allocation scheme without partitioning, irrespective of the underlying scheduling algorithm. Furthermore, by reducing per-device transmission duration, the proposed policy also minimizes uplink energy consumption, which is particularly beneficial for battery-constrained IIoT devices. Extensive experiments on real-world datasets - including GC10-Det, an industrial surface defect benchmark, and CIFAR-10, a standard image classification benchmark - demonstrate that the proposed policy consistently reduces training time and energy consumption compared to existing bandwidth allocation schemes, approaching the theoretical lower bound on round time.

URL PDF HTML ☆

赞 0 踩 0

2605.30888 2026-06-01 cs.CL

The Flip Side of RLHF: On-Policy Feedback for Reward Model Self-Supervised Improvement

RLHF的另一面：用于奖励模型自监督改进的在线策略反馈

Xiaobo Wang, Tong Wu, Min Tang, Jiaqi Li, Qi Liu, Zilong Zheng

AI总结提出SAVE框架，利用价值函数生成在线策略反馈，通过对比学习更新奖励模型，在六个基准上超越现有方法。

详情

AI中文摘要

构建用于语言模型对齐的强大奖励模型（RM）受到从人工标注或评判模型获取多样且可靠偏好数据的成本和难度的瓶颈限制。随着策略超越静态RM训练，这一问题变得更加严重。因此，我们提出SAVE（基于价值锚定的在线策略反馈自监督奖励模型改进），一个通过使用价值函数进行在线策略RM训练的框架，对在线策略响应进行评分作为反馈。SAVE自然地利用提示特定的价值头作为自适应锚点，将奖励评分的在线策略响应转化为监督信号。它计算RM优势并过滤模糊样本，通过对比目标更新RM。通过六个不同基准的严格实证评估，SAVE在增强RM训练方面的有效性得到了强烈验证。它在所有数据集上取得了优于现有方法的结果，同时在三种RL算法（GRPO、RLOO、GSPO）和不同策略骨干上保持一致的改进。

英文摘要

Building strong reward models (RMs) for language model alignment is bottlenecked by the cost and difficulty of acquiring diverse and reliable preference data from human annotation or judge models. It is dramatically worse as the policy evolves beyond the static RM training. Therefore, we propose SAVE (Self-supervised reward model improvement via Value-Anchored On-policy feedback), a framework that grades on-policy responses as feedback by using the value function for on-policy RM training. SAVE naturally converts the reward-graded on-policy responses into supervision with a prompt-specific value head as an adaptive anchor. It computes RM advantages and filters ambiguous samples to update the RM via a contrastive objective. The effectiveness of SAVE for enhancing RM training is strongly validated through rigorous empirical evaluation across six diverse benchmarks. It achieves outperforming results across all datasets while maintaining consistent improvements across three RL algorithms (GRPO, RLOO, GSPO) and different policy backbones.

URL PDF HTML ☆

赞 0 踩 0

2605.30884 2026-06-01 cs.CV

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$：基于难度感知强化学习的由粗到细GUI定位

Junlong Li, Chao Hao, Lap-Pui Chau, Yi Wang

AI总结提出GUI-C$^2$框架，通过难度感知数据筛选和由粗到细的强化学习机制，解决GUI定位中训练样本难度不均和视觉区域裁剪权衡问题，实现最先进性能。

详情

AI中文摘要

现有的用于GUI定位的智能体强化学习方法在数据层面和策略层面存在局限性。在数据层面，当前方法通常平等对待所有训练样本，尽管它们对基线模型的训练价值随难度而变化。忽视这一点会大大降低训练效率甚至导致崩溃。在策略层面，现有框架难以平衡裁剪较大区域以获取足够上下文和较小区域以减少冗余之间的权衡，这是工具增强定位代理固有的张力。此外，过于复杂的决策对于小参数模型来说难以处理，并显著增加推理时间。为了解决这些问题，在数据层面，我们提出了GUI-D，一个数据挖掘和难度评分流程，通过适当的测试识别值得训练的样本，并分配难度分数以指导后续训练权重。在策略层面，我们提出了GUI-C$^2$，它采用区域门控的由粗到细细化机制，通过模型内部不确定性信号逐步缩小视野，自适应地为大目标保留上下文，同时增强对小目标的精度，并通过改进感知的阶段奖励进行强化，确保每次细化真正提升定位。同时，我们简化了决策过程，大大减少了额外的推理时间。最后，大量实验表明，我们的方法达到了最先进的性能。代码和数据将公开。

英文摘要

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.30876 2026-06-01 cs.CL

dMoE: dLLMs with Learnable Block Experts

dMoE: 具有可学习块专家的扩散大语言模型

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma, Xinchao Wang

AI总结针对扩散大语言模型与混合专家架构集成时块并行解码与令牌级专家选择不匹配导致的推理内存瓶颈，提出dMoE框架，通过聚合块内令牌级专家分布为统一的块级专家分布来减少激活专家数量，在保持性能的同时显著降低内存使用和延迟。

Comments Working in progress. Code is available at: \url{https://github.com/fscdc/dMoE}

详情

AI中文摘要

扩散大语言模型（dLLMs）最近作为自回归模型的有前途的替代方案出现，在自然支持并行解码的同时提供了有竞争力的性能。然而，随着dLLMs越来越多地与混合专家（MoE）架构集成以扩展模型容量，块并行解码与令牌级专家选择之间出现了根本性的不匹配。具体来说，每次dLLM前向传递处理多个具有双向依赖关系的令牌，而传统的MoE层独立路由每个令牌。这种不匹配显著增加了唯一激活专家的数量，使推理越来越受内存限制。为了解决这个问题，我们提出了dMoE，一个简单而有效的块级MoE框架。dMoE的核心思想是将每个块内的令牌级专家分布聚合成统一的块级专家分布，然后以更连贯的方式指导专家路由。通过这种方式，dMoE在不牺牲性能的情况下显著减少了推理期间唯一激活专家的数量，从而缓解了内存瓶颈。在各种基准上的大量实验证明了dMoE的有效性。平均而言，dMoE将唯一激活专家的数量从69.5减少到14.6，同时保留了原始性能的99.11%。同时，它将内存使用减少了76.64%到79.84%，并实现了1.14倍到1.66倍的端到端延迟加速。代码可在https://github.com/fscdc/dMoE获取。

英文摘要

Diffusion Large Language Models (dLLMs) have recently emerged as a promising alternative to autoregressive models, offering competitive performance while naturally supporting parallel decoding. However, as dLLMs are increasingly integrated with Mixture-of-Experts (MoE) architectures to scale model capacity, a fundamental mismatch arises between block parallel decoding and token-level expert selection. Specifically, each dLLM forward pass processes multiple tokens with bidirectional dependencies, whereas conventional MoE layers route each token independently. This mismatch substantially increases the number of uniquely activated experts, making inference increasingly memory-bound. To address this, we propose dMoE, a simple yet effective block-level MoE framework. The central idea of dMoE is to aggregate token-level expert distributions within each block into a unified block-level expert distribution, which is then used to guide expert routing in a more coherent manner. In this way, dMoE substantially reduces the number of uniquely activated experts during inference without sacrificing performance, thereby mitigating the memory-bound bottleneck. Extensive experiments across a variety of benchmarks demonstrate the effectiveness of dMoE. On average, dMoE reduces the number of uniquely activated experts from 69.5 to 14.6 while retaining 99.11% of the original performance. Meanwhile, it reduces memory usage by 76.64% to 79.84% and achieves 1.14$\times$ to 1.66$\times$ end-to-end latency speedup. Code is available at: https://github.com/fscdc/dMoE

URL PDF HTML ☆

赞 0 踩 0

2605.30873 2026-06-01 cs.LG cs.AI cs.DC

Federated Variational Preference Alignment with Gumbel-Softmax Prior for Personalized User Preferences

联邦变分偏好对齐与Gumbel-Softmax先验用于个性化用户偏好

Jabin Koo, Hoyoung Kim, Minwoo Jang, Jungseul Ok

AI总结提出FedVPA-GP框架，通过联邦混合先验和正交损失解决联邦学习中用户偏好冲突和个性化问题，在HH-RLHF数据集上优于单一模型。

Comments 21 pages, 4 figures. Accepted to ICML 2026

详情

AI中文摘要

联邦学习（FL）为对齐大型语言模型（LLMs）提供了一条保护隐私的途径；然而，现有框架通常强制使用单一奖励模型，不可避免地平均了本质上相互冲突的用户偏好（例如，有用性与无害性）。虽然变分偏好学习（VPL）提供了一条个性化的途径，但将其适应于去中心化设置面临一个基本挑战：由严重的局部数据稀缺性和异质性驱动的后验坍塌。在本文中，我们提出了具有Gumbel-Softmax先验的联邦变分偏好对齐（FedVPA-GP），这是一个旨在在不牺牲隐私的情况下解耦多样偏好的框架。为了稳定变分推断，我们引入了一个联邦混合先验，使客户端能够利用聚合的总体分布作为动态先验。此外，我们加入了一个正交损失，明确强制在潜在空间中分离偏好原型。在HH-RLHF数据集上的实验表明，FedVPA-GP显著优于单一基线，成功解耦了冲突的用户意图，并实现了动态偏好切换。

英文摘要

Federated Learning (FL) offers a privacy-preserving pathway for aligning Large Language Models (LLMs); however, existing frameworks typically enforce a monolithic reward model, inevitably averaging out inherently conflicting user preferences (e.g., helpfulness vs. harmlessness). While Variational Preference Learning (VPL) offers a pathway to personalization, adapting it to decentralized settings presents a fundamental challenge: posterior collapse driven by severe local data scarcity and heterogeneity. In this paper, we propose Federated Variational Preference Alignment with Gumbel-Softmax Prior (FedVPA-GP), a framework designed to disentangle diverse preferences without compromising privacy. To stabilize variational inference, we introduce a Federated Mixture Prior that enables clients to leverage the aggregate population distribution as a dynamic prior. Furthermore, we incorporate an Orthogonal Loss that explicitly enforces the separation of preference prototypes in the latent space. Experiments on the HH-RLHF dataset demonstrate that FedVPA-GP significantly outperforms monolithic baselines, successfully disentangling conflicting user intents and enabling dynamic preference switching.

URL PDF HTML ☆

赞 0 踩 0

2605.30865 2026-06-01 cs.LG

GlucoFM: A Dual-Stream Foundation Model for Continuous Glucose Monitoring

GlucoFM: 一种用于连续血糖监测的双流基础模型

Zechen Li, Keerthana Natarajan, Weizhi Zhang, Menglian Zhou, Simon A. Lee, Yuwei Zhang, Maxwell A. Xu, Zeinab Esmaeilpour, Flora D. Salim, Mark Malhotra, Lindsey Sunden, Shwetak Patel, Yuzhe Yang, Ahmed A. Metwally

AI总结提出GlucoFM，一种轻量级CGM基础模型，通过将血糖动态分解为慢生理状态和瞬态事件流，在7个临床预测任务上平均PR-AUC比最佳CGM专用模型提高4.1点。

详情

AI中文摘要

连续血糖监测（CGM）提供了日常代谢生理的密集视图，然而现有的通用时间序列和CGM专用基础模型通常将血糖轨迹编码为纠缠的单流序列，使得血糖动态的独特时间结构仅被隐式建模。我们提出GlucoFM，一种轻量级CGM基础模型，它将不规则记录对齐到24小时时间网格，保留观测掩码，并将血糖动态分解为慢生理状态和瞬态事件流，捕捉低频血糖基线和可能反映急性生理反应或传感器伪影的短期偏差。GlucoFM在来自477名受试者的109,066小时未标记CGM记录上进行了预训练，具有两个互补目标：融合每日表示上的掩码上下文潜在预测以及状态和事件流上的时间动态预测。在四个不同队列和七个临床预测任务中，GlucoFM在评估基线中实现了最强的受试者分离线性探测性能，比最佳CGM专用基础模型平均PR-AUC提高4.1点。其收益在核心代谢结果上最为显著，在所有糖尿病风险和β细胞功能障碍任务以及4个胰岛素抵抗任务中的3个上领先PR-AUC。GlucoFM还在评估方法中实现了最佳的整体跨数据集迁移性能和强大的少样本适应能力，并且在聚合多天进行受试者级别预测时获得一致收益，突出了生理感知分解作为可迁移CGM表示学习的有效归纳偏置。

英文摘要

Continuous glucose monitoring (CGM) provides a dense view of daily metabolic physiology, yet existing generic time-series and CGM-specific foundation models often encode glucose traces as entangled single-stream sequences, leaving the distinct temporal structure of glycemic dynamics only implicitly modeled. We present GlucoFM, a lightweight CGM foundation model that aligns irregular recordings to a 24-hour chronological grid, preserves observation masks, and decomposes glucose dynamics into slow physiological state and transient event streams, capturing low-frequency glycemic baselines and short-term deviations that may reflect acute physiological responses or sensor artifacts. GlucoFM is pretrained on 109,066 hours of unlabeled CGM recordings from 477 subjects with two complementary objectives: masked contextual latent prediction over fused daily representations and temporal dynamics prediction over state and event streams. Across four diverse cohorts and seven clinical prediction tasks, GlucoFM achieves the strongest subject-disjoint linear-probing performance among evaluated baselines, improving average PR-AUC by 4.1 points over the best CGM-specific foundation model. Its gains are most pronounced on core metabolic outcomes, leading PR-AUC on all diabetes-risk and $β$-cell dysfunction tasks and on 3 of 4 insulin-resistance tasks. GlucoFM also achieves the best overall cross-dataset transfer performance and strong few-shot adaptation among evaluated methods, and consistent gains when aggregating multiple days for subject-level prediction, highlighting physiology-aware decomposition as an effective inductive bias for transferable CGM representation learning.

URL PDF HTML ☆

赞 0 踩 0

2605.30863 2026-06-01 cs.CV cs.GR

DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction

DSD-GS: 面向高效高保真动态场景重建的高斯泼溅动态-静态分解

Youngtae Han, Sung-hwan Han, Youngmin Yi

AI总结提出基于前馈高斯泼溅编码器和光流模型的动态-静态分解框架，通过消除静态区域冗余计算，在渲染质量、训练/渲染速度和存储效率上达到最优。

Comments 23 pages, 9 figures, 7 tables

详情

AI中文摘要

动态场景重建和新视角合成是虚拟现实、机器人、数字孪生等下一代视觉智能应用的基础。然而，从任意视角对复杂时变场景进行高保真重建仍是一个重大挑战。现有的动态3DGS方法由于将所有高斯体建模为动态组件，存在计算效率低下的问题。虽然近期基于分解的方法试图解决这一问题，但仍面临重建质量下降和训练时间延长的问题。为缓解这些局限，我们提出一种新颖的动态重建框架，基于高效的静态-动态分解策略，使用前馈高斯泼溅编码器和光流模型。通过消除静态区域的冗余计算，我们的方法实现了最先进的性能，在渲染质量、训练和渲染速度以及存储效率上均优于现有基线。值得注意的是，在Neural 3D数据集上，我们的框架仅需10分钟训练，并在单张NVIDIA RTX 5090 GPU上以1352x1014分辨率实现了超过700 FPS的渲染速度。此外，我们的分解策略消除了COLMAP预处理的需求，并实现了确定性初始化，从而提高了效率和可重复性。

英文摘要

Dynamic scene reconstruction and novel view synthesis are fundamental to next-generation visual intelligence applications such as virtual reality, robotics, and digital twins. However, high-fidelity reconstruction of complex, time-varying scenes from arbitrary viewpoints remains a significant challenge. Existing dynamic 3DGS methods suffer from computational inefficiency, since they model all Gaussians as dynamic components. While recent decomposition-based approaches address this issue, they still struggle with degraded reconstruction quality and prolonged training time. To mitigate these limitations, we propose a novel dynamic reconstruction framework built upon an efficient static-dynamic decomposition strategy using a Feed-Forward Gaussian Splatting encoder and an optical flow model. By eliminating redundant computations on static regions, our method achieves state-of-the-art performance, outperforming existing baselines across rendering quality, training and rendering speed, and storage efficiency. Notably, on the Neural 3D dataset, our framework requires only 10 minutes for training and achieves a rendering speed of over 700 FPS on a single NVIDIA RTX 5090 GPU at resolution of 1352x1014. Furthermore, our decomposition strategy eliminates the need for COLMAP preprocessing and enables deterministic initialization, thereby enhancing both efficiency and reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2605.30861 2026-06-01 cs.AI

Distilling LLM Feedback for Lean Theorem Proving

蒸馏LLM反馈用于Lean定理证明

Gaetan Narozniak, Gérard Biau, Rémi Munos, Ahmad Rammal, Pierre Marion

AI总结提出反馈蒸馏方法，通过让模型在token级别匹配自身分布（基于语言模型提供的特权反馈）来训练，以解决GRPO在推理后训练中的稀疏奖励和模式崩溃问题，并在Lean4定理证明中取得更好效果。

详情

AI中文摘要

推理模型的后训练通常结合监督微调和基于可验证奖励的强化学习（最常见的是GRPO）。然而，该算法存在奖励稀疏、探索受限和模式崩溃的问题。基于最近关于自蒸馏的工作，我们提出了反馈蒸馏，这是一种训练方法，其中模型在token级别被训练以匹配自身分布，该分布以语言模型产生的特权反馈为条件。反馈蒸馏提供token级别的监督，并能注入外部知识。在Lean4定理证明中评估我们的方法，我们发现反馈蒸馏比GRPO在生成轨迹上保持更大的多样性，从而产生更高的策略熵和更好的pass@k缩放。这两种方法是互补的：从反馈蒸馏检查点初始化GRPO优于单独使用任何一种方法。总之，我们的结果为提高复杂推理的后训练提供了一条有前景的途径。

英文摘要

Post-training for reasoning models typically combines supervised fine-tuning with reinforcement learning from verifiable rewards, most commonly with GRPO. However, this algorithm suffers from sparse rewards, limited exploration, and mode collapse. Building upon recent works on self-distillation, we propose Feedback Distillation, a training method where the model is trained to match, at the token level, its own distribution conditioned on privileged feedback produced by a language model. Feedback Distillation offers token-level supervision and can inject external knowledge. Evaluating our method for Lean4 theorem-proving, we find that Feedback Distillation maintains greater diversity in generated trajectories than GRPO, yielding higher policy entropy and better pass@k scaling. The two methods are complementary: initializing GRPO from a Feedback Distillation checkpoint outperforms either method alone. All in all, our results suggest a promising avenue to improve post-training for complex reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.30859 2026-06-01 cs.LG cs.AI

DARTS: Distribution-Aware Active Rollout Trajectory Shaping for Accelerating LLM Reinforcement Learning

DARTS: 分布感知的主动展开轨迹塑造以加速LLM强化学习

Yujie Wang, Siwei Chen, Longzan Luo, Xinyi Liu, Xupeng Miao, Fangcheng Fu, Bin Cui

AI总结针对强化学习中长尾响应分布导致的效率瓶颈，提出分布感知的主动轨迹塑造方法，通过细粒度识别提示内长尾并削减无效冗余，实现高达1.77倍的加速而不损失模型性能。

Comments 16 pages, 14 figures, 5 tables. Accepted to ICML 2026

详情

AI中文摘要

强化学习已成为提升模型能力的关键技术，但由于响应长度的长尾分布，其展开效率受到瓶颈制约。现有工作通过提示级尾部调度缓解长尾影响，但我们关注低效率的根本来源：分布本身。具体而言，我们以更细粒度刻画长尾分布，识别提示内长尾，并揭示它们通常包含无效冗余。为解决此问题，我们提出一种主动分布塑造的新范式，将展开分布向简洁性和确定性方向塑造，从而从根本上解决尾部带来的开销。我们通过一种分布感知的轨迹采样机制实现这一点，该机制为每个提示从冗余探索空间中选择轨迹，并采用自适应冗余分配方案以最大化塑造效果和系统效率。实验表明，与最先进系统相比，在不影响模型性能的情况下，实现了高达1.77倍的显著加速。

英文摘要

Reinforcement Learning (RL) has become pivotal for improving model capabilities yet suffers from rollout efficiency bottlenecks due to the long-tail response length distribution. While existing works mitigate the impact of long tails via prompt-level tail scheduling, we focus on the root source of inefficiency: the distribution itself. Specifically, we characterize the long-tail distribution at a finer granularity, identifying intra-prompt long tails, and revealing that they frequently consist of ineffective verbosity. To address this, we propose a novel paradigm of active distribution shaping to shape the rollout distribution towards conciseness and certainty, thereby fundamentally resolving tail-induced overheads. We achieve this through a distribution-aware trajectory sampling mechanism, which selects trajectories from a redundant exploration space for each prompt, and an adaptive redundancy allocation scheme to maximize both shaping effectiveness and system efficiency. Experiments demonstrate significant acceleration over state-of-the-art systems by up to 1.77x without compromising model performance.

URL PDF HTML ☆

赞 0 踩 0

2605.30858 2026-06-01 cs.LG

ForecastCompass: Guiding Agentic Forecasting with Adaptive Factor Memory

ForecastCompass: 自适应因子记忆引导的智能预测

Yurui Chang, Yongkang Du, Yuanpu Cao, Jinghui Chen, Lu Lin

AI总结提出ForecastCompass框架，通过分层预测任务分类和双组件记忆（因子记忆与推理记忆），结合回顾分析迭代修正，提升智能体在动态环境中的概率预测准确性和校准性。

详情

AI中文摘要

智能预测对于动态环境中的决策至关重要，但由于智能体必须从不完整、时间有限的证据中进行推理，并在结果确定之前产生校准的概率，因此仍然具有挑战性。记忆提供了一种自然机制，将经验从已解决的预测转移到未来的预测任务。然而，现有的智能体记忆方法并非为预测量身定制，因为它们通常存储过去的交互、反思或事实关联，而没有明确表示可重用的预测因子或校准知识。我们提出了ForecastCompass (FoCo)，一种用于智能预测的自适应因子记忆框架。FoCo通过分层预测任务分类来组织预测经验，从而能够检索与任务相关的预测知识。它维护两个互补的记忆组件：因子记忆（捕获可重用的预测维度）和推理记忆（编码概率更新、不确定性处理和校准原则）。利用回顾分析作为学习信号，FoCo通过口头记忆修正程序迭代修正记忆，使智能体能够随时间积累可迁移的预测知识。在Prophet Arena和FutureX上使用GPT-5-mini和Gemini-2.5-Flash进行的实验表明，FoCo提高了概率准确性和校准性。

英文摘要

Agentic forecasting is important for decision-making in dynamic environments, but it remains challenging because agents must reason from incomplete, time-limited evidence and produce calibrated probabilities before outcomes are resolved. Memory provides a natural mechanism for transferring experience from resolved forecasts to future prediction tasks. However, existing agent-memory methods are not tailored to forecasting, as they typically store past interactions, reflections, or factual associations without explicitly representing reusable predictive factors or calibration knowledge. We propose ForecastCompass (FoCo), an adaptive factor-based memory framework for agentic forecasting. FoCo organizes forecasting experience with a hierarchical forecasting-task taxonomy, enabling retrieval task-relevant forecasting knowledge. It maintains two complementary memory components: factor memory, which captures reusable predictive dimensions, and reasoning memory, which encodes probability updating, uncertainty handling, and calibration principles. Using retrospective analyses as learning signals, FoCo iteratively revises memory through a verbalized memory-revision procedure, enabling the agent to accumulate transferable forecasting knowledge over time. Experiments on Prophet Arena and FutureX with GPT-5-mini and Gemini-2.5-Flash show that FoCo improves both probabilistic accuracy and calibration.

URL PDF HTML ☆

赞 0 踩 0

2605.30857 2026-06-01 cs.CL

MADS: Model-Aware Diverse Core Set Selection for Instruction Tuning

MADS: 面向指令微调的模型感知多样化核心集选择

Yi Bai, Wenhao Zhang, Yao Chen, Jiao Xue, Zhumin Chen, Pengjie Ren

AI总结提出一种基于模型推理时神经激活状态区分数据特征的多样化核心集选择方法，在减少数据量的同时提升大语言模型在多个下游任务上的性能。

详情

AI中文摘要

指令微调用于增强大语言模型（LLMs）的指令遵循能力。随着指令微调数据量的增加，选择最优核心集变得尤为重要。然而，确保核心集的多样性仍然是一个重大挑战。现有方法主要基于文本特征本身来区分不同的训练数据，与LLMs自身对数据的理解和表示相分离。为解决这一问题，我们提出了一种模型感知的多样化核心集选择方法，该方法基于LLM推理过程中的神经激活状态来区分数据特征。该方法利用模型内在的激活特征，实现了基于覆盖的选择的高效实例化，以确保核心集的多样性。我们在涵盖五个不同任务的六个基准上广泛评估了我们的方法。在我们的方法中，由3B参数LLM选择的核心集在用于微调7B、8B和13B参数的更大模型时表现有效。在包含52K指令-响应对的Alpaca-GPT4数据集上的实验结果表明，由Llama-3.2-3B-Instruct选择的、大小为原始数据集15%的核心集，在微调四个更大的基础模型时，与使用完整数据集训练相比，平均提升了2.5%。实验结果表明，我们的方法在减少数据需求的同时，提升了模型在多个下游任务上的性能。

英文摘要

Instruction fine-tuning is employed to enhance the instruction-following ability of large language models (LLMs). As the amount of instruction fine-tuning data increases, selecting the optimal core set becomes particularly important. However, ensuring the diversity of the core set remains a significant challenge. Existing methods predominantly distinguish different training data based on the text features themselves, decoupled from LLMs' own understanding and representation of the data. To address this issue, we propose a Model-Aware Diverse Core Set Selection method, which distinguishes data features based on the neural activation states during LLM inference. This approach serves as an efficient instantiation of coverage-based selection using model-intrinsic activation features to ensure the diversity in the core set. We extensively evaluate our method on six benchmarks that cover five distinct tasks. In our method, the core set selected by the 3B-parameter LLM performs effectively when utilized to fine-tune larger models with 7B, 8B, and 13B parameters. Experimental results on the Alpaca-GPT4 dataset, which comprises 52K instruction-response pairs, show that the core set, sized at 15\% of the original dataset and selected by Llama-3.2-3B-Instruct, achieves an average improvement of 2.5\% when fine-tuning four larger base models compared with training on the full dataset. The experimental results demonstrate that our method enhances model performance on multiple downstream tasks while reducing data requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.30852 2026-06-01 cs.CL

Speculative Pipeline Decoding: Higher-Accruacy and Zero-Bubble Speculation via Pipeline Parallelism

推测性流水线解码：通过流水线并行实现更高准确度和零气泡推测

Yijiong Yu, Huazheng Wang, Shuai Yuan, Ruilong Ren, Ji Pei

AI总结提出推测性流水线解码（SPD）框架，利用流水线并行将目标LLM划分为n个流水线阶段并行处理n个token，通过推测模块聚合中间特征预测下一token，实现有限难度、高接受率和零延迟气泡，显著提升理论加速比。

详情

AI中文摘要

推测性解码（SD）通过草稿-验证范式加速低并发LLM推理。然而，主流方法通常依赖多token预测，这引入了逐渐增加的预测难度和串行草稿延迟。为了解决这些问题，我们提出了推测性流水线解码（SPD），这是一个突破性的框架，释放了流水线并行的真正潜力。通过将目标LLM划分为$n$个流水线阶段，SPD允许LLM并行处理$n$个token以加速解码。为了在单序列解码中持续填充流水线，推测模块聚合不同流水线深度的中间特征来预测下一个token，与目标模型的流水线步骤严格并行执行，从而实现有限的难度、更高的接受率和零延迟气泡。我们的实验表明，与主流基线相比，SPD实现了显著更高的理论加速比，为LLM解码加速提供了高度可扩展的解决方案。我们的代码可在https://github.com/yuyijiong/speculative_pipeline_decoding获取。

英文摘要

Speculative Decoding (SD) accelerates low-concurrency LLM inference by employing a draft-then-verify paradigm. However, mainstream methods typically rely on multi-token prediction, which introduces escalating prediction difficulty and serial drafting latency. To address these, we propose Speculative Pipeline Decoding (SPD), a groundbreaking framework that unlocks the true potential of pipeline parallelism. By partitioning the target LLM into $n$ pipeline stages, SPD allows LLM to process $n$ tokens in parallel to accelerate decoding. To continuous fill the pipeline in single sequence decoding, a speculation module aggregates intermediate features across different pipeline depths to predict the next token, executing strictly in parallel with the target model's pipeline step, to realize bounded difficulty, higher acceptance rates, and zero latency bubbles. Our experiments demonstrate that SPD achieves a significantly higher theoretical speedup compared to mainstream baselines, offering a highly scalable solution for LLM decoding acceleration. Our code is available at https://github.com/yuyijiong/speculative_pipeline_decoding

URL PDF HTML ☆

赞 0 踩 0