arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 21503
专题追踪 全部专题
2606.04045 2026-06-04 cs.LG cs.AI

Bayes-Sufficient Representations in Supervised Learning

监督学习中的贝叶斯充分表示

Vasileios Sevetlidis

发表机构 * Athena Research Center, Kimmeria Campus, Xanthi, Greece(阿塔尼亚研究中心,基米里亚校区,辛提斯,希腊) Democritus University of Thrace, Vas. Sofias Campus, Xanthi, Greece(德摩根大学,瓦斯·索菲亚校区,辛提斯,希腊) International Hellenic University, Serres, Greece(国际希腊大学,塞雷斯,希腊)

AI总结 本文定义了监督学习中表示对损失函数的贝叶斯充分性,引入贝叶斯商概念,并证明最小充分表示等价于贝叶斯商,通过实验区分了充分性、最小性和非必要信息保留。

详情
AI中文摘要

表示学习通常被描述为保留输入中与预测相关的信息。本文探讨了在固定监督决策问题中相关性的含义。定义了一个表示对于联合分布和损失是贝叶斯充分的,如果某个预测头可以使用它来实现贝叶斯最优行动规则。这使得目标信息依赖于损失。在几乎必然唯一的贝叶斯行动情况下,相关对象是贝叶斯商,它识别需要相同贝叶斯最优行动的输入。当表示细化这个商时,它是充分的;当它在信息上等价于商时,它是贝叶斯最小的。该框架自然地连接到属性诱导:零一损失需要贝叶斯类,平方损失需要条件均值,布里尔损失需要二元预测中的条件概率,对数损失或严格适当评分规则需要预测分布。受控的有限实验、学习的神经瓶颈实验以及真实数据的iNaturalist分类学细化实验说明了充分性、最小性和保留的非必要信息之间的区别。对于固定的监督问题,分布和损失决定贝叶斯行动,贝叶斯行动决定商,商决定贝叶斯最优预测所需的最小信息。

英文摘要

Representation learning is often described as preserving the information in an input that is relevant for prediction. This work asks what relevance means for a fixed supervised decision problem. A representation is defined to be Bayes-sufficient for a joint distribution and loss if some prediction head can use it to implement a Bayes-optimal action rule. This makes the target information loss-dependent. In the almost-surely unique Bayes-action case, the relevant object is a Bayes quotient, which identifies inputs that require the same Bayes-optimal action. A representation is sufficient when it refines this quotient, and Bayes-minimal when it is informationally equivalent to it. The framework connects naturally to property elicitation: zero-one loss requires the Bayes class, squared loss the conditional mean, Brier loss the conditional probability in binary prediction, and log loss or strictly proper scoring rules the predictive distribution. Controlled finite experiments, learned neural bottleneck experiments, and a real-data iNaturalist taxonomic refinement experiment illustrate the distinction between sufficiency, minimality, and retained non-required information. For a fixed supervised problem, the distribution and the loss determine the Bayes action, the Bayes action determines the quotient, and the quotient determines the minimal information required for Bayes-optimal prediction.

2606.04040 2026-06-04 cs.SD cs.AI eess.AS

Channel-Oriented Design for EEG-to-Music Reconstruction

面向脑电到音乐重建的通道导向设计

Jiaxin Qing, Junwei Lu, Lexin Li

发表机构 * UC Berkeley(加州大学伯克利分校) Harvard University(哈佛大学)

AI总结 针对脑电信号弱、易受噪声和通道变异影响的问题,提出通道导向设计(包括通道级标记化、多视角自蒸馏和数据增强),在编码-对齐-解码流水线中实现稳定的音乐语义空间对齐,显著提升重建性能。

详情
AI中文摘要

脑机接口旨在从神经信号中解码自然刺激,但迄今为止大多数进展集中在视觉和语言领域。本文研究更具挑战性但探索较少的脑电到音乐重建场景,其中信号微弱、分布广泛且极易受噪声和通道变异影响。我们的核心发现是,早期通道混合会破坏微弱但具有判别性的脑电信号。为此,我们提出一种包含三个关键组件的通道导向设计。具体而言,通道级标记化将每个电极视为显式标记以保留空间局部的神经证据,通道级多视角自蒸馏通过时间裁剪和随机通道子集强制一致性以学习鲁棒且分布式的表示,通道级数据增强引入结构化通道丢弃以提高对噪声、伪迹和缺失电极的不变性。这些组件共同保留了跨通道的微弱但信息丰富的信号,并实现了与语义音乐表示空间的稳定对齐。我们将该通道导向设计集成到脑电到音乐重建的编码-对齐-解码流水线中。理论上,我们刻画了何时保留通道级结构能够改善对齐。实验上,我们与一系列最先进的基线方法进行比较,并展示了一致且显著的性能提升。

英文摘要

Brain-computer interfaces aim to decode naturalistic stimuli from neural signals, yet most progress to date has focused on vision and language. In this article, we study a more challenging but far less explored setting, EEG-to-music reconstruction, where signals are weak, distributed, and highly susceptible to noise and channel variability. Our central finding is that early channel mixing destroys weak but discriminative EEG signals. To address this, we propose a channel-oriented design with three key components. Specifically, channel-wise tokenization treats each electrode as an explicit token to retain spatially localized neural evidence, channel-wise multi-view self-distillation enforces consistency across temporal crops and random channel subsets to learn robust and distributed representations, and channel-wise data augmentation introduces structured channel dropout to improve invariance to noise, artifacts, and missing electrodes. Together, these components preserve weak yet informative signals across channels and enable stable alignment to a semantic music representation space. We integrate this channel-oriented design within an encoding-alignment-decoding pipeline for EEG-to-music reconstruction. Theoretically, we characterize when preserving channel-level structure leads to improved alignment. Empirically, we compare with a range of state-of-the-art baselines and demonstrate consistent and significant performance gains.

2606.04036 2026-06-04 cs.LG

Self-Distilled Policy Gradient

自蒸馏策略梯度

Yifeng Liu, Shiyuan Zhang, Yifan Zhang, Quanquan Gu

发表机构 * Department of Computer Science, University of California, Los Angeles, CA, USA(加州大学洛杉矶分校计算机科学系) Princeton AI Laboratory, Princeton University, Princeton, NJ, USA(普林斯顿大学普林斯顿AI实验室)

AI总结 提出SDPG框架,结合组相对验证器优势、归一化标准差、精确全词汇在线自蒸馏和参考策略KL正则化,提升稀疏奖励强化学习的稳定性和性能。

详情
AI中文摘要

在线自蒸馏,即语言模型基于特权上下文监督自身生成,是稀疏奖励强化学习中密集监督的有前景来源。实际上,它可以实例化为辅助的全词汇学生到教师反向KL散度损失。因此,我们提出SDPG,一个自蒸馏策略梯度框架,它结合了组相对验证器优势与归一化标准差、精确的全词汇在线自蒸馏以及参考策略KL正则化。实验上,SDPG相比RLVR和自蒸馏基线提高了稳定性和性能。代码可在https://github.com/lauyikfung/SDPG获取。

英文摘要

On-policy self-distillation, where a language model conditions on privileged context to supervise its own generations, is a promising source of dense supervision for sparse-reward reinforcement learning. Actually, it can be instantiated as an auxiliary full-vocabulary student-to-teacher reverse Kullback-Leibler divergence loss. We therefore propose SDPG, a self-distilled policy-gradient framework that combines group-relative verifier advantages with normalized standard deviation, exact full-vocabulary on-policy self-distillation, as well as reference-policy KL regularization. Empirically, SDPG improves stability and performance over RLVR and self-distillation baselines. The code is available at https://github.com/lauyikfung/SDPG.

2606.04033 2026-06-04 cs.LG

Inverse Critical Experiment Design via Gradient Optimization and a Multigroup Attention-Based Neural Network Architecture

通过梯度优化与多群注意力神经网络架构的逆临界实验设计

Will Savage, Logan Burnett, Dean Price

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一种结合深度神经网络代理模型和非参数梯度优化的方法,用于逆设计临界实验几何结构以最大化中子相似性相关系数$c_k$,并在TN-LC运输容器验证中取得高$c_k$值。

详情
AI中文摘要

先进核反应堆设计和燃料概念的验证需要与目标技术具有高中子相似性的临界实验。中子相似性由相关系数$c_k$量化,该系数捕捉了核数据不确定性引起的$k_\text{eff}$共享偏差。通常,实验需要$c_k\geq0.9$才能与目标技术充分相似。本文提出了一种临界实验逆设计方法。使用深度神经网络代理模型和非参数梯度优化来生成最大化$c_k$的实验几何结构。 深度神经网络基于OpenMC计算的网格临界实验几何结构的灵敏度向量进行训练。模型架构结合了U-Net卷积编码器-解码器与新颖的多群注意力池化层,引入该层以捕捉灵敏度的不同空间依赖性。多群注意力池化在性能上优于传统池化,并具有可解释的内部行为。代理模型的可微性使得能够对全组合设计空间进行基于梯度的优化,通过直接改变几何网格中每个位置的材料分配来最大化$c_k$。 该方法应用于TN-Americas TN-LC运输容器(使用HALEU燃料)的验证,该容器现有的临界实验覆盖有限。优化过程为三个感兴趣配置生成了实验几何结构,分别达到0.97757、0.81324和0.93276的$c_k$分数。该方法展示了深度学习和梯度优化在加速先进核技术开发方面的潜力。

英文摘要

The validation of advanced nuclear reactor designs and fuel concepts requires critical experiments with high neutronic similarity to the target technology. Neutronic similarity is quantified by the correlation coefficient $c_k$, which captures the shared bias in $k_\text{eff}$ induced by uncertainties in nuclear data. Generally, a $c_k\geq0.9$ is needed for an experiment to be sufficiently similar to a target technology. This work presents a methodology for the inverse design of critical experiments. Deep neural network surrogate modeling and nonparametric gradient optimization are used to generate experiment geometries that maximize $c_k$. A deep neural network is trained on OpenMC-calculated sensitivity vectors for grid-based critical experiment geometries. The model architecture combines a U-Net convolutional encoder-decoder with a novel multigroup attention pooling layer, introduced to capture the differing spatial dependencies of sensitivities. Multigroup attention pooling is shown to achieve better performance than traditional pooling, as well as interpretable internal behavior. The differentiability of the surrogate enables gradient-based optimization of the full combinatorial design space, allowing $c_k$ to be maximized by directly changing the material assignment of each position in the geometry grid. The method is applied to the validation of the TN-Americas TN-LC transportation cask with HALEU fuel, for which existing critical experiment coverage is limited. The optimization procedure is shown to produce experiment geometries achieving $c_k$ scores of 0.97757, 0.81324, and 0.93276 for three configurations of interest. This approach demonstrates the potential of deep learning and gradient optimization to accelerate the development of advanced nuclear technology.

2606.04031 2026-06-04 cs.LG math.OC stat.ML

Pseudospectral Bounds for Transient Amplification in Coupled Gradient Descent

耦合梯度下降中瞬态放大的伪谱界

Ahanaf Hasan Ariq

发表机构 * Ideal School and College(理想学校和学院)

AI总结 针对耦合梯度下降中块三角雅可比矩阵的非正态性导致的瞬态放大,提出尖锐的伪谱理论,给出Kreiss常数的上界与匹配极小极大下界,并导出随机耦合下降的有限步迭代复杂度界。

Comments 11 pages, 3 tables. Accepted as poster at HiLD 2026 (4th Workshop on High-dimensional Learning Dynamics, ICML 2026)

详情
AI中文摘要

耦合梯度下降——其中一个参数块的更新依赖于另一个——是双层优化、双时间尺度随机逼近和对抗训练的基础。当耦合雅可比矩阵为块三角时,渐近稳定性由对角块的谱半径决定,但由于非正态性,收敛前的瞬态放大可能任意大。我们为这种块三角雅可比矩阵发展了尖锐的伪谱理论,证明当对角块对称且谱半径至多为γ<1时,Kreiss常数满足K(J) ≤ 2/(1-γ) + ||C||/(4(1-γ)),并建立了匹配的极小极大下界。我们刻画了谱不稳定的临界耦合阈值,并通过Neumann级数扰动框架将分析扩展到近自指系统。作为推论,我们得到了随机耦合下降的有限步迭代复杂度界O(K(J)^2 log(1/δ))。将结果表述为非平稳双时间尺度优化的标度律,我们的理论揭示了谱半径分析无法看到的非渐近、实例依赖的高维学习动力学。在线性二次问题、基于IQC的比较和神经网络训练上的实验证实了该理论。

英文摘要

Coupled gradient descent--where the update of one parameter block depends on another--underlies bilevel optimization, two-time-scale stochastic approximation, and adversarial training. When the coupled Jacobian is block-triangular, asymptotic stability is governed by the spectral radii of the diagonal blocks, yet transient amplification before convergence can be arbitrarily large due to non-normality. We develop a sharp pseudospectral theory for such block-triangular Jacobians, proving that the Kreiss constant satisfies $K(J) \leq 2/(1-γ) + \|C\|/(4(1-γ))$ when the diagonal blocks are symmetric with spectral radii at most $γ< 1$, and we establish matching minimax lower bounds. We characterize the critical coupling threshold for spectral instability and extend the analysis to nearly self-referential systems via a Neumann-series perturbation framework. As a consequence, we obtain a finite-horizon iteration-complexity bound of $O(K(J)^2 \log(1/δ))$ for stochastic coupled descent. Framed as scaling laws for non-stationary two-time-scale optimization, our results expose a non-asymptotic, instance-dependent regime of high-dimensional learning dynamics that is invisible to spectral-radius analysis. Experiments on linear-quadratic problems, IQC-based comparisons, and neural-network training confirm the theory.

2606.04028 2026-06-04 cs.LG

Novel Aspects of IEEE SA P3109 Arithmetic Formats for Machine Learning

IEEE SA P3109 机器学习算术格式的新颖方面

Andrew Fitzgibbon, Christoph M. Wintersteiger, Jeffrey Sarnoff

发表机构 * Imandra, Inc.(Imandra公司)

AI总结 本文介绍 IEEE P3109 草案标准,该标准定义了一组参数化的二进制浮点格式及相关操作,旨在促进机器学习,并提出了新颖的近似度量 kappa-approximation。

详情
AI中文摘要

IEEE P3109 草案标准定义了一个参数化的二进制浮点格式族及相关操作,重点在于促进机器学习。这些格式允许以少量比特高效且一致地表示数值。定义的格式在宽度、精度(以比特计)、有符号性以及无穷大的存在性上参数化。操作通过将浮点值解码为闭扩展实数集(实数加上正负无穷大和 NaN(非数值))来定义。对 NaN 和无穷大操作数的显式处理确保了仅在操作定义中调用实数运算。定义了广泛的舍入和饱和模式;包括随机舍入。操作无异常,加速了吞吐量,异常情况通过返回值(例如 NaN)传达。对共享公共比例因子的值块的操作以统一方式基于底层操作定义。系统供应商可以通过一种新颖的尺度不变度量(类似于最后一位单位)来描述近似实现,称为 kappa-approximation。标准函数定义和各种其他属性通过形式化规范进行机械验证和生成。

英文摘要

The IEEE P3109 draft standard defines a parameterized family of binary floating-point formats and associated operations, with a focus on facilitating machine learning. These formats allow efficient and consistent representation of values in a small number of bits. The defined formats are parameterized over width and precision in bits, signedness, and the presence of infinities. Operations are defined by decoding floating-point values to the set of closed extended reals: the reals augmented with positive and negative infinity and NaN (Not a Number). Explicit treatment of NaN and infinite operands ensures that only real arithmetic is invoked in operation definitions. Extensive rounding and saturation modes are defined; stochastic rounding is included. Operations are exception-free, accelerating throughput, with exceptional situations communicated through return values, e.g., NaN. Operations on blocks of values sharing a common scale factor are defined in terms of the underlying operations in a uniform manner. System vendors may describe approximate implementations via a novel scale-invariant measure, akin to units in the last place, called kappa-approximation. Standard function definitions and various other properties are mechanically verified and generated using formal specifications.

2606.03995 2026-06-04 cs.LG cs.AI q-bio.QM

Early Detection of Alzheimer's Disease Using Explainable Machine Learning on Clinical Biomarkers: A Multi-Class Classification Study Using the Alzheimer's Disease Neuroimaging Initiative (ADNI) Dataset

使用可解释机器学习基于临床生物标志物早期检测阿尔茨海默病:基于阿尔茨海默病神经影像学倡议(ADNI)数据集的多分类研究

Afshan Hashmi

发表机构 * TRDC, Tuwaiq Academy(TRDC,图瓦伊克学院)

AI总结 本研究使用XGBoost分类器,基于ADNI数据集的8个临床特征(MMSE、CDR Global、CDR-SB、MoCA、FAQ、年龄、性别、教育程度)进行三分类(正常认知、轻度认知障碍、阿尔茨海默病)检测,通过SMOTE处理类别不平衡,Optuna优化超参数,SHAP提供可解释性,在测试集上达到macro AUC 0.982、准确率0.943,并揭示了临床合理的特征重要性模式。

详情
AI中文摘要

背景:阿尔茨海默病(AD)影响全球超过5500万人。从常规临床评估中准确、可解释地检测正常认知(NC)、轻度认知障碍(MCI)和AD仍是一个关键未满足需求。方法:使用XGBoost分类器进行三分类检测,采用来自阿尔茨海默病神经影像学倡议(ADNI)的八个临床特征:MMSE、CDR Global、CDR Sum of Boxes(CDR-SB)、MoCA、FAQ、年龄、性别和教育程度。使用Optuna(50次试验)优化超参数;通过SMOTE处理类别不平衡。性能通过macro AUC-ROC(1000次迭代bootstrap 95%置信区间)、macro F1、平衡准确率和Cohen's kappa评估。SHAP值提供特征级别的可解释性。结果:数据集包含1641名基线受试者(608 NC、767 MCI、266 AD)。在五折交叉验证中,平均macro AUC为0.983(SD 0.007),准确率为0.944(SD 0.006),macro F1为0.929(SD 0.008)。在保留测试集(n=247)上,macro AUC为0.982(95% CI: 0.965--0.995),准确率为0.943,平衡准确率为0.932,macro F1为0.927,Cohen's kappa为0.909。SHAP分析确定CDR Global是NC和MCI的主要预测因子,而CDR-SB和MMSE共同驱动AD分类。结论:一个基于常规临床评估训练的可解释机器学习模型实现了近乎完美的三分类阿尔茨海默病检测。SHAP分析揭示了临床合理、类别特定的特征重要性模式,支持临床有效性。未来工作将扩展该框架,加入语音生物标志物以实现多模态检测。

英文摘要

Background: Alzheimer's disease (AD) affects over 55 million people worldwide. Accurate, interpretable detection of normal cognition (NC), mild cognitive impairment (MCI), and AD from routine clinical assessments remains a critical unmet need. Methods: An XGBoost classifier was developed for three-class detection using eight clinical features from the Alzheimer's Disease Neuroimaging Initiative (ADNI): MMSE, CDR Global, CDR Sum of Boxes (CDR-SB), MoCA, FAQ, age, sex, and education. Hyperparameters were optimised using Optuna (50 trials); class imbalance was addressed with SMOTE. Performance was evaluated by macro AUC-ROC with 1,000-iteration bootstrap 95% confidence intervals, macro F1, balanced accuracy, and Cohen's kappa. SHAP values provided feature-level explainability. Results: The dataset comprised 1,641 baseline subjects (608 NC, 767 MCI, 266 AD). On five-fold cross-validation, mean macro AUC was 0.983 (SD 0.007), accuracy 0.944 (SD 0.006), and macro F1 0.929 (SD 0.008). On the held-out test set (n = 247), macro AUC was 0.982 (95% CI: 0.965--0.995), accuracy 0.943, balanced accuracy 0.932, macro F1 0.927, and Cohen's kappa 0.909. SHAP analysis identified CDR Global as the dominant predictor for NC and MCI, while CDR-SB and MMSE together drove AD classification. Conclusion: An explainable machine learning model trained on routine clinical assessments achieves near-perfect three-class Alzheimer's detection. SHAP analysis reveals clinically plausible, class-specific feature importance patterns supporting clinical validity. Future work will extend this framework with speech biomarkers for multimodal detection.

2605.04356 2026-06-04 cs.LG cs.AI

Efficiently Aligning Language Models with Online Natural Language Feedback

通过在线自然语言反馈高效对齐语言模型

Christine Ye, Joe Benton

发表机构 * GitHub

AI总结 提出使用在线自然语言反馈替代可验证奖励,通过迭代优化代理奖励模型并在过优化点收集专家监督,在模糊领域高效对齐语言模型,实验表明可大幅提升专家监督的数据效率。

详情
AI中文摘要

可验证奖励的强化学习已被用于在许多领域激发语言模型的出色性能。但是,AI的广泛有益部署可能需要我们在“模糊”、难以监督的领域中训练具有强大能力的模型。在本文中,我们开发了在模糊领域中对齐语言模型的方法,其中人类专家仍然能够提供高质量的监督信号,但仅限于少量模型输出,使用在线自然语言反馈。具体来说,我们通过迭代优化代理奖励信号来训练模型,在过优化点停止,收集新的专家监督,并更新代理奖励。我们使用上下文学习(ICL)和微调从语言模型构建代理奖励模型。我们通过分别在Qwen3-8B和Haiku 4.5上激发创意写作和对齐研究能力来测试我们的方法。对于Qwen3-8B,ICL方法使用50倍更少的专家样本恢复了高达35%的性能,而微调方法使用最多20倍更少的样本恢复了80%,使用3倍更少的样本恢复了100%。对于Haiku 4.5,ICL方法使用30倍更少的样本恢复了高达35%的性能,微调方法使用10倍更少的样本恢复了100%。我们的结果表明,在线自然语言反馈可以显著提高专家监督的数据效率。

英文摘要

Reinforcement learning with verifiable rewards has been used to elicit impressive performance from language models in many domains. But, broadly beneficial deployments of AI may require us to train models with strong capabilities in "fuzzy", hard-to-supervise domains. In this paper, we develop methods to align language models in fuzzy domains where human experts are still able to provide high-quality supervision signal, but only for a small number of model outputs, using online natural language feedback. Specifically, we train models by iteratively optimizing against proxy reward signals, stopping at the point of over-optimization, collecting fresh expert supervision, and updating the proxy reward. We construct proxy reward models from language models using in-context learning (ICL) and fine-tuning. We test our methods by eliciting creative writing and alignment research capabilities in Qwen3-8B and Haiku 4.5 respectively. For Qwen3-8B, ICL methods recover up to 35% of performance with 50x fewer expert samples, while fine-tuning methods recover 80% with up to 20x fewer samples and 100% with 3x fewer samples. For Haiku 4.5, ICL methods recover up to 35% of performance with 30x fewer samples, and fine-tuning methods recover 100% with 10x fewer samples. Our results suggest that online natural language feedback can substantially improve the data efficiency of expert supervision.

2606.03988 2026-06-04 cs.AI

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

想象感知标记增强多模态语言模型的空间推理能力

Mahtab Bigverdi, Linjie Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dangjoo Kim, Zelun Luo, Linda Shapiro, Ranjay Krishna

发表机构 * University of Washington(华盛顿大学) Allen Institute for AI(Allen人工智能研究所) Microsoft(微软) OpenAI(开放人工智能研究院)

AI总结 提出想象感知标记(IPT)作为中间感知表征,通过监督学习提升多模态语言模型在不可见视角推理、遮挡路径追踪等空间推理任务上的性能,在三个新构建的数据集上优于文本思维链训练。

详情
AI中文摘要

视觉语言模型(VLM)在许多任务上表现出色,但当关键信息无法直接观察时,仍难以进行空间推理。许多此类问题需要想象感知:从未见视角推断所见内容、追踪穿过遮挡空间的路径、或将部分观察整合成连贯的空间表征。我们引入了想象感知标记(IPT),这是一种中间感知表征,将VLM在替代空间配置下会感知到的内容外部化,同时保持与观察输入一致。为了研究这一能力,我们设计了三个任务:视角推理(PET)、路径追踪(PT)和多视角计数(MVC),并构建了包含约20K个样本的数据集,附带真实想象、答案和评估基准。以统一VLM BAGEL为骨干,IPT监督持续提升了空间推理性能,并且通常优于文本思维链训练,即使在推理时不生成图像。在MVC上,IPT将准确率提高了3.4%,并在PT上达到了与强大闭源模型竞争的性能。我们进一步发现,将IPT与仅标签监督相结合能带来额外收益,而文本思维链可能大幅降低性能,这表明当空间计算被迫通过语言进行时存在模态不匹配。总体而言,IPT为推理未观察到的空间结构提供了原则性的监督信号,在生成可解释中间表征的同时提升了泛化能力。

英文摘要

Vision language models (VLMs) excel at many tasks but still struggle with spatial reasoning when critical information is not directly observable. Many such problems require imaginative perception: inferring what would be seen from an unseen viewpoint, tracing paths through occluded spaces, or integrating partial observations into a coherent spatial representation. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under alternative spatial configurations while remaining consistent with the observed input. To study this capability, we formulate three tasks, Perspective Taking (PET), Path Tracing (PT), and Multiview Counting (MVC), and construct datasets of approximately 20K examples with ground truth imaginations, answers, and evaluation benchmarks. Using the unified VLM BAGEL as the backbone, IPT supervision consistently improves spatial reasoning and often outperforms textual chain of thought training, even without generating images at inference time. On MVC, IPT improves accuracy by 3.4% and achieves competitive performance with strong closed-source models on PT. We further find that combining IPT and label-only supervision yields additional gains, whereas textual chain of thought can substantially degrade performance, suggesting a modality mismatch when spatial computation is forced through language. Overall, IPT provides a principled supervision signal for reasoning about unobserved spatial structure, improving generalization while producing interpretable intermediate representations.

2606.03972 2026-06-04 cs.CV

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

AAD-1:用于一步自回归视频生成的非对称对抗蒸馏

Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yujun Shen, Zhipeng Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出非对称对抗蒸馏框架AAD-1,通过打破生成器与判别器的对称性以及分阶段训练策略,解决一步自回归视频生成中的运动崩溃和训练不稳定问题,实现最先进性能。

Comments ICML 2026. Project page: \url{https://aad-1.github.io/}

详情
AI中文摘要

我们提出了AAD-1,一种用于一步自回归图像到视频生成的非对称对抗蒸馏框架。最先进的方法采用对抗蒸馏,但存在运动崩溃和训练不稳定的问题,导致生成静态视频。AAD-1通过架构和训练策略上的两个关键设计解决了这些挑战。我们的核心架构见解是打破生成器和判别器之间的对称性。生成器保持因果性以保留自回归采样能力,而判别器则双向关注整个时空上下文,并为整个视频序列生成单一的全局真实性评分。这种非对称设计使判别器能够有效检测导致自回归生成中运动崩溃的全局时间故障和长程漂移。为了稳定训练,我们引入了一种分阶段策略,首先使用分布匹配来引导一个稳定的一步生成器,提供一个预热阶段,使学生分布更接近教师分布,然后再开始对抗蒸馏。在VBench上的大量实验表明,AAD-1在一步自回归视频生成中达到了最先进的性能。

英文摘要

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

2606.03943 2026-06-04 cs.RO cs.CV cs.LG

PointAction: 3D Points as Universal Action Representations for Robot Control

PointAction: 3D点作为机器人控制的通用动作表示

Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出PointAction框架,通过微调视频生成模型联合预测未来RGB帧和动态3D点图,将点动力学作为与具体本体无关的动作接口,再由扩散动作解码器映射为可执行动作,以减少RGB动作歧义并跨任务/本体迁移。

Comments Project page: https://oriontmt.github.io/pointaction/

详情
AI中文摘要

视频-动作模型(VAM)利用预训练视频扩散模型捕获的广泛视觉动态,为通用机器人操作提供了有前景的路径。然而,仅RGB视频展开无法直接操作:它们未明确指定度量3D运动、接触几何和细粒度空间约束,导致动作基础不明确。同时,跨不同任务和本体的动作监督扩展仍然成本高昂。我们提出PointAction,一个通过显式基于点的4D建模将视频预测桥接到机器人动作的框架。PointAction微调基础视频生成模型,联合预测未来RGB帧和动态3D点图,产生任务相关场景几何的时间一致3D运动。这些点动力学作为结构化的、与本体无关的动作接口,由基于扩散的动作解码器映射为可执行的机器人动作。通过使用度量3D点动力学作为视频预测和控制之间的接口,PointAction减少了仅RGB动作基础的不确定性,并支持在有限动作监督下跨任务和本体的迁移。实验表明,PointAction在机器人场景上实现了最先进的4D生成质量,在模拟中优于现有基线,并泛化到预训练中未见过的两个真实机器人手臂。

英文摘要

Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

2606.03938 2026-06-04 cs.LG cs.AI

q0: Primitives for Hyper-Epoch Pretraining

q0: 超周期预训练的原语

Bishwas Mandal, Shmuel Berman, Akshay Vegesna, Samip Dahal

发表机构 * Q Labs(Q实验室) Princeton University(普林斯顿大学)

AI总结 针对多周期训练中单模型性能饱和的问题,提出超周期预训练(q0)方法,通过循环调度、链式蒸馏和学习先验三个原语,从多周期预算中生成多样化模型群体并聚合其预测,显著提升数据效率。

Comments 22 pages, 5 figures

详情
AI中文摘要

多周期训练正成为标准做法,因为计算能力的增长速度快于高质量文本的供应。但预训练单个模型会在几轮后饱和,远在计算预算耗尽之前。我们认为这需要概念上的转变,从训练单个模型转向探索模型群体并聚合它们的预测。我们引入了超周期预训练(q0),它将多周期预算转化为多样化模型群体,其组合预测比单个精炼模型达到更低的验证损失。q0 归结为三个核心原语。具有反相关学习率和权重衰减的循环调度从几个并行轨迹中收集多样化模型。链式蒸馏使每个模型针对其前驱进行训练,从而模型质量在群体中累积。一个在保留集上拟合的学习先验,为任何推理预算选择和加权成员。在 1.8B 参数模型上,使用 100M FineWeb 令牌训练,q0 仅使用约 56 个周期(约 4.6 倍更少)即可匹配强大的 256 周期集成基线,或当匹配基线的集成大小时使用约 67 个周期(约 3.8 倍更少),并持续改进。这些增益在 Slowrun 设置下达到累积约 12.9 倍的数据效率,并迁移到下游基准测试。关键的是,最优分配随预算变化,因此我们给出了处方性配方,说明如何花费给定的周期预算以最大化泛化,从单个周期到最大预算。

英文摘要

Multi-epoch training is becoming the standard now that compute is growing faster than the supply of high-quality text. But pretraining a single model saturates within a few passes, long before the compute budget is exhausted. We argue this calls for a conceptual shift from training a single model toward exploring a population of models and aggregating their predictions. We introduce hyper-epoch pretraining (q0), which turns a multi-epoch budget into a population of diverse models whose combined predictions reach a lower validation loss than a single refined model. q0 reduces to three core primitives. A cyclic schedule with anti-correlated learning rate and weight decay collects diverse models from a few parallel trajectories. Chain distillation trains each model against its predecessor so that model quality compounds across the population. A learned prior, fit on a held out set, selects and weights members for any inference budget. On a 1.8B-parameter model trained on 100M FineWeb tokens, q0 matches a strong 256-epoch ensemble baseline using only ~56 epochs (~4.6x fewer), or ~67 epochs (~3.8x fewer) when matched to the baseline's ensemble size, and continues to improve beyond it. These gains reach cumulative ~12.9x data efficiency under the Slowrun setting and transfer to downstream benchmarks. Crucially, the optimal allocation shifts with the budget, so we give prescriptive recipes for how to spend a given epoch budget to maximize generalization, from a single epoch up to the largest budgets.

2606.03937 2026-06-04 cs.AI

Entropy Is Not Enough: Unlocking Effective Reinforcement Learning for Visual Reasoning via Vision-Anchored Token Selection

熵是不够的:通过视觉锚定令牌选择解锁视觉推理的有效强化学习

Senjie Jin, Peixin Wang, Boyang Liu, Xiaoran Fan, Shuo Li, Zhiheng Xi, Jiazheng Zhang, Yuhao Zhou, Tao Gui, Qi Zhang, Xuanjing Huang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院)

AI总结 针对视觉推理中基于熵的信用分配机制失效问题,提出VEPO框架,通过视觉敏感性与令牌熵的乘法耦合实现梯度信用重定向,显著提升多模态强化学习性能。

详情
AI中文摘要

虽然令牌级熵通常被认为在仅文本的强化学习与可验证奖励(RLVR)中对于信用分配有效,但尚不清楚该机制在视觉推理中是否仍然成立。我们的对照研究表明,由于忽略了具有自然低熵的视觉敏感令牌,该机制在视觉推理中失效。尽管现有的多模态RL方法日益认识到视觉感知的重要性,但它们难以满足将精确感知基础与语义推理交织的内在需求,要么缺乏系统的视觉度量,要么忽视了令牌熵主要驱动语义探索。为解决这一问题,我们引入了VEPO(视觉熵令牌选择策略优化),这是一个有效的RL框架,通过原则性的乘法耦合明确整合视觉敏感性与令牌熵,其中VEPO将梯度信用重定向到同时具有视觉基础且信息量高的令牌。大量实验表明VEPO具有领先性能,在7B规模上显著超过仅熵基线2.28分,在3B规模上超过3.15分。消融实验进一步证实了我们方法的合理性。

英文摘要

While token-level entropy is commonly recognized as effective for credit assignment in text-only reinforcement learning with verifiable rewards (RLVR), it remains unclear whether this mechanism still holds in visual reasoning. Our controlled study shows that this mechanism collapses in visual reasoning due to the omission of vision-sensitive tokens with naturally low entropy. Although existing multimodal RL methods increasingly acknowledge the importance of visual perception, they struggle to satisfy the inherent demand for interleaving precise perceptual grounding with semantic reasoning, either lacking systematic visual measurements or overlooking that token entropy primarily drives semantic exploration. To address this, we introduce VEPO (Vision-Entropy token-selection for Policy Optimization), an effective RL framework explicitly integrating visual sensitivity with token entropy via a principled multiplicative coupling, where VEPO redirects gradient credit toward tokens which are simultaneously visually grounded and highly informative. Extensive experiments demonstrate VEPO's leading performance, significantly outperforming the entropy-only baseline by 2.28 points at 7B-scale and 3.15 points at 3B-scale. Ablations further substantiate the soundness of our method.

2606.03899 2026-06-04 cs.LG

Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering

先降噪,后正交:通过谱滤波理解Muon中的动量

Xianliang Li, Zihan Zhang, Weiyang Liu, Han Bao

发表机构 * The Institute of Statistical Mathematics(统计数学研究所) The Graduate Institute for Advanced Studies, SOKENDAI(SOKENDAI高级研究院) National Institute of Informatics(国家信息研究所) The Chinese University of Hong Kong(香港中文大学) Tohoku University(东北大学) RIKEN AIP(理化学研究所AIP)

AI总结 本文通过谱滤波理论证明Muon优化器中的动量能抑制梯度扰动、扩大谱间隙,从而稳定正交化步骤,并证明先动量后正交化比相反顺序或去除动量更优。

详情
AI中文摘要

Muon最近在大语言模型训练中展示了强大的实证性能,但动量在Muon中的理论作用仍不清楚。现有的Muon分析要么移除动量以单独研究谱更新,要么保留动量而不解释其为何提升实证性能。我们的工作通过展示Muon中的动量充当谱滤波器来弥合这一差距。在结构化信号加扰动梯度模型下,我们证明动量抑制扰动同时保留主导信号,从而扩大它们之间的谱间隙。这个扩大的间隙稳定了传递给Muon正交化步骤的矩阵的奇异子空间,使得最终更新更可靠。我们进一步证明,在正交化之前应用动量比颠倒顺序或简单地移除动量能实现与梯度信号分量可证明的更强对齐。跨多种任务(包括LLM预训练)的实验支持我们的理论分析。更广泛地说,我们的理论为理解其他基于矩阵的优化器中动量的益处提供了起点。

英文摘要

Muon has recently demonstrated strong empirical performance in large language model training, but the theoretical role of momentum in Muon remains unclear. Existing analyses of Muon either remove momentum to study spectral updates in isolation, or retain momentum without explaining why it improves empirical performance. Our work bridges this gap by showing momentum in Muon acts as a spectral filter. Under a structured signal-plus-perturbation gradient model, we prove that momentum suppresses perturbations while preserving the dominant signal, thereby enlarging the spectral gap between them. This enlarged gap stabilizes the singular subspaces of the matrix passed to Muon's orthogonalization step, making the resulting update more reliable. We further show that applying momentum before orthogonalization achieves provably stronger alignment with the signal component of the gradient than either reversing this order or simply removing momentum. Experiments across diverse tasks, including LLM pretraining, support our theoretical analysis. More broadly, our theory offers a starting point for understanding the benefits of momentum in other matrix-based optimizers.

2606.03892 2026-06-04 cs.CL cs.AI cs.LG

Synthesize and Reward -- Reinforcement Learning for Multi-Step Tool Use in Live Environments

合成与奖励——面向实时环境中多步骤工具使用的强化学习

Ibrahim Abdelaziz, Asim Munawar, Kinjal Basu, Maxwell Crouse, Chulaka Gunasekara, Suneet Katrekar, Pavan Kapanipathi

发表机构 * IBM Research(IBM研究院)

AI总结 提出PROVE框架,通过20个有状态MCP服务器、自动化数据合成流水线和多组件程序化奖励,解决多步骤工具调用中的环境构建、查询生成和奖励设计问题,在BFCL Multi-Turn、tau2-bench和T-Eval上分别提升最多+10.2、+6.8和+6.5分。

详情
AI中文摘要

训练LLM编排多步骤工具调用受到三个相互耦合的障碍的阻碍:现实的有状态执行环境构建成本高昂,合成训练查询通常与服务器的实际状态脱节(因此生成的工具调用无法执行),以及基于回忆的RL奖励会鼓励冗长的工具调用模式。我们提出PROVE(已验证环境上的程序化奖励),一个包含三项贡献的框架:(1)一个包含20个有状态MCP(模型上下文协议)服务器的库,暴露了343个工具,支持具有会话范围状态隔离的实时执行RL训练;(2)一个自动数据合成流水线,通过基于实时采样服务器状态的依赖图引导的对话模拟,针对这些服务器生成经过验证的多轮工具调用轨迹,使得每个生成的查询都引用实际存在的实体;(3)一个多组件程序化奖励——渐进式有效性评分、依赖感知覆盖率、具有复杂度缩放调用预算的自适应效率惩罚、工具名称信号和参数值匹配奖励——无需外部评判模型。我们使用相同的奖励超参数和约13K训练示例,通过GRPO训练了四个模型(Qwen3-4B、Qwen3-8B、Qwen2.5-7B、Granite-4.1-8B);仅对每个模型族从三点扫描中调整学习率。在BFCL Multi-Turn、tau2-bench和T-Eval上,PROVE分别带来了最多+10.2、+6.8和+6.5分的改进,表明紧凑的程序化奖励在两个模型族的多步骤工具编排上产生了一致的收益。

英文摘要

Training LLMs to orchestrate multi-step tool calls is held back by three coupled obstacles: realistic stateful execution environments are costly to build, synthetic training queries are often detached from the server's actual state (so the generated tool calls fail to execute), and recall-based RL rewards incentivize verbose tool-calling patterns. We present PROVE (Programmatic Rewards On Verified Environments), a framework with three contributions: (1) a library of 20 stateful MCP (Model Context Protocol) servers exposing 343 tools, enabling live-execution RL training with session-scoped state isolation; (2) a state-machine data synthesis pipeline that generates multi-turn tool-call trajectories grounded in live-sampled server state, so generated queries reference entities that actually exist; and (3) a multi-component programmatic reward with an adaptive efficiency penalty that counters the verbosity incentive of recall-based rewards. We train four models (Qwen3-4B, Qwen3-8B, Qwen2.5-7B, Granite-4.1-8B) with GRPO on the resulting ~13K training examples. On BFCL Multi-Turn, tau2-bench, and T-Eval, PROVE yields improvements of up to +10.2, +6.8, and +6.5 points respectively, demonstrating that this framework yields consistent gains on multi-step tool orchestration across two model families.

2606.03810 2026-06-04 cs.CL cs.AI

Consistency Training Can Entrench Misalignment

一致性训练可能固化不对齐

David Demitri Africa, Arathi Mani

发表机构 * UK AI Security Institute(英国人工智能安全研究所)

AI总结 研究通过七种一致性训练方法在108个微调模型上的实验,发现一致性训练通常抑制奖励黑客和新兴不对齐,但会放大谄媚行为,并提出了一个统一的理论框架来解释其对齐效应。

Comments Accepted to ICML 2026

详情
AI中文摘要

一致性训练鼓励模型在相关输入或采样过程中产生相似输出。这类方法简单、可扩展且基本无需标签,但其对模型对齐的影响仍知之甚少。这些方法的自引导特性是否会放大模型中的不良行为?我们在108个“模型生物体”(经过微调以展示各种受控不对齐行为的开源模型,7B-70B)上测试了七种一致性训练方法。我们发现结果差异显著:一致性训练通常抑制奖励黑客和新兴不对齐,但会放大谄媚行为。我们提供的证据表明,由一致性标注过程引起的分布偏移(而非选择算子的变化)可能是系统性对齐效应的主要驱动因素。最后,我们提出了一个统一的理论框架,推导出一致性训练放大或抑制不对齐的条件。总之,我们的研究确立了一致性训练并非对齐中立的,其在关键系统中的使用应受到仔细审计。

英文摘要

Consistency training encourages a model to produce similar outputs across related inputs or sampling procedures. Such methods are simple, scalable, and largely label-free, but their effects on model alignment remain poorly understood. Could the self-bootstrapping nature of these methods amplify undesired behavior in models? We test seven consistency training methods on 108 model organisms: open-source models (7B--70B) fine-tuned to exhibit various forms of controlled misaligned behavior. We find that outcomes vary significantly: consistency training generally suppresses reward hacking and emergent misalignment but amplifies sycophancy. We present evidence that distribution shifts induced by the consistency labeling process, rather than variation in the selection operators, may be the primary driver of systematic alignment effects. Finally, we present a unifying theoretical framework to derive conditions under which consistency training will amplify or suppress misalignment. In total, our study establishes that consistency training is not alignment-neutral, and that its use in critical systems should be carefully audited.

2606.03784 2026-06-04 cs.RO

Revisiting Embodied Chain-of-Thought for Generalizable Robot Manipulation

重新审视具身思维链以实现可泛化的机器人操作

Nan Sun, Yuan Zhang, Yongkun Yang, Wentao Zhao, Peiyan Li, Jun Guo, Wenxuan Song, Pengxiang Ding, Runze Suo, Yifei Su, Xin Xiao, Xinghang Li, Huaping Liu

发表机构 * Tsinghua University(清华大学) Xiaomi Robotics(小米机器人) Peking University(北京大学) CASIA HKUST(GZ)(香港科技大学(广州)) Zhejiang University(浙江大学) Fudan University(复旦大学) Wuhan University(武汉大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文通过构建最大规模具身思维链语料库,提出ERVLA模型,利用思维链作为表征塑造监督而非测试时推理,显著提升了机器人操作的可泛化性。

详情
AI中文摘要

具身思维链旨在连接语言推理与机器人控制,但其有效形式和集成策略仍待探索。本文在大规模视觉-语言-动作模型上重新审视具身思维链。我们构建了迄今为止最大的具身思维链语料库,包含978,743条轨迹、226.3M样本和2592.5小时机器人数据。通过大量实验,我们发现有效的具身思维链应将高层语义理解具体化为可操作的动作指导,例如末端执行器运动描述和图像空间轨迹,而仅靠高层推理带来的收益有限。我们进一步表明,当显式思维链用作自回归动作前缀时,其扩展性不可靠,因为存在复合推理错误和不稳定的推理-动作耦合。为解决这些问题,我们提出ERVLA,一种将具身思维链用作表征塑造监督而非强制性测试时推理的VLA模型。ERVLA采用推理丢弃策略进行训练,使模型在训练期间吸收丰富的推理痕迹,同时在推理时无需思维链解码直接预测动作。这种设计提高了随预训练数据增加的扩展性,并避免了自回归不稳定性。ERVLA在LIBERO-Plus上达到86.9%的成功率,在VLABench上达到53.2%的成功率,展现出强大的分布外泛化能力。在真实机器人实验中,ERVLA进一步优于竞争性基线,尤其是在需要语义消歧和长时域执行的任务上。代码、数据和模型检查点将发布。

英文摘要

Embodied chain-of-thought (CoT) aims to bridge linguistic reasoning and robotic control, but its effective form and integration strategy remain underexplored. In this paper, we revisit embodied CoT for vision-language-action (VLA) models at large scale. We construct the largest embodied CoT corpus to date, comprising 978,743 trajectories, 226.3M samples, and 2592.5 hours of robot data. Through extensive experiments, we find that effective embodied CoT should ground high-level semantic understanding into concrete action guidance, such as end-effector movement descriptions and image-space trajectories, while high-level reasoning alone brings only marginal gains. We further show that explicit CoT does not scale reliably when used as an autoregressive action prefix, as it suffers from compounding inference errors and unstable reasoning-action coupling. To address these limitations, we propose ERVLA, a VLA model that uses embodied CoT as representation-shaping supervision rather than mandatory test-time reasoning. ERVLA is trained with a reasoning-dropout strategy, enabling the model to absorb rich reasoning traces during training while predicting actions directly without CoT decoding during inference. This design improves scalability with increasing pre-training data and avoids autoregressive instability. ERVLA achieves state-of-the-art performance on LIBERO-Plus with an 86.9% success rate and reaches 53.2% success rate on VLABench, demonstrating strong out-of-distribution generalization. In real-robot experiments, ERVLA further outperforms competitive state-of-the-art baselines, especially on tasks requiring semantic disambiguation and long-horizon execution.

2606.03746 2026-06-04 cs.CV cs.AI cs.GR cs.LG

Qwen-Image-Flash: Beyond Objective Design

Qwen-Image-Flash:超越目标设计

Tianhe Wu, Kun Yan, Zikai Zhou, Lihan Jiang, Jiahao Li, Jie Zhang, Kaiyuan Gao, Ningyuan Tang, Shengming Yin, Xiaoyue Chen, Xiao Xu, Yilei Chen, Yuxiang Chen, Yan Shu, Yixian Xu, Yanran Zhang, Zihao Liu, Zhendong Wang, Zekai Zhang, Deqing Li, Liang Peng, Yi Wang, Jingren Zhou, Chenfei Wu

发表机构 * alibaba-inc.com(阿里巴巴公司)

AI总结 本文通过系统研究数据组成、教师指导和任务混合三个因素,提出Qwen-Image-Flash,表明有效的少步蒸馏不仅需要精心设计的目标,还需要对更广泛的训练流程进行原则性组织。

详情
AI中文摘要

少步蒸馏已成为加速先进视觉生成模型的有效策略,但先前的工作主要集中在蒸馏目标上。在这项工作中,我们从互补的角度重新审视少步蒸馏,重点关注关键影响学生表现的训练方案。以Qwen-Image-2.0为代表案例,我们系统地研究了统一文本到图像生成和指令引导图像编辑蒸馏中的三个因素:数据组成、教师指导和任务混合。我们的实证分析揭示了若干非直观行为,这些行为推动了Qwen-Image-Flash的开发。总体而言,我们的结果表明,有效的少步蒸馏不仅需要精心设计的目标,还需要对更广泛的训练流程进行原则性组织。

英文摘要

Few-step distillation has become an effective strategy for accelerating advanced visual generative models, yet prior work has largely focused on distillation objectives. In this work, we revisit few-step distillation from a complementary perspective, focusing on the training recipe that critically shapes student performance. Using Qwen-Image-2.0 as a representative case, we systematically investigate three factors in unified text-to-image generation and instruction-guided image editing distillation: data composition, teacher guidance, and task mixture. Our empirical analysis reveals several non-obvious behaviors, which motivate the development of Qwen-Image-Flash. Overall, our results suggest that effective few-step distillation requires not only carefully designed objectives, but also principled organization of the broader training pipeline.

2606.03660 2026-06-04 cs.AI

From Answers to States: Verifiable Process-Level Evaluation of Chemical Reasoning in Large Language Models

从答案到状态:大语言模型中化学推理的可验证过程级评估

Hongyu Guo, Hao Li, He Cao, Gongbo Zhang, Li Yuan

发表机构 * Peking University, Shenzhen Graduate School(北京大学深圳研究生院) International Digital Economy Academy (IDEA)(国际数字经济学院)

AI总结 提出ChemCoTBench-V2基准,通过确定性规则和参考轨迹验证结构化化学推理步骤,揭示模型在最终答案正确性与推理状态一致性之间的差距。

Comments 23 pages, 6 figures, 14 tables

详情
AI中文摘要

大语言模型越来越多地被用作化学助手,然而大多数化学基准仍然只对最终答案评分。这掩盖了一个关键的失败模式:模型可能输出正确的分子、产物或选项,但其推理过程违反了化学逻辑。现有的过程级评估器难以扩展,因为LLM评判者和人工步骤级过程注释成本高、不一致且容易产生幻觉。我们引入了ChemCoTBench-V2,一个规则可验证的诊断基准,用于对结构化、可验证的化学推理轨迹进行低成本、可审计的评估。它涵盖分子理解、分子编辑、分子优化和反应预测,包含18个报告任务中的5620个评估样本。模型必须在专家设计的模板中暴露关键中间步骤,这些步骤通过确定性化学规则进行检查,对于封闭答案任务,还使用参考轨迹而非另一个LLM评判者。开放式的分子优化通过预言机可验证的状态约束而非严格的轨迹匹配进行评估。该基准报告三个独立的信号:最终答案正确性、模板遵循度和基于专家精炼中间步骤的逐步骤验证器正确性。对前沿模型的实验揭示了最终答案成功与结构化推理状态一致性之间的持续差距:模型通常遵循要求的格式但未能通过化学步骤检查,或者正确回答但支持性推理薄弱。ChemCoTBench-V2支持细粒度模型比较,并识别轨迹首次违反验证器的具体步骤。

英文摘要

Large language models are increasingly used as chemistry assistants, yet most chemistry benchmarks still score only final answers. This masks a critical failure mode: a model may output the correct molecule, product, or option while its reasoning violates chemical logic. Existing process-level evaluators are hard to scale because LLM judges and human step-level process annotation are costly, inconsistent, and vulnerable to hallucination. We introduce ChemCoTBench-V2, a rule-verifiable diagnostic benchmark for low-cost, auditable evaluation of structured, verifier-addressable chemical reasoning traces. It spans molecular understanding, molecule editing, molecular optimization, and reaction prediction, with 5,620 evaluation samples across 18 reporting tasks. Models must expose key intermediate steps in expert-designed templates, and those steps are checked with deterministic chemistry rules and, for closed-answer tasks, reference traces rather than another LLM judge. Open-ended molecular optimization is evaluated with oracle-verifiable state constraints rather than strict trace matching. The benchmark reports three separate signals: final-answer correctness, template adherence, and step-wise verifier correctness over expert-refined intermediate commitments. Experiments on frontier models reveal a persistent gap between final-answer success and structured-reasoning-state consistency: models often follow the requested format while failing chemical-step checks, or answer correctly with weak supporting reasoning. ChemCoTBench-V2 enables fine-grained model comparison and identifies the concrete step at which the trace first violates the verifier.

2606.03631 2026-06-04 cs.LG cs.AI

AnchorMoE: Interpretable Time Series Classification via Anchor-Routed MoE

AnchorMoE: 基于锚点路由的混合专家模型实现可解释时间序列分类

Tao Xie, Zexi Tan, Haoyi Xiao, Mengke Li, Yiqun Zhang, Yang Lu, Cuie Yang, Yiu-ming Cheung

发表机构 * School of Automation, Guangdong University of Technology(广东工业大学自动化学院) School of Computer Science and Technology, Guangdong University of Technology(广东工业大学计算机科学与技术学院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) School of Informatics, Xiamen University(厦门大学信息学院) State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University(东北大学过程工业综合自动化国家重点实验室) Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系)

AI总结 提出AnchorMoE框架,利用混合专家架构对局部补丁进行多视角表示并路由至专门专家,通过加性分解实现前向可解释性,并引入几何正交约束和不确定性感知门控机制提升稀疏信号下的分解可靠性与噪声抑制。

Comments Accepted by KDD 2026, 12 pages

详情
AI中文摘要

多变量时间序列分类(MTSC)在高风险领域(如临床诊断和工业故障检测)中至关重要,这些领域的安全部署需要透明的决策过程。然而,隔离驱动模型预测的时间段具有挑战性,因为现实世界时间序列中的判别信号通常是稀疏、异构且被背景噪声严重掩盖的。因此,本文提出了AnchorMoE,一种天生可解释的分类框架。基于混合专家(MoE)架构,AnchorMoE编码局部补丁的多视角表示并将其路由到专门专家,确保最终预测被表述为输入段上的精确加性分解,从而促进前向透明度,而非依赖事后估计。为了在稀疏信号分布下保持这种分解的可靠性,我们引入了几何正交约束,惩罚表示冗余,迫使不同专家专门处理异构预测模式。此外,设计了一个不确定性感知的可靠性门控,动态校准每个段的贡献,有效抑制残余背景噪声。在真实世界和合成基准上的大量实验表明,AnchorMoE在实现高度竞争的分类性能的同时,忠实于原始时间序列进行决策。

英文摘要

Multivariate time series classification (MTSC) is pivotal in high-stakes domains, such as clinical diagnosis and industrial fault detection, where safe deployment necessitates transparent decision-making. However, isolating the temporal segments that drive model predictions is challenging because discriminative signals in real-world time series are typically sparse, heterogeneous, and heavily obscured by background noise. This paper, therefore, proposes AnchorMoE, an interpretable-by-construction classification framework. Built upon a Mixture-of-Experts (MoE) architecture, AnchorMoE encodes multi-view representations of local patches and routes them to specialized experts, ensuring that the final prediction is formulated as an exact additive decomposition over the input segments, facilitating ante-hoc transparency rather than relying on post-hoc estimations. To maintain the reliability of this decomposition under sparse signal distributions, we introduce a geometric orthogonality constraint that penalizes representational redundancy, compelling distinct experts to specialize in heterogeneous predictive patterns. Furthermore, an uncertainty-aware reliability gate is designed to dynamically calibrate the contribution of each segment, effectively suppressing residual background noise. Extensive experiments on real-world and synthetic benchmarks demonstrate that AnchorMoE achieves highly competitive classification performance while faithfully grounding its decisions in the raw time series.

2606.03598 2026-06-04 cs.RO cs.AI cs.CV

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

PHASER: 面向视觉-语言-动作模型的相位感知与语义经验回放

Ziyang Chen, Shaoguang Wang, Weiyu Guo, Qianyi Cai, He Zhang, Pengteng Li, Yiren Zhao, Yandong Guo

发表机构 * Thrust of AI, HKUST(Guangzhou)(人工智能 thrust,香港科技大学(广州)) AI 2 Robotics, Shenzhen, China(人工智能与机器人,深圳,中国)

AI总结 提出PHASER框架,通过相位感知容量分配和多模态干扰路由策略,结合自动相位提取管线Auto-PC,解决VLA模型在持续学习中的灾难性遗忘问题,在LIBERO基准上平均成功率提升高达31%。

Comments 20 pages, 8 figures, 12 tables

详情
AI中文摘要

视觉-语言-动作(VLA)模型在语言条件机器人操作中取得了显著成功。然而,在开放环境中部署这些模型需要持续获取新技能,这一过程不可避免地会严重遗忘先前学习的行为。虽然经验回放(ER)是一种标准的缓解策略,但简单的均匀采样从根本上与操作轨迹的时间特征不一致。它系统性地欠采样短暂但因果关键的子技能,导致相位饥饿,并完全忽略了历史任务中不同程度的遗忘。为克服这些限制,我们提出PHASER,一种架构无关的持续学习框架。PHASER采用以相位为中心的容量分配,确保所有子技能获得平等的记忆支持,并结合多模态干扰路由策略,动态优先处理遗忘风险高的历史相位。此外,为实现完全自主的终身适应,我们集成了Auto-PC,一种轻量级管线,结合无监督动作信号变化点检测和基于VLM的语义验证,无需大量人工监督即可提取时间边界。在LIBERO持续学习套件上对三个VLA骨干网络的评估表明,PHASER取得了显著的实证改进,与匹配预算的ER相比,平均成功率(ASR)提升高达31%,并在LIBERO-Goal CL设置中达到87.8%的最终ASR。

英文摘要

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns with the temporal characteristics of manipulation trajectories. It systematically under-samples brief but causally critical sub-skills, leading to phase starvation, and completely overlooks the varying degrees of forgetting across historical tasks. To overcome these limitations, we introduce PHASER, an architecture-agnostic continual learning framework. PHASER employs a phase-centric capacity allocation to guarantee equal memory support for all sub-skills, coupled with a multi-modal interference routing strategy that dynamically prioritizes historical phases at high risk of forgetting. Furthermore, to enable fully autonomous lifelong adaptation, we integrate Auto-PC, a lightweight pipeline combining unsupervised action-signal change-point detection with VLM-based semantic verification to extract temporal boundaries without intensive manual supervision. Evaluated across three VLA backbones on LIBERO continual learning suites, PHASER yields substantial empirical improvements, increasing Average Success Rate (ASR) by up to 31% over matched-budget ER and achieving an 87.8% final ASR on the LIBERO-Goal CL setting.

2606.03564 2026-06-04 cs.CV cs.AI

CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

CR-Seg:注意力引导与CoT增强的由粗到精推理分割

Yifan Cao, Xiaocui Yang, Faxian Wan, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院)

AI总结 提出CR-Seg两阶段框架,通过注意力图提取和全局到局部思维链,实现由粗到精的推理分割,解决跨模态对齐和推理-答案不一致问题。

详情
AI中文摘要

推理分割旨在通过联合视觉-文本推理来分割复杂语言描述的目标对象。现有方法通常依赖学习到的语义标记来桥接多模态大语言模型(MLLMs)和分割模型,但面临困难的跨模态对齐问题;或者依赖显式空间提示(如边界框),但可能丢失整体响应语义。为解决这些限制,我们提出注意力引导与CoT增强的由粗到精推理分割(CR-Seg),一个两阶段框架。具体地,我们设计了提取注意力图和点(EAP)模块,用于提取粗目标定位的注意力图并选择信息点,两者都输入SAM进行掩码细化。为缓解推理-答案不一致,我们进一步引入全局到局部思维链(GLCoT),引导模型从全局场景上下文逐步推理到局部目标细节。在推理分割基准上的大量实验证明了CR-Seg的有效性。

英文摘要

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.

2606.03441 2026-06-04 cs.RO cs.LG

PerchRL: Vision-Based Agile Perching on Inclined Platforms under Rapid and Irregular Motion

PerchRL:基于视觉的快速不规则运动倾斜平台敏捷着陆

Zihong Lu, Zongzhuo Liu, Huaxu Li, Jinqiang Cui, Jie Mei, Youmin Gong, U Kei Cheang, Boyu Zhou

发表机构 * SUSTech(四川大学) HITSZ(哈尔滨工业大学) PCL(鹏城实验室) Differential Robotics(差分机器人实验室)

AI总结 提出PerchRL强化学习框架,通过两阶段学习策略(状态预训练+视觉微调)和混合学习框架(可见性感知状态增强+主动感知奖励),实现四旋翼在快速不规则运动倾斜平台上的自主视觉着陆。

详情
AI中文摘要

自主视觉引导的四旋翼在移动倾斜平台上的着陆对于空地协作至关重要,但由于有限的视场角(FOV)而具有挑战性。本文提出PerchRL,一种基于强化学习(RL)的框架,用于在快速和不规则运动下的倾斜平台上进行基于视觉的敏捷着陆。具体而言,我们采用两阶段学习策略,包括基于状态的预训练和基于视觉的微调。为了提高对不同平台运动的泛化能力,我们使用随机化的平台轨迹来防止过拟合,并采用时间增强方法从历史观测中捕捉潜在运动模式。在基于视觉的微调过程中,提出了一种混合学习框架,包括可见性感知状态增强和主动感知奖励,以提高在间歇性视觉丢失下的鲁棒性。大量的仿真和真实世界实验证明了PerchRL的可行性、稳定性和实时性能,而在不同四旋翼平台上的成功部署进一步验证了其适应性。源代码将发布以惠及社区。

英文摘要

Autonomous vision-based perching of quadrotors on moving inclined platforms is critical for air-ground collaboration but remains challenging due to the limited field of view (FOV). In this paper, we propose PerchRL, a reinforcement learning (RL) framework for vision-based agile perching on inclined platforms under rapid and irregular motion. Specifically, we employ a two-stage learning strategy consisting of state-based pre-training followed by vision-based fine-tuning. To improve generalization across diverse platform motions, we employ randomized platform trajectories to prevent overfitting and temporal augmentation methods to capture latent motion patterns from historical observations. During vision-based fine-tuning, a hybrid learning framework consisting of visibility-aware state augmentation and active perception rewards is presented to improve robustness under intermittent visual loss. Extensive simulation and real-world experiments demonstrate the feasibility, stability, and real-time performance of PerchRL, while successful deployment across distinct quadrotor platforms further validates its adaptability. The source code will be released to benefit the community.

2606.03402 2026-06-04 cs.CV

Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation

Mamba增强的隐式运动学习用于音频驱动肖像动画

Xuan Wei, Jiahui Chen, Kaiheng Li, Mingyu Shao, Qingqi Hong

发表机构 * Fujian Provincial Natural Science Foundation of China(福建省自然科学基金委员会) Giant Interactive Group Inc.(巨匠互动集团有限公司) National Natural Science Foundation of China(国家自然科学基金委员会)

AI总结 提出一种两阶段隐式运动框架,结合区域感知注意力机制和Mamba增强扩散模型,从单张静态图像和音频生成逼真且时间一致的人体运动视频,在多个基准上达到最先进性能。

Comments accepted by 2026 IEEE International Conference on Multimedia and Expo (ICME)

详情
AI中文摘要

音频驱动的人体运动视频生成旨在从单张静态图像合成逼真且时间一致的人体动画,应用于说话头生成、共语手势生成和动态演示。超越传统基于关键点的方法(这些方法往往难以捕捉细微的运动动态),我们提出了一种新颖的隐式运动框架,用于从单张静态图像和音频生成逼真且时间一致的人体运动视频。我们的方法采用两阶段流水线,将运动预测与渲染解耦。第一阶段将外观先验和层次深度线索整合到区域感知注意力机制中,以建模潜在运动特征。第二阶段采用Mamba增强的扩散模型直接从音频和源图像预测这些特征,实现细粒度运动模式的无监督学习。这种解耦架构增强了灵活性和效率。在一个新的380小时高质量数据集上训练,我们的方法在准确性、自然性和时间一致性方面优于多个公共基准和我们收集的数据上的先前工作,达到了新的最先进水平。

英文摘要

Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that often struggle to capture subtle motion dynamics, We propose a novel implicit-motion framework for generating realistic and temporally coherent human motion videos from a single static image and audio. Our approach uses a two-stage pipeline that decouples motion prediction from rendering. The first stage integrates appearance priors and hierarchical depth cues into a region-aware attention mechanism to model latent motion features. The second stage employs a Mamba-enhanced diffusion model to directly predict these features from audio and the source image, enabling unsupervised learning of fine-grained motion patterns. This decoupled architecture enhances flexibility and efficiency. Trained on a new 380-hour high-quality dataset, our method outperforms prior work across multiple public benchmarks and our collected data in accuracy, naturalness, and temporal coherence, setting a new state-of-the-art.

2606.03393 2026-06-04 cs.LG

Flicker-DDPM: Accelerating Denoising Diffusion via 1/f Colored Noise Injection

Flicker-DDPM:通过1/f彩色噪声注入加速去噪扩散

KeXiang Mao, FanCheng Li

发表机构 * School of Physics and Technology, Wuhan University(武汉大学物理科学技术学院) Hongyi Honor College, Wuhan University(弘毅荣誉学院)

AI总结 提出Flicker-DDPM模型,利用自组织临界性启发的1/f彩色噪声替代各向同性白噪声,通过空间相关核生成幂律谱噪声,在CIFAR-10上以3.33倍更少的采样步数达到或超越标准DDPM的生成质量,并从频域线性理论解释加速机制。

Comments 16pages, 8 figures, Code available at https://github.com/Mao-Kexiang/Flicker_DDPM

详情
AI中文摘要

我们提出了一种新颖的扩散模型Flicker-DDPM,它引入了受自组织临界性(SOC)启发的闪烁(1/f)噪声,SOC是自然系统中广泛观察到的现象。与在前向过程中采用各向同性白噪声的去噪扩散概率模型(DDPM)不同,Flicker-DDPM采用具有幂律谱的彩色噪声,以更好地匹配自然图像的频谱统计,其功率谱通常遵循P(k)正比于1/k^{\alpha}。为此,我们基于空间相关核{\sigma}(d) = (d + 1)^{-\eta}开发了一个彩色噪声模块,并从理论上证明调整{\eta}可以控制生成的1/f^{\alpha}噪声的谱指数{\alpha},从而适应具有不同频谱特征的数据集。在CIFAR-10上,Flicker-DDPM使用3.33倍更少的采样步数即可达到或超越标准DDPM基线的生成质量,且每步的额外计算成本可忽略不计。我们进一步开发了一种频域线性理论,证明频谱匹配的彩色噪声使反向轨迹线性化,从理论上解释了所观察到的采样加速现象。

英文摘要

We propose a novel diffusion model, Flicker-DDPM, which incorporates flicker (1/f) noise inspired by self-organized criticality (SOC), a widely observed phenomenon in natural systems. Unlike denoising diffusion probabilistic models (DDPMs), which employ isotropic white noise in the forward process, Flicker-DDPM adopts colored noise with power-law spectra to better match the spectral statistics of natural images, whose power spectra typically follow P(k) proportional to 1/k^α. To this end, we develop a colored-noise module based on a spatial correlation kernel, σ(d) = (d + 1)^{-η}, and theoretically establish that adjusting η controls the spectral exponent α of the generated 1/fα noise, enabling adaptation to datasets with diverse spectral characteristics. On CIFAR-10, Flicker DDPM matches or surpasses the generation quality of a standard DDPM baseline using 3.33 times fewer sampling steps, with negligible additional computational cost per step. We further develop a frequency-domain linear theory demonstrating that spectrally matched colored noise linearizes the reverse trajectory, theoretically explaining the observed sampling acceleration.

2606.03376 2026-06-04 cs.CV cs.AI cs.CL cs.LG

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

P²-DPO:通过校准直接偏好优化在感知处理中锚定幻觉

Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen, Tong Zhang

发表机构 * Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence, School of Computer Science & Engineering, South China University of Technology(广东省计算人工智能模型与认知智能重点实验室,计算机科学与工程学院,华南理工大学) Pazhou Lab, Guangzhou, China(琶洲实验室,广州,中国) Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human, Guangzhou, China(教育部健康智能感知与并行数字人工程研究中心,广州,中国)

AI总结 针对大型视觉语言模型中的幻觉问题,提出P²-DPO训练范式,通过模型自生成偏好对和校准损失,直接优化感知瓶颈和视觉鲁棒性,无需昂贵人工反馈。

详情
AI中文摘要

幻觉最近在大型视觉语言模型(LVLMs)中引起了广泛的研究关注。直接偏好优化(DPO)旨在直接从人类提供的纠正偏好中学习,从而解决幻觉问题。尽管取得了成功,但这种范式尚未专门针对关注区域中的感知瓶颈或解决图像退化下的视觉鲁棒性不足问题。此外,现有的偏好对通常是视觉无关的,其固有的离策略性质限制了它们在指导模型学习方面的有效性。为了解决这些挑战,我们提出了感知处理直接偏好优化(P²-DPO),一种新颖的训练范式,其中模型生成并学习自己的偏好对,从而直接解决已识别的视觉瓶颈,同时固有地避免视觉无关和离策略数据的问题。它引入了:(1)一种针对焦点增强感知和视觉鲁棒性的在策略偏好对构建方法,以及(2)一种精心设计的校准损失,以精确地将视觉信号与文本的因果生成对齐。实验结果表明,在相当数量的训练数据和成本下,P²-DPO在基准测试中优于依赖昂贵人工反馈的强基线。此外,对注意力区域保真度(ARF)和图像退化场景的评估验证了P²-DPO在解决关注区域感知瓶颈和提高对退化输入的视觉鲁棒性方面的有效性。

英文摘要

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.

2606.03318 2026-06-04 cs.CL

Beyond Ideal Instruction: A Comprehensive Framework for Evaluating LLMs in Realistic Interactions

超越理想指令:现实交互中评估大语言模型的综合框架

Xuan Yang, Hao Xu, Tingfeng Hui, Hongsheng Xin, Kaike Zhang, Chunxiao Liu, Ning Miao

发表机构 * Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Hong Kong Institute of AI for Science, City University of Hong Kong(香港城市大学人工智能科学研究院) Li Auto Inc.(Li汽车公司) Beijing University of Posts and Telecommunications(北京邮电大学) Independent Researcher(独立研究员)

AI总结 提出RUT-Bench基准,通过高保真模拟理想与非理想用户行为,评估大语言模型在现实工具调用场景中的表现,发现所有测试模型成功率低于40%且面对复杂非理想输入时性能显著下降。

详情
AI中文摘要

尽管大语言模型(LLMs)在工具使用能力上取得了巨大进步,但现有的评估基准难以完全对齐真实世界场景。这些基准大多依赖于模拟的理想化用户假设,缺乏面向经验的评估。这些局限性未能考虑到真实用户特有的模糊性、不合作行为和意图转变。为填补这一空白,我们提出了RUT-Bench,一个专门用于评估LLMs在多样化真实用户工具调用场景下的基准。RUT-Bench支持高保真模拟,涵盖单轮和多轮对话中的理想理性模式和非理想异质行为。我们使用该基准对19个广泛采用的开源和专有LLMs进行了全面评估。实验结果显示,没有测试的LLM实现超过40%的整体成功率,并且几乎所有模型在面对更复杂的非理想用户输入时都经历了明显的性能下降。我们的代码和数据可在该https URL获取。

英文摘要

Despite great advances in tool-use capabilities of large language models (LLMs), existing evaluation benchmarks struggle to fully align with real-world scenarios. Such benchmarks mostly rely on simulated idealized user assumptions and lacks experience-oriented evaluation. These limitations fail to account for the ambiguity, uncooperative behaviors, and shifting intentions characteristic of real-world users. To fill this gap, we propose RUT-Bench, a dedicated benchmark designed to assess LLMs under diverse Real-world User Tool calling scenarios. RUT-Bench supports high-fidelity simulations covering both ideal rational patterns and heterogeneous non-ideal behaviors across single-turn and multi-turn dialogues. We conduct comprehensive evaluations on 19 widely adopted open-source and proprietary LLMs using our benchmark. Experimental results reveal that no tested LLMs achieve an overall success rate above 40%, and nearly all of them experience noticeable performance drops when facing more complicated non-ideal user inputs. Our code and data is available at https://github.com/Miaow-Lab/RUT-Bench.

2606.03303 2026-06-04 cs.AI

LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

LEAP:利用智能体框架增强形式化数学的大语言模型

Po-Nien Kung, Linfeng Song, Dawsen Hwang, Jinsung Yoon, Chun-Liang Li, Simone Severini, Mirek Olšák, Edward Lockhart, Quoc V Le, Burak Gokturk, Thang Luong, Tomas Pfister, Nanyun Peng

发表机构 * Google Cloud AI Research(谷歌云人工智能研究) Google Cloud(谷歌云) Google DeepMind(谷歌深Mind)

AI总结 提出LEAP智能体框架,通过分解问题、与Lean编译器交互及自我优化,使通用大模型在形式化定理证明上达到最先进性能,并在Putnam竞赛和Lean-IMO-Bench上超越专业系统。

详情
AI中文摘要

大语言模型(LLMs)在非正式数学推理中表现强劲,但在生成如Lean等形式语言中可机械验证的证明方面存在困难。我们提出LEAP,一个智能体框架,使通用基础模型在自动化形式定理证明上达到最先进性能。LEAP利用基础模型的能力,如非正式推理、指令遵循和迭代自我优化。通过将复杂问题分解为更小的单元,该系统通过与Lean编译器的持续交互,将形式化证明构建与非正式蓝图连接起来。为了在日益饱和的基准之外提供严格评估,我们引入了Lean-IMO-Bench,一个用Lean形式化的IMO风格问题基准,其陈述简短但证明高度非常规且多步,涵盖广泛难度级别。实验上,在最新2025年Putnam竞赛(北美本科生年度数学竞赛)中,LEAP解决了所有12个问题,匹配了前沿形式化数学模型的最新突破。在Lean-IMO-Bench上,LEAP将通用LLM的一次性形式化解决率从低于10%提升至70%,显著超过了由专业金牌级IMO系统设定的48%基准。此外,我们通过自主形式化开放组合挑战的复杂证明,包括Knuth偶阶Cayley图哈密顿分解中关键子问题的验证证明,展示了LEAP的研究级实用性。

英文摘要

Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.

2606.03201 2026-06-04 cs.CV cs.AI

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

基于视频预测模型的跨领域视频强化学习

Zhao Yang, Xinrui Zu, Jacob E. Kooi, Thomas Delliaux, He Liu, Shujian Yu, Kevin Sebastian Luck, Vincent François-Lavet

发表机构 * VU Amsterdam(阿姆斯特丹大学) ISAE-SUPAERO

AI总结 提出XIPER奖励模型,通过跨领域视频预测将智能体观测映射到专家域,利用预测似然作为奖励信号,解决视觉差异域中无奖励信号和领域差距问题。

详情
AI中文摘要

由于缺乏奖励信号以及存在领域差距,从视觉上截然不同的领域的专家视频中进行强化学习具有挑战性。我们引入了XIPER(跨领域视频预测奖励),这是一种奖励模型,用于从视觉不同领域收集的专家视频中进行学习,其中智能体的外观因颜色、形态或仿真到现实差距等因素而不同。更具体地说,XIPER训练了一个跨领域视频预测模型,将智能体观测映射到专家领域,并使用预测似然作为奖励信号。在DMC Color Suite(8个任务)和DMC Body Suite(3个任务)上的实验表明,尽管存在智能体颜色和形态等领域的差距,XIPER始终优于基线方法。我们进一步在仿真到现实迁移数据集上分析了XIPER,证明它仅凭模拟专家视频就能为真实机器人观测产生有意义的奖励信号。代码、预训练模型、数据集和视频演示可在我们的项目网页上找到:this https URL

英文摘要

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent's appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: https://sites.google.com/view/xiper

2606.03175 2026-06-04 cs.CV cs.RO

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

在值得时询问:面向实例目标导航的成本感知开放式交互

Xunyi Zhao, Sihao Lin, Gengze Zhou, Zerui Li, Shijie Li, Wei Tao, Jiajun Liu, Qi Wu

发表机构 * Adelaide University(阿德莱德大学) Responsible AI Research Centre, Australian Institute for Machine Learning(负责任人工智能研究中心,澳大利亚机器学习研究所) Institute for Infocomm Research (I2R), A*STAR(信息与通信研究院(I2R),A*STAR) iMotion CSIRO Data61 Project Website(CSIRO Data61项目网站)

AI总结 针对实例目标导航中语言歧义问题,提出一种成本敏感的不确定性减少方法,通过信息增益分析确定有效问题类型,并构建基准测试和加权成功率指标,实现零样本MLLM导航器仅在预期收益大于成本时查询。

详情
AI中文摘要

实例目标导航(IGN)要求具身智能体根据不明确的自然语言描述,在干扰物中找到特定对象实例。这种歧义通常无法仅通过感知和语言解决,因此与oracle的交互成为消歧的自然机制。先前的交互方法允许oracle查询,但将轻量级澄清和路径级指导同等对待,使得智能体通过重复的高信息量问题提高成功率,而非高效解决潜在歧义。我们将交互式IGN重新定义为成本敏感的不确定性减少问题,其中智能体应提出其答案相对于惩罚能最大程度减少导航不确定性的问题。为此,我们对现有导航语料库进行信息增益分析,以识别哪些线索能减少导航不确定性,从而得到一组紧凑的问题类型和数据驱动的成本。然而,现有的交互式导航基准并未建模不同问题类型的成本,也未评估智能体使用交互的效率,因此不适合研究成本敏感的交互。基于此分类,我们构建了一个用于诊断交互行为和效率的基准,以及一个加权成功率指标,该指标根据推导出的成本对每次查询进行惩罚。我们进一步提出了一种零样本MLLM导航器,仅在预期不确定性减少证明交互成本合理时,才在每个决策步骤有选择地进行查询。

英文摘要

Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.