arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 4119
2606.02277 2026-06-02 cs.RO

RoboSemanticBench: Diagnosing Semantic Grounding in Action Prediction for VLA Models

RoboSemanticBench: 诊断 VLA 模型在动作预测中的语义基础

Bin Yu, Yao Zhang, Haishan Liu, Shijie Lian, Yuliang Wei, Xiaopeng Lin, Zhaolong Shen, Changti Wu, Ruina Hu, Bailing Wang, Cong Huang, Kai Chen

发表机构 * HIT(哈尔滨工业大学) ZGCA(中钢集团人工智能研究院) ZGCI(中钢集团智能计算研究所) WHU(武汉大学) HUST(华中科技大学) HKUST(GZ)(香港科技大学(广州)) BUAA(北京航空航天大学) ECNU(华东师范大学) DeepCybo

AI总结 提出 RoboSemanticBench 基准,通过多选问答任务评估 VLA 模型是否利用指令语义选择正确物体,发现模型在语义正确选择上接近随机,揭示语义理解与动作预测之间的差距。

Comments GitHub: https://github.com/ZGC-EmbodyAI/RoboSemanticBench

详情
AI中文摘要

视觉-语言-动作(VLA)模型建立在预训练语言或视觉-语言骨干网络的语义理解应指导机器人动作预测的前提上。然而,机器人微调被优化为对任务特定动作分布的模仿,许多评估可以通过视觉或指令-动作捷径解决。我们引入 RoboSemanticBench(RSB),一个用于诊断动作预测中语义基础的具身基准:即后训练的 VLA 模型是否能够使用复杂的指令语义来选择并操作正确的物理目标。在每个回合中,机器人接收一个多项选择的数学或常识问题,观察候选答案块,并必须抓取对应正确答案的块。RSB 涵盖受控算术、小学数学理解以及常识或事实理解,分为四选和十选套件。在代表性的 VLA 模型上,我们发现许多策略学会了抓取候选块,但在控制抓取成功率后,选择语义正确块的比例接近随机或低于随机,揭示了骨干网络级别的语义能力与动作预测之间持续存在的差距。

英文摘要

Vision-language-action (VLA) models are built on the premise that semantic understanding from pretrained language or vision-language backbones should guide robot action prediction. Yet robot fine-tuning is optimized as imitation over task-specific action distributions, and many evaluations can be solved through visual or instruction-action shortcuts. We introduce RoboSemanticBench (RSB), an embodied benchmark for diagnosing semantic grounding in action prediction: whether post-trained VLA models can use complex instruction semantics to select and manipulate the correct physical target. In each episode, a robot receives a multiple-choice math or general-knowledge question, observes candidate answer blocks, and must grasp the block corresponding to the correct answer. RSB covers controlled arithmetic, grade-school mathematical understanding, and commonsense or factual understanding under four-choice and ten-choice suites. Across representative VLA models, we find that many policies learn to grasp candidate blocks but select the semantically correct block at near-random or below-random rates after controlling for grasp success, revealing a persistent gap between backbone-level semantic competence and action prediction.

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine(医学人工智能实验室) RWTH Aachen University(亚琛工业大学) Department of Diagnostic and Interventional Radiology(诊断与介入放射学部门)

AI总结 研究临床视觉-语言模型(VLM)在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险,并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情
AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型(VLM)学习了一个共享嵌入空间,该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中(例如仅图像数据共享或受控访问的报告)构成了隐私风险,因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索,并使用公共配对队列(其中真实配对是已知的)作为基准来审计风险,而不是作为隐私场景。在来自MIMIC-CXR(43,793个保留对)和外部CheXpert Plus(29,296个对)的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM,我们发现重链接率随专业化程度系统性地上升:最强的VLM在候选池N=100时以15倍随机概率检索到正确报告,在N=10,000时以50倍随机概率,在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在,表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险,我们冻结了两个编码器,仅对定义对齐层的投影头应用差分隐私优化(epsilon=0.34,delta=6x10^-6)。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%,并无需重新训练即可迁移到CheXpert Plus,同时图像侧效用基本保持:线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接,而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

2606.02273 2026-06-02 cs.CV

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

用于驾驶员监控系统的视觉语言模型:一个驾驶员活动描述数据集

David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs, Rainer Stiefelhagen

发表机构 * Fraunhofer IOSB(弗劳恩霍夫智能系统研究所) Technische Hochschule Ingolstadt(图林根工业大学) Karlsruhe Institute of Technology (KIT)(卡尔斯鲁厄理工学院)

AI总结 本文通过创建Drive&Act数据集的详细自然语言版本,评估并微调视觉语言模型,以提升对驾驶员细微动作的识别能力,微调后的模型在跨数据集评估中表现更优。

Comments Accepted at IEEE ITSC 2026

详情
AI中文摘要

理解细微的驾驶员动作对于构建可靠的驾驶员监控系统至关重要。现有的视觉语言模型(VLM)在通用数据集上训练,难以识别驾驶员行为的细微差别。本文通过创建Drive&Act数据集的详细自然语言版本来解决这一限制。我们使用基于LLM的评分方法在新的基准上评估了三个VLM。它们在新基准上的表现表明,它们无法可靠地生成准确的细粒度驾驶员活动描述。基于标注的Drive&Act数据集,我们创建了一个新的Drive&Act描述数据集,其中包含细粒度描述,用于训练VLM理解驾驶员活动。在驾驶员监控数据集(DMD)上的跨数据集评估表明,在我们的新Drive&Act描述数据集上微调的VLM能够很好地泛化到DMD数据集中的动作。在我们的Drive&Act描述数据集上微调的VLM取得了76的ACCR分数,优于零样本VLM基线的66 ACCR分数。这些发现表明,用丰富描述的驾驶员动作来适应VLM可以显著提高其解释驾驶员行为的能力,同时也突显了需要更多样化的数据集以支持未来应用中更广泛泛化的需求。我们的Drive&Act描述数据集和代码将在GitHub上公开。

英文摘要

Understanding subtle driver actions is essential for building reliable driver monitoring systems. Existing visionlanguage models (VLMs) are trained on general datasets and struggle to recognize fine distinctions in driver behaviors. This paper addresses this limitation by creating a detailed natural language version of the Drive&Act dataset. We evaluate three VLMs on our new benchmark using LLM-based scoring methods. Their performance on the new benchmark shows that they cannot reliably generate accurate fine-grained driver activity descriptions. Based on the labeled Drive&Act dataset we create a new Drive&Act description dataset containing finegrained descriptions to train VLMs on driver activity understanding. Cross dataset evaluation on the Driver Monitoring Dataset (DMD) shows that the VLM fine-tuned on our new Drive&Act description dataset generalizes well to actions in the DMD dataset. The VLM fine-tuned on our Drive&Act description dataset achieves an ACCR score of 76 outperforming the zero-shot VLM baseline with an ACCR score of 66. These findings demonstrate that adapting VLMs with richly described driver actions can significantly improve their ability to interpret driver behavior while also highlighting the need for more diverse datasets to support broader generalization in future applications. Our Drive&Act description dataset and code will be publicly available on GitHub.

2606.02268 2026-06-02 cs.CV

From Extrinsic to Intrinsic: Geodesic-Guided Representation Learning for 3D Geometric Data

从外在到内在:面向3D几何数据的测地线引导表示学习

Yuming Zhao, Junhui Hou, Qijian Zhang, Jia Qin, Ying He

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出PRISM预训练范式,通过恢复内在表面测地线度量学习等距嵌入,解决3D表示学习中外在空间与内在拓扑的脱节问题,在测地距离预测及下游任务中表现优异。

详情
AI中文摘要

几何分析从根本上区分了 extit{外在}和 extit{内在}视角。当前3D表示学习的主流范式依赖于外在空间结构或高层语义,难以捕捉形状本质和底层流形拓扑。为弥合这一差距,我们引入了一种新的3D表示学习范式,即 extbf{PRISM}(用于 extbf{预训练}),通过 extbf{恢复内在表面测地线度量}来学习等距嵌入。PRISM包含一个拓扑增强目标,显式约束潜在空间的结构,以及一个专门的两阶段训练策略,以缓解测地距离分布中固有的样本不平衡。实验表明,我们的方法在测地距离预测中表现出令人满意的准确性、鲁棒性和高效率,并在包括形状识别、表面参数化和非刚性对应在内的多种下游任务中取得了优越性能。代码将公开在 https://github.com/AidenZhao/PRISM。

英文摘要

Geometric analysis fundamentally distinguishes between \textit{extrinsic} and \textit{intrinsic} perspectives. The dominant paradigm in current 3D representation learning relies on either extrinsic spatial structures or high-level semantics, struggling to capture the essence of shape identity and underlying manifold topology. To bridge this gap, we introduce a novel 3D representation learning paradigm, namely \textbf{PRISM}, for \textbf{P}re-training, which learns isometric embeddings by \textbf{R}ecovering the \textbf{I}ntrinsic \textbf{S}urface geodesic \textbf{M}etric. PRISM incorporates a topology-enforcing objective that explicitly constrains the structure of latent space, alongside a specialized two-stage training recipe mitigating sample imbalance inherent in the distribution of geodesic distances. Experiments demonstrate that our approach shows satisfactory accuracy, robustness, and high efficiency in geodesic distance prediction and achieves superior performance across diverse downstream tasks, including shape recognition, surface parameterization, and non-rigid correspondence. The code will be publicly available at https://github.com/AidenZhao/PRISM.

2606.02267 2026-06-02 cs.LG cs.CV

A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs

噪声与双边滤波的组合在CNN中实现超线性且可扩展的对抗鲁棒性

Nicolas Stalder, Benjamin F. Grewe, Matteo Saponati, Pau Vilimelis Aceituno

发表机构 * Institute of Neuroinformatics ETH Zürich, University of Zürich(神经信息学研究所,苏黎世联邦理工学院,苏黎世大学)

AI总结 本文提出结合高斯噪声和双边滤波的预处理方法,通过互补机制实现超线性对抗鲁棒性提升,并验证其与对抗训练结合后能以更低计算成本达到与最先进防御相当的性能。

Comments Main: 8 pages, 3 figures, 2 Tables. Supplement: 10 pages, 7 figures, 6 Tables

详情
AI中文摘要

深度神经网络对对抗样本的脆弱性对其实际部署构成了重大挑战。现有的增强深度网络鲁棒性的技术依赖于对抗训练,这种方法虽然强大,但计算密集且通常针对特定攻击类型。为了解决这些局限性,现有工作探索了添加高斯噪声或滤波图像等技术,这两种技术都能适度提升网络对各种对抗攻击的鲁棒性。在此,我们从理论上证明,这两种方法通过互补机制增强对抗鲁棒性,当结合时产生超线性鲁棒性。基于这一见解,我们通过实验表明,一个结合高斯噪声和双边滤波的简单预处理器能以最小计算成本实现对抗鲁棒性的超线性提升。接下来,我们将预处理器与对抗训练结合,并在RobustBench上进行测试,评估其相对于最先进防御的超线性改进。首先,该组合在AutoAttack上排名第二,总体排名第三,同时仅使用约35%的训练FLOPs,模型参数减少约50%,训练轮次减少约33%,数据量减少约15%(与最先进防御相比)。其次,我们的方法高效可扩展,在三个数量级上以大约2-8倍的总计算量匹配竞争模型的准确率。总体而言,我们的方法提供了一个有原则且易于集成的框架来增强对抗鲁棒性,具有可忽略的计算开销和简单但理论扎实的设计。

英文摘要

The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined. Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only $\sim$35% of the training FLOPs, using a model with $\sim$50% less parametets, trained with $\sim$33% of the epochs and $\sim$15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2-8x less total compute across 3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.

2606.02256 2026-06-02 cs.LG

ArrythML: An Autoencoder-Based TinyML Approach for On-Device Arrhythmia Detection on Resource-Constrained Embedded Systems

ArrythML: 一种基于自动编码器的TinyML方法,用于资源受限嵌入式系统上的设备内心律失常检测

Nagarajan S, Kurian Polachan

发表机构 * International Institute of Information Technology, Bangalore, India(国际信息科技研究所,班加罗尔,印度)

AI总结 提出一种基于INT8量化自动编码器的TinyML模型,在ESP32-S3微控制器上实现实时、低功耗的ECG分割与心律失常检测,达到84%召回率和79% F1分数。

Comments 19 pages,

详情
AI中文摘要

我们的工作提出了一种使用TinyML模型进行ECG分割和心律失常检测的方法,用于资源受限嵌入式系统上的实时设备内推理。我们开发了基于INT8量化自动编码器的TinyML模型,具有最少的层数和参数,适用于嵌入式部署。这些模型使用来自MIT-BIH心律失常数据库的自定义数据集进行评估,并在基于PC的模拟和设备内环境中进行了验证。在评估中,超过95,000个ECG片段在运行TensorFlow Lite Micro运行时的ESP32-S3微控制器上进行了处理。评估后,进行了详细分析,包括按注释和按记录的失败分析,以表征模型在不同ECG形态和心律模式上的行为,并解释漏检情况。在几种情况下,明显的误分类可能对应于参考注释中标记为正常的早期或微妙异常模式,突显了模型的敏感性。通过过滤数据集中模糊案例的细化评估显示,性能最佳的基于DNN的自动编码器实现了84%的召回率、79%的F1分数、约180 KB的模型大小和9 ms的设备内推理延迟。这些结果证明了低功耗、保护隐私的嵌入式可穿戴系统的可行性,该系统能够完全在设备上执行准确的心律失常检测。

英文摘要

Our work presents a method for ECG segmentation and arrhythmia detection using Tiny Machine Learning (TinyML) models for real-time, on-device inference on resource-constrained embedded systems. We develop INT8 quantized autoencoder-based TinyML models with minimal layers and parameters for embedded deployment. These models are evaluated using a custom dataset derived from the MIT-BIH Arrhythmia Database and validated in both PC-based simulations and on-device environments. For the evaluations, over 95,000 ECG segments are processed on an ESP32-S3 microcontroller running the TensorFlow Lite Micro runtime. Post-evaluation, detailed analysis, including annotation-wise and record-wise failure analysis, is conducted to characterize model behavior across diverse ECG morphologies and rhythm patterns and to explain missed detections. In several cases, apparent misclassifications may correspond to early or subtle anomaly patterns labeled as normal in the reference annotations, highlighting the model's sensitivity. A refined evaluation by filtering out ambiguous cases in the dataset shows that the best-performing DNN-based autoencoder achieves a recall of 84%, an F1-score of 79%, a model size of approximately 180 KB, and an inference latency of 9 ms on-device. These results demonstrate the feasibility of low-power, privacy-preserving embedded wearable systems capable of performing accurate arrhythmia detection entirely on-device.

2606.02255 2026-06-02 cs.CL cs.AI

Who Annotates in NLP? A Large-scale Assessment of Human Annotation Reporting between 2018 and 2025

谁在NLP中进行标注?2018年至2025年人工标注报告的大规模评估

Maria Kunilovskaya, Gagan Bhatia, Lisa Sophie Albertelli, Yanran Chen, Christian Greisinger, Lotta Kiefer, Christoph Leiter, Subhadeep Roy, Tewodros Achamaleh, Muhammad Arslan Manzoor, Sebastian Pohl, Yufang Hou, Steffen Eger

发表机构 * NLLG Lab University of Technology Nuremberg(NLLG实验室 梅尔堡技术大学) Interdisciplinary Transformation University(跨学科转型大学)

AI总结 本研究通过大规模审计NLP论文中的人工标注报告,发现标注细节报告不完整,并提出了改进报告质量的框架和建议。

详情
AI中文摘要

人工标注是许多NLP研究的经验基础,从数据集构建到模型评估,但论文往往不清楚谁产生了标注以及如何控制标注过程。我们首次对主要NLP会议中的人工标注报告进行了大规模、任务级别的审计,询问哪些标注细节被记录,哪些缺失,以及报告如何随时间、主题、会议和人工判断的预期用途而变化。我们引入了一个统一的标注报告实践分类法,并针对Annotated-gold(一个由41篇论文和72个标注任务组成的人工裁决黄金标准)验证了一个LLM辅助的提取流程,其中最佳模型与裁决标签达到了与人类相当的一致性,Krippendorff's alpha为0.606,而人类间一致性为0.585。利用该流程,我们构建了Annotated-llm数据集,涵盖2018-2025年ACL会议论文,包含来自1603篇论文的2667个提取的标注任务,发现论文经常报告操作细节,如招募策略、标注者专业知识和标注量,但往往省略评估标注有效性所需的细节,包括培训、语言能力、报酬、社会人口统计、裁决和一致性值,尤其是在模型评估研究中。我们的结果表明,NLP中的标注报告随时间有所改善,但仍不均衡,我们建立了一个可扩展的框架和最低报告建议,以使人工标注更可靠、可重复和可解释。

英文摘要

Human annotation is the empirical foundation of much NLP research, from dataset construction to model evaluation, but papers often leave unclear who produced the annotations and how the annotation process was controlled. We provide the first large-scale, task-level audit of human annotation reporting across major NLP venues, asking which annotation details are documented, which are missing, and how reporting varies across time, topic, venue, and intended use of human judgment. We introduce a unified taxonomy of annotation-reporting practices and validate an LLM-assisted extraction pipeline against Annotated-gold, a human-adjudicated gold standard of 41 papers and 72 annotation tasks, where the best model reaches human-comparable agreement with adjudicated labels, with Krippendorff's alpha of 0.606 versus 0.585 for human-human agreement. Using this pipeline, we construct Annotated-llm, a dataset covering ACL-venue papers from 2018-2025, with 2,667 extracted annotation tasks from 1,603 papers, and find that papers frequently report operational details such as recruitment strategies, annotator expertise, and annotation volume, but often omit details needed to assess annotation validity, including training, language proficiency, compensation, socio-demographics, adjudication, and agreement values, especially in model-evaluation studies. Our results show that annotation reporting in NLP has improved over time but remains uneven, and they establish a scalable framework and bare-minimum reporting recommendations for making human annotation more reliable, reproducible, and interpretable.

2606.02252 2026-06-02 cs.CL

ResMerge: Residual-based Spectral Merging of Large Language Models

ResMerge:基于残差的谱合并大型语言模型

Yandu Sun, Zhiyan Hou, Haokai Ma, Yuheng Jia, Junfeng Fang, Haiyun Guo, Hongyan An, weizhen wang, Jinqiao Wang

发表机构 * Southeast University(东南大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) University of Chinese Academy of Sciences(中国科学院大学) National University of Singapore(新加坡国立大学) Wuhan University of Technology(武汉理工大学) Peking University(北京大学) Wuhan AI Research(武汉人工智能研究院)

AI总结 提出ResMerge框架,通过将RL任务向量分解为谱头部和残差组件,利用残差组件构建稳定骨干并选择性引入头部信息,实现无需训练的专家模型合并。

Comments 14 pages including appendix

详情
AI中文摘要

模型合并提供了一种无需训练的方式组合多个后训练专家模型,但合并通过强化学习(RL)获得的专家仍然具有挑战性。现有的谱合并方法通常假设主导奇异方向包含主要任务信号,而低能量残差组件可以被压缩、选择或衰减以减少干扰。我们发现这一假设不适用于RL任务向量:将每个任务向量分解为谱头部和残差组件后,两部分都能独立恢复大量行为知识,同时表现出不同的合并特性。头部高度集中且信息丰富,但更容易产生尖锐的跨专家冲突,而残差组件更分散,为聚合提供了更稳定的基础。基于这一观察,我们提出了ResMerge,一种针对RL专家的基于残差的谱合并框架。ResMerge首先通过球形残差共识自适应构建稳定的残差骨干,该算法在Frobenius球面上估计一个可靠性加权的共识方向。然后通过轻量级头部校正模块重新引入头部信息,该模块由正跨专家一致性门控。在多个RL专家组和能力领域的实验表明,ResMerge比代表性的任务向量和谱合并基线更好地保留了专家能力。ResMerge的实现可在https://github.com/sunyd0303-cpu/ResMerge-release公开获取。

英文摘要

Model merging offers a training-free way to combine multiple post-trained expert models, but merging experts obtained through reinforcement learning (RL) remains challenging. Existing spectral merging methods often assume that leading singular directions contain the main task signal, while lower-energy residual components can be compressed, selected, or attenuated to reduce interference. We find that this assumption does not hold for RL task vectors: after decomposing each task vector into a leading spectral head and a residual component, both parts can independently recover substantial behavior knowledge, while exhibiting different merging properties. The head is highly concentrated and informative but more prone to sharp cross-expert conflicts, whereas the residual component is more dispersed and provides a more stable basis for aggregation. Based on this observation, we propose ResMerge, a residual-based spectral merging framework for RL experts. ResMerge first constructs a stable residual backbone with Spherical Residual Consensus Adaptation, which estimates a reliability-weighted consensus direction on the Frobenius sphere. It then reintroduces leading-head information through a Lightweight Head Correction module gated by positive cross-expert agreement. Experiments across multiple RL expert groups and capability domains show that ResMerge better preserves expert capabilities than representative task-vector and spectral merging baselines. The implementation of ResMerge is publicly available at https://github.com/sunyd0303-cpu/ResMerge-release.

2606.02251 2026-06-02 cs.RO cs.AI eess.SP

FW-NKF: Frequency-Weighted Neural Kalman Filters

FW-NKF: 频率加权神经卡尔曼滤波器

Adnan Harun Dogan, Berken Utku Demirel, Christian Holz

发表机构 * Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出频率加权神经卡尔曼滤波器(FW-NKF),通过将因果谱整形算子嵌入卡尔曼测量残差并联合学习观测和状态转移网络,抑制频带受限噪声,在混沌系统和惯性姿态估计等任务中定位误差降低达10%。

Comments Published at ICRA 2026

详情
AI中文摘要

鲁棒状态估计是机器人自主性的核心,然而经典卡尔曼滤波器难以应对频率相关干扰和模型失配,如传感器振动、电磁干扰和周期性噪声。尽管深度卡尔曼滤波器(DKF)变体通过学习潜在状态转移扩展了扩展卡尔曼滤波(EKF)框架,但它们缺乏明确的机制来抑制在实际场景中通常污染传感器测量的带限噪声分量。我们引入了频率加权神经卡尔曼滤波器(FW-NKF),这是一种统一的混合方法,将因果谱整形算子嵌入卡尔曼测量残差,并联合学习观测网络和状态转移网络。通过同时调整滤波器频谱和潜在状态表示,FW-NKF在抑制噪声主导频带的同时捕获复杂的残差结构。我们在四个异构基准上进行了广泛实验,包括混沌系统(如多维洛伦兹系统)和全身惯性姿态估计,发现定位误差降低高达10%,且方向精度显著提升。我们的消融研究证实,频率加权和深度潜在状态建模对整体性能有贡献。

英文摘要

Robust state estimation is central to robotic autonomy, yet classical Kalman filters struggle with frequency-dependent disturbances and model mismatch such as sensor vibrations, electromagnetic interference, and periodic noise. Although Deep Kalman Filter (DKF) variants extend the Extended Kalman Filtering (EKF) framework by learning latent transitions, they lack explicit mechanisms to suppress band-limited noise components that typically corrupt sensor measurements in real-world scenarios. We introduce the Frequency-Weighted Neural Kalman Filter (FW-NKF), a unified hybrid approach that embeds a causal spectral-shaping operator into the Kalman measurement residual and jointly learns observation, and transition networks. By adapting both the filter spectrum and the latent state representation, FW-NKF attenuates the noise-dominated frequency bands while capturing complex residual structures. We conduct extensive experiments on four heterogeneous benchmarks, including chaotic systems such as multi-dimensional Lorenz systems and full-body inertial pose estimation, and find a reduction in localization error of up to 10% as well as marked improvements in orientation accuracy. Our ablation studies confirm that frequency weighting and deep latent-state modeling contribute to overall performance.

2606.02248 2026-06-02 cs.CL

Geometric Latent Reasoning Induces Shorter Generations in LLMs

几何潜在推理诱导LLM生成更短的文本

Shashi Kumar, Yacouba Kaloga, Petr Motlicek, Ina Kodrasi, Andrea Cavallaro

发表机构 * Idiap Research Institute(日内瓦研究所) EPFL(苏黎世联邦理工学院) BUT(布拉格技术大学)

AI总结 提出几何潜在推理(GLR)方法,通过轻量级过渡头在嵌入空间中预测迭代方向更新,近似离散推理轨迹,从而在不显式优化长度的情况下显著缩短LLM的生成步数。

详情
AI中文摘要

大型语言模型通过生成冗长的显式推理令牌链来解决复杂问题。虽然有效,但这使得推理成本高昂、对长度敏感,并受限于(离散的)自然语言。虽然潜在推理提供了连续的替代方案,但确定中间潜在状态的有用结构仍是一个开放挑战。在本文中,我们将潜在推理公式化为模型预训练令牌嵌入空间中的几何路径逼近问题。我们引入了几何潜在推理(GLR),它使用轻量级过渡头来预测嵌入空间中的迭代方向更新。利用文本思维链轨迹作为锚点,GLR学习逼近离散推理轨迹,同时允许偏离精确令牌嵌入的连续偏差。使用Qwen3模型在数学推理基准上的评估揭示了一个涌现现象:几何潜在推理在没有显式长度目标的情况下诱导出显著更短的生成。通过用连续的潜在步骤替换早期的显式推理,模型通常使用更少的总生成步骤达到正确答案。这些发现表明,连续轨迹充当紧凑的中间推理状态,揭示了潜在计算预算、输出长度和准确性之间的新权衡。

英文摘要

Large language models solve complex problems by generating lengthy chains of explicit reasoning tokens. While effective, this makes reasoning expensive, length-sensitive, and constrained to (discrete) natural language. While latent reasoning offers a continuous alternative, determining useful structures for intermediate latent states is an open challenge. In this paper, we formulate latent reasoning as a geometric path-approximation problem within the model's pretrained token-embedding space. We introduce Geometric Latent Reasoning (GLR), which uses a lightweight transition head to predict iterative direction updates in embedding space. Using textual chain-of-thought traces as anchors, GLR learns to approximate discrete reasoning trajectories while permitting continuous deviations from exact token embeddings. Evaluations on mathematical reasoning benchmarks using Qwen3 models reveal an emergent phenomenon: geometric latent reasoning induces substantially shorter generations without an explicit length objective. By replacing early explicit reasoning with continuous latent steps, models often reach correct answers using substantially fewer total generation steps. These findings suggest that continuous trajectories act as compact intermediate reasoning states, exposing a new tradeoff between latent computation budget, output length, and accuracy.

2606.02246 2026-06-02 cs.CV

Ego-METAS: Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark

Ego-METAS:面向自我中心的在线多模态节能时间动作分割基准

Maria Santos-Villafranca, Jesus Bermudez-cameo, Alejandro Perez-Yus, Giovanni Maria Farinella, Antonino Furnari

发表机构 * University of Zaragoza - I3A(萨拉戈塔大学 - I3A) Department of Mathematics and Computer Science, University of Catania(卡塔尼亚大学数学与计算机科学系)

AI总结 为解决资源受限设备上的能耗感知问题,提出了首个自我中心在线多模态节能时间动作分割基准Ego-METAS,包含超过100小时未裁剪视频和5种模态,要求模型动态选择传感器并遵守能量预算,评估显示最优路由高度依赖场景,现有方法难以适应连续环境。

Comments Project Page: https://maria-sanvil.github.io/Ego-METAS-website/

详情
AI中文摘要

为了在物理世界中运行,具身智能体必须以“始终在线”的方式感知环境,选择性访问信息最丰富的传感器,以平衡能量约束和任务准确性。尽管这对资源受限设备至关重要,但能耗感知感知仍未被充分探索,大多数先前工作假设无限计算。为了解决这一问题,我们引入了Ego-METAS:首个自我中心在线多模态节能时间动作分割基准。Ego-METAS提供了一个统一的测试平台,包含来自EgoExo4D、CMU-MMAC和CaptainCook4D的超过100小时未裁剪自我中心视频,涵盖5种模态(RGB、音频、注视、IMU和单色相机)。我们制定了一个在线时间动作分割任务,其中模型必须动态选择在每个时间步激活哪些传感器,同时严格遵守硬件代表性的能量预算。除了基准测试,我们还发布了统一的分割、清理后的标注、预提取特征以及一套多样化的基线路由策略。我们的评估表明,最优路由高度依赖于场景,并且现有的策略学习方法(主要针对裁剪片段设计)难以适应连续的未裁剪环境。然而,即使是互补模态的简单动态融合(例如通过随机路由)也被证明对于平衡预测准确性与严格能量预算至关重要。最终,Ego-METAS为开发自主、始终在线的具身AI的鲁棒、成本感知策略提供了标准化基础。

英文摘要

To operate in the physical world, embodied agents must perceive their environment in an "always-on" fashion, selectively accessing the most informative sensors to balance energy constraints and task accuracy. Despite its importance for resource-constrained devices, energy-aware perception remains under-explored, with most prior work assuming unlimited compute. To address this, we introduce Ego-METAS: the first Egocentric online Multimodal Energy-efficient Temporal Action Segmentation benchmark. Ego-METAS provides a unified testbed of more than 100 hours of untrimmed egocentric video from EgoExo4D, CMU-MMAC, and CaptainCook4D, spanning 5 modalities (RGB, audio, gaze, IMU, and monochrome camera). We formulate an online temporal action segmentation task where models must dynamically select which sensors to activate at each timestep while strictly adhering to hardware-representative energy budgets. Alongside the benchmark, we release unified splits, cleaned annotations, pre-extracted features, and a diverse suite of baseline routing policies. Our evaluations show that optimal routing is highly scenario-dependent, and that existing policy-learning methods, designed primarily for trimmed clips, struggle to adapt to continuous, untrimmed environments. However, even simple dynamic fusion of complementary modalities (e.g., via random routing) proves critical for balancing predictive accuracy against strict energy budgets. Ultimately, Ego-METAS provides a standardized foundation to develop robust, cost-aware policies for autonomous, always-on embodied AI.

2606.02245 2026-06-02 cs.CL

When Knowledge Is Not Free: Cost-Aware Evidence Selection in Retrieval-Augmented Generation

当知识并非免费:检索增强生成中的成本感知证据选择

Mingyan Wu, Han Yang, Omer Ben-Porat, Yftah Ziser

发表机构 * Northeastern University(东北大学) Technical University of Munich(慕尼黑技术大学) GESIS – Leibniz Institute for the Social Sciences(GESIS——莱比锡社会科学研究所) Technion–Israel Institute of Technology(技术学院——以色列理工学院) NVIDIA Research(NVIDIA研究) University of Groningen(格罗宁根大学)

AI总结 提出成本感知RAG设置,通过访问成本层级和预算约束,研究证据选择策略,发现静态选择脆弱而智能体方法有潜力但依赖模型和任务。

详情
AI中文摘要

检索增强生成(RAG)通常假设外部知识是免费的,但许多高质量来源需要付费、许可、受限或以其他方式访问成本高昂。我们引入了成本感知RAG,这是一种设置,其中检索到的证据被分配访问成本层级,系统必须在明确的证据访问预算下回答问题。我们通过为MS MARCO v2.1添加访问摩擦层级来实例化这一设置,并在通用领域和特定领域QA基准上评估预算证据选择。我们的结果表明,静态选择是脆弱的:没有固定的选择器能统一占优,更大的预算并不能可靠地提高答案质量,即使高成本证据与领域匹配。然后我们研究了智能体成本感知RAG,其中LLM决定何时检索、访问哪个层级以及何时停止。智能体作为自适应证据获取控制器显示出强大的潜力,但其行为仍然高度依赖于模型和任务。这些发现表明,成本感知的证据获取是下一代RAG系统的核心挑战。所有代码和数据可在https://github.com/Mignonmy/Cost-Aware获取。

英文摘要

Retrieval-Augmented Generation (RAG) typically assumes that external knowledge is free, but many high-quality sources are paywalled, licensed, restricted, or otherwise costly to access. We introduce cost-aware RAG, a setting where retrieved evidence is assigned access-cost tiers and systems must answer under an explicit evidence-access budget. We instantiate this setting by augmenting MS MARCO v2.1 with access-friction tiers and evaluate budgeted evidence selection across general-domain and domain-specific QA benchmarks. Our results show that static selection is brittle: no fixed selector uniformly dominates, and larger budgets do not reliably improve answer quality, even when costly evidence is domain-matched. We then study agentic cost-aware RAG, where an LLM decides when to retrieve, which tier to access, and when to stop. Agents show strong promise as adaptive evidence-acquisition controllers, but their behavior remains highly model- and task-dependent. These findings suggest that cost-aware evidence acquisition is a central challenge for the next generation of RAG systems. All code and data are available at https://github.com/Mignonmy/Cost-Aware.

2606.02242 2026-06-02 cs.CV cs.AI cs.LG

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

发表机构 * Tevian, Russia(俄罗斯Tevian) Lomonosov Moscow State University, Russia(俄罗斯罗蒙诺索夫莫斯科国立大学)

AI总结 针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题,提出解耦两阶段训练流程,使用单一视觉编码器避免跨任务干扰,实验表明图像预训练和文本监督能提升双任务性能。

详情
AI中文摘要

基于图像(I2I)和基于文本(T2I)的行人重识别(ReID)的联合优化受到模态差异和冲突训练目标的阻碍,导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性,但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异,以实现有效训练。由于I2I和T2I ReID通常分开研究,为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现,我们提出了一种解耦的两阶段训练流程,用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器,支持I2I和T2I检索,同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验,改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外,我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信,我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

2606.02241 2026-06-02 cs.LG

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

BlockGen: 灵活的分块序列建模与混合采样器

Justin Deschenaux, Caglar Gulcehre

发表机构 * EPFL Lausanne, Switzerland(瑞士洛桑联邦理工学院) Microsoft AI(微软人工智能)

AI总结 提出BlockGen框架,通过分块序列建模和AR-informed预测-校正采样,比较均匀态扩散与掩码扩散在分块生成中的性能。

详情
AI中文摘要

均匀态扩散框架是否为离散扩散更强大的范式?最近的研究表明情况可能如此。结合预测-校正采样器,均匀态扩散模型(USDMs)生成的样本质量高于掩码扩散模型(MDMs),并且在下游任务中USDMs与MDMs相当或更优,尽管它们表现出更大的困惑度。两个问题仍未解决。首先,现有工作比较均匀和掩码扩散时使用了无信息的校正器,这些校正器在随机位置重新注入噪声,而不是针对最可能出错的标记。其次,先前的工作比较了全序列扩散模型,因此我们不知道当逐块生成标记时是否得出相同的结论。为了解决这些问题,我们引入了BlockGen,一种分块序列模型,我们使用掩码和均匀扩散两种方式实例化它。BlockGen在混合块大小上训练,其似然性比固定块大小的模型更精细地在自回归和纯扩散之间插值。BlockGen实现了AR-informed预测-校正采样(ARPC),它结合了AR和扩散预测来重新生成不太可能的标记,而无需辅助验证器。在祖先采样下,均匀扩散在逐块设置中优于掩码扩散,尤其是在少步数情况下。在ARPC下,差距缩小并在高NFE时反转。在GSM8K上使用块大小16时,MDMs达到略高于USDMs的准确率,我们在OpenWebText上的生成困惑度中也观察到类似趋势。代码见https://github.com/jdeschena/blockgen。

英文摘要

Is the uniform-state diffusion framework a more powerful paradigm for discrete diffusion? Recent studies indicate that this may be the case. In combination with predictor-corrector samplers, uniform-state diffusion models (USDMs) produce samples of higher-quality than masked diffusion models (MDMs), and USDMs equal or outperform MDMs in downstream tasks, even though they exhibit greater perplexity. Two issues remain unresolved. First, existing work compares uniform and masked diffusion with un-informed correctors that re-inject noise at random positions, rather than targeting tokens most likely to be wrong. Second, prior work compares full-sequence diffusion models, so we do not know whether the same conclusion holds when tokens are generated block by block. To address these issues, we introduce BlockGen, a blockwise sequence model that we instantiate with both masked and uniform diffusion. BlockGen trains on a mixture of block sizes and its likelihood interpolates between AR and pure diffusion more finely than models with a fixed block size. BlockGen enables AR-informed predictor-corrector sampling (ARPC), which combines AR and diffusion predictions to re-generate unlikely tokens without an auxiliary verifier. Under ancestral sampling, uniform outperforms masked in the block-by-block setting, especially in the few-step regime. Under ARPC, the gap closes and reverses at high NFE. With block size $16$ on GSM8K, MDMs reach slightly higher accuracy than USDMs, and we observe a similar trend in Generative Perplexity on OpenWebText. Find our code at https://github.com/jdeschena/blockgen.

2606.02237 2026-06-02 cs.LG

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

为什么 DMD 学生懒惰?理解少步蒸馏中的复制行为

Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) AITHYRA, Research Institute for Biomedical AI(生物医学AI研究 institute)

AI总结 本文研究分布匹配蒸馏(DMD)中高维学生模型自发复制教师噪声-数据配对的现象,通过几何自由度受限解释其成因。

详情
AI中文摘要

分布匹配蒸馏(DMD)通过在所有尺度上对齐噪声分布,将预训练扩散模型压缩为高效的少步生成器。原则上,这种分布级监督对教师的特定噪声-数据配对保持不可知;这为学生提供了重新映射潜在噪声的自由度,这一行为在低维设置中一直被观察到。令人惊讶的是,我们发现,在高维设置中,蒸馏学生自发地再现了教师的原始噪声-数据配对,我们将这种现象称为复制。我们证明复制既不是对抗性目标的副产品,也不是教师记忆的结果。相反,我们的证据表明,复制是高维蒸馏过程中学生模型几何自由度有限而产生的一种涌现特性。

英文摘要

Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to specific noise-data pairings of the teacher; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the original noise-data pairings of the teacher, a phenomenon we term copying. We demonstrate that copying is neither a byproduct of adversarial objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.

2606.02232 2026-06-02 cs.LG

A Doeblin-Anchored Contrastive Chart for Learning Markov Transition Kernels

Doeblin锚定对比图:学习马尔可夫转移核

Ao Xu

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Zhongguancun Academy(中关村学院)

AI总结 提出一种基于Doeblin锚定的对比坐标框架,通过对比目标学习有效的马尔可夫转移核,并引入可测马尔可夫化算子保证核有效性,实现非参数收敛率与有限时域误差界。

详情
AI中文摘要

学习马尔可夫转移模型不仅仅是条件密度估计:学习到的对象必须是一个有效的转移核,才能在后续动力学中迭代。本文介绍了一种Doeblin锚定对比图,这是一个从统计到动力学的坐标框架,用于从对比目标学习转移核。给定一个重启律和一个锚定强度,该图将目标转移与重启律混合。得到的锚定核同时是一个Doeblin小化马尔可夫核、一个二元对比实验中的正条件律,以及原始转移律的一个显式可逆坐标。我们证明了锚定对比风险识别锚定转移密度,并将超额风险校准为密度误差。由于学习得分的反演可能产生有符号或未归一化的对象,我们引入了一个可测马尔可夫化算子,在保持积分$L^1$精度(最多一个常数因子)的同时恢复核有效性。Oracle不等式和Hölder-ReLU逼近界给出了独立转移对的非参数速率。对于平稳几何$\beta$-混合轨迹,一个保守的稀疏化与耦合扩展以有效样本量提供了相同的重建接口。在显式覆盖下,占用加权扰动界将一步核误差转化为有限时域边际、路径律和占用测度误差。

英文摘要

Learning a Markov transition model is not merely conditional density estimation: the learned object must be a valid transition kernel before it is iterated in downstream dynamics. This paper introduces a Doeblin-anchored contrastive chart, a statistical-to-dynamical coordinate framework for learning transition kernels from contrastive objectives. Given a restart law and an anchor strength, the chart mixes the target transition with the restart law. The resulting anchored kernel is simultaneously a Doeblin-minorized Markov kernel, the positive conditional law in a binary contrastive experiment, and an explicitly invertible coordinate for the original transition law. We prove that the anchored contrastive risk identifies the anchored transition density and calibrates excess risk to density error. Since inversion of a learned score may produce a signed or unnormalized object, we introduce a measurable Markovization operator that restores kernel validity while preserving integrated $L^1$ accuracy up to a constant factor. Oracle inequalities and Hölder--ReLU approximation bounds yield nonparametric rates for independent transition pairs. For stationary geometrically $β$-mixing trajectories, a conservative thinning-and-coupling extension yields the same reconstruction interface with an effective sample size. Occupancy-weighted perturbation bounds transfer one-step kernel error to finite-horizon marginal, path-law, and occupation-measure errors under explicit coverage.

2606.02223 2026-06-02 cs.LG math.ST stat.ME stat.TH

Network Learning with Semi-relaxed Gromov-Wasserstein

半松弛Gromov-Wasserstein的网络学习

Charles Dufour, Ulysse Naepels, Leonardo V. Santoro

发表机构 * EPFL, Institute of Mathematics(苏黎世联邦理工学院数学研究所) Lausanne, Switzerland(瑞士拉沃斯)

AI总结 针对大规模网络生成机制估计中的节点标签缺失问题,提出半松弛Gromov-Wasserstein目标函数,通过概率耦合松弛分配问题,利用块坐标条件梯度算法求解,并证明松弛解与确定性分配的最优性差距以O(1/n)速率消失,实现随机块模型和Hölder光滑图模型的相合性与极小化最优收敛速率。

详情
AI中文摘要

估计大规模网络的生成机制是统计机器学习中的一个基本挑战。由于缺乏规范的节点标签,识别潜在连接结构通常是一个NP难的组合问题。我们通过允许概率耦合来应对这一挑战,从而松弛了分配问题。我们的估计框架可以表述为半松弛Gromov-Wasserstein目标,并提供了生成结构的低维表示。我们通过块坐标条件梯度算法求解该问题。尽管进行了松弛,但所得解通常是确定性的:事实上,我们证明了松弛解与确定性分配之间的最优性差距以$O(1/n)$的速率消失,其中$n$是节点数。这使得潜在模型的可处理恢复成为可能,并能够进行严格的统计分析:我们为随机块模型和Hölder光滑图模型建立了相合性和极小化最优收敛速率。我们的实现在合成和真实数据集上均展示了随$n$的高效扩展能力。

英文摘要

Estimating the generative mechanism of large-scale networks is a fundamental challenge in statistical machine learning. It requires the identification of the latent connectivity structure, which is in general an NP-hard combinatorial problem due to the absence of canonical node labels. We address this challenge by allowing for probabilistic couplings, thereby relaxing the assignment problem. Our estimation framework can be formulated as a semi-relaxed Gromov-Wasserstein objective and provides a low-dimensional representation of the generative structure. We solve this via a block-coordinate conditional gradient algorithm. Despite the relaxation, the resulting solution is typically deterministic: in fact, we show that the optimality gap between the relaxed solution and the deterministic assignment vanishes at rate $O(1/n)$, where $n$ is the number of nodes. This allows for tractable recovery of the underlying model and enables rigorous statistical analysis: we establish consistency and minimax-optimal convergence rates for both stochastic block models and Holder-smooth graphons. Our implementation scales efficiently with $n$, as demonstrated on both synthetic and real-world datasets.

2606.02221 2026-06-02 cs.CV cs.LG

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

CORE-MTL: 通过因果正交表示重新思考梯度平衡

Chengfeng Wu, Tao Zou, Yanru Wu, Jingge Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出CORE-MTL框架,通过因果正交表示将共享表示分解为语义流和残差流,以分离任务相关结构与虚假上下文,从而减少负迁移并提升泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

多任务学习旨在通过跨领域共享共同表示来构建联合模型。为实现这一目标,现有的优化中心方法要么平衡任务梯度,要么修改共享架构。然而,由于这些方法对共享表示的内容不可知,它们无法将任务相关结构与虚假上下文分离,导致负迁移和泛化能力差。为克服这一限制,我们提出了用于多任务学习的因果正交表示(CORE-MTL),这是一个因果驱动的表示中心框架,鼓励对共享表示进行结构化的语义-残差分解,将任务相关结构集中在语义流中,而将干扰变化归入残差流。我们通过利用结构化场景的物理先验和属性的统计约束,在视觉领域实例化了该框架。理论上,我们的方法比优化中心方法具有更紧的分布外泛化界,并且无需显式梯度投影或重新加权即可减少任务梯度干扰。实验上,CORE-MTL在视觉多任务基准测试中,在分布内和分布外设置下均持续优于现有方法。代码公开于 https://github.com/Hope-Rita/CORE-MTL。

英文摘要

Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approaches remain agnostic to the content of the shared representation, they fail to disentangle task-relevant structure from spurious context, leading to negative transfer and poor generalization. To overcome this limitation, we propose Causal Orthogonal Representations for Multi-Task Learning (CORE-MTL), a causally motivated representation-centric framework that encourages a structured semantic-residual factorization of the shared representation, concentrating task-relevant structure in the semantic stream while relegating nuisance variation to the residual stream. We instantiate this framework in the visual domain by leveraging physical priors for structured scenes and statistical constraints for attributes. Theoretically, our method enjoys a tighter out-of-distribution generalization bound than optimization-centric methods and reduces task gradient interference without explicit gradient projection or reweighting. Empirically, CORE-MTL consistently outperforms existing methods on visual multi-task benchmarks in both in-distribution and out-of-distribution settings. Code is publicly available at https://github.com/Hope-Rita/CORE-MTL.

2606.02219 2026-06-02 cs.CV

Symmetry-Aware 9D Pose Estimation with Sim(3)-Consistent Feature and Spherical Inception Convolution

对称感知的9D姿态估计:Sim(3)一致特征与球形Inception卷积

Panfei Cheng, Hongshan Yu, Wenrui Chen, Xiaojun Tang, Jian Liu, Naveed Akhtar

发表机构 * National Engineering Research Center for Robot Visual Perception and Control, School of Robotics and Artificial Intelligence, Hunan University(机器人视觉感知与控制国家工程研究中心,机器人与人工智能学院,湖南大学) Beijing Spacecrafts, China Academy of Space Technology(北京航天器,中国航天科技研究院) School of Computing and Information Systems, The University of Melbourne(计算与信息学院,墨尔本大学)

AI总结 提出一种类别级物体姿态估计方法,通过语义引导的对称感知模块和球形大核Inception卷积融合特征,实现无形状先验的精确平移/尺寸估计和鲁棒旋转估计,在基准和真实场景中达到最优性能。

Comments 12 pages, 7 figures

详情
AI中文摘要

物体姿态估计是智能系统感知或操作图像/视频中物体的基本问题。然而,当前的实例级方法难以泛化到未见物体。类别级方法试图解决这一问题,但仍受限于非线性Sim(3)空间的学习复杂性和类内变化。为应对这些挑战,我们提出一种有效的类别级物体姿态估计方法,包含两项关键创新:(1) 一个平移/尺寸估计器,具有语义引导的对称感知模块,利用大型视觉模型(LVM)的鲁棒泛化能力推断对称点,从而无需形状先验即可获得精确的平移和尺寸。该结果作为旋转估计的预计算线索,降低了在非线性Sim(3)空间学习的难度,并为处理更具挑战性的旋转估计奠定坚实基础。(2) 一个特征融合模块,基于我们提出的球形大核Inception卷积,将LVM的语义特征与系统计算的几何特征融合,通过建模长程依赖关系以较低计算成本从类内变化中提取关键姿态特征。基于这些创新,我们在基准和真实场景中达到最优性能,并开发了一个能够处理多样物体的鲁棒机器人抓取系统。我们的代码将在项目页面提供:{\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}。

英文摘要

Object pose estimation is a fundamental problem for an agent system to perceive or manipulate objects in images or videos. However, current instance-level methods struggle with generalization to unseen objects. Category-level methods seek to address this, but remain constrained by the complexities of learning in the non-linear Sim(3) space and intra-class variations. To address these challenges, We propose an effective method for category-level object pose estimation with two key innovations: (1) A translation/size estimator, featuring a semantic-guided symmetry-aware module that leverages robust generalization capabilities of a large vision model (LVM) to infer symmetry points, resulting in accurate translation and size without shape priors. This result serves as a precomputed cue for rotation estimation, thereby reducing the difficulty of learning in the non-linear Sim(3) space and laying a robust foundation for tackling the inherently more challenging rotation estimation. (2) A feature fusion module, based on our proposed spherical large-kernel inception convolution, fuses semantic features from the LVM with systematically computed geometric features to extract essential pose features from intra-class variations by modeling long-range dependencies without excessive computational cost. Built on these innovations, we achieve SOTA on benchmarks and real-world scenes, while developing a robust robotic picking system capable of handling diverse objects. Our code will be available at the project page: {\hypersetup{urlcolor=blue}https://panfei-cheng.github.io/SSH-Pose}.

2606.02218 2026-06-02 cs.LG cs.AI

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过感知掉队者的组大小调整实现更快的同步在线策略强化学习

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar

发表机构 * University of Minnesota(明尼苏达大学) University of Waterloo(滑铁卢大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出动态组大小控制器SAGC,通过在线约束优化调整组大小,减少同步在线策略强化学习中的掉队者事件,提升墙钟效率并保持或改善训练奖励和模型质量。

详情
AI中文摘要

同步强化学习方法如组相对策略优化(GRPO)提供稳定且可复现的在线策略训练,但极易受到掉队者的影响——单个异常长的轨迹可能延迟整个组的奖励计算和参数更新。随着组大小增加,这个问题变得更加严重,在更大组的好处与同步停滞的墙钟成本之间产生矛盾。我们提出感知掉队者的组控制(SAGC),一种动态组大小控制器,根据观察到的轨迹行为在线调整训练组。SAGC将组大小选择形式化为一个在线约束优化问题,旨在保留更大组的好处,同时控制掉队者事件的长期发生率。在同步GRPO和DAPO训练中,以及在普通和强工程基线上,SAGC一致地减少了掉队者发生率并提高了墙钟效率,同时实现了有竞争力或更好的训练奖励。我们进一步表明这些收益转化为最终模型质量:在下游推理基准上,SAGC与最强的静态组大小基线相比具有竞争力或更好,并且通常在没有显式长度惩罚的情况下产生更短的输出。这些结果将动态组控制定位为使同步在线策略强化学习更高效和更稳健的实用方法。

英文摘要

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

2606.02215 2026-06-02 cs.CL cs.SI

Better with Experience: Self-Evolving LLM Agents for Evidence-Grounded Health Community Notes

经验更优:用于证据基础健康社区笔记的自进化LLM智能体

Zihang Fu, Fanxiao Li, Jianyang Gu, Haonan Wang, Preslav Nakov, Bryan Hooi, Min-Yen Kan, Jiaying Wu

发表机构 * National University of Singapore(国立新加坡大学) Yunnan University(云南大学) The Ohio State University(俄亥俄州立大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出EvoNote框架,通过自进化经验记忆和细粒度信用分配,在健康社区笔记生成中实现证据获取、分析与撰写,显著提升笔记质量并缩短生成时间。

详情
AI中文摘要

大型语言模型增强的社区笔记为社交媒体上健康错误信息的及时、基于证据的纠正提供了一条可扩展的路径。然而,它们仍然在每条帖子后重置,留下了先前案例中未使用的有用纠正经验。我们引入了EvoNote,一个智能体框架,通过先前错误信息纠正事件的自进化经验记忆,使健康社区笔记生成能够自我进化。其核心是细粒度信用分配:EvoNote将轨迹级反馈基于健康特定的笔记质量进行归因,并将其提炼为行动级记忆,用于声明分析、证据获取和笔记撰写。我们在MM-HealthCN上评估了EvoNote,这是一个包含1200个实例的多模态基准测试,包括用户标记的健康帖子、人工撰写的社区笔记和众包有用性标签。在一个人工验证的分层效用评判器下,EvoNote生成的笔记在89.6%的案例中优于对应的人工笔记;在一组没有众包有用性判定的“需要更多评分”帖子中,EvoNote为82.0%的案例生成了有用的笔记。它还将生成候选纠正所需的中位时间从人工笔记流程的超过13小时减少到不到2分钟。分析将这些收益归因于更强的证据使用和可复用的纠正策略,将自进化笔记生成定位为健康错误信息治理的一种有前景的范式。

英文摘要

Large Language Model (LLM)-augmented Community Notes offer a scalable path for timely, evidence-grounded correction of health misinformation on social platforms. However, they still reset at every post, leaving useful correction experience from prior cases unused. We introduce EvoNote, an agentic framework that enables health Community Notes generation to self-evolve through an evolving experience memory of prior misinformation correction episodes. Its core is fine-grained credit assignment: EvoNote grounds trajectory-level feedback in health-specific note qualities and distills it into action-level memory for claim analysis, evidence acquisition, and note writing. We evaluate EvoNote on MM-HealthCN, a 1.2K-instance multimodal benchmark of user-flagged health posts with human-written Community Notes and crowd-derived helpfulness labels. Under a human-validated hierarchical utility judge, EvoNote-generated notes are preferred over corresponding human-written notes in 89.6% of cases; on a separate set of Needs More Ratings posts without a crowd helpfulness verdict, EvoNote produces helpful notes for 82.0% of cases. It also reduces the median time needed to produce a candidate correction from over 13 hours in the human-note pipeline to under 2 minutes. Analyses link these gains to stronger evidence use and reusable correction strategies, positioning self-evolving note generation as a promising paradigm for health misinformation governance.

2606.02214 2026-06-02 cs.CL

Do Gender Cues Affect LLM Value Trade-offs? Evidence from a Controlled Decision Benchmark

性别线索是否影响LLM价值权衡?来自受控决策基准的证据

Yangyang Liu, Dong Yu, Pengyuan Liu

发表机构 * Beijing Language and Culture University(北京语言大学)

AI总结 通过构建受控基准RVDB,研究性别线索是否导致大语言模型在价值决策中产生系统性翻转,并发现性别效应集中在价值边界模糊和决策严重性高的情境下,且模型自我归因常掩盖性别影响。

详情
AI中文摘要

大型语言模型越来越多地用于价值敏感的决策场景,在这些场景中,无关的人口统计线索不应改变判断。我们构建了现实价值决策基准(RVDB),这是一个受控基准,仅改变角色-性别配置,同时保持场景、有序价值对、角色、候选决策、价值距离和决策严重性固定。使用跨七个模型的位置平衡评估,我们测试模型是否在性别扰动下保持决策不变性,以及它们的自我归因是否反映观察到的行为变化。我们发现,显式性别线索会引起有限但系统的决策翻转,包括在显式性别归因提示下,该提示要求模型报告性别是否影响其选择。跨性别角色交换揭示了一致的女性提议决策不对称性,而模型通常将翻转的决策归因于无影响或其他非性别因素。进一步分析表明,性别效应集中在价值边界较不明确和决策背景更严重的情况下,表明性别线索作为局部边界移动因素,而非价值推理的全局覆盖。价值排名基本保持稳定,但有序价值对权衡在不同角色-性别配置中不均匀地变化。这些结果表明,性别可以在行为上进入LLM价值权衡,同时在自我归因中被掩盖,这促使在基于解释的评估之外进行受控行为审计。

英文摘要

Large language models are increasingly used in value-sensitive decision settings, where irrelevant demographic cues should not alter judgments. We construct the Realistic Value Decision Benchmark (RVDB), a controlled benchmark that varies only the role-gender configuration while holding the scenario, ordered value pair, roles, candidate decisions, Value Distance, and Decision Severity fixed. Using a position-balanced evaluation across seven models, we test whether models preserve decision invariance under gender perturbations and whether their self-attributions reflect observed behavioral changes. We find that explicit gender cues induce bounded but systematic decision flips, including under an explicit gender-attribution prompt that asks models to report whether gender influenced their choice. Cross-gender role swaps reveal a consistent female-proposed-decision asymmetry, while models often attribute flipped decisions to No Influence or other non-gender factors. Further analysis shows that gender effects concentrate near less determinate value boundaries and under more severe decision contexts, suggesting that gender cues act as local boundary-shifting factors rather than global overrides of value reasoning. Value rankings remain largely stable, but ordered value-pair trade-offs shift unevenly across role-gender configurations. These results show that gender can enter LLM value trade-offs behaviorally while remaining obscured in self-attribution, motivating controlled behavioral audits beyond explanation-based evaluation.

2606.02212 2026-06-02 cs.SD

C2GA: A Class-Controllable Generative Augmentation Framework for Respiratory Sound Classification

C2GA:一种用于呼吸音分类的类别可控生成式增强框架

Ziqi Ma, Mengyu Han, Anteng Cai, Zhanchong Liu, Bowen Feng, Hang Yu, Sheng Hu

发表机构 * School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院) School of AI and Advanced Computing (AIAC), XJTLU Entrepreneur College (Taicang), Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学人工智能与高级计算学院) ISIR, Osaka University(大阪大学ISIR)

AI总结 针对呼吸音分类中数据有限、噪声大和类别不平衡问题,提出基于条件VQ-VAE和Transformer自回归先验的类别可控生成式增强框架C2GA,实现高保真、语义一致的样本生成。

Comments 18 pages, 5 figures, submitted to Computer Methods and Programs in Biomedicine

详情
AI中文摘要

背景:呼吸音分类在肺部病理的临床识别中起着关键作用。然而,其性能常受限于真实听诊数据集的规模小、噪声严重和类别不平衡。尽管传统的音频增强技术易于实现,但它们可能无意中扭曲微妙的病理特征。同时,现有的基于变分自编码器(VAE)或生成对抗网络(GAN)的生成方法往往面临样本保真度有限和对类别语义的可控性不足的问题,特别是在监督稀缺的情况下。方法:为克服这些限制,我们提出C2GA,一个类别可控的生成式增强框架。C2GA首先使用条件向量量化变分自编码器(VQ-VAE)构建一个语义丰富的离散潜在空间,其中局部声学标记与全局类别原型显式解耦。随后,训练一个基于Transformer的自回归先验以生成标签一致的标记序列。这些生成的标记随后与相应的类别原型融合,并解码为高保真Mel频谱图用于数据增强。结论:这些结果表明,C2GA为呼吸音分析提供了一种有效且语义可靠的增强策略。通过实现可控且高质量的数据生成,所提框架为提高真实临床场景中呼吸音分类的鲁棒性和泛化能力提供了一种有前景的解决方案。

英文摘要

Background: Respiratory sound classification plays a critical role in the clinical identification of pulmonary pathologies. However, its performance is often hindered by the limited size, severe noise, and class imbalance of real-world auscultation datasets. Although conventional audio augmentation techniques are easy to implement, they may inadvertently distort subtle pathological characteristics. Meanwhile, existing Variational Autoencoder (VAE)- or Generative Adversarial Network (GAN)-based generative approaches often suffer from limited sample fidelity and insufficient controllability over class semantics, particularly under conditions of scarce supervision. Methods: To overcome these limitations, we propose C2GA, a class-controllable generative augmentation framework. C2GA first constructs a semantically rich discrete latent space using a conditional Vector-Quantized Variational Autoencoder (VQ-VAE), in which local acoustic tokens are explicitly decoupled from global class prototypes. Subsequently, a Transformer-based autoregressive prior is trained to generate label-consistent token sequences. These generated tokens are then fused with the corresponding class prototypes and decoded into high-fidelity Mel-spectrograms for data augmentation. Conclusion: These results indicate that C2GA provides an effective and semantically reliable augmentation strategy for respiratory sound analysis. By enabling controllable and high-quality data generation, the proposed framework offers a promising solution for improving the robustness and generalization of respiratory sound classification in realistic clinical scenarios.

2606.02211 2026-06-02 cs.CL cs.AI

Consistency Training while Mitigating Obfuscation via Rate Matching

通过速率匹配缓解混淆的一致性训练

Sohaib Imran, Prakhar Gupta, Jannes Elstner, David Demitri Africa

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出速率匹配一致性训练(RMCT),通过匹配目标行为率而非约束表达方式,在减少模型受无关特征影响的同时避免混淆,提升可监控性。

详情
AI中文摘要

大型语言模型常常受到无关输入特征的影响,例如揭示用户偏好答案的线索。一致性训练通过训练模型在具有和不具有无关特征的输入上表现相似来减少这种影响。然而,现有方法在整个响应或内部激活上训练一致性,这也限制了模型是否表达这些无关特征。我们表明这会导致混淆,即模型学会不提及线索但仍受其影响,这可能削弱可监控性。为了解决这个问题,我们引入了速率匹配一致性训练(RMCT),它在选定的行为属性上训练一致性,而不约束这种行为如何表达。RMCT匹配模型在输入扰动下表现出目标行为(例如,遵循偏见线索)的速率,而不是要求具有和不具有无关特征的配对输入,从而将一致性训练扩展到无法移除无关特征的场景。我们在两个开放权重语言模型上评估了RMCT在减少谄媚方面的效果,在保留的偏见类型上实现了与标准一致性训练基线相当的偏见遵循减少,同时很大程度上保留了模型表达偏见线索的倾向。此外,我们发现RMCT在我们的实验中更节省数据,但计算效率较低。总体而言,RMCT表明一致性训练可以在不直接牺牲可监控性的情况下提高行为鲁棒性。

英文摘要

Large language models are often influenced by extraneous input features, such as cues revealing a user's preferred answer. Consistency training reduces this influence by training models to behave similarly across inputs with and without the extraneous feature. However, existing methods train for consistency over entire responses or internal activations, which also constrains whether the model verbalises said extraneous features. We show this leads to obfuscation, where the model learns not to mention a cue while remaining influenced by it, which may undermine monitorability. To address this, we introduce Rate Matching Consistency Training (RMCT), which trains for consistency over selected behavioural properties without constraining how this behaviour is expressed. RMCT matches the rate at which the model exhibits a target behaviour (e.g., following a bias cue) across input perturbations, rather than requiring paired inputs with and without the extraneous feature, extending consistency training to settings where the extraneous features cannot be removed. We evaluate RMCT on sycophancy reduction in two open-weight language models, achieving reductions in bias-following comparable to a standard consistency-training baseline on held-out bias types, while largely preserving the model's tendency to verbalise the bias cue. Further, we find that RMCT is more data-efficient at the expense of being less compute-efficient in our experiments. Overall, RMCT shows that consistency training can improve behavioural robustness without directly trading off against monitorability.

2606.02204 2026-06-02 cs.CL

Cross-Environment Neural Reranking for Sample-Efficient Action Selection in Text-Based Agents

跨环境神经重排序用于文本智能体中样本高效的动作选择

Kan Shao

发表机构 * Jinglue Technology Development (Nanjing) Co., Ltd.(江苏 Jinglue 技术发展(南京)有限公司)

AI总结 研究通过联合训练轻量级神经重排序器(DeBERTa-v3)在多个文本环境(ALFWorld、WebShop、ScienceWorld)中实现样本高效的动作选择,发现重平衡联合训练显著提升性能,跨环境适应仅需9.2%目标域数据即可恢复93%全数据性能,数据多样性是主要驱动力。

Comments 11 pages, 4 figures, 6 tables

详情
AI中文摘要

大型语言模型智能体在基于文本的基准测试中取得了强劲性能,但推理成本过高,促使使用紧凑的神经重排序器进行动作选择。我们研究单个轻量级模型是否能在多个不同环境中执行动作选择,这种能力将消除每个环境模型的维护成本。在ALFWorld、WebShop和ScienceWorld上联合训练DeBERTa-v3(184M-434M参数),并采用少数类上采样,我们发现重平衡的双环境联合训练显著提升了单环境ALFWorld性能(净增益+0.412),同时保持了具有竞争力的WebShop性能(+0.214 vs. 单环境+0.249)。三环境训练在4个种子上平均组合净增益为+0.551 +/- 0.024,每个环境的结果接近专门的单环境模型,同时提供正向跨域迁移。跨环境适应具有极高的样本效率:仅用9.2%的目标域数据微调即可恢复93%的全数据性能,且扩展模型容量收益有限,表明数据多样性是主要驱动力。基于环境感知的LoRA适配器路由结合PCGrad实现了最佳种子结果+0.611(种子42),种子456和789分别为+0.554和+0.559,但由于种子123降至+0.263(4种子均值+0.497 +/- 0.158),表现出高方差,这是一个有前景但目前不稳定的方向。使用干净分割和数据重平衡的联合训练是关键要素。我们将发布包含51,580个训练实例(41,740个原始唯一状态,带少数类上采样)的三环境基准测试,以及所有模型检查点(接收后)。

英文摘要

Large language model agents achieve strong performance on text-based benchmarks but incur prohibitive inference costs, motivating the use of compact neural rerankers for action selection. We investigate whether a single lightweight model can perform action selection across multiple diverse environments, a capability that would eliminate per-environment model maintenance. Training DeBERTa-v3 (184M-434M parameters) jointly on ALFWorld, WebShop, and ScienceWorld with minority-class upsampling, we find that rebalanced two-environment joint training substantially improves over single-environment ALFWorld performance (net gain +0.412) while maintaining competitive WebShop performance (+0.214 vs. +0.249 single-environment). Three-environment training yields a mean combined net gain of +0.551 +/- 0.024 across 4 seeds, with per-environment results approaching specialized single-environment models while providing positive cross-domain transfer. Cross-environment adaptation is highly sample-efficient: fine-tuning on only 9.2% of target-domain data recovers 93% of full-data performance, and scaling model capacity yields limited benefits, indicating data diversity is the primary driver. Environment-aware LoRA adapter routing with PCGrad achieves a best-seed result of +0.611 (seed 42), with seeds 456 and 789 at +0.554 and +0.559, but exhibits high variance due to seed 123 collapsing to +0.263 (4-seed mean +0.497 +/- 0.158), representing a promising but currently unstable direction. Joint training with clean splits and data rebalancing is a key ingredient. We will release our three-environment benchmark of 51,580 training instances (41,740 raw unique states with minority-class upsampling) and all model checkpoints upon acceptance.

2606.02194 2026-06-02 cs.LG

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

基于学习奖励的大规模行为模型的一致性离策略改进

Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学) University of Oxford(牛津大学) Zuse School ELIZA(泽努斯学校ELIZA) hessian.AI(海西斯AI) German Research Center for AI (DFKI)(德国人工智能研究中心(DFKI)) Robotics Institute Germany (RIG)(德国机器人研究所)

AI总结 提出一种逆强化学习方法,通过从专家演示中学习稠密奖励函数,结合一致性模仿学习理论保证,实现对预训练策略的离策略改进,在稀疏奖励操作任务中优于强化学习基线。

Comments 13 pages, 7 figures

详情
AI中文摘要

使用行为克隆将专家演示数据蒸馏到大规模生成模型中是一种可扩展的学习机器人控制能力策略的方法,特别是对于灵巧操作。强化学习(RL)可以作为一种利用额外经验进一步微调这些策略的手段。一个开放的问题是RL是否比收集更多人类演示更具样本效率。先前的工作通过将RL应用于一个较小的残差策略来大规模微调预训练策略,该残差策略纠正预训练模型。然而,对于典型的稀疏奖励任务,RL算法可能难以以样本高效的方式优化行为。我们探索逆强化学习,其中从专家演示中学习稠密奖励函数,可能降低RL微调的挑战。我们特别考虑一致性模仿学习,这是一种IRL方法,通过使用具有理论保证的特定奖励公式来促进BC策略的改进。我们展示了我们的IRL方法在所有六个稀疏操作任务上保持或提高了pi-0.5的性能,并在六个复杂操作任务中的五个上实现了≥90%的成功率,优于使用稀疏奖励的基于RL的基线。通过确保我们的初始预训练微调策略对于初始奖励和评论家是最优的,我们的方法避免了RL微调中常见的初始下降,并实现了更快的改进。

英文摘要

Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

2606.02179 2026-06-02 cs.LG cs.AI cs.CE

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

关于拓扑优化中通过敏感性条件伯努利流匹配的泛化性

Mohammad Rashed, Duarte F. Valoroso Madeira, Babak Gholami, Caglar Guerbuez, Yunjia Yang, Nils Thuerey

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 通过信息论分析,提出伪敏感性概念,并利用敏感性条件伯努利流匹配生成器在拓扑优化中实现最优的分布外泛化性能。

Comments ICML Paper

详情
AI中文摘要

拓扑优化(TO)的代理模型在分布偏移(如载荷或边界条件变化)下表现出高度可变的分布外(OOD)泛化能力,但这一变异性的来源尚不清楚。我们假设OOD性能取决于条件信号保留关于驱动经典TO的伴随敏感性(简化梯度)的信息量。将TO流程建模为因果马尔可夫链,数据处理不等式表明,在该抽象下,敏感性场是拓扑预测的信息论最优条件信号。然而,计算精确的伴随敏感性在实践中可能昂贵或不可用;我们观察到某些物理场可以通过单调变换近似敏感性。为形式化这一点,我们引入 extbf{伪敏感性}来区分哪些场能够实现泛化,哪些信息贫乏。然后,我们展示了一个敏感性条件的伯努利流匹配生成器实证地证实了这些预测:以敏感性为条件可获得最先进的OOD性能,而越来越远的物理场性能退化至原始参数条件。结果在载荷偏移下的结构TO基准测试和我们新的CFD-TO数据集(边界条件偏移如多出口配置)中均成立。代码和数据集见https://tum-pbs.github.io/topotransformer/。

英文摘要

Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is governed by how much information the conditioning signal preserves about the adjoint sensitivity (reduced gradient) that drives classical TO. Modeling the TO pipeline as a causal Markov chain, the Data Processing Inequality establishes that, under this abstraction, the sensitivity field is an information-theoretically optimal conditioning signal for topology prediction. However, computing exact adjoint sensitivities can be expensive or unavailable in practice; we observe that certain physical fields can approximate sensitivities through monotone transformations. To formalize this, we introduce \textbf{pseudo-sensitivities} to characterize which fields enable generalization versus those that are information-poor. We then show that a sensitivity-conditioned Bernoulli flow-matching generator empirically confirms these predictions: conditioning on sensitivities yields state-of-the-art OOD performance, while increasingly distant physical fields degrade toward raw parameter conditioning. Results hold across structural TO benchmarks under load shifts and our new CFD-TO dataset under boundary-condition shifts such as multi-outlet configurations. Code and datasets are available at https://tum-pbs.github.io/topotransformer/ .

2606.02177 2026-06-02 cs.LG

Low-Pass Flow Matching

低通流匹配

Francesco M. Ruscio, T. Konstantin Rusch

发表机构 * ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Tübingen AI Center(图宾根人工智能中心) Liquid AI(液体AI)

AI总结 针对流匹配中白噪声源与自然数据频谱不匹配的问题,提出基于算子调制插值的低通流匹配方法,引入时变频谱偏差,在保持或提升样本质量的同时显著降低采样成本。

Comments ICLR 2026 Delta Workshop

详情
AI中文摘要

流匹配通常依赖于白噪声源,这一选择往往与自然数据的功率谱不一致,自然数据的功率谱倾向于随频率衰减。为了解决这个问题,我们引入了低通流匹配,这是基于算子调制插值的流匹配的一种变体。该公式引入了一种时变频谱偏差,随着路径接近数据,该偏差从源频谱过渡到频率衰减偏差。我们在无条件图像生成任务上验证了我们的方法,包括科学数据集Galaxy10。实验表明,我们的方法与自适应ODE求解器配合使用时特别有效,与标准基线相比,在提高或保持样本质量的同时,大幅降低了采样成本。

英文摘要

Flow Matching typically relies on white noise sources, a choice often misaligned with the power spectra of natural data, which tend to decay with frequency. To address this, we introduce Low-Pass Flow Matching, a variant of Flow Matching based on an operator-modulated interpolant. This formulation induces a time-varying spectral bias that transitions from the source spectrum to a frequency-decaying bias as the path approaches the data. We validate our method on unconditional image generation tasks, including the scientific Galaxy10 dataset. Empirically, we show that our method is particularly effective when paired with adaptive ODE solvers, where it improves or preserves sample quality while substantially reducing sampling cost compared to standard baselines.

2606.02172 2026-06-02 cs.LG cs.CV

Closing the Alignment-Maturity Gap in Federated Prototype Learning

缩小联邦原型学习中的对齐-成熟度差距

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

发表机构 * CITIC, Universidade da Coruña(CITIC,科鲁纳大学)

AI总结 针对联邦学习中原型对齐压力抑制局部判别结构的问题,提出FedSAP框架,通过确定性对齐课程和几何驱动代理分离损失稳定表征学习,在多种异质性条件下提升分类性能。

详情
AI中文摘要

从分布式异质数据中学习判别性视觉表示是联邦学习(FL)中的一个基本挑战。基于原型的方法通过跨客户端共享类级表示来解决统计异质性,但在早期训练轮次中会产生距离依赖的梯度压力,这种压力尤其严重:对从噪声局部表示聚合而来的不成熟全局原型施加的对齐压力会产生大梯度,从而抑制局部判别结构的出现。结果导致嵌入空间组织不良,识别性能下降,尤其是在严重的非独立同分布(non-IID)条件下。我们提出FedSAP,一个通过两种互补机制稳定联邦表示学习的框架:一个确定性对齐课程,将全局对齐延迟到局部表示变得稳定;以及一个几何驱动的代理分离损失,利用现有原型库在单位超球面上强制执行类间结构,而不引入额外参数或通信开销。这些机制共同产生紧凑、分离良好的类簇,而不改变联邦参与者之间的底层通信协议。在三个基准测试和不同程度的异质性下的实验表明,与评估的原型基线相比,性能提升高达4个百分点,在高异质性下改进最为显著。我们框架的表示性质还使其能够直接扩展到半监督设置,其中未标记数据只需最小修改即可纳入,突显了调度对齐作为设计原则的通用性。

英文摘要

Learning discriminative visual representations from distributed, heterogeneous data is a fundamental challenge in Federated Learning (FL). Prototype-based methods address statistical heterogeneity by sharing class-level representations across clients but create a distance-dependent gradient pressure that is particularly severe during early training rounds: alignment pressure applied to immature global prototypes, aggregated from noisy local representations, generates large gradients that suppress the emergence of local discriminative structure. The result is a poorly organized embedding space and degraded recognition performance, particularly under severe non-IID conditions. We propose FedSAP, a framework that stabilises federated representation learning through two complementary mechanisms: a deterministic alignment curriculum that delays global alignment until local representations become stable and a geometry-driven proxy separation loss that enforces inter-class structure on the unit hypersphere using the existing prototype bank without introducing additional parameters or communication overhead. Together, these mechanisms produce compact, well-separated class clusters without altering the underlying communication protocol between federation's participants. Experiments across three benchmarks and varying degrees of heterogeneity show gains of up to 4 percentage points over the prototype-based baselines evaluated, with improvements most pronounced under high heterogeneity. The representational nature of our framework further enables a straightforward extension to semi-supervised settings, where unlabelled data is incorporated with minimal modification, underscoring the generality of scheduled alignment as a design principle.

2606.02171 2026-06-02 cs.CV

InsightVQA: High-Dimensional Emotion-Cognitive Visual Question Answering Benchmark

InsightVQA: 高维情感认知视觉问答基准

Shiyu Wang, Ziyu Liu, Chaoyi Yu, Yujie Yin, Zhongqian Mao, Jing Chen, Jiaqi Song, Yunshi Lan, Yan Wang

发表机构 * East China Normal University(东华师范大学)

AI总结 为解决现有基准仅关注情感识别而缺乏深层认知推理的问题,提出大规模层次化视觉问答数据集InsightVQA,包含725K问答对,并构建评估基准InsightVQA-Bench和基线模型InsightNet。

Comments 16 pages, 22 figures

详情
AI中文摘要

视觉情感理解要求模型不仅识别情感状态,还要理解其产生原因并进行更高层次的认知推理。然而,现有基准主要关注情感识别,对基于依据的理解和面向响应的分析支持有限。为弥补这一差距,我们引入了 extbf{InsightVQA},一个用于情感理解和认知推理的层次化视觉问答大规模数据集。我们从六个公开来源收集的351K图像出发,应用严格的多阶段过滤流程,筛选出138K高置信度图像。每张图像在三个层次上进行标注:用于情感和效价识别的感知QA、通过约束引导生成从视觉触发提取构建的基于依据的理解QA,以及以响应意图预测和序列洞察推理为中心的认知QA。总计,InsightVQA包含725K个问答对。我们还提出了 extbf{InsightVQA-Bench},一个包含30K样本的高质量评估基准,用于细粒度评估。为支持评估,我们引入了 extbf{InsightNet},一个针对多模态大语言模型的情感调优基线。结果表明,InsightVQA对基于依据的情感理解和推理提出了重大挑战。

英文摘要

Visual emotion understanding requires models not only to recognize emotional states, but also to why they arise and perform higher-level cognitive reasoning. However, existing benchmarks mainly focus on emotion recognition, offering limited support for grounded understanding and response-oriented analysis. To address this gap, we introduce \textbf{InsightVQA}, a large-scale dataset for hierarchical visual question answering on emotion understanding and cognitive reasoning. Building from 351K images collected from six public sources, we apply a rigorous multi-stage filtering pipeline to curate 138K high-confidence images. Each image is annotated at three hierarchical levels: perception QA for emotion and valence recognition, grounded understanding QA constructed from visual trigger extraction through constraint-guided generation, and cognition QA centered on response intent prediction and sequential insight reasoning. In total, InsightVQA contains 725K QA pairs. We further present \textbf{InsightVQA-Bench}, a high-quality evaluation benchmark comprising 30K samples for fine-grained evaluation. To support evaluation, we introduce \textbf{InsightNet}, an emotion-tuned baseline for MLLMs. Results demonstrate that InsightVQA poses significant challenges for grounded emotion understanding and reasoning.