arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3841
热门方向导航
2606.08106 2026-06-09 cs.AI cs.MA 新提交

PACE: Anytime-Valid Acceptance Tests for Self-Evolving Agents

PACE: 自演化智能体的任意有效接受测试

Zayx Shawn

发表机构 * Independent Researcher(独立研究员)

AI总结 提出PACE方法,将自演化智能体的变更接受问题转化为序贯假设检验,通过配对任意有效提交评估控制错误提交概率,在多个基准上显著减少虚假提交并降低评估成本。

详情
AI中文摘要

自演化智能体通过反复提出对其自身提示、技能或工作流程的更改,并保留那些在小型保留集上得分更高的更改来改进。几乎所有努力都集中在生成候选方案的提议者上;我们认为薄弱环节是接受者,即决定是否提交更改的规则。针对相同的噪声开发估计应用数百次,无处不在的“如果分数上升则保留”规则是未受控制的自适应多重测试:智能体有效地自我p-hack,累积虚假提交,导致其搅动和漂移而非改进。我们将提交重新定义为序贯假设检验,并提出PACE(配对任意有效提交评估),一种无需训练、任意有效的提交门控。每个候选方案与现有方案在相同实例上进行比较,仅当通过测试-下注的e过程积累决定性证据时才提交,提前停止以节省评估,并在可选停止下将每个候选方案的虚假提交概率控制在用户设定的水平(每决策保证)。在Qwen2.5智能体(0.5B-3B)于GSM8K、SVAMP和ARC-Challenge上在提示级别自演化时,贪婪接受在真实改进隐藏在噪声提议中时提交30-42%的虚假编辑和10-33%的有害编辑,而PACE提交真实改进且几乎无其他,匹配贪婪的保留集准确性,但方差显著降低且评估成本降低约18%。在没有真正增益可用时,贪婪每次运行提交13-21次虚假自我修改(72-100%虚假),并使最脆弱的智能体性能下降4.9个百分点,而PACE保持基线水平。自演化的可靠性取决于接受者,而不仅仅是提议者。

英文摘要

Self-evolving agents improve by repeatedly proposing changes to their own prompts, skills, or workflows and keeping those that score higher on a small held-out set. Almost all effort has gone into the proposer that generates candidates; we argue the weak point is the acceptor, the rule that decides whether to commit a change. Applied hundreds of times against the same noisy dev estimate, the ubiquitous "keep it if the score went up" rule is uncontrolled adaptive multiple testing: the agent effectively p-hacks itself, accumulating false commits that make it churn and drift rather than improve. We recast committing as a sequential hypothesis test and propose PACE (Paired Anytime-valid Commit Evaluation), a training-free, anytime-valid commit gate. Each candidate is compared to the incumbent on identical instances and committed only when a testing-by-betting e-process accumulates decisive evidence, stopping early to save evaluations and controlling each candidate's false-commit probability at a user-set level even under optional stopping (a per-decision guarantee). On Qwen2.5 agents (0.5B-3B) self-evolving at the prompt level on GSM8K, SVAMP, and ARC-Challenge, greedy acceptance commits 30-42% false and 10-33% harmful edits when a genuine improvement is hidden among noisy proposals, while PACE commits the real one and essentially nothing else, matching greedy's held-out accuracy at sharply lower variance and about 18% lower evaluation cost. With no real gain available, greedy commits 13-21 spurious self-modifications per run (72-100% false) and degrades the most fragile agent by 4.9 points, while PACE holds at baseline. Reliability of self-evolution depends on the acceptor, not only on the proposer.

2606.08105 2026-06-09 cs.LG 新提交

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

注意力汇聚的统一视角:两种算法,两种解决方案

Lukas Fesser, Mozes Jacobs, Thomas Fel, Andy Keller, Sham Kakade

发表机构 * Kempner Institute(肯普纳研究所) Harvard University(哈佛大学)

AI总结 本文揭示注意力汇聚(attention sink)可对应两种不同机制:自适应空操作(adaptive nop)和广播(broadcast),并据此提出诊断方法,证明门控(gating)和寄存器(register)等干预分别针对不同机制,组合使用效果更佳。

详情
AI中文摘要

当注意力集中在一个单一标记(即汇聚)上时,模型实际上在计算什么?注意力汇聚在softmax transformer中普遍存在,然而这种共享的视觉特征可能隐藏着根本不同的算法。我们表明,视觉上相似的汇聚模式可以反映两种不同的机制:{i}自适应空操作,其中注意力头通过路由到空标记来抑制其更新;以及{ii}广播,其中汇聚聚合并重新分配全局信息。在这种情况下,汇聚扮演着类似的作用:当没有有用信息可计算时,作为一个安全的目的地。提出的干预措施如门控或寄存器之所以有效,是因为它们隐式地针对其中一种机制,揭示了方法与假设机制之间的对偶性:门控隐式假设空操作;寄存器隐式假设广播。每种机制都会留下不同的痕迹(空操作汇聚的值范数可忽略;广播汇聚导致低秩输出),我们在合成任务上形式化这些痕迹,并用于推导实用的诊断方法。应用于预训练视觉transformer时,这些诊断表明两种机制在大规模模型中均存在:汇聚从早期层的CLS标记过渡到深层层的块标记,并集中在专门的注意力头中。引人注目的是,为广播设计的寄存器标记被重新用于服务空操作,证实了单独任何一种干预都不足够。将门控与寄存器结合使用在稳定性和性能上带来互补的提升。总体而言,我们发现相同的注意力模式可以反映两种截然不同的计算,有效的干预需要首先询问模型实际在计算什么。

英文摘要

When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We show that visually similar sink patterns can reflect two distinct mechanisms: {i} adaptive nop, where a head suppresses its update by routing to a null token, and {ii} broadcast, where a sink aggregates and redistributes global information. In that case, sinks serve an analogous role: a safe destination when there is nothing useful to compute. Proposed interventions like gating or registers work because they implicitly target one or the other, revealing a duality between method and assumed mechanism: gating implicitly assumes nop; registers implicitly assume broadcast. Each mechanism leaves distinct traces (nop sinks exhibit negligible value norms; broadcast sinks induce low-rank outputs) which we formalize on synthetic tasks and use to derive practical diagnostics. Applied to pretrained vision transformers, these diagnostics reveal that both mechanisms exist at scale: sinks transition from CLS in early layers to patches in deeper layers, and concentrate in specialized heads. Strikingly, register tokens, designed for broadcast, are repurposed to also serve nop, confirming that neither intervention alone suffices. Combining gating with registers yields complementary gains in stability and performance. Overall, we find that the same attention pattern can reflect two very different computations and effective intervention requires first asking what the model is actually computing.

2606.08104 2026-06-09 cs.RO 新提交

Reinforcement learning in linear embedding space unlocks generalizable control across soft robot configurations

线性嵌入空间中的强化学习解锁软体机器人配置的通用控制

Xinglong Zhang, Cong Li, Hangjie Mo, Yue Jiang, Xin Xu, Wei Jiang, Zhenshan Bing, Yihe Yang, Xiaojian Li, Yueneng Yang, Huimin Lu, Ling-li Zeng, Alois Knoll, Dewen Hu, Li Wen, Wei Pan

发表机构 * National University of Defense Technology(国防科技大学) Hefei University of Technology(合肥工业大学) Nanjing University (Suzhou Campus)(南京大学(苏州校区)) Technical University of Munich(慕尼黑工业大学) Beihang University(北京航空航天大学) Newcastle University(纽卡斯尔大学)

AI总结 提出基于共享线性Koopman嵌入空间的强化学习框架,将控制策略与机器人形态解耦,实现跨33种软体机器人配置的快速迁移,样本量减少75倍,并支持高速运动、重载和多执行器故障下的鲁棒控制。

Comments An updated version of this paper has been accepted by Nature Communications

详情
AI中文摘要

软体生物如章鱼和大象鼻子展现出显著的形态适应性,能够动态重构身体形状和刚度,并灵活调整控制策略以实现多功能行为。受这些生物系统启发,近几十年来出现了各种软体机器人,它们采用针对特定任务定制的不同材料、刚度和形态。尽管软体机器人的材料和结构设计取得了重大进展,但开发一个能够跨不同配置快速适应的通用控制框架仍然是一个长期挑战。现有控制器局限于固定配置,需要针对新配置进行费力的特定配置重新建模和策略重新设计。本文介绍了一种通用控制系统,通过共享线性Koopman嵌入空间中的强化学习,实现跨多种软体机器人配置的快速适应。通过将机器人动力学编码到该嵌入空间,我们的方法将控制策略与特定形态解耦,允许跨不同配置进行实时、无模型的策略适应,而无需从头重新训练。我们在33种不同的机器人配置上验证了该系统。该系统在跨配置的迁移样本量上减少了75倍,同时在高速运动、重负载和多执行器故障下保持鲁棒性能,并实现了软体机器人领域此前无法获得的现实技能。这项工作为多种软体机器人配置建立了一个统一且可适应的控制范式,弥合了机械可重构性与控制灵活性之间的差距,并可能为复杂物理系统中的通用控制提供更广泛的见解。

英文摘要

Soft-bodied organisms such as octopuses and elephant trunks exhibit remarkable morphological adaptability, dynamically reconfiguring body shape and stiffness, and flexibly adjusting their control strategies to enable versatile behaviors. Inspired by these biological systems, various soft robots have emerged in recent decades, featuring diverse materials, stiffnesses, and morphologies tailored to specific tasks. Despite substantial advances in the materials and structural designs of soft robots, developing a generalizable control framework capable of rapid adaptation across diverse configurations remains a long-standing challenge. Existing controllers are limited to fixed configurations, demanding laborious configuration-specific remodelling and policy redesign for new configurations. Here, we introduce a generalizable control system that enables rapid adaptation across diverse soft robot configurations via reinforcement learning in a shared linear Koopman embedding space. By encoding robot dynamics into this embedding space, our method decouples control policies from specific morphologies, allowing real-time, model-free policy adaptation across diverse configurations without retraining from scratch. We validate our system across 33 distinct robot configurations. Our system achieves a 75 times reduction in transfer samples across configurations, while sustaining robust performance under high-speed motion, heavy payloads, and multiactuator faults, and achieving real-world skills previously unattainable in soft robotics. This work establishes a unified and adaptable control paradigm for diverse soft robot configurations, bridging mechanical reconfigurability with control flexibility, and may offer broader insights for generalizable control in complex physical systems.

2606.08103 2026-06-09 cs.RO cs.CV 新提交

Revisiting Articulated Parts Perception in Robot Manipulation

重新审视机器人操作中的关节部件感知

Xiaoqian Wu, Yejie Guo, Xiaoyang Chen, Lixin Yang, Cewu Lu, Yong-Lu Li

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出几何主结构(GPS)作为关节部件的新表示,结合VR设备实现高效标注,训练通用模型,在零样本下达到73%操作成功率。

Comments CVPR2026

详情
AI中文摘要

我们被各种带有可移动关节部件的物体所包围,例如盒子、把手、门。对关节部件的准确且可泛化的感知对于增强机器人操作能力至关重要。基于这一需求,近期在关节部件感知方面的工作遵循两个主要方向:一类工作使用基于姿态的表示,这需要高人力成本;与此同时,基于可供性的方法通过点跟踪提取未来物体运动,无需额外人工,但受限于低质量数据。在本文中,我们提出了一种新的关节部件表示——几何主结构(GPS),它是部件几何结构的抽象,以平衡可扩展性和质量。为了实现高效且可扩展的数据收集,GPS与便携式虚拟现实(VR)设备集成,只需一分钟即可标注一个物体序列。这种直接的人工标注比估计的可供性提供了更高质量。利用高效的VR-GPS系统,我们收集了6个部件类别下234个物体的41K帧数据,并训练了一个以单张RGB-D物体图像为输入的通用GPS模型。对于物体操作,我们基于GPS预测部署了一个启发式策略。无需任何领域内微调,我们的方法在9个物体的270个初始状态下达到了73%的成功率。我们的代码、数据和可复用工具可在 https://enlighten0707.github.io/gps 获取。

英文摘要

We are surrounded by various objects with movable, articulated parts, e.g., box, handle, door. An accurate and generalizable perception of articulated parts is essential to enhance robotic manipulation capabilities. Building on this need, recent efforts in articulated parts perception have followed two main directions: One line of work uses pose-based representation, which requires high manual cost; in parallel, affordance-based methods extract future object motion from point tracking without additional manual efforts, but suffer from low-quality data. In this paper, we propose a new representation of articulated parts, Geometric Primary Structure (GPS), an abstraction of the part geometry structure to balance scalability and quality. For efficient and scalable data collection, GPS is integrated with a portable Virtual Reality (VR) device and requires only one minute to annotate one object sequence. This direct human annotation provides higher quality than the estimated affordance. With this efficient VR-GPS system, we collect 41K frames for 234 objects across six part classes, and train a generalizable GPS model with a single RGB-D object image as input. For object manipulation, we deploy a heuristic policy based on GPS prediction. Without any in-domain fine-tuning, our method achieves an 73% success rate, covering 270 initial states for 9 objects. Our code, data and reusable tool are available at https://enlighten0707.github.io/gps.

2606.08100 2026-06-09 cs.LG 新提交

Constraint-Aware Optimization for Robust Protein Stability Prediction

约束感知优化用于鲁棒蛋白质稳定性预测

A Shivram, Aneesh S. Chivukula, Manik Gupta, Sourav Chowdhury

发表机构 * Birla Institute of Technology and Science Pilani, Hyderabad Campus(比拉理工学院海得拉巴校区)

AI总结 提出约束感知优化框架,结合平衡均方误差、孪生反对称正则化器和OOD边缘一致性损失,在不改变SPURS架构下提升蛋白质稳定性预测的鲁棒性,在多个基准上取得显著改进。

详情
AI中文摘要

多模态$\Delta\Delta G$预测器结合蛋白质语言模型与逆折叠表示,在Megascale数据集上实现了强分布内准确性,但在分布外蛋白质上鲁棒性有限,在配对突变基准上存在持续的正反向偏差,且对稀有稳定突变的代表性不足。现有方法主要通过额外的架构组件来解决这些局限性,而优化层面的干预相对未被充分探索。我们引入了一个约束感知优化框架,结合平衡均方误差、孪生反对称正则化器以及在每个位置特征表示上的新颖OOD边缘一致性损失,无需对SPURS主干进行架构更改。在十一个基准和三个随机种子上,该框架将S669上的Spearman相关性从0.486提高到0.540(种子间$\sigma=0.002$),在不修改架构的情况下匹配已发表的SPURS基线(0.50),并将S461上的相关性从0.653提高到0.711,在另外五个OOD数据集上取得一致的小幅提升。在Ssym上的受控诊断表明,反对称训练并未消除系统性的正反向偏差,表明增益是通过隐式正则化而非精确热力学约束强制执行来实现的。

英文摘要

Multimodal $ΔΔG$ predictors integrating protein language models with inverse-folding representations achieve strong in-distribution accuracy on the Megascale dataset but exhibit limited robustness on out-of-distribution (OOD) proteins, persistent forward-reverse bias on paired-mutation benchmarks, and under-representation of rare stabilizing mutations. Existing approaches address these limitations primarily through additional architectural components, leaving optimization-level intervention comparatively underexplored. We introduce a constraint-aware optimization framework combining Balanced Mean Squared Error, a Siamese anti-symmetric regularizer, and a novel OOD-margin consistency loss on the per-position feature representation, requiring no architectural changes to the SPURS backbone. Across eleven benchmarks and three random seeds, the framework improves Spearman correlation on S669 from 0.486 to 0.540 ($σ=0.002$ across seeds), matching the published SPURS baseline (0.50) without architectural modification, and on S461 from 0.653 to 0.711, with consistent smaller gains on five additional OOD datasets. A controlled diagnostic on Ssym reveals that anti-symmetric training does not eliminate systematic forward-reverse bias, indicating that gains arise through implicit regularization rather than exact thermodynamic constraint enforcement.

2606.08099 2026-06-09 cs.RO 新提交

Cybernetic Android Avatar "Yui": System Integration, Field Deployment, and Evaluation

赛博格安卓化身“Yui”:系统集成、现场部署与评估

Kaoruko Shinkawa, Mizuki Nakajima, Taisei Mogi, Yoshihiro Nakata

发表机构 * The University of Electro-Communications(电气通信大学) Tokyo Denki University(东京电机大学)

AI总结 提出全身赛博格安卓化身Yui,集成操作者沉浸式遥操作与对话者类人社交信号,通过世博会长期展览、远程教育交流等实际部署验证可行性,获得共在感和情绪传达的积极评价。

Comments 47 pages, 20 figures, 10 tables. Submitted to International Journal of Social Robotics

详情
AI中文摘要

远程通信技术已广泛使用,但在许多社交互动场景中,支持共享物理空间感和传达丰富的非语言线索仍然具有挑战性。本研究介绍了“Yui”,一种全身赛博格安卓化身,旨在将操作者沉浸式遥操作与对话者类人社交信号相结合。Yui 结合了55自由度的全身机构与先前开发的安卓头部、面部表情和注视控制、上半身和手臂运动、手部驱动以及移动平台。它可以通过基于头戴显示器的沉浸式模式或基于网络摄像头的桌面模式进行操作。我们通过三个实际部署评估了系统:日本关西大阪2025年世博会的长期公共展览、小学生之间的远程教育交流以及与普通参与者的公共互动研究。在世博会部署期间,两个单元累计运行约1131小时,展示了操作可行性和维护挑战。在公共研究中,操作者和对话者均报告了对共在感的积极印象和使用意愿。对话者还在类人性和情绪及意图传达方面对化身给予了积极评价。结果表明对普通操作者具有可用性,同时在精确可控性方面存在改进空间。这些发现为可社交部署的全身安卓化身提供了现场证据和设计启示。

英文摘要

Remote communication technologies have become widely used; however, supporting a sense of shared physical space and conveying rich non-verbal cues remain challenging in many social interaction scenarios. This study presents "Yui," a full-body cybernetic android avatar designed to integrate operator-side immersive teleoperation with interlocutor-side human-like social signaling. Yui combines a 55-degrees of freedom full-body mechanism with a previously developed android head, facial expression and gaze control, upper-body and arm motion, hand actuation, and a mobile platform. It can be operated through either the immersive mode using a head mounted display-based interface or desktop mode using a webcam-based interface. We evaluated the system through three real-world deployments: a long-term public exhibition at Expo 2025 in Osaka, Kansai, Japan; a remote educational exchange between elementary school students; and a public interaction study with general participants. During the Expo deployment, two units accumulated approximately 1131 h of operation, demonstrating both operational feasibility and maintenance challenges. In the public study, both operators and interlocutors reported positive impressions of co-presence and willingness to use the system. Interlocutors also rated the avatar positively in terms of human likeness and the transmission of emotions and intentions. The results indicate usability for general operators while suggesting room for improvement in precise controllability. These findings provide field-derived evidence and design implications for socially deployable full-body android avatars.

2606.08094 2026-06-09 cs.RO cs.AI cs.LG cs.SY eess.SY 新提交

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

vla.cpp:视觉-语言-动作模型的统一推理运行时

Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le

发表机构 * VinRobotics Center for AI Research, VinUniversity(VinUniversity 人工智能研究中心) Intelligent Autonomous Systems, TU Darmstadt(达姆施塔特工业大学智能自主系统) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学院) University of Stuttgart(斯图加特大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 提出vla.cpp,基于llama.cpp的便携C++推理运行时,支持多种VLA架构,在LIBERO-Object上接近SOTA性能,内存仅1.3 GiB,并实现跨硬件部署。

Comments 17 pages, 3 figures, 12 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常以Python/PyTorch堆栈形式提供,假设使用工作站级GPU,这与机器人实际运行的硬件不匹配。我们提出了vla.cpp,一个基于llama.cpp的便携式C++推理运行时。据我们所知,它是第一个原生支持流匹配和扩散VLA推理模式的ggml类引擎,其中缓存的视觉-语言前缀由交叉注意力动作专家在多个求解器步骤中消耗。单个运行时通过一个请求/响应协议服务于跨越五个骨干网络和四个动作头家族的七种架构,每个模型打包为自包含的捆绑包。在LIBERO-Object上,该引擎在200个回合中与最先进的检查点相差不到一个回合,并以1.3 GiB内存运行BitVLA达到100%成功率。相同的捆绑包在三个硬件层级上不变地运行,从消费级GPU到8 GB嵌入式模块。跨硬件屋顶线分析表明,批量大小为1的VLA推理受计算限制,因此利用率而非带宽是部署杠杆;由此分析得出的IMMA梯形GEMM将BitVLA每步延迟降低了4.5倍。然后,我们在ALOHA机械臂上设计了一个机载压力测试,隔离了学习型VLA必须在训练它的硬件上针对移动目标重新规划的延迟约束。代码、演示视频和可重复的基准测试框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。

英文摘要

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

2606.08093 2026-06-09 cs.AI 新提交

A Multi-modal Agentic Co-pilot for Evidence Grounded Computational Pathology

面向证据基础计算病理学的多模态智能体协同助手

Zhe Xu, Zhengyu Zhang, Zhiyuan Cai, Jiahao Xu, Yijie Lin, Ziyi Liu, Junlin Hou, Hongyi Wang, Yuxiang Nie, Ling Liang, Yihui Wang, Yingxue Xu, Ronald Cheong Kin Chan, Li Liang, Hao Chen

发表机构 * Department of Computer Science and Engineering, Hong Kong University of Science and Technology(香港科技大学计算机科学与工程系) Department of Pathology, Nanfang Hospital, Southern Medical University(南方医科大学南芳医院病理科) Department of Pathology, School of Basic Medical Sciences, Southern Medical University(南方医科大学基础医学学院病理科) Department of Anatomical and Cellular Pathology, Chinese University of Hong Kong(香港中文大学解剖与细胞病理学系) Guangdong Provincial Key Laboratory of Molecular Tumor Pathology(广东省分子肿瘤病理学重点实验室) Jinfeng Laboratory(锦风实验室) Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology(香港科技大学化学与生物工程系) Division of Life Science, Hong Kong University of Science and Technology(香港科技大学生命科学系) State Key Laboratory of Nervous System Disorders, The Hong Kong University of Science and Technology(香港科技大学神经系统疾病国家重点实验室) HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute, The Hong Kong University of Science and Technology(香港科技大学深圳-香港协同创新研究院)

AI总结 提出PathPocket,一种多模态AI协同助手,通过构建包含11万文档的病理证据语料库和455万实体的超图,实现基于证据的病理诊断,在20万真实案例上超越现有方法。

详情
AI中文摘要

病理学是现代医学的基石,准确的决策高度依赖于循证实践。虽然人工智能有潜力改变临床工作流程,但AI与循证医学的结合仍未被充分探索,现有的初步尝试仅限于纯文本的通用医学。在这项工作中,我们提出了PathPocket,一种专门为证据基础病理学设计的多模态AI智能体协同助手。我们构建了迄今为止最全面的病理证据语料库,包含约110,472份公开和授权文档,这些文档按照从临床指南到专家意见的严格证据层级进行结构化组织。在这个精心分级的基础上,我们构建了一个大规模多模态病理超图,包含超过455万个实体和710万个关系。作为强大的知识引擎,该超图为协作式多智能体推理框架提供了可追溯的证据,该框架集成了输入理解、证据检索、过滤和诊断生成。这使得PathPocket能够无缝解决广泛的临床任务,从纯文本查询到涉及感兴趣区域和千兆像素全切片图像的复杂多模态诊断。我们在一个包含超过20万真实案例的多维基准测试上严格评估了该系统,其性能显著优于现有最先进方法。至关重要的是,广泛的用户研究表明,PathPocket显著提高了病理学家的诊断准确性和信心。通过将病理学解释直接基于可验证的文献,PathPocket为未来证据基础的计算病理学提供了实用且可扩展的解决方案。

英文摘要

Pathology is the cornerstone of modern medicine, where accurate decision-making relies heavily on evidence-based practices. While artificial intelligence (AI) has the potential to transform clinical workflows, the intersection of AI and evidence-based medicine remains under-explored, with primitive attempts restricted to text-only general medicine. In this work, we present PathPocket, a multimodal AI agentic co-pilot designed specifically for evidence grounded pathology. We construct the most comprehensive pathology evidence corpus to date, encompassing approximately 110,472 public and authorized documents structured across a rigorous hierarchy of evidence from clinical guideline to expert opinion. From this meticulously graded foundation, we build a large-scale multimodal pathology hypergraph containing over 4.55 million entities and 7.10 million relations. Serving as a robust knowledge engine, this hypergraph provides traceable evidence for a collaborative multi-agent reasoning framework integrating input understanding, evidence retrieval, filtering, and diagnosis generation. This enables PathPocket to seamlessly resolve a wide spectrum of clinical tasks, ranging from text-only queries to complex multimodal diagnostics involving region-of-interest (ROI) and gigapixel whole-slide images (WSIs). We rigorously evaluate the system on a multidimensional benchmark of over 200,000 real-world cases, where it significantly outperforms existing state-of-the-arts. Crucially, extensive user studies demonstrate that PathPocket substantially improves the diagnostic accuracy and confidence of pathologists. By directly grounding pathology interpretations in verifiable literature, PathPocket offers a practical and scalable solution for the future of evidence grounded computational pathology.

2606.08092 2026-06-09 cs.CL 新提交

When Languages Disagree: Self-Evolving Multilingual LLM Judges

当语言不一致时:自我进化的多语言LLM评判者

Xiyan Fu, Wei Lu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出SEMJ方法,利用多语言评判中的跨语言不一致性进行迭代自我反思与重新评估,在多个基准上优于投票和反思基线,提升准确性和跨语言一致性。

详情
AI中文摘要

多语言LLM-as-a-judge被广泛用于跨语言评估模型输出,但存在跨语言不一致性问题(Fu and Liu, 2025)。现有方法通常将这种不一致性视为噪声,并通过投票或聚合来缓解。在本工作中,我们反而表明多语言不一致性可以提供互补的评估信号。我们的oracle分析发现,跨语言采样判断比单语言判断能获得更高的性能上限,表明不同语言可能包含互补的判断。受此发现启发,我们提出SEMJ,一种自我进化的多语言评判者,利用跨语言不一致性进行迭代优化。SEMJ为每个输入构建多语言变体,收集独立的判断和理由,并将不一致的输出反馈给自我反思和重新评估。在多个基准上的实验表明,SEMJ在准确性和跨语言一致性上始终优于投票和反思基线。进一步分析表明,不一致性触发了有用的重新评估,从而提高了判断质量。

英文摘要

Multilingual LLM-as-a-judge is widely used to evaluate model outputs across languages, but suffers from cross-lingual inconsistency (Fu and Liu, 2025). Existing methods typically treat this inconsistency as noise and mitigate it through voting or aggregation. In this work, we instead show that multilingual inconsistency can provide complementary evaluation signals. Our oracle analysis finds that sampling judgments across languages yields a higher performance upper bound than single-language judging, indicating that different languages potentially include complementary judgments. Motivated by this finding, we propose SEMJ, a self-evolving multilingual judge that leverages cross-lingual inconsistency for iterative refinement. SEMJ constructs multilingual variants of each input, collects independent judgments and rationales, and feeds inconsistent outputs back for self-reflection and re-evaluation. Experiments on multiple benchmarks show that SEMJ consistently outperforms voting and reflection baselines in both accuracy and cross-lingual consistency. Further analysis shows that inconsistency triggers useful re-evaluation, which improves judgment quality.

2606.08091 2026-06-09 cs.CV 新提交

VideoWeaver: Evaluating and Evolving Skills for Agentic Long Video Generation

VideoWeaver: 评估与进化智能体长视频生成技能

Jianhui Wei, Jie Tan, Hengchuan Zhu, Xiaotian Zhang, Yan Zhang, Ziyi Chen, Daoan Zhang, Wei Xu, Zuozhu Liu

发表机构 * Zhejiang University(浙江大学) ByteDance(字节跳动)

AI总结 提出VideoWeaver框架,让智能体自主组合基础技能生成视频,并设计智能体裁判评估过程与结果,通过技能进化算法提升生成质量。

详情
AI中文摘要

最近的智能体框架如Claude Code、Codex和OpenClaw在工具使用和编排方面表现强劲,但它们能否处理长视频生成这一长时多模态任务仍待探索。与早期手工设计管线的视频智能体不同,这些框架可以构建和优化自己的工作流程。我们提出VideoWeaver,一个评估和进化长视频生成技能的智能体框架和基准测试,其中智能体通过将基础技能组合成自己的工作流程(而非遵循预定义管线)将单个指令转化为长视频。该基准测试包含16个任务类别和285个案例,参考信息涵盖文本、图像、音频、视频及其组合。由于错误可能出现在任何阶段而不仅仅是最终视频,我们提出一种智能体裁判,它检查执行轨迹和最终视频,并将其评分基于元数据和中间文件等证据。利用这一反馈,我们进一步设计了一种技能进化算法,用于优化和合并智能体的技能。在多个框架和模型上,我们发现显式的组合技能比单独使用基础技能更能改善生成过程,技能进化进一步提高了输出质量,并且不同框架和模型选择之间的性能差异显著。所提出的智能体裁判也与人类判断高度一致,尤其是在过程指标上。代码和数据集可在https://github.com/JianhuiWei7/VideoWeaver获取。

英文摘要

Recent agent frameworks such as Claude Code, Codex, and OpenClaw are strong at tool use and orchestration, but whether they can handle long video generation, a long-horizon multimodal task, remains underexplored. Unlike earlier video agents whose pipeline is handcrafted, these frameworks can build and refine their own workflows. We introduce VideoWeaver, an agent harness and benchmark that evaluates and evolves skills for long video generation, where an agent turns a single instruction into a long video by composing foundation skills into its own workflow rather than following a predefined pipeline. The benchmark has 16 task categories and 285 cases, with references spanning text, image, audio, video, and their combinations. Because errors can arise at any stage and not just in the final video, we propose an agent-as-judge that inspects both the execution trace and the final video, grounding its scores in evidence such as metadata and intermediate files. Using this feedback, we further design a skill evolution algorithm that refines and merges the agent's skills. Across multiple frameworks and models, we find that an explicit composition skill improves the generation process over using foundation skills alone, that skill evolution further improves output quality, and that performance varies notably across harness and model choices. The proposed agent-as-judge also aligns well with human judgments, especially on process metrics. Code and dataset is available at https://github.com/JianhuiWei7/VideoWeaver

2606.08088 2026-06-09 cs.LG cs.CL 新提交

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

ConSteer-RL:通过置信度感知强化学习引导大型语言模型的推理能力

Qing Miao, Yiming Zhao, Jing Yang, Chenxi Liu, Yuehai Chen, Yuewen Liu, Shaoyi Du, Badong Chen

发表机构 * Xi'an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出ConSteer-RL框架,将模型log概率的token级置信度信号融入GRPO,通过置信度感知奖励塑造机制惩罚过度自信错误并强化正确自信推理,在多个模型规模上平均提升2.3%-4.0%。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)近期已成为提升大型语言模型(LLMs)推理能力的关键范式,但其仍受限于稀疏的二元奖励以及对模型内部不确定性的忽视。本文提出ConSteer-RL,一个简单而有效的框架,将源自模型log概率的token级置信度信号整合到RLVR训练中。具体而言,基于组相对策略优化(GRPO)框架,我们通过将每个token的概率聚合成标量置信度分数,并融入基于意识的奖励塑造机制,构建置信度感知奖励,该机制惩罚过度自信的错误,同时强化正确且自信的推理。实验结果表明,ConSteer-RL在不同模型规模上持续优于强GRPO基线,平均提升2.3%-4.0%。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.

2606.08087 2026-06-09 cs.SD cs.CL 新提交

Assessing the Energy and Carbon Emissions of Neural Speaker Verification Model in Training and Inference

评估神经说话人验证模型在训练和推理中的能耗与碳排放

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Aday Avignon University(阿维尼翁大学)

AI总结 本研究通过测量不同ResNet架构在VoxCeleb2上的能耗与碳排放,发现模型加深或加宽带来边际精度提升但能耗剧增,而中等规模网络(如ResNet-50)能实现性能与环境影响的良好平衡。

Comments Accepted to Speaker Odyssey 2026 Lisbon

详情
AI中文摘要

深度学习说话人验证(SV)越来越依赖于深度神经网络骨干,但其环境影响仍缺乏记录。本文对在VoxCeleb2上训练的ResNet架构进行了评估,变化深度、通道宽度和阶段分布,并使用节点级传感器测量能耗和碳足迹。结果显示明显的收益递减点:更深或更宽的模型仅带来边际精度提升,而能耗急剧增长。相比之下,中等规模网络如ResNet-50和阶段集中变体在性能与环境影响之间实现了有利的权衡。这些发现为设计节能的SV系统提供了可操作的指导方针。

英文摘要

Deep-learning speaker verification (SV) increasingly relies on deep neural network backbones, whose environmental impact remains largely undocumented. In this paper, we conduct an evaluation of ResNet architectures trained on VoxCeleb2, varying depth, channel width, and stage distribution, and measure energy consumption and carbon footprint using node-level sensors. Results show a clear point of diminishing returns: deeper or wider models bring only marginal accuracy gains while energy consumption grows steeply. In contrast, mid-sized networks such as ResNet-50 and stage-concentrated variants achieve favorable trade-offs between performance and environmental impact. These findings provide actionable guidelines for designing energy-efficient SV systems.

2606.08081 2026-06-09 cs.CL cs.AI 新提交

Aligned but Not Partner-Specific: Distinguishing How Multimodal LLM Agents Succeed in Reference Games Without Human-Like Conventions

对齐但非伙伴特定:区分多模态LLM智能体在参考游戏中如何成功而无需类人惯例

Po-Ya Angela Wang, Chinmaya Mishra, Aslı Özyürek, Paula Rubio-Fernández, Esam Ghaleb

发表机构 * National Taiwan University(国立台湾大学) Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Institut Jean Nicod(让·尼科研究所)

AI总结 通过约束伪对基线方法,区分多模态LLM智能体在参考游戏中的标签对齐是源于伙伴特定交互还是共享任务词汇,发现智能体通过冗长描述而非压缩表达实现协调。

详情
AI中文摘要

重复参考游戏测试对话者是否用基于共享交互历史的更短、伙伴特定的惯例替换其初始长描述。先前工作表明,多模态LLM在轮次中未能变得更高效,尽管它们在使用的标签上对齐。我们如何确定这种对齐反映了伙伴特定的基础而非共享任务词汇?我们通过将有能力的多模态智能体对与来自KTH Tangrams语料库的人类对进行比较来解决这个问题。我们的新颖方法论贡献是一个受约束的伪对基线,它匹配原始指称任务结构,但打破了伙伴历史。该基线使我们能够测试观察到的标签对齐是否依赖于与特定伙伴的交互。在三个分析层面(任务能力、描述策略、对齐动态)上,我们发现了明显差异。人类通过适应减少努力,压缩描述并增加与伙伴的标签对齐。智能体反而保持固定的努力水平,从第一轮开始产生冗长的描述,标签重叠接近上限,在真实对和伪对之间统计上无法区分。因此,多模态LLM在没有惯例的情况下实现了协调,通过冗长描述而非形成人类对话特征的紧凑、依赖历史的指称表达来取得成功。

英文摘要

Repeated reference games test whether interlocutors replace their initially long descriptions with shorter, partner-specific conventions grounded in shared interaction history. Prior work shows that multimodal LLMs fail to become more efficient across rounds, although they align on the labels they use. How can we determine whether this alignment reflects partner-specific grounding rather than a shared task vocabulary? We address this question by comparing capable multimodal agent dyads with human dyads from the KTH Tangrams corpus. Our novel methodological contribution is a constrained pseudo-dyad baseline that matches the original referential task structure, but breaks partner history. This baseline enables us to test whether the observed label alignment depends on interaction with a specific partner. Across three analytic layers (task competence, description strategy, alignment dynamics), we find clear differences. Humans reduce effort through entrainment, compressing descriptions and increasing label alignment with partners. Agents instead maintain fixed effort levels, producing verbose descriptions from round one, with near-ceiling label overlap that is statistically indistinguishable between real and pseudo dyads. MLLMs thus achieve coordination without convention, succeeding by verbose description rather than by forming the compact, history-dependent referring expressions characteristic of human dialogue.

2606.08078 2026-06-09 cs.SD cs.CL 新提交

On Low-Bit Quantization Errors in Speaker Verification: Diagnostic and Mitigation

说话人验证中的低位量化误差:诊断与缓解

Hugo Leguillier, Driss Matrouf, Guillaume Lechien, Mickael Rouvier

发表机构 * LIA, UPR 4128 Avignon University(阿维尼翁大学) Aday

AI总结 本文通过逐层和得分级分析,诊断了低比特量化对说话人验证的影响,发现2比特是关键拐点,并提出校准多精度级联方法,在保持低位推理效率的同时接近全精度性能。

Comments Accepted at Speaker Odyssey 2026 Lisbon

详情
AI中文摘要

尽管低比特量化为在资源受限设备上部署说话人验证提供了实用手段,但其对说话人验证性能的影响仍知之甚少。本文通过联合逐层和得分级分析,研究了ResNet-36和ResNet-200的均匀K-means量化感知训练。我们的逐层分析突出了脆弱组件,并表明得分退化不能仅由权重失真完全解释。我们在2比特处识别出一个明显的拐点,较大的得分漂移和有害决策翻转集中在FP32阈值附近。我们的得分级分析揭示了在极端量化下得分误差产生的位置和方式。基于这些发现,我们提出了一种校准的多精度级联方法,该方法在2比特下解决大多数试验,仅升级模糊情况,实现了接近FP32的性能,同时以显著降低的计算和内存成本保留了低位推理的效率优势。

英文摘要

Although low-bit quantization provides practical means to deploy speaker verification on resource-constrained devices, its effects on speaker verification performance remain poorly understood. In this paper, we study uniform K-means quantization-aware training of ResNet-36 and ResNet-200 through joint layer-wise and score-level analyses. Our layer-wise analysis highlights fragile components and shows that score degradation is not fully explained by weight distortion alone. We identify a clear knee point at 2 bits, with larger score drift and harmful decision flips concentrated near the FP32 threshold. Our score-level analysis reveals where and how score errors emerge under extreme quantization. Building on these findings, we propose a calibrated multi-precision cascade that resolves most trials at 2 bits and escalates only ambiguous cases, achieving performance close to FP32 while preserving the efficiency benefits of low-bit inference with substantially lower compute and memory costs.

2606.08077 2026-06-09 cs.CL 新提交

Support Vector Rubrics: Closing the Gap Between Self-Generated and Human Rubrics

支持向量评分准则:弥合自生成与人工评分准则之间的差距

Mengyuan Sun, Yu Li, Zhuohao Yu, Shikun Zhang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(北京大学软件工程国家工程研究中心) University of Science and Technology of China(中国科学技术大学)

AI总结 针对自生成评分准则在困难实例上落后于人工标注的问题,提出SVR框架,将准则构建转化为偏好数据上的最大间隔边界学习,通过对比特征挖掘、提示条件选择器和迭代优化,显著缩小与人工准则的差距,并展现出广泛的奖励建模能力。

详情
AI中文摘要

基于评分准则的评估是评判大语言模型(LLM)输出的一种有前景的范式,然而在困难实例上,自生成准则落后于人工标注的准则。我们认为这一判别差距反映了目标不匹配:自生成准则描述好的回答,而有效的准则必须区分相近的候选。为弥合这一差距,我们引入SVR(支持向量评分准则),一个将准则构建重新表述为偏好数据上的最大间隔边界学习的框架。SVR从偏好对中挖掘对比特征存入准则库,学习一个提示条件化的选择器以及全局准则权重,并通过支持对选择和对抗性探测困难负例来迭代优化准则库。在推理时,仅给定提示,SVR从库中检索顶级准则并对回答进行评分。在RubricBench上,SVR将差距从24.1分缩小到0.3分,并优于强自生成准则和评判基线,且学习到的准则库无需重新训练即可跨评判迁移。在RewardBench 1&2和RM-Bench上,它仍与专用奖励模型保持竞争力,展示了更广泛的奖励建模能力。总体而言,边界定义的准则为弥合LLM评估中的判别差距提供了一条原则性路径。

英文摘要

Rubric-based evaluation is a promising paradigm for judging large language model (LLM) outputs, yet self-generated rubrics lag human-annotated criteria on hard instances. We argue this discriminative gap reflects an objective mismatch: self-generated rubrics describe good responses, whereas effective criteria must discriminate between close candidates. To close this gap, we introduce SVR (Support Vector Rubrics), a framework that recasts rubric construction as max-margin boundary learning over preference data. SVR mines contrastive features from preference pairs into a rubric bank, learns a prompt-conditioned selector together with global rubric weights, and iteratively refines the bank through support-pair selection and adversarial probing of hard negatives. At inference, given only the prompt, SVR retrieves the top-rubrics from the bank and scores responses. On RubricBench, SVR narrows the gap to human reference rubrics from 24.1 to 0.3 points and outperforms strong self-rubric and judge baselines, and the learned bank transfers across judges without retraining. On RewardBench 1&2, and RM-Bench, it remains competitive with dedicated reward models, demonstrating broader reward modeling capability. Overall, boundary-defining rubrics offer a principled route to closing the discriminative gap in LLM evaluation.

2606.08076 2026-06-09 cs.CL cs.AI cs.CY 新提交

"I understand your perspective": LLM Persuasion and Sycophancy through the Lens of Communicative Action Theory

“我理解你的观点”:通过交往行动理论视角看LLM的说服与谄媚

Esra Dönmez, Agnieszka Falenska

发表机构 * Institute for Natural Language Processing, University of Stuttgart(斯图加特大学自然语言处理研究所) Interchange Forum for Reflecting on Intelligent Systems, University of Stuttgart(斯图加特大学智能系统反思交流论坛)

AI总结 本研究基于哈贝马斯的交往行动理论,通过模拟Reddit讨论,发现LLM能有效传达言外之意(如建立信任),其谄媚策略与观点改变强相关,且人类更偏好LLM生成的论证。

详情
Journal ref
Findings of the Association for Computational Linguistics: ACL 2025
AI中文摘要

大型语言模型(LLM)能够生成高质量的论证,但它们在参与细致入微且有说服力的交往行动方面的能力仍 largely unexplored。本研究通过尤尔根·哈贝马斯的交往行动理论框架探索LLM的说服潜力。它考察LLM是否以与人类交流可比的方式表达言外之意(即语言的语用功能,如传达知识、建立信任或表明相似性)。我们使用来自说服性子论坛ChangeMyView的对话,模拟意见持有者与LLM之间的在线讨论。然后,我们比较人类撰写和LLM生成的反驳论证中言外之意的可能性,特别是那些成功改变了原帖作者观点的论证。我们发现,所有三个LLM都能有效传达言外之意——通常比人类更甚——可能增加其拟人化程度。此外,LLM精心制作谄媚回应,与意见持有者的意图紧密对齐,这种策略与观点改变强相关。最后,众包工作者发现LLM生成的反驳论证更令人信服,并且一致偏好它们胜过人类撰写的论证。这些发现表明,LLM的说服力不仅仅在于生成高质量论证。相反,用人类偏好训练LLM有效地调整它们以模仿人类交流模式,特别是细微的交往行动,可能增加个体对其影响的易感性。

英文摘要

Large Language Models (LLMs) can generate high-quality arguments, yet their ability to engage in nuanced and persuasive communicative actions remains largely unexplored. This work explores the persuasive potential of LLMs through the framework of Jürgen Habermas' Theory of Communicative Action. It examines whether LLMs express illocutionary intent (i.e., pragmatic functions of language such as conveying knowledge, building trust, or signaling similarity) in ways that are comparable to human communication. We simulate online discussions between opinion holders and LLMs using conversations from the persuasive subreddit ChangeMyView. We then compare the likelihood of illocutionary intents in human-written and LLM-generated counter-arguments, specifically those that successfully changed the original poster's view. We find that all three LLMs effectively convey illocutionary intent -- often more so than humans -- potentially increasing their anthropomorphism. Further, LLMs craft sycophantic responses that closely align with the opinion holder's intent, a strategy strongly associated with opinion change. Finally, crowd-sourced workers find LLM-generated counter-arguments more agreeable and consistently prefer them over human-written ones. These findings suggest that LLMs' persuasive power extends beyond merely generating high-quality arguments. On the contrary, training LLMs with human preferences effectively tunes them to mirror human communication patterns, particularly nuanced communicative actions, potentially increasing individuals' susceptibility to their influence.

2606.08071 2026-06-09 cs.CL 新提交

SurgiQ: A Large-Scale Multi-Domain Benchmark for Evaluating Surgical Understanding in Large Language Models

SurgiQ: 用于评估大语言模型手术理解的大规模多领域基准

Ayah Al-Naji, Edoardo Fazzari, Saif Alkindi, Hamdan Alhadhrami, Preslav Nakov, Cesare Stefanini

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出SurgiQ基准,包含13,055道多选题,覆盖六个外科领域和四种题型,用于评估LLM的手术推理能力。实验显示最佳模型准确率仅68.1%,通用模型优于多数生物医学模型,表明当前医学专业化未能充分覆盖手术知识。

详情
AI中文摘要

大语言模型在外科领域的可靠评估仍不成熟。广泛的医学基准测试临床知识,而手术需要程序性推理、管理权衡、否定处理以及在合理手术决策中的选择。我们提出SurgiQ,一个纯文本、基于来源的基准,包含13,055道四选一多选题,涵盖六个外科领域和四种题型:基于案例、推理、最佳选项和否定题。SurgiQ通过多阶段生成、验证和专家审核流程,从外科教科书、开放获取论文和考试材料构建。我们在统一的log-likelihood协议下评估了35个开源权重LLM。结果显示仍有很大提升空间:较小模型通常接近25%的随机基线,而最佳模型达到68.1%的准确率。通用模型,尤其是Qwen2.5,优于大多数生物医学模型,表明当前的医学专业化尚未提供足够广泛的外科覆盖。校准和错误分析进一步表明,即使是强模型也会在临床合理的干扰项上犯自信的错误,这促使进行更可靠和更广泛的外科LLM评估。

英文摘要

Reliable evaluation of large language models in surgery remains underdeveloped. Broad medical benchmarks test clinical knowledge, while surgery requires procedural reasoning, management trade-offs, negation handling, and selection among plausible operative decisions. We present SurgiQ, a text-only, source-grounded benchmark of 13,055 four-option multiple-choice questions spanning six surgical domains and four question formats: case-based, reasoning, best-option, and negative. SurgiQ is constructed from surgical textbooks, open-access papers, and examination material using a multi-stage generation, verification, and expert-audit pipeline. We evaluate 35 open-weight LLMs under a unified log-likelihood protocol. Our results show substantial remaining headroom: smaller models often remain near the 25\% random baseline, while the best model reaches 68.1\% accuracy. General-purpose models, especially Qwen2.5, outperform most biomedical models, suggesting that current medical specialization does not yet provide sufficiently broad surgical coverage. Calibration and error analysis further show that even strong models make confident mistakes on clinically plausible distractors, motivating more reliable and broader surgical LLM evaluation.

2606.08068 2026-06-09 cs.LG 新提交

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

DICE: 用于稳定多智能体LLM协调的熵正则化均衡选择

Yi Xie, Zhanke Zhou, Chentao Cao, Bo Liu, Bo Han

发表机构 * University of Arizona(亚利桑那大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出DICE框架,通过熵正则化均衡选择(HQRE)解决多智能体LLM协调中的不稳定性,实现线性收敛和有限贝叶斯遗憾,在11个基准上平均提升4.3-8.5个百分点。

详情
AI中文摘要

多智能体大语言模型(LLM)系统通常无法可靠地超越配备最佳N采样的单个强模型。我们认为这种不稳定性的一个核心来源是病态的均衡选择:当前系统指定了智能体共享哪些信息,但没有指定应选择哪种协调约定。我们将此类系统的一类广泛形式化为折扣不完全信息马尔可夫博弈,并表明两种常见病理——竞争约定之间的振荡和跨约定漂移——均可导致不稳定的学习和线性贝叶斯遗憾。为了获得一个良定义的目标,我们引入了异质量化响应均衡(HQRE),这是一种具有智能体和状态依赖温度的熵正则化均衡概念。在单调性条件下,HQRE是唯一的,允许线性收敛的镜像更新,并产生有界的贝叶斯遗憾;相同的条件产生可 rollout 测量的稳定性诊断。我们在两种算法中实例化这一目标:DICE-PC,通过提示控制动作协调冻结模型,以及DICE-FT,执行参数高效的镜像微调。在四个领域的十一个基准测试中,DICE在准确性-成本权衡上优于强类内基线;在推理和规划任务上,DICE-PC平均提高4.3个百分点,DICE-FT提高8.5个百分点。

英文摘要

Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected. We formalize a broad class of such systems as discounted incomplete-information Markov games and show that two common pathologies, oscillation between competing conventions and drift across them, can both induce unstable learning and linear Bayesian regret. To obtain a well-posed target, we introduce the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy-regularized equilibrium concept with agent- and state-dependent temperatures. Under a monotonicity condition, HQRE is unique, admits linearly convergent mirror updates, and yields bounded Bayesian regret; the same condition yields rollout-measurable stability diagnostics. We instantiate this objective in two algorithms: DICE-PC, which coordinates frozen models through prompt-control actions, and DICE-FT, which performs parameter-efficient mirror fine-tuning. Across eleven benchmarks in four domains, DICE improves accuracy-cost trade-offs over strong within-class baselines; on reasoning and planning tasks, DICE-PC improves by 4.3 percentage points on average and DICE-FT by 8.5 points.

2606.08067 2026-06-09 cs.LG 新提交

Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense

超越同质性:迈向广义图重构攻击与防御

Zhanke Zhou, Bo Han, Xuan Li, Jiangchao Yao, Sanmi Koyejo, Michael K. Ng

发表机构 * Hong Kong Baptist University(香港浸会大学) Shanghai Jiao Tong University(上海交通大学) Stanford University(斯坦福大学)

AI总结 针对图神经网络可能泄露训练图邻接信息的问题,提出基于马尔可夫链近似的攻击方法MC-GRA(+)和防御方法MC-GPB(+),在异质图上实现高保真重构攻击并有效防御。

详情
AI中文摘要

图神经网络(GNN)广泛部署于关系数据上,但它们可能泄露关于训练图邻接的敏感或专有信息,例如社交关系、交易和交互。本文研究图重构攻击(GRA),这是一种模型反演形式,从训练好的GNN中重构训练邻接,给定不同级别的攻击方信息。我们首先系统地表征了邻接何时以及为何通过特征、标签、嵌入和预测变得可恢复,其中泄漏由图的同质性、异质性和模型的归纳偏差调节。受这些发现启发,我们通过马尔可夫链近似视角审视GNN推理,将分层前向计算视为一个拓扑依赖表示的链。基于此视角,我们开发了互补的攻击和防御方法。在攻击方面,我们提出MC-GRA(+),通过优化一个替代邻接来重构邻接,该替代邻接的GNN诱导表示在各层与目标模型的表示对齐。在防御方面,我们提出MC-GPB(+),在整个表示链中抑制邻接依赖的信息,同时旨在在隐私-效用权衡下保持分类准确性。在同质/异质图基准和GNN上的实验表明,我们的攻击比先前方法提高了重构保真度,而我们的防御仅以轻微精度损失降低了重构成功率。

英文摘要

Graph neural networks (GNNs) are widely deployed on relational data, yet they can leak sensitive or proprietary information about the training graph adjacency, e.g., social ties, transactions, and interactions. This work studies graph reconstruction attacks (GRA), a form of model inversion that reconstructs the training adjacency from a trained GNN, given different levels of attacker-side information. We first provide a systematic characterization of when and why adjacency becomes recoverable through features, labels, embeddings, and predictions, with leakage modulated by graph homophily, heterophily, and the model's inductive bias. Motivated by these findings, we view GNN inference through a Markov chain approximation lens, treating the layered forward computation as a chain of topology-dependent representations. Building on this view, we develop complementary attack and defense methods. On the attack side, we propose MC-GRA (+), which reconstructs the adjacency by optimizing a surrogate adjacency whose GNN-induced representations align with those of the target model at each layer. On the defense side, we propose MC-GPB (+), which suppresses adjacency-dependent information throughout the representation chain while aiming to preserve classification accuracy under a privacy-utility trade-off. Experiments across homophilic/heterophilic graph benchmarks and GNNs show that our attacks improve reconstruction fidelity over prior methods, while our defenses reduce reconstruction success with only minor accuracy loss.

2606.08064 2026-06-09 cs.RO 新提交

Cooperative Long Rope Skipping via Multi-Agent Reinforcement Learning

基于多智能体强化学习的协作长绳跳绳

Zihao Wang, Shijie Peng, Kerui Wu, Yu Huang, Ruiqi Xue, Dong Liu, Tian Xu, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Beijing Academy of Artificial Intelligence, BAAI(北京智源人工智能研究院)

AI总结 提出Marope框架,采用分层强化学习实现多个人形机器人的协作长绳跳绳,通过多智能体强化学习训练分散的摇绳策略,上层调度策略协调执行,并融入多样跳跃策略提升泛化能力,在仿真和真实实验中优于基线方法。

详情
AI中文摘要

人类展现出卓越的运动敏捷性,能够完成跑步、跳跃等多种动态技能,这凸显了人形机器人在运动方面的巨大潜力。在竞技体育中,长绳跳绳需要两名摇绳者协同摇绳,同时适应不同跳跃节奏的玩家,这对人形机器人来说是一项有意义但具有挑战性的任务。尽管现有的人形机器人运动方法在单智能体和无交互场景(如跑步、舞蹈和跑酷)中取得了成功,但需要多参与者精确协调的任务场景仍鲜有探索。为此,我们提出Marope,一个用于多个人形机器人协作长绳跳绳的多智能体强化学习框架。具体而言,Marope采用分层强化学习框架进行策略训练。在底层,通过多智能体强化学习学习分散的摇绳操作策略;在顶层,训练集中调度策略以协调底层策略的执行。为了提高对不同玩家行为风格的泛化能力,Marope进一步将多样化的跳跃策略融入协作博弈训练中。我们在仿真和真实环境中对宇树G1人形机器人进行了评估。实验结果表明,Marope优于多种基线方法,实现了更高效稳定的摇绳操作以及与不同玩家更鲁棒和自适应的协作。

英文摘要

Humans exhibit remarkable motor agility, enabling a wide range of dynamic skills such as running and jumping, which highlights the great potential of humanoid robots for athletic locomotion. Among athletic sports, long rope skipping requires two rope turners to cooperatively swing the rope while adapting to a player under different jumping rhythms, making it a meaningful yet challenging task for humanoid robots. Although existing methods for humanoid sports have achieved success in single-agent and interaction-free settings, such as running, dancing, and parkour, task scenarios that require precise coordination among multiple participants remain largely unexplored. To this end, we propose Marope, a multi-agent reinforcement learning (MARL) framework for cooperative long rope skipping with multiple humanoid robots. Specifically, Marope adopts a hierarchical reinforcement learning framework for policy training. At the lower level, it learns decentralized rope manipulation policies through MARL, while at the upper level, a centralized scheduling policy is trained to coordinate the execution of the lower-level policies. To improve generalization across different player behavioral styles, Marope further incorporates diverse jumping policies into cooperative game training. We evaluate our approach on Unitree G1 humanoid robots in both simulation and real-world settings. Experimental results demonstrate that Marope outperforms various baselines, achieving more efficient and stable rope manipulation as well as more robust and adaptable cooperation with varied players.

2606.08063 2026-06-09 cs.CV cs.AI cs.CL 新提交

Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

Robust-U1: MLLMs能否自我恢复受损视觉内容以实现鲁棒理解?

Jiaqi Tang, Jianmin Chen, Youyang Zhai, Wei Wei, Runtao Liu, Mengjie Zhao, Xiangyu Wu, Qingfa Xiao, Qifeng Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Robust-U1框架,通过监督微调、强化学习和多模态推理,使多模态大模型具备显式视觉自恢复能力,在真实和对抗性损坏下达到最先进鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解方面取得了显著成功,但在真实世界的视觉损坏下其性能会大幅下降。尽管存在现有的鲁棒性增强方法,但它们存在局限性:黑盒特征对齐缺乏可解释性,而白盒基于文本的推理无法恢复丢失的像素级细节。本文研究一个基本研究问题:MLLMs能否自行恢复受损的视觉内容?为此,我们提出Robust-U1,一种新颖框架,赋予MLLMs显式的视觉自恢复能力以实现鲁棒理解。该方法包含三个核心阶段:用于初始重建的监督微调、具有双重奖励(像素级SSIM和语义级CLIP相似度)的强化学习以对齐高视觉质量,以及联合考虑受损输入和恢复图像的多模态推理。大量实验表明,Robust-U1在真实世界损坏基准上达到了最先进的鲁棒性,并在一般VQA基准上的对抗性损坏下保持了优越性能。分析证实,高质量的视觉恢复直接提升了推理性能,将自恢复确立为鲁棒视觉理解的关键机制。源代码可在https://github.com/jqtangust/Robust-U1获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable success in visual understanding, yet their performance degrades significantly under real-world visual corruptions. While existing robustness enhancement approaches exist, they are limited: black-box feature alignment lacks interpretability, and white-box text-based reasoning cannot restore lost pixel-level details. This work investigates a fundamental research question: Can MLLMs recover corrupted visual content by themselves? To address this, we propose Robust-U1, a novel framework that equips MLLMs with explicit visual self-recovery capability for robust understanding. The approach comprises three core stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards (pixel-level SSIM and semantic-level CLIP similarity) for aligning high visual quality, and multimodal reasoning that jointly considers both the corrupted input and the recovered image. Extensive experiments demonstrate that Robust-U1 achieves state-of-the-art robustness on the real-world corruption benchmark and maintains superior performance under adversarial corruptions on general VQA benchmarks. Analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. The source code is available at https://github.com/jqtangust/Robust-U1.

2606.08057 2026-06-09 cs.RO cs.AI 新提交

EgoAERO: Learning Dexterous Manipulation from a Single Egocentric Video without Object Assets

EgoAERO:无需物体资产,从单个第一人称视频学习灵巧操作

Yichen Niu, Haoran Lv, Xinrui Zhang, Xueyao Wan, Shiyu Gao, Ying Ai, Hui Xu, Yongqi Hu, Hengyi Zhang, Yang Xie, Zhaxizhuoma, Yue Zhao, Zhenshan Bing, Yan Ding, Jianxing Liu

发表机构 * School of Astronautics, Harbin Institute of Technology(哈尔滨工业大学航天学院) Lumos Robotic Suzhou Research Institute, Harbin Institute of Technology(哈尔滨工业大学苏州研究院) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Lab(上海人工智能实验室) Nanjing University(南京大学) Xi’an Jiaotong-Liverpool University(西交利物浦大学) Fudan University(复旦大学)

AI总结 提出EgoAERO框架,无需物体资产,从单个第一人称RGB-D视频中通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,并利用两阶段残差学习转化为机器人策略,实现单次演示的灵巧操作。

详情
AI中文摘要

第一人称RGB-D视频提供了人类灵巧操作演示的自然来源,但现有数据难以用于机器人学习,因为物体姿态、几何和接触信息常常缺失或需要预先扫描的物体资产。我们提出EgoAERO,这是第一个无需物体资产、从单个第一人称RGB-D人类演示中学习灵巧操作的框架。EgoAERO通过无资产物体跟踪与重建、自我运动补偿和自适应接触优化重建接触一致的手-物轨迹,然后利用两阶段残差学习将其转化为机器人策略。我们进一步引入在线质量评估机制,并构建EgoDex-R,一个包含430万RGB-D帧的大规模第一人称数据集,用于灵巧策略学习。仿真和真实世界实验表明,EgoAERO能够实现单次演示的灵巧操作,并在HOI4D上达到接近基于CAD重建的下游性能。

英文摘要

Egocentric RGB-D videos offer a natural source of human dexterous manipulation demonstrations, but existing data is difficult to use for robot learning because object pose, geometry, and contact information are often missing or require pre-scanned object assets. We present EgoAERO, the first framework that learns dexterous manipulation from a single egocentric RGB-D human demonstration without object assets. EgoAERO reconstructs contact-consistent hand-object trajectories through asset-free object tracking and reconstruction, ego motion compensation, and adaptive contact optimization, then converts them into robot policies using two-stage residual learning. We further introduce an online quality assessment mechanism and construct EgoDex-R, a large-scale egocentric dataset with 4.3M RGB-D frames for dexterous policy learning. Simulation and real-world experiments show that EgoAERO enables single-demonstration dexterous manipulation and achieves downstream performance close to CAD-based reconstructions on HOI4D.

2606.08056 2026-06-09 cs.CL cs.AI 新提交

What's the Point? Spatial Grammar & Index Resolution for Sign Language Processing

要点何在?手语处理中的空间语法与索引解析

Oline Ranum, Simon Hadfield, Richard Bowden

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey(萨里大学视觉、语音与信号处理中心)

AI总结 针对手语中占10-15%但被忽视的空间索引现象,提出索引检测与话语实体链接的分解框架,建立索引感知手语建模基线,并作为辅助专家提升冻结手语识别模型性能。

详情
AI中文摘要

手语模型主要使用词汇序列或文本监督进行训练,因此对非词汇和构式性结构的建模不足。一个相对易处理的情况是空间索引:将话语实体分配给空间位置以供后续共指的指向手势,而以词汇为中心的目标在很大程度上未能捕捉到这一点。我们对手语识别中的索引进行了有针对性的评估,显示尽管索引占手语内容的10-15%,但其恢复效果很差。我们引入了一个用于训练和评估索引专家的框架,为索引感知手语建模建立了基线。我们的方法将空间指代解析分解为索引检测和话语实体链接。由此产生的提及表示支持自动标注和非词汇结构建模,并在推理时作为辅助索引专家增强冻结的SLR模型。

英文摘要

Sign language models are predominantly trained with gloss-sequence or text supervision, thereby under-modeling non-lexical and productive constructions. One comparatively tractable instance is spatial indexing: pointing gestures that assign discourse entities to spatial loci for subsequent co-reference, which lexicon-centric objectives largely fail to capture. We present a targeted evaluation of indexing in Sign Language Recognition, showing that despite comprising 10-15% of signing content, indexing is poorly recovered. We introduce a framework for training and evaluating indexing experts, establishing a baseline for index-aware sign language modeling. Our approach decomposes spatial reference resolution into index detection and discourse entity linking. The resulting mention representations enable automatic annotation and non-lexical structure modeling, and serve as an auxiliary indexing expert that augments a frozen SLR model at inference time.

2606.08051 2026-06-09 cs.AI cs.LG 新提交

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

你能做到多小?面向金融交易中商户信息抽取的 270M-8B 模型 LoRA 微调

Donghao Huang, Tomas Drietomsky, Benjamin Barrett, Zhaoxia Wang

发表机构 * Singapore Management University(新加坡管理大学) Mastercard(万事达卡) A*STAR Centre for Frontier AI Research(新加坡科技研究局前沿人工智能研究中心)

AI总结 针对金融交易中从嘈杂银行字符串提取结构化商户信息的生产需求,系统评估 24 种模型变体,发现 Qwen 3.5 4B 在参数量减半下 F1 仅低 0.35 点,0.8B 模型匹配 2.5-4 倍大模型性能,且思维链微调提升有限。

Comments 9 pages, 5 figures, 5 tables. Submitted to the IEEE International Conference on Data Mining (ICDM) 2026

详情
AI中文摘要

金融交易处理需要从嘈杂、缩写的银行交易字符串中大规模提取结构化商户信息。我们当前的生产系统是 LoRA 微调的 LLaMA 3.1-8B,在该任务上达到了 96.95% 的 F1 分数,但部署 80 亿参数模型带来了高昂的内存、延迟和成本约束。为了识别更高效的替代方案,我们进行了一项以部署为中心的研究,涵盖四个模型家族的 24 种模型变体:Gemma 3(270M、1B、4B)、Qwen 3.5(0.8B、2B、4B)、Aya(3.35B)和 LLaMA 3.1-8B,系统评估了准确率、推理吞吐量、训练成本和硬件行为,以评估生产适用性。我们的发现表明:(1)使用 LoRA 秩为 8 复现 LLaMA 3.1-8B 微调达到 96.75% F1,仅比秩为 32 的基线低 0.20 个点;(2)仅使用 JSON 提示的 Qwen 3.5 4B 达到 96.60% F1,比 8B 基线低 0.35 个点,同时参数量大约减半;(3)0.8B 的 Qwen 3.5 模型达到 94.75% F1,与 2.5-4 倍大的模型性能相当,提供了有吸引力的延迟-准确率权衡;(4)思维链微调通常使大多数模型的 F1 提升 0.3-1.8 个点,尽管 Qwen 3.5 4B 在直接仅 JSON 提示下表现最佳;(5)Qwen 3.5 的 Think 和 Nothink 训练模板产生几乎相同的结果(F1 差异 <0.004),表明对于结构化抽取任务,显式推理监督是不必要的。我们进一步将所有 14 个微调后的子 8B 模型部署为 Databricks Model Serving 端点,并观察到基准性能可靠地迁移到生产环境,平均 F1 变化仅为 0.8 个点。基于 Cohere2 架构的 Aya 3.35B 是唯一的例外,在服务条件下 F1 下降了 3-5 个点。基于这些结果,我们提供了跨准确率和延迟需求的部署建议,……

英文摘要

Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences <0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...

2606.08049 2026-06-09 cs.AI cs.MA 新提交

SKILL.nb: Selective Formalization and Gated Execution for Durable Agent Workflows

SKILL.nb:用于持久代理工作流的选择性形式化与门控执行

Amine El Hattami, Nicolas Chapados, Christopher Pal

发表机构 * ServiceNow Research Mila Polytechnique Montréal(蒙特利尔综合理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能讲席)

AI总结 提出SKILL.nb框架,通过选择性形式化和门控执行管理代理工作流的生命周期可靠性,在WebArena-Verified上单轮成功率达53.7%,重执行保留率91.7%。

详情
AI中文摘要

AI代理越来越多地将过去的经验转化为可重用的工件,如代码、工作流和程序记忆。重用可以提高效率,但也带来了生命周期可靠性问题:曾经成功的工件可能在环境漂移、任务说明不充分或任务分布变化时失败,尤其是在Web自动化中。我们引入了SKILL.nb,一个通过证据校准的生命周期策略来管理可重用代理工作流的框架。SKILL.nb使用选择性形式化:执行证据决定哪些工作流步骤应成为可执行代码,哪些应保留自然语言指导,以及何时应修订这些选择。工作流存储为可审计、版本化的笔记本,交织自然语言指导、多语言可执行单元格、验证门、回退路径以及多模态证据(如输出、截图和错误轨迹)。在运行时,门控执行让每个步骤在门验证时运行代码,或在漂移使可执行实现失效时本地回退。在WebArena-Verified上,SKILL.nb实现了53.7%的单轮成功率,比最强基线提高了3.9个百分点。在三次重新执行中,它保留了91.7%的初始成功任务,比次优方法高出15.5个百分点。在有界修复下,它恢复了72.9%的后续失败,同时将修复后回归限制在4.2%,而持久基线为15.0%至17.0%。它还在Mind2Web跨网站和跨领域分割上领先。在GitLab迁移测试中,SKILL.nb在重用基于GitLab 15.7学习的冻结状态时保持性能,冻结与新鲜目标版本的差距在GitLab 16.11上为-1.7个百分点,在GitLab 18.9上为+0.6个百分点。这些结果将生命周期治理和门控执行确定为超越一次性任务成功之外的可靠性轴。

英文摘要

AI agents increasingly turn past experience into reusable artifacts such as code, workflows, and procedural memories. Reuse can improve efficiency, but it also creates a lifecycle reliability problem: artifacts that succeed once may fail under environment drift, underspecified tasks, or changing task distributions, especially in web automation. We introduce SKILL.nb, a framework for governing reusable agent workflows with evidence-calibrated lifecycle policies. SKILL.nb uses selective formalization: execution evidence decides which workflow steps should become executable code, which should remain natural-language guided, and when those choices should be revised. Workflows are stored as auditable, versioned notebooks that interleave natural-language guidance, multi-language executable cells, validation gates, fallback paths, and multimodal evidence such as outputs, screenshots, and error traces. At runtime, gate-conditioned execution lets each step run code when its gates validate, or fall back locally when drift invalidates the executable realization. On WebArena-Verified, SKILL.nb achieves 53.7% single-round success, improving over the strongest baseline by 3.9 percentage points. Across three re-executions, it retains 91.7% of initially successful tasks, 15.5 points above the next best method. Under bounded repair, it recovers 72.9% of subsequent failures while limiting post-repair regressions to 4.2%, compared with 15.0% to 17.0% for persistent baselines. It also leads on Mind2Web cross-website and cross-domain splits. In a GitLab migration test, SKILL.nb preserves performance when reusing frozen state learned on GitLab 15.7, with frozen-versus-fresh target-version gaps of -1.7 points on GitLab 16.11 and +0.6 points on GitLab 18.9. These results identify lifecycle governance and gate-conditioned execution as reliability axes beyond one-shot task success.

2606.08048 2026-06-09 cs.CL 新提交

Diffusion Language Model Parallel Decoding via Product-of-Experts Bridge

通过专家乘积桥接的扩散语言模型并行解码

Juntong Shi, Brian L. Trippe, Jure Leskovec, Stefano Ermon, Minkai Xu

发表机构 * Stanford University(斯坦福大学)

AI总结 提出PoE-Bridge框架,通过专家乘积构建中间分布,结合扩散语言模型并行解码和自回归模型质量,实现5倍加速并恢复至少95%的AR性能。

Comments ICML 2026

详情
AI中文摘要

扩散语言模型(DLM)通过并行解码提供了显著的速度优势,但与自回归(AR)模型相比,缺乏令牌依赖性限制了生成质量。最近的进展试图通过重要性采样来弥合差距,其中DLM作为提议分布,AR作为目标分布。然而,由于它们分布之间的巨大差距,采样需要大量粒子,因此计算成本高昂。在本文中,我们引入了PoE-Bridge,一种新颖的解码框架,通过引入中间分布来弥合差距,从而大幅提高生成速度和准确性。该分布被构建为DLM提议和AR目标的专家乘积(PoE)。借助中间分布,我们首先使用DLM并行起草多个续写,然后应用拒绝采样验证起草的令牌,并将结果候选向PoE移动。接着,我们使用重要性采样进一步将PoE对齐的候选向AR目标校正。我们还提出了若干改进技术,包括用于增强多样性的混合温度采样和用于减少浪费验证的弹性拒绝窗口。实验上,PoE-Bridge在标准DLM解码方法上实现了显著提高的准确性,速度提升5倍,并恢复了目标AR模型至少95%的性能,在具有挑战性的数学推理和编码任务上高效地推进了大部分质量差距。我们的代码可在https://github.com/juntongshi48/poe-bridge获取。

英文摘要

Diffusion language models (DLMs) offer substantial speed advantages through parallel decoding, but the lack of token dependencies limits generation quality compared to autoregressive (AR) models. Recent progress attempts to bridge the gap via importance sampling, with DLM being the proposal and AR being the target. However, due to the huge gap between their distributions, the sampling requires a large number of particles and is thus expensive to compute. In this paper, we introduce PoE-Bridge, a novel decoding framework that drastically improves generation speed and accuracy by introducing an intermediate distribution to bridge the gap. The distribution is constructed as a Product-of-Experts (PoE) of the DLM proposal and the AR target. With the intermediate distribution, we first use the DLM to draft multiple continuations in parallel, then apply rejection sampling to verify the drafted tokens and move the resulting candidates toward the PoE. We then use importance sampling to further correct the PoE-aligned candidates toward the AR target. We further propose several improved techniques, including mixed-temperature sampling for enhanced diversity and elastic rejection windows for reducing wasted verification. Empirically, PoE-Bridge achieves significantly improved accuracy with $5\times$ speedup over the standard DLM decoding approach, and recovers at least 95% of the target AR model's performance, efficiently advancing most of the quality gap on challenging mathematical reasoning and coding tasks. Our code is available at https://github.com/juntongshi48/poe-bridge.

2606.08046 2026-06-09 cs.AI cs.CV cs.LG 新提交

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

OSMGraphCLIP:从OpenStreetMap图学习全局位置表示

Dimitrios Michail, Eleni Saka, Ioannis Giannopoulos, Ioannis Papoutsis

发表机构 * Harokopio University of Athens(雅典哈罗科皮奥大学) National Technical University of Athens(雅典国家技术大学) Vienna University of Technology(维也纳技术大学) National Observatory of Athens(雅典国家天文台)

AI总结 提出OSMGraphCLIP模型,利用OpenStreetMap异构图结构学习全局位置嵌入,通过多尺度图编码器和对比学习对齐,在气候、生态、社会经济等下游任务中达到或超越卫星基线方法。

详情
AI中文摘要

我们提出了OSMGraphCLIP,一种CLIP风格的地理空间表示模型,从免费可用的OpenStreetMap(OSM)数据中学习全局位置嵌入。OSMGraphCLIP将地理环境表示为带类型的OSM特征的异构图,保留了道路、建筑物、土地利用区域和兴趣点之间的拓扑和语义关系。多尺度图编码器捕获细粒度的局部结构和更广泛的景观组成,并通过对比对齐目标监督球谐位置编码器。我们在涵盖气候、生态、社会经济指标、公共卫生、土地覆盖、生物多样性和野火预测等一系列下游地理空间回归和分类任务中评估了OSMGraphCLIP,并表明仅结构化OSM数据就支持跨领域的强全局位置表示。OSMGraphCLIP在大多数基准测试中达到或超过了基于卫星的基线,在社会经济和公共卫生任务中优势最为明显,因为OSM对建成环境的显式语义注释编码了卫星像素只能间接捕获的人类活动模式。在生态和环境任务中,尽管未使用地球观测数据,该模型仍与基于图像的方法保持紧密竞争。定性分析证实,学习到的嵌入连贯地组织了地理空间,仅从地图拓扑中恢复了生物群落边界、城市梯度和热带-温带区别。

英文摘要

We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM's explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 新提交

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时:表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Denmark(丹麦技术大学)

AI总结 本文提出行为安全与干预鲁棒性之间的“审计差距”,通过构建解离模型和引入潜在脆弱性评分(LVS),证明行为安全指标不足以衡量表征层面的鲁棒性。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的安全性通常从行为层面进行评估,这提供了有限的内部鲁棒性证据,因为这些评估针对的是输出,而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距:行为安全与干预下鲁棒性之间的差异。为了研究这一差距,我们构建了解离模型,这些模型在保持安全的外在行为的同时,在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架,通过在参数和潜在空间中进行软干预(包括有害微调和逐层潜在扰动)来测试模型鲁棒性。为了形式化评估,我们提出了潜在脆弱性评分(LVS),用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架,我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是,解离模型在有害干预下尽管表现出相当的拒绝行为,但LVS显著升高,其中中间表征对干预最为敏感。我们的结果表明,仅凭行为安全评估无法全面反映模型鲁棒性,这促使我们需要进行表征感知的审计,以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

2606.08039 2026-06-09 cs.RO 新提交

MuJoCo-Drones-Gym: A GPU-Accelerated Multi-Drone Simulator for Control and Reinforcement Learning

MuJoCo-Drones-Gym: 用于控制和强化学习的GPU加速多无人机模拟器

Manan Tayal

发表机构 * TAU-Intelligence

AI总结 提出基于MuJoCo物理引擎的GPU加速多无人机模拟器MuJoCo-Drones-Gym,支持任意数量Crazyflie 2.x纳米四旋翼,提供模块化物理模型、动作接口和观测空间,集成PettingZoo多智能体强化学习,涵盖悬停、速度跟踪等七种任务环境。

Comments 18 pages, 8 figures, 7 tables

详情
AI中文摘要

机器人模拟器是现代空中机器人研究的基石,既作为新控制算法开发的工具,也作为训练强化学习策略的数据源。然而,现有的四旋翼学习环境通常在物理保真度、多智能体支持和现代深度强化学习管道所需吞吐量之间面临权衡。本文提出MuJoCo-Drones-Gym,一个基于MuJoCo物理引擎构建的开源Gymnasium兼容多无人机环境。MuJoCo-Drones-Gym支持任意数量的Bitcraze Crazyflie 2.x纳米四旋翼,并暴露模块化API用于选择:(i)物理模型(刚体MuJoCo、显式Python动力学,或地面效应、桨叶阻力和无人机间下洗流的任意子集),(ii)动作接口(每电机RPM、集体归一化推力、速度设定点或PID航点命令),以及(iii)观测空间(运动状态向量、RGB/深度/分割相机或邻域邻接信息)。PettingZoo ParallelEnv封装支持即插即用的多智能体强化学习,而一套七种任务环境——悬停、速度跟踪、多无人机悬停、航点导航、编队飞行、门赛竞速和通用多智能体模板——展示了接口的广度。我们描述了环境设计、底层物理和四旋翼动力学,并通过与密切相关项目gym-pybullet-drones相似的控制和学习示例说明其使用,同时利用MuJoCo改进的接触处理、渲染和并行化能力。

英文摘要

Robotic simulators are a cornerstone of modern research in aerial robotics, serving both as a vehicle for the development of new control algorithms and as the data source for training reinforcement learning (RL) policies. Yet, existing quadcopter learning environments often face a trade-off between physical fidelity, multi-agent support, and the throughput required by modern deep RL pipelines. In this paper, we present MuJoCo-Drones-Gym, an open-source Gymnasium-compatible multi-drone environment built on top of the MuJoCo physics engine. MuJoCo-Drones-Gym supports an arbitrary number of Bitcraze Crazyflie 2.x nano-quadcopters and exposes a modular API for selecting (i)~the physics model (rigid-body MuJoCo, explicit Python dynamics, or any subset of ground effect, blade drag, and inter-drone downwash), (ii)~the action interface (per-motor RPMs, collective normalized thrust, velocity setpoints, or PID waypoint commands), and (iii)~the observation space (kinematic state vectors, RGB / depth / segmentation cameras, or neighbourhood adjacency information). A PettingZoo ParallelEnv wrapper enables drop-in multi-agent reinforcement learning, while a suite of seven task environments, hover, velocity tracking, multi-drone hover, waypoint navigation, formation flight, gate racing, and a generic multi-agent template, demonstrates the breadth of the interface. We describe the environment design, the underlying physics and quadcopter dynamics, and illustrate its use through control and learning examples that mirror those of the closely related gym-pybullet-drones project, while taking advantage of MuJoCo's improved contact handling, rendering, and parallelizability.

2606.08038 2026-06-09 cs.SD 新提交

Exploring the Scale and Diversity of Speech Anti-spoofing Datasets: Experiments and Analysis

探索语音反欺骗数据集的规模与多样性:实验与分析

Zhuolin Yi, Jun Xue, Yanzhen Ren, Yihuan Huang, Yi Chai, Daixian Li, Guanxiang Feng, Jiajun Liu

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院)

AI总结 本研究通过解耦训练数据规模与多样性,发现数据多样性比规模更重要,过大规模可能导致过拟合,而多样化的较小数据集在跨域评估中表现更优。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

过去十年中,语音反欺骗数据集的规模呈指数级增长,其背后假设是更大的数据能带来更好的性能。然而,无差别地扩大规模是否能够相应地提升模型泛化能力尚不清楚。本研究通过解耦训练数据规模与多样性的影响,挑战了“规模优先”的范式。通过对代表性数据集的实验,我们报告了两个关键发现:(1)更大并不总是更好。在固定生成方法下过度扩大数据规模会带来微不足道的收益,甚至可能因过拟合而降低跨域泛化能力。(2)多样性优于规模。在跨数据集评估中,一个包含多种攻击的较小复合训练集显著优于规模更大但多样性有限的数据集。我们得出结论,未来的数据集构建应优先考虑生成方法的多样性而非规模,以有效提升模型泛化能力。

英文摘要

The scale of speech anti-spoofing datasets has grown exponentially over the past decade, driven by the assumption that larger data leads to better performance. However, it remains unclear whether indiscriminate scaling commensurately improves model generalization. This study challenges the "scale-first" paradigm by decoupling the impacts of training data scale versus diversity. Through experiments on representative datasets, we report two key findings: (1) Larger is not always better. Expanding data scale excessively under fixed generation methods yields negligible returns and may even degrade cross-domain generalization due to overfitting.(2) Diversity outweighs scale. A smaller composite training set featuring diverse attacks significantly outperforms larger-scale datasets with limited diversity in cross-dataset evaluations. We conclude that future dataset construction should prioritize the diversity of generation methods over scale to effectively enhance model generalization.