arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 1405
2606.13681 2026-06-12 cs.CL 新提交

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Pennsylvania(宾夕法尼亚大学) Nanyang Technological University(南洋理工大学) Recursive Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化,并设计基于补丁的记忆范式EvoMem记录结构化更新历史,使智能体能通过记忆变化推理环境演化,实验表明当前智能体在动态环境中表现不佳,EvoMem可稳定提升性能。

详情
AI中文摘要

大型语言模型(LLM)智能体在广泛基准测试中取得了强劲性能,但大多数评估假设静态环境。相比之下,实际部署本质上是动态的,要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距,我们引入了EvoArena,一个基准套件,将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem,一种基于补丁的记忆范式,将记忆演化记录为结构化的更新历史,使智能体能够通过记忆中的变化推理环境演化。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能,在EvoArena上平均提升1.5%,并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外,EvoMem在EvoArena上还将链级准确率提升3.7%,其中成功需要完成一系列连续的相关演化子任务。机制分析表明,EvoMem改善了记忆中的证据捕获,表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

2606.13680 2026-06-12 cs.CL cs.AI 新提交

Learning to Reason by Analogy via Retrieval-Augmented Reinforcement Fine-Tuning

通过检索增强强化微调进行类比推理学习

Zilin Xiao, Qi Ma, Chun-cheng Jason Chen, Xintao Chen, Avinash Atreya, Hanjie Chen, Vicente Ordonez

发表机构 * Meta Superintelligence Labs(Meta超级智能实验室) Rice University(莱斯大学)

AI总结 提出RA-RFT框架,通过黄金相关性蒸馏训练检索器,并结合强化微调利用类比推理轨迹,提升数学推理性能。

详情
AI中文摘要

检索增强生成(RAG)已成为将语言模型锚定于外部知识的标准机制,然而基于词汇或语义相似性的传统检索难以适用于复杂推理任务:语义相似的问题可能要求完全不同的解决策略,而表面不同的问题可能共享相同的底层推理模式。我们提出检索增强强化微调(RA-RFT),一种事后训练框架,教导语言模型通过类比进行推理。RA-RFT使用黄金相关性蒸馏训练检索器,该检索器根据预期推理收益而非语义重叠对上下文进行排序,然后通过强化微调方法利用检索到的类比演示对策略模型进行微调,使模型学会在可验证的结果奖励下利用推理轨迹。我们进一步分析了检索上下文的多样性,发现推理感知检索揭示了互补的解决策略,为个别问题提供了不同的推理支架。在具有挑战性的数学推理基准上,RA-RFT始终优于标准强化微调方法。例如,在AIME 2025上,对于Qwen3-1.7B和Qwen3-4B,RA-RFT的平均@32准确率分别比GRPO提高了7.1和2.8个百分点——这表明推理感知检索是一个互补的改进轴,与奖励设计或训练课程的进步正交。

英文摘要

Retrieval-augmented generation (RAG) has become a standard mechanism for grounding language models in external knowledge, yet conventional retrieval based on lexical or semantic similarity is poorly suited for complex reasoning tasks: a semantically similar problem may demand an entirely different solution strategy, while a superficially different problem may share the same underlying reasoning pattern. We propose Retrieval-Augmented Reinforcement Fine-Tuning (RA-RFT), a post-training framework that teaches language models to reason by analogy. RA-RFT uses gold-relevance distillation to train a retriever that ranks contexts by expected reasoning benefit rather than semantic overlap, and then fine-tunes the policy model via reinforcement fine-tuning methods with retrieved analogous demonstrations, so the model learns to leverage reasoning traces under verifiable outcome rewards. We further analyze the diversity of retrieved contexts and find that reasoning-aware retrieval surfaces complementary solution strategies that provide distinct reasoning scaffolds for individual problems. Across challenging mathematical reasoning benchmarks, RA-RFT consistently outperforms standard reinforcement fine-tuning methods. For example, it improves AIME 2025 average@32 accuracy by 7.1 and 2.8 points over GRPO for Qwen3-1.7B and Qwen3-4B respectively -- suggesting that reasoning-aware retrieval is a complementary axis of improvement and orthogonal to advances in reward design or training curricula.

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 新提交

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学) Amazon FAR(亚马逊FAR)

AI总结 提出Mana框架,将灵巧操作重解释为动画问题,通过粗到细的流水线自动生成操作轨迹,实现铰接工具的零样本仿真到现实迁移。

Comments Project Page: https://zhaohengyin.github.io/mana

详情
AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互,仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上,但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难,仍未得到充分探索。我们提出了Mana(操作动画器),一个通用的仿真到现实框架,将灵巧操作重新解释为动画问题。受计算机动画启发,Mana采用粗到细的流水线,通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化,仅需几次鼠标点击即可指定功能可供性(每个工具不到1分钟)。在跨越不同尺度和关节类型的四个铰接工具上,Mana实现了抓取和手内操作的零样本仿真到现实迁移,展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

2606.13676 2026-06-12 cs.CV 新提交

Modality Forcing for Scalable Spatial Generation

模态强制实现可扩展的空间生成

Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park

发表机构 * Carnegie Mellon University(卡内基梅隆大学) World Labs

AI总结 提出Modality Forcing方法,通过为每个模态分配独立噪声水平,实现单DiT的联合图像-深度生成,利用稀疏深度数据训练,继承T2I预训练的可扩展性,在深度估计上取得竞争性能。

详情
AI中文摘要

文本到图像(T2I)模型包含丰富的空间先验。合成逼真、杂乱的场景需要理解几何,包括透视和相对尺度。先前的工作通过调整T2I模型利用这一先验进行深度预测,但需要密集深度数据并涉及复杂的方案。我们提出Modality Forcing,一种简单、可扩展的后训练方案,使用在稀疏深度数据上训练的单个DiT进行联合图像-深度生成。Modality Forcing通过为每个模态分配独立的噪声水平,允许以任意排列进行图像和深度的条件生成和联合生成。每个模态的解码器使我们能够在稀疏的真实世界深度上训练,并实现强大的、可泛化的深度预测。我们进一步表明,Modality Forcing继承了T2I预训练的可扩展性:通过从头训练一组T2I模型(370M到3.3B参数),我们发现更大的模型在更多图像数据上训练产生更准确的深度。我们的最强模型与最先进的单目深度估计器竞争,并将现有联合图像-深度生成模型的AbsRel降低了57%。这些结果提供了强有力的证据,表明图像生成是空间感知的可扩展预训练目标。

英文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

2606.13673 2026-06-12 cs.CV cs.AI 新提交

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw:重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达)

AI总结 提出SpatialClaw框架,以代码作为动作接口,通过状态化Python内核和感知几何原语,使VLM智能体逐步执行并灵活组合中间结果,在20个3D/4D空间推理基准上平均准确率59.9%,比现有方法高11.2个百分点。

Comments Project page: https://spatialclaw.github.io/

详情
AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型(VLM)面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题,但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行,即在观察到任何中间结果之前就确定完整的分析策略;要么依赖结构化的工具调用接口,这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此,我们提出SpatialClaw,一个无需训练的空间推理框架,采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核,预加载输入帧和一套感知与几何原语,让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元,从而灵活地组合和操作感知结果,并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估,SpatialClaw实现了59.9%的平均准确率,比最新的空间智能体高出11.2个百分点,并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升,无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

2606.13672 2026-06-12 cs.RO 新提交

$\texttt{WEAVER}$, Better, Faster, Longer: An Effective World Model for Robotic Manipulation

$\texttt{WEAVER}$:更好、更快、更长——一种有效的机器人操作世界模型

Arnav Kumar Jain, Yilin Wu, Jesse Farebrother, Gokul Swamy, Andrea Bajcsy

发表机构 * Mila - Québec AI Institute(Mila - 魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) McGill University(麦吉尔大学)

AI总结 提出WEAVER世界模型架构,通过流匹配损失训练多视图潜在预测,同时实现高保真度、长程一致性和高效推理,在机器人操作任务中显著提升策略评估、改进和测试时规划性能。

详情
AI中文摘要

世界模型(即学习型模拟器)对机器人技术的潜在影响深远——包括策略评估、策略改进和测试时规划——所有这些都只需有限的真实世界交互。为了解锁这些下游能力,世界模型需要同时满足三个期望:(i)保真度(即产生与现实相关的模拟轨迹),(ii)一致性(即产生在长时域上连贯的模拟轨迹),以及(iii)效率(即快速产生模拟轨迹)。我们提出$\texttt{WEAVER}$(面向具身推理的多视图世界估计):一种同时实现所有三个期望的世界模型架构,在机器人操作任务上提供了最先进的结果。$\texttt{WEAVER}$是一个多视图世界模型,通过流匹配损失训练以预测未来潜在状态和奖励值。我们提炼了模型架构、记忆和预测目标方面的关键设计决策,以解锁那些困扰先前世界建模方法的长时间动态操作任务。我们将$\texttt{WEAVER}$应用于机器人硬件,展示了其在策略评估(与真实世界成功率的相关系数$\rho=0.870$)、策略改进(在$\pi_{0.5}$机器人基础模型上真实世界成功率提升$38\%$)和测试时规划(真实世界成功率提升$14\%$,且比先前世界模型快$5-10$倍)方面的有效性。$\texttt{WEAVER}$在分布外场景评估中也表现出优于先前世界模型的性能。代码、模型和视频见:this https URL。

英文摘要

The potential impacts of world models (WMs, i.e., learned simulators) on robotics are far-reaching -- policy evaluation, policy improvement, and test-time planning -- all with limited real-world interaction. To unlock these downstream capabilities, a WM needs to jointly satisfy three desiderata: $\textit{(i)}$ fidelity (i.e., producing simulated trajectories that correlate with reality), $\textit{(ii)}$ consistency (i.e., producing simulated trajectories that are coherent over long horizons), and $\textit{(iii)}$ efficiency (i.e., producing simulated trajectories quickly). We propose $\texttt{WEAVER}$ (World Estimation Across Views for Embodied Reasoning): a WM architecture that simultaneously achieves all three desiderata, providing state-of-the-art results on robotic manipulation tasks. $\texttt{WEAVER}$ is a multi-view WM trained to predict future latents and reward values via a flow-matching loss. We distill the key design decisions across model architecture, memory, and prediction objectives required to unlock the kinds of long-horizon dynamic manipulation tasks that have confounded prior world modeling approaches. We apply $\texttt{WEAVER}$ in robotic hardware, demonstrating its effectiveness at policy evaluation ($ρ$=0.870 correlation with real-world success rate), policy improvement (real-world success rate improvement of $38\%$ on top of the $π_{0.5}$ robot foundation model), and test-time planning (real-world success rate improvement of $14\%$ with a $5-10\times$ speedup over prior WMs). $\texttt{WEAVER}$ also demonstrates better performance than prior WMs when evaluated on out-of-distribution scenarios. Code, models, and videos at: https://arnavkj1995.github.io/WEAVER/ .

2606.13671 2026-06-12 cs.LG 新提交

Understanding Truncated Positional Encodings for Graph Neural Networks

理解图神经网络的截断位置编码

James Flora, Mitchell Black, Weng-Keen Wong, Amir Nayyeri

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究截断位置编码(如前k个特征空间或邻接矩阵幂)对图神经网络表达能力的影响,理论证明截断后多种位置编码的表达能力存在本质差异,且截断谱位置编码不再强于1-WL测试,实验表明混合截断编码优于单一类型。

Comments 28 pages, 4 figures, ICML 2026

详情
AI中文摘要

位置编码(PEs)在理论和经验上增强了图神经网络(GNNs)的能力。两个最流行的PE家族——谱(例如,拉普拉斯特征空间、有效电阻)和基于游走的(邻接矩阵的多项式)——在表达能力上理论等价,其表达性介于1-WL和3-WL测试之间。然而,这种等价性假设GNN使用这些PE的“完整”版本,这需要$O(n^3)$的时间和空间复杂度。相反,从业者通常使用这些编码的截断变体,例如前$k$个特征空间或邻接矩阵的幂。然而,这些截断PE的理论性质尚不清楚。在这项工作中,我们启动了对这些截断PE的研究。理论上,我们表明,在截断下,几个PE家族在表达能力上存在根本差异。作为推论,我们证明截断谱PE不再强于1-WL测试。我们还研究了一个谱PE家族——$k$-调和距离——以突出即使密切相关的截断PE在表达能力上的差异。最后,我们通过实验表明,在真实世界数据集上,混合截断PE优于任何单一家族。

英文摘要

Positional encodings (PEs) enhance the power of graph neural networks (GNNs), both theoretically and empirically. Two of the most popular families of PEs - spectral (e.g., Laplacian eigenspaces, effective resistance) and walk-based (polynomials of the adjacency matrix) - are theoretically equivalent in expressive power, with expressivity between the 1-WL and 3-WL tests. However, this equivalence assumes the GNN uses the "complete" version of these PEs, which requires $O(n^3)$ time and space complexity. Instead, practitioners commonly use truncated variants of these encodings, such as the first $k$ eigenspaces or powers of the adjacency matrix. However, the theoretical properties of these truncated PEs are unknown. In this work, we initiate the study of these truncated PEs. Theoretically, we show that, under truncation, several families of PEs are fundamentally different in expressive power. As a corollary, we show that truncated spectral PEs are no longer stronger than the 1-WL test. We also study a family of spectral PEs, the $k$-harmonic distances, to highlight the differences in expressive power of even closely related truncated PEs. Finally, we experimentally show that a mix of truncated PEs is preferable to any single family on real-world datasets.

2606.13670 2026-06-12 cs.AI 新提交

Automated reproducibility assessments in the social and behavioral sciences using large language models

使用大型语言模型自动评估社会与行为科学的可重复性

Tobias Holtdirk, Pietro Marcolongo, Anna Steinberg Schulten, Felix Henninger, Stefan Rose, Sarah Ball, Bolei Ma, Frauke Kreuter, Markus Weinmann, Stefan Feuerriegel

发表机构 * LMU Munich(慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of Cologne(科隆大学)

AI总结 本研究利用大型语言模型(LLMs)自动评估社会与行为科学研究的可重复性,在76项研究中,LLM在41%的研究中恢复了原始效应量,在96%的案例中得出了与原始研究相同的定性结论,优于人类再分析。

详情
AI中文摘要

社会与行为科学的可重复性通常由独立研究人员重新分析原始数据来评估,以判断已发表的研究结果是否可复现。然而,这种方法资源密集且难以规模化。在此,我们展示了大型语言模型(LLMs)可以自动化可重复性评估。利用N=76项来自行为与社会科学、具有预定义声明的研究,我们比较了LLM生成的分析与原始结果和人类再分析。对于7项研究,LLM无法产生可行的效应量估计。对于其余研究,我们的LLM流程在41%的研究中恢复了原始效应量(Cohen's d的容忍度为+/-0.05)。此外,我们的LLM流程在96%的案例中得出了与原始研究相同的定性结论,其中结论指示再分析是否支持原始声明。相比之下,人类再分析者在34%的研究中恢复了原始效应量,并在74%的案例中得出了相同的定性结论。这些结果共同表明,LLMs可以作为自动化可重复性评估的可扩展工具,并为社会与行为科学中实证结果的系统审计提供基础。

英文摘要

Reproducibility in the social and behavioral sciences is typically evaluated by independent researchers who reanalyze the original data to assess whether the published findings can be recovered. However, such approaches are resource-intensive and difficult to scale. Here, we show that large language models (LLMs) can automate reproducibility assessments. Using N=76 published studies with predefined claims from the behavioral and social sciences, we compare LLM-generated analysis with the original findings and human reanalysis. For 7 studies, the LLM could not produce a viable effect size estimate. For the remaining studies, our LLM pipeline recovered the original effect sizes in 41% of studies using a +/-0.05 tolerance in Cohen's d. Further, our LLM pipeline reached the same qualitative conclusion as the original study in 96% of cases, where conclusions indicate whether the reanalysis supports the original claim. For comparison, human reanalysts recovered the original effect sizes in 34% of studies and reached the same qualitative conclusion in 74% of cases. Together, these results show that LLMs can serve as a scalable tool for automated reproducibility assessment and provide a foundation for systematic auditing of empirical results in the social and behavioral sciences.

2606.13669 2026-06-12 cs.AI 新提交

Agents-K1: Towards Agent-native Knowledge Orchestration

Agents-K1:迈向智能体原生的知识编排

Zongsheng Cao, Bihao Zhan, Jinxin Shi, Jiong Wang, Fangchen Yu, Zhijie Zhong, Zijie Guo, Tianshuo Peng, Zhuo Liu, Yi Xie, Xiang Zhuang, Yue Fan, Runmin Ma, Shiyang Feng, Xiangchao Yan, Anran Liu, Peng Ye, Wenlong Zhang, Shufei Zhang, Chunfeng Song, Fenghua Ling, Jie Zhou, Liang He, Bo Zhang, Lei Bai

发表机构 * PJLab(上海人工智能实验室)

AI总结 提出Agents-K1管道,将原始文档转化为智能体原生科学知识图谱,通过多模态解析器、GRPO训练的4B信息抽取骨干和三源智能体接口,实现科学信息抽取、知识图谱构建和多跳推理。

详情
AI中文摘要

当前基于LLM的研究智能体通过智能体编排取得了进展,但在很大程度上忽视了科学知识编排。现有工作通常将论文简化为摘要、表面提及和扁平化的\ exttt{cites}边,忽略了科学推理所必需的关键实体、主张、证据、机制和方法谱系。为此,我们引入了\ extbf{Agents-K1},一个端到端的知识编排管道,将原始文档转换为智能体原生的科学知识图谱。Agents-K1在统一的理论基础下整合了三个组件:一个多模态解析器,其五模块模式捕获整个论文中的实体、多模态证据、引用和类型化实体间关系,而非仅摘要;一个基于GRPO在规则奖励下训练的4B信息抽取骨干;以及一个graphanything CLI,一个统一了网络搜索、多模态图检索和跨文档遍历的三源智能体接口。在此基础上,我们处理了六个学科的246万篇科学论文,生成了\ extbf{Scholar-KG},并发布了其中100万篇论文的子集,完整Scholar-KG可通过下方SCP链接访问。同一管道可扩展到通用领域语料库和符合模式的数据合成。大量实验表明,Agents-K1在科学信息抽取、知识图谱构建和多跳科学推理方面取得了优越性能。

英文摘要

Current LLM-based research agents have advanced through agent orchestration, yet largely overlook scientific knowledge orchestration. Existing works often reduce papers to abstracts, surface mentions, and flat \texttt{cites} edges, omitting key entities, claims, evidence, mechanisms, and method lineages essential for scientific reasoning. To this end, we introduce \textbf{Agents-K1}, an end-to-end knowledge orchestration pipeline that converts raw documents into agent-native scientific knowledge graphs. Agents-K1 integrates three components under a unifying theoretical foundation: a multimodal parser whose five-module schema captures entities, multimodal evidence, citations, and typed inter-entity relations across the full paper rather than abstracts alone; a 4B information-extraction backbone trained with GRPO under a rule-based reward; and a graphanything CLI, a tri-source agent interface that unifies web search, multimodal graph retrieval, and cross-document traversal. On top of this, we process 2.46 million scientific papers across six subjects to produce \textbf{Scholar-KG}, of which we release a one-million-paper subset, and the full Scholar-KG is accessible via the SCP link below. The same pipeline can be extended to general-domain corpora and to schema-conformant data synthesis. Extensive experiments demonstrate that Agents-K1 achieves superior performance in scientific information extraction, knowledge graph construction, and multi-hop scientific reasoning.

2606.13668 2026-06-12 cs.CL 新提交

Influcoder: Distilling Decoders' Gradient Influence Rankings into an Encoder for Data Attribution

Influcoder:将解码器的梯度影响排名蒸馏到编码器用于数据归因

Dimitri Kachler, Damien Sileo, Pascal Denis

发表机构 * Centre Inria de l’Université de Lille, CRIStAL, Université de Lille(里尔大学Inria中心,CRIStAL,里尔大学)

AI总结 针对大型语言模型训练数据归因中影响函数方法计算和存储成本高的问题,提出Influcoder方法,通过将解码器梯度影响排名蒸馏到编码器,实现快速且成本高效的大规模数据归因。

Comments 8 pages, 2 figures

详情
AI中文摘要

随着大型语言模型(LLMs)能力的增长,通过过滤训练数据中的样本来策划高质量数据集的努力日益增多。通常,数据归因(DA)方法旨在估计训练数据集中单个样本如何预先调节模型以生成特定输出。例如,人们可能对数据中哪些样本可能是训练LLM后产生毒性行为的来源感兴趣。许多方法通过影响函数的范式来量化这种调节。虽然这类方法在其功能上是有效的,但它们缺乏必要的处理速度和存储紧凑性,无法在实际中应用于大型数据集。我们提出了一种方法,Influcoder,作为一种快速且成本高效的方法,用于大规模基于影响的数据归因。

英文摘要

With the growth of LLMs' (Large Language Models) capabilities, there has been an increasing push to curate high quality datasets by filtering samples in the training data. In general, Data Attribution (DA) methods aim to estimate how individual samples in a training dataset can precondition a model to generate certain outputs. As an example, one might be interested in which samples in the data could be the source of toxic behavior after training the LLM. Many methods quantify this conditioning through the paradigm of influence functions. While methods of this family are effective in its function, they lack the necessary processing speed and storage compactness to be practically implemented on large datasets. We propose a method, Influcoder, as a quick and cost-effective approach to influence-based Data Attribution at scale.

2606.13663 2026-06-12 cs.CL 新提交

HyperTool: Beyond Step-Wise Tool Calls for Tool-Augmented Agents

HyperTool:超越逐步工具调用的工具增强型智能体

Yaxin Du, Yifan Zhou, Yujie Ge, Jiajun Wang, Xianghe Pang, Shuo Tang, Tuney Zheng, Bryan Dai, Jian Yang, Siheng Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) IQuest Research Beijing University of Aeronautics and Astronautics(北京航空航天大学)

AI总结 针对工具增强型LLM中逐步调用导致执行粒度不匹配的问题,提出HyperTool统一可执行接口,将确定性工具子流程折叠为单次调用,在多步工具任务上显著提升准确率。

详情
AI中文摘要

工具增强型LLM智能体通常依赖逐步的原子工具调用,其中每次调用、观察和值传递都暴露在主推理轨迹中。这造成了执行粒度不匹配:局部确定性的工具工作流被展开为重复的模型可见决策,消耗上下文并迫使模型管理轨迹中的低级数据流。我们引入HyperTool,一个统一的可执行MCP风格工具接口,改变了模型可见的工具执行单元。模型调用HyperTool时使用一个代码块,该代码块可以通过原始模式调用现有工具、操作返回值并在本地传递中间结果,将确定性工具子程序折叠为单个外部调用。为了训练模型使用此接口,我们从跨工具组合任务中合成HyperTool格式的轨迹,并在真实MCP环境中验证。在MCP-Universe上,HyperTool将Qwen3-32B的平均准确率从15.69%提升至35.29%,Qwen3-8B从9.93%提升至33.33%,并在平均准确率上超越GPT-OSS和Kimi-k2.5,表明我们的HyperTool能显著改进多步工具使用。

英文摘要

Tool-augmented LLM agents commonly rely on step-wise atomic tool calls, where each invocation, observation, and value transfer is exposed in the main reasoning trace. This creates an \emph{execution-granularity mismatch}: locally deterministic tool workflows are unfolded into repeated model-visible decisions, consuming context and forcing the model to manage low-level dataflow in the trace. We introduce \textbf{HyperTool}, a unified executable MCP-style tool interface that changes the model-visible unit of tool execution. A model invokes HyperTool with a code block that can call existing tools through their original schemas, manipulate returned values, and pass intermediate results locally, folding deterministic tool subroutines into a single outer call. To train models to use this interface, we synthesize HyperTool-format trajectories from cross-tool compositional tasks and verify them in real MCP environments. On MCP-Universe, HyperTool improves average accuracy from 15.69\% to 35.29\% on Qwen3-32B and from 9.93\% to 33.33\% on Qwen3-8B, and surpass GPT-OSS and Kimi-k2.5 on average accuracy, showing that our HyperTool can substantially improve multi-step tool use.

2606.13652 2026-06-12 cs.CV cs.GR 新提交

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

世界追踪:超越可见表面的生成式像素对齐几何

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

发表机构 * World Labs University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出世界追踪(World Tracing),一种生成式像素对齐几何表示,通过扩散变压器预测有序点栈,同时重建可见表面和生成遮挡几何,在多个基准上超越深度预测和图像到3D方法。

Comments World Labs Technical Report; Page: https://haoz19.github.io/world-tracing-page/

详情
AI中文摘要

图像到3D方法常常在忠实度和完整性之间权衡:深度估计器锚定于输入像素但止于可见表面,而图像到3D模型生成完整形状却往往与输入不对齐。我们引入世界追踪(World Tracing),一种生成式像素对齐几何表示,它预测与观测像素对齐的3D点,同时完成可见表面之外的几何。对于每个输入像素,世界追踪预测一个有序的相机空间3D点栈,其中第一层表示可见表面,后续层表示与遮挡表面的从前到后交点。我们通过一个世界追踪扩散变压器WT-DiT实例化该表示,该变压器将多个几何层视为独立的去噪令牌,并通过分解和全局注意力耦合。WT-DiT使用像素空间流匹配和混合噪声调度进行训练,平衡可见表面重建与遮挡几何生成。世界追踪在物体、场景和动态基准上,在可见表面重建和完整几何生成方面均取得了强劲性能,超越了深度预测器和图像到3D生成器。它还保留了2D到3D对应关系,实现了文本驱动的3D场景编辑、几何条件的新视角视频合成,以及与纹理网格生成器的无训练集成。

英文摘要

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

2606.13649 2026-06-12 cs.CL cs.LG 新提交

Operadic consistency: a label-free signal for compositional reasoning failures in LLMs

Operadic一致性:LLM中组合推理失败的无标签信号

Nathaniel Bottman, Yinhong Liu, Kyle Richardson

发表机构 * Incubilate University of Cambridge(剑桥大学) Allen Institute for Artificial Intelligence(艾伦人工智能研究所)

AI总结 提出Operadic一致性(OC)作为检测大语言模型组合推理失败的无标签信号,在四个多跳QA数据集上与准确率强相关(Pearson r≥0.86),优于自一致性等方法。

详情
AI中文摘要

在推理时检测LLM推理失败而无需真实标签,催生了广泛的置信度基线,包括自一致性、语义熵和P(True),这些方法基于问题内采样和自我评估。Operad理论,即通过迭代替换构建系统的形式化方法,提出了一种补充性诊断:模型对组合查询的直接回答应与通过组合同一查询的分解陈述所产生的回答一致。我们将这一思想实例化为Operadic一致性(OC),一个每问题信号。在四个多跳QA数据集上的十二个指令微调LLM(4B到671B参数,开源和闭源)上,OC与每个数据集上的准确率强相关(Pearson r ∈ [0.86, 0.94],所有p ≤ 0.0004),并且是我们评估的所有信号中唯一在所有四个数据集上均匀达到r ≥ 0.85的信号。思维链自一致性(CoT-SC;Wang等人,2023)在HotpotQA和DROP上与OC匹配(r = 0.93, 0.87),但在MuSiQue和StrategyQA上降至r ≈ 0.45。在每问题层面,OC在每个数据集上提供了超出CoT-SC和语义熵的信息(OC系数的聚类稳健p ≤ 10^{-16}),并且该结论在额外控制构造的分解感知基线时依然稳健(p ≤ 10^{-13})。相同的信号在等成本K = 3预算下,相对于调优的CoT-SC基线产生了选择性预测改进(固定覆盖率下的准确率提升)(AUARC提升+0.086至+0.096,AUROC提升+0.092至+0.164;95%置信区间在每个单元上排除零)。在五个前沿思维模型上,其中分解从模型自身的思维链中提取,相同的等成本比较在所有测试的16个(数据集、预算、指标)单元上给出了正的选择性预测点估计提升,其中12个单元的95%置信区间排除零。

英文摘要

Detecting LLM reasoning failures at inference time without ground-truth labels has motivated a wide range of confidence baselines, including self-consistency, semantic entropy, and P(True), built on within-question sampling and self-evaluation. Operad theory, the formalism for systems built by iterated substitution, suggests a complementary diagnostic: a model's direct answer to a compositional query should agree with the answer it produces by composing a stated decomposition of the same query. We instantiate this idea as operadic consistency (OC), a per-question signal. Across twelve instruction-tuned LLMs (4B to 671B parameters, open-weights and closed-source) on four multi-hop QA datasets, OC is strongly correlated with accuracy on every dataset (Pearson $r \in [0.86, 0.94]$, all $p \leq 0.0004$), and is the only signal we evaluate with $r \geq 0.85$ uniformly across all four datasets. Chain-of-thought self-consistency (CoT-SC; Wang et al., 2023) matches OC on HotpotQA and DROP ($r = 0.93, 0.87$) but drops to $r \approx 0.45$ on MuSiQue and StrategyQA. At the per-question level, OC contributes information beyond CoT-SC and semantic entropy on every dataset (cluster-robust $p \leq 10^{-16}$ for the OC coefficient), and the conclusion is robust to additionally controlling for constructed decomposition-aware baselines ($p \leq 10^{-13}$). The same signal yields selective-prediction improvements (accuracy at fixed coverage) over a tuned CoT-SC baseline at the equal-cost $K = 3$ budget (AUARC lifts of +0.086 to +0.096 and AUROC lifts of +0.092 to +0.164; 95% CIs exclude zero on every cell). On five frontier thinking models, where the decomposition is extracted from the model's own chain of thought, the same equal-cost comparison gives positive selective-prediction point-estimate lift on all 16 (dataset, budget, metric) cells tested, with 95% CIs excluding zero on 12 of the 16.

2606.13647 2026-06-12 cs.CL cs.AI cs.LG 新提交

SkMTEB: Slovak Massive Text Embedding Benchmark and Model Adaptation

SkMTEB:斯洛伐克大规模文本嵌入基准与模型适配

Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková, Viktória Ondrejová

发表机构 * Comenius University in Bratislava(布拉迪斯拉发夸美纽斯大学) Cisco Systems(思科系统) Technical University of Košice(科希策技术大学) Kempelen Institute of Intelligent Technologies(肯佩伦智能技术研究所)

AI总结 针对低资源西斯拉夫语斯洛伐克语,构建首个MTEB风格文本嵌入基准SkMTEB(含31个数据集、7类任务),并开发高效本地部署模型e5-sk-small/large,通过词汇裁剪与微调在参数减少62%下达到与商业API相当的竞争力。

Comments ACL 2026

详情
AI中文摘要

我们介绍了SkMTEB,这是首个针对斯洛伐克语(一种低资源西斯拉夫语)的全面MTEB风格文本嵌入基准,包含31个数据集,覆盖7种任务类型——几乎是现有斯洛伐克语多语言基准覆盖深度的4倍。我们对31个嵌入模型的评估表明,大型指令调优多语言模型表现最强,而现有的针对NLU任务训练的斯洛伐克语特定模型在嵌入任务上迁移效果不佳。为了满足高效、可本地部署的斯洛伐克语嵌入需求,我们通过对多语言E5模型进行词汇裁剪和微调,开发了\ exttt{e5-sk-small}(45M参数)和\ exttt{e5-sk-large}(365M)模型。尽管模型尺寸缩小了高达62%,我们的开源模型在性能上与专有API相当,同时仍可本地部署用于语义搜索和检索增强生成(RAG)。我们公开了基准、模型、数据集和代码,希望我们的方法能为其他资源匮乏的语言提供可复现的路径。

英文摘要

We introduce SkMTEB, the first comprehensive MTEB-style text embedding benchmark for Slovak, a low-resource West Slavic language, comprising 31 datasets across 7 task types -- nearly 4$\times$ the depth of existing multilingual benchmark coverage for Slovak. Our evaluation of 31 embedding models reveals that large instruction-tuned multilingual models achieve the strongest performance, while existing Slovak-specific models trained for NLU tasks transfer poorly to embedding tasks. To address the need for efficient, locally-deployable Slovak embeddings, we develop \texttt{e5-sk-small} (45M parameters) and \texttt{e5-sk-large} (365M) by applying vocabulary trimming and fine-tuning to Multilingual E5 models. Despite size reductions of up to 62\%, our open-source models achieve competitive performance with proprietary APIs while remaining locally deployable for semantic search and retrieval-augmented generation (RAG). We release the benchmark, models, datasets, and code openly, hoping our approach offers a replicable path for other under-resourced languages.

2606.13644 2026-06-12 cs.CV 新提交

Surflo: Consistent 3D Surface Flow Model with Global State

Surflo:具有全局状态的一致3D表面流模型

Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa

发表机构 * LIX, École polytechnique(LIX,巴黎综合理工学院) Kyoto University(京都大学) Kyutai UC Berkeley(加州大学伯克利分校)

AI总结 提出Surflo模型,通过将可变数量的无位姿RGB视图压缩为全局潜变量,并利用流匹配从噪声中独立传输3D表面点,实现任意分辨率的一致表面重建,推理时通过光度梯度引导消除局部不一致性。

Comments Project webpage: https://anttwo.github.io/surflo/

详情
AI中文摘要

几何形状对视角具有不变性,这使得任何图像集合都是单个3D状态的冗余编码。现有的前馈重建模型未能充分利用这一点:逐视角方法会生成重叠且未对齐的点云,其数量随输入数量线性增长;而全局潜在方法则局限于固定的低分辨率输出。我们提出Surflo,它将可变数量的无位姿RGB视图压缩为K个潜在令牌(一个全局状态),并通过流匹配将带方向的3D表面点从噪声独立传输到表面上进行解码。这使得输出不受任何固定网格或令牌预算的限制:相同的潜在变量在单次前向传播中即可生成从几千到一百万个点。为了抑制独立逐点解码固有的局部不一致性,我们在ODE积分过程中注入光度梯度,通过推理时的引导项关联邻近点。Surflo在表面指标上匹配或超越前馈基线,运行速度比需要数百个视图的基于优化的方法快一个数量级,并且是唯一结合全局潜在变量与任意分辨率解码的前馈方法。

英文摘要

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

2606.13643 2026-06-12 cs.CL 新提交

Recursive Agent Harnesses

递归智能体框架

Elias Lumer, Sahil Sen, Kevin Paul, Vamse Kumar Subbiah

发表机构 * PricewaterhouseCoopers, U.S.(普华永道(美国))

AI总结 提出递归智能体框架(RAH),通过代码优先的框架递归扩展模型递归,在长上下文推理中显著提升编码智能体性能。

详情
AI中文摘要

递归语言模型(RLM)表明,模型调用的递归是长上下文推理的有效策略,而生产级编码智能体已开始编写大规模生成子智能体的代码,最近如Anthropic的动态工作流。我们命名并研究了这两条工作线之间的模式,其中递归单元是一个完整的智能体框架,包含文件系统工具、代码执行和规划,而不是没有工具的模型调用。我们将其称为递归智能体框架(RAH),并将其视为框架递归,即RLM模型递归的代码优先扩展。父智能体生成并执行一个可执行脚本,该脚本并行生成子智能体框架以处理细粒度工作负载,并使用结构化函数调用处理小子任务。我们在长上下文推理上提供了受控评估。在固定主干为GPT-5以匹配已发布的Codex和RLM基线的情况下,RAH在Oolong-Synthetic(199个样本,13个上下文长度桶,最高4M令牌)上将Codex编码智能体基线从71.75%提高到81.36%,这一增益归因于框架而非模型。使用更强的骨干Claude Sonnet 4.5,同一设计达到89.77%。

英文摘要

Recursive language models (RLMs) showed that recursion over model calls is an effective strategy for long-context reasoning, and production coding agents have begun to write code that spawns subagents at scale, most recently in Anthropic's dynamic workflows. We name and study the pattern between these two lines of work, where the recursive unit is a full agent harness with filesystem tools, code execution, and planning rather than a model call with no tools. We call this the Recursive Agent Harness (RAH) and frame it as harness recursion, the code-first extension to the model recursion of RLMs. A parent agent generates and runs an executable script that spawns subagent harnesses in parallel for fine-grained workloads and uses structured function calls for small subtasks. We provide a controlled evaluation on long-context reasoning. With the backbone held fixed at GPT-5 to match the published Codex and RLM baselines, RAH improves the Codex coding-agent baseline from 71.75% to 81.36% on Oolong-Synthetic (199 samples, 13 context-length buckets up to 4M tokens), a gain attributable to the harness rather than the model. With a stronger backbone, Claude Sonnet 4.5, the same design reaches 89.77%.

2606.13640 2026-06-12 cs.SD 新提交

The Moving Drone: Negotiating Agency Between the Voice and the Virtual

移动的无人机:在声音与虚拟之间协商能动性

Nithya Shikarpur, Victor Arul, Anna Huang

发表机构 * Massachusettes Institute of Technology(麻省理工学院) Harvard University(哈佛大学)

AI总结 基于印度斯坦音乐,通过Max/MSP循环器和生成式AI模型GaMaDHaNi,将传统静态无人机变为动态、主动的虚拟音乐代理,探讨人机协作中的能动性。

Comments Published in NIME music track 2026

详情
AI中文摘要

印度斯坦音乐中的旋律材料通常与一个主音相关联,该主音通常由坦布拉(一种四弦无人机乐器)持续维持。植根于印度斯坦音乐,《移动的无人机》将传统静态的无人机置于运动中,在表演过程中逐渐获得能动性,从反应性角色过渡到更主动的角色。该作品在Max/MSP中使用四个独立的循环器作为“虚拟”无人机。当歌手即兴演唱时,这些循环器实时循环填充,在声音与虚拟无人机之间创建一个有机且不断演变的反馈回路。这种关系通过音高移位循环进一步在旋律上演变,引入了突然、显式运动的维度。然后,通过集成GaMaDHaNi(一种经过歌手条件训练的语音到声音生成式AI模型)来重新合成循环音频,从而在音色上发生变化。虽然当前的音乐AI方法优先考虑生成内容的高保真度和逼真度,这引发了音乐界对工作替代的焦虑,但本作品有意使用低保真生成输出,进一步需要人类解释和情境背景才能完成。《移动的无人机》将技术和生成式AI置于既定的社会文化音乐实践中,提出虚拟无人机作为一种主动、响应性和共同创造的音乐代理。

英文摘要

Melodic material in Hindustani music is presented in relation to a tonic, usually sustained by the tanpura, a four-stringed drone instrument. Rooted in Hindustani music, 'The Moving Drone' sets the traditionally static drone into motion that, throughout the performance, gains increasing agency transitioning from reactive to more proactive roles. The work employs four independent loopers in Max/MSP to function as 'virtual' drones. They are populated cyclically in real-time as the vocalist improvises, creating an organic and evolving feedback loop between the voice and the virtual drone. This relationship further evolves melodically by pitch shifting the loops, which introduces a dimension of sudden, explicit movement. Then it changes timbrally, via the integration of GaMaDHaNi, a singer conditioned pitch-to-voice generative AI model to resynthesize looped audio. While current music AI approaches prioritize high-fidelity and realism of generated content which has sparked anxiety over job replacement for the music community, this work intentionally utilizes low-fidelity generative outputs, further necessitating human interpretation and situational context in order to be complete. 'The Moving Drone' positions technology and generative AI within established socio-cultural musical practices, proposing a virtual drone as an active, responsive, and co-creative musical agent.

2606.13637 2026-06-12 cs.LG 新提交

The Stable Recovery Manifold: Geometric Principles Governing Recoverability in Continual Learning

稳定恢复流形:持续学习中可恢复性的几何原理

Ayushman Trivedi, Bhavika Melwani

发表机构 * ResNet-18 Split CIFAR-100

AI总结 通过分析Split CIFAR-100上ResNet-18的顺序学习,发现遗忘知识在表示重组后仍可紧凑解码,提出稳定恢复流形假说,表明灾难性遗忘主要是可访问性和流形对齐问题。

Comments 9 pages, 8 figures, 8 tables

详情
AI中文摘要

灾难性遗忘通常被视为顺序学习过程中先前学习知识的破坏。基于可访问性崩溃框架,我们研究了持续学习中可恢复性的几何结构。使用Split CIFAR-100和顺序训练的ResNet-18,我们分析了十个任务上的可恢复性、表示漂移和恢复复杂度。我们引入了恢复子空间维度(k_t),即保持完整探针性能90%所需的最小奇异方向数量。与我们的可恢复性扩散假说相反,尽管存在显著的表示漂移,恢复维度在整个训练过程中保持稳定(平均k_t = 8.0)。主角度漂移强烈预测可恢复性(r = -0.862),一个简单的几何模型解释了82.2%的可恢复性方差。这些发现支持稳定恢复流形假说,表明遗忘的知识在表示重组后仍可紧凑解码。结果表明,灾难性遗忘主要是一个可访问性和流形对齐问题,而非信息破坏。

英文摘要

Catastrophic forgetting is often viewed as the destruction of previously learned knowledge during sequential learning. Building on the Accessibility Collapse framework, we investigate the geometric structure of recoverability in continual learning. Using Split CIFAR-100 and a sequentially trained ResNet-18, we analyze recoverability, representational drift, and recovery complexity across ten tasks. We introduce Recovery Subspace Dimensionality (k_t), a measure of the minimum number of singular directions required to preserve 90 percent of full probe performance. Contrary to our Recoverability Diffusion hypothesis, recovery dimensionality remains stable throughout training (mean k_t = 8.0) despite substantial representational drift. Principal-angle drift strongly predicts recoverability (r = -0.862), and a simple geometric model explains 82.2 percent of recoverability variance. These findings support the Stable Recovery Manifold hypothesis, suggesting that forgotten knowledge remains compactly decodable despite representational reorganization. The results indicate that catastrophic forgetting is primarily an accessibility and manifold-alignment problem rather than information destruction.

2606.13630 2026-06-12 cs.CL 新提交

From Tokens to Faces: Investigating Discrete Speech Representations for 3D Facial Animation

从词元到面部:探究用于3D面部动画的离散语音表示

Pedro Correa, Olivier Perrotin, Samir Sadok, Paula Costa, Thomas Hueber

发表机构 * Univ. Estadual de Campinas (UNICAMP), Brazil(巴西坎皮纳斯州立大学(UNICAMP)) Univ. Grenoble Alpes, CNRS, Grenoble INP, GIPSA-lab, France(法国格勒诺布尔阿尔卑斯大学,CNRS,格勒诺布尔国立理工学院,GIPSA实验室) Inria at Univ. Grenoble Alpes, CNRS, LJK, France(法国格勒诺布尔阿尔卑斯大学Inria,CNRS,LJK)

AI总结 研究评估四种语音表示在3D面部合成中的效果,发现编码音素类别有利于准确预测面部动画,并基于此提出音频视觉文本到语音管线。

Comments This work has been accepted in Interspeech 2026

详情
AI中文摘要

语音表示的选择在语音驱动的3D面部动画中至关重要。不同表示在编码内容上有所差异:SSL特征强调音段和语义线索,神经编解码器产生优化用于声学重建的潜在表示,而ASR风格的目标产生基于标签的空间。我们评估了四种用于3D面部合成的语音表示族,通过客观指标和感知评估比较了它们在两个面部解码器上的面部重建质量。此外,我们进行了探测分析,将分词表示与音素单元和发音变形联系起来。我们发现,编码音素类别有利于在语义和基于标签的表示上准确预测面部动画,且面部动画质量相当。基于后者,我们引入了一个音频视觉文本到语音(AVTTS)管线,该管线利用离散表示作为共享空间来解码语音和3D面部运动。

英文摘要

The choice of speech representation is critical in speech-driven 3D facial animation. Representations differ in what they encode: SSL features emphasize segmental and semantic cues, neural codecs yield latents optimized for acoustic reconstruction, and ASR-style objectives produce label-based spaces. We evaluate four speech representation families for 3D facial synthesis, comparing their facial reconstruction quality across two facial decoders using objective metrics and a perceptual evaluation. We additionally conduct probing analyses that relate tokenized representations to phonetic units and to articulatory deformations. We found that encoding phonetic classes is beneficial for accurate facial animation prediction on both semantic and label-based representations with comparable facial animation quality. From the latter, we introduce an Audio Visual Text-to-Speech (AVTTS) pipeline that leverages, as a shared space, discrete representations to decode speech and 3D facial motion.

2606.13625 2026-06-12 cs.CV 新提交

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

重新审视长尾监控场景中的车辆颜色识别

Vinícius Orrú, Bruno H. Foggiatto, Gabriel E. Lima, David Menotti, Rayson Laroca

发表机构 * Pontifical Catholic University of Paraná(巴拉那天主教大学) National High Court of Brazil(巴西国家高等法院) Federal University of Paraná(巴拉那联邦大学)

AI总结 针对监控场景中车辆颜色分布高度不平衡的问题,本文提出结合生成式数据增强、视觉表征、损失重加权等方法的综合方案,在UFPR-VeSV数据集上实现94.6%微平均和79.7%宏平均准确率,宏平均比近期文献提升8.2个百分点。

Comments Accepted for presentation at the 2026 International Conference on Pattern Recognition (ICPR) - V3SC Workshop

详情
AI中文摘要

车辆颜色识别是监控系统中车辆识别的重要线索,尤其是在车牌因低分辨率、遮挡、运动模糊或光照不足而难以辨认时。然而,真实世界的车辆颜色分布高度不平衡,使得整体准确率不足以评估在罕见但操作相关的颜色上的性能。本文利用UFPR-VeSV(一个具有挑战性的真实世界监控数据集),对严重类别不平衡下的车辆颜色识别进行了全面研究。我们通过两种现成的生成策略探索合成少数类增强:使用RunDiffusion/JuggernautXL的文本条件图像生成和使用Gemini 2.0 Flash的图像条件颜色编辑。精心策划的合成数据与现代视觉表征、损失重加权、学习率调度、颜色安全增强、前景感知预处理和集成融合相结合。表现最佳的方法达到了94.6%的微平均准确率和79.7%的宏平均准确率,宏平均准确率比近期文献提高了8.2个百分点。手动错误分析进一步表明,许多剩余的失败即使在人工标注者看来也是视觉上模糊的,这凸显了在无约束监控图像中基于颜色的车辆识别的实际局限性。生成的图像和源代码可在以下网址公开获取:this https URL

英文摘要

Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at https://github.com/viniciusorru/vcr-synthetic

2606.13624 2026-06-12 cs.CL 新提交

Beyond Uniform Tokens: Adaptive Compression for Time Series Language Models

超越统一令牌:时间序列语言模型的自适应压缩

Jialin Gan, Xin Qiu, Guangzhe Chen, Xue Wang

发表机构 * Zhejiang University(浙江大学) Harbin Institute of Technology(哈尔滨工业大学) Shandong University(山东大学)

AI总结 针对时间序列语言模型中令牌效率低的问题,提出自适应令牌预算框架,通过频域结构压缩时间序列令牌并逐层减少提示令牌,实现高达7.68倍推理加速并在78%设置中提升性能。

详情
AI中文摘要

大型语言模型(LLM)通过共享令牌接口联合建模数值观测和文本上下文,实现了时间序列(TS)分析。然而,TS令牌和提示令牌表现出根本不同的信息结构,使得统一令牌处理效率低下。在本文中,我们从非对称令牌的角度研究TS语言建模中的令牌效率。我们表明,TS令牌具有高度不均匀的频谱贡献,其中许多令牌共享冗余频率模式,而一小部分保留了关键的时间证据。我们还观察到,提示令牌的影响随模型深度衰减,表明在所有层中完全保留提示是不必要的。基于这些发现,我们开发了一个自适应令牌预算框架,通过频域结构压缩TS令牌,并逐层减少提示令牌。在预测、分类、插补和异常检测上的实验表明,在\textit{\textbf{78\%}}的评估设置中实现了高达\textit{\textbf{7.68$\times$}}的推理加速和性能提升,显示了非对称令牌压缩对于可扩展TS基础模型的有效性。

英文摘要

Large language models (LLMs) have enabled time series (TS) analysis by jointly modeling numerical observations and textual context through a shared token interface. However, TS tokens and prompt tokens exhibit fundamentally different information structures, making uniform token processing inefficient. In this paper, we study token efficiency in TS language modeling from an asymmetric-token perspective. We show that TS tokens have highly uneven spectral contributions, where many tokens share redundant frequency patterns while a small subset preserves critical temporal evidence. We also observe that prompt-token influence attenuates with model depth, suggesting that full prompt retention across all layers is unnecessary. Based on these findings, we develop an adaptive token budgeting framework that compresses TS tokens via frequency-domain structure and progressively reduces prompt tokens across layers. Experiments across forecasting, classification, imputation, and anomaly detection demonstrate up to \textit{\textbf{7.68$\times$}} inference acceleration and performance gains in \textit{\textbf{78\%}} of evaluated settings, showing the effectiveness of asymmetric token compression for scalable TS foundation models.

2606.13621 2026-06-12 cs.AI cs.CR cs.GT cs.LG cs.MA 新提交

Beyond Runtime Enforcement: Shield Synthesis as Defensibility Analysis for Adversarial Networks

超越运行时强制:作为对抗网络可防御性分析的盾牌合成

Achraf Hsain, Sultan Almuhammadi

发表机构 * Information and Computer Science Department, King Fahd University of Petroleum and Minerals(信息与计算机科学系,法赫德国王石油矿产大学)

AI总结 提出将盾牌合成重新解释为设计时分析工具,通过约束双人安全博弈生成可防御性判定,并融合拓扑度量和强化学习行为形成可防御性指纹,揭示系统安全的结构性见解。

Comments 26 pages, 7 figures, 7 tables. Under review at JAIR. Code: https://github.com/AchrafHsain7/Bastion

详情
AI中文摘要

盾牌强化学习通常被呈现为一种运行时安全机制,它将时序逻辑规范编译成限制智能体行为的自动机。我们认为这是错误的产品。同样的自动机理论机制——规范编译、乘积博弈构建、吸引子计算和获胜区域提取——更适合被解读为一种设计时分析工具,其输出是关于系统的结构性见解,而非对已部署智能体的运行时约束。我们通过一个用于网络防御的约束双人安全博弈来实例化这一点。两个规范被不对称地执行:防御者规范定义了博弈的不安全区域,而攻击者规范在吸引子计算期间限制了对手的合法行为。求解该博弈产生一个可防御性判定——一个形式化证书,表明拓扑-规范对是否可防御——以及相关的获胜区域和盾牌。除了二元判定,我们还从吸引子结构中推导出拓扑级度量,并将其与盾牌约束的对抗性多智能体强化学习的后收敛行为相结合。这些共同构成了一个可防御性指纹,捕捉了网络的形式安全属性及其在自适应博弈下的操作行为。假设分析表明,形式可防御性和操作有效性捕捉了安全的不同方面:小的架构变化可能导致操作结果的巨大变化,而形式安全裕度几乎不变。因此,盾牌合成最有价值的不是作为安全智能体的部署机制,而是作为回答关于系统是否、在哪里以及如何可以被防御的架构问题的框架。可防御性判定是输出,而非安全策略。

英文摘要

Shielded reinforcement learning is typically presented as a runtime safety mechanism that compiles temporal-logic specifications into automata restricting an agent's actions. We argue this is the wrong product. The same automata-theoretic machinery -- specification compilation, product game construction, attractor computation, and winning-region extraction -- is better read as a design-time analytical instrument whose outputs are structural insights about a system rather than runtime constraints on a deployed agent. We instantiate this through a constrained two-player safety game for network defense. The two specifications are enforced asymmetrically: the defender specification defines the unsafe region of the game, whereas the attacker specification restricts the adversary's legal actions during attractor computation. Solving the game yields a defensibility verdict -- a formal certificate that a topology-specification pair is or is not defensible -- with the associated winning region and shield. Beyond the binary verdict, we derive topology-level metrics from the attractor structure and combine them with post-convergence behavior from shield-constrained adversarial multi-agent reinforcement learning. Together these form a defensibility fingerprint capturing both a network's formal safety properties and its operational behavior under adaptive play. A what-if analysis shows that formal defensibility and operational effectiveness capture distinct aspects of security: small architectural changes can produce large shifts in operational outcomes while leaving formal safety margins nearly unchanged. Shield synthesis is thus most valuable not as a deployment mechanism for safe agents, but as a framework for answering architectural questions about whether, where, and how a system can be defended. The defensibility verdict is the output, not the safe policy.

2606.13610 2026-06-12 cs.CL cs.AI 新提交

One Polluted Page Is Enough: Evaluating Web Content Pollution in Generative Recommenders

一个被污染的页面就够了:评估生成式推荐系统中的网页内容污染

Minghao Luo, Liang Chen

发表机构 * The Chinese University of Hong Kong(香港中文大学)

AI总结 本研究提出FORGE基准,评估搜索增强LLM在检索结果被污染时推荐虚假产品的脆弱性,发现单个污染页面即可导致高达27%的推荐错误率,且推理能力无法缓解此问题。

详情
AI中文摘要

搜索增强的大语言模型通过检索实时网页内容越来越多地介入日常消费者推荐。这带来了新的风险:生成式推荐系统可能消费被污染的网页内容,例如旨在误导推荐的虚假评论和推广页面。我们提出:在消费被污染的检索结果时,搜索增强的LLM在多大程度上会成为虚假产品的无意推广者?为此,我们引入FORGE(生成环境中的虚假在线推荐),这是一个在受控网页内容污染下衡量虚假产品推荐的基准。给定上游搜索结果,FORGE将检索到的网页中的真实产品本地重写为虚假产品,以模拟网页内容污染,并测量LLM推荐虚假产品的频率。FORGE涵盖15个类别和5个消费者场景下的225个真实世界产品。在12个商业和开源LLM中,所有模型都易受影响:单个被污染的页面即可导致高达27%的被欺骗率,而完全替换前三个结果则将此比例提升至73.8%。不同类别间的脆弱性差异显著,当模型缺乏相关产品的稳定先验知识时,脆弱性增加。推理并不能缓解这种脆弱性;相反,它常常生成虚假的社会证明来为错误推荐辩护。我们评估了三种防御措施:怀疑提示和共识过滤(基于模型先验或跨文档证据)。怀疑可能加剧脆弱性,类似于推理,而过滤则可能抑制合法产品。我们在以下网址发布FORGE:this https URL。

英文摘要

Search-augmented LLMs increasingly mediate everyday consumer recommendations by retrieving live web content. This creates a new risk: generative recommenders may consume polluted web content, such as fake reviews and promotional pages crafted to mislead recommendations. We ask: to what extent do search-augmented LLMs become unwitting promoters of fake products when consuming polluted retrieval results? To answer this, we introduce FORGE (Fake Online Recommendations in Generative Environments), a benchmark for measuring fake-product promotion under controlled web-content pollution. Given an upstream search result, FORGE locally rewrites real products in retrieved web pages into fake ones to simulate web-content pollution, and measures how often the LLM recommends the fake product. FORGE covers 225 real-world products across 15 categories and 5 consumer scenarios. Across 12 commercial and open-weights LLMs, all models are vulnerable: a single polluted page yields fooled rates of up to 27%, while the full top-3 replacement raises this to 73.8%. Vulnerability varies substantially across categories, increasing when models lack stable prior knowledge of the relevant products. Reasoning does not mitigate this vulnerability; instead, it often generates spurious social proof to justify false recommendations. We evaluate three defenses: skepticism prompting and consensus filtering (over model priors or cross-document evidence). Skepticism can exacerbate vulnerability, much like reasoning, while filtering risks suppressing legitimate products. We release FORGE at https://github.com/leoluolol/forge-benchmark.

2606.13604 2026-06-12 cs.AI cs.LG cs.MA 新提交

Multi-Agent Reinforcement Learning from Delayed Marketplace Feedback for Objective-Weight Adaptation in Three-Sided Dispatch

基于延迟市场反馈的多智能体强化学习在三方调度中的目标权重自适应

Haochen Wu, Yi Hou, Shiguang Xie

发表机构 * DoorDash

AI总结 提出在DoorDash部署的强化学习系统,利用延迟信号自适应调整调度目标权重,通过离线策略学习在噪声和耦合反馈下优化配送质量与批处理效率的权衡。

Comments Accepted at ICML 2026 Workshop on Reinforcement Learning from World Feedback (RLxF)

详情
AI中文摘要

三方市场中的调度为从世界反馈中进行强化学习提供了自然场景:决策通过延迟的操作结果(如配送速度、骑手利用率和商家拥堵)进行评估。我们介绍了DoorDash部署的一个强化学习系统,该系统利用延迟信号在大规模食品配送市场中自适应调整调度目标权重。该系统并非取代组合分配优化器,而是通过从记录的市场数据中学习的店铺级策略选择一个离散乘数,该乘数改变调度优化器在配送质量与批处理效率之间的权衡。这种接口使得在噪声、延迟和耦合反馈下进行离线策略学习成为可能,同时保留生产可行性约束和操作保障。我们使用集中式离线数据和分散式店铺级执行训练共享价值函数,采用双Q学习目标和保守正则化器以减少分布外价值高估。在生产切换实验中,离线训练的策略增加了批处理并减少了骑手侧时间成本,而不会降低面向客户的配送质量。结果展示了如何利用来自实时经济和物流系统的世界反馈安全地在线调整决策策略。

英文摘要

Dispatch in three-sided marketplaces provides a natural setting for reinforcement learning from world feedback: decisions are evaluated by delayed operational outcomes such as delivery speed, courier utilization, and merchant congestion. We present a deployed reinforcement learning system at DoorDash that adapts dispatch objective weights in a large-scale food-delivery marketplace using delayed signals. Rather than replacing the combinatorial assignment optimizer, a store-level policy learned from logged marketplace data selects a discrete multiplier that shifts the dispatch optimizer's tradeoff between delivery quality and batching efficiency. This interface enables offline policy learning under noisy, delayed, and coupled feedback while preserving production feasibility constraints and operational safeguards. We train a shared value function using centralized offline data and decentralized store-level execution, with Double Q-learning targets and a conservative regularizer to reduce out-of-distribution value overestimation. In a production switchback experiment, the offline-trained policy increases batching and reduces courier-side time costs without degrading customer-facing delivery quality. Results illustrate how world feedback from a live economic and logistics system can be used to safely adapt decision policies online.

2606.13603 2026-06-12 cs.LG cs.AI cs.CL 新提交

Beyond the Commitment Boundary: Probing Epiphenomenal Chain-of-Thought in Large Reasoning Models

超越承诺边界:探究大型推理模型中的附带思维链

Daniel Scalena, Sara Candussio, Luca Bortolussi, Elisabetta Fersini, Malvina Nissim, Gabriele Sarti

发表机构 * CLCG, University of Groningen(格罗宁根大学CLCG) University of Milano-Bicocca(米兰-布雷拉大学) University of Trieste(特里耶大学) Khoury College of Computer Sciences, Northeastern University(东北大学Khoury计算机科学学院)

AI总结 通过早期退出估计思维链步骤的因果重要性,发现推理中存在从瞬态猜测到稳定答案的“承诺边界”,后续步骤为附带现象,可提前退出以缩短推理长度达55%而不影响性能。

详情
AI中文摘要

思维链推理是语言模型推理时扩展的主导范式,但每个步骤对最终答案的因果影响尚不明确。我们通过早期退出估计每个步骤的因果重要性,并利用这一度量研究多个模型家族的推理轨迹中答案如何形成。在多种任务中,我们发现推理通常会跨越一个“承诺边界”——从瞬态中间猜测到稳定、高置信度答案的急剧转变。这种转变通常发生在单个步骤中,远在模型推理块结束之前,随后是“附带”的思维链步骤,这些步骤不改变最终答案概率。利用注意力探针,我们表明答案形成阶段可以从中间推理步骤中以高精度线性解码,并稳健地泛化到未见过的推理任务。我们利用这一信号在承诺边界处提前退出推理块,平均将思维链长度减少高达55%,而对模型性能影响微乎其微。

英文摘要

Chain-of-thought (CoT) reasoning is the dominant paradigm for inference-time scaling in language models, yet the causal influence of individual steps on the final answer poorly understood. We estimate each step's causal importance via early exit and use this measure to study how answers form across the reasoning traces of several model families. Across diverse tasks, we find that reasoning typically crosses a \emph{commitment boundary} -- a sharp transition from transient intermediate guesses to a stable, high-confidence answer. This transition often happens in a single step, well before the model's reasoning block ends, and is followed by \emph{epiphenomenal} CoT steps that leave the final answer probability unaltered. Using attention probes, we show that answer-formation stages can be linearly decoded from intermediate reasoning steps with high accuracy and generalize robustly to unseen reasoning tasks. We exploit this signal to early-exit reasoning blocks at the commitment boundary, reducing the length of CoTs up to 55\% on average with negligible impact on model performance.

2606.13602 2026-06-12 cs.AI 新提交

EpiBench: Verifiable Evaluation of AI Agents on Epigenomics Analysis

EpiBench:人工智能代理在表观基因组学分析中的可验证评估

Harihara Muralidharan, Reema Baskar, Soo Hee Lee, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出EpiBench基准,通过106个评估任务测试AI代理在表观基因组学工作流中的决策能力,发现最佳系统GPT-5.5/Pi通过率仅45%,失败多因缺乏深度科学判断。

详情
AI中文摘要

我们介绍了EpiBench,一个用于短周期表观基因组学分析的可验证基准。EpiBench评估代理是否能够从真实工作流状态中做出明确定义的分析决策,并返回可确定性评分的答案。该基准包含CUT\&Tag/CUT\&RUN、ATAC-seq、ChIP-seq和DNA甲基化工作流中的106个评估。在来自16个模型-框架对的5,088条有效轨迹中,没有系统通过大多数尝试:GPT-5.5 / Pi以45.0%(143/318次尝试;95%置信区间(CI),36.3--53.7)领先,其次是GPT-5.5 / OpenAI Codex的39.9%(127/318次尝试;95% CI,31.6--48.3)。Claude Opus 4.8 Max / Pi和GPT-5.4 / Pi分别通过了39.0%(124/318次尝试;95% CI,30.2--47.8和31.0--47.0)。性能因检测类型而异,许多失败的运行仍包含部分正确答案。代理通常能找到正确的文件并计算出有用的中间结果,但当任务需要更深入、特定于检测的科学判断时,它们就会失败。

英文摘要

We introduce EpiBench, a verifiable benchmark for short-horizon epigenomics analysis. EpiBench evaluates whether agents can make well-defined analysis decisions from realistic workflow states and return deterministically gradable answers. The benchmark includes 106 evaluations across CUT\&Tag/CUT\&RUN, ATAC-seq, ChIP-seq, and DNA methylation workflows. Across 5,088 valid trajectories from 16 model-harness pairs, no system passed a majority of attempts: GPT-5.5 / Pi led at 45.0\% (143/318 attempts; 95\% confidence interval (CI), 36.3--53.7), followed by GPT-5.5 / OpenAI Codex at 39.9\% (127/318 attempts; 95\% CI, 31.6--48.3). Claude Opus 4.8 Max / Pi and GPT-5.4 / Pi each passed 39.0\% (124/318 attempts; 95\% CI, 30.2--47.8 and 31.0--47.0, respectively). Performance varies across assay types, and many failed runs still contain parts of the correct answer. Agents often found the right files and computed useful intermediate results, but failed when the task required deeper, assay-specific scientific judgment.

2606.13601 2026-06-12 cs.RO cs.SY eess.SY 新提交

MCR-Bionic Hand: Anatomical Structural Priors for Dexterous Manipulation

MCR-Bionic Hand: 用于灵巧操作的解剖结构先验

Haosen Yang, Guowu Wei

发表机构 * University of Salford(索尔福德大学)

AI总结 本文提出MCR-Bionic Hand,一种基于人体手部解剖结构先验的仿生机械手,通过结构智能实现低维控制到灵巧操作的映射,在接触密集型任务中验证了其有效性。

详情
AI中文摘要

灵巧机器人手通常被表述为由自由度、驱动和控制算法支配的高维主动控制系统。然而,人类手的灵巧性部分编码在骨骼、韧带、肌腱、腱膜和内在肌肉的物理结构中。本文将这种贡献描述为两种相互关联的结构智能形式:结构先验生成,其中腕指腱固定、FDS/FDP路径和背侧伸肌腱帽将低维姿态输入转换为默认抓取构型及PIP到DIP协调;以及肌肉介导的调节,其中外在肌、蚓状肌和骨间肌围绕该默认状态调节MCP姿态、远端稳定性、指尖力路径和接触状态。基于此框架,MCR-Bionic Hand被开发为一个1:1肌肉骨骼仿生手,在一个主体内集成了两排八骨手腕、跨腕肌腱、解剖屈肌腱路径、掌板和侧副韧带约束、背侧伸肌腱帽以及内在肌通路。功能演示和几何力学模型表明,手腕姿态诱导多关节预塑形,伸肌腱帽将PIP姿态映射为耦合的DIP响应,而内在肌通路在抓取形成后调节远端稳定性和指尖动作方向。接触密集型任务,包括硬币旋转、笔传递、手背翻硬币和立方体操作,表明MCR-Bionic将低维状态生成与精细接触后调节联系起来。这些结果表明,解剖仿生学的价值不在于视觉相似性,而在于识别执行部分控制功能的人手结构。

英文摘要

Dexterous robotic hands are usually formulated as high dimensional active control systems governed by degrees of freedom, actuation, and algorithms. Human hand dexterity, however, is partly encoded in the physical architecture of bones, ligaments, tendons, aponeuroses, and intrinsic muscles. This work describes that contribution as two linked forms of structural intelligence: structural prior generation, in which wrist to finger tenodesis, FDS/FDP routing, and the dorsal extensor hood transform low dimensional posture inputs into default grasp configurations and PIP to DIP coordination; and muscle mediated modulation, in which extrinsic muscles, lumbricals, and interossei regulate MCP posture, distal stability, fingertip force paths, and contact states around that default state. Based on this framework, MCR-Bionic Hand is developed as a 1:1 musculoskeletal biomimetic hand integrating a two row eight bone wrist, cross wrist tendons, anatomical flexor routing, volar plate and collateral ligament constraints, the dorsal extensor hood, and intrinsic muscle pathways within one body. Functional demonstrations and geometric mechanical models show that wrist posture induces multi joint pre shaping, the extensor hood maps PIP posture to a coupled DIP response, and intrinsic plus pathways modulate distal stability and fingertip action direction after grasp formation. Contact rich tasks, including coin rotation, pen transfer, dorsal coin flipping, and cube manipulation, show that MCR-Bionic links low dimensional state generation with fine post contact modulation. These results suggest that anatomical biomimetics is valuable not for visual similarity, but for identifying human hand structures that perform part of control.

2606.13598 2026-06-12 cs.AI cs.CL cs.LG cs.MA 新提交

Reward Modeling for Multi-Agent Orchestration

多智能体编排的奖励建模

King Yeung Tsang, Zihao Zhao, Vishal Venkataramani, Haizhou Shi, Zixuan Ke, Semih Yavuz, Shafiq Joty, Hao Wang

发表机构 * Rutgers University(罗杰斯大学) Salesforce AI Research(Salesforce人工智能研究)

AI总结 提出OrchRM框架,通过自监督学习从多智能体执行中间产物构建奖励模型,无需人工标注,实现高效编排器训练和测试时扩展,在多个领域提升性能并降低计算成本。

Comments Preprint; work in progress

详情
AI中文摘要

基于大型语言模型(LLM)的多智能体系统(MAS)需要有效的编排来协调专门化的智能体,然而训练这样的编排器受到有限监督和高计算成本的阻碍。我们提出了编排奖励建模(OrchRM),一种无需人工标注即可评估编排质量的自监督框架。OrchRM利用多智能体执行过程中的中间产物来构建Bradley-Terry奖励模型训练的胜负对。与现有的依赖昂贵子智能体展开的MAS测试时扩展和编排器训练框架不同,OrchRM直接在编排层面操作,实现了高效且高性能的奖励引导编排器训练和MAS测试时扩展。OrchRM在token使用上提高了高达10倍的训练效率,同时将MAS测试时扩展的准确率提升了高达8%。这些增益在多个领域(包括数学推理、基于网络的问答和多跳推理)中一致迁移,证明了编排级奖励建模作为鲁棒多智能体编排的可扩展方向。代码将在此https URL提供。

英文摘要

Multi-Agent Systems (MAS) built on Large Language Models (LLMs) require effective orchestration to coordinate specialized agents, yet training such orchestrators is hindered by limited supervision and high computational cost. We propose Orchestration Reward Modeling (OrchRM), a self-supervised framework for evaluating orchestration quality without human annotations. OrchRM leverages intermediate artifacts from multi-agent executions to construct win-lose pairs for Bradley-Terry reward model training. Unlike existing MAS test-time scaling and orchestrator training frameworks that rely on costly sub-agent rollouts, OrchRM operates directly at the orchestration level, enabling efficient and high-performing reward-guided orchestrator training and MAS test-time scaling. OrchRM improves training efficiency by up to 10x in token usage while improving MAS test-time scaling performance by up to 8% in accuracy. These gains consistently transfer across multiple domains, including mathematical reasoning, web-based question answering, and multi-hop reasoning, demonstrating orchestration-level reward modeling as a scalable direction for robust multi-agent orchestration. Code will be available at https://github.com/Wang-ML-Lab/OrchRM.

2606.13591 2026-06-12 cs.AI cs.LG cs.MA 新提交

Multiagent Protocols with Aggregated Confidence Signals

带有聚合置信信号的多智能体协议

Ali Elahi, Barbara Di Eugenio

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出三种协议,通过转换原始置信信号并采用软投票或贝叶斯融合,为多智能体系统输出聚合置信度,在保持正确性的同时显著提升判别能力。

Comments 22 pages and 5 figures, 9 pages and 2 figures before the appendix

详情
AI中文摘要

置信度在自然语言处理中用于可靠性、监督和一系列下游决策任务,但目前没有方法能够为多智能体系统的输出产生或评估置信度。先前的工作在多智能体辩论中使用置信度来加权消息、触发辩论或校准单个智能体,但从未将这些置信度聚合成系统本身的单一置信度。我们引入了三种协议,通过首先转换原始置信信号使其在不同模型间可比,然后通过软投票或称为贝叶斯融合的概率融合方法将它们组合,从而产生最终答案和单一的聚合置信度。这种聚合置信度在判别性(AUARC)上显著优于最佳单个智能体或标准辩论基线,同时正确性(F1分数)保持稳定,并恢复了多智能体辩论在更模糊任务上的损失。通过分析两种估计器(序列概率和自我报告)以及参数和非参数校准器,我们发现校准提高了两种估计器的F1分数,而AUARC对其依赖较小。我们在五个基准测试和四种任务类型上评估了每基准六对同质和异质辩论对,涵盖了多种模型能力和大小。

英文摘要

Confidence is used for reliability, oversight, and a range of downstream decision tasks in Natural Language Processing (NLP), yet no existing method produces or evaluates a confidence for the output of a multiagent system. Prior work uses confidence within multiagent debate (MAD) to weight messages, trigger debate, or calibrate individual agents, but it never aggregates these into a single confidence for the system itself. We introduce three protocols that produce a final answer along with a single aggregated confidence by first transforming raw confidence signals to make them comparable across models, then combining them via soft voting or a probability fusion we call Bayesian fusion. This aggregated confidence is substantially more discriminative (AUARC) than that of the best single agent or the standard debate baselines, while correctness (F1-score) stays stable and recovers the losses MAD incurs on more ambiguous tasks. Analyzing two estimators, sequence probability and self-report, alongside parametric and non-parametric calibrators, we find that calibration improves F1 for both estimators while AUARC is less reliant on it. We evaluate six homogeneous and heterogeneous debating pairs per benchmark, across five benchmarks and four task types, spanning a range of model capabilities and sizes.

2606.13587 2026-06-12 cs.CV 新提交

Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background

面向杂乱背景下的自动废物回收的有效废物分割

Mamoona Javaid, Mubashir Noman, Abdul Hannan, Shah Nawaz, Mustansar Fiaz, Sajid Ghuffar

发表机构 * University of Science and Technology Beijing(北京科技大学)

AI总结 提出一种结合空间域和谱域的级联分割网络,并引入辅助特征增强模块,在杂乱场景下实现高效废物分割,在三个数据集上验证了有效性。

Comments accepted at ICML 2026

详情
AI中文摘要

城市区域的快速扩张和人口增长导致废物产量急剧增加,这需要高效自动化的废物管理。在此背景下,使用深度学习的自动废物回收(AWR)可以帮助人类实现最优废物管理。最近的AWR深度学习方法提供了有前景的废物分割性能,但这些方法依赖大型骨干网络,对AWR系统效率低下,且在杂乱场景中性能下降。为此,本文引入了一种最优废物分割网络,该网络有效利用空间域捕获局部结构依赖性和谱域高效提取全局上下文关系。这种级联设计使网络能够逐步利用互补域中的局部和全局表示,突出有效分割各种废物对象所需的语义信息。此外,引入了辅助特征增强模块(AFEM),以增强目标对象的边界和斑点放大,从而在杂乱场景中实现更好的分割。在ZeroWaste-aug、ZeroWaste-f和SpectralWaste数据集上的大量实验揭示了所提出方法的优势。

英文摘要

Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects' boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWaste-aug, ZeroWaste-f and SpectralWaste datasets reveals the merits of the proposed method.