arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.12334 2026-05-21 cs.AI

Reinforcing VLAs in Task-Agnostic World Models

在任务无关的世界模型中强化视觉-语言-动作

Yucen Wang, Rui Yu, Fengming Zhang, Junjie Lu, Xinyao Qin, Tianxiang Zhang, Kaixin Wang, Li Zhao

发表机构 * Microsoft Research Asia（微软亚洲研究院）； Nanjing University（南京大学）； University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Wuhan University（武汉大学）； University of Technology Sydney（悉尼科技大学）； Tsinghua University（清华大学）

AI总结本文提出RAW-Dream方法，通过分离世界模型学习与下游任务依赖，利用预训练的世界模型和现成的视觉-语言模型，实现零样本推理，从而在无需任务特定数据的情况下提高VLA适应性。

详情

AI中文摘要

在学习的世界模型中通过强化学习（RL）后训练视觉-语言-动作（VLA）模型，已成为一种有效的策略，可以在不进行昂贵的真实世界交互的情况下适应新任务。然而，尽管使用想象轨迹减少了策略训练的样本复杂性，现有方法仍然严重依赖任务特定数据来微调世界和奖励模型，从根本上限制了其扩展到未见任务的能力。为了解决这个问题，我们主张世界和奖励模型应捕捉可转移的物理先验，以实现零样本推理。我们提出了RAW-Dream（在任务无关世界梦中强化VLA），一种新的范式，完全将世界模型学习与下游任务依赖分离。RAW-Dream利用在多样化任务无关行为上预训练的世界模型来预测未来滚动，以及现成的视觉-语言模型（VLM）进行奖励生成。由于这两个组件都是任务无关的，VLA可以在此零样本想象中轻松微调以适应任何新任务。此外，为了减轻世界模型的幻觉，我们引入了双噪声验证机制来过滤掉不可靠的滚动。在模拟和现实世界设置中的广泛实验展示了一致的性能提升，证明了通用的物理先验可以有效替代昂贵的任务依赖数据，为VLA适应提供了一条高度可扩展的道路。

英文摘要

Post-training Vision-Language-Action (VLA) models via reinforcement learning (RL) in learned world models has emerged as an effective strategy to adapt to new tasks without costly real-world interactions. However, while using imagined trajectories reduces the sample complexity of policy training, existing methods still heavily rely on task-specific data to fine-tune both the world and reward models, fundamentally limiting their scalability to unseen tasks. To overcome this, we argue that world and reward models should capture transferable physical priors that enable zero-shot inference. We propose RAW-Dream (Reinforcing VLAs in task-Agnostic World Dreams), a new paradigm that completely disentangles world model learning from downstream task dependencies. RAW-Dream utilizes a world model pre-trained on diverse task-free behaviors for predicting future rollouts, and an off-the-shelf Vision-Language Model (VLM) for reward generation. Because both components are task-agnostic, VLAs can be readily finetuned for any new task entirely within this zero-shot imagination. Furthermore, to mitigate world model hallucinations, we introduce a dual-noise verification mechanism to filter out unreliable rollouts. Extensive experiments across simulation and real-world settings demonstrate consistent performance gains, proving that generalized physical priors can effectively substitute for costly task-dependent data, offering a highly scalable roadmap for VLA adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.12321 2026-05-21 cs.AI cs.CY cs.ET

LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

LIDSA：信号自由的自主交叉口管理中的认知仲裁

Abderrahmane Lakas, Mohamed Amine Ferrag, Merouane Debbah

发表机构 * Department of Computer and Network Engineering, United Arab Emirates University, UAE（计算机与网络工程系，阿联酋大学）； Research Institute for Digital Future, Khalifa University, UAE（未来数字研究院，哈利法大学）

AI总结本文提出LIDSA框架，利用大语言模型进行意图驱动的速度建议，以实现信号自由的自主交叉口管理，通过对比固定周期控制、SCATS、AIM和GLOSA等方法，证明LLM在实时交叉口管理中的有效性。

Comments Renamed LISA to LIDSA to avoid naming ambiguity with existing traffic-control software. No technical changes

详情

AI中文摘要

大型语言模型（LLMs）在智能交通系统（ITS）中展现出强大的潜力，特别是在需要情境推理和多智能体协调的任务中。这些能力使它们非常适合协同驾驶，其中基于规则的方法在复杂和动态的交通环境中表现不佳。交叉口管理尤其具有挑战性，因为存在冲突的优先权需求、异质车辆优先级以及必须实时解决的车辆特定运动学约束。然而，现有方法通常将LLMs作为基于信号系统的辅助组件，而不是主要决策者。信号控制器仍然缺乏车辆感知，预留方法缺乏意图意识，而最近的基于LLM的系统仍然依赖于信号基础设施。此外，LLM推理延迟限制了其在亚秒级控制设置中的应用。我们提出了LIDSA（基于LLM的意图驱动速度建议），一种用于自主交叉口管理的信号自由认知仲裁框架。LIDSA利用LLM对声明的车辆意图进行推理，结合优先级类别、队列压力和能源偏好。我们评估了LIDSA在不同交通负载下的性能，结果表明LIDSA将平均控制延迟减少了高达89.1%，同时保持了服务水平C，而所有非LLM基线方法降级到服务水平F。在接近饱和需求下，LIDSA将平均等待时间减少了93%，峰值队列长度减少了60.6%相对于固定周期控制。它还降低了燃料消耗高达48.8%，并实现了86.2%的意图满足率，相比最好的非LLM方法的61.2%。这些结果证明了基于LLM的推理能够实现实时、无信号的交叉口管理。

英文摘要

Large language models (LLMs) show strong potential for Intelligent Transportation Systems (ITS), particularly in tasks requiring situational reasoning and multi-agent coordination. These capabilities make them well suited for cooperative driving, where rule-based approaches struggle in complex and dynamic traffic environments. Intersection management remains especially challenging due to conflicting right-of-way demands, heterogeneous vehicle priorities, and vehicle-specific kinematic constraints that must be resolved in real time. However, existing approaches typically use LLMs as auxiliary components on top of signal-based systems rather than as primary decision-makers. Signal controllers remain vehicle-agnostic, reservation-based methods lack intent awareness, and recent LLM-based systems still depend on signal infrastructure. In addition, LLM inference latency limits their use in sub-second control settings. We propose LIDSA (LLM-Based Intent-Driven Speed Advisory), a signal-free cognitive arbitration framework for autonomous intersection management. LIDSA uses an LLM to reason over declared vehicle intents, incorporating priority classes, queue pressure, and energy preferences. We evaluate LIDSA against fixed-cycle control, SCATS, AIM, and GLOSA across varying traffic loads. Results show that LIDSA reduces mean control delay by up to 89.1% and maintains Level of Service C while all non-LLM baselines degrade to Level of Service F. Under near-saturated demand, LIDSA reduces mean waiting time by 93% and peak queue length by 60.6% relative to fixed-cycle control. It also lowers fuel consumption by up to 48.8% and achieves 86.2% intent satisfaction, compared to 61.2% for the best non-LLM method. These results demonstrate that LLM-based reasoning can enable real-time, signal-free intersection management.

URL PDF HTML ☆

赞 0 踩 0

2605.12196 2026-05-21 cs.LG

ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting

ECTO：用于超短期风功率预测的外源性条件化时间运算符

Cao Yuan, Junjun Wang

发表机构 * Wuhan Polytechnic University（武汉理工大学）； Wuhan Public Meteorological Service Center（武汉市气象局）

AI总结本文提出了一种统一框架ECTO，通过物理基础变量选择和外源性条件化制度细化模块，实现了对超短期风功率预测中非平稳、条件依赖的风力发电的高效建模，从而在不同气候、容量和外源变量维度的风场中取得最佳的均方误差性能。

Comments 42 pages, 10 figures, 9 tables

详情

AI中文摘要

准确的超短期风功率预测对于电网调度和备用管理至关重要，但因其风力发电的非平稳性和条件依赖性而具有挑战性。气象外源变量包含大量预测信息，但最有信息量的变量组合会因站点、运行条件和预测时间跨度而异。现有的深度学习方法要么将外源输入视为通用的辅助通道通过统一混合或软门控，要么依赖于固定的预处理步骤如PCA，而没有利用气象变量的物理结构。我们提出ECTO（外源性条件化时间运算符），一个统一的框架，将外源变量建模分解为两个互补的模块。物理基础变量选择（PGVS）使用领域指导的物理先验和稀疏max激活进行层次化、组意识的稀疏选择，产生一个紧凑、条件适应的外源上下文。外源性条件化制度细化（ECRR）将预测路由通过学习到的制度专家，通过专家混合范式应用增益-偏置校准和特定时间跨度的校正。在三个跨越不同气候、容量（66-200 MW）和外源变量维度（11-13个变量）的风场实验中，ECTO在所有站点中实现了最低的均方误差，相对于最强基线的相对改进范围从2.2%到5.2%，在较长的预测时间跨度（H=32）时扩大到8.6%。消融分析确认了每个与外源变量相关的组件都贡献了积极的效果（PGVS +1.84%，ECRR +2.86%），可解释性分析揭示PGVS学习了具有物理意义的、特定站点的变量选择模式，而ECRR收敛到一致的校准策略。

英文摘要

Accurate ultra-short-term wind power forecasting is critical for grid dispatch and reserve management, yet remains challenging due to the non-stationary, condition-dependent nature of wind generation. Meteorological exogenous variables carry substantial predictive information, but the most informative variable combination varies across sites, operating conditions, and prediction horizons. Existing deep learning approaches either treat exogenous inputs as generic auxiliary channels through uniform mixing or soft gating, or rely on fixed preprocessing steps such as PCA, without exploiting the physical structure of meteorological variables. We propose ECTO (Exogenous-Conditioned Temporal Operator), a unified framework that decomposes exogenous variable modeling into two complementary modules. Physically-Grounded Variable Selection (PGVS) performs hierarchical, group-aware sparse selection over exogenous variables using a domain-informed physical prior and sparsemax activations, producing a compact, condition-adaptive exogenous context. Exogenous-Conditioned Regime Refinement (ECRR) routes the forecast through learned regime experts that apply gain--bias calibration and horizon-specific corrections via a mixture-of-experts paradigm. Experiments on three wind farms spanning different climates, capacities (66--200 MW), and exogenous dimensions (11--13 variables) demonstrate that ECTO achieves the lowest MSE across all sites, with relative improvements over the strongest baseline ranging from 2.2% to 5.2%, widening to 8.6% at the longer prediction horizon ($H=32$). Ablation analysis confirms that each exogenous-related component contributes positively (PGVS +1.84%, ECRR +2.86%), and interpretability analysis reveals that PGVS learns physically meaningful, site-specific variable selection patterns, while ECRR converges to well-separated calibration strategies consistent across sites.

URL PDF HTML ☆

赞 0 踩 0

2605.11866 2026-05-21 cs.SD

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

AuDirector：一种用于沉浸式音频叙事的自反思闭环框架

Yiming Ren, Xuenan Xu, Ziyang Zhang, Wen Wu, Baoxiang Li, Chao Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Tsinghua University（清华大学）

AI总结本文提出AuDirector框架，通过自反思闭环多智能体方法解决长期音频叙事中一致性、情感表达和音频保真度的问题，提升语音生成的质量和用户交互性。

详情

AI中文摘要

尽管在文本和视觉生成方面取得了进展，但创建连贯的长格式音频叙事仍然具有挑战性。现有框架往往存在角色设定与语音表现不匹配、自我纠正机制不足和人机交互有限等问题。为了解决这些挑战，我们提出AuDirector，一种自反思闭环多智能体框架。具体而言，它包括一个身份感知预制作机制，将叙事文本转换为角色档案和语句层面的情感指令，以检索合适的语音候选人并指导表达性语音合成，从而促进上下文对齐的语音适应。为了提高质量，协作合成与纠正模块引入闭环自我纠正机制，系统地审核和重新生成缺陷的音频组件。此外，由人类引导的交互细化模块通过解释自然语言反馈来促进用户控制，从而交互式地细化底层脚本。实验表明，AuDirector在结构连贯性、情感表达性和音频保真度方面均优于最先进的基线模型。音频样本可在https://anonymous-itsh.github.io/上找到。

英文摘要

Despite advances in text and visual generation, creating coherent long-form audio narratives remains challenging. Existing frameworks often exhibit limitations such as mismatched character settings with voice performance, insufficient self-correction mechanisms, and limited human interactivity. To address these challenges, we propose AuDirector, a self-reflective closed-loop multi-agent framework. Specifically, it involves an Identity-Aware Pre-production mechanism that transforms narrative texts into character profiles and utterance-level emotional instructions to retrieve suitable voice candidates and guide expressive speech synthesis, thereby promoting context-aligned voice adaptation. To enhance quality, a Collaborative Synthesis and Correction module introduces a closed-loop self-correction mechanism to systematically audit and regenerate defective audio components. Furthermore, a Human-Guided Interactive Refinement module facilitates user control by interpreting natural language feedback to interactively refine the underlying scripts. Experiments demonstrate that AuDirector achieves superior performance compared to state-of-the-art baselines in structural coherence, emotional expressiveness, and acoustic fidelity. Audio samples can be found at https://anonymous-itsh.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.11302 2026-05-21 cs.LG cs.AI cs.CL

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

时间敏感语言生成理论：稀疏幻觉战胜模式崩溃

Atul Ganju, Travis McVoy, Shaddin Dughmi, Shang-Hua Teng

发表机构 * University of Southern California（美国南加州大学）

AI总结本文研究了在全局偏好顺序下语言生成的极限情况，提出了一种时间敏感的语言生成方法，通过稀疏幻觉技术克服了模式崩溃问题，证明了在特定条件下可以实现最优密度。

详情

AI中文摘要

我们研究了在全局偏好顺序下语言生成的极限情况，如Kleinberg和Wei所引入的。与以往工作类似，我们追求广度，但增加了时效性要求：高排名字符串应更早生成。一个字符串只有在截止时间前生成才被认可，其截止时间由一个函数确定，该函数将字符串在目标语言中的排名映射到必须生成的时间。这与机器学习中的归纳偏置一致，即在其他条件相同的情况下，倾向于选择更简单或更可能的输出。我们证明，在强意义上，最终一致的生成器无法实现时效性生成——这是大多数先前相关工作的主角。在可能最温和的一致性放松下，即幻觉率随时间消失，我们证明可以绕过我们的不可能结果。特别是，我们可以实现相对于任何超线性截止函数的最优密度。我们还证明这是紧的，通过排除线性截止时间和消失幻觉率下的时效性生成。

英文摘要

We study language generation in the limit under a global preference ordering on strings, as introduced by Kleinberg and Wei. As is done in previous work, we aim for breadth, but impose an additional requirement of timeliness: higher-ranked strings should be generated earlier. A string is then only credited if it is generated before a deadline, where its deadline is defined by a function that maps a string's rank in the target language to the time by which it must be produced. This is in keeping with a central consideration in machine learning, where inductive bias favors ``simpler'' or ``more plausible'' outputs, all else being equal. We show that timely generation is impossible in a strong sense for eventually consistent generators -- the protagonists of most prior related work. Under what is perhaps the mildest natural relaxation of consistency, a hallucination rate that vanishes over time, we show that we can circumvent our impossibility result. In particular, we can achieve optimal density with respect to any superlinear deadline function. We also show this is tight by ruling out timely generation with linear deadlines and vanishing hallucination rate.

URL PDF HTML ☆

赞 0 踩 0

2605.11151 2026-05-21 cs.AI cs.RO

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

RankQ: 通过自监督动作排名实现离线到在线强化学习

Andrew Choi, Wei Xu

发表机构 * Horizon Robotics（地平线机器人）

AI总结该研究提出RankQ方法，通过自监督多项排名损失增强时序差分学习，以在大状态-动作空间中更准确地学习批评器，从而在稀疏奖励D4RL基准和基于视觉的机器人学习中实现更高效的离线到在线微调。

详情

AI中文摘要

离线到在线强化学习（RL）通过利用预先收集的数据集来提高样本效率。然而，一个关键挑战是在有限的数据集覆盖下，在大规模状态-动作空间中学习准确的批评器。为了减轻价值过估计带来的有害更新，先前方法通过降低分布外（OOD）动作相对于数据集动作的权重来引入悲观主义。虽然有效，但这种方法本质上充当了一个行为克隆锚点，当数据集动作不优时会阻碍后续在线策略改进。我们提出RankQ，一种离线到在线的Q学习目标，通过在时序差分学习中加入自监督的多项排名损失来强制结构化动作排序。通过学习相对动作偏好而不是均匀惩罚未见过的动作，RankQ塑造Q函数，使动作梯度指向高质量的行为。在稀疏奖励D4RL基准中，RankQ的性能与或优于七种先前方法。在基于视觉的机器人学习中，RankQ能够在低数据环境下有效微调预训练的视觉-语言-动作（VLA）模型，平均在模拟成功率上比次优方法高42.7%。在高数据环境下，RankQ在模拟性能上比次优方法提高13.7%，并实现强大的仿真到现实转移，将现实世界立方体堆叠成功率从43.1%提升到88.9%，相对于VLA的初始性能。

英文摘要

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

URL PDF HTML ☆

赞 0 踩 0

2605.10830 2026-05-21 cs.CV cs.LG

Predicting 3D structure by latent posterior sampling

通过潜在后验采样预测3D结构

Azmi Haider, Dan Rosenbaum

发表机构 * Department of Computer Science（计算机科学系）； University of Haifa（海法大学）； Department of Computational Science（计算科学系）

AI总结本文提出了一种结合NeRF表示和扩散模型的概率建模方法，用于从不同类型的观测数据（如单视角、多视角、噪声图像、稀疏像素和稀疏深度数据）中准确预测3D结构。

详情

AI中文摘要

生成模型在2D图像和神经场表示在3D场景中的显著成就提供了一个有吸引力的机会，将两种方法的优势结合起来。在本工作中，我们提出了一种方法，将基于NeRF的3D场景表示与扩散模型的概率建模和推理相结合。我们将3D重建视为一个具有内在不确定性的感知问题，从而可以受益于概率推断方法。核心思想是将3D场景表示为一个随机的潜在变量，我们可以学习其先验分布，并在给定一组观测数据的情况下进行后验推断。我们通过扩散模型的分数推理方法进行后验采样，并结合从重建模型计算出的似然项（包括体渲染）。我们通过两阶段过程训练模型：首先训练重建模型并自动解码潜在表示以处理3D场景的数据集，然后在潜在空间上训练扩散模型的先验。通过使用模型从后验中生成样本，我们证明了各种3D重建任务可以执行，根据所使用的输入观测类型不同。我们展示了从单视角、多视角、噪声图像、稀疏像素和稀疏深度数据的重建。这些观测在提供的场景信息量上有所不同，我们展示了我们的方法能够建模与每个任务相关的不同水平的内在不确定性。我们的实验表明，这种方法产生了一种全面的方法，能够准确地从各种观测类型中预测3D结构。

英文摘要

The remarkable achievements of both generative models of 2D images and neural field representations for 3D scenes present a compelling opportunity to integrate the strengths of both approaches. In this work, we propose a methodology that combines a NeRF-based representation of 3D scenes with probabilistic modeling and reasoning using diffusion models. We view 3D reconstruction as a perception problem with inherent uncertainty that can thereby benefit from probabilistic inference methods. The core idea is to represent the 3D scene as a stochastic latent variable for which we can learn a prior and use it to perform posterior inference given a set of observations. We formulate posterior sampling using the score-based inference method of diffusion models in conjunction with a likelihood term computed from a reconstruction model that includes volumetric rendering. We train the model using a two-stage process: first we train the reconstruction model while auto-decoding the latent representations for a dataset of 3D scenes, and then we train the prior over the latents using a diffusion model. By using the model to generate samples from the posterior we demonstrate that various 3D reconstruction tasks can be performed, differing by the type of observation used as inputs. We showcase reconstruction from single-view, multi-view, noisy images, sparse pixels, and sparse depth data. These observations vary in the amount of information they provide for the scene and we show that our method can model the varying levels of inherent uncertainty associated with each task. Our experiments illustrate that this approach yields a comprehensive method capable of accurately predicting 3D structure from diverse types of observations.

URL PDF HTML ☆

赞 0 踩 0

2605.10787 2026-05-21 cs.AI cs.SE

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

ComplexMCP: 评估LLM代理在动态、相互依赖和大规模工具沙箱中的表现

Yuanyang Li, Xue Yang, Longyue Wang, Weihua Luo, Hongyang Chen

发表机构 * Zhejiang University（浙江大学）； Zhejiang Lab（浙江实验室）； Alibaba Group（阿里巴巴集团）

AI总结本文提出ComplexMCP基准，用于评估LLM代理在动态、相互依赖和大规模工具环境中的性能，揭示了现有模型在复杂任务中的不足，指出三个关键瓶颈：工具检索饱和、过度自信和战略投降倾向。

详情

AI中文摘要

当前LLM代理擅长调用孤立API，但在商业软件自动化最后一公里方面表现不佳。在现实场景中，工具并非独立，而是原子性、相互依赖且易受环境噪声影响。我们引入ComplexMCP，一个基于Model Context Protocol（MCP）设计的基准，提供超过300个经过严格测试的工具，来源于7个状态沙箱，涵盖办公套件到金融系统。与现有数据集不同，我们的基准采用种子驱动架构模拟动态环境状态和不可预测的API故障，确保评估的确定性与多样性。我们评估了各种LLM在全上下文和RAG范式下的表现，揭示了显著的性能差距：即使顶级模型也难以超过60%的成功率，远低于人类90%的表现。细粒度轨迹分析识别出三个根本瓶颈：（1）工具检索饱和；（2）过度自信，即代理跳过必要的环境验证；（3）战略投降倾向，即倾向于合理化失败而非追求恢复。这些发现凸显了当前代理在相互依赖工作流中的不足，将ComplexMCP定位为下一代鲁棒自主系统的关键测试平台。

英文摘要

Current LLM agents are proficient at calling isolated APIs but struggle with the "last mile" of commercial software automation. In real-world scenarios, tools are not independent; they are atomic, interdependent, and prone to environmental noise. We introduce $\textbf{ComplexMCP}$, a benchmark designed to evaluate agents in these rigorous conditions. Built on the Model Context Protocol (MCP), $\textbf{ComplexMCP}$ provides over 300 meticulously tested tools derived from 7 stateful sandboxes, ranging from office suites to financial systems. Unlike existing datasets, our benchmark utilizes a seed-driven architecture to simulate dynamic environment states and unpredictable API failures, ensuring a deterministic yet diverse evaluation. We evaluate various LLMs across full-context and RAG paradigms, revealing a stark performance gap: even top-tier models fail to exceed a 60% success rate, far trailing human performance 90%. Granular trajectory analysis identifies three fundamental bottlenecks: (1) $\textbf{tool retrieval saturation}$ as action spaces scale; (2) $\textbf{over-confidence}$, where agents skip essential environment verifications; and (3) $\textbf{strategic defeatism}$, a tendency to rationalize failure rather than pursuing recovery. These findings underscore the insufficiency of current agents for interdependent workflows, positioning $\textbf{ComplexMCP}$ as a critical testbed for the next generation of resilient autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2605.10603 2026-05-21 cs.CV

Segment Anything with Robust Uncertainty-Accuracy Correlation

具有鲁棒不确定性和准确性相关性的分割任何东西

Hongyou Zhou, Marc Toussaint, Ling Shao, Zihan Ye

发表机构 * Technical University of Berlin（柏林技术大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出了一种名为RUAC的分割方法，通过引入轻量级不确定性头和对抗性训练，提高在外观和形变转移下的像素级不确定性估计，从而提升分割质量和不确定性准确性相关性。

Comments ICML 2026

详情

AI中文摘要

尽管在零样本性能方面表现强劲，SAM在域转移下不可靠，因为Mask级置信度混淆（MCC），其中基于IoU的单个掩码分数无法反映边界附近的像素级可靠性。受神经网络中纹理偏置捷径与人类视觉中以形状为中心的处理之间的对比启发，我们将域外变化建模为外观转移和非刚性变形，这些共同压力校准。我们提出Segment Anything with Robust Uncertainty-Accuracy Correlation（RUAC）以在外观和变形转移下实现鲁棒的像素级不确定性估计。RUAC添加了一个轻量级的不确定性头，通过联合扰动纹理和几何的协作风格-变形攻击进行训练，并应用不确定性-准确性对齐以确保在对抗性扰动下不确定性仍能一致地突出错误像素。在23个零样本领域中，RUAC提高了分割质量和更忠实的不确定性，具有更强的不确定性-准确性相关性。项目页面：https://hongyouzhou.github.io/ruac/.

英文摘要

Despite strong zero-shot performance, SAM is unreliable under domain shift due to Mask-level Confidence Confusion (MCC), where a single IoU-based mask score fails to reflect pixel-wise reliability near boundaries. Motivated by the contrast between texture-biased shortcuts in neural networks and shape-centric processing in human vision, we model out-of-domain variation as appearance shifts and non-rigid deformations that jointly stress calibration. We propose Segment Anything with Robust Uncertainty-Accuracy Correlation (RUAC) for robust pixel-wise uncertainty estimation under appearance and deformation shifts. RUAC adds a lightweight uncertainty head, trains it with a collaborative style-deformation attack that jointly perturbs texture and geometry, and applies Uncertainty-Accuracy Alignment to ensure uncertainty consistently highlights erroneous pixels even under adversarial perturbations. Across 23 zero-shot domains, RUAC improves segmentation quality and yields more faithful uncertainty with stronger uncertainty-accuracy correlation. Project page: https://hongyouzhou.github.io/ruac/.

URL PDF HTML ☆

赞 0 踩 0

2605.10181 2026-05-21 cs.CV cs.AI

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

机器学习与深度学习在分布外检测中的比较研究

Jihyeon Baek, Seunghoon Lee, Gitaek Kwon, Doohyun Park

发表机构 * VUNO Inc.（VUNO公司）

AI总结本文比较了传统机器学习和深度学习在分布外检测任务中的性能，发现轻量级机器学习方法在保持同等准确性的同时，具有显著更低的计算成本，适用于视觉复杂度有限的任务。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情

AI中文摘要

分布外检测（OOD）对于构建可靠的人工智能系统至关重要，因为无法信任产生无效输入输出的模型。尽管深度学习（DL）通常被认为优于传统机器学习（ML），但医学影像数据通常是在标准化协议下获取的，导致在OOD检测任务中图像变化相对受限。这促使在该设置下直接比较ML和DL方法。两种方法在包含超过60,000张视网膜和非视网膜图像的开放数据集上进行了评估，涵盖多种分辨率。两种方法在内部和外部验证集上均实现了AUROC为1.000和准确性在0.999至1.000之间的结果，显示出相当的检测性能。然而，ML方法在保持等同准确性的同时，表现出显著更低的端到端延迟，表明具有更大的计算效率。这些结果表明，对于视觉复杂度有限的OOD检测任务，轻量级ML方法可以实现DL级别的性能，但计算成本显著降低，支持实际应用场景的部署。

英文摘要

Out-of-distribution (OOD) detection is essential for building reliable AI systems, as models that produce outputs for invalid inputs cannot be trusted. Although deep learning (DL) is often assumed to outperform traditional machine learning (ML), medical imaging data are typically acquired under standardized protocols, leading to relatively constrained image variability in OOD detection tasks. This motivates a direct comparison between ML and DL approaches in this setting. The two approaches are evaluated on open datasets comprising over 60,000 fundus and non-fundus images across multiple resolutions. Both approaches achieved an AUROC of 1.000 and accuracies between 0.999 and 1.000 on internal and external validation sets, showing comparable detection performance. The ML approach, however, exhibited substantially lower end-to-end latency while maintaining equivalent accuracy, indicating greater computational efficiency. These results suggest that for OOD detection tasks of limited visual complexity, lightweight ML approaches can achieve DL-level performance with significantly reduced computational cost, supporting practical real-world deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.10165 2026-05-21 cs.CV cs.AI

Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation

通过标准化损失聚合进行任务无关的噪声标签检测

Inhyuk Park, Doohyun Park

发表机构 * VUNO Inc.（VUNO公司）

AI总结本文提出了一种任务无关的噪声标签检测方法SLA，通过聚合标准化的交叉验证损失来量化标签可靠性，实验表明SLA在各种噪声水平下均优于硬计数基线，并在低噪声比情况下收敛更快，有助于高效重新标注和提升数据集可靠性。

Comments Accepted to IEEE ISBI 2026. The final published version will appear in IEEE Xplore

详情

AI中文摘要

由于观察者差异和模糊案例，大规模医学影像数据集中的噪声标签很常见。我们提出了一种统计上站得住且任务无关的框架，即标准化损失聚合（SLA），用于在样本层面检测噪声标签。SLA通过在重复交叉验证运行中聚合标准化的折叠级验证损失来量化标签可靠性。这种公式将离散的硬计数方案泛化为一个连续估计器，能够捕捉性能偏差的频率和幅度，从而产生可解释且统计上稳定的噪声分数。在公共视网膜数据集上的实验表明，SLA在所有噪声水平下均优于硬计数基线，并在低噪声比情况下收敛速度显著加快，尤其是在细微损失变化具有信息量的情况下。具有高SLA分数的样本指示可能模糊或错误标注的案例，从而指导高效的重新标注，提高任何分类任务的数据集可靠性。

英文摘要

Noisy labels are common in large-scale medical imaging datasets due to inter-observer variability and ambiguous cases. We propose a statistically grounded and task-agnostic framework, Standardized Loss Aggregation (SLA), for detecting noisy labels at the sample level. SLA quantifies label reliability by aggregating standardized fold-level validation losses across repeated cross-validation runs. This formulation generalizes discrete hard-counting schemes into a continuous estimator that captures both the frequency and magnitude of performance deviations, yielding interpretable and statistically stable noisiness scores. Experiments on a public fundus dataset demonstrate that SLA consistently outperforms the hard-counting baseline across all noise levels and converges substantially faster, especially under low noise ratios where subtle loss variations are informative. Samples with high SLA scores indicate potentially ambiguous or mislabeled cases, guiding efficient re-annotation and improving dataset reliability for any classification task.

URL PDF HTML ☆

赞 0 踩 0

2605.09860 2026-05-21 cs.AI

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

何时重新承诺：为长时间视觉-语言推理发现时间抽象

Chen Li, Zhantao Yang, Fangyi Chen, Han Zhang, Anudeepsekhar Bolimera, Marios Savvides

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出了一种可学习的状态条件化承诺深度方法，用于长时间视觉-语言推理任务，通过动态调整承诺深度，提高了求解率并减少了基本动作数量，优于固定深度基线和现有模型。

详情

AI中文摘要

长时间推理需要决定不仅采取什么行动，还要在下一次观察之前多深地承诺。我们将其形式化为"承诺深度"：在重新规划之间执行的原始动作数量。承诺深度在重新规划成本和执行误差累积之间产生权衡，但大多数现有长时间系统将其固定为手动设计的标量。在本文中，我们将其视为策略本身的一个可学习、状态条件化的变量。我们将其实例化在一个模型原生的视觉-语言策略中，该策略联合预测执行什么和持续多久。在Sliding Puzzle和Sokoban任务中，所得到的自适应策略在非退化的固定深度基线中占据帕累托最优，达到高达12.5个百分点的更高求解率，同时每回合使用约25%更少的基本动作。尽管使用7B主干，我们的方法在两个任务上优于GPT-5.5和Claude Sonnet，而每个测试的开放权重视觉-语言模型都达到0%的零样本成功率。我们进一步展示了理论分析，表明在标准的承诺深度替代方案下，状态条件化的承诺在本地最优深度在不同状态变化时严格优于任何固定深度。

英文摘要

Long-horizon reasoning requires deciding not only what actions to take, but how deeply to commit before the next observation. We formalize this as \emph{commitment depth}: the number of primitive actions executed open-loop between replans. Commitment depth induces a trade-off between replanning cost and compounding execution error, yet most existing long-horizon systems fix it as a hand-designed scalar. In this work, we instead treat commitment depth as a learnable, state-conditioned variable of the policy itself. We instantiate this within a model-native vision--language policy that jointly predicts both what to execute and for how long. Across Sliding Puzzle and Sokoban, the resulting adaptive policy Pareto-dominates every non-degenerate fixed-depth baseline, achieving up to 12.5 percentage points higher solve rate while using approximately 25\% fewer primitive actions per episode. Despite using a 7B backbone, our method outperforms GPT-5.5 and Claude Sonnet on both tasks, while every tested open-weight vision--language model achieves 0\% zero-shot success. We further present a theoretical analysis showing that, under the standard commitment-depth surrogate, state-conditioned commitment strictly dominates any fixed depth whenever the locally optimal depth varies across states.

URL PDF HTML ☆

赞 0 踩 0

2605.09586 2026-05-21 cs.CV cs.RO

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

DeformMaster: 一个用于从视频中生成变形物体交互物理-神经世界模型

Can Li, Zhoujian Li, Ren Li, Jie Gu, Lei Lei, Jingmin Chen, Lei Sun

发表机构 * Nankai University（南开大学）； Zhejiang University（浙江大学）； Southern University of Science and Technology（南方科技大学）； Rightly Robotics, A4X（Rightly Robotics，A4X）； University of Science and Technology of China（中国科学技术大学）

AI总结本研究提出DeformMaster，一种基于视频的交互物理-神经世界模型，能够从真实交互视频中生成变形物体的统一动态-外观框架，通过保留结构化的物理推演并利用神经残差补偿未建模效应，实现高保真4D外观生成，实验表明其在动态预测和外观渲染方面优于现有方法。

Comments Project page: https://can-lee.github.io/deformmaster-web/

详情

AI中文摘要

世界模型用于变形物体应恢复不仅几何和外观，还应包含底层物理动态、交互基础和材料行为。从真实视频中学习此类模型具有挑战性，因为变形的线性、平面和体积物体在高维变形、噪声交互和复杂材料响应下演变。因此，模型必须从视觉观测中推断物理状态，通过新交互推进，并以高视觉保真度渲染结果。我们提出了DeformMaster，一种视频衍生的交互物理-神经世界模型，将真实交互视频转化为统一动态-外观框架中的变形物体在线交互模型。DeformMaster保留了结构化的物理推演，同时利用神经残差补偿未建模效应，将稀疏手部运动作为分布式合规执行器用于手-连续体交互，用空间变化的本构专家表示材料响应，并从预测的物理演变中驱动高保真4D外观。在真实世界变形物体序列上的实验表明，DeformMaster能够推演未来动态并渲染动态外观，优于现有最先进基线，同时支持新动作推演、材料参数变化和动态新视角合成。项目页面：https://can-lee.github.io/deformmaster-web/

英文摘要

World models for deformable objects should recover not only geometry and appearance, but also underlying physical dynamics, interaction grounding, and material behavior. Learning such a model from real videos is challenging because deformable linear, planar, and volumetric objects evolve under high-dimensional deformation, noisy interactions, and complex material response. The model must therefore infer a physical state from visual observations, roll it forward under new interactions, and render the resulting dynamics with high visual fidelity. We present DeformMaster, a video-derived interactive physics-neural world model that turns real interaction videos into an online interactive model of deformable objects within a unified dynamics-and-appearance framework. DeformMaster preserves structured physical rollout while using a neural residual to compensate for unmodeled effects, grounds sparse hand motion as distributed compliant actuator for hand-continuum interaction, represents material response with spatially varying constitutive experts, and drives high-fidelity 4D appearance from the predicted physical evolution. Experiments on real-world deformable-object sequences demonstrate DeformMaster's ability to roll out future dynamics and render dynamic appearance, outperforming state-of-the-art baselines while supporting novel action rollout, material-parameter variation, and dynamic novel-view synthesis. Project page: https://can-lee.github.io/deformmaster-web/

URL PDF HTML ☆

赞 0 踩 0

2605.08858 2026-05-21 cs.CV

ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability

ProDG：用于无数据后置可解释性的原型

Piotr Borycki, Magdalena Trędowicz, Jacek Tabor, Łukasz Struski, Przemysław Spurek

发表机构 * Jagiellonian University（雅盖隆大学）； IDEAS Research Institute（IDEAS研究所）； Centre for Credible AI（可信AI中心）； Warsaw University of Technology（华沙理工大学）

AI总结本文提出ProDG，一种无需数据的后置可解释性框架，通过生成模型直接从冻结模型的权重中合成纯高保真原型，从而摆脱了对任何外部数据的依赖，为隐私敏感领域提供了稳健的视觉可解释性。

详情

AI中文摘要

基于原型的前置可解释性方法通过利用直观的'这看起来像那'推理范式提供高度准确的解释。另一方面，后置模型可以在不依赖底层数据集或需要昂贵神经网络重新训练的情况下解释单个图像的预测。最近的方法成功解决了原型网络的重新训练问题。然而，它们仍然面临一个根本限制：它们需要访问数据子集（例如测试或验证集）来搜索并提取视觉原型。在本文中，我们解决了这一问题，并引入了ProDG：用于无数据后置可解释性的生成原型，一种新的框架，利用生成模型直接从冻结模型的权重中合成纯、高保真的原型，完全消除了对任何外部数据的依赖。通过在无数据XAI领域建立新的前沿，ProDG为隐私敏感领域解锁了稳健的视觉可解释性，其中原始数据受到严格限制或根本无法访问。项目页面：https://github.com/piotr310100/ProDG

英文摘要

Ante-hoc interpretability methods based on prototypes provide highly accurate explanations by utilizing the intuitive "this looks like that" reasoning paradigm. On the other hand, post-hoc models can explain predictions for a single image without relying on an underlying dataset or requiring costly neural network retraining. Recent approaches successfully solve the retraining problem for prototype-based networks. However, they still face a fundamental limitation: they require access to a subset of data (e.g., a test or validation set) to search for and extract the visual prototypes. In this paper, we address this issue and introduce ProDG: Generative Prototypes for Data-Free Post-Hoc Explainability, a novel framework that leverages generative models to synthesize pure, high-fidelity prototypes directly from the frozen model's weights, completely eliminating the dependency on any external data. By establishing this new frontier in Data-Free XAI, ProDG unlocks robust visual interpretability for privacy-sensitive domains, where original data is strictly restricted or fundamentally inaccessible. Project page: https://github.com/piotr310100/ProDG

URL PDF HTML ☆

赞 0 踩 0

2605.08123 2026-05-21 cs.LG cs.CL

Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

块级可微的Sinkhorn注意力：带有间隙意识的尘桶桥尾部细化

Dylan Forde

发表机构 * Independent Researcher（独立研究者）

AI总结本文研究了通过停止基固定深度尾部细化代理在TPU硬件上实现长上下文平衡熵最优传输（OT）注意力。通过停止T步Sinkhorn求解后，展开一个短的细化尾部并精确地对这个代理进行微分。对于报告的R=2 TPU路径，反向传播包含四个阶梯计划因子。我们证明了一个精确的一参考瓷砖计划：R=2分数余切是单个参考计划瓷砖乘以一个由向量余切和双差分构建的显式修改字段。这导致了块级成本O((T+R)LW)，O(Ld)输入存储，以及O(L)额外的HBM使用，对于固定头部维度d和带宽W在平衡固定支撑路径上。我们还正式化了当前dustbin_block路径作为在增强支撑上的相同单位目标代理，因此共轭计划提升到单个活跃尘桶路径，这在我们的TPU运行中使用；这个桥是代数的，不声称一般KL不平衡或任意容量间隙模型。我们提供了局部代理偏置界，后验偏置证书和严格正活跃块的投影收缩证书。在合成掩码问题上，优化的内核在10^-5至10^-10范围内与相同中心代理的精确自动微分匹配。在TPU v6e-8上，一个四配置Pfam屏幕完成端到端，一个提升的平衡R=2运行通过三小时预算，每秒维持大约8.5个示例，达到第1437步。保留的Pfam测试碎片将重建从5.57提高到2.05，稀疏CE从5.53提高到5.30，相对于第0步，CE被诊断性记录而不是直接优化；目标-均值对齐度量没有显著改善，而确定性对角参考在这些度量上仍更强。

详情

AI中文摘要

我们研究了通过停止基固定深度尾部细化代理在TPU硬件上实现长上下文平衡熵最优传输（OT）注意力。在停止T步Sinkhorn求解后，我们展开一个短的细化尾部并精确地对这个代理进行微分。对于报告的R=2 TPU路径，反向传播包含四个阶梯计划因子。我们证明了一个精确的一参考瓷砖计划：R=2分数余切是单个参考计划瓷砖乘以一个由向量余切和双差分构建的显式修改字段。这导致了块级成本O((T+R)LW)，O(Ld)输入存储，以及O(L)额外的HBM使用，对于固定头部维度d和带宽W在平衡固定支撑路径上。我们还正式化了当前dustbin_block路径作为在增强支撑上的相同单位目标代理，因此共轭计划提升到单个活跃尘桶路径，这在我们的TPU运行中使用；这个桥是代数的，不声称一般KL不平衡或任意容量间隙模型。我们提供了局部代理偏置界，后验偏置证书和严格正活跃块的投影收缩证书。在合成掩码问题上，优化的内核在10^-5至10^-10范围内与相同中心代理的精确自动微分匹配。在TPU v6e-8上，一个四配置Pfam屏幕完成端到端，一个提升的平衡R=2运行通过三小时预算，每秒维持大约8.5个示例，达到第1437步。保留的Pfam测试碎片将重建从5.57提高到2.05，稀疏CE从5.53提高到5.30，相对于第0步，CE被诊断性记录而不是直接优化；目标-均值对齐度量没有显著改善，而确定性对角参考在这些度量上仍更强。

英文摘要

We study long-context balanced entropic optimal transport (OT) attention on TPU hardware through a stopped-base, fixed-depth tail-refinement surrogate. After a stopped $T$-step Sinkhorn solve, we unroll a short refinement tail and differentiate that surrogate exactly. For the reported $R=2$ TPU path, the backward pass contains four staircase plan factors. We prove an exact one-reference-tile schedule: the $R=2$ score cotangent is a single reference plan tile times an explicit modifier field built from vector cotangents and dual differences. This yields block-wise cost $O((T+R)LW)$, $O(Ld)$ input storage, and $O(L)$ additional HBM usage for fixed head dimension $d$ and band width $W$ on the balanced fixed-support path. We also formalize the current \texttt{dustbin\_block} path as the same unit-target surrogate on an augmented support, so the adjoint schedule lifts to the single-active-dustbin path used in our TPU runs; this bridge is algebraic and does not claim a general KL-unbalanced or arbitrary-capacity gap model. We provide a local surrogate-bias bound, an a posteriori bias certificate, and a projective contraction certificate for strictly positive active blocks. On synthetic masked problems, the optimized kernel matches exact autodiff of the same centered surrogate to within $10^{-5}$--$10^{-10}$. On TPU v6e-8, a four-configuration Pfam screen completes end-to-end, and a promoted balanced $R=2$ run sustains roughly $8.5$ examples per second through a three-hour budget, reaching step $1437$. Held-out Pfam test shards improve reconstruction from $5.57$ to $2.05$ and sparse CE from $5.53$ to $5.30$ relative to step $0$, with CE logged diagnostically rather than optimized directly; target-barycenter alignment metrics do not materially improve, and a deterministic diagonal reference remains stronger on those metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.07926 2026-05-21 cs.AI

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

AgentEscapeBench: 评估LLM代理在跨领域工具引导推理中的能力

Zhengkang Guo, Yiyang Li, Lin Qiu, Xiaohua Wang, Jingwen Xv, Dongyu Ru, Xiaoyu Li, Xiaoqing Zheng, Xuezhi Cao, Xunliang Cai

发表机构 * Fudan University（复旦大学）； Meituan Longcat Team（美团Longcat团队）

AI总结本文提出AgentEscapeBench基准测试，用于评估LLM代理在非熟悉工作流和短程交互之外维持工具引导推理的能力，通过逃亡室风格的任务测试代理在显式长距离依赖约束下推断、执行和修订新工具使用程序的能力，结果显示代理在依赖深度增加时表现显著下降。

详情

AI中文摘要

随着基于LLM的代理越来越多地依赖外部工具，评估其在非熟悉工作流和短程交互之外维持工具引导推理的能力变得至关重要。我们引入了AgentEscapeBench，一个逃亡室风格的基准测试，用于测试代理是否能够在显式长距离依赖约束下推断、执行和修订新的工具使用程序。每个任务定义了一个工具和物品上的有向无环依赖图，要求代理调用真实外部函数、跟踪逐步揭示的隐藏状态、传播中间结果，并提交一个确定性可验证的最终答案。AgentEscapeBench包含五个难度层级中的270个实例，并支持全自动评估。对十六个LLM代理和人类参与者的实验表明，随着依赖深度的增加，表现急剧下降：人类从难度5级的98.3%成功降至难度25级的80.0%，而最佳模型从90.0%降至60.0%。轨迹分析表明，模型失败主要归因于长距离状态跟踪、线索遵循和中间结果传播的崩溃。这些发现表明，当前代理通常能够处理局部工具使用，但在深度上下文依赖方面仍存在困难。我们希望AgentEscapeBench可以作为诊断测试床，用于衡量当前代理能力，并指导未来训练努力，以实现更健壮的通用推理、行动和适应能力。

英文摘要

As LLM-based agents increasingly rely on external tools, it is important to evaluate their ability to sustain tool-grounded reasoning beyond familiar workflows and short-range interactions. We introduce AgentEscapeBench, an escape-room-style benchmark that tests whether agents can infer, execute, and revise novel tool-use procedures under explicit long-range dependency constraints. Each task defines a directed acyclic dependency graph over tools and items, requiring agents to invoke real external functions, track hidden state revealed incrementally, propagate intermediate results, and submit a deterministically verifiable final answer. AgentEscapeBench includes 270 instances across five difficulty tiers and supports fully automated evaluation. Experiments with sixteen LLM agents and human participants show that performance drops sharply as dependency depth increases: humans decline from 98.3% success at difficulty-5 to 80.0% at difficulty-25, while the best model drops from 90.0% to 60.0%. Trajectory analysis attributes model failures mainly to breakdowns in long-range state tracking, clue adherence, and intermediate-result propagation. These findings suggest that current agents can often handle local tool use but still struggle with deep contextual dependencies. We hope AgentEscapeBench can serve as a diagnostic testbed for measuring current agent capabilities and informing future training efforts toward more robust general-purpose reasoning, action, and adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.07731 2026-05-21 cs.CL cs.AI

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

对可比的意大利和国际开源大语言模型进行EngGPT2-16B-A3B的基准测试

Andrea Sassella, Andrea Chizzola, Tommaso Bianchi, Luca Alessandrelli, Mark James Carman

发表机构 * AIRIC, Politecnico di Milano（AIRIC，米兰理工大学）； DEIB, Politecnico di Milano（DEIB，米兰理工大学）

AI总结本文研究了EngGPT2-16B-A3B在多个基准测试中的性能，与同等规模的开源MoE和密集模型进行比较，展示了其在国际和意大利基准测试中的表现。

详情

AI中文摘要

本报告对ENGINEERING Ingegneria Informatica S.p.A.的EngGPT2MoE-16B-A3B大语言模型进行了基准测试，该模型是一个具有3B活跃参数的16B参数混合专家（MoE）模型。性能在各种代表性基准测试中进行了评估，并与同等规模的开源MoE和密集模型进行了比较。与流行的意大利模型如FastwebMIIA-7B、Minerva-7B、Velvet-14B和LLaMAntino-3-ANITA-8B相比，EngGPT2MoE-16B-A3B在国际基准测试（ARC-Challenge、GSM8K、AIME24、AIME25、MMLU和HumanEval（HE））中表现相同或更好。它在RULER基准测试的最长上下文设置（32k）中取得最佳性能。在意大利基准数据集ITALIC上，该模型在除Velvet-14B外的其他模型中表现相同或更好。与同等规模的MoE模型相比，新模型在所有考虑的基准测试中都比DeepSeek-MoE-16B-Chat的值更高。它在HE、MMLU、AIME24、AIME25、GSM8K和32k RULER设置上比Moonlight-16B-A3B更高，但在BFCL和一些ARC和ITALIC设置上较低。最后，它在大多数基准测试中比GPT-OSS-20B低，包括HE、MMLU、AIME24、AIME25、GSM8K、ARC、BFCL和RULER 32k。与流行的密集模型相比，EngGPT2MoE-16B-A3B在AIME24和AIME25上比Llama-3.1-8B-Instruct、Gemma-3-12b-it和Minstral-3-8BInstruct-2512-BF16的值更高，但在ITALIC、BFCL和32k RULER设置上较低。当性能汇总所有基准测试指标时，EngGPT2MoE-16B-A3B在评估的意大利模型中表现更高，但在一些最高效的国际模型（特别是GPT-5 nano和Qwen3-8B）中表现较低。总体而言，我们的发现表明新模型是原生意大利大语言模型的一大步。

英文摘要

This report benchmarks the performance of ENGINEERING Ingegneria Informatica S.p.A.'s EngGPT2MoE-16B-A3B LLM, a 16B parameter Mixture of Experts (MoE) model with 3B active parameters. Performance is investigated across a wide variety of representative benchmarks, and is compared against comparably-sized open-source MoE and dense models. In comparison with popular Italian models, namely FastwebMIIA-7B, Minerva-7B, Velvet-14B, and LLaMAntino-3-ANITA-8B, EngGPT2MoE-16B-A3B performs as well or better on international benchmarks: ARC-Challenge, GSM8K, AIME24, AIME25, MMLU, and HumanEval (HE). It achieves the best performance for the longest context setting (32k) of the RULER benchmark. On the Italian benchmark dataset ITALIC, the model performs as well or better than the other models except for Velvet-14B, which outperforms it. Compared with popular MoE models of comparable size, the new model reports higher values than DeepSeek-MoE-16B-Chat on all considered benchmarks. It has higher values than Moonlight-16B-A3B on HE, MMLU, AIME24, AIME25, GSM8K, and the 32k RULER setting, but lower on BFCL and some ARC and ITALIC settings. Finally it has lower values than GPT-OSS-20B on most benchmarks, including HE, MMLU, AIME24, AIME25, GSM8K, ARC, BFCL, and the RULER 32k. When compared with popular dense models, EngGPT2MoE-16B-A3B reports higher values on AIME24 and AIME25 than Llama-3.1-8B-Instruct, Gemma-3-12b-it, and Ministral-3-8BInstruct-2512-BF16, but lower values on ITALIC, BFCL, and RULER with a 32k context. When performance is aggregated across all benchmark metrics, EngGPT2MoE-16B-A3B shows higher performance than the Italian models under evaluation while achieving lower results than some of the most performant international models, in particular GPT-5 nano and Qwen3-8B. Taken together, our findings find the new model to be a step forward for native Italian Large Language Models.

URL PDF HTML ☆

赞 0 踩 0

2605.07021 2026-05-21 cs.AI

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

行为线索推理：通过监督提高推理的效率和安全性

Christopher Z. Cui, Taylor W. Killian, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； Brigham Young University

AI总结该研究提出行为线索推理方法，通过引入行为线索来增强大语言模型的可控性和可监控性，从而在复杂数学问题解决中减少50%的无效推理token，并在安全行动恢复方面将成功率从46%提升至96%。

详情

AI中文摘要

大语言模型（LLMs）的推理过程在监督方面面临挑战，因为许多不一致的行为往往在推理结束后才显现。为了解决这一问题，我们引入了行为线索推理，使LLM的推理过程更加可控和可监控。行为线索是特殊标记序列，模型在训练过程中被训练为在特定隐含和显式行为之前立即发出，起到双重用途的信号和控制杠杆。在微调较弱的外部监控器时，通过强化学习进行推理监督，仅使用行为线索产生的信息压缩视图就足以让监控器剪枝复杂数学问题解决中多达50%的无效推理token。当在过度约束违反导致失败的环境中利用几乎最优的规则基监控器时，行为线索使从80%的推理轨迹中恢复安全行动，这些轨迹原本会以提出不安全行动而结束，将成功率从46%提升至96%。通过在两个模型家族和三个领域中的评估，我们证明行为线索推理在不降低性能的情况下提高了推理的可监控性和可控性。更广泛地说，我们的工作通过展示被监控模型本身可以被训练得更易于监督来推进可扩展的监督。

英文摘要

Reasoning in Large Language Models (LLMs) poses a challenge for oversight as many misaligned behaviors do not surface until reasoning concludes. To address this, we introduce Behavior Cue Reasoning for making LLM reasoning more controllable and monitorable. Behavior Cues are special token sequences that a model is trained to emit immediately before specific implicit and explicit behaviors, acting as dual purpose signal and control levers. When fine-tuning a weaker external monitor with Reinforcement Learning for reasoning oversight, a compressed view of only information surfaced by Behavior Cues is sufficient signal for the monitor to prune up to 50% of otherwise wasted reasoning tokens in complex math problem solving. When leveraged by an almost optimal rule-based monitor in an environment where excessive constraint violations results in failure, Behavior Cues allows for the recovery of safe actions from 80% of reasoning traces that would otherwise end with the proposal of an unsafe action, more than doubling the success rate from 46% to 96%. Through evaluation across two model families and three domains, we show that Behavior Cue Reasoning improves reasoning monitorability and controllability with no cost to performance. More broadly, our work progresses scalable oversight by demonstrating how the monitored model itself can be trained to reason more tractably to oversight. Code: https://github.com/christopherzc/behavior-cues

URL PDF HTML ☆

赞 0 踩 0

2605.06139 2026-05-21 cs.LG cs.AI

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

列表式策略优化：基于组的RLVR作为LLM响应单纯形上的目标投影

Yun Qu, Qi Wang, Yixiu Mao, Heming Zou, Yuhang Jiang, Yingyue Li, Wutong Xu, Lizhou Cai, Weijie Liu, Clive Bai, Kai Yang, Yangkun Chen, Saiyong Yang, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University（清华大学自动化系）； LLM Department, Tencent（腾讯LLM部门）

AI总结本文提出列表式策略优化（LPO），通过显式执行目标投影来解构隐式目标，利用响应单纯形限制近端RL目标，并通过精确散度最小化进行策略投影，从而在多样推理任务和LLM基础上提升训练性能，同时保持优化稳定性和响应多样性。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）已成为大语言模型（LLMs）训练后的一种标准方法，以激励推理能力。在现有方法中，基于组的策略梯度很流行，它为每个提示样本生成一组响应，并通过组内优势信号更新策略。本文揭示这些优化策略共享一个共同的几何结构：每种策略隐式地定义了一个目标分布，并通过一阶近似向响应单纯形投影。基于这一见解，我们提出了列表式策略优化（LPO）以显式执行目标投影，通过限制近端RL目标到响应单纯形来解构隐式目标，然后通过精确散度最小化进行策略投影。该框架提供了（i）在列表式目标上单调改进，具有有界、零和和自校正的投影梯度，以及（ii）通过解耦的投影步骤灵活选择散度，具有不同的结构性质。在多样推理任务和LLM基础架构上，LPO在匹配的目标下一致地优于典型的策略梯度基线，同时内在地保持了优化稳定性和响应多样性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a standard approach for large language models (LLMs) post-training to incentivize reasoning capacity. Among existing recipes, group-based policy gradient is prevalent, which samples a group of responses per prompt and updates the policy via group-relative advantage signals. This work reveals that these optimization strategies share a common geometric structure: each implicitly defines a target distribution on the response simplex and projects toward it via first-order approximation. Building on this insight, we propose Listwise Policy Optimization (LPO) to explicitly conduct the target-projection, which demystifies the implicit target by restricting the proximal RL objective to the response simplex, and then projects the policy via exact divergence minimization. This framework provides (i) monotonic improvement on the listwise objective with bounded, zero-sum, and self-correcting projection gradients, and (ii) flexibility in divergence selection with distinct structural properties through the decoupled projection step. On diverse reasoning tasks and LLM backbones, LPO consistently improves training performance over typical policy gradient baselines under matched targets, while intrinsically preserving optimization stability and response diversity.

URL PDF HTML ☆

赞 0 踩 0

2605.05863 2026-05-21 cs.LG cs.AI

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

SOPE: 通过先验数据稳定在线强化学习中的策略评估

Carlo Romeo, Girolamo Macaluso, Alessandro Sestini, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center – University of Florence（媒体集成与通信中心——佛罗伦萨大学）； SEED – Electronic Arts（SEED——电子艺界）

AI总结本文提出SOPE算法，通过使用与演员对齐的离策略策略评估（OPE）信号作为自动早停机制，动态控制离线训练阶段的长度，从而在连续控制任务中提高基线性能并减少计算资源消耗。

详情

AI中文摘要

将先验数据纳入在线强化学习可以加速训练，但通常需要在高计算成本和长的多阶段训练流水线之间做出艰难的权衡。虽然固定长度的稳定阶段比静态更新计划更具计算效率，但它们需要任务相关的手动调整，可能会导致先验知识的浪费或严重的过拟合。为此，我们提出了SOPE算法，该算法利用与演员对齐的离策略策略评估（OPE）信号作为自动早停机制，动态控制离线训练阶段的长度。通过在当前策略的动作分布下对批评者进行保留验证集的评估，SOPE在离分布收益饱和时精确停止梯度更新，从而消除了手动调度调整的需要。在Minari基准套件的25个连续控制任务上评估，SOPE将基线性能提高了高达45.6%，同时将所需的TFLOPs减少了高达22倍，从而在样本效率和计算效率之间取得了平衡。这些发现表明，自适应的、基于评估的更新计划比依赖静态、详尽的更新计划更有效。

英文摘要

Incorporating prior data into online reinforcement learning accelerates training but typically forces a difficult trade-off between high computational costs and long, multi-stage training pipelines. While fixed-length stabilization phases are significantly more computationally efficient than static update schedules, they require task-dependent manual tuning, risking either the waste of prior knowledge or severe overfitting. To address this, we propose SOPE, an algorithm that uses an actor-aligned Off-Policy Policy Evaluation (OPE) signal as an automated early-stopping mechanism to dynamically control the length of offline training phases. By evaluating the critic on a held-out validation split under the current policy's action distribution, SOPE halts gradient updates exactly when out-of-distribution benefits saturate, eliminating the need for manual schedule tuning. Evaluated on 25 continuous control tasks from the Minari benchmark suite, SOPE improves baseline performance by up to 45.6% while reducing the required TFLOPs by up to 22x, thus balancing the tradeoff between sample and computational efficiency. These findings demonstrate that adaptive, evaluation-driven update schedules are more effective than relying on static, exhaustive update schedules.

URL PDF HTML ☆

赞 0 踩 0

2605.05405 2026-05-21 cs.CV

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

通过联合嵌入实现零样本卫星图像检索：应用于危机响应

James Walsh, William Fawcett, Grace Colverd, Raúl Ramos-Pollán

发表机构 * Trillium Technologies ； University of Cambridge（剑桥大学）； Universidad de Antioquia（Antioquia大学）

AI总结本文提出GeoQuery系统，通过两阶段语义和视觉搜索，在无需配对数据和计算资源的情况下实现全球范围内的自然语言查询，利用部分全球数据的自然语言嵌入，优化描述生成提示以使文本嵌入空间与冻结CLAY视觉嵌入空间的距离相关联，从而在灾难地点查询中实现高精度检索。

详情

DOI: 10.56272/a2c9ee39

AI中文摘要

地球观测档案的语义搜索仍具挑战性。视觉基础模型如CLAY能生成丰富的卫星图像嵌入，但缺乏用于直观查询所需的自然语言基础，而对遥感CLIP式模型的完整对比训练需要配对数据和计算资源，这些在全球范围内不可用。为允许全球范围内的自然语言查询，我们提出GeoQuery，一种零样本检索系统，通过两阶段语义和视觉搜索绕过数据和计算限制，利用部分全球数据的自然语言嵌入。我们不训练联合编码器，而是为100,000个代理子集的全球Sentinel-2瓦片生成语言描述，并优化描述生成提示，使生成的文本嵌入空间中的距离与冻结CLAY视觉嵌入空间中的距离相关联。查询分为两个阶段，首先在代理子集上进行文本相似度搜索，然后在全球CLAY嵌入中进行视觉最近邻搜索。在76个灾难地点查询中，包括英国洪水、美国野火和美国干旱，GeoQuery在50公里内达到31.6%的准确率，其中洪水表现最强（50%在50公里内），因为地形特征由RGB嵌入良好捕获。在名为\ECHO{}的危机响应系统中部署，GeoQuery在布里斯班2025年 Cyclone Alfred期间识别出易受灾区域，下游洪水模拟重现了历史模式。提示对齐的代理为EO基础模型与操作检索之间提供了一个实用的桥梁，当完整对比训练不可行时。

英文摘要

Semantic search of Earth observation archives remains challenging. Visual foundation models such as CLAY produce rich embeddings of satellite imagery but lack the natural-language grounding needed for intuitive query, and full contrastive training of a remote-sensing CLIP-style model requires paired data and compute that are unavailable at global scale. To allow natural language querying at global scales, we present GeoQuery, a zero-shot retrieval system that sidesteps data and compute constraints through a two-stage semantic and visual search, leveraging a natural language embedding of a subset (proxy) of global data. Rather than training a joint encoder, we generate language descriptions for a 100k proxy subset of global Sentinel-2 tiles and optimise the description-generation prompt so that distances in the resulting text-embedding space correlate with distances in the frozen CLAY visual-embedding space. Queries are resolved in two stages, with a text-similarity search over the proxy subset followed by a visual nearest-neighbour search over worldwide CLAY embeddings On 76 disaster-location queries covering UK floods, US wildfires, and US droughts, GeoQuery achieves 31.6\% accuracy within 50\,km, with the strongest performance on floods (50\% within 50\,km) where terrain features are well captured by RGB embeddings. Deployed within a crisis response system called \ECHO{}, GeoQuery identified vulnerable areas during Brisbane's 2025 Cyclone Alfred, with downstream flood simulations reproducing historical patterns. Prompt-aligned proxies offer a practical bridge between EO foundation models and operational retrieval when full contrastive training is out of reach.

URL PDF HTML ☆

赞 0 踩 0

2605.03690 2026-05-21 cs.LG cs.AI q-bio.QM

Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction

基于图神经网络的面向层次的知识图谱嵌入：应用于酵母表型预测

Filip Kronström, Alexander H. Gower, Daniel Brunnsåker, Ievgeniia A. Tiukova, Ross D. King

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg（计算机科学与工程系，查尔姆斯理工大学和哥德堡大学）； Department of Life Sciences, Chalmers University of Technology（生命科学系，查尔姆斯理工大学）； Department of Industrial Biotechnology, KTH Royal Institute of Technology（工业生物技术系，皇家理工学院）； Department of Chemical Engineering and Biotechnology, University of Cambridge（化学工程与生物技术系，剑桥大学）

AI总结本文提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法，用于酵母表型预测，并展示了其在基因敲除效应预测和知识图谱修订评估中的应用。

详情

AI中文摘要

我们提出了一种利用图神经网络和来自底层本体的语义损失来生成层次感知的知识图谱嵌入的方法。该方法生成的嵌入更能反映领域知识。为了展示其效用，我们预测并解释了酵母Saccharomyces cerevisiae中基因敲除的影响，并在没有预测任务的情况下学习知识图谱的盒嵌入。我们进一步展示了盒嵌入如何作为评估知识图谱修订的基础。我们的酵母知识图谱是从社区数据库和本体术语构建的。低维盒嵌入结合图神经网络用于预测双基因敲除的细胞生长。在10折交叉验证中，这些预测的平均R²分数为0.360，显著高于基线比较，证明了高层定性知识对实验结果的影响力。在模型训练中纳入语义损失项提高了其预测性能（R²=0.377），通过将嵌入对齐本体结构。这表明本体中的类层次可以用于定量预测。我们还测试了训练好的模型在三基因敲除上的表现，展示了其对训练数据之外数据的泛化能力。此外，通过识别酵母知识图谱中对细胞生长预测重要的共现关系，我们构建了关于酵母相互作用特征的假说。一个生物实验验证了其中一个发现，揭示了肌醇利用与渗透压压力抗性之间的关联，突显了模型在生物发现中的潜力。

英文摘要

We present a method for finding hierarchy-aware embeddings of knowledge graphs (KGs) using graph neural networks (GNNs) enriched with a semantic loss derived from underlying ontologies. This method yields embeddings that better reflect domain knowledge. To demonstrate their utility, we predict and interpret the effects of gene deletions in the yeast Saccharomyces cerevisiae and learn box embeddings for KGs in the absence of a prediction task. We further show how box embeddings can serve as the basis for evaluating KG revisions. Our yeast KG is constructed from community databases and ontology terms. Low-dimensional box embeddings combined with GNNs are used to predict cell growth for double gene knockouts. Over 10-fold cross validation, these predictions have a mean $R^2$~score~of~0.360, significantly higher than baseline comparisons, demonstrating that high-level qualitative knowledge is informative about experimental outcomes. Incorporating semantic loss terms in the training of the models improves their predictive performance ($R^2$=0.377) by aligning embeddings with ontology structure. This shows that class hierarchies from ontologies can be exploited for quantitative prediction. We also test the trained models on triple gene knockouts, showing they generalise to data beyond those seen in training. Additionally, by identifying co-occurring relations in the yeast KG important for the cell-growth predictions, we construct hypotheses about interacting traits in yeast. A biological experiment validates one such finding, revealing an association between inositol utilisation and osmotic stress resistance, highlighting the model's potential to guide biological discovery.

URL PDF HTML ☆

赞 0 踩 0

2605.01486 2026-05-21 cs.AI

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

MAP-Law: 多轮法律咨询中的覆盖驱动检索控制

Qinchuan Cheng, Jiaqi Liu, Ruixuan Xie, Xiaoya Yuan, Yuxin Liu

发表机构 * Xi’an Jiaotong University（西安交通大学）； Sichuan University（四川大学）； Southwestern University of Finance and Economics（西南财经大学）； Northeastern University（东北大学）

AI总结本文提出了一种覆盖驱动的检索控制框架，用于多轮法律咨询，通过维护用户事实、法律要素、检索目标和检索证据的结构化地图，利用要素覆盖、证据有效性覆盖和边际检索收益来决定检索、澄清、改写或停止操作，实验表明该方法在固定法律要素模式下能有效实现要素覆盖。

详情

AI中文摘要

法律咨询本质上是迭代的：在提供建议之前，系统必须识别相关法律要素，收集缺失的事实和权威，以及确定当前证据是否足够。现有的检索增强型法律代理通常使用固定的检索预算或单次搜索，使其对咨询的演变覆盖状态不敏感。本文介绍了一种针对多轮法律咨询的覆盖驱动检索控制框架。该框架维护用户事实、法律要素、检索目标和检索证据的结构化地图，并利用要素覆盖、证据有效性覆盖和边际检索收益来决定是否检索、澄清、改写或停止。在50个案例的合成中文劳动法咨询试点中，使用DeepSeek V4-Pro动作选择变体，在试点指标下实现了完全测量的要素覆盖，平均需要3.4次检索轮次和7.1个证据片段。诊断分析表明，模型支持的动作选择能够通过小幅增加检索预算恢复规则-政策失败案例，而强制继续主要增加令牌和延迟成本。这些结果表明，法律要素覆盖是适应性法律检索的有用控制信号，在固定模式条件下保持检索控制行为，而非部署层面的法律正确性。

英文摘要

Legal consultation is inherently iterative: before giving advice, a system must identify relevant legal elements, gather missing facts and authorities, and determine whether the current evidence is sufficient. Existing retrieval-augmented legal agents often use fixed retrieval budgets or single-shot search, making them insensitive to the evolving coverage state of a consultation. This paper introduces a coverage-driven retrieval-control framework for multi-turn legal consultation. The framework maintains a structured map over user facts, legal elements, retrieval goals, and retrieved evidence, and uses element coverage, evidence validity coverage, and marginal retrieval gain to decide whether to retrieve, clarify, reformulate, or stop. On a 50-case synthetic Chinese labor-law consultation pilot with fixed legal-element schemas, a DeepSeek V4-Pro action-selection variant achieves full measured element coverage under the pilot metric while requiring 3.4 retrieval rounds and 7.1 evidence snippets on average. Diagnostic analyses show that model-backed action selection recovers rule-policy failure cases with a small retrieval-budget increase, while forced continuation mainly increases token and latency costs. These results suggest that legal-element coverage is a useful control signal for adaptive legal retrieval, while remaining bounded to retrieval-control behavior under synthetic fixed-schema conditions rather than deployment-level legal correctness.

URL PDF HTML ☆

赞 0 踩 0

2604.27505 2026-05-21 cs.CV

Leveraging Verifier-Based Reinforcement Learning in Image Editing

利用基于验证器的强化学习进行图像编辑

Hanzhong Guo, Jie Wu, Jie Liu, Yu Gao, Zilyu Ye, Linxiao Yuan, Xionghui Wang, Yizhou Yu, Weilin Huang

发表机构 * School of Computing and Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）； Center for Embodied AI and Computer Vision, Shenzhen Loop Area Institute（具身人工智能与计算机视觉中心，深圳Loop Area研究院）

AI总结本文提出Edit-R1框架，通过构建基于推理的验证器奖励模型（RRM）来解决图像编辑中缺乏稳健奖励模型的问题，该模型通过分解指令为不同原则并逐项评估图像，实现细粒度奖励，实验表明其在图像编辑任务中优于现有模型。

详情

AI中文摘要

尽管强化学习从人类反馈（RLHF）已成为文本到图像生成的关键范式，但其在图像编辑中的应用仍鲜有研究。关键瓶颈在于缺乏适用于所有编辑任务的稳健通用奖励模型。现有编辑奖励模型通常仅提供总体评分而无详细检查，忽视了不同指令要求，导致奖励偏差。为此，我们主张从简单的评分器转向推理验证器。我们引入Edit-R1框架，构建基于推理链（CoT）的验证器奖励模型（RRM）并用于下游图像编辑。Edit-RRM将指令分解为不同的原则，将编辑后的图像与每个原则进行评估，并将这些检查汇总成可解释、细粒度的奖励。为了构建此类RRM，我们首先应用监督微调（SFT）作为“冷启动”生成CoT奖励轨迹。然后，我们引入组对比偏好优化（GCPO），一种利用人类配对偏好数据强化点状RRM的强化学习算法。在构建RRM后，我们使用GRPO训练编辑模型，利用此非可微但强大的奖励模型。大量实验表明，我们的Edit-RRM在图像编辑特定任务中优于强大的VLMs如Seed-1.5-VL和Seed-1.6-VL，并观察到明显的扩展趋势，性能从3B到7B参数持续提升。此外，Edit-R1为编辑模型如FLUX.1-kontext带来增益，凸显了其在提升图像编辑任务中的有效性。

英文摘要

While Reinforcement Learning from Human Feedback (RLHF) has become a pivotal paradigm for text-to-image generation, its application to image editing remains largely unexplored. A key bottleneck is the lack of a robust general reward model for all editing tasks. Existing edit reward models usually give overall scores without detailed checks, ignoring different instruction requirements and causing biased rewards. To address this, we argue that the key is to move from a simple scorer to a reasoning verifier. We introduce Edit-R1, a framework that builds a chain-of-thought (CoT) verifier-based reasoning reward model (RRM) and then leverages it for downstream image editing. The Edit-RRM breaks instructions into distinct principles, evaluates the edited image against each principle, and aggregates these checks into an interpretable, fine-grained reward. To build such an RRM, we first apply supervised fine-tuning (SFT) as a ``cold-start'' to generate CoT reward trajectories. Then, we introduce Group Contrastive Preference Optimization (GCPO), a reinforcement learning algorithm that leverages human pairwise preference data to reinforce our pointwise RRM. After building the RRM, we use GRPO to train editing models with this non-differentiable yet powerful reward model. Extensive experiments demonstrate that our Edit-RRM surpasses powerful VLMs such as Seed-1.5-VL and Seed-1.6-VL as an editing-specific reward model, and we observe a clear scaling trend, with performance consistently improving from 3B to 7B parameters. Moreover, Edit-R1 delivers gains to editing models like FLUX.1-kontext, highlighting its effectiveness in enhancing image editing.

URL PDF HTML ☆

赞 0 踩 0

2604.27375 2026-05-21 cs.CV

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

VeraRetouch: 一个轻量级的全微分框架用于多任务推理照片修复

Yihong Guo, Youwei Lyu, Jiajun Tang, Yizhuo Zhou, Hongliang Wang, Jinwei Chen, Changqing Zou, Qingnan Fan

发表机构 * Zhejiang University（浙江大学）； vivo BlueImage Lab（vivo BlueImage实验室）； University of Chinese Academy of Sciences（中国科学院大学）； Zhejiang Lab（浙江实验室）

AI总结本文提出VeraRetouch，一个轻量级且全微分的多任务照片修复框架，通过使用0.5B视觉-语言模型和全微分的修复渲染器，实现了端到端的像素级训练，并引入了AetherRetouch-1M+数据集和DAPO-AE强化学习策略，以提升修复性能和泛化能力。

详情

AI中文摘要

推理照片修复已获得显著关注，要求模型分析图像缺陷、提供推理过程并执行精确的修复增强。然而，现有方法常依赖非微分的外部软件，导致优化障碍，并存在参数冗余和泛化能力有限的问题。为解决这些问题，我们提出了VeraRetouch，一个轻量级且全微分的多任务照片修复框架。我们采用一个0.5B视觉-语言模型（VLM）作为核心智能，根据指令和场景语义制定修复计划。此外，我们开发了一个全微分的修复渲染器，取代外部工具，通过解耦控制潜在变量实现直接端到端的像素级训练。为克服数据稀缺，我们引入了AetherRetouch-1M+，第一个百万级的专业修复数据集，通过新的逆降级工作流程构建。此外，我们提出DAPO-AE，一种强化学习后训练策略，以增强自主审美认知。大量实验表明，VeraRetouch在多个基准上实现了最先进的性能，同时保持显著更小的模型规模，支持移动部署。我们的代码和模型已公开在https://github.com/OpenVeraTeam/VeraRetouch。

英文摘要

Reasoning photo retouching has gained significant traction, requiring models to analyze image defects, give reasoning processes, and execute precise retouching enhancements. However, existing approaches often rely on non-differentiable external software, creating optimization barriers and suffering from high parameter redundancy and limited generalization. To address these challenges, we propose VeraRetouch, a lightweight and fully differentiable framework for multi-task photo retouching. We employ a 0.5B Vision-Language Model (VLM) as the central intelligence to formulate retouching plans based on instructions and scene semantics. Furthermore, we develop a fully differentiable Retouch Renderer that replaces external tools, enabling direct end-to-end pixel-level training through decoupled control latents for lighting, global color, and specific color adjustments. To overcome data scarcity, we introduce AetherRetouch-1M+, the first million-scale dataset for professional retouching, constructed via a new inverse degradation workflow. Furthermore, we propose DAPO-AE, a reinforcement learning post-training strategy that enhances autonomous aesthetic cognition. Extensive experiments demonstrate that VeraRetouch achieves state-of-the-art performance across multiple benchmarks while maintaining a significantly smaller footprint, enabling mobile deployment. Our code and models are publicly available at https://github.com/OpenVeraTeam/VeraRetouch.

URL PDF HTML ☆

赞 0 踩 0

2604.26052 2026-05-21 cs.CL

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

从提示风险到响应风险：大型语言模型安全行为的配对分析

Mengya Hu, Qiong Wei, Sandeep Atluri

发表机构 * Microsoft（微软）

AI总结本研究通过配对分析人类标注的提示和响应记录，探讨了大型语言模型在四个危害类别（性、自残、仇恨和暴力）和有序严重性级别上的安全行为，发现61%的响应比提示减少了危害，3%的响应升级了危害，并揭示了安全行为与无害性之间的权衡。

详情

AI中文摘要

大型语言模型的安全评估通常报告二元结果，如攻击成功率（ASR）、拒绝率或有害与安全分类，这些结果隐藏了提示与响应之间风险的变化。我们通过对四个危害类别（性、自残、仇恨和暴力）和有序严重性级别（安全、低、中、高）的人类标注提示和响应记录进行配对分析，发现61%的响应相对于提示减少了危害，36%的响应保持了严重性，3%的响应升级了危害。升级分为两种机制：良性提示触发未请求的有害细节，以及在更高严重性级别上保持任务的响应。类别分解显示，性内容在此样本中表现出最高的危害持续性，这由相同严重性级别的合规性驱动，而非来自良性输入的漂移。联合相关性分析揭示了有用性与无害性之间的权衡：合规性升级仍保持高度相关，而安全响应包含低相关性的通用拒绝。最后，少样本LLM评估者表现出提示/响应检测的不对称性，数据校准无法弥补这种不对称性。评估者提示可在https://github.com/microsoft/PairedSafety获取。

英文摘要

Safety evaluations of large language models (LLMs) typically report binary outcomes, i.e. attack success rate (ASR), refusal rate, or harmful versus safe classification, which hide how risk changes between prompt and response. We present a paired analysis over human labeled prompt and response records across four harm categories (Sexual, Self harm, Hate and Violence) and ordinal severity levels (Safe, Low, Medium, High). 61% of responses reduce harm relative to the prompt, 36% preserve severity, and 3% escalate. The escalation splits into two mechanisms: benign prompts triggering unrequested harmful detail, and answers that stay on task at higher severity than the prompt. Category decomposition shows that Sexual content exhibits the highest harm persistence in this sample, driven by compliance at the same severity rather than drift from benign inputs. Joint relevance analysis exposes a helpfulness versus harmlessness tradeoff: compliance escalations remain highly relevant, whereas safe responses include generic refusals with low relevance. Finally, few-shot LLM graders exhibit a prompt/response detection asymmetry that data calibration does not close. Grader prompts are shared at https://github.com/microsoft/PairedSafety.

URL PDF HTML ☆

赞 0 踩 0

2604.24697 2026-05-21 cs.AI

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

当前智能体能否缩小发现到应用的差距？Minecraft中的一个案例研究

Zhou Ziheng, Huacong Tang, Jinyuan Zhang, Haowei Lin, Bangcheng Yang, Qian Long, Fang Sun, Yizhou Sun, Yitao Liang, Ying Nian Wu, Demetri Terzopoulos, Xiaofeng Gao

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Peking University（北京大学）； Amazon（亚马逊）

AI总结本文通过Minecraft中的SciCrafter基准测试，探讨了智能体在发现因果规律并将其应用于构建功能系统（发现-应用循环）方面的能力，发现前沿模型在该任务中的成功率约为26%，揭示了知识识别和问题提出能力成为当前AI的瓶颈。

Comments Preprint, under review. 41 pages. Project page: https://scicrafter-bench.github.io/. Code: https://github.com/scicrafter-bench/scicraft-bench

详情

AI中文摘要

发现因果规律并将其应用于构建功能性系统——发现-应用循环——是通用智能的标志，但评估这一能力受到科学发现与现实世界工程之间巨大复杂性差距的阻碍。我们引入了基于Minecraft的SciCrafter基准测试，通过参数化的红石电路任务来操作化这一循环。智能体必须按照指定的模式（例如同时或按时间序列）点燃灯泡；扩大目标参数会显著增加构建复杂性和所需知识，迫使真正的发现而非依赖记忆中的解决方案。在通用目的代码智能体框架下评估前沿模型，包括GPT-5.2、Gemini-3-Pro和Claude-Opus-4.5，我们发现所有模型均在约26%的成功率处停滞。为了诊断这些失败，我们将循环分解为四个能力——知识差距识别、实验发现、知识整合和知识应用，并设计了针对性的干预措施，其边际贡献作为相应差距的代理。我们的分析表明，尽管通用知识应用能力仍然是所有模型中最大的差距，但对前沿模型而言，知识差距识别开始成为主要障碍——表明瓶颈正从解决正确的问题转变为提出正确的问题。我们发布了SciCrafter作为未来研究AI系统在完整发现-应用循环中导航的诊断探针。

英文摘要

Discovering causal regularities and applying them to build functional systems--the discovery-to-application loop--is a hallmark of general intelligence, yet evaluating this capacity has been hindered by the vast complexity gap between scientific discovery and real-world engineering. We introduce SciCrafter, a Minecraft-based benchmark that operationalizes this loop through parameterized redstone circuit tasks. Agents must ignite lamps in specified patterns (e.g., simultaneously or in timed sequences); scaling target parameters substantially increases construction complexity and required knowledge, forcing genuine discovery rather than reliance on memorized solutions. Evaluating frontier models including GPT-5.2, Gemini-3-Pro, and Claude-Opus-4.5 under a general-purpose code agent scaffold, we find that all plateau at approximately 26% success rate. To diagnose these failures, we decompose the loop into four capacities--knowledge gap identification, experimental discovery, knowledge consolidation, and knowledge application--and design targeted interventions whose marginal contributions serve as proxies for corresponding gaps. Our analysis reveals that although the general knowledge application capability still remains as the biggest gap across all models, for frontier models the knowledge gap identification starts to become a major hurdle--indicating the bottleneck is shifting from solving problems right to raising the right problems for current AI. We release SciCrafter as a diagnostic probe for future research on AI systems that navigate the full discovery-to-application loop.

URL PDF HTML ☆

赞 0 踩 0

2604.21060 2026-05-21 cs.CV

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

基于临床信息的儿童脑肿瘤全切片病理图像分类建模

Joakim Nguyen, Jian Yu, Jinrui Fang, Nicholas Konz, Tianlong Chen, Sanjay Krishnan, Chandra Krishnan, Ying Ding, Hairong Wang, Ankita Shukla

发表机构 * Dept. of Computer Science（计算机科学系）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； School of Information（信息学院）； Dell Children's Medical Center（德尔儿童医学中心）； Dept. of OREI（OREI部门）； University of Nevada, Reno（内华达大学里诺分校）

AI总结本文提出一种结合临床信息的对比学习框架，用于在有限数据和类别不平衡条件下提高儿童脑肿瘤全切片图像的细粒度分类性能。

Comments Accepted at the IEEE International Conference on Healthcare Informatics (ICHI), 2026

详情

AI中文摘要

准确诊断儿童脑肿瘤，从组织病理学开始，对深度学习提出了独特的挑战，包括严重的数据稀缺性、类别不平衡以及不同诊断亚型之间细微的形态学重叠。尽管病理基础模型在片段级表示学习方面取得了进展，但其在有限数据下有效适应弱监督的儿童脑肿瘤分类仍待探索。在本文中，我们引入了一种专家指导的对比微调框架，用于从全切片图像（WSI）中进行儿童脑肿瘤诊断。我们的方法将对比学习整合到滑动级别的多实例学习（MIL）中，以在下游微调过程中显式正则化滑动级别的表示几何。我们提出了一个通用的监督对比设置以及一个结合临床信息的硬负样本变体，旨在针对诊断上易混淆的亚型。通过在现实中的低样本和类别不平衡条件下对儿童脑肿瘤WSI分类进行全面实验，我们证明对比微调在细粒度诊断区分上产生了可测量的改进。我们的实验分析揭示了不同对比策略之间的互补优势，专家指导的硬负样本促进了更紧凑的类内表示和改进的类间分离。本文强调了在数据稀缺的儿童病理学设置中显式塑造滑动级别表示对于鲁棒细粒度分类的重要性。

英文摘要

Accurate diagnosis of pediatric brain tumors, starting with histopathology, presents unique challenges for deep learning, including severe data scarcity, class imbalance, and fine-grained morphologic overlap across diagnostically distinct subtypes. While pathology foundation models have advanced patch-level representation learning, their effective adaptation to weakly supervised pediatric brain tumor classification under limited data remains underexplored. In this work, we introduce an expert-guided contrastive fine-tuning framework for pediatric brain tumor diagnosis from whole-slide images (WSI). Our approach integrates contrastive learning into slide-level multiple instance learning (MIL) to explicitly regularize the geometry of slide-level representations during downstream fine-tuning. We propose both a general supervised contrastive setting and an expert-guided variant that incorporates clinically informed hard negatives targeting diagnostically confusable subtypes. Through comprehensive experiments on pediatric brain tumor WSI classification under realistic low-sample and class-imbalanced conditions, we demonstrate that contrastive fine-tuning yields measurable improvements in fine-grained diagnostic distinctions. Our experimental analyses reveal complementary strengths across different contrastive strategies, with expert-guided hard negatives promoting more compact intra-class representations and improved inter-class separation. This work highlights the importance of explicitly shaping slide-level representations for robust fine-grained classification in data-scarce pediatric pathology settings.

URL PDF HTML ☆

赞 0 踩 0

2604.20985 2026-05-21 cs.LG cs.AI cs.CR stat.ML

Differentially Private Model Merging

差分隐私模型融合

Qichuan Yin, Manzil Zaheer, Tian Li

发表机构 * The University of Chicago（芝加哥大学）； Google DeepMind（谷歌DeepMind）

AI总结本文提出两种后处理技术，随机选择和线性组合，用于在不额外训练的情况下生成满足任意目标差分隐私要求的最终私有模型，同时分析了这些方法在一般问题和私有均值估计中的隐私-效用权衡。

详情

AI中文摘要

在机器学习中，推理或部署时间的隐私要求往往由于政策、法规或用户偏好变化而演变。在本文中，我们旨在构建一组模型，以满足任何目标差分隐私（DP）要求，而无需额外训练，给定一组已在相同数据集上训练且具有不同隐私/效用权衡的现有模型。我们提出两种后处理技术，即随机选择和线性组合，以生成最终的私有模型，满足任何目标隐私参数。我们从R'enyi DP和一般问题中的隐私损失分布的角度提供了这些方法的隐私计费，以及在私有均值估计中的精确隐私/效用权衡分析，并比较了这两种机制。实验上，我们展示了我们方法的有效性，并在多个模型和合成及现实世界数据集上验证了我们的分析。

英文摘要

In machine learning, privacy requirements at inference or deployment time often evolve due to changing policies, regulations, or user preferences. In this work, we aim to construct a magnitude of models to satisfy any target differential privacy (DP) requirement without additional training, given a set of existing models trained on the same dataset with different privacy/utility tradeoffs. We propose two post-processing techniques, namely random selection and linear combination, to generate final private models satisfying any target privacy parameter. We provide privacy accounting of these approaches from the lens of R'enyi DP and privacy loss distributions on general problems, as well as on private mean estimation, where we precisely characterize the privacy/utility tradeoffs and compare the two mechanisms. Empirically, we demonstrate the effectiveness of our approaches and validate our analyses on several models and both synthetic and real-world datasets.

URL PDF HTML ☆

赞 0 踩 0

2604.12239 2026-05-21 cs.CV eess.IV

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography

基于标准化车牌字体的单目车辆距离估计

Manognya Lokesh Reddy, Zheng Liu

发表机构 * Department of Computer and Information Science, University of Michigan-Dearborn（1计算机与信息科学系，密歇根大学-迪尔伯恩分校）； Department of Industrial and Manufacturing Systems Engineering, University of Michigan-Dearborn（2工业与制造系统工程系，密歇根大学-迪尔伯恩分校）

AI总结本文提出了一种利用美国标准化车牌字体作为被动标记进行车辆距离估计的方法，通过显式的几何先验知识解决尺度模糊问题，无需训练数据或主动照明，实现了鲁棒的距离、相对速度和碰撞预警。

Comments 21 pages, 12 figures

详情

AI中文摘要

准确的车辆间距离估计是高级驾驶辅助系统（ADAS）和自动驾驶的核心。尽管LiDAR和雷达提供高精度，但其高成本限制了在大众市场车辆中的广泛应用。基于单目相机的估计提供了低成本的替代方案，但存在根本性的尺度模糊问题。最近的单目深度学习方法取得了显著成果，但需要昂贵的监督训练，存在领域偏移，并且生成的预测难以在安全关键部署中认证。本文提出了一种框架，利用美国标准化车牌字体作为被动标记进行度量测距，通过显式的几何先验知识解决尺度模糊问题，无需任何训练数据或主动照明。首先，一个四方法并行车牌检测器在全汽车照明范围内实现了稳健的车牌阅读。其次，一个三阶段状态识别引擎融合光学字符识别文本匹配、多设计颜色评分和轻量级神经网络分类器，在所有环境条件下提供稳健的识别。第三，混合深度融合与逆方差加权和在线尺度对齐，结合一维常速卡尔曼滤波器，提供平滑的距离、相对速度和时间到碰撞用于碰撞预警。在受控静态数据集上的基线验证重现了字符高度测量的2.3%系数变异和与先前工作中的车牌宽度方法相比距离估计方差减少了36%。

英文摘要

Accurate inter-vehicle distance estimation is a cornerstone of Advanced Driver Assistance Systems (ADAS) and autonomous driving. While LiDAR and radar provide high precision, their high cost prohibits widespread adoption in mass-market vehicles. Monocular camera-based estimation offers a low-cost alternative but suffers from fundamental scale ambiguity. Recent deep learning methods for monocular depth achieve impressive results yet require expensive supervised training, suffer from domain shift, and produce predictions that are difficult to certify for safety-critical deployment. This paper presents a framework that exploits the standardized typography of United States license plates as passive fiducial markers for metric ranging, resolving scale ambiguity through explicit geometric priors without any training data or active illumination. First, a four-method parallel plate detector achieves robust plate reading across the full automotive lighting range. Second, a three-stage state identification engine fusing optical character recognition text matching, multi-design color scoring, and a lightweight neural network classifier provides robust identification across all ambient conditions. Third, hybrid depth fusion with inverse-variance weighting and online scale alignment, combined with a one-dimensional constant-velocity Kalman filter, delivers smoothed distance, relative velocity, and time-to-collision for collision warning. Baseline validation on a controlled static dataset reproduces a 2.3% coefficient of variation in character height measurements and a 36% reduction in distance-estimate variance compared with plate-width methods from prior work.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Reinforcing VLAs in Task-Agnostic World Models

LIDSA: Cognitive Arbitration for Signal-Free Autonomous Intersection Management

ECTO: Exogenous-Conditioned Temporal Operator for Ultra-Short-Term Wind Power Forecasting

AuDirector: A Self-Reflective Closed-Loop Framework for Immersive Audio Storytelling

A Theory of Time-Sensitive Language Generation: Sparse Hallucination Beats Mode Collapse

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

Predicting 3D structure by latent posterior sampling

ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

Segment Anything with Robust Uncertainty-Accuracy Correlation

A Comparative Study of Machine Learning and Deep Learning for Out-of-Distribution Detection

Task-Agnostic Noisy Label Detection via Standardized Loss Aggregation

When to Re-Commit: Temporal Abstraction Discovery for Long-Horizon Vision-Language Reasoning

DeformMaster: An Interactive Physics-Neural World Model for Deformable Objects from Videos

ProDG: Prototypes for Data-Free Generative Post-Hoc Explainability

Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge

AgentEscapeBench: Evaluating Out-of-Domain Tool-Grounded Reasoning in LLM Agents

Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs

Behavior Cue Reasoning: Monitorable Reasoning Improves Efficiency and Safety through Oversight

Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex

SOPE: Stabilizing Off-Policy Evaluation for Online RL with Prior Data

Zero-Shot Satellite Image Retrieval through Joint Embeddings: Application to Crisis Response

Graph Neural Network based Hierarchy-Aware Embeddings of Knowledge Graphs: Applications to Yeast Phenotype Prediction

MAP-Law: Coverage-Driven Retrieval Control for Multi-Turn Legal Consultation

Leveraging Verifier-Based Reinforcement Learning in Image Editing

VeraRetouch: A Lightweight Fully Differentiable Framework for Multi-Task Reasoning Photo Retouching

From Prompt Risk to Response Risk: Paired Analysis of Safety Behavior of Large Language Model

Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft

Clinically-Informed Modeling for Pediatric Brain Tumor Classification from Whole-Slide Histopathology Images

Differentially Private Model Merging

Physics-Grounded Monocular Vehicle Distance Estimation Using Standardized License Plate Typography