arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2154
专题追踪
2605.19593 2026-05-20 cs.AI cs.DC

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

迈向多模型LLM调度器:关于卸载和抢占的实证洞察

Mert Yildiz, Pietro Spadaccino, Alexey Rolich, Francesca Cuomo, Andrea Baiocchi

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 本文通过实证研究探讨了不同LLM在不同硬件平台上的行为,重点分析了层卸载和抢占对性能的影响,揭示了卸载和抢占对解码吞吐量的非线性影响以及其在不同模型和硬件平台上的差异,为设计高效的多模型LLM服务系统提供了指导。

Comments The 2026 Mediterranean Artificial Intelligence and Networking Conference (MAIN 2026)

详情
AI中文摘要

现代大型语言模型(LLM)的部署越来越需要在共享异构硬件上服务具有不同架构、规模和专业化的多个模型。这种设置对资源分配、调度和调度提出了新的挑战,特别是在GPU内存受限的情况下,部分CPU-GPU卸载和抢占成为必要。尽管现有系统主要优化单一模型的吞吐量,但较少工作在这些条件下处理多模型调度。本文通过实证研究探讨了不同LLM在不同硬件平台上的行为,重点分析了层卸载和抢占对性能的影响。我们发现,卸载导致解码吞吐量显著非线性下降,较小的模型对减少GPU驻留时间更敏感。我们进一步证明,抢占带来了显著的开销,主要由模型状态重新加载而非键值缓存传输主导,并且这种成本在不同模型和硬件平台上差异显著。此外,我们还强调了序列长度和互连带宽在放大数据移动和执行效率低下方面的作用。基于这些发现,我们识别出未来调度器必须考虑的关键特性,包括模型特定的卸载敏感性、工作负载特征以及抢占和数据传输的成本结构。这些见解为设计下一代能够高效管理异构、多模型工作负载的LLM服务系统提供了指导。

英文摘要

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

2605.19592 2026-05-20 cs.RO cs.AI

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块用于平滑连续控制

Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun, Yuankai Wu, Huachun Tan, Yong Wang

发表机构 * Department of Data and Systems Engineering, The University of Hong Kong, Hong Kong SAR, China(香港大学数据与系统工程系) Beijing Institute of Technology, Zhuhai, China(北京理工大学珠海学院) College of Computer Science, Sichuan University, Chengdu, China(四川大学计算机学院)

AI总结 本文提出了一种隐式动作分块框架Dual-Window Smoothing (DWS),用于实现平滑的连续控制。该方法通过双窗口设计,在不扩展动作空间的情况下,确保物理平滑性和时间差分目标的一致性,从而解决传统显式动作分块方法的优化困难和与标准逐步交互不兼容的问题。

详情
AI中文摘要

强化学习常常产生高频振荡的控制信号,这会破坏物理部署所需的安全性和稳定性。显式动作分块通过预测固定时间跨度的轨迹来解决这个问题,但会按时间跨度长度成比例地扩展策略输出维度,导致优化困难和与标准逐步交互不兼容。为克服这些挑战,本文提出了Dual-Window Smoothing (DWS),一种隐式动作分块框架用于平滑连续控制。与显式方法不同,DWS通过确定性调制确保时间一致性,而不扩展动作空间。它采用双窗口设计:一个执行窗口通过确定性调制确保物理平滑,一个价值窗口在时间差分目标上对时间跨度进行对齐,以纠正由于开环执行导致的批评者偏差。DWS还包含一个轻量级的演员侧时间正则化器,基于一阶动作差异,以促进全局连续性。该设计有效地弥合了时间抽象与反应式逐步控制之间的差距。在包括DeepMind控制套件和工业能源管理任务在内的基准测试中,DWS优于最先进的(SOTA)基线。在复杂的基于视觉的自动驾驶任务中,DWS实现了更平滑的控制,更安全的行为,减少了抖动,并达到了100%的成功率。

英文摘要

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

2605.19589 2026-05-20 cs.LG physics.flu-dyn

Physics-Informed Graph Neural Network Surrogates for Turbulent Nanoparticle Dispersion in Dental Clinical Environments

具有物理信息的图神经网络代理用于牙科临床环境中湍流纳米粒子分散

Takshak Shende, Viktor Popov

发表机构 * Department of Mechanical Engineering, University College London (UCL)(伦敦大学学院机械工程系) Ascend Technologies Ltd(Ascend技术有限公司)

AI总结 本文提出了一种结合物理信息的图神经网络代理,用于预测牙科临床环境中湍流纳米粒子的分散过程,通过改进的图网络和物理模型提高了计算效率和准确性。

Comments 40 pages, 12 figures,

详情
AI中文摘要

牙科气溶胶程序会产生亚50微米的颗粒,这些颗粒可以在封闭的诊所中长时间悬浮,从而为空气传播病原体的传播提供途径。雷诺平均纳维-斯托克斯(RANS)模拟结合欧拉-拉格朗日粒子追踪可以准确捕捉这种传输,但每个场景的运行时间非常长,这使得在三维空间中无法实时支持临床决策。本文提出了一种欧拉-拉格朗日图交互网络(ELGIN),这是一种具有物理信息的图代理,能够同时预测载流体流动动力学在OpenFOAM多面体网格上的动态以及多分散喷雾云中每个包裹的运动。ELGIN通过可微逆距离网格-包裹耦合,将多头图变换器与雅可比预处理的可学习压力投影和湍流闭合头连接到一个sigmoid门控拉格朗日交互网络。ELGIN使用辛特尔-弗莱特积分器推进包裹。一个四阶段的物理信息课程稳定了260步自回归滚动,而无需梯度爆炸。通过foam-extend 4.1 OpenFOAM reactingParcelFoam在临床相关通风速率和手piece喷雾速度下的参数扫描提供了CFD地面真实数据。本文报告了一种单案例演示,其中ELGIN和一个仅基于拉格朗日的基线(M0)都在二十案例扫描的Sweep_Case_03上进行训练和评估;完整的16/2/2重训练正在进行,并将取代所有报告的指标。在该案例中,ELGIN比M0更紧密地跟踪foam-extend粒子云:平均包裹位移误差从房间宽度的19.56%降至16.20%,云半径-惯性误差从9.85%降至6.58%。26秒的滚动在4GB GPU上完成于约64秒,比foam-extend参考流程快约37倍,朝着多案例检查点到位后每就诊感染风险筛查的目标前进。

英文摘要

Dental aerosol procedures produce sub-50 micrometre nuclei that can remain airborne for long periods in enclosed clinics, creating pathways for airborne pathogen transmission. Reynolds-Averaged Navier-Stokes (RANS) simulations with Euler-Lagrange particle tracking capture this transport accurately but require very long run times per scenario, which precludes real-time clinical decision support in 3D. We present the Eulerian-Lagrangian Graph Interaction Network (ELGIN), a physics-informed graph surrogate that jointly predicts carrier-flow dynamics on the OpenFOAM polyhedral mesh and the per-parcel motion of the polydisperse spray cloud. ELGIN couples a multi-head Graph Transformer with Jacobi-preconditioned learnable pressure projection and a turbulence-closure head to a sigmoid-gated Lagrangian Interaction Network through differentiable inverse-distance mesh-parcel coupling, and advances parcels with a symplectic Stormer-Verlet integrator. A four-stage physics-informed curriculum stabilises 260-step autoregressive rollouts without gradient explosion. A parameter sweep with foam-extend 4.1 OpenFOAM reactingParcelFoam across clinically relevant ventilation rates and handpiece spray speeds provides CFD ground truth. This article reports a single-case demonstration in which both ELGIN and a Lagrangian-only baseline (M0) are trained and evaluated on Sweep_Case_03 of a twenty-case sweep; full 16/2/2 retraining is in progress and will replace all reported metrics. On this case, ELGIN tracks the foam-extend particle cloud much more closely than M0: mean parcel displacement error falls from 19.56% to 16.20% of room width and cloud radius-of-gyration error from 9.85% to 6.58%. A 26-second rollout completes in ~64 s on a 4 GB GPU, approximately 37x faster than the foam-extend reference pipeline, toward per-appointment infection-risk screening once the multi-case checkpoint is in place.

2605.19587 2026-05-20 cs.AI

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

SceneCode: 可执行的世界程序用于可编辑的室内场景及具有关节物体

Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yangguang Li, Yu Cheng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室) Microsoft(微软) University of Oxford(牛津大学)

AI总结 本文提出SceneCode,一种通过可执行程序生成可编辑的室内场景,解决了现有方法中物体结构控制不足的问题,提升了场景生成的精确性和可交互性。

详情
AI中文摘要

室内场景合成是具身AI、机器人操作和基于模拟的策略评估的基础,其中有用的场景不仅需要定义环境的外观,还需要定义其物体的结构。然而,现有流程通常将生成内容表示为静态网格,并且只能从经过筛选的资产库中继承关节性,这限制了物体级别的可控性,并阻止了按需生成新的可交互资产。我们通过将物理上可交互的室内场景合成制定为程序化世界生成,提出SceneCode框架,该框架将自然语言提示编译成可执行的代码驱动的室内世界,而不是一组不透明的网格。一个房间级别的智能核心首先将提示转换为结构化的房屋布局,并通过规划-设计-批评循环发出每个物体的AssetRequests。每个请求随后被路由到五个代码生成策略之一,并转换为合成的分步Blender Python程序,这些程序通过执行引导的修复和优化循环进行验证。生成的程序被编译成模拟准备的资产,并导出为SDF用于物理模拟。一个持久的场景状态注册表将物体请求、可执行程序、渲染几何体和模拟资产联系起来,使场景组装成为一个可追溯且本地可编辑的世界构建过程。我们评估了SceneCode在场景级合成、物体级资产质量、人类判断和下游机器人交互方面的表现。结果表明,可执行世界程序提高了提示忠实的室内场景生成,并产生了具有更干净网格结构和可加载的模拟器关节元数据的资产。项目页面:https://scene-code.github.io/.

英文摘要

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

2605.19584 2026-05-20 cs.LG stat.ML

Online Market Making and the Value of Observing the Order Book

在线市场做市与观察订单簿的价值

Davide Maran, Marcello Restelli

发表机构 * Politecnico di Milano(米兰理工大学)

AI总结 本文研究了在线市场做市问题,其中学习者在与持有私人估值的交易者交互时,依次发布买入和卖出价格。与现有在线学习公式假设完全截断反馈不同,我们引入了受真实限价簿启发的动作依赖反馈模型。我们证明,这种额外信息从根本上改变了问题的学习性。在随机设置中,我们提出了一种消除算法,以高概率达到O(√T)的遗憾,而无需对交易者估值分布的光滑性做出任何假设。然后我们将这一结果扩展到广泛的均值回归价格过程中,考虑了局部自回归动态和基于累积偏离均值的较弱全局漂移条件。在任一假设下,我们建立了高概率O(√T)的遗憾界,依赖于一个新的有趣的集中不等式。最后,在对抗性设置中,我们设计了探索后扰动算法,保证了期望O(T^{2/3})的遗憾。

Comments Accepted at COLT2026

详情
AI中文摘要

我们研究了一个在线市场做市问题,其中学习者在与持有私人估值的交易者交互时,依次发布买入和卖出价格。与现有在线学习公式假设完全截断反馈不同,我们引入了受真实限价簿启发的动作依赖反馈模型:当发生交易时,交易者的估值保持隐藏,而当没有发生交易时,会揭示关于供应和需求的信息反馈。我们证明,这种额外信息从根本上改变了问题的学习性。在随机设置中,我们提出了一种消除算法,以高概率达到O(√T)的遗憾,而无需对交易者估值分布的光滑性做出任何假设。然后我们将这一结果扩展到广泛的均值回归价格过程中,考虑了局部自回归动态和基于累积偏离均值的较弱全局漂移条件。在任一假设下,我们建立了高概率O(√T)的遗憾界,依赖于一个新的有趣的集中不等式。最后,在对抗性设置中,我们设计了探索后扰动算法,保证了期望O(T^{2/3})的遗憾。我们的结果量化了在线市场做市中观察订单簿的价值,并证明了即使有限的动作依赖反馈也能显著改善遗憾保证,相比标准带隙反馈模型。

英文摘要

We study an online market-making problem in which a learner sequentially posts bid and ask prices for a single asset while interacting with traders holding private valuations. Unlike existing online learning formulations that assume fully censored feedback, we introduce an action-dependent feedback model inspired by real limit order books: when a trade occurs, the trader's valuation remains hidden, whereas when no trade occurs, informative feedback about supply and demand is revealed. We show that this additional information fundamentally changes the learnability of the problem. In the stochastic setting with i.i.d. market prices, we propose an elimination-based algorithm that achieves $O(\sqrt T)$ regret with high probability, without requiring any smoothness assumptions on the distribution of trader valuations. We then extend this result to a broad class of mean-reverting price processes by considering both local, autoregressive dynamics and a weaker global drift condition based on cumulative deviations from the mean. Under either assumption, we establish high-probability $O(\sqrt T)$ regret bounds, relying on a new concentration inequality of independent interest. Finally, in the adversarial setting with oblivious prices, we design an explore-then-perturb algorithm that guarantees $O(T^{2/3})$ regret in expectation. Our results quantify the value of observing the order book in online market making and demonstrate that even limited, action-dependent feedback can substantially improve regret guarantees compared to standard bandit feedback models.

2605.19580 2026-05-20 cs.RO

PAPO-VLA: Planning-Aware Policy Optimization for Vision-Language-Action Models

PAPO-VLA: 为视觉-语言-动作模型进行规划感知的策略优化

Peizheng Guo, Jingyao Wang, Changwen Zheng, Wenwen Qiang

发表机构 * Institute of Software Chinese Academy of Sciences(软件研究所中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出PAPO-VLA,一种针对视觉-语言-动作模型的规划感知策略优化方法,通过识别和优化规划动作以提高VLA策略的可靠性。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在语言引导的机器人任务中展现出有前途的能力。然而,使VLA策略可靠仍然具有挑战性,因为一个操作任务是通过闭环交互完成的,其中每个动作都会影响后续的执行。为了分析这个问题,我们重新审视VLA策略在执行过程中的作用,并认为VLA策略同时扮演着规划者和执行者两个角色:规划者做出任务导向的决策以改变执行方向,而执行者通过密集的连续动作来实现这些决策。这种观点表明,提高VLA可靠性需要特别关注规划动作。现有的优化方法可以模仿动作或改进完整的轨迹,但通常不明确识别规划动作或衡量其对任务成功的重要性。为了解决这个问题,我们提出了PAPO-VLA,即针对VLA模型的规划感知策略优化方法。PAPO-VLA首先通过联合考虑动作变化和轨迹结果来识别规划动作,然后通过因果充分性和因果必要性估计其重要性,并最终将这种重要性纳入GRPO优势估计中。这样,更重要规划动作会受到更强的优化关注,同时整个轨迹仍然通过轨迹级反馈进行优化。在多个基准上的实验展示了PAPO-VLA的有效性。

英文摘要

Vision-Language-Action (VLA) models show promising ability in language-guided robotic tasks. However, making VLA policies reliable remains challenging, because a manipulation task is completed through closed-loop interaction, where each action affects subsequent execution. To analyze this problem, we revisit VLA policy during execution and argue that a VLA policy acts both as a planner, which makes task-oriented decisions that change the direction of execution, and as an executor, which realizes these decisions through dense continuous actions. This view suggests that improving VLA reliability requires particular attention to planning actions. Existing optimization methods can imitate actions or improve complete trajectories, but they usually do not explicitly identify planning actions or measure their importance for task success. To address this issue, we propose Planning-Aware Policy Optimization for VLA models (PAPO-VLA). PAPO-VLA first identifies planning actions by jointly considering action variation and trajectory outcome, then estimates their importance through causal sufficiency and causal necessity, and finally incorporates this importance into GRPO advantage estimation. In this way, more important planning actions receive stronger optimization emphasis, while the whole trajectory is still optimized by trajectory-level feedback. Experiments on multiple benchmarks demonstrate the effectiveness of PAPO-VLA.

2605.19577 2026-05-20 cs.CL

GoLongRL: Capability-Oriented Long Context Reinforcement Learning with Multitask Alignment

GoLongRL: 以能力为导向的长上下文强化学习与多任务对齐

Minxuan Lv, Tiehua Mei, Tanlong Du, Junmin Chen, Zhenpeng Su, Ziyang Chen, Ziqi Wang, Zhennan Wu, Ruotong Pan, jian Liang, Ruiming Tang, Han Li

发表机构 * Kuaishou Technology(快手科技) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出GoLongRL,一种完全开源的、以能力为导向的长上下文强化学习后训练配方,通过可验证奖励(RLVR)实现。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径,导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。本文的贡献是(1)以能力为导向的数据构建并完全开源,释放了包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学,数据集涵盖9种任务类型,每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本,其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下,我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外,我们的 Qwen3-30B-A3B 模型在该数据上训练后,长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当,表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。(2)TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战,我们提出了 TMN-Reweight,它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐,以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能,在报告的评估中,通用能力得以保持或提升。

详情
AI中文摘要

我们提出了GoLongRL,一种完全开源、以能力为导向的长上下文强化学习后训练配方,用于可验证奖励(RLVR)。现有长上下文强化学习方法往往将数据构建视为设计越来越复杂的检索路径,导致任务覆盖同质化和奖励形式无法充分反映实际长上下文需求。我们的工作提供了两个贡献。(1)以能力为导向的数据构建并完全开源。我们公开发布了一个包含23,000个RLVR样本的数据集、完整的构建流程和所有训练代码。基于长上下文能力的分类学,数据集涵盖9种任务类型,每种任务类型都配有其自然评估指标。它包含从现有语料库中精心挑选的开源样本和合成样本,其问答对是从真实源文档如书籍、学术论文和多轮对话中生成的。在相同的 vanilla GRPO 设置下,我们的数据集单独优于闭源的 QwenLong-L1.5 数据集。此外,我们的 Qwen3-30B-A3B 模型在该数据上训练后,长上下文性能与 DeepSeek-R1-0528 和 Qwen3-235B-A22B-Thinking-2507 相当,表明更广泛的覆盖和更大的奖励多样性显著有助于长上下文能力的提升。(2)TMN-Reweight 用于异构多任务优化。为了解决异构奖励带来的优化挑战,我们提出了 TMN-Reweight,它结合了任务层面的均值归一化以实现跨任务奖励尺度对齐,以及难度自适应加权以获得更可靠的优势估计。TMN-Reweight 进一步在 vanilla GRPO 上提高了平均性能,在报告的评估中,通用能力得以保持或提升。

英文摘要

We present GoLongRL, a fully open-source, capability-oriented post-training recipe for long-context reinforcement learning with verifiable rewards (RLVR). Existing long-context RL methods often treat data construction as a matter of designing increasingly complex retrieval paths, leading to homogeneous task coverage and reward formulations that inadequately reflect practical long-context requirements. Our work offers two contributions. (1) Capability-oriented data construction with full open release. We openly release a dataset of 23K RLVR samples, the complete construction pipeline, and all training code. Guided by a taxonomy of long-context capabilities, the dataset spans 9 task types, each paired with its natural evaluation metric. It comprises curated open-source samples from established corpora and synthetic samples whose QA pairs are generated from real source documents such as books, academic papers, and multi-turn dialogues. Under the same vanilla GRPO setup, our dataset alone outperforms the closed-source QwenLong-L1.5 dataset. Moreover, our Qwen3-30B-A3B model trained on this data delivers long-context performance comparable to DeepSeek-R1-0528 and Qwen3-235B-A22B-Thinking-2507, suggesting that broader coverage and greater reward diversity substantially benefit long-context capability improvement. (2) TMN-Reweight for heterogeneous multitask optimization. To address optimization challenges from heterogeneous rewards, we propose TMN-Reweight, which combines task-level mean normalization for cross-task reward scale alignment with difficulty-adaptive weighting for more reliable advantage estimation. TMN-Reweight further improves average performance over vanilla GRPO, with general capabilities preserved or improved across reported evaluations.

2605.19576 2026-05-20 cs.AI cs.CL cs.SE

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

库漂移:在自我演化的LLM技能库中诊断和修复一种无声的失败模式

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式AI创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 本文研究了自我演化的LLM技能库中的一种无声失败模式——库漂移,通过可重复触发实验、细粒度诊断和验证修复方法,揭示了技能积累无序导致检索退化、假阳性注入和性能停滞的问题,并提出了一种经过验证的修复方案,显著提升了技能库的性能。

详情
AI中文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

英文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

2605.19568 2026-05-20 cs.CL

m3BERT: A Modern, Multi-lingual, Matryoshka Bidirectional Encoder

m3BERT: 一种现代的、多语言的、俄罗斯套娃双向编码器

Yaoxiang Wang, Simiao Zuo, Qingguo Hu, Yucheng Ding, Yeyun Gong, Jian Jiao, Jinsong Su

发表机构 * Xiamen University(厦门大学) Microsoft(微软公司) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出m3BERT,一种现代多语言俄罗斯套娃双向编码器,通过联合优化Transformer层和多维嵌入表示,解决现有预训练模型在不同部署场景中适应性差的问题,展示了其在工业检索中的高效性和实用性。

Comments KDD 2026

详情
AI中文摘要

嵌入模型在工业信息检索系统中至关重要,如搜索和广告。然而,现有预训练模型通常具有固定架构和嵌入维度,这在适应具有不同业务驱动约束的多样化部署场景时带来了显著挑战。一种常见做法是在资源受限任务中通过部分参数初始化从更大预训练模型进行微调。这种方法往往效果不佳,因为预训练和下游使用之间的不匹配阻碍了预训练优势的完全实现。为了解决这一限制,我们引入了m3BERT:一种现代的、多语言的、俄罗斯套娃双向编码器,其特征是新颖的预训练策略,联合优化Transformer层和多个嵌入维度的表示。这使得单个模型能够针对不同的资源和准确率目标进行定制,同时保持与预训练的一致性。结合最近的架构改进,m3BERT采用三阶段预训练:单语预训练、多语适应以服务多样化用户群体,以及在大规模网络领域语料库上进行关键的持续预训练以增强商业检索中的实用性。m3BERT在Bing-Click大型工业检索数据集上显著优于现有最先进的嵌入模型,展示了其作为高效基础的实用性和适应性,用于资源感知的工业检索系统。进一步在公共数据集上的实验也证实了我们多粒度俄罗斯套娃预训练策略的通用有效性。

英文摘要

Embedding models are pivotal in industrial information retrieval systems like search and advertising. However, existing pretrained models often exhibit fixed architectures and embedding dimensionalities, posing significant challenges when adapting them to diverse deployment scenarios with varying business-driven constraints. A common practice involves fine-tuning with partial parameter initialization from larger pretrained models for resource-constrained tasks. This method is often suboptimal as the misalignment between pretraining and downstream usage prevents full realization of pretraining benefits. To address this limitation, we introduce m3BERT: a Modern, Multi-lingual, Matryoshka Bidirectional Encoder, which features a novel pretraining strategy that jointly optimizes representations across both transformer layers and multiple embedding dimensions. This enables a single model to be tailored to varied resource and accuracy targets while maintaining consistency with pretraining. Incorporating recent architectural improvements, m3BERT uses a three-stage pretraining: monolingual pretraining, multilingual adaptation to serve diverse user bases, and crucial continual pretraining on a massive web domain corpus to enhance utility in commercial retrieval. m3BERT significantly outperforms state-of-the-art embedding models in Bing-Click, a large-scale industrial retrieval dataset, showcasing its practical versatility as an efficient foundation for resource-aware industrial retrieval systems. Further experiments on public datasets also confirm the general effectiveness of our multigranular Matryoshka pretraining strategy.

2605.19562 2026-05-20 cs.RO cs.LG math.OC

Learning-Accelerated Optimization-based Trajectory Planning for Cooperative Aerial-Ground Handover Missions

基于学习的优化轨迹规划用于协作的空中-地面切换任务

Jingshan Chen, Bochen Yu, Henrik Ebel, Peter Eberhard

发表机构 * Institute of Engineering and Computational Mechanics, University of Stuttgart, 70569 Stuttgart, Germany(工程与计算力学研究所,斯图加特大学,德国斯图加特70569) Mechanical Engineering, LUT University, 53850 Lappeenranta, Finland(机械工程,卢蒂大学,芬兰拉佩恩兰塔53850)

AI总结 本文提出了一种结合学习的轨迹规划框架,用于协同无人 aerial 和 ground 车辆的切换任务,通过使用解耦的编码器-解码器 LSTM 网络生成协调的切换轨迹预测,从而加速优化过程,实现更快的收敛和更高的优化成功率。

Comments Preprint of a contribution accepted for publication in the RoManSy 2026 Springer proceedings

详情
AI中文摘要

本文提出了一种基于学习的轨迹规划框架,用于协同无人 aerial 和 ground 车辆的切换任务。尽管集中式轨迹优化能够确保动态可行性和任务最优性,但其高计算成本限制了实时应用。我们提出了一种神经代理规划器,利用解耦的编码器-解码器长短期记忆(LSTM)网络,从任务规范中生成协调的切换轨迹预测。这些预测作为下游集中优化器的有信息的预热启动,从而加速收敛到动态可行的解决方案。基准评估显示,与冷启动优化相比,结合学习的规划框架在速度上提高了三倍以上,并实现了100%的优化成功率。结果表明,结合数据驱动推断与模型驱动细化能够为异构多机器人系统提供快速且可靠的轨迹生成。

英文摘要

This paper presents a learning-augmented trajectory planning framework for cooperative unmanned aerial vehicle (UAV) and unmanned ground vehicle (UGV) handover missions. While centralized trajectory optimization ensures dynamic feasibility and task optimality, its high computational cost limits real-time applicability. We propose a neural surrogate planner utilizing decoupled encoder-decoder long short-term memory (LSTM) networks to generate coordinated handover trajectory predictions from the task specifications. These predictions serve as informed warm starts for the downstream centralized optimizer, thereby accelerating convergence to dynamically feasible solutions. Benchmark evaluations demonstrate that the learning-augmented planning framework achieves more than a threefold speedup and 100% optimization success rate compared to cold start optimization. The results indicate that combining data-driven inference with model-based refinement enables fast and reliable trajectory generation for heterogeneous multi-robot systems.

2605.19561 2026-05-20 cs.LG cs.AI

TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

TORQ:MXFP4量化中的两级正交旋转

Zukang Xu, Xing Hu, Dawei Yang

发表机构 * Open Compute Project(开放计算项目)

AI总结 本文提出TORQ框架,通过优化坐标变换重塑激活空间的几何属性,解决MXFP4激活量化中的精度下降问题,显著提升量化精度。

Comments 17 pages, 4 figures, 13 tables

详情
AI中文摘要

随着大型语言模型(LLMs)向实际部署迈进,微缩FP4(MXFP4)格式已成为下一代低比特推断的基石,因其在高动态范围与硬件效率之间的平衡能力。然而,直接将MXFP4应用于LLM激活量化不可避免地导致显著的精度下降。在本文中,我们从理论上分析MXFP4激活量化的误差结构,揭示出性能下降的根本原因在于激活分布与MXFP4块浮点格式之间的两个结构性不平衡:(1)极端块间方差不平衡和(2)块内代码书利用不平衡。为了解决这些挑战,我们提出了TORQ(MXFP4量化中的两级正交旋转),一种无训练的后训练量化(PTQ)框架,通过最优坐标变换重塑激活空间的几何属性。在宏观层面,TORQ利用Schur-Horn定理通过块间正交旋转重新分配激活能量,防止高方差块驱动共享缩放因子,从而保留小幅度元素的精度。在微观层面,TORQ采用最大熵引导的块内旋转以缓解代码书坍塌并最大化MXFP4代码书的信息容量。在主流LLM如LLaMA3和Qwen3上的实验表明,与现有方法相比,TORQ显著提高了MXFP4激活量化的准确性:在Qwen3-32B上,WikiText的困惑度降低到8.43(相比BF16的7.61),平均准确率从直接RTN的38.40%增加到73.63%(相比BF16的74.82%),大幅缩小了4位浮点量化与全精度推断之间的差距。

英文摘要

As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In this paper, we theoretically analyze the error structure of MXFP4 activation quantization, revealing that the root cause of this performance drop lies in two structural imbalances between activation distributions and the MXFP4 block floating-point format: (1) extreme inter-block variance imbalance and (2) intra-block codebook utilization imbalance. To address these challenges, we propose TORQ (Two-level Orthogonal Rotation for MXFP4 Quantization), a training-free Post-Training Quantization (PTQ) framework designed to reshape the geometric properties of the activation space through optimal coordinate transformations. At the macroscopic level, TORQ leverages the Schur-Horn theorem to redistribute activation energy via inter-block orthogonal rotation, preventing high-variance blocks from driving up shared scaling factors and thereby preserving the precision of small-magnitude elements. At the microscopic level, TORQ employs maximum-entropy-guided intra-block rotation to alleviate codebook collapse and maximize the MXFP4 codebook's information capacity. Experiments on mainstream LLMs such as LLaMA3 and Qwen3 show that TORQ significantly improves the accuracy of MXFP4 activation quantization compared to existing methods: on Qwen3-32B, the perplexity on WikiText is reduced to 8.43 (vs. 7.61 for BF16), and the average accuracy increases from 38.40% with direct RTN to 73.63% (vs. 74.82% for BF16), substantially narrowing the gap between 4-bit floating-point quantization and full-precision inference.

2605.19559 2026-05-20 cs.CV cs.AI

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

EgoCoT-Bench: 用于MLLMs的 grounded 和可验证的 operation-centric 思维链推理基准测试

Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出EgoCoT-Bench,一个用于评估MLLMs在第一人称视角下细粒度操作中心推理能力的基准测试,包含3172个可验证的问答对,涵盖感知、预见和高层次推理等任务,旨在解决现有基准测试在细粒度推理和证据验证方面的不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)的快速发展引发了对第一人称视频理解的广泛关注,特别是MLLMs识别细粒度手-物体交互、跟踪物体状态变化以及从第一人称视角推理动态环境中操作过程的能力。然而,现有的第一人称视频基准测试存在局限性,即缺乏对基于现实证据的推理评估,难以支持细粒度的操作中心推理,并且很少检查模型推理是否基于显式的时空证据。为了解决这一差距,我们引入了EgoCoT-Bench,一个细粒度的第一人称基准测试,用于验证和可验证的操作中心推理,具有显式的逐步推理注释。总体而言,EgoCoT-Bench包含3172个可验证的问答对,覆盖351个第一人称视频,分为四个任务组,共12个子任务组,涵盖感知与回顾、预见和高层次推理。该基准测试通过时空场景图(STSG)引导生成框架构建,并通过人工标注者进一步优化,以确保正确性、第一人称相关性和细粒度质量。实验结果表明,第一人称细粒度推理仍存在困难,并进一步揭示了许多多模态模型生成的解释虽然答案正确,但证据与答案不一致。我们希望EgoCoT-Bench能为第一人称视频理解中的 grounded 和可验证推理提供有用的测试平台。项目页面和补充材料可在:https://dstardust.github.io/EgoCoT/ 上找到。

英文摘要

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

2605.19556 2026-05-20 cs.CV

EpiDiffVO: Geometry-Aware Epipolar Diffusion for Robust Visual Odometry

EpiDiffVO: 一种基于几何的视差扩散用于鲁棒视觉里程计

Prateeth Rao

发表机构 * International Institute of Information Technology Bangalore(国际信息科技学院班加罗尔)

AI总结 本文提出了一种稀疏视差匹配框架,通过优化几何一致性来减少冗余,并结合视差扩散过程和图神经网络实现高效的视觉里程计。

Comments 8 pages, 5 figures, in revision to be submitted to IEEE RA-L

详情
AI中文摘要

从图像对中估计相对姿态本质上只需要一组几何上一致的对应点的最小子集。然而,大多数基于学习的方法依赖于密集匹配或直接回归,导致冗余并降低几何可解释性。在本工作中,我们提出了一种稀疏视差匹配框架,预测一组紧凑的对应点,以优化不同时间基线下的几何一致性。为了解决残余噪声和对齐问题,我们引入了视差扩散过程,该过程建模对应点的不确定性,并将关键点细化到视差一致性。经过细化的对应点,结合深度线索,被提升为图表示,形成一个Steiner图,该图编码点之间的关系结构。图神经网络学习了一组紧凑的有用对应点,这些对应点被传递给可微的奇异值分解求解器进行端到端的几何估计。从得到的基矩阵中恢复相对姿态,并在TartanAir和KITTI SLAM数据集上进行视觉里程计评估。实验结果表明,结合稀疏匹配、基于扩散的细化和基于图的子集选择可以减少对应点的冗余,同时在具有挑战性的基线下保持稳健的姿态估计。

英文摘要

Estimating relative pose from image pairs fundamentally requires only a minimal subset of geometrically consistent correspondences. However, most learning-based approaches rely on dense matching or direct regression, leading to redundancy and reduced geometric interpretability. In this work, we propose a sparse epipolar matching framework that predicts a compact set of correspondences optimized for geometric consistency across varying temporal baselines. To address residual noise and misalignment, we introduce an epipolar diffusion process that models correspondence uncertainty and refines keypoints toward epipolar consistency. The refined correspondences, along with depth cues, are lifted into a graph representation forming a Steiner graph that encodes relational structure between points. A graph neural network learns a compact subset of informative correspondences, which are passed to a differentiable singular value decomposition solver for end-to-end geometric estimation. Relative pose is recovered from the resulting essential matrix and evaluated in a visual odometry setting on the TartanAir and KITTI SLAM datasets. Experimental results demonstrate that combining sparse matching, diffusion-based refinement, and graph-based subset selection reduces correspondence redundancy while maintaining robust pose estimation across challenging baselines.

2605.19554 2026-05-20 cs.CV

Self-Creative Text-to-Object Generation using Semantic-Aware Spatial Weighting

基于语义感知空间加权的自创文本到物体生成

Yue Yu, Haibo Chen, Shuo Chen, Jian Yang, Jun Li

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 本文提出了一种自创扩散模型SCDiff,通过学习空间加权模块和视觉-语义混合损失模块,提升文本到图像生成的创意性和语义对齐性。

详情
AI中文摘要

在文本到图像(T2I)生成中注入创造力是一个重大挑战,因为合成图像不仅要具有视觉新颖性和惊喜,还应具有艺术价值。然而,当前T2I模型主要优化于字面文本-图像对齐,其噪声预测网络限制生成到高概率区域,导致生成结果缺乏真实创造力。为此,我们提出了一种自创扩散(SCDiff)模型,用于有意义的T2I生成,包含两个核心模块:可学习的空间加权(LSW)模块和视觉-语义混合损失(VSML)。LSW模块设计了一个参数化的Kaiser-Bessel窗,以强化中心图像特征,促进新颖和令人惊讶的生成。VSML模块引入了双重损失函数:相似性损失约束新图像与文本描述对齐,而多样性损失最大化其与原始图像的区别,从而增强语义价值和视觉新颖性。大量实验表明,我们的模型显著提高了创造力、语义对齐性和视觉一致性,提供了一个简单但强大的框架用于生成创意物体。

英文摘要

Instilling creativity in text-to-image (T2I) generation presents a significant challenge, as it requires synthesized images to exhibit not only visual novelty and surprise, but also artistic value. Current T2I models, however, are largely optimized for literal text-image alignment with their data distribution, and their noise prediction networks constrain the generation to high-probability regions, consequently generating outputs that lack authentic creativity. To address this, we propose a Self-Creative Diffusion (SCDiff) model for meaningful T2I generations featuring two core modules: a learnable spatial weighting (LSW) module and a visual-semantic mixing loss (VSML). The LSW module designs a parametric Kaiser-Bessel window to reinforce central image features, fostering novel and surprising generation. The VSML module introduces a dual loss function: a similarity loss constrains that the new images align with its textual description, while a diversity loss maximizes its distinction from the original image, enhancing both semantic value and visual novelty. Extensive experiments demonstrate that our model substantially improves creativity, semantic alignment, and visual coherence, offering a simple yet powerful framework for generating creative objects.

2605.19541 2026-05-20 cs.SD

Optimising Neural Speech Codecs for 300bps Communication using Reinforcement Learning

利用强化学习优化神经语音编解码器用于300bps通信

Junyi Wang, Chi Zhang, Jing Qian, Haifeng Luo, Hao Wang, Zengrui Jin, Chao Zhang

发表机构 * Tsinghua University(清华大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出ClariCodec,一种在300bps下工作的神经语音编解码器,通过将量化视为随机策略,利用强化学习优化可懂度,从而在极端压缩水平下减少词错误率。

详情
AI中文摘要

在带宽受限的通信中,如卫星和水下信道,语音往往需要在超低比特率下传输,其中可懂性是主要目标。在如此极端的压缩水平下,通过声音重建损失训练的编解码器倾向于将比特分配给感知细节,导致词错误率(WER)显著下降。本文提出了ClariCodec,一种在300比特每秒(bps)下工作的神经语音编解码器,将量化重新表述为随机策略,从而通过强化学习(RL)优化可懂性。具体来说,编码器使用由WER驱动的奖励进行微调,而声音重建流程保持冻结。即使没有强化学习,ClariCodec在LibriSpeech测试清洁集上以300bps实现了4.64%的WER,已经与在更高比特率下工作的编解码器具有竞争力。进一步的强化学习微调将WER降低到测试清洁集上的3.55%和测试其他集上的10.4%,对应的相对减少为23%,同时保持感知质量。

英文摘要

In bandwidth-constrained communication such as satellite and underwater channels, speech must often be transmitted at ultra-low bitrates where intelligibility is the primary objective. At such extreme compression levels, codecs trained with acoustic reconstruction losses tend to allocate bits to perceptual detail, leading to substantial degradation in word error rate (WER). This paper proposes ClariCodec, a neural speech codec operating at 300 bit per second (bps) that reformulates quantisation as a stochastic policy, enabling reinforcement learning (RL)-based optimisation of intelligibility. Specifically, the encoder is fine-tuned using WER-driven rewards while the acoustic reconstruction pipeline remains frozen. Even without RL, ClariCodec achieves 4.64% WER on the LibriSpeech test-clean set at 300 bps, already competitive with codecs operating at higher bitrates. Further RL fine-tuning reduces WER to 3.55% on test-clean and 10.4% on test-other, corresponding to a 23% relative reduction while preserving perceptual quality.

2605.19539 2026-05-20 cs.CV

Trust It or Not: Evidential Uncertainty for Feed-Forward 3D Reconstruction with Trust3R

信任它还是不信任它:基于信任3R的证据不确定性用于前馈3D重建

Zihao Zhu, Wenyuan Zhao, Nuo Chen, Chao Tian, Zhiwen Fan

发表机构 * Department of Electrical and Computer Engineering, Texas A\&M University, College Station, TX, USA(电子与计算机工程系,德克萨斯农工大学,学院站,德克萨斯州,美国)

AI总结 本文提出Trust3R,一种轻量级的证据不确定性框架,用于前馈3D重建,通过结合门控残差均值细化和正态-逆 Wishart 证据头,生成点云不确定性估计,提升几何重建的准确性和可靠性。

Comments Accepted at ICML 2026. 10 pages main paper, with appendix

详情
AI中文摘要

几何基础模型有希望从未经校准的图像中进行无约束的密集几何预测。然而,在当前的前馈设计中,其预测的置信度分数是启发式的,缺乏概率解释,且通常无法指示预测几何的可信区域和程度。为解决这一差距,我们提出了Trust3R,一种轻量级的证据不确定性框架用于前馈3D重建。Trust3R结合了门控残差均值细化和正态-逆 Wishart 证据头,生成每一点的几何不确定性的闭合形式多元学生t分布。这种设计在提供概率基础的点云不确定性估计的同时,增加了适度的推断开销。我们在多样化的室内和室外基准上进行了评估,并与MASt3R内置的置信度图以及跨越单次通过异方差回归和基于采样的方法(如MC dropout和深度集合)的常见不确定性感知基线进行了比较。实验结果表明,Trust3R在风险覆盖和稀疏化方面表现一致,并且在几何准确性方面总体有所提高。这些收益体现在跨基准的更强的不确定性排名上,ScanNet++上AURC降低了25%,AUSE降低了41%,为不确定性感知加权在下游几何管道中提供了实用的可靠性信号。项目页面和代码可在https://trust3r-z.github.io/上找到。

英文摘要

Geometric foundation models hold promise for unconstrained dense geometry prediction from uncalibrated images. However, in current feed-forward designs, their predicted confidence scores are heuristic, lack probabilistic interpretation, and often fail to indicate where and how much the predicted geometry can be trusted. To address this gap, we present Trust3R, a lightweight evidential uncertainty framework for feed-forward 3D reconstruction. Trust3R combines gated residual mean refinement with a Normal-Inverse-Wishart evidential head, yielding a closed-form multivariate Student-t distribution for per-point geometric uncertainty. This design provides probabilistically grounded pointmap uncertainty estimates while adding moderate inference overhead. We evaluate on diverse indoor and outdoor benchmarks and compare against MASt3R's built-in confidence map as well as common uncertainty-aware baselines spanning single-pass heteroscedastic regression and sampling-based methods such as MC dropout and deep ensembles. Experimental results show that Trust3R consistently improves risk-coverage and sparsification, and generally improves geometric accuracy. These gains are reflected in stronger uncertainty ranking across benchmarks, with 25% lower AURC and 41% lower AUSE on ScanNet++, providing a practical reliability signal for uncertainty-aware weighting in downstream geometry pipelines. The project page and code are available at https://trust3r-z.github.io/.

2605.19538 2026-05-20 cs.CV cs.AI

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

CaptchaMind: 通过强化学习与显式推理监督训练CAPTCHA求解器

Pengcheng Wang, Haoxiang Liu, Yang Dai, Xiangxiang Zeng, Guanhua Chen, Baotian Hu, Longyue Wang, Weihua Luo

发表机构 * Alibaba Group(阿里巴巴集团) Southern University of Science and Technology(南方科技大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文提出CaptchaMind,一种基于强化学习的CAPTCHA求解器,通过显式推理监督训练,实现了82.9%的平均成功率,显著优于现有方法。

Comments 17 pages, 12 figures

详情
AI中文摘要

CAPTCHAs被广泛部署作为人类验证机制,经常阻止智能代理在现实网络环境中完成端到端自动化。解决现代CAPTCHAs需要稳健的多步骤视觉推理和交互能力,但基于训练的方法由于缺乏大规模训练数据和过程级注释而一直缺席。我们介绍了CaptchaBench,第一个支持大规模训练的CAPTCHA基准,包含16,000个程序生成的样本,覆盖八个任务类别,并带有详细的区域和过程级注释。系统评估表明,现有方法在需要精细视觉细节捕获和区域级比较的任务上表现一致失败。因此,我们提出了CaptchaMind,一种基于强化学习的求解器,通过显式推理过程监督训练,实现了82.9%的平均成功率,跨八个任务和71.0%在现实实例上的表现,显著优于所有现有方法,无需闭源API。

英文摘要

CAPTCHAs are widely deployed as human verification mechanisms and frequently block intelligent agents from completing end-to-end automation in real-world web environments. Solving modern CAPTCHAs requires robust multi-step visual reasoning and interaction capabilities, yet training-based approaches have remained absent due to the lack of large-scale training data and process-level annotations. We introduce CaptchaBench, the first CAPTCHA benchmark designed to support large-scale training, comprising 16,000 programmatically generated samples across eight task categories with detailed region and process-level annotations. Systematic evaluation on CaptchaBench reveals that existing methods fail consistently on tasks requiring fine-grained visual detail capture and region-level comparison. We therefore present CaptchaMind, an RL-based solver trained with explicit reasoning process supervision, achieving 82.9% average success rate across eight tasks and 71.0% on real-world instances, substantially outperforming all existing methods without closed-source APIs.

2605.19533 2026-05-20 cs.CV

Replacement Learning: Training Neural Networks with Fewer Parameters

替代学习:用更少的参数训练神经网络

Yuming Zhang, Peizhe Wang, Tianyang Han, Hengyu Shi, Junhao Su, Dongzhi Guan, Jiabin Liu, Jiaji Wang

发表机构 * The University of Hong Kong(香港大学) Southeast University(东南大学)

AI总结 本文提出替代学习(RepL)方法,通过替换而非删除神经网络中的部分模块来减少全深度反向传播的冗余,从而在保持性能的同时降低参数量、内存使用和训练时间。

Comments 16pages

详情
AI中文摘要

端到端训练结合全深度反向传播仍然是优化深度神经网络的主要范式,但随着模型变深,其效率会下降。由于每个块必须在单一全局目标下执行和微分,全深度反向传播引入了显著的参数冗余、激活-内存成本和训练延迟,尤其是在相邻层具有高度相关学习模式时。直接跳过或删除层可以降低成本,但通常会削弱表示能力或需要特定架构的重用设计。在本文中,我们提出了替代学习(RepL),一种训练时的范式,通过替换选定的块而不是简单地删除它们来减少全深度冗余。对于每个被移除的块,RepL插入一个轻量级计算层,通过可学习的转换从其相邻前序和后序块的参数合成一个替代操作符,并将该合成操作符应用于前序激活。这样,RepL在保持局部上下文连续性的同时避免了不必要的全层计算。我们为CNNs和ViTs实例化RepL,使用定制化的参数融合块来处理卷积通道、特征分辨率和Transformer子模块。在CIFAR-10、SVHN、STL-10、ImageNet、COCO和CityScapes等数据集上的广泛实验表明,RepL在减少可训练参数、GPU内存使用和训练时间的同时,在分类、检测和分割任务中与标准端到端训练相匹配或超越。此外,在WikiText-2、迁移学习、推理吞吐量、检查点、随机深度和INT8量化等额外结果中进一步展示了其通用性和兼容性。

英文摘要

End-to-end training with full-depth backpropagation remains the dominant paradigm for optimizing deep neural networks, but its efficiency deteriorates as models grow deeper. Since every block must be executed and differentiated under a single global objective, full-depth BP introduces substantial parameter redundancy, activation-memory cost, and training latency, especially when neighboring layers exhibit highly correlated learning patterns. Directly skipping or removing layers can reduce cost, but often weakens representation capacity or requires architecture-specific reuse designs. In this paper, we propose Replacement Learning (RepL), a training-time paradigm that reduces full-depth redundancy by replacing selected blocks rather than simply discarding them. For each removed block, RepL inserts a lightweight computing layer that synthesizes a surrogate operator from the parameters of its adjacent preceding and succeeding blocks through a learnable transformation, and applies the synthesized operator to the preceding activation. In this way, RepL preserves local contextual continuity while avoiding unnecessary full-layer computation. We instantiate RepL for CNNs and ViTs with tailored parameter-fusion blocks that handle convolutional channels, feature resolutions, and transformer submodules. Extensive experiments on CIFAR-10, SVHN, STL-10, ImageNet, COCO, and CityScapes show that RepL reduces trainable parameters, GPU memory usage, and training time while matching or surpassing standard end-to-end training across classification, detection, and segmentation. Additional results on WikiText-2, transfer learning, inference throughput, checkpointing, stochastic depth, and INT8 quantization further demonstrate its generality and compatibility.

2605.19532 2026-05-20 cs.CV cs.LG

Boosting Text-to-Image Diffusion Models via Core Token Attention-Based Seed Selection

通过基于核心标记注意力的种子选择提升文本到图像扩散模型

Yunzhe Zhang, Hongfu Liu, Pengyu Hong

发表机构 * Brandeis University(布兰迪大学)

AI总结 本文研究了文本到图像扩散模型中种子对生成质量的影响,提出基于核心标记注意力的种子选择方法,无需训练即可提升文本与图像的一致性及视觉质量。

Comments Preprint

详情
AI中文摘要

文本到图像扩散模型能够生成高质量的图像,但其输出对随机种子极为敏感:不同的初始种子往往导致图像质量和提示词与图像的一致性产生显著差异。我们重新审视这一

英文摘要

Text-to-image diffusion models can synthesize high-quality images, yet the outcome is notoriously sensitive to the random seed: different initial seeds often yield large variations in image quality and prompt-image alignment. We revisit this "seed effect" and show that attention dynamics over prompt core tokens, the content-bearing words, measured during the first few denoising steps, strongly predict final generation quality. Building on this observation, we introduce Attention-Based Seed Selection (ABSS), a training-free, plug-and-play method that ranks seeds for a given prompt by leveraging cross-attention to core tokens during the denoising process. ABSS requires no finetuning and does not alter the initial noise; it scores and ranks all candidate seeds, keeps only the top-k for full generation, and discards the rest, without relying on a fixed accept/reject threshold. Operating purely at inference time, ABSS can serve as a lightweight pre-selection add-on for existing seed-optimization pipelines, enabling additional gains. Across three benchmarks, extensive experiments show that ABSS enables consistent improvements in text-image alignment and visual quality for Stable Diffusion variants, as corroborated by human preference and alignment metrics.

2605.19529 2026-05-20 cs.AI

Generative-Evaluative Agreement: A Necessary Validity Criterion for LLM-Enabled Adaptive Assessment

生成-评估一致性:为LLM赋能的自适应评估的必要有效性标准

Grandee Lee, Yue Wang, Che Yee Lye, Luke Peh

发表机构 * Singapore University of Social Sciences(新加坡社会科学研究大学)

AI总结 本文提出生成-评估一致性(GEA)作为LLM赋能自适应评估的有效性标准,通过测量LLM评分函数是否能恢复其生成函数所指示的技能水平,发现其在不同技能层面的有效性存在差异,并提出细粒度、技能分解的评分标准作为提升GEA的主要方法。

Comments BEA 2026

详情
AI中文摘要

当相同的LLM生成评估项目、模拟学生响应并评分时,验证循环是自我参照的。我们引入生成-评估一致性(GEA),作为一种衡量标准,用于确定LLM的评分函数是否能恢复其生成函数被指示产生的技能水平。在首次对双阶段自适应评估的直接测量中,模型恢复了约一半的预期方差r=0.698,存在系统性正偏。GEA在可语法验证的技能上表现强r>0.7,但在设计层面的技能上接近于零,并且低技能的过度估计会放大接近路由阈值的分数。我们主张细粒度、技能分解的评分标准是提升GEA的主要提出机制,并概述了互补的缓解措施。

英文摘要

When the same LLM generates assessment items, simulates student responses, and scores them, the validation loop is self-referential. We introduce Generative-Evaluative Agreement (GEA), a validity criterion measuring whether an LLM's scoring function recovers the skill levels its generative function was instructed to produce. In the first direct measurement of GEA on a two-stage adaptive assessment, the model recovers roughly half the intended variance r = 0.698 with systematic positive bias. GEA is strong r > 0.7 for syntactically verifiable skills but near zero for design-level skills, and low-skill overestimation inflates scores near the routing threshold. We argue that granular, skill-decomposed rubrics are the principal proposed mechanism for strengthening GEA and outline complementary mitigations.

2605.19528 2026-05-20 cs.CV

Towards Camera-Robust 3D Localization: Equation-Anchored Tool-Use for MLLMs

面向相机鲁棒的3D定位:基于方程的工具使用用于MLLMs

Xueying Jiang, Wenhao Li, Quanhao Qian, Deli Zhao, Shijian Lu, Gongjie Zhang, Ran Xu

发表机构 * Nanyang Technological University(南洋理工大学) DAMO Academy, Alibaba Group(阿里集团大模型研究院) HuPan Lab(虎派实验室) Alibaba Group(阿里集团)

AI总结 本文提出了一种基于方程的工具使用框架,通过将空间工具作为公式变量重新利用,以解决多模态大语言模型(MLLMs)中3D定位的相机固有模糊问题,从而在3D物体检测和3D视觉定位任务中取得了显著提升。

详情
AI中文摘要

多模态大语言模型(MLLMs)中的3D定位,包括3D物体检测和3D视觉定位,本质上受限于相机内参的模糊性:相同图像在不同相机下可以对应不同的3D场景。现有的MLLMs要么忽略相机参数并过度拟合于标准训练内参,要么从外部工具检索深度和3D线索,但将返回值视为参考线索(数值提示,模型可以隐式解释)。我们提出了一种基于方程的工具使用框架,将空间工具重新作为公式变量。该框架主动检索相机内参并采样多点度量深度,将针孔反投影方程$\hat{X} = (u_c - c_x)ar{Z}/f_x$明确写出在Chain-of-Thought(CoT)中,并在回归最终9自由度包围盒之前将工具输出代入公式。在从$0.5 imes$到$1.5 imes$缩放的相机内参下,我们的方法在3D物体检测和3D视觉定位任务中优于仅使用RGB和工具增强的基线方法,特别是在相机偏离训练尺度最显著时有显著提升。代码和数据将被发布。

英文摘要

3D localization in Multimodal Large Language Models (MLLMs), including 3D object detection and 3D visual grounding, is fundamentally limited by camera intrinsic ambiguity: the same image admits different 3D scenes under different cameras. Existing MLLMs either ignore camera parameters and overfit to a canonical training intrinsic, or retrieve depth and 3D cues from external tools but treat the returned values as reference cues (numerical hints that the model is free to interpret implicitly), both preventing camera information from being deterministically propagated into the prediction. We propose an equation-anchored tool-use framework that re-purposes spatial tools as formula variables. The proposed framework proactively retrieves camera intrinsics and samples multi-point metric depths, writes the pinhole back-projection equation $\hat{X} = (u_c - c_x)\bar{Z}/f_x$ explicitly in Chain-of-Thought (CoT), and substitutes tool outputs into the formula before regressing the final 9-DoF bounding box. On both 3D object detection and 3D visual grounding tasks under rescaled camera intrinsics from $0.5\times$ to $1.5\times$, our method outperforms RGB-only and tool-augmented baselines, with significant gains where the camera deviates most from the training scale. Code and data will be released.

2605.19527 2026-05-20 cs.CV

Dual-Prompt CLIP with Hybrid Visual Encoders for Occluded Person Re-Identification

双提示CLIP与混合视觉编码器用于遮挡行人重识别

Zhangjian Ji, Shaotong Qiao, Kai Feng, Wei Wei

发表机构 * organization= School of Computer \& Information Technology, Shanxi University , addressline= Wucheng Rd.92 , city= Taiyuan , postcode= 030006 , state= Shanxi , country= China organization= Key Laboratory of Computational Intelligence

AI总结 本文提出了一种双提示学习重识别模型DPL-ReID,通过双提示学习策略和现实遮挡增强方法,提升遮挡行人重识别的鲁棒性和准确性。

详情
AI中文摘要

遮挡行人重识别旨在在多个摄像头视图中匹配部分可见的行人。然而,遮挡会破坏身体区域线索,从而复杂化跨视图匹配。大多数基于预训练视觉-语言模型的行人重识别方法只关注增强基于提示的特征学习,而忽略遮挡物的语义信息。基于CLIP-ReID的成功,我们提出了一种新的双提示学习重识别(DPL-ReID)模型用于遮挡行人重识别。它结合了双提示学习(Dual-PL)策略,可以利用文本线索捕捉完整的行人语义并保持对遮挡的鲁棒性,以及现实世界遮挡增强(RWOA)方法,该方法真实模拟现实世界中遇到的遮挡场景以丰富遮挡样本。此外,我们还设计了加权门控特征融合(WGFF)方法,它结合LSNet来捕捉全局信息并作为特征门控机制。该机制可以有效引导CLIP视觉编码器生成更全面的特征表示。在多个基准遮挡重识别数据集上的广泛实验表明,所提出的DPL-ReID实现了最先进的性能。遮挡实例库可在https://github.com/stone-qiao/DPL-ReID上获取。

英文摘要

Occluded person re-identification focuses on matching partially visible pedestrians across multiple camera views. However, occlusions disrupt body-region cues, thereby complicating cross-view matching. Most person ReID methods built on pretrained vision-language models only focus on enhancing prompt-based feature learning while ignoring the semantic information of occluders. Based on the success of CLIP-ReID, we propose a novel Dual Prompt Learning ReID (DPL-ReID) model for occluded person ReID. It incorporates a Dual Prompt Learning (Dual-PL) strategy, which can utilize textual cues to capture complete pedestrian semantics and keep robustness against occlusion, and a Real-World Occlusion Augmentation (RWOA) method that realistically simulates occlusion scenarios encountered in real word to enrich occluded samples. In addition, we also design a Weighted Gated Feature Fusion (WGFF) method, which in corporates LSNet to capture global information and act as a feature-gating mechanism. This mechanism can effectively guide the CLIP visual encoder toward generating more comprehensive feature representations. Extensive experiments on several benchmark occluded ReID datasets show that our proposed DPL-ReID achieves the state-of-the art performance. The occlusion instance library are available at https://github.com/stone-qiao/DPL-ReID.

2605.19524 2026-05-20 cs.RO cs.CV

SafeAlign-VLA: A Negative-Enhanced Safe Alignment Framework for Risk-Aware Autonomous Driving

SafeAlign-VLA: 一种增强负样本的安全对齐框架用于风险感知的自动驾驶

Kefei Tian, Yuansheng Lian, Kai Yang, Xiangdong Chen, Shen Li

发表机构 * College of Transportation, Tongji University(同济大学交通运输学院) Department of Civil Engineering, Tsinghua University(清华大学土木工程系) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与移动系统学院) Department of Civil and Environmental Engineering, National University of Singapore(新加坡国立大学土木与环境工程系)

AI总结 本文提出SafeAlign-VLA框架,通过整合负样本数据提升自动驾驶系统对安全边界的理解,通过生成安全标签和反事实轨迹,结合两阶段训练策略和基于锚点的群体相对策略优化,提高了自动驾驶的安全性和鲁棒性。

详情
AI中文摘要

端到端的自动驾驶系统在常见场景中表现优异,但在安全关键的长尾案例中表现不佳。视觉-语言-动作(VLA)模型因其强大的推理能力而具有前景。然而,大多数基于VLA的方法依赖于正专家演示,很少利用负样本,导致对危险行为和安全边界的理解不足。为了解决这一限制,我们提出了SafeAlign-VLA,一种统一的增强负样本的安全对齐框架,将负数据整合到监督学习和强化学习中。首先,我们开发了一种反事实安全配对范式,通过反事实推理从危险场景中生成结构化的安全标签和反事实正轨迹。然后采用两阶段训练策略:负样本增强的监督微调用于故障反馈和轨迹修正,接着是基于锚点的群体相对策略优化,利用正负轨迹作为对比锚点,引导采样并惩罚高风险行为。在NAVSIM和DeepAccident上的实验验证了所提框架。SafeAlign-VLA在NAVSIM v1测试集上达到89.1 PDMS,比无负样本基线提高了1.3%。在DeepAccident上,碰撞率降低到3.36%,同时达到84.2%的语言准确率和85.8%的风险预测准确率。这些结果证明了所提增强负样本的安全对齐框架在安全和鲁棒自动驾驶中的有效性。

英文摘要

End-to-end autonomous driving systems excel in common scenarios but struggle with safety-critical long-tail cases. Vision-Language-Action (VLA) models are promising due to their strong reasoning capabilities. However, most VLA-based approaches rely on positive expert demonstrations, rarely exploiting negative samples, leading to insufficient understanding of risky behaviors and safety boundaries. To address this limitation, we propose SafeAlign-VLA, a unified negative-enhanced safe alignment framework that incorporates negative data into supervised learning and reinforcement learning. First, we develop a counterfactual safety pairing paradigm to generate structured safety labels and counterfactual positive trajectories from risky scenarios via counterfactual reasoning. Then, a two-stage training strategy is adopted: negative-enhanced supervised fine-tuning for failure feedback and trajectory correction, followed by anchor-based group relative policy optimization that uses positive and negative trajectories as contrastive anchors to steer sampling and penalize high-risk behaviors via group-relative advantages. Experiments on NAVSIM and DeepAccident validate the proposed framework. SafeAlign-VLA achieves 89.1 PDMS on the NAVSIM v1 testset, improving over the baseline without negative data by 1.3%. On DeepAccident, it reduces the collision rate to 3.36%, while achieving 84.2% language accuracy and 85.8% risk prediction accuracy. These results demonstrate the effectiveness of the proposed negative-enhanced safe alignment framework for safe and robust autonomous driving.

2605.19523 2026-05-20 cs.CL cs.AI cs.CV

Investigating Cross-Modal Skill Injection: Scenarios, Methods, and Hyperparameters

探究跨模态技能注入:场景、方法与超参数

Zhiyu Xu, Lean Wang, Yuanxin Liu, Lei Li, Hao Zhou, Fandong Meng, Jie Zhou, Xu Sun

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, Peking University(多媒体信息处理国家重点实验室,计算机科学学院,北京大学) WeChat AI, Tencent Inc., China(腾讯公司,中国) The University of Hong Kong(香港大学)

AI总结 本文研究了跨模态技能注入在不同场景下的表现,分析了其方法和超参数的影响,发现其在指令遵循和跨语言任务中表现良好,但在数学推理中存在困难,同时指出经典方法如TA和DARE在性能上优于其他融合方法。

详情
AI中文摘要

视觉-语言模型(VLMs)在一般多模态理解方面表现出色;然而,它们在高效获取持续演化的领域特定技能方面存在困难。传统增强VLM能力的方法,如监督微调(SFT),需要大量的数据集整理和大量的计算资源。模型合并作为一种高效的替代方法,能够将领域专家的LLM专业知识转移到VLMs上,而无需额外的数据集要求或显著的计算开销。与传统合并同质LLM的方法不同,跨模态技能注入旨在通过将领域专家LLM整合到VLM中来诱导出新的跨模态能力。然而,现有研究缺乏对跨模态技能注入的适用性和方法的系统分析。在本研究中,我们从三个主要方面探讨了跨模态技能注入:场景、方法和超参数。在场景方面,我们发现跨模态技能注入在指令遵循和跨语言设置中表现良好,但在数学推理中表现不佳。在方法方面,我们发现经典方法如TA和DARE在性能上优于其他融合方法。我们还提供了这些经典方法所依赖的超参数调优的系统和定量分析。

英文摘要

Vision-Language Models (VLMs) have demonstrated remarkable proficiency in general multi-modal understanding; yet they struggle to efficiently acquire continually evolving domain-specific skills. Conventional approaches to enhancing VLM capabilities, such as Supervised Fine-Tuning (SFT), require extensive dataset curation and substantial computational resources. Model merging has emerged as an efficient alternative that enables the transfer of domain-specific expertise from Large Language Models (LLMs) to VLMs without incurring additional training data requirements or significant computational overhead. Unlike conventional merging of homogeneous LLMs, which mainly aggregates existing capabilities, cross-modal skill injection aims to induce emergent cross-modal capabilities by integrating a domain-expert LLM into a VLM. However, existing research lacks a systematic analysis of the applicability and methodology of cross-modal skill injection. In this study, we investigate cross-modal skill injection across three main aspects: scenarios, methods, and hyperparameters. For scenarios, we find that cross-modal skill injection generally performs well in instruction-following and cross-lingual settings, yet struggles with mathematical reasoning. For methods, we find that classic approaches such as TA and DARE consistently achieve superior performance over alternative merging methods. We also provide a systematic and quantitative analysis of the hyperparameter tuning that these classic methods critically depend on.

2605.19522 2026-05-20 cs.CV

iDiff: Interpretable Difference-aware Framework for Pairwise Image Quality Assessment

iDiff:用于成对图像质量评估的可解释差异感知框架

Xinli Yue, JianHui Sun, Tao Shao, Liangchao Yao, Fan Xia, Yuetang Deng

发表机构 * Tencent(腾讯)

AI总结 本文提出iDiff框架,通过双分支设计结合可解释的差异建模和结构化多模态推理,提升成对图像质量评估的鲁棒性和可解释性,并在NTIRE 2026 RAIM挑战中取得第一名。

Comments Accepted to CVPR 2026 Workshop

详情
AI中文摘要

成对图像质量评估(IQA)在专业摄影中需要一个模型不仅能够识别两个候选图像之间的优选图像,还能提供有说服力且基于图像的推理。在NTIRE 2026 RAIM挑战中,这一要求进一步通过联合评估偏好预测和推理生成被强调。为了解决这一任务,我们提出了iDiff,一个用于成对图像质量评估的可解释差异感知框架。我们的方法采用由答案模型和推理模型组成的双分支设计。答案模型通过显式地将每个样本分解为左右全局和局部视图,随后进行内容感知的专业化处理,针对人物和场景图像,并通过跨主干的集成方法进行聚合,以实现稳健的偏好预测。推理模型专注于推理生成,并逐步增强,通过专家式模板、多源质量特征以及基于答案模型预测的条件监督进行优化。通过这种方式,iDiff联合建模了判别性决策和结构化解释,提高了鲁棒性和可解释性。广泛的实验表明,所提出的框架在准确性和推理质量指标上都有效。我们的方法在NTIRE 2026 RAIM挑战中取得了第一名,展示了将显式差异建模与结构化多模态推理整合用于成对IQA的有效性。

英文摘要

Pairwise image quality assessment (IQA) in professional photography requires a model not only to identify the preferred image between two candidates, but also to provide convincing and image-grounded reasoning. In the NTIRE 2026 RAIM challenge, this requirement is further emphasized by jointly evaluating preference prediction and rationale generation. To address this task, we propose iDiff, an Interpretable Difference-aware framework for pairwise image quality assessment. Our method adopts a dual-branch design consisting of an Answer Model and a Thinking Model. The Answer Model performs robust preference prediction by explicitly decomposing each sample into left/right global and local views, followed by content-aware specialization for person and scene images and ensemble-based aggregation across backbones. The Thinking Model focuses on rationale generation and is progressively enhanced with expert-style templates, multi-source quality features, and answer-aware supervision conditioned on the Answer Model prediction. In this way, iDiff jointly models discriminative decision making and structured explanation, improving both robustness and interpretability. Extensive experiments demonstrate the effectiveness of the proposed framework on both accuracy and reasoning-quality metrics. Our method achieved first place in the NTIRE 2026 RAIM challenge, showing the effectiveness of integrating explicit difference modeling with structured multimodal reasoning for pairwise IQA.

2605.19521 2026-05-20 cs.AI cs.GT

Efficient Elicitation of Collective Disagreements

高效获取集体分歧

Mohamed Ouaguenouni, Felipe Garrido-Lucero, Umberto Grandi, César Hidalgo, Magdalena Tydrichova

发表机构 * IRIT, Université Toulouse Capitole(IRIT,图卢兹Capitole大学) Center for Collective Learning, IAST, Toulouse School of Economics(集体学习中心,IAST,图卢兹经济学院) Center for Collective Learning, CIAS, Corvinus University of Budapest(集体学习中心,布达佩斯科文大学) AMBS, University of Manchester(AMBS,曼彻斯特大学) Centrale Supélec, Paris Saclay(中央超导学院,巴黎萨克利)

AI总结 本文研究了群体在备选方案上的分歧结构,提出了一种分层框架来确定计算现有分歧度量所需的最小聚合偏好信息,引入了 plurality 矩阵并展示了超越三级分歧度量的理论和实验价值。

详情
AI中文摘要

我们分析了在一组替代方案上,一群选民之间的分歧结构。调查通常要求进行成对比较,这简单直观,或者要求对替代方案进行完整排序,以获取选民的全部偏好。基于成对比较无法区分结构性分歧与噪声的观察,我们提出了一种分层框架,以确定计算文献中若干分歧度量所需的最小聚合偏好信息。具体而言,我们引入了 plurality 矩阵,这是成对比较的推广,记录了对于每一个替代方案的子集 S,每个 a ∈ S 在 S 中排名第一的概率。我们定义分歧度量的级别为表达该度量所需的最小子集大小,证明了许多现有概念,包括排名方差和分裂度,处于级别 3,证明成对比较不足以表达这些度量。此外,我们展示了超越级别 3 的理论和实验价值。为了使这些结果具有可操作性,我们设计了两种获取 plurality 矩阵的协议,探索了所需参与者数量与每个参与者认知负荷之间的权衡。

英文摘要

We analyze the structure of the disagreement among a population of voters over a set of alternatives. Surveys typically ask either for pairwise comparisons, simple and intuitive for participants, or full rankings over alternatives, eliciting the entire voters' preferences. Building on the observation that pairwise comparisons cannot distinguish structural disagreement from noise, we propose a stratified framework to identify the minimal aggregated preference information needed to compute a number of disagreement measures from the literature. Specifically, we introduce the plurality matrix, a generalization of pairwise comparisons that records, for every subset $S$ of alternatives, the probability that each $a \in S$ ranks first in $S$. We define the level of a disagreement measure as the smallest subset size needed to express it, showing that many existing notions, including rank-variance and divisiveness, sit at level $3$, proving that pairwise comparisons are not enough. In addition, we demonstrate the interest of going beyond level $3$ both theoretically and experimentally. To make these results actionable, we design two elicitation protocols to estimate the plurality matrix, exploring the trade-off between the number of required participants and the cognitive load requested to each of them.

2605.19518 2026-05-20 cs.AI

BLINKG: A Benchmark for LLM-Integrated Knowledge Graph Generation

BLINKG:一个用于集成大语言模型的知识图谱生成基准

Carla Castedo, Enrique Iglesias, Manuel Lama, Alberto Bugarin-Diz, Maria-Esther Vidal, David Chaves-Fraga

发表机构 * Centro Singular de Investigación en Tecnoloxías Intelixentes (CiTIUS), Universidade de Santiago de Compostela, Spain(圣地亚哥-德孔波斯特拉大学智能技术研究中心(CiTIUS)) L3S Research Center Germany, Hannover, Germany(德国汉诺威L3S研究中心) TIB Leibniz Information Centre for Science and Technology, Hannover, Germany(德国汉诺威TIB莱比锡信息科学与技术研究中心) Leibniz University, Hannover, Germany(德国汉诺威莱比锡大学) Departamento de Electrónica e Computación, Universidade de Santiago de Compostela, Spain(圣地亚哥-德孔波斯特拉大学电子与计算系)

AI总结 本文提出BLINKG基准,用于评估大语言模型在从异构数据源生成知识图谱中的映射能力,通过复杂度递增的场景和实验评估,揭示了LLM在知识图谱构建中的潜力与局限。

详情
AI中文摘要

生成知识图谱(KGs)仍然是知识工程师最耗时和劳动密集的任务,因为他们需要在输入数据源和本体术语之间识别语义等价性。虽然声明性解决方案(如RML、SPARQL-Anything)帮助泛化了这一过程,但将输入模式元素与本体术语对齐仍涉及复杂的转换并需要大量手动工作。随着大语言模型(LLMs)的出现,人们越来越关注利用其能力来协助KG工程师。尽管一些研究探索了使用LLMs自动化KG构建,但尚无标准化框架来评估它们在数据模式和本体概念之间建立对应关系的有效性。因此,在本文中,我们提出了BLINKG,一个用于评估LLMs在从异构数据源构建KG时映射能力的基准。该基准包含一系列基于真实世界用例的复杂度递增的场景。我们使用BLINK对几种最先进的LLMs进行了广泛的实验评估,观察到它们已经提供了有前途的解决方案。然而,它们在复杂场景中的表现仍然有限。得益于这一基准,我们能够评估当前LLMs在KG构建中的能力。此外,我们定义了一套要求,以实现(半)自动(LLM驱动)的KG构建,为该领域开辟了新的研究方向。

英文摘要

Generating Knowledge Graphs (KGs) remains one of the most time-consuming and labor-intensive tasks for knowledge engineers, as they need to identify semantic equivalences between input data sources and ontology terms. While declarative solutions (e.g., RML, SPARQL-Anything) have helped to generalize this process, aligning input schema elements with ontology terms still involves intricate transformations and requires considerable manual effort. With the advent of Large Language Models (LLMs), there is growing interest in leveraging their capabilities to assist KG engineers. Although some studies have explored using LLMs to automate KG construction, there is still no standardized framework for assessing how effectively they establish correspondences between data schemes and ontology concepts. Therefore, in this paper, we propose BLINKG, a benchmark designed to evaluate the mapping capabilities of LLMs in constructing KGs from heterogeneous data sources. The benchmark includes a set of scenarios with increasing complexity, based on real-world use cases. We conduct an extensive experimental evaluation of several stateof-the-art LLMs using BLINK and observe that they already offer promising solutions. However, their performance remains limited in complex scenarios. Thanks to this benchmark, we can already assess the current capabilities of LLMs for KG construction. Additionally, we define a set of requirements for achieving (semi)automated (LLM-driven) KG construction, opening new research lines in this area.

2605.19516 2026-05-20 cs.CL cs.AI cs.LG

Base Models Look Human To AI Detectors

基础模型对AI检测器看起来很像人类

Yixuan Even Xu, Ziqian Zhong, Aditi Raghunathan, Fei Fang, J. Zico Kolter

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究发现基础模型生成的文本在AI检测器中常被误判为人类生成,提出HIP方法通过迭代改写提升检测器规避能力,揭示当前检测器更关注指令调优和局部上下文而非通用机器生成文本特征。

Comments 39 pages, 9 figures

详情
AI中文摘要

随着AI生成文本在现实世界大规模应用,机构越来越多地使用商业AI文本检测器,尤其是在教育和学术诚信流程中。我们报告了一个令人惊讶的经验发现:当用GPTZero和Pangram评估时,基础模型生成的文本往往被判断为高度人类化,而经过指令调优的模型生成的文本则不具有这种特性。基于这一观察,我们提出了Humanization by Iterative Paraphrasing (HIP),一种不依赖特定检测器的管道,它最小化地微调基础模型为改写器并迭代应用。与我们测试的基线相比,HIP在商业检测器上实现了更好的语义保留与检测器规避的平衡。在Llama-3和Qwen-3系列模型中,从0.6B到70B的不同规模上,HIP始终提高了检测器的人类化程度。我们的发现表明,当前检测器更关注指令调优和局部上下文而非任何通用机器生成文本的不变特征。这反过来要求检测器设计更明确地建模这些因素。

英文摘要

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

2605.19511 2026-05-20 cs.CV

Are Watermarked Images Editable? SafeMark for Watermark-Preserving Text-Guided Image Editing

水印图像可编辑吗?SafeMark用于水印保持的文本引导图像编辑

Xiaodong Wu, Qi Li, Xiangman Li, Zelin Zhang, Lingshuang Liu, Jianbing Ni

发表机构 * Queen’s University(皇后大学) University of Waterloo(滑铁卢大学)

AI总结 本文研究了一个基础但未被充分探索的问题:水印图像能否在不损害水印完整性的情况下保持可编辑?我们提出了SafeMark框架,该框架在图像编辑过程中显式地将水印完整性整合进去。具体来说,SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中,微调编辑器,使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据:在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息,这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容,不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明,SafeMark在多种编辑设置中实现了高水印比特准确性,同时保持高质量的语义编辑,而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明,语义可编辑性和水印完整性本质上是兼容的,使生成编辑管道中的图像溯源变得可信。

详情
AI中文摘要

本文研究了一个基础但未被充分探索的问题:水印图像能否在不损害水印完整性的情况下保持可编辑?我们提出了SafeMark,一个用于水印保持的文本引导图像编辑的框架,该框架在编辑过程中显式地整合水印完整性。具体来说,SafeMark将阈值化的水印解码损失直接添加到扩散编辑器的训练目标中,微调编辑器,使得语义上有效的编辑也能够在最终输出中保留嵌入的水印。这种设计具有清晰的信息论依据:在编辑图像上保持高比特准确性下限界了编辑通道所保持的水印与编辑输出之间的互信息,这一量根本控制着水印恢复能力。SafeMark与可微扩散编辑器兼容,且不需要架构修改。在多个数据集、文本引导编辑方法和编辑后失真设置上的广泛评估表明,SafeMark在多种编辑设置中实现了高水印比特准确性,同时保持高质量的语义编辑,而不会牺牲对常见编辑后失真的鲁棒性。这些结果表明,语义可编辑性和水印完整性本质上是兼容的,使生成编辑管道中的图像溯源变得可信。

英文摘要

This paper investigates a fundamental yet underexplored question: can watermarked images remain editable without compromising watermark integrity? We propose SafeMark, a framework for watermark-preserving text-guided image manipulation that explicitly integrates watermark integrity into the editing process. Specifically, SafeMark adds a thresholded watermark-decoding loss directly to the diffusion editor's training objective, fine-tuning the editor so that semantically valid edits also preserve the embedded watermark at the final output. This design admits a clean information-theoretic justification: maintaining high bit-accuracy on the edited image lower-bounds the mutual information that the editor channel preserves between watermark and edited output, the quantity that fundamentally controls watermark recoverability. SafeMark is compatible with differentiable diffusion-based editors, and requires no architectural modification. Extensive evaluations across multiple datasets, text-guided editing methods, and post-edit distortion settings demonstrate that SafeMark achieves high watermark bit accuracy across diverse editing settings while maintaining high-quality semantic edits, without sacrificing robustness to common post-edit distortions. These results demonstrate that semantic editability and watermark integrity are fundamentally compatible, enabling trustworthy image provenance in generative editing pipelines.

2605.19510 2026-05-20 cs.CV

Return of Frustratingly Easy Unsupervised Video Domain Adaptation

令人沮丧的简单无监督视频域适应重现

Pengfei Wei, Yiqun Sun, Zhiqiang Xu, Yiping Ke, Lawrence B. Hsieh

发表机构 * Magellan Technology Research Institute (MTRI)(马格纳技术研究所(MTRI)) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出了一种名为MetaTrans的简单无监督视频域适应方法,通过巧妙的模型架构设计,分别处理跨域视频的空间和时间分歧,从而在多个跨域动作识别任务中实现了显著的性能提升。

Comments To appear in ICML 2026

详情
AI中文摘要

无监督视频域适应(UVDA)是一个实用但研究较少的问题。在本文中,我们提出了一种名为MetaTrans的令人沮丧的简单UVDA方法。具体来说,MetaTrans采用了一个包含仅两个基本损失项的简洁学习目标。尽管学习目标的简洁性,MetaTrans体现了一种先进的UVDA思想,即通过微妙的模型架构设计,分别处理跨域视频的空间和时间分歧。通过实现一个时间静态减法模块,MetaTrans有效地消除了空间和时间分歧。广泛的实证评估,特别是在各种跨域动作识别任务中,显示了显著的绝对适应性能提升和相对于最先进UVDA基线的显著优越性能提升。

英文摘要

Unsupervised video domain adaptation (UVDA) is a practical but under-explored problem. In this paper, we propose a frustratingly easy UVDA method, called MetaTrans. Specifically, MetaTrans adopts a concise learning objective that contains only two fundamental loss terms. Despite the simplicity of the learning objective, MetaTrans embodies an advanced UVDA idea, that is, handling the spatial and temporal divergence of cross-domain videos separately, through a subtle model architecture design. By implementing a temporal-static subtraction module, MetaTrans effectively removes spatial and temporal divergence. Extensive empirical evaluations, particularly on various cross-domain action recognition tasks, show substantial absolute adaptation performance enhancement and significantly superior relative performance gain compared with state-of-the-art UVDA baselines.