arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2070
2606.12334 2026-06-11 cs.LG cs.RO 新提交

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

傅里叶特征让智能体通过模仿学习学习高精度策略

Balázs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) FZI Research Center for Information Technology(FZI信息技术研究中心)

AI总结 提出在点云编码器中使用傅里叶特征映射,解决神经网络低频偏好导致的高精度操作问题,在多个基准和真实机器人上显著提升性能。

Comments Published as a conference paper at ICML 2026

详情
AI中文摘要

高精度机器人操作需要细粒度的空间推理,由于深度模糊和透视尺度问题,仅使用RGB的策略通常难以实现。直接利用3D信息(如基于点云的策略)比纯图像策略提供了更强的几何先验,但其性能仍然高度依赖于任务。我们假设这种差异可能是由于神经网络倾向于学习低频函数的频谱偏差,这尤其影响以缓慢变化的笛卡尔特征为条件的架构。因此,我们提出将点云从笛卡尔空间映射到高维傅里叶空间,有效地使点云编码器能够直接访问高频特征。我们通过实验验证了傅里叶特征在RoboCasa和ManiSkill3基准测试中的具有挑战性的操作任务以及真实机器人设置上的效果。尽管简单,我们发现傅里叶特征在不同的编码器架构和基准测试中提供了显著的好处,并且对超参数具有鲁棒性。我们的结果表明,傅里叶特征让策略比笛卡尔特征更有效地利用几何细节,显示了其作为基于点云的模仿学习的通用工具的潜力。我们在项目页面上提供源代码和视频:https://this https URL

英文摘要

High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: https://fourier-il.github.io/fourier-il

2606.12332 2026-06-11 cs.CL cs.LG 新提交

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

通过信息增益衡量多轮对话中的语义进展

Paul He, Shiva Kasiviswanathan, Dominik Janzing

发表机构 * NTU Singapore(新加坡南洋理工大学) Amazon(亚马逊) Amazon Research, Tübingen, Germany(亚马逊研究院(德国图宾根))

AI总结 提出基于信息论的信息增益指标,通过高斯嵌入近似量化多轮对话中问题相关的语义进展,无需LLM推理,在多个基准上取得与人类判断一致的结果。

Comments Preprint. 26 pages

详情
AI中文摘要

评估多轮对话具有挑战性,因为质量体现在多轮之间而非单个回复。我们关注信息寻求对话的一个关键维度:语义进展,定义为对话过程中新、与问题相关且非冗余信息的累积。我们将语义进展形式化为基于问题的不确定性减少,并引入一个在嵌入空间中近似它的信息论指标。我们的主要估计器使用具有闭式更新的易处理高斯公式,而互补的最大熵论证表明,当仅保留二阶嵌入信息时,对数行列式结构更广泛地出现。该公式产生了理想的理论性质,包括单调性、跨轮次总信息增益的可加分解以及冗余证据的递减回报。与LLM作为评判者的方法不同,我们的指标在评估时不需要自回归推理,并且对于固定的嵌入模型完全可复现。在MT-Bench、Chatbot Arena和UltraFeedback上的实验表明,尽管仅针对语义进展,所提出的指标与人类判断的一致性具有竞争力,在MT-Bench和UltraFeedback上相比几个基于LLM的评判者具有更好的对齐。值得注意的是,该方法在仅CPU执行下使用轻量级嵌入模型仍然有效,表明语义进展可以在不依赖大模型能力的情况下被捕获。

英文摘要

Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.

2606.12329 2026-06-11 cs.AI 新提交

PROJECTMEM: A Local-First, Event-Sourced Memory and Judgment Layer for AI Coding Agents

PROJECTMEM:面向AI编码代理的本地优先、事件溯源记忆与判断层

Ripon Chandra Malo, Tong Qiu

发表机构 * University of Utah(犹他大学)

AI总结 提出PROJECTMEM,一种本地优先、事件溯源的记忆与判断层,通过记录事件日志并生成紧凑摘要,帮助AI编码代理避免重复错误,实现记忆即治理。

Comments 12 pages, 5 figures, 1 table. Code: https://github.com/riponcm/projectmem

详情
AI中文摘要

AI编码助手现在支持越来越多的软件工作,从快速脚本到生产应用。然而,这些代理在很大程度上仍然是无状态的:每个新会话都会重新读取项目文件,重新推导之前的决策,并且——最昂贵的是——可能会重复已经失败的调试尝试。重建这种上下文每个会话估计消耗5,000-20,000个令牌;瓶颈通常不是模型能力,而是缺失的项目记忆。我们提出了projectmem,一个面向AI编码代理的开源、本地优先的记忆与判断层。projectmem将开发记录为一个仅追加的纯文本事件日志,包含类型化事件——问题、尝试、修复、决策和笔记——并通过模型上下文协议(MCP)将该日志确定性地投影为紧凑的、AI可读的摘要。除了存储,projectmem还添加了一个确定性的前置动作门,在代理重复之前失败的修复或编辑已知脆弱文件之前警告它。我们将其定义为记忆即治理:记忆不仅回答代理,还作用于其下一个动作。该系统完全离线运行,无遥测;其不可变日志也作为可重现、可审计的AI辅助开发的溯源轨迹。projectmem作为一个三依赖的Python包发布(14个MCP工具,19个CLI命令,37个自动化测试),并通过一个为期两个月的自我研究进行评估,涉及10个项目,包含207个记录事件。源代码:此 https URL。

英文摘要

AI coding assistants now support a growing share of software work, from quick scripts to production applications. Yet these agents remain largely stateless: each new session re-reads project files, re-derives prior decisions, and - most costly - may repeat debugging attempts that already failed. Reconstructing this context can consume an estimated 5,000-20,000 tokens per session; the bottleneck is often not model capability but missing project memory. We present projectmem, an open-source, local-first memory and judgment layer for AI coding agents. projectmem records development as an append-only, plain-text event log of typed events - issues, attempts, fixes, decisions, and notes - and deterministically projects that log into compact, AI-readable summaries served through the Model Context Protocol (MCP). Beyond storage, projectmem adds a deterministic pre-action gate that warns an agent before it repeats a previously failed fix or edits a known-fragile file. We frame this as Memory-as-Governance: memory that does not merely answer the agent but acts on its next action. The system runs fully offline with no telemetry; its immutable log also serves as a provenance trail for reproducible, auditable AI-assisted development. projectmem ships as a three-dependency Python package (14 MCP tools, 19 CLI commands, 37 automated tests) and is evaluated through a two-month self-study across 10 projects comprising 207 logged events. Source code: https://github.com/riponcm/projectmem.

2606.12320 2026-06-11 cs.AI cs.CC cs.CR cs.SE 新提交

A Five-Plane Reference Architecture for Runtime Governance of Production AI Agents

生产AI代理运行时治理的五平面参考架构

Krti Tallam

发表机构 * Kamiwaza

AI总结 针对生产AI代理打破传统数据边界治理假设的问题,提出由推理平面和四个执行平面组成的五平面参考架构,通过可组合原语实现运行时治理,阻断七种威胁并验证四个正确性不变式。

Comments 65 pages, 3 figures, 5 tables. Reference architecture with a reference implementation of the policy-engine core and microbenchmark results; full-system evaluation identified as future work

详情
AI中文摘要

企业安全旨在治理数据边界:受保护表面是静态和传输中的数据,控制措施——访问控制、数据丢失防护、边界检查——治理该边界的穿越。生产AI代理瓦解了这一假设。代理代表企业读取上下文、调用工具、调用连接器并修改记录系统,因此风险转移到工作流内部,进入一系列单独允许但可能转变未经授权业务流程的动作序列。现有策略引擎无法扩展到这种机制:它们根据原子主体评估请求时决策,而代理系统需要对复合主体进行状态化评估,这些主体的权限通过委托链衰减。我们提出了一种用于生产代理运行时治理的参考架构,由四个可组合原语构建:五平面分解(一个裁决意图的推理平面,以及四个执行平面——网络、身份、端点、数据——实现决策)、任意停止中介、具有能力衰减的复合主体,以及作为结构化证据基础的审计。我们定义了六种中断原语的分类,这些原语泛化了允许和拒绝,陈述并论证了四个正确性不变式,并展示了在五个具体工作流中阻断七种生产代理威胁。策略引擎核心的参考实现提供了测量证据:衰减正确性和证据可重构性在每次试验中成立,裁决运行在个位数微秒内,审计基础的防篡改行为完全符合设计。我们明确范围:该架构治理委托行为,而非模型行为,针对实时代理基准的全系统评估是下一步工作。

英文摘要

Enterprise security was built to govern data boundaries: the protected surface was data at rest and in transit, and the controls -- access control, data-loss prevention, perimeter inspection -- governed crossings of that boundary. Production AI agents dissolve this assumption. An agent reads context, calls tools, invokes connectors, and modifies systems of record on an enterprise's behalf, so risk moves inside the workflow, into sequences of individually-permitted actions that may transform a business process no one authorized. Existing policy engines do not extend to this regime: they evaluate request-time decisions against atomic principals, where agentic systems require stateful evaluation against composite principals whose authority attenuates through delegation chains. We present a reference architecture for the runtime governance of production agents, built from four composable primitives: a five-plane decomposition (a reasoning plane that adjudicates intent, and four enforcement planes -- network, identity, endpoint, data -- that realize the decision), stop-anywhere mediation, composite principals with capability attenuation, and audit as a structured evidence substrate. We define a taxonomy of six interruption primitives that generalize allow and deny, state and argue for four correctness invariants, and demonstrate the foreclosure of seven production-agent threats across five concrete workflows. A reference implementation of the policy-engine core supplies measured evidence: attenuation correctness and evidence reconstructability hold on every trial, adjudication runs in single-digit microseconds, and the audit substrate's tamper-evidence behaves exactly as designed. We are explicit about scope: the architecture governs delegated action, not model behavior, and a full-system evaluation against a live agent benchmark is the invited next step.

2606.12319 2026-06-11 cs.CV 新提交

Anatomically Conditioned Recurrent Refinement for Topology-Aware Circle of Willis Segmentation

解剖条件循环细化用于拓扑感知的Willis环分割

Juraj Perić, Marija Habijan, Dario Mužević, Irena Galić, Danilo Babin, Aleksandra Pižurica

发表机构 * Faculty of Electrical Engineering, Computer Science and Information Technology, Osijek, Croatia(奥西耶克大学电气工程、计算机科学与信息技术学院) Clinical Medical Center Osijek, Osijek, Croatia(奥西耶克临床医学中心) Ghent University, Dept. of Telecommunications and Information Processing, imec-TELIN-IPI, Ghent, Belgium(根特大学电信与信息处理系,imec-TELIN-IPI) Ghent University, Dept. of Telecommunications and Information Processing, TELIN-GAIM, Ghent, Belgium(根特大学电信与信息处理系,TELIN-GAIM)

AI总结 提出AC2RUNet,通过静态和动态双流架构结合课程学习,在TopCoW数据集上显著降低Hausdorff距离和Betti数误差,改善拓扑连通性。

Comments 9 pages, 4 figures, 1 table. Accepted at EUSIPCO 2026

详情
AI中文摘要

由于复杂的拓扑结构和易碎细小的血管结构,从磁共振血管造影(MRA)中分割Willis环(CoW)具有挑战性。标准卷积神经网络(CNN)通常无法捕捉这些拓扑约束,导致“血管断裂”伪影。为了解决这个问题,我们提出了解剖条件循环细化U-Net(AC2RUNet)。我们的架构将分割解耦为两个流:提取不变解剖特征的静态流和随时间迭代细化拓扑错误的轻量级动态流。我们进一步引入了一种动态课程学习策略,从高召回率的几何监督过渡到拓扑感知约束。在TopCoW数据集上验证,AC2RUNet显著降低了Hausdorff距离(4.72 mm vs 9.17 mm)和Betti数误差(0.19 vs 0.40),在保持相当体积Dice的同时改善了nnU-Net基线的拓扑连通性。

英文摘要

Segmenting the Circle of Willis (CoW) from Magnetic Resonance Angiography (MRA) is challenging due to complex topology and thin vascular structures that are prone to fragmentation. Standard Convolutional Neural Networks (CNNs) often fail to capture these topological constraints, resulting in "broken vessel" artifacts. To address this, we propose the Anatomically Conditioned Recurrent Refinement U-Net (AC2RUNet). Our architecture decouples segmentation into two streams: a Static Stream that extracts invariant anatomical features and a lightweight Dynamic Stream that iteratively refines topological errors over time. We further introduce a dynamic curriculum learning strategy that transitions from high-recall geometric supervision to topology-aware constraints. Validated on the TopCoW dataset, AC2RUNet substantially reduces Hausdorff Distance (4.72 mm vs 9.17 mm) and Betti number errors (0.19 vs 0.40), improving topological connectivity over the nnU-Net baseline while maintaining comparable volumetric Dice.

2606.12318 2026-06-11 cs.LG cs.AI 新提交

Harness In-Context Operator Learning with Chain of Operators

利用算子链实现上下文算子学习

Minghui Yang, Ling Guo, Liu Yang

发表机构 * Department of Mathematics, Shanghai Normal University(上海师范大学数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出Chain of Operators (CHOP)框架,通过构造显式初等变换与冻结ICON的算子链,无需微调即可提升上下文算子网络在分布外算子任务上的泛化能力,在标量守恒律和平均场控制问题中降低推理误差。

详情
AI中文摘要

神经算子近似函数空间之间的映射,但通常对其他算子泛化能力差,需要微调或重新训练。上下文算子网络(ICON)通过向模型提供数值上下文来解决此问题,使模型从提示中学习特定算子并适应不同算子而无需微调。然而,ICON在分布外(OOD)算子任务上仍可能泛化失败。受大型语言模型(LLM)的提示工程成功启发,我们引入了算子链(CHOP),一种在不更新参数的情况下将冻结的ICON应用于OOD算子任务的框架。具体来说,CHOP构建了一个由显式初等变换和冻结ICON组成的算子链。在标量守恒律和平均场控制问题上的实验表明,与直接ICON评估相比,CHOP降低了相对推理误差,同时链中的每个算子保持可解释且具有封闭形式。在一个PDE族上构建的链进一步泛化到另一个不同的族,表明跨提示系统存在共享机制。

英文摘要

Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

2606.12316 2026-06-11 cs.CV 新提交

Slots, Transitions, Loops: Learning Composable World Models for ARC

槽、转换、循环:学习可组合的ARC世界模型

Gege Gao, Bernhard Schölkopf, Andreas Geiger

发表机构 * University of Tübingen(图宾根大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出Loop-OWM架构,通过颜色原型槽、演示条件任务摘要和循环转换模型,学习ARC任务中的视觉符号规则,在ARC-1和ARC-2上超越基线。

详情
AI中文摘要

ARC测试上下文中的规则归纳:给定少量输入-输出演示,模型必须推断隐藏规则并将其应用于新查询。虽然许多方法通过语言、代码或符号程序表达ARC规则,但ARC本身是视觉符号的:规则表现为对象、颜色、形状和空间关系上的网格转换。我们引入Loop-OWM,一种以对象为中心的世界建模架构,将规则学习为结构化状态上的可组合转换。它结合了颜色原型槽、演示条件任务摘要,以及具有密集传播和槽条件校正的循环转换模型。在ARC-1和ARC-2上,Loop-OWM以相当或更少的参数优于非循环和循环基线。这些结果表明,ARC规则不仅可以作为语言描述或搜索程序学习,还可以作为视觉符号世界状态上的转换学习。

英文摘要

ARC tests in-context rule induction: given a few input-output demonstrations, a model must infer the hidden rule and apply it to a new query. While many approaches express ARC rules through language, code, or symbolic programs, ARC itself is visual-symbolic: rules appear as grid transitions over objects, colors, shapes, and spatial relations. We introduce Loop-OWM, an object-centric world-modeling architecture that learns these rules as composable transitions over structured states. It combines color-prototype slots, demonstration-conditioned task summaries, and a looped transition model with dense propagation and slot-conditioned correction. On both ARC-1 and ARC-2, Loop-OWM outperforms non-looped and looped baselines with comparable or fewer parameters. These results suggest that ARC rules can be learned not only as language descriptions or searched programs, but also as transitions over visual-symbolic world states.

2606.12306 2026-06-11 cs.RO 新提交

UGV-Conditioned Multi-UAV Informative Planning on a Shared Exposure Belief

基于共享暴露信念的UGV条件多无人机信息规划

Lars Oerlemans, Moji Shi, Marija Popovic

发表机构 * MAVLab, Faculty of Aerospace Engineering, TU Delft(马文实验室,航空航天工程学院,代尔夫特理工大学)

AI总结 提出一种协调无人机编队降低地面车辆在未知威胁区导航风险的方法,通过共享暴露信念引导感知并减少冗余覆盖,仿真显示累积暴露降低38%,冗余覆盖从38.8%降至3.7%。

Comments 8 pages, 6 figures

详情
AI中文摘要

在大型、威胁增强的环境中进行安全地面导航需要空中支持,以主动降低地面车辆沿路线面临的风险。现有的空中侦察系统专注于测绘或覆盖环境,但不将感知引导到对地面车辆安全最相关的区域。在本文中,我们解决了协调一组无人机(UAV)以提高无人地面车辆(UGV)在未知威胁区导航安全性的问题。我们方法的一个关键方面是共享暴露信念,该信念根据空中观测在线更新,并由无人机团队和地面车辆共同使用。这使我们能够将空中感知引导到路线相关区域,同时允许UGV围绕新发现的威胁重新规划。我们通过空间区域分配协调无人机团队以避免冗余感知。仿真实验表明,与不考虑危险等级的系统相比,我们的方法将UGV累积暴露降低了38%,并在我们的多无人机协调方案下将冗余空中覆盖从38.8%降至3.7%。

英文摘要

Safe ground navigation in large, threat-augmented environments requires aerial support that actively reduces the risks that a ground vehicle faces along its route. Existing aerial reconnaissance systems focus on mapping or covering the environment, but do not direct sensing toward regions that are most relevant for ground vehicle safety. In this paper, we address the problem of coordinating a team of unmanned aerial vehicles (UAVs) to improve the safety of an unmanned ground vehicle (UGV) navigating through unknown threat zones. A key aspect of our approach is a shared exposure belief that is updated online from aerial observations and used jointly by the UAV team and the ground vehicle. This enables us to direct aerial sensing towards route-relevant regions while allowing the UGV to replan around newly revealed threats. We coordinate the UAV team through spatial region assignment to avoid redundant sensing. Simulation experiments show that our approach reduces cumulative UGV exposure by 38% compared to a system that does not account for hazard levels, and reduces redundant aerial coverage from 38.8% to 3.7% under our multi-UAV coordination scheme.

2606.12303 2026-06-11 cs.CV 新提交

From 2D Grids to 1D Tokens: Reforming Shared Representations for Multimodal Image Fusion

从二维网格到一维标记:重塑多模态图像融合的共享表示

Yuchen Xian, Yunqiu Xu, Yang He, Yi Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于冻结预训练图像标记器的紧凑一维标记接口,通过选择性标记编辑(STE)稀疏更新关键标记,在保持融合骨干网络不变的同时引导全局外观一致性,实现全局连贯与局部保真的最佳平衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

多模态图像融合旨在将来自不同模态的互补信息整合到融合图像中,该图像在保持全局一致外观的同时保留丰富的局部细节。现有方法在二维特征网格上构建共享表示,这些表示擅长建模局部结构,但对图像级全局外观因素的利用有限。为平衡这些目标,我们引入了一种基于冻结预训练图像标记器的紧凑一维标记接口,用于建模非局部外观/基因素。我们的设计不是将标记器用作重建骨干,而是将一维标记空间用作全局载体,同时保留用于局部结构恢复的二维空间路径。具体来说,我们引入了选择性标记编辑(STE),它稀疏地更新/替换一小部分关键标记,提供了一种轻量级机制来引导全局外观一致性,同时保持融合骨干网络不变并避免额外损失。在四个常用基准上的实验表明,我们的方法实现了最佳整体性能,在全局连贯性和局部保真度方面均具有一致的多指标改进。项目页面:此 https URL

英文摘要

Multimodal image fusion aims to integrate complementary information from different modalities into a fused image that preserves rich local details while maintaining globally consistent appearance. Existing approaches build shared representations on 2D feature grids, which excel at modeling local structures but offer limited leverage over image-level global appearance factors. To balance these objectives, we introduce a compact 1D token interface based on a frozen pretrained image tokenizer for modeling non-local appearance/base factors. Rather than using the tokenizer as a reconstruction backbone, our design uses the 1D token space as a global carrier while retaining the 2D spatial pathway for local structure restoration. Specifically, we introduce Selective Token Editing (STE), which sparsely updates/replaces a small set of critical tokens, providing a lightweight mechanism to steer global appearance coherence while keeping the fusion backbone unchanged and avoiding extra losses. Experiments on four commonly used benchmarks show that our method achieves the best overall performance, with consistent, multi-metric improvements in both global coherence and local fidelity. Project page: https://zju-xyc.github.io/1D-Fusion-Project-Page/

2606.12300 2026-06-11 cs.CV cs.AI 新提交

Natural-Language Temporal Grounding in Hour-Long Videos is a Search Problem: A Benchmark and Empirical Decomposition

自然语言在小时级视频中的时间定位是一个搜索问题:基准与经验分解

Sukmin Seo, Geewook Kim

发表机构 * NAVER Cloud AI KAIST AI(韩国科学技术院人工智能系)

AI总结 针对小时级视频的自然语言时间定位,提出搜索是主要瓶颈而非识别,发布首个开放小时级定位基准ExtremeWhenBench,并通过检索-定位混合方法显著提升性能。

Comments 10 pages, 6 figures, Code and benchmark: https://github.com/naver-ai/ExtremeWhenBench

详情
AI中文摘要

时间定位——根据自然语言查询返回视频中的区间$[t_s, t_e]$——是长视频的语言接口,但此前仅在短视频上研究;小时级自然语言定位的动态仍未充分探索。我们认为,在小时级尺度上,限制因素是搜索而非识别:视频-LLM的瓶颈不在于定位附近的事件,而在于根据自然语言查询搜索长视频的相关区域。为验证这一点,我们发布了ExtremeWhenBench,首个开放的小时级定位基准(194个视频上的2273个查询,平均时长75.7分钟,最长9小时),具有开放式查询分布。所有开放视频-LLM均表现不佳,而帧级检索基线优于它们;失败分类将85%的失败归因于搜索;检索-定位混合方法比单一视频-LLM提升了6.7倍——类似于开放域QA中的检索-读取模式。

英文摘要

Temporal grounding--returning the interval $[t_s, t_e]$ for a natural-language query over a video--is the language interface to long-form video, yet has been studied on short videos; the dynamics of hour-scale natural-language grounding remain underexplored. We take the position that at hour-scale, the binding constraint is search, not recognition: Video-LLMs are bottlenecked not by localizing a nearby event, but--given a natural-language query--by searching for the relevant region of a long video. To test this, we release ExtremeWhenBench, the first open hour-scale grounding benchmark (2,273 queries over 194 videos, mean 75.7 min, max 9 hr) with an open-form query distribution. Every open Video-LLM collapses while a frame-level retrieval baseline outperforms them; a failure taxonomy attributes 85% of failures to search; and a retrieve-then-ground hybrid recovers 6.7x over the monolithic Video-LLM--mirroring retrieve-then-read in open-domain QA.

2606.12299 2026-06-11 cs.RO cs.LG 新提交

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

学习对你的VLA说什么:基本无害的视觉语言动作模型引导

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出一个框架,通过交互式搜索语言序列改进闭环VLA任务性能,并学习一个改进头预测何时语言引导能提升性能,同时通过共形化防止有害干预。

Comments 22 pages, 14 tables, 14 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型为机器人控制提供了自然语言接口,但从语言到行为的映射通常脆弱且不直观:语义相似的指令可能引发截然不同的行为,而某些能力可能无法仅通过提示激发。因此,人类指令和零样本语言模型都可能无法可靠地引导VLA成功执行任务。在这项工作中,我们提出了一个框架,该框架交互式地搜索改进闭环VLA任务性能的语言序列,将这些序列提炼为测试时语言反馈策略(LFP),并学习一个改进头来预测何时语言引导会提升性能。我们对这个改进头进行共形化,以防止在分布外场景中LFP相对于原始指令降低任务性能的有害引导干预。关键的是,我们的方法适用于任意冻结的预训练VLA,既不需要访问原始训练分布,也不需要微调底层模型。在已知环境中,我们的共形化LFP在仿真中使基础VLA性能提升24.7%,在硬件中提升65.0%。在视觉和语义扰动下,我们的共形化LFP具有强大的无害性保证,并产生开环提示无法观察到的恢复行为。

英文摘要

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

2606.12295 2026-06-11 cs.CV cs.CL cs.IR 新提交

Findings of the MAGMaR 2026 Shared Task

MAGMaR 2026 共享任务结果

Alexander Martin, Dengjia Zhang, Joel Brogan, Francis Ferraro, Jeremy Gwinnup, Reno Kriz, Teng Long, Kenton Murray, Andrew Yates, Xiang Xiang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) OpenAI University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) Air Force Research Laboratory(空军研究实验室) Human Language Technology Center of Excellence, Johns Hopkins University(约翰霍普金斯大学人类语言技术卓越中心) University of Amsterdam(阿姆斯特丹大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文介绍MAGMaR 2026共享任务的结果,包括视频检索和基于检索视频的生成任务,所有提交系统均超越去年基线。

Comments Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: https://github.com/rekriz11/MAGMAR_2026

详情
AI中文摘要

本概述论文介绍了第二届多模态检索增强生成(MAGMaR)研讨会的共享任务结果。在该共享任务中,参与者提交的系统专注于(i)视频检索或(ii)基于检索到的视频进行文章的接地生成。团队可以提交到任一任务。对于检索任务,我们有2个参与团队提交了总共17个系统——所有这些系统都击败了基于去年共享任务获胜者得出的基线。在生成方面,我们有4个团队提交了16个系统。所有团队至少有一个生成的报告被人类标注者评为最佳。

英文摘要

This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.

2606.12294 2026-06-11 cs.CV eess.IV 新提交

Bridging the Modality Gap in Forensic Image Retrieval

弥合法医图像检索中的模态差距

Ricardo González-Gazapo, Annette Morales-González, Yoanna Martínez-Díaz, Heydi Méndez-Vázquez, Milton García-Borroto

发表机构 * Advanced Technologies Application Center (CENATAV)(先进技术应用中心(CENATAV)) Centro de Sistemas Complejos, Facultad de Física, Universidad de La Habana(哈瓦那大学物理学院复杂系统中心)

AI总结 提出统一检索框架,利用多模态大语言模型生成文本描述并结合视觉与文本特征融合,提升纹身、人脸素描等法医任务的检索精度与鲁棒性。

Comments 23 pages, 5 figures, paper submitted to Elsevier journal

详情
AI中文摘要

自动图像检索在现代法医分析中扮演着越来越关键的角色,支持依赖于视觉证据高效比较的调查工作流程。虽然先前的工作主要集中在开发和优化多模态检索系统,但很少关注评估这些技术在多样化真实场景中的法医适用性。在本研究中,我们提出了一个统一的检索框架,适用于四个关键的法医任务:(1)给定纹身查询图像的纹身图像检索;(2)由人类专家文本描述引导的纹身检索,模拟目击者口头描述纹身的常见情况;(3)从手绘草图中检索纹身;(4)从法医面部素描中检索人脸。我们的系统利用多模态大语言模型(MLLM)自动为所有查询和图库图像生成结构化文本描述,然后使用句子变换器嵌入进行基于文本的比较。我们使用仅视觉嵌入、仅文本嵌入以及一种多模态融合策略来评估检索性能,该策略结合了来自与每个任务相关的最先进视觉特征提取器的文本和图像相似性分数。模态融合一致地提高了检索精度和鲁棒性,特别是在视觉信息有限或嘈杂的场景中(例如,素描、部分纹身或零碎的目击者陈述)。这项工作突显了统一多模态检索流程的法医价值,并展示了现代MLLM如何能够操作化传统上依赖人工专家分析的具有挑战性的法医任务。我们的结果将多模态检索定位为支持涉及纹身、面部合成和目击者描述的调查工作流程的有前途工具。

英文摘要

Automated image retrieval plays an increasingly critical role in modern forensic analysis, supporting investigative workflows that rely on efficient comparison of visual evidence. While prior work has focused primarily on developing and optimizing multimodal retrieval systems, limited attention has been paid to evaluating the forensic applicability of these technologies across diverse real-world scenarios. In this study, we present a unified retrieval framework adapted to four key forensic tasks: (1) tattoo image retrieval given a tattoo query image; (2) tattoo retrieval guided by human-expert textual descriptions, modelling the common situation where a witness verbally describes a tattoo; (3) tattoo retrieval from hand-drawn sketches; and (4) face retrieval from forensic face sketches. Our system leverages a multimodal large language model (MLLM) to automatically generate structured textual descriptions for all queries and gallery images, followed by sentence-transformer embedding for text-based comparison. We evaluate retrieval using visual-only embeddings, text-only embeddings and a multimodal fusion strategy that combines text- and image-based similarity scores derived from state-of-the-art visual feature extractors relevant to each task. The fusion of modalities consistently improves retrieval precision and robustness, especially in scenarios where visual information is limited or noisy (e.g., sketches, partial tattoos, or fragmented witness statements). This work highlights the forensic value of a unified multimodal retrieval pipeline and demonstrates how modern MLLMs can operationalize challenging forensic tasks that traditionally rely on manual expert analysis. Our results position multimodal retrieval as a promising tool for supporting investigative workflows involving tattoos, facial composites, and witness descriptions.

2606.12286 2026-06-11 cs.CV 新提交

CellNet -- Localizing Cells using Sparse and Noisy Point Annotations

CellNet -- 利用稀疏和噪声点标注定位细胞

Benjamin Eckhardt, Dmytro Fishman, Stuart Fawke, Andrew Curtis, Bo Fussing, Constantin Pape

发表机构 * University of Göttingen(哥廷根大学) Wellcome Sanger Institute(威康桑格研究所) University of Tartu(塔尔图大学)

AI总结 提出基于回归的深度学习算法CellNet,利用稀疏点标注在相位对比显微镜图像中检测和计数细胞,减少标注负担,在低数据场景下优于零样本方法。

Comments Conference poster at Biology at Scale: From Variants to Cellular Programs and Functions

详情
AI中文摘要

计数活细胞是许多生物学研究工作流程中的重要步骤。我们在Wellcome Sanger研究所的合作者通过大规模饱和基因组编辑筛选研究人类重要基因,这需要反复多次计数细胞。基于计算机视觉的自动化对于高通量和资源效率至关重要。在这项工作中,我们开发了一种基于回归的深度学习计算机视觉算法,用于检测和计数相位对比显微镜图像中的细胞。为了减少标注工作量(这在实际中常成为瓶颈),我们专注于仅使用稀疏点标注来计数细胞,这种标注方式快速且易于获取。通过与最先进的零样本方法比较,我们表明基于回归的计数在低数据场景下是一种有前景的替代方案。通过开发自动计数显微镜图像中活细胞的方法,我们为人类基因组的重要研究做出了贡献。代码可在以下网址获取:https://this https URL。

英文摘要

Counting living cells is an important step in many biological research workflows. Our collaborators at the Wellcome Sanger Institute study vital genes in humans via large scale saturation genome editing screening, which requires repeatedly counting cells a great number of times. Computer Vision based automation is crucial for high throughput and resource efficiency. In this work, we develop a regression-based deep learning computer vision algorithm to detect and count cells in phase-contrast microscopy images. To reduce annotation effort, which in practice often becomes a bottleneck, we focus on counting cells only using sparse point annotations, which are fast and easy to acquire. By comparison to state-of-the-art 0-shot methods, we show that regression-based counting is a promising alternative in low data regimes. Through developing methods to automatically count living cells in microscopy images, we contribute to valuable research on the human genome. The code is available at https://github.com/beijn/cellnet.

2606.12282 2026-06-11 cs.SD cs.LG 新提交

PianoKontext: Expressive Performance Rendering from Deadpan Context

PianoKontext: 从平淡语境中生成富有表现力的演奏

Dmitrii Gavrilev

发表机构 * Dmitrii Gavrilev

AI总结 提出PianoKontext,一种基于流匹配的钢琴演奏渲染模型,通过动态时间规整对齐乐谱与演奏的潜在表示,生成可变长度的表现力演奏。

Comments ICML 2026 Workshop on Machine Learning for Audio (Oral)

详情
AI中文摘要

表现力演奏渲染(EPR)旨在根据音符序列生成逼真的演奏。然而,流匹配音频编辑模型仅操作相同时长的同步音乐样本,限制了它们对表现力时机的理解。我们提出了PianoKontext,一种针对古典钢琴音乐的流匹配渲染模型,该模型在预训练的Music2Latent模型的潜在空间中生成可变长度的演奏。我们将MIDI乐谱合成为平淡音频,并在潜在空间中使用动态时间规整(DTW)构建用于训练的对齐数据。对齐的嵌入在DiT块中拼接,从而简单有效地学习乐谱与演奏之间的依赖关系。音频样本可在我们的演示页面获取:此https URL。

英文摘要

Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: https://realfolkcode.github.io/pianokontext_demo/.

2606.12278 2026-06-11 cs.CV cs.LG 新提交

Finding Sparse Subnetworks in One Training Cycle via Progressive Magnitude-Based Pruning

通过渐进式幅度剪枝在一个训练周期内找到稀疏子网络

Romana Qureshi, Hafida Benhidour, Said Kerrache, Nahlah Aljeraisy

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) University of Jeddah(吉达大学) King Fahd University of Petroleum and Minerals(法赫德国王石油矿产大学) King Saud University(沙特国王大学)

AI总结 提出渐进式幅度剪枝方法,在单训练周期内线性增加稀疏度,基于权重幅度更新掩码,在CIFAR-10和MNIST上优于LTH、SNIP和GraSP等基线。

详情
AI中文摘要

神经网络剪枝通过移除不太重要的参数来减小模型大小,同时旨在保持预测性能。尽管彩票假说(LTH)表明,当从合适的初始化训练时,稀疏子网络可以匹配密集网络,但其迭代剪枝过程需要多个完整的训练周期。本工作评估了渐进式幅度剪枝作为一种单周期替代方案。该方法在训练期间使用线性调度逐渐增加稀疏度,并基于活跃权重幅度更新剪枝掩码。我们在CIFAR-10和MNIST上,针对ResNet、VGG风格和LeNet架构进行了系统实验,将所提方法与代表性的迭代和基于初始化的剪枝基线(包括LTH、SNIP和GraSP)进行比较。在CIFAR-10上,该方法在ResNet-18上以72.9%稀疏度达到95.12%的准确率,而LTH报告为90.5%。在极端稀疏度下,它在VGG类架构上以97%稀疏度达到93.13%的准确率,而SNIP约为92.0%;在VGG-19上以97.97%稀疏度达到93.44%的准确率,而GraSP在98%稀疏度下为92.19%。在ResNet-18上的稀疏度-准确率分析进一步表明,在70-85%稀疏度范围内,准确率保持在密集基线的0.1个百分点以内。这些结果表明,在所评估的设置下,渐进式幅度剪枝为神经网络稀疏化提供了一种有效的单周期方法。

英文摘要

Neural network pruning reduces model size by removing less important parameters while aiming to preserve predictive performance. Although the Lottery Ticket Hypothesis (LTH) shows that sparse subnetworks can match dense networks when trained from suitable initializations, its iterative pruning procedure requires multiple complete training cycles. This work evaluates progressive magnitude-based pruning as a single-cycle alternative. The method gradually increases sparsity during training using a linear schedule and updates pruning masks based on active weight magnitudes. We conduct systematic experiments on CIFAR-10 and MNIST across ResNet, VGG-style, and LeNet architectures, comparing the proposed method with representative iterative and initialization-based pruning baselines, including LTH, SNIP, and GraSP. On CIFAR-10, the method achieves 95.12\% accuracy on ResNet-18 at 72.9\% sparsity, compared with 90.5\% reported for LTH. At extreme sparsity, it achieves 93.13\% accuracy on a VGG-like architecture at 97\% sparsity, compared with approximately 92.0\% for SNIP, and 93.44\% accuracy on VGG-19 at 97.97\% sparsity, compared with 92.19\% for GraSP at 98\% sparsity. A sparsity-accuracy analysis on ResNet-18 further shows that accuracy remains within 0.1 percentage points of the dense baseline across 70--85\% sparsity. These results indicate that progressive magnitude-based pruning provides an effective single-cycle approach for neural network sparsification under the evaluated settings.

2606.12277 2026-06-11 cs.LG 新提交

Finding Multiple Interpretations in Datasets

在数据集中寻找多种解释

Matthew Chak, Paul Anderson

发表机构 * Department of Computer Science, California Polytechnic State University(加州州立理工大学计算机科学系)

AI总结 提出一种方法,在保持性能的同时,找到具有不同上下文感知特征但性能相似的模型集,以提取对潜在现象的洞察。

详情
AI中文摘要

在本文中,我们提出了一种方法,用于寻找在损失/准确率测量方面表现相似但具有高度不同上下文感知特征的模型集。通过在METABRIC数据集上的实验,我们表明所提出的方法找到了多个模型,这些模型的基因表达与对照组方法找到的模型高度不同,且没有性能损失。我们认为,只要目标是分析模型的任何全局特征以提取对正在研究的潜在现象的洞察,所提出的方法就很重要。

英文摘要

In this paper, we propose an approach to finding sets of similar-performing models (in terms of loss/accuracy measurements) with highly different context-aware characteristics. Through experiments on the METABRIC dataset, we show that the proposed method finds multiple models with highly different gene expressions than those found by the control methodology without performance penalties. We argue that the proposed methodology is important whenever one aims to analyze any global characteristic of a model to extract insight into the underlying phenomenon being studied.

2606.12273 2026-06-11 cs.CL 新提交

Beyond Fully Random Masking: Attention-Guided Denoising and Optimization for Diffusion Language Models

超越完全随机掩码:扩散语言模型的注意力引导去噪与优化

Jia Deng, Junyi Li, Wayne Xin Zhao, Jinpeng Wang, Hongyu Lu, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Meituan(美团) WeChat, Tencent(腾讯微信) Beijing Key Laboratory of Research on Large Models and Intelligent Governance(大型模型与智能治理北京市重点实验室)

AI总结 提出AGDO框架,利用注意力结构指导去噪顺序并强化关键令牌,在数学和编码基准上提升扩散语言模型的推理性能。

Comments 13 pages. Accepted to ACL 2026 Main Conference

详情
AI中文摘要

扩散大语言模型(dLLMs)通过并行解码提供了自回归模型的高效替代方案,然而现有的后训练方法大多依赖随机掩码策略,忽略了内在的令牌依赖关系。在这项工作中,我们对dLLMs中的注意力进行了实证分析,表明对未掩码上下文关注更强的令牌表现出更高的生成稳定性,并在推理中发挥关键作用。受这些发现启发,我们提出了AGDO,一种注意力引导的去噪与优化框架,将训练和优化与注意力导出的依赖关系对齐。AGDO基于注意力结构确定去噪顺序,并在监督微调和强化学习过程中强调注意力关键令牌。在数学和编码基准上的实验表明,AGDO持续提升推理性能,优于dLLMs的最先进后训练方法。

英文摘要

Diffusion large language models (dLLMs) offer an efficient alternative to autoregressive models through parallel decoding, yet existing post-training methods largely rely on random masking strategies that overlook intrinsic token dependencies. In this work, we present an empirical analysis of attention in dLLMs and show that tokens attending more strongly to unmasked context exhibit greater generation stability and play a critical role in reasoning. Motivated by these findings, we propose AGDO, an attention-guided denoising and optimization framework that aligns both training and optimization with attention-derived dependencies. AGDO determines the denoising order based on attention structure and emphasizes attention-critical tokens during supervised fine-tuning and reinforcement learning. Experiments on mathematical and coding benchmarks demonstrate that AGDO consistently improves reasoning performance, outperforming state-of-the-art post-training methods for dLLMs.

2606.12268 2026-06-11 cs.AI 新提交

The Impossibility of Eliciting Latent Knowledge

引出潜在知识的不可能性

Korbinian Friedl, Francis Rhys Ward, Paul Yushin Rapoport, Tom Everitt, Jonathan Richens

发表机构 * The London School of Economics and Political Science(伦敦政治经济学院) Independent(独立机构)

AI总结 本文利用因果影响图形式化定义引出潜在知识问题,证明不存在仅依赖行为反馈的训练策略能确保智能体诚实报告其信念。

Comments 24 pages, 3 figures. Includes proofs in appendix

详情
AI中文摘要

高级AI系统对其环境拥有广泛的知识;事实上,它们的知识可能(远远)超过其开发者或用户。因此,AI系统的一个理想属性是诚实——即它准确报告其对世界的信念。设计一个诚实的AI系统可能很困难,特别是当我们想询问关于环境中潜在变量的问题时——这些变量对与之交互的人类是隐藏的。这就引出了引出潜在知识(ELK)问题:训练AI智能体诚实报告其信念的问题。在本文中,我们使用因果影响图(CID)使ELK在形式上精确化。CID可用于描述智能体的训练环境与其主观世界表征之间的关系。我们使用CID来形式化可观测变量和潜在变量之间的区别,明确指定智能体诚实的确切含义,并正式定义目标泛化错误。我们证明,在某些情况下,开发者可以通过在训练期间提供正确的反馈来激励智能体诚实回答问题。然而,智能体泛化的一种自然但不理想的方式是提供人类会评估为真实的答案,而不是诚实的答案。我们证明了一个不可能性定理:不存在仅依赖于智能体行为且能确保产生诚实智能体的基于反馈的训练策略,即使在训练期间反馈是完美的。

英文摘要

Advanced AI systems have extensive knowledge of their environments; in fact, their knowledge may (far) exceed that of their developers or users. Consequently, a desirable property for an AI system is that it is honest -- that it accurately reports its beliefs about the world. Designing an AI system to be honest may be difficult, especially if we want to ask it questions about latent variables in the environment -- variables which are hidden from the human interacting with it. This gives rise to the problem of eliciting latent knowledge (ELK): the problem of training an AI agent to honestly report its beliefs. In this paper, we make ELK formally precise using Causal Influence Diagrams (CIDs). CIDs can be used to describe the relationship between an agent's training environment and its subjective representation of the world. We use CIDs to formalise the distinction between observable and latent variables, to specify what exactly it means for an agent to be honest, and to formally define goal misgeneralisation. We show that, under certain circumstances, developers can incentivise an agent to honestly answer questions by providing correct feedback during training. However, a natural, but undesirable, way for an agent to generalise is to provide answers which humans would evaluate as true, rather than honest answers. We prove an impossibility theorem stating: There is no feedback-based training strategy that depends only on agent behaviour and with certainty produces an honest agent, even if feedback is perfect during training.

2606.12258 2026-06-11 cs.CV 新提交

Bridging Day and Night: Unsupervised Cross-Domain Re-Identification with Synergistic Prompt and Prototype Learning

连接昼夜:基于协同提示与原型学习的无监督跨域重识别

Jiyang Xu, Rui Liu, Hang Dai

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出无监督昼夜重识别框架,结合提示学习和原型表示学习,通过两阶段训练实现无标注跨域身份关联,性能媲美全监督方法。

详情
AI中文摘要

跨域昼夜重识别(ReID)面临昼夜场景间显著视觉外观差异的根本挑战。现有的全监督方法严重依赖劳动密集型标注,成本高昂且跨域泛化能力有限。本文研究无监督昼夜重识别,提出一种新颖框架,协同结合提示学习和基于原型的表示学习,无需人工标注即可关联跨域身份。我们的方法采用渐进式两阶段训练策略。第一阶段,利用视觉语言模型以无标注方式生成实例特定的文本提示。我们采用实例级对齐机制,将视觉特征和文本提示嵌入统一语义空间,通过实例感知的动态偏差适应将未标注的昼夜图像与可学习提示对齐。第二阶段,构建域特定原型记忆库,并引入两个互补模块:i) 域内身份关联模块,增强每个域内的特征判别性;ii) 跨域原型匹配模块,可靠识别正负原型对,从而建立昼夜间的鲁棒身份对应关系。在公开基准上的大量实验验证了方法的有效性。在无监督设置下,我们的框架取得了与最先进全监督方法相当的Rank-1准确率。

英文摘要

Cross-domain day-night re-identification (ReID) is fundamentally challenged by the substantial visual appearance discrepancies between daytime and nighttime scenes. Existing fully supervised methods rely heavily on labor-intensive annotations, which are costly and exhibit limited generalization across domains. In this work, we investigate unsupervised day-night ReID and propose a novel framework that synergistically combines prompt learning and prototype-based representation learning to associate identities across domains without requiring manual labels. Our approach follows a progressive two-stage training strategy. In the first stage, we exploit the vision-language model to generate instance-specific textual prompts in an annotation-free manner. We employ an instance-level alignment mechanism to embed visual features and textual prompts into a unified semantic space, aligning unlabeled day/night images with learnable prompts via instance-aware dynamic-bias adaptation. In the second stage, we construct domain-specific prototype memory banks and introduce two complementary modules: i) an intra-domain identity association module to enhance feature discriminability within each domain, and ii) a cross-domain prototype matching module to reliably identify positive and negative prototype pairs, thereby establishing robust identity correspondences across day and night. Extensive experiments on public benchmarks validate the effectiveness of our method. Under the unsupervised setting, our framework attains Rank-1 accuracy comparable to state-of-the-art fully supervised methods.

2606.12252 2026-06-11 cs.LG cs.AI 新提交

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

使用可解释性作为训练时可靠性信号实现高效心电图分类

Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford(牛津大学工程科学系生物医学工程研究所) School of Computer Science, University of Nottingham Ningbo China(宁波诺丁汉大学计算机科学学院)

AI总结 提出ERTS方法,利用训练中的解释质量(Grad-CAM注意力图)区分信息性和不可靠不确定性,过滤低聚焦样本,在三个ECG数据集上提升macro-F1并降低训练成本。

详情
AI中文摘要

训练用于临床时间序列分析的深度神经网络计算需求高,但许多医疗环境缺乏重复模型开发和部署所需的资源。这一挑战在心电图分类中尤为明显,大数据集和长训练计划使效率变得重要。渐进式数据丢弃通过从梯度更新中排除已学习的样本来降低训练成本,但它依赖模型置信度,可能保留因噪声或歧义而难以处理而非有用信号的样本。在这项工作中,我们引入了ERTS,一种基于可解释性的可靠性训练信号,用于高效心电图分类。ERTS在训练期间利用解释质量来区分信息性和不可靠的不确定性。基于渐进式数据选择,我们计算候选样本的Grad-CAM注意力图,并推导出一个聚焦分数,衡量模型预测是否得到连贯且局部化模式的支持。低聚焦样本被过滤掉,而具有有意义注意力的样本优先进行梯度更新。我们在三个ECG数据集和多个骨干架构上评估ERTS,显示macro-F1的一致提升以及有效训练成本的降低。这些结果表明,解释质量可以作为改善临床时间序列学习中效率和可靠性的实用信号。代码将发布。

英文摘要

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

2606.12251 2026-06-11 cs.LG cs.AI cs.CR 新提交

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

强化学习破坏基于梯度的对抗优化

Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

发表机构 * COSIC, KU Leuven(鲁汶大学COSIC) Imec Brubotics, VUB(布鲁塞尔自由大学Brubotics) DistriNet, KU Leuven(鲁汶大学DistriNet)

AI总结 研究通过强化学习训练图像分类器以破坏攻击者使用的梯度结构,发现RL作为隐式正则化器产生不稳定梯度方向和较小梯度幅度,使基于梯度的攻击失效,并与对抗训练结合实现双重防御。

详情
AI中文摘要

基于梯度的对抗攻击仍然是对深度神经网络(DNN)的主要威胁,因为它们利用梯度信息高效优化对抗扰动。为了解决这个问题,我们研究了强化学习(RL)训练是否可以通过使用策略梯度目标和epsilon-贪婪探索来训练图像分类器,从而破坏攻击者使用的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100上使用多种架构进行系统实验,我们发现RL训练的分类器显著破坏了基于梯度的对抗优化。为了解释这一点,我们使用损失景观可视化、静态和动态梯度指标以及预测熵进行了全面的机制分析。我们的分析揭示,RL充当隐式正则化器,产生具有高度不稳定梯度方向和较小梯度幅度的模型。这种组合使得每个PGD步骤在方向上不可靠且幅度有限,导致基于梯度的攻击在实际迭代预算内失败。我们进一步表明,将RL与对抗训练(RL-adv)结合提供了在两个互补层面运作的双层防御:RL退化攻击者可用的梯度信息(梯度级防御),而对抗训练强化决策边界(边界级防御)。RL-adv在所有评估的主要攻击类型(包括基于梯度的PGD、AutoAttack、基于迁移和基于查询的攻击)中实现了最高的鲁棒性,显著优于SL-adv。这些发现将RL诱导的梯度破坏识别为一种互补的鲁棒性机制,并激励未来研究结合SL效率与RL梯度正则化特性的混合SL-RL训练调度。

英文摘要

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

2606.12250 2026-06-11 cs.CL 新提交

Reassessing High-Performing LLMs on Polish Medical Exams: True Competence or Bias-Driven Performance?

重新评估高性能大语言模型在波兰医学考试中的表现:真实能力还是偏差驱动?

Antoni Lasik, Jakub Pokrywka, Łukasz Grzybowski, Jeremi Ignacy Kaczmarek, Gabriela Korzańska, Janusz Świeczkowski-Feiz, Oskar Pastuszek, Paulina Hoffman, Jakub Tomasz Dąbrowski, Wojciech Kusa

发表机构 * NASK National Research Institute(NASK国家研究所) Adam Mickiewicz University(亚当·密茨凯维奇大学) ARAAI Poland(ARAAI波兰) Poznań University of Medical Sciences(波兹南医科大学) Centre of Postgraduate Medical Education, Poland(波兰研究生医学教育中心) T. Marciniak Lower Silesian Specialist Hospital(T. 马尔奇尼亚克下西里西亚专科医院) Medical University of Warsaw(华沙医科大学)

AI总结 通过引入扩展和更具挑战性的波兰医学考试基准,减少MCQA伪影,发现标准MCQA分数高估了LLM的真实临床能力,最佳模型在更难的设置下分数下降28.4和31个百分点。

Comments 26 pages total with references and appendix, preprint

详情
AI中文摘要

医学领域的大语言模型(LLM)主要通过多项选择题问答(MCQA)进行评估,但由于猜测策略和答案偏差,这种方法可能高估真实的临床能力。为解决这些局限性,我们引入了一个基于波兰医学考试的扩展且更具挑战性的基准,增加了超过15,000道题目、两个新领域和四项结构修改,以减少MCQA特定伪影并更好地测试推理能力。我们评估了21个LLM,结果表明评估设计对结果影响很大。在我们的更难设置下,最佳模型(Qwen3.5-122B)在英语和波兰语考试中分别下降了28.4和31个百分点。尽管数据污染证据不足,但标准MCQA分数并不能可靠地反映真实的医学能力。为促进进一步研究,我们公开了该基准。

英文摘要

Large language models (LLMs) in medicine are mainly evaluated using multiple-choice question answering (MCQA), which can overestimate real clinical ability due to guessing strategies and answer biases. To address these limitations, we introduce an expanded and more challenging benchmark based on Polish medical exams, adding over 15,000 questions, two new domains, and four structural modifications that reduce MCQA-specific artifacts and better test reasoning. We evaluate 21 LLMs and show that evaluation design strongly affects results. Under our harder setup, the best model (Qwen3.5-122B) drops by 28.4 and 31 pp on English and Polish exams, respectively. Despite low evidence of data contamination, standard MCQA scores do not reliably reflect true medical competence. To facilitate further research, we make our benchmark publicly available.

2606.12248 2026-06-11 cs.CV 新提交

Damage-TriageFormer: A Foundation-Model Framework for Typology-Based Building Damage Assessment from Mono-Temporal Imagery

Damage-TriageFormer:基于类型学的单时相影像建筑损伤评估的基础模型框架

Yiming Xiao, Yu-Hsuan Ho, Sanjay Thasma, Junwei Ma, Ali Mostafavi

发表机构 * Texas A&M University(德克萨斯A&M大学) Resilitix Intelligence LLC Institute for a Disaster Resilient Texas(德克萨斯灾害韧性研究所)

AI总结 提出Damage-TriageFormer,一种基于单张灾后影像的建筑损伤类型学评估模型,通过扩展DINOv3 ViT-L骨干网络和两阶段门控损伤头,在三个灾害数据集上实现了宏观F1约0.62,无需灾前影像即可支持应急响应。

详情
AI中文摘要

决策相关的建筑损伤评估对于灾后资源优先分配和恢复至关重要,但大多数自动化方法要么将损伤扁平化为单一严重程度等级(无损伤、轻微、严重、摧毁),要么需要成对的灾前和灾后影像,而这对于突发灾害通常不可用。本文提出了Damage-TriageFormer,一种基于单张灾后影像、足迹条件化的模型,它生成损伤类型学而非严重程度等级。我们的贡献包括:(1)DamageTriage-Bench,一个基于NOAA应急响应影像(涵盖2018年迈克尔飓风、2024年海伦飓风和2025年洛杉矶野火复合灾害)构建的新基准,包含五个类型学类别,区分屋顶损伤和结构损伤,并在每个类别内区分部分和全部范围;(2)Damage-TriageFormer,它扩展了DINOv3 ViT-L骨干网络,结合简单特征金字塔进行更高分辨率的实例池化、两阶段门控损伤头以及辅助严重程度回归目标。我们的模型在验证集上达到宏观F1为0.624,在保留的分层测试集上为0.619,在运营分类最需要的地方表现最强,无损伤建筑和完全结构倒塌的每类F1分别为0.91和0.84。尽管罕见的完全屋顶损伤类别由于样本有限和固有的模糊标签边界仍然困难,但我们的结果表明,单张灾后影像可以支持可操作的建筑损伤分类,无需灾前参考即可实现有针对性的应急响应和资源分配。

英文摘要

Decision-relevant building damage assessment is critical for prioritizing resources and recovery after a disaster, yet most automated methods either flatten damage into a single severity scale (no damage, minor, major, destroyed) or require paired pre- and post-event imagery that is often unavailable for emerging hazards. This paper presents Damage-TriageFormer, a single-image, post-event, footprint-conditioned model that produces a damage typology rather than a severity scale. We contribute: (1) DamageTriage-Bench, a new benchmark built from NOAA Emergency Response Imagery across Hurricane Michael (2018), Hurricane Helene (2024), and the 2025 Los Angeles wildfire complex, with five typology classes that distinguish roof damage from structural damage and, within each, partial from total extent; and (2) Damage-TriageFormer, which extends a DINOv3 ViT-L backbone with a Simple Feature Pyramid for higher-resolution instance pooling, a two-stage gated damage head, and an auxiliary severity-regression objective. Our model achieves macro F1 of 0.624 on validation and 0.619 on a held-out stratified test set, performing strongest where operational triage needs it most, with per-class F1 of 0.91 and 0.84 on undamaged buildings and total structural collapse, respectively. While the rare Total Roof Damage class remains difficult due to its limited examples and an inherently ambiguous label boundary, our results show that single-image post-event imagery can support actionable building damage typing, enabling targeted emergency response and resource allocation without a pre-event reference.

2606.12243 2026-06-11 cs.CL cs.AI 新提交

VIA-SD: Verification via Intra-Model Routing for Speculative Decoding

VIA-SD:通过模型内路由进行推测解码的验证

Yuchen Xian, Yang He, Yunqiu Xu, Yi Yang

发表机构 * ReLER, The State Key Lab of Brain Machine Intelligence, Zhejiang University(脑机智能国家重点实验室,浙江大学) College of Artificial Intelligence, Zhejiang University(人工智能学院,浙江大学) CFAR, Agency for Science, Technology and Research, Singapore(科学与技术研究局,新加坡) National University of Singapore(新加坡国立大学)

AI总结 提出VIA-SD多级验证框架,利用从完整验证器派生的精简验证器处理中等置信度令牌,减少大模型调用,在多个任务上实现10-20%加速。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

推测解码(SD)通过让轻量级草稿模型生成候选,由大型验证器并行验证,解决了LLM的高推理成本问题。现有的草稿-验证方法使用二元决策:接受或完全重新计算。然而,我们发现许多被拒绝的令牌可以通过从完整验证器通过模型内路由派生的精简子模型正确验证,而不是完整验证器。这促使我们使用精简验证器来处理需要中等验证资源的令牌,减少昂贵的大模型调用。我们提出了VIA-SD(通过模型内路由进行推测解码的验证),一种使用路由精简验证器的多级框架。草稿令牌分层处理:高置信度情况直接接受,中等置信度情况由精简验证器重新生成,不确定情况由完整模型验证。在四个代表性任务和多个模型家族中,VIA-SD将拒绝率降低了0.10-0.22,并在强SD基线基础上实现了10-20%的加速,同时相比非草稿解码实现了2.5-3倍的加速。此外,VIA-SD与现有SD框架兼容,无需修改其训练过程。我们的结果表明,多级SD是一种可扩展且高效的LLM推理通用范式。项目页面:此https URL

英文摘要

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/

2606.12240 2026-06-11 cs.LG cs.AI 新提交

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

多速率专家混合模型加速液态神经网络训练

Shilong Zong, Almuatazbellah Boker, Hoda Eldardiry

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 提出多速率专家混合框架,结合液态神经网络的多尺度动态与注意力机制,提升多变量时间序列建模的准确性和效率。

详情
AI中文摘要

多变量时间序列数据通常表现出复杂的时间依赖、不规则采样和跨多个时间尺度的异质动态,使得精确序列建模特别具有挑战性。传统的循环神经网络(RNN),如长短期记忆网络(LSTM),在离散时间下运行,可能难以有效捕捉连续和不规则的时间行为。液态神经网络(LNN)通过连续时间动态解决了其中一些限制,但标准LNN架构通常依赖单一动力系统,限制了其建模异质时间模式的能力。为了解决这些挑战,我们提出了一个基于液态神经网络的多速率专家混合(MR-MoE)框架。在所提出的架构中,多个基于LNN的专家以不同的时间尺度运行,使模型能够明确分离快速变化的动态和缓慢演变的时间趋势。门控网络进一步实现了基于输入条件的自适应专家专业化。此外,我们结合了特征级和时间注意力机制,以提高鲁棒性、可解释性和长程依赖建模能力。特征级注意力抑制噪声或无关变量,而时间注意力则选择性地关注信息丰富的历史状态。我们在一个复杂的多变量时间序列预测任务上评估了所提出的框架,并与强基线模型(包括LSTM、单体LNN和标准MoE模型)进行了比较。实验结果表明,所提出的MR-MoE框架在保持良好计算效率的同时,持续实现了改进的AUROC和AUPRC性能。这些结果突显了结合连续时间动态、多尺度专家分解和自适应注意力机制对时间序列建模的有效性。

英文摘要

Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

2606.12234 2026-06-11 cs.CL 新提交

On The Effectiveness-Fluency Trade-Off In LLM Conditioning: A Systematic Study

论LLM条件控制中的效果-流畅性权衡:一项系统性研究

Iuri Macocco, Pau Rodríguez, Arno Blaas, Luca Zappella, Marco Baroni, Xavier Suau

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) Apple(苹果公司) ICREA(加泰罗尼亚研究与高级研究所)

AI总结 系统研究LLM条件控制方法在注入和移除目标概念时的效果与流畅性权衡,发现高效引导方法常以牺牲流畅性为代价,且激活引导方法在指令调优模型上效果较差。

Comments 8 pages, 2 figure

详情
AI中文摘要

控制大型语言模型(LLM)的输出是其可靠部署的核心挑战,然而对所涉及权衡的清晰理解仍然难以捉摸。当前的条件控制方法通常在评估时狭隘地关注其注入或移除目标概念的有效性,而忽略了生成质量。我们系统性地研究了注入和移除场景中的一系列条件控制方法。我们发现,高效的引导方法通常以流畅性的大幅损失为代价来实现条件控制。此外,我们识别出一个关键但先前被忽视的与训练范式的交互:激活引导方法在指令调优模型上的效果远不如在基础模型上。另一方面,简单的提示和全面的监督微调是概念注入的可行选择,但在概念移除方面效果不佳。最后,廉价计算的文本指标与昂贵的LLM作为评判者的评分高度相关,并为条件控制方法的行为提供了见解。

英文摘要

Controlling the output of Large Language Models (LLMs) is a central challenge for their reliable deployment, yet a clear understanding of the involved trade-offs remains elusive. Current approaches to conditioning are often evaluated with a narrow focus on their effectiveness at injecting or removing a target concept, neglecting generation quality. We systematically investigate a range of conditioning methods in both injection and removal scenarios. We find that efficient steering methods frequently achieve conditioning at a steep cost to fluency. Furthermore, we identify a critical yet previously overlooked interaction with the training paradigm: activation steering methods are far less effective on instruction-tuned models than on their base counterparts. Simple prompting and full-fledged supervised fine-tuning, on the other hand, are viable options for concept injection, but are not as good at concept removal. Finally, cheaply computed textual metrics highly correlate to costly LLM-as-judge scores, and provide insights on the behavior of conditioning methods.

2606.12232 2026-06-11 cs.LG 新提交

Re-evaluating Confidence Remasking in Masked Diffusion Language Models

重新评估掩蔽扩散语言模型中的置信度重新掩蔽

Stipe Frkovic, Metod Jazbec, Dan Zhang, Christian A. Naesseth, Ilija Bogunovic, Eric Nalisnick

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam(阿姆斯特丹大学UvA-Bosch Delta实验室) Bosch Center for AI(博世人工智能中心) University of Basel(巴塞尔大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文重新评估了掩蔽扩散语言模型中一种无需训练的后验置信度重新掩蔽方法WINO,发现在标准解码设置下其收益甚微,且会加剧多样性坍塌问题。

详情
AI中文摘要

掩蔽扩散语言模型(dLLMs)最近已成为自回归语言模型的有竞争力的替代方案,其通过并行令牌生成实现更快的推理。然而,掩蔽公式的一个显著限制是,一旦令牌被解除掩蔽,就无法再修改,这使得dLLMs容易受到早期采样错误的影响。为了解决这个问题,越来越多的研究试图扩展掩蔽dLLMs,使其具有自我纠正(重新掩蔽)能力。其中一类有吸引力的方法以无需训练、事后方式基于令牌置信度实现,早期报告的结果令人鼓舞。在这项工作中,我们重新审视了代表性事后重新掩蔽方法WINO [Hong et al., 2026]的实证评估,发现在标准解码设置(较短的块长度)下,它相比于仅基于置信度的解除掩蔽 [Wu et al., 2025] 几乎没有带来好处。将评估扩展到非贪婪解码,我们发现虽然基于置信度的重新掩蔽可以在一定程度上减轻由增加随机性引入的错误,但它也加剧了先前报道的基于置信度的解除掩蔽导致的多样性坍塌。总体而言,我们的结果表明,事后基于置信度的重新掩蔽的好处高度依赖于设置,这凸显了需要更全面的评估框架。

英文摘要

Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One appealing subset of these methods does so in a training-free, post-hoc manner based on token confidences, with encouraging early reported results. In this work, we revisit the empirical evaluation of a representative post-hoc remasking method, WINO [Hong et al., 2026], and find that under standard decoding settings (shorter block lengths) it brings little-to-no benefit over confidence-based unmasking alone [Wu et al., 2025]. Extending the evaluation to non-greedy decoding, we find that while confidence-based remasking can mitigate errors introduced by increased stochasticity to some extent, it also exacerbates the diversity collapse previously reported for confidence-based unmasking. Overall, our results show that the benefits of post-hoc confidence-based remasking are highly setting-dependent, underscoring the need for a more comprehensive evaluation framework.

2606.12226 2026-06-11 cs.CV eess.IV 新提交

An Electric Potential-Augmented Benchmark Dataset for Physics-Guided Image Reconstruction of Electrical Capacitance Tomography

一种电势增强的基准数据集,用于电容层析成像的物理引导图像重建

Xinqi Zhang, Qiming Ma, Lihui Peng

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对电容层析成像(ECT)数据驱动方法忽略电势场的问题,提出一个包含电势图的基准数据集,通过COMSOL-MATLAB管道生成20,000个样本,并验证其提升建模精度和鲁棒性。

详情
AI中文摘要

虽然深度学习显著推进了电容层析成像(ECT)的图像重建,但大多数数据驱动方法直接映射电容和介电常数分布,将传感器视为黑箱。这忽略了电势场——控制非线性和病态“软场”效应的基本物理联系。为解决此问题,我们提出一个电势增强的ECT基准数据集,旨在将ECT背后的潜在物理显式集成到学习过程中。通过COMSOL-MATLAB管道为八电极传感器生成示例,数据集包含20,000个随机样本,涵盖四种典型流型。关键的是,除了传统的电容向量和以图像形式描绘的介电常数分布外,每个样本还保留了八个激励方向的全场电势图。除了数据发布,我们还提供了ECT正问题和逆问题的说明性评估协议。通过在分布内(IID)和分布外(OOD)场景下的全面测试,我们系统地展示了包含电势图如何增强建模精度和鲁棒性。从根本上说,潜在场信息的显式包含显著降低了将物理定律集成到ECT建模中的障碍,从而为未来ECT图像重建的物理引导机器学习建立了标准化基础。

英文摘要

While deep learning has significantly advanced image reconstruction of Electrical Capacitance Tomography (ECT), most data-driven methods map directly between capacitance and permittivity distribution, treating the sensor as a black box. This overlooks the electric potential field -- the fundamental physical link governing the nonlinear and ill-posed ``soft-field'' effect. To address this, we propose an electric potential-augmented ECT benchmark dataset designed to explicitly integrate latent physics behind ECT into the learning process. Generated via a COMSOL-MATLAB pipeline for an eight-electrode sensor as an example, the dataset comprises 20,000 randomized samples across four typical flow patterns. Crucially, alongside the conventional capacitance vectors and permittivity distributions depicted as images, each sample preserves eight excitation-wise full-field potential maps. Beyond data release, we provide illustrative evaluation protocols for both forward and inverse problems of ECT. Through comprehensive testing on both in-distribution (IID) and out-of-distribution (OOD) scenarios, we systematically demonstrate how the inclusion of electric potential maps enhances modeling accuracy and robustness. Fundamentally, the explicit inclusion of latent field information significantly lowers the barrier to integrating physical laws into ECT modeling, thereby establishing a standardized foundation for future physics-guided machine learning of ECT image reconstruction.

2606.12218 2026-06-11 cs.CV cs.AI 新提交

Adapting Prithvi-EO for Fallow Detection for Food-Water Nexus: ViT-Adapter Necks and Parameter-Efficient Backbone tuning of Geospatial Foundation Model

为食物-水关系调整Prithvi-EO用于休耕地检测:地理空间基础模型的ViT-Adapter颈部与参数高效骨干微调

Sk Muhammad Asif, Orhun Aydin

发表机构 * Earth, Atmospheric and Geospatial Science, Saint Louis University(圣路易斯大学地球、大气与地理空间科学系)

AI总结 针对休耕地检测中多尺度特征需求与基础模型单尺度ViT骨干不匹配的问题,提出结合LoRA和混合PEFT的两种参数高效微调方案与三种颈部设计,其中Lite ViT-Adapter配合单阶段检测头在mAP@50上达到0.9479,优于无适配器方法25.70%。

Comments 10 pages, 6 figures. Preprint. Submitted to ACM SIGSPATIAL 2026

详情
AI中文摘要

理解休耕地的空间分布对于优化食物-水关系至关重要,因为休耕在作物轮作和水资源保护中发挥着作用。休耕是美国农业部作物数据层中的一个低精度类别。地理空间基础模型Prithvi-EO在计算机视觉任务中展现出强大的迁移能力。然而,其视觉Transformer骨干在单一空间尺度上生成特征,不适合目标检测头所需的多尺度特征。现有方法通过缩放单步长令牌来合成多尺度金字塔,牺牲了空间异质性,而全骨干微调对于地理空间基础模型来说计算成本过高。我们评估了一个结合两种参数高效微调方案的休耕地检测流程:低秩适应和混合PEFT,以及三种颈部设计:伪多尺度、Lite ViT-Adapter和Full ViT-Adapter。我们最佳配置,即带有单阶段检测头的Lite ViT-Adapter,在Diou损失下实现了0.9479的mAP@50,表明中心感知定位对于不规则休耕地检测的有效性。在LoRA下,ViT-Adapter释放的单阶段检测比无适配器的基于锚点的方法提高了6.42%,而最佳配置比基线无适配器的基于锚点的方法提高了25.70%。这些结果表明,轻量级空间先验融合和选择性骨干解冻使Prithvi-EO能够更有效地捕捉局部休耕模式,优于依赖重塑单步长ViT令牌的方法。

英文摘要

Understanding spatial distribution of fallow land is important for optimizing the food-water (FW) nexus, given fallowing's role in crop rotation and water conservation. Fallow is a low accuracy class in USDA Cropland Data Layer (CDL). Geospatial foundation model (GFM), Prithvi-EO has shown strong transferability across computer vision tasks. However, its Vision Transformer (ViT) backbone produces features at a single spatial scale that are ill-suited for the multi-scale features required by object detection heads. Existing approaches synthesise multi-scale pyramids through scaling of single stride tokens, sacrificing spatial heterogeneity, and full backbone fine-tuning is computationally prohibitive for GFMs. We evaluate a fallow detection pipeline combining two parameter-efficient fine tuning (PEFT) schemes: Low-Rank Adaptation (LoRA) and a hybrid PEFT, with three neck designs: pseudo multi-scale, Lite ViT-Adapter, and Full ViT-Adapter. Our best configuration, Lite ViT-Adapter with a one-stage head, achieves a mAP@50 of 0.9479 with the Diou loss, suggesting the effectiveness of center-aware localization for irregular fallow field detection. ViT-Adapter free one-stage detection under LoRA improves the adapter-free anchor-based approach by 6.42%, and the best configuration improves baseline adapter-free anchor-based approach by 25.70%. These results demonstrate that lightweight spatial prior fusion and selective backbone unfreezing enable Prithvi-EO to capture local fallow patterns more effectively, outperforming approaches that rely on reshaped single-stride ViT tokens.