arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.26484 2026-05-27 cs.LG

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

Extra-Merge：追踪语言模型预训练中模型合并的秩-1子空间

Wenjie Zhou, Bohan Wang, Hongtao Zhang, Chenxi Jia, Wei Chen, Xueqi Cheng

发表机构 * School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China（中国科学院大学先进交叉学科学院）； State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences, China（中国科学院人工智能安全国家重点实验室）； University of Chinese Academy of Sciences, China（中国科学院大学）； Alibaba Group, China（阿里巴巴集团）； School of Mathematics, Southeast University, Nanjing, China（东南大学数学学院）

AI总结本文通过分析预训练后期轨迹发现秩-1子空间现象，提出无需额外训练的Extra-Merge方法，沿该子空间外推以最小化损失，在GPT-2和LLaMA系列上优于标准合并基线。

详情

AI中文摘要

模型合并已成为增强大型语言模型（LLMs）的轻量级范式，但其底层机制仍知之甚少。在这项工作中，我们分析了后期预训练轨迹，并揭示了一个 extbf{秩-1子空间}现象：虽然原始优化步骤剧烈振荡，但连续的\emph{合并}检查点坍缩到一个稳定的、近似一维的线性流形上。我们通过\emph{河谷}景观分析从理论上为这一观察提供了依据：平均操作充当了几何低通滤波器，抑制高曲率噪声以揭示最优下降方向。基于这一见解，我们提出了 extbf{Extra-Merge}，一种无需训练的策略，沿该子空间外推以最小化损失，无需额外的梯度更新。在GPT-2和LLaMA系列（124M到2B）上的大量实验表明，Extra-Merge始终优于标准合并基线。值得注意的是，它在Pythia-12B下游任务上取得了一致的零样本准确率提升，并有效推广到Muon优化器\citep{jordan2024muon}。

英文摘要

Model merging has emerged as a lightweight paradigm for enhancing Large Language Models (LLMs), yet its underlying mechanisms remain poorly understood. In this work, we analyze late-stage pre-training trajectories and uncover a \textbf{Rank-1 Subspace} phenomenon: while raw optimization steps oscillate violently, consecutive \emph{merged} checkpoints collapse onto a stable, approximately one-dimensional linear manifold. We theoretically ground this observation in a \emph{river-valley} landscape analysis: averaging acts as a geometric low-pass filter that dampens high-curvature noise to reveal the optimal descent direction. Capitalizing on this insight, we propose \textbf{Extra-Merge}, a training-free strategy that extrapolates along this subspace to minimize loss without additional gradient updates. Extensive experiments across GPT-2 and LLaMA families (124M to 2B) demonstrate that Extra-Merge consistently outperforms standard merging baselines. Notably, it yields consistent zero-shot accuracy gains on Pythia-12B downstream tasks and generalizes effectively to the Muon optimizer \citep{jordan2024muon}.

URL PDF HTML ☆

赞 0 踩 0

2605.26483 2026-05-27 cs.CV

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

基于临床基础的反事实推理用于医学视频诊断

Jianzhe Gao, Churan Wang, Weiyi Zhang, Jianghua Li, Li-An Li, Wenguan Wang, Yixin Zhu, Yizhou Wang

发表机构 * Center for Data Science in Clinical Medicine（临床医学数据科学中心）； The State Key Lab of Brain-Machine Intelligence（脑机智能国家重点实验室）； Department of Gynecology and Obstetrics, 7th Medical Center of Chinese PLA General Hospital（中国人民解放军第七医学中心妇产科部）； School of Computer Science, Peking University（北京大学计算机学院）； School of Psychological and Cognitive Sciences, Peking University（北京大学心理学与认知科学学院）； State Key Lab of General AI, Peking University（通用人工智能国家重点实验室）； Nat’l Eng. Research Center of Visual Technology（视觉技术国家工程研究中心）； Beijing Key Laboratory of Behavior and Mental Health（北京行为与心理健康重点实验室）； Embodied Intelligence Lab, PKU-Wuhan Institute for Artificial Intelligence（具身智能实验室，北京大学-武汉人工智能研究院）

AI总结提出MedVCR反事实推理框架，通过扩散生成器合成病理组织演变、临床规则编码诊断知识及双重诊断预测策略，在医学视频诊断任务上提升2.6%-10.2%性能。

详情

AI中文摘要

医学视频诊断涉及从整个检查过程中的动态组织反应推断临床决策。现有方法依赖于端到端学习范式，该范式i)关注外观而非病理，ii)缺乏临床先验知识，iii)仅基于观察进行推理而无反事实比较。本文引入MedVCR，一个模仿临床诊断思维的反事实推理框架。MedVCR包含三个组件：一个反事实生成器，通过扩散方式合成指定病理状态下的组织演变；一个反事实表示学习模块，通过临床规则（即时间一致性、病理可分离性和反事实对齐）编码诊断知识；以及一个双重诊断预测策略，将视频级评估与帧级反事实分析相结合。MedVCR在完全监督（如阴道镜检查）和弱监督（如结肠镜检查）视频诊断设置下进行评估，与领先基线相比取得了2.6%-10.2%的性能提升。全面的消融研究进一步验证了每个组件的有效性。代码将发布。

英文摘要

Medical video diagnosis involves inferring clinical decisions from dynamic tissue responses throughout examination processes. Existing methods rely on an end-to-end learning paradigm that i) focuses on appearance rather than pathology, ii) lacks clinical priors, and iii) reasons solely from observations without counterfactual comparison. This work introduces MedVCR, a counterfactual reasoning framework that mimics clinical diagnostic thinking. MedVCR comprises three components: a Counterfactual Generator that synthesizes tissue evolution under specified pathological states via a diffusion-based manner; a Counterfactual Representation Learning module that encodes diagnostic knowledge through clinical rules (i.e., temporal consistency, pathological separability, and counterfactual alignment); and a Dual Diagnostic Prediction strategy that integrates video-level assessment with frame-level counterfactual analysis. MedVCR is evaluated under both fully supervised (e.g., colposcopy) and weakly supervised (e.g., colonoscopy) video diagnosis settings, yielding 2.6%-10.2% performance gains compared with leading baselines. Comprehensive ablation studies further validate the effectiveness of each component. The code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.26478 2026-05-27 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

基于随机解耦策略梯度的高效在策略视觉强化学习

Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham

发表机构 * Yale University（耶鲁大学）； Shanghai Jiao Tong University（上海交通大学）； University of Sydney（悉尼大学）

AI总结提出随机解耦策略梯度（SDPG）方法，通过轨迹滚动的随机扰动估计策略梯度，在单GPU上数小时内端到端训练多样化的视觉运动控制策略，显著降低计算和内存开销，并在视觉MuJoCo基准测试中优于基线方法。

2605.26477 2026-05-27 cs.LG

Variational Inference for Evidential Deep Learning

证据深度学习的变分推断

Jiawei Tang, Xinyan Du, Hui Liu, Junhui Hou, Yuheng Jia

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； School of Computing Information Sciences, Saint Francis University（圣弗朗西斯大学计算信息科学学院）； Department of Computer Science, City University of Hong Kong（香港城市大学计算机科学系）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education, China（新一代人工智能技术及其交叉应用关键实验室（东南大学），中华人民共和国教育部，中国）

AI总结针对传统证据深度学习（EDL）中KL惩罚项导致证据过高和参数设置缺乏理论保证的问题，提出基于变分推断的VI-EDL框架，通过推导证据下界（ELBO）抑制证据过度增长，并建立泛化界理论，在视觉和医学数据集上实现最先进的分布外检测、噪声检测和自动驾驶性能。

详情

AI中文摘要

尽管深度神经网络（DNN）取得了显著性能，但它们倾向于产生过度自信的预测。证据深度学习（EDL）通过将预测公式化为类别概率上的狄利克雷分布来显式量化认知不确定性，从而缓解了这一问题。然而，我们发现传统的EDL存在两个基本限制：一个仅抑制负类证据的Kullback-Leibler（KL）惩罚项，导致证据过高，从而降低了模型量化不确定性的能力；以及缺乏设置狄利克雷参数$α=e+1$的理论保证。在本文中，我们提出了一个数学上严谨的框架——变分推断证据深度学习（VI-EDL）。通过从变分推断的角度重新表述证据学习，我们推导出一个证据下界（ELBO），它防止证据过度增长。理论上，我们严格建立了泛化界，并揭示了预测不确定性、特征和网络复杂度如何影响该界，以及为什么设置$oldsymbolα = \mathbf{e} + \mathbf{1}$可以最小化它。在标准视觉和医学数据集上的大量实验表明，VI-EDL实现了最先进的性能，在分布外检测、噪声检测和自动驾驶场景中表现出色。代码可在https://github.com/seutjw/VI-EDL获取。

英文摘要

While Deep Neural Networks (DNNs) achieve remarkable performance, their tendency to produce overconfident predictions. Evidential Deep Learning (EDL) mitigates this by formulating predictions as a Dirichlet distribution over class probabilities to explicitly quantify epistemic uncertainty. However, we found that the conventional EDL suffers from two fundamental limitations: a Kullback-Leibler (KL) penalty that only suppresses the evidence of negative classes, producing excessively high evidence therefore decreasing the model's ability to quantify uncertainty, and an absence in theoretical guarantee of setting Dirichlet parameter $α=e+1$. In this paper, we propose a mathematically principled framework, Variational Inference Evidential Deep Learning (VI-EDL). By reformulating evidential learning through the lens of variational inference, we derive an Evidence Lower Bound (ELBO), which prevents the evidence from growing excessively. Theoretically, we rigorously establish a generalization bound and reveal how the predicted uncertainty, feature and network complexity affect this bound, and why setting $\boldsymbolα = \mathbf{e} + \mathbf{1}$ can minimize it. Extensive experiments on standard visual and medical datasets demonstrate that VI-EDL achieves state-of-the-art performance, showing excellent performance in out-of-distribution detection, noise detection and autonomous driving scenario. The code is available in https://github.com/seutjw/VI-EDL.

URL PDF HTML ☆

赞 0 踩 0

2605.26475 2026-05-27 cs.CV cs.AI

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

大规模平面场景的视觉度量测量比较研究

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited（中国电力工程顾问集团有限公司）

AI总结本文针对大规模室外场景，使用PTZ相机比较了三种基于视觉的平面度量方法（单目测距、图像拼接和立体测距），分析了它们的精度和适用性。

详情

AI中文摘要

基于视觉的度量距离和面积测量在大规模室外环境中仍然具有挑战性，原因包括远距离感知、相机变焦和不稳定的成像条件。本文研究了在实际水库监测场景中使用PTZ相机的平面度量测量，并比较了三种代表性方法：基于几何的单目测距、带有鸟瞰变换的图像拼接以及使用两个联合校准的单目相机的立体测距。对于单目测距，从相机几何推导出平面定位模型，并分析了相机俯仰角的影响。研究了用于大面积映射的图像拼接，同时开发了一种无需专用立体硬件的立体方案用于远距离测量。实验显示了明确的权衡：单目测距在足够大的俯仰角下达到米级精度，立体测距达到分米级精度且对俯仰变化敏感性较低，图像拼接在小规模场景中有效，但随着场景增大稳定性和可扩展性下降。

英文摘要

Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.

URL PDF HTML ☆

赞 0 踩 0

2605.26471 2026-05-27 cs.RO

Heterogeneous AAV Logistics Task Allocation: A Reinforcement Learning Enhanced Overlapping Coalition Formation Game Approach

异构AAV物流任务分配：一种强化学习增强的重叠联盟形成博弈方法

Yuze Zhou, Jingliang Sun, Junzhi Li, Jianxin Zhong, Zihan Wang, Teng Long

发表机构 * Beijing Institute of Technology（北京理工大学）； Key Laboratory of Dynamics and Control of Flight Vehicle, Ministry of Education（教育部飞行器动力学与控制重点实验室）

AI总结针对动态城市物流中时间敏感任务的随机出现带来的异构AAV任务分配最优性挑战，提出一种基于Transformer的软演员-评论家网络增强的重叠联盟形成博弈方法，实现全局最优任务分配并证明收敛至纳什稳定均衡。

Comments 12 pages

详情

AI中文摘要

在动态城市物流中，时间敏感任务的随机出现对异构AAV物流任务分配提出了显著的最优性挑战。为解决这一问题，提出了一种强化学习增强的重叠联盟形成博弈方法。建立了动态任务分配模型，其中全局最优性通过耦合服务质量与资源消耗的广义物流成本进行数学量化。为应对随机订单到达引起的时变任务集，设计了一种基于Transformer的软演员-评论家网络。通过利用多头自注意力编码可变长度的物流状态并捕获任务间的时空依赖关系，学习到的策略自适应地指导联盟更新，取代重叠联盟形成博弈中的启发式规则。在此基础上，异构AAV可以为动态物流任务形成更高效的重叠联盟。所得到的联盟形成过程被证明构成一个精确势博弈，保证了在有限迭代次数内收敛到纳什稳定均衡。数值仿真表明，所提算法在广义物流成本准则下有效提高了任务分配的最优性。在32架AAV和80个任务的场景中，与启发式OCF基线相比，我们的算法实现了39.76%的成本降低。室内飞行实验进一步验证了其实用性。

英文摘要

In dynamic urban logistics, the stochastic emergence of time-sensitive tasks poses a significant optimality challenge for heterogeneous AAVs logistics task allocation. To address this problem, a reinforcement learning enhanced overlapping coalition formation game approach is proposed. A dynamic task allocation model is established, where global optimality is mathematically quantified by a generalized logistics cost coupling service quality and resource consumption. To deal with the time-varying task sets induced by stochastic order arrivals, a transformer-based soft actor-critic network is designed. By leveraging multi-head self-attention to encode variable-length logistics states and capture task-wise spatiotemporal dependencies, the learned policy adaptively guides coalition updates, replacing heuristic rules in the overlapping coalition formation game. On this basis, heterogeneous AAVs can form more efficient overlapping coalitions for dynamic logistics tasks. The resulting coalition formation process is proven to constitute an exact potential game, which guarantees convergence to a Nash-stable equilibrium within a finite number of iterations. Numerical simulations demonstrate that the proposed algorithm effectively improves the optimality of task allocation under the generalized logistics cost criterion. In a scenario with 32 AAVs and 80 tasks, our algorithm achieves a 39.76% cost reduction compared with the heuristic OCF baseline. Indoor flight experiments further validate its practicality.

URL PDF HTML ☆

赞 0 踩 0

2605.26470 2026-05-27 cs.CV

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

面向逆问题的三元动力学感知扩散后验采样：优化引导与随机性调度

Junseo Bang, Dong Ju Mun, Hoigi Seo, Seongmin Hong, Se Young Chun

发表机构 * IPAI \& AIIS, Seoul National University, Republic of Korea

AI总结提出TriPS方法，将后验采样建模为时变控制问题，通过优化数据一致性引导、无分类器引导和随机性的调度策略，显著提升成像逆问题的求解性能。

Comments ICML 2026

详情

AI中文摘要

使用扩散模型的生成后验采样已成为解决成像逆问题的主流范式，通常包含三个主要组件：数据一致性（DC）引导、无分类器引导（CFG）和随机性。虽然先前的工作专注于如何开发每个或所有组件，但很少关注如何调度它们，导致启发式固定或部分调整的次优调度。在这项工作中，我们认为所有三个组件在调度方面的相互作用对于显著提高成像逆问题的求解性能至关重要。我们的分析表明，在采样早期激进的CFG与DC引导冲突，而随机性将轨迹带回高概率区域。基于这些发现，我们提出了三元动力学感知后验采样（TriPS），它将后验采样重新表述为一个时变控制问题，并按照DC和随机性尺度递减、CFG尺度递增的三元趋势优化调度。TriPS通过两种策略实现：基于模板的函数先验搜索以获得可靠的基线调度，以及基于组相对策略优化（GRPO）的强化学习以获得更灵活的时间曲线。实验表明，TriPS在数据保真度和感知真实感方面优于最先进的基线方法。

英文摘要

Generative posterior sampling using diffusion models has emerged as a dominant paradigm for solving inverse problems in imaging, which usually consists of three main components: data consistency (DC) guidance, classifier-free guidance (CFG) and stochasticity. While prior arts have focused on how to develop each or all components, less attention has given to how to schedule them, leading to heuristically fixed or partially adjusted suboptimal schedules. In this work, we argue that the interactions among all three components in terms of scheduling are crucial for significantly improved performance in solving inverse problems in imaging. Our analysis shows that aggressive CFG early in sampling conflict with DC guidance, while stochasticity brings the trajectory back to higher-probability regions. Based on these findings, we propose Triadic Dynamics Aware Posterior Sampling (TriPS), which reformulates posterior sampling as a time-varying control problem and optimizes schedules following a triadic trend of decreasing DC and stochasticity scales alongside increasing CFG scale. TriPS achieves this through two strategies: template-based search over functional priors for reliable baseline schedules, and Group Relative Policy Optimization (GRPO)-based reinforcement learning for more flexible temporal curves. Experiments demonstrate TriPS outperforms state-of-the-art baselines in data fidelity and perceptual realism.

URL PDF HTML ☆

赞 0 踩 0

2605.26468 2026-05-27 cs.LG cs.AI

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

扩散检测：用于无监督IC异常检测的生成扩散模型

Yuxuan Yin, Chen He, Todd Jacobs, Jialei He, Boxun Xu, Robert Jin, Peng Li

发表机构 * Department of Electrical and Computer Engineering, University of California Santa Barbara, CA, USA（加州大学圣芭芭拉分校电子与计算机工程系）

AI总结提出首个结合扩散Transformer的无监督异常检测框架，通过自编码器压缩、结构化令牌序列和噪声预测误差实现晶圆级快速筛选，在16nm IC测试数据上达到最优性能。

Comments 9 pages, 5 figures

2605.26463 2026-05-27 cs.CL cs.AI

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

迈向无差错的电子健康记录：临床笔记与结构化表格之间的推理密集型一致性验证

Yeonsu Kwon, Jiho Kim, Junseong Choi, Paloma Rabaey, Minseo Kim, Sujeong Im, Jeewon Yang, Jun-Min Lee, Sangji Lee, Jiwon Kim, Hangyul Yoon, Hyunwook Kwon, Edward Choi

发表机构 * KAIST（韩国科学技术院）； Ghent University（根特大学）； Samsung Medical Center（三星医疗中心）； Samsung Changwon Hospital（三星昌原医院）； Asan Medical Center（亚山医疗中心）

AI总结针对电子健康记录中临床笔记与结构化表格数据不一致的问题，提出推理密集型基准EHR-ReasonCon和基于大语言模型的框架EHR-Inspector，通过锚点实体提取、时间引用和表格探索工具实现高效一致性验证。

详情

AI中文摘要

电子健康记录中非结构化临床笔记与结构化表格之间的数据一致性对于患者安全和临床决策至关重要。然而，现有关于笔记-表格一致性验证的工作主要依赖于数值或简单事件的表面匹配。这些方法未能捕捉真实世界EHR文档背后的推理，包括临床解释、事件关系和时间变化。为弥补这一差距，我们引入了EHR-ReasonCon，一个用于笔记-表格一致性验证的推理密集型基准。它基于MIMIC-III构建，并经过专家指导的注释，包含来自临床笔记的8,048个实体，并提供高质量的真实标签。注释协议由专门的表格探索工具支持，以确保系统的证据检索和可靠的一致性评估。我们还提出了EHR-Inspector，一个基于LLM的框架，它分割笔记、提取锚点实体和时间引用，并使用表格探索工具与结构化表格进行一致性验证。在严格和宽松标准下，使用经过专家验证的LLM-as-a-judge指标进行评估，EHR-Inspector在多个模型骨干上实现了最先进的性能。进一步的分析证明了其组件的有效性，并突出了与人工验证的差异。

英文摘要

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.

URL PDF HTML ☆

赞 0 踩 0

2605.26460 2026-05-27 cs.CV cs.AI

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

AnchorDiff: 基于锚点图传播的无训练概念定位用于多模态扩散Transformer

Jian Zhang, Zhijun Zhang

发表机构 * School of Automation Science and Engineering（自动化科学与工程学院）

AI总结提出AnchorDiff方法，通过锚点选择和混合图传播解耦语义定位与结构细化，解决多模态扩散Transformer中视觉混淆概念间的概念泄漏问题。

详情

AI中文摘要

多模态扩散Transformer（MM-DiTs）为无训练概念定位编码了丰富的表示，但现有的基于注意力的方法通常在视觉上易混淆的概念上产生重叠激活，这种失败模式我们称为概念泄漏，即目标响应溢出到非目标对象。为了解决这个问题，我们提出了AnchorDiff，一种无训练的定位方法，将语义定位与结构细化解耦。AnchorDiff从概念到图像的注意力图中选择一个高置信度锚点，并将其作为独热种子在从图像到图像自注意力导出的混合图上传播。该图利用输出空间相似性进行密集的物体内传播，并通过逐行注意力门抑制跨物体连接。此外，我们引入了多概念混淆数据集，其中包含具有多个视觉相似概念和独立掩码的图像，从而能够显式评估概念泄漏。实验表明，AnchorDiff在ImageNet-Segmentation和PascalVOC上实现了强大的定位性能，同时在我们的多概念混淆数据集上显著减少了概念泄漏。

英文摘要

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.26459 2026-05-27 cs.LG

MuCon: Clipped Muon Updates for LLM Training

MuCon: 用于LLM训练的裁剪Muon更新

Albert Yi

发表机构 * Albert Yi（阿尔伯特·伊）

AI总结本文提出MuCon优化器，通过奇异值裁剪替代Muon的极分解方向，并研究其近似计算与数值稳定性。

详情

AI中文摘要

Muon风格的优化器采用矩阵值动量或预条件更新 $B = U \operatorname{diag}(\sigma_1,\ldots,\sigma_r) V^\top$，并将其替换为其规范部分极因子 $\operatorname{Pol}(B) = U V^\top$。这会将每个非零奇异值映射为1。MuCon是本文研究的裁剪Muon变体：它对相同的Muon矩阵应用奇异值裁剪，$D^{\mathrm{MuCon}}_\tau(B) = \operatorname{MClip}_\tau(B) = U \operatorname{diag}\bigl(\min\{\sigma_i,\tau\}\bigr) V^\top, \qquad \tau> 0$。因此，$\operatorname{MClip}_\tau$ 表示数学裁剪算子，而MuCon表示优化器原语，它将此裁剪方向替代Muon的极方向。本文使用的Muon/MuCon缩放参数化称为 $\text{SpectralP}$：这是一种隐藏矩阵缩放方案，在该方案下应用极Muon或裁剪MuCon方向。映射 $\operatorname{MClip}_\tau$ 是到谱范数球 $\{X : \|X\|_2 \le \tau\}$ 的Frobenius投影：它保持小于或等于 $\tau$ 的奇异值不变，仅修改违反的奇异方向。本文探讨何时可以在不进行完整稠密SVD的情况下近似MuCon裁剪步骤。我们记录了两个精确恒等式，一个极/绝对值公式和一个标量根公式，后者引出了用于裁剪半正定因子的有理牛顿滤波器，并指出了两者共同的数值障碍：接近阈值的奇异值使得符号决策和有理求解变得病态。因此，矩阵函数方法仅在结合稳定的极/平方根本原语或裁剪边界附近的显式正则化时才有用。

英文摘要

Muon-style optimizers take a matrix-valued momentum or preconditioned update $B = U \operatorname{diag}(σ_1,\ldots,σ_r) V^\top$ and replace it with its canonical partial polar factor $\operatorname{Pol}(B) = U V^\top$. This maps every nonzero singular value to one. MuCon is the clipped-Muon variant studied here: it applies singular-value clipping to the same Muon matrix, $D^{\mathrm{MuCon}}\_τ(B) = \operatorname{MClip}\_τ(B) = U \operatorname{diag}\bigl(\min\{σ\_i,τ\}\bigr) V^\top, \qquad τ> 0$. Thus, $\operatorname{MClip}\_τ$ denotes the mathematical clipping operator, while MuCon denotes the optimizer primitive that substitutes this clipped direction for Muon's polar direction. The Muon/MuCon scaling parameterization used in this work is called $\text{SpectralP}$: it is the hidden-matrix scaling recipe under which polar Muon or clipped MuCon directions are applied. The map $\operatorname{MClip}\_τ$ is the Frobenius projection onto the spectral-norm ball $\{X : \|X\|_2 \le τ\}$: it leaves singular values at or below $τ$ unchanged and modifies only the violating singular directions. This paper asks when the MuCon clipping step can be approximated without a full dense SVD. We record two exact identities, a polar/absolute-value formula and a scalar-root formulation leading to a rational Newton filter for the clipped positive-semidefinite factor, and identify the numerical obstruction common to both: singular values near the threshold make sign decisions and rational solves ill-conditioned. Matrix-function methods are therefore useful only when paired with stable polar/square-root primitives or explicit regularization near the clipping boundary.

URL PDF HTML ☆

赞 0 踩 0

2605.26456 2026-05-27 cs.CV

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

稀疏激光雷达提示的单目几何基础：面向长距离驾驶深度的实证研究

Kai Zheng, Qiang Feng, Xingjian Liu, Wenquan Tan, Yuan Li

发表机构 * Benewake (Beijing) Co., Ltd.（北京 Benewake 公司）

AI总结本文提出SLIM，首次将MoGe-2适配为接受真正稀疏激光雷达输入，通过部分卷积稀疏编码器和多尺度融合网络，在长距离（100-150米）将绝对相对误差降低39-51%。

Comments 6 pages, 3 figures, 2 tables

详情

AI中文摘要

稀疏激光雷达提示的深度基础模型（PromptDA, Prior Depth Anything, DMD3C）在室内场景或KITTI标准80米评估范围内表现出色。然而，存在两个局限性：（i）在长距离驾驶场景（50-150米）中缺乏系统性的距离分层评估；（ii）基于视差基础模型的先前方法依赖于预插值的密集先验，而真正稀疏激光雷达注入到点图基础模型（例如MoGe-2，NeurIPS 2025）尚未被探索。我们提出SLIM（稀疏激光雷达注入的单目几何），这是首个将MoGe-2适配为接受真正稀疏激光雷达输入的工作。SLIM集成了一个部分卷积稀疏编码器和一个多尺度融合颈部，在五个尺度上将激光雷达特征融合到点图解码器中。我们采用密度无关训练（随机注入比例在[0.005, 0.30]之间），使得单一模型能够适应不同的输入密度。在Virtual KITTI和CARLA上，SLIM在100-150米范围内将MoGe-2基线的绝对相对误差降低了约39-51%。在六种注入比例下的消融实验表明，部分卷积注入在Virtual KITTI的所有六种设置下均改善了AbsRel和RMSE；在CARLA上，AbsRel在六种设置中的五种得到改善（0.015比例下接近平局，差异为0.0013），而RMSE在不同编码器间相当，部分卷积在三种设置下有所改善（最多改善0.31单位），在其余三种设置下最多损失0.11单位。

英文摘要

Sparse-LiDAR-prompted depth foundation models (PromptDA, Prior Depth Anything, DMD3C) have shown strong results on indoor scenes or within KITTI's standard 80-meter evaluation cap. However, two limitations remain: (i) systematic distance-stratified evaluation in long-range driving regimes (50-150 m) is largely absent; (ii) prior approaches built on disparity-based foundations rely on pre-interpolated dense priors, leaving truly sparse LiDAR injection on point-map foundations (e.g., MoGe-2, NeurIPS 2025) unexplored. We present SLIM (Sparse-LiDAR Injected Monocular geometry), the first adaptation of MoGe-2 to accept truly sparse LiDAR input. SLIM integrates a partial-convolution sparse encoder with a multi-scale fusion neck that fuses LiDAR features into the point-map decoder at five scales. We adopt density-agnostic training (random injection ratio in [0.005, 0.30]) so a single model serves diverse input densities. On Virtual KITTI and CARLA, SLIM reduces the absolute relative error of the MoGe-2 baseline by approximately 39-51% at 100-150 m. Ablation across six injection ratios shows partial-convolution injection improves both AbsRel and RMSE on Virtual KITTI in all six settings; on CARLA, AbsRel improves in five of six settings (one near-tie at 0.015 differs by 0.0013), and RMSE is comparable across encoders, with partial-convolution improving in three settings (by up to 0.31 unit) and losing by at most 0.11 unit in the other three.

URL PDF HTML ☆

赞 0 踩 0

2605.26454 2026-05-27 cs.CL

Model Unlearning Objectives Vary for Distinct Language Functions

模型遗忘目标因语言功能而异

Berk Atil, Vipul Gupta, Rebecca J. Passonneau

发表机构 * Pennsylvania State University（宾夕法尼亚州立大学）； Scale AI

AI总结本文提出针对不同语言功能（危险知识遗忘和毒性遗忘）应设计不同的遗忘方法，并分别提出基于余弦的元学习变体RMU和多层目标方法，在多个7-8B模型上取得良好效果。

2605.26449 2026-05-27 cs.CV cs.AI

Cross-scale Aligned Supervision for Training GANs

跨尺度对齐监督用于训练生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * Sungkyunkwan University（全北大学）

AI总结针对GAN多尺度生成中跨尺度轨迹未对齐问题，提出CAT（跨尺度对齐Transformer），通过生成器侧一致性正则化对齐中间输出与最终输出，在ImageNet-256上实现FID-50K为1.56。

Comments Preprint

详情

AI中文摘要

现代GAN通常在中间生成器输出上引入对抗性监督，并将由此产生的多阶段合成解释为从粗到细的分层生成。在这项工作中，我们挑战了这一解释。我们认为标准的尺度级对抗监督并未构建适当的从粗到细的层次结构：每个中间图像被独立地推向其自身分辨率下的真实分布，但这种尺度级的真实性并不能确保各阶段的输出代表相同的生成样本。此外，每个阶段产生的特定尺度图像并未用作后续阶段的明确细化目标。因此，其对抗性损失可以改善特定尺度的输出，而不约束后续阶段保持相同的样本轨迹，允许它们转向不同的样本而不是细化先前的输出。我们将此问题称为跨尺度轨迹未对齐问题。为了解决这个问题，我们提出了CAT，一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT保持判别器尺度级，因此每个中间输出在其自身分辨率下被评估，同时添加一个简单的生成器侧一致性正则化，以对齐中间输出与最终输出。在类别条件ImageNet-256上，CAT-H/2在仅60个训练周期后，通过一步推理实现了1.56的FID-50K，优于强大的单步GAN和扩散/流基线。

英文摘要

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.26447 2026-05-27 cs.CV

Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting

Underwater360: 基于全景高斯泼溅的全景图像水下场景重建

Jiangbei Hu, Weichao Song, Shibo Yu, Mohan Wang, Zihan Yi, Rui Wu, Mingkang Xiang, Na Lei, Shengfa Wang, Zhongxuan Luo, Ying He

发表机构 * School of Software, Dalian University of Technology（大连理工大学软件学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结提出Underwater360框架，利用物理信息引导的全向高斯泼溅，通过球面光线投射和外观-介质建模，实现水下全景场景的高质量重建与外观恢复。

详情

AI中文摘要

水下场景重建对于沉浸式探索水生环境至关重要，但由于复杂的参与介质效应（如吸收和散射）以及传统相机的有限视场（FoV），仍然具有挑战性。尽管将全景成像与3D高斯泼溅（3DGS）相结合为逼真的水下渲染提供了有前景的方向，但传统的3DGS难以处理球面投影畸变和水下介质退化。在本文中，我们提出了 extbf{Underwater360}，一个物理信息引导的全向3DGS框架，用于水下全景场景重建。首先，我们引入了一个全向高斯泼溅模块，该模块直接在球面相机空间中进行光线投射，而不是依赖2D投影近似，从而减少了360$^\circ$视场下的几何畸变。其次，我们设计了一个基于物理的外观-介质建模架构，带有姿态条件的外观嵌入，以明确地将内在场景辐射与深度相关的后向散射和衰减解耦，从而实现物理基础的外观恢复。最后，我们建立了一个新的全景水下基准数据集，包含合成场景和真实场景。大量实验表明，Underwater360在水下新视图合成和场景外观恢复方面取得了优越的性能，在复杂水下环境中提供了改进的渲染质量和跨视图一致性。代码和数据集发布在https://github.com/SwcK423/Underwater360。

英文摘要

Underwater scene reconstruction is essential for immersive exploration of aquatic environments, yet remains challenging due to complex participating-media effects such as absorption and scattering, as well as the limited field of view (FoV) of conventional cameras. Although combining panoramic imaging with 3D Gaussian Splatting (3DGS) offers a promising direction for photorealistic underwater rendering, traditional 3DGS struggles with both spherical projection distortion and underwater medium degradation. In this paper, we propose \textbf{Underwater360}, a physics-informed omnidirectional 3DGS framework for underwater panoramic scene reconstruction. First, we introduce an Omnidirectional Gaussian Splatting module that performs ray casting directly in spherical camera space instead of relying on 2D projection approximations, thereby reducing geometric distortions under 360$^\circ$ FoV. Second, we design a physics-based appearance-medium modeling architecture with pose-conditioned appearance embeddings to explicitly decouple intrinsic scene radiance from depth-dependent backscatter and attenuation, enabling physically grounded scene appearance restoration. Finally, we establish a new panoramic underwater benchmark dataset containing both synthetic and real-world scenes. Extensive experiments demonstrate that Underwater360 achieves superior performance in underwater novel view synthesis and scene appearance restoration, delivering improved rendering quality and cross-view consistency in complex underwater environments. The code and datasets are released at https://github.com/SwcK423/Underwater360

URL PDF HTML ☆

赞 0 踩 0

2605.26446 2026-05-27 cs.LG cs.AI

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

DDGAD：基于扩散的图异常检测中的轨迹动力学

Yuxin Yang, Limei Hu, Feng Chen

发表机构 * College of Artificial Intelligence（人工智能学院）； Southwest University（西南大学）

AI总结提出DDGAD框架，利用扩散正则化和可靠性感知邻域共识下的轨迹动力学区分正常与异常节点，通过三种互补异常信号检测异常。

详情

AI中文摘要

图异常检测（GAD）旨在识别图结构数据中行为或属性显著偏离整体模式的节点或子结构，在金融风险控制、社交网络分析和网络安全等领域具有关键应用。然而，现有的基于GCN的方法存在污染传播的根本问题，即异常节点通过消息传递污染其邻居的表示，导致检测性能下降。本文提出DDGAD，一种新颖的基于扩散的图异常检测框架，利用轨迹动力学区分正常和异常节点。我们的关键洞察是，在扩散正则化和可靠性感知邻域共识的耦合作用下，正常节点表现出一致且稳定的表示轨迹，而异常节点由于全局流形先验与局部污染消息传递之间的方向不一致，表现出不稳定且冲突的动力学。为了减轻污染传播，我们引入了一种分布式的可靠性感知共识细化机制，并定义了三种互补的异常信号：邻居不一致性、可靠性权重和动力学冲突能量。我们进一步对耦合动力学下的正常节点稳定性进行了初步的理论分析。这些信号从局部不一致性、共识可靠性和动力学不稳定性角度共同刻画异常行为。在五个真实世界数据集上的大量实验证明了所提框架的有效性。

英文摘要

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2605.26445 2026-05-27 cs.CL

Curation and Extraction of Drug-Related Entities from Reddit Platform

从Reddit平台策划和提取药物相关实体

Zewei Wang, Zihan Xu, Yishu Wei, Michael Chary, Yifan Peng

发表机构 * Population Health Sciences, Weill Cornell Medicine, New York City, USA（威立·科恩医学中心人口健康科学系，纽约市，美国）； School of Computing and Information Systems, University of Melbourne, Melbourne, Australia（墨尔本大学计算机与信息系统学院，墨尔本，澳大利亚）； Emergency Medicine, Weill Cornell Medicine, New York City, USA（急诊医学，威立·科恩医学中心，纽约市，美国）

AI总结为解决医生对非法药物真实使用情况了解有限的问题，本文构建了ReDose数据集（6435条Reddit帖子），并采用BERT、LLM和RAG模型进行药物、剂量和效果实体提取，其中BiomedBERT在药物实体上F1达0.843，Llama-3 70B优于GPT-4，但效果提取仍具挑战。

Comments Accepted by IEEE International Conference on Healthcare Informatics (ICHI 2026)

详情

AI中文摘要

医生主要通过临床过量案例了解非法药物，这限制了他们对其真实使用情况的理解。与此同时，药物用户在线上分享第一手经验，提供了关于药物剂量和效果的见解。为弥合这一差距，我们引入了ReDose（Reddit药物剂量和效果）数据集，包含6435条关于物质使用的Reddit帖子。一名委员会认证的毒理学家主要注释了训练集和测试集，而两名医学生参与了测试集的注释，标注了药物、剂量和效果实体。我们使用基于BERT、大型语言模型（LLM）和检索增强生成（RAG）模型对6267个注释进行了基准测试。BiomedBERT在药物实体上达到了0.843的F1分数，而Llama-3 70B优于GPT-4（F1=0.79 vs. 0.72）。效果提取仍然具有挑战性，GPT-4的召回率为0.41。ReDose捕捉了患者策划的叙述，以推进从社交媒体中提取医学数据。

英文摘要

Physicians learn primarily about illicit drugs from clinical overdose cases, limiting their understanding of real-world usage. Meanwhile, drug users share first-hand experiences online, offering insights into dosage and effects of drugs. To bridge this gap, we introduce ReDose (REddit Drug DOSe and Effect), a dataset of 6,435 Reddit posts on substance use. A board-certified toxicologist primarily annotated both the training and test sets, while two medical science students contributed to the test set, labeling DRUG, DOSE, and EFFECT entities. We benchmarked 6,267 annotations using BERT-based, large language model (LLM)-based, and Retrieval-Augmented Generation (RAG) models. BiomedBERT achieved an F1-score of 0.843 for DRUG, while Llama-3 70B outperformed GPT-4 (F1 = 0.79 vs. 0.72). EFFECT extraction remains challenging, with GPT-4 achieving a recall of 0.41. ReDose captures patient-curated narratives to advance medical data extraction from social media.

URL PDF HTML ☆

赞 0 踩 0

2605.26442 2026-05-27 cs.CL cs.AI

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

大型语言模型的对齐调优：以数据为中心的对齐数据管道视角

Hwanjun Song

发表机构 * KAIST（韩国科学技术院）

AI总结本文以数据为中心，将对齐调优重构为管道设计问题，分解为响应合成、偏好评估和偏好实例化三个阶段，并基于此框架统一分类现有对齐方法，总结设计权衡与失败模式，提炼高层原则，最后指出开放挑战。

Comments Accepted at the Findings of ACL 2026

2605.26441 2026-05-27 cs.CV cs.AI

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

从博弈视角重新思考弱监督视频时间定位

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, Daizong Liu

发表机构 * Hubei Key Laboratory of Distributed System Security（湖北分布式系统安全重点实验室）； Hubei Engineering Research Center on Big Data Security（大数据安全工程研究中心）； School of Cyber Science and Engineering（网络安全科学与工程学院）； Huazhong University of Science and Technology（华中科技大学）； University of Central Florida（佛罗里达中央大学）； Zhejiang Gongshang University（浙江工商大学）； Guangzhou University（广州大学）； The Chinese University of Hong Kong（香港中文大学）； Peking University（北京大学）

AI总结本文从博弈论视角出发，通过多元合作博弈建模帧与词的不确定对应关系，实现多级跨模态交互，从而在弱监督下提升视频时间定位的准确性。

Comments Published in ECCV 2024

详情

AI中文摘要

本文针对弱监督视频时间定位这一具有挑战性的任务。现有方法通常基于时刻提案选择框架，利用对比学习和重构范式对预定义时刻提案进行评分。尽管取得了显著进展，但我们认为当前框架忽略了两个不可或缺的问题：1) 粗粒度跨模态学习：先前方法仅捕获全局视频级与查询的对齐，未能建模视频帧与查询词之间的详细一致性以准确定位时刻边界。2) 复杂的时刻提案：其性能严重依赖于提案的质量，而提案的选择既耗时又复杂。为此，本文首次尝试从新颖的博弈视角处理该任务，通过多样粒度和灵活组合有效学习每个视觉-语言对之间的不确定关系，实现多级跨模态交互。具体而言，我们创造性地将每个视频帧和查询词建模为多元合作博弈中的玩家，学习它们对跨模态相似度得分的贡献。通过博弈论交互量化联盟内帧-词合作的趋势，我们能够评估帧与词之间所有不确定但可能的对应关系。最后，我们不再使用时刻提案，而是利用学习到的查询引导的帧级得分进行更好的时刻定位。实验表明，我们的方法在Charades-STA和ActivityNet Caption数据集上均取得了优越性能。

英文摘要

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.26440 2026-05-27 cs.CL cs.SE

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

Conv-to-Bench: 通过代码任务中的用户-助手对话评估语言模型

Victor M. dos Santos, Andre C. Castro, Samuel L. de S. Toledo, Bruno M. L. Calura, Lisandra C. de M. Menezes, Raul C. R. Mata, Telma W. de L. Soares, Bryan L. M. de Oliveira

发表机构 * Institute of Mathematics and Computer Science, University of São Paulo（圣保罗大学数学与计算机科学学院）； Institute of Informatics, Federal University of Goiás（戈亚斯联邦大学信息学院）； HUG Labs（HUG实验室）； Advanced Knowledge Center for Immersive Technologies (AKCIT)（沉浸式技术高级知识中心）

AI总结提出Conv-to-Bench框架，自动将多轮用户-助手对话转化为结构化需求清单，用于评估大语言模型，在编程领域与人工标准高度一致且计算开销低。

详情

AI中文摘要

大型语言模型（LLMs）的快速发展已超越了传统评估基准的可扩展性，这些基准仍严重依赖劳动密集型的人工专家策划。我们通过Conv-to-Bench解决了这一瓶颈，这是一个多阶段框架，可自动将真实的多轮用户-助手对话转化为结构化的、可验证的需求清单。通过利用真实对话日志中的“指令演化”，我们的方法将碎片化的用户意图分解为整合的指令和二元评估标准。应用于编程领域，Conv-to-Bench生成的评估集与BigCodeBench等人工程准几乎完美对齐，实现了高达ρ=1.000的斯皮尔曼相关性，且计算开销显著降低。对LLM-as-a-judge框架的验证进一步证实了其可靠性，主要评估器与人工验证的真实标签达到高度一致（κ=0.705）。我们全面的消融研究表明，虽然多轮交互捕捉了用户意图的迭代演化，但以指令为中心的提取提供了更稳健的基础。最终，Conv-to-Bench提供了一种可扩展、成本效益高的范式，用于在用户中心AI应用持续多样化时保持高保真评估标准。

英文摘要

The rapid advancement of Large Language Models (LLMs) has outpaced the scalability of traditional evaluation benchmarks, which remain heavily dependent on labor-intensive expert curation. We address this bottleneck with Conv-to-Bench, a multi-stage framework that automatically transforms authentic multi-turn user-assistant dialogues into structured, verifiable requirement checklists. By leveraging the "instructional evolution" found in real-world conversational logs, our approach deconstructs fragmented user intent into consolidated instructions and binary evaluation criteria. Applied to the programming domain, Conv-to-Bench produces evaluation sets that demonstrate near-perfect alignment with human-authored standards like BigCodeBench, achieving Spearman correlations of up to $ρ$ = 1.000 with significantly lower computational overhead. Validation of the LLM-as-a-judge framework further confirms its reliability, with the primary evaluator achieving substantial agreement with human-verified ground truth ($κ$ = 0.705). Our comprehensive ablation studies reveal that while multi-turn interactions capture the iterative evolution of user intent, instruction-centric extraction provides a more robust foundation. Ultimately, Conv-to-Bench provides a scalable, cost-effective paradigm for maintaining high-fidelity evaluation standards as user-centric AI applications continue to diversify.

URL PDF HTML ☆

赞 0 踩 0

2605.26438 2026-05-27 cs.CL cs.AI

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

LURE: 减少评估感知的实时使用回放评估

Igor Ivanov, David Demitri Africa

发表机构 * Meridian Cambridge（梅里登剑桥）

AI总结提出LURE方法，通过回放真实代理交互轨迹并附加评估提示来构建类似部署的评估，以减少大语言模型的评估感知，并引入自动化评估真实性流程。

详情

AI中文摘要

大型语言模型能够识别自己正在被评估（评估感知），并因此表现出不同的行为，这破坏了安全和对齐基准的有效性。我们提出LURE（实时使用回放评估），一种通过回放真实的代理交互轨迹并在末尾附加评估提示来构建类似部署的评估的方法。我们还引入了一个自动化流程来衡量评估的真实性，结合了对口头化评估感知的检测和法官模型对日志是否为评估的概率估计，并在一个包含部署和评估记录的大型数据集上进行了验证。我们发现，与广泛使用的基准和合成评估生成器相比，基于LURE的评估与部署的区分度显著降低，并且可以接近与用户真实对话的真实性。我们在策划、AI安全破坏和谄媚场景中实例化了LURE。我们的结果表明，评估真实性是对齐基准的一个关键属性，应在基准结果旁边报告，特别是当这些结果用于安全案例时。

英文摘要

Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

URL PDF HTML ☆

赞 0 踩 0

2605.26434 2026-05-27 cs.LG cs.AI

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

基于重建的脑电图基础模型中的非周期和低频谱偏差

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Simon Bock Segaard, Jeppe Roden Münster, Andreas Peter Juhl Hansen, Takfarinas Medani, Tiantian Feng, Richard Leahy, Shrikanth Narayanan

发表机构 * University of Southern California（美国南加州大学）； Aalborg University（奥尔堡大学）

AI总结研究揭示基于重建预训练的脑电图基础模型存在非周期和低频成分偏差，导致低资源场景下性能不佳，并提出通过辅助损失关注高频振荡结构来改进。

Comments 18 pages, 13 figures, 3 tables

详情

AI中文摘要

脑电图基础模型在大规模无标签脑电图数据上预训练，已成为学习可泛化脑电图表示的有前景方向。尽管在数据丰富场景下表现积极，但在低资源设置中，它们往往无法显著优于完全监督的小型模型。我们对此缺陷提供了机制性解释，将其归因于基于重建的预训练任务与脑电图信号独特的频谱结构之间的根本性不匹配，该结构分解为高功率非周期成分和低功率振荡成分。通过使用受控的合成脑电图输入，我们证明脑电图基础模型嵌入偏向于捕捉脑电图信号的非周期成分，而低估振荡成分，尤其是高频成分。此外，在真实BCI数据集上的线性探针评估进一步揭示，嵌入比任务相关信息更强烈地编码受试者身份，从而强化了主要基于重建目标训练的基础模型嵌入中的低频和非周期成分偏差。这些发现共同阐明了基于重建的脑电图基础模型中的一种失败模式，并激励未来工作纳入明确针对高频振荡结构的辅助损失，作为实现更强大和可泛化的脑电图表示的途径。

英文摘要

EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable EEG representations. Despite showing positive results in data-rich regimes, they often fail to outperform significantly smaller supervised models in low-resource settings compared to fully supervised models. We provide a mechanistic account of this shortcoming, attributing it to a fundamental mismatch between reconstruction-based pretext tasks and the idiosyncratic spectral structure of EEG signals, which decompose into distinct high-power aperiodic and low-power oscillatory components. Using controlled, synthetically-generated EEG inputs, we demonstrate that EEG foundation model embeddings are biased to capture the aperiodic components of the EEG signal while under-representing oscillatory components, particularly at higher frequencies. Additionally, linear probe evaluations on real-world BCI datasets further reveal that embeddings encode subject identity more strongly than task-relevant information, thereby reinforcing the low-frequency and aperiodic component bias in foundation model embeddings trained primarily on reconstruction based objectives. Together, these findings elucidate a failure mode in reconstruction based EEG foundation models and motivate future work to incorporate auxiliary losses explicitly targeting high-frequency oscillatory structure as a path toward more capable and generalizable EEG representations.

URL PDF HTML ☆

赞 0 踩 0

2605.26433 2026-05-27 cs.CL

Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization

向量并非中性：从摘要任务中导出的LLM表示进行敏感信息推断

Weixin Liu, Bowen Qu, Juming Xiong, Congning Ni, Bradley A. Malin, Zhijun Yin

发表机构 * Vanderbilt University（范德比大学）； Vanderbilt University Medical Center（范德比大学医学院）

AI总结研究LLM摘要系统导出向量中的敏感信息泄露风险，提出SurfaceLoRA微调方法降低特定向量的可恢复性，但未针对的池化向量仍存在风险。

Comments 30 pages, 2 figures; preprint

详情

AI中文摘要

大型语言模型（LLM）摘要系统可能将私有输入的紧凑向量表示传递给下游检索、监控、审计或分析工作流。即使源文档保持访问受限，派生向量可能在不同访问控制下处理，仍支持敏感信息推断，造成残留的信息披露风险。我们以临床出院摘要生成为高风险案例研究，使用电子健康记录（EHR）记录的种族作为受控敏感标签审计。我们审计系统可能保留或暴露给下游组件的两个工件：最终提示令牌隐藏状态和均值池化提示表示。我们的结果表明，从一个导出工件降低案例研究敏感标签的可恢复性并不一定能降低另一个工件的可恢复性。作为缓解案例研究，我们引入了SurfaceLoRA，一种针对导出向量的参数高效微调方法，该方法使用连接到指定导出向量的梯度反转鉴别器。在平衡的五向探测协议下，SurfaceLoRA将EHR记录的种族可恢复性从目标最终令牌工件降低到接近随机水平，同时保持摘要效用，但从未经目标池化工件的可恢复性仍然显著更高。这些发现表明，隐私审计和缓解应针对保留或暴露给下游组件的确切向量工件进行。

英文摘要

Large language model (LLM) summarization systems may pass compact vector representations of private inputs to downstream retrieval, monitoring, audit, or analytic workflows. Even when source documents remain access-restricted, derived vectors may be handled under different access controls and still support sensitive-information inference, creating a residual information-disclosure risk. We study this issue in clinical discharge-summary generation as a high-stakes case study, using electronic health record (EHR)-recorded race as a controlled sensitive-label audit. We audit two artifacts that a system might retain or expose to downstream components: the final prompt-token hidden state and the mean-pooled prompt representation. Our results show that reducing recoverability of the case-study sensitive label from one exported artifact does not necessarily reduce recoverability from another. As a mitigation case study, we introduce SurfaceLoRA, an exported-vector-targeted parameter-efficient fine-tuning method that uses a gradient-reversal discriminator attached to a designated exported vector. Under a balanced five-way probing protocol, SurfaceLoRA reduces EHR-recorded race recoverability from the targeted final-token artifact toward chance while preserving summarization utility, yet recoverability remains substantially higher from untargeted pooled artifacts. These findings show that privacy auditing and mitigation should be performed on the exact vector artifact retained or exposed to downstream components.

URL PDF HTML ☆

赞 0 踩 0

2605.26423 2026-05-27 cs.LG eess.IV

FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

FM-fMRI：用于静息态到任务态fMRI时间序列合成的事件条件流匹配

Peiyu Duan, Jiyao Wang, Nicha C. Dvornek, Junlin Yang, Ziqi Gao, Lawrence H. Staib, James S. Duncan

发表机构 * Department of Biomedical Engineering（生物医学工程系）； Department of Radiology & Biomedical Imaging（放射科与生物医学成像系）； Department of Electrical Engineering（电气工程系）

AI总结提出FM-fMRI模型，利用事件条件流匹配从静息态fMRI和任务事件信息生成任务态fMRI时间序列，在频谱、连接性和分布匹配上优于扩散模型、GAN和VAE，并提升自闭症分类性能。

Comments MICCAI 2026 Early Accepted

详情

AI中文摘要

基于任务的fMRI提供了任务诱发神经动力学的直接读数，但获取成本高且难以大规模采集，这促使从广泛可用的静息态fMRI（rsfMRI）进行静息态到任务态的合成。我们提出FM-fMRI，一种事件条件流匹配模型，它学习一个连续时间条件向量场，从受试者的rsfMRI和任务事件信息生成任务ROI时间序列。该公式支持基于ODE的快速采样和对异构事件调度的灵活条件设置。我们不是优化逐点重建，而是使用互补标准评估生成的信号，这些标准探究时间和频谱结构、受试者和组水平连接组一致性以及分布对齐。在公共人类连接组项目和内部BioPoint自闭症队列上，FM-fMRI在频谱和连接性一致性上达到最强，并在分布级匹配上优于条件扩散模型、生成对抗网络（GAN）和变分自编码器（VAE）基线。此外，我们通过使用我们的方法合成任务fMRI ROI时间序列来扩充BioPoint队列，改进了下游自闭症分类，并在数据有限的临床环境中展示了实用性。代码将在GitHub上提供。

英文摘要

Task-based fMRI provides a direct readout of task-evoked neural dynamics, but it is expensive and difficult to acquire at scale, motivating rest-to-task synthesis from widely available resting-state fMRI (rsfMRI). We propose FM-fMRI, an event-conditioned flow-matching model that learns a continuous-time conditional vector field to generate task ROI time series from a subject's rsfMRI and the task event information. The formulation enables fast ODE-based sampling and flexible conditioning over heterogeneous event schedules. Rather than optimizing for pointwise reconstruction, we evaluated generated signals using complementary criteria that probe temporal and spectral structure, subject and group-level connectome consistency, and distributional alignment. On the public Human Connectome Project and internal BioPoint autism cohort, FM-fMRI achieves the strongest spectral and connectivity agreement and improved distribution-level matching over conditional diffusion, generative adversarial networks (GANs), and variational autoencoders (VAEs) baselines. Furthermore, we augment the BioPoint cohort by synthesizing task-fMRI ROI time series with our method, improving downstream autism classification and demonstrating practical utility in data-limited clinical settings. The code will be available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.26421 2026-05-27 cs.CV

HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

HydraPrompt: 面向合成图像检测的视觉语言模型自适应非对称框架

Senyuan Shi, Hao Tan, Zichang Tan, Shuhan Feng, Ajian Liu, Sergio Escalera, Jun Wan

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； School of Advanced Interdisciplinary Sciences (SAIS), University of Chinese Academy of Sciences（中国科学院大学先进交叉学科学院）； Shenzhen Institute of Advanced Technology (SIAT), Chinese Academy of Sciences（中国科学院深圳先进技术研究所）； MAIS, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所MAIS）； University of Barcelona（巴塞罗那大学）

AI总结提出一种非对称提示框架HydraPrompt，通过动态调整类别中心对齐细粒度图像线索，结合条件监督对比学习，实现合成图像检测的SOTA性能。

Comments 8 pages, 6 figures

详情

AI中文摘要

生成模型的快速发展导致伪造内容激增，对现有合成图像检测方法构成重大挑战。利用视觉语言模型（如CLIP）的进展，最近的工作通过可学习的文本提示来识别合成图像。然而，它们仍使用静态提示作为真实和伪造图像的固定边界，无法适应推理过程中出现的各种伪造类型。为解决这一问题，我们提出**HydraPrompt**，一种非对称提示框架，通过对齐细粒度图像线索动态调整类别中心。具体而言，我们提出非对称提示适配器（**APA**）：（1）对于真实类别，引入单组提示以捕获一致的代表性模式，作为真实内容的统一锚点；（2）对于伪造类别，构建样本自适应提示，专门捕获不同样本中的多样线索，实现伪造图像变体的自适应建模。为增强不同合成图像间的可区分性，我们进一步引入条件监督对比（**CSC**）目标，在压缩真实表示的同时捕获细粒度伪造线索。在主流SID基准上的大量实验表明，我们的框架达到了最先进的性能。

英文摘要

The rapid evolution of generative models has precipitated a proliferation of fabricated content, posing significant challenges to existing Synthetic Image Detection (SID) methods. Capitalizing on advancements in vision-language models (e.g., CLIP), recent attempts have leveraged learnable textual prompts to identify synthetic images. However, they still leverage static prompt as a fixed boundary for real and fake images, failing to adapt to the varying types of forgery that emerge during inference. To overcome this issue, we propose **HydraPrompt**, an asymmetric prompting framework that dynamically adjusts the category centers by aligning with fine-grained image cues. Specifically, we propose an Asymmetric Prompt Adapter (**APA**): (1) for authentic category, we introduce a single set of prompts to capture the consistent representative patterns, which serves as a unified anchor for real content. While (2) for fake category, we construct sample-adaptive prompts that specialize in capturing diverse cues from different samples, enabling adaptive modeling of forgery image variations. To increase pronounced discriminability within different synthetic images, we further introduce a Conditional Supervised Contrastive (**CSC**) objective, which compacts the authentic representations while capturing fine-grained forgery clues. Extensive experiments on popular SID benchmarks demonstrate the state-of-the-art performance of our framework.

URL PDF HTML ☆

赞 0 踩 0

2605.26419 2026-05-27 cs.LG

Amortized Factor Inference Networks for Posterior Inference

摊销因子推理网络用于后验推理

Joohwan Ko, Justin Domke

发表机构 * Manning College of Information and Computer Sciences（Manning信息与计算机科学学院）； University of Massachusetts Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结提出摊销因子推理网络（AFINs），通过编码-合并-解码架构实现跨不同先验、似然和维度的后验推理泛化，在保持后验精度的同时大幅降低测试时计算量。

2605.26415 2026-05-27 cs.CV cs.AI

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

拯救效应：时空语义早期退出绕过CLIP中的量化崩溃

Kahyeon Nam, Hyesong Choi

发表机构 * Soongsil University（顺斯大学）

AI总结针对CLIP模型INT8量化导致的表示崩溃问题，提出LRA-EE方法，通过时空语义聚合、多特征门控和层自适应阈值实现早期退出，在ImageNet-1K零样本分类中降低13.4% FLOPs并提升2.44%准确率。

详情

AI中文摘要

在资源受限的硬件上部署视觉-语言模型通常需要INT8量化，但在CLIP等联合嵌入架构中，这引入了一种不同于量化CNN分类器的故障模式：跨Transformer块累积的激活噪声扰乱了多模态嵌入的方向，侵蚀了零样本检索所依赖的余弦对齐。我们将此特征化为量化诱导的表示崩溃（QIRC），并在INT8 CLIP ViT-B/32上量化它，其中逐层噪声信号比从浅层块的低于10%增长到第11层的52%。我们提出LRA-EE（逐层表示感知早期退出），它通过时空语义聚合（用全局补丁令牌平均替代不成熟的浅层[CLS]）、学习到的多特征门控（置信度、top-2间隔、空间激活方差）以及根据每层信息噪声比校准的层自适应置信阈值，绕过噪声饱和的深层。在ImageNet-1K零样本分类上，LRA-EE相比INT8基线减少了13.4%的FLOPs，并将Top-1准确率提高了+2.44个百分点（58.72% -> 61.16%）。四象限分解隔离了拯救效应：9.5%的样本在浅层出口被正确分类，但在全深度被噪声丢失，而只有7.1%遭受相反情况。

英文摘要

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

URL PDF HTML ☆

赞 0 踩 0

2605.26405 2026-05-27 cs.CL

Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM

面向即时自适应反馈：通过知识增强的大语言模型提升学生学习

Younghun Lee, Amir Bralin, Nobel Sanjay Rebello, Dan Goldwasser

发表机构 * Department of Computer Science（计算机科学系）； Department of Physics and Astronomy（物理与天文学系）； College of Education（教育学院）

AI总结提出一个框架，利用领域专家知识增强大语言模型，在真实教学场景中提供即时自适应反馈，并在大规模大学课程中提升学生成绩超过80%。

Comments 8 pages, Accepted to 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情

AI中文摘要

教育干预是提升学生学习的有效工具。虽然大语言模型（LLMs）允许大规模生成自适应反馈，但当前研究缺乏在真实教学环境中提供即时（JiT）反馈的明确方法。在本文中，我们提出了一个框架，通过将LLMs与领域专家知识相结合来提供自适应反馈。我们的方法收集学生的书面推理逻辑（策略文章），基于推理内容分析潜在错误类型，并提供非侵入性反馈，旨在澄清缺失或错误的概念。我们在一个大规模大学课程（N > 1000）中部署了该框架，与以往学期相比，学生成绩提升了超过80%。最后，我们通过分析学习轨迹验证了该框架的教学实用性；我们展示了与LLM的迭代对话如何促进从错误概念向正确理解的转变。

英文摘要

Educational interventions are effective tools for enhancing student learning. While Large Language Models (LLMs) allow for generating adaptive feedback at scale, current studies lack clear methodologies for providing Just-in-Time (JiT) feedback in authentic instructional settings. In this paper, we present a framework that provides adaptive feedback by grounding LLMs with domain-specific expert knowledge. Our approach collects written reasoning logic (strategy essays) from students, analyzes potential error types based on the content of that reasoning, and delivers non-intrusive feedback designed to clarify missing or incorrect concepts. We deploy this framework in a large-scale university course (N > 1000), where it improved student performance by over 80% compared to previous semesters. Lastly, we validate the framework's pedagogical utility by analyzing the learning trajectories; we demonstrate how iterative conversations with LLM facilitate shifting one's misconception to correct understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.26403 2026-05-27 cs.AI

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

从静态上下文到校准的交互式强化学习：利用对齐模拟器缓解多轮对话中的分布偏移

Xiaohua Wang, Jiakang Yuan, Zisu Huang, Muzhao Tian, Changze Lv, Kaitao Song, Tao Chen, Xiaoqing Zheng

发表机构 * Fudan University（复旦大学）

AI总结本文提出校准的交互式强化学习框架，通过将交互式强化学习与模拟器对齐相结合，缓解多轮对话中因策略和模拟器导致的分布偏移，提升对话质量。

详情

AI中文摘要

研究界的一个长期目标是开发高度交互的基于LLM的对话代理。最近的研究侧重于基于固定离线日志（静态上下文强化学习）或基于提示的模拟器（交互式强化学习）来优化策略。在这项工作中，我们从理论上证明，这两种范式都受到上下文分布偏移的根本限制——即训练期间观察到的对话历史与真实对话中遇到的对话历史之间的不匹配。这种偏移在每轮对话中呈二次方累积，严重降低对话质量。具体来说，我们将这种偏移归因于两个不同的来源：（i）策略引起的偏移，源于在静态历史而非自生成轨迹上进行训练；（ii）模拟器引起的偏移，源于模拟行为与真实人类行为之间的差异。为了解决这些挑战，我们提出了校准的交互式强化学习，这是一个统一的框架，将交互式强化学习与模拟器对齐相结合。通过将模拟器与人类交互模式对齐，我们的方法减少了模拟到真实的差距，并减轻了累积的分布偏移。在多个对话任务上的实验证实了我们的理论分析：（i）交互式强化学习通过缓解策略分布偏移，显著优于静态上下文基线；（ii）使用我们的对齐方法校准模拟器进一步弥合了模拟到真实的差距，产生了最先进的下游性能。

英文摘要

A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.

URL PDF HTML ☆

赞 0 踩 0

2605.26399 2026-05-27 cs.CV

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following

OmniGF: 一种用于统一视线跟随的双分支视觉-语言框架

Qiaomu Miao, Haoyu Wu, Jingyi Xu, Minh Hoai, Dimitris Samaras

发表机构 * Stony Brook University（石英布大学）； The University of Adelaide（阿德莱德大学）

AI总结提出OmniGF框架，通过双分支解码策略（语言分支生成离散推理状态，连续空间分支利用密集隐藏状态）结合头部嵌入，实现多人场景下精确的空间视线估计、语义视线预测和复杂社会视线推理，在多个基准上达到新最优。

详情

AI中文摘要

理解人类注视行为对于复杂场景理解和人机交互至关重要。传统的视线跟随模型通常局限于纯空间定位，缺乏推理语义目标或复杂社会背景的高级能力。此外，这些模型通常顺序处理个体，对同一场景图像进行多人体推理时需要冗余计算。虽然最近的视觉-语言模型（VLM）提供了处理与视线相关语义任务所需的卓越语义推理能力，但它们对离散文本生成的依赖本质上限制了在连续空间任务（如视线定位）中的精度。为弥合这一差距，我们提出OmniGF，一个统一的视觉-语言框架，使基础VLM适应高度可扩展的多人体视线推理。该模型采用双分支解码策略：结构化语言分支生成离散推理状态，而连续空间分支直接利用VLM的密集隐藏状态。用高分辨率视线目标热图监督这些提取的表示，有效克服了仅文本坐标生成的空间瓶颈。此外，为明确将模型锚定于多人场景，我们通过从裁剪的人头图像编码的头嵌入增强输入，同时为所有个体提供细粒度的外观和方向线索。通过建模所有个体并利用VLM的强大语义能力，OmniGF无缝集成了精确的空间视线目标估计、语义视线预测和复杂社会视线推理。大量实验表明，我们的框架在多个标准基准上建立了新的最优性能。代码可在https://github.com/cvlab-stonybrook/omnigf获取。

英文摘要

Understanding human gaze behavior is essential for complex scene comprehension and human-computer interaction. Traditional gaze following models are typically restricted to pure spatial localization, lacking the high-level capacity to reason about semantic targets or complex social contexts. Furthermore, these models often process individuals sequentially, requiring redundant computations over the same scene image for multi-person inference. While recent Vision-Language Models (VLMs) offer the exceptional semantic reasoning needed to address gaze-related semantic tasks, their reliance on discrete text generation inherently limits precision in continuous spatial tasks like gaze localization. To bridge this gap, we propose OmniGF, a unified vision-language framework that adapts foundational VLMs for highly scalable multi-person gaze reasoning. The model adopts a dual-branch decoding strategy: a structured language branch generates discrete reasoning states, while a continuous spatial branch directly taps into the VLM's dense hidden states. Supervising these extracted representations with high-resolution gaze target heatmaps effectively overcomes the spatial bottleneck of text-only coordinate generation. Furthermore, to explicitly ground the model in multi-person scenes, we augment the input with head embeddings encoded from cropped head images, providing fine-grained appearance and orientation cues for all individuals simultaneously. By modeling all individuals and leveraging the strong semantic capability of VLMs, OmniGF seamlessly integrates precise spatial gaze target estimation, semantic gaze prediction, and complex social gaze reasoning. Extensive experiments demonstrate that our framework establishes new state-of-the-art performance across multiple standard benchmarks. Code is available at https://github.com/cvlab-stonybrook/omnigf.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

Extra-Merge: Tracing the Rank-1 Subspace of Model Merging in Language Model Pre-Training

Clinically-Grounded Counterfactual Reasoning for Medical Video Diagnosis

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

Variational Inference for Evidential Deep Learning

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

Heterogeneous AAV Logistics Task Allocation: A Reinforcement Learning Enhanced Overlapping Coalition Formation Game Approach

Triadic Dynamics Aware Diffusion Posterior Sampling for Inverse Problems: Optimizing Guidance and Stochasticity Schedules

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

MuCon: Clipped Muon Updates for LLM Training

Sparse-LiDAR Prompting of Monocular Geometry Foundations: An Empirical Study Toward Long-Range Driving Depth

Model Unlearning Objectives Vary for Distinct Language Functions

Cross-scale Aligned Supervision for Training GANs

Underwater360: Reconstructing Underwater Scenes from Panoramic Images with Omnidirectional Gaussian Splatting

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

Curation and Extraction of Drug-Related Entities from Reddit Platform

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

Conv-to-Bench: Evaluating Language Models Via User-Assistant Dialogues In Code Tasks

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

Vectors Are Not Neutral: Sensitive-Information Inference from Exported LLM Representations in Summarization

FM-fMRI: Event Conditioned Flow Matching for Rest-to-Task fMRI Time-Series Synthesis

HydraPrompt: An Adaptive and Asymmetric Framework of Vision-Language Models for Synthetic Image Detection

Amortized Factor Inference Networks for Posterior Inference

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

Towards Just-in-Time Adaptive Feedback: Enhancing Student Learning via Knowledge-Grounded LLM

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

OmniGF: A Dual-Branch Vision-Language Framework for Unified Gaze Following