arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 3947
2606.16767 2026-06-16 cs.CV 新提交

Text-Vision Co-Instructed Image Editing

文本-视觉协同指导的图像编辑

Chenxi Xie, Yuhui Wu, Qiaosi Yi, Lei Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) OPPO Research Institute(OPPO研究院)

AI总结 提出TV-Edit框架,联合文本指令的语义表达与稀疏视觉指令的空间引导,实现精确且忠实于意图的图像编辑,显著优于现有方法。

详情
AI中文摘要

现有的图像编辑方法通常可分为基于文本指令和基于视觉提示两类。文本指令语义表达丰富,但受限于编辑结果空间控制的粗粒度。相比之下,拖拽和点等视觉提示能提供精确的空间引导,但存在语义意图固有的模糊性。为统一文本和视觉提示的优势,我们提出文本-视觉协同指导的图像编辑,将文本指令作为语义意图、稀疏视觉指令作为空间引导联合建模,旨在实现精确且忠实于意图的图像操作。为此,我们首先构建了一个包含超过23K个样本的文本-视觉指令配对数据集,这些样本源自动态视频,为跨模态指令提供对齐监督。然后,我们提出TV-Edit,一个文本-视觉指令统一编辑框架,将基于拖拽或点的视觉指令与图像-文本语义上下文化,并将其提升为语义感知的控制表示,用于预训练的编辑骨干网络。通过整合语义意图和空间约束,TV-Edit相比纯文本或纯拖拽方法实现了更精确的空间控制、更少的指令歧义和更强的结构一致性。最后,我们建立了TV-Edit-Bench,一个精心设计的基准,用于评估语义忠实度、空间对齐和视觉一致性,通过地面真实参考和受控的文本-视觉变化进行可靠评估。我们在多个编辑骨干网络上的实验表明,TV-Edit始终产生更精确且忠实于意图的编辑,显著优于最先进的基于指令和基于拖拽的基线方法。

英文摘要

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

2606.16765 2026-06-16 cs.LG physics.flu-dyn 新提交

A Validated LBM Dataset and Pipeline for Surrogate Modeling of Turbulent 3D Obstructed Channel Flows

一个经过验证的LBM数据集和用于湍流三维阻塞通道流代理建模的流水线

Lukas Schröder, Shubham Kavane, Harald Köstler

发表机构 * Chair of Computer Science 10 (System Simulation)(计算机科学系10号 chair(系统仿真)) Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡弗里德里希-亚历山大大学)

AI总结 提出一个可复现的流水线,生成雷诺数1000-10000的三维通道流训练数据,使用累积碰撞算子的格子玻尔兹曼求解器,并通过实验测量和网格收敛研究验证,为神经算子标准化比较提供基础。

Comments 4 pages + appendix, 9 figures, Accepted at the 1st Workshop on Differentiable Systems and Scientific Machine Learning (SysDiff) @ EurIPS 2025, OpenReview: https://openreview.net/forum?id=rdmHT72NQH

详情
AI中文摘要

评估三维湍流的神经算子需要经过验证的数据集和物理基准。我们提出了一个可复现的流水线,用于生成在雷诺数1000-10000范围内、围绕生成几何体的三维通道流的训练数据。我们的格子玻尔兹曼求解器采用累积碰撞算子,并通过实验测量(斯特劳哈尔数、阻力系数、湍流波动)进行了严格验证,在1024x512x512分辨率下进行了全面的网格收敛研究。基于已建立的框架,这个经过验证的流水线能够实现代理模型的标准化比较。我们概述了计划中的系统评估,包括傅里叶神经算子与U-Net变体在预测、超分辨率和误差校正任务上的表现,并使用物理信息度量来评估湍流能量级联的表示。未来的工作将比较数值求解器和神经代理之间的计算效率,探索实际应用。我们寻求社区对我们验证方法、计划中的基准方法论以及湍流中神经算子评估优先级的反馈。

英文摘要

Evaluating neural operators for 3D turbulent flow requires validated datasets with physical benchmarks. We present a reproducible pipeline generating training data for 3D channel flows around generated geometries at Re=1,000-10,000. Our lattice Boltzmann solver with cumulant collision operators is rigorously verified against experimental measurements (Strouhal number, drag coefficients, turbulent fluctuations) with comprehensive grid convergence studies at resolution 1024x512x512. Building upon an established framework, this validated pipeline enables standardized surrogate model comparison. We outline planned systematic evaluation of Fourier Neural Operator and U-Net variants on forecasting, super-resolution, and error correction tasks, using physics-informed metrics to assess turbulent energy cascade representation. Future work will compare computational efficiency between numerical solvers and neural surrogates, exploring practical application. We seek community feedback on our validation approach, planned benchmark methodology, and evaluation priorities for neural operators in turbulent flows.

2606.16759 2026-06-16 cs.LG 新提交

Maximum Entropy Inverse Reinforcement Learning for Mean-Field Games with Average Reward

平均奖励均值场博弈的最大熵逆强化学习

Şevket Kaan Alkır, Naci Saldı, Berkay Anahtarcı, Can Deha Karıksız

发表机构 * Bilkent University(比尔肯大学) Özyeğin University(厄齐金大学)

AI总结 针对平均奖励准则下的离散时间无限时域均值场博弈,提出基于最大因果熵的逆强化学习方法,通过占据测度框架统一处理有限维线性奖励和无限维RKHS奖励,并设计梯度上升算法实现策略恢复。

Comments 49 pages, 2 figures, 2 tables

详情
AI中文摘要

我们研究了平均奖励准则下离散时间、无限时域均值场博弈(MFGs)的逆强化学习。专家演示被认为来自未知奖励下的平稳均值场均衡,目标是通过最大因果熵原理恢复解释观察行为的策略。我们通过强制与专家均值场项和长期特征期望的一致性来制定逆问题,在统一的占据测度框架内处理两类奖励。对于有限维线性奖励,我们给出了具有显式对数配分目标的对偶凸重构,并证明了平滑性和曲率性质,从而证明了恒定步长梯度下降的合理性。对于无限维RKHS奖励,我们开发了一种拉格朗日松弛,其内最大化策略由软贝尔曼方程刻画。主要障碍是缺乏折扣因子收缩。我们通过引入基于极小化的次随机核来解决这个问题,该核产生了软贝尔曼算子的严格收缩。我们建立了对数似然得分的Fréchet可微性和Lipschitz平滑性,从而得到了具有收敛保证的梯度上升算法。两个数值例子,一个恶意软件传播MFG和一个基于RKHS的消费者选择模型,表明恢复的策略与专家行为紧密匹配。

英文摘要

We study inverse reinforcement learning for discrete-time, infinite-horizon mean-field games (MFGs) under an average-reward criterion. Expert demonstrations are assumed to arise from a stationary mean-field equilibrium under an unknown reward, and the goal is to recover a policy explaining the observed behaviour via the maximum causal entropy principle. We formulate the inverse problem by enforcing consistency with the expert mean-field term and long-run feature expectations, treating two reward classes within a unified occupation-measure framework. For finite-dimensional linear rewards, we give a convex dual reformulation with an explicit log-partition objective, and prove smoothness and curvature properties justifying constant-step-size gradient descent. For infinite-dimensional RKHS rewards, we develop a Lagrangian relaxation whose inner-maximising policy is characterised by a soft Bellman equation. The main obstacle is the absence of a discount-factor contraction. We resolve this by introducing a minorisation-based sub-stochastic kernel that yields a strict contraction of the soft Bellman operator. We establish Fréchet differentiability and Lipschitz smoothness of the log-likelihood score, leading to a gradient ascent algorithm with convergence guarantees. Two numerical examples, a malware-spread MFG and an RKHS-based consumer-choice model, show that the recovered policies closely match expert behaviour.

2606.16756 2026-06-16 cs.CV 新提交

3D Classification of Paramagnetic Rim Lesions in Multiple Sclerosis via Asymmetric QSM-FLAIR Modeling

多发性硬化症中顺磁性边缘病变的3D分类:基于非对称QSM-FLAIR建模

Veronica Pignedoli, Giacomo Boffa, Nicoletta Noceti, Matilde Inglese, Francesca Odone, Matteo Moro

发表机构 * MaLGa, DIBRIS, University of Genova(热那亚大学) DINOGMI, University of Genova(热那亚大学) IRCCS Azienda Ospedaliera Metropolitana(IRCCS大都会医院)

AI总结 提出一种3D多模态深度学习框架,利用非对称QSM-FLAIR建模对多发性硬化症中的顺磁性边缘病变进行自动分类,通过自监督预训练和对比正则化提升有限数据下的鲁棒性,在88名患者队列中验证了有效性。

Comments 10 pages, 3 figures, accepted at MICCAI 2026. Github link: https://github.com/veronicapignedoli/FRODO

详情
AI中文摘要

在磁敏感加权MRI上识别的顺磁性边缘病变(Rim$^+$)最近已成为多发性硬化症(MS)慢性活动性炎症的特异性生物标志物,并与长期残疾进展相关。然而,磁敏感成像和专家判读仍局限于专业中心,视觉评估耗时且可变,且Rim$^+$病变的低患病率给自动分析带来了严重的类别不平衡挑战。我们提出了一种3D多模态深度学习框架,用于从定量磁化率图(QSM)和FLAIR MRI中进行病变级别的Rim$^+$/Rim$^-$分类。该架构通过将QSM作为主要磁敏感驱动信号并用FLAIR衍生的结构上下文进行条件化,显式建模了模态非对称性。为了提高在有限数据下的鲁棒性,我们采用了自监督多模态预训练,随后进行带有对比正则化的监督微调。该方法在临床采集的88名MS患者队列中进行了评估,以专家病变标注作为参考标准。结果显示了相比先前架构的性能提升,支持了非对称多模态建模在自动识别慢性活动性病变中的有效性。

英文摘要

Paramagnetic rim lesions (Rim$^+$) identified on susceptibility-sensitive MRI have recently emerged as a specific biomarker of chronic active inflammation in Multiple Sclerosis (MS) and are associated with long-term disability progression. However, susceptibility imaging and expert interpretation remain limited to specialized centers, visual assessment is time-consuming and variable, and the low prevalence of Rim$^+$ lesions poses severe class imbalance challenges for automated analysis. We propose a 3D multimodal deep learning framework for lesion-level Rim$^+$/Rim$^-$ classification from Quantitative Susceptibility Mapping (QSM) and FLAIR MRI. The architecture explicitly models modality asymmetry by treating QSM as the primary susceptibility-driven signal and conditioning it with FLAIR-derived structural context. To improve robustness under limited data, we employ self-supervised multimodal pretraining followed by supervised fine-tuning with contrastive regularization. The method was evaluated on a clinically acquired cohort of 88 people with MS with expert lesion annotations as reference standard. Results highlight improved performance compared to prior architectures, supporting the effectiveness of asymmetric multimodal modeling for automated chronic active lesion identification.

2606.16753 2026-06-16 cs.CL cs.AI cs.LG 新提交

P3B3: A Multi-Turn Conversational Benchmark for Measuring European and Brazilian Portuguese Variety Bias in LLMs

P3B3:用于测量大语言模型中欧洲和巴西葡萄牙语变体偏差的多轮对话基准

Rafael Ferreira, Inês Vieira, Inês Calvo, James Furtado, Iago Paulo, Diogo Tavares, Diogo Glória-Silva, David Semedo, João Magalhães

发表机构 * NOVA University of Lisbon(新里斯本大学) NOVA LINCS(NOVA LINCS实验室)

AI总结 提出P3B3基准,通过专家策划的对话提示和评估框架,测量大语言模型在葡萄牙语变体(欧洲vs巴西)上的偏差和可控性,发现多数模型偏向巴西葡萄牙语。

Comments Accepted at MeLLM Workshop at ACL 2026

详情
AI中文摘要

随着大语言模型(LLMs)融入日常交流,捕捉区域语言变异对于可靠和公平的语言使用至关重要。在葡萄牙语中,欧洲(pt-PT)和巴西(pt-BR)变体仍然代表性不均,pt-BR在数据量上占主导地位,而LLM对葡萄牙语变体的偏好尚未得到充分探索。为弥补这一空白,我们引入了P3B3,一个由专家策划的语言变体无关的对话提示基准,以及一个用于测量变体偏差和可控性的评估框架。在多个模型上的实验表明,大多数LLM表现出对pt-BR的强烈偏差,且不同模型的可控性存在差异。这些结果凸显了需要在语言变体之间实现更平衡的多语言表示。

英文摘要

As Large Language Models (LLMs) become embedded in everyday communication, capturing regional linguistic variation is essential for reliable and equitable language use. In Portuguese, European (pt-PT) and Brazilian (pt-BR) varieties remain unevenly represented, with pt-BR dominating in data quantity, while LLM preference for Portuguese variants remains underexplored. To address this gap, we introduce P3B3, an expert-curated language variety agnostic benchmark of conversational prompts, along with an evaluation framework for measuring variety bias and controllability. Experiments on several models show that most LLMs exhibit a strong bias toward pt-BR, with variation in controllability across models. These results highlight the need for more balanced multilingual representation across language varieties.

2606.16749 2026-06-16 cs.CV 新提交

Structure-aware Knowledge-guided Heterogeneous Mamba for Zygomaticomaxillary Suture Assessment

结构感知知识引导的异构Mamba用于颧上颌缝评估

Xiaoqi Guo, Birui Chen, Xinquan Yang, Chaoyun Zhang, Xuefen Liu, Mianjie Zheng, Kun Tang, Xuguang Li, Wen Ma, Yanhua Xu, Linlin Shen

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机与软件学院) School of Artificial Intelligence, Shenzhen University(深圳大学人工智能学院) Affiliated Stomatology Hospital of Kunming Medical University(昆明医科大学附属口腔医院) Shenzhen University General Hospital(深圳大学总医院)

AI总结 提出首个ZMS公开数据集(3790张图像,覆盖4-24岁),并设计SKMamba框架,通过解耦双路径架构、隐式边缘提取器和跨模态语义对齐模块,实现自动化ZMS成熟度评估,性能优于现有方法。

详情
AI中文摘要

颧上颌缝是连接颧骨和上颌骨的关键颅周结构,是上颌前移过程中的主要阻力部位,其成熟状态直接影响正畸干预的时机和效果。然而,由于缝线中微妙的高频过渡以及相邻阶段之间的全局语义模糊性,ZMS成熟的准确分期仍然具有挑战性。为解决这一问题,我们提出了首个公开ZMS数据集,包含3790张覆盖4至24岁全年龄范围的ZMS图像。基于该数据集,我们提出了SKMamba,一种结构感知和知识引导的基于Mamba的多模态框架,用于自动化ZMS成熟度评估。SKMamba采用解耦的双路径架构,模拟经验丰富的正畸医生使用的分层诊断过程。我们首先引入隐式边缘提取器(IEE),利用结构预训练减少小梁噪声并突出缝线边界。作为补充,设计了跨模态语义对齐(CSA)模块,用于整合来自大语言模型(LLM)的解剖描述。该模块有助于将局部形态线索与全局语义描述对齐,同时确保客观形态证据仍是决策的主要依据。在我们的ZMS数据集上的大量实验表明,SKMamba相比现有方法实现了最先进的性能。代码可在https://github.com/galaxygxq1116/SKMamba获取。

英文摘要

The Zygomaticomaxillary Suture is a key circummaxillary structure that connects the zygomatic bone and the maxilla, which serves as a primary site of resistance during maxillary advancement, and its maturation status directly influences the timing and efficacy of orthopedic interventions. However, accurate staging of ZMS maturation remains challenging due to subtle high-frequency transitions in suture lines and the global semantic ambiguity between adjacent stages. To address this, we present the first public ZMS dataset, comprising 3,790 ZMS images covering the entire age range from 4 to 24 years. Based on this dataset, we propose SKMamba, a Structure-aware and Knowledge-guided Mamba-based multi-modal framework for automated ZMS maturation assessment. SKMamba adopts a decoupled dual-path architecture that mimics the hierarchical diagnostic process used by experienced orthodontists. We first introduce an Implicit Edge Extractor (IEE), which leverages structural pre-training to reduce trabecular noise and accentuate sutural boundaries. Complementarily, a Cross-Modal Semantic Alignment (CSA) module is designed to incorporate anatomical descriptions from a large language model (LLM). This module helps align local morphological cues with global semantic descriptions while ensuring that objective morphological evidence remains the primary basis for decisions. Extensive experiments on our ZMS dataset demonstrate that SKMamba achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/galaxygxq1116/SKMamba.

2606.16748 2026-06-16 cs.LG cs.CL 新提交

MyPCBench: A Benchmark for Personally Intelligent Computer-Use Agents

MyPCBench: 个人智能计算机使用代理的基准测试

Lawrence Keunho Jang, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出MyPCBench基准,在模拟真实桌面环境(含17个Web应用)中测试个人计算机使用代理,发现最佳模型Claude Opus 4.6仅解决55.4%任务,失败集中在多应用和长轨迹任务。

详情
AI中文摘要

当前的计算机使用代理基准测试在非个人化环境中评估模型。这导致评估与部署之间存在差距,因为个人助理预计将在用户的整个数字生活中工作,包括其上下文、历史数据和已登录账户。这种差距在Web任务中最为明显,因为实时Web评估无法测试需要登录或个人信息的网站,而真正的个人助理必须驱动这类网站。我们引入了MyPCBench,它在Linux桌面上测试计算机使用代理作为个人助理,该桌面填充了17个模拟的真实世界Web应用程序和一个完整的桌面堆栈,所有这些都为一个典型角色——来自《办公室》的Michael Scott——进行了种子化。我们在此环境中定义了184个任务,每个任务都受到来自OpenClaw社区的真实请求的启发,并使用统一的计算机+bash工具界面基准测试了六个闭源和开源模型。我们发现,最佳模型Claude Opus 4.6完全解决了55.4%的任务,是唯一超过50%的模型。模型失败集中在跨越多个应用程序的任务和长轨迹上,其中个性化对助理的压力最大。我们在https://mypcbench.com上发布了环境、任务集和代理工具包。

英文摘要

Current benchmarks for computer-use agents evaluate models in impersonal environments. This leaves a gap between evaluation and deployment where personal assistants are expected to work across a user's whole digital life, including their context, historical data, and logged-in accounts. This gap is widest on web tasks, where live web evaluations cannot exercise sites that require logging in or personal information, the kind of site a real personal assistant has to drive. We introduce MyPCBench, which tests computer-use agents as personal assistants on a Linux desktop populated with 17 simulated real-world web applications and a full desktop stack, all seeded for one canonical persona, Michael Scott from The Office. We define 184 tasks in this environment, each inspired by a real request drawn from the OpenClaw community, and benchmark six closed and open-weight models with a uniform computer+bash tool surface. We find that the best model, Claude Opus 4.6, fully solves 55.4\% of the tasks, the only model above 50\%. Model failures cluster on tasks that span many applications and on long trajectories, where personalization stresses an assistant the most. We release the environment, task set, and agent harness at https://mypcbench.com.

2606.16742 2026-06-16 cs.CV cs.AI 新提交

Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection

通过噪声放大揭示伪影:AI生成视频检测的新视角

Renxi Cheng, Jie Gui, Hongsong Wang

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education(教育部区块链应用监管工程研究中心(东南大学)) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学))

AI总结 针对AI生成视频检测难题,提出基于位平面的噪声放大方法,通过像素级强度增强、区域级空间放大和帧级时间聚合,在GenVidBench和HardGVD基准上超越现有方法。

Comments 13 pages, 5 figures

详情
AI中文摘要

随着视频生成模型的快速发展,区分AI生成视频与真实视频已成为一项具有挑战性的任务。现有研究大多集中于开发用于识别生成对抗网络生成样本的检测器。然而,AI生成视频的检测,尤其是文本到视频模型生成的视频,仍是一个未探索的领域。尽管最先进的文本到视频模型可以生成类似于真实视频的逼真视觉内容,但它们无法生成图像的细节以及视频中细节的变化。受此启发,我们从位平面的新视角处理AI生成视频检测,位平面可以有效描述图像或视频中的细节或噪声。为此,我们提出了一种简单而有效的方法,称为噪声放大。该方法首先基于位平面提取噪声信号,然后放大这些噪声信号,最后将其输入判别器网络进行视频伪造分类。噪声放大通过三个方面综合构建:像素级强度增强、区域级空间放大和帧级时间聚合。为了在具有挑战性的场景中评估AI生成视频检测方法,我们还引入了一个名为HardGVD的基准。在大型数据集GenVidBench和HardGVD上的大量实验表明,我们简单的方法显著优于最先进的方法。

英文摘要

With the rapid advancement of video generation models, distinguishing between AI-generated and authentic videos has emerged as a challenging endeavor. The majority of existing research endeavors concentrate on the development of detectors for identifying samples generated by generative adversarial networks. Nevertheless, the detection of AI-generated videos, particularly those produced by text-to-video models, still remains an uncharted territory. Although state-of-the-art text-to-video models can generate realistic visual content similar to real videos, they fall short of generating the details of the images and the changes in details within the videos. Inspired by this, we address AI-generated video detection from a novel perspective of bit-planes, which can effectively describe the details or noises in images or videos. To this end, we propose a simple yet effective approach called Noise Amplification. This approach first extracts noise signals based on bit-planes, then amplifies these noise signals, and finally feeds them into the discriminator networks for video fake classification. Noise amplification is comprehensively constructed by incorporating three aspects: pixel-level intensity enhancement, region-level spatial amplification, and frame-level temporal aggregation. To evaluate methods of AI-generated video detection in challenging scenarios, we also introduce a benchmark named HardGVD. Extensive experiments on both the large-scale dataset GenVidBench and HardGVD show that our simple approach significantly outperforms state-of-the-art methods.

2606.16735 2026-06-16 cs.RO 新提交

Pride and Prejudice: Toward an Information-Theoretic Framework for Mutually Communicative Driver Behavior Modeling

傲慢与偏见:迈向相互通信的驾驶员行为建模的信息论框架

Tingjun Li, Nan Xu, Shuo Feng, Hassan Askari, Bruno Henrique Groenner Barbosa, Konghui Guo

发表机构 * State Key Laboratory of Automotive Chassis Integration and Bionics, Jilin University(吉林大学汽车底盘集成与仿生国家重点实验室) Beijing National Research Center for Information Science and Technology, Tsinghua University(清华大学北京信息科学与技术国家研究中心) Department of Engineering, Brock University(布鲁克大学工程系) Department of Automatics, Federal University of Lavras(拉夫拉斯联邦大学自动化系)

AI总结 针对自动驾驶与人类驾驶车辆间意图误读导致的安全与效率问题,提出基于信息论的隐式相互通信模型,结合贝叶斯说服博弈与信息论奖励,在NGSIM数据集上降低强制换道预测误差达20%。

Comments 16 pages, 10 figures. Accepted for the IEEE Transactions on Intelligent Transportation Systems (T-ITS), June 2026

详情
AI中文摘要

当自动驾驶车辆(AV)和人类驾驶车辆(HV)误读彼此的意图时,混合自主驾驶会变得不安全且低效。我们将此问题研究为换道中的隐式相互通信。所提出的框架建模了自车如何在认知不确定性下既表达自身意图又探测对方驾驶员的偏好。它结合了用于主动信号传递的带有虚拟特征的k级贝叶斯说服博弈、用于相互通信的信息论奖励以及通信能力的自适应权重。我们进一步引入了Pride-Inquiry (P-I) 和 Pride-Prejudice (P-P) 平面来分析通信强度和倾向。该模型使用基于通信的多智能体逆强化学习算法(C-MIRL)在自然主义NGSIM数据集上进行校准。与非通信基线相比,所提出的模型将强制换道的预测误差降低了高达20%,同时保持了强大的泛化能力。驾驶员在环问卷得分与校准后的通信变量呈正相关,支持了模型的主观有效性。学习到的奖励进一步表明,询问和倾听能力比单纯的骄傲和表达贡献更大,并且询问偏好在不同驾驶员之间变化更强烈。这些结果支持在交互驾驶中对相互通信和认知不确定性进行显式建模。

英文摘要

Mixed autonomy driving becomes unsafe and inefficient when autonomous vehicles (AVs) and human-driven vehicles (HVs) misread each other's intentions. We study this problem as implicit mutual communication in lane changes. The proposed framework models how the ego vehicle both expresses its intent and probes the other driver's preference under epistemic uncertainty. It combines a level-k Bayesian persuasion game with virtual features for proactive signaling, information-theoretic rewards for mutual communication, and adaptive weights of communication affordances. We further introduce the Pride-Inquiry (P-I) and Pride-Prejudice (P-P) planes to analyze communication intensity and tendency. The model is calibrated with a Communication-Based Multi-Agent Inverse Reinforcement Learning algorithm (C-MIRL) on the naturalistic NGSIM dataset. Compared with the non-communicative baseline, the proposed model reduces the prediction error of mandatory lane changes by up to 20% while maintaining strong generalization. Driver-In-the-Loop questionnaire scores are positively correlated with the calibrated communication variables, supporting the subjective validity of the model. The learned rewards further show that inquiry and listening affordances contribute more than pride and expression alone, and that inquiry preference varies more strongly across drivers. These results support explicit modeling of mutual communication and epistemic uncertainty in interactive driving.

2606.16733 2026-06-16 cs.AI 新提交

A First-Principles Derivation of LLM Policy Optimization: From Expected Reward to GRPO and Its Structural Extensions

LLM策略优化的第一性原理推导:从期望奖励到GRPO及其结构扩展

Jianghan Shen, Siqi Luo, Yue Li, Jiyao Liu, Wanying Qu, Yi Zhang, Ziyan Huang, Tianbin Li, Ming Hu, Xiaohong Liu, Yirong Chen, Junjun He

发表机构 * Nanjing University(南京大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Fudan University(复旦大学) Nanjing University of Aeronautics and Astronautics(南京航空航天大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文从第一性原理出发,基于轨迹概率和奖励两个轴,统一分析了从REINFORCE、PPO到GRPO及其变体的LLM策略优化方法,揭示了设计选择背后的原理和复合失败模式。

详情
AI中文摘要

语言模型的策略梯度算法优化相同的目标 $J(θ) = \mathbb{E}_{τ\sim p_θ(τ)}[R(τ)]$,该目标恰好有两个因素:轨迹概率 $p_θ(τ)$ 和奖励 $R(τ)$。从REINFORCE到PPO再到GRPO及其后续方法,每种方法都修改其中一个或两个因素,以解决先前公式中的特定失败。现有调查按领域或时间顺序组织这些方法,这掩盖了每个设计选择背后的原理及其在梯度估计器中的精确干预位置。本调查从第一性原理出发,重新审视基于 $J(θ)$ 的LLM策略优化领域,并使用由 $p_θ(τ)$ 诱导的轨迹侧和由 $R(τ)$ 诱导的奖励侧作为定位方法的两条轴。它涵盖了从REINFORCE和PPO到GRPO的路径,以及GRPO后变体、Agentic RL和GRPO-OPD。由此产生的框架是统一的、诊断性的和可扩展的:它从共享目标分析方法,识别每种方法修改了哪一侧以及为什么,并在这些设置中应用相同的轨迹和奖励轴。在这些设置中,该框架还暴露了单侧修复无法解决的复合失败,因此需要轨迹侧和奖励侧的联合设计。该地图识别的边界情况和耦合失败标志着现有解决方案的极限,并为设计下一代LLM策略优化算法提供了原则性的起点。

英文摘要

Policy gradient algorithms for language models optimize the same objective $J(θ) = \mathbb{E}*{τ\sim p*θ(τ)}[R(τ)]$, which has exactly two factors: the trajectory probability $p_θ(τ)$ and the reward $R(τ)$. Every method from REINFORCE to PPO to GRPO and their descendants modifies one or both factors to address a specific failure in the preceding formulation. Existing surveys organize these methods by domain or chronology, which obscures the rationale behind each design choice and the precise location of its intervention within the gradient estimator. This survey revisits the landscape of LLM policy optimization from $J(θ)$ on first principles and uses the trajectory side, induced by $p_θ(τ)$, and the reward side, induced by $R(τ)$, as the two axes along which methods are located. It covers the path from REINFORCE and PPO to GRPO, as well as post-GRPO variants, Agentic RL, and GRPO-OPD. The resulting framework is unified, diagnostic, and extensible: it analyzes methods from a shared objective, identifies which side each method modifies and why, and applies the same trajectory and reward axes across these settings. Across these settings, the framework also exposes compound failures that no single-side fix resolves and that therefore require joint design of the trajectory side and the reward side. The boundary cases and coupled failures identified by this map mark where existing solutions run out and provide a principled starting point for designing the next generation of LLM policy optimization algorithms.

2606.16731 2026-06-16 cs.SD cs.AI cs.HC 新提交

MuVAP: Multimodal Multiparty Voice Activity Projection for Turn-taking Prediction in the Wild

MuVAP: 面向野外对话轮次预测的多模态多方语音活动投影

Haotian Qi, Gabriel Skantze

发表机构 * Department of Speech Music and Hearing, KTH Stockholm, Sweden(瑞典皇家理工学院言语、音乐与听觉系)

AI总结 提出MuVAP框架,通过将声学预测锚定到面部轨迹,实现从单声道音频和单摄像头视角进行说话人感知的轮次预测,并引入角色相对投影和AVCC数据集解决多方建模和因果跟踪问题。

详情
AI中文摘要

当前的多方对话轮次模型通常依赖于复杂的麦克风阵列或多摄像头设置,限制了它们在人与机器人交互场景中的适用性。我们提出了MuVAP,这是一个因果多模态框架,通过将声学预测锚定到面部轨迹来扩展语音活动投影,从而能够从单声道音频流和单摄像头视角进行说话人感知的轮次预测。为了解决建模多个说话人的组合复杂性,我们提出了角色相对投影,它将任意N说话人交互映射到一个固定的当前与下一个话语持有者状态。由于现有的视听数据集包含破坏因果跟踪的剪辑切换,我们引入了视听对话语料库,这是一个31小时的未剪辑、单摄像头多方对话数据集。评估表明,MuVAP在两人和三人场景下的转换-保持和下一说话人预测任务中优于强基线。

英文摘要

Current multiparty turn-taking models often rely on complex microphone arrays or multi-camera setups, limiting their applicability in human-robot interaction scenarios. We introduce MuVAP, a causal multimodal framework that extends Voice Activity Projection by grounding acoustic predictions in face tracks, enabling speaker-aware turn-taking predictions from a monaural audio stream and a single camera view. To address the combinatorial complexity of modeling multiple speakers, we propose Role-Relative Projection, which maps any N-speaker interaction onto a fixed current versus next floor-holder state. Because existing audiovisual datasets contain disruptive editing cuts that break causal tracking, we introduce the Audio-Visual Conversation Corpus, a 31-hour dataset of unedited, single-camera multiparty conversations. Evaluations demonstrate that MuVAP outperforms strong baselines on Shift-Hold and next-speaker prediction tasks across two- and three-speaker settings.

2606.16729 2026-06-16 cs.LG math.OC 新提交

Learning Policy from a Single Trajectory in Average-Reward Markov Decision Process

从平均奖励马尔可夫决策过程中的单条轨迹学习策略

Jongmin Lee, Ernest K. Ryu, Vaneet Aggarwal

发表机构 * Seoul National University(首尔国立大学) UCLA(加州大学洛杉矶分校) Purdue University(普渡大学)

AI总结 针对弱通信平均奖励MDP,首次从单条轨迹建立有限样本复杂度保证,提出无模型方法,值函数和策略方法分别达到$\widetilde{O}(1/\varepsilon^2)$和$\widetilde{O}(1/\varepsilon^4)$的样本复杂度。

详情
AI中文摘要

尽管已有大量工作刻画了折扣累积奖励MDP的样本复杂度,但平均奖励MDP的有限样本分析仍然有限,且大多数现有工作依赖于遍历性或生成模型访问等限制性假设。在这项工作中,我们首次为弱通信平均奖励MDP从单条轨迹建立了有限样本复杂度保证。为此,我们研究了弱通信MDP中单条轨迹的动力学,并基于此分析,开发了新颖的无模型方法。值得注意的是,我们的基于值函数和基于策略的方法在弱通信MDP中从单条轨迹分别提供了$\widetilde{O}(1/\varepsilon^2)$和$\widetilde{O}(1/\varepsilon^4)$的有限样本复杂度保证。此外,我们引入了第一个无需问题相关参数先验知识的通信MDP无模型方法。

英文摘要

While there is an extensive body of work characterizing the sample complexity of discounted cumulative-reward MDPs, finite sample analyses for average-reward MDPs have been limited, and most existing works rely on restrictive assumptions such as ergodicity or access to a generative model. In this work, we establish the first finite sample complexity guarantees from a single trajectory for weakly communicating average-reward MDPs. To this end, we study the dynamics of a single trajectory in weakly communicating MDPs and based on this analysis, we develop novel model-free methods. Notably, our value-based and policy-based methods provide finite sample complexity guarantees of $\widetilde{O}(1/\varepsilon^2)$ and $\widetilde{O}(1/\varepsilon^4)$ from a single trajectory in weakly communicating MDPs, respectively. Furthermore, we introduce the first model-free method that requires no prior knowledge of problem-dependent quantities for communicating MDPs.

2606.16723 2026-06-16 cs.AI 新提交

AgentFairBench: Do LLM Agents Discriminate When They Act?

AgentFairBench: LLM智能体在行动时是否存在歧视?

Triveni Morla, Rohith Reddy Bellibaltu, Manpreet Singh, Manmeet Singh Kapoor

发表机构 * Florida International University(佛罗里达国际大学) Boston University(波士顿大学) Department of Computer Science and Engineering, Indian Institute of Technology Patna(印度帕纳吉印度理工学院计算机科学与工程系)

AI总结 提出AgentFairBench基准,通过反事实匹配集和偏差传导框架,评估LLM智能体在招聘、贷款和医疗分诊中的行动公平性,发现统计量级不匹配会夸大歧视,而匹配后Claude Haiku无显著人口统计效应。

Comments Submitted to IEEE Access

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地采取行动(筛选申请人、推荐信贷、分诊患者),但LLM的公平性仍通过评分答案来衡量。我们引入AgentFairBench,一个廉价、可复现、多领域的基准,用于评估LLM智能体行动中的人口统计差异。基于配套框架——偏差传导框架(BCF,在此重述),它涵盖三个监管锚定的领域:招聘、贷款和医疗分诊。在四种递增代理能力的智能体框架(直接、思维链、多智能体协商、工具增强)下,使用合成的人口统计中性档案,在仅改变姓名编码的种族×性别信号的反事实匹配集中进行评估(遵循Bertrand Mullainathan传统)。一个仅依赖NumPy的测试工具计算反事实翻转率、平均绝对分数差异(MASD)、行动率差异和工具调用差异,并提供自助置信区间、配对检验和错误发现率控制,每个模型的成本仅为个位数美元。一个包含保留私有分割和污染金丝雀的实时排行榜接受外部模型提交。我们的试点研究(864个决策加上重测复现)带来了一个方法论教训:将六组分数分布与两次运行的噪声差异进行比较,仅通过统计量级就会将差异夸大约2.4倍。在匹配量级的噪声基底和综合组检验下,Claude Haiku 4.5未显示出高于采样噪声的人口统计效应(120个成对对比中0个和9个综合对比中0个通过校正);植入偏差测试证实该工具能检测到存在的差异。贡献在于一个健全、敏感、可采用的工具、量级匹配的零假设方法以及可扩展的开源工件。代码、数据和测试工具以开放许可证发布,并附有匿名评审工件。

英文摘要

Large language model (LLM) agents increasingly take actions (screening applicants, recommending credit, triaging patients), yet fairness for LLMs is still measured by grading answers. We introduce AgentFairBench, a cheap, reproducible, multi-domain benchmark for demographic disparity in the actions of LLM agents. Grounded in a companion framework, the Bias Conduction Framework (BCF, restated here), it spans three regulator-anchored domains: hiring, lending, and medical triage. Synthetic, demographic-neutral profiles are evaluated in counterfactual matched sets that vary only a name-coded race x gender signal (in the Bertrand Mullainathan tradition), under four agent scaffolds of increasing agency (direct, chain-of-thought, multi-agent deliberation, tool-augmented). A NumPy-only harness computes counterfactual flip rate, mean absolute score difference (MASD), action-rate disparity, and tool-invocation disparity, with bootstrap confidence intervals, paired tests, and false-discovery-rate control, for single-digit dollars per model. A live leaderboard with a held-out private split and a contamination canary admits external models by submission. Our pilot (864 decisions plus a test-retest replication) carries a methodological lesson: comparing a six-group score spread against a two-run noise difference overstates disparity by ~ 2.4X through statistic arity alone. Against an arity matched noise floor and an omnibus group test, claude haiku 4 5 shows no demographic effect above sampling noise (0 of 120 pairwise and 0 of 9 omnibus contrasts survive correction); a planted-bias test confirms the instrument detects disparity when present. The contribution is a sound, sensitive, adoption-ready instrument, the arity matched null methodology, and open artifacts to scale it. Code, data, and harness are released under open licenses, with an anonymized review artifact.

2606.16721 2026-06-16 cs.AI 新提交

Medical world models: representing medical states, modelling clinical dynamics and guiding intervention policies

医疗世界模型:表示医疗状态、建模临床动态与指导干预策略

Ke Liu, Mengxuan Li, Yanyi Bao, Tianyun Zhang, Chong Chu, Jiajun Bu, Haishuai Wang

发表机构 * College of Computer Science, Zhejiang University(浙江大学计算机科学与技术学院) School of Medicine, Zhejiang University(浙江大学医学院) Department of Biomedical Informatics, Harvard University(哈佛大学生物医学信息学系)

AI总结 本文提出医疗世界模型框架,通过构建患者状态、建模临床动态和支持干预决策,推动医疗AI从静态诊断向动态模拟演进。

详情
AI中文摘要

医疗诊断和治疗是动态过程,患者状态随时间演变,临床干预改变未来结果。尽管当前医疗AI能检测疾病、估计风险和生成报告,许多系统仍返回静态标签或分数,对疾病进展或替代干预如何重塑轨迹的洞察有限。医疗世界模型通过学习患者状态动态的内部模拟器,将人工智能中的世界模型思想应用于医疗。其长期目标是帮助临床医生预测病情恶化、比较治疗条件下的未来,并为个体患者定制护理。然而,相关工作仍分散在基础模型、纵向建模、疾病模拟、治疗效果估计、强化学习和数字孪生等领域。为弥合这一差距,本综述概述了一条路线图,将医疗AI从孤立的诊断和预测推进到模拟疾病演变和支持干预决策的医疗世界模型。该路线图围绕三个耦合能力组织:患者状态构建、临床动态建模和干预决策支持。在代表性系统中,比较突出了每种能力的贡献以及如何将部分组件集成到更成熟的感知-动态-规划系统中。最后,我们确定了将合理的推演转化为临床有用模拟器所涉及的挑战。相关文献见 https://github.com/1999kevin/awesome_medical_world_models。

英文摘要

Medical diagnosis and treatment are dynamic processes in which patient states evolve over time and clinical interventions alter future outcomes. Although current medical AI can detect disease, estimate risk and generate reports, many systems still return static labels or scores, offering limited insight into how illness may progress or how alternative interventions may reshape its trajectory. Medical world models adapt the world-model idea from artificial intelligence to healthcare by learning internal simulators of patient-state dynamics. Their long-term goal is to help clinicians anticipate deterioration, compare treatment-conditioned futures and tailor care to individual patients. Yet relevant work remains scattered across foundation models, longitudinal modelling, disease simulation, treatment-effect estimation, reinforcement learning and digital twins. To bridge this gap, this review outlines a roadmap for advancing medical AI from isolated diagnosis and prediction toward medical world models that simulate disease evolution and support intervention decisions. This roadmap is organized around three coupled capabilities: patient-state construction, clinical dynamics modelling and intervention decision support. Across representative systems, the comparison highlights what each capability contributes and how partial components can be integrated into more mature perception--dynamics--planning systems. Finally, we identify the challenges involved in turning plausible rollouts into clinically useful simulators. Related literature is available at https://github.com/1999kevin/awesome_medical_world_models.

2606.16707 2026-06-16 cs.AI 新提交

User as Code: Executable Memory for Personalized Agents

用户即代码:面向个性化智能体的可执行记忆

Bojie Li

发表机构 * Pine AI

AI总结 提出可执行记忆范式User as Code,将用户模型转化为可运行的Python代码,通过两阶段流水线实现精确聚合与规则执行,在长对话基准上达到78.8%召回率,聚合问题准确率99%,并能主动触发安全警报。

详情
AI中文摘要

个性化AI智能体需要用户记忆:一个关于用户是谁的持久模型,通过多次对话构建并在每次新对话中查询。如今,这种记忆几乎总是以非结构化文本、知识图谱或扁平事实存储的形式保存,并通过检索——获取与当前请求最相似的条目——来查询。这种“事实袋”记忆能很好地回忆单个事实,但由于存储事实和基于事实行动是分离的步骤,它在解决矛盾、聚合多条记录或执行规则方面存在困难。我们认为用户记忆应该是可执行的。我们引入用户即代码(UaC)范式,其中智能体对用户的模型是一个活的软件项目:类型化的Python对象保存用户状态,普通的Python函数编码管理状态的规则,因此表示和推理用户发生在同一个可由解释器运行的媒介中。实现机制是一个两阶段流水线:一个只追加的日志从不丢弃任何事实,并定期检查点化为类型化代码。这改变了记忆的能力。在标准的长对话基准测试中,UaC在召回率上匹配全上下文上限和最强的先前记忆系统(LOCOMO上78.8%)。其优势在表示至关重要的地方显现。在关于用户历史的聚合问题(如“我去年进行了多少次国际旅行?”)上,基于检索的记忆崩溃(6-43%),而UaC保持近乎完美(99%),因为答案是对类型化状态的一行计算,而不是对文本的搜索。而且,由于其规则在状态变化时确定性执行,UaC能够呈现未经请求的、安全关键的警报——例如新开的药物与数月前记录的过敏相冲突——这是查询驱动记忆无法提供的能力。

英文摘要

A personalized AI agent needs a user memory: a persistent model of who the user is, built across many conversations and consulted on each new one. Today this memory is almost always stored as unstructured text, a knowledge graph, or a flat store of facts, and consulted by retrieval -- fetching the entries most similar to the current request. Such "bag-of-facts" memory recalls individual facts well, but because storing a fact and acting on it are separate steps, it struggles to resolve contradictions, aggregate over many records, or enforce rules. We argue that user memory should instead be executable. We introduce User as Code (UaC), a paradigm in which an agent's model of a user is a living software project: typed Python objects hold the user's state and ordinary Python functions encode the rules that govern it, so representing and reasoning about the user happen in one medium an interpreter can run. The enabling mechanism is a two-phase pipeline: an append-only log that never discards a fact, periodically checkpointed into typed code. This changes what memory can do. On standard long-term conversation benchmarks, UaC matches both a full-context upper bound and the strongest prior memory systems on recall (78.8% on LOCOMO). Its advantage emerges where representation matters most. On aggregate questions over a user's history -- "how many international trips did I take last year?" -- retrieval-based memory collapses (6-43%) while UaC stays near-perfect (99%), because the answer is a one-line computation over typed state rather than a search over text. And because its rules execute deterministically whenever the state changes, UaC can surface unsolicited, safety-critical alerts -- such as a newly prescribed drug that conflicts with an allergy recorded months earlier -- a capability query-driven memory cannot provide.

2606.16700 2026-06-16 cs.CL 新提交

Multi-Turn Reflective Masking Elicits Reasoning in Mask Diffusion Models

多轮反射掩码激发掩码扩散模型中的推理能力

Yanming Zhang, Yihan Bian, Jingyuan Qi, Yuguang Yao, Lifu Huang, Tianyi Zhou

发表机构 * University of Maryland(马里兰大学) Virginia Tech(弗吉尼亚理工大学) Intuit UC Davis(加州大学戴维斯分校) MBZUAI(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出反射掩码(RM)方法,通过轻量级后训练使掩码扩散模型具备多轮掩码与去噪能力,实现迭代局部修正,无需架构改变,在文本生成、数独和图像编辑等任务中优于基线。

Comments 22 pages, 6 figures, 5 tables

详情
AI中文摘要

尽管自回归(AR)模型通常通过思维链推理和反思来执行推理,但它们对先前输出的改进仍然依赖于完全顺序生成,即使只需要局部编辑。相比之下,掩码扩散模型(MDMs)中的掩码机制自然支持对先前输出进行显式局部编辑,允许选择性细化而无需丢弃先前答案并从头生成另一个。虽然这一特性更接近人类通过迭代局部修正来纠正错误的方式,但现有的MDMs不支持多轮掩码和去噪。我们提出反射掩码(RM),通过轻量级后训练激发MDMs中这种内在的推理能力。RM提供了一种原生的测试时扩展,其中MDM基于不断演化的上下文迭代地重新审视和修正其先前的输出。为了利用来自先前轮次的见解(如AR推理),我们进一步引入了历史参考,这是一种参数无关的机制,在修正过程中利用中间去噪状态。我们的方法不需要架构改变,并且易于应用于现有的MDMs。在包括文本生成、数独和图像编辑在内的多种任务和模态中,反射掩码始终优于基于标准掩码的基线,并展现出强大的通用性,将RM定位为MDMs上推理的基本原语。

英文摘要

While reasoning on autoregressive (AR) models is often performed by chain-of-thought reasoning and reflection, their refinement of previous outputs still relies on fully sequential generation, even when only local edits are needed. In contrast, the masking mechanism in Mask Diffusion Models (MDMs) naturally supports explicit local edits on previous outputs, allowing selective refinement without discarding previous answers and generating another from scratch. While this property more closely aligns with how humans correct mistakes by iterative local refinement, existing MDMs do not support multi-turn masking and denoising. We propose Reflective Masking (RM), which elicits such an intrinsic reasoning capability in MDMs via lightweight post-training. RM provides a native test-time scaling, where an MDM iteratively revisits and revises its prior outputs based on evolving context. To exploit insights from previous turns like AR reasoning, we further introduce History Reference, a parameter-free mechanism that leverages intermediate denoising states during revision. Our approach requires no architectural changes and is easily applicable to existing MDMs. Across diverse tasks and modalities, including text generation, Sudoku, and image editing, Reflective Masking consistently outperforms standard masking-based baselines and demonstrates strong generality, positioning RM as a fundamental primitive for reasoning on MDMs.

2606.16696 2026-06-16 cs.RO 新提交

VENOM: Versatile Embodied Network for Omni-bodied Motion tracking

VENOM: 用于全身运动追踪的多功能具身网络

Siddharth Padmanabhan, Kazuki Miyazawa, Takato Horii

发表机构 * Graduate School of Engineering Science, University of Osaka(大阪大学工学研究科)

AI总结 提出VENOM,一种基于GPT的跨具身全身运动追踪模型,在仿真中实现多个人形机器人的全身运动追踪,无需分离上下身控制。

详情
AI中文摘要

仅从演示数据中实现跨多个人形机器人的专家级表现力全身运动追踪,在人形机器人学习中仍然是一个具有挑战性且相对未充分探索的问题。跨具身运动追踪策略主要通过将控制问题分解为上身和下身控制来训练。本文提出VENOM,一种用于仿真中的人形机器人的跨具身全身运动追踪模型。VENOM是一种基于GPT的运动追踪器,在多人形机器人数据上训练,可以追踪整个身体,无需分割为上身和下身控制。我们整理了一个名为VENOM数据集的多人形机器人运动追踪数据集,包含状态、动作和奖励,并在此数据集上训练VENOM和基线模型。在本文中,我们评估了VENOM相对于基线的性能,并表明我们能够实现一个稳定的运动追踪器,其能力优于仅通过监督学习在多人形机器人数据上训练的MLP,并且还表明,尽管缺乏奖励反馈,VENOM与使用非对称演员-评论家强化学习训练的专家的追踪能力紧密匹配。

英文摘要

Achieving expert-level expressive full-body motion tracking across multiple humanoids solely from demonstration data remains a challenging and relatively an underexplored problem in humanoid robot learning. Cross-embodiment motion tracking policies are mostly trained by decoupling the control problem into upper and lower body control. This work proposes VENOM, a cross-embodiment full-body motion tracking model for humanoids in simulation. VENOM is a GPT-based motion tracker trained on multiple humanoid data that can track the entire body without the requirement to split into upper and lower body control. We curate a multi-humanoid motion tracking dataset called the VENOM dataset that contains states, actions, and rewards and train VENOM and the baselines on this dataset. In this letter, we evaluate VENOM's performance against baselines and show that we can achieve a stable motion tracker across different humanoids more capable than an MLP trained on multiple humanoid data with supervised learning alone, and also show that despite lack of reward feedback, VENOM closely matches the tracking capability of experts that were trained using asymmetric-actor critic reinforcement learning.

2606.16694 2026-06-16 cs.LG cs.AI physics.app-ph q-bio.NC 新提交

Adaptive inference and function vectors in deep transformers

深度变换器中的自适应推理与函数向量

Ravin Raj, Gautam Reddy

发表机构 * Joseph Henry Laboratories of Physics, Princeton University(普林斯顿大学约瑟夫·亨利物理实验室)

AI总结 提出深度变换器作为平均场交互系统实现分布式推理的理论,利用函数向量逐层推断潜在上下文变量,在上下文回归任务中预测非高斯分层结构与深度的关系,并通过约束线性注意力变换器验证。

详情
AI中文摘要

变换器被广泛用作学习大量耦合变量间复杂相关性的通用基础架构,但其内部机制仍不明确。我们提出了一种深度变换器作为平均场交互系统的理论,该系统在通信、局部性和深度约束下实现分布式推理。我们证明,这样的系统可以利用内部状态表示(“函数向量”)在其层上以越来越精细的尺度推断潜在上下文变量。在上下文回归任务中,该理论预测了潜在上下文变量中的非高斯分层结构与变换器深度之间的非平凡关系。使用约束线性注意力变换器对预测进行了测试,并展示了深度架构中的自适应推理。前馈模块和深度使变换器能够实现比先前描述的更丰富的上下文学习算法类别。

英文摘要

Transformers are widely used as a general-purpose substrate for learning complex correlations between a large collection of coupled variables, but their internal mechanisms have remained mysterious. We introduce a theory of a deep transformer as a mean-field interacting system that implements distributed inference, subject to constraints on communication, locality and depth. We show that such a system can exploit internal state representations ('function vectors') to infer a latent context variable at increasingly finer scales over its layers. In an in-context regression task, the theory predicts a non-trivial relationship between non-Gaussian, hierarchical structure in the latent context variable, and transformer depth. Predictions are tested using constrained linear attention transformers and demonstrate adaptive inference in deep architectures. Feedforward blocks and depth enable transformers to implement a much richer class of in-context learning algorithms than previously described.

2606.16690 2026-06-16 cs.RO cs.AI cs.CV 新提交

PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

PATCH: 基于动作块条件潜在补丁创新的机器人操作监控

Yanan Zhou, Ranpeng Qiu, Yincong Chen, Jiajie Cui, Weiming Zhi

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) Australian Centre For Robotics, The University of Sydney(悉尼大学澳大利亚机器人中心)

AI总结 提出PATCH监控器,通过动作块条件潜在补丁创新检测局部场景动态,实现扰动感知的机器人操作干预与恢复。

详情
AI中文摘要

基于学习的操作策略在真实世界机器人操作中取得了实质性进展,特别是在短视界动作生成方面。然而,在开放工作空间中部署时,面对意外的局部场景动态(如移动物体、短暂遮挡或预期运动附近的干扰)仍然脆弱。现有的运行时监控器通常依赖全局观测异常、策略不确定性或帧级视觉变化,难以区分任务相关的执行风险与良性的视觉变化。我们提出PATCH,一种用于部署时干预的基于动作块条件的潜在补丁创新监控器。给定当前动作块,PATCH定义了一个投影执行走廊,预测其内部的潜在补丁演化,并累积机器人自身运动无法解释的持续残差。这些残差形成局部化的干预信号,使PATCH-Router能够暂停执行、选择可用的恢复源,并在局部创新消退后恢复原始策略。在真实机器人 rollout 数据上的实验表明,PATCH 比竞争性运行时监控器产生更稳定且上下文相关的触发信号。真实机器人部署进一步展示了监控驱动的干预和策略恢复,用于扰动感知的操作。项目页面:https://yananzhou5555.github.io/PATCH/。

英文摘要

Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot's own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: https://yananzhou5555.github.io/PATCH/.

2606.16687 2026-06-16 cs.AI cs.CL 新提交

From Affect Prediction to Affect Forecasting: Evidence for Distinct Information Sources in Longitudinal Text

从情感预测到情感预报:纵向文本中不同信息源的证据

Sadia Noor, Seemab Latif, Raja Khurram Shahzad, Mehwish Fatima

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)(国立科技大学电气工程与计算机科学学院) Department of Communication, Quality Management and Information Systems, Mid Sweden University(中瑞典大学通信、质量管理和信息系统系)

AI总结 本文区分当前情感估计与未来情感变化预报,提出TSAP框架和ACF-Hybrid模型,实验表明文本语义支持当前预测,而数值轨迹动力学更适用于未来变化预报。

详情
AI中文摘要

对纵向文本中的维度情感建模需要区分当前情感估计与未来情感变化预报。现有方法通常将每个文本视为独立观测,并对两个任务应用类似假设,而不检验它们是否依赖不同的信息源。本文利用纵向自我报告生态短文和情感词条目研究这一区别。我们提出特质-状态情感预测(TSAP)框架及其时间扩展E-TSAP用于逐文本效价和唤醒度预测,在来自91名用户的1737条条目的保留预测测试集上评估。我们进一步提出情感变化预报混合模型(ACF-Hybrid)用于下一步情感变化预报,在来自46名用户的保留预报测试集上评估。对于预测,E-TSAP在效价上达到复合皮尔逊相关系数0.670,在唤醒度上达到0.449。对于预报,文本表示的表现不如紧凑的数值轨迹基线:包含文本的模型在效价上仅达到r=0.316,在唤醒度上达到r=0.284,而简单的先前状态基线分别达到r=0.615和r=0.670。ACF-Hybrid使用维度特定的数值轨迹特征,在效价上达到r=0.659,在唤醒度上达到r=0.658。这些结果表明,文本语义支持当前情感预测,而未来情感变化通过先前数值轨迹动力学能更好地捕获。

英文摘要

Modeling dimensional affect in longitudinal text requires distinguishing current affect estimation from future affective change forecasting. Existing approaches often treat each text as an independent observation and apply similar assumptions to both tasks, without testing whether they rely on different information sources. This paper investigates that distinction using longitudinal self-reported ecological essays and feeling-word entries. We propose the Trait--State Affective Prediction (TSAP) framework and its temporal extension E-TSAP for per-text valence and arousal prediction, evaluated on a held-out prediction test set of 1,737 entries from 91 users. We further propose the Affective Change Forecaster Hybrid (ACF-Hybrid) for next-step affective change forecasting, evaluated on a held-out forecasting test set of 46 users. For prediction, E-TSAP achieves composite Pearson correlations of 0.670 for valence and 0.449 for arousal. For forecasting, textual representations perform worse than compact numeric trajectory baselines: the text-inclusive model achieves only r=0.316 for valence and r=0.284 for arousal, whereas a simple prior-state baseline reaches r=0.615 and r=0.670, respectively. ACF-Hybrid, using dimension-specific numeric trajectory features, achieves r=0.659 for valence and $r=0.658$ for arousal. These results show that textual semantics support current affect prediction, whereas future affective change is better captured through prior numeric trajectory dynamics.

2606.16684 2026-06-16 cs.CL 新提交

Progressive Knowledge-Guided Large Language Model Framework for Bearing Fault Diagnosis

渐进式知识引导的大型语言模型框架用于轴承故障诊断

Jinghan Wang, Gaoliang Peng, Yanjun Chen, Wei Zhang, Wentao Wu, Tianchen Liu

发表机构 * Harbin Institute of Technology, China(哈尔滨工业大学,中国) Eastern Institute of Technology, China(东方技术研究所,中国)

AI总结 提出渐进式物理引导多尺度振动信号处理框架,通过81维测量描述符、故障自适应分割和隐式知识编码,在四个数据集上实现98.49%诊断精度并降低12.6倍计算成本。

详情
AI中文摘要

基于振动的轴承故障诊断需要解决三个相互关联的测量挑战,包括全局统计特征效率与局部瞬态信号保真度之间的权衡、测量特征对底层故障物理的可追溯性不足,以及跨诊断尺度的多源测量信息融合无效。本文提出一个渐进式物理引导的多尺度振动信号处理框架,在统一诊断流程中解决所有三个挑战。一个源自轴承运动学和特征缺陷频率的81维测量描述符,建立了物理可追溯的特征空间,实现每样本约20毫秒的实时故障筛查。然后,一种故障自适应信号分割机制将分析注意力引导至基于物理先验的故障相关波形区域,无需手动特征工程。在训练过程中,结构化的故障机制知识进一步隐式编码到模型参数中,实现自主多尺度测量融合,推理时无需外部知识依赖。在四个公开基准数据集上,在不同运行条件下验证,该框架实现了98.49%的诊断准确率,相对于信号级基线计算成本降低了12.6倍。可解释性分析证实诊断特征激活与已建立的轴承故障力学一致,支持安全关键工业系统中的测量可追溯性。

英文摘要

Vibration-based bearing fault diagnosis requires resolving three interrelated measurement challenges, including the trade-off between global statistical feature efficiency and local transient signal fidelity, insufficient traceability of measurement features to underlying fault physics, and ineffective multi-source measurement information fusion across diagnostic scales. This paper presents a progressive physics-guided multi-scale vibration signal processing framework that addresses all three challenges within a unified diagnostic pipeline. An 81-dimensional measurement descriptor, derived from bearing kinematic theory and characteristic defect frequencies, establishes a physically traceable feature space enabling real-time fault screening at approximately 20 ms per sample. A fault-adaptive signal segmentation mechanism then directs analytical attention toward fault-relevant waveform regions guided by physics-based priors, without manual feature engineering. Structured fault mechanism knowledge is further encoded implicitly in model parameters during training, enabling autonomous multi-scale measurement fusion without external knowledge dependencies at inference. Validated on four public benchmark datasets under diverse operating conditions, the framework achieves 98.49% diagnostic accuracy with a 12.6-fold reduction in computational cost relative to signal-level baselines. Interpretability analysis confirms that diagnostic feature activations align with established bearing fault mechanics, supporting measurement traceability in safety-critical industrial systems.

2606.16682 2026-06-16 cs.LG cs.CL 新提交

Multimodal Evaluator Preference Collapse: Cross-Modal Contagion in Self-Evolving Agents

多模态评估者偏好坍缩:自进化智能体中的跨模态传染

Zewen Liu

发表机构 * Qilu Institute of Technology, School of Software Engineering(齐鲁理工学院软件工程学院)

AI总结 研究多模态自评估中偏好坍缩的加剧现象,发现跨模态传染导致策略选择扭曲,并引入传染矩阵量化风险。

Comments 19 pages, 0 figures

详情
AI中文摘要

当AI智能体使用语言模型在反馈循环中评估自身输出时,会出现系统性偏差。我们表明,评估者偏好坍缩(EPC)在多模态设置中被显著放大。使用GPT-4o评估DeepSeek-chat在文本和视觉任务上的表现,我们发现单一策略(step_by_step)吸收了48.4%的权重——是纯文本自评估中坍缩的3.2倍——而三个视觉域策略合计仅获得9.1%的权重。然后,我们展示了一种称为跨模态传染的新现象:在一个模态上获得的评估者偏好会迁移到另一个模态并破坏其策略选择。通过一个四阶段隔离训练范式,我们测量了传染系数并记录了策略反转——一个模态的最优策略在跨模态暴露后发生逆转。跨四种评估者配置(总计53次独立重复,15,592次API调用)的第3阶段统计验证揭示了一个清晰的层次结构:跨模型评估(GPT-4o,N=8)产生强但对称的双向传染(平均gamma_{T->V}=1.176,gamma_{V->T}=1.089,Delta=-0.088,p=0.575,Cohen's d=0.29);高轮次(DashScope,50轮)导致坍缩为单一策略主导(70%零传染);而自评估提供近乎完全的免疫——97%的运行(N=30,DeepSeek-chat)产生恰好为零的传染(平均gamma=0.033,95% CI [-0.031, 0.010],p=0.642,d=0.07)。没有评估者条件显示出统计显著的方向不对称性。我们引入了由评估者身份索引的传染矩阵,发布了MM-EPC实验框架,并将跨模型评估者架构确定为偏好传染的主要风险因素。

英文摘要

When AI agents use language models to evaluate their own outputs in a feedback loop, systematic biases emerge. We show that Evaluator Preference Collapse (EPC) is dramatically amplified in multimodal settings. Using GPT-4o to evaluate DeepSeek-chat across text and visual tasks, we find that a single strategy (step_by_step) absorbs 48.4% of all weight -- 3.2x the collapse observed in text-only self-evaluation -- while three visual-domain strategies receive only 9.1% combined weight. We then demonstrate a novel phenomenon we term cross-modal contagion: evaluator preferences acquired on one modality transfer to and corrupt strategy selection on another. Through a four-phase isolation training paradigm, we measure contagion coefficients and document strategy inversion -- the optimal strategy for a modality reverses after cross-modal exposure. A Phase 3 statistical validation across four evaluator configurations (N=53 total independent repetitions, 15,592 API calls) reveals a clear hierarchy: cross-model evaluation (GPT-4o, N=8) produces strong but symmetric bidirectional contagion (mean gamma_{T->V}=1.176, gamma_{V->T}=1.089, Delta=-0.088, p=0.575, Cohen's d=0.29); high round counts (DashScope, 50 rounds) cause collapse to single-strategy dominance (70% zero contagion); and self-evaluation provides near-complete immunity -- 97% of runs (N=30, DeepSeek-chat) yield exactly zero contagion (mean gamma=0.033, 95% CI [-0.031, 0.010], p=0.642, d=0.07). No evaluator condition shows statistically significant directional asymmetry. We introduce the contagion matrix indexed by evaluator identity, release the MM-EPC experimental framework, and identify cross-model evaluator architecture as the primary risk factor for preference contagion.

2606.16673 2026-06-16 cs.CV 新提交

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

MMDiff: 扩展扩散变换器用于多模态生成

Yagmur Akarken, Orest Kupyn, Christian Rupprecht

发表机构 * University of Oxford, Visual Geometry Group(牛津大学视觉几何组)

AI总结 提出MMDiff框架,利用冻结的扩散变换器通过轻量解码器联合生成图像及多种密集感知模态,发现多时间步特征融合与空间变化聚合权重是关键,在语义分割等任务上取得优异性能。

详情
AI中文摘要

扩散变换器已展现出卓越的生成能力,然而在其去噪轨迹中计算出的丰富感知表示在内容渲染后被丢弃。我们提出了MMDiff,一个将冻结的扩散变换器转化为多模态生成系统的框架,该系统使用轻量级解码器头联合生成图像以及任意组合的密集感知模态。我们的核心发现是,感知信息在去噪轨迹上呈时间分布,并且具有空间变化聚合权重的多时间步特征融合至关重要,相比单时间步提取,语义分割结果提高了高达28.7% mIoU。我们进一步采用概念驱动的注意力提取以实现可解释的空间引导,并表明冻结的扩散特征与最先进的编码器(如DINOv3)具有竞争力和互补性。通过在冻结的骨干网络上仅训练轻量级解码器头,我们在语义分割、显著目标检测和深度估计中取得了强劲性能,并证明了该框架能够有效生成大规模合成数据。

英文摘要

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

2606.16672 2026-06-16 cs.CV 新提交

Sinkhorn-CPD: Robust point cloud registration via unbalanced entropic optimal transport

Sinkhorn-CPD:通过非平衡熵最优传输实现鲁棒点云配准

Jin Zhang, Mingyang Zhao, Bing Liu, Xin Jiang

发表机构 * LMIB & School of Mathematical Sciences, Beihang University(北京航空航天大学数学科学学院与LMIB) State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院数学科学国家重点实验室) Beijing Key Laboratory of Artificial Intelligence Innovation and Application in the Machine Tool Industry, School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院北京市机床行业人工智能创新与应用重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出Sinkhorn-CPD,用双KL散度惩罚替代CPD的目标边际约束,通过非平衡熵最优传输和广义Sinkhorn迭代实现鲁棒点云配准,方差自动退火无需手动调参。

Comments 14 pages, 10 figures; journal version published in Computer-Aided Design

详情
Journal ref
Computer-Aided Design 199 (2026) 104104
AI中文摘要

相干点漂移(CPD)因其软对应和闭式参数更新而被广泛用于刚性点云配准。然而,CPD的目标边际约束迫使每个观测值(包括离群点)恰好接收单位概率质量。在严重离群点和部分重叠情况下,这一假设会降低配准精度。最优传输(OT)方法可以通过非平衡公式处理缺失质量,但需要手动调整退火调度。本文提出Sinkhorn-CPD,用双Kullback-Leibler惩罚替代CPD的目标边际约束,使算法能够丢弃两侧的离群点。由此得到的公式是一个完全非平衡的熵最优传输问题,可通过广义Sinkhorn迭代高效求解。此外,Sinkhorn-CPD保留了CPD的闭式Procrustes和方差更新。在我们的方法中,方差sigma^2扮演熵正则化参数的角色,从而自动产生从扩散到尖锐对应的退火调度,无需手动调节温度。在合成、跨类别和扫描到CAD基准上的实验表明,Sinkhorn-CPD达到了最先进的精度,对离群点和部分重叠具有强鲁棒性。

英文摘要

Coherent Point Drift (CPD) is widely used for rigid point cloud registration because of its soft correspondences and closed-form parameter updates. However, CPD's target-side marginal constraint forces every observation, including outliers, to receive exactly unit probability mass. This assumption degrades registration accuracy under heavy outliers and partial overlap. Optimal transport (OT) methods can handle missing mass through unbalanced formulations, but require hand-tuned annealing schedules. In this paper, we propose Sinkhorn-CPD, which replaces CPD's target-side marginal constraint with dual Kullback-Leibler penalties, allowing the algorithm to discard outliers on both sides. The resulting formulation is a fully unbalanced entropic optimal transport problem, which can be efficiently solved by generalized Sinkhorn iterations. Moreover, Sinkhorn-CPD preserves the closed-form Procrustes and variance updates of CPD. In our method, the variance sigma^2 plays the role of the entropic regularization parameter, which induces an automatic annealing schedule from diffuse to sharp correspondences without manual temperature tuning. Experiments on synthetic, cross-category, and scan-to-CAD benchmarks show that Sinkhorn-CPD achieves state-of-the-art accuracy, with strong robustness to outliers and partial overlap.

2606.16667 2026-06-16 cs.CV 新提交

Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

在放弃之前再看一眼:预算约束下的共形证据获取用于可靠的视觉-语言模型

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * South China University of Technology(华南理工大学) RIKEN Center for Advanced Intelligence Project(RIKEN先进智能研究中心) Columbia University(哥伦比亚大学)

AI总结 针对视觉-语言模型幻觉问题,提出预算约束共形证据获取(BCEA)方法,通过三级决策(回答、放弃或获取额外视觉证据)在有限计算预算下控制幻觉率,并恢复有限样本保证。

详情
AI中文摘要

大型视觉-语言模型(LVLMs)会产生幻觉:它们断言图像不支持的视觉细节。一个原则性的解决方案是使用无分布保证的选择性预测——验证每个声明,当声明没有依据时放弃,从而使断言声明中的幻觉率有可证明的界限。然而,我们表明,这个保证是以残酷的代价换来的:为了在平衡的对象存在基准上将幻觉率保持在5%以下,最先进的共形过滤器必须在超过80%的声明上放弃。我们认为,当更多视觉证据可以廉价获取时,放弃是浪费的,并引入了预算约束共形证据获取(BCEA),它将二元回答/放弃决策替换为三向选择:回答、放弃或在有限计算预算下通过重新检查图像(缩放、裁剪或应用特定声明的干预)获取额外视觉证据。我们有两个观察。首先,天真地将获取插入到校准的过滤器中会破坏统计保证——实际风险超过目标多达17个百分点——因为获取步骤破坏了共形校准所依赖的可交换性。其次,将整个获取策略折叠到得分函数中,并在获取后得分上重新校准,恢复了有限样本保证,同时仍然恢复覆盖。BCEA进一步使用结构化的、声明类型特定的干预。在POPE基准和COCO构建的存在性和空间关系声明上,针对四个开源VLM,BCEA将幻觉率控制在目标水平,并持续提高覆盖,优于保证放弃的基线。

英文摘要

Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below $5\%$ on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than $80\%$ of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee -- realized risk overshoots the target by up to $17$ points -- because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emph{restores} the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.

2606.16663 2026-06-16 cs.LG 新提交

Beyond Defensive Reporting: Machine Learning for Active Anti-Money Laundering Control in Insurance

超越防御性报告:机器学习在保险主动反洗钱控制中的应用

Dara Goldar, Geir Kjetil Ferkingstad Sandve, Martin Jullum

发表机构 * Fremtind Insurance(Fremtind保险) University of Oslo(奥斯陆大学) Norwegian Computing Center(挪威计算中心)

AI总结 本文利用挪威保险公司的生产数据,训练梯度提升决策树模型检测洗钱索赔,并引入欺诈标签辅助训练,在预算加权捕获率指标下,最佳模型在2-6%的审查索赔中捕获近三分之二的洗钱案例。

详情
AI中文摘要

通过保险索赔进行洗钱对保险公司构成威胁,既包括欺诈性赔付,也包括声誉和监管风险。尽管如此,很少有研究探讨如何预防此类洗钱行为。本文考察了机器学习是否可以帮助保险公司在赔付前标记可疑索赔,将重点从被动报告转向主动预防。使用一家挪威主要保险公司的生产数据,我们训练梯度提升决策树模型来检测后来被报告给当局涉嫌洗钱的索赔。由于欺诈和洗钱可能共享行为模式,我们还考察了保险欺诈标签是否可以作为辅助训练信号。我们使用预算加权捕获率(本文引入的指标)比较了不同的学习设置,该指标衡量在只能手动审查一小部分索赔时捕获了多少洗钱案例。结果表明,纳入与欺诈相关的调查标签显著改善了洗钱检测。表现最佳的模型在排名前2%至6%的选定调查索赔中捕获了近三分之二的洗钱案例。据我们所知,这是首个关于机器学习在保险索赔中检测洗钱的实证研究。

英文摘要

Money laundering through insurance claims poses a threat to insurers both through fraudulent payouts and reputational and regulatory risk. Despite this, little research has examined how such laundering can be prevented. This paper examines whether machine learning can help insurers flag suspicious claims before payout, shifting the focus from passive reporting to active prevention. Using production data from a major Norwegian insurer, we train gradient-boosted decision tree models to detect claims later reported to authorities for suspected money laundering. Because fraud and laundering may share behavioural patterns, we also examine whether insurance fraud labels can serve as an auxiliary training signal. We compare different learning setups using the Budget-Weighted Capture Rate, a metric introduced in this paper to measure how many laundering cases are captured when only a small share of claims can be manually reviewed. The results show that incorporating fraud-related investigation labels substantially improves laundering detection. The best-performing model captures nearly two-thirds of laundering cases within the top-ranked 2 to 6 percent of claims selected for investigation. To our knowledge, this is the first empirical study of machine learning for money laundering detection in insurance claims.

2606.16659 2026-06-16 cs.CL 新提交

FraudSMSWalker: Benchmarking Agentic Large Language Models for SMS-to-Webpage Fraud Detection

FraudSMSWalker: 用于短信到网页欺诈检测的智能体大语言模型基准测试

Y. H. Zhou, Z. M. Ma, Y. J. Zhou, Y. T. Li, H. X. Xiang, Y. M. Cheng, T. L. Chen, K. J. Zhang, Z. H. Nan, J. H. Ni, Z. Wu, Q. Y. Pan, S. Zhang, S. Cheng, M. Y. Luo

发表机构 * Baimaohui(白猫汇) PPSUC(中国人民公安大学)

AI总结 提出FraudSMSWalker基准,通过屏蔽URL的短信-网页对评估智能体大语言模型在跨渠道欺诈检测中的证据推理能力,发现模型能检测可疑线索但难以保持良性召回。

详情
AI中文摘要

短信欺诈日益跨渠道:一条消息引导用户访问网页,最终风险取决于短信声明与页面内容及请求用户操作的一致性。然而,现有评估要么专注于仅消息的钓鱼短信分类,要么暴露URL和域名线索,使模型能够依赖声誉捷径。为弥补这一空白,我们引入了\textbf{FraudSMSWalker},一个用于URL屏蔽的短信到网页欺诈判断的受控基准。FraudSMSWalker包含699条双语链,包括332个欺诈和367个良性案例,涵盖十个服务场景。模型可见输入包括短信上下文和经过处理的网页证据,而原始URL、主机、域名、IP、重定向和声誉元数据被隐藏。该基准进一步包含硬良性案例,其页面包含登录、支付、验证或账户管理元素,这些元素在服务上下文中看似合理,但也出现在诈骗流程中。我们在屏蔽浏览器代理协议下评估了九个网络代理,并进行了URL可见性消融实验。结果表明,当前代理可以检测可疑线索,但难以保持良性召回,并且经常产生观察证据弱支持的正面预测。这些发现将FraudSMSWalker定位为一个基准,用于衡量当直接声誉捷径被抑制时,网络代理能否做出既准确又有证据基础的欺诈判断。相关代码和数据集可在匿名链接处获取。

英文摘要

SMS fraud is increasingly cross-channel: a message directs the user to a webpage, and the final risk depends on how the SMS claim aligns with the page content and requested user action. However, existing evaluations either focus on message-only smishing classification or expose URL and domain cues that allow models to rely on reputation shortcuts. To address this gap, we introduce \textbf{FraudSMSWalker}, a controlled benchmark for URL-masked SMS-to-webpage fraud judgment. FraudSMSWalker contains 699 bilingual chains, including 332 fraudulent and 367 benign cases, across ten service scenarios. The model-visible input consists of the SMS context and sanitized webpage evidence, while raw URLs, hosts, domains, IPs, redirects, and reputation metadata are withheld. The benchmark further includes hard benign cases whose pages contain login, payment, verification, or account-management elements that are plausible under the service context but also appear in scam flows. We evaluate nine web agents under masked browser-agent protocols and conduct URL-visibility ablations. The results show that current agents can detect suspicious cues, but struggle to preserve benign recall and often produce positive predictions that are weakly supported by the observed evidence. These findings position FraudSMSWalker as a benchmark for measuring whether web agents can make fraud judgments that remain both accurate and evidence-grounded when direct reputation shortcuts are suppressed. The associated code and dataset are accessible at the \href{https://anonymous.4open.science/w/FraudMessageWalker-Bench}{anonymous link}.

2606.16658 2026-06-16 cs.CV 新提交

Vision-Language Models as Zero-Annotation Oracles in Histopathology

视觉-语言模型作为组织病理学中的零标注预言机

Vishal Jain, Giorgio Buzzanca, Sarah Cechnicka, Maarten Naesens, Priyanka Koshy, Tri Nguyen, Jesper Kers, Candice Roufosse, Bernhard Kainz

发表机构 * Imperial College London(帝国理工学院) Leiden University Medical Center(莱顿大学医学中心) KU Leuven(鲁汶大学) University Hospitals Leuven(鲁汶大学医院) University Medical Center Utrecht(乌得勒支大学医学中心) Friedrich-Alexander University Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出一种粗到细方法,利用通用视觉-语言模型作为零标注预言机进行前景分割,在特殊染色上优于监督基线,并通过伪标签蒸馏轻量学生模型。

Comments 11 pages, 1 figure, 6 tables. Code available at https://github.com/VishalJ99/vlm-wsi-auto-context

详情
AI中文摘要

前景分割是每个计算病理学流程的关键第一步,但现有方法依赖于手工调整的启发式规则或监督模型,这些模型过度拟合狭窄的染色和扫描仪分布,在特殊染色(如Jones银染或Elastica van Gieson)上无声失败。我们提出一种粗到细方法,将前景分割重新定义为视觉感知任务,并利用通用视觉-语言模型(VLM)作为零标注预言机。我们的关键洞察是,组织与背景的区分是一个自然图像识别问题,而非组织病理学问题,因此在互联网规模语料上训练的VLM能够泛化到领域特定模型无法处理的场景。我们引入了Leica-75基准,包含跨越三种染色家族的75张肾移植全切片图像。在Leica-75上,我们的方法在分布外染色上实现了最高分割质量(Jones Dice 0.858 +/- 0.027,EVG Dice 0.853 +/- 0.041),交叉染色方差比最佳监督基线低7倍,同时在分布内H&E上保持竞争力。使用自动筛选示例(Auto-context)的少样本提示挽救了Stress-32(n=32)上的困难案例,Stress-32是一个精心设计的压力测试子集(2B模型Dice从0.470提升至0.819)。基于VLM的标注审查与人类专家共识一致(模糊检测kappa=0.989;分割掩码审查的平均精确率/召回率分级准确率0.708 vs. 人类0.646)。生成的伪标签用于蒸馏轻量学生模型,其性能与教师模型相当,而运行成本仅为教师模型的一小部分。我们的框架为数字病理学中持续存在的基础设施瓶颈提供了原则性、可扩展的解决方案。

英文摘要

Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution H&E. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

2606.16656 2026-06-16 cs.LG 新提交

Near-Optimal Stochastic Linear Bandits with Delay

带延迟的近最优随机线性赌博机

Ofir Schlisselberg, Mengxiao Zhang, Yishay Mansour

发表机构 * Tel Aviv University(特拉维夫大学) University of Iowa(爱荷华大学) Tel Aviv University and Google Research(特拉维夫大学和谷歌研究)

AI总结 研究多种延迟模型下的随机线性赌博机,给出近最优遗憾界,揭示延迟与线性结构交互的维度影响。

详情
AI中文摘要

我们研究了在几种延迟模型下具有延迟反馈的随机线性赌博机,并建立了近最优的遗憾保证。我们的结果确定了延迟线性赌博机何时表现出与多臂赌博机(MAB)相同的定性行为,以及线性结构何时产生根本性的新挑战。具体来说,(1)对于\emph{损失无关延迟},其中延迟不依赖于实现的损失(但可能依赖于臂),我们表明延迟仅引起加性遗憾惩罚。在随机延迟下,该惩罚与期望延迟成比例,而在对抗性延迟下,它与最大未完成观测数成比例。值得注意的是,两种延迟惩罚都是无维度的,改进了现有最优结果;(2)对于\emph{损失相关延迟},我们表明线性赌博机比MAB困难得多:与MAB不同,我们在线性赌博机中证明了匹配(最多对数因子)的上界和下界,其延迟惩罚依赖于维度的平方根。(3)对于\emph{延迟作为收益模型},这是损失相关延迟的一个特例,我们表明仅依赖于最优臂延迟的最优MAB保证在线性赌博机中也是无法实现的。这些结果共同提供了延迟反馈如何与线性泛化相互作用的清晰刻画。

英文摘要

We study stochastic linear bandits with delayed feedback under several delay models and establish near-optimal regret guarantees. Our results identify when delayed linear bandits exhibit the same qualitative behavior as multi-armed bandits (MAB), and when the linear structure creates fundamentally new challenges. Specifically, (1) for \emph{loss-independent delays}, where the delay does not depend on the realized loss (but potentially depends on the arm), we show that delays incur only an additive regret penalty. Under stochastic delays, this penalty scales with the expected delay, while under adversarial delays, it scales with the maximum number of outstanding observations. Notably, both delay penalties are dimension-free, improving upon the state-of-the-art results; (2) for \emph{loss-dependent delays}, we show that linear bandits are substantially harder than MAB: unlike in MAB, we prove matching (up to log factors) upper and lower bounds in linear bandits, whose delay penalty depends on the square root of the dimension. (3) for the \emph{delay-as-payoff model}, a special case of loss-dependent delay, we show that the optimal MAB guarantee, which depends only on the delay of the optimal arm, is also unattainable in linear bandits. Together, these results provide a sharp characterization of how delayed feedback interacts with linear generalization.

2606.16655 2026-06-16 cs.LG 新提交

Distribution Alignment for One-Shot Federated Learning via Optimal Transport

基于最优传输的单轮联邦学习分布对齐

Daniele Berardini, Vito Paolo Pastore, Vittorio Murino

发表机构 * AI for Good (AIGO), Italian Institute of Technology(意大利技术研究院AI for Good (AIGO)) MaLGa-DIBRIS, University of Genoa(热那亚大学MaLGa-DIBRIS) Department of Computer Science, University of Verona(维罗纳大学计算机科学系)

AI总结 针对单轮联邦学习中客户端数据异构导致的特征错位问题,提出SLOT-Align方法,利用共享冻结编码器、Bures-Wasserstein重心和测地最优传输映射实现无训练的特征对齐,提升模型精度与鲁棒性。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

单轮联邦学习(OSFL)应对极端通信场景,其中客户端仅与服务器交互一次,放大了异构客户端数据分布的影响。特别是,客户端间的域偏移和标签偏移相互作用导致特征表示错位,而无法通过迭代优化纠正。现有OSFL方法依赖于蒸馏、服务器端生成或基于集成的聚合,但假设表示已对齐或分别处理域偏移和标签偏移。我们提出SLOT-Align(单轮、无学习的最优传输对齐),一种面向OSFL的几何感知特征协调框架。SLOT-Align使用共享冻结编码器提取紧凑特征统计,通过Bures-Wasserstein重心构建全局参考,并利用闭式测地最优传输映射对齐局部表示。该方法计算高效,可与依赖冻结编码器的现有OSFL流程结合,无需修改其训练过程。跨多个基准、预训练骨干网络和OSFL方法的大量实验表明,SLOT-Align在联合域偏移和标签偏移下持续提升准确率和鲁棒性。

英文摘要

One-Shot Federated Learning (OSFL) addresses extreme communication regimes in which clients interact with the server only once, amplifying the impact of heterogeneous client data distributions. In particular, the interaction of domain shift and label shift across clients induces misaligned feature representations that cannot be corrected through iterative optimization. Existing OSFL methods rely on distillation, server-side generation or ensemble-based aggregation, but assume aligned representations or address domain and label shift separately. We introduce SLOT-Align (Single-round, Learning-free Optimal Transport Alignment), a geometry-aware feature harmonization framework for OSFL. SLOT-Align uses a shared frozen encoder to extract compact feature statistics, constructs a global reference via Bures-Wasserstein barycenters, and aligns local representations using closed-form geodesic optimal transport maps. The method is computationally efficient and can be combined with existing OSFL pipelines relying on frozen encoders without modifying their training procedures. Extensive experiments across multiple benchmarks, pretrained backbones, and OSFL methods show that SLOT-Align consistently improves accuracy and robustness under joint domain and label shift.