arXivDaily arXiv每日学术速递 周一至周五更新
重置
全部学科分类 2051
2511.13207 2026-06-11 cs.RO cs.CV 版本更新

PIGEON: VLM-Driven Object Navigation via Points of Interest Selection

PIGEON: 通过兴趣点选择的VLM驱动物体导航

Cheng Peng, Zhenzhe Zhang, Xiaobao Wei, Yanhao Zhang, Heng Wang, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Shanghang Zhang, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Beijing Academy of Artificial Intelligence (BAAI)(北京人工智能研究院) Peking University(北京大学) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院)

AI总结 提出PIGEON框架,将物体导航建模为基于原始观测的稀疏决策问题,通过兴趣点(PoI)作为视觉决策单元,结合VLM选择关键点,实现零样本SOTA性能并迁移至主动具身问答。

详情
AI中文摘要

在未见过的室内环境中进行物体导航要求智能体在部分可观测条件下执行语义搜索。视觉-语言模型(VLM)为此任务提供了强大的语义-空间先验,但如何将其与机器人导航接口仍然具有挑战性:密集的VLM推理成本高昂,而将环境抽象为符号记忆通常将高层推理与支持它的原始视觉证据分离。我们提出PIGEON(基于兴趣点引导的物体导航探索),一种VLM驱动的框架,将物体导航建模为基于原始观测的稀疏决策问题。PIGEON引入兴趣点(PoI)作为稀疏视觉决策单元,将几何可执行的路点与原始自我中心观测耦合。PIGEON不是将VLM用作密集控制器或限制其进行前沿排序,而是使VLM能够选择任务关键的PoI,包括探索前沿、疑似目标物体、可穿越楼梯和楼层级摘要,而低级规划器在它们之间执行连续运动。这种PoI接口进一步使高层导航决策可验证,使我们能够开发一个RLVR流水线,无需手动思维链注释即可改进局部VLM。在Habitat ObjectNav基准上的大量实验表明,PIGEON实现了零样本最先进性能,与基础模型能力一致扩展,并且仅通过提示修改即可迁移到主动具身问答。在物理机器人上的实际部署进一步证明了其鲁棒性和效率。

英文摘要

Object navigation in unseen indoor environments requires agents to perform semantic search under partial observability. Vision-language models (VLMs) provide strong semantic-spatial priors for this task, but how to interface them with robot navigation remains challenging: dense VLM inference is expensive, while abstracting environments into symbolic memories often separates high-level reasoning from the raw visual evidence that supports it. We propose we propose PIGEON (Point of Interest Guided Exploration for Object Navigation), a VLM-driven framework that formulates object navigation as raw-observation-grounded sparse decision problem. PIGEON introduces Points of Interest (PoIs) as sparse visual decision units that couple geometrically executable waypoints with raw egocentric observations. Rather than using VLMs as dense controllers or restricting them to frontier ranking, PIGEON enables VLMs to select among task-critical PoIs, including exploration frontiers, suspected target objects, traversable stairs, and floor-level summaries, while low-level planners execute continuous motion between them. This PoI interface further makes high-level navigation decisions verifiable, allowing us to develop an RLVR pipeline that improves local VLMs without manual Chain-of-Thought annotations. Extensive experiments on Habitat ObjectNav benchmarks show that PIGEON achieves state-of-the-art zero-shot performance, scales consistently with foundation model capacity, and transfers to Active Embodied Question Answering with only prompt modifications. Real-world deployments on physical robots further demonstrate its robustness and efficiency.

2511.08195 2026-06-11 cs.CV 版本更新

UI2Code^N: UI-to-Code Generation as Interactive Visual Optimization

UI2Code^N: 将UI到代码生成视为交互式视觉优化

Zhen Yang, Wenyi Hong, Mingde Xu, Xinyue Fan, Weihan Wang, Jiale Cheng, Xiaotao Gu, Jie Tang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出将UI截图转代码任务重构为交互式视觉优化问题,采用基于偏好的强化学习方法RVPO优化视觉排名,在UI起草、润色和编辑任务上达到SOTA。

Comments 27 pages

详情
AI中文摘要

UI到代码旨在将UI截图转换为可执行的前端代码。尽管视觉语言模型(VLM)取得了进展,但大多数现有方法将UI到代码视为单次生成,这与现实世界中本质上是迭代和反馈驱动的UI开发不匹配。我们将UI到代码重新表述为一个交互式视觉优化问题,其中代码生成嵌入在执行、视觉检查和由渲染视觉反馈驱动的迭代细化的闭环过程中。为了解决视觉目标的不可微性和绝对视觉评估器的噪声,我们提出了相对视觉策略优化(RVPO),这是一种基于偏好的强化学习方法,在执行反馈下优化渲染候选之间的相对视觉排名。我们将这一范式实例化为UI2Code^N,这是一个开源的9B模型,通过持续预训练、监督微调和强化学习进行训练。实验表明,在UI起草、UI润色和UI编辑基准测试中,即使超越更大的模型,也达到了最先进的性能,并且通过迭代视觉优化性能持续提升。我们的代码和模型可在该https URL获取。

英文摘要

UI-to-code aims to translate UI screenshots into executable front-end code. Despite progress with vision-language models (VLMs), most existing methods formulate UI-to-code as a single-pass generation, which mismatches real-world UI development that is inherently iterative and feedback-driven. We reformulate UI-to-code as an interactive visual optimization problem, where code generation is embedded in a closed-loop process of execution, visual inspection, and iterative refinement driven by rendered visual feedback. To address the non-differentiability of visual objectives and the noise of absolute visual evaluators, we propose Relative Visual Policy Optimization (RVPO), a preference-based reinforcement learning method that optimizes relative visual rankings among rendered candidates under execution feedback. We instantiate this paradigm in UI2Code^N, an open-source 9B model trained via continual pre-training, supervised fine-tuning, and reinforcement learning. Experiments demonstrate state-of-the-art performance on UI drafting, UI polishing, and UI editing benchmarks, even outperforming larger models, with performance consistently improving through iterative visual optimization. Our code and models are available at https://github.com/zai-org/UI2Code_N.

2511.08299 2026-06-11 cs.RO 版本更新

Phase-Based Multi-Gait Learning for a Salamander-Like Robot

基于相位的多步态学习用于蝾螈机器人

Zhiang Liu, Yang Liu, Yongchun Fang, Xian Guo

发表机构 * Nankai University(南开大学)

AI总结 提出一种基于相位的无参考运动学习框架,通过相位变量和相位覆盖奖励,结合形态对称数据增强,使蝾螈机器人自主习得22种动态对称步态。

详情
AI中文摘要

蝾螈机器人受其生物对应物的骨骼结构启发而设计。然而,现有控制器无法充分利用这些形态特征,主要依赖预定义模式或关节轨迹,这阻碍了多样化和灵活步态的生成,并限制了其在现实场景中的应用。在本文中,我们提出一种基于相位的学习框架,使机器人无需使用参考运动即可获得多样化的步态库。每个身体部分由一个能够向前和向后演化的相位变量控制,并采用相位覆盖奖励来促进腿部相位空间的探索。此外,通过数据增强融入机器人的形态对称性,提高了样本效率,并在学习行为中强制实现了运动级和任务级的对称性。大量实验表明,机器人成功习得了22种具有动态和对称运动的代表性步态,证明了所提学习框架的有效性。

英文摘要

Salamander-like robots are designed inspired by the skeletal structure of their biological counterparts. However, existing controllers cannot fully exploit these morphological features and largely rely on predefined patterns or joint trajectories, which prevents the generation of diverse and flexible gaits and limits their applicability in real-world scenarios. In this paper, we propose a phase-based learning framework that enables the robot to acquire a diverse repertoire of gaits without using reference motions. Each body part is controlled by a phase variable capable of forward and backward evolution, with a phase coverage reward to promote the exploration of the leg phase space. Additionally, morphological symmetry of the robot is incorporated via data augmentation, improving sample efficiency and enforcing both motion-level and task-level symmetry in learned behaviors. Extensive experiments show that the robot successfully acquires 22 representative gaits exhibiting both dynamic and symmetric movements, demonstrating the effectiveness of the proposed learning framework.

2511.07332 2026-06-11 cs.LG cs.AI 版本更新

Grounding Computer Use Agents on Human Demonstrations

基于人类演示的计算机使用智能体基础构建

Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

发表机构 * Mila - Quebec AI Institute(魁北克AI研究所) McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学) ServiceNow Research(ServiceNow研究) University of Waterloo(滑铁卢大学) University of Oxford(牛津大学) National University of Singapore(新加坡国立大学) Polytechnique Montréal(蒙特利尔理工学院) École de Technologie Supérieure(高级技术学院) CIFAR AI Chair(CIFAR人工智能主席)

AI总结 为解决桌面环境高质量基础数据稀缺问题,构建了包含87个应用、56K截图和3.56M人工标注的GroundCUA数据集,并基于此训练GroundNext模型,在5个基准上以少于先前十分之一的数据取得最优结果。

Comments Accepted at ICLR 2026

详情
AI中文摘要

构建可靠的计算机使用智能体需要基础构建:将自然语言指令准确连接到正确的屏幕元素。尽管存在大量用于网络和移动交互的数据集,但桌面环境的高质量资源有限。为填补这一空白,我们引入了GroundCUA,一个基于专家人类演示构建的大规模桌面基础数据集。它涵盖12个类别的87个应用,包含56K张截图,每个屏幕元素都经过仔细标注,总计超过3.56M个人工验证标注。从这些演示中,我们生成了多样的指令,覆盖广泛的实际任务,为模型训练提供高质量数据。利用GroundCUA,我们开发了GroundNext系列模型,将指令映射到目标UI元素。在3B和7B规模上,GroundNext通过监督微调在五个基准上取得了最先进的结果,同时所需训练数据不到先前工作的十分之一。强化学习后训练进一步提升了性能,在OSWorld基准上使用o3作为规划器的智能体评估中,GroundNext取得了与使用更多数据训练的模型相当或更优的结果。这些结果证明了高质量、专家驱动数据集在推进通用计算机使用智能体中的关键作用。

英文摘要

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

2507.23534 2026-06-11 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

发表机构 * National Taiwan University(国立台湾大学)

AI总结 提出经验混合框架,通过差分隐私启发的噪声生成支持边界数据,联合训练样本和边界数据以正则化决策边界,在多个数据集上提升持续学习准确率。

详情
AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本,但仅稀疏地近似数据分布,导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制,该数据通过差分隐私启发的噪声注入潜在特征,生成边界邻近表示,隐式正则化决策边界。基于此,我们提出经验混合框架,通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分:(1) 潜在空间噪声注入以生成支持边界数据,(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同,支持边界数据丰富了决策边界附近的特征空间,从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 13%, 2%, respectively.

2509.11575 2026-06-11 cs.AI 版本更新

A Survey of Reasoning and Agentic Systems in Time Series with Large Language Models

时间序列中基于大语言模型的推理与智能体系统综述

Ching Chang, Yidan Shi, Defu Cao, Wei Yang, Jeehyun Hwang, Haixin Wang, Jiacheng Pang, Wei Wang, Yan Liu, Wen-Chih Peng, Tien-Fu Chen

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of Southern California(南加州大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 本文定义时间序列推理问题,按推理拓扑分为直接、线性链和分支结构三类,结合传统分析、解释、因果推断和生成等目标,综述方法、系统、数据集和评估实践,并指导拓扑选择与部署权衡。

Comments Accepted to Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

时间序列推理将时间作为第一类轴,并将中间证据直接纳入答案。本综述定义该问题,并按推理拓扑组织文献,分为三类:一步直接推理、具有显式中间步骤的线性链推理,以及探索、修正和聚合的分支结构推理。该拓扑与领域的主要目标交叉,包括传统时间序列分析、解释与理解、因果推断与决策,以及时间序列生成,同时一个紧凑的标签集跨越这些轴,并捕获分解与验证、集成、工具使用、知识访问、多模态、智能体循环和LLM对齐机制。跨领域回顾了方法和系统,展示了每种拓扑所能实现的功能以及在忠实性或鲁棒性方面的不足,同时提供了支持研究和部署的精选数据集、基准和资源(此 https URL)。强调了保持证据可见且时间对齐的评估实践,并提炼了关于将拓扑与不确定性匹配、基于可观察伪影进行基础化、规划偏移和流式处理,以及将成本和延迟视为设计预算的指导。我们强调,推理结构必须在基础化和自我纠正的能力与计算成本和可重复性之间取得平衡,而未来的进展可能依赖于将推理质量与效用联系起来的基准,以及在偏移感知、流式处理和长视野设置下权衡成本和风险的闭环测试平台。综合来看,这些方向标志着从狭窄的准确性向大规模可靠性的转变,使系统不仅能够分析,还能理解、解释和作用于动态世界,提供可追溯的证据和可信的结果。

英文摘要

Time series reasoning treats time as a first-class axis and incorporates intermediate evidence directly into the answer. This survey defines the problem and organizes the literature by reasoning topology with three families: direct reasoning in one step, linear chain reasoning with explicit intermediates, and branch-structured reasoning that explores, revises, and aggregates. The topology is crossed with the main objectives of the field, including traditional time series analysis, explanation and understanding, causal inference and decision making, and time series generation, while a compact tag set spans these axes and captures decomposition and verification, ensembling, tool use, knowledge access, multimodality, agent loops, and LLM alignment regimes. Methods and systems are reviewed across domains, showing what each topology enables and where it breaks down in faithfulness or robustness, along with curated datasets, benchmarks, and resources that support study and deployment (https://github.com/blacksnail789521/Time-Series-Reasoning-Survey). Evaluation practices that keep evidence visible and temporally aligned are highlighted, and guidance is distilled on matching topology to uncertainty, grounding with observable artifacts, planning for shift and streaming, and treating cost and latency as design budgets. We emphasize that reasoning structures must balance capacity for grounding and self-correction against computational cost and reproducibility, while future progress will likely depend on benchmarks that tie reasoning quality to utility and on closed-loop testbeds that trade off cost and risk under shift-aware, streaming, and long-horizon settings. Taken together, these directions mark a shift from narrow accuracy toward reliability at scale, enabling systems that not only analyze but also understand, explain, and act on dynamic worlds with traceable evidence and credible outcomes.

2506.17137 2026-06-11 cs.CV 版本更新

Towards Conditional Feature Alignment for Cross-Domain Counting

面向跨域计数的条件特征对齐

Zhuonan Liang, Dongnan Liu, Jianan Fan, Yaxuan Song, Qiang Qu, Runnan Chen, Yu Yao, Peng Fu, Weidong Cai

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Electronic Engineering and Information Science, Beijing Institute of Technology(北京理工大学电子工程与信息科学学院)

AI总结 提出条件特征对齐(CFA)框架,通过标签诱导的条件对齐而非全局域不变性,解决跨域计数中密度分布变化问题,在无监督域适应和源域泛化任务上取得显著性能提升。

Comments 12 pages, 6 figures, 4 tables

详情
AI中文摘要

目标计数模型在跨域部署时性能往往会下降,因为密度组成在不同域之间变化,并且其本身与任务相关。标准的特征对齐方法倾向于通过鼓励全局域不变性来抑制这种变化,但当源域和目标域包含不同比例的背景、稀疏前景和密集前景时,这可能是有害的。我们提出条件特征对齐(CFA),一种跨域计数框架,它在标签诱导的条件下对齐表示,而不是在整个边缘特征分布上对齐。给定密度标注或伪密度预测,CFA构建前景/背景或密度级别的条件,并仅对齐属于匹配条件的特征。我们通过条件散度视角形式化这一思想,表明条件对齐消除了条件内的差异,同时保留了条件边缘的密度偏移。对于无监督域适应,CFA从标注中估计源域条件,从分离的伪密度图中估计目标域条件,然后执行条件级对抗对齐,并加入全图一致性正则化。对于源域泛化,我们通过MPCount实例化相同原则,在生成的源域视图之间强制执行条件级记忆一致性。在人群和细胞计数基准上的实验表明,在多种UDA和DG设置下,性能具有竞争力或得到提升。例如,在JHU-CROWD++ FH→SN上,CFA-DG将MAE/RMSE从MPCount的216.3/421.4降低到90.5/169.9,表明条件级对齐在大的天气和密度引起的偏移下特别有效。这些结果表明,条件级对齐是域自适应计数的一个有前景的设计原则。

英文摘要

Object counting models often degrade under cross-domain deployment because density composition varies across domains and is itself task-relevant. Standard feature alignment methods tend to suppress such variation by encouraging global domain invariance, which can be harmful when source and target domains contain different proportions of background, sparse foreground, and dense foreground. We propose Conditional Feature Alignment (CFA), a cross-domain counting framework that aligns representations within label-induced conditions rather than across full marginal feature distributions. Given density annotations or pseudo-density predictions, CFA constructs foreground/background or density-level conditions and aligns only features belonging to matching conditions. We formalise this idea through a conditional divergence perspective, showing that conditional alignment removes within-condition discrepancy while preserving condition-marginal density shift. For unsupervised domain adaptation, CFA estimates source conditions from annotations and target conditions from detached pseudo-density maps, then performs condition-wise adversarial alignment with full-image consistency regularisation. For source-domain generalisation, we instantiate the same principle with MPCount by enforcing condition-wise memory-consistency between generated source-domain views. Experiments on crowd and cell counting benchmarks show competitive or improved performance across diverse UDA and DG settings. For example, on JHU-CROWD++ FH$\rightarrow$SN, CFA-DG reduces MAE/RMSE from MPCount's 216.3/421.4 to 90.5/169.9, indicating that condition-wise alignment is especially effective under large weather- and density-induced shifts. These results suggest that condition-wise alignment is a promising design principle for domain-adaptive counting.

2510.24515 2026-06-11 cs.RO 版本更新

Learning Ordinal Response Policies in Rank-Based Stochastic Prize-Collecting Games

基于排序的随机奖品收集博弈中的序数响应策略学习

Malintha Fernando, Petter Ögren, Silun Zhang

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出随机奖品收集定向越野博弈(SPCOG),扩展团队定向越野问题至自利代理场景,利用序数排名(OR)作为强归纳偏置,并设计虚拟序数响应学习(FORL)算法实现收敛策略。

详情
AI中文摘要

团队定向越野问题(TOP)概括了自主移动、空中物流和监视应用中出现的许多现实世界多智能体调度和路由任务。虽然多智能体系统规划中存在多种TOP变体,但它们假设所有智能体都朝着单一目标合作;因此,当它们在奖励稀缺环境中竞争时,这些变体并不适用。我们提出随机奖品收集定向越野博弈(SPCOG)作为TOP的扩展,以在存在自利智能体、能量约束和随机转移的情况下在图上进行规划。关于完全图和星图的理论讨论表明,在SPCOG中存在唯一的纯纳什均衡,该均衡与基于排序的冲突解决下等效TOP的最优路由解一致。我们提出序数排名(OR)的概念,作为智能体全局排名及其在拓扑定义良好的邻域内位置的简洁表示。在动态和静态奖品分布下,对真实世界道路网络图进行的实证评估表明,在参数共享设置中,利用局部信息的策略可以优于利用全局信息的策略,前提是前者以OR而非全局排名为条件,这表明OR在图上的多智能体博弈中充当了强归纳偏置。与全局排名条件策略相比,OR条件策略还能更好地泛化到具有大量智能体的博弈中。最后,我们还提出虚拟序数响应学习(FORL)作为一种熵调节算法,以在图上奖品收集博弈的独立学习设置中获得收敛策略。

英文摘要

The Team Orienteering Problem (TOP) generalizes many real-world multi-agent scheduling and routing tasks that occur in autonomous mobility, aerial logistics, and surveillance applications. While many flavors of the TOP exist for planning in multi-agent systems, they assume that all the agents cooperate toward a single objective; therefore, they do not extend to settings when they compete in reward-scarce environments. We propose Stochastic Prize-Collecting Orienteering Games (SPCOG) as an extension of the TOP to plan in the presence of self-interested agents operating on a graph, under energy constraints and stochastic transitions. A theoretical discussion on complete and star graphs establishes that there is a unique pure Nash equilibrium in SPCOGs that coincides with the optimal routing solution of an equivalent TOP under rank-based conflict resolution. We propose the concept of Ordinal Rank (OR) as a concise representation of an agents' global rank and its location within a topological, well-defined neighborhood. Empirical evaluations conducted on real-world, road-network graphs under both dynamic and stationary prize distributions show that in parameter-sharing settings, the policies that leverage local information can outperform those policies leverage global information when the former is conditioned on the OR rather than the global rank, indicating that the OR acts as a strong inductive bias in multi-agent games on graphs. The OR-conditioned policies also generalize much better to games with large number of agents compared to global-rank conditioned policies. Finally, we also propose we propose Fictitious Ordinal Response Learning (FORL) as an entropy-regulated algorithm to obtain convergent policies in independent-learning settings in prize-collecting games on graphs.

2510.22335 2026-06-11 cs.CV cs.AI 版本更新

Moving Beyond Diffusion: Hierarchy-to-Hierarchy Autoregression for fMRI-to-Image Reconstruction

超越扩散:层级到层级自回归用于fMRI到图像重建

Xu Zhang, Ruijie Quan, Wenguan Wang, Yi Yang

发表机构 * The State Key Lab of Brain-Machine Intelligence, Zhejiang University, China(脑机智能国家重点实验室,浙江大学,中国) ReLER, CCAI, College of Artificial Intelligence, Zhejiang University, China(ReLER、中国人工智能学会、人工智能学院、浙江大学、中国)

AI总结 提出MindHier框架,通过层级fMRI编码器、层级对齐和尺度感知粗到细引导策略,实现从粗到细的fMRI到图像重建,优于扩散方法。

Comments ICLR 2026

详情
AI中文摘要

从fMRI信号重建视觉刺激是连接机器学习和神经科学的核心挑战。最近的扩散方法通常将fMRI活动映射到单个神经嵌入,并将其作为静态指导贯穿整个生成过程。然而,这种固定指导压缩了层级神经信息,并且与图像重建的阶段依赖性需求不一致。为此,我们提出MindHier,一种基于尺度自回归建模的从粗到细的fMRI到图像重建框架。MindHier引入三个组件:层级fMRI编码器提取多级神经嵌入,层级到层级对齐方案强制与CLIP特征的逐层对应,以及尺度感知的粗到细神经引导策略将这些嵌入注入到匹配尺度的自回归中。这些设计使MindHier成为扩散方法的一种高效且认知对齐的替代方案,通过实现层级重建过程,先合成全局语义再细化局部细节,类似于人类视觉感知。在NSD数据集上的大量实验表明,MindHier在语义保真度、推理速度(4.67倍)和结果确定性方面均优于基于扩散的基线方法。

英文摘要

Reconstructing visual stimuli from fMRI signals is a central challenge bridging machine learning and neuroscience. Recent diffusion-based methods typically map fMRI activity to a single neural embedding, using it as static guidance throughout the entire generation process. However, this fixed guidance collapses hierarchical neural information and is misaligned with the stage-dependent demands of image reconstruction. In response, we propose MindHier, a coarse-to-fine fMRI-to-image reconstruction framework built on scale-wise autoregressive modeling. MindHier introduces three components: a Hierarchical fMRI Encoder to extract multi-level neural embeddings, a Hierarchy-to-Hierarchy Alignment scheme to enforce layer-wise correspondence with CLIP features, and a Scale-Aware Coarse-to-Fine Neural Guidance strategy to inject these embeddings into autoregression at matching scales. These designs make MindHier an efficient and cognitively aligned alternative to diffusion-based methods by enabling a hierarchical reconstruction process that synthesizes global semantics before refining local details, akin to human visual perception. Extensive experiments on the NSD dataset show that MindHier achieves superior semantic fidelity, 4.67$\times$ faster inference, and more deterministic results than the diffusion-based baselines.

2510.14828 2026-06-11 cs.AI cs.RO 版本更新

RoboGPT-R1: Enhancing Robot Task Planning with Reinforcement Learning

RoboGPT-R1: 通过强化学习增强机器人任务规划

Jinrui Liu, Bingyan Nie, Boyu Li, Yaran Chen, Yuze Wang, Shunsen He, Haoran Li

发表机构 * Institute of Automation, CASIA(中国科学院自动化研究所) School of Artificial Intelligence, UCAS(中国科学技术大学人工智能学院) Huawei Cloud Technology Co., Ltd(华为云技术有限公司)

AI总结 提出RoboGPT-R1两阶段微调框架,先监督学习获取基础知识,再通过强化学习提升视觉空间理解和推理能力,在EmbodiedBench上超越GPT-4o-mini 21.33%。

详情
Journal ref
Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026), pp. 2827-2837, IFAAMAS, 2026
AI中文摘要

提高具身智能体的推理能力对于机器人在长视距操作任务中成功完成复杂的人类指令至关重要。尽管基于监督微调(SFT)的大语言模型和视觉语言模型在规划任务中取得了成功,但由于其常识和推理能力受限,它们在复杂现实环境中执行长视距操作任务时仍面临挑战。考虑到通过监督微调将通用视觉语言模型对齐到机器人规划任务存在泛化能力差和物理理解不足的问题,我们提出了RoboGPT-R1,一个用于具身规划的两阶段微调框架。在该框架中,监督训练通过专家序列获取基础知识,随后通过强化学习解决模型在视觉空间理解和推理方面的不足。为了实现多步推理任务中的物理理解和动作序列一致性,我们设计了一个基于规则的奖励函数,同时考虑了长视距性能和环境中的动作约束。基于Qwen2.5-VL-3B训练的推理模型在EmbodiedBench基准上显著优于更大规模的模型GPT-4o-mini 21.33%,并超过其他基于Qwen2.5-VL-7B训练的工作20.33%。

英文摘要

Improving the reasoning capabilities of embodied agents is crucial for robots to complete complex human instructions in long-view manipulation tasks successfully. Despite the success of large language models and vision language models based on Supervised Fine-Tuning (SFT) in planning tasks, they continue facing challenges in performing long-horizon manipulation tasks in complex real-world environments, owing to their restricted common sense and reasoning capabilities. Considering that aligning general-purpose vision language models to robotic planning tasks via supervised fine-tuning suffers from poor generalization and insufficient physical understanding, we propose RoboGPT-R1, a two-stage fine-tuning framework for embodied planning. In this framework, supervised training acquires foundational knowledge through expert sequences, followed by RL to address the model's shortcomings in visual-spatial understanding and reasoning. To achieve physical understanding and action sequence consistency in multi-step reasoning tasks, we design a rule-based reward function that simultaneously considers long-horizon performance and action constraint in the environment. The reasoning model, trained on Qwen2.5-VL-3B, significantly outperforms the larger-scale model, GPT-4o-mini, by 21.33% and surpasses other work trained on Qwen2.5-VL-7B by 20.33% on the EmbodiedBench benchmark.

2509.16456 2026-06-11 cs.AI 版本更新

GPO: Learning from Critical Steps to Improve LLM Reasoning

GPO:从关键步骤中学习以改进大语言模型推理

Jiahao Yu, Zelei Cheng, Xian Wu, Xinyu Xing

发表机构 * Department of Computer Science Northwestern University(计算机科学系西北大学) AI Foundations Capital One(人工智能基础资本 one) Meta AI

AI总结 提出引导式关键优化(GPO)微调策略,通过识别推理轨迹中的关键步骤并优先学习,显著提升大语言模型的多步推理能力。

Comments 39th Conference on Neural Information Processing Systems (NeurIPS 2025)

详情
AI中文摘要

大语言模型(LLMs)越来越多地应用于各个领域,在不同任务上展现出令人印象深刻的潜力。最近,推理LLMs被提出以改善LLMs的推理或思考能力,从而解决复杂问题。尽管推理LLMs取得了有希望的结果,但增强LLMs的多步推理能力仍然是一个重大挑战。虽然现有的优化方法已经推进了LLM的推理能力,但它们通常将推理轨迹视为一个整体,而不考虑轨迹中潜在的关键步骤。在本文中,我们引入了引导式关键优化(GPO),一种新颖的微调策略,深入推理过程以实现更有效的改进。GPO首先识别推理轨迹中的“关键步骤”——模型必须谨慎进行以成功解决问题的点。我们通过估计优势函数来定位关键步骤。然后,GPO将策略重置到关键步骤,采样新的轨迹,并优先学习这些轨迹。这种关注使模型能够更有效地从推理过程中的关键时刻学习,以提高推理性能。我们证明GPO是一种通用策略,可以与各种优化方法集成以提高推理性能。除了理论分析外,我们在具有挑战性的推理基准上的实验表明,GPO能够持续且显著地提升现有优化方法的性能,展示了其通过关注生成过程中的关键时刻来改进LLM推理的有效性和泛化性。

英文摘要

Large language models (LLMs) are increasingly used in various domains, showing impressive potential on different tasks. Recently, reasoning LLMs have been proposed to improve the \textit{reasoning} or \textit{thinking} capabilities of LLMs to solve complex problems. Despite the promising results of reasoning LLMs, enhancing the multi-step reasoning capabilities of LLMs still remains a significant challenge. While existing optimization methods have advanced the LLM reasoning capabilities, they often treat reasoning trajectories as a whole, without considering the underlying critical steps within the trajectory. In this paper, we introduce \textbf{G}uided \textbf{P}ivotal \textbf{O}ptimization (GPO), a novel fine-tuning strategy that dives into the reasoning process to enable more effective improvements. GPO first identifies the `critical step' within a reasoning trajectory - a point that the model must carefully proceed to succeed at the problem. We locate the critical step by estimating the advantage function. GPO then resets the policy to the critical step, samples the new rollout and prioritizes the learning process on those rollouts. This focus allows the model to learn more effectively from pivotal moments within the reasoning process to improve the reasoning performance. We demonstrate that GPO is a general strategy that can be integrated with various optimization methods to improve reasoning performance. Besides theoretical analysis, our experiments across challenging reasoning benchmarks show that GPO can consistently and significantly enhance the performance of existing optimization methods, showcasing its effectiveness and generalizability in improving LLM reasoning by concentrating on pivotal moments within the generation process.

2510.08073 2026-06-11 cs.CV cs.LG 版本更新

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

物理驱动的时空建模用于AI生成视频检测

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, Mingkui Tan

发表机构 * South China University of Technology(华南理工大学) University of Science and Technology of China(中国科学技术大学) Key Laboratory of Big Data and Intelligent Robot, Ministry of Education(教育部大数据与智能机器人重点实验室) Pazhou Lab(琶洲实验室) University of Melbourne(墨尔本大学) Hunan University(湖南大学) Hong Kong Baptist University(香港 Baptist大学)

AI总结 提出基于概率流守恒的物理驱动AI生成视频检测范式,通过归一化时空梯度(NSG)统计量捕捉物理异常,结合预训练扩散模型估计NSG,并利用最大均值差异(MMD)进行检测,在Recall和F1-Score上分别提升16.00%和10.75%。

Comments Accepted at NeurIPS 2025 spotlight

详情
AI中文摘要

AI生成的视频已实现近乎完美的视觉真实感(如Sora),迫切需要可靠的检测机制。然而,检测此类视频在建模高维时空动态和识别违反物理规律的细微异常方面面临重大挑战。本文提出首个基于概率流守恒原理的物理驱动AI生成视频检测范式。具体而言,我们提出一种称为归一化时空梯度(NSG)的统计量,该统计量量化空间概率梯度与时间密度变化之比,明确捕捉与自然视频动态的偏差。利用预训练的扩散模型,我们通过空间梯度近似和运动感知时间建模开发了NSG估计器,无需复杂的运动分解,同时保持物理约束。在此基础上,我们提出基于NSG的视频检测方法(NSG-VD),该方法计算测试视频与真实视频NSG特征之间的最大均值差异(MMD)作为检测指标。最后,我们推导了真实视频与生成视频之间NSG特征距离的上界,证明由于分布偏移,生成视频表现出放大的差异。大量实验证实,NSG-VD在Recall和F1-Score上分别比最先进的基线方法高出16.00%和10.75%,验证了NSG-VD的优越性能。源代码可在该 https URL 获取。

英文摘要

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose the first physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at https://github.com/ZSHsh98/NSG-VD.

2510.06596 2026-06-11 cs.CV cs.AI cs.IT cs.LG math.IT 版本更新

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

SDQM:用于目标检测数据集评估的合成数据质量指标

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

发表机构 * Northeastern University, Khoury College of Computer Sciences(东北大学,Khoury 计算科学学院) Binghamton University, School of Computing(布ingham顿大学,计算科学学院) Air Force Research Laboratory, Mission Applications and Infrastructure Section(空军研究实验室,任务应用与基础设施部门)

AI总结 提出SDQM指标,无需模型训练收敛即可评估合成数据质量,与YOLO11的mAP强相关,优于现有指标。

Comments Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3

详情
Journal ref
Journal of Electronic Imaging 35(3), 033014 (2026)
AI中文摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、良好标注数据集的稀缺给构建鲁棒模型带来了重大挑战。为了解决这一问题,通过模拟和生成模型产生的合成数据已成为一种有前景的解决方案,它增强了数据集的多样性,并提高了模型的性能、可靠性和韧性。然而,评估这些生成数据的质量需要一个有效的指标。我们引入了合成数据集质量指标(SDQM),用于评估目标检测任务的数据质量,而无需模型训练收敛。该指标能够更高效地生成和选择合成数据集,解决了资源受限的目标检测任务中的一个关键挑战。在我们的实验中,SDQM与领先的目标检测模型YOLO11的平均精度均值(mAP)得分表现出强相关性,而先前的指标仅表现出中等或弱相关性。此外,它提供了改进数据集质量的可操作见解,最大限度地减少了昂贵的迭代训练需求。这一可扩展且高效的指标为评估合成数据设立了新标准。SDQM的代码可从此https URL获取。

英文摘要

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at https://github.com/ayushzenith/SDQM

2510.01529 2026-06-11 cs.LG cs.CR 版本更新

Bypassing Prompt Guards in Production with Controlled-Release Prompting

绕过生产环境中的提示守卫:受控释放提示攻击

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang

发表机构 * UC Berkeley(加州大学伯克利分校) zkBricks Inc(zkBricks公司) Ethereum Foundation(以太坊基金会) NYU Shanghai(纽约大学上海分校)

AI总结 针对AI对齐的提示过滤存在理论上的不可能性,本文提出受控释放提示攻击,利用轻量级输入过滤器与主模型之间的资源不对称性,在实际部署的大语言模型系统中成功绕过提示守卫。

Comments Accepted to USENIX Security 2026

详情
AI中文摘要

Ball等人最近指出,用于AI对齐的提示过滤面临一个根本性障碍:在标准密码学假设下,任何运行速度远快于被保护模型的过滤器都无法普遍区分对抗性提示和良性提示。我们研究这一不可能性结果是否转化为已部署的大语言模型(LLM)系统中的现实漏洞。我们通过引入受控释放提示攻击给出了肯定答案,这是理论框架的一种实际实例化,利用了轻量级输入过滤器与其保护的主模型之间的资源不对称性。与理论构造不同,我们的攻击不需要修改模型:它生成任何有界过滤器无法解读但对目标LLM仍然可处理的恶意提示。我们发现,在基线方法失败的四个主要聊天平台(Google Gemini、DeepSeek Chat、xAI Grok和Mistral Le Chat)上,我们的攻击均成功。此外,我们将攻击应用于从Gemini提取受版权保护的数据。最后,我们对14个开源提示守卫模型进行了系统评估,揭示即使具有推理能力的过滤器也无法在不产生过高资源开销的情况下可靠地检测我们的攻击。

英文摘要

Ball et al. recently established that prompt filtering for AI alignment faces a fundamental barrier: under standard cryptographic assumptions, no filter running significantly faster than the protected model can universally distinguish adversarial prompts from benign ones. We investigate whether this impossibility result translates to real-world vulnerabilities in deployed large language model (LLM) systems. We answer affirmatively by introducing controlled-release prompting, a practical instantiation of the theoretical framework that exploits the resource asymmetry between lightweight input filters and the main models they protect. Unlike the theoretical construction, our attack does not require model modification: it generates malicious prompts that are indecipherable by any bounded filter yet remain tractable to the target LLM. We find our attack to be successful on four major chat platforms (Google Gemini, DeepSeek Chat, xAI Grok, and Mistral Le Chat) where baseline methods fail. Additionally, we apply our attack to extract copyrighted data from Gemini. Finally, we provide a systematic evaluation of 14 open-weight prompt guard models, revealing that even reasoning-capable filters cannot reliably detect our attack without incurring prohibitive resource overhead.

2510.03520 2026-06-11 cs.LG cs.AI cs.SY eess.SY 版本更新

Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment

可认证安全RLHF:基于语义基础与固定惩罚约束优化的更安全大语言模型对齐

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) New Jersey Institute of Technology(新泽西理工学院) Department of Computer Engineering(计算机工程系) Heritage Institute of Technology(遗产理工学院)

AI总结 针对现有RLHF方法依赖奖励/成本函数和双变量调优导致性能敏感且缺乏可证明安全保证的问题,提出CS-RLHF,通过语义基础成本模型和固定惩罚约束优化,实现可认证安全对齐,效率提升至少5倍。

详情
AI中文摘要

确保安全是大语言模型(LLMs)的基本要求。在增强模型输出效用与减轻其潜在危害之间取得适当平衡是一个复杂且持续的挑战。当代方法通常将这个问题形式化为约束马尔可夫决策过程(CMDP)框架,并采用成熟的CMDP优化技术。然而,这些方法表现出两个显著的限制。首先,它们对奖励和成本函数的依赖使得性能对底层评分机制高度敏感,而该机制必须捕捉语义含义,而不是被表面关键词触发。其次,基于CMDP的训练需要调整双变量,这一过程计算成本高昂,并且对于可能通过对抗性越狱利用的固定双变量,不提供任何可证明的安全保证。为了克服这些限制,我们引入了可认证安全RLHF(CS-RLHF),它引入了一个在大规模语料库上训练的成本模型,以分配基于语义的安全分数。与基于拉格朗日的方法相比,CS-RLHF采用了一种修正的基于惩罚的公式。该设计借鉴了约束优化中精确惩罚函数理论,其中约束满足直接通过适当选择的惩罚项来强制执行。通过适当缩放的惩罚,可以在优化器处保证安全约束的可行性,从而消除了双变量更新的需要。实证评估表明,CS-RLHF优于最先进的LLM模型响应,对正常和越狱提示的效率至少提高5倍。

英文摘要

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

2508.09459 2026-06-11 cs.CV cs.AI 版本更新

RelayFormer: A Unified Local-Global Attention Framework for Scalable Image and Video Manipulation Localization

RelayFormer: 一种用于可扩展图像和视频篡改定位的统一局部-全局注意力框架

Wen Huang, Jiarui Yang, Tao Dai, Jiawei Li, Shaoxiong Zhan, Bin Wang, Shu-Tao Xia

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) College of Artificial Intelligence, Nankai University(南开大学人工智能学院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出RelayFormer统一框架,通过全局局部中继(GLR)令牌和中继注意力机制,适应不同分辨率并统一处理图像与视频,在篡改定位任务中实现高效且性能优越。

详情
AI中文摘要

视觉篡改定位(VML)旨在识别图像和视频中被篡改的区域,随着高级编辑工具的兴起,这一任务变得日益具有挑战性。现有方法面临两个核心问题。首先是分辨率多样性。调整大小或填充可能会扭曲微妙的取证线索,并引入不必要的计算成本。其次是将图像的空间模型扩展到视频的时空输入的困难,这通常导致为两种数据类型维护单独的架构。为了解决这些挑战,我们提出了RelayFormer,一个统一框架,能够适应不同分辨率并自然处理静态和时态视觉数据。RelayFormer将输入划分为固定大小的子图像,并引入全局局部中继(GLR)令牌,通过基于中继的注意力机制传播结构化上下文。这种设计使得全局线索(如语义或时间一致性)的高效交换成为可能,同时保留细粒度的篡改伪影。与依赖统一调整大小或稀疏注意力的先前方法不同,RelayFormer以最小的开销扩展到可变分辨率和视频序列。跨多个基准的实验表明,其具有优越的性能和强大的效率,结合了无需插值或过多填充的分辨率适应性、图像和视频的统一处理,以及准确性和计算成本之间的有利平衡。代码可在\href{this https URL}{this https URL}获取。

英文摘要

Visual manipulation localization (VML) aims to identify tampered regions in images and videos, a task that has become increasingly challenging with the rise of advanced editing tools. Existing methods face two central issues. The first is resolution diversity. Resizing or padding can distort subtle forensic cues and introduce unnecessary computational cost. The second is the difficulty of extending spatial models for images to spatio-temporal inputs in videos, which often results in maintaining separate architectures for the two data types. To address these challenges, we propose RelayFormer, a unified framework that adapts to varying resolutions and naturally handles both static and temporal visual data. RelayFormer partitions inputs into fixed-size sub-images and introduces Global Local Relay (GLR) tokens that propagate structured context through a relay-based attention mechanism. This design enables efficient exchange of global cues, such as semantic or temporal consistency, while preserving fine-grained manipulation artifacts. Unlike prior approaches that depend on uniform resizing or sparse attention, RelayFormer scales to variable resolutions and video sequences with minimal overhead. Experiments across diverse benchmarks demonstrate superior performance and strong efficiency, combining resolution adaptivity without interpolation or excessive padding, unified processing for images and videos, and a favorable balance between accuracy and computational cost. Code is available at~\href{https://github.com/WenOOI/RelayFormer}{https://github.com/WenOOI/RelayFormer}.

2510.02149 2026-06-11 cs.LG math.OC stat.ML 版本更新

Reinforcement Learning with Action-Triggered Observations

具有动作触发观测的强化学习

Alexander Ryabchenko, Wenlong Mou

发表机构 * Department of Statistical Sciences, University of Toronto(统计科学系,多伦多大学;向量研究所) Vector Institute

AI总结 提出动作触发稀疏可追踪MDP框架,推导Bellman方程并证明最优策略存在,利用观测间动作序列的线性表示实现基于回归的方法,在几何分布情节下达到与完全可观测线性MDP匹配的遗憾界。

详情
AI中文摘要

我们引入了动作触发稀疏可追踪马尔可夫决策过程(ATST-MDPs),这是一种用于部分可观测性的强化学习框架,其中完整状态观测在每个步骤以由所选动作决定的概率随机发生。我们推导了针对该设置的Bellman方程,并证明了最优策略的存在性。利用稀疏观测揭示完整状态的事实,我们提供了一个等价公式,其中智能体在连续观测之间承诺动作序列。在线性MDP假设下,我们证明了这些动作序列上的值函数在有限维特征映射中具有线性表示,从而能够使用标准的基于回归的方法。作为一个应用,我们推导了ATST-LSVI-UCB,一种乐观算法,在几何分布的情节学习中实现了遗憾界$\widetilde{O}(\sqrt{Kd^3(1-\gamma)^{-3}})$,其中$K$是情节数,$d$是特征维度,$\gamma$是折扣因子(情节继续概率),与完全可观测线性MDP的已知速率相匹配。

英文摘要

We introduce Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, with probability determined by the chosen action. We derive Bellman equations tailored to this setting and establish the existence of an optimal policy. Exploiting the fact that sporadic observations reveal the full state, we provide an equivalent formulation in which agents commit to action-sequences between consecutive observations. Under the linear MDP assumption, we show that the value function over such action-sequences admits a linear representation in a finite-dimensional feature map, enabling standard regression-based methods. As an application, we derive ATST-LSVI-UCB, an optimistic algorithm achieving regret $\widetilde{O}(\sqrt{Kd^3(1-γ)^{-3}})$ for episodic learning with geometrically distributed horizons, where $K$ is the number of episodes, $d$ the feature dimension, and $γ$ the discount factor (episode continuation probability), matching the known rate for linear MDPs with full observability.

2509.26294 2026-06-11 cs.LG cs.AI 版本更新

Noise-Guided Transport for Imitation Learning

噪声引导的模仿学习传输方法

Lionel Blondé, Joao A. Candido Ramos, Alexandros Kalousis

发表机构 * University of Cambridge(剑桥大学) University of Oxford(牛津大学)

AI总结 针对低数据场景下的模仿学习,提出噪声引导传输(NGT)方法,通过对抗训练将模仿问题转化为最优传输问题,无需预训练或特殊架构,在极低数据量下实现强性能。

Comments Accepted at ICML 2026. Code: https://github.com/lionelblonde/ngt

详情
AI中文摘要

我们考虑低数据场景下的模仿学习,其中只有有限数量的专家演示可用。在这种情况下,依赖大规模预训练或高容量架构的方法难以应用,对演示数据的效率变得至关重要。我们引入了噪声引导传输(NGT),一种轻量级的离策略方法,将模仿问题转化为通过对抗训练解决的最优传输问题。NGT不需要预训练或专门架构,通过设计包含不确定性估计,并且易于实现和调优。尽管简单,NGT在具有挑战性的连续控制任务(包括高维人形任务)中,在仅有20个转换的超低数据场景下取得了强劲的性能。

英文摘要

We consider imitation learning in the low-data regime, where only a limited number of expert demonstrations are available. In this setting, methods that rely on large-scale pretraining or high-capacity architectures can be difficult to apply, and efficiency with respect to demonstration data becomes critical. We introduce Noise-Guided Transport (NGT), a lightweight off-policy method that casts imitation as an optimal transport problem solved via adversarial training. NGT requires no pretraining or specialized architectures, incorporates uncertainty estimation by design, and is easy to implement and tune. Despite its simplicity, NGT achieves strong performance on challenging continuous control tasks, including high-dimensional Humanoid tasks, under ultra-low data regimes with as few as 20 transitions.

2509.25359 2026-06-11 cs.CL cs.AI 版本更新

Geometric Metrics and LLMs: What They Measure and When They Work

几何度量与大语言模型:它们测量什么以及何时有效

Viacheslav Yusupov, Anna Antipina, Ameliia Alaeva, Danil Maksimov, Anna Vasileva, Tatyana Zaitseva, Alina Ermilova, Evgeny Burnaev, Egor Shvetsov

发表机构 * Moscow Institute of Physics and Technology(莫斯科物理技术学院) Russian Academy of Sciences(俄罗斯科学院)

AI总结 本文系统测试了用于大语言模型评估的几何度量,发现部分度量主要反映输出长度,而几何度量在文本统计基础上提供有限但真实的信息,并指出故障检测是最有前景的应用。

详情
AI中文摘要

我们提出了对大语言模型评估中几何度量的系统性压力测试。基于排名的内部表示几何特性作为无参考质量信号显示出前景,但其可靠的条件仍不清楚。我们评估了八种常用度量:内在维度估计器、谱范数及相关量,在六个测试模型(0.5-8B)和八个生成器上对比任务,将真实的几何信号与文本长度效应以及标准文本统计已捕获的信息区分开。三个发现出现。首先,一些度量(特别是Schatten范数和MOM)主要反映输出长度,一旦控制长度,其明显的区分能力就崩溃。其次,几何度量在文本统计之外增加了适度但真实的信息:结合它们,分类器在6路生成器识别上达到78%的准确率,而仅用文本统计为69%。第三,度量并不追踪文本质量的通用概念,而是显示内在维度与词汇多样性(RTTR)之间仅存在中等关联。我们给出了特定用例的建议,并指出故障检测是最有前景的近期应用。

英文摘要

We present a systematic stress-test of geometric metrics for LLM evaluation. Rank-based geometric properties of internal representations have shown promise as reference-free quality signals, but the conditions under which they are reliable remain unclear. We evaluate eight commonly-used metrics: intrinsic-dimensionality estimators, spectral norms, and related quantities across six tester models (0.5-8B) and eight generators on contrasting tasks, separating genuine geometric signal from text-length effects and from what standard text statistics already capture. Three findings emerge. First, some metrics (notably Schatten Norm and MOM) mainly reflect output length, and their apparent discriminative power collapses once length is controlled. Second, geometric metrics add modest but real information beyond text statistics: combined with them, a classifier reaches 78% accuracy on 6-way generator identification versus 69% for text statistics alone. Third, rather than tracking a general notion of text quality, the metrics demonstrate only moderate association between the intrinsic-dimensionality and lexical diversity (RTTR). We give use-case-specific recommendations and identify failure detection as the most promising near-term application.

2509.23982 2026-06-11 cs.CL cs.AI cs.CY cs.LG cs.NE 版本更新

Toward Preference-aligned Large Language Models via Residual-based Model Steering

基于残差模型引导的偏好对齐大型语言模型

Lucio La Cava, Andrea Tagarelli

发表机构 * DIMES Dept., University of Calabria, Italy(卡利博大学DIMES系)

AI总结 提出PaLRS方法,利用残差流中的偏好信号提取轻量级引导向量,无需训练即可在推理时对齐模型偏好,在数学推理和代码生成任务上取得一致提升,同时节省大量时间。

Comments Accepted at IJCAI 2026

详情
AI中文摘要

偏好对齐是使大型语言模型(LLMs)有用且与(人类)偏好一致的关键步骤。现有方法如基于人类反馈的强化学习或直接偏好优化通常需要精心策划的数据和对数十亿参数进行昂贵的优化,最终导致持久性的任务特定模型。在这项工作中,我们引入了基于残差引导的LLM偏好对齐(PaLRS),这是一种无需训练的方法,利用LLM残差流中编码的偏好信号。从仅一百个偏好对中,PaLRS提取出轻量级、即插即用的引导向量,可在推理时应用以将模型推向偏好行为。我们在各种中小型开源LLM上评估了PaLRS,显示PaLRS对齐的模型在数学推理和代码生成基准上取得了一致的提升,同时保持了基线通用性能。此外,与使用DPO和SimPO对齐的模型相比,它们表现更好且节省大量时间。我们的发现强调,PaLRS为标准偏好优化流程提供了一种有效、更高效且灵活的替代方案,提供了一种无需训练、即插即用的对齐机制,且数据需求极少。

英文摘要

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

2509.19463 2026-06-11 cs.RO 版本更新

CU-Multi: A Dataset for Multi-Robot Collaborative Perception

CU-Multi:多机器人协同感知数据集

Doncey Albin, Daniel McGann, Miles Mena, Annika Thomas, Harel Biggie, Xuefei Sun, Steve McGuire, Jonathan P. How, Christoffer Heckman

发表机构 * Autonomous Robotics and Perception Group at the University of Colorado Boulder(科罗拉多大学波尔得分校自主机器人与感知组) Robot Perception Lab at Carnegie Mellon University(卡内基梅隆大学机器人感知实验室) Aerospace Controls Laboratory at Massachusetts Institute of Technology(麻省理工学院航空航天控制实验室) Computer Science and Artificial Intelligence Laboratory at Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室) Human-Aware Robotic Exploration Lab at University of California Santa Cruz(加州大学圣克ruz分校人感知机器人探索实验室)

AI总结 针对多机器人协同感知基准测试缺乏专用数据集的问题,提出CU-Multi数据集,包含多天采集的同步多机器人轨迹、RGB-D、RTK GPS、语义LiDAR及精确里程计,支持可重复评估。

Comments 8 pages, 11 figures. arXiv admin note: text overlap with arXiv:2505.17576

详情
AI中文摘要

多机器人系统的一个核心挑战是将独立收集的感知数据融合成统一表示。尽管协同SLAM(C-SLAM)取得了进展,但由于缺乏专用的多机器人数据集,基准测试仍然受到阻碍。许多评估转而分割单机器人轨迹,这种做法可能仅部分反映真实的多机器人操作,更关键的是缺乏标准化,导致结果难以解释或跨研究比较。虽然最近引入了几个多机器人数据集,但它们大多包含短轨迹,机器人间重叠有限且机器人内闭环稀疏。为克服这些限制,我们引入了CU-Multi,这是一个在科罗拉多大学博尔德分校两个大型户外场地多天收集的数据集。CU-Multi包含四个同步运行,具有对齐的起始时间和受控的轨迹重叠,复现了机器人团队的不同视角。它包括RGB-D感知、RTK GPS、语义LiDAR和精化的地面真实里程计。通过将重叠变化与密集语义标注相结合,CU-Multi为多机器人协同感知任务中的可重复评估提供了坚实基础。

英文摘要

A central challenge for multi-robot systems is fusing independently gathered perception data into a unified representation. Despite progress in Collaborative SLAM (C-SLAM), benchmarking remains hindered by the scarcity of dedicated multi-robot datasets. Many evaluations instead partition single-robot trajectories, a practice that may only partially reflect true multi-robot operations and, more critically, lacks standardization, leading to results that are difficult to interpret or compare across studies. While several multi-robot datasets have recently been introduced, they mostly contain short trajectories with limited inter-robot overlap and sparse intra-robot loop closures. To overcome these limitations, we introduce CU-Multi, a dataset collected over multiple days at two large outdoor sites on the University of Colorado Boulder campus. CU-Multi comprises four synchronized runs with aligned start times and controlled trajectory overlap, replicating the distinct perspectives of a robot team. It includes RGB-D sensing, RTK GPS, semantic LiDAR, and refined ground-truth odometry. By combining overlap variation with dense semantic annotations, CU-Multi provides a strong foundation for reproducible evaluation in multi-robot collaborative perception tasks.

2509.14860 2026-06-11 cs.CV cs.AI cs.CL cs.MA 版本更新

MARIC: Multi-Agent Reasoning for Image Classification

MARIC:用于图像分类的多智能体推理

Wonduk Seo, Minhyeong Yu, Hyunjin An, Seunghyun Lee

发表机构 * Enhans, Seoul, South Korea(韩国首尔Enhans) Peking University, Beijing, China(中国北京北京大学)

AI总结 提出多智能体框架MARIC,通过分解图像分类为协作推理过程,利用大纲智能体、方面智能体和推理智能体进行多视角分析与综合,在四个基准数据集上显著优于基线方法。

Comments 11 pages, preprint

详情
AI中文摘要

图像分类传统上依赖于参数密集型模型训练,需要大规模标注数据集和大量微调才能达到有竞争力的性能。虽然最近的视觉语言模型(VLM)缓解了其中一些限制,但它们仍然受限于对单次表示的依赖,往往无法捕捉视觉内容的互补方面。在本文中,我们介绍了基于多智能体的图像分类推理(MARIC),这是一个多智能体框架,将图像分类重新表述为协作推理过程。MARIC首先利用大纲智能体分析图像的全局主题并生成有针对性的提示。基于这些提示,三个方面智能体沿着不同的视觉维度提取细粒度描述。最后,推理智能体通过集成反思步骤综合这些互补输出,产生用于分类的统一表示。通过明确地将任务分解为多个视角并鼓励反思性综合,MARIC减轻了参数繁重训练和单一VLM推理的缺点。在4个不同的图像分类基准数据集上的实验表明,MARIC显著优于基线,突出了多智能体视觉推理在鲁棒且可解释的图像分类中的有效性。

英文摘要

Image classification has traditionally relied on parameter-intensive model training, requiring large-scale annotated datasets and extensive fine tuning to achieve competitive performance. While recent vision language models (VLMs) alleviate some of these constraints, they remain limited by their reliance on single pass representations, often failing to capture complementary aspects of visual content. In this paper, we introduce Multi Agent based Reasoning for Image Classification (MARIC), a multi agent framework that reformulates image classification as a collaborative reasoning process. MARIC first utilizes an Outliner Agent to analyze the global theme of the image and generate targeted prompts. Based on these prompts, three Aspect Agents extract fine grained descriptions along distinct visual dimensions. Finally, a Reasoning Agent synthesizes these complementary outputs through integrated reflection step, producing a unified representation for classification. By explicitly decomposing the task into multiple perspectives and encouraging reflective synthesis, MARIC mitigates the shortcomings of both parameter-heavy training and monolithic VLM reasoning. Experiments on 4 diverse image classification benchmark datasets demonstrate that MARIC significantly outperforms baselines, highlighting the effectiveness of multi-agent visual reasoning for robust and interpretable image classification.

2509.11548 2026-06-11 cs.CV 版本更新

How Auxiliary Reasoning Unleashes GUI Grounding in VLMs

辅助推理如何释放VLM中的GUI定位能力

Weiming Li, Yan Shao, Jing Yang, Yujing Lu, Ling Zhong, Yuhan Wang, Min Yu, Tongxiao Ruan, Manni Duan

发表机构 * Zhejiang Lab(浙江实验室) Hangzhou Research and Development Center(杭州研发中心) China Mobile(中国移动) Innovation Center of Yangtze River Delta(长江三角洲创新中心) Zhejiang University(浙江大学)

AI总结 针对VLM在GUI定位任务中隐式空间理解强但显式坐标输出弱的问题,提出三种零样本辅助推理方法(如标记网格),通过输入图像添加空间线索,显著提升定位性能,在多个基准上达到接近最优微调方法的效果。

详情
AI中文摘要

图形用户界面(GUI)定位是构建GUI代理的基础任务。然而,通用视觉语言模型(VLM)由于缺乏特定优化,在此任务上表现不佳。本文识别出一个关键差距:尽管VLM表现出显著的潜在定位能力(如通过Pointing Game衡量的性能所示),但在输出显式坐标时表现不佳。为了解决这一差异并绕过当前微调方法的高数据和高标注成本,我们提出了三种零样本辅助推理方法。通过提供显式空间线索(如轴、网格和标记交点)作为输入图像的一部分,这些方法使VLM能够更好地表达其隐式空间理解能力。我们在四个GUI定位基准上评估了这些方法,涉及七个开源和专有VLM。实验结果表明,辅助推理带来了显著的性能提升。Mark-Grid Scaffold将Gemini-3.1-Pro在ScreenSpot-v2上的直接推理准确率从11.72%提升至95.20%,在ScreenSpot上达到最先进性能,并在ScreenSpot-v2和UI-I2E-Bench上接近最强的微调方法。我们的代码可在该https URL获取。

英文摘要

Graphical user interface (GUI) grounding is a fundamental task for building GUI agents. However, general vision-language models (VLMs) struggle with this task due to a lack of specific optimization. We identify a key gap in this paper: while VLMs exhibit significant latent grounding potential, as demonstrated by their performance measured by Pointing Game, they underperform when tasked with outputting explicit coordinates. To address this discrepancy and bypass the high data and annotation costs of current fine-tuning approaches, we propose three zero-shot auxiliary reasoning methods. By providing explicit spatial cues such as axes, grids and labeled intersections as part of the input image, these methods enable VLMs to better articulate their implicit spatial understanding capabilities. We evaluate these methods on four GUI grounding benchmarks across seven open-source and proprietary VLMs. Experimental results show substantial gains from auxiliary reasoning. Mark-Grid Scaffold boosts Gemini-3.1-Pro from 11.72\% under direct inference to 95.20\% on ScreenSpot-v2, achieves state-of-the-art performance on ScreenSpot, and approaches the strongest fine-tuned methods on ScreenSpot-v2 and UI-I2E-Bench. Our code is available at https://github.com/liweim/AuxiliaryReasoning.

2509.10303 2026-06-11 cs.LG cs.AI 版本更新

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

超越次优性:离线强化学习通过随机解决方案学习有效调度

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出离线RL算法CDQAC,从次优静态数据集学习调度策略,在JSP/FJSP上超越在线RL和强启发式方法,仅需1-5%数据,发现状态-动作覆盖比轨迹质量更重要。

详情
AI中文摘要

在线强化学习(RL)方法通过与模拟环境直接交互学习调度策略,在作业车间调度(JSP)和柔性作业车间调度(FJSP)问题上表现出色。然而,这些方法通常需要大量的训练交互,限制了其样本效率和实际适用性。受此挑战的启发,我们引入了保守离散分位数演员-评论家(CDQAC),这是一种离线RL算法,可以直接从静态、次优数据集中学习有效的调度策略。CDQAC将基于分位数的评论家与延迟策略更新相结合,以估计机器-操作对的回报分布。在JSP和FJSP基准上的大量实验表明,CDQAC始终优于生成数据的启发式方法,超越了最先进的离线和在线RL基线,并且具有很高的样本效率,仅需原始数据集的1%到5%即可学习高质量策略。我们的分析表明,在调度中,离线RL的性能主要受状态-动作覆盖范围而非单个轨迹质量的影响。调度将密集奖励(与完工时间目标对齐)与跨启发式方法的等长轨迹相结合,从而能够从广泛的行为中有效学习。与此观察一致,由简单随机启发式方法生成的具有更广覆盖范围的数据集,使其性能优于在由更强启发式方法(如遗传算法)生成的数据集上训练的策略。

英文摘要

Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated environments. However, these methods often require extensive training interactions, limiting their sample efficiency and practical applicability. Motivated by this challenge, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL algorithm that learns effective scheduling policies directly from static, suboptimal datasets. CDQAC couples a quantile-based critic with delayed policy updates to estimate the return distribution of machine-operation pairs. Extensive experiments on JSP and FJSP benchmarks demonstrate that CDQAC consistently outperforms the data-generating heuristics, surpasses state-of-the-art offline and online RL baselines, and is highly sample efficient, requiring only 1 to 5% of the original dataset to learn high-quality policies. Our analysis suggests that, in scheduling, offline RL performance is governed mainly by state-action coverage rather than the quality of individual trajectories. Scheduling couples a dense reward aligned with the makespan objective with equal-length trajectories across heuristics, enabling effective learning from a broad range of behaviors. Consistent with this observation, datasets generated by a simple random heuristic with broader coverage let it outperform policies trained on datasets produced by stronger heuristics such as Genetic Algorithms.

2507.21164 2026-06-11 cs.LG cs.AI eess.IV stat.ML 版本更新

OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection

OCSVM引导的无监督异常检测表示学习

Nicolas Pinon, Robin Trombetta, Carole Lartizien

发表机构 * Univ. Lyon(里昂大学) CNRS UMR 5220(国家科学研究中心UMR 5220) Inserm U1294(法国国家医学研究院U1294) INSA Lyon(里昂国立应用科学学院) UCBL(里昂大学) CREATIS(里昂大学生物医学图像研究中心)

AI总结 提出一种将表示学习与可解析求解的一类SVM耦合的方法,通过定制损失函数直接对齐潜在特征与决策边界,在MNIST-C和脑MRI病变检测任务上展现了鲁棒性和性能。

详情
AI中文摘要

无监督异常检测(UAD)旨在无需标签数据检测异常,这在许多机器学习应用中是必要的,因为异常样本稀少或不可用。大多数最先进的方法分为两类:基于重构的方法(通常重构异常过于完美)和与密度估计器解耦的表示学习(可能遭受次优特征空间)。虽然一些近期方法尝试耦合特征学习和异常检测,但它们通常依赖替代目标、限制核选择或引入近似,从而限制了表达能力和鲁棒性。为解决这一挑战,我们提出了一种新颖方法,通过自定义损失公式将表示学习与可解析求解的一类SVM(OCSVM)耦合,该损失直接使潜在特征与OCSVM决策边界对齐。该模型在两个任务上评估:基于MNIST-C的新基准,以及具有挑战性的脑MRI细微病变检测任务。与大多数关注图像级别大而高信号病变的方法不同,我们的方法成功针对小而非高信号的病变,同时我们评估体素级别的指标,处理了更具临床相关性的场景。两个实验评估了对领域偏移的鲁棒性形式,包括MNIST-C中的损坏类型以及MRI中的纹理或人群年龄变化。结果展示了我们提出模型的性能和鲁棒性,突显了其在通用UAD和现实医学成像应用中的潜力。源代码可在此https URL获取。

英文摘要

Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at https://github.com/Nicolas-Pinon/uad_ocsvm_guided_repr_learning.

2405.06995 2026-06-11 cs.SD cs.CV cs.MM eess.AS 版本更新

Benchmarking Cross-Domain Audio-Visual Deception Detection

跨域音视频欺骗检测基准测试

Xiaobao Guo, Zitong Yu, Nithish Muthuchamy Selvaraj, Bingquan Shen, Adams Wai-Kin Kong, Alex C. Kot

发表机构 * Rapid-Rich Object Search (ROSE) Lab and the College of Computing and Data Science, Nanyang Technological University (NTU)(快速丰富对象搜索(ROSE)实验室和南洋理工大学计算与数据科学学院) School of Computing and Information Technology and Dongguan Key Laboratory for Intelligence and Information Technology, Great Bay University(计算与信息科技学院和东莞智能与信息技术重点实验室,大湾大学) DSO National Laboratories(国防科学实验室) College of Computing and Data Science, Nanyang Technological University (NTU)(计算与数据科学学院,南洋理工大学) SMBU, Shenzhen 518172, China(深圳SMBU,越南河内VinUniversity,和新加坡NTU) VinUniversity, Hanoi 100000, Vietnam and NTU, Singapore

AI总结 提出首个跨域音视频欺骗检测基准,评估不同场景下的泛化能力,并设计MM-IDGM算法和Attention-Mixer融合方法提升性能。

Comments 17 pages

详情
AI中文摘要

自动欺骗检测对于帮助人类准确评估真实性和识别欺骗行为至关重要。传统的接触式技术,如测谎仪,依赖生理信号来确定个体陈述的真实性。然而,自动欺骗检测的最新进展表明,从音频和视频模态中提取的多模态特征在公开数据集上可能优于人类观察者。尽管有这些积极发现,现有音视频欺骗检测方法在不同场景下的泛化能力仍 largely unexplored。为弥补这一空白,我们提出了首个跨域音视频欺骗检测基准,使我们能够评估这些方法在现实场景中的泛化能力。我们使用了广泛采用的音频和视觉特征以及不同的架构进行基准测试,比较了单到单和多到单域泛化性能。为了进一步利用来自多个源域的数据进行训练的影响,我们研究了三种域采样策略,包括域同步、域交替和逐域采样,用于多到单域泛化评估。我们还提出了一种通过最大化模态编码器之间的梯度内积来增强泛化性能的算法,称为“MM-IDGM”。此外,我们提出了Attention-Mixer融合方法来提高性能,并相信这一新的跨域基准将促进未来音视频欺骗检测的研究。

英文摘要

Automated deception detection is crucial for assisting humans in accurately assessing truthfulness and identifying deceptive behavior. Conventional contact-based techniques, like polygraph devices, rely on physiological signals to determine the authenticity of an individual's statements. Nevertheless, recent developments in automated deception detection have demonstrated that multimodal features derived from both audio and video modalities may outperform human observers on publicly available datasets. Despite these positive findings, the generalizability of existing audio-visual deception detection approaches across different scenarios remains largely unexplored. To close this gap, we present the first cross-domain audio-visual deception detection benchmark, that enables us to assess how well these methods generalize for use in real-world scenarios. We used widely adopted audio and visual features and different architectures for benchmarking, comparing single-to-single and multi-to-single domain generalization performance. To further exploit the impacts using data from multiple source domains for training, we investigate three types of domain sampling strategies, including domain-simultaneous, domain-alternating, and domain-by-domain for multi-to-single domain generalization evaluation. We also propose an algorithm to enhance the generalization performance by maximizing the gradient inner products between modality encoders, named ``MM-IDGM". Furthermore, we proposed the Attention-Mixer fusion method to improve performance, and we believe that this new cross-domain benchmark will facilitate future research in audio-visual deception detection.

2507.17012 2026-06-11 cs.AI cs.CE 版本更新

Sustainability assessment using multimodal AI agents

使用多模态AI代理进行可持续性评估

Zhihan Zhang, Alexander Metzger, Yuxuan Mei, Felix Hähnlein, Zachary Englhardt, Tingyu Cheng, Gregory D. Abowd, Shwetak Patel, Adriana Schulz, Vikram Iyer

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(保罗·G·艾伦计算机科学与工程学院,华盛顿大学) Computer Science and Engineering, University of Notre Dame(计算机科学与工程,诺丁汉大学) Electrical and Computer Engineering, Northeastern University(电气与计算机工程,东北大学)

AI总结 提出多模态多代理AI系统,模拟生命周期评估专家与利益相关者协作,自动估算电子设备碳足迹,将数据收集时间从数周缩短至一分钟,误差在19%以内。

Comments This article is published in Nature Electronics, and is available online at: https://www.nature.com/articles/s41928-026-01653-w

详情
AI中文摘要

减少计算行业快速增长的环境影响需要大规模评估电子产品的排放。然而,传统的电子设备生命周期评估(LCA)需要专有或不可用的数据。在这里,我们通过引入一个多模态多代理AI系统重新构想传统的可持续性评估,该系统模拟LCA专业人员与利益相关者(如产品经理和工程师)之间的协作过程,自动估算电子设备的碳足迹。代理通过利用结构化数据抽象和从公共互联网(包括维修社区和政府监管数据库)挖掘信息的软件工具,迭代构建完整的生命周期清单。这将数据收集时间从数周或数月减少到不到一分钟。该系统可以在零专有数据的情况下,以专家LCA的19%误差范围内计算碳足迹(典型的人类LCA之间的差异)。我们还表明,通过编码领域特定知识,环境影响估算可以重新定义为数据驱动的预测任务,其中未知产品和排放因子都被表示为具有已知排放的相似产品的加权组合。

英文摘要

Reducing the rapidly growing environmental impact of the computing industry requires assessing the emissions of electronics at scale. However, a traditional life cycle assessment (LCA) of an electronic device, which maps materials and processes to environmental impacts, often requires proprietary or unavailable data. Here, we reimagine conventional sustainability assessment by introducing a multimodal multi-agent AI system that emulates the collaborative process between LCA professionals and stakeholders (such as product managers and engineers) to automatically estimate the carbon footprint of electronic devices. The agents iteratively construct a complete life-cycle inventory by leveraging a structured data abstraction and software tools that mine information from the public internet, including repair communities and government regulatory databases. This reduces data gaps and data collection from weeks or months of expert time to under one minute. The system can calculate carbon footprint within 19% of expert LCAs with zero proprietary data (typical of the variation between human LCAs). We also show that by encoding domain-specific knowledge, environmental impact estimation can be reframed as a data-driven prediction task, in which both unknown products and emission factors are represented as weighted combinations of similar ones with known emissions.

2506.20040 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

跨层离散概念发现用于解释语言模型

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

发表机构 * University of Washington(华盛顿大学)

AI总结 提出跨层向量量化变分自编码器(CLVQ-VAE),通过离散向量量化瓶颈将残差流中的重复特征压缩为紧凑可解释的概念向量,在三个数据集上优于聚类、单层VQ-VAE和稀疏自编码器基线。

详情
AI中文摘要

由于残差流的存在,解释语言模型仍然具有挑战性,残差流在相邻层之间线性混合和复制特征,导致单层分析忽略这种跨层结构。跨层稀疏自编码器(SAE)解决了层混合问题,但在连续空间中操作,概念分散在许多神经元上,没有清晰的边界。我们引入了跨层向量量化变分自编码器(CLVQ-VAE),这是一种新颖的框架,通过离散向量量化瓶颈将较低层的表示映射到较高层,将重复的残差流特征压缩为紧凑、可解释的概念向量。我们的方法结合了基于top-k温度的采样和指数移动平均(EMA)码本更新,在保持码本多样性的同时,对离散潜在空间进行受控探索。在基于编码器和解码器的模型上,针对ERASER-Movie、Jigsaw和AGNews数据集,CLVQ-VAE在三个评估轴上优于聚类、单层向量量化变分自编码器(VQ-VAE)和稀疏自编码器(SAE)基线:移除识别出的概念使模型准确率下降高达93%,LLM评判员在66.7%的比较中将我们的概念排在首位,人类标注者从我们的可视化中恢复模型预测的准确率为78%,而聚类为54%。

英文摘要

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

2507.03065 2026-06-11 cs.LG 版本更新

Persistent Homology as a Theory of Emergent Structure

持久同调作为涌现结构理论

Xin Li

发表机构 * Department of Computer Science, University at Albany(计算机科学系,阿尔巴尼大学)

AI总结 提出将涌现属性定义为持久非平凡同调类,通过持久条、收缩相似图算子和Hodge分解等工具,统一描述涌现的六个特征,并提供可验证预测。

详情
AI中文摘要

为什么某些宏观结构在其微观组分不断变化时仍保持可识别?涡旋在流体团翻转时持续,神经记忆在尖峰和突触波动时持续,机构在个体进出时持续。我们提出一个尺度相对的回答:涌现属性是一个持久的非平凡同调类 $[z]\in H_p=\ker\partial_p/\im\partial_{p+1}$,即一个在描述过滤中闭合但不精确的宏观特征。这一识别将涌现转化为一个\emph{测量}问题。持久条检测稳定的宏观特征,我们引入收缩相似(CS)图算子以提供预测鲁棒性的支架谱间隙。Hodge分解将调和宏观支架与精确和共精确微观流分离;函子凝聚解释何时一个层次的涌现类成为下一个层次的单位。由此产生的支架-流框架用同一数学语言表达了涌现的六个熟悉特征(即必然性、相干性、不可约性、互补性、鲁棒性和层次性)。它还在大气、神经和社会系统中产生可证伪的预测:真正的涌现结构应在过滤中持续,保持谱稳定,对调和干预有不成比例的反应,并需要时间尺度分离以实现层次自主性。

英文摘要

Why do some macroscopic structures remain identifiable even though their microscopic constituents continually change? Vortices persist while fluid parcels turn over, neural memories persist while spikes and synapses fluctuate, and institutions persist while individuals enter and leave. We propose a scale-relative answer: an emergent property is a persistent nontrivial homology class $[z]\in H_p=\ker\partial_p/\im\partial_{p+1}$, a macro-feature that is closed but not exact across a filtration of descriptions. This identification turns emergence into a \emph{measurement} problem. Persistent bars detect stable macro-features, and we introduce a contractive-similarity (CS) graph operator to supply scaffold spectral gaps that predict robustness. Hodge decomposition separates harmonic macro-scaffold from exact and co-exact micro-flow; and functorial condensation explains when one level's emergent class becomes a unit for the next. The resulting scaffold-flow framework expresses six familiar signatures of emergence (i.e., inevitability, coherence, irreducibility, complementarity, robustness, and hierarchy) within one mathematical language. It also yields falsifiable predictions across atmospheric, neural, and social systems: genuine emergent structures should persist across filtrations, remain spectrally stable, respond disproportionately to harmonic interventions, and require timescale separation for hierarchical autonomy.

2506.03933 2026-06-11 cs.CV cs.AI 版本更新

Diffusion-based Cumulative Adversarial Purification for Vision Language Models

基于扩散的累积对抗净化方法用于视觉语言模型

Jia Fu, Yongtao Wu, Yihang Chen, Kunyu Peng, Xiao Zhang, Volkan Cevher, Sepideh Pashami, Anders Holst

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Swiss Federal Institute of Technology Lausanne(洛桑联邦理工学院) University of California, Los Angeles(加州大学洛杉矶分校) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍兹中心) RISE Research Institutes of Sweden(瑞典RISE研究机构) Halmstad University(哈马碧大学)

AI总结 提出DiffCAP,一种基于扩散的对抗净化策略,通过理论证明对抗效应随扩散单调衰减,并利用噪声注入与VLM嵌入相似度阈值自适应净化,显著提升防御效果并加速去噪。

Comments Accepted to Transactions on Machine Learning Research (TMLR 2026)

详情
AI中文摘要

视觉语言模型(VLM)在多模态理解方面表现出卓越的能力,但它们对对抗扰动的敏感性对其在实际应用中的可靠性构成了重大威胁。尽管这些扰动通常对人类不可察觉,但它们可能极大地改变模型输出,导致错误的解释和决策。本文介绍了DiffCAP,一种新颖的基于扩散的净化策略,可以有效中和VLM中的对抗性破坏。我们在理论上建立了前向扩散过程中的可证明恢复区域,同时量化了相对于VLM的语义变化的收敛速度。这些发现表明,随着扩散的进行,对抗效应单调减弱。基于这一原理,DiffCAP利用噪声注入,以VLM嵌入的相似度阈值作为自适应标准,然后通过反向扩散恢复出干净且可靠的表示用于VLM推理。通过在三个任务场景中、不同攻击强度下、使用三个VLM在六个数据集上进行的大量实验,我们表明DiffCAP以显著优势优于现有的防御技术。值得注意的是,DiffCAP显著降低了超参数调优的复杂性和所需的扩散时间,从而加速了去噪过程。结合理论定理和实验支持,DiffCAP为在对抗环境中安全部署VLM提供了一种稳健且实用的解决方案。源代码可在以下网址获取:https://this URL。

英文摘要

Vision Language Models (VLMs) have shown remarkable capabilities in multimodal understanding, yet their susceptibility to adversarial perturbations poses a significant threat to their reliability in real-world applications. Despite often being imperceptible to humans, these perturbations can drastically alter model outputs, leading to erroneous interpretations and decisions. This paper introduces DiffCAP, a novel diffusion-based purification strategy that can effectively neutralize adversarial corruptions in VLMs. We theoretically establish a provable recovery region in the forward diffusion process and meanwhile quantify the convergence rate of semantic variation with respect to VLMs. These findings manifest that adversarial effects monotonically fade as diffusion unfolds. Guided by this principle, DiffCAP leverages noise injection with a similarity threshold of VLM embeddings as an adaptive criterion, before reverse diffusion restores a clean and reliable representation for VLM inference. Through extensive experiments across six datasets with three VLMs under varying attack strengths in three task scenarios, we show that DiffCAP outperforms existing defense techniques by a substantial margin. Notably, DiffCAP significantly reduces both hyperparameter tuning complexity and the required diffusion time, thereby accelerating the denoising process. Equipped with theorems and empirical support, DiffCAP provides a robust and practical solution for securely deploying VLMs in adversarial environments. The source code is available at https://github.com/JasonFu1998/DiffCAP.