arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 1938
专题追踪
2605.19138 2026-05-21 cs.RO cs.AI cs.LG

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Berkeley(加州大学伯克利分校) New York University Abu Dhabi (NYUAD)(纽约大学阿布扎克分校) University of Toronto(多伦多大学) NVIDIA(英伟达)

AI总结 本文提出COBALT平台,通过基于云的远程操作技术,利用智能手机等设备大规模收集高质量的机器人学习数据,提高仿真实验和现实世界中的机器人学习效率。

详情
AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT,一个旨在大规模普及机器人学习的远程操作平台,无论是仿真还是现实世界。通过利用向量化的环境,我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作,从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接,包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步,支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行,每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU,凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究,显示基于手机的远程操作性能与或优于专用硬件,能够更快、更符合人体工学地收集数据。为确保数据质量,COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明,结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察,我们通过众包收集了一个大规模、高质量的试点数据集,该数据集包含7500多个演示(50多个小时),在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

2605.18833 2026-05-21 cs.LG cs.AI

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

利用知识图谱嵌入进行自动化大数据质量评估

Hadi Fadlallah, Rima Kilany, Mitri Haber, Ali Jaber

发表机构 * Saint-Joseph University(圣约瑟夫大学) Lebanese University(黎巴嫩大学)

AI总结 本文提出了一种基于知识图谱嵌入的自动化大数据质量评估方法,通过整合多样化的知识图谱表示,利用上下文信息生成针对每个情境的全面数据质量评估计划。

Comments 17 pages, 10 figures

Journal ref International Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405

详情
AI中文摘要

自动化数据质量评估对于管理大数据至关重要,但现有解决方案在实现准确的上下文感知评估方面面临挑战。本文提出了一种基于知识的新方法,利用知识图谱嵌入来预测输入数据集的上下文表示与知识图谱中相关质量规则和维度之间的缺失边。我们通过整合知识图谱中的多样化表示,从深入的文献研究中获取洞察,从而开发出针对每个情境的全面且上下文特定的数据质量评估计划。利用知识图谱提高了我们对输入数据集上下文的理解,克服了传统方法仅依赖严格匹配并忽视上下文特征的局限性。通过注入数值边属性,我们为每个预测的质量测量分配相应的权重,为输入数据集提供全面的数据质量评估计划。为了评估我们的方法,我们利用AccentureLabs开发和基准测试的AmpliGraph框架。评估涉及使用由黎巴嫩原子能委员会(LAEC-CNRS)提供的现实世界辐射传感器数据集。从该评估中获得的结果证明了我们的解决方案能够为给定的输入数据集生成全面的数据质量评估计划。

英文摘要

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

2605.18743 2026-05-21 cs.AI

WorldString: Actionable World Representation

WorldString: 可行动态世界表征

Kunqi Xu, Jitao Li, Jianglong Ye, Tianshu Tang, Isabella Liu, Sifei Liu, Xueyan Zou

发表机构 * CalTech(加州理工学院) UC San Diego(南加州大学) Tsinghua University - IEI Lab(清华大学-IEI实验室) NVIDIA(英伟达)

AI总结 本文提出WorldString,一种能够通过点云或RGB-D视频流直接学习现实物体状态流形的神经架构,为构建可行动态世界模型提供基础构建块。

详情
AI中文摘要

受大语言模型中涌现行为启发,研究社区正在探索类似涌现能力的世界模型,尤其关注物理世界的建模。在物理世界建模中,物体是构成物理现实的基本原始元素。从人类到计算机,几乎一切我们交互的事物都是物体。这些物体很少是静态的;它们是具有变化状态的可行动态实体,其状态由内在属性决定。尽管当前方法通过视频生成或动态场景重建来处理物体动作状态,但没有一种方法明确地以统一、原则性的方式建模这一基本元素,以构建可行动态物体表征。我们提出了WorldString,一种神经架构,能够通过直接从点云或RGB-D视频流中学习来建模现实物体的状态流形。作为通用的数字孪生,它充当物理世界模型的基础构建块;因此,我们将其命名为WorldString。有趣的是,其完全可微的结构无缝地使未来与策略学习和神经动力学的整合成为可能。

英文摘要

Inspired by the emergent behaviors in large language models that generalized human intelligence, the research community is pursuing similar emergent capabilities within world models, with a emphasis on modeling the physical world. Within the scope of physical world model, objects are the fundamental primitives that constitute physical reality. From humans to computers, nearly everything we interact with is an object. These objects are rarely static; they are actionable entities with varying states determined by their intrinsic properties. While current methods approach object action states either via video generation or dynamic scene reconstruction, none explicitly model this basic element in a unified, principled way to build an actionable object representation. We propose WorldString, a neural architecture capable of modeling the state manifold of real-world objects by learning directly from point clouds or RGB-D video streams. Serving as a versatile digital twin, it acts as a foundational building block for physical world models; thus, we name it WorldString. Sweetly, its fully differentiable structure seamlessly enables future integration with policy learning and neural dynamics.

2605.18736 2026-05-21 cs.CV

Spectral Progressive Diffusion for Efficient Image and Video Generation

频域渐进扩散用于高效图像和视频生成

Howard Xiao, Brian Chao, Lior Yariv, Gordon Wetzstein

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种频域渐进扩散框架,通过在预训练扩散模型的去噪轨迹中逐步提高分辨率,实现高效的图像和视频生成,同时改进了效率和质量。

Comments Project website at https://howardxiao.ca/speed

详情
AI中文摘要

扩散模型已被证明可以在频域中隐式地自回归地生成视觉内容,其中低频分量在去噪过程中早期生成,而高频细节仅在后期时间步出现。这种结构为高效的生成提供了自然机会,因为对噪声主导的高频分量进行高分辨率计算几乎冗余。我们提出了频域渐进扩散,这是一种通用框架,它在预训练扩散模型的去噪轨迹中逐步增加分辨率。为此,我们开发了一种频域噪声扩展机制,并从模型的功率谱中推导出最优的分辨率计划。我们的框架支持无训练加速,并且提供了一种新的微调配方,进一步提高了效率和质量。我们在最先进的预训练图像和视频生成模型上实现了显著的加速,同时保持了视觉质量。

英文摘要

Diffusion models have been shown to implicitly generate visual content autoregressively in the frequency domain, where low-frequency components are generated earlier in the denoising process while high-frequency details emerge only in later timesteps. This structure offers a natural opportunity for efficient generation, as high-resolution computation on noise-dominated frequencies is largely redundant. We propose Spectral Progressive Diffusion, a general framework that progressively grows resolution along the denoising trajectory of pretrained diffusion models. To this end, we develop a spectral noise expansion mechanism and derive an optimal resolution schedule from the model's power spectrum. Our framework supports training-free acceleration and a novel fine-tuning recipe that further improves efficiency and quality. We demonstrate significant speedups on state-of-the-art pretrained image and video generation models while preserving visual quality.

2605.18678 2026-05-21 cs.CV cs.AI

Lance: Unified Multimodal Modeling by Multi-Task Synergy

Lance:通过多任务协同实现统一多模态建模

Fengyi Fu, Mengqi Huang, Shaojin Wu, Yunsheng Jiang, Yufei Huo, Hao Li, Yinghang Song, Fei Ding, Jianzhu Guo, Qian He, Zheren Fu, Zhendong Mao, Yongdong Zhang

发表机构 * Intelligent Creation Lab, ByteDance(字节跳动智能创作实验室)

AI总结 本文提出Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。该模型通过协同多任务训练的实用范式实现统一多模态建模,基于统一上下文建模和解耦能力路径两个核心原则,通过双流混合专家架构实现联合上下文学习并解耦理解和生成路径。

Comments 34 pages, 14 figures, 10 tables, homepage url: https://lance-project.github.io , code url: https://github.com/bytedance/Lance

详情
AI中文摘要

我们提出了Lance,一种轻量级的原生统一模型,支持图像和视频的多模态理解、生成和编辑。与依赖模型容量扩展或文本-图像主导设计不同,Lance通过协同多任务训练探索统一多模态建模的实用范式。其基于两个核心原则:统一上下文建模和解耦能力路径。具体而言,Lance从头开始训练,并在共享交错的多模态序列上采用双流混合专家架构,实现联合上下文学习的同时解耦理解和生成路径。我们进一步引入模态感知的旋转位置编码以减轻异构视觉标记之间的干扰并提升跨任务对齐。在训练过程中,Lance采用分阶段的多任务训练范式,结合能力导向的目标和自适应数据调度,以加强语义理解和视觉生成性能。实验结果表明,Lance在图像和视频生成方面显著优于现有开源统一模型,同时保留了强大的多模态理解能力。该模型的主页可在https://lance-project.github.io上访问。

英文摘要

We present Lance, a lightweight native unified model supporting multimodal understanding, generation, and editing for both images and videos. Rather than relying on model capacity scaling or text-image-dominant designs, Lance explores a practical paradigm for unified multimodal modeling via collaborative multi-task training. It is grounded in two core principles: unified context modeling and decoupled capability pathways. Specifically, Lance is trained from scratch and employs a dual-stream mixture-of-experts architecture on shared interleaved multimodal sequences, enabling joint context learning while decoupling the pathways for understanding and generation. We further introduce modality-aware rotary positional encoding to mitigate interference among heterogeneous visual tokens and boost cross-task alignment. During training, Lance adopts a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling to strengthen both semantic comprehension and visual generation performance. Experimental results demonstrate that Lance substantially outperforms existing open-source unified models in image and video generation, while retaining strong multimodal understanding capabilities. The homepage is available at https://lance-project.github.io.

2605.18579 2026-05-21 cs.LG

S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs

S2Aligner: 用于稀疏文本属性图的高效且可迁移的预训练方法

Yuhan Wang, Haopeng Zhang, Yibo Ding, Jiaqi Yu, Xinyu Zhao, Yuhang Liu, Ziwei Zhang, Xiao Wang, Ruijie Wang

发表机构 * Beihang University(北航大学) Tianjin University(天津大学)

AI总结 本文提出S2Aligner,一种针对稀疏文本属性图的高效且可迁移的预训练方法,通过解耦语义对齐与结构建模,增强对齐过程而不污染共享的语义空间,从而减少跨域泛化差距。

Comments 19 pages

详情
AI中文摘要

在文本属性图(TAGs)上进行预训练是构建可迁移图基础模型的核心,其中LLM-as-Aligner方法通过大语言模型的语义知识对图和文本表示进行对齐。然而,这些方法通常假设节点文本提供足够的监督,但在实际稀疏TAGs中这一假设往往不成立。当文本锚点缺失、嘈杂或跨域不均时,图结构必须通过弱语义证据进行对齐,导致不可靠的结构-语义对应关系和稀疏性引起的迁移偏差。本文提出S2Aligner,一种针对稀疏TAGs的稀疏感知且结构增强的LLM-as-Aligner框架用于图-文本预训练。关键思想是解耦语义对齐与结构建模,使拓扑感知信号能够增强对齐而不污染共享的语义空间。具体而言,S2Aligner将图-文本表示分解为语义和结构成分,利用结构导向的重建与一致性控制来将可靠的拓扑线索注入文本表示,并在文本稀疏性下抑制不一致的结构信号。此外,S2Aligner引入稀疏感知的跨域风险平衡,通过全局-域密度比校准域风险,并通过图可靠性估计降低不可靠的稀疏样本权重。理论分析表明,该目标通过控制域风险差异来减少跨域泛化差距。在多样化的图域、稀疏程度和下游任务上进行的广泛实验表明,S2Aligner在一致性上优于现有基线。

英文摘要

Pre-training on text-attributed graphs (TAGs) is central to building transferable graph foundation models, where LLM-as-Aligner methods align graph and text representations through the semantic knowledge of large language models. However, these methods usually assume that node texts provide sufficient and reliable supervision, an assumption often violated in real-world sparse TAGs. When textual anchors are missing, noisy, or uneven across domains, graph structures must be aligned with weak semantic evidence, leading to unreliable structure-semantics correspondence and sparsity-induced transfer bias. This paper presents S2Aligner, a sparsity-aware and structure-enhanced LLM-as-Aligner framework for graph-text pre-training on sparse TAGs. The key idea is to decouple semantic alignment from structural modeling, allowing topology-aware signals to enhance alignment without contaminating the shared semantic space. Specifically, S2Aligner decomposes graph-text representations into semantic and structural components, uses structure-oriented reconstruction with consistency control to inject reliable topology cues into text representations, and suppresses inconsistent structural signals under textual sparsity. Moreover, S2Aligner introduces sparsity-aware cross-domain risk balancing, which calibrates domain risks through a global-domain density ratio and downweights unreliable sparse samples via graph reliability estimation. Theoretical analysis shows that this objective reduces cross-domain generalization gaps by controlling domain risk discrepancy. Extensive experiments across diverse graph domains, sparsity levels, and downstream tasks demonstrate that S2Aligner consistently outperforms existing baselines.

2605.18447 2026-05-21 cs.CV

NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

基于NeRF的在轨航天器单目影像重建:在光照变化和姿态不确定性下的应用

Antoine Legrand, Renaud Detry, Christophe De Vleeschouwer

发表机构 * ICTEAM, UCLouvain(ICTEAM,乌得勒支大学) ESAT, KU Leuven(ESAT,鲁汶大学) MECH, KU Leuven(机械工程系,鲁汶大学) Aerospacelab(航天实验室) UCLouvain(乌得勒支大学)

AI总结 本文提出一种基于NeRF的方法,通过引入图像特定的外观嵌入和姿态修正项,提升在光照变化和姿态不确定性下的航天器重建鲁棒性,验证了其在离线重建中的有效性,并展示了其在在线重建中的潜力。

Comments (under review)

详情
AI中文摘要

自主接近和临近操作围绕非合作、未知航天器是主动碎片清除和在轨服务任务的关键。此类操作的关键组成部分是从一组2D图像中离线重建目标的3D模型。这项任务具有挑战性,因为有两个主要因素:首先,在轨光照条件表现出显著的变异性,并且随时间迅速变化。其次,图像中的姿态信息不准确,导致3D重建的不确定性。为克服这些挑战,我们提出扩展Neural Radiance Fields,引入每图像的自由度:一个可学习的外观嵌入,捕捉每张图像特定的光照条件,以及一个图像特定的姿态修正项,以细化其噪声姿态标签,提高图像间的3D一致性。这些参数增加了极小的复杂性,因为它们与NeRF联合学习,但显著提高了对光照变化和姿态不准确性的鲁棒性。我们在三个代表在轨操作的图像集中验证了我们的方法,证明了其在离线重建中的有效性,并突显了其在在线重建中的适用性,这在该领域是一个开放性问题。

英文摘要

Autonomous rendezvous and proximity operations around uncooperative, unknown spacecraft are critical for active debris removal and on-orbit servicing missions. A key component of such operations is the offline reconstruction of a 3D model of the target from a set of 2D images. This task is challenging due to two main factors. First, in-orbit illumination conditions exhibit considerable variability, and change rapidly over time. Second, the inaccuracy of pose information in the images, results in 3D reconstruction uncertainty. To overcome these challenges, we propose to extend Neural Radiance Fields with per-image degrees of freedom: a learnable appearance embedding that captures the illumination conditions specific to each image, and an image-specific pose correction term that refines its noisy pose label to increase 3D consistency across images. These parameters add minimal complexity, as they are learned jointly with the NeRF, yet they substantially improve robustness to illumination variability and pose inaccuracies. We validate our approach on three image sets representative of in-orbit operations, demonstrating its effectiveness for offline reconstruction and highlighting its suitability for online reconstruction, an open problem in the field.

2605.17946 2026-05-21 cs.AI cs.CV cs.LG

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出SVFSearch,首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准,通过5000个四选一测试示例和4198个辅助训练示例,评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情
AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干,以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而,现有的基准很少评估在短视频应用中的这种能力,其中暂停的帧通常在视觉上具有歧义性,回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch,这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例,每个示例都围绕一个暂停的游戏场景展开,来自真实的短视频片段。为了支持公平且可重复的评估,SVFSearch提供了一个冻结的离线检索环境,包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口,避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距:最好的开源直接问答模型达到66.4%,最好的实际代理达到79.1%,而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈,包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

2605.17776 2026-05-21 cs.RO

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track: 一个大规模多模态数据集,用于通过多约束轨迹优化的无人机视觉跟踪

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, Ji Pei

发表机构 * Autel Robotics(Autel机器人公司) Nanjing University(南京大学) Peking University(北京大学) Southern University of Science and Technology(南方科技大学) University of Hong Kong(香港大学)

AI总结 本文提出CosFly-Track数据集,用于无人机视觉跟踪任务,通过多约束轨迹优化生成大规模多模态数据,提升了动态目标跟踪性能。

详情
AI中文摘要

近年来,空中视觉-语言导航(VLN)数据集发展迅速,但主要解决的是面向静态目的地的目标导向导航问题,而无人机视觉跟踪——在保持可见性的同时持续跟随移动目标——则缺乏专门的训练数据。我们介绍了CosFlyTrack,这是一个用于城市环境中无人机视觉跟踪的大规模多模态数据集和可扩展生成管道。该数据集提供了约12,000条专家和扰动的无人机轨迹,这些轨迹源自6,000条行人路径,包含240万时间步(约334小时),包含七个对齐的数据通道:RGB、度量深度、语义分割、六自由度无人机姿态、带有可见性标志的目标状态、双语(中文-英文)指令以及轨迹对元数据。为了生成高质量的专家轨迹,我们开发了MuCO,一个多约束优化器,能够在连续的三维空间中直接规划,使用BVH加速的碰撞和可见性查询,共同执行目标可见性、视角质量、碰撞避免、平滑度、运动学可行性等约束,避免了基于网格的规划器的离散化伪影和事后平滑。在七个视觉-语言模型上的微调实验表明,CosFlyTrack将跟踪性能提升到78.3至95.6个百分点的SR@1米,比零样本基线提高了53至69个百分点,支持该数据集作为动态目标跟踪代理的训练资源。该数据集在https://huggingface.co/datasets/AutelRobotics/CosFly上公开可用;评估脚本和预训练检查点托管在https://huggingface.co/AutelRobotics/CosFly-Track上。

英文摘要

Recent aerial vision-language navigation (VLN) datasets have grown rapidly, but they primarily address goal-oriented navigation to static destinations, leaving UAV visual tracking -- continuously following a moving target while maintaining visibility -- largely without dedicated training data. We introduce CosFlyTrack, a large-scale multi-modal dataset and scalable generation pipeline for UAV visual tracking in urban environments. The dataset provides approximately 12,000 expert and perturbed UAV trajectories generated from 6,000 pedestrian paths, comprising 2.4 million timesteps (approximately 334 hours) with seven aligned data channels: RGB, metric depth, semantic segmentation, six-degree-of-freedom drone pose, target state with visibility flag, bilingual (Chinese-English) instructions, and trajectory-pair metadata. To generate high-quality expert trajectories, we develop MuCO, a multi-constraint optimizer that plans directly in continuous three-dimensional space with BVH-accelerated collision and visibility queries, jointly enforcing target visibility, viewpoint quality, collision avoidance, smoothness, and kinematic feasibility, avoiding the discretization artifacts and post-hoc smoothing of grid-based planners. Fine-tuning experiments on seven vision-language models show that CosFlyTrack improves tracking performance to 78.3 to 95.6 percent SR@1 meter, a 53 to 69 percentage point gain over zero-shot baselines, supporting the dataset as a training resource for dynamic target-following agents. The dataset is publicly available at https://huggingface.co/datasets/AutelRobotics/CosFly; evaluation scripts and pre-trained checkpoints are hosted at https://huggingface.co/AutelRobotics/CosFly-Track.

2605.17694 2026-05-21 cs.CL

Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

大语言模型在权力不对称对话中是否反映社会认知效应?

Anvesh Rao Vijjini, Sagar Manjunath, Snigdha Chaturvedi

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 研究探讨了大语言模型在被赋予高或低地位角色时是否表现出与人类相似的社会认知效应,通过模拟多轮权力不对称对话,分析语言协调、代词使用、说服成功率和对危险请求的遵从性,发现模型在权力效应上表现出关键特征,但存在差异和变异性。

Comments ACL 2026 (main)

详情
AI中文摘要

权力差异通过已知的社会认知效应塑造人类交流,包括语言协调、代词使用、权威偏见和有害遵从。我们检验了大型语言模型(LLMs)在被赋予高或低地位角色时是否表现出类似行为。使用来自不同职业的拟人化角色,我们模拟多轮、权力不对称的对话(例如,校长与教师、法官与律师),并测量(i)语言协调、(ii)代词使用、(iii)说服成功率以及(iv)对危险请求的遵从性。我们的结果表明,LLMs 显示出权力的社会认知效应,尽管存在细微差别和变异性,将模拟交互与既 desirable 又 unsafe 的行为联系起来。

英文摘要

Power differences shape human communication through well documented socio cognitive effects, including language coordination, pronoun usage, authority bias, and harmful compliance. We examine whether large language models (LLMs) exhibit similar behaviors when assigned high or low status personas. Using personas from diverse professions, we simulate multi turn, power asymmetric dialogues (e.g., principal teacher, justice lawyer) and measure (i) language coordination, (ii) pronoun usage, (iii) persuasion success, and (iv) compliance with unsafe requests. Our results show that LLMs show key socio-cognitive effects of power, albeit with nuances and variability, linking simulated interactions to both desirable and unsafe behaviors.

2605.17472 2026-05-21 cs.CV

Weighted Reverse Convolution for Feature Upsampling

加权反卷积用于特征上采样

Wentong Li, Zhiyuan Qi, Zichen Zhao, Kai Zhang, Lei Zhang

发表机构 * Nanjing University of Aeronautics and Astronautics(南京航空航天大学) The Hong Kong Polytechnic University(香港理工大学) Nanjing University(南京大学)

AI总结 本文提出加权反卷积(WRC),从逆问题的角度重新审视视觉基础模型中的特征上采样,通过空间自适应的逆操作提升高层视觉描述符的密度,从而在需要细粒度定位、密集预测和点对应的任务中提升性能。

Comments 18 pages, 7 figures, code:https://github.com/PolyU-VCLab/WRC

详情
AI中文摘要

预训练的视觉基础模型(VFMs)提供强大的语义表示,但其补丁级特征本质上是粗略的,限制了在需要细粒度定位、密集预测和点对应的任务中的有效性。在本文中,我们从逆问题的角度重新审视VFMs中的特征上采样,并提出加权反卷积(WRC),一种空间自适应的逆操作,用于密集化高层视觉描述符。具体来说,我们将特征上采样公式化为加权Tikhonov正则化最小二乘问题,其中空间变化的权重在每个空间位置调节数据保真度和先验强度。这使得WRC能够适应空间变化的特征特性,从而在保留关键结构的同时减轻过平滑问题。此外,WRC保留了一个高效、完全可微的闭合形式FFT解,使其成为一种实用的上采样操作符。在轻量级自监督密集化框架中集成后,WRC在各种下游基准测试中一致提高了密集特征质量,包括分割、深度估计、视频对象分割、对象发现和关键点对应,同时保持高计算效率。

英文摘要

Pre-trained vision foundation models (VFMs) provide strong semantic representations, yet their patch-level features are inherently coarse, limiting their effectiveness on tasks requiring fine-grained localization, dense prediction, and point-wise correspondence. In this work, we revisit feature upsampling for VFMs from the perspective of \textbf{\textit{inverse problem}} and propose Weighted Reverse Convolution (WRC), a spatially adaptive inverse operator for densifying high-level visual descriptors. Specifically, we formulate feature upsampling as a weighted Tikhonov-regularized least-squares problem, where spatially varying weights modulate both data fidelity and prior strength at each spatial location. This allows WRC to adapt the reconstruction to spatially varying feature characteristics, thereby preserving critical structures while mitigating over-smoothing. Moreover, WRC retains an efficient, fully differentiable closed-form FFT solution, making it a practical drop-in upsampling operator. Integrated into a lightweight self-supervised densification framework, WRC consistently improves dense feature quality across various downstream benchmarks, including segmentation, depth estimation, video object segmentation, object discovery, and keypoint correspondence, while maintaining high computational efficiency.

2605.16962 2026-05-21 cs.CV cs.AI

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

OmniVL-Guard Pro: 一个增强工具的代理用于综合视觉-语言防伪

Jinjie Shen, Zheng Huang, Yuchen Zhang, Yujiao Wu, Yaxiong Wang, Lechao Cheng, Shengeng Tang, Tianrui Hui, Nan Pu, Zhun Zhong

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology, Hefei, China(合肥工业大学计算机科学与信息工程学院) Wuhan University, Wuhan, China(武汉大学) Lab for Intelligence and visiON (LION)(智能与视觉实验室) Xi'an Jiaotong University(西安交通大学)

AI总结 该研究提出OmniVL-Guard Pro,一种增强工具的代理,用于综合视觉-语言防伪,通过整合多种工具环境和引入新的强化学习方法,实现了开放世界中的线索驱动推理,并在多个任务上达到了最先进的性能。

Comments 29 pages

详情
AI中文摘要

现有的视觉-语言伪造检测和定位方法基于封闭世界范式,假设模型可以单独完成验证。然而,自包含的MLLMs受限于有限的参数知识、静态训练语料和有限的感知分辨率,在动态开放世界防伪中存在实际限制,特别是在需要外部线索的实时事件验证和需要对局部篡改进行细致审查的伪造分割中。为了解决这些限制,我们从扩大自包含模型转向超越它。我们提出了OmniVL-Guard Pro,一种增强工具的代理,将统一的防伪从封闭世界预测扩展到开放世界的线索驱动推理。OmniVL-Guard Pro整合了一个涵盖实时事件搜索、局部裁剪和缩放、边缘异常筛查、人脸检测、视频帧提取以及SAM3基于分割的工具环境。为了生成高质量的工具推理轨迹,我们引入了树状结构的自进化工具轨迹生成,通过种子引导、无引导的自我进化和弱提示的硬样本合成生成多样化的轨迹,产生Full-Spectrum Tool Reasoning (FSTR)数据集用于训练。我们进一步提出了Checker-Guided Agentic Reinforcement Learning (CGARL),它为过程级监督提供,以惩罚那些答案正确但推理扭曲的情况。广泛的实验表明,OmniVL-Guard Pro在各种任务上实现了最先进的性能,并表现出强大的零样本泛化能力。FSTR数据集和OmniVL-Guard Pro的代码将在https://github.com/shen8424/OmniVL-Guard-Pro公开发布。

英文摘要

Existing vision-language forgery detection and grounding methods operate under a closed-world paradigm, assuming verification can be completed by the model alone. However, self-contained MLLMs are constrained by finite parametric knowledge, static training corpora, and limited perceptual resolution, creating a practical ceiling in dynamic open-world forensics -- particularly for real-time event verification requiring external clues and forgery segmentation demanding fine-grained scrutiny of local manipulations. To address these limitations, we shift from scaling up the self-contained model toward reaching beyond it. We propose \textbf{OmniVL-Guard Pro}, a tool-augmented agent that extends unified forensics from closed-world prediction to open-world clues-driven reasoning. OmniVL-Guard Pro integrates a tool environment spanning real-time event search, local cropping and zooming, edge-anomaly screening, face detection, video frame extraction, and SAM3-based segmentation. To generate high-quality tool-reasoning trajectories, we introduce \textbf{Tree-Structured Self-Evolving Tool Trajectory Generation}, which produces diverse trajectories through seed guidance, guider-free self-evolution, and weakly-hinted hard sample synthesis, yielding the Full-Spectrum Tool Reasoning (FSTR) dataset for training. We further propose \textbf{Checker-Guided Agentic Reinforcement Learning} (CGARL), which provides process-level supervision to penalize cases where the answer is correct but the reasoning is distorted. Extensive experiments demonstrate that OmniVL-Guard Pro achieves state-of-the-art performance across various tasks, and exhibits strong zero-shot generalization. The FSTR dataset and code for OmniVL-Guard Pro will be publicly released at https://github.com/shen8424/OmniVL-Guard-Pro.

2605.16812 2026-05-21 cs.LG cs.CR

Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy

Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy

Youngmok Ha, Viktor Schlegel, Yidan Sun, Anil Anthony Bharath

发表机构 * Imperial College London(帝国理工学院伦敦分校) Imperial College London, Imperial Global Singapore(帝国理工学院伦敦分校,帝国全球新加坡)

AI总结 本文提出了一种基于雅可比矩阵的各向异性噪声重塑方法,以在局部差分隐私下提升表示的效用。该方法通过识别任务关键子空间,选择性地衰减噪声,并将标准LDP的各向同性噪声重塑为各向异性分布,从而在保持每个维度隐私预算的同时,异质地调节噪声影响,显著提升数据效用。

详情
AI中文摘要

尽管局部差分隐私(LDP)是分布式数据收集的基础原始构件,其严格的噪声注入要求常常导致数据效用的严重下降。这种下降源于传统LDP机制的任务无关性质,即在所有维度上均匀注入噪声,而不考虑其对下游目标的相对重要性。为了解决这个问题,我们提出了一种新的方法,通过数据表示的任务相关子空间来减轻噪声。我们的方法通过公共下游模型的雅可比矩阵识别任务关键子空间,选择性地衰减这些维度的噪声,并将标准LDP的各向同性噪声重塑为各向异性分布。该方法在保持每个维度隐私预算的同时,异质地调节噪声影响,从而显著提升数据效用。此外,我们的方法适用于线性和非线性模型,并能无缝集成到现有机制中。在CIFAR-10-C(亮度腐败最高严重级别5)上的大量实验表明,整合我们的方法可使PrivUnit2和PrivUnitG的效用在ε=7.5时提高约20%。源代码可在https://github.com/ymha/jacobian-anr-ldp获取。

英文摘要

While Local Differential Privacy (LDP) serves as a foundational primitive for distributed data collection, its stringent noise injection requirement often leads to severe degradation in data utility. This degradation stems from the task-agnostic nature of conventional LDP mechanisms, which inject noise uniformly across all dimensions regardless of their relative importance to the downstream objective. To address this issue, we propose a novel approach that mitigates noise in task-relevant subspaces of the data representation. Our method identifies task-critical subspaces via the Jacobian matrix of the public downstream model, selectively attenuates noise along those dimensions, and reshapes the isotropic noise of standard LDP into an anisotropic distribution. This method preserves the uniform per-dimension privacy budget while heterogeneously modulating noise impact across dimensions, thereby substantially enhancing data utility. Furthermore, our approach generalizes to both linear and non-linear models and integrates seamlessly with existing mechanisms. Extensive experiments on CIFAR-10-C (Brightness corruption at the highest severity level 5) demonstrate that integrating our approach improves the utility of PrivUnit2 and PrivUnitG by approximately 20\% at $ε=7.5$. The source code is available at https://github.com/ymha/jacobian-anr-ldp.

2605.16793 2026-05-21 cs.LG

PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting

PULSE: 非平稳时间序列预测的生成性相演变

Yangyou Liu, Zezhi Shao, Xinyu Chen, Hu Chen, Fei Wang, Yuankai Wu

发表机构 * College of Computer Science, Sichuan University, Chengdu, China(四川大学计算机学院) University of Chinese Academy of Sciences, Beijing, China(中国科学院大学) Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所) Institute of Artificial Intelligence, University of Central Florida, Orlando, USA(佛罗里达大学人工智能研究所)

AI总结 针对非平稳时间序列预测中稳定表示与分布偏移之间的矛盾,本文提出PULSE框架,通过物理假设引导相演变,采用解耦-演化-模拟设计哲学,通过相锚解耦、相路由器和统计感知混合等方法提升模型鲁棒性,实验证明物理引导的归纳偏置比原始架构复杂度更重要。

详情
AI中文摘要

在非平稳条件下进行时间序列预测面临稳定表示与适应分布偏移之间的根本矛盾。现有方法隐式依赖静态历史假设,导致我们称之为相遗忘的临界失败模式,即模型对演变的全局上下文失明。为了解决这一问题,我们通过三个物理假设形式化非平稳动态:世界分解、动态相演变和异方差流形生成。这些原理启发了PULSE,一个受物理启发、即插即用的框架,采用解耦-演化-模拟设计哲学。具体而言,PULSE利用相锚定解耦来解决由主导趋势引起的优化干扰,采用相路由器主动生成未来轨迹,并引入统计感知混合(SAM)以确保对分布外波动的鲁棒性。实验证明,PULSE使简单的MLP主干在12个现实世界基准上达到最先进的或高度竞争的性能。这验证了正确的物理引导的归纳偏置比原始架构复杂度对非平稳预测更为关键。代码可在:https://github.com/Gemost/PULSE获取。

英文摘要

Time series forecasting under non-stationarity faces a fundamental tension between capturing stable representations and adapting to distribution shifts. Existing methods implicitly rely on static historical assumptions, leading to a critical failure mode we term Phase Amnesia, where models become blind to the evolving global context. To resolve this, we formalize non-stationary dynamics through three physical hypotheses: wold decomposition, dynamical phase evolution, and heteroscedastic manifold generation. These principles inspire PULSE, a physics-informed, plug-and-play framework adopting a Disentangle--Evolve--Simulate design philosophy. Specifically, PULSE utilizes phase-anchored disentanglement to resolve optimization interference caused by dominant trends, employs a Phase Router to actively generate future trajectories, and introduces Statistic-Aware Mixup (SAM) to ensure robustness against out-of-distribution volatility. Empirically, PULSE enables a simple MLP backbone to achieve state-of-the-art or highly competitive performance across 12 real-world benchmarks. This validates that a correct physics-informed inductive bias is far more critical than raw architectural complexity for non-stationary forecasting. The code is available at: https://github.com/Gemost/PULSE.

2605.16530 2026-05-21 cs.CV

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

SWoMo:用于白内障手术模拟的神经符号世界模型

Ssharvien Kumar Sivakumar, Akwele Johnson, Anirudh Dhingra, Yannik Frisch, Ghazal Ghazaei, Anirban Mukhopadhyay

发表机构 * Technical University Darmstadt(德累斯顿技术大学) Carl Zeiss AG(蔡司股份有限公司) AICM, Medical Faculty of Heidelberg University(海德堡大学医学院)

AI总结 本文提出SWoMo,一种用于白内障手术模拟的神经符号世界模型,通过分离运动生成与视觉真实性,结合规则基模拟器和场景图表示来建模运动动态和工具-组织交互,同时使用扩散模型生成逼真的视觉效果,从而提升手术模拟的真实性和临床适用性。

详情
AI中文摘要

现实手术模拟在培训初学者外科医生和开发自主代理方面起着至关重要的作用。世界模型可以通过根据当前观察和手术动作预测未来患者状态,将此类模拟环境扩展到真实且多样的程序中。然而,当前最先进的方法往往无法满足临床应用所需的关键标准,包括视觉真实性、物理基础的交互以及模拟超出训练分布的场景的能力。因此,我们引入SWoMo,一种用于白内障手术模拟的神经符号世界模型,该模型将运动生成与视觉真实性解耦。符号组件包括基于规则的模拟器和场景图表示,用于建模运动动态和工具-组织交互,而扩散模型则生成逼真的视觉外观,包括纹理和组织变形。我们提出了一种逆配对策略,通过在模拟器中重建真实的手术视频以获得配对的模拟和真实视频,然后用于训练我们的视频扩散模型,以实现反向的仿真到现实的翻译目标。我们的实验表明,与先前工作相比,既有定性也有定量的改进。我们证明,我们的模拟器进一步满足了关键标准,包括对未见交互几何的泛化、下游阶段检测的改进以及无监督的视频风格迁移。代码、数据和模型权重可在:https://ssharvienkumar.github.io/SWoMo/上获取。

英文摘要

Realistic surgical simulation plays a crucial role in training novice surgeons and in the development of autonomous agents. World models can scale such simulation environments to realistic and diverse procedures by predicting future patient states conditioned on current observations and surgical actions. However, current state-of-the-art approaches often fail to satisfy key criteria required for clinical applicability, including visual realism, physically grounded interactions, and the ability to simulate scenarios beyond the training distribution. Hence, we introduce SWoMo, a neuro-symbolic world model for cataract surgery simulation that decouples motion generation from visual realism. The symbolic component, consisting of a rule-based simulator and scene graph representations, models motion dynamics and tool-tissue interactions, while a diffusion model produces realistic visual appearance, including textures and tissue deformations. We propose an inverse pairing strategy that reconstructs real surgical videos in the simulator to obtain paired simulated and real videos, which are then used to train our video diffusion model for the reverse objective of sim-to-real translation. Our experiments show both qualitative and quantitative improvements over prior work. We demonstrate that our simulator further satisfies the key criteria, including generalisation to unseen interaction geometries, improvements in downstream phase detection, and unsupervised video style transfer. The code, data, and model weights are available at: https://ssharvienkumar.github.io/SWoMo/

2605.16217 2026-05-21 cs.CL cs.AI cs.IR

Argus: Evidence Assembly for Scalable Deep Research Agents

Argus:可扩展深度研究代理的证据组装

Zhen Zhang, Liangcai Su, Zhuo Chen, Xiang Lin, Haotian Xu, Simon Shaolei Du, Kaiyu Yang, Bo An, Lidong Bing, Xinyu Wang

发表机构 * MiroMind AI

AI总结 Argus通过将深度研究视为拼图碎片的组装过程,而非并行暴力求解整个答案,提高了大规模信息检索任务的效率和效果。

详情
AI中文摘要

深度研究代理在复杂信息检索任务上取得了显著进展。即使长ReAct风格的探索仅追踪单一轨迹,而最新最先进的系统通过并行搜索和聚合来扩展推理时间计算。然而,深度研究答案由互补的证据片段组成,而并行探索通常重复而非完成这些片段,导致收益递减且推动聚合上下文接近模型极限。我们提出Argus,一种代理系统,其中搜索者和导航者合作将深度研究视为从互补证据片段中组装拼图,而非并行暴力求解整个答案。搜索者通过ReAct风格交互收集给定子查询的证据轨迹。导航者维护共享证据图,验证哪些片段仍缺失,派遣搜索者收集它们,并在完成图上推理以生成来源追踪的最终答案。我们用强化学习训练导航者以验证、派遣和合成,同时独立训练搜索者以保持标准ReAct代理。所获得的导航者支持单个搜索者或多个并行搜索者无需重新训练。使用35B-A3B MoE骨干的搜索者和导航者,Argus在单个搜索者上获得5.5分,在8个并行搜索者上获得12.7分,平均在八个基准上。使用64个搜索者时,其在BrowseComp上达到86.2分,超越了我们所有基准测试的专有代理,同时导航器的推理上下文保持在21.5K tokens以下。

英文摘要

Deep research agents have achieved remarkable progress on complex information seeking tasks. Even long ReAct style rollouts explore only a single trajectory, while recent state of the art systems scale inference time compute via parallel search and aggregation. Yet deep research answers are composed of complementary pieces of evidence, which parallel rollouts often duplicate rather than complete, yielding diminishing returns while pushing the aggregation context toward the model's limit. We propose Argus, an agentic system in which a Searcher and a Navigator cooperate to treat deep research as assembling a jigsaw from complementary evidence pieces, rather than brute forcing the whole answer in parallel. The Searcher collects evidence traces for a given sub-query through ReAct-style interaction. The Navigator maintains a shared evidence graph, verifying which pieces are still missing, dispatching Searchers to gather them, and reasoning over the completed graph to produce a source-traced final answer. We train the Navigator with reinforcement learning to verify, dispatch, and synthesize, while independently training the Searcher to remain a standard ReAct agent. The resulting Navigator supports rollouts with a single Searcher or many in parallel without retraining. With both Searcher and Navigator built on a 35B-A3B MoE backbone, Argus gains 5.5 points with a single Searcher and 12.7 points with 8 parallel Searchers, averaged over eight benchmarks. With 64 Searchers it reaches 86.2 on BrowseComp, surpassing every proprietary agent we benchmark, while the Navigator's reasoning context stays under 21.5K tokens.

2605.15944 2026-05-21 cs.RO cs.LG

FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy

FocalPolicy: 频率优化的分块和局部锚定的流匹配用于连贯的视觉-运动策略

Qian He, Zhenshuo Yang, Wenqi Liang, Chunhui Hao, Nicu Sebe, Jiandong Tian

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(机器人与智能系统国家重点实验室,沈阳自动化研究所,中国科学院) University of the Chinese Academy of Sciences(中国科学院大学) University of Trento(特伦多大学)

AI总结 本文提出FocalPolicy,一种面向视觉-运动策略的策略,通过频率优化的分块和局部锚定的流匹配,解决连续视觉-运动策略中的精度与远见之间的平衡问题。

详情
AI中文摘要

视觉-运动策略旨在从专家示范中学习复杂的操作任务。然而,生成平滑且连贯的轨迹仍然具有挑战性,因为它需要在近端精度与远端远见之间进行平衡。现有方法通常专注于优化块内动作分布,往往忽略了块间连贯性。因此,块间不连续性显著阻碍了连贯长周期动作的学习。为克服这一限制并实现精度与远见之间的协同平衡,我们提出了FocalPolicy,一种具有远见的视觉-运动策略,结合了频率优化的分块与局部锚定的流匹配。我们引入了一个远见复合目标,监督时间域内近端动作的对齐,同时在多个未来动作块上正则化频率域结构以提高跨块连贯性。为了高效学习复杂动作分布,我们设计了局部锚定采样,以提高一致性流匹配训练期间的目标信号传播效率。广泛的实验表明,FocalPolicy优于现有方法,并验证了我们的模块对其他基线的通用性。项目网站:https://focalpolicy.github.io/

英文摘要

Visuomotor policies aim to learn complex manipulation tasks from expert demonstrations. However, generating smooth and coherent trajectories remains challenging, as it requires balancing proximal precision with distal foresight. Existing approaches typically focus on optimizing intra-chunk action distributions, often neglecting the inter-chunk coherence. Consequently, inter-chunk discontinuities significantly impede the learning of coherent long-horizon actions. To overcome this limitation and achieve a synergetic balance between precision and foresight, we propose FocalPolicy, a foresight-aware visuomotor policy that combines Frequency-Optimized Chunking with Locally Anchored flow matching. We introduce a foresight composite objective that supervises time-domain alignment within the proximal actions while regularizing frequency-domain structure over multiple future action chunks to improve cross-chunk coherence. To efficiently learn complex action distributions, we design locally anchored sampling to enhance target signal propagation efficiency during consistency flow matching training. Extensive experiments demonstrate that FocalPolicy outperforms existing approaches and confirm the generalizability of our modules to other baselines. Project website: https://focalpolicy.github.io/

2605.15876 2026-05-21 cs.CV

Unlocking Dense Metric Depth Estimation in VLMs

解锁VLMs中的密集度量深度估计

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

发表机构 * Zhejiang University(浙江大学) Tencent Hunyuan LLM(腾讯混元大模型) HKUST(香港科技大学) Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 本文提出DepthVLM,一种将单个VLM转换为原生密集几何预测器的简单有效框架,同时保持其多模态能力。通过在LLM主干上附加轻量级深度头,并在统一的视觉-文本监督范式下进行训练,DepthVLM能够在单次前向传递中生成高分辨率深度图和语言输出。此外,还引入了一个统一的室内-室外度量深度基准,实验表明DepthVLM在推理效率、复杂3D空间推理等方面均优于现有VLMs和纯视觉模型。

Comments Project Page: https://depthvlm.github.io/

详情
AI中文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

英文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

2605.15691 2026-05-21 cs.LG

SEED: Targeted Data Selection by Weighted Independent Set

SEED:通过加权独立集实现目标数据选择

Yuan Zhang, Lifeng Guo, Junwen Pan, Wenzhao Zheng, Wen Zhou, Kuan Cheng, Kurt Keutzer, Shanghang Zhang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Beijing University of Posts and Telecommunications(北京邮电大学) Tianjin University(天津大学) EECS, UC Berkeley(伯克利大学电子工程与计算机科学系) Chinese Academy of Sciences(中国科学院)

AI总结 本文提出SEED方法,通过将数据选择问题建模为加权独立集(WIS)在相似性图上,解决样本质量与多样性之间的平衡问题,并引入节点价值校准和局部尺度归一化来提升数据选择的鲁棒性和可扩展性。

Comments 20 pages

详情
AI中文摘要

数据选择旨在从大规模训练语料中识别出紧凑且信息丰富的子集,平衡样本质量和收集多样性。我们将该问题建模为相似性图上的加权独立集(WIS),其中节点代表数据样本并按影响程度加权,边连接语义冗余的配对。这种建模自然产生同时高质量和多样化的子集。然而,实践中存在两个挑战:朴素的节点权重无法区分信息信号与梯度噪声,且在异构领域分布下构造边会产生结构不平衡的图,偏向社会稀疏区域。为解决这些问题,我们引入了两种从统一图视角出发的改进方法:(1)节点价值校准,限制影响估计到双侧显著子空间,以任务相关信号为基础确定节点重要性,而不是表面统计;(2)局部尺度归一化,适应边阈值到局部邻域密度,缓解因跨领域分布偏移引起的图不平衡。这些组件共同产生了一个稳健且可扩展的数据选择流程,称为SEED。我们进一步构建了 exttt{Honeybee-Remake-SEED-200K},一个由SEED编纂的紧凑多模态数据集。广泛实验表明,SEED在指令微调、视觉指令微调和语义分割等任务上,优于现有最先进方法,适用于多种模型家族。

英文摘要

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

2605.15157 2026-05-21 cs.RO cs.LG

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

手在环中:通过无缝手臂干预改进VLA策略以实现灵巧操作

Zhuohang Li, Liqun Huang, Wei Xu, Zhengming Zhu, Nie Lin, Xiao Ma, Xinjun Sheng, Ruoshi Wen

发表机构 * State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University(机械系统与振动国家重点实验室,机械工程学院,上海交通大学) Shanghai Key Laboratory of Intelligent Robotics, Meta Robotics Institute, Shanghai Jiao Tong University, Shanghai 200240, China(智能机器人上海市重点实验室,元机器人研究院,上海交通大学,上海200240,中国) The University of Tokyo(东京大学)

AI总结 本文提出Hand-in-the-Loop方法,通过无缝整合人类干预与自主策略执行,减少手部操作中的突兀变化,提升双臂灵巧操作的鲁棒性和效率。

详情
AI中文摘要

Vision-Language-Action (VLA)模型在灵巧操作中容易累积误差,高维动作空间和接触丰富的动态会放大政策偏差。虽然交互模仿学习(IIL)可通过人类修正数据细化策略,但将其应用于高自由度机械手仍具有挑战性,因为人类遥控与策略执行在干预时刻的命令不匹配,导致机器人手部配置的突兀变化,即'手势跳跃'。我们提出了Hand-in-the-Loop (HandITL),一种无缝的人在回路干预方法,将人类的修正意图与自主策略执行相结合,以避免在双臂灵巧操作中的手势跳跃。与使用直接遥控接管相比,HandITL将干预抖动减少了99.8%,并保持了干预后的稳健操作,将抓取失败减少了87.5%,平均完成时间减少了19.1%。我们在需要双臂协调、工具使用和精细长时域操作的任务上验证了HandITL。当用于收集策略细化的修正数据时,HandITL在三个长时域灵巧任务中平均优于使用标准遥控数据训练的策略19%。

英文摘要

Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human correction data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the intervention moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with taking over control using direct teleoperation, HandITL reduces intervention jitter by 99.8% and preserves robust post-intervention manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect correction data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

2605.15104 2026-05-21 cs.CL

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

从文本到声音:一个可重复和可验证的评估工具调用LLM代理的框架

Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jonas Robertson, Shashi Bhushan TN

发表机构 * Dialpad Inc.(Dialpad公司)

AI总结 本文提出了一种可重复和可验证的框架,用于评估基于语音的工具调用LLM代理,通过文本到语音转换和环境噪声模拟,无需重新标注工具模式和黄金标签,从而在不重新标注的情况下评估工具调用性能。

详情
AI中文摘要

语音代理越来越多地需要从语音中可靠的工具使用,而突出的工具调用基准测试仍基于文本。我们研究是否可以将经过验证的文本基准转换为受控的音频基础工具调用评估,而无需重新标注工具模式和黄金标签。我们的数据集无关框架使用文本到语音、说话人变化和环境噪声来创建配对的文本-音频实例,同时保留原始数据集注释。基于对7个多模态模型在Confetti和When2Call音频转换版本上的广泛评估,我们的框架表明性能强烈依赖于模型和任务:Gemini-3.1-Flash-Live获得最高的Confetti分数(70.4),而GPT-Realtime-1.5在When2Call上表现最佳(71.9)。在Confetti上,文本到语音的差距从Qwen3-Omni的1.8分到GPT-Realtime-1.5的4.8分。对失败案例的针对性分析表明,降解最常反映语音中对论点值的误解。考虑到现实部署场景,我们进一步报告了纯文本结果,一个基于歧义的改述压力测试,以及一个参考免费的LLM-as-judge协议,其经过人类偏好的验证。值得注意的是,我们发现开源的Qwen3判官在至少8B参数的情况下,超过80%的协议与专有判官一致,支持隐私保护的评估。总体而言,我们的框架提供了一个可验证和可重复的第一阶段诊断,补充了专门构建的音频语料库。

英文摘要

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

2605.14417 2026-05-21 cs.RO cs.CV

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前:为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) LimX Dynamics Technology Co., Ltd.(LimX动态技术有限公司) Shandong University(山东大学) Data61/CSIRO Griffith University(格里菲斯大学) Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)(深度感知技术研究院,江苏省工业技术研究院(JITRI))

AI总结 该研究提出DAJI框架,通过学习语言生成与闭环控制之间的预见性关节意图接口,解决语言条件人形机器人中预见未来物理转换的需求,实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情
AI中文摘要

自然语言是人形机器人的直观接口,但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考,或使用隐式/动作策略,其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI(Dynamics-Aligned Joint Intent),一个分层框架,学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略,而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明,DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异,在 HumanML3D 风格生成中达到 94.42% 的回放成功率,在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

2605.14382 2026-05-21 cs.CV cs.GR cs.MM

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Delta Forcing:交互式自回归视频生成中的信任区域引导

Yuheng Wu, Xiangbo Gao, Tianhao Chen, Xinghao Chen, Qing Yin, Zhengzhong Tu, Dongman Lee

发表机构 * Texas A\&M University(德克萨斯A&M大学) University of Washington(华盛顿大学)

AI总结 本文提出Delta Forcing方法,通过约束不可靠的教师监督在适应性信任区域中,以提高自回归视频生成的一致性并保持对新事件的响应性。

详情
AI中文摘要

交互式实时自回归视频生成对于内容创作和世界建模等应用至关重要,其中视觉内容必须适应动态变化的事件条件。一个基本挑战在于在反应性和稳定性之间取得平衡:模型必须迅速响应新事件,同时在长时间范围内保持时间一致性。现有方法将双向模型蒸馏为自回归生成器,并进一步通过流式长调优进行适应,但往往在条件变化后仍会出现持续漂移。我们发现原因在于条件偏差,其中教师可能提供与条件对齐但轨迹无关的指导,使生成偏向于局部有效但全局不一致的模式。受信任区域策略优化的启发,我们提出Delta Forcing,一种简单而有效的框架,它将不可靠的教师监督限制在适应性信任区域内。具体而言,Delta Forcing从教师和生成器轨迹之间的潜在delta估计转移一致性,并利用它来平衡教师监督与单调连续性目标。这抑制了不可靠的教师诱导的偏移,同时保持对新事件的响应性。广泛的实验表明,Delta Forcing在提高一致性的同时保持了事件的响应性。

英文摘要

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

2605.14364 2026-05-21 cs.LG

MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data

MoRe:模块化表示用于序列数据的原理化持续表示学习

Jiaqi Sun, Boyang Sun, Rasmy M. H., Xiangchen Song, Kun Zhang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出MoRe框架,通过模块化表示方法实现序列数据的原理化持续学习,其核心贡献是通过分解知识为可识别的模块层级,实现模块的重用、对齐和扩展,从而在保持旧模块的同时提升模型的可塑性和稳定性。

详情
AI中文摘要

持续学习要求模型在适应新数据的同时保持已获得的知识。其核心挑战可以视为原理化的一步适应:在最小干扰的情况下将新信息整合到现有表示中。大多数现有方法通过监督、任务特定的方式修改模型参数或架构来解决这一挑战。然而,根本问题在于表示层面:任务需要具有不同但结构化的表示,这些表示可以被选择性更新而不破坏表示,同时结构应反映数据中的内在组织而非任务边界。在序列数据中,时间延迟依赖性提供了一种自然的信号,用于揭示这种组织,展示如何基本表示产生更具体的表示。受人类大脑模块化组织的启发,我们提出MoRe,一个框架,它在表示本身中识别模块性,而不是在架构层面分配。MoRe将知识分解为具有可识别保证的基本和特定模块层级,使在适应过程中能够实现原理化的模块重用、对齐和扩展,同时通过构造保留旧模块。在合成基准和真实世界LLM激活数据上的实验表明了可解释的层次结构,改进了可塑性-稳定性权衡,表明MoRe是持续适应的原理化基础。

英文摘要

Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation

2605.14259 2026-05-21 cs.AI cs.CL

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

面向异构业务系统的超图企业代理推理器

Ling Wang, Xin Liu, Songnan Liu, Jianan Wang, Cheng Cheng, Yihan Zhu, Enyu Li, Yu Xiao, Jiangyong Xie, Duogong Yan, Jiangyi Chen

发表机构 * SUPCON

AI总结 本文提出HEAR,一种基于分层超图本体的企业代理推理器,通过分层图层和超边层实现结构化多跳分析,无需重新训练LLM,在供应链任务中达到94.7%的准确率,并展示出适应性和效率。

详情
AI中文摘要

将大语言模型(LLMs)应用于异构企业系统受到多跳、n元推理中幻觉和失败的阻碍。现有范式(如GraphRAG、NL2SQL)缺乏复杂环境所需语义基础和可审计执行。我们引入HEAR,一种基于分层超图本体的企业代理推理器。其基图层虚拟化了具有溯源意识的数据接口,而超边层编码n元业务规则和程序协议。通过证据驱动的推理循环,HEAR动态协调本体工具进行结构化多跳分析,无需重新训练LLM。在供应链任务中,包括订单履行阻塞根本原因分析(RCA)的评估显示,HEAR达到94.7%的准确率。关键地,HEAR展示了适应性效率:利用程序超边以最小化令牌成本,同时利用拓扑探索确保复杂查询的严格正确性。通过将专有模型性能与开源权重骨干结合,并自动化手动诊断,HEAR建立了可扩展、可审计的企业智能基础。

英文摘要

Applying Large Language Models (LLMs) to heterogeneous enterprise systems is hindered by hallucinations and failures in multi-hop, n-ary reasoning. Existing paradigms (e.g., GraphRAG, NL2SQL) lack the semantic grounding and auditable execution required for these complex environments. We introduce HEAR, an enterprise agentic reasoner built on a Stratified Hypergraph Ontology. Its base Graph Layer virtualizes provenance-aware data interfaces, while the Hyperedge Layer encodes n-ary business rules and procedural protocols. Operating an evidence-driven reasoning loop, HEAR dynamically orchestrates ontology tools for structured multi-hop analysis without requiring LLM retraining. Evaluations on supply-chain tasks, including order fulfillment blockage root cause analysis (RCA), show HEAR achieves up to 94.7% accuracy. Crucially, HEAR demonstrates adaptive efficiency: utilizing procedural hyperedges to minimize token costs, while leveraging topological exploration for rigorous correctness on complex queries. By matching proprietary model performance with open-weight backbones and automating manual diagnostics, HEAR establishes a scalable, auditable foundation for enterprise intelligence.

2605.14201 2026-05-21 cs.RO cs.CV

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

MAPLE:基于潜在空间的多智能体交互用于端到端自动驾驶

Rajeev Yasarla, Deepti Hegde, Hsin-Pai Cheng, Shizhong Han, Yunxiao Shi, Meysam Sadeghigooghari, Hanno Ackermann, Litian Liu, Pranav Desai, Fatih Porikli, Mohammad Ghavamzadeh, Hong Cai

发表机构 * Qualcomm AI Research(英矽人工智能研究)

AI总结 本文提出MAPLE框架,通过在视觉-语言-动作模型的潜在空间中实现反应式多智能体滚动,以解决传统模仿学习框架下闭环设置中模型易碎的问题,通过监督微调和强化学习结合多样性奖励,实现了可扩展且无需外部模拟器的闭环训练,提升了端到端自动驾驶系统的鲁棒性。

Comments 19 pages, 9 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在端到端运动规划中表现出色,但在闭环设置中由于训练基于传统模仿学习框架而显得脆弱。现有的闭环监督方法缺乏可扩展性且无法完全建模反应式环境。我们提出MAPLE,一种新的框架,用于在VLA模型的潜在空间中进行动态驾驶场景的反应式多智能体滚动。主体车辆和附近交通代理在多步时间范围内独立控制,同时对场景中的其他代理具有反应性,从而实现闭环训练。MAPLE包含两个训练阶段:(1)基于真实轨迹的潜在滚动监督微调,随后是(2)具有全局和代理特定奖励的强化学习,这些奖励鼓励安全、进展和交互真实感。我们进一步提出多样性奖励,鼓励模型生成可能不在记录驾驶数据中存在的规划行为。值得注意的是,我们的闭环训练框架具有可扩展性,且无需外部模拟器,这些模拟器计算成本高且视觉保真度有限。MAPLE在Bench2Drive上实现了最先进的驾驶性能,并展示了可扩展的闭环多智能体交互,为鲁棒的端到端自动驾驶系统提供了支持。

英文摘要

Vision-language-action (VLA) models are effective as end-to-end motion planners, but can be brittle when evaluated in closed-loop settings due to being trained under traditional imitation learning framework. Existing closed-loop supervision approaches lack scalability and fail to completely model a reactive environment. We propose MAPLE, a novel framework for reactive, multi-agent rollout of a dynamic driving scenario in the latent space of the VLA model. The ego vehicle and nearby traffic agents are independently controlled over multi-step horizons, while being reactive to other agents in the scene, enabling closed-loop training. MAPLE consists of two training stages: (1) supervised fine-tuning on the latent rollouts based on ground-truth trajectories, followed by (2) reinforcement learning with global and agent -specific rewards that encourage safety, progress, and interaction realism. We further propose diversity rewards that encourage the model to generate planning behaviors that may not be present in logged driving data. Notably, our closed-loop training framework is scalable and does not require external simulators, which can be computationally expensive to run and have limited visual fidelity to the real-world. MAPLE achieves state-of-the-art driving performance on Bench2Drive and demonstrates scalable, closed-loop multi-agent play for robust E2E autonomous driving systems.

2605.13475 2026-05-21 cs.CV

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

FedHPro: 通过梯度匹配实现联邦超原型学习

Huan Wang, Jun Shen, Haoran Li, Zhenyu Yang, Jun Yan, Ousman Manjang, Yanlong Zhai, Di Wu, Guansong Pang

发表机构 * School of Computing and Information Technology, University of Wollongong, Wollongong, Australia(计算机与信息科技学院,沃尔灵戈大学,澳大利亚) School of Computing and Information Systems, Singapore Management University, Singapore, Singapore(计算机与信息系统学院,新加坡管理大学,新加坡) Monash University, Australia(莫纳什大学,澳大利亚) Macquarie University, Australia(麦考瑞大学,澳大利亚) Beijing Institute of Technology, China(北京理工大学,中国) La Trobe University, Australia(拉筹伯大学,澳大利亚)

AI总结 本文提出FedHPro框架,通过引入超原型和梯度匹配来提升联邦学习中的类别分离度和类别内一致性,实验表明其在多个基准数据集上达到最优性能。

Comments 23 pages, ICML 2026 Camera-ready Version

详情
AI中文摘要

联邦学习(FL)能够在保护隐私的同时实现分布式客户端的协同训练。为了增强FL的泛化能力,基于原型的FL受到关注,因为共享的全局原型为对齐客户端特定的局部原型提供了语义锚点。然而,现有方法通过平均局部原型或细化全局锚点来更新全局原型,这通常导致客户端间的语义漂移,从而产生不一致的全局信号。为了解决这个问题,我们引入了超原型,由一组可学习的全局类别级原型定义,以在客户端间保留底层语义知识。超原型通过梯度匹配进行优化,以对齐从客户端真实样本中直接提取的类别相关特征,而不是原型级描述符。我们进一步提出了FedHPro,一个联邦超原型学习框架,以利用超原型通过互对比学习和客户端特定的边距来促进类别间分离度,同时通过一致性惩罚促进类别内均匀性。在多样化的异构场景中的全面实验表明,1)超原型产生更一致的全局信号,2)FedHPro在多个基准数据集上达到最优性能。代码可在https://github.com/mala-lab/FedHPro获取。

英文摘要

Federated Learning (FL) enables collaborative training of distributed clients while protecting privacy. To enhance generalization capability in FL, prototype-based FL is in the spotlight, since shared global prototypes offer semantic anchors for aligning client-specific local prototypes. However, existing methods update global prototypes at the prototype-level via averaging local prototypes or refining global anchors, which often leads to semantic drift across clients and subsequently yields a misaligned global signal. To alleviate this issue, we introduce hyper-prototypes, defined by a set of learnable global class-wise prototypes to preserve underlying semantic knowledge across clients. The hyper-prototypes are optimized via gradient matching to align with class-relevant characteristics distilled directly from clients' real samples, rather than prototype-level descriptors. We further propose FedHPro, a Federated Hyper-Prototype Learning framework, to leverage hyper-prototypes to promote inter-class separability via mutual-contrastive learning with client-specific margin, while encouraging intra-class uniformity through a consistency penalty. Comprehensive experiments under diverse heterogeneous scenarios confirm that 1) hyper-prototypes produce a more semantically consistent global signal, and 2) FedHPro achieves state-of-the-art performance on several benchmark datasets. Code is available at \href{https://github.com/mala-lab/FedHPro}{https://github.com/mala-lab/FedHPro}.

2605.13302 2026-05-21 cs.LG cs.SY eess.SY

Safe Bayesian Optimization for Uncertain Correlation Matrices in Linear Models of Co-Regionalization

安全的贝叶斯优化用于线性共区域化模型中的不确定相关矩阵

Jannis Lübsen, Annika Eichler

发表机构 * Institute of Control Systems, Hamburg University of Technology(控制系统研究所,汉堡技术大学)

AI总结 本文将多任务贝叶斯优化的安全保证从内在共区域化模型扩展到线性共区域化模型,通过组合多个特征更灵活地建模任务间相关性,并推导了从线性共区域化核高斯过程中采样的向量值函数的统一误差界,同时在安全多任务贝叶斯优化基准上的数值比较中展示了线性共区域化模型的潜在性能优势。

Comments Accepted at IFAC WC26

详情
AI中文摘要

本文将多任务贝叶斯优化的安全保证从内在共区域化模型扩展到线性共区域化模型。后者通过组合多个特征提供了更灵活的任务间相关性建模方式。我们推导了从线性共区域化核高斯过程中采样的向量值函数的统一误差界。此外,我们通过在安全多任务贝叶斯优化基准上的数值比较,展示了线性共区域化模型的潜在性能优势。

英文摘要

This paper extends safety guarantees for multi-task Bayesian optimization with uncertain co-regionalization matrices from intrinsic co-regionalization models to linear models of co-regionalization. The latter allows for more flexible modeling of the inter-task correlations by composing multiple features. We derive uniform error bounds for vector-valued functions sampled from a Gaussian process with a linear model of co-regionalization kernel. Furthermore, we show the potential performance gains of linear models of co-regionalization in a numerical comparison on a safe multi-task Bayesian optimization benchmark.

2605.13081 2026-05-21 cs.CV

PRA-PoE: Robust Multimodal Alzheimer's Diagnosis with Arbitrary Missing Modalities

PRA-PoE: 基于任意缺失模态的鲁棒多模态阿尔茨海默病诊断

Guangqian Yang, Ye Du, Wenlong Hou, Qian Niu, Shujun Wang

发表机构 * Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China(生物医学工程系,香港理工大学,香港特别行政区,中国) Department of Technology Management for Innovation, The University of Tokyo, Japan(创新技术管理系,东京大学,日本) Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China(数据科学与人工智能系,香港理工大学,香港特别行政区,中国)

AI总结 该研究提出PRA-PoE框架,通过原型锚定表示对齐和不确定性感知专家融合机制,解决多模态学习中模态缺失导致的表示偏移问题,提升了在不同缺失模式下的诊断鲁棒性与准确性。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

缺失模态在真实世界阿尔茨海默病评估中普遍存在,对多模态学习构成重大挑战,尤其是在训练和部署时观测模态子集的分布不同时。这种缺失模式不匹配会引发跨模态子集的条件表示偏移。现有方法依赖于隐式插补或模态合成,往往无法显式建模模态可用性和不确定性,导致过度依赖合成特征、鲁棒性降低和不确定性估计不准确。为了解决这些限制,我们提出PRA-PoE,一种不完整多模态学习框架,配备了原型锚定表示对齐(PRA)和不确定性感知专家(UA-PoE)融合机制。首先,PRA使用可学习的全局原型和可用性条件化标记来编码模态可用性,区分观测与缺失模态,重新合成缺失模态的特征,并自适应地细化观测表示以对齐跨模态子集的潜在空间,目标是在不同缺失模式下减少表示偏移。其次,UA-PoE将每个模态建模为高斯专家并执行闭式产品专家融合,其中不确定性较高的专家会通过较低的精度自动降权,从而提高不确定性可靠性。我们通过在临床现实协议下训练使用自然缺失数据并在所有非空模态组合上测试来评估PRA-PoE。PRA-PoE在所有非空模态子集上均优于现有最佳方法,实现了在ADNI数据集上的平均准确率5.4%的相对提升,在OASIS-3数据集上的平均F1值达到10.9%的相对提升。

英文摘要

Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

2605.12960 2026-05-21 cs.CL

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

DiM\textsuperscript{3}: 通过方向和幅度感知融合连接多语言和多模态模型

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Schütze

发表机构 * Northeastern University, China(东北大学,中国) CIS, LMU Munich, Germany(慕尼黑莱布尼茨大学计算机学院,德国) Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心(MCML),德国) Shanghai Jiao Tong University, China(上海交通大学,中国)

AI总结 本研究提出DiM3方法,通过在共享语言模型骨干中融合残差更新,实现多语言和多模态能力的无缝整合,从而提升多语言性能并保持多模态能力。

详情
AI中文摘要

为了实现更通用和类人智能,大语言模型应能够无缝整合多语言和多模态能力;然而,扩展现有多模态模型以支持多种语言通常需要昂贵的多语言多模态数据构建和重复端到端重新训练。我们研究了一种无需训练的替代方法:通过在共享语言模型骨干中组合残差更新,将多语言能力注入现有多模态模型中。关键挑战是多语言和多模态更新是异构的,反映了共享模型中的不同功能角色。为了解决这一问题,我们提出了方向和幅度感知的多语言多模态融合(DiM3),在每个参数维度上选择性地组合这两种更新,同时保留原始视觉编码器和多模态投影器。在文本-only和视觉-语言设置下的多语言基准测试中,覆盖57种语言(LLaVA和Qwen为基础的骨干),实验表明DiM3在多语言性能上显著优于现有融合基线,并在原始多模态模型上大幅提升了多语言性能,同时在专用多语言多模态微调中保持竞争力。此外,我们进一步表明DiM3可以直接应用于已训练的多语言多模态模型,并仍能产生额外收益。进一步的可解释性分析显示,DiM3主要重塑中间层语义表示,在文本-only和多模态输入下加强跨语言对齐,同时保留高层任务敏感结构。我们的仓库在https://github.com/wzj1718/DiM3。

英文摘要

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.