arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.19138 2026-05-21 cs.RO cs.AI cs.LG

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

COBALT: 通过基于云的远程操作利用智能手机进行机器人学习

Ayush Agarwal, Ansh Gandhi, Jeremy A. Collins, Omar Rayyan, Aryan Sarswat, Ranjani Koushik, Masoud Moghani, Ajay Mandlekar, Animesh Garg

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； University of California, Berkeley（加州大学伯克利分校）； New York University Abu Dhabi (NYUAD)（纽约大学阿布扎克分校）； University of Toronto（多伦多大学）； NVIDIA（英伟达）

AI总结本文提出COBALT平台，通过基于云的远程操作技术，利用智能手机等设备大规模收集高质量的机器人学习数据，提高仿真实验和现实世界中的机器人学习效率。

详情

AI中文摘要

大规模、高质量的演示数据稀缺仍然是扩展模仿学习用于机器人操作的主要瓶颈。我们提出了COBALT，一个旨在大规模普及机器人学习的远程操作平台，无论是仿真还是现实世界。通过利用向量化的环境，我们的可扩展、负载均衡的基础设施支持多个用户在单个GPU上同时进行远程操作，从而显著降低远程操作成本。操作员可以使用几乎全球任何地方的常见设备连接，包括单或双智能手机、VR头盔、3D鼠标和键盘。内存中的数据缓存和高效的视频流保持控制和渲染同步，支持数十个并发用户在20 Hz下以不超过100毫秒的端到端延迟运行，每GPU支持多达8个并发用户。我们还展示了稳定运行支持256个模拟客户端跨8个GPU，凸显了系统在硬件和单个服务器内的扩展能力。我们进行了全面的用户研究，显示基于手机的远程操作性能与或优于专用硬件，能够更快、更符合人体工学地收集数据。为确保数据质量，COBALT记录一套实时指标以自动过滤劣质演示。我们进一步证明，结构化的用户培训课程显著提高了数据收集质量。基于用户研究的洞察，我们通过众包收集了一个大规模、高质量的试点数据集，该数据集包含7500多个演示（50多个小时），在五个国家的智能手机上收集了九天的数据。我们通过训练最先进的模仿学习算法验证了数据集的质量。请访问https://cobalt-teleop.github.io/获取更多详情。

英文摘要

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present COBALT, a teleoperation platform designed to democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU, yielding a significant reduction in teleoperation cost. Operators can connect from nearly anywhere on Earth using commonly available devices, including single or dual smartphones, VR headsets, 3D mice, and keyboards. An inmemory data cache and efficient video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency for up to 8 concurrent users per GPU. We also demonstrate stable operation supporting 256 simulated clients across 8 GPUs, underscoring the system's ability to scale across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably to or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, COBALT logs a suite of real-time metrics to automatically filter suboptimal demonstrations. We further demonstrate that a structured user training curriculum significantly improves data collection quality. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over five days. We validate the dataset's quality by training state-of-the-art imitation learning algorithms. Please visit https://cobalt-teleop.github.io/ for more details.

URL PDF HTML ☆

赞 0 踩 0

2605.18833 2026-05-21 cs.LG cs.AI

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

利用知识图谱嵌入进行自动化大数据质量评估

Hadi Fadlallah, Rima Kilany, Mitri Haber, Ali Jaber

发表机构 * Saint-Joseph University（圣约瑟夫大学）； Lebanese University（黎巴嫩大学）

AI总结本文提出了一种基于知识图谱嵌入的自动化大数据质量评估方法，通过整合多样化的知识图谱表示，利用上下文信息生成针对每个情境的全面数据质量评估计划。

Comments 17 pages, 10 figures

Journal ref International Journal of Data Mining, Modelling and Management 17.4 (2025) 383-405

详情

DOI: 10.1504/IJDMMM.2025.150987

AI中文摘要

自动化数据质量评估对于管理大数据至关重要，但现有解决方案在实现准确的上下文感知评估方面面临挑战。本文提出了一种基于知识的新方法，利用知识图谱嵌入来预测输入数据集的上下文表示与知识图谱中相关质量规则和维度之间的缺失边。我们通过整合知识图谱中的多样化表示，从深入的文献研究中获取洞察，从而开发出针对每个情境的全面且上下文特定的数据质量评估计划。利用知识图谱提高了我们对输入数据集上下文的理解，克服了传统方法仅依赖严格匹配并忽视上下文特征的局限性。通过注入数值边属性，我们为每个预测的质量测量分配相应的权重，为输入数据集提供全面的数据质量评估计划。为了评估我们的方法，我们利用AccentureLabs开发和基准测试的AmpliGraph框架。评估涉及使用由黎巴嫩原子能委员会（LAEC-CNRS）提供的现实世界辐射传感器数据集。从该评估中获得的结果证明了我们的解决方案能够为给定的输入数据集生成全面的数据质量评估计划。

英文摘要

Automated data quality assessment is crucial for managing big data, but existing solutions face challenges in achieving accurate context-aware assessment. This paper presents a novel knowledge-based approach to enhance automated data quality assessment. Our approach utilizes knowledge graph embeddings to predict missing edges between the input dataset's context representation and the relevant quality rules and dimensions within a knowledge graph representing contextual data characteristics and the required quality assessment operations. We surpass conventional practices by integrating diverse representations within the knowledge graph, drawing insights from contextual information from a thorough literature investigation. This integration allows us to develop a comprehensive and context-specific data quality assessment plan tailored to each context. Leveraging the knowledge graph improves our understanding of the input dataset's context, overcoming the limitations of traditional methods that rely solely on strict matching and overlook contextual characteristics. By injecting numerical edge attributes, we assign corresponding weights to each predicted quality measurement, providing a comprehensive data quality assessment plan for the input dataset. To evaluate our approach, we leverage AmpliGraph, a framework developed and benchmarked by AccentureLabs. The evaluation involves employing a real-world radiation sensors dataset provided by the Lebanese Atomic Energy Commission (LAEC-CNRS). The results obtained from this evaluation demonstrate the capability of our solution to generate a comprehensive data quality assessment plan for the given input dataset.

URL PDF HTML ☆

赞 0 踩 0

2605.18743 2026-05-21 cs.AI

SVFSearch: 一种面向游戏垂直领域的多模态知识密集型短视频帧搜索基准

Lingtao Mao, Huangyu Dai, Xinyu Sun, Zihan Liang, Ben Chen, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出SVFSearch，首个针对中文游戏领域短视频帧搜索的多模态知识密集型基准，通过5000个四选一测试示例和4198个辅助训练示例，评估了从直接问答到计划-行动-重新计划代理等多种方法在短视频帧搜索中的性能。

详情

AI中文摘要

多模态大语言模型越来越多地被用作代理的骨干，以理解多模态输入、计划检索操作、调用外部工具并推理由检索信息得出的结论。然而，现有的基准很少评估在短视频应用中的这种能力，其中暂停的帧通常在视觉上具有歧义性，回答需要垂直的、长尾的和快速发展的领域知识。我们引入了SVFSearch，这是首个针对中文游戏领域短视频帧搜索的开放基准。SVFSearch包含5,000个四选一测试示例和4,198个辅助训练示例，每个示例都围绕一个暂停的游戏场景展开，来自真实的短视频片段。为了支持公平且可重复的评估，SVFSearch提供了一个冻结的离线检索环境，包括一个游戏领域文本语料库、一个主题链接的图像画廊以及文本、图像和多模态检索接口，避免了对不受控的网络搜索API的依赖。我们评估了从直接问答和RAG工作流程到计划-行动-重新计划代理和学习搜索模型在内的代表性范式。结果揭示了模型单独回答、实际代理搜索和 oracle 知识之间的巨大差距：最好的开源直接问答模型达到66.4%，最好的实际代理达到79.1%，而 oracle 知识达到95.4%。进一步分析揭示了视觉定位、检索质量、证据基础推理和工具使用行为中的瓶颈，包括过度检索、只回答捷径和检索诱导的误导。

英文摘要

Multimodal large language models are increasingly used as agent backbones that understand multimodal inputs, plan retrieval actions, invoke external tools, and reason over retrieved information. Yet existing benchmarks rarely evaluate this ability in short-video applications, where a paused frame is often visually ambiguous and answering requires vertical, long-tail, and fast-evolving domain knowledge. We introduce SVFSearch, the first open benchmark for short-video frame search in the Chinese gaming domain. SVFSearch contains 5,000 four-choice test examples and 4,198 auxiliary training examples, each centered on a paused game scene from a real short-video clip. To support fair and reproducible evaluation, SVFSearch provides a frozen offline retrieval environment with a game-domain text corpus, a topic-linked image gallery, and text, image, and multimodal retrieval interfaces, avoiding reliance on uncontrolled web search APIs. We evaluate representative paradigms ranging from direct QA and RAG workflow to Plan-Act-Replan agents and learned search models. Results reveal a large gap between model-only answering, practical agentic search, and oracle knowledge: the best open-source direct-QA model reaches 66.4%, the best practical agent achieves 79.1%, and oracle knowledge reaches 95.4%. Further analysis exposes bottlenecks in visual grounding, retrieval quality, evidence-grounded reasoning, and tool-use behavior, including over-search, answer-only shortcuts, and retrieval-induced misleading.

URL PDF HTML ☆

赞 0 踩 0

2605.17776 2026-05-21 cs.RO

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

CosFly-Track: 一个大规模多模态数据集，用于通过多约束轨迹优化的无人机视觉跟踪

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, Ji Pei

发表机构 * Autel Robotics（Autel机器人公司）； Nanjing University（南京大学）； Peking University（北京大学）； Southern University of Science and Technology（南方科技大学）； University of Hong Kong（香港大学）

AI总结本文提出CosFly-Track数据集，用于无人机视觉跟踪任务，通过多约束轨迹优化生成大规模多模态数据，提升了动态目标跟踪性能。

详情

AI中文摘要

解锁VLMs中的密集度量深度估计

Hanxun Yu, Xuan Qu, Yuxin Wang, Jianke Zhu, Lei Ke

发表机构 * Zhejiang University（浙江大学）； Tencent Hunyuan LLM（腾讯混元大模型）； HKUST（香港科技大学）； Shenzhen Loop Area Institute（深圳环城研究院）

AI总结本文提出DepthVLM，一种将单个VLM转换为原生密集几何预测器的简单有效框架，同时保持其多模态能力。通过在LLM主干上附加轻量级深度头，并在统一的视觉-文本监督范式下进行训练，DepthVLM能够在单次前向传递中生成高分辨率深度图和语言输出。此外，还引入了一个统一的室内-室外度量深度基准，实验表明DepthVLM在推理效率、复杂3D空间推理等方面均优于现有VLMs和纯视觉模型。

Comments Project Page: https://depthvlm.github.io/

详情

AI中文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

英文摘要

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified multimodal foundation model. The project page is available at https://depthvlm.github.io/

URL PDF HTML ☆

赞 0 踩 0

2605.15691 2026-05-21 cs.LG

SEED: Targeted Data Selection by Weighted Independent Set

SEED：通过加权独立集实现目标数据选择

Yuan Zhang, Lifeng Guo, Junwen Pan, Wenzhao Zheng, Wen Zhou, Kuan Cheng, Kurt Keutzer, Shanghang Zhang

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）； Beijing University of Posts and Telecommunications（北京邮电大学）； Tianjin University（天津大学）； EECS, UC Berkeley（伯克利大学电子工程与计算机科学系）； Chinese Academy of Sciences（中国科学院）

AI总结本文提出SEED方法，通过将数据选择问题建模为加权独立集（WIS）在相似性图上，解决样本质量与多样性之间的平衡问题，并引入节点价值校准和局部尺度归一化来提升数据选择的鲁棒性和可扩展性。

Comments 20 pages

详情

AI中文摘要

数据选择旨在从大规模训练语料中识别出紧凑且信息丰富的子集，平衡样本质量和收集多样性。我们将该问题建模为相似性图上的加权独立集（WIS），其中节点代表数据样本并按影响程度加权，边连接语义冗余的配对。这种建模自然产生同时高质量和多样化的子集。然而，实践中存在两个挑战：朴素的节点权重无法区分信息信号与梯度噪声，且在异构领域分布下构造边会产生结构不平衡的图，偏向社会稀疏区域。为解决这些问题，我们引入了两种从统一图视角出发的改进方法：（1）节点价值校准，限制影响估计到双侧显著子空间，以任务相关信号为基础确定节点重要性，而不是表面统计；（2）局部尺度归一化，适应边阈值到局部邻域密度，缓解因跨领域分布偏移引起的图不平衡。这些组件共同产生了一个稳健且可扩展的数据选择流程，称为SEED。我们进一步构建了 exttt{Honeybee-Remake-SEED-200K}，一个由SEED编纂的紧凑多模态数据集。广泛实验表明，SEED在指令微调、视觉指令微调和语义分割等任务上，优于现有最先进方法，适用于多种模型家族。

英文摘要

Data selection seeks to identify a compact yet informative subset from large-scale training corpora, balancing sample quality against collection diversity. We formulate this problem as a Weighted Independent Set (WIS) on a similarity graph, where nodes represent data samples weighted by influence, and edges connect semantically redundant pairs. This formulation naturally yields subsets that are simultaneously high-quality and diverse. However, two challenges arise in practice: naive node weights fail to distinguish informative signals from gradient noise, and edge construction under heterogeneous domain distributions produces structurally imbalanced graphs that bias selection toward sparse regions. To address these issues, we introduce two principled refinements from a unified graph perspective: (1) \textit{node value calibration} that restricts influence estimation to the bilateral salient subspace to ground node importance in task-relevant signals rather than surface-level statistics; (2) \textit{local scale normalization} that adapts edge thresholds to local neighborhood density, mitigating graph imbalance induced by cross-domain distribution shifts. Together, these components yield a robust and scalable data selection pipeline dubbed SEED. We further construct \texttt{Honeybee-Remake-SEED-200K}, a compact multimodal dataset curated by SEED. Extensive experiments show that SEED consistently outperforms state-of-the-art methods on instruction tuning, visual instruction tuning, and semantic segmentation across diverse model families.

URL PDF HTML ☆

赞 0 踩 0

2605.15157 2026-05-21 cs.RO cs.LG

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

手在环中：通过无缝手臂干预改进VLA策略以实现灵巧操作

Zhuohang Li, Liqun Huang, Wei Xu, Zhengming Zhu, Nie Lin, Xiao Ma, Xinjun Sheng, Ruoshi Wen

发表机构 * State Key Laboratory of Mechanical System and Vibration, School of Mechanical Engineering, Shanghai Jiao Tong University（机械系统与振动国家重点实验室，机械工程学院，上海交通大学）； Shanghai Key Laboratory of Intelligent Robotics, Meta Robotics Institute, Shanghai Jiao Tong University, Shanghai 200240, China（智能机器人上海市重点实验室，元机器人研究院，上海交通大学，上海200240，中国）； The University of Tokyo（东京大学）

AI总结本文提出Hand-in-the-Loop方法，通过无缝整合人类干预与自主策略执行，减少手部操作中的突兀变化，提升双臂灵巧操作的鲁棒性和效率。

详情

AI中文摘要

Vision-Language-Action (VLA)模型在灵巧操作中容易累积误差，高维动作空间和接触丰富的动态会放大政策偏差。虽然交互模仿学习(IIL)可通过人类修正数据细化策略，但将其应用于高自由度机械手仍具有挑战性，因为人类遥控与策略执行在干预时刻的命令不匹配，导致机器人手部配置的突兀变化，即'手势跳跃'。我们提出了Hand-in-the-Loop (HandITL)，一种无缝的人在回路干预方法，将人类的修正意图与自主策略执行相结合，以避免在双臂灵巧操作中的手势跳跃。与使用直接遥控接管相比，HandITL将干预抖动减少了99.8%，并保持了干预后的稳健操作，将抓取失败减少了87.5%，平均完成时间减少了19.1%。我们在需要双臂协调、工具使用和精细长时域操作的任务上验证了HandITL。当用于收集策略细化的修正数据时，HandITL在三个长时域灵巧任务中平均优于使用标准遥控数据训练的策略19%。

英文摘要

Vision-Language-Action (VLA) models are prone to compounding errors in dexterous manipulation, where high-dimensional action spaces and contact-rich dynamics amplify small policy deviations over long horizons. While Interactive Imitation Learning (IIL) can refine policies through human correction data, applying it to high-degree-of-freedom (DoF) robotic hands remains challenging due to a command mismatch between human teleoperation and policy execution at the intervention moment, which causes abrupt robot-hand configuration changes, or "gesture jumps". We present Hand-in-the-Loop (HandITL), a seamless human-in-the-loop intervention method that blends human corrective intent with autonomous policy execution to avoid gesture jumps during bimanual dexterous manipulation. Compared with taking over control using direct teleoperation, HandITL reduces intervention jitter by 99.8% and preserves robust post-intervention manipulation, reducing grasp failures by 87.5% and mean completion time by 19.1%. We validate HandITL on tasks requiring bimanual coordination, tool use, and fine-grained long-horizon manipulation. When used to collect correction data for policy refinement, HandITL yields policies that outperform those trained with standard teleoperation data by 19% on average across three long-horizon dexterous tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.15104 2026-05-21 cs.CL

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

从文本到声音：一个可重复和可验证的评估工具调用LLM代理的框架

Md Tahmid Rahman Laskar, Xue-Yong Fu, Seyyed Saeed Sarfjoo, Quinten McNamara, Jonas Robertson, Shashi Bhushan TN

发表机构 * Dialpad Inc.（Dialpad公司）

AI总结本文提出了一种可重复和可验证的框架，用于评估基于语音的工具调用LLM代理，通过文本到语音转换和环境噪声模拟，无需重新标注工具模式和黄金标签，从而在不重新标注的情况下评估工具调用性能。

详情

AI中文摘要

语音代理越来越多地需要从语音中可靠的工具使用，而突出的工具调用基准测试仍基于文本。我们研究是否可以将经过验证的文本基准转换为受控的音频基础工具调用评估，而无需重新标注工具模式和黄金标签。我们的数据集无关框架使用文本到语音、说话人变化和环境噪声来创建配对的文本-音频实例，同时保留原始数据集注释。基于对7个多模态模型在Confetti和When2Call音频转换版本上的广泛评估，我们的框架表明性能强烈依赖于模型和任务：Gemini-3.1-Flash-Live获得最高的Confetti分数（70.4），而GPT-Realtime-1.5在When2Call上表现最佳（71.9）。在Confetti上，文本到语音的差距从Qwen3-Omni的1.8分到GPT-Realtime-1.5的4.8分。对失败案例的针对性分析表明，降解最常反映语音中对论点值的误解。考虑到现实部署场景，我们进一步报告了纯文本结果，一个基于歧义的改述压力测试，以及一个参考免费的LLM-as-judge协议，其经过人类偏好的验证。值得注意的是，我们发现开源的Qwen3判官在至少8B参数的情况下，超过80%的协议与专有判官一致，支持隐私保护的评估。总体而言，我们的框架提供了一个可验证和可重复的第一阶段诊断，补充了专门构建的音频语料库。

英文摘要

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

URL PDF HTML ☆

赞 0 踩 0

2605.14417 2026-05-21 cs.RO cs.CV

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

在身体移动之前：为语言条件的人形控制学习预见性关节意图

Haozhe Jia, Honglei Jin, Yuan Zhang, Youcheng Fan, Shaofeng Liang, Lei Wang, Shuxu Jin, Kuimou Yu, Zinuo Zhang, Jianfei Song, Wenshuo Chen, Yutao Yue

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； LimX Dynamics Technology Co., Ltd.（LimX动态技术有限公司）； Shandong University（山东大学）； Data61/CSIRO ； Griffith University（格里菲斯大学）； Institute of Deep Perception Technology, Jiangsu Industrial Technology Research Institute (JITRI)（深度感知技术研究院，江苏省工业技术研究院（JITRI））

AI总结该研究提出DAJI框架，通过学习语言生成与闭环控制之间的预见性关节意图接口，解决语言条件人形机器人中预见未来物理转换的需求，实现了在HumanML3D风格生成和BABEL任务中的高性能表现。

详情

AI中文摘要

自然语言是人形机器人的直观接口，但流式全身控制需要能够现在执行并预见未来物理转换的控制表示。现有语言条件人形系统通常生成低级跟踪器必须反应性修复的运动学参考，或使用隐式/动作策略，其输出不显式编码即将发生的接触变化、支撑转移和平衡准备。我们提出DAJI（Dynamics-Aligned Joint Intent），一个分层框架，学习语言生成与闭环控制之间的预见性关节意图接口。DAJI-Act通过学生驱动的回放将未来的教师 distill 成可部署的扩散动作策略，而 DAJI-Flow 自回归地从语言和意图历史生成未来意图块。实验表明，DAJI 在预见性隐式学习、单指令生成和流式指令跟随中表现优异，在 HumanML3D 风格生成中达到 94.42% 的回放成功率，在 BABEL 任务中达到 0.152 的子序列 FID。

英文摘要

Natural language is an intuitive interface for humanoid robots, yet streaming whole-body control requires control representations that are executable now and anticipatory of future physical transitions. Existing language-conditioned humanoid systems typically generate kinematic references that a low-level tracker must repair reactively, or use latent/action policies whose outputs do not explicitly encode upcoming contact changes, support transfers, and balance preparation. We propose \textbf{DAJI} (\emph{Dynamics-Aligned Joint Intent}), a hierarchical framework that learns an anticipatory joint-intent interface between language generation and closed-loop control. DAJI-Act distills a future-aware teacher into a deployable diffusion action policy through student-driven rollouts, while DAJI-Flow autoregressively generates future intent chunks from language and intent history. Experiments show that DAJI achieves strong results in anticipatory latent learning, single-instruction generation, and streaming instruction following, reaching 94.42\% rollout success on HumanML3D-style generation and 0.152 subsequence FID on BABEL.

URL PDF HTML ☆

赞 0 踩 0

2605.14382 2026-05-21 cs.CV cs.GR cs.MM

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

Delta Forcing：交互式自回归视频生成中的信任区域引导

Yuheng Wu, Xiangbo Gao, Tianhao Chen, Xinghao Chen, Qing Yin, Zhengzhong Tu, Dongman Lee

发表机构 * Texas A\&M University（德克萨斯A&M大学）； University of Washington（华盛顿大学）

AI总结本文提出Delta Forcing方法，通过约束不可靠的教师监督在适应性信任区域中，以提高自回归视频生成的一致性并保持对新事件的响应性。

详情

AI中文摘要

交互式实时自回归视频生成对于内容创作和世界建模等应用至关重要，其中视觉内容必须适应动态变化的事件条件。一个基本挑战在于在反应性和稳定性之间取得平衡：模型必须迅速响应新事件，同时在长时间范围内保持时间一致性。现有方法将双向模型蒸馏为自回归生成器，并进一步通过流式长调优进行适应，但往往在条件变化后仍会出现持续漂移。我们发现原因在于条件偏差，其中教师可能提供与条件对齐但轨迹无关的指导，使生成偏向于局部有效但全局不一致的模式。受信任区域策略优化的启发，我们提出Delta Forcing，一种简单而有效的框架，它将不可靠的教师监督限制在适应性信任区域内。具体而言，Delta Forcing从教师和生成器轨迹之间的潜在delta估计转移一致性，并利用它来平衡教师监督与单调连续性目标。这抑制了不可靠的教师诱导的偏移，同时保持对新事件的响应性。广泛的实验表明，Delta Forcing在提高一致性的同时保持了事件的响应性。

英文摘要

Interactive real-time autoregressive video generation is essential for applications such as content creation and world modeling, where visual content must adapt to dynamically evolving event conditions. A fundamental challenge lies in balancing reactivity and stability: models must respond promptly to new events while maintaining temporal coherence over long horizons. Existing approaches distill bidirectional models into autoregressive generators and further adapt them via streaming long tuning, yet often exhibit persistent drift after condition changes. We identify the cause as conditional bias, where the teacher may provide condition-aligned but trajectory-agnostic guidance, biasing generation toward locally valid yet globally inconsistent modes. Inspired by Trust Region Policy Optimization, we propose Delta Forcing, a simple yet effective framework that constrains unreliable teacher supervision within an adaptive trust region. Specifically, Delta Forcing estimates transition consistency from the latent delta between teacher and generator trajectories, and uses it to balance teacher supervision with a monotonic continuity objective. This suppress unreliable teacher-induced shifts while preserving responsiveness to new events. Extensive experiments demonstrate that Delta Forcing significantly improves consistency while maintaining event reactivity.

URL PDF HTML ☆

赞 0 踩 0

2605.14364 2026-05-21 cs.LG

MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data

MoRe：模块化表示用于序列数据的原理化持续表示学习

Jiaqi Sun, Boyang Sun, Rasmy M. H., Xiangchen Song, Kun Zhang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文提出MoRe框架，通过模块化表示方法实现序列数据的原理化持续学习，其核心贡献是通过分解知识为可识别的模块层级，实现模块的重用、对齐和扩展，从而在保持旧模块的同时提升模型的可塑性和稳定性。

详情

AI中文摘要

持续学习要求模型在适应新数据的同时保持已获得的知识。其核心挑战可以视为原理化的一步适应：在最小干扰的情况下将新信息整合到现有表示中。大多数现有方法通过监督、任务特定的方式修改模型参数或架构来解决这一挑战。然而，根本问题在于表示层面：任务需要具有不同但结构化的表示，这些表示可以被选择性更新而不破坏表示，同时结构应反映数据中的内在组织而非任务边界。在序列数据中，时间延迟依赖性提供了一种自然的信号，用于揭示这种组织，展示如何基本表示产生更具体的表示。受人类大脑模块化组织的启发，我们提出MoRe，一个框架，它在表示本身中识别模块性，而不是在架构层面分配。MoRe将知识分解为具有可识别保证的基本和特定模块层级，使在适应过程中能够实现原理化的模块重用、对齐和扩展，同时通过构造保留旧模块。在合成基准和真实世界LLM激活数据上的实验表明了可解释的层次结构，改进了可塑性-稳定性权衡，表明MoRe是持续适应的原理化基础。

英文摘要

Continual learning requires models to adapt to new data while preserving previously acquired knowledge. At its core, this challenge can be viewed as principled one-step adaptation: incorporating new information with minimal interference to existing representations. Most existing approaches address this challenge by modifying model parameters or architectures in a supervised, task-specific manner. However, the underlying issue is representational: tasks require distinct yet structured representations that can be selectively updated without disrupting representations, while structure should reflect intrinsic organization in the data rather than task boundaries. In sequential data, time-delayed dependencies provide a natural signal for uncovering this organization, revealing how fundamental representations give rise to more specific ones. Inspired by the modular organization of the human brain, we propose MoRe, a framework that identifies modularity in the representation itself rather than allocating it at the architectural level. MoRe decomposes knowledge into a hierarchy of fundamental and specific modules with identifiability guarantees, enabling principled module reuse, alignment, and expansion during adaptation while preserving old modules by construction. Experiments on synthetic benchmarks and real-world LLM activations demonstrate interpretable hierarchical structure, improved plasticity-stability trade-offs, suggesting MoRe as a principled foundation for continual adaptation

URL PDF HTML ☆

赞 0 踩 0

2605.14259 2026-05-21 cs.AI cs.CL

安全的贝叶斯优化用于线性共区域化模型中的不确定相关矩阵

Jannis Lübsen, Annika Eichler

发表机构 * Institute of Control Systems, Hamburg University of Technology（控制系统研究所，汉堡技术大学）

AI总结本文将多任务贝叶斯优化的安全保证从内在共区域化模型扩展到线性共区域化模型，通过组合多个特征更灵活地建模任务间相关性，并推导了从线性共区域化核高斯过程中采样的向量值函数的统一误差界，同时在安全多任务贝叶斯优化基准上的数值比较中展示了线性共区域化模型的潜在性能优势。

Comments Accepted at IFAC WC26

2605.13081 2026-05-21 cs.CV

PRA-PoE: Robust Multimodal Alzheimer's Diagnosis with Arbitrary Missing Modalities

PRA-PoE: 基于任意缺失模态的鲁棒多模态阿尔茨海默病诊断

Guangqian Yang, Ye Du, Wenlong Hou, Qian Niu, Shujun Wang

发表机构 * Department of Biomedical Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China（生物医学工程系，香港理工大学，香港特别行政区，中国）； Department of Technology Management for Innovation, The University of Tokyo, Japan（创新技术管理系，东京大学，日本）； Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China（数据科学与人工智能系，香港理工大学，香港特别行政区，中国）

AI总结该研究提出PRA-PoE框架，通过原型锚定表示对齐和不确定性感知专家融合机制，解决多模态学习中模态缺失导致的表示偏移问题，提升了在不同缺失模式下的诊断鲁棒性与准确性。

Comments Early accepted by MICCAI 2026

详情

AI中文摘要

缺失模态在真实世界阿尔茨海默病评估中普遍存在，对多模态学习构成重大挑战，尤其是在训练和部署时观测模态子集的分布不同时。这种缺失模式不匹配会引发跨模态子集的条件表示偏移。现有方法依赖于隐式插补或模态合成，往往无法显式建模模态可用性和不确定性，导致过度依赖合成特征、鲁棒性降低和不确定性估计不准确。为了解决这些限制，我们提出PRA-PoE，一种不完整多模态学习框架，配备了原型锚定表示对齐（PRA）和不确定性感知专家（UA-PoE）融合机制。首先，PRA使用可学习的全局原型和可用性条件化标记来编码模态可用性，区分观测与缺失模态，重新合成缺失模态的特征，并自适应地细化观测表示以对齐跨模态子集的潜在空间，目标是在不同缺失模式下减少表示偏移。其次，UA-PoE将每个模态建模为高斯专家并执行闭式产品专家融合，其中不确定性较高的专家会通过较低的精度自动降权，从而提高不确定性可靠性。我们通过在临床现实协议下训练使用自然缺失数据并在所有非空模态组合上测试来评估PRA-PoE。PRA-PoE在所有非空模态子集上均优于现有最佳方法，实现了在ADNI数据集上的平均准确率5.4%的相对提升，在OASIS-3数据集上的平均F1值达到10.9%的相对提升。

英文摘要

Missing modalities are prevalent in real-world Alzheimer's disease (AD) assessment and pose a significant challenge to multimodal learning, particularly when the distribution of observed modality subsets differs between training and deployment. Such missingness pattern mismatch induces a conditional representation shift across modality subsets. Existing approaches that rely on implicit imputation or modality synthesis often fail to explicitly model modality availability and uncertainty, leading to overconfident dependence on synthesized features, reduced robustness, and miscalibrated uncertainty estimates. To address these limitations, we propose PRA-PoE, an incomplete multimodal learning framework that is equipped with Prototype-anchored Representation Alignment (PRA) and an Uncertainty-aware Product of Experts (UA-PoE) fusion mechanism. First, PRA uses learnable global prototypes and availability-conditioned tokens to encode modality availability, distinguish observed from missing modalities, re-synthesize features for missing modalities, and adaptively refine observed representations to align latent spaces across modality subsets, with the goal of reducing representation shift under varying missingness patterns. Second, UA-PoE models each modality as a Gaussian expert and performs closed-form Product of Experts fusion, where experts with higher uncertainty are automatically down-weighted via lower precision, improving uncertainty reliability. We evaluate PRA-PoE under a clinically realistic protocol by training with naturally missing data and testing on all non-empty modality combinations. PRA-PoE consistently outperforms the state-of-the-art across datasets, achieving a 5.4% relative improvement in average accuracy on ADNI and a 10.9% relative gain in average F1 on OASIS-3 over the strongest baseline across all non-empty modality subsets.

URL PDF HTML ☆

赞 0 踩 0

2605.12960 2026-05-21 cs.CL

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging

DiM\textsuperscript{3}: 通过方向和幅度感知融合连接多语言和多模态模型

Zijing Wang, Mingyang Wang, Ercong Nie, Yongkang Liu, Shi Feng, Mengjie Zhao, Daling Wang, Xiaocui Yang, Hinrich Schütze

发表机构 * Northeastern University, China（东北大学，中国）； CIS, LMU Munich, Germany（慕尼黑莱布尼茨大学计算机学院，德国）； Munich Center for Machine Learning (MCML), Germany（慕尼黑机器学习中心（MCML），德国）； Shanghai Jiao Tong University, China（上海交通大学，中国）

AI总结本研究提出DiM3方法，通过在共享语言模型骨干中融合残差更新，实现多语言和多模态能力的无缝整合，从而提升多语言性能并保持多模态能力。

详情

AI中文摘要

为了实现更通用和类人智能，大语言模型应能够无缝整合多语言和多模态能力；然而，扩展现有多模态模型以支持多种语言通常需要昂贵的多语言多模态数据构建和重复端到端重新训练。我们研究了一种无需训练的替代方法：通过在共享语言模型骨干中组合残差更新，将多语言能力注入现有多模态模型中。关键挑战是多语言和多模态更新是异构的，反映了共享模型中的不同功能角色。为了解决这一问题，我们提出了方向和幅度感知的多语言多模态融合（DiM3），在每个参数维度上选择性地组合这两种更新，同时保留原始视觉编码器和多模态投影器。在文本-only和视觉-语言设置下的多语言基准测试中，覆盖57种语言（LLaVA和Qwen为基础的骨干），实验表明DiM3在多语言性能上显著优于现有融合基线，并在原始多模态模型上大幅提升了多语言性能，同时在专用多语言多模态微调中保持竞争力。此外，我们进一步表明DiM3可以直接应用于已训练的多语言多模态模型，并仍能产生额外收益。进一步的可解释性分析显示，DiM3主要重塑中间层语义表示，在文本-only和多模态输入下加强跨语言对齐，同时保留高层任务敏感结构。我们的仓库在https://github.com/wzj1718/DiM3。

英文摘要

Towards more general and human-like intelligence, large language models should seamlessly integrate both multilingual and multimodal capabilities; however, extending an existing multimodal model to many languages typically requires expensive multilingual multimodal data construction and repeated end-to-end retraining. We study a training-free alternative: injecting multilingual capability into an existing multimodal model by composing residual updates in the shared language model backbone. The key challenge is that multilingual and multimodal updates are heterogeneous, reflecting different functional roles in the shared model. To address this, we propose Direction- and Magnitude-aware Multilingual Multimodal merging (DiM3), which selectively composes the two updates at each parameter dimension while preserving the original vision encoder and multimodal projector. Experiments on multilingual benchmarks in both text-only and vision-language settings, covering 57 languages across LLaVA- and Qwen-based backbones, show that DiM3 consistently outperforms existing merging baselines, substantially improves multilingual performance over the original multimodal model, and remains competitive with dedicated multilingual multimodal fine-tuning while largely retaining general multimodal ability. We further show that DiM3 can be directly applied to already trained multilingual multimodal models and still yield additional gains. Further interpretability analysis shows that DiM3 primarily reshapes intermediate-layer semantic representations, strengthening cross-lingual alignment under both text-only and multimodal inputs while preserving higher-layer task-sensitive structure. Our repository is on https://github.com/wzj1718/DiM3.

URL PDF HTML ☆

赞 0 踩 0

AI 大模型

视觉与机器人

科学与医疗

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

Automated Big Data Quality Assessment using Knowledge Graph Embeddings

WorldString: Actionable World Representation

Spectral Progressive Diffusion for Efficient Image and Video Generation

Lance: Unified Multimodal Modeling by Multi-Task Synergy

S2Aligner: Pair-Efficient and Transferable Pre-Training for Sparse Text-Attributed Graphs

NeRF-based Spacecraft Reconstruction from Monocular Imagery Under Illumination Variability and Pose Uncertainty

SVFSearch: A Multimodal Knowledge-Intensive Benchmark for Short-Video Frame Search in the Gaming Vertical Domain

CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Do LLM Agents Mirror Socio-Cognitive Effects in Power-Asymmetric Conversations?

Weighted Reverse Convolution for Feature Upsampling

OmniVL-Guard Pro: A Tool-Augmented Agent for Omnibus Vision-Language Forensics

Jacobian-Guided Anisotropic Noise Reshaping for Enhancing Representation Utility under Local Differential Privacy

PULSE: Generative Phase Evolution for Non-Stationary Time Series Forecasting

SWoMo: Neuro-Symbolic World Model for Cataract Surgery Simulation

Argus: Evidence Assembly for Scalable Deep Research Agents

FocalPolicy: Frequency-Optimized Chunking and Locally Anchored Flow Matching for Coherent Visuomotor Policy

Unlocking Dense Metric Depth Estimation in VLMs

SEED: Targeted Data Selection by Weighted Independent Set

Hand-in-the-Loop: Improving VLA Policies for Dexterous Manipulation via Seamless Hand-Arm Intervention

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Before the Body Moves: Learning Anticipatory Joint Intent for Language-Conditioned Humanoid Control

Delta Forcing: Trust Region Steering for Interactive Autoregressive Video Generation

MoRe: Modular Representations for Principled Continual Representation Learning on Sequential Data

Hypergraph Enterprise Agentic Reasoner over Heterogeneous Business Systems

MAPLE: Latent Multi-Agent Play for End-to-End Autonomous Driving

FedHPro: Federated Hyper-Prototype Learning via Gradient Matching

Safe Bayesian Optimization for Uncertain Correlation Matrices in Linear Models of Co-Regionalization

PRA-PoE: Robust Multimodal Alzheimer's Diagnosis with Arbitrary Missing Modalities

DiM\textsuperscript{3}: Bridging Multilingual and Multimodal Models via Direction- and Magnitude-Aware Merging