arXivDaily arXiv每日学术速递 周一至周五更新

视觉与机器人

多模态信息融合

面向图像、视频、多传感器和跨模态感知的信息融合,包括 Image Fusion、红外可见光、遥感、医学影像、LiDAR/雷达/相机和音视频融合。

今日/当前日期收录 67 信号源:cs.CV, eess.IV, eess.SP, cs.RO, cs.MM

1. 音视频/视觉语言融合 14 篇

2606.19100 2026-06-18 cs.CV 新提交 80%

AMALIA-VL: A Native European Portuguese Open-Source Vision and Language Model

AMALIA-VL: 一个原生欧洲葡萄牙语开源视觉与语言模型

Diogo Glória-Silva, João Cardeira, Manuel Letras da Luz, Afonso Simplício, Gonçalo Vinagre, Diogo Tavares, Rafael Ferreira, Inês Calvo, Inês Vieira, David Semedo, João Magalhães

发表机构 * NOVA School of Science and Technology(NOVA科学与技术学校) NOVA LINCS

专题命中 音视频/视觉语言融合 :构建欧洲葡萄牙语视觉语言模型

AI总结 针对欧洲葡萄牙语缺乏开源多模态模型的问题,提出AMALIA-VL,通过三阶段训练和葡萄牙语中心数据混合,建立强基线并开源所有资源。

详情
AI中文摘要

大型视觉与语言模型(LVLMs)发展迅速,但欧洲葡萄牙语(pt-PT)在现有的开源多模态模型中仍系统性地未被充分服务,这些模型要么将其与巴西葡萄牙语混为一谈,要么在其训练数据混合中严重缺乏代表性。我们推出了AMALIA-VL,这是第一个原生为pt-PT构建的开源指令微调LVLM,通过可学习的连接器将高分辨率视觉编码器与动态图像平铺以及完全开放的pt-PT优化语言模型配对。我们贡献了一个精心设计的三阶段训练过程——视觉-语言对齐、通用视觉指令微调和偏好优化——以及一个以pt-PT为中心的多模态数据混合,该混合结合了策划和翻译的公共数据集与新颖的数据集,以解决欧洲葡萄牙语多模态资源几乎完全缺失的问题。我们的评估表明,AMALIA-VL为开源pt-PT LVLM建立了强基线。我们将发布模型权重、训练数据和构建流程,以及机器翻译的pt-PT评估基准,以帮助民主化pt-PT LVLM的开发。

英文摘要

Large Vision and Language Models (LVLMs) have advanced rapidly, yet European Portuguese (pt-PT) remains systematically underserved by existing open-source multimodal models, which either conflate it with Brazilian Portuguese or severely under-represent it in their training data mixes. We introduce AMALIA-VL, the first open-source instruction-tuned LVLM built natively for pt-PT, pairing a high-resolution vision encoder with dynamic image tiling and a fully open pt-PT-optimized language model via a learned connector. We contribute with a purposefully designed three-stage training process - vision-language alignment, general visual instruction tuning, and preference optimization - together with a pt-PT-centric multimodal data mix combining curated and translated public datasets with novel datasets that address the near-total absence of European Portuguese multimodal resources. Our evaluation shows that AMALIA-VL establishes a strong baseline for open-source pt-PT LVLMs.We will release model weights, training data, and construction pipelines along with machine-translated pt-PT evaluation benchmarks to help democratize pt-PT LVLM development.

2606.18992 2026-06-18 cs.CV 新提交 80%

Show, Don't Ask: Generative Visual Disambiguation for Composed Image Retrieval with Turn-Valid Coverage

展示,而非询问:基于轮次有效覆盖的生成式视觉消歧用于组合图像检索

Amsisan Tran, Baogh Le, Tuan Kiet Pham, Sui Yang Guang

发表机构 * Amsisan Tran Baogh Le Tuan Kiet Pham Sui Yang Guang

专题命中 音视频/视觉语言融合 :组合图像检索涉及视觉与文本跨模态融合。

AI总结 提出CLARA框架,通过展示视觉备选面板让用户选择,结合似然比重校准实现多轮覆盖保证,在组合图像检索中有效消歧,优于文本提问基线。

详情
AI中文摘要

组合图像检索(CIR)使用参考图像和文本修改来搜索目标图像。然而,此类查询通常描述多个可能的图像而非一个确切目标,使得用户意图模糊。近期方法通过使用共形预测估计模糊性并向用户提问澄清文本来解决此问题。但这些方法有两个局限:其覆盖保证仅在第一轮交互中成立,且文本问题通常不足以解决细粒度视觉差异,如外观、属性或视角。我们提出CLARA,一种通过向用户展示小型视觉备选面板来消歧的澄清框架。用户无需回答文本问题,只需选择最接近预期目标的原型图像。这提供了直接的视觉信号,并避免依赖模型预测用户答案。为在多轮交互中维持有效的共形保证,CLARA使用用户选择引起的似然比对校准进行重加权。显示的原型也被约束为代表当前候选集,并映射到真实语料库图像,确保生成的图像不能人为提高覆盖。在开放域和时尚基准上的实验表明,CLARA匹配单轮最先进的检索性能,在多轮交互中维持名义覆盖,并在比强文本问题基线更少的轮次中找到预期目标。其优势在模糊性涉及视角或细粒度属性时尤为明显,此时视觉消歧比文本提问更有效。

英文摘要

Composed image retrieval (CIR) uses a reference image and a text modification to search for a target image. However, such queries often describe several possible images rather than one exact target, making the user's intent ambiguous. Recent methods address this by using conformal prediction to estimate ambiguity and by asking users clarifying text questions. However, these methods have two limitations: their coverage guarantee only holds at the first interaction, and text questions are often insufficient for resolving fine-grained visual differences such as appearance, attributes, or viewpoint. We propose CLARA, a clarification framework that resolves ambiguity by showing users a small panel of visual alternatives. Instead of answering text questions, the user simply selects the prototype image closest to the intended target. This provides a direct visual signal and avoids relying on a model to predict the user's answer. To maintain valid conformal guarantees across multiple interaction rounds, CLARA reweights calibration using the likelihood ratio induced by the user's selection. The displayed prototypes are also constrained to represent the current candidate set and are snapped to real corpus images, ensuring that generated images cannot artificially improve coverage. Experiments on open-domain and fashion benchmarks show that CLARA matches single-turn state-of-the-art retrieval performance, maintains nominal coverage across interaction rounds, and finds the intended target in fewer rounds than strong text-question baselines. Its advantage is especially clear when ambiguity involves viewpoint or fine-grained attributes, where visual clarification is more effective than textual questioning.

2606.18885 2026-06-18 cs.CV cs.IR 新提交 80%

LARE: Low-Attention Region Encoding for Text-Image Retrieval

LARE: 低注意力区域编码用于文本-图像检索

Abdulmalik Alquwayfili, Faisal Almeshal, Jumanah Almajnouni, Leena Alotaibi, Faisal Alhajari, Mohammed Alkhrashi, Alreem Almuhrij, Abdullah Aldwyish, Raied Aljadaany, Huda Alamri, Muhammad Kamran J. Khan

发表机构 * Saudi Data and Artificial Intelligence Authority (SDAIA)(沙特数据与人工智能局)

专题命中 音视频/视觉语言融合 :文本-图像检索,低注意力区域编码增强跨模态检索。

AI总结 提出LARE框架,通过并行编码低注意力区域和完整图像,解决拥挤场景下视觉编码器忽视关键细节的问题,在密集场景子集上提升检索性能。

Comments Accepted at the ICML 2026 Workshop on Efficient Multimodal Question Answering (EMM-QA). Code: https://github.com/AbdulmalikDS/LARE ; Dataset: https://huggingface.co/datasets/AbdulmalekDS/Dense-Set

详情
AI中文摘要

拥挤场景中的图像检索尤其具有挑战性,因为传统视觉编码器存在显著性偏差,倾向于关注主要对象而忽略低注意力区域,而这些区域通常对细粒度检索至关重要。我们提出了LARE(低注意力区域编码),一个显式建模这些被忽略区域的框架。LARE采用双编码策略,并行编码图像的低注意力区域和完整图像,从而产生更多样化和信息丰富的图像嵌入。为了评估拥挤场景下的图像检索性能,我们引入了Dense-Set,一个源自COCO和Flickr30K的具有挑战性的子集。在该子集中,图像被重新标注,以提供对低注意力或先前被忽略区域的更丰富描述。该数据集突显了现有检索模型的局限性,并能够在密集拥挤场景条件下进行更严格的评估。实验结果表明,所提出的框架通过在共享潜在空间中保留微妙的非主导视觉线索来提高检索性能。

英文摘要

Image retrieval in crowded scenes is particularly challenging due to the salience bias of conventional visual encoders, which tend to focus on dominant objects while neglecting low-attention regions that are often crucial for fine-grained retrieval. We propose LARE (Low-Attention Region Encoding), a framework that explicitly models these overlooked regions. LARE adopts a dual-encoding strategy that encodes low-attention regions of an image and the full image in parallel, leading to more diverse and informative image embeddings. To evaluate image retrieval performance in challenging crowded scenes, we introduce Dense-Set, a challenging subset derived from COCO and Flickr30K. In this subset, images are re-captioned to provide richer descriptions of low-attention or previously overlooked regions. This dataset highlights the limitations of existing retrieval models and enables a more rigorous evaluation under densely crowded scene conditions. Experimental results demonstrate that the proposed framework improves retrieval performance by preserving subtle, non-dominant visual cues within the shared latent space.

2606.18558 2026-06-18 cs.CV 新提交 80%

MolmoMotion: Forecasting Point Trajectories in 3D with Language Instruction

MolmoMotion: 基于语言指令的3D点轨迹预测

Jianing Zhang, Chenhao Zheng, Yajun Yang, Max Argus, Rustin Soraki, Winson Han, Taira Anderson, Chun-Liang Li, Shuo Liu, Jiafei Duan, Zhongzheng Ren, Jieyu Zhang, Ranjay Krishna

发表机构 * Allen Institute for AI(艾伦人工智能研究所) University of Washington(华盛顿大学) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

专题命中 音视频/视觉语言融合 :基于语言指令预测3D点轨迹,涉及视觉与语言融合。

AI总结 提出一种基于语言指令的3D点运动预测方法,通过构建大规模数据集和基准,实现类无关、视角稳定的运动轨迹预测,并在机器人操作和视频生成中验证其有效性。

详情
AI中文摘要

运动预测是视觉智能的核心:智能体必须预测物体如何运动,以规划行动、推理物理交互并合成逼真的未来场景。我们认为,世界坐标系中的3D点提供了一种通用表示,具有类无关、视角稳定、紧凑且对下游任务直接有用的特性。我们形式化了目标条件3D点运动预测任务:给定一段短视觉历史、目标物体上的一组3D查询点以及预期目标的语言描述,模型预测每个点的未来3D轨迹。我们引入了一个完整的堆栈来大规模研究此任务:(1) MolmoMotion-1M是一个大型语料库,包含从116万无约束视频中标注的动作描述、物体锚定的3D点轨迹;(2) PointMotionBench是一个人工验证的基准,涵盖111个物体类别和61种运动类型;(3) MolmoMotion是一个通用运动预测模型,支持自回归坐标预测和基于流匹配的轨迹生成。MolmoMotion能准确预测不同语言指令下的多样运动模式,并在PointMotionBench上显著优于现有运动预测基线。最后,我们展示了学习到的3D运动先验能很好地迁移到下游应用:它提高了机器人操作的训练效率和泛化能力,其预测轨迹为生成模型提供了有效的运动指导,以合成具有更真实物体运动的视频。

英文摘要

Motion forecasting is central to visual intelligence: agents must anticipate how objects will move in order to plan actions, reason about physical interactions, and synthesize realistic futures. We argue that 3D points in world coordinates provide a general representation that is class-agnostic, view-stable, compact, and directly useful for downstream tasks. We formalize the task of goal-conditioned 3D point motion forecasting: given a short visual history, a set of 3D query points on an object of interest, and a language description of the intended goal, the model predicts the future 3D trajectory of each point. We introduce a full stack to study this task at scale: (1) MolmoMotion-1M is a large corpus of action-described, object-grounded 3D point trajectories annotated from 1.16M unconstrained videos; (2) PointMotionBench is a human-verified benchmark spanning 111 object categories and 61 motion types; and (3) MolmoMotion is a general motion forecasting model that supports both autoregressive coordinate prediction and flow-matching-based trajectory generation. MolmoMotion accurately predicts diverse motion patterns with different language instructions, and significantly outperforms existing motion prediction baselines on PointMotionBench. Finally, we show that the learned 3D motion prior transfers well to downstream applications: it improves training efficiency and generalization for robot manipulation, and its predicted trajectories provide effective motion guidance for generative models to synthesize videos with more realistic object motion.

2606.18441 2026-06-18 cs.CV 新提交 80%

Reasoning as Intersection: Consensus-Frame Alignment for Visual Focus in Video-MLLMs

推理即交集:视频多模态大语言模型中视觉焦点的一致性帧对齐

Chengwen Liu, Zhe Huang, Jisheng Dang, Hong Peng, Qi Tian, Tat-Seng Chua

发表机构 * School of Information Science and Engineering, Lanzhou University(兰州大学信息科学与工程学院) Beijing University of Posts and Telecommunications(北京邮电大学) Cloud and AI BU, Huawei(华为云与AI业务部) School of Computing, National University of Singapore(新加坡国立大学计算机学院)

专题命中 音视频/视觉语言融合 :视频多模态大语言模型推理,融合视频帧与语言。

AI总结 提出无时间标注的过程级奖励框架CF-GRPO,通过视频内在线索构建一致性帧先验,并利用一致性帧奖励优化模型帧使用与先验的对齐,提升视频推理性能。

详情
AI中文摘要

强化学习提升了大型语言模型的推理能力,但将仅结果奖励应用于视频多模态大语言模型(Video-MLLMs)时,对哪些视觉证据应支持答案提供的指导有限。受多感官整合启发(其中一致的线索可以增强感知估计的显著性和可靠性),我们引入了一致性帧GRPO(CF-GRPO),一种无需时间标注的过程级奖励框架,用于证据感知的视频推理。CF-GRPO从内在视频线索中构建一致性帧先验,包括时间覆盖、场景转换线索和查询条件化的视觉相关性。然后,它从视觉和响应表示中计算模型侧的帧使用分数,并通过一致性帧奖励(CFR)优化它们的一致性。通过显著性感知的稀疏聚合和分布锐化,CFR提供了高对比度的奖励信号,无需人工时间标注。实验表明,VideoCFR在复杂视频推理基准上取得了有竞争力的性能,并在多个指标上优于代表性的Video-MLLM和RL基线,同时一致性先验提供了训练中强调的证据帧的可解释视图。实现代码见:https://this https URL。

英文摘要

Reinforcement learning has improved the reasoning ability of large language models, but applying outcome-only rewards to video multimodal large language models (Video-MLLMs) provides limited guidance on which visual evidence should support the answer. Inspired by multisensory integration, where consistent cues can enhance the salience and reliability of perceptual estimates, we introduce Consensus Frame GRPO (CF-GRPO), a temporal-annotation-free process-level reward framework for evidence-aware video reasoning. CF-GRPO constructs a consensus frame prior from intrinsic video cues, including temporal coverage, scene-transition cues, and query-conditioned visual relevance. It then computes a model-side frame-use score from visual and response representations and optimizes their agreement through the Consensus Frame Reward (CFR). With salience-aware sparse aggregation and distribution sharpening, CFR provides a high-contrast reward signal without requiring human temporal annotations. Experiments show that VideoCFR achieves competitive performance across complex video reasoning benchmarks and improves several metrics over representative Video-MLLM and RL baselines, while the consensus prior provides an interpretable view of the evidence frames emphasized during training. The implementation is available at https://github.com/1Pansy/VideoCFR.

2606.19338 2026-06-18 cs.CV 新提交 75%

Beyond the Current Observation: Evaluating Multimodal Large Language Models in Controllable Non-Markov Games

超越当前观测:评估多模态大语言模型在可控非马尔可夫博弈中的表现

Shengyuan Ding, Xilin Wei, Xinyu Fang, Haodong Duan, Dahua Lin, Jiaqi Wang, Yuhang Zang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学)

专题命中 音视频/视觉语言融合 :评估多模态大模型在非马尔可夫博弈中的表现

AI总结 提出RNG-Bench基准套件,通过配对记忆和3D迷宫两个博弈,评估多模态大模型在非马尔可夫环境中重建历史观测并据此行动的能力,发现主要错误源于遗忘而非决策,微调可提升性能。

详情
AI中文摘要

将多模态基础模型部署为闭环策略时,越来越需要基于不再可见的观测来调节动作。然而,现有基准要么暴露完整状态,将隐藏状态重建与其他智能体技能混为一谈,要么仅在回合结束后测试记忆。我们引入了RNG-Bench(重建性非马尔可夫博弈),这是一个基准套件,旨在隔离基础模型在多步交互中重建过去观测并据此行动的能力。RNG-Bench包含两个互补的博弈:配对记忆,其中卡片身份在特定位置短暂显示后需被回忆;以及3D迷宫,其中自我中心视图需整合为空间地图。两个博弈都在统一的测试框架下评估,具有三个可控难度轴:网格大小、视觉模式和观测模态。该基准进一步引入了头对头对决协议以控制实例级方差,以及记忆差距指标,将遗忘与不良动作选择区分开来。最难的配置需要大约128K个token和每回合350个图像输入,前沿MLLMs远未饱和。记忆差距分析表明,大多数残余错误源于遗忘较早的观测,而非次优决策。最后,在最优策略轨迹和过滤后的模型演示上微调Qwen3.5-9B,提高了RNG-Bench的性能,并迁移到现有基准,而不降低通用多模态能力。

英文摘要

Deploying multimodal foundation models as closed-loop policies increasingly requires conditioning actions on observations that are no longer visible. However, existing benchmarks either expose the full state, conflate hidden-state reconstruction with other agent skills, or test recall only after an episode has ended. We introduce RNG-Bench (Reconstructive Non-Markov Games), a benchmark suite designed to isolate a base model's ability to reconstruct past observations and act on them during multi-step interaction. RNG-Bench includes two complementary games: Matching Pairs, where card identities briefly revealed at specific locations must later be recalled, and 3D Maze, where egocentric views must be integrated into a spatial map. Both games are evaluated under a unified harness with three controlled difficulty axes: grid size, visual pattern, and observation modality. The benchmark further introduces a head-to-head duel protocol to control for instance-level variance and a Memory Gap metric that disentangles forgetting from poor action selection. The hardest configurations require contexts of roughly 128K tokens and 350 image inputs per episode, and remain far from saturated by frontier MLLMs. Memory Gap analysis shows that most residual errors stem from forgetting earlier observations rather than from suboptimal decision making. Finally, fine-tuning Qwen3.5-9B on optimal-policy rollouts and filtered model demonstrations improves performance on RNG-Bench and transfers to existing benchmarks without degrading general multimodal capability.

2606.19297 2026-06-18 cs.LG cs.RO 新提交 75%

Does VLA Even Know the Basics? Measuring Commonsense and World Knowledge Retention in Vision-Language-Action Models

VLA 甚至知道基础知识吗?衡量视觉-语言-动作模型中的常识和世界知识保留

Nikita Kachaev, Andrey Moskalenko, Matvey Skripkin, Nikita Kurlaev, Daria Pugacheva, Albina Burlova, Mikhail Kolosov, Denis Shepelev, Andrey Kuznetsov, Elena Tutubalina, Aleksandr I. Panov, Alexey K. Kovalev, Vlad Shakhuro

发表机构 * CogAI Lab(CogAI实验室) FusionBrain Lab(FusionBrain实验室) IAI MSU(MSU人工智能研究所) Lomonosov MSU(Lomonosov莫斯科大学) NUST MISIS Applied AI Institute(应用人工智能研究所) HSE University(俄罗斯高等经济大学) Generalizable AI Systems(可泛化人工智能系统) ISP RAS(俄罗斯科学院信息与自动化过程研究所) MIRAI Domain-specific NLP Group(领域特定自然语言处理小组)

专题命中 音视频/视觉语言融合 :评估视觉-语言-动作模型的知识保留

AI总结 提出 Act2Answer 协议,通过动作回答评估 VLA 模型的知识保留,发现模型在简单概念上表现良好,但在丰富语义类别上存在差距,且 VQA 联合训练有助于知识保留。

Comments Project page: https://tttonyalpha.github.io/act2answer/

详情
AI中文摘要

具身视觉-语言-动作(VLA)模型通常通过在机器人数据上微调强大的预训练 VLM 获得,但目前尚不清楚它们在适应后保留了多少常识和事实知识。在知识敏感任务上的失败是模糊的,混淆了知识缺失与低级控制泛化能力差。我们引入 Act2Answer,一种轻量级协议,通过要求智能体通过动作来回答,将 VLM 知识基准适配到 VLA 评估。每个问题变成一个简短的桌面场景,其中智能体执行单个物体放置动作以选择候选答案,从而产生动作基础的、减少控制混淆的成功率。我们在不同的常识和世界知识类别中策划了这样的环境测试套件,并引入逐层意图探测以定位 VLM 骨干和动作头中与答案相关的信息。在对 7 个 VLA 模型和 9 个 VLM 基线的大规模研究中,我们系统地跨类别对模型进行排名,发现 VLA 在简单概念上表现稳健,但在更丰富的语义类别上相对于其源 VLM 显示出更大的差距,VQA 联合训练与更好的知识保留相关,并且答案相关信号在 VLA 中间层达到峰值,但在上层减弱。Act2Answer 可在以下网址获取:此 https URL。

英文摘要

Embodied Vision-Language-Action (VLA) models are typically obtained by fine-tuning powerful pretrained VLMs on robotics data, yet it is unclear how much commonsense and factual knowledge they retain after adaptation. Failures on knowledge-sensitive tasks are ambiguous, conflating missing knowledge with poor generalization of low-level control. We introduce Act2Answer, a lightweight protocol that adapts VLM knowledge benchmarks to VLA evaluation by requiring agents to answer through action. Each question becomes a short tabletop episode where the agent performs a single object-placement action to select among candidate answers, yielding an action-grounded success rate with reduced control confounds. We curate a test suite of such environments across diverse commonsense and world-knowledge categories and introduce layerwise intent probing to localize answer-relevant information across the VLM backbone and action head. In a large-scale study of 7 VLA models and 9 VLM baselines, we systematically rank models across categories, finding that VLAs show solid performance on simple concepts while exhibiting larger gaps on richer semantic categories relative to their source VLMs, that VQA co-training is associated with better knowledge retention, and that answer-relevant signals peak in middle VLA layers but attenuate in upper layers. Act2Answer is available at https://tttonyalpha.github.io/act2answer/.

2606.19161 2026-06-18 cs.RO 新提交 75%

HT-Bench: Benchmarking and Learning Dexterous Full-Hand Tactile Representations with Egocentric Vision

HT-Bench:基于自我中心视觉的灵巧全手触觉表示基准与学习

Yuzhe Huang, Jiaping Wu, Jiaming Jiang, Hezhe Lin, Aikebaier Aierken, Yunlong Wang, Kun Cheng, Ziyuan Jiao, Yuanxin Zhong

发表机构 * Beihang University(北航) Rimbot BUPT(北邮) ShanghaiTech University(上海科技大学) Tsinghua University(清华大学) CAS(中国科学院)

专题命中 音视频/视觉语言融合 :对齐触觉与视觉信息,多模态表示学习

AI总结 提出HT-Bench多任务基准和HandTouch编码器,通过大规模自我中心视觉与全手触觉数据,在触觉相似性检索、掩码修复、视觉到触觉合成等任务上验证了触觉表示的有效性。

Comments 9pages, 4figures

详情
AI中文摘要

由于触觉传感器设计、数据格式和机器人形态的多样性,为机器人操作中的触觉表示学习建立通用基准仍然具有挑战性。我们并未试图建立这样的基准,而是探索了一个可扩展且有前景的未来发展方向:将自我中心视觉与全手触觉数据配对。为此,我们引入了\ extbf{HT-Bench},一个用于灵巧全手触觉感知的大规模多任务基准,包含在226个任务中收集的1000万RGB帧和780万触觉帧。HT-Bench从三个关键角度评估触觉表示:它们是否编码有意义的接触几何、是否能够将触觉观测与视觉信息对齐、以及是否能够泛化到未见任务。为评估这些能力,HT-Bench包含四个任务:细粒度触觉相似性检索、掩码触觉修复、视觉到触觉合成以及多模态触觉帧预测。我们进一步提出了\ extbf{HandTouch},一个矢量量化视觉-触觉编码器,通过渐进的空间、跨模态和时间训练学习触觉表示。在HT-Bench上,HandTouch始终优于代表性的触觉编码器基线,将细粒度触觉相似性检索的Recall@5从74.65%提高到85.23%,将掩码触觉修复的RMSE从0.022降低到0.010,并将视觉到触觉合成的OOD cIoU从0.628提高到0.705。这些结果证明了HandTouch的有效性,并表明大规模自我中心全手触觉数据为评估和推进灵巧操作中的触觉表示学习提供了可扩展的基础。

英文摘要

Establishing a universal benchmark for tactile representation learning in robotic manipulation remains challenging due to the diversity of tactile sensor designs, data formats, and robot embodiments. Rather than seeking to establish such, we explore a scalable and promising direction for future development: egocentric vision paired with full-hand tactile data. To this end, we introduce \textbf{HT-Bench}, a large-scale multi-task benchmark for dexterous full-hand tactile sensing, comprising 10M RGB frames and 7.8M tactile frames collected across 226 tasks. HT-Bench evaluates tactile representations from three key perspectives: whether they encode meaningful contact geometry, whether they can align tactile observations with visual information, and whether they generalize to unseen tasks. To assess these capabilities, HT-Bench includes four tasks: fine-grained tactile similarity retrieval, masked tactile inpainting, vision-to-tactile synthesis, and multimodal tactile frame prediction. We further propose \textbf{HandTouch}, a vector-quantized vision--tactile encoder that learns tactile representations through progressive spatial, cross-modal, and temporal training. Across HT-Bench, HandTouch consistently outperforms representative tactile encoder baselines, improving Recall@5 on fine-grained tactile similarity retrieval from 74.65\% to 85.23\%, reducing RMSE on masked tactile inpainting from 0.022 to 0.010, and increasing OOD cIoU on vision-to-tactile synthesis from 0.628 to 0.705. These results demonstrate the effectiveness of HandTouch and suggest that large-scale egocentric full-hand tactile data provides a scalable basis for evaluating and advancing tactile representation learning in dexterous manipulation.

2606.19088 2026-06-18 cs.RO 新提交 75%

ReSiReg: Towards Spatially Consistent Semantics in Language-Conditioned Robotic Tasks

ReSiReg:面向语言条件机器人任务的空间一致语义

Simon Schwaiger, David Seyser, Alessandro Scherl, Wilfried Wöber, Gerald Steinbauer-Wagner

发表机构 * Graz University of Technology, Institute of Software Engineering and Artificial Intelligence(格拉茨技术大学,软件工程与人工智能研究所) University of Applied Sciences Technikum Wien, Department of Industrial Engineering(维也纳应用科技大学,工业工程系) University of Alicante, Department of Computer Technology(阿利坎特大学,计算机技术系) University of Natural Resources and Life Sciences, Institute for Integrative Nature Conservation Research(自然资源与生命科学大学,整合自然保护研究 institute)

专题命中 音视频/视觉语言融合 :改进VLM特征空间一致性用于语言接地

AI总结 提出ReSiReg方法,通过重构空间一致的VLM中间特征,改善密集语言接地检索,在OVSS和3D映射中提升空间一致性,并发布紧凑的25M参数VLM模型。

详情
AI中文摘要

视觉-语言模型(VLM)使机器人能够遵循开放语言指令。然而,密集的VLM嵌入已被证明存在噪声且缺乏空间一致性。这对于需要同时推理语义和3D空间的机器人应用来说是有问题的。我们研究了近期VLM的空间结构,并提出了ReSiReg,一种特征重构方法,利用空间一致的VLM中间特征来改善密集语言接地检索。ReSiReg将中间特征聚类为视觉原型,推导其语言描述符,并将每个补丁重构为原型级语言嵌入的软混合。我们在OVSS和3D映射上跨骨干网络进行定量评估,并在真实世界操作场景中进行定性评估。定量结果显示密集检索得到改善;操作场景显示出更空间一致的目标激活。我们进一步为机器人应用提供了一个紧凑的25M密集VLM,远小于ViT-B基线且具有竞争力。可从此网址获取。

英文摘要

Vision-Language Models (VLMs) enable robots to follow open-language instructions. However, dense VLM embeddings have shown to be noisy and lack spatial consistency. This is problematic for robotic applications, which require simultaneous reasoning over semantics and 3D space. We examine spatial structure across recent VLMs and propose ReSiReg, a feature reconstruction method that uses spatially consistent VLM intermediates to improve dense language-grounded retrieval. ReSiReg clusters intermediates into visual prototypes, derives their language descriptors, and reconstructs each patch as a soft mixture of prototype-level language embeddings. We evaluate quantitatively on OVSS and 3D mapping across backbones, and qualitatively in real-world manipulation scenes. Quantitative results show improved dense retrieval; manipulation scenes show more spatially consistent target activations. We further provide a compact 25M dense VLM for robotic applications, substantially smaller than and competitive with ViT-B baselines. Available at https://resireg.github.io

2606.18955 2026-06-18 cs.CV cs.RO 新提交 75%

Motion-Focused Latent Action Enables Cross-Embodiment VLA Training from Human EgoVideos

运动聚焦的潜在动作使跨实体VLA训练能从人类自我中心视频中学习

Runze Xu, Yiluo Zhang, Jian Wang, Yu Wang, Jincheng Yu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Tianfu Jiangxi Laboratory(天府江西实验室)

专题命中 音视频/视觉语言融合 :从人类视频提取动作先验,涉及视觉语言融合。

AI总结 提出基于潜在动作的框架,利用混合解耦VQ-VAE从无标签人类视频中提取通用动作先验,通过意图-感知解耦策略减少动作幻觉,仅需50条轨迹即可适配下游任务。

Comments Accepted to IROS 2026

详情
AI中文摘要

训练通用视觉-语言-动作(VLA)模型通常需要大量、多样化的机器人数据集,并带有高保真动作标注。尽管自我中心的人类操作视频丰富且捕捉了显著的环境多样性,但缺乏动作标签使其难以在传统训练范式下使用。为解决这一问题,我们提出了一种基于潜在动作的框架,旨在从无标签人类视频中提取通用动作先验。该架构采用混合解耦VQ-VAE,通过物理掩码将运动动态与环境背景解耦,从而构建跨实体动作码本。通过在人类视频上使用码本进行预训练,VLM骨干网络学习到动作意图的深层表示。为了适应特定实体,我们引入了一种意图-感知解耦策略,其中VLM预测动作意图,而一个独立的冻结视觉编码器为动作专家提供状态特定特征,从而减少动作幻觉。在仿真和真实环境中的结果表明,我们的方法仅在无标签人类视频上预训练,与在大量标注数据集上训练的最先进VLA模型相比具有竞争力,且仅需50条轨迹进行下游适配。

英文摘要

Training generalist Vision-Language-Action(VLA) models typically requires massive, diverse robotic datasets with high-fidelity action annotations. While egocentric human manipulation videos are abundant and capture significant environmental diversity, the absence of action labels makes them difficult to use in conventional training paradigms. To address this, we propose a latent-action-based framework designed to extract general action priors from unlabeled human videos. The architecture features a Hybrid Disentangled VQ-VAE that decouples motion dynamics from environmental backgrounds through physical masks, enabling the construction of a cross-embodiment action codebook. By pre-training on human videos with the codebook, the VLM backbone learns deep representations of action intent. For adaptation to specific embodiments, we introduce an intent-perception decoupling strategy where the VLM predicts the action intent while a separate frozen visual encoder provides state-specific features to the action expert, thereby reducing action hallucinations. Results in simulation and real-world environments show that our method, pre-trained exclusively on unlabeled human videos, performs competitively with state-of-the-art VLA models trained on massive annotated datasets, requiring only 50 trajectories for downstream adaptation.

2606.18846 2026-06-18 cs.CV 新提交 75%

From Bounding Boxes to Visual Reasoning: An On-Policy Data Annotation Tool for Vision-Language Models

从边界框到视觉推理:一种用于视觉语言模型的在线策略数据标注工具

Like Zhang, Runliang Niu, Shiqi Wang, Xiyu Hu, Qianli Xing, Pan Wang, Qingzu He, Qi Wang

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) College of Computer Science, Jilin University(吉林大学计算机科学与技术学院) OPPO

专题命中 音视频/视觉语言融合 :提出视觉语言模型标注工具,涉及视觉与语言模态融合。

AI总结 提出ScreenAnnotator,通过统一标注原子模式、在线策略循环与贝叶斯验证器,解决现有工具表达力不足、标注-训练脱节和数据复用性差的问题,实现高效多任务数据生成。

Comments 14 pages, 7 figures

详情
AI中文摘要

视觉语言模型(VLM)正快速向复杂的基于基础的结构化视觉推理发展。训练具备此类高级能力的模型需要一种新型数据,该数据能将空间坐标、开放词汇描述、结构化属性和拓扑关系无缝统一为单一表示。然而,现有数据标注工具从根本上无法满足这些复杂需求,存在三个系统性瓶颈:表达力有限、严重的标注-训练解耦以及数据复用性差。为弥补这一基础设施差距,我们引入了一个开源标注工具ScreenAnnotator。首先,我们定义了一个统一的标注原子模式,将空间、语义和结构基元绑定为单个单元。其次,我们实现了一个嵌入贝叶斯标注验证器(BAV)的在线策略标注循环。最后,我们设计了一个模板驱动的多任务数据合成过程,动态地将静态原子转化为多样化的多维推理任务,消除了冗余的重新标注。在线策略循环将流程图上的标注接受率提升至近100%,GUI截图上的接受率达到77%,同时随着标注数据的积累,每张图像的标注时间稳步减少。在流程图场景中,微调VLM的平均准确率达到76.1%,绝对提升了35.1个百分点。我们的代码可在以下网址获取:this https URL。

英文摘要

Vision-language models (VLMs) are rapidly advancing toward sophisticated grounded structured visual reasoning. Training models for such advanced capabilities demands a new genre of data that seamlessly unifies spatial coordinates, open-vocabulary descriptions, structured attributes, and topological relationships into a singular representation. However, existing data annotation tools fundamentally fail to meet these intricate demands, suffering from three systematic bottlenecks: limited expressiveness, severe annotation-training decoupling, and poor data reusability. To bridge this infrastructure gap, we introduce an open-source annotation tool, ScreenAnnotator. First, we define a unified annotation atom schema that binds spatial, semantic, and structural primitives into a single unit. Second, we implement an on-policy annotation loop embedded with a Bayesian Annotation Verifier (BAV). Finally, we design a template-driven multi-task data synthesis process dynamically transforms static atoms into diverse multi-dimensional reasoning tasks, eliminating redundant re-annotation. The on-policy loop drives the annotation accept rate to nearly 100% on flowcharts and 77% on GUI screenshots, while steadily reducing per-image annotation time as labeled data accumulate. In the flowchart scenario, fine-tuning a VLM yields 76.1% average accuracy, which is a 35.1% point absolute gain. Our code is available at: https://github.com/WnQinm/Annotator.

2606.06926 2026-06-18 cs.CV cs.MM 新提交 75%

SVHighlights: Towards Extremely Long Sport Video Highlight Detection

SVHighlights: 迈向极长体育视频精彩片段检测

Donggyu Lee, Youngbin Ki, Jeonghun Kang, Taehwan Kim

发表机构 * Ulsan National Institute of Science and Technology(釜山国立科学研究院)

专题命中 音视频/视觉语言融合 :利用大语言模型融合多模态信息检测体育视频精彩片段

AI总结 针对现有方法无法处理超长视频精彩片段检测的问题,提出首个基准SVHighlights(包含320个平均时长2小时的体育视频)以及无训练的分段方法TF-SELECTOR,通过大语言模型融合多模态信息预测片段级显著性分数,在多个指标上超越现有基线。

Comments Accepted to KDD 2026 (Datasets and Benchmarks Track). Project Page: https://leedongkyu2019.github.io/SVHighlights/

详情
AI中文摘要

尽管长视频的精彩片段检测具有重要的实际意义,但现有方法大多局限于短视频内容,这主要是由于缺乏合适的基准。为了填补这一空白,我们引入了SVHighlights,据我们所知,这是首个针对极长体育视频(每段时长超过一小时,涵盖多种体育类别)精彩片段检测的基准。SVHighlights是通过一个数据集生成流水线,从完整体育视频及其对应的官方精彩片段视频对构建而成,无需传统的逐片段显著性标注即可实现可扩展的标签生成。该基准包含320个视频,平均时长2.00小时,总时长640.18小时,显著超过以往的数据集。现有方法在长视频上也面临根本性挑战:在短视频片段上训练的模型无法泛化到小时级内容,并且它们的片段级评分缺乏识别精彩片段所需的更广泛上下文。为了解决这一问题并提供一个强基线,我们提出了TF-SELECTOR,一种无需训练的基于分段的方法,该方法通过合并相邻的具有相同语义内容的镜头,将每个视频划分为上下文感知的分段,并使用多模态输入(包括视觉描述、转录文本和音频音量)的大语言模型预测分段级显著性分数。实验表明,与视频时间定位(VTG)微调的基线相比,TF-SELECTOR在大多数指标上取得了更优的性能,在HIT@1上提升+3.12,在HIT@K上提升+4.06,在IoU上提升+2.95。这些结果确立了SVHighlights作为长视频精彩片段检测的具有挑战性的测试平台,并证明了简单的基于分段的策略可以有效地扩展到小时级视频。

英文摘要

While highlight detection for long-form videos is of great practical importance, most existing methods remain limited to short-form content, largely due to the absence of a suitable benchmark. To bridge this gap, we introduce SVHighlights, to the best of our knowledge, the first benchmark for highlight detection in extremely long sports videos, each exceeding one hour in duration, across multiple sports categories. SVHighlights is constructed from pairs of full-length sports videos and their corresponding official highlight videos using a dataset generation pipeline, enabling scalable label generation without conventional per-clip saliency annotation. The benchmark comprises 320 videos with an average duration of 2.00 hours and a total of 640.18 hours, substantially exceeding previous datasets. Existing methods also face fundamental challenges on long videos: models trained on short clips fail to generalize to hour-long content, and their clip-level scoring lacks the broader context needed to identify highlights. To address this and provide a strong baseline, we present TF-SELECTOR, a training-free segment-based approach that divides each video into context-aware segments by merging adjacent shots sharing the same semantic content, and predicts segment-level saliency scores using a large language model with multimodal inputs including visual captions, transcripts, and audio volume. Experiments demonstrate that TF-SELECTOR achieves superior performance across most metrics compared to Video Temporal Grounding (VTG)-tuned baselines, with improvements of +2.50 in HIT@1, +4.04 in HIT@K, and +2.95 in IoU. These results establish SVHighlights as a challenging testbed for long-form highlight detection and demonstrate that a simple segment-based strategy can effectively scale to hour-long videos.

2606.19120 2026-06-18 cs.LG cs.CV 新提交 70%

Seeing Before Reasoning: Decoupling Perception and Reasoning for Shortcut-Resilient Multimodal On-Policy Self-Distillation

先看后思:解耦感知与推理以实现抗捷径的多模态在策略自蒸馏

Sihan Wang, Xiyao Liu, Lianqing Liu, Zhi Han

发表机构 * State Key Laboratory of Robotics and Intelligent Systems, Shenyang Institute of Automation, Chinese Academy of Sciences(机器人与智能系统国家重点实验室,沈阳自动化研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

专题命中 音视频/视觉语言融合 :多模态大语言模型后训练,融合视觉与语言

AI总结 提出ViGOS框架,通过解耦感知和推理,在MLLM后训练中避免文本捷径,提升图像依赖行为。

Comments 29 pages, 5 figures, 8 tables

详情
AI中文摘要

在策略自蒸馏(OPSD)训练模型在其自身rollouts上,并使用冻结副本提供基于参考目标的密集token级目标。这对于LLM推理效果良好,但直接扩展到多模态大语言模型(MLLMs)可能产生捷径:特权目标可能主要基于文本参考目标而非图像来引导token。我们提出ViGOS,一种视觉引导的OPSD框架用于MLLM后训练。学生首先编写视觉描述,然后推理出最终答案。对于有效rollouts,仅图像的感知教师监督描述,而特权推理教师监督同一学生前缀上的推理和最终答案。仅对无效rollouts使用参考教师以恢复输出格式。在通用视觉-语言、专家推理、视觉数学、空间定位和视觉-语言先验基准测试中,ViGOS保持了OPSD的主要优势,并在易产生捷径的设置中改善了图像引导行为。

英文摘要

On-policy self-distillation (OPSD) trains a model on its own rollouts and uses a frozen copy to provide dense token-level targets conditioned on a reference target. This works well for LLM reasoning, but a direct extension to multimodal large language models (MLLMs) can create a shortcut: the privileged target may guide tokens mainly based on the text reference target rather than the image. We propose ViGOS, a visually grounded OPSD framework for MLLM post-training. The student first writes a visual description and then reasons toward the final answer. For valid rollouts, an image-only perception teacher supervises the description, while a privileged reasoning teacher supervises the reasoning and final answer on the same student prefix. A reference teacher is used only for invalid rollouts to recover the output format. Across general vision-language, expert reasoning, visual math, spatial grounding, and visual-language-prior benchmarks, ViGOS keeps the main benefits of OPSD and improves image-grounded behavior in shortcut-prone settings.

2606.18839 2026-06-18 cs.LG cs.CV 新提交 70%

Semantic Robustness Certification for Vision-Language Models

视觉语言模型的语义鲁棒性认证

Peiyu Yang, Paul Montague, Feng Liu, Andrew C. Cullen, Amardeep Kaur, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing \& Information Systems, University of Melbourne, Australia

专题命中 音视频/视觉语言融合 :认证视觉语言模型鲁棒性,涉及视觉与文本语义融合。

AI总结 提出首个无需额外数据即可认证视觉语言模型在语义层面(如形状、大小、风格)鲁棒性的框架,通过文本提示作为语义代理并量化决策边界,确保预测类别在语义变换下不变。

Comments Accepted to ICML

详情
AI中文摘要

视觉语言模型(VLM)现在被广泛用于下游任务。然而,现实世界的应用常常使VLM面临由语义变化(例如形状、大小和风格)引起的分布偏移。鲁棒性认证确定当对输入应用变换时模型的预测是否改变。虽然大多数认证框架研究输入的几何或像素级变换,但本文提出了一种新颖的框架,能够在语义级变换下认证VLM的鲁棒性。利用VLM的开放词汇能力,我们使用文本提示作为语义代理来构建由控制语义变化程度的范围参数化的变换。通过以封闭形式表征VLM决策边界,我们的框架定量地认证了在语义变换下预测类别保持不变的范围区间。我们的框架是第一个在语义级变化下认证VLM鲁棒性而无需为每种变化提供额外数据的框架,使其易于应用。在合成数据和真实数据上的实验表明,我们的框架能够在各种场景下认证针对多种语义变化的鲁棒性。

英文摘要

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

2. 多传感器融合 8 篇

2606.18959 2026-06-18 cs.RO 新提交 80%

TactSpace: Learning a Physics-enriched Shared Latent Space for Tactile Sim-to-Real Transfer

TactSpace: 学习富含物理信息的共享潜在空间以实现触觉模拟到现实的迁移

Arunim Joarder, Arjun Bhardwaj, René Zurbrügg, Mayank Mittal, Florin Püntener, Sira Bielefeldt, Cosmin Roman, Vaishakh Patil, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zürich(瑞士苏黎世联邦理工学院机器人系统实验室) Micro- and Nanosystems Lab, ETH Zürich(瑞士苏黎世联邦理工学院微纳系统实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心) NVIDIA(NVIDIA公司)

专题命中 多传感器融合 :对齐触觉与视觉模态,实现模拟到现实迁移。

AI总结 提出多模态表示学习框架TactSpace,通过共享潜在空间对齐异构触觉模态,实现零样本模拟到现实迁移,在力预测和形状重建任务中分别降低误差16.7%和45.8%。

Comments 9 pages, 6 figures, 4 tables, accepted into IROS 2026

详情
AI中文摘要

触觉传感提供了对机器人操作至关重要的接触相互作用的直接测量。然而,当前的模拟器缺乏足够保真度来忠实模拟触觉传感器的复杂变形和换能机制,严重阻碍了机器人学习流程中的模拟到现实迁移。为了解决这一挑战,我们提出了一种多模态表示学习框架,该框架在共享潜在空间内对齐异构触觉模态,消除了对精确原始信号模拟的需求,同时保留了相关的接触信息。我们的方法采用模态特定编码器将不同的触觉观测(例如模拟穿透深度和真实电容)投影到公共嵌入空间中。该模型使用自重建和交叉重建目标以及对比对齐进行训练,鼓励模态不变且信息丰富的表示。我们在压头形状识别、力预测和几何重建任务上评估学习到的嵌入,仅在模拟中训练并直接在真实传感器测量上测试。我们的结果展示了跨物理不同表示的零样本模拟到现实迁移。此外,结合多物理模拟模态产生了更信息丰富的嵌入,这些嵌入可跨不同下游任务迁移,力预测误差降低16.7%,形状重建误差降低45.8%。最后,我们为Isaac Lab发布了一个基于Warp的高效罚函数触觉模拟模型实现,支持可扩展的触觉数据生成。

英文摘要

Tactile sensing provides direct measurements of contact interactions that are essential for robotic manipulation. However, current simulators lack the fidelity to faithfully model the complex deformation and transduction mechanics of tactile sensors, severely hindering sim-to-real transfer in robot learning pipelines. To address this challenge, we propose a multi-modal representation learning framework that aligns heterogeneous tactile modalities within a shared latent space, eliminating the need for accurate raw-signal simulation while preserving relevant contact information. Our approach employs modality-specific encoders to project diverse tactile observations, such as simulated penetration depth and real-world capacitance, into a common embedding space. The model is trained using self- and cross-reconstruction objectives alongside contrastive alignment, encouraging modality-invariant yet information-rich representations. We evaluate the learned embeddings on indenter shape identification, force prediction, and geometric reconstruction tasks, training exclusively in simulation and testing directly on real sensor measurements. Our results demonstrate zero-shot sim-to-real transfer across physically dissimilar representations. Furthermore, incorporating multi-physics simulation modalities yields more informative embeddings that transfer across diverse downstream tasks, demonstrating a 16.7% reduction in force prediction error and a 45.8% reduction in shape reconstruction error. Finally, we release an efficient Warp-based implementation of a penalty-based tactile simulation model for Isaac Lab, enabling scalable tactile data generation.

2606.18841 2026-06-18 cs.CV 新提交 80%

Rethinking Air-Ground Collaboration: A Progressive Cross-Task Benchmark and Socialized Learning Framework

重新思考空地协作:渐进式跨任务基准与社会化学习框架

Zhoupeng Guo, Yunqi Zhu, Zhihe Fan, Xinjie Yao, Ruipu Zhao, Boan Tao, Yiming Sun, Zhen Wang, Pengfei Zhu

发表机构 * School of Automation, Southeast University(东南大学自动化学院) School of Computer Science and Engineering, University of New South Wales(新南威尔士大学计算机科学与工程学院) School of Sports Training, Tianjin University of Sport(天津体育学院运动训练学院) Faculty of Information Engineering and Automation, Kunming University of Science and Technology(昆明理工大学信息工程与自动化学院) School of Artificial Intelligence, Tianjin University(天津大学人工智能学院) School of Artificial Intelligence, Hebei University of Technology(河北工业大学人工智能学院)

专题命中 多传感器融合 :空地协作感知,融合空中与地面视角的多传感器信息。

AI总结 提出空地渐进协作基准AGPC和社会化协同感知框架SCP,通过双层级路由器实现跨视角跨任务选择性交互,在异构空地感知中提升下游性能7.86%。

详情
AI中文摘要

空地协同感知对于真实世界动态环境中的鲁棒视觉理解至关重要。然而,现有研究通常将协作建模为单任务跨视角融合,忽视了定位、目标关联和细粒度解析之间的功能依赖关系。此外,空中和地面视角的异构性引入了显著的几何、尺度和遮挡差异,使得统一特征共享容易受到负迁移的影响。为解决这些问题,我们将空地感知建模为渐进式跨任务协作任务,并构建了空地渐进协作(AGPC)基准,这是一个包含超过745K原始视频帧的时空对齐基准。基于该基准,我们提出了社会化协同感知(SCP),一个从空中全局定位到地面目标关联和身份感知解析的渐进式协作框架。其核心模块——双层级路由器(DLR),将输入侧的多尺度专家选择与输出侧的任务条件调制解耦,实现了选择性的跨视角和跨任务交互,同时抑制有害干扰。大量实验证明了SCP的有效性。它实现了3.73%的协同进化增益和7.86%的平均下游性能提升。这些结果表明,对于异构空地感知,任务条件协作比统一融合更有效。代码可在该网址获取。

英文摘要

Air-ground collaborative perception is crucial for robust visual understanding in real-world dynamic environments. However, existing studies typically formulate collaboration as single-task cross-view fusion, overlooking the functional dependencies among localization, target association, and fine-grained parsing. In addition, the heterogeneous nature of aerial and ground views introduces substantial geometric, scale, and occlusion discrepancies, making uniform feature sharing vulnerable to negative transfer. To tackle these issues, we model air-ground perception as a progressive cross-task collaboration task and construct the Air-Ground Progressive Collaboration (AGPC) benchmark, a spatio-temporally aligned benchmark comprising more than 745K raw video frames. Built upon this benchmark, we propose Socialized Co-Perception (SCP), a coarse-to-fine framework that organizes collaboration progressively from aerial global localization to ground target association and identity-aware parsing. Its core module, the Dual-Layer Router (DLR), decouples input-side multi-scale expert selection from output-side task-conditioned modulation, enabling selective cross-view and cross-task interaction while suppressing harmful interference. Extensive experiments demonstrate the effectiveness of SCP. It achieves a 3.73\% coevolutionary gain and a 7.86\% improvement in average downstream performance. These results show that task-conditioned collaboration is more effective than uniform fusion for heterogeneous air-ground perception. The code is available at https://github.com/g1136639260-spec/AGSCP.

2606.18948 2026-06-18 cs.RO 新提交 75%

C-ARC: Continuous-Adaptive Range Clustering for Non-Repetitive LiDAR Sensors

C-ARC: 面向非重复式LiDAR传感器的连续自适应范围聚类

Nick B. Schroeder, Jonathan Lichtenfeld, Oskar von Stryk

发表机构 * Technical University of Darmstadt(德累斯顿技术大学) Simulation, Systems Optimization and Robotics Group(仿真、系统优化与机器人组)

专题命中 多传感器融合 :非重复式LiDAR点云聚类,属于传感器融合。

AI总结 提出C-ARC框架,通过滑动窗口上的持久双图结构解耦高频点插入与按需聚类检索,并利用指数控制环自适应校准网格分辨率,实现非重复式LiDAR点云的实时聚类。

Comments Submitted to IEEE Robotics and Automation Letters. This work has been submitted to the IEEE for possible publication. 8 pages, 7 figures

详情
AI中文摘要

实时LiDAR聚类识别点云中的结构,是许多移动机器人算法的重要前提。当前方法主要针对重复式机械LiDAR传感器开发。近年来,由于成本和外形尺寸小,非重复式LiDAR传感器的使用显著增加。这类基于Risley棱镜的非重复传感器违反了重复式机械传感器的两个关键假设:结构化的扫描线和明确的帧边界。其Rhodonea曲线轨迹产生非均匀点分布,且缺乏旋转周期使得传统扫描线索引无法适用。为满足这些新需求,我们开发了C-ARC,一个连续自适应范围聚类框架,它在滑动窗口上维护一个持久双图,将高频点插入与按需聚类检索解耦。这对于SLAM或跟踪等关键功能至关重要。自适应范围网格分辨率机制在初始化时使用指数控制环校准网格尺寸,无需预先了解扫描模式即可平衡稀疏-碰撞权衡。作为开源的单线程C++17库实现,C-ARC在商用硬件上对Livox Mid-360以20 Hz产生实时聚类输出。在Livox Avia上的评估表明,对于扫描模式高度集中的传感器,无界单元占用是主要限制。自适应分辨率机制还提高了现有基于网格的方法在非重复数据上的聚类质量。

英文摘要

Real-time LiDAR clustering identifies structures in point clouds, which is an essential prerequisite for many mobile robotics algorithms. Current methods are mostly developed for repetitive mechanical LiDAR sensors. Recently, the use of non-repetitive LiDAR sensors is strongly increasing due to their small cost and form factor. Such non-repetitive Risley prism-based sensors violate two key assumptions of repetitive mechanical sensors: structured scan lines and well-defined frame boundaries. Their Rhodonea-curve trajectories produce non-uniform point distributions, and the absence of a rotation cycle renders conventional scan line indexing inapplicable. To meet such new requirements, we developed C-ARC, a Continuous-Adaptive Range Clustering framework that maintains a persistent dual-graph over a sliding window, decoupling high-frequency point insertion from on-demand cluster retrieval. This is crucial for key functionalities like SLAM or tracking. An adaptive range grid resolution mechanism calibrates grid dimensions at initialization using an exponential control loop, balancing the sparsity-collision trade-off without prior knowledge of the scanning pattern. Implemented as an open-sourced single-threaded C++17 library, C-ARC produces real-time cluster output at 20 Hz on commodity hardware for the Livox Mid-360. Evaluation on the Livox Avia identifies unbounded cell occupancy as the primary limitation for sensors with strongly concentrated scan patterns. The adaptive resolution mechanism additionally improves clustering quality for existing grid-based methods on non-repetitive data.

2606.18506 2026-06-18 cs.LG eess.SP stat.AP 新提交 75%

Beyond AHI: An Interpretable Causal-Discovery-Guided Framework for Sleep Recovery in Connected Health

超越AHI:一种可解释的因果发现引导的睡眠恢复框架在互联健康中的应用

Saba A. Farahani, Elahe Khatibi, Manoj Vishwanath, Amir M. Rahmani, Hung Cao

发表机构 * University of California, Irvine(加州大学尔湾分校)

专题命中 多传感器融合 :从多模态PSG信号推导睡眠恢复评分,融合多种生理信号。

AI总结 提出一种可解释的因果发现引导框架,从多模态PSG中推导层次化睡眠恢复评分(SRS),在两大队列中SRS与感知恢复的关联强度是AHI的2.5倍。

Comments 6 pages, 2 figures, 2 tables. Accepted at the 2nd Workshop on Sensing and Computing for Smart and Connected Health (SCH), co-located with IEEE/ACM CHASE 2026

详情
AI中文摘要

客观睡眠评估依赖于多导睡眠图(PSG),但临床影响通常更好地反映在患者报告结局(PROs)如嗜睡和疲劳中。现有的总结指标,包括呼吸暂停低通气指数(AHI),对功能恢复背后的多域生理学提供的洞察有限。我们提出了一种可解释的、因果发现引导的框架,用于从多模态PSG中推导层次化睡眠恢复评分(SRS)。利用两个大型人群队列(MESA: n=1540; MrOS: n=825),我们应用有向无环图(DAG)学习来识别候选生理驱动因素,涵盖呼吸负担、缺氧负担、睡眠碎片化、睡眠结构和自主神经调节。尽管源自临床PSG,这些域自然映射到互联健康技术中日益可用的传感流,包括可穿戴心电图、血氧测定和睡眠阶段估计设备。为了保持机制合理性,我们引入了一个两阶段筛选过程,结合基于生理学的约束和受约束的LLM辅助审计,以识别和消除结构混杂因素以及构造重叠变量。跨队列,这五个域作为与恢复相关的重复生理域出现,所得SRS与感知恢复的关联强度高达AHI的2.5倍。通过将多模态睡眠生理学与以患者为中心的结果通过可解释、偏差感知和域结构化的框架联系起来,这项工作为临床睡眠研究和新兴智能互联健康环境中的恢复建模提供了实用基础。

英文摘要

Objective sleep assessment relies on polysomnography (PSG), yet clinical impact is often better reflected in patient-reported outcomes (PROs) such as sleepiness and fatigue. Existing summary indices, including the Apnea-Hypopnea Index (AHI), provide limited insight into the multidomain physiology underlying functional recovery. We propose an interpretable, causal-discovery--guided framework for deriving a hierarchical Sleep Recovery Score (SRS) from multimodal PSG. Using two large population cohorts (MESA: n=1540; MrOS: n=825), we apply directed acyclic graph (DAG) learning to identify candidate physiological drivers spanning respiratory burden, hypoxic burden, sleep fragmentation, sleep architecture, and autonomic regulation. Although derived from clinical PSG, these domains map naturally to sensing streams increasingly available in connected health technologies, including wearable ECG, oximetry, and sleep-stage estimation devices. To preserve mechanistic plausibility, we introduce a two-stage screening process that combines physiology-based constraints with constrained LLM-assisted auditing to identify and remove structural confounders and construct-overlapping variables. Across cohorts, these five domains emerge as recurrent physiological domains associated with recovery, and the resulting SRS shows up to 2.5$\times$ stronger alignment with perceived recovery than AHI. By linking multimodal sleep physiology to patient-centered outcomes through an interpretable, bias-aware, and domain structured framework, this work provides a practical foundation for recovery modeling across both clinical sleep studies and emerging smart and connected health settings.

2606.19333 2026-06-18 cs.RO cs.CV 新提交 70%

Do as I Do: Dexterous Manipulation Data from Everyday Human Videos

Do as I Do: 从日常人类视频中获取灵巧操作数据

Bhawna Paliwal, Haritheja Etukuru, William Liang, Pieter Abbeel, Nur Muhammad Mahi Shafiullah, Jitendra Malik

发表机构 * UC Berkeley(加州大学伯克利分校)

专题命中 多传感器融合 :从单目RGB视频重建手-物交互并重定向

AI总结 提出DO AS I DO算法,从单目RGB人类视频中重建手-物交互并重定向到多指灵巧机器人手,生成可执行的操作数据,优于现有方法。

Comments Project website: https://do-as-i-do.com/

详情
AI中文摘要

我们如何可扩展地生成机器人操作数据,特别是在像多指灵巧手这样的人形平台上?从人类视频中学习最近成为这个问题的可能答案。然而,估计手-物交互和跨越人-机器人具身差距的困难阻碍了将丰富的单目RGB人类视频作为机器人操作数据的主要来源。在这项工作中,我们提出了DO AS I DO,一种将单目RGB人类视频重建并重定向到多指灵巧机器人手的算法。DO AS I DO从各种自我中心和外部中心的野外视频源中重建手-物交互。然后,该算法将这些手-物交互估计重定向为一系列可在现实世界中执行的动作,从不同的人类视频中生成机器人完整的操作数据。总体而言,DO AS I DO在从RGB视频中估计手-物交互和提取灵巧操作轨迹方面优于先前的最先进技术,正如我们在具有真实标签的数据集和在线收集的视频片段数据集上的实验所示。我们的实验使我们能够为从业者收集人类操作数据提出一个有效性指南。

英文摘要

How can we scalably generate data for robotic manipulation, especially on human-like platforms such as dexterous multi-fingered hands? Learning from human videos has recently emerged as a likely answer to this question. However, difficulties in estimating hand-object interaction and crossing the human-to-robot embodiment gap have hindered the adoption of abundant monocular RGB-only human videos as the primary source of robot manipulation data. In this work, we present DO AS I DO, an algorithm to reconstruct and retarget monocular RGB human videos to multi-fingered dexterous robotic hands. DO AS I DO reconstructs hand-object interactions from various egocentric and exocentric in-the-wild video sources. The algorithm then retargets these hand-object interaction estimates into a sequence of actions executable in the real world, yielding robot-complete manipulation data from disparate human videos. Overall, DO AS I DO outperforms previous state of the art in estimating hand-object interactions and extracting dexterous manipulation trajectories from RGB videos, as we show in experiments on datasets with ground truths and on a dataset of video clips collected online. Our experiments enable us to propose an efficacy playbook for practitioners collecting human data for manipulation.

2606.19267 2026-06-18 cs.RO cs.SY eess.SY 新提交 70%

A Mixed-Reality Testbed for Autonomous Vehicles

自动驾驶汽车的混合现实测试平台

H. M. Sabbir Ahmad, Ehsan Sabouni, Emrullah Celik, Zean Wan, Damola Ajeyemi, Christos G. Cassandras, Wenchao Li

发表机构 * Division of Systems Engineering(系统工程系) Boston University(波士顿大学) Department of Electrical and Computer Engineering(电气与计算机工程系)

专题命中 多传感器融合 :混合现实测试平台集成物理机器人与仿真环境

AI总结 提出一种混合现实硬件在环测试平台,集成物理移动机器人与高保真仿真环境,用于验证感知、规划和控制算法,并支持多智能体系统研究。

Comments 9 pages, 7 figures, 1 table

详情
AI中文摘要

我们提出了一种用于自动驾驶汽车的混合现实、硬件在环(HIL)测试平台,该平台将物理移动机器人测试平台与高保真仿真环境无缝集成。虚拟仿真能够创建多样化的、安全关键的驾驶场景,以验证最先进的感知、规划和控制算法,同时通过配备多模态传感器的物理机器人在逼真的虚拟环境中增强仿真,进一步促进严格的验证。我们的测试平台还利用无线通信实现车辆连接,并通过物理机器人和虚拟仿真代理的组合容纳大量代理,支持包括网联自动驾驶汽车(CAV)在内的多智能体系统研究。最后,我们提出了一种结合感知、规划和一种新颖的基于控制障碍函数(CBF)的在线学习控制器的安全保证框架,用于CAV。使用所提出框架的实验用于验证和展示测试平台的关键功能以及其在弥合仿真与真实世界硬件部署之间差距方面的整体效用。

英文摘要

We propose a mixed-reality, hardware-in-the-loop (HIL) testbed for autonomous vehicles that seamlessly integrates a physical testbed of mobile robots with a high-fidelity simulation environment. The virtual simulation enables the creation of diverse, safety-critical driving scenarios to validate state-of-the-art perception, planning, and control algorithms, while augmenting simulations with physical robots equipped with multimodal sensors in photorealistic virtual environments further facilitating rigorous validation. Our testbed also features vehicular connectivity using wireless communication and can accommodate a large number of agents through the combination of physical robots and virtual simulated agents, supporting research on multi-agent systems including Connected and Autonomous Vehicles (CAVs). Finally, we present a safety-guaranteed framework combining perception, planning and a novel online learning-based controller using Control Barrier Functions (CBFs) for CAVs. Experiments using the proposed framework are used to validate and demonstrate the key functionalities and the overall utility of the testbed to bridge the gap between simulation and real-world hardware deployment.

2606.19176 2026-06-18 cs.RO cs.AI cs.SY eess.SY 新提交 70%

Hardware- and Vision-in-the-Loop Validation of Deep Monocular Pose Estimation for Autonomous Maritime UAV Flight

用于自主海上无人机飞行的深度单目位姿估计的硬件与视觉在环验证

Maneesha Wickramasuriya, Beomyeol Yu, Jaden Shin, Mason Huslig, Taeyoung Lee, Murray Snyder

发表机构 * Mechanical and Aerospace Engineering, George Washington University(机械与航空航天工程,乔治华盛顿大学)

专题命中 多传感器融合 :融合视觉与IMU数据用于位姿估计

AI总结 提出硬件验证的视觉在环框架,结合深度变换器单目位姿估计器和延迟卡尔曼滤波器,在模拟逼真海上环境中实现自主室内飞行,验证了感知延迟等嵌入式效应。

Comments 6 pages 9 figues

详情
AI中文摘要

船舶上的自主无人机操作需要可靠的基于视觉的相对位姿估计,然而海上验证成本高、依赖天气且风险大。本文提出一个硬件验证的视觉在环框架,能够在模拟逼真海上环境的同时实现完全自主的室内飞行。渲染的海上视图由板载的基于深度变换器的单目位姿估计器处理。延迟的视觉测量与高频率IMU数据通过延迟卡尔曼滤波器融合,为几何控制提供一致的状态估计。该系统捕捉了纯仿真中缺失的关键嵌入式效应,包括感知延迟、异步更新和计算约束。自主起飞、轨迹跟踪和着陆实验证明了稳定的闭环飞行。结果建立了一个安全且硬件真实的中间阶段,用于在船上部署之前开发海上无人机自主性。

英文摘要

Autonomous UAV operations on ships require reliable vision-based relative pose estimation, yet at-sea validation is costly, weather-dependent, and risky. This paper presents a hardware-validated vision-in-the-loop framework that enables fully autonomous indoor flight while emulating photorealistic maritime environments. Rendered maritime views are processed onboard by a deep transformer-based monocular pose estimator. Delayed vision measurements are fused with high-rate IMU data using a delayed Kalman filter to provide consistent state estimates for geometric control. The system captures critical embedded effects, including perception latency, asynchronous updates, and computational constraints, that are absent in pure simulation. Autonomous takeoff, trajectory tracking, and landing experiments demonstrate stable closed-loop flight. The results establish a safe and hardware-realistic intermediate stage for developing maritime UAV autonomy prior to shipboard deployment.

2606.18953 2026-06-18 cs.RO 新提交 70%

Object-Centric Residual RL for Zero-Shot Sim-to-Real VLA Enhancement

面向零样本仿真到现实VLA增强的以对象为中心的残差强化学习

Kinam Kim, Namiko Saito, Heecheol Kim, Katsushi Ikeuchi, Jaegul Choo, Yasuyuki Matsushita

发表机构 * KAIST(韩国科学技术院) Microsoft Research Asia - Tokyo(微软亚洲研究院-东京) The University of Tokyo(东京大学)

专题命中 多传感器融合 :对象位姿与视觉语言动作融合,增强机器人策略。

AI总结 提出以对象为中心的残差强化学习框架,在仿真中训练策略,零样本迁移到真实机器人,将VLA模型成功率从42%提升至76%。

Comments 8 pages, 7 figures, 2 tables; 8-page appendix

详情
AI中文摘要

视觉-语言-动作(VLA)模型能够泛化到多种操作任务,但其基于模仿学习的策略在精确物理交互中因执行误差累积而脆弱;能否仅在仿真中训练的强化学习策略零样本提升真实世界VLA的鲁棒性?残差强化学习在冻结的VLA之上学习修正策略,提供了一个自然框架,但现有方法面临根本的仿真到现实困境:特权状态方法需要有损蒸馏才能部署;基于图像的方法存在视觉域差距;而真实世界强化学习成本高且不安全。我们提出一种以对象为中心的残差强化学习框架,利用对象姿态优化VLA动作,从而构建一个在仿真和现实之间一致迁移的紧凑观测空间。为对齐两个域,我们额外在仿真中重放相同的遥操作演示,以训练真实世界VLA的仿真对应物。残差强化学习策略仅在仿真中通过姿态噪声注入和丢弃进行训练,并零样本迁移到真实机器人。在真实Franka Research 3(FR3)机器人的五个操作任务上,我们的方法将成功率从42%零样本提升至76%,且改进后的轨迹可进一步用于重新训练基础VLA以实现自我改进,无需额外遥操作。项目页面:此https URL

英文摘要

Vision-Language-Action (VLA) models can generalize across diverse manipulation tasks, but their imitation-learning-based policies remain brittle in precise physical interactions due to compounding execution errors; Can a reinforcement learning policy trained purely in simulation improve the robustness of real-world VLAs zero-shot? Residual RL, which learns a corrective policy on top of a frozen VLA, offers a natural framework, but existing approaches face a fundamental sim-to-real dilemma: privileged-state methods require lossy distillation for deployment; image-based methods suffer from the visual domain gap; and real-world RL is costly and unsafe. We propose an object-centric residual RL framework that refines VLA actions using object poses, enabling a compact observation space that transfers consistently between simulation and reality. To align the two domains, we additionally replay the same teleoperation demonstrations in simulation to train a sim counterpart of the real-world VLA. The residual RL policy is trained only in simulation with pose noise injection and dropout, and transfers zero-shot to the real robot. Across five manipulation tasks on a real Franka Research 3 (FR3) robot, our method improves the success rate from 42% to 76% zero-shot, and the improved rollouts can be further reused to retrain the base VLA for self-improvement without additional teleoperation. Project page: https://www.microsoft.com/en-us/research/articles/object-centric-residual-rl/

3. 医学影像融合 4 篇

2606.18876 2026-06-18 cs.CV cs.LG 新提交 80%

Test-Time Adaptation in Optical Coherence Tomography Using Trajectory-Aligned Time-Independent Flow

光学相干断层扫描中基于轨迹对齐的时间无关流的测试时自适应

Veit Hucke, Thomas Pinetz, Gregor Reiter, Ursula Schmidt-Erfurth, Hrvoje Bogunović

发表机构 * Institute of Artificial Intelligence, Center for Medical Data Science, Medical University of Vienna, Austria(人工智能研究所、医学数据科学中心、维也纳医学大学,奥地利) Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna, Austria(医学人工智能综合中心、维也纳医学大学,奥地利) Department of Ophthalmology and Optometry, Medical University of Vienna, Austria(眼科与视光学部、维也纳医学大学,奥地利) Laboratory for Ophthalmic Image Analysis, Medical University of Vienna, Austria(眼科图像分析实验室、维也纳医学大学,奥地利)

专题命中 医学影像融合 :OCT图像质量自适应,属于医学影像融合。

AI总结 提出一种基于流匹配的测试时自适应方法,通过直方图匹配和去除时间条件,生成高质量替代图像,在AMD分割中达到最优性能。

Comments Accepted in MICCAI

详情
AI中文摘要

光学相干断层扫描(OCT)在眼科中至关重要,但图像质量不一致,尤其是在低成本设备中,阻碍了自动化分析。为了解决这个问题,我们引入了一种基于流匹配的测试时自适应方法,从噪声输入生成高质量替代图像。通常,测试数据和训练数据之间的域差距会导致去噪过程中像素分布不匹配。我们通过将测试图像的直方图与合成参考轨迹匹配来克服这一问题,成功地将输入与预期分布对齐。此外,我们移除了网络的时间条件,以考虑真实世界噪声分布的轻微偏差。我们的方法在分割年龄相关性黄斑变性(AMD)两个阶段的关键生物标志物方面达到了最先进的性能。代码地址:this https URL。

英文摘要

Optical coherence tomography (OCT) is essential in ophthalmology, but inconsistent image quality especially in low-cost devices hinders automated analysis. To address this, we introduce a flow-matching-based test-time adaptation method that generates high-quality surrogate images from noisy inputs. Typically, domain gaps between test and training data cause pixel distribution mismatches during the denoising process. We overcome this by matching the test image's histogram to synthetic reference trajectories, successfully aligning the input with expected distributions. Additionally, we remove the network's time conditioning to account for slight deviations in real-world noise distributions. Our approach achieves state-of-the-art performance in segmenting critical biomarkers for two stages of Age-related Macular Degeneration (AMD). Code is available: https://github.com/Veit21/tta-flow.

2606.18872 2026-06-18 cs.CV 新提交 80%

Bridging Single Distortion Artifacts and Mmultifactorial Clinical Quality: Few-shot Biparametric MRI Quality Assessment via Distortion-trained Prototypical Networks

桥接单一失真伪影与多因素临床质量:基于失真训练的原型网络的少样本双参数MRI质量评估

Yuheng Tang, Alexander Ng, Wen Yan, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, Shonit Punwani, Daniel Alexander, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute(UCL Hawkes研究所) Department of Medical Physics and Biomedical Engineering(医学物理与生物医学工程系) University College London(伦敦大学学院) Division of Surgery and Interventional Science(外科与介入科学分会) Centre for Medical Imaging(医学成像中心) British Urology Researchers in Surgical Training (BURST)(英国泌尿外科手术培训研究人员(BURST)) Department of Radiology(放射科) University College London Hospitals NHS Foundation Trust(伦敦大学学院医院国家健康服务信托基金) Centre of Medical Imaging, Division of Medicine(医学成像中心,医学分会) Centre for Medical Image Computing(医学图像计算中心) Department of Computer Science(计算机科学系) Department of Urology(泌尿科)

专题命中 医学影像融合 :双参数MRI质量评估,融合T2和DWI特征。

AI总结 提出一种少样本双参数原型网络,利用失真标签元训练,通过特征融合和域对齐,仅用5个样本即可预测PI-QUAL临床质量评分,解决临床数据稀缺问题。

详情
AI中文摘要

临床前列腺多参数MRI高度依赖高质量扩散加权成像(DWI),但DWI读图常因几何失真(通常由直肠气体引起)而受损。通过PI-QUAL评分系统评估质量是新兴的临床标准,但该方法主观、耗时,且存在类别不平衡问题,其中低质量病例多样且相对稀少。以PRIME临床试验为例,6%的图像PI-QUAL评分低于4,87%的DWI问题源于失真,许多其他临床质量问题代表性不足。为解决这种标注临床数据的双重稀缺性,我们提出了一种用于自动图像质量评估(IQA)的少样本双参数原型网络。我们的框架利用双分支3D ResNet融合T2加权和DWI特征,提供解剖背景以区分真实形态与失真。为处理现实异质性,我们引入特征级线性调制(FiLM)和梯度反转层(GRL),以对齐基于不同b值的特征分布,同时抑制采集相关偏差。我们证明,仅基于相对客观、易于获取的失真标签进行元训练的模型,能够仅使用五个代表性样本有效适应预测复杂的多因素临床质量评分(如PI-QUAL)。在两个数据集上的实验结果表明,我们的方法在此具有挑战性的IQA任务中显著优于少样本学习基线,为临床工作流程中标准化前列腺MRI质量控制提供了实际可行且数据高效的解决方案。

英文摘要

Clinical prostate multi-parametric MRI relies heavily on high-quality diffusion-weighted imaging (DWI), yet reading DWI is frequently compromised by geometric distortion, often caused by rectal air. Assessing quality via the PI-QUAL scoring system is an emerging clinical standard, but it is subjective, time-consuming and suffers from a class imbalance where low-quality cases are diverse and relatively scarce. Using the PRIME clinical trial as an example, there are $6\%$ images with PI-QUAL scores lower than 4, $87\%$ of DWI issues are due to distortion. Many of the other clinical quality issues are under-represented. To address this common dual-scarcity of annotated clinical data, we propose a few-shot biparametric prototypical network for automated image quality assessment (IQA). Our framework utilizes a dual-branch 3D ResNet to fuse T2-weighted and DWI features, providing anatomical context to distinguish true morphology from distortion. To handle real-world heterogeneity, we introduce feature-wise linear modulation (FiLM) and a gradient reversal layer (GRL) to align feature distributions conditioned on varying b-values while suppressing acquisition-related biases. We demonstrate that a model meta-trained solely on comparatively objective, readily obtainable distortion labels can effectively adapt to predicting complex, multi-factorial clinical quality scores such as PI-QUAL using only five representative samples. Experimental results on two datasets show that our method significantly outperforms few-shot learning baselines for this challenging IQA task, offering a practically feasible and data-efficient solution for standardizing prostate MRI quality control in clinical workflows.

2606.18753 2026-06-18 cs.CV 新提交 80%

SMART: A Flexible, Interpretable, and Scalable Spatio-temporal Brain Atlas from High-Resolution Imaging Data

SMART:一种灵活、可解释且可扩展的高分辨率成像数据时空脑图谱

John Kalkhof, Boris Gutman, Emile d'Angremont, Daniel C. Alexander, Marco Lorenzi

发表机构 * Illinois Institute of Technology(伊利诺伊理工学院) Amsterdam University Medical Center(阿姆斯特丹大学医学中心) University College London(伦敦大学学院)

专题命中 医学影像融合 :时空脑图谱学习,处理高分辨率3D医学图像。

AI总结 提出SMART框架,通过解耦全局疾病动态与患者特定解剖表现,学习连续疾病时间图谱,实现高分辨率3D医学图像中时空变化的灵活、可解释和可扩展建模。

详情
AI中文摘要

我们介绍了SMART,一个从纵向高分辨率3D医学图像中学习灵活、可解释且可扩展的时空脑图谱的框架。现有的时空图谱构建方法依赖于黑盒生成模型,缺乏灵活性、限制可解释性,并且难以扩展到高维数据。SMART通过学习一个连续的疾病时间图谱来解决这些挑战,该图谱将全局群体级疾病动态与患者特定的解剖表现解耦。在解剖学启发先验的指导下,SMART通过区域特异性微分方程,沿着共享的疾病时间线建模可解释的全局区域进展轨迹。全局轨迹进一步通过由灵活且可扩展的多尺度神经细胞自动机参数化的密集微分同胚位移,个性化到个体解剖结构。在阿尔茨海默病的五个纵向MRI数据集(ADNI-1/GO/2、OASIS-3、AIBL;>1300名受试者)上评估,SMART产生了解剖学上有意义的疾病进展预测,并实现了最先进的预测准确性和比对抗性和扩散基线更好的时间一致性。我们的方法为高维医学图像时间序列中时空变化的灵活、可解释和可扩展建模建立了一个新范式。

英文摘要

We introduce SMART, a framework for learning a flexible, interpretable, and scalable spatio-temporal brain atlas from longitudinal high-resolution 3D medical images. Existing approaches to spatio-temporal atlas construction rely on black-box generative models that lack flexibility, limit interpretability, and struggle to scale to high-dimensional data. SMART addresses these challenges by learning a continuous disease-time atlas that decouples global group-wise disease dynamics from their patient-specific anatomical manifestation. Guided by anatomically inspired priors, SMART models interpretable global trajectories of regional progression along a shared disease timeline through region-specific differential equations. Global trajectories are further personalized to individual anatomies via dense diffeomorphic displacements parameterized by a flexible and scalable multi-scale Neural Cellular Automata. Evaluated on five longitudinal MRI datasets in Alzheimer's disease (ADNI-1/GO/2, OASIS-3, AIBL; > 1,300 subjects), SMART produces anatomically meaningful predictions of disease progression and achieves state-of-the-art forecasting accuracy and improved temporal consistency over adversarial and diffusion baselines. Our approach establishes a new paradigm for flexible, interpretable, and scalable modeling of spatio-temporal change in high-dimensional medical image time-series.

2606.18869 2026-06-18 cs.CV 新提交 75%

Learning to Distort: Weakly-Supervised Image Quality Transfer for Prostate DWI Correction

学习扭曲:用于前列腺DWI校正的弱监督图像质量迁移

YuCheng Tang, Wen Yan, Alexander Ng, Natasha Thorley, Pawel Rajwa, Yipei Wang, Aqua Asif, Clare Allen, Louise Dickinson, Francesco Giganti, David Atkinson, Shonit Punwani, Daniel Alexander, Shaheer Ullah Saeed, Veeru Kasivisvanathan, Yipeng Hu

发表机构 * UCL Hawkes Institute(UCL哈维斯研究所) Department of Medical Physics and Biomedical Engineering(医学物理与生物医学工程系) University College London(伦敦大学学院) Division of Surgery and Interventional Science(外科与介入科学分会) Centre for Medical Imaging(医学成像中心) British Urology Researchers in Surgical Training (BURST)(英国泌尿外科手术培训研究人员(BURST)) Department of Radiology(放射科) University College London Hospitals NHS Foundation Trust(伦敦大学学院医院国家健康服务信托基金) Centre for Medical Image Computing(医学图像计算中心) Department of Computer Science(计算机科学系) Department of Urology(泌尿科)

专题命中 医学影像融合 :前列腺DWI校正,涉及图像质量迁移与融合。

AI总结 提出弱监督图像质量迁移框架,利用图像质量评估信号从无失真图像学习生成真实失真,并训练校正模型,在PI-RADS和Gleason评分分类任务中优于现有无配对方法。

详情
AI中文摘要

单次激发平面回波前列腺弥散加权成像(DWI)常因几何失真而复杂化,影响从这些图像中获得可靠诊断的能力。开发自动化校正方法面临缺乏配对的失真和未失真临床扫描的挑战。本文首先提出一种新颖的弱监督图像质量迁移(IQT)框架,从无失真图像到失真图像,利用图像质量评估(IQA)信号监督迁移过程。与传统方法需要昂贵的体素级配对数据或采用无配对算法不同,我们的方法利用图像级质量标签(此处为失真与无失真)在预训练特征空间中建立潜在质量原型。认识到模拟真实失真比直接无配对校正更可靠,我们描述了一种弱监督原型流匹配算法,显式正则化生成轨迹朝向失真原型,产生模拟临床退化的真实磁敏感伪影。通过合成这些真实配对,我们能够训练第二个IQT模型进行正向失真校正。实验结果表明,我们生成的图像成功模拟了真实伪影的诊断干扰,从而产生更强大的失真校正IQT模型。除定性比较外,我们还通过评估临床下游任务性能(PI-RADS和Gleason评分分类),使用分布内和外部数据集,将我们的方法与现有无配对方法(如CycleGAN、UNIT-DDPM和OT-FM)作为正向或反向替代方案进行详尽的定量评估。

英文摘要

Single-shot echo-planar prostate diffusion-weighted imaging (DWI) is frequently complicated by geometric distortions, which impact the ability to derive reliable diagnoses from such images. Developing automated correction methods is challenged by the absence of paired distorted and undistorted clinical scans. In this paper, we first propose a novel weakly-supervised image quality transfer (IQT) framework from undistorted to distorted images that utilizes image quality assessment (IQA) signals to supervise the transfer process. Unlike traditional methods that require expensive, voxel-wise paired data or resort to developing unpaired algorithms, our approach utilizes image-level quality labels (here, distorted vs. undistorted) to establish latent quality prototypes within a pre-trained feature space. Recognizing that simulating realistic distortions is more reliable than direct unpaired correction, we describe a weakly-supervised prototype flow matching algorithm to explicitly regularize generative trajectories towards distorted prototypes, producing realistic susceptibility artifacts that mimic clinical degradations. By synthesizing these realistic pairs, we enable a second IQT model to be trained in the forward direction for distortion correction. Experimental results demonstrate that our generated images successfully mimic the diagnostic interference of real-world artifacts, which leads to more capable distortion correction IQT models. In addition to qualitative comparisons, we also conduct exhaustive quantitative evaluations that compare our approach with existing unpaired approaches (e.g., CycleGAN, UNIT-DDPM, and OT-FM) - as either forward or reverse alternatives - by assessing clinical downstream task performance in PI-RADS and Gleason score classification, using both in-distribution and external data sets.

4. 红外-可见光融合 1 篇

2606.18783 2026-06-18 cs.CV 新提交 80%

SCR-Guided Difficulty-Aware Optimization for Infrared Small Target Detection

SCR引导的困难感知优化用于红外小目标检测

Yunus Sevim, Behçet Uğur Töreyin

发表机构 * Aselsan(阿塞尔桑公司) Istanbul Technical University(伊斯坦布尔理工大学)

专题命中 红外-可见光融合 :红外小目标检测,利用信杂比优化,涉及红外图像处理。

AI总结 提出REEM框架,利用信杂比作为可见性先验,通过可微调制软IoU损失,提升低可见性目标检测性能,无需额外参数或推理开销。

Comments Accepted at CVPR 2026 Workshops (PBVS). Published version: https://openaccess.thecvf.com/content/CVPR2026W/PBVS/html/Sevim_SCR-Guided_Difficulty-Aware_Optimization_for_Infrared_Small_Target_Detection_CVPRW_2026_paper.html

详情
AI中文摘要

红外小目标检测由于严重的背景杂波、低对比度和弱空间响应仍然具有挑战性,其中几何重叠单独不足以表征检测质量。在这项工作中,我们提出了REEM(重加权显式可见性增强调制),一种轻量级的SCR引导的困难感知优化框架,在训练期间将信杂比(SCR)作为物理上有意义的可见性先验。REEM不修改网络架构或直接优化SCR,而是从输入图像计算真实局部SCR,并对软IoU学习信号应用可微调制,强调低可见性目标,同时保持稳定优化和相同的推理行为。REEM集成到基于U-Net的MSHNet中,无需引入额外参数、架构修改或推理时开销。大量实验表明,与基线相比,REEM实现了持续改进,获得了更高的IoU和检测概率(Pd),同时大幅减少了虚警(FA),特别是在具有挑战性的低可见性条件下。这些结果表明,SCR引导的困难感知优化为红外小目标检测提供了有效且物理基础的补充,超越了传统的基于重叠的目标函数。代码可在https://github.com/yall-in-one/Reemm获取。

英文摘要

Infrared small target detection remains challenging due to severe background clutter, low contrast, and weak spatial responses where geometric overlap alone is insufficient to characterize detection quality. In this work, we propose REEM (Reweighted Explicit-visibility Enhanced Modulation), a lightweight SCR-guided difficulty-aware optimization framework that incorporates Signal-to-Clutter Ratio (SCR) as a physically meaningful visibility prior during training. Instead of modifying the network architecture or directly optimizing SCR, REEM computes a ground-truth local SCR from the input image and applies a differentiable modulation to the soft-IoU learning signal, emphasizing low-visibility targets while preserving stable optimization and identical inference behavior. REEM is integrated into a U-Net-based MSHNet without introducing additional parameters, architectural modifications, or inference-time overhead. Extensive experiments demonstrate consistent improvements over the baseline, achieving higher IoU and detection probability (Pd) together with substantially reduced false alarms (FA), particularly under challenging low-visibility conditions. These results suggest that SCR-guided difficulty-aware optimization provides an effective and physically grounded complement to conventional overlap-based objectives for infrared small target detection. The code is available at https://github. com/yall-in-one/Reemm.

5. Image Fusion 1 篇

2606.18496 2026-06-18 cs.CV cs.AI 新提交 80%

Neural Phase Correlation

神经相位相关

Cole Reynolds

发表机构 * Weyl Labs(Weyl实验室)

专题命中 Image Fusion :学习相位相关泛化,可处理非刚性形变,适用于图像配准融合。

AI总结 提出相位相关的学习泛化,通过可学习基函数将变换分解,适用于非刚性形变和幺正动力学,在心脏MRI和超声数据集上达到或超越现有方法。

详情
AI中文摘要

对应关系本质上是关系性的:它寻求同一场景两次观测之间的未知变换,而非任一观测的内容。然而,主流的基于学习的方法并未将变换表示为架构中的一等对象。它们独立编码每幅图像,让学习的相似度函数或深度解码器隐式地发现映射。相位相关是典型的例外,它直接在傅里叶域测量图像间关系,但其固定基的刚性将其限制于全局平移。我们引入相位相关的学习泛化,通过学习变换分解所基于的基来解除这一限制。相同的代数原语可扩展到密集非刚性形变和幺正动力学。在ACDC心脏MRI基准上,该框架在两个配准方向上匹配或超越先前发表的基线。在CAMUS超声心动图上,它无需辅助评分或自适应平滑机制即可达到最先进水平。应用于一维量子谐振子的时间演化波函数对时,同一框架仅从观测对中恢复未知哈密顿量的埃尔米特函数本征态和量子化能级。

英文摘要

Correspondence is fundamentally relational: it seeks the unknown transformation between two observations of a common scene, not the content of either. Yet the dominant learning-based methods do not represent the transformation as a first-class object in the architecture. They encode each image independently and let a learned similarity function or a deep decoder discover the mapping implicitly. Phase correlation is the canonical exception, measuring the inter-image relationship directly in the Fourier domain, but the rigidity of its fixed basis confines it to global translation. We introduce a learned generalization of phase correlation that lifts this restriction by learning the basis on which the transformation decomposes. The same algebraic primitive extends to dense non-rigid deformations and to unitary dynamics. On the ACDC cardiac-MRI benchmark the framework matches or exceeds prior published baselines on both registration directions. On CAMUS echocardiography it matches state-of-the-art without auxiliary scoring or adaptive-smoothness mechanisms. Applied to time-evolved wavefunction pairs of the 1-D quantum harmonic oscillator, the same framework recovers the Hermite-function eigenstates and the quantized energy levels of the unknown Hamiltonian from observation pairs alone.

6. 融合架构与评测 2 篇

2606.19253 2026-06-18 cs.CV cs.AI cs.LG cs.RO 新提交 75%

OneCanvas: 3D Scene Understanding via Panoramic Reprojection

OneCanvas: 通过全景重投影实现3D场景理解

Bartłomiej Baranowski, Dave Zhenyu Chen, Matthias Nießner

发表机构 * Technical University of Munich(慕尼黑技术大学) Huawei(华为)

专题命中 融合架构与评测 :多视图补丁特征聚合到全景画布实现3D理解

AI总结 提出OneCanvas方法,将多视图补丁特征聚合到全景画布上,利用深度和相机位姿进行重投影,无需复杂几何编码器或大量训练,在SQA3D等基准上达到最先进精度。

Comments Project page: https://baranowskibrt.github.io/onecanvas/

详情
AI中文摘要

现有的视觉语言模型(VLM)中的3D场景理解方法要么依赖复杂的、模型特定的几何编码器,要么为了追求空间推理而需要大量的训练预算。相反,OneCanvas将所有视图的补丁特征聚合到一个单一的等距柱状全景画布上。具体来说,每个补丁利用其深度和相机位姿被反投影到3D世界坐标,然后根据从画布原点看到的该点的连续经度和纬度放置在画布上,无需对重叠视图进行光栅化或聚合。补丁的度量坐标的3D位置嵌入被添加到其特征中,从而恢复了将世界位置压缩到角度画布坐标时丢失的深度。因此,来自所有帧的补丁共享一个空间坐标系,无需融合或对主干网络进行重大架构修改。预训练的VLM将此表示视为普通图像。由于画布可以以任何感兴趣的姿态为中心,相同的表示直接支持从特定视角进行情境推理,这是机器人和具身AI中的常见需求。得益于这种表示,我们还可以引入空间预训练课程:通过程序化地将从真实图像中提取的对象的补丁特征放置在原本空白的画布上的选定3D世界位置,我们生成了涵盖广泛空间推理任务的即时监督,并控制答案分布以减少空间推理捷径。OneCanvas在SQA3D和VSI-Bench上达到了最先进的准确率,并在SPBench上泛化到分布外数据,其训练计算量比最强竞争方法少一个数量级。

英文摘要

Existing approaches to 3D scene understanding in Vision-Language Models (VLMs) either rely on complex, model-specific geometry encoders or large training budgets in pursuit of spatial reasoning. Instead, OneCanvas aggregates patch features from all views onto a single equirectangular panoramic canvas. Namely, each patch is unprojected to a 3D world coordinate using its depth and camera pose, then placed on the canvas at the continuous longitude and latitude of that point as seen from the canvas origin, with no rasterization or aggregation across overlapping views. A 3D position embedding of the patch's metric coordinates is added to its feature, restoring the depth lost when collapsing the world position to an angular canvas coordinate. Patches from all frames thus share one spatial coordinate system with no fusion or major architectural modifications of the backbone. The pretrained VLM consumes this representation as if it were an ordinary image. Because the canvas can be centered on any pose of interest, the same representation directly supports situated reasoning from a specific viewpoint, a common requirement in robotics and embodied AI. Thanks to this representation, we can also introduce a spatial pretraining curriculum: by procedurally placing patch features of objects, drawn from real images, at chosen 3D world positions on an otherwise empty canvas, we generate on-the-fly supervision spanning a broad range of spatial reasoning tasks, with answer distributions controlled to reduce spatial reasoning shortcuts. OneCanvas achieves state-of-the-art accuracy on SQA3D and VSI-Bench, and generalizes to out-of-distribution data on SPBench, using an order of magnitude less training compute than the strongest competing methods.

2606.19316 2026-06-18 cs.CV 新提交 70%

NeuMesh++: Towards Versatile and Efficient Volumetric Editing with Disentangled Neural Mesh-based Implicit Field

NeuMesh++:基于解耦神经网格隐式场的多功能高效体积编辑

Chong Bao, Yuan Li, Bangbang Yang, Yujun Shen, Hujun Bao, Zhaopeng Cui, Yinda Zhang, Guofeng Zhang

发表机构 * State Key Lab of CAD&CG, College of Computer Science, Zhejiang University(浙江大学计算机科学学院CAD&CG国家重点实验室) Ant Research(蚂蚁研究院) Google(谷歌) ByteDance(字节跳动)

专题命中 融合架构与评测 :解耦神经网格隐式场实现多模态编辑

AI总结 提出一种基于网格顶点的解耦神经辐射场表示,实现几何、纹理和语义引导的高效体积编辑,包括网格引导几何编辑、纹理交换填充绘制及语义编辑。

Comments TPAMI 2025; Project Page: https://zju3dv.github.io/neumeshplusplus/

详情
AI中文摘要

近年来,神经隐式渲染技术迅速发展,在新视角合成和3D场景重建方面展现出显著优势。然而,现有的用于编辑目的的神经渲染方法功能有限,例如刚性变换和类别特定编辑。在本文中,我们提出了一种新颖的基于网格的表示方法,通过在网格顶点上编码解耦的几何、纹理和语义码来编码神经辐射场,从而实现一系列高效且全面的编辑功能,包括网格引导的几何编辑、通过纹理交换、填充和绘制操作进行的指定纹理编辑,以及语义引导的编辑。为此,我们开发了几种技术,包括一种新颖的局部空间参数化以提高渲染质量和训练稳定性,一种可学习的顶点修改颜色以提高纹理编辑的保真度,一种空间感知优化策略以实现精确的纹理编辑,以及一种语义辅助区域选择以减轻隐式场编辑的繁琐标注。在真实和合成数据集上的大量实验和编辑示例证明了我们的方法在表示质量和编辑能力上的优越性。项目页面:此 https URL

英文摘要

Recently neural implicit rendering techniques have evolved rapidly and demonstrated significant advantages in novel view synthesis and 3D scene reconstruction. However, existing neural rendering methods for editing purposes offer limited functionalities, e.g., rigid transformation and category-specific editing. In this paper, we present a novel mesh-based representation by encoding the neural radiance field with disentangled geometry, texture, and semantic codes on mesh vertices, which empowers a set of efficient and comprehensive editing functionalities, including mesh-guided geometry editing, designated texture editing with texture swapping, filling and painting operations, and semantic-guided editing. To this end, we develop several techniques including a novel local space parameterization to enhance rendering quality and training stability, a learnable modification color on vertex to improve the fidelity of texture editing, a spatial-aware optimization strategy to realize precise texture editing, and a semantic-aided region selection to ease the laborious annotation of implicit field editing. Extensive experiments and editing examples on both real and synthetic datasets demonstrate the superiority of our method on representation quality and editing ability. Project page: https://zju3dv.github.io/neumeshplusplus/