arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 20 篇

2606.12590 2026-06-12 cs.CV cs.AI 新提交

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) Queen’s University(女王大学)

AI总结 针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足,提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架,通过最小编辑模型输出构建偏好对,仅修正临床错误片段,显著提升诊断准确性。

详情
AI中文摘要

大型视觉语言模型(LVLMs)在医学影像任务中取得了强劲性能,但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法,包括直接偏好优化(DPO)及其变体,在医学领域面临三个关键限制:(1)序列级奖励信号将临床关键令牌与通用填充文本等同对待;(2)依赖静态监督微调参考作为偏好响应引入了离策略分布偏移,将优化导向风格伪影而非临床正确性;(3)对齐目标缺乏明确的视觉定位约束,使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标,该目标将干净图像与病变破坏图像配对,以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架,通过最小编辑模型生成的输出来构建偏好对,仅修正临床错误片段,同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

2606.12633 2026-06-12 cs.CV cs.LG 新提交

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA:面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ECA方法,通过混合查询模块、Fisher动态扩展和字典重放,实现无需旧数据的持续对齐,缓解灾难性遗忘,提升开放图像到文本生成的增量学习性能。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

开放图像到文本生成(OpenITG)的增量学习(IL)使模型能够持续为新的图像生成准确、上下文相关的文本,同时保留先前获得的知识。与先前研究不同,本文处理了一个更实际的场景,其中视觉数据的主要类别随时间推移而演变。在此背景下,我们引入了持续对齐的新概念,它逐步调整预训练VLM中的对齐模块,以保持高质量的跨模态表示。基于这一思想,我们提出了高效持续对齐(ECA),一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征,同时最小化对已建立对齐的干扰,且无需访问先前任务的原始数据。为此,ECA采用了三种核心机制:混合查询(MoQ)模块,用于适应任务特定的查询令牌;Fisher动态扩展(FeDEx),基于Fisher信息矩阵(FIM)度量动态扩展模型结构;以及带有字典重放(DR)的嵌入字典,以保留过去的知识。为了评估ECA的性能,我们构建了四个新的IL OpenITG基准,更好地反映了现实场景。实验结果表明,与基线方法相比,ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

2606.12744 2026-06-12 cs.CV 新提交

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

GRIP:面向大型多模态模型的反馈引导提示检索

Garvita Allabadi, Matteo Sodano, Roberto Estevão, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Bonn(波恩大学) Microsoft(微软)

AI总结 提出GRIP,一种可学习的视觉检索框架,利用多模态模型反馈识别真正提升上下文学习性能的示例,在分类、描述和VQA任务上优于基于相似度的检索。

详情
AI中文摘要

上下文学习(ICL)已成为一种强大的机制,使大型语言模型(LLMs)无需微调即可适应新任务。将此概念扩展到大型多模态模型(LMMs),多模态上下文学习(M-ICL)依赖于检索相关示例(如图像、标题或问答对)来指导分类、描述和视觉问答(VQA)等任务的预测。现有方法大多基于特征空间相似性选择上下文示例,假设语义相似的样本提供最有用的上下文。然而,我们的系统分析表明,这一假设并不总是成立:视觉上相似的示例并不一定是那些最有效增强上下文学习性能的示例。为解决此问题,我们提出了上下文提示的引导检索(GRIP),一种可学习的纯视觉检索框架,利用LMMs的反馈来识别真正改善模型预测的示例。GRIP通过对比训练学习区分有益和有害的上下文示例,将检索优化到超越纯相似性。在三个多模态任务(分类、描述和VQA)上,GRIP在Qwen2.5-VL-7B上持续优于基于相似度的检索,在Idefics2-8B上的分类任务中提升最为显著。此外,我们证明了从一个开放LMM训练得到的检索器可以迁移到其他模型(包括闭源的GPT-4o和Gemini)而无需重新训练,从而实现了M-ICL的可扩展且经济高效的部署。代码将在接收后发布。

英文摘要

In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

2606.12830 2026-06-12 cs.CV cs.AI 新提交

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理:构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University(清华大学) Virginia Tech(弗吉尼亚理工大学) NVIDIA(英伟达)

AI总结 提出PERIA智能体,通过视觉感知和交互工具增强VLM的空间推理能力,在13个基准上优于同类模型7.0%-14.8%。

详情
AI中文摘要

尽管最近的视觉语言模型(VLM)展示了强大的多模态理解能力,但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明,仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent(PERIA),一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具:视觉感知工具用于暴露文本、符号和空间证据,以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA,我们开发了一种统一方案,结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化(OR-GIGPO),以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明,PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%,在分布外基准上提高了4.4%,同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型(如Qwen3-VL-235B-A22B-Thinking和GPT-5)相当的性能,证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

2606.12847 2026-06-12 cs.CV 新提交

Language-Guided Abstraction for Visual Reasoning

语言引导的视觉推理抽象

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang

发表机构 * School of Artificial Intelligence, Guangzhou University(广州大学人工智能学院) Traditional Chinese Medicine Hospital of Zengcheng District(广州市增城区中医医院)

AI总结 提出L-VARC框架,通过语言引导的特权信息学习分支增强视觉推理,设计语义压缩模块和交叉注意力投影器,在ARC任务上以18M参数超越现有方法。

详情
AI中文摘要

抽象与推理语料库(ARC)被视为通往通用人工智能(AGI)的关键途径,因为它使模型能够从少量示例中学习抽象转换规则,然后泛化到新任务。然而,主流的ARC方法要么是纯语言,要么是纯视觉(即VARC)。前者严重依赖大语言模型,消耗数十亿参数;后者通常难以捕捉高层语义,导致在像素级模式上过拟合。为弥合这一差距,我们提出L-VARC,一种通过语言引导的特权信息学习(LUPI)分支增强视觉推理的新框架。具体来说,我们通过将统一的任务无关提示输入DeepSeek-V3来设计语义压缩模块。这样,原始的LARC(一个众包语言描述数据集)可以被大幅精炼和结构化,以适应标准文本编码器(如CLIP)的上下文长度约束。此外,我们设计了交叉注意力投影器来对齐视觉特征与语义嵌入,旨在指导ARC模型的训练。值得注意的是,LUPI分支在训练过程中使用,推理时被丢弃,从而产生一个仅1800万参数的轻量级模型。大量实验表明,我们的L-VARC有效利用语言先验提升视觉推理,并超越现有最优方法。消融研究进一步证实了这两个新设计对L-VARC框架的贡献。代码见https://this URL。

英文摘要

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

2606.12886 2026-06-12 cs.CV cs.AI 新提交

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

交错思维中的模态隔离桥接:通过逐步强化监督模态转换

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出MoTiF框架,通过反射式SFT和Flow-GRPO优化模态转换保真度,解决交错思维中图像与文本脱节的模态隔离问题,提升跨模态一致性和任务准确性。

Comments 22 pages, 5 figures, 6 tables

详情
AI中文摘要

交错思维是一种统一的多模态模型交替进行文本推理和视觉生成的方法,在空间和物理任务上显示出潜力。然而,在复杂的长链场景中,我们识别出一个基本故障模式:生成的图像偏离文本上下文,而后续文本忽略视觉证据,导致两种模态交替但并未真正相互通知。我们将其称为模态隔离,并归因于模态边界处的信息损失累积。我们将每个推理循环分解为原子操作,并定义模态转换损失,量化每个边界处的跨模态幻觉(文本到图像)和视觉利用不足(图像到文本)。我们提出MoTiF(模态转换保真度),一个两阶段训练框架,直接优化这些转换:反射式SFT训练模型检测和恢复错误的视觉输出;Flow-GRPO通过强化学习提高图像生成保真度。MoTiF中的所有训练信号来自转换级保真度而非最终任务准确性。在四个视觉谜题基准测试中,这种转换级监督显著提高了跨模态一致性和最终任务准确性。结果表明,有效的交错推理需要在模态边界处进行明确的结构监督,而不仅仅是扩展或最终任务优化。

英文摘要

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

2606.12898 2026-06-12 cs.CV cs.CL 新提交

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

放大关键信息:面向视觉文本理解的注意力引导自适应渲染

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu

发表机构 * Michigan State University(密歇根州立大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对视觉语言模型在视觉文本理解任务中存在的定位与利用脱节问题,提出无需训练、模型无关的注意力引导自适应渲染方法AGAR,通过放大关键文本跨度提升模型性能。

详情
AI中文摘要

视觉文本理解(VTC)将文本渲染为图像供视觉语言模型(VLM)阅读,绕过了LLM的上下文窗口限制,并支持从长页OCR到多页记忆问答等应用。然而,现有的VTC流水线将渲染和布局视为固定的、内容无关的预处理步骤,并且对VLM内部如何处理可视化文本的机制理解甚少。通过对VTC问答任务的聚焦实证研究,我们揭示了VLM存在一种“定位而不利用”的模式:证据定位注意力在中间到后期层中急剧出现,并且与答案正确性在很大程度上解耦,然而仅仅放大渲染页面上定位的跨度就能恢复大部分失败。基于这些观察,我们提出了AGAR(注意力引导自适应渲染),一种无需训练、模型无关的方法,该方法利用VLM自身的中间到后期层注意力来识别前K个重要的视觉补丁,将它们映射回单词跨度,并在重新推理答案之前重新渲染页面,放大这些跨度。在九个VTC基准测试(短文本、长上下文和多页记忆问答)和四个VLM骨干上的大量实验表明,AGAR(i)作为即插即用的增强,持续改进了现成的VLM,(ii)与VLM后训练相结合可带来进一步收益,并且(iii)在视觉和文本侧输入退化下保持鲁棒性。

英文摘要

Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

2606.12985 2026-06-12 cs.CV 新提交

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

物体先于词汇:用于从儿童视角视频中语言接地学习的物体优先归纳偏置

Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 针对婴儿视角视频中命名参照物出现时间和位置的双重歧义,提出BabyMind方法,通过物体优先的归纳偏置、掩码区域接口和原型空间多实例对比学习,在稀疏弱监督下提升语言接地性能。

详情
AI中文摘要

从自然经验中学习接地词汇含义需要解决婴儿视角记录中的两个歧义:命名参照物何时出现以及在杂乱画面中的位置。在SAYCam风格的数据中,看护者的语言稀疏且与自我中心视频弱同步,因此单帧对比配对会产生噪声正样本,其中目标物体缺失或被干扰物纠缠。我们提出BabyMind,一种在稀疏、噪声监督下用于儿童视角对比学习的物体优先偏置。BabyMind使用离线掩码区域接口提取候选物体嵌入,通过跟踪将短话语中心窗口内的候选物体链接成轻量级物体文件,并使用原型空间多实例对比目标将话语与物体文件袋对齐。轨迹一致性和全局物体一致性正则化器稳定学习,并将物体文件结构转移到评估时使用的全局帧嵌入中。在SAYCam-S上,BabyMind将Labeled-S 15强制选择准确率比CVCL提高了+2.6个点,并在词汇内分布外基准测试中取得一致提升。代码可在该网址获取。

英文摘要

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

2606.13061 2026-06-12 cs.CV 新提交

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

LaME: 通过信息瓶颈在潜在空间中进行多模态嵌入的推理学习

Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun

发表机构 * University of Science and Technology of China(中国科学技术大学) Kuaishou Technology(快手科技) Zhejiang University(浙江大学) Tsinghua University(清华大学)

AI总结 提出LaME方法,将面向嵌入的潜在推理建模为弱监督信息瓶颈,使用可学习推理令牌在单次前向传播中完成推理,避免显式CoT的高计算成本和标注依赖,实现60倍加速。

详情
AI中文摘要

基于推理的通用多模态嵌入通过将思维链(CoT)推理引入嵌入流程取得了快速进展。尽管在通用和复杂任务上表现强劲,该范式存在两个核心限制:(i) 自回归CoT推理计算成本高,使其不适用于低延迟检索;(ii) 嵌入性能与CoT标注质量高度耦合,导致大规模训练不可靠。这些引出了基本问题:文本CoT是否是嵌入的最优推理形式,以及有效的嵌入推理能否在潜在空间中完成?为此,我们提出LaME(潜在推理多模态嵌入),将面向嵌入的潜在推理建模为弱监督信息瓶颈。LaME采用K个可学习推理令牌作为固定容量瓶颈,在单次前向传播中完成所有推理。两个弱监督信号在结构上解耦了对比目标和自回归目标,消除了对CoT标注的依赖,而两阶段训练流程确保了稳定收敛。在MMEB-v2和MRMR上的实验表明,LaME达到了有竞争力的性能,超越了某些显式CoT模型,同时推理速度比显式CoT方法快60倍,比潜在基线快2倍,吞吐量与判别式嵌入模型相当。代码将开源。

英文摘要

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

2606.13156 2026-06-12 cs.CV cs.AI 新提交

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

迭代视觉思维:通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd(QpiAI印度私人有限公司)

AI总结 提出迭代视觉思维(IVT)框架,通过视觉反馈闭环和两阶段训练(SFT+GRPO),使视觉语言模型具备空间自我修正能力,在三个基准上提升指标2.4-3.2个百分点。

详情
AI中文摘要

视觉语言模型(VLM)在单次空间定位上表现强劲,但缺乏观察和修正自身预测的机制。我们发现,简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败:指代表达理解的Acc@0.5从79.6%骤降至48.7%(下降31个百分点),揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维(IVT),一种闭环框架,其中模型预测边界框,观察预测在图像上的渲染结果,并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距:首先,我们利用基础模型自身的预测作为真实错误,并提示教师VLM生成修正推理轨迹,从而无需人工标注即可获得监督数据;其次,我们应用组相对策略优化(GRPO)和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准(505个测试样本)上,使用IVT的SFT预热在每个指标上都超过了单次基础模型:Acc@0.5升至82.0%(+2.4个百分点),Acc@0.7升至74.1%(+3.2个百分点),Acc@0.9升至48.3%(+2.8个百分点)。GRPO进一步将每步IoU退化减少了5倍,稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本,表明空间自我修正是一种可学习的能力,可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

2606.13288 2026-06-12 cs.CV cs.AI cs.CL 新提交

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

跨模态掩码组合概念建模以增强视觉-语言组合性

Wei Li, Zhen Huang, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学,教育部脑启发智能感知与认知重点实验室) Independent Researcher(独立研究员)

AI总结 提出MACCO框架,通过掩码一个模态的组合概念并从另一模态完整上下文重建,增强视觉-语言模型的组合理解能力,在五个基准上显著提升。

Comments Accepted to ACL 2026 Main Conference, 25 pages

详情
AI中文摘要

对比训练的视觉-语言模型(如CLIP)在学习联合图像-文本表示方面取得了显著进展,但在组合理解方面仍面临挑战。它们通常表现出“词袋”行为——难以捕捉对象关系、属性-对象绑定和词序依赖。这一限制不仅源于优化时依赖全局单向量表示,还源于对配对图像文本数据中固有丰富组合信息的利用和建模不足。在这项工作中,我们提出了MACCO(掩码组合概念建模)框架,该框架掩码一个模态中的组合概念,并基于另一模态的完整上下文信息重建它们,从而使模型能够更有效地捕捉和对齐跨模态组合结构。为促进这一过程,我们引入了两个辅助目标,在模态间和模态内联合对齐和正则化掩码特征。在五个组合基准上的大量实验和深入分析表明,我们的方法不仅显著增强了VLM的组合性,还提高了它们捕捉句法结构和语言信息的能力。此外,改进的组合性也有利于文本到图像生成和多模态大语言模型。代码可在https://this URL获取。

英文摘要

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

2606.13289 2026-06-12 cs.CV cs.AI 新提交

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X: 具有整体视觉分词器的原生统一多模态模型

Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang

发表机构 * Nanjing University(南京大学) CASIA(中国科学院自动化研究所) Tencent Hunyuan(腾讯混元) Zhongguancun Academy(中关村学院) Shanghai AI Lab(上海人工智能实验室)

AI总结 提出HYDRA-X,首个在单一ViT中统一图像和视频分词的原生统一多模态模型,通过因果时间注意力和分层时间压缩实现高效重建,并利用轻量化解压缩器注入语义,显著提升编辑一致性和收敛速度。

详情
AI中文摘要

整体视觉分词器是统一多模态模型(UMMs)的基础,因为它们将多样的视觉输入映射到统一的表示空间。在本文中,我们提出HYDRA-X,这是首个在单一视觉变换器(ViT)中统一图像和视频分词的原生UMM。我们的设计由两个核心挑战驱动:高效地将时空重建能力注入原生ViT,以及将图像级和视频级语义感知嵌入到潜在空间中。为解决第一个挑战,全面的消融实验揭示了两个关键发现:(1)帧级因果时间注意力足以用于视觉重建,而全时空注意力会降低重建质量;(2)分层时间压缩显著优于单步替代方案。为解决第二个挑战,我们提出了一种轻量化解压缩器,在联合图像-视频教师监督下对时间压缩特征进行上采样,从而在紧凑的潜在空间中强制实施互补的语义结构。基于这种整体分词器,我们进一步提出了编辑流程的原则性改进:源-目标交互应在分词器内部的潜在级别发生,而不是在LLM内部的语义级别,从而显著提高编辑一致性并加速收敛。在7B密集模型上实例化,HYDRA-X在图像和视频理解及生成任务上均取得了强劲性能,为未来的统一分词器UMM铺平了道路。

英文摘要

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

2606.13673 2026-06-12 cs.CV cs.AI 新提交

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw:重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达)

AI总结 提出SpatialClaw框架,以代码作为动作接口,通过状态化Python内核和感知几何原语,使VLM智能体逐步执行并灵活组合中间结果,在20个3D/4D空间推理基准上平均准确率59.9%,比现有方法高11.2个百分点。

Comments Project page: https://spatialclaw.github.io/

详情
AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型(VLM)面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题,但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行,即在观察到任何中间结果之前就确定完整的分析策略;要么依赖结构化的工具调用接口,这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此,我们提出SpatialClaw,一个无需训练的空间推理框架,采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核,预加载输入帧和一套感知与几何原语,让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元,从而灵活地组合和操作感知结果,并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估,SpatialClaw实现了59.9%的平均准确率,比最新的空间智能体高出11.2个百分点,并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升,无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

2606.12555 2026-06-12 cs.SD cs.CV cs.MM 交叉投稿

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

AudioX-Turbo:高效任意到音频生成的统一框架

Zeyue Tian, Lei Ke, Zhaoyang Liu, Ruibin Yuan, Liumeng Xue, Yujiu Yang, Weijia Chen, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学) Noiz AI Independent Researcher(独立研究员)

AI总结 提出AudioX-Turbo,基于教师-学生范式的统一高效框架,通过多模态扩散Transformer和分布匹配蒸馏实现文本、视频、音频到音频的生成,仅需4步采样,NFE减少约25倍。

详情
AI中文摘要

基于灵活的多模态控制信号生成音频和音乐是一个广泛适用的课题,面临以下关键挑战:1) 统一的多模态建模框架,2) 大规模、高质量的训练数据,3) 多步扩散采样的高昂推理成本。为此,我们提出AudioX-Turbo,一个统一且高效的任意到音频生成框架,集成了多种多模态条件(即文本、视频和音频信号)。AudioX-Turbo遵循教师-学生范式。教师模型AudioX-Base基于多模态扩散Transformer,并带有模态自适应融合模块,用于对齐多样化的多模态输入以实现高保真合成,然后通过适用于流匹配的分布匹配蒸馏将其蒸馏为少步学生模型AudioX-Turbo,并辅以基于扩散的判别器以实现高质量的少步生成。为支持AudioX-Turbo的训练,我们构建了一个大规模、高质量的数据集IF-caps-Pro,包含约920万个样本,通过两阶段数据收集和标注流程整理而成。我们在广泛的任务上对AudioX-Turbo进行基准测试,发现我们的模型实现了优越的性能,尤其是在文本到音频和文本到音乐生成方面,同时仅需4个采样步骤,所需的函数评估次数(NFE)比多步基线减少约25倍。这些结果表明,我们的方法能够在灵活的多模态控制下进行音频生成,展现出高效且强大的指令跟随能力。代码和数据集将在https://this URL上提供。

英文摘要

Audio and music generation based on flexible multimodal control signals is a widely applicable topic, with the following key challenges: 1) a unified multimodal modeling framework, 2) large-scale, high-quality training data, and 3) the prohibitive inference cost of multi-step diffusion sampling. As such, we propose AudioX-Turbo, a unified and efficient framework for anything-to-audio generation that integrates varied multimodal conditions (i.e., text, video, and audio signals) in this work. AudioX-Turbo follows a teacher-student paradigm. The teacher AudioX-Base is built on a Multimodal Diffusion Transformer with a Multimodal Adaptive Fusion module that aligns diverse multimodal inputs for high-fidelity synthesis, and is then distilled into the few-step student AudioX-Turbo via Distribution Matching Distillation adapted to flow matching, complemented by a diffusion-based discriminator for high-quality few-step generation. To support the training of AudioX-Turbo, we construct a large-scale, high-quality dataset, IF-caps-Pro, comprising approximately 9.2M samples curated through a two-stage data collection and annotation pipeline. We benchmark AudioX-Turbo across a wide range of tasks, finding that our model achieves superior performance, especially on text-to-audio and text-to-music generation, while operating at only 4 sampling steps and requiring approximately 25x fewer function evaluations (NFE) than multi-step baselines. These results demonstrate that our method is capable of audio generation under flexible multimodal control, showing efficient and powerful instruction-following capabilities. The code and datasets will be available at https://zeyuet.github.io/AudioX-Turbo/.

2504.21561 2026-06-12 cs.CV 版本更新

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

通过逐步偏好调优的多模态智能体迭代工具使用探索

Pengxiang Li, Zhi Gao, Bofei Zhang, Yapeng Mi, Xiaojian Ma, Chenrui Shi, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, Qing Li

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology(北京智能信息科技重点实验室,计算机科学与技术学院,北京理工大学) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI) State Key Laboratory of General Artificial Intelligence, Peking University(通用人工智能国家重点实验室,北京大学) Harbin Institute of Technology(哈尔滨工业大学) Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University(广东机器感知与智能计算实验室,深圳MSU-BIT大学) Department of Automation, Tsinghua University(自动化系,清华大学)

AI总结 提出SPORT方法,通过任务合成、步骤采样、步骤验证和偏好调优的迭代循环,使多模态智能体无需预收集数据即可自主探索和优化工具使用策略,在GTA和GAIA基准上分别提升6.41%和3.64%。

Comments 24 pages

详情
AI中文摘要

多模态智能体将控制器(例如视觉语言模型)与外部工具集成,在解决复杂多模态任务方面展现了卓越的能力。现有训练这些智能体的方法,包括监督微调和强化学习,都依赖于大量人工标注的任务-答案对和工具轨迹。然而,对于复杂多模态任务,此类标注成本过高或难以实现。本文提出一种无需任何预收集数据的多模态智能体迭代工具使用探索方法,即SPORT,通过逐步偏好优化来改进工具使用轨迹。我们的方法使多模态智能体能够通过自我探索和优化自主发现有效的工具使用策略,消除了人工标注的瓶颈。SPORT包含四个迭代组件:任务合成、步骤采样、步骤验证和偏好调优。我们首先使用语言模型合成多模态任务。然后,我们引入一种新颖的轨迹探索方案,其中步骤采样和步骤验证交替执行以解决合成任务。在步骤采样中,智能体尝试不同的工具并获取相应结果。在步骤验证中,我们使用验证器提供AI反馈以构建逐步偏好数据。该数据随后通过偏好调优用于更新控制器的工具使用,生成SPORT智能体。通过与真实环境交互,SPORT智能体逐渐演化为更精细和更有能力的系统。在GTA和GAIA基准上的评估显示,SPORT智能体分别实现了6.41%和3.64%的提升,突显了我们方法的泛化性和有效性。项目页面见该URL。

英文摘要

Multimodal agents, which integrate a controller e.g., a vision language model) with external tools, have demonstrated remarkable capabilities in tackling complex multimodal tasks. Existing approaches for training these agents, both supervised fine-tuning and reinforcement learning, depend on extensive human-annotated task-answer pairs and tool trajectories. However, for complex multimodal tasks, such annotations are prohibitively expensive or impractical to obtain. In this paper, we propose an iterative tool usage exploration method for multimodal agents without any pre-collected data, namely SPORT, via step-wise preference optimization to refine the trajectories of tool usage. Our method enables multimodal agents to autonomously discover effective tool usage strategies through self-exploration and optimization, eliminating the bottleneck of human annotation. SPORT has four iterative components: task synthesis, step sampling, step verification, and preference tuning. We first synthesize multimodal tasks using language models. Then, we introduce a novel trajectory exploration scheme, where step sampling and step verification are executed alternately to solve synthesized tasks. In step sampling, the agent tries different tools and obtains corresponding results. In step verification, we employ a verifier to provide AI feedback to construct step-wise preference data. The data is subsequently used to update the controller for tool usage through preference tuning, producing a SPORT agent. By interacting with real environments, the SPORT agent gradually evolves into a more refined and capable system. Evaluation in the GTA and GAIA benchmarks shows that the SPORT agent achieves 6.41% and 3.64% improvements, underscoring the generalization and effectiveness introduced by our method. The project page is https://SPORT-Agents.github.io.

2602.00462 2026-06-12 cs.CV cs.AI 版本更新

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

LatentLens: 揭示大语言模型中高度可解释的视觉标记

Benno Krojer, Shravan Nayak, Oscar Mañas, Vaibhav Adlakha, Desmond Elliott, Siva Reddy, Marius Mosbach

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出 LatentLens 方法,通过将视觉标记与文本语料库中的上下文标记表示进行最近邻匹配,实现视觉标记的可解释性,发现大多数视觉标记在各层均具有可解释性。

Comments ICML 2026 (Camera Ready)

详情
AI中文摘要

将大型语言模型(LLM)转换为视觉语言模型(VLM)可以通过将视觉编码器输出的视觉标记映射到LLM的嵌入空间来实现。有趣的是,这种映射可以简单到浅层MLP变换。为了理解LLM为何能如此容易地处理视觉标记,我们需要可解释性方法来揭示在LLM处理的每一层中视觉标记表示所编码的内容。在这项工作中,我们引入了LatentLens,一种将潜在表示映射到自然语言描述的新方法。LatentLens编码一个大型文本语料库,并存储该语料库中每个标记的上下文化标记表示。然后将视觉标记表示与这些上下文化表示进行比较,并将最邻近的表示作为视觉标记的描述。我们在15个不同的VLM上评估了该方法,结果表明,常用的方法(如LogitLens)大大低估了视觉标记的可解释性。相反,使用LatentLens,大多数视觉标记在所有研究的模型和所有层中都是可解释的。定性上,我们展示了LatentLens产生的描述在语义上有意义,并且与单个标记相比,为人类提供了更细粒度的解释。更广泛地说,我们的发现为视觉和语言表示之间的对齐提供了新的证据,并为分析LLM的潜在表示开辟了新的方向。

英文摘要

Transforming a large language model (LLM) into a vision-language model (VLM) can be achieved by mapping the visual tokens from a vision encoder into the embedding space of an LLM. Intriguingly, this mapping can be as simple as a shallow MLP transformation. To understand why LLMs can so readily process visual tokens, we need interpretability methods that reveal what is encoded in the visual token representations at every layer of LLM processing. In this work, we introduce LatentLens, a novel approach for mapping latent representations to descriptions in natural language. LatentLens encodes a large text corpus and stores contextualized token representations for each token in that corpus. Visual token representations are then compared to these contextualized representations and the top-nearest neighbor representations serve as descriptions of the visual token. We evaluate this method on 15 different VLMs, showing that commonly used methods, such as LogitLens, substantially underestimate the interpretability of visual tokens. With LatentLens instead, the majority of visual tokens are interpretable across all studied models and all layers. Qualitatively, we show that the descriptions produced by LatentLens are semantically meaningful and provide more fine-grained interpretations for humans compared to individual tokens. More broadly, our findings contribute new evidence on the alignment between vision and language representations and open up new directions for analyzing the latent representations of LLMs.

2602.07106 2026-06-12 cs.CV cs.AI cs.CL 版本更新

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

Ex-Omni:为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) LIGHTSPEED Independent Researcher(独立研究员)

AI总结 提出Ex-Omni模型,通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成,并引入统一令牌查询门控融合机制,实现全模态大语言模型同步生成语音和3D面部动画。

详情
AI中文摘要

全模态大语言模型旨在统一多模态理解和生成,然而,尽管自然的人机交互至关重要,但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni),一个开源模型,通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦,其中语音单元提供时间支架,隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入,以及InstructS2SF-1200K,一个包含1200K样本的预训练数据集。大量实验表明,Ex-Omni在保持竞争性语音理解和生成能力的同时,实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

2603.06652 2026-06-12 cs.CV cs.AI 版本更新

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR: 通过多模态过程对齐实现忠实视觉推理

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) Data Science & Artificial Intelligence Research Institute, China Unicom(中国unicom数据科学与人工智能研究院) Unicom Data Intelligence, China Unicom(中国unicom数据智能)

AI总结 提出PaLMR框架,通过感知对齐数据层和过程对齐优化层,减少推理幻觉并提升视觉推理忠实度,在多个基准上取得最优结果。

详情
Journal ref
CVPR 2026 Findings
AI中文摘要

强化学习近期提升了大语言模型和多模态大语言模型的推理能力,但现有的奖励设计强调最终答案的正确性,因此容忍过程幻觉——即模型在得到正确答案的同时错误感知视觉证据的情况。我们通过PaLMR框架解决这种过程层面的不对齐,该框架不仅对齐结果,还对齐推理过程本身。PaLMR包含两个互补组件:一个感知对齐数据层,构建具有结构化伪真值和可验证视觉事实的过程感知推理数据;以及一个过程对齐优化层,构建具有过程感知评分函数的分层奖励融合方案,以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明,我们的方法显著减少了推理幻觉并提高了视觉推理忠实度,在HallusionBench上取得了最先进的结果,同时在MMMU、MathVista和MathVerse上保持了强劲性能。这些发现表明,PaLMR为过程对齐的多模态推理提供了一条原则性且实用的路径,推进了MLLM的可靠性和可解释性。

英文摘要

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

2605.16713 2026-06-12 cs.CV cs.AI 版本更新

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

GeoWorld-VLM:从世界模型中获取几何结构用于视觉-语言模型

Renjie Gu, Kaichen Zhou, Yan Luo, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Kempner Institute for the Study of Natural and Artificial Intelligence(凯普纳自然与人工智能研究 institute) Harvard University(哈佛大学)

AI总结 GeoWorld-VLM通过将冻结的摄像机条件视频世界模型的几何结构转移到视觉-语言模型中,提升空间关系推理能力,实验显示在两个不同架构上均提升了约4%的性能。

详情
AI中文摘要

现代视觉-语言模型(VLMs)在语义识别方面表现优异,但在基本空间关系如左、在、后、之间等上仍显脆弱。这一失败的原因出现在语言推理之前:视觉路径在特征提取过程中可能压缩或丢弃关键的3D结构线索,导致语言模型接收到的图像表示不足以支持可靠的空判断。我们引入GeoWorld-VLM,一种VLM侧蒸馏框架,将冻结的摄像机条件视频世界模型的几何结构转移到VLMs中。GeoWorld-VLM仅微调图像编码器和多模态投影器,使后投影器图像特征与中间世界模型表示对齐,同时保持主骨干冻结。给定图像、提示和采样的摄像机轨迹,世界模型教师将静态视觉输入转换为合成多视角空间信号。训练结合空间答案监督、教师-学生特征对齐和对原VLM的保留锚点。由于语言模型保持冻结,GeoWorld-VLM保留原始模型的语言能力,同时将空间改进归因于增强的视觉路径。为了评估所提方法的有效性和通用性,我们将GeoWorld-VLM应用于两种不同的VLM架构,并在两个骨干上观察到一致的改进。GeoWorld-VLM在What'sUp和VSR基准上分别提升了约4%的性能,表明世界模型引导的视觉对齐在模型结构和空间推理数据集上具有泛化能力。

英文摘要

Modern Vision-Language Models (VLMs) achieve strong semantic recognition, yet remain brittle on elementary spatial relations such as left of, on, behind, and between. One cause of this failure arises before language reasoning begins: the visual pathway may compress or discard critical 3D structural cues during feature extraction, so the language model receives image representations that are already insufficient for reliable spatial judgment. We introduce GeoWorld-VLM, a VLM-side distillation framework that transfers geometric structure from frozen camera-conditioned video world models into VLMs. GeoWorld-VLM fine-tunes only the image encoder and multimodal projector, aligning post-projector image features with intermediate world-model representations while leaving the main backbone frozen. Given images, a prompt, and a sampled camera trajectory, the world-model teacher converts static visual input into a synthetic multi-view spatial signal. Training combines spatial answer supervision, teacher-student feature alignment, and a preservation anchor to the original VLM. Since the language model remains frozen, GeoWorld-VLM preserves the original model's linguistic capabilities while attributing spatial improvements to the enhanced visual pathway. To evaluate the effectiveness and generality of the proposed method, we apply GeoWorld-VLM to two distinct VLM architectures and observe consistent improvements across both backbones. GeoWorld-VLM improves performance by approximately 4 percent on both the What'sUp and VSR benchmarks, suggesting that world-model-guided visual alignment generalizes across model structures and spatial reasoning datasets.

2606.11792 2026-06-12 cs.CV cs.AI cs.CL 版本更新

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

MultiToP:学习修补视觉令牌以减轻视频大型多模态模型中的幻觉

Yuansheng Gao, Wenbin Xing, Jiahao Yuan, Kaiwen Zhou, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) Sun Yat-sen University(中山大学) East China Normal University(华东师范大学)

AI总结 提出MultiToP框架,通过轻量级视觉令牌修补器动态替换不可靠视觉令牌,结合信息引导排名校准和稀疏正则化,在不修改原模型情况下减少视频多模态模型幻觉,显著提升F1分数和问答准确率。

Comments Preprint

详情
AI中文摘要

视频大型多模态模型在视频理解方面取得了显著进展,但仍容易产生幻觉,即生成的响应未能忠实于输入视频。在本文中,我们提出MultiToP,一种多模态上下文感知的视觉令牌修补框架,通过在语言生成之前优化不可靠的视觉令牌来减轻幻觉。MultiToP引入了一个轻量级的视觉令牌修补器,用于预测令牌级替换分布,并选择性地用动态全局修补令牌替换不可靠的视觉令牌。为了有效训练修补器,我们进一步提出了信息引导的排名校准,利用从主干网络派生的答案条件帧级信息线索来指导令牌替换。结合真实答案监督和稀疏正则化,MultiToP实现了局部视觉证据优化,而无需修改原始模型。大量实验表明,MultiToP在Vript-HAL上有效减少了幻觉,且推理开销可忽略不计,将Qwen3-VL-4B-Instruct的F1分数相比原始模型提高了50.60%。同时,MultiToP保持了通用的视频理解能力,在ActivityNet-QA上为Video-LLaVA-7B带来了18.58%的相对准确率提升。

英文摘要

Video Large Multimodal Models have achieved remarkable progress in video understanding, yet they remain prone to hallucinations, where generated responses are not faithfully supported by the input video. In this paper, we propose MultiToP, a multimodal-context-aware visual token patching framework that mitigates hallucinations by refining unreliable visual tokens before language generation. MultiToP introduces a lightweight Visual Token Patcher to predict token-level replacement distributions and selectively substitute unreliable visual tokens with a dynamic global patch token. To train the patcher effectively, we further propose information-guided rank calibration, which uses answer-conditioned frame-level information cues derived from the backbone to guide token replacement. Combined with ground-truth answer supervision and sparsity regularization, MultiToP enables localized visual evidence refinement without modifying the original model. Extensive experiments demonstrate that MultiToP effectively reduces hallucinations on Vript-HAL with negligible inference overhead, improving the F1 scores of Qwen3-VL-4B-Instruct by 50.60% over the vanilla model. Meanwhile, MultiToP preserves general video understanding ability, yielding an 18.58% relative accuracy gain on ActivityNet-QA for Video-LLaVA-7B.

2. 具身智能、机器人与自动驾驶 19 篇

2606.12473 2026-06-12 cs.CV 新提交

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

基于人体姿态估计的立体视觉跌倒预测与检测在AMD Kria K26 SOM上的实现

Shreyas Narasimhiah Ramesh, P. D. Rathika, Mahasweta Sarkar, Kristen Wells, Michel Audette, Christopher Paolini

发表机构 * San Diego State University(圣地亚哥州立大学) PSG College of Technology(PSG理工学院) Old Dominion University(欧道明大学)

AI总结 提出一种基于AMD Kria K26 SOM的低功耗、便携式立体视觉跌倒预测与检测系统,通过量化YOLOX、A2J和CNN三级流水线实现实时、隐私保护的跌倒检测,多线程版本达到4.5 FPS。

Comments 19 pages; 31 figures

详情
AI中文摘要

背景与目标:老年人跌倒可能导致严重伤害并降低生活质量。及时的预测和检测对于预防伤害和支持健康至关重要。我们提出了一种便携式、低功耗、电池供电的基于视觉的跌倒预测与检测系统,在AMD Kria K26系统模块(SOM)上使用人体姿态估计(HPE)。目标是实现非侵入性、保护隐私的实时跌倒检测系统。方法:系统使用Intel RealSense D455距离感应摄像头,通过USB连接到K26 SOM。它捕获同步的RGB和深度帧,分辨率分别为640×480×3和640×480像素,帧率为60 FPS。SOM运行一个三级流水线,包括量化的YOLOX、Anchor-to-Joint(A2J)和跌倒检测模型。YOLOX从RGB帧中识别人体边界框,然后丢弃RGB帧以保护隐私。A2J使用深度帧估计每个人的15个关节点。CNN使用选定的关节坐标(x, y, z)对跌倒活动进行分类。YOLOX在CrowdHuman上训练;A2J在ITOP、MP-3DHP、UR Fall Detection和自定义的SDSU PSG数据集上训练;CNN在UR Fall Detection和SDSU PSG上训练。设计使用了单核DPU的串行流水线和双核DPU运行YOLOX和A2J的多线程版本。结果:量化精度通过YOLOX的IoU≥50%、A2J的10厘米规则mAP以及CNN的分类准确率(TP+TN)/(TP+TN+FP+FN)进行评估。准确率分别为74%、84.13%和75.85%。吞吐量从单线程流水线的2.5 FPS提高到多线程版本的4.5 FPS。结论:结果证明了在AMD Kria K26边缘设备上实现隐私保护跌倒检测的可行性。设备上的HPE和跌倒分类无需依赖云端,支持老年人监测和辅助医疗。未来工作将提高模型精度和速度。

英文摘要

Background and Objective: Falls among elderly people can cause serious injury and reduce quality of life. Timely prediction and detection are essential to prevent harm and support well-being. We propose a portable, low-power, battery-operated, vision-based fall prediction and detection system using HPE on an AMD Kria K26 System-on-Module (SOM). The objective is a non-intrusive, privacy-preserving system for real-time fall detection. Methods: The system uses an Intel RealSense D455 range-sensing camera connected to the K26 SOM by USB. It captures synchronized RGB and depth frames, 640 x 480 x 3 and 640 x 480 pixels, at 60 FPS. The SOM runs a three-stage pipeline with quantized YOLOX, Anchor-to-Joint (A2J), and fall-detection models. YOLOX identifies human bounding boxes from RGB frames, then discards the RGB frames to preserve privacy. A2J uses depth frames to estimate 15 joint keypoints per person. A CNN uses selected joint coordinates (x, y, z) to classify fall activity. YOLOX was trained on CrowdHuman; A2J on ITOP, MP-3DHP, UR Fall Detection, and a custom SDSU PSG dataset; and the CNN on UR Fall Detection and SDSU PSG. The design used a single-core DPU with a serial pipeline and a dual-core DPU running YOLOX and A2J with multiple threads. Results: Quantized accuracy was evaluated using IoU >= 50% for YOLOX, mAP with a 10-cm rule for A2J, and classification accuracy, (TP + TN)/(TP + TN + FP + FN), for the CNN. Accuracies were 74%, 84.13%, and 75.85%. Throughput improved from 2.5 FPS for the single-threaded pipeline to 4.5 FPS for the multi-threaded version. Conclusion: Results demonstrate the feasibility of privacy-preserving fall detection on an AMD Kria K26 edge device. On-device HPE and fall classification runs without cloud dependency, supporting elderly monitoring and assistive healthcare. Future work will improve model accuracy and speed.

2606.12981 2026-06-12 cs.CV 新提交

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

用于TUMTraf V2X协同3D目标检测的相机与LiDAR BEV融合

Muhammad Shahbaz, Shaurya Agarwal

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida(中佛罗里达大学土木、环境与建筑工程系)

AI总结 提出一种融合路边相机与基础设施-车辆点云的BEV空间检测器,采用CenterPoint风格头部和IoU重排序,在DriveX 2026挑战赛公开测试集上达到0.85 mAP,并分析了训练/验证与测试集重叠对分数的影响。

详情
AI中文摘要

我们描述了一种为DriveX 2026挑战赛的TUMTraf V2X协同3D目标检测赛道开发的相机与LiDAR融合检测器。该检测器在共享的鸟瞰视图空间中融合三个路边相机与一个融合的基础设施-车辆点云,并通过带有广义IoU回归损失和IoU质量重排序头的CenterPoint风格头部预测边界框。在提供的训练和验证分割上训练后,模型在公开Codabench测试分割上达到了0.85的3D mAP。在迭代系统时,我们观察到50个测试帧中有44个也出现在已发布的训练(40个)和验证(4个)分割中并带有标签。因此,我们进行了两项额外研究来量化这种重叠对最终分数的影响:(1)一个微调运行,对44个重叠帧进行过采样,达到0.89 mAP;(2)一个后处理运行,将这些帧上的预测替换为已发布的真实值,达到0.99 mAP(上传到我们的Codabench账户进行测试,但未在排行榜上发布)。报告了所有三种配置及其每类结果。

英文摘要

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 新提交

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University(斯坦福大学)

AI总结 提出紧凑潜世界模型,结合扩散Transformer(DiT)预测未来场景,在nuScenes上实现4.8倍更好的KID,并实现动作可控性(转向ρ=0.81)。

Comments 10 pages, 9 figures, 2 tables

详情
AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景,从而无需真实世界部署即可进行规划和仿真,但在紧凑、可训练的规模下,未来具有模糊性,且该领域的标准失真度量具有误导性:它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题,该模型给定当前前摄像头潜变量和一系列自我动作,预测未来场景潜变量,由冻结解码器渲染为$256 \ imes 256$帧,最多提前8秒,在150个保留的nuScenes场景上评估。我们首先基准测试预测位置:在跨越四个表示族的六个冻结编码器中,具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer(DiT),并通过受控诊断识别其所需的四个要素:空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中,我们揭示了核心矛盾:失真度量(余弦相似度、SSIM)倾向于模糊均值,掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界:扩散模型达到KID 0.078,而回归为0.375(好4.8倍),且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性(转向驱动场景位移,Spearman $\ ho = 0.81$,而回归为$-0.18$)。我们将有限的单次运动归因于共享当前锚点,并设计了一个紧凑的170万参数“跳跃”模型,恢复完整的真实运动幅度($1.02\ imes$ GT),而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

2606.13460 2026-06-12 cs.CV 新提交

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

VISA: VLM引导的实例语义审计用于3D占据世界模型

Ruiqi Xian, Yuehan Xian, Jing Liang, Xuewei Qi, Dinesh Manocha

发表机构 * University of Maryland College Park(马里兰大学帕克分校) Nanjing University of Posts and Telecommunications(南京邮电大学) Stanford University(斯坦福大学) Motional AD Inc.(Motional AD公司)

AI总结 提出VISA方法,利用离线VLM对每个物理对象实例进行结构化语义审计,并通过可靠性加权损失蒸馏到3D占据模型中,无需VLM推理即可提升封闭集占据mIoU。

详情
AI中文摘要

语义3D占据为自动驾驶和机器人决策提供体素化世界状态,但对象和稀有类错误会影响自由空间解释、碰撞检测和时间状态传播。我们表明,常见的VLM策略(将3D体素或对象特征与裁剪-标题嵌入对齐)提高了文本-空间相似性,但未能可靠地改善封闭集占据mIoU。受此不匹配启发,我们提出VISA,一种针对现有占据世界模型的训练时语义审计方法。VISA对每个物理对象实例的代表性裁剪查询离线VLM,获得包含类别假设、可能混淆、可靠性、属性和证据的结构化审计,并将其沿对象轨迹传播。审计被关联到匹配的3D对象体素,并通过可靠性加权分类、属性因子和场景级审计图损失蒸馏到语义logits中,而推理保持不变且无需VLM。在nuScenes上,三次运行平均,VISA将OccWorld从19.06提升到20.05 mIoU,GaussianWorld从21.36提升到21.91 mIoU;在GaussianWorld上,对象mIoU从18.18提升到19.16,稀有类mIoU从15.60提升到16.79。这些结果表明,VLM更适合作为可靠性感知的语义审计器而非通用标题嵌入目标用于封闭集占据。

英文摘要

Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

2606.13503 2026-06-12 cs.CV cs.AI cs.RO 新提交

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

异构激光雷达早期融合与学习重排序策略用于非结构化环境中的鲁棒长期地点识别

Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente, Luis Payá

发表机构 * Miguel Hernández University of Elche(米格尔·埃尔南德斯·德埃尔切大学)

AI总结 提出MinkUNeXt-VINE++方法,通过异构LiDAR数据早期融合和学习重排序策略,在非结构化环境(如葡萄园)中显著提升长期地点识别性能,Recall@1指标提升20%-30%。

详情
AI中文摘要

在非结构化环境(如农田)中,鲁棒定位是自主系统的关键挑战。LiDAR传感器提供环境的详细3D信息,且不受光照条件影响,因此基于LiDAR的地点识别方法备受关注。本文提出MinkUNeXt-VINE++,一种结合两个传感器(Livox Mid-360和Velodyne VLP-16)异构LiDAR数据早期融合与推理时学习重排序策略的新方法。这种融合利用每个传感器的优势,提供更全面的环境表示。此外,重排序方法在重复环境(如葡萄园)中尤为重要,因为找到真正匹配是一项重大挑战。我们使用TEMPO-VINE数据集评估了该方法,该数据集提供了不同物候阶段葡萄园环境中的异构LiDAR数据。结果表明,与单传感器方法和现有最优方法相比,MinkUNeXt-VINE++显著提升了地点识别性能。与单传感器方法相比,MinkUNeXt-VINE++在Recall@1指标上提升了20%,加入重排序后提升30%。我们的方法代码已公开,可复现结果。

英文摘要

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

2606.13509 2026-06-12 cs.CV cs.AI 新提交

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

基于测量校准的多相机融合用于视觉室内定位

Mateo Toro Diz, Jonathan Hoss, Noah Klarmann

发表机构 * Rosenheim Technical University of Applied Sciences(罗森海姆应用技术大学)

AI总结 提出测量校准融合方法,通过显式量化单相机定位误差(单应校准、人体检测、运动跟踪)来优化多相机数据融合,实验表明该方法虽未显著提升绝对精度,但有效降低了轨迹方差并提高了运动平滑性。

Comments This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

基于视觉的室内定位系统受到检测噪声、遮挡和有限相机覆盖的影响,导致流程多个阶段存在不确定性。虽然多相机数据融合被广泛用于缓解这些问题,但通常被视为黑箱组件并仅通过端到端评估,掩盖了其机制贡献。为弥补这一不足,本文研究是否可以利用显式表征单相机定位误差来校准和优化多相机数据融合。我们提出了一种测量校准融合方法,该方法集成了组件级误差量化,具体分离了单应校准、人体检测和运动跟踪。进行了组件级评估以量化单应校准、人体检测和运动跟踪的误差贡献。实验结果表明,与单相机基线相比,数据融合提高了定位精度。虽然测量校准融合在绝对精度上相比标准融合仅提供有限的改进,但它显著降低了轨迹方差并提高了运动平滑性,这对于需要稳定连续运动估计的应用至关重要。这些结果突显了在设计基于视觉的室内定位系统的数据融合策略时,显式误差表征的价值。

英文摘要

Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

2606.13515 2026-06-12 cs.CV cs.LG cs.RO 新提交

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

MaskWAM:统一掩码提示与预测的世界-动作模型

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Tencent Robotics X(腾讯机器人X实验室) Tsinghua University(清华大学)

AI总结 提出MaskWAM,通过统一掩码输入与预测的混合Transformer架构,解决世界-动作模型的空间瓶颈,提升策略泛化能力,在LIBERO等任务上显著优于基线。

详情
AI中文摘要

世界-动作模型(WAMs)通过视频预测为机器人控制提供了一种有前景的范式。然而,当前的WAMs存在根本性的空间瓶颈:标准文本输入在杂乱场景中引入指代歧义,而非结构化的RGB预测缺乏语义基础,并受任务无关背景的偏差影响。为克服这些限制,我们引入了MaskWAM,一种以对象为中心的世界-动作模型。通过统一的混合Transformer(MoT)将掩码同时作为显式输入和预测进行联合集成,MaskWAM实现了鲁棒的策略泛化。该设计提供两个关键优势:(1)预测未来掩码产生以对象为中心的语义监督,抑制视觉噪声,显著增强甚至标准文本条件的WAMs;(2)将此预测监督与第一帧视觉提示(如目标对象掩码)耦合,建立精确的空间锚点,大幅减少语言歧义。关键在于,由于WAMs本质上是视觉驱动的架构,直接掩码条件化比单独文本提供更强的引导,为操作未见对象建立了精确且鲁棒的范式。在LIBERO、RoboTwin和真实世界任务上的评估表明,MaskWAM在语言清晰和语言模糊任务中均显著优于基线。

英文摘要

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

2606.12849 2026-06-12 cs.DC cs.CV cs.RO 交叉投稿

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

SemanticXR: 低功耗实时可查询语义建图与对象级设备-云架构

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve

AI总结 提出首个设备-云协同系统SemanticXR,通过对象级通信、执行和内存管理,在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询,服务器建图延迟提升2.2倍,设备功耗仅增加2%。

详情
AI中文摘要

语义建图是新兴扩展现实(XR)应用(如AI助手和空间对象搜索)中实现具身交互的核心服务。在移动XR设备上部署此功能需要系统具备开放词汇、实时和低功耗特性。现有方法计算密集且假设服务器级资源。云卸载提供了一条实用路径,但现有系统未在设备-云边界拆分语义建图或管理其通信、执行和内存占用。我们提出SemanticXR,首个在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询的设备-云系统。我们的关键洞察是将语义可识别对象提升为跨设备和服务器的通信、执行和内存的一级单元。在服务器端,对象级并行和几何下采样改善了建图延迟,而对象级深度建图协同设计降低了上行带宽。在设备端,具有增量更新和更新优先级的对象级稀疏局部地图实现了网络鲁棒的查询,并限制了内存和下行带宽。对象级可配置的资源使用与质量权衡让应用和系统分别根据应用需求和运行条件调整建图。与使用相同感知模型的设备-云基线相比,对象级组织在同等语义质量下将服务器端建图延迟提升了2.2倍。深度建图协同设计将上行带宽维持在2.5 Mbps以下。在设备端,SemanticXR即使在网络中断时也能为多达10,000个对象维持低于100 ms的查询延迟,在500 MB内支持数万个对象,并将下行带宽随地图变化而非总场景大小缩放。系统在正常运行时仅增加2%的设备功耗。

英文摘要

Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

2606.13494 2026-06-12 cs.RO cs.CV 交叉投稿

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

NavWAM:用于目标条件视觉导航的导航世界动作模型

Daichi Azuma, Taiki Miyanishi, Koya Sakamoto, Shuhei Kurita, Yaonan Zhu, Petr Khrapchenkov, Motoaki Kawanabe, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo(东京大学) National Institute of Informatics(国立信息学研究所) AIRoA ATR

AI总结 提出NavWAM,一种扩散变换器策略,通过联合学习未来观测、目标进度值和动作块,将导航世界模型预测直接转化为可执行动作,在离线基准和真实机器人部署中优于基于规划的世界模型基线。

Comments Project page: https://dachii-azm.github.io/navwam/

详情
AI中文摘要

目标条件视觉导航要求机器人在部分可观测性下行动,通过预测其运动将如何改变未来的自我中心视图以及这种变化是否使其更接近目标。导航世界模型提供了这种视觉预见,但它们仍然是预测模块,需要外部规划器将预测的未来转化为闭环控制。我们提出导航世界动作模型(NavWAM),一种扩散变换器策略,通过将未来观测、目标进度值和动作块表示为共享的潜在序列,将导航世界模型预测转化为可执行动作。通过联合学习未来预测与决定闭环行为的动作和价值目标,NavWAM使视觉预见可直接用于机器人控制。我们通过模拟预训练和真实机器人适应构建NavWAM,并在图像目标导航任务上将其与基于规划的世界模型和代表性直接导航策略进行评估。在离线基准和闭环真实机器人部署中,NavWAM在使用默认策略模式(无CEM式动作搜索)的情况下,在我们的评估中优于基于规划的世界模型基线。项目页面:此 https URL

英文摘要

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

2606.13497 2026-06-12 cs.RO cs.CV 交叉投稿

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

SPARC:来自机器人演示的可靠空间标注

Nils Blank, Paul Mattes, Maximilian Xiling Li, Jakub Suliga, Thomas Roth, Moritz Reuss, Pankhuri Vanjani, Rudolf Lioutikov

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) NVIDIA(英伟达) Robotics Institute Germany(德国机器人研究所)

AI总结 提出SPARC框架,利用机器人任务的时空结构生成可靠性评分,自动标注演示中的空间信息,减少噪声标签并保留更多有用样本,在物体定位基准上优于纯检测基线。

详情
AI中文摘要

本文介绍了一种具有可靠性校准的机器人演示空间标注方法(SPARC),这是一个风险感知框架,能够自动为机器人演示标注结构化的空间信息,并为每个标注分配可靠性评分。结构化的空间标注,如边界框、物体轨迹和操作阶段标签,有益于广泛的机器人应用,从训练接地机器人策略和具身基础模型到运动规划和层次化任务组合。现有的自动化流水线可以大规模生成此类标注,但无法提供可靠的质量信号:检测器置信度对于标注正确性的校准不佳,迫使人们在接受噪声标签或丢弃有用样本之间做出选择。与现有的自动化流水线不同,SPARC利用机器人任务固有的时空结构生成可靠性信号,减少噪声标签并保留更多有用样本。我们进一步引入了交互感知基准(IA-Bench),这是一个衡量模型在机器人演示中接地交互物体位置准确性的基准。在涵盖多种实体和场景的1.7k个人工标注演示上,SPARC在定位准确性上显著优于纯检测基线,同时在高精度操作点保留了三倍以上的样本。我们的实验表明,基于我们的标注微调的模型在物体接地和指向基准上达到了与类似规模模型相当的最先进结果,同时在更广泛的空间推理套件上保持竞争力,无需手动验证或标注的训练数据。此外,基于SPARC生成的标注训练的策略在杂乱、视觉模糊的真实场景中优于基线。代码、数据和模型可从此网址获取。

英文摘要

This work introduces Spatial Annotations from Robot Demonstrations with Reliability Calibration (SPARC), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations, such as bounding boxes, object trajectories, and manipulation phase labels, benefit a broad range of robotics applications from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition. Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, reducing noisy labels and retaining more useful samples. We further introduce Interaction-Aware Bench (IA-Bench), a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in localization accuracy while retaining three times more samples at high-precision operating points. Our experiments demonstrate that models finetuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations outperform baselines in cluttered, visually ambiguous real-world scenes. Code, data, and models are available at intuitive-robots.github.io/sparc-labeling.

2606.13677 2026-06-12 cs.RO cs.AI cs.CV cs.LG 交叉投稿

Mana: Dexterous Manipulation of Articulated Tools

Mana: 铰接工具的灵巧操作

Zhao-Heng Yin, Guanya Shi, Pieter Abbeel, C. Karen Liu

发表机构 * UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Stanford University(斯坦福大学) Amazon FAR(亚马逊FAR)

AI总结 提出Mana框架,将灵巧操作重解释为动画问题,通过粗到细的流水线自动生成操作轨迹,实现铰接工具的零样本仿真到现实迁移。

Comments Project Page: https://zhaohengyin.github.io/mana

详情
AI中文摘要

铰接工具的操作由于需要协调内部自由度与接触丰富的交互,仍然是灵巧机器人学中的一个主要挑战。虽然先前的工作主要集中在刚性物体上,但铰接工具的使用由于其物理复杂性以及学习功能性抓取和操作策略的困难,仍未得到充分探索。我们提出了Mana(操作动画器),一个通用的仿真到现实框架,将灵巧操作重新解释为动画问题。受计算机动画启发,Mana采用粗到细的流水线,通过运动规划和强化学习将程序生成的抓取关键帧转化为操作轨迹。数据生成过程基本自动化,仅需几次鼠标点击即可指定功能可供性(每个工具不到1分钟)。在跨越不同尺度和关节类型的四个铰接工具上,Mana实现了抓取和手内操作的零样本仿真到现实迁移,展示了灵巧铰接工具操作的可扩展方法。

英文摘要

Articulated tool manipulation remains a major challenge in dexterous robotics due to the need to coordinate internal degrees of freedom and contact-rich interactions. While prior work has largely focused on rigid objects, articulated tool use remains underexplored because of its physical complexity and the difficulty of learning functional grasping and manipulation policies. We present Mana (Manipulation Animator), a general sim-to-real framework that reinterprets dexterous manipulation as an animation problem. Inspired by computer animation, Mana employs a coarse-to-fine pipeline that transforms procedurally-generated grasp keyframes into manipulation trajectories through motion planning and reinforcement learning. The data generation process is largely automatic, requiring only a few mouse clicks to specify functional affordances (<1 minute per tool). Across four articulated tools spanning different scales and joint types, Mana achieves zero-shot sim-to-real transfer for both grasping and in-hand manipulation, demonstrating a scalable approach to dexterous articulated tool use.

2507.22028 2026-06-12 cs.CV cs.RO 版本更新

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

从看见到体验:通过强化学习扩展导航基础模型

Honglin He, Yukai Ma, Brad Squicciarini, Wayne Wu, Bolei Zhou

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Coco Robotics(Coco机器人)

AI总结 提出S2E框架,结合离线视频预训练和模拟环境强化学习,通过锚点引导分布匹配和残差注意力模块,提升导航基础模型的交互性和安全性。

Comments 27 pages, 20 figures, 9 tables, conference

详情
AI中文摘要

基于大规模网络数据训练的导航基础模型使智能体能够跨不同环境和实体进行泛化。然而,这些仅基于离线数据训练的模型往往缺乏推理其行为后果或通过反事实理解进行适应的能力。因此,它们在现实世界城市导航中面临重大限制,其中交互性和安全行为(如避开障碍物和移动行人)至关重要。为解决这些挑战,我们引入了从看见到体验(S2E)学习框架,通过强化学习扩展导航基础模型的能力。S2E结合了离线视频预训练和强化学习后训练的优势。它保持了从大规模真实世界视频中获得的模型泛化能力,同时通过模拟环境中的强化学习增强了其交互性。具体而言,我们引入了两项创新:(1)用于离线预训练的锚点引导分布匹配策略,通过基于锚点的监督稳定学习并建模多样化的运动模式;(2)用于强化学习的残差注意力模块,从模拟环境中获得反应性行为,同时不抹除模型的预训练知识。此外,我们建立了一个全面的端到端评估基准NavBench-GS,该基准基于真实世界场景的光照逼真3D高斯溅射重建,并融入了物理交互。它可以系统评估导航基础模型的泛化能力和安全性。

英文摘要

Navigation foundation models trained on massive web-scale data enable agents to generalize across diverse environments and embodiments. However, these models, which are trained solely on offline data, often lack the capacity to reason about the consequences of their actions or adapt through counterfactual understanding. They thus face significant limitations in real-world urban navigation, where interactive and safe behaviors, such as avoiding obstacles and moving pedestrians, are critical. To tackle these challenges, we introduce the Seeing-to-Experiencing (S2E) learning framework to scale the capability of navigation foundation models with reinforcement learning. S2E combines the strengths of pretraining on offline videos and post-training through reinforcement learning. It maintains the model's generalizability acquired from large-scale real-world videos while enhancing its interactivity through reinforcement learning in simulation environments. Specifically, we introduce two innovations: (1) an Anchor-Guided Distribution Matching strategy for offline pretraining, which stabilizes learning and models diverse motion patterns through anchor-based supervision; and (2) a Residual-Attention Module for reinforcement learning, which obtains reactive behaviors from simulation environments without erasing the model's pretrained knowledge. Moreover, we establish a comprehensive end-to-end evaluation benchmark, NavBench-GS, built on photorealistic 3D Gaussian Splatting reconstructions of real-world scenes that incorporate physical interactions. It can systematically assess the generalizability and safety of navigation foundation models.

2510.03896 2026-06-12 cs.CV cs.RO 版本更新

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出通用动作专家(GAE),通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹,采用动作预训练-点云微调(APPF)方案解耦动作动力学与几何基础,实现跨视觉域、视角和指令的强泛化。

详情
AI中文摘要

视觉语言模型展示了强大的推理和规划能力,但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起,导致泛化能力有限。我们提出了通用动作专家(GAE),一个任务无关的模型,将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口:VLM预测代表高层意图的稀疏3D路点,而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力,我们引入了动作预训练-点云微调(APPF)方案,将学习动作动力学与几何基础解耦。预训练后,GAE被冻结并在下游任务中重用,只需对VLM进行轻量级微调以生成稀疏接口。实验表明,我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

2511.17221 2026-06-12 cs.CV cs.RO 版本更新

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

QueryOcc:基于查询的3D语义占据自监督方法

Adam Lilja, Ji Lan, Junsheng Fu, Lars Hammarstrand

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) Zenseact

AI总结 提出QueryOcc,一种基于查询的自监督框架,通过相邻帧的4D时空查询直接学习连续3D语义占据,利用视觉基础模型或激光雷达数据提供监督,并引入收缩场景表示以在恒定内存下实现远程监督,在Occ3D-nuScenes基准上语义RayIoU提升26%。

详情
AI中文摘要

从图像学习3D场景几何和语义是计算机视觉的核心挑战,也是自动驾驶的关键能力。由于大规模3D标注成本过高,近期研究探索直接从传感器数据中进行自监督学习,无需人工标签。现有方法要么依赖2D渲染一致性(3D结构仅隐式出现),要么依赖来自累积激光雷达点云的离散化体素网格,限制了空间精度和可扩展性。我们提出QueryOcc,一种基于查询的自监督框架,通过跨相邻帧采样的独立4D时空查询直接学习连续3D语义占据。该框架支持来自视觉基础模型导出的伪点云或原始激光雷达数据的监督。为了实现恒定内存下的远程监督和推理,我们引入了一种收缩场景表示,在平滑压缩远处区域的同时保留近场细节。QueryOcc在自监督Occ3D-nuScenes基准上以11.6 FPS运行,语义RayIoU比之前的基于相机的方法提升26%,表明直接4D查询监督能够实现强大的自监督占据学习。

英文摘要

Learning 3D scene geometry and semantics from images is a core challenge in computer vision and a key capability for autonomous driving. Since large-scale 3D annotation is prohibitively expensive, recent work explores self-supervised learning directly from sensor data without manual labels. Existing approaches either rely on 2D rendering consistency, where 3D structure emerges only implicitly, or on discretized voxel grids from accumulated lidar point clouds, limiting spatial precision and scalability. We introduce QueryOcc, a query-based self-supervised framework that learns continuous 3D semantic occupancy directly through independent 4D spatio-temporal queries sampled across adjacent frames. The framework supports supervision from either pseudo-point clouds derived from vision foundation models or raw lidar data. To enable long-range supervision and reasoning under constant memory, we introduce a contractive scene representation that preserves near-field detail while smoothly compressing distant regions. QueryOcc surpasses previous camera-based methods by 26% in semantic RayIoU on the self-supervised Occ3D-nuScenes benchmark while running at 11.6 FPS, demonstrating that direct 4D query supervision enables strong self-supervised occupancy learning. https://research.zenseact.com/publications/queryocc/

2606.01621 2026-06-12 cs.CV cs.RO 版本更新

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Goal2Pixel: 将目标锚定到像素以实现视觉语言导航

Muyi Bao, Yuxin Cai, Hang Xu, Zongtai Li, Jinxi He, Jingfan Tang, Chen Lv, Ji Zhang, Yaqi Xie, Wenshan Wang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Nanyang Technological University(南洋理工大学)

AI总结 提出Goal2Pixel范式,通过将连续环境中的视觉语言导航(VLN-CE)重新定义为可导航像素锚定,利用图像平面作为统一空间接口,预测可见导航像素并反投影为3D航点,结合可见性感知关键帧记忆和坐标感知辅助损失,在减少VLM调用次数的同时实现竞争性性能。

Comments 8 pages

详情
AI中文摘要

视觉语言模型(VLM)已成为连续环境中视觉语言导航(VLN-CE)的常见基础。然而,大多数基于VLM的方法将导航视为低级动作预测,这种接口模糊、受限于短视运动基元,且由于重复的VLM查询而效率低下。我们提出Goal2Pixel,一种纯基于像素的范式,将VLN-CE重新定义为可导航像素锚定。Goal2Pixel不预测动作,而是使用图像平面作为VLM推理与机器人运动之间的统一空间接口:模型预测一个对智能体可见的可导航像素,该像素被反投影为3D航点以进行前向导航。对于非前向动作,我们在图像平面上附加辅助指令区域,其中左/右/下区域分别解释为左转、右转和停止。为了实现长程导航,我们提出了一种可见性感知的关键帧记忆,用于紧凑且信息丰富的历史表示。为了将预训练的VLM适应于可导航像素锚定,我们引入了语义嵌入和坐标感知辅助损失。Goal2Pixel在需要比先前方法更少的VLM推理调用的情况下,实现了具有竞争力的最新性能。在R2R-CE Val-Unseen上,它以每集仅7.75次VLM调用达到54.1%的SR和52.5%的SPL,而直接动作预测在32.9%的SR下需要46.62次调用,减少了6倍。同样的趋势在RxR-CE上也成立。项目页面:https://baobao0926.github.io/Goal2Pixel/。

英文摘要

Vision-language models (VLMs) have become a common foundation for vision-and-language navigation in continuous environments (VLN-CE). Yet most VLM-based methods cast navigation as low-level action prediction, an interface that is ambiguous, tied to short-horizon motion primitives, and inefficient due to repeated VLM querying. We propose Goal2Pixel, a pure pixel-based paradigm that reformulates VLN-CE as navigable pixel grounding. Rather than predicting actions, Goal2Pixel uses the image plane as a unified spatial interface between VLM reasoning and robot motion: the model predicts a visible navigable pixel to the agent, which is back-projected into a 3D waypoint for forward navigation. For non-forward actions, we append auxiliary directive regions to the image plane, where the left/right/bottom regions are interpreted as turning left, turning right, and stopping, respectively. To enable long-horizon navigation, we propose a visibility-aware keyframe memory for compact and informative history representation. To adapt pretrained VLMs to navigable pixel grounding, we introduce semantic embeddings and coordinate-aware auxiliary losses. Goal2Pixel achieves competitive state-of-the-art performance while requiring fewer VLM inference calls than prior methods. On R2R-CE Val-Unseen it achieves 54.1% SR and 52.5% SPL with just 7.75 VLM calls per episode, 6x fewer than the 46.62 required by direct action prediction at 32.9% SR. The same trend holds on RxR-CE.Project Page: https://baobao0926.github.io/Goal2Pixel/.

2511.18322 2026-06-12 cs.RO cs.CV cs.LG 版本更新

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

从视频中学习软体连续体机器人的视觉可解释振荡器网络

Henrik Krauss, Johann Licher, Naoya Takeishi, Annika Raatz, Takehisa Yairi

发表机构 * Department of Advanced Interdisciplinary Studies, The University of Tokyo(东京大学先进跨学科研究系) Institute of Assembly Technology and Robotics, Leibniz University Hannover(莱比锡大学汉诺威装配技术与机器人研究所) Research Center for Advanced Science and Technology, The University of Tokyo(东京大学先进科学研究中心)

AI总结 提出注意力广播解码器(ABCD)和视觉振荡器网络(VONs),实现从视频中学习软体连续体机器人动力学的视觉和机械可解释性,多步预测误差降低5.8倍。

Comments Code available at: https://github.com/UThenrik/visual_oscillators_for_SCR Dataset available at: https://zenodo.org/records/17812071 Video available at: https://youtu.be/i80H8erVISM

详情
AI中文摘要

从视频中学习软体连续体机器人(SCR)动力学提供了灵活性,但现有方法缺乏可解释性或依赖先验假设。基于模型的方法需要先验知识和手动设计。我们通过引入以下内容来弥补这一差距:(1)注意力广播解码器(ABCD),一种用于基于自编码器的潜在动力学学习的即插即用模块,生成像素级注意力图,定位每个潜在维度的贡献,同时过滤静态背景,通过空间接地潜在变量和图像叠加实现视觉可解释性。(2)视觉振荡器网络(VONs),一种二维潜在振荡器网络,与ABCD注意力图耦合,用于学习到的质量、耦合刚度和力的图像可视化,从而实现机械可解释性。我们在单段和双段SCR上验证了我们的方法,表明基于ABCD的模型显著提高了多步预测精度,在双段机器人上,Koopman算子的误差降低了5.8倍,振荡器网络的误差降低了3.5倍。VONs自主发现了振荡器的链式结构。这种完全数据驱动的方法产生了紧凑、机械可解释的模型,对未来的控制应用具有潜在意义。

英文摘要

Learning soft continuum robot (SCR) dynamics from video offers flexibility but existing methods lack interpretability or rely on prior assumptions. Model-based approaches require prior knowledge and manual design. We bridge this gap by introducing: (1) The Attention Broadcast Decoder (ABCD), a plug-and-play module for autoencoder-based latent dynamics learning that generates pixel-accurate attention maps localizing each latent dimension's contribution while filtering static backgrounds, enabling visual interpretability via spatially grounded latents and on-image overlays. (2) Visual Oscillator Networks (VONs), a 2D latent oscillator network coupled to ABCD attention maps for on-image visualization of learned masses, coupling stiffness, and forces, thereby enabling mechanical interpretability. We validate our approach on single- and double-segment SCRs, demonstrating that ABCD-based models significantly improve multi-step prediction accuracy with 5.8x error reduction for Koopman operators and 3.5x for oscillator networks on a two-segment robot. VONs autonomously discover a chain structure of oscillators. This fully data-driven approach yields compact, mechanically interpretable models with potential relevance for future control applications.

2606.08765 2026-06-12 cs.RO cs.CV 版本更新

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

RGB-S: 用于鲁棒灵巧操作的图像对齐触觉显著性

Shengcheng Luo, Kefei Wu, Xiaoying Zhou, Wanlin Li, Ziyuan Jiao, Chenxi Xiao

发表机构 * ShanghaiTech University(上海科技大学) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出RGB-S框架,通过正向运动学和相机标定将触觉传感器位置投影到RGB图像平面,生成力调制高斯显著性图,显式对齐触觉与视觉,在严重遮挡下灵巧操作成功率提升26.7个百分点。

Comments 20 pages, 7 figures

详情
AI中文摘要

有效的视觉-触觉整合对于机器人灵巧操作至关重要,尤其是在视觉观测不可靠或被遮挡时。然而,将稀疏、异构的触觉测量与密集的视觉表示鲁棒地对齐仍然是一个基本挑战。大多数现有方法需要策略从有限的演示中隐式学习跨模态对应关系,而不利用几何先验。因此,它们在视觉观测退化时往往数据效率低且泛化能力差。为解决这一限制,我们提出一个框架,显式地将物理接触锚定在图像域中。利用机器人正向运动学和相机标定,我们将触觉传感器位置直接投影到RGB图像平面上。然后,我们渲染力调制的高斯显著性图,以模拟由运动学和标定误差引起的空间不确定性。通过零初始化的条件架构整合这些2D空间锚点,我们的方法将物理接触先验注入标准视觉骨干网络,同时保留预训练的视觉表示。我们在模拟和现实世界的六项灵巧操作任务中评估了我们的方法,在严重视觉遮挡下。现实世界实验表明,在图像域中显式的RGB-S锚定将现实世界遮挡操作成功率比最强的隐式视觉-触觉基线提高了26.7个百分点,表明其空间推理能力和对遮挡的鲁棒性得到了改善。项目页面:touch-as-saliency.github.io

英文摘要

Effective visuo-tactile integration is critical for robotic dexterous manipulation, especially when visual observations are unreliable or occluded. However, robustly aligning sparse, heterogeneous tactile measurements with dense visual representations remains a fundamental challenge. Most existing approaches require policies to learn cross-modal correspondences implicitly from limited demonstrations, without leveraging geometric priors. As a result, they are often data-inefficient and generalize poorly when visual observations are degraded. To address this limitation, we propose a framework that explicitly grounds physical contacts in the image domain. Using robot forward kinematics and camera calibration, we project tactile sensor locations directly onto the RGB image plane. We then render force-modulated Gaussian saliency maps to model spatial uncertainty arising from kinematic and calibration errors. By integrating these 2D spatial anchors through a zero-initialized conditioning architecture, our method injects physical contact priors into standard visual backbones while preserving pre-trained visual representations. We evaluate our method on six dexterous manipulation tasks in both simulation and the real world under severe visual occlusions. Real-world experiments show that explicit RGB-S grounding in the image domain improves real-world occluded manipulation success rates by $26.7$ percentage points over the strongest implicit visuo-tactile baseline, suggesting its improved spatial reasoning and robustness to occlusion. Project page: touch-as-saliency.github.io

2606.10683 2026-06-12 cs.RO cs.AI cs.CV 版本更新

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

UniDexTok:基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Hefei University of Technology(合肥工业大学) Rimbot Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口,并基于此开发UniDexTok,一种免重定向的状态分词器,学习基于真实关节状态的离散token,实现异构灵巧手的统一表示,误差降低98%以上。

详情
AI中文摘要

灵巧手对于精细操作至关重要,但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难,与平行夹爪相比更是如此。因此,灵巧手数据仍然碎片化,难以用于联合训练。在这项工作中,我们提出了统一灵巧手模型(UDHM),它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM,我们引入了UniDexTok,一种免重定向的状态分词器,它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示,无需依赖重定向或仿真数据。与最近的基线UniHM相比,UniDexTok将MPJAE从15.63度降低到0.16度,MPJPE从18.51毫米降低到0.18毫米,误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明,来自其他实施例的数据提高了目标实施例的重建精度,证明了跨实施例分词的优势。当引入新的灵巧手时,UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

2606.12236 2026-06-12 cs.RO cs.CV 版本更新

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) University of California, Merced(加州大学默塞德分校)

AI总结 提出DrivingAgent框架,通过自动化模块开发(设计阶段)和强化学习训练的轻量级LLM实时调度(调度阶段),解决自动驾驶系统集成新模型和满足实时约束的挑战,在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情
AI中文摘要

许多自动驾驶系统越来越多地整合基础模型以提高泛化能力并处理长尾场景。然而,这一趋势带来了两个关键挑战:(i)设计和集成新模型的手动且劳动密集型过程,以及(ii)缺乏智能、动态的调度机制以满足严格的实时约束。虽然基于大语言模型(LLM)的智能体为自动化提供了有前景的途径,但现有框架并不适合自动驾驶。具体来说,它们未能区分系统设计和实时调度的根本不同需求,将模块视为不透明的黑盒,并且并非为持续运行而设计。为了解决这些局限性,我们提出了DrivingAgent,这是一个针对自动驾驶系统设计和调度双重挑战的新型智能体框架。在设计阶段,DrivingAgent通过解释系统架构、生成代码以及通过超网络训练验证模块来自动化模块开发。在调度阶段,它采用一个通过强化学习训练的轻量级LLM来实时动态编排系统模块,并由一个集成长期存储与带时间戳短期上下文的结构化记忆支持。实验结果表明,DrivingAgent在nuScenes和Bench2Drive基准测试上实现了更优的速度-精度权衡。

英文摘要

Many autonomous driving systems are increasingly incorporating foundation models to improve generalization and handle long-tail scenarios. However, this trend introduces two key challenges: (i) the manual and labor-intensive process of designing and integrating new models, and (ii) the lack of intelligent, dynamic scheduling mechanisms to meet strict real-time constraints. While Large Language Model (LLM)-based agents offer a promising avenue for automation, existing frameworks are ill-suited for autonomous driving. Specifically, they fail to distinguish between the fundamentally different requirements of system design and real-time scheduling, treat modules as opaque black boxes, and are not designed for continuous operation. To address these limitations, we propose DrivingAgent, a novel agent framework tailored to the dual challenges of autonomous driving system design and scheduling. In the design phase, DrivingAgent automates module development by interpreting system architecture, generating code, and validating modules via super-network training. In the scheduling phase, it employs a lightweight LLM trained with reinforcement learning to dynamically orchestrate system modules in real time, supported by a structured memory that integrates long-term storage with timestamped short-term context. Experimental results demonstrate that DrivingAgent achieves a superior speed--accuracy trade-off on both the nuScenes and Bench2Drive benchmarks.

3. 图像识别、检索与分类 8 篇

2606.13206 2026-06-12 cs.CV cs.RO 新提交

Visual Place Recognition in Forests with Depth-Aware Distillation

基于深度感知蒸馏的森林视觉地点识别

Walter Nedov, Saimunur Rahman, Kavindie Katuwandeniya, David Hall, Kaushik Roy, Peyman Moghadam

发表机构 * CSIRO Robotics, Brisbane, Australia(澳大利亚联邦科学与工业研究组织机器人实验室,布里斯班,澳大利亚) University of Queensland, Brisbane, Australia(昆士兰大学,布里斯班,澳大利亚) Queensland University of Technology, Brisbane, Australia(昆士兰科技大学,布里斯班,澳大利亚)

AI总结 针对森林环境中视觉地点识别因植被重复、结构线索弱及外观变化大而困难的问题,提出轻量级深度感知蒸馏框架,将几何线索注入DINOv2模型,在WildCross基准上提升鲁棒性。

Comments IEEE ICRA Workshop on Field Robotics 2026

详情
AI中文摘要

在自然森林环境中,由于植被重复、结构线索弱以及穿越过程中外观变化显著,视觉地点识别仍然具有挑战性。为解决这一限制,本文提出了一种轻量级的深度感知蒸馏框架,该框架将几何线索注入基于DINOv2的地点识别模型,同时保持其预训练的描述符空间。在最近的WildCross基准上进行评估,所提出的方法相比仅依赖外观的对应方法取得了性能提升,对外观变化具有鲁棒性。这些结果证明了深度作为自然环境中地点识别的强互补模态的重要性,并指出深度感知蒸馏是迈向更鲁棒森林感知的一个有前景的方向。

英文摘要

Visual place recognition in natural forest environments remains challenging due to repetitive vegetation, weak structural cues, and significant appearance variation across traversals. To address this limitation, this paper proposes a lightweight depth-aware distillation framework that injects geometric cues into a DINOv2-based place recognition model, while maintaining its pre-trained descriptor space. Evaluated on the recent WildCross benchmark, the proposed approach yields gains over an appearance-only counterpart, providing robustness to appearance variations. These results demonstrate the importance of depth as a strong complementary modality for place recognition in natural environments and identify depth-aware distillation as a promising direction for more robust forest perception.

2606.13267 2026-06-12 cs.CV cs.CL cs.IR 新提交

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

TimeLens: 面向大埃及博物馆的基于检索增强问答的设备端文物识别

Rawan Hesham, Ali Ashraf, Amr Ahmed, Malak Alaa, Omar Ahmed, Omar Wagih

发表机构 * Grand Egyptian Museum(大埃及博物馆)

AI总结 针对博物馆场景中的细粒度视觉相似性、训练数据与手持相机差距以及AI幻觉问题,提出设备端文物检测器与双语检索增强生成(RAG)问答系统,实现实时识别与可靠问答。

Comments 6 pages, 4 figures, 5 tables. Submitted to AIVRCH 2026

详情
AI中文摘要

TimeLens 是一款面向大埃及博物馆(GEM)的 AI 驱动双语移动导览应用。游客将手机对准展品时,可实时识别文物,并针对后续问题获得英语或阿拉伯语回答。本工作解决了馆内部署特有的三个问题:51 件编目文物(许多近乎相同的拉美西斯雕像)间的细粒度视觉相似性、策展训练数据与手持相机条件之间的差距,以及 AI 导览陈述未经证实的历史事实的风险。报告了两项工程贡献。首先,通过数据质量驱动的迭代研究——从基础模型自动标注(YOLO-World),经过空间标签清理规则,到完全人工标注的数据集——开发了设备端文物检测器,将标签质量确定为决定性因素:最终的 YOLOv8n 模型解决了所有先前失败的类别,同时保持为 5.97 MB 的 TensorFlow Lite 资产,可在中端手机上实时运行(mAP@0.5 = 0.995,mAP@0.5:0.95 = 0.924)。其次,基于 108 条记录的 ChromaDB 知识库的双语检索增强生成(RAG)导览,在七个候选语言模型上进行了基准测试,选定了 Gemma 4 E2B(Q4 K M);十项针对性优化将端到端延迟从超过 30 秒降低到约 10 秒。两个子系统集成在一个生产级 Flutter 应用中,具有双语界面、博物馆位置门控和文本转语音支持。

英文摘要

TimeLens is an AI-powered bilingual mobile guide for the Grand Egyptian Museum (GEM). Pointing a phone at an exhibit, a visitor sees the artifact recognized in real time and can ask follow-up questions answered in English or Arabic. The work addresses three problems specific to in-gallery deployment: fine-grained visual similarity among 51 catalogued artifacts (many near-identical Ramesside statues), the gap between curated training data and handheld camera conditions, and the risk of an AI guide stating unsupported historical facts. Two engineering contributions are reported. First, an on-device artifact detector was developed through a data-quality-driven iteration study -- from foundation-model auto-annotation (YOLO-World), through spatial label-cleaning rules, to a fully hand-annotated dataset -- isolating label quality as the decisive factor: the final YOLOv8n model resolves every previously failing class while remaining a 5.97 MB TensorFlow Lite asset that runs in real time on a mid-range phone (mAP@0.5 = 0.995, mAP@0.5:0.95 = 0.924). Second, a bilingual Retrieval-Augmented Generation (RAG) guide, grounded in a 108-record ChromaDB knowledge base, was benchmarked across seven candidate language models, with Gemma 4 E2B (Q4 K M) selected; ten targeted optimizations reduce end-to-end latency from over 30 s to approximately 10 s. Both subsystems are integrated in a production Flutter application with bilingual interface, museum location gating, and text-to-speech support.

2606.13275 2026-06-12 cs.CV 新提交

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

文化遗产的零样本描述:印度尼西亚传统服装的自动化图像分析

Anugrah Aidin Yotolembah, Novanto Yudistira, Gembong Edhi Setyawan

发表机构 * University of Technology, Sydney(悉尼大学)

AI总结 提出Custom ZeroCLIP框架,利用检索增强的视觉-语言模型,在零样本设置下为印度尼西亚传统服装生成描述,在8个未见省份上取得优于基线的性能。

Comments accepted to ICME workshop on AIART 2026

详情
AI中文摘要

本文提出了Custom ZeroCLIP,一个用于印度尼西亚传统服装零样本描述的检索增强视觉-语言框架。数据集包含来自印度尼西亚所有38个省份的3,800张专家标注图像。采用省份级归纳零样本协议,模型在24个可见省份上训练,在6个可见省份上验证,并在8个未见省份上评估。该框架结合了冻结的CLIP ViT-B/32图像编码器、CLIP文本编码器、BERT文本编码器和LSTM描述解码器。在推理过程中,未见省份的标签和描述不可用,检索仅使用训练省份的描述。训练、验证或检索库构建过程中未使用任何未见省份的图像、标签或描述。Custom ZeroCLIP实现了0.8536的CLIPScore、0.3342的BLEU-4和0.4859的METEOR,优于现有基线。消融实验表明,检索提高了文化词汇的恢复能力,METEOR提升了19.3%,而人工评估证实了更强的文化准确性和流畅性。结果证明了检索增强的领域自适应在低资源文化遗产环境下生成文化基础描述的有效性。数据集可在以下网址公开获取:https://this https URL。

英文摘要

This paper presents Custom ZeroCLIP, a retrieval-augmented vision-language framework for zero-shot captioning of Indonesian traditional garments. The dataset contains 3,800 expert-annotated images from all 38 Indonesian provinces. Using a province-level inductive zero-shot protocol, the model is trained on 24 seen provinces, validated on 6 seen provinces, and evaluated on 8 unseen provinces. The framework combines a frozen CLIP ViT-B/32 image encoder, a CLIP text encoder, a BERT text encoder, and an LSTM caption decoder. During inference, unseen-province labels and captions are unavailable, and retrieval uses only captions from training provinces. No unseen-province image, label, or caption is used during training, validation, or retrieval-bank construction. Custom ZeroCLIP achieves a CLIPScore of 0.8536, BLEU-4 of 0.3342, and METEOR of 0.4859, outperforming existing baselines. Ablation results show that retrieval improves cultural vocabulary recovery with a 19.3\% METEOR gain, while human evaluation confirms stronger cultural accuracy and fluency. The results demonstrate the effectiveness of retrieval-augmented domain adaptation for culturally grounded caption generation in low-resource heritage settings. The dataset is publicly available at https://github.com/AnugrahAidinYotolembah/Traditional-Indonesian-Clothing-Captioning-Dataset.

2606.13625 2026-06-12 cs.CV 新提交

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

重新审视长尾监控场景中的车辆颜色识别

Vinícius Orrú, Bruno H. Foggiatto, Gabriel E. Lima, David Menotti, Rayson Laroca

发表机构 * Pontifical Catholic University of Paraná(巴拉那天主教大学) National High Court of Brazil(巴西国家高等法院) Federal University of Paraná(巴拉那联邦大学)

AI总结 针对监控场景中车辆颜色分布高度不平衡的问题,本文提出结合生成式数据增强、视觉表征、损失重加权等方法的综合方案,在UFPR-VeSV数据集上实现94.6%微平均和79.7%宏平均准确率,宏平均比近期文献提升8.2个百分点。

Comments Accepted for presentation at the 2026 International Conference on Pattern Recognition (ICPR) - V3SC Workshop

详情
AI中文摘要

车辆颜色识别是监控系统中车辆识别的重要线索,尤其是在车牌因低分辨率、遮挡、运动模糊或光照不足而难以辨认时。然而,真实世界的车辆颜色分布高度不平衡,使得整体准确率不足以评估在罕见但操作相关的颜色上的性能。本文利用UFPR-VeSV(一个具有挑战性的真实世界监控数据集),对严重类别不平衡下的车辆颜色识别进行了全面研究。我们通过两种现成的生成策略探索合成少数类增强:使用RunDiffusion/JuggernautXL的文本条件图像生成和使用Gemini 2.0 Flash的图像条件颜色编辑。精心策划的合成数据与现代视觉表征、损失重加权、学习率调度、颜色安全增强、前景感知预处理和集成融合相结合。表现最佳的方法达到了94.6%的微平均准确率和79.7%的宏平均准确率,宏平均准确率比近期文献提高了8.2个百分点。手动错误分析进一步表明,许多剩余的失败即使在人工标注者看来也是视觉上模糊的,这凸显了在无约束监控图像中基于颜色的车辆识别的实际局限性。生成的图像和源代码可在以下网址公开获取:this https URL

英文摘要

Vehicle color recognition is an important cue for vehicle identification in surveillance systems, especially when license plates are illegible due to low resolution, occlusion, motion blur, or poor illumination. However, real-world vehicle color distributions are highly imbalanced, making overall accuracy insufficient to assess performance on rare but operationally relevant colors. This paper presents a comprehensive study of vehicle color recognition under severe class imbalance using UFPR-VeSV, a challenging real-world surveillance dataset. We investigate synthetic minority-class augmentation through two off-the-shelf generative strategies: text-conditioned image generation with RunDiffusion/JuggernautXL and image-conditioned color editing with Gemini 2.0 Flash. The curated synthetic data are combined with modern visual representations, loss reweighting, learning-rate scheduling, color-safe augmentation, foreground-aware preprocessing, and ensemble fusion. The bestperforming approach achieves 94.6% micro accuracy and 79.7% macro accuracy, improving macro accuracy by 8.2 percentage points over recent literature. A manual error analysis further shows that many remaining failures are visually ambiguous even for human annotators, highlighting the practical limits of color-based vehicle identification in unconstrained surveillance imagery. The generated images and source code are publicly available at https://github.com/viniciusorru/vcr-synthetic

2601.02177 2026-06-12 cs.CV cs.CR 版本更新

Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

为什么商用WiFi传感器在多人体步态识别中失败:基于ESP32的系统分析

Oliver Custance, Saad Khan, Simon Parkinson

发表机构 * University of Cambridge(剑桥大学)

AI总结 通过ESP32实验发现,多人体步态识别性能差主要源于商用WiFi的感知质量限制,而非算法选择。

详情
AI中文摘要

WiFi信道状态信息(CSI)在单人步态识别中展现出潜力,引发了对其在非接触式生物识别、持续认证和被动识别中应用的兴趣。然而,在低成本商用设备上进行多人识别的可行性仍不清楚。一个关键问题是,较差的多人性能主要是算法限制,还是反映了商用WiFi硬件更根本的感知上限。我们通过使用商用ESP32 WiFi传感器的系统实证研究来回答这个问题。我们评估了六种不同的信号分离方法——FastICA、SOBI、PCA-ICA、NMF、小波和张量分解——在七个场景中,覆盖1-10人,包括受控和现实室内环境。为了超越分类准确率进行研究,我们引入了三个诊断指标:受试者内变异性(ISV)、受试者间可区分性(ISD)和性能退化率(PDR)。所有方法的性能均中等(39%-56%准确率),几乎没有证据表明仅靠算法选择能解决问题。表现最佳的方法NMF达到56%准确率,而所有方法都表现出极高的特征空间重叠(97%-99%)、不稳定的受试者内表示以及显著的环境敏感性。这些发现表明,在商用ESP32 CSI约束下,密集多人步态识别更多受限于感知质量和空间多样性,而非所选分离算法。我们的结果对安全和隐私有直接影响:它们质疑了商用WiFi CSI作为稳健的多用户生物识别基元的实用性,同时也对低成本现成WiFi硬件可实现的被动识别能力施加了重要限制。

英文摘要

WiFi Channel State Information (CSI) has shown promise for single-person gait identification, raising interest in its use for contactless biometrics, continuous authentication, and passive identification. However, the feasibility of multi-person identification on low-cost commodity devices remains unclear. A critical question is whether weak multi-person performance is primarily an algorithmic limitation, or whether it reflects a more fundamental sensing ceiling on commodity WiFi hardware. We address this question through a systematic empirical study using commodity ESP32 WiFi sensors. We evaluated six different signal separation methods--FastICA, SOBI, PCA-ICA, NMF, Wavelet, and Tensor decomposition--across seven scenarios spanning 1-10 people in both controlled and realistic indoor environments. To investigate beyond classification accuracy, we introduce three diagnostic metrics: intra-subject variability (ISV), inter-subject distinguishability (ISD), and performance degradation rate (PDR). In all methods, performance remains moderate (39%-56% accuracy), with limited evidence that algorithmic choice alone solves the problem. The best-performing method, NMF, reaches 56% accuracy, while all methods exhibit extremely high feature-space overlap (97%-99%), unstable within-subject representations, and marked environmental sensitivity. These findings suggest that, under commodity ESP32 CSI constraints, dense multi-person gait identification is limited more by sensing quality and spatial diversity than by the chosen separation algorithm. Our results have direct implications for security and privacy: they call into question the practicality of commodity WiFi CSI as a robust multi-user biometric primitive for authentication, while also placing important bounds on the passive identification capabilities achievable with low-cost off-the-shelf WiFi hardware.

2601.06279 2026-06-12 cs.CV 版本更新

EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

EyeTheia:一个轻量级且易用的眼动追踪工具箱

Stevenson Pather, Niels Martignène, Arnaud Bugnet, Fouad Boutaleb, Fabien D'Hondt, Deise Santana Maia

发表机构 * Univ. Lille, Inserm, CHU Lille, U1172 - LilNCog - Lille Neuroscience & Cognition(里尔大学、法国国家医学研究院、里尔大学医院、U1172 - 里尔神经科学与认知中心) Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL(里尔大学、法国国家科学研究中心、里尔中央理工大学、UMR 9189 CRIStAL) Centre national de ressources et de résilience (CN2R)(资源与韧性国家研究中心)

AI总结 提出基于网络摄像头的轻量级眼动追踪管道EyeTheia,结合MediaPipe特征提取和CNN模型,通过用户微调降低预测误差,在点探测任务中与商业方案表现一致。

Comments Code for the EyeTheia: https://github.com/patherstevenson/EyeTheia. Experimental platform for the cognitive neuroscience task (BAWEB IAPS): https://git.interactions-team.fr/INTERACTIONS/calypso/src/branch/main/src/medita/

详情
AI中文摘要

我们介绍了EyeTheia,一个用于基于网络摄像头的视线估计的轻量级开源深度学习管道,专为基于浏览器的实验平台和现实世界的认知与临床研究设计。EyeTheia仅使用标准笔记本电脑摄像头即可实现实时视线追踪,结合基于MediaPipe的 landmarks 提取和受iTracker启发的卷积神经网络,并支持可选的用户特定微调。我们研究了两种互补策略:在移动数据上预训练模型,以及在桌面数据集上从头训练相同架构。在MPIIFaceGaze上的验证结果显示,在标定前两种方法性能相当,而轻量级的用户特定微调持续降低了视线预测误差。我们还在一个真实的点探测任务中评估了EyeTheia,并与商业网络摄像头追踪器SeeSo SDK进行了比较。结果表明,在刺激呈现期间左右视线分配上具有高度一致性,尽管时间变异性更高。总体而言,EyeTheia为低成本视线追踪提供了一个透明且可扩展的解决方案,适用于可扩展和可重复的实验与临床研究。代码、训练模型和实验材料均已公开。

英文摘要

We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.

2606.04364 2026-06-12 cs.CV cs.LG 版本更新

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

通过部分分解注意力的空间基础概念瓶颈模型

Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出一种部分分解的概念瓶颈模型,通过空间先验约束注意力,在细粒度识别中实现可解释性并提升定位精度。

Comments Updated results with GobalAttention Tokens

详情
AI中文摘要

概念瓶颈模型(CBM)在预测类别之前预测一层人类命名的属性,从而使其决策可审计。在细粒度识别任务中,概念头通常可以自由关注图像中的任何位置,因此以某个身体区域命名的头可能被其他区域的证据满足。本研究通过构造一个部分分解的CBM来消除这种自由度。该方法基于冻结的DINOv3视觉变换器,包含三个组件。一个学习到的前景门控,基于DINOv3块特征训练,抑制部分注意力内的背景块。一组部分查询交叉关注块特征,并且312个CUB属性中的每一个通过固定的概念到部分映射被路由,仅从其名称所暗示的部分令牌读取。一个可学习的二维高斯先验,以对数空间加性注入注意力logits,打破部分查询之间的排列对称性;其均值从每个部分的数据集平均关键点位置初始化,在训练或测试时不需要每张图像的关键点监督。在CUB-200-2011上,空间先验模型匹配完全监督基线(top-1准确率88.85%对88.95%),同时将指向精度提高16个百分点(52.6%对36.4%)。用PCA前景目标替换边界框监督,并与高斯先验结合,消除了所有每张图像监督,达到88.6%的top-1准确率和约70%的指向精度。关键点分数扫描显示,训练集的0.5%(约27张图像)足以初始化先验,且无显著损失。完全移除部分身份是更困难的情况:没有任何空间先验,指向精度降至2.9%。

英文摘要

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

2606.09855 2026-06-12 cs.MM cs.CV cs.LG 版本更新

MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

MinhwaNet: 韩国民俗画中忠实但不足的对象定位

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出MinhwaNet,通过部分级检测器生成对象证据图,发现韩国民俗画中符号列表不足以预测画作类型,而符号布局更重要,揭示了忠实但不足的解离现象。

详情
AI中文摘要

韩国民俗画(minhwa)由少量吉祥符号构成——老虎代表保护、一对鸟代表婚姻和谐、牡丹代表财富——这些符号在其许多绘画类型中反复出现。这暗示了一种直观的计算方法:识别画作中出现的符号,并从符号清单中读取画作类型。我们使用一个公开语料库,包含整幅画作、八字段双语策展说明以及一组独立的专家对象裁剪图,发现这种方法并不奏效。仅给定画作包含的符号列表的模型,其预测画作类型的效果远不如将图像与策展文本融合的模型,而强制类型表示基于对象定位反而会损害准确性。然而,类型预测所依赖的视觉证据仍然是局部化的且可检查的。从部分级检测器投影出的无泄漏对象证据图,在空间上忠实于策展人隔离符号对象的位置以及基于补丁的替代模型的梯度显著性。我们将这种配置称为忠实但不足的解离。部分级解释诚实地反映了部分级模型所见,但类型目标取决于符号的排列方式而非出现的符号。相同的视角区分了内容标签(在转移到保留的源机构时仍然有效,即类型)和风格标签(无效,即时代),我们通过语料库中的另外两个标签验证了这一预测。我们发布了多模态系统、一幅画作的证据图与其目录的工作示例解读,以及在长尾遗产收藏中反复出现的一系列评估注意事项。

英文摘要

Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

4. 目标检测、分割与定位 6 篇

2606.12628 2026-06-12 cs.CV 新提交

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

面向自动驾驶中共现对象检测的上下文感知特征融合

Binay Kumar Singh, Niels Da Vitoria Lobo

发表机构 * Department of Computer Science, University of Central Florida(中佛罗里达大学计算机科学系)

AI总结 提出上下文中心特征融合框架CCFF,通过局部上下文融合模块和全局上下文注意力模块分别处理小/遮挡对象与共现先验,提升共现对象检测性能,在Cityscapes和BDD100K上实现类别一致性策略0.973和0.969,小目标检测AP_S提升14.1%。

Comments 8 pages, 3 figures, CVPR 2026 Precognition Workshop

详情
AI中文摘要

自动驾驶中的目标检测需要精确定位以及对共现对象之间关系上下文的固有理解。在极其复杂的异构环境中,稀有类别、小尺度对象和频繁出现的对象对于标准目标检测框架来说难以处理。在本文中,我们提出了一种新颖的框架,称为上下文中心特征融合(CCFF),它利用两个基于注意力的模块:局部上下文融合模块(LCFM)使用RoI到RoI的自注意力机制来解决空间交互,主要考虑小且部分遮挡的对象;而全局上下文注意力模块(GCAM)通过将top-K RoI特征池化为全局上下文注意力标记来转换对象的共现先验,避免了像素级全局池化的计算开销。这种局部和以对象为中心的全局特征的融合产生了上下文化的嵌入,增强了分类结果和共现对象检测。我们的方法在两个数据集Cityscapes和BDD100K上进行了评估,在关系一致性上显示出显著改进,分别达到了0.973和0.969的类别级一致性策略(CCS)。此外,我们的方法在小目标检测(AP_S: 14.1%)上取得了实质性提升,并成功恢复了通常在大分布中丢失的稀有类别,如“火车”。我们的效率报告显示,该框架以0.2 FPS的开销实时处理图像。代码可在此https URL获取。

英文摘要

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.

2606.12826 2026-06-12 cs.CV cs.AI 新提交

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS: 解耦实例级运动目标分割

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出双解耦特征提取框架分离图像与事件模态的外观和运动信息,并通过多粒度跨模态对齐实现有效融合,在运动实例分割任务中尤其对快速运动和低光下的小目标取得最优性能。

详情
AI中文摘要

运动实例分割(MIS)因其在交通监控、自动驾驶和动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化,提供高时间分辨率和动态范围,使其对运动信息高度敏感。通过融合事件和图像特征,事件中的运动线索可以补充图像中的空间细节,从而提升MIS的性能。然而,当前的多模态MIS方法仍然难以分割小的运动实例,因为事件相机在有限分辨率下往往产生稀疏特征。此外,事件特征将外观属性与运动线索纠缠在一起,进一步限制了有效的跨模态融合。为解决这些挑战,我们首先提出一个双解耦特征提取框架,在图像和事件模态内分离并提取外观和运动信息,从而改善特征密度。随后,引入多粒度跨模态对齐,以对齐跨模态分布和语义一致的特征,实现具有丰富空间和时间细节的更有效融合。实验结果表明,我们的方法在多模态MIS中达到了最先进的性能,特别是在快速运动和低光等挑战性条件下的小实例分割方面。

英文摘要

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

2606.12958 2026-06-12 cs.CV 新提交

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

YOLO-AMC:一种改进的带有注意力机制的YOLO架构用于建筑裂缝检测

Ching-Yu Tsai, Chia-Min Lin, Chih-Hsiang Yang, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出YOLO-AMC,在YOLOv11中移除C2PSA并引入GAM、Res-CBAM、SA等注意力机制,增强裂缝检测性能,在测试集上mAP@0.5达0.9917,速度110.95 FPS,兼顾精度与部署效率。

Comments 14 pages, 8 tables, 6 figures. Expanded version of IET ICETA 2025 conference paper

详情
AI中文摘要

裂缝检测在基础设施检查和结构健康监测(SHM)中起着重要作用。然而,裂缝通常表现为薄、低对比度的结构,且容易受到背景噪声的影响,给现有目标检测模型带来了挑战。本研究提出了一种改进的基于YOLO的架构,集成了注意力机制,称为YOLO-AMC(用于裂缝检测的YOLO注意力机制),以增强自动裂缝检测性能。基于YOLOv11,移除了原始的C2PSA模块,并在Neck的多尺度特征融合层中引入了多种注意力机制,包括全局注意力机制(GAM)、残差卷积块注意力模块(Res-CBAM)和Shuffle Attention(SA),以加强跨尺度特征整合。实验结果表明,YOLO-AMC在多个评估指标上始终优于基线模型YOLOv11n和YOLOv8n。在评估的注意力模块中,GAM取得了最佳检测性能,在测试数据集上获得了mAP@0.5 = 0.9917和mAP@0.5:0.95 = 0.9506,高于YOLOv11(0.9833 / 0.9112)和YOLOv8(0.9707 / 0.8921)。此外,在保持7.6 GFLOPs计算复杂度的同时,所提出的模型在NVIDIA RTX 4090平台上达到了110.95 FPS,在Raspberry Pi 5边缘设备上约为5 FPS,展示了准确性与部署效率之间的良好权衡。本研究的实现代码可在GitHub上获取,网址为:https://this https URL。

英文摘要

Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at https://github.com/CY-Tsai24/YOLO-AMC.

2606.13033 2026-06-12 cs.CV 新提交

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

SAM-Deep-EIoU:面向多目标跟踪的选择性掩码传播

Alexander Holmberg

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院)

AI总结 提出选择性掩码传播算法,仅在不确定性高的帧调用视频目标分割模型,以轻量级基跟踪器为主,在DanceTrack和SportsMOT上提升性能,SportsMOT达86.8 HOTA。

详情
AI中文摘要

多目标跟踪的难度分布呈重尾特性:大多数帧对于轻量级基跟踪器是容易的,而一小部分帧本质上是困难的。视频目标分割(VOS)模型通常能在基跟踪器失败的困难帧中保持身份,但其计算和内存成本高得多。我们提出选择性掩码传播,一种跟踪算法,仅在分配不确定性信号触发的窗口上从基跟踪器调度到VOS模型。仅当VOS模型做出与基跟踪器身份分配相矛盾的置信预测时,才修改基跟踪器的输出;弱或不确定的预测保留基输出。该方法无需训练,将基跟踪器和VOS模型均视为黑盒,并且可以通过用更强大的模型替换VOS组件而受益。在DanceTrack上,选择性掩码传播改进了三种不同的基跟踪器。在SportsMOT上,身份保持是体育分析的核心,使用全局轨迹关联的SAM3-Deep-EIoU以86.8 HOTA达到基准上的最先进性能。

英文摘要

Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

2606.13587 2026-06-12 cs.CV 新提交

Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background

面向杂乱背景下的自动废物回收的有效废物分割

Mamoona Javaid, Mubashir Noman, Abdul Hannan, Shah Nawaz, Mustansar Fiaz, Sajid Ghuffar

发表机构 * University of Science and Technology Beijing(北京科技大学)

AI总结 提出一种结合空间域和谱域的级联分割网络,并引入辅助特征增强模块,在杂乱场景下实现高效废物分割,在三个数据集上验证了有效性。

Comments accepted at ICML 2026

详情
AI中文摘要

城市区域的快速扩张和人口增长导致废物产量急剧增加,这需要高效自动化的废物管理。在此背景下,使用深度学习的自动废物回收(AWR)可以帮助人类实现最优废物管理。最近的AWR深度学习方法提供了有前景的废物分割性能,但这些方法依赖大型骨干网络,对AWR系统效率低下,且在杂乱场景中性能下降。为此,本文引入了一种最优废物分割网络,该网络有效利用空间域捕获局部结构依赖性和谱域高效提取全局上下文关系。这种级联设计使网络能够逐步利用互补域中的局部和全局表示,突出有效分割各种废物对象所需的语义信息。此外,引入了辅助特征增强模块(AFEM),以增强目标对象的边界和斑点放大,从而在杂乱场景中实现更好的分割。在ZeroWaste-aug、ZeroWaste-f和SpectralWaste数据集上的大量实验揭示了所提出方法的优势。

英文摘要

Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects' boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWaste-aug, ZeroWaste-f and SpectralWaste datasets reveals the merits of the proposed method.

2606.13042 2026-06-12 cs.AI cs.CV 交叉投稿

Augmentation techniques for video surveillance in the visible and thermal spectral range

可见光和热红外光谱范围内视频监控的增强技术

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB(弗劳恩霍夫光学、系统技术与图像处理研究所)

AI总结 针对多光谱CNN目标检测,研究可见光与热红外图像差异,探索数据增强技术对分类精度的影响,以提升监控性能。

Comments 8 pages

详情
Journal ref
SPIE Security + Defence, Strasbourg, 10th September 2019
AI中文摘要

在智能视频监控中,摄像机在白天和夜晚记录图像序列。通常,这需要不同的传感器。为了获得更好的性能,将它们结合起来并不罕见。我们关注的情况是,长波红外摄像机连续记录,此外,另一台摄像机在白天记录可见光谱范围内的图像,并且智能算法监控采集的图像。更准确地说,我们的任务是基于多光谱CNN的目标检测。乍一看,可见光谱范围内的图像与热红外图像的区别在于,前者具有颜色和清晰的纹理信息,而后者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息,但诸如光照变化和不同传感器的特性等因素仍然构成重大问题。无论如何,获取足够且实用的热红外数据集来训练深度神经网络仍然是一个挑战。这就是为什么借助可见光谱范围内的数据进行训练可能是有利的,特别是当待评估的数据同时包含可见光和红外数据时。然而,目前尚不清楚热辐射、形状或颜色信息的强烈变化如何影响分类精度。为了更深入地了解卷积神经网络如何做出决策以及它们从不同传感器输入数据中学到什么,我们研究了不同增强技术的适用性和鲁棒性。

英文摘要

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

5. 视频理解与时序视觉 8 篇

2606.12601 2026-06-12 cs.CV 新提交

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

双状态槽注意力:解耦外观与身份用于视频目标中心学习

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

发表机构 * University of Arkansas(阿肯色大学)

AI总结 提出双状态槽注意力(DSSA),通过分离每个槽为局部状态(外观)和身份状态(稳定身份),并采用竞争调制聚合减少弱匹配槽的干扰,提升视频目标分割质量与时间一致性。

详情
AI中文摘要

无监督视频目标中心学习旨在无需监督地将动态场景分解为持久的目标级表示。然而,现有的基于槽的方法在快速运动和部分遮挡等挑战性场景中难以维持稳定的目标身份。首先,它们通常将目标的每帧外观和跨帧身份编码在单个槽向量中,造成目标冲突导致槽交换:重建需要对瞬态视觉变化敏感,而时间一致性需要对它们不变。其次,槽注意力中使用的令牌重归一化可能放大弱注意力槽,使其吸收其他目标的令牌,破坏槽与目标的对应关系。我们提出双状态槽注意力(DSSA),一种完全自监督框架,通过分离外观与身份并减少弱匹配槽的虚假更新来解决这些限制。DSSA将每个槽分解为用于每帧外观的局部状态和用于时间稳定目标信息的身份状态,从而用分离的表示对齐重建和时间一致性。身份状态通过学习的循环转换更新,该转换作为局部状态的时间滤波器,而竞争调制聚合(CMA)降低弱匹配槽的更新权重,防止它们吸收其他目标的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明,DSSA在分割质量和时间一致性上持续优于先前方法,同时在下游目标识别和视频动态预测中表现更强。代码和模型将在接收后公开。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

2606.13030 2026-06-12 cs.CV 新提交

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

一种结合跨主体伪标签与语义对齐的多模态微手势识别框架

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT)(合肥工业大学计算机科学与信息工程学院) School of Computer Science, University of Auckland (UOA)(奥克兰大学计算机科学学院)

AI总结 针对微手势识别中低信噪比、长尾分布和跨主体域偏移问题,提出多模态框架,通过显著性引导提取、平方根平滑加权、正交语义嵌入损失和跨模态伪标签策略,实现有效识别,F1分数达68.13%。

Comments 14 pages, 2 figures

详情
AI中文摘要

微手势(MGs)是自发的、细微的身体动作,经常传达隐藏的人类情感。在未修剪视频中识别MGs仍然极具挑战性,因为其极低的信噪比、严重的长尾类分布以及跨主体评估场景中固有的域偏移。在本文中,我们为第四届MiGA-IJCAI挑战赛的Track 1提出了一个全面的多模态框架。为了捕捉细粒度表示,我们设计了一个显著性引导的多模态提取流程,整合了68关键点骨架关节坐标、3D热图体积和高分辨率RGB视觉特征。我们引入了一种温和的平方根平滑加权机制,配合正交语义嵌入损失,以保护尾部类别而不损害整体识别能力。更重要的是,为了弥合跨主体泛化差距,我们提出了一种跨模态伪标签(CMPL)策略用于无监督域适应,显著提升了单模态鲁棒性。最后,采用温度缩放软投票机制以减轻后期融合中的过度自信。大量实验表明,我们的框架达到了具有竞争力的68.13%的F1分数,获得第四名。

英文摘要

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

2606.13332 2026-06-12 cs.CV 新提交

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

OR-Action: 细粒度动作的多角色视频理解

Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab

发表机构 * Technical University of Munich(慕尼黑工业大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Carl Zeiss AG(卡尔蔡司股份公司)

AI总结 针对手术室活动理解中场景图方法缺乏时间建模的问题,提出基于公开数据集的细粒度多角色动作基准,并引入纯视觉时序模型,显著优于图方法,同时提出多视角到单视角特征对齐策略提升单视角性能。

详情
AI中文摘要

对手术室活动的细粒度理解能够实现工作流感知的辅助,但由于杂乱、遮挡和有限的感知,仍然困难。建模该环境的主流方法是使用场景图作为OR交互的可解释表示。然而,在没有显式时间建模的情况下,将它们的逐帧关系预测转换为时间上延伸的细粒度动作是具有挑战性的。为了对当前OR理解方法进行原则性的时间评估,我们引入了第一个以动作为中心的基准,该基准基于公开可用的自我中心-外部中心OR数据集,通过定义细粒度的多角色动作分类法,并通过从地面真实场景图状态变化中蒸馏生成密集动作片段。在该基准上的实验表明,当前的场景图预测方法难以建模时间结构,即使通过图神经网络添加显式建模也是如此。因此,我们引入了一种纯视觉时间模型,当使用所有可用的自我中心视频作为输入时,该模型显著优于基于图的方法。在此模型基础上,我们还引入了一种新颖的多视角到单视角特征对齐策略,提高了多角色动作识别的单视角性能,减少了对大量自我中心视频采集的需求。基准和代码将在接收后发布。

英文摘要

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

2606.13410 2026-06-12 cs.CV cs.HC 新提交

Person Identification from Contextual Motion

基于情境运动的人物识别

Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni

发表机构 * Technion – Israel Institute of Technology(以色列理工学院) University of Haifa(海法大学)

AI总结 提出一种生成模型描述动作实例创建过程,并针对监控和认证应用推导概率身份推断方案;引入交互式人物识别场景,通过序列化消息交换最大化互信息,实现高识别率。

详情
AI中文摘要

我们考虑基于运动风格识别人的问题。我们提出了一个描述动作实例创建过程的生成模型,并针对监控和认证应用所驱动的两种常见人物识别场景推导了概率身份推断方案。我们引入了一种新颖的、交互式的人物运动模式识别场景。为此,我们将识别过程形式化为受试者与系统之间的顺序消息交换会话。受试者的行为使用受人类信息处理(HIP)范式启发的概率生成模型建模。在每个阶段,系统向受试者呈现视觉刺激(线索)并记录其运动响应。线索的选择旨在最大化预期响应与受试者身份的互信息。一旦记录,响应用于更新可能受试者身份的后验概率。一旦达到足够的分类置信水平,该过程终止。据我们所知,这是首次在这种交互式设置中解决人物识别问题。我们在五个公开数据集和我们自己的新数据集(包含22名受试者对15个线索的4,476条记录)上报告了高识别率。

英文摘要

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

2506.01274 2026-06-12 cs.CV cs.AI 版本更新

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

ReFoCUS: 用于上下文理解的强化引导帧优化

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

发表机构 * Korea Advanced Institute of Science & Technology(韩国科学技术院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ReFoCUS框架,首次将在线策略梯度强化学习集成到视频大语言模型的帧级优化中,通过自回归和查询条件选择架构学习帧选择策略,无需显式帧级监督,提升视频问答推理准确性。

Comments Project page: https://interlive-team.github.io/ReFoCUS/

详情
AI中文摘要

近期大型多模态模型(LMMs)的进展实现了有效的视觉-语言推理,然而视频理解能力仍受限于次优的帧选择策略,尽管视频专用LMMs发展迅速。先前的工作尝试通过静态启发式或外部检索模块来提供帧级信息,但这些方法往往无法捕捉与给定用户查询相关的视觉线索,混淆了原始视觉动态与真正的语义相关性。在本文中,我们介绍了ReFoCUS(用于上下文理解的强化引导帧优化),这是首个将在线策略梯度强化学习集成到视频-LLMs帧级优化的框架。ReFoCUS旨在学习帧选择策略,利用来自参考模型的奖励信号来捕捉其对最佳支持时间接地响应的帧组合的潜在评分行为。为了高效探索巨大的组合帧空间,我们采用了一种自回归且查询条件的选择架构,确保上下文一致性的同时降低复杂度。我们的策略学习无需显式帧级监督,因为它隐式地发现了最优且语义一致的帧组合。ReFoCUS在多个视频问答基准测试中持续提高了推理准确性,证明了将帧选择与模型内部效用对齐的优势。

英文摘要

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.

2506.21855 2026-06-12 cs.CV 版本更新

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Periodic-MAE:用于rPPG估计的周期性视频掩码自编码器

Jiho Choi, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University, Republic of Korea(电子与信息工程系,全州国立大学)

AI总结 提出Periodic-MAE,一种自监督框架,通过周期性感知掩码和生理频带约束,从无标签面部视频学习可泛化的时空表示,提升远程光电容积描记法(rPPG)估计性能。

详情
AI中文摘要

在本文中,我们提出Periodic-MAE,一种自监督框架,用于从无标签面部视频中学习周期性生理信号的通用时空表示。该方法利用掩码自编码器(MAE),通过重建掩码视频令牌学习高维面部表示,而不依赖远程光电容积描记法(rPPG)特定监督。为了明确地将表示学习与rPPG特征对齐,我们引入了一种基于视频重采样的周期性感知帧掩码策略,使编码器能够学习捕获与脉搏信号估计相关的准周期性时间模式的表示。此外,生理频带约束被集成到MAE预训练框架中,利用脉搏信号在频域的稀疏性,引导学习到的表示朝向生理上有意义的模式。预训练后,学习到的表示被迁移到下游rPPG估计任务,其中编码器作为通用特征提取器,从面部视频中恢复脉搏相关信号。我们在四个基准数据集(包括PURE、UBFC-rPPG、MMPD和V4V)上进行了广泛实验。此外,我们在无约束光照条件和受试者运动下收集的真实世界rPPG数据集上评估了所提方法。实验结果表明,Periodic-MAE持续改善了rPPG估计性能,特别是在具有挑战性的跨数据集和真实世界评估场景中。我们的代码可在以下网址获取:此 https URL。

英文摘要

In this paper, we propose Periodic-MAE, a self-supervised framework for learning generalizable spatio-temporal representations of periodic physiological signals from unlabeled facial videos. The proposed method leverages a masked autoencoder (MAE), which learns high-dimensional facial representations by reconstructing masked video tokens without relying on remote photoplethysmography (rPPG) specific supervision. To explicitly align representation learning with the characteristics of rPPG, we introduce a periodicity-aware frame masking strategy based on video resampling, enabling the encoder to learn representations that capture quasi-periodic temporal patterns relevant to pulse signal estimation. In addition, physiological bandlimit constraints are integrated into the MAE pre-training framework, exploiting the sparsity of pulse signals in the frequency domain to guide the learned representations toward physiologically meaningful patterns. After pre-training, the learned representations are transferred to downstream rPPG estimation, where the encoder serves as a generic feature extractor for recovering pulse-related signals from facial videos. We conduct extensive experiments on four benchmark datasets, including PURE, UBFC-rPPG, MMPD, and V4V. Moreover, we evaluate the proposed approach on a real-world rPPG dataset collected under unconstrained lighting conditions and subject motion. Experimental results demonstrate that Periodic-MAE consistently improves rPPG estimation performance, particularly in challenging cross-dataset and real-world evaluation settings. Our code is available at https://github.com/ziiho08/Periodic-MAE.

2605.24488 2026-06-12 cs.CV cs.GR 版本更新

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

基于SMPL骨架的拉班运动描述子的暗示性运动外观不变检测

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung

发表机构 * Sogang University(ソガン大学)

AI总结 提出一种仅基于SMPL骨架轨迹和拉班运动分析描述子的运动分类流程,用于检测暗示性和露骨动作,在四个层级上实现57.3%的四分类准确率。

Comments 5 pages, 2 figures, 3 tables. Extended version of a poster accepted to SIGGRAPH 2026

详情
AI中文摘要

在线多人3D虚拟环境中的内容审核最近已交由自动化、基于AI的流程处理。然而,该领域主要涉及图像、视频和音频中非法内容的检测,在暗示性运动的检测技术上存在盲点。我们提出一种仅基于运动的分类流程,使用拉班运动分析(LMA)描述子从SMPL骨架轨迹中检测暗示性和露骨动作。在涵盖四个有序层级(日常、艺术、暗示、露骨)的20,514个运动片段(17小时以上)上,基于110个LMA特征的逻辑回归实现了57.3%的四分类准确率(随机概率的2.3倍)、72.1%的三分类准确率和78.7%的二元SFW/NSFW准确率。混淆主要集中在相邻层级,证实分类错误集中在相邻层级而非非相邻层级。此外,不同运动质量在分类体系的每个层级占主导地位——没有单一特征驱动分类,表明四层级结构反映了真正不同的运动模式。

英文摘要

Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

2606.08436 2026-06-12 cs.CV 版本更新

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

通过候选感知因果推理增强教学视频中的时间答案定位

Muge Qi, Rong Fu, Pengbin Feng, Xianda Li, Yu Cai, Yifu Guo, Shizhe Zhang, Simon James Fong, Lei Ma, Bin Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出候选感知因果推理框架,通过视觉-语言预训练候选选择和基于GRPO的时序逻辑推理,解决教学视频中复杂问题理解和长视频片段定位挑战,在六个基准上取得最优mIoU。

详情
AI中文摘要

教学视频中的时间答案定位任务旨在定位响应自然语言查询的精确视频片段,对于直接视频答案检索日益重要。由于需要理解语义复杂的问题并解决未修剪视频与短目标时刻之间的显著长度不匹配,该任务仍然具有挑战性。现有方法通常对无关内容敏感或视觉推理能力不足。为了解决这些局限性,我们提出了候选感知因果推理框架。我们的方法首先采用基于视觉-语言预训练的候选选择算法高效生成K个候选片段,然后应用由拒绝奖励机制增强并通过组相对策略优化优化的时序逻辑推理模块进行稳健推理。在六个基准上的大量实验表明,我们的方法在平均交并比方面达到了最先进的性能,为长视频中基于推理的检索提供了新视角。

英文摘要

The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

6. 生成式视觉与世界模型 19 篇

2606.12562 2026-06-12 cs.CV cs.GR 新提交

HairPort: In-context 3D-aware Hair Import and Transfer for Images

HairPort: 上下文感知的3D发型导入与迁移

Alireza Heidari, Amirhossein Alimohammadi, Wallace Michel Pinto Lira, Adi Bar-Lev, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University(西蒙菲莎大学) Huawei Canada(华为加拿大)

AI总结 提出HairPort框架,通过显式分离发型移除与迁移,并利用3D感知管道实现大姿态差异下的发型迁移,结合LoRA适配的秃头转换器和条件流匹配生成器,实现高质量、身份保持的发型迁移。

Comments Accepted to SIGGRAPH 2026 (Conference Papers Track). 23 pages, 15 figures, 10 tables, including supplementary material as appendices. Project page: https://deepmancer.github.io/HairPort/

详情
AI中文摘要

在图像之间迁移发型是计算机图形学、计算机视觉和视觉效果中一个重要但具有挑战性的任务。它使用户能够在无需实际改变发型的情况下探索新造型,应用于虚拟试穿系统、增强现实和娱乐等领域。大多数先前的方法在姿态差异较小时表现最佳,但在视角和尺度差异较大时效果不佳,此时缺失的发型内容必须合成而非迁移。我们提出HairPort,一个3D感知的发型迁移框架,通过显式分离发型移除与迁移,并在合成前强制几何一致性来解决这些问题。我们引入了一个秃头转换器,通过基于LoRA的上下文适配FLUX.1 Kontext生成逼真的秃头人脸版本。为了训练我们的秃头转换器,我们引入了一个新数据集Baldy,包含6000对在不同身份和条件下的秃头和原始图像。我们还使用了一个3D感知迁移管道,在将参考发型合成到源图像之前,从目标视角重建并重新渲染该发型。由于具有3D感知能力,我们的方法支持源和目标之间的大姿态和尺度差异。最后,一个条件流匹配生成器从秃头源和几何对齐的参考引导中合成迁移结果。综合来看,我们的方法实现了准确、姿态一致且身份保持的发型迁移,在定性和定量上均优于现有方法。

英文摘要

Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

2606.12575 2026-06-12 cs.CV 新提交

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

高保真两步图像生成:通过教师对齐的端到端蒸馏

Dongyang Liu, Ruoyi Du, David Liu, Dengyang Jiang, Liangchen Li, Qilong Wu, Zhen Li, Steven C. H. Hoi, Hongsheng Li, Peng Gao

发表机构 * Z-Image Team, Alibaba Group(阿里巴巴集团Z-Image团队) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出Z-Image Turbo++,通过分布对齐对抗学习、步解耦参数化和迭代正则化端到端训练,将8步教师模型蒸馏为2步生成模型,显著缩小质量差距。

详情
AI中文摘要

少步扩散蒸馏在4-8步生成中已日趋成熟,但进一步推进到2步仍具挑战。本文介绍Z-Image Turbo++,一种从8步Z-Image Turbo教师模型蒸馏得到的高质量2步图像生成模型。我们的方法通过三个针对该场景简单而有效的设计选择,解决了2步生成中任务难度增加和模型容量有限的核心瓶颈。首先,我们提出分布对齐对抗学习,使用教师生成的图像而非外部真实图像作为GAN训练的真实样本,提供更易实现且信息量更大的对抗目标。其次,我们采用步解耦参数化,为两个去噪步骤分配独立的模型参数,以更好地匹配它们不同的容量需求。第三,我们执行带迭代正则化的端到端训练,使第一步能够接收来自最终图像质量的梯度,同时通过显式的步1损失保留有意义的中间生成。这些设计共同在定性和定量评估中显著缩小了2步与8步生成之间的质量差距,凸显了精心定制的蒸馏策略在改善少步生成中质量-效率权衡方面的潜力。

英文摘要

Few-step diffusion distillation has become increasingly mature for 4-8-step generation, yet pushing further to 2 steps remains challenging. In this work, we introduce Z-Image Turbo++, a high-quality 2-step image generation model distilled from the 8-step Z-Image Turbo teacher. Our method addresses the central bottlenecks of increased task difficulty and limited model capacity in 2-step generation through three simple but effective design choices tailored to this regime. First, we propose Distribution-Aligned Adversarial Learning, which uses teacher-generated images rather than external real images as real samples for GAN training, providing a more attainable and informative adversarial target. Second, we adopt Step-Decoupled Parameterization, assigning independent model parameters to the two denoising steps to better match their distinct capacity demands. Third, we perform End-to-End Training with Iterative Regularization, allowing the first step to receive gradients from final image quality while preserving a meaningful intermediate generation through an explicit step-1 loss. Together, these designs substantially narrow the quality gap between 2-step and 8-step generation in both qualitative and quantitative evaluations, highlighting the potential of carefully tailored distillation strategies for improving the quality-efficiency trade-off in few-step generation.

2606.13035 2026-06-12 cs.CV cs.AI 新提交

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

TetherCache: 基于门控召回与可信对齐的自回归长视频生成稳定性方法

Yu Meng, Xiangyang Luo, Letian Li, Wenyuan Jiang, Chen Gao, Xinlei Chen, Yong Li, Xiao-Ping Zhang

发表机构 * Tsinghua University(清华大学) D-INFK, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 提出TetherCache,一种无需训练、即插即用的缓存管理策略,通过门控召回(GRAB)和可信对齐编辑(TAME)缓解自回归视频扩散模型中的上下文漂移,实现稳定长视频生成。

Comments 17 pages, 8 figures

详情
AI中文摘要

自回归视频扩散模型通过将新生成帧的条件建立在先前生成内容上,为流式变长视频生成提供了自然框架。然而,将这些模型扩展到分钟级生成仍具挑战:有限的KV缓存预算使模型无法保留完整历史,而反复以自生成帧为条件会导致上下文分布偏移随时间累积,引发视觉伪影、质量下降和时间漂移。本文提出TetherCache,一种无需训练、即插即用的缓存管理策略,用于抗漂移长视频生成。TetherCache将缓存组织为sink、memory和recent区域,并引入两种互补机制。首先,GRAB(基于注意力多样性平衡的门控召回)使用结合注意力相关性与时间多样性的门控分数选择长程记忆帧,在固定缓存预算下保留信息丰富且多样化的历史上下文。其次,TAME(通过记忆编辑的可信对齐)通过将新召回的记忆令牌的统计量对齐到可信上下文分布来对其进行轻量编辑,减少漂移历史特征造成的污染。基于Self-Forcing,TetherCache在VBench-Long的30秒、60秒和240秒设置上持续提升长视频生成质量。特别地,在240秒生成中,它显著提高了整体和语义分数,同时将质量漂移从7.84降至1.33,证明了其在稳定长程自回归视频扩散中的有效性。

英文摘要

Autoregressive video diffusion models provide a natural formulation for streaming and variable-length video generation by conditioning newly generated frames on previously generated content. However, extending these models to minute-level generation remains challenging: the limited KV-cache budget prevents the model from retaining the full history, while repeatedly conditioning on self-generated frames induces a context distribution shift that accumulates over time, leading to visual artifacts, quality degradation, and temporal drift. In this paper, we propose TetherCache, a training-free and plug-and-play cache management strategy for drift-resistant long video generation. TetherCache organizes the cache into sink, memory, and recent regions, and introduces two complementary mechanisms. First, GRAB (Gated Recall with Attention-Diversity Balancing) selects long-range memory frames using a gated score that combines attention-based relevance with temporal diversity, preserving informative yet diverse historical context under a fixed cache budget. Second, TAME (Trusted Alignment via Memory Editing) lightly edits newly recalled memory tokens by aligning their statistics to a trusted context distribution, reducing the pollution caused by drifted historical features. Built on Self-Forcing, TetherCache consistently improves long-video generation quality on VBench-Long across 30s, 60s, and 240s settings. In particular, for 240s generation, it substantially improves overall and semantic scores while reducing quality drift from 7.84 to 1.33, demonstrating its effectiveness for stable long-horizon autoregressive video diffusion.

2606.13041 2026-06-12 cs.CV cs.GR cs.MM 新提交

SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing

SeamEdit: 一种用于大图像语义编辑的黑盒VLM无关流水线

Xiangyu Lyu, Dan Lei

发表机构 * Technische Universität Darmstadt(达姆施塔特工业大学) Fine-Arts Educator, Yuncheng Middle School(运城中学美术教师)

AI总结 提出SeamEdit,一种无需训练、模型无关的流水线,通过五阶段后处理解决大图像分块编辑中的语义变形、对齐漂移和接缝伪影问题,实现高质量语义编辑。

Comments 19 pages, 9 figures, 2 tables

详情
AI中文摘要

大图像的语义区域编辑必须同时满足两个要求:高生成质量和与周围内容的自然融合。一些相关方法依赖于白盒模型,而忽略了闭源模型的强大生成能力。然而,直接将闭源模型应用于分块编辑会引入几种失败模式:语义变形、画布级对齐漂移和可见接缝伪影。本文提出SeamEdit,一种无需训练且模型无关的流水线,将任何具有修补能力的VLM视为黑盒预言机。SeamEdit通过五阶段后处理流水线缓解这些问题:基于覆盖的分块分解、黑盒VLM修补、几何和颜色一致性校正、基于接缝风险的多候选排序以及动态规划曲线接缝融合。该流水线降低了接缝可见性,并支持任意分块区域的语义修改。

英文摘要

Semantic region editing for large images must satisfy two requirements at the same time: high generative quality and natural integration with surrounding content. Some related methods rely on white-box models and leave the strong generation capability of closed-source models underexplored. Directly applying closed-source models to tiled editing, however, introduces several failure modes: semantic deformation, canvas-level alignment drift, and visible seam artifacts. This paper presents SeamEdit, a training-free and model-agnostic pipeline that treats any VLM with inpainting capability as a black-box oracle. SeamEdit mitigates these issues through a five-stage post-hoc pipeline: overlay-based tile decomposition, black-box VLM inpainting, geometric and color-consistency correction, seam-risk-based multi-candidate ranking, and dynamic-programming curved seam fusion. The pipeline reduces seam visibility and supports semantic modification of arbitrary tile regions.

2606.13303 2026-06-12 cs.CV 新提交

DuET: Dual Expert Trajectories for Diffusion Image Editing

DuET: 双专家轨迹用于扩散图像编辑

Lidia Troeshestova, Alexander Ustyuzhanin, Sergey Kastryulin

发表机构 * HSE University(高等经济大学) Yandex

AI总结 提出训练自由的DuET方法,通过临时切换到文本到图像阶段再返回编辑模式,缓解源图像条件限制,提升编辑指令相关性、语义保真度和感知质量。

详情
AI中文摘要

最近的扩散编辑器在每一步去噪过程中以源图像为条件执行多样化的基于指令的编辑。然而,持续的源图像条件限制可能会限制编辑的完全执行程度和结果的自然性,尤其是当目标场景与输入差异较大时。我们提出了DuET(双专家轨迹),一种无需训练的推理方法,通过过渡到文本到图像阶段再返回编辑模式,暂时放松源图像条件,使得去噪轨迹能够向目标分布移动,同时保留图像条件编辑的结构优势。在不修改模型权重或增加采样成本的情况下,DuET在多种模型和基准上持续改善了指令相关性、语义保真度和感知质量。在某些情况下,这些改进伴随着源图像保留的适度降低,揭示了源保留与编辑保真度之间可预测的权衡。

英文摘要

Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

2606.13304 2026-06-12 cs.CV 新提交

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

ReFree: 通过无奖励强化学习和多级语音引导实现逼真的共语音视频生成

Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt

发表机构 * Télécom Paris, Institut Polytechnique de Paris(巴黎高等电信学院,巴黎综合理工学院) Max Planck Institute for Informatics(马克斯·普朗克信息学研究所)

AI总结 提出ReFree-S2V框架,利用流匹配和预训练视频生成模型,通过多级语音表示和可学习选择器实现精细唇同步与自然表情,并引入无奖励强化学习生成自然头部运动,在唇同步准确性和自然度上达到最优。

详情
AI中文摘要

语音驱动的说话角色动画旨在生成逼真的肖像视频,传达自然的对话行为,使面部运动与语音音频对齐。尽管视频生成的最新进展显著提高了基于视频的动画的真实感,但实现准确的唇部发音和富有表现力的行为仍然具有挑战性。现有方法通常在精确的音素到唇同步与动态面部表情和头部运动之间进行权衡,产生要么准确但僵硬,要么富有表现力但同步性差的动画。我们通过提出ReFree-S2V来解决这一挑战,这是一个流匹配语音到肖像动画框架,基于预训练的视频生成模型,在语音驱动的肖像动画中实现细粒度的语音发音和高层次的表现力线索。该模型引入了一种多级语音表示,在局部和全局粒度上捕捉语音和韵律信息。这些表示通过可学习的级别选择器选择性地注入到Transformer块中,从而实现准确的唇同步和自然的表达性运动。为了实现自然的头部运动,我们进一步在流匹配训练中引入了一种新颖的无奖励强化学习方案,在不依赖手工制作的同步指标或奖励模型以及人类偏好标注的高成本的情况下,抑制感知上不合理的运动。大量实验表明,ReFree-S2V实现了最先进的性能,在定量唇同步准确性和定性人类评估的自然度和表现力方面显著优于现有方法。

英文摘要

Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

2606.13312 2026-06-12 cs.CV cs.GR 新提交

MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

MagPlus: 通过可学习放大桥接微表情到常规表情

Sliman Jammal, Andrei Sharf

发表机构 * Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出MagPlus管道,通过可学习放大将微表情运动映射到常规表情范围,再利用标准表情模型处理,最后用DeMagPlus恢复强度,无需重新训练即可生成逼真微表情。

详情
AI中文摘要

面部微表情是短暂而细微的面部运动,为真实人类情感提供重要线索。然而,由于标注的微表情数据有限且底层面部运动极其微弱,建模和生成微表情仍然困难。现有的微表情生成方法因此常面临质量有限、鲁棒性弱和泛化能力差的问题。我们提出MagPlus,一个可迁移的微表情处理管道,将微表情分析与标准面部动画模型连接起来。MagPlus不是从头训练专用生成器,而是学习将细微面部运动放大到常规表情范围,将微表情转换为与现有面部表情处理模型兼容的信号。放大后的序列随后被标准面部表情模型用于迁移和合成等任务。互补的DeMagPlus模块将生成的运动恢复为逼真的微表情强度水平,同时保留合成的动态。我们使用四个面部动画模型评估该框架:FOMM、FSRT、MetaPortrait和EmoPortraits。这些模型均未在微表情数据上训练。实验表明,MagPlus-DeMagPlus使预训练的宏表情模型能够生成更逼真的微表情运动,而无需重新训练主干网络。

英文摘要

Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

2606.13382 2026-06-12 cs.CV cs.AI 新提交

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University(复旦大学)

AI总结 提出SmartFont扩散框架,通过全局内容-风格生成与弱监督局部校正专家结合,并引入去噪状态条件分配模块动态加权全局与局部特征,实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情
AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模(鲁棒但解耦不完美),要么强调组件/局部建模(捕捉细节但严重依赖局部先验和参考覆盖)。我们认为关键挑战不仅在于学习更纯净的条件,而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此,我们提出SmartFont,一个基于扩散的少样本字体生成框架,结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图,实现无需显式组件条件推理的细粒度校正。在此基础上,去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明,SmartFont实现了更好的全局-局部平衡,提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

2606.13432 2026-06-12 cs.CV cs.AI 新提交

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector: 无需配对数据的通用多镜头相机克隆

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

发表机构 * Kuaishou Technology(快手科技) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OmniDirector框架,通过将相机参数编码为网格运动视频,并利用百万级配对数据训练,实现无需交叉配对数据的多镜头相机运动克隆,具备卓越的控制性能。

Comments 12 pages, 8 figures

详情
AI中文摘要

从参考视频中克隆相机运动是视频生成中的一项重要任务,因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示,要么合成交叉配对数据,但受限于数据稀缺性,导致在复杂相机运动克隆中表现不佳。为解决这些问题,我们引入了一种通用的相机运动表示,将相机编码为网格运动视频。该相机网格以视觉方式表示相机参数,并支持集成多样化的轨迹以进行多镜头视频生成。基于此,我们提出了OmniDirector,一个在百万级相机网格-视频对上训练的统一框架,该框架协调角色、动作和相机,为多模态扩散变换器提供导演级别的控制。此外,我们设计了一种新颖的分层提示扩展代理,通过理解信号关系系统地描述相机运动和视觉内容,从而和谐地整合不同的控制信号。大量实验证明了我们框架的卓越性能和出色的可控性。项目页面:此https URL

英文摘要

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

2606.13558 2026-06-12 cs.CV cs.CL 新提交

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

编辑比特,差异编码:面向视觉自回归模型的逐比特残差编辑

Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze

发表机构 * LMU Munich & Munich Center for Machine Learning (MCML)(慕尼黑大学 & 慕尼黑机器学习中心 (MCML))

AI总结 提出BitResEdit,一种无需训练的视觉自回归图像编辑方法,通过比特级源负引导和残差编码注入,在保持背景的同时实现强文本对齐。

详情
AI中文摘要

基于文本引导的图像编辑与视觉自回归(VAR)生成器需要控制模型采样的内容以及将采样变化写回图像代码的位置。现有的VAR编辑器主要操作于令牌流、特征或扁平的下一个令牌对数几率,忽略了逐比特残差VAR模型的两个原生结构:逐比特伯努利预测头和图像组装所用的加性多尺度残差代码域。我们提出BitResEdit,一种针对逐比特残差VAR生成器(如Infinity)的无训练编辑器。BitEdit通过沿共享编辑前缀上计算的源-目标对比倾斜后CFG的逐比特对数几率,执行源负引导,然后将每个更新投影到干净CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样的比特转换为每尺度连续代码残差,用定位掩码对其进行门控,并通过生成器的原生尺度求和重新注入。它们共同将决策时的比特引导与组合时的代码组合耦合,使得被掩码的潜在特征通过代码算术精确保留,同时在目标区域内应用局部化的尺度感知编辑。在PIE-Bench上使用Infinity-2B,BitResEdit在相同骨干的VAR编辑器中实现了最强的文本对齐,在编辑区域上的CLIP比最强先前的编辑器提高了+1.07,同时背景保持与其相当。消融实验表明BitEdit和ResEdit在目标对齐和背景保持中发挥互补作用。

英文摘要

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

2606.13676 2026-06-12 cs.CV 新提交

Modality Forcing for Scalable Spatial Generation

模态强制实现可扩展的空间生成

Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park

发表机构 * Carnegie Mellon University(卡内基梅隆大学) World Labs

AI总结 提出Modality Forcing方法,通过为每个模态分配独立噪声水平,实现单DiT的联合图像-深度生成,利用稀疏深度数据训练,继承T2I预训练的可扩展性,在深度估计上取得竞争性能。

详情
AI中文摘要

文本到图像(T2I)模型包含丰富的空间先验。合成逼真、杂乱的场景需要理解几何,包括透视和相对尺度。先前的工作通过调整T2I模型利用这一先验进行深度预测,但需要密集深度数据并涉及复杂的方案。我们提出Modality Forcing,一种简单、可扩展的后训练方案,使用在稀疏深度数据上训练的单个DiT进行联合图像-深度生成。Modality Forcing通过为每个模态分配独立的噪声水平,允许以任意排列进行图像和深度的条件生成和联合生成。每个模态的解码器使我们能够在稀疏的真实世界深度上训练,并实现强大的、可泛化的深度预测。我们进一步表明,Modality Forcing继承了T2I预训练的可扩展性:通过从头训练一组T2I模型(370M到3.3B参数),我们发现更大的模型在更多图像数据上训练产生更准确的深度。我们的最强模型与最先进的单目深度估计器竞争,并将现有联合图像-深度生成模型的AbsRel降低了57%。这些结果提供了强有力的证据,表明图像生成是空间感知的可扩展预训练目标。

英文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

2606.12858 2026-06-12 cs.IT cs.AI cs.CV math.IT 交叉投稿

JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

JSCGC:面向无线生成式通信的联合源信道生成编码

Tong Wu, Zhiyong Chen, Guo Lu, Li Song, Feng Yang, Meixia Tao, Wenjun Zhang

发表机构 * Cooperative Medianet Innovation Center, the School of Information Science and Electronic Engineering, Shanghai Jiao Tong University(联合中位网创新中心,信息科学与电子工程学院,上海交通大学)

AI总结 提出联合源信道生成编码(JSCGC),用生成模型替换传统解码器,将通信重构问题转化为受感知约束下的受控生成问题,通过联合训练和随机采样框架最大化互信息,在潜空间图像传输中提升特征、语义和分布质量。

Comments submitted to IEEE Journal

详情
AI中文摘要

传统通信系统,包括基于分离的编码和基于学习的联合源信道编码(JSCC),通常是在香农率失真理论下设计的。然而,依赖通用失真度量无法捕捉复杂的人类视觉感知,常常导致模糊或不真实的复原。在本文中,我们提出联合源信道生成编码(JSCGC),一种生成式通信范式,用接收端的生成模型替换传统解码器。接收信号被视为一个条件,控制采样过程进入学习到的条件分布,将通信从用于失真最小化的确定性重构重新表述为在感知约束下用于互信息最大化的受控生成。基于这一表述,我们开发了一个统一的联合训练和高效随机采样框架,并提供了其在学习和推理阶段有效性的理论分析。在潜空间图像传输上的大量实验表明,JSCGC在不同信道条件下持续改善基于特征、语义层面和分布的质量,同时表现出一种以语义不一致而非失真为特征的独特错误行为。

英文摘要

Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typically designed under Shannon's rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces the conventional decoder with a generative model at the receiver. The received signal is treated as a condition that controls the sampling process into the learned conditional distribution, reformulating communication from deterministic reconstruction for distortion minimization to controlled generation for mutual information maximization under perceptual constraints. Based on this formulation, we develop a unified joint training and efficient stochastic sampling framework, and provide theoretical analysis of its effectiveness in both learning and inference stages. Extensive experiments on latent-space image transmission demonstrate that the JSCGC consistently improves feature-based, semantic-level, and distributional quality across diverse channel conditions, while exhibiting a distinct error behavior characterized by semantic inconsistency rather than distortion.

2606.13240 2026-06-12 cs.LG cs.AI cs.CV stat.ME stat.ML 交叉投稿

Towards More General Control of Diffusion Models Using Jeffrey Guidance

使用 Jeffrey 引导实现扩散模型的更通用控制

Raphaël Razafindralambo, Rémy Sun, Frédéric Precioso, Jes Frellsen, Pierre-Alexandre Mattei

发表机构 * Inria, CNRS, I3S, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、信息与系统科学实验室、马赛·蔚蓝海岸大学) Technical University of Denmark(丹麦技术大学) Inria, CNRS, LJAD, Maasai Université Côte d’Azur(法国国家信息与自动化研究所、法国国家科学研究中心、雅克-路易·利翁实验室、马赛·蔚蓝海岸大学)

AI总结 提出 Jeffrey 引导框架,通过 Jeffrey 条件规则更新边缘分布,扩展扩散模型控制到标准引导无法表达的应用,在 CIFAR-10 和 FFHQ 上显著降低 FID,并在 CelebA-HQ 上实现公平性控制。

详情
AI中文摘要

扩散模型的一个关键优势在于其灵活性,因为其输出可以在采样时通过引导进行控制。然而,除了条件采样等简单情况外,目标分布通常隐含地定义,仅通过采样规则或启发式能量函数给出。为了解决这个问题,我们提出了 Jeffrey 引导,这是一个原则性框架,将扩散模型控制扩展到标准引导无法表达的应用。它利用 Jeffrey 条件规则将边际分布更新到指定的目标,保持条件结构并最小化对联合分布的扰动。我们首先通过针对指定的嵌入分布来演示 Jeffrey 引导。以 Inception 嵌入为目标,这导致在 CIFAR-10 和 FFHQ 上 FID 显著降低。我们进一步将 Jeffrey 引导应用于 CelebA-HQ 上的公平性,更新无条件扩散模型以强制属性之间的独立性。

英文摘要

A key strength of diffusion models lies in their flexibility, since their outputs can be controlled at sampling time through guidance. However, beyond simple cases such as conditional sampling, the target distribution is often left implicit, defined only through a sampling rule or a heuristic energy function. To address this, we propose Jeffrey guidance, a principled framework that extends diffusion-model control to applications beyond what standard guidance can express. It leverages Jeffrey's rule of conditioning to update marginal distributions towards a prescribed target, preserving the conditional structure and minimally perturbing the joint distribution. We first demonstrate Jeffrey guidance by targeting a prescribed embedding distribution. With Inception embeddings as the target, this leads to substantial reductions in FID on both CIFAR-10 and FFHQ. We further apply Jeffrey guidance to fairness on CelebA-HQ, updating an unconditional diffusion model to enforce independence between attributes.

2606.13364 2026-06-12 cs.LG cs.CV 交叉投稿

VideoMDM: Towards 3D Human Motion Generation From 2D Supervision

VideoMDM: 从2D监督走向3D人体运动生成

Amir Mann, Gal Michael Harari, Merav Keidar, Or Litany

发表机构 * Technion(以色列理工学院) NVIDIA(英伟达)

AI总结 提出VideoMDM框架,利用单目视频的2D姿态通过扩散模型学习3D运动先验,使用深度加权的2D重投影损失近似3D监督,在HumanML3D上接近全3D监督性能。

Comments https://videomdm.github.io/

详情
AI中文摘要

我们提出VideoMDM,一个基于扩散的框架,直接从单目视频中提取的精确2D姿态训练3D人体运动先验,无需任何3D真实数据。预训练的2D到3D提升器提供近似的3D姿态序列,作为有噪声的教师:这些序列被扩散,模型在3D空间去噪,并通过重投影预测并与精确关键点比较在2D空间进行监督。我们证明,在温和假设下,深度加权的2D重投影损失在期望上等价于直接3D监督,并将标准3D运动正则化器——速度一致性和过参数化表示对齐——适应到这一2D设置。与仅在推理时将2D提升到3D的方法不同,VideoMDM在训练期间学习一个连贯的3D运动流形。在HumanML3D上,它几乎缩小了与完全3D监督的MDM的差距(FID 0.88 vs 0.54);在真实视频数据集Fit3D和NBA上,该方法学习生成一致被人类偏好的运动,并取得了强定量结果。

英文摘要

We introduce VideoMDM, a diffusion-based framework that trains 3D human motion priors directly from accurate 2D poses extracted from monocular videos, without any 3D ground truth. A pretrained 2D-to-3D lifter provides approximate 3D pose sequences that serve as a noisy teacher: these are diffused, denoised by the model in 3D, and supervised in 2D by reprojecting the prediction and comparing against accurate keypoints. We show that, under mild assumptions, a depth-weighted 2D reprojection loss is equivalent in expectation to direct 3D supervision, and we adapt standard 3D motion regularizers - velocity consistency and over-parameterized representation alignment - to this 2D setting. Unlike methods that lift 2D to 3D only at inference, VideoMDM learns a coherent 3D motion manifold during training. On HumanML3D it nearly closes the gap to fully 3D-supervised MDM (FID 0.88 vs 0.54); On real video datasets Fit3D and NBA the method learns to generate motions consistently preferred by humans, with strong quantitative results.

2506.18438 2026-06-12 cs.CV 版本更新

CPAM: Context-Preserving Adaptive Manipulation for Zero-Shot Real Image Editing

CPAM: 保持上下文的自适应操作用于零样本真实图像编辑

Dinh-Khoi Vo, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam(越南科学大学信息科技学院) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学) Faculty of Information Technology, Monash University, Melbourne, Victoria, Australia(莫纳什大学信息科技学院) Department of Computer Science, University of Dayton, Dayton, Ohio, US(Dayton 大学计算机科学系)

AI总结 提出CPAM零样本框架,通过保持上下文的自适应操作和掩码引导,实现复杂非刚性真实图像的编辑,保留纹理和身份,无需微调。

Comments Accepted to IEEE Transactions on Multimedia. Project page: https://vdkhoi20.github.io/CPAM

详情
AI中文摘要

使用文本描述在文本到图像扩散模型中编辑自然图像仍然是一个重大挑战,特别是在实现一致生成和处理复杂非刚性对象方面。现有方法通常难以保留纹理和身份,需要大量微调,并且在编辑特定空间区域或对象的同时保留背景细节方面存在局限性。本文提出了保持上下文的自适应操作(CPAM),一种用于复杂非刚性真实图像编辑的新型零样本框架。具体来说,我们提出了一个保留适应模块,该模块调整自注意力机制以有效保留并独立控制对象和背景。这确保了在编辑过程中使用掩码引导技术时,对象的形状、纹理和身份得以保持,同时背景不变形。此外,我们开发了一个局部提取模块,以减轻在交叉注意力机制的条件化过程中对非期望修改区域的干扰。我们还引入了各种掩码引导策略,以简单的方式促进多样化的图像操作任务。CPAM可以无缝集成到多个扩散骨干网络中,包括SD1.5、SD2.1和SDXL,展示了跨不同模型架构的强大泛化能力。在我们新构建的图像操作基准(IMBA)上进行的广泛实验表明,我们提出的方法是人类评估者的首选,优于现有的最先进编辑技术。源代码和数据将在项目页面公开发布:this https URL

英文摘要

Editing natural images using textual descriptions in text-to-image diffusion models remains a significant challenge, particularly in achieving consistent generation and handling complex, non-rigid objects. Existing methods often struggle to preserve textures and identity, require extensive fine-tuning, and exhibit limitations in editing specific spatial regions or objects while retaining background details. This paper proposes Context-Preserving Adaptive Manipulation (CPAM), a novel zero-shot framework for complicated, non-rigid real image editing. Specifically, we propose a preservation adaptation module that adjusts self-attention mechanisms to preserve and independently control the object and background effectively. This ensures that the objects' shapes, textures, and identities are maintained while keeping the background undistorted during the editing process using the mask guidance technique. Additionally, we develop a localized extraction module to mitigate the interference with the non-desired modified regions during conditioning in cross-attention mechanisms. We also introduce various mask-guidance strategies to facilitate diverse image manipulation tasks in a simple manner. CPAM can be seamlessly integrated with multiple diffusion backbones, including SD1.5, SD2.1, and SDXL, demonstrating strong generalization across different model architectures. Extensive experiments on our newly constructed Image Manipulation BenchmArk (IMBA), a robust benchmark dataset specifically designed for real image editing, demonstrate that our proposed method is the preferred choice among human raters, outperforming existing state-of-the-art editing techniques. The source code and data will be publicly released at the project page: https://vdkhoi20.github.io/CPAM

2506.18493 2026-06-12 cs.CV 版本更新

ShowFlow: From Robust Single Concept to Condition-Free Multi-Concept Generation

ShowFlow: 从鲁棒的单概念到无条件的多概念生成

Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science(科学大学) Vietnam National University(越南国家大学) Monash University(墨尔本大学) University of Dayton(Dayton大学)

AI总结 提出ShowFlow框架,通过KronA-WED适配器和语义感知注意力正则化增强单概念生成,并利用SAMA和布局一致性指导实现无额外条件的多概念生成。

详情
AI中文摘要

定制化图像生成仍然是可控图像合成中的核心挑战。对于单概念生成,保持身份保留和提示对齐是困难的。在多概念场景中,仅依赖提示而不使用布局框或语义掩码等额外条件,通常会导致身份丢失和概念遗漏。在本文中,我们介绍了ShowFlow,一个旨在应对这些挑战的全面框架。我们提出了用于单概念图像生成的ShowFlow-S,以及用于处理多个概念的ShowFlow-M。ShowFlow-S引入了一个KronA-WED适配器,它将Kronecker适配器与权重和嵌入分解相结合,并配合一种新颖的语义感知注意力正则化(SAR)训练目标,以增强单概念生成。在此基础上,ShowFlow-M直接重用由ShowFlow-S学习的鲁棒模型,以支持无需额外条件的多概念生成,并集成了主体自适应匹配注意力(SAMA)和布局一致性指导作为即插即用模块。大量实验和用户研究验证了ShowFlow的有效性,突显了其在广告和虚拟试穿等实际应用中的潜力。我们的源代码将在以下网址公开:this https URL。

英文摘要

Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and together with a novel Semantic-Aware Attention Regularization (SAR) training objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses robust models learned by ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a Layout Consistency guidance as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing. Our source code will be publicly available at: https://htrvu.github.io/showflow.

2606.06113 2026-06-12 cs.CV 版本更新

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

位置、类型、原因与重要性:面向文本到图像反馈的结构化缺陷定位

Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen, Haoxiang Cao, Feng Lu, Wendong Zhang, Changqian Yu, Chun Yuan

发表机构 * Tsinghua University(清华大学) Kolors Team, Kuaishou Technology(快手科技Kolors团队) University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) South China Normal University(华南师范大学)

AI总结 提出结构化缺陷定位(SDG)方法,将文本到图像生成中的缺陷诊断建模为结构化集合预测,通过构建SDG-30K数据集和SDG-Eval评估协议,并利用视觉语言模型作为检测器,结合BoxFlow-GRPO将预测的缺陷集合转化为空间奖励以改进扩散模型对齐。

Comments 25 pages, 9 figures

详情
AI中文摘要

尽管文本到图像(T2I)模型生成的图像越来越逼真,但它们仍然存在局部、细微且结构复杂的失败。诊断这些失败需要实例级别的反馈,回答缺陷发生的位置、类型、原因及其对整体图像质量的重要性。虽然最近的密集反馈方法超越了标量监督,但其以热图为中心的表示仍将诊断公式化为像素场回归,这使得定位可变数量的缺陷并将语义原因绑定到单个失败变得困难。为了解决这一表示瓶颈,我们提出了结构化缺陷定位(SDG),通过将每个缺陷建模为(位置、类型、原因、重要性)元组,将T2I诊断转化为结构化集合预测。为了使这一公式可训练和可测量,我们引入了SDG-30K,一个包含30K张图像的数据集,具有跨四个现代T2I生成器的框级标注,以及一个专用的评估协议SDG-Eval。基于这种结构化表示,我们进一步提出了一个诊断到对齐的框架,其中视觉语言模型(VLM)作为SDG检测器,BoxFlow-GRPO将预测的缺陷集合转化为基于框的、重要性加权的空间奖励,用于扩散模型对齐。大量实验表明,我们的SDG检测器在结构化缺陷定位上优于领先的专有VLM,而SDG引导的奖励一致地改善了T2I对齐并支持局部图像细化。这些结果确立了SDG作为诊断、评估和增强现代生成模型的统一实例级接口。

英文摘要

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

2606.09639 2026-06-12 cs.CV 版本更新

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Jason Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Electronic Science and Technology of China(电子科技大学) Zhejiang University(浙江大学) The University of Tokyo(东京大学) Nanyang Technological University(南洋理工大学)

AI总结 提出CineDance-1M大规模多镜头长片音视频数据集,通过三阶段筛选流程和CineBench评估体系,实现高质量联合生成。

详情
AI中文摘要

训练数据集的保真度和结构多样性从根本上决定了视频生成模型的能力。尽管商业系统在生成电影叙事方面表现出色,但开源模型的进展仍受限于高质量训练数据的稀缺性。为弥合这一差距,我们引入了CineDance-1M,一个大规模、开放研究文本到音视频(T2AV)数据集,专门用于多镜头、长片联合音视频生成。每个视频平均时长92.8秒,包含24.2个连续镜头,并提供音频和视频模态的可配置、结构化标注。这一卓越质量通过严格的三个阶段筛选流程实现:i) 多样化来源和全面清洗,ii) 基于电影理论的叙事解析,以及iii) 层次化双模态字幕生成。为进行全面评估,我们提出了CineBench,包含多样化的提示套件和六维、与人类对齐的度量系统,专为复杂叙事音视频评估而设计。此外,我们将LTX-2.3适配为CineDance,展示了卓越的单模态质量以及精确的音视频对齐和稳健的主体与环境一致性,有效验证了我们的筛选策略和CineDance-1M的高质量。我们预期这项工作将为加速未来多镜头、长片联合音视频生成研究奠定坚实基础。我们的项目页面可在https://aliothchen.github.io/projects/CineDance/获取。

英文摘要

The fidelity and structural diversity of training datasets fundamentally determine the capabilities of video generation models. While commercial systems showremarkableabilitytogeneratecinematicnarratives, the progress of open-source models remains limited by the scarcity of high-quality training data. To bridge this gap, we introduce CineDance-1M, a large-scale, open research Text-to-Audio-Video (T2AV) dataset designed specifically for multi-shot, long-form joint audio-video generation. Averaging 92.8 seconds and 24.2 continuous shots per video, it provides configurable, structured annotations for both audio and video modalities. This exceptional quality is achieved through a rigorous three-stage curation pipeline: i) diverse sourcing and comprehensive cleansing, ii) film-theory-inspired narrative parsing, and iii) hierarchical dual-modal captioning. For a comprehensive assessment, we propose CineBench, featuring a diverse prompt suite and a six-dimensional, human-aligned metric system tailored for complex narrative audio-video evaluation. Furthermore, we adapt LTX-2.3 into CineDance, which demonstrates exceptional single-modality quality alongside precise audio-video alignment and robust subject and environment consistency, effectively validating our curation strategy and the high quality of CineDance-1M. We anticipate that this work will serve as a solid foundation for accelerating future research in multi-shot, long-form joint audio-video generation. Our project page is available at https://aliothchen.github.io/projects/CineDance/.

2606.01538 2026-06-12 cs.GR cs.CV cs.LG 版本更新

MPMWorlds: Material-Point-Method Simulations for Inferring and Extrapolating Physical Dynamics

MPMWorlds: 用于推断和外推物理动力学的物质点法模拟

Žiga Kovačič, Kevin Ellis

发表机构 * Cornell University(康奈尔大学)

AI总结 通过构建2D物质点法(MPM)模拟数据集,研究从视频推断物理动力学并外推时间演化的能力,比较代码生成与视频扩散方法的优劣。

Comments 16 pages, 13 figures. Project page: https://zzigak.github.io/mpmworlds/

详情
AI中文摘要

为了研究从视频推断物理动力学并将其向前外推的能力,我们组装了一个包含丰富物理现象(如可变形物体、流体、运动物体和发射器)的2D物质点法(MPM)物理模拟数据集。我们在此数据集上研究了代码生成和视频扩散方法,通过改变物理相关辅助信息的数量来识别它们的优缺点。代码生成模型除了提供自动合成MPM模拟的工作演示外,还揭示了这种方法在从视觉输入推断物理参数方面存在困难,但相对于视频扩散,它能产生物理和时间上稳定的向前外推结果,而视频扩散模型能更强烈地从视觉输入中识别几何属性,但会产生物理上不可信的外推结果。

英文摘要

To study the ability to infer physical dynamics from videos and extrapolate them forward in time, we assemble a dataset of 2D Material Point Method (MPM) physical simulations covering rich physical phenomena such as deformable objects, fluids, kinetic objects, and emitters. We study code generation and video diffusion approaches on this dataset, identifying their strengths and weaknesses by varying the amount of physically relevant side information. The code generation model, beyond giving a working demonstration of automatic synthesis of MPM simulations, reveals that such an approach struggles with inferring physical parameters from visual input, but relative to video diffusion, produces physically and temporally stable extrapolations forward in time, while the video diffusion model more strongly identifies geometric properties from visual input but produces physically implausible extrapolations.

7. 3D视觉、点云与空间智能 14 篇

2606.12939 2026-06-12 cs.CV 新提交

MAMVI: 3D Test-Time Adaptation via Masked Multi-View Point Clouds

MAMVI:通过掩蔽多视角点云实现3D测试时自适应

Inseok Kong, Geunyoung Jung, Jiyoung Jung

发表机构 * Department of Geo Informatics, University of Seoul(首尔大学地理信息学系) Department of Artificial Intelligence, University of Seoul(首尔大学人工智能系)

AI总结 针对3D点云在分布偏移下性能下降的问题,提出MAMVI方法,用统一单步自适应替代顺序优化,结合混合掩蔽策略和多视角损失聚合,实现快速且高精度的测试时自适应。

Comments Accepted by ICPR 2026

详情
AI中文摘要

3D点云模型在传感器噪声、遮挡和环境变化引起的分布偏移下会出现显著的性能下降。测试时自适应(TTA)已成为在推理过程中缓解此问题的实用范式。最近,利用多视角增强在提升3D TTA性能方面显示出潜力。然而,现有的多视角方法通常受限于将每个视角独立处理的顺序优化。这种顺序优化由于重复的优化步骤导致显著的推理延迟,使得实时自适应不切实际。为了解决这个问题,我们提出了掩蔽多视角测试时自适应(MAMVI),它用统一的单步自适应替代顺序优化。具体来说,MAMVI利用一种混合掩蔽策略,结合固定比例以保持稳定性,以及Beta分布采样以增加多样性。通过聚合多个视角的损失,MAMVI基于多视角共识通过单次反向传播执行自适应。此外,使用基于置信度的自适应学习率来动态调整每个样本的自适应强度。在ModelNet-40C、ShapeNet-C和ScanObjectNN-C上的大量实验表明,MAMVI在ShapeNet-C和ScanObjectNN-C上达到了最先进的准确率。同时,它在ModelNet-40C上保持竞争力,同时推理速度提高了4.9-8.9倍,使其非常适合实时应用。我们的代码可在以下网址获取:this https URL

英文摘要

3D point cloud models suffer significant performance degradation under distribution shifts caused by sensor noise, occlusions, and environmental changes. Test-time adaptation (TTA) has emerged as a practical paradigm for mitigating this issue during inference. Recently, leveraging multi-view augmentation has shown promise in improving 3D TTA performance. However, existing multi-view approaches are often constrained by sequential optimization that treats each view independently. This sequential optimization leads to substantial inference latency due to repetitive optimization steps, making real-time adaptation impractical. To address this, we propose Masked Multi-View Test-Time Adaptation (MAMVI), which replaces sequential optimization with a unified single-step adaptation. Specifically, MAMVI utilizes a hybrid masking strategy that combines fixed ratios for stability with Beta-distributed sampling for diversity. By aggregating losses across multiple views, MAMVI performs adaptation through a single backward pass based on multi-view consensus. Additionally, a confidence-based adaptive learning rate is used to dynamically adjust the adaptation intensity for each sample. Extensive experiments on ModelNet-40C, ShapeNet-C, and ScanObjectNN-C demonstrate that MAMVI achieves state-of-the-art accuracy on ShapeNet-C and ScanObjectNN-C. Moreover, it remains competitive on ModelNet-40C while delivering 4.9-8.9 times faster inference, making it highly suitable for real-time applications. Our code is available at https://github.com/Inseok-kong/MAMVI

2606.13345 2026-06-12 cs.CV 新提交

JointEdit3D: Feed-Forward 3D Scene Editing in a Unified Latent Space

JointEdit3D:统一潜在空间中的前馈3D场景编辑

Xinnan Zhu, Ruijie Xu, Jiayu Ying, Daoguo Dong, Jiachen Xu, Yuan Xie, Xin Tan

发表机构 * East China Normal University(华东师范大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Fudan University(复旦大学) Tencent(腾讯)

AI总结 提出JointEdit3D,在统一RGB-几何重建生成潜在空间中通过非对称潜在修复实现前馈3D场景编辑,引入SceneAnchor分支和编辑/背景感知损失,并构建SceneEdit3D-15K数据集和SceneEdit3D-Bench基准,显著提升编辑区域质量和3D结构完整性。

Comments Preprint. Project page: https://xinnan-zhu.github.io/JointEdit3D-Page/

详情
AI中文摘要

现有的3D场景编辑方法通常依赖于对显式3D表示进行逐场景优化或级联编辑-重建流水线,导致测试时成本高、3D感知有限以及结构不一致。为了在编辑过程中耦合外观合成和几何预测,我们构建了一个统一的RGB-几何重建生成潜在空间,并将其适应于前馈3D场景编辑。由此产生的框架JointEdit3D通过仅观察单个编辑后的RGB参考潜在变量,并在源场景锚定下生成剩余的RGB视图和编辑后的几何潜在变量,执行非对称潜在修复。JointEdit3D引入了一个专门的SceneAnchor分支来注入源场景结构而不强制直接复制,并采用编辑/背景感知损失来平衡编辑区域的保真度与未编辑内容的保持。为了解决缺乏用于标准化3D场景编辑评估的配对资源的问题,我们引入了SceneEdit3D-15K数据集,该数据集包含15K个配对编辑样本和渲染器提供的3D注释,以及SceneEdit3D-Bench,一个精心挑选的100样本基准。实验表明,JointEdit3D在保持竞争性背景保留的同时,在编辑区域质量和3D结构完整性方面优于先前基线。

英文摘要

Existing 3D scene editing methods typically rely on per-scene optimization over explicit 3D representations or cascaded edit-and-reconstruct pipelines, resulting in high test-time cost, limited 3D awareness, and structural inconsistencies. To couple appearance synthesis and geometry prediction during editing, we build on a unified RGB-geometry reconstruction-generation latent space and adapt it to feed-forward 3D scene editing. The resulting framework, \textbf{JointEdit3D}, performs asymmetric latent inpainting by observing only a single edited RGB reference latent and generating the remaining RGB views and edited geometry latent under source-scene anchoring. JointEdit3D introduces a dedicated SceneAnchor Branch to inject source-scene structure without forcing direct copying, and adopts edit/background-aware losses to balance edited-region fidelity with unedited-content preservation. To address the lack of paired resources for standardized 3D scene editing evaluation, we introduce SceneEdit3D-15K, a dataset with 15K paired editing samples and renderer-provided 3D annotations, together with SceneEdit3D-Bench, a curated 100-sample benchmark. Experiments show that JointEdit3D improves edited-region quality and 3D structural completeness over prior baselines while maintaining competitive background preservation.

2606.13488 2026-06-12 cs.CV 新提交

Point-Wise Geometry-Aware Transformer for Partial-to-Full Point Cloud Registration in Computer-Assisted Surgery

面向计算机辅助手术中部分到完整点云配准的点级几何感知Transformer

Siyu Zhou, Zhongliang Jiang

发表机构 * The Chair for Computer Aided Medical Procedures, Technical University of Munich(慕尼黑工业大学计算机辅助医疗程序教席) The University of Hong Kong(香港大学)

AI总结 提出GAPR-Net,一种结合卷积与Transformer的粗到细框架,通过交叉注意力融合局部与全局信息,并设计变换不变的点级几何特征,在四个骨骼数据集上实现94.2%配准召回率、1.992mm RMSE。

详情
AI中文摘要

由于重叠率变化、点密度波动以及噪声的存在,部分到完整配准仍然具有挑战性。尽管Transformer在点云处理中展现出强大潜力,但先前的方法通常将其局限于全局上下文聚合,忽略了对于精确对应至关重要的细粒度局部几何信息。我们提出GAPR-Net,一种基于学习的点云配准框架,采用粗到细架构,结合卷积和Transformer模块,通过交叉注意力机制在部分和完整点云之间融合局部和全局信息。为此,提出了一种变换不变的点级几何特征表示,能够鲁棒地捕获单个点相对于其邻域点的相对几何特征。为了评估所提方法的有效性,在四个几何上不同的骨骼(包括胫骨、股骨、骨盆和胸软骨)上进行了实验。整体配准召回率达到94.2%,该方法实现了低RMSE 1.992 mm,旋转和平移的R²值分别为0.908和0.974。结果表明,所提方法有效解决了部分到完整点云配准问题。该方法利用部分观测实现高精度3D点云配准,为计算机辅助手术中的精确手术导航和机器人干预提供了关键基础。代码将在双盲评审后公开。

英文摘要

Partial-to-full registration remains challenging due to varying overlap ratios, fluctuating point densities, and the presence of noise. While transformers have shown strong potential for point cloud processing, prior methods typically confine them to global context aggregation, overlooking fine-grained local geometry crucial for accurate correspondence. We propose \emph{GAPR-Net}, a learning-based point cloud registration framework with a coarse-to-fine architecture that combines convolution and transformer modules, in which local and global information is fused between the partial and full point clouds using a cross-attention mechanism. To achieve this, a transformation-invariant point-wise geometric feature representation is proposed, which can robustly capture relative geometric features for individual points with respect to their neighboring points. To evaluate the effectiveness of the proposed approach, experiments are conducted on four geometrically distinct bones, including the tibia, femur, pelvis, and thoracic cartilage. The overall registration recall reaches 94.2\%, the method results in a low RMSE of 1.992 mm and $R^2$ values of 0.908 and 0.974 for rotation and translation, respectively. The results demonstrate that the proposed method effectively addresses the partial-to-full point cloud registration problem. The proposed method enables highly accurate 3D point cloud registration using partial observation, providing a critical foundation for precise surgical navigation and robotic interventions in computer-assisted surgery. The code will be accessed after the double-blind review process.

2606.13644 2026-06-12 cs.CV 新提交

Surflo: Consistent 3D Surface Flow Model with Global State

Surflo:具有全局状态的一致3D表面流模型

Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa

发表机构 * LIX, École polytechnique(LIX,巴黎综合理工学院) Kyoto University(京都大学) Kyutai UC Berkeley(加州大学伯克利分校)

AI总结 提出Surflo模型,通过将可变数量的无位姿RGB视图压缩为全局潜变量,并利用流匹配从噪声中独立传输3D表面点,实现任意分辨率的一致表面重建,推理时通过光度梯度引导消除局部不一致性。

Comments Project webpage: https://anttwo.github.io/surflo/

详情
AI中文摘要

几何形状对视角具有不变性,这使得任何图像集合都是单个3D状态的冗余编码。现有的前馈重建模型未能充分利用这一点:逐视角方法会生成重叠且未对齐的点云,其数量随输入数量线性增长;而全局潜在方法则局限于固定的低分辨率输出。我们提出Surflo,它将可变数量的无位姿RGB视图压缩为K个潜在令牌(一个全局状态),并通过流匹配将带方向的3D表面点从噪声独立传输到表面上进行解码。这使得输出不受任何固定网格或令牌预算的限制:相同的潜在变量在单次前向传播中即可生成从几千到一百万个点。为了抑制独立逐点解码固有的局部不一致性,我们在ODE积分过程中注入光度梯度,通过推理时的引导项关联邻近点。Surflo在表面指标上匹配或超越前馈基线,运行速度比需要数百个视图的基于优化的方法快一个数量级,并且是唯一结合全局潜在变量与任意分辨率解码的前馈方法。

英文摘要

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

2606.13652 2026-06-12 cs.CV cs.GR 新提交

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

世界追踪:超越可见表面的生成式像素对齐几何

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

发表机构 * World Labs University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出世界追踪(World Tracing),一种生成式像素对齐几何表示,通过扩散变压器预测有序点栈,同时重建可见表面和生成遮挡几何,在多个基准上超越深度预测和图像到3D方法。

Comments World Labs Technical Report; Page: https://haoz19.github.io/world-tracing-page/

详情
AI中文摘要

图像到3D方法常常在忠实度和完整性之间权衡:深度估计器锚定于输入像素但止于可见表面,而图像到3D模型生成完整形状却往往与输入不对齐。我们引入世界追踪(World Tracing),一种生成式像素对齐几何表示,它预测与观测像素对齐的3D点,同时完成可见表面之外的几何。对于每个输入像素,世界追踪预测一个有序的相机空间3D点栈,其中第一层表示可见表面,后续层表示与遮挡表面的从前到后交点。我们通过一个世界追踪扩散变压器WT-DiT实例化该表示,该变压器将多个几何层视为独立的去噪令牌,并通过分解和全局注意力耦合。WT-DiT使用像素空间流匹配和混合噪声调度进行训练,平衡可见表面重建与遮挡几何生成。世界追踪在物体、场景和动态基准上,在可见表面重建和完整几何生成方面均取得了强劲性能,超越了深度预测器和图像到3D生成器。它还保留了2D到3D对应关系,实现了文本驱动的3D场景编辑、几何条件的新视角视频合成,以及与纹理网格生成器的无训练集成。

英文摘要

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

2503.17182 2026-06-12 cs.CV 版本更新

Radar-Guided Polynomial Fitting for Metric Depth Estimation

雷达引导的多项式拟合用于度量深度估计

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong

发表机构 * Yale University(耶鲁大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出POLAR方法,利用雷达数据预测多项式系数,对单目深度估计的无尺度深度进行非均匀校正,实现度量深度估计,性能在三个数据集上平均提升24.9% MAE和33.2% RMSE。

Comments CVPR 2026

详情
AI中文摘要

我们提出POLAR,一种新颖的雷达引导深度估计方法,引入多项式拟合以高效地将预训练单目深度估计(MDE)模型的无尺度深度预测转换为度量深度图。与依赖复杂架构或昂贵传感器的现有方法不同,我们的方法基于一个基本洞察:尽管MDE模型通常能在每个物体或局部区域内推断合理的局部深度结构,但它们可能使这些区域相互错位,使得在三个或更多区域的情况下线性尺度和偏移(仿射)变换不足。为解决这一限制,我们使用从廉价、普遍存在的雷达数据预测的多项式系数,在深度范围内非均匀地自适应调整预测。通过这种方式,POLAR超越了仿射变换,并能够通过引入拐点来纠正此类错位。重要的是,我们的多项式拟合框架通过一种新颖的训练目标保持结构一致性,该目标通过一阶导数正则化强制局部单调性。POLAR在三个数据集上实现了最先进的性能,在MAE和RMSE上平均优于现有方法24.9%和33.2%,同时在延迟和计算成本方面也实现了最先进的效率。

英文摘要

We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

2602.22629 2026-06-12 cs.CV 版本更新

CRAG: Can 3D Generative Models Help 3D Assembly?

CRAG: 3D生成模型能否辅助3D装配?

Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng, Jing Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CRAG方法,将3D装配与形状生成联合优化,通过生成完整形状和预测部件姿态实现相互增强,在多种几何、部件数和缺失情况下达到最优性能。

Comments 15 pages, 8 figures

详情
AI中文摘要

大多数现有的3D装配方法将问题视为纯姿态估计,通过刚性变换重新排列观察到的部件。相比之下,人类装配自然地将结构推理与整体形状推断相结合。受此直觉启发,我们将3D装配重新表述为装配和生成的联合问题。我们表明这两个过程相互增强:装配为生成提供部件级结构先验,而生成注入整体形状上下文,解决装配中的歧义。与无法合成缺失几何形状的先前方法不同,我们提出了CRAG,它同时生成合理的完整形状并预测输入部件的姿态。大量实验表明,在具有不同几何形状、不同部件数量和缺失部件的野外物体上,该方法达到了最先进的性能。项目页面:this https URL

英文摘要

Most existing 3D assembly methods treat the problem as pure pose estimation, rearranging observed parts via rigid transformations. In contrast, human assembly naturally couples structural reasoning with holistic shape inference. Inspired by this intuition, we reformulate 3D assembly as a joint problem of assembly and generation. We show that these two processes are mutually reinforcing: assembly provides part-level structural priors for generation, while generation injects holistic shape context that resolves ambiguities in assembly. Unlike prior methods that cannot synthesize missing geometry, we propose CRAG, which simultaneously generates plausible complete shapes and predicts poses for input parts. Extensive experiments demonstrate state-of-the-art performance across in-the-wild objects with diverse geometries, varying part counts, and missing pieces. Project Page: https://ai4ce.github.io/CRAG/

2603.23502 2026-06-12 cs.CV 版本更新

OccAny: Generalized Unconstrained Urban 3D Occupancy

OccAny: 广义无约束城市3D占据预测

Anh-Quan Cao, Tuan-Hung Vu

AI总结 提出首个广义无约束城市3D占据模型OccAny,通过分割强制和新视图渲染技术,在无标定场景下实现度量占据预测与分割特征完成,跨域泛化优于视觉几何基线。

Comments Accepted to CVPR 2026. Project page: https://valeoai.github.io/OccAny/

详情
AI中文摘要

依赖于域内标注和精确传感器先验,现有的3D占据预测方法在可扩展性和域外泛化方面均受限。虽然最近的视觉几何基础模型展现出强大的泛化能力,但它们主要针对通用目的设计,缺乏城市占据预测所需的一个或多个关键要素,即度量预测、杂乱场景中的几何完成以及城市场景的适应性。我们解决了这一差距,并提出了OccAny,这是第一个无约束城市3D占据模型,能够在域外无标定场景上运行,预测并完成与分割特征耦合的度量占据。OccAny具有通用性,可以从序列、单目或环视图像预测占据。我们的贡献有三方面:(i) 提出了第一个广义3D占据框架,(ii) 提出了分割强制(Segmentation Forcing)方法,在提高占据质量的同时实现掩码级预测,以及(iii) 提出了一种新视图渲染管线,用于推断新视图几何以实现测试时视图增强,从而完成几何。大量实验表明,OccAny在3D占据预测任务上优于所有视觉几何基线,同时在两个已建立的城市占据预测数据集上的三种输入设置下,与域内自监督方法保持竞争力。我们的代码可在以下网址获取:https://this https URL。

英文摘要

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group(软件性能优化组) Department of Computing(计算部门)

AI总结 提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统,通过在线可微渲染实现跟踪与建图,并支持实时网格转换与编辑。

Comments 26 pages, 11 figures

详情
AI中文摘要

我们提出了一种密集RGB-D SLAM系统,使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法,但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务(如模拟、碰撞和编辑)的标准图元。最近的离线方法表明,通过在一组带姿态的图像上进行Delaunay三角剖分,可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解,我们提出了第一个密集SLAM系统,通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格,从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上,我们的系统在3D几何方面优于基线,匹配相机跟踪精度,并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

2606.07436 2026-06-12 cs.CV 版本更新

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D:面向智能体3D空间推理的场景感知技能进化

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

发表机构 * Zhejiang University(浙江大学) University of Technology Sydney(技术悉尼大学) OPPO Research Institute(OPPO研究院)

AI总结 提出Skill-3D框架,通过场景记忆和技能库的协同进化,使智能体根据场景自适应选择工具,显著提升3D空间推理中工具使用的正确性和充分性。

详情
AI中文摘要

本文探索智能体3D空间理解,即MLLM智能体通过工具使用进行3D推理。现有方法在3D场景下常误用工具并表现出有偏的工具偏好,使得智能体范式相比非智能体策略仅有边际提升。我们揭示3D空间推理任务在不同场景下具有异质性,而这些智能体对所有场景采用统一的工具使用策略,而非根据具体场景和任务选择工具。为解决此问题,我们提出Skill-3D,一种学习自进化场景感知技能的框架。具体而言,Skill-3D识别任务场景并将智能体的工具使用轨迹记录到场景记忆中,其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能,失败的轨迹作为教训附加到该技能上。在训练过程中,一旦相似场景再次出现,注入相应技能以引导智能体,产生新轨迹,其成功和失败进一步优化技能,形成记忆和技能库共同进化的循环。实验表明,Skill-3D显著提升了3D空间推理中的工具利用率(在VSI-Bench上从39%提升至78%),推动智能体正确且充分地使用工具。例如,在MMSI-Bench上,它将Gemini-3-Flash提升了67%。此外,我们在技能引导的轨迹上进行智能体后训练,使Qwen3-VL-8B在VSI-Bench上提升了43%。

英文摘要

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.

2606.11894 2026-06-12 cs.CV 版本更新

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 提出Wild3R,一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法,通过引入包含多样光照和瞬态物体的WildCity数据集,学习跨视角外观一致性并移除瞬态内容,性能优于现有前馈方法,与基于逐场景优化的方法相当。

Comments Project page: https://furuschool.github.io/wild3r-page/

详情
AI中文摘要

前馈式3D高斯泼溅(3DGS)消除了传统3DGS所需的耗时逐场景优化。然而,现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中,我们提出了Wild3R,一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据,而这些是学习鲁棒场景表示所必需的。为解决这一问题,我们引入了WildCity数据集,该数据集包含200个场景、170种光照条件和瞬态物体,总计337,500张图像。通过利用该数据集,我们的模型在参考视图条件下学习跨视角的外观一致性,同时移除瞬态内容。大量实验表明,我们的方法优于现有的前馈方法,并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

2606.12368 2026-06-12 cs.CV 版本更新

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

DepthMaster: 统一透视与全景图像的单目深度估计

Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang

AI总结 提出DepthMaster统一框架,通过将全景图分解为重叠透视块并引入对应一致性损失和虚拟投影相机几何先验,解决透视与全景深度估计的几何差异和数据稀缺问题,在13个数据集上实现零样本最优性能。

详情
AI中文摘要

虽然单目深度估计取得了显著进展,但对于窄视场(FoV)透视图像和$360^\circ$全景图像实现通用的度量深度估计仍然是一个未解决的挑战。现有方法通常针对特定相机类型设计,难以在多样化场景中生成准确的度量深度。这一限制源于两个关键挑战:透视相机与全景相机之间的固有几何差异,以及带有度量标注的全景训练数据的稀缺性。在这项工作中,我们引入了DepthMaster,一个统一的度量深度估计框架。我们不采用专门网络来学习球形畸变,而是通过将全景图像分解为重叠的透视块来重新表述问题。关键的是,与先前依赖临时架构修改来处理边界的基于投影的方法不同,我们引入了一种新颖的对应一致性损失(CCL),并注入虚拟投影相机作为几何先验,从而能够无缝拼接这些块,同时避免专用算子并保持主干与标准Transformer设计高度兼容。该策略通过将所有输入统一为规范透视表示来解决几何差异,并通过直接从大量透视数据集中解锁强大的度量先验来有效规避数据稀缺问题。在仅包含一个全景数据集的混合数据集上训练后,DepthMaster在13个多样化数据集上实现了最先进的零样本性能,不仅在透视和全景领域超越了通用方法,还领先于领先的专家模型。

英文摘要

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

2511.23030 2026-06-12 cs.RO cs.CV 版本更新

DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

DiskChunGS:基于分块内存管理的大规模3D高斯SLAM

Casimir Feldmann, Maximum Wilder-Smith, Vaishakh Patil, Michael Oechsle, Michael Niemeyer, Keisuke Tateno, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zurich(机器人系统实验室,瑞士苏黎世联邦理工学院) Google(谷歌)

AI总结 提出DiskChunGS,通过将场景划分为空间块并将非活跃区域存储于磁盘,突破GPU内存限制,实现大规模3D高斯SLAM,在多个数据集上完成全序列重建并提升视觉质量。

详情
Journal ref
IEEE Robotics and Automation Letters, vol. 11, no. 4, 2026
AI中文摘要

近期3D高斯溅射(3DGS)的进展在实时渲染的新视角合成中展现了令人印象深刻的结果。然而,将3DGS与SLAM系统集成面临根本的可扩展性限制:方法受限于GPU内存容量,只能重建小规模环境。我们提出DiskChunGS,一种可扩展的3DGS SLAM系统,通过一种外核方法克服这一瓶颈,该方法将场景划分为空间块,并在GPU内存中仅维护活跃区域,同时将非活跃区域存储在磁盘上。我们的架构与现有的用于位姿估计和闭环检测的SLAM框架无缝集成,实现大规模全局一致的重建。我们在室内场景(Replica、TUM-RGBD)、城市驾驶场景(KITTI)以及资源受限的Nvidia Jetson平台上验证了DiskChunGS。我们的方法独特地完成了所有11个KITTI序列,没有出现内存故障,同时实现了卓越的视觉质量,证明了算法创新可以克服先前限制3DGS SLAM方法的内存约束。

英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.

2603.05965 2026-06-12 cs.RO cs.CV 版本更新

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE: 具有解析平移鲁棒性的概率占用BEV编码用于3D地点识别

Jinseop Lee, Byoungho Lee, Gichul Yoo

发表机构 * SK Intellix

AI总结 提出无学习的LiDAR地点描述符PROBE,通过极坐标雅可比解析边缘化连续平移,实现距离自适应角度不确定性,在跨传感器泛化中取得高精度。

Comments 8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情
AI中文摘要

我们提出PROBE(概率占用BEV编码),一种无学习的LiDAR地点识别描述符,将每个BEV单元的占用建模为伯努利随机变量。PROBE不依赖于离散点云扰动,而是通过极坐标雅可比解析边缘化连续笛卡尔平移,在O(R·S)时间内得到距离自适应角度不确定性σ_θ = σ_t / r。主要参数σ_t表示以米为单位的预期平移不确定性,这是一种与传感器无关的物理量,增强了跨传感器泛化能力,同时减少了对每个数据集大量调参的需求。成对相似性结合了伯努利-KL Jaccard与指数不确定性门控以及基于FFT的高度余弦相似性用于旋转对齐。在涵盖四种不同LiDAR类型的四个数据集上评估,PROBE在多会话评估中实现了手工描述符中最高的精度,并且在单会话性能上与手工和监督基线相比具有竞争力。源代码和补充材料可在该https URL获取。

英文摘要

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

8. 医学影像与生物视觉 16 篇

2606.12635 2026-06-12 cs.CV 新提交

CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

CD-RCM:面向反射共聚焦显微镜的泛化连续深度新视角合成

Tooba Imtiaz, Milind Rajadhyaksha, Kivanc Kose, Jennifer Dy

发表机构 * Northeastern University(东北大学) Memorial Sloan Kettering Cancer Center(纪念斯隆凯特琳癌症中心)

AI总结 针对反射共聚焦显微镜各向异性3D体积,提出首个RCM专用新视角合成方法CD-RCM,通过前馈模型从稀疏z-stack预测连续深度切片,实现亚秒级高保真合成。

详情
AI中文摘要

反射共聚焦显微镜(RCM)通过获取连续深度处的正面图像,形成稀疏z-stack,从而提供人体皮肤 \emph{体内} 的无创、细胞分辨率“光学活检”。由于光学限制,这些堆栈是各向异性的3D体积,横向分辨率(0.5 $\mu$m)比轴向分辨率(由光学切片定义,3 $\mu$m)高约6倍,限制了组织解释。我们的目标是通过插值中间切片并使3D体积各向同性,提供连续深度可视化。这种表示允许任意方向切片,包括类似组织病理学的横截面检查,无需针对每位患者进行优化。为此,我们引入了首个RCM特定的新视角合成(NVS)方法CD-RCM,这是一种前馈模型,可从稀疏采样的RCM堆栈预测逼真的、未见过的深度。经典神经渲染方法侧重于从表面级多视角观测进行重建。与表面级相机视图不同,RCM可以获取组织表面以下至200 $\mu$m的光学切片正面图像。然而,在可视化RCM堆栈时,较浅切片(朝向表面)的观测会遮挡较深切片。这种独特的轴向成像几何和层依赖性解剖结构促使我们开发了定制的架构和训练框架,明确考虑了RCM的深度分辨、遮挡成像物理特性。实验表明,CD-RCM实现了高保真新视角合成,推理时间低于一秒。

英文摘要

Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution "optical biopsies" of human skin \emph{in vivo} by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 $μ$m) $\sim$6 times higher compared to axial resolution, which is defined by the optical sectioning (3 $μ$m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 $μ$m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM's depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

2606.13032 2026-06-12 cs.CV 新提交

GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

GeoCFNet: 几何感知置信场网络用于机器人辅助内镜黏膜下剥离术

Rui Tang, Guankun Wang, Long Bai, Haochen Yin, Huxin Gao, Jiewen Lai, Jiazheng Wang, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong(香港中文大学电子工程系) Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co. Ltd.(华为技术有限公司中央研究院2012实验室理论实验室)

AI总结 提出GeoCFNet,通过几何感知置信场估计解决动态内镜场景下的解剖引导问题,集成Token差异化融合和几何感知空间正则化,实现精确稳定的置信场预测。

Comments IEEE ICIA 2026

详情
AI中文摘要

先进的手术机器人技术使机器人辅助内镜黏膜下剥离术(ESD)成为整块切除大病变的有前景方法,具有降低复发率和改善长期预后的潜力。然而,ESD的技术复杂性和并发症风险需要稳定精确的视觉引导,以维持准确的解剖通道和安全组织边界。密集置信场通过描述优选解剖区域及其向周围组织的空间过渡,为此提供了有效表示。然而,在动态内镜场景中,由于烟雾、镜面高光、组织变形、弱纹理以及目标区域的薄几何结构,可靠的置信场估计仍然具有挑战性。为解决这些问题,我们将解剖引导表述为几何感知置信场估计问题,并提出GeoCFNet,一种基于预训练DINOv3骨干网络的几何感知置信场网络。GeoCFNet集成了Token差异化融合模块以聚合类别令牌上下文与密集补丁表示、用于置信回归的SegFormer解码器,以及几何感知空间正则化(GASR)以保持空间一致性和局部几何过渡。实验结果表明,GeoCFNet实现了RMSE 0.0480、PSNR 27.1995、SSIM 0.3397和CC 0.2466,表明其能够为机器人辅助ESD引导提供精确且几何稳定的置信场估计。

英文摘要

Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

2606.13096 2026-06-12 cs.CV 新提交

Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

基于层级肿瘤结构比较的统一MRI脑图像翻译

Yupeng Cai, Jia Wei, Jianlong Zhou

发表机构 * South China University of Technology(华南理工大学) UTS Data Science Institute, University of Technology Sydney(悉尼科技大学UTS数据科学研究所)

AI总结 提出HTSCGAN模型,通过层级肿瘤结构比较和多种损失函数,提高多模态MRI脑图像翻译质量,在BraTS2020/2021上表现优异。

详情
AI中文摘要

多模态MRI脑图像翻译通过可用模态在现代医学中具有重要的实际意义,为疾病的早期诊断、治疗计划和结果评估提供有力支持。为此,确保翻译后肿瘤区域的保真度至关重要。然而,现有的脑图像翻译方法忽略了不同肿瘤区域的结构信息,而利用这些信息有助于翻译模型提高翻译图像的质量和临床适用性。在这项工作中,我们提出了一种新颖的翻译模型HTSCGAN,这是一个统一的多模态脑图像翻译生成对抗模型,整合了肿瘤区域内的结构信息,旨在提高脑图像翻译的质量。具体地,生成器采用三个不同补丁大小的补丁对比模块(PCM)来捕获肿瘤区域的层级结构信息。此外,使用预训练的补丁分类器(PC)和预训练的结构感知编码器(SAE),分别通过补丁分类损失和肿瘤感知损失,使生成的图像包含与真实图像相同的肿瘤区域结构。在BraTS2020和BraTS2021上的实验表明,我们的模型在翻译任务和下游分割任务中均表现出强大的性能,突显了其在提高翻译脑图像质量和临床相关性方面的有效性。我们的代码可在以下网址获取:https://this URL。

英文摘要

Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

2606.13135 2026-06-12 cs.CV cs.AI 新提交

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类:可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Orel Oncological Dispensary(奥廖尔肿瘤医院)

AI总结 本研究比较了四种深度学习架构在皮肤镜图像分类中的表现,提出一种两阶段级联分类方案,通过可调分诊阈值实现敏感度控制,并在外部临床数据集上验证了泛化差距。

Comments 28 pages, 8 figures, 10 tables

详情
AI中文摘要

目的:比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案,并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法:在三种方案中比较四种架构(ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S):二分类(恶性/良性)、单阶段四分类(良性、MEL、SCC、BCC)和两阶段级联(二分类分诊,然后三分类MEL/SCC/BCC)。所有模型使用ImageNet预训练权重和单一增强协议,在聚合的开放ISIC Archive数据上训练,并在内部保留样本和两个临床数据集(Melanoscope AI移动系统;谢切诺夫大学)上评估。结果:内部二分类阶段达到ROC-AUC 0.952-0.966;在谢切诺夫大学数据集上降至0.797-0.893,敏感度降至0.53-0.67,ECE从0.02升至0.27-0.39,且低估恶性,量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果:二分类阶段ViT-B/16的缺陷(p<0.05);在区分阶段,没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1,但仅对ViT-B/16显著,通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上,直接11分类的平均类别敏感度为0.525。结论:可调分诊阈值提供了标准单阶段(argmax)分类无法实现的敏感度控制,并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

2606.13188 2026-06-12 cs.CV cs.AI 新提交

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Transformer引导的图注意力直接心脏网格重建:一种结构数字孪生框架

Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

发表机构 * CAVE Labs, C-IoT, Dept. of CSE, PES University(PES大学计算机科学与工程系C-IoT实验室CAVE实验室) C-IoT, Dept. of CSE, PES University(PES大学计算机科学与工程系C-IoT实验室)

AI总结 提出端到端网络,结合3D Swin Transformer和GAT,直接从医学图像生成平滑的心脏表面网格,避免传统后处理,在MM-WHS 2017上实现1.8 mm平均Chamfer距离。

详情
AI中文摘要

构建患者特异性心脏模型是精准心脏病学的核心,但这些模型在临床应用中始终面临同一障碍:网格生成缓慢、混乱且令人沮丧。标准工作流程——分割图像、运行Marching Cubes、然后手动清理结果——耗时、操作者间不一致,并且需要大多数临床团队不具备的专业知识。我们采取了一种根本不同的方法。我们不将分割和网格生成视为两个独立问题,而是训练一个单一的端到端网络,直接从原始3D医学图像生成平滑、可用于模拟的心脏表面网格。核心是一个3D Swin Transformer编码器-解码器,从CT或MRI体积中提取体积特征,配以一个图注意力网络(GAT)头,迭代变形模板网格以拟合患者心脏边界。我们在MM-WHS 2017基准上使用CT和MRI进行了测试。分割分数具有竞争力(CT上Dice为0.84,MRI上为0.83),但主要关注点是网格质量:平均Chamfer距离为1.8 mm,95%分位数表面距离低于5 mm。每个网格通过单次前向传播生成——无需Marching Cubes、平滑滤波器或手动清理。我们认为,对于心脏数字孪生管道,几何保真度和拓扑正确性比像素级Dice分数更重要。通过消除后处理瓶颈,该方法使患者特异性心脏模拟在临床使用中变得更加可行。

英文摘要

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

2606.13315 2026-06-12 cs.CV eess.IV 新提交

Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

用于3D脑部MRI的掩码和预测自监督基础模型

Esra Ergün, Hersh Chandarana, Dan Sodickson, Gözde Ünal

发表机构 * Istanbul Technical University(伊斯坦布尔理工大学) NYU Langone Health(纽约大学朗格尼医学中心)

AI总结 研究自监督基础模型在MRI疾病检测中的应用,提出频谱域重建损失(MAE)和方差-协方差正则化(JEPA)两种方法,在五个下游任务中验证了目标设计对任务结构匹配的重要性。

详情
AI中文摘要

自监督基础模型在医学影像中展现出巨大潜力。然而,现有的MRI基础模型研究主要强调分割和密集预测任务,而针对基于MRI的疾病检测的自监督基础模型的系统研究仍然有限。在这项工作中,我们研究了两种主要的自监督预训练范式用于基于MRI的疾病检测:通过掩码自编码器(MAE)的基于重建的学习和通过联合嵌入预测架构(JEPA)的预测表示学习。我们通过引入一种新颖的MAE频谱域重建损失来增强对细粒度解剖结构的敏感性,并通过在我们的JEPA框架中集成方差-协方差正则化(VCR)来鼓励去相关的潜在表示,从而研究辅助目标的作用。我们的模型在对比度无关的设置下,在异质单对比度MRI体积上进行预训练,无需模态拼接。在五个下游疾病检测任务中,我们的结果突出了自监督目标设计对医学基础模型预训练的重要性,表明每个目标的下游收益由其与任务结构的相关性决定。具体来说,当下游判别信号以强高频解剖结构为特征时,频谱正则化带来最大的改进;而当判别信息跨越多个去相关的特征维度时,协方差正则化最为有益。具有频谱域监督的MAE在基于MRI的疾病检测中始终实现优越的下游性能。这些发现表明,医学影像中的自监督目标编码了特定的偏差,其下游收益根本上取决于任务的结构。

英文摘要

Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance--covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

2606.13341 2026-06-12 cs.CV cs.AI physics.med-ph 新提交

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

双域等变生成对抗网络用于多模态CT-PET合成

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

发表机构 * arXiv

AI总结 提出双域等变生成对抗网络(DDE-GAN),联合空间与频域学习并融入旋转等变性,实现高保真多模态CT-PET图像合成。

Comments 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情
AI中文摘要

我们提出了一种用于多模态CT-PET图像合成的双域等变生成对抗网络(DDE-GAN)。传统的基于GAN的方法通常仅在空间域中操作,忽略了几何一致性,导致结构保真度有限。DDE-GAN通过联合学习空间域和频率(傅里叶)域,捕捉互补的解剖和频谱信息,解决了这些挑战。此外,嵌入在CT和PET测量物理中的旋转等变性被整合到生成器和判别器的损失中,以确保在旋转下的一致响应,从而提高解剖准确性。一种分层双域训练策略通过多阶段损失函数强制实现域内和域间一致性。在HECKTOR 2022 CT-PET数据集上的评估表明,DDE-GAN在CT-PET图像合成中取得了优于基线模型的合成质量。结果表明,将双域学习与几何等变性相结合,显著增强了多模态图像合成的准确性和鲁棒性,为PET补全和数据增强等实际应用提供了可能。

英文摘要

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

2606.13562 2026-06-12 cs.CV cs.AI 新提交

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

对比信息增强和域对抗训练用于成人到新生儿MR重建泛化

Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza

发表机构 * University of Calgary(卡尔加里大学) Seaman Family MR Research Centre, Foothills Medical Centre(Seaman家族磁共振研究中心,山麓医疗中心) Hotchkiss Brain Institute, University of Calgary(Hotchkiss脑研究所,卡尔加里大学) Pediatrics, Division of Neonatology, University of Calgary(卡尔加里大学儿科学系新生儿科) Alberta Children’s Hospital Research Institute, University of Calgary(阿尔伯塔儿童医院研究所,卡尔加里大学) Radiology and Clinical Neuroscience, University of Calgary(卡尔加里大学放射学与临床神经科学系) Electrical and Software Engineering, University of Calgary(卡尔加里大学电气与软件工程系)

AI总结 研究对比信息增强和域对抗训练提升E2E-VarNet从成人到新生儿MR重建的泛化能力,在加速因子R=4和R=8下,混合域对抗训练在SSIM和PSNR指标上表现最优。

Comments 24 pages, 1 table, 7 figures

详情
AI中文摘要

目的:研究对比信息数据增强和域对抗训练是否能改善E2E-VarNet从成人到新生儿的泛化能力。方法:研究了三种训练方案:(1) 仅使用未增强的成人数据进行成人单独训练,(2) 使用配对的未增强和新生儿信息增强的成人数据进行混合训练,(3) 使用域对抗目标进行混合训练。模型在回顾性欠采样的多线圈成人T2加权脑MR数据上训练,并在新生儿和成人测试数据上以加速因子$R=4$和$R=8$进行评估,使用定量指标和定性评估。特征分析评估了域对抗训练是否改变了未增强成人、增强成人和新生儿测试样本的潜在表示。结果:在新生儿数据上评估时,混合训练(Mixed)和混合域对抗训练(Mixed-DAT)优于仅未增强的成人单独训练(Unaug-Only)。在R=4时,Mixed-DAT取得最佳性能(SSIM = 0.924 +/- 0.027,PSNR = 33.98 +/- 1.15 dB)。在R=8时,Mixed-DAT在SSIM指标上表现最佳(0.848 +/- 0.031,对比Unaug-Only的0.766 +/- 0.037和Mixed的0.814 +/- 0.035),而Mixed在PSNR指标上表现最佳(29.56 +/- 0.83 dB,对比Unaug-Only的26.26 +/- 0.78 dB和Mixed-DAT的29.43 +/- 0.83 dB)。t-SNE图的定性评估表明,Mixed-DAT增加了未增强成人、增强成人和新生儿测试数据的潜在表示之间的重叠。结论:对比信息增强和域对抗训练改善了基于深度学习的MR重建从成人到新生儿的泛化能力。这些发现表明,对比信息数据增强结合对抗训练可能提高欠采样新生儿MR重建中对域偏移的鲁棒性。

英文摘要

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

2606.12824 2026-06-12 eess.IV cs.AI cs.CV physics.med-ph 交叉投稿

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

采集状态作为结构化、可测量变量影响肺结节AI:核驱动的测量不稳定性和噪声驱动的检测脆弱性,DICOM元数据不可见

Daniel Soliman

发表机构 * Daniel Soliman, M.S(丹尼尔·索利曼,硕士)

AI总结 研究通过LUNA16训练的RetinaNet检测器,发现CT采集状态(重建核与噪声)独立影响AI的测量与检测性能,且无法从DICOM元数据恢复,提出采集感知的输入验证层。

详情
AI中文摘要

医学影像AI治理正在规范化:2026年ACR-SIIM实践参数建议本地验收测试和持续漂移监测,ACR Assess-AI注册使用DICOM元数据监测AI输出。我们认为在输出指标之下存在一个必要但目前未监测的层:输入研究是否保持在模型验证过的采集范围内。使用LUNA16训练的MONAI RetinaNet肺结节检测器,我们测试采集状态是否表现为结构化的可测量变量。在仅重建核不同的真实配对CT(NLST B30f vs B80f)上,核单独使AI测量的直径发生偏移,并在5.2%(155个结节中的8个)中翻转了Fleischner尺寸类别,而检测置信度不变(Wilcoxon p=0.22)。在受控的LIDC-IDRI扰动下,效应按轴分离:噪声轴降低检测置信度(p=5.9e-32,集中在6mm以下结节)但不影响测量,而频率/核轴破坏测量(p=8.6e-13)但不影响检测。一个4特征像素指纹恢复了重建身份(真实CT上患者级AUC约0.95,QIBA体模上0.995),而ConvolutionKernel DICOM标签无信息(不同重建标签相同)。核轴跨四个制造商传输(留一制造商AUC 0.94-0.98,与制造商内上限匹配)。因此采集状态映射到不同的AI故障模式:频率内容对应测量可靠性,噪声对应检测灵敏度,且无法从元数据恢复。采集感知的输入侧验证是现在进入影像AI认证的验收测试和漂移监测要求中缺失的层。

英文摘要

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 交叉投稿

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ:面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University(斯坦福大学) Stanford University School of Medicine(斯坦福大学医学院) Ghent University(根特大学)

AI总结 提出OpenMedQ,在14个数据集(约335万样本)上预训练医学视觉语言模型,在PathVQA上BLEU-1达75.9,超越562B参数的Med-PaLM M,并在8个未见医学分类任务上取得最高平均macro-F1(0.757)。

Comments Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

详情
AI中文摘要

我们提出OpenMedQ,一个在迄今为止最广泛的完全开放医学混合数据集上预训练的医学视觉语言模型:包含14个数据集,总计约335万预训练样本,涵盖病理学、放射学、显微镜和纯文本临床问答。OpenMedQ在PathVQA上达到最先进的BLEU-1(75.9),击败了参数多达562B(约大80倍)的Med-PaLM M变体,并在VQA-MED上匹配了最佳报告的BLEU-1(64.5)。其视觉编码器在相同的下游配方下迁移到8个未见过的医学分类基准,获得了最高的平均macro-F1(0.757),优于BiomedCLIP(0.745)、PMC-CLIP(0.745)、PubMedCLIP(0.746)和从头训练的基线(0.616)。我们公开了代码,并提供了一个交互式演示,作为社区的可复现基线。

英文摘要

We present OpenMedQ, a medical vision-language model pretrained on the broadest fully-open medical mix to date: 14 datasets totaling ~3.35M pretraining samples spanning pathology, radiology, microscopy, and text-only clinical QA. OpenMedQ reaches state-of-the-art BLEU-1 on PathVQA (75.9), beating Med-PaLM M variants up to 562B parameters (~80x larger), and matches the best reported VQA-MED BLEU-1 (64.5). Its vision encoder, transferred to 8 unseen medical classification benchmarks under an identical downstream recipe, obtains the highest average macro-F1 (0.757) among BiomedCLIP (0.745), PMC-CLIP (0.745), PubMedCLIP (0.746), and a from-scratch baseline (0.616). We release our code and an interactive demo is publicly available as a reproducible baseline for the community.

2606.13028 2026-06-12 cs.RO cs.CV 交叉投稿

Comparing Commercial Depth Sensor Accuracy for Medical Applications

面向医疗应用的商用深度传感器精度比较

Pit Henrich, Maximilian Weiherer, Franziska Hansen, Bernhard Egger, Franziska Mathis-Ullrich

AI总结 本文在猪骨、猪肚和硅胶肾模型上,以触针采样为参考,比较了立体视觉、结构光和飞行时间四类深度传感器在50cm距离下的精度,发现Zivid 2M+ 60在所有物体和指标上表现最佳。

Comments 4 Pages

详情
AI中文摘要

深度估计在医疗和外科手术中有众多应用。我们使用触针采样的参考数据,在猪骨标本、猪肚标本和硅胶肾脏模型上对四种深度传感器进行了基准测试。这些物体包含多个现实挑战,包括均匀表面、镜面反射表面和次表面散射。比较包括距离约50厘米处的立体视觉、结构光和飞行时间传感器。具体而言,比较了Intel RealSense D405(美国Intel RealSense)、PMD Flexx2(德国pmdtechnologies)、Stereolabs ZED 2i(法国Stereolabs)和Zivid 2M+ 60(挪威Zivid)。在本研究考虑的所有物体和指标中,Zivid 2M+ 60表现最佳。ZED在真实组织上排名第二,但在模型上排名最后。

英文摘要

Depth estimation has numerous medical and surgical applications. We benchmark four depth sensors on a porcine bone specimen, a porcine belly specimen, and a silicone kidney phantom using stylus-sampled references. These objects contain several real-world challenges, including homogeneous surfaces, specular surfaces, and subsurface scattering. The comparison includes stereo, structured-light, and time-of-flight sensors at a distance of approximately 50 cm. Specifically, the Intel RealSense D405 (Intel RealSense, United States), PMD Flexx2 (pmdtechnologies, Germany), Stereolabs ZED 2i (Stereolabs, France), and Zivid 2M+ 60 (Zivid, Norway) are compared. The Zivid 2M+ 60 performed best across all objects and metrics considered in this work. The ZED ranked second for real tissue, but last on the phantom.

2511.19652 2026-06-12 cs.CV 版本更新

Navigating Gigapixel Pathology Images with Large Multimodal Models

利用大型多模态模型导航千兆像素病理图像

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Pathology, Massachusetts General Hospital(麻省总医院病理学系) Department of Pathology and Laboratory Medicine, Brown University(布朗大学病理学与实验室医学系)

AI总结 提出GIANT方法,无需训练即可让通用多模态模型自主导航WSI,通过迭代选择多放大倍数裁剪并聚合证据,在MultiPathQA基准上实现SOTA。

详情
AI中文摘要

近期大型多模态模型的进展使得开发能够对话和推理病理全切片图像(WSI)的交互式聊天模型成为可能。然而,现有的切片级聊天系统通常高度专业化,通常将WSI压缩为固定的切片级嵌入或依赖多组件流水线,这可能会丢失多尺度细节并限制目标任务之外的泛化能力。我们提出GIANT(千兆像素图像组织导航代理),一种简单、无需训练的方法,让通用多模态模型自主导航WSI,迭代选择多放大倍数裁剪并随时间聚合证据。为了评估WSI问答中的泛化能力并促进可重复性,我们引入了MultiPathQA,一个涵盖五个临床挑战和934个问题(涉及868个独特WSI)的基准套件。其中包括128道由病理学家编写的多项选择题,旨在模拟真实的诊断搜索和多尺度推理。使用GPT-5,GIANT在五个基准中的四个上取得了最先进的性能,优于专门用于病理问答的模型。

英文摘要

Recent advances in large multimodal models have allowed for the development of interactive chat models that can converse and reason about pathology whole-slide images (WSIs). However, existing slide-level chat systems are often highly specialized, typically compressing WSIs into fixed slide-level embeddings or relying on multi-component pipelines, which can lose multi-scale detail and limit generalizability beyond the target task. We present GIANT (Gigapixel Image Agent for Navigating Tissue), a simple, training-free approach that lets general-purpose multimodal models navigate WSIs on their own, iteratively selecting multi-magnification crops and aggregating evidence over time. To evaluate generalizability in WSI question answering and to promote reproducibility, we introduce MultiPathQA, a benchmark suite spanning five clinical challenges and 934 questions over 868 unique WSIs. This includes a new set of 128 pathologist-authored multiple-choice questions designed to mirror real diagnostic search and multi-scale reasoning. Using GPT-5, GIANT outperforms models specialized for pathology question answering, achieving state-of-the-art performance on four out of five benchmarks.

2512.14648 2026-06-12 cs.CV eess.IV 版本更新

Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-Guided Subtyping and Lesion-Wise Model Ensemble

适用于多样化脑肿瘤的自适应分割流程:放射组学引导的亚型分类与病灶级模型集成

Daniel Capellán-Martín, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation(Sheikh Zayed儿童外科创新研究所) Children’s National Hospital(儿童医院) University of Washington(华盛顿大学) Universidad Politécnica de Madrid(马德里理工大学) CIBER-BBN ISCIII School of Medicine and Health Sciences(医学与健康科学学院)

AI总结 提出一种灵活模块化的自适应分割流程,通过放射组学特征检测肿瘤亚型并平衡训练,结合病灶级性能指标优化模型集成与后处理,在BraTS 2025挑战赛中达到顶尖性能,支持临床定量肿瘤测量。

Comments 12 pages, 5 figures, 3 tables. Algorithm presented at MICCAI BraTS 2025

详情
AI中文摘要

在多参数磁共振成像(MRI)上对脑肿瘤进行鲁棒且可泛化的分割仍然困难,因为肿瘤类型差异很大。BraTS 2025 Lighthouse挑战赛在多种高质量成人及儿童肿瘤数据集上对分割方法进行基准测试:多联盟国际儿童脑肿瘤分割(PED)、术前脑膜瘤肿瘤分割(MEN)、脑膜瘤放射治疗分割(MEN-RT)以及治疗前后脑转移瘤分割(MET)。我们提出了一种灵活、模块化且自适应的流程,通过选择和组合最先进的模型,并在训练前后应用肿瘤和病灶特定的处理,来提高分割性能。从MRI中提取的放射组学特征有助于检测肿瘤亚型,确保更平衡的训练。自定义的病灶级性能指标决定了每个模型在集成中的影响力,并优化了进一步细化预测的后处理,使工作流能够针对每个病例定制每一步。在BraTS测试集上,我们的流程在多个挑战中取得了与顶尖算法相当的性能。这些发现证实,自定义的病灶感知处理与模型选择能够产生鲁棒的分割,而无需将方法锁定在特定的网络架构上。我们的方法在临床实践中具有定量肿瘤测量的潜力,支持诊断和预后。

英文摘要

Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.

2512.14937 2026-06-12 cs.CV cs.AI 版本更新

Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques

仅使用后处理技术改进预训练的成人胶质瘤分割模型

Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation(Sheikh Zayed儿童手术创新研究所) Children’s National Hospital(儿童医院) University of Madrid(马德里大学) CIBER-BBN ISCIII School of Medicine and Health Sciences(医学与健康科学学院) George Washington University(乔治·华盛顿大学)

AI总结 针对预训练模型在胶质瘤分割中的系统误差,提出自适应后处理技术,在BraTS 2025挑战中使排名指标提升14.9%(撒哈拉以南非洲)和0.9%(成人胶质瘤),推动向高效、公平、可持续的后处理策略转变。

详情
AI中文摘要

胶质瘤是成人中最常见的恶性脑肿瘤,也是最致命的肿瘤之一。尽管积极治疗,中位生存率仍低于15个月。准确的多参数MRI(mpMRI)肿瘤分割对于手术规划、放疗和疾病监测至关重要。虽然深度学习模型提高了自动分割的准确性,但大规模预训练模型泛化能力差且常表现不佳,产生系统性错误,如假阳性、标签交换和切片不连续。这些问题因GPU资源获取不平等和大规模模型训练日益增长的环境成本而进一步加剧。在这项工作中,我们提出自适应后处理技术,以改进为各种肿瘤类型开发的大规模预训练模型产生的胶质瘤分割质量。我们在多个BraTS 2025分割挑战任务中展示了这些技术,使撒哈拉以南非洲挑战的排名指标提升了14.9%,成人胶质瘤挑战提升了0.9%。该方法推动脑肿瘤分割研究从日益复杂的模型架构转向精确、计算公平且可持续的高效临床后处理策略。

英文摘要

Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM:生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出GetNetUPAM框架,通过分层嵌套交叉验证保持生态异质性,并集成CBAM空间注意力的ARPA-N网络,在高噪声低信噪比条件下实现鲁棒泛化,在零训练区域将误报率降低约10倍。

Comments Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined

详情
AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型,以及能够暴露部署相关故障模式的评估协议,这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移,而传统模型和单次划分评估会掩盖这些偏移,夸大性能并掩盖不稳定性。我们提出GetNetUPAM,一种分层嵌套交叉验证框架,它利用嵌套阶段来量化模型稳定性,而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块,GetNetUPAM保留了生态异质性,并迫使每个外层折代表不同的环境条件,防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力,强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM,我们评估了自适应分辨率池化和注意力网络(ARPA-N),一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器,生成注意力图以定位真实叫声结构,并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下,ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域,它在固定90%召回率下将每小时误报率降低超过一个数量级(约10倍),并在各折上持续改进指标。这些进展提供了可重复的基准,推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

2604.27277 2026-06-12 cs.LG cs.AI cs.CV 版本更新

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

BrainDINO:一种用于通用临床表征学习的脑MRI基础模型

Yizhou Wu, Shansong Wang, Yuheng Li, Mojtaba Safari, Mingzhe Hu, Chih-Wei Chang, Harini Veeraraghavan, Xiaofeng Yang

发表机构 * Department of Radiation Oncology and Winship Cancer Institute, Emory University(放射肿瘤科和Winship癌症研究所,埃默里大学) Department of Radiation and Cellular Oncology, The University of Chicago(放射肿瘤学与细胞肿瘤学部,芝加哥大学) Department of Electrical and Computer Engineering, Georgia Institute of Technology(电气与计算机工程系,佐治亚理工学院) Department of Biomedical Engineering, Georgia Institute of Technology(生物医学工程系,佐治亚理工学院) Department of Biomedical Informatics, Emory University(生物医学信息学系,埃默里大学) Department of Medical Physics, Memorial Sloan Kettering Cancer Center(医学物理系,纪念斯隆凯特琳癌症中心)

AI总结 提出BrainDINO,一种基于自蒸馏的基础模型,在约660万张未标记轴向切片上训练,通过冻结编码器加轻量任务头,在多种脑MRI任务上达到或超越基线,尤其在小样本场景下优势显著。

Comments 25 pages, 5 figures

详情
AI中文摘要

脑MRI支撑着广泛的神经科学和临床应用,然而大多数基于学习的方法仍针对特定任务且需要大量标注数据。本文表明,单一的自监督表征可以泛化到异质的脑MRI终点。我们训练了BrainDINO,一个自蒸馏的基础模型,使用了来自20个数据集的约660万张未标记轴向切片,这些数据集涵盖了人群、疾病和采集设置的广泛变异。通过使用冻结编码器加轻量任务头,BrainDINO支持肿瘤分割、神经退行性和神经发育性疾病分类、脑年龄估计、卒中后时间预测、分子状态预测、MRI序列分类和生存建模等任务的迁移。在各种任务和监督机制下,BrainDINO始终等于或超过自然图像和MRI特定自监督基线,在标签稀缺时尤其具有优势。表征分析进一步显示,在缺乏任务特定监督的情况下,特征结构具有解剖学组织和病理敏感性。我们的发现表明,大规模切片级自监督学习可以产生统一的脑MRI表征,支持多样化的神经影像任务,无需体积预训练或全网络微调,为稳健且数据高效的脑影像分析建立了可扩展的基础。代码可在 https://github.com/mclwu22/BrainDINO 获取。

英文摘要

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

9. 文档图像、OCR与图表理解 2 篇

2606.13108 2026-06-12 cs.CV 新提交

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

PP-OCRv6: 从1.5M到34.5M参数,在OCR任务上超越十亿级视觉语言模型

Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng, Ting Sun, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Changda Zhou, Hongen Liu, Suyin Liang, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.(百度公司飞桨团队)

AI总结 提出轻量级OCR系统PP-OCRv6,通过统一MetaFormer架构和结构化重参数化,在服务器到边缘设备上以少数量级参数超越十亿级VLM,中模型识别准确率83.2%,检测Hmean 86.2%。

详情
AI中文摘要

视觉语言模型(VLM)在通用视觉语言任务上取得了令人印象深刻的结果,但在应用于专用OCR场景时,它们存在幻觉、定位不精确和计算成本过高的问题。本文提出PP-OCRv6,一个轻量级OCR系统,结合了架构创新和数据中心优化。PP-OCRv6围绕统一的MetaFormer风格构建块重新设计了骨干网络、检测颈和识别颈,采用结构化重参数化,将空间token混合与通道混合解耦,并通过任务特定的步长配置支持两个任务。三个模型层级(中、小、微)共享相同的构建块原语,覆盖从服务器到边缘的部署场景。在我们的内部基准测试中,PP-OCRv6_medium实现了83.2%的识别准确率和86.2%的检测Hmean,分别比PP-OCRv5_server高出+5.1%和+4.6%,同时以数量级更少的参数超越了Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。微层级在Intel Xeon CPU上实现了比PP-OCRv5_mobile快3.9倍的推理速度,同时保持相当的准确率。

英文摘要

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS(中国科学院大学) CASIA(中国科学院自动化研究所) Tencent(腾讯) CMU(卡内基梅隆大学) WashU(华盛顿大学) SJTU(上海交通大学) XDU(北京理工大学)

AI总结 本文提出VDE Bench,一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准,通过高质量数据集和新的评估框架,系统量化了文本修改的准确性。

详情
AI中文摘要

近年来,图像编辑模型取得了显著进展,使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而,一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑,这涉及在图像中修改文本内容,同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上,因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距,我们提出了VDE Bench(视觉文档编辑基准),这是一个严格人工标注和评估的基准,专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集,其种子图像涵盖密集的中文和英文文本文档,包括学术论文、海报、演示文稿、考试材料和报纸。此外,我们引入了一个新的评估框架,系统地量化了在OCR解析层面的编辑性能,从而实现了对文本修改准确性的细粒度评估。基于此基准,我们对代表性图像编辑模型进行了全面评估。人类验证显示,人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

10. 低层视觉、计算成像与图像增强 8 篇

2606.13136 2026-06-12 cs.CV cs.LG eess.IV 新提交

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

一种可扩展且轻量级的统一架构用于像素合并图像传感器的去马赛克

Saurabh Kumar, Nutan Sairam Yenneti

发表机构 * Samsung Research Institute Bangalore(三星研究院班加罗尔分院)

AI总结 提出模块化统一架构,通过无学习CFA识别模块和轻量级设计,实现多种像素合并传感器的去马赛克,提升图像质量并降低资源消耗。

详情
AI中文摘要

像素合并图像传感器因其分辨率与聚光能力的权衡,正成为智能手机相机的默认选择。然而,与拜耳彩色滤光片阵列(CFA)相比,它们更大的颜色间分离使得去马赛克更具挑战性。此外,现有的基于深度学习的去马赛克方法是CFA特定的,需要多个独立模型,占用宝贵的板载资源,并需要更大的开发和维护工作。在这项工作中,我们提出了一种模块化的统一架构,用于对各种像素合并传感器进行去马赛克,该架构在可扩展且轻量级的同时提供更高的图像质量。此外,为了实现即插即用操作,我们引入了一个无学习的CFA识别模块,以准确检测原始数据的CFA类型。

英文摘要

Pixel-bin image sensors are becoming the default choice for smartphone cameras due to their resolution vs light-gathering trade-off. However, their larger inter-color separation compared to the Bayer color filter array (CFA) makes them challenging to demosaic. Furthermore, existing deep learning-based demosaicing methods are CFA-specific, requiring multiple individual models that take up precious onboard resources and demand larger development and maintenance efforts. In this work, we propose a modular unified architecture for demosaicing various pixel-bin sensors that provides higher image quality while being extensible and lightweight. Additionally, to enable plug-and-play operation, we introduce a learning-free CFA-identification module to detect the CFA type of raw data accurately.

2606.13366 2026-06-12 cs.CV cs.MM 新提交

Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization

双约束扩散图像压缩用于操作率失真感知优化

Sanxin Jiang, Jiro Katto, Heming Sun

发表机构 * Shanghai University of Electric Power(上海电力大学) Waseda University(早稻田大学) Institute of Science Tokyo(东京科学大学)

AI总结 提出DCIC框架,结合学习编解码器和基于扩散的解码器,通过联合失真和等幂约束实现率失真感知帕累托前沿的连续导航,无需额外码率开销。

详情
AI中文摘要

率失真感知(RDP)权衡通过施加重建的分布约束扩展了经典率失真理论,为联合控制保真度和感知真实性的神经图像压缩提供了统一框架。虽然先前的工作实现了接近最优的率感知权衡,但明确实现完整RDP曲面的实用框架仍然很少,主要由于在解码器引入公共随机性的困难。我们提出DCIC(双约束扩散图像压缩),它将学习编解码器与基于扩散的解码器相结合,受联合失真和等幂约束的支配。失真约束限制了相对于基础编解码器输出的重建保真度;等幂约束——要求重新编码恢复图像恢复基础编解码器重建——作为分布感知要求的可处理替代。它们通过一致噪声注入的迭代优化引导反向去噪过程,实现公共随机性而无需额外码率开销。在固定码率下,双衰减因子$(K_D, K_P)$共同导航失真感知平面的帕累托前沿,从单个比特流实现连续可调的保真度-真实感权衡。DCIC$_{RD}$($K_P{=}0$)和DCIC$_{RP}$($K_D{=}0$)作为边界曲线出现,DCIC$_{RDP}$($K_D = K_P=1$)实现最优内部工作点。在CelebA-HQ、CLIC2020和ImageNet-1K上,跨CNN、Transformer和混合架构的实验证实,DCIC$_{RDP}$在所有感知编解码器中实现了优越的BD-PSNR,而DCIC$_{RP}$在BD-FID上与专用感知方法相匹配,验证了完整RDP曲面导航的实用价值。

英文摘要

The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate--perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint -- requiring that re-encoding the restored image recovers the base codec reconstruction -- serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors $(K_D, K_P)$ jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC$_{RD}$ ($K_P{=}0$) and DCIC$_{RP}$ ($K_D{=}0$) arise as boundary curves, with DCIC$_{RDP}$ ($K_D = K_P=1$) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC$_{RDP}$ achieves superior BD-PSNR over all perceptual codecs, while DCIC$_{RP}$ matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

2606.13580 2026-06-12 cs.CV cs.AI 新提交

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

EvTexture++: 事件驱动的视频超分辨率纹理增强

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China(中国科学技术大学,脑启发智能感知与认知教育部重点实验室) Midea Group(美的集团)

AI总结 提出首个事件驱动的视频超分辨率纹理增强框架EvTexture++,利用事件的高频时空细节逐步恢复纹理,并通过时间纹理对齐模块增强帧间一致性,在多个数据集上达到最优性能。

Comments IEEE TPAMI 2026. Extended version of arXiv:2406.13457 (ICML 2024). Project page: https://dachunkai.github.io/evtexture-project-page/

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6642-6659, June 2026
AI中文摘要

基于事件的视觉因其独特特性(包括超高时间分辨率和极端动态范围)而受到越来越多的关注。最近的工作将其引入视频超分辨率(VSR)以增强光流估计和时间对齐。相比之下,本文将事件信号的关注点从运动细化转向VSR中的纹理增强。我们提出了EvTexture++,这是首个专用于VSR中纹理增强的事件驱动框架。它利用事件的高频时空细节来改善纹理恢复。EvTexture++包含一个定制的纹理增强分支,以及一个迭代纹理增强模块,该模块逐步利用高时间分辨率的事件信息进行纹理恢复。这使得纹理区域在迭代中逐渐细化,从而产生更准确、更详细的高分辨率输出。除了帧内纹理恢复外,大运动可能会降低帧间时间一致性,尤其是在纹理区域,导致纹理闪烁。为了缓解这一问题,我们进一步利用事件的连续时间运动线索来增强时间一致性,引入了一个时间纹理对齐模块,该模块估计事件引导的纹理感知光流,以实现精确的帧间纹理对齐。此外,EvTexture++被设计为即插即用工具,可灵活提升现有VSR模型的性能。在五个数据集上的实验表明,EvTexture++达到了最先进的性能。当集成到最近的VSR模型中时,它带来了显著的改进,在纹理丰富的Vid4数据集上PSNR提升高达1.55 dB。代码:此https URL。

英文摘要

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

2505.01869 2026-06-12 cs.CV 版本更新

Visual enhancement and 3D representation for underwater scenes: a review

水下场景的视觉增强与三维表示:综述

Guoxi Huang, Haoran Wang, Brett Seymour, Evan Kovacs, John Ellerbroc, Dave Blackham, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol(视觉信息实验室,布里斯托尔大学) Submerged Resources Center, National Park Service(水下资源中心,国家公园服务) Marine Imaging Technologies, LLC(海洋成像技术有限公司) Gates Underwater Products, Inc(盖茨水下产品公司) Esprit film and television Ltd(Esprit电影和电视有限公司)

AI总结 本文综述了水下视觉增强和三维重建方法,从物理模型到非学习与数据驱动技术(如NeRF和3D高斯溅射),并评估了多种算法在基准数据集上的性能,指出了未来研究方向。

详情
AI中文摘要

水下视觉增强(UVE)和水下三维重建由于水生环境中复杂的成像条件,在计算机视觉和基于AI的任务中面临重大挑战。尽管开发了许多增强算法,但涵盖UVE和水下三维重建的全面系统性综述仍然缺失。为了推动这些领域的研究,我们从多个角度进行了深入综述。首先,我们介绍了基本的物理模型,强调了挑战传统技术的特殊性。我们调查了专门为水下场景设计的视觉增强和三维重建的先进方法。本文评估了从非学习方法到先进数据驱动技术(包括神经辐射场和3D高斯溅射)的各种方法,讨论了它们在处理水下失真方面的有效性。最后,我们在多个基准数据集上对最先进的UVE和水下三维重建算法进行了定量和定性评估。最后,我们指出了水下视觉未来发展的关键研究方向。

英文摘要

Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

2509.25787 2026-06-12 cs.CV 版本更新

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

自进化视觉语言模型用于图像质量评估:基于投票与排序

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang

发表机构 * City University of Hong Kong(香港城市大学) ByteDance Inc.(字节跳动公司)

AI总结 提出EvoQuality框架,通过自一致性生成伪标签,利用群体相对策略优化迭代提升VLM的图像质量感知能力,无监督下在多个IQA基准上超越监督方法。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

在训练后阶段改进视觉语言模型(VLM)通常依赖于监督微调或强化学习,这些方法需要昂贵的人工标注数据。虽然自监督技术已被证明能有效增强推理能力,但其在图像质量评估(IQA)等感知领域的应用仍鲜有探索。在这项工作中,我们引入了EvoQuality,一种新颖的框架,使VLM能够自主优化其质量感知能力,无需任何真实标签。EvoQuality将自一致性原则适应于IQA的排序本质。它通过对VLM自身输出进行成对多数投票来生成伪标签,建立相对质量的共识。这些伪排序随后被转化为保真度奖励,通过群体相对策略优化(GRPO)指导模型的迭代进化。通过迭代利用自身预测,EvoQuality逐步优化VLM的感知能力。大量实验表明,EvoQuality在多个IQA基准上将基础VLM的零样本性能提升了31.8%(PLCC)。值得注意的是,尽管完全自监督,EvoQuality的性能与甚至超越最先进的基于监督VLM的IQA模型,在7个IQA基准中的5个上表现更优。此外,该框架展现出显著的灵活性,可与预训练IQA模型堆叠以增强在未见数据集上的泛化能力。代码和检查点将在此https URL提供。

英文摘要

Improving vision-language models (VLMs) in the post-training stage typically relies on supervised fine-tuning or reinforcement learning, methods that necessitate costly, human-annotated data. While self-supervised techniques have proven effective for enhancing reasoning capabilities, their application to perceptual domains such as image quality assessment (IQA) remains largely unexplored. In this work, we introduce EvoQuality, a novel framework that enables a VLM to autonomously refine its quality perception capabilities without any ground-truth labels. EvoQuality adapts the principle of self-consistency to the ranking-based nature of IQA. It generates pseudo-labels by performing pairwise majority voting on the VLM's own outputs to establish a consensus on relative quality. These pseudo-rankings are then formulated into a fidelity reward that guides the model's iterative evolution through group relative policy optimization (GRPO). By iteratively leveraging its own predictions, EvoQuality progressively refines the VLM's perceptual capability. Extensive experiments show that EvoQuality boosts the base VLM's zero-shot performance by 31.8% on PLCC across diverse IQA benchmarks. Remarkably, despite being entirely self-supervised, EvoQuality achieves performance that is competitive with, or even surpasses, state-of-the-art supervised VLM-based IQA models, outperforming these models on 5 out of 7 IQA benchmarks. Furthermore, the framework demonstrates significant flexibility, allowing it to be stacked with pre-trained IQA models to bolster generalization on unseen datasets. Codes and checkpoints will be available at https://github.com/bytedance/EvoQuality.

2602.09730 2026-06-12 cs.CV cs.LG cs.NA math.NA 版本更新

Allure of Craquelure: A Variational-Generative Approach to Crack Detection in Paintings

龟裂的魅力:一种变分-生成式绘画裂纹检测方法

Laura Paul, Holger Rauhut, Martin Burger, Samira Kabri, Tim Roith

发表机构 * Dept. of Mathematics, LMU Munich(数学系,慕尼黑大学) Munich Center for Machine Learning(慕尼黑机器学习中心) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY(海德堡影像,德意志电子同步辐射光源) Fachbereich Mathematik, University of Hamburg(数学学院,汉堡大学) CIT School, Technical University of Munich(技术大学慕尼黑信息学院)

AI总结 提出混合方法,将裂纹检测建模为逆问题,用深度生成模型作为画作先验,结合Mumford-Shah变分泛函和裂纹先验,通过联合优化获得像素级裂纹定位图。

详情
AI中文摘要

近期成像技术、深度学习与数值性能的进步使得对艺术品的非侵入性详细分析成为可能,支持其记录与保护。特别是,数字化绘画中龟裂的自动检测对于评估退化和指导修复至关重要,但由于可能复杂的场景以及裂纹与类似裂纹的艺术特征(如笔触或毛发)之间的视觉相似性,这仍然具有挑战性。我们提出一种混合方法,将裂纹检测建模为一个逆问题,将观测图像分解为无裂纹绘画和裂纹分量。采用深度生成模型作为底层艺术品的有力先验,同时使用Mumford-Shah型变分泛函结合裂纹先验来捕捉裂纹结构。联合优化得到绘画中裂纹定位的像素级图。

英文摘要

Recent advances in imaging technologies, deep learning and numerical performance have enabled non-invasive detailed analysis of artworks, supporting their documentation and conservation. In particular, automated detection of craquelure in digitized paintings is crucial for assessing degradation and guiding restoration, yet remains challenging due to the possibly complex scenery and the visual similarity between cracks and crack-like artistic features such as brush strokes or hair. We propose a hybrid approach that models crack detection as an inverse problem, decomposing an observed image into a crack-free painting and a crack component. A deep generative model is employed as powerful prior for the underlying artwork, while crack structures are captured using a Mumford--Shah-type variational functional together with a crack prior. Joint optimization yields a pixel-level map of crack localizations in the painting.

2606.10200 2026-06-12 cs.CV cs.AI cs.LG 版本更新

An Improved Generative Adversarial Network for Micro-Resistivity Imaging Logging Restoration

一种改进的生成对抗网络用于微电阻率成像测井恢复

Ahmed Faizul Haque, S. M. Riaz Rahman Antu, Saif Ahmed, Asadullah Hil Galib, Souvik Pramanik, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

AI总结 提出基于改进GAN的成像测井图像恢复方法,通过FCN生成网络、深度可分离卷积残差块、Inception模块及多尺度特征提取与空间注意力机制,结合全局与局部判别网络,有效恢复缺失区域,结构相似性达0.903。

Comments Mistakes in citations and references. Further we want to submit in conference with improved experiments and results

详情
AI中文摘要

本文提出了一种改进的基于GAN的成像测井图像恢复方法,用于解决微电阻率成像测井图像部分缺失的问题。该方法采用FCN作为生成网络基础设施,并添加深度可分离卷积残差块以学习和保留更有效的像素与语义信息;添加Inception模块以增加网络的多尺度感知场并减少参数数量;添加多尺度特征提取模块和空间注意力残差块,结合通道注意力机制与残差块实现多尺度特征提取。设计了全局判别网络和局部判别网络,通过相互对抗与生成网络逐步提高恢复部分与整体图像之间的内容和语义结构一致性。实验结果表明,测试集中五组不同大小缺失区域的成像测井图像的平均结构相似性度量为0.903,相比其他类似方法提高了约0.3。研究表明,该方法可用于微电阻率成像测井图像的恢复,在语义结构一致性和纹理细节方面有良好改善,从而为保障微电阻率成像测井图像后续解释的顺利进行提供了一种新的深度学习方法。

英文摘要

An improved GAN-based imaging logging image restoration method is presented in this paper for solving the problem of partially missing micro-resistivity imaging logging images. The method uses FCN as the generative network infrastructure and adds a depth-separable convolutional residual block to learn and retain more effective pixel and semantic information; an Inception module is added to increase the multi-scale perceptual field of the network and reduce the number of parameters in the network; and a multi-scale feature extraction module and a spatial attention residual block are added to combine the channel attention. The multi-scale module adds a multi-scale feature extraction module and a spatial attention residual block, which combine the channel attention mechanism and the residual block to achieve multi-scale feature extraction. The global discriminative network and the local discriminative network are designed to gradually improve the content and semantic structure coherence between the restored parts and the whole image by playing off each other and the generative network. According to the experimental results, the average structural similarity measure of the five sets of imaged logging images with different sizes of missing regions in the test set is 0.903, which is an improvement of about 0.3 compared with other similar methods. It is shown that the method in this study can be used for the restoration of micro-resistivity imaging log images with good improvement in semantic structural coherence and texture details, thus providing a new deep learning method to ensure the smooth advancement of the subsequent interpretation of micro-resistivity imaging log images.

2402.01779 2026-06-12 eess.IV cs.CV cs.LG stat.ML 版本更新

Plug-and-Play image restoration with Stochastic deNOising REgularization

即插即用图像恢复:随机去噪正则化

Marien Renaud, Jean Prost, Arthur Leclaire, Nicolas Papadakis

发表机构 * GitHub

AI总结 提出SNORE框架,仅在适当噪声水平图像上应用去噪器,结合随机正则化与梯度下降求解逆问题,在去模糊和修复任务上达到SOTA。

详情
AI中文摘要

即插即用(PnP)算法是一类迭代算法,通过结合物理模型和深度神经网络进行正则化来解决图像逆问题。尽管它们能产生令人印象深刻的图像恢复结果,但这些算法依赖于在迭代过程中噪声逐渐减小的图像上非标准地使用去噪器,这与最近基于扩散模型(DM)的算法形成对比,后者仅在重新加噪的图像上应用去噪器。我们提出了一种新的PnP框架,称为随机去噪正则化(SNORE),该框架仅在具有适当噪声水平的图像上应用去噪器。它基于显式的随机正则化,从而产生一种随机梯度下降算法来解决不适定逆问题。提供了该算法及其退火扩展的收敛性分析。实验上,我们证明SNORE在去模糊和修复任务上与最先进方法相比具有竞争力,无论是在定量还是定性方面。

英文摘要

Plug-and-Play (PnP) algorithms are a class of iterative algorithms that address image inverse problems by combining a physical model and a deep neural network for regularization. Even if they produce impressive image restoration results, these algorithms rely on a non-standard use of a denoiser on images that are less and less noisy along the iterations, which contrasts with recent algorithms based on Diffusion Models (DM), where the denoiser is applied only on re-noised images. We propose a new PnP framework, called Stochastic deNOising REgularization (SNORE), which applies the denoiser only on images with noise of the adequate level. It is based on an explicit stochastic regularization, which leads to a stochastic gradient descent algorithm to solve ill-posed inverse problems. A convergence analysis of this algorithm and its annealing extension is provided. Experimentally, we prove that SNORE is competitive with respect to state-of-the-art methods on deblurring and inpainting tasks, both quantitatively and qualitatively.

11. 鲁棒性、安全、隐私与可信视觉 8 篇

2606.12977 2026-06-12 cs.CV cs.AI cs.CR cs.LG 新提交

Efficient, Robust, and Anti-Collusion Fingerprinting of Image Diffusion Models

图像扩散模型的高效、鲁棒且抗共谋指纹识别

Jianwei Fei, Yunshu Dai, Zhihua Xia, Xiaochun Cao, Jiantao Zhou, Alessandro Piva, Benedetta Tondi

发表机构 * University of Florence(佛罗伦萨大学) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) College of Cyber Security, Jinan University(暨南大学网络空间安全学院) State Key Laboratory of Internet of Things for Smart City, University of Macau(澳门大学智慧城市物联网国家重点实验室) Department of Computer and Information Science, Faculty of Science and Technology, University of Macau(澳门大学科技学院计算机与信息科学系) University of Siena(锡耶纳大学)

AI总结 针对生成式文本到图像模型指纹识别缺乏抗共谋攻击鲁棒性的问题,提出基于个性化归一化模块的编码方法,并引入无损函数不变参数变换的抗共谋机制,实现高保真、高鲁棒且首次主动抵御共谋攻击的指纹识别。

详情
AI中文摘要

模型指纹识别,即将用户特定标识(指纹)嵌入生成输出中,最近已成为保护生成式文本到图像(T2I)模型知识产权并防止未经授权重新分发的流行解决方案。在这项工作中,我们揭示了现有生成模型指纹识别方法中一个先前未被探索的系统性漏洞:它们缺乏对共谋攻击的鲁棒性,其中多个攻击者结合他们的模型以移除或掩盖指纹。为了解决这个问题,我们迈出了为T2I模型开发具有抗共谋能力的鲁棒指纹识别方法的第一步。所提出的方法将比特串(即指纹)编码到集成到T2I模型中的个性化归一化模块(PNM)的系数中,从而可以从任何生成的图像中可靠地恢复指纹。为了防御共谋攻击并防止未经授权的模型重新分发,我们引入了一种基于无损函数不变参数变换的抗共谋机制。该机制显著降低了共谋模型的图像生成质量,使其实际上无法使用。此外,我们的方法允许开发者通过重新参数化PNM高效地创建多个带指纹的T2I模型副本,而无需重新训练。我们还引入了一种最坏情况优化策略,以提高对模型级攻击的鲁棒性。实验表明,所提出的方法在多个T2I图像生成和编辑任务中实现了高保真度和鲁棒性,指纹提取准确率超过99.5%。与现有方法相比,我们的方法首次通过显著增加共谋模型的FID,展示了对共谋攻击的显著主动鲁棒性。

英文摘要

Model fingerprinting, embedding user-specific identifiers (fingerprints) into generated outputs, has recently emerged as a popular solution to protect the intellectual property rights (IPR) of generative text-to-image (T2I) models and prevent unauthorized redistribution. In this work, we reveal a previously unexplored systematic vulnerability in existing generative model fingerprinting methods: they lack robustness against collusion attacks, where multiple attackers combine their models to remove or obscure the fingerprints. To address this issue, we take the first step towards a robust fingerprinting method for T2I models with anti-collusion capabilities. The proposed method encodes strings of bits, namely fingerprints, into the coefficients of a personalized normalization module (PNM) incorporated into T2I models, so that fingerprints can be reliably recovered from any generated image. To defend against collusion attacks and prevent unauthorized model redistribution, we introduce an anti-collusion mechanism based on lossless function-invariant parameter transformations. This mechanism significantly degrades the image generation quality of colluded models, making them effectively unusable. Moreover, our method allows developers to efficiently create multiple copies of fingerprinted T2I models by reparameterizing the PNM without the need for retraining. We also introduce a worst-case optimization strategy to improve robustness against model-level attacks. Our experiments demonstrate that the proposed method achieves high fidelity and robustness across multiple T2I image generation and editing tasks, with fingerprint extraction accuracy exceeding 99.5%. Compared with existing methods, our method demonstrates, for the first time, a notable proactive robustness to collusion attacks by significantly increasing the FID of colluded models.

2606.13022 2026-06-12 cs.CV cs.LG 新提交

Quality-Preserving Imperceptible Adversarial Attack on Skeleton-based Human Action Recognition

基于骨架的人体动作识别中保质量不可察觉对抗攻击

Ziyi Chang, Kanglei Zhou, Xiaohui Liang, Hubert P. H. Shum

发表机构 * Durham University(杜伦大学) Tsinghua University(清华大学) Beihang University(北京航空航天大学) Zhongguancun Laboratory(中关村实验室)

AI总结 针对骨架动作识别的对抗攻击常引入噪声扰动降低动作质量,本文提出一种基于分布的对抗攻击方法,通过最小化经验风险与真实风险的差距来保持动作质量,并设计新指标评估自然性,实验表明该方法在攻击成功率和动作质量上均优于现有方法。

详情
AI中文摘要

针对骨架人体动作识别的对抗攻击已受到广泛关注。然而,现有方法通常引入类似噪声的扰动,导致攻击后动作质量下降,从而在S-HAR系统的最新进展中本质上是可察觉的。我们发现这种退化源于先前对抗攻击优化过程中经验风险与真实风险之间的差距。为解决此问题,我们提出一种在不损害动作质量的情况下获得对抗动作的攻击方法。为最小化风险差距并保持动作质量,我们提出一种基于分布的对抗攻击方法,不引入类似噪声的扰动。为忠实评估动作质量,我们提出一种新指标,该指标与人类对真实世界自然性的感知一致。在两个数据集上对最先进的S-HAR方法进行了实验,通过定性和定量分析证明了我们的方法在攻击成功率和攻击后动作质量方面的优越性。我们的保质量攻击应用和基于分布的方法的成功引发了关于动作识别器鲁棒性的严重担忧,强调了在该领域进一步改进的必要性。

英文摘要

Adversarial attacks on skeletal human action recognition have received significant attention. However, existing methods typically introduce noise-like perturbations that degrade motion quality post-attack, and thereby are inherently perceptible with recent advancements in S-HAR systems. We discover that this degradation stems from the gap between empirical and true risks during the optimization process of previous adversarial attacks. To address this issue, we propose an attack where adversarial motions are obtained without compromising their motion quality. To minimize the risk gap and preserve motion quality, we propose a distribution-based adversarial attack method without introducing noise-like perturbations. To faithfully evaluate the motion quality, we propose a new metric that aligns with human perception on real-world naturalness. Experiments have been conducted on the state-of-the-art S-HAR methods across two datasets, demonstrating the superiority of our method in both the attack success rate and the post-attack motion quality through qualitative and quantitative analyses. The success of our quality-preserving attack application and distribution-based method raises serious concerns about the robustness of action recognizers, highlighting the need for further enhancements in this domain.

2606.13528 2026-06-12 cs.CV 新提交

What's Old is New Again: Classical Dimensionality Reduction for Efficient Saliency-Guided Biometric Attack Detection

旧法新用:经典降维方法用于高效显著性引导的生物特征攻击检测

Samuel Webster, Walter Scheirer

发表机构 * University of Notre Dame(圣母大学)

AI总结 提出使用PCA和LDA等经典降维方法直接从训练数据生成显著性图,无需人工标注,在五个生物特征攻击检测领域超越基线甚至达到最优性能。

Comments 16 pages (8 main, 2 references, 6 appendix), 4 figures (3 main, 1 appendix), 13 tables (3 main, 10 appendix)

详情
AI中文摘要

显著性引导训练是一种视觉识别范式,鼓励模型在学习过程中关注最相关的图像区域。尽管其在生物特征呈现攻击检测(PAD)中的应用在鲁棒性和泛化性方面显示出显著优势,但由于现有显著性获取方法(如有限数据集上的人工标注)成本高、领域特异性强且可扩展性有限,其采用往往受到限制。我们提出了一种新颖、成本效益高且高度可扩展的显著性获取方法,使用受经典降维技术PCA和LDA启发的图。我们提出的方法直接从原始训练数据生成显著性图,无需人工标注或领域知识。我们在三个显著性探索领域(虹膜PAD、合成人脸检测、指纹PAD)中情境化这些显著性源的有效性,并在两个显著性新颖领域(指纹静脉PAD和身份证PAD)中展示了其可扩展性。在所有测试领域中,使用降维来源的显著性图训练的模型在没有任何资源投入或特定领域工具的情况下,超过了基线甚至有时是最先进的显著性方法。我们的发现克服了显著性引导训练在生物特征攻击检测及更广泛领域中一个重要但尚未解决的障碍。

英文摘要

Saliency-guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost-efficient, and highly-scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency-explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency-novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction-sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain-specific tooling. Our findings overcome an important yet unaddressed barrier to saliency-guided training for biometric attack detection and beyond.

2606.12655 2026-06-12 cs.CR cs.CV 交叉投稿

Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

Amnesia: 一种针对持续学习梦境的重放隐蔽攻击

Ahmed Sharshar, Naveen Kumar Kummari, Mohsen Guizani

AI总结 提出Amnesia攻击,通过仅控制重放索引选择,在审计约束下最大化持续学习模型性能下降,揭示了索引级重放控制的威胁。

详情
AI中文摘要

持续学习(CL)模型常使用经验重放来减少灾难性遗忘,但其对重放采样干扰的鲁棒性尚未充分探索。现有的CL攻击会改变输入或训练流程(投毒/后门),且很少包含明确的审计约束,限制了真实性。这里,审计性意味着监控者可以通过检查采样器可见的遥测数据(例如,记录的重放索引/标签统计)来验证合规性,即检查实现的重放类别直方图是否接近名义基线,以及重放率在每个批次和/或滚动窗口内是否不变。我们研究了一个权限受限的内部人员,其仅控制重放索引选择,而不控制像素、标签或模型参数,同时保持在审计限制内(如队列优先级)。我们提出了Amnesia,一种重放组合攻击,在两种预算下最大化性能下降:可见性预算δ,限制与名义类别直方图p0的TV/KL散度;以及质量预算f,固定重放率。Amnesia有两个步骤:(i)计算轻量级类别效用(如EMA损失或置信度),将p0向有害类别倾斜;(ii)使用高效的KL(指数倾斜)或TV(平衡质量重分配)优化器将倾斜投影回δ-球内。窗口调度器强制执行滚动审计。在具有挑战性的CL基准测试和强重放基线中,Amnesia持续降低最终准确率(ACC)并恶化反向迁移(-BWT)。KL变体在多种审计方案(包括每批次和滚动窗口检查)下实现高影响且基本未被检测到。TV变体更具破坏性但更易检测,尤其是在严格的每类别约束下。这些结果揭示了仅索引重放控制是CL系统中一个实用且可审计的威胁面,并建立了原则性的影响-可见性权衡。

英文摘要

Continual learning (CL) models often use experience replay to reduce catastrophic forgetting, but their robustness to replay sampling interference remains underexplored. Existing CL attacks alter inputs or training pipelines (poisoning/backdoors) and rarely include explicit auditable constraints, limiting realism. Here, auditability means a monitor can verify compliance from sampler-visible telemetry - e.g., logged replay index/label statistics - by checking that the realized replay class histogram stays close to a nominal baseline and that replay rate is unchanged per batch and/or over a rolling window. We study a limited-privilege insider who controls only replay index selection, not pixels, labels, or model parameters, while staying within auditable limits such as queue priorities. We introduce Amnesia, a replay composition attack that maximizes degradation under two budgets: a visibility budget delta bounding the TV/KL divergence from a nominal class histogram p0, and a mass budget f fixing the replay rate. Amnesia has two steps: (i) compute lightweight class utilities, such as EMA loss or confidence, to tilt p0 toward harmful classes; and (ii) project the tilt back into the delta-ball using efficient KL (exponential tilt) or TV (balanced mass redistribution) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks and strong replay baselines, Amnesia consistently lowers final accuracy (ACC) and worsens backward transfer (-BWT). The KL variant delivers high impact while remaining largely undetected under multiple audit schemes, including per-batch and rolling-window checks. The TV variant is more damaging but easier to detect, especially under tight per-class constraints. These results expose index-only replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility trade-off.

2606.12949 2026-06-12 cs.CR cs.CV 交叉投稿

ViPER: Vision-based Packing-Aware Encoder for Robust Malware Detection

ViPER:基于视觉的打包感知编码器用于鲁棒恶意软件检测

Fatima Qaiser, Bisma Tahir, Muhammad Abid Mughal, Nauman Shamim

AI总结 提出ViPER,一种基于LoRA适配ViT-B/14的双头架构,联合学习恶意软件分类和打包检测,通过打包感知门控机制和频率加权损失处理打包标签偏斜,在20万Windows PE图像上达到0.8521平衡准确率、0.9260 ROC-AUC和0.9279 AUPR。

详情
AI中文摘要

基于可视化的恶意软件检测将原始二进制字节映射为灰度图像,并应用学习的视觉分类器,为传统分析流程提供了一种抗规避且无需反汇编的替代方案。然而,可执行文件打包仍然是一个关键的失效模式:打包后的二进制文件产生高熵图像,掩盖了这些模型所依赖的结构模式。由于打包在良性软件中也很常见(例如用于压缩或复制保护),仅凭打包状态并不能可靠地指示恶意性,且现有方法未在统一的监督框架内解决这一挑战。我们提出了ViPER,一种基于视觉的打包感知编码器,用于鲁棒的恶意软件检测。ViPER构建在LoRA适配的ViT-B/14骨干网络上,采用双头架构,联合学习恶意软件分类和打包检测。打包感知门控机制根据推断的打包状态调节恶意软件预测,从而为打包和未打包输入实现不同的决策边界。为了解决训练期间打包标签偏斜的问题,我们采用了频率加权损失,并在联合类别-打包层上进行分层采样。在20万张Windows PE字节图图像上的评估中,ViPER达到了0.8521的平衡准确率、0.9260的ROC-AUC和0.9279的AUPR,在所有主要指标上均优于代表性的最先进基线,同时打包检测AUC达到0.9949。

英文摘要

Visualization-based malware detection maps raw binary bytes to grayscale images and applies learned visual classifiers, providing an evasion-resistant and disassembly-free alternative to conventional analysis pipelines. However, executable packing remains a critical failure mode: packed binaries produce high-entropy images that obscure the structural patterns these models rely on. Because packing is also prevalent in benign software (e.g., for compression or copy protection), packing state alone is not a reliable indicator of maliciousness, and existing approaches do not address this challenge within a unified supervised framework. We present ViPER, a Vision-based Packing-Aware Encoder for Robust malware detection. ViPER builds on a LoRA-adapted ViT-B/14 backbone with a dual-head architecture that jointly learns malware classification and packing detection. A packing-aware gating mechanism conditions malware predictions on the inferred packing state, enabling distinct decision boundaries for packed and unpacked inputs. To address packing label skew during training, we employ frequency-weighted losses with stratified sampling over joint class-packing strata. Evaluated on 200,000 Windows PE byteplot images, ViPER achieves a balanced accuracy of 0.8521, ROC-AUC of 0.9260, and AUPR of 0.9279, outperforming representative state-of-the-art baselines across all primary metrics, while attaining a packing detection AUC of 0.9949.

2511.04260 2026-06-12 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2606.12263 2026-06-12 cs.CV 版本更新

VOID: Defeating Unauthorized Mimicry in Latent Diffusion Models

VOID: 击败潜在扩散模型中的未授权模仿

Chunlin Qiu, Ang Li, Tianxiao Huang, Ruilin Gan, Yunjie Ge, Shenyi Zhang, Huayi Duan, Lingchen Zhao, Chao Shen, Qian Wang

发表机构 * School of Cyber Science and Engineering, Wuhan University(武汉大学网络空间安全学院) School of Computer Science, Wuhan University(武汉大学计算机学院) Institute for Math&AI, Wuhan University(武汉大学数学与人工智能研究所) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) School of Cyber Science and Engineering, Xi’an Jiaotong University(西安交通大学网络空间安全学院)

AI总结 针对潜在扩散模型被用于未授权模仿的问题,提出VOID防御框架,通过操纵模型内在随机性,放大潜在编码误差并抵消目标引导信号,实现语义破坏,阻止未授权模仿,同时将扰动限制在人眼不可感知区域。

Comments Extended full version with more comprehensive experimental results. To appear in the 35th USENIX Security Symposium (USENIX Security 2026)

详情
AI中文摘要

虽然潜在扩散模型(LDM)彻底改变了视觉合成,但它们越来越多地被用于对个人的未授权模仿。现有防御通过注入欺骗性扰动,将生成图像引导至无关目标。然而,这种方法基于一个无根据的假设:微小的扰动能在LDM的整个生成过程中保持其欺骗效果。实际上,模型固有的恢复机制会移除这些扰动,导致个体身份在生成的图像中重新出现。我们提出VOID,一种通过操纵LDM内在随机性克服这一难题的防御框架。VOID以两种新颖方式扰动扩散管道:1)放大潜在编码误差以破坏图像的语义结构,以及2)抵消目标引导信号以抑制模型的恢复能力。这导致语义破坏,阻止任何未授权模仿。值得注意的是,安全增益不以视觉效用为代价,因为VOID同时设法将扰动限制在受保护图像的人眼不可感知区域。我们在5个数据集上对10种模仿攻击的24种最先进防御进行了全面评估,证明了VOID前所未有的保护能力:它将平均Frechet Inception Distance(FID)从113提高到365,比迄今为止最强的防御提升了223%。

英文摘要

While Latent Diffusion Models (LDMs) have revolutionized visual synthesis, they are increasingly exploited for unauthorized mimicry of individuals. Existing defenses inject deceptive perturbations to steer the generated images toward irrelevant targets. However, this approach hinges on an ungrounded assumption: subtle perturbations can maintain their deceptive efficacy throughout an LDM's extensive generation process. In reality, the model's innate restoration mechanism will remove such perturbations and cause individual identities to re-emerge in the images generated. We propose VOID, a defense framework that overcomes this conundrum by manipulating an LDM's intrinsic stochasticity. VOID perturbs the diffusion pipeline in two novel ways: 1) amplifying the latent encoding errors to shatter an image's semantic structure, and 2) counteracting the target guidance signals to suppress the model's restoration capabilities. This results in a semantic corruption that thwarts any unauthorized mimicry. Notably, the security gain does not come at the price of visual utility, as VOID simultaneously manages to confine perturbations to human-imperceptible regions of protected images. Our comprehensive evaluation of 24 state-of-the-art defenses against 10 mimicry attacks on 5 datasets demonstrates VOID's unprecedented protection power: it increases the average Frechet Inception Distance (FID) from 113 to 365, a 223% improvement over the strongest defense to date.

2605.00600 2026-06-12 cs.LG cs.AI cs.CV 版本更新

Possibilistic Predictive Uncertainty for Deep Learning

深度学习的可能性预测不确定性

Yao Ni, Jeremie Houssineau, Yew-Soon Ong, Piotr Koniusz

发表机构 * University of Cambridge(剑桥大学) National University of Singapore(新加坡国立大学) University of Warsaw(华沙大学)

AI总结 提出基于可能性理论的Dirichlet近似可能性后验预测(DAPPr)框架,通过投影-近似策略实现高效且原则性的认知不确定性量化,在多个基准上达到竞争性能。

Comments Accepted by ICML 2026, 20 pages

详情
AI中文摘要

深度神经网络在多种应用中取得了令人印象深刻的结果,然而它们对未见输入的过度自信需要可靠的认知不确定性建模。现有的不确定性建模方法面临一个基本困境:贝叶斯方法提供原则性的估计,但计算成本高昂,而高效的二阶预测器在其特定目标与认知不确定性量化之间缺乏严格联系。为解决这一困境,我们引入了Dirichlet近似可能性后验预测(DAPPr),一个基于可能性理论的原则性框架。我们定义了参数上的可能性后验,通过上确界算子将其投影到预测空间,并使用可学习的Dirichlet可能性函数近似投影后的后验。这种投影-近似策略产生了一个具有闭式解的简单训练目标。尽管简单,跨多个不同基准的大量实验表明,DAPPr在保持原则性推导和计算效率的同时,实现了与最先进的二阶预测器相当或更优的不确定性量化性能。代码可在 https://github.com/MaxwellYaoNi/DAPPr 获取。

英文摘要

Deep neural networks achieve impressive results across diverse applications, yet their overconfidence on unseen inputs necessitates reliable epistemic uncertainty modeling. Existing methods for uncertainty modeling face a fundamental dilemma: Bayesian approaches provide principled estimates but remain computationally prohibitive, while efficient second-order predictors lack rigorous connections between their specific objectives and epistemic uncertainty quantification. To resolve this dilemma, we introduce Dirichlet-approximated possibilistic posterior predictions (DAPPr), a principled framework grounded in possibility theory. We define a possibilistic posterior over parameters, project it to the prediction space via supremum operators, and approximate the projected posterior using learnable Dirichlet possibility functions. This projection-and-approximation strategy yields a simple training objective with closed-form solutions. Despite its simplicity, extensive experiments across diverse benchmarks show that DAPPr achieves competitive or superior uncertainty quantification performance over state-of-the-art second-order predictors while maintaining both principled derivation and computational efficiency. Code is available at https://github.com/MaxwellYaoNi/DAPPr.

12. 数据集、基准、评测与训练方法 15 篇

2606.12671 2026-06-12 cs.CV 新提交

SalArt-VQA: Diagnosing Whether VLMs Understand Salient Artifacts in Generated Images

SalArt-VQA: 诊断VLM是否理解生成图像中的显著伪影

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy

AI总结 提出SalArt-VQA基准,通过950张图像和3681道多选题,从检测、定位、空间基础、缺陷识别四方面评估VLM对生成图像伪影的理解,揭示高检测准确率下隐藏的失败模式。

Comments 23 pages, 7 figures, 7 tables. Dataset: https://huggingface.co/datasets/salartvqa/SalArt-VQA

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被用于检测AI生成图像是否包含可见伪影,然而它们分析此类伪影的能力仍然知之甚少。正确的图像级决策仍可能隐藏重要失败:模型可能正确标记伪影,但依赖于错误的视觉线索、选择错误的区域,或描述图像中不存在的缺陷。为了直接评估这些行为,我们引入了SalArt-VQA,一个用于细粒度理解AI生成图像中显著伪影的诊断基准。SalArt-VQA包含950张图像和3,681道人工编写的多项选择题,涵盖伪影图像、匹配的真实参考图像和配对的生成参考图像。四种对齐的问题类型评估存在检测、语义定位、空间基础和证据基础的缺陷识别,而参考分割测试了当注释缺陷不存在时的校准和弃权能力。在20个VLM上,SalArt-VQA揭示了图像级检测准确率所隐藏的失败:最强的模型在伪影图像上达到99.37%的检测召回率,但仅在53.26%的图像上正确回答了所有四个伪影侧问题。比较伪影图像与无伪影参考揭示了灵敏度-校准权衡:敏感模型经常做出无根据的伪影声明,而保守模型主要通过遗漏真实伪影来避免误报。这些结果表明,高伪影检测准确率本身并不意味着有基础的伪影理解。SalArt-VQA暴露了这些隐藏的失败模式,并提供了对VLM伪影声明是否得到局部视觉证据支持的细粒度评估。

英文摘要

Vision-language models (VLMs) are increasingly used to detect whether AI-generated images contain visible artifacts, yet their ability to analyze such artifacts remains poorly understood. A correct image-level decision can still hide important failures: a model may correctly flag an artifact while relying on the wrong visual cue, selecting the wrong region, or describing a defect that the image does not support. To evaluate these behaviors directly, we introduce SalArt-VQA, a diagnostic benchmark for fine-grained SALient ARTifact understanding in AI-generated images. SalArt-VQA contains 950 images and 3,681 human-authored multiple-choice questions spanning artifact images, matched real reference images, and paired generated reference images. Four aligned question types evaluate presence detection, semantic localization, spatial grounding, and evidence-grounded defect identification, while the reference splits test calibration and abstention when the annotated defect is absent. Across 20 VLMs, SalArt-VQA reveals failures that image-level detection accuracy hides: the strongest model reaches 99.37% detection recall on artifact images but answers all four artifact-side questions correctly on only 53.26% of images. Comparing artifact images with artifact-free references reveals a sensitivity-calibration tradeoff: sensitive models often make unsupported artifact claims, while conservative models avoid false alarms largely by missing real artifacts. These results show that high artifact detection accuracy alone does not imply grounded artifact understanding. SalArt-VQA exposes these hidden failure modes and provides a fine-grained evaluation of whether VLM artifact claims are supported by local visual evidence.

2606.12706 2026-06-12 cs.CV 新提交

VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving

VLADriveBench:评估自动驾驶VLA中的CoT-动作关系

Thach Nguyen, Danhua Guo, Tom Lampo, Fei Wu, Burhan Yaman

发表机构 * Uber AV Labs(优步自动驾驶实验室)

AI总结 提出VLADriveBench框架,结合观察指标和CoT干预协议评估VLA模型中思维链与驾驶动作的相关性和因果性,发现不同模型表现差异显著。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在生成驾驶轨迹的同时产生思维链(CoT)推理,但现有基准仅评估轨迹质量,不评估CoT是否与驾驶动作相关、一致或具有因果联系。我们引入VLADriveBench,一个结合观察指标(提及、幻觉、矛盾、动作对齐)与CoT干预协议的框架,以提供CoT-动作关系的互补视角。将VLADriveBench应用于两种架构的三个模型,我们发现两种分析可能产生显著分歧:ORION在观察对齐上得分最高,但其CoT是附带现象;而Alpamayo v1.5得分较低,但其CoT具有很强的因果性,视觉显著性控制着CoT影响的程度。

英文摘要

Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.

2606.12869 2026-06-12 cs.CV 新提交

Learning Task-Aware Sampling with Shared Saliency through Density-Equalizing Mappings

通过密度均衡映射学习具有共享显著性的任务感知采样

Tsz Lok Ip, Han Zhang, Lok Ming Lui

发表机构 * Department of Mathematics, The Chinese University of Hong Kong(香港中文大学数学系) Department of Mathematics, City University of Hong Kong(香港城市大学数学系)

AI总结 提出DECNN框架,利用密度均衡映射根据数据空间重要性动态重分配卷积计算资源,实现任务自适应采样,提升模型效率与可解释性。

Comments 16 pages, 10 figures

详情
AI中文摘要

在基于图像和表面的学习任务中,卷积特征通常使用在整个域上均匀采样的感受野来提取。然而,信息丰富的结构在实践中很少均匀分布,通常集中在局部区域。这种现象在医学影像中尤为常见,其中病理变化在空间上受限。因此,均匀卷积将相同的计算量分配给信息丰富和信息不丰富的区域,导致特征提取效率低下和模型容量利用不充分。为了解决这个问题,我们提出了一个任务自适应采样框架,根据数据的空间重要性动态重分配计算注意力。具体来说,我们引入了密度均衡卷积神经网络(DECNN),它通过密度均衡映射,利用学习到的密度函数来引导卷积。密度函数编码了不同区域的相对重要性,并诱导一种变换,放大信息丰富的区域,同时压缩不太相关的区域。结果,卷积感受野在域上非均匀地重新分布,使得在任务相关区域能够进行更密集的采样。通过将这种重要性驱动的变换与卷积相结合,DECNN执行自适应特征提取,将计算资源集中在信息丰富的结构上。这导致更有效地利用模型容量,产生一个轻量级但表达力强的架构,同时生成可解释的显著性图。在图像分类和颅面表面分析上的实验表明,DECNN以更少的参数实现了竞争性或更优的性能,准确识别任务相关区域,并在复杂的几何变化下保持鲁棒性。

英文摘要

In image and surface-based learning tasks, convolutional features are typically extracted using receptive fields that are sampled uniformly across the entire domain. However, informative structures are rarely distributed uniformly in practice and are often concentrated in localized regions. Such phenomena are particularly common in medical imaging, where pathological changes are spatially confined. Consequently, uniform convolution allocates equal computational effort to both informative and uninformative regions, resulting in inefficient feature extraction and suboptimal utilization of model capacity. To address this issue, we propose a framework for task-adaptive sampling that dynamically redistributes computational attention according to the spatial importance of the data. Specifically, we introduce the Density-Equalizing Convolutional Neural Network (DECNN), which employs density-equalizing mappings to guide convolution through a learned density function. The density function encodes the relative importance of different regions and induces a transformation that enlarges informative areas while compressing less relevant ones. As a result, convolutional receptive fields are redistributed non-uniformly over the domain, enabling denser sampling in task-relevant regions. By coupling this importance-driven transformation with convolution, DECNN performs adaptive feature extraction that focuses computational resources on informative structures. This leads to more efficient use of model capacity, yielding a lightweight yet expressive architecture while simultaneously producing an interpretable saliency map. Experiments on image classification and craniofacial surface analysis demonstrate that DECNN achieves competitive or superior performance with fewer parameters, accurately identifies task-relevant regions, and remains robust under complex geometric variations.

2606.12925 2026-06-12 cs.CV cs.LG 新提交

Multi-Label Test-Time Adaptation with Bayesian Conditional Priors

基于贝叶斯条件先验的多标签测试时自适应

Qiru Li, Ao Zhou, Zhiwei Jiang, Zifeng Cheng, Cong Wang, Yafeng Yin, Qing Gu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出贝叶斯条件先验估计(BCP),一种无梯度的测试时自适应方法,通过在线估计锚定条件先验注入标签依赖性,提升冻结视觉语言模型在多标签识别中的分布偏移鲁棒性。

Comments accepted by ICML2026

详情
AI中文摘要

多标签识别中,冻结的视觉语言模型(VLM)在分布偏移下表现脆弱:标准零样本推理独立评分每个标签,忽略共现结构,产生不连贯的标签集,其中主导概念抑制较弱但兼容的标签。我们引入贝叶斯条件先验(BCP)估计,一种无梯度的测试时自适应方法,在不调整主干网络的情况下注入标签依赖性。BCP将零样本logits视为在固定图像-文本似然下的边缘后验代理,并将偏移引起的误差主要归因于不匹配的标签先验。对于每个测试图像,它选择一个高置信度的锚定标签,并应用锚定条件的贝叶斯精炼。该更新在logit空间中是闭式的,并具有点互信息(PMI)解释,明确促进兼容标签并抑制不兼容标签。BCP通过从无标签测试流中在线估计锚定条件先验(使用轻量级二阶共现统计)来运行,无需目标标注,且仅增加单个前向传递之外的微不足道的开销。在标准多标签基准和多个CLIP主干网络上,BCP持续优于强TTA基线,例如将RN50的平均mAP从57.31提升至69.22,ViT-B/16从62.61提升至71.79。

英文摘要

Multi-label recognition with frozen Vision-Language Models (VLMs) is brittle under distribution shift: standard zero-shot inference scores labels independently, ignoring co-occurrence structure and producing incoherent label sets where dominant concepts suppress weaker but compatible labels. We introduce Bayesian Conditional Priors (BCP) Estimation, a gradient-free test-time adaptation method that injects label dependency without tuning the backbone. BCP views zero-shot logits as a proxy for marginal posteriors under a fixed image-text likelihood and attributes shift-induced errors mainly to a mismatched label prior. For each test image, it selects a high-confidence anchor label and applies an anchor-conditioned Bayesian refinement. This update is closed-form in logit space and admits a pointwise mutual information (PMI) interpretation, explicitly promoting compatible labels and suppressing incompatible ones. BCP operates without target annotations by estimating anchor-conditioned priors online from the unlabeled test stream via lightweight second-order co-occurrence statistics, adding negligible overhead beyond a single forward pass. Across standard multi-label benchmarks and multiple CLIP backbones, BCP consistently outperforms strong TTA baselines, e.g., improving RN50 average mAP from 57.31 to 69.22 and ViT-B/16 from 62.61 to 71.79.

2606.13427 2026-06-12 cs.CV 新提交

VietFashion: Benchmarking Sketch-Text Composed Image Retrieval for Cultural Outfits

VietFashion:面向文化服饰的草图-文本组合图像检索基准

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam(胡志明市理科大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 提出VietFashion基准,针对越南传统服饰奥黛,结合手绘草图和文本描述进行多目标检索,揭示现有方法在细粒度文化语义和跨模态组合上的不足。

Comments ICMR 2026. Project page: https://hng0303.github.io/VietFashion

详情
AI中文摘要

文化服饰对视觉检索系统提出了独特挑战,因为其身份往往依赖于标准AI模型难以捕捉的微妙结构和符号细节。我们引入VietFashion,一个以越南传统服饰奥黛为中心的草图-文本组合图像检索新基准。VietFashion使设计师和研究人员能够通过手绘草图(传达服装结构)和文本描述(编码文化语义)的组合来检索具有文化意义的服装。数据集初始包含650张草图,并通过生成模型扩展至超过21,000张带有对齐标题的照片级真实图像。文本提示描述了详细的服装属性,这些属性从时尚杂志中提取以确保真实性和多样性。为了更好地反映设计意图固有的模糊性,VietFashion采用多目标检索设置,其中单个查询可能对应多个有效结果。我们建立了标准化的评估协议,并对最先进的组合图像检索方法进行了基准测试。实验结果表明,在建模细粒度文化语义和多模态组合方面存在显著性能差距,使VietFashion成为细粒度时尚检索的一个具有挑战性的基准。数据集公开于:this https URL。

英文摘要

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

2606.13496 2026-06-12 cs.CV 新提交

Budget-Constrained Step-Level Diffusion Caching

预算约束的步骤级扩散缓存

Mingkun Lei, Tong Zhao, Liangyu Yuan, Chi Zhang

发表机构 * Westlake-AGI-Lab(西湖大学AGI实验室)

AI总结 提出BudCache方法,通过离线搜索(模拟退火+爬山)在固定计算预算下优化缓存策略,并引入缓存感知调度对齐,以提升扩散模型生成质量。

Comments Accepted by ICML 2026

详情
AI中文摘要

步骤级缓存通过利用去噪步骤间的时间冗余来加速扩散模型。现有方法使用基于阈值的启发式方法进行每步缓存决策,没有直接优化最终输出质量。因此,它们的推理延迟随输入变化,在部署时难以控制。在这项工作中,我们提出了BudCache,它反转了这一公式:不是让每步误差阈值决定运行成本,而是预先固定计算预算,并搜索最能保留最终输出的缓存策略。为了应对步骤选择的组合复杂性,我们将模拟退火与确定性爬山相结合。这种离线搜索在几分钟内找到高质量的缓存策略,并且在推理过程中不引入在线搜索或阈值开销。当计算预算非常紧张时,我们进一步引入缓存感知调度对齐,它使时间离散化适应所选的缓存策略,以减少缓存引起的轨迹不匹配。在FLUX.1-dev和Wan2.1上的实验表明,在相同推理预算下,BudCache比启发式缓存基线实现了更好的生成质量。代码可在以下网址获取:https://this https URL

英文摘要

Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising steps. Existing methods make per-step cache decisions using threshold-based heuristics, without directly optimizing for final output quality. As a result, their inference latency varies across inputs and is difficult to control at deployment. In this work, we propose BudCache, which inverts this formulation: rather than letting per-step error thresholds dictate the runtime cost, we fix the compute budget in advance and search for the cache policy that best preserves the final output. To tackle the combinatorial complexity of step selection, we combine Simulated Annealing with deterministic Hill Climbing. This offline search identifies high-quality cache policies within minutes and introduces no online search or thresholding overhead during inference. When the compute budget is very tight, we further introduce cache-aware schedule alignment, which adapts the time discretization to the selected cache policy to reduce cache-induced trajectory mismatch. Experiments on FLUX.1-dev and Wan2.1 show that BudCache achieves better generation quality than heuristic caching baselines under the same inference budgets. Code is available at https://github.com/Westlake-AGI-Lab/BudCache

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 交叉投稿

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文系统比较了不同架构的地理空间基础模型,在统一设置下评估其灵活性与性能,为多模态推理提供设计指导。

详情
AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练,正在迅速改变地球观测。然而,其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中,我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较,特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练,并在GEOBench基准测试上,在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性,本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

2606.12913 2026-06-12 cs.LG cs.CV 交叉投稿

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择:用于无损训练加速的统一数据集剪枝框架

Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出基于图的统一数据集剪枝框架,将数据集建模为加权图,通过最大权重团问题选择样本,并设计贪心算法,在多种剪枝比例下优于现有方法,实现ImageNet-1k上40%以上训练加速且不损失精度。

Comments ICML 2026

详情
AI中文摘要

现代训练数据集的快速增长显著增加了计算成本,促使数据集剪枝(DP)方法仅保留信息量丰富的样本子集以减少训练成本。现有的剪枝标准通常依赖于评估样本独立性的内在信号或通过成对关系促进多样性的外在信号。虽然在其特定领域有效,但每种方法仅捕捉样本效用的一方面,且在不同剪枝比例或数据分布下缺乏鲁棒性。在这项工作中,我们提出了一个统一的基于图的DP框架。通过将数据集建模为加权图,其中节点权重编码内在价值,边权重编码外在价值,DP可以转化为最大权重团问题(MWCP)。尽管MWCP是NP难的,但其结构允许基于样本边际增益的原则性贪心解法。在几个温和条件下,我们进一步证明该统一目标具有形式化的近似保证,适用于广泛的度量族,并提供了实用设计指南。大量实验表明,我们的方法优于现有DP方法,同时显著降低训练成本,在ImageNet-1k上使用ResNet-50时,训练时间减少超过40%且不损失精度。

英文摘要

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

2606.13223 2026-06-12 cs.LG cs.CV 交叉投稿

Distributional Loss for Robust Classification

分布损失用于鲁棒分类

Kathleen Anderson, Thomas Martinetz

发表机构 * Institute for Neuro- and Bioinformatics(神经与生物信息学研究所)

AI总结 提出一种基于双峰高斯分布的分布损失概念,通过软化目标隐式捕捉类别模糊性,缓解过拟合,提升决策边界鲁棒性,尤其在低数据场景下效果显著。

Comments ICANN 2026

详情
AI中文摘要

本文提出了一种用于监督分类任务的新型损失概念。我们不是强制每个输入样本直接映射到单个分配标签,而是将分类器输出的优化目标定义为双峰高斯分布。这种更柔和的目标公式隐式地捕捉了类别模糊性,减轻了过拟合,并鼓励学习更鲁棒的决策边界,所有这些都不需要额外的标签信息。实验结果表明,鲁棒性持续提升,在低数据场景下尤其明显,同时仅需对标准训练流程进行最小修改。

英文摘要

This paper proposes a novel loss concept for supervised classification tasks. Rather than enforcing a direct mapping from each input sample to a single assigned label, we define an optimization objective over all classifier outputs as a bimodal Gaussian distribution. This softer target formulation implicitly captures class ambiguity, mitigates overfitting, and encourages the learning of more robust decision boundaries, all without requiring additional label information. Experimental results demonstrate consistent improvements in robustness, with particularly pronounced gains in low-data regimes, while requiring only minimal modifications to standard training pipelines.

2606.13461 2026-06-12 cs.LG cs.CV 交叉投稿

Reinforcement Learning for Neural Model Editing

神经模型编辑的强化学习

Shaivi Malik

发表机构 * Shaivi Malik

AI总结 提出将神经模型编辑形式化为强化学习问题,通过奖励反馈学习编辑策略,在偏见缓解和机器遗忘任务上取得良好效果。

详情
AI中文摘要

编辑预训练神经网络需要针对特定目标定制的专用算法。设计此类算法通常耗时且需要大量精力。我们提出了一个探索性框架,将神经模型编辑形式化为强化学习问题,其中智能体使用奖励反馈修改模型。我们引入了两个环境:MaskWorld,其中智能体以乘法方式缩放权重;以及ShiftWorld,其中智能体应用加法权重更新。奖励函数结合了效用保持目标和任务特定编辑目标,使智能体能够在保持整体模型性能的同时学习有针对性的修改。我们在文本分类中的偏见缓解和图像分类中的机器遗忘上评估了该框架,这两者传统上都依赖于专用算法。我们的结果表明,在遗忘任务中,学习到的策略将遗忘集准确率降至接近0%,同时保留集准确率保持在90%以上。在偏见缓解设置中,学习到的策略将偏见相关性能提高了5%以上,同时保持了一般分类效用。我们的发现表明,神经模型编辑可以转化为强化学习问题,从而可以从奖励反馈中学习编辑策略,而不是为每个任务手动设计。

英文摘要

Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

2512.12571 2026-06-12 cs.CV 版本更新

Measurement Plasticity: Sensor-Level Adaptation for Vision-Language Models

测量塑性:面向视觉-语言模型的传感器级自适应

Boyeong Im, Wooseok Lee, Yoojin Kwon, Hyung-Sin Kim

发表机构 * University of Seoul(首尔大学)

AI总结 提出多视角物理提示(MVP)用于测试时自适应,通过将相机曝光三角(ISO、快门速度、光圈)作为物理提示,在传感器层面进行自适应,无需梯度或模型修改,在ImageNet-ES上优于数字方法。

Comments Accepted to the ICML 2026 Workshop on Continual Adaptation at Scale

详情
AI中文摘要

我们提出用于测试时自适应(TTA)的多视角物理提示(MVP),这是一种前向传播框架,通过将相机曝光三角(即ISO、快门速度、光圈)视为物理提示,将TTA从令牌层面转移到光子层面。在推理时,MVP使用源亲和度得分获取选定的多个物理视角,评估每个保留视角的数字增强变体并过滤最低熵预测,然后通过硬投票聚合预测。这种先选择后投票的设计简单、易于校准,且无需梯度或模型修改。在ImageNet-ES和ImageNet-ES-Diverse上,MVP在自动曝光以及与传统传感器控制结合的情况下均优于纯数字TTA。在减少参数候选以降低捕获延迟的情况下,MVP仍然有效,展示了其实用性。

英文摘要

We propose Multi-View Physical-prompt (MVP) for Test-Time Adaptation (TTA), a forward-only framework that moves TTA from tokens to photons by treating the camera exposure triangle (i.e., ISO, shutter speed, and aperture) as physical prompts. At inference, MVP acquires selected multiple physical views using a source-affinity score, evaluates digitally augmented variants of each retained view and filters the lowest-entropy predictions, and aggregates predictions with hard voting. This selection-then-vote design is simple, calibration-friendly, and requires no gradients or model modifications. On ImageNet-ES and ImageNet-ES-Diverse, MVP outperforms digital-only TTA on both Auto-Exposure and a combination with conventional sensor control. MVP remains effective under reduced parameter candidates that lower capture latency, demonstrating its practicality.

2603.10834 2026-06-12 cs.CV cs.AI 版本更新

On the Reliability of Cue Conflict and Beyond

论线索冲突的可靠性及其超越

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology(乌山国立科学研究院) College of Medicine, Hanyang University(翰阳大学医学院) NAVER AI Lab(NAVER AI实验室)

AI总结 针对现有线索冲突基准在评估形状-纹理偏好时存在不稳定和模糊的问题,提出REFINED-BIAS数据集与评估框架,通过显式定义形状和纹理、构建平衡的线索对及基于排序的度量,实现更可靠和可解释的偏差诊断。

Comments Shape-Texture Bias, Cue Conflict Benchmark

详情
AI中文摘要

理解神经网络如何依赖视觉线索提供了其内部决策过程的人类可解释视角。线索冲突基准在探究形状-纹理偏好以及激发更强、类人形状偏差通常与改进的域内性能相关的见解方面具有影响力。然而,我们发现当前基于风格化的实例化可能产生不稳定和模糊的偏差估计。具体来说,风格化可能无法可靠地实例化感知上有效且可分离的线索,也无法控制其相对信息量;基于比率的偏差可能掩盖绝对线索敏感性;将评估限制在预选类别可能忽略完整决策空间而扭曲模型预测。这些因素共同可能将偏好与线索有效性、线索平衡和可识别性伪影混淆。我们引入了REFINED-BIAS,一个用于可靠和可解释的形状-纹理偏差诊断的集成数据集和评估框架。REFINED-BIAS使用形状和纹理的显式定义构建平衡的、人类和模型可识别的线索对,并通过基于排序的度量测量完整标签空间上的线索特定敏感性,从而实现更公平的跨模型比较。在不同的训练范式和架构中,REFINED-BIAS实现了更公平的跨模型比较、更忠实的形状和纹理偏差诊断以及更清晰的实证结论,解决了先前线索冲突评估无法可靠区分的矛盾。

英文摘要

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

2603.14482 2026-06-12 cs.CV 版本更新

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

V-JEPA 2.1: 解锁视频自监督学习中的密集特征

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

发表机构 * FAIR at Meta(Meta的FAIR) Universidad de Zaragoza(萨拉戈萨大学)

AI总结 提出V-JEPA 2.1系列自监督模型,通过密集预测损失、深度自监督、多模态分词器和有效缩放,学习图像和视频的密集高质量视觉表示,在多个基准上取得最优性能。

详情
AI中文摘要

我们提出V-JEPA 2.1,一系列自监督模型,能够学习图像和视频的密集、高质量视觉表示,同时保持强大的全局场景理解。该方法结合了四个关键组件。首先,密集预测损失使用基于掩码的目标,其中可见和掩码令牌都贡献于训练信号,鼓励显式的空间和时间接地。其次,深度自监督在多个中间编码器层上分层应用自监督目标,以提高表示质量。第三,多模态分词器实现了图像和视频的统一训练。最后,该模型受益于模型容量和训练数据的有效缩放。这些设计选择共同产生了空间结构、语义一致和时间连贯的表示。实验上,V-JEPA 2.1在几个具有挑战性的基准上取得了最先进的性能,包括在Ego4D上短期物体交互预测的7.71 mAP,在EPIC-KITCHENS上高级动作预测的40.8 Recall@5,以及在实际机器人抓取成功率上比V-JEPA-2 AC提高了20个百分点。该模型还在机器人导航(TartanDrive上5.687 ATE)、深度估计(NYUv2上线性探针0.307 RMSE)和全局识别(Something-Something-V2上77.7)方面表现出强大的性能。这些结果表明,V-JEPA 2.1显著推进了密集视觉理解和世界建模的最新技术。

英文摘要

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

2605.01391 2026-06-12 cs.CV 版本更新

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

VISTA:视频交互时空分析基准

Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen, Anirudh Bharadwaj, Aman Chadha, Yogesh Rawat

发表机构 * University of Central Florida(中央佛罗里达大学) BITS Pilani(比特斯理工学院) Ho Chi Minh City University of Science(胡志明市科学大学) Amazon GenAI Project(亚马逊生成人工智能项目)

AI总结 提出VISTA基准,通过分解视频为实体、动作和关系,实现开放集多实体多动作的时空理解评估,揭示传统指标掩盖的偏差。

Comments Accepted to CVPR 2026 Workshop on Pixel-level Video Understanding in the Wild (PVUW)

详情
AI中文摘要

现有的视觉-语言模型(VLM)基准主要评估简单单动作视频、封闭属性集和受限实体类型的时空理解,未能捕捉真实世界视频理解中多样实体之间的自由形式多动作交互。此外,缺乏一个系统性的框架来分析模型在互补时空轴上的失败,阻碍了全面评估。为解决这些问题,我们引入了VISTA,一个视频交互时空分析基准,专为VLM中的开放集、多实体和多动作时空理解设计。VISTA将视频分解为可解释的实体、其关联动作和关系动态,实现多轴诊断以及关系、空间和时间理解的统一评估。我们的基准将多个数据集整合到一个单一的交互感知分类法中,包含约12K个精心策划的视频-查询对,涵盖多样场景和复杂性。我们在VISTA上系统评估了11个最先进的VLM,并分解了跨分类法的聚合性能,揭示了传统指标掩盖的缺陷和显著的时空偏差。通过在具有挑战性的数据集上提供详细的、分类法驱动的诊断,VISTA提供了一个精细的框架来指导模型设计、预训练策略和评估协议的进步。总体而言,VISTA是第一个大规模、交互感知的VLM时空理解诊断基准。

英文摘要

Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.

2304.13836 2026-06-12 cs.LG cs.AI cs.CV stat.ME 版本更新

On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

论 $\textit{RemOve-And-Retrain}$ 的陷阱:数据处理不等式视角

Junhwa Song, Keumgang Cha, Junghoon Seo

发表机构 * KAIST(韩国科学技术院)

AI总结 从信息论角度揭示ROAR基准的缺陷:数据无关的后处理可提升ROAR分数,导致对归因图信息量的误判,并发现模糊性偏差。

Comments Accepted at the 2026 ICML Workshop on Mechanistic Interpretability

详情
AI中文摘要

RemOve-And-Retrain (ROAR) 基准被广泛用于评估特征归因方法,但其有效性尚未从信息论角度得到充分探索。我们证明,对归因图进行模型和数据无关的后处理(通过数据处理不等式,这些变换\emph{不能}增加关于决策函数的信息)通常可以改善ROAR分数。这意味着ROAR排名的提升本身并不能证明归因图携带更多关于模型的信息。我们将这种失败模式归因于对空间模糊掩膜的偏好。在CIFAR-10、SVHN和CUB-200上的实验显示,模糊度与ROAR性能之间存在一致的关联,这种模式也出现在ROAD变体中。我们为更谨慎的基于移除的基准测试提供了指导方针,这对验证神经网络内部机制的机械理解具有重要意义。

英文摘要

The RemOve-And-Retrain (ROAR) benchmark is widely used to evaluate feature attribution methods, yet its validity remains underexplored from an information-theoretic perspective. We show that model- and data-agnostic post-processing of attribution maps (transformations that, by the data processing inequality, \emph{cannot} add information about the decision function) can often improve ROAR scores. This means that an improved ROAR ranking is not, by itself, evidence that an attribution map carries more information about the model. We trace this failure mode to a bias toward spatially blurry masks. Experiments on CIFAR-10, SVHN, and CUB-200 show a consistent association between blurriness and ROAR performance, a pattern that also appears in the ROAD variant. We provide guidelines for more cautious removal-based benchmarking, with implications for validating mechanistic understanding of neural network internals.

13. 其他/综合视觉 14 篇

2606.12988 2026-06-12 cs.CV cs.AI 新提交

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

一种用于实时个性化人体工学姿态分析的机器学习框架

Manex Atxa, Bruno Simoes, Julen Balzategui

发表机构 * Vicomtech Foundation(Vicomtech基金会) Basque Research and Technology Alliance(巴斯克研究与技术联盟) BRTA

AI总结 提出利用三维体积视频数据实时预测人体工学/非工学姿态的方法,结合3D点云多角度分析与个性化深度学习分类器,克服固定视角遮挡问题,实现实时评估。

Comments 13 pages, 7 figures, conference 24CMH

详情
AI中文摘要

本文介绍了一种利用三维体积视频数据实时预测人体工学和非工学姿态的新方法。尽管该方法是为人体工学评估设计的,但它可以适应其他需要实时分析人体姿态的应用。该系统的一个突出特点是能够在评估过程中分析3D点云,从而实现多角度计算。这克服了相机通常提供固定视角的关键限制,从而限制了全面姿态评估可用的数据,尤其是在发生遮挡时。系统持续自动地对实时流数据使用选定的视角进行姿态推断;然而,只有用户手动选择和标记的姿态用于训练个性化深度学习分类器。该方法通过一个案例研究进行了优化,其中RGB-D相机捕捉了执行负重任务的受试者,实现了实时骨骼标记。模型在此数据上训练,并在训练阶段后对新流数据实时进行推断。本研究通过结合最先进的3D数据技术和传统的2D姿态估计算法,为实时人体工学评估提供了一种可扩展且实用的方法。它解决了工作场所环境中日益增长的安全与健康监测需求,标志着对该领域的显著贡献。

英文摘要

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

2606.13239 2026-06-12 cs.SE cs.AI cs.CL cs.CV 交叉投稿

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结 提出COM即行动范式,将专业软件交互转化为确定性程序合成,解决GUI代理的脆弱性和API代理的异构性问题;构建ComCADBench基准和ComActor自校正代理,在工业CAD软件上实现SOTA性能。

详情
AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制:基于GUI的代理受困于脆弱的视觉基础和长程错误累积,而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中,我们将组件对象模型(COM)识别为统一的、可执行的抽象,提出了COM即行动:一种新的范式,将专业软件交互重新定义为确定性程序合成,而非顺序视觉控制。为了在最苛刻的环境中验证这一范式,我们引入了ComCADBench,这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距:前沿的专有模型在基于GUI的交互下几乎无法成功,而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距,我们开发了ComActor,一个通过渐进式三阶段框架训练的自校正代理,以及ComForge,一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明,ComActor在ComCADBench上达到了最先进的性能,在基线崩溃的长程任务中表现出强大的韧性,并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

2606.13368 2026-06-12 cs.AI cs.CV 交叉投稿

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD:一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出IterCAD,一种闭环交互式CAD生成与编辑的多模态智能体框架,通过渐进式SFT和几何感知强化学习优化,在代码可执行性和几何精度上显著超越现有方法。

详情
AI中文摘要

计算机辅助设计在现代制造业中至关重要,然而现有的自动化方法主要依赖于开环、一次性生成,与迭代的实际实践不匹配。在本文中,我们提出了IterCAD,一个统一的闭环交互式CAD生成与编辑的多模态智能体框架。我们将任务形式化为多模态智能体与可执行CAD沙箱之间的多轮交互,涵盖三个任务:绘图到代码、文本到代码和交互式编辑。为此,我们开发了一个数据合成流水线,结合先进的工业制造特征,生成符合标准的多视图工程图纸、复杂的代码编辑任务和高保真交互轨迹。我们通过渐进式SFT,然后结合几何感知强化学习和可行前缀掩码来优化智能体,以增强代码可执行性和几何保真度。最后,我们引入了IterCAD-Bench评估套件,并提出了Chamfer距离容忍度-召回率(CD-TR)曲线及其AUC-TR指标,建立了一个无幸存者偏差的标准,统一了代码有效性和几何精度。大量实验表明,IterCAD在多个基准测试中取得了极具竞争力的性能,在代码可执行性和几何精度上显著优于现有方法,并在闭环迭代优化中展现出卓越的能力。

英文摘要

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

2507.22791 2026-06-12 cs.CV 版本更新

Modality-Aware Feature Matching in Visual and Vision-Language Applications: A Comprehensive Survey

视觉与视觉-语言应用中的模态感知特征匹配:全面综述

Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin

发表机构 * School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics(江西财经大学计算机与人工智能学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Computer Science and Informatics, Cardiff University(卡迪夫大学计算机科学与信息学院) School of Computing and Communications, Lancaster University(兰卡斯特大学计算机与通讯学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院) Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR)(新加坡资讯研究院,科技研究局(A*STAR)) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 综述基于模态的特征匹配,涵盖传统手工方法和现代深度学习方法,重点讨论跨RGB、深度、3D点云、LiDAR、医学图像及视觉-语言模态的进展,突出模态感知技术。

Comments CSUR

详情
AI中文摘要

特征匹配是计算机视觉中的一项基础任务,对于图像检索、立体匹配、三维重建和SLAM等应用至关重要。本综述全面回顾了基于模态的特征匹配,探索了传统手工方法,并强调了当代深度学习方法在各种模态中的应用,包括RGB图像、深度图像、3D点云、LiDAR扫描、医学图像和视觉-语言交互。传统方法利用Harris角点等检测器和SIFT、ORB等描述符,在中等模态内变化下表现出鲁棒性,但在显著模态差距下表现不佳。当代基于深度学习的方法,例如基于CNN的SuperPoint和基于Transformer的LoFTR等无检测器策略,显著提高了跨模态的鲁棒性和适应性。我们重点介绍了模态感知的进展,例如用于深度图像的几何和深度特定描述符、用于3D点云的稀疏和密集学习方法、用于LiDAR扫描的注意力增强神经网络,以及用于复杂医学图像匹配的MIND描述符等专门解决方案。跨模态应用,特别是在医学图像配准和视觉-语言任务中,突显了特征匹配处理日益多样化数据交互的演变。

英文摘要

Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

2509.21398 2026-06-12 cs.CV eess.IV 版本更新

Skeleton Sparsification and Densification Scale-Spaces

骨架稀疏化和致密化尺度空间

Julia Gierke, Pascal Peter

发表机构 * Mathematical Image Analysis Group, Saarland University(萨尔兰大学数学图像分析组) Department of Mathematics and Computer Science, Saarland University(萨尔兰大学数学与计算机科学系)

AI总结 提出骨架化尺度空间,通过稀疏化中轴实现形状层次简化,并引入致密化实现从粗到细的逆过程,应用于鲁棒骨架化、形状压缩和增材制造刚度增强。

详情
AI中文摘要

Hamilton-Jacobi骨架,也称为中轴,是一种强大的形状描述符,它根据最大内切圆的中心来表示二值对象。尽管应用广泛,但中轴对噪声敏感:微小的边界变化可能导致骨架不成比例地扩大和产生不必要的分支。经典的剪枝方法通过系统地移除多余的骨架分支来缓解这一缺陷。这种骨架的顺序简化类似于稀疏化尺度空间的原理,该空间将图像嵌入到从越来越稀疏的像素表示重建的族中。我们通过引入骨架化尺度空间将两者结合起来:它们利用中轴的稀疏化来实现形状的层次简化。与传统的剪枝不同,我们的框架固有地满足关键的尺度空间特性,如层次结构、可控简化和对几何变换的等变性。我们在连续和离散公式中提供了严格的理论基础,并通过致密化进一步扩展了这一概念。通过逐步增长骨架而不是收缩它,我们允许从粗到细尺度的逆过程。致密化尺度空间甚至可以超越原始骨架,产生与实际问题相关的过完备形状表示。通过概念验证实验,我们展示了我们的框架在实际任务中的有效性,包括鲁棒骨架化、形状压缩和增材制造的刚度增强。

英文摘要

The Hamilton-Jacobi skeleton, also known as the medial axis, is a powerful shape descriptor that represents binary objects in terms of the centres of maximal inscribed discs. Despite its broad applicability, the medial axis suffers from sensitivity to noise: Minor boundary variations can lead to disproportionately large and undesirable expansions of the skeleton. Classical pruning methods mitigate this shortcoming by systematically removing extraneous skeletal branches. This sequential simplification of skeletons resembles the principle of sparsification scale-spaces that embed images into a family of reconstructions from increasingly sparse pixel representations. We combine both worlds by introducing skeletonisation scale-spaces: They leverage sparsification of the medial axis to achieve hierarchical simplification of shapes. Unlike conventional pruning, our framework inherently satisfies key scale-space properties such as hierarchical architecture, controllable simplification, and equivariance to geometric transformations. We provide a rigorous theoretical foundation in both continuous and discrete formulations and extend the concept further with densification. By growing the skeleton successively instead of shrinking it, we allow inverse progression from coarse to fine scales. Densification scale-spaces can even reach beyond the original skeleton to produce overcomplete shape representations with relevancy for practical applications. Through proof-of-concept experiments, we demonstrate the effectiveness of our framework for practical tasks including robust skeletonisation, shape compression, and stiffness enhancement for additive manufacturing.

2604.23165 2026-06-12 cs.CV 版本更新

BSViT: A Burst Spiking Vision Transformer for Expressive and Efficient Visual Representation Learning

BSViT:用于高效表达视觉表征学习的脉冲视觉Transformer

Hongxiang Peng, Dewei Bai, Hong Qu

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出BSViT,通过双通道爆发脉冲自注意力机制和局部邻域掩码策略,解决脉冲视觉Transformer中二进制脉冲信息容量有限和全局自注意力密集交互的问题,在静态和事件视觉基准上取得更高精度和能效。

Comments Accepted by ECML PKDD 2026

详情
AI中文摘要

脉冲视觉Transformer(S-ViT)为节能视觉学习提供了有前景的框架。然而,现有设计仍受限于两个基本问题:二进制脉冲编码的信息容量有限以及全局自注意力引入的密集令牌交互。为应对这些挑战,本文提出BSViT,一种爆发脉冲驱动的视觉Transformer,具有双通道爆发脉冲自注意力(DBSSA)机制。DBSSA用二进制脉冲编码查询,用爆发脉冲编码键以增强表示能力。值通路采用双兴奋性和抑制性二进制通道,实现有符号调制和更丰富的脉冲交互。重要的是,整个注意力操作保持仅加法计算,确保与节能神经形态硬件的兼容性。为进一步降低脉冲活动并融入空间先验,引入补丁邻域掩码策略将注意力限制在局部邻域,实现结构感知稀疏性并减少计算开销。此外,爆发脉冲编码被系统地集成到网络中,以提升脉冲级表示能力,超越传统二进制脉冲。在静态和事件视觉基准上的大量实验表明,BSViT在精度上持续优于现有脉冲Transformer,同时保持有竞争力的能效。

英文摘要

Spiking Vision Transformers (S-ViTs) offer a promising framework for energy-efficient visual learning. However, existing designs remain limited by two fundamental issues: the restricted information capacity of binary spike coding and the dense token interactions introduced by global self-attention. To address these challenges, this work proposes BSViT, a burst spiking-driven Vision Transformer featuring a Dual-Channel Burst Spiking Self-Attention (DBSSA) mechanism. DBSSA encodes queries with binary spikes and keys with burst spikes to enhance representational capacity. The value pathway adopts dual excitatory and inhibitory binary channels, enabling signed modulation and richer spike interactions. Importantly, the entire attention operation preserves addition-only computation, ensuring compatibility with energy-efficient neuromorphic hardware. To further reduce spike activity and incorporate spatial priors, a patch adjacency masking strategy is introduced to restrict attention to local neighborhoods, resulting in structure-aware sparsity and reduced computational overhead. In addition, burst spike coding is systematically integrated across the network to increase spike-level representational capacity beyond conventional binary spiking. Extensive experiments on both static and event-based vision benchmarks demonstrate that BSViT consistently outperforms existing spiking Transformers in accuracy while maintaining competitive energy efficiency.

2604.13924 2026-06-12 cs.LG cs.AI cs.CV 版本更新

ASTER: Latent Pseudo-Anomaly Generation for Unsupervised Time-Series Anomaly Detection

ASTER: 用于无监督时间序列异常检测的潜在伪异常生成

Romain Hermary, Samet Hicsonmez, Dan Pineau, Abd El Rahman Shabayek, Djamila Aouada

发表机构 * University of Montreal(蒙特利尔大学) Université de Montréal(蒙特利尔大学)

AI总结 提出ASTER框架,在潜在空间生成伪异常训练Transformer分类器,结合预训练LLM增强表示,在三个基准数据集上达到最优性能。

Comments Published in ICPR 2026

详情
AI中文摘要

时间序列异常检测(TSAD)在工业监控、医疗保健和网络安全等领域至关重要,但由于罕见且异质的异常以及标记数据的稀缺性,它仍然具有挑战性。这种稀缺性使得无监督方法占主导地位,但现有方法通常依赖于重建或预测(难以处理复杂数据),或依赖于需要领域特定异常合成和固定距离度量的基于嵌入的方法。我们提出ASTER,一个直接在潜在空间中生成伪异常的框架,避免了手工制作的异常注入和对领域专业知识的需求。潜在空间解码器生成定制的伪异常,用于训练基于Transformer的异常分类器,而预训练的LLM丰富了该空间的时间和上下文表示。在三个基准数据集上的实验表明,ASTER达到了最先进的性能,并为基于LLM的TSAD设立了新标准。

英文摘要

Time-series anomaly detection (TSAD) is critical in domains such as industrial monitoring, healthcare, and cybersecurity, but it remains challenging due to rare and heterogeneous anomalies and the scarcity of labelled data. This scarcity makes unsupervised approaches predominant, yet existing methods often rely on reconstruction or forecasting, which struggle with complex data, or on embedding-based approaches that require domain-specific anomaly synthesis and fixed distance metrics. We propose ASTER, a framework that generates pseudo-anomalies directly in the latent space, avoiding handcrafted anomaly injections and the need for domain expertise. A latent-space decoder produces tailored pseudo-anomalies to train a Transformer-based anomaly classifier, while a pre-trained LLM enriches the temporal and contextual representations of this space. Experiments on three benchmark datasets show that ASTER achieves state-of-the-art performance and sets a new standard for LLM-based TSAD.

2605.26144 2026-06-12 cs.SE cs.AI cs.CV 版本更新

VISTA: An End-to-End Benchmark for Visual Spec-to-Web-App Coding Agents

VISTA:面向视觉规格到网页应用编码智能体的端到端基准

JunJia Guo, Yuhang Yao, Jiawei, Zhou, Jingdi Chen

发表机构 * University of Arizona(亚利桑那大学) Zoom Stony Brook University(石溪大学)

AI总结 提出VISTA基准,通过多维度输入条件和评估指标,衡量基于LLM的智能体从视觉规格生成功能完整、视觉一致的网页应用的能力。

Comments Project page: https://kaboider.github.io/VIS_APP/; Code: https://github.com/kaboider/VIS_APP_Code; Dataset: https://huggingface.co/datasets/JunJiaGuo/VIS-APP-Bench

详情
AI中文摘要

我们提出了VISTA(视觉规格到应用基准),这是一个用于评估基于LLM的智能体端到端网页应用生成能力的基准。与以往关注算法任务的代码生成基准不同,VISTA针对以UI为中心的现实开发场景,要求智能体从不明确的输入中生成功能完整、视觉一致的应用。我们定义了五种提示信息条件,沿视觉/结构保真度和技术栈约束两个轴变化:(1)仅文本,自由选择技术栈;(2)文本加参考截图,指定三种技术栈;(3)文本加参考截图,自由选择技术栈;(4)文本加截图和精简的Figma结构,指定单一技术栈;(5)文本加截图和精简的Figma结构,自由选择技术栈。为实现稳健评估,基准中的每个页面都手动标注了交互式UI组件和大约三个视觉锚点,解决了Playwright等基于脚本的测试工具在开放式代码生成设置中的已知局限性。评估结合了基于DOM的参考匹配、行为特定的浏览器测试和基于CLIP的视觉相似性,共同衡量结构对齐、行为完整性和整体视觉保真度。我们使用VISTA评估了来自两个模型家族和两个框架的四个智能体系统,发现视觉保真度和功能正确性在输入条件和智能体之间部分解耦,并且智能体的编辑风格差异显著,但大体上与任务质量正交。VISTA为推进基于智能体的软件工程研究建立了严谨且可重复的基础。

英文摘要

We present VISTA (VIsual Spec-To-App Benchmark), a benchmark for evaluating the end-to-end web-app generation capabilities of LLM-based agents. Unlike prior code generation benchmarks that focus on algorithmic tasks, VISTA targets realistic UI-centric development, where agents must produce functional, visually coherent applications from underspecified inputs. We define five prompt-information conditions that vary along two axes, visual/structural fidelity and stack constraint: (1) text only with free stack choice, (2) text with reference screenshots under three specified stacks, (3) text with reference screenshots under free stack choice, (4) text with screenshots and pruned Figma structure under a single specified stack, and (5) text with screenshots and pruned Figma structure under free stack choice. To enable robust evaluation, each page in the benchmark is manually annotated with interactive UI components and around three visual anchor points, addressing the well-known limitations of script-based testing tools such as Playwright in open-ended code generation settings. Evaluation combines DOM-grounded reference matching, behavior-specific browser tests, and CLIP-based visual similarity, jointly measuring structural alignment, behavioral completeness, and overall visual fidelity. We use VISTA to assess four agent systems drawn from two model families and two harnesses, finding that visual fidelity and functional correctness are partially decoupled across both input conditions and agents, and that agent editing style varies sharply but is largely orthogonal to task quality. VISTA establishes a rigorous and reproducible foundation for advancing agent-based software engineering research.

2606.11930 2026-06-12 cs.HC cs.AI cs.CV 版本更新

Frozen Multimodal Embeddings for AI-Assisted Interview Assessment of Personality and Cognitive Ability

冻结多模态嵌入用于异步视频面试中的个性与认知能力评估

Kuo-En Hung, Hung-Yue Suen, Shih-Ching Yeh, Hsiang-Wen Wang

发表机构 * Technology Application and Human Resource Development, National Taiwan Normal University(台湾国立台中教育大学技术应用与人力资源发展系) Computer Science and Information Engineering, National Central University(台湾国立中央大学计算机科学与资讯工程系) Institute of Photonic System, National Yang Ming Chiao Tung University(台湾阳明交通大学光电系统研究所)

AI总结 针对异步视频面试中标注数据有限的高维多模态学习问题,提出使用冻结多模态编码器(CLIP、Whisper、RoBERTa等)结合低容量下游模型,在个性预测任务上实现MSE降低19.1%,并发现认知能力预测中存在数据集捷径。

Comments 9 pages, 1 figure, 5 tables

详情
AI中文摘要

从异步视频面试(AVI)中预测心理特质是一个具有挑战性的多模态学习问题,因为标注数据集有限,而每个回答包含高维的视觉、声学和语言信号。本文介绍了我们针对ACM多媒体AVI挑战2026的解决方案,该挑战评估两个任务:Track~1从与个性相关的面试回答中预测自我报告的HEXACO个性特质,Track~2从结构化AVI回答中对认知能力水平进行分类。我们将该问题视为小样本表示学习任务。我们不微调大型预训练模型,而是使用冻结的多模态编码器,包括用于视觉特征的CLIP、用于声学特征和转录的Whisper,以及用于文本表示的RoBERTa、E5和DeBERTaV3,随后使用低容量下游模型。对于Track~1,我们的特质特定回归和晚期融合系统实现了平均验证MSE为0.2696,优于官方基线0.3334。消融结果显示,从全局模型(0.3189)到逐特质建模(0.2871)再到逐特质晚期融合(0.2696)的三步改进,相对于官方基线MSE相对降低了19.1%。对于Track~2,一个紧凑的主题属性基线达到了0.5781的准确率,而我们的多模态集成达到了0.5313,两者均高于官方基线0.4062。我们将这一结果解释为验证分割中可能存在主题属性捷径的证据,而非从AVI内容中进行的稳健认知推理。总体而言,我们的发现表明,基于AVI的心理评估受益于特质特定的多模态建模,但认知能力预测需要仔细控制数据集捷径。

英文摘要

Predicting psychological traits from asynchronous video interviews (AVIs) is a challenging problem in AI-assisted interview assessment because labeled datasets are limited while each response contains high-dimensional visual, acoustic, and verbal signals. This paper presents our solution for the ACM Multimedia AVI Challenge 2026, which evaluates two tasks: Track~1 predicts self-reported HEXACO personality traits from personality-related interview responses, and Track~2 classifies cognitive ability levels from structured AVI responses. We treat the problem as a small-sample representation learning task. Instead of fine-tuning large pretrained models, we use frozen multimodal encoders, including CLIP for visual features, Whisper for acoustic features and transcripts, and RoBERTa, E5, and DeBERTaV3 for textual representations, followed by low-capacity downstream models. For Track~1, our trait-specific regression and late-fusion system achieves an average validation MSE of 0.2696, improving over the official baseline of 0.3334. Ablation results show a three-step improvement from a global model (0.3189), to per-trait modeling (0.2871), to per-trait late fusion (0.2696), corresponding to a 19.1% relative MSE reduction over the official baseline. For Track~2, a compact subject-attribute baseline reaches 0.5781 accuracy, while our multimodal ensemble reaches 0.5313, both above the official baseline of 0.4062. We interpret this result as evidence of possible subject-attribute shortcuts in the validation split rather than robust cognitive inference from AVI content. Overall, our findings suggest that AVI-based psychological assessment benefits from trait-specific multimodal modeling, but cognitive ability prediction requires careful control of dataset shortcuts.

2511.20162 2026-06-12 cs.CV cs.AI q-bio.NC 版本更新

Action Without Interaction: Probing the Physical Foundations of Video LMMs via Contact-Release Detection

无交互行动:通过接触-释放检测探测视频LMMs的物理基础

Daniel Harari, Michael Sidorov, Chen Shterental, Liel David, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 研究探讨了视频LMMs在实际视觉输入中语义理解的深度,通过接触-释放检测发现模型在物理基础方面的不足。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 workshop on Cognitive Foundations for Multimodal Models (CogVL)
AI中文摘要

大型多模态模型(LMMs)在现实视觉任务中表现出越来越强的性能,例如在视频中描述对象、周围环境和动态动作。本研究探讨了这些模型如何将语义理解与实际视觉输入联系起来。具体来说,给定手与物体互动的序列,我们询问模型何时以及在哪里开始或结束互动。为此,我们引入了一个前所未有的大规模数据集,包含来自Something-Something-V2数据集的视频中超过20,000个标注的互动。250名AMTurk人工标注者标记了核心互动事件,特别是物体和代理何时以及在哪里接触(接触)或分离(释放)。我们要求最先进的LMMs,包括GPT、Gemini和Qwen,在短视频中定位这些事件,每个视频只有一个事件。结果表明,尽管模型能够可靠地命名目标对象并识别动作,但它们表现出一种“捷径学习”现象,即语义成功掩盖了在物理基础方面的失败。具体来说,它们始终无法识别互动开始或结束的帧,并且在场景中对物理事件的定位较差。这种脱节表明,尽管LMMs在系统1直观模式识别(命名动作和对象)方面表现出色,但它们缺乏系统2认知基础,无法对如“接触”和“释放”这样的物理原始要素进行推理,因此无法真正将动态场景 grounded 在物理现实中。

英文摘要

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (`contact') or detached (`release'). We asked SoTA LMMs, including GPT, Gemini and Qwen to locate these events in short videos, each with a single event. The results show that while models reliably name target objects and identify actions, they exhibit a form of `shortcut learning' where semantic success masks a failure in physical grounding. Specifically, they consistently fail to identify the frame where the interaction begins or ends and poorly localize the physical event within the scene. This disconnect suggests that while LMMs excel at System 1 intuitive pattern recognition (naming the action and objects), they lack the System 2 cognitive foundations required to reason about physical primitives like `contact' and `release', hence truly ground dynamic scenes in physical reality.

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China(中国人民大学)

AI总结 本文综述了深度学习在几何问题求解中的应用,涵盖相关任务、方法、评估指标及未来方向,旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情
AI中文摘要

几何问题求解作为数学推理的重要组成部分,在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术,尤其是多模态大语言模型的出现,显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用,包括(i)几何问题求解相关任务的全面总结;(ii)相关深度学习方法的深入回顾;(iii)评估指标和方法的详细分析;以及(iv)最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考,从而推动该领域进一步发展。我们维护了一个相关论文列表:https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

2508.03721 2026-06-12 cs.CV eess.IV 版本更新

Enhancing Diameter Measurement Accuracy in Machine Vision Applications

提升机器视觉应用中直径测量精度

Ahmet Gokhan Poyraz, Ahmet Emir Dirik, Hakan Gurkan, Mehmet Kacmaz

发表机构 * Department of Electrical and Electronics Engineering, Bursa Technical University(布尔萨技术大学电气与电子工程系) Doğu Pres R&D(多古普研发) Department of Computer Engineering, Bursa Uludağ University(布尔萨乌拉达格大学计算机工程系) Institute of Electrical Information Technology, Clausthal University of Technology(克莱斯特哈尔技术大学电气信息学院)

AI总结 本文提出两种新方法通过多参考零件提升测量精度,利用转换因子和像素信息减少误差,实验显示误差从13-114微米降至1-2微米。

Comments Preprint

详情
Journal ref
Measurement 278 (2026) 121646
AI中文摘要

在相机测量系统中,通常使用特殊设备如 telecentric 镜头来测量公差较小的零件。然而,由于系统内的机械和软件因素,测量误差仍可能发生,特别是在使用相同设置测量不同直径零件时。本文提出两种创新方法,通过多个已知参考零件增强测量精度:基于转换因子的方法和基于像素的方法。第一种方法通过已知参考零件估计转换因子以计算未知零件的直径(毫米)。第二种方法则直接利用参考零件的像素直径信息估算直径(毫米)。实验设置包括工业级相机和 telecentric 镜头。对玻璃样品(1-12 mm)和金属工件(3-24 mm)的测试显示,使用所提出的方法后,原本范围为13-114微米的测量误差被降至1-2微米。仅使用少量已知参考零件,该方法能够实现相机视野内所有零件的高精度测量。此外,该方法通过显著降低误差率和提高测量可靠性,增强了现有直径测量文献。

英文摘要

In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera's field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.

2505.18060 2026-06-12 cs.CV 版本更新

Semantic Correspondence: Unified Benchmarking and a Strong Baseline

语义对应:统一的基准测试与强大的基线

Kaiyan Zhang, Xinghui Li, Jingyi Lu, Kai Han

发表机构 * The University of Hong Kong(香港大学)

AI总结 本文首次全面调研语义对应方法,提出分类体系并汇总多基准结果,提出高性能基线,为未来研究奠定基础。

详情
Journal ref
IEEE Trans. Pattern Anal. Mach. Intell. 48, no. 3 (2026) 3911-3930
AI中文摘要

建立语义对应是计算机视觉中的一个具有挑战性任务,旨在在不同图像中匹配具有相同语义信息的关键点。得益于深度学习的快速发展,过去十年来取得了显著进展。然而,对这一任务的全面回顾和分析仍然缺失。本文首次对语义对应方法进行了广泛的调查。我们首先提出一个分类体系,根据方法设计的类型对现有方法进行分类。这些方法随后被相应归类,并对每种方法进行详细分析。此外,我们汇总并总结了文献中各种基准测试方法的结果,形成一个统一的比较表格,并提供详细的配置以突出性能差异。此外,为了深入了解现有的语义匹配方法,我们彻底进行了受控实验,以分析不同方法组件的有效性。最后,我们提出了一种简单而有效的基线,该基线在多个基准测试中实现了最先进的性能,为该领域未来的研究奠定了坚实基础。我们希望本文的调查能为未来的发展提供全面的参考和统一的基线。代码已公开在:https://github.com/Visual-AI/Semantic-Correspondence。

英文摘要

Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.

2412.14631 2026-06-12 cs.CV 版本更新

Review of Fruit Tree Image Segmentation

水果树图像分割综述

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonbuk National University, South Korea(计算机科学与人工智能系/先进图像与信息科技中心,全州国立大学)

AI总结 本文综述了水果树前视图像分割研究,指出现有方法缺乏通用数据集和模型,提出六个未来研究方向以构建通用分割模块。

详情
Journal ref
Agriculture, Volume 15, Issue 21, 2025
AI中文摘要

水果树图像分割是自动化农业任务如表型分析、采摘、喷洒和修剪中的关键问题。许多论文提出了适用于特定任务和环境的多样化解决方案。本文综述范围限定在水果树前视图,基于158篇通过新设计的爬虫方法收集的相关论文。这些论文基于一种按方法、图像、任务和水果顺序考虑的分类法进行系统回顾。该分类法将帮助读者直观理解这些研究活动的整体情况。本文指出,先前研究的主要不足是缺乏适用于多种任务和环境的通用数据集和分割模型。本文建议六个重要的未来研究任务,期望这些将为构建通用的树分割模块铺平道路。

英文摘要

Fruit tree image segmentation is an essential problem in automating a variety of agricultural tasks such as phenotyping, harvesting, spraying, and pruning. Many research papers have proposed a diverse spectrum of solutions suitable to specific tasks and environments. The review scope of this paper is confined to the front views of fruit trees and based on 158 relevant papers collected using a newly designed crawling review method. These papers are systematically reviewed based on a taxonomy that sequentially considers the method, image, task, and fruit. This taxonomy will assist readers to intuitively grasp the big picture of these research activities. Our review reveals that the most noticeable deficiency of the previous studies was the lack of a versatile dataset and segmentation model that could be applied to a variety of tasks and environments. Six important future research tasks are suggested, with the expectation that these will pave the way to building a versatile tree segmentation module.