arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.12590 2026-06-12 cs.CV cs.AI 新提交

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

分析与改进医学LVLMs中的细粒度偏好优化

Shayan Mohammadizadehsamakosh, Pritam Sarkar, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University（约克大学）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； Queen’s University（女王大学）

AI总结针对医学大视觉语言模型在事实一致性、视觉定位和临床对齐方面的不足，提出一种结合双向令牌级KL正则化和视觉对比定位目标的细粒度在线偏好优化框架，通过最小编辑模型输出构建偏好对，仅修正临床错误片段，显著提升诊断准确性。

详情

AI中文摘要

大型视觉语言模型（LVLMs）在医学影像任务中取得了强劲性能，但仍容易出现事实不一致、视觉定位差以及与临床有意义反馈对齐不足的问题。现有的后训练对齐方法，包括直接偏好优化（DPO）及其变体，在医学领域面临三个关键限制：（1）序列级奖励信号将临床关键令牌与通用填充文本等同对待；（2）依赖静态监督微调参考作为偏好响应引入了离策略分布偏移，将优化导向风格伪影而非临床正确性；（3）对齐目标缺乏明确的视觉定位约束，使模型对微妙但诊断决定性的病理特征不敏感。我们的方法利用双向令牌级KL正则化以及视觉对比定位目标，该目标将干净图像与病变破坏图像配对，以惩罚缺乏足够视觉证据生成的响应。这些组件共同构成了一个细粒度的在线对齐框架，通过最小编辑模型生成的输出来构建偏好对，仅修正临床错误片段，同时保留原始语言风格。在医学影像任务和临床文本生成基准上的大量实验验证了我们方法的有效性。

英文摘要

Large Vision-Language Models (LVLMs) have achieved strong performance across medical imaging tasks, yet they remain prone to factual inconsistencies, poor visual grounding, and misalignment with clinically meaningful feedback. Existing post-training alignment approaches, including Direct Preference Optimization (DPO) and its variants, face three critical limitations in the medical domain: (1) sequence-level reward signals treat clinically critical tokens identically to generic filler text; (2) reliance on static supervised fine-tuning references as preferred responses introduces an off-policy distribution shift, steering optimization toward stylistic artifacts over clinical correctness; and (3) alignment objectives lack explicit visual grounding constraints, leaving models insensitive to subtle yet diagnostically decisive pathological features. Our method leverages a bidirectional token-wise KL regularizer alongside a visual-contrastive grounding objective that pairs clean and lesion-corrupted images to penalize responses generated without adequate visual evidence. Together, these components form a fine-grained, on-policy alignment framework that constructs preference pairs by minimally editing model-generated outputs, correcting only clinically erroneous spans while preserving the original linguistic style. Extensive experiments across medical imaging tasks and clinical text generation benchmarks validate the effectiveness of our approach.

URL PDF HTML ☆

赞 1 踩 0

2606.12633 2026-06-12 cs.CV cs.LG 新提交

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

ECA：面向开放图像到文本生成的高效持续对齐

Jiangtao Kong, Peijun Zhao, Chun-Fu Chen, Youngwook Do, Shaohan Hu, Tianyi Zhou, Huajie Shao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ECA方法，通过混合查询模块、Fisher动态扩展和字典重放，实现无需旧数据的持续对齐，缓解灾难性遗忘，提升开放图像到文本生成的增量学习性能。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

开放图像到文本生成（OpenITG）的增量学习（IL）使模型能够持续为新的图像生成准确、上下文相关的文本，同时保留先前获得的知识。与先前研究不同，本文处理了一个更实际的场景，其中视觉数据的主要类别随时间推移而演变。在此背景下，我们引入了持续对齐的新概念，它逐步调整预训练VLM中的对齐模块，以保持高质量的跨模态表示。基于这一思想，我们提出了高效持续对齐（ECA），一种用于OpenITG的无样本IL方法。关键挑战是使模型能够获取新的任务特定特征，同时最小化对已建立对齐的干扰，且无需访问先前任务的原始数据。为此，ECA采用了三种核心机制：混合查询（MoQ）模块，用于适应任务特定的查询令牌；Fisher动态扩展（FeDEx），基于Fisher信息矩阵（FIM）度量动态扩展模型结构；以及带有字典重放（DR）的嵌入字典，以保留过去的知识。为了评估ECA的性能，我们构建了四个新的IL OpenITG基准，更好地反映了现实场景。实验结果表明，与基线方法相比，ECA显著缓解了灾难性遗忘并提高了IL性能。代码和基准可在该https URL获取。

英文摘要

Incremental Learning (IL) for Open-ended Image-to-Text Generation (OpenITG) enables models to continuously generate accurate, contextually relevant text for new images while preserving previously acquired knowledge. Unlike prior studies, this paper addresses a more practical scenario in which the predominant category of visual data shifts over time as environments evolve. In this context, we introduce a new notion of continual alignment, which incrementally adapts the alignment module within pre-trained VLMs to preserve high-quality cross-modal representations. Based on this idea, we propose Efficient Continual Alignment (ECA), a novel exemplar-free IL approach for OpenITG. The key challenge is enabling the model to acquire new, task-specific features while minimizing interference with the established alignment without accessing raw data from previous tasks. To address this, ECA employs three core mechanisms: a Mixture of Query (MoQ) module that adapts task-specific query tokens, a Fisher Dynamic Expansion (FeDEx) that dynamically expands model structure based on a Fisher Information Matrix (FIM)-based metric, and an embedding dictionary with Dictionary Replay (DR) to retain past knowledge. To evaluate ECA's performance, we construct four new IL OpenITG benchmarks that better reflect real-world scenarios. Experimental results demonstrate that ECA significantly mitigates catastrophic forgetting and improves IL performance compared to baseline methods. Code and benchmarks are available at https://github.com/Snowball0823/ECA.

URL PDF HTML ☆

赞 1 踩 0

2606.12744 2026-06-12 cs.CV 新提交

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

GRIP：面向大型多模态模型的反馈引导提示检索

Garvita Allabadi, Matteo Sodano, Roberto Estevão, Yuxiong Wang, Vikram Adve, Emre Kiciman, Ranveer Chandra

发表机构 * University of Illinois Urbana Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Bonn（波恩大学）； Microsoft（微软）

AI总结提出GRIP，一种可学习的视觉检索框架，利用多模态模型反馈识别真正提升上下文学习性能的示例，在分类、描述和VQA任务上优于基于相似度的检索。

详情

AI中文摘要

上下文学习（ICL）已成为一种强大的机制，使大型语言模型（LLMs）无需微调即可适应新任务。将此概念扩展到大型多模态模型（LMMs），多模态上下文学习（M-ICL）依赖于检索相关示例（如图像、标题或问答对）来指导分类、描述和视觉问答（VQA）等任务的预测。现有方法大多基于特征空间相似性选择上下文示例，假设语义相似的样本提供最有用的上下文。然而，我们的系统分析表明，这一假设并不总是成立：视觉上相似的示例并不一定是那些最有效增强上下文学习性能的示例。为解决此问题，我们提出了上下文提示的引导检索（GRIP），一种可学习的纯视觉检索框架，利用LMMs的反馈来识别真正改善模型预测的示例。GRIP通过对比训练学习区分有益和有害的上下文示例，将检索优化到超越纯相似性。在三个多模态任务（分类、描述和VQA）上，GRIP在Qwen2.5-VL-7B上持续优于基于相似度的检索，在Idefics2-8B上的分类任务中提升最为显著。此外，我们证明了从一个开放LMM训练得到的检索器可以迁移到其他模型（包括闭源的GPT-4o和Gemini）而无需重新训练，从而实现了M-ICL的可扩展且经济高效的部署。代码将在接收后发布。

英文摘要

In-Context Learning (ICL) has become a powerful mechanism for adapting Large Language Models (LLMs) to new tasks without fine-tuning. Extending this concept to Large Multimodal Models (LMMs), Multimodal In-Context Learning (M-ICL) relies on retrieving relevant examples, such as images, captions, or question-answer pairs, to guide predictions across tasks like classification, captioning, and visual question answering (VQA). Most existing approaches select in-context examples based on feature-space similarity, assuming that semantically similar samples provide the most useful context. However, our systematic analysis reveals that this assumption does not always hold: visually similar examples are not necessarily those that most effectively enhance in-context learning performance. To address this, we propose the Guided Retrieval of In-context Prompts (GRIP), a learnable vision-only retrieval framework that leverages feedback from LMMs to identify examples that truly improve model predictions. GRIP learns to distinguish beneficial from detrimental in-context examples through contrastive training, refining retrieval beyond pure similarity. Across three multimodal tasks, namely classification, captioning, and VQA, GRIP improves consistently over similarity-based retrieval on Qwen2.5-VL-7B, with its strongest gains in classification on Idefics2-8B. Moreover, we demonstrate that retrievers trained with feedback from one open LMM can be transferred to other models without retraining, including closed-source GPT-4o and Gemini, enabling scalable and cost-efficient deployment of M-ICL. Code will be published upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.12830 2026-06-12 cs.CV cs.AI 新提交

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

感知、交互、推理：构建工具增强的视觉智能体用于空间推理

Changye Li, Meng Lu, Yi Wu, Ligeng Zhu

发表机构 * Tsinghua University（清华大学）； Virginia Tech（弗吉尼亚理工大学）； NVIDIA（英伟达）

AI总结提出PERIA智能体，通过视觉感知和交互工具增强VLM的空间推理能力，在13个基准上优于同类模型7.0%-14.8%。

详情

AI中文摘要

尽管最近的视觉语言模型（VLM）展示了强大的多模态理解能力，但在需要主动证据获取和多步视觉交互的空间推理任务中仍存在局限。这种局限性表明，仅依赖视觉编码器的隐式视觉表示不足以恢复细粒度的空间证据。我们引入了PERception-Interaction-reason Agent（PERIA），一种用于地图推理、视觉探测和视觉重建等空间推理任务的工具增强视觉智能体。PERIA使用两类轻量工具：视觉感知工具用于暴露文本、符号和空间证据，以及视觉交互工具用于操作视觉上下文、追踪路径和验证空间关系。为了训练PERIA，我们开发了一种统一方案，结合了监督式工具使用轨迹合成、复合奖励和观察松弛的组内组策略优化（OR-GIGPO），以实现有效的多工具行为。在来自8个数据集的13个基准上的实验表明，PERIA-8B在分布内基准上比Qwen3-8B骨干网络提高了10.0%，在分布外基准上提高了4.4%，同时比之前类似规模的先进基线高出7.0%-14.8%。它还实现了与更大模型（如Qwen3-VL-235B-A22B-Thinking和GPT-5）相当的性能，证明了PERIA在增强空间推理能力方面的有效性。

英文摘要

While recent vision-language models (VLMs) demonstrate strong multimodal understanding, they remain limited in spatial reasoning tasks that require active evidence acquisition and multi-step visual interaction. This limitation suggests that relying solely on implicit visual representations from vision encoders is insufficient for recovering fine-grained spatial evidence. We introduce PERception-Interaction-reason Agent (PERIA), a tool-augmented visual agent for spatial reasoning tasks across map reasoning, visual probing, and vision reconstruction. PERIA uses two lightweight tool families: vision perception tools for exposing textual, symbolic, and spatial evidence, and vision interaction tools for manipulating visual context, tracing paths, and verifying spatial relations. To train PERIA, we develop a unified recipe that combines supervised tool-use trajectory synthesis, composite rewards, and Observation-Relaxed Group-in-Group Policy Optimization (OR-GIGPO) for effective multi-tool behavior. Experiments on 13 benchmarks from 8 datasets show that PERIA-8B improves over the Qwen3-8B backbone by 10.0% on in-distribution benchmarks and 4.4% on out-of-distribution benchmarks, while outperforming previous state-of-the-art baselines of similar size by 7.0%-14.8%. It also achieves performance comparable to much larger models such as Qwen3-VL-235B-A22B-Thinking and GPT-5, demonstrating the effectiveness of PERIA in enhancing spatial reasoning capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.12847 2026-06-12 cs.CV 新提交

Language-Guided Abstraction for Visual Reasoning

语言引导的视觉推理抽象

Xu-Jing Ye, Yuan-Gen Wang, Ruping Wang

发表机构 * School of Artificial Intelligence, Guangzhou University（广州大学人工智能学院）； Traditional Chinese Medicine Hospital of Zengcheng District（广州市增城区中医医院）

AI总结提出L-VARC框架，通过语言引导的特权信息学习分支增强视觉推理，设计语义压缩模块和交叉注意力投影器，在ARC任务上以18M参数超越现有方法。

详情

AI中文摘要

抽象与推理语料库（ARC）被视为通往通用人工智能（AGI）的关键途径，因为它使模型能够从少量示例中学习抽象转换规则，然后泛化到新任务。然而，主流的ARC方法要么是纯语言，要么是纯视觉（即VARC）。前者严重依赖大语言模型，消耗数十亿参数；后者通常难以捕捉高层语义，导致在像素级模式上过拟合。为弥合这一差距，我们提出L-VARC，一种通过语言引导的特权信息学习（LUPI）分支增强视觉推理的新框架。具体来说，我们通过将统一的任务无关提示输入DeepSeek-V3来设计语义压缩模块。这样，原始的LARC（一个众包语言描述数据集）可以被大幅精炼和结构化，以适应标准文本编码器（如CLIP）的上下文长度约束。此外，我们设计了交叉注意力投影器来对齐视觉特征与语义嵌入，旨在指导ARC模型的训练。值得注意的是，LUPI分支在训练过程中使用，推理时被丢弃，从而产生一个仅1800万参数的轻量级模型。大量实验表明，我们的L-VARC有效利用语言先验提升视觉推理，并超越现有最优方法。消融研究进一步证实了这两个新设计对L-VARC框架的贡献。代码见https://this URL。

英文摘要

The Abstraction and Reasoning Corpus (ARC) is viewed as a critical avenue to Artificial General Intelligence (AGI), as it enables models to learn abstract transformation rules from few-shot examples and then generalize to new tasks. However, prevalent ARC methodology is either pure language or vision-only (i.e., VARC). The former depends heavily on LLMs, consuming billions of parameters. The latter often struggles to capture high-level semantics, leading to overfitting on pixel-level patterns. To bridge this gap, we propose L-VARC, a novel framework that enhances visual reasoning via a language-guided Learning Using Privileged Information (LUPI) branch. Specifically, we design a Semantic Compression Module by feeding a unified, task-agnostic prompt into DeepSeek-V3. In this way, the raw LARC (a crowd-sourced language description dataset) can be substantially refined and structured, fitting with the context length constraint of standard text encoders (e.g., CLIP). Moreover, we design a Cross-Attention Projector to align visual features with semantic embeddings, aiming to guide the training of the ARC model. Notably, the LUPI branch is taken in the training process and will be discarded during inference, thereby yielding a lightweight model with a mere 18 million parameters. Extensive experiments demonstrate that our L-VARC effectively leverages linguistic priors to boost visual reasoning and outperforms state-of-the-art. Ablation studies further confirm the contribution of the two new designs towards the L-VARC framework. The code is available at https://github.com/GZHU-DVL/L-VARC.

URL PDF HTML ☆

赞 0 踩 0

2606.12886 2026-06-12 cs.CV cs.AI 新提交

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

交错思维中的模态隔离桥接：通过逐步强化监督模态转换

Tingyu Li, Le Zhou, Siyuan Li, Yujun Wu, Xinglong Xu, Jingxuan Wei, Conghui He, Cheng Tan

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Shanghai Jiaotong University（上海交通大学）； Zhejiang University（浙江大学）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出MoTiF框架，通过反射式SFT和Flow-GRPO优化模态转换保真度，解决交错思维中图像与文本脱节的模态隔离问题，提升跨模态一致性和任务准确性。

Comments 22 pages, 5 figures, 6 tables

详情

AI中文摘要

交错思维是一种统一的多模态模型交替进行文本推理和视觉生成的方法，在空间和物理任务上显示出潜力。然而，在复杂的长链场景中，我们识别出一个基本故障模式：生成的图像偏离文本上下文，而后续文本忽略视觉证据，导致两种模态交替但并未真正相互通知。我们将其称为模态隔离，并归因于模态边界处的信息损失累积。我们将每个推理循环分解为原子操作，并定义模态转换损失，量化每个边界处的跨模态幻觉（文本到图像）和视觉利用不足（图像到文本）。我们提出MoTiF（模态转换保真度），一个两阶段训练框架，直接优化这些转换：反射式SFT训练模型检测和恢复错误的视觉输出；Flow-GRPO通过强化学习提高图像生成保真度。MoTiF中的所有训练信号来自转换级保真度而非最终任务准确性。在四个视觉谜题基准测试中，这种转换级监督显著提高了跨模态一致性和最终任务准确性。结果表明，有效的交错推理需要在模态边界处进行明确的结构监督，而不仅仅是扩展或最终任务优化。

英文摘要

Interleaved thinking, where a unified multimodal model alternates between textual reasoning and visual generation, has shown promise on spatial and physical tasks. However, in complex long-chain scenarios, we identify a fundamental failure mode: generated images diverge from the textual context while subsequent text ignores the visual evidence, causing the two modalities to alternate without genuinely informing each other. We term this Modal Isolation and attribute it to compounding information loss at modality boundaries. We decompose each reasoning cycle into atomic operations and define modality transition loss, quantifying cross-modal hallucination (text-to-image) and visual utilization deficit (image-to-text) at each boundary. We propose MoTiF (Modality Tiransition Fidelity), a two-stage training framework that directly optimizes these transitions: Reflective SFT trains the model to detect and recover from erroneous visual outputs; Flow-GRPO improves image generation fidelity via reinforcement learning. All training signals in MoTiF derive from transition-level fidelity rather than end-task accuracy. Across four visual puzzle benchmarks, this transition-level supervision substantially improves both cross-modal coherence and final task accuracy. The results demonstrate that effective interleaved reasoning requires explicit structural supervision at modality boundaries, not merely scaling or end-task optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.12898 2026-06-12 cs.CV cs.CL 新提交

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

放大关键信息：面向视觉文本理解的注意力引导自适应渲染

Shenglai Zeng, Qirui Wang, Kai Guo, Xinnan Dai, Xianxuan Long, Hui Liu

发表机构 * Michigan State University（密歇根州立大学）； Xi’an Jiaotong University（西安交通大学）

AI总结针对视觉语言模型在视觉文本理解任务中存在的定位与利用脱节问题，提出无需训练、模型无关的注意力引导自适应渲染方法AGAR，通过放大关键文本跨度提升模型性能。

详情

AI中文摘要

视觉文本理解（VTC）将文本渲染为图像供视觉语言模型（VLM）阅读，绕过了LLM的上下文窗口限制，并支持从长页OCR到多页记忆问答等应用。然而，现有的VTC流水线将渲染和布局视为固定的、内容无关的预处理步骤，并且对VLM内部如何处理可视化文本的机制理解甚少。通过对VTC问答任务的聚焦实证研究，我们揭示了VLM存在一种“定位而不利用”的模式：证据定位注意力在中间到后期层中急剧出现，并且与答案正确性在很大程度上解耦，然而仅仅放大渲染页面上定位的跨度就能恢复大部分失败。基于这些观察，我们提出了AGAR（注意力引导自适应渲染），一种无需训练、模型无关的方法，该方法利用VLM自身的中间到后期层注意力来识别前K个重要的视觉补丁，将它们映射回单词跨度，并在重新推理答案之前重新渲染页面，放大这些跨度。在九个VTC基准测试（短文本、长上下文和多页记忆问答）和四个VLM骨干上的大量实验表明，AGAR（i）作为即插即用的增强，持续改进了现成的VLM，（ii）与VLM后训练相结合可带来进一步收益，并且（iii）在视觉和文本侧输入退化下保持鲁棒性。

英文摘要

Visual Text Comprehension (VTC) renders text into images for a vision-language model (VLM) to read, sidestepping LLM context-window limits and powering applications from long-page OCR to multi-page memory QA. Yet existing VTC pipelines treat rendering and layout as a fixed, content-agnostic preprocessing step and offer little mechanistic understanding of how VLMs internally process visualized text. Through a focused empirical study on VTC QA tasks, we reveal that VLMs exhibit a localization-without-utilization regime: evidence-localizing attention emerges sharply in the middle-to-late layers and is largely decoupled from answer correctness, yet simply enlarging the localized spans on the rendered page recovers a large fraction of the failures. Building on these observations, we propose AGAR (Attention-Guided Adaptive Rendering), a training-free, model-agnostic method that leverages a VLM's own middle-to-late layer attention to identify the top-K important visual patches, maps them back to word spans, and re-renders the page with those spans enlarged before re-inferring the answer. Extensive experiments across nine VTC benchmarks (short-form, long-context, and multi-page memory QA) and four VLM backbones show that AGAR (i)consistently improves off-the-shelf VLMs as a plug-and-play enhancement, (ii)composes with VLM post-training to yield further gains, and (iii)remains robust under both visual- and text-side input degradation.

URL PDF HTML ☆

赞 0 踩 0

2606.12985 2026-06-12 cs.CV 新提交

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

物体先于词汇：用于从儿童视角视频中语言接地学习的物体优先归纳偏置

Sathira Silva, Abrham Kahsay Gebreselasie, Muhammad Umer Sheikh, Kartik Kuckreja, Daniel Harari, Muhammad Haris Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； Weizmann Institute of Science（魏茨曼科学研究所）

AI总结针对婴儿视角视频中命名参照物出现时间和位置的双重歧义，提出BabyMind方法，通过物体优先的归纳偏置、掩码区域接口和原型空间多实例对比学习，在稀疏弱监督下提升语言接地性能。

详情

AI中文摘要

从自然经验中学习接地词汇含义需要解决婴儿视角记录中的两个歧义：命名参照物何时出现以及在杂乱画面中的位置。在SAYCam风格的数据中，看护者的语言稀疏且与自我中心视频弱同步，因此单帧对比配对会产生噪声正样本，其中目标物体缺失或被干扰物纠缠。我们提出BabyMind，一种在稀疏、噪声监督下用于儿童视角对比学习的物体优先偏置。BabyMind使用离线掩码区域接口提取候选物体嵌入，通过跟踪将短话语中心窗口内的候选物体链接成轻量级物体文件，并使用原型空间多实例对比目标将话语与物体文件袋对齐。轨迹一致性和全局物体一致性正则化器稳定学习，并将物体文件结构转移到评估时使用的全局帧嵌入中。在SAYCam-S上，BabyMind将Labeled-S 15强制选择准确率比CVCL提高了+2.6个点，并在词汇内分布外基准测试中取得一致提升。代码可在该网址获取。

英文摘要

Learning grounded word meaning from natural experience requires resolving two ambiguities in infant-view recordings: when the named referent appears and where it is in a cluttered frame. In SAYCam-style data, caregiver speech is sparse and weakly synchronized with egocentric video, so single-frame contrastive pairing yields noisy positives in which the intended object is absent or entangled with distractors. We propose BabyMind, an object-first bias for child-view contrastive learning under sparse, noisy supervision. BabyMind extracts candidate object embeddings using an offline mask-based region interface, links candidates across a short utterance-centered window into lightweight object files via tracking, and aligns utterances to bags of object files with a prototype-space multiple-instance contrastive objective. Track-coherence and global-object agreement regularizers stabilize learning and transfer object-file structure into the global frame embedding used at evaluation. On SAYCam-S, BabyMind improves Labeled-S 15 forced-choice accuracy by +2.6 points over CVCL and yields consistent gains on in-vocabulary out-of-distribution benchmarks. Code is available at https://github.com/sathiiii/BabyMind.

URL PDF HTML ☆

赞 0 踩 1

2606.13061 2026-06-12 cs.CV 新提交

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

LaME: 通过信息瓶颈在潜在空间中进行多模态嵌入的推理学习

Peixi Wu, Biao Yang, Feipeng Ma, Bosong Chai, Bo Lin, Wei Yuan, Fan Yang, Tingting Gao, Hebei Li, Xiaoyan Sun

发表机构 * University of Science and Technology of China（中国科学技术大学）； Kuaishou Technology（快手科技）； Zhejiang University（浙江大学）； Tsinghua University（清华大学）

AI总结提出LaME方法，将面向嵌入的潜在推理建模为弱监督信息瓶颈，使用可学习推理令牌在单次前向传播中完成推理，避免显式CoT的高计算成本和标注依赖，实现60倍加速。

详情

AI中文摘要

基于推理的通用多模态嵌入通过将思维链（CoT）推理引入嵌入流程取得了快速进展。尽管在通用和复杂任务上表现强劲，该范式存在两个核心限制：(i) 自回归CoT推理计算成本高，使其不适用于低延迟检索；(ii) 嵌入性能与CoT标注质量高度耦合，导致大规模训练不可靠。这些引出了基本问题：文本CoT是否是嵌入的最优推理形式，以及有效的嵌入推理能否在潜在空间中完成？为此，我们提出LaME（潜在推理多模态嵌入），将面向嵌入的潜在推理建模为弱监督信息瓶颈。LaME采用K个可学习推理令牌作为固定容量瓶颈，在单次前向传播中完成所有推理。两个弱监督信号在结构上解耦了对比目标和自回归目标，消除了对CoT标注的依赖，而两阶段训练流程确保了稳定收敛。在MMEB-v2和MRMR上的实验表明，LaME达到了有竞争力的性能，超越了某些显式CoT模型，同时推理速度比显式CoT方法快60倍，比潜在基线快2倍，吞吐量与判别式嵌入模型相当。代码将开源。

英文摘要

Reasoning-driven universal multimodal embedding has advanced rapidly by introducing Chain-of-Thought (CoT) reasoning into the embedding pipeline. Despite the strong performance across both general and complex tasks, this paradigm suffers from two core limitations: (i) autoregressive CoT reasoning incurs high computational cost, making it impractical for low-latency retrieval; and (ii) embedding performance is heavily coupled with CoT annotation quality, making large-scale training unreliable. These raise fundamental questions: Is textual CoT the optimal form of reasoning for embedding, and can effective embedding reasoning be accomplished in latent space? To this end, we propose LaME (Latent Reasoning Multimodal Embedding), which formulates embedding-oriented latent reasoning as a weakly supervised information bottleneck. LaME employs K learnable reason tokens as a fixed-capacity bottleneck, completing all reasoning within a single forward pass. The two weak supervision signals structurally decouple contrastive from autoregressive objectives and eliminate dependence on CoT annotations, while a two-stage training pipeline ensures stable convergence. Experiments on MMEB-v2 and MRMR show that LaME achieves competitive performance, surpassing some explicit CoT-based models, while delivering 60x faster inference than explicit CoT methods and 2x faster than latent baselines with throughput comparable to discriminative embedding models. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2606.13156 2026-06-12 cs.CV cs.AI 新提交

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

迭代视觉思维：通过视觉反馈教会视觉语言模型空间自我修正

Animesh Tripathy, Aswanth Krishnan

发表机构 * QpiAI India Pvt. Ltd（QpiAI印度私人有限公司）

AI总结提出迭代视觉思维（IVT）框架，通过视觉反馈闭环和两阶段训练（SFT+GRPO），使视觉语言模型具备空间自我修正能力，在三个基准上提升指标2.4-3.2个百分点。

详情

AI中文摘要

视觉语言模型（VLM）在单次空间定位上表现强劲，但缺乏观察和修正自身预测的机制。我们发现，简单地提示VLM在其预测的渲染可视化上迭代会导致灾难性失败：指代表达理解的Acc@0.5从79.6%骤降至48.7%（下降31个百分点），揭示了定位能力与自我修正能力之间的根本差距。我们提出迭代视觉思维（IVT），一种闭环框架，其中模型预测边界框，观察预测在图像上的渲染结果，并通过视觉反馈迭代优化。两阶段训练方案弥合了自我修正差距：首先，我们利用基础模型自身的预测作为真实错误，并提示教师VLM生成修正推理轨迹，从而无需人工标注即可获得监督数据；其次，我们应用组相对策略优化（GRPO）和简单的IoU奖励来稳定多步优化。在涵盖RefCOCOg、Ref-Adv和Ref-L4的混合基准（505个测试样本）上，使用IVT的SFT预热在每个指标上都超过了单次基础模型：Acc@0.5升至82.0%（+2.4个百分点），Acc@0.7升至74.1%（+3.2个百分点），Acc@0.9升至48.3%（+2.8个百分点）。GRPO进一步将每步IoU退化减少了5倍，稳定了优化轨迹。所有训练仅使用单个GPU上的2400个样本，表明空间自我修正是一种可学习的能力，可以在适度规模下灌输。

英文摘要

Vision-language models (VLMs) achieve strong singleshot spatial grounding, yet lack any mechanism to observe and correct their own predictions. We find that naively prompting a VLM to iterate over rendered visualizations of its predictions causes catastrophic failure: Acc@0.5 on referring expression comprehension collapses from 79.6% to 48.7% (a 31 percentage point drop), revealing a fundamental gap between grounding capability and self-correction ability. We propose Iterative Visual Thinking (IVT), a closed-loop framework in which the model predicts a bounding box, observes the prediction rendered on the image, and iteratively refines through visual feedback. A two-phase training recipe closes the self-correction gap: first, we exploit the base model's own predictions as realistic errors and prompt a teacher VLM to generate corrective reasoning traces, yielding supervised data without human annotation; second, we apply Group Relative Policy Optimization (GRPO) with a simple IoU reward to stabilize multi-step refinement. On a mixed benchmark spanning RefCOCOg, Ref-Adv, and Ref-L4 (505 test samples), SFT warm-up with IVT surpasses the single-shot base model on every metric: Acc@0.5 rises to 82.0% (+2.4pp), Acc@0.7 to 74.1% (+3.2pp), and Acc@0.9 to 48.3% (+2.8pp). GRPO further reduces per-step IoU degradation by 5x, stabilizing the refinement trajectory. All training uses only 2,400 samples on a single GPU, demonstrating that spatial self-correction is a learnable capability that can be instilled at modest scale.

URL PDF HTML ☆

赞 0 踩 0

2606.13288 2026-06-12 cs.CV cs.AI cs.CL 新提交

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

跨模态掩码组合概念建模以增强视觉-语言组合性

Wei Li, Zhen Huang, Xinmei Tian

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition, University of Science and Technology of China（中国科学技术大学，教育部脑启发智能感知与认知重点实验室）； Independent Researcher（独立研究员）

AI总结提出MACCO框架，通过掩码一个模态的组合概念并从另一模态完整上下文重建，增强视觉-语言模型的组合理解能力，在五个基准上显著提升。

Comments Accepted to ACL 2026 Main Conference, 25 pages

详情

AI中文摘要

对比训练的视觉-语言模型（如CLIP）在学习联合图像-文本表示方面取得了显著进展，但在组合理解方面仍面临挑战。它们通常表现出“词袋”行为——难以捕捉对象关系、属性-对象绑定和词序依赖。这一限制不仅源于优化时依赖全局单向量表示，还源于对配对图像文本数据中固有丰富组合信息的利用和建模不足。在这项工作中，我们提出了MACCO（掩码组合概念建模）框架，该框架掩码一个模态中的组合概念，并基于另一模态的完整上下文信息重建它们，从而使模型能够更有效地捕捉和对齐跨模态组合结构。为促进这一过程，我们引入了两个辅助目标，在模态间和模态内联合对齐和正则化掩码特征。在五个组合基准上的大量实验和深入分析表明，我们的方法不仅显著增强了VLM的组合性，还提高了它们捕捉句法结构和语言信息的能力。此外，改进的组合性也有利于文本到图像生成和多模态大语言模型。代码可在https://this URL获取。

英文摘要

Contrastively trained vision-language models like CLIP, have made remarkable progress in learning joint image-text representations, but still face challenges in compositional understanding. They often exhibit a "bag-of-words" behavior--struggling to capture the object relations, attribute-object bindings, and word order dependencies. This limitation arises not only from the reliance on global, single-vector representations for optimization, but also from the insufficient exploitation and modeling of the rich compositional information inherently present in paired image text data. In this work, we propose MACCO (MAsked Compositional Concept MOdeling), a framework that masks compositional concepts in one modality and reconstructs them conditioned on the full contextual information from the other, enabling the model to capture and align cross-modal compositional structures more effectively. To facilitate this process, we introduce two auxiliary objectives that jointly align and regularize masked features both inter-modally and intra-modally. Extensive experiments on five compositional benchmarks, along with in-depth analyses, demonstrate that our approach not only significantly enhances compositionality in VLMs but also improves their ability to capture syntactic structure and linguistic information. Additionally, the improved compositionality also benefits text-to-image generation and multimodal large language model. Code is available at https://github.com/hiker-lw/MACCO.

URL PDF HTML ☆

赞 0 踩 0

2606.13289 2026-06-12 cs.CV cs.AI 新提交

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

HYDRA-X: 具有整体视觉分词器的原生统一多模态模型

Guozhen Zhang, Xuerui Qiu, Yutao Cui, Tianhui Song, Changlin Li, Junzhe Li, Tao Huang, Xiao Zhang, Yang Li, Jianbing Wu, Miles Yang, Zhao Zhong, Liefeng Bo, Limin Wang

发表机构 * Nanjing University（南京大学）； CASIA（中国科学院自动化研究所）； Tencent Hunyuan（腾讯混元）； Zhongguancun Academy（中关村学院）； Shanghai AI Lab（上海人工智能实验室）

AI总结提出HYDRA-X，首个在单一ViT中统一图像和视频分词的原生统一多模态模型，通过因果时间注意力和分层时间压缩实现高效重建，并利用轻量化解压缩器注入语义，显著提升编辑一致性和收敛速度。

详情

AI中文摘要

整体视觉分词器是统一多模态模型（UMMs）的基础，因为它们将多样的视觉输入映射到统一的表示空间。在本文中，我们提出HYDRA-X，这是首个在单一视觉变换器（ViT）中统一图像和视频分词的原生UMM。我们的设计由两个核心挑战驱动：高效地将时空重建能力注入原生ViT，以及将图像级和视频级语义感知嵌入到潜在空间中。为解决第一个挑战，全面的消融实验揭示了两个关键发现：（1）帧级因果时间注意力足以用于视觉重建，而全时空注意力会降低重建质量；（2）分层时间压缩显著优于单步替代方案。为解决第二个挑战，我们提出了一种轻量化解压缩器，在联合图像-视频教师监督下对时间压缩特征进行上采样，从而在紧凑的潜在空间中强制实施互补的语义结构。基于这种整体分词器，我们进一步提出了编辑流程的原则性改进：源-目标交互应在分词器内部的潜在级别发生，而不是在LLM内部的语义级别，从而显著提高编辑一致性并加速收敛。在7B密集模型上实例化，HYDRA-X在图像和视频理解及生成任务上均取得了强劲性能，为未来的统一分词器UMM铺平了道路。

英文摘要

Holistic visual tokenizers are fundamental to unified multimodal models (UMMs) as they map diverse visual inputs into a unified representation space. In this paper, we present HYDRA-X, the first UMM that unifies image and video tokenization within a single Vision Transformer (ViT). Our design is driven by two core challenges: efficiently injecting spatiotemporal reconstruction capability into a native ViT, and embedding image- and video-level semantic awareness into the latent space. To address the first, comprehensive ablations reveal two key findings: (1) frame-level causal temporal attention suffices for visual reconstruction, whereas full spatiotemporal attention degrades it; and (2) hierarchical temporal compression substantially outperforms single-step alternatives. To tackle the second, we propose a lightweight decompressor that upsamples temporally compressed features under joint image-video teacher supervision, thereby enforcing complementary semantic structures within the compact latent space. Building on this holistic tokenizer, we further propose a principled improvement of the editing pipeline: source-target interaction should occur at the latent level inside the tokenizer rather than at the semantic level inside the LLM, substantially improving editing consistency and accelerating convergence. Instantiated at the 7B dense model, HYDRA-X achieves strong performance across image and video understanding and generation tasks, paving the way for future unified-tokenizer UMMs.

URL PDF HTML ☆

赞 0 踩 0

2606.13673 2026-06-12 cs.CV cs.AI 新提交

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

SpatialClaw：重新思考智能体空间推理的动作接口

Seokju Cho, Ryo Hachiuma, Abhishek Badki, Hang Su, Byung-Kwan Lee, Chan Hee Song, Sifei Liu, Subhashree Radhakrishnan, Seungryong Kim, Yu-Chiang Frank Wang, Min-Hung Chen

发表机构 * KAIST（韩国科学技术院）； NVIDIA（英伟达）

AI总结提出SpatialClaw框架，以代码作为动作接口，通过状态化Python内核和感知几何原语，使VLM智能体逐步执行并灵活组合中间结果，在20个3D/4D空间推理基准上平均准确率59.9%，比现有方法高11.2个百分点。

Comments Project page: https://spatialclaw.github.io/

详情

AI中文摘要

空间推理——确定物体在3D空间中的位置、关系及运动方式的能力——仍然是视觉语言模型（VLM）面临的基本挑战。工具增强型智能体试图通过为VLM添加专业感知模块来解决这一问题，但其有效性受限于调用这些工具的动作接口。本文研究该接口的设计如何影响智能体进行开放式空间推理的能力。现有的空间智能体要么采用单次代码执行，即在观察到任何中间结果之前就确定完整的分析策略；要么依赖结构化的工具调用接口，这通常缺乏自由组合操作或针对每个任务定制分析的灵活性。这两种设计对开放式、复杂的3D/4D空间推理的灵活性有限。因此，我们提出SpatialClaw，一个无需训练的空间推理框架，采用代码作为动作接口。SpatialClaw维护一个状态化的Python内核，预加载输入帧和一套感知与几何原语，让基于VLM的智能体在每一步根据所有先前输出编写一个可执行单元，从而灵活地组合和操作感知结果，并根据中间文本和视觉观察以及每个问题的需求调整其分析。在涵盖广泛静态和动态3D/4D空间推理任务的20个空间推理基准上评估，SpatialClaw实现了59.9%的平均准确率，比最新的空间智能体高出11.2个百分点，并且在来自两个模型家族的六个VLM骨干网络上均取得一致提升，无需任何基准或模型特定的适配。

英文摘要

Spatial reasoning, the ability to determine where objects are, how they relate, and how they move in 3D, remains a fundamental challenge for vision-language models (VLMs). Tool-augmented agents attempt to address this by augmenting VLMs with specialist perception modules, yet their effectiveness is bounded by the action interface through which those tools are invoked. In this work, we study how the design of this interface shapes the agent's capacity for open-ended spatial reasoning. Existing spatial agents either employ single-pass code execution, which commits to a full analysis strategy before any intermediate result is observed, or rely on a structured tool-call interface that often offers less flexibility for freely composing operations or tailoring the analysis to each task. Both designs offer limited flexibility for open-ended, complex 3D/4D spatial reasoning. We therefore propose SpatialClaw, a training-free framework for spatial reasoning that adopts code as the action interface. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives, letting a VLM-backed agent write one executable cell per step conditioned on all prior outputs, enabling the agent to flexibly compose and manipulate perception results and adapt its analysis to both intermediate text and visual observations and the demands of each problem. Evaluated across 20 spatial reasoning benchmarks spanning a broad range of static and dynamic 3D/4D spatial reasoning tasks, SpatialClaw achieves 59.9% average accuracy, outperforming the recent spatial agent by +11.2 points, with consistent gains across six VLM backbones from two model families without any benchmark- or model-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2606.12555 2026-06-12 cs.SD cs.CV cs.MM 交叉投稿

Ex-Omni：为全模态大语言模型赋能3D面部动画生成

Haoyu Zhang, Zhipeng Li, Yiwen Guo, Tianshu Yu

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； LIGHTSPEED ； Independent Researcher（独立研究员）

AI总结提出Ex-Omni模型，通过混合形状感知语音单元生成器和解码器解耦语义推理与时间生成，并引入统一令牌查询门控融合机制，实现全模态大语言模型同步生成语音和3D面部动画。

详情

AI中文摘要

全模态大语言模型旨在统一多模态理解和生成，然而，尽管自然的人机交互至关重要，但扩展它们以联合生成语音和3D面部动画仍 largely unexplored。一个关键挑战是LLM的离散语义推理与3D面部运动所需的密集时间动态之间的不匹配。我们提出Expressive Omni (Ex-Omni)，一个开源模型，通过原生语音伴随的3D面部动画增强OLLM。Ex-Omni通过混合形状感知语音单元生成器和混合形状解码器将语义推理与时间生成解耦，其中语音单元提供时间支架，隐藏语音表示携带面部相关线索。我们进一步引入统一的令牌查询门控融合机制用于受控语义注入，以及InstructS2SF-1200K，一个包含1200K样本的预训练数据集。大量实验表明，Ex-Omni在保持竞争性语音理解和生成能力的同时，实现了比级联管道更好的音视频同步和更低的面部生成延迟。

英文摘要

Omni-modal large language models (OLLMs) aim to unify multimodal understanding and generation, yet extending them to jointly produce speech and 3D facial animation remains largely unexplored despite its importance for natural human-computer interaction. A key challenge is the mismatch between the discrete semantic reasoning of LLMs and the dense temporal dynamics required for 3D facial motion. We propose Expressive Omni (Ex-Omni), an open-source model that augments OLLMs with native speech-accompanied 3D facial animation. Ex-Omni decouples semantic reasoning from temporal generation through a blendshape-aware speech unit generator and a blendshape decoder, where speech units provide temporal scaffolding and hidden speech representations carry facially relevant cues. We further introduce a unified token-as-query gated fusion (TQGF) mechanism for controlled semantic injection, as well as InstructS2SF-1200K, a dataset consisting of 1200K samples for pre-training. Extensive experiments show that Ex-Omni maintains competitive speech understanding and generation ability while achieving better audio-visual synchronization and lower face-generation latency than cascaded pipelines.

URL PDF HTML ☆

赞 0 踩 0

2603.06652 2026-06-12 cs.CV cs.AI 版本更新

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

PaLMR: 通过多模态过程对齐实现忠实视觉推理

Yantao Li, Qiang Hui, Chenyang Yan, Kanzhi Cheng, Fang Zhao, Chao Tan, Huanling Gao, Jianbing Zhang, Kai Wang, Xinyu Dai, Shiguo Lian

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Data Science & Artificial Intelligence Research Institute, China Unicom（中国unicom数据科学与人工智能研究院）； Unicom Data Intelligence, China Unicom（中国unicom数据智能）

AI总结提出PaLMR框架，通过感知对齐数据层和过程对齐优化层，减少推理幻觉并提升视觉推理忠实度，在多个基准上取得最优结果。

详情

Journal ref: CVPR 2026 Findings

AI中文摘要

强化学习近期提升了大语言模型和多模态大语言模型的推理能力，但现有的奖励设计强调最终答案的正确性，因此容忍过程幻觉——即模型在得到正确答案的同时错误感知视觉证据的情况。我们通过PaLMR框架解决这种过程层面的不对齐，该框架不仅对齐结果，还对齐推理过程本身。PaLMR包含两个互补组件：一个感知对齐数据层，构建具有结构化伪真值和可验证视觉事实的过程感知推理数据；以及一个过程对齐优化层，构建具有过程感知评分函数的分层奖励融合方案，以鼓励视觉上可信的思维链并提高训练稳定性。在Qwen2.5-VL-7B上的实验表明，我们的方法显著减少了推理幻觉并提高了视觉推理忠实度，在HallusionBench上取得了最先进的结果，同时在MMMU、MathVista和MathVerse上保持了强劲性能。这些发现表明，PaLMR为过程对齐的多模态推理提供了一条原则性且实用的路径，推进了MLLM的可靠性和可解释性。

英文摘要

Reinforcement learning has recently improved the reasoning ability of Large Language Models and Multimodal LLMs, yet prevailing reward designs emphasise final-answer correctness and consequently tolerate process hallucinations--cases where models reach the right answer while misperceiving visual evidence. We address this process-level misalignment with PaLMR, a framework that aligns not only outcomes but also the reasoning process itself. PaLMR comprises two complementary components: a perception-aligned data layer that constructs process-aware reasoning data with structured pseudo-ground-truths and verifiable visual facts, and a process-aligned optimisation layer that constructs a hierarchical reward fusion scheme with a process-aware scoring function to encourage visually faithful chains-of-thought and improve training stability. Experiments on Qwen2.5-VL-7B show that our approach substantially reduces reasoning hallucinations and improves visual reasoning fidelity, achieving state-of-the-art results on HallusionBench while maintaining strong performance on MMMU, MathVista, and MathVerse. These findings indicate that PaLMR offers a principled and practical route to process-aligned multimodal reasoning, advancing the reliability and interpretability of MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.16713 2026-06-12 cs.CV cs.AI 版本更新

用于TUMTraf V2X协同3D目标检测的相机与LiDAR BEV融合

Muhammad Shahbaz, Shaurya Agarwal

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida（中佛罗里达大学土木、环境与建筑工程系）

AI总结提出一种融合路边相机与基础设施-车辆点云的BEV空间检测器，采用CenterPoint风格头部和IoU重排序，在DriveX 2026挑战赛公开测试集上达到0.85 mAP，并分析了训练/验证与测试集重叠对分数的影响。

详情

AI中文摘要

我们描述了一种为DriveX 2026挑战赛的TUMTraf V2X协同3D目标检测赛道开发的相机与LiDAR融合检测器。该检测器在共享的鸟瞰视图空间中融合三个路边相机与一个融合的基础设施-车辆点云，并通过带有广义IoU回归损失和IoU质量重排序头的CenterPoint风格头部预测边界框。在提供的训练和验证分割上训练后，模型在公开Codabench测试分割上达到了0.85的3D mAP。在迭代系统时，我们观察到50个测试帧中有44个也出现在已发布的训练（40个）和验证（4个）分割中并带有标签。因此，我们进行了两项额外研究来量化这种重叠对最终分数的影响：（1）一个微调运行，对44个重叠帧进行过采样，达到0.89 mAP；（2）一个后处理运行，将这些帧上的预测替换为已发布的真实值，达到0.99 mAP（上传到我们的Codabench账户进行测试，但未在排行榜上发布）。报告了所有三种配置及其每类结果。

英文摘要

We describe a Camera and LiDAR fusion detector developed for the TUMTraf V2X cooperative 3D object detection track of the DriveX 2026 challenge. The detector fuses three roadside cameras with a fused infrastructure-plus-vehicle point cloud in a shared bird's-eye-view space and predicts boxes through a CenterPoint-style head with a generalized IoU regression loss and an IoU quality re-ranking head. Trained on the provided train and validation splits, the model reaches a 3D mAP of 0.85 on the public Codabench test split. While iterating on the system, we observed that 44 of the 50 test frames are also present in the released train (40) and validation (4) splits with their labels. We therefore conducted two additional studies to quantify how this overlap affects the final score: (1) a finetuning run that oversamples the 44 overlapping frames, reaching 0.89 mAP, and (2) a post-processing run that replaces predictions on those frames with the released ground truth, reaching 0.99 mAP (uploaded to our Codabench account for testing but not published on the leaderboard). All three configurations and their per-class results are reported.

URL PDF HTML ☆

赞 0 踩 0

2606.12987 2026-06-12 cs.CV cs.AI cs.LG cs.RO 新提交

Diffusion Transformer World-Action Model for AV Scene Prediction

扩散Transformer世界-动作模型用于自动驾驶场景预测

Ruslan Sharifullin, Benjamin Jiang, Kai Xi Chew

发表机构 * Stanford University（斯坦福大学）

AI总结提出紧凑潜世界模型，结合扩散Transformer（DiT）预测未来场景，在nuScenes上实现4.8倍更好的KID，并实现动作可控性（转向ρ=0.81）。

Comments 10 pages, 9 figures, 2 tables

详情

AI中文摘要

动作条件世界模型使自动驾驶车辆能够根据自身规划的控制预测未来摄像头场景，从而无需真实世界部署即可进行规划和仿真，但在紧凑、可训练的规模下，未来具有模糊性，且该领域的标准失真度量具有误导性：它们奖励模糊的回归均值而非逼真的预测。我们通过一个紧凑的潜世界模型应对这一问题，该模型给定当前前摄像头潜变量和一系列自我动作，预测未来场景潜变量，由冻结解码器渲染为$256 \ imes 256$帧，最多提前8秒，在150个保留的nuScenes场景上评估。我们首先基准测试预测位置：在跨越四个表示族的六个冻结编码器中，具有时间上下文的V-JEPA2将转向RMSE比最佳单帧编码器降低40%。然后我们训练一个潜扩散Transformer（DiT），并通过受控诊断识别其所需的四个要素：空间token、$x_0$目标、残差锚定以及与目标不确定性匹配的采样。在Stable-Diffusion-VAE编码-预测-解码流水线中，我们揭示了核心矛盾：失真度量（余弦相似度、SSIM）倾向于模糊均值，掩盖了扩散模型更接近真实帧分布的事实。基于Inception的FID和KID揭示了清晰的感知-失真边界：扩散模型达到KID 0.078，而回归为0.375（好4.8倍），且可部署的训练校准使其无需测试时真实值即可实用。该模型真正具有动作可控性（转向驱动场景位移，Spearman $\ ho = 0.81$，而回归为$-0.18$）。我们将有限的单次运动归因于共享当前锚点，并设计了一个紧凑的170万参数“跳跃”模型，恢复完整的真实运动幅度（$1.02\ imes$ GT），而单次模型捕获不到一半。

英文摘要

Action-conditioned world models let an autonomous vehicle predict future camera scenes from its own planned controls, enabling planning and simulation without real-world rollouts, but at compact, trainable scale the futures are ambiguous and the field's standard distortion metrics actively mislead: they reward a blurry regression mean over a realistic prediction. We confront this with a compact latent world model that, given the present front-camera latent and a sequence of ego-actions, predicts future scene latents a frozen decoder renders to $256 \times 256$ frames up to 8 seconds ahead, evaluated on 150 held-out nuScenes scenes. We first benchmark where to predict: across six frozen encoders spanning four representation families, V-JEPA2 with temporal context reduces steering RMSE by 40% over the best single-frame encoder. We then train a latent Diffusion Transformer (DiT) and, through a controlled diagnosis, identify the four ingredients it needs: spatial tokens, the $x_0$ objective, residual anchoring, and sampling matched to target uncertainty. In a Stable-Diffusion-VAE encode-predict-decode pipeline we expose the central tension: distortion metrics (cosine similarity, SSIM) favor the blurry mean, masking that the diffusion model is far closer to the real frame distribution. Inception-based FID and KID reveal a clean perception-distortion frontier: diffusion attains KID 0.078 versus 0.375 for regression ($4.8\times$ better), and a deployable train-derived calibration makes this practical without test-time ground truth. The model is genuinely action-controllable (steering drives scene displacement, Spearman $ρ= 0.81$, vs $-0.18$ for regression). We trace limited single-pass motion to a shared-present anchor and engineer a compact 1.7M-parameter "jump" model that recovers full ground-truth motion magnitude ($1.02\times$ GT), where single-pass models capture less than half.

URL PDF HTML ☆

赞 0 踩 0

2606.13460 2026-06-12 cs.CV 新提交

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

VISA: VLM引导的实例语义审计用于3D占据世界模型

Ruiqi Xian, Yuehan Xian, Jing Liang, Xuewei Qi, Dinesh Manocha

发表机构 * University of Maryland College Park（马里兰大学帕克分校）； Nanjing University of Posts and Telecommunications（南京邮电大学）； Stanford University（斯坦福大学）； Motional AD Inc.（Motional AD公司）

AI总结提出VISA方法，利用离线VLM对每个物理对象实例进行结构化语义审计，并通过可靠性加权损失蒸馏到3D占据模型中，无需VLM推理即可提升封闭集占据mIoU。

详情

AI中文摘要

语义3D占据为自动驾驶和机器人决策提供体素化世界状态，但对象和稀有类错误会影响自由空间解释、碰撞检测和时间状态传播。我们表明，常见的VLM策略（将3D体素或对象特征与裁剪-标题嵌入对齐）提高了文本-空间相似性，但未能可靠地改善封闭集占据mIoU。受此不匹配启发，我们提出VISA，一种针对现有占据世界模型的训练时语义审计方法。VISA对每个物理对象实例的代表性裁剪查询离线VLM，获得包含类别假设、可能混淆、可靠性、属性和证据的结构化审计，并将其沿对象轨迹传播。审计被关联到匹配的3D对象体素，并通过可靠性加权分类、属性因子和场景级审计图损失蒸馏到语义logits中，而推理保持不变且无需VLM。在nuScenes上，三次运行平均，VISA将OccWorld从19.06提升到20.05 mIoU，GaussianWorld从21.36提升到21.91 mIoU；在GaussianWorld上，对象mIoU从18.18提升到19.16，稀有类mIoU从15.60提升到16.79。这些结果表明，VLM更适合作为可靠性感知的语义审计器而非通用标题嵌入目标用于封闭集占据。

英文摘要

Semantic 3D occupancy provides a voxelized world state for autonomous driving and robot decision making, but object and rare-class errors can affect free-space interpretation, collision checking, and temporal state propagation. We show that a common VLM strategy, aligning 3D voxel or object features with crop-caption embeddings, improves text-space similarity without reliably improving closed-set occupancy mIoU. Motivated by this mismatch, we propose VISA, a training-time semantic auditing approach for existing occupancy world models. VISA queries an offline VLM on a representative crop of each physical object instance, obtains a structured audit with class hypotheses, plausible confusions, reliability, attributes, and evidence, and propagates it along the object track. The audit is grounded to matched 3D object voxels and distilled into semantic logits through reliability-weighted taxonomy, attribute-factor, and scene-level audit graph losses, while inference remains unchanged and requires no VLM. On nuScenes, averaged across three runs, VISA improves OccWorld from 19.06 to 20.05 mIoU and GaussianWorld from 21.36 to 21.91 mIoU; on GaussianWorld, object mIoU improves from 18.18 to 19.16 and rare-class mIoU from 15.60 to 16.79. These results suggest that VLMs are better suited to closed-set occupancy as reliability-aware semantic auditors than as generic caption-embedding targets.

URL PDF HTML ☆

赞 0 踩 0

2606.13503 2026-06-12 cs.CV cs.AI cs.RO 新提交

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

异构激光雷达早期融合与学习重排序策略用于非结构化环境中的鲁棒长期地点识别

Judith Vilella-Cantos, Juan José Cabrera, Mónica Ballesta, David Valiente, Luis Payá

发表机构 * Miguel Hernández University of Elche（米格尔·埃尔南德斯·德埃尔切大学）

AI总结提出MinkUNeXt-VINE++方法，通过异构LiDAR数据早期融合和学习重排序策略，在非结构化环境（如葡萄园）中显著提升长期地点识别性能，Recall@1指标提升20%-30%。

详情

AI中文摘要

在非结构化环境（如农田）中，鲁棒定位是自主系统的关键挑战。LiDAR传感器提供环境的详细3D信息，且不受光照条件影响，因此基于LiDAR的地点识别方法备受关注。本文提出MinkUNeXt-VINE++，一种结合两个传感器（Livox Mid-360和Velodyne VLP-16）异构LiDAR数据早期融合与推理时学习重排序策略的新方法。这种融合利用每个传感器的优势，提供更全面的环境表示。此外，重排序方法在重复环境（如葡萄园）中尤为重要，因为找到真正匹配是一项重大挑战。我们使用TEMPO-VINE数据集评估了该方法，该数据集提供了不同物候阶段葡萄园环境中的异构LiDAR数据。结果表明，与单传感器方法和现有最优方法相比，MinkUNeXt-VINE++显著提升了地点识别性能。与单传感器方法相比，MinkUNeXt-VINE++在Recall@1指标上提升了20%，加入重排序后提升30%。我们的方法代码已公开，可复现结果。

英文摘要

Robust localization in unstructured environments, such as agricultural fields, is a critical challenge for autonomous systems. LiDAR sensors provide detailed 3D information about the environment and are invariant to lighting conditions. For this reason, LiDAR-based place recognition methods have gained significant attention. In this paper, we propose MinkUNeXt-VINE++, a novel approach that combines early fusion of heterogeneous LiDAR data from two sensors (Livox Mid-360 and Velodyne VLP-16) and a learned re-ranking strategy in inference time. This fusion leverages the strengths of each sensor to provide a more comprehensive representation of the environment. Additionally, the re-ranking approach is particularly important in repetitive environments, such as vineyards, as finding true positives is a major challenge. We evaluated our approach using the TEMPO-VINE dataset, which provides heterogeneous LiDAR data in vineyard environments across different phenological stages. Our results demonstrate that MinkUNeXt-VINE++ significantly improves place recognition performance compared to single-sensor approaches and state-of-the-art methods. MinkUNeXt-VINE++ achieves a 20% improvement in the Recall@1 metric compared to single-sensor approaches, and +30% including re-ranking. The code of our method is publicly available for reproduction.

URL PDF HTML ☆

赞 0 踩 0

2606.13509 2026-06-12 cs.CV cs.AI 新提交

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

基于测量校准的多相机融合用于视觉室内定位

Mateo Toro Diz, Jonathan Hoss, Noah Klarmann

发表机构 * Rosenheim Technical University of Applied Sciences（罗森海姆应用技术大学）

AI总结提出测量校准融合方法，通过显式量化单相机定位误差（单应校准、人体检测、运动跟踪）来优化多相机数据融合，实验表明该方法虽未显著提升绝对精度，但有效降低了轨迹方差并提高了运动平滑性。

Comments This paper has been accepted for presentation at the IEEE 22st International Conference on Automation Science and Engineering (CASE 2026)

详情

AI中文摘要

基于视觉的室内定位系统受到检测噪声、遮挡和有限相机覆盖的影响，导致流程多个阶段存在不确定性。虽然多相机数据融合被广泛用于缓解这些问题，但通常被视为黑箱组件并仅通过端到端评估，掩盖了其机制贡献。为弥补这一不足，本文研究是否可以利用显式表征单相机定位误差来校准和优化多相机数据融合。我们提出了一种测量校准融合方法，该方法集成了组件级误差量化，具体分离了单应校准、人体检测和运动跟踪。进行了组件级评估以量化单应校准、人体检测和运动跟踪的误差贡献。实验结果表明，与单相机基线相比，数据融合提高了定位精度。虽然测量校准融合在绝对精度上相比标准融合仅提供有限的改进，但它显著降低了轨迹方差并提高了运动平滑性，这对于需要稳定连续运动估计的应用至关重要。这些结果突显了在设计基于视觉的室内定位系统的数据融合策略时，显式误差表征的价值。

英文摘要

Indoor vision-based localization systems are affected by detection noise, occlusions, and limited camera coverage, leading to uncertainty at multiple stages of the pipeline. While multi-camera data fusion is widely used to mitigate these issues, it is typically treated as a black-box component and evaluated solely end-to-end, obscuring its mechanistic contributions. To address this gap, this work investigates whether explicitly characterizing single-camera localization errors can be leveraged to calibrate and optimize multi-camera data fusion. We introduce a measurement-calibrated fusion approach that integrates component-wise error quantification, specifically isolating homography calibration, human detection, and motion tracking. A component-wise evaluation is conducted to quantify error contributions from homography calibration, human detection, and motion tracking. Experimental results show that data fusion improves localization accuracy compared to single-camera baselines. While measurement-calibrated fusion provides only limited improvement in absolute accuracy over standard fusion, it substantially reduces trajectory variance and improves motion smoothness, which are critical for applications requiring stable and continuous motion estimates. These results highlight the value of explicit error characterization when designing data fusion strategies for vision-based indoor positioning systems.

URL PDF HTML ☆

赞 0 踩 0

2606.13515 2026-06-12 cs.CV cs.LG cs.RO 新提交

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

MaskWAM：统一掩码提示与预测的世界-动作模型

Hanyang Yu, Haitao Lin, Jingbo Zhang, Wenyao Zhang, Chenghao Gu, Heng Li, Ping Tan

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； Tencent Robotics X（腾讯机器人X实验室）； Tsinghua University（清华大学）

AI总结提出MaskWAM，通过统一掩码输入与预测的混合Transformer架构，解决世界-动作模型的空间瓶颈，提升策略泛化能力，在LIBERO等任务上显著优于基线。

详情

AI中文摘要

世界-动作模型（WAMs）通过视频预测为机器人控制提供了一种有前景的范式。然而，当前的WAMs存在根本性的空间瓶颈：标准文本输入在杂乱场景中引入指代歧义，而非结构化的RGB预测缺乏语义基础，并受任务无关背景的偏差影响。为克服这些限制，我们引入了MaskWAM，一种以对象为中心的世界-动作模型。通过统一的混合Transformer（MoT）将掩码同时作为显式输入和预测进行联合集成，MaskWAM实现了鲁棒的策略泛化。该设计提供两个关键优势：（1）预测未来掩码产生以对象为中心的语义监督，抑制视觉噪声，显著增强甚至标准文本条件的WAMs；（2）将此预测监督与第一帧视觉提示（如目标对象掩码）耦合，建立精确的空间锚点，大幅减少语言歧义。关键在于，由于WAMs本质上是视觉驱动的架构，直接掩码条件化比单独文本提供更强的引导，为操作未见对象建立了精确且鲁棒的范式。在LIBERO、RoboTwin和真实世界任务上的评估表明，MaskWAM在语言清晰和语言模糊任务中均显著优于基线。

英文摘要

World Action Models (WAMs) present a promising paradigm for robotic control via video prediction. However, current WAMs suffer from fundamental spatial bottlenecks: standard text inputs introduce referential ambiguity in cluttered scenes, while unstructured RGB predictions lack semantic grounding and remain biased by task-irrelevant backgrounds. To overcome these limitations, we introduce MaskWAM, an object-centric world-action model. By jointly integrating masks as both explicit inputs and predictions via a unified Mixture of Transformers (MoT), MaskWAM unlocks robust policy generalization. This design provides two key benefits: (1) predicting future masks yields object-centric semantic supervision that suppresses visual noise, significantly enhancing even standard text-conditioned WAMs; and (2) coupling this predictive supervision with first-frame visual prompts, such as target object masks, establishes a precise spatial anchor that substantially reduces language ambiguity. Crucially, as WAMs are inherently vision-driven architectures, direct mask conditioning yields substantially stronger guidance than text alone, establishing a precise and robust paradigm for manipulating unseen objects. Evaluations on LIBERO, RoboTwin, and real-world tasks demonstrate that MaskWAM significantly outperforms baselines in both language-clear and language-ambiguous tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.12849 2026-06-12 cs.DC cs.CV cs.RO 交叉投稿

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

SemanticXR: 低功耗实时可查询语义建图与对象级设备-云架构

Rahul Singh, Devdeep Ray, Connor Smith, Sarita Adve

AI总结提出首个设备-云协同系统SemanticXR，通过对象级通信、执行和内存管理，在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询，服务器建图延迟提升2.2倍，设备功耗仅增加2%。

详情

AI中文摘要

语义建图是新兴扩展现实（XR）应用（如AI助手和空间对象搜索）中实现具身交互的核心服务。在移动XR设备上部署此功能需要系统具备开放词汇、实时和低功耗特性。现有方法计算密集且假设服务器级资源。云卸载提供了一条实用路径，但现有系统未在设备-云边界拆分语义建图或管理其通信、执行和内存占用。我们提出SemanticXR，首个在XR功耗、带宽和内存约束下实现实时开放词汇语义建图与查询的设备-云系统。我们的关键洞察是将语义可识别对象提升为跨设备和服务器的通信、执行和内存的一级单元。在服务器端，对象级并行和几何下采样改善了建图延迟，而对象级深度建图协同设计降低了上行带宽。在设备端，具有增量更新和更新优先级的对象级稀疏局部地图实现了网络鲁棒的查询，并限制了内存和下行带宽。对象级可配置的资源使用与质量权衡让应用和系统分别根据应用需求和运行条件调整建图。与使用相同感知模型的设备-云基线相比，对象级组织在同等语义质量下将服务器端建图延迟提升了2.2倍。深度建图协同设计将上行带宽维持在2.5 Mbps以下。在设备端，SemanticXR即使在网络中断时也能为多达10,000个对象维持低于100 ms的查询延迟，在500 MB内支持数万个对象，并将下行带宽随地图变化而非总场景大小缩放。系统在正常运行时仅增加2%的设备功耗。

英文摘要

Semantic mapping is a core service that enables grounded interactions in emerging Extended Reality (XR) applications such as AI assistants and spatial object search. Deploying this capability on mobile XR devices requires a system that is open-vocabulary, real-time, and low-power. Existing approaches are compute-intensive and assume server-class resources. Cloud offloading offers a practical path, but no existing system splits semantic mapping across the device-cloud boundary or manages its communication, execution, and memory footprint. We present SemanticXR, the first device-cloud system for real-time, open-vocabulary semantic mapping and querying under XR power, bandwidth, and memory constraints. Our key insight is to elevate semantically identifiable objects to first-class units of communication, execution, and memory across the device and server. On the server, object-level parallelism and geometry downsampling improve mapping latency, while object-level depth-mapping co-design reduces upstream bandwidth. On the device, an object-level sparse local map with incremental updates and update prioritization enables network-robust querying with bounded memory and downstream bandwidth. Object-level configurable resource usage vs. quality trade-offs let applications and the system adapt mapping to application requirements and operating conditions, respectively. Against a device-cloud baseline with the same perception models, object-level organization improves server-side mapping latency by 2.2X at equal semantic quality. Depth-mapping co-design maintains upstream bandwidth under 2.5 Mbps. On the device, SemanticXR sustains sub-100 ms query latency for up to 10,000 objects even under network drops, supports tens of thousands of objects within 500 MB, and scales downstream bandwidth with map changes, not total scene size. The system adds only 2% device power during normal operation.

URL PDF HTML ☆

赞 0 踩 0

2606.13494 2026-06-12 cs.RO cs.CV 交叉投稿

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

NavWAM：用于目标条件视觉导航的导航世界动作模型

Daichi Azuma, Taiki Miyanishi, Koya Sakamoto, Shuhei Kurita, Yaonan Zhu, Petr Khrapchenkov, Motoaki Kawanabe, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The University of Tokyo（东京大学）； National Institute of Informatics（国立信息学研究所）； AIRoA ； ATR

AI总结提出NavWAM，一种扩散变换器策略，通过联合学习未来观测、目标进度值和动作块，将导航世界模型预测直接转化为可执行动作，在离线基准和真实机器人部署中优于基于规划的世界模型基线。

Comments Project page: https://dachii-azm.github.io/navwam/

详情

AI中文摘要

目标条件视觉导航要求机器人在部分可观测性下行动，通过预测其运动将如何改变未来的自我中心视图以及这种变化是否使其更接近目标。导航世界模型提供了这种视觉预见，但它们仍然是预测模块，需要外部规划器将预测的未来转化为闭环控制。我们提出导航世界动作模型（NavWAM），一种扩散变换器策略，通过将未来观测、目标进度值和动作块表示为共享的潜在序列，将导航世界模型预测转化为可执行动作。通过联合学习未来预测与决定闭环行为的动作和价值目标，NavWAM使视觉预见可直接用于机器人控制。我们通过模拟预训练和真实机器人适应构建NavWAM，并在图像目标导航任务上将其与基于规划的世界模型和代表性直接导航策略进行评估。在离线基准和闭环真实机器人部署中，NavWAM在使用默认策略模式（无CEM式动作搜索）的情况下，在我们的评估中优于基于规划的世界模型基线。项目页面：此 https URL

英文摘要

Goal-conditioned visual navigation requires a robot to act under partial observability by anticipating how its motion will change the future egocentric view and whether that change brings it closer to the goal. Navigation world models provide such visual foresight, but they remain prediction modules that require an external planner to convert predicted futures into closed-loop control. We propose Navigation World Action Model (NavWAM), a diffusion-transformer policy that turns navigation world-model prediction into executable action by representing future observations, goal-progress values, and action chunks in a shared latent sequence. By learning future prediction jointly with the action and value targets that determine closed-loop behavior, NavWAM makes visual foresight directly usable for robot control. We build NavWAM through simulation pretraining and real-robot adaptation, and evaluate it on image-goal navigation against planning-based world models and a representative direct navigation policy. Across offline benchmarks and closed-loop real-robot deployment, NavWAM improves over planning-based world-model baselines in our evaluations while using the default policy mode without CEM-style action search. Project page: https://dachii-azm.github.io/navwam/

URL PDF HTML ☆

赞 0 踩 0

2606.13497 2026-06-12 cs.RO cs.CV 交叉投稿

GAE: 利用可泛化动作专家释放VLM的物理潜力

Mingyu Liu, Zheng Huang, Xiaoyi Lin, Muzhi Zhu, Canyu Zhao, Yating Wang, Haoyi Zhu, Hao Chen, Chunhua Shen

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出通用动作专家（GAE），通过稀疏几何接口将VLM的高层意图转化为连续动作轨迹，采用动作预训练-点云微调（APPF）方案解耦动作动力学与几何基础，实现跨视觉域、视角和指令的强泛化。

详情

AI中文摘要

视觉语言模型展示了强大的推理和规划能力，但将这些预测转化为精确的机器人动作仍是一个核心挑战。现有的视觉-语言-动作方法通常将推理和动作生成纠缠在一起，导致泛化能力有限。我们提出了通用动作专家（GAE），一个任务无关的模型，将稀疏几何规划转化为密集的机器人动作。我们的方法引入了一个稀疏几何接口：VLM预测代表高层意图的稀疏3D路点，而GAE将这些路点与实时点云观测一起映射到连续动作轨迹。GAE在一个包含来自仿真和真实世界机器人的15万条轨迹的大规模点云-轨迹数据集上进行预训练。为了进一步提高效率和泛化能力，我们引入了动作预训练-点云微调（APPF）方案，将学习动作动力学与几何基础解耦。预训练后，GAE被冻结并在下游任务中重用，只需对VLM进行轻量级微调以生成稀疏接口。实验表明，我们的方法在多样化的视觉域、相机视角和自然语言指令下实现了强大的性能和泛化能力。

英文摘要

Vision-language models demonstrate strong reasoning and planning abilities, yet grounding these predictions into precise robot actions remains a central challenge. Existing Vision-Language-Action methods typically entangle reasoning and action generation, leading to limited generalization. We propose Generalizable Action Expert (GAE), a task-agnostic model that converts sparse geometric plans into dense robot actions. Our approach introduces a sparse geometric interface: the VLM predicts sparse 3D waypoints representing high-level intention, while GAE maps these waypoints together with real-time point cloud observations to continuous action trajectories. GAE is pretrained on a large-scale pointcloud-trajectory dataset comprising 150k trajectories from both simulation and real-world robots. To further improve efficiency and generalization, we introduce an Action Pre-training, Pointcloud Fine-tuning (APPF) scheme that decouples learning action dynamics from geometry grounding. After pretraining, GAE is frozen and reused across downstream tasks, requiring only lightweight fine-tuning of the VLM to produce the sparse interface. Experiments show that our method achieves strong performance and generalization across diverse visual domains, camera viewpoints, and natural language instructions.

URL PDF HTML ☆

赞 0 踩 0

2511.17221 2026-06-12 cs.CV cs.RO 版本更新

UniDexTok：基于真实数据的统一灵巧手分词器

Dong Fang, Youjun Wu, Yuanxin Zhong, Rui Zhang, Yunlong Wang, Xiaosong Jia, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Hefei University of Technology（合肥工业大学）； Rimbot ； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结提出统一灵巧手模型(UDHM)将人手和机器人手状态映射到共享22自由度语义接口，并基于此开发UniDexTok，一种免重定向的状态分词器，学习基于真实关节状态的离散token，实现异构灵巧手的统一表示，误差降低98%以上。

详情

AI中文摘要

灵巧手对于精细操作至关重要，但其硬件设计在不同实施例之间存在显著差异。运动学、关节定义和自由度方面的差异使得定义共享状态表示变得困难，与平行夹爪相比更是如此。因此，灵巧手数据仍然碎片化，难以用于联合训练。在这项工作中，我们提出了统一灵巧手模型（UDHM），它将人手和机器人手状态映射到一个共享的22自由度语义接口。基于UDHM，我们引入了UniDexTok，一种免重定向的状态分词器，它从标准化的真实关节状态中学习基于实施例的离散token。UniDexTok为异构灵巧手提供了统一表示，无需依赖重定向或仿真数据。与最近的基线UniHM相比，UniDexTok将MPJAE从15.63度降低到0.16度，MPJPE从18.51毫米降低到0.18毫米，误差分别减少了98.98%和99.03%。这些结果将重建精度从厘米级提升到亚毫米级。实验进一步表明，来自其他实施例的数据提高了目标实施例的重建精度，证明了跨实施例分词的优势。当引入新的灵巧手时，UniDexTok还表现出强大的零样本和少样本重建能力。

英文摘要

Dexterous hands are essential for fine-grained manipulation, but their hardware designs vary substantially across embodiments. Differences in kinematics, joint definitions, and degrees of freedom make it difficult to define a shared state representation compared with parallel grippers. As a result, dexterous-hand data remains fragmented and difficult to use for joint training. In this work, we propose the Unified Dexterous Hand Model (UDHM), which maps human and robot hand states into a shared 22-DoF semantic interface. Based on UDHM, we introduce UniDexTok, a retargeting-free state tokenizer that learns embodiment-conditioned discrete tokens from standardized real joint states. UniDexTok provides a unified representation for heterogeneous dexterous hands without relying on retargeting or simulation data. Compared with the recent baseline UniHM, UniDexTok reduces MPJAE from 15.63 degrees to 0.16 degrees and MPJPE from 18.51 mm to 0.18 mm, corresponding to error reductions of 98.98% and 99.03%, respectively. These results improve reconstruction from centimeter-scale to sub-millimeter accuracy. Experiments further show that data from other embodiments improves target-embodiment reconstruction accuracy, demonstrating the benefit of cross-embodiment tokenization. UniDexTok also shows strong zero-shot and few-shot reconstruction ability when new dexterous hands are introduced.

URL PDF HTML ☆

赞 0 踩 0

2606.12236 2026-06-12 cs.RO cs.CV 版本更新

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

DrivingAgent: 自动驾驶系统的设计与调度智能体

Zhongyu Xia, Wenhao Chen, Yongtao Wang, Ming-Hsuan Yang

发表机构 * Wangxuan Institute of Computer Technology, Peking University（北京大学王选计算机技术研究所）； University of California, Merced（加州大学默塞德分校）

AI总结提出DrivingAgent框架，通过自动化模块开发（设计阶段）和强化学习训练的轻量级LLM实时调度（调度阶段），解决自动驾驶系统集成新模型和满足实时约束的挑战，在nuScenes和Bench2Drive上取得更优速度-精度权衡。

详情

AI中文摘要

为什么商用WiFi传感器在多人体步态识别中失败：基于ESP32的系统分析

Oliver Custance, Saad Khan, Simon Parkinson

发表机构 * University of Cambridge（剑桥大学）

AI总结通过ESP32实验发现，多人体步态识别性能差主要源于商用WiFi的感知质量限制，而非算法选择。

详情

AI中文摘要

WiFi信道状态信息（CSI）在单人步态识别中展现出潜力，引发了对其在非接触式生物识别、持续认证和被动识别中应用的兴趣。然而，在低成本商用设备上进行多人识别的可行性仍不清楚。一个关键问题是，较差的多人性能主要是算法限制，还是反映了商用WiFi硬件更根本的感知上限。我们通过使用商用ESP32 WiFi传感器的系统实证研究来回答这个问题。我们评估了六种不同的信号分离方法——FastICA、SOBI、PCA-ICA、NMF、小波和张量分解——在七个场景中，覆盖1-10人，包括受控和现实室内环境。为了超越分类准确率进行研究，我们引入了三个诊断指标：受试者内变异性（ISV）、受试者间可区分性（ISD）和性能退化率（PDR）。所有方法的性能均中等（39%-56%准确率），几乎没有证据表明仅靠算法选择能解决问题。表现最佳的方法NMF达到56%准确率，而所有方法都表现出极高的特征空间重叠（97%-99%）、不稳定的受试者内表示以及显著的环境敏感性。这些发现表明，在商用ESP32 CSI约束下，密集多人步态识别更多受限于感知质量和空间多样性，而非所选分离算法。我们的结果对安全和隐私有直接影响：它们质疑了商用WiFi CSI作为稳健的多用户生物识别基元的实用性，同时也对低成本现成WiFi硬件可实现的被动识别能力施加了重要限制。

英文摘要

WiFi Channel State Information (CSI) has shown promise for single-person gait identification, raising interest in its use for contactless biometrics, continuous authentication, and passive identification. However, the feasibility of multi-person identification on low-cost commodity devices remains unclear. A critical question is whether weak multi-person performance is primarily an algorithmic limitation, or whether it reflects a more fundamental sensing ceiling on commodity WiFi hardware. We address this question through a systematic empirical study using commodity ESP32 WiFi sensors. We evaluated six different signal separation methods--FastICA, SOBI, PCA-ICA, NMF, Wavelet, and Tensor decomposition--across seven scenarios spanning 1-10 people in both controlled and realistic indoor environments. To investigate beyond classification accuracy, we introduce three diagnostic metrics: intra-subject variability (ISV), inter-subject distinguishability (ISD), and performance degradation rate (PDR). In all methods, performance remains moderate (39%-56% accuracy), with limited evidence that algorithmic choice alone solves the problem. The best-performing method, NMF, reaches 56% accuracy, while all methods exhibit extremely high feature-space overlap (97%-99%), unstable within-subject representations, and marked environmental sensitivity. These findings suggest that, under commodity ESP32 CSI constraints, dense multi-person gait identification is limited more by sensing quality and spatial diversity than by the chosen separation algorithm. Our results have direct implications for security and privacy: they call into question the practicality of commodity WiFi CSI as a robust multi-user biometric primitive for authentication, while also placing important bounds on the passive identification capabilities achievable with low-cost off-the-shelf WiFi hardware.

URL PDF HTML ☆

赞 0 踩 0

2601.06279 2026-06-12 cs.CV 版本更新

EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

EyeTheia：一个轻量级且易用的眼动追踪工具箱

Stevenson Pather, Niels Martignène, Arnaud Bugnet, Fouad Boutaleb, Fabien D'Hondt, Deise Santana Maia

发表机构 * Univ. Lille, Inserm, CHU Lille, U1172 - LilNCog - Lille Neuroscience & Cognition（里尔大学、法国国家医学研究院、里尔大学医院、U1172 - 里尔神经科学与认知中心）； Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL（里尔大学、法国国家科学研究中心、里尔中央理工大学、UMR 9189 CRIStAL）； Centre national de ressources et de résilience (CN2R)（资源与韧性国家研究中心）

AI总结提出基于网络摄像头的轻量级眼动追踪管道EyeTheia，结合MediaPipe特征提取和CNN模型，通过用户微调降低预测误差，在点探测任务中与商业方案表现一致。

Comments Code for the EyeTheia: https://github.com/patherstevenson/EyeTheia. Experimental platform for the cognitive neuroscience task (BAWEB IAPS): https://git.interactions-team.fr/INTERACTIONS/calypso/src/branch/main/src/medita/

详情

AI中文摘要

我们介绍了EyeTheia，一个用于基于网络摄像头的视线估计的轻量级开源深度学习管道，专为基于浏览器的实验平台和现实世界的认知与临床研究设计。EyeTheia仅使用标准笔记本电脑摄像头即可实现实时视线追踪，结合基于MediaPipe的 landmarks 提取和受iTracker启发的卷积神经网络，并支持可选的用户特定微调。我们研究了两种互补策略：在移动数据上预训练模型，以及在桌面数据集上从头训练相同架构。在MPIIFaceGaze上的验证结果显示，在标定前两种方法性能相当，而轻量级的用户特定微调持续降低了视线预测误差。我们还在一个真实的点探测任务中评估了EyeTheia，并与商业网络摄像头追踪器SeeSo SDK进行了比较。结果表明，在刺激呈现期间左右视线分配上具有高度一致性，尽管时间变异性更高。总体而言，EyeTheia为低成本视线追踪提供了一个透明且可扩展的解决方案，适用于可扩展和可重复的实验与临床研究。代码、训练模型和实验材料均已公开。

英文摘要

We introduce EyeTheia, a lightweight and open deep learning pipeline for webcam-based gaze estimation, designed for browser-based experimental platforms and real-world cognitive and clinical research. EyeTheia enables real-time gaze tracking using only a standard laptop webcam, combining MediaPipe-based landmark extraction with a convolutional neural network inspired by iTracker and optional user-specific fine-tuning. We investigate two complementary strategies: adapting a model pretrained on mobile data and training the same architecture from scratch on a desktop-oriented dataset. Validation results on MPIIFaceGaze show comparable performance between both approaches prior to calibration, while lightweight user-specific fine-tuning consistently reduces gaze prediction error. We further evaluate EyeTheia in a realistic Dot-Probe task and compare it to the commercial webcam-based tracker SeeSo SDK. Results indicate strong agreement in left-right gaze allocation during stimulus presentation, despite higher temporal variability. Overall, EyeTheia provides a transparent and extensible solution for low-cost gaze tracking, suitable for scalable and reproducible experimental and clinical studies. The code, trained models, and experimental materials are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.04364 2026-06-12 cs.CV cs.LG 版本更新

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

通过部分分解注意力的空间基础概念瓶颈模型

Dhanesh Ramachandram

发表机构 * Vector Institute（向量研究所）

AI总结提出一种部分分解的概念瓶颈模型，通过空间先验约束注意力，在细粒度识别中实现可解释性并提升定位精度。

Comments Updated results with GobalAttention Tokens

详情

AI中文摘要

概念瓶颈模型（CBM）在预测类别之前预测一层人类命名的属性，从而使其决策可审计。在细粒度识别任务中，概念头通常可以自由关注图像中的任何位置，因此以某个身体区域命名的头可能被其他区域的证据满足。本研究通过构造一个部分分解的CBM来消除这种自由度。该方法基于冻结的DINOv3视觉变换器，包含三个组件。一个学习到的前景门控，基于DINOv3块特征训练，抑制部分注意力内的背景块。一组部分查询交叉关注块特征，并且312个CUB属性中的每一个通过固定的概念到部分映射被路由，仅从其名称所暗示的部分令牌读取。一个可学习的二维高斯先验，以对数空间加性注入注意力logits，打破部分查询之间的排列对称性；其均值从每个部分的数据集平均关键点位置初始化，在训练或测试时不需要每张图像的关键点监督。在CUB-200-2011上，空间先验模型匹配完全监督基线（top-1准确率88.85%对88.95%），同时将指向精度提高16个百分点（52.6%对36.4%）。用PCA前景目标替换边界框监督，并与高斯先验结合，消除了所有每张图像监督，达到88.6%的top-1准确率和约70%的指向精度。关键点分数扫描显示，训练集的0.5%（约27张图像）足以初始化先验，且无显著损失。完全移除部分身份是更困难的情况：没有任何空间先验，指向精度降至2.9%。

英文摘要

Concept bottleneck models (CBMs) predict a layer of human-named attributes before predicting a class, which makes their decisions auditable. On fine-grained recognition tasks the concept heads are usually free to attend anywhere in the image, so a head named for one body region can be satisfied by evidence on another. This work studies a part-factorized CBM that removes that freedom by construction. The method has three components built on a frozen DINOv3 vision transformer. A learned foreground gate, trained on DINOv3 patch features, suppresses background patches inside the part attention. A set of part queries cross-attends to patch features and each of the 312 CUB attributes is routed, through a fixed concept-to-part map, to read only from the part token its name implies. A learnable two-dimensional Gaussian prior, injected additively in log space into the attention logits, breaks the permutation symmetry among part queries; its means are initialized from the dataset-average keypoint location of each part, which requires no per-image keypoint supervision at training or test time. On CUB-200-2011 the spatial-prior model matches a fully supervised baseline (88.85% versus 88.95% top-1) while raising pointing accuracy by 16 points (52.6% versus 36.4%). Replacing bounding-box supervision with a PCA foreground target and combining it with the Gaussian prior removes all per-image supervision and reaches 88.6% top-1 at about 70% pointing accuracy. A keypoint-fraction sweep shows that 0.5% of the training set (about 27 images) suffices to initialize the prior with no measurable loss. Removing part identity entirely is the harder case: without any spatial prior, pointing accuracy collapses to $2.9\%$.

URL PDF HTML ☆

赞 0 踩 0

2606.09855 2026-06-12 cs.MM cs.CV cs.LG 版本更新

MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

MinhwaNet: 韩国民俗画中忠实但不足的对象定位

Joonhyung Bae

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)（韩国科学技术院）

AI总结提出MinhwaNet，通过部分级检测器生成对象证据图，发现韩国民俗画中符号列表不足以预测画作类型，而符号布局更重要，揭示了忠实但不足的解离现象。

详情

AI中文摘要

韩国民俗画（minhwa）由少量吉祥符号构成——老虎代表保护、一对鸟代表婚姻和谐、牡丹代表财富——这些符号在其许多绘画类型中反复出现。这暗示了一种直观的计算方法：识别画作中出现的符号，并从符号清单中读取画作类型。我们使用一个公开语料库，包含整幅画作、八字段双语策展说明以及一组独立的专家对象裁剪图，发现这种方法并不奏效。仅给定画作包含的符号列表的模型，其预测画作类型的效果远不如将图像与策展文本融合的模型，而强制类型表示基于对象定位反而会损害准确性。然而，类型预测所依赖的视觉证据仍然是局部化的且可检查的。从部分级检测器投影出的无泄漏对象证据图，在空间上忠实于策展人隔离符号对象的位置以及基于补丁的替代模型的梯度显著性。我们将这种配置称为忠实但不足的解离。部分级解释诚实地反映了部分级模型所见，但类型目标取决于符号的排列方式而非出现的符号。相同的视角区分了内容标签（在转移到保留的源机构时仍然有效，即类型）和风格标签（无效，即时代），我们通过语料库中的另外两个标签验证了这一预测。我们发布了多模态系统、一幅画作的证据图与其目录的工作示例解读，以及在长尾遗产收藏中反复出现的一系列评估注意事项。

英文摘要

Korean folk painting (minhwa) is built from a small vocabulary of auspicious symbols, a tiger for protection, a pair of birds for marital harmony, a peony for wealth, that recur across many of its painted genres. This suggests an obvious computational approach, identify which symbols appear in a painting and read the genre from the inventory. Working with a public corpus that pairs whole paintings, eight-field bilingual curatorial captions, and a separate set of expert object crops, we find that this approach does not work. A model given only a list of which symbols a painting contains predicts the genre far worse than a model that fuses the image with the curatorial text, and forcing the genre representation to be object-grounded actively hurts accuracy. The visual evidence on which the genre prediction rests is nonetheless localized and inspectable. A leakage-safe object evidence map projected from a part-level detector is spatially faithful to where curators isolated symbolic objects and to a patch-based surrogate's own gradient saliency. We name this configuration a faithful-but-insufficient dissociation. The part-level explanation is honest about what the part-level model sees, yet the genre target turns on how symbols are arranged rather than on which ones appear. The same lens separates a content label that survives transfer to held-out source institutions, genre, from a style label that does not, era, a prediction we confirm on two further labels in the corpus. We release the multimodal system, a worked-example reading of one painting's evidence map against its catalogue, and a set of evaluation cautions that recur in long-tailed heritage collections.

URL PDF HTML ☆

赞 0 踩 0

2606.12628 2026-06-12 cs.CV 新提交

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

面向自动驾驶中共现对象检测的上下文感知特征融合

Binay Kumar Singh, Niels Da Vitoria Lobo

发表机构 * Department of Computer Science, University of Central Florida（中佛罗里达大学计算机科学系）

AI总结提出上下文中心特征融合框架CCFF，通过局部上下文融合模块和全局上下文注意力模块分别处理小/遮挡对象与共现先验，提升共现对象检测性能，在Cityscapes和BDD100K上实现类别一致性策略0.973和0.969，小目标检测AP_S提升14.1%。

Comments 8 pages, 3 figures, CVPR 2026 Precognition Workshop

详情

AI中文摘要

自动驾驶中的目标检测需要精确定位以及对共现对象之间关系上下文的固有理解。在极其复杂的异构环境中，稀有类别、小尺度对象和频繁出现的对象对于标准目标检测框架来说难以处理。在本文中，我们提出了一种新颖的框架，称为上下文中心特征融合（CCFF），它利用两个基于注意力的模块：局部上下文融合模块（LCFM）使用RoI到RoI的自注意力机制来解决空间交互，主要考虑小且部分遮挡的对象；而全局上下文注意力模块（GCAM）通过将top-K RoI特征池化为全局上下文注意力标记来转换对象的共现先验，避免了像素级全局池化的计算开销。这种局部和以对象为中心的全局特征的融合产生了上下文化的嵌入，增强了分类结果和共现对象检测。我们的方法在两个数据集Cityscapes和BDD100K上进行了评估，在关系一致性上显示出显著改进，分别达到了0.973和0.969的类别级一致性策略（CCS）。此外，我们的方法在小目标检测（AP_S: 14.1%）上取得了实质性提升，并成功恢复了通常在大分布中丢失的稀有类别，如“火车”。我们的效率报告显示，该框架以0.2 FPS的开销实时处理图像。代码可在此https URL获取。

英文摘要

Object detection in autonomous driving requires precise localization and an inherent understanding of the relational context between co-occurring objects. In extremely complex heterogeneous environments rare classes, small-scale objects, and frequently appearing objects are difficult for standard object detection frameworks to handle. In this paper, we propose a novel framework called Context-Centric Feature Fusion (CCFF), which utilizes two attention-based modules, Local Context Fusion Module (LCFM) uses the RoI-to-RoI self-attention mechanism to resolve spatial interactions, mainly considering small and partially obscured objects, while Global Context Attention Module (GCAM) converts the co-occurrence of objects priors by pooling top-K RoI features into a global context attention token, avoiding the computational overhead of pixel-level global pooling. This fusion of local and object-centric global features yields contextualized embeddings that enhance classification results and co-occurring objects detection. Our method is evaluated on two datasets, Cityscapes and BDD100K which demonstrate significant improvement on relational consistency, achieving a Category-level Consistency Strategy (CCS) of 0.973 and 0.969, respectively. Furthermore, our approach produces substantial gains in small object detection (AP_S: 14.1%) and successfully recovers rare classes such as "Train" that are typically lost in large distributions. Our efficiency report shows that the framework processes images in real time with a 0.2 FPS overhead. The code is available at https://github.com/BinayKSingh/CCFF.

URL PDF HTML ☆

赞 0 踩 0

2606.12826 2026-06-12 cs.CV cs.AI 新提交

DIMOS: Disentangling Instance-level Moving Object Segmentation

DIMOS: 解耦实例级运动目标分割

Hongxiang Huang, Hongwei Ren, Xiaopeng Lin, Yulong Huang, Zeke Xie, Bojun Cheng

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出双解耦特征提取框架分离图像与事件模态的外观和运动信息，并通过多粒度跨模态对齐实现有效融合，在运动实例分割任务中尤其对快速运动和低光下的小目标取得最优性能。

详情

AI中文摘要

运动实例分割（MIS）因其在交通监控、自动驾驶和动物追踪等领域的广泛应用而日益受到关注。事件相机记录异步亮度变化，提供高时间分辨率和动态范围，使其对运动信息高度敏感。通过融合事件和图像特征，事件中的运动线索可以补充图像中的空间细节，从而提升MIS的性能。然而，当前的多模态MIS方法仍然难以分割小的运动实例，因为事件相机在有限分辨率下往往产生稀疏特征。此外，事件特征将外观属性与运动线索纠缠在一起，进一步限制了有效的跨模态融合。为解决这些挑战，我们首先提出一个双解耦特征提取框架，在图像和事件模态内分离并提取外观和运动信息，从而改善特征密度。随后，引入多粒度跨模态对齐，以对齐跨模态分布和语义一致的特征，实现具有丰富空间和时间细节的更有效融合。实验结果表明，我们的方法在多模态MIS中达到了最先进的性能，特别是在快速运动和低光等挑战性条件下的小实例分割方面。

英文摘要

Moving instance segmentation (MIS) attracts increasing attention due to its broad applications in traffic surveillance, autonomous driving, and animal tracking. Event cameras record asynchronous brightness changes, providing high temporal resolution and dynamic range, which makes them highly sensitive to motion information. By fusing event and image features, motion cues from events can complement spatial details from images, enhancing the performance of MIS. However, current multimodal MIS methods still struggle to segment small moving instances, as event cameras often yield sparse features under limited resolution. Moreover, event features entangle appearance attributes with motion cues, which further restricts effective cross-modal fusion. To address these challenges, we first propose a dual-disentangling feature extraction framework that separates and extracts appearance and motion information within both image and event modalities, thereby improving feature density. Subsequently, a multi-granularity cross-modal alignment is introduced to align distributionally and semantically consistent features across modalities, enabling more effective fusion with rich spatial and temporal details. The experiment results demonstrate that our method achieves state-of-the-art performance in multimodal MIS, especially for small instances under challenging conditions such as fast motion and low-light settings.

URL PDF HTML ☆

赞 0 踩 0

2606.12958 2026-06-12 cs.CV 新提交

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

YOLO-AMC：一种改进的带有注意力机制的YOLO架构用于建筑裂缝检测

Ching-Yu Tsai, Chia-Min Lin, Chih-Hsiang Yang, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University（淡江大学电机与计算机工程系）

AI总结提出YOLO-AMC，在YOLOv11中移除C2PSA并引入GAM、Res-CBAM、SA等注意力机制，增强裂缝检测性能，在测试集上mAP@0.5达0.9917，速度110.95 FPS，兼顾精度与部署效率。

Comments 14 pages, 8 tables, 6 figures. Expanded version of IET ICETA 2025 conference paper

详情

AI中文摘要

裂缝检测在基础设施检查和结构健康监测（SHM）中起着重要作用。然而，裂缝通常表现为薄、低对比度的结构，且容易受到背景噪声的影响，给现有目标检测模型带来了挑战。本研究提出了一种改进的基于YOLO的架构，集成了注意力机制，称为YOLO-AMC（用于裂缝检测的YOLO注意力机制），以增强自动裂缝检测性能。基于YOLOv11，移除了原始的C2PSA模块，并在Neck的多尺度特征融合层中引入了多种注意力机制，包括全局注意力机制（GAM）、残差卷积块注意力模块（Res-CBAM）和Shuffle Attention（SA），以加强跨尺度特征整合。实验结果表明，YOLO-AMC在多个评估指标上始终优于基线模型YOLOv11n和YOLOv8n。在评估的注意力模块中，GAM取得了最佳检测性能，在测试数据集上获得了mAP@0.5 = 0.9917和mAP@0.5:0.95 = 0.9506，高于YOLOv11（0.9833 / 0.9112）和YOLOv8（0.9707 / 0.8921）。此外，在保持7.6 GFLOPs计算复杂度的同时，所提出的模型在NVIDIA RTX 4090平台上达到了110.95 FPS，在Raspberry Pi 5边缘设备上约为5 FPS，展示了准确性与部署效率之间的良好权衡。本研究的实现代码可在GitHub上获取，网址为：https://this https URL。

英文摘要

Crack detection plays an important role in infrastructure inspection and Structural Health Monitoring (SHM). However, cracks typically appear as thin, low-contrast structures and are easily affected by background noise, posing challenges for existing object detection models. This study proposes an improved YOLO-based architecture with integrated attention mechanisms, termed YOLO-AMC (YOLO with Attention Mechanisms for Crack Detection), to enhance automated crack detection performance. Based on YOLOv11, the original C2PSA module is removed, and multiple attention mechanisms, including Global Attention Mechanism (GAM), Residual Convolutional Block Attention Module (Res-CBAM), and Shuffle Attention (SA), are introduced into the multi-scale feature fusion layers of the Neck to strengthen cross-scale feature integration. Experimental results demonstrate that YOLO-AMC consistently outperforms baseline models YOLOv11n and YOLOv8n across multiple evaluation metrics. Among the evaluated attention modules, GAM achieves the best detection performance, obtaining mAP@0.5 = 0.9917 and mAP@0.5:0.95 = 0.9506 on the test dataset, which are higher than those of YOLOv11 (0.9833 / 0.9112) and YOLOv8 (0.9707 / 0.8921). Furthermore, while maintaining a computational complexity of 7.6 GFLOPs, the proposed model achieves 110.95 FPS on an NVIDIA RTX 4090 platform and approximately 5 FPS on a Raspberry Pi 5 edge device, demonstrating a favorable trade-off between accuracy and deployment efficiency. The implementation code for this study is available on GitHub at https://github.com/CY-Tsai24/YOLO-AMC.

URL PDF HTML ☆

赞 0 踩 0

2606.13033 2026-06-12 cs.CV 新提交

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

SAM-Deep-EIoU：面向多目标跟踪的选择性掩码传播

Alexander Holmberg

发表机构 * KTH Royal Institute of Technology（瑞典皇家理工学院）

AI总结提出选择性掩码传播算法，仅在不确定性高的帧调用视频目标分割模型，以轻量级基跟踪器为主，在DanceTrack和SportsMOT上提升性能，SportsMOT达86.8 HOTA。

详情

AI中文摘要

多目标跟踪的难度分布呈重尾特性：大多数帧对于轻量级基跟踪器是容易的，而一小部分帧本质上是困难的。视频目标分割（VOS）模型通常能在基跟踪器失败的困难帧中保持身份，但其计算和内存成本高得多。我们提出选择性掩码传播，一种跟踪算法，仅在分配不确定性信号触发的窗口上从基跟踪器调度到VOS模型。仅当VOS模型做出与基跟踪器身份分配相矛盾的置信预测时，才修改基跟踪器的输出；弱或不确定的预测保留基输出。该方法无需训练，将基跟踪器和VOS模型均视为黑盒，并且可以通过用更强大的模型替换VOS组件而受益。在DanceTrack上，选择性掩码传播改进了三种不同的基跟踪器。在SportsMOT上，身份保持是体育分析的核心，使用全局轨迹关联的SAM3-Deep-EIoU以86.8 HOTA达到基准上的最先进性能。

英文摘要

Multi-object tracking has a heavy-tailed difficulty distribution: most frames are easy for a lightweight base tracker, while a small fraction are intrinsically hard. Video object segmentation (VOS) models can often preserve identity through the hard frames where the base tracker fails, but they are much more expensive in compute and memory. We propose selective mask propagation, a tracking algorithm that dispatches from a base tracker to a VOS model only on windows where an assignment-uncertainty signal fires. The base tracker's output is modified only when the VOS model makes a confident prediction that contradicts the base tracker's identity assignment; weak or inconclusive predictions preserve the base output. The method is training-free, treats both the base tracker and the VOS model as black boxes, and can benefit from replacing the VOS component with a more capable model. On DanceTrack, selective mask propagation improves three different base trackers. On SportsMOT, where identity preservation is central to sports analytics, SAM3-Deep-EIoU with global track association achieves state-of-the-art performance on the benchmark with 86.8 HOTA.

URL PDF HTML ☆

赞 0 踩 0

2606.13587 2026-06-12 cs.CV 新提交

Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background

面向杂乱背景下的自动废物回收的有效废物分割

Mamoona Javaid, Mubashir Noman, Abdul Hannan, Shah Nawaz, Mustansar Fiaz, Sajid Ghuffar

发表机构 * University of Science and Technology Beijing（北京科技大学）

AI总结提出一种结合空间域和谱域的级联分割网络，并引入辅助特征增强模块，在杂乱场景下实现高效废物分割，在三个数据集上验证了有效性。

Comments accepted at ICML 2026

详情

AI中文摘要

城市区域的快速扩张和人口增长导致废物产量急剧增加，这需要高效自动化的废物管理。在此背景下，使用深度学习的自动废物回收（AWR）可以帮助人类实现最优废物管理。最近的AWR深度学习方法提供了有前景的废物分割性能，但这些方法依赖大型骨干网络，对AWR系统效率低下，且在杂乱场景中性能下降。为此，本文引入了一种最优废物分割网络，该网络有效利用空间域捕获局部结构依赖性和谱域高效提取全局上下文关系。这种级联设计使网络能够逐步利用互补域中的局部和全局表示，突出有效分割各种废物对象所需的语义信息。此外，引入了辅助特征增强模块（AFEM），以增强目标对象的边界和斑点放大，从而在杂乱场景中实现更好的分割。在ZeroWaste-aug、ZeroWaste-f和SpectralWaste数据集上的大量实验揭示了所提出方法的优势。

英文摘要

Rapid expansion of urban areas and population growth is causing an immense increase in waste production, which demands the need for efficient and automated waste management. In this scenario, automated waste recycling (AWR) using deep learning methods can assist humans in optimal waste management. Recent deep learning approaches for AWR provide promising waste segmentation performance, however, these methods rely on large backbone networks that are inefficient for AWR systems and suffer from performance deterioration in cluttered scenes. To this end, an optimal waste segmentation network is introduced which effectively utilizes the spatial domain to capture localized structural dependencies and the spectral domain to efficiently extract global contextual relationships. This cascaded design allows the network to progressively leverage both local and global representations across complementary domains to highlight the semantic information necessary for effective segmentation of various waste objects. Furthermore, auxiliary feature enhancement module (AFEM) is introduced to enhance the target objects' boundaries and blob amplification for better segmentation in cluttered scenarios. Extensive experimentation on ZeroWaste-aug, ZeroWaste-f and SpectralWaste datasets reveals the merits of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2606.13042 2026-06-12 cs.AI cs.CV 交叉投稿

Augmentation techniques for video surveillance in the visible and thermal spectral range

可见光和热红外光谱范围内视频监控的增强技术

Vanessa Buhrmester, Ann-Kristin Grosselfinger, David Munch, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation IOSB（弗劳恩霍夫光学、系统技术与图像处理研究所）

AI总结针对多光谱CNN目标检测，研究可见光与热红外图像差异，探索数据增强技术对分类精度的影响，以提升监控性能。

Comments 8 pages

详情

Journal ref: SPIE Security + Defence, Strasbourg, 10th September 2019

AI中文摘要

在智能视频监控中，摄像机在白天和夜晚记录图像序列。通常，这需要不同的传感器。为了获得更好的性能，将它们结合起来并不罕见。我们关注的情况是，长波红外摄像机连续记录，此外，另一台摄像机在白天记录可见光谱范围内的图像，并且智能算法监控采集的图像。更准确地说，我们的任务是基于多光谱CNN的目标检测。乍一看，可见光谱范围内的图像与热红外图像的区别在于，前者具有颜色和清晰的纹理信息，而后者不包含物体发出的热辐射信息。尽管颜色可以为分类任务提供有价值的信息，但诸如光照变化和不同传感器的特性等因素仍然构成重大问题。无论如何，获取足够且实用的热红外数据集来训练深度神经网络仍然是一个挑战。这就是为什么借助可见光谱范围内的数据进行训练可能是有利的，特别是当待评估的数据同时包含可见光和红外数据时。然而，目前尚不清楚热辐射、形状或颜色信息的强烈变化如何影响分类精度。为了更深入地了解卷积神经网络如何做出决策以及它们从不同传感器输入数据中学到什么，我们研究了不同增强技术的适用性和鲁棒性。

英文摘要

In intelligent video surveillance, cameras record image sequences during day and night. Commonly, this demands different sensors. To achieve a better performance it is not unusual to combine them. We focus on the case that a long-wave infrared camera records continuously and in addition to this, another camera records in the visible spectral range during daytime and an intelligent algorithm supervises the picked up imagery. More accurate, our task is multispectral CNN-based object detection. At first glance, images originating from the visible spectral range differ between thermal infrared ones in the presence of color and distinct texture information on the one hand and in not containing information about thermal radiation that emits from objects on the other hand. Although color can provide valuable information for classification tasks, effects such as varying illumination and specialties of different sensors still represent significant problems. Anyway, obtaining sufficient and practical thermal infrared datasets for training a deep neural network poses still a challenge. That is the reason why training with the help of data from the visible spectral range could be advantageous, particularly if the data, which has to be evaluated contains both visible and infrared data. However, there is no clear evidence of how strongly variations in thermal radiation, shape, or color information influence classification accuracy. To gain deeper insight into how Convolutional Neural Networks make decisions and what they learn from different sensor input data, we investigate the suitability and robustness of different augmentation techniques...

URL PDF HTML ☆

赞 0 踩 0

2606.12601 2026-06-12 cs.CV 新提交

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

双状态槽注意力：解耦外观与身份用于视频目标中心学习

Sieu Tran, Duc Nguyen, Hao Vo, Khoa Vo, Ngan Le

发表机构 * University of Arkansas（阿肯色大学）

AI总结提出双状态槽注意力（DSSA），通过分离每个槽为局部状态（外观）和身份状态（稳定身份），并采用竞争调制聚合减少弱匹配槽的干扰，提升视频目标分割质量与时间一致性。

详情

AI中文摘要

无监督视频目标中心学习旨在无需监督地将动态场景分解为持久的目标级表示。然而，现有的基于槽的方法在快速运动和部分遮挡等挑战性场景中难以维持稳定的目标身份。首先，它们通常将目标的每帧外观和跨帧身份编码在单个槽向量中，造成目标冲突导致槽交换：重建需要对瞬态视觉变化敏感，而时间一致性需要对它们不变。其次，槽注意力中使用的令牌重归一化可能放大弱注意力槽，使其吸收其他目标的令牌，破坏槽与目标的对应关系。我们提出双状态槽注意力（DSSA），一种完全自监督框架，通过分离外观与身份并减少弱匹配槽的虚假更新来解决这些限制。DSSA将每个槽分解为用于每帧外观的局部状态和用于时间稳定目标信息的身份状态，从而用分离的表示对齐重建和时间一致性。身份状态通过学习的循环转换更新，该转换作为局部状态的时间滤波器，而竞争调制聚合（CMA）降低弱匹配槽的更新权重，防止它们吸收其他目标的令牌。在MOVi-C、MOVi-D和YouTube-VIS上的实验表明，DSSA在分割质量和时间一致性上持续优于先前方法，同时在下游目标识别和视频动态预测中表现更强。代码和模型将在接收后公开。

英文摘要

Unsupervised video object-centric learning aims to decompose dynamic scenes into persistent, object-level representations without supervision. However, existing slot-based methods struggle to maintain stable object identity in challenging settings such as rapid motion and partial occlusion. First, they typically encode both the per-frame appearance of an object and its identity across frames in a single slot vector, creating an objective conflict that leads to slot swapping: reconstruction requires sensitivity to transient visual changes, whereas temporal consistency requires invariance to them. Second, the token renormalization used in Slot Attention can amplify weakly attending slots, allowing them to absorb tokens from other objects and destabilize slot-to-object correspondence. We propose Dual-State Slot Attention (DSSA), a fully self-supervised framework that addresses these limitations by separating appearance from identity and by reducing spurious updates from weakly matching slots. DSSA decomposes each slot into a local state for per-frame appearance and an identity state for temporally stable object information, thereby aligning reconstruction and temporal consistency with separate representations. The identity state is updated through a learned recurrent transition that acts as a temporal filter on the local state, while competition-modulated aggregation (CMA) down-weights updates from weakly matching slots and prevents them from absorbing tokens from other objects. Experiments on MOVi-C, MOVi-D, and YouTube-VIS demonstrate that DSSA consistently improves segmentation quality and temporal consistency over prior methods, while also yielding stronger downstream object recognition and video dynamics prediction. Code and models will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.13030 2026-06-12 cs.CV 新提交

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

一种结合跨主体伪标签与语义对齐的多模态微手势识别框架

Haoran Zhang, Haokun Zhang, Pengyu Liu, Yujia Zhang, Weibao Xue, Yanbin Hao

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology (HFUT)（合肥工业大学计算机科学与信息工程学院）； School of Computer Science, University of Auckland (UOA)（奥克兰大学计算机科学学院）

AI总结针对微手势识别中低信噪比、长尾分布和跨主体域偏移问题，提出多模态框架，通过显著性引导提取、平方根平滑加权、正交语义嵌入损失和跨模态伪标签策略，实现有效识别，F1分数达68.13%。

Comments 14 pages, 2 figures

详情

AI中文摘要

微手势（MGs）是自发的、细微的身体动作，经常传达隐藏的人类情感。在未修剪视频中识别MGs仍然极具挑战性，因为其极低的信噪比、严重的长尾类分布以及跨主体评估场景中固有的域偏移。在本文中，我们为第四届MiGA-IJCAI挑战赛的Track 1提出了一个全面的多模态框架。为了捕捉细粒度表示，我们设计了一个显著性引导的多模态提取流程，整合了68关键点骨架关节坐标、3D热图体积和高分辨率RGB视觉特征。我们引入了一种温和的平方根平滑加权机制，配合正交语义嵌入损失，以保护尾部类别而不损害整体识别能力。更重要的是，为了弥合跨主体泛化差距，我们提出了一种跨模态伪标签（CMPL）策略用于无监督域适应，显著提升了单模态鲁棒性。最后，采用温度缩放软投票机制以减轻后期融合中的过度自信。大量实验表明，我们的框架达到了具有竞争力的68.13%的F1分数，获得第四名。

英文摘要

Micro-gestures (MGs) are spontaneous and subtle body movements that frequently convey hidden human emotions. Recognizing MGs in untrimmed videos remains highly challenging due to their extremely low signal-to-noise ratio, severe long-tailed class distribution, and the inherent domain shift encountered in cross-subject evaluation scenarios. In this paper, we propose a comprehensive multi-modal framework for Track 1 of the 4th MiGA-IJCAI Challenge. To capture fine-grained representations, we design a saliency-guided multi-modal extraction pipeline integrating 68-keypoint skeleton joint coordinates, 3D heatmap volumes, and high-resolution RGB visual features. We introduce a gentle square-root smoothed weighting mechanism paired with an Orthogonal Semantic Embedding Loss to protect tail classes without compromising overall recognition capabilities. More importantly, to bridge the cross-subject generalization gap, we propose a Cross-Modal Pseudo-Labeling (CMPL) strategy for unsupervised domain adaptation, which significantly boosts single-modal robustness. A temperature-scaled soft-voting mechanism is finally utilized to alleviate overconfidence during late fusion. Extensive experiments demonstrate that our framework achieves a competitive F1-score of 68.13\%, securing the 4th place.

URL PDF HTML ☆

赞 0 踩 0

2606.13332 2026-06-12 cs.CV 新提交

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

OR-Action: 细粒度动作的多角色视频理解

Felix Tristram, Ege Özsoy, Christian Benz, Marcel Walch, Ghazal Ghazaei, Nassir Navab

发表机构 * Technical University of Munich（慕尼黑工业大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）； Carl Zeiss AG（卡尔蔡司股份公司）

AI总结针对手术室活动理解中场景图方法缺乏时间建模的问题，提出基于公开数据集的细粒度多角色动作基准，并引入纯视觉时序模型，显著优于图方法，同时提出多视角到单视角特征对齐策略提升单视角性能。

详情

AI中文摘要

对手术室活动的细粒度理解能够实现工作流感知的辅助，但由于杂乱、遮挡和有限的感知，仍然困难。建模该环境的主流方法是使用场景图作为OR交互的可解释表示。然而，在没有显式时间建模的情况下，将它们的逐帧关系预测转换为时间上延伸的细粒度动作是具有挑战性的。为了对当前OR理解方法进行原则性的时间评估，我们引入了第一个以动作为中心的基准，该基准基于公开可用的自我中心-外部中心OR数据集，通过定义细粒度的多角色动作分类法，并通过从地面真实场景图状态变化中蒸馏生成密集动作片段。在该基准上的实验表明，当前的场景图预测方法难以建模时间结构，即使通过图神经网络添加显式建模也是如此。因此，我们引入了一种纯视觉时间模型，当使用所有可用的自我中心视频作为输入时，该模型显著优于基于图的方法。在此模型基础上，我们还引入了一种新颖的多视角到单视角特征对齐策略，提高了多角色动作识别的单视角性能，减少了对大量自我中心视频采集的需求。基准和代码将在接收后发布。

英文摘要

Fine-grained understanding of operating room (OR) activity could enable workflow-aware assistance, yet remains difficult due to clutter, occlusions, and limited sensing. The prevailing approach to model this environment is scene graphs as an interpretable representation of OR interactions. Converting their frame-wise relational predictions into temporally extended, fine-grained actions however, is challenging without explicit temporal modeling. To enable a principled temporal evaluation of current OR understanding methods, we introduce the first action-centric benchmark built on a publicly available ego-exocentric OR dataset by defining a fine-grained, multi-role action taxonomy and generating dense action segments via distillation from ground-truth scene graph state changes. Experiments on this benchmark show that current scene graph prediction methods struggle to model temporal structure, even when adding explicit modeling through Graph Neural Networks. We therefore introduce a vision-only temporal model that outperforms graph-based methods significantly when using all available egocentric video as input. Building on this model we also introduce a novel multi- to single-view feature alignment strategy that improves single-view performance on multi-role action recognition, mitigating the need for extensive egocentric video capture. Benchmark and code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.13410 2026-06-12 cs.CV cs.HC 新提交

Person Identification from Contextual Motion

基于情境运动的人物识别

Igor Kviatkovsky, Ehud Rivlin, Ilan Shimshoni

发表机构 * Technion – Israel Institute of Technology（以色列理工学院）； University of Haifa（海法大学）

AI总结提出一种生成模型描述动作实例创建过程，并针对监控和认证应用推导概率身份推断方案；引入交互式人物识别场景，通过序列化消息交换最大化互信息，实现高识别率。

详情

AI中文摘要

我们考虑基于运动风格识别人的问题。我们提出了一个描述动作实例创建过程的生成模型，并针对监控和认证应用所驱动的两种常见人物识别场景推导了概率身份推断方案。我们引入了一种新颖的、交互式的人物运动模式识别场景。为此，我们将识别过程形式化为受试者与系统之间的顺序消息交换会话。受试者的行为使用受人类信息处理（HIP）范式启发的概率生成模型建模。在每个阶段，系统向受试者呈现视觉刺激（线索）并记录其运动响应。线索的选择旨在最大化预期响应与受试者身份的互信息。一旦记录，响应用于更新可能受试者身份的后验概率。一旦达到足够的分类置信水平，该过程终止。据我们所知，这是首次在这种交互式设置中解决人物识别问题。我们在五个公开数据集和我们自己的新数据集（包含22名受试者对15个线索的4,476条记录）上报告了高识别率。

英文摘要

We consider the problem of identifying people based on their motion styles. We present a generative model describing the action instance creation process and derive a probabilistic identity inference scheme for two common person identification scenarios motivated by the surveillance and authentication applications. We introduce a novel, \emph{interactive}, scenario for person identification from motion patterns. To this end, we formalize the identification process in the context of a sequential message exchange session between the subject and the system. The subject's behavior is modeled using a probabilistic generative model inspired by the Human Information Processing (HIP) paradigm. At each stage, the system presents a visual stimulus (a cue) to the subject and records their motion response. The cue is selected so as to maximize the mutual information of the expected response and the subject's identity. Once recorded, the response is used to update the a posteriori probability over possible subjects' identities. The process terminates once a sufficient classification confidence level is reached. To the best of our knowledge, this is the first time person identification is addressed in such interactive setting. We report high recognition rates on five publicly available datasets and our own novel dataset consisting of 4,476 recordings of 22 test subjects responding to 15 cues.

URL PDF HTML ☆

赞 0 踩 0

2506.01274 2026-06-12 cs.CV cs.AI 版本更新

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

ReFoCUS: 用于上下文理解的强化引导帧优化

Hosu Lee, Junho Kim, Hyunjun Kim, Yong Man Ro

发表机构 * Korea Advanced Institute of Science & Technology（韩国科学技术院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ReFoCUS框架，首次将在线策略梯度强化学习集成到视频大语言模型的帧级优化中，通过自回归和查询条件选择架构学习帧选择策略，无需显式帧级监督，提升视频问答推理准确性。

Comments Project page: https://interlive-team.github.io/ReFoCUS/

详情

AI中文摘要

近期大型多模态模型（LMMs）的进展实现了有效的视觉-语言推理，然而视频理解能力仍受限于次优的帧选择策略，尽管视频专用LMMs发展迅速。先前的工作尝试通过静态启发式或外部检索模块来提供帧级信息，但这些方法往往无法捕捉与给定用户查询相关的视觉线索，混淆了原始视觉动态与真正的语义相关性。在本文中，我们介绍了ReFoCUS（用于上下文理解的强化引导帧优化），这是首个将在线策略梯度强化学习集成到视频-LLMs帧级优化的框架。ReFoCUS旨在学习帧选择策略，利用来自参考模型的奖励信号来捕捉其对最佳支持时间接地响应的帧组合的潜在评分行为。为了高效探索巨大的组合帧空间，我们采用了一种自回归且查询条件的选择架构，确保上下文一致性的同时降低复杂度。我们的策略学习无需显式帧级监督，因为它隐式地发现了最优且语义一致的帧组合。ReFoCUS在多个视频问答基准测试中持续提高了推理准确性，证明了将帧选择与模型内部效用对齐的优势。

英文摘要

Recent progress in Large Multi-modal Models (LMMs) has enabled effective vision-language reasoning, yet the ability to video understanding remains constrained by suboptimal frame selection strategies, albeit with the rapid development of video-specialized LMMs. Prior works attempted to solve this with static heuristics or external retrieval modules to feed frame-level information, but these approaches often fail to capture visual cues grounded to the given user queries conflating raw visual dynamics with true semantic relevance. In this paper, we introduce ReFoCUS (Reinforcement-guided Frame Optimization for Contextual UnderStanding), the first framework to integrate online policy-gradient reinforcement learning into frame-level optimization for video-LLMs. ReFoCUS aims to learn a frame selection policy, leveraging reward signals derived from reference models to capture their underlying scoring behavior over frame combinations that best support temporally grounded responses. To efficiently explore the large combinatorial frame space, we employ an autoregressive and query-conditional selection architecture that ensures contextual consistency while reducing complexity. Our policy learning removes the need for explicit frame-level supervision, as it implicitly discovers optimal and semantically consistent frame compositions. ReFoCUS consistently improves reasoning accuracy across multiple video QA benchmarks, demonstrating the advantage of aligning frame selection with model-internal utility.

URL PDF HTML ☆

赞 0 踩 0

2506.21855 2026-06-12 cs.CV 版本更新

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Periodic-MAE：用于rPPG估计的周期性视频掩码自编码器

Jiho Choi, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University, Republic of Korea（电子与信息工程系，全州国立大学）

AI总结提出Periodic-MAE，一种自监督框架，通过周期性感知掩码和生理频带约束，从无标签面部视频学习可泛化的时空表示，提升远程光电容积描记法（rPPG）估计性能。

详情

AI中文摘要

在本文中，我们提出Periodic-MAE，一种自监督框架，用于从无标签面部视频中学习周期性生理信号的通用时空表示。该方法利用掩码自编码器（MAE），通过重建掩码视频令牌学习高维面部表示，而不依赖远程光电容积描记法（rPPG）特定监督。为了明确地将表示学习与rPPG特征对齐，我们引入了一种基于视频重采样的周期性感知帧掩码策略，使编码器能够学习捕获与脉搏信号估计相关的准周期性时间模式的表示。此外，生理频带约束被集成到MAE预训练框架中，利用脉搏信号在频域的稀疏性，引导学习到的表示朝向生理上有意义的模式。预训练后，学习到的表示被迁移到下游rPPG估计任务，其中编码器作为通用特征提取器，从面部视频中恢复脉搏相关信号。我们在四个基准数据集（包括PURE、UBFC-rPPG、MMPD和V4V）上进行了广泛实验。此外，我们在无约束光照条件和受试者运动下收集的真实世界rPPG数据集上评估了所提方法。实验结果表明，Periodic-MAE持续改善了rPPG估计性能，特别是在具有挑战性的跨数据集和真实世界评估场景中。我们的代码可在以下网址获取：此 https URL。

英文摘要

In this paper, we propose Periodic-MAE, a self-supervised framework for learning generalizable spatio-temporal representations of periodic physiological signals from unlabeled facial videos. The proposed method leverages a masked autoencoder (MAE), which learns high-dimensional facial representations by reconstructing masked video tokens without relying on remote photoplethysmography (rPPG) specific supervision. To explicitly align representation learning with the characteristics of rPPG, we introduce a periodicity-aware frame masking strategy based on video resampling, enabling the encoder to learn representations that capture quasi-periodic temporal patterns relevant to pulse signal estimation. In addition, physiological bandlimit constraints are integrated into the MAE pre-training framework, exploiting the sparsity of pulse signals in the frequency domain to guide the learned representations toward physiologically meaningful patterns. After pre-training, the learned representations are transferred to downstream rPPG estimation, where the encoder serves as a generic feature extractor for recovering pulse-related signals from facial videos. We conduct extensive experiments on four benchmark datasets, including PURE, UBFC-rPPG, MMPD, and V4V. Moreover, we evaluate the proposed approach on a real-world rPPG dataset collected under unconstrained lighting conditions and subject motion. Experimental results demonstrate that Periodic-MAE consistently improves rPPG estimation performance, particularly in challenging cross-dataset and real-world evaluation settings. Our code is available at https://github.com/ziiho08/Periodic-MAE.

URL PDF HTML ☆

赞 0 踩 0

2605.24488 2026-06-12 cs.CV cs.GR 版本更新

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

基于SMPL骨架的拉班运动描述子的暗示性运动外观不变检测

Jaehoon Ahn, Jeonghan Kong, Moon-Ryul Jung

发表机构 * Sogang University（ソガン大学）

AI总结提出一种仅基于SMPL骨架轨迹和拉班运动分析描述子的运动分类流程，用于检测暗示性和露骨动作，在四个层级上实现57.3%的四分类准确率。

Comments 5 pages, 2 figures, 3 tables. Extended version of a poster accepted to SIGGRAPH 2026

详情

DOI: 10.1145/3799825.3818709

AI中文摘要

在线多人3D虚拟环境中的内容审核最近已交由自动化、基于AI的流程处理。然而，该领域主要涉及图像、视频和音频中非法内容的检测，在暗示性运动的检测技术上存在盲点。我们提出一种仅基于运动的分类流程，使用拉班运动分析（LMA）描述子从SMPL骨架轨迹中检测暗示性和露骨动作。在涵盖四个有序层级（日常、艺术、暗示、露骨）的20,514个运动片段（17小时以上）上，基于110个LMA特征的逻辑回归实现了57.3%的四分类准确率（随机概率的2.3倍）、72.1%的三分类准确率和78.7%的二元SFW/NSFW准确率。混淆主要集中在相邻层级，证实分类错误集中在相邻层级而非非相邻层级。此外，不同运动质量在分类体系的每个层级占主导地位——没有单一特征驱动分类，表明四层级结构反映了真正不同的运动模式。

英文摘要

Content moderation in online multiplayer 3D virtual environments is increasingly automated, yet detection has focused on images, video, and audio, leaving suggestive motion a blind spot. We present a motion-only classification pipeline that detects suggestive and explicit movement from SMPL skeleton trajectories using Laban Movement Analysis (LMA) descriptors. On a dataset spanning everyday, artistic, suggestive, and explicit movement (17+ hours of video), a logistic regression trained on 61-feature LMA descriptors reaches 68% binary SFW/NSFW accuracy (70% random forest) under a leak-free evaluation protocol. At this level, our descriptor performs comparably to a learned video model trained on the same motion re-rendered as appearance-free video, a gray figure with no clothing, skin, or scene. The indirectness (tortuosity) of each joint's trajectory, measured as the ratio of the joint's path length to its net displacement, peaks at the suggestive tier, showing that the Direct-to-Indirect polarity of Laban's Space factor provides an interpretable marker of the shift from functional to suggestive motion. Ultimately, Laban-based kinematic descriptors offer a lightweight, interpretable approach to suggestive-motion detection: every decision decomposes into named, theory-grounded features. Because the classifier operates on pose trajectories alone, moderation can run directly on avatar poses in virtual environments, with no appearance data.

URL PDF HTML ☆

赞 0 踩 0

2606.08436 2026-06-12 cs.CV 版本更新

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

通过候选感知因果推理增强教学视频中的时间答案定位

Muge Qi, Rong Fu, Pengbin Feng, Xianda Li, Yu Cai, Yifu Guo, Shizhe Zhang, Simon James Fong, Lei Ma, Bin Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出候选感知因果推理框架，通过视觉-语言预训练候选选择和基于GRPO的时序逻辑推理，解决教学视频中复杂问题理解和长视频片段定位挑战，在六个基准上取得最优mIoU。

详情

AI中文摘要

教学视频中的时间答案定位任务旨在定位响应自然语言查询的精确视频片段，对于直接视频答案检索日益重要。由于需要理解语义复杂的问题并解决未修剪视频与短目标时刻之间的显著长度不匹配，该任务仍然具有挑战性。现有方法通常对无关内容敏感或视觉推理能力不足。为了解决这些局限性，我们提出了候选感知因果推理框架。我们的方法首先采用基于视觉-语言预训练的候选选择算法高效生成K个候选片段，然后应用由拒绝奖励机制增强并通过组相对策略优化优化的时序逻辑推理模块进行稳健推理。在六个基准上的大量实验表明，我们的方法在平均交并比方面达到了最先进的性能，为长视频中基于推理的检索提供了新视角。

英文摘要

The task of temporal answer grounding in instructional video (TAGV), which aims to locate precise video segments that respond to natural language queries, is increasingly important for direct video answer retrieval. This task remains challenging due to the need to comprehend semantically complex questions and to address the significant length mismatch between untrimmed videos and short target moments. Existing methods often suffer from sensitivity to irrelevant content or insufficient visual reasoning capabilities. To tackle these limitations, we propose a Candidate-Aware Causal Reasoning (CACR) framework. Our approach first employs a Visual-Language Pre-training based Candidate Selection (VBCS) algorithm to efficiently generate K candidate segments, then applies a temporal logic reasoning module enhanced by a rejection reward mechanism and optimized via Group Relative Policy Optimization (GRPO) for robust inference. Extensive experiments on six benchmarks demonstrate that our method achieves state-of-the-art performance in terms of mean Intersection-over-Union (mIoU), providing a new perspective for reasoning-based retrieval in long videos.

URL PDF HTML ☆

赞 0 踩 0

2606.12562 2026-06-12 cs.CV cs.GR 新提交

HairPort: In-context 3D-aware Hair Import and Transfer for Images

HairPort: 上下文感知的3D发型导入与迁移

Alireza Heidari, Amirhossein Alimohammadi, Wallace Michel Pinto Lira, Adi Bar-Lev, Ali Mahdavi-Amiri

发表机构 * Simon Fraser University（西蒙菲莎大学）； Huawei Canada（华为加拿大）

AI总结提出HairPort框架，通过显式分离发型移除与迁移，并利用3D感知管道实现大姿态差异下的发型迁移，结合LoRA适配的秃头转换器和条件流匹配生成器，实现高质量、身份保持的发型迁移。

Comments Accepted to SIGGRAPH 2026 (Conference Papers Track). 23 pages, 15 figures, 10 tables, including supplementary material as appendices. Project page: https://deepmancer.github.io/HairPort/

详情

DOI: 10.1145/3799902.3811046

AI中文摘要

在图像之间迁移发型是计算机图形学、计算机视觉和视觉效果中一个重要但具有挑战性的任务。它使用户能够在无需实际改变发型的情况下探索新造型，应用于虚拟试穿系统、增强现实和娱乐等领域。大多数先前的方法在姿态差异较小时表现最佳，但在视角和尺度差异较大时效果不佳，此时缺失的发型内容必须合成而非迁移。我们提出HairPort，一个3D感知的发型迁移框架，通过显式分离发型移除与迁移，并在合成前强制几何一致性来解决这些问题。我们引入了一个秃头转换器，通过基于LoRA的上下文适配FLUX.1 Kontext生成逼真的秃头人脸版本。为了训练我们的秃头转换器，我们引入了一个新数据集Baldy，包含6000对在不同身份和条件下的秃头和原始图像。我们还使用了一个3D感知迁移管道，在将参考发型合成到源图像之前，从目标视角重建并重新渲染该发型。由于具有3D感知能力，我们的方法支持源和目标之间的大姿态和尺度差异。最后，一个条件流匹配生成器从秃头源和几何对齐的参考引导中合成迁移结果。综合来看，我们的方法实现了准确、姿态一致且身份保持的发型迁移，在定性和定量上均优于现有方法。

英文摘要

Transferring hairstyles between images is an important but challenging task in computer graphics, computer vision, and visual effects. It enables users to explore new looks without physically altering their hair, with applications in virtual try-on systems, augmented reality, and entertainment. Most prior works operate best under small pose gaps, and they fall short under large viewpoint and scale differences, where missing hair content must be synthesized rather than transferred. We propose HairPort, a 3D-aware hairstyle transfer framework that attempts to solve these issues by explicitly separating hair removal from transfer and enforcing geometric consistency before synthesis. We introduce a Bald Converter, which produces realistic bald versions of faces through LoRA-based in-context adaptation of FLUX.1 Kontext. To train our Bald Converter, we introduce a new dataset, Baldy, containing 6,000 paired bald and original images across diverse identities and conditions. We also use a 3D-Aware Transfer Pipeline that reconstructs and re-renders the reference hairstyle from the target viewpoint before compositing it onto the source image. Being 3D aware, our method supports large pose and scale discrepancies between the source and target. Finally, a conditional flow-matching generator synthesizes the transferred result from the bald source and geometry-aligned reference guidance. Together, our method enables accurate, pose-consistent, and identity-preserving hairstyle transfer, outperforming existing methods both qualitatively and quantitatively.

URL PDF HTML ☆

赞 0 踩 0

2606.12575 2026-06-12 cs.CV 新提交

DuET: 双专家轨迹用于扩散图像编辑

Lidia Troeshestova, Alexander Ustyuzhanin, Sergey Kastryulin

发表机构 * HSE University（高等经济大学）； Yandex

AI总结提出训练自由的DuET方法，通过临时切换到文本到图像阶段再返回编辑模式，缓解源图像条件限制，提升编辑指令相关性、语义保真度和感知质量。

详情

AI中文摘要

最近的扩散编辑器在每一步去噪过程中以源图像为条件执行多样化的基于指令的编辑。然而，持续的源图像条件限制可能会限制编辑的完全执行程度和结果的自然性，尤其是当目标场景与输入差异较大时。我们提出了DuET（双专家轨迹），一种无需训练的推理方法，通过过渡到文本到图像阶段再返回编辑模式，暂时放松源图像条件，使得去噪轨迹能够向目标分布移动，同时保留图像条件编辑的结构优势。在不修改模型权重或增加采样成本的情况下，DuET在多种模型和基准上持续改善了指令相关性、语义保真度和感知质量。在某些情况下，这些改进伴随着源图像保留的适度降低，揭示了源保留与编辑保真度之间可预测的权衡。

英文摘要

Recent diffusion editors perform diverse instruction-based edits while conditioning on the source image at every denoising step. Yet persistent source-image conditioning can limit how fully an edit is executed and how natural the result appears, especially when the target scene diverges substantially from the input. We introduce DuET (Dual Expert Trajectories), a training-free inference method that temporarily relaxes source-image conditioning by transitioning through a text-to-image phase before returning to edit mode, allowing the denoising trajectory to move toward the target distribution while retaining the structural benefits of image-conditioned editing. Without modifying model weights or increasing sampling cost, DuET consistently improves instruction relevance, semantic fidelity, and perceptual quality across diverse models and benchmarks. In some cases, these gains come with a modest reduction in source-image preservation, revealing a predictable trade-off between source preservation and edit fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.13304 2026-06-12 cs.CV 新提交

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

ReFree: 通过无奖励强化学习和多级语音引导实现逼真的共语音视频生成

Salaheldin Mohamed, M. Hamza Mughal, Rishabh Dabral, Christian Theobalt

发表机构 * Télécom Paris, Institut Polytechnique de Paris（巴黎高等电信学院，巴黎综合理工学院）； Max Planck Institute for Informatics（马克斯·普朗克信息学研究所）

AI总结提出ReFree-S2V框架，利用流匹配和预训练视频生成模型，通过多级语音表示和可学习选择器实现精细唇同步与自然表情，并引入无奖励强化学习生成自然头部运动，在唇同步准确性和自然度上达到最优。

详情

AI中文摘要

语音驱动的说话角色动画旨在生成逼真的肖像视频，传达自然的对话行为，使面部运动与语音音频对齐。尽管视频生成的最新进展显著提高了基于视频的动画的真实感，但实现准确的唇部发音和富有表现力的行为仍然具有挑战性。现有方法通常在精确的音素到唇同步与动态面部表情和头部运动之间进行权衡，产生要么准确但僵硬，要么富有表现力但同步性差的动画。我们通过提出ReFree-S2V来解决这一挑战，这是一个流匹配语音到肖像动画框架，基于预训练的视频生成模型，在语音驱动的肖像动画中实现细粒度的语音发音和高层次的表现力线索。该模型引入了一种多级语音表示，在局部和全局粒度上捕捉语音和韵律信息。这些表示通过可学习的级别选择器选择性地注入到Transformer块中，从而实现准确的唇同步和自然的表达性运动。为了实现自然的头部运动，我们进一步在流匹配训练中引入了一种新颖的无奖励强化学习方案，在不依赖手工制作的同步指标或奖励模型以及人类偏好标注的高成本的情况下，抑制感知上不合理的运动。大量实验表明，ReFree-S2V实现了最先进的性能，在定量唇同步准确性和定性人类评估的自然度和表现力方面显著优于现有方法。

英文摘要

Speech-driven talking character animation seeks to generate life-like portrait videos that convey natural conversation behavior, aligning facial motion with spoken audio. Although recent advances in video generation have substantially improved realism in video-based animation, achieving both accurate lip articulation and expressive behavior remains challenging. Existing approaches typically trade off precise phoneme-to-lip synchronization against dynamic facial expressions and head motion, yielding animations that are either accurate yet rigid, or expressive but poorly synchronized. We address this challenge by proposing ReFree-S2V, a flow-matching speech-to-portrait animation framework that builds upon a pretrained video generation model to achieve fine-grained speech articulation and high-level expressive cues in speech-driven portrait animation. This model introduces a multi-level speech representation capturing phonetic and prosodic information at both local and global granularities. These representations are selectively injected into transformer blocks via learnable level selectors, enabling both accurate lip synchronization and natural expressive motion. To achieve natural head movements, we further introduce a novel reward-free reinforcement learning scheme into flow-matching training to discourage perceptually implausible motion without relying on handcrafted synchronization metrics or reward models, or the high cost of human preference annotation. Extensive experiments demonstrate that ReFree-S2V achieves state-of-the-art performance, significantly outperforming existing methods in both quantitative lip-sync accuracy and qualitative human evaluations of naturalness and expressivity.

URL PDF HTML ☆

赞 0 踩 0

2606.13312 2026-06-12 cs.CV cs.GR 新提交

MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

MagPlus: 通过可学习放大桥接微表情到常规表情

Sliman Jammal, Andrei Sharf

发表机构 * Ben-Gurion University of the Negev（内盖夫本-古里安大学）

AI总结提出MagPlus管道，通过可学习放大将微表情运动映射到常规表情范围，再利用标准表情模型处理，最后用DeMagPlus恢复强度，无需重新训练即可生成逼真微表情。

详情

AI中文摘要

面部微表情是短暂而细微的面部运动，为真实人类情感提供重要线索。然而，由于标注的微表情数据有限且底层面部运动极其微弱，建模和生成微表情仍然困难。现有的微表情生成方法因此常面临质量有限、鲁棒性弱和泛化能力差的问题。我们提出MagPlus，一个可迁移的微表情处理管道，将微表情分析与标准面部动画模型连接起来。MagPlus不是从头训练专用生成器，而是学习将细微面部运动放大到常规表情范围，将微表情转换为与现有面部表情处理模型兼容的信号。放大后的序列随后被标准面部表情模型用于迁移和合成等任务。互补的DeMagPlus模块将生成的运动恢复为逼真的微表情强度水平，同时保留合成的动态。我们使用四个面部动画模型评估该框架：FOMM、FSRT、MetaPortrait和EmoPortraits。这些模型均未在微表情数据上训练。实验表明，MagPlus-DeMagPlus使预训练的宏表情模型能够生成更逼真的微表情运动，而无需重新训练主干网络。

英文摘要

Facial micro-expressions are subtle and short-lived facial movements that provide important cues about genuine human emotions. However, modeling and generating them remains difficult because annotated micro-expression data is limited and the underlying facial motions are extremely weak. Existing micro-expression generation methods therefore often suffer from limited quality, weak robustness, and poor generalization. We propose MagPlus, a transferable micro-expression processing pipeline that connects micro-expression analysis with standard facial animation models. Instead of training a dedicated generator from scratch, MagPlus learns to magnify subtle facial motions into the range of regular facial expressions, transforming micro-expressions into signals that are compatible with existing facial expression processing models. The magnified sequence is then used by a standard facial expression model for tasks such as transfer and synthesis. A complementary DeMagPlus module then restores the generated motion back to realistic micro-expression intensity levels while preserving the synthesized dynamics. We evaluate the framework using four facial animation models: FOMM, FSRT, MetaPortrait, and EmoPortraits. None of these models are trained on micro-expression data. Experiments show that MagPlus-DeMagPlus enables pretrained macro-expression models to generate more realistic micro-expression motion without retraining the backbones.

URL PDF HTML ☆

赞 0 踩 0

2606.13382 2026-06-12 cs.CV cs.AI 新提交

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

SmartFont: 少样本字体生成的动态条件分配

Zian Yang, Zixin Wang

发表机构 * Fudan University（复旦大学）

AI总结提出SmartFont扩散框架，通过全局内容-风格生成与弱监督局部校正专家结合，并引入去噪状态条件分配模块动态加权全局与局部特征，实现少样本字体生成的全局完整性与局部细节保真度平衡。

详情

AI中文摘要

少样本字体生成同时需要全局结构完整性和细粒度局部风格保真度。现有方法通常要么依赖全局内容-风格建模（鲁棒但解耦不完美），要么强调组件/局部建模（捕捉细节但严重依赖局部先验和参考覆盖）。我们认为关键挑战不仅在于学习更纯净的条件，而在于通过生成过程中的多级分配来组织互补但有偏的全局和局部条件。为此，我们提出SmartFont，一个基于扩散的少样本字体生成框架，结合全局内容-风格生成与弱监督局部校正专家。局部分支通过弱组件监督学习专家级局部概念和语义有意义的空间图，实现无需显式组件条件推理的细粒度校正。在此基础上，去噪状态条件分配模块在时间步和注入块上自适应地加权全局内容、全局风格和局部校正特征。大量实验表明，SmartFont实现了更好的全局-局部平衡，提高了字形质量和局部细节保真度。

英文摘要

Few-shot font generation simultaneously requires global structural completeness and fine-grained local style fidelity. Existing methods usually either rely on global content-style modeling, which is robust but imperfectly disentangled, or emphasize component/local modeling, which captures fine details but relies heavily on local priors and reference coverage. We argue that the key challenge is not merely to learn purer conditions, but to organize complementary yet biased global and local conditions through multi-level allocation during generation. To this end, we propose SmartFont, a diffusion-based few-shot font generation framework that combines global content-style generation with weakly supervised local corrective experts. The local branch performs semantic-spatial allocation by learning expert-wise local concepts and semantically meaningful spatial maps under weak component supervision, enabling fine-grained correction without requiring explicit component-conditioned inference. On top of this, a denoising-state condition allocation module adaptively weights global content, global style, and local corrective feature across timesteps and injection blocks. Extensive experiments show that SmartFont achieves better global-local balance, improves glyph quality and local detail fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.13432 2026-06-12 cs.CV cs.AI 新提交

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

OmniDirector: 无需配对数据的通用多镜头相机克隆

Jiwen Liu, Shujuan Li, Zhixue Fang, Xiaohan Li, Yan Zhou, Zijie Meng, Zhimin Zhang, Yawen Luo, Guoxin Zhang, Yu-Shen Liu, Pengfei Wan

发表机构 * Kuaishou Technology（快手科技）； Tsinghua University（清华大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出OmniDirector框架，通过将相机参数编码为网格运动视频，并利用百万级配对数据训练，实现无需交叉配对数据的多镜头相机运动克隆，具备卓越的控制性能。

Comments 12 pages, 8 figures

详情

AI中文摘要

从参考视频中克隆相机运动是视频生成中的一项重要任务，因为视频提供了直观且精确的控制。现有方法要么直接使用无法处理多镜头生成的参数化表示，要么合成交叉配对数据，但受限于数据稀缺性，导致在复杂相机运动克隆中表现不佳。为解决这些问题，我们引入了一种通用的相机运动表示，将相机编码为网格运动视频。该相机网格以视觉方式表示相机参数，并支持集成多样化的轨迹以进行多镜头视频生成。基于此，我们提出了OmniDirector，一个在百万级相机网格-视频对上训练的统一框架，该框架协调角色、动作和相机，为多模态扩散变换器提供导演级别的控制。此外，我们设计了一种新颖的分层提示扩展代理，通过理解信号关系系统地描述相机运动和视觉内容，从而和谐地整合不同的控制信号。大量实验证明了我们框架的卓越性能和出色的可控性。项目页面：此https URL

英文摘要

Cloning camera motion from reference videos is an important task in video generation, as videos provide intuitive and precise control. Existing methods either directly use parametric representations that fail to handle multi-shot generation or synthesize cross-paired data, which suffer from data scarcity, resulting in poor performance in complicated camera motion cloning. To address these issues, we introduce a general camera motion representation that encodes cameras as grid motion videos. This camera grid represents the camera parameters visually and supports the integration of diverse trajectories for multi-shot video generation. Building upon this, we propose OmniDirector, a unified framework trained on a million-scale camera grid-video pairs that coordinates characters, actions, and cameras to provide director-level control for multimodal diffusion transformers. Furthermore, we design a novel hierarchical prompt expansion agent that harmoniously integrates different control signals by systematically describing camera motion and visual content through understanding signal relationships. Extensive experiments demonstrate the superior performance and outstanding controllability of our framework. Project page: https://ymlinfeng.github.io/OmniDirector.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.13558 2026-06-12 cs.CV cs.CL 新提交

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

编辑比特，差异编码：面向视觉自回归模型的逐比特残差编辑

Shengqiang Zhang, Ruotong Liao, Volker Tresp, Barbara Plank, Hinrich Schütze

发表机构 * LMU Munich & Munich Center for Machine Learning (MCML)（慕尼黑大学 & 慕尼黑机器学习中心 (MCML)）

AI总结提出BitResEdit，一种无需训练的视觉自回归图像编辑方法，通过比特级源负引导和残差编码注入，在保持背景的同时实现强文本对齐。

详情

AI中文摘要

基于文本引导的图像编辑与视觉自回归（VAR）生成器需要控制模型采样的内容以及将采样变化写回图像代码的位置。现有的VAR编辑器主要操作于令牌流、特征或扁平的下一个令牌对数几率，忽略了逐比特残差VAR模型的两个原生结构：逐比特伯努利预测头和图像组装所用的加性多尺度残差代码域。我们提出BitResEdit，一种针对逐比特残差VAR生成器（如Infinity）的无训练编辑器。BitEdit通过沿共享编辑前缀上计算的源-目标对比倾斜后CFG的逐比特对数几率，执行源负引导，然后将每个更新投影到干净CFG采样器周围的闭式伯努利-KL信任域中。ResEdit将采样的比特转换为每尺度连续代码残差，用定位掩码对其进行门控，并通过生成器的原生尺度求和重新注入。它们共同将决策时的比特引导与组合时的代码组合耦合，使得被掩码的潜在特征通过代码算术精确保留，同时在目标区域内应用局部化的尺度感知编辑。在PIE-Bench上使用Infinity-2B，BitResEdit在相同骨干的VAR编辑器中实现了最强的文本对齐，在编辑区域上的CLIP比最强先前的编辑器提高了+1.07，同时背景保持与其相当。消融实验表明BitEdit和ResEdit在目标对齐和背景保持中发挥互补作用。

英文摘要

Text-guided image editing with visual autoregressive (VAR) generators requires controlling both what the model samples and where the sampled change is written back into the image code. Existing VAR editors mainly operate on token streams, features, or flat next-token logits, leaving two native structures of bitwise-residual VAR models underused: the per-bit Bernoulli prediction head and the additive multi-scale residual code field from which the image is assembled. We propose BitResEdit, a training-free editor for bitwise-residual VAR generators such as Infinity. BitEdit performs source-negative guidance by tilting the post-CFG per-bit log-odds along a source--target contrast computed on a shared edited prefix, then projects each update into a closed-form Bernoulli-KL trust region around the clean CFG sampler. ResEdit converts the sampled bits into per-scale continuous-code residuals, gates them with a localization mask, and re-injects them through the generator's native sum-of-scales. Together they couple decision-time bit guidance with combination-time code composition, so masked-out latent features are preserved exactly by code arithmetic while localized, scale-aware edits are applied inside the target region. On PIE-Bench with Infinity-2B, BitResEdit attains the strongest text alignment among same-backbone VAR editors, improving CLIP on the edited region by +1.07 over the strongest prior editor while keeping background preservation competitive with it. Ablations show BitEdit and ResEdit play complementary roles in target alignment and background preservation.

URL PDF HTML ☆

赞 0 踩 0

2606.13676 2026-06-12 cs.CV 新提交

Modality Forcing for Scalable Spatial Generation

模态强制实现可扩展的空间生成

Bardienus Pieter Duisterhof, Deva Ramanan, Jeffrey Ichnowski, Justin Johnson, Keunhong Park

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； World Labs

AI总结提出Modality Forcing方法，通过为每个模态分配独立噪声水平，实现单DiT的联合图像-深度生成，利用稀疏深度数据训练，继承T2I预训练的可扩展性，在深度估计上取得竞争性能。

详情

AI中文摘要

文本到图像（T2I）模型包含丰富的空间先验。合成逼真、杂乱的场景需要理解几何，包括透视和相对尺度。先前的工作通过调整T2I模型利用这一先验进行深度预测，但需要密集深度数据并涉及复杂的方案。我们提出Modality Forcing，一种简单、可扩展的后训练方案，使用在稀疏深度数据上训练的单个DiT进行联合图像-深度生成。Modality Forcing通过为每个模态分配独立的噪声水平，允许以任意排列进行图像和深度的条件生成和联合生成。每个模态的解码器使我们能够在稀疏的真实世界深度上训练，并实现强大的、可泛化的深度预测。我们进一步表明，Modality Forcing继承了T2I预训练的可扩展性：通过从头训练一组T2I模型（370M到3.3B参数），我们发现更大的模型在更多图像数据上训练产生更准确的深度。我们的最强模型与最先进的单目深度估计器竞争，并将现有联合图像-深度生成模型的AbsRel降低了57%。这些结果提供了强有力的证据，表明图像生成是空间感知的可扩展预训练目标。

英文摘要

Text-to-image (T2I) models contain rich spatial priors. Synthesizing photorealistic, cluttered scenes requires an understanding of geometry, including perspective and relative scale. Prior works adapt T2I models to leverage this prior for depth prediction, but they require dense depth data and involve complex recipes. We propose Modality Forcing, a simple, scalable post-training recipe for joint image-depth generation using a single DiT trained on sparse depth data. Modality Forcing enables conditional and joint generation of image and depth in any permutation by assigning separate noise levels per modality. Per-modality decoders let us train on sparse, real-world depth and achieve strong, generalizable depth prediction. We further show that Modality Forcing inherits the scalability of T2I pre-training: by training a set of T2I models from scratch (370M to 3.3B parameters), we find that larger models trained on more image data produce more accurate depth. Our strongest model is competitive with state-of-the-art monocular depth estimators and reduces AbsRel by 57% relative to existing joint image-depth generative models. These results provide strong evidence that image generation is a scalable pre-training objective for spatial perception. https://modality-forcing.github.io/

URL PDF HTML ☆

赞 1 踩 0

2606.12858 2026-06-12 cs.IT cs.AI cs.CV math.IT 交叉投稿

JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

JSCGC：面向无线生成式通信的联合源信道生成编码

Tong Wu, Zhiyong Chen, Guo Lu, Li Song, Feng Yang, Meixia Tao, Wenjun Zhang

发表机构 * Cooperative Medianet Innovation Center, the School of Information Science and Electronic Engineering, Shanghai Jiao Tong University（联合中位网创新中心，信息科学与电子工程学院，上海交通大学）

AI总结提出联合源信道生成编码（JSCGC），用生成模型替换传统解码器，将通信重构问题转化为受感知约束下的受控生成问题，通过联合训练和随机采样框架最大化互信息，在潜空间图像传输中提升特征、语义和分布质量。

Comments submitted to IEEE Journal

详情

AI中文摘要

传统通信系统，包括基于分离的编码和基于学习的联合源信道编码（JSCC），通常是在香农率失真理论下设计的。然而，依赖通用失真度量无法捕捉复杂的人类视觉感知，常常导致模糊或不真实的复原。在本文中，我们提出联合源信道生成编码（JSCGC），一种生成式通信范式，用接收端的生成模型替换传统解码器。接收信号被视为一个条件，控制采样过程进入学习到的条件分布，将通信从用于失真最小化的确定性重构重新表述为在感知约束下用于互信息最大化的受控生成。基于这一表述，我们开发了一个统一的联合训练和高效随机采样框架，并提供了其在学习和推理阶段有效性的理论分析。在潜空间图像传输上的大量实验表明，JSCGC在不同信道条件下持续改善基于特征、语义层面和分布的质量，同时表现出一种以语义不一致而非失真为特征的独特错误行为。

英文摘要

Conventional communication systems, including both separation-based coding and learning-based joint source-channel coding (JSCC), are typically designed under Shannon's rate-distortion theory. However, relying on generic distortion metrics fails to capture complex human visual perception, often resulting in blurred or unrealistic reconstructions. In this paper, we propose Joint Source-Channel-Generation Coding (JSCGC), a generative communication paradigm that replaces the conventional decoder with a generative model at the receiver. The received signal is treated as a condition that controls the sampling process into the learned conditional distribution, reformulating communication from deterministic reconstruction for distortion minimization to controlled generation for mutual information maximization under perceptual constraints. Based on this formulation, we develop a unified joint training and efficient stochastic sampling framework, and provide theoretical analysis of its effectiveness in both learning and inference stages. Extensive experiments on latent-space image transmission demonstrate that the JSCGC consistently improves feature-based, semantic-level, and distributional quality across diverse channel conditions, while exhibiting a distinct error behavior characterized by semantic inconsistency rather than distortion.

URL PDF HTML ☆

赞 0 踩 0

2606.13240 2026-06-12 cs.LG cs.AI cs.CV stat.ME stat.ML 交叉投稿

ShowFlow: 从鲁棒的单概念到无条件的多概念生成

Trong-Vu Hoang, Quang-Binh Nguyen, Thanh-Toan Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science（科学大学）； Vietnam National University（越南国家大学）； Monash University（墨尔本大学）； University of Dayton（Dayton大学）

AI总结提出ShowFlow框架，通过KronA-WED适配器和语义感知注意力正则化增强单概念生成，并利用SAMA和布局一致性指导实现无额外条件的多概念生成。

详情

AI中文摘要

定制化图像生成仍然是可控图像合成中的核心挑战。对于单概念生成，保持身份保留和提示对齐是困难的。在多概念场景中，仅依赖提示而不使用布局框或语义掩码等额外条件，通常会导致身份丢失和概念遗漏。在本文中，我们介绍了ShowFlow，一个旨在应对这些挑战的全面框架。我们提出了用于单概念图像生成的ShowFlow-S，以及用于处理多个概念的ShowFlow-M。ShowFlow-S引入了一个KronA-WED适配器，它将Kronecker适配器与权重和嵌入分解相结合，并配合一种新颖的语义感知注意力正则化（SAR）训练目标，以增强单概念生成。在此基础上，ShowFlow-M直接重用由ShowFlow-S学习的鲁棒模型，以支持无需额外条件的多概念生成，并集成了主体自适应匹配注意力（SAMA）和布局一致性指导作为即插即用模块。大量实验和用户研究验证了ShowFlow的有效性，突显了其在广告和虚拟试穿等实际应用中的潜力。我们的源代码将在以下网址公开：this https URL。

英文摘要

Customizing image generation remains a core challenge in controllable image synthesis. For single-concept generation, maintaining both identity preservation and prompt alignment is challenging. In multi-concept scenarios, relying solely on a prompt without additional conditions like layout boxes or semantic masks, often leads to identity loss and concept omission. In this paper, we introduce ShowFlow, a comprehensive framework designed to tackle these challenges. We propose ShowFlow-S for single-concept image generation, and ShowFlow-M for handling multiple concepts. ShowFlow-S introduces a KronA-WED adapter, which integrates a Kronecker adapter with weight and embedding decomposition, and together with a novel Semantic-Aware Attention Regularization (SAR) training objective to enhance single-concept generation. Building on this foundation, ShowFlow-M directly reuses robust models learned by ShowFlow-S to support multi-concept generation without extra conditions, incorporating a Subject-Adaptive Matching Attention (SAMA) and a Layout Consistency guidance as the plug-and-play module. Extensive experiments and user studies validate ShowFlow's effectiveness, highlighting its potential in real-world applications like advertising and virtual dressing. Our source code will be publicly available at: https://htrvu.github.io/showflow.

URL PDF HTML ☆

赞 0 踩 0

2606.06113 2026-06-12 cs.CV 版本更新

Where, What, Why, and Importance: Structured Defect Grounding for Text-to-Image Feedback

位置、类型、原因与重要性：面向文本到图像反馈的结构化缺陷定位

Huaisong Zhang, Hao Yu, Yuxuan Zhang, Jiahe Wang, Xinrui Chen, Haoxiang Cao, Feng Lu, Wendong Zhang, Changqian Yu, Chun Yuan

发表机构 * Tsinghua University（清华大学）； Kolors Team, Kuaishou Technology（快手科技Kolors团队）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）； South China Normal University（华南师范大学）

AI总结提出结构化缺陷定位（SDG）方法，将文本到图像生成中的缺陷诊断建模为结构化集合预测，通过构建SDG-30K数据集和SDG-Eval评估协议，并利用视觉语言模型作为检测器，结合BoxFlow-GRPO将预测的缺陷集合转化为空间奖励以改进扩散模型对齐。

Comments 25 pages, 9 figures

详情

AI中文摘要

尽管文本到图像（T2I）模型生成的图像越来越逼真，但它们仍然存在局部、细微且结构复杂的失败。诊断这些失败需要实例级别的反馈，回答缺陷发生的位置、类型、原因及其对整体图像质量的重要性。虽然最近的密集反馈方法超越了标量监督，但其以热图为中心的表示仍将诊断公式化为像素场回归，这使得定位可变数量的缺陷并将语义原因绑定到单个失败变得困难。为了解决这一表示瓶颈，我们提出了结构化缺陷定位（SDG），通过将每个缺陷建模为（位置、类型、原因、重要性）元组，将T2I诊断转化为结构化集合预测。为了使这一公式可训练和可测量，我们引入了SDG-30K，一个包含30K张图像的数据集，具有跨四个现代T2I生成器的框级标注，以及一个专用的评估协议SDG-Eval。基于这种结构化表示，我们进一步提出了一个诊断到对齐的框架，其中视觉语言模型（VLM）作为SDG检测器，BoxFlow-GRPO将预测的缺陷集合转化为基于框的、重要性加权的空间奖励，用于扩散模型对齐。大量实验表明，我们的SDG检测器在结构化缺陷定位上优于领先的专有VLM，而SDG引导的奖励一致地改善了T2I对齐并支持局部图像细化。这些结果确立了SDG作为诊断、评估和增强现代生成模型的统一实例级接口。

英文摘要

Despite generating increasingly photorealistic images, text-to-image (T2I) models still exhibit localized, subtle, and structurally complex failures. Diagnosing these failures requires instance-level feedback that answers where a defect occurs, what type it is, why it is defective, and its importance to overall image quality. While recent dense-feedback methods move beyond scalar supervision, their heatmap-centric representations still formulate diagnosis as pixel-field regression, making it difficult to localize variable-cardinality defects and bind semantic reasons to individual failures. To address this representation bottleneck, we propose Structured Defect Grounding (SDG), which casts T2I diagnosis as structured set prediction by modeling each defect as a (location, type, reason, importance) tuple. To make this formulation trainable and measurable, we introduce SDG-30K, a 30K-image dataset with box-grounded annotations across four modern T2I generators, together with a dedicated evaluation protocol, SDG-Eval. Building on this structured representation, we further present a diagnosis-to-alignment framework in which a Vision-Language Model (VLM) serves as the SDG detector, and BoxFlow-GRPO converts predicted defect sets into box-derived, importance-weighted spatial rewards for diffusion model alignment. Extensive experiments show that our SDG detector outperforms leading proprietary VLMs on structured defect grounding, while SDG-guided rewards consistently improve T2I alignment and support localized image refinement. These results establish SDG as a unified, instance-level interface for diagnosing, evaluating, and enhancing modern generative models.

URL PDF HTML ☆

赞 0 踩 0

2606.09639 2026-06-12 cs.CV 版本更新

CineDance: Towards Next-Generation Multi-Shot Long-Form Cinematic Audio-Video Generation

CineDance: 迈向下一代多镜头长片电影级音视频生成

Yuheng Chen, Teng Hu, Yuji Wang, Qingdong He, Zhucun Xue, Qianyu Zhou, Jason Li, Lizhuang Ma, Jiangning Zhang, Dacheng Tao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； University of Electronic Science and Technology of China（电子科技大学）； Zhejiang University（浙江大学）； The University of Tokyo（东京大学）； Nanyang Technological University（南洋理工大学）

AI总结提出CineDance-1M大规模多镜头长片音视频数据集，通过三阶段筛选流程和CineBench评估体系，实现高质量联合生成。

详情

AI中文摘要

面向计算机辅助手术中部分到完整点云配准的点级几何感知Transformer

Siyu Zhou, Zhongliang Jiang

发表机构 * The Chair for Computer Aided Medical Procedures, Technical University of Munich（慕尼黑工业大学计算机辅助医疗程序教席）； The University of Hong Kong（香港大学）

AI总结提出GAPR-Net，一种结合卷积与Transformer的粗到细框架，通过交叉注意力融合局部与全局信息，并设计变换不变的点级几何特征，在四个骨骼数据集上实现94.2%配准召回率、1.992mm RMSE。

详情

AI中文摘要

由于重叠率变化、点密度波动以及噪声的存在，部分到完整配准仍然具有挑战性。尽管Transformer在点云处理中展现出强大潜力，但先前的方法通常将其局限于全局上下文聚合，忽略了对于精确对应至关重要的细粒度局部几何信息。我们提出GAPR-Net，一种基于学习的点云配准框架，采用粗到细架构，结合卷积和Transformer模块，通过交叉注意力机制在部分和完整点云之间融合局部和全局信息。为此，提出了一种变换不变的点级几何特征表示，能够鲁棒地捕获单个点相对于其邻域点的相对几何特征。为了评估所提方法的有效性，在四个几何上不同的骨骼（包括胫骨、股骨、骨盆和胸软骨）上进行了实验。整体配准召回率达到94.2%，该方法实现了低RMSE 1.992 mm，旋转和平移的R²值分别为0.908和0.974。结果表明，所提方法有效解决了部分到完整点云配准问题。该方法利用部分观测实现高精度3D点云配准，为计算机辅助手术中的精确手术导航和机器人干预提供了关键基础。代码将在双盲评审后公开。

英文摘要

Partial-to-full registration remains challenging due to varying overlap ratios, fluctuating point densities, and the presence of noise. While transformers have shown strong potential for point cloud processing, prior methods typically confine them to global context aggregation, overlooking fine-grained local geometry crucial for accurate correspondence. We propose \emph{GAPR-Net}, a learning-based point cloud registration framework with a coarse-to-fine architecture that combines convolution and transformer modules, in which local and global information is fused between the partial and full point clouds using a cross-attention mechanism. To achieve this, a transformation-invariant point-wise geometric feature representation is proposed, which can robustly capture relative geometric features for individual points with respect to their neighboring points. To evaluate the effectiveness of the proposed approach, experiments are conducted on four geometrically distinct bones, including the tibia, femur, pelvis, and thoracic cartilage. The overall registration recall reaches 94.2\%, the method results in a low RMSE of 1.992 mm and $R^2$ values of 0.908 and 0.974 for rotation and translation, respectively. The results demonstrate that the proposed method effectively addresses the partial-to-full point cloud registration problem. The proposed method enables highly accurate 3D point cloud registration using partial observation, providing a critical foundation for precise surgical navigation and robotic interventions in computer-assisted surgery. The code will be accessed after the double-blind review process.

URL PDF HTML ☆

赞 0 踩 0

2606.13644 2026-06-12 cs.CV 新提交

Surflo: Consistent 3D Surface Flow Model with Global State

Surflo：具有全局状态的一致3D表面流模型

Antoine Guédon, Shu Nakamura, Nicolas Dufour, Jiahui Lei, Ko Nishino, Angjoo Kanazawa

发表机构 * LIX, École polytechnique（LIX，巴黎综合理工学院）； Kyoto University（京都大学）； Kyutai ； UC Berkeley（加州大学伯克利分校）

AI总结提出Surflo模型，通过将可变数量的无位姿RGB视图压缩为全局潜变量，并利用流匹配从噪声中独立传输3D表面点，实现任意分辨率的一致表面重建，推理时通过光度梯度引导消除局部不一致性。

Comments Project webpage: https://anttwo.github.io/surflo/

详情

AI中文摘要

几何形状对视角具有不变性，这使得任何图像集合都是单个3D状态的冗余编码。现有的前馈重建模型未能充分利用这一点：逐视角方法会生成重叠且未对齐的点云，其数量随输入数量线性增长；而全局潜在方法则局限于固定的低分辨率输出。我们提出Surflo，它将可变数量的无位姿RGB视图压缩为K个潜在令牌（一个全局状态），并通过流匹配将带方向的3D表面点从噪声独立传输到表面上进行解码。这使得输出不受任何固定网格或令牌预算的限制：相同的潜在变量在单次前向传播中即可生成从几千到一百万个点。为了抑制独立逐点解码固有的局部不一致性，我们在ODE积分过程中注入光度梯度，通过推理时的引导项关联邻近点。Surflo在表面指标上匹配或超越前馈基线，运行速度比需要数百个视图的基于优化的方法快一个数量级，并且是唯一结合全局潜在变量与任意分辨率解码的前馈方法。

英文摘要

Geometry is invariant to viewpoint, which makes any collection of images a redundant encoding of a single 3D state. Existing feed-forward reconstruction models fail to exploit this: per-view methods emit overlapping, unaligned pointmaps that grow linearly with input count, while global-latent methods commit to a fixed, low-resolution output. We introduce Surflo, which compresses a variable number of unposed RGB views into K latent tokens-one global state-and decodes oriented 3D surface points by independently transporting them from noise onto the surface via flow matching. This frees the output from any fixed grid or token budget: the same latent yields from a few thousand to a million points in a single forward pass. To suppress the local inconsistencies inherent to independent per-point decoding, an inference-time guidance term correlates nearby points by injecting a photometric gradient during ODE integration. Surflo matches or surpasses feed-forward baselines on surface metrics, runs an order of magnitude faster than optimization-based methods that require hundreds of views, and is the only feed-forward approach to combine a global latent with arbitrary-resolution decoding.

URL PDF HTML ☆

赞 0 踩 0

2606.13652 2026-06-12 cs.CV cs.GR 新提交

World Tracing: Generative Pixel-Aligned Geometry Beyond the Visible

世界追踪：超越可见表面的生成式像素对齐几何

Hao Zhang, Mohamed El Banani, Jen-Hao Cheng, Paul Zhang, Yi Hua, Ben Mildenhall, Christoph Lassner, Narendra Ahuja, Gengshan Yang

发表机构 * World Labs ； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出世界追踪（World Tracing），一种生成式像素对齐几何表示，通过扩散变压器预测有序点栈，同时重建可见表面和生成遮挡几何，在多个基准上超越深度预测和图像到3D方法。

Comments World Labs Technical Report; Page: https://haoz19.github.io/world-tracing-page/

详情

AI中文摘要

图像到3D方法常常在忠实度和完整性之间权衡：深度估计器锚定于输入像素但止于可见表面，而图像到3D模型生成完整形状却往往与输入不对齐。我们引入世界追踪（World Tracing），一种生成式像素对齐几何表示，它预测与观测像素对齐的3D点，同时完成可见表面之外的几何。对于每个输入像素，世界追踪预测一个有序的相机空间3D点栈，其中第一层表示可见表面，后续层表示与遮挡表面的从前到后交点。我们通过一个世界追踪扩散变压器WT-DiT实例化该表示，该变压器将多个几何层视为独立的去噪令牌，并通过分解和全局注意力耦合。WT-DiT使用像素空间流匹配和混合噪声调度进行训练，平衡可见表面重建与遮挡几何生成。世界追踪在物体、场景和动态基准上，在可见表面重建和完整几何生成方面均取得了强劲性能，超越了深度预测器和图像到3D生成器。它还保留了2D到3D对应关系，实现了文本驱动的3D场景编辑、几何条件的新视角视频合成，以及与纹理网格生成器的无训练集成。

英文摘要

Image-to-3D methods often trade off faithfulness and completeness: depth estimators are anchored to input pixels but stop at the visible surface, while image-to-3D models generate complete shapes that are often misaligned with the input. We introduce World Tracing, a generative pixel-aligned geometry representation that predicts 3D points aligned with observed pixels while completing geometry beyond the visible surface. For each input pixel, World Tracing predicts an ordered stack of camera-space 3D points, where the first layer represents the visible surface and subsequent layers represent front-to-back intersections with occluded surfaces. We instantiate this representation with a world-tracing diffusion transformer, WT-DiT, which treats multiple geometry layers as separate denoising tokens coupled through factorized and global attention. WT-DiT is trained with pixel-space flow matching and a mixed noise schedule that balances visible-surface reconstruction with occluded-geometry generation. World Tracing achieves strong performance on visible-surface reconstruction and complete geometry generation across object, scene, and dynamic benchmarks, outperforming both depth predictors and image-to-3D generators. It also preserves 2D-to-3D correspondence, enabling text-driven 3D scene editing, geometry-conditioned novel-view video synthesis, and training-free integration with textured-mesh generators.

URL PDF HTML ☆

赞 1 踩 0

2503.17182 2026-06-12 cs.CV 版本更新

Radar-Guided Polynomial Fitting for Metric Depth Estimation

雷达引导的多项式拟合用于度量深度估计

Patrick Rim, Hyoungseob Park, Vadim Ezhov, Jeffrey Moon, Alex Wong

发表机构 * Yale University（耶鲁大学）； University of Pennsylvania（宾夕法尼亚大学）

AI总结提出POLAR方法，利用雷达数据预测多项式系数，对单目深度估计的无尺度深度进行非均匀校正，实现度量深度估计，性能在三个数据集上平均提升24.9% MAE和33.2% RMSE。

Comments CVPR 2026

详情

AI中文摘要

我们提出POLAR，一种新颖的雷达引导深度估计方法，引入多项式拟合以高效地将预训练单目深度估计（MDE）模型的无尺度深度预测转换为度量深度图。与依赖复杂架构或昂贵传感器的现有方法不同，我们的方法基于一个基本洞察：尽管MDE模型通常能在每个物体或局部区域内推断合理的局部深度结构，但它们可能使这些区域相互错位，使得在三个或更多区域的情况下线性尺度和偏移（仿射）变换不足。为解决这一限制，我们使用从廉价、普遍存在的雷达数据预测的多项式系数，在深度范围内非均匀地自适应调整预测。通过这种方式，POLAR超越了仿射变换，并能够通过引入拐点来纠正此类错位。重要的是，我们的多项式拟合框架通过一种新颖的训练目标保持结构一致性，该目标通过一阶导数正则化强制局部单调性。POLAR在三个数据集上实现了最先进的性能，在MAE和RMSE上平均优于现有方法24.9%和33.2%，同时在延迟和计算成本方面也实现了最先进的效率。

英文摘要

We propose POLAR, a novel radar-guided depth estimation method that introduces polynomial fitting to efficiently transform scaleless depth predictions from pretrained monocular depth estimation (MDE) models into metric depth maps. Unlike existing approaches that rely on complex architectures or expensive sensors, our method is grounded in a fundamental insight: although MDE models often infer reasonable local depth structure within each object or local region, they may misalign these regions relative to one another, making a linear scale and shift (affine) transformation insufficient given three or more of these regions. To address this limitation, we use polynomial coefficients predicted from cheap, ubiquitous radar data to adaptively adjust predictions non-uniformly across depth ranges. In this way, POLAR generalizes beyond affine transformations and is able to correct such misalignments by introducing inflection points. Importantly, our polynomial fitting framework preserves structural consistency through a novel training objective that enforces local monotonicity via first-derivative regularization. POLAR achieves state-of-the-art performance across three datasets, outperforming existing methods by an average of 24.9% in MAE and 33.2% in RMSE, while also achieving state-of-the-art efficiency in terms of latency and computational cost.

URL PDF HTML ☆

赞 0 踩 0

2602.22629 2026-06-12 cs.CV 版本更新

CRAG: Can 3D Generative Models Help 3D Assembly?

CRAG: 3D生成模型能否辅助3D装配？

Zeyu Jiang, Sihang Li, Siqi Tan, Chenyang Xu, Juexiao Zhang, Julia Galway-Witham, Xue Wang, Scott A. Williams, Radu Iovita, Chen Feng, Jing Zhang

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出CRAG方法，将3D装配与形状生成联合优化，通过生成完整形状和预测部件姿态实现相互增强，在多种几何、部件数和缺失情况下达到最优性能。

Comments 15 pages, 8 figures

2603.23502 2026-06-12 cs.CV 版本更新

OccAny: Generalized Unconstrained Urban 3D Occupancy

OccAny: 广义无约束城市3D占据预测

Anh-Quan Cao, Tuan-Hung Vu

AI总结提出首个广义无约束城市3D占据模型OccAny，通过分割强制和新视图渲染技术，在无标定场景下实现度量占据预测与分割特征完成，跨域泛化优于视觉几何基线。

Comments Accepted to CVPR 2026. Project page: https://valeoai.github.io/OccAny/

详情

AI中文摘要

依赖于域内标注和精确传感器先验，现有的3D占据预测方法在可扩展性和域外泛化方面均受限。虽然最近的视觉几何基础模型展现出强大的泛化能力，但它们主要针对通用目的设计，缺乏城市占据预测所需的一个或多个关键要素，即度量预测、杂乱场景中的几何完成以及城市场景的适应性。我们解决了这一差距，并提出了OccAny，这是第一个无约束城市3D占据模型，能够在域外无标定场景上运行，预测并完成与分割特征耦合的度量占据。OccAny具有通用性，可以从序列、单目或环视图像预测占据。我们的贡献有三方面：(i) 提出了第一个广义3D占据框架，(ii) 提出了分割强制（Segmentation Forcing）方法，在提高占据质量的同时实现掩码级预测，以及(iii) 提出了一种新视图渲染管线，用于推断新视图几何以实现测试时视图增强，从而完成几何。大量实验表明，OccAny在3D占据预测任务上优于所有视觉几何基线，同时在两个已建立的城市占据预测数据集上的三种输入设置下，与域内自监督方法保持竞争力。我们的代码可在以下网址获取：https://this https URL。

英文摘要

Relying on in-domain annotations and precise sensor-rig priors, existing 3D occupancy prediction methods are limited in both scalability and out-of-domain generalization. While recent visual geometry foundation models exhibit strong generalization capabilities, they were mainly designed for general purposes and lack one or more key ingredients required for urban occupancy prediction, namely metric prediction, geometry completion in cluttered scenes and adaptation to urban scenarios. We address this gap and present OccAny, the first unconstrained urban 3D occupancy model capable of operating on out-of-domain uncalibrated scenes to predict and complete metric occupancy coupled with segmentation features. OccAny is versatile and can predict occupancy from sequential, monocular, or surround-view images. Our contributions are three-fold: (i) we propose the first generalized 3D occupancy framework with (ii) Segmentation Forcing that improves occupancy quality while enabling mask-level prediction, and (iii) a Novel View Rendering pipeline that infers novel-view geometry to enable test-time view augmentation for geometry completion. Extensive experiments demonstrate that OccAny outperforms all visual geometry baselines on 3D occupancy prediction task, while remaining competitive with in-domain self-supervised methods across three input settings on two established urban occupancy prediction datasets. Our code is available at https://github.com/valeoai/OccAny .

URL PDF HTML ☆

赞 0 踩 0

2605.31419 2026-06-12 cs.CV cs.RO 版本更新

Triangle Splatting SLAM

三角形泼溅SLAM

Nicholas Fry, Eric Dexheimer, Kirill Mazur, Paul H. J. Kelly, Andrew J. Davison

发表机构 * Software Performance Optimisation Group（软件性能优化组）； Department of Computing（计算部门）

AI总结提出首个使用可微三角形作为3D地图表示的密集RGB-D SLAM系统，通过在线可微渲染实现跟踪与建图，并支持实时网格转换与编辑。

Comments 26 pages, 11 figures

详情

AI中文摘要

我们提出了一种密集RGB-D SLAM系统，使用可微三角形作为3D地图表示。虽然3D高斯泼溅已成为新颖视角合成的主要方法，但三角形仍然是传统渲染硬件、游戏引擎以及需要显式几何的下游任务（如模拟、碰撞和编辑）的标准图元。最近的离线方法表明，通过在一组带姿态的图像上进行Delaunay三角剖分，可以将非结构化的“三角形汤”优化为照片级逼真的网格。基于这一见解，我们提出了第一个密集SLAM系统，通过在线可微渲染三角形汤来执行跟踪和建图。地图可以通过受限Delaunay三角剖分实时转换为连通网格，从而实现网格变形和碰撞检测等新的在线功能。在Replica和TUM-RGBD数据集上，我们的系统在3D几何方面优于基线，匹配相机跟踪精度，并支持基于网格的在线场景编辑。

英文摘要

We present a dense RGB-D SLAM system using differentiable triangles as the 3D map representation. While 3D Gaussian Splatting has emerged as the leading method for novel-view synthesis, triangles remain the standard primitive for traditional rendering hardware, game engines, and downstream tasks requiring explicit geometry such as simulation, collision, and editing. Recent offline methods have demonstrated that an unstructured 'triangle soup' can be optimised into a photorealistic mesh via Delaunay triangulation across a set of posed images. Building upon this insight, we present the first dense SLAM system to employ Triangle Splatting to perform both tracking and mapping through online differentiable rendering of a triangle soup. The map can be converted into a connected mesh on-the-fly via restricted Delaunay triangulation, enabling new online capabilities such as mesh deformation and collision checking. On Replica and TUM-RGBD, our system outperforms baselines on 3D geometry, matches the camera-tracking accuracy, and enables online mesh-based scene editing.

URL PDF HTML ☆

赞 0 踩 0

2606.07436 2026-06-12 cs.CV 版本更新

Skill-3D: Evolving Scene-Aware Skills for Agentic 3D Spatial Reasoning

Skill-3D：面向智能体3D空间推理的场景感知技能进化

Haoyuan Li, Zhengdong Hu, Jun Wang, Hehe Fan, Yi Yang

发表机构 * Zhejiang University（浙江大学）； University of Technology Sydney（技术悉尼大学）； OPPO Research Institute（OPPO研究院）

AI总结提出Skill-3D框架，通过场景记忆和技能库的协同进化，使智能体根据场景自适应选择工具，显著提升3D空间推理中工具使用的正确性和充分性。

详情

AI中文摘要

本文探索智能体3D空间理解，即MLLM智能体通过工具使用进行3D推理。现有方法在3D场景下常误用工具并表现出有偏的工具偏好，使得智能体范式相比非智能体策略仅有边际提升。我们揭示3D空间推理任务在不同场景下具有异质性，而这些智能体对所有场景采用统一的工具使用策略，而非根据具体场景和任务选择工具。为解决此问题，我们提出Skill-3D，一种学习自进化场景感知技能的框架。具体而言，Skill-3D识别任务场景并将智能体的工具使用轨迹记录到场景记忆中，其中来自相似场景的成功轨迹被聚合和蒸馏成可复用的场景感知技能，失败的轨迹作为教训附加到该技能上。在训练过程中，一旦相似场景再次出现，注入相应技能以引导智能体，产生新轨迹，其成功和失败进一步优化技能，形成记忆和技能库共同进化的循环。实验表明，Skill-3D显著提升了3D空间推理中的工具利用率（在VSI-Bench上从39%提升至78%），推动智能体正确且充分地使用工具。例如，在MMSI-Bench上，它将Gemini-3-Flash提升了67%。此外，我们在技能引导的轨迹上进行智能体后训练，使Qwen3-VL-8B在VSI-Bench上提升了43%。

英文摘要

This paper explores agentic 3D spatial understanding, i.e., MLLM agents performing 3D reasoning through tool use. Existing methods often misuse tools and exhibit biased tool preferences under 3D scenarios, leaving the agentic paradigm with only marginal gains over non-agentic strategies. We reveal that 3D spatial reasoning tasks are heterogeneous across scenes, while these agents apply a uniform tool-use strategy to all scenes rather than selecting tools according to the specific scene and task. To address this, we propose Skill-3D, a framework that learns self-evolving scene-aware skills. Specifically, Skill-3D identifies the task scene and records the agent's tool-use trajectory into a Scene Memory, where successful trajectories from similar scenes are aggregated and distilled into a reusable scene-aware skill, with failed ones attached to the skill as lessons. During training, once a similar scene recurs, the corresponding skill is injected to guide the agent, producing new trajectories whose successes and failures further refine the skill, forming a loop in which the memory and the skill library co-evolve. Experiments show that Skill-3D substantially improves tool utilization in 3D spatial reasoning (from 39% to 78% on VSI-Bench), driving the agent toward correct and sufficient tool use. For instance, it improves Gemini-3-Flash by 67% on MMSI-Bench. Furthermore, we conduct agentic post-training over skill-guided trajectories, which boosts Qwen3-VL-8B by 60% on VSI-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.11894 2026-06-12 cs.CV 版本更新

Wild3R: Feed-Forward 3D Gaussian Splatting from Unconstrained Sparse Photo Collection

Wild3R: 从无约束稀疏照片集合进行前馈式3D高斯泼溅

Yuto Furutani, Takashi Otonari, Kaede Shiohara, Toshihiko Yamasaki

发表机构 * The University of Tokyo（东京大学）

AI总结提出Wild3R，一种针对无约束稀疏照片集合的前馈式3D高斯泼溅方法，通过引入包含多样光照和瞬态物体的WildCity数据集，学习跨视角外观一致性并移除瞬态内容，性能优于现有前馈方法，与基于逐场景优化的方法相当。

Comments Project page: https://furuschool.github.io/wild3r-page/

详情

AI中文摘要

前馈式3D高斯泼溅（3DGS）消除了传统3DGS所需的耗时逐场景优化。然而，现有的前馈方法难以处理包含多样光照条件和瞬态物体的真实世界照片集合。在本文中，我们提出了Wild3R，一种针对无约束稀疏照片集合的前馈方法。主要瓶颈在于缺乏提供多视角、多种光照和瞬态变化的训练数据，而这些是学习鲁棒场景表示所必需的。为解决这一问题，我们引入了WildCity数据集，该数据集包含200个场景、170种光照条件和瞬态物体，总计337,500张图像。通过利用该数据集，我们的模型在参考视图条件下学习跨视角的外观一致性，同时移除瞬态内容。大量实验表明，我们的方法优于现有的前馈方法，并取得了与先前基于逐场景优化的方法相竞争的结果。

英文摘要

Feed-forward 3D Gaussian Splatting (3DGS) removes the need for time-consuming per-scene optimization required by traditional 3DGS. However, existing feed-forward approaches struggle with real-world photo collections that include diverse lighting conditions and transient objects. In this paper, we present Wild3R, a feed-forward approach for unconstrained sparse photo collections. The main bottleneck is the lack of training data that provides multiple viewpoints, a variety of illuminations, and transient variations necessary for learning robust scene representations. To address this, we introduce the WildCity dataset, which comprises 200 scenes, 170 lighting conditions, and transient objects, resulting in 337,500 images in total. By leveraging the dataset, our model learns appearance consistency across viewpoints conditioned on reference views, while removing transient content. Extensive experiments demonstrate that our method outperforms existing feed-forward approaches and achieves results competitive with prior per-scene optimization-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.12368 2026-06-12 cs.CV 版本更新

DepthMaster: Unified Monocular Depth Estimation for Perspective and Panoramic Images

DepthMaster: 统一透视与全景图像的单目深度估计

Pengfei Wang, Shihao Wang, Liyi Chen, Zhiyuan Ma, Guowen Zhang, Lei Zhang

AI总结提出DepthMaster统一框架，通过将全景图分解为重叠透视块并引入对应一致性损失和虚拟投影相机几何先验，解决透视与全景深度估计的几何差异和数据稀缺问题，在13个数据集上实现零样本最优性能。

详情

AI中文摘要

虽然单目深度估计取得了显著进展，但对于窄视场（FoV）透视图像和$360^\circ$全景图像实现通用的度量深度估计仍然是一个未解决的挑战。现有方法通常针对特定相机类型设计，难以在多样化场景中生成准确的度量深度。这一限制源于两个关键挑战：透视相机与全景相机之间的固有几何差异，以及带有度量标注的全景训练数据的稀缺性。在这项工作中，我们引入了DepthMaster，一个统一的度量深度估计框架。我们不采用专门网络来学习球形畸变，而是通过将全景图像分解为重叠的透视块来重新表述问题。关键的是，与先前依赖临时架构修改来处理边界的基于投影的方法不同，我们引入了一种新颖的对应一致性损失（CCL），并注入虚拟投影相机作为几何先验，从而能够无缝拼接这些块，同时避免专用算子并保持主干与标准Transformer设计高度兼容。该策略通过将所有输入统一为规范透视表示来解决几何差异，并通过直接从大量透视数据集中解锁强大的度量先验来有效规避数据稀缺问题。在仅包含一个全景数据集的混合数据集上训练后，DepthMaster在13个多样化数据集上实现了最先进的零样本性能，不仅在透视和全景领域超越了通用方法，还领先于领先的专家模型。

英文摘要

While monocular depth estimation has achieved significant progress, achieving generalized metric depth estimation for both narrow field-of-view (FoV) perspectives and $360^\circ$ panoramas remains an unsolved challenge. Existing methods are often tailored to specific camera types and struggle to produce accurate metric depth that generalizes across diverse settings. This limitation stems from two key challenges: the inherent geometric discrepancy between perspective and panoramic cameras, and the scarcity of panoramic training data with metric annotations. In this work, we introduce DepthMaster, a unified metric depth estimation framework. Rather than employing specialized networks to learn spherical distortions, we reformulate the problem by decomposing panoramic images into overlapping perspective patches. Crucially, distinct from prior projection-based methods that rely on ad-hoc architectural modifications to handle boundaries, we introduce a novel Correspondence Consistency Loss (CCL) and inject virtual projection cameras as geometric priors, allowing us to seamlessly stitch the patches while avoiding specialized operators and keeping the backbone largely compatible with standard Transformer designs. This strategy also resolves the geometric differences by unifying all inputs into a canonical perspective representation, and effectively circumvents data scarcity by directly unlocking powerful metric priors from vast perspective datasets. Trained on a mixed dataset that contains only one panorama dataset, DepthMaster achieves state-of-the-art zero-shot performance on 13 diverse datasets, outperforming not only universal methods but also leading specialist models in both perspective and panoramic domains.

URL PDF HTML ☆

赞 0 踩 0

2511.23030 2026-06-12 cs.RO cs.CV 版本更新

DiskChunGS: Large-Scale 3D Gaussian SLAM Through Chunk-Based Memory Management

DiskChunGS：基于分块内存管理的大规模3D高斯SLAM

Casimir Feldmann, Maximum Wilder-Smith, Vaishakh Patil, Michael Oechsle, Michael Niemeyer, Keisuke Tateno, Marco Hutter

发表机构 * Robotic Systems Lab, ETH Zurich（机器人系统实验室，瑞士苏黎世联邦理工学院）； Google（谷歌）

AI总结提出DiskChunGS，通过将场景划分为空间块并将非活跃区域存储于磁盘，突破GPU内存限制，实现大规模3D高斯SLAM，在多个数据集上完成全序列重建并提升视觉质量。

详情

DOI: 10.1109/LRA.2026.3668704
Journal ref: IEEE Robotics and Automation Letters, vol. 11, no. 4, 2026

AI中文摘要

近期3D高斯溅射（3DGS）的进展在实时渲染的新视角合成中展现了令人印象深刻的结果。然而，将3DGS与SLAM系统集成面临根本的可扩展性限制：方法受限于GPU内存容量，只能重建小规模环境。我们提出DiskChunGS，一种可扩展的3DGS SLAM系统，通过一种外核方法克服这一瓶颈，该方法将场景划分为空间块，并在GPU内存中仅维护活跃区域，同时将非活跃区域存储在磁盘上。我们的架构与现有的用于位姿估计和闭环检测的SLAM框架无缝集成，实现大规模全局一致的重建。我们在室内场景（Replica、TUM-RGBD）、城市驾驶场景（KITTI）以及资源受限的Nvidia Jetson平台上验证了DiskChunGS。我们的方法独特地完成了所有11个KITTI序列，没有出现内存故障，同时实现了卓越的视觉质量，证明了算法创新可以克服先前限制3DGS SLAM方法的内存约束。

英文摘要

Recent advances in 3D Gaussian Splatting (3DGS) have demonstrated impressive results for novel view synthesis with real-time rendering capabilities. However, integrating 3DGS with SLAM systems faces a fundamental scalability limitation: methods are constrained by GPU memory capacity, restricting reconstruction to small-scale environments. We present DiskChunGS, a scalable 3DGS SLAM system that overcomes this bottleneck through an out-of-core approach that partitions scenes into spatial chunks and maintains only active regions in GPU memory while storing inactive areas on disk. Our architecture integrates seamlessly with existing SLAM frameworks for pose estimation and loop closure, enabling globally consistent reconstruction at scale. We validate DiskChunGS on indoor scenes (Replica, TUM-RGBD), urban driving scenarios (KITTI), and resource-constrained Nvidia Jetson platforms. Our method uniquely completes all 11 KITTI sequences without memory failures while achieving superior visual quality, demonstrating that algorithmic innovation can overcome the memory constraints that have limited previous 3DGS SLAM methods.

URL PDF HTML ☆

赞 0 踩 0

2603.05965 2026-06-12 cs.RO cs.CV 版本更新

PROBE: Probabilistic Occupancy BEV Encoding with Analytical Translation Robustness for 3D Place Recognition

PROBE: 具有解析平移鲁棒性的概率占用BEV编码用于3D地点识别

Jinseop Lee, Byoungho Lee, Gichul Yoo

发表机构 * SK Intellix

AI总结提出无学习的LiDAR地点描述符PROBE，通过极坐标雅可比解析边缘化连续平移，实现距离自适应角度不确定性，在跨传感器泛化中取得高精度。

Comments 8 pages, 8 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L). \c{opyright} 2026 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses

详情

DOI: 10.1109/LRA.2026.3703245

AI中文摘要

我们提出PROBE（概率占用BEV编码），一种无学习的LiDAR地点识别描述符，将每个BEV单元的占用建模为伯努利随机变量。PROBE不依赖于离散点云扰动，而是通过极坐标雅可比解析边缘化连续笛卡尔平移，在O(R·S)时间内得到距离自适应角度不确定性σ_θ = σ_t / r。主要参数σ_t表示以米为单位的预期平移不确定性，这是一种与传感器无关的物理量，增强了跨传感器泛化能力，同时减少了对每个数据集大量调参的需求。成对相似性结合了伯努利-KL Jaccard与指数不确定性门控以及基于FFT的高度余弦相似性用于旋转对齐。在涵盖四种不同LiDAR类型的四个数据集上评估，PROBE在多会话评估中实现了手工描述符中最高的精度，并且在单会话性能上与手工和监督基线相比具有竞争力。源代码和补充材料可在该https URL获取。

英文摘要

We present PROBE (PRobabilistic Occupancy BEV Encoding), a learning-free LiDAR place recognition descriptor that models each BEV cell's occupancy as a Bernoulli random variable. Rather than relying on discrete point-cloud perturbations, PROBE analytically marginalizes over continuous Cartesian translations via the polar Jacobian, yielding a distance-adaptive angular uncertainty $σ_θ= σ_t / r$ in $\mathcal{O}(R{\cdot}S)$ time. The primary parameter $σ_t$ represents the expected translational uncertainty in meters, a sensor-independent physical quantity that enhances cross-sensor generalization while reducing the need for extensive per-dataset tuning. Pairwise similarity combines a Bernoulli-KL Jaccard with exponential uncertainty gating and FFT-based height cosine similarity for rotation alignment. Evaluated on four datasets spanning four diverse LiDAR types, PROBE achieves the highest accuracy among handcrafted descriptors in multi-session evaluation and competitive single-session performance relative to both handcrafted and supervised baselines. The source code and supplementary materials are available at https://sites.google.com/view/probe-pr.

URL PDF HTML ☆

赞 0 踩 0

2606.12635 2026-06-12 cs.CV 新提交

CD-RCM: Generalizable Continuous-Depth Novel View Synthesis for Reflectance Confocal Microscopy

CD-RCM：面向反射共聚焦显微镜的泛化连续深度新视角合成

Tooba Imtiaz, Milind Rajadhyaksha, Kivanc Kose, Jennifer Dy

发表机构 * Northeastern University（东北大学）； Memorial Sloan Kettering Cancer Center（纪念斯隆凯特琳癌症中心）

AI总结针对反射共聚焦显微镜各向异性3D体积，提出首个RCM专用新视角合成方法CD-RCM，通过前馈模型从稀疏z-stack预测连续深度切片，实现亚秒级高保真合成。

详情

AI中文摘要

反射共聚焦显微镜（RCM）通过获取连续深度处的正面图像，形成稀疏z-stack，从而提供人体皮肤 \emph{体内} 的无创、细胞分辨率“光学活检”。由于光学限制，这些堆栈是各向异性的3D体积，横向分辨率（0.5 $\mu$m）比轴向分辨率（由光学切片定义，3 $\mu$m）高约6倍，限制了组织解释。我们的目标是通过插值中间切片并使3D体积各向同性，提供连续深度可视化。这种表示允许任意方向切片，包括类似组织病理学的横截面检查，无需针对每位患者进行优化。为此，我们引入了首个RCM特定的新视角合成（NVS）方法CD-RCM，这是一种前馈模型，可从稀疏采样的RCM堆栈预测逼真的、未见过的深度。经典神经渲染方法侧重于从表面级多视角观测进行重建。与表面级相机视图不同，RCM可以获取组织表面以下至200 $\mu$m的光学切片正面图像。然而，在可视化RCM堆栈时，较浅切片（朝向表面）的观测会遮挡较深切片。这种独特的轴向成像几何和层依赖性解剖结构促使我们开发了定制的架构和训练框架，明确考虑了RCM的深度分辨、遮挡成像物理特性。实验表明，CD-RCM实现了高保真新视角合成，推理时间低于一秒。

英文摘要

Reflectance confocal microscopy (RCM) provides noninvasive, cellular-resolution "optical biopsies" of human skin \emph{in vivo} by acquiring en-face images at successive depths, forming a sparse z-stack. Due to optical limitations, these stacks are anisotropic 3D volumes with lateral resolution (0.5 $μ$m) $\sim$6 times higher compared to axial resolution, which is defined by the optical sectioning (3 $μ$m), limiting the interpretation of tissue. Our goal is to provide continuous-depth visualization by interpolating intermediate sections and making the 3D volume isotropic. Such a representation permits arbitrary-direction sectioning, including histopathology-like cross-sectional examination, without requiring per-patient optimization. To that end, we introduce the first RCM-specific novel-view synthesis (NVS) approach, CD-RCM, a feedforward model that predicts realistic, unseen depths from sparsely sampled RCM stacks. Classical neural rendering methods focus on reconstruction from surface-level multi-view observations. In contrast to surface-level camera views, RCM can acquire optically sectioned en-face images of tissue beyond the surface up to 200 $μ$m. However, during visualization of the RCM stacks, observations of the shallower sections (towards the surface) obscure the deeper ones. This unique axial imaging geometry and layer-dependent anatomical organization motivated our development of a tailored architectural and training framework that explicitly accounts for RCM's depth-resolved, occlusive imaging physics. Experiments demonstrate that CD-RCM achieves high-fidelity novel-view synthesis with sub-second inference time.

URL PDF HTML ☆

赞 0 踩 0

2606.13032 2026-06-12 cs.CV 新提交

GeoCFNet: Geometry-Aware Confidence Field Network for Robot-Assisted Endoscopic Submucosal Dissection

GeoCFNet: 几何感知置信场网络用于机器人辅助内镜黏膜下剥离术

Rui Tang, Guankun Wang, Long Bai, Haochen Yin, Huxin Gao, Jiewen Lai, Jiazheng Wang, Hongliang Ren

发表机构 * Department of Electronic Engineering, The Chinese University of Hong Kong（香港中文大学电子工程系）； Theory Lab, Central Research Institute, 2012 Labs, Huawei Technologies Co. Ltd.（华为技术有限公司中央研究院2012实验室理论实验室）

AI总结提出GeoCFNet，通过几何感知置信场估计解决动态内镜场景下的解剖引导问题，集成Token差异化融合和几何感知空间正则化，实现精确稳定的置信场预测。

Comments IEEE ICIA 2026

详情

AI中文摘要

先进的手术机器人技术使机器人辅助内镜黏膜下剥离术（ESD）成为整块切除大病变的有前景方法，具有降低复发率和改善长期预后的潜力。然而，ESD的技术复杂性和并发症风险需要稳定精确的视觉引导，以维持准确的解剖通道和安全组织边界。密集置信场通过描述优选解剖区域及其向周围组织的空间过渡，为此提供了有效表示。然而，在动态内镜场景中，由于烟雾、镜面高光、组织变形、弱纹理以及目标区域的薄几何结构，可靠的置信场估计仍然具有挑战性。为解决这些问题，我们将解剖引导表述为几何感知置信场估计问题，并提出GeoCFNet，一种基于预训练DINOv3骨干网络的几何感知置信场网络。GeoCFNet集成了Token差异化融合模块以聚合类别令牌上下文与密集补丁表示、用于置信回归的SegFormer解码器，以及几何感知空间正则化（GASR）以保持空间一致性和局部几何过渡。实验结果表明，GeoCFNet实现了RMSE 0.0480、PSNR 27.1995、SSIM 0.3397和CC 0.2466，表明其能够为机器人辅助ESD引导提供精确且几何稳定的置信场估计。

英文摘要

Advanced surgical robotics has made robot-assisted endoscopic submucosal dissection (ESD) a promising approach for the en-bloc resection of large lesions, with the potential to reduce recurrence and improve long-term outcomes. However, the technical complexity and risk of complications in ESD demand stable and precise visual guidance to maintain an accurate dissection corridor and a safe tissue margin. Dense confidence fields provide an effective representation for this purpose by describing both the preferred dissection region and its spatial transition to surrounding tissue. However, reliable confidence field estimation remains challenging in dynamic endoscopic scenes due to smoke, specular highlights, tissue deformation, weak texture, and the thin geometric structure of the target region. To address these challenges, we formulate dissection guidance as a geometry-aware confidence field estimation problem and propose GeoCFNet, a geometry-aware confidence field network built on a pretrained DINOv3 backbone. GeoCFNet integrates a Token-Differentiated Fusion module to aggregate class-token context with dense patch representations, a SegFormer decoder for confidence regression, and Geometry-Aware Spatial Regularization (GASR) to preserve spatial coherence and local geometric transitions. Experimental results show that GeoCFNet achieves RMSE 0.0480, PSNR 27.1995, SSIM 0.3397, and CC 0.2466, indicating accurate and geometrically stable confidence field estimation for robot-assisted ESD guidance.

URL PDF HTML ☆

赞 0 踩 0

2606.13096 2026-06-12 cs.CV 新提交

Unified MRI Brain Image Translation via Hierarchical Tumor Structure Comparison

基于层级肿瘤结构比较的统一MRI脑图像翻译

Yupeng Cai, Jia Wei, Jianlong Zhou

发表机构 * South China University of Technology（华南理工大学）； UTS Data Science Institute, University of Technology Sydney（悉尼科技大学UTS数据科学研究所）

AI总结提出HTSCGAN模型，通过层级肿瘤结构比较和多种损失函数，提高多模态MRI脑图像翻译质量，在BraTS2020/2021上表现优异。

详情

AI中文摘要

多模态MRI脑图像翻译通过可用模态在现代医学中具有重要的实际意义，为疾病的早期诊断、治疗计划和结果评估提供有力支持。为此，确保翻译后肿瘤区域的保真度至关重要。然而，现有的脑图像翻译方法忽略了不同肿瘤区域的结构信息，而利用这些信息有助于翻译模型提高翻译图像的质量和临床适用性。在这项工作中，我们提出了一种新颖的翻译模型HTSCGAN，这是一个统一的多模态脑图像翻译生成对抗模型，整合了肿瘤区域内的结构信息，旨在提高脑图像翻译的质量。具体地，生成器采用三个不同补丁大小的补丁对比模块（PCM）来捕获肿瘤区域的层级结构信息。此外，使用预训练的补丁分类器（PC）和预训练的结构感知编码器（SAE），分别通过补丁分类损失和肿瘤感知损失，使生成的图像包含与真实图像相同的肿瘤区域结构。在BraTS2020和BraTS2021上的实验表明，我们的模型在翻译任务和下游分割任务中均表现出强大的性能，突显了其在提高翻译脑图像质量和临床相关性方面的有效性。我们的代码可在以下网址获取：https://this URL。

英文摘要

Multi-modal MRI brain image translation via available modalities holds significant practical importance in modern medicine, providing robust support for early diagnosis, treatment planning, and outcome assessment of diseases. For this purpose, it is important to ensure the fidelity of the tumor regions after translation. However, existing brain image translation methods ignore the structure information of different tumor regions, which could assist translation models in enhancing the quality and clinical applicability of the translated images. In this work, we propose a novel translation model called HTSCGAN, which is a unified multi-modal brain image translation generative adversarial model integrating the structural information within tumor regions with the aim of improving the quality of brain image translation. Specifically, the generator employs three Patch Contrast Module (PCM) with different patch sizes to capture the hierarchical structural information of the tumor regions. In addition, a pretrained Patch Classifier (PC) and a pretrained Structure-Aware Encoder (SAE) are employed to derive the generated image containing the same tumor region structure as the ground truth image via patch classification loss and tumor perceptual loss, respectively. The experiments on BraTS2020 and BraTS2021 demonstrate strong performance of our model in both translation tasks and down stream segmentation tasks, highlighting its effectiveness in enhancing the quality and clinical relevance of the translated brain images. Our code is available at https://anonymous.4open.science/r/HTSCGAN.

URL PDF HTML ☆

赞 0 踩 0

2606.13135 2026-06-12 cs.CV cs.AI 新提交

Cascade Classification of Dermoscopic Images of Skin Neoplasms with Controllable Sensitivity and External Clinical Validation

皮肤肿瘤皮肤镜图像的级联分类：可控敏感度与外部临床验证

Elena S. Kozachok, Sergey S. Seregin, Aleksandr V. Kozachok, Ilya P. Latyshev, Oleg I. Samovarov

发表机构 * Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)（俄罗斯科学院伊万尼科夫系统编程研究所）； Orel Oncological Dispensary（奥廖尔肿瘤医院）

AI总结本研究比较了四种深度学习架构在皮肤镜图像分类中的表现，提出一种两阶段级联分类方案，通过可调分诊阈值实现敏感度控制，并在外部临床数据集上验证了泛化差距。

Comments 28 pages, 8 figures, 10 tables

详情

AI中文摘要

目的：比较皮肤肿瘤皮肤镜图像的深度学习架构和分类方案，并评估从开放国际数据集到俄罗斯临床独立数据集的泛化能力。方法：在三种方案中比较四种架构（ViT-B/16、Swin-S、ConvNeXt-S、EfficientNetV2-S）：二分类（恶性/良性）、单阶段四分类（良性、MEL、SCC、BCC）和两阶段级联（二分类分诊，然后三分类MEL/SCC/BCC）。所有模型使用ImageNet预训练权重和单一增强协议，在聚合的开放ISIC Archive数据上训练，并在内部保留样本和两个临床数据集（Melanoscope AI移动系统；谢切诺夫大学）上评估。结果：内部二分类阶段达到ROC-AUC 0.952-0.966；在谢切诺夫大学数据集上降至0.797-0.893，敏感度降至0.53-0.67，ECE从0.02升至0.27-0.39，且低估恶性，量化了排序和校准中的泛化差距。配对检验确认了临床数据上的一个架构间结果：二分类阶段ViT-B/16的缺陷（p<0.05）；在区分阶段，没有架构显示出显著优势。级联方案在大多数架构上提高了宏F1，但仅对ViT-B/16显著，通过恢复被分配到主导良性类别的恶性病变。在ISIC MILK10k上，直接11分类的平均类别敏感度为0.525。结论：可调分诊阈值提供了标准单阶段（argmax）分类无法实现的敏感度控制，并更好地再现了临床鉴别诊断逻辑。持续的泛化差距要求在部署前进行外部临床验证和重新校准。

英文摘要

Purpose. To compare deep learning architectures and classification schemes for dermoscopic images of skin neoplasms and assess their generalization on transfer from open international datasets to independent clinical datasets of Russian practice. Methods. Four architectures (ViT-B/16, Swin-S, ConvNeXt-S, EfficientNetV2-S) were compared in three schemes: binary (malignant/benign), single-stage four-class (benign, MEL, SCC, BCC), and a two-stage cascade (binary triage, then three-class differentiation MEL/SCC/BCC). All models used ImageNet-pretrained weights and a single augmentation protocol on aggregated open ISIC Archive data, and were evaluated on an internal held-out sample and two clinical datasets (Melanoscope AI mobile system; Sechenov University). Results. Internally the binary stage attains ROC-AUC 0.952-0.966; on Sechenov University it drops to 0.797-0.893, sensitivity to 0.53-0.67, and ECE rises from 0.02 to 0.27-0.39 with underestimation of malignancy, quantifying a generalization gap in ranking and calibration. Paired tests confirm one inter-architecture result on clinical data: the deficit of ViT-B/16 at the binary stage (p<0.05); at the differentiation stage no architecture has a proven advantage. The cascade raises macro F1 over single-stage four-class classification for most architectures, but significantly only for ViT-B/16, by recovering malignant lesions assigned to the dominant benign class. On ISIC MILK10k, direct 11-class classification yields mean-class sensitivity 0.525. Conclusion. A tunable triage threshold gives sensitivity control not attainable in standard single-stage (argmax) classification and better reproduces clinical differential-diagnosis logic. The persistent generalization gap mandates external clinical validation and recalibration before deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.13188 2026-06-12 cs.CV cs.AI 新提交

Transformer-Guided Graph Attention for Direct Cardiac Mesh Reconstruction: A Structural Digital Twin Framework

Transformer引导的图注意力直接心脏网格重建：一种结构数字孪生框架

Abhishek H S, Akash Ganamukhi, Abhimanyu Suresh, Aditya G Hiremath, Prasad B Honnavalli, Adithya Balasubramanyam

发表机构 * CAVE Labs, C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室CAVE实验室）； C-IoT, Dept. of CSE, PES University（PES大学计算机科学与工程系C-IoT实验室）

AI总结提出端到端网络，结合3D Swin Transformer和GAT，直接从医学图像生成平滑的心脏表面网格，避免传统后处理，在MM-WHS 2017上实现1.8 mm平均Chamfer距离。

详情

AI中文摘要

构建患者特异性心脏模型是精准心脏病学的核心，但这些模型在临床应用中始终面临同一障碍：网格生成缓慢、混乱且令人沮丧。标准工作流程——分割图像、运行Marching Cubes、然后手动清理结果——耗时、操作者间不一致，并且需要大多数临床团队不具备的专业知识。我们采取了一种根本不同的方法。我们不将分割和网格生成视为两个独立问题，而是训练一个单一的端到端网络，直接从原始3D医学图像生成平滑、可用于模拟的心脏表面网格。核心是一个3D Swin Transformer编码器-解码器，从CT或MRI体积中提取体积特征，配以一个图注意力网络（GAT）头，迭代变形模板网格以拟合患者心脏边界。我们在MM-WHS 2017基准上使用CT和MRI进行了测试。分割分数具有竞争力（CT上Dice为0.84，MRI上为0.83），但主要关注点是网格质量：平均Chamfer距离为1.8 mm，95%分位数表面距离低于5 mm。每个网格通过单次前向传播生成——无需Marching Cubes、平滑滤波器或手动清理。我们认为，对于心脏数字孪生管道，几何保真度和拓扑正确性比像素级Dice分数更重要。通过消除后处理瓶颈，该方法使患者特异性心脏模拟在临床使用中变得更加可行。

英文摘要

Building patient-specific cardiac models sits at the heart of precision cardiology, yet getting those models into clinical use keeps running into the same wall: mesh generation is slow, messy, and frustrating. The standard workflow -- segmenting the image, running Marching Cubes, and then manually cleaning up the result -- is time-consuming, inconsistent across operators, and demands specialist knowledge most clinical teams do not have. We take a fundamentally different approach. Instead of treating segmentation and mesh generation as two separate problems, we train a single end-to-end network that goes directly from a raw 3D medical image to a smooth, simulation-ready cardiac surface mesh. The core is a 3D Swin Transformer encoder-decoder that extracts volumetric features from CT or MRI volumes, paired with a Graph Attention Network (GAT) head that iteratively deforms a template mesh to fit the patient's cardiac boundary. We tested on the MM-WHS 2017 benchmark using both CT and MRI. Segmentation scores were competitive (Dice of 0.84 on CT, 0.83 on MRI), but the primary focus is mesh quality: mean Chamfer distance of 1.8 mm, with 95th-percentile surface distance below 5 mm. Every mesh is produced in a single forward pass -- no Marching Cubes, no smoothing filters, no manual cleanup. We argue that for cardiac digital twin pipelines, geometric fidelity and topological correctness matter more than pixel-level Dice scores. By removing the post-processing bottleneck, this approach makes patient-specific cardiac simulation substantially more accessible for clinical use.

URL PDF HTML ☆

赞 0 踩 0

2606.13315 2026-06-12 cs.CV eess.IV 新提交

Masked and Predictive Self-Supervised Foundation Models for 3D Brain MRI

用于3D脑部MRI的掩码和预测自监督基础模型

Esra Ergün, Hersh Chandarana, Dan Sodickson, Gözde Ünal

发表机构 * Istanbul Technical University（伊斯坦布尔理工大学）； NYU Langone Health（纽约大学朗格尼医学中心）

AI总结研究自监督基础模型在MRI疾病检测中的应用，提出频谱域重建损失（MAE）和方差-协方差正则化（JEPA）两种方法，在五个下游任务中验证了目标设计对任务结构匹配的重要性。

详情

AI中文摘要

自监督基础模型在医学影像中展现出巨大潜力。然而，现有的MRI基础模型研究主要强调分割和密集预测任务，而针对基于MRI的疾病检测的自监督基础模型的系统研究仍然有限。在这项工作中，我们研究了两种主要的自监督预训练范式用于基于MRI的疾病检测：通过掩码自编码器（MAE）的基于重建的学习和通过联合嵌入预测架构（JEPA）的预测表示学习。我们通过引入一种新颖的MAE频谱域重建损失来增强对细粒度解剖结构的敏感性，并通过在我们的JEPA框架中集成方差-协方差正则化（VCR）来鼓励去相关的潜在表示，从而研究辅助目标的作用。我们的模型在对比度无关的设置下，在异质单对比度MRI体积上进行预训练，无需模态拼接。在五个下游疾病检测任务中，我们的结果突出了自监督目标设计对医学基础模型预训练的重要性，表明每个目标的下游收益由其与任务结构的相关性决定。具体来说，当下游判别信号以强高频解剖结构为特征时，频谱正则化带来最大的改进；而当判别信息跨越多个去相关的特征维度时，协方差正则化最为有益。具有频谱域监督的MAE在基于MRI的疾病检测中始终实现优越的下游性能。这些发现表明，医学影像中的自监督目标编码了特定的偏差，其下游收益根本上取决于任务的结构。

英文摘要

Self-supervised foundation models have shown strong promise in medical imaging. However, existing MRI foundation-model studies have primarily emphasized segmentation and dense prediction tasks, while systematic investigation of self-supervised foundation models for MRI-based disease detection remains limited. In this work, we investigate two major self-supervised pretraining paradigms for MRI-based disease detection: reconstruction-based learning via Masked Autoencoders (MAE) and predictive representation learning via Joint Embedding Predictive Architectures (JEPA). We study the role of auxiliary objectives by introducing a novel spectral-domain reconstruction loss for MAE to enhance sensitivity to fine-grained anatomical structure, and by integrating variance--covariance regularization (VCR) within our JEPA framework to encourage decorrelated latent representations. Our models are pretrained on heterogeneous single-contrast MRI volumes in a contrast-agnostic setting, without modality concatenation. Across five downstream disease detection tasks, our results highlight the importance of self-supervised objective design for medical foundation model pretraining, demonstrating that the downstream benefit of each objective is determined by its relevance to the task's structure. Specifically, spectral regularization yields the largest improvements when the downstream discriminative signal is characterized by strong high-frequency anatomical structures, while covariance regularization is most beneficial when discriminative information spans multiple decorrelated feature dimensions. MAE with spectral-domain supervision consistently achieves superior downstream performance for MRI-based disease detection. These findings suggest that self-supervised objectives in medical imaging encode specific biases, and their downstream benefit is fundamentally conditioned on the task's structure.

URL PDF HTML ☆

赞 0 踩 0

2606.13341 2026-06-12 cs.CV cs.AI physics.med-ph 新提交

Dual-Domain Equivariant Generative Adversarial Network for Multimodal CT-PET Synthesis

双域等变生成对抗网络用于多模态CT-PET合成

Gabriel Steele, Alzahra Altalib, Alessandro Perelli

发表机构 * arXiv

AI总结提出双域等变生成对抗网络（DDE-GAN），联合空间与频域学习并融入旋转等变性，实现高保真多模态CT-PET图像合成。

Comments 4 pages, 3 figures, 1 table, 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI)

详情

DOI: 10.1109/ISBI61048.2026.11515956

AI中文摘要

我们提出了一种用于多模态CT-PET图像合成的双域等变生成对抗网络（DDE-GAN）。传统的基于GAN的方法通常仅在空间域中操作，忽略了几何一致性，导致结构保真度有限。DDE-GAN通过联合学习空间域和频率（傅里叶）域，捕捉互补的解剖和频谱信息，解决了这些挑战。此外，嵌入在CT和PET测量物理中的旋转等变性被整合到生成器和判别器的损失中，以确保在旋转下的一致响应，从而提高解剖准确性。一种分层双域训练策略通过多阶段损失函数强制实现域内和域间一致性。在HECKTOR 2022 CT-PET数据集上的评估表明，DDE-GAN在CT-PET图像合成中取得了优于基线模型的合成质量。结果表明，将双域学习与几何等变性相结合，显著增强了多模态图像合成的准确性和鲁棒性，为PET补全和数据增强等实际应用提供了可能。

英文摘要

We present a Dual-Domain Equivariant Generative Adversarial Network (DDE-GAN) for multimodal CT-PET image synthesis. Traditional GAN-based approaches often operate solely in the spatial domain and ignore geometric consistency, resulting in limited structural fidelity. DDE-GAN addresses these challenges by jointly learning from both spatial and frequency (Fourier) domains, capturing complementary anatomical and spectral information. Furthermore, rotational equivariance embedded in the physics of the CT and PET measurements are integrated into the loss of both the generator and discriminator to ensure consistent responses under rotations, improving anatomical accuracy. A hierarchical dual-domain training strategy enforces intra- and inter-domain consistency through multi-stage loss functions. Evaluated on the HECKTOR 2022 CT-PET dataset, DDE-GAN achieves superior synthesis quality over baseline models for CT-PET image synthesis. The results demonstrate that combining dual-domain learning with geometric equivariance substantially enhances multimodal image synthesis accuracy and robustness, enabling practical applications in PET completion and data augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.13562 2026-06-12 cs.CV cs.AI 新提交

Contrast-Informed Augmentation and Domain-Adversarial Training for Adult-to-Neonatal MR Reconstruction Generalization

对比信息增强和域对抗训练用于成人到新生儿MR重建泛化

Stephen Moore, Lara Leijser, Richard Frayne, Roberto Souza

发表机构 * University of Calgary（卡尔加里大学）； Seaman Family MR Research Centre, Foothills Medical Centre（Seaman家族磁共振研究中心，山麓医疗中心）； Hotchkiss Brain Institute, University of Calgary（Hotchkiss脑研究所，卡尔加里大学）； Pediatrics, Division of Neonatology, University of Calgary（卡尔加里大学儿科学系新生儿科）； Alberta Children’s Hospital Research Institute, University of Calgary（阿尔伯塔儿童医院研究所，卡尔加里大学）； Radiology and Clinical Neuroscience, University of Calgary（卡尔加里大学放射学与临床神经科学系）； Electrical and Software Engineering, University of Calgary（卡尔加里大学电气与软件工程系）

AI总结研究对比信息增强和域对抗训练提升E2E-VarNet从成人到新生儿MR重建的泛化能力，在加速因子R=4和R=8下，混合域对抗训练在SSIM和PSNR指标上表现最优。

Comments 24 pages, 1 table, 7 figures

详情

AI中文摘要

目的：研究对比信息数据增强和域对抗训练是否能改善E2E-VarNet从成人到新生儿的泛化能力。方法：研究了三种训练方案：(1) 仅使用未增强的成人数据进行成人单独训练，(2) 使用配对的未增强和新生儿信息增强的成人数据进行混合训练，(3) 使用域对抗目标进行混合训练。模型在回顾性欠采样的多线圈成人T2加权脑MR数据上训练，并在新生儿和成人测试数据上以加速因子$R=4$和$R=8$进行评估，使用定量指标和定性评估。特征分析评估了域对抗训练是否改变了未增强成人、增强成人和新生儿测试样本的潜在表示。结果：在新生儿数据上评估时，混合训练（Mixed）和混合域对抗训练（Mixed-DAT）优于仅未增强的成人单独训练（Unaug-Only）。在R=4时，Mixed-DAT取得最佳性能（SSIM = 0.924 +/- 0.027，PSNR = 33.98 +/- 1.15 dB）。在R=8时，Mixed-DAT在SSIM指标上表现最佳（0.848 +/- 0.031，对比Unaug-Only的0.766 +/- 0.037和Mixed的0.814 +/- 0.035），而Mixed在PSNR指标上表现最佳（29.56 +/- 0.83 dB，对比Unaug-Only的26.26 +/- 0.78 dB和Mixed-DAT的29.43 +/- 0.83 dB）。t-SNE图的定性评估表明，Mixed-DAT增加了未增强成人、增强成人和新生儿测试数据的潜在表示之间的重叠。结论：对比信息增强和域对抗训练改善了基于深度学习的MR重建从成人到新生儿的泛化能力。这些发现表明，对比信息数据增强结合对抗训练可能提高欠采样新生儿MR重建中对域偏移的鲁棒性。

英文摘要

Purpose: To investigate whether contrast-informed data augmentation and domain-adversarial training improve the adult-to-neonatal generalization of the E2E-VarNet. Methods: Three training regimes were investigated: (1) adult-only training with unaugmented adult data, (2) mixed training with paired unaugmented and neonatal-informed augmented adult data, and (3) mixed training with a domain-adversarial objective. Models were trained on retrospectively undersampled multi-coil adult T2-weighted brain MR data and evaluated on neonatal and adult test data at acceleration factors $R=4$ and $R=8$ using quantitative metrics and qualitative evaluation. Feature analyses assessed whether domain-adversarial training altered the latent representations of unaugmented adult, augmented adult, and neonatal test samples. Results: Mixed training (Mixed) and mixed domain-adversarial training (Mixed-DAT) outperformed unaugmented adult-only training (Unaug-Only) when evaluated on neonatal data. At R=4, Mixed-DAT achieved the best performance (SSIM = 0.924 +/- 0.027, PSNR = 33.98 +/- 1.15 dB). At R=8, Mixed-DAT performed best when measured using SSIM (0.848 +/- 0.031 vs. 0.766 +/- 0.037 for Unaug-Only and 0.814 +/- 0.035 for Mixed) and Mixed performed best when measured using PSNR (29.56 +/- 0.83 dB vs. 26.26 +/- 0.78 dB for Unaug-Only and 29.43 +/- 0.83 dB for Mixed-DAT). Qualitative assessment of t-SNE plots suggested that Mixed-DAT increased the overlap among the latent representations of the unaugmented adult, augmented adult, and neonatal test data. Conclusion: Contrast-informed augmentation and domain-adversarial training improved adult-to-neonatal generalization of deep learning-based MR reconstruction. These findings suggest that contrast-informed data augmentation combined with adversarial training may improve robustness to domain shift in undersampled neonatal MR reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.12824 2026-06-12 eess.IV cs.AI cs.CV physics.med-ph 交叉投稿

Acquisition state behaves as a structured, measurable variable governing lung-nodule AI: kernel-driven measurement instability and noise-driven detection fragility, invisible to DICOM metadata

采集状态作为结构化、可测量变量影响肺结节AI：核驱动的测量不稳定性和噪声驱动的检测脆弱性，DICOM元数据不可见

Daniel Soliman

发表机构 * Daniel Soliman, M.S（丹尼尔·索利曼，硕士）

AI总结研究通过LUNA16训练的RetinaNet检测器，发现CT采集状态（重建核与噪声）独立影响AI的测量与检测性能，且无法从DICOM元数据恢复，提出采集感知的输入验证层。

详情

AI中文摘要

医学影像AI治理正在规范化：2026年ACR-SIIM实践参数建议本地验收测试和持续漂移监测，ACR Assess-AI注册使用DICOM元数据监测AI输出。我们认为在输出指标之下存在一个必要但目前未监测的层：输入研究是否保持在模型验证过的采集范围内。使用LUNA16训练的MONAI RetinaNet肺结节检测器，我们测试采集状态是否表现为结构化的可测量变量。在仅重建核不同的真实配对CT（NLST B30f vs B80f）上，核单独使AI测量的直径发生偏移，并在5.2%（155个结节中的8个）中翻转了Fleischner尺寸类别，而检测置信度不变（Wilcoxon p=0.22）。在受控的LIDC-IDRI扰动下，效应按轴分离：噪声轴降低检测置信度（p=5.9e-32，集中在6mm以下结节）但不影响测量，而频率/核轴破坏测量（p=8.6e-13）但不影响检测。一个4特征像素指纹恢复了重建身份（真实CT上患者级AUC约0.95，QIBA体模上0.995），而ConvolutionKernel DICOM标签无信息（不同重建标签相同）。核轴跨四个制造商传输（留一制造商AUC 0.94-0.98，与制造商内上限匹配）。因此采集状态映射到不同的AI故障模式：频率内容对应测量可靠性，噪声对应检测灵敏度，且无法从元数据恢复。采集感知的输入侧验证是现在进入影像AI认证的验收测试和漂移监测要求中缺失的层。

英文摘要

AI governance for medical imaging is formalizing: the 2026 ACR-SIIM Practice Parameter recommends local acceptance testing and ongoing drift monitoring, and the ACR Assess-AI registry monitors AI outputs using DICOM metadata for context. We argue that a necessary, currently unmonitored layer sits beneath output metrics: whether incoming studies remain within the acquisition envelope a model was validated on. Using a LUNA16-trained MONAI RetinaNet lung-nodule detector, we test whether acquisition state behaves as a structured, measurable variable. On real paired CT differing only in reconstruction kernel (NLST B30f vs B80f), kernel alone shifted AI-measured diameter and flipped a Fleischner size category in 5.2% (8 of 155) of nodules at fixed patient and acquisition, while detection confidence was unchanged (Wilcoxon p=0.22). Under controlled LIDC-IDRI perturbations the effects dissociated by axis: the noise axis degraded detection confidence (p=5.9e-32, concentrated in nodules under 6 mm) but not measurement, while the frequency/kernel axis corrupted measurement (p=8.6e-13) but not detection. A 4-feature pixel fingerprint recovered reconstruction identity (patient-level AUC about 0.95 on real CT, 0.995 on a QIBA phantom) where the ConvolutionKernel DICOM tag was uninformative (identical labels across reconstructions). The kernel axis transported across four manufacturers (leave-one-vendor-out AUC 0.94-0.98, matching the within-vendor ceiling). Acquisition state thus maps to distinct AI failure modes, frequency content to measurement reliability and noise to detection sensitivity, and is not recoverable from metadata. Acquisition-aware, input-side validation is the missing layer for the acceptance-testing and drift-monitoring requirements now entering imaging-AI accreditation.

URL PDF HTML ☆

赞 0 踩 0

2606.12953 2026-06-12 cs.AI cs.CV cs.LG eess.IV 交叉投稿

OpenMedQ: Broad Open Pretraining for Medical Vision-Language Models

OpenMedQ：面向医学视觉语言模型的广泛开放预训练

Ibrahim Gulluk, Max Van Puyvelde, Olivier Gevaert

发表机构 * Stanford University（斯坦福大学）； Stanford University School of Medicine（斯坦福大学医学院）； Ghent University（根特大学）

AI总结提出OpenMedQ，在14个数据集（约335万样本）上预训练医学视觉语言模型，在PathVQA上BLEU-1达75.9，超越562B参数的Med-PaLM M，并在8个未见医学分类任务上取得最高平均macro-F1（0.757）。

Comments Medical Imaging with Deep Learning (MIDL) 2026, Short Paper Track

2606.13028 2026-06-12 cs.RO cs.CV 交叉投稿

Comparing Commercial Depth Sensor Accuracy for Medical Applications

面向医疗应用的商用深度传感器精度比较

Pit Henrich, Maximilian Weiherer, Franziska Hansen, Bernhard Egger, Franziska Mathis-Ullrich

AI总结本文在猪骨、猪肚和硅胶肾模型上，以触针采样为参考，比较了立体视觉、结构光和飞行时间四类深度传感器在50cm距离下的精度，发现Zivid 2M+ 60在所有物体和指标上表现最佳。

Comments 4 Pages

2511.19652 2026-06-12 cs.CV 版本更新

Navigating Gigapixel Pathology Images with Large Multimodal Models

利用大型多模态模型导航千兆像素病理图像

Thomas A. Buckley, Kian R. Weihrauch, Katherine Latham, Andrew Z. Zhou, Padmini A. Manrai, Arjun K. Manrai

发表机构 * Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）； Department of Pathology, Massachusetts General Hospital（麻省总医院病理学系）； Department of Pathology and Laboratory Medicine, Brown University（布朗大学病理学与实验室医学系）

AI总结提出GIANT方法，无需训练即可让通用多模态模型自主导航WSI，通过迭代选择多放大倍数裁剪并聚合证据，在MultiPathQA基准上实现SOTA。

详情

AI中文摘要

近期大型多模态模型的进展使得开发能够对话和推理病理全切片图像（WSI）的交互式聊天模型成为可能。然而，现有的切片级聊天系统通常高度专业化，通常将WSI压缩为固定的切片级嵌入或依赖多组件流水线，这可能会丢失多尺度细节并限制目标任务之外的泛化能力。我们提出GIANT（千兆像素图像组织导航代理），一种简单、无需训练的方法，让通用多模态模型自主导航WSI，迭代选择多放大倍数裁剪并随时间聚合证据。为了评估WSI问答中的泛化能力并促进可重复性，我们引入了MultiPathQA，一个涵盖五个临床挑战和934个问题（涉及868个独特WSI）的基准套件。其中包括128道由病理学家编写的多项选择题，旨在模拟真实的诊断搜索和多尺度推理。使用GPT-5，GIANT在五个基准中的四个上取得了最先进的性能，优于专门用于病理问答的模型。

英文摘要

Recent advances in large multimodal models have allowed for the development of interactive chat models that can converse and reason about pathology whole-slide images (WSIs). However, existing slide-level chat systems are often highly specialized, typically compressing WSIs into fixed slide-level embeddings or relying on multi-component pipelines, which can lose multi-scale detail and limit generalizability beyond the target task. We present GIANT (Gigapixel Image Agent for Navigating Tissue), a simple, training-free approach that lets general-purpose multimodal models navigate WSIs on their own, iteratively selecting multi-magnification crops and aggregating evidence over time. To evaluate generalizability in WSI question answering and to promote reproducibility, we introduce MultiPathQA, a benchmark suite spanning five clinical challenges and 934 questions over 868 unique WSIs. This includes a new set of 128 pathologist-authored multiple-choice questions designed to mirror real diagnostic search and multi-scale reasoning. Using GPT-5, GIANT outperforms models specialized for pathology question answering, achieving state-of-the-art performance on four out of five benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2512.14648 2026-06-12 cs.CV eess.IV 版本更新

Adaptable Segmentation Pipeline for Diverse Brain Tumors with Radiomic-Guided Subtyping and Lesion-Wise Model Ensemble

适用于多样化脑肿瘤的自适应分割流程：放射组学引导的亚型分类与病灶级模型集成

Daniel Capellán-Martín, Abhijeet Parida, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation（Sheikh Zayed儿童外科创新研究所）； Children’s National Hospital（儿童医院）； University of Washington（华盛顿大学）； Universidad Politécnica de Madrid（马德里理工大学）； CIBER-BBN ； ISCIII ； School of Medicine and Health Sciences（医学与健康科学学院）

AI总结提出一种灵活模块化的自适应分割流程，通过放射组学特征检测肿瘤亚型并平衡训练，结合病灶级性能指标优化模型集成与后处理，在BraTS 2025挑战赛中达到顶尖性能，支持临床定量肿瘤测量。

Comments 12 pages, 5 figures, 3 tables. Algorithm presented at MICCAI BraTS 2025

详情

DOI: 10.1007/978-3-032-16365-3_41

AI中文摘要

在多参数磁共振成像（MRI）上对脑肿瘤进行鲁棒且可泛化的分割仍然困难，因为肿瘤类型差异很大。BraTS 2025 Lighthouse挑战赛在多种高质量成人及儿童肿瘤数据集上对分割方法进行基准测试：多联盟国际儿童脑肿瘤分割（PED）、术前脑膜瘤肿瘤分割（MEN）、脑膜瘤放射治疗分割（MEN-RT）以及治疗前后脑转移瘤分割（MET）。我们提出了一种灵活、模块化且自适应的流程，通过选择和组合最先进的模型，并在训练前后应用肿瘤和病灶特定的处理，来提高分割性能。从MRI中提取的放射组学特征有助于检测肿瘤亚型，确保更平衡的训练。自定义的病灶级性能指标决定了每个模型在集成中的影响力，并优化了进一步细化预测的后处理，使工作流能够针对每个病例定制每一步。在BraTS测试集上，我们的流程在多个挑战中取得了与顶尖算法相当的性能。这些发现证实，自定义的病灶感知处理与模型选择能够产生鲁棒的分割，而无需将方法锁定在特定的网络架构上。我们的方法在临床实践中具有定量肿瘤测量的潜力，支持诊断和预后。

英文摘要

Robust and generalizable segmentation of brain tumors on multi-parametric magnetic resonance imaging (MRI) remains difficult because tumor types differ widely. The BraTS 2025 Lighthouse Challenge benchmarks segmentation methods on diverse high-quality datasets of adult and pediatric tumors: multi-consortium international pediatric brain tumor segmentation (PED), preoperative meningioma tumor segmentation (MEN), meningioma radiotherapy segmentation (MEN-RT), and segmentation of pre- and post-treatment brain metastases (MET). We present a flexible, modular, and adaptable pipeline that improves segmentation performance by selecting and combining state-of-the-art models and applying tumor- and lesion-specific processing before and after training. Radiomic features extracted from MRI help detect tumor subtype, ensuring a more balanced training. Custom lesion-level performance metrics determine the influence of each model in the ensemble and optimize post-processing that further refines the predictions, enabling the workflow to tailor every step to each case. On the BraTS testing sets, our pipeline achieved performance comparable to top-ranked algorithms across multiple challenges. These findings confirm that custom lesion-aware processing and model selection yield robust segmentations yet without locking the method to a specific network architecture. Our method has the potential for quantitative tumor measurement in clinical practice, supporting diagnosis and prognosis.

URL PDF HTML ☆

赞 0 踩 0

2512.14937 2026-06-12 cs.CV cs.AI 版本更新

Improving Pre-trained Adult Glioma Segmentation Models Using only Post-processing Techniques

仅使用后处理技术改进预训练的成人胶质瘤分割模型

Abhijeet Parida, Daniel Capellán-Martín, Zhifan Jiang, Nishad Kulkarni, Krithika Iyer, Austin Tapp, Syed Muhammad Anwar, María J. Ledesma-Carbayo, Marius George Linguraru

发表机构 * Sheikh Zayed Institute for Pediatric Surgical Innovation（Sheikh Zayed儿童手术创新研究所）； Children’s National Hospital（儿童医院）； University of Madrid（马德里大学）； CIBER-BBN ； ISCIII ； School of Medicine and Health Sciences（医学与健康科学学院）； George Washington University（乔治·华盛顿大学）

AI总结针对预训练模型在胶质瘤分割中的系统误差，提出自适应后处理技术，在BraTS 2025挑战中使排名指标提升14.9%（撒哈拉以南非洲）和0.9%（成人胶质瘤），推动向高效、公平、可持续的后处理策略转变。

详情

DOI: 10.1007/978-3-032-16365-3_22

AI中文摘要

胶质瘤是成人中最常见的恶性脑肿瘤，也是最致命的肿瘤之一。尽管积极治疗，中位生存率仍低于15个月。准确的多参数MRI（mpMRI）肿瘤分割对于手术规划、放疗和疾病监测至关重要。虽然深度学习模型提高了自动分割的准确性，但大规模预训练模型泛化能力差且常表现不佳，产生系统性错误，如假阳性、标签交换和切片不连续。这些问题因GPU资源获取不平等和大规模模型训练日益增长的环境成本而进一步加剧。在这项工作中，我们提出自适应后处理技术，以改进为各种肿瘤类型开发的大规模预训练模型产生的胶质瘤分割质量。我们在多个BraTS 2025分割挑战任务中展示了这些技术，使撒哈拉以南非洲挑战的排名指标提升了14.9%，成人胶质瘤挑战提升了0.9%。该方法推动脑肿瘤分割研究从日益复杂的模型架构转向精确、计算公平且可持续的高效临床后处理策略。

英文摘要

Gliomas are the most common malignant brain tumors in adults and are among the most lethal. Despite aggressive treatment, the median survival rate is less than 15 months. Accurate multiparametric MRI (mpMRI) tumor segmentation is critical for surgical planning, radiotherapy, and disease monitoring. While deep learning models have improved the accuracy of automated segmentation, large-scale pre-trained models generalize poorly and often underperform, producing systematic errors such as false positives, label swaps, and slice discontinuities in slices. These limitations are further compounded by unequal access to GPU resources and the growing environmental cost of large-scale model training. In this work, we propose adaptive post-processing techniques to refine the quality of glioma segmentations produced by large-scale pretrained models developed for various types of tumors. We demonstrated the techniques in multiple BraTS 2025 segmentation challenge tasks, with the ranking metric improving by 14.9 % for the sub-Saharan Africa challenge and 0.9% for the adult glioma challenge. This approach promotes a shift in brain tumor segmentation research from increasingly complex model architectures to efficient, clinically aligned post-processing strategies that are precise, computationally fair, and sustainable.

URL PDF HTML ☆

赞 0 踩 0

2509.04682 2026-06-12 cs.SD cs.AI cs.CV cs.IR cs.LG eess.AS 版本更新

GetNetUPAM: Ecologically Informed Nested Cross-Validation and Noise-Robust Attention for Marine Bioacoustic Monitoring

GetNetUPAM：生态信息嵌套交叉验证与噪声鲁棒注意力用于海洋生物声学监测

Nicholas R. Rasmussen, Rodrigue Rizk, Longwei Wang, KC Santosh

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结提出GetNetUPAM框架，通过分层嵌套交叉验证保持生态异质性，并集成CBAM空间注意力的ARPA-N网络，在高噪声低信噪比条件下实现鲁棒泛化，在零训练区域将误报率降低约10倍。

Comments Resubmitted and under review as an anonymous submission to IEEETAI - We are allowed an archive submission. Final formatting is yet to be determined

详情

AI中文摘要

部署可靠的生物声学监测系统需要能够在高噪声、低信噪比条件下泛化的模型，以及能够暴露部署相关故障模式的评估协议，这些在当前UPAM实践中基本未得到解决。内在噪声、可变传播以及混合的生物和人为源会导致分布偏移，而传统模型和单次划分评估会掩盖这些偏移，夸大性能并掩盖不稳定性。我们提出GetNetUPAM，一种分层嵌套交叉验证框架，它利用嵌套阶段来量化模型稳定性，而不是调整以获取夸大的保留分数。通过将数据划分为站点-年份块，GetNetUPAM保留了生态异质性，并迫使每个外层折代表不同的环境条件，防止过拟合局部噪声或传感器伪影。内层分层折衡量整个UPAM信号分布上的泛化能力，强制模型开发与外层保留部署条件严格分离。使用GetNetUPAM，我们评估了自适应分辨率池化和注意力网络（ARPA-N），一种用于不规则频谱图维度的CNN架构。ARPA-N将CBAM空间注意力集成为学习型噪声抑制器，生成注意力图以定位真实叫声结构，并避免标准CNN在长窗口数据上利用的全局非生物线索。在GetNetUPAM下，ARPA-N在不同环境条件下鲁棒泛化。在零训练的Balleny Islands区域，它在固定90%召回率下将每小时误报率降低超过一个数量级（约10倍），并在各折上持续改进指标。这些进展提供了可重复的基准，推动UPAM向可扩展、部署可靠的生态监测发展。

英文摘要

Deploying reliable bioacoustic monitoring systems requires models that generalize under high-noise, low-SNR conditions and evaluation protocols that expose deployment-relevant failure modes, gaps largely unaddressed in current UPAM practice. Intrinsic noise, variable propagation, and mixed biological and anthropogenic sources induce distribution shifts that conventional models and single-split evaluations obscure, inflating performance and masking instability. We introduce GetNetUPAM, a hierarchical nested cross-validation framework that uses the nested stage to quantify model stability rather than tune for inflated hold-out scores. By partitioning data into site-year blocks, GetNetUPAM preserves ecological heterogeneity and forces each outer fold to represent a distinct environmental regime, preventing overfitting to localized noise or sensor artifacts. Inner stratified folds measure generalization across the full UPAM signal distribution, enforcing strict separation between model development and the outer held-out deployment condition. Using GetNetUPAM, we evaluate the Adaptive Resolution Pooling and Attention Network (ARPA-N), a CNN architecture for irregular spectrogram dimensions. ARPA-N integrates CBAM spatial attention as a learned noise suppressor, producing attention maps that localize true call structure and avoid the global, non-biological cues exploited by standard CNNs on long-window data. Under GetNetUPAM, ARPA-N generalizes robustly across diverse environmental regimes. In the zero-training support Balleny Islands region, it reduces false positives per hour by over an order of magnitude (approximately 10x) at fixed 90 percent recall, yielding consistently improved metrics across folds. These advances provide a reproducible benchmark and move UPAM toward scalable, deployment-reliable ecological monitoring.

URL PDF HTML ☆

赞 0 踩 0

2604.27277 2026-06-12 cs.LG cs.AI cs.CV 版本更新

BrainDINO: A Brain MRI Foundation Model for Generalizable Clinical Representation Learning

BrainDINO：一种用于通用临床表征学习的脑MRI基础模型

Yizhou Wu, Shansong Wang, Yuheng Li, Mojtaba Safari, Mingzhe Hu, Chih-Wei Chang, Harini Veeraraghavan, Xiaofeng Yang

发表机构 * Department of Radiation Oncology and Winship Cancer Institute, Emory University（放射肿瘤科和Winship癌症研究所，埃默里大学）； Department of Radiation and Cellular Oncology, The University of Chicago（放射肿瘤学与细胞肿瘤学部，芝加哥大学）； Department of Electrical and Computer Engineering, Georgia Institute of Technology（电气与计算机工程系，佐治亚理工学院）； Department of Biomedical Engineering, Georgia Institute of Technology（生物医学工程系，佐治亚理工学院）； Department of Biomedical Informatics, Emory University（生物医学信息学系，埃默里大学）； Department of Medical Physics, Memorial Sloan Kettering Cancer Center（医学物理系，纪念斯隆凯特琳癌症中心）

AI总结提出BrainDINO，一种基于自蒸馏的基础模型，在约660万张未标记轴向切片上训练，通过冻结编码器加轻量任务头，在多种脑MRI任务上达到或超越基线，尤其在小样本场景下优势显著。

Comments 25 pages, 5 figures

详情

AI中文摘要

脑MRI支撑着广泛的神经科学和临床应用，然而大多数基于学习的方法仍针对特定任务且需要大量标注数据。本文表明，单一的自监督表征可以泛化到异质的脑MRI终点。我们训练了BrainDINO，一个自蒸馏的基础模型，使用了来自20个数据集的约660万张未标记轴向切片，这些数据集涵盖了人群、疾病和采集设置的广泛变异。通过使用冻结编码器加轻量任务头，BrainDINO支持肿瘤分割、神经退行性和神经发育性疾病分类、脑年龄估计、卒中后时间预测、分子状态预测、MRI序列分类和生存建模等任务的迁移。在各种任务和监督机制下，BrainDINO始终等于或超过自然图像和MRI特定自监督基线，在标签稀缺时尤其具有优势。表征分析进一步显示，在缺乏任务特定监督的情况下，特征结构具有解剖学组织和病理敏感性。我们的发现表明，大规模切片级自监督学习可以产生统一的脑MRI表征，支持多样化的神经影像任务，无需体积预训练或全网络微调，为稳健且数据高效的脑影像分析建立了可扩展的基础。代码可在 https://github.com/mclwu22/BrainDINO 获取。

英文摘要

Brain MRI underpins a wide range of neuroscientific and clinical applications, yet most learning-based methods remain task-specific and require substantial labeled data. Here we show that a single self-supervised representation can generalize across heterogeneous brain MRI endpoints. We trained BrainDINO, a self-distilled foundation model, on approximately 6.6 million unlabeled axial slices from 20 datasets encompassing broad variation in population, disease, and acquisition setting. Using a frozen encoder with lightweight task heads, BrainDINO supported transfer across tumor segmentation, neurodegenerative and neurodevelopmental conditions classification, brain age estimation, post-stroke temporal prediction, molecular status prediction, MRI sequence classification, and survival modeling. Across tasks and supervision regimes, BrainDINO consistently equaled or exceeded natural-image and MRI-specific self-supervised baselines, with particularly strong advantages under label scarcity. Representation analyses further showed anatomically organized and pathology-sensitive feature structure in the absence of task-specific supervision. Our findings indicate that large-scale slice-wise self-supervised learning can yield a unified brain MRI representation that supports diverse neuroimaging tasks without volumetric pretraining or full-network fine-tuning, establishing a scalable foundation for robust and data-efficient brain imaging analysis. Code is available at https://github.com/mclwu22/BrainDINO

URL PDF HTML ☆

赞 0 踩 0

2606.13108 2026-06-12 cs.CV 新提交

PP-OCRv6: From 1.5M to 34.5M Parameters, Surpassing Billion-Scale VLMs on OCR Tasks

PP-OCRv6: 从1.5M到34.5M参数，在OCR任务上超越十亿级视觉语言模型

Yubo Zhang, Xueqing Wang, Manhui Lin, Yue Zhang, Penglongyi Deng, Ting Sun, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Changda Zhou, Hongen Liu, Suyin Liang, Cheng Cui, Yi Liu, Dianhai Yu, Yanjun Ma

发表机构 * PaddlePaddle Team, Baidu Inc.（百度公司飞桨团队）

AI总结提出轻量级OCR系统PP-OCRv6，通过统一MetaFormer架构和结构化重参数化，在服务器到边缘设备上以少数量级参数超越十亿级VLM，中模型识别准确率83.2%，检测Hmean 86.2%。

详情

AI中文摘要

视觉语言模型（VLM）在通用视觉语言任务上取得了令人印象深刻的结果，但在应用于专用OCR场景时，它们存在幻觉、定位不精确和计算成本过高的问题。本文提出PP-OCRv6，一个轻量级OCR系统，结合了架构创新和数据中心优化。PP-OCRv6围绕统一的MetaFormer风格构建块重新设计了骨干网络、检测颈和识别颈，采用结构化重参数化，将空间token混合与通道混合解耦，并通过任务特定的步长配置支持两个任务。三个模型层级（中、小、微）共享相同的构建块原语，覆盖从服务器到边缘的部署场景。在我们的内部基准测试中，PP-OCRv6_medium实现了83.2%的识别准确率和86.2%的检测Hmean，分别比PP-OCRv5_server高出+5.1%和+4.6%，同时以数量级更少的参数超越了Qwen3-VL-235B、GPT-5.5和Gemini-3.1-Pro。微层级在Intel Xeon CPU上实现了比PP-OCRv5_mobile快3.9倍的推理速度，同时保持相当的准确率。

英文摘要

Vision-Language Models (VLMs) have achieved impressive results on general vision-language tasks, yet they suffer from hallucination, imprecise localization, and prohibitive computational cost when applied to dedicated OCR scenarios. This paper presents PP-OCRv6, a lightweight OCR system that combines architectural innovation with data-centric optimization. PP-OCRv6 redesigns the backbone, detection neck, and recognition neck around a unified MetaFormer-style building block with structural reparameterization, decoupling spatial token mixing from channel mixing and supporting both tasks through task-specific stride configurations. Three model tiers (medium, small, tiny) share the same block primitives, covering deployment scenarios from server to edge. On our in-house benchmarks, PP-OCRv6_medium achieves 83.2% recognition accuracy and 86.2% detection Hmean, outperforming PP-OCRv5_server by +5.1% and +4.6% respectively while surpassing Qwen3-VL-235B, GPT-5.5, and Gemini-3.1-Pro with orders of magnitude fewer parameters. The tiny tier achieves 3.9$\times$ faster inference than PP-OCRv5_mobile on Intel Xeon CPU while maintaining comparable accuracy.

URL PDF HTML ☆

赞 0 踩 0

2602.00122 2026-06-12 cs.CV cs.AI cs.MM 版本更新

VDE Bench: Evaluating The Capability of Image Editing Models to Modify Visual Documents

VDE Bench: 评估图像编辑模型对视觉文档进行修改的能力

Hongzhu Yi, Yujia Yang, Yuanxiang Wang, Tong Li, Zhenyu Guan, Tianyu Zong, Jiahuan Chen, Chenxi Bao, Tiankun Yang, Haopeng Jin, Yixuan Yuan, Xinming Wang, Tao Yu, Ruilin Gao, Ruiwen Tao, Haijin Liang, Jin Ma, Jinwen Luo, Yeshani, Xinyu Zuo, Jungang Xu

发表机构 * UCAS（中国科学院大学）； CASIA（中国科学院自动化研究所）； Tencent（腾讯）； CMU（卡内基梅隆大学）； WashU（华盛顿大学）； SJTU（上海交通大学）； XDU（北京理工大学）

AI总结本文提出VDE Bench，一个专门评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务性能的基准，通过高质量数据集和新的评估框架，系统量化了文本修改的准确性。

详情

AI中文摘要

近年来，图像编辑模型取得了显著进展，使用户能够通过自然语言指令灵活地交互式地操作视觉内容。然而，一个重要的但尚未充分研究的研究方向是密集的视觉文档图像编辑，这涉及在图像中修改文本内容，同时忠实保留原始文本风格和背景上下文。现有方法主要集中在英语场景和文本相对稀疏的图像上，因此无法充分解决密集、结构复杂的文档或非拉丁文字如中文。为弥合这一差距，我们提出了VDE Bench（视觉文档编辑基准），这是一个严格人工标注和评估的基准，专门设计用于评估图像编辑模型在双语中文-英文和复杂视觉文档编辑任务上的性能。该基准包含942个基于指令的图像编辑样本数据集，其种子图像涵盖密集的中文和英文文本文档，包括学术论文、海报、演示文稿、考试材料和报纸。此外，我们引入了一个新的评估框架，系统地量化了在OCR解析层面的编辑性能，从而实现了对文本修改准确性的细粒度评估。基于此基准，我们对代表性图像编辑模型进行了全面评估。人类验证显示，人类判断与自动化评估指标之间有一致性。VDE Bench构成了评估图像编辑模型在双语密集文本视觉文档性能的首个系统性基准。

英文摘要

In recent years, image editing models have made significant progress, enabling users to manipulate visual content in a flexible and interactive manner through natural language instructions. However, an important yet underexplored research direction remains dense visual document image editing, which involves modifying textual content within images while faithfully preserving the original text style and background context. Existing methods primarily focus on English scenarios and images with relatively sparse text, and thus cannot adequately address dense, structurally complex documents or non-Latin scripts such as Chinese. To bridge this gap, we propose VDE Bench (Visual Doc Edit Bench), a rigorously human annotated and evaluated benchmark specifically designed to assess the performance of image editing models on bilingual Chinese-English and complex visual document editing tasks. The benchmark comprises a high quality dataset of 942 instruction based image editing samples, whose seed images encompass dense Chinese and English text documents including academic papers, posters, presentation slides, examination materials, and newspapers. Furthermore, we introduce a novel evaluation framework that systematically quantifies editing performance at the OCR parsing level, thereby enabling fine grained assessment of text modification accuracy. Based on this benchmark, we conduct a comprehensive evaluation of representative image editing models. Human verification demonstrates a high degree of consistency between human judgments and automated evaluation metrics. VDE Bench constitutes the first systematic benchmark for evaluating the performance of image editing models on bilingual dense text visual documents.

URL PDF HTML ☆

赞 0 踩 0

2606.13136 2026-06-12 cs.CV cs.LG eess.IV 新提交

An Extensible and Lightweight Unified Architecture for Demosaicing Pixel-bin Image Sensors

一种可扩展且轻量级的统一架构用于像素合并图像传感器的去马赛克

Saurabh Kumar, Nutan Sairam Yenneti

发表机构 * Samsung Research Institute Bangalore（三星研究院班加罗尔分院）

AI总结提出模块化统一架构，通过无学习CFA识别模块和轻量级设计，实现多种像素合并传感器的去马赛克，提升图像质量并降低资源消耗。

2606.13366 2026-06-12 cs.CV cs.MM 新提交

Dual-Constrained Diffusion Image Compression for Operational Rate-Distortion-Perception Optimization

双约束扩散图像压缩用于操作率失真感知优化

Sanxin Jiang, Jiro Katto, Heming Sun

发表机构 * Shanghai University of Electric Power（上海电力大学）； Waseda University（早稻田大学）； Institute of Science Tokyo（东京科学大学）

AI总结提出DCIC框架，结合学习编解码器和基于扩散的解码器，通过联合失真和等幂约束实现率失真感知帕累托前沿的连续导航，无需额外码率开销。

详情

AI中文摘要

率失真感知（RDP）权衡通过施加重建的分布约束扩展了经典率失真理论，为联合控制保真度和感知真实性的神经图像压缩提供了统一框架。虽然先前的工作实现了接近最优的率感知权衡，但明确实现完整RDP曲面的实用框架仍然很少，主要由于在解码器引入公共随机性的困难。我们提出DCIC（双约束扩散图像压缩），它将学习编解码器与基于扩散的解码器相结合，受联合失真和等幂约束的支配。失真约束限制了相对于基础编解码器输出的重建保真度；等幂约束——要求重新编码恢复图像恢复基础编解码器重建——作为分布感知要求的可处理替代。它们通过一致噪声注入的迭代优化引导反向去噪过程，实现公共随机性而无需额外码率开销。在固定码率下，双衰减因子$(K_D, K_P)$共同导航失真感知平面的帕累托前沿，从单个比特流实现连续可调的保真度-真实感权衡。DCIC$_{RD}$（$K_P{=}0$）和DCIC$_{RP}$（$K_D{=}0$）作为边界曲线出现，DCIC$_{RDP}$（$K_D = K_P=1$）实现最优内部工作点。在CelebA-HQ、CLIC2020和ImageNet-1K上，跨CNN、Transformer和混合架构的实验证实，DCIC$_{RDP}$在所有感知编解码器中实现了优越的BD-PSNR，而DCIC$_{RP}$在BD-FID上与专用感知方法相匹配，验证了完整RDP曲面导航的实用价值。

英文摘要

The rate-distortion-perception (RDP) trade-off extends classical rate--distortion theory by imposing a distributional constraint on reconstructions, providing a unified framework for neural image compression that jointly governs fidelity and perceptual realism. While prior work achieves near-optimal rate--perception trade-offs, practical frameworks explicitly realizing the full RDP surface remain scarce, primarily due to the difficulty of introducing common randomness at the decoder. We propose DCIC (Dual-Constrained Diffusion Image Compression), which integrates a learned codec with a diffusion-based decoder governed by joint distortion and idempotence constraints. The distortion constraint bounds reconstruction fidelity relative to the base codec output; the idempotence constraint -- requiring that re-encoding the restored image recovers the base codec reconstruction -- serves as a tractable surrogate for the distributional perception requirement. Together, they steer the reverse denoising process via iterative optimization with consistent noise injection, realizing common randomness without additional rate overhead. At fixed rate, dual attenuation factors $(K_D, K_P)$ jointly navigate the Pareto frontier of the distortion-perception plane, enabling continuously adjustable fidelity-realism trade-offs from a single bitstream. DCIC$_{RD}$ ($K_P{=}0$) and DCIC$_{RP}$ ($K_D{=}0$) arise as boundary curves, with DCIC$_{RDP}$ ($K_D = K_P=1$) realizing the optimal interior operating point. Experiments on CelebA-HQ, CLIC2020, and ImageNet-1K across CNN, Transformer, and hybrid architectures confirm that DCIC$_{RDP}$ achieves superior BD-PSNR over all perceptual codecs, while DCIC$_{RP}$ matches dedicated perception-oriented methods in BD-FID, validating the practical value of full RDP surface navigation.

URL PDF HTML ☆

赞 0 踩 0

2606.13580 2026-06-12 cs.CV cs.AI 新提交

EvTexture++: Event-Driven Texture Enhancement for Video Super-Resolution

EvTexture++: 事件驱动的视频超分辨率纹理增强

Dachun Kai, Jiayao Lu, Yueyi Zhang, Xiaoyan Sun

发表机构 * MOE Key Laboratory of Brain-Inspired Intelligent Perception and Cognition, University of Science and Technology of China（中国科学技术大学，脑启发智能感知与认知教育部重点实验室）； Midea Group（美的集团）

AI总结提出首个事件驱动的视频超分辨率纹理增强框架EvTexture++，利用事件的高频时空细节逐步恢复纹理，并通过时间纹理对齐模块增强帧间一致性，在多个数据集上达到最优性能。

Comments IEEE TPAMI 2026. Extended version of arXiv:2406.13457 (ICML 2024). Project page: https://dachunkai.github.io/evtexture-project-page/

详情

DOI: 10.1109/TPAMI.2026.3660020
Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6642-6659, June 2026

AI中文摘要

基于事件的视觉因其独特特性（包括超高时间分辨率和极端动态范围）而受到越来越多的关注。最近的工作将其引入视频超分辨率（VSR）以增强光流估计和时间对齐。相比之下，本文将事件信号的关注点从运动细化转向VSR中的纹理增强。我们提出了EvTexture++，这是首个专用于VSR中纹理增强的事件驱动框架。它利用事件的高频时空细节来改善纹理恢复。EvTexture++包含一个定制的纹理增强分支，以及一个迭代纹理增强模块，该模块逐步利用高时间分辨率的事件信息进行纹理恢复。这使得纹理区域在迭代中逐渐细化，从而产生更准确、更详细的高分辨率输出。除了帧内纹理恢复外，大运动可能会降低帧间时间一致性，尤其是在纹理区域，导致纹理闪烁。为了缓解这一问题，我们进一步利用事件的连续时间运动线索来增强时间一致性，引入了一个时间纹理对齐模块，该模块估计事件引导的纹理感知光流，以实现精确的帧间纹理对齐。此外，EvTexture++被设计为即插即用工具，可灵活提升现有VSR模型的性能。在五个数据集上的实验表明，EvTexture++达到了最先进的性能。当集成到最近的VSR模型中时，它带来了显著的改进，在纹理丰富的Vid4数据集上PSNR提升高达1.55 dB。代码：此https URL。

英文摘要

Event-based vision has drawn increasing attention owing to its distinctive properties, including ultra-high temporal resolution and extreme dynamic range. Recent works have introduced it to video super-resolution (VSR) to enhance flow estimation and temporal alignment. In contrast, this paper shifts the focus of event signals from motion refinement to texture enhancement in VSR. We propose EvTexture++, the first event-driven framework dedicated to texture enhancement in VSR. It leverages high-frequency spatiotemporal details from events to improve texture recovery. EvTexture++ incorporates a customized texture enhancement branch, along with an iterative texture enhancement module that progressively exploits high-temporal-resolution event information for texture restoration. This enables gradual refinement of texture regions across iterations, yielding more accurate and detailed high-resolution outputs. Besides intra-frame texture recovery, large motions could degrade inter-frame temporal consistency, particularly in texture regions, leading to texture flickering. To mitigate this, we further exploit the continuous-time motion cues of events to enhance temporal consistency, introducing a temporal texture alignment module that estimates event-guided texture-aware flow for precise inter-frame texture alignment. Moreover, EvTexture++ is designed as a plug-and-play tool to flexibly boost the performance of existing VSR models. Experiments on five datasets demonstrate that EvTexture++ achieves state-of-the-art performance. When integrated into recent VSR models, it yields significant improvements, with gains of up to 1.55 dB in PSNR on the texture-rich Vid4 dataset. Code: https://github.com/DachunKai/EvTexture.

URL PDF HTML ☆

赞 0 踩 0

2505.01869 2026-06-12 cs.CV 版本更新

Visual enhancement and 3D representation for underwater scenes: a review

水下场景的视觉增强与三维表示：综述

Guoxi Huang, Haoran Wang, Brett Seymour, Evan Kovacs, John Ellerbroc, Dave Blackham, Nantheera Anantrasirichai

发表机构 * Visual Information Laboratory, University of Bristol（视觉信息实验室，布里斯托尔大学）； Submerged Resources Center, National Park Service（水下资源中心，国家公园服务）； Marine Imaging Technologies, LLC（海洋成像技术有限公司）； Gates Underwater Products, Inc（盖茨水下产品公司）； Esprit film and television Ltd（Esprit电影和电视有限公司）

AI总结本文综述了水下视觉增强和三维重建方法，从物理模型到非学习与数据驱动技术（如NeRF和3D高斯溅射），并评估了多种算法在基准数据集上的性能，指出了未来研究方向。

详情

AI中文摘要

水下视觉增强（UVE）和水下三维重建由于水生环境中复杂的成像条件，在计算机视觉和基于AI的任务中面临重大挑战。尽管开发了许多增强算法，但涵盖UVE和水下三维重建的全面系统性综述仍然缺失。为了推动这些领域的研究，我们从多个角度进行了深入综述。首先，我们介绍了基本的物理模型，强调了挑战传统技术的特殊性。我们调查了专门为水下场景设计的视觉增强和三维重建的先进方法。本文评估了从非学习方法到先进数据驱动技术（包括神经辐射场和3D高斯溅射）的各种方法，讨论了它们在处理水下失真方面的有效性。最后，我们在多个基准数据集上对最先进的UVE和水下三维重建算法进行了定量和定性评估。最后，我们指出了水下视觉未来发展的关键研究方向。

英文摘要

Underwater visual enhancement (UVE) and underwater 3D reconstruction pose significant challenges in computer vision and AI-based tasks due to complex imaging conditions in aquatic environments. Despite the development of numerous enhancement algorithms, a comprehensive and systematic review covering both UVE and underwater 3D reconstruction remains absent. To advance research in these areas, we present an in-depth review from multiple perspectives. First, we introduce the fundamental physical models, highlighting the peculiarities that challenge conventional techniques. We survey advanced methods for visual enhancement and 3D reconstruction specifically designed for underwater scenarios. The paper assesses various approaches from non-learning methods to advanced data-driven techniques, including Neural Radiance Fields and 3D Gaussian Splatting, discussing their effectiveness in handling underwater distortions. Finally, we conduct both quantitative and qualitative evaluations of state-of-the-art UVE and underwater 3D reconstruction algorithms across multiple benchmark datasets. Finally, we highlight key research directions for future advancements in underwater vision.

URL PDF HTML ☆

赞 0 踩 0

2509.25787 2026-06-12 cs.CV 版本更新

Self-Evolving Vision-Language Models for Image Quality Assessment via Voting and Ranking

自进化视觉语言模型用于图像质量评估：基于投票与排序

Wen Wen, Tianwu Zhi, Kanglong Fan, Yang Li, Xinge Peng, Yabin Zhang, Yiting Liao, Junlin Li, Li Zhang

发表机构 * City University of Hong Kong（香港城市大学）； ByteDance Inc.（字节跳动公司）

AI总结提出EvoQuality框架，通过自一致性生成伪标签，利用群体相对策略优化迭代提升VLM的图像质量感知能力，无监督下在多个IQA基准上超越监督方法。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

旧法新用：经典降维方法用于高效显著性引导的生物特征攻击检测

Samuel Webster, Walter Scheirer

发表机构 * University of Notre Dame（圣母大学）

AI总结提出使用PCA和LDA等经典降维方法直接从训练数据生成显著性图，无需人工标注，在五个生物特征攻击检测领域超越基线甚至达到最优性能。

Comments 16 pages (8 main, 2 references, 6 appendix), 4 figures (3 main, 1 appendix), 13 tables (3 main, 10 appendix)

详情

AI中文摘要

显著性引导训练是一种视觉识别范式，鼓励模型在学习过程中关注最相关的图像区域。尽管其在生物特征呈现攻击检测（PAD）中的应用在鲁棒性和泛化性方面显示出显著优势，但由于现有显著性获取方法（如有限数据集上的人工标注）成本高、领域特异性强且可扩展性有限，其采用往往受到限制。我们提出了一种新颖、成本效益高且高度可扩展的显著性获取方法，使用受经典降维技术PCA和LDA启发的图。我们提出的方法直接从原始训练数据生成显著性图，无需人工标注或领域知识。我们在三个显著性探索领域（虹膜PAD、合成人脸检测、指纹PAD）中情境化这些显著性源的有效性，并在两个显著性新颖领域（指纹静脉PAD和身份证PAD）中展示了其可扩展性。在所有测试领域中，使用降维来源的显著性图训练的模型在没有任何资源投入或特定领域工具的情况下，超过了基线甚至有时是最先进的显著性方法。我们的发现克服了显著性引导训练在生物特征攻击检测及更广泛领域中一个重要但尚未解决的障碍。

英文摘要

Saliency-guided training is a paradigm in visual recognition that encourages models to focus on the most relevant image regions during learning. While its application in biometric presentation attack detection (PAD) has shown strong benefits in robustness and generalization, adoption is often limited by the high cost, domain specificity, and limited scalability of existing saliency acquisition methods, such as human annotations over a limited dataset. We present a novel, cost-efficient, and highly-scalable approach to saliency acquisition using maps inspired by classical dimensionality reduction techniques: PCA and LDA. Our proposed methods generate saliency maps directly from raw training data, requiring no human annotation nor domain knowledge. We contextualize the effectiveness of these saliency sources in three saliency-explored domains (iris PAD, synthetic face detection, fingerprint PAD) and demonstrate its scalability in two saliency-novel domains (fingerprint vein PAD and ID card PAD). Across all domains tested, models trained using dimensionality reduction-sourced saliency maps exceed baseline and sometimes SOTA saliency methods without any resource investment or domain-specific tooling. Our findings overcome an important yet unaddressed barrier to saliency-guided training for biometric attack detection and beyond.

URL PDF HTML ☆

赞 0 踩 0

2606.12655 2026-06-12 cs.CR cs.CV 交叉投稿

Amnesia: A Stealthy Replay Attack on Continual Learning Dreams

Amnesia: 一种针对持续学习梦境的重放隐蔽攻击

Ahmed Sharshar, Naveen Kumar Kummari, Mohsen Guizani

AI总结提出Amnesia攻击，通过仅控制重放索引选择，在审计约束下最大化持续学习模型性能下降，揭示了索引级重放控制的威胁。

详情

AI中文摘要

持续学习（CL）模型常使用经验重放来减少灾难性遗忘，但其对重放采样干扰的鲁棒性尚未充分探索。现有的CL攻击会改变输入或训练流程（投毒/后门），且很少包含明确的审计约束，限制了真实性。这里，审计性意味着监控者可以通过检查采样器可见的遥测数据（例如，记录的重放索引/标签统计）来验证合规性，即检查实现的重放类别直方图是否接近名义基线，以及重放率在每个批次和/或滚动窗口内是否不变。我们研究了一个权限受限的内部人员，其仅控制重放索引选择，而不控制像素、标签或模型参数，同时保持在审计限制内（如队列优先级）。我们提出了Amnesia，一种重放组合攻击，在两种预算下最大化性能下降：可见性预算δ，限制与名义类别直方图p0的TV/KL散度；以及质量预算f，固定重放率。Amnesia有两个步骤：（i）计算轻量级类别效用（如EMA损失或置信度），将p0向有害类别倾斜；（ii）使用高效的KL（指数倾斜）或TV（平衡质量重分配）优化器将倾斜投影回δ-球内。窗口调度器强制执行滚动审计。在具有挑战性的CL基准测试和强重放基线中，Amnesia持续降低最终准确率（ACC）并恶化反向迁移（-BWT）。KL变体在多种审计方案（包括每批次和滚动窗口检查）下实现高影响且基本未被检测到。TV变体更具破坏性但更易检测，尤其是在严格的每类别约束下。这些结果揭示了仅索引重放控制是CL系统中一个实用且可审计的威胁面，并建立了原则性的影响-可见性权衡。

英文摘要

Continual learning (CL) models often use experience replay to reduce catastrophic forgetting, but their robustness to replay sampling interference remains underexplored. Existing CL attacks alter inputs or training pipelines (poisoning/backdoors) and rarely include explicit auditable constraints, limiting realism. Here, auditability means a monitor can verify compliance from sampler-visible telemetry - e.g., logged replay index/label statistics - by checking that the realized replay class histogram stays close to a nominal baseline and that replay rate is unchanged per batch and/or over a rolling window. We study a limited-privilege insider who controls only replay index selection, not pixels, labels, or model parameters, while staying within auditable limits such as queue priorities. We introduce Amnesia, a replay composition attack that maximizes degradation under two budgets: a visibility budget delta bounding the TV/KL divergence from a nominal class histogram p0, and a mass budget f fixing the replay rate. Amnesia has two steps: (i) compute lightweight class utilities, such as EMA loss or confidence, to tilt p0 toward harmful classes; and (ii) project the tilt back into the delta-ball using efficient KL (exponential tilt) or TV (balanced mass redistribution) optimizers. A windowed scheduler enforces rolling audits. Across challenging CL benchmarks and strong replay baselines, Amnesia consistently lowers final accuracy (ACC) and worsens backward transfer (-BWT). The KL variant delivers high impact while remaining largely undetected under multiple audit schemes, including per-batch and rolling-window checks. The TV variant is more damaging but easier to detect, especially under tight per-class constraints. These results expose index-only replay control as a practical, auditable threat surface in CL systems and establish a principled impact-visibility trade-off.

URL PDF HTML ☆

赞 0 踩 0

SalArt-VQA: 诊断VLM是否理解生成图像中的显著伪影

Xiaoxiao Sun, Ruotian Zhang, Junzhe Huang, James Burgess, Serena Yeung-Levy

AI总结提出SalArt-VQA基准，通过950张图像和3681道多选题，从检测、定位、空间基础、缺陷识别四方面评估VLM对生成图像伪影的理解，揭示高检测准确率下隐藏的失败模式。

Comments 23 pages, 7 figures, 7 tables. Dataset: https://huggingface.co/datasets/salartvqa/SalArt-VQA

详情

VietFashion：面向文化服饰的草图-文本组合图像检索基准

Hoang-Nguyen Cao, Le-Hoang Bui, Dinh-Khoi Vo, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam（胡志明市理科大学）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学胡志明市分校）

AI总结提出VietFashion基准，针对越南传统服饰奥黛，结合手绘草图和文本描述进行多目标检索，揭示现有方法在细粒度文化语义和跨模态组合上的不足。

Comments ICMR 2026. Project page: https://hng0303.github.io/VietFashion

详情

DOI: 10.1145/3805622.3810590

AI中文摘要

文化服饰对视觉检索系统提出了独特挑战，因为其身份往往依赖于标准AI模型难以捕捉的微妙结构和符号细节。我们引入VietFashion，一个以越南传统服饰奥黛为中心的草图-文本组合图像检索新基准。VietFashion使设计师和研究人员能够通过手绘草图（传达服装结构）和文本描述（编码文化语义）的组合来检索具有文化意义的服装。数据集初始包含650张草图，并通过生成模型扩展至超过21,000张带有对齐标题的照片级真实图像。文本提示描述了详细的服装属性，这些属性从时尚杂志中提取以确保真实性和多样性。为了更好地反映设计意图固有的模糊性，VietFashion采用多目标检索设置，其中单个查询可能对应多个有效结果。我们建立了标准化的评估协议，并对最先进的组合图像检索方法进行了基准测试。实验结果表明，在建模细粒度文化语义和多模态组合方面存在显著性能差距，使VietFashion成为细粒度时尚检索的一个具有挑战性的基准。数据集公开于：this https URL。

英文摘要

Cultural garments pose a unique challenge for visual retrieval systems, as their identity often depends on subtle structural and symbolic details that are poorly captured by standard AI models. We introduce VietFashion, a new benchmark for sketch-text composed image retrieval centered on the Ao Dai, a traditional Vietnamese garment. VietFashion enables designers and researchers to retrieve culturally meaningful outfits using a combination of hand-drawn sketches, which convey garment structure, and textual descriptions, which encode cultural semantics. The dataset is initialized with 650 sketches and expanded using generative models to produce over 21,000 photorealistic images with aligned captions. Textual prompts that describe detailed outfit attributes, which are extracted from fashion magazines to ensure authenticity and diversity. To better reflect the inherent ambiguity of design intent, VietFashion adopts a multi-target retrieval setting, where a single query may correspond to multiple valid results. We establish standardized evaluation protocols and benchmark state-of-the-art composed image retrieval methods. Experimental results reveal significant performance gaps in modeling fine-grained cultural semantics and multi-modal composition, positioning VietFashion as a challenging benchmark for fine-grained fashion retrieval. The dataset is publicly available at: https://hng0303.github.io/VietFashion.

URL PDF HTML ☆

赞 0 踩 0

2606.13496 2026-06-12 cs.CV 新提交

Budget-Constrained Step-Level Diffusion Caching

预算约束的步骤级扩散缓存

Mingkun Lei, Tong Zhao, Liangyu Yuan, Chi Zhang

发表机构 * Westlake-AGI-Lab（西湖大学AGI实验室）

AI总结提出BudCache方法，通过离线搜索（模拟退火+爬山）在固定计算预算下优化缓存策略，并引入缓存感知调度对齐，以提升扩散模型生成质量。

Comments Accepted by ICML 2026

详情

AI中文摘要

步骤级缓存通过利用去噪步骤间的时间冗余来加速扩散模型。现有方法使用基于阈值的启发式方法进行每步缓存决策，没有直接优化最终输出质量。因此，它们的推理延迟随输入变化，在部署时难以控制。在这项工作中，我们提出了BudCache，它反转了这一公式：不是让每步误差阈值决定运行成本，而是预先固定计算预算，并搜索最能保留最终输出的缓存策略。为了应对步骤选择的组合复杂性，我们将模拟退火与确定性爬山相结合。这种离线搜索在几分钟内找到高质量的缓存策略，并且在推理过程中不引入在线搜索或阈值开销。当计算预算非常紧张时，我们进一步引入缓存感知调度对齐，它使时间离散化适应所选的缓存策略，以减少缓存引起的轨迹不匹配。在FLUX.1-dev和Wan2.1上的实验表明，在相同推理预算下，BudCache比启发式缓存基线实现了更好的生成质量。代码可在以下网址获取：https://this https URL

英文摘要

Step-level caching accelerates diffusion models by exploiting temporal redundancy across denoising steps. Existing methods make per-step cache decisions using threshold-based heuristics, without directly optimizing for final output quality. As a result, their inference latency varies across inputs and is difficult to control at deployment. In this work, we propose BudCache, which inverts this formulation: rather than letting per-step error thresholds dictate the runtime cost, we fix the compute budget in advance and search for the cache policy that best preserves the final output. To tackle the combinatorial complexity of step selection, we combine Simulated Annealing with deterministic Hill Climbing. This offline search identifies high-quality cache policies within minutes and introduces no online search or thresholding overhead during inference. When the compute budget is very tight, we further introduce cache-aware schedule alignment, which adapts the time discretization to the selected cache policy to reduce cache-induced trajectory mismatch. Experiments on FLUX.1-dev and Wan2.1 show that BudCache achieves better generation quality than heuristic caching baselines under the same inference budgets. Code is available at https://github.com/Westlake-AGI-Lab/BudCache

URL PDF HTML ☆

赞 0 踩 0

2606.12595 2026-06-12 cs.LG cs.AI cs.CV 交叉投稿

Emerging Flexible Designs for Geospatial Multimodal Foundation Models

地理空间多模态基础模型的新兴灵活设计

Philipe Dias, Waqwoya Abebe, Abhishek Potnis, Aristeidis Tsaris, Dan Lu, Xiao Wang, Dalton Lunga

发表机构 * Oak Ridge National Laboratory（橡树岭国家实验室）

AI总结本文系统比较了不同架构的地理空间基础模型，在统一设置下评估其灵活性与性能，为多模态推理提供设计指导。

详情

AI中文摘要

基础模型通过跨多样未标记地理空间模态的可扩展预训练，正在迅速改变地球观测。然而，其架构多样性——从编码器-only到编码器-解码器以及掩码自编码范式——使得以一致方式评估性能权衡变得具有挑战性。在这项工作中，我们对领先的、专为地理空间多模态推理设计的基础模型架构进行了同类比较，特别关注不同光谱波段配置下的灵活性。我们使用相同的自监督学习目标和训练数据集标准化预训练，并在GEOBench基准测试上，在一致参数化下评估所有模型的分类和分割任务。我们的结果为模型灵活性、模态对齐和下游任务性能之间的设计权衡提供了新见解。通过强调受控条件下的架构优势和局限性，本研究为构建能够进行鲁棒多模态推理的下一代地理空间基础模型提供了实用指导。

英文摘要

Foundation models are rapidly transforming Earth observation by enabling scalable pretraining across diverse unlabeled geospatial modalities. However, their architectural diversity ranging from encoder-only to encoder-decoder and masked autoencoding paradigms makes it challenging to assess performance trade offs in a consistent manner. In this work, we present an apples-to-apples comparison of leading FM architectures designed for geospatial multimodal reasoning, with a particular focus on flexibility across varied spectral band configurations. We standardize pretraining using identical self supervised learning objectives and training datasets, and evaluate all models under consistent parameterization on the GEOBench benchmark across classification and segmentation tasks. Our results offer new insights into the design trade-offs between model flexibility, modality alignment, and downstream task performance. By highlighting architectural strengths and limitations under controlled conditions, this study provides practical guidance for building next generation geospatial foundation models capable of robust multimodal reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.12913 2026-06-12 cs.LG cs.CV 交叉投稿

Selecting Samples on Graphs: A Unified Dataset Pruning Framework for Lossless Training Acceleration

图上的样本选择：用于无损训练加速的统一数据集剪枝框架

Dongyue Wu, Zilin Guo, Xiaoyu Li, Jiajia Liu, Jingdong Chen, Nong Sang, Changxin Gao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于图的统一数据集剪枝框架，将数据集建模为加权图，通过最大权重团问题选择样本，并设计贪心算法，在多种剪枝比例下优于现有方法，实现ImageNet-1k上40%以上训练加速且不损失精度。

Comments ICML 2026

详情

AI中文摘要

现代训练数据集的快速增长显著增加了计算成本，促使数据集剪枝（DP）方法仅保留信息量丰富的样本子集以减少训练成本。现有的剪枝标准通常依赖于评估样本独立性的内在信号或通过成对关系促进多样性的外在信号。虽然在其特定领域有效，但每种方法仅捕捉样本效用的一方面，且在不同剪枝比例或数据分布下缺乏鲁棒性。在这项工作中，我们提出了一个统一的基于图的DP框架。通过将数据集建模为加权图，其中节点权重编码内在价值，边权重编码外在价值，DP可以转化为最大权重团问题（MWCP）。尽管MWCP是NP难的，但其结构允许基于样本边际增益的原则性贪心解法。在几个温和条件下，我们进一步证明该统一目标具有形式化的近似保证，适用于广泛的度量族，并提供了实用设计指南。大量实验表明，我们的方法优于现有DP方法，同时显著降低训练成本，在ImageNet-1k上使用ResNet-50时，训练时间减少超过40%且不损失精度。

英文摘要

The rapid growth of modern training datasets has significantly increased computational cost, motivating dataset pruning~(DP) methods which retain only a subset of informative samples to reduce training cost. Existing pruning criteria typically rely on either intrinsic signals that assess samples independently or extrinsic signals that promote diversity via pairwise relations. While effective in their own specific regimes, each captures only one aspect of sample utility and lacks robustness across different pruning ratios or data distribution. In this work, we present a unified graph-based DP framework. By modeling the dataset as a weighted graph, where node weights encode intrinsic value and edge weights encode extrinsic value, DP can be cast as a Maximum Weight Clique Problem (MWCP). Although MWCP is NP-hard, its structure admits a principled greedy solution based on sample-wise marginal gains. Under a few mild conditions, we further prove that this unified objective enjoys a formal approximation guarantee, which applies to a broad family of importance metrics and provides practical design guidelines. Extensive experiments show that our method outperforms existing DP methods while substantially reducing training cost, reducing training time by over 40\% without sacrificing accuracy on ImageNet-1k with ResNet-50.

URL PDF HTML ☆

赞 0 踩 0

2606.13223 2026-06-12 cs.LG cs.CV 交叉投稿

Distributional Loss for Robust Classification

分布损失用于鲁棒分类

Kathleen Anderson, Thomas Martinetz

发表机构 * Institute for Neuro- and Bioinformatics（神经与生物信息学研究所）

AI总结提出一种基于双峰高斯分布的分布损失概念，通过软化目标隐式捕捉类别模糊性，缓解过拟合，提升决策边界鲁棒性，尤其在低数据场景下效果显著。

Comments ICANN 2026

2606.13461 2026-06-12 cs.LG cs.CV 交叉投稿

Reinforcement Learning for Neural Model Editing

神经模型编辑的强化学习

Shaivi Malik

发表机构 * Shaivi Malik

AI总结提出将神经模型编辑形式化为强化学习问题，通过奖励反馈学习编辑策略，在偏见缓解和机器遗忘任务上取得良好效果。

详情

AI中文摘要

编辑预训练神经网络需要针对特定目标定制的专用算法。设计此类算法通常耗时且需要大量精力。我们提出了一个探索性框架，将神经模型编辑形式化为强化学习问题，其中智能体使用奖励反馈修改模型。我们引入了两个环境：MaskWorld，其中智能体以乘法方式缩放权重；以及ShiftWorld，其中智能体应用加法权重更新。奖励函数结合了效用保持目标和任务特定编辑目标，使智能体能够在保持整体模型性能的同时学习有针对性的修改。我们在文本分类中的偏见缓解和图像分类中的机器遗忘上评估了该框架，这两者传统上都依赖于专用算法。我们的结果表明，在遗忘任务中，学习到的策略将遗忘集准确率降至接近0%，同时保留集准确率保持在90%以上。在偏见缓解设置中，学习到的策略将偏见相关性能提高了5%以上，同时保持了一般分类效用。我们的发现表明，神经模型编辑可以转化为强化学习问题，从而可以从奖励反馈中学习编辑策略，而不是为每个任务手动设计。

英文摘要

Editing pretrained neural networks requires specialized algorithms tailored to specific objectives. Designing such algorithms is often time-consuming and demands significant effort. We present an exploratory framework that formulates neural model editing as a reinforcement learning problem, where agents modify models using reward feedback. We introduce two environments: MaskWorld, where agents scale weights multiplicatively, and ShiftWorld, where agents apply additive weight updates. The reward function combines a utility-preservation objective with a task-specific editing objective, enabling agents to learn targeted modifications while maintaining overall model performance. We evaluate the framework on bias mitigation in text classification and machine unlearning in image classification, both of which traditionally rely on specialized algorithms. Our results show that the learned policies reduce forget set accuracy to nearly 0% while preserving over 90% retain set accuracy on the unlearning task. In the bias mitigation setting, the learned policies improve bias-related performance by more than 5% while maintaining general classification utility. Our findings show that neural model editing can be cast as a reinforcement learning problem, allowing editing policies to be learned from reward feedback rather than manually engineered for each task.

URL PDF HTML ☆

赞 0 踩 0

2512.12571 2026-06-12 cs.CV 版本更新

Measurement Plasticity: Sensor-Level Adaptation for Vision-Language Models

测量塑性：面向视觉-语言模型的传感器级自适应

Boyeong Im, Wooseok Lee, Yoojin Kwon, Hyung-Sin Kim

发表机构 * University of Seoul（首尔大学）

AI总结提出多视角物理提示（MVP）用于测试时自适应，通过将相机曝光三角（ISO、快门速度、光圈）作为物理提示，在传感器层面进行自适应，无需梯度或模型修改，在ImageNet-ES上优于数字方法。

Comments Accepted to the ICML 2026 Workshop on Continual Adaptation at Scale

2603.10834 2026-06-12 cs.CV cs.AI 版本更新

On the Reliability of Cue Conflict and Beyond

论线索冲突的可靠性及其超越

Pum Jun Kim, Seung-Ah Lee, Seongho Park, Dongyoon Han, Jaejun Yoo

发表机构 * Ulsan National Institute of Science and Technology（乌山国立科学研究院）； College of Medicine, Hanyang University（翰阳大学医学院）； NAVER AI Lab（NAVER AI实验室）

AI总结针对现有线索冲突基准在评估形状-纹理偏好时存在不稳定和模糊的问题，提出REFINED-BIAS数据集与评估框架，通过显式定义形状和纹理、构建平衡的线索对及基于排序的度量，实现更可靠和可解释的偏差诊断。

Comments Shape-Texture Bias, Cue Conflict Benchmark

详情

AI中文摘要

理解神经网络如何依赖视觉线索提供了其内部决策过程的人类可解释视角。线索冲突基准在探究形状-纹理偏好以及激发更强、类人形状偏差通常与改进的域内性能相关的见解方面具有影响力。然而，我们发现当前基于风格化的实例化可能产生不稳定和模糊的偏差估计。具体来说，风格化可能无法可靠地实例化感知上有效且可分离的线索，也无法控制其相对信息量；基于比率的偏差可能掩盖绝对线索敏感性；将评估限制在预选类别可能忽略完整决策空间而扭曲模型预测。这些因素共同可能将偏好与线索有效性、线索平衡和可识别性伪影混淆。我们引入了REFINED-BIAS，一个用于可靠和可解释的形状-纹理偏差诊断的集成数据集和评估框架。REFINED-BIAS使用形状和纹理的显式定义构建平衡的、人类和模型可识别的线索对，并通过基于排序的度量测量完整标签空间上的线索特定敏感性，从而实现更公平的跨模型比较。在不同的训练范式和架构中，REFINED-BIAS实现了更公平的跨模型比较、更忠实的形状和纹理偏差诊断以及更清晰的实证结论，解决了先前线索冲突评估无法可靠区分的矛盾。

英文摘要

Understanding how neural networks rely on visual cues offers a human-interpretable view of their internal decision processes. The cue-conflict benchmark has been influential in probing shape-texture preference and in motivating the insight that stronger, human-like shape bias is often associated with improved in-domain performance. However, we find that the current stylization-based instantiation can yield unstable and ambiguous bias estimates. Specifically, stylization may not reliably instantiate perceptually valid and separable cues nor control their relative informativeness, ratio-based bias can obscure absolute cue sensitivity, and restricting evaluation to preselected classes can distort model predictions by ignoring the full decision space. Together, these factors can confound preference with cue validity, cue balance, and recognizability artifacts. We introduce REFINED-BIAS, an integrated dataset and evaluation framework for reliable and interpretable shape-texture bias diagnosis. REFINED-BIAS constructs balanced, human- and model- recognizable cue pairs using explicit definitions of shape and texture, and measures cue-specific sensitivity over the full label space via a ranking-based metric, enabling fairer cross-model comparisons. Across diverse training regimes and architectures, REFINED-BIAS enables fairer cross-model comparison, more faithful diagnosis of shape and texture biases, and clearer empirical conclusions, resolving inconsistencies that prior cue-conflict evaluations could not reliably disambiguate.

URL PDF HTML ☆

赞 0 踩 0

2603.14482 2026-06-12 cs.CV 版本更新

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

V-JEPA 2.1: 解锁视频自监督学习中的密集特征

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, Adrien Bardes

发表机构 * FAIR at Meta（Meta的FAIR）； Universidad de Zaragoza（萨拉戈萨大学）

AI总结提出V-JEPA 2.1系列自监督模型，通过密集预测损失、深度自监督、多模态分词器和有效缩放，学习图像和视频的密集高质量视觉表示，在多个基准上取得最优性能。

详情

AI中文摘要

我们提出V-JEPA 2.1，一系列自监督模型，能够学习图像和视频的密集、高质量视觉表示，同时保持强大的全局场景理解。该方法结合了四个关键组件。首先，密集预测损失使用基于掩码的目标，其中可见和掩码令牌都贡献于训练信号，鼓励显式的空间和时间接地。其次，深度自监督在多个中间编码器层上分层应用自监督目标，以提高表示质量。第三，多模态分词器实现了图像和视频的统一训练。最后，该模型受益于模型容量和训练数据的有效缩放。这些设计选择共同产生了空间结构、语义一致和时间连贯的表示。实验上，V-JEPA 2.1在几个具有挑战性的基准上取得了最先进的性能，包括在Ego4D上短期物体交互预测的7.71 mAP，在EPIC-KITCHENS上高级动作预测的40.8 Recall@5，以及在实际机器人抓取成功率上比V-JEPA-2 AC提高了20个百分点。该模型还在机器人导航（TartanDrive上5.687 ATE）、深度估计（NYUv2上线性探针0.307 RMSE）和全局识别（Something-Something-V2上77.7）方面表现出强大的性能。这些结果表明，V-JEPA 2.1显著推进了密集视觉理解和世界建模的最新技术。

英文摘要

We present V-JEPA 2.1, a family of self-supervised models that learn dense, high-quality visual representations for both images and videos while retaining strong global scene understanding. The approach combines four key components. First, a dense predictive loss uses a masking-based objective in which both visible and masked tokens contribute to the training signal, encouraging explicit spatial and temporal grounding. Second, deep self-supervision applies the self-supervised objective hierarchically across multiple intermediate encoder layers to improve representation quality. Third, multi-modal tokenizers enable unified training across images and videos. Finally, the model benefits from effective scaling in both model capacity and training data. Together, these design choices produce representations that are spatially structured, semantically coherent, and temporally consistent. Empirically, V-JEPA 2.1 achieves state-of-the-art performance on several challenging benchmarks, including 7.71 mAP on Ego4D for short-term object-interaction anticipation and 40.8 Recall@5 on EPIC-KITCHENS for high-level action anticipation, as well as a 20-point improvement in real-robot grasping success rate over V-JEPA-2 AC. The model also demonstrates strong performance in robotic navigation (5.687 ATE on TartanDrive), depth estimation (0.307 RMSE on NYUv2 with a linear probe), and global recognition (77.7 on Something-Something-V2). These results show that V-JEPA 2.1 significantly advances the state of the art in dense visual understanding and world modeling.

URL PDF HTML ☆

赞 0 踩 0

2605.01391 2026-06-12 cs.CV 版本更新

VISTA: Video Interaction Spatio-Temporal Analysis Benchmark

VISTA：视频交互时空分析基准

Alejandro Aparcedo, Akash Kumar, Aaryan Garg, Dalton Pham, Wen-Kai Chen, Anirudh Bharadwaj, Aman Chadha, Yogesh Rawat

发表机构 * University of Central Florida（中央佛罗里达大学）； BITS Pilani（比特斯理工学院）； Ho Chi Minh City University of Science（胡志明市科学大学）； Amazon GenAI Project（亚马逊生成人工智能项目）

AI总结提出VISTA基准，通过分解视频为实体、动作和关系，实现开放集多实体多动作的时空理解评估，揭示传统指标掩盖的偏差。

Comments Accepted to CVPR 2026 Workshop on Pixel-level Video Understanding in the Wild (PVUW)

详情

AI中文摘要

现有的视觉-语言模型（VLM）基准主要评估简单单动作视频、封闭属性集和受限实体类型的时空理解，未能捕捉真实世界视频理解中多样实体之间的自由形式多动作交互。此外，缺乏一个系统性的框架来分析模型在互补时空轴上的失败，阻碍了全面评估。为解决这些问题，我们引入了VISTA，一个视频交互时空分析基准，专为VLM中的开放集、多实体和多动作时空理解设计。VISTA将视频分解为可解释的实体、其关联动作和关系动态，实现多轴诊断以及关系、空间和时间理解的统一评估。我们的基准将多个数据集整合到一个单一的交互感知分类法中，包含约12K个精心策划的视频-查询对，涵盖多样场景和复杂性。我们在VISTA上系统评估了11个最先进的VLM，并分解了跨分类法的聚合性能，揭示了传统指标掩盖的缺陷和显著的时空偏差。通过在具有挑战性的数据集上提供详细的、分类法驱动的诊断，VISTA提供了一个精细的框架来指导模型设计、预训练策略和评估协议的进步。总体而言，VISTA是第一个大规模、交互感知的VLM时空理解诊断基准。

英文摘要

Existing benchmarks for Vision-Language Models (VLMs) primarily evaluate spatio-temporal understanding on simple single-action videos, closed attribute sets and restricted entity types, failing to capture the freeform, multi-action interactions between diverse entities which characterize real-world video understanding. Furthermore, the lack of a systematic framework for analyzing model failures across complementary spatio-temporal axes hinders comprehensive evaluation. To address these gaps, we introduce VISTA, a Video Interaction Spatio-Temporal Analysis benchmark designed for open-set, multi-entity and multi-action spatio-temporal understanding in VLMs. VISTA decomposes videos into interpretable entities, their associated actions, and relational dynamics, enabling multi-axis diagnostics and unified assessment of relational, spatial, and temporal understanding. Our benchmark integrates multiple datasets into a single interaction-aware taxonomy and comprises ~12K curated video-query pairs spanning diverse scenes and complexities. We systematically evaluate 11 state-of-the-art VLMs on VISTA, and break down aggregate performance across our taxonomy to reveal shortcomings and pronounced spatio-temporal biases obscured by traditional metrics. By providing detailed, taxonomy-driven diagnostics on a challenging dataset, VISTA offers a nuanced framework to guide advances in model design, pretraining strategies, and evaluation protocols. Overall, VISTA is the first, large-scale, interaction-aware diagnostic benchmark for spatio-temporal understanding in VLMs.

URL PDF HTML ☆

赞 0 踩 0

2304.13836 2026-06-12 cs.LG cs.AI cs.CV stat.ME 版本更新

On Pitfalls of $\textit{RemOve-And-Retrain}$: Data Processing Inequality Perspective

论 $\textit{RemOve-And-Retrain}$ 的陷阱：数据处理不等式视角

Junhwa Song, Keumgang Cha, Junghoon Seo

发表机构 * KAIST（韩国科学技术院）

AI总结从信息论角度揭示ROAR基准的缺陷：数据无关的后处理可提升ROAR分数，导致对归因图信息量的误判，并发现模糊性偏差。

Comments Accepted at the 2026 ICML Workshop on Mechanistic Interpretability

2606.12988 2026-06-12 cs.CV cs.AI 新提交

A Machine Learning Framework for Real-Time Personalized Ergonomic Pose Analysis

一种用于实时个性化人体工学姿态分析的机器学习框架

Manex Atxa, Bruno Simoes, Julen Balzategui

发表机构 * Vicomtech Foundation（Vicomtech基金会）； Basque Research and Technology Alliance（巴斯克研究与技术联盟）； BRTA

AI总结提出利用三维体积视频数据实时预测人体工学/非工学姿态的方法，结合3D点云多角度分析与个性化深度学习分类器，克服固定视角遮挡问题，实现实时评估。

Comments 13 pages, 7 figures, conference 24CMH

详情

AI中文摘要

本文介绍了一种利用三维体积视频数据实时预测人体工学和非工学姿态的新方法。尽管该方法是为人体工学评估设计的，但它可以适应其他需要实时分析人体姿态的应用。该系统的一个突出特点是能够在评估过程中分析3D点云，从而实现多角度计算。这克服了相机通常提供固定视角的关键限制，从而限制了全面姿态评估可用的数据，尤其是在发生遮挡时。系统持续自动地对实时流数据使用选定的视角进行姿态推断；然而，只有用户手动选择和标记的姿态用于训练个性化深度学习分类器。该方法通过一个案例研究进行了优化，其中RGB-D相机捕捉了执行负重任务的受试者，实现了实时骨骼标记。模型在此数据上训练，并在训练阶段后对新流数据实时进行推断。本研究通过结合最先进的3D数据技术和传统的2D姿态估计算法，为实时人体工学评估提供了一种可扩展且实用的方法。它解决了工作场所环境中日益增长的安全与健康监测需求，标志着对该领域的显著贡献。

英文摘要

This paper introduces a new methodology for real-time prediction of ergonomic and non-ergonomic human poses using volumetric video data in three dimensions. Although the methodology was designed for ergonomic assessments, it can be adapted to other applications requiring real-time analysis of human posture. One aspect that makes this system stand out is its ability to analyze 3D point clouds during the assessment, enabling computation from multiple angles. This overcomes a critical limitation of cameras which provide often a fixed viewpoint, thereby restricting the data available for a thorough postural evaluation, especially when occlusions occur. The system continuously and automatically performs pose inference using the chosen perspective on the real-time streaming data; however, only the poses manually selected and labeled by the user are used to train the personalized deep learning classifier. The methodology has been refined through a case study in which RGB-D cameras captured subjects performing load-lifting tasks, enabling real-time skeletal labeling. The model was trained on this data and, following the training phase, performs inference on new streaming data in real time. This research offers a scalable and pragmatic approach for real-time ergonomic evaluation by combining state-of-the-art 3D data technologies and traditional 2D pose estimation algorithms. It addresses the increasing need for safety and health monitoring in workplace environments, marking a notable contribution to the domain.

URL PDF HTML ☆

赞 0 踩 0

2606.13239 2026-06-12 cs.SE cs.AI cs.CL cs.CV 交叉投稿

ComAct: Reframing Professional Software Manipulation via COM-as-Action Paradigm

ComAct: 通过COM即行动范式重构专业软件操作

Jiaxin Ai, Tao Hu, Xuemeng Yang, Shu Zou, Hairong Zhang, Daocheng Fu, Yu Yang, Hongbin Zhou, Nianchen Deng, Pinlong Cai, Zhongyuan Wang, Botian Shi, Kaipeng Zhang, Licheng Wen

AI总结提出COM即行动范式，将专业软件交互转化为确定性程序合成，解决GUI代理的脆弱性和API代理的异构性问题；构建ComCADBench基准和ComActor自校正代理，在工业CAD软件上实现SOTA性能。

详情

AI中文摘要

现有的计算机使用代理在专业软件操作上仍然存在根本性限制：基于GUI的代理受困于脆弱的视觉基础和长程错误累积，而基于API的方法则难以应对异构协议和不可访问的商业接口。在这项工作中，我们将组件对象模型（COM）识别为统一的、可执行的抽象，提出了COM即行动：一种新的范式，将专业软件交互重新定义为确定性程序合成，而非顺序视觉控制。为了在最苛刻的环境中验证这一范式，我们引入了ComCADBench，这是首个针对操作真实工业CAD软件的代理的基准测试。我们的实验揭示了显著的范式差距：前沿的专有模型在基于GUI的交互下几乎无法成功，而基于COM的执行则带来了实质性的即时收益。为了弥合语法正确性与几何精度之间的剩余差距，我们开发了ComActor，一个通过渐进式三阶段框架训练的自校正代理，以及ComForge，一个用于在Windows容器中进行大规模训练的可扩展平台。大量实验表明，ComActor在ComCADBench上达到了最先进的性能，在基线崩溃的长程任务中表现出强大的韧性，并泛化到外部CAD基准测试。

英文摘要

Existing computer-use agents remain fundamentally limited in professional software manipulation: GUI-based agents suffer from fragile visual grounding and long-horizon error accumulation, while API-basedapproaches struggle with heterogeneous protocols and inaccessible commercial interfaces. In this work,we identify the Component Object Model (COM) as a unified executable abstraction, proposing COM-as-Action: a new paradigm that reframes professional software interaction as deterministic program synthesisrather than sequential visual control. To validate this paradigm in the most demanding environments, weintroduce ComCADBench, the first benchmark for agents operating real industrial CAD software. Ourexperiments reveal a substantial paradigm gap: frontier proprietary models achieve near-zero successunder GUI-based interaction, whereas COM-based execution yields substantial immediate gains. Tobridge the remaining gap between syntactic correctness and geometric accuracy, we develop ComActor, aself-correcting agent trained through a progressive three-stage framework, alongside ComForge, a scalableplatform for large-scale training in Windows containers. Extensive experiments show that ComActorachieves state-of-the-art performance on ComCADBench, with strong resilience in long-horizon taskswhere baselines collapse, and generalizes to external CAD benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.13368 2026-06-12 cs.AI cs.CV 交叉投稿

IterCAD: An Iterative Multimodal Agent for Visually-Grounded CAD Generation and Editing

IterCAD：一种用于视觉引导的CAD生成与编辑的迭代多模态智能体

Tao Hu, Jiaxin Ai, Licheng Wen, Xueheng Li, Shu Zou, Siqi Li, Nianchen Deng, Xinyu Cai, Hongbin Zhou, Pinlong Cai, Daocheng Fu, Yu Yang, Hairong Zhang, Botian Shi, Xuemeng Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出IterCAD，一种闭环交互式CAD生成与编辑的多模态智能体框架，通过渐进式SFT和几何感知强化学习优化，在代码可执行性和几何精度上显著超越现有方法。

详情

AI中文摘要

计算机辅助设计在现代制造业中至关重要，然而现有的自动化方法主要依赖于开环、一次性生成，与迭代的实际实践不匹配。在本文中，我们提出了IterCAD，一个统一的闭环交互式CAD生成与编辑的多模态智能体框架。我们将任务形式化为多模态智能体与可执行CAD沙箱之间的多轮交互，涵盖三个任务：绘图到代码、文本到代码和交互式编辑。为此，我们开发了一个数据合成流水线，结合先进的工业制造特征，生成符合标准的多视图工程图纸、复杂的代码编辑任务和高保真交互轨迹。我们通过渐进式SFT，然后结合几何感知强化学习和可行前缀掩码来优化智能体，以增强代码可执行性和几何保真度。最后，我们引入了IterCAD-Bench评估套件，并提出了Chamfer距离容忍度-召回率（CD-TR）曲线及其AUC-TR指标，建立了一个无幸存者偏差的标准，统一了代码有效性和几何精度。大量实验表明，IterCAD在多个基准测试中取得了极具竞争力的性能，在代码可执行性和几何精度上显著优于现有方法，并在闭环迭代优化中展现出卓越的能力。

英文摘要

Computer-Aided Design is pivotal in modern manufacturing, yet existing automated methods predominantly rely on open-loop, one-shot generation, creating a mismatch with iterative real-world practices. In this paper, we present IterCAD, a unified multimodal agent framework for closed-loop, interactive CAD generation and editing. We formulate the task as a multi-turn interaction between a multimodal agent and an executable CAD sandbox, covering three tasks: Drawing-to-Code, Text-to-Code, and Interactive Editing. To support this, we develop a data synthesis pipeline incorporating advanced industrial manufacturing features to generate standard-compliant multi-view engineering drawings, complex code-editing tasks, and high-fidelity interaction trajectories. We optimize the agent via progressive SFT followed by geometry-aware reinforcement learning with viable-prefix masking to enhance code executability and geometric fidelity. Finally, we introduce the IterCAD-Bench evaluation suite and propose the Chamfer Distance Tolerance-Recall (CD-TR) curve alongside its AUC-TR metric, establishing a survivor-bias-free standard that unifies code validity and geometric precision. Extensive experiments demonstrate that IterCAD achieves highly competitive performance across multiple benchmarks, significantly outperforming existing approaches in both code executability and geometric precision, while exhibiting superior capabilities in closed-loop iterative refinement.

URL PDF HTML ☆

赞 0 踩 0

2507.22791 2026-06-12 cs.CV 版本更新

Modality-Aware Feature Matching in Visual and Vision-Language Applications: A Comprehensive Survey

视觉与视觉-语言应用中的模态感知特征匹配：全面综述

Weide Liu, Wei Zhou, Jun Liu, Ping Hu, Jun Cheng, Jungong Han, Weisi Lin

发表机构 * School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics（江西财经大学计算机与人工智能学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算机与数据科学学院）； School of Computer Science and Informatics, Cardiff University（卡迪夫大学计算机科学与信息学院）； School of Computing and Communications, Lancaster University（兰卡斯特大学计算机与通讯学院）； School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； Institute for Infocomm Research, Agency for Science, Technology and Research (A*STAR)（新加坡资讯研究院，科技研究局（A*STAR））； Department of Automation, Tsinghua University（清华大学自动化系）

AI总结综述基于模态的特征匹配，涵盖传统手工方法和现代深度学习方法，重点讨论跨RGB、深度、3D点云、LiDAR、医学图像及视觉-语言模态的进展，突出模态感知技术。

Comments CSUR

详情

AI中文摘要

特征匹配是计算机视觉中的一项基础任务，对于图像检索、立体匹配、三维重建和SLAM等应用至关重要。本综述全面回顾了基于模态的特征匹配，探索了传统手工方法，并强调了当代深度学习方法在各种模态中的应用，包括RGB图像、深度图像、3D点云、LiDAR扫描、医学图像和视觉-语言交互。传统方法利用Harris角点等检测器和SIFT、ORB等描述符，在中等模态内变化下表现出鲁棒性，但在显著模态差距下表现不佳。当代基于深度学习的方法，例如基于CNN的SuperPoint和基于Transformer的LoFTR等无检测器策略，显著提高了跨模态的鲁棒性和适应性。我们重点介绍了模态感知的进展，例如用于深度图像的几何和深度特定描述符、用于3D点云的稀疏和密集学习方法、用于LiDAR扫描的注意力增强神经网络，以及用于复杂医学图像匹配的MIND描述符等专门解决方案。跨模态应用，特别是在医学图像配准和视觉-语言任务中，突显了特征匹配处理日益多样化数据交互的演变。

英文摘要

Feature matching is a cornerstone task in computer vision, essential for applications such as image retrieval, stereo matching, 3D reconstruction, and SLAM. This survey comprehensively reviews modality-based feature matching, exploring traditional handcrafted methods and emphasizing contemporary deep learning approaches across various modalities, including RGB images, depth images, 3D point clouds, LiDAR scans, medical images, and vision-language interactions. Traditional methods, leveraging detectors like Harris corners and descriptors such as SIFT and ORB, demonstrate robustness under moderate intra-modality variations but struggle with significant modality gaps. Contemporary deep learning-based methods, exemplified by detector-free strategies like CNN-based SuperPoint and transformer-based LoFTR, substantially improve robustness and adaptability across modalities. We highlight modality-aware advancements, such as geometric and depth-specific descriptors for depth images, sparse and dense learning methods for 3D point clouds, attention-enhanced neural networks for LiDAR scans, and specialized solutions like the MIND descriptor for complex medical image matching. Cross-modal applications, particularly in medical image registration and vision-language tasks, underscore the evolution of feature matching to handle increasingly diverse data interactions.

URL PDF HTML ☆

赞 0 踩 0

2509.21398 2026-06-12 cs.CV eess.IV 版本更新

无交互行动：通过接触-释放检测探测视频LMMs的物理基础

Daniel Harari, Michael Sidorov, Chen Shterental, Liel David, Abrham Kahsay Gebreselasie, Muhammad Haris Khan

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结研究探讨了视频LMMs在实际视觉输入中语义理解的深度，通过接触-释放检测发现模型在物理基础方面的不足。

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 workshop on Cognitive Foundations for Multimodal Models (CogVL)

AI中文摘要

大型多模态模型（LMMs）在现实视觉任务中表现出越来越强的性能，例如在视频中描述对象、周围环境和动态动作。本研究探讨了这些模型如何将语义理解与实际视觉输入联系起来。具体来说，给定手与物体互动的序列，我们询问模型何时以及在哪里开始或结束互动。为此，我们引入了一个前所未有的大规模数据集，包含来自Something-Something-V2数据集的视频中超过20,000个标注的互动。250名AMTurk人工标注者标记了核心互动事件，特别是物体和代理何时以及在哪里接触（接触）或分离（释放）。我们要求最先进的LMMs，包括GPT、Gemini和Qwen，在短视频中定位这些事件，每个视频只有一个事件。结果表明，尽管模型能够可靠地命名目标对象并识别动作，但它们表现出一种“捷径学习”现象，即语义成功掩盖了在物理基础方面的失败。具体来说，它们始终无法识别互动开始或结束的帧，并且在场景中对物理事件的定位较差。这种脱节表明，尽管LMMs在系统1直观模式识别（命名动作和对象）方面表现出色，但它们缺乏系统2认知基础，无法对如“接触”和“释放”这样的物理原始要素进行推理，因此无法真正将动态场景 grounded 在物理现实中。

英文摘要

Large multi-modal models (LMMs) show increasing performance in realistic visual tasks for images and, more recently, for videos. For example, given a video sequence, such models are able to describe in detail objects, the surroundings and dynamic actions. In this study, we explored the extent to which these models ground their semantic understanding in the actual visual input. Specifically, given sequences of hands interacting with objects, we asked models when and where the interaction begins or ends. For this purpose, we introduce a first of its kind, large-scale dataset with more than 20K annotated interactions on videos from the Something-Something-V2 dataset. 250 AMTurk human annotators labeled core interaction events, particularly when and where objects and agents become attached (`contact') or detached (`release'). We asked SoTA LMMs, including GPT, Gemini and Qwen to locate these events in short videos, each with a single event. The results show that while models reliably name target objects and identify actions, they exhibit a form of `shortcut learning' where semantic success masks a failure in physical grounding. Specifically, they consistently fail to identify the frame where the interaction begins or ends and poorly localize the physical event within the scene. This disconnect suggests that while LMMs excel at System 1 intuitive pattern recognition (naming the action and objects), they lack the System 2 cognitive foundations required to reason about physical primitives like `contact' and `release', hence truly ground dynamic scenes in physical reality.

URL PDF HTML ☆

赞 0 踩 0

2507.11936 2026-06-12 cs.CL cs.AI cs.CV cs.LG 版本更新

A Survey of Deep Learning for Geometry Problem Solving

深度学习在几何问题求解中的应用综述

Jianzhe Ma, Wenxuan Wang, Qin Jin

发表机构 * Renmin University of China（中国人民大学）

AI总结本文综述了深度学习在几何问题求解中的应用，涵盖相关任务、方法、评估指标及未来方向，旨在提供实践参考以推动该领域发展。

Comments ACL 2026 Main Conference

详情

AI中文摘要

几何问题求解作为数学推理的重要组成部分，在教育、评估AI数学能力及多模态能力评估中具有关键作用。近期深度学习技术，尤其是多模态大语言模型的出现，显著加速了该领域的研究。本文综述了深度学习在几何问题求解中的应用，包括（i）几何问题求解相关任务的全面总结；（ii）相关深度学习方法的深入回顾；（iii）评估指标和方法的详细分析；以及（iv）最先进性能、现有挑战和有前景的未来方向的批判性讨论。我们的目标是提供一个全面且实用的深度学习在几何问题求解中的参考，从而推动该领域进一步发展。我们维护了一个相关论文列表：https://github.com/majianz/dl4gps。

英文摘要

Geometry problem solving, a crucial aspect of mathematical reasoning, is vital across various domains, including education, the assessment of AI's mathematical abilities, and multimodal capability evaluation. The recent surge in deep learning technologies, particularly the emergence of multimodal large language models, has significantly accelerated research in this area. This paper presents a survey of the applications of deep learning in geometry problem solving, including (i) a comprehensive summary of the relevant tasks in geometry problem solving; (ii) a thorough review of related deep learning methods; (iii) a detailed analysis of evaluation metrics and methods; and (iv) a critical discussion of state-of-the-art performance, existing challenges, and promising future directions. Our objective is to offer a comprehensive and practical reference of deep learning for geometry problem solving, thereby fostering further advancements in this field. We maintain a list of relevant papers: https://github.com/majianz/dl4gps.

URL PDF HTML ☆

赞 0 踩 0

2508.03721 2026-06-12 cs.CV eess.IV 版本更新

Enhancing Diameter Measurement Accuracy in Machine Vision Applications

提升机器视觉应用中直径测量精度

Ahmet Gokhan Poyraz, Ahmet Emir Dirik, Hakan Gurkan, Mehmet Kacmaz

发表机构 * Department of Electrical and Electronics Engineering, Bursa Technical University（布尔萨技术大学电气与电子工程系）； Doğu Pres R&D（多古普研发）； Department of Computer Engineering, Bursa Uludağ University（布尔萨乌拉达格大学计算机工程系）； Institute of Electrical Information Technology, Clausthal University of Technology（克莱斯特哈尔技术大学电气信息学院）

AI总结本文提出两种新方法通过多参考零件提升测量精度，利用转换因子和像素信息减少误差，实验显示误差从13-114微米降至1-2微米。

Comments Preprint

详情

DOI: 10.1016/j.measurement.2026.121646
Journal ref: Measurement 278 (2026) 121646

AI中文摘要

在相机测量系统中，通常使用特殊设备如 telecentric 镜头来测量公差较小的零件。然而，由于系统内的机械和软件因素，测量误差仍可能发生，特别是在使用相同设置测量不同直径零件时。本文提出两种创新方法，通过多个已知参考零件增强测量精度：基于转换因子的方法和基于像素的方法。第一种方法通过已知参考零件估计转换因子以计算未知零件的直径（毫米）。第二种方法则直接利用参考零件的像素直径信息估算直径（毫米）。实验设置包括工业级相机和 telecentric 镜头。对玻璃样品（1-12 mm）和金属工件（3-24 mm）的测试显示，使用所提出的方法后，原本范围为13-114微米的测量误差被降至1-2微米。仅使用少量已知参考零件，该方法能够实现相机视野内所有零件的高精度测量。此外，该方法通过显著降低误差率和提高测量可靠性，增强了现有直径测量文献。

英文摘要

In camera measurement systems, specialized equipment such as telecentric lenses is often employed to measure parts with narrow tolerances. However, despite the use of such equipment, measurement errors can occur due to mechanical and software-related factors within the system. These errors are particularly evident in applications where parts of different diameters are measured using the same setup. This study proposes two innovative approaches to enhance measurement accuracy using multiple known reference parts: a conversion factor-based method and a pixel-based method. In the first approach, the conversion factor is estimated from known references to calculate the diameter (mm) of the unknown part. In the second approach, the diameter (mm) is directly estimated using pixel-based diameter information from the references. The experimental setup includes an industrial-grade camera and telecentric lenses. Tests conducted on glass samples (1-12 mm) and metal workpieces (3-24 mm) show that measurement errors, which originally ranged from 13-114 micrometers, were reduced to 1-2 micrometers using the proposed methods. By utilizing only a few known reference parts, the proposed approach enables high-accuracy measurement of all parts within the camera's field of view. Additionally, this method enhances the existing diameter measurement literature by significantly reducing error rates and improving measurement reliability.

URL PDF HTML ☆

赞 0 踩 0

2505.18060 2026-06-12 cs.CV 版本更新

Semantic Correspondence: Unified Benchmarking and a Strong Baseline

语义对应：统一的基准测试与强大的基线

Kaiyan Zhang, Xinghui Li, Jingyi Lu, Kai Han

发表机构 * The University of Hong Kong（香港大学）

AI总结本文首次全面调研语义对应方法，提出分类体系并汇总多基准结果，提出高性能基线，为未来研究奠定基础。

详情

DOI: 10.1109/TPAMI.2025.3640429
Journal ref: IEEE Trans. Pattern Anal. Mach. Intell. 48, no. 3 (2026) 3911-3930

AI中文摘要

建立语义对应是计算机视觉中的一个具有挑战性任务，旨在在不同图像中匹配具有相同语义信息的关键点。得益于深度学习的快速发展，过去十年来取得了显著进展。然而，对这一任务的全面回顾和分析仍然缺失。本文首次对语义对应方法进行了广泛的调查。我们首先提出一个分类体系，根据方法设计的类型对现有方法进行分类。这些方法随后被相应归类，并对每种方法进行详细分析。此外，我们汇总并总结了文献中各种基准测试方法的结果，形成一个统一的比较表格，并提供详细的配置以突出性能差异。此外，为了深入了解现有的语义匹配方法，我们彻底进行了受控实验，以分析不同方法组件的有效性。最后，我们提出了一种简单而有效的基线，该基线在多个基准测试中实现了最先进的性能，为该领域未来的研究奠定了坚实基础。我们希望本文的调查能为未来的发展提供全面的参考和统一的基线。代码已公开在：https://github.com/Visual-AI/Semantic-Correspondence。

英文摘要

Establishing semantic correspondence is a challenging task in computer vision, aiming to match keypoints with the same semantic information across different images. Benefiting from the rapid development of deep learning, remarkable progress has been made over the past decade. However, a comprehensive review and analysis of this task remains absent. In this paper, we present the first extensive survey of semantic correspondence methods. We first propose a taxonomy to classify existing methods based on the type of their method designs. These methods are then categorized accordingly, and we provide a detailed analysis of each approach. Furthermore, we aggregate and summarize the results of methods in literature across various benchmarks into a unified comparative table, with detailed configurations to highlight performance variations. Additionally, to provide a detailed understanding on existing methods for semantic matching, we thoroughly conduct controlled experiments to analyse the effectiveness of the components of different methods. Finally, we propose a simple yet effective baseline that achieves state-of-the-art performance on multiple benchmarks, providing a solid foundation for future research in this field. We hope this survey serves as a comprehensive reference and consolidated baseline for future development. Code is publicly available at: https://github.com/Visual-AI/Semantic-Correspondence.

URL PDF HTML ☆

赞 0 踩 0

2412.14631 2026-06-12 cs.CV 版本更新

Review of Fruit Tree Image Segmentation

水果树图像分割综述

Il-Seok Oh

发表机构 * Department of Computer Science and Artificial Intelligence/CAIIT, Jeonbuk National University, South Korea（计算机科学与人工智能系/先进图像与信息科技中心，全州国立大学）

AI总结本文综述了水果树前视图像分割研究，指出现有方法缺乏通用数据集和模型，提出六个未来研究方向以构建通用分割模块。

详情

DOI: 10.3390/agriculture15212239
Journal ref: Agriculture, Volume 15, Issue 21, 2025

AI中文摘要

水果树图像分割是自动化农业任务如表型分析、采摘、喷洒和修剪中的关键问题。许多论文提出了适用于特定任务和环境的多样化解决方案。本文综述范围限定在水果树前视图，基于158篇通过新设计的爬虫方法收集的相关论文。这些论文基于一种按方法、图像、任务和水果顺序考虑的分类法进行系统回顾。该分类法将帮助读者直观理解这些研究活动的整体情况。本文指出，先前研究的主要不足是缺乏适用于多种任务和环境的通用数据集和分割模型。本文建议六个重要的未来研究任务，期望这些将为构建通用的树分割模块铺平道路。

英文摘要

Fruit tree image segmentation is an essential problem in automating a variety of agricultural tasks such as phenotyping, harvesting, spraying, and pruning. Many research papers have proposed a diverse spectrum of solutions suitable to specific tasks and environments. The review scope of this paper is confined to the front views of fruit trees and based on 158 relevant papers collected using a newly designed crawling review method. These papers are systematically reviewed based on a taxonomy that sequentially considers the method, image, task, and fruit. This taxonomy will assist readers to intuitively grasp the big picture of these research activities. Our review reveals that the most noticeable deficiency of the previous studies was the lack of a versatile dataset and segmentation model that could be applied to a variety of tasks and environments. Six important future research tasks are suggested, with the expectation that these will pave the way to building a versatile tree segmentation module.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 20 篇

Analyzing and Improving Fine-grained Preference Optimization in Medical LVLMs

ECA: Efficient Continual Alignment for Open-Ended Image-to-Text Generation

GRIP: Feedback-Guided Prompt Retrieval for Large Multimodal Models

Perceive, Interact, Reason: Building Tool-Augmented Visual Agents for Spatial Reasoning

Language-Guided Abstraction for Visual Reasoning

Bridging Modal Isolation in Interleaved Thinking: Supervising Modality Transitions via Stepwise Reinforcement

Magnifying What Matters: Attention-Guided Adaptive Rendering for Visual Text Comprehension

Objects Before Words: Object-First Inductive Biases for Grounding Language in Child-View Video

LaME: Learning to Think in Latent Space for Multimodal Embedding via Information Bottleneck

Iterative Visual Thinking: Teaching Vision-Language Models Spatial Self-Correction through Visual Feedback

Cross-Modal Masked Compositional Concept Modeling for Enhancing Visio-Linguistic Compositionality

HYDRA-X: Native Unified Multimodal Models with Holistic Visual Tokenizers

SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

AudioX-Turbo: A Unified Framework for Efficient Anything-to-Audio Generation

Iterative Tool Usage Exploration for Multimodal Agents via Step-wise Preference Tuning

LatentLens: Revealing Highly Interpretable Visual Tokens in LLMs

Ex-Omni: Enabling 3D Facial Animation Generation for Omni-modal Large Language Models

PaLMR: Towards Faithful Visual Reasoning via Multimodal Process Alignment

GeoWorld-VLM: Geometry from World Models for Vision-Language Models

MultiToP: Learning to Patch Visual Tokens to Mitigate Hallucinations in Video Large Multimodal Models

2. 具身智能、机器人与自动驾驶 19 篇

Stereo Vision-Based Fall Prediction and Detection using Human Pose Estimation on the AMD Kria K26 SOM

Camera and LiDAR BEV Fusion for Cooperative 3D Object Detection on TUMTraf V2X

Diffusion Transformer World-Action Model for AV Scene Prediction

VISA: VLM-Guided Instance Semantic Auditing for 3D Occupancy World Models

Heterogeneous LiDAR Early Fusion and Learned Re-Ranking Strategy for Robust Long-Term Place Recognition in Unstructured Environments

Measurement-Calibrated Multi-Camera Fusion for Vision-Based Indoor Localization

MaskWAM: Unifying Mask Prompting and Prediction for World-Action Models

SemanticXR: Low Power and Real-time Queryable Semantic Mapping with an Object-Level Device-Cloud Architecture

NavWAM: A Navigation World Action Model for Goal-Conditioned Visual Navigation

SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

Mana: Dexterous Manipulation of Articulated Tools

From Seeing to Experiencing: Scaling Navigation Foundation Models with Reinforcement Learning

GAE: Unleashing Physical Potential of VLM with Generalizable Action Expert

QueryOcc: Query-based Self-Supervision for 3D Semantic Occupancy

Goal2Pixel: Grounding Goals to Pixels for Vision-Language Navigation

Learning Visually Interpretable Oscillator Networks for Soft Continuum Robots from Video

RGB-S: Image-Aligned Tactile Saliency for Robust Dexterous Manipulation

UniDexTok: A Unified Dexterous Hand Tokenizer from Real Data

DrivingAgent: Design and Scheduling Agents for Autonomous Driving Systems

3. 图像识别、检索与分类 8 篇

Visual Place Recognition in Forests with Depth-Aware Distillation

TimeLens: On-Device Artifact Recognition with Retrieval-Augmented Question Answering for the Grand Egyptian Museum

Zero-Shot Captioning for Cultural Heritage: Automated Image Analysis of Traditional Indonesian Clothing

Revisiting Vehicle Color Recognition in Long-Tailed Surveillance Scenarios

Why Commodity WiFi Sensors Fail at Multi-Person Gait Identification: A Systematic Analysis Using ESP32

EyeTheia: A Lightweight and Accessible Eye-Tracking Toolbox

Spatially Grounded Concept Bottleneck Models via Part-Factorized Attention

MinhwaNet: Faithful but Insufficient Object Grounding in Korean Folk Painting

4. 目标检测、分割与定位 6 篇

Context-Aware Feature-Fusion for Co-occurring Object Detection in Autonomous Driving

DIMOS: Disentangling Instance-level Moving Object Segmentation

YOLO-AMC: An Improved YOLO Architecture with Attention Mechanisms for Building Crack Detection

SAM-Deep-EIoU: Selective Mask Propagation for Multi-Object Tracking

Towards Effective Waste Segmentation for Automated Waste Recycling in Cluttered Background

Augmentation techniques for video surveillance in the visible and thermal spectral range

5. 视频理解与时序视觉 8 篇

Dual-State Slot Attention: Decoupling Appearance and Identity for Video Object-Centric Learning

A Multi-Modal Framework with Cross-Subject Pseudo-Labeling and Semantic Alignment for Micro-Gesture Recognition

OR-Action: Multi-Role Video Understanding with Fine-Grained Actions

Person Identification from Contextual Motion

ReFoCUS: Reinforcement-guided Frame Optimization for Contextual Understanding

Periodic-MAE: Periodic Video Masked Autoencoder for rPPG Estimation

Appearance-Invariant Detection of Suggestive Motion via Laban Movement Descriptors

CACR:Reinforcing Temporal Answer Grounding in Instructional Video via Candidate-Aware Causal Reasoning

6. 生成式视觉与世界模型 19 篇

HairPort: In-context 3D-aware Hair Import and Transfer for Images

High-Fidelity Two-Step Image Generation via Teacher-Aligned End-to-End Distillation

TetherCache: Stabilizing Autoregressive Long-Form Video Generation with Gated Recall and Trusted Alignment

SeamEdit: A Black-Box VLM-Agnostic Pipeline for Large-Image Semantic Editing

DuET: Dual Expert Trajectories for Diffusion Image Editing

ReFree: Towards Realistic Co-Speech Video Generation via Reward-Free RL and Multilevel Speech Guidance

MagPlus: Bridging Micro-to-Regular Facial Expressions through Learnable Magnification

SmartFont: Dynamic Condition Allocation for Few-Shot Font Generation

OmniDirector: General Multi-Shot Camera Cloning without Cross-Paired Data

Edit the Bits, Diff the Codes: Bitwise Residual Editing for Visual Autoregressive Models

Modality Forcing for Scalable Spatial Generation

JSCGC: Joint Source-Channel-Generation Coding for Wireless Generative Communications

Towards More General Control of Diffusion Models Using Jeffrey Guidance