arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 42 篇

2606.14728 2026-06-16 cs.CV 新提交

FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

FUSE: 通过贝叶斯融合认知不确定性和偶然不确定性来量化视觉语言模型中的不确定性

Harry Zhang, Luca Carlone

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出FUSE概率框架,通过贝叶斯融合视觉语言模型中的偶然不确定性和认知不确定性,生成标量不确定性度量,用于可靠预测输出正确性,实现SOTA不确定性校准。

详情
AI中文摘要

视觉语言模型(VLM)在多个领域中扮演着越来越重要的角色。在许多应用中,如机器人技术,量化这些模型输出的不确定性至关重要。我们开发了FUSE,一个用于捕捉视觉语言建模中两个互补不确定性来源的概率框架:(i)来自输入数据视觉语言歧义的偶然嵌入级不确定性,以及(ii)从VLM的语义响应多样性估计的认知模型级不确定性。我们的方法制定了一种贝叶斯融合机制,该机制分析性地结合这些不确定性来源,以产生一个标量不确定性度量。该度量可用于可靠地预测模型在下游应用中的输出正确性。我们证明,我们的方法优于基线,并实现了SOTA不确定性校准。

英文摘要

Vision-language models (VLMs) are playing an increasingly important role across multiple domains. In many applications, such as robotics, it is crucial to quantify the uncertainty in the output of these models. } We develop FUSE, a probabilistic framework for capturing two complementary sources of uncertainty in vision-language modeling: (i) aleatoric embedding-level uncertainty derived from input data vision-language ambiguity, and (ii) epistemic model-level uncertainty estimated from the semantic response diversity of VLMs. Our approach formulates a Bayesian fusion mechanism that analytically combines these uncertainty sources to produce a scalar measure of uncertainty. This measure can be used to reliably predict the model's output correctness for downstream applications. We demonstrate that our method outperforms baselines and achieves SOTA uncertainty calibration.

2606.14741 2026-06-16 cs.CV cs.LG 新提交

HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

HorusEye:语言作为动态注意力用于应急视觉分析

Armel Yara

发表机构 * Armel Yara

AI总结 提出HorusEye框架,通过语言反馈动态引导视觉分析,在应急场景下评估多种VLM,发现语言反馈效果依赖模型,并揭示热成像中的裁剪悖论。

Comments 18 pages, 9 figures, 11 tables

详情
AI中文摘要

我们介绍了HorusEye,即语言作为动态注意力用于应急视觉分析。我们的研究分为五个阶段。第一阶段是构建RefCOCO-Degraded基准数据集,包含15,244张图像(3,811张基础图像×4种条件:清晰、雾、烟和热成像),具有系统性的视觉退化。通过四个研究问题,我们评估了多种VLM(Gemini、Qwen2-VL、BLIP-2、LLaVA、Kosmos-2)在视觉定位(第二阶段)、语言反馈恢复(第三阶段)、健康VQA任务(第四阶段)以及幻觉分析(最终阶段)上的表现。我们的关键发现是语言反馈的有效性依赖于模型:Gemini通过迭代语言反馈在热成像条件下提升了47.3%,而Qwen2-VL在相同协议下性能下降了5.1%。我们还发现了“热成像悖论”,即提升RGB性能的裁剪策略在热成像中灾难性地失败。此外,BLIP-2在退化条件下独特地产生更多幻觉,使其不适合应急部署。

英文摘要

We introduce HorusEye, Language as Dynamic Attention for Emergency Visual Analysis. Our investigation followed five stages. The first one is benchmarking RefCOCO-Degraded, a dataset of 15,244 images (3,811 base images x 4 conditions: Clean, Fog, Smoke and Thermal) with systematic visual degradation. Through four research questions, we evaluate multiple VLMs (Gemini, Qwen2-VL, BLIP-2, LLaVA, Kosmos-2) across visual grounding the second stage, language feedback recovery the third one, health VQA tasks the fourth, and hallucination analysis the final stage. Our key finding is that language feedback effectiveness is model-dependent: Gemini achieves +47.3% improvement in thermal conditions through iterative language feedback, while Qwen2-VL shows -5.1% degradation under the same protocol. We also identify the 'Thermal Paradox' where cropping strategies that improve RGB performance catastrophically fail in thermal imagery. Furthermore, BLIP-2 uniquely hallucinates more under degradation, making it unsuitable for emergency deployment

2606.14753 2026-06-16 cs.CV cs.AI 新提交

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

超越自注意力:用于快速图像描述的次二次视觉Transformer

Chiradeep Ghosh, Dakshina Ranjan Kisku

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) National Institute of Technology Durgapur(德里apur国立学院) Durgapur, India(印度德里apur)

AI总结 提出基于高斯混合模型和EM算法的概率Transformer,将自注意力复杂度从二次降至线性,在Flickr30K上实现高效图像描述。

Comments 8 pages, 8 figures

详情
AI中文摘要

图像描述是一项具有挑战性且重要的任务,旨在为给定图像生成连贯且语义有意义的文本描述。要完成此任务,需要对视觉内容有深入理解,并具备用自然语言表达这种理解的能力。尽管基于Transformer的架构取得了显著进展,现有方法仍存在局限性,例如缺乏丰富的局部特征表示以及二次自注意力的高计算成本。所提出的模型通过重构视觉Transformer架构,专注于提高计算效率。在设计该方法时,将Vision Transformer中的标准自注意力机制替换为基于高斯混合模型(GMM)的概率Transformer方法,这是一种软聚类技术。该模型不是计算所有图像块之间的成对注意力,而是使用期望最大化(EM)算法将相似块分组到固定数量的聚类中。这种基于聚类的机制将计算复杂度从二次O(n^2)降低到线性O(nK),其中K << n。自回归的GPT解码器用于生成描述。该模型在Flickr 30K数据集上进行了评估,显示出与现有工作相比具有竞争力和显著的改进。

英文摘要

Image captioning is a challenging and significant task that aims to generate coherent and semantically meaningful textual descriptions for given images. To accomplish this task, it requires a deep understanding of visual content along with the ability to express that understanding in natural language. Despite remarkable progress with transformer-based architectures, existing approaches often suffer from limitations, such as a lack of rich local feature representations and the high computational cost of quadratic self-attention. The proposed model focuses on improving computational efficiency by restructuring the vision transformer architecture. In designing this approach, the standard self-attention mechanism in Vision Transformers is replaced with a probabilistic transformer approach based on a Gaussian Mixture Model (GMM), a soft-clustering technique. Instead of computing pairwise attention among all image patches, the model groups similar patches into a fixed number of clusters using an Expectation-Maximization (EM) algorithm. This clustering-based mechanism reduces the computational complexity from quadratic O(n^2) to linear O(nK), where K << n. The autoregressive GPT-based decoder is used for caption generation. The model is evaluated on the Flickr 30K dataset, demonstrating competitive and significant improvement over existing works.

2606.14758 2026-06-16 cs.CV cs.AI 新提交

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

解构幻觉:正交语义投影实现鲁棒可解释性

Emirhan Bilgiç, Baptiste Caramiaux, Zhi Yan, Gianni Franchi

发表机构 * U2IS, ENSTA, Institut Polytechnique de Paris(巴黎综合理工学院ENSTA学院U2IS实验室) ISIR, Université Sorbonne, Pierre et Marie Curie(索邦大学皮埃尔和玛丽·居里分校ISIR实验室) AMIAD, Pôle Recherche(AMIAD研究部)

AI总结 针对视觉语言模型解释中的语义幻觉问题,提出线性语义归因(LSA)理论框架,并引入正交语义投影(OSP)方法,通过正交化查询向量消除共享特征干扰,最小化幻觉。

Comments 41 pages in total. 5 figures, and 2 tables in the main paper; 10 figures and 17 tables in the appendix

详情
AI中文摘要

随着视觉语言模型在安全关键型应用中的部署日益增多,其解释的可信度变得至关重要。视觉语言模型的可解释人工智能(XAI)方法常常遭受语义幻觉,即当输入错误的文本描述时(例如,提示“猫”却高亮显示狗),归因图仍会突出显示显著的图像区域。尽管这个问题普遍存在,但文献中缺乏对XAI方法和CLIP嵌入的正式数学分析。我们证明,这种现象并非特定于单一架构,而是高维嵌入空间中线性语义泄漏的基本后果。我们提出了一个统一的理论框架——线性语义归因(LSA),该框架泛化于多种判别方法。我们引入了OSP,一种利用OMP残差性质的几何干预方法,用于将独特的语义信号与共享概念分离。我们从理论上证明并实验表明,OSP通过将查询向量与干扰概念正交化,最小化幻觉,使归因模型对共享特征“失明”,同时保持对正确提示的保真度。我们的代码可在 https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection 获取。

英文摘要

As Vision-Language Models are increasingly deployed in safety-critical applications, the trustworthiness of their explanations becomes crucial. Explainable AI (XAI) methods for Vision-Language Models often suffer from semantic hallucination, where attribution maps highlight prominent image regions even when prompted with incorrect text descriptions (e.g., highlighting a dog when prompted ``cat''). Although this problem is widespread, a formal mathematical analysis of XAI methods and CLIP embeddings is largely missing in the literature. We demonstrate that this phenomenon is not specific to a single architecture but is a fundamental consequence of Linear Semantic Leakage in high-dimensional embedding spaces. We propose a unified theoretical framework, Linear Semantic Attribution (LSA), which generalizes across discriminative methods. We introduce OSP, a geometric intervention that utilizes the residual property of OMP to disentangle unique semantic signals from shared concepts. We prove theoretically and demonstrate empirically that OSP minimizes hallucination by orthogonalizing the query vector against distractor concepts, rendering the attribution model blind to shared features while preserving fidelity for correct prompts. Our code is available at: https://github.com/emirhanbilgic/Orthogonal-Semantic-Projection

2606.14777 2026-06-16 cs.CV cs.AI 新提交

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

JoyAI-VL-Interaction: 实时视觉-语言交互智能

Dingyu Yao, Junhao Zhou, Chenxu Yang, Chuanyu Qin, Haowen Hou, Zheming Liang, Congcong Wang, Yuhang Cao, Shenglong Ye, Shuai Xie, Shuhuan Gu, Haoyang Huang, Qingyi Si, Nan Duan, Jiaqi Wang

发表机构 * JD.com(京东)

AI总结 提出一种持续观察、自主决定是否回应的视觉-语言交互模型,并开源8B规模模型及完整部署系统,在六个真实场景中优于现有方案。

详情
AI中文摘要

现实世界中的许多时刻不会等待用户提问。安全监控上起火,视频通话中表情变化,或直播中观众想要的商品一闪而过。然而,当今的大模型大多仍以轮次式设计:它们只在被召唤时回答,即使是看似交互式的视频通话应用,其运作方式仍是问答系统,仅在轮询或提示时做出反应。我们主张一种不同的范式:一个像人一样存在于世界中的模型。它持续观察当前发生的事件,自行决定是说话还是保持沉默,实时交互,并在问题困难时委托给后台模型。为了推动交互模型及其在各领域的应用,我们做出两项完全开源贡献。首先,我们发布JoyAI-VL-Interaction,一个8B规模的视觉优先VL交互模型。该模型内部做出响应决策,每秒选择保持沉默、回应或委托给后台模型,并在视觉触发响应性和时间感知方面表现出色。我们为其配备了一个可迁移的训练方案,从中涌现出我们从未训练过的能力,例如引导购物者切换应用屏幕或根据幻灯片即兴授课。其次,我们发布了一个围绕该模型构建的完整可部署系统。该系统将任何正在进行的视频流式传输到模型中,使其真正存在于世界中。所有其他组件都是可插拔的,包括ASR/TTS模块、记忆、可视化UI以及可连接任何API或代理的后台大脑。在六个真实场景中,人类评估者以较大优势偏好JoyAI-VL-Interaction而非豆包和Gemini的应用内视频通话助手。据我们所知,这是第一个开源的、视觉驱动的交互模型,同时发布了其训练方案、数据和完整可部署系统。

英文摘要

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

2606.14883 2026-06-16 cs.CV cs.LG 新提交

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

理解连续视觉-语言模型中的跨模态贡献:一个理论视角

Salimeh Sekeh, Mary Wisell

发表机构 * San Diego State University(圣地亚哥州立大学)

AI总结 本文从理论角度分析连续视觉-语言模型中跨模态(视觉-语言)贡献,提出新视角并通过实验验证其有效性,揭示任务顺序和相似性对贡献鲁棒性的影响,提升泛化性能。

详情
AI中文摘要

连续视觉-语言模型通常通过顺序微调来解决;然而,尽管这种范式能够适应新环境(任务),但它本质上以牺牲保持先前获取知识所需的稳定性为代价,强调了先前学习环境(任务)的贡献。虽然现有方法已经充分研究了视觉-语言模型(VLM)中的连续学习和灾难性遗忘,但跨一系列环境的模态特定贡献的理论理解仍然很大程度上未被探索。在本文中,我们提出了一个新的理论视角来理解跨模态(视觉-语言)对连续环境的贡献。我们在大型VLM上实证评估了我们的理论发现,并展示了它们在捕捉环境级跨模态贡献方面的有效性。我们的分析为连续VLM提供了更深入的见解,突出了它们对不同任务顺序和任务间相似性的贡献鲁棒性,以及它们改进的泛化性能。

英文摘要

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

2606.15160 2026-06-16 cs.CV cs.LG 新提交

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

DLWM: 多样化潜在世界模型用于高效多模态推理

David Huang, Lianlei Shan

发表机构 * University of Toronto(多伦多大学) Tsinghua University(清华大学)

AI总结 提出DLWM框架,结合潜在空间推理与强化学习,通过多样化潜在假设和资源感知策略提升多模态推理效率,准确率提升2-5%,内存减少24%。

Comments Preprint. 9 pages main text, 15 pages total including appendix, 2 figures

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)的推理能力有了显著提升。现有方法通常依赖显式的思维链或连续的潜在空间轨迹来增强多步推理。然而,这些方法通常假设输入具有单一的潜在解释,并沿着固定路径或在统一计算预算下展开推理。在现实世界的多模态场景中,视觉观测常受遮挡、模糊、视角变化或语义歧义的影响,产生多种合理的解释。统一的推理策略不仅限制了模型探索多个假设的能力,还导致高内存使用和展开成本。我们提出DLWM(多样化潜在世界模型),一种结合潜在空间推理与强化学习的多模态推理框架。首先,我们在连续潜在空间中构建一组多样化的潜在世界假设,每个假设捕捉视觉输入的不同合理解释,并在每个假设上独立展开潜在推理。基于正交性的多样性正则化器明确防止假设坍缩。其次,我们将潜在推理过程形式化为资源受限的序列决策问题,并引入资源感知的强化学习策略,该策略自适应地在假设间分配计算资源,动态决定是扩展、终止还是合并推理路径,从而大幅减少内存占用并提高展开效率。在多个多模态推理基准上的实验表明,DLWM在准确率上比现有方法高出2-5个百分点,同时内存使用减少24%。

英文摘要

Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.

2606.15651 2026-06-16 cs.CV 新提交

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

自问式视觉语言模型:用于组合视觉推理的强化学习

Saraswathy Amjith

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出自问式框架,通过GRPO强化学习训练VLM自动分解问题并回答子问题,提升组合视觉推理能力,在CLEVR和A-OKVQA上验证有效性。

详情
AI中文摘要

视觉语言模型(VLM)是处理图像和文本的AI系统,但它们通常难以处理需要多步骤链式推理的组合视觉推理问题,例如识别物体、计数并比较结果。现有方法通过训练模型使用人工编写的逐步解释来改进推理,但创建这些注释成本高昂且难以扩展。我们提出一个自问式框架,使用称为组相对策略优化(GRPO)的强化学习算法,训练VLM将视觉问题分解为更小的子问题,并在生成最终答案前回答每个子问题。模型从未见过如何分解问题的示例,而是通过奖励信号(根据输出是否包含子问题以及最终答案是否正确评分)自行发现这种行为。我们将该框架应用于一个30亿参数的模型,在合成几何形状场景(CLEVR)和真实世界照片(A-OKVQA)上进行训练。在A-OKVQA上,自问式和标准强化学习均显著提高了未训练模型的准确率(分别为52.2%和51.6%,对比46.8%)。我们引入了首个自问式VLM,不仅像标准RL那样奖励最终答案,还额外奖励生成中间子问题,使其能够发现组合分解策略。这些结果表明,教会AI系统自问中间问题是复杂视觉推理的一种有前景的策略,特别是当问题难度需要显式的逐步分解时。

英文摘要

Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.

2606.15663 2026-06-16 cs.CV 新提交

OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

OneFocus: 实现基于统一视觉语言模型的真实世界X光安检

Jiali Wen, Hongxia Gao, Litao Li, Yixin Chen, Kaijie Zhang, Qianyun Liu, Xiaoqin Wen

AI总结 针对X光违禁品检测中新型违禁品适应难和视觉理解不足的问题,提出MMXray数据集和统一视觉语言模型OneFocus,支持问答、定位、分类和图像理解四项核心任务,达到最先进性能。

Comments 17 pages, 10 figures

详情
AI中文摘要

X光违禁品检测对于大规模物流和运输中的安全至关重要,然而传统检测器难以适应新兴违禁品类型且缺乏基本的视觉理解。视觉语言模型(VLM)提供了强大的泛化能力,但受到高质量X光图像-文本数据稀缺的阻碍。为弥补这一关键差距,我们提出了MMXray,一个精心策划的基准数据集,包含52,124个图像-文本对,涵盖28个细粒度类别的X光违禁品。为了丰富MMXray中的真实遮挡模式,我们进一步引入了CleanDET,一个专用的合成数据集,包含来自28个类别的干净前景违禁品图像和具有不同密度水平的背景图像,以及AnyContraSyn,一种旨在操作CleanDET的可控合成方法。我们还开发了OnePipe,一个用于系统数据整理的可扩展流水线。基于MMXray,我们提出了OneFocus,一个统一的VLM,支持四个核心任务:视觉问答、违禁品定位、分类和图像理解。OneFocus在X光违禁品理解方面达到了最先进的性能,并展示了强大的跨域泛化能力,为安检建立了强大的视觉语言基线。

英文摘要

X-ray contraband detection is critical for security in large-scale logistics and transportation, yet conventional detectors struggle to adapt to emerging contraband types and lack fundamental visual understanding. Vision-language models (VLMs) offer strong generalization but are hindered by the scarcity of high-quality X-ray image-caption data. To bridge this critical gap, we present MMXray, a meticulously curated benchmark of 52,124 image-caption pairs spanning 28 fine-grained classes of X-ray contraband. To enrich MMXray with realistic occlusion patterns, we further introduce CleanDET, a dedicated synthesis dataset containing clean foreground contraband images from 28 categories and background images with diverse density levels, together with AnyContraSyn, a controllable synthesis method designed to operate on CleanDET. We also develop OnePipe, an extensible pipeline for systematic data curation. Built on MMXray, we propose OneFocus, a unified VLM that supports four core tasks: visual question answering, contraband localization, classification, and image understanding. OneFocus achieves state-of-the-art performance in X-ray contraband understanding and demonstrates robust cross-domain generalization, establishing a strong vision-language baseline for security screening.

2606.15765 2026-06-16 cs.CV 新提交

Task-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning

任务指令引导的视觉基础模型因果路由用于多任务学习

Donghyun Han, Yuseok Bae, Jung Uk Kim, Hyung-Il Kim

发表机构 * Electronics and Telecommunications Research Institute (ETRI)(韩国电子通信研究院(ETRI)) Kyung Hee University(庆熙大学) Chonnam National University(全南大学)

AI总结 提出TIGER框架,通过自然语言任务指令引导路由网络,结合反事实因果对齐,协调多个异构视觉基础模型实现多任务密集预测,在NYUD-v2和Pascal Context上超越现有方法。

Comments 17 pages, 6 figures

详情
AI中文摘要

视觉基础模型(VFMs)在广泛的视觉任务中展现出强大的鲁棒性和迁移性。然而,每个模型通常编码了由其预训练目标和数据领域形成的强归纳偏置,导致视觉知识碎片化但互补。因此,单个模型往往难以捕捉多个密集预测任务所需的不同视觉表示。为解决这一限制,我们提出TIGER(任务指令引导的专家路由),一个协调多个异构VFMs进行多任务密集预测的框架。TIGER并非简单聚合专家特征,而是利用自然语言任务指令引导路由网络,根据任务语义分配令牌级专家权重,实现互补专家特征的自适应集成。TIGER进一步引入反事实损失,通过测量排除专家时的预测变化,将路由决策与每个专家的因果贡献对齐,鼓励更可靠和可解释的路由。我们在两个多任务密集预测基准NYUD-v2和Pascal Context上评估TIGER,在保持所有VFMs冻结的情况下,它持续优于最近的多任务学习基线。这些结果表明,将指令引导的专家路由与反事实因果对齐相结合,能够有效协调异构视觉基础模型。

英文摘要

Vision foundation models (VFMs) have demonstrated strong robustness and transferability across a wide range of visual tasks. However, each model typically encodes strong inductive biases shaped by its pre-training objective and data domain, resulting in fragmented yet complementary visual knowledge. As a result, a single model often struggles to capture the diverse visual representations required across multiple dense prediction tasks. To address this limitation, we propose TIGER (Task-Instruction-Guided Expert Routing), a framework that coordinates multiple heterogeneous VFMs for multi-task dense prediction. Instead of naively aggregating expert features, TIGER leverages natural-language task instructions to guide a routing network that assigns token-level expert weights conditioned on task semantics, enabling adaptive integration of complementary expert features. TIGER further introduces a counterfactual loss that aligns routing decisions with each expert's causal contribution by measuring prediction changes when experts are excluded, encouraging more reliable and interpretable routing. We evaluate TIGER on two multi-task dense prediction benchmarks, NYUD-v2 and Pascal Context, where it consistently outperforms recent multi-task learning baselines while keeping all VFMs frozen. These results demonstrate that combining instruction-guided expert routing with counterfactual causal alignment enables effective coordination of heterogeneous vision foundation models.

2606.15920 2026-06-16 cs.CV 新提交

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

OmniOPSD:面向情感计算的理性特权在线自蒸馏

Zebang Cheng, Shuimu Chen, Boxue Yang, Yuanshen Guan, Jingyi Chen, Zheng Lian, Xiaojiang Peng, Fei Ma, LaiZhong Cui, Qi Tian

发表机构 * Shenzhen University(深圳大学) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济实验室(深圳)) Tsinghua University(清华大学) Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学) Shenzhen Technology University(深圳技术大学) Tongji University(同济大学) Huawei(华为)

AI总结 针对多模态大模型在复杂推理任务中奖励稀疏的问题,提出OmniOPSD框架,利用前沿模型生成的理性作为教师特权证据而非学生模仿目标,通过在线自蒸馏提供密集令牌级监督,在MER-UniBench上取得84.19平均分的最优性能。

详情
AI中文摘要

多模态大语言模型的强化学习在复杂推理任务中常因严重的奖励稀疏性而受阻。这一挑战在涉及状态、情感、意图和行为的以人为中心的场景中尤为突出,其中异质多模态信号和主观人为因素使得高质量思维链标注昂贵且难以获取。尽管许多多模态数据集提供了专家标注的真实标签,但直接使用这些标签进行监督微调可能会鼓励多模态感知中的捷径学习,并为安全关键的人机交互提供有限的透明度。为解决这些限制,我们提出OmniOPSD,一种理性特权的在线自蒸馏框架,该框架将前沿模型生成的理性作为教师侧的特权证据而非学生模仿目标。OmniOPSD仅将前沿模型生成的证据感知理性作为训练时的特权证据上下文提供给本地教师。学生从原始多模态输入中采样自己的轨迹,而理性特权教师对相同令牌进行评分并提供密集的令牌级监督。因此,学生在自己的轨迹分布上学习,无需直接模仿前沿模型完成,且推理不需要标签、理性、思维链标注或闭源模型访问。在MER-UniBench上的实验表明,OmniOPSD以84.19的平均分实现了最先进的性能,消融实验进一步支持了理性特权教师指导的价值。

英文摘要

Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human--AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.

2606.15982 2026-06-16 cs.CV 新提交

Mind the Gap: Diagnosing Constraint Discovery Failures in Text-in-Image Editing

注意差距:诊断图像内文本编辑中的约束发现失败

Rui Gui

AI总结 通过图像内文本编辑的受控诊断,研究多模态模型在发现未明确指定的视觉依赖约束时的失败,发现模型仅能自行发现46%的约束,而明确提供时可达94%。

详情
AI中文摘要

多模态推理中的一个关键挑战是确定在特定任务下哪些视觉依赖变得相关,而不仅仅是识别可见内容。我们通过图像内文本编辑中的编辑诱导约束发现来研究这一点,这是一个受控的诊断设置,其中局部文本变化可以激活次要的一致性约束:给定一个有效的编辑指令和一张图像,模型能否识别出也必须改变的次要区域?在461个诊断案例、四个MLLM和19个约束子类型中,模型在无引导提示下仅恢复46%的案例级宏观召回率,而当明确提供约束时则为94%,这表明当模型必须决定要呈现哪些未说明的依赖时,很大一部分失败会出现。Oracle场分解显示,案例特定的因果解释是最有效的部分引导(0.782召回率),高于区域名称(0.610)或类型标签(0.646),这表明编辑特定的因果线索占据了Oracle增益的很大一部分。下游实验进一步表明,更高的自我发现召回率并不一定能提高任务性能:未经验证的自我发现引入了假阳性,抵消了召回率的提升,从而激发了精度感知的约束引出。

英文摘要

A key challenge in multimodal reasoning is determining which visual dependencies become relevant under a specific task, rather than merely recognizing visible content. We study this through edit-induced constraint discovery in text-in-image editing, a controlled diagnostic setting where a local text change can activate secondary consistency constraints: given a valid editing instruction and an image, can a model identify the secondary regions that must also change? Across 461 diagnostic cases, four MLLMs, and 19 constraint subtypes, models recover only 46% case-level macro recall under unguided prompting versus 94% when constraints are explicitly provided, suggesting that a substantial portion of the failure arises when models must decide which unstated dependencies to surface. Oracle-field decomposition shows that case-specific causal explanations are the most effective partial guidance (0.782 recall), above region names (0.610) or type labels (0.646), suggesting that edit-specific causal cues account for much of the oracle gain. A downstream experiment further shows that higher self-discovery recall does not necessarily improve task performance: unverified self-discovery introduces false positives that offset recall gains, motivating precision-aware constraint elicitation.

2606.16067 2026-06-16 cs.CV 新提交

Stepwise Token Selection for Efficient Multimodal Large Language Models

逐步令牌选择用于高效多模态大语言模型

Landi He, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 提出一种基于指针机制的逐步视觉令牌选择方法,通过可微松弛实现端到端训练,动态决定保留令牌数量,在去除88.9%令牌时保持94.6%准确率并加速1.88倍。

详情
AI中文摘要

在多模态大语言模型(MLLMs)中,推理成本主要由视觉令牌前缀而非语言骨干网络决定,因此令牌减少成为提高效率的关键因素。现有方法通常为视觉令牌分配独立的的重要性分数,并保留固定数量的排名靠前的令牌,这隐含地假设令牌独立且输入间压缩比均匀。在这项工作中,我们将视觉令牌剪枝重新表述为序列决策过程。具体来说,我们引入了一种指针式的选择机制,该机制迭代地选择信息丰富的令牌,每次决策都基于先前选择的令牌,并通过学习到的终止动作动态决定何时停止。这使得所选子集及其大小能够联合优化。为了实现标准语言建模目标下的端到端训练,我们设计了一种基于方差保持噪声插值方案的可微松弛,允许梯度通过离散选择过程传播。在LLaVA-v1.5-7B和Qwen2.5-VL-7B上的大量实验表明,我们的方法在不同压缩水平下始终优于固定比例基线。在去除88.9%视觉令牌的激进剪枝下,我们的方法保持了94.6%的原始准确率,同时实现了1.88倍的预填充延迟加速。

英文摘要

In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

2606.16092 2026-06-16 cs.CV cs.AI 新提交

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

VinQA:面向真实世界多模态文档问答的交错视觉元素长文本答案生成

Young Rok Jang, Hyesoo Kong, Kyunghwan An, Jae Sub Huh, Gyeonghun Kim, Stanley Jungkyu Choi

发表机构 * LG AI Research(LG AI研究院)

AI总结 提出VinQA数据集和两种编码方法(页面编码与模态编码),用于生成交错引用视觉元素的长文本答案;通过M-GroSE评估框架和微调Qwen2.5-VL模型,显著缩小与专有模型的性能差距。

Comments Accepted to CVPR 2026. Main paper: 5 figures, 4 tables; includes supplementary material

详情
AI中文摘要

真实世界的文档将文本与表格、图表、照片和示意图以多样化的布局组合在一起,然而现有关于多模态大语言模型(MLLMs)用于文档问答的研究主要产生纯文本回复,未能充分利用这些视觉元素。我们引入VinQA,一个用于长文本答案生成的数据集,其中引用的视觉元素与其支持文本明确交错,并基于相关文档页面。为支持此任务,我们研究了两种将原始文档页面图像输入MLLM的编码方法及其视觉元素引用机制:(1)页面编码,直接编码带有视觉元素边界框的整页图像,并将这些框选区域视为可引用单元;(2)模态编码,解析每个页面以提取文本并裁剪视觉元素,分别编码,并将这些裁剪元素用作可引用单元。在我们的实验中,我们提出M-GroSE,一个扩展GroUSE的多模态评估框架,用于从完整性、答案相关性、忠实性和不可回答性四个维度评估答案。我们还报告了Visual Source F1以直接衡量视觉引用准确性。尽管专有前沿模型在VinQA测试集上仍获得最佳总体分数,但在训练集上微调开源Qwen2.5-VL模型显著提升了其性能并缩小了这一差距。模态编码最初对于具有长文本、多视觉元素和多样化引用需求的复杂文档更为稳健。然而,在VinQA上训练后,页面编码达到了可比水平,即使没有模态编码中使用的显式解析也能有效竞争。最后,基于MLLM的评判器Visual G-Eval确认,微调后的模型在语义恰当的位置插入视觉元素,并附有忠实的支持文本。

英文摘要

Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

2606.16158 2026-06-16 cs.CV cs.CL 新提交

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

必要时聚焦:用于无训练视觉定位的自适应路由与协作定位

Yifan Wang, Peiming Li, Shiyu Li, Zhiyuan Hu, Xiaochen Yang, Wenming Yang, Yang Tang, Zheng Wei

发表机构 * East China University of Science and Technology(华东理工大学) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学)

AI总结 提出LazyMCoT动态框架,通过自适应路由评估不确定性,对简单查询跳过处理,对困难样本利用协作定位模块进行两阶段精炼,在提升推理精度的同时降低平均推理延迟。

详情
AI中文摘要

虽然多模态大语言模型(MLLMs)在跨模态推理方面表现出色,但它们通常难以感知复杂高分辨率图像中的细粒度细节。最近的无训练方法通过图像缩放和局部裁剪来解决这一问题。然而,不加区分地应用这些操作会导致简单查询的计算冗余,并且可能因截断必要的全局上下文或引入无关的背景噪声而降低准确性。为此,我们提出了LazyMCoT,一个动态且无需训练的框架,能够根据样本难度自适应地分配视觉定位工作。该框架具有自适应路由机制,通过单次前向传递的首词统计量来评估预测不确定性。这有效地绕过了置信度高的案例,同时通过保形校准确保困难样本的召回。对于这些具有挑战性的案例,协作定位模块通过两阶段精炼过程,将模型固有的跨模态注意力与外部视觉专家相结合。该精炼过程生成精确的局部显示,以恢复小目标或被遮挡的目标。在多个基准上的大量实验表明,LazyMCoT通过同时提高推理精度和降低平均推理延迟,与基于训练的方法相媲美。我们的代码可在https://github.com/TencentBAC/LazyMCoT获取。

英文摘要

While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at https://github.com/TencentBAC/LazyMCoT.

2606.16193 2026-06-16 cs.CV cs.AI cs.LG 新提交

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

级联稀疏自编码器在多模态大语言模型中学习多级视觉概念

Yusong Zhao, Hengyi Wang, Tanuja Ganu, Akshay Nambi, Hao Wang

发表机构 * Rutgers University(罗格斯大学) Microsoft Research(微软研究院)

AI总结 提出级联稀疏自编码器(CSAEs),通过在第一级SAE解码器权重上训练第二级SAE来学习层次化视觉概念,避免嵌套或堆叠SAE的缺点,在多个MLLM和数据集上提升了概念层次一致性和干预效果。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉-语言任务上表现出色,但其内部视觉表示仍难以解释。稀疏自编码器(SAEs)提供了一种可扩展的方式,将密集模型激活分解为稀疏、可解释的特征。然而,现有SAE架构主要恢复扁平特征字典,不太适合显式的多级概念组织。在本文中,我们引入级联稀疏自编码器(CSAEs)用于学习MLLMs中的层次化视觉概念。CSAEs并非嵌套或堆叠SAE稀疏激活码,而是直接在第一个SAE的解码器权重上训练第二个SAE,将学习到的低级特征方向作为高级抽象的输入。这种设计使CSAEs能够学习“概念的概念”,同时避免了嵌套、Matryoshka式层次结构中的共享前缀耦合问题以及简单堆叠SAE的瓶颈。在Qwen3-VL、Gemma-3和LLaVA上的多个视觉数据集上的实验表明,与最先进的SAE基线相比,CSAEs在层次概念一致性方面提高了可解释性。概念引导的结果进一步表明,学习到的概念组支持对MLLM输出进行有效的组级干预。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated strong performance on vision-language tasks, yet their internal visual representations remain difficult to interpret. Sparse Autoencoders (SAEs) provide a scalable way to decompose dense model activations into sparse, interpretable features. However, existing SAE architectures primarily recover flat feature dictionaries and are less suited for explicit multi-level concept organization. In this paper, we introduce cascaded sparse autoencoders (CSAEs) for learning hierarchical visual concepts in MLLMs. Rather than nesting or stacking SAE sparse activation codes, CSAEs train a second-level SAE directly on the decoder weights of the first-level SAE, treating learned low-level feature directions as inputs for higher-level abstraction. This design enables CSAEs to learn "concepts of concepts" while avoiding drawbacks from the shared-prefix coupling of nesting, Matryoshka-style hierarchies and the bottlenecks of naively stacked SAEs. Experiments across Qwen3-VL, Gemma-3, and LLaVA on multiple visual datasets show that CSAEs improve interpretability in terms of hierarchical concept coherence over state-of-the-art SAE baselines. Results on concept steering further demonstrate that the learned concept groups support effective group-level interventions in MLLM outputs.

2606.16198 2026-06-16 cs.CV 新提交

GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

GRACE: 基于接地动作中心证据增强视频多模态大语言模型用于观众情感预测

Ruoxuan Yang, Tieyuan Chen, Xiaofeng Huang, Haibing Yin, Jun Wang, Xiping Chen, Jun Yin, Xuesong Gao, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Hangzhou Dianzi University(杭州电子科技大学) The 52nd Research Institute of China Electronics Technology Group Corporation(中国电子科技集团公司第五十二研究所) Hangzhou Bywin Technology Co., Ltd.(杭州百威科技有限公司) Zhejiang Dahua Technology Co., Ltd.(浙江大华技术股份有限公司) School of Information Science and Engineering, Shandong University(山东大学信息科学与工程学院) Haihe Laboratory of Information Technology Application Innovation(海河信息技术应用创新实验室)

AI总结 提出GRACE框架,通过提取时间有序的主谓宾三元组和视觉实体裁剪,增强视频MLLM对细粒度情感线索的提取与推理,在Pitts数据集上提升Qwen2.5-VL和Qwen3-VL性能。

Comments 13 pages, 5 figures

详情
AI中文摘要

视频广告中的观众情感预测旨在推断观众中引发的潜在情感反应。为了弥合展示内容与感受之间的差距,模型必须从显性的视觉叙事、具体的角色-物体交互和可见的文本线索中推断隐藏的观众情感。然而,标准的多模态大语言模型(MLLMs)通常依赖整体帧表示,这使得这些细粒度的情感相关事件隐式化,并复杂化了精确的情感推理。为了解决这个问题,我们提出了一种基于接地动作中心的证据增强框架,通过引入显式事件结构和局部化视觉证据来增强视频MLLMs的线索提取和理解能力。我们的方法从以动作中心的视频描述中提取时间排序的主语-动词-宾语(SVO)三元组和辅助可见文本线索,将主语和宾语实体作为视觉实体裁剪进行接地,然后使MLLM基于这些提取的结构化线索执行线索增强的情感推理。通过这种方式,动作三元组指定“发生了什么”,而接地的视觉实体裁剪将“谁或什么参与每个事件”锚定到具体的视觉证据上。在Pitts数据集上的实验显示,相对于Qwen2.5-VL和Qwen3-VL基线有持续改进。消融研究、在AdsQA上的跨数据集评估以及在情感聚焦的TVQA子集上的迁移实验进一步支持了我们方法的有效性和泛化能力。

英文摘要

Viewer sentiment prediction in video advertisements aims to infer the latent affective response evoked in the audience. To bridge the gap between what is shown and what is felt, models must deduce hidden viewer emotions from explicit visual narratives, concrete character-object interactions, and visible textual cues. However, standard Multimodal Large Language Models (MLLMs) typically rely on holistic frame representations, which leave these fine-grained, affect-relevant events implicit and complicate precise emotional reasoning. To address this, we propose a grounded action-centric evidence augmentation framework that enhances video MLLMs' clue extraction and comprehension by introducing explicit event structure and localized visual evidence. Our method extracts temporally ordered subject-verb-object (SVO) triplets and auxiliary visible textual cues from action-centric video descriptions, grounds subject and object entities as visual entity crops, and then enables the MLLM to perform clue-enhanced emotional reasoning based on these extracted structured clues. In this way, action triplets specify "what happens", while grounded visual entity crops anchor "who or what participates in each event" to concrete visual evidence. Experiments on the Pitts dataset show consistent improvements over Qwen2.5-VL and Qwen3-VL baselines. Ablation studies, cross-dataset evaluation on AdsQA, and transfer experiments on an emotion-focused TVQA subset further support the effectiveness and generalization of our approach.

2606.16255 2026-06-16 cs.CV 新提交

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

UniDDT: 使用解耦扩散变换器统一多模态理解与生成

Shuai Wang, Liang Li, Yang Chen, Ruopeng Gao, Yao Teng, Limin Wang

发表机构 * Nanjing University(南京大学) ByteDance Seed(字节跳动Seed) University of Hong Kong(香港大学)

AI总结 提出UniDDT模型,通过噪声ViT编码器统一视觉语义表示,并采用解耦扩散解码器分离扩散与文本解码,平衡多模态理解与生成任务,在多个基准上取得优异性能。

Comments This work was completed in \textbf{November 2025}

详情
AI中文摘要

统一多模态模型(UMMs)已成为通用多模态智能的关键方向,将理解和生成集成到单一框架中。然而,现有的UMMs面临显著挑战:(1)视觉理解与生成任务之间的固有学习冲突,导致两个任务建模次优;(2)不同的理解与生成视觉空间阻碍可扩展性;(3)过度依赖特定任务数据,忽视了文本-图像理解与生成的二元性。为解决这些挑战,我们提出UniDDT,它利用噪声ViT编码器与LLM统一视觉生成和理解任务的语义编码,同时使用独立的扩散解码器将扩散解码与文本解码解耦。借助这种噪声ViT编码器,UniDDT能够利用潜在空间作为统一的视觉表示,实现理解与生成任务之间的无缝兼容。因此,可以平衡生成任务内的可扩展性和理解任务内的语义表达能力。此外,我们从相同的图像-文本对构建双重数据结构,促进生成与理解数据之间的相互依赖,以利用其固有的二元性。大量实验表明,UniDDT实现了多模态理解与生成的有效统一,增强了语义一致性和可扩展性。对于视觉生成任务,我们的UniDDT在GenEval上达到0.87分,DPG总体得分86.9。对于多模态理解任务,我们的UniDDT在MME基准上达到1699.5分,在SEEDbench上总体得分76.5。

英文摘要

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.

2606.16295 2026-06-16 cs.CV cs.CL 新提交

VisualClaw: A Real-Time, Personalized Agent for the Physical World

VisualClaw:面向物理世界的实时个性化智能体

Haoqin Tu, Jianwen Chen, Zijun Wang, Siwei Han, Juncheng Wu, Hardy Chen, Haonian Ji, Kaiwen Xiong, Jiaqi Liu, Peng Xia, Jieru Mei, Hongliang Fei, Jason Eshraghian, Zeyu Zheng, Yuyin Zhou, Huaxiu Yao, Cihang Xie

发表机构 * UC Santa Cruz(加州大学圣克鲁兹分校) UNC-Chapel Hill(北卡罗来纳大学教堂山分校) Google(谷歌) UC Berkeley(加州大学伯克利分校)

AI总结 提出VisualClaw,一种自进化多模态智能体,通过混合编码和技能进化机制降低部署成本并提升准确性,在多个视频QA基准上实现平均-98%的API成本削减和最高+15.80%的准确率提升。

Comments H. T. and J. C. contribute to this project equally

详情
AI中文摘要

视觉语言模型正作为复杂多模态任务的通用接口。然而,部署仍面临三个差距:VLMs在处理密集视频帧和长提示时通常产生高延迟和成本,智能体框架在部署后保持静态,标准视频QA基准不测试智能体是否能在工具使用工作区内使用视觉证据。我们提出VisualClaw,一个围绕两个原则构建的自进化多模态智能体。首先,混合编码通过级联门过滤信息较少的流式帧,并通过热/冷top-k注入压缩文本技能库,从而降低部署成本。其次,技能进化让智能体从失败中学习:检索的记忆作为直接拼接上下文或引导证据条件化进化器,产生技能库更新以帮助未来问题。在4个视频QA基准上使用2个VLM,VisualClaw相比全帧上传平均降低每问题API成本-98%,相比离线均匀8帧基线降低-25.9%,同时在大多数设置中提升准确率,例如在EgoSchema上使用Gemini 3 Flash平均+3.85%,峰值+15.80%。为解决这一差距,我们整理了VisualClawArena,一个通过严格五阶段流程构建的200场景多模态智能体基准;模型必须使用视频证据、文档、动态更新和工作区内的可执行检查。在VisualClawArena上,相同的框架配合计算机使用智能体后端,相比无进化基线,Codex (GPT-5.5)的宏观准确率提升+2.9%,Claude Code (Sonnet 4.6)提升+3.2%,相比均匀采样基线成本降低-9.5%。这些特性使VisualClaw自然适用于边缘应用,其中级联将1小时流式会话从约3,600次API上传减少到仅5-20次调用,自进化使其成为完美的个性化助手。

英文摘要

Vision language models are serving as general-purpose interfaces for complex multimodal tasks. However, deployment still faces three gaps: VLMs typically incur high latency and cost when processing dense video frames and long prompts, the agent scaffold remains static after deployment, and standard video-QA benchmarks do not test whether agents can use visual evidence inside tool-using workspaces. We present VisualClaw, a self-evolving multimodal agent built around two principles. First, hybrid encoding reduces deployment cost by filtering less informative streaming frames with a cascaded gate and compressing the text skill bank through hot/cold top-k injection. Second, skill evolution lets the agent learn from failures: retrieved memories condition an evolver as direct concatenated context or as guided evidence, producing skill-bank updates that help future questions. Across 4 video-QA benchmarks with 2 VLMs, VisualClaw cuts per-question API cost by an average -98% versus full-frame upload and by -25.9% over the offline uniform 8 frame baseline, while boosting accuracy in most settings, e.g., an average +3.85% and a peak +15.80% on EgoSchema with Gemini 3 Flash. To address the gap, we curate VisualClawArena, a 200-scenario multimodal agentic benchmark built through a strict five-stage pipeline; models must use video evidence, documents, dynamic updates, and executable checks inside a workspace. On VisualClawArena, the same framework with computer-use agent backends improves macro accuracy by +2.9% for Codex (GPT-5.5) and +3.2% for Claude Code (Sonnet 4.6) over no-evolution baselines, with a -9.5% cost reduction compared to the uniform-sampled baseline. These properties make VisualClaw a natural fit for edge applications, where the cascade reduces a 1-hour streaming session from ~3,600 API uploads down to only 5-20 calls and the self-evolution makes it a perfect personalized assistant.

2606.16586 2026-06-16 cs.CV 新提交

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

LOCUS: 局部视觉线索搜索增强多模态大语言模型的细粒度感知

Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu

发表机构 * University of Science and Technology of China(中国科学技术大学) State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室)

AI总结 提出LOCUS训练框架,通过可验证的局部线索搜索代理任务,使MLLM内化细粒度证据选择,提升定位敏感视觉理解而不改变推理接口。

详情
AI中文摘要

多模态大语言模型(MLLMs)在细粒度视觉感知上仍然不可靠,即使高分辨率输入保留了必要的局部细节。我们将这一限制识别为视觉上下文腐烂:决定性证据可能存在于完整图像中,但在冗余视觉上下文中无法被可靠地选择和利用。我们提出LOCUS(局部视觉线索搜索),一个训练框架,通过可验证的代理任务教会MLLMs内化局部证据搜索。在训练期间,LOCUS提供一个局部裁剪作为视觉线索,并使用基于IoU的奖励优化模型以恢复其在完整图像中的空间支持。视觉线索仅在训练期间使用,保持标准的图像-问题推理接口不变。在细粒度感知、幻觉、一般理解和推理基准上的实验表明,LOCUS改善了定位敏感的视觉理解,同时保留了广泛的能力。注意力分析进一步表明对任务相关证据区域的更强关注,表明训练时的视觉线索搜索为内化的细粒度证据选择提供了有效途径。

英文摘要

Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.

2606.16601 2026-06-16 cs.CV 新提交

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

DifferAD-R1: 基于差异引导的多模态大语言模型工业异常定位

Dingrong Wang, Xian Tao, Zhen Qu, Hengliang Luo, Xinyi Gong, Fei Shen, Zhengtao Zhang, Guiguang Ding

发表机构 * Institute of Automation, Chinese Academy of Sciences (CAS)(中国科学院自动化研究所) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) CASI Vision Technology Co., Ltd.(中科慧远视觉技术有限公司) Shandong Laboratory of Aluminum Advanced Manufacturing in Binzhou (SLAAMB), Binzhou Institute of Technology, Weiqiao-UCAS Science and Technology Park(山东省滨州市铝先进制造实验室(SLAAMB),滨州技术学院,魏桥国科科技园) Space Information Research Institute, Hangzhou Dianzi University(杭州电子科技大学空间信息研究院) School of Software, Tsinghua University(清华大学软件学院)

AI总结 提出DifferAD-R1框架,通过差异引导双图像范式将异常定位转化为一次性差异定位问题,并设计双一致性定位奖励和难度感知策略,在AD-DualDiff数据集上优于现有方法。

Comments Submitted to IEEE Transactions on Circuits and Systems for Video Technology

详情
AI中文摘要

工业异常定位旨在准确识别和定位工业产品中的异常区域,解决实际场景中检测未见缺陷类别的关键挑战。传统的封闭集方法通常跨场景泛化能力差,而现有的基于多模态大语言模型(MLLM)的方法面临两个核心限制:要么采用与定位实际需求不一致的问答式范式,要么依赖标准优化技术如组相对策略优化(GRPO),后者无法为细微缺陷提供有效的学习信号。为解决这些问题,本文提出DifferAD-R1,一种专为工业异常定位设计的MLLM增强强化学习框架。我们设计了一种差异引导的双图像范式,将定位任务重新表述为一次性差异定位问题,以有效探索跨场景异常。针对难以检测的异常,开发了双一致性定位奖励,增强了优化稳定性和鲁棒性。此外,我们整合了难度感知策略,包括自适应重加权和分组重采样,以优先学习困难实例。为促进实际工业环境中的评估,我们构建了AD-DualDiff数据集,包含20个类别的13K对图像。实验结果表明,DifferAD-R1显著优于现有基线,并与大规模模型如Qwen3-VL(235B参数)相比取得了有竞争力的性能。我们的代码公开在:https://github.com/Rong2026/work-1。

英文摘要

Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: https://github.com/Rong2026/work-1.

2606.16615 2026-06-16 cs.CV 新提交

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL:面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(上海海事大学数字图像与智能计算实验室) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医科大学附属连云港医院)

AI总结 提出SUP-MCRL框架,通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器,解决多模态对比学习中语义一致性和主体选择性问题,在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情
AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时,神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐,忽略了语义一致性和主体选择性,导致虚假的零样本对齐。我们提出SUP-MCRL,一个统一框架,集成了三种协作机制:(1) 语义实体感知视觉编码器(SAVE),学习空间注意力以提取语义内容,无需预训练的显著性模型;(2) 统一EEG增强器(UEE),采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性;(3) 基于原型的渐进增强器(PPA),维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%(Top-1/Top-5)的个体内准确率和24.0%/52.9%的LOSO准确率,超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces suffer severe fidelity degradation in neural visual decoding when generalizing to natural visual experiences. Conventional multimodal contrastive representation learning solely optimizes geometric distance alignment, neglecting semantic consistency and subject selectivity, causing spurious zero-shot alignment. We propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) Semantic-entity Aware Visual Encoder (SAVE), learning spatial attention to extract semantic content without pre-trained saliency models; (2 Unified EEG Enhancer (UEE), employing multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) Prototype-based Progressive Augmenter (PPA), maintaining an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, surpassing state-of-the-art methods. Code is available at https://github.com/NZWANG/SUP-MCRL.

2606.16667 2026-06-16 cs.CV 新提交

Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

在放弃之前再看一眼:预算约束下的共形证据获取用于可靠的视觉-语言模型

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * South China University of Technology(华南理工大学) RIKEN Center for Advanced Intelligence Project(RIKEN先进智能研究中心) Columbia University(哥伦比亚大学)

AI总结 针对视觉-语言模型幻觉问题,提出预算约束共形证据获取(BCEA)方法,通过三级决策(回答、放弃或获取额外视觉证据)在有限计算预算下控制幻觉率,并恢复有限样本保证。

详情
AI中文摘要

大型视觉-语言模型(LVLMs)会产生幻觉:它们断言图像不支持的视觉细节。一个原则性的解决方案是使用无分布保证的选择性预测——验证每个声明,当声明没有依据时放弃,从而使断言声明中的幻觉率有可证明的界限。然而,我们表明,这个保证是以残酷的代价换来的:为了在平衡的对象存在基准上将幻觉率保持在5%以下,最先进的共形过滤器必须在超过80%的声明上放弃。我们认为,当更多视觉证据可以廉价获取时,放弃是浪费的,并引入了预算约束共形证据获取(BCEA),它将二元回答/放弃决策替换为三向选择:回答、放弃或在有限计算预算下通过重新检查图像(缩放、裁剪或应用特定声明的干预)获取额外视觉证据。我们有两个观察。首先,天真地将获取插入到校准的过滤器中会破坏统计保证——实际风险超过目标多达17个百分点——因为获取步骤破坏了共形校准所依赖的可交换性。其次,将整个获取策略折叠到得分函数中,并在获取后得分上重新校准,恢复了有限样本保证,同时仍然恢复覆盖。BCEA进一步使用结构化的、声明类型特定的干预。在POPE基准和COCO构建的存在性和空间关系声明上,针对四个开源VLM,BCEA将幻觉率控制在目标水平,并持续提高覆盖,优于保证放弃的基线。

英文摘要

Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below $5\%$ on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than $80\%$ of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee -- realized risk overshoots the target by up to $17$ points -- because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emph{restores} the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.

2606.16783 2026-06-16 cs.CV cs.AI cs.LG 新提交

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

Gen-VCoT: 基于扩散的RGB中间表示的生成式视觉思维链推理

Zhiqiang Zhou, Junliang Dai, Xu ling

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化工职业技术学院)

AI总结 提出Gen-VCoT框架,利用专家视觉模型生成RGB图像作为推理中间步骤,通过自适应路由器选择推理深度,在空间和深度问题上分别提升25%和50%,但简单事实查询性能下降,表明最优表示依赖于任务。

Comments 12 pages, 5 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉推理方面表现出色,但依赖基于文本的思维链(CoT),缺乏可解释的视觉中间表示。现有方法使用不透明的标记或外部工具,缺失关键属性。我们提出Gen-VCoT,一个使用专家视觉模型生成RGB图像作为推理中间表示的框架。它包含三个阶段:视觉定位(SAM分割)、几何推理(Marigold深度图)和语义推理(Qwen2-VL集成)。一个自适应路由器选择推理深度。评估显示,Gen-VCoT在空间问题(提升25%)和深度问题(提升50%)上表现更好,但可能损害简单事实查询。文本CoT在CLEVR上优于视觉中间表示(91.2% vs 62.5%),表明最优表示依赖于任务。Gen-VCoT为可解释的多模态推理建立了新范式。

英文摘要

Multimodal large language models (MLLMs) excel at visual reasoning but rely on text-based chain-of-thought (CoT), lacking interpretable visual intermediates. Existing methods use opaque tokens or external tools, missing key properties. We propose Gen-VCoT, a framework using expert vision models to generate RGB images as reasoning intermediates. It has three stages: visual grounding (SAM segmentation), geometric reasoning (Marigold depth maps), and semantic reasoning (Qwen2-VL integration). An adaptive router selects reasoning depth. Evaluations show Gen-VCoT improves spatial (25% better) and depth (50% better) questions, but may hurt simple factual queries. Text CoT outperforms visual intermediates on CLEVR (91.2% vs 62.5%), showing task-dependent optimal representations. Gen-VCoT establishes a new paradigm for interpretable multimodal reasoning.

2606.14786 2026-06-16 cs.MM cs.AI cs.CV 交叉投稿

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

MatchLM2Lite: 一种可扩展的MLLM-to-Lite框架用于重复内容识别

Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Zirui Zhu, Kanchan Sarkar, Kun Xu

发表机构 * Tiktok(字节跳动) National University of Singapore School of Computing(新加坡国立大学计算机学院)

AI总结 提出MatchLM2Lite框架,通过将多模态大语言模型蒸馏为轻量模型,实现视频、音频和文本联合建模的实时重复内容识别,在降低35倍计算成本的同时保持高准确率,并成功部署于大规模生产环境。

详情
AI中文摘要

内容审核对于在线视频平台确保内容安全、保护创作者和维持积极的用户体验至关重要。除了过滤有害内容,平台必须大规模保证内容真实性,以便用户接触到多样化、原创的视频,而非低价值的重复内容。我们提出MatchLM2Lite,一个实时、生产级的重复内容识别(RCI)系统,它利用多模态大语言模型(MLLM)的强大理解能力,将其蒸馏为一个小型且推理速度快的模型。我们的系统联合建模视频、音频和文本信号,对视频对进行操作以生成细粒度的重复分数。该系统包含两个模块,MatchLM和MatchLite,以及一个两阶段训练方案。首先,我们高容量的MLLM,MatchLM,作为教师模型定义RCI性能的上限。然后,其能力被蒸馏到一个紧凑的学生模型MatchLite中。这种设计使MatchLite能够在视频对上实现低延迟、高吞吐量的推理,同时保留MatchLM的大部分准确性,使其适合集成到实时推荐系统中。MatchLM相比我们之前的生产模型F1分数提高了+8.57。经过知识蒸馏后,MatchLite保留了+6.55的F1分数提升,同时计算成本降低了35倍。大规模部署后,MatchLM2Lite实现了高效的成对多模态RCI,以高每秒查询数(QPS)稳定服务在线流量,端到端延迟低于30秒。该系统在不降低用户参与度的情况下,将我们平台上的重复视频观看率降低了2.5%,证明了其在大规模生产环境中的有效性。

英文摘要

Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

2606.15427 2026-06-16 cs.LG cs.AI cs.CV 交叉投稿

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

通过提示实现视觉语言模型发射后能力扩展用于在轨航天器检测

Nicholas A. Welsh, Lennon J. Shikhman, Monty Nehru Attazs, Seemanthini K. Putane, Van Minh Nguyen, Ryan T. White

发表机构 * Florida Institute of Technology(佛罗里达理工学院) University of Florida(佛罗里达大学)

AI总结 研究利用提示驱动的视觉语言模型在轨扩展语义能力,无需修改权重即可通过自然语言提示检测新航天器部件,在129张图像上零样本实例分割达到0.385 mAP@0.5。

Comments 5 pages, 1 figure, 2 tables. Equal contribution by Nicholas A. Welsh and Lennon Shikhman. Published in the CVPR2026 Workshop on AI4Space

详情
AI中文摘要

星载检测系统通常在发射前部署感知模型,之后更新模型权重或扩展固定标签集在操作上变得不可行。虽然监督模型可以在飞行前集成,但在轨道上添加新的语义能力需要重新训练和重新上传参数。我们研究提示驱动的视觉语言模型是否能够实现发射后语义扩展,允许通过自然语言提示指定新的航天器部件,而无需修改星载权重。我们在一个包含129张先前未见卫星图像的测试集上,采用严格冻结的单次推理协议,评估了航天器部件的零样本实例分割。在固定全局阈值且无后处理的情况下,SAM3达到0.385 mAP@0.5和0.267 mAP@0.5:0.95。性能强烈依赖于尺度:大型结构元素如航天器主体(0.639 AP@0.50)和太阳翼(0.598 AP@0.5)定位可靠,而相对较小的附件如天线(0.221 AP@0.5)和推进器(0.081 AP@0.5)仍然困难。提示形式影响性能,包含空间和几何描述符的结构化提示相比短类别名称提示提升高达82%。该模型在当代嵌入式GPU的内存和计算范围内运行,表明提示驱动的定位可以为主要航天器结构提供发射后语义扩展的实用机制,同时突显了在轨道域偏移下细粒度部件零样本定位的局限性。

英文摘要

Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision--language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of $129$ images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves $0.385$ mAP@$0.5$ and $0.267$ mAP@$0.5{:}0.95$. Performance is strongly scale-dependent: large structural elements like spacecraft bodies ($0.639$ AP@$0.50$) and solar arrays ($0.598$ AP@$0.5$) localize reliably, while relatively small appendages like antennas ($0.221$ AP@$0.5$) and thrusters ($0.081$ AP@$0.5$) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to $82%$ improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

2606.15694 2026-06-16 cs.MM cs.AI cs.CV cs.LG 交叉投稿

MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

MAF: 面向情感分析的多模态自适应少样本提示方法

Hangling Xie

发表机构 * Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 提出MAF框架,通过动态检索与查询相关的多模态示例,利用轻量级系数生成网络实时融合多模态相似度,结合多数投票提升MLLM在情感分析中的性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在理解复杂多模态内容方面展现了卓越的能力。然而,它们在情感分析中的性能对提示设计高度敏感,导致静态、统一应用的提示本质上无法捕捉不同输入中变化的细微多模态线索。为了解决这一局限性,我们提出了一种多模态自适应少样本提示(MAF)框架,该框架动态检索并整合与查询相关的示例,以上下文敏感的方式激发MLLM的情感推理能力。MAF构建了一个示例检索模块,整体编码面部表情、场景上下文和文本语义,并引入唇部运动幅度检测机制以在多人物场景中准确识别说话者。与传统的固定权重融合不同,我们训练了一个轻量级系数生成网络,实时输出查询条件的融合权重,从而实现多模态相似度分数的加权聚合,以检索最具信息量的前K个示例。通过MLLM生成的多个候选输出进行多数投票,进一步增强了预测稳定性。在公开基准数据集上的大量实验表明,MAF相比相应的骨干变体取得了显著且一致的性能提升,并与强大的多模态情感分析基线保持竞争力。

英文摘要

Multimodal large language models (MLLMs) have demonstrated remarkable capabilities in understanding complex multimodal content. However, their performance in sentiment analysis exhibits acute sensitivity to prompt design, rendering static, uniformly applied prompts inherently suboptimal for capturing the nuanced multimodal cues that vary across inputs. To address this limitation, we propose a Multimodal Adaptive Few-Shot Prompting (MAF) framework, which dynamically retrieves and integrates query-relevant demonstrations to elicit the sentiment reasoning capabilities of MLLMs in a context-sensitive manner. MAF constructs a demonstration retrieval module that holistically encodes facial expressions, scene context, and textual semantics, with a lip movement amplitude detection mechanism introduced for accurate speaker identification in multi-person scenarios. Departing from conventional fixed-weight fusion, a lightweight coefficient generation network is trained to output query-conditioned fusion weights in real time, enabling weighted aggregation of multimodal similarity scores to retrieve the top-K most informative demonstrations. Prediction stability is further enhanced through majority voting over multiple candidate outputs generated by the MLLM. Extensive experiments on public benchmark datasets demonstrate that MAF achieves substantial and consistent performance improvements over the corresponding backbone variants and remains competitive with strong multimodal sentiment-analysis baselines.

2606.15782 2026-06-16 cs.AI cs.CV 交叉投稿

Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

通过检索增强的可靠性感知推理缓解多模态系统中的视觉幻觉

Pratheswaran Hariharan, Haiping Xu, Donghui Yan

发表机构 * University of Massachusetts, Dartmouth(马萨诸塞大学达特茅斯分校)

AI总结 提出一种检索增强的可靠性感知推理框架,利用外部视觉证据库和多个可靠性指标进行决策门控,在不重训练模型的情况下减少视觉幻觉,将接受预测准确率从85.84%提升至88.88%。

Comments 28 pages, 9 figures

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉语言理解和自然语言响应生成方面展现了强大的能力。然而,当视觉证据较弱、模糊或语义不一致时,这些系统仍可能产生过度自信的预测和类似幻觉的输出。现有方法大多侧重于改进多模态表示对齐或检索增强生成,而缺乏量化实例级预测可靠性或识别错误视觉输出的机制。本文提出了一种检索增强的可靠性感知推理框架,用于可信的多模态视觉理解。该框架利用预训练的视觉嵌入和基于归一化特征表示的最近邻检索构建外部视觉证据数据库。检索到的证据用于通过多个可靠性指标估计预测的可信度,包括相似性强度、类别支持一致性、证据边际、基于熵的不确定性以及聚合可靠性分数。基于这些信号,决策门控决定系统是否应接受预测、谨慎回答或在证据不足时放弃/回退。然后,多模态响应生成层根据可靠性决策生成最终面向用户的响应。在ImageNet-100上的实验表明,所提出的可靠性感知框架在89.04%的覆盖率下将接受预测准确率从85.84%提升至88.88%。类似幻觉的接受错误答案率从14.16%降至11.12%。这些结果表明,整合检索证据、可靠性估计和选择性决策门控可以在不重新训练大型多模态模型的情况下改善校准并减少过度自信的视觉错误。

英文摘要

Multimodal large language models (MLLMs) have demonstrated strong capabilities in vision-language understanding and natural-language response generation. However, these systems can still produce overconfident predictions and hallucination-like outputs, particularly when the visual evidence is weak, ambiguous, or semantically inconsistent. Most existing approaches focus on improving multimodal representation alignment or retrieval-augmented generation, while providing limited mechanisms to quantify instance-level prediction reliability or identify incorrect visual outputs. This work proposes a retrieval-augmented reliability-aware inference framework for trustworthy multimodal visual understanding. The proposed framework constructs an external visual evidence database using pretrained visual embeddings and nearest-neighbor retrieval over normalized feature representations. Retrieved evidence is used to estimate prediction trustworthiness through multiple reliability indicators, including similarity strength, class-support agreement, evidence margin, entropy-based uncertainty, and an aggregate reliability score. Based on these signals, a decision gate determines whether the system should accept the prediction, answer with caution, or abstain/fallback when evidence is insufficient. A multimodal response-generation layer then produces a final user-facing response conditioned on the reliability decision. Experiments on ImageNet-100 demonstrate that the proposed reliability-aware framework improves accepted prediction accuracy from 85.84\% to 88.88\% at 89.04\% coverage. The hallucination-like accepted wrong-answer rate is reduced from 14.16\% to 11.12\%. These results show that integrating retrieval evidence, reliability estimation, and selective decision gating can improve calibration and reduce overconfident visual errors without retraining large multimodal models.

2606.16494 2026-06-16 cs.CL cs.AI cs.CV 交叉投稿

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

迷失在末尾:多模态检索增强问答中的首因偏差

Jieyuan Liu, Jianyang Gu, Shijie Chen, Jefferson Chen, Zhen Wang

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) The Ohio State University(俄亥俄州立大学)

AI总结 研究多模态知识型视觉问答中检索上下文的位置依赖,发现不同于纯文本的U形效应,出现首因偏差(开头优于末尾),并通过消融实验定位原因为指令调优阅读器的提示槽0。

Comments 15 pages, 9 figures. Under review at EMNLP 2026

详情
AI中文摘要

基于知识的视觉问答(KB-VQA)通过将阅读器条件化于从维基百科规模知识库检索的段落,使视觉-语言系统能够回答超出其参数知识的问题。在纯文本长上下文LLM中,检索上下文的使用遵循Liu等人(2024)的U形“迷失在中间”效应:上下文开头和结尾的信息被使用,中间部分被忽略。这种效应是否会迁移到部署的多模态KB-VQA中尚不清楚。为填补这一空白,我们设计了首个针对多模态KB-VQA中阅读器侧位置依赖的受控探针:一种黄金位置协议,其中只有黄金段落的提示槽在问题内变化。我们在三个开源7B/8B VLM阅读器和两个KB-VQA基准上运行,k最大为20。形状从U形翻转为首因:在每个阅读器-基准组合上,黄金在开头比黄金在结尾高出16到26个点,我们称这种效应为“迷失在末尾”。三项针对性消融实验缩小了原因:纯文本对照显示多模态设置将已存在的文本模式首因放大了2.2到4.5倍,图像位置和干扰物洗牌消融共同将根源定位到指令调优阅读器的提示槽0。在冻结的阅读器上,三种检索侧修复(MMR、神权重排序、基于排名的重排序)均未缩小差距(无显著改进)。我们的发现表明,recall@k是部署KB-VQA的错误指标,缩小差距需要阅读器侧干预;我们发布该协议作为评估此类干预的受控工具。

英文摘要

Knowledge-based visual question answering (KB-VQA) lets vision-language systems answer questions that exceed their parametric knowledge by conditioning a reader on passages retrieved from a Wikipedia-scale knowledge base. In pure-text long-context LLMs, retrieved-context use follows the U-shaped "lost-in-the-middle" effect of Liu et al. (2024): information at the start and end of context is used, the middle is lost. Whether this transfers to deployed multimodal KB-VQA is open. To close this gap, we design the first controlled probe of reader-side position dependence in multimodal KB-VQA: a gold-position protocol in which only the gold passage's prompt slot varies within question. We run it on three open-source 7B/8B VLM readers and two KB-VQA benchmarks at k up to 20. The shape flips from U to primacy: gold-at-first beats gold-at-last by 16 to 26 points on every reader-by-benchmark cell, an effect we call "Lost at the End". Three targeted ablations narrow the cause: a text-only control shows the multimodal setting amplifies an already-present text-mode primacy 2.2 to 4.5 times, and image-position and distractor-shuffle ablations together pin the locus to prompt slot 0 of the instruction-tuned reader. On a frozen reader, three retrieval-side fixes (MMR, oracle reranking, rank-based reordering) all leave the gap intact (no separable improvement). Our findings indicate that recall@k is the wrong metric for deployed KB-VQA and that closing the gap requires reader-side intervention; we release our protocol as a controlled instrument for evaluating such interventions.

2606.17053 2026-06-16 cs.CL cs.CV 交叉投稿

Context-Aware RL for Agentic and Multimodal LLMs

上下文感知强化学习用于智能体与多模态大语言模型

Peiyang Xu, Bangzheng Li, Sijia Liu, Karthik R. Narasimhan, Pramod Viswanath, Prateek Mittal, Xingyu Fu

发表机构 * Princeton University(普林斯顿大学) UC Davis(加州大学戴维斯分校)

AI总结 提出ContextRL方法,通过间接辅助目标(上下文选择奖励)增强大模型在长上下文和多模态任务中的细粒度推理能力,在5个长程基准和12个视觉问答基准上分别提升+2.2%和+1.8%。

Comments 29 pages, 9 figures

详情
AI中文摘要

大语言模型在需要从长或复杂上下文中识别细小但决定性证据(如工具跟踪中的一行或图像中的细微细节)时常常失败。我们提出ContextRL,一种上下文感知的强化学习方法,通过一个间接辅助目标来提升长程推理和多模态性能。ContextRL不是仅监督最终答案,而是向模型提供查询、答案和两个高度相似的上下文,并奖励它选择支持查询-答案对的上下文,从而鼓励细粒度定位。我们在两个领域构建对比上下文数据:对于编码智能体,轨迹作为上下文,通过条件过滤生成1k对;对于多模态推理,图像作为上下文,通过生成式编辑和相似性搜索生成7K对。ContextRL在5个长程基准上比标准GRPO平均提升+2.2%,在12个多样化视觉问答基准上平均提升+1.8%。为了分离所提目标与额外数据的影响,我们与数据增强基线进行比较,这些基线将相同的对比上下文重新用作标准查询-上下文-答案示例。这些基线几乎没有改进,表明收益来自所提出的上下文选择目标,而非仅对比数据。

英文摘要

Large language models (LLMs) often fail when answering requires identifying a small but decisive piece of evidence within a long or complex context, such as a single line in a tool trace or a subtle detail in an image. We propose ContextRL, a context-aware reinforcement learning (RL) method that improves long-horizon reasoning and multimodal performance through an \emph{indirect} auxiliary objective. Instead of supervising only the final answer, ContextRL presents the model with a query, an answer, and two highly similar contexts, and rewards it for selecting the context that supports the query--answer pair, thereby encouraging fine-grained grounding. We construct contrastive context data in two domains: for coding agents, trajectories serve as contexts, yielding 1k pairs built via condition filtering; for multimodal reasoning, images serve as contexts, yielding 7K pairs built via generative editing and similarity search. ContextRL achieves average gains of +2.2% over standard GRPO on 5 long-horizon benchmarks, and +1.8% across 12 diverse visual question answering benchmarks. To disentangle the effect of the proposed objective from that of additional data, we compare against data-augmentation baselines that repurpose the same contrastive contexts as standard query--context--answer examples. These baselines provide little to no improvement, showing that the gains arise from the proposed context-selection objective rather than from the contrastive data alone.

2504.17547 2026-06-16 cs.CV cs.IR cs.MM 版本更新

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

基于知识的视觉问答系统综述:视觉推理任务中的知识生命周期

Jiaqi Deng, Zonghan Wu, Huan Huo, Guandong Xu

发表机构 * University Technology of Sydney(悉尼大学技术学院) East China Normal University(华东师范大学) Education University of Hong Kong(香港教育大学)

AI总结 综述基于知识的视觉问答(KB-VQA)方法,将其分为知识表示、检索和推理三个阶段,并探讨大语言模型带来的变革,指出未来研究方向。

Comments Accepted at TKDE, 20 pages, 5 figures, 4 tables

详情
Journal ref
IEEE Transactions on Knowledge and Data Engineering, 2026
AI中文摘要

基于知识的视觉问答(KB-VQA)扩展了通用视觉问答(VQA),不仅需要理解视觉和文本输入,还需要广泛的知识,从而在多种实际应用中取得显著进展。KB-VQA引入了独特的挑战,包括对齐来自不同模态和来源的异构信息、从嘈杂或大规模存储库中检索相关知识,以及执行复杂推理以从组合上下文中推断答案。随着大语言模型(LLMs)的发展,KB-VQA系统也经历了显著变革,LLMs作为强大的知识库、检索增强生成器和强推理器。尽管取得了实质性进展,但目前尚无全面综述系统性地组织和回顾现有的KB-VQA方法。本综述旨在通过建立KB-VQA方法的结构化分类法,并将系统分为主要阶段:知识表示、知识检索和知识推理,来填补这一空白。通过探索各种知识集成技术并识别持续存在的挑战,本文还概述了有前景的未来研究方向,为推进KB-VQA模型及其应用提供了基础。

英文摘要

Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.

2507.17588 2026-06-16 cs.CV cs.CL 版本更新

Dual-branch Prompting for Multimodal Machine Translation

双分支提示用于多模态机器翻译

Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) School of Computer and Software Engineering, Xihua University(西华大学计算机与软件工程学院) School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine(成都中医药大学针灸推拿学院)

AI总结 提出基于扩散模型的双分支提示框架D2P-MMT,利用重建图像过滤视觉噪声,通过分布对齐损失提升鲁棒翻译性能。

Comments This manuscript has been fully accepted and published by ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM)

详情
AI中文摘要

多模态机器翻译(MMT)通常通过整合对齐的视觉特征来增强纯文本翻译。尽管取得了显著进展,最先进的MMT方法在推理时通常依赖于配对的图像-文本输入,并且对无关的视觉噪声敏感,这限制了它们的鲁棒性和实际应用性。为了解决这些问题,我们提出了D2P-MMT,一种基于扩散的双分支提示框架,用于鲁棒的视觉引导翻译。具体来说,D2P-MMT仅需要源文本和由预训练扩散模型生成的重建图像,该图像自然地过滤掉分散注意力的视觉细节,同时保留语义线索。在训练期间,模型使用双分支提示策略从真实图像和重建图像中联合学习,鼓励丰富的跨模态交互。为了弥合模态差距并减轻训练-推理差异,我们引入了一种分布对齐损失,强制两个分支的输出分布之间的一致性。在Multi30K数据集上的大量实验表明,与现有最先进方法相比,D2P-MMT实现了更优的翻译性能。我们的代码在此https URL公开可用。

英文摘要

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.

2601.06212 2026-06-16 cs.CV cs.AI 版本更新

Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

Akasha 2: 哈密顿状态空间对偶与视觉-语言联合嵌入预测架构

Yani Meziani

发表机构 * Independent AI Researcher(独立AI研究员) Québec (QC), Canada(魁北克(QC),加拿大)

AI总结 提出 Akasha 2 多模态架构,结合哈密顿状态空间对偶与视觉-语言联合嵌入预测,通过稀疏混合哈密顿专家和哈密顿流匹配实现超低延迟视频预测与合成,在保持能量守恒下取得 SOTA 性能。

Comments No supporting claims were validated in this automated agentic R&D research run

详情
AI中文摘要

我们提出了 Akasha 2,一种最先进的多模态架构,它集成了哈密顿状态空间对偶(H-SSD)与视觉-语言联合嵌入预测架构(VL-JEPA)。该系统利用 Mamba-3 选择性状态空间模型(SSM),并通过稀疏混合哈密顿专家(SMoE-HE)增强,后者通过辛积分强制执行潜在物理守恒定律。对于视觉合成,我们引入了哈密顿流匹配(HFM)和持久化 3D 高斯泼溅(3DGS),在移动硬件上实现了超低延迟(<50ms)。这项工作在潜在世界模型中建立了一个新范式,通过全息记忆架构实现了前所未有的时空一致性。我们的方法表明,将物理启发的归纳偏置融入神经架构可带来显著改进:最先进的视频预测(FVD: 287),比扩散模型快 4 倍的视觉合成,以及相比 Transformer 基线 3-18 倍的推理加速,同时在长时间范围内保持能量守恒。

英文摘要

We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.

2601.08010 2026-06-16 cs.CV 版本更新

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

CASHEW: 通过迭代轨迹聚合稳定多模态推理

Chaoyu Li, Deeparghya Dutta Barua, Fei Tao, Pooyan Fazli

发表机构 * Arizona State University(亚利桑那州立大学) NewsBreak

AI总结 提出CASHEW框架,通过迭代聚合多候选推理轨迹并利用视觉验证过滤幻觉步骤,以及其强化学习变体CASHEW-RL,显著提升多模态推理的稳定性和准确性。

详情
AI中文摘要

视觉语言模型在广泛的多模态理解和推理任务中表现出色,但其多步推理仍然不稳定。对相同输入进行重复采样往往会产生分歧的推理轨迹和不一致的最终预测。为了解决这个问题,我们引入了两种受测试时扩展启发的互补方法:(1) CASHEW,一个推理时框架,通过迭代聚合多个候选轨迹为更高质量的推理轨迹来稳定推理,并利用显式的视觉验证过滤幻觉步骤,将推理锚定在视觉证据上;(2) CASHEW-RL,一个学习变体,将这种聚合行为内化到单个模型中。CASHEW-RL 使用组序列策略优化(GSPO)和复合奖励进行训练,该奖励鼓励基于最小但充分的视觉证据的正确答案,同时根据任务难度自适应地分配推理努力。这个训练目标使得推理时能够进行鲁棒的自我聚合。在13个图像理解、视频理解和视频推理基准上的大量实验显示了显著的性能提升,包括在 ScienceQA 上提升高达 +26.2 个百分点,在 EgoSchema 上提升 +9.1 个百分点。

英文摘要

Vision-language models achieve strong performance across a wide range of multimodal understanding and reasoning tasks, yet their multi-step reasoning remains unstable. Repeated sampling over the same input often produces divergent reasoning trajectories and inconsistent final predictions. To address this, we introduce two complementary approaches inspired by test-time scaling: (1) CASHEW, an inference-time framework that stabilizes reasoning by iteratively aggregating multiple candidate trajectories into higher-quality reasoning traces, with explicit visual verification filtering hallucinated steps and grounding reasoning in visual evidence, and (2) CASHEW-RL, a learned variant that internalizes this aggregation behavior within a single model. CASHEW-RL is trained using Group Sequence Policy Optimization (GSPO) with a composite reward that encourages correct answers grounded in minimal yet sufficient visual evidence, while adaptively allocating reasoning effort based on task difficulty. This training objective enables robust self-aggregation at inference. Extensive experiments on 13 image understanding, video understanding, and video reasoning benchmarks show significant performance improvements, including gains of up to +26.2 percentage points on ScienceQA and +9.1 percentage points on EgoSchema.

2601.16093 2026-06-16 cs.CV 版本更新

SAMTok: Representing Any Mask with Two Words

SAMTok: 用两个词表示任意掩码

Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li

发表机构 * Wuhan University(武汉大学) ByteDance(字节跳动) NUS(新加坡国立大学)

AI总结 提出离散掩码分词器SAMTok,将区域掩码转化为两个特殊标记,通过标准下一标记预测和简单强化学习使基础多模态大模型获得像素级能力,在多项任务上达到最先进水平。

Comments CVPR 2026 Highlight

详情
AI中文摘要

像素级能力对于构建交互式智能系统至关重要。然而,由于复杂的区域级编码器、专门的分割解码器以及不兼容的训练目标,像素级多模态大模型(MLLMs)仍然难以扩展。为了解决这些挑战,我们提出了SAMTok,一种离散掩码分词器,它将任何区域掩码转换为两个特殊标记,并使用这些标记以高保真度重建掩码。通过将掩码视为新的语言标记,SAMTok使基础MLLMs(如QwenVL系列)能够通过标准的下一标记预测和简单的强化学习来学习像素级能力,无需架构修改和专门的损失设计。SAMTok基于SAM2构建,并在2.09亿个多样化掩码上使用掩码编码器和残差向量量化器进行训练,以产生离散、紧凑且信息丰富的标记。利用500万个SAMTok格式的掩码理解和生成数据样本,QwenVL-SAMTok在区域描述、区域VQA、接地对话、指代分割、场景图解析和多轮交互分割上取得了最先进或可比的结果。我们进一步引入了一种文本答案匹配奖励,使得掩码生成的高效强化学习成为可能,在GRES和GCG基准测试上带来了显著改进。我们的结果展示了一种可扩展且直接的范式,用于赋予MLLMs强大的像素级能力。我们的代码和模型已公开。

英文摘要

Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.

2602.00344 2026-06-16 cs.CV cs.AI cs.CL 版本更新

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

当RAG有害:诊断和缓解检索增强LVLMs中的注意力分散

Beidi Zhao, Wenlong Deng, Xinting Liao, Yushu Li, Nazim Shaikh, Yao Nie, Xiaoxiao Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文发现检索增强生成(RAG)在LVLMs中导致注意力分散(AD)问题,即检索文本抑制视觉注意力并偏离问题相关区域,提出MAD-RAG方法通过双问题公式和注意力混合来解耦视觉定位与上下文整合,在三个基准上提升性能并纠正大部分失败案例。

Comments 19 pages, 13 figures

详情
AI中文摘要

虽然检索增强生成(RAG)是增强大型视觉语言模型(LVLMs)在基于知识的VQA任务上的主导范式之一,但最近的工作将RAG失败归因于对检索上下文的注意力不足,并提出减少分配给图像令牌的注意力。在这项工作中,我们识别了先前研究忽略的一个不同失败模式:注意力分散(AD)。当检索上下文足够(高度相关或包含正确答案)时,检索文本全局抑制视觉注意力,并且图像令牌上的注意力从问题相关区域转移。这导致模型在原本无需检索文本就能正确回答的问题上失败。为了缓解这个问题,我们提出了MAD-RAG,一种无需训练的干预方法,通过双问题公式解耦视觉定位与上下文整合,并结合注意力混合以保留图像条件证据。在OK-VQA、E-VQA和InfoSeek上的大量实验表明,MAD-RAG在不同模型家族中始终优于现有基线,相对于原始RAG基线分别取得了高达4.76%、9.20%和6.18%的绝对增益。值得注意的是,MAD-RAG纠正了高达74.68%的失败案例,且计算开销可忽略不计。

英文摘要

While Retrieval-Augmented Generation (RAG) is one of the dominant paradigms for enhancing Large Vision-Language Models (LVLMs) on knowledge-based VQA tasks, recent work attributes RAG failures to insufficient attention towards the retrieved context, proposing to reduce the attention allocated to image tokens. In this work, we identify a distinct failure mode that previous study overlooked: Attention Distraction (AD). When the retrieved context is sufficient (highly relevant or including the correct answer), the retrieved text suppresses the visual attention globally, and the attention on image tokens shifts away from question-relevant regions. This leads to failures on questions the model could originally answer correctly without the retrieved text. To mitigate this issue, we propose MAD-RAG, a training-free intervention that decouples visual grounding from context integration through a dual-question formulation, combined with attention mixing to preserve image-conditioned evidence. Extensive experiments on OK-VQA, E-VQA, and InfoSeek demonstrate that MAD-RAG consistently outperforms existing baselines across different model families, yielding absolute gains of up to 4.76%, 9.20%, and 6.18% over the vanilla RAG baseline. Notably, MAD-RAG rectifies up to 74.68% of failure cases with negligible computational overhead.

2602.12279 2026-06-16 cs.CV cs.AI cs.LG 版本更新

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

UniT:统一多模态思维链测试时扩展

Leon Liangyu Chen, Haoyu Ma, Zhipeng Fan, Ziqi Huang, Animesh Sinha, Xiaoliang Dai, Jialiang Wang, Zecheng He, Jianwei Yang, Chunyuan Li, Junzhe Sun, Chu Wang, Serena Yeung-Levy, Felix Juefei-Xu

发表机构 * Stanford University(斯坦福大学) Meta Superintelligence Labs(Meta超级智能实验室) Nanyang Technological University(南洋理工大学)

AI总结 提出UniT框架,通过多轮推理、验证和细化实现统一多模态模型的测试时扩展,实验表明短推理轨迹可泛化到长链,顺序思维链比并行采样更高效。

Comments CVPR 2026

详情
AI中文摘要

统一模型可以在单一架构内处理多模态理解和生成,但它们通常以单次通过的方式运行,而不迭代地细化输出。许多多模态任务,尤其是那些涉及复杂空间组合、多个交互对象或不断变化的指令的任务,需要分解指令、验证中间结果并进行迭代修正。虽然测试时扩展(TTS)已证明分配额外的推理计算用于迭代推理能显著提升语言模型性能,但将这一范式扩展到统一多模态模型仍然是一个开放挑战。我们引入了UniT,一个用于多模态思维链测试时扩展的框架,使单个统一模型能够在多轮中推理、验证和细化。UniT结合了智能体数据合成、统一模型训练和灵活的测试时推理,以激发包括验证、子目标分解和内容记忆在内的认知行为。我们的关键发现是:(1)在短推理轨迹上训练的统一模型能在测试时泛化到更长的推理链;(2)顺序思维链推理比并行采样提供更可扩展且计算高效的TTS策略;(3)在生成和编辑轨迹上训练能提升分布外视觉推理能力。这些结果确立了多模态测试时扩展作为推进统一模型中生成和理解的有效的范式。

英文摘要

Unified models can handle both multimodal understanding and generation within a single architecture, yet they typically operate in a single pass without iteratively refining their outputs. Many multimodal tasks, especially those involving complex spatial compositions, multiple interacting objects, or evolving instructions, require decomposing instructions, verifying intermediate results, and making iterative corrections. While test-time scaling (TTS) has demonstrated that allocating additional inference compute for iterative reasoning substantially improves language model performance, extending this paradigm to unified multimodal models remains an open challenge. We introduce UniT, a framework for multimodal chain-of-thought test-time scaling that enables a single unified model to reason, verify, and refine across multiple rounds. UniT combines agentic data synthesis, unified model training, and flexible test-time inference to elicit cognitive behaviors including verification, subgoal decomposition, and content memory. Our key findings are: (1) unified models trained on short reasoning trajectories generalize to longer inference chains at test time; (2) sequential chain-of-thought reasoning provides a more scalable and compute-efficient TTS strategy than parallel sampling; (3) training on generation and editing trajectories improves out-of-distribution visual reasoning. These results establish multimodal test-time scaling as an effective paradigm for advancing both generation and understanding in unified models.

2603.01696 2026-06-16 cs.CV cs.AI 版本更新

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

跨模态身份映射:通过强化学习最小化模态转换中的信息损失

Haonan Jia, Shichao Dong, Xin Dong, Zenghui Sun, Jin Wang, Jinsong Lan, Xiaoyong Zhu, Bo Zheng, Kaifu Zhang

发表机构 * Taobao & Tmall Group of Alibaba(淘宝与天猫集团(阿里巴巴)) The University of Hong Kong(香港大学)

AI总结 提出跨模态身份映射(CIM)框架,利用强化学习优化图像描述,通过检索一致性度量信息损失,无需额外标注,显著提升关系推理能力。

Comments Accepted to CVPR 2026

详情
AI中文摘要

大型视觉语言模型(LVLMs)在生成的图像描述中常常遗漏或歪曲关键的视觉内容。最小化这种信息损失将迫使LVLMs关注图像细节以生成精确的描述。然而,由于视觉内容和文本输出之间的模态差距,衡量模态转换过程中的信息损失本质上是困难的。在本文中,我们认为图像描述的质量与使用该描述通过文本搜索检索到的图像之间的相似性正相关。基于这一见解,我们进一步提出了跨模态身份映射(CIM),一种无需额外标注即可增强图像描述的强化学习框架。具体来说,该方法从两个角度定量评估信息损失:图库表示一致性和查询-图库图像相关性。在这些指标的监督下,LVLM最小化信息损失并旨在实现从图像到描述的恒等映射。实验结果表明,我们的方法在图像描述方面表现出优越的性能,即使与监督微调相比也是如此。特别是在COCO-LN500基准上,CIM在Qwen2.5-VL-7B上的关系推理提升了20%。

英文摘要

Large Vision-Language Models (LVLMs) often omit or misrepresent critical visual content in generated image captions. Minimizing such information loss will force LVLMs to focus on image details to generate precise descriptions. However, measuring information loss during modality conversion is inherently challenging due to the modal gap between visual content and text output. In this paper, we argue that the quality of an image caption is positively correlated with the similarity between images retrieved via text search using that caption. Based on this insight, we further propose Cross-modal Identity Mapping (CIM), a reinforcement learning framework that enhances image captioning without requiring additional annotations. Specifically, the method quantitatively evaluates the information loss from two perspectives: Gallery Representation Consistency and Query-gallery Image Relevance. Supervised under these metrics, LVLM minimizes information loss and aims to achieve identity mapping from images to captions. The experimental results demonstrate the superior performance of our method in image captioning, even when compared with Supervised Fine-Tuning. Particularly, on the COCO-LN500 benchmark, CIM achieves a 20% improvement in relation reasoning on Qwen2.5-VL-7B.

2603.03447 2026-06-16 cs.CV 版本更新

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

Proact-VL:用于实时AI伴侣的主动式视频大语言模型

Weicai Yan, Yuhong Dai, Qi Ran, Haodong Li, Wang Lin, Tao Jin, Xing Xie, Hao Liao, Jianxun Lian

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出Proact-VL框架,通过游戏场景中的评论和引导任务,解决实时AI伴侣在低延迟推理、自主响应决策和内容质量控制方面的挑战,实现了主动式实时交互。

Comments ICML 2026

详情
AI中文摘要

主动和实时的交互体验对于类人AI伴侣至关重要,但面临三个关键挑战:(1)在连续流输入下实现低延迟推理,(2)自主决定何时响应,以及(3)控制生成内容的质量和数量以满足实时约束。在这项工作中,我们通过两个游戏场景(评论员和引导员)实例化AI伴侣,这两个场景因其适合自动评估而被选中。我们引入了Live Gaming Benchmark,这是一个包含三种代表性场景(单人评论、协同评论和用户引导)的大规模数据集,并提出了Proact-VL,一个通用框架,将多模态语言模型塑造为能够进行类人环境感知和交互的主动式实时交互代理。大量实验表明,Proact-VL在保持强大视频理解能力的同时,实现了优越的响应延迟和质量,证明了其在实时交互应用中的实用性。

英文摘要

Proactive and real-time interactive experiences are essential for human-like AI companions, yet face three key challenges: (1) achieving low-latency inference under continuous streaming inputs, (2) autonomously deciding when to respond, and (3) controlling both quality and quantity of generated content to meet real-time constraints. In this work, we instantiate AI companions through two gaming scenarios, commentator and guide, selected for their suitability for automatic evaluation. We introduce the Live Gaming Benchmark, a large-scale dataset with three representative scenarios: solo commentary, co-commentary, and user guidance, and present Proact-VL, a general framework that shapes multimodal language models into proactive, real-time interactive agents capable of human-like environment perception and interaction. Extensive experiments show Proact-VL achieves superior response latency and quality while maintaining strong video understanding capabilities, demonstrating its practicality for real-time interactive applications.

2605.01733 2026-06-16 cs.CV cs.AI 版本更新

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

GEASS: 基于证据适应的门控选择性描述信任机制用于视觉-语言模型

Zeshang Li, Shuoyang Zhang

发表机构 * University of International Relations(国际关系大学)

AI总结 本文提出GEASS,一种无需训练的模块,通过门控、加权和证据标准来决定模型在每个查询中消耗多少描述信息,从而提升视觉-语言模型的准确性。

Comments 18 pages, 12 figures

详情
AI中文摘要

视觉-语言模型(VLMs)在 grounded reasoning 方面表现出色,但仍然容易产生 object hallucination。最近的研究将自动生成的描述视为一个均匀的积极资源,但我们发现盲目地嵌入一个描述可能会降低而不是提高性能——在 HallusionBench 上,Qwen2.5-VL-3B 的准确性下降了近 10 个点。两个结构性质解释了这一点。首先,描述不仅锚定了模型的最终答案,还锚定了其推理轨迹和词汇选择。其次,描述错误是不对称的:遗漏远多于伪造,但每个伪造对实例的影响更大。因此,描述的有用性是查询特定的,而不是语料库特定的。我们提出 GEASS(ated Evidence-Adaptive Selective Caption Trust),一个无需训练的模块,决定每个查询中模型消耗多少描述信息:它通过干净路径的置信度来门控描述,通过它产生的熵减少来加权描述,并在两种路径意见不同时提高证据标准。在 POPE 和 HallusionBench 上对四个 VLMs 的实验表明,GEASS 在 vanilla 推理和对比解码上都表现出色,仅需每个查询两个额外的前向传递。

英文摘要

Vision-Language Models (VLMs) hallucinate objects that are not present, and a growing line of work tries to curb this by feeding the model its own generated caption as auxiliary evidence -- assuming that a caption, once available, is something to consume. We show this fails: naively appending a caption can lower accuracy rather than raise it, dropping Qwen2.5-VL-3B† on HallusionBench by nearly ten points. To understand why, we build GD-Probe, a diagnostic set that pairs a global and a detail question on the same image, so that any difference in caption effect is attributable to the question alone. Caption utility proves to be a per-query property: the same caption helps global questions and harms detail ones, through a single mechanism -- an embedded caption competes with the image for attention and pulls the model's evidence onto its own text -- whose sign is set by whether the caption covers the queried content. Crucially, this regime is readable from quantities the decoder already emits, with no attention access or grounding. We turn this into GEASS (Gated Evidence-Adaptive Selective Caption Trust), a training-free, logit-level module that decides per query how much of the caption to trust, gating it by the clean path's confidence, weighting it by the entropy reduction it induces, and raising the evidence bar when the two pathways disagree. Across four VLMs and two benchmarks (POPE and HallusionBench), GEASS improves over both vanilla inference and contrastive decoding under a single fixed setting, adding only two forward passes and no parameters.

2605.10157 2026-06-16 cs.CV cs.CL 版本更新

MolSight: Molecular Property Prediction with Images

MolSight: 基于图像的分子属性预测

Aaditya Baranwal, Akshaj Gupta, Yogesh S Rawat, Shruti Vyas

发表机构 * University of Central Florida(中央佛罗里达大学) Birla Institute of Technology and Science(比拉理工学院和科学学院)

AI总结 MolSight首次系统研究基于视觉的分子属性预测,通过10种视觉架构和7种预训练策略,在10个下游任务中展示性能,提出化学引导课程提升效果,以更低的FLOPs实现优异结果。

详情
AI中文摘要

每种合成分子均可绘制为2D骨架图,但现代属性预测更关注分子图、3D构象或大参数语言模型。我们提出MolSight,首次系统研究基于视觉的分子属性预测。使用10种视觉架构、7种预训练策略和2M分子图像,在10个下游任务中评估性能,涵盖物理性质回归、药物发现分类和量子化学预测。为应对预训练分子结构复杂度差异,提出化学引导课程:五种结构复杂度描述符将语料库分为五个难度递增的层级,持续优于非课程基线。证明单个渲染的bond-line图像经视觉编码器处理即可实现竞争性的分子属性预测,即仅凭视觉获得化学洞察。最佳课程训练配置在10个基准中的5个达到顶结果,全部达到前两名,FLOPs仅为最近多模态竞争者的80倍更低。

英文摘要

Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.

2605.18313 2026-06-16 cs.CV cs.AI 版本更新

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Wasserstein均衡解码用于可靠的医疗视觉问答

Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao, Bernhard Kainz

发表机构 * Friedrich-Alexander University Erlangen-Nürnberg(弗里德里希-亚历山大厄林根-纽伦堡大学) Imperial College London(伦敦帝国理工学院) University College London(伦敦大学学院)

AI总结 本文提出了一种基于Wasserstein距离的均衡解码方法,用于改进医疗视觉问答系统,通过语义感知的停止准则提高解码效率和准确性,同时在VQA-RAD和PathVQA数据集上实现了显著的性能提升。

详情
AI中文摘要

小型视觉-语言模型(2-8B)由于隐私限制、有限的连接性和低延迟要求,适合临床部署。然而,其有限的容量会加剧生成合理但错误的输出。我们扩展了之前仅限于纯文本、封闭式NLP任务的博弈论解码方法,应用于开放式的医疗视觉问答(VQA)。我们引入了一种语义感知的Wasserstein停止准则,以取代基于词序的匹配,使收敛基于候选答案之间的语义共识,避免因临床等效排名交换导致的不必要的迭代。在VQA-RAD和PathVQA上,我们获得了比贪心和判别基线显著的改进。在VQA-RAD上,我们比贪心的4B模型提高了3.5个百分点(p < 0.01),在更大规模上呈现出相似趋势。在PathVQA上,Gemma-3-4B与BDG在贪心解码下表现相当,尽管没有领域特定的微调。在与经典BDG的准确性相等时,Wasserstein准则将平均收敛迭代次数减少了约20%,在提高推理效率的同时保留了博弈论均衡行为。代码可在https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA上获得。

英文摘要

Small vision-language models (2-8B) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language models for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clinically equivalent ranking swaps. On VQA-RAD and PathVQA, we obtain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain-specific fine-tuning. At accuracy parity with classic BDG, the Wasserstein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

2. 具身智能、机器人与自动驾驶 45 篇

2606.14752 2026-06-16 cs.CV cs.AI cs.LG cs.RO 新提交

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

X-Tokenizer: 一种用于视觉-语言-动作预训练的多模态动作分词器

Xirui Kang, Yanpei Shi, Lucy Liang, Roy Gan, Dongxiu Liu, Pushi Zhang, Danpeng Chen, Xiaoyi Qin, Yinan Zheng, Jinliang Zheng, Hao Wang, Xianyuan Zhan, Hang Su

发表机构 * Square Robot City University of Hong Kong(香港城市大学) Tsinghua University(清华大学)

AI总结 提出X-Tokenizer,通过语义残差量化(SRQ)和掩码动作建模(MAM)将动作离散化为语义接口,在2.4M轨迹上预训练后提升VLA模型的多模态接地和长程任务性能。

Comments Project page: https://x-square-robot.github.io/X-Tokenizer_projectPage/

详情
AI中文摘要

现代视觉-语言-动作(VLA)模型必须桥接预训练的视觉-语言推理和精确的连续机器人控制。现有的动作分词器主要为了重建而离散化动作,产生的编码保留了运动几何结构,但仅向主干网络提供弱语义监督。因此,我们将动作分词化不仅视为压缩,而是作为多模态推理与可执行控制之间的语义接口学习。为此,我们引入了X-Tokenizer,一种轻量级的编码器-语义残差量化(SRQ)-解码器架构,为多种机械臂形态提供共享的动作接口。其关键组件SRQ在残差向量量化上施加了非对称结构:第一层通过掩码动作建模(MAM)训练,形成捕获粗略运动意图的离散动作语言,而更深层则保持面向重建的残差,保留细粒度细节。为了进一步将动作标记与多模态语义对齐,X-Tokenizer通过与预训练基础模型的表示空间进行对比对齐以及下一帧视觉-语言特征预测进行预训练。在2.4M轨迹(2.0B动作帧)上预训练后,单个冻结的X-Tokenizer作为表示塑造的监督信号插入混合离散-连续VLA中。X-Tokenizer在真实世界聚合指标上达到最佳,并在RoboTwin 2.0模拟中表现强劲。在多模态接地(+13.5%)和长程任务(+8.25)上优于FAST,表明动作分词器作为VLA预训练的语义接口,而不仅仅是动作压缩。

英文摘要

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

2606.14772 2026-06-16 cs.CV cs.AI 新提交

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

ScoutVLA:面向开放世界具身问答的无人机中心主动感知双专家VLA模型

Wenhao Lu, Zhengqiu Zhu, Xiaofeng Wang, Xiaoran Zhang, Yatai Ji, Yong Zhao, Yue Hu, Yingzhen Nie, Jinlong Zhu, Zheng Zhu

发表机构 * National Key Laboratory of Digital Intelligent Modeling and Simulation, National University of Defense Technology(国防科技大学数字智能建模与仿真国家重点实验室) GigaAI

AI总结 针对无人机在室外具身问答中细粒度视角调整不足的问题,提出ScoutVLA模型,采用解耦双专家架构(视觉语言专家推断语义意图,动作专家生成连续视角调整轨迹),并通过知识隔离机制平衡连续控制与语义推理,在仿真和真实实验中显著优于基线方法。

详情
AI中文摘要

空中具身问答(EQA)要求无人机(UAV)主动感知环境并回答自然语言问题。现有的室外EQA系统通常在目标进入无人机视野后停止,导致寻找证据所需的问题的细粒度视角调整问题仍未解决。为解决此问题,我们引入FG-EQA,一个细粒度主动感知EQA基准,包含超过4万条模拟轨迹和1千条真实轨迹。受侦察蜂“摇摆舞”的启发(它们迭代调整飞行路径以验证目标信息),我们提出ScoutVLA,一种用于室外EQA的证据驱动视觉-语言-动作模型。为模拟这种主动探索行为,ScoutVLA采用解耦双专家架构:视觉语言专家推断语义意图以识别缺失证据,而独立动作专家使用高自由度流匹配生成连续视角调整轨迹。为平衡连续控制和语义推理的竞争需求,我们设计了一种解耦训练策略,其中包含知识隔离机制,防止动作梯度抹除模型的多模态推理能力。大量仿真实验和定性真实世界实地研究均验证了ScoutVLA相对于最先进基线的优越性,平均严格成功率高10.48倍,平均QA正确率高7.72倍。

英文摘要

Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

2606.14841 2026-06-16 cs.CV 新提交

Multi-HMR 2: Multi-Person Camera-Centric Human Detection, Mesh Recovery and Tracking

Multi-HMR 2:多人相机中心人体检测、网格恢复与跟踪

Guénolé Fiche, Philippe Weinzaepfel, Romain Brégier, Fabien Baradel

发表机构 * NAVER LABS Europe(NAVER LABS欧洲)

AI总结 提出基于DETR的框架Multi-HMR 2,联合预测场景一致相机和人体网格,实现度量3D定位与跟踪,无需真实内参或视频监督,在保持骨盆中心性能的同时显著提升检测与定位精度。

详情
AI中文摘要

人体网格恢复(HMR)的大多数进展集中在骨盆中心恢复,忽视了相机坐标系中的度量3D定位和检测精度——这两个因素对于人机交互和社交场景理解等实际应用至关重要。当前的评估协议通常忽略这些方面,强调每人的根中心恢复而非相机空间感知。因此,现有方法依赖于固定的相机假设或手工后处理,限制了其鲁棒性和实际部署。我们提出了Multi-HMR 2,一个简单而鲁棒的基于DETR的框架,用于多人相机中心的人体检测、网格恢复和跟踪。Multi-HMR 2预测一个场景一致的相机以及人体网格,无需真实内参即可实现度量3D定位。此外,通过从SAM2中提取基于图像的记忆特征,Multi-HMR 2扩展到跟踪,无需视频监督即可实现一致的同一性关联。尽管概念简单——无手工组件、无视频输入、无真实相机——Multi-HMR 2在保持最先进的骨盆中心性能的同时,显著提高了检测精度和度量3D定位。

英文摘要

Most advances in human mesh recovery (HMR) have focused on pelvis-centered recovery, overlooking metric 3D localization and detection accuracy in the camera coordinate system - two key factors for real-world applications such as human-robot interaction and social scene understanding. Current evaluation protocols often ignore these aspects, emphasizing per-person, root-centered recovery rather than camera-space perception. As a result, existing approaches rely on fixed camera assumptions or handcrafted post-processing, limiting their robustness and practical deployment. We introduce Multi-HMR 2, a simple yet robust DETR-based framework for Multi-person Camera-centric Human detection, mesh Recovery, and tracking. Multi-HMR 2 predicts a scene-consistent camera together with human meshes, enabling metric 3D localization without ground-truth intrinsics. Moreover, by distilling image-based memory features from SAM2, Multi-HMR 2 extends to tracking, achieving consistent identity association without video supervision. Despite its conceptual simplicity - no handcrafted components, no video input, and no ground-truth cameras - Multi-HMR 2 achieves state-of-the-art pelvis-centered performance while substantially improving detection accuracy and metric 3D localization.

2606.15099 2026-06-16 cs.CV cs.LG cs.RO 新提交

Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

少思考,早行动:视觉-语言-动作模型中带早退的强化潜在推理

Dianqiao Lei, Lianlei Shan

AI总结 提出AVA-VLA框架,通过强化学习去噪和早退策略优化潜在推理轨迹,在LIBERO上实现6倍推理加速和98.3%平均成功率。

Comments Accepted at ICML 2026

详情
AI中文摘要

现有的视觉-语言-动作(VLA)模型主要依赖显式的思维链(CoT)推理来桥接感知和动作。虽然有效,但这种范式在多步骤任务中面临高计算成本和错误传播的问题。在本文中,我们提出了自适应变量对齐VLA(AVA-VLA),一种新颖的潜在推理VLA框架,将推理建模为一系列不可观测的潜在变量,绕过了显式文本生成的需求。然而,潜在轨迹本质上容易受到噪声干扰和与下游目标不对齐的影响。为了解决这个问题,我们引入了一种基于强化学习的去噪机制,将潜在状态生成视为一个顺序决策过程,通过任务级奖励优化推理轨迹。此外,我们结合了一种早退策略,根据状态置信度自适应地终止推理,实现了深度和效率之间的动态权衡。在具身决策基准上的大量实验表明,AVA-VLA在LIBERO上实现了比显式CoT方法6倍的推理加速,同时达到了98.3%的平均成功率,在效率和长期稳定性上均优于全推理基线。

英文摘要

Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.

2606.15142 2026-06-16 cs.CV cs.RO 新提交

MotionVLA: Vision-Language-Action Model for Humanoid Motion

MotionVLA:面向人形运动的视觉-语言-动作模型

Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) AI 2 Robotics

AI总结 针对人形运动生成中低频姿态与高频物理信号量化不匹配的问题,提出双流频率分词器DSFT和基于Qwen3.5的MotionVLA模型,在HumanML3D和MBench上显著提升多样性一致性和运动条件一致性。

详情
AI中文摘要

从场景图像和文本生成逼真的人形运动涉及低频姿态语义和高频物理动力学。然而,许多现有方法使用单个共享码本对运动进行分词,将异质运动信号强制映射到相同的量化空间。我们对人体运动数据的频域分析揭示了单码本量化与运动统计之间的明显不匹配:五个DCT系数捕获了93%的关节位置能量,但仅捕获了37%的关节速度能量,这可能导致量化偏向姿态统计,而低估高频速度分量。第二个挑战在于使标准自回归模型有效建模运动序列中的高频物理信号。因此,我们提出了DSFT,一种双流频率分词器,将运动分离为基础流和物理流,并使用DCT截断和BPE独立压缩它们。此外,我们提出了MotionVLA,一个基于Qwen3.5的模型,将基础令牌和物理令牌排列在统一序列中,其中物理令牌在基础令牌之后预测。在HumanML3D和MBench上的实验表明,尽管使用轻量级2B骨干网络,MotionVLA在HumanML3D上将与真实数据的多样性差距减少了50%以上,并在MBench上将运动条件一致性提高了3.8%,支持频率感知的双流解耦作为自回归运动生成的有效公式。代码:https://github.com/AIGeeksGroup/MotionVLA。网站:https://aigeeksgroup.github.io/MotionVLA。

英文摘要

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

2606.15287 2026-06-16 cs.CV 新提交

G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

G2IA: 几何引导的实例感知跨模态地点识别检索与精炼

Xianyun Jiao, Jingyi Xu, Zhongmiao Yan, Xieyuanli Chen, Lin Pei

发表机构 * Shanghai Jiao Tong University(上海交通大学) National University of Defense Technology(国防科技大学)

AI总结 提出G2IA框架,通过几何引导的实例感知检索和跨模态局部形状与空间布局验证,解决图像到点云地点识别中的模态差异和感知混淆问题。

详情
AI中文摘要

跨模态地点识别(CMPR)使仅搭载相机的机器人在自主导航场景中能够根据预先构建的激光雷达地图进行定位。这种图像到点云的设置面临两种耦合的模糊性:透视RGB外观与稀疏度量几何之间的模态差异,以及具有相似道路、立面、交叉口和物体布局的城市地点之间的感知混淆。我们不将CMPR视为单一的全局描述符匹配问题,而是认为可靠的检索需要几何感知表示对齐和细粒度候选验证。本文提出G2IA,一个几何引导的实例感知框架,用于图像到点云的地点识别。在检索阶段,来自VGGT的视觉几何先验和实例特征被整合,以构建与激光雷达地图表示更兼容的地点描述符。在精炼阶段,通过显式验证局部实例形状及其相对空间布局在跨模态下是否一致,对检索到的候选进行重新排序。在公开基准上的实验表明,G2IA在不同定位阈值下一致地改善了图像到点云的地点识别,并表现出强大的跨数据集泛化能力。

英文摘要

Cross-modal place recognition (CMPR) enables camera-only robots to localize against pre-built LiDAR maps in autonomous navigation scenarios. This image-to-point-cloud setting is challenged by two coupled ambiguities: the modality gap between perspective RGB appearance and sparse metric geometry, and perceptual aliasing among urban places with similar roads, facades, intersections, and object arrangements. Instead of treating CMPR as a single global descriptor matching problem, we argue that reliable retrieval requires both geometry-aware representation alignment and fine-grained candidate verification. In this paper, we propose G2IA, a geometry-guided instance-aware framework for image-to-point-cloud place recognition. In the retrieval stage, visual geometry priors from VGGT and instance features are integrated to construct place descriptors that are more compatible with LiDAR-derived map representations. In the refinement stage, the retrieved candidates are re-ranked by explicitly verifying whether local instance shapes and their relative spatial layouts are consistent across modalities. Experiments on public benchmarks demonstrate that G2IA consistently improves image-to-point-cloud place recognition under different localization thresholds, and exhibits strong cross-dataset generalization.

2606.15341 2026-06-16 cs.CV 新提交

CausalDrive: Real-time Causal World Models for Autonomous Driving

CausalDrive: 用于自动驾驶的实时因果世界模型

Tianyi Yan, Huan Zheng, Dubing Chen, Meizhi Qu, Yingying Shen, Lijun Zhou, Mingfei Tu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Cheng-zhong Xu, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau(澳门大学协同创新研究院,科技学院) Xiaomi EV(小米汽车) CASIA(中国科学院自动化研究所)

AI总结 提出CausalDrive,一种可控、实时的驾驶世界渲染器,通过因果预测和Context-Forced DMD架构实现交互式模拟,支持闭环评估、强化学习后训练和人在环仿真。

详情
AI中文摘要

世界模型已成为扩展自动驾驶数据的有前景范式,但现有的视频生成模型作为交互式模拟器仍有不足。基于布局的渲染器依赖所有背景智能体的“预言”未来轨迹,使其严格非反应式。相反,纯动作条件预测器缺乏对复杂交互的语义控制,并受限于高昂的扩散延迟,阻碍了闭环策略学习。为弥补这一差距,我们提出CausalDrive,一种可控、实时的基础驾驶世界渲染器。CausalDrive仅基于初始前视图、自车轨迹和宏观文本提示运行。通过排除未来NPC布局,我们迫使模型内在预测因果交互,实现对驾驶社会学的文本驱动控制,允许用户动态编排对相同自车动作的不同反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移,我们提出新颖的Context-Forced DMD架构。该架构结合连续流匹配与自校正蒸馏目标,实现12 FPS的交互速度。这一突破将被动视频生成器转变为可玩的神经模拟器。我们在三个下游应用中展示了其多功能性:(1)生成式闭环评估,显著减轻碰撞伪影;(2)由Video2Reward模块驱动的大规模强化学习后训练;(3)实时人在环仿真。大量实验验证,在CausalDrive反应式场景中训练的策略在现实世界中表现出更优的交互能力。

英文摘要

World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.

2606.15869 2026-06-16 cs.CV 新提交

Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

Metis: 一种用于自动驾驶和城市导航的通用高效世界-动作模型

Jingyu Li, Zhe Liu, Dongnan Hu, Junjie Wu, Zipei Ma, Wenxiao Wu, Chao Han, Zhihui Hao, Zhikang Liu, Kun Zhan, Jiankang Deng, Xiatian Zhu, Li Zhang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) The University of Hong Kong(香港大学) Tongji University(同济大学) Li Auto Inc.(理想汽车) Huazhong University of Science and Technology(华中科技大学) Imperial College London(伦敦帝国理工学院) University of Surrey(萨里大学)

AI总结 提出Metis框架,通过解耦视频生成与动作预测,采用混合专家架构和不对称注意力掩码,实现高效推理与泛化,在多个导航基准上取得最优性能。

详情
AI中文摘要

世界-动作模型(WAMs)在自动驾驶和城市导航中展现出巨大潜力。基于视觉-语言-动作模型或视频生成模型的现有方法存在关键限制:(1)测试时因预测未来观测而导致高推理延迟,(2)视频与动作建模紧密耦合导致表示不匹配和泛化能力下降。为解决这两个问题,我们提出Metis,一种端到端WAM框架,将视频生成与动作预测解耦。具体而言,Metis采用混合专家(Mixture-of-Transformers)架构,包含专门用于视频生成和动作预测的专家,保留了每个任务的内在分布特性。为提高效率,我们引入非对称注意力掩码,使得两个专家能够联合训练,同时允许动作模型在推理时绕过显式视频生成。这种设计确保了训练-推理一致性,并在不牺牲规划性能的情况下显著降低计算成本。大量实验表明,Metis在NAVSIM navhard和navtest基准以及CityWalker导航基准上取得了最先进的性能,验证了其在多样化任务中的泛化能力和效率。真实机器人部署进一步证实了我们方法的实际可行性。

英文摘要

World action models~(WAMs) have shown great promise for autonomous driving and urban navigation. Built upon Vision-Language-Action models or video generation models, existing approaches suffer key limitations: (1) High inference latency due to future observation prediction at test time, and (2) tightly coupled video and action modeling leading to representational mismatch and degraded generalization. To address both issues, we propose Metis, an end-to-end WAM framework that decouples video generation and action prediction. Specifically, Metis employs a Mixture-of-Transformers architecture with dedicated experts for video generation and action prediction, preserving the intrinsic distributional properties of each task. To enhance efficiency, we introduce an asymmetric attention mask that enables joint training of both experts while allowing the action model to bypass explicit video generation during inference. This design ensures training-inference consistency and significantly reduces computational costs without compromising planning performance. Extensive experiments demonstrate state-of-the-art performance on the NAVSIM navhard and navtest benchmarks and the CityWalker navigation benchmark, validating both the generalizability and efficiency across diverse tasks. Real-robot deployments further confirm the practical feasibility of our approach.

2606.16202 2026-06-16 cs.CV cs.AI cs.RO 新提交

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

EgoPhys: 从第一人称视频学习可变形物体的通用物理模型

Hyunjin Kim, Ri-Zhao Qiu, Guangqi Jiang, Xiaolong Wang

发表机构 * UC San Diego(加州大学圣地亚哥分校)

AI总结 提出EgoPhys框架,从第一人称RGB视频中通过可泛化先验构建可变形物体的物理数字孪生,无需测试时优化即可预测弹簧刚度场,在重建、未来预测和零样本泛化上优于基线。

Comments Project Page: https://hjhyunjinkim.github.io/EgoPhys

详情
AI中文摘要

人类通过日常互动自然地理解物体物理,但准确预测复杂的可变形动力学(如弹性材料和织物)仍然是计算机视觉和机器人学的主要挑战。我们提出EgoPhys,一个利用可泛化先验从仅RGB的第一人称视频构建可变形物理数字孪生的框架。EgoPhys通过将每个物体的逆物理解蒸馏到紧凑码本中,克服了现有方法的局限性,从而能够为未见物体预测密集的弹簧刚度场,而无需每个弹簧的测试时优化。使用来自多样化第一人称交互的可泛化先验进行训练,EgoPhys在重建、未来预测和零样本泛化方面优于基线。为了支持训练和评估,我们整理了一个涵盖多样化可变形物体、场景和操作风格的第一人称交互数据集。我们将EgoPhys部署在真实的xArm6机器人上,证明从单个第一人称人类游戏视频初始化的数字孪生可以作为内部世界表示,辅助可变形物体规划,突显第一人称RGB观测作为通往真实到模拟管道的可扩展路径。

英文摘要

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

2606.16253 2026-06-16 cs.CV cs.AI 新提交

Learned Image Compression for Vision-Language-Action Models

面向视觉-语言-动作模型的图像压缩学习

Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

发表机构 * POSTECH(浦项科技大学) Soongsil University(崇实大学) Chung-Ang University(中央大学)

AI总结 提出SPARC框架,通过自适应比特率分配和倾斜率损失,在低带宽下保持VLA机器人控制性能,优于传统编解码器。

详情
AI中文摘要

视觉-语言-动作(VLA)模型越来越依赖高频多摄像头观测,使得视觉通信成为带宽受限或分布式部署场景中实时机器人控制的主要瓶颈。然而,现有的图像和视频编解码器旨在保留通用视觉保真度,而非下游VLA策略的控制性能。在这项工作中,我们引入了SPARC(空间自适应速率控制),一种为VLA驱动机器人量身定制的学习图像压缩框架。我们的关键观察是,视觉信息的重要性在相机视角和图像内的空间区域之间差异很大。基于这一观察,SPARC采用轻量级时间掩码选择器,根据任务相关性自适应地在潜在表示上分配比特率,同时利用时间上下文。我们进一步引入倾斜率损失,通过减少基于熵的目标过度抑制罕见但任务关键的视觉模式的趋势来稳定训练。在包括RoboCasa365、VLABench和LIBERO在内的多样化机器人基准测试上的实验表明,在相同比特率预算下,SPARC始终比传统图像/视频编解码器和最近的学习压缩方法实现更强的控制性能。我们还展示了在远程控制设置中的实际部署优势,我们的方法显著改善了比特率-成功率权衡。

英文摘要

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

2606.16274 2026-06-16 cs.CV 新提交

GraphWorld: Long-Horizon Planning with World Models for End-to-End Autonomous Driving

GraphWorld: 基于世界模型的长时域规划实现端到端自动驾驶

Ziying Song, Caiyan Jia, Lin Liu, Lei Yang, Shengkai Zhang, Feiyang Jia, Fengda Zhao, Peiliang Wu, Shaoqing Xu, Chen Lv, Yadan Luo

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University(北京交通大学计算机科学与技术学院,交通数据挖掘与具身智能北京市重点实验室) School of Artificial Intelligence (School of Software), Yanshan University(燕山大学人工智能学院(软件学院)) School of Mechanical and Aerospace Engineering, Nanyang Technological University(南洋理工大学机械与航空航天工程学院) University of Macau(澳门大学) The University of Queensland(昆士兰大学)

AI总结 提出GraphWorld框架,通过潜在世界建模增强长时域规划,利用自车中心交互图建模邻车关系,并基于世界状态条件规划实现安全轨迹生成,显著降低碰撞率。

Comments 16 pages, 5 figures

详情
AI中文摘要

端到端自动驾驶通过将感知、预测和规划统一到单一学习框架中取得了显著进展,在短时域决策中表现出色。然而,大多数现有的E2E-AD方法仍局限于短时域规划,缺乏建模长期时间依赖的能力,这严重限制了它们在复杂且高度交互的驾驶场景中的泛化性和安全性。在这项工作中,我们提出了GraphWorld,一个通过潜在世界建模显式增强长时域规划的E2E-AD框架。我们引入了一个自车中心交互图,该图基于空间邻近性自适应地建模关键邻车,并通过跨节点交叉注意力将关系上下文传播到规划查询。我们提出了一种世界状态条件规划,通过建模自车与周围智能体之间的交互来学习以自车为中心的潜在世界表示。这种潜在世界状态捕获了关键的交互动态和安全相关语义,并作为条件信号来指导长时域、安全感知的轨迹规划。在Bench2Drive、NAVSIMv1/2和nuScenes上的大量实验表明,GraphWorld显著降低了碰撞率并提高了长时域规划性能,验证了其在复杂驾驶环境中的有效性。

英文摘要

End-to-end autonomous driving has made significant progress by unifying perception, prediction, and planning within a single learning framework, achieving strong performance in short-horizon decision making. However, most existing E2E-AD methods remain confined to short-horizon planning and lack the ability to model long-term temporal dependencies, which severely limits their generalization and security in complex and highly interactive driving scenarios. In this work, we propose GraphWorld, an E2E-AD framework that explicitly enhances long-horizon planning through latent world modeling. We introduce an Ego-Centric Interaction Graph, which adaptively models critical neighboring agents based on spatial proximity, and propagates relational context to planning queries via cross-node cross-attention. We present a World-State-Conditioned Planning that learns ego-centric latent world representations by modeling interactions between an ego vehicle and surrounding agents. This latent world state captures key interaction dynamics and safety-relevant semantics, and serves as a conditioning signal to guide long-horizon, safety-aware trajectory planning. Extensive experiments on Bench2Drive, NAVSIMv1/2, and nuScenes demonstrate that GraphWorld significantly reduces collision rates and improves long-horizon planning performance, validating its effectiveness in complex driving environments.

2606.16278 2026-06-16 cs.CV cs.AI 新提交

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

RealityBridge: 连接可编辑3D高斯泼溅驾驶模拟与现实世界视频

Zhenhua Wu, Yun Pang, Mingkun Chang, Yuwei Ning, Liangzhi Wang, Yi Xiao, Guanbin Li

发表机构 * Sun Yat-sen University(中山大学) Guangdong Key Laboratory of Information Security Technology(广东省信息安全技术重点实验室) Key Laboratory of Machine Intelligence and Advanced Computing, Ministry of Education(教育部机器智能与先进计算重点实验室)

AI总结 提出RealityBridge框架,利用多模态控制和轻量级GateNet,结合自回归长视频训练与奖励引导后训练,缩小编辑后3DGS驾驶视频的Sim-to-Real差距,提升视觉真实感和时间一致性。

详情
AI中文摘要

长尾危险场景对于安全导向的自动驾驶至关重要,但难以大规模收集和复现。可编辑3D高斯泼溅(3DGS)模拟通过重建真实驾驶场景并支持可控场景编辑,提供了一种有前景的替代方案。然而,编辑后的3DGS渲染视频仍存在显著的Sim-to-Real差距,包括渲染伪影、前景资产退化、光照不一致和时间闪烁。现有的修复和视频生成方法不足以应对此任务,因为它们通常无法联合修复3DGS特定伪影、提升视觉真实感并确保时间一致性。为填补这一空白,我们提出RealityBridge,一种针对编辑后3DGS驾驶视频的结构保持和资产感知的Sim-to-Real框架。RealityBridge使用多模态控制,包括渲染视频、前景掩码、边缘图和语义掩码,并结合轻量级GateNet进行跨骨干层的自适应条件分配。我们进一步构建了针对性的训练数据,并引入自回归长视频训练与奖励引导后训练,以提升修复质量、时间稳定性和幻觉抑制。在内部和公开驾驶数据集上的大量实验表明,RealityBridge在伪影去除、光照协调和长序列时间一致性方面优于现有方法。

英文摘要

Long-tail hazardous scenarios are essential for safety-oriented autonomous driving, yet they are difficult to collect and reproduce at scale. Editable 3D Gaussian Splatting (3DGS) simulation offers a promising alternative by reconstructing real driving scenes and supporting controllable scene editing. However, edited 3DGS-rendered videos still suffer from a significant Sim-to-Real gap, including rendering artifacts, degraded foreground assets, inconsistent illumination, and temporal flickering. Existing restoration and video generation methods are insufficient for this task, as they often fail to jointly repair 3DGS-specific artifacts, improve visual realism, and ensure temporal consistency. To fill this gap, we propose RealityBridge, a structure-preserving and asset-aware Sim-to-Real framework for edited 3DGS driving videos. RealityBridge uses multimodal controls, including rendered videos, foreground masks, edge maps, and semantic masks, together with a lightweight GateNet for adaptive condition allocation across backbone layers. We further construct targeted training data and introduce autoregressive long-video training with reward-guided post-training to improve restoration quality, temporal stability, and hallucination suppression. Extensive experiments on internal and public driving datasets show that RealityBridge outperforms existing methods in artifact removal, illumination harmonization, and long-sequence temporal consistency.

2606.16354 2026-06-16 cs.CV 新提交

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

GraphBEV++: 自动驾驶中的多模态特征对齐

Ziying Song, Caiyan Jia, Lin Liu, Shaoqing Xu, Lei Yang, Yadan Luo

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University(北京交通大学计算机科学与技术学院,交通数据挖掘与具身智能北京市重点实验室) School of Artificial Intelligence (School of Software), Yanshan University(燕山大学人工智能学院(软件学院)) University of Macau(澳门大学) Nanyang Technological University(南洋理工大学) The University of Queensland(昆士兰大学)

AI总结 针对自动驾驶中BEV感知的特征未对齐问题,提出GraphBEV++框架,通过局部对齐(LocalAlign-v2)和全局对齐(GlobalAlign-v2)模块,利用图匹配、可变形偏移和扩散去噪方法,在多种基准上实现最优性能。

Comments 30 pages, 7 figures

详情
AI中文摘要

BEV感知中的特征未对齐是自动驾驶中一个关键但常被忽视的挑战,尤其是在激光雷达和相机传感器之间的标定不确定情况下。为了解决这个问题,我们提出了一个鲁棒的多模态融合框架GraphBEV++,该系统性地缓解了投影引起的未对齐。该框架包含两个关键模块:LocalAlign-v2和GlobalAlign-v2。LocalAlign-v2通过图匹配引入邻域感知深度特征来纠正局部未对齐。它支持基于LSS和基于查询的BEV表示,使其与BEVFusion和BEVFormer架构兼容,实现跨范式的一致对齐。GlobalAlign-v2包含两种变体:可变形和扩散。可变形变体通过显式学习跨模态特征偏移来解决基于LSS的多模态BEV中的全局未对齐。相比之下,扩散变体针对基于查询的BEV中的隐式未对齐,通过注入噪声模拟未对齐,并采用去噪过程恢复对齐特征。实验结果表明,GraphBEV++在nuScenes和Waymo子集上的未对齐噪声下实现了最先进的性能,改进了Argoverse2上的远距离检测,并有效泛化到3D占用预测任务,在干净和有噪声设置下均一致提高了占用估计的准确性和鲁棒性。此外,GraphBEV++有效缓解了端到端自动驾驶中的未对齐问题。与五个基线(UniAD、VAD、FusionAD、MomAD和WoTE)相比,它在感知、预测和规划任务上的开环(nuScenes)和闭环(Bench2Drive和NAVSIM)评估中均表现出更优的性能。

英文摘要

Feature misalignment in BEV perception is a critical yet often overlooked challenge in autonomous driving, especially under calibration uncertainties between LiDAR and camera sensors. To address this issue, we propose a robust multi-modal fusion framework, GraphBEV++, which systematically mitigates projection-induced misalignment. The framework consists of two key modules: LocalAlign-v2 and GlobalAlign-v2. LocalAlign-v2 introduces neighborhood-aware depth features via graph matching to correct local misalignment. It supports both LSS-based and query-based BEV representations, making it compatible with BEVFusion and BEVFormer architectures for consistent cross-paradigm alignment. GlobalAlign-v2 encompasses two variants: Deformable and Diffusion. The Deformable variant addresses global misalignment in LSS-based multi-modal BEV by explicitly learning cross-modal feature offsets. In contrast, the Diffusion variant targets implicit misalignment in query-based BEV by injecting noise to simulate misalignment and employing a denoising process to recover aligned features. Experimental results show that GraphBEV++ achieves state-of-the-art performance under misalignment noise on nuScenes and Waymo subset, improves long-range detection on Argoverse2, and generalizes effectively to the 3D occupancy prediction task, consistently improving occupancy estimation accuracy and robustness under both clean and noisy settings. Furthermore, GraphBEV++ effectively alleviates misalignment issues in end-to-end autonomous driving. Compared with five baselines (UniAD, VAD, FusionAD, MomAD, and WoTE), it demonstrates superior performance in both open-loop (nuScenes) and closed-loop (Bench2Drive and NAVSIM) evaluations across perception, prediction, and planning tasks.

2606.16414 2026-06-16 cs.CV 新提交

Instance-Aware Knowledge Distillation for Semi-Supervised Learning of an On-Board Multi-Task Dense Prediction Model for Collision Avoidance System

面向碰撞避免系统的半监督学习中的实例感知知识蒸馏用于车载多任务密集预测模型

Gyutae Hwang, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University(全北国立大学电子与信息工程学部)

AI总结 提出实例感知知识蒸馏框架,利用教师模型领域先验和基础模型实例知识生成伪标签,训练轻量学生模型在边缘设备上实时执行多任务密集预测,在实例分割上超越教师,计算量降低22.68倍。

Comments 13 pages, 7 figures

详情
AI中文摘要

碰撞避免系统已发展为基于摄像头的深度学习方法用于驾驶场景理解。然而,在乡村俱乐部等边缘环境中的部署受到有限计算资源和不可靠通信基础设施的限制。此外,为目标领域构建大规模数据集涉及大量标注成本。为了解决这些限制,我们提出了一种实例感知知识蒸馏框架用于半监督学习。具体来说,我们通过利用来自教师的领域先验和来自基础模型的实例中心知识生成减轻教师偏差的伪标签。训练后的轻量学生模型被部署在所提出的碰撞避免系统中,并实时执行多个密集预测任务。该系统检测前方障碍物并将其空间信息编码为控制器局域网消息,用于自动导引车操作。为此,我们构建了一个大规模的乡村俱乐部数据集,并对所提出的系统进行了现场验证。实验结果表明,学生在实例分割上优于大型教师,同时减轻了单目深度估计中的性能下降。与教师相比,学生将FLOPs减少了22.68倍,参数减少了14.33倍,在低成本边缘设备上实现了6.46 FPS。

英文摘要

Collision avoidance systems have evolved toward camera-based deep learning approaches for driving scene understanding. However, deployment in edge environments such as country clubs is constrained by limited computational resources and unreliable communication infrastructure. Moreover, constructing large-scale datasets for the target domain involves substantial annotation cost. To address these limitations, we propose an instance-aware knowledge distillation framework for semi-supervised learning. Specifically, we generate pseudo labels that mitigate teacher bias by leveraging domain priors from the teacher and instance-centric knowledge from foundation models. The trained lightweight student is deployed in the proposed collision avoidance system and performs multiple dense prediction tasks in real-time. The system detects frontal obstacles and encodes their spatial information into controller area network messages for automated guided vehicle operation. To achieve this, we construct a large-scale country club dataset and perform field validation of the proposed system. Experimental results demonstrate that the student outperforms the large teacher in instance segmentation while mitigating performance degradation in monocular depth estimation. Compared with the teacher, the student reduces FLOPs by 22.68$\times$ and parameters by 14.33$\times$, achieving 6.46 FPS on a low-cost edge device.

2606.16470 2026-06-16 cs.CV cs.RO 新提交

Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

解耦的以对象为中心的视频理解用于生成机器人操作指令

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Xiem HoangVan, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology(日本北陆先端科学技术大学院大学信息科学学院) University of Engineering and Technology, Vietnam National University(越南国立大学工程与技术大学) Department of Robotics, Hanyang University(汉阳大学机器人学系)

AI总结 提出解耦动作识别与对象选择的框架,通过TSM分类动作和对象选择算法识别任务相关对象,结合VLM生成精确指令,在Something-Something V2上显著提升性能。

详情
AI中文摘要

将视频演示翻译为可执行的机器人命令仍然具有挑战性,因为现有方法通常无法识别演示动作中功能涉及的对象。因此,它们可能生成语言上合理但操作上模糊的命令。我们提出了一种以对象为中心的视频理解框架,将动作识别与对象识别解耦,以生成精确的、无语法的操作命令。我们的方法集成了时间移位模块(TSM)用于高效的时空动作分类,以及一种新颖的\textbf{对象选择}算法,通过基于轨迹的角色分类、模糊检测和重叠最小化来识别任务相关对象。然后,选定的对象由视觉语言模型(VLM)处理,以实现鲁棒的类别识别和零样本泛化。在修改后的Something-Something V2数据集上评估,我们的方法达到了86.79%的动作分类准确率,在标准对象上BLEU-4得分为0.337,在新颖对象上为0.261。这些结果分别比最强的任务特定基线提高了80.2%和143.9%。在METEOR和CIDEr指标上观察到更大的提升,在新颖对象上分别达到157.9%和171.7%。在所有语义指标上,我们的方法始终优于任务特定方法,并与大型通用VLM保持竞争力或超越它们,同时保留了模块化的、以对象为中心的设计。

英文摘要

Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.

2606.16474 2026-06-16 cs.CV cs.RO 新提交

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

MVOFormer:用于鲁棒单目视觉里程计的流-语义Transformer

Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University(浙江大学流体动力与机电系统国家重点实验室) Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems(浙江省工业大数据与机器人智能系统重点实验室) School of Mechanical Engineering, Zhejiang University(浙江大学机械工程学院) Robotics Institute, Zhejiang University(浙江大学机器人研究院) School of Artificial Intelligence and Robotics, Hunan University(湖南大学人工智能与机器人学院) Rural Health Research Institute, Charles Sturt University(查尔斯特大学农村健康研究所) University College London(伦敦大学学院)

AI总结 提出MVOFormer,一种流-语义双分支编码器与迭代多模态解码器结合的Transformer框架,通过融合密集几何运动与语义先验实现粗到细位姿优化,在零样本泛化上显著超越现有方法。

Comments 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

单目视觉里程计(MVO)是自主导航和机器人定位的基础。然而,现有的基于学习的MVO方法通常缺乏可解释的互补特征或具有过于复杂的多阶段架构,这些局限性固有地限制了它们的鲁棒性和跨域泛化能力。在这项工作中,我们提出了MVOFormer,一种用于鲁棒单目视觉里程计的新型Transformer框架。我们的架构采用流-语义双分支编码器,将密集几何运动线索与以物体为中心的语义先验协同结合,明确区分静态结构与动态干扰物。然后,这些表示通过迭代多模态解码器融合,实现从粗到细的位姿优化,同时动态抑制对不可靠区域的注意力。大量评估表明,无需任何目标域微调,MVOFormer在TartanAir、KITTI、TUM-RGBD和ETH3D-SLAM等多个基准上实现了优越的零样本泛化和鲁棒性,显著优于先前基于学习的帧到帧方法。

英文摘要

Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

2606.16569 2026-06-16 cs.CV cs.RO 新提交

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

PROSE: 基于视觉语言模型的无训练自我中心场景配准

Zhiang Chen, Nahyuk Lee, Boyang Sun, Taein Kwon, Marc Pollefeys, Zuria Bauer, Sunghwan Hong

发表机构 * ETH Zurich(苏黎世联邦理工学院) VGG, University of Oxford(牛津大学VGG实验室) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出PROSE方法,利用预训练视觉语言模型将RGB序列提升为对象级3D场景图,通过对象高度先验和相同/不同查询匹配实例,无需训练或深度传感器即可实现自我中心场景配准,在Aria基准上超越几何和场景图基线。

Comments Project page: https://rckola.github.io/prose/

详情
AI中文摘要

将同一室内空间在不同时间拍摄的两张图像进行配准,是机器人和AR系统持久空间记忆的基础,但该任务的现实版本是自我中心的,且其最具可扩展性的形式是仅RGB。头戴式摄像头产生模糊、快速移动、部分重叠的视图,难以从中恢复密集几何。经典配准依赖于该场景所缺乏的干净点云,而学习的场景图方法需要预先构建或注释的图以及训练好的匹配器,我们发现后者在自我中心数据下脆弱。我们采取不同路线,使用预训练的视觉语言模型作为场景理解和跨扫描匹配的来源。我们的方法PROSE(Prompted Scene rEgistration)利用现成的几何、分割和语言基础模型将每个RGB序列提升为对象级3D场景图,然后提示同一VLM匹配两个RGB序列中的对象实例。为了使匹配易于处理且可靠,我们利用对象高度作为先验,并通过配对的相同/不同查询验证每个提议的匹配,然后通过为每个匹配对象假设一个候选并选择具有最强几何一致性的候选来求解刚体变换。PROSE不添加任何学习参数,也不需要深度传感器、训练或注释图。在自我中心的Aria Digital Twin和Aria Everyday Activities基准测试中,它在真实和RGB重建的点云上的配准精度均优于几何和学习的场景图基线,并且其生成的场景图可直接用于下游任务。

英文摘要

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

2606.16898 2026-06-16 cs.CV cs.AI 新提交

Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

Semantic Flip: 用于具身问答和空间定位中鲁棒拒绝的合成OOD生成

Dongbin Na, Chanwoo Kim, Giyun Choi, Dooyoung Hong

发表机构 * RGA Inc.(RGA公司)

AI总结 提出Semantic Flip框架,通过合成辅助OOD样本训练轻量拒绝模块,使冻结的视觉语言模型在无外部OOD标注下实现鲁棒拒绝,在具身问答和空间定位基准上优于强提示基线。

Comments 18 pages, 3 figures. Code and data: https://github.com/ndb796/SemanticFlip ; project page: https://ndb796.github.io/SemanticFlip

详情
AI中文摘要

检测不可回答的用户查询对于现实世界具身代理的可靠部署仍然至关重要。然而,现代视觉语言模型(VLM)即使当可用视觉记忆无法支持查询时,也常常生成过于自信的答案。这种过度自信会带来各种任务依赖的风险。代理可能在具身问答中向用户提供误导信息,并在空间推理导航中选择任意坐标并物理引导用户前往。尽管风险很高,但只有少数先前研究直接解决具身VLM何时以及如何回答“我不知道”的问题。本文提出Semantic Flip,一个简单而有效的框架,无需外部OOD标注即可合成辅助分布外(OOD)样本用于具身拒绝。关键思想是独立变换查询和视频记忆,以构建缺乏足够视觉基础的辅助OOD对。这些合成对使得能够在冻结的预训练VLM之上训练一个轻量级拒绝模块。该模块可附加到任何现有的基于VLM的流水线中,无需重新训练底层模型。在两个互补的基准测试中,Semantic Flip始终优于强提示基线。本文还引入了SpaceReject,一个新的用于空间定位的拒绝基准,包含故意不可回答的查询和长视频记忆,其中Semantic Flip达到了0.9559的$F_1$分数。源代码和数据集公开于https://github.com/ndb796/SemanticFlip。

英文摘要

Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with "I do not know." This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an $F_1$ score of 0.9559. The source codes and datasets are publicly available at https://github.com/ndb796/SemanticFlip.

2606.16960 2026-06-16 cs.CV 新提交

SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

SurroundNEXO:面向自动驾驶空间一致几何的自车中心度量桥接

Shuai Yuan, Runxi Tang, Yuzhou Ji, Fudong Ge, Hanshi Wang, Yifei Wang, Xianming Zeng, Jianyun Xu, Xingliang Liu, Yanfeng Wang, Zhipeng Zhang

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Hello Inc.

AI总结 提出SurroundNEXO框架,通过自车中心几何(Ego-Ray位置编码)和稀疏LiDAR度量锚点,解决多相机低重叠下的度量深度预测与空间一致性问题,在多个基准上显著提升性能。

详情
AI中文摘要

现代自动驾驶依赖于精确的度量3D理解进行感知、重建和规划,这反过来需要可靠的多相机深度预测。然而,车载环视相机系统的外向性本质上限制了视图间的视觉重叠,挑战了传统多视图几何所依赖的对应关系假设。为弥合这一差距,我们提出SurroundNEXO(以西班牙语单词nexo命名,意为几何链接),一个低重叠多相机度量深度框架,将跨视图推理建立在自车中心几何而非密集视觉对应上。SurroundNEXO不直接强制早期全局融合,而是首先通过Ego-Ray位置编码为图像令牌分配全局可比较的自车框架视线方向,然后使用稀疏LiDAR测量作为度量锚点传播绝对尺度线索,最后逐步扩展特征交互,从视图局部建模到分解的时空推理和全局集成。这种设计使得在弱重叠相机间实现具有改进空间一致性的度量尺度深度预测。在包括NuScenes、Waymo和DDAD的低重叠自动驾驶基准上,与SOTA方法相比,SurroundNEXO将单视图误差降低33.2%,跨视图一致性提高10.5%,度量重建质量提升25.6%。此外,它在极稀疏深度提示下保持鲁棒,并对未见过的相机布局展现出强大的零样本泛化能力。

英文摘要

Modern autonomous driving depends on accurate metric 3D understanding for perception, reconstruction, and planning, which in turn requires reliable multi-camera depth prediction. However, the outward-facing nature of vehicle-mounted surround-view camera rigs inherently limits visual overlap across views, challenging the correspondence-based assumptions that underpin conventional multi-view geometry. To bridge this gap, we present SurroundNEXO, named after the Spanish word nexo for a geometric link, a low-overlap multi-camera metric depth framework that grounds cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Instead of directly enforcing early global fusion, SurroundNEXO first assigns image tokens globally comparable ego-frame viewing directions through Ego-Ray Positional Encoding, then uses sparse LiDAR measurements as metric anchors to propagate absolute scale cues, and finally expands feature interaction progressively from view-local modeling to decomposed spatio-temporal reasoning and global integration. This design enables metric-scale depth prediction with improved spatial consistency across weakly overlapping cameras. Across low-overlap autonomous driving benchmarks, including NuScenes, Waymo and DDAD, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared with SOTA methods. It further remains robust under extremely sparse depth prompts and exhibits strong zero-shot generalization to unseen camera layouts.

2606.14879 2026-06-16 cs.RO cs.CV cs.LG 交叉投稿

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

VANDERER: 基于未来感知与视觉好奇心引导扩散策略的无地图探索

Venkata Naren Devarakonda, Raktim Gautam Goswami, Prashanth Krishnamurthy, Farshad Khorrami

发表机构 * Control/Robotics Research Laboratory (CRRL), Department of Electrical and Computer Engineering, NYU Tandon School of Engineering(纽约大学坦登工程学院电气与计算机工程系控制/机器人研究实验室(CRRL)) New York University Abu Dhabi (NYUAD) Center for Artificial Intelligence and Robotics (CAIR)(纽约大学阿布扎比分校人工智能与机器人中心(CAIR))

AI总结 提出VANDERER框架,利用视觉好奇心模块引导预训练扩散策略,仅依赖单目图像实现高效无地图探索,在多种模拟环境中平均探索面积比NoMaD多13.4%。

详情
AI中文摘要

移动智能体需要高效的探索策略来绘制未知环境并自主规划任务。传统方法依赖于生成占据地图并优化未探索区域的访问顺序。然而,在传感器受限的设置中,例如仅使用单目相机,生成准确的占据地图具有挑战性。为了解决这一问题,我们提出了VANDERER,一个探索框架,它利用视觉好奇心模块(VCM)仅使用单目图像数据来引导预训练的扩散策略。该好奇心模块通过导航世界模型预测所提议动作的结果,并通过好奇心成本对其进行评估。然后,该成本引导扩散过程生成最大化探索的动作。在多种模拟环境中进行评估,VANDERER始终优于现有基线,平均探索面积比NoMaD多13.4%。我们的结果揭示了室外环境中视觉好奇心与几何好奇心之间的直接相关性,表明VANDERER能够有效利用这种关系,在传感器受限的智能体上实现高效探索。

英文摘要

Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

2606.15133 2026-06-16 cs.RO cs.CV 交叉投稿

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

DragMesh-2: 与铰接物体的物理合理灵巧手-物体交互

Tianshan Zhang, Yijia Duan, Yanjun Li, Zeyu Zhang, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院)

AI总结 提出DragMesh-2框架,通过接触驱动的灵巧手-铰接物体交互,结合物理信息感知训练机制PICA,在无触觉反馈下提升变接触负载的鲁棒性。

Comments Code: https://github.com/AIGeeksGroup/DragMesh-2. Website: https://aigeeksgroup.github.io/DragMesh-2

详情
AI中文摘要

与铰接物体的灵巧交互对于家庭、辅助和人形操作至关重要,其中多指手可以提供超越平行爪抓取的顺应接触模式。然而,铰接物体操作不同于静态物体操作:目标部件无法直接驱动,其运动必须通过持续的物理手-手柄接触来实现。这使得从以物体为中心的铰接生成到手驱动的灵巧手-物体交互的转变变得非平凡,因为几何轨迹重放或开环执行无法模拟移动铰接部件所需的接触动力学。此外,仅在固定动力学下为任务完成训练的策略可能会过拟合标称接触负载,尤其是在没有触觉或力反馈的情况下,并且当接触负载变化时性能可能会下降。为了应对这些挑战,我们提出了DragMesh-2,一个用于与铰接物体灵巧交互的接触驱动框架,它将铰接交互从以物体为中心的生成扩展到手驱动的灵巧手-物体交互,其中铰接运动必须通过物理接触产生。我们进一步提出了PICA,一种物理信息感知的训练机制,它在没有触觉或力反馈的情况下将物理信号注入策略学习,提高了在变化接触负载下的鲁棒性和任务成功率。最后,我们在多个阻尼条件和铰接物体类别上进行了系统评估,以研究接触负载变化下的鲁棒性,并提供了一个纯几何的灵巧交互资源,以支持未来的移动操作和人形手-物体交互研究。在七个GAPartNet物体上,DragMesh-2在接触负载变化下比对比方法实现了更强的鲁棒性,同时在各种阻尼条件下保持了高任务成功率。

英文摘要

Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.

2606.15594 2026-06-16 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 交叉投稿

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

从像素到证明:通过并行保形鲁棒MPC实现概率安全的潜在世界模型控制

Devesh Nath, Anutam Srinivasan, Haoran Yin, Ruitong Jiang, Jeffrey Fang, Glen Chou

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出SLS^2框架,结合保形预测与鲁棒模型预测控制,在学习的潜在世界模型中实现基于视觉的安全运动规划,提升目标到达性能与安全性。

详情
AI中文摘要

我们提出了SLS^2,一个使用鲁棒模型预测控制(MPC)在学习的潜在世界模型中进行安全反馈运动规划的框架。我们的方法训练了一个动作条件的联合嵌入世界模型,具有紧凑的马尔可夫潜在状态,通过学习的潜在动力学实现高效的基于梯度的轨迹优化。为了在潜在预测不完美的情况下确保真实系统的安全性,我们采用保形预测来通知GPU加速的系统级综合(SLS)鲁棒MPC方案,以获得校准的潜在误差界限和鲁棒的潜在空间约束集。我们还学习并保形化了一个潜在约束检查器,使SLS规划器能够在闭环执行期间施加概率安全约束。我们在基于视觉的控制任务上评估了我们的方法,与潜在世界模型和安全规划基线相比,它提高了目标到达性能和安全性。

英文摘要

We present SLS^2, a framework for safe feedback motion planning from pixels using robust model predictive control (MPC) in learned latent world models. Our approach trains an action-conditioned joint-embedding world model with compact Markovian latent states, enabling efficient gradient-based trajectory optimization through learned latent dynamics. To enforce safety for the true system despite imperfect latent predictions, we inform a GPU-accelerated system level synthesis (SLS) robust MPC scheme with conformal prediction to obtain calibrated latent error bounds and robust latent-space constraint sets. We further learn and conformalize a latent constraint checker, allowing the SLS planner to impose probabilistic safety constraints during closed-loop execution. We evaluate our method on vision-based control tasks, where it improves both goal-reaching performance and safety over latent world-model and safe-planning baselines.

2606.15647 2026-06-16 cs.AI cs.CV cs.RO 交叉投稿

Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

迈向下一代医疗:医疗具身AI在感知、决策与行动中的综述

Cheng Zhang, Qing Cai, Xingzheng Wu, Xun Yang, Xiaojun Chang, Bingkun Bao, Liqiang Nie, Xinwang Liu, Yi Yang

发表机构 * School of Information Science and Engineering, Ocean University of China(中国海洋大学信息科学与工程学院) Innovation School of Artificial Intelligence, Hefei University of Technology(合肥工业大学人工智能创新学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机与信息工程学院) School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)计算机科学与技术学院) College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) ReLER Laboratory, CCAI, Zhejiang University(浙江大学计算机辅助设计与图形学国家重点实验室)

AI总结 本文系统综述医疗具身AI的核心组件,强调感知、决策与行动的协调集成,并分析临床实践中的挑战与未来方向。

Comments 19 pages, 9 figures

详情
AI中文摘要

基础模型在提升医疗效率方面表现出色,广泛应用于各类医疗场景。然而,它们在感知、理解和与物理世界交互方面的能力有限,严重制约了其在真实临床工作流中的有效性,而临床工作流中安全关键的决策和物理执行紧密耦合。近年来,具身人工智能(AI)作为一种有前景的物理交互范式出现,使智能体能够在复杂医疗环境中操作。随着该领域研究的迅速扩展,理解智能体如何在临床环境中作为集成的端到端系统运行变得日益关键。然而,现有关于医疗具身AI的综述大多强调单个方面或功能组件,缺乏统一的系统级组织。为支持和巩固最新进展,我们系统调查了医疗具身AI的核心组件,特别关注感知、决策与行动的协调集成。我们进一步回顾了代表性医疗应用和相关数据集,并分析了真实临床实践中遇到的主要挑战。最后,我们讨论了这一快速发展领域未来研究的关键方向。相关项目见 https://github.com/VMVLab/Medical_Embodied_AI_Paper_List。

英文摘要

Foundation models have demonstrated impressive performance in enhancing healthcare efficiency across a wide range of medical applications. Nevertheless, their limited ability to perceive, understand, and interact with the physical world significantly constrains their effectiveness in real-world clinical workflows, where safety-critical decision-making and physical execution are tightly coupled. Recently, embodied artificial intelligence (AI) has emerged as a promising physical-interactive paradigm for intelligent healthcare, enabling agents to operate in complex medical environments. As research in this area rapidly expands, understanding how intelligent agents function as integrated, end-to-end systems in clinical environments becomes increasingly critical. However, existing surveys on medical embodied AI largely emphasize individual aspects or functional components, lacking a unified system-level organization of the field. To support and consolidate recent advances, we systematically survey the core components of medical embodied AI, with a particular emphasis on the coordinated integration of perception, decision-making, and action. We further review representative medical applications and relevant datasets, and we analyze the major challenges encountered in real-world clinical practice. Finally, we discuss key directions for future research in this rapidly evolving field. The associated project can be found at https://github.com/VMVLab/Medical_Embodied_AI_Paper_List.

2606.15685 2026-06-16 cs.RO cs.CV 交叉投稿

Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

通过可复用技能学习新任务:面向具身持续学习的技能组合专家

Shuaike Zhang, Shaokun Wang, Haoyu Tang, Jianlong Wu, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Shandong University(山东大学) Shenzhen Loop Area Institute(深圳循环区域研究所)

AI总结 提出技能组合专家(SCE)框架,通过组合技能基础(CSG)分解演示为可复用技能,并利用双执行-转换专家(DETE)实现新任务学习,有效缓解具身持续学习中的灾难性遗忘。

Comments 13 pages, 5 figures

详情
AI中文摘要

具身持续学习(ECL)旨在使机器人能够在闭环控制下持续获取新的操作任务,同时保留先前学习的行为。与传统的持续学习相比,ECL遭受更严重的灾难性遗忘。在闭环控制下累积的特征漂移通过顺序决策逐步传播,导致先前学习的行为退化。ECL中的一个关键挑战在于如何在不断演变的任务中进行结构化的技能复用,因为现有方法主要关注技能学习,而没有明确组织它们以执行连贯的任务。为了解决这个问题,我们提出了SCE,一个用于ECL的技能组合专家框架。SCE通过组合技能基础(CSG)构建技能库,将任务演示分解为可复用的技能。在此基础上,双执行-转换专家(DETE)通过技能组合实现新任务学习,其中一个分支确保技能执行,另一个支持技能之间的转换以实现连贯行为。在LIBERO基准测试和真实世界操作任务上的实验表明,SCE持续提高了保留率和整体任务性能。进一步的特征漂移分析和消融研究验证了我们方法的有效性。项目网站:https://eqcy.github.io/sce/。

英文摘要

Embodied Continual Learning (ECL) aims to enable robots to continually acquire new manipulation tasks while retaining previously learned behaviors under closed-loop control. Compared with conventional continual learning, ECL suffers from more severe catastrophic forgetting. Feature drift accumulated under closed-loop control progressively propagates through sequential decision-making, leading to degradation of previously learned behaviors. A key challenge in ECL lies in structured skill reuse across continually evolving tasks, since existing methods primarily focus on skill learning without explicitly organizing them for coherent task execution. To address this issue, we propose SCE, a Skill-Compositional Experts framework for ECL. SCE builds a skill base via Compositional Skill Grounding (CSG), which decomposes task demonstrations into reusable skills. Based on this, Dual Execution-and-Transition Experts (DETE) enable new task learning through skill composition, where one branch ensures skill execution and the other supports transitions between skills for coherent behavior. Experiments on LIBERO benchmarks and real-world manipulation tasks demonstrate that SCE consistently improves retention and overall task performance. Further feature drift analyses and ablation studies verify the effectiveness of our method. Project website: https://eqcy.github.io/sce/.

2606.16101 2026-06-16 cs.MM cs.CV 交叉投稿

Effective and Low-cost Lane-based Map Localization for Vehicle-Centric Route Generation

基于车道的地图定位实现以车辆为中心的路线生成:一种有效且低成本的方法

Hong-Shiang Lin, Jung-Hsin Chen, Yu-Luen Tzeng, Wei-Hao Chen, Yi-Chen Lee, Li-Jhe Chen, Peng-Yuan Chen

发表机构 * National Taipei University(台北国立大学)

AI总结 提出OLRA框架,通过匹配导航路线与摄像头检测的车道线,以低成本地图定位生成驾驶员视角路线,提升定位精度和路线一致性,在nuScenes数据集上优于OpenPilot。

Comments 14 pages, 18 figures. Under Review

详情
AI中文摘要

以驾驶员为中心的路线表示在直观驾驶引导系统中起着至关重要的作用。本文提出OLRA,一种低成本的基于地图定位的框架,通过将基于地图的导航路线与摄像头检测的车道线进行匹配,推导出驾驶员视角对齐的路线。这一对齐过程相互增强了车辆定位精度和视觉路线一致性。为了弥合不同范式之间的评估差距,我们引入了实用的路线评估指标,并将OLRA与代表性的直接生成方法OpenPilot进行基准测试。在nuScenes数据集上的实验结果表明,OLRA在复杂路段和超过20米的距离估计中优于OpenPilot,实现了更低的整体欧氏误差。本研究有望促进未来基于低成本地图定位的路线生成方法的研究。

英文摘要

Driver-centric route representation plays a vital role in intuitive driving guidance systems. This paper presents OLRA, a low-cost, map-localization-based framework that derives driver-view-aligned routes by matching map-based navigation routes with camera-detected lane markings. This alignment process mutually enhances vehicle localization accuracy and visual route consistency. To bridge the evaluation gap across different paradigms, we introduce practical route evaluation metrics and benchmark OLRA against OpenPilot, a representative direct-generation approach. Experimental results on the nuScenes dataset demonstrate that OLRA outperforms OpenPilot in complex road segments and in route estimation at distance beyond 20 meters, achieving lower overall Euclidean error. This study is expected to promote future research in low-cost, maplocalization-based route generation methods.

2606.16436 2026-06-16 cs.RO cs.CV 交叉投稿

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

V2P-Manip:从单目人类视频学习灵巧操作

Kaihan Chen, Yanming Shao, Haifeng Ji, Xiaokang Yang, Yao Mu

发表机构 * Zhejiang University(浙江大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出V2P-Manip框架,从单目人类演示视频中提取具有视觉保真度和物理合理性的轨迹,通过两阶段精炼实现空间对齐与物理一致性,在TACO和OakInk基准上显著优于先前方法。

详情
AI中文摘要

实现自主机器人灵巧操作需要大规模精确、类人的动作序列。作为昂贵遥操作数据的可扩展补充,从单目视频中提取兼具视觉保真度和物理合理性的轨迹是具身智能的一个有前景的前沿方向。为此,我们引入V2P-Manip,一个高效的框架,旨在直接从人类演示视频中学习灵巧操作策略。我们建立了一个高效、集成的流水线,涵盖3D资产获取、轨迹估计和灵巧策略学习。为了弥合视觉感知与物理约束之间的差距,我们引入了一个两阶段精炼过程,以强制执行空间对齐和物理一致性。在TACO和OakInk基准上的评估表明,我们的方法在姿态精度、对非结构化环境的适应性以及训练效率方面显著优于先前方法。最终,实验结果证实了在多个合成操作任务上平均成功率超过75%,并验证了提取的操作先验在不同灵巧手形态上的适应性。

英文摘要

Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

2606.16690 2026-06-16 cs.RO cs.AI cs.CV 交叉投稿

PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

PATCH: 基于动作块条件潜在补丁创新的机器人操作监控

Yanan Zhou, Ranpeng Qiu, Yincong Chen, Jiajie Cui, Weiming Zhi

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) Australian Centre For Robotics, The University of Sydney(悉尼大学澳大利亚机器人中心)

AI总结 提出PATCH监控器,通过动作块条件潜在补丁创新检测局部场景动态,实现扰动感知的机器人操作干预与恢复。

详情
AI中文摘要

基于学习的操作策略在真实世界机器人操作中取得了实质性进展,特别是在短视界动作生成方面。然而,在开放工作空间中部署时,面对意外的局部场景动态(如移动物体、短暂遮挡或预期运动附近的干扰)仍然脆弱。现有的运行时监控器通常依赖全局观测异常、策略不确定性或帧级视觉变化,难以区分任务相关的执行风险与良性的视觉变化。我们提出PATCH,一种用于部署时干预的基于动作块条件的潜在补丁创新监控器。给定当前动作块,PATCH定义了一个投影执行走廊,预测其内部的潜在补丁演化,并累积机器人自身运动无法解释的持续残差。这些残差形成局部化的干预信号,使PATCH-Router能够暂停执行、选择可用的恢复源,并在局部创新消退后恢复原始策略。在真实机器人 rollout 数据上的实验表明,PATCH 比竞争性运行时监控器产生更稳定且上下文相关的触发信号。真实机器人部署进一步展示了监控驱动的干预和策略恢复,用于扰动感知的操作。项目页面:https://yananzhou5555.github.io/PATCH/。

英文摘要

Learning-based manipulation policies have made substantial progress in real-world robot manipulation, particularly for short-horizon action generation. However, deployment in open workspaces remains fragile under unexpected local scene dynamics, such as moving objects, transient occlusions, or disturbances near the intended motion. Existing runtime monitors often rely on global observation anomalies, policy uncertainty, or frame-level visual changes, and struggle to distinguish task-relevant execution risk from benign visual variation. We introduce PATCH, an action-chunk-conditioned latent patch innovation monitor for deployment-time intervention. Given the active action chunk, PATCH defines a projected execution corridor, predicts latent patch evolution inside it, and accumulates persistent residuals unexplained by the robot's own motion. These residuals form a localized intervention signal that allows PATCH-Router to pause execution, select an available recovery source, and resume the original policy once localized innovation subsides. Experiments on real robot rollout data show that PATCH produces more stable and context-relevant triggers than competing runtime monitors. Real-robot deployment further demonstrates monitor-driven intervention and policy resumption for disturbance-aware manipulation. Project Page: https://yananzhou5555.github.io/PATCH/.

2606.17040 2026-06-16 cs.RO cs.CV 交叉投稿

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

R2RDreamer: 面向空间泛化的2D操作策略的3D感知数据增强

Xiuwei Xu, Haowen Sun, Angyuan Ma, Yiwei Zhang, Zhenyu Wu, Xiaofeng Wang, Bingyao Yu, Zheng Zhu, Jie Zhou, Jiwen Lu

发表机构 * Tsinghua University(清华大学) BUPT(北京邮电大学) GigaAI

AI总结 提出R2RDreamer框架,通过轻量级3D编辑和2D视频补全,从少量真实演示生成几何一致的增强数据,提升2D操作策略的空间泛化能力。

Comments Project page: https://r2rdreamer.github.io/

详情
AI中文摘要

空间泛化对于模仿学习的操作策略至关重要,但通常需要跨不同物体姿态、机器人配置和相机视角的大规模演示。从少量源演示中进行数据增强为昂贵的真实世界数据收集提供了一种实用替代方案。基于仿真的增强可以创建可控变化,但需要复杂的环境和物体设置,并可能引入仿真到现实的差距。最近的实到实方法通过联合编辑真实演示的3D观测和动作轨迹来避免这些问题,但它们仍然依赖于强大的3D场景解析和几何补全,并且通常生成针对3D点云策略而非基于RGB的2D策略的观测。我们提出R2RDreamer,一个实到实演示增强框架,它在保持3D动作-观测编辑的几何一致性的同时,将视觉补全迁移到2D视频空间。具体来说,R2RDreamer首先通过在一个共享的3D框架中编辑不完整的物体点云和末端执行器轨迹来执行轻量级3D增强;然后,它将编辑后的场景投影到具有遮挡感知推理的掩码图像空间控制视频中,并使用密集控制图像到视频模型来补全时间上连贯的RGB观测。在空间偏移操作任务上的实验,包括2D扩散风格策略和视觉-语言-动作策略,表明R2RDreamer从有限的源演示中提高了空间泛化能力,分析验证了3D编辑、遮挡感知投影和视频补全的贡献。

英文摘要

Spatial generalization is critical for imitation-learned manipulation policies, but achieving it typically requires scaling demonstrations across diverse object poses, robot configurations, and camera viewpoints. Data augmentation from a few source demonstrations offers a practical alternative to costly real-world collection. Simulation-based augmentation can create controllable variation, but requires complex environment and object setup and may introduce a sim-to-real gap. Recent real-to-real methods avoid these issues by jointly editing 3D observations and action trajectories from real demonstrations, yet they still rely on strong 3D scene parsing and geometry completion, and often produce observations tailored to 3D pointcloud policies rather than RGB-based 2D policies. We propose R2RDreamer, a real-to-real demonstration augmentation framework that preserves the geometric consistency of 3D action-observation editing while moving visual completion to 2D video space. Specifically, R2RDreamer first performs lightweight 3D augmentation by editing incomplete object pointclouds and end-effector trajectories in a shared 3D frame; it then projects the edited scene into masked image-space control videos with occlusion-aware reasoning and uses a dense-control image-to-video model to complete temporally coherent RGB observations. Experiments on spatially shifted manipulation tasks with both 2D diffusion-style policies and vision-language-action policies show that R2RDreamer improves spatial generalization from limited source demonstrations, with analyses validating the contributions of 3D editing, occlusion-aware projection, and video completion.

2606.17046 2026-06-16 cs.RO cs.CV cs.LG 交叉投稿

Geometric Action Model for Robot Policy Learning

几何动作模型用于机器人策略学习

Jisang Han, Seonghu Jeon, Jaewoo Jung, René Zurbrügg, Honggyu An, Tifanny Portela, Marco Hutter, Marc Pollefeys, Seungryong Kim, Sunghwan Hong

发表机构 * KAIST AI(韩国科学技术院人工智能学院) ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出几何动作模型(GAM),通过重用预训练几何基础模型(GFM)作为共享骨干,实现语言条件下的操作策略,在仿真和真实机器人任务中优于现有方法。

Comments Project page: https://cvlab-kaist.github.io/Geometric-Action-Model/

详情
AI中文摘要

通用机器人策略必须遵循用户指令,同时推理物体、相机和机器人动作如何在3D物理世界中交互。最近的视觉-语言-动作模型(VLAs)和视频世界-动作模型(WAMs)从大规模基础模型中继承了强大的语义或时间先验,但它们仍然主要在2D图像帧或2D派生的潜在空间上操作,隐含了接触丰富操作所需的3D几何信息。我们提出了几何动作模型(GAM),一种语言条件操作策略,直接重用预训练的几何基础模型(GFM)作为感知、时间预测和动作解码的共享基础。GAM在中间层分割GFM:浅层作为观察编码器,在分割层插入一个因果未来预测器,根据语言、本体感受和动作历史预测未来的潜在令牌。然后,预测的未来令牌通过剩余的GFM块进行特征传播和解码,使得单个骨干能够同时产生未来几何和动作。这种设计通过最小的架构修改赋予GFM语言条件的时间世界建模能力,同时保留其丰富的几何先验。在广泛的仿真和真实机器人操作基准测试中,GAM比当前基础模型规模的基线更准确、更鲁棒、更快、更轻量。

英文摘要

Generalist robot policies must follow user instructions while reasoning about how objects, cameras, and robot actions interact in the 3D physical world. Recent vision-language-action models (VLAs) and video world-action models (WAMs) inherit strong semantic or temporal priors from large-scale foundation models, but they still operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation. We propose the Geometric Action Model (GAM), a language-conditioned manipulation policy that directly repurposes a pretrained geometric foundation model (GFM) as a shared substrate for perception, temporal prediction, and action decoding. GAM splits the GFM at an intermediate layer: the shallow layers serve as an observation encoder, and a causal future predictor inserted at the split layer forecasts future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining GFM blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions. This design equips the GFM with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors. Across a broad suite of simulation and real-robot manipulation benchmarks, GAM is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines.

2509.23014 2026-06-16 cs.CV 版本更新

Planning with Unified Multimodal Models

统一多模态模型规划

Yihao Sun, Zhilong Zhang, Yang Yu, Pierre-Luc Bacon

发表机构 * Mila - Québec AI Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学) Nanjing University(南京大学)

AI总结 提出Uni-Plan框架,利用统一多模态模型同时作为策略、动态模型和价值函数,通过自判别过滤避免动态预测幻觉,在具身决策任务中优于VLM方法,无需专家演示且数据可扩展。

Comments 29 pages, 11 figures

详情
AI中文摘要

借助大型语言模型(LLMs)和视觉语言模型(VLMs)强大的推理能力,许多近期工作探索了将其用于决策。然而,这些方法大多仅依赖基于语言的推理,限制了其推理和做出明智决策的能力。最近,支持多模态输入和输出的统一多模态模型(UMMs)成为一个有前景的新方向。我们认为这类模型通过生成的视觉内容进行推理,在决策方面具有更大潜力。为此,我们提出了Uni-Plan,一个基于UMMs的规划框架。在该框架内,单个模型同时充当策略、动态模型和价值函数。此外,为避免动态预测中的幻觉,我们提出了一种新颖的方法——自判别过滤,其中生成模型作为自判别器来过滤无效的动态预测。在具身决策任务上的实验表明,与基于VLM的方法相比,Uni-Plan显著提高了成功率,同时展现出强大的数据可扩展性,无需专家演示,在相同训练数据规模下取得更好性能。这项工作为未来使用UMMs进行推理和决策的研究奠定了基础。

英文摘要

With the powerful reasoning capabilities of large language models (LLMs) and vision-language models (VLMs), many recent works have explored using them for decision-making. However, most of these approaches rely solely on language-based reasoning, which limits their ability to reason and make informed decisions. Recently, a promising new direction has emerged with unified multimodal models (UMMs), which support both multimodal inputs and outputs. We believe such models have greater potential for decision-making by enabling reasoning through generated visual content. To this end, we propose Uni-Plan, a planning framework built on UMMs. Within this framework, a single model simultaneously serves as the policy, dynamics model, and value function. In addition, to avoid hallucinations in dynamics predictions, we present a novel approach self-discriminated filtering, where the generative model serves as a self-discriminator to filter out invalid dynamics predictions. Experiments on embodied decision-making tasks show that Uni-Plan substantially improves success rates compared to VLM-based methods, while also showing strong data scalability, requiring no expert demonstrations and achieving better performance under the same training-data size. This work lays a foundation for future research in reasoning and decision-making with UMMs.

2510.12560 2026-06-16 cs.CV cs.LG cs.RO 版本更新

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

CoIRL-AD:面向自动驾驶的潜在世界模型中的协作-竞争模仿-强化学习

Xiaoji Zheng, Ziyuan Yang, Yanhao Chen, Yuhang Peng, Yuanrong Tang, Gengyuan Liu, Bokui Chen, Jiangtao Gong

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出CoIRL-AD框架,通过解耦模仿学习与强化学习、利用潜在世界模型进行长时程奖励估计以及引入竞争机制,在离线训练中提升自动驾驶的鲁棒性,尤其在跨城市泛化和长尾场景中表现优异。

Comments 19 pages, 22 figures, ICML 2026

详情
AI中文摘要

基于模仿学习(IL)训练的端到端自动驾驶模型通常泛化能力较差,尤其是在专家演示稀疏的长尾场景中。强化学习(RL)可以提供互补的任务级监督,但在没有交互模拟器的离线设置中,将RL应用于真实世界的自动驾驶具有挑战性,因为数据集主要由专家动作主导,行为多样性有限。我们提出CoIRL-AD,一个竞争性的双策略框架,在统一的离线训练机制下整合IL和RL。CoIRL-AD将模仿和奖励优化解耦到不同的智能体中,以缓解目标冲突,使用想象的未来轨迹进行长时程奖励估计,并引入竞争机制,选择性地传递有益行为,同时使RL保持与专家驾驶行为一致。在nuScenes基准上的实验表明,CoIRL-AD在强IL基线上持续提升鲁棒性,尤其在跨城市泛化和长尾场景中取得了显著改进。代码可在以下网址获取:this https URL。

英文摘要

End-to-end autonomous driving models trained with imitation learning (IL) often generalize poorly, particularly in long-tail scenarios where expert demonstrations are sparse. Reinforcement learning (RL) can provide complementary task-level supervision, but applying RL to real-world autonomous driving is challenging in offline settings without interactive simulators, where datasets are dominated by expert actions and provide limited behavioral diversity. We propose CoIRL-AD, a competitive dual-policy framework that integrates IL and RL under a unified offline training regime. CoIRL-AD decouples imitation and reward optimization into separate actors to alleviate objective conflicts, uses imagined future rollouts for long-horizon reward estimation, and introduces a competition mechanism that selectively transfers beneficial behaviors while keeping RL anchored to expert-like driving. Experiments on the nuScenes benchmark show that CoIRL-AD consistently improves robustness over strong IL-based baselines, with especially large gains in cross-city generalization and long-tail scenarios. Code is available at: https://github.com/SEU-zxj/CoIRL-AD.

2511.15645 2026-06-16 cs.CV cs.RO 版本更新

FDIO: Frequency Decomposed Inertial Odometry

FDIO:频率分解惯性里程计

Shanshan Zhang, Liqin Wu, Wenying Cao, Lingxiang Zheng, Yu Yang

发表机构 * Department of Information and Communication Engineering, National and Local Joint Engineering Research Center of Navigation and Location Based Services, Xiamen University(信息与通信工程系、导航与位置服务国家与地方联合工程研究中心、厦门大学) Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University(电子科学系、固体表面物理化学国家重点实验室、厦门大学)

AI总结 针对双设备采集场景中IMU信号耦合问题,提出频率分解惯性里程计(FDIO),通过拉普拉斯金字塔分解信号、Mamba模块建模低频长程运动和多尺度卷积提取高频局部特征,在五个数据集上平均绝对轨迹误差降低33.3%。

详情
AI中文摘要

行人惯性里程计(PIO)仅利用惯性测量单元(IMU)采集的加速度和角速度测量值估计自主行人运动,使其在消费级定位应用中具有极高价值。然而,在双设备采集设置下,自由携带的移动设备收集的IMU信号本质上是复合信号,其中人体躯干的全局运动与局部肢体运动引起的扰动耦合在一起。这种耦合使得精确的人体运动建模更具挑战性。为解决这一问题,本文提出了频率分解惯性里程计(FDIO)。该方法首先使用拉普拉斯金字塔将输入IMU信号分解为低频和高频分量。然后采用Mamba模块从低频分量中建模长程运动信息,并使用多尺度卷积模块从高频分量中提取细粒度局部动态特征。在五个公开PIO数据集上的实验表明,FDIO的平均绝对轨迹误差为3.221米,平均相对轨迹误差为2.550米,与RoNIN ResNet基线相比,误差分别降低了33.3%和16.7%。这些结果验证了所提出的频率分解策略的有效性。据我们所知,这项工作是将Mamba和频率分解架构引入惯性里程计的早期尝试之一。

英文摘要

Pedestrian inertial odometry (PIO) estimates autonomous pedestrian motion using only acceleration and angular velocity measurements collected by an inertial measurement unit (IMU), making it highly valuable for consumer level localization applications. However, under a dual device acquisition setting, IMU signals collected by a freely carried mobile device are inherently composite signals in which the global motion of the human torso is coupled with perturbations induced by local limb motion. This coupling makes accurate human motion modeling more challenging. To address this issue, this paper proposes frequency decomposed inertial odometry (FDIO). The proposed method first decomposes input IMU signals into low frequency and high frequency components using a Laplacian pyramid. It then adopts a Mamba module to model long range motion information from the low frequency component and uses a multi scale convolution module to extract fine grained local dynamic features from the high frequency component. Experiments on five public PIO datasets show that FDIO achieves an average absolute trajectory error of 3.221~m and an average relative trajectory error of 2.550~m, reducing the errors by 33.3\% and 16.7\% compared with the RoNIN ResNet baseline, respectively. These results validate the effectiveness of the proposed frequency decomposition strategy. To the best of our knowledge, this work is among the first efforts to introduce Mamba and a frequency decomposition architecture into inertial odometry.

2602.07343 2026-06-16 cs.CV cs.AI cs.LG cs.RO 版本更新

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

通过文字看道路:一种语言引导的RGB-T驾驶场景分割框架

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

发表机构 * National University of Singapore(新加坡国立大学) University of Technology Sydney(悉尼科技大学)

AI总结 提出CLARITY框架,利用视觉语言模型先验动态调整RGB-T融合策略,并引入暗目标语义保留和层次化解码器,在MFNet数据集上达到62.3% mIoU和77.5% mAcc的新SOTA。

详情
AI中文摘要

在恶劣光照、照明和阴影条件下,道路场景的鲁棒语义分割仍然是自动驾驶应用的核心挑战。RGB-热融合是一种标准方法,但现有方法在所有条件下统一应用静态融合策略,导致模态特定噪声在网络中传播。因此,我们提出CLARITY,它根据检测到的场景条件动态调整融合策略。在视觉语言模型(VLM)先验的引导下,网络学习根据光照状态调节每种模态的贡献,同时利用对象嵌入进行分割,而不是应用固定的融合策略。我们进一步引入了两种机制:一种保留有效的暗对象语义,这些语义在先前的噪声抑制方法中被错误丢弃;另一种是层次化解码器,它在不同尺度上强制结构一致性,以锐化薄对象的边界。在MFNet数据集上的实验表明,CLARITY建立了新的最先进水平(SOTA),实现了62.3%的mIoU和77.5%的mAcc。

英文摘要

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

2603.07920 2026-06-16 cs.CV 版本更新

RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving

RLPR:面向自动驾驶的两阶段非对称跨模态对齐雷达-激光雷达地点识别

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Guangming Xiong

发表机构 * Beijing Institute of Technology(北京理工大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出RLPR框架,通过双流网络提取结构特征,并利用两阶段非对称跨模态对齐策略,实现雷达与激光雷达之间的鲁棒地点识别,在四个数据集上达到最优性能。

Comments Accepted by IEEE Robotics and Automation Letters (RA-L) 2026

详情
AI中文摘要

全天候自主性对于自动驾驶至关重要,这需要在不同场景下实现可靠的定位。虽然激光雷达地点识别被广泛部署用于此任务,但其性能在恶劣天气下会下降。相反,基于雷达的方法虽然具有天气鲁棒性,但受限于雷达地图的普遍不可用性。为了弥合这一差距,雷达到激光雷达的地点识别(将雷达扫描定位到现有激光雷达地图中)引起了越来越多的兴趣。然而,提取模态间共享的判别性和可泛化特征仍然具有挑战性,加之缺乏大规模配对训练数据以及不同雷达类型之间的信号异质性。在这项工作中,我们提出了RLPR,一个鲁棒的雷达到激光雷达地点识别框架,兼容单芯片、扫描和4D雷达。我们首先设计了一个双流网络来提取结构特征,这些特征抽象掉了传感器特定的信号属性(例如多普勒或RCS)。随后,基于我们对雷达和激光雷达之间任务特定非对称性的观察,我们引入了一种两阶段非对称跨模态对齐(TACMA)策略,该策略利用预训练的雷达分支作为判别性锚点来指导对齐过程。在四个数据集上的实验表明,RLPR实现了最先进的识别精度,并具有强大的零样本泛化能力。

英文摘要

All-weather autonomy is critical for autonomous driving, which necessitates reliable localization across diverse scenarios. While LiDAR place recognition is widely deployed for this task, its performance degrades in adverse weather. Conversely, radar-based methods, though weather-resilient, are hindered by the general unavailability of radar maps. To bridge this gap, radar-to-LiDAR place recognition, which localizes radar scans within existing LiDAR maps, has garnered increasing interest. However, extracting discriminative and generalizable features shared between modalities remains challenging, compounded by the scarcity of large-scale paired training data and the signal heterogeneity across radar types. In this work, we propose RLPR, a robust radar-to-LiDAR place recognition framework compatible with single-chip, scanning, and 4D radars. We first design a dual-stream network to extract structural features that abstract away from sensor-specific signal properties (e.g., Doppler or RCS). Subsequently, motivated by our task-specific asymmetry observation between radar and LiDAR, we introduce a two-stage asymmetric cross-modal alignment (TACMA) strategy, which leverages the pre-trained radar branch as a discriminative anchor to guide the alignment process. Experiments on four datasets demonstrate that RLPR achieves state-of-the-art recognition accuracy with strong zero-shot generalization capabilities.

2606.08525 2026-06-16 cs.CV 版本更新

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

DriveReward:面向自动驾驶的综合数据集与生成式视觉语言奖励模型

Qimao Chen, Fang Li, Yuechen Luo, Zehan Zhang, Haiyang Sun, Fangzhen Li, Bing Wang, Guang Chen, Yang Ji, Jiong Deng, Hongwei Xie, Hangjun Ye, Long Chen, Yi Zhang

发表机构 * Tsinghua University(清华大学) Xiaomi EV(小米汽车)

AI总结 提出DriveReward数据集和专用视觉语言奖励模型,通过反事实标注和时序视觉引导,解决自动驾驶中奖励获取的泛化问题,在强化学习和轨迹选择中取得与基于规则方法相当的性能。

详情
AI中文摘要

奖励模型在强化学习和自动驾驶的多模态轨迹选择中起着关键作用。然而,获取此类奖励通常依赖于手工设计的基于规则的目标或感知真值,这阻碍了数据扩展的泛化能力。虽然视觉语言模型在其他领域已被证明可作为奖励模型,但其在驾驶任务中的有效性尚未得到充分探索。在这项工作中,我们通过以下方式弥合这一差距:(1)引入DriveReward,一个通过时间接地视觉引导严格标注的推理轨迹评估数据集,并增加了反事实驾驶行为;(2)以及一个专门的视觉语言奖励模型。为了解决传统数据集中失败案例稀缺的问题,我们提出了一种反事实数据标注方案,构建包含多种驾驶风格和错误行为的案例。在我们提出的基准上的评估显示,即使是领先的开源和专有视觉语言模型也无法在所有任务中表现出色,突显出现有模型仍有很大的改进空间。基于这些发现,我们随后定制了一个专门的1B奖励模型,在特定任务的奖励对齐上优于更大的视觉语言模型。最后,我们通过将奖励模型集成到强化学习微调和多模态轨迹评分中,在多个基线上验证了其有效性,在开环和闭环评估中均达到了与基于规则的奖励计算相当的性能。

英文摘要

Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.

2606.10862 2026-06-16 cs.CV cs.AI 版本更新

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

LIBERO-Occ:通过视角想象评估和改进场景诱导遮挡下的视觉-语言-动作模型

Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Chinese University of Hong Kong(香港中文大学)

AI总结 针对VLA模型在场景遮挡下性能下降的问题,提出LIBERO-Occ基准和视角想象方法,通过生成互补视图提升鲁棒性。

Comments 14 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)模型在标准操作基准上取得了强劲的性能,但大多数评估假设任务相关物体完全可见。这一假设在现实场景中经常不成立,因为遮挡使得操作部分可观察。本文研究了场景诱导遮挡作为VLA模型的一个基本挑战,并引入了LIBERO-Occ,一个面向遮挡的LIBERO扩展。实验表明,最先进的VLA在遮挡下性能显著下降。为解决这一问题,我们提出了视角想象(VIM),该方法从遮挡的主观测中生成互补视图,并基于观察和想象证据共同进行动作预测。VIM在任务套件、遮挡类型和严重程度上提高了鲁棒性,且无需在部署时增加额外摄像头,表明视角想象是部分可观察操作中感知完成的一种有前景的机制。我们的基准和相应代码可在以下网址获取:this https URL。

英文摘要

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

2606.13674 2026-06-16 cs.CV 版本更新

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

RepWAM:基于表示视觉-动作分词器的世界动作建模

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所) Robbyant, Ant Group(蚂蚁集团 Robbyant) Hongkong University of Science and Technology(香港科技大学)

AI总结 提出RepWAM,一种基于表示视觉-动作分词器的世界动作模型,通过联合建模未来视觉状态和潜在动作,在真实和仿真机器人操作任务中取得优异性能。

详情
AI中文摘要

本文提出RepWAM,一种基于表示视觉-动作分词器的表示中心世界动作模型(WAM)。现有的WAM通常从预训练的视频生成模型中继承面向重建的视频分词器。尽管这些分词器保留了视觉保真度,但仅靠像素重建对学习连接未来预测与机器人控制的指令跟随动态提供的指导有限。为解决此问题,我们探索了一种语义视觉-动作潜在空间用于表示中心的全局动作建模。具体来说,我们训练了一个表示视觉-动作分词器,将视觉输入映射为对齐的视觉和潜在动作标记。然后,我们预训练WAM以在语言指令下联合建模未来视觉状态和连接它们的潜在动作,随后适应真实机器人轨迹以实现闭环操作。在真实世界操作任务和仿真基准上的实验表明,RepWAM在多种操作设置中展现出强劲性能,而消融实验凸显了语义视觉-动作分词相对于面向重建替代方案的价值。这些结果确立了表示视觉-动作分词作为世界动作模型的有前途的基础,并朝着通用机器人策略迈出了一步。代码和权重将在以下网址提供:this https URL。

英文摘要

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

2509.18428 2026-06-16 cs.RO cs.CV 版本更新

Latent Action Pretraining Through World Modeling

通过世界建模的潜在动作预训练

Bahey Tharwat, Yara Nasser, Ali Abouzeid, Ian Reid

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(Mohamed bin Zayed人工智能大学) Alexandria University(亚历山大大学)

AI总结 提出LAWM框架,通过世界建模从无标签视频中学习潜在动作表征,实现跨任务、环境和本体的迁移学习,在LIBERO基准和真实场景中优于使用真实动作预训练的方法。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在遵循语言指令的机器人操作任务学习中越来越受欢迎。最先进的VLA模型,如OpenVLA和$\pi_{0}$,是在通过遥操作收集的大规模手动标注动作数据集上训练的。最近的方法,包括LAPA和villa-X,引入了潜在动作表示,通过建模帧间的抽象视觉变化,实现在无标签数据集上的无监督预训练。尽管这些方法展示了强大的结果,但它们的大模型尺寸使得在真实世界环境中部署具有挑战性。在这项工作中,我们提出了LAWM,一个模型无关的框架,通过世界建模从无标签视频数据中学习潜在动作表示,以自监督方式预训练模仿学习模型。这些视频可以来自机器人记录或人类使用日常物品执行动作的视频。我们的框架能够跨任务、环境和本体迁移所学知识。它在LIBERO基准和真实世界设置中优于使用真实机器人动作预训练的模型以及其他类似的预训练方法,同时在真实世界环境中高效且实用。

英文摘要

Vision-Language-Action (VLA) models have gained popularity for learning robotic manipulation tasks that follow language instructions. State-of-the-art VLAs, such as OpenVLA and $π_{0}$, were trained on large-scale, manually labeled action datasets collected through teleoperation. More recent approaches, including LAPA and villa-X, introduce latent action representations that enable unsupervised pretraining on unlabeled datasets by modeling abstract visual changes between frames. Although these methods have shown strong results, their large model sizes make deployment in real-world settings challenging. In this work, we propose LAWM, a model-agnostic framework to pretrain imitation learning models in a self-supervised way, by learning latent action representations from unlabeled video data through world modeling. These videos can be sourced from robot recordings or videos of humans performing actions with everyday objects. Our framework is able to transfer learned knowledge across tasks, environments, and embodiments. It outperforms models pretrained with ground-truth robot actions and other similar pretraining methods on the LIBERO benchmark and real-world setup, while being efficient and practical for real-world settings.

2511.18960 2026-06-16 cs.LG cs.CV cs.RO 版本更新

AVA-VLA: Improving Vision-Language-Action models with Active Visual Attention

AVA-VLA: 通过主动视觉注意力改进视觉-语言-动作模型

Lei Xiao, Jifeng Li, Juntao Gao, Feiyang Ye, Yan Jin, Jingjing Qian, Jing Zhang, Yong Wu, Xiaoyuan Yu

发表机构 * LiAuto Inc.(LiAuto公司) Beijing University of Technology(北京理工大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对VLA模型忽视历史信息的问题,提出AVA-VLA框架,利用循环状态近似信念并引入主动视觉注意力动态重加权视觉令牌,在LIBERO和CALVIN等基准上取得最优性能。

Comments Accepted at CVPR 2026 (Highlight)

详情
AI中文摘要

视觉-语言-动作(VLA)模型最近在具身任务中取得了显著进展,但大多数方法在每个时间步独立处理视觉观察。这种历史无关的设计将机器人操作视为马尔可夫决策过程,而现实中的机器人控制本质上是部分可观测的,需要推理过去的交互。为了解决这一不匹配,我们从部分可观测马尔可夫决策过程的角度重新表述VLA策略学习,并提出AVA-VLA,一种将动作生成建立在循环状态上的框架,该状态作为智能体对任务历史信念的神经近似。基于此循环状态,我们引入了主动视觉注意力(AVA),它动态地重新加权当前观测中的视觉令牌,以关注与指令和执行历史最相关的区域。大量实验表明,AVA-VLA在标准机器人基准测试(包括LIBERO和CALVIN)上达到了最先进的性能,并有效迁移到真实世界的双臂操作任务。这些结果证明了时间基础的主动视觉处理在改善机器人序列决策中VLA性能的有效性。项目页面见该URL。

英文摘要

Vision-Language-Action (VLA) models have shown remarkable progress in embodied tasks recently, but most methods process visual observations independently at each timestep. This history-agnostic design treats robot manipulation as a Markov Decision Process, even though real-world robotic control is inherently partially observable and requires reasoning over past interactions. To address this mismatch, we reformulate VLA policy learning from a Partially Observable Markov Decision Process perspective and propose AVA-VLA, a framework that conditions action generation on a recurrent state that serves as a neural approximation to the agent's belief over task history. Built on this recurrent state, we introduce Active Visual Attention (AVA), which dynamically reweights visual tokens in the current observation to focus on regions most relevant given both the instruction and execution history. Extensive experiments show that AVA-VLA achieves state-of-the-art performance on standard robotic benchmarks, including LIBERO and CALVIN, and transfers effectively to real-world dual-arm manipulation tasks. These results demonstrate the effectiveness of temporally grounded active visual processing for improving VLA performance in robotic sequential decision-making. The project page is available at https://liauto-dsr.github.io/AVA-VLA-Page.

2601.04061 2026-06-16 cs.RO cs.CV 版本更新

CLAP: Contrastive Latent Action Pretraining for Learning Vision-Language-Action Models from Human Videos

CLAP: 从人类视频中学习视觉-语言-动作模型的对比潜在动作预训练

Chubin Zhang, Jianan Wang, Zifeng Gao, Yue Su, Tianru Dai, Cai Zhou, Jiwen Lu, Yansong Tang

发表机构 * Tsinghua University(清华大学) Astribot University of Hong Kong(香港大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出CLAP框架,通过对比学习将人类视频与机器人动作词汇对齐,利用伪标签训练VLA模型,实现从人类视频到机器人执行的有效技能迁移。

Comments The code is available at: https://github.com/LinShan-Bin/OpenCLAP

详情
AI中文摘要

通用视觉-语言-动作模型仍然受限于机器人数据的稀缺性,而人类视频演示则相对丰富。现有的潜在动作模型试图利用视频数据,但常常遭受视觉纠缠,编码噪声而非操作技能。为了解决这一限制,我们提出了对比潜在动作预训练(CLAP),该框架首先使用Act-VAE从机器人轨迹中学习可执行的动作标记词汇,然后通过对比学习将人类视觉转换与该词汇对齐。这种对齐将未标记的人类视频映射到物理上可行的潜在动作空间,而不是重建外观。基于对齐的标记,我们使用机器人演示和伪标记的人类视频训练CLAP-NTP作为自回归VLA,保持指令遵循和物体泛化能力。为了部署和目标域适应,我们进一步引入了一种后训练策略,该策略将CLAP-RF(一种用于低延迟连续动作块预测的整流流动作头)与知识匹配正则化相结合,以在微调期间保留预训练的语义知识。大量实验表明,CLAP在竞争基线上取得了强劲的性能,同时实现了从人类视频到机器人执行的有效技能迁移。

英文摘要

Generalist Vision-Language-Action models remain constrained by the scarcity of robotic data relative to the abundance of human video demonstrations. Existing Latent Action Models attempt to use video data but often suffer from visual entanglement, encoding noise rather than manipulation skills. To address this limitation, we propose Contrastive Latent Action Pretraining (CLAP), a framework that first uses Act-VAE to learn an executable action-token vocabulary from robot trajectories and then aligns human visual transitions with this vocabulary through contrastive learning. This alignment maps unlabeled human videos into a physically grounded latent action space rather than reconstructing appearance. Building on the aligned tokens, we train CLAP-NTP as an autoregressive VLA using robot demonstrations and pseudo-labeled human videos, preserving instruction following and object generalization. For deployment and target-domain adaptation, we further introduce a post-training strategy that combines CLAP-RF, a Rectified Flow action head for low-latency continuous action chunk prediction, with Knowledge Matching regularization to preserve pretrained semantic knowledge during fine-tuning. Extensive experiments show that CLAP achieves strong performance against competitive baselines while enabling effective skill transfer from human videos to robotic execution.

2601.18692 2026-06-16 cs.RO cs.CV 版本更新

A Pragmatic VLA Foundation Model

一个务实的VLA基础模型

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, Yiyu Ren, Kejia Zhang, Hui Yu, Jingmei Zhao, Shuai Zhou, Zhenqi Qiu, Houlong Xiong, Ziyu Wang, Zechen Wang, Ran Cheng, Yong-Lu Li, Yongtao Huang, Xing Zhu, Yujun Shen, Kecheng Zheng

发表机构 * robbyant.com

AI总结 提出LingBot-VLA,基于约2万小时真实数据和9种双臂机器人配置,在3个平台上完成100个任务,性能优于竞品,并实现高效训练吞吐。

Comments Project Webpage: https://technology.robbyant.com/lingbot-vla/, Code: https://github.com/Robbyant/lingbot-vla/, GM-100: https://huggingface.co/datasets/robbyant/lingbot-GM-100

详情
AI中文摘要

在机器人操作领域,一个有能力的视觉-语言-动作(VLA)基础模型有望在任务和平台上忠实泛化,同时确保成本效率(例如,适应所需的数据和GPU小时数)。为此,我们开发了LingBot-VLA,使用了来自9种流行的双臂机器人配置的约2万小时真实数据。通过对3个机器人平台的系统评估,每个平台完成100个任务,每个任务有130个训练后回合,我们的模型在性能上明显优于竞争对手,展示了其强大的性能和广泛的泛化能力。我们还构建了一个高效的代码库,在8-GPU训练设置下实现了每秒261个样本的吞吐量,相比现有的VLA导向代码库,加速了1.5~2.8倍(取决于所依赖的VLM基础模型)。上述特性确保我们的模型非常适合实际部署。为了推动机器人学习领域的发展,我们开放了代码、基础模型和基准数据,重点关注更具挑战性的任务和促进合理的评估标准。

英文摘要

Offering great potential in robotic manipulation, a capable Vision-Language-Action (VLA) foundation model is expected to faithfully generalize across tasks and platforms while ensuring cost efficiency (e.g., data and GPU hours required for adaptation). To this end, we develop LingBot-VLA with around 20,000 hours of real-world data from 9 popular dual-arm robot configurations. Through a systematic assessment on 4 robotic platforms, each completing 100 tasks with 130 post-training episodes per task, our model achieves clear superiority over competitors, showcasing its strong performance and broad generalizability. We have also built an efficient codebase, which delivers a throughput of 261 samples per second with an 8-GPU training setup, representing a 1.5~2.8$\times$ (depending on the relied VLM base model) speedup over existing VLA-oriented codebases. The above features ensure that our model is well-suited for real-world deployment. To advance the field of robot learning, we provide open access to the code, base model, and benchmark data, with a focus on enabling more challenging tasks and promoting sound evaluation standards.

2602.00222 2026-06-16 cs.RO cs.AI cs.CV 版本更新

MapDream: Task-Driven Map Learning for Vision-Language Navigation

MapDream: 面向视觉-语言导航的任务驱动地图学习

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, Zhaoxin Fan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MapDream框架,通过自回归鸟瞰图生成联合学习地图与动作预测,在R2R-CE和RxR-CE上达到单目最优性能。

详情
AI中文摘要

视觉-语言导航(VLN)要求智能体在部分可观测的3D环境中遵循自然语言指令,这促使地图表示能够聚合超出局部感知的空间上下文。然而,现有大多数方法依赖于独立于导航策略构建的手工地图。我们认为,地图应该是由导航目标直接塑造的学习表示,而非详尽的重建。基于这一见解,我们提出MapDream,一种地图在环框架,将地图构建表述为自回归鸟瞰图(BEV)图像合成。该框架联合学习地图生成和动作预测,将环境上下文蒸馏为紧凑的三通道BEV地图,仅保留导航关键的可通行性。监督预训练引导了可靠的地图到控制接口,而自回归设计通过强化微调实现端到端联合优化。在R2R-CE和RxR-CE上的实验取得了最先进的单目性能,验证了任务驱动的生成式地图学习。

英文摘要

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

2602.13197 2026-06-16 cs.RO cs.CV cs.LG 版本更新

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

模仿有效的方法:基于仿真过滤的人类视频模块化策略学习

Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Allen Institute for AI(Allen人工智能研究所) University of Washington(华盛顿大学) Cornell University(康奈尔大学)

AI总结 提出Perceive-Simulate-Imitate框架,通过仿真过滤人类视频中的抓取-轨迹对,学习任务导向的抓取与后抓取运动策略,无需机器人数据即可实现鲁棒操作。

Comments Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

通过观看人类视频学习操作技能的能力有潜力为机器人学习解锁新的高度可扩展数据源。本文研究抓取操作,其中任务涉及在抓取物体后执行各种后抓取运动。人类视频为学习后抓取运动提供了强信号,但对于学习先决的抓取行为帮助较小,尤其是对于没有类人手的机器人。一个有前景的方法是采用模块化策略设计,利用专用抓取生成器产生稳定抓取。然而,任意稳定抓取通常与任务不兼容,阻碍机器人执行期望的下游运动。为解决这一挑战,我们提出Perceive-Simulate-Imitate (PSI)框架,该框架使用通过仿真中配对抓取-轨迹过滤处理的人类视频运动数据来训练模块化操作策略。这一仿真步骤用抓取适用性标签扩展轨迹数据,从而允许对任务导向的抓取能力进行监督学习。通过真实世界实验,我们展示了该框架可以在没有任何机器人数据的情况下高效学习精确操作技能,相比直接使用抓取生成器,性能显著更鲁棒。

英文摘要

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

2606.12978 2026-06-16 cs.RO cs.CV cs.SY eess.SY 版本更新

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文发现VLA模型存在轨迹级漏洞:看似保留原始指令的对抗性提示,能重定向机器人最终物理结果,并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将自然语言引入闭环机器人控制,使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色,因为提示在每个重新规划步骤中被重复使用,每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示,这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式:一个提示仍然$\textit{看起来}$指定了预期任务,但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$,这是一种仅提示的威胁模型,其中攻击者在情节开始前选择一个提示,所有策略和环境组件保持不变,并且提示必须保持接近良性指令,同时省略目标词和纠正语言。为了找到这样的提示,我们引入了一种在线提示搜索方法,该方法使用滚动来发现扰动,其闭环行为跟踪目标任务,同时满足命令保持约束。在仿真和硬件上的实验表明,接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞:看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站:此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

2606.13769 2026-06-16 cs.RO cs.CV cs.LG 版本更新

$μ_0$: A Scalable 3D Interaction-Trace World Model

$\mu_0$: 一种可扩展的3D交互轨迹世界模型

Seungjae Lee, Yoonkyo Jung, Jusuk Lee, Jonghun Shin, Amir Hossein Shahidzadeh, Yao-Chih Lee, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) Seoul National University(首尔大学)

AI总结 提出基于3D轨迹的可扩展世界模型$\mu_0$,通过预测交互点轨迹实现跨本体机器人学习,无需动作标签,性能媲美有监督模型。

详情
AI中文摘要

能够捕捉动作如何引起物理变化的世界模型使得可扩展的机器人学习成为可能,而无需依赖特定本体的动作标签。像素空间视频模型提供了广泛的视觉先验,但将模型容量消耗在密集外观重建上,而直接动作模型则需要特定本体的标签,阻碍了可扩展性。我们提出$\mu_0$,一种基于3D轨迹的可扩展世界模型。$\mu_0$不是预测密集像素或直接建模动作,而是预测显著交互点(如物体、工具、手和接触区域)的平滑3D轨迹,从而产生一个紧凑、与本体无关的运动接口。为了能够从多样化的视频源进行训练,我们的TraceExtract系统通过选择关键点、构建全局对齐的轨迹以及将运动片段与层次化语言描述关联,自动提取3D监督。这种TraceExtract监督通过将预训练的视觉-语言骨干网络与模块化轨迹专家相结合来预训练$\mu_0$,其中轨迹专家通过B样条控制点表示每个查询并预测未来轨迹。实验表明,$\mu_0$在2D和3D轨迹预测方面均优于基线方法,包括轨迹预测模型和分词VLM方法。由于$\mu_0$是冻结且可重用的,它可以与动作专家配对用于下游机器人本体。尽管是无动作预训练,由此产生的轨迹条件策略在性能上与使用动作监督预训练的VLA模型(如$\pi_0$)相当。这些结果确立了3D轨迹作为跨本体操作的可扩展和可迁移表示。

英文摘要

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

3. 图像识别、检索与分类 18 篇

2606.14735 2026-06-16 cs.CV 新提交

UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

UtVAA: 用于移动图像分类的带有Affix Attention的超微型视觉Transformer

Romiyal George, Sathiyamohan Nishankar, Selvarajah Thuseethan, Roshan G. Ragel

发表机构 * University of Peradeniya(佩拉德尼亚大学) Charles Darwin University(查尔斯·达尔文大学)

AI总结 提出超微型ViT架构UtVAA,通过Affix Attention块结合局部与全局特征,在极低参数量和FLOPs下实现高精度图像分类,适用于移动设备。

Comments 13 pages, 7 figures

详情
AI中文摘要

视觉Transformer(ViT)在图像分类中展现了强大的表示能力。然而,其二次自注意力复杂度和大量参数限制了在资源受限的移动和边缘设备上的部署。本文介绍了UtVAA,一种超微型视觉Transformer架构,专为在严格计算预算下进行高效视觉识别而设计。它包含一个新颖的Affix Attention块,该块结合了深度可分离局部特征提取、线性自注意力、用于空间依赖建模的坐标注意力,以及一个轻量级三元融合策略来整合局部和全局表示。此外,Dilated Bottleneck块通过使用扩张深度可分离卷积扩展感受野,同时通过残差连接保持低FLOPs和稳定优化。UtVAA实现了可扩展的Tiny、Medium和Large变体,其中最小的模型包含204.67K参数和53.95M FLOPs。在CIFAR-10、CIFAR-100、PlantVillage-Tomato和SLIF-Tomato数据集上的实验结果表明,UtVAA在百万参数以下的范围内达到了有竞争力的准确率。总体而言,结果表明基于Transformer的视觉模型可以重新设计为超微型架构,而不会显著损失判别性能,使得UtVAA适用于移动和边缘部署。代码可在https://github.com/romiyal/UtVAA获取。

英文摘要

Vision Transformers (ViTs) have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at https://github.com/romiyal/UtVAA

2606.14770 2026-06-16 cs.CV cs.AI cs.IR cs.LG 新提交

An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

大规模行人属性识别中的优化动态与稀疏边界实证分析

Houssam El Mir

发表机构 * College of Computer Science and Technology, Zhejiang University of Technology(浙江工业大学计算机科学与技术学院)

AI总结 针对行人属性识别中极端类别不平衡问题,提出多标签焦点损失校准配置(alpha=0.50, gamma=2.0),在零计算开销下匹配BCE基线并提升难例挖掘,同时识别出0.1%正样本率下的稀疏墙边界。

详情
AI中文摘要

行人属性识别(PAR)对于视频监控至关重要,支持法医搜索和重识别系统。当将PETA和PA-100K合并为一个包含109,000张图像的复合语料库时,极端类别不平衡仍然是一个基本障碍,其中少数属性的正样本比例低于1%。这导致标准BCE优化抑制稀有特征,我们称之为多数负类欺骗陷阱。我们在ResNet-18骨干网络上对多标签焦点损失超参数(alpha和gamma)进行了系统消融。校准配置(alpha=0.50, gamma=2.0)实现了62.32%的宏F1分数,与BCE基线相当,同时保留了优越的难例挖掘和收敛动态。我们的方法使用纯损失函数工程,边缘部署零计算开销。我们识别出稀疏墙,这是一个硬边界,当正样本比例低于0.1%时,全局损失重新加权失效,需要实例级干预。

英文摘要

Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

2606.14871 2026-06-16 cs.CV cs.AI 新提交

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

一种可靠且可扩展的柠檬叶病害分类集成深度学习方法

Shayan Abrar, Sudeepta Mandal, Abdul Awal Yasir, Sonjoy Bhattacharjee, Sadman Haque Bhuiyan, Samanta Ghosh, Rafi Ahamed

发表机构 * Dept. of CSE(计算机科学与工程系) American International University-Bangladesh(美国国际大学-孟加拉国) East West University(东-西大学) North South University(北南大学)

AI总结 提出集成InceptionV3和MobileNetV2的深度学习方法,结合对抗训练和Grad-CAM可视化,在9类柠檬叶病害数据集上达到99.27%准确率,实现可靠分类。

Comments 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

详情
AI中文摘要

植物病害的早期检测对植物和农民至关重要。植物病害会降低水果的产量和品质,并且植物在感染后更容易受到其他胁迫的影响。柠檬叶病害数据集包含1354张图像,分为9个类别,其中仅1个类别为健康叶片,其余8个类别为叶片病害。经过全面预处理后,数据集被划分为训练集(70%)、测试集(15%)和验证集(15%)。应用了两个预训练模型(InceptionV3和MobileNetV2),然后使用集成技术将这些模型组合起来以提高鲁棒性。集成模型表现出99.27%的准确率。应用对抗训练以提高模型的能力,并确保在噪声数据下的可靠预测。Grad-CAM可视化突出了叶片图像的重要区域,从而验证了模型预测的置信度。

英文摘要

Early detection of plant diseases is crucial to plants and for the farmers. Plant diseases reduce fruit yield and quality, and plants are more susceptible to other stresses when they are infected. The lemon leaf disease dataset contains 1354 images. The dataset has 9 classes. Among the 9 classes only one class is for healthy leaf, and the other 8 classes are leaf diseases. The dataset was split into training (70%), testing (15%) and validation (15%) sets after comprehensive preprocessing. Two pretrained models (InceptionV3 and MobileNetV2) were applied and then combined these models using an ensemble technique to boost robustness. Ensemble models showed a promising performance of 99.27% accuracy. Adversarial Training is applied to improve models' ability and ensure reliable predictions under noisy data. Grad-CAM visualization highlights the important regions of leaf images that validate the model prediction with confidence level.

2606.14886 2026-06-16 cs.CV cs.AI 新提交

Improved Knowledge Distillation for Land-Use Image Classification

改进的知识蒸馏用于土地利用图像分类

Arundhuti Sur, Abhiroop Chatterjee, Susmita Ghosh, Emmett Ientilucci

发表机构 * Jadavpur University(贾达沃大学) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出一种改进的知识蒸馏框架,通过VGG16教师网络向轻量MobileNetV2学生网络传递知识,结合硬监督和软监督策略,在三个数据集上达到99.04%准确率,优于基线方法。

Comments Accepted by IGARSS 2026

详情
AI中文摘要

本文提出了一种改进的知识蒸馏(KD)框架,用于高效压缩深度卷积神经网络以完成土地利用图像分类任务。受在降低计算复杂度的同时实现竞争性分类准确率的需要的驱动,采用教师-学生学习范式,其中VGG16网络将知识传递给轻量级MobileNetV2模型。所提出的框架将来自真实标签的硬监督与结合了Kullback-Leibler散度和余弦相似度损失的软监督策略相结合。在三个土地利用数据集上进行的实验表明,所提出的基于KD的方法性能提升,达到了99.04%的准确率,优于基线学生训练和单损失蒸馏方法,同时保持了显著的模型压缩。

英文摘要

In the present article, an improved Knowledge Distillation (KD) framework has been proposed for efficient compression of deep convolutional neural networks for land-use image classification task. Motivated by the need to achieve competitive classification accuracy while reducing computational complexity, a teacher-student learning paradigm is adopted in which a VGG16 network transfers knowledge to a lightweight MobileNetV2 model. The proposed framework integrates hard supervision from ground truth labels with a soft supervision strategy that combines Kullback-Leibler divergence and Cosine Similarity losses. Experiments conducted on three land-use datasets show that the proposed KD-based method yields improved performance, and achieves an accuracy of 99.04%, outperforming both baseline student training and single-loss distillation approaches, while retaining substantial model compression.

2606.15134 2026-06-16 cs.CV cs.AI cs.LG 新提交

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

超越标量距离:来自冻结MLLM的语义属性梯度用于视觉嵌入

Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出SAGA框架,利用冻结的多模态大语言模型(MLLM)通过GRPO奖励机制为视觉编码器提供属性级监督,替代传统标量距离,提升零样本图像检索性能。

详情
AI中文摘要

用于检索的视觉编码器通常通过类标签监督进行训练:每个训练对简化为一个标量,均匀地将嵌入推远或拉近,就好像每个视觉属性要么不同要么匹配。一个多模态大语言模型(MLLM),在展示相同的一对图像时,能够阐述这些属性并利用它们预测图像是否共享一个类别。我们提出\textbf{SAGA},一个框架,将这种基于语言、属性感知的感知转化为编码器本身的训练信号。具体来说,我们使用组相对策略优化(GRPO)来奖励MLLM对视觉编码器令牌的正确预测。由于正确的预测要求这些令牌暴露该对之间不同或匹配的具体属性,梯度推动编码器编码这些属性,用属性解析的监督取代统一的成对标量。一个辅助的注意力蒸馏损失将编码器的嵌入锚定到MLLM关注的令牌上,一个标准的度量学习损失塑造嵌入几何结构以进行最近邻检索。MLLM在整个过程中被冻结,在推理时被丢弃,与度量学习基线的部署成本相匹配。在CUB-200-2011、Cars-196、FGVC-Aircraft和iNaturalist Aves上的零样本图像检索中,SAGA在Recall@1上比最先进的基线提高了3到6个百分点。

英文摘要

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

2606.15151 2026-06-16 cs.CV cs.LG 新提交

HiRo: A Compact Four-Directional Hierarchical Reservoir Token-Mixer for Efficient Image Classification

HiRo:一种用于高效图像分类的紧凑型四方向分层储层令牌混合器

Md Farhadul Islam, Ishan Thakkar, J. Todd Hastings

发表机构 * University of Kentucky(肯塔基大学)

AI总结 提出HiRo模型,通过四方向扫描和两级切片混合储层模块实现局部与跨窗口令牌混合,在MNIST、CIFAR-10/100上以不足1M参数达到高精度。

Comments Accepted at ICONS 2026

详情
AI中文摘要

最近的图像分类模型必须在局部特征建模、跨窗口交互和参数效率之间取得平衡。许多高性能架构依赖于完全可训练的令牌混合器,这改善了表示学习但增加了参数数量、优化复杂性和计算成本。我们提出了一种参数高效的图像分类模型HiRo,它将移位窗口分区与多方向分层储层计算相结合。图像被划分为非重叠块(视为令牌),线性投影、归一化,并添加二维正弦位置编码,然后在局部窗口内处理。在每个窗口内,令牌沿四个方向扫描,并通过两级切片混合储层模块。在第一阶段,方向序列被分割成连续的切片,每个切片由具有可训练闭环读出的固定储层处理。得到的切片输出使用开始、结束和均值表示进行汇总,然后由每个方向的第二阶段固定储层混合。混合后的切片表示被扩展回令牌级别并与第一阶段输出融合,之后四个方向的输出重新对齐并平均。连续块在常规窗口和移位窗口之间交替以实现跨窗口交互,随后是层归一化、残差前馈网络和用于分类的全局池化。该设计将常规和移位窗口分区与分层多方向储层相结合,构建了一个高效的局部到跨窗口令牌混合框架用于图像分类。尽管使用的可训练参数少于1M,且内存和时间显著低于基于Transformer的基线,HiRo在MNIST、CIFAR-10和CIFAR-100上分别达到了99.46%、85.57%和59.10%的准确率。

英文摘要

Recent image classification models must balance local feature modeling, cross-window interaction, and parameter efficiency. Many high-performing architectures rely on fully trainable token-mixers, which improve representation learning but increase parameter count, optimization complexity and computational cost. We propose a parameter-efficient image classification model called HiRo that integrates shifted-window partitioning with multi-directional hierarchical reservoir computing. Images are divided into non-overlapping patches (treated as tokens), linearly projected, normalized, and enriched with 2D sinusoidal positional encodings, then processed within local windows. Inside each window, tokens are scanned in four directions and passed through a two-stage slice-and-mix reservoir module. In the first stage, directional sequences are split into contiguous slices, each processed by its own fixed reservoir with a trainable closed-loop readout. The resulting slice outputs are summarized using the start, end, and mean representations, and then mixed by a second-stage fixed reservoir for each direction. The mixed slice representations are expanded back to the token level and fused with the first-stage outputs, after which the four directional outputs are realigned and averaged. Consecutive blocks alternate between regular and shifted windows to enable cross-window interaction, followed by layer normalization, a residual feed-forward network, and global pooling for classification. This design combines regular and shifted window partitioning with hierarchical multi-directional reservoirs to make an efficient local-to-cross-window token-mixing framework for image classification. Despite using under 1M trainable parameters and significantly lower memory and time than transformer-style baselines, HiRo also achieves 99.46%, 85.57%, and 59.10% accuracy on MNIST, CIFAR-10, and CIFAR-100, respectively.

2606.15282 2026-06-16 cs.CV 新提交

Enhancing Precision Agriculture with a Hybrid Deep Learning Framework for Multi-Class Plant Disease Classification and Interpretability

利用混合深度学习框架增强精准农业:多类植物病害分类与可解释性

Hasibul Islam Sufi, Ridam Roy, Shayla Alam Setu, Mahimul Islam Nadim

发表机构 * Department of Computer Science and Engineering, Daffodil International University(计算机科学与工程系,达福尔国际大学)

AI总结 提出混合ResNet-ViT架构用于多类植物病害分类,在38类叶片图像上达到98.58%准确率,结合Grad-CAM等可解释性技术定位病害区域。

详情
AI中文摘要

本研究提出了一种整体深度学习架构,用于从高分辨率叶片图像中对植物病害进行多类分类,特别关注ResNet-50和混合ResNet + Vision Transformer (ViT)设计的行为。一个专门收集的图像数据库包含15,200张训练图像和3,800张验证图像,涵盖多种作物的38个类别,包括番茄、苹果、葡萄等,经过预处理步骤如调整大小、归一化和数据增强以增强模型鲁棒性。训练了多种架构,包括ResNet-50、MobileNetV2和EfficientNet-B0,并与混合ResNet + ViT模型进行比较。所有模型使用AdamW优化器和交叉熵损失进行微调,并应用早停以防止过拟合并确保泛化。此外,实现了可解释性技术如Grad-CAM和显著性图以指示病害相关区域,同时进行基于分割的分析以识别叶片的受影响部分。在所有考虑的架构中,ResNet-50达到了最高准确率98.74%,而混合ResNet + ViT模型达到了竞争性的98.58%,表明混合架构在捕捉局部和全局信息方面是有效的。实验结果展示了基于Transformer的模型在实现高精度、可解释且计算高效的基于计算机的多类多病害分类系统方面的潜力,为栽培管理实践和精准农业提供了有用的帮助。

英文摘要

This study proposes an overall deep learning architecture for multi-class classification of plant diseases from high-resolution leaf imagery, with a particular interest in investigating the behavior of ResNet-50 and a hybrid ResNet + Vision Transformer (ViT) design. A specially gathered image database with 15,200 training images and 3,800 validation images spanning 38 classes across multiple crops, including tomato, apple, grape etc. were subjected to preprocessing steps such as resizing, normalization, and data augmentation to enhance model robustness. Multiple architectures, including ResNet-50, MobileNetV2, and EfficientNet-B0, were trained and compared with the hybrid ResNet + ViT model. All models were fine-tuned using the AdamW optimizer and cross-entropy loss, with early stopping applied to prevent overfitting and ensure generalization. Furthermore, interpretability techniques such as Grad-CAM and saliency maps were implemented to indicate disease-relevant regions, while segmentation-based analysis was performed to identify the affected parts of a leaf. For every one of the considered architectures, ResNet-50 led to the highest accuracy of 98.74%, whereas the hybrid ResNet + ViT model achieved a competitive accuracy of 98.58%, showing that the hybrid architectures were effective in capturing both local and overall information. The experimental results showcase the promise of transformer-based models to achieve highly accurate, interpretable, and computationally efficient computer-based multi-class multi-disease classification systems, providing helpful assistance for cultivation management practices as well as for precision farming.

2606.15355 2026-06-16 cs.CV 新提交

Sustainable Face Recognition on Low-Power Devices with VQ-VAE Embeddings

基于VQ-VAE嵌入的低功耗设备可持续人脸识别

Christos Chronis, Georgios Th. Papadopoulos, Iraklis Varlamis

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出一种基于VQ-VAE的可持续边缘人脸识别框架,通过紧凑潜在表示和知识蒸馏,在低功耗设备上实现与先进模型相当的精度,同时降低内存和计算需求。

详情
AI中文摘要

人脸识别已成为现代AI应用的基石,但传统方法通常依赖部署在云环境中的计算密集型模型,导致网络流量增加、高能耗和大量碳足迹。本文介绍了一种基于向量量化变分自编码器(VQ-VAE)的可持续、可边缘部署的人脸识别框架,该框架生成紧凑且语义丰富的人脸图像潜在表示。通过利用VQ-VAE嵌入在边缘的压缩能力和重建质量,并结合知识蒸馏中预训练人脸嵌入的力量,我们的系统在显著降低边缘内存和计算需求的同时,达到了与最先进人脸嵌入模型相当的精度,使其适用于低功耗边缘设备。VQ-VAE压缩的集成最小化了网络开销,同时通过在潜在空间中仅保留最具信息量的面部特征来保持高匹配精度。因此,重建图像保留了关键身份特征,提高了人脸嵌入的鲁棒性和整体性能。

英文摘要

Face recognition has become a cornerstone of modern AI applications, yet conventional approaches often rely on computationally intensive models deployed in cloud environments, leading to increased network traffic, high energy consumption, and a heavy carbon footprint. This work introduces a sustainable, edge-deployable face recognition framework based on Vector-Quantized Variational Autoencoders (VQ-VAE), which generates compact and semantically rich latent representations of facial images. By leveraging the compression capacity and reconstruction quality of VQ-VAE embeddings on the edge and combining them with the power of pre-trained face embeddings in a knowledge distillation setup, our system achieves comparable accuracy to state-of-the-art face embedding models while significantly reducing memory and computation requirements on the edge, making it suitable for low-power edge devices. The integration of VQ-VAE compression minimizes network overhead while keeping the matching accuracy high by retaining only the most informative facial features in the latent space. As a result, the reconstructed images preserve the key identity characteristics, improving the robustness and overall performance of the face embeddings.

2606.15468 2026-06-16 cs.CV cs.LG 新提交

Analyzing Visual Aircraft Representations with Sparse Autoencoders

使用稀疏自编码器分析飞机视觉表示

Deepshik Sharma

发表机构 * Jain University(耆那大学)

AI总结 本文通过稀疏自编码器分解ConvNeXt模型在FGVC-Aircraft数据集上的中间表示,发现可解释的飞机结构特征,并通过消融实验验证其类别相关性。

Comments 18 pages, 4 figures, 7 tables

详情
AI中文摘要

视觉模型可以在分类任务上取得强性能,但支持其预测的内部表示通常难以解释。本文研究稀疏自编码器是否可以将视觉模型的中间表示分解为可解释的特征。我们在FGVC-Aircraft数据集上训练ConvNeXt分类器,从其最终特征阶段提取空间激活,并在这些激活上训练稀疏自编码器。使用最高激活图像块、激活强度和类别选择性分析学习到的稀疏特征。定性视觉检查显示,几个特征对应于可识别的飞机结构和视觉模式。我们使用输入空间和特征空间消融评估选定的特征子集,测量模糊图像块和抑制稀疏特征对类别logits、分类边界和预测置信度的影响。结果表明,稀疏自编码器可以揭示与飞机识别相关的部分可解释、类别相关的视觉特征,同时也暴露出多义性和粗糙空间定位等局限性。

英文摘要

Vision models can achieve strong performance on classification tasks, but the internal representations supporting their predictions are often difficult to interpret. This work investigates whether sparse autoencoders can decompose intermediate representations of a vision model into interpretable features. We train a ConvNeXt classifier on the FGVC-Aircraft dataset, extract spatial activations from its final feature stage, and train a sparse autoencoder on these activations. The learned sparse features are analyzed using top-activating image patches, activation strength, and class selectivity. Qualitative visual inspection reveals that several features correspond to recognizable aircraft structures and visual patterns. We evaluate a subset of selected features using input-space and feature-space ablations, measuring how blurring image patches and suppressing sparse features affect class logits, classification margins, and prediction confidence. The results suggest that sparse autoencoders can reveal partially interpretable, class-relevant visual features associated with aircraft recognition, while also exposing limitations such as polysemanticity and coarse spatial localization.

2606.15547 2026-06-16 cs.CV cs.AI 新提交

EcoBin: A Two-Stage Deep Convolutional Neural Network for Contamination-Aware Waste Classification

EcoBin: 一种用于污染感知废物分类的两阶段深度卷积神经网络

Raghav Senthil Kumar

发表机构 * BASIS Phoenix(BASIS凤凰学校)

AI总结 提出EcoBin两阶段深度CNN,通过合成污染数据集和污染检测模块,显著提升回收废物分类中污染物的识别准确率。

Comments 7 pages, 8 figures

详情
AI中文摘要

废物分类模型在分类废物方面已经变得非常准确,在基准数据集上通常超过95%。然而,这些模型未能考虑可回收废物中的污染。我们提出了EcoBin,一种两阶段深度卷积神经网络,它根据处理途径对家庭废物进行分类,并明确考虑污染。第一阶段是一个基于EfficientNetV2-S骨干网络的基础废物分类器,将数据集中的三十个废物类别分配到四个处理途径之一。第二阶段是一个污染分类器,检查任何被导向回收的物品,并在检测到污染时将其决策覆盖为垃圾。由于不存在公开的污染可回收物数据集,我们通过使用U2-Net模型分割干净可回收物体的图像,并在其表面合成逼真的污染纹理来合成一个数据集。第一阶段达到87.42%的测试准确率和96.13%的途径调整准确率。同时,污染阶段以0.99的ROC-AUC区分干净和污染物品。在污染可回收物的测试集上,完整流水线正确路由了25个物品中的24个,而单独的基础分类器仅正确路由了25个中的1个。McNemar检验证实污染阶段带来的改进具有统计学显著性(p < 0.001)。

英文摘要

Waste classification models have become highly accurate at sorting waste, often exceeding 95% on benchmark datasets. However, these models fail to account for contamination in recyclable waste. We present EcoBin, a two-stage deep convolutional neural network that classifies household waste by its disposal pathway and that explicitly accounts for contamination. The first stage is a base waste classifier built on an EfficientNetV2-S backbone that assigns each of the thirty waste categories in our dataset to one of four disposal pathways. The second stage is a contamination classifier that inspects any item routed toward recycling and overrides the decision to garbage when contamination is detected. Because no public dataset of contaminated recyclables exists, we synthesize one by segmenting images of clean recyclable objects with a U2-Net model and compositing realistic contamination textures onto their surfaces. The first stage achieves 87.42% test accuracy and a 96.13% pathway-adjusted accuracy. Meanwhile, the contamination stage distinguishes clean from contaminated items with a 0.99 ROC-AUC. On a test set of contaminated recyclables, the complete pipeline routes 24 of 25 items correctly, compared with only 1 of 25 for the base classifier alone. A McNemar's test confirms that the improvement contributed by the contamination stage is statistically significant (p < 0.001).

2606.15574 2026-06-16 cs.CV 新提交

Toward the Whole Picture: Accumulative Fingerprint Mapping and Reconstruction for Small-Area Mobile Sensors

迈向全貌:小面积移动传感器的累积指纹映射与重建

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Tsinghua University(清华大学)

AI总结 针对小面积移动指纹传感中采集与识别不匹配的问题,提出累积映射与重建框架,将局部观测序列转化为统一指纹状态,实现单次匹配,提升效率与鲁棒性。

详情
AI中文摘要

移动设备上的小面积指纹传感在采集与识别之间造成了根本性的不匹配:每次触摸仅捕获一个微小且姿态变化的局部补丁,而可靠的生物特征匹配最终需要一个稳定且足够完整的指纹表示。现有流程主要通过将重复触摸视为独立的局部模板来应对这种不匹配,这导致重复注册、重复匹配,且无法保证足够的全局覆盖。在本文中,我们提出了一种不同的公式,即针对小面积移动传感的\emph{累积指纹映射与重建}。该视角并非分别匹配每个局部补丁,而是将一系列局部观测转换为一个统一的指纹状态,该状态随着新触摸的到来而逐步细化,并可在整合后仅匹配一次。作为一个具体基线,我们提出了一种经典流程,执行补丁级结构特征提取、特征级配准与融合、指纹图构建以及基于相位的脊线重建。更重要的是,我们将此基线定位在一个更广泛的移动指纹框架内,该框架集成了结构化令牌学习、两阶段姿态推理和基于扩散的生成式重建。这一观点将移动指纹识别从多次捕获多次匹配处理重新构建为累积地图构建、状态细化和一次性匹配,为小面积移动平台提供了一条通向高效、姿态鲁棒且易于部署的生物特征识别的原则性路径。基线实现已在 https://github.com/XiongjunGuan/FpReconstruction 公开发布。

英文摘要

Small-area fingerprint sensing on mobile devices creates a fundamental mismatch between acquisition and recognition: each touch captures only a tiny, pose-varying local patch, while reliable biometric matching ultimately requires a stable and sufficiently complete fingerprint representation. Existing pipelines largely cope with this mismatch by treating repeated touches as independent partial templates, which leads to repeated registration, repeated matching, and no guarantee of adequate global coverage. In this paper, we advocate a different formulation, namely \emph{accumulative fingerprint mapping and reconstruction} for small-area mobile sensing. Rather than matching every partial patch separately, the proposed perspective converts a sequence of local observations into a unified fingerprint state that is progressively refined as new touches arrive and can be matched only once after consolidation. As a concrete baseline, we present a classical pipeline that performs patch-wise structural feature extraction, feature-level registration and fusion, fingerprint map construction, and phase-based ridge reconstruction. More importantly, we position this baseline within a broader mobile fingerprint framework that integrates structured token learning, two-stage pose reasoning, and diffusion-based generative reconstruction. This viewpoint reframes mobile fingerprint recognition from multi-capture multi-match processing to accumulative map building, state refinement, and one-shot matching, offering a principled route toward efficient, pose-robust, and deployment-friendly biometrics for small-area mobile platforms. The baseline implementation has been publicly released at https://github.com/XiongjunGuan/FpReconstruction.

2606.15763 2026-06-16 cs.CV 新提交

The Circumplex Degeneracy Behind the Rare-Class Limit in Affect Recognition

情感识别中稀有类别极限背后的圆周退化

Van Thong Huynh, Hong Hai Nguyen, Soo-Hyung Kim

发表机构 * Faculty of CSE, Ho Chi Minh City University of Technology (HCMUT), VNUHCM(胡志明市理工大学计算机科学与工程学院, 越南国家大学胡志明市分校) Dept. of AI, FPT University(FPT大学人工智能系) Dept. of AI Convergence, Chonnam National University(全南大学人工智能融合系)

AI总结 通过多任务研究揭示稀有表情识别失败源于Russell圆周上的退化性,而非类别不平衡,并提出圆周代价最优传输项,但增益非几何性,稀有类别错误结构受视觉混淆影响。

详情
AI中文摘要

野外表情识别在少数稀有情感上持续失败,标准解释是类别不平衡。通过在两个基准上的受控多任务研究,我们表明失败反而是情感几何的一个属性:稀有类别在Russell圆周上是退化的,这种退化限制了任何损失或代价所能达到的效果。我们的工具是一个圆周代价最优传输项,通过效价-唤醒距离对表情混淆进行定价。该项提高了官方得分和表情宏F1,但大多数研究省略的对照显示,增益并非几何性的:一个均匀代价(相当于通用置信度惩罚)在Aff-Wild2上与它匹配(p=0.625),并在AffectNet上显著超过它(比基线高+0.057,大于圆周项)。几何重塑的是错误的结构,使它们在Aff-Wild2上情感上更接近真相(与均匀对照相比p=0.031),但这种效果在AffectNet上不成立,因为圆周远角的一个视觉混淆压倒了它。相比之下,稀有类别失败在我们检查的两个数据集上都是稳定的:退化对(Aff-Wild2上的愤怒-恐惧,AffectNet上的愤怒-蔑视)抵抗基于频率的干预、传输项以及专门为分离它们而构建的动作单元增强代价。我们得出结论,稀有表情的进展需要区分这些类别的表示,而不是重新定价其混淆的监督,我们提供了区分两者的对照和指标。

英文摘要

In-the-wild expression recognition persistently fails on a few rare emotions, and the standard explanation is class imbalance. Through a controlled multi-task study on two benchmarks, we show the failure is instead a property of affect geometry: the rare classes are degenerate on Russell's circumplex, and that degeneracy bounds what any loss or cost can achieve. Our instrument is a circumplex-cost optimal-transport term that prices expression confusions by their valence-arousal distance. The term improves the official score and expression macro-F1, but a control most studies omit shows the gain is not geometric: a uniform cost, equivalent to a generic confidence penalty, matches it on Aff-Wild2 (p=0.625) and significantly exceeds it on AffectNet (+0.057 over base, larger than the circumplex). What the geometry reshapes is the structure of the errors, making them affectively nearer the truth on Aff-Wild2 (p=0.031 against the uniform control), an effect that does not survive on AffectNet, where a visual confound at the far corner of the circumplex overwhelms it. The rare-class failure, by contrast, is stable across both datasets we examine: the degenerate pairs (anger-fear on Aff-Wild2, anger-contempt on AffectNet) resist frequency-based interventions, the transport term, and an action-unit-augmented cost built specifically to separate them. We conclude that progress on rare expressions requires representations that distinguish the classes, not supervision that reprices their confusions, and we provide the controls and metrics needed to tell the two apart.

2606.16161 2026-06-16 cs.CV 新提交

Multimodal LLM-Empowered Re-Ranking for Generalizable Person Re-Identification

多模态大语言模型赋能的通用行人重识别重排序

Jiachen Li, Xiaojin Gong

发表机构 * College of Information Science and Electronic Engineering, Zhejiang University(浙江大学信息与电子工程学院)

AI总结 提出利用多模态大语言模型(MLLM)的泛化能力,通过微调MLLM并计算μ-距离来改进推理阶段的重排序,从而提升领域泛化行人重识别的性能。

详情
AI中文摘要

领域泛化(DG)行人重识别(Re-ID)因其在未见真实场景中部署的潜力而吸引了越来越多的研究兴趣。现有大多数方法通过训练领域泛化编码器来处理DG Re-ID,但忽略了推理阶段可能的改进。相比之下,本文探索了一种替代方向,即改进推理重排序以增强DG Re-ID。传统的重排序方法通常依赖于基于邻域的距离来优化初始排序列表,这本质上依赖于Re-ID编码器生成的特征。然而,由于编码器缺乏足够的泛化能力来在未见场景中产生可靠的特征距离,这些方法在目标域上性能下降。受近期多模态大语言模型(MLLM)卓越泛化能力的启发,我们提出了一种MLLM赋能的距离度量来改进DG Re-ID中的重排序。具体来说,我们首先通过监督微调将MLLM适应于Re-ID数据,其中包含一个领域无关的提示和一种查询-候选难例挖掘方案。然后,使用适应后的MLLM在推理过程中计算μ-距离,该距离对领域差距具有鲁棒性,并显著提升后续重排序性能。我们的方法是模型无关的,可以无缝集成到之前的重排序框架中。大量实验表明,我们的方法在多个DG Re-ID基准上持续带来显著的性能提升。本工作的代码将很快在https://github.com/RikoLi/MUSE发布。

英文摘要

Domain Generalizable (DG) person re-identification (Re-ID) has attracted growing research interest due to its potential for deployment in unseen real-world scenarios. Most existing approaches address DG Re-ID by focusing on training domain-generalizable encoders but ignore the possible refinements in inference stage. In contrast, this work explores an alternative direction which improves inference re-ranking to enhance DG Re-ID. Conventional re-ranking methods typically rely on neighborhood-based distances to refine the initial ranking list, inherently depending on features produced by the Re-ID encoder. However, they deteriorate on target domains since the encoder lacks sufficient generalizability to produce reliable feature distances on unseen scenarios. Inspired by the remarkable generalization capabilities of recent Multimodal Large Language Models (MLLMs), we propose an MLLM-empowered distance metric to improve re-ranking in DG Re-ID. Specifically, we first adapt an MLLM to Re-ID data through supervised fine-tuning, which incorporates a domain-agnostic prompt and a query-candidate hard mining scheme. Then, the adapted MLLM is employed to compute a $μ$-distance during inference, which is robust to domain gap and significantly enhances subsequent re-ranking performance. Our approach is model-agnostic and can be seamlessly integrated into previous re-ranking frameworks. Extensive experiments demonstrate that our approach consistently yields substantial performance improvements across multiple DG Re-ID benchmarks. The code of this work will be released at https://github.com/RikoLi/MUSE soon.

2401.15296 2026-06-16 cs.CV cs.AI 版本更新

A Survey on 3D Skeleton Based Person Re-Identification: Taxonomy, Advances, Challenges, and Interdisciplinary Prospects

基于3D骨架的行人重识别综述:分类、进展、挑战与跨学科前景

Haocong Rao, Chunyan Miao

发表机构 * College of Computing and Data Science, Nanyang Technological University (NTU), Singapore(南洋理工大学计算与数据科学学院,新加坡) Joint NTU-UBC Research Centre of Excellence in Active Living for the Elderly (LILY), NTU, Singapore(老龄化积极生活卓越研究中心(LILY),南洋理工大学,新加坡) Alibaba-NTU Global e-Sustainability CorpLab (ANGEL), NTU, Singapore(阿里巴巴-南洋理工大学全球可持续发展企业实验室(ANGEL),南洋理工大学,新加坡)

AI总结 本文系统综述了基于3D骨架的行人重识别方法,提出了手工、序列和图建模三类分类法,并评估了监督、自监督和无监督学习范式下的最新技术,最后讨论了关键挑战与跨学科应用前景。

Comments Accepted by IJCAI 2026. A curated collection of valuable resources is available at https://github.com/Kali-Hac/3D-SRID-Survey

详情
AI中文摘要

基于3D骨架的行人重识别是一个重要的新兴研究领域,在模式识别领域引起了越来越多的关注。凭借在各种应用场景中的独特优势,近年来提出了许多基于3D骨架的行人重识别(SRID)方法,这些方法采用了不同的骨架建模和学习范式。在本文中,我们提供了对近期SRID进展的全面回顾和分析。首先,我们定义了SRID任务,并概述了其起源和主要进展。其次,我们制定了一个系统性的分类法,将现有方法分为三类:手工建模、序列建模和图建模。然后,我们详细阐述了这三类中的代表性模型,并说明了其基础机制。同时,我们概述了主流的监督、自监督和无监督SRID学习范式及相应的常用方法。进一步地,我们在各种类型的基准和协议上对最先进的SRID方法进行了全面评估,以比较其有效性、效率和关键特性。最后,我们提出了推动未来研究的关键挑战和前景,并通过案例研究强调了SRID的跨学科应用。

英文摘要

Person re-identification via 3D skeletons is an important emerging research area that attracts increasing attention within the pattern recognition community. With distinctive advantages across various application scenarios, numerous 3D skeleton based person re-identification (SRID) methods with diverse skeleton modeling and learning paradigms have been proposed in recent years. In this paper, we provide a comprehensive review and analysis of recent SRID advances. First of all, we define the SRID task and provide an overview of its origin and major advancements. Secondly, we formulate a systematic taxonomy that organizes existing methods into three categories centered on hand-crafted, sequence-based, and graph-based modeling. Then, we elaborate on the representative models along these three types with an illustration of foundational mechanisms. Meanwhile, we provide an overview of mainstream supervised, self-supervised, and unsupervised SRID learning paradigms and corresponding common methods. A thorough evaluation of state-of-the-art SRID methods is further conducted over various types of benchmarks and protocols to compare their effectiveness, efficiency, and key properties. Finally, we present the key challenges and prospects to advance future research, and highlight interdisciplinary applications of SRID with a case study.

2510.05888 2026-06-16 cs.CV 版本更新

BioAutoML-NAS: An End-to-End AutoML Framework for Multimodal Insect Classification via Neural Architecture Search on Large-Scale Biodiversity Data

BioAutoML-NAS:基于大规模生物多样性数据通过神经架构搜索进行多模态昆虫分类的端到端AutoML框架

Arefin Ittesafun Abian, Debopom Sutradhar, Md Rafi Ur Rashid, Reem E. Mohamed, Md Rafiqul Islam, Asif Karim, Kheng Cher Yeo, Sami Azam

发表机构 * Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh(乌姆国际大学计算机科学与工程系,达卡,孟加拉国) Applied Artificial Intelligence and Intelligent Systems (AAIINS) Laboratory, 1217, Dhaka, Bangladesh(应用人工智能与智能系统实验室(AAIINS),1217号,达卡,孟加拉国) Department of Computer Science and Engineering, Penn State University, University Park, PA, USA(宾夕法尼亚州立大学计算机科学与工程系,University Park,PA,美国) Faculty of Science and Information Technology, Charles Darwin University, Sydney, NSW, Australia(查尔斯达尔文大学科学与信息技术学院,悉尼,新南威尔士州,澳大利亚) Faculty of Science and Technology, Charles Darwin University, Casuarina, 0909, NT, Australia(查尔斯达尔文大学科学与技术学院,Casuarina,0909,北领地,澳大利亚)

AI总结 提出首个多模态BioAutoML模型BioAutoML-NAS,利用神经架构搜索自动学习图像操作,结合元数据融合与交替双层优化,在BIOSCAN-5M数据集上以96.81%准确率超越现有方法。

Comments Accepted in IEEE Transactions on Big Data

详情
AI中文摘要

昆虫分类对于农业管理和生态研究至关重要,因为它直接影响作物健康和生产。然而,由于昆虫的复杂特征、类别不平衡和大规模数据集,这项任务仍然具有挑战性。为了解决这些问题,我们提出了BioAutoML-NAS,这是第一个使用多模态数据(包括图像和元数据)的BioAutoML模型,它对图像应用神经架构搜索(NAS)来自动学习每个单元内每个连接的最佳操作。多个单元堆叠形成完整网络,每个单元提取详细的图像特征表示。多模态融合模块将图像嵌入与元数据结合,使模型能够同时使用视觉和分类生物学信息对昆虫进行分类。交替双层优化训练策略联合更新网络权重和架构参数,而零操作移除不太重要的连接,产生稀疏、高效且高性能的架构。在BIOSCAN-5M数据集上的广泛评估表明,BioAutoML-NAS达到了96.81%的准确率、97.46%的精确率、96.81%的召回率和97.05%的F1分数,分别比最先进的迁移学习、Transformer、AutoML和NAS方法高出约16%、10%和8%。在Insects-1M数据集上的进一步验证获得了93.25%的准确率、93.71%的精确率、92.74%的召回率和93.22%的F1分数。这些结果表明,BioAutoML-NAS提供了准确、可信的昆虫分类,支持现代可持续农业。

英文摘要

Insect classification is important for agricultural management and ecological research, as it directly affects crop health and production. However, this task remains challenging due to the complex characteristics of insects, class imbalance, and large-scale datasets. To address these issues, we propose BioAutoML-NAS, the first BioAutoML model using multimodal data, including images, and metadata, which applies neural architecture search (NAS) for images to automatically learn the best operations for each connection within each cell. Multiple cells are stacked to form the full network, each extracting detailed image feature representations. A multimodal fusion module combines image embeddings with metadata, allowing the model to use both visual and categorical biological information to classify insects. An alternating bi-level optimization training strategy jointly updates network weights and architecture parameters, while zero operations remove less important connections, producing sparse, efficient, and high-performing architectures. Extensive evaluation on the BIOSCAN-5M dataset demonstrates that BioAutoML-NAS achieves 96.81% accuracy, 97.46% precision, 96.81% recall, and a 97.05% F1 score, outperforming state-of-the-art transfer learning, transformer, AutoML, and NAS methods by approximately 16%, 10%, and 8% respectively. Further validation on the Insects-1M dataset obtains 93.25% accuracy, 93.71% precision, 92.74% recall, and a 93.22% F1 score. These results demonstrate that BioAutoML-NAS provides accurate, confident insect classification that supports modern sustainable farming.

2510.08976 2026-06-16 cs.CV cs.DC cs.IR 版本更新

MIRAGE: Runtime Scheduling for Multi-Vector Image Retrieval with Hierarchical Decomposition

MIRAGE:基于层次分解的多向量图像检索运行时调度

Maoliang Li, Ke Li, Yaoyang Liu, Jiayu Chen, Zihao Zheng, Yinjun Wu, Chenchen Liu, Xiang Chen

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) School of Electronics Engineering and Computer Science, Peking University(北京大学电子工程与计算机科学学院) School of Information, Renmin University of China(中国人民大学信息学院) School of Integrated Circuit Science and Engineering, Beihang University(北京航空航天大学集成电路科学与工程学院)

AI总结 提出MIRAGE框架,通过层次化分解和跨层次相似性一致性减少冗余计算,实现多向量图像检索的精度提升和3.5倍计算加速。

Comments Will appear in DAC'2026, camera ready

详情
AI中文摘要

为了有效利用用户特定数据,多模态大语言模型(MLLM)应用中采用了检索增强生成(RAG)。然而,传统检索方法通常存在检索精度有限的问题。最近多向量检索(MVR)的进展通过分解查询并与分割后的图像匹配来提高精度,但仍存在次优的精度和效率,忽略了查询与不同图像对象之间的对齐以及冗余的细粒度图像片段。在这项工作中,我们提出了一种高效的图像检索调度框架——MIRAGE。首先,我们引入了一种新颖的层次化范式,为不同的图像对象采用多个中间粒度以增强对齐。其次,我们通过利用跨层次相似性一致性和层次稀疏性来最小化检索中的冗余,从而减少不必要的匹配计算。此外,我们自动为每个数据集配置参数,以适应不同场景的实用性。我们的实证研究表明,MIRAGE不仅实现了显著的精度提升,而且与现有MVR系统相比,计算量减少了高达3.5倍。

英文摘要

To effectively leverage user-specific data, retrieval augmented generation (RAG) is employed in multimodal large language model (MLLM) applications. However, conventional retrieval approaches often suffer from limited retrieval accuracy. Recent advances in multi-vector retrieval (MVR) improve accuracy by decomposing queries and matching against segmented images. They still suffer from sub-optimal accuracy and efficiency, overlooking alignment between the query and varying image objects and redundant fine-grained image segments. In this work, we present an efficient scheduling framework for image retrieval - MIRAGE. First, we introduce a novel hierarchical paradigm, employing multiple intermediate granularities for varying image objects to enhance alignment. Second, we minimize redundancy in retrieval by leveraging cross-hierarchy similarity consistency and hierarchy sparsity to minimize unnecessary matching computation. Furthermore, we configure parameters for each dataset automatically for practicality across diverse scenarios. Our empirical study shows that, MIRAGE not only achieves substantial accuracy improvements but also reduces computation by up to 3.5 times over the existing MVR system.

2605.07099 2026-06-16 cs.CV 版本更新

InfoGeo: Information-Theoretic Object-Centric Learning for Cross-View Generalizable UAV Geo-Localization

InfoGeo: 面向跨视角泛化无人机地理定位的信息论目标中心学习

Hongyang Zhang, Maonan Wang, Ziyao Wang, Hongrui Yin, Man-On Pun

发表机构 * The University of Hong Kong(香港大学)

AI总结 提出InfoGeo框架,利用信息瓶颈理论通过目标中心结构对齐和跨视图知识约束,增强无人机跨视角地理定位在域偏移下的鲁棒性和泛化能力。

详情
AI中文摘要

跨视角地理定位(CVGL)是GPS拒止环境中精确定位和导航的基础,旨在将地面或无人机图像与卫星视图匹配。现有方法通常依赖全局特征对齐,但受区域纹理和天气条件变化引起的显著域偏移影响。在无人机场景中,由于更广的视角不可避免地引入密集的细粒度目标,造成严重视觉杂乱,这一问题更为突出。为此,我们从目标中心学习(OCL)中汲取灵感,提出InfoGeo,一个旨在增强鲁棒性和泛化能力的信息论框架。InfoGeo将优化重新表述为信息瓶颈过程,包含两个核心目标:(i)通过跨视图对齐目标中心结构关系,最大化视图不变信息;(ii)通过跨视图知识约束,最小化视图特定噪声信号。在多种基准和挑战场景上的广泛评估表明,InfoGeo显著优于现有最先进方法。

英文摘要

Cross-view geo-localization (CVGL) is fundamental for precise localization and navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. Existing approaches often rely on global feature alignment, but they suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios, where the broader perspective inevitably introduces dense, fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.

2606.03654 2026-06-16 cs.CV cs.NA math.NA 版本更新

Graph Regularized Non-negative Reduced Biquaternion Matrix Factorization for Color Image Recognition

图正则化非负简化四元数矩阵分解用于彩色图像识别

Hailang Wu, Yonghe Liu, Bingxuan Yu, Chaoqian Li

发表机构 * School of Mathematics and Statistics, Yunnan University(云南大学数学与统计学学院)

AI总结 针对非负简化四元数矩阵分解忽略局部几何结构的问题,提出图正则化模型,通过引入图拉普拉斯正则化项保持局部结构,并设计分量交替投影梯度算法,在彩色图像识别中取得竞争性结果。

详情
AI中文摘要

非负简化四元数矩阵分解(NRBMF)利用简化四元数(RB)矩阵的乘积,将彩色图像像素的非负约束纳入分解过程。然而,NRBMF主要关注重构精度,未利用图像数据的局部几何结构,这可能限制所学低维特征的判别能力。为解决此问题,我们提出了一种图正则化非负简化四元数矩阵分解(GNRBMF)模型用于彩色图像识别。该模型将图拉普拉斯正则化项引入简化四元数系数矩阵,鼓励原始空间中的邻近样本在学习的特征空间中具有相似表示。同时,GNRBMF在简化四元数域中保留了NRBMF的非负保持特性。为求解优化问题,推导了一种分量交替投影梯度算法,并分析了其收敛性。实验结果表明,所提出的GNRBMF模型在某些测试设置下取得了具有竞争力或更优的识别性能。

英文摘要

Non-negative reduced biquaternion matrix factorization (NRBMF) uses the product of reduced biquaternion (RB) matrices to incorporate the non-negativity constraints of color image pixels into the factorization process. However, NRBMF mainly focuses on reconstruction accuracy and does not explicitly exploit the local geometric structure of image data, which may limit the discriminative ability of the obtained low-dimensional coefficient representations. To address this issue, we propose a graph regularized non-negative reduced biquaternion matrix factorization (GNRBMF) model for color image recognition. The proposed model incorporates a graph Laplacian regularizer into the reduced biquaternion coefficient matrix, encouraging nearby samples in the original space to have similar coefficient representations. Meanwhile, GNRBMF retains the non-negativity property of NRBMF in the reduced biquaternion algebra. To solve the optimization problem, a component-wise alternating projected gradient algorithm is derived, and its convergence properties are analyzed. Experimental results on three color image datasets show that the proposed GNRBMF model achieves competitive or superior recognition performance compared with several methods in most tested settings.

4. 目标检测、分割与定位 26 篇

2606.14716 2026-06-16 cs.CV cs.AI cs.RO 新提交

RAMS: Resource-Adaptive and Detection-Conditioned Model Switching for Embedded Edge Perception

RAMS: 面向嵌入式边缘感知的资源自适应与检测条件模型切换

Kushal Khemani, Evan Leri, George Xu, Amit Hod

发表机构 * NEXEDGE Research Lab(NEXEDGE研究实验室)

AI总结 提出RAMS运行时控制器,通过监控设备压力、校准切换阈值,在YOLOv8三个规模模型间动态切换,引入检测条件策略和VRU加权准确率评分,在多种嵌入式平台上实现延迟与精度的平衡。

详情
AI中文摘要

嵌入式硬件上的边缘目标检测需要在变化的资源压力下平衡推理延迟和检测质量。我们提出RAMS,一种轻量级运行时控制器,它监控设备压力,从空闲行为校准切换阈值,并在三个驻留的YOLOv8层级(NANO/SMALL/MEDIUM,分辨率320/416/640 px)之间动态选择,无需模型重新加载延迟。RAMS定义了五种切换策略,包括两种检测条件变体,可在最近检测到易受伤道路使用者(VRU)后防止激进的降级。我们进一步引入VRU加权准确率评分(SWAS),一种用于离线策略比较的标量指标,无需真实标注,以及一种基于oracle的变体,用于分离检测器循环性与真正的层级保留收益。在Raspberry Pi 5、x86笔记本电脑和Jetson Orin ONNX/TensorRT部署中,相同的控制器方程在37倍的延迟范围内运行。在重负载下的Jetson Orin TensorRT上,safety2策略实现了3.41毫秒的平均延迟,比固定MEDIUM推理快5.6倍,同时通过接近NANO操作并在VRU阳性窗口期间选择性锁定SMALL和MEDIUM,保留了其代理准确率的74%。与重负载下仅基于阈值的策略相比,检测条件切换在oracle评分下将SWAS提高了25.4%,在检测器衍生评分下提高了47.3%。实时KITTI评估报告了每层级VRU召回率分别为24.2%、41.2%和59.0%,表明反应性覆盖从根本上受限于基线检测器的召回率。

英文摘要

Edge object detection on embedded hardware requires balancing inference latency and detection quality under changing resource pressure. We present RAMS, a lightweight runtime controller that monitors device pressure, calibrates switching thresholds from idle behavior, and dynamically selects among three resident YOLOv8 tiers (NANO/SMALL/MEDIUM at 320/416/640 px) without model-reload latency. RAMS defines five switching policies, including two detection-conditioned variants that prevent aggressive downgrades after recent vulnerable-road-user (VRU) detections. We further introduce the VRU-Weighted Accuracy Score (SWAS), a scalar metric for offline policy comparison without ground-truth annotations, together with an oracle-bounded variant that separates detector circularity from genuine tier-retention benefit. Across Raspberry Pi 5, x86 laptops, and Jetson Orin ONNX/TensorRT deployments, the same controller equations operate over a 37x latency range. On Jetson Orin TensorRT under heavy load, the safety2 policy achieves 3.41 ms mean latency, 5.6x faster than fixed-MEDIUM inference, while retaining 74% of its proxy accuracy through near-NANO operation with selective SMALL and MEDIUM locks during VRU-positive windows. Detection-conditioned switching improves SWAS by 25.4% under oracle scoring and 47.3% under detector-derived scoring relative to threshold-only policies under heavy load. Live KITTI evaluation reports per-tier VRU recall of 24.2%, 41.2%, and 59.0%, showing that reactive overrides are fundamentally limited by baseline detector recall.

2606.14720 2026-06-16 cs.CV 新提交

AI for Maritime Security: Comparative Evaluation of CNN and Vision Transformer Architectures for Maritime Object Detection

AI用于海上安全:CNN与Vision Transformer架构在海上目标检测中的比较评估

Ismet Gocer, Zakirul Bhuiayn, Shakeel Ahmad, Raza Hasan

发表机构 * Southampton Solent University School of Technology and Maritime Industries(索马顿桑德兰大学技术与海洋工业学院)

AI总结 研究利用CNN和Vision Transformer等六种深度学习模型,在多种天气条件下检测海面船只,ViT达到100%准确率且处理速度最快,展示了AI视觉系统在海上监视中的潜力。

Comments 24 Pages

详情
AI中文摘要

本研究旨在通过使用先进的人工智能(AI)和计算机视觉(CV)技术来增强海上安全。为此,设计并评估了能够在不同实时环境下检测海面船只存在的智能目标检测系统。为实现这一目标,使用了包含6,468张图像的海上图像数据集,涵盖了多云、雾、雨和晴天等不同天气条件。评估了六种深度学习架构,包括基础卷积神经网络(CNN)模型、四种迁移学习模型(Xception、VGG16、MobileNetV2和EfficientNetV2L)以及一种视觉Transformer(ViT)模型。使用多个性能指标对模型进行比较,包括准确率、第一类和第二类错误、模型大小以及视频处理时间。结果表明,模型性能因计算约束和部署条件而异。虽然轻量级架构适用于资源有限的设备,但ViT实现了最佳整体性能,达到100%准确率,错误率最低且视频处理时间最快。研究结果凸显了AI驱动的计算机视觉系统在海上监视、边境保护和自主导航中的潜力。

英文摘要

This study aims to enhance maritime security by using advanced Artificial Intelligence (AI) and Computer Vision (CV) techniques. For this purpose, it was designed and assessed intelligent object detection systems that can detect the presence of ships on the sea surface under different real-time environments. To achieve this goal, a maritime image dataset with 6,468 images was used, covering different weather conditions like cloudy, foggy, rainy, and sunny environments. Six deep learning architectures were evaluated, including a base Convolutional Neural Network (CNN) model, four transfer learning models (Xception, VGG16, MobileNetV2, and EfficientNetV2L), and a Vision Transformer (ViT) model. The models were compared using multiple performance indicators, including accuracy, Type I and Type II errors, model size, and video processing time. The results show that model performance varies depending on computational constraints and deployment conditions. While lightweight architectures are suitable for resource-limited devices, the ViT achieved the best overall performance, reaching 100% accuracy with the lowest error rates and the fastest video processing time. The findings highlight the potential of AI-driven computer vision systems for maritime surveillance, border protection, and autonomous navigation.

2606.14754 2026-06-16 cs.CV cs.AI 新提交

Sub-Semantic Image Segmentation

子语义图像分割

Aviad Cohen Zada, Nadav Orenstein, Shai Avidan, Gal Oren

发表机构 * Tel Aviv University(特拉维夫大学) Stanford University(斯坦福大学) Technion(以色列理工学院)

AI总结 提出子语义图像分割,通过耦合视觉-语言模型与SAM,并引入DETECTURE解决语言泄漏、提示竞争和语义失真问题,在自建数据集TextureADE上取得最优性能。

Comments 23 pages. Code: https://github.com/Scientific-Computing-Lab/TextureDetecture

详情
AI中文摘要

图像可以基于视觉线索(即纹理分割)或对象(即语义分割)进行分割。我们提出了一类新的子语义图像分割,模糊了两者之间的界限。在子语义图像分割中,语言不用于命名整个对象。相反,它用于将图像划分为可由语言描述的稳定外观模式。为此,我们将通用视觉-语言模型与SAM 3(一个可提示分割骨干网络,其原生文本路径可以将丰富描述映射到掩码)耦合。简单的耦合由于我们在论文中识别的多种原因而失败,我们通过引入DETECTURE来克服它们,解决了三个具体的失效模式——纹理区域之间的语言泄漏、分割骨干网络内部的提示竞争以及语言到掩码接口处的语义失真。由于没有子语义图像分割的数据集,我们引入了一个名为TextureADE的数据集。新数据集使用我们设计的系统从ADE20K数据集派生而来。我们将DETECTURE与多个基线进行比较,发现它在多个数据集上使用不同指标均取得了最强性能。代码可在https://github.com/Scientific-Computing-Lab/TextureDetecture获取。

英文摘要

Images can be segmented based on visual cues (i.e., texture segmentation) or into objects (i.e., semantic segmentation). We propose a new category of sub-semantic image segmentation that blurs the line between the two. In sub-semantic image segmentation, language is not used to name whole objects. Instead, it is used to partition an image into stable appearance patterns that can be described by language. To do that, we couple a general-purpose vision-language model to SAM 3, a promptable segmentation backbone whose native text pathway can ground rich descriptions into masks. Simple coupling fails for a number of reasons that we identify in the paper, and we overcome them by introducing DETECTURE that resolves three concrete failure modes -- language leakage between texture regions, prompt competition inside the segmentation backbone, and semantic distortion at the language-to-mask interface. Since there is no dataset of sub-semantic image segmentation, we introduce one, termed TextureADE. The new dataset is derived from the ADE20K dataset using a system we designed. We compare DETECTURE to a number of baselines and find that it achieves the strongest performance on several datasets using different metrics. Code is available at https://github.com/Scientific-Computing-Lab/TextureDetecture.

2606.14755 2026-06-16 cs.CV cs.AI 新提交

Where Does Texture Evidence Live in SAM? Features, Proposal Masks, and Texture Segmentation

纹理证据在 SAM 中存在于何处?特征、提议掩码与纹理分割

Nadav Orenstein, Aviad Cohen Zada, Shai Avidan, Gal Oren

发表机构 * Tel Aviv University(特拉维夫大学) Stanford University(斯坦福大学) Technion(以色列理工学院)

AI总结 研究冻结的 Segment Anything Model (SAM) 中纹理相关证据的存在性,通过最小聚类读取和提议银行监督读取分析多尺度特征与自动提议掩码,发现 SAM 并非纹理盲,但默认失败源于读取不匹配和承诺失败。

Comments 26 pages, 13 figures, 20 tables. Code available at https://github.com/Scientific-Computing-Lab/ArchiTexture

详情
AI中文摘要

纹理分割对基础分割模型构成挑战,因为有意义区域由材质或重复外观而非物体身份定义。Segment Anything Models (SAMs) 默认情况下在纹理定义的分割上经常失败,但这种失败是模糊的:纹理证据可能缺失、在提议银行中缺失,或者存在但被以物体为中心的读取方式错误选择或组装。我们询问在适应之前,冻结的 SAM 中已经保留了哪些纹理相关证据。我们研究两个冻结的证据空间:多尺度特征(通过最小聚类读取探测)和自动提议银行(作为监督整合读取的证据)。SAM 全程冻结;我们不微调骨干网络或重新训练提议生成器。在 RWTD、STLD、ADE20K 精选精修裁剪补充集以及 ControlNet 拼接的 PTD 桥梁存档上,冻结的 SAM 默认情况下不是纹理分割器,但其失败并非简单的纹理盲。粗糙的冻结特征保留了纹理组织,提议银行通常包含纹理对齐的掩码或片段。自然场景更常需要组装和对片段做出承诺,而更干净的合成案例则通常简化为选择已经连贯的提议。因此,默认掩码失败应分解为表示证据、提议银行支持、读取不匹配和承诺失败。

英文摘要

Texture segmentation stresses foundation segmentation because meaningful regions are defined by material or repeated appearance rather than object identity. Segment Anything Models (SAMs) often fail by default on such texture-defined partitions, but this failure is ambiguous: the texture evidence may be absent, missing from the proposal bank, or present but selected or assembled incorrectly by an object-centric readout. We ask what texture-relevant evidence is already preserved in frozen SAM before adaptation. We study two frozen evidence spaces: multiscale features, probed with a minimal clustering readout, and the automatic proposal bank, treated as evidence for a supervised consolidation readout. SAM is frozen throughout; we do not fine-tune the backbone or retrain the proposal generator. Across RWTD, STLD, an ADE20K-selected refined-crop complement, and a ControlNet-stitched PTD bridge archive, frozen SAM is not a texture segmenter by default, but its failures are not simple texture blindness. Coarse frozen features preserve texture organization, and proposal banks often contain texture-aligned masks or fragments. Natural scenes more often require assembly and commitment over fragments, while cleaner synthetic cases more often reduce to selecting an already coherent proposal. Default mask failure should therefore be decomposed into representation evidence, proposal-bank support, readout mismatch, and commitment failure.

2606.14905 2026-06-16 cs.CV 新提交

Deep Learning in Seismic Interpretation: Federated Advances in Salt Dome Segmentation

地震解释中的深度学习:盐丘分割的联邦学习进展

Muhammad Zain Mehdi, Muhammad Zaid, Owais Aleem

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FedSaltNet联邦学习框架,结合轻量级Small U-Net和前景加权聚合策略,在四个非独立同分布地震数据集上实现盐丘分割,IoU相对提升4.0%,并证明简单架构在数据受限联邦环境中的必要性。

Comments 7 pages, 8 figures

详情
AI中文摘要

盐丘描绘是地下地质解释中一项关键且高影响力的任务,驱动着油气勘探、储层建模和钻井安全决策。虽然卷积编码器-解码器架构在自动盐分割方面取得了显著改进,但其广泛应用受到数据主权问题、数据集偏差和标注地震数据稀缺的严重限制。本文介绍了FedSaltNet,一个专门为鲁棒、可泛化和隐私保护的盐丘分割而设计的联邦学习框架。我们将轻量级Small U-Net骨干网络(因其效率和正则化特性而选择)与一种新颖的前景加权聚合策略相结合,以解决特定领域的类别不平衡问题。通过在四个不同地震数据集(TGS、SEAM、F3、GBS)上模拟非独立同分布条件的广泛比较研究,我们展示了两个关键发现:前景加权算法有效缓解了数据异质性,与最佳传统联邦学习方法相比,交并比相对提高了4.0%;简单的U-Net架构被证明至关重要,其平均IoU比高容量的ResNet-18 U-Net变体高出166%,强调了在数据受限的联邦环境中架构简单性的必要性。FedSaltNet提供了一个经过验证的高性能解决方案,确立了联邦深度学习用于协作式下一代地下解释的可行性。

英文摘要

Salt-dome delineation is a critical, high-impact task in subsurface geological interpretation, driving decisions in hydrocarbon exploration, reservoir modeling, and drilling safety. While convolutional encoder-decoder architectures have delivered significant improvements in automated salt segmentation, their widespread application is severely limited by data sovereignty concerns, dataset bias, and the scarcity of labeled seismic volumes. This paper introduces FedSaltNet, a Federated Learning (FL) framework explicitly engineered for robust, generalizable, and privacy preserving salt-dome segmentation. We couple a lightweight Small U-Net backbone, chosen for its efficiency and regularization properties with a novel Foreground-Weighted (FG-WEIGHTED) aggregation strategy designed to tackle domain-specific class imbalance. Through an extensive comparative study emulating non-IID conditions across four diverse seismic datasets (TGS, SEAM, F3, GBS), we demonstrate two critical findings: The FG-WEIGHTED algorithm effectively mitigates data heterogeneity, yielding a 4.0% relative improvement in Intersection over Union (IoU) over the best conventional FL method. The simple U-Net architecture proved essential, outperforming the higher capacity ResNet-18 U-Net variant by 166% in average IoU, underscoring the necessity of architectural simplicity in data-constrained federated environments. FedSaltNet provides a validated, high-performance solution that establishes the viability of federated deep learning for collaborative, next-generation subsurface interpretation.

2606.14912 2026-06-16 cs.CV cs.AI 新提交

Mask Proposal Voting Based on Geodesic Framework for Robust Image Segmentation

基于测地线框架的掩膜提议投票用于鲁棒图像分割

Li Liu, Mingzhu Wang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine(上海交通大学医学院附属瑞金康复医院) Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine(上海中医药大学附属岳阳中西医结合医院) Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences(山东第一医科大学附属山东省肿瘤医院放疗科) University Paris Dauphine, PSL Research University, CNRS, UMR 7534, CEREMADE(巴黎多芬纳大学,PSL研究大学,法国国家科学研究中心,UMR 7534,CEREMADE)

AI总结 提出一种掩膜提议投票框架,通过自适应域构造和加权投票机制克服经典最小路径法对初始化的依赖,在复杂场景下实现鲁棒分割。

详情
AI中文摘要

尽管取得了巨大进步,但准确的分割仍然是一项具有挑战性的任务,尤其是在背景杂乱、强度变化复杂和拓扑外观多样的场景中。最小路径模型在解决图像分割任务中展现了强大的能力。然而,基于最小路径的分割方法的性能严重受限于模型初始化,从而限制了其在实际中的应用范围。在这项工作中,我们提出了一种新颖的掩膜提议投票框架,克服了经典方法的主要缺点,即使在复杂场景下也能实现鲁棒分割。首先,我们引入了一种高效的方法来构建自适应域切割,作为初始化基于区域的最小割演化的约束,从而可以生成多样且可靠的掩膜提议候选,大大增加了这些提议准确覆盖目标区域的可能性。其次,我们提出了一种新的掩膜投票方案,构建编码最终分割信息的投票得分图。与经典的路径投票方法相比,我们的模型允许引入先验知识,为每个单独的掩膜分配不同的重要性。因此,所提出的分割模型能够在复杂场景下准确描绘对象边界,并且对初始化不敏感。实验表明,我们的方法在准确性和鲁棒性上始终优于最先进的基于最小路径的方法。

英文摘要

Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

2606.15049 2026-06-16 cs.CV 新提交

Gaussian Spatial Priors for Anatomy-Aware Object Detection in Surgical Videos

高斯空间先验用于手术视频中解剖感知的目标检测

Yunfan Li, Artem Shmelev, Himanshu Gupta

发表机构 * Stony Brook University(石溪大学) Stony Brook University Hospital(石溪大学医院)

AI总结 提出高斯空间先验(GSP)模块,通过编码解剖结构间的空间关系作为参数化偏置注入DAB-DETR解码器的自注意力,显著提升腹股沟疝手术视频中依赖类结构(如腹壁血管)的检测性能。

详情
AI中文摘要

检测手术视频中的解剖结构对于术中安全框架至关重要,例如腹股沟疝修复中的肌耻骨孔关键视图(CVMPO)。虽然标准方法能可靠检测出库珀韧带和危险三角等显著结构,但较小的结构(如腹壁血管)由于视觉模糊和间歇性可见性仍然具有挑战性。我们观察到结构之间的空间关系受解剖约束,并提出高斯空间先验(GSP)模块,将该关系编码为紧凑的参数化偏置,注入DAB-DETR解码器的自注意力中。该先验从训练注释中离线计算为一组冻结的高斯参数,并在每个解码器层使用迭代精化的参考点重新计算。在腹股沟疝修复视频数据集上使用5折交叉验证,GSP在依赖类检测上比DAB-DETR提升$+33.5\%$($\text{AP}_{50}$),比YOLOv26提升$+53.9\%$,同时在锚点检测上提升$+6.0\%$。这些增益在所有折上具有统计显著性($p=0.012$,配对$t$检验)。

英文摘要

Detecting anatomical structures in surgical video is essential for intraoperative safety frameworks such as the Critical View of Myopectineal Orifice (CVMPO) in inguinal hernia repair. While prominent structures like the Cooper's Ligament and Triangle of Doom are reliably detected by standard methods, smaller structures such as the epigastric vessels remain challenging due to their visual ambiguity and intermittent visibility. We observe that the spatial relationship between structures is anatomically constrained, and propose a Gaussian Spatial Prior (GSP) module that encodes this relationship as a compact, parametric bias injected into the self-attention of a DAB-DETR decoder. The prior is computed offline from training annotations as a small set of frozen Gaussian parameters and recomputed at each decoder layer using the iteratively refined reference points. On a dataset of inguinal hernia repair videos with 5-fold cross-validation, GSP improves dependent class detection by $+33.5\%$ ($\text{AP}_{50}$) over DAB-DETR and $+53.9\%$ over YOLOv26, while also improving anchor detection by $+6.0\%$. These gains are statistically significant across all folds ($p=0.012$, paired $t-$test).

2606.15072 2026-06-16 cs.CV 新提交

Texture-Shape Bias Balancing for Robust Synthetic-to-Real Semantic Segmentation in Automotive NIR Imagery

纹理-形状偏差平衡用于汽车近红外图像中鲁棒的合成到真实语义分割

Felix Stillger, Ben Hamscher, Lukas Hahn, Annika Mütze, Tobias Meisen, Kira Maag

发表机构 * University of Wuppertal(伍珀塔尔大学) Aptiv(Aptiv公司) Heinrich Heine University Düsseldorf(海因里希·海涅大学杜塞尔多夫) Osnabrück University(奥斯纳布吕克大学)

AI总结 提出生成式增强框架,通过目标风格适配和Voronoi风格多样化策略平衡纹理-形状偏差,实现近红外图像合成到真实域适应,将域差距减少高达63.6%。

Comments Accepted at ECML PKDD 2026 (ADS Track)

详情
AI中文摘要

语义分割是现代汽车系统中视觉感知的基本组成部分,实现像素级场景理解。近红外成像在困难光照条件下提供稳定检测,但由于缺乏真实世界场景的高质量标注数据,特定领域的语义分割模型开发仍具挑战。合成数据集提供可扩展的替代方案,但基于合成图像训练的模型在迁移到真实域时性能下降。我们首次系统研究汽车领域近红外图像中合成到真实域适应的语义分割。我们提出生成式增强框架,通过引入的目标风格适配将合成图像转换为逼真的近红外风格变体。目标风格适配通过低秩适配在小型真实近红外图像集上微调潜在扩散模型,并使用结构保持的多信号条件应用于合成训练数据。为减少纹理偏差并提高分割鲁棒性,我们进一步应用基于Voronoi的风格多样化策略,在保持场景几何的同时修改原始纹理。在车辆内部和街道场景的近红外数据上使用多种模型架构的实验表明,训练期间平衡归纳偏差可显著提高语义分割的鲁棒性,并在我们的真实场景中将域差距减少高达63.6%(外部)和28.4%(内部)。代码可在GitHub获取。

英文摘要

Semantic segmentation is a fundamental component of visual perception in modern automotive systems, enabling pixel-level scene understanding. Near-Infrared imaging (NIR) offers stable detection under difficult illumination conditions, but the development of domain-specific semantic segmentation models remains challenging due to the lack of high-quality annotated data from real-world scenarios. Synthetic datasets offer a scalable alternative, but models trained on synthetic images often suffer performance degradation when transferred to real domains. We present the first systematic study on synthetic to real domain adaptation for semantic segmentation in NIR images in the automotive domain. We propose a generative augmentation framework that transforms synthetic images into realistic NIR-style variants via our introduced target style adaptation (TSA). TSA fine-tunes a latent diffusion model via low-rank adaptation on a small curated set of real NIR images and applies it to synthetic training data using structure-preserving multi-signal conditioning. To reduce texture bias and improve segmentation robustness, we further apply a Voronoi-based style diversification strategy (VSD) that modifies the original textures while preserving scene geometry. Experiments with multiple model architectures on NIR data from vehicle interiors and street scenes show that balancing inductive bias during training leads to noticeably more robust semantic segmentation and effectively reduces the domain gap in our real-world scenarios by up to 63.6% on exterior and 28.4% on interior data. The code is available at GitHub.

2606.15112 2026-06-16 cs.CV 新提交

Learn Temporal Consistency For Robust Satellite Video Detector

学习时间一致性以实现鲁棒的卫星视频检测器

Weilong Guo, Shengyang Li, Yanfeng Gu

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间应用工程与技术中心) Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间应用重点实验室) University of Chinese Academy of Sciences(中国科学院大学) School of Electronics and Information Engineering, Harbin Institute of Technology(哈尔滨工业大学电子与信息工程学院)

AI总结 提出基于时间一致性学习(TCL)的卫星视频目标检测框架,通过时间特征聚合、结构编码和时间一致性约束模块,实现定向细粒度目标检测,在SAT-MTB数据集上达到47.7% mAP,较基线提升4.8%。

Comments 11 pages, 8 figures

详情
AI中文摘要

卫星视频目标检测(SVOD)对于定向和细粒度目标在卫星应用中扮演重要角色。现有大多数SVOD方法仅关注一个或几个粗粒度类别的移动目标,并用水平边界框表示目标。它们难以提取整个卫星视频中关于目标的完整、准确和一致的信息。在本文中,我们提出了一种基于时间一致性学习(TCL)的卫星视频目标检测框架。TCL通过利用卫星视频中丰富的时间上下文,灵活地检测定向和细粒度目标。该框架集成了三个关键模块:时间和细粒度特征聚合(TFA)、结构编码(SE)和时间一致性约束(TCC)。TFA和TCC模块促进跨帧的一致表示学习,而SE模块编码外观和结构信息以实现精确的细粒度识别。在SAT-MTB基准数据集上的实验结果表明,TCL具有优越的性能,实现了47.7% mAP的定向和细粒度检测精度,较基线提升4.8%。此外,我们的TCL框架易于适应现有的基于图像的检测器,从而提高了检测精度。

英文摘要

Satellite video object detection (SVOD) for oriented and fine-grained objects plays an important role in satellite applications. Most existing SVOD methods only focus on one or a few coarse-grained categories of moving objects and represent objects with horizontal bounding boxes. They have difficulty extracting complete, accurate, and consistent information about objects in whole satellite videos. In this paper, we propose a satellite video object detection framework based on Temporal Consistency Learning (TCL). TCL adeptly detects oriented and fine-grained objects by leveraging the rich temporal contexts within satellite videos. The framework integrates three key modules: temporal and fine-grained feature aggregation (TFA), structure encoding (SE), and temporal consistency constraint (TCC). TFA and TCC modules facilitate consistent representation learning across frames, while the SE module encodes both appearance and structural information for precise fine-grained recognition. Experimental results on the SAT-MTB benchmark dataset demonstrate TCL's superior performance, achieving a new state-of-the-art oriented and fine-grained detection accuracy of 47.7% mAP--a 4.8% improvement over the baseline. Furthermore, our TCL framework readily accommodates existing image-based detectors, leading to enhanced detection accuracies.

2606.15118 2026-06-16 cs.CV 新提交

Multi-view feature High-order Fusion for Space Weak Object Detection and Segmentation

多视角特征高阶融合用于空间弱目标检测与分割

Weilong Guo, Yuhan Sun, Shengyang Li

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences(中国科学院空间应用工程与技术中心) Key Laboratory of Space Utilization, Chinese Academy of Sciences(中国科学院空间应用重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 针对空间弱目标检测与分割,提出多视角特征高阶融合方法(MHF),通过高阶特征感知和递归任务贡献门控选择,有效聚合弱目标的准确丰富特征,作为即插即用模块显著提升多种视觉模型性能。

详情
AI中文摘要

弱目标在空间应用的图像和视频中很常见。然而,从它们有限的外观信息中学习合适的表示是困难的。受多视角学习的启发,我们开发了简单的多视角注意力机制,将其输出视为多视角特征。我们还提出了一种多视角特征高阶融合方法(MHF),以聚合更准确和丰富的弱目标特征。我们的MHF将常用的低阶特征融合方法扩展到高阶。它增强了模型捕获弱目标相关和互补信息的能力。这是通过引入高阶多视角特征感知和递归任务贡献门控选择多视角特征来实现的。新操作高度灵活且可定制,与多视角特征表示的各种变体兼容。我们在两个新构建的空间科学数据集和一个开放的大规模卫星视频数据集上进行了大量实验。我们的MHF作为一个即插即用模块,显著改进了各种基于视觉Transformer和卷积的检测与分割模型。我们在三个数据集上的两个任务上都取得了最先进的精度。我们的MHF可以成为视觉建模的新基础模块,有效地从多视角学习角度表示弱目标。代码将在https://github.com/Kingdroper/MHF 提供。

英文摘要

Weak objects are common in images and videos of space applications. However, it is hard to learn proper representations from their limited appearance information. Inspired by multi-view learning, we develop simple multi-view attentions, treating their outputs as multi-view features. We also propose a multi-view feature high-order fusion method (MHF) to aggregate more accurate and richer features of weak objects. Our MHF extends the commonly used low-order feature fusion method to higher orders. It enhances the model's capacity to capture relevant and complementary information about weak objects. This is achieved by introducing high-order multi-view features perception and a recursive task-contribution gated selection of multi-view features. The new operation is highly flexible and customizable. It is compatible with various variants of multi-view feature representations. We conduct extensive experiments on two newly constructed space science datasets and an open, large-scale satellite video dataset. Our MHF serves as a plug-and-play module and significantly improves various vision transformers and convolution-based detection and segmentation models. We achieve all state-of-the-art accuracies on both tasks across three datasets. Our MHF can be a new basic module for visual modeling that effectively represents weak objects in terms of multi-view learning. The code will be available at https://github.com/Kingdroper/MHF.

2606.15253 2026-06-16 cs.CV 新提交

Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

聚焦、对齐与维持:对抗增量目标检测中的梯度稀释

Aoting Zhang, Dongbao Yang, Chang Liu, Xiaopeng Hong, Yu Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FAS框架,通过注入先验的查询聚焦判别信号、确定性锚点蒸馏对齐分配、流形支持回放维持旧类分布,解决增量目标检测中梯度稀释导致的性能下降问题。

Comments Accepted by ICML2026

详情
AI中文摘要

将检测Transformer适应到增量目标检测(IOD)面临系统性挑战,因为基于集合的优化本质上被顺序学习所不稳定。在这项工作中,我们识别出梯度稀释是性能下降的根本原因,其中保留旧知识所需的优化信号逐渐减弱。这种现象表现为保留梯度在幅度、方向和支撑覆盖上的级联侵蚀,由三个紧密耦合的因素驱动:信号分散,其中前景梯度被背景噪声淹没;分配漂移,其中随机查询-目标匹配导致不一致的梯度轨迹;以及支撑衰减,其中保留样本的梯度不足以覆盖旧类特征空间,在新类干扰下削弱决策边界。为对抗此,我们提出FAS,一个统一的框架,在增量学习中聚焦、对齐和维持梯度流。具体地,我们引入注入先验的查询,通过从源头过滤背景干扰来聚焦判别信号。我们进一步提出确定性锚点蒸馏,以对齐查询-目标分配并在不稳定匹配下跨阶段强制执行语义一致性。最后,我们设计流形支撑回放,以维持旧类的分布支撑,对抗持续更新引起的表示侵蚀。大量实验表明,FAS恢复了鲁棒的优化动态,并优于最先进的方法,在具有挑战性的40+10x4增量设置中实现了超过5.0 AP的提升。

英文摘要

Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

2606.15286 2026-06-16 cs.CV 新提交

Decoupled Motion Representation Learning for Moving Infrared Small Target Detection

解耦运动表示学习用于移动红外小目标检测

Guoyi Zhang, Peiwen Wu, Han Wang, Xiangpeng Xu, Xiaohu Zhang

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University(中山大学航空航天学院)

AI总结 针对动态场景中目标、平台和背景运动高度耦合导致检测困难的问题,提出解耦运动表示学习框架,通过显式运动分支建模全局相干运动、隐式分支捕捉局部异常,并设计相干运动引导的异常推理模块抑制虚警,在复杂动态场景中显著优于现有方法。

详情
AI中文摘要

动态场景中的红外小目标检测仍然具有挑战性,原因是目标、成像平台和动态背景之间的运动高度耦合。现有的多帧方法通常执行隐式时间建模,其中连贯的背景动态主导运动对应学习,导致检测与虚警之间存在固有的权衡。在这项工作中,我们观察到背景运动表现出强烈的全局连贯性,而小目标主要对应稀疏的局部运动异常。此外,许多虚警响应与全局连贯运动模式保持高度一致性,表明它们主要源于连贯的背景动态而非真实目标运动。基于这些观察,我们提出了一种解耦运动表示学习框架用于移动红外小目标检测。具体地,引入显式运动分支,利用预训练的光流先验建模全局连贯运动动态,并采用结构保持的自监督适应策略进行红外运动对应学习。同时,设计了基于可变形特征对齐的隐式运动分支,在连贯运动引导下捕捉目标敏感的局部运动异常。此外,提出了连贯运动引导的局部异常推理模块,在局部运动建模过程中识别并抑制由连贯运动引起的虚假响应。在两个具有挑战性的红外小目标检测基准上的大量实验表明,所提方法在复杂运动的动态场景中持续优于现有最先进方法,同时保持了良好的推理效率。

英文摘要

Infrared small target detection in dynamic scenes remains challenging due to the highly coupled motions among targets, imaging platforms, and dynamic backgrounds. Existing multi-frame methods usually perform implicit temporal modeling, where coherent background dynamics dominate motion correspondence learning, leading to an inherent trade-off between detection and false alarms. In this work, we observe that background motions exhibit strong global coherence, whereas small targets mainly correspond to sparse local motion anomalies. Moreover, many false-alarm responses maintain high consistency with globally coherent motion patterns, indicating that they mainly originate from coherent background dynamics rather than genuine target motions. Based on these observations, we propose a decoupled motion representation learning framework for moving infrared small target detection. Specifically, an explicit motion branch is introduced to model globally coherent motion dynamics using pretrained optical flow priors, together with a structure-preserving self-supervised adaptation strategy for infrared motion correspondence learning. Meanwhile, an implicit motion branch based on deformable feature alignment is designed to capture target-sensitive local motion anomalies under coherent motion guidance. Furthermore, a coherent-motion-guided local anomaly reasoning module is proposed to identify and suppress coherent-motion-induced false responses during localized motion modeling. Extensive experiments on two challenging infrared small target detection benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches, particularly in dynamic scenes with complex motions, while maintaining favorable inference efficiency.

2606.15409 2026-06-16 cs.CV 新提交

Segmentation-based Detection for Efficient Multi-Task Spacecraft Perception

基于分割检测的高效多任务航天器感知

Sivaperuman Muniyasamy, Surendar Devasundaram

发表机构 * University of Arizona(亚利桑那大学)

AI总结 针对太空视觉感知中的多任务需求,提出集成MobileNetV3编码器与U-Net风格解码器的轻量架构,通过分割掩码联合推导检测框,在SPARK 2026挑战赛中获得0.9482综合得分,排名第二。

Comments 8 pages, 2 figures, 6 tables. CVPRW AI4SPACE-SPARK 2026 Challenge Stream-1 First Place Winners. Code is available at https://github.com/sivaastro/segdet-spark

详情
AI中文摘要

基于视觉的感知是空间态势感知以及自主在轨操作(如交会、对接、服务和导航)的基础。然而,该领域的进展受到标注空间图像稀缺以及具有挑战性的视觉域特性(包括剧烈的光照变化、低信噪比和高对比度)的限制。我们针对SPARK 2026挑战赛的Stream 1,该任务要求一个单一模型完成多目标类型的航天器分类、检测和细粒度部件分割。我们提出了一种紧凑架构,集成了MobileNetV3编码器和U-Net风格解码器,结合了计算效率与精确的密集预测。在单航天器场景下,检测通过预测部件掩码的并集解析得到,避免了单独的边界框回归头。我们的方法取得了0.9482的整体排行榜分数,其中分类、检测和分割的任务特定分数分别为1.0000、0.9788和0.8917。所提出的方法在SPARK 2026挑战赛中总体排名第二,表明轻量级编码器-解码器架构能够为实际星载视觉系统提供强大的多任务性能。

英文摘要

Vision-based perception is fundamental to Space Situational Awareness and autonomous on-orbit operations such as rendezvous, docking, servicing, and navigation. However, progress in this area is limited by the scarcity of annotated space imagery and by challenging visual-domain characteristics including severe illumination changes, low signal-to-noise ratio, and high contrast. We address Stream 1 of the SPARK 2026 Challenge, which requires a single model for spacecraft classification, detection, and fine-grained component segmentation across multiple target types. We propose a compact architecture that integrates a MobileNetV3 encoder with a U-Net-style decoder, combining computational efficiency with accurate dense prediction. Detection is derived analytically from the union of predicted component masks, avoiding a separate bounding-box regression head in the single-spacecraft setting. Our method achieved an overall leaderboard score of 0.9482, with task-specific scores of 1.0000 in classification, 0.9788 in detection, and 0.8917 in segmentation. The proposed approach ranked second overall in the SPARK 2026 Challenge, demonstrating that lightweight encoder-decoder architectures can deliver strong multi-task performance for practical onboard space vision systems.

2606.15590 2026-06-16 cs.CV 新提交

Unlocking Diffusion Hierarchies: Adaptive Timestep Selection for Zero-Shot Segmentation

解锁扩散层次:自适应时间步选择用于零样本分割

Ramin Nakhli, Mahesh Ramachandran, Luca Ballan

发表机构 * Google(谷歌)

AI总结 提出自适应时间步选择机制,利用扩散模型去噪过程中的层次语义进展,结合上下文相似度图融合高分辨率注意力与U-Net特征,实现零样本分割性能提升。

详情
AI中文摘要

零样本分割最近通过利用大规模文本到图像扩散模型(如Stable Diffusion)中的丰富视觉先验取得了显著改进。然而,当前的基于扩散的方法常常面临空间分辨率和上下文信息之间的权衡,以及依赖单一静态时间步进行特征提取的限制。为了克服这些挑战,我们的工作引入了两项关键进展。首先,我们的上下文相似度图将高分辨率注意力图与丰富的U-Net编码器特征融合,提供了细粒度且鲁棒的逐像素表示。其次,我们识别出不同扩散模型的去噪过程中存在一种涌现的层次语义进展:表示从早期时间步的部分级抽象过渡到后期阶段的物体级抽象。利用这一洞察,我们引入了一种机制来自适应地为每个像素选择最优时间步。大量实验表明,我们的方法持续优于现有的零样本分割基线,验证了将上下文特征与动态层次时间步选择相结合的有效性。

英文摘要

Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.

2606.15786 2026-06-16 cs.CV cs.AI physics.geo-ph 新提交

Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

领域引导的Segment Anything模型提示用于地震解释:属性、可视化和混合提示的作用

Aniq Ahmad, Heather Bedle, Ahmad Mustafa

发表机构 * School of Geosciences, University of Oklahoma(俄克拉荷马大学地球科学学院) King Fahd University of Petroleum and Minerals(法赫德国王石油矿产大学)

AI总结 提出零样本适应框架,通过地质目标感知的地震属性与颜色映射选择,结合混合提示策略,提升SAM在地震解释中的分割精度,避免微调。

详情
AI中文摘要

计算机视觉大型预训练基础模型的出现显著提高了视觉数据解释的效率。特别是Segment Anything Model (SAM)通过基于提示的交互提供了强大的零样本分割能力,因此成为地震解释的有前景工具。然而,大多数现有的SAM应用依赖于针对特定地质目标的微调,这需要大量标注数据、计算成本高,且常常损害模型的泛化能力。在本研究中,我们引入了一个原则性框架,用于将基础模型零样本适应到地震数据。该框架基于两个关键组件:(1) 将地震属性和可视化选择(如颜色映射)与感兴趣的地质目标对齐;(2) 采用混合提示策略,结合稀疏的用户定义点提示和从SAM内部特征激活中导出的密集掩码提示。我们系统地在多个地质目标、数据集、提示配置和地震属性表示上评估了该框架。我们的结果表明,地质目标感知的地震属性和颜色映射选择,结合混合提示,相对于仅基于点提示,增强了地质特征的可分离性,并改善了边界描绘和分割精度。我们的发现表明,当这些组件联合应用时,SAM可以在完全零样本设置下实现有竞争力的分割性能,从而消除了为每个地质特征重新训练SAM的需要。这项工作建立了一条实用且可扩展的途径,以在地震解释中利用基础模型,减少对标注数据的依赖,同时保持模型的通用性。

英文摘要

The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model's generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM's internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

2606.16119 2026-06-16 cs.CV 新提交

EdgeZSAD: Practical Zero-Shot Anomaly Detection on Edge Devices

EdgeZSAD:边缘设备上的实用零样本异常检测

Taewan Cho, Andrew Jaeyong Choi

发表机构 * Gachon University(加东大学) Plaid Labs Inc.(Plaid实验室)

AI总结 针对边缘部署约束,提出基于TinyViT-21M-512骨干、非对称全局-局部读出(EdgeGLR)和可复现源训练方案(Real-IAD-DR)的紧凑零样本异常检测系统,在多个工业基准上达到高精度且可直接部署。

详情
AI中文摘要

工业检测需要零样本异常检测(ZSAD),该检测在边缘部署约束下仍然有效。最近的方法通常依赖ViT-L基础骨干(约3亿参数),这超出了典型嵌入式硬件的内存和算子预算。我们通过EdgeZSAD研究这一场景,这是一个紧凑的参考系统,围绕TinyViT-21M-512骨干、非对称全局-局部读出(EdgeGLR)和可复现的源端训练方案(Real-IAD-DR)构建。我们在源训练、目标未见协议下训练单个检查点,并在六个工业基准上评估。在三次独立运行中,所得模型在MVTec-AD上平均图像AUROC达到91.6,在VisA上达到88.2,同时可直接部署在Jetson Orin Nano Super(TensorRT FP16)和RB5 Gen2(QNN GPU FP16)上。在六个设备重新评分的基准中,图像AUROC漂移保持在0.2点以下,表明导出的图在评估的部署设置中保留了主机端的排序行为。

英文摘要

Industrial inspection needs zero-shot anomaly detection (ZSAD) that remains useful under edge deployment constraints. Recent methods often rely on ViT-L foundation backbones (~300M parameters), which exceed the memory and operator budget of typical embedded hardware. We study this regime through EdgeZSAD, a compact reference system built around a TinyViT-21M-512 backbone, an asymmetric global-local readout (EdgeGLR), and a reproducible source-side training recipe (Real-IAD-DR). We train a single checkpoint in a source-trained, target-unseen protocol and evaluate it across six industrial benchmarks. Across three independent runs, the resulting model reaches an average image AUROC of 91.6 on MVTec-AD and 88.2 on VisA, while remaining directly deployable on Jetson Orin Nano Super (TensorRT FP16) and RB5 Gen2 (QNN GPU FP16). Across the six device-rescored benchmarks, image-AUROC drift stays below 0.2 points, indicating that the exported graph preserves host-side ranking behavior in the evaluated deployment setting.

2606.16124 2026-06-16 cs.CV 新提交

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

面向遥感图像与视频的无训练开放词汇视觉定位

Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao

发表机构 * School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) Interdisciplinary Institute of Artificial Intelligence, Xidian University(西安电子科技大学跨学科人工智能研究院) School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Northwest Institute of Nuclear Technology(西北核技术研究所) School of Physics and Information Engineering, Fuzhou University(福州大学物理与信息工程学院)

AI总结 提出无训练框架RSVG-ZeroOV,利用冻结的通用基础模型通过概览-聚焦-演化范式实现零样本开放词汇遥感视觉定位,并扩展至视频时空定位,在多个基准上超越现有零样本方法。

详情
AI中文摘要

遥感视觉定位(RSVG)旨在根据自然语言表达在遥感图像或视频中定位所指目标。现有的RSVG方法通常依赖于任务特定的手动标注,这些标注收集成本高昂,且在覆盖真实世界地理空间场景的多样性方面不可避免地存在局限。因此,它们往往难以泛化到涉及新物体、细粒度属性、复杂空间关系和功能语义的开放词汇查询。本文提出RSVG-ZeroOV,一个无训练框架,利用冻结的通用基础模型进行零样本开放词汇RSVG。RSVG-ZeroOV遵循概览-聚焦-演化范式,利用视觉语言模型(VLM)和扩散模型(DM)独特且互补的注意力模式逐步生成精确的定位结果。具体而言:(i) 概览利用VLM提取交叉注意力图,捕获指代表达与视觉区域之间的语义相关性;(ii) 聚焦利用DM的细粒度建模先验,补偿VLM注意力常忽略的物体结构和形状信息;(iii) 演化引入一个简单而有效的注意力演化模块,抑制无关激活,产生纯净的物体掩码。为处理视频输入,我们进一步提出Video RSVG-ZeroOV,通过查询相关关键帧选择器和时序传播器将图像级定位扩展到时空定位,无需视频标注或微调即可实现高效且时序一致的视频定位。在六个图像和视频定位基准上的大量实验表明,RSVG-ZeroOV持续优于现有零样本基线,并与弱监督和全监督方法相比达到有竞争力或更优的性能。

英文摘要

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

2606.16302 2026-06-16 cs.CV 新提交

Explainable Flood Segmentation on Sentinel-1 SAR Imagery: A Comparative Study of CNN and Transformer Architectures

可解释的Sentinel-1 SAR影像洪水分割:CNN与Transformer架构的比较研究

Arundhuti Banerjee, David Daou

发表机构 * United Nations University's Institute for Environment and Human Security (UNU-EHS)(联合国大学环境与人类安全研究所(UNU-EHS))

AI总结 比较CNN和视觉Transformer在Sentinel-1 SAR影像多类洪水分割中的性能,SegFormer-b2在ETCI数据集上显著优于U-Net,但在Sen1Floods11上优势缩小,并利用可解释性技术分析模型决策。

详情
AI中文摘要

快速准确的洪水预测对于灾害响应和减灾规划至关重要。卫星上的合成孔径雷达(SAR)传感器非常适合这一目的,因为它们独立于天气和日光条件运行。尽管基于SAR的数据能够实现全天候洪水监测,但区分被淹没的土地和永久水体仍然是一个重大挑战,特别是当洪水严格定义为被淹没的土地时。本研究提供了卷积神经网络(CNN)和视觉Transformer架构在多类洪水分割中的全面比较,使用Sentinel-1 SAR影像,专门训练以区分被淹没的土地、永久水体和陆地。三个基于CNN的最先进模型U-Net、U-Net++和带ResNet-34骨干的DeepLabV3,以及三个SegFormer变体(b0, b1, b2)在两个基准数据集ETCI NASA和SenFloods11上进行了评估,采用基于场景的数据划分以确保空间泛化的现实评估。结果表明,SegFormer-b2在ETCI数据集上显著优于U-Net基线(在Wilcoxon符号秩检验中,所有7个测试场景的洪水IoU更高),而在Sen1Floods11上微调后,优势缩小到场景变异范围内,并集中在空间碎片化的洪水事件中。研究包括定性和定量的可解释性技术,以直观理解模型决策并系统评估预测可靠性。定性分析显示,SegFormer-b2产生更空间连贯的Grad-CAM激活,聚焦于洪水相关特征,而U-Net在洪水边界处产生更具信息量的不确定性估计。

英文摘要

Rapid and accurate flood prediction is essential for disaster response and mitigation planning. Synthetic Aperture Radar (SAR) sensors in satellites are well-suited for this purpose because they operate independently of weather and daylight conditions. Although SAR-based data enable all-weather flood monitoring, distinguishing flooded land from permanent water remains a significant challenge, particularly when flooding is defined strictly as inundated land. This study provides a comprehensive comparison of convolutional neural network (CNN) and vision transformer architectures for multi-class flood segmentation using Sentinel-1 SAR imagery, specifically trained to separate flooded land from permanent water bodies and land. Three state-of-the-art (SOTA)CNN-based models, U-Net, U-Net++, and DeepLabV3 with ResNet-34 backbone, and three SegFormer variants (b0,b1,b2) were evaluated in two benchmark datasets, the ETCI NASA dataset and SenFloods11, using scene-based data splits to ensure a realistic assessment of spatial generalization. The results demonstrate that SegFormer-b2 significantly outperforms the U-Net baseline on the ETCI dataset (higher flood IoU across all 7 test scenes in the Wilcoxon signed-rank test), while after fine-tuning on Sen1Floods11, the advantage narrows to within the range of scene variability and is concentrated in spatially fragmented flood events. The study includes both qualitative and quantitative explainability techniques to visually comprehend model decisions and systematically assess prediction reliability. Qualitative analysis reveals that SegFormer-b2 produces more spatially coherent Grad-CAM activations focused on flood-relevant features, while U-Net generates more informative uncertainty estimates along flood boundaries.

2606.16448 2026-06-16 cs.CV 新提交

Hierarchical Fine-Grained Aerial Object Detection

层次化细粒度航空目标检测

Yan Zhang, Fang Xu, Wen Yang, Gui-Song Xia

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) School of Electronic Information, Wuhan University(武汉大学电子信息学院)

AI总结 提出ExpertDet,通过视觉感知掩码属性建模和层次化视觉实例提升,利用结构化先验知识增强细粒度航空目标检测,并在新基准PSP上超越现有方法。

Comments 15 pages

详情
AI中文摘要

细粒度航空目标检测,由现实世界目标类别的内在粒度驱动,对于遥感中的高级场景理解至关重要。现有方法很大程度上继承了粗粒度目标检测的范式,仅依赖单标签监督,因此难以区分具有细微结构差异的模型级类别。然而,对于每个特定模型(例如波音787),属性和层次等结构化先验知识提供了跨多个粒度的判别性语义。受此启发,我们提出了ExpertDet,一种融合专家知识线索以增强细粒度航空目标检测的方案。具体来说,我们设计了视觉感知掩码属性建模(VMAM),通过从视觉线索重建随机掩码的属性,将属性语义与视觉结构对齐,使检测器能够捕捉细微的结构差异。我们进一步提出了层次化视觉实例提升(HierVIP),该方法基于层次关系构建视觉原型树,并施加分类学感知约束,以在增强类别判别性的同时保持跨层次语义连续性。此外,我们为航空图像中模型特定的舰船和飞机的精确识别整理了一个新的细粒度目标检测基准PSP,分别涵盖106个舰船类别和30个飞机模型,是现有航空目标检测数据集中模型特定类别最广泛的集合。我们在PSP基准上对最先进的目标检测算法进行了基准测试。大量评估表明,ExpertDet在各个层次上始终优于其他细粒度竞争对手。数据集、基准和代码可在https://nnnnerd.github.io/PSP-Benchmark/获取。

英文摘要

Fine-grained aerial object detection, driven by the intrinsic granularity of real-world object categories, is crucial for advanced scene understanding in remote sensing. Existing methods largely inherit the paradigm of coarse-grained object detection, relying solely on single-label supervision and thus struggling to distinguish model-level categories with subtle structural differences. However, for each specific model (e.g., Boeing 787), structured prior knowledge such as attributes and hierarchies offers discriminative semantics across multiple granularities. Motivated by this, we present ExpertDet, a scheme that incorporates expert-informed cues to enhance fine-grained aerial object detection. Specifically, we design Vision-aware Masked Attribute Modeling (VMAM), which aligns attribute semantics with visual structures by reconstructing randomly masked attributes from visual cues, enabling the detector to capture subtle structural distinctions. We further propose Hierarchical Visual Instance Promotion (HierVIP), which builds a visual prototype tree based on hierarchical relations and imposes taxonomy-aware constraints to preserve cross-level semantic continuity while enhancing category discrimination. Moreover, we curate a new fine-grained object detection benchmark for Precise recognition of model-specific Ships and Planes from aerial imagery, PSP, covering 106 ship classes and 30 airplane models, respectively, featuring the most extensive collection of model-specific categories among existing aerial object detection datasets to date. We benchmark state-of-the-art object detection algorithms on the PSP benchmark. Extensive evaluation demonstrates that ExpertDet consistently outperforms other fine-grained competitors across hierarchy levels. The dataset, benchmark, and code are available at https://nnnnerd.github.io/PSP-Benchmark/.

2606.16996 2026-06-16 cs.CV cs.AI cs.LG 新提交

ActiveSAM: Image-Conditional Class Pruning for Fast and Accurate Open-Vocabulary Segmentation

ActiveSAM: 图像条件类别剪枝实现快速准确的开放词汇分割

Tran Dinh Tien, Zhiqiang Shen

发表机构 * VILA Lab, Mohamed bin Zayed University of Artificial Intelligence(VILA实验室,穆罕默德·本·扎耶德人工智能大学)

AI总结 提出ActiveSAM,一种无需训练、零样本的推理框架,通过图像条件类别剪枝和低分辨率预览,将SAM 3转化为主动词汇分割器,在8个基准上平均提升1.4 mIoU,速度提升最高5.5倍。

Comments Preprint. Code is available at https://github.com/VILA-Lab/ActiveSAM

详情
AI中文摘要

Segment Anything Model 3 (SAM 3) 为概念提示分割提供了强大的冻结骨干网络,但直接应用于开放词汇语义分割 (OVSS) 效率低下:全分辨率解码通常在整个数据集词汇表上运行,而每个图像只包含一小部分活跃类别。我们引入ActiveSAM,一种无需训练、零样本的推理框架,将SAM 3转化为主动词汇分割器。ActiveSAM首先规范化并扩展类别提示,然后从低分辨率存在预览中估计图像条件的活跃集。只有保留的类别使用冻结的SAM 3解码器进行桶式提示复用全分辨率解码。预览阶段仅使用类别存在证据,跳过不必要的分割头计算,而最终阶段应用边缘感知背景校准以抑制低置信度像素。ActiveSAM不需要目标数据集训练、权重更新或oracle类别存在标签。在八个OVSS基准上,ActiveSAM改善了无需训练的开放词汇语义分割的速度-准确率权衡,平均比当前最先进的SegEarth-OV3高出约+1.4 mIoU,同时在大型词汇数据集上运行速度最高提升5.5倍。ActiveSAM在模拟真实世界分布偏移的图像损坏下也表现出最强的鲁棒性,使其非常适合部署在噪声输入领域,如自动驾驶和具身AI。代码可在https://github.com/VILA-Lab/ActiveSAM获取。

英文摘要

Segment Anything Model 3 (SAM 3) provides a strong frozen backbone for concept-prompted segmentation, but applying it directly to open-vocabulary semantic segmentation (OVSS) is inefficient: full-resolution decoding is typically run over the entire dataset vocabulary, whereas each image contains only a small active subset of classes. We introduce ActiveSAM, a training-free, zero-shot inference framework that turns SAM 3 into an active-vocabulary segmenter. ActiveSAM first canonicalizes and expands class prompts, then estimates an image-conditioned active set from a low-resolution presence preview. Only the retained classes are decoded at full resolution, using bucketed prompt multiplexing with the frozen SAM 3 decoder. The preview stage uses only class-presence evidence and skips unnecessary segmentation-head computation, while the final stage applies margin-aware background calibration to suppress low-confidence pixels. ActiveSAM requires no target-dataset training, no weight updates, and no oracle class-presence labels. Across eight OVSS benchmarks, ActiveSAM improves the speed-accuracy tradeoff of training-free open-vocabulary semantic segmentation, outperforming the current state-of-the-art SegEarth-OV3 by approximately +1.4 mIoU on average while running up to 5.5x faster on large-vocabulary datasets. ActiveSAM also demonstrates the strongest robustness under image corruption that simulates real-world distribution shift, making it well-suited for deployment in noisy-input domains such as autonomous driving and embodied AI. Code is available at https://github.com/VILA-Lab/ActiveSAM.

2509.10005 2026-06-16 cs.CV 版本更新

TUNI: Unifying Pre-training and Fine-tuning with Modality-Aware Mutual Learning and Rectification for RGB-T Semantic Segmentation

TUNI:基于模态感知互学习和矫正的RGB-T语义分割统一预训练与微调框架

Xiaodong Guo, Xianda Guo, Tong Liu, Zhihong Deng, Yanlun Peng, Xiang Li, Wujie Zhou

发表机构 * School of Automation, Beijing Institute of Technology(自动化学院,北京理工大学) School of Computer Science, Wuhan University(计算机学院,武汉大学) Great Wall Motor(长城汽车) School of Information and Electronic Engineering, Zhejiang University of Science and Technology(信息电子工程学院,浙江理工大学)

AI总结 提出TUNI框架,通过模态感知互学习与矫正统一预训练和微调,解决RGB-T语义分割中多模态特征提取融合、模态依赖不平衡及热信息利用不足问题,在五个数据集上优于15种SOTA模型。

Comments This paper is an extended version of the authors' work previously presented at the ICRA conference. To appear in IEEE Transactions on Circuits and Systems for Video Technology. DOl: 10.1109/TCSVT.2026.3701706

详情
AI中文摘要

RGB-热(RGB-T)语义分割提高了自主平台在挑战性环境中的环境感知能力。现有的RGB-T分割框架存在多模态特征提取和融合欠佳、模态依赖不平衡以及热信息利用不足的问题。为了解决这些挑战,我们提出了TUNI,一个用于高效实时RGB-T语义分割的统一预训练和微调框架。它预训练了一个RGB-T编码器,该编码器包含一个RGB-T局部模块,可选择性地强调跨模态的显著一致和不同局部特征,从而以统一的方式整合跨模态特征提取和融合。为了缓解RGB-T预训练过程中的模态偏差问题,引入了模态反转对比互学习,使两个以RGB为主和以热为主的编码器之间进行知识交换。在微调阶段,模态矫正学习通过关注两个特定模态解码器之间正确但不同的预测区域,充分利用残余热信息。我们进一步开发了三种TUNI变体,覆盖轻量级、平衡和高性能需求。在五个RGB-T语义分割数据集上的大量实验表明,与15种最先进模型相比,TUNI在准确性、泛化能力和紧凑性方面均表现优异。代码可在以下网址获取:https://this URL。

英文摘要

RGB-thermal (RGB-T) semantic segmentation improves the environmental perception of autonomous platforms in challenging conditions. Prevailing RGB-T segmentation frameworks suffer from suboptimal multi-modal feature extraction and fusion, unbalanced modality dependency, and inadequate utilization of thermal information. To address these challenges, we propose TUNI, a unified pre-training and fine-tuning framework for efficient and real-time RGB-T semantic segmentation. It pre-trains an RGB-T encoder that incorporates an RGB-T local module that selectively emphasizes salient consistent and distinct local features across modalities, thereby integrating cross-modal feature extraction and fusion in a unified manner. To alleviate the modality bias issue during RGB-T pre-training, modality-inverted contrastive mutual learning is introduced to enable knowledge exchange between two RGB-dominated and thermal-dominated encoders. In the fine-tuning phase, modality rectification learning fully exploits residual thermal information by focusing on correct yet divergent prediction regions between two modality-specific decoders. We further develop three TUNI variants, covering lightweight, balanced, and high-performance requirements. Extensive experiments on five RGB-T semantic segmentation datasets demonstrate that TUNI achieves superior accuracy, generalization, and compactness compared with 15 state-of-the-art models. The code is available at https://github.com/xiaodonguo/TUNI-v2.

2512.24838 2026-06-16 cs.CV cs.RO 版本更新

CropTrack: A Tracking with Re-Identification Framework for Precision Agriculture

CropTrack: 面向精准农业的跟踪与重识别框架

Md Ahmed Al Muzaddid, Jordan A. James, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington(计算机科学与工程系,德克萨斯大学阿灵顿分校)

AI总结 针对农业场景中物体外观相似、频繁遮挡导致跟踪困难的问题,提出结合外观与运动信息的MOT框架CropTrack,通过重排序增强外观关联、一对多关联冲突解决和指数移动平均原型特征库,显著提升身份保持和关联精度。

Comments 8 pages, 5 figures, and 4 tables

详情
AI中文摘要

农业环境中的多目标跟踪(MOT)由于重复模式、相似物体外观、突然光照变化和频繁遮挡而面临重大挑战。该领域的当代跟踪器依赖物体运动而非外观进行关联。然而,当目标经历频繁且强烈的遮挡时,它们难以维持物体身份。物体外观的高度相似性使得在农业场景中集成基于外观的关联变得非平凡。为解决此问题,我们提出CropTrack,一种基于外观和运动信息结合的新型MOT框架。CropTrack集成了重排序增强的外观关联、基于外观冲突解决策略的一对多关联以及指数移动平均原型特征库,以改进基于外观的关联。在公开可用的农业MOT数据集上评估,CropTrack展示了一致的身份保持,优于传统的基于运动的跟踪方法。与现有技术相比,CropTrack在关联准确性和识别精度得分上取得了显著提升,同时身份切换次数更低。

英文摘要

Multiple-object tracking (MOT) in agricultural environments presents major challenges due to repetitive patterns, similar object appearances, sudden illumination changes, and frequent occlusions. Contemporary trackers in this domain rely on the motion of objects rather than appearance for association. Nevertheless, they struggle to maintain object identities when targets undergo frequent and strong occlusions. The high similarity of object appearances makes integrating appearance-based association nontrivial for agricultural scenarios. To solve this problem we propose CropTrack, a novel MOT framework based on the combination of appearance and motion information. CropTrack integrates a reranking-enhanced appearance association, a one-to-many association with appearance-based conflict resolution strategy, and an exponential moving average prototype feature bank to improve appearance-based association. Evaluated on publicly available agricultural MOT datasets, CropTrack demonstrates consistent identity preservation, outperforming traditional motion-based tracking methods. Compared to the state of the art, CropTrack achieves significant gains in association accuracy and identification precision scores with a lower number of identity switches.

2602.06335 2026-06-16 cs.CV 版本更新

SPDA-SAM: A Self-prompted Depth-Aware Segment Anything Model for Instance Segmentation

SPDA-SAM: 一种用于实例分割的自提示深度感知分割一切模型

Yihan Shang, Wei Wang, Chao Huang, Xinghui Dong

发表机构 * State Key Laboratory of Physical Oceanography and the Faculty of Information Science and Engineering, Ocean University of China(物理海洋学国家重点实验室和中国海洋大学信息科学与工程学院) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区计算机科学与技术学院)

AI总结 提出SPDA-SAM,通过自提示模块和粗到细RGB-D融合,解决SAM依赖手动提示和缺乏深度信息的问题,在12个数据集上超越现有方法。

详情
AI中文摘要

最近,分割一切模型(SAM)在各种实例分割任务中展现出强大的泛化能力。然而,其性能严重依赖于手动提示的质量。此外,实例分割方法通常使用的RGB图像本质上缺乏深度信息。因此,这些方法感知空间结构和描绘物体边界的能力受到阻碍。为了解决这些挑战,我们提出了一种用于实例分割的自提示深度感知SAM(SPDA-SAM)。具体来说,我们设计了一个语义-空间自提示模块(SSSPM),该模块分别从SAM的图像编码器和掩码解码器中提取语义和空间提示。此外,我们引入了一个粗到细的RGB-D融合模块(C2FFM),其中从单目RGB图像中提取的特征与从中估计的深度图进行融合。特别地,深度图中的结构信息用于为特征融合提供粗粒度指导,而深度的局部变化被编码以融合细粒度特征表示。据我们所知,SAM尚未以这种自提示和深度感知的方式进行探索。实验结果表明,我们的SPDA-SAM在十二个不同的数据集上优于最先进的对应方法。这些令人鼓舞的结果应归因于自提示的引导以及粗到细RGB-D融合操作对空间信息损失的补偿。

英文摘要

Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self-prompted Depth-Aware SAM (SPDA-SAM) for instance segmentation. Specifically, we design a Semantic-Spatial Self-prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse-to-Fine RGB-D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse-grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine-grained feature representations. To our knowledge, SAM has not been explored in such self-prompted and depth-aware manners. Experimental results demonstrate that our SPDA-SAM outperforms its state-of-the-art counterparts across twelve different data sets. These promising results should be due to the guidance of the self-prompts and the compensation for the spatial information loss by the coarse-to-fine RGB-D fusion operation.

2604.18866 2026-06-16 cs.CV 版本更新

HMR-Net: Hierarchical Modular Routing for Cross-Domain Object Detection in Aerial Images

HMR-Net: 用于航拍图像跨域目标检测的层次化模块化路由

Pourya Shamsolmoali, Masoumeh Zareapoor, Michael Felsberg, Nick Pears, Huiyu Zhou, Yue Lu

发表机构 * Department of Computer Science, University of York(约克大学计算机科学系) SEIEE, Shanghai Jiao Tong University(上海交通大学SEIEE) Computer Vision Laboratory, Linkoping University(林哈姆大学计算机视觉实验室) School of Computing and Mathematical Sciences, University of Leicester(莱斯特大学计算与数学科学学院) SCEE, East China Normal University(华东师范大学SCEE)

AI总结 提出层次化模块化路由框架,通过领域路由和场景路由实现跨数据集和复杂场景下的结构化专业化,并利用条件专家模块支持零样本新类别检测。

详情
AI中文摘要

尽管目标检测取得了进展,航拍图像仍然是一个具有挑战性的领域,因为模型通常难以在空间分辨率、场景组成和语义标签覆盖的变化中泛化。不同数据集之间的地理背景、传感器特性和目标分布的差异限制了传统模型学习一致且可迁移表示的能力。在这些数据上训练的共享方法倾向于在不同领域上施加统一表示,导致在区域特定内容上性能较差,并且在处理新目标类别时缺乏灵活性。为了解决这个问题,我们提出了一种新颖的模块化学习框架,能够在航拍检测中实现结构化专业化。我们的方法引入了一种具有两个层次模块化的层次化路由机制:一个领域路由层,利用潜在地理嵌入将输入分配给领域专门的专家模块;以及一个场景路由机制,将图像子区域分配给场景特定的专家模块。这使得我们的方法能够在数据集之间和复杂场景内进行专业化。此外,该框架包含一个条件专家模块,利用外部语义信息(例如,类别名称或文本描述)在推理时检测新目标类别,无需重新训练或微调。通过超越单一表示,我们的方法为遥感目标检测提供了一个自适应框架。在四个数据集上的全面评估突出了在多数据集泛化、区域级别专业化和开放类别检测方面的改进。

英文摘要

Despite advances in object detection, aerial imagery remains a challenging domain, as models often fail to generalize across variations in spatial resolution, scene composition, and semantic label coverage. Differences in geographic context, sensor characteristics, and object distributions across datasets limit the capacity of conventional models to learn consistent and transferable representations. Shared methods trained on such data tend to impose a unified representation across fundamentally different domains, resulting in poor performance on region-specific content and less flexibility when dealing with novel object categories. To address this, we propose a novel modular learning framework that enables structured specialization in aerial detection. Our method introduces a hierarchical routing mechanism with two levels of modularity: a domain routing layer that uses latent geographic embeddings to assign inputs to domain-specialized expert modules, and a scene routing mechanism that allocates image subregions to scene-specific expert modules. This allows our method to specialize across datasets and within complex scenes. Additionally, the framework contains a conditional expert module that uses external semantic information (e.g., category names or textual descriptions) to enable detection of novel object categories during inference, without the need for retraining or fine-tuning. By moving beyond monolithic representations, our method provides an adaptive framework for remote sensing object detection. Comprehensive evaluations on four datasets highlight improvements in multi-dataset generalization, region-level specialization, and open-category detection.

2605.25803 2026-06-16 cs.CV 版本更新

ATV-Net: Adaptive Triple-View Network with Dynamic Feature Fusion

ATV-Net: 自适应三视角网络与动态特征融合

Sheng-Wei Chan, Hsin-Jui Pan, Chun-Po Shen, Chia-Min Lin, Yung-Che Wang, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(电子工程系,台湾潭口大学)

AI总结 提出ATV-Net,通过自适应门控融合三种感受野视角(微观、局部、侦察)改进ResNet-101分割头,在Cityscapes上达到80.31% mIoU,证明经典CNN分割仍有竞争力。

Comments Code will be released soon

详情
AI中文摘要

最近的语义分割研究越来越倾向于更强的上下文建模、密集注意力和基于Transformer的架构。尽管这些模型取得了令人印象深刻的性能,但经典的基于CNN的分割流水线因其简单、高效和易于实现而仍然具有吸引力。本文重新审视了一个实际问题:仅通过修改分割头,基于ResNet的分割模型能改进多少?我们提出了ATV-Net,一种自适应三视角网络,通过三个简单但互补的感受野视角来增强ResNet-101骨干网络。微观视角捕获逐点的语义响应,局部视角建模邻域结构和对象边界,侦察视角提供扩大的上下文线索。ATV-Net不是用固定权重融合这些视角,而是引入自适应决策门,根据输入场景特征动态选择感受野响应。进一步应用紧凑的全局协调层以提高空间和语义一致性。在Cityscapes验证集上的实验表明,ATV-Net达到了80.31%的mIoU。这一结果表明,经典的基于CNN的分割远未过时:通过简单的感受野视角和自适应融合,基于ResNet的流水线可以在不依赖Transformer风格的全局注意力或过于复杂的上下文模块的情况下达到有竞争力的精度水平。

英文摘要

Recent advances in semantic segmentation rely heavily on attention-based and transformer-style architectures that, while accurate, introduce considerable architectural complexity and computational cost. This paper asks whether a compact CNN-based segmentation head can remain competitive by adaptively selecting useful receptive-field evidence. We propose ATV-Net, an Adaptive Triple-View Network that attaches a lightweight head to a conventional backbone. The head organizes three complementary views -- point-wise, neighborhood-level, and enlarged context -- and fuses them through an Adaptive Decision Gate that generates image-dependent weights from global feature statistics. This allows the model to emphasize different receptive-field responses according to scene content, without dense attention or multi-scale aggregation. Experiments on Cityscapes and Pascal VOC 2012 show that ATV-Net achieves 80.31% mIoU on Cityscapes with ResNet-101 and 80.90% with ConvNeXt-Tiny, and 86.7% and 88.5% mIoU on Pascal VOC 2012, respectively, while requiring fewer GFLOPs than representative context-aggregation and attention-based heads. The results indicate that adaptive receptive-field selection remains a practical and effective design choice for CNN-based semantic segmentation.

2606.13127 2026-06-16 cs.CV 版本更新

Fully Distributed Multi-View 3D Tracking in Real-Time

全分布式多视角3D实时跟踪

Byron Hernandez, Fangyu Li, Aotian Wu, Paul J. Shin, Kaustubh Purandare, Henry Medeiros

发表机构 * University of Florida(佛罗里达大学) NVIDIA Corporation(英伟达公司)

AI总结 提出MV3DT全分布式框架,通过点对点协作实现实时多视角3D跟踪,无需中央聚合,在WILDTRACK上达到94.3% IDF1和93.3% MOTA,支持100摄像头30 FPS运行。

Comments 18 pages, 4 figures, 2 algorithms, 4 tables

详情
AI中文摘要

具有重叠视野的多摄像头跟踪通常依赖于集中式融合,这造成了计算瓶颈,阻碍了大规模部署。我们提出了MV3DT,一个用于实时多视角3D跟踪的全分布式框架,通过点对点协调实现精确的身份传播和遮挡恢复,消除了中央聚合的需要。每个摄像头节点执行一个轻量级模块化流水线,包括单目3D感知、分布式多视角关联以及通过轻量级消息传递的协作融合。MV3DT在WILDTRACK上达到了94.3%的IDF1和93.3%的MOTA,与最先进的集中式方法相当,同时展示了卓越的可扩展性,在100个摄像头上以30 FPS运行,摄像头间延迟小于10毫秒,通信开销仅为2.2%。在给定相机标定的情况下,MV3DT以零样本方式运行,无需特定场景学习,可直接部署在新环境中。这些结果确立了MV3DT作为大规模重叠摄像头网络中实时多视角跟踪的实用解决方案。

英文摘要

Multi-camera tracking with overlapping fields of view typically relies on centralized fusion, which creates computational bottlenecks that prevent deployment at scale. We present MV3DT, a fully distributed framework for real-time multi-view 3D tracking that achieves accurate identity propagation and occlusion recovery through peer-to-peer coordination, eliminating the need for central aggregation. Each camera node executes a lightweight modular pipeline comprising monocular 3D perception, distributed multi-view association, and collaborative fusion via lightweight messaging. MV3DT achieves 96.5% IDF1, 93.1% MOTA, and 94.6% MOTP on WILDTRACK, competitive with state-of-the-art centralized methods, and unprecedented 41.7% IDF1 and 50.9% MOTA on SCOUT while demonstrating superior scalability: sustaining 30 FPS on 100 cameras with <10ms inter-camera latency and only 2.2% communication overhead. MV3DT operates in a zero-shot regime given camera calibrations, requiring no scene-specific learning and making it directly deployable in new environments. These results establish MV3DT as a practical solution for real-time multi-view tracking in large-scale overlapping camera networks.

5. 视频理解与时序视觉 19 篇

2606.14723 2026-06-16 cs.CV 新提交

Disagreement-Based Cross-Model Routing for Implicit Video Question Answering

基于分歧的跨模型路由用于隐式视频问答

Durga Sandeep Saluru

发表机构 * Independent Researcher(独立研究员)

AI总结 针对隐式视频问答中单模型精度瓶颈和自一致性策略失效问题,提出无标签无训练的分歧驱动跨模型路由方法,将分歧样本路由至第二模型,在ImplicitQA基准上提升平均准确率1.43%。

详情
AI中文摘要

我们研究ImplicitQA基准上的多项选择视频问答,其中正确答案从未明确显示,必须从屏幕外事件、视线线索、因果结构和跨镜头空间布局中推断。在该基准上,单个前沿视频LLM已接近其精度上限,我们观察到传统的自一致性策略——对同一模型的重复样本进行多数投票——可能有害而非有益,因为模型在难题上的错误是相关的。我们提出基于分歧的跨模型路由,一种纯推理时过程,无需标签和训练。我们对原生视频模型(Gemini 3.1 Pro Preview)在温度为零时进行三次采样,利用其视频处理流水线的真实样本间方差来识别三个样本存在分歧的大约20%的问题子集,并将该子集仅路由到来自不同家族的第二个模型(Claude Opus 4.8),该模型采用自适应思考的均匀采样帧。在具有公开真实标签的1001个问题的验证集上——我们的主要评估——该方法相对于主模型的最佳单样本将AvgAcc提高了1.43,每个类别的提升集中在运动与轨迹(+5.49)、推断计数(+3.45)和垂直空间推理(+1.82)——这些类别最依赖于跨镜头参考解析。相同的流水线应用于保留的172个问题的CVPR 2026 ImplicitQA挑战测试集,实现了82.03 AvgAcc / 79.71 MacroAvgAcc(相对于主模型最佳单样本提升1.81),在独立分割上确认了验证结果。

英文摘要

We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion & Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.

2606.14724 2026-06-16 cs.CV cs.AI 新提交

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

VigilFormer: 用于视频异常检测的可变形注意力与因果风险推理

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出VigilFormer框架,结合可变形时空注意力与因果时序建模,通过稀疏注意力、对比多实例学习和自适应帧跳过,在保持高精度的同时实现实时异常检测。

详情
AI中文摘要

监控场景中的视频异常检测必须在检测准确性与实时吞吐量之间取得平衡,现有方法要么通过更强的特征提取器,要么通过更高效的架构来解决这一矛盾,但很少能兼顾两者。我们提出VigilFormer,一个统一框架,结合可变形时空注意力与因果时序建模,用于检测未修剪监控视频中的异常。所提出的可变形时空编码器(DSTE)关注跨帧的稀疏信息位置,避免了密集注意力的二次复杂度,同时保留了捕捉不规则运动模式的能力。因果异常分类器(CAC)对片段级特征应用扩张因果卷积,并优化对比多实例学习目标,无需帧级标签即可分离异常和正常表示。为满足部署约束,自适应置信度调度器(ACS)在推理时动态跳过低信息帧,减少静态场景中的冗余计算。在UCF-Crime、ShanghaiTech和CUHK Avenue上评估,VigilFormer在单GPU上以41.5 FPS分别达到87.83%、97.21%和89.74%的AUC分数,在准确性和速度上均优于最近的弱监督方法。

英文摘要

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

2606.14730 2026-06-16 cs.CV 新提交

Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation

基于输入条件化槽查询的分层GRU用于足球动作预测

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods(迪克体育用品的GameChanger)

AI总结 提出分层模型,利用局部Transformer、GRU和输入条件化事件槽解码器,结合频率重加权匈牙利匹配和高斯软标签,在SoccerNet基准上实现17.91% mAP。

Comments CVPR 2026 SoccerNet Ball Action Anticipation Challenge, Validated Rank 4

详情
AI中文摘要

我们提出了一种用于足球比赛视频中球动作预测的分层模型。给定30秒的观察窗口,系统预测接下来5秒窗口内发生的10类动作。一个共享的局部Transformer编码每个5秒子窗口内的片段级特征;然后一个GRU聚合所有子窗口的时间上下文;最后,一个带有K个输入条件化事件槽的Transformer解码器通过三个解耦头(目标性、类别、时间偏移)解码预测目标。我们引入了频率重加权匈牙利匹配,系统性地偏向稀有动作类别,以及用于时间箱监督的高斯软目标。在SoccerNet球动作预测基准上,我们的方法在测试服务器上达到了17.91%的mAP。

英文摘要

We present a hierarchical model for ball action anticipation in football broadcast video. Given a 30-second observation window, the system predicts actions occurring in the subsequent 5-second window across 10 classes. A shared local Transformer encodes clip-level features within each 5-second sub-window; a GRU then aggregates temporal context across all sub-windows; finally, a Transformer decoder with K input-conditioned event slots decodes the anticipation target via three decoupled heads (objectness, class, temporal offset). We introduce frequency-reweighted Hungarian matching that systematically favours rare action classes, and Gaussian soft targets for temporal bin supervision. On the SoccerNet Ball Action Anticipation benchmark, our method achieves 17.91% mAP on the test server.

2606.14762 2026-06-16 cs.CV cs.AI 新提交

Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

Scribby: 一种用于语义视频分析的多级LLM框架

Julian Abelarde, Hugo Garrido-Lestache Belinchon

发表机构 * Department of Computer Science and Software Engineering, Milwaukee School of Engineering(密尔沃基工程学院计算机科学与软件工程系)

AI总结 提出一种基于LLM的视频摘要框架,通过微观索引(分析完整转录、句子及语义分组)平衡宏观理解与微观语义分析,并利用相关性热图实现语义分块和匹配的可视化。

详情
AI中文摘要

随着视频内容在教育平台、录播讲座和直播娱乐中的持续扩展,对长视频进行高效且结构化分析的需求日益增长。尽管许多现有AI程序基于AI生成的转录提供高级视频摘要,但这些方法通常局限于粗略概述,缺乏对视频结构、主题进展和语义关系的详细分析,而这些正是全面视频分析所必需的。本文提出一种基于LLM的视频摘要框架,平衡宏观理解与微观语义分析。该过程的第一阶段在微观层面对视频进行索引,包括:(1) 分析完整转录,(2) 分析单个转录句子,(3) 使用LLM作为评判依据语义相似性对这些句子进行分组。在句子级处理中,通过将全局转录分析和相邻句子信息纳入每个评估提示,保留上下文连续性。该框架为通过相关性热图可视化语义分块和语义匹配的视频分析工具奠定了基础。还讨论了框架的局限性和未来扩展。

英文摘要

As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite{1}. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite{2,3,4,5}, these approaches are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite{6,12,13}. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite{6,13}. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed.

2606.14765 2026-06-16 cs.CV cs.AI cs.LG cs.MM 新提交

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

动量引导的语义预测(MoFore)用于自监督视频表示学习

Qinwu Xu

发表机构 * Qinwu Xu, PhD(秦武 Xu 博士)

AI总结 提出MoFore框架,通过预测未来潜在嵌入进行自监督视频表示学习,结合对比正则化防止表示崩溃,在UCF101上验证了时间一致性和语义结构。

Comments 13 pages, 5 Figures, and 2 Tables

详情
AI中文摘要

自监督视频表示学习最近通过对比学习、掩码重建和预测表示学习取得了进展。基于重建的方法如MAE和VideoMAE通过恢复掩码视觉内容来学习表示,而对比方法如CLIP通过表示对齐学习语义有意义的嵌入空间。在这项工作中,我们提出了一种动量引导的语义预测框架(MoFore)用于自监督视频表示学习。该方法不是优化像素级重建或任务特定的语义对齐,而是通过从时间上遥远的上下文片段预测未来的潜在嵌入来学习时间预测性视频表示。为了提高跨时间尺度的鲁棒性,我们进一步引入了训练期间的随机时间间隔预测。该框架将预测性潜在预测与对比正则化相结合,以鼓励时间一致性同时防止表示崩溃。在UCF101数据集上的实验表明,所提出的框架在训练期间不使用动作标签的情况下学习了时间一致且语义有意义的视频表示。定量分析显示学习到的嵌入空间具有强时间稳定性和涌现的类别级结构,而定性检索实验揭示了跨相关活动的运动感知组织。总体而言,结果表明长程潜在预测为自监督视频表示学习提供了一种有效且计算高效的方法,而不依赖于基于重建的目标。

英文摘要

Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

2606.14778 2026-06-16 cs.CV cs.AI 新提交

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

FactCheck: 基于多智能体协作的可行性感知长期动作预测

Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) China Mobile(中国移动)

AI总结 提出FactCheck多智能体框架,通过闭环“观察-规划-验证”机制,结合历史动作图验证可行性,在EPIC-Kitchens-55和EGTEA Gaze+上超越现有方法。

详情
AI中文摘要

长期动作预测(LTA)旨在从部分观察的视频中预测未来动词-名词动作的有序序列。虽然该任务是具身智能的基础,但预测物理上可行的长期动作仍然是一个关键挑战。现有方法以开环方式运行,常常幻觉出不存在物体、违反物体可供性或不考虑物体状态,因为它们缺乏明确的机制来验证动作相对于物理环境的可行性。为解决此问题,我们提出FactCheck,一种新颖的多智能体协作框架,通过闭环“观察-规划-验证”机制提高可行性。FactCheck将复杂的LTA任务分解为专门角色:观察者从视频观察中识别历史动作并构建双形式结构化记忆,包括捕捉高层人类意图和环境状态的历史动作摘要,以及编码物体状态和时间依赖性的历史动作图;规划者基于低层历史动作和高层历史动作摘要生成未来动作草案;验证者严格根据历史动作图验证草案并修正不可行动作。在EPIC-Kitchens-55和EGTEA Gaze+基准上的大量实验表明,FactCheck始终优于最先进方法。我们的工作为可行性感知的长期动作预测建立了新范式,有效闭环了动作识别、动作预测和动作验证。

英文摘要

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

2606.15200 2026-06-16 cs.CV 新提交

Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams

铭记于心:面向用户中心的持续空间智能推理在自我中心视频流中的应用

Yun Wang, Junbin Xiao, Han Lyu, Yifan Wang, Jing Zuo, Zhanjie Zhang, Hong Huang, Dapeng Wu, Angela Yao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UCS-Bench数据集和DirectMe框架,通过增量构建结构化空间记忆,实现自我中心视频流中动态空间推理、长期记忆与用户实时位置对齐,显著提升多模态大模型的空间推理能力。

Comments 45 pages. https://icml.cc/virtual/2026/poster/63682

详情
Journal ref
ICML 2026
AI中文摘要

我们介绍了UCS-Bench,一个涵盖170多小时自我中心视觉观察的数据集,包含8.1K+带时间戳的问题,用于诊断自我中心视频流中用户中心的持续空间智能。UCS-Bench针对一个新问题,强调动态空间推理、长期记忆及其与用户实时位置的对齐。我们提出了DirectMe,一个从流式自我中心观察中增量构建和维护结构化空间记忆的框架。DirectMe能够稳健地跟踪和回忆物体位置,这些位置始终相对于用户随时间移动。通过将视觉感知与记忆更新和空间推理紧密耦合,我们的方法支持需要回忆交互、解决视角引起的歧义以及适应动态场景的长时查询。实验表明,DirectMe显著提升了领先多模态大语言模型的空间推理能力;它还超越了许多具有空间感知和长形式流视频模型。我们希望我们的基准和解决方案能够推进自我中心AI助手的空间智能研究。数据和代码可在https://github.com/cocowy1/UCS-Bench获取。

英文摘要

We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 8.1K+ timestamped questions for diagnosing User-Centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to the user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adapting to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatially aware and long-form streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code are available at https://github.com/cocowy1/UCS-Bench.

2606.15275 2026-06-16 cs.CV 新提交

MamBOA: State-Space Architecture for Video Recognition

MamBOA:用于视频识别的状态空间架构

Mustafa Bora Çelik

发表机构 * Ankara Medipol University(安卡拉梅迪波尔大学)

AI总结 提出MamBOA框架,通过交错扫描结构将选择性状态空间递归(S6)作为运动合成器,从骨干网络提取的连续特征中编码运动,实现细粒度动作识别的高效时序建模。

Comments 15 pages, 7 figures. Codes available at [https://github.com/BOA-clk/MamBOA]

详情
AI中文摘要

细粒度动作识别需要时序推理,通用架构通过不同的成本-精度权衡来解决:3D密集算子将计算与输入体积耦合,而基于差分的方法通过刚性的、手工设计的无上下文特征减法来近似运动——每种方法都反映了深思熟虑的设计选择,并在表达能力或灵活性上存在相应限制。我们提出MamBOA,一个骨干无关的时序框架,基于新颖的交错扫描结构,将选择性状态空间递归(S6)重新定义为原生运动合成器。通过将从预训练骨干中提取的连续特征表示交错成单个交替序列,所提出的扫描结构驱动递归在共享隐藏状态中编码每个位置的时序观测,两者仅相隔一个衰减步骤——使得帧间过渡成为状态动力学的内在组成部分,而非外部计算的量。然后,一系列专用的对齐和解码操作将此联合编码提炼为显式运动表示,双路径池化机制通过平衡注意力驱动的选择与均匀时序覆盖来自适应地聚合该表示。该框架与CNN、Transformer和Mamba骨干家族无缝接口,每对特征仅增加约2.1 GFLOPs。在Diving48上,MamBOA使用图像预训练骨干达到85.02%的Top-1准确率,使用视频预训练骨干在单次前向传播中处理整个视频达到86.24%——表明结构诱导的状态空间动力学构成了运动建模的原则性和通用基础。

英文摘要

Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

2606.15320 2026-06-16 cs.CV 新提交

Conditional Multi-Event Temporal Grounding in Long-Form Video

长视频中的条件多事件时间定位

Yuanhao Zou, Arthad Kulkarni, Lucas Tonanez, Lincoln Spencer, Guangyu Sun, Tianxingjian Ding, Andong Deng, Yi Li, Shuangjun Liu, Yuan Li, Dashan Gao, Ning Bi, Taotao Jing, Shuai Zhang, Chen Chen

发表机构 * University of Central Florida(中佛罗里达大学) Qualcomm AI Research(高通人工智能研究院)

AI总结 提出CoMET-Bench基准和CoMET-Agent框架,解决长视频中基于组合时空条件定位所有事件的任务,F1@0.5提升6.1%。

详情
AI中文摘要

多模态大语言模型在视频时间定位方面取得了快速进展,但实际应用通常需要定位满足组合时间和空间条件的每个事件。现有基准存在不足:它们仅定位每个查询的单个时刻,在没有时间条件的情况下进行计数,或者将定位和计数视为不相交的任务。我们引入了CoMET-Bench,用于长视频中的条件多事件时间定位,包含600个视频上的2789个查询,平均时长33.8分钟,涵盖五个真实世界领域,每个查询由4个时间条件、3个空间条件和一个专用的负查询子集组成。我们进一步提出了一个统一的评估协议,联合测量计数、定位和负查询识别,包括一个新的Rejection-F1指标,以防止懒惰的“始终为空”模型进行琐碎的游戏。对广泛的MLLM、基于代理和定位专用方法的基准测试表明,现有方法远未解决此任务。基于这些发现,我们提出了CoMET-Agent,一个无需训练的代理框架,将任务重新表述为结构化搜索和聚合,通过纯结构推理在F1@0.5上比GPT-5提高6.1%。失败分析进一步揭示了三个开放方向:细粒度实体跟踪、位置均匀检索和因果事件配对。

英文摘要

Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy "always-empty" models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

2606.15417 2026-06-16 cs.CV 新提交

From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

从帧到时间图:基于视觉语言模型的上下文第一人称动作识别

Bessie Dominguez-Dager, Francisco Gomez-Donoso, Miguel Cazorla, Marc Pollefeys, Daniel Barath, Zuria Bauer

发表机构 * University of Alicante(阿利坎特大学) ETH Zürich(苏黎世联邦理工学院) Microsoft(微软)

AI总结 提出将视频转换为时间动作图,通过多阶段提示生成自然语言叙述并结构化,实现上下文学习,在EGTEA和Epic-Kitchens-100上显著提升零样本和少样本动作识别性能。

详情
AI中文摘要

第一人称视频中的动作推理需要捕捉手-物交互的细粒度过渡,而通用视觉语言模型(VLM)在直接处理原始像素时往往难以胜任。我们提出通过将视频转换为时间动作图,将视觉感知与符号推理解耦。在多阶段提示流程中,我们首先在短时间窗口上生成密集的自然语言叙述作为语义瓶颈,然后将其形式化为结构化的开放词汇图表示。在EGTEA和Epic-Kitchens-100数据集上,符号表示实现了高效的上下文学习:少样本图演示相比零样本帧和图推理均带来显著的准确率提升。即使在零样本设置下,尽管潜在的预训练污染可能有利于基于像素的推理,但基于图的推理仍能与像素推理保持竞争力。在来自6个模型家族、参数范围从2B到235B的11个开源VLM上,我们的发现表明,当前VLM作为符号推理器比作为直接视觉观察者更有效。通过将视频投影到语言领域,我们提供了一种可扩展、无需微调的替代端到端方法,更好地利用了这些模型的潜在推理优势。代码将公开。

英文摘要

Action reasoning in egocentric video requires capturing fine-grained transitions of hand-object interactions, a task where general-purpose Vision-Language Models (VLMs) often struggle when operating directly on raw pixels. We propose to decouple visual perception from symbolic reasoning by converting videos into Temporal Action Graphs. In a multi-stage prompting pipeline, we first generate dense natural language narratives over short temporal windows as a semantic bottleneck, then formalize them into structured, open-vocabulary graph representations. On the EGTEA and Epic-Kitchens-100 datasets, the symbolic representation unlocks efficient in-context learning: few-shot graph demonstrations yield substantial accuracy gains over zero-shot frame and graph-based inference alike. Even in the zero-shot setting, graph-based reasoning remains competitive with pixel-based inference despite potential pretraining contamination favoring the latter. Across 11 open-weight VLMs from 6 model families ranging from 2B to 235B parameters, our findings indicate that current VLMs are more effective as symbolic reasoners than as direct visual observers. By projecting video into the language domain, we provide a scalable, fine-tuning-free alternative to end-to-end approaches that better leverages these models' latent reasoning strengths. The code will be made public.

2606.15486 2026-06-16 cs.CV 新提交

ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

ST-DiffEye: 基于扩散的连续注视生成通过联合扫描路径-轨迹建模

Brian Nlong Zhao, Ozgur Kara, Junho Kim, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出ST-DiffEye,一种联合轨迹-扫描路径扩散框架,通过将两者拼接为额外输入通道进行联合建模,并引入基于连续排序概率得分(CRPS)的评估方法,在视觉搜索和自由观看任务上达到最先进性能。

详情
AI中文摘要

我们研究人类注视建模问题,旨在生成观察者在观看视觉刺激时产生的注视模式。注视主要通过两种模态捕获:连续眼动轨迹(描述细粒度运动动态)和离散扫描路径(描述高级注视结构)。由于注视在不同观察者和试验间差异显著,我们将这种变异性视为定义属性而非噪声,并将注视建模为随机生成过程。现有的生成式注视模型仅对这两种表示之一进行单独监督。我们假设轨迹和扫描路径以互补尺度描述注视,并在训练过程中联合提供信息,通过ST-DiffEye(一种联合轨迹-扫描路径扩散框架)验证该假设,该框架通过将两者拼接为额外的原始输入通道来耦合两种模态,除了输入和输出通道扩展外无需额外架构开销。我们进一步引入基于连续排序概率得分(CRPS)的原则性评估框架,该框架将任何现有序列相似性度量推广为适当的评分规则,以联合评估生成注视的准确性和多样性。在任务驱动的视觉搜索(涵盖目标存在和目标缺失场景)以及自由观看基准上的实验证明了最先进的性能。这些结果以及详细的消融实验证实了联合建模的优势以及分布感知评估在捕捉人类注视内在变异性方面的价值。项目网页:https://st-diffeye.github.io/

英文摘要

We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze. Project webpage: https://st-diffeye.github.io/

2606.15527 2026-06-16 cs.CV cs.AI 新提交

Selective Synergistic Learning for Video Object-Centric Learning

选择性协同学习用于视频对象中心学习

WonJun Moon, Jae-Pil Heo

发表机构 * KAIST(韩国科学技术院) Sungkyunkwan University(成均馆大学)

AI总结 提出选择性协同学习(SSync),通过伪标签线性复杂度选择性蒸馏可靠线索,避免错误传播,提升视频对象分解质量并作为即插即用模块。

详情
AI中文摘要

典型的视频对象中心学习(VOCL)方法采用基于槽的框架,依赖重建驱动的编码器-解码器架构,学习通过两个空间图进行:编码器的注意力图和解码器的对象图。由于这两个不同的图表现出不同的属性,最近的密集对齐策略试图通过对比学习强制所有时空补丁之间的一致性来调和这种差异。然而,这种无差别的对齐无意中传播了每个模块固有的弱点,例如编码器的噪声预测和解码器的模糊边界。此外,计算所有对之间的密集相似性会带来与时空补丁总数二次方关系的计算成本,严重限制了可扩展性。受此启发,我们提出了选择性协同学习(SSync)。SSync 不是进行穷举的补丁到补丁对齐,而是通过选择性蒸馏仅最可靠的线索来防止错误传播:严格利用编码器进行边界细化,利用解码器进行内部去噪。这通过线性复杂度的伪标签实现,消除了二次空间比较的需要。此外,为了防止强化架构偏差(如槽冗余),我们引入了传递性伪标签合并,基于时空激活一致性合并重叠的槽。大量研究表明,SSync 提高了分解质量,并作为一个通用的即插即用模块,同时对槽配置表现出卓越的鲁棒性。代码可在 github.com/wjun0830/SSync 获取。

英文摘要

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

2606.15992 2026-06-16 cs.CV 新提交

Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

基于MediaPipe Pose的多任务网球击球生物力学分析

Jigyashman Hazarika

发表机构 * Kaggle

AI总结 提出多任务流水线,从RGB视频自动识别击球类型、预测击球方向并评估姿势质量,结合规则反馈提供教练建议,在跨球员测试中击球类型准确率仅下降0.8%。

Comments 14 pages, 9 figures

详情
AI中文摘要

我们构建了一个从普通RGB视频进行网球击球生物力学分析的多任务流水线。在基于姿态的击球识别基础上,它增加了两个新任务:预测击球方向和评估姿势质量,外加一个基于规则的反馈层,用于提供教练建议。使用加权关节速度得分s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder自动检测击球,无需手动标注。姿态来自MediaPipe Pose Landmarker(33个关键点,公制世界坐标),每个击球被转换为30帧×39特征的序列,输入TennisTransformerGPU——一个紧凑的564,103参数Transformer(4层,4头,d=128),带有三个并行输出头。在来自7名职业球员和1名业余球员的11个视频中的1,281个标注击球上训练,在随机80/20划分下,击球类型准确率为83.7%,方向准确率为61.9%,姿势准确率为62.6%。有趣的测试是跨球员:在职业球员上训练,在业余球员上评估。击球类型几乎不变,为82.9%,下降0.8%。方向预测无法迁移,直接退化为多数类。消融实验表明世界坐标的重要性:切换到图像空间关键点导致跨球员击球类型准确率从83%降至47%,方向准确率从68%降至21%。所有内容在Kaggle的免费T4 GPU上运行,完全可复现。

英文摘要

We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.

2606.16342 2026-06-16 cs.CV 新提交

When the Past Matters: FlashBack Memory for Precipitation Nowcasting

当过去重要时:用于降水临近预报的FlashBack记忆

Yuhao Du, Boxiao Huang, Chengrong Wu, Jiankai Zhang

发表机构 * College of Atmospheric Sciences, Lanzhou University(兰州大学大气科学学院) Fuqua School of Business, Duke University(杜克大学福库商学院) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Supercomputing Center of Lanzhou University(兰州大学超级计算中心)

AI总结 提出FlashBack Memory模块,通过动态检索关键历史状态并自适应融合,增强循环模型时空表征能力,显著提升高分辨率降水预测的准确性和时序一致性。

详情
AI中文摘要

准确的降水临近预报对于减灾和社会经济规划至关重要,然而现有方法在高时空分辨率下常面临虚警、漏报和长程依赖建模困难。为解决这些问题,我们提出FlashBack Memory(FB)模块,该模块动态检索关键历史状态并通过自适应融合门进行整合,增强循环模型的时空表征能力。我们将FB集成到PredRNN、PredRNNpp、MIM、MotionRNN和PredRNN-V2中,并在CIKM2017、Shanghai2020和SEVIR数据集上评估。实验结果表明,FB显著改善了MSE、MAE、SSIM和CSI指标,特别是对于高强度降雨和长序列预测,同时减少了虚警和漏报,增强了时间一致性和空间定位。所提方法提供了一种通用且高效的记忆增强机制,提升了基于循环的降水临近预报模型的整体性能。

英文摘要

Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

2606.16353 2026-06-16 cs.CV cs.AI 新提交

What Should a Streaming Video Model Remember?

流式视频模型应该记住什么?

Haonan Ge, Yiwei Wang, Hang Wu, Yujun Cai

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) University of California, Merced(加州大学默塞德分校) The University of Queensland(昆士兰大学)

AI总结 针对流式视频理解中固定记忆预算下的长程历史利用问题,提出选择性潜在记忆框架SelectStream,通过惊喜驱动自适应窗口、优先级保持合并和查询条件图推理三个机制,实现高效在线推理,在多个基准上取得领先性能。

详情
AI中文摘要

流式视频理解模型必须在持续流中的任意时刻回答查询,仅使用到目前为止观察到的内容,并在固定的记忆和计算预算下工作。现有方法通过添加记忆库、检索模块或视觉令牌压缩来保存长程历史。然而,强近期窗口基线表明,不加区分地注入历史可能会稀释当前场景感知,这表明关键挑战不在于是否使用记忆,而在于如何选择性分配记忆。我们将此形式化为预算在线潜在证据分配,并提出\textbf{SelectStream},一个选择性潜在记忆框架,该框架保持当前观察对冻结VLM直接可见,同时仅通过紧凑的、查询条件的证据预算暴露历史信息。三个协调机制控制何时写入、保留什么以及如何检索:惊喜驱动的自适应窗口、优先级保持合并以及固定容量潜在记忆图上的查询条件图推理。检索到的证据被校准并作为潜在令牌注入以生成答案,无需重放帧或随着流长度增长上下文。实验结果表明,SelectStream实现了强大的在线流式性能,并保持了通用视频理解能力,在StreamingBench上达到82.67%,在OVO-Bench上达到67.03%,在离线视频基准上平均准确率达到74.4%,同时优于强近期窗口基线和先前的流式记忆方法。

英文摘要

Streaming video understanding models must answer queries at any moment during an ongoing stream, using only what they have observed so far and under fixed memory and computation budgets. Existing methods address this by adding memory banks, retrieval modules, or visual token compression to preserve long-range history. However, strong recent-window baselines show that indiscriminate history injection can dilute current-scene perception, suggesting that the key challenge is not whether to use memory, but how to allocate it selectively. We formulate this as budgeted online latent evidence allocation and propose \textbf{SelectStream}, a selective latent-memory framework that keeps the current observation directly visible to a frozen VLM while exposing historical information only through a compact, query-conditioned evidence budget. Three coordinated mechanisms govern when to write, what to preserve, and how to retrieve: surprise-driven adaptive windowing, priority-preserving consolidation, and query-conditioned graph reasoning over a fixed-capacity latent memory graph. Retrieved evidence is calibrated and injected as latent tokens for answer generation, without replaying frames or growing the context with stream length. Experimental results show that SelectStream achieves strong online streaming performance and preserves general video understanding, reaching 82.67\% on StreamingBench, 67.03\% on OVO-Bench, and 74.4\% average accuracy on offline video benchmarks, while outperforming strong recent-window baselines and prior streaming memory methods.

2503.06637 2026-06-16 cs.CV 版本更新

CLAD: Constrained Latent Action Diffusion for Vision-Language Procedure Planning

CLAD: 面向视觉-语言程序规划的约束潜在动作扩散模型

Lei Shi, Andreas Bulling

发表机构 * Machine Perception and Interaction group, Örebro University(机器感知与交互组,奥雷布罗大学) Collaborative Artificial Intelligence group, University of Stuttgart(协作人工智能组,斯图加特大学)

AI总结 提出CLAD模型,利用变分自编码器学习动作和观测的潜在表示作为约束,引导扩散模型生成动作,在三个数据集上大幅超越现有方法。

Comments Accepted at RO-MAN 2026

详情
AI中文摘要

我们提出CLAD,一种用于教学视频中视觉-语言程序规划的约束潜在动作扩散模型。程序规划是一项具有挑战性的任务,即在给定起始和目标状态的视觉观察时预测中间动作。然而,未来的交互式AI系统还必须能够使用多模态输入(例如,视觉观察与语言描述相结合)来规划程序。为了解决这一视觉-语言程序规划任务,我们的方法使用变分自编码器(VAE)学习动作和观测的潜在表示作为约束,并将其集成到扩散过程中。该方法利用了扩散模型的潜在空间已经具有可用的语义。我们使用潜在约束来引导扩散模型更好地生成动作。我们在流行的CrossTask、Coin和NIV数据集上进行了大量实验,结果表明我们的方法大幅优于最先进的方法。通过评估我们方法的消融版本,我们进一步表明,在VAE潜在空间中学习的动作和观测表示的这种集成是这些性能改进的关键。

英文摘要

We propose CLAD, a Constrained Latent Action Diffusion model for vision-language procedure planning in instructional videos. Procedure planning is the challenging task of predicting intermediate actions given a visual observation of a start and a goal state. However, future interactive AI systems must also be able to plan procedures using multi-modal input, e.g., where visual observations are augmented with language descriptions. To tackle this vision-language procedure planning task, our method uses a Variational Autoencoder (VAE) to learn the latent representation of actions and observations as constraints and integrate them into the diffusion process. This approach exploits that the latent space of diffusion models already has semantics that can be used. We use the latent constraints to steer the diffusion model to better generate actions. We report extensive experiments on the popular CrossTask, Coin, and NIV datasets and show that our method outperforms state-of-the-art methods by a large margin. By evaluating ablated versions of our method, we further show that the proposed integration of the action and observation representations learnt in the VAE latent space is key to these performance improvements.

2510.03584 2026-06-16 cs.CV 版本更新

FrameOracle: Learning What to See and How Much to See in Videos

FrameOracle: 学习在视频中看什么以及看多少

Chaoyu Li, Tianzhi Li, Fei Tao, Zhenyu Zhao, Ziqian Wu, Maozheng Zhao, Juntong Song, Cheng Niu, Pooyan Fazli

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FrameOracle轻量模块,预测相关帧及其数量,通过课程学习从弱代理信号到强监督,在多个VLM和基准上以更少帧数保持或提升精度。

Comments ICML 2026. Project page: https://people-robots.github.io/frameoracle/

详情
AI中文摘要

视觉语言模型(VLM)推进了视频理解,但在严格的计算预算下运行,使得性能依赖于选择少量高质量帧子集。现有的帧采样策略,如均匀或固定预算选择,无法适应内容密度或任务复杂性的变化。为了解决这个问题,我们提出了FrameOracle,一个轻量级、即插即用的模块,它同时预测(1)哪些帧与给定查询最相关,以及(2)需要多少帧。FrameOracle通过一个课程进行训练,该课程从弱代理信号(如跨模态相似性)逐步过渡到更强的监督,使用FrameOracle-41K——第一个具有经过验证的关键帧注释(指定每个问题的最小足够帧数)的大规模视频问答数据集。在五个VLM和六个基准上的大量实验表明,FrameOracle将16帧输入减少到平均10.4帧,且无精度损失。当从64帧候选开始时,它将输入平均减少到13.9帧,同时将精度提高1.5%,实现了可扩展视频理解的最优效率-精度权衡。

英文摘要

Vision-language models (VLMs) advance video understanding but operate under tight computational budgets, making performance dependent on selecting a small, high-quality subset of frames. Existing frame sampling strategies, such as uniform or fixed-budget selection, fail to adapt to variations in content density or task complexity. To address this, we present FrameOracle, a lightweight, plug-and-play module that predicts both (1) which frames are most relevant to a given query and (2) how many frames are needed. FrameOracle is trained via a curriculum that progresses from weak proxy signals, such as cross-modal similarity, to stronger supervision with FrameOracle-41K, the first large-scale VideoQA dataset with validated keyframe annotations specifying minimal sufficient frames per question. Extensive experiments across five VLMs and six benchmarks show that FrameOracle reduces 16-frame inputs to an average of 10.4 frames without accuracy loss. When starting from 64-frame candidates, it reduces inputs to 13.9 frames on average while improving accuracy by 1.5%, achieving state-of-the-art efficiency-accuracy trade-offs for scalable video understanding.

2603.16970 2026-06-16 cs.CV cs.AI 版本更新

MAND: Modality-Aware Novelty Detection for Open-World Egocentric Activity Recognition

MAND: 面向开放世界自我中心活动识别的模态感知新颖性检测

Hyejeong Im, Wonseon Lim, Dae-Won Kim

发表机构 * Department of Computer Science and Engineering, Chung-Ang University(Chung-Ang大学计算机科学与工程系)

AI总结 提出MAND框架,通过模态感知自适应评分和表示稳定训练,利用视觉和惯性模态互补信息,提升开放世界自我中心活动识别中的新颖性检测和已知类准确率。

详情
AI中文摘要

多模态自我中心活动识别整合视觉和惯性线索以实现鲁棒的第一人称行为理解。然而,在开放世界环境中部署此类系统需要检测新颖活动,同时从非平稳数据流中持续学习。现有方法依赖主融合logits进行新颖性评分,未充分利用各模态可用的互补证据。由于这些logits常被RGB主导,其他模态(尤其是IMU)的线索未被充分利用,且这种不平衡随着灾难性遗忘的累积而加剧。为解决此问题,我们提出MAND,一种用于多模态自我中心开放世界持续学习的模态感知框架。在推理时,模态感知自适应评分(MoAS)利用样本级可靠性自适应调整模态贡献,并通过偏差和分歧惩罚细化新颖性评分。在训练时,模态感知表示稳定训练(MoRST)通过模态特定头和模态级logits蒸馏保留每个模态在任务间的判别能力。在公开多模态自我中心基准上的实验表明,MAND一致地提升了新颖活动检测和已知类准确率,同时大幅降低FPR95,表明更可靠的开放世界识别。源代码见\href{this https URL}{this http URL}。

英文摘要

Multimodal egocentric activity recognition integrates visual and inertial cues for robust first-person behavior understanding. However, deploying such systems in open-world environments requires detecting novel activities while continuously learning from non-stationary data streams. Existing methods rely on the main fused logits for novelty scoring, without fully exploiting the complementary evidence available from individual modalities. Because these logits are often dominated by RGB, cues from other modalities, particularly IMU, remain underutilized, and this imbalance worsens as catastrophic forgetting accumulates. To address this, we propose MAND, a modality-aware framework for multimodal egocentric open-world continual learning. At inference, Modality-aware Adaptive Scoring (MoAS) adaptively adjusts modality contributions using sample-wise reliability and refines novelty scoring with deviation and disagreement penalties. During training, Modality-aware Representation Stabilization Training (MoRST) preserves the discriminative capacity of each modality across tasks through modality-specific heads and modality-wise logit distillation. Experiments on a public multimodal egocentric benchmark show that MAND consistently improves novel activity detection and known-class accuracy while substantially reducing FPR95, indicating more reliable open-world recognition. The source code is available at \href{https://github.com/HyeJeongIm/MAND}{github.com/HyeJeongIm/MAND}.

2606.02506 2026-06-16 cs.CV 版本更新

Question-Aware Evidence Ledgers for Video Relational Reasoning

问题感知的证据账本用于视频关系推理

Yilin Ou, Mengshi Qi, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 提出基于GPT-5.5视频QA求解器和问题感知证据账本的测试时推理流水线,通过显式化计数、空间、端点、视角和对话推理所需的目标、计数单位、参考帧及时间或空间范围,并利用外部工具作为证据源,最终在VRR-QA挑战上达到92.95%的整体准确率。

Comments Technical report for the VRR Challenge at the VideoLLMs Workshop, CVPR 2026

详情
AI中文摘要

VRR-QA挑战评估视频中的视觉关系推理,答案通常依赖于隐含的空间关系、事件边界、目标身份和对话上下文,而非单个显著帧。我们提出一个基于强GPT-5.5视频QA求解器和一组问题感知证据账本的测试时推理流水线。初始求解器从统一的视频表示回答每个问题,而路由账本被提示使所需目标、计数单位、参考帧以及时间或空间范围显式化,用于计数、空间、端点、视角和对话推理。外部工具如开放词汇检测、深度线索、成对裁剪、ASR和场景图账本仅用作证据源。保守门控保持当前答案,除非独立证据唯一支持不同选项。最终证据门控流水线在挑战测试集上达到92.95%的整体准确率和93.79%的宏平均准确率。

英文摘要

The VRR-QA challenge evaluates visual relational reasoning in videos, where answers often depend on implicit spatial relations, event boundaries, target identity, and dialogue context rather than a single salient frame. We present a test-time reasoning pipeline built around a strong GPT-5.5 video QA solver and a set of question-aware evidence ledgers. The initial solver answers each question from a uniform video representation, while routed ledgers are prompted to make the required targets, count units, reference frames, and temporal or spatial scope explicit for counting, spatial, endpoint, viewpoint, and dialogue reasoning. External tools such as open-vocabulary detection, depth cues, pair crops, ASR, and scene-graph ledgers are used only as evidence sources. A conservative gate keeps the current answer unless independent evidence uniquely supports a different option. The final evidence-gated pipeline achieves 92.95% overall accuracy and 93.79% macro accuracy on the challenge test split.

6. 生成式视觉与世界模型 50 篇

2606.14732 2026-06-16 cs.CV cs.AI cs.LG cs.MM 新提交

Steady-Forcing: Balancing Spatial Persistence and Motion Continuity in Long-Horizon Nature Video Diffusion

Steady-Forcing: 长时程自然视频扩散中空间持久性与运动连续性的平衡

Matiur Rahman Minar, Seunghun Oh, GangHyeon Jeong, Unsang Park

发表机构 * Department of Computer Science and Engineering, Sogang University(西江大学计算机科学与工程系) Department of Artificial Intelligence, Sogang University(西江大学人工智能系)

AI总结 提出Steady-Forcing框架,通过视觉锚点、运动记忆和蒸馏等技术,在长时程固定相机自然视频生成中平衡背景稳定与运动连续性,优于现有方法。

Comments Project page: https://minar09.github.io/steadyforcing/

详情
AI中文摘要

自回归视频扩散模型支持流式生成,但在长时程生成中常退化:静态场景布局漂移,而改善空间稳定性的机制往往抑制运动,导致水流、火焰或烟雾等自然流动停滞。我们研究了固定相机长时程自然视频生成中的这种稳定性-运动权衡,其中两种失败模式比移动相机设置更易区分。我们提出Steady-Forcing,一种结合持久视觉锚点(V-Sink)、指数移动平均运动记忆(EMA-Sink)、块相对时间编码、周期性缓存净化以及从Wan2.1-14B教师模型蒸馏(在任务聚焦配置下使用运动奖励先验)的记忆与训练框架。这些组件共同设计用于在数分钟的自回归生成中保持背景一致性,同时维持视觉上合理的流体动力学。在七个基线上的评估表明,Steady-Forcing改善了长时程背景一致性和成像质量,而盲用户研究显示更强的感知稳定性和运动连续性。基准评估进一步表明,通用的VBench聚合分数对固定相机伪影惩罚不足,同时将漂移引起的光流奖励为动态程度,而不直接惩罚纹理硬化或流动停滞——这激励了未来针对静态相机自然流动评估的任务特定基准。项目页面:https://minar09.github.io/steadyforcing/

英文摘要

Autoregressive video diffusion models enable streaming generation but often degrade over long rollouts: static scene layouts drift, while mechanisms that improve spatial stability tend to suppress motion, causing natural flows such as water, fire, or smoke to stagnate. We study this stability-motion trade-off in fixed-camera long-horizon nature video generation, where the two failure modes can be more clearly separated than in moving-camera settings. We propose Steady-Forcing, a memory and training framework combining a persistent visual anchor (V-Sink), an exponential moving-average motion memory (EMA-Sink), block-relative temporal encoding, periodic cache purification, and distillation from a Wan2.1-14B teacher with motion-rewarded priors under task-focused configurations. Together, these components are designed to preserve background identity while sustaining visually plausible fluid dynamics over multi-minute autoregressive rollouts. Evaluations across seven baselines show that Steady-Forcing improves long horizon background consistency and imaging quality, while a blind user study indicates stronger perceived stability and motion continuity. The benchmark evaluation further suggest that generic VBench aggregate scores under-penalize fixed-camera artifacts as well as rewarding drift-induced optical flow as Dynamic Degree while not directly penalizing texture hardening or flow stagnation - motivating future task-specific benchmarks for static-camera nature-flow evaluation. Project page: https://minar09.github.io/steadyforcing/

2606.14746 2026-06-16 cs.CV 新提交

Style-CCL: Content-Preserving Style Transfer via Curriculum Continual Learning

Style-CCL:通过课程持续学习实现内容保持的风格迁移

Shiwen Zhang, Haoyuan Wang, Xianghao Zang, Haibin Huang, Chi Zhang, Xuelong Li

发表机构 * Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院)

AI总结 针对扩散变换器在风格迁移中内容与风格特征纠缠的问题,提出多阶段课程持续学习框架Style-CCL,通过从语义到纹理风格、从干净到合成数据的分阶段训练,并采用随机记忆排练防止灾难性遗忘,在风格相似性、内容一致性和美学质量上达到最优。

Comments code and models of QwenStyle are released at https://github.com/witcherofresearch/Qwen-Image-Style-Transfer/ and https://github.com/Tele-AI/TeleStyle/

详情
AI中文摘要

给定内容和风格参考,内容保持的风格迁移对于扩散变换器(DiT)仍然具有挑战性,因为内容和风格特征纠缠在一起。通过反向三元组合成流程构建百万级训练集,以及双分支风格-内容DiT(SC-DiT)——通过分离的ROPE嵌入和因果掩码解耦风格和内容,我们观察到这种在混合风格类别上的单阶段训练范式会导致语义风格占主导,阻碍纹理风格学习,并损害内容保持。为了解决这些问题,我们提出了Style-CCL,一个多阶段课程持续学习框架,从语义(简单)到纹理(困难)风格,从干净到合成数据训练SC-DiT,并在各阶段之间使用随机记忆排练以避免灾难性遗忘。大量实验表明,我们的Style-CCL在三个核心指标:风格相似性、内容一致性和美学质量上达到了最先进的性能。

英文摘要

Content-Preserving Style transfer, given content and style references, remains challenging for Diffusion Transformers (DiTs) due to entangled content and style features. With a reverse triplet synthesis pipeline to build a million-scale training set and a dual-branch Style-Content DiT (SC-DiT) that decouples style and content via separate ROPE embeddings and causal masking, we observe that such a one-stage training paradigm on mixed style categories causes semantic styles to dominate, hindering texture style learning, and harming content preservation. To address these issues, we propose Style-CCL, a Multi-Stage Curriculum Continual Learning framework that trains SC-DiT from semantic (easy) to texture (hard) styles, and from clean to synthetic data, with Random Memory Rehearsal across stages to avoid catastrophic forgetting. Extensive experiments demonstrate that our Style-CCL achieves state-of-the-art performance in three core metrics: style similarity, content consistency, and aesthetic quality.

2606.14756 2026-06-16 cs.CV cs.AI cs.LG 新提交

Divide-and-Denoise: A Game-Theoretic Method for Fairly Composing Diffusion Models

分而除噪:一种公平组合扩散模型的博弈论方法

Abhi Gupta, Polina Barabanshchikova, Vikas Garg, Samuel Kaski, Tommi Jaakkola

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学) University of Cambridge(剑桥大学)

AI总结 提出Divide-and-Denoise方法,通过公平分配博弈协调多个预训练扩散模型,在采样时划分区域并引导各模型去噪,解决模型主导或冲突问题,在条件图像生成中优于基线。

Comments Accepted as spotlight at ICML 2026

详情
AI中文摘要

大量预训练扩散模型为组合提供了机会。然而,组合多个模型存在一个模型主导或模型间相互冲突的风险。在此,我们提出Divide-and-Denoise,一种在采样过程中协调多个预训练扩散模型的方法。类似于管理专业劳动力,我们的方法在模型间创建了公平且高效的劳动分工。我们方法的核心是分配的概念,它定义了每个模型对含噪样本每个区域的责任。在每个时间步,我们通过以下步骤去噪:(i) 通过求解公平分配博弈更新分配,其中我们在公平约束下将样本划分为最大化总效用的区域,以及(ii) 使模型与这种分配对齐,引导每个模型在其分配区域内去噪。这导致了一个新的复合去噪过程,该过程与划分过程同步演化。我们在条件图像生成上评估了Divide-and-Denoise。在包括GenEval基准在内的多个质量指标上,我们的方法优于基线,并解决了常见失败情况,包括缺失对象和属性不匹配。实验表明,Divide-and-Denoise利用了每个模型的专业知识,同时不忽视任何其他模型。

英文摘要

The abundance of pre-trained diffusion models provides an opportunity for composition. Combining several models, however, runs the risk of one model dominating or models disagreeing with each other. Here, we propose Divide-and-Denoise, a method for coordinating multiple pre-trained diffusion models during sampling. Much like managing a specialized workforce, our method creates a fair but efficient division of labor across models. Central to our method is the notion of an allocation which defines the responsibility of each model to every region of the noisy sample. At every timestep, we then denoise by (i) updating the allocation by solving a fair division game, where we divide the sample into regions that maximize total utility under fairness constraints, and (ii) aligning the models with this allocation, where we guide each model to denoise within its assigned region. This leads to a new composite denoising process that evolves in tandem with a division process. We evaluate Divide-and-Denoise on conditional image generation. Across several quality metrics, including the GenEval benchmark, our method outperforms baselines and resolves common failures including missing objects and mismatched attributes. Experiments show that Divide-and-Denoise utilizes each model's expertise without neglecting any other model.

2606.14787 2026-06-16 cs.CV cs.CR 新提交

Vision-Encoder Behavioral Fingerprints of Image-to-Image Generative Models: A Training-Paradigm-Driven Taxonomy of Six Commercial APIs

图像到图像生成模型的视觉编码器行为指纹:基于训练范式的六个商业API分类

Hunter Hill

发表机构 * H. Hill

AI总结 通过内容自适应亚JND对抗扰动管道,对六个商业图像到图像AI系统进行测试,基于DINOv2 ViT-B/14令牌距离,发现编辑训练模型与采样时适配的T2I基模型在2D平面上形成两个不同的行为带。

详情
AI中文摘要

我们研究了六个生产级图像到图像AI系统(gpt-image-1、Gemini 2.5 Flash Image、Flux Kontext、SDXL img2img、SD3 img2img和Qwen Image Edit),采用内容自适应亚JND对抗扰动管道,通过冻结的DINOv2 ViT-B/14令牌距离与干净参考进行比较,对所有输出进行评分。在涵盖COCO照片、CelebA-HQ肖像和AI生成输入的3,588次调用语料库中,六个系统在2D(patch_mean, ssim_clean)平面上分为两个图像不变行为带:编辑训练模型(Flux Kontext、Qwen Edit、Gemini)聚集在一个紧密带中,而采样时适配的T2I基模型(SDXL、SD3、gpt-image-1)聚集在一个漂移带中。

英文摘要

We study six production image-to-image AI systems (gpt-image-1, Gemini 2.5 Flash Image, Flux Kontext, SDXL img2img, SD3 img2img, and Qwen Image Edit) under a content-adaptive sub-JND adversarial perturbation pipeline, scoring all outputs by frozen DINOv2 ViT-B/14 token distances against clean references. Across a 3,588-call corpus spanning COCO photographs, CelebA-HQ portraits, and AI-generated inputs, the six systems partition into two image-invariant behavioral bands on a 2D (patch_mean, ssim_clean) plane: edit-trained models (Flux Kontext, Qwen Edit, Gemini) cluster in a tight band, while T2I-base models adapted at sampling time (SDXL, SD3, gpt-image-1) cluster in a drift band.

2606.14792 2026-06-16 cs.CV cs.AI 新提交

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

基于离散扩散模型的视觉-文本思维高效强化学习

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

发表机构 * KAIST(韩国科学技术院) Sony AI(索尼AI) AITRICS Sony Group Corporation(索尼集团公司)

AI总结 提出用离散扩散模型替代自回归模型进行多模态强化学习,通过局部视觉编辑减少计算量,并设计分解奖励分配策略解决跨模态干扰问题。

详情
AI中文摘要

基于强化学习的后训练已被广泛采用,以在能够同时进行文本和图像生成的统一多模态模型中实现交错视觉和文本推理。然而,大多数现有方法建立在自回归统一模型上,在视觉推理过程中需要完整的图像再生。在这项工作中,我们证明多模态离散扩散模型是自回归模型在交错推理中进行强化学习的有效替代方案,因为它们能够通过局部视觉编辑而非完整的图像令牌再生来执行高效的视觉展开。与自回归基线相比,这使GRPO期间的展开计算减少了26.9%,且性能下降极小。尽管效率提高,我们发现联合奖励分配(在模态间使用共享奖励信号)在RL更新期间会在不相关的图像和文本令牌序列之间引入跨模态干扰。为解决此问题,我们提出分解奖励分配策略,该策略独立地为文本和视觉片段分配奖励。采用分解奖励分配后,我们的RL方法相比联合奖励分配提高了11.2%,相比基础模型提高了38.04%。

英文摘要

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

2606.14972 2026-06-16 cs.CV 新提交

ReGenHuman: Re-Generating Human Appearances for Realistic Full-Body Video Anonymization

ReGenHuman: 重新生成人体外观以实现逼真的全身视频匿名化

Adam Sun, Eshaan Barkataki, Arnold Milstein, Gordon Wetzstein, Ehsan Adeli

发表机构 * Stanford University(斯坦福大学)

AI总结 提出ReGenHuman,首个同时实现逼真、时间一致且天生匿名的全身视频匿名化流水线,采用“重新生成而非编辑”范式,利用结构条件微调视频扩散模型合成人体区域。

详情
AI中文摘要

匿名化以人为中心的视频数据是一个研究不足的问题。先前的匿名化技术要么以牺牲真实性和下游实用性为代价进行模糊或遮盖像素,要么以牺牲时间一致性为代价逐帧生成。我们引入了ReGenHuman,这是第一个同时实现逼真、时间一致且天生匿名的全身视频匿名化流水线。与过去直接遮盖或编辑输入的方法相反,我们提出了一种“重新生成,而非编辑”的范式。我们的方法将2D姿态、分割和单目深度组合成两个互补的条件流——StructAll和StructHuman,用于在野外人体视频上微调视频到视频的扩散骨干网络,完全从无身份的结构线索合成人体区域。我们在隐私、质量和实用性方面评估了我们的模型,并表明我们的ReGenHuman在所有三个轴上与当前基线相比实现了最佳权衡。我们进一步表明,我们的匿名化视频对于下游任务(包括视频问答)仍然有效。

英文摘要

Anonymizing human-centric video data is an understudied problem. Prior anonymization techniques either blur or redact pixels at the cost of realism and downstream utility, or generate frame-by-frame at the cost of temporal coherence. We introduce ReGenHuman, the first full-body video anonymization pipeline that is simultaneously realistic, temporally consistent, and anonymous by construction. Contrary to past approaches which redact or edit the inputs directly, we propose a regenerate, don't edit paradigm. Our approach composites 2D pose, segmentation, and monocular depth into two complementary conditioning streams - StructAll and StructHuman, which are used to fine-tune a video-to-video diffusion backbone on in-the-wild human videos, synthesizing the human regions entirely from identity-free structural cues. We evaluate our model on privacy, quality, and utility, and show that our ReGenHuman achieves the best tradeoff across all three axes against current baselines. We further show that our anonymized videos remain effective for downstream tasks, including video question answering.

2606.15015 2026-06-16 cs.CV cs.AI 新提交

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2606.15158 2026-06-16 cs.CV 新提交

RefGC-SR$^2$: Reference-guided Generated Content Super-Resolution and Refinement

RefGC-SR$^2$: 参考引导的生成内容超分辨率与精炼

Jeahun Sung, Dahyeon Kye, Soo Ye Kim, Jihyong Oh

发表机构 * CMLab, Chung-Ang University(Chung-Ang 大学 CMLab) Adobe Research(Adobe 研究院)

AI总结 针对生成管道中参考图像下采样导致细节丢失和伪影问题,提出RefGC-SR$^2$任务,利用原始高分辨率参考图像同时恢复细节、精炼伪影并提升分辨率,构建首个真实三元组数据生成管道并设计频率感知扩散Transformer模型。

Comments The first two authors contributed equally to this work. The last two authors are co-corresponding authors. Please visit our project page at https://cmlab-korea.github.io/RefGC-SR2/

详情
AI中文摘要

参考引导生成(例如,对象合成、定制)已快速发展,但当前管道存在一个根本性限制:用户提供的以对象为中心的高分辨率参考图像(HRRI)在输入模型前被下采样到固定的低分辨率(LR),因此细粒度细节在输出生成之前就被丢弃。此外,生成步骤在此基础上引入其自身的伪影(例如,身份扭曲)。现有的参考引导生成内容精炼(RefGCR)方法可以纠正部分伪影,但仍操作在LR域;参考引导超分辨率(RefSR)方法恢复分辨率但假设自然图像退化,并忽略生成管道的伪影分布。为在一个统一公式中解决这两个空白,我们引入一个新任务:参考引导的生成内容超分辨率-精炼(RefGC-SR$^2$),其中原始HRRI在后处理阶段被重用,以同时恢复丢失的细节、精炼生成伪影并放大输出。我们为此RefGC-SR$^2$任务构建了首个真实世界三元组数据生成管道,训练一个双联条件生成器来合成公开预训练模型无法提供的配对低质量锚点。我们进一步提出一个用于RefGC-SR$^2$的频率感知扩散Transformer模型,该模型从HRRI中选择性地注入精细细节,同时去除生成伪影。大量实验表明,我们的RefGC-SR$^2$模型成功(i)相对于参考图像忠实地精炼对象身份,以及(ii)恢复高分辨率细节,使得最终结果相比现有RefGCR和RefSR基线在质量上显著更高,且实际使用性更强。

英文摘要

Reference-guided generation (e.g., object compositing, customization) has progressed rapidly, yet current pipelines share a fundamental limitation: the object-centric high-resolution reference image (HRRI) provided by users is downsampled to a fixed low-resolution (LR) before being fed into the model, so the fine-grained details are discarded before the output is even produced. In addition, the generation step then introduces its own artifacts (e.g., identity distortion) on top of this loss. Existing reference-guided generated content refinement (RefGCR) methods can correct some of these artifacts but still operate in the LR domain; reference-guided super-resolution (RefSR) methods recover resolution but assume natural-image degradations and ignore the artifact distribution of generative pipelines. To address both gaps in a single formulation, we introduce a new task: reference-guided generated content super-resolution-refinement (RefGC-SR$^2$), where the original HRRI is reused at the post-processing stage to recover lost details, refine generative artifacts, and upscale the output simultaneously. We construct the first real-world triplet data generation pipeline for this RefGC-SR$^2$ task, training a diptych-conditioned generator to synthesize paired low-quality anchors that public pretrained models cannot provide. We further present a frequency-aware diffusion transformer model for RefGC-SR$^2$ that selectively injects fine details from the HRRI while removing generative artifacts. Extensive experiments demonstrate that our RefGC-SR$^2$ model successfully (i) refines the object identity faithfully with respect to the reference, and (ii) recovers high-resolution details, so that the final result is significantly higher quality and practically more usable compared to existing RefGCR and RefSR baselines.

2606.15162 2026-06-16 cs.CV 新提交

GeoStream: Toward Precise Camera Controlled Streaming Video Generation

GeoStream:迈向精确相机控制的流式视频生成

Yizhou Zhao, Yifan Wang, Xiaoyuan Wang, Yushu Wu, Hao Zhang, Moayed Haji-Ali, Rameen Abdal, Ashkan Mirzaei, Yanyu Li, Willi Menapace, Laszlo Jeni, Sergey Tulyakov, Peter Wonka, Chaoyang Wang

发表机构 * CMU(卡内基梅隆大学) Northeastern University(东北大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Rice University(莱斯大学) Snap Inc.(Snap公司) KAUST(阿卜杜拉国王科技大学)

AI总结 提出GeoStream框架,通过自刷新3D缓存和在线策略蒸馏,实现自回归流式视频生成中的精确度量级相机控制,解决了现有方法在视角移动时控制失效和分布偏移问题。

详情
AI中文摘要

精确的交互式相机控制对于基于视频的世界模型至关重要,但大多数现有方法隐式学习相机运动,导致在分布外轨迹下控制不准确。显式几何条件化提高了可控性,但现有方法是非自回归的,依赖于从初始帧构建的静态3D缓存,一旦视点超出原始视锥体,该缓存就会失效。我们提出GeoStream,一个在自回归流式视频生成中实现精确度量级相机控制的框架。我们的方法维护一个自刷新3D缓存,该缓存从模型自身的输出中定期在线更新:我们从最新生成的帧估计深度,反投影到3D,再投影到目标视图,生成点重投影作为后续合成的几何条件。基于相同原理,训练期间看到的条件也从学生自身生成的帧中渲染,产生完全在策略的蒸馏,自然对齐训练和推理条件分布。与先前使用离策略条件噪声的工作不同,我们的方法针对模型在推理时遇到的确切误差分布进行训练,既缓解了标准自回归漂移,也缓解了当缓存本身来自生成输出时出现的二阶几何反馈循环。定量和定性结果表明,我们的方法显著提高了相机可控性。

英文摘要

Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student's own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

2606.15188 2026-06-16 cs.CV 新提交

Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

自适应推理时间缩放:基于早期步骤潜在验证的图像编辑

Yue Yu, Yang Jiao, Jiayu Wang, Qi Dai, Jingjing Chen

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Microsoft Research Asia(微软亚洲研究院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究所)

AI总结 提出VeriLatent框架,通过早期步骤潜在空间编辑激活图验证初始噪声,实现自适应推理时间缩放,提升图像编辑质量和效率。

详情
AI中文摘要

基于指令的图像编辑随着生成模型的最新进展取得了显著进步。然而,编辑结果的质量仍受随机采样的初始噪声影响,特别是在复杂编辑场景中。不合适的初始噪声可能导致不满意的编辑结果。最近的推理时间缩放方法通过采样多个初始噪声并选择更好的候选者来解决这一问题。然而,大多数方法遵循解码-验证方案,引入了效率与准确性的权衡。当在有限的推理步骤后进行解码时,解码后的图像通常噪声过大,无法进行可靠评估,而充分去噪的图像则需要更高的计算成本。为了解决这个问题,我们提出了VeriLatent,一种即插即用的自适应推理时间缩放框架,用于图像编辑的早期步骤潜在验证。具体来说,我们提出了一种新颖的验证器,通过在早期阶段通过潜在空间编辑激活图对每个初始噪声进行评分。它通过评估候选者是否能在正确区域引发有效编辑来识别有希望的候选者。这使得无需将潜在变量解码为图像即可进行高效的早期剪枝。在此基础上,我们进一步开发了一种用于推理时间缩放的自适应搜索策略。它根据编辑难度分配推理预算,从而减少函数评估次数(NFE)。在多个基准测试和不同基础模型上的大量实验表明,VeriLatent持续提高了编辑性能和推理时间缩放效率。

英文摘要

Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

2606.15389 2026-06-16 cs.CV 新提交

Timestep Rescheduling in Diffusion Inversion

扩散反演中的时间步重调度

Shangquan Sun, Ting Gong, Zhirui Liu, Jiamin Wu, Runkai Zhao, Mianxin Liu, Wenqi Ren, Xiaochun Cao

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对扩散反演中时间步选择影响反演精度的问题,提出一种基于全局重缩放和局部动态规划的非均匀时间步调度器,有效降低反演误差,提升图像重建与编辑性能。

Comments Accepted by ICML 2026. 23 pages, including appendices

详情
AI中文摘要

扩散反演将图像映射回扩散模型的高斯潜在空间,是图像重建和编辑的关键任务。虽然DDIM实现了快速确定性反演,但它固有地引入了累积为明显反演误差的偏差。现有方法通常通过求解不动点问题来解决这一问题,但很大程度上忽略了噪声调度器中扩散时间步的选择如何影响反演保真度。在这项工作中,我们揭示了扩散反演中的偏差尺度强烈依赖于时间步大小,并呈现出抛物线趋势,较大的误差集中在较小和较大的时间步。基于这一发现,我们提出了一种简单而有效的非均匀时间步调度器,该调度器集成了全局重缩放和基于局部动态规划的重调度,实现了计算资源的战略分配,从而最小化整体反演误差并保持更高的反演精度。我们的方法可作为现有反演技术的即插即用增强,无需额外参数或计算开销。通过大量实验,我们验证了集成我们的调度器能够持续提升现有反演方法的性能,在图像重建和编辑中取得更优结果。

英文摘要

Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixed-point problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

2606.15534 2026-06-16 cs.CV 新提交

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Track2View: 通过配对3D点轨迹实现4D一致的相机控制视频生成

Feng Qiao, Zhaochong An, Zhexiao Xiong, Serge Belongie, Nathan Jacobs

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) University of Copenhagen(哥本哈根大学)

AI总结 提出Track2View,利用配对3D点轨迹为视频扩散变压器提供显式时空对应,实现新视角视频渲染,在视觉质量、视角同步和相机精度上达到最先进水平。

详情
AI中文摘要

从新相机视角重新渲染现有视频需要输出遵循规定的相机轨迹,同时保持原始场景每一帧的外观和动态。现有方法依赖于每帧姿态嵌入、噪声点云渲染或隐式学习对应关系,这些方法都没有提供源像素和目标像素之间的显式、时间连续链接。我们提出Track2View,它将视频扩散变压器条件化为配对3D点轨迹:投影到源和目标相机视图中的场景点的稀疏轨迹。这些轨迹提供了显式的时空对应关系,在构造上是时间连续的,编码了内容应在何时何地出现。Track2View的核心是一个双视图轨迹调节器,通过无参数几何操作和学习的时间聚合将视觉上下文从源视图转移到目标视图,确保对任意相机轨迹的泛化能力,而无需记忆特定运动。我们进一步引入了一个数据整理流程,通过在时间上连接的多相机视图对上运行3D点跟踪器来提取一对一的轨迹对应关系。在一个包含静态和动态场景的400视频基准测试中,Track2View在视觉质量、视角同步和相机精度方面取得了最先进的结果,相对于领先基线,旋转误差减少了30-65%,平移误差减少了61-72%。项目页面可访问:https://qjizhi.github.io/track2view

英文摘要

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view

2606.15592 2026-06-16 cs.CV 新提交

DenseControl: Instance-Level Controllable Synthesis of Dense Crowd Image

DenseControl: 密集人群图像的实例级可控合成

Juncheng Wang, Lei Shang, Wang Lu, Baigui Sun, Shujun Wang

发表机构 * the Hong Kong Polytechnic University(香港理工大学) Tongyi lab, Alibaba Group(阿里巴巴集团通义实验室) Tsinghua University(清华大学)

AI总结 提出DenseControl管道,通过隔离对象嵌入图和隐式尺度嵌入策略,实现密集人群图像中实例位置、大小、背景、风格和属性的精确控制,在合成质量和下游应用中达到最优。

Comments Accepted to IEEE TMM

详情
AI中文摘要

在本文中,我们介绍了DenseControl,一种用于生成密集人群图像的新型管道。具体来说,DenseControl精心定位和缩放每个生成的实例,以精确对齐预定义的坐标和尺度。在此基础上,我们进一步允许控制背景、风格和实例属性。DenseControl的动机源于对合成人群图像中两个主要挑战的观察:控制信号嵌入和在传递实例尺度指导时保持拓扑完整性。为了解决这些问题,我们首先引入了隔离对象嵌入(IOE)图,这是一种新颖的表示,有助于空间位置控制,同时减轻模型学习投影的困难。其次,我们提出了一种隐式尺度嵌入(ISE)策略,该策略与IOE图无缝集成,以编码精确的尺度信息。为了进一步增强ISE与IOE图结合的效果,我们引入了一种位置快捷机制,增强交叉注意力以缓解投影挑战。我们通过两个角度评估DenseControl:合成质量和在潜在应用中的适用性。不同控制条件下的实验表明,DenseControl在密集人群图像合成中达到了最先进的结果。此外,我们展示了在数据稀缺下增强人群分析、迁移学习和天气泛化场景中的应用,以突出DenseControl的实际效用。代码库将发布。

英文摘要

In this paper, we introduce DenseControl, a novel pipeline for generating dense crowd images. Specifically, DenseControl meticulously positions and sizes each generated instance to align precisely with the predefined coordinates and scales. Based on this, we further allow for control over the background, style, and attributes of instances. The motivation behind DenseControl stems from the observation of two main challenges in synthesizing crowd images: controlling signal embedding and maintaining topological integrity when imparting instance scale guidance. To address these, we first introduce the Isolated Object Embedding (IOE) map, a novel representation that facilitates spatial location control while mitigating the difficulties associated with learning projections for model. Secondly, we propose an Implicit Scale Embedding (ISE) strategy that seamlessly integrates with the IOE map to encode precise scale information. To further enhance the efficacy of combining ISE with the IOE map, we incorporate a Position Shortcut mechanism that enhances cross-attention to alleviate projection challenges. We evaluate DenseControl through two lenses: synthesis quality and applicability in latent applications. Experiments across different control conditions demonstrate DenseControl achieves state-of-the-art results in dense crowd image synthesis. Furthermore, we showcase applications in augmenting crowd analysis under data scarcity, transfer learning, and weather generalization scenes, to highlight the practical utility of DenseControl. The codebase will be released.

2606.15796 2026-06-16 cs.CV cs.AI 新提交

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

DifFRACT:用于电路追踪的扩散特征重构与归因

Artyom Mazur, Nina Konovalova, Aibek Alanov

发表机构 * HSE University(高等经济学院) FusionBrain Lab(FusionBrain实验室)

AI总结 本文扩展了基于转码器的电路追踪方法到多模态扩散Transformer,通过训练时间步条件转码器近似MLP子层,实现精确的特征级归因并恢复可解释电路,揭示了属性绑定和跨流语义传播机制。

详情
AI中文摘要

机械可解释性旨在通过将模型计算分解为可解释特征和电路来解释神经网络行为。虽然基于转码器的电路追踪最近已实现对大型语言模型的详细因果分析,但用于图像生成的多模态扩散Transformer仍然相对不透明。我们仍然缺乏理解语义信息如何在去噪步骤间传播以及文本和图像表示如何在双流MM-DiT架构中交互的工具。现有方法仅提供部分洞察:注意力图揭示了token交互的有限视图,而稀疏自编码器可以发现可解释特征,但并未直接揭示这些特征如何通过非线性MLP层进行变换和组合。在这项工作中,我们将基于转码器的电路追踪扩展到多模态扩散Transformer。我们训练了时间步条件转码器,它们忠实地近似FLUX.1[schnell]中MLP子层的输入输出行为。通过用转码器替换MLP并线性化剩余计算,我们获得了精确的特征到特征归因,并恢复了紧凑、可解释的电路。实验上,我们的转码器在稀疏性-忠实度权衡上与稀疏自编码器相当或略优。得到的电路揭示了属性绑定和跨流语义传播背后的机制,并为系统性生成错误提供了因果解释。此外,基于电路的干预比标准的基于SAE的引导更加精确和有效。我们的结果表明,基于转码器的电路分析对于最先进的扩散Transformer是可行的,并为理解和控制多模态生成模型提供了强大的框架。代码可在https://github.com/Artalmaz31/DifFRACT获取。

英文摘要

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

2606.15819 2026-06-16 cs.CV cs.AI 新提交

SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

SACE: 视觉自回归模型中的语义奇点概念擦除

Siya Yang, Nanxiang Jiang, Zhaoxin Fan, Yunfeng Diao

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院)

AI总结 针对视觉自回归模型应用现有擦除技术导致语义崩溃和视觉伪影的问题,提出语义奇点公理并通过增量语义显著性分析验证,进而引入首个尺度感知的概念擦除框架SACE,在首尺度耦合熵正则化擦除目标与恢复性保存损失,实现精确概念擦除。

详情
AI中文摘要

视觉自回归(VAR)模型的快速进步为高保真文本到图像合成开辟了变革性前沿,同时也加剧了对生成内容安全对齐的担忧。将现有擦除技术简单应用于VAR模型会导致灾难性的语义崩溃和视觉伪影,因为这些技术主要针对扩散模型的同质去噪步骤设计。为应对这一基础性挑战,我们首先提出语义奇点公理,该公理认为提示中嵌入的任何目标语义概念在Scale-0处被明确锁定。然后通过我们提出的增量语义显著性分析(ISSA)严格验证该公理,该分析还使社区能够透明地检查从粗到细的语义注入过程。在此洞察指导下,我们引入了首个针对VAR模型的尺度感知概念擦除框架(SACE)。通过将干预严格限制在首尺度,我们的方法耦合了熵正则化擦除目标以防止高熵采样退化,以及恢复性保存损失以安全锚定纠缠良性先验的完整性。大量实验表明,我们的方法在最小训练开销下实现了跨多个领域的手术式概念擦除性能,及时而优雅地解决了新兴VAR架构中固有的关键安全漏洞。代码可在 https://github.com/limerenceysy/SACE 获取。

英文摘要

The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: https://github.com/limerenceysy/SACE}{https://github.com/limerenceysy/SACE.

2606.15848 2026-06-16 cs.CV 新提交

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

EmoZone-Talker: 基于面部动作单元的音频驱动3DGS说话人头部的区域语义控制

Tingting Chen, Shaojun Wang, Huaye Zhang, Diqiong Jiang, Chenglizhao Chen

发表机构 * China University of Petroleum (East China)(中国石油大学(华东))

AI总结 提出EmoZone-Talker框架,通过区域解耦和时序建模解决音频与表情信号的冲突,实现精细、可解释的面部表情控制。

详情
AI中文摘要

3D高斯泼溅(3DGS)在高保真说话头部合成方面显示出巨大潜力。然而,由于语音驱动的面部动态与显式表情信号之间的内在冲突,实现细粒度、可解释且可编辑的面部表情控制仍然具有根本性挑战。现有方法依赖隐式多模态融合,导致空间纠缠和时间不稳定性。我们提出EmoZone-Talker,一种新颖的框架,将音频驱动的面部动画重新表述为跨模态冲突下的结构化时空协调问题。我们的方法引入了面部运动的显式空间解缠和时序动态建模。具体来说,我们提出了具有优先注意力偏好的协同区域(SZ-PAB),通过解剖先验引导的区域约束显式解耦模态贡献,以及通道独立的时间AU编码器(CIT-AE)来建模时间连贯的AU动态。通过将这些表示集成到3D高斯变形中,EmoZone-Talker实现了对面部表情的精确和可解释控制。大量实验表明,我们的方法提高了表情可控性和真实感,在上脸准确性和时间连贯性方面取得了显著提升,同时保持了高渲染质量和准确的唇形同步。代码将公开发布以促进可重复性和进一步研究。

英文摘要

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

2606.15889 2026-06-16 cs.CV 新提交

SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

SiGnature: 显式运动扩散用于风格化语义手势

Adi Rosenthal, Tomer Koren, Nadav Shaked, Doron Friedman, Ariel Shamir

发表机构 * Reichman University(赖希曼大学)

AI总结 提出SiGnature框架,通过显式关节旋转空间和免训练推理机制JMI,实现语义手势的精准控制与说话人风格的高保真保持,优于现有方法。

详情
AI中文摘要

虽然共语手势生成的最新进展已实现令人印象深刻的节奏同步,但生成既具有语义意义又忠实于说话人独特非语言风格的手势仍然是一个开放挑战。语义手势(如象形形状或指示性指向)在统计上稀疏,使其难以在标准生成模型中有效学习。我们提出SiGnature,一个用于风格化和语义手势生成的框架,它协调了精确的语义控制与高保真风格保持。与依赖纠缠潜在表示的流行方法不同,SiGnature在显式关节旋转空间中操作。这种设计实现了我们的核心贡献——联合运动集成(JMI),一种免训练推理机制,能够直接将任何外部运动序列(特别是野外语义手势)注入扩散过程。JMI自动识别传达语义动作的特定“活动关节”并将其注入生成,同时依赖扩散主干根据目标说话人预学习的风格合成剩余的身体动态(包括姿态和流畅度)。这使得无需重新训练或引入剪切粘贴方法典型的“弗兰肯斯坦”伪影,即可即插即用地集成任意运动(包括复杂语义手势)。大量实验和感知研究表明,SiGnature在保持流畅自然的共语手势生成和保留说话人独特特征的同时,提供了优越的语义运动控制,从而优于最先进的基线方法。

英文摘要

While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker's unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific ``active joints'' conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the ``Frankenstein'' artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

2606.16103 2026-06-16 cs.CV 新提交

SceneCraft: Interactive System for Image Editing via Scene Graph

SceneCraft: 基于场景图的交互式图像编辑系统

Duc-Manh Phan, Ngoc-Dai Tran, Duy-Khang Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh, Vietnam(胡志明市理科大学) Vietnam National University, Ho Chi Minh, Vietnam(越南国家大学胡志明市) University of Dayton, Dayton, Ohio, USA(代顿大学)

AI总结 提出SceneCraft框架,通过场景图表示图像,用户直接操作图结构进行复杂编辑,自动生成精确提示,降低语言歧义,提升编辑质量和用户控制。

详情
AI中文摘要

生成式AI的最新进展使得自然语言驱动的图像编辑成为可能,但现有系统在处理包含多个交互对象的复杂场景时常常失败,因为它们严重依赖用户精心制作精确的文本提示。为了解决缺乏结构化控制的问题,我们提出了SceneCraft,一种新颖的交互式框架,通过将图像表示为可编辑的场景图来桥接用户意图和模型执行。用户无需通过试错来猜测文本提示,而是直接与可视化图交互以执行复杂的空间和关系操作。这些图修改会自动转换为精确的、上下文感知的编辑提示,有效消除语言歧义。为了确保鲁棒和多样化的结果,结构化提示被分派到多个最先进的生成模型。跨多种编辑场景的评估表明,SceneCraft提供了更直观的控制机制,显著减少了手动提示工程的认知负担,同时生成的输出在质量和保真度上获得用户一致更高的评价。

英文摘要

Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

2606.16131 2026-06-16 cs.CV cs.LG 新提交

Shift-and-Sum Quantization for Visual Autoregressive Models

Shift-and-Sum 量化用于视觉自回归模型

Jaehyeon Moon, Bumsub Ham

发表机构 * Yonsei University(延世大学) Articron

AI总结 提出针对视觉自回归模型的训练后量化框架,通过移位求和量化减少注意力值乘积误差,并采用重采样策略校准数据,在图像生成等任务上达到新最优。

Comments ICLR 2026

详情
AI中文摘要

训练后量化(PTQ)能够使用少量数据实现深度网络的高效部署。然而,其在视觉自回归模型(VAR)上的应用仍相对未被探索。我们识别出将PTQ应用于VAR的两个关键挑战:(i)注意力值乘积中的大重建误差,尤其是在高注意力分数更频繁出现的粗尺度上;(ii)由于有限的校准数据,码本条目的采样频率与其预测概率之间存在差异。为了解决这些挑战,我们提出了一种针对VAR的PTQ框架。首先,我们引入了一种移位求和量化方法,通过聚合值令牌的对称移位副本的量化结果来减少重建误差。其次,我们提出了一种校准数据的重采样策略,使码本条目的采样频率与其预测概率对齐。在类别条件图像生成、修复、外推和类别条件编辑上的实验表明,该方法在VAR架构上取得了一致的改进,为VAR的PTQ建立了新的最先进水平。

英文摘要

Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

2606.16184 2026-06-16 cs.CV cs.MM 新提交

Closed-Loop Triplet Synergistic Generation for Long-Form Video

闭环三元组协同生成用于长视频

Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu

发表机构 * University of Science and Technology of China(中国科学技术大学) Microsoft Research Asia(微软亚洲研究院)

AI总结 提出CoTriSyGen框架,通过闭环视觉-文本-记忆协同过程,结合分析器进行镜头内和镜头间修正,解决长视频生成中的身份漂移和不一致问题。

详情
AI中文摘要

多镜头长视频生成由于身份漂移和镜头间的复合不一致性仍然具有挑战性。虽然基于故事板的流程提高了可控性,但它们通常以前馈方式执行,缺乏将生成的视觉证据反馈到后续条件中的机制。我们提出CoTriSyGen,一个智能体框架,将多镜头长视频生成形式化为闭环视觉-文本-记忆协同过程,其中计划意图、持久记忆和生成的视觉被联合用于迭代校正和长程一致性。基于视觉语言模型的分析器对该三元组进行推理,并沿两条路径生成对提示和记忆的更新:(i) 镜头内修正,当检测到语义或构成违规时触发目标重新生成,并细化图像到视频的提示以实现连贯运动;(ii) 镜头间修正,重写后续镜头提示以传播新出现的实体或属性,并根据生成的证据提高提示质量(例如,构成基础和电影流畅性)。该循环基于以实体为中心的记忆,该记忆被建模为可变的视觉状态,随着故事进展而演变,由生成器和分析器通过添加新的和演变的实体来持续更新,以反映外观变化、累积的多视图证据和多实体构成。在我们策划的StoryBench基准上的实验表明,与代表性方法相比,在跨镜头一致性、提示遵循和电影连续性方面有显著改进。

英文摘要

Multi-shot long-form video generation remains challenging due to identity drift and compounding inconsistencies across shots. While storyboard-driven pipelines improve controllability, they are often executed in a feed-forward manner, with limited mechanisms to incorporate generated visual evidence back into subsequent conditioning. We propose CoTriSyGen, an agentic framework that formulates multi-shot long video generation as a closed-loop visual-text-memory synergy process, where planned intent, persistent memory, and generated visuals are jointly leveraged for iterative correction and long-range coherence. A vision-language-model-based analyzer reasons over this triplet and produces updates to both prompts and memory along two pathways: (i) intra-shot refinement, which triggers targeted regeneration when semantic or compositional violations are detected and refines image-to-video prompt for coherent motions; and (ii) inter-shot refinement, which rewrites subsequent-shot prompts to propagate newly manifested entities or attributes and improve prompt quality (e.g., compositional grounding and cinematic fluency) based on generated evidence. The loop is grounded in an entity-centric memory modeled as a mutable visual state that evolves as the story progresses, which is continuously updated by both the generator and the analyzer by adding new and evolved entities to reflect appearance changes, accumulated multi-view evidence, and multi-entity compositions. Experiments on our curated StoryBench benchmark demonstrate substantial improvements in cross-shot consistency, prompt adherence, and cinematic continuity over representative methods.

2606.16241 2026-06-16 cs.CV 新提交

Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis

结构-语义协同优化的潜扩散模型用于快速视觉字谜合成

Xiang Gao, Yunpeng Jia

发表机构 * School of Digital Media and Design Arts, Beijing University of Posts and Telecommunications(北京邮电大学数字媒体与设计艺术学院)

AI总结 提出结构-语义协同优化框架S2CO-Anagram,通过空文本结构对齐、语义增强和注意力引导噪声融合,在极低计算成本下生成高分辨率、高视觉和谐度与语义保真度的视觉字谜图像。

详情
AI中文摘要

视觉字谜是一种有趣的艺术创作形式,其中单个图像在翻转或旋转等变换下呈现不同的概念解释。最近的工作通过利用预训练的文本到图像(T2I)扩散模型实现了视觉字谜合成,但仍存在几个关键限制,包括计算效率低、美学质量次优以及语义保真度和表现力弱。本文专注于以最小的计算成本生成视觉质量显著提升的视觉字谜,从而推进幻觉数字艺术的智能创作。为了提高图像分辨率同时减少时间开销,我们将基于像素的T2I模型中的先进并行去噪算法适配到对抗性蒸馏的潜模型上,并相应地提出了一种结构-语义协同优化(S2CO)框架来抵消随之而来的视觉退化。作为我们方法的核心,S2CO框架包含三个关键创新:(I)空文本结构对齐优化;(II)语义增强优化;(III)注意力引导噪声融合。基于这些组件,我们的方法称为S2CO-Anagram,能够生成比相关SOTA方法具有显著更优视觉和谐性和语义保真度的高分辨率字谜图像,同时实现更快的推理速度。代码将公开。

英文摘要

Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.

2606.16317 2026-06-16 cs.CV 新提交

Training-free sparse attention based on cumulative energy filtering

基于累积能量过滤的无训练稀疏注意力

Chunlu Li, Yixuan Pan, Bai Du, Zhenyuan Chen, Yanzhao Li, Hui Dong, Hui Wang, Zhiqiang Zou

发表机构 * Huawei Technologies(华为技术有限公司)

AI总结 提出动态阈值策略,在保持固定召回率的同时提高稀疏性,并与Flash Attention深度集成,无需额外掩码计算,在Wan 2.2上稀疏度从61.42%提升至82%,VBench指标下降小于5%。

详情
AI中文摘要

稀疏注意力通过仅计算重要令牌而跳过其余令牌,加速用于视频生成的扩散变换器(DiTs)。令牌选择策略是平衡稀疏性和准确性的关键。我们将令牌过滤过程形式化为一个双目标优化问题:最大化稀疏性和最小化准确性下降。现有算法无法同时实现这两个目标。例如,Top-p仅考虑准确性约束,而Top-k维持固定的计算预算但放松了准确性约束。本文证明,维持固定的召回率足以保证准确性,而固定阈值对于降低计算成本是次优的。因此,我们提出一种动态阈值方案,在保持相同准确性水平的同时提高稀疏性。此外,我们的算法与Flash Attention(FA)深度集成,无需任何额外的掩码计算开销。在Wan 2.2上的实验结果表明,与同样集成FA的BLASST算法相比,我们的动态阈值策略将稀疏性从61.42%提升至82%,而VBench指标下降小于5%。这导致注意力计算减少约15%,计算效率提升1.61倍,比BLASST高1.18倍。

英文摘要

Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.

2606.16401 2026-06-16 cs.CV 新提交

RGFVR: Reference-Guided Face Video Restoration with Flow Matching

RGFVR: 基于参考引导的流匹配人脸视频修复

Cem Eteke, Batuhan Tosun, Eckehard Steinbach

发表机构 * Chair of Media Technology, Munich Institute of Robotics and Machine Intelligence, School of Computation, Information, and Technology, Technical University of Munich(慕尼黑工业大学计算、信息与技术学院慕尼黑机器人与机器智能研究所媒体技术教席)

AI总结 提出一种主体无关的参考引导框架,通过双模态感知-描述身份条件注入预训练流匹配文本到视频生成器,结合两阶段训练策略,在降采样、模糊、噪声和压缩伪影等退化下提升人脸视频修复的保真度、时间一致性和身份保持。

详情
AI中文摘要

从退化观测中恢复人脸视频具有挑战性,因为它需要同时恢复视觉保真度、时间一致性和主体身份。现有方法要么是无参考的,当个体特定面部细节丢失时可能导致身份丢失,要么是主体特定的,限制了对未见身份的泛化。我们提出了一种主体无关的参考引导框架,用于身份保持的人脸视频修复。我们的方法将双模态感知-描述身份条件引入预训练的基于流的文本到视频生成器,并采用两阶段训练策略来增强修复过程中的身份引导。实验表明,我们的方法提高了修复保真度、时间一致性和身份保持,在包括降采样、模糊、噪声和压缩伪影在内的挑战性视频退化下实现了优越性能。代码可在 https://github.com/batuhanntosun/RG-FVR 获取。

英文摘要

Face video restoration from degraded observations is challenging, as it requires simultaneously recovering visual fidelity, temporal consistency, and subject identity. Existing approaches are often either reference-free, which can lead to identity loss when person-specific facial details are lost, or subject-specific, which limits generalization to unseen identities. We propose a subject-agnostic, reference-guided framework for identity-preserving face video restoration. Our method introduces bimodal perceptual-descriptive identity conditioning into a pretrained flow-based text-to-video generator and employs a two-stage training strategy to strengthen identity guidance during restoration. Experiments show that our approach improves restoration fidelity, temporal consistency, and identity preservation, achieving superior performance under challenging video degradations, including downsampling, blur, noise, and compression artifacts. The code is available under: https://github.com/batuhanntosun/RG-FVR.

2606.16457 2026-06-16 cs.CV cs.GR 新提交

ResEdit: Residual embeddings for precise generative image editing

ResEdit:用于精确生成式图像编辑的残差嵌入

Ahmet Canberk Baykal, Valentin Deschaintre, Yannick Hold-Geoffroy, Michael Fischer, Anna Frühstück, Cengiz Öztireli, Iliyan Georgiev

发表机构 * Adobe Research(Adobe研究院) University of Cambridge(剑桥大学)

AI总结 提出残差图像编码作为额外条件,结合梯度反转优化策略,在保持图像身份和全局一致性的同时实现高保真精确编辑。

Comments Accepted to the EGSR 2026 journal track

详情
AI中文摘要

条件扩散图像生成器可以通过反演重新用于编辑,无需大规模配对微调数据。然而,在保持图像身份和全局一致性的同时产生高质量、有针对性的编辑仍然具有挑战性,因为弱条件反演通常会将冲突的图像特征嵌入到噪声中。我们证明,将残差图像编码作为额外条件,既能改善身份保留,又能提高可编辑性。我们优化这种残差编码,为重建提供强大的条件信号,从而减少对反演的依赖及其易受上述缺陷的影响。为了确保该残差不干扰期望的编辑,我们采用了一种基于梯度反转的优化策略,将残差与编辑条件解耦。我们展示了该方法在基于内在属性的精确编辑和重光照中产生高保真结果的能力,并给出了概念验证性的文本引导操作。

英文摘要

Conditional diffusion image generators can be repurposed for editing through inversion, without the need for large-scale paired fine-tuning data. However, producing high-quality, targeted edits while maintaining image identity and global consistency remains challenging, as weakly conditioned inversion often embeds conflicting image features into the noise. We demonstrate that incorporating a residual image encoding as additional conditioning enables both improved identity preservation and better editability. We optimize this residual encoding to provide a strong conditioning signal for reconstruction, thereby reducing the reliance on inversion and susceptibility to its aforementioned pitfalls. To ensure this residual does not interfere with desired edits, we incorporate a gradient reversal-based optimization strategy that disentangles the residual from the edited condition. We illustrate our method's ability to produce high-fidelity results across precise intrinsic-based editing and relighting, and show proof-of-concept text-guided manipulation.

2606.16502 2026-06-16 cs.CV 新提交

Active Reference Acquisition in Few-Shot Font Generation

少样本字体生成中的主动参考获取

Shinnosuke Matsuo

发表机构 * NTT, Inc., Japan(日本电报电话公司) Kyushu University(九州大学)

AI总结 针对少样本字体生成中参考不足导致风格不匹配的问题,提出主动参考获取框架,通过基于局部结构部分覆盖的获取函数,顺序选择最需补充的字符,提升生成质量并减少查询次数。

Comments Accepted at ICDAR2026

详情
AI中文摘要

少样本字体生成旨在给定一个或几个参考字形的情况下,合成字体的其余字形,同时保持风格一致性,从而支持字体设计师高效完成字体设计。现有方法主要关注在固定参考集下提高生成质量。然而,当当前参考字形不足以代表目标风格时,少样本字体生成可能无法产生令人满意的结果。在实际场景中,必要时可以从设计师处获取额外的参考字形。因此,我们提出一个新的框架——少样本字体生成中的主动参考获取,其中模型顺序决定下一个获取哪个字符作为额外参考。此外,我们提出一种基于参考部分覆盖的获取函数来高效地查询设计师。受字体风格由局部结构部分良好表征的观察启发,我们使用局部特征直方图表示每个字形,并选择最大化参考集预期部分覆盖的查询字符。通过优先选择包含当前参考未覆盖部分的字符,所提方法逐步扩展参考集中视觉部分的多样性。结果,生成质量得到提高,且查询次数更少。在Google Fonts数据集上的实验表明,所提方法实现了比随机查询和与参考无关的基线更高的生成质量。代码可在https://github.com/matsuo-shinnosuke/ActiveRef-FontGen获取。

英文摘要

Few-shot font generation aims to synthesize the remaining glyphs of a font given one or a few reference glyphs while preserving stylistic consistency, thereby supporting font designers in efficiently completing a typeface. Existing methods primarily focus on improving generation quality given a fixed reference set. However, when the current reference glyphs are insufficient to represent the target style, few-shot font generation may fail to produce satisfactory results. In practical scenarios, additional reference glyphs can often be obtained from the designer when necessary. Accordingly, we propose a new framework, Active Reference Acquisition in Few-Shot Font Generation, in which the model sequentially decides which character to acquire next as an additional reference. Furthermore, we propose a reference part-coverage-based acquisition function to efficiently query the designer. Motivated by the observation that font styles are well characterized by local structural parts, we represent each glyph using a histogram of local features and select query characters that maximize the expected part coverage of the reference set. By prioritizing characters that contain parts not yet covered by the current references, the proposed method progressively expands the diversity of visual parts in the reference set. As a result, generation quality is improved with fewer queries. Experiments on the Google Fonts dataset demonstrate that the proposed method achieves higher generation quality than random querying and reference-agnostic baselines. The code is available at https://github.com/matsuo-shinnosuke/ActiveRef-FontGen.

2606.16673 2026-06-16 cs.CV 新提交

MMDiff: Extending Diffusion Transformers for Multi-Modal Generation

MMDiff: 扩展扩散变换器用于多模态生成

Yagmur Akarken, Orest Kupyn, Christian Rupprecht

发表机构 * University of Oxford, Visual Geometry Group(牛津大学视觉几何组)

AI总结 提出MMDiff框架,利用冻结的扩散变换器通过轻量解码器联合生成图像及多种密集感知模态,发现多时间步特征融合与空间变化聚合权重是关键,在语义分割等任务上取得优异性能。

详情
AI中文摘要

扩散变换器已展现出卓越的生成能力,然而在其去噪轨迹中计算出的丰富感知表示在内容渲染后被丢弃。我们提出了MMDiff,一个将冻结的扩散变换器转化为多模态生成系统的框架,该系统使用轻量级解码器头联合生成图像以及任意组合的密集感知模态。我们的核心发现是,感知信息在去噪轨迹上呈时间分布,并且具有空间变化聚合权重的多时间步特征融合至关重要,相比单时间步提取,语义分割结果提高了高达28.7% mIoU。我们进一步采用概念驱动的注意力提取以实现可解释的空间引导,并表明冻结的扩散特征与最先进的编码器(如DINOv3)具有竞争力和互补性。通过在冻结的骨干网络上仅训练轻量级解码器头,我们在语义分割、显著目标检测和深度估计中取得了强劲性能,并证明了该框架能够有效生成大规模合成数据。

英文摘要

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

2606.16767 2026-06-16 cs.CV 新提交

Text-Vision Co-Instructed Image Editing

文本-视觉协同指导的图像编辑

Chenxi Xie, Yuhui Wu, Qiaosi Yi, Lei Zhang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) OPPO Research Institute(OPPO研究院)

AI总结 提出TV-Edit框架,联合文本指令的语义表达与稀疏视觉指令的空间引导,实现精确且忠实于意图的图像编辑,显著优于现有方法。

详情
AI中文摘要

现有的图像编辑方法通常可分为基于文本指令和基于视觉提示两类。文本指令语义表达丰富,但受限于编辑结果空间控制的粗粒度。相比之下,拖拽和点等视觉提示能提供精确的空间引导,但存在语义意图固有的模糊性。为统一文本和视觉提示的优势,我们提出文本-视觉协同指导的图像编辑,将文本指令作为语义意图、稀疏视觉指令作为空间引导联合建模,旨在实现精确且忠实于意图的图像操作。为此,我们首先构建了一个包含超过23K个样本的文本-视觉指令配对数据集,这些样本源自动态视频,为跨模态指令提供对齐监督。然后,我们提出TV-Edit,一个文本-视觉指令统一编辑框架,将基于拖拽或点的视觉指令与图像-文本语义上下文化,并将其提升为语义感知的控制表示,用于预训练的编辑骨干网络。通过整合语义意图和空间约束,TV-Edit相比纯文本或纯拖拽方法实现了更精确的空间控制、更少的指令歧义和更强的结构一致性。最后,我们建立了TV-Edit-Bench,一个精心设计的基准,用于评估语义忠实度、空间对齐和视觉一致性,通过地面真实参考和受控的文本-视觉变化进行可靠评估。我们在多个编辑骨干网络上的实验表明,TV-Edit始终产生更精确且忠实于意图的编辑,显著优于最先进的基于指令和基于拖拽的基线方法。

英文摘要

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

2606.16799 2026-06-16 cs.CV cs.AI 新提交

Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

解耦语义与失真:面向AI生成图像质量评估的多尺度双流视觉-语言对齐

Zijie Meng

AI总结 提出MST-CLIPIQA多尺度双流框架,通过显式表示解耦实现层次化视觉-语言对齐,在五个基准上取得质量SRCC平均提升1.11%、图文对应SRCC提升2.35%的新SOTA结果。

Comments 11 pages, 2 figures Accepted by ICME2026(spotlight)

详情
AI中文摘要

现有的基于视觉-语言模型(VLM)的AI生成图像质量评估(AIGIQA)方法存在根本性的语义-失真维度冲突:为语义区分优化的单一表示在本质上将组成性理解与低层感知敏感性纠缠在一起,使其对细粒度质量退化视而不见。我们提出MST-CLIPIQA,一种多尺度双流框架,通过显式表示解耦实现层次化视觉-语言对齐。我们的架构利用具有互补补丁粒度的双CLIP编码器:粗粒度流捕获全局语义连贯性,而细粒度流保留纹理特征和伪影模式。一种受信息瓶颈启发的门控融合机制执行自适应跨尺度蒸馏,当生成提示可用时,可选交叉注意力实现基于提示的对应评估。在五个基准上的广泛实验建立了新的最先进结果,在质量预测上实现平均SRCC提升1.11%,在文本-图像对应预测上提升2.35%,同时仅需0.8M可训练参数即可保持效率。我们的项目可在https://github.com/YMlinfeng/MST-CLIPIQA获取。

英文摘要

Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.

2606.16866 2026-06-16 cs.CV 新提交

Redirecting the Flow: Image Customization through Attention Distribution Shift

重定向流:通过注意力分布偏移实现图像定制

Jie Li, Suorong Yang, Jian Zhao, Furao Shen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) School of Computer Science, Nanjing University(南京大学计算机科学与技术学院) School of Electronic Science and Engineering, Nanjing University(南京大学电子科学与工程学院)

AI总结 提出基于最大熵理论的Conditional Attention Distribution Shift方法,通过双分支架构CustomShift实现高效主题驱动图像生成,在DreamBooth和Custom101基准上优于现有方法。

详情
AI中文摘要

主题驱动的图像定制旨在生成不仅遵循文本指令而且保留给定参考主题身份的图像。现有方法,包括测试时微调、基于编码器的方法以及共享注意力空间中的令牌竞争,存在效率有限、提取的参考特征与生成过程不对齐以及无关信息干扰等问题。为了解决这些限制,我们将定制任务表述为通过将参考图像融入文本到图像生成所引发的分布偏移,并基于最大熵理论推导出条件注意力分布偏移公式。基于这一公式,我们提出了CustomShift,一种基于Stable Diffusion 3的双分支架构。参考对齐分支利用参考图像和主题名称之间的自注意力实现与潜在表示的逐层对齐,而交叉引导分支整合文本和参考线索以指导生成。在DreamBooth和Custom101基准上的实验表明,我们的方法始终优于最先进的方法,在语义保真度和主题一致性之间取得了更好的平衡。

英文摘要

Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.

2606.16993 2026-06-16 cs.CV 新提交

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX-World 1.0:通用交互式世界模型

DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou, Bingjie Gao, Qiwen Gu, Siyu Hong, Jiachen Lei, Geng Li, Jifan Li, Ruimin Lin, Qingfeng Shi, Bingze Song, Lei Sun, Jing Tang, Ruitian Tian, Jun Wang, Jiahong Wu, Pengfei Zhang, Shen Zhang, Jiashu Zhu

AI总结 提出通用交互式文图生视频世界模型DreamX-World 1.0,通过E-PRoPE相机控制、因果强制自回归生成、记忆条件场景持久化和事件指令微调,实现可控长时程生成,在多项指标上超越现有方法。

Comments Project page: https://amap-ml.github.io/DreamX_World, Code: https://github.com/AMAP-ML/DreamX-World

详情
AI中文摘要

DreamX-World 1.0 是一个通用的交互式文本/图像到视频的世界模型,用于可控的长时程生成。它支持相机导航、重新访问先前观察过的区域,以及在逼真、游戏风格和风格化领域中的可提示事件。我们的数据引擎结合了相机精确的虚幻引擎渲染、动作丰富的游戏录制以及带有恢复相机几何的真实世界视频。对于相机控制,我们引入了 E-PRoPE,一种轻量级的投影位置编码变体,它保留了 PRoPE 的投影相机几何,同时对空间缩减的令牌应用相机感知注意力。我们使用因果强制、DMD 风格蒸馏和长滚动训练,将双向视频生成器转换为几步自回归世界模型。在自生成的长时程上下文上进行训练,使模型暴露于其自身的生成历史,并减少跨自回归块累积的风格和颜色漂移。记忆条件场景持久性通过基于相机几何的检索来检索早期视图,而残差循环使得条件路径对不完美的记忆潜变量不那么敏感。事件指令微调增加了可组合的事件控制,而强化学习对齐在蒸馏后恢复了相机控制和视觉质量。通过混合精度 DiT 执行、残差重用、75% 剪枝的 VAE 解码和异步流水线并行,DreamX-World 1.0 在八块 RTX 5090 GPU 上达到高达 16 FPS。在我们的 5 秒基本评估中,DreamX-World 1.0 获得了 73.75 的相机控制分数和 84.76 的总分,在总分上优于 HY-WorldPlay 1.5 和 LingBot-World,后两者分别达到 80.79 和 80.45。

英文摘要

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

2606.17049 2026-06-16 cs.CV 新提交

BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering

BRDFusion:物理与生成结合的城市场景逆渲染

Yi-Ruei Liu, Jie-Ying Lee, Zheng-Hui Huang, Yu-Lun Liu, Chih-Hao Lin

发表机构 * National Yang Ming Chiao Tung University University of Illinois Urbana-Champaign National Taiwan University

AI总结 提出BRDFusion框架,结合物理建模与生成先验,实现城市场景逆渲染,在保持物理一致性的同时修复伪影,支持新视角重光照、夜间模拟和动态物体编辑。

Comments Project page: https://shigon255.github.io/brdfusion-page/

详情
AI中文摘要

从捕获视频中对城市场景进行逆渲染可实现众多应用,包括内容创建和自动驾驶仿真。基于物理的渲染方法遵循并控制光照物理,但存在重建和渲染伪影。而生成模型能产生逼真视频,但一致性和可控性有限。我们提出BRDFusion,一个统一框架,结合两种互补模型用于逆渲染和前向渲染。具体而言,BRDFusion通过物理建模恢复显式、一致的场景属性,并利用生成先验缓解优化歧义。在前向渲染中,物理模型提供基于场景配置的可控渲染,生成模型则去噪并修复伪影。因此,我们的方法在允许精确控制的同时生成高质量视频,在真实和合成场景中均优于基线。此外,BRDFusion支持新视角重光照、夜间模拟以及动态物体插入/编辑。项目页面:https://shigon255.github.io/brdfusion-page/

英文摘要

Inverse rendering of urban scenes from captured videos enables numerous applications, including content creation and autonomous driving simulation. Physically-based rendering methods follow and control lighting physics, but suffer from reconstruction and rendering artifacts. While generative models produce realistic videos, they offer limited consistency and controllability. We present BRDFusion, a unified framework that combines two complementary models for inverse and forward rendering. Specifically, BRDFusion recovers explicit, consistent scene properties with physical modeling and alleviates optimization ambiguity with generative priors. During forward rendering, the physical model provides controllable rendering from the scene configuration, and the generative model denoises and fixes artifacts. Therefore, our method produces high-quality videos while allowing precise control, outperforming baselines in real and synthetic scenes. Moreover, BRDFusion supports novel-view relighting, night simulation, and dynamic object insertion/editing. Project page: https://shigon255.github.io/brdfusion-page/

2606.14721 2026-06-16 cs.GR cs.CV cs.RO 交叉投稿

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

DC-Motion: 通过离散-连续令牌解耦语义与细节以生成人体运动

Hequan Wang, Jiaxu Zhang, Zhengbo Zhang, Zhigang Tu

发表机构 * Wuhan University(武汉大学)

AI总结 提出DC-Motion框架,通过离散-连续VAE将运动分解为语义离散令牌和细节连续残差,结合掩码自回归模型和残差扩散模型,实现复杂文本指令下的高质量运动生成。

详情
AI中文摘要

文本到运动生成需要合成物理上真实的动态,这些动态严格遵循复杂且长程的文本指令。现有方法依赖于同质表示空间,可能无法捕捉人体运动的层次结构,扩散模型在组合语义推理上表现不佳,而自回归模型由于量化牺牲了细粒度的物理细节。为了解决这个问题,我们引入了DC-Motion,一个分解式生成框架,旨在通过离散-连续令牌显式解耦语义和细节。首先,离散-连续VAE(DC-VAE)将运动分解为用于语义的离散令牌和用于细粒度动态的连续残差。然后,一个掩码自回归模型从文本预测离散结构,一个轻量级残差扩散模型恢复连续的物理细节。大量实验表明,DC-Motion有效提高了遵循复杂指令的能力。通过有效平衡语义可控性和物理真实性,我们的方法为人体运动生成提供了一种高度可适应的建模范式。在HumanML3D和KIT-ML数据集上,DC-Motion实现了最先进的性能,在运动真实感方面获得了最佳的FID,在文本对齐方面获得了最佳的R-precision。

英文摘要

Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

2502.10389 2026-06-16 cs.CV cs.AI 版本更新

Region-Adaptive Sampling for Diffusion Transformers

扩散变压器的区域自适应采样

Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang

发表机构 * National University of Singapore(新加坡国立大学) Microsoft Research(微软研究院)

AI总结 提出RAS,一种无需训练的自适应采样策略,通过动态分配不同采样比例到图像区域,实现扩散变压器2.36-2.51倍加速且质量损失极小。

Comments CVPR'26 Poster

详情
AI中文摘要

扩散模型(DMs)已成为跨不同领域生成任务的主要选择。然而,它们依赖多次顺序前向传递,严重限制了实时性能。先前的加速方法主要集中于减少采样步骤数或重用中间结果,由于卷积U-Net结构的限制,未能利用图像中空间区域的变化。通过利用扩散变压器(DiTs)在处理可变数量令牌方面的灵活性,我们引入了RAS,一种新颖的、无需训练的采样策略,该策略根据DiT模型的关注点动态地为图像中的区域分配不同的采样比例。我们的关键观察是,在每个采样步骤中,模型集中在语义上有意义的区域,并且这些关注区域在连续步骤中表现出强烈的连续性。利用这一见解,RAS仅更新当前关注的区域,而其他区域则使用来自前一步的缓存噪声进行更新。模型的关注点基于前一步的输出确定,利用了我们观察到的时间一致性。我们在Stable Diffusion 3和Lumina-Next-T2I上评估了RAS,分别实现了高达2.36倍和2.51倍的加速,且生成质量下降最小。此外,一项用户研究表明,RAS在人类评估下提供相当的质量,同时实现1.6倍加速。我们的方法朝着更高效的扩散变压器迈出了重要一步,增强了它们在实时应用中的潜力。

英文摘要

Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

2505.04486 2026-06-16 cs.CV cs.AI cs.LG 版本更新

Efficient Flow Matching using Latent Variables

使用潜在变量的高效流匹配

Anirban Samaddar, Yixuan Sun, Viktor Nilsson, Sandeep Madireddy

发表机构 * Argonne National Laboratory(阿贡国家实验室) KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出Latent-CFM方法,利用预训练深度潜在变量模型提取数据特征作为条件,提升流匹配模型的训练效率和生成质量,在图像和物理场生成任务中优于现有方法。

详情
AI中文摘要

流匹配模型在概率生成模型的图像生成任务中显示出巨大潜力。然而,文献中的大多数流匹配模型在从简单源分布(如标准高斯)学习流时,并未显式利用目标数据中的潜在聚类结构。这导致学习效率低下,尤其是对于许多通常位于低维流形中的高维真实世界数据集。为此,我们提出了 $\texttt{Latent-CFM}$,它通过使用预训练的深度潜在变量模型从数据中提取的特征作为条件,提供了高效的训练策略。通过对来自多模态分布的合成数据和广泛使用的图像基准数据集的实验,我们表明,$\texttt{Latent-CFM}$ 通过采用预训练的轻量级潜在变量模型,在显著减少训练和计算量的情况下,展现出比最先进的流匹配模型更好的生成质量。除了自然图像,我们还考虑了源自物理过程的空间场的生成建模。使用二维达西流数据集,我们证明了我们的方法比竞争方法生成更物理准确的样本。此外,通过潜在空间分析,我们证明了我们的方法可用于以潜在特征为条件的条件图像生成,这增加了生成过程的可解释性。

英文摘要

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

2602.04789 2026-06-16 cs.CV 版本更新

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

轻量强制:通过稀疏注意力加速自回归视频扩散

Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

发表机构 * Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 针对自回归视频生成模型中注意力二次复杂度问题,提出首个稀疏注意力方案Light Forcing,通过块感知增长机制和分层稀疏注意力实现质量和效率提升。

Comments ICML 2026

详情
AI中文摘要

先进的自回归(AR)视频生成模型提高了视觉保真度和交互性,但注意力的二次复杂度仍然是高效部署的主要瓶颈。虽然现有的稀疏注意力解决方案在双向模型上显示出潜力,但我们发现将这些解决方案应用于AR模型会导致显著的性能下降,原因有二:孤立地考虑块生成以及对过去信息上下文的利用不足。基于这些观察,我们提出 extsc{Light Forcing},这是 extit{首个}专为AR视频生成模型设计的稀疏注意力解决方案。它引入了 extit{块感知增长}机制来定量估计每个块的贡献,从而决定其稀疏性分配。这种渐进式稀疏性增加策略使得当前块在生成过程中能够继承早期块中的先验知识。此外,我们引入了 extit{分层稀疏注意力},以从粗到细的方式捕捉信息丰富的历史和局部上下文。这种两级掩码选择策略(即帧级和块级)能够自适应地处理不同的注意力模式。大量实验表明,我们的方法在质量(例如,VBench上84.5)和效率(例如,端到端加速1.2至1.3倍)上均优于现有的稀疏注意力方法。结合FP8量化和LightVAE, extsc{Light Forcing}在RTX 5090 GPU上进一步实现了2.3倍加速和19.7 FPS。代码将在\href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}发布。

英文摘要

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., $1.2{\sim}1.3\times$ end-to-end speedup). Combined with other efficient solutions, \textsc{Light Forcing} further achieves a $2.0{\sim}3.0\times$ end-to-end speedup across diverse GPUs (e.g., 27.4\,FPS on RTX 5090 and 33.9\,FPS on H100). Code is released via this \href{https://github.com/chengtao-lv/LightForcing}{link}.

2602.13344 2026-06-16 cs.CV eess.IV 版本更新

FireRed-Image-Edit-1.0 Technical Report

FireRed-Image-Edit-1.0 技术报告

Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, Ziyuan Guo

发表机构 * Super Intelligence Team(超级智能团队) Xiaohongshu Inc.(小红书公司)

AI总结 提出FireRed-Image-Edit扩散变换器,通过数据整理、多阶段训练和评估优化,在指令图像编辑上达到最先进性能,并开源代码、模型和基准。

详情
AI中文摘要

我们提出FireRed-Image-Edit,一种基于指令的图像编辑扩散变换器,通过系统优化数据整理、训练方法和评估设计,实现了最先进的性能。我们构建了一个16亿样本的训练语料库,包含来自不同来源的9亿文本到图像和7亿图像编辑对。经过严格清洗、分层、自动标注和两阶段过滤,我们保留了超过1亿高质量样本,在生成和编辑之间取得平衡,确保强大的语义覆盖和指令对齐。我们的多阶段训练流程通过预训练、监督微调和强化学习逐步构建编辑能力。为了提高数据效率,我们引入了多条件感知桶采样器用于可变分辨率批处理,以及带有动态提示重索引的随机指令对齐。为了稳定优化并增强可控性,我们提出了用于DPO的非对称梯度优化、用于文本编辑的具有布局感知OCR奖励的DiffusionNFT,以及用于身份保持的可微一致性损失。我们进一步建立了REDEdit-Bench,一个涵盖15个编辑类别的综合基准,包括新引入的美化和低级增强任务。在REDEdit-Bench和公共基准(ImgEdit和GEdit)上的大量实验表明,与开源和专有系统相比,我们的性能具有竞争力或更优。为了支持未来研究,我们的代码、模型和基准套件在此https URL公开提供。

英文摘要

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at https://github.com/FireRedTeam/FireRed-Image-Edit/ .

2603.01371 2026-06-16 cs.CV 版本更新

TIMI: Training-Free Image-to-3D Multi-Instance Generation with Spatial Fidelity

TIMI: 无需训练的图像到3D多实例生成与空间保真度

Xiao Cai, Pengpeng Zeng, Ji Zhang, Heng Tao Shen, Jingkuan Song, Lianli Gao

AI总结 提出无需训练的TIMI框架,通过实例感知分离引导和空间稳定几何自适应更新模块,实现高空间保真度的图像到3D多实例生成。

Comments Project page: https://cdawn628.github.io/TIMI-Page/

详情
AI中文摘要

图像到3D多实例生成中的精确空间保真度对于下游实际应用至关重要。最近的工作尝试通过在多实例数据集上微调预训练的图像到3D(I23D)模型来解决这一问题,但这带来了大量的训练开销,并且难以保证空间保真度。事实上,我们观察到预训练的I23D模型已经具有有意义的空间先验,但由于实例纠缠问题,这些先验仍未得到充分利用。受此启发,我们提出了TIMI,一种新颖的无需训练的图像到3D多实例生成框架,实现了高空间保真度。具体来说,我们首先引入了实例感知分离引导(ISG)模块,该模块在早期去噪阶段促进实例解缠。接下来,为了稳定ISG引入的引导,我们设计了一个空间稳定的几何自适应更新(SGU)模块,该模块在保持实例相对关系的同时促进实例几何特征的保留。大量实验表明,与现有的多实例方法相比,我们的方法在全局布局和不同的局部实例方面都取得了更好的性能,且无需额外训练,推理速度更快。

英文摘要

Precise spatial fidelity in Image-to-3D multi-instance generation is critical for downstream real-world applications. Recent work attempts to address this by fine-tuning pre-trained Image-to-3D (I23D) models on multi-instance datasets, which incurs substantial training overhead and struggles to guarantee spatial fidelity. In fact, we observe that pre-trained I23D models already possess meaningful spatial priors, which remain underutilized as evidenced by instance entanglement issues. Motivated by this, we propose TIMI, a novel Training-free framework for Image-to-3D Multi-Instance generation that achieves high spatial fidelity. Specifically, we first introduce an Instance-aware Separation Guidance (ISG) module, which facilitates instance disentanglement during the early denoising stage. Next, to stabilize the guidance introduced by ISG, we devise a Spatial-stabilized Geometry-adaptive Update (SGU) module that promotes the preservation of the geometric characteristics of instances while maintaining their relative relationships. Extensive experiments demonstrate that our method yields better performance in terms of both global layout and distinct local instances compared to existing multi-instance methods, without requiring additional training and with faster inference speed.

2603.04239 2026-06-16 cs.CV 版本更新

DiverseDiT: Towards Diverse Representation Learning in Diffusion Transformers

DiverseDiT:扩散Transformer中的多样化表示学习

Mengping Yang, Zhiyu Tan, Binglei Li, Xiaomeng Yang, Hesen Chen, Hao Li

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院)

AI总结 通过分析扩散Transformer的表示动力学,发现块间表示多样性是关键,提出DiverseDiT框架,利用长残差连接和多样性损失促进多样特征学习,在ImageNet上提升性能并加速收敛。

Comments Accepted in CVPR 2026, GitHub Code: https://github.com/kobeshegu/DiverseDiT, Project Page: https://forevermamba.work/projects/DiverseDiT/

详情
AI中文摘要

扩散Transformer(DiTs)的最新突破因其卓越的可扩展性而彻底改变了视觉合成领域。为了增强DiTs捕获有意义内部表示的能力,最近的工作如REPA引入了外部预训练编码器进行表示对齐。然而,DiTs内部表示学习的潜在机制尚不明确。为此,我们首先系统地研究了DiTs的表示动力学。通过分析不同设置下内部表示的演变和影响,我们揭示了块间的表示多样性是有效学习的关键因素。基于这一关键洞察,我们提出了DiverseDiT,一种明确促进表示多样性的新颖框架。DiverseDiT结合了长残差连接以多样化块间的输入表示,以及表示多样性损失以鼓励块学习不同的特征。在ImageNet 256x256和512x512上的大量实验表明,我们的DiverseDiT在应用于不同大小的不同骨干网络时,即使在具有挑战性的单步生成设置下,也能产生一致的性能提升和收敛加速。此外,我们展示了DiverseDiT与现有表示学习技术互补,从而带来进一步的性能提升。我们的工作为DiTs的表示学习动力学提供了宝贵的见解,并提供了增强其性能的实用方法。

英文摘要

Recent breakthroughs in Diffusion Transformers (DiTs) have revolutionized the field of visual synthesis due to their superior scalability. To facilitate DiTs' capability of capturing meaningful internal representations, recent works such as REPA incorporate external pretrained encoders for representation alignment. However, the underlying mechanisms governing representation learning within DiTs are not well understood. To this end, we first systematically investigate the representation dynamics of DiTs. Through analyzing the evolution and influence of internal representations under various settings, we reveal that representation diversity across blocks is a crucial factor for effective learning. Based on this key insight, we propose DiverseDiT, a novel framework that explicitly promotes representation diversity. DiverseDiT incorporates long residual connections to diversify input representations across blocks and a representation diversity loss to encourage blocks to learn distinct features. Extensive experiments on ImageNet 256x256 and 512x512 demonstrate that our DiverseDiT yields consistent performance gains and convergence acceleration when applied to different backbones with various sizes, even when tested on the challenging one-step generation setting. Furthermore, we show that DiverseDiT is complementary to existing representation learning techniques, leading to further performance gains. Our work provides valuable insights into the representation learning dynamics of DiTs and offers a practical approach for enhancing their performance.

2604.28185 2026-06-16 cs.CV 版本更新

Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling

新时代的视觉生成:从原子映射到智能世界建模的演进

Keming Wu, Zuhao Yang, Kaichen Zhang, Shizun Wang, Haowei Zhu, Sicong Leng, Zhongyu Yang, Qijie Wang, Sudong Wang, Ziting Wang, Zili Wang, Hui Zhang, Haonan Wang, Hang Zhou, Yifan Pu, Xingxuan Li, Fangneng Zhan, Bo Li, Lidong Bing, Yuxin Song, Ziwei Liu, Wenhu Chen, Jingdong Wang, Xinchao Wang, Xiaojuan Qi, Shijian Lu, Bin Wang

发表机构 * Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学) University of Hong Kong(香港大学) National University of Singapore(新加坡国立大学) University of Waterloo(滑铁卢大学) StepFun MiroMind Baidu(百度) Fudan University(复旦大学) Hong Kong University of Science and Technology(香港科技大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) LMMs-Lab(LMMs实验室)

AI总结 提出五级分类法(原子生成、条件生成、上下文生成、智能生成、世界建模生成),分析流匹配、统一理解与生成模型等关键技术,并指出现有评估高估感知质量而忽视结构、时序和因果缺陷。

Comments Project Page: https://github.com/EvolvingLMMs-Lab/Evolving-Visual-Generation

详情
AI中文摘要

最近的视觉生成模型在照片级真实感、排版、指令遵循和交互式编辑方面取得了重大进展,但在空间推理、持久状态、长期一致性和因果理解方面仍然存在困难。我们认为,该领域应超越外观合成,转向智能视觉生成:基于结构、动态、领域知识和因果关系的合理视觉内容。为构建这一转变,我们引入了一个五级分类法:原子生成、条件生成、上下文生成、智能生成和世界建模生成,从被动渲染器逐步演进到交互式、智能、世界感知的生成器。我们分析了关键技术驱动因素,包括流匹配、统一理解与生成模型、改进的视觉表示、后训练、奖励建模、数据整理、合成数据蒸馏和采样加速。我们进一步表明,当前的评估往往通过强调感知质量而高估了进展,却忽略了结构、时序和因果缺陷。通过结合基准测试回顾、野外压力测试和专家约束案例研究,本路线图提供了一个以能力为中心的视角,用于理解、评估和推进下一代智能视觉生成系统。

英文摘要

Recent visual generation models have made major progress in photorealism, typography, instruction following, and interactive editing, yet they still struggle with spatial reasoning, persistent state, long-horizon consistency, and causal understanding. We argue that the field should move beyond appearance synthesis toward intelligent visual generation: plausible visuals grounded in structure, dynamics, domain knowledge, and causal relations. To frame this shift, we introduce a five-level taxonomy: Atomic Generation, Conditional Generation, In-Context Generation, Agentic Generation, and World-Modeling Generation, progressing from passive renderers to interactive, agentic, world-aware generators. We analyze key technical drivers, including flow matching, unified understanding-and-generation models, improved visual representations, post-training, reward modeling, data curation, synthetic data distillation, and sampling acceleration. We further show that current evaluations often overestimate progress by emphasizing perceptual quality while missing structural, temporal, and causal failures. By combining benchmark review, in-the-wild stress tests, and expert-constrained case studies, this roadmap offers a capability-centered lens for understanding, evaluating, and advancing the next generation of intelligent visual generation systems.

2605.18324 2026-06-16 cs.CV cs.AI cs.GR cs.LG stat.ML 版本更新

Improved Baselines with Representation Autoencoders

改进的基于表示自动编码器的基线

Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman, Saining Xie

发表机构 * Adobe Research(Adobe研究院) ANU(澳大利亚国立大学) New York University(纽约大学)

AI总结 本文研究了基于表示自动编码器(RAE)的设计选择,发现三个见解,简化并改进了RAE。首先,研究了一种通用公式,将表示定义为最后k个编码器层的总和,而不是仅最终层。其次,研究了RAE与表示对齐(REPA)的假设,发现两者具有互补的工作机制。最后,改进了RAE在无分类器指导(CFG)中的表现,通过重新参数化DiT模型输出,实现了无需训练第二个模型的指导效果。RAEv2在ImageNet-256上达到了1.06的gFID,且训练效率显著提高。

详情
AI中文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for

英文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr6, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EPFID@k (epochs to reach unguided gFID < k) as a measure of training efficiency. RAEv2 attains an EPFID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. The code is available at https://raev2.github.io.

2605.19876 2026-06-16 cs.CV 版本更新

Structural Energy Guidance for View-Consistent Text-to-3D Generation

基于结构能量的视图一致文本到3D生成

Qing Zhang, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li

发表机构 * Australian National University(澳大利亚国立大学) CSIRO(澳大利亚国家科学委员会) The University of Hong Kong(香港大学)

AI总结 本文针对基于扩散模型的文本到3D生成中视图不一致问题,提出无需训练的SEGS框架,通过在U-Net特征的PCA子空间中构建结构能量并注入去噪过程,提升多视图一致性,实验表明有效降低Janus率并提升视图一致性评分。

详情
AI中文摘要

基于扩散模型的文本到3D生成常常面临Janus问题,导致不同视角下的几何不一致。本文识别出2D扩散先验中的视角偏差是主要原因,并提出结构能量引导采样(SEGS),一种无需训练且易于集成的框架,用于提升多视角一致性。SEGS在U-Net特征的PCA子空间中构建结构能量,并将其梯度注入去噪过程。该方法可轻松集成到SDS/VSD流程中而无需重新训练。实验表明,SEGS在平均情况下将Janus率降低约10%,并在多个基准上提升视图一致性评分,包括DreamFusion、Magic3D和LucidDreamer。该方法有效缓解了视角伪影,同时保持外观保真度,为高质量文本到3D内容生成提供了灵活的解决方案。

英文摘要

Text-to-3D generation based on diffusion models often suffers from the Janus problem, leading to inconsistent geometry across viewpoints. This work identifies viewpoint bias in 2D diffusion priors as the main cause and proposes Structural Energy-Guided Sampling (SEGS), a training-free and plug-and-play framework to improve multi-view consistency. SEGS constructs a structural energy in the PCA subspace of U-Net features and injects its gradient into the denoising process. It can be easily integrated into SDS/VSD pipelines without retraining. Experiments show that SEGS reduces the Janus Rate by about 10% on average and improves View-CS scores across multiple baselines, including DreamFusion, Magic3D, and LucidDreamer. This method effectively alleviates viewpoint artifacts while preserving appearance fidelity, providing a flexible solution for high-quality text-to-3D content generation.

2605.25449 2026-06-16 cs.CV 版本更新

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion

Pantheon360: 通过3D感知的360°视频扩散驯服数字孪生生成

Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu, Fangzhou Lin, Hengyuan Zhang, David Paz, Xinyu Huang, Yuliang Guo, Yu-Lun Liu, Yue Wang, Liu Ren

发表机构 * University of Southern California(南加州大学) National Yang Ming Chiao Tung University(国家阳明交通大学) Cornell University(康奈尔大学) Bosch Research(博世研究)

AI总结 提出Pantheon360框架,利用显式3D缓存从稀疏360°输入生成高保真视频,实现全局几何一致性和可控相机路径,解决传统透视视频生成器视野受限导致的跨视图不一致和时间漂移问题。

Comments Accepted to CVPR 2026. Project page: https://koi953215.github.io/pantheon360_page/

详情
AI中文摘要

从视频生成完整的数字孪生需要精确的相机控制、全局场景覆盖以及严格的空间-时间一致性约束,由于透视视频生成器的视野(FoV)有限,这些要求仍然具有挑战性。其狭窄的FoV迫使使用长轨迹或多视图轨迹,从而加剧了跨视图不一致和时间漂移。我们认为360°视频生成提供了一种自然的解决方案:全景覆盖简化了轨迹设计,并为保持一致性提供了强大的全局上下文。我们提出Pantheon360:通过3D感知的360°视频扩散驯服数字孪生生成,这是一个可控的360°视频生成框架,能够从稀疏的360°输入合成高保真视频。关键思想是一个显式的3D缓存,从输入中重建,作为任何用户定义相机路径的几何骨架。这使得扩散模型可以专注于逼真的纹理细化,同时3D缓存强制执行全局几何一致性。实验表明,Pantheon360实现了卓越的视觉质量和无与伦比的几何一致性,为下游仿真和数字孪生应用提供了可靠且灵活的360°场景生成。

英文摘要

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.

2605.29509 2026-06-16 cs.CV 版本更新

KGEdit: Ambiguity-Aware Knowledge Graphs for Training-Free Precise Video Generation and Editing

KGEdit: 面向无训练精确视频生成与编辑的歧义感知知识图谱

Mingshu Cai, Miao Zhang, Chenghe Yang, Yixuan Li, Osamu Yoshie, Yuya Ieiri

发表机构 * Waseda University, Japan(日本早稻田大学) College of Computer Engineering, Jimei University(集美大学计算机工程学院) Department of Language Science and Technology, The Hong Kong Polytechnic University(香港理工大学语言科学与技术系) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出KGEdit框架,通过构建歧义感知知识图谱和结构化语义注入模块,解决文本到视频扩散模型中的语义歧义、概念绑定错误和跨帧不一致问题,实现无需额外训练的精确视频生成与编辑。

详情
AI中文摘要

近年来,无训练视频生成取得了显著进展。然而,在处理复杂文本指令时,现有方法仍存在语义歧义、概念绑定错误和跨帧不一致的问题。为解决这些问题,我们提出了KGEdit,一种用于文本到视频(T2V)扩散模型的结构化语义控制框架。具体而言,我们首先构建一个歧义感知知识图谱(AAKG)来解耦和消歧输入提示,将其转换为四种类型的结构化语义:身份、关系、属性和负约束。然后,我们设计了一个结构化语义注入模块(SSIM),将这些语义信号注入扩散Transformer的关键层,实现细粒度的语义控制。此外,我们引入了一个时间感知语义控制(TASC)模块,根据去噪过程的阶段特性动态调度语义目标,进一步提高了语义对齐和时间一致性。实验表明,KGEdit在编辑精度和时间稳定性方面优于现有方法,同时在文本驱动的交互场景中提供了更高的效率和可控性。

英文摘要

In recent years, training-free video generation has progressed remarkably. However, when handling complex textual instructions, existing methods still suffer from semantic ambiguity, incorrect concept binding, and cross-frame inconsistency. To address these issues, we propose KGEdit, a structured semantic control framework for text-to-video (T2V) diffusion models. Specifically, we first construct an ambiguity-aware knowledge graph (AAKG) to disentangle and disambiguate the input prompt, converting it into four types of structured semantics: identity, relation, attribute, and negative constraints. We then design a structured semantic injection module (SSIM) to inject these semantic signals into key layers of the diffusion Transformer, enabling fine-grained semantic control. In addition, we introduce a temporal-aware semantic control (TASC) module that dynamically schedules semantic objectives according to the stage-wise characteristics of the denoising process, further improving semantic alignment and temporal consistency. Experiments show that KGEdit outperforms existing methods in editing precision and temporal stability, while offering higher efficiency and controllability in text-driven interaction scenarios.

2606.01900 2026-06-16 cs.CV 版本更新

Auteur: Language-Driven Cinematographic Framing for Human-Centric Video Generation

Auteur: 以语言驱动的电影化取景实现以人为中心的视频生成

Muhammed Burak Kizil, Enes Sanli, Niloy J. Mitra, Xuelin Chen, Erkut Erdem, Aykut Erdem, Duygu Ceylan

发表机构 * Koç University(科克大学) University College London(伦敦大学学院) Adobe(Adobe公司) Hacettepe University(哈切特佩大学)

AI总结 提出Auteur方法,通过将相机运动参数化为以人为中心的取景(包括镜头尺寸、角度和构图),并利用领域特定语言(DSL)和微调的多模态大语言模型,实现语言驱动的电影化取景,在人类中心视频生成中优于现有方法。

Comments Project Page: https://cyberiada.github.io/Auteur/

详情
AI中文摘要

生成式视频模型在视觉保真度和时间连贯性方面取得了显著进展,但有意地控制相机仍然难以实现。现有框架将相机运动视为像素合成的副产品,产生的轨迹具有随机性、空间不一致性,并且对驱动场景的人类主体漠不关心。在这项工作中,我们提出了Auteur,一种用于生成式视频中语言驱动的、以人为中心的相机取景方法。我们的核心见解是,专业电影制作人构思镜头时并非将其视为世界空间中的轨迹,而是定义为相对于演员的取景,将镜头尺寸、角度和构图编码为人体姿态和运动的函数。我们将这一直觉形式化为一种以人为中心的相机参数化,并引入一种可转换为标准6自由度相机参数的领域特定语言(DSL)。然后,一个微调的多模态大语言模型充当虚拟导演,将自然语言描述和粗略的人体运动映射为稀疏的DSL关键帧,这些关键帧通过确定性插值生成连续的相机轨迹,并作为输入提供给视频生成器。我们在一个新数据集上训练和评估Auteur,该数据集包含34K个对齐的文本、人体运动和DSL标注的相机轨迹,这些轨迹来自程序化合成和CondensedMovies数据集中的真实电影片段。Auteur实现了以人为中心的场景的电影化取景,这一能力在先前的生成模型中基本缺失。为了评估这一行为,我们提出了新的以取景为中心的指标,实验表明Auteur持续优于现有方法。

英文摘要

Generative video models have achieved remarkable visual fidelity and temporal coherence, yet intentional camera control remains elusive. Existing frameworks treat camera motion as a byproduct of pixel synthesis, producing trajectories that are stochastic, spatially inconsistent, and indifferent to the human subject driving the scene. In this work, we present Auteur, a method for language-driven, human-centric camera framing in generative video. Our core insight is that professional filmmakers conceive shots not as world-space trajectories but as framings defined relative to the actor, encoding shot size, angle, and composition as functions of human pose and motion. We formalize this intuition as a human-centric camera parameterization and introduce a Domain-Specific Language (DSL) that is convertible to standard 6-DoF camera parameters. A fine-tuned multimodal large language model then acts as a virtual director, mapping natural language descriptions and coarse human motion to sparse DSL keyframes that are deterministically interpolated into continuous camera trajectories, which are then provided as input to video generators. We train and evaluate Auteur on a new dataset of 34K aligned text, human motion, and DSL-annotated camera trajectories drawn from procedural synthesis and real-world movie footage from the CondensedMovies dataset. Auteur enables cinematographic framing of human-centered scenes, a capability largely absent in prior generative models. To assess this behavior, we propose new framing-focused metrics, and our experiments show that Auteur consistently outperforms existing methods. Project page is https://cyberiada.github.io/Auteur/

2606.04621 2026-06-16 cs.CV cs.GR 版本更新

MeshFlow: Efficient Artistic Mesh Generation via MeshVAE and Flow-based Diffusion Transformer

MeshFlow: 通过MeshVAE和基于流的扩散Transformer实现高效艺术网格生成

Weiyu Li, Antoine Toisoul, Tom Monnier, Roman Shapovalov, Rakesh Ranjan, Ping Tan, Andrea Vedaldi

发表机构 * Meta AI Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出MeshFlow方法,利用变分自编码器将网格拓扑和顶点坐标映射到连续潜空间,并结合修正流Transformer并行生成网格,相比自回归方法速度提升18倍且精度优异。

Comments CVPR2026 Highlight, Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow

详情
AI中文摘要

我们提出MeshFlow,一种生成类艺术家3D网格的新方法。当前的网格生成器通常采用自回归(AR)下一个标记预测,鉴于网格拓扑的离散性,这是一个自然的选择。然而,AR方法扩展性差,因为推理成本随网格大小呈二次增长。它们还需要离散化顶点坐标,这引入了量化误差。为了解决这些挑战,我们引入了一个变分自编码器(VAE),通过对比损失监督,将连续的顶点位置和离散的连接性表示在连续潜空间中。这个潜空间比先前基于标记的网格表示紧凑得多。然后,我们基于修正流Transformer构建了一个3D生成器,并行生成所有网格顶点和边。我们的模型生成网格的速度比最快的AR生成器快18倍,同时在标准网格生成指标上实现了出色的精度。主页:https://mesh-flow.github.io/,代码:https://github.com/facebookresearch/meshflow

英文摘要

We present MeshFlow, a new method for generating artist-like 3D meshes. Current mesh generators often adopt Auto-Regressive (AR) next-token prediction, a natural choice given the discrete nature of mesh topology. However, AR methods scale poorly because the inference cost is quadratic in mesh size. They also require discretizing the vertex coordinates, which introduces quantization errors. To address these challenges, we introduce a Variational Autoencoder (VAE) that, supervised with a contrastive loss, represents both continuous vertex positions and discrete connectivity in a continuous latent space. This latent space is significantly more compact than prior token-based mesh representations. We then build a 3D generator based on a Rectified Flow transformer, generating all mesh vertices and edges in parallel. Our model generates meshes 18x faster than the fastest AR generator while also achieving excellent accuracy across standard mesh-generation metrics. Homepage: https://mesh-flow.github.io/, Code: https://github.com/facebookresearch/meshflow

2606.09076 2026-06-16 cs.CV 版本更新

Beyond Scalar Rewards by Internalizing Reasoning into Score Distributions

超越标量奖励:将推理内化到分数分布中

Xin Jin, Huanqia Cai, Zhen Li, Zechao Zhan, Dengyang Jiang, Aiming Hao, Yuming Jiang, Chunle Guo, Peng Gao, Ming-Ming Cheng, Steven C. H. Hoi

发表机构 * Alibaba Group(阿里巴巴集团) Nankai University(南开大学)

AI总结 提出Z-Reward框架,通过教师-学生模型将推理型奖励内化为紧凑VLM的分数分布,实现高效且准确的文本到图像优化。

Comments Z-Image Team Technical Report

详情
AI中文摘要

奖励模型对于文本到图像的后训练至关重要,但视觉偏好是主观的,更适合表示为评分分布而非确定性标量。现有的标量、评分令牌和成对奖励模型过度压缩了不确定性和细粒度评分差异,而基于推理的生成式奖励提供了更强的判断,但部署成本高且难以用作直接优化信号。我们提出Z-Reward,一种教师-学生奖励建模框架,将推理密集型判断与高效奖励部署解耦。教师是一个大型VLM,使用推理推断符合评分标准的分数分布,并通过组定向分数优化(GDSO)进行训练,该优化结合了来自分布期望的策略梯度奖励以及关于分数分布和分数差距的直接点式和成对监督。学生通过推理内化分数蒸馏(RISD)进行训练,将教师的推理条件分数分布转移到紧凑VLM中,而无需在推理时使用显式推理链。在我们内部标注的评估集上,27B GDSO教师达到了89.6%的人类偏好准确率,优于SFT、RewardDance和GRPO,而9B RISD学生达到了88.6%,优于OPD基线并接近更大的教师。我们进一步表明,Z-Reward可以作为文本到图像优化的可微奖励信号,相对于SFT基线产生了41.3%的净人类偏好改进。

英文摘要

Reward models are central to text-to-image post-training, but visual preference is subjective and better represented as a distribution over rubric scores than as a deterministic scalar. Existing scalar, score-token, and pairwise reward models over-compress uncertainty and fine-grained score differences, while reasoning-based generative rewards provide stronger judgments but are costly to deploy and difficult to use as direct optimization signals. We propose Z-Reward, a teacher-student reward modeling framework that decouples reasoning-heavy judgment from efficient reward deployment. The teacher is a large VLM that uses reasoning to infer rubric-aligned score distributions, and is trained with Group-wise Direct Score Optimization (GDSO), which combines policy-gradient rewards from distribution expectations with direct pointwise and pairwise supervision on score distributions and score gaps. The student is trained with Reasoning-Internalized Score Distillation (RISD), which transfers the teacher's reasoning-conditioned score distribution into a compact VLM without requiring explicit reasoning chains at inference time. On our internally annotated evaluation set, the 27B GDSO teacher reaches 89.6% human preference accuracy, outperforming SFT, RewardDance, and GRPO, while the 9B RISD student reaches 88.6%, outperforming the OPD baseline and closely matching the larger teacher. We further show that Z-Reward can serve as a differentiable reward signal for text-to-image optimization, yielding a 41.3% net human-preference improvement over the SFT baseline.

2606.09150 2026-06-16 cs.CV 版本更新

Ultra Flash: Scaling Real-Time Streaming Video Generation to High Resolutions

Ultra Flash: 将实时流式视频生成扩展到高分辨率

Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Jun-hao Zhuang, Yuming Li, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

发表机构 * JD Explore Academy(京东探索研究院) USTC(中国科学技术大学) PKU(北京大学) THU(清华大学) BUAA(北京航空航天大学) FDU(复旦大学) HKUST(香港科技大学) HKU(香港大学) CUHK(香港中文大学)

AI总结 提出Ultra Flash级联框架,通过架构保持的超分辨率训练、因果流式潜在上采样器和高分辨率解码器、以及级联优化方案,在单GPU上实现1K分辨率约30 FPS和2K分辨率约18 FPS的实时高分辨率流式视频生成。

详情
AI中文摘要

尽管最近的自回归视频扩散模型在流式质量上取得了显著成果,但它们仍局限于低分辨率(如480P),使得高效、可扩展的实时高分辨率视频生成成为一个根本性的开放挑战。为弥补这一差距,我们提出了Ultra Flash,一个能够实时生成高分辨率视频的级联流式框架。Ultra Flash在单GPU上实现约30 FPS(1K分辨率)和约18 FPS(2K分辨率),通过三个关键贡献:(1)一种保持架构的T2V到TV2V超分辨率训练范式,结合面向AIGC的数据降级流水线,有效保留基础模型的生成能力,从而在级联到主流低分辨率生成模型后增强高分辨率细节;(2)一个因果流式潜在上采样器与高分辨率解码器配对,增强时空连贯性,同时实现高效的潜在空间缩放和精确的高分辨率解码,且计算开销可忽略;(3)一种级联高分辨率流式视频生成优化方案,首先对超分辨率模型进行混合奖励增强的稀疏因果化和单步蒸馏,然后引入带有动态缓存管理的级联流式自强迫偏好优化,共同增强整体连贯性、提高质量,并实现实时高分辨率流式视频生成。大量实验表明,Ultra Flash能够可靠地生成超高分辨率流式视频,同时保持最先进的视觉质量和卓越效率。

英文摘要

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency. Project Page: https://xin1u.github.io/UltraFlash/

2606.11751 2026-06-16 cs.CV cs.AI 版本更新

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) JD Explore Academy(京东探索研究院)

AI总结 提出首个自回归扩散框架AnchorEdit,通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题,在10轮以上交互中保持高保真度。

Comments Code: https://github.com/xuhang07/AnchorEdit

详情
AI中文摘要

多轮图像编辑对于迭代设计至关重要,但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性,但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit,首个专为高分辨率、长期多轮编辑设计的自回归(AR)扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距:保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差,以及用于高效4步生成的一致性蒸馏。在推理过程中,我们引入记忆机制来锚定初始主体身份,并确保在扩展编辑轨迹上的稳定外推。为评估性能,我们提供了一个新的高分辨率多轮编辑基准,旨在压力测试长期稳定性。大量实验表明,AnchorEdit达到了最先进的结果,即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

2606.13655 2026-06-16 cs.CV cs.GR 版本更新

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Flex4DHuman:面向4D人体重建的灵活多视角视频扩散模型

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

发表机构 * University of Washington(华盛顿大学) World Labs

AI总结 提出Flex4DHuman,一种基于相对相机位姿条件化的多视角视频扩散模型,无需显式几何先验即可将单目或稀疏多视角视频转换为密集多视角视频,并用于4D高斯溅射重建。

Comments Project Page: https://andy-cheng.github.io/Flex4DHuman/

详情
AI中文摘要

我们提出Flex4DHuman,一种多视角视频扩散模型,它通过仅使用相对相机位姿条件化,将动态主体的单目或稀疏多视角视频转换为同步的密集多视角视频。与先前依赖骨架、深度图、法线或渲染目标视角几何的人体中心方法不同,Flex4DHuman不需要显式几何先验,而是通过相对相机位姿位置编码来条件化生成。生成的视频可直接被下游重建流程用于创建动态4D高斯溅射。基于Wan 2.1 1.3B文本到视频模型,Flex4DHuman保留了骨干架构,并通过五轴位置编码编码相机和视角信息,该编码将时空RoPE扩展了视角索引和连续SE(3)相对相机几何。三阶段课程逐步训练模型以进行位姿跟随、灵活的参考到目标视角生成以及时间展开。为支持时间展开,我们使用干净的历史目标视角令牌进行训练。我们还添加了多视角字幕以实现测试时文本控制。结合现成的4D高斯溅射阶段,我们的框架将单目静态相机视频提升为动态4D高斯溅射。在DNA-Rendering和ActorsHQ上的实验表明,Flex4DHuman超越了先前最先进的方法,而相同的公式在混合人体-动物训练后泛化到动物类别。这些能力使Flex4DHuman成为从随意单目视频进行可扩展4D内容创建的实际一步,适用于仿真、游戏、AR/VR和视频重拍。

英文摘要

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

2509.24223 2026-06-16 cs.LG cs.CV stat.ML 版本更新

Semantic Editing with Coupled Stochastic Differential Equations

耦合随机微分方程的语义编辑

Jianxin Zhang, Clayton Scott

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出耦合随机微分方程(coupled SDEs)引导预训练生成模型的采样过程,无需重新训练即可实现高提示保真度和近像素级一致性的语义编辑。

详情
AI中文摘要

使用预训练的文本到图像模型编辑图像内容仍然具有挑战性。现有方法常常扭曲细节或引入意外伪影。我们提出使用\emph{耦合随机微分方程}(coupled SDEs)来引导任何可以通过求解SDE进行采样的预训练生成模型的采样过程,包括扩散模型和整流流模型。通过用相同的相关噪声驱动源图像和编辑图像,我们的方法将新样本引导至所需语义,同时保持与源图像的视觉相似性。该方法开箱即用,无需重新训练或辅助网络,并实现了高提示保真度和近像素级一致性。这些结果使耦合SDE成为受控生成式AI的简单而强大的工具。项目页面:此 https URL。代码:此 https URL。

英文摘要

Editing the content of an image with a pretrained text-to-image model remains challenging. Existing methods often distort fine details or introduce unintended artifacts. We propose using \emph{coupled stochastic differential equations} (coupled SDEs) to guide the sampling process of any pre-trained generative model that can be sampled by solving an SDE, including diffusion and rectified flow models. By driving both the source image and the edited image with the same correlated noise, our approach steers new samples toward the desired semantics while preserving visual similarity to the source. The method works out-of-the-box, without retraining or auxiliary networks, and achieves high prompt fidelity along with near-pixel-level consistency. These results position coupled SDEs as a simple yet powerful tool for controlled generative AI. Project page: https://z-jianxin.github.io/syncSDE-release/. Code: https://github.com/Z-Jianxin/syncSDE-release.

7. 3D视觉、点云与空间智能 23 篇

2606.14811 2026-06-16 cs.CV 新提交

S23DR 2026: End-to-End 3D Wireframe Prediction via DETR-Style Set Prediction with Contrastive Denoising

S23DR 2026:基于对比去噪的DETR风格集合预测实现端到端3D线框预测

Nitiz Khanal

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出WireframeDETR方法,直接对3D点云进行DETR风格集合预测,无需中间顶点检测,通过对比去噪训练、多尺度编码器和渐进辅助损失权重实现端到端3D线框预测,在S23DR 2026挑战赛上取得0.575 HSS。

Comments Technical report; S23DR 2026 Challenge submission

详情
AI中文摘要

我们提出了WireframeDETR,这是我们对结构化语义3D重建(S23DR)2026挑战赛的提交,该挑战赛要求从多视图COLMAP点云预测3D建筑线框。我们的方法直接将DETR风格的集合预测应用于3D点云,生成作为边坐标对集合的线框,无需任何中间顶点检测阶段。我们引入了三项技术贡献:(1)对比去噪训练,稳定早期epoch中嘈杂的匈牙利匹配;(2)多尺度编码器,通过学习的标量权重聚合最后一个编码器层的输出;(3)渐进辅助损失权重,将梯度信号集中在最受益的解码器层上。我们的模型在公共测试集上达到0.575 HSS(F1≈0.664,IoU≈0.516),在清理后的验证集上达到最佳验证HSS 0.534。

英文摘要

We present WireframeDETR, our submission to the Structured Semantic 3D Reconstruction (S23DR) 2026 Challenge, which requires predicting a 3D building wireframe from multi-view COLMAP point clouds. Our method applies DETR-style set prediction directly to 3D point clouds, producing wireframes as sets of edge coordinate pairs without any intermediate vertex detection stage. We introduce three technical contributions: (1) contrastive denoising training that stabilises noisy Hungarian matching in early epochs; (2) a multi-scale encoder that aggregates the last encoder layer outputs via learned scalar weights; and (3) progressive auxiliary loss weighting that concentrates gradient signal on the decoder layers that most benefit from it. Our model achieves a public test HSS of 0.575 (F1~=~0.664, IoU~=~0.516) and a best validation HSS of 0.534 on the cleaned val split.

2606.15328 2026-06-16 cs.CV 新提交

SGFormer++: Semantic Graph Transformer for Incremental 3D Scene Graph Generation

SGFormer++:用于增量式3D场景图生成的语义图Transformer

Mengshi Qi, Changsheng Lv, Zijian Fu, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications(北京邮电大学网络与交换技术国家重点实验室)

AI总结 提出SGFormer++,通过图嵌入层和语义注入层实现全局消息传递,并引入空间引导特征适配器和级联二值预测头解决增量场景图生成中的灾难性遗忘问题,在3DSSG基准上达到最优性能。

详情
AI中文摘要

本文提出SGFormer++,一种用于3D场景图生成(SGG)的新型语义图Transformer,旨在将点云场景解析为语义结构图,其中节点表示检测到的对象实例,边编码它们的成对关系,核心挑战在于建模复杂的全局场景结构。现有基于图卷积网络(GCN)的方法存在过平滑和感受野有限的问题,而SGFormer++利用Transformer层作为骨干网络实现全局消息传递。具体地,我们引入了两个专为3D SGG定制的关键组件:(1)图嵌入层++,以线性计算复杂度高效集成边缘感知的全局上下文;(2)语义注入层++,利用来自大语言模型(LLM)和视觉-语言模型(VLM)的语言先验丰富视觉特征,在不引入额外可训练参数的情况下增强语义表示。为进一步解决增量式SGG(I-SGG)的实际挑战(其中新的关系类别顺序到达),我们为SGFormer++配备了新颖的空间引导特征适配器,利用主语-宾语空间几何校准谓词特征以应对尺度变化,以及级联二值预测头,通过任务增量分类器扩展和logit蒸馏缓解灾难性遗忘。在3DSSG基准上的大量实验表明,SGFormer++在标准和增量设置下均达到最先进性能:在增量设置下,谓词A@1绝对提升4.49%。代码和数据可在 https://github.com/Andy20178/SGFormer 获取。

英文摘要

In this paper, we propose SGFormer++, a novel Semantic Graph Transformer for 3D scene graph generation (SGG), which aims to parse point cloud scenes into semantic structural graphs, where nodes denote detected object instances and edges encode their pairwise relationships, with the core challenge lying in modeling complex global scene structure. While existing graph convolutional network (GCN)-based methods suffer from over-smoothing and limited receptive fields, SGFormer++ leverages Transformer layers as its backbone to enable global message passing. Specifically, we introduce two key components tailored for 3D SGG: (1) a Graph Embedding Layer++ that efficiently integrates edge-aware global context with linear computational complexity, and (2) a Semantic Injection Layer++ that enriches visual features with linguistic priors from large language models (LLMs) and vision-language models (VLMs), boosting semantic representation without introducing extra trainable parameters. To further address the practical challenge of incremental SGG (I-SGG), where new relationship categories arrive sequentially, we equip SGFormer++ with a novel Spatial-guided Feature Adapter, which calibrates predicate features using subject-object spatial geometry to counter scale variation, and a Cascaded Binary Prediction Head that mitigates catastrophic forgetting via task-incremental classifier expansion and logit distillation. Extensive experiments on the 3DSSG benchmark demonstrate that SGFormer++ achieves state-of-the-art performance in both standard and incremental settings: it yields a significant 4.49% absolute improvement in Predicate A@1 under the incremental setting. Code and data are available at: https://github.com/Andy20178/SGFormer.

2606.15659 2026-06-16 cs.CV 新提交

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

SpatialAvatar-0: 高质量4D头部虚拟形象的多阶段重建

Yiran Wang, Zeyu Zhang, Yuanming Li, Ziming Wang, Yang Zhao

发表机构 * USYD(悉尼大学) SpatialReal ZJU(浙江大学) La Trobe(拉筹伯大学)

AI总结 提出基于FLAME网格绑定高斯表示的多阶段框架,通过前馈生成器和10K迭代布局保持微调,实现跨域零样本和单目基准领先性能。

详情
AI中文摘要

高质量4D头部虚拟形象(来自一张或少量源肖像)是远程呈现、AR/VR和数字人交互的核心。3D高斯泼溅(3DGS)已成为主导表示,两个互补范式(可泛化的前馈预测器和逐主体精炼器)并行成熟。然而,现有前馈预测器在单一数据集族上训练,具有硬编码的源数量,继承了相应的领域偏差。逐主体精炼器需要30万至60万次迭代,并依赖自适应致密化,这会破坏上游高斯布局,导致两个范式无法端到端共享表示。为桥接两个范式,我们提出SpatialAvatar-0,基于共享的FLAME网格绑定高斯表示:一个前馈生成器,具有无参数的K源均值池化,以及一个从单目时序到多视角空间的两阶段调度,防止身份先验在小多视角集上坍缩。我们进一步引入一个10K迭代的布局保持逐主体精炼循环,冻结FLAME绑定和高斯数量,并用三分量抗尖峰正则化替代致密化。在VFHQ/HDTF跨域零样本上,我们超越域内领先者GAGAvatar +1.5 dB PSNR,尽管从未在任一测试域上训练;在SplattingAvatar单目基准上,我们领先所有报告指标,超越30万次迭代的GeoAvatar +1.3 dB PSNR,且逐主体调度比常见SOTA基线短至60倍。网站:https://spatialwalk.github.io/SpatialAvatar-0。

英文摘要

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.

2606.15681 2026-06-16 cs.CV 新提交

3D Consistency Optimization for Self-Supervised Monocular Video Depth Estimation

自监督单目视频深度估计的3D一致性优化

Yuanye Liu, Ke Zhang, Junzhe Jiang, Li Zhang, Vishal Patel, Xiahai Zhuang

发表机构 * Fudan University(复旦大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出一种将视频深度估计转化为多视图3D重建的框架,通过光度渲染、世界坐标对齐和多尺度时间梯度一致性约束,实现全局3D结构一致性,在自监督和零样本临床场景中达到最先进的空间精度。

详情
AI中文摘要

可靠的单目视频深度估计对于内窥镜导航中的下游3D推理和具身AI至关重要。然而,现有的自监督方法通常独立处理视频帧或依赖弱时间正则化。这些方法缺乏对底层3D场景的整体感知,不可避免地遭受几何不一致的预测和严重的跨帧漂移。为了解决这些限制,我们引入了一种新范式,将顺序视频深度估计重新表述为无约束的多视图3D重建问题,从而能够充分利用嵌入在最近3D基础模型中的强大几何先验。我们方法的核心是一个由三个约束驱动的3D一致性优化框架:图像级光度渲染、显式世界坐标几何对齐和多尺度时间梯度一致性。这种统一优化优雅地将孤立帧锚定到全局一致的3D结构上。我们的方法在自监督训练场景和具有挑战性的零样本临床环境中都得到了验证。结果表明,所提出的方法实现了最先进的空间精度,优于基于帧、基于视频的深度估计器和多视图3D重建基线。

英文摘要

Reliable monocular video depth estimation is crucial for downstream 3D reasoning and embodied AI in endoscopic navigation. However, existing self-supervised approaches typically treat video frames independently or rely on weak temporal regularization. These methods, lacking a holistic perception of the underlying 3D scene, inevitably suffer from geometrically inconsistent predictions and severe cross-frame drift. To address these limitations, we introduce a new paradigm that recasts sequential video depth estimation as an unconstrained multi-view 3D reconstruction problem, enabling full exploitation of the powerful geometric priors embedded in recent 3D foundation models. The core of our approach is a 3D consistency optimization framework driven by three constraints: image-level photometric rendering, explicit world-coordinate geometric alignment, and multi-scale temporal gradient consistency. Such unified optimization elegantly anchors isolated frames to a globally coherent 3D structure. Our method has been validated in both the self-supervised training scenarios and challenging zero-shot clinical environments. Results show that the proposed approach achieves state-of-the-art spatial accuracy, outperforming the frame-based, video-based depth estimators and the multi-view 3D reconstruction baselines.

2606.15908 2026-06-16 cs.CV 新提交

High-Fidelity 4D Hand-Object Capture via Multi-View Spatiotemporal Tracking and Physics-Aware Gaussians

高保真4D手-物体捕捉:基于多视角时空追踪和物理感知高斯模型

Bo Peng, Xu Chen, Yi Gu, Hidenobu Matsuki, Mingsong Dou, Jingjing Shen, Deying Kong, Juyong Zhang, Zhengyang Shen

发表机构 * Google XR(谷歌XR) University of Science and Technology of China (USTC)(中国科学技术大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出无需模板和标记的多视角系统,通过跨视角几何与时间线索的Transformer初始化,结合物理感知高斯优化,实现鲁棒且无伪影的4D手-物体交互重建。

Comments Project page: https://zyshen021.github.io/HOSTPG/

详情
AI中文摘要

具身AI和空间计算中对高保真4D手-物体交互(HOI)数据的需求日益增长,但目前受限于对预扫描物体模板和物理标记的依赖。尽管近期方法在从视频重建4D手-物体交互方面取得了有希望的结果,但它们对手和物体姿态的初始估计高度敏感。然而,从图像中估计这些姿态具有挑战性,尤其是在手-物体交互场景中固有的严重遮挡下。我们提出了一种新颖系统,用于从同步且校准的多视角视频中鲁棒且精确地重建手和物体,无需任何模板或标记。我们的系统包含两个主要创新组件:(1)一个多视角前馈Transformer模型,聚合跨视角几何和时间线索,为姿态和密集物体几何提供可靠的、度量一致的初始化;(2)一个手-物体物理感知高斯优化框架,用于细化初始估计,集成四面体约束、碰撞细化和外观分解,以产生物理上合理且视觉上精确的重建。在公共基准和广泛内部数据集上的验证表明,我们的流程实现了高度鲁棒、无伪影的重建,为自动化4D资产生成提供了高效基础。我们的项目页面位于https://zyshen021.github.io/HOSTPG/。

英文摘要

The growing demand for high-fidelity 4D hand-object interaction (HOI) data in embodied AI and spatial computing is currently bottlenecked by the reliance on pre-scanned object templates and physical markers. While recent methods have demonstrated promising results in reconstructing 4D hand-object interaction from videos, they are highly sensitive to initial estimates of hand and object poses. Yet, estimating these poses from images is challenging, in particular under severe occlusion which is inherent in hand-object interaction scenarios. We propose a novel system for the robust and accurate reconstruction of hands and objects from synchronized and calibrated multi-view videos without requiring any templates or markers. Our system consists of two main components with key innovations: (1) a multi-view feed-forward transformer model that aggregates cross-view geometry and temporal cues to provide a reliable, metric-consistent initialization for both poses and dense object geometry, and (2) a hand-object physics-aware Gaussian-based optimization framework to refine the initial estimates, integrating tetrahedral constraints, collision refinement, and appearance decomposition to produce physically plausible and visually accurate reconstruction. Validated on public benchmarks and an extensive internal dataset, our pipeline achieves highly robust, artifact-free reconstruction, providing an efficient foundation for automated 4D asset generation. Our project page are available at https://zyshen021.github.io/HOSTPG/.

2606.15924 2026-06-16 cs.CV cs.GR 新提交

TurboGS: Accelerating 3D Gaussian Splatting via Error-Guided Sparse Pixel Sampling and Optimization

TurboGS: 通过误差引导的稀疏像素采样与优化加速3D高斯泼溅

Zheng Dong, Daifei Qiu, Pinxuan Dai, Ke Xu, Jiamin Xu, Lili He, Rynson W. H. Lau, Weiwei Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TurboGS框架,通过误差引导的稀疏像素采样、结构感知损失、动态密度控制和混合优化器,在保持高保真渲染质量的同时实现高达10倍的训练加速。

Comments Accepted by ICML2026. Project page: https://zhengdong.site/projects/TurboGS/

详情
AI中文摘要

消费级应用需要快速优化3D高斯泼溅(3DGS)以实现高保真新视角渲染。然而,现有的3DGS加速方法在牺牲细节的同时,仍会在冗余像素上产生大量计算。本文提出TurboGS,一种误差引导的训练框架,通过将优化集中在感知信息丰富的像素上来加速3DGS。TurboGS基于四个核心组件:(1)瓦片级稀疏像素采样,由训练期间的多视图重建误差驱动,优先处理困难区域并跳过重建良好的区域以避免冗余梯度计算;(2)带有稀疏归一化互相关的瓦片级结构感知损失,提供稀疏但有效的监督以保留细节并稳定训练;(3)误差驱动的高斯密度控制策略,动态分配模型容量并移除冗余基元;(4)定制的混合优化器,将Hessian信息更新与Adam动量阻尼相结合,以稳定和改善稀疏监督下的收敛。标准基准实验表明,TurboGS在单个RTX 5090 GPU上可在100秒内提供与原始3DGS相当或更优的渲染质量(训练速度提升高达10倍)。

英文摘要

Consumer-level applications require fast optimization of 3D Gaussian Splatting (3DGS) with high-fidelity novel view rendering. However, existing 3DGS acceleration approaches still incur substantial computation on redundant pixels while sacrificing fine details. In this paper, we present TurboGS, an error-guided training framework that accelerates 3DGS by concentrating optimization on perceptually informative pixels. TurboGS is built upon four core components: (1) a tile-wise sparse pixel sampling, which, driven by multi-view reconstruction errors during training, prioritizes challenging regions and skips well-reconstructed ones to avoid redundant gradient computation; (2) a tile-wise structure-aware loss with sparse Normalized Cross-Correlation, which provides sparse yet effective supervision to preserve fine details and stabilize training; (3) an error-driven Gaussian density control strategy, which dynamically allocates model capacity and removes redundant primitives; and (4) a tailored hybrid optimizer that couples Hessian-informed updates with Adam moment damping to stabilize and improve convergence under sparse supervision. Experiments on standard benchmarks demonstrate that TurboGS can deliver on par or superior rendering quality within 100 seconds on a single RTX 5090 GPU card (up to 10x training speedup over vanilla 3DGS).

2606.15966 2026-06-16 cs.CV cs.GR 新提交

VEPHand: View-Efficient Photometric Hand Performance Capture at Scale

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结 提出面向有限视角(约20个)的端到端手部动态捕捉与配准管线,通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题,在12000+序列上验证了高保真重建与配准。

详情
AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础,但在实际多视角系统中仍具挑战性,这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线,专为视图高效设置(约20个视角)设计。我们通过两项主要创新应对关键挑战。首先,为克服重建困难(如视角重叠有限和背景杂乱),我们的无掩膜神经方法通过场景参数化和场景特定密度正则化,从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次,针对配准挑战(如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果),我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数,将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下,捕捉精细表面变形,确保在严重关节运动和自接触下的合理结果,并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性,并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面:https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://zyshen021.github.io/VEPHand/.

2606.16048 2026-06-16 cs.CV 新提交

PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

PointDiffusion: 点云领域的基于扩散的场景补全

Chidera Agbasiere, Mikhail Sannikov, Faith Ogunwoye, Erik Shaikhiev, Alex Kozinov, Ilya Mikhalchuk, Iana Zhura, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology(斯科尔科沃科学技术学院智能空间机器人实验室)

AI总结 提出多令牌高斯VAE和锚点ICP地面真值精化,实现单步扩散场景补全,在SemanticKITTI上平方倒角距离降低16倍,推理延迟降低25-143倍。

详情
AI中文摘要

从稀疏LiDAR点云重建密集3D场景是自动驾驶中的基本挑战,其中潜在扩散模型提供了一种有前景的解决方案。然而,现有方法依赖于对象级自编码器,这些自编码器在室外尺度下会崩溃为不稳定的全局表示,并且受到由里程计漂移破坏的地面真值数据的影响,这系统地降低了监督质量。此外,多步扩散推理会带来难以承受的延迟,无法实时部署。我们提出了一种新颖的多令牌高斯VAE,具有交叉注意力池化,用于稳定的场景级LiDAR压缩,并结合基于锚点的ICP地面真值精化流水线,消除了训练监督中的漂移引入噪声。这些组件共同实现了一个无支架的单步扩散补全模型,在SemanticKITTI序列08上将平方倒角距离减少了约16倍(从0.396 m^2降至0.024 m^2),分别比LiDiff和ScoreLiDAR高出17-19%和10-11%,并且推理延迟降低了25-143倍。我们的结果表明,在此设置下,数据质量主导模型设计,多令牌潜在空间为基于潜在扩散的场景补全提供了稳定的第一阶段。

英文摘要

Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.

2606.16323 2026-06-16 cs.CV cs.GR 新提交

HAFMat: Hybrid Priors Guided Adaptive Fusion for Single-Image Human Material Estimation

HAFMat: 混合先验引导的自适应融合用于单张图像人体材质估计

Yu Jiang, Jiahao Xia, Jiongming Qin, Jianchi Sun, Chunxia Xiao

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) Faculty of Engineering and IT, University of Technology Sydney(悉尼科技大学工程与信息技术学院)

AI总结 提出HAFMat框架,通过混合先验(外观、几何、结构及预训练模型预测)引导自适应特征融合,解决单张图像人体PBR材质估计的病态问题,在合成和真实数据上达到最优性能。

详情
AI中文摘要

基于物理的渲染(PBR)材质估计是一项基础的外观分解任务,在虚拟内容创建、重光照和数字人体渲染中具有广泛应用。然而,从单张人体图像估计PBR材质仍然高度病态,因为光照、几何和反射率在观察到的外观中严重纠缠。为缓解这种歧义,我们提出HAFMat,一种混合先验引导的单图像人体材质估计框架。我们的方法引入编码互补线索的引导图,包括外观、身体几何、结构以及来自预训练模型的先验材质预测。一个关键观察是这些引导线索是异质的:一些线索主要提供纹理级约束,而其他线索传达更高层的语义信息。为利用这一特性,我们设计了一种多层自适应特征融合机制,在不同阶段自适应地将引导特征与解码器特征融合。该设计使纹理主导和语义主导的线索能够在适当层次引导材质解码,从而实现更准确且物理合理的材质估计。在合成和真实数据上的大量实验表明,我们的方法在材质估计和下游重光照任务中达到了最先进的性能。

英文摘要

Physically based rendering (PBR) material estimation is a fundamental appearance decomposition task with broad applications in virtual content creation, relighting, and digital human rendering. However, estimating PBR materials from a single human image remains highly ill-posed, since illumination, geometry, and reflectance are heavily entangled in the observed appearance. To mitigate this ambiguity, we propose HAFMat, a hybrid-prior-guided framework for single-image human material estimation. Our method introduces guidance maps that encode complementary cues, including appearance, body geometry, structure, and prior material predictions from pre-trained models. A key observation is that these guidance cues are heterogeneous: some cues mainly provide texture-level constraints, while others convey higher-level semantic information. To exploit this property, we design a Multi-layer Adaptive Feature Fusion Mechanism, which adaptively fuses guidance features with decoder features at different stages. This design enables texture-dominant and semantic-dominant cues to guide material decoding at appropriate levels, leading to more accurate and physically plausible material estimation. Extensive experiments on both synthetic and real data demonstrate that our method achieves state-of-the-art performance in material estimation and downstream relighting.

2606.16333 2026-06-16 cs.CV cs.GR cs.LG 新提交

Differentiable Packing of Irregular 3D Objects with Adaptive Container Estimation

不规则3D物体的可微分装箱与自适应容器估计

Palak Gupta, Shanmuganathan Raman

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院甘地讷格尔分校)

AI总结 提出一种可微分装箱框架,通过梯度优化联合调整物体姿态和容器尺寸,利用自适应挤压机制和基于张量广播的快速计算,在单个GPU上数分钟内实现比基线方法小11-32%的容器。

Comments Comments: 20 pages, 8 figures, 5 tables. Under review at Computers & Graphics (Elsevier)

详情
AI中文摘要

大多数现有方法要么预先固定容器,要么通过外部搜索循环仅优化单个容器维度,其余维度则作为手动调整问题。我们提出了一种可微分装箱框架,在单个基于梯度的循环内联合优化所有6N个物体姿态参数和所有三个容器边长。该公式结合了六个基于物理的、可微分的损失项,这些损失项通过轴对齐包围盒代理直接在三角形网格上计算。自适应挤压机制在重叠损失低于按对数量缩放的阈值时周期性收紧容器,导致容器体积先大幅下降,然后进行小幅细化。所有成对计算均以张量广播形式编写,与基于循环的参考实现相比,速度提升了3.4到54倍。该流程使用Python和PyTorch实现,无需物理引擎、FFT库或凸分解。在多个物体类别上,该方法在N=100时产生的容器比时间匹配的DBLF和模拟退火基线小11%至32%,同时在单个消费级GPU上每个实例的运行时间不到4分钟。

英文摘要

Most existing approaches either fix the container in advance or optimize only a single container dimension through an outer search loop, leaving the remaining dimensions as a manual tuning problem. We present a differentiable packing framework that jointly optimizes all 6N object pose parameters and all three container side lengths inside a single gradient-based loop. The formulation combines six physics-inspired, differentiable loss terms computed directly on triangle meshes through axis-aligned bounding-box proxies. An adaptive squeezing mechanism periodically tightens the container whenever the overlap loss falls below a pair-count-scaled threshold, producing a large initial drop in container volume, followed by small refinements. All pairwise computations are written in tensor-broadcasting form, giving a 3.4 to 54 times speedup over a reference loop-based implementation. The pipeline is implemented in Python and PyTorch, with no physics engine, FFT library, or convex decomposition. On multiple object categories, the method produces containers that are 11 to 32 percent smaller than time-matched DBLF and simulated-annealing baselines at N =100, while running in under 4 minutes per instance on a single consumer GPU.

2606.16479 2026-06-16 cs.CV cs.AI 新提交

Uncertainty Quality of VGGT: An Analysis on the DTU Benchmark Dataset

VGGT的不确定性质量:基于DTU基准数据集的分析

Markus Hillemann, Robert Langendörfer, Steven Landgraf, Markus Ulrich

发表机构 * Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院摄影测量与遥感研究所)

AI总结 本文分析VGGT模型在DTU数据集上的不确定性预测质量,确定有效置信度阈值,并证明提升不确定性质量可显著改善3D重建精度。

Comments Accepted for publication in the ISPRS Annals of the Photogrammetry, Remote Sensing and Spatial Information Sciences

详情
AI中文摘要

视觉几何基础变换器(VGGT)在短时间内引起了广泛关注,尤其是因其在CVPR-2025上获得最佳论文奖。与DUSt3R和MASt3R类似,VGGT旨在通过用一个简单、统一的馈送神经网络取代束调整和特征匹配等既定方法,实现范式转变,该网络可直接从场景的多张图像中在几秒内预测相机位姿、深度图和密集3D结构。其关键能力是在单次前向传播中一致地处理任意数量的视图,无需任何后处理或迭代优化。对于摄影测量学,这为实时、可扩展和可访问的3D重建开辟了新的可能性。在此背景下,不仅高重建精度至关重要,高质量的不确定性估计也至关重要,因为它们能增强信任并实现稳健的质量保证。因此,本文研究了VGGT不确定性预测的质量。分析确定了用于过滤VGGT原始输出的有效置信度阈值,并证明提升不确定性质量在提高其3D重建精度方面具有巨大潜力。

英文摘要

Visual Geometry Grounded Transformer (VGGT) has already attracted a great deal of attention in a short period of time, not least due to the Best Paper Award at CVPR-2025. Similar to DUSt3R and MASt3R, VGGT aims to bring about a paradigm shift by replacing established methods like bundle adjustment and feature matching with a simple, unified, feed-forward neural network that predicts camera poses, depth maps, and dense 3D structure directly from multiple images of a scene in a few seconds. A key aspect is its ability to process an arbitrary number of views consistently in a single forward pass without any post-processing or iterative optimization. For photogrammetry, this opens new possibilities for real-time, scalable, and accessible 3D reconstruction. In this context, not only high reconstruction accuracy but also high-quality uncertainty estimates are crucial, as they foster trust and enable robust quality assurance. This paper therefore investigates the quality of VGGT's uncertainty predictions. The analysis identifies an effective confidence threshold for filtering VGGT's raw output and demonstrates that enhancing uncertainty quality holds strong potential for improving the accuracy of its 3D reconstructions.

2606.16566 2026-06-16 cs.CV 新提交

Local-GS: Accelerating 3D Gaussian Splatting via Tile-Local Warp Coherence

Local-GS:通过Tile局部Warp一致性加速3D高斯泼溅

Yang Luo, Yan Gong, Yongsheng Gao, Jie Zhao, Xinyu Zhang, Huaping Liu

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(哈尔滨工业大学机器人技术与系统国家重点实验室) State Key Laboratory of Intelligent Green Vehicle and Mobility, School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院智能绿色车辆与交通国家重点实验室) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出Local-GS,通过基于SIMT执行边界组织高斯原语,设计提升、剔除和混合三阶段warp一致渲染范式,在不降低质量的前提下实现最高7.76倍加速。

详情
AI中文摘要

3D高斯泼溅(3DGS)通过将场景表示为各向异性3D高斯原语的密集集合,显著推进了实时新视角合成。然而,高斯的不规则空间分布通常导致GPU利用率低下,因为warp发散和冗余计算降低了渲染性能。为了解决这个问题,我们提出了Local-GS,一种warp一致的渲染范式,它根据SIMT(单指令多线程)执行边界而非场景几何来组织高斯原语。具体来说,我们提出了三个warp一致阶段:提升阶段,在tile级别预计算共享参数;剔除阶段,丢弃没有贡献的warp;混合阶段,用统一的指令流替换逐像素分支。在多个数据集上的广泛基准测试中,Local-GS在不牺牲质量的情况下提高了效率。作为一种即插即用的优化,它为所有测试的基线提供了额外的性能提升,在Deep Blending场景上实现了7.76倍的加速。

英文摘要

3D Gaussian Splatting (3DGS) has significantly advanced real-time novel view synthesis by representing scenes as dense collections of anisotropic 3D Gaussian primitives. However, the irregular spatial distribution of Gaussians often leads to poor GPU utilization, as warp divergence and redundant computation degrade rendering performance. To address this, we present Local-GS, a warp-coherent rendering paradigm that, organizes Gaussian primitives with respect to SIMT (Single Instruction, Multiple Threads) execution boundaries rather than scene geometry. Specifically, we propose three warp-coherent stages: a hoisting stage that precomputes shared parameters at tile level, a culling stage that discards warps with no contribution, and a blending stage that replaces per-pixel branching with a uniform instruction stream. Across extensive benchmarks on multiple datasets, Local-GS improves efficiency without compromising quality. As a plug-and-play optimization, it provides additional performance gains to all tested baselines, culminating in a $7.76\times$ speedup on Deep Blending scenes.

2606.16593 2026-06-16 cs.CV 新提交

Rotational Symmetry based Object Pose Estimation from Point Clouds in the Absence of Known 3D Models

基于旋转对称性的无已知3D模型点云物体姿态估计

Weichen Dai, Ruixun Yu, Yangjie Tang, Yifan Du, Yiyang Zhang, Donglei Sun, Hua Zhang

发表机构 * Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, School of Computer Science, Hangzhou Dianzi University(浙江省脑机协同智能重点实验室,杭州电子科技大学计算机学院) Advanced Intelligent Manufacturing Research Group, the University of Nottingham Ningbo China(先进智能制造研究组,宁波诺丁汉大学)

AI总结 提出利用工业物体的旋转对称性,通过迭代优化联合估计姿态与点云,无需已知3D模型,在合成和真实数据集上达到与有模型方法相当的性能。

详情
AI中文摘要

物体姿态估计对许多工业应用至关重要,例如使用机器人进行自动喷漆。然而,保密性问题常常限制了对高质量3D模型的访问,给基于点云的姿态估计带来了重大挑战。在这种情况下,旋转对称性——许多工业物体易于获取的特征——可以提供有价值的先验信息以促进姿态估计。在本文中,我们提出了一种方法,利用工业物体中常见的旋转对称性来解决缺乏3D模型带来的挑战。通过迭代优化过程,物体姿态与点云细化联合估计。该优化依赖于旋转对称性约束损失。为了构建这一损失,每个3D点根据当前估计的姿态旋转,并利用旋转对称性通过最近邻搜索识别多个对应点。然后使用这些对应点计算旋转对称性约束损失,迭代地细化姿态和点云。通过将旋转对称性显式地纳入优化过程,所提出的方法实现了鲁棒的姿态估计,并在不同物体类型上具有良好的泛化能力。该方法在一个专门为无已知3D模型的点云创建的数据集上进行了评估,该数据集包含四类合成物体和一个从生产线收集的真实轮毂。实验结果表明,所提出的方法实现了与依赖已知3D模型的方法相当的性能。

英文摘要

Object pose estimation is crucial to many industrial applications, with one example being automated spray painting using a robot. However, confidentiality concerns often limit access to high-quality 3D models, posing a significant challenge for point-cloud-based pose estimation. In such scenarios, rotational symmetry, a readily accessible characteristic of many industrial objects, can provide valuable prior information to facilitate pose estimation.In this paper, we propose a method that leverages the rotational symmetry commonly found in industrial objects to address the challenge caused by the absence of 3D models. The object pose is jointly estimated with point cloud refinement through an iterative optimization process. This optimization relies on a rotational symmetry constraint loss. To construct this loss, each 3D point is rotated according to the currently estimated pose, and multiple correspondences are identified using nearest-neighbor search by exploiting the rotational symmetry property. These correspondences are then used to compute the rotational symmetry constraint loss, which iteratively refines both the pose and the point cloud.By explicitly incorporating rotational symmetry into the optimization process, the proposed method achieves robust pose estimation and generalizes well across diverse object types. The proposed method is evaluated on a dataset specifically created for point clouds without known 3D models, consisting of four categories of synthetic objects and one real wheel hub collected from a production line. Experimental results demonstrate that the proposed method achieves performance comparable to methods that rely on known 3D models.

2606.16672 2026-06-16 cs.CV 新提交

Sinkhorn-CPD: Robust point cloud registration via unbalanced entropic optimal transport

Sinkhorn-CPD:通过非平衡熵最优传输实现鲁棒点云配准

Jin Zhang, Mingyang Zhao, Bing Liu, Xin Jiang

发表机构 * LMIB & School of Mathematical Sciences, Beihang University(北京航空航天大学数学科学学院与LMIB) State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院数学科学国家重点实验室) Beijing Key Laboratory of Artificial Intelligence Innovation and Application in the Machine Tool Industry, School of Artificial Intelligence, Beihang University(北京航空航天大学人工智能学院北京市机床行业人工智能创新与应用重点实验室) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出Sinkhorn-CPD,用双KL散度惩罚替代CPD的目标边际约束,通过非平衡熵最优传输和广义Sinkhorn迭代实现鲁棒点云配准,方差自动退火无需手动调参。

Comments 14 pages, 10 figures; journal version published in Computer-Aided Design

详情
Journal ref
Computer-Aided Design 199 (2026) 104104
AI中文摘要

相干点漂移(CPD)因其软对应和闭式参数更新而被广泛用于刚性点云配准。然而,CPD的目标边际约束迫使每个观测值(包括离群点)恰好接收单位概率质量。在严重离群点和部分重叠情况下,这一假设会降低配准精度。最优传输(OT)方法可以通过非平衡公式处理缺失质量,但需要手动调整退火调度。本文提出Sinkhorn-CPD,用双Kullback-Leibler惩罚替代CPD的目标边际约束,使算法能够丢弃两侧的离群点。由此得到的公式是一个完全非平衡的熵最优传输问题,可通过广义Sinkhorn迭代高效求解。此外,Sinkhorn-CPD保留了CPD的闭式Procrustes和方差更新。在我们的方法中,方差sigma^2扮演熵正则化参数的角色,从而自动产生从扩散到尖锐对应的退火调度,无需手动调节温度。在合成、跨类别和扫描到CAD基准上的实验表明,Sinkhorn-CPD达到了最先进的精度,对离群点和部分重叠具有强鲁棒性。

英文摘要

Coherent Point Drift (CPD) is widely used for rigid point cloud registration because of its soft correspondences and closed-form parameter updates. However, CPD's target-side marginal constraint forces every observation, including outliers, to receive exactly unit probability mass. This assumption degrades registration accuracy under heavy outliers and partial overlap. Optimal transport (OT) methods can handle missing mass through unbalanced formulations, but require hand-tuned annealing schedules. In this paper, we propose Sinkhorn-CPD, which replaces CPD's target-side marginal constraint with dual Kullback-Leibler penalties, allowing the algorithm to discard outliers on both sides. The resulting formulation is a fully unbalanced entropic optimal transport problem, which can be efficiently solved by generalized Sinkhorn iterations. Moreover, Sinkhorn-CPD preserves the closed-form Procrustes and variance updates of CPD. In our method, the variance sigma^2 plays the role of the entropic regularization parameter, which induces an automatic annealing schedule from diffuse to sharp correspondences without manual temperature tuning. Experiments on synthetic, cross-category, and scan-to-CAD benchmarks show that Sinkhorn-CPD achieves state-of-the-art accuracy, with strong robustness to outliers and partial overlap.

2606.17027 2026-06-16 cs.CV 新提交

MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

MeshLoom: 网格序列的前馈式非刚性配准

Jianqi Chen, Jiraphon Yenphraphai, Xiangjun Tang, Sergey Tulyakov, Chaoyang Wang, Peter Wonka, Rameen Abdal

发表机构 * KAUST Saudi Arabia(沙特阿拉伯国王科技大学) Snap Inc. United States of America(Snap Inc. 美国) Purdue University United States of America(普渡大学 美国)

AI总结 提出MeshLoom,一种前馈式配准网络,通过拓扑感知编码器-解码器直接重建网格序列的顶点变形,实现秒级多网格配准,并在非刚性配准任务上达到最先进水平,同时支持运动插值和网格变形。

Comments Project page: https://meshloom.github.io/

详情
AI中文摘要

我们提出MeshLoom,一种前馈式配准网络,可直接重建网格序列中的顶点变形。我们的方法将非刚性配准推进到超越现有模型,这些模型通常受限于昂贵的逐实例优化、狭窄的物体类别、仅成对输入或仅仅是中间输出。该网络简单高效,可在数秒内配准多个网格。其核心在于拓扑感知的编码器-解码器设计。具体来说,我们首先引入一种拓扑感知的点表示,将锚点(参考)网格的拓扑编码到其逐顶点特征中。这种表示增强了网络对锚点网格几何结构的理解,并区分了欧几里得接近但测地距离远的点。然后,我们提出一种多模态编码器,将这种锚点网格表示与每帧的互补线索(如形状潜变量和图像特征)融合。这些多源信号被压缩成一个紧凑的全局运动嵌入,捕捉密集的帧间对应关系。一个轻量级解码器随后用锚点网格点表示查询该全局嵌入,检索目标时间戳处的逐顶点变形。通过在多种运动和物体类别上的大量实验,我们表明MeshLoom在非刚性配准上达到了最先进的结果。此外,我们发现我们的全局嵌入-然后-查询范式自然地使网络能够生成中间时间戳的变形,这扩展了MeshLoom到运动插值和网格变形。项目页面:https://meshloom.github.io/。

英文摘要

We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder--decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh's topology into its per-vertex features. This representation strengthens the network's understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: https://meshloom.github.io/ .

2606.15238 2026-06-16 cs.GR cs.CV 交叉投稿

HairLRM: Strand-based Hair Modeling via Large Reconstruction Models

HairLRM:基于大型重建模型的发丝建模

Yuefan Shen, Yican Dong, Xiufeng Huang, Zhongtian Zheng, Youyi Zheng, Kui Wu

发表机构 * LIGHTSPEED Shenzhen China(LIGHTSPEED深圳中国) State Key Lab of CAD and CG, Zhejiang University Hangzhou China(计算机辅助设计与图形学国家重点实验室,浙江大学杭州中国) Hong Kong Baptist University Hong Kong China(香港 Baptist大学香港中国) LIGHTSPEED Los Angeles CA USA (2026)(LIGHTSPEED洛杉矶CA美国(2026))

AI总结 针对传统发丝建模从2D图像推断3D结构的不适定性问题,提出结合大型重建模型的几何先验,利用双方向自编码器将粗几何提升为高保真发丝,通过潜在空间优化和表面引导细化解决矢量场奇点,实现鲁棒且精确的发丝重建。

Comments ACM SIGGRAPH 2026 Conference Paper

详情
AI中文摘要

传统基于发丝建模的根本限制不仅仅是数据稀缺,而是在没有结构约束的情况下从2D图像推断复杂3D场的不适定性。这种无约束回归会导致在解决全局遮挡(例如马尾辫)和局部方向性(例如卷发)时出现灾难性失败,产生过度平滑、看似合理但不正确的几何形状。为了解决这个问题,我们将大型重建模型(LRM)的强几何先验集成到发丝生成流程中。使用LRM网格作为结构锚点,我们采用一种新颖的双方向自编码器将粗几何提升为高保真发丝。通过潜在空间优化和表面引导细化解决矢量场奇点,我们的方法有效解缠复杂的拓扑结构,为头发重建的鲁棒性和准确性设立了新的基准。

英文摘要

The fundamental limitation of traditional strand-based modeling is not simply data scarcity, but the ill-posedness of inferring complex 3D fields from 2D imagery without structural constraints. This unconstrained regression leads to catastrophic failures in resolving both global occlusion (e.g., in ponytails) and local directionality (e.g., in curls), resulting in over-smoothed, plausible-but-incorrect geometries. To resolve this, we integrate the strong geometric priors of Large Reconstruction Models (LRMs) into the strand generation pipeline. Using the LRM mesh as a structural anchor, we employ a novel Dual Orientation AutoEncoder to lift coarse geometry into high-fidelity strands. By resolving vector field singularities through latent-space optimization and surface-guided refinement, our method effectively disentangles complex topological structures, setting a new benchmark for robustness and accuracy in hair reconstruction.

2508.09977 2026-06-16 cs.CV 版本更新

A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

3D高斯泼溅应用综述:分割、编辑与生成

Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding

发表机构 * Shanghai University of Finance and Economics(上海财经大学) University College London(伦敦大学学院) Xiamen University(厦门大学) Fudan University(复旦大学)

AI总结 综述3D高斯泼溅在分割、编辑和生成三大任务中的应用,总结代表性方法、监督策略和学习范式,并分析公共基准上的比较结果。

Comments IEEE TPAMI, GitHub Repo: https://github.com/heshuting555/Awesome-3DGS-Applications

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
AI中文摘要

在新视角合成背景下,3D高斯泼溅(3DGS)最近作为神经辐射场(NeRF)的高效且具有竞争力的对应物出现,能够实时实现高保真度的逼真渲染。除了新视角合成,3DGS的显式和紧凑特性使其能够应用于需要几何和语义理解的广泛下游任务。本综述全面概述了3DGS应用的最新进展。首先回顾了3DGS的重建基础,接着介绍了问题公式化、2D基础模型以及相关的基于NeRF的研究领域,这些为下游3DGS应用提供了信息。然后,我们将3DGS应用分为三个基础任务:分割、编辑和生成,以及建立在这些基础能力之上或与之紧密耦合的其他功能应用。对于每个任务,我们总结了代表性方法、监督策略和学习范式,突出了共享的设计原则和新兴趋势。还总结了常用数据集和评估协议,以及最近方法在公共基准上的比较分析。为了支持持续的研究和开发,我们在https://this URL上维护了一个持续更新的论文、代码和资源仓库。

英文摘要

In the context of novel view synthesis, 3D Gaussian Splatting (3DGS) has recently emerged as an efficient and competitive counterpart to Neural Radiance Field (NeRF), enabling high-fidelity photorealistic rendering in real time. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first reviews the reconstruction preliminaries of 3DGS, followed by the problem formulation, 2D foundation models, and related NeRF-based research areas that inform downstream 3DGS applications. We then categorize 3DGS applications into three foundational tasks: segmentation, editing, and generation, alongside additional functional applications built upon or tightly coupled with these foundational capabilities. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

2510.09088 2026-06-16 cs.CV 版本更新

MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling

MambaH-Fit: 基于状态空间建模的超曲面拟合点云法线估计再思考

Weijia Wang, Yuanzhi Su, Pei-Gen Ye, Yuan-Gen Wang

发表机构 * Guangzhou University(广州大学) Hong Kong Polytechnic University(香港理工大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出MambaH-Fit框架,通过注意力驱动层次特征融合和逐块状态空间模型,增强局部几何细节建模,提升点云法线估计的精度和鲁棒性。

Comments 11 pages, 12 figures

详情
AI中文摘要

我们提出了MambaH-Fit,一个专为基于超曲面拟合的点云法线估计设计的状态空间建模框架。现有的法线估计方法在建模细粒度几何结构方面往往不足,从而限制了预测法线的准确性。最近,状态空间模型(SSMs),特别是Mamba,通过以线性复杂度捕捉长程依赖关系展示了强大的建模能力,并激发了对点云处理的适应性。然而,现有的基于Mamba的方法主要关注理解全局形状结构,而对局部细粒度几何细节的建模仍很大程度上未被探索。为了解决上述问题,我们首先引入了一种注意力驱动的层次特征融合(AHFF)方案,以自适应地融合多尺度点云块特征,显著增强了局部点云邻域中的几何上下文学习。在此基础上,我们进一步提出了逐块状态空间模型(PSSM),该模型通过状态动力学将点云块建模为隐式超曲面,从而实现对法线预测的有效细粒度几何理解。在基准数据集上的大量实验表明,我们的方法在准确性、鲁棒性和灵活性方面优于现有方法。消融研究进一步验证了所提出组件的贡献。

英文摘要

We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.

2512.10840 2026-06-16 cs.CV 版本更新

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

PoseGAM: 通过几何感知多视图推理实现鲁棒的未见物体姿态估计

Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

发表机构 * KAUST(卡塔尔科技大学)

AI总结 提出PoseGAM,一种基于多视图基础模型的几何感知框架,直接预测未见物体的6D姿态,无需显式匹配,通过点云几何和特征网络整合几何信息,在多个基准上平均AR提升5.1%。

Comments Accepted by CVPR 2026 (Oral). Project page: https://windvchen.github.io/PoseGAM/

详情
AI中文摘要

6D物体姿态估计,即预测物体相对于相机的变换,对于未见物体仍然具有挑战性。现有方法通常依赖于在查询图像与物体模型或模板图像之间显式构建特征对应关系。在这项工作中,我们提出了PoseGAM,一种几何感知的多视图框架,直接从查询图像和多个模板图像预测物体姿态,消除了显式匹配的需要。该方法基于最近的多视图基础模型架构,通过两种互补机制整合物体几何信息:显式的基于点的几何和来自几何表示网络的学习特征。此外,我们构建了一个包含超过19万个物体的大规模合成数据集,涵盖多种环境条件,以增强鲁棒性和泛化能力。在多个基准上的广泛评估表明,我们的方法达到了最先进的性能,与先前方法相比平均AR提高了5.1%,在单个数据集上最高提升了17.6%,显示出对未见物体的强泛化能力。项目页面:此https URL。

英文摘要

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

2601.13565 2026-06-16 cs.CV cs.RO eess.IV 版本更新

Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

学习细粒度对应与跨视角感知用于开放词汇6D物体姿态估计

Yu Qin, Shimeng Fan, Fan Yang, Zixuan Xue, Zijie Mai, Wenrui Chen, Kailun Yang, Zhiyong Li

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University(人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心,湖南大学) State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University(自主智能无人系统国家重点实验室,同济大学) School of Computer Science and Engineering, Hunan University of Science and Technology(计算机科学与工程学院,湖南科技大学)

AI总结 提出FiCoP框架,通过物体中心解耦、跨视角全局感知模块和补丁相关预测器,实现空间约束的细粒度对应,显著提升开放世界6D姿态估计的鲁棒性。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L). The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP

详情
AI中文摘要

开放词汇6D物体姿态估计使机器人能够仅凭自然语言指令操控任意未见过的物体。然而,现有方法的一个关键限制是它们依赖于无约束的全局匹配策略。在开放世界场景中,尝试将锚点特征与整个查询图像空间进行匹配会引入过多的歧义,因为目标特征容易与背景干扰物混淆。为解决这一问题,我们提出了细粒度对应姿态估计(FiCoP),这是一个从易受噪声影响的全局匹配过渡到空间约束的补丁级对应的框架。为了系统地消除背景干扰,FiCoP首先采用以物体为中心的解耦步骤,将目标从宏观环境噪声中隔离出来。基于这个局部区域,我们的核心方法创新有两个方面。首先,提出了跨视角全局感知(CPGP)模块,通过显式上下文推理和文本引导的语义注入融合双视图特征,建立结构一致性。其次,我们设计了一个补丁相关预测器(PCP),利用补丁到补丁的相关矩阵作为结构先验。这生成一个精确的块状关联图,作为空间滤波器,强制执行细粒度、抗噪声的匹配。在REAL275和Toyota-Light数据集上的实验表明,与最先进方法相比,FiCoP的平均召回率分别提高了8.0%和6.1%,突显了其在复杂、无约束的开放世界环境中为机器人代理提供鲁棒和泛化感知的能力。源代码将在此https URL公开。

英文摘要

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. To systematically eliminate background interference, FiCoP first employs an object-centric disentanglement step to isolate the target from macro-level environmental noise. Building upon this localized region, our core methodological innovations are twofold. Firstly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning and text-guided semantic injection. Secondly, we design a Patch Correlation Predictor (PCP) that leverages a patch-to-patch correlation matrix as a structural prior. This generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.

2605.15796 2026-06-16 cs.CV 版本更新

Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion

通过姿态感知解缠和点云融合实现3D与2D指纹的跨模态注册

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Department of Automation, Tsinghua University(自动化系,清华大学)

AI总结 本文提出统一框架,实现3D指纹预处理与跨接触式和非接触式2D指纹的注册,结合非参数可视化解缠、点云融合、姿态归一化和姿态感知注册策略,提升3D与2D指纹兼容性。

详情
AI中文摘要

三维(3D)指纹保留全局指纹几何和局部脊线结构,避免接触引起的变形,但难以与传统二维(2D)指纹系统集成。本文针对3D采集与跨模态匹配之间的中间阶段,提出统一框架,用于3D指纹预处理和跨接触式和非接触式2D模态的注册。框架结合四个组件:1)非参数可视化和解缠方法,将3D指纹点云转换为卷轴等效2D表示,无需全局指纹模型;2)点云融合管道,将多个部分3D捕捉注册并拼接为更完整的指纹模型;3)基于椭圆的姿态归一化方法用于标准指纹对齐;4)姿态感知的跨模态注册策略,提高3D指纹与非接触式和接触式2D指纹的兼容性。在自建的多模态指纹数据库(含150个指纹)上的实验表明,所提框架实现了脊线级3D注册精度、鲁棒的姿态估计和一致的2D兼容性提升。特别是3D融合误差集中在0.09 mm,非接触式2D-3D注册达到脊线尺度投影精度,姿态感知解缠相对于通用3D解缠提高了真实匹配分数。这些结果支持3D指纹作为跨异构指纹模态的有效几何桥梁。

英文摘要

Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D--3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities. The baseline implementation has been publicly released at https://github.com/XiongjunGuan/3DFpVisual.

2606.10550 2026-06-16 cs.CV cs.GR 版本更新

LentiAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

PrismAvatar:用于实时立体通信的伪多视图重建与亚像素棱镜渲染

Chufeng Fang, Dongdong Teng, Lilin Liu

发表机构 * Sun Yat-sen University(中山大学)

AI总结 提出PrismAvatar系统,通过单目视频重建可控头部化身,并利用亚像素编码光栅实现实时裸眼立体通信,采用伪多视图监督和轮廓感知损失提升侧视质量。

Comments 10 pages, 5 figures, 3 tables

详情
AI中文摘要

实时立体视频通信一直是沉浸式远程呈现的目标,但实际系统仍需要专门的捕获设备或将远程用户限制为单个肖像视图。我们提出PrismAvatar,一种高斯头部化身系统,将单目化身捕获与亚像素编码的裸眼光栅显示连接起来,用于实时自动立体通信。从单目肖像视频中,PrismAvatar重建可控头部化身,并针对显示引起的横向观看区域进行优化。该方法利用自然头部转动作为伪多视图(PMV)监督,以约束在单目训练中弱观察的区域,包括头发、耳朵、下颌轮廓和颈部边界。可靠的侧帧按偏航角分箱,对齐到虚拟相机,并在严格的头部和头发域内进行监督;轮廓感知损失和分阶段正则化进一步抑制鬼影、alpha泄漏和深度不稳定性,同时保留横向细节。在运行时,PrismAvatar渲染32个虚拟视图,并将其编码为具有校准亚像素路由掩码的4K光栅图像。实时跟踪原型保持10.65 FPS,而特定主体的蒸馏驱动将相同的显示管线提升至38.49 FPS。

英文摘要

Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present LentiAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, LentiAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, LentiAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

2510.18189 2026-06-16 cs.GR cs.CV 版本更新

A Generalizable Light Transport 3D Embedding for Global Illumination

一种可泛化的全局光照光传输3D嵌入

Bing Xu, Mukund Varma T, Cheng Wang, Tzu-Mao Li, Lifan Wu, Bartlomiej Wronski, Ravi Ramamoorthi, Marco Salvi

发表机构 * UC San Diego and NVIDIA USA(加州大学圣迭戈分校和美国NVIDIA公司) UC San Diego USA(加州大学圣迭戈分校(美国)) NVIDIA USA(美国NVIDIA公司) UC San Diego USA and NVIDIA USA(加州大学圣迭戈分校和美国NVIDIA公司)

AI总结 提出一种可泛化的3D光传输嵌入方法,通过点云和Transformer直接预测全局光照,无需光栅化或路径追踪线索,适用于多种室内场景。

Comments SIGGRAPH 2026

详情
AI中文摘要

全局光照(GI)对于真实感渲染至关重要,但由于模拟间接光传输的复杂性,计算成本仍然很高。最近的神经方法主要依赖于逐场景优化,有时扩展到处理相机或几何体的变化。跨场景泛化的努力大多停留在2D屏幕空间,例如神经去噪或基于G-buffer的GI预测,这些方法常常遭受视角不一致和空间理解有限的问题。我们提出了一种可泛化的3D光传输嵌入,直接从3D场景配置近似全局光照,而不使用光栅化或路径追踪线索。每个场景被表示为具有几何和材质特征的点云。一个可扩展的Transformer建模全局点对点交互,将这些特征编码为神经基元。在渲染时,每个查询点通过最近邻搜索检索附近的基元,并通过交叉注意力聚合它们的潜在特征,以预测所需的渲染量。我们展示了在具有不同布局、几何体和材质的多样化室内场景中,漫反射全局光照预测的结果。为辐照度估计训练的嵌入可以通过有限的微调快速适应新的渲染任务。我们还展示了用于光泽材质空间方向辐射场估计的初步结果,并展示了归一化场如何加速无偏路径引导。该方法突显了一条将学习先验集成到渲染管线中的路径,而无需显式的光线追踪光照线索。

英文摘要

Global illumination (GI) is essential for realistic rendering but remains computationally expensive due to the complexity of simulating indirect light transport. Recent neural methods have mainly relied on per-scene optimization, sometimes extended to handle changes in camera or geometry. Efforts toward cross-scene generalization have largely stayed in 2D screen space, such as neural denoising or G-buffer based GI prediction, which often suffer from view inconsistency and limited spatial understanding. We propose a generalizable 3D light transport embedding that approximates global illumination directly from 3D scene configurations, without using rasterized or path-traced cues. Each scene is represented as a point cloud with geometric and material features. A scalable transformer models global point-to-point interactions to encode these features into neural primitives. At render time, each query point retrieves nearby primitives via nearest-neighbor search and aggregates their latent features through cross-attention to predict the desired rendering quantity. We demonstrate results on diffuse global illumination prediction across diverse indoor scenes with varying layouts, geometry, and materials. The embedding trained for irradiance estimation can be quickly adapted to new rendering tasks with limited fine-tuning. We also present preliminary results for spatial-directional radiance field estimation for glossy materials and show how the normalized field can accelerate unbiased path guiding. This approach highlights a path toward integrating learned priors into rendering pipelines without explicit ray-traced illumination cues.

8. 医学影像与生物视觉 56 篇

2606.14727 2026-06-16 cs.CV 新提交

FairGen: Preference-Aligned Diffusion for Demographically Equitable Medical Image Synthesis

FairGen: 用于人口统计公平医学图像生成的偏好对齐扩散模型

Zhimin Li, Ruichen Zhang, Zhen Tan, Howard J Aizenstein, Jingtong Hu, Tianlong Chen

发表机构 * University of Pittsburgh, Swanson School of Engineering(匹兹堡大学斯旺森工程学院) The University of North Carolina at Chapel Hill, Department of Computer Science(北卡罗来纳大学教堂山分校计算机科学系) Arizona State University, School of Computing and Augmented Intelligence(亚利桑那州立大学计算与增强智能学院) University of Pittsburgh, Department of Psychiatry(匹兹堡大学精神病学系)

AI总结 提出FairGen框架,通过将医生偏好嵌入扩散模型生成过程,合成人口统计平衡的医学图像,在皮肤、胸片和脑MRI任务上分别实现95.9%、80.0%和35.2%的公平性提升,同时保持诊断准确性。

Comments Accepted for publication in npj Digital Medicine. 20 pages, 6 figures

详情
AI中文摘要

医学影像学是现代诊断的核心,人工智能系统越来越多地用于支持基于图像的分析,以提高效率、准确性和医疗可及性。然而,医疗保健获取的不平等和疾病患病率的差异导致临床图像数据中存在严重的人口统计不平衡。由于疾病在不同人口群体中可能表现出不同的特征,使得某些表型表现自然罕见,这种不平衡进一步加剧。在这种不平衡数据上训练的AI模型有可能延续诊断偏见并扩大医疗差距。本文介绍了FairGen,一个公平感知的扩散框架,它在合成人口统计平衡的医学图像的同时保留与病理相关的视觉特征。通过将医生对齐的偏好嵌入生成过程,FairGen在合成和下游分类过程中改善了子组覆盖。应用于皮肤病学、放射学和神经影像学基准任务,FairGen在皮肤图像上实现了95.9%的公平性提升,在胸部X光片上实现了80.0%,在脑MRI上实现了35.2%,同时相对于在原始临床数据上训练的模型保持了有竞争力的诊断准确性。面向临床医生的专家评审和在独立队列上的外部验证进一步支持这些增益超越了标准保真度指标,并且不局限于原始分布内数据集。

英文摘要

Medical imaging is central to modern diagnostics, and artificial intelligence (AI) systems are increasingly used to support image-based analysis by improving efficiency, accuracy, and access to care. However, inequities in healthcare access and differential disease prevalence create severe demographic imbalances in clinical image data. Such imbalances are compounded by the fact that diseases can manifest with distinct features across demographic groups, rendering certain phenotypic presentations naturally rare. AI models trained on such imbalanced data risk perpetuating diagnostic bias and widening healthcare disparities. Here we introduce FairGen, a fairness-aware diffusion framework that synthesizes demographically balanced medical images while preserving pathology-relevant visual features. By embedding physician-aligned preferences into the generation process, FairGen improves subgroup coverage during synthesis and downstream classification. Applied to dermatology, radiology, and neuroimaging benchmark tasks, FairGen achieves fairness improvements of 95.9% for skin images, 80.0% for chest radiography, and 35.2% for brain MRI, while maintaining competitive diagnostic accuracy relative to models trained on original clinical data. Clinician-facing expert review and external validation on independent cohorts further support that these gains extend beyond standard fidelity metrics and are not confined to the original in-distribution datasets.

2606.14731 2026-06-16 cs.CV 新提交

BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

BBR-Net:用于连续医学图像分割的边界平衡重放

Zahid Ullah, Sieun Choi, Jihie Kim

发表机构 * Department of Computer Science and Artificial Intelligence, Dongguk University(东国大学计算机科学与人工智能系)

AI总结 提出边界平衡重放网络(BBR-Net),通过边界感知优先级和类别平衡选择重放样本,在连续心脏超声分割中减少灾难性遗忘并保持目标域适应能力。

详情
AI中文摘要

在域漂移下,基于重放的方法通常保留外观信息而没有显式建模解剖结构,因此连续学习在医学图像分割中仍然具有挑战性。本研究探究结构一致性是否控制连续心脏超声分割中的知识保留。我们提出边界平衡重放网络(BBR-Net),它使用边界感知优先级和类别平衡来选择重放样本,以保留解剖信息丰富的区域。该方法在CAMUS和CardiacNet上进行了前向(CAMUS到CardiacNet)和反向(CardiacNet到CAMUS)任务顺序的评估。在前向设置中,BBR-Net将源任务性能保持在接近离线联合训练参考的水平,同时显著减少灾难性遗忘并保持竞争性的目标任务适应。消融结果表明,边界感知优先级有助于保留,并且当与类别感知采样结合时,改善了源任务保留与目标任务适应之间的平衡。相反,反向设置揭示,当初始表示从噪声大且结构不一致的数据中学习时,结构感知重放会失败。为了隔离这种效应,我们进行了受控的结构扰动分析,逐步破坏源任务边界,同时保持数据集、架构和训练协议固定。随着结构可靠性降低,遗忘持续增加,表明重放有效性受存储结构信息质量的强烈影响,而不仅仅是记忆容量。这些发现表明,在域漂移下保留解剖结构是连续医学图像分割的核心因素,重放机制应考虑结构可靠性以支持稳健的知识保留。

英文摘要

Continual learning for medical image segmentation remains challenging under domain shift because replay-based methods often preserve appearance information without explicitly modeling anatomical structure. This study investigates whether structural consistency governs knowledge retention in continual cardiac ultrasound segmentation. We propose the Boundary-Balanced Replay Network (BBR-Net), which selects replay samples using boundary-aware priority and class balance to preserve anatomically informative regions. The method is evaluated on CAMUS and CardiacNet under forward (CAMUS to CardiacNet) and reverse (CardiacNet to CAMUS) task orders. In the forward setting, BBR-Net retains source-task performance close to an offline joint-training reference, while markedly reducing catastrophic forgetting and preserving competitive target-task adaptation. Ablation results show that boundary-aware prioritization contributes to retention and improves the balance between source-task preservation and target-task adaptation when combined with class-aware sampling. In contrast, the reverse setting reveals that structure-aware replay fails when initial representations are learned from noisy and structurally inconsistent data. To isolate this effect, we conduct a controlled structural perturbation analysis by progressively corrupting source-task boundaries while keeping the dataset, architecture, and training protocol fixed. Forgetting increases consistently as structural reliability decreases, suggesting that replay effectiveness is strongly influenced by the quality of stored structural information, rather than by memory capacity alone. These findings indicate that preserving anatomical structure under domain shift is a central factor in continual medical image segmentation, and that replay mechanisms should account for structural reliability to support robust knowledge retention.

2606.14749 2026-06-16 cs.CV cs.AI 新提交

Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

幼鱼昼夜活动与异常检测的自动化三维运动监测

Chih-Wei Huang, Chang-Wen Huang, Chung-Ping Chiang, Tsung-Wei Pan

发表机构 * AI Research Center, National Taiwan Ocean Univ.(台湾海洋大学人工智能研究中心) Dept. of Aquaculture, National Taiwan Ocean Univ.(台湾海洋大学水产养殖系) Center of Excellence for the Oceans, National Taiwan Ocean University(台湾海洋大学海洋卓越研究中心)

AI总结 提出结合深度学习目标检测与双目立体视觉的高通量3D行为表型框架,实现高密度环境下幼鱼实时监测、体长估计和3D轨迹重建,首次量化自由游动幼鱼的真实物理速度,建立昼夜运动基线用于生理应激预警。

详情
AI中文摘要

精准水产养殖在追踪高分辨率行为特征方面面临“表型瓶颈”,因为传统方法无法量化瞬时三维(3D)身体活动。为解决这一问题,我们提出了一种高通量3D行为表型框架,将深度学习目标检测与双目立体视觉相结合,用于高密度环境下幼年罗非鱼的实时监测。该系统自动进行非接触式体长估计,并从绝对空间坐标重建3D游泳轨迹。通过消除2D透视畸变,该方法精确量化了3D速度和加速度,首次实现了对自由游动幼鱼真实物理游泳速度的估计。结果表明,该框架成功建立了昼夜运动基线,可作为生理应激的早期预警系统,并为鱼类活力提供客观指标。

英文摘要

Precision aquaculture faces a "phenotyping bottleneck" in tracking high-resolution behavioral traits, as conventional methods cannot quantify instantaneous three-dimensional (3D) physical exertion. To address this, we present a high-throughput 3D behavioral phenotyping framework integrating deep learning object detection with binocular stereo vision for real-time monitoring of juvenile tilapia in high-density environments. The system automates non-contact body length estimation and reconstructs 3D swimming trajectories from absolute spatial coordinates. By eliminating 2D perspective distortions, this approach precisely quantifies 3D velocity and acceleration, marking the first estimation of true physical swimming speeds in free-roaming juveniles. Results show the framework successfully establishes circadian locomotor baselines, serving as an early warning system for physiological stress and providing an objective metric for fish vitality.

2606.14759 2026-06-16 cs.CV cs.AI 新提交

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

基于潜在空间运动建模的二维电影心脏磁共振时序一致且可控视频生成

Yiheng Cao, Gustavo Andrade-Miranda, Jiatian Zhang, Guillaume Sallé, Xin Gao

发表机构 * Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences(苏州生物医学工程与技术研究所,中国科学院) SyCoIA, IMT Mines Ales(SyCoIA,IMT Mines Ales)

AI总结 提出一种文本到视频生成方法,通过解耦心脏空间结构与时间运动,利用微调扩散模型合成初始帧,再以心脏相位嵌入条件化潜在流模型生成完整运动,实现高时序一致性和解剖可控性。

详情
Journal ref
ISBI 2026 - IEEE International Symposium on Biomedical Imaging, Apr 2026, London, United Kingdom. pp.1-4
AI中文摘要

电影心脏磁共振是评估心脏功能的金标准,但公共数据集的稀缺限制了先进数据驱动模型的发展。为解决这一限制,我们提出一种生成方法,用于合成时间上连贯且解剖上一致的心脏序列。我们的文本到视频框架将心脏空间结构与时间运动解耦。首先,一个微调的扩散模型根据临床文本提示合成初始帧,控制解剖特征。然后,一个以心脏相位嵌入为条件的潜在流模型生成完整的心脏运动,确保空间一致性和时间控制。我们的模型生成解剖和病理多样化的序列,具有高时间连贯性和对输入提示的强保真度,图像真实感的FID为31.68,文本-图像对齐的CLIP得分为31.04。这些实验结果突显了其产生高保真、按需医疗数据的潜力,为数据稀缺提供了可扩展的解决方案。

英文摘要

Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

2606.14766 2026-06-16 cs.CV cs.AI cs.MA 新提交

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

XMedFusion:面向自主医疗系统的知识引导多模态感知与推理框架

Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

发表机构 * National University of Sciences and Technology (NUST)(巴基斯坦国立科技大学) University of Oxford(牛津大学)

AI总结 提出XMedFusion模块化AI框架,通过视觉感知、知识图谱构建和检索引导生成等智能体协同,增强放射学报告生成的视觉基础与临床发现捕捉能力,在公共数据集上显著优于基线模型。

Comments Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

详情
AI中文摘要

自主医疗和机器人系统日益依赖智能感知与推理能力来解释视觉数据并支持临床决策。放射学报告生成是此类自动化诊断工作流的关键组成部分,然而现有的端到端多模态模型常因视觉基础薄弱而导致不可靠的解释和细微临床发现的遗漏。本文提出XMedFusion,一个模块化AI框架,设计为自主医疗系统的智能感知与推理模块。该框架将视觉信息分解为协调的功能组件,模拟专家驱动的分析,包括提取图像基础证据的视觉感知智能体、构建临床相关发现结构的知识图谱构建智能体,以及确保报告结构一致的检索引导起草过程。合成智能体通过推理驱动的验证迭代整合视觉和结构化证据,生成可靠且可解释的诊断输出。在公共胸部X光片数据集上的实验评估表明,与基线视觉-语言模型相比,在BLEU-1上提升0.0493至0.3359,ROUGE-L上提升0.0863至0.2440,METEOR上提升0.0829至0.1708,同时在语义评估指标如一致性(2.38至7.80)和准确性(2.34至6.93)上也有显著提升。结果突出了结构化多智能体感知与推理在增强智能医学成像系统的鲁棒性、透明度和自动化方面的有效性,使其能够集成到自主医疗和机器人诊断工作流中。

英文摘要

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

2606.14803 2026-06-16 cs.CV 新提交

HSQ-VLM: A Novel Spatially-Constrained Quadrant Segmentation VLM Model for Explainability in Diabetic Retinopathy

HSQ-VLM: 一种用于糖尿病视网膜病变可解释性的新型空间约束象限分割VLM模型

Shivum Telang

发表机构 * Pittsburgh, Pennsylvania(宾夕法尼亚州匹兹堡)

AI总结 提出HSQ-VLM,利用地标锚定笛卡尔交叉注意力机制和四象限拓扑潜在分割,实现眼底图像中病变的解剖精确量化与自然语言报告生成,在出血和微动脉瘤检测上达到99.6%和96.4%的灵敏度。

详情
AI中文摘要

糖尿病视网膜病变(DR)是一种侵袭性视网膜疾病,也是全球失明的主要原因,但其临床管理目前受到诊断AI黑箱性质的阻碍。虽然深度学习模型实现了高分类准确率,但严重缺乏能够详细描述导致DR临床决策的确切解剖标志和病变分布的可解释性方法。因此,我们提出了HSQ-VLM,一种新颖的眼底图像象限分割流水线,利用地标锚定笛卡尔交叉注意力机制将视觉特征提取与结构化临床推理统一起来。与依赖任意图像分割的传统方法不同,我们的流水线实现了四象限拓扑潜在分割(TLP),以动态地将视网膜特征与以中央凹为中心的坐标系对齐。这使得视觉语言模型能够生成以解剖精度量化病理的自然语言报告。在包含3,500张高分辨率眼底图像的数据集上,这种创新方法实现了出血检测灵敏度99.6%和微动脉瘤检测灵敏度96.4%,同时与标准分割基线相比,边界模糊误差显著减少。

英文摘要

Diabetic Retinopathy (DR) is an aggressive retinal disease and a leading cause of global blindness, yet its clinical management is currently hindered by the black-box nature of diagnostic AI. While deep learning models achieve high classification accuracy, there is a critical lack of explainability methods capable of detailing the exact anatomical landmarks and lesion distributions that lead to a clinical decision for DR. Therefore, we propose HSQ-VLM, a novel quadrant segmentation pipeline on fundus images that utilizes a Landmark-Anchored Cartesian Cross-Attention mechanism to unify visual feature extraction with structured clinical reasoning. Unlike traditional methods that rely on arbitrary image partitioning, our pipeline implements 4-quadrant Topological Latent Partitioning (TLP) to dynamically align retinal features with a fovea-centered coordinate system. This allows the Vision-Language Model to generate natural language reports that quantify pathology with anatomical precision. On a dataset of 3,500 high-resolution fundus images, this innovative methodology achieved a lesion detection sensitivity of 99.6% for hemorrhages and 96.4% for microaneurysms, while demonstrating a significant reduction in boundary-ambiguity errors compared to standard segmentation baselines.

2606.14957 2026-06-16 cs.CV 新提交

Learning Sparse Latent Predictive Foundation Model for Multimodal Neuroimaging

学习用于多模态神经影像的稀疏潜在预测基础模型

Haoxu Huang, Long Chen, Jingyun Chen, Jinu Hyun, James Ryan Loftus, Kara Melmed, Daniel Orringer, Jennifer Frontera, Seena Dehkharghani, Arjun Masurkar, Narges Razavian

发表机构 * New York University, Center for Data Science(纽约大学数据科学中心) NYU Grossman School of Medicine, Department of Radiology(纽约大学格罗斯曼医学院放射学系) State University of New York at Binghamton, School of Computing(纽约州立大学宾汉姆顿分校计算机学院) NYU Grossman School of Medicine, Department of Neurology(纽约大学格罗斯曼医学院神经病学系) NYU Grossman School of Medicine, Department of Neurosurgery(纽约大学格罗斯曼医学院神经外科学系) NYU Grossman School of Medicine, Department of Pathology(纽约大学格罗斯曼医学院病理学系) School of Medicine, Department of Radiology, Stanford(斯坦福大学医学院放射学系) NYU Grossman School of Medicine, Department of Neuroscience(纽约大学格罗斯曼医学院神经科学系) NYU Grossman School of Medicine, Neuroscience Institute(纽约大学格罗斯曼医学院神经科学研究所)

AI总结 提出Neuro-JEPA模型,结合潜在预测目标和专家混合架构,学习T1w、T2w和FLAIR三种MRI序列的统一表示,在25项临床任务和22项公开数据集任务上优于现有基础模型和CNN基线。

Comments Under Review Preprint

详情
AI中文摘要

脑部MRI通常作为多个互补序列采集,具有独特的对比度加权,包括T1加权成像(T1w)解剖对比和液体敏感T2加权(T2w)对比。然而,在健康系统规模上,跨多种MRI对比机制学习统一表示的方法尚缺乏。在本研究中,我们引入了Neuro-JEPA,一种稀疏多模态神经影像基础模型,它结合了潜在预测目标和专家混合架构,以编码跨核心T1w、T2w和液体抑制FLAIR成像(FLAIR)的脑部MRI。我们进一步对架构、掩码、目标和稀疏性设计选择进行了系统的方法论研究,这些选择有利于稳健的神经影像多模态表示学习。Neuro-JEPA在428,647项研究的1,551,862次扫描上进行了预训练,这些扫描经过了模态特定的预处理和跨三种核心结构脑部MRI序列的数据整理。我们在临床和研究环境中评估了学习到的表示,包括来自三个健康系统(NYU Langone、NYU Long Island和Massachusetts General Hospital)的25项任务,以及来自12个公开数据集的22项任务,涵盖了单模态、多模态和跨域评估配置。在这些基准测试中,现有的神经影像基础模型相对于简单的卷积神经网络(CNN)基线显示出不一致的提升,而Neuro-JEPA在所有评估设置中实现了更强且更一致的性能。这些结果建立了一个可扩展的多模态神经影像表示学习方法论框架,并强调了基础模型评估协议需要包括简单基线、临床异质性队列和受控的多模态比较。

英文摘要

Brain MRIs are routinely acquired as multiple complementary sequences with unique contrast weighting, including T1-weighed imaging (T1w) anatomic and fluid-sensitive T2-weighted (T2w) contrasts. However, methods for learning unified representations across the multitude of MRI contrast mechanisms at health-system scale are lacking. In this study, we introduce Neuro-JEPA, a sparse multimodal neuroimaging foundation model that combines a latent predictive objective with a Mixture-of-Experts architecture to encode brain MRI across core T1w, T2w, and fluid-suppressed FLAIR imaging (FLAIR). We further provide a systematic methodological study of architectural, masking, objective, and sparsity design choices beneficial for robust neuroimaging multimodal representation learning. Neuro-JEPA was pretrained on 1,551,862 scans from 428,647 studies after modality-specific preprocessing with data curation across three core structural brain MRI sequences. We evaluated the learned representations across clinical and research settings, including 25 tasks from three health systems: NYU Langone, NYU Long Island, and Massachusetts General Hospital, and 22 tasks from 12 public datasets, covering unimodal, multimodal and cross-domain evaluation configurations. Across these benchmarks, existing neuroimaging foundation models showed inconsistent gains over a simple convolutional neural network (CNN) baseline, whereas Neuro-JEPA achieved stronger and more consistent performance across all evaluated settings. These results establish a scalable methodological framework for multimodal neuroimaging representation learning and highlight the need for foundation model evaluation protocols that include simple baselines, clinically heterogeneous cohorts and controlled multimodal comparisons.

2606.15019 2026-06-16 cs.CV 新提交

Towards Global AI-Driven Cervical Cancer Screening

迈向全球人工智能驱动的宫颈癌筛查

Thuy Nuong Tran, Ömer Sümer, Evangelia Christodoulou, Lennart Nauschütte, Simon Kalteis, Martin Paulikat, Esmira Pashayeva, Klara Steinheuer, Isabella Borges, Piotr Kalinowski, Hermann Bussmann, Sieng Sokmney, Poeung Kuong, Sathiarany Vong, Achim Schneider, Magnus von Knebel-Doeberitz, Patrick Godau, Lena Maier-Hein

发表机构 * German Cancer Research Center (DKFZ)(德国癌症研究中心) Heidelberg University(海德堡大学) National Center for Tumor Diseases (NCT) Heidelberg(海德堡国家肿瘤疾病中心) Helmholtz Association(亥姆霍兹联合会) University of Heidelberg(海德堡大学) Medical Faculty Heidelberg(海德堡医学院) University Hospital Heidelberg(海德堡大学医院) German Consortium for Translational Cancer Research (DKTK)(德国转化癌症研究联盟) National Center for Tumor Diseases (NCT) Dresden(德累斯顿国家肿瘤疾病中心) University Hospital Carl Gustav Carus Dresden(德累斯顿卡尔·古斯塔夫·卡鲁斯大学医院) Technische Universität Dresden(德累斯顿工业大学) Helmholtz-Zentrum Dresden-Rossendorf (HZDR)(亥姆霍兹德累斯顿罗森多夫研究中心) University of Bonn(波恩大学) University Hospital Bonn(波恩大学医院) University of Cologne(科隆大学) University Hospital Cologne(科隆大学医院) University of Duisburg-Essen(杜伊斯堡-埃森大学) University Hospital Essen(埃森大学医院) University of Freiburg(弗莱堡大学) University Hospital Freiburg(弗莱堡大学医院) University of Göttingen(哥廷根大学) University Hospital Göttingen(哥廷根大学医院) University of Hamburg(汉堡大学) University Hospital Hamburg-Eppendorf(汉堡-埃彭多夫大学医院) University of Jena(耶拿大学) University Hospital Jena(耶拿大学医院) University of Kiel(基尔大学) University Hospital Schleswig-Holstein(石勒苏益格-荷尔斯泰因大学医院) University of Leipzig(莱比锡大学) University Hospital Leipzig(莱比锡大学医院) University of Lübeck(吕贝克大学) University Hospital Lübeck(吕贝克大学医院) University of Magdeburg(马格德堡大学) University Hospital Magdeburg(马格德堡大学医院) University of Mainz(美因茨大学) University Hospital Mainz(美因茨大学医院) University of Marburg(马尔堡大学) University Hospital Marburg(马尔堡大学医院) University of Munich (LMU)(慕尼黑大学) University Hospital Munich (LMU)(慕尼黑大学医院) Technical University of Munich (TUM)(慕尼黑工业大学) University Hospital rechts der Isar (TUM)(慕尼黑工业大学伊萨尔河右岸医院) University of Münster(明斯特大学) University Hospital Münster(明斯特大学医院) University of Regensburg(雷根斯堡大学) University Hospital Regensburg(雷根斯堡大学医院) University of Rostock(罗斯托克大学) University Hospital Rostock(罗斯托克大学医院) University of Tübingen(蒂宾根大学) University Hospital Tübingen(蒂宾根大学医院) University of Ulm(乌尔姆大学) University Hospital Ulm(乌尔姆大学医院) University of Würzburg(维尔茨堡大学) University Hospital Würzburg(维尔茨堡大学医院)

AI总结 提出首个基于深度学习、在多国数据上验证的宫颈癌筛查方法,通过多任务学习同时进行图像分类和病变分割,在内部验证中优于医生,但外部验证显示性能因国家而异。

Comments 20 pages, 9 figures

详情
AI中文摘要

全球消除宫颈癌是世界卫生组织设定的关键公共卫生目标,筛查项目可将死亡率降低高达80%。然而,中低收入国家在专家和活检服务方面资源有限。基于深度学习的算法为筛查提供了有前景的支持,但现有方法大多在单一国家的私有数据集上开发和验证。我们提出了首个基于深度学习的宫颈癌筛查方法,并在多国数据上进行了验证。技术上,我们将阴道镜图像中病变的检测和分类问题表述为多任务学习问题,同时进行图像级分类和病变分割。我们的模型在带有手动病变分割掩膜和相应组织病理结果的醋酸染色阴道镜图像私有数据集上训练,采用大量数据增强以应对图像变异性。在以内部分布验证中,以病理结果作为金标准,我们的算法在CIN1-(宫颈上皮内瘤变1级或更低)与CIN2+(2级或更高)分类中优于医学专家(平衡准确率:0.68 vs 0.64)。在来自四个国家的四个阴道镜数据集上进行外部验证,这些数据集在患病率和患者特征上存在显著差异,我们的方法相比基线方法表现出更优性能。不同国家间的性能差异较大,AUC值范围从0.54到0.80。总体而言,算法性能随年龄、转化区(最易发生病变的宫颈区域)、合并症和特征性体征的存在而变化,其中合并症的负面影响最大。未来工作应侧重于提高模型的鲁棒性和泛化能力。

英文摘要

The global elimination of cervical cancer is a key public health goal set by the World Health Organization (WHO), with screening programs reducing mortality by up to 80%. However, access to experts and biopsy services is limited in low- to middle-income countries (LMICs). Deep learning (DL)-based algorithms offer promising support for screening, but most existing approaches have been developed and validated on private datasets from single countries. We present the first DL-based approach to cervical cancer screening validated on data from multiple countries. Technically, we phrase the problem of detecting and classifying lesions in colposcopy images as a multi-task learning problem, in which we simultaneously perform image-level classification and lesion segmentation. Our model was trained on a private data set of acid stain colposcopy images with manually generated lesion segmentation masks and corresponding histopathological results, employing extensive data augmentation to address image variability. In an in-distribution validation with pathology results serving as ground truth, our algorithm outperformed medical experts (Balanced Accuracy: 0.68 vs 0.64) in CIN1- (Cervical intraepithelial neoplasia grade 1 or lower) versus CIN2+ (grade 2 or higher) classification. External validation on four colposcopy data sets from four countries featuring radical differences in prevalence and patient characteristics yielded superior performance of our method compared to baseline methods. Performance variability across countries was high with AUC values ranging from 0.54 - 0.80. Overall, algorithm performance varied with age, transformation zone (cervical area most prone to lesion development), presence of comorbidities and pathognomonic signs, with comorbidities having by far the largest negative effect. Future work should focus on improving model robustness and generalizability.

2606.15110 2026-06-16 cs.CV 新提交

Physics-Driven Zero-Shot MRI Reconstruction with Non-local Image Priors

基于非局部图像先验的物理驱动零样本MRI重建

Lingtong Zhang, Wenlei Li, Mu He, Li Xiao, Yang Ji

发表机构 * School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院)

AI总结 提出一种物理驱动的零样本自监督学习框架,通过线圈灵敏度图引导的动态存储库、SPIRiT正则化和非局部自相似像素银行,解决欠采样MRI重建中的监督不足和过拟合问题,在FastMRI数据集上达到最优性能。

详情
AI中文摘要

零样本自监督学习(ZS-SSL)已成为加速磁共振成像(MRI)重建的一种有前景的范式,消除了对全采样外部数据集的依赖。然而,仅从单个欠采样扫描中学习存在监督稀缺和优化不稳定的问题,常常导致过拟合或伪影。为了解决这些挑战,我们提出了一种鲁棒的物理驱动ZS-SSL框架,将物理一致性与图像域非局部先验协同结合。我们的方法引入了三项核心创新:(1)线圈灵敏度图(CSM)引导的动态存储库,通过基于线圈灵敏度约束过滤物理不一致的伪影来稳定训练轨迹;(2)基于SPIRiT的正则化,通过学习的相关核和随机掩蔽强制执行k空间自一致性;(3)非局部自相似性(NSS)像素库,利用前两个模块建立的高保真参考显式挖掘非局部解剖相似性,从而增强图像域的监督。在FastMRI数据集上的大量实验表明,我们的方法实现了最先进的性能,特别是在高加速因子下,有效弥合了零样本学习与监督方法之间的差距。代码可在https://github.com/Zolento/NS-SSL获取。

英文摘要

Zero-Shot Self-Supervised Learning (ZS-SSL) has emerged as a promising paradigm for accelerated Magnetic Resonance Imaging (MRI) reconstruction, eliminating the reliance on fully-sampled external datasets. However, learning solely from a single under-sampled scan suffers from supervision scarcity and optimization instability, often leading to overfitting or artifacts. To address these challenges, we propose a robust physics-driven ZS-SSL framework that synergizes physical consistency with image-domain non-local priors. Our method introduces three core innovations: (1) a Coil Sensitivity Map (CSM)-Guided Dynamic Repository, which stabilizes the training trajectory by filtering physically inconsistent artifacts based on coil sensitivity constraints; (2) a SPIRiT-based regularization, which enforces k-space self-consistency via a learned correlation kernel and stochastic masking; (3) a Non-Local Self-Similarity (NSS) Pixel Bank, which leverages the high-fidelity reference established by the former modules to explicitly mine non-local anatomical similarities, thereby augmenting supervision in the image domain. Extensive experiments on the FastMRI dataset demonstrate that our approach achieves state-of-the-art performance, particularly under high acceleration factors, effectively bridging the gap between zero-shot learning and supervised methods. The code is available at https://github.com/Zolento/NS-SSL.

2606.15129 2026-06-16 cs.CV cs.AI 新提交

EyeMVP: OCT-Informed Fundus Representation Learning via Paired CFP--OCT Pretraining

EyeMVP: 通过配对CFP-OCT预训练实现OCT启发的眼底表征学习

Zhuo Deng, Ruiheng Zhang, Ziheng Zhang, Weihao Gao, Yitong Li, Qian Wang, Lei Shao, Jiaoyue Dong, Zhixi Zeng, Lijian Fang, Haibo Wang, Xiaobin Lin, Tao Liu, Zhicheng Du, Zhengwei Zhang, Lin Yang, Zheng Gong, Xinyu Zhao, Zhenquan Wu, Fang Li, Zhiguang Zhou, Guoming Zhang, Sun Jing, Han Lv, Wenbin We, Lan Ma

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Beijing Tongren Eye Center, Beijing Tongren Hospital, Capital Medical University(首都医科大学附属北京同仁医院北京同仁眼科中心) Liangxiang Hospital of Beijing Fangshan District, Capital Medical University(首都医科大学北京市房山区良乡医院) The Third People's Hospital of Dalian(大连市第三人民医院) National Clinical Research Center for Endocrine and Metabolic Diseases, The Second Xiangya Hospital of Central South University(中南大学湘雅二医院国家内分泌代谢病临床医学研究中心) The Central Hospital of Baoji City(宝鸡市中心医院) Wuxi No.2 People's Hospital, Affiliated Wuxi Clinical College of Nantong University(南通大学附属无锡临床学院无锡市第二人民医院) Shenzhen Eye Hospital, Southern Medical University(南方医科大学深圳眼科医院) Beijing Friendship Hospital, Capital Medical University(首都医科大学附属北京友谊医院)

AI总结 提出跨模态视网膜基础模型EyeMVP,利用配对CFP-OCT预训练,通过跨模态掩码重建将OCT结构信息注入CFP表征,在16项下游任务中优于现有模型,尤其对黄斑疾病诊断有显著提升。

详情
AI中文摘要

彩色眼底摄影(CFP)是大规模视网膜筛查的主要手段,但其诊断能力受限于缺乏深度分辨的结构信息。光学相干断层扫描(OCT)提供横截面视网膜解剖结构,但在人群筛查中可及性较低。本文提出EyeMVP,一种跨模态视网膜基础模型,通过配对CFP-OCT预训练学习OCT启发的CFP表征。EyeMVP在来自中国八家医院112,642名患者的674,893个严格同眼同天配对CFP-OCT图像三元组上预训练。该模型使用跨模态掩码重建,以OCT相关监督丰富CFP表征,同时在推理时仅需CFP图像。为适应正面CFP与横截面OCT之间的非对齐成像几何,EyeMVP将源约束交叉注意力与CFP导出的结构掩码相结合。在16项下游任务中,包括分类、分割、少样本适应和跨模态检索,EyeMVP优于代表性视网膜基础模型,并在涉及黄斑和视神经结构的任务上表现出一致提升。对于CFP具有挑战性的黄斑疾病,EyeMVP在黄斑水肿上达到0.948的AUROC(对比EyeCLIP的0.852),在近视性黄斑劈裂上达到0.825。在一项探索性读者研究中,EyeMVP在黄斑水肿上超过初级和中级眼科医生组,但未达到高级眼科医生水平,而在近视性黄斑劈裂上,其平衡准确性数值上高于所有读者组。这些结果表明,像素级跨模态重建可以用OCT相关监督丰富CFP表征,为筛查环境中基于CFP的更强视网膜分析提供了一条实用途径。

英文摘要

Color fundus photography (CFP) is the mainstay for large-scale retinal screening, yet its diagnostic capacity is constrained by the lack of depth-resolved structural information. Optical coherence tomography (OCT) provides cross-sectional retinal anatomy, but is less accessible in population-level screening. Here, we present EyeMVP, a cross-modal retinal foundation model that uses paired CFP--OCT pretraining to learn OCT-informed CFP representations. EyeMVP is pretrained on 674,893 strict same-eye same-day paired CFP--OCT image triples from 112,642 patients across eight hospitals in China. The model uses cross-modal masked reconstruction to enrich CFP representations with OCT-associated supervision, while requiring only CFP images at inference. To accommodate the non-aligned imaging geometry between en-face CFP and cross-sectional OCT, EyeMVP combines source-constrained cross-attention with CFP-derived structural masks. Across 16 downstream tasks, including classification, segmentation, few-shot adaptation, and cross-modal retrieval, EyeMVP outperforms representative retinal foundation models and shows consistent gains on tasks involving macular and optic nerve structure. For CFP-challenging macular diseases, EyeMVP achieves an AUROC of 0.948 for macular edema (vs.~0.852 for EyeCLIP) and 0.825 for myopic macular schisis. In an exploratory reader study, EyeMVP exceeds junior and intermediate ophthalmologist groups but does not reach senior ophthalmologist performance on macular edema, while showing numerically higher balanced accuracy than all reader groups on myopic macular schisis. These results suggest that pixel-level cross-modal reconstruction can enrich CFP representations with OCT-associated supervision, providing a practical route toward stronger CFP-based retinal analysis in screening settings.

2606.15167 2026-06-16 cs.CV 新提交

Variational Network with Wavelet-based UNET in Accelerated MRI Reconstruction from Under Sampled K-space Data

基于小波UNET的变分网络在欠采样k空间数据加速MRI重建中的应用

Yasir Arafat Prodhan, Shaikh Anowarul Fattah

发表机构 * Bangladesh University of Engineering and Technology(孟加拉国工程技术大学)

AI总结 提出一种结合小波U-Net的变分网络,通过可学习多尺度频率表示和物理引导迭代重建,在单线圈和多线圈设置下有效抑制伪影、保留高频细节,在fastMRI和M4Raw数据集上达到最优性能。

Comments 14 pages, 9 figures

详情
AI中文摘要

全采样MRI需要密集的k空间采集,导致扫描时间长、临床吞吐量降低以及对患者运动敏感性增加。加速MRI通过获取欠采样k空间数据并计算重建缺失信息来解决这一问题。然而,从欠采样测量中重建是高度病态的,可能引入混叠伪影、噪声放大和解剖细节丢失。尽管传统的并行成像和压缩感知方法缓解了这些问题,深度学习方法进一步提高了重建质量,但在激进欠采样下保留高频结构仍然具有挑战性。在这项工作中,我们提出了一种基于小波U-Net(W-UNet)的变分网络用于加速MRI重建。该框架将物理引导的迭代重建与可学习多尺度频率表示相结合。标准池化操作被离散小波变换和逆小波变换模块取代,实现了无损下采样,同时保留了低频结构和高频边缘细节。集成到细化和灵敏度图估计阶段后,所提出的设计在单线圈和多线圈设置下改善了伪影抑制、特征保留和重建保真度。在fastMRI膝盖和M4Raw脑部数据集上的实验显示了最先进的性能。消融研究进一步证实了基于小波的特征分解对加速MRI重建的有效性。

英文摘要

Fully sampled MRI requires dense k-space acquisition, leading to long scan times, reduced clinical throughput, and increased sensitivity to patient motion. Accelerated MRI addresses this by acquiring undersampled k-space data and reconstructing the missing information computationally. However, reconstruction from undersampled measurements is highly ill-posed and can introduce aliasing artifacts, noise amplification, and loss of anatomical detail. Although conventional parallel imaging and compressed sensing methods mitigate these issues, and deep learning methods have further improved reconstruction quality, preserving high-frequency structures under aggressive undersampling remains challenging. In this work, we propose a Variational Network with a Wavelet-based U-Net (W-UNet) for accelerated MRI reconstruction. The framework combines physics-guided iterative reconstruction with learnable multi-scale frequency representations. Standard pooling operations are replaced with Discrete Wavelet Transform and Inverse Wavelet Transform modules, enabling lossless downsampling while preserving low-frequency structure and high-frequency edge details. Integrated into the refinement and sensitivity map estimation stages, the proposed design improves artifact suppression, feature preservation, and reconstruction fidelity in both single-coil and multi-coil settings. Experiments on fastMRI knee and M4Raw brain datasets show state-of-the-art performance. Ablation studies further confirm the effectiveness of wavelet-based feature decomposition for accelerated MRI reconstruction.

2606.15176 2026-06-16 cs.CV cs.AI 新提交

Enabling Real-Time Point-of-Care Ultrasound Segmentation: A GPU-Free Deployment in Resource-Limited Settings

实现实时床旁超声分割:资源受限环境中的无GPU部署

Weihao Gao

发表机构 * School of Computer Science and Artificial Intelligence, Guangdong University of Education(广东第二师范学院计算机科学与人工智能学院)

AI总结 提出超轻量级架构UltraSeg,在CPU和移动设备上实现实时超声图像分割,性能媲美大型模型,消除GPU依赖,降低AI成本。

Comments 15 pages,4 figures

详情
AI中文摘要

超声成像因其低成本和高便携性成为全球最广泛使用的医学模态,然而人工智能(AI)的部署仍受限于对GPU加速模型的依赖,造成结构性矛盾:"智能"的成本超过了成像设备本身。在此,我们展示了UltraSeg的系统性适配和广泛评估,UltraSeg最初为结肠镜息肉分割设计的超轻量级架构,现被改造用于床旁超声(POCUS),涵盖跨越六个解剖部位(乳腺、甲状腺、肾脏、颈动脉、胎儿和小动物肿瘤)的十个公共数据集。我们在超声领域系统验证了两种变体:UltraSeg-130K(0.13M参数)在单核CPU上达到89.7 FPS,在翻新移动设备上达到34.8 FPS;而UltraSeg-500K(0.5M参数)在CPU上达到44.6 FPS,在移动设备上达到16.1 FPS。UltraSeg-500K在平均性能上匹配或超过31M参数的UNet,并接近105M参数的TransUNet,在外部验证集(UDIAT、DDTI)上具有优越的零样本跨数据集泛化能力。通过实现无需GPU依赖的临床级分割,本工作使AI成本与超声可及性相匹配,使先进诊断在资源受限环境中成为可能。

英文摘要

Ultrasound imaging is the most widely adopted medical modality globally due to its low cost and portability, yet artificial intelligence (AI) deployment remains constrained by reliance on GPU-accelerated models, creating a structural paradox where the cost of "intelligence" exceeds that of the imaging device itself. Here, we present the systematic adaptation and extensive evaluation of UltraSeg, an ultra-lightweight architecture originally developed for colonoscopic polyp segmentation, now engineered for point-of-care ultrasound (POCUS) across ten public datasets spanning six anatomical sites (breast, thyroid, kidney, carotid, fetal, and small-animal tumor). We systematically validate both variants in ultrasound domains: UltraSeg-130K (0.13M parameters) achieves 89.7 FPS on single-core CPUs and 34.8 FPS on a refurbished mobile device, while UltraSeg-500K (0.5M parameters) delivers 44.6 FPS on CPU and 16.1 FPS on mobile device. UltraSeg-500K matches or exceeds the Dice performance of the 31M-parameter UNet and approaches 105M-parameter TransUNet in average performance, with superior zero-shot cross-dataset generalization on external validation sets (UDIAT, DDTI). By enabling clinical-grade segmentation without GPU dependency, this work brings AI costs in line with ultrasound accessibility, making advanced diagnostics available in resource-limited settings.

2606.15250 2026-06-16 cs.CV cs.AI 新提交

Landmark-free Assessment of Lower-limb Alignment with Implicit Neural Shape Functions from Knee Radiographs

基于膝关节X光片的隐式神经形状函数的无地标下肢对齐评估

Zhisen Hu, Antti Kemppainen, David Johnson, Egor Panfilov, Huy Hoang Nguyen, Timothy Cootes, Claudia Lindner, Aleksei Tiulpin

发表机构 * Division of Informatics, Imaging and Data Sciences, The University of Manchester(曼彻斯特大学信息学、影像与数据科学部) Research Unit of Health Sciences and Technology, University of Oulu(奥卢大学健康科学与技术研究部) Medical Research Center Oulu, University of Oulu and Oulu University Hospital(奥卢大学与奥卢大学医院医学研究中心) Department of Trauma and Orthopaedics, Stockport NHS Foundation Trust, Stepping Hill Hospital(斯泰平希尔医院斯托克波特NHS基金会创伤与骨科) School of Health and Society, University of Salford(索尔福德大学健康与社会学院) School of Biological Sciences, The University of Manchester(曼彻斯特大学生物科学学院) Weill Cornell Medicine, Cornell University(康奈尔大学威尔康奈尔医学院)

AI总结 提出隐式神经形状函数(INSF)方法,无需显式地标,通过编码解剖形状到潜在空间并直接回归临床对齐测量,实现自动化下肢对齐评估,性能与现有方法相当且易于扩展。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

下肢对齐(LLA)的放射学评估对于预测全膝关节置换术中的关节健康和手术结果至关重要。传统测量方法手动且耗时,而最近的机器学习方法通常依赖于定位一组固定的解剖标志。这种依赖性限制了灵活性,并且当临床定义发生变化时可能需要重新标注。为了解决这个问题,我们提出了一种使用隐式神经形状函数(INSF)的自动化工作流程。我们不依赖显式地标坐标,而是将解剖结构编码到紧凑的潜在空间中,并直接从这些潜在代码回归临床对齐测量。这种架构允许快速扩展到新任务,而无需改变骨干表示。我们在一个包含566张膝关节X光片的内部数据集上训练了我们的方法,每张图像都标注了股骨和胫骨的轮廓。我们在一个包含50名患者的内部测试数据集和一个来自MRKR数据集的402个术前病例的外部独立数据集上进行了评估。这些数据提供了手动临床测量,并且MRKR测量将公开可用。性能与最先进的基于地标的方法和手动一致性相当,同时提供了一种可扩展到其他测量任务的灵活形状表示。

英文摘要

Radiographic assessment of lower-limb alignment (LLA) is important for predicting joint health and surgical outcomes in total knee arthroplasty. Traditional measurement methods are manual and time-consuming, while recent machine learning approaches typically rely on locating a fixed set of anatomical landmarks. This dependence limits flexibility and may require re-annotation when clinical definitions change. To address this, we propose an automated workflow using Implicit Neural Shape Functions (INSF). Rather than relying on explicit landmark coordinates, we encode the anatomy into a compact latent space and regress clinical alignment measurements directly from these latent codes. This architecture allows for rapid extendability to new tasks without altering the backbone representation. We trained our method on an internal dataset of 566 knee radiographs, each annotated with the outline of the femur and tibia. We evaluated it on both an internal test dataset of 50 patients and a separate external set of 402 preoperative cases from the MRKR dataset. Manual clinical measurements are available for these data, and the MRKR measurements will be made publicly accessible. Performance was comparable to state-of-the-art landmark-based methods and manual agreement, while offering a flexible shape representation that can be extended to additional measurement tasks.

2606.15265 2026-06-16 cs.CV 新提交

Trusted Multi-View Deep Learning Classification of Fetal Congenital Heart Disease with Feature-level and Decision-level Fusion

基于特征级与决策级融合的胎儿先天性心脏病可信多视图深度学习分类

Tan Zhou, Shifa Yao, Suncheng Xiang, Dahong Qian, Baoying Ye

AI总结 提出一种多视图深度学习框架,通过特征提取、注意力机制和不确定性决策融合超声心动图多视角图像,实现胎儿先天性心脏病高精度二分类。

详情
AI中文摘要

先天性心脏病(CHD)是指胚胎发育期间心脏和大血管发育异常导致的解剖结构异常。传统诊断方法往往难以达到高准确率和效率,尤其是在心脏解剖结构复杂的情况下。本研究提出了一种专门的多视图深度学习框架,利用超声心动图图像进行CHD二分类。使用包含五个视图的大规模CHD数据集训练模型,使其能够整合多角度图像数据。该框架利用先进的特征提取和注意力机制提高诊断精度和可靠性。还集成了基于不确定性的决策组件以处理低质量图像,从而增强诊断效果。实验结果表明,该方法在我们的数据集上达到了顶级性能,并为早期CHD检测提供了稳健的工具,凸显了其临床应用的潜力。数据集和源代码将在论文被接收后发布。

英文摘要

Congenital heart disease (CHD) refers to the abnormal anatomical structure caused by the abnormal development of the heart and great vessels during embryonic development. Traditional diagnostics often fail to achieve high accuracy and efficiency, especially given the complexity of cardiac anatomy. This study presents a specialized multi-view deep learning framework for CHD binary classification using echocardiographic images. A large-scale CHD dataset, including five views, was used to train the model, enabling it to integrate multi-angle image data. The framework utilizes advanced feature extraction and attention mechanisms to improve diagnostic precision and reliability. An uncertainty-based decision-making component is also integrated to handle low-quality images, enhancing diagnostic outcomes. Experimental results show that this method achieves top-tier performance on our dataset and provides a robust tool for early CHD detection, underscoring its potential for clinical use. The dataset and source code will be released upon paper acceptance.

2606.15304 2026-06-16 cs.CV 新提交

HemExp: Clinically-Guided Latent Diffusion for Modeling Hematoma Expansion

HemExp: 临床引导的潜在扩散模型用于血肿扩展示建模

Orhun Utku Aydin, Satoru Tanioka, Tzu I Chuang, Alexander Koch, Dimitrios Rallios, Marie Gultom, Begum Tahhan, Fujimaro Ishida, Dietmar Frey, Adam Hilbert

发表机构 * CLAIM – Charité Lab for AI in Medicine, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin(柏林夏里特医学院人工智能医学实验室(CLAIM),柏林夏里特医学院,柏林自由大学和柏林洪堡大学成员) Department of Neurosurgery, Mie Chuo Medical Center(三重中央医疗中心神经外科)

AI总结 提出临床引导的潜在扩散模型HemExp,通过生成患者特异性随访CT图像及血肿分割,实现血肿扩展示的空间概率预测,支持临床变量扰动下的不确定性建模。

详情
AI中文摘要

自发性脑出血(ICH)后的血肿扩展示(HE)是神经外科护理中急性分诊和治疗决策的主要决定因素。然而,现有方法大多提供二元扩展示风险或单一随访体积,限制了不确定性感知决策。我们提出HemExp,一种临床引导的潜在扩散模型,可生成患者特异性的随访非对比CT图像,以及脑实质和脑室内出血的分割。生成过程以基线成像、临床变量和明确的扩展示指标为条件,实现对真实临床场景的可控模拟。HemExp使用血肿感知多头变分自编码器,并通过条件扩散模型将进展建模为基线和随访潜在表示之间的差异。该模型在来自多个中心的450名患者的配对扫描上训练,并在来自一个保留机构的107名患者上评估。HemExp通过为每位患者生成多个合成随访图像来估计可能的随访血肿体积分布,从而生成空间HE概率图。扰动临床输入(如症状发作至成像时间或抗凝状态)会改变预测的随访血肿体积分布。HemExp扩展了二元预测器,并在成像空间中展示了稳健的临床相关结果估计,如血肿体积、脑室内受累和占位效应。总体而言,我们的结果支持可控潜在扩散作为早期ICH进展的不确定性感知建模的一个有前景的方向。

英文摘要

Hematoma expansion (HE) after spontaneous intracerebral hemorrhage (ICH) is a major determinant of acute triage and treatment decisions in neurosurgical care. However, most existing methods provide either a binary expansion risk or a single follow-up volume, limiting uncertainty-aware decisions. We introduce HemExp, a clinically-guided latent diffusion model that generates patient-specific follow-up non-contrast CT images, along with segmentations of intraparenchymal and intraventricular hemorrhage. Generation is conditioned on baseline imaging, clinical variables, and an explicit expansion indicator, enabling controllable simulation of realistic clinical scenarios. HemExp uses a hemorrhage-aware multi-head variational autoencoder and models progression as the difference between baseline and follow-up latent representations with a conditional diffusion model. The model is trained on paired scans from 450 patients across multiple centers and evaluated on 107 patients from a held-out institution. HemExp produces spatial HE probability maps by generating multiple synthetic follow-up images per patient to estimate distributions of plausible follow-up hematoma volumes. Perturbing clinical inputs such as symptom-onset-to-imaging time or anticoagulant status shifts the predicted follow-up volume distribution. HemExp extends binary predictors and demonstrates robust estimation of clinically relevant outcomes in the imaging space, such as hematoma volume, intraventricular involvement, and mass effects. Overall, our results support controllable latent diffusion as a promising direction for uncertainty-aware modeling of early ICH progression.

2606.15305 2026-06-16 cs.CV 新提交

CoMNeT: A MedNeXt-CorrDiff Framework for Volumetric Brain Tumor Segmentation

CoMNeT: 一种用于体积脑肿瘤分割的MedNeXt-CorrDiff框架

Michael L. Evans, MD Fayaz Bin Hossen, MD Shibly Sadique, Walia Farzana, Khan M. Iftekharuddin

发表机构 * Old Dominion University(欧道明大学)

AI总结 提出CoMNeT框架,结合3D卷积分割模型MedNeXt与校正扩散后处理CorrDiff,通过集成学习提升胶质瘤分割精度,在UTSW-Glioma数据集上取得优于基线模型的Dice分数。

Comments 10 pages, 4 figures, 2 tables

详情
AI中文摘要

从多参数磁共振成像(MRI)中准确分割脑肿瘤对于治疗规划、反应评估和定量神经肿瘤学研究至关重要。然而,由于肿瘤外观和MRI协议在患者扫描之间的差异,自动分割在计算机视觉中仍然是一项困难的任务。此外,临床重要区域如增强肿瘤(ET)和肿瘤核心(TC)通常相对于全脑体积较小,进一步增加了实现高体素级精度的难度。在本文中,我们展示了将现代3D卷积分割模型与基于校正扩散的细化和集成相结合,可以改善UTSW-Glioma数据集上的体积胶质瘤分割。我们提出了CoMNeT,一个MedNeXt-CorrDiff框架,该框架使用四种MRI模态作为输入,并预测ET、TC和全肿瘤(WT)区域,用于自动脑肿瘤分割。MedNeXt作为主要分割模型,使用全局响应归一化进行特征学习,而CorrDiff被训练为后处理残差细化方法,在最终阈值化之前纠正概率图中的错误。使用五折交叉验证,CoMNeT在大多数肿瘤区域取得了最高的Dice分数,ET、TC、WT和平均Dice分数分别为0.7543 +/- 0.0261、0.6806 +/- 0.0166、0.9049 +/- 0.0128和0.7798 +/- 0.0184。CoMNeT优于两个选定的基线模型:SegResNet(平均Dice 0.7555 +/- 0.0190)和独立MedNeXt(平均Dice 0.7697 +/- 0.0154)。我们的研究结果支持将校正扩散和折叠级概率集成作为现有最先进3D卷积模型用于自动胶质瘤分割的实用补充。

英文摘要

Accurate brain tumor segmentation from multiparametric magnetic resonance imaging (MRI) is critical for treatment planning, response assessment, and quantitative neuro-oncology research. However, automated segmentation remains a difficult task in computer vision because of variation in tumor appearance and MRI protocols across patient scans. Moreover, clinically important regions such as enhancing tumor (ET) and tumor core (TC) are often small relative to the full brain volume, furthering increasing the difficulty of achieving high voxel-level precision. In this paper, we show that combining a modern 3D convolutional segmentation model with corrective diffusion-based refinement and ensembling improves volumetric glioma segmentation on the UTSW-Glioma dataset. We propose CoMNeT, a MedNeXt-CorrDiff framework that uses four MRI modalities as input and predicts ET, TC, and whole tumor (WT) regions for automated brain tumor segmentation. MedNeXt is used as the primary segmentation model with Global Response Normalization for feature learning, while CorrDiff is trained as a postprocessing residual refinement method to correct errors in the probability maps before final thresholding. Using five-fold cross-validation, CoMNeT achieved the highest Dice score for most tumor regions, with ET, TC, WT, and average Dice scores of 0.7543 +/- 0.0261, 0.6806 +/- 0.0166, 0.9049 +/- 0.0128, and 0.7798 +/- 0.0184, respectively. CoMNeT outperformed two selected baseline models: SegResNet (0.7555 +/- 0.0190 average Dice) and standalone MedNeXt (0.7697 +/- 0.0154 average Dice). Our findings support the use of corrective diffusion and fold-level probability ensembling as practical additions to existing state-of-the-art 3D convolutional models for automated glioma segmentation.

2606.15323 2026-06-16 cs.CV 新提交

PPDM: Pixel Puzzling Diffusion Model for Speed and Memory Efficient Volumetric Medical Image Translation

PPDM: 像素拼图扩散模型用于速度和内存高效的体积医学图像翻译

Tianqi Chen, Jun Hou, Yinchi Zhou, James S. Duncan, Chi Liu, Bo Zhou

发表机构 * Department of Radiology, Northwestern University(西北大学放射学系) Department of Biomedical Engineering, Yale University(耶鲁大学生物医学工程系) Department of Radiology and Biomedical Imaging, Yale School of Medicine(耶鲁医学院放射学与生物医学影像系)

AI总结 提出像素拼图扩散模型(PPDM),通过可逆像素拼图操作和直接桥接扩散公式,在降低内存和加速推理的同时保持全局一致性,用于3D医学图像翻译。

Comments 12 pages, 5 figures, 5 tables

详情
AI中文摘要

扩散模型在医学图像到图像翻译中展现出优越的保真度,但其扩展到高分辨率3D体积受到高昂计算成本和GPU内存需求的严重限制。现有的内存高效策略常常牺牲全局体积一致性或精细解剖细节。在这项工作中,我们提出了像素拼图扩散模型(PPDM),一个简单而有效的框架,用于内存和速度高效的3D医学图像翻译。PPDM引入了一个可逆的像素拼图-解拼图操作,将空间分辨率转换为通道维度,显著减少激活内存同时保持全局上下文。为了进一步提高效率和稳定性,我们采用直接桥接扩散公式,从条件输入而非纯噪声开始,使模型能够专注于任务相关的残差。此外,引入拼图梯度损失以强制空间一致性并抑制空间重排引入的网格状伪影。我们在多个具有挑战性的3D医学图像翻译任务上评估PPDM,包括低计数PET去噪、联合PET去噪和衰减校正以及跨模态MRI翻译。在所有任务中,PPDM始终匹配或超越全3D扩散模型,同时将训练GPU内存使用减少高达一个数量级并显著加速推理,并且优于基于潜在压缩或频率分解的现有内存高效扩散方法。这些结果表明,PPDM在有限计算资源下为高保真3D扩散医学图像翻译提供了实用且可扩展的解决方案。

英文摘要

Diffusion models have demonstrated superior fidelity for medical image-to-image translation, but their extension to high-resolution 3D volumes is severely constrained by prohibitive computational cost and GPU memory requirements. Existing memory-efficient strategies often compromise global volumetric consistency or fine anatomical detail. In this work, we propose the Pixel Puzzling Diffusion Model (PPDM), a simple and effective framework for memory- and speed-efficient 3D medical image translation. PPDM introduces a reversible pixel puzzle-unpuzzle operator that trades spatial resolution for channel dimensionality, substantially reducing activation memory while preserving global context. To further improve efficiency and stability, we adopt a direct bridge diffusion formulation that starts from the conditional input rather than pure noise, enabling the model to focus on task-relevant residuals. In addition, a puzzle-gradient loss is incorporated to enforce spatial coherence and suppress grid-like artifacts introduced by spatial rearrangement. We evaluate PPDM on multiple challenging 3D medical image translation tasks, including low-count PET denoising, joint PET denoising and attenuation correction, and cross-modal MRI translation. Across all tasks, PPDM consistently matches or outperforms full 3D diffusion models while reducing training GPU memory usage by up to an order of magnitude and significantly accelerating inference, and it outperforms existing memory-efficient diffusion approaches based on latent compression or frequency decomposition. These results demonstrate that PPDM provides a practical and scalable solution for high-fidelity 3D diffusion-based medical image translation under limited computational resources.

2606.15370 2026-06-16 cs.CV cs.LG 新提交

MNet++: Extended 2D/3D Networks for Anisotropic Medical Image Segmentation

MNet++: 用于各向异性医学图像分割的扩展2D/3D网络

Kirsten Odendaal, Rade Bajic

发表机构 * School of Computing, Georgia Institute of Technology(佐治亚理工学院计算学院)

AI总结 本文复现并扩展了混合2D/3D卷积网络MNet,引入自适应融合门控和VMamba状态空间模块,在保持各向异性鲁棒性的同时提升分割性能。

详情
AI中文摘要

本工作展示了MNet的完整复现与扩展,MNet是一种专为各向异性医学图像分割设计的混合2D/3D卷积网络。在nnU-Net框架内重新实现了原始架构,以验证其报告的性能和对可变体素间距(即各向异性)的鲁棒性。在匹配的预处理和计算约束下,在PROMISE前列腺MRI和LiTS肝脏CT的受控子集上进行了实验。复现的MNet在PROMISE上达到了89.0 +/- 0.9%的Dice相似系数(DSC),与已发表结果相差0.8%,在LiTS上肝脏和肿瘤分割分别达到94.3 +/- 1.9%和54.6 +/- 3.1%。进一步引入了两种轻量级扩展:(1) 一种学习的融合门控机制,实现自适应2D-3D特征融合;(2) 一个VMamba状态空间模块,用于高效的长程深度建模。空间门控变体以不到3%的推理开销将DSC提高了+0.8%,而VMamba提高了性能一致性,将PROMISE Dice变异降低至+/- 0.7%,并在LiTS肝脏上达到最强性能,Dice为95.8%。两种扩展均保持了MNet对各向异性的鲁棒性,在1-4 mm体素间距下Dice变化为1.5%。总体而言,该研究证实了MNet的可复现性,并表明自适应融合和状态空间建模有潜力进一步增强各向异性条件下的分割可靠性。然而,需要进一步测试才能得出明确结论。

英文摘要

This work demonstrates a full reproduction and extension of MNet, a hybrid 2D/3D convolutional network designed for anisotropic medical image segmentation. The original architecture was re-implemented within the nnU-Net framework to verify its reported performance and robustness to variable voxel spacing, known as anisotropy. Experiments were conducted on PROMISE prostate MRI and a controlled subset of LiTS liver CT under matched preprocessing and compute constraints. The reproduced MNet achieved a Dice similarity coefficient (DSC) of 89.0 +/- 0.9% on PROMISE, within 0.8% of the published result, and 94.3 +/- 1.9% / 54.6 +/- 3.1% for liver and tumor segmentation on LiTS, respectively. Two lightweight extensions were further introduced: (1) a learned Fusion Gating mechanism enabling adaptive 2D-3D feature blending, and (2) a VMamba state-space module for efficient long-range depth modelling. The Spatial Gating variant improved DSC by +0.8% with less than 3% inference overhead, while VMamba improved performance consistency, reducing PROMISE Dice variation to +/- 0.7% and achieving the strongest LiTS liver performance at 95.8% Dice. Both extensions preserved MNet robustness to anisotropy, with delta Dice = 1.5% across 1-4 mm voxel spacing. Overall, the study confirms MNet reproducibility and demonstrates that adaptive fusion and state-space modelling have the potential to further strengthen segmentation reliability under anisotropic conditions. However, further tests are required to provide definitive conclusions.

2606.15457 2026-06-16 cs.CV cs.LG 新提交

Lesion-DDPM: Lesion-Enhanced 3D Diffusion for MS MRI Synthesis

Lesion-DDPM:用于MS MRI合成的病灶增强3D扩散模型

Weidong Zhang, Yongchan Jung, Shafayat Mowla Anik, Furen Xiao, Vasudevan Janarthanan, Enkhzaya Chuluunbaatar, Byeong Kil Lee, Jeeho Ryoo

发表机构 * University of Texas at Arlington(德克萨斯大学阿灵顿分校) University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校) University of Texas at Dallas(德克萨斯大学达拉斯分校) National Taiwan University Hospital(国立台湾大学医院) National University of Mongolia(蒙古国立大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出Lesion-DDPM,一种3D条件扩散框架,通过多级解剖掩膜注入和病灶加权重建损失,实现病灶感知的FLAIR合成,在MS病灶分割下游任务中显著提升Dice分数。

详情
AI中文摘要

3D FLAIR MRI被广泛推荐为多发性硬化(MS)脑部成像的标准MRI序列之一,但公开可用的MS数据集仍然相对较小,且在不同扫描仪、采集协议和病灶模式上存在差异。这种稀缺性和异质性阻碍了稳健的神经影像机器学习模型的发展,尤其对于旨在合成图像同时保留小而稀疏病灶的生成模型而言,这是一个挑战。我们提出了Lesion-DDPM,一种用于病灶感知FLAIR合成的3D条件扩散框架,该框架结合了多级解剖掩膜注入以及病灶加权重建损失,以在保持整体大脑结构的同时强调病灶体素。使用MSLesSeg数据集的精选子集,我们将Lesion-DDPM与代表性的最先进GAN和扩散模型进行比较,评估图像生成指标和下游3D U-Net分割性能。在我们的实验中,Lesion-DDPM在所有方法中实现了最低的病灶区域重建误差。在下游3D U-Net病灶分割任务中,仅使用Lesion-DDPM生成的扫描训练并在真实MRI上评估的模型达到了0.616的Dice分数,而最佳竞争合成数据集为0.569。当将Lesion-DDPM图像添加到真实训练集中时,Dice分数进一步增加到0.685。

英文摘要

3D FLAIR MRI is widely recommended as one of the standard MRI sequences for brain imaging in multiple sclerosis (MS), but publicly available MS datasets remain relatively small and vary across scanners, acquisition protocols, and lesion patterns. This scarcity and variability hinder the development of robust neuroimaging machine learning models and are particularly challenging for generative models that aim to synthesize images while preserving small, sparse lesions. We propose Lesion-DDPM, a 3D conditional diffusion framework for lesion-aware FLAIR synthesis that incorporates multi-level anatomical mask injection together with a lesion-weighted reconstruction loss to emphasize lesion voxels while maintaining global brain structure. Using a curated subset of the MSLesSeg dataset, we compare Lesion-DDPM with representative state-of-the-art GAN- and diffusion-based models, assessing both image-generation metrics and downstream 3D U-Net segmentation. In our experiments, Lesion-DDPM achieved the lowest lesion-region reconstruction error among all methods. In a downstream 3D U-Net lesion segmentation task, a model trained only on Lesion-DDPM-generated scans and evaluated on real MRIs reached a Dice score of 0.616 compared with 0.569 for the best competing synthetic dataset. When Lesion-DDPM images were added to the real training set, the Dice score further increased to 0.685.

2606.15611 2026-06-16 cs.CV cs.AI 新提交

Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation

双基础模型的相互蒸馏用于半监督PET/CT分割

Fuyou Mao, Beining Wu, Yanfeng Jiang, Bohan Xu, Lixin Lin, Naye Ji, Hao Zhang, Yan Tang

发表机构 * Central South University(中南大学) Hangzhou Dianzi University(杭州电子科技大学) Communication University of Zhejiang(浙江传媒学院) Northeastern University(东北大学)

AI总结 提出MuDuo框架,利用SAM-Med3D和SegAnyPET分别从CT和PET中蒸馏知识到轻量学生网络,实现半监督器官分割,仅用5个标注样本在AutoPET数据集上达到最优性能。

Comments MICCAI 2026

详情
AI中文摘要

PET/CT的器官分割对于肿瘤学中的定量分析和放疗计划至关重要。为了降低PET/CT分割的高标注成本,半监督学习(SSL)为使用有限标注数据开发深度模型提供了一种实用且有效的解决方案。视觉基础模型的最新发展展示了显著的适应性和更高的效率。在这项工作中,我们提出了一个相互蒸馏框架,该框架无缝地利用了结构性和功能性基础模型,这些模型作为模态特定的通才,从结构性CT和代谢性PET成像中蒸馏知识。通过弥合学生模型的任务特定精度与通才基础模型的分割先验之间的差距,我们提出了MuDuo,一个相互蒸馏框架,协同利用SAM-Med3D用于CT和SegAnyPET用于PET,将它们的知识蒸馏到一个轻量级学生网络中。我们的方法消除了手动提示的需要,同时最大化未标注数据在自动分割中的效用,在AutoPET数据集上仅使用5个标注案例就达到了最先进的性能。我们的源代码可在https://github.com/Wu-beining/MuDuo获取。

英文摘要

Organ segmentation from PET/CT is critical for quantitative analysis and radiotherapy planning in oncology. To ease the high annotation cost of PET/CT segmentation, semi-supervised learning (SSL) provides a practical and effective solution for developing deep models with limited labeled data. Recent developments in visual foundation models have demonstrated remarkable adaptability with improved efficiency. In this work, we propose a mutual distillation framework that seamlessly exploits both structural and functional foundation models, which act as modality-specific generalists for distilling knowledge from structural CT and metabolic PET imaging. By bridging the gap between the task-specific precision of student models and the segmentation priors of generalist foundation models, we propose \textbf{MuDuo}, a mutual distillation framework that synergistically leverages SAM-Med3D for CT and SegAnyPET for PET to distill their knowledge into a lightweight student network. Our approach eliminates the need for manual prompts while maximizing the utility of unlabeled data for automatic segmentation, achieving state-of-the-art performance on the AutoPET dataset with only 5 labeled cases. Our source code is available at https://github.com/Wu-beining/MuDuo.

2606.15667 2026-06-16 cs.CV 新提交

CEVAR: Centerline Embedding Extraction for Endovascular Aneurysm Repair

CEVAR:用于血管内动脉瘤修复的中心线嵌入提取

Roman Naeem, Timo Niiniskorpi, Charlotte Sandström, Naman Desai, Anders Jeppsson, Ida Häggström, Fredrik Kahl, Håkan Roos, Jennifer Alvén

发表机构 * Chalmers University of Technology(查尔姆斯理工大学) The University of Gothenburg(哥德堡大学) Sahlgrenska University Hospital(萨尔拉格斯卡大学医院)

AI总结 针对EVAR术后密封区失效导致的破裂问题,提出一种结合3D中心线追踪与嵌入几何预测的Transformer框架,实现自动密封区评估,在常规及无对比剂CT上均优于半自动方法。

Comments Submitted Version. Accepted at MICCAI 2026

详情
AI中文摘要

由于支架移植物密封区密封失效导致EVAR术后破裂,血管内动脉瘤修复(EVAR)后的长期死亡率仍然很高。使用中心线测量的结构化CT审查可改善检测,但当前工作流程需要手动中心线编辑和专家操作。我们提出了一种用于自动化、协议驱动的密封区评估的Transformer框架,该框架将3D中心线追踪与基于嵌入的几何预测相结合。评估了两种最先进的图像到图模型,用于从随访CT中提取主动脉-髂动脉中心线,并根据EVAR4C协议测量支架位置、血管直径和密封长度。在整个测试集和具有挑战性的无对比剂子集上,所提出的全自动方法优于商业半自动工作流程。

英文摘要

Long-term mortality rates after endovascular aneurysm repair (EVAR) remain elevated due to post-EVAR rupture caused by loss of seal in stent graft sealing zones. Structured CT review using centerline measurements improves detection, but current workflows require manual centerline editing and expert operators. We propose a transformer framework for automated, protocol-driven sealing zone assessment that combines 3D centerline tracking with embedding-based geometric prediction. Two state-of-the-art image-to-graph models are evaluated for aorto-iliac centerline extraction from follow-up CT and for measurement of stent position, vessel diameters, and seal lengths according to EVAR4C protocol. Across the full test set and a challenging no-contrast subset, the proposed fully automatic method outperforms the commercial semi-automatic workflow.

2606.15772 2026-06-16 cs.CV 新提交

Ellipse Meets Bit-Planes: A Novel Approach to RNFL based Glaucoma Detection Using Advanced Image Processing and Deep Learning

椭圆遇上位平面:基于先进图像处理和深度学习的RNFL青光眼检测新方法

Snigdha Paul, Sambit Mallick, Anindya Sen

发表机构 * Heritage Institute of Technology(传统理工学院)

AI总结 提出自适应椭圆极坐标变换增强RNFL分析,分别用深度学习特征融合(99.3%检测率)和位平面切片图像处理(92.31%准确率)实现高效青光眼检测。

详情
AI中文摘要

本工作提出了一种从易获取的彩色眼底图像中自动检测青光眼的集成流程,基于自适应椭圆极坐标变换算法,增强对视网膜神经纤维层(RNFL)作为观察青光眼变化的主要生物标志物的分析,不受视盘和黄斑位置影响。利用该变换,我们引入了两种针对不同操作需求定制的框架。第一种框架采用深度学习启发的特征融合方法,检测率达99.3%,适用于需要高精度的场景,尽管计算需求较高。第二种框架采用基于位平面切片的新型图像处理算法,准确率为92.31%,针对需要快速推理且资源消耗最小的环境进行了优化。两种框架都为青光眼早期检测提供了可扩展且经济高效的解决方案。本研究强调了基于RNFL的诊断工具在应对全球青光眼挑战中的潜力,特别是在医疗资源匮乏的地区。

英文摘要

This work proposes an integrated pipeline for automatic glaucoma detection method from easily available colour fundas images based on an adaptive algorithm for ellipse-based polar transformation, to enhance the analysis of the Retinal Nerve Fiber Layer (RNFL) as the primary biomarker for observing glaucomatous changes, regardless of optic disc and macula position. Utilizing this transformation, we introduce two distinct frameworks tailored to different operational needs. The first framework, a deep learning-inspired feature fusion approach, achieves a 99.3% detection rate, ideal for settings where high precision is essential, despite higher computational demands. The second framework employs a novel image-processing algorithm based on bit-plane slicing, offering 92.31% accuracy and optimized for environments requiring rapid inference with minimal resource consumption. Both frameworks provide scalable and cost-effective solutions for early glaucoma detection. This study highlights the potential of RNFL-based diagnostic tools in addressing the global challenge of glaucoma, particularly in underserved regions.

2606.15802 2026-06-16 cs.CV 新提交

CPS4: Class Prompt driven Semi-Supervised Spine Segmentation with Class-specific Consistency Constraint

CPS4: 基于类别提示的半监督脊柱分割与类别特定一致性约束

Qingtao Pan, Hongzan Sun, Bing Ji, Shuo Li

发表机构 * School of Control Science and Engineering, Shandong University(山东大学控制科学与工程学院) Department of Nuclear Medicine, Shengjing Hospital of China Medical University(中国医科大学附属盛京医院核医学科) Department of Computer and Data Science, Case Western Reserve University(凯斯西储大学计算机与数据科学系) Department of Biomedical Engineering, Case Western Reserve University(凯斯西储大学生物医学工程系)

AI总结 提出CPS4,首个利用文本类别提示增强伪标签质量的半监督脊柱分割网络,通过两阶段训练(VLM预训练和半监督分割)实现,仅用5%标注数据即达80.44% Dice。

详情
AI中文摘要

视觉语言模型(VLM)有潜力通过利用文本类别提示生成分割图来增强半监督脊柱分割中伪标签的质量,但尚未有人研究。尽管有前景,但缺乏明确的约束来确保脊柱类别提示与脊柱单元区域之间的一致性,导致多类别分割图生成性能不佳。本文提出CPS4,首个使用类别提示增强脊柱伪标签质量的文本引导半监督脊柱分割网络。具体地,CPS4通过两个训练阶段实现。(i) 类别特定一致性约束的VLM预训练阶段:我们提出token级和像素级注意力损失,以优化类别提示与脊柱单元之间的一致性,迫使文本类别提示在语义空间中与目标脊柱单元紧密耦合。(ii) 类别提示驱动的半监督脊柱分割阶段:使用预训练的视觉-文本编码器,我们为未标记的脊柱图像推导每个类别特定的二值分割图,并将它们整合为统一的多类别分割图,提高半监督脊柱分割网络生成的脊柱伪标签的质量。实验结果表明,我们的CPS4在公共脊柱分割数据集上仅使用5%的标注数据即实现了80.44%的Dice,超越了流行的半监督学习和VLM方法。我们的代码将公开。

英文摘要

Vision Language Model (VLM) has great potential to enhance the quality of pseudo labels in semi-supervised spine segmentation by leveraging textual class prompts to generate segmentation map, but no one has studied it yet. Although promising, it lacks explicit constraints to ensure consistency between spine class prompts and spine unit region, resulting in unsatisfactory performance in multi-class segmentation map generation. In this paper, we propose CPS4, the first text-guided semi-supervised spine segmentation network using class prompts to enhance the quality of spine pseudo labels. Specifically, CPS4 is implemented through two training stages. (i) Class-specific consistency constrained VLM pretraining stage: we propose token- and pixel-level attention loss to optimize the consistency between class prompts and spine units, forcing the textual class prompt to be closely coupled with the target spine unit in the semantic space. (ii) Class Prompt driven semi-supervised spine segmentation stage: using the pretrained vision-text encoder, we derive each class-specific binary segmentation map for the unlabeled spine image and integrate them into an unified multi-class segmentation map, improving the quality of the spine pseudo label generated by the semi-supervised spine segmentation network. Experimental results show that our CPS4 achieves superior spine segmentation performance with Dice of 80.44%, only using 5% labeled data on the public spine segmentation dataset, surpassing popular semi-supervised learning and VLM methods. Our code will be available.

2606.15837 2026-06-16 cs.CV cs.LG stat.ME stat.ML 新提交

Learning a Sampling-Free Variational DNN Plugin from Tiny Training Sets to Refine OOD Segmentation With Uncertainty Estimation

学习一种无采样的变分DNN插件,从微小训练集精炼OOD分割并估计不确定性

Jimut B. Pal, Suyash P. Awate

发表机构 * Centre for Machine Intelligence and Data Science (C-MInDS), Indian Institute of Technology (IIT) Bombay(印度理工学院孟买分校机器智能与数据科学中心) Computer Science and Engineering (CSE) Department, Indian Institute of Technology (IIT) Bombay(印度理工学院孟买分校计算机科学与工程系)

AI总结 提出VarDeepPCA,一种轻量级变分DNN框架,利用小分布内数据集学习有效解剖几何分布,无需目标域数据或预训练,通过重新解释softmax映射实现无采样推理,并提供不确定性估计,在4种临床应用中显著提升OOD分割的解剖合理性和准确性。

Comments Accepted at the Journal of Machine Learning for Biomedical Imaging

详情
AI中文摘要

深度神经网络(DNN)由于扫描仪和采集协议的变化,经常无法泛化到分布外(OOD)的医学图像。由于获取和标注新医学数据集的成本高昂,重新训练DNN模型以应对这些分布偏移通常不切实际。为了解决这个问题,我们引入了VarDeepPCA,一种新颖的轻量级变分DNN框架,旨在通过利用内在几何先验来恢复/精炼退化的分割图。与需要目标域数据或大量预训练的现有方法不同,我们的VarDeepPCA仅使用小的分布内(ID)数据集显式学习有效解剖几何的分布。理论上,我们的新颖变分学习框架利用对softmax映射的重新解释来隐式执行精确分布建模,从而实现计算高效、无采样的学习和推理。这也使VarDeepPCA能够为其恢复的分割图提供不确定性估计。我们在4种不同的临床应用上,使用14个公开可用的数据集,涉及心肌、神经视网膜边缘、前列腺和胎儿头部分割,对我们的框架进行了实证验证。与15种现有方法的比较表明,VarDeepPCA一致地恢复了现有方法在OOD数据上产生的分割图,以(i)显著提高几何的解剖合理性和分割的临床实用性,以及(ii)显著减少误差,而不需要比现有方法更多的训练数据。

英文摘要

Deep neural networks (DNNs) frequently fail to generalize to out-of-distribution (OOD) medical images because of variations in scanners and acquisition protocols. Retraining DNN models to address these distribution shifts is often impractical due to the high cost of acquiring and annotating new medical datasets. To address this, we introduce VarDeepPCA, a novel lightweight variational DNN framework designed to restore/refine degraded segmentation maps by leveraging intrinsic geometric priors. Unlike existing approaches that require target-domain data or extensive pre-training, our VarDeepPCA explicitly learns a distribution of valid anatomical geometries using only small in-distribution (ID) datasets. Theoretically, our novel variational learning framework leverages a reinterpretation of the softmax mapping to implicitly perform exact distribution modeling, thereby enabling computationally efficient, sampling-free learning and inference. This also enables VarDeepPCA to provide uncertainty estimates associated with its restored segmentation maps. We empirically validate our framework across 4 distinct clinical applications, using 14 publicly available datasets, involving segmentation of the myocardium, neuroretinal rim, prostate, and fetal head. Comparisons against 15 existing methods demonstrate that VarDeepPCA consistently restores segmentation maps produced by the existing methods on OOD data to (i) significantly improve anatomical plausibility of geometries and clinical utility of the segmentations, and (ii) significantly reduce errors, without needing any more training data than that used by existing methods.

2606.15861 2026-06-16 cs.CV 新提交

Object Tokens as a Bridge Between Segmentation and Visual Question Answering in Robotic Surgery

对象标记作为机器人手术中分割与视觉问答的桥梁

Yiping Li, Ronald de Jong, Romy van Jaarsveld, Franco Badaloni, Gino Kuiper, Jelle Ruurda, Josien Pluim, Marcel Breeuwer

发表机构 * Department of Biomedical Engineering, Eindhoven University of Technology(埃因霍温理工大学生物医学工程系) Department of Electrical Engineering, Eindhoven University of Technology(埃因霍温理工大学电气工程系) Department of Surgery, University Medical Center Utrecht(乌得勒支大学医学中心外科)

AI总结 提出统一框架,联合像素级分割与视觉问答,通过VLM生成对象标记引导答案预测和分割掩码,在RAMIE和EndoVis18数据集上优于基线方法。

详情
AI中文摘要

机器人手术中的视觉问答(VQA),称为手术VQA,需要对复杂手术场景进行高级理解,并将视觉感知与语言推理相结合,具有支持手术培训和术中决策的潜力。最近的视觉-语言模型(VLM)通过参数高效微调显示出有希望的性能;然而,大多数现有方法依赖于粗粒度的视觉定位,通常仅限于边界框,这未能捕捉手术对象的细粒度空间结构。在这项工作中,我们提出了一个统一框架,在单个框架内联合执行像素级分割和视觉问答。我们的方法将VLM与基于Segment Anything Model(SAM)的解码器集成,并将场景元素表示为VLM生成的对象标记。这些对象标记指导答案预测,并进一步投影到基于SAM的解码器以产生分割掩码。通过分割和问答目标优化对象标记嵌入,模型学习空间基础表示,增强视觉推理,同时提供显式的像素级基础。我们在私有RAMIE(机器人辅助微创食管切除术)数据集和公共EndoVis18数据集上评估了所提出的方法,在手术VQA中始终优于基线方法。这些结果表明,将上下文感知的对象标记纳入视觉-语言模型可改善细粒度手术场景理解。

英文摘要

Visual Question Answering (VQA) in robotic surgery, referred to as surgical VQA, requires high-level understanding of complex surgical scenes and the integration of visual perception with language reasoning, with the potential to support surgical training and intraoperative decision-making. Recent Vision-Language Models (VLMs) have shown promising performance through parameter-efficient fine-tuning; however, most existing approaches rely on coarse visual grounding, typically limited to bounding boxes, which fails to capture the fine-grained spatial structure of surgical objects. In this work, we propose a unified framework that jointly performs pixel-level segmentation and visual question answering within a single framework. Our approach integrates a VLM with a Segment Anything Model (SAM)-based decoder and represents scene elements as object tokens generated by the VLM. These object tokens guide answer prediction and are further projected to the SAM-based decoder to produce segmentation masks. By optimizing the object token embeddings through both segmentation and question answering objectives, the model learns spatially grounded representations that enhance visual reasoning while providing explicit pixel-level grounding. We evaluate the proposed method on the private RAMIE (Robot-Assisted Minimally Invasive Esophagectomy) dataset and the public EndoVis18 dataset, where it consistently outperforms baseline methods for surgical VQA. These results demonstrate that incorporating context-aware object tokens into vision-language models improves fine-grained surgical scene understanding.

2606.15938 2026-06-16 cs.CV cs.MM 新提交

Learning Directional Semantic Transitions for Longitudinal Chest X-ray Analysis

学习纵向胸部X光分析的方向性语义转变

Zhangfeng Hu, Zefan Yang, Ge Wang, Tanveer Syeda-Mahmood, Anushree Burade, Mannudeep Kalra, Pingkun Yan

发表机构 * Rensselaer Polytechnic Institute(伦斯勒理工学院) Stanford University(斯坦福大学) Massachusetts General Hospital, Harvard Medical School(麻省总医院,哈佛医学院)

AI总结 提出ProTrans框架,将疾病进展建模为配对CXR研究间的方向性语义转变,利用放射学报告和可学习进展特征图显式编码语义变化,通过反向时间建模和双向重建一致性实现方向感知,在纵向下游任务中优于现有方法。

Comments MICCAI 2026

详情
AI中文摘要

胸部X光(CXR)解读通常需要纵向比较以评估疾病进展。现有方法通常依赖于时间特征融合或研究间差异建模,但在捕捉细微进展语义方面仍有限,且忽视了疾病轨迹固有的方向性。本文提出ProTrans,一种新颖的视觉-语言预训练框架,将疾病进展建模为配对CXR研究间的方向性语义转变。ProTrans利用放射学报告将单个CXR表示锚定到可解释的疾病状态,并引入可学习的进展特征图以显式编码状态间的语义转变,与报告导出的进展描述对齐。为强制方向感知,ProTrans结合了反向时间建模过程,并在状态和转变间施加双向重建一致性,从而解耦方向语义并促进连贯的轨迹建模。在纵向下游任务(包括疾病进展分类和进展描述)上的大量实验表明,ProTrans始终优于现有方法,为纵向CXR理解建立了统一的预训练框架。https://github.com/RPIDIAL/ProTrans

英文摘要

Chest X-ray (CXR) interpretation often requires longitudinal comparison to assess disease progression. Existing approaches typically rely on temporal feature fusion or inter-study discrepancy modeling, yet remain limited in capturing subtle progression semantics and overlook the inherently directional nature of disease trajectories. In this paper, we propose ProTrans, a novel vision-language pretraining framework that formulates disease progression as a directional semantic transition between paired CXR studies. ProTrans leverages radiology reports to anchor individual CXR representations within interpretable disease states, and introduces a learnable progression feature map to explicitly encode semantic shifts between states, aligned with report-derived progression descriptions. To enforce direction-aware perception, ProTrans incorporates a reversed temporal modeling process and imposes bidirectional reconstruction consistency across states and transitions, thereby disentangling directional semantics and promoting coherent trajectory modeling. Extensive experiments on longitudinal downstream tasks, including disease progression classification and progression captioning, demonstrate that ProTrans consistently outperforms existing methods, establishing a unified pretraining framework for longitudinal CXR understanding. https://github.com/RPIDIAL/ProTrans

2606.15967 2026-06-16 cs.CV 新提交

CRIS: Cross-Plane Self-Supervised Isotropic Restoration for Anisotropic Volumetric Imaging Across Modalities

CRIS:跨模态各向异性体积成像的跨平面自监督各向同性恢复

Adi Ahituv, Anat Ilivitzki, Moti Freiman

发表机构 * Faculty of Data and Decision Sciences, Technion -- Israel Institute of Technology(数据与决策科学学院,技术离子技术学院) Faculty of Biomedical Engineering, Technion -- Israel Institute of Technology(生物医学工程学院,技术离子技术学院) The May-Blum-Dahl MRI Research Center, Technion -- Israel Institute of Technology(梅-布卢姆-达尔MRI研究中心,技术离子技术学院)

AI总结 提出CRIS,一种无需配对各向同性真值的跨平面自监督框架,通过正交重切2D条带补全实现3D各向同性恢复,在MRI和体积电镜上优于插值和多种方法。

Comments 22 pages, 8 figures, supplementary material included. Submitted to Medical Image Analysis

详情
AI中文摘要

各向异性体积采集在临床MRI和体积电子显微镜(vEM)中很常见,其中稀疏的跨平面采样产生厚切片或截面,降低了正交重切和下游分析的质量。我们提出CRIS,一种跨平面自监督框架,无需配对各向同性真值即可实现各向同性恢复。CRIS将3D恢复视为各向同性网格正交重切上的2D条带补全:训练时,高分辨率面内切片被合成退化并周期性掩蔽;推理时,空白切片定义各向同性网格,恢复两个正交重切,并通过多视图平均融合预测。我们在两个MRI队列和两个显微镜基准上评估CRIS,各向异性高达8倍。在脑MRI上,CRIS达到32.921±0.436 dB PSNR和0.9631±0.0027 SSIM,优于插值、SMORE4、SIMPLE、SA-INR和ATME,并给出最佳分割一致性(Dice 0.940±0.004,ASSD 0.245±0.014 mm,HD99 1.275±0.061 mm)。在无参考腹部MRI上,CRIS将FID/KID降至48.714/0.023。在vEM上,CRIS优于插值、NIIV和vEMINR,在4倍时达到29.133 dB/0.834 3D PSNR/SSIM,在EPFL 8倍时达到27.123 dB/0.734,在噪声hemibrain数据上达到21.915 dB/0.699。在鲁棒性实验中,一个可变间隙CRIS模型在间隙因子3-7以及冠状、轴向和矢状退化下评估,保持比插值更高的PSNR/SSIM(36.36-31.14 dB和0.977-0.932对比33.07-27.85 dB和0.951-0.853)。这些结果支持CRIS作为一种模态灵活的途径,无需配对各向同性目标或特定配置的重新训练即可实现各向同性恢复。代码可在https://github.com/adi-hatav/CRIS获取。

英文摘要

Anisotropic volumetric acquisitions are common in clinical MRI and volume electron microscopy (vEM), where sparse through-plane sampling creates thick slices or sections that degrade orthogonal reformats and downstream analysis. We present CRIS, a cross-plane self-supervised framework for isotropic restoration without paired isotropic ground truth. CRIS casts 3D restoration as 2D stripe completion on orthogonal reformats of an isotropic grid: high-resolution in-plane slices are synthetically degraded and periodically masked for training, while at inference blank slices define the isotropic grid, two orthogonal reformats are restored, and predictions are fused by multi-view averaging. We evaluate CRIS on two MRI cohorts and two microscopy benchmarks up to 8x anisotropy. On brain MRI, CRIS achieves 32.921 +/- 0.436 dB PSNR and 0.9631 +/- 0.0027 SSIM, outperforming interpolation, SMORE4, SIMPLE, SA-INR, and ATME, and gives the best segmentation consistency (Dice 0.940 +/- 0.004, ASSD 0.245 +/- 0.014 mm, HD99 1.275 +/- 0.061 mm). On reference-free abdominal MRI, CRIS reduces FID/KID to 48.714/0.023. On vEM, CRIS outperforms interpolation, NIIV, and vEMINR, reaching 29.133 dB/0.834 3D PSNR/SSIM at 4x, 27.123 dB/0.734 on EPFL at 8x, and 21.915 dB/0.699 on noisy hemibrain data. In a robustness experiment, one variable-gap CRIS model evaluated across gap factors 3--7 and coronal, axial, and sagittal degradations maintained higher PSNR/SSIM than interpolation (36.36--31.14 dB and 0.977--0.932 vs. 33.07--27.85 dB and 0.951--0.853). These results support CRIS as a modality-flexible route to isotropic restoration without paired isotropic targets or configuration-specific retraining. Code is available at https://github.com/adi-hatav/CRIS.

2606.15976 2026-06-16 cs.CV 新提交

HadBalance: A Plug-and-Play Unified Global Geometric Prior Framework for Generalizable Biomedical Segmentation

HadBalance: 一种即插即用的统一全局几何先验框架,用于可泛化的生物医学分割

Zhuangzhi Gao, Feixiang Zhou, He Zhao, Wenhan Chen, Ruiyu Luo, Xin Wang, Hongyi Qin, Zhongli Wu, Yanda Meng, Yitian Zhao, Alena Shantsila, Gregory Y. H. Lip, Eduard Shantsila, Yalin Zheng

发表机构 * Department of Eye and Vision Science, University of Liverpool(利物浦大学眼与视觉科学系) Department of Primary Care and Mental Health, University of Liverpool(利物浦大学初级保健与精神健康系) School of Psychological and Cognitive Sciences, Peking University(北京大学心理与认知科学学院) Computer Vision Research Group, University of Amsterdam(阿姆斯特丹大学计算机视觉研究组) Institute of Life Course & Medical Sciences, University of Liverpool(利物浦大学生命历程与医学科学研究所) Bioengineering Program, Biomedical Sciences Division (BioMed), King Abdullah University of Science and Technology (KAUST)(阿卜杜拉国王科技大学生物医学科学部生物工程项目) Ningbo Cixi Institute of Biomedical Engineering, Chinese Academy of Sciences(中国科学院宁波慈溪生物医学工程研究所) Liverpool Centre for Cardiovascular Science, University of Liverpool(利物浦大学利物浦心血管科学中心)

AI总结 针对生物医学图像分割中几何先验缺乏统一性和泛化性的问题,提出基于Hadwiger定理的全局近凸形状先验,并结合冲突感知目标平衡方法,实现跨器官和模态的即插即用分割。

Comments Provisionally accepted by the 29th International Conference on Medical Image Computing and Computer Assisted Intervention (MICCAI 2026). 11 pages, 3 figures, 2 tables

详情
AI中文摘要

精确的生物医学图像分割对于临床诊断至关重要。几何线索(例如边界、形状和拓扑)可以改善结构一致性,但大多数是任务特定的,缺乏跨器官和模态泛化的统一几何基础。我们观察到,多个医学分割目标可以近似为全局近凸形状。凸区域是指其中任意两个内点都可以由完全包含在该区域内的线段连接的区域。在实践中,医学目标可能表现出小的局部凹陷或边界不规则性;我们将这种全局凸状形状称为近凸。受此启发,我们从Hadwiger定理推导出Hadwiger形状先验,作为可解释的全局正则化器,使用三个二维度量:面积A、周长P和欧拉示性数χ,从而实现跨器官和模态的迁移。然而,由于医学数据集是形状异质的,统一施加近凸先验可能会过度正则化具有显著凹陷的非凸解剖结构,从而冲淡凹陷和细节,降低分割精度。为解决这一挑战,我们提出冲突感知目标平衡(CAOB),它以梯度感知的方式将形状先验与分割相结合。对于每个先验,CAOB仅移除与分割冲突的梯度分量,同时保留剩余的对齐分量,并自适应地调节目标影响以防止先验主导。这使得能够在形状异质数据上稳定使用形状先验,而不会抹去真实的凹陷或精细结构细节。我们将这个即插即用框架称为HadBalance。

英文摘要

Precise biomedical image segmentation is crucial for clinical diagnosis. Geometric cues (e.g., boundary, shape, and topology) can improve structural consistency, yet most are task-specific and lack a unified geometric foundation that generalizes across organs and modalities. We are motivated by the observation that several medical segmentation targets can be approximated as globally near-convex shapes. A convex region is one in which any two interior points can be connected by a line segment entirely contained within the region. In practice, medical targets may exhibit small local concavities or boundary irregularities; we refer to such globally convex-like shapes as near-convex. Motivated by this, we derive Hadwiger Shape Priors from Hadwiger's theorem as an interpretable global regularizer using three 2D measures: area A, perimeter P, and Euler characteristic chi, enabling transfer across organs and modalities. However, because medical datasets are shape-heterogeneous, enforcing near-convex priors uniformly can over-regularize non-convex anatomy with significant concavities, washing out concavities and fine details and degrading segmentation accuracy. To address this challenge, we propose Conflict-Aware Objective Balancing (CAOB), which integrates shape priors with segmentation in a gradient-aware manner. For each prior, CAOB removes only the gradient component that conflicts with segmentation while preserving the remaining aligned component, and adaptively regulates objective influences to prevent prior dominance. This enables stable use of shape priors on shape-heterogeneous data without erasing genuine concavities or fine structural details. We call this plug-and-play framework HadBalance.

2606.16036 2026-06-16 cs.CV 新提交

Trusting Right Predictions for Wrong Reasons: A LIME Based Analysis of Deep Learning Interpretability in Lung Cancer Diagnosis

信任错误理由的正确预测:基于LIME的肺癌诊断深度学习可解释性分析

Samarpan Poudel, Vladislav D Veksler

发表机构 * Caldwell University School of Business and Computer Science(考德威尔大学商业与计算机科学学院)

AI总结 本研究通过LIME分析三种深度学习模型(CNN、ResNet50、ViT)在肺癌CT分类中的决策一致性,发现预测高度一致但解释区域差异显著,表明预测一致性不能替代推理一致性。

详情
AI中文摘要

肺癌是癌症相关死亡的主要原因,每年约有250万新发病例和180万死亡病例,使得可靠诊断成为临床优先事项。尽管深度学习模型在肺癌分类中取得了强劲性能,但评估主要集中于预测准确性,其决策过程尚未得到充分检验。本研究比较了三种架构不同的模型:卷积神经网络(CNN)、预训练ResNet50和视觉Transformer(ViT),均在IQ-OTH/NCCD肺癌CT数据集上训练。应用局部可解释模型无关解释(LIME)来研究模型推理。除了标准性能指标外,还引入了一个双相关框架来测量模型对之间的预测一致性和解释一致性。所有三个模型均取得了强劲的分类性能,ResNet50达到98.61%的准确率,CNN为97.91%,ViT为93.75%,同时所有模型的ROC-AUC得分均为0.99。所有模型对的预测相关性超过0.99,表明输出高度一致。然而,LIME解释相关性仍低于0.26,揭示了用于得出这些预测的图像区域存在实质性差异。对误分类样本的分析进一步识别出一致的空间模式:错误预测与肺实质外的注意力相关,而正确预测主要集中于肺区域内部。这些发现表明,预测一致性是推理一致性的一个糟糕代理,并且可解释性评估必须被视为临床AI系统中与预测性能并列的独立验证标准。

英文摘要

Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

2606.16153 2026-06-16 cs.CV cs.AI 新提交

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

医学图像分割综述:挑战、基准与未来展望

Pengyu Zhu, Xiaojing Zhang, Kunbo Zhang, Chunyan Zhang, Zhenyu Wang

发表机构 * School of Control and Computer Engineering, North China Electric Power University(华北电力大学控制与计算机工程学院) SPIC Digital Technology Co., Ltd(国家电投数字科技有限公司) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Department 6 of Health Care, Second Medical Center, People’s Liberation Army General Hospital(中国人民解放军总医院第二医学中心健康医学科六病区)

AI总结 本文系统综述了基于U-Net、Transformer和SAM架构的医学图像分割方法,分析主要挑战,旨在指导未来研究并推动临床转化。

Comments 12 pages,3 figures,1 table. All related resources are available at https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main

详情
AI中文摘要

医学图像分割在临床诊断、治疗规划、疾病监测和神经系统疾病识别中发挥着关键作用。本文对其系统发展进行了全面综述,涵盖了广泛使用的公开数据集、基于U-Net、Transformer和SAM架构的代表性方法及其关键评估指标与差异,随后从多个角度分析了主要挑战。与专注于单一模型家族或特定临床应用的综述不同,本综述将基于U-Net、Transformer和SAM的方法组织在一个统一的分析框架内,特别关注它们在提高分割精度和效率方面的有效性。本工作旨在指导医学图像分割的未来研究并支持临床转化,所有相关资源均可在我们的GitHub仓库中公开获取:https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main。

英文摘要

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

2606.16180 2026-06-16 cs.CV cs.LG 新提交

To forget is to preserve: Machine Unlearning for 3D medical image segmentation

遗忘即保留:面向3D医学图像分割的机器遗忘

Nitesh Kumar Singh, Akhilesh Singh, Arjun Arora

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 针对数据隐私法规,研究基于四种机制的近似遗忘策略在3D医学图像分割中的应用,通过Dice系数和MAE评估,发现噪声标签策略在遗忘集和保留集间取得最佳平衡。

Comments 9 pages, 5 figures

详情
AI中文摘要

随着新的数据隐私法规(如GDPR [1])允许个人要求从训练好的机器学习模型中删除其任何个人信息,人们开始推动研究从模型中遗忘数据以遵守这些法律。在这方面,基于四种机制,我们考虑了几种应用于MRBrainS18数据集 [2] 的近似遗忘策略。我们使用3D ResNet-50 [3] 作为分割的骨干架构,该架构已通过Med3D框架 [4] 进行预训练。以预训练模型为基线,我们评估了在两类主体(即保留和遗忘)上的相应保留准确率。我们通过Dice相似系数和平均绝对误差(MAE)值评估这些方法,使用两个独立的训练周期(20和50个epoch)。结果表明,噪声标签策略具有最佳的整体权衡,在50个epoch后,遗忘集准确率下降93%,同时保留集准确率保持84%。所有其他策略在更高的epoch数下表现出极端的遗忘水平,同时其保留集性能也出现灾难性退化。本研究结果为在主体特定水平上的遗忘提供了严格的性能指标基线,并为从业者选择适当策略提供了明确标准。

英文摘要

With new data privacy laws such as the General Data Protection Regulation (GDPR) [1] that allow individuals to ask that any of their personal information be erased from trained machine learning models, there has been a push to investigate the unlearning of data from models as a way to comply with these laws. In this regard, based on four mechanics, we consider several approximate unlearning strategies applied to the MRBrainS18 dataset [2]. We use a 3D ResNet-50 [3] as a backbone architecture for segmentation that has been pre-trained with the Med3D framework [4]. Considering the pre-trained model as a baseline, we evaluate respective retention accuracy on 2 types of subjects, i.e., retain and forget. We assess these approaches through their Dice similarity coefficient and mean absolute error (MAE) values using two separate training horizons 20 and 50 epochs. The results show that the Noisy Label strategy had the best overall trade-off with a decrease of 93% in the forget set while maintaining 84% accuracy for the retained set after 50 epochs. All other strategies showed extreme levels of forgetting at higher epoch numbers while also demonstrating catastrophic degradation of their retain set performance. The results of this study provide a strict baseline of performance metrics for unlearning on a subject-specific level and provide practitioners with clear criteria for selecting the proper strategies.

2606.16212 2026-06-16 cs.CV cs.AI 新提交

LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

LUCID:基于确定性流匹配的学习型欠采样自适应一致性引导稀疏视角CT重建

Jigang Duan, Jiayi Wang, Heran Wang, Ping Yang, Genwei Ma, Xing Zhao

发表机构 * School of Mathematical Sciences, Capital Normal University(首都师范大学数学科学学院) National Center for Applied Mathematics Beijing, Capital Normal University(首都师范大学北京国家应用数学中心) Academy for Multidisciplinary Studies, Capital Normal University(首都师范大学交叉科学研究院)

AI总结 提出LUCID框架,利用流匹配生成先验和稀疏度自适应策略,通过退化匹配初始状态和投影域一致性校正,实现不同采样密度下的稳定稀疏视角CT重建,减少伪影和幻觉结构。

详情
AI中文摘要

稀疏视角CT通过获取更少的投影视图来减少辐射剂量和扫描时间,但角度欠采样使得重建严重病态,导致条纹伪影、结构模糊和细节丢失。现有的监督方法通常受限于特定的采样设置,而生成方法在严重欠采样下可能引入解剖上不一致的幻觉样结构。我们提出Lucid,一种基于流匹配生成先验的稀疏自适应、一致性引导重建框架,用于稀疏视角CT。Lucid仅在高品质CT图像上训练,学习高斯分布与高品质CT图像分布之间的连续传输,与视角采样无关。在推理过程中,显式纳入采样稀疏度水平,以调整单个预训练模型的生成轨迹。具体地,Lucid通过稀疏度加权融合稀疏视角FBP图像和高斯噪声构建退化匹配的初始状态,执行稀疏度调制的流匹配更新,并在每次先验更新后应用投影域数据一致性校正。在多种稀疏视角设置下的实验表明,Lucid在不同采样密度下实现稳定的重建性能,提高图像质量和结构保真度,并降低生成式稀疏视角CT重建中幻觉样结构的风险。

英文摘要

Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

2606.16234 2026-06-16 cs.CV cs.AI 新提交

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

传播结构引导:从眼底图像和稀疏OCT扫描合成荧光素血管造影

Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室) Tianyuan Honors School, Nanjing Medical University(南京医科大学天元荣誉学院) Nanjing University of Science and Technology(南京理工大学) Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University(南京医科大学第一附属医院眼科)

AI总结 提出从彩色眼底照片(CFP)和稀疏OCT扫描合成荧光素血管造影(FFA)的框架,通过空间对齐跨模态融合和令牌级对比学习,实现非侵入性FFA合成,提升下游诊断性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情
AI中文摘要

眼底荧光素血管造影(FFA)对于评估视网膜血管异常至关重要,但其获取具有侵入性且并非总是可行。相比之下,彩色眼底摄影(CFP)无创且广泛可用,这推动了CFP到FFA合成的研究。然而,先前的工作仅依赖CFP表面纹理,从根本上限制了重建功能性血管信息和细微病理变化的能力。为了解决这个问题,我们提出了一种新颖的框架,该框架利用光学相干断层扫描(OCT)提供的结构引导,从CFP合成FFA。我们构建了一个包含来自3,676只患者眼睛的配对CFP、FFA和OCT的多模态视网膜成像数据集——这是视网膜成像中首个三模态对齐数据集。为了弥合OCT和眼底模态之间的空间差距,我们提出了空间对齐跨模态融合(SACMF)模块,该模块将深度分辨的OCT特征投影到眼底平面,并通过自适应层归一化将其注入CFP编码器。除了特征融合,我们还引入了令牌级跨模态对齐(TCMA),这是一种令牌级对比学习策略,在对应空间位置显式对齐CFP和FFA表示。我们的方法相比最先进的方法实现了更优的合成性能。此外,大量实验表明,我们方法合成的FFA图像在提升下游疾病诊断性能方面比现有方法带来更大的改进,突显了我们的方法作为常规工作流程中无创决策支持工具的临床潜力。代码可在https://github.com/while-plus/OCT-guide-FFA-Syn获取。

英文摘要

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

2606.16294 2026-06-16 cs.CV q-bio.NC 新提交

Sex-based Network-Specific Differences in Connectomes: A Krakencoder-Based Analysis

基于性别的连接组网络特异性差异:基于Krakencoder的分析

Vibhashree S H, Debanjali Bhattacharya, Vamshi Krishna Kancharla, Neelam Sinha

发表机构 * Centre for Brain Research, Indian Institute of Science(印度科学研究所大脑研究中心) Dept. of Artificial Intelligence, Amrita School of Artificial Intelligence, Amrita Vishwa Vidyapeetham(阿姆里塔大学阿姆里塔人工智能学院人工智能系)

AI总结 使用Krakencoder框架模拟脑连接组模态间缺陷传播,分析702名健康被试的结构和功能连接组,发现默认模式网络扰动最大,感觉运动网络最小,完整预测连接组保留更多性别判别信息。

详情
AI中文摘要

本研究使用Krakencoder作为模拟框架,探讨一个脑连接组模态的缺陷如何传播到另一个模态。分析了人类连接组项目中702名健康被试的结构和功能连接组,并分别评估了每个Yeo-7功能网络的影响。考虑了七种场景,每种场景涉及移除单个网络,同时保留其余网络。使用三种互补指标量化跨模态预测中的扰动:特征值谱上的KL散度、Frobenius范数和Wasserstein距离。此外,评估了预测连接组中性别特异性信息的持久性。在所有指标和两个预测方向上,默认模式网络产生的扰动最大,而感觉运动网络产生的扰动最小。网络级扰动特征的性别差异细微,最佳结果是在网络移除条件下预测的连接组达到66.09%的准确率。相比之下,从完整输入预测的连接组实现了更高的性别分类准确率,最高达84.76%。这些发现证实,完整的预测连接组比仅基于扰动的特征保留了显著更多的性别判别信息。

英文摘要

This study examines how deficiencies in one brain connectome modality propagate to the other, using the Krakencoder as a simulation framework. Structural and functional connectomes from 702 healthy participants in the Human Connectome Project were analyzed, with the impact of each of the Yeo-7 functional networks assessed separately. Seven scenarios were considered, each involving the removal of a single network while the remaining networks were preserved. The resulting perturbations in cross-modal predictions were quantified using three complementary metrics: KL divergence on eigenvalue spectra, Frobenius norm, and Wasserstein distance. In addition, the persistence of sex-specific information within the predicted connectomes was evaluated. Across all metrics and both prediction directions, the Default Mode Network produced the largest perturbations, whereas the Somatomotor network yielded the smallest. Sex differences in network-level perturbation signatures were subtle, with the best result being an accuracy of 66.09% from connectomes predicted under network-removal conditions. In contrast, connectomes predicted from intact inputs achieved substantially higher sex classification accuracy, reaching up to 84.76%. These findings confirm that full predicted connectomes retain considerably more sex-discriminative information than perturbation-derived signatures alone.

2606.16325 2026-06-16 cs.CV 新提交

Attention-Based Prototype Calibration for Multi-Rater Few-Shot Medical Image Segmentation

基于注意力机制的原型校准用于多评估者少样本医学图像分割

Truong Vu, Minh Khoi Ho, Yutong Xie

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一种注意力原型校准框架,通过建模评估者特异性偏差,在不修改骨干网络的情况下实现个性化分割,有效提升多评估者少样本分割性能。

Comments MICCAI 2026 main track

详情
AI中文摘要

少样本医学图像分割方法通常假设单一真实标注,忽略了临床数据集中常见的不同专家评估者之间的系统性差异。我们提出了一种基于注意力机制的原型校准框架,用于少样本多评估者分割,该框架在原型空间中建模评估者相对于共识表示的特定偏差。一个轻量级且原理性的注意力算子直接优化评估者原型,而不修改骨干特征提取器,使得该方法与现有的基于原型的少样本分割方法完全兼容。这种设计在保持语义一致性的同时,以最小的计算开销实现个性化分割输出。在多评估者医学影像数据集上的实验表明,与基线原型方法相比,该方法持续改进,突出了结构化原型校准在建模标注变异性方面的有效性。我们的代码可在 https://github.com/truong2710-cyber/JAPC 获取。

英文摘要

Few-shot medical image segmentation methods typically assume a single ground-truth annotation, overlooking systematic variability across expert raters commonly observed in clinical datasets. We propose an attention-based prototype calibration framework for few-shot multi-rater segmentation that models rater-specific deviations from a consensus representation in prototype space. A lightweight yet principled attention operator directly refines rater prototypes without modifying the backbone feature extractor, making the approach fully compatible with existing prototype-based few-shot segmentation methods. This design preserves semantic consistency while enabling personalized segmentation outputs with minimal computational overhead. Experiments on multi-rater medical imaging datasets demonstrate consistent improvements over baseline prototype approaches, highlighting the effectiveness of structured prototype calibration for modeling annotation variability. Our code is available at https://github.com/truong2710-cyber/JAPC.

2606.16421 2026-06-16 cs.CV 新提交

Beer-Lambert Guided Representation Learning for Unsupervised Anomaly Detection in Sub-THz Food Inspection Images

比尔-朗伯引导的表示学习用于亚毫米波食品检测图像中的无监督异常检测

Gyutae Hwang, Sang Jun Lee

发表机构 * Division of Electronics and Information Engineering, Jeonbuk National University(全北国立大学电子与信息工程学部)

AI总结 提出比尔-朗伯引导的表示学习框架,通过衰减分解模块约束学生表示,在亚毫米波食品检测图像中实现无监督异常检测,并引入留一食品协议评估泛化能力。

Comments 6 pages, 3 figures

详情
AI中文摘要

食品制造需要可靠的检测系统来检测异物污染并维护产品安全。亚毫米波透射成像提供了依赖于材料的衰减特性,有助于检测食品中的低密度污染物。然而,现有的无监督异常检测方法主要依赖于RGB预训练的视觉表示,这可能无法充分捕捉亚毫米波图像的透射行为。本文提出了一种比尔-朗伯引导的表示学习框架,用于亚毫米波食品检测图像中的无监督异常检测。该方法引入了一个衰减分解模块作为辅助正则化模块,在训练过程中通过衰减重建来约束学生表示。除了传统的单类设置外,我们还引入了一种留一食品协议,以评估在未见食品类别下的泛化能力。在Inline-Food-Inspection-THz数据集上的实验结果表明,所提出的方法在整体异常检测性能上优于基线方法。

英文摘要

Food manufacturing requires reliable inspection systems to detect foreign material contamination and maintain product safety. Sub-THz transmission imaging provides material-dependent attenuation characteristics that are useful for detecting low-density contaminants in food products. However, existing unsupervised anomaly detection methods mainly rely on RGB-pretrained visual representations, which may not adequately capture the transmission behavior of Sub-THz images. This paper proposes a Beer-Lambert guided representation learning framework for unsupervised anomaly detection in Sub-THz food inspection images. The proposed method introduces an attenuation decomposition module as an auxiliary regularization module that constrains student representations through attenuation reconstruction during training. In addition to the conventional one-class setting, we introduce a Leave-One-Food-Out protocol to evaluate generalization capability under unseen food categories. Experimental results on the Inline-Food-Inspection-THz dataset show that the proposed method improves overall anomaly detection performance over the baseline method.

2606.16477 2026-06-16 cs.CV 新提交

AURA: Active-Response Attribution under Treatment Ambiguity in Bacterial Cytological Profiling

AURA: 细菌细胞学分析中治疗模糊性下的主动响应归因

Kartik Jhawar, Mrunmayee Deshpande, Wilfried Moreira, Guillermo C. Bazan, Lipo Wang

发表机构 * Nanyang Technological University(南洋理工大学) Institute of High Performance Computing, A*STAR(新加坡科技研究局高性能计算研究所) University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 针对抗生素组合中仅部分药物实际作用的问题,提出基于能量的约束逆归因方法AURA,通过分解残余形态并选择重构能量最低的子集,在跨重复实验中达到95.47%的精确匹配准确率。

详情
AI中文摘要

当细菌样本暴露于多种抗生素时,并非每种施加的药物都必然起作用:如果细菌对其中一种药物耐药,则该药物不会留下形态学痕迹。因此,临床上有意义的量不是施加了哪些抗生素,而是哪些抗生素是活跃的。我们表明,在实际的大肠杆菌显微镜中,这两者严重脱钩——天真地假设施加的组合等于活跃组合的正确率仅约37%——然而现有的计算工具不适合恢复活跃集。前向扰动模型如scGen、CPA和IMPA旨在从处理预测外观,而非反向,并且反转它们会严重退化;判别式图像分类器倾向于记忆菌株和批次特定的纹理,并且无法跨实验重复迁移。我们引入AURA,它将任务重新定义为基于能量的约束逆归因。其核心归纳偏置是活跃集必须是施加集的子集;这压缩了候选空间,并让AURA通过将残余形态分解为抗生素响应原子并选择重构能量最低的子集来推断施加抗生素中的活跃子集,测试时不使用菌株标签。AURA-E添加了证据感知的弃权,当候选解释仍然近乎同等合理时保留预测。在大肠杆菌细胞学分析数据集的跨重复迁移中,AURA以95.47%的精确匹配准确率恢复活跃抗生素组合。

英文摘要

When a bacterial sample is exposed to several antibiotics, not every applied drug necessarily acts: if the organism is resistant to one of them, that drug leaves no morphological trace. The clinically meaningful quantity is therefore not which antibiotics were applied, but which ones were active. We show that these two are sharply decoupled in real E. coli microscopy - naively assuming the applied combination equals the active one is correct only about 37% of the time - yet existing computational tools are ill-suited to recovering the active set. Forward perturbation models such as scGen, CPA, and IMPA are designed to predict appearance from treatment, not the reverse, and inverting them degrades sharply; discriminative image classifiers tend to memorise strain- and batch-specific texture and fail to transfer across experimental replicates. We introduce AURA, which reframes the task as constrained, energy-based inverse attribution. Its central inductive bias is that the active set must be a subset of the applied set; this collapses the candidate space and lets AURA infer the active subset of applied antibiotics by decomposing residual morphology into antibiotic response atoms and selecting the subset with the lowest reconstruction energy, using no strain label at test time. AURA-E adds evidence-aware abstention, withholding a prediction when candidate explanations remain near-equally plausible. On cross-replicate transfer in an E. coli cytological profiling dataset, AURA recovers the active antibiotic combination with 95.47% exact-match accuracy.

2606.16484 2026-06-16 cs.CV cs.AI cs.MM 新提交

Unified Multimodal Model for Brain MRI Imputation and Understanding

统一多模态模型用于脑MRI补全与理解

Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

发表机构 * Department of Computing, Imperial College London(伦敦帝国理工学院计算机系) Department of Brain Sciences, Imperial College London(伦敦帝国理工学院脑科学系)

AI总结 提出UniBrain模型,通过统一训练策略联合处理脑MRI模态补全与图像理解,采用自对齐和动态隐藏状态机制,在多疾病数据集上实现高性能。

Comments Early accepted to MICCAI 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在医学领域具有巨大潜力,因为它们继承了LLM的知识,并允许以自然语言集成、分析和解释多种数据模态。然而,医学MLLMs面临重大挑战,特别是高质量训练数据的稀缺以及现实临床环境中数据缺失的频繁发生。在此,我们提出了一种新颖的统一多模态模型UniBrain,用于脑磁共振图像(MRI)分析。为了解决潜在的脑MRI模态缺失问题,我们采用统一训练策略进行联合成像模态补全和脑图像理解。在训练过程中,构建了交错且描述丰富的数据流,以自回归方式训练模型,从而实现基于生成的多模态数据的医学推理。引入自对齐策略,利用密集图像嵌入学习细粒度解剖特征,无需详细的图像描述。此外,我们提出了一种动态隐藏状态机制,以缓解长上下文多模态推理中的暴露偏差。在多疾病脑MRI数据集上的大量实验表明,UniBrain在模态不完全的各种情况下,在脑图像补全、理解和疾病诊断方面均取得了高性能。

英文摘要

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

2606.16573 2026-06-16 cs.CV 新提交

Transformation-driven generation of comparable projection images from multimodal anatomical scenes

从多模态解剖场景生成可比较投影图像的变换驱动方法

Dariusz Pojda, Krzysztof Domino, Michał Tarnawski, Agnieszka Anna Tomaka

发表机构 * Institute of Theoretical and Applied Informatics, Polish Academy of Sciences(波兰科学院理论与应用信息学研究所)

AI总结 提出变换驱动框架,从多模态解剖数据生成可重复的投影空间观测,通过下颌运动场景验证,实现不同解剖配置下直接可比的虚拟X光投影生成。

Comments 36 pages, 11 figures

详情
AI中文摘要

本工作解决了从异质解剖场景生成可重复投影空间观测的计算问题,其中组件可能经历独立的空间变换。我们提出了一种变换驱动框架,用于从多模态解剖数据生成合成投影图像,并在下颌运动场景中进行了演示。与主要为配准、投影真实感或渲染效率设计的传统数字重建放射影像(DRR)方法不同,所提出的公式将投影成像视为对显式表示的解剖场景进行观测的过程。独立可变换的基于体积和表面的解剖对象嵌入到共享场景表示中,并通过显式变换直接传播到投影空间。投影几何、采集建模、材料解释和图像呈现保持显式分离,从而能够在保持可重复性和生成投影之间直接可比性的同时,对方法假设进行可控探索。特别强调了与颅面分析相关的变换驱动解剖场景,包括下颌运动和 therapeutic repositioning。使用由CT/CBCT体积、分割结构、表面模型以及辅助解剖或治疗对象组成的共享解剖参考场景,该框架能够在保持相同成像假设的同时,从多种解剖配置生成直接可比的VirtualRTG投影。该方法并非旨在实现完全物理逼真的放射模拟,而是为研究解剖-投影关系、运动可观测性和变换感知成像工作流提供可控且可重复的方法学环境。

英文摘要

This work addresses the computational problem of generating reproducible projection-space observations from heterogeneous anatomical scenes whose components may undergo independent spatial transformations. We propose a transformation-driven framework for synthetic projection imaging from multimodal anatomical data and demonstrate it on mandibular-motion scenarios. In contrast to conventional Digitally Reconstructed Radiograph (DRR) approaches primarily designed for registration, projection realism, or rendering efficiency, the proposed formulation treats projection imaging as an observation process operating on an explicitly represented anatomical scene. Independently transformable volumetric and surface-based anatomical objects are embedded within a shared scene representation and propagated directly into projection space through explicit transformations. Projection geometry, acquisition modelling, material interpretation, and image presentation remain explicitly separated, enabling controlled exploration of methodological assumptions while preserving reproducibility and direct comparability between generated projections. Particular emphasis is placed on transformation-driven anatomical scenarios relevant to craniofacial analysis, including mandibular motion and therapeutic repositioning. Using a shared anatomical reference scene composed of CT/CBCT volumes, segmented structures, surface models, and auxiliary anatomical or therapeutic objects, the framework enables generation of directly comparable VirtualRTG projections from multiple anatomical configurations while preserving identical imaging assumptions. Rather than aiming at fully physically faithful radiographic simulation, the proposed approach provides a controllable and reproducible methodological environment for studying anatomy--projection relationships, motion observability, and transformation-aware imaging workflows.

2606.16658 2026-06-16 cs.CV 新提交

Vision-Language Models as Zero-Annotation Oracles in Histopathology

视觉-语言模型作为组织病理学中的零标注预言机

Vishal Jain, Giorgio Buzzanca, Sarah Cechnicka, Maarten Naesens, Priyanka Koshy, Tri Nguyen, Jesper Kers, Candice Roufosse, Bernhard Kainz

发表机构 * Imperial College London(帝国理工学院) Leiden University Medical Center(莱顿大学医学中心) KU Leuven(鲁汶大学) University Hospitals Leuven(鲁汶大学医院) University Medical Center Utrecht(乌得勒支大学医学中心) Friedrich-Alexander University Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出一种粗到细方法,利用通用视觉-语言模型作为零标注预言机进行前景分割,在特殊染色上优于监督基线,并通过伪标签蒸馏轻量学生模型。

Comments 11 pages, 1 figure, 6 tables. Code available at https://github.com/VishalJ99/vlm-wsi-auto-context

详情
AI中文摘要

前景分割是每个计算病理学流程的关键第一步,但现有方法依赖于手工调整的启发式规则或监督模型,这些模型过度拟合狭窄的染色和扫描仪分布,在特殊染色(如Jones银染或Elastica van Gieson)上无声失败。我们提出一种粗到细方法,将前景分割重新定义为视觉感知任务,并利用通用视觉-语言模型(VLM)作为零标注预言机。我们的关键洞察是,组织与背景的区分是一个自然图像识别问题,而非组织病理学问题,因此在互联网规模语料上训练的VLM能够泛化到领域特定模型无法处理的场景。我们引入了Leica-75基准,包含跨越三种染色家族的75张肾移植全切片图像。在Leica-75上,我们的方法在分布外染色上实现了最高分割质量(Jones Dice 0.858 +/- 0.027,EVG Dice 0.853 +/- 0.041),交叉染色方差比最佳监督基线低7倍,同时在分布内H&E上保持竞争力。使用自动筛选示例(Auto-context)的少样本提示挽救了Stress-32(n=32)上的困难案例,Stress-32是一个精心设计的压力测试子集(2B模型Dice从0.470提升至0.819)。基于VLM的标注审查与人类专家共识一致(模糊检测kappa=0.989;分割掩码审查的平均精确率/召回率分级准确率0.708 vs. 人类0.646)。生成的伪标签用于蒸馏轻量学生模型,其性能与教师模型相当,而运行成本仅为教师模型的一小部分。我们的框架为数字病理学中持续存在的基础设施瓶颈提供了原则性、可扩展的解决方案。

英文摘要

Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution H&E. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

2606.16749 2026-06-16 cs.CV 新提交

Structure-aware Knowledge-guided Heterogeneous Mamba for Zygomaticomaxillary Suture Assessment

结构感知知识引导的异构Mamba用于颧上颌缝评估

Xiaoqi Guo, Birui Chen, Xinquan Yang, Chaoyun Zhang, Xuefen Liu, Mianjie Zheng, Kun Tang, Xuguang Li, Wen Ma, Yanhua Xu, Linlin Shen

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机与软件学院) School of Artificial Intelligence, Shenzhen University(深圳大学人工智能学院) Affiliated Stomatology Hospital of Kunming Medical University(昆明医科大学附属口腔医院) Shenzhen University General Hospital(深圳大学总医院)

AI总结 提出首个ZMS公开数据集(3790张图像,覆盖4-24岁),并设计SKMamba框架,通过解耦双路径架构、隐式边缘提取器和跨模态语义对齐模块,实现自动化ZMS成熟度评估,性能优于现有方法。

详情
AI中文摘要

颧上颌缝是连接颧骨和上颌骨的关键颅周结构,是上颌前移过程中的主要阻力部位,其成熟状态直接影响正畸干预的时机和效果。然而,由于缝线中微妙的高频过渡以及相邻阶段之间的全局语义模糊性,ZMS成熟的准确分期仍然具有挑战性。为解决这一问题,我们提出了首个公开ZMS数据集,包含3790张覆盖4至24岁全年龄范围的ZMS图像。基于该数据集,我们提出了SKMamba,一种结构感知和知识引导的基于Mamba的多模态框架,用于自动化ZMS成熟度评估。SKMamba采用解耦的双路径架构,模拟经验丰富的正畸医生使用的分层诊断过程。我们首先引入隐式边缘提取器(IEE),利用结构预训练减少小梁噪声并突出缝线边界。作为补充,设计了跨模态语义对齐(CSA)模块,用于整合来自大语言模型(LLM)的解剖描述。该模块有助于将局部形态线索与全局语义描述对齐,同时确保客观形态证据仍是决策的主要依据。在我们的ZMS数据集上的大量实验表明,SKMamba相比现有方法实现了最先进的性能。代码可在https://github.com/galaxygxq1116/SKMamba获取。

英文摘要

The Zygomaticomaxillary Suture is a key circummaxillary structure that connects the zygomatic bone and the maxilla, which serves as a primary site of resistance during maxillary advancement, and its maturation status directly influences the timing and efficacy of orthopedic interventions. However, accurate staging of ZMS maturation remains challenging due to subtle high-frequency transitions in suture lines and the global semantic ambiguity between adjacent stages. To address this, we present the first public ZMS dataset, comprising 3,790 ZMS images covering the entire age range from 4 to 24 years. Based on this dataset, we propose SKMamba, a Structure-aware and Knowledge-guided Mamba-based multi-modal framework for automated ZMS maturation assessment. SKMamba adopts a decoupled dual-path architecture that mimics the hierarchical diagnostic process used by experienced orthodontists. We first introduce an Implicit Edge Extractor (IEE), which leverages structural pre-training to reduce trabecular noise and accentuate sutural boundaries. Complementarily, a Cross-Modal Semantic Alignment (CSA) module is designed to incorporate anatomical descriptions from a large language model (LLM). This module helps align local morphological cues with global semantic descriptions while ensuring that objective morphological evidence remains the primary basis for decisions. Extensive experiments on our ZMS dataset demonstrate that SKMamba achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/galaxygxq1116/SKMamba.

2606.16756 2026-06-16 cs.CV 新提交

3D Classification of Paramagnetic Rim Lesions in Multiple Sclerosis via Asymmetric QSM-FLAIR Modeling

多发性硬化症中顺磁性边缘病变的3D分类:基于非对称QSM-FLAIR建模

Veronica Pignedoli, Giacomo Boffa, Nicoletta Noceti, Matilde Inglese, Francesca Odone, Matteo Moro

发表机构 * MaLGa, DIBRIS, University of Genova(热那亚大学) DINOGMI, University of Genova(热那亚大学) IRCCS Azienda Ospedaliera Metropolitana(IRCCS大都会医院)

AI总结 提出一种3D多模态深度学习框架,利用非对称QSM-FLAIR建模对多发性硬化症中的顺磁性边缘病变进行自动分类,通过自监督预训练和对比正则化提升有限数据下的鲁棒性,在88名患者队列中验证了有效性。

Comments 10 pages, 3 figures, accepted at MICCAI 2026. Github link: https://github.com/veronicapignedoli/FRODO

详情
AI中文摘要

在磁敏感加权MRI上识别的顺磁性边缘病变(Rim$^+$)最近已成为多发性硬化症(MS)慢性活动性炎症的特异性生物标志物,并与长期残疾进展相关。然而,磁敏感成像和专家判读仍局限于专业中心,视觉评估耗时且可变,且Rim$^+$病变的低患病率给自动分析带来了严重的类别不平衡挑战。我们提出了一种3D多模态深度学习框架,用于从定量磁化率图(QSM)和FLAIR MRI中进行病变级别的Rim$^+$/Rim$^-$分类。该架构通过将QSM作为主要磁敏感驱动信号并用FLAIR衍生的结构上下文进行条件化,显式建模了模态非对称性。为了提高在有限数据下的鲁棒性,我们采用了自监督多模态预训练,随后进行带有对比正则化的监督微调。该方法在临床采集的88名MS患者队列中进行了评估,以专家病变标注作为参考标准。结果显示了相比先前架构的性能提升,支持了非对称多模态建模在自动识别慢性活动性病变中的有效性。

英文摘要

Paramagnetic rim lesions (Rim$^+$) identified on susceptibility-sensitive MRI have recently emerged as a specific biomarker of chronic active inflammation in Multiple Sclerosis (MS) and are associated with long-term disability progression. However, susceptibility imaging and expert interpretation remain limited to specialized centers, visual assessment is time-consuming and variable, and the low prevalence of Rim$^+$ lesions poses severe class imbalance challenges for automated analysis. We propose a 3D multimodal deep learning framework for lesion-level Rim$^+$/Rim$^-$ classification from Quantitative Susceptibility Mapping (QSM) and FLAIR MRI. The architecture explicitly models modality asymmetry by treating QSM as the primary susceptibility-driven signal and conditioning it with FLAIR-derived structural context. To improve robustness under limited data, we employ self-supervised multimodal pretraining followed by supervised fine-tuning with contrastive regularization. The method was evaluated on a clinically acquired cohort of 88 people with MS with expert lesion annotations as reference standard. Results highlight improved performance compared to prior architectures, supporting the effectiveness of asymmetric multimodal modeling for automated chronic active lesion identification.

2606.16794 2026-06-16 cs.CV 新提交

LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

基于LLM的视觉解释评估框架:用于评估面部皮肤病分类模型的可解释性

Gyuyeon Na

发表机构 * AI and Business Analytics, Ewha Womans University(人工智能与商业分析,成均馆大学)

AI总结 提出基于LLM的视觉解释评估框架,通过渐进式提示工程评估Grad-CAM在面部皮肤病诊断模型中的解释质量,聚焦病变定位和可信度。

详情
AI中文摘要

本研究提出了一个特定领域的基于LLM的视觉解释评估框架,用于评估面部皮肤病诊断模型中Grad-CAM解释的质量。以往研究主要关注通过数据增强技术提升分类性能,而较少系统性地检验模型解释是否基于临床相关的病变区域。在本研究中,对基于EfficientNet-B0、MobileNetV3和ResNet18的面部皮肤病分类模型应用了几何增强、颜色增强和混合增强策略。采用Grad-CAM生成代表模型决策过程的视觉解释。此外,利用GPT-5.5、Gemini 3.5 Flash和Claude Sonnet 4.6设计了LLM-as-a-Judge评估框架,从病变定位和解释可信度两个角度评估Grad-CAM解释。为提高评估一致性和临床基础,引入了渐进式提示工程策略,包含评估准则、临床知识、惩罚规则和结构化输出格式。

英文摘要

This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.

2606.16991 2026-06-16 cs.CV cs.LG 新提交

A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT

基于非增强CT的腹部疾病诊断与报告生成的多中心基准

Mariam Elbakry, Aliaa Sayed Sheha, Salma Hassan Tantawy, Aya Yassin, Concetto Spampinato, Karim Lekadir, Xiaomeng Li, Marawan Elbatel

发表机构 * Ain Shams University(艾因夏姆斯大学) The Hong Kong University of Science and Technology(香港科技大学) University of Catania(卡塔尼亚大学) Universitat de Barcelona(巴塞罗那大学)

AI总结 提出一个多中心基准,利用非增强CT合成增强CT发现,用于多器官腹部疾病诊断和自动报告生成,实验表明非增强CT保留诊断信号,平均AUC达69.1%(内部)和63.1%(外部)。

Comments Early Accept (top ~9%), MICCAI 2026

详情
AI中文摘要

多期增强CT(CECT)广泛用于腹部病变表征,但存在造影剂肾病风险、增加采集负担并加重放射科医生工作量。为解决这些问题,我们引入了一个新的多中心基准,用于多器官腹部疾病诊断和自动放射报告生成,该基准学习从单期非增强CT(NCCT)合成增强CT发现。为此,我们从两个中心收集了配对NCCT-CECT研究及其对应的增强放射报告的大规模数据集,分为内部集和外部验证队列。在统一评估协议下,我们对五种当代深度学习架构进行了基准测试,涵盖胸部专用、腹部专用和通用多模态领域。大量实验表明,NCCT保留了诊断信号,在内部队列和外部队列上分别实现了平均多器官AUC 69.1%和63.1%。通过公开发布该数据集和标准化基准,本研究旨在促进未来对更安全、资源高效且全球可及的免造影腹部成像工作流程的研究。代码地址:https://github.com/xmed-lab/TriALS-Report。

英文摘要

Multiphasic contrast-enhanced CT (CECT) is widely used for abdominal lesion characterization, yet it carries inherent risks of contrast-induced nephropathy, escalates acquisition burden, and heavily contributes to radiologist workload. To address these challenges, we introduce a novel multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation, which learns to synthesize contrast-enhanced findings from single-phase non-contrast CT (NCCT). To support this, we curated a large-scale dataset of paired NCCT-CECT studies and their corresponding contrast-enhanced radiology reports from two centers, partitioned into internal sets and an external validation cohort. Under a unified evaluation protocol, we benchmarked five contemporary deep learning architectures encompassing chest-specific, abdomen-specific, and general-purpose multimodal domains. Extensive experiments demonstrate that NCCT retains diagnostic signals, achieving an average multi-organ AUC of 69.1% on the internal cohort and 63.1% on the external cohort, respectively. By releasing this dataset and standardized benchmark publicly, this study aims to catalyze future research into safer, resource-efficient, and globally accessible contrast-free abdominal imaging workflows. Code is available at: https://github.com/xmed-lab/TriALS-Report.

2606.14828 2026-06-16 eess.IV cs.AI cs.CV 交叉投稿

Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

基于血管图神经网络的DSA软脑膜侧支检测

Junyong Cao, Hakim Baazaoui, Chinmay Prabhakar, Suprosanna Shit, Lukas Bastian Otto, Susanne Wegener, Bjoern Menze, Ezequiel de la Rosa

发表机构 * University of Zurich(苏黎世大学) University Hospital Zurich(苏黎世大学医院)

AI总结 提出一种混合图-像素架构,在DSA血管图上对单个血管段分类,首次实现DSA中软脑膜侧支的个体化检测,PR-AUC达0.434,优于纯图或纯像素方法。

详情
AI中文摘要

软脑膜侧支(LMCs)是急性缺血性卒中的重要预后因素。现有自动化方法依赖CT血管造影(CTA),但单个LMCs通常太小而无法在CTA上分辨,限制了这些方法只能进行粗略的侧支评分。数字减影血管造影(DSA)以更高的分辨率可视化单个侧支,但当前评估仍依赖主观的手动分级量表,存在评分者间一致性差的问题。我们提出一个框架,将侧支检测形式化为对从DSA导出的图上的单个血管段进行分类。一种混合图-像素架构将拓扑感知的图分支与密集像素分支相结合,在共享的节点概率空间中融合。在五折交叉验证中,融合模型的PR-AUC达到0.434,优于纯图(0.403)和纯像素(0.362)基线。据我们所知,这是首个能够在DSA中实现LMCs个体化的方法,允许对每个血管进行精确的定量评估。这种整合将DSA评估转向客观评价,支持未来对单个LMCs的生物标志物和模式发现。

英文摘要

Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.

2606.15000 2026-06-16 eess.IV cs.CV 交叉投稿

Polyp-D2ATL: Deep Domain-Adaptive Transfer Learning for Colorectal Polyp Classification under Label Distribution Shift

Polyp-D2ATL:标签分布偏移下用于结直肠息肉分类的深度域自适应迁移学习

Sajad Jabarzadeh Ghandilu, Maryam Sadat Hosseini Azad, Shahriar Baradaran Shokouhi, Emad Fatemizadeh

发表机构 * School of Electrical Engineering, Sharif University of Technology(谢尔万大学电气工程学院) School of Electrical Engineering, Iran University of Science and Technology(伊朗科学技术大学电气工程学院)

AI总结 提出Polyp-D2ATL框架,通过特定训练策略解决不平衡数据、标签分布偏移和跨模态泛化问题,在PICCOLO数据集上显著优于现有模型。

Comments 15 pages, 5 figures, 7 tables

详情
AI中文摘要

早期且高准确率地预测结直肠息肉,作为最危险癌症类型之一的重要标志,将有助于挽救更多生命。尽管结直肠息肉分类取得了进展,但在获得能够诊断真实场景中伴有不同特征的难以预测息肉的自动化息肉预测系统方面仍存在许多挑战,其中模型需要成功处理不平衡数据、标签分布偏移和跨模态泛化。在本研究中,我们提出了Polyp-D2ATL,一种新颖的框架,并辅以特定的训练策略,缓解了这些限制,并有效预测了属于NICE分类的不同类别息肉。我们在PICCOLO验证集和测试集上的大量实验表明,所提出的Polyp-D2ATL在各种可靠指标上显著优于现有最先进模型,在验证集上达到了82.38%的准确率、77.49%的宏F1分数和87.47%的特异性,同时在保留的测试集上取得了一致的改进,证明了所提出方法的泛化能力和临床适用性。

英文摘要

Early and highly accurate prediction of colorectal polyps, as an important sign of one of the most dangerous types of cancer, will result in saving more lives. Despite the advancements in colorectal polyp classification, many challenges remain in obtaining an automated polyp prediction system that is able to diagnose the difficult-to-predict polyps accompanied by different features in real scenarios, where the model can handle imbalanced data, label distribution shift, and cross-modality generalization successfully. In this study, we propose Polyp-D2ATL, a novel framework accompanied by a specific training strategy, which mitigates these limitations and effectively predicts the different classes of polyps belonging to the NICE classification. Our extensive experiments on the PICCOLO validation and test sets demonstrate that the proposed Polyp-D2ATL significantly outperforms existing state-of-the-art models across various reliable metrics, achieving an accuracy of 82.38%, a Macro-F1 of 77.49%, and a specificity of 87.47% on the validation set, alongside consistent improvements on the held-out test set which demonstrates the generalization capacity and clinical applicability of the proposed approach.

2606.15037 2026-06-16 cs.CL cs.CV 交叉投稿

ReportQA: QA-Based Radiology Report Evaluation

ReportQA: 基于问答的放射学报告评估

Yiming Shi, Shaoshuai Yang, Xi Chen, Haolin Li, Hengyu Zhang, Che Jiang, Kaiwen Wang, Xun Zhu, Dong Xie, Fei Wang, Dejing Dou, Miao Li, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) College of AI, Tsinghua University(清华大学人工智能学院) Beijing National Research Center for Information Science and Technology(北京信息科学与技术国家研究中心) Beijing Electronic Digital & Intelligence(北京电子数字与智能)

AI总结 提出ReportQA框架,利用知识树和LLM从报告中提取结构化信息生成QA对,以问答准确率作为评估指标,比现有指标更符合放射科医生判断。

详情
AI中文摘要

放射学报告评估对于推进自动报告生成至关重要。自然语言生成指标具有有限的临床相关性。临床效能(CE)指标评估重要的医学发现,但主要关注存在性且仅覆盖有限的实体集。由于严重依赖人工标注,CE指标难以扩展临床实体或属性。在临床实践中,放射学报告作为信息传递的媒介。临床医生使用它们执行下游诊断任务,而无需直接检查图像。基于这一见解,我们提出了ReportQA,一个临床相关且灵活的放射学报告评估框架,支持对放射学报告生成系统进行详细的定量分析。我们首先收集涵盖多种成像模态和解剖区域的数据集。然后,在放射科医生的指导下构建临床实体和属性的知识树,并使用大型语言模型(LLM)从原始报告中提取结构化信息。接下来,我们从预定义模板生成QA对,并通过自过滤和基于报告的过滤进行质量控制。在评估期间,将报告视为上下文,LLM作为评判模型来回答QA对。基于得到的QA准确率,我们引入了QAScore指标。与现有指标相比,QAScore显示出与放射科医生判断更好的对齐。在多个最先进的视觉-语言模型上的实验表明,当前基于报告的推理范式难以学习细粒度的临床表示,并表现出强烈的负先验偏差。相比之下,问题驱动的推理提供了一种更有效的替代方案。为了可重复性和可扩展性,我们发布了知识树、结构化报告和QA对,以及用于QA构建和评估的流水线代码。

英文摘要

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

2509.25594 2026-06-16 cs.CV cs.AI 版本更新

K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

K-Prism: 一种知识引导与提示集成的通用医学图像分割模型

Bangwei Guo, Yunhe Gao, Meng Ye, Difei Gu, Yang Zhou, Leon Axel, Dimitris Metaxas

发表机构 * Rutgers University(罗格斯大学) Stanford University(斯坦福大学) The University of Texas at Arlington(德克萨斯大学阿灵顿分校) New York University(纽约大学)

AI总结 提出K-Prism统一分割框架,通过双提示表示和混合专家解码器整合语义先验、上下文知识和交互反馈三种知识范式,在18个数据集上实现语义、上下文和交互分割的最优性能。

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

医学图像分割是临床决策的基础,但现有模型仍然碎片化。它们通常基于单一知识源训练,并针对特定任务、模态或器官。这种碎片化与临床实践形成鲜明对比,在临床实践中,专家无缝整合多种知识:来自训练集的解剖先验、来自参考病例的基于示例的推理,以及通过实时交互的迭代细化。我们提出了$\textbf{K-Prism}$,一个统一的分割框架,通过系统整合三种知识范式来反映这种临床灵活性:(i) 从标注数据集中学习的$\textit{语义先验}$,(ii) 来自少样本参考示例的$\textit{上下文知识}$,以及(iii) 来自用户输入(如点击或涂鸦)的$\textit{交互反馈}$。我们的关键见解是,这些异构知识源可以编码为双提示表示:定义$\textit{分割什么}$的1-D稀疏提示和指示$\textit{关注哪里}$的2-D密集提示,然后通过混合专家(MoE)解码器动态路由。这种设计使得范式之间灵活切换,并能够在不同任务上进行联合训练,而无需修改架构。在涵盖多种模态(CT、MRI、X射线、病理、超声等)的18个公共数据集上的全面实验表明,K-Prism在语义、上下文和交互分割设置中均达到了最先进的性能。

英文摘要

Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings.

2602.02186 2026-06-16 cs.CV 版本更新

Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision

学习拓扑感知隐式场用于不完整拓扑监督下的统一肺树建模

Ziqiao Weng, Jiancheng Yang, Kangxian Xie, Bo Zhou, Weidong Cai

发表机构 * School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) ELLIS Institute Finland(芬兰ELLIS研究所) Aalto University(阿尔托大学) Department of Computer Science and Engineering, University of Buffalo(布法罗大学计算机科学与工程系) Department of Radiology, Northwestern University(西北大学放射学系)

AI总结 提出TopoField框架,利用稀疏点云学习连续隐式场,在无完整标注下修复肺树拓扑不完整,并联合实现解剖标记与肺段重建,效率高且鲁棒。

Comments 20 pages

详情
AI中文摘要

从CT图像中提取的肺树经常表现出拓扑不完整性,例如缺失或断开的分支,这严重降低了下游解剖分析的质量,并限制了现有肺树建模流程的适用性。当前方法通常依赖密集体积处理、显式图推理或通用点云补全先验,导致效率有限、结构感知弱以及在现实结构损坏下的鲁棒性降低。我们提出TopoField,一个拓扑感知隐式建模框架,将拓扑修复视为一类建模问题,并实现肺树分析的统一多任务推理。TopoField使用稀疏表面和骨架点云表示肺部解剖结构,并通过在\textit{已经}不完整的树上合成引入的结构破坏进行训练,学习一个支持拓扑修复的连续隐式场,无需依赖完整或显式的断开标注。基于修复后的隐式表示,通过任务特定的隐式函数在单次前向传播中联合推断解剖标记和肺段重建。在Lung3D+数据集上的大量实验表明,TopoField在具有挑战性的不完整场景下持续改善拓扑完整性,并实现准确的解剖标记和肺段重建。我们进一步在外部分割模型的真实不完整输出上验证TopoField,展示了其对现实分割流程的适用性。由于其隐式公式,TopoField实现了高计算效率,每个案例完成所有任务仅需一秒多,突显了其在大规模和时间敏感的临床应用中的实用性。

英文摘要

Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing, explicit graph reasoning, or generic point cloud completion priors, leading to limited efficiency, weak structural awareness, and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textit{already} incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward pass. Extensive experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. We further validate TopoField on real incomplete outputs from an external segmentation model, demonstrating its applicability to realistic segmentation pipelines. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications.

2603.12514 2026-06-16 cs.CV cs.LG 版本更新

CT-VDETR: Semi-supervised 3D Trauma Detection in Computed Tomography (CT) scans using Dense Vertex Relative Position Encoding

CT-VDETR:使用密集顶点相对位置编码的CT扫描半监督3D创伤检测

Shivam Chaudhary, Sheethal Bhat, Andreas Maier

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出CT-VDETR框架,结合自监督预训练和半监督transformer检测,在仅78个标注体数据上实现31.33% mAP@0.50,比纯监督方法提升1.53倍。

Comments v2: Updated results with corrected dataset split. Revised Table 1 (mAP@0.50: 31.33% SSL vs 20.45% baseline, 1.53x improvement; mAP@0.75: 30.95% vs 10.45%, 2.96x improvement). Updated validation curves showing stable convergence. No methodology changes. 7 pages, 4 figures, 2 tables. Code: https://github.com/shivasmic/3d-trauma-detection-ssl

详情
AI中文摘要

在腹部CT中准确检测和定位创伤性损伤仍然具有挑战性,因为体素级标注有限且获取成本高。我们提出了一种标签高效的3D腹部创伤检测框架,该框架将自监督预训练与半监督基于transformer的检测相结合。首先,我们在1098个CT体数据上使用掩码图像建模(MIM)预训练3D U-Net编码器,用于解剖表示学习。接着,我们通过特征适配器将V-DETR适应到密集体积CT,该适配器将编码器特征网格转换为紧凑的token序列,用于transformer解码。然后,将预训练编码器与V-DETR和3D顶点相对位置编码(3D V-RPE)集成,以改善不规则形状损伤的定位。最后,在半监督教师-学生一致性正则化中,利用额外的2000个未标注体数据进行检测器训练。据我们所知,这是3D DETR风格检测器首次应用于RSNA腹部创伤检测任务。在该基准上,所提方法仅使用78个标注训练体数据就达到了31.33%的测试mAP@0.50,相当于纯监督训练的1.53倍提升。这些结果表明,将医学领域预训练与半监督学习相结合是标签稀缺的3D医学检测的有效策略。

英文摘要

Accurate detection and localization of traumatic injuries in abdominal CT remain challenging because voxel-level annotations are limited and expensive to obtain. We present a label-efficient framework for 3D abdominal trauma detection that combines self-supervised pretraining with semi-supervised transformer-based detection. First, we use Masked Image Modeling (MIM) on 1098 CT volumes to pretrain a 3D U-Net encoder for anatomical representation learning. Next, we adapt V-DETR to dense volumetric CT through a feature adapter that converts the encoder feature grid into a compact token sequence for transformer decoding. The pretrained encoder is then integrated with V-DETR and 3D Vertex Relative Position Encoding (3D V-RPE) to improve the localization of irregularly shaped injuries. Finally, semi-supervised teacher-student consistency regularization leverages 2,000 additional unlabeled volumes during detector training. To the best of our knowledge, this is the first application of a 3D DETR-style detector to the RSNA abdominal trauma detection task. On this benchmark, the proposed method achieves 31.33% test mAP@0.50 using only 78 labeled training volumes, corresponding to a 1.53x improvement over supervised-only training. These results show that combining medical-domain pretraining with semi-supervised learning is an effective strategy for label-scarce 3D medical detection.

2603.15525 2026-06-16 cs.CV cs.HC 版本更新

Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models

临床感知的合成图像生成用于胸部X光模型的概念覆盖

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * University of Edinburgh(爱丁堡大学) NHS Lothian(洛锡安国家健康服务)

AI总结 提出CARPA框架,通过解剖约束的概念扰动生成合成胸部X光图像,扩展临床概念覆盖,提升模型性能与可靠性。

Comments Accepted for presentation at the IJCAI-ECAI 2026 RobustifAI workshop

详情
AI中文摘要

用于胸部X光诊断的深度学习模型受到公开训练数据集中临床有意义概念组合覆盖有限的限制。虽然合成图像生成已被探索以增加数据多样性,但现有方法很少强制执行临床或解剖约束,限制了其在提高模型可靠性方面的效用。我们提出了CARPA,一个临床感知和解剖基础的合成胸部X光生成框架,该框架在保持解剖结构的同时对临床概念向量进行有针对性的扰动。通过生成具有受控概念插入和删除的解剖忠实合成图像,CARPA扩展了临床相关的概念覆盖。我们通过七种骨干架构评估CARPA,在合成子集上微调模型,并在一个保留的MIMIC-CXR基准上进行测试。与先前的概念扰动方法相比,在CARPA生成的图像上微调一致地提高了精确率-召回率性能,降低了预测不确定性,并改善了模型校准。结构和语义分析表明高解剖保真度、强概念对齐和低语义不确定性。两位专家放射科医生的评估进一步确认了真实感和临床一致性。这些结果共同表明,解剖基础的概念扰动能够更有效地利用合成数据,提高胸部X光分类模型的性能和可靠性,并支持更安全的临床部署。

英文摘要

Deep learning models for chest X-ray diagnosis are constrained by limited coverage of clinically meaningful concept combinations in publicly available training datasets. While synthetic image generation has been explored to increase data diversity, existing methods rarely enforce clinical or anatomical constraints, limiting utility for improving model reliability. We propose CARPA, a clinically aware and anatomically grounded framework for synthetic chest X-ray generation that applies targeted perturbations to clinical concept vectors while preserving anatomical structure. By producing anatomically faithful synthetic images with controlled concept insertions and deletions, CARPA expands clinically relevant concept coverage. We evaluate CARPA across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior concept perturbation approaches, fine-tuning on CARPA-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong concept alignment, and low semantic uncertainty. Evaluation by two expert radiologists further confirms realism and clinical agreement. Together, these results show that anatomically grounded concept perturbations enable more effective use of synthetic data, improving both performance and reliability of chest X-ray classification models and supporting safer clinical deployment.

2605.05761 2026-06-16 cs.CV 版本更新

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

iTRIALSPACE:用于肺CT模型受控评估的可编程虚拟病灶试验

Fakrul Islam Tushar, Umme Hafsa Momy, Joseph Y. Lo, Geoffrey D. Rubin

发表机构 * Department of Radiology and Imaging Sciences, University of Arizona(亚利桑那大学放射科和影像科学系) Department of Biomedical Engineering, Florida International University(佛罗里达国际大学生物医学工程系) Center for Virtual Imaging Trials, Department of Radiology, Duke University Medical Center(达特茅斯大学医学中心虚拟成像试验中心,放射科)

AI总结 提出可编程评估框架iTRIALSPACE,通过四阶段流水线(结节分析、试验规范、掩膜插入、CT合成)构建受控虚拟病灶试验,揭示固定基准无法发现的模型缺陷。

Comments 11 pages, 13 figures, 13 tables

详情
AI中文摘要

我们引入iTRIALSPACE,一个用于肺CT模型受控评估的可编程评估框架。标准基准是静态回顾性集合,混杂了病灶大小、肺叶分布、解剖结构和采集背景,使得难以确定什么因素在结构上驱动模型准确性。iTRIALSPACE通过四阶段流水线(多数据集结节分析、显式试验规范、解剖感知掩膜插入和ControlNet条件CT合成)将真实临床CT和病灶轮廓组合成受控虚拟病灶试验来解决这一限制。该框架基于一个统一的54属性结节分析数据集,涵盖来自七个公共CT源的13,140个标注结节,并实例化为13种试验模式。我们在一个涵盖三种医学VLM、四种空间引导条件和三种临床任务的55,469样本虚拟病灶研究中评估iTRIALSPACE。在所有13种模式下,合成基底保持在真实到真实FID基线内,且合成性能排名强烈转移到真实临床数据(ρ = 0.93,p < 10^{-15})。受控试验模式揭示了固定分布基准无法获得的发现,包括在肺叶均衡采样下的捷径驱动尺寸预测崩溃,以及双交叉分析中宿主与供体方差比达到8.9倍和3.3倍。这些结果将iTRIALSPACE定位为一种可审计的评估基础设施,用于超越静态回顾性基准的受控、可证伪测试。

英文摘要

We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.

2606.02877 2026-06-16 cs.CV 版本更新

Pathway-Structured Privileged Distillation for Deployable Computational Pathology

面向可部署计算病理学的通路结构特权蒸馏

Yongxin Guo, Hao Lu, Onur Koyun, Muhammet Demir, Metin Gurcan

发表机构 * School of Medicine, Wake Forest University(威克森林大学医学院)

AI总结 提出MoPE框架,通过通路索引病理专家和记忆使用对齐,将多模态学习转化为仅组织学推理的特权蒸馏,提升全切片图像推理性能。

详情
AI中文摘要

整合转录组学和组织病理学可以改善癌症风险建模,但在常规环境中RNA分析的有限可用性限制了其实用性。本文引入了通路专家混合(MoPE),这是一个知识蒸馏框架,将多模态学习重新定义为仅组织学推理的特权蒸馏。MoPE的动机来自RNA谱和全切片图像之间的部分可观测性:组织学可以捕获某些分子程序相关的形态学后果,但不能期望重建完整的转录组状态。MoPE编码RNA衍生的通路,并通过记忆使用对齐将分子监督转移到通路索引的病理专家。在各种公共基准测试和两个独立的乳腺癌队列中,与基线方法相比,MoPE持续改善了仅WSI推理性能。通路使用分析和人工审核的视觉检查提供了模型行为和候选形态学相关读数的有限检查。这些结果支持通路结构特权蒸馏作为在训练期间利用分子信息同时保持无RNA推理的有前途的途径。

英文摘要

Integrating transcriptomics and histopathology can improve cancer risk modelling, yet practical use is constrained by the limited availability of RNA profiling in routine settings. Here we introduce Mixture of Pathway Experts (MoPE), a knowledge-distillation framework that reframes multimodal learning as privileged distillation for histology-only inference. MoPE is motivated by the partial observability between RNA profiles and whole-slide images: histology can capture morphology-linked consequences of certain molecular programmes, but cannot be expected to reconstruct the full transcriptomic state. MoPE encodes RNA-derived pathways and transfers the molecular supervision to pathway-indexed pathology experts through memory-usage alignment. Across diverse public benchmarks and two independent breast cancer cohorts, MoPE consistently improved WSI-only inference performance relative to baseline methods. Pathway-usage analyses and human-audited visual inspection provide bounded inspection of model behaviour and candidate morphology-linked readouts. These results support pathway-structured privileged distillation as a promising route to using molecular information during training while preserving RNA-free inference.

2411.05824 2026-06-16 eess.IV cs.CV cs.LG 版本更新

Navigating Distribution Shifts in Medical Image Analysis: A Survey

医学图像分析中的分布偏移导航:综述

Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Frans Coenen, Amir Hussain, Kaizhu Huang

发表机构 * Life Simulation Research Center, Beijing Academy of Artificial Intelligence(北京人工智能生命模拟研究中心) Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology(王国阿卜杜勒·阿齐兹国王科技大学电气与数学科学与工程系) Department of Intelligent Science, School of Advanced Technology, Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学先进科技学院智能科学系) Computer Science, School of Computer Science and Informatics, University of Liverpool(利物浦大学计算机科学与信息学学院) SDAIA-KFUPM Joint Research Centre for Artificial Intelligence, King Fahd University of Petroleum and Minerals(法赫德石油与矿物大学人工智能SDAIA-KFUPM联合研究中心) Nuffield Department of Primary Care Health Sciences, University of Oxford(牛津大学初级保健健康科学努尔菲尔德部门)

AI总结 本文系统综述了应对医学图像分析中分布偏移的深度学习方法,按临床约束分类为联合训练、联邦学习、微调和域泛化,并揭示方法从显式对齐向不确定性建模的转变。

详情
AI中文摘要

医学图像分析(MedIA)已成为现代医疗保健中不可或缺的一部分,增强了临床诊断和个性化治疗。尽管深度学习(DL)技术取得了显著进展,但其实际部署面临分布偏移带来的挑战,即基于特定数据集训练的模型在不同医院或患者群体的数据上表现不佳。为解决这一问题,研究人员积极开发策略以提高DL模型的适应性,使其能够在陌生环境中有效使用。本文系统综述了将DL技术应用于受分布偏移影响的MedIA系统的方法。我们并非按技术特征组织现有方法,而是明确将现实临床约束(如有限的数据可访问性、严格的隐私要求和异构协作协议)与能够解决这些约束的技术范式联系起来。通过建立操作约束与方法论演变之间的这种联系,我们将现有工作分类为联合训练、联邦学习、微调和域泛化,每种方法对应特定的医疗场景。除了这种分类,我们的实证分析表明,随着这些范式中域信息逐渐变得不可访问,性能改进变得越来越受限,并进一步揭示了方法论焦点从显式分布对齐向不确定性感知建模的逐渐转变,最终指向在实际MedIA中需要更多可部署性感知的设计。

英文摘要

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints -- such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols -- with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.

2505.05647 2026-06-16 eess.SP cs.CV 版本更新

A New k-Space Model for Non-Cartesian Fourier Imaging

一种用于非笛卡尔傅里叶成像的新k空间模型

Chin-Cheng Chan, Justin P. Haldar

发表机构 * USC Center for Advanced Research Computing(USC高级研究计算中心) Signal and Image Processing Institute(信号与图像处理研究所)

AI总结 针对传统基于体素的傅里叶成像模型计算成本高、收敛慢且易产生伪影的问题,提出一种基于傅里叶域基展开的新模型,在非笛卡尔MRI重建中实现更优图像质量和更低计算复杂度。

详情
AI中文摘要

在过去的几十年中,使用基于模型的方法重建傅里叶成像数据一直很流行,这些方法可以轻松地融入物理约束和先进的正则化/机器学习先验。最常见的建模方法是将连续图像表示为平移的“体素”基函数的线性组合。尽管这种基于体素的模型已被广泛研究和部署,但它存在长期以来的局限性,包括高计算成本、慢收敛和易产生伪影。在这项工作中,我们从新的角度重新审视该模型,识别出可能之前被忽视的新问题(包括不良近似、环绕和零空间特性)。我们的见解促使我们提出一种新模型,该模型对先前方法的局限性(旧的和新的)更具鲁棒性。具体来说,新模型基于傅里叶域基展开,而不是标准的图像域体素方法。在非笛卡尔MRI重建背景下呈现的示例结果表明,新模型能够改善图像质量(减少伪影)和/或降低计算复杂度(更快的计算和更好的收敛)。

英文摘要

For the past several decades, it has been popular to reconstruct Fourier imaging data using model-based approaches that can easily incorporate physical constraints and advanced regularization/machine learning priors. The most common modeling approach is to represent the continuous image as a linear combination of shifted "voxel" basis functions. Although well-studied and widely-deployed, this voxel-based model is associated with longstanding limitations, including high computational costs, slow convergence, and a propensity for artifacts. In this work, we reexamine this model from a fresh perspective, identifying new issues that may have been previously overlooked (including undesirable approximation, wrap-around, and nullspace characteristics). Our insights motivate us to propose a new model that is more resilient to the limitations (old and new) of the previous approach. Specifically, the new model is based on a Fourier-domain basis expansion rather than the standard image-domain voxel-based approach. Illustrative results, which are presented in the context of non-Cartesian MRI reconstruction, demonstrate that the new model enables improved image quality (reduced artifacts) and/or reduced computational complexity (faster computations and improved convergence).

2604.25371 2026-06-16 q-bio.QM cs.CV 版本更新

PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

PhyloSDF: 基于系统发育条件的残差流匹配神经生成3D颅骨形态

Kaikwan Lau, Gary P. T. Choi

发表机构 * Department of Mathematics(数学系)

AI总结 提出PhyloSDF模型,结合系统发育一致性损失和残差条件流匹配,从少量样本生成符合系统发育关系的3D颅骨形态,在达尔文雀数据集上优于扩散模型和标准流匹配。

详情
AI中文摘要

生成新颖、生物学上可信的三维形态结构是计算进化生物学中的一个基本挑战,其难点在于极端的数据稀缺性以及对生成形状必须尊重物种间系统发育关系的要求。在这项工作中,我们提出了PhyloSDF,一个基于系统发育条件的神经生成模型,用于3D生物形态,它整合了两项创新:(1) 一个由新型系统发育一致性损失正则化的DeepSDF自动解码器,该损失使潜在空间结构与进化距离相关(Pearson r=0.993);(2) 一个残差条件流匹配(Residual CFM)架构,将生成分解为解析的物种质心查找和学习到的残差预测,从而能够从每个物种仅约4个标本进行生成。我们在达尔文雀及其近缘物种的24个物种的100个微CT扫描颅骨上评估了PhyloSDF。该模型生成的网格在代码水平上实现了真实种内变异的88-129%,所有180个生成网格均被验证为非记忆。残差CFM在保真度(Chamfer距离0.00181 vs. 0.00190)和形态测量Fréchet距离(10,641 vs. 13,322)上均超越了去噪扩散(在此尺度下完全失败)、标准流匹配(模式坍缩至3-6%变异)以及高斯混合基线。跨18个物种的留一物种实验展示了系统发育外推能力,平滑的潜在插值产生了生物学上可信的祖先颅骨重建。

英文摘要

Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin's Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fréchet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

9. 文档图像、OCR与图表理解 5 篇

2606.15886 2026-06-16 cs.CV 新提交

Text region detection in historical astronomical diagrams

历史天文图中的文本区域检测

Zeynep Sonat Baltacı, Raphaël Baena, Fei Meng, Somkéo Norindr, Florence Somer, Matthieu Husson, Mathieu Aubry

发表机构 * LIGM, ENPC, IP Paris, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France(LIGM, 国立桥路学校, 巴黎理工学院, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心, 马恩拉瓦莱, 法国) LTE, CNRS, PSL-Observatoire de Paris, SU, EIDA Project(LTE, 法国国家科学研究中心, 巴黎文理研究大学-巴黎天文台, 索邦大学, EIDA项目)

AI总结 提出包含948张历史天文图的大规模数据集,涵盖十世纪七种语言传统,并设计Poly-DETR模型实现文本区域检测。

详情
AI中文摘要

文本检测是历史文献分析中的关键任务。尽管手稿和地图的文本检测已有数据集和基准,但数学图表中的文本研究鲜受关注。为此,我们引入一个大规模、多样化、开放获取的数据集,包含948张历史天文图,共计10,940个定向多边形文本区域。数据集跨越十个世纪(8至18世纪)和七种主要语言传统:阿拉伯语和波斯语(115张)、中文(332张)、拜占庭语(233张)、拉丁语(185张)、希伯来语(48张)和梵语(35张)。它涵盖了从符号到多行段落的广泛图表风格和文本内容。每个文本实例都标注了有序多边形,精确描绘文本区域并编码阅读方向。此外,我们为拉丁图表中的2,293个区域标注了20个类别标签。我们在数据集上评估了多个强基线,包括TESTR、DeepSolo++以及Poly-DETR(我们设计的DINO-DETR的简单扩展,用于预测有序多边形顶点)。Poly-DETR在MTHv2和cBAD2019基准上达到最先进性能,并在我们的数据集上提供了坚实、简单的基线。代码和数据集在线提供。

英文摘要

Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.

2606.15987 2026-06-16 cs.CV cs.DL 新提交

A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts

来自萨希迪克科普特古代手稿的文本识别数据集

Fabio Quattrini, Carmine Zaccagnino, Costanza Bianchi, Silvia Cascianelli, Rita Cucchiara

发表机构 * University of Modena and Reggio Emilia(摩德纳大学与雷焦艾米利亚大学)

AI总结 针对低资源手写文本识别,构建了萨希迪克科普特古代手稿数据集SCAM,并评估了多种先进方法的性能,揭示了当前方法在低资源历史文本上的局限性。

Comments Accepted at ICDAR 2026

详情
AI中文摘要

在这项工作中,我们针对低资源场景下的手写文本识别(HTR),这些场景源于代表性不足的语言、稀有文字以及历史文献典型的退化视觉条件。我们引入了SCAM(萨希迪克科普特古代手稿),这是一个从已灭绝的萨希迪克科普特方言书写的数字化古代手稿构建的新行级数据集。该数据集反映了现实且具有挑战性的环境,因为它结合了跨图书馆的异构采集条件以及典型的文献退化,如墨水褪色、透印和材料劣化。除了视觉复杂性外,由于萨希迪克科普特的资源稀缺、其不常见的字母表以及方言特有的变音符号,SCAM还带来了显著的语言挑战。为了支持低资源HTR的研究,我们基于不同范式对几种最先进的方法进行了基准测试,突出了它们在此环境中的局限性和优势。我们的结果强调了当前在资源丰富的现代文字上的HTR性能与基于历史的低资源场景之间的差距,从而为未来的发展提供了参考点。

英文摘要

In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

2406.17148 2026-06-16 cs.CV 版本更新

MixTeX: Data-Efficient LaTeX OCR via Synthetic Pretraining and Limited Fine-Tuning

MixTeX: 通过合成预训练和有限微调实现数据高效的LaTeX OCR

Yuhan Xu, Yijun Zhao, Renqing Luo, Gary M. Weiss

发表机构 * arXiv

AI总结 提出MixTeX,通过合成预训练(无需真实LaTeX源)和少量真实样本微调,实现数据高效的LaTeX OCR,在英中印刷和手写基准上优于依赖大数据集的方法。

详情
AI中文摘要

LaTeX OCR将科学文档图像转换为可编辑的LaTeX代码。现有系统依赖大型配对数据集,这些数据集收集成本高,且对于低资源语言有限。本文提出了MIXTEX,一种数据高效的系统,使用合成预训练而无需真实的LaTeX源。与依赖arXiv数据集的Nougat不同,我们通过随机配对语法正确的维基百科文本与LaTeX公式来生成训练数据,仅需语法正确性。这消除了对真实文档集合的依赖,实现了可扩展的数据生成(1.2亿个token),并支持低资源语言。在合成预训练之后,适应仅需400个真实样本。在包含印刷和手写英文及中文的977样本基准上的评估表明,这种两阶段策略在需要更少人力和计算的情况下,优于在大型真实数据集上训练的方法。数据、代码和模型公开可用。

英文摘要

LaTeX OCR converts scientific document images into editable LaTeX code. Existing systems rely on large paired datasets, which are costly to collect and limited for low-resource languages. This paper presents MIXTEX, a data-efficient system using synthetic pretraining without real LaTeX sources. Unlike Nougat that depends on arXiv datasets, we generate training data by randomly pairing grammatical Wikipedia text with LaTeX formulas, requiring only syntactic correctness. This eliminates dependency on real document collections, enables scalable data generation (120M tokens), and supports low-resource languages. Following synthetic pretraining, adaptation requires only 400 real samples. Evaluation on a 977-sample benchmark with printed and handwritten English and Chinese shows that this two-stage strategy outperforms methods trained on large real datasets while requiring less human effort and computation. Data, code, and models are publicly available.

2603.01016 2026-06-16 cs.CV eess.IV eess.SP 版本更新

Implementation of Licensed Plate Detection and Noise Removal in Image Processing

图像处理中车牌检测与噪声去除的实现

Yiquan Gao

发表机构 * Asia Pacific University, Malaysia(亚太大学,马来西亚)

AI总结 本文实现了一种车牌检测与噪声去除方法,通过图像处理技术提高车牌识别系统的准确性和鲁棒性。

Comments 13 pages. This is the author's version, accepted manuscript Published version available at https://www.ijarse.com/ADMIN/admin/postimages/images/fullpdf/1519302304_SVCET2087ijarse.pdf

详情
Journal ref
International Journal of Advance Research in Science and Engineering, Vol. 07, Special Issue No. 02, pp. 678-690, ISSN: 2319-8354, Feb. 2018
AI中文摘要

汽车车牌识别系统是一种图像处理技术,用于通过捕获汽车车牌来识别车辆。汽车车牌识别技术也称为自动车牌识别、自动车辆识别、汽车车牌识别或汽车光学字符识别。在马来西亚,随着如今车辆数量的迅速增加,道路上相当多的车辆带来了对汽车车牌识别系统的巨大需求。汽车车牌识别系统可以应用于电子停车支付系统、高速公路收费系统、交通监控系统以及作为警察执法工具。此外,汽车车牌识别系统技术还有潜力与生物学、航空航天等其他不同领域的各种技术相结合,以实现解决某些专门问题的目标。

英文摘要

Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.

2606.08781 2026-06-16 cs.CV 版本更新

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

DeepMine-Mamba:缓解基于Mamba的状态空间模型在文档图像二值化中的信息稀释问题

Sheng-Wei Chan, Yung-Che Wang, Hsin-Jui Pan, Chia-Min Lin, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出DeepMine-Mamba框架,通过抗稀释门控机制选择性恢复笔画敏感局部响应,抑制无关背景增强,解决Mamba状态空间模型在文档二值化中弱前景线索被稀释的问题。

Comments code will be released on https://github.com/henrychan0719/Deep-Mine-Mamba

详情
AI中文摘要

文档图像二值化旨在从退化的背景中分离前景文本,同时保留细、断裂和低对比度的笔画。尽管深度学习方法提高了二值化性能,但大多数现有方法依赖于卷积、基于Transformer或生成架构,而基于Mamba的状态空间模型在此任务中尚未被充分探索。在这项工作中,我们研究了基于Mamba的特征传播,并观察到直接的状态空间传播可能会在长程建模过程中稀释弱前景线索,特别是淡墨迹、碎片化字符和边界敏感的笔画细节。为了解决这个问题,我们提出了DeepMime-Mamba,一个基于Mamba的二值化框架,配备了一种新颖的抗稀释门控机制,该机制估计传播引起的特征变化,并选择性地恢复笔画敏感的局部响应,同时抑制不必要的背景增强。在严格的留一年验证协议下,对DIBCO/H-DIBCO基准的实验表明,DeepMine-Mamba取得了具有竞争力的整体性能,在基准年份中具有强大的平均FM和Fps。消融结果进一步表明,抗稀释门控机制改善了笔画保留,并减少了感知上显著的二值化误差。

英文摘要

Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin, broken, and low-contrast strokes. Although deep learning methods have improved binarization performance, most existing approaches rely on convolutional, transformer-based, or generative architectures, while Mamba-based state space models remain largely unexplored for this task. In this work, we investigate Mamba-based feature propagation and observe that direct state-space propagation may dilute weak foreground cues during long-range modeling, especially faint ink traces, fragmented characters, and boundary-sensitive stroke details. To address this problem, we propose DeepMine-Mamba, a Mamba-based binarization framework equipped with a novel Anti-Dilution Gate that estimates propagation-induced feature changes and selectively restores stroke-sensitive local responses while suppressing unnecessary background enhancement. Experiments on DIBCO/H-DIBCO benchmarks under a strict leave-one-year-out protocol show that DeepMine-Mamba achieves competitive overall performance, with strong average FM and Fps across benchmark years. Ablation results further show that the Anti-Dilution Gate is the key component for mitigating propagation-induced foreground dilution and improving stroke preservation.

10. 低层视觉、计算成像与图像增强 25 篇

2606.14773 2026-06-16 cs.CV cs.AI 新提交

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

双螺旋视觉 (DH-V2):一种基于几何的带宽受限感知视觉采样器

Jinwen Wen

发表机构 * Independent Researcher(独立研究者)

AI总结 提出双螺旋视觉(DH),一种基于黄金比例螺旋轨迹的几何采样器,将2D图像压缩为1D信号,实现1433倍压缩比,在CPU上0.52ms完成感知,CIFAR-10上准确率提升6.03%。

Comments 5 pages, 3 figures, 5 tables. Code and benchmarks: https://github.com/JackJ-C/double-helix-vision-tool

详情
AI中文摘要

我们提出双螺旋视觉(DH),一种基于几何的视觉采样器,利用成对的黄金比例启发螺旋轨迹将2D图像压缩为紧凑的1D信号。DH不是均匀处理每个像素,而是采用两个相位偏移的螺旋(Alpha和Beta,偏移180度)以生物启发的中央凹方式采样图像:中心高密度,外围稀疏覆盖。在4K分辨率下,DH实现了1433倍压缩比(减少99.93%),同时保留场景的几何结构。完整的感知流水线——包括空间映射、时间碰撞检测和帧内结构视差估计——在仅CPU硬件上以1080p分辨率运行仅需0.52毫秒,无需神经网络依赖。在CIFAR-10上,在极端采样预算下(每个螺旋K=128个点),DH比均匀随机采样获得了+6.03%的准确率提升。提供了一个可序列化为JSON的机器人API,以2.7 KB的数据包提供亚毫秒级空间感知报告。代码和基准测试在MIT许可下提供。

英文摘要

We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

2606.14781 2026-06-16 cs.CV 新提交

Variational Deep Unfolding with Mamba-Based Nonlocal Modeling for Underwater Image Enhancement

基于Mamba非局部建模的变分深度展开水下图像增强

Daniel Torres, Julia Navarro, Catalina Sbert, Joan Duran

发表机构 * Institute of Applied Computing and Community Code (IAC3)(应用计算与社区代码研究所 (IAC3)) Dept. of Mathematics and Computer Science, Universitat de les Illes Balears(巴利阿里群岛大学数学与计算机科学系)

AI总结 提出一种融合变分建模与可学习架构的深度展开网络,利用Mamba层捕获自相似性,通过近端轨迹损失约束展开阶段,实现水下图像增强。

详情
AI中文摘要

水下成像在海洋工程中至关重要,但捕获的数据通常存在能见度低和颜色失真问题。针对这些挑战,我们提出了一种基于模型的深度展开网络用于水下图像增强,该网络将变分建模集成到可学习架构中。该框架基于去雾分解的变分公式,包含一个乘法残差分量以吸收剩余伪影,以及一个非局部梯度型约束以保留结构细节并增强边缘锐度。我们提供了理论分析,建立了相关最小化问题解的存在性。所提出的展开方法结合了Mamba层,以有效捕获场景中的自相似性。此外,我们引入了一种近端轨迹损失,强制展开阶段与理想恢复正则化器的迭代之间的一致性。实验结果表明,与最近的最先进方法相比,所提出的展开方法实现了更好的视觉质量和有竞争力的定量性能。源代码将在https://github.com/MIA-UIB/Variational-Unfolding-Mamba-Underwater-Enhancement 提供。

英文摘要

Underwater imaging plays a crucial role in ocean engineering, although captured data often suffer from poor visibility and color distortion. To address these challenges, we propose a model-based deep unfolding network for underwater image enhancement that integrates variational modeling into a learnable architecture. The framework is guided by a variational formulation based on a dehazing decomposition, incorporating a multiplicative residual component to absorb remaining artifacts and a nonlocal gradient-type constraint to preserve structural details and enhance edge sharpness. We provide a theoretical analysis establishing the existence of solution for the associated minimization problem. The proposed unfolding method incorporates Mamba layers to efficiently capture self-similarities in the scene. In addition, we introduce a proximal trajectory loss that enforces consistency between the unfolding stages and the iterations of an ideal restoration regularizer. Experimental results demonstrate that the proposed unfolding approach achieves improved visual quality and competitive quantitative performance compared with recent state-of-the-art methods. The source code will be available at https://github.com/MIA-UIB/Variational-Unfolding-Mamba-Underwater-Enhancement .

2606.15104 2026-06-16 cs.CV 新提交

Text-Driven Fusion for Infrared and Visible Images: Achieving Image Scene Adaptation on Hyperbolic Space

红外与可见光图像的文本驱动融合:在双曲空间实现图像场景自适应

Huan Kang, Hui Li, Tianyang Xu, Tao Zhou, Xiao-Jun Wu, Josef Kittler

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种文本驱动的红外与可见光图像融合框架,利用双曲流形学习嵌入层次语义,通过BLIP文本提示引导视觉-属性对齐,实现无文本输入的自适应融合,性能优于现有方法。

Comments 14 pages, 8 figures

详情
AI中文摘要

红外与可见光图像融合旨在整合互补模态,而现有的欧几里得方法施加了刚性的距离度量,扭曲了多模态交互和父子语义层次。为了克服这些限制,我们引入了一种由双曲流形学习驱动的文本驱动融合框架。在训练过程中,BLIP提取的文本提示作为双曲空间中的拓扑锚点,通过自然适应不同语义粒度的双曲嵌入引导视觉-属性对齐。通过利用庞加莱球负曲率决定的指数体积增长,该方法无缝嵌入层次树以编码从粗到细的语义而不会出现度量饱和,同时广阔的外围空间防止了跨模态融合期间的纹理失真。在推理时,融合过程利用学习到的文本属性先验自动适应输入内容,完全消除了对文本输入的需求。实验结果表明,我们的方法在基准数据集上优于最先进的方法,代码可在 https://github.com/Shaoyun2023/TEDFusion 获取。

英文摘要

Infrared and visible image fusion aims to integrate complementary modalities, while existing Euclidean methods impose rigid distance metrics that distort multi-modal interactions and parent-to-child semantic hierarchies. To overcome these limitations, we introduce a text-driven fusion framework empowered by hyperbolic manifold learning. During training, BLIP-extracted text prompts serve as topological anchors within the hyperbolic space, guiding vision-attribute alignment through hyperbolic embeddings that naturally accommodate varying semantic granularities. By exploiting the exponential volume growth dictated by the Poincaré ball's negative curvature, this approach seamlessly embeds hierarchical trees to encode coarse-to-fine semantics without metric saturation, while the vast peripheral space prevents texture distortion during cross-modal fusion. At inference, the fusion process autonomously adapts to input content using the learned text-attribute priors, completely eliminating the need for textual input. Experimental results show our method outperforms state-of-the-art approaches on benchmark datasets, with code available at https://github.com/Shaoyun2023/TEDFusion.

2606.15243 2026-06-16 cs.CV 新提交

SPARK: Spatial Policy-driven Adaptive Reinforcement learning for Knowledge distillation

SPARK: 空间策略驱动的自适应强化学习知识蒸馏

Mohamed Jismy Aashik Rasool, Shabir Ahmad, Gisong Oh, Teag Kuen Whangbo

发表机构 * Gachon University(高丽大学)

AI总结 提出SPARK框架,利用轻量强化学习策略网络自适应分配蒸馏努力,通过空间权重图调制量化感知训练中的知识蒸馏损失,提升低比特量化图像恢复网络的性能。

Comments 13 pages, 3 figures,5 tables ,BMVC submission

详情
AI中文摘要

低比特量化使得图像恢复(IR)网络能够在资源受限设备上部署,但引入了舍入噪声,不成比例地降低了边缘和精细纹理等高频率区域的质量。现有的知识蒸馏(KD)方法在所有空间位置上均匀应用蒸馏信号,忽略了不同图像区域的重建难度差异。为了解决这一问题,我们提出了SPARK(空间策略驱动的自适应强化学习知识蒸馏),一个使用轻量级强化学习(RL)策略网络自适应分配蒸馏努力的框架。在每个训练步骤中,一个难度特征提取器计算四个信号,即拉普拉斯方差、像素方差、学生重建误差和师生知识差距,这些信号被输入到一个紧凑的策略CNN中,该网络生成一个随机空间权重图,以在量化感知训练(QAT)期间调制KD损失。SPARK与IR任务无关,不增加推理成本,并且无需架构更改即可集成到任何现有的QAT流程中。在基准数据集上的实验表明,SPARK在多种学生架构上始终优于PTQ、QAT和最先进的(SOTA)KD方法,在显著的计算约束下实现了最接近全精度教师的重建质量。

英文摘要

Low-bit quantization enables deployment of image restoration (IR) networks on resource-constrained devices, but introduces rounding noise that disproportionately degrades high-frequency regions such as edges and fine textures. Existing knowledge distillation (KD) methods apply distillation signals uniformly across all spatial locations, overlooking the varying reconstruction difficulty across image regions. To address this, we propose SPARK (Spatial Policy-driven Adaptive Reinforcement Learning for Knowledge Distillation), a framework that adaptively allocates distillation effort using a lightweight reinforcement learning (RL) policy network. At each training step, a difficulty feature extractor computes four signals, namely Laplacian variance, pixel variance, student reconstruction error, and teacher-student knowledge gap, which are fed into a compact policy CNN that produces a stochastic spatial weight map to modulate the KD loss during quantization-aware training (QAT). SPARK is IR task-agnostic, adds no inference cost, and integrates into any existing QAT pipeline without architectural changes. Experiments on benchmark datasets demonstrate that SPARK consistently outperforms PTQ, QAT, and state-of-the-art (SOTA) KD approaches across multiple student architectures, achieving reconstruction quality closest to the full-precision teacher under significant computational constraints.

2606.15597 2026-06-16 cs.CV 新提交

Fusion-E2Pulse: A Multimodal Event-RGB Fusion Network for Non-contact Pulse Wave Reconstruction

Fusion-E2Pulse:一种用于非接触式脉搏波重建的多模态事件-RGB融合网络

Qian Feng, Hao Guo, Yan Niu, Zhenhuan Xu, Yidi Li

发表机构 * College of Computer Science and Technology(计算机科学与技术学院) Taiyuan University of Technology(太原科技大学)

AI总结 提出Fusion-E2Pulse多模态融合网络,利用RGB信号结构先验抑制运动伪影,结合事件流高灵敏度恢复细粒度形态细节,在脉搏波重建中实现噪声抑制与形态保真度的最佳平衡。

Comments Accepted by MICCAI 2026. The final version will appear in the official MICCAI proceedings published by Springer

详情
AI中文摘要

非接触式脉搏波重建依赖于波形形态的精确恢复,包括重搏切迹。传统的基于RGB的方法从录制的面部视频中提取生理信号,但受限于标准相机的积分成像机制,曝光过程会产生平滑效应,削弱微弱的血管搏动细节。相反,神经形态事件相机虽然对强度波动具有极高的灵敏度,但本质上容易受到微小运动引起的噪声和伪影的影响。为了利用基于帧的积分和基于事件的差分感知之间的协同作用,我们提出了一种名为Fusion-E2Pulse的新型多模态网络。该框架利用滤波后的RGB信号作为结构先验来抑制运动伪影,同时利用事件流的高灵敏度恢复细粒度的形态细节。实验结果表明,Fusion-E2Pulse达到了最先进的性能,有效平衡了噪声抑制和形态保真度,心率估计的平均绝对误差为0.78 bpm,波形相关性为0.89,收缩期相位持续时间误差为16.74 ms,验证了其在重建细粒度病理特征方面的有效性。

英文摘要

Non-contact pulse wave reconstruction hinges on the precise recovery of waveform morphology, including the dicrotic notch. Conventional Red-Green-Blue (RGB)-based methods, which extract physiological signals from recorded facial videos, are constrained by the integral imaging mechanism of standard cameras, where the exposure process induces a smoothing effect that attenuates subtle vascular pulsation details. Conversely, neuromorphic event cameras, while offering exceptional sensitivity to intensity fluctuations, are inherently susceptible to noise and artifacts induced by minor motion. To exploit the synergy between frame-based integration and event-based differential sensing, we propose a novel multimodal network named Fusion-E2Pulse. This framework utilizes filtered RGB signals as structural priors to suppress motion artifacts, while leveraging the high-sensitivity of event streams to recover fine-grained morphological details. Experimental results demonstrate that Fusion-E2Pulse achieves state-of-the-art performance, effectively balancing noise suppression and morphological fidelity, achieving a mean absolute error of 0.78 bpm for heart rate estimation, a waveform correlation of 0.89, and a systolic phase duration error of 16.74 ms, validating its efficacy in reconstructing fine-grained pathological features.

2606.15648 2026-06-16 cs.CV 新提交

Fusing Transferred Priors and Physics-based Decomposition for Underwater Image Enhancement

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种无需配对标签的迁移学习方法,将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制,利用跨域先验监督各步骤,实现物理一致的增强。

详情
AI中文摘要

水下图像在不同水质条件下拍摄,导致复杂的退化,包括颜色偏差、低对比度和模糊效应。最近,基于学习的方法已显示出在水下图像增强(UIE)方面的潜力。然而,以往的大多数工作侧重于训练策略或网络设计,使增强结果与数据集中的标签良好对齐,忽略了标签是从先前UIE方法的增强结果中选取的,这些伪标签存在噪声。因此,它们的模型性能在一定程度上并不令人满意。然而,收集水下图像的真实标签具有挑战性。在这项工作中,我们提出了一种基于迁移学习的UIE方法,该方法不需要水下图像具有成对的噪声或真实标签来学习。相反,首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后,利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式,通过迁移学习实现了一种新颖的UIE,并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明,我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能,并有效提升了下游视觉任务,显著优于基准方法。项目仓库:https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

2606.15857 2026-06-16 cs.CV 新提交

A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

用于水下图像增强与目标检测联合优化的双分支协作框架

Liyuan Cao, Zheng Liu, Guanghao Liao, Yonghui Yang, Qi Li

发表机构 * School of Electronic and Information Engineering, University of Science and Technology Liaoning(电子与信息工程学院,科学技术大学辽宁)

AI总结 提出一种双分支水下图像增强框架,通过细节增强和颜色恢复分支分别提升纹理细节和校正色偏,在提升视觉质量的同时兼顾检测性能与效率,在URPC数据集上使YOLOv8的mAP50提升2.1%。

详情
AI中文摘要

由于波长依赖的光吸收和散射,水下图像通常存在颜色失真和细节模糊,这限制了水下目标检测的性能。现有的水下图像增强方法主要关注视觉质量提升,但仍难以平衡增强质量、处理效率和下游检测性能。因此,本文提出一种高效的双分支水下图像增强框架用于目标检测。细节增强分支通过提升亮度和局部对比度来恢复暗区域的纹理细节。颜色恢复分支使用自适应补偿来减少颜色失真并改善色彩层次。通过结合两个分支的互补输出,所提框架为目标检测提供更清晰、信息更丰富的图像。在UIEB和EUVP数据集上,所提方法分别达到2.249和2.576的UIQM分数。当应用于URPC数据集上的YOLOv8检测任务时,与基线相比,所提方法将mAP50提升了2.1%。大量实验表明,我们的方法在复杂水下场景中改善了目标检测,同时平衡了增强质量和处理效率。

英文摘要

Due to wavelength dependent light absorption and scattering, underwater images usually suffer from color distortion and blurred details, which limits underwater object detection performance. Existing underwater image enhancement methods mainly focus on visual quality improvement, while it is still difficult to balance enhancement quality, processing efficiency, and downstream detection performance. Therefore, this paper proposes an efficient dual-branch underwater image enhancement framework for object detection. The detail enhancement branch improves brightness and local contrast to recover texture details in dark regions. The color restoration branch uses adaptive compensation to reduce color distortion and improve color gradation. By combining the complementary outputs of the two branches, the proposed framework provides clearer and more informative images for object detection. On the UIEB and EUVP datasets, the proposed method achieves UIQM scores of 2.249 and 2.576. When applied to the YOLOv8 detection task on the URPC dataset, the proposed method improves mAP50 by 2.1\% compared with the baseline. Extensive experiments show that our method improves object detection in complex underwater scenes, while balancing enhancement quality and processing efficiency.

2606.16031 2026-06-16 cs.CV 新提交

The Third Challenge on Image Denoising at NTIRE 2026: Methods and Results

NTIRE 2026图像去噪挑战赛第三轮:方法与结果

Lei Sun, Hang Guo, Bin Ren, Shaolin Su, Xian Wang, Danda Pani Paudel, Luc Van Gool, Radu Timofte, Yawei Li

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Würzburg(维尔茨堡大学) Beijing University of Posts and Telecommunications(北京邮电大学) Tianjin University(天津大学) Nanjing University of Science and Technology(南京理工大学) University of Beira Interior(贝拉内大学) Siddaganga Institute of Technology(西达甘加理工学院) National Institute of Technology Karnataka(卡纳塔克邦国立理工学院) Sardar Vallabhbhai National Institute of Technology(萨达尔·瓦拉巴伊·帕特尔国立理工学院) University of Luxembourg(卢森堡大学) University of Twente(特温特大学) University of Kragujevac(克拉古耶瓦茨大学) Prince Sultan University(苏丹王子大学) University of Tunis El Manar(突尼斯埃尔马纳尔大学) University of Electronic Science and Technology of China(电子科技大学) Wuhan University(武汉大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Peng Cheng Laboratory(鹏城实验室)

AI总结 报告NTIRE 2026高噪声图像去噪挑战赛,参赛团队采用先进神经网络架构,以PSNR为指标,在无约束条件下实现最先进性能。

Comments accepted by cvprw2026

详情
AI中文摘要

本文报告了NTIRE 2026图像去噪挑战赛,特别关注高噪声场景(σ=50)。该竞赛研究了旨在从加性高斯白噪声(AWGN)污染的图像中恢复高保真细节的先进神经架构。与受约束的基准不同,本赛道强调峰值定量性能,以峰值信噪比(PSNR)衡量,且不限制参数数量或计算开销。通过综合116名注册者中20个入围团队的贡献,本报告对最新的技术创新进行了基准测试,并提供了无约束图像恢复领域当前最先进技术的全面快照。

英文摘要

This paper reports on the NTIRE 2026 Challenge on Image Denoising, specifically focusing on the high-noise regime ($σ= 50$). The competition investigates advanced neural architectures designed to restore high-fidelity details from images corrupted by additive white Gaussian noise (AWGN). Unlike constrained benchmarks, this track emphasizes peak quantitative performance, measured by Peak Signal-to-Noise Ratio (PSNR), without limitations on parameter count or computational overhead. By synthesizing contributions from 20 finalist teams out of 116 registrants, this report benchmarks the latest technical innovations and provides a comprehensive snapshot of the current state-of-the-art in unconstrained image restoration.

2606.16082 2026-06-16 cs.CV cs.AI 新提交

Tool-IQA: Augmenting Image Quality Assessment with Simple Tools

Tool-IQA: 利用简单工具增强图像质量评估

Guanyi Qin, Junjie Zhang, Chunming He, Yibing Fu, Jie Liang, Tianhe Wu, Lei Zhang

发表机构 * National University of Singapore(新加坡国立大学) OPPO Research Institute(OPPO研究院) Nanyang Technical University(南洋理工大学) Duke University(杜克大学) City University of Hong Kong(香港城市大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出Tool-IQA,通过为视觉语言模型配备放大镜和伽马校正器等简单工具,将被动评分转变为工具增强的工作流程,显著提升图像质量评估性能。

详情
AI中文摘要

视觉语言模型(VLM)越来越多地被用于图像质量评估(IQA)。然而,当前方法通常采用静态的一次性评分范式,而人类通过动态视觉检查(例如,选择性调整视图以验证细节和细微伪影)来评估图像质量。具体来说,仅依赖单次观察存在两个主要限制:首先,仅在全局尺度上感知图像限制了对更精细局部细节的评估;其次,图像的原始强度分布可能压倒可见性,导致对图像质量的检查不足。为了解决这些问题,我们提出了Tool-IQA,将评估机制从被动评分转变为工具增强的工作流程。特别地,我们为VLM配备了简单而有效的视图工具:用于检查局部细节的放大镜,以及用于揭示可见性和隐藏伪影的伽马校正器。评估遵循一个结构化的流程,包括带有评分标准的初始观察、工具增强的深入检查以及最终校准质量分数的量化。此外,为了确保高效且有目的地调用工具,我们引入了一种批量感知的训练策略,以奖励能够产生积极贡献的工具交互,而不仅仅是鼓励使用。在各种IQA基准上的实验表明,通过有效的工具调用和校准评估,我们提出的Tool-IQA显著优于现有最先进的模型,例如,在具有挑战性的CLIVE数据集上实现了0.854的PLCC。

英文摘要

Vision-Language Models (VLMs) have been increasingly adopted for Image Quality Assessment (IQA). However, current methods typically employ a static one-shot scoring paradigm, despite the fact that humans assess image quality through dynamic visual inspection, e.g., selectively adjusting views to verify details and subtle artifacts. Specifically, relying solely on a single-pass observation introduces two primary limitations: first, perceiving the image only at a global scale restricts the assessment of finer local details; second, the original intensity distribution of the image may overwhelm the visibility, leading to insufficient inspection of image quality. To address these issues, we propose Tool-IQA, shifting the assessment mechanism from passive scoring to a tool-augmented workflow. In particular, we equip VLMs with simple yet effective view tools: a Magnifier to inspect local details, and a Gamma Corrector to uncover visibility and hidden artifacts. The assessment follows a structured pipeline that consists of an initial observation with rubric notes, a tool-augmented in-depth inspection, and a final quantification for calibrated quality score. Furthermore, to ensure efficient and purposeful tool callings, we introduce a batch-aware training strategy to reward tool interactions that can yield positive contributions rather than simply encouraging usage. Experiments on a variety of IQA benchmarks demonstrate that, with effective tool calling and calibrated assessment, our proposed Tool-IQA significantly outperforms existing state-of-the-art models, e.g., it achieves a PLCC of 0.854 on the challenging CLIVE dataset.

2606.16159 2026-06-16 cs.CV 新提交

Continuous Splatting meets Retinex: Continuous Gaussian Splatting and Implicit Reflectance Modeling for Low-Light Image Enhancement

连续Splatting遇见Retinex:用于低光图像增强的连续高斯Splatting与隐式反射建模

Yuhan Chen, Yicui Shi, Guofa Li, Wenxuan Yu, Ying Fang, Guangrui Bai, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与运载工程学院) School of Engineering Science, University of Science and Technology of China(中国科学技术大学工程科学学院) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出CGS-Retinex框架,结合连续高斯Splatting与Retinex理论,通过连续参数场估计全局光照,并利用隐式神经表示独立建模反射率,实现低光图像增强,有效抑制噪声和过曝,恢复高频结构和色彩。

详情
AI中文摘要

低光图像增强旨在从低照度观测中恢复清晰图像,对高级下游视觉任务至关重要。然而,现有方法在平衡全局平滑光照调整和局部高频细节恢复时经常遇到颜色失真和结构伪影。为了解决这些问题,我们提出了CGS-Retinex,这是第一个基于显式-隐式联合建模的低光图像增强框架。我们的框架深度融合了连续高斯Splatting与Retinex理论。具体来说,我们将图像网格表示为连续参数场,并提出连续高斯渲染器来估计空间连续的全局光照分布。这种方法从根本上消除了离散高斯采样引起的网格伪影。此外,我们引入隐式神经表示来独立建模反射率。我们利用浅层高频特征引导网络准确重建退化的纹理细节。在Retinex框架内,我们加入了物理启发的亮度一致性约束和光照平滑正则化,使显式光照和隐式反射率能够保持适当曝光,并实现高频结构和颜色的高保真恢复。大量实验表明,CGS-Retinex通过精确解耦光照和纹理,显著抑制了暗区噪声和过曝,同时实现了卓越的高频结构保真度和色彩恢复。这项工作为低光图像增强建立了一种新的连续物理表示范式。

英文摘要

Low-light image enhancement aims to recover clear images from low-illumination observations and is crucial for high-level downstream vision tasks. However, existing methods frequently encounter color distortion and structural artifacts when balancing global smooth illumination adjustment and local high-frequency detail recovery. To address these issues, we propose CGS-Retinex as the first low-light image enhancement framework based on explicit-implicit joint modeling. Our framework deeply integrates continuous Gaussian splatting with Retinex theory. Specifically, we represent the image grid as a continuous parameter field and propose a continuous Gaussian renderer to estimate the spatially continuous global illumination distribution. This approach fundamentally eliminates grid artifacts caused by discrete Gaussian sampling. Furthermore, we introduce an implicit neural representation to model reflectance independently. We leverage shallow high-frequency features to guide the network in accurately reconstructing degraded texture details. Within the Retinex framework, we incorporate physics-inspired brightness consistency constraints and illumination smoothness regularization to enable explicit illumination and implicit reflectance to maintain proper exposure and achieve high-fidelity recovery of high-frequency structures and colors. Extensive experiments demonstrate that CGS-Retinex significantly suppresses dark-region noise and overexposure while achieving exceptional high-frequency structural fidelity and color restoration by precisely decoupling illumination and texture. This work establishes a novel continuous physical representation paradigm for low-light image enhancement.

2606.16163 2026-06-16 cs.CV 新提交

Dehaze-GaussianImage: Zero-Shot Dehazing via Efficient 2D Gaussian Splatting Representation

Dehaze-GaussianImage:基于高效2D高斯泼溅表示的零样本去雾

Yuhan Chen, Wenxuan Yu, Guofa Li, Kunyang Huang, Ying Fang, Yicui Shi, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出首个将2D高斯泼溅引入图像去雾的零样本框架,通过重建-解耦学习策略嵌入大气散射模型,实现几何级解耦和清晰纹理重建,以最少参数达到无监督SOTA性能。

详情
AI中文摘要

现有的单图像去雾方法通常受限于像素级优化的计算冗余和隐式神经网络的缺乏物理可解释性。这些限制阻碍了表示效率与重建保真度之间的平衡。为了解决这些问题,我们提出了Dehaze-GaussianImage,这是第一个将2D高斯泼溅(2DGS)引入图像去雾领域的零样本框架,打破了传统的像素网格处理范式。与静态卷积神经网络(CNN)或Transformer不同,我们的方法将有雾图像建模为连续且动态演变的各向异性高斯场。具体来说,我们提出了一种新颖的重建-解耦零样本学习策略,将大气散射模型嵌入高斯参数空间。该策略驱动高斯基元在优化过程中自适应地分裂、克隆和修剪,实现传输介质和清晰纹理的几何级解耦。此外,引入了显式的结构保持约束,以抑制传统物理先验常引起的伪影。实验结果表明,所提出的方法以最少的参数在全无监督方式下实现了最先进的性能,突显了显式高斯表示在低级视觉任务中的潜力。

英文摘要

Existing single image dehazing methods are often constrained by computational redundancy in pixel-level optimization and the lack of physical interpretability in implicit neural networks. These limitations hinder the balance between representation efficiency and reconstruction fidelity. To address these issues, we propose Dehaze-GaussianImage, the first zero-shot framework that introduces 2D Gaussian Splatting (2DGS) into the image dehazing domain to break the traditional pixel-grid processing paradigm. Distinct from static convolutional neural networks (CNNs) or Transformers, our approach models hazy images as continuous and dynamically evolvable anisotropic Gaussian fields. Specifically, we propose a novel reconstruction-decoupling zero-shot learning strategy that embeds the atmospheric scattering model into the Gaussian parameter space. This strategy drives Gaussian primitives to adaptively split, clone, and prune during optimization, achieving geometric-level decoupling of the transmission medium and clear textures. Furthermore, explicit structure-preserving constraints are introduced to suppress artifacts commonly caused by traditional physical priors. Experimental results demonstrate that the proposed method achieves state-of-the-art (SOTA) performance in a fully unsupervised manner with minimal parameters, highlighting the potential of explicit Gaussian representation for low-level vision tasks.

2606.16168 2026-06-16 cs.CV 新提交

Fi-Gaussian: Frequency-Aware Implicit Gaussian Splatting for Single Image Dehazing

Fi-Gaussian:面向单图像去雾的频率感知隐式高斯泼溅

Yuhan Chen, Ying Fang, Guofa Li, Wenxuan Yu, Yicui Shi, Kunyang Huang, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出Fi-Gaussian网络,利用频率感知隐式高斯泼溅模块解耦高低频信息并自适应聚合,结合物理散射重归一化机制,实现单图像去雾的SOTA性能。

详情
AI中文摘要

单图像去雾仍然受到高频细节丢失和精确物理散射建模困难的阻碍。为了解决这些问题,我们提出了Fi-Gaussian,一种用于单图像去雾的频率感知隐式高斯泼溅网络。与依赖3D点云的显式渲染方法不同,我们的方法采用隐式高斯泼溅,将清晰图像的潜在分布自适应地建模为2D特征空间中的连续表示。网络的核心是频率感知隐式高斯泼溅模块,它在频域中解耦低频结构信息和高频纹理信息,然后使用复值权重进行自适应高斯聚合以恢复精细细节。此外,引入了一种物理驱动的散射重归一化机制,在隐式高斯先验的指导下估计透射图和大气光。在多个基准数据集上的大量实验表明,Fi-Gaussian实现了最先进的定量性能,并产生视觉上更优的去雾结果,验证了隐式高斯泼溅在低级视觉任务中的有效性。

英文摘要

Single image dehazing continues to be hindered by the loss of high-frequency details and the difficulty of accurate physical scattering modeling. To address these issues, we propose Fi-Gaussian, a frequency-aware implicit Gaussian splatting network for single image dehazing. Unlike explicit rendering methods that rely on 3D point clouds, our method employs implicit Gaussian splatting to adaptively model the underlying distribution of clear images as a continuous representation in 2D feature space. The core of the network is a frequency-aware implicit Gaussian splatting module, which decouples low-frequency structural information and high-frequency texture information in the frequency domain and then performs adaptive Gaussian aggregation with complex-valued weights to recover fine details. In addition, a physics-driven scattering renormalization mechanism is introduced to estimate the transmission map and atmospheric light under the guidance of implicit Gaussian priors. Extensive experiments on multiple benchmark datasets demonstrate that Fi-Gaussian achieves state-of-the-art quantitative performance and produces visually superior dehazed results, validating the effectiveness of implicit Gaussian splatting for low-level vision tasks.

2606.16188 2026-06-16 cs.CV 新提交

teasr: training-efficient any-step diffusion transformer for real-world image super-resolution

teasr: 面向真实世界图像超分辨率的训练高效任意步扩散Transformer

Xiang Gao, Chenxin Zhu, Yushun Fang, Qiang Hu, Xiaoyun Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 提出TEASR框架,通过自对抗蒸馏和时步感知矫正策略,在单一扩散模型中实现任意步采样,无需辅助教师模型,显著提升训练效率并超越现有方法。

详情
AI中文摘要

扩散模型因其强大的生成先验在真实世界图像超分辨率(Real-ISR)中表现出色,但存在迭代采样速度慢的问题。尽管现有的单步蒸馏方法加速了推理,但它们通常需要辅助教师模型,这会增加训练内存并限制其在大规模架构上的可扩展性。此外,这些固定步模型缺乏在速度和质量之间进行权衡的灵活性。在本文中,我们提出了TEASR,一种用于Real-ISR的训练高效任意步扩散框架,能够在统一模型内实现单步和多步恢复。我们的关键思想是在单个扩散模型内执行自对抗蒸馏,从而消除对辅助教师或判别器的需求。具体来说,我们提出了一种时步感知矫正策略,该策略稳定了跨噪声水平的单步生成。这两个设计进一步使得在单个GPU上蒸馏20B参数的扩散模型成为可能,显著提高了训练效率。此外,我们引入了一种具有解耦时步条件的双分支扩散Transformer,以分离当前噪声状态和去噪目标,从而提升采样质量。大量实验表明,TEASR支持无缝的任意步采样,并在多个数据集上持续优于最先进的方法。

英文摘要

Diffusion models excel in Real-World Image Super-Resolution (Real-ISR) due to their powerful generative priors but suffer from slow iterative sampling. Although existing one-step distillation methods accelerate inference, they typically require auxiliary teacher models that inflate training memory and restrict scalability to large-scale architectures. Furthermore, these fixed-step models lack the flexibility to trade off speed for quality. In this paper, we propose TEASR, a training-efficient any-step diffusion framework for Real-ISR that enables both one-step and multi-step restoration within a unified model. Our key idea is to perform self-adversarial distillation within a single diffusion model, eliminating the need for auxiliary teachers or discriminators. Specifically, we propose a timestep-aware rectification strategy that stabilizes one-step generation across noise levels. These two designs further enables the distillation of 20B-parameter diffusion models on a single GPU, significantly improving training efficiency. Moreover, we introduce a dual-branch diffusion transformer with decoupled timestep condition to separate the current noise state and the denoising target to enhance sampling quality. Extensive experiments demonstrate that TEASR supports seamless any-step sampling and consistently outperforms state-of-the-art methods across multiple datasets.

2606.16298 2026-06-16 cs.CV 新提交

DDTNet: Degradation Disentanglement and Transfer Network for Test-Time All-in-One De-weathering Adaptation

DDTNet:面向测试时全能去天气适应的退化解缠与迁移网络

Kuan-Hung Lin, Fu-Jen Tsai, Yan-Tsung Peng, Min-Hung Chen, Chia-Wen Lin, Yen-Yu Lin

发表机构 * National Yang Ming Chiao Tung University(国立阳明交通大学) National Tsing Hua University(国立清华大学) National Chengchi University(国立政治大学) NVIDIA(英伟达)

AI总结 提出DDTNet,通过解缠目标域退化模式并迁移至源域干净图像生成域自适应训练数据,微调恢复模型以提升跨天气和域的适应能力,核心是退化解缠模块(DDM)中的退化耦合注意力(DCA)。

详情
AI中文摘要

全能型恶劣天气图像恢复旨在使用单一统一模型去除多种退化,如雨、雾和雪。尽管具有广泛适用性,现有方法通常以牺牲性能为代价,对单个退化类型提供平衡但次优的结果。当训练和测试数据之间存在域差距时,这一问题变得更加突出。受退化模式建模比恢复干净内容更可行的观察启发,我们提出了退化解缠与迁移网络(DDTNet),该网络专门关注退化迁移。通过从目标域退化图像中解缠退化模式并将其迁移到源域干净图像,DDTNet生成域自适应的配对训练数据。这些配对数据随后用于微调恢复模型,显著增强其在各种天气条件和域上的适应性。DDTNet的核心是退化解缠模块(DDM),该模块包含退化耦合注意力(DCA),用于捕获通用和特定天气特征,从而实现退化模式的有效解缠和迁移。实验结果表明,DDTNet在真实世界的去雨、去雪和去雾数据集上显著且一致地改进了现有的全能型模型。

英文摘要

All-in-one adverse weather image restoration aims to remove multiple degradations, such as rain, haze, and snow, using a single unified model. Despite their broad applicability, existing methods typically compromise performance, delivering balanced but suboptimal results for individual degradation types. This issue becomes more pronounced when a domain gap exists between training and testing data. Motivated by the observation that modeling degradation patterns is more feasible than recovering clean content, we propose the Degradation Disentanglement and Transfer Network (DDTNet), which focuses specifically on degradation transfer. By disentangling degradation patterns from target-domain degraded images and transferring them to source domain clean images, DDTNet generates domain-adaptive paired training data. These pairs are then used to fine-tune restoration models, significantly enhancing their adaptability across diverse weather conditions and domains. The core of DDTNet is the Degradation Disentanglement Module (DDM), which comprises Degradation Coupled Attention (DCA) to capture both general and weather-specific features, thereby enabling effective disentanglement and transfer of degradation patterns. Experimental results demonstrate that DDTNet significantly and consistently improves existing all-in-one models across real-world deraining, desnowing, and dehazing datasets.

2606.16392 2026-06-16 cs.CV 新提交

Towards UAV Image Dehazing: A UAV Atmospheric Scattering Model, Benchmark, and Geometry-Aware Deep Unfolding Network

面向无人机图像去雾:无人机大气散射模型、基准与几何感知深度展开网络

Wenxuan Fang, Jiangwei Weng, Yu Zheng, Junkai Fan, Guangfa Wang, Xiang Chen, Jian Yang, Jun Li

发表机构 * Nanjing University of Science and Technology(南京理工大学)

AI总结 提出无人机大气散射模型(UASM)描述非均匀雾分布,并设计几何感知深度展开网络(GP-DUN),通过隐式几何估计、几何感知梯度下降和池化专家近端映射模块实现高效去雾,在合成和真实数据集上超越现有方法。

详情
AI中文摘要

在无人机应用中,雾霾会显著模糊远处细节并削弱结构信息,阻碍细节恢复。当前无人机场景仍面临两个关键挑战:(i) 真实世界的成对雾/干净图像难以获取,而经典大气散射模型不足以建模无人机图像中空间非均匀的雾霾;(ii) 现有去雾方法难以去除无人机图像上部区域积累的浓雾。为解决这些问题,我们首先提出无人机大气散射模型(UASM),该模型显式地结合飞行高度、俯仰角和消光系数来表征无人机成像中的非均匀雾霾分布。基于UASM,我们开发了一个物理驱动的去雾框架,称为几何感知近端深度展开网络(GP-DUN)。具体来说,GP-DUN由三个关键模块组成:隐式几何估计器(LGE),用于推断与无人机成像几何一致的透射率;几何感知梯度下降模块(GeoGDM),将UASM嵌入数据保真项并执行物理一致的闭式更新;以及池化专家近端映射模块(PE-PMM),学习隐式先验以恢复超出显式物理建模能力的纹理和结构。此外,我们进一步构建了UASM-HazeSet,它提供可控的成对合成数据以及2,285张真实无人机雾霾图像用于测试。大量实验表明,GP-DUN在UASM-HazeSet和真实无人机雾霾基准上均持续优于现有方法。

英文摘要

In UAV applications, haze significantly obscures distant details and weaken structural information, hindering the recovery of details. Current UAV scenarios still face two key challenges: (i) paired hazy/clean images from the real world are unobtainable, while the classical atmospheric scattering model is inadequate for modeling the spatially non-uniform haze in UAV imagery; (ii) existing dehazing methods struggle to remove the heavy haze accumulated in the upper regions of UAV images. To address these issues, we first propose a UAV Atmospheric Scattering Model (UASM), which explicitly incorporates flight altitude, viewing pitch, and extinction to characterize the non-uniform haze distribution in UAV imaging. Based on UASM, we develop a physics-driven dehazing framework, termed Geometry-aware Proximal Deep Unfolding Network (GP-DUN). Specifically, GP-DUN consists of three key modules: a Latent Geometry Estimator (LGE) that infers transmittance consistent with UAV imaging geometry, a Geometry-aware Gradient Descent Module (GeoGDM) that embeds UASM into the data-fidelity term and performs physics-consistent closed-form updates, and an Pooling-Expert Proximal Mapping Module (PE-PMM) that learns an implicit prior to restore textures and structures beyond the capability of explicit physical modeling. In addition, we further construct UASM-HazeSet, which provides controllable paired synthetic data together with 2,285 real UAV haze images for testing. Extensive experiments show that GP-DUN consistently outperforms existing methods on both UASM-HazeSet and real UAV haze benchmarks.

2606.16396 2026-06-16 cs.CV eess.IV 新提交

SP$^3$: Spherical Priors for Plug-and-Play Restoration

SP$^3$:用于即插即用恢复的球面先验

Sean Man, Ron Raphaeli, Matan Kleiner, Or Ronai

发表机构 * Technion – Israel Institute of Technology(以色列理工学院) Independent Researcher(独立研究员)

AI总结 提出SP$^3$算法,用球面编码器替代去噪器作为生成先验,通过半二次分裂实现快速图像恢复,速度比零样本扩散方法快3-630倍。

详情
AI中文摘要

在本文中,我们介绍了SP$^3$,一种新颖的即插即用算法,通过用球面编码器(SE)作为生成先验替代去噪器,加速最大后验图像恢复。SP$^3$利用SE紧密结构的潜在空间作为自然图像流形上的鲁棒投影,来近似难处理的近端先验步骤。通过半二次分裂,将该投影与闭式数据一致性步骤交替进行,实现了无需推理期间梯度计算的稳定收敛。这种独特的公式解锁了“任意时刻”恢复能力,从第一次迭代起就能产生清晰、合理的图像。在各种图像恢复任务上的评估表明,SP$^3$实现了与最先进的零样本扩散和流方法相当的感知质量,同时速度提升3-630倍。

英文摘要

In this paper, we introduce SP$^3$, a novel Plug-and-Play algorithm that accelerates maximum a posteriori image restoration by replacing denoisers with Spherical Encoders (SE) as generative priors. SP$^3$ approximates the intractable proximal prior step by utilizing the SE tightly structured latent space as a robust projection onto the natural image manifold. Alternating this projection with a closed-form data-consistency step, via Half-Quadratic Splitting, achieves stable convergence without requiring gradient computation during inference. This unique formulation unlocks "anytime" restoration capabilities, producing sharp, plausible images from the first iteration. Evaluations across a variety of image restoration tasks demonstrate that SP$^3$ achieves perceptual quality comparable to state-of-the-art zero-shot diffusion and flow methods while being $3$-$630\times$ faster.

2606.16795 2026-06-16 cs.CV 新提交

WaveDINO: Learning-Based Atmospheric Correction of Unwrapped InSAR Interferograms Validated by GNSS: Results at Laguna del Maule and Campi Flegrei Volcanoes

WaveDINO: 基于学习的解缠InSAR干涉图大气校正方法——通过GNSS验证:在Laguna del Maule和Campi Flegrei火山的结果

Robert Popescu, Juliet Biggs, Tianyuan Zhu, Nantheera Anantrasirichai

发表机构 * University of Bristol(布里斯托大学) NERC Centre for the Observation and Modelling of Earthquakes, Volcanoes, and Tectonics (COMET)(NERC地震、火山与构造观测与建模中心(COMET)) British Geological Survey(英国地质调查局)

AI总结 提出WaveDINO,一种基于小波的多尺度去噪框架,结合DINOv3基础模型特征和地形信息,通过混合训练策略(物理合成形变+真实大气噪声)校正InSAR干涉图大气相位延迟,在智利和意大利火山数据上优于现有方法,GNSS验证显示均方根误差降低3%-19%。

Comments 11 pages, 6 figures

详情
AI中文摘要

干涉合成孔径雷达(InSAR)能够有效监测火山形变;然而,观测信号常受到大气相位延迟、季节性地表变化和去相关效应的干扰。现有的大气校正方法,如基于数值天气模型的方法,可以减少这些影响,但无法始终消除大气伪影,并可能引入残余偏差。为解决这些局限性,我们提出了一种新颖的基于学习的解缠InSAR干涉图去噪方法,采用结合物理驱动合成形变与真实大气噪声的混合训练策略。具体而言,我们引入了WaveDINO,一种基于小波的多尺度去噪框架,其条件依赖于冻结的DINOv3基础模型特征和地形信息。训练使用叠加在短周期干涉图上的合成岩浆源形变,使网络暴露于真实大气统计特征的同时保留已知真值。性能在受控合成数据和来自智利Laguna del Maule及意大利Campi Flegrei的长期真实干涉图上进行评估,并使用独立的GNSS测量进行验证。WaveDINO持续优于竞争模型,提高了与GNSS测量的一致性,在两个站点分别将平均GNSS拟合误差降低了约3%和19%,同时超越了基于天气模型的校正方法。

英文摘要

Interferometric Synthetic Aperture Radar (InSAR) enables effective monitoring of volcanic deformation; however, the observed signals are often corrupted by atmospheric phase delays, seasonal surface changes, and decorrelation effects. Existing atmospheric correction methods, such as numerical weather model-based methods, can reduce these effects but do not consistently remove atmospheric artefacts and may introduce residual biases. To address these limitations, we propose a novel learning-based method for denoising unwrapped InSAR interferograms, using a hybrid training strategy that combines physically motivated synthetic deformation with real atmospheric noise. Specifically, we introduce WaveDINO, a wavelet-based multi-scale denoising framework conditioned on frozen DINOv3 foundation-model features and terrain information. Training uses synthetic magma-source deformation superimposed on short-term interferograms to expose the network to realistic atmospheric statistics while retaining known ground truth. Performance is evaluated on both controlled synthetic data and long-term real interferograms from Laguna del Maule (Chile) and Campi Flegrei (Italy), with independent GNSS measurements used for validation. WaveDINO consistently outperforms competing models, improving agreement with GNSS measurements, and reducing mean GNSS misfit by approximately 3% and 19% at two sites, respectively, while surpassing weather-model-based corrections.

2606.15352 2026-06-16 eess.IV cs.CV cs.GR 交叉投稿

Chroma-gated, differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

色度门控、可微分的OKLCH插值:用于减少色偏的连续Oklab回退

Naoyuki Uchida

发表机构 * Independent Researcher(独立研究者)

AI总结 针对OKLCH插值在中性轴附近的两种色偏问题,提出一种可微分的色度门控函数,连续混合OKLCH和线性Oklab路径,在不依赖端点测试的情况下统一处理两种色偏,并验证了其有效性。

Comments 14 pages, 5 figures. Ancillary files: reproducibility scripts (symbolic verification, evaluation, and figure generation)

详情
AI中文摘要

OKLCH——Ottosson的Oklab颜色空间的圆柱形式(亮度、色度、色调)——是CSS Color 4推荐的用于渐变和color-mix()的插值空间,现已广泛部署。然而,其极坐标参数化在中性轴附近以两种方式产生色偏:(1)两个彩色端点之间的色调间绕行,经过非预期的色调(蓝色到黄色明显经过绿色);(2)当一个端点为消色差时,产生离线弯曲。现有补救措施统一为二值化——仅在消色差端点触发的阈值开关——因此它们仅处理(2);对于彩色对,它们都退化为原始OKLCH,未处理(1)色调间色偏。我们引入连续Oklab回退(COFb),一个单参数、可微分的色度门控$w(C)=C^n/(C^n+σ^n)$,随着色度下降,将OKLCH路径连续混合到线性Oklab路径。单个门控减少了二值化家族未处理的(1)色偏,并统一处理(1)和(2),无需任何端点测试。我们刻画了一个色偏-色调权衡边界,采用默认值($n=1$,有理Michaelis-Menten形式;对于典型sRGB调色板,$σ\approx0.19$,基于归一化无关的半色偏准则),并符号验证了门控的性质。在默认值下,COFb将色调间路径绕行减半(平均横向偏差-49.5%,色度加权色调偏移-35.5%)。我们还说明了该方法的局限性:仅针对(2),二值化开关仍然更好,并且像任何笛卡尔混合一样,COFb不保持色度。在部署中,COFb完全在普通Oklab (a,b)到sRGB中运行,因此它作为一种回退,在无法使用现代CSS颜色插值(color-mix(in oklch)等)的场合——旧引擎、图像和视频管线或GPU着色器——提供相同的减少色偏的渐变。

英文摘要

OKLCH -- the cylindrical (lightness, chroma, hue) form of Ottosson's Oklab color space -- is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued -- a threshold switch that fires only at an achromatic endpoint -- so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate $w(C)=C^n/(C^n+σ^n)$ that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ($n=1$, the rational Michaelis-Menten form; $σ\approx0.19$ for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate's properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method's limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable -- older engines, image and video pipelines, or GPU shaders.

2606.16107 2026-06-16 eess.IV cs.CV cs.MM 交叉投稿

Variable-Rate Deep Image Compression based on Low-Rank Adaptation by Progressive Learning

基于渐进学习的低秩自适应变速率深度图像压缩

Xing-Yu Xu, Chen-Hsiu Huang, Ja-Ling Wu

发表机构 * National Taiwan University(台湾大学)

AI总结 提出一种基于低秩自适应(LoRA)的渐进学习方法,通过引入LoRA速率自适应模块(LoRAM)实现变速率深度图像压缩,在推理时不增加计算复杂度,参数存储节省99%,数据集节省90%,训练步骤节省97%。

详情
AI中文摘要

在数字时代,图像压缩对于众多应用至关重要,包括网络媒体、流媒体服务、高分辨率医学成像和车联网,能够实现高效的数据存储和传输。随着对高质量图像通信的需求日益增长,对先进压缩技术的需求也变得越来越关键。近年来,许多深度图像压缩(DIC)技术被提出,与传统标准相比表现出令人印象深刻的性能。然而,变速率图像压缩仍然是一个未解决的问题。特定的DIC方法部署多个网络以实现不同的压缩率,而其他方法使用单一模型,这通常会导致更高的计算复杂性和性能下降。本文提出了一种基于参数高效微调方法——低秩自适应(LoRA)的渐进学习变速率图像压缩方法。我们在DIC方法中引入了一个额外的LoRA速率自适应模块(LoRAM)。由于LoRA的重参数化合并,我们提出的方法在推理期间不会引入额外的计算复杂性。与使用多个模型的方法相比,综合实验表明,我们的方法实现了具有竞争力的性能,在参数存储上节省了99%,数据集节省了90%,训练步骤节省了97%。

英文摘要

In the digital age, image compression is crucial for numerous applications, including web media, streaming services, high-resolution medical imaging, and connected vehicle networks, enabling efficient data storage and transmission. With the increasing demand for high-quality image communication, the need for advanced compression techniques becomes increasingly critical. Numerous Deep Image Compression (DIC) techniques have recently been introduced, showing impressive performance compared to traditional standards. However, variable-rate image compression remains an unresolved issue. Specific DIC methods deploy multiple networks to attain different compression rates, whereas others use a single model, which often results in higher computational complexity and reduced performance. This work proposes a progressive learning approach for variable-rate image compression based on the parameter-efficient fine-tuning method, the Low-Rank Adaptation (LoRA). We introduce an additional LoRA Rate-Adaptive Module (LoRAM) in DIC methods. Due to the re-parameterized merging of LoRA, our proposed method does not introduce additional computational complexity during inference. Compared to methods utilizing multiple models, comprehensive experiments demonstrate that our approach achieves competitive performance, saving 99\% in parameter storage, 90% in datasets, and 97% in training steps.

2606.16261 2026-06-16 physics.optics cs.CV cs.NE physics.app-ph 交叉投稿

Wavelength-Multiplexed 2D Beam Steering via a Passive Diffractive Network

通过无源衍射网络实现波长复用的二维光束偏转

Che-Yung Shen, Yuhang Li, Cagatay Isil, Tianyi Gan, Mona Jarrahi, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles(加州大学洛杉矶分校电气与计算机工程系) Bioengineering Department, University of California, Los Angeles(加州大学洛杉矶分校生物医学工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles(加州大学洛杉矶分校加州纳米系统研究所)

AI总结 提出一种无源衍射光学网络,利用波长作为控制参数实现任意二维光束偏转,通过深度学习优化级联衍射层,数值和实验验证了625个波长通道的25x25偏转阵列,具有亚波长精度和高信道保真度。

Comments 20 Pages, 4 Figures

详情
AI中文摘要

我们引入了一种波长可寻址的衍射光学网络,将照明波长转化为高维控制参数,用于任意可编程的二维光束偏转。所提出的无源架构包括级联的空间优化衍射层,通过深度学习联合设计,以快速将不同波长映射到预定义/期望的输出角度。与受限于一维线性映射的传统单层色散光学元件不同,该框架利用复杂的波前变换,将照明波长作为内在寻址键,实现任意二维光束偏转,无需机械扫描或电子相位控制。我们在数值上演示了覆盖400-750 nm的625个波长通道的波长控制光束偏转,实现了25x25独立可寻址光束位置阵列,具有亚波长定位精度和高信道保真度。与将波长路由限制在线性轨迹的传统光栅不同,所提出的衍射网络执行非局域波前变换,能够在二维视场内实现任意波长到角度的映射。我们进一步在太赫兹和可见光谱范围内实验验证了所提出的框架,展示了使用3D打印的无源衍射层在太赫兹频率和可见光谱中的相位型空间光调制器实现的波长复用光束偏转。这种波长可寻址的衍射架构为高速可编程光束偏转建立了一种紧凑且可扩展的范式,在光通信、路由、成像、传感以及新兴光子信息处理系统中具有潜在应用。

英文摘要

We introduce a wavelength-addressable diffractive optical network that transforms illumination wavelength into a high-dimensional control parameter for arbitrarily programmable 2D beam steering. The proposed passive architecture comprises cascaded spatially optimized diffractive layers, jointly designed using deep learning, to rapidly map distinct wavelengths to predefined/desired output angles. Unlike conventional single-layer dispersive optical elements, which are physically restricted to 1D linear mapping, this framework harnesses complex wavefront transformations to utilize the illumination wavelength as an intrinsic addressing key for arbitrary 2D beam steering, eliminating the need for mechanical scanning or electronic phase control. We numerically demonstrate wavelength-controlled beam steering across 625 wavelength channels spanning 400-750 nm, realizing a 25 x 25 array of independently addressable beam positions with subwavelength positioning accuracy and high channel fidelity. Unlike conventional gratings, which constrain wavelength routing to a linear trajectory, the proposed diffractive network performs nonlocal wavefront transformations, enabling arbitrary wavelength-to-angle mappings across a 2D field of view. We further validate the proposed framework experimentally in both the terahertz and visible spectral regimes, demonstrating wavelength-multiplexed beam steering using 3D fabricated passive diffractive layers at terahertz frequencies and phase-only spatial light modulators in the visible spectrum. This wavelength-addressable diffractive architecture establishes a compact and scalable paradigm for high-speed programmable beam steering, with potential applications in optical communications, routing, imaging, sensing, and emerging photonic information-processing systems.

2606.17048 2026-06-16 cs.LG cs.CV stat.ML 交叉投稿

Exact Posterior Score Estimation for Solving Linear Inverse Problems

精确后验分数估计用于求解线性逆问题

Abbas Mammadov, Ozgur Kara, Kaan Oktay, Iskander Azangulov, Adil Kaan Akan, Hyungjin Chung, James Matthew Rehg, Yee Whye Teh

发表机构 * University of Oxford(牛津大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) EverEx

AI总结 提出精确后验分数(EPS)方法,通过闭式后验分数将线性逆问题转化为去噪问题,无需梯度或投影,在FFHQ和ImageNet上优于现有方法。

详情
AI中文摘要

扩散和基于流的模型通过训练去噪器来逆转高斯损坏,从而学习强大的数据先验。为了利用这一先验解决线性逆问题,需要从后验中采样,但先验提供的分数是无条件分数,而非后验分数。现有方法要么使用近似测量匹配校正来引导固定的预训练去噪器,要么训练一个放弃先验去噪结构的条件恢复模型。我们在一般高斯插值下推导了线性高斯逆问题的精确后验分数闭式,并表明后验采样可归结为在算子依赖的偏移枢轴和各向异性噪声协方差下的去噪问题。我们将这一恒等式转化为精确后验分数(EPS),这是一种去噪训练目标,保留了标准预训练的输入/输出结构,因此可以从头训练或从预训练去噪器微调。在推理时,EPS使用与底层骨干相同的采样器,无需似然梯度或投影。我们在FFHQ和ImageNet上的五个线性逆问题上评估了EPS,在保真度、感知和分布指标上优于无训练和基于训练的基线,同时使用的去噪器评估次数比基于梯度的后验采样器少大约一个数量级。

英文摘要

Diffusion and flow-based models learn powerful data priors by training a denoiser to reverse Gaussian corruption. To use this prior to solve a linear inverse problem, one needs to sample from the posterior, but the score that the prior provides is the unconditional score, not the posterior score. Existing methods either steer a fixed pretrained denoiser with approximate measurement-matching corrections, or train a conditional restoration model that abandons the denoising structure of the prior. We derive the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, and show that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot under an anisotropic noise covariance. We turn this identity into Exact Posterior Score (EPS), a denoising training objective that preserves the input/output structure of standard pretraining and can therefore be trained from scratch or fine-tuned from a pretrained denoiser. At inference, EPS uses the same sampler as the underlying backbone, with no likelihood gradients or projections. We evaluate EPS on five linear inverse problems across FFHQ and ImageNet, where it outperforms training-free and training-based baselines on fidelity, perceptual, and distributional metrics, while using roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.

2511.12024 2026-06-16 cs.CV 版本更新

Null-Space Diffusion Distillation Unlocks Speed, Fidelity and Realism in Lensless Imaging

零空间扩散蒸馏解锁无透镜成像的速度、保真度和真实感

Jose Reinaldo Cunha Santos A V Silva Neto, Hodaka Kawachi, Yasushi Yagi, Tomoya Nakamura

发表机构 * D3 Center, The University of Osaka(大阪大学D3中心) The University of Osaka(大阪大学) SANKEN, The University of Osaka(SANKEN与大阪大学) Grad. Sch. of Eng. Sci., The University of Osaka(大阪大学工科科学院)

AI总结 提出零空间扩散蒸馏(NSDD),通过将结构化扩散先验蒸馏为前馈网络,实现单次高质量重建,兼顾测量一致性、感知质量和推理速度。

Comments 10 pages without references, 5 figures, 5 tables

详情
AI中文摘要

无透镜成像从高度复用的测量中重建场景,导致严重不适定的逆问题。在这项工作中,我们识别出无透镜重建范式在测量一致性、感知质量和推理速度之间的基本权衡。传统方法倾向于一致性但产生感知退化的结果,监督方法实现高质量重建和快速推理但可能违反物理约束,而扩散先验方法(特别是使用范围-零分解等结构化约束时)实现了高感知质量和一致性,但由于迭代采样仍然缓慢。基于这一观察,我们提出零空间扩散蒸馏(NSDD),一种单次重建模型,将结构化扩散先验推理蒸馏为高效的前馈网络。NSDD学习生成高质量重建,保持测量一致性,同时避免昂贵的迭代采样。实验结果表明,NSDD实现了与扩散先验方法相竞争的感知质量和一致性,同时提供显著更快的推理速度,并在所有三个目标之间实现了有利的平衡。此外,消融实验表明,蒸馏范围-零分解比非结构化全重建蒸馏提高了重建质量和鲁棒性,包括在未见过的真实场景上。这些结果凸显了结构感知蒸馏在高效无透镜成像中的潜力。代码见此 http URL。

英文摘要

Lensless imaging reconstructs scenes from highly multiplexed measurements, resulting in a severely ill-posed inverse problem. In this work, we identify a fundamental trade-off between measurement consistency, perceptual quality, and inference speed across lensless reconstruction paradigms. Traditional methods favor consistency but produce perceptually degraded results, supervised approaches achieve high-quality reconstructions with fast inference but may violate physical constraints, and diffusion-prior methods achieve high perceptual quality and consistency--particularly when structured constraints such as range-null decomposition are used--but remain slow due to iterative sampling. Motivated by this observation, we propose Null-Space Diffusion Distillation (NSDD), a single-pass reconstruction model that distills structured diffusion-prior inference into an efficient feed-forward network. NSDD learns to produce high-quality reconstructions that preserve measurement consistency while avoiding costly iterative sampling. Experimental results demonstrate that NSDD achieves perceptual quality and consistency competitive with diffusion-prior methods, while providing significantly faster inference and offering a favorable balance across all three objectives. Furthermore, ablation experiments show that distilling the range--null decomposition improves reconstruction quality and robustness over unstructured full-reconstruction distillation, including on unseen real scenes. These results highlight the potential of structure-aware distillation for efficient lensless imaging. Code is available at github.com/JRCSAVSN/NullSpaceDiffusionDistillation.

2511.12572 2026-06-16 cs.CV 版本更新

Through-Foliage Surface-Temperature Reconstruction for Early Wildfire Detection

面向早期野火检测的穿透植被地表温度重建

Mohamed Youssef, Lukas Brunner, Klaus Rundhammer, Gerald Czech, Oliver Bimber

发表机构 * Department of Computer Science, Johannes Kepler University(计算机科学系,约翰尼斯·开普勒大学) Fire Brigade St. Agatha(圣阿加塔消防队) Upper Austria Fire Brigade Headquarter(上奥地利消防队总部)

AI总结 结合信号处理与机器学习,通过合成孔径传感和视觉状态空间模型从模糊数据中恢复热信号,实现无人机自动野火监测,在模拟和实地实验中显著降低温度重建误差。

详情
AI中文摘要

我们提出了一种通过结合信号处理和机器学习来重建森林植被下地表温度的方法,实现无人机全自动空中野火监测以进行早期火灾检测。合成孔径(SA)传感减少了树冠遮挡,但引入了热模糊。为了克服这一点,我们训练了一个视觉状态空间模型,从模糊数据中恢复部分遮挡的土壤和火灾热点的微弱热信号。为了解决真实训练数据有限的问题,我们使用潜在扩散模型、温度增强和程序化热森林建模生成逼真的地表温度模拟。在模拟数据集上,与传统热成像和未校正的SA成像相比,我们的方法将RMSE降低了2-2.5倍;在实地热点实验中,RMSE分别改善了12.8倍和2.6倍。我们的方法还能泛化到其他热信号,包括人体特征,捕捉其形态和范围——这在简单阈值法失效时至关重要——而传统成像难以处理部分遮挡情况。

英文摘要

We present a method to reconstruct surface temperatures through forest vegetation by combining signal processing and machine learning, enabling fully automated aerial wildfire monitoring with drones for early fire detection. Synthetic aperture (SA) sensing reduces canopy occlusion but introduces thermal blur. To overcome this, we train a visual state space model to recover subtle thermal signals of partially occluded soil and fire hotspots from blurred data. To address limited real-world training data, we generate realistic surface temperature simulations using a latent diffusion model, temperature augmentation, and procedural thermal forest modeling. On simulated datasets, our method reduces RMSE by 2-2.5 versus conventional thermal and uncorrected SA imaging; in field experiments on hotspots, RMSE improved by 12.8-fold and 2.6-fold, respectively. Our approach also generalizes to other thermal signals, including human signatures, capturing morphology and extent -- critical where simple thresholding fails -- while conventional imaging struggles with partial occlusion.

2601.19506 2026-06-16 cs.CV 版本更新

Bridging Information Asymmetry: A Hierarchical Framework for Blind Face Restoration with Reduced Uncertainty

弥合信息不对称:一种降低不确定性的分层盲脸修复框架

Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang, Shuang Zeng, Lei Zhu, Yanye Lu

发表机构 * Biomedical Engineering Department, College of Future Technology, Peking University(北京大学未来技术学院生物医学工程系) Institute of Medical Technology, Peking University Health Science Center(北京大学医学部医学技术研究所) National Biomedical Imaging Center, Peking University(北京大学国家生物医学成像中心)

AI总结 提出Pref-Restore分层框架,通过语义信息增强、纹理保真对齐和保真约束偏好优化,降低盲脸修复中的不确定性,实现身份敏感的高保真重建。

Comments Accepted by TPAMI

详情
AI中文摘要

盲脸修复仍然是一个持续的挑战,因为从严重受限的观测中重建整体结构本质上是病态的。当前的生成范式虽然能够合成逼真的面部细节,但仍然受到盲修复欠约束性质的限制,其中严重退化的输入可能映射到合理但身份不一致的输出。为了解决这个问题,我们提出了\textbf{Pref-Restore},一种降低修复不确定性的分层BFR框架。我们的设计围绕三个互补原则组织:(1)语义信息增强,其中自回归语义分支将图像和文本线索转换为结构化标记,提供稳定的高层锚点;(2)纹理级保真对齐,其中扩散生成器在此锚点下训练以恢复身份相关细节;(3)保真约束偏好优化,其中面部感知奖励在控制质量-保真权衡的同时优化扩散轨迹。在合成和真实世界基准上的大量实验表明,Pref-Restore实现了最先进的性能,具有更强的身份敏感保真度和更低的重复采样不确定性。系统的消融实验进一步将这些增益归因于所提出的分层设计,展示了分阶段训练的必要性、文本路径的鲁棒性和质量依赖性,以及保真约束偏好优化的好处。

英文摘要

Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative paradigms, while capable of synthesizing realistic facial details, remain limited by the under-constrained nature of blind restoration, where severely degraded inputs can be mapped to plausible yet identity-inconsistent outputs. To address this issue, we present \textbf{Pref-Restore}, a hierarchical framework for BFR with reduced restoration uncertainty. Our design is organized around three complementary principles: (1) Semantic Information Augmentation, where an auto-regressive semantic branch converts image and text cues into structured tokens that provide a stable high-level anchor; (2) Texture-level Fidelity Alignment, where the diffusion generator is trained under this anchor to recover identity-relevant details; and (3) Fidelity-constrained Preference Optimization, where a face-aware reward refines the diffusion trajectory while controlling the quality--fidelity trade-off. Extensive experiments on synthetic and real-world benchmarks show that Pref-Restore achieves state-of-the-art performance, with stronger identity-sensitive fidelity and lower restoration uncertainty across repeated sampling. Systematic ablations further attribute these gains to the proposed hierarchical design, showing the necessity of staged training, the robustness and quality dependence of the text pathway, and the benefit of fidelity-constrained preference optimization.

2606.06176 2026-06-16 cs.CV 版本更新

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

RQUL-UIE: 通过数据集内自监督重振质量不稳定标签用于水下图像增强

Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出一种基于扩散模型的数据集内自监督学习策略,通过评估标签质量并量化噪声级别进行分步去噪监督,结合傅里叶细化网络,有效利用不稳定标签提升水下图像增强质量。

详情
AI中文摘要

水下图像增强对于减轻水介质引起的退化至关重要。尽管基于学习的方法取得了显著进展,但大多数依赖于具有不稳定标签质量的配对数据集,这限制了模型性能。本文提出了一种基于扩散的数据集内自监督学习策略,旨在利用训练标签的质量分布。具体地,我们通过预训练扩散模型的语义感知嵌入以无需训练的方式评估标签质量。这些质量分数随后被量化为噪声级别索引,指导多步去噪过程以进行级别监督。该机制防止低质量标签降低模型性能,同时最大化其在训练中的效用。此外,引入基于傅里叶的细化网络以显式重建高频分量。大量评估表明,我们的方法在恢复质量上始终优于最先进的方法。代码和预训练模型将在接收后提供链接。

英文摘要

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

11. 鲁棒性、安全、隐私与可信视觉 23 篇

2606.14748 2026-06-16 cs.CV cs.AI 新提交

Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

我的视觉-语言数据在你的AI中吗?成员推断测试(MINT)演示2

Daniel DeAlcala, Gonzalo Mancera, Julian Fierrez, Aythami Morales, Ruben Tolosana, Ruben Vera-Rodriguez

发表机构 * Universidad Autonoma de Madrid(马德里自治大学)

AI总结 提出成员推断测试(MINT)框架,通过多种架构检测训练数据,在人脸识别和LLM上准确率达90%,并构建了多模态审计平台。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情
AI中文摘要

我们展示了成员推断测试(MINT)演示2,这是一个旨在提高机器学习训练过程透明度的框架。MINT是一种实验性技术,用于确定特定数据是否在机器学习模型训练期间被使用。我们建立了理论框架,并根据被审计模型已知信息的多少,提出了多种MINT架构。使用一个流行的人脸识别模型、4个最先进的LLM以及多个多样化的大规模公共图像和文本数据库进行的实验,在训练数据检测中达到了高达90%的准确率。基于这些结果,我们引入了一个综合性的网络平台,将这些能力扩展到图像和文本模态。该平台集成了多种技术栈,包括MINT、aMINT和gMINT,允许用户审计广泛的模型。该演示旨在促进AI透明度,并提供一种实用工具以促进对新兴AI法规的合规性。

英文摘要

We present the Membership Inference Test (MINT) Demo 2, a framework designed to improve transparency in machine learning training processes. MINT is a technique for experimentally determining whether specific data were used during machine learning model training. We establish the theoretical framework and propose multiple architectures for MINT depending on the amount of information known about the models that are being audited. Experimental results using a popular face recognition model, 4 state-of-the-art LLMs, and multiple, diverse, and large-scale public image and text databases achieve promising accuracy levels in the detection of training data of up to 90%. Building on these results, we introduce a comprehensive web platform1 that expands these capabilities to image and text modalities. The platform integrates a diverse technological stack, including MINT, aMINT, and gMINT, allowing users to audit a wide range of models. This demonstrator aims to promote AI transparency and provides a practical tool to foster compliance with emerging AI regulations.

2606.14783 2026-06-16 cs.CV cs.CR 新提交

The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models

视觉编码器作为隐私边界:无编码器视觉-语言模型中的视觉令牌侧信道

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo(东京科学大学工学院) College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院) Department of Electrical and Computer Engineering, National University of Singapore(新加坡国立大学电气与计算机工程系)

AI总结 研究无编码器视觉-语言模型中视觉令牌侧信道导致的隐私泄露问题,通过解码器攻击从中间视觉令牌恢复图像和文本,发现空间采样保真度是关键因素,并指出KV缓存也存在泄露风险。

详情
AI中文摘要

视觉编码器将图像像素压缩为语义嵌入,通过保留语义内容同时衰减精确文本恢复所需的像素局部细节,隐式地充当隐私边界。无编码器视觉-语言模型(VLM)通过将图像块直接路由到语言模型令牌流中移除了这一边界,从而暴露了一个架构上的隐私攻击面:中间视觉令牌成为输出前的侧信道。在令牌访问攻击者下,解码器从两个无编码器VLM(Gemma4和Fuyu)中反转视觉令牌流,恢复可识别的图像结构和可读的保留访问码,而匹配的基于编码器的控制模型仅能定位目标区域但无法恢复精确字符串。模型内消融实验表明,操作因素是视觉令牌网格的空间采样保真度,尤其是字符方向采样密度,而非令牌或值的数量。泄露不仅限于导出的令牌:Gemma4第0层键值缓存张量可直接反转,将侧信道置于生产服务栈通常为解码效率而持久化的KV缓存中。该攻击在杂乱场景、真实文档退化以及零样本迁移到公共文档图像中依然有效,并抵抗加性噪声和量化等值级防御。因此,有效的缓解措施必须降低空间采样,使得移除视觉编码器成为VLM部署中的一级隐私决策。

英文摘要

A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count. The leakage is not limited to exported tokens: Gemma4 layer-0 key-value cache tensors are directly invertible, placing the side channel within KV caches commonly persisted by production serving stacks for decoding efficiency. The attack survives clutter, realistic document degradation, and zero-shot transfer to public document images, and it resists value-level defenses such as additive noise and quantization. Effective mitigation must therefore reduce spatial sampling, making removal of the vision encoder a first-class privacy decision in VLM deployment.

2606.15169 2026-06-16 cs.CV 新提交

Label Shift Aware Adaptation for Online Zero-shot Learning with Contrastive Language-Image Pre-Training (CLIP)

基于对比语言-图像预训练(CLIP)的在线零样本学习中的标签偏移感知自适应

Pengxiao Han, Changkun Ye, Yanshuo Wang, Jinguang Tong, Miaohua Zhang, Xuesong Li, Jie Hong, Lars Petersson

发表机构 * Australian National University(澳大利亚国立大学) China North Vehicle Research Institute(中国北方车辆研究所) The Hong Kong Polytechnic University(香港理工大学) Griffith University(格里菲斯大学) CSIRO(澳大利亚联邦科学与工业研究组织) The University of Hong Kong(香港大学)

AI总结 针对在线零样本学习中测试数据与CLIP训练数据分布不匹配的问题,提出标签偏移感知(LSA)方法,通过域自适应和标签偏移校正提升分类性能。

详情
AI中文摘要

像对比语言-图像预训练(CLIP)这样的视觉-语言模型已在数据稀缺场景中得到广泛研究。该领域中一个特别具有挑战性和现实性的任务是使用CLIP进行在线零样本学习,其中未知测试样本由CLIP以随机顺序顺序预测,同时在顺序推理阶段保持特征提取和模型参数固定。在这种设置下,大多数现有方法通过使用传入测试样本在线调整表示来解决问题,而忽略了CLIP最初训练的数据分布。当测试数据中的标签分布与训练域不同时,这种不匹配可能导致性能下降。为了解决这一差距,我们提出了标签偏移感知(LSA),它将在线零样本分类任务形式化为域适应问题。具体来说,LSA适应CLIP(在未知源分布上训练)计算的预测到目标分布,仅使用未标记的测试数据,并应用标签偏移校正来减轻源域和目标域之间的不匹配。跨多个数据集的广泛实验表明,所提出的LSA始终优于基于CLIP的最先进的在线零样本学习方法。

英文摘要

Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.

2606.15202 2026-06-16 cs.CV 新提交

Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

安全相关环境中的人类注视与视觉语言模型注意力的比较

Marta Vallejo, Siwen Wang

发表机构 * Heriot-Watt University(赫瑞-瓦特大学)

AI总结 本研究通过眼动追踪实验和GPT-4o等视觉语言模型,比较了人类与模型在安全相关场景中的注意力分布,发现模型无需训练数据即可近似人类注视模式。

Comments 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

详情
AI中文摘要

人类视觉注意力在人们感知和响应包含潜在风险的环境时起着重要作用。本研究探讨大型视觉语言模型是否能识别安全相关环境中吸引人类注意力的相同场景区域。使用Pupil Invisible可穿戴眼镜收集了十名参与者观看33张代表不同潜在风险水平的环境场景图像的眼动数据。将注视坐标映射到刺激图像上,生成群体平均的人类注视热图。同时,通过OpenAI视觉应用程序接口(API)提示GPT-4o生成视觉注意力的空间预测,并将其转换为显著性图,以便与人类注视模式进行比较。使用四种互补指标评估人类注视热图与模型生成的显著性图之间的空间对齐:皮尔逊相关系数(r = 0.515 ± 0.117)、归一化扫描路径显著性(NSS = 0.988 ± 0.323)、Kullback-Leibler散度(KL = 1.766 ± 0.844)以及使用Judd公式的接收者操作特征曲线下面积(AUC-Judd = 0.806 ± 0.076)。与Gemini Pro、Gemini Flash和Claude的跨模型比较显示,所有模型均超过AUC-Judd的随机基线0.5,并获得了正的NSS分数。根据四项指标中的三项,Gemini Pro表现出最强的空间定位能力,而GPT-4o在KL散度上产生了与人类注意力最接近的分布匹配。这些发现表明,大型视觉语言模型能够识别与人类在安全相关场景中视觉注意力大致对应的区域,而无需眼动训练数据。结果凸显了视觉语言模型作为近似人类注意力模式的可扩展工具的潜力。

英文摘要

Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.

2606.15608 2026-06-16 cs.CV 新提交

On the Adversarial Robustness of Multimodal LLM Judges

多模态大语言模型评判器的对抗鲁棒性

Zihan Wang, Guansong Pang, Zelin Liu, Wenjun Miao, Jin Zheng, Xiao Bai

发表机构 * School of Computer Science and Engineering, Beihang University(北京航空航天大学计算机科学与工程学院) State Key Laboratory of Virtual Reality Technology and System, Beihang University(北京航空航天大学虚拟现实技术与系统国家重点实验室) State Key Laboratory of Software Development Environment, Jiangxi Research Institute, Beihang University(北京航空航天大学江西研究院软件开发环境国家重点实验室) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算机与信息系统学院)

AI总结 提出RobustMLLMJudge框架评估多模态大语言模型作为评判器时的对抗鲁棒性,并设计MGSIA攻击方法,通过语义诱导和高分流形对齐生成可迁移的分数膨胀扰动,揭示其脆弱性。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地被用作自动评判器,例如用于图像质量和安全评估。然而,它们的对抗鲁棒性在很大程度上尚未被探索,威胁到自动评判的公平性和可靠性。为弥补这一差距,我们引入了RobustMLLMJudge,这是第一个用于评估通用MLLM在充当评判器时对抗鲁棒性的通用框架。它涵盖了针对质量与安全评估场景中主流评判方法的各种攻击。利用RobustMLLMJudge,我们发现:i) 不同的MLLM评判器极易受到分数膨胀的对抗攻击;ii) 尽管这些攻击方法有效,但由于MLLM评判器评估协议中的独特约束,它们面临关键挑战。我们进一步提出了MGSIA,即流形引导语义诱导攻击,这是一种绕过这些约束的新方法,能够对MLLM评判器实施更有效且可迁移的攻击。MGSIA的核心思想是将肯定性语义诱导与高分流形对齐相结合:它最大化评判器对二元语义查询产生肯定性响应(例如“是”)的概率,同时将对抗性表示正则化到从代理协议估计的高分中心附近。这些目标共同产生可迁移的分数膨胀扰动。大量实验证明了MGSIA在不同评估场景下欺骗先进MLLM评判器的优越性和泛化能力,凸显了对鲁棒MLLM评判器的需求。代码和数据将在https://github.com/mala-lab/RobustMLLMJudge提供。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., "Yes") to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at https://github.com/mala-lab/RobustMLLMJudge.

2606.15779 2026-06-16 cs.CV cs.LG 新提交

Faithful Action-unit Causal Reasoning for Counterfactually Faithful Emotion Explanations

面向反事实忠实情感解释的忠实动作单元因果推理

Van Thong Huynh, Hong Hai Nguyen, Thuy Pham, Trong Nghia Nguyen, Soo-Hyung Kim

发表机构 * Faculty of CSE, Ho Chi Minh City University of Technology (HCMUT), VNUHCM(胡志明市理工大学计算机科学与工程学院,越南国家大学胡志明市分校) Dept. of AI, FPT University(FPT大学人工智能系) Faculty of DSAI, College of Technology, National Economic University(国民经济大学技术学院数据科学与人工智能系) Dept. of AI Convergence, Chonnam National University(全南大学人工智能融合系)

AI总结 提出FACR方法,通过反事实一致性目标和极性感知因果图,训练模型在动作单元与情感之间实现可测量的因果忠实性,在UNBC-PAIN数据集上将忠实度从0.08提升至0.57。

详情
AI中文摘要

多模态模型可以命名面部情感背后的动作单元(AU),但其AU->情感的解释通常是合理的而非忠实的:没有任何机制强制模型调用的AU是实际驱动其预测的AU。我们将AU->情感推理视为解释、标签和结构化AU->情感因果图G之间的反事实一致性问题,并提出FACR,该方法将推理器建立在独立诱导的、极性感知的G上,并训练一个反事实忠实性目标:对G标记为某类因果的AU进行do干预必须改变预测,而对标记为无关的AU进行do干预必须保持预测不变。因此,忠实性既可通过匹配的干预指标进行训练和测量,我们针对已知因果结构PSPI疼痛-AU组成评估该指标,因为现有情感推理基准不支持。我们明确指出,该指标测试的是对给定结构的忠实性而非重新发现:它询问训练后的推理器是否调用结构标记为因果的AU,在留出受试者和第二个数据集上进行评估。在UNBC-PAIN上的受试者独立评估中,该目标将调用AU与PSPI组成的一致性从无目标的基线0.08提高到0.57,检测成本略有增加;一个不忠实控制实验将增益归因于该目标。在跨数据集情感迁移中,该目标同样提高了七类任务上对G的忠实性(0.50到0.84)。最后,我们附加语言verbalizer并将审计扩展到生成的文本:通过潜在激活偏置每个动作单元的发射,使解释在结构上忠实,因此消融一个AU会将其从解释中移除,该属性可迁移到第二个语言模型骨干,而自由生成的解释则不忠实。

英文摘要

Multimodal models can name the action units (AUs) behind a facial emotion, but their AU->emotion rationales are typically plausible rather than faithful: nothing forces the AUs a model invokes to be the AUs that actually drive its prediction. We cast AU->emotion reasoning as a counterfactual-consistency problem between the rationale, the label, and a structural AU->emotion causal graph G, and propose FACR, which grounds the reasoner in an independently induced, polarity-aware G and trains a counterfactual-faithfulness objective: a do-intervention on an AU that G marks causal for a class must move the prediction, while one it marks irrelevant must leave it unchanged. Faithfulness is thereby both trainable and measurable through a matching interventional metric, which we evaluate against a known causal structure, the PSPI pain-AU composition, as no existing affective-reasoning benchmark allows. We are explicit that this metric tests fidelity to the supplied structure rather than its rediscovery: it asks whether the trained reasoner invokes the AUs the structure marks causal, on held-out subjects and a second dataset. Under subject-independent evaluation on UNBC-PAIN, the objective raises the agreement between the invoked AUs and the PSPI composition from a no-objective baseline of 0.08 to 0.57, at a small detection cost; an unfaithfulness control attributes the gain to the objective. On a cross-dataset emotion transfer, the objective likewise raises fidelity to G on a seven-class task (0.50 to 0.84). Finally, we attach a language verbalizer and extend the audit to the generated text: biasing each action unit's emission by its latent activation makes the rationale faithful by construction, so that ablating an AU removes it from the explanation, a property that transfers to a second language-model backbone, whereas a freely generated rationale is unfaithful.

2606.15880 2026-06-16 cs.CV cs.AI 新提交

Deep Residual Injection for Full-Spectrum Forensic Signal Perception in Multimodal Large Language Models

深度残差注入:多模态大语言模型的全频谱取证信号感知

Kaiqing Lin, Zhiyuan Yan, Ruoxin Chen, Ke-Yue Zhang, Yue Zhou, Caiyong Piao, Bin Li, Taiping Yao, Bo Wang, Youchang Xiao, Shouhong Ding

发表机构 * National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学) University of Electronic Science and Technology of China(电子科技大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对多模态大语言模型在取证中难以同时保留语义知识和捕获低级生成器伪影的问题,提出Deep-VRM方法,通过将伪影特定视觉信号作为残差路径注入中间层,实现全频谱信号感知,达到鲁棒检测性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

多模态大语言模型(MLLMs)因其强大的语义理解能力,越来越多地被应用于取证领域。随着AI生成图像变得逼真,仅凭语义层面的不一致往往不足以进行可靠检测。这引发了一个关键问题:MLLMs能否实现全频谱取证信号感知,即在不牺牲预训练语义知识的情况下捕获低级生成器伪影。我们进一步对MLLMs中的取证信号感知进行了逐层分析,表明语义信息主要在早期到中间层形成,而直接微调学习伪影会破坏这些语义表示。基于这一发现,我们提出了深度视觉残差MLLM(Deep-VRM),以保留早期语义处理,同时将伪影特定的视觉信号作为残差路径注入中间层,在此与语义标记表示融合,并通过后续可训练层传播。这使得后续层能够联合建模语义推理和信号级取证线索,令人惊讶的是,模型学会了根据输入自适应地利用不同级别的取证信号,实现了鲁棒且可泛化的检测性能。大量实验表明,我们的方法在大多数基准测试中达到了最先进水平。代码和数据可在https://github.com/KQL11/Deep-VRM获取。

英文摘要

Multimodal large language models (MLLMs) have been increasingly adopted in forensics for their robust semantic understanding. As AI-generated images become realistic, semantic-level inconsistencies alone are often insufficient for reliable detection. This motivates a critical question: whether MLLMs can achieve full-spectrum forensic signal perception, i.e., capturing low-level generator artifacts without sacrificing pre-trained semantic knowledge. We further perform a layer-wise analysis of forensic signal perception in MLLMs, showing that semantic information is primarily formed in the early-to-middle layers, whereas direct fine-tuning for artifact learning disrupts these semantic representations. Based on this insight, we propose Deep Visual Residual MLLM (Deep-VRM) to preserve early semantic processing while injecting artifact-specific visual signals as a residual path into an intermediate layer, where they are fused with semantic token representations and propagated through subsequent trainable layers. This enables later layers to jointly model semantic reasoning and signal-level forensic cues, and surprisingly, the model learns to adaptively leverage different levels of forensic signals depending on the input, achieving robust and generalizable detection performance. Extensive experiments show that our method achieves state-of-the-art across most benchmarks. The code and data are available at https://github.com/KQL11/Deep-VRM.

2606.16519 2026-06-16 cs.CV 新提交

BadWorld: Adversarial Attacks on World Models

BadWorld:对世界模型的对抗攻击

Linghui Shen, Mingyue Cui, Xingyi Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出BadWorld框架,通过自监督速度攻击和轨迹自适应双层优化,对自回归视觉世界模型进行无标签对抗攻击,暴露其结构脆弱性。

Comments Project Page: https://linghuiishen.github.io/BadWorld/

详情
AI中文摘要

视觉世界模型(VWMs)从单张上下文图像合成交互式、条件于动作的展开。然而,这些模型对对抗扰动的鲁棒性仍是一个未解问题。标准对抗攻击无法评估这种脆弱性,因为攻击者缺乏真实未来视频且无法预测后续用户控制。我们提出了BadWorld,一个专为自回归VWMs设计的无标签对抗框架,系统性地克服了这两个限制。首先,为绕过对未来监督的需求,我们提出了一种自监督速度攻击,直接破坏模型的早期去噪动态。其次,为确保攻击能泛化到不可预测的用户动作,我们制定了一种轨迹自适应双层优化,主动挖掘困难的控制序列以锻造控制无关的扰动。在具有连续和离散控制的代表性VWMs上评估,BadWorld暴露了严重的结构脆弱性。视觉上难以区分的对抗图像可靠地触发未来展开中的灾难性退化,导致去噪不完整、结构崩溃和控制不一致。这些发现揭示了在安全关键系统中部署VWMs的关键风险,同时突显了一种隐私保护的实用机制。

英文摘要

Visual world models (VWMs) synthesize interactive, action-conditioned rollouts from a single context image. However, it remains an open question how robust these models are to adversarial perturbations. Standard adversarial attacks fail to assess this vulnerability because attackers lack ground-truth future videos and cannot predict subsequent user controls. We introduce BadWorld, a label-free adversarial framework tailored for autoregressive VWMs that systematically overcomes both constraints. First, to bypass the need for future supervision, we propose a self-supervised velocity attack that directly disrupts the early denoising dynamics of the model. Second, to ensure the attack generalizes across unpredictable user actions, we formulate a trajectory-adaptive bi-level optimization that actively mines hard control sequences to forge control-agnostic perturbations. Evaluated on representative VWMs with continuous and discrete controls, BadWorld exposes severe structural fragility. Visually indistinguishable adversarial images reliably trigger catastrophic degradation in future rollouts, leading to incomplete denoising, structural collapse, and control inconsistency. These findings reveal critical risks for deploying VWMs in safety-critical systems while highlighting a practical mechanism for privacy protection.

2606.16742 2026-06-16 cs.CV cs.AI 新提交

Revealing Artifacts via Noise Amplification: A Novel Perspective for AI-Generated Video Detection

通过噪声放大揭示伪影:AI生成视频检测的新视角

Renxi Cheng, Jie Gui, Hongsong Wang

发表机构 * School of Cyber Science and Engineering, Southeast University(东南大学网络空间安全学院) Purple Mountain Laboratories(紫金山实验室) Engineering Research Center of Blockchain Application, Supervision And Management (Southeast University), Ministry of Education(教育部区块链应用监管工程研究中心(东南大学)) School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications (Southeast University), Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室(东南大学))

AI总结 针对AI生成视频检测难题,提出基于位平面的噪声放大方法,通过像素级强度增强、区域级空间放大和帧级时间聚合,在GenVidBench和HardGVD基准上超越现有方法。

Comments 13 pages, 5 figures

详情
AI中文摘要

随着视频生成模型的快速发展,区分AI生成视频与真实视频已成为一项具有挑战性的任务。现有研究大多集中于开发用于识别生成对抗网络生成样本的检测器。然而,AI生成视频的检测,尤其是文本到视频模型生成的视频,仍是一个未探索的领域。尽管最先进的文本到视频模型可以生成类似于真实视频的逼真视觉内容,但它们无法生成图像的细节以及视频中细节的变化。受此启发,我们从位平面的新视角处理AI生成视频检测,位平面可以有效描述图像或视频中的细节或噪声。为此,我们提出了一种简单而有效的方法,称为噪声放大。该方法首先基于位平面提取噪声信号,然后放大这些噪声信号,最后将其输入判别器网络进行视频伪造分类。噪声放大通过三个方面综合构建:像素级强度增强、区域级空间放大和帧级时间聚合。为了在具有挑战性的场景中评估AI生成视频检测方法,我们还引入了一个名为HardGVD的基准。在大型数据集GenVidBench和HardGVD上的大量实验表明,我们简单的方法显著优于最先进的方法。

英文摘要

With the rapid advancement of video generation models, distinguishing between AI-generated and authentic videos has emerged as a challenging endeavor. The majority of existing research endeavors concentrate on the development of detectors for identifying samples generated by generative adversarial networks. Nevertheless, the detection of AI-generated videos, particularly those produced by text-to-video models, still remains an uncharted territory. Although state-of-the-art text-to-video models can generate realistic visual content similar to real videos, they fall short of generating the details of the images and the changes in details within the videos. Inspired by this, we address AI-generated video detection from a novel perspective of bit-planes, which can effectively describe the details or noises in images or videos. To this end, we propose a simple yet effective approach called Noise Amplification. This approach first extracts noise signals based on bit-planes, then amplifies these noise signals, and finally feeds them into the discriminator networks for video fake classification. Noise amplification is comprehensively constructed by incorporating three aspects: pixel-level intensity enhancement, region-level spatial amplification, and frame-level temporal aggregation. To evaluate methods of AI-generated video detection in challenging scenarios, we also introduce a benchmark named HardGVD. Extensive experiments on both the large-scale dataset GenVidBench and HardGVD show that our simple approach significantly outperforms state-of-the-art methods.

2606.17037 2026-06-16 cs.CV cs.AI cs.LG 新提交

The Importance of Phase in Neural Representations: An Internal Oppenheim-Lim Test of Image Classifiers

相位在神经表示中的重要性:图像分类器的内部Oppenheim-Lim测试

Alper Yıldırım

AI总结 通过内部相位-幅度移植实验,发现图像分类器(如PRISM2D、GFNet、ViT-B/16)的预测主要依赖相位/符号信息,而图像特定幅度对读出贡献有限;ResNet-50在ReLU前存在潜在符号编码,揭示了CNN与注意力模型在纹理-形状差异上的机制。

详情
AI中文摘要

Oppenheim和Lim(1981)表明,自然图像仅从傅里叶相位重建时仍可识别,而幅度几乎不携带其身份信息。我们探究训练后的图像分类器是否在其隐藏层内再现这种不对称性,并进行因果测试:给定两幅图像,我们在选定层将一幅图像的相位移植到另一幅图像的幅度上,并记录预测跟随哪幅图像。在PRISM2D、GFNet和ViT-B/16中,预测跟随相位或符号捐赠者,删除所有图像特定幅度几乎不影响准确率,因此身份信息依赖于相位,而图像特定幅度对读出而言在很大程度上是可舍弃的。ResNet-50起初似乎打破了这一模式,因为在ReLU之后移植符号无效;在ReLU之前的公平干预揭示了后期块中存在强烈的潜在符号编码,而仅DC对照表明读出消耗了通道空间平均值。对照排除了幅度简单地不依赖于图像的平凡情况。因此,这些架构共享一个相位/符号身份编码,但以不同基(由整流和读出几何决定)暴露出来,这为CNN与注意力模型之间的纹理-形状差异提供了机制性解释。

英文摘要

Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture--shape gap between CNNs and attention models.

2606.15117 2026-06-16 cs.MM cs.AI cs.CV cs.LG cs.SD 交叉投稿

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

用于集成视听视频深度伪造检测中领域适应的师生结构

Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee

发表机构 * Department of Computer Engineering, Sharif University of Technology(谢里夫理工学院计算机工程系)

AI总结 提出EAV-DFD方法,结合师生框架的领域适应机制,提升模型在未见领域上的泛化能力,在三个数据集上AUC分别提升4.09%、17.94%和0.5%。

详情
AI中文摘要

生成式AI模型的快速发展导致了更逼真的深度伪造媒体,包括对音频、视频或两者的操纵。这引发了严重的隐私和社会问题。该领域的许多研究已经取得了有前景的域内结果;然而,这些模型在面对来自不同领域的数据时,其有效性常常下降。因此,最近的深度伪造检测方法侧重于通过多种技术增强泛化能力,这些技术融合了所有输入模态,包括音频、图像及其交互。为此,我们提出了EAV-DFD方法,一种广义的深度集成视听模型(EAV-DFD),结合了利用师生框架的领域适应机制,以增强模型在未见领域上的表现和泛化能力。为了评估模型性能,我们使用FakeAVCeleb数据集作为主领域,DFDC、Deepfake_TIMIT和PolyGlotFake数据集作为未见领域。我们的实验结果表明,所提出的框架在领域适应方面是有效的,仅使用一小部分未见数据集训练学生模型,就在三个未见数据集上分别将模型的AUC性能提升了4.09%、17.94%和0.5%。这产生了一种新颖的深度伪造检测模型,能够适应新领域并解释哪个模态被操纵,突显了我们的方法在现实世界应用中的潜力。

英文摘要

The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

2606.15993 2026-06-16 cs.CY cs.CV 交叉投稿

Classifying by Proxy: Explainable and Reproducible Ensemble of Proxy Tasks for Child Sexual Abuse Imagery Classification

通过代理任务分类:用于儿童性虐待图像分类的可解释且可复现的代理任务集成

Clara Ernesto, Carlos Caetano, Sandra Avila, João Macedo, Camila Laranjeira, Leo S. F. Ribeiro

发表机构 * Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade Estadual de São Paulo (USP)(圣保罗州立大学数学与计算机科学学院) Instituto de Computação (IC), Universidade Estadual de Campinas (UNICAMP)(坎皮纳斯州立大学计算机学院) Departamento de Ciência da Computação, Universidade Federal de Minas Gerais (UFMG)(巴西矿务大学计算机科学系) Instituto Federal de Educação, Ciência e Tecnologia de Minas Gerais (IFMG)(米纳斯吉拉斯州联邦教育、科学和技术研究院)

AI总结 提出一种代理任务集成方法,用于儿童性虐待图像分类,在提高可复现性、可解释性和安全性的同时,在RCPD数据集上达到91.9%的平衡准确率。

Comments 12 pages, 7 figures, 7 tables. Accepted at ACM FAccT 2026

详情
AI中文摘要

儿童性虐待图像(CSAI)分类系统是减轻执法人员评估这些材料时常承受的心理影响以及从网络上高效移除这些材料的必要解决方案。然而,由于任务的性质,研究和开发此类系统并非易事。图像高度敏感,相关数据集受到严格的访问限制,这意味着该领域的大多数研究无法复现或分发,因此难以比较和验证。更令人担忧的是,目前用于此任务的大多数模型缺乏执法人员经常期望的一个方面:可解释性。在本文中,我们应用了代理任务集成——与CSAI分类相关的任务——在可复现性、可解释性和分发安全性方面取得了改进。这一概念首次应用于真实的CSAI,通过从CSAI文献中选择相关代理任务并对原始框架进行训练调整。我们的最终模型取得了有竞争力的结果,在RCPD数据集上使用最佳代理任务组合实现了91.9%的平衡准确率。此外,我们将这些结果与同类最佳表示学习模型DINO进行了对比,表明我们的集成提高了准确性,并为其分类结果提供了解释,这是单个深度学习模型很少能提供的特性。

英文摘要

Child Sexual Abuse Imagery (CSAI) classification systems are needed solutions for lessening the psychological impacts often felt by law enforcement agents responsible for evaluating these materials and for efficient removal of these materials from the web. However, due to the nature of the task, researching and developing such systems is not a trivial endeavor. The images are highly sensitive, and the related datasets are under restrictive access regimes, which means most studies in the area are not reproducible or distributable and are therefore hard to compare and validate. More concerning still, most models for this task today lack an aspect often desired by law enforcement agents: explainability. In this paper, we apply an ensemble of Proxy Tasks -- tasks that correlate to CSAI classification -- yielding improvements in reproducibility, explainability, and security for distribution. This concept is applied for the first time to real CSAI, with a novel selection of relevant Proxy Tasks (selected from the CSAI literature) and training adaptations to the original framework. Our final model achieves competitive results, yielding 91.9% balanced accuracy on the RCPD dataset with the best Proxy Task combination. We furthermore contrast these results with the best-in-class representation learning model, DINO, and show that our ensemble improves accuracy and provides explanations for its classification results, a feature that a single deep learning model can seldom provide.

2606.16196 2026-06-16 cs.LG cs.CV 交叉投稿

When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

当置信度缺乏概念:通过表示扰动实现可解释的OOD检测

Anju Chhetri, Pratik Shrestha, Ramesh Rana, Prashnna Gyawali, Binod Bhattarai

发表机构 * NepAl Applied Mathematics and Informatics Institute for research(尼泊尔应用数学与信息学研究所) West Virginia University(西弗吉尼亚大学) Kathmandu University(加德满都大学) University College London(伦敦大学学院) University of Aberdeen(阿伯丁大学)

AI总结 提出一种基于类条件语义扰动和稀疏自编码器的可解释OOD检测框架,通过分析表示稳定性实现检测与内部机制解释。

详情
AI中文摘要

深度神经网络在医学影像任务中取得了显著性能,但其在分布偏移下过度泛化的倾向对安全临床部署构成了主要障碍。分布外(OOD)检测方法旨在缓解这一风险,但现有方法大多依赖语义含义理解不足的不透明内部信号,限制了在安全关键场景中的信任。本文提出一种可解释的OOD检测框架,该框架通过类条件语义扰动探测模型预测的稳定性。利用稀疏自编码器(SAE),我们从分布内数据中学习类特定概念向量,将密集的中间表示解耦为稀疏、语义有意义的组件。在推理时,我们使用与模型预测类别相关的概念向量扰动深层表示,并测量类别logits的稳定性。我们假设分布内样本对此类扰动表现出低敏感性,因为其表示与类特定语义方向对齐,而OOD样本由于表示错位而显示出放大的偏差。通过将OOD检测框架为概念条件稳定性分析,我们的方法既提供了判别性OOD信号,又提供了驱动模型不确定性的内部机制的可解释视角,使其特别适用于高风险医学应用。

英文摘要

Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.

2606.16535 2026-06-16 cs.LG cs.CV cs.SC 交叉投稿

Assessing Reliability of Symbol Detection in Concept Bottleneck Models

评估概念瓶颈模型中符号检测的可靠性

Javier Fumanal-Idocin, Javier Andreu-Perez

发表机构 * University of Essex(埃塞克斯大学)

AI总结 本文研究概念瓶颈模型(CBM)中符号检测的可靠性问题,通过交换独立训练的概念检测器和分类头来识别易受虚假激活影响的概念,并提出一种可靠性感知训练策略,在CUB-200-2011和合成任务上验证了其有效性。

详情
AI中文摘要

概念瓶颈模型(CBM)是可解释人工智能的相关工具,因为它们通过人类可解释的符号进行预测。然而,高任务准确率并不能保证这些符号被忠实地检测到:联合训练的CBM可能在瓶颈中编码任务特定的捷径,使其解释不可靠。在本文中,我们通过交换共享相同符号词汇的独立训练的概念检测器和分类头来研究概念检测的可靠性。我们利用由此产生的性能下降、概念级指标和符号级不确定性估计来识别特别容易发生虚假激活的概念。最后,我们提出了一种可靠性感知训练策略,其中共享的概念检测器通过多个分类头进行优化,并因依赖全局或实例级不可靠符号而受到惩罚。在具有完整概念监督的CUB-200-2011上,检测器和头几乎可以自由互换(交换下降低于一个准确率点,相对保留率高于99%,且没有概念检测低于随机水平),而在受控的合成任务上,我们表明,随着概念监督权重的减少,模型保持近乎完美的任务准确率,而交换准确率和与真实概念的一致性下降到随机水平。我们的可靠性感知训练显著缓解了这种泄漏,在泄漏情况下大致使交换准确率翻倍。

英文摘要

Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above $99\%$, and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.

2501.01908 2026-06-16 cs.CV cs.LG eess.IV physics.med-ph 版本更新

Training-Free Adversarial Robustness in Computational MRI

计算MRI中无需训练的抗对抗鲁棒性

Mahdi Saberi, Chi Zhang, Mehmet Akçakaya

发表机构 * arXiv

AI总结 提出一种无需重训练即可缓解MRI重建模型对抗攻击的方法,基于循环测量一致性在攻击输入的小邻域内最小化目标函数,显著降低对抗扰动影响。

Comments International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

深度学习方法已成为重建欠采样磁共振成像数据的最先进技术。然而,研究表明这些方法易受小的对抗输入扰动影响,导致输出图像出现严重失真。已有多种策略被提出以减少这些攻击的影响,但它们需要重新训练。在这项工作中,我们提出了一种新颖的方法,无需任何重训练即可缓解MRI重建模型上的对抗攻击。基于循环测量一致性的思想,我们设计了一个新颖的缓解目标,在攻击输入周围的小球内最小化该目标。结果表明,我们的方法在不同数据集、攻击类型/强度以及PD-DL网络上显著降低了对抗扰动的影响,并在定性和定量上优于传统的缓解方法。我们还引入了一个实际相关的小对抗扰动场景,该场景模拟原始数据中的脉冲噪声(与人字形伪影相关),并展示了我们的方法在此设置中的适用性。最后,我们展示了我们的缓解方法在两种现实扩展场景中仍然有效:盲设置(用户不知道攻击强度或算法)和自适应攻击设置(攻击者完全了解防御策略)。

英文摘要

Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to herringbone artifacts, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two realistic extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

2507.02288 2026-06-16 cs.CV cs.LG 版本更新

Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

基于语言引导与表示对齐的提示解缠用于域泛化

De Cheng, Zhipeng Xu, Xinyang Jiang, Dongsheng Li, Nannan Wang, Xinbo Gao

发表机构 * School of Telecommunications Engineering, the State Key Laboratory of Integrated Services Networks (ISN), Xidian University, Xi’an, China(电信工程学院、集成服务网络国家重点实验室(ISN)、西安电子科技大学) Microsoft Research Asia, Shanghai, China(微软亚洲研究院,上海,中国)

AI总结 提出利用大语言模型自动解缠文本提示,并引入最差显式表示对齐,结合抽象提示增强源域多样性,实现域不变视觉表示学习,在多个基准上超越现有方法。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6799-6816, June 2026
AI中文摘要

域泛化(DG)旨在开发一个能够在未见过的目标域上有效执行的通用模型。值得注意的是,预训练视觉基础模型(VFM)如CLIP的最新进展,已显示出增强深度学习模型泛化能力的巨大潜力。尽管基于VFM的域提示调整在DG中受到越来越多的关注,但设计能够解缠跨域不变特征的提示仍然是一个关键挑战。在本文中,我们提出通过利用VFM的可控且灵活的语言提示来解决这一挑战。注意到VFM的文本模态自然更容易解缠,我们引入了一个新颖的文本特征引导的视觉提示调整框架。该框架首先使用大语言模型(LLM)自动解缠文本提示,然后学习由解缠文本特征引导的域不变视觉表示。然而,仅依赖语言来引导视觉特征解缠存在局限性,因为视觉特征有时可能过于复杂或微妙,难以被描述性文本完全捕捉。为解决这一问题,我们引入了最差显式表示对齐(WERA),它通过添加一组额外的抽象提示来扩展文本引导的视觉提示。这些提示通过风格化图像增强来增强源域多样性,而对齐约束确保视觉表示在原始分布和增强分布上保持一致。在包括PACS、VLCS、OfficeHome、DomainNet和TerraInc在内的主要DG数据集上进行的实验表明,我们提出的方法优于最先进的DG方法。

英文摘要

Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

2511.20710 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

受神经启发的多模态视觉-语言模型对成员推断隐私泄露是否具有弹性?

David Amebley, Sayanton Dibbo

发表机构 * The University of Alabama(阿拉巴马大学) Alabama Center for the Advancement of AI(阿拉巴马人工智能 advancement 中心) Trustworthy AI Lab(可信人工智能实验室) Department of Computer Science, The University of Alabama(计算机科学系)

AI总结 研究受神经启发的多模态视觉-语言模型(VLM)对基于图像-文本的成员推断攻击的弹性,提出拓扑正则化框架,实验表明神经VLM在保持模型效用同时显著降低攻击成功率。

Comments Accepted at USENIX WOOT '26

详情
AI中文摘要

在智能体AI时代,多模态模型(MMs)的日益部署引入了新的攻击向量,可能导致MMs中敏感训练数据泄露,造成隐私泄露。本文研究了一种黑盒隐私攻击,即对多模态视觉-语言模型(VLMs)的成员推断攻击(MIA)。最先进的研究主要分析单模态AI-ML系统的隐私攻击,而最近的研究表明MMs也可能易受隐私攻击。尽管研究人员已证明生物启发的神经网络表示可以提高单模态模型对对抗攻击的弹性,但受神经启发的MMs是否对隐私攻击具有弹性仍未被探索。在这项工作中,我们引入了一个系统的神经科学启发的拓扑正则化(τ)框架,以分析MM VLMs对基于图像-文本的推断隐私攻击的弹性。我们使用三个VLM:BLIP、PaliGemma 2和ViT-GPT2,在三个基准数据集:COCO、CC3M和NoCaps上检验了这一现象。我们的实验比较了基线VLM和神经VLM(带有拓扑正则化)的弹性,其中τ>0配置定义了VLM的NEURO变体。我们在COCO数据集上使用BLIP模型的结果表明,NEURO VLM中MIA攻击成功率平均下降24%的ROC-AUC,同时在MPNet和ROUGE-2指标上实现了相似的模型效用(生成字幕与参考字幕之间的相似性)。这表明神经VLM相对更具隐私攻击弹性,同时不会显著牺牲模型效用。我们使用PaliGemma 2和ViT-GPT2模型在另外两个数据集CC3M和NoCaps上的广泛评估进一步验证了发现的一致性。这项工作有助于加深对MMs中隐私风险的理解,并为神经VLM的隐私威胁弹性提供了证据。

英文摘要

In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

2603.17531 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Rel-Zero:利用补丁对不变性实现鲁棒的零水印以抵御AI编辑

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang

AI总结 针对AI编辑对图像真实性的威胁,提出Rel-Zero零水印框架,利用编辑中补丁对关系距离的不变性,无需修改原图即可生成鲁棒水印,实验证明其优于现有方法。

Comments accepted to CVPR 2026

详情
AI中文摘要

近期基于扩散的图像编辑技术的进步对数字视觉内容的真实性构成了重大威胁。传统的基于嵌入的水印方法通常引入可察觉的扰动以保持鲁棒性,不可避免地损害视觉保真度。同时,现有的零水印方法通常依赖全局图像特征,难以抵御复杂的操作。在这项工作中,我们揭示了一个关键观察:尽管在基于AI的编辑过程中单个图像补丁发生显著变化,但补丁对之间的关系距离保持相对不变。利用这一特性,我们提出了关系零水印(Rel-Zero),一种新颖的框架,无需对原始图像进行任何修改,而是从这些编辑不变的补丁关系中推导出唯一的零水印。通过将水印基于内在的结构一致性而非绝对外观,Rel-Zero为内容认证提供了一种非侵入性且具有弹性的机制。大量实验表明,与先前的零水印方法相比,Rel-Zero在多种编辑模型和操作下实现了显著提升的鲁棒性。

英文摘要

Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

2603.24058 2026-06-16 cs.CV cs.AI 版本更新

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

通过注意力不平衡修正减轻LVLM中的对象幻觉

Han Sun, Qin Li, Peixin Wang, Min Zhang

发表机构 * Shanghai Key Laboratory of Trustworthy Computing, East China Normal University(上海可信计算实验室,东华大学)

AI总结 发现多模态和token间注意力不平衡是对象幻觉的因果因素,提出轻量级解码干预方法AIR,通过重新分配注意力权重修正不平衡,在多个基准上减少幻觉达35.1%,并提升通用能力。

Comments CVPR 2026 Findings Track, code is available at https://github.com/Ice-wave/AIR

详情
AI中文摘要

大型视觉-语言模型(LVLMs)中的对象幻觉严重损害了其在现实应用中的可靠性,对它们在自动驾驶和医学图像分析等高风险场景中的部署构成了关键障碍。通过系统的实证研究,我们发现跨模态(即视觉和语言)和模态内(单个token之间)的不平衡注意力分配与对象幻觉的发生存在强因果相关性。利用这一洞察,我们引入了一个新概念——注意力不平衡,它不仅量化了注意力差异的程度,还直观地描绘了驱动对象幻觉的潜在模式(例如,对无关语言token的过度关注或对判别性视觉特征的关注不足)。为了减轻对象幻觉,我们进一步提出了注意力不平衡修正(AIR),这是一种轻量级的解码时干预方法,通过重新分配注意力权重和调整注意力分布来修正模态级和token级的不平衡。在四个主流LVLM和三个基准(CHAIR、POPE和MM-Vet)上,与七个基线进行的大量评估表明,AIR持续降低对象幻觉率,与基线相比最高减少35.1%,同时在多种视觉-语言任务中提升LVLMs的通用能力高达15.9%。

英文摘要

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.

2605.00591 2026-06-16 cs.CV 版本更新

Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

视觉语言模型中标签噪声提示调优的内在梯度抑制

Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出双Softmax提示调优(DSPT),通过内在梯度抑制机制自适应压制高错误噪声样本的梯度,实现标签噪声下的鲁棒提示调优。

详情
AI中文摘要

对比视觉语言模型如CLIP展现出显著的零样本泛化能力。然而,提示调优对标签噪声高度敏感,因为错误标记的样本会产生不成比例的大梯度,可能压倒预训练先验。我们认为,由于CLIP已经提供了接近最优的初始化,适应过程应本质上是保守的,特别是在噪声设置中常见的极端梯度更新情况下。为此,我们提出了双Softmax提示调优(DSPT),一种无需超参数的内在梯度抑制方法。通过应用顺序概率归一化,DSPT诱导出一个自适应饱和区,该区域抑制来自高错误噪声样本的梯度,同时保持信息性更新。我们还提供了关于该机制如何实现自适应抑制的理论分析和实证证据。这种设计将传统上作为训练瓶颈的“梯度消失”转化为标签噪声提示调优的原则性噪声过滤盾牌。大量实验证实,这种简单、即插即用的设计在各种噪声基准上实现了最先进的鲁棒性,优于具有复杂架构和手工调整超参数的方法。

英文摘要

Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.

2606.00435 2026-06-16 cs.CV cs.AI 版本更新

Detect Before You Leap: Mirage Detection in Vision-Language Models

在跳跃前检测:视觉语言模型中的幻象检测

Sayeed Shafayet Chowdhury, Md. Shaown Miah, S. M. Taiabul Haque, Syed Ishtiaque Ahmed

发表机构 * Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校) Bangladesh University of Engineering and Technology(孟加拉工程与技术大学)

AI总结 针对视觉语言模型在缺乏视觉证据时产生自信但无根据回答的幻象问题,提出文本条件层内对齐方法,通过分析视觉编码器各层补丁令牌与问题嵌入的对齐轨迹,结合像素统计、零样本域路由和结构化自评估,实现高精度预响应幻象检测。

详情
AI中文摘要

视觉语言模型(VLM)即使在所需视觉证据缺失、空白或与问题无关时,也能产生自信的视觉答案。这种失败模式被称为幻象(Asadi et al. 2026),在医学和文档视觉问答中尤其令人担忧,因为看似合理但缺乏视觉依据的响应可能被误认为是基于图像的证据。我们研究预发布幻象检测:给定图像-问题对,目标是在VLM生成响应之前确定其应回答还是弃权。我们提出文本条件层内对齐(TC-LIA),一种模型无关的方法,探测CLIP ViT-H/14视觉编码器各层的补丁令牌表示。TC-LIA将逐层图像补丁令牌投影到最终CLIP嵌入空间,并测量它们与问题嵌入的相似度,从而跟踪问题相关视觉证据是否在视觉层中出现。得到的对齐轨迹通过最终图像-文本余弦相似度、后期层top-k补丁-文本对齐、早期到后期增益和逐层斜率进行总结。这些特征与像素统计空白/噪声检测、零样本域路由和结构化VLM自评估相结合,形成一个集成系统。在五个VQA领域、三种输入条件和十二个VLM骨干网络上,最佳系统实现了约94.6-94.7%的三类检测准确率,幻象率低于3%,而基线幻象率范围为21.7%至66.6%。

英文摘要

Vision-language models (VLMs) can produce confident visual answers even when the required visual evidence is missing, blank, or unrelated to the question. This failure mode, recently described as mirage (mirage2026), is especially concerning in medical and document VQA, where a plausible but visually ungrounded answer may be mistaken for image-based evidence. We study the complementary problem of pre-release mirage detection: given an image-question pair, determine whether the VLM should answer or abstain before generation. To that end, we propose a novel model-agnostic Text-Conditioned Layer-wise Internal Alignment (TC-LIA) method that probes patch-token representations across the layers of a CLIP ViT-H/14 vision encoder. The key idea is to project layer-wise image patch tokens into the final CLIP embedding space and measure their similarity with the question embedding, thereby tracking whether question-relevant visual evidence emerges across vision layers. TC-LIA summarizes this alignment trajectory using final image-text cosine similarity, late-layer top-k patch-text alignment, early-to-late gain, and layer-wise slope. These features are combined with pixel-statistic based blank/noise detection, zero-shot domain routing, and structured VLM self-assessment in an ensemble. Across five VQA domains with related, unrelated-real, and blank/noise inputs, and across twelve VLM backbones, Qwen2.5-VL-32B achieves the highest three-class detection accuracy of 94.7% with a 3.0% mirage rate, while Qwen2.5-VL-72B achieves 94.6% accuracy with a lower 2.8% mirage rate. Baseline mirage rates span 21.7-66.6%.

2409.01062 2026-06-16 cs.LG cs.CR cs.CV 版本更新

Random Erasing vs. Model Inversion: A Promising Defense or a False Hope?

随机擦除 vs. 模型反演:有希望的防御还是虚假的希望?

Viet-Hung Tran, Ngoc-Bao Nguyen, Son T. Mai, Hans Vandierendonck, Ira Assent, Alex Kot, Ngai-Man Cheung

发表机构 * Temasek Laboratories, Singapore University of Technology and Design(Temasek实验室,新加坡技术与设计大学) The Queen’s University Belfast(女王大学贝尔法斯特分校) Aarhus University(阿arhus大学) Nanyang Technological University (NTU)(南洋理工大学(NTU);河内 Vin 大学) VinUniversity, Hanoi, Vietnam

AI总结 本文探索随机擦除(RE)作为防御模型反演攻击的方法,通过特征空间分析揭示其有效性,并在37种设置下实现隐私-效用权衡的最优性能。

Comments Accepted in Transactions on Machine Learning Research (TMLR). First two authors contributed equally

详情
AI中文摘要

模型反演(MI)攻击通过从机器学习模型重建私有训练数据,构成重大的隐私威胁。虽然现有防御主要集中于模型中心方法,但数据对MI鲁棒性的影响仍 largely 未被探索。在这项工作中,我们探索了随机擦除(RE)——一种传统上用于提高遮挡下模型泛化能力的技术——并揭示了其作为防御MI攻击的惊人有效性。具体来说,我们新颖的特征空间分析表明,使用RE图像训练的模型在MI重建图像的特征与私有数据的特征之间引入了显著差异。同时,私有图像的特征与其他类别保持 distinct,并与不同分类区域良好分离。这些效应共同降低了MI重建质量和攻击准确率,同时保持了合理的自然准确率。此外,我们探索了RE的两个关键属性,包括部分擦除和随机位置。部分擦除防止模型在训练期间观察完整对象。我们发现这对旨在重建完整对象的MI有显著影响。擦除的随机位置在实现强隐私-效用权衡中起着关键作用。我们的发现凸显了RE作为一种简单而有效的防御机制,可以轻松与现有隐私保护技术集成。在37种设置上的广泛实验表明,我们的方法在隐私-效用权衡中达到了最先进的性能。结果一致证明了我们的防御在不同MI攻击、网络架构和攻击配置下优于现有方法。首次,我们在某些配置下实现了攻击准确率的显著下降而不降低效用。

英文摘要

Model Inversion (MI) attacks pose a significant privacy threat by reconstructing private training data from machine learning models. While existing defenses primarily concentrate on model-centric approaches, the impact of data on MI robustness remains largely unexplored. In this work, we explore Random Erasing (RE), a technique traditionally used for improving model generalization under occlusion, and uncover its surprising effectiveness as a defense against MI attacks. Specifically, our novel feature space analysis shows that models trained with RE-images introduce a significant discrepancy between the features of MI-reconstructed images and those of the private data. At the same time, features of private images remain distinct from other classes and well-separated from different classification regions. These effects collectively degrade MI reconstruction quality and attack accuracy while maintaining reasonable natural accuracy. Furthermore, we explore two critical properties of RE including Partial Erasure and Random Location. Partial Erasure prevents the model from observing entire objects during training. We find this has a significant impact on MI, which aims to reconstruct the entire objects. Random Location of erasure plays a crucial role in achieving a strong privacy-utility trade-off. Our findings highlight RE as a simple yet effective defense mechanism that can be easily integrated with existing privacy-preserving techniques. Extensive experiments across 37 setups demonstrate that our method achieves state-of-the-art (SOTA) performance in the privacy-utility trade-off. The results consistently demonstrate the superiority of our defense over existing methods across different MI attacks, network architectures, and attack configurations. For the first time, we achieve a significant degradation in attack accuracy without a decrease in utility for some configurations.

2501.19337 2026-06-16 cs.CL cs.CV 版本更新

Token-Level Entropy Reveals Demographic Disparities in Language Models

Token级熵揭示语言模型中的群体统计差异

Messi H. J. Lee

发表机构 * Independent Researcher(独立研究者)

AI总结 通过测量零温度下全词汇香农熵,发现黑裔名字比白裔名字产生更高的首token熵,且女性名字比男性名字产生更低的熵和更同质的输出,种族和性别效应可加,指令微调未减弱种族差距,显式群体标签探测无显著种族效应。

Comments 9 pages

详情
AI中文摘要

我们探究仅由姓名标识的人口统计身份是否会系统性地重塑语言模型的生成分布。在六个开源基础模型和5760个隐式句子补全提示(例如“Tanisha在一个周一早晨走进办公室,然后”)上,测量零温度下的全词汇香农熵,我们发现,在所有六个架构中,与白裔名字相比,黑裔名字产生更高的首token熵——这与在显式人口统计提示下记录的输出层面同质性偏差(Lee等人,2024)相反——并且黑裔名字总是比白裔名字在身份中性基线之上产生更大的熵(所有六个模型中ΔΔ>0)。与男性名字相比,女性名字伴随更低的首token熵(DL合并β̂=-0.041,p=0.019)和更同质的输出(α̂=+0.024,p<0.001)——这一模式与同质性偏差收敛;种族和性别效应是可加的。指令微调并未减弱种族差距(匹配格式DL合并β̂=+0.153)。使用显式群体标签而非姓名运行相同模板,在隐式探测显著的12个模型中有10个产生无效的种族效应——这表明探测方法是恢复哪种分布结构的主要决定因素。

英文摘要

We ask whether demographic identity, signaled by a name alone, systematically reshapes the generative distribution of a language model. Measuring full-vocabulary Shannon entropy at temperature zero across six open-weight base models and 5,760 implicit sentence-completion prompts (e.g., "Tanisha walked into the office on a Monday morning and"), we find that Black-associated names produce higher first-token entropy than White-associated names across all six architectures - opposite to the output-level homogeneity bias documented under explicit demographic prompting (Lee et al., 2024) - and Black-associated names always produce greater entropy above identity-neutral baselines than White-associated names ($ΔΔ> 0$ in all six models). Women-associated names co-occur with lower first-token entropy (DL-pooled $\hatβ= -0.041, p = .019$) and more homogeneous outputs ($\hatα= +0.024, p < .001$) than men-associated names - a pattern convergent with homogeneity bias; race and gender effects are additive. Instruction tuning does not attenuate the race gap (matched-format DL-pooled $\hatβ=+0.153$). Running the same templates with explicit group labels instead of names yields null race effects in 10 of 12 models where implicit probing is significant - establishing that probing methodology is a primary determinant of which distributional structure is recovered.

12. 数据集、基准、评测与训练方法 50 篇

2606.14725 2026-06-16 cs.CV 新提交

Interpolation between Convolution and Attention via K-Nearest Neighbors

通过K近邻实现卷积与注意力之间的插值

Mingi Kang

发表机构 * Bowdoin College(博德因学院)

AI总结 提出ConvNN统一框架,将卷积和自注意力视为K近邻聚合的特例,通过可配置的相似度函数和邻居选择策略实现局部与全局聚合的连续插值。

Comments Undergraduate Thesis in Computer Science at Bowdoin College

详情
AI中文摘要

从卷积神经网络到Transformer的转变重塑了计算机视觉,然而这两个架构家族通常被视为根本不同。卷积神经网络由空间局部卷积操作定义,而Transformer依赖于全局自注意力。我们认为,尽管卷积和自注意力存在明显差异,但它们可以在一个统一的k近邻聚合框架内统一。关键洞察在于,这两种操作都是邻居选择和加权聚合的特例:卷积通过空间邻近性选择邻居,而自注意力通过特征相似性选择邻居,这表明它们位于一个连续谱上,而不是代表截然不同的计算。我们引入了卷积近邻(ConvNN),这是一个统一框架,形式化了这种联系。ConvNN通过将邻居选择限制在归一化空间坐标上精确恢复标准和深度卷积,并通过用缩放点积相似性替换空间邻近性精确恢复自注意力及其稀疏变体(包括KVT注意力)。除了这些特例,ConvNN可作为卷积和注意力层的即插即用替代,通过可配置的相似度函数、邻居选择策略、位置编码和聚合核,系统探索局部与全局聚合之间的中间谱。

英文摘要

The shift from Convolutional Neural Networks to Transformers has reshaped computer vision, yet these two architectural families are typically viewed as fundamentally distinct. Convolutional Neural Networks are defined by spatially local convolution operations, while Transformers rely on global self-attention. We argue that convolution and self-attention, despite their apparent differences, can be unified within a single k-nearest neighbor aggregation framework. The critical insight is that both operations are special cases of neighbor selection and weighted aggregation. Convolution selects neighbors by spatial proximity while self-attention selects by feature similarity, revealing that they lie on a continuous spectrum rather than representing categorically different computations. We introduce Convolutional Nearest Neighbors (ConvNN), a unified framework that formalizes this connection. ConvNN exactly recovers standard and depthwise convolution by restricting neighbor selection to normalized spatial coordinates, and exactly recovers self-attention and its sparse variants, including KVT-attention, by replacing spatial proximity with scaled dot-product similarity. Beyond these special cases, ConvNN serves as a drop-in replacement for both convolution and attention layers, enabling systematic exploration of the intermediate spectrum between local and global aggregation through configurable similarity functions, neighbor selection strategies, positional encodings, and aggregation kernels.

2606.14740 2026-06-16 cs.CV 新提交

GridVQA-X: A Framework for Evaluating Multimodal Explainability Methods

GridVQA-X: 评估多模态可解释性方法的框架

Sujay Belsare, Sudarshan Nikhil, Sushant Kumar, Ponnurangam Kumaraguru, Chirag Agarwal

发表机构 * IIIT Hyderabad(印度海得拉巴国际信息技术学院) University of Virginia(弗吉尼亚大学)

AI总结 提出GridVQA-X诊断框架,通过合成数据生成数学保证的解释,并训练纯推理与捷径依赖的配对模型,揭示现有可解释性方法无法区分真实跨模态推理与浅层捷径。

Comments 23 pages, 15 Figures, Accepted for poster presentation at CVPR 2026 TRUE-V Workshop

详情
AI中文摘要

随着视觉-语言模型的不断发展,其预测结果对相关利益方具有可解释性变得至关重要。然而,可解释性领域并未跟上多模态发展的步伐。尽管最近的多模态可解释人工智能(MxAI)方法生成解释以归因不同模态之间的交互,但当前的评估协议缺乏区分真正跨模态推理(例如,空间组合)与浅层跨模态捷径(例如,词袋属性匹配)所需的地面真相。目前尚不清楚MxAI方法是否忠实地捕捉了协同交互,或者仅仅是对作为简单特征检测器的模型进行推理幻觉。在本文中,我们介绍了GridVQA-X,这是第一个专门设计用于评估跨模态可解释性的诊断框架。与自然数据集不同,GridVQA-X利用封闭世界合成逻辑生成独特的、数学上保证的解释。我们利用这个受控环境,在相同的架构上训练配对的真实模型:$M_{\ ext{pure}}$,学习稳健的空间关系推理,以及$M_{\ ext{spur}}$,结构上被迫依赖跨模态捷径。这种行为差异创建了一个严格的测试平台:一个忠实的解释器必须为每个模型报告不同的推理路径。我们的发现表明,广泛使用的方法无法区分依赖真正空间关系推理的模型和利用跨模态捷径的模型,突显了在捕捉真正跨模态协同方面的关键差距,并错误地表示了多模态模型实际如何做出决策。

英文摘要

With the increasing development of Vision-Language Models, it becomes imperative that their predictions are readily explainable to relevant stakeholders. However, the field of explainability has not kept pace with the multimodal surge. While recent Multimodal Explainable AI (MxAI) methods generate explanations to attribute the interaction between different modalities, current evaluation protocols lack the ground truth required to distinguish between true cross-modal reasoning (e.g., spatial composition) and shallow cross-modal shortcuts (e.g., Bag-of-Words attribute matching). It remains unknown whether MxAI methods faithfully capture synergistic interactions or merely hallucinate reasoning on models acting as simple feature detectors. In this paper, we introduce GridVQA-X, the first diagnostic framework specifically designed to evaluate cross-modal explainability. Unlike natural datasets, GridVQA-X leverages a closed-world synthesis logic to generate unique, mathematically guaranteed explanations. We utilize this controlled environment to train paired ground-truth models on identical architectures: $M_{\text{pure}}$, which learns robust spatial-relational reasoning and $M_{\text{spur}}$, which is structurally forced to rely on cross-modal shortcuts. This behavioral divergence creates a rigorous testbed: a faithful explainer must report distinct reasoning pathways for each model. Our findings reveal that widely used methods fail to distinguish between models relying on genuine spatial-relational reasoning and those exploiting cross-modal shortcuts, highlighting a critical gap in capturing true cross-modal synergy and misrepresenting how multimodal models actually make decisions.

2606.14747 2026-06-16 cs.CV cs.AI 新提交

MMLongEmbed: Benchmarking Multimodal Embedding Models in Long-Context Scenarios

MMLongEmbed: 长上下文场景下的多模态嵌入模型基准测试

Haitian Wang, Ruoxi Sun, Quantong Qiu, Juntao Li, Junhui Li, Hua Chen, Jinxiong Chang, Min Zhang

发表机构 * Soochow University(苏州大学) Ant Group(蚂蚁集团)

AI总结 针对多模态嵌入模型在长上下文场景中缺乏系统评估的问题,提出首个综合基准MMLongEmbed,涵盖文本、文档和视频模态的检索任务,揭示模型依赖浅层特征匹配、难以捕捉深层语义依赖等瓶颈。

详情
AI中文摘要

最近的进展显著扩展了多模态嵌入模型(MEMs)的理论上下文窗口。然而,更大的上下文窗口并不一定能转化为对长上下文多模态输入的有效理解和表示,这仍然是实际部署的关键瓶颈。为了解决这一设置中缺乏系统评估的问题,我们引入了MMLongEmbed,这是首个用于评估长上下文场景中MEMs的综合基准。MMLongEmbed包含四个检索任务,涵盖多个上下文长度范围,覆盖文本、文档和视频模态。通过对最先进模型的广泛评估,我们发现当前架构严重依赖浅层特征匹配,难以捕捉深层语义和结构依赖。我们进一步观察到,性能下降随上下文长度和关键信息位置系统性地变化。此外,模型对不同模态中的冗余上下文信息表现出显著不同的鲁棒性。为了可重复性,基准和代码已公开。

英文摘要

Recent advancements have significantly expanded the theoretical context windows of Multimodal Embedding Models (MEMs). However, larger context windows do not necessarily translate into effective comprehension and representation of long-context multimodal inputs, which remains a critical bottleneck for real-world deployment. To address the lack of systematic evaluation in this setting, we introduce MMLongEmbed, the first comprehensive benchmark for evaluating MEMs in long-context scenarios. MMLongEmbed comprises four retrieval tasks spanning multiple context-length ranges, covering text, document, and video modalities. Through extensive evaluation of state-of-the-art models, we find that current architectures rely heavily on superficial feature matching and struggle to capture deep semantic and structural dependencies. We further observe that performance degradation varies systematically with context length and key information placement. Moreover, models exhibit substantially different robustness to redundant contextual information across modalities. For reproducibility, the benchmark and code are publicly available.

2606.14757 2026-06-16 cs.CV cs.LG 新提交

Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

基于空间填充曲线的小型与有限数据视觉Transformer的空间先验

Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出VIOLIN,一种轻量级掩码注意力机制,通过空间填充曲线编码空间结构,以极小的参数和计算开销为视觉Transformer注入空间归纳偏置,在小模型和有限数据场景下显著提升性能。

Comments ICML 2026

详情
AI中文摘要

尽管视觉Transformer(ViT)已成为许多计算机视觉任务中的主导骨干网络,但由于置换等变性,其注意力机制缺乏显式的空间归纳偏置。这在模型容量小或训练数据有限的情况下尤为重要。受线性Transformer中的注意力掩码策略和视觉状态空间模型(SSM)的扫描模式的启发,我们引入了VIOLIN,一种轻量级掩码注意力机制,通过空间填充曲线(SFC)在注意力中编码空间结构,仅增加不到0.0015%的额外参数和可忽略的计算开销。VIOLIN使用多条SFC扫描图像,构建曲线特定的衰减掩码,然后将其组合并与注意力矩阵相乘。在广泛的评估中,VIOLIN持续提升性能。在有限数据场景下,例如在VTAB-1K上进行微调时,它提升了所有任务组的准确率,在空间信息至关重要的任务上提升高达8.7%。它可以与参数高效微调方法(如LoRA)结合,进一步提高性能。除了微调,VIOLIN在ImageNet-1K上预训练期间改进了各种小型ViT架构(如DeiT、DINO)。此外,在高度依赖位置信息的像素级CIFAR-100训练中,VIOLIN将准确率提升了高达7.2%。总体而言,VIOLIN提供了一种计算高效且有效的方式,将空间归纳偏置注入ViT,特别有利于小模型和有限数据场景。

英文摘要

Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

2606.14760 2026-06-16 cs.CV cs.AI 新提交

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

GeoRoPE: 面向遥感基础模型的地面感知旋转适配

Yu Luo, Kun Hu, Mengwei He, Xiaogang Zhu, Shan Zeng, Allen Benter, Wei Xiang, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * The University of Sydney(悉尼大学) Edith Cowan University(埃迪斯科文大学) Adelaide University(阿德莱德大学) Wuhan Polytechnic University(武汉轻工大学) Climate, Orange Agricultural Institute(气候研究所,奥兰治农业研究所) La Trobe University(拉筹伯大学)

AI总结 提出GeoRoPE方法,通过地理坐标校准和频率校准解决遥感基础模型中的尺度失配问题,提升跨分辨率鲁棒性和尺度敏感表征学习。

详情
AI中文摘要

遥感基础模型(RSFMs)受益于在多传感器和地面采样距离(GSD)图像上的预训练,但仅凭这种暴露并不能解决下游适配过程中的尺度失配问题。固定的token网格偏移在不同传感器下可能对应不同的地面距离,使得基于网格的位置先验在物理上不一致。同时,异质空间粒度意味着紧凑的城市区域和均质景观即使在相同GSD下也可能需要不同的位置敏感性。因此,我们提出GeoRoPE,一种面向RSFMs的地面感知、RoPE兼容且参数高效的空间适配方法。GeoRoPE从两个互补方面重新校准token级位置交互。首先,地理坐标校准(GCC)根据一个token网格步长代表的地面距离重新缩放原始token网格偏移,产生跨GSD的地理校准相对坐标。其次,地理频率校准(GFC)使用关系特定因子调整原生RoPE频率,实现对场景依赖空间粒度的位置敏感适配。GeoRoPE通过轻量适配器注入预训练RSFM,在保持冻结空间先验的同时添加地理感知位置校正。在多个RSFM、传感器、分辨率和下游任务上的实验表明,GeoRoPE提升了跨分辨率鲁棒性和尺度敏感表征学习。

英文摘要

Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textit{Geo-Coordinate Calibration (GCC)} rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textit{Geo-Frequency Calibration (GFC)} adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

2606.14780 2026-06-16 cs.CV cs.LG 新提交

YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

YTClickbait21K:面向YouTube点击诱饵检测的多模态人工标注数据集,覆盖多样频道与内容类别

Md. Minhazul Islam, Md. Tanbeer Jubaer, Amith Khandakar, Shovon Sarker, Sumaiya Rahman, Md. Masum Mia, Mohamed Arselene Ayari, Hamed Noori

发表机构 * Department of Computer Science and Engineering, Rajshahi University of Engineering & Technology(拉贾沙希工程与技术大学计算机科学与工程系) Department of Electrical Engineering, Qatar University(卡塔尔大学电气工程系) Department of Civil and Environmental Engineering, Qatar University(卡塔尔大学土木与环境工程系) SenseNet Inc.(SenseNet公司)

AI总结 为应对视频平台点击诱饵检测缺乏大规模高质量多模态数据的问题,构建了包含21,238个视频、来自29国40频道、覆盖新闻/娱乐/教育/游戏等类别的人工标注数据集YTClickbait21K,通过三人独立标注与多数投票确保质量,为多模态语义理解和自动内容审核提供基准。

详情
AI中文摘要

视频分享平台上的点击诱饵内容对信息可靠性构成重大挑战,然而自动检测的进展一直受限于缺乏大规模、高质量的多模态数据集。我们提出了YTClickbait21K,一个人工标注的YouTube点击诱饵数据集,包含来自29个国家40个频道的21,238个视频,覆盖新闻、娱乐、教育和游戏等多种内容类别。每个样本包括结构化元数据(标题、描述、互动统计)以及相关的缩略图图像,支持全面的多模态分析。为确保标注质量,每个视频由三名标注员使用标准化的决策框架独立标注,该框架融合了文本、视觉和跨模态一致性线索,最终标签通过多数投票确定。该数据集展现出显著的人工标注一致性(k=0.65),尽管点击诱饵检测具有固有的主观性,但仍确认了可靠的标注。通过结合规模、标注严谨性和多模态丰富性,该数据集为开发和评估机器学习模型提供了稳健的基准,促进了跨模态语义理解的研究,并推动了自动内容审核系统的发展。

英文摘要

Clickbait content on video-sharing platforms poses a significant challenge to information reliability, yet progress in automated detection has been constrained by the lack of large-scale, high-quality multimodal datasets. We present YTClickbait21K, a human-annotated YouTube clickbait dataset comprising 21,238 videos collected from 40 channels across 29 countries, covering diverse content categories such as news, entertainment, education, and gaming. Each sample includes structured metadata (title, description, engagement statistics) along with associated thumbnail images, enabling comprehensive multimodal analysis. To ensure annotation quality, every video was independently labeled by three annotators using a standardized decision framework that incorporates textual, visual, and cross-modal consistency cues, with final labels determined through majority voting. The dataset exhibits substantial inter-annotator agreement (k=0.65), confirming reliable labeling despite the inherent subjectivity of clickbait detection. By combining scale, annotation rigor, and multimodal richness, this dataset provides a robust benchmark for developing and evaluating machine learning models, facilitating research in cross-modal semantic understanding, and advancing automated content moderation systems.

2606.14795 2026-06-16 cs.CV 新提交

Position: The Systemic Lack of Agency in Visual Reasoning

立场:视觉推理中系统性的能动性缺失

Yizhao Huang, Haoyang Chen, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haoyuan Du, Yandong Shi, Zheng Wang, Zhixiang Wang

AI总结 本文指出当前视觉语言模型因缺乏自主探索能力而无法进行隐式推理,并提出V-IRD基准来评估这一能力,实验表明强语义识别不等同于主动视觉探索。

Comments Accepted by ICML 2026

详情
AI中文摘要

本文论证了系统性的能动性缺失限制了当前视觉语言模型(VLM)的隐式推理能力。隐式推理是指自主发现并利用隐藏的视觉证据来弥合信息鸿沟的能力,而非仅仅依赖明确指定的目标。这种能力是人类视觉理解和日常推理的基础。我们认为,这种限制源于将视觉推理主要视为被动的语义检索,而非依赖于自主视觉探索的主动情境推理。因此,现有大多数基准主要评估被动能力,而忽略了这一推理维度。为弥补这一空白,我们引入了视觉隐式推理诊断基准(V-IRD),该基准通过要求模型严格通过自主视觉分析推导答案来针对这一缺失象限。我们的结果表明,尽管具有强大的检索能力,但主流VLM在利用参考对象和关注需要自主探究的视觉证据方面存在困难。简而言之,强语义识别并不等同于主动视觉探索,揭示了当前VLM的关键差距。更多信息请访问 https://haoychen.github.io/Implicit-Reasoning/

英文摘要

This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs. More information can be found at https://haoychen.github.io/Implicit-Reasoning/

2606.14926 2026-06-16 cs.CV 新提交

FlexPooling with Simple Auxiliary Classifiers in Deep Networks

深度网络中带有简单辅助分类器的FlexPooling

Muhammad Ali, Omar Alsuwaidi, Salman Khan

发表机构 * Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE(阿联酋阿布扎比穆罕默德·本·扎耶德人工智能大学计算机视觉系)

AI总结 提出FlexPooling自适应池化方法,通过学习加权平均替代标准池化,并附加简单辅助分类器,在多个图像分类数据集上提升准确率1-3%。

详情
Journal ref
VISAPP 4 (18th), 497-505 2023
AI中文摘要

在计算机视觉中,大多数卷积神经网络的基本流程包括多个特征提取层,其中输入信号在后续每一层中被下采样到更低分辨率。这种下采样过程通常称为池化,是CNN中的基本操作。池化提高了对变换的鲁棒性,减少了可训练参数数量,增加了感受野,并降低了计算时间。由于池化是有损过程,但对于从低级表示中提取高级信息仍然重要,因此保留先前激活中最突出的信息以提高网络判别能力至关重要。标准池化通常使用密集池化方法,如最大池化或平均池化,或通过步长卷积核进行。在本文中,我们提出一种简单而有效的自适应池化方法,称为FlexPooling,它通过学习与网络其余部分联合的激活加权平均来推广平均池化。我们进一步表明,将简单辅助分类器(SAC)附加到CNN上可以提高性能,并证明了所提出方法与标准池化方法相比的有效性。在多个流行图像分类数据集上的实验表明,FlexPooling始终优于基线网络,准确率提升约1%至3%。

英文摘要

In computer vision, the basic pipeline of most convolutional neural networks consists of multiple feature extraction layers, where the input signal is downsampled to a lower resolution in each subsequent layer. This downsampling process is commonly referred to as pooling, which is an essential operation in CNNs. Pooling improves robustness against transformations, reduces the number of trainable parameters, increases the receptive field, and lowers computation time. Since pooling is a lossy process but remains important for extracting high-level information from low-level representations, it is important to preserve the most prominent information from previous activations to improve network discriminability. Standard pooling is usually performed using dense pooling methods, such as max pooling or average pooling, or through strided convolutional kernels. In this paper, we propose a simple yet effective adaptive pooling method, called FlexPooling, which generalizes average pooling by learning a weighted average over activations jointly with the rest of the network. We further show that attaching Simple Auxiliary Classifiers (SAC) to the CNN improves performance and demonstrates the effectiveness of the proposed method compared with standard pooling methods. Experiments on multiple popular image classification datasets show that FlexPooling consistently outperforms baseline networks, achieving approximately 1 to 3 percent improvement in accuracy.

2606.14958 2026-06-16 cs.CV cs.IR cs.LG 新提交

MVEB: Massive Video Embedding Benchmark

MVEB:大规模视频嵌入基准

Adnan El Assadi, Roman Solomatin, Isaac Chung, Chenghao Xiao, Deep Shah, Manan Dey, Shriya Sudhakar, Zacharie Bugaud, Wissam Siblini, Ayush Sunil Munot, Yashwanth Devavarapu, Rakshitha Ireddi, Michelle Yang, Márton Kardos, Niklas Muennighoff, Kenneth Enevoldsen

AI总结 提出MVEB基准,包含23个任务评估33种视频嵌入模型,发现无单一模型占优,音频贡献取决于标注来源,并集成到MTEB生态。

详情
AI中文摘要

我们介绍了大规模视频嵌入基准(MVEB),这是一个包含23个任务的视频嵌入基准,涵盖分类、零样本分类、聚类、配对分类、检索和以视频为中心的问答。我们评估了33个模型,发现没有单一模型占优:基于MLLM的嵌入在分类、聚类、配对分类和问答上领先;多模态绑定在检索和零样本分类上领先;没有对比适应训练的生成式MLLM在跨模态任务上崩溃。成对的仅视频与音频+视频评估表明,音频的贡献取决于数据集标注来源:当标签来自两种模态时音频有帮助,当仅来自视觉时则有害,这一差距在模型族中一致为6个百分点。MVEB源自MVEB+(一个包含184个任务的任务池),旨在保持任务多样性的同时降低评估成本。它集成到MTEB生态系统中,以实现跨文本、图像、音频和视频的统一评估。我们在https://github.com/embeddings-benchmark/mteb上发布MVEB和所有184个任务,以及代码和排行榜。

英文摘要

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

2606.15055 2026-06-16 cs.CV cs.AI 新提交

Bridging Geographic Bias in Urban Streetscape Inference via Lifelong Learning with Visual-Semantic Pivoting

通过视觉-语义枢轴终身学习弥合城市街景推理中的地理偏差

Xinze Zhang

发表机构 * University of Southern California(南加州大学)

AI总结 提出HVSP-LL终身学习框架,通过分层视觉-语义枢轴模块和公平感知重放机制,在跨城市街景推理中减少地理偏差,实现城市间感知差距缩小38%。

详情
AI中文摘要

城市街景的视觉感知支撑着景观规划、公共卫生和场所营造中的循证决策。然而,在少数拍摄良好的大都市上训练的模型会系统性地误判代表性不足的地区,将地理偏差传播到下游政策中。我们通过HVSP-LL(一种终身学习框架)解决了这一差距,该框架将分层视觉-语义枢轴模块与公平感知重放机制相结合。枢轴模块沿三层本体(宏观结构、中观组成、微观元素)组织景观概念,并将图像特征与每层可学习的语义锚点对齐,提供抵抗分布漂移的可迁移表示。终身适应组件顺序吸收新的城市区域,同时通过最差区域样本重新加权目标和结构感知示例缓冲区约束区域间感知差距。我们在一个由四大洲十二个城市和七个感知维度组成的全景街景基准上评估了HVSP-LL。该框架在保留城市序列上达到0.834的斯皮尔曼相关系数,比最强的持续基线绝对提高了6.1个百分点,并将城市间感知差距缩小到0.094——相对于最强的持续基线(0.151)减少了38%,相对于代表性的正则化基线(0.218)减少了57%。消融实验证实,枢轴层次结构的每一层都有单调贡献,公平感知重放将平均反向迁移从-0.038(无保留)转换为+0.013,消除了保留序列上的灾难性遗忘。我们的结果表明,分层锚定是实现城市尺度地理公平街景推理的实用途径。

英文摘要

Visual perception of urban streetscapes underpins evidence-based decisions in landscape planning, public health, and place-making. Yet models trained on a few well-photographed metropolises systematically misjudge underrepresented districts, propagating geographic bias into downstream policy. We address this gap with HVSP-LL, a lifelong learning framework that couples a stratified visual-semantic pivoting module with an equity-aware rehearsal mechanism. The pivoting module organises landscape concepts along a three-tier ontology (macro structure, meso composition, micro element) and aligns image features to learnable semantic anchors at each tier, providing transferable representations that resist distributional drift. The lifelong adaptation component sequentially absorbs new urban regions while constraining inter-region perception gaps through a worst-region sample-reweighting objective and a structurally-aware exemplar buffer. We evaluate HVSP-LL on a panoramic streetscape benchmark assembled from twelve cities across four continents and seven perceptual dimensions. The framework attains 0.834 Spearman correlation on the held-out city sequence, an absolute 6.1 point improvement over the strongest continual baseline, and shrinks the inter-city perception gap to 0.094 -- a 38% reduction relative to the strongest continual baseline (0.151) and a 57% reduction relative to a representative regularisation baseline (0.218). Ablations confirm that each tier of the pivoting hierarchy contributes monotonically, and the equity-aware rehearsal converts mean backward transfer from -0.038 (without retention) to +0.013, eliminating catastrophic forgetting on the held-out sequence. Our results indicate that hierarchical anchoring is a practical pathway toward geographically equitable streetscape inference at city scale.

2606.15346 2026-06-16 cs.CV cs.LG cs.MM 新提交

DYNA-PRUNER: Input-Adaptive Data-Model Co-Pruning for Efficient and Scalable Spatio-Temporal Media Prediction

DYNA-PRUNER: 面向高效可扩展时空媒体预测的输入自适应数据-模型协同剪枝

Fuyan Zhang, Yuqi Li, Yingli Tian, Edmond S. L. Ho

发表机构 * The City College of New York(纽约市立学院) The Graduate Center, CUNY(纽约市立大学研究生中心) New York University(纽约大学) University of Glasgow(格拉斯哥大学)

AI总结 提出Dyna-Pruner框架,通过共享重要性同步机制实现输入自适应的数据与模型结构协同剪枝,在CNN、RNN和Transformer骨干上减少70% FLOPs并实现2.5倍加速,精度损失小于1%。

Comments ICME 2026 Spotlight Paper

详情
AI中文摘要

时空预测支持雷达/卫星临近预报和城市级交通监测,但现代模型通常因实时部署成本过高而受限。这源于密集计算与强输入依赖冗余(如平静海面或晴朗天空)之间的不匹配。为了在可扩展媒体分析中实现自动化的资源感知架构优化,我们提出Dyna-Pruner,一个用于输入依赖的数据和模型结构协同剪枝的端到端框架。一种共享重要性同步机制生成耦合掩码,剪枝冗余区域及其对应的计算单元(如卷积滤波器),从而在推理时产生每个样本的稀疏子网络。在WeatherBench、SEVIR和TaxiBJ上的实验表明,该框架与CNN、RNN和Transformer骨干无缝集成,将FLOPs减少高达70%,并在NVIDIA Jetson AGX Orin上实现2.5倍加速,精度损失可忽略不计(<1%)。

英文摘要

Spatio-temporal prediction supports radar/satellite nowcasting and city-scale traffic monitoring, but modern models are often too expensive for real-time deployment. This stems from a mismatch between dense computation and strong input-dependent redundancy (e.g., calm seas or clear skies). To enable automated, resource-aware architecture optimization in scalable media analysis, we propose Dyna-Pruner, an end-to-end framework for input-dependent co-pruning of data and model structure. A shared-importance synchronization mechanism generates coupled masks that prune redundant regions and their corresponding computational units (e.g., convolutional filters), yielding per-sample sparse sub-networks at inference time. Experiments on WeatherBench, SEVIR, and TaxiBJ show seamless integration with CNN, RNN, and Transformer backbones, reducing FLOPs by up to $70\%$ and achieving a $2.5\times$ speedup on NVIDIA Jetson AGX Orin with negligible accuracy loss ($<1\%$).

2606.15570 2026-06-16 cs.CV 新提交

An Extensive Benchmark for Single-round and Multi-round Instruction-based Image Editing

单轮与多轮指令式图像编辑的广泛基准

Yiwei Ma, Ke Ye, Weihuang Lin, Jiayi Ji, Xiaoshuai Sun, Tat-Seng Chua, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(厦门大学多媒体可信感知与高效计算教育部重点实验室) National University of Singapore(新加坡国立大学)

AI总结 提出I2EBench2.0基准,通过16个单轮和7个多轮维度评估指令式图像编辑模型,结合用户研究确保与人类判断一致,并基于八种模型分析提供研究指导。

Comments Accepted by International Journal of Computer Vision (IJCV), 2026

详情
AI中文摘要

近年来,基于指令的图像编辑(IIE)领域取得了显著进展,该领域专注于使用模型自动修改输入图像。然而,由于指令的复杂性和编辑的多样性,评估这些编辑模型的有效性是一项重大挑战。为解决此问题,该领域的一项紧迫任务是开发一个稳健的评估框架,能够精确衡量编辑结果的质量,并提供有价值的基准以指导未来改进。为应对这一挑战,我们提出了一个名为I2EBench2.0的综合评估基准,专为IIE模型的单轮和多轮评估而设计。I2EBench2.0具有四个关键特性:1)跨单轮和多轮评估:I2EBench2.0同时评估单轮和多轮基于指令的编辑,评估编辑的精确性和一致性。2)广泛的评估标准:I2EBench2.0涵盖广泛的标准,评估每个IIE模型的高层和低层方面。具体而言,它包含16个单轮评估维度和7个多轮评估维度。3)与人类判断对齐:为确保我们的基准与人类评估一致,我们对每个标准进行了全面的用户研究。4)研究驱动的见解:通过分析当前IIE模型在所有16个单轮和7个多轮维度上的优缺点,我们提供了旨在指导该领域未来研究的关键见解。我们使用I2EBench2.0测试了八个最近开发的IIE模型,并通过细致的比较和分析得出了学术见解。相关代码、数据集以及所有IIE模型生成的图像可在GitHub上获取:https://github.com/cocoshe/I2EBench。

英文摘要

In recent years, there have been notable advancements in the area of instruction-based image editing (IIE), which focuses on the automatic alteration of input images using a model. Nevertheless, assessing the effectiveness of these editing models poses a considerable challenge due to the intricate nature of instructions and the wide variety of edits. To tackle this problem, one urgent task in this domain is the development of a robust evaluation framework that can precisely gauge the quality of editing outcomes and offer valuable benchmarks to guide future improvements. To address this challenge, we present a comprehensive evaluation benchmark named I2EBench2.0, designed for single-round and multi-round assessment of IIE models. I2EBench2.0 has four key features: 1) Evaluation Across Single and Multi-rounds: I2EBench2.0 simultaneously evaluates both single-round and multi-round instruction-based edits, assessing the precision and consistency of the edits. 2) Extensive Evaluation Criteria: I2EBench2.0 encompasses a broad range of criteria, evaluating both high-level and low-level aspects of each IIE model. Specifically, it incorporates 16 dimensions for single-round evaluations and 7 for multi-round evaluations. 3) Alignment with Human Judgment: To ensure our benchmark aligns with human evaluation, we conducted a comprehensive user study for each criterion. 4) Research-driven Insights: By analyzing the strengths and weaknesses of current IIE models across all 16 single-round and 7 multi-round dimensions, we provide critical insights aimed at directing future research in this area. We tested eight recently developed IIE models using I2EBench2.0 and derived academic insights through meticulous comparison and analysis. The related code, dataset, and images generated by all IIE models are available on GitHub: https://github.com/cocoshe/I2EBench.

2606.15629 2026-06-16 cs.CV 新提交

XPASS-Vis: A Dataset for Cross-Domain Personalized Image Aesthetic Assessment

XPASS-Vis: 跨领域个性化图像美学评估数据集

Takato Hayashi, Hiroaki Takahara, Candy Olivia Mawalim, Hiromi Narimatsu, Akisato Kimura, Shiro Kumano, Shogo Okada

发表机构 * Japan Advanced Institute of Science and Technology(日本先端科学技术大学) Communication Science Laboratories, NTT, Inc.(日本电信电话株式会社通信科学实验室)

AI总结 提出首个跨领域个性化图像美学评估数据集XPASS-Vis,涵盖艺术、时尚、风景三个领域,通过129名标注者评估6526个刺激,建立跨领域个性化美学偏好迁移的基准模型,发现无监督域适应方法可恢复约60%的监督上限性能。

详情
AI中文摘要

个性化图像美学评估(PIAA)旨在个体层面上对艺术品和照片的美学判断的主观性进行建模。已知美学偏好既高度个性化又在视觉领域间部分一致。然而,现有的PIAA数据集和方法大多局限于单一领域,或每个领域内每位标注者的样本太少,无法实现跨领域个性化。因此,个性化美学偏好的跨领域泛化在很大程度上仍未得到探索。为了解决这一空白,我们引入了XPASS-Vis,这是第一个专门为跨领域PIAA设计的数据集。XPASS-Vis包含来自三个视觉领域(艺术、时尚、风景)的6,526个刺激,由129名标注者评分,产生87,836次用户-刺激交互,每次交互都标注了总体美学得分和九项美学情感评分。值得注意的是,每位标注者在每个领域评分的刺激超过200个,提供了足够的领域内覆盖以支持领域内和跨领域的个性化。此外,我们在无监督域适应(UDA)下建立了跨领域PIAA的基线模型,其中在标记源领域上训练的模型被迁移到未标记的目标领域。对代表性UDA方法的系统评估表明,在完全无监督的设置下,性能最佳的方法恢复了约60%(Spearman's ρ = .28)的监督上限。这提供了令人鼓舞的证据,表明个性化美学偏好在一定程度上可以在视觉领域间迁移。同时,仍然存在显著差距,凸显了需要针对PIAA的适应策略。XPASS-Vis及附带的基线为跨领域PIAA的未来研究奠定了基础。所有数据集和代码将在论文被接收后公开。

英文摘要

Personalized image aesthetic assessment (PIAA) seeks to model, at the individual level, the subjective nature of aesthetic judgments toward artworks and photographs. Aesthetic preference is known to be both deeply personal and partially consistent across visual domains. Yet existing PIAA datasets and methods are largely confined to a single domain, or provide too few samples per annotator within each domain to enable personalization across domains. Consequently, the cross-domain generalization of personalized aesthetic preferences remains largely unexplored. To address this gap, we introduce XPASS-Vis, the first dataset explicitly designed for cross-domain PIAA. XPASS-Vis comprises 6,526 stimuli from three visual domains -- art, fashion, and landscape -- rated by 129 annotators, yielding 87,836 user-stimulus interactions, each annotated with an overall aesthetic score and nine aesthetic-emotion ratings. Notably, each annotator rated more than 200 stimuli per domain, providing sufficient per-domain coverage to support personalization both within and across domains. Moreover, we establish baseline models for cross-domain PIAA under unsupervised domain adaptation (UDA), where a model trained on a labeled source domain is transferred to an unlabeled target domain. A systematic evaluation of representative UDA approaches shows that the best-performing method recovers approximately 60\% (Spearman's $ρ$ = .28) of the supervised upper bound under a fully unsupervised setting. This provides encouraging evidence that personalized aesthetic preferences are, to a meaningful extent, transferable across visual domains. At the same time, a substantial gap remains, highlighting the need for PIAA-specific adaptation strategies. XPASS-Vis and the accompanying baselines provide a foundation for future research on cross-domain PIAA. All datasets and code will be made publicly available upon acceptance.

2606.15749 2026-06-16 cs.CV cs.AI cs.SY eess.SY 新提交

OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

OmniTraffic:面向时空交通推理的可控生成流水线与基准

Maonan Wang, Zhengyan Huang, Kemou Jiang, Yuhang Fu, Jiayue Zhu, Yuxin Cai, Xingchen Zou, Qiaosheng Zhang, Yi Yu, Ding Wang, Xi Chen, Ben M. Chen, Yuxuan Liang, Zhiyong Cui, Man On Pun, Yirong Chen

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Shanghai AI Lab(上海人工智能实验室) Beihang University(北京航空航天大学) Nanyang Technological University(南洋理工大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出OmniTraffic,一个基于12个真实路口3D重建的可控生成流水线与基准,通过8M VQA样本和3K人工验证测试集评估11个前沿MLLM,揭示拓扑与时空推理中的显著人机差距,并证明仿真数据微调可提升真实场景性能。

Comments 34 pages, 28 figures

详情
AI中文摘要

交通场景理解要求模型超越物体识别进行推理,包括车道拓扑、多视角几何、时间演变和信号相位语义。然而,现有的面向交通的多模态基准大多强调被动视觉识别或孤立的视频理解,在受控条件下评估结构感知的交通推理方面支持有限。我们介绍了OmniTraffic,一个用于时空交通推理的可控生成流水线和基准。它基于12个真实世界交叉口重建为可编辑的3D交通环境,并辅以来自两个国家的监控录像,支持受控和自然条件评估。它定义了一个三级任务层次,涵盖场景感知、多视角和时间推理以及决策支持。利用结构化交通元数据,OmniTraffic生成同步的多视角VQA样本,涵盖车辆状态、车道功能、视图-BEV对应、时间动态和信号相位分析,产生800万个VQA样本和一个3000个人工验证的测试集。对11个前沿MLLM的评估揭示了巨大的人机差距,在拓扑基础和时空推理任务中失败最为明显。在模拟的OmniTraffic数据上微调轻量级MLLM进一步提高了在真实交通场景上的性能,证明了仿真生成的监督对特定交通多模态推理的价值。除了固定数据集,OmniTraffic还提供了一个可扩展的流水线,具有可配置的交叉口、相机视角、交通需求、信号相位、视觉条件和罕见事件。

英文摘要

Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.

2606.15867 2026-06-16 cs.CV 新提交

CogCanvas: A Benchmark for Evaluating Multi-Subject Reference-Based Image Generation

CogCanvas: 用于评估多主体参考图像生成的基准

Long-Bao Nguyen, Quang-Khai Tran, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam(胡志明市理科大学) University of Dayton, Ohio, United States(代顿大学) Vietnam National University, Ho Chi Minh City, Vietnam(越南国家大学胡志明市分校)

AI总结 提出CogCanvas基准,包含1952张参考图像和1361个组合提示,评估多身份、对象绑定和背景场景的生成,引入BG-Sim和Attr-VQA指标,发现现有模型在超过3个主体时性能严重下降。

详情
AI中文摘要

多主体参考图像生成需要同时保留多个人的身份、绑定每个人的对象和时尚物品,并尊重指定的背景场景,当前扩散模型在此方面仍然脆弱。现有基准一次只评估一个方面,没有一个能联合捕捉多身份组合、人-物交互、背景基础和空间合理性。我们引入了CogCanvas,一个包含1952张精选参考图像的基准,涵盖100个名人身份、115个独特对象和时尚物品,以及29个真实世界背景场景(包括地标),从中我们构建了1361个组合提示,覆盖2-5人的群体规模。筛选流程结合了基于DINOv2的去重、两阶段美学过滤以及结构化交互和位置图的自动推导,作为真实监督。CogCanvas在统一的六轴评估协议下支持三个任务:基于参考的多人物-对象生成(主要)、文本到图像的组合生成和参考检索。我们引入了两个针对多参考设置量身定制的指标:BG-Sim,通过DINOv3特征相似性在SAM 3掩码区域上评分背景保真度;Attr-VQA,使用多模态大语言模型根据结构化图验证每个主体的属性绑定和人际交互。对五种最先进方法的基准测试表明,随着群体规模从2人增加到5人,每个模型都显著退化,在超过三个主体时对象/时尚物品绑定几乎完全失败。

英文摘要

Multi-subject reference-based image generation requires jointly preserving multiple human identities, binding per-person objects and fashion items, and respecting a specified background scene, a regime where current diffusion models remain brittle. Existing benchmarks evaluate only one axis at a time and none jointly captures multi-identity composition with human-object interaction, background grounding, and spatial plausibility. We introduce CogCanvas, a benchmark of 1,952 curated reference images spanning 100 celebrity identities, 115 distinctive objects and fashion items, and 29 real-world background scenes including landmarks, from which we construct 1,361 compositional prompts covering 2-5 person group sizes. The curation pipeline combines DINOv2-based deduplication, two-stage aesthetic filtering, and automated derivation of structured interaction and position graphs that serve as ground-truth supervision. CogCanvas supports three tasks, reference-based multi-human-object generation (primary), text-to-image compositional generation, and reference retrieval, under a unified six-axis evaluation protocol. We introduce two metrics tailored to the multi-reference setting: BG-Sim, which scores background fidelity on SAM 3-masked regions via DINOv3 feature similarity, and Attr-VQA, which uses a multimodal LLM to verify per-subject attribute binding and inter-person interactions against the structured graphs. Benchmarking five SOTA methods reveals that every model degrades substantially as group size grows from 2 to 5, with near-complete failure on object/fashion binding beyond three subjects.

2606.15956 2026-06-16 cs.CV cs.AI cs.LG 新提交

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

你不需要强假设:通过时间差异进行视觉表示学习

Ninad Daithankar, Alexi Gladstone, Yann LeCun, Heng Ji

发表机构 * UIUC(伊利诺伊大学厄巴纳-香槟分校) New York University(纽约大学)

AI总结 提出TDV方法,基于因果假设(过去导致未来)从视频中自监督学习,避免强归纳偏置,在密集空间任务上达到SOTA。

详情
AI中文摘要

AI的进步很大程度上是由假设更少的方法驱动的。随着计算和数据量的增加,弱归纳偏置的方法通常优于强假设的方法。这在视觉表示学习领域尤为典型,方法从监督学习主导,到弱监督学习,再到如今无需人工标签的自监督学习的广泛成功。然而,即使是现代自监督学习方法仍然依赖于强归纳偏置,如数据增强、掩码或裁剪。如果这一趋势持续,这些剩余的偏置在大规模下将成为瓶颈——我们的实验证实了这一点:随着数据增长,归纳偏置的最优强度降低。这促使我们寻找依赖更少假设的方法。为此,我们提出了视觉时间差异(TDV),一种从视频中进行自监督学习的新范式,它避免了现有的归纳偏置,而是依赖于一个因果假设:过去导致未来。TDV通过联合训练图像编码器和运动编码器,使得当前帧的表示加上编码的运动等于下一帧的表示。尽管没有利用任何强归纳偏置,TDV在密集空间任务上达到了最先进的水平,为无需强假设的表示学习奠定了基础。

英文摘要

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

2606.16015 2026-06-16 cs.CV 新提交

Stringalign: Moving beyond summary statistics with a transparent Unicode-aware tool for evaluating automatic transcription models

Stringalign: 超越摘要统计的透明Unicode感知工具,用于评估自动转录模型

Yngve Mardal Moe, Marie Roald

发表机构 * Independent researcher(独立研究员) The National Library of Norway(挪威国家图书馆)

AI总结 提出Stringalign库,通过透明预处理和错误分析,解决字符/词错误率定义模糊问题,支持HTR、OCR和ASR模型的可重复评估。

详情
AI中文摘要

在评估和理解文档识别、音频转录等文本处理任务的性能时,比较文本字符串至关重要。随着基于AI的手写文本识别(HTR)、光学字符识别(OCR)和自动语音识别(ASR)模型日益复杂,需要能够以灵活且可重复的方式促进评估的工具。本文介绍了Stringalign,一个旨在简化自动转录项目评估过程并促进透明评估的Python库。Stringalign的工具可以检查和可视化模型产生的错误率和错误类型,从而洞察可能的改进,并帮助为特定任务选择模型。广泛使用的字符串比较指标,如字符错误率(CER)和词错误率(WER),虽然有用,但由于字符和词的定义不同而可能产生歧义。Stringalign通过确保所有预处理(即归一化和分词)透明且易于复制,并提供工具以超越摘要统计并分析常见模型错误,解决了这一挑战。此外,Stringalign遵循研究软件的FAIR(可发现、可访问、可互操作、可重用)原则,同时保持轻量级且易于融入研究人员现有工作流程。在本文中,我们讨论了字符级和词级字符串比较的挑战,并通过示例表明,现有工具可能产生不透明且有时令人困惑的结果,而Stringalign提供了一种易于使用且无歧义的替代方案。

英文摘要

Comparing text strings is crucial when evaluating and understanding the performance of various text processing tasks such as document recognition and audio transcription. With an increasingly complex landscape of AI-based handwritten text recognition (HTR), optical character recognition (OCR) and automatic speech recognition (ASR) models, there is a need for tools that facilitate evaluation in a flexible and reproducible way. This paper presents Stringalign, a Python library designed to simplify the evaluation process for automatic transcription projects and facilitate transparent evaluation. Stringalign's tools to examine and visualise both the rate of errors and the types of errors a model makes, give insights into possible improvements and help inform model selection for a particular task. Widely used string comparison metrics, such as the character and word error rates (CER and WER), although useful, can be ambiguous due to varying definitions of what constitutes a character and a word. Stringalign addresses this challenge by ensuring all preprocessing (i.e. normalisation and tokenisation) is transparent and easily replicable, and by providing tools to move beyond summary statistics and analyse common model errors. Moreover, Stringalign adheres to FAIR (Findable, Accessible, Interoperable, and Reusable) principles for research software while staying lightweight and easy to adapt into researchers existing workflows. In this paper, we discuss challenges with character and word level string comparisons and show through examples that where existing tools can yield opaque and sometimes confusing results, Stringalign provides an easy-to-use and unambiguous alternative.

2606.16185 2026-06-16 cs.CV 新提交

Learned JPEG Compression for DNN Vision

面向DNN视觉的JPEG压缩学习

Kaixiang Zheng, Ahmed H. Salamah, Siyu Chen, En-Hui Yang

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出J4D框架,通过可微分JPEG编解码器和信息论速率估计,优化JPEG编码参数以在低压缩率下提升DNN推理性能,实验显示在相同精度下压缩率降低高达80.05%。

详情
AI中文摘要

JPEG是一种为人类观看而设计的损失性图像压缩技术,几十年来一直占据主导地位。然而,在人工智能(AI)时代,大量通常由JPEG压缩的图像数据正在并将继续由深度神经网络(DNN)而非人类消费,因此需要优化JPEG以提升DNN推理性能。为此,我们提出面向DNN视觉的JPEG压缩学习(J4D),这是一种新颖的训练框架,用于确定JPEG编码参数,以在最小化压缩率的同时最大化DNN推理性能。解决这一优化问题的主要挑战在于以封闭形式表示JPEG编解码器和压缩率。通过引入基于概率量化方案的可微分软量化器,我们不仅获得了JPEG编解码器的可微分代理,还能够解析计算编码源的熵,这是实际压缩率的近似估计。有了可微分JPEG编解码器和信息论速率估计器,我们就能通过反向传播解决上述优化问题。训练后,学习到的编码参数将基于概率量化用于实际的JPEG编码。跨多个数据集和DNN架构的大量实验结果表明,J4D始终显著优于默认JPEG和其他为DNN优化的竞争性JPEG编解码器。值得注意的是,与默认JPEG相比,J4D在相同码率下准确率提升高达11.60%,或在相同准确率下压缩率降低高达80.05%。此外,借助J4D,我们首次展示了为不同DNN架构设计通用JPEG编码参数的潜力。

英文摘要

JPEG, a lossy image compression technique designed for human viewers, has maintained its dominance for decades. However, in the era of artificial intelligence (AI), a substantial portion of image data, often compressed by JPEG, is and will continue to be consumed by deep neural networks (DNNs) instead of humans, thus creating a need to optimize JPEG for DNN inference performance. To this end, we propose learned JPEG compression for DNN vision (J4D), a novel training framework for determining JPEG encoding parameters to minimize compression rate while maximizing DNN inference performance. The major challenge of solving this optimization problem lies in representing the JPEG codec and compression rate in closed form. By incorporating a differentiable soft quantizer based on a probabilistic quantization scheme, we not only obtain a differentiable proxy for the JPEG codec, but are also able to compute the entropy of the coded source analytically, which is a close estimate of the actual compression rate. Equipped with both the differentiable JPEG codec and the information-theoretic rate estimator, we are then able to solve the aforementioned optimization problem with backpropagation. After training, the learned encoding parameters will be subsequently used in actual JPEG encoding based on probabilistic quantization. Extensive experimental results across multiple datasets and DNN architectures demonstrate that J4D consistently and significantly outperforms the default JPEG and other competitive JPEG codecs optimized for DNNs. Notably, compared to the default JPEG, J4D achieves an increase in accuracy by as much as 11.60% at the same rate, or a reduction of compression rate up to 80.05% at the same accuracy. Additionally, with the help of J4D, we show the potential to design universal JPEG encoding parameters for various DNN architectures for the first time.

2606.16256 2026-06-16 cs.CV cs.LG 新提交

KeepLoRA++: Continual Learning with Layer-Scaled Residual Gradient Adaptation

KeepLoRA++: 基于层级缩放残差梯度适应的持续学习

Mao-Lin Luo, Yi-Lin Zhang, Zi-Hao Zhou, Yankun Hong, Xialiang Tong, Mingxuan Yuan, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education(东南大学计算机网络和信息集成教育部重点实验室) Huawei Noah’s Ark Lab(华为诺亚方舟实验室)

AI总结 针对预训练视觉语言模型持续学习中保留预训练知识、旧任务知识和学习新知识的冲突,提出KeepLoRA++,通过层级缩放残差梯度适应方法,限制LoRA参数更新到残差子空间并采用浅到深层缩放,平衡三者,在图像分类、视觉问答和视频理解任务上优于基线。

详情
AI中文摘要

预训练视觉语言模型的持续学习需要平衡三个相互竞争的目标:保留预训练知识、保留一系列已学习任务的知识以及保持获取新知识的可塑性。本文提出KeepLoRA++,通过统一的二维知识保留机制来平衡这些目标。我们从层间和层内两个角度分析Transformer架构的知识分布。层间视角考察知识保留如何跨层分布,而层内视角关注每层内的参数空间。我们的分析揭示了一个结构特性:通用可迁移知识主要编码在浅层和参数的主子空间中,而任务特定适应则定位于深层和残差子空间。受此启发,KeepLoRA++引入了一种层级缩放残差梯度适应方法。新任务的学习通过将LoRA参数更新限制在残差子空间,并结合从浅到深的层级缩放来实现,以防止干扰先前获得的能力。具体而言,新任务的梯度被投影到与预训练模型主子空间以及先前任务特征主导方向正交的子空间上,同时为浅层分配较小的更新幅度,为深层分配较大的更新幅度。我们的理论分析和实证评估证实,KeepLoRA++成功平衡了这三个相互竞争的目标,在图像分类、视觉问答和视频理解任务上持续优于代表性基线。

英文摘要

Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

2606.16334 2026-06-16 cs.CV 新提交

Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

时间盲:使用CHRONOSIGHT基准测试视觉语言模型的时间推理能力

Parthaw Goswami, Jaynto Goswami Deep

发表机构 * Department of Computer Science, University of Missouri(密苏里大学计算机科学系) SAP

AI总结 提出CHRONOSIGHT基准,从五个维度评估视觉语言模型的时间推理能力,发现模型与人类存在巨大差距(人类平均0.89,最佳模型0.40),并通过微调显著提升性能。

详情
AI中文摘要

人类对视觉场景的感知本质上是时间性的。我们本能地识别水果是在成熟还是腐烂,建筑是在进展还是被拆除,以及两张同一主体的照片之间大致相隔多少时间。大型视觉语言模型(VLM)是否具备这种能力仍然是一个开放且具有实际重要性的问题。我们引入了CHRONOSIGHT,一个严格控制的基准,评估视觉时间推理的五个维度:CHRONORANK(图像序列的时间顺序排序)、CHRONOLOCATE(从单张图像定位阶段顺序)、CHRONODELTA(估计两张图像之间经过的时间,采用对数尺度)、CHRONOREVERSE(检测时间反转序列)以及CHRONOODD(识别集合中的时间异常值)。该基准包含来自八个过程系列(生物生长、食物转化、物理风化、建筑、环境变化、人类衰老、天文现象和城市动态)的1000个项目,时间跨度从分钟到千年。我们在两种提示模式下评估了八个开源VLM(参数从5亿到190亿),并收集了人类表现基线。人类在所有任务上的平均表现为0.89;最佳开源模型(Qwen2.5-VL-7B)在直接提示下达到0.40,我们将这一差距称为时间盲。在151个样本上进行轻量级LoRA微调,将CHRONODELTA的准确率从接近零提升到0.43,并零样本迁移到相关任务(CHRONOODD:0.37;CHRONOREVERSE:0.64),这表明瓶颈部分在于指令遵循而非视觉感知。基准、代码和预测将在接收后发布。

英文摘要

Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.

2606.16633 2026-06-16 cs.CV cs.AI 新提交

DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

DCP-Prune:基于分布一致性保持的超低令牌剪枝

Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun

发表机构 * College of Computer Science, Nankai University(南开大学计算机学院) Nanjing University of Posts and Telecommunications(南京邮电大学)

AI总结 提出DCP-Prune框架,通过锚点-上下文图恢复和文本感知令牌聚类选择,在超低令牌预算下保持分布一致性,实现稳定高性能。

Comments The code will be released at: https://github.com/EMVision-NK/DCP-Prune

详情
AI中文摘要

最近的视觉令牌剪枝方法在中等令牌预算下能有效保持模型性能,但在超低令牌预算下变得不稳定。我们的分析表明,随着剪枝预算减少,精度下降通常伴随着更大的特征分布偏移。关键的是,这种分布偏移的程度与性能下降强相关。为了更好地表征这一现象,我们引入了一种轻量级的分布一致性度量来估计保留令牌与完整令牌之间的分布偏移。受这些观察启发,我们提出了一个两阶段剪枝框架,包括锚点-上下文图恢复(ACGR)和文本感知令牌聚类选择(TATCS)。具体地,ACGR在令牌移除前转移上下文信息,而TATCS在检测到严重分布偏移时动态重新选择代表性令牌。大量实验表明,我们的方法在超低令牌预算下实现了更优且更稳定的性能。值得注意的是,在仅使用16个视觉令牌的情况下,它在LLaVA-1.5-7B上保留了92.1%的上限平均性能。

英文摘要

Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

2606.16638 2026-06-16 cs.CV 新提交

MVM-IOD: An Industrial Object-Centric Benchmark Dataset for the Evaluation of 3D Reconstruction Methods

MVM-IOD:用于评估3D重建方法的工业对象中心基准数据集

Robert Langendörfer, Markus Hillemann, Markus Ulrich

发表机构 * Machine Vision Metrology, Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology, Germany(德国卡尔斯鲁厄理工学院摄影测量与遥感研究所机器视觉计量学)

AI总结 针对工业场景3D重建评估数据集匮乏的问题,提出MVM-IOD数据集,包含工业对象图像、参考相机位姿和点云,并评估了多种SOTA方法,发现前馈方法对非分布图像敏感。

详情
AI中文摘要

工业应用中的3D对象重建和相机位姿估计是挑战性任务,因为错误代价高昂且计算时间通常有限。典型工业对象的复杂性进一步增加了这些任务的难度。现有的大多数数据集并未描绘真实的工业场景。因此,我们引入了机器视觉计量工业对象数据集(MVM-IOD)。通过将安装在工业机器人臂末端执行器上的相机在对象周围的半球上移动,系统性地捕获典型工业对象的图像。MVM-IOD包含参考相机位姿和参考3D点云,以及9个对象和2种背景选择所获得的RGB图像,共产生18个场景,这允许评估所有基于图像的方法,这些方法计算3D重建、相机位姿或场景的新视图。基于MVM-IOD,我们广泛评估了当前的SOTA 3D重建和相机位姿估计方法,例如运动恢复结构、多视图立体、近期的前馈方法(Visual Geometry Grounded Transformer, π3)和2D高斯泼溅,并报告我们的发现作为未来研究的基线。实验表明,像我们这样的捕获设置为前馈方法生成分布外图像,导致次优的点云和相机位姿。然而,通过应用简单的预处理步骤,这些分布外图像可以更接近训练分布。因此,在某些工业应用中,应谨慎使用前馈方法。

英文摘要

3D object reconstruction, and camera pose estimation in industrial applications are challenging tasks, as errors are costly while the computation time is often limited. The complexity of typical industrial objects further complicates these tasks. Most of the existing datasets in this context do not depict realistic industrial scenarios. Therefore, we introduce the Machine Vision Metrology Industrial Object Dataset (MVM-IOD). Images of typical industrial objects are captured systematically, by moving a camera, mounted at the end effector of an industrial robot arm, on a hemisphere around the objects. MVM-IOD contains reference camera poses and reference 3D point clouds, the acquired RGB images of 9 objects and 2 background choices resulting in 18 scenes, which allows evaluation of all image based methods that compute a 3D reconstruction, camera poses, or novel views of a scene. Based on MVM-IOD, we extensively evaluate current SOTA 3D reconstruction and camera pose estimation methods, such as Structure from Motion, Multi-View Stereo, recent feed forward methods (Visual Geometry Grounded Transformer, π3), and 2D Gaussian Splatting and report our findings as a baseline for future research. The experiments show that capture setups like ours generate out-of distribution images for feed forward methods, leading to suboptimal point clouds and camera poses. However, these out-of-distribution images can be shifted closer to the training distribution by applying simple preprocessing steps. Consequently, in certain industrial applications, feed forward methods should be used with caution.

2606.16861 2026-06-16 cs.CV 新提交

An Open-Source Monitoring Framework for Data Exploration and Progress Tracking in Multi-Center Radiology Studies

一个用于多中心放射学研究中数据探索与进度跟踪的开源监控框架

Markus Bujotzek, Jonas Scherer, Stefan Denner, Peter Neher, Benjamin Hamm, Lorenz Feineis, Uenal Akuenal, Andreas Bucher, Tobias Penzkofer, Klaus Maier-Hein

发表机构 * Germany Cancer Research Center(德国癌症研究中心) University of Heidelberg(海德堡大学) University Hospital Frankfurt(法兰克福大学医院) Charite Universitätsmedizin Berlin(柏林夏里特医学院) Berlin Institute of Health(柏林健康研究所)

AI总结 提出基于Grafana-Prometheus的轻量级开源监控架构,通过聚合分布式站点指标并可视化,实现隐私保护的数据探索和进度监控,已在德国RACOON联盟38家大学医院部署验证。

详情
AI中文摘要

多中心研究对于推进医学和放射学研究至关重要。数据探索、协作发现和研究进度监控对于最大化其潜力至关重要。然而,在实践中,这些过程通常依赖于手动通信和共享表格,这些表格很快就会过时,并阻碍大型分布式研究中的高效协调。这凸显了对专用监控解决方案的需求,以提供对研究进度的透明和最新洞察。我们提出了一种轻量级、开源的多中心研究监控架构,基于广泛使用的Grafana-Prometheus栈。该框架从分布式研究站点收集聚合的监控指标,并通过可配置的仪表板进行可视化。作为一个真实世界的部署示例,该框架被集成到医学影像平台Kaapana中,并在一个大型多中心研究网络中进行评估。通过在德国范围内的RACOON联盟中部署我们的解决方案,我们展示了其在所有38家德国大学医院中实现隐私保护的数据探索和研究进度监控的能力。该监控框架支持分布式研究活动的透明协调,并可促进大规模多中心研究的更高效管理。源代码和Kaapana集成可在https://github.com/MIC-DKFZ/study-monitoring-kaapana公开获取。

英文摘要

Multi-center studies are crucial for advancing medical and radiological research. Data exploration, collaboration discovery, and study progress monitoring are essential for maximizing their potential. However, in practice these processes often rely on manual communication and shared tables, which quickly become outdated and hinder efficient coordination in large distributed studies. This highlights the need for dedicated monitoring solutions that provide transparent and up-to-date insights into study progress. We propose a lightweight, open-source monitoring architecture for multi-center studies based on the widely used Grafana-Prometheus stack. The framework collects aggregated monitoring metrics from distributed study sites and visualizes them through configurable dashboards. As a real-world deployment example, the framework is integrated into the medical imaging platform Kaapana and evaluated within a large multi-center research network. By deploying our solution within the Germany-wide RACOON consortium, we demonstrate its ability to enable privacy-preserving data exploration and study progress monitoring across all 38 German university clinics. The monitoring framework supports transparent coordination of distributed research activities and can facilitate more efficient management of large-scale multi-center studies. The source code and Kaapana integration are publicly available at https://github.com/MIC-DKFZ/study-monitoring-kaapana.

2606.16868 2026-06-16 cs.CV cs.AI cs.DC 新提交

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

真实世界标签噪声下的联邦医学图像分割:面向噪声标签学习方法选择的基准套件

Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, Klaus Maier-Hein

发表机构 * Division of Medical Image Computing, Germany Cancer Research Center(德国癌症研究中心医学图像计算部) Medical Faculty, University of Heidelberg(海德堡大学医学院) Heidelberg Institute of Radiation Oncology (HIRO), National Center for Radiation Research in Oncology (NCRO)(海德堡放射肿瘤学研究所(HIRO),国家放射肿瘤学研究中心(NCRO)) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital(海德堡大学医院放射肿瘤科模式分析与学习组) Faculty of Mathematics and Computer Science, University of Heidelberg(海德堡大学数学与计算机科学学院) National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and the university medical center Heidelberg(国家肿瘤疾病中心(NCT),NCT海德堡,DKFZ与海德堡大学医学中心的合作机构)

AI总结 针对联邦学习中真实世界标签噪声(如轮廓不一致、结构缺失或混淆)问题,提出一个包含多样化真实噪声数据集、客户端噪声场景和针对性评估的基准套件,支持系统评估和噪声标签学习方法选择。

详情
AI中文摘要

虽然联邦学习(FL)能够在不集中敏感数据的情况下实现协作式医学图像分割,但实际部署常因跨站点的标签缺陷(如轮廓不一致、结构缺失或多余、标签混淆)而复杂化。联邦噪声标签学习(FNLL)旨在减轻这些影响,但在实践中仍未被充分利用,因为现有证据主要基于合成噪声、简化设置和有限的实际噪声评估。我们通过引入一个基准套件来弥补这一差距,该套件结合了多样化的真实世界噪声数据集、与部署相关的客户端噪声场景以及针对标签噪声的评估,以支持系统的FNLL评估和知情的方法选择。该套件将来自不同来源的精心策划的真实世界噪声医学图像分割数据集与一个全面的联邦分割框架相结合,包括各种客户端噪声场景和针对噪声的评估。所提出的套件为医学图像分割中的FNLL评估提供了现实且具有区分性的基础,并为公平基准测试、数据集特定的标签噪声表征以及未来在现实联邦设置下的方法开发建立了可重复使用的基础。代码可在 https://github.com/MIC-DKFZ/FedSegNoiseBench 获取。

英文摘要

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

2606.17020 2026-06-16 cs.CV cs.AI 新提交

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

FusionRS: 用于双模态视觉-语言基础模型的大规模RGB-红外遥感数据集

Jiaju Han, Ben Zhang, Xuemeng Sun, Qike Zhang, Yuxian Dong, Chengyin Hu, Fengyu Zhang, Yiwei Wei, Jiujiang Guo

发表机构 * China University of Petroleum-Beijing at Karamay(中国石油大学(北京)克拉玛依校区) University of Electronic Science and Technology of China(电子科技大学) Tianjin University(天津大学)

AI总结 针对遥感视觉-语言模型缺乏红外数据的问题,提出首个大规模RGB-红外-文本数据集FusionRS,通过翻译RGB图像为红外风格并配以红外感知描述,训练双模态基础模型,提升RGB-红外对齐和双模态字幕生成性能。

详情
AI中文摘要

遥感视觉-语言模型推动了地球观测理解的发展,但现有工作大多集中于RGB图像,红外数据中的互补信息尚未得到充分探索。红外图像提供了独特的线索,包括热强度结构、物体边界和光照不变场景特征,这些可以丰富超越传统RGB观测的视觉-语言学习。然而,用于遥感视觉-语言建模的大规模RGB-红外-文本数据集仍然缺失。为填补这一空白,我们引入了FusionRS,这是首个专为遥感双模态视觉-语言学习设计的大规模RGB-红外-文本数据集。FusionRS通过将多样的公开RGB遥感图像翻译为红外风格对应物,形成对齐的RGB-IR图像对。每对图像都配有常规场景描述和红外感知描述,后者在保留语义内容的同时明确描述红外特有的视觉属性。基于FusionRS,我们训练了用于RGB-IR联合理解的双模态视觉-语言基础模型。我们首先训练CLIP风格的模型进行RGB-IR-文本对齐,然后微调生成式VLM用于双模态RGB-IR字幕生成。实验表明,与仅RGB和非红外感知训练设置相比,FusionRS改进了RGB-IR对齐、红外到文本检索和双模态字幕生成。消融研究进一步验证了红外感知描述对于加强红外-语言对齐至关重要,突显了模态特定文本监督对于更可扩展的RGB-红外遥感视觉-语言表示学习的重要性。

英文摘要

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

2606.15048 2026-06-16 cs.LG cs.CV 交叉投稿

Temporal Difference Learning for Diffusion Models

扩散模型的时间差分学习

Qizhen Ying, Yangchen Pan, Victor Adrian Prisacariu, Junfeng Wen

AI总结 提出时间差分(TD)目标函数,通过将扩散过程视为马尔可夫奖励过程并利用强化学习中的策略评估,强制去噪轨迹上的跨时间一致性,显著提升少步采样下的生成质量。

Comments 15 pages, 4 figures. Accepted at ICML 2026

详情
AI中文摘要

扩散模型通常使用专注于单个时间步(或相邻对)的局部去噪目标的损失函数进行训练,这并不强制去噪轨迹上预测之间的一致性。这种跨时间一致性的缺乏会降低性能,尤其是对于少步采样器。我们引入了一个时间差分(TD)目标,惩罚模型沿去噪路径的多步进展的不一致性。通过将扩散过程重新表述为马尔可夫奖励过程,并将去噪视为强化学习中的策略评估问题,我们推导出一个统一的TD方法,适用于离散和连续时间扩散公式。我们进一步提出了一种基于样本的加权方法,稳定训练。实验表明,使用我们的TD训练可以显著提高由FID衡量的样本质量,当采样步数较少时优势更强,突显了其在低计算预算场景下的实用价值。我们进行了消融研究以证明我们的设计选择,包括成对损失加权、正则化权重和单步跨度。总体而言,我们的TD方法可以作为一种通用的即插即用模块,强制跨时间一致性并提高不同扩散生成模型的生成质量。

英文摘要

Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

2606.15615 2026-06-16 cs.LG cs.CV 交叉投稿

MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers

MoECa: 在扩散变换器中对齐特征复用与专家分解

Maoliang Li, Haojing Chen, Jiayu Chen, Zihao Zheng, Xinhao Sun, Hailong Zou, Xiang Chen

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) School of Software Engineering, University of Electronic Science and Technology of China(电子科技大学软件工程学院)

AI总结 针对DiT-MoE中跨时间步的冗余计算,提出基于专家分支级别的细粒度缓存框架MoECa,实现分支级特征复用,并引入专家感知自适应控制和同步缓存更新,在多个模型上取得高达2.83倍加速且质量损失极小。

Comments under review

详情
AI中文摘要

基于混合专家模型的扩散变换器(DiT-MoE)通过稀疏激活提升了模型容量,但扩散推理仍然受限于跨时间步的冗余计算。现有的缓存方法主要在token级别操作,这在DiT-MoE中变得次优,因为每个token更新内部被分解为多个路由专家分支。我们的分析表明,DiT-MoE中的跨时间步冗余在专家分支级别比在整个token级别更易于表征。基于这一观察,我们提出MoECa,一种细粒度的缓存框架,跨时间步执行分支级特征复用。MoECa进一步引入了专家感知的自适应控制和MoE与注意力路径之间的同步缓存更新,以维持稳定的中间状态。在多个DiT-MoE模型上的实验表明,MoECa在速度-质量权衡上始终优于先前的缓存方法,实现了高达2.83倍的推理加速且质量退化极小。

英文摘要

Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83$\times$ inference speedup and minimal quality degradation.

2606.16075 2026-06-16 cs.LG cs.CV 交叉投稿

AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

AME:生成式AI市场中的多类型贡献者归属框架

Yang Shi, Songwen Pei, Yang Gao, Bingxue Zhang

发表机构 * University of Shanghai for Science and Technology(上海理工大学) Fudan University(复旦大学)

AI总结 针对生成式AI中多阶段协作的价值分配问题,提出AME框架,整合异构数据贡献评估、数据权利映射和可信执行,实现与人类判断一致的低成本价值分配。

详情
AI中文摘要

生成式AI通过异构贡献者(包括训练数据、基础模型、微调行为和提示)之间的多阶段协作实现价值创造。然而,如何公平分配数据价值仍未得到充分探索。本文将多阶段生成式AI价值分配定义为一个新的研究问题,并识别出三个核心挑战:异构数据贡献评估、数据权利映射和可信执行。我们提出AME(归属-映射-执行)框架,这是一个统一框架,将数据贡献评估、数据权利映射和可信执行整合到单个工作流中。实验结果表明,AME框架实现了与人类参考判断更一致的数据价值分配结果,同时保持低成本的可信执行。我们的工作为生成式AI数据市场中的价值评估和收益分配提供了初步基础。

英文摘要

Generative AI enables value creation through multi-stage collaboration among heterogeneous contributors, including training data, base models, fine-tuning behaviors, and prompts. However, how to fairly allocate the data value remains largely unexplored. This paper formulates multi-stage generative AI value allocation as a new research problem and identifies three core challenges: heterogeneous data contribution valuation, data rights mapping, and trustworthy execution. We propose AME (Attribution-Mapping-Execution) framework, a unified framework that integrates data contribution valuation, data rights mapping, and trustworthy execution into a single workflow. Experimental results demonstrate that AME framework achieves data value allocation outcomes more consistent with human reference judgments while maintaining low-cost trustworthy execution. Our work provides an initial foundation for value assessment and revenue allocation in generative AI data markets.

2505.10496 2026-06-16 cs.CV 版本更新

CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

CheXGenBench:合成胸片保真度、隐私和实用性的统一基准

Raman Dutt, Pedro Sanchez, Yongchen Yao, Steven McDonagh, Sotirios A. Tsaftaris, Timothy Hospedales

发表机构 * University of Edinburgh(爱丁堡大学) Samsung AI Center, Cambridge(剑桥三星AI中心)

AI总结 提出CheXGenBench,首个统一评估框架,同时衡量合成胸片生成模型的保真度、隐私风险和下游实用性,涵盖11种前沿T2I模型,揭示当前模型在长尾分布、隐私风险和下游多模态任务中的局限。

Comments Published in Transactions of Machine Learning Research (06/2026)

详情
Journal ref
Transactions on Machine Learning Research (2026)
AI中文摘要

结构化基准测试推动了真实世界图像的文字条件生成,但合成放射图像生成尚无此类基准。尽管这是一个高度活跃的研究领域,现有研究仍采用不一致的评估协议,缺乏对三个最关键标准(生成保真度、隐私风险和下游实用性)的统一评估。为解决这些局限,我们引入CheXGenBench,这是首个用于合成胸片生成的统一评估框架,可同时评估前沿文本到图像(T2I)生成模型的保真度、隐私风险和下游实用性。我们的评估协议包含20多项定量指标,覆盖11种领先的T2I架构,并支持即插即用地集成新模型。通过严格且公平的评估协议,我们建立了所有维度的全面基线最新技术水平(SoTA)性能,以指导未来研究。此外,我们的结果揭示了当前生成模型的若干局限:首先,即使是SoTA模型也难以处理长尾医学分布;其次,无论保真度质量如何,模型都存在高隐私风险;第三,尽管合成数据已有利于下游分类,但对下游多模态任务的实用性有限。基于这些结果,我们提出了具体的研究方向以推动该领域发展。代码见此https URL。

英文摘要

Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility. To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and downstream utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish comprehensive baseline state-of-the-art (SoTA) performances across all dimensions to guide future research. Furthermore, our results uncover several limitations of current generative models, which include first, even SoTA models struggle with long-tailed medical distributions; second, models pose high privacy risks regardless of fidelity quality; and third, while synthetic data already benefits downstream classification, it is of limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. The code is available at https://github.com/Raman1121/CheXGenBench

2508.07797 2026-06-16 cs.CV 版本更新

Power Battery Detection

动力电池检测

Xiaoqi Zhao, Peiqian Cao, Chenyang Yu, Zonglei Feng, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Youwei Pang, Jinsong Ouyang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu

发表机构 * Yale University, USA(耶鲁大学,美国) Dalian University of Technology, China(大连理工大学,中国) Volkswagen Automotive Co., Ltd(大众汽车有限公司) X3000 Inspection Co., Ltd(X3000检测有限公司) Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 针对动力电池X射线图像中极板端点定位任务,提出首个大规模基准PBD5K和点级分割模型MDCNeXt,通过多维度结构线索与状态空间模块提升检测精度。

Comments Accepted by International Journal of Computer Vision (IJCV). Code: https://github.com/NTU-AI4X/X-ray-PBD

详情
AI中文摘要

动力电池是电动汽车的关键部件,其内部结构缺陷可能带来严重安全风险。我们对一项新任务——动力电池检测(PBD)进行了全面研究,该任务旨在从工业X射线图像中定位阴极和阳极板的密集端点,用于质量检测。人工检测效率低且易出错,而传统视觉算法难以处理密集排列的极板、低对比度、尺度变化和成像伪影。为解决这一问题并推动对该有意义任务的关注,我们提出了PBD5K,这是该任务的第一个大规模基准,包含来自九种电池类型的5000张X射线图像,具有细粒度标注和八种真实世界视觉干扰。为支持可扩展且一致的标注,我们开发了一种智能标注流程,结合了图像过滤、模型辅助预标注、交叉验证和分层质量评估。我们将PBD表述为一个点级分割问题,并提出了MDCNeXt,该模型旨在提取和整合来自极板本身的多维结构线索,包括点、线和计数信息。为改善极板间的区分并抑制视觉干扰,MDCNeXt引入了两个状态空间模块。第一个是提示过滤模块,学习由任务特定提示引导的对比关系。第二个是密度感知重排序模块,在高极板密度区域细化分割。此外,我们提出了一种距离自适应掩码生成策略,在阳极和阴极位置的不同空间分布下提供鲁棒的监督。源代码和数据集将在\href{ this https URL }{PBD5K}公开。

英文摘要

Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

2512.00885 2026-06-16 cs.CV 版本更新

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

HanDyVQA:面向细粒度手-物交互动态的视频问答基准

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi

发表机构 * Institute of Industrial Science, The University of Tokyo(东京大学工业科学研究所) National Institute of Advanced Industrial Science and Technology (AIST)(国家先进工业科学与技术研究院) Waseda University(早稻田大学) Visual Geometry Group, University of Oxford(牛津大学视觉几何组)

AI总结 提出HanDyVQA基准,通过六类问题(11.1K QA对)和10.3K分割掩码,全面评估视频模型对手-物交互中操作与效果的细粒度时空推理能力,发现最佳模型Gemini-2.5-Pro仅73%准确率(人类97%)。

Comments CVPR 2026, Project page: https://masatate.github.io/HanDyVQA-project-page/

详情
AI中文摘要

手-物交互(HOI)本质上涉及动态过程,其中人的操作会在物体上产生不同的时空效果。然而,现有的语义HOI基准要么关注操作,要么关注效果,但都停留在粗粒度层面,缺乏捕捉HOI中潜在动态的细粒度时空推理。我们引入了HanDyVQA,一个细粒度视频问答基准,全面覆盖HOI的操作和效果两个方面。HanDyVQA包含六种互补的问题类型(动作、过程、物体、位置、状态变化和物体部件),总计11.1K个多项选择问答对。收集的问答对识别操作风格、手/物体运动和部件级状态变化。HanDyVQA还包括10.3K个用于物体和物体部件问题的分割掩码,从而能够评估视频物体分割中的物体/部件级推理。我们在基准上评估了最新的视频基础模型,发现即使表现最好的模型Gemini-2.5-Pro也仅达到73%的平均准确率,远低于人类表现(97%)。进一步分析揭示了在空间关系、运动和部件级几何理解方面仍存在的挑战。我们还发现,将显式的HOI相关线索整合到视觉特征中可以提高性能,为开发具有更深层HOI动态理解的未来模型提供了见解。

英文摘要

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

2512.01095 2026-06-16 cs.CV cs.AI cs.LG 版本更新

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

CycliST:用于循环状态转换推理的视频语言模型基准

Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt(人工智能与机器学习实验室,图腾斯达特技术大学) Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA)(Konrad Zuse 学校(ELIZA)) Honda Research Institute Europe GmbH, Offenbach, Germany(本田欧洲研究院,奥芬巴赫,德国) Uncertainty in Artificial Intelligence Group, TU Eindhoven(人工智能不确定性小组,埃因霍温技术大学) Hessian Center for AI (hessian.AI)(黑森人工智能中心(hessian.AI)) Center for Cognitive Science(认知科学中心) German Center for Artificial Intelligence (DFKI)(德国人工智能中心(DFKI))

AI总结 提出CycliST基准,通过合成视频评估视频语言模型对循环状态转换的文本推理能力,揭示现有模型在检测循环模式、时间理解和定量分析方面的局限。

Comments Published in the Journal of Data-centric Machine Learning Research (DMLR); https://openreview.net/forum?id=l03g53HUL2

详情
Journal ref
Journal of Data-centric Machine Learning Research, 2026
AI中文摘要

我们提出了CycliST,这是一个新颖的基准数据集,旨在评估视频语言模型(VLM)在循环状态转换上的文本推理能力。CycliST通过生成合成的、结构丰富的视频序列来捕捉现实世界过程的基本方面,这些视频序列具有物体运动和视觉属性的周期性模式。CycliST采用分层评估系统,通过改变循环物体的数量、场景杂乱程度和光照条件逐步增加难度,挑战最先进模型的时空认知能力。我们使用当前最先进的VLM(包括开源和专有模型)进行了大量实验,揭示了它们在泛化到循环动力学(如线性和轨道运动)以及视觉属性(如颜色和尺度)随时间变化方面的局限性。我们的结果表明,当前的VLM难以可靠地检测和利用循环模式,缺乏时间理解的概念,并且无法从场景中提取定量信息(如运动物体的数量),突显了需要解决的重要技术差距。更具体地说,我们发现没有单一模型在性能上始终领先:大小和架构与结果的相关性不强,且没有模型在所有任务上同样成功。通过提供有针对性的挑战和全面的评估框架,CycliST为超越当前最先进水平的视觉推理模型在理解周期性模式方面铺平了道路。

英文摘要

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

2512.07925 2026-06-16 cs.CV cs.AI 版本更新

Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning

苏丹冲突相关火灾的近实时检测:基于无监督深度学习

Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart

发表机构 * George Mason University(乔治·马歇尔大学)

AI总结 提出轻量级VAE模型结合Planet Labs 4波段影像,在24-30小时内无监督检测苏丹冲突火灾区域,优于余弦距离、CVA和IR-MAD方法。

详情
Journal ref
Science of Remote Sensing, Volume 13, 2026, 100446, ISSN 2666-0172
AI中文摘要

苏丹持续的武装冲突凸显了快速监测冲突相关火灾影响区域的必要性。深度学习和高频卫星影像的最新进展使得能够近实时评估战区活跃火灾和烧伤疤痕。本研究提出了一种近实时监测方法,使用轻量级变分自编码器(VAE)模型,结合空间分辨率3米的4波段Planet Labs影像。我们证明,在有利观测条件下,利用可获取的商业卫星数据,这些受影响区域可在约24至30小时内被检测到。为此,我们改编了一个最初为10波段影像设计的VAE模型,使其有效处理高分辨率4波段输入。模型以无监督方式训练,学习名义地表状态的紧凑潜在表示,并通过量化时间配对潜在嵌入之间的变化来识别燃烧特征。性能在苏丹的五个案例研究中评估,并与余弦距离、CVA和IR-MAD在精确率、召回率、F1分数以及时间配对影像块之间的精确率-召回率曲线下面积(AUPRC)上进行比较。结果表明,所提方法始终优于其他方法,在高度不平衡的火灾检测场景中实现了更高的召回率和F1分数,同时保持了可行的精确率。使用8波段影像和时间序列影像的实验相比单一4波段输入仅带来边际性能提升,突显了所提轻量级方法在可扩展的近实时冲突监测中的有效性。

英文摘要

Ongoing armed conflict in Sudan highlights the need for rapid monitoring of conflict-related fire-affected areas. Recent advances in deep learning and high-frequency satellite imagery enable near--real-time assessment of active fires and burn scars in war zones. This study presents a near--real-time monitoring approach using a lightweight Variational Auto-Encoder (VAE)--based model integrated with 4-band Planet Labs imagery at 3 m spatial resolution. We demonstrate that these impacted regions can be detected within approximately 24 to 30 hours under favorable observational conditions using accessible, commercially available satellite data. To achieve this, we adapt a VAE--based model, originally designed for 10-band imagery, to operate effectively on high-resolution 4-band inputs. The model is trained in an unsupervised manner to learn compact latent representations of nominal land-surface conditions and identify burn signatures by quantifying changes between temporally paired latent embeddings. Performance is evaluated across five case studies in Sudan and compared against cosine distance, CVA, and IR-MAD using precision, recall, F1-score, and the area under the precision-recall curve (AUPRC) computed between temporally paired image tiles. Results show that the proposed approach consistently outperforms the other methods, achieving higher recall and F1-scores while maintaining viable precision in highly imbalanced fire-detection scenarios. Experiments with 8-band imagery and temporal image sequences yield only marginal performance gains over single 4-band inputs, underscoring the effectiveness of the proposed lightweight approach for scalable, near--real-time conflict monitoring.

2601.16713 2026-06-16 cs.CV 版本更新

A Human-in-the-Loop Label Error Detection Framework Applied to Arabic-Script HTR Datasets

一种人在回路中的标签错误检测框架应用于阿拉伯文字手写文本识别数据集

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

发表机构 * University of St. Thomas(圣汤姆斯大学)

AI总结 提出CER-HV两阶段框架,结合字符错误率检测和人工验证,在阿拉伯文字HTR数据集中高效识别标签错误,提升识别性能。

详情
AI中文摘要

尽管近期取得了进展,阿拉伯文字的手写文本识别(HTR)仍落后于拉丁文字HTR。部分问题在于数据集质量。为帮助缩小这一差距,我们提出了一个用于检测标签错误的两阶段框架(CER-HV)。第一阶段(CER)是基于字符错误率的噪声检测器,构建在卷积循环神经网络(CRNN)架构上。第二阶段(HV)是人在回路中(HITL)验证第一阶段检测到的噪声样本。将CER-HV框架应用于多个阿拉伯文字数据集,可以识别出带有标签错误的样本,包括转录、分割、方向和非文本内容错误,这些错误会显著影响HTR性能。这些错误被框架的第一阶段以高达90%(前50)的精度识别。我们还表明,我们的CRNN在六个评估数据集中的五个上达到了最先进的性能,在KHATT(阿拉伯语)上达到8.46%的字符错误率(CER),在PHTI(普什图语)上达到8.22%,在Ajami上达到10.59%,在Muharaf(阿拉伯语)上达到10.11%,所有这些均未进行任何数据清洗。我们在PHTD(波斯语)数据集上建立了11.3% CER的新基线。应用CER-HV在数据集清洗和重新训练后,评估CER最多提高了1.8个百分点。尽管我们的实验专注于阿拉伯文字语言的文档,但该框架是通用的,可以应用于其他文本识别数据集。

英文摘要

Despite recent advances, Handwritten Text Recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR. Part of the problem is dataset quality. To help closing this gap, we propose a two-stage framework (CER-HV) for detecting label errors. Stage 1 (CER) is a Character-Error-Rate-based noise detector built on a Convolutional Recurrent Neural Network (CRNN) architecture. Stage 2 (HV) is the Human-In-The-Loop (HITL) Verification of noisy samples detected by the first stage. Applying the CER-HV framework on multiple Arabic-script datasets can identify samples with label errors including transcription, segmentation, orientation, and non-text content errors that can markedly affect HTR performance. These errors were identified by the first stage of the framework with up to 90percent (top-50) precision. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.46 percent Character Error Rate (CER) on KHATT (Arabic), 8.22 percent on PHTI (Pashto), 10.59 percent on Ajami, and 10.11% on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves evaluation CER by up to 1.8 percentage points after dataset cleaning and retraining. Although our experiments focus on documents written in an Arabic-script language, the framework is general and can be applied to other text recognition datasets

2602.04525 2026-06-16 cs.CV cs.AI 版本更新

SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

SLUM-i: 非正规住区城市制图的半监督学习与数据质量基准测试

Muhammad Taha Mukhtar, Syed Musa Ali Kazmi, Khola Naseem, Muhammad Ali Chattha, Andreas Dengel, Sheraz Ahmed, Muhammad Naseer Bajwa, Muhammad Imran Malik

发表机构 * School of Electrical Engineering and Computer Science, National University of Sciences and Technology (NUST)(电气工程与计算机科学学院,国立科学与技术大学(NUST)) Smart Data & Knowledge Services, German Research Center for Artificial Intelligence (DFKI)(智能数据与知识服务,德国人工智能研究中心(DFKI))

AI总结 针对非正规住区制图中标注稀缺和数据质量挑战,提出半监督分割框架,集成类别自适应阈值和DINOv2过滤机制,在跨三大洲七城市实验中mIoU提升最高5.9个百分点。

Comments 10 pages, 8 figures, 5 tables

详情
AI中文摘要

快速的城市扩张推动了低收入和中等收入国家主要城市非正规住区的增长,巴基斯坦的拉合尔和卡拉奇以及印度的孟买就是突出的例子。然而,这些住区的大规模制图不仅受到标注稀缺的严重限制,还受到固有数据质量挑战的制约,特别是正式与非正式结构之间的高光谱模糊性和显著的标注噪声。我们通过引入一个从头构建的拉合尔基准数据集,以及从经过验证的行政边界导出的卡拉奇和孟买配套数据集来解决这一问题,这些数据集总计约900平方公里的城市区域。该集合还补充了来自撒哈拉以南非洲和拉丁美洲先前文献中的四个城市,并为每个城市提供了全面的数据质量评估。我们还提出了一个半监督分割框架,旨在缓解标准半监督学习流程中固有的类别不平衡和分布不匹配问题。我们的方法集成了类别自适应阈值机制,该机制动态调整置信度阈值以防止少数类抑制,以及基于DINOv2的未标记池过滤器,该过滤器在训练前移除分布外的图块以减少协变量偏移。跨越三大洲七个城市、重复五个随机种子的广泛实验表明,与最先进的半监督基线相比,mIoU最高提升5.9个百分点,且两个组件均与架构无关,不增加推理开销。

英文摘要

Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling approximately 900 $\text{km}^\text{2}$ of urban area. This collection is supplemented by four cities from prior literature across Sub-Saharan Africa and Latin America, with comprehensive data quality assessments provided for each city. We also propose a semi-supervised segmentation framework designed to mitigate the class imbalance and distribution mismatch inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression, and a DINOv2-based unlabeled pool filter that removes out-of-distribution tiles prior to training to reduce covariate shift. Extensive experiments across seven cities spanning three continents, repeated over five random seeds, demonstrate gains of up to +5.9 pp mIoU over state-of-the-art semi-supervised baselines, with both components being architecture-agnostic and adding no inference overhead.

2602.09764 2026-06-16 cs.CV cs.IR cs.LG 版本更新

Self-Supervised Learning as Discrete Communication

自监督学习作为离散通信

Kawtar Zaher, Ilyass Moummad, Olivier Buisson, Alexis Joly

发表机构 * Kawtar Zaher, Ilyass Moummad, Olivier Buisson, Alexis Joly

AI总结 将视觉自监督学习视为教师与学生网络间的离散通信过程,通过固定容量二进制信道传输语义信息,使用逐元素二元交叉熵目标强制离散一致性,并引入编码率正则化促进结构化表示,在图像分类、检索和密集预测任务上优于连续对齐基线。

详情
AI中文摘要

大多数自监督学习(SSL)方法通过对齐同一输入的不同视图来学习连续视觉表示,对信息如何在表示维度间进行结构化提供的控制有限。在这项工作中,我们将视觉自监督学习视为教师网络与学生网络之间的离散通信过程,其中语义信息通过固定容量的二进制信道传输。学生网络不是对齐连续特征,而是预测教师网络产生的多标签二进制消息。通过逐元素二元交叉熵目标强制离散一致性,同时编码率正则化项鼓励有效利用受限信道,促进结构化表示。我们进一步表明,周期性地重新初始化投影头通过鼓励嵌入在多个离散编码中保持可预测性来增强这种效果。大量实验表明,在图像分类、检索和密集视觉预测任务中,以及通过自监督适应在领域转移下,该方法持续优于连续对齐基线。除了骨干表示,我们分析了学习到的二进制编码,并表明它们形成了一种紧凑且信息丰富的离散语言,捕获了可跨类别复用的语义因子。

英文摘要

Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.

2602.15720 2026-06-16 cs.CV 版本更新

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

ToaSt: 面向高效ViT的令牌通道选择与结构化剪枝

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ToaSt框架,对多头自注意力模块进行耦合头结构化剪枝,对前馈网络采用训练无关的令牌通道选择方法,在多种ViT模型上实现精度与效率的优越平衡。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉Transformer(ViT)在各种视觉任务中取得了显著成功,但其部署常因高昂的计算成本而受阻。尽管结构化权重剪枝和令牌压缩已成为有前景的解决方案,但它们分别存在再训练时间长和层间依赖复杂优化的问题。我们提出ToaSt,一个解耦框架,将专门策略应用于不同的ViT组件。我们对多头自注意力模块应用耦合头结构化剪枝,利用注意力操作特性增强鲁棒性。对于前馈网络(占FLOPs的60%以上),我们引入令牌通道选择(TCS),一种在推理时过滤冗余噪声通道的免训练方法。在包括DeiT、ViT-MAE和Swin Transformer在内的九个不同模型上的广泛评估表明,ToaSt在精度和效率之间实现了优越的权衡,持续优于现有基线。在ViT-MAE-Huge上,ToaSt以39.4%的FLOPs缩减达到88.52%的准确率(+1.64%)。ToaSt还能有效迁移到多种下游任务(COCO检测、ADE20K分割、CIFAR-100分类),在COCO上达到52.2 vs 51.9 mAP。代码:this http URL

英文摘要

Vision Transformers (ViTs) have achieved remarkable success across various vision tasks, yet their deployment is often hindered by prohibitive computational costs. While structured weight pruning and token compression have emerged as promising solutions, they suffer from prolonged retraining and inter-layer dependencies that complicate optimization, respectively. We propose ToaSt, a decoupled framework applying specialized strategies to distinct ViT components. We apply coupled head-wise structured pruning to Multi-Head Self-Attention modules, leveraging attention operation characteristics to enhance robustness. For Feed-Forward Networks (over 60% of FLOPs), we introduce Token Channel Selection (TCS), a training-free method that filters redundant noise channels at inference time. Extensive evaluations across nine diverse models, including DeiT, ViT-MAE, and Swin Transformer, demonstrate that ToaSt achieves superior trade-offs between accuracy and efficiency, consistently outperforming existing baselines. On ViT-MAE-Huge, ToaSt achieves 88.52% accuracy (+1.64%p) with 39.4% FLOPs reduction. ToaSt also transfers effectively to diverse downstream tasks (COCO detection, ADE20K segmentation, CIFAR-100 classification), achieving 52.2 versus 51.9 mAP on COCO. Code: github.com/SHANNonLab-HUFS/ToaSt

2603.05876 2026-06-16 cs.CV cs.RO 版本更新

Systematic Evaluation of Novel View Synthesis for Video Place Recognition

面向视频地点识别的合成新视角系统性评估

Muhammad Zawad Mahmud, Samiha Islam, Damian Lyons

AI总结 系统评估合成新视角对视频地点识别的影响,发现少量合成视角可提升识别性能,且视角变化幅度不如添加数量和图像类型重要。

Comments Submitted to IEEE IROS 2026

详情
AI中文摘要

合成新视角的生成在多个方面对机器人导航具有积极影响。在基于图像的导航中,由地面机器人拍摄的场景生成的俯视新视角可用于引导空中机器人到达该位置。在视频地点识别中,可以添加地面位置的空中新视角,使无人机能够识别地面机器人所见的地点,同样,俯视视角也可用于生成地面新视角。本文使用五个公开视频地点识别图像数据库和七种典型图像相似度方法,对视频地点识别中的合成新视角进行了系统性评估。我们表明,对于少量合成添加,新视角能提升视频地点识别的统计指标。我们发现,对于较大添加量,视角变化幅度不如添加视角数量和数据集中的图像类型重要。

英文摘要

The generation of synthetic novel views has the potential to positively impact robot navigation in several ways. In image-based navigation, a novel overhead view generated from a scene taken by a ground robot could be used to guide an aerial robot to that location. In Video Place Recognition (VPR), novel views of ground locations from the air can be added that enable a UAV to identify places seen by the ground robot, and similarly, overhead views can be used to generate novel ground views. This paper presents a systematic evaluation of synthetic novel views in VPR using five public VPR image databases and seven typical image similarity methods. We show that for small synthetic additions, novel views improve VPR recognition statistics. We find that for larger additions, the magnitude of viewpoint change is less important than the number of views added and the type of imagery in the dataset.

2604.12813 2026-06-16 cs.CV cs.MM 版本更新

DPC-VQA: Decoupling Quality Perception and Residual Calibration for Video Quality Assessment

DPC-VQA: 解耦质量感知与残差校准用于视频质量评估

Xinyue Li, Shubo Xu, Zhichao Zhang, Zhaolin Cai, Yitong Chen, Guangtao Zhai

发表机构 * Shanghai Jiao Tong University(上海交通大学) Baidu Inc.(百度公司) Xinjiang University(新疆大学)

AI总结 提出DPC-VQA框架,通过冻结多模态大模型提供感知先验,轻量校准分支预测残差修正,实现低训练成本下视频质量评估,在UGC和AIGC基准上取得竞争性能。

详情
AI中文摘要

近年来,多模态大语言模型(MLLMs)在视频质量评估(VQA)任务上展现出有前景的性能。然而,由于需要大规模重新训练和昂贵的主观意见分(MOS)标注,将其适应新场景仍然成本高昂。本文认为,预训练的MLLM已经为VQA提供了有用的感知先验,主要挑战在于如何高效地将该先验校准到目标MOS空间。基于这一见解,我们提出DPC-VQA,一种用于视频质量评估的解耦感知与校准框架。具体而言,DPC-VQA使用冻结的MLLM提供基础质量估计和感知先验,并采用轻量校准分支预测残差修正以实现目标场景适应。该设计避免了昂贵的端到端重新训练,同时以更低的训练和数据成本保持可靠性能。在用户生成内容(UGC)和AI生成内容(AIGC)基准上的大量实验表明,DPC-VQA在相比代表性基线方法取得竞争性能的同时,使用的可训练参数不到传统基于MLLM的VQA方法的2%,且仅需20%的MOS标签即可保持有效性。代码将在发表后公开。

英文摘要

Recent multimodal large language models (MLLMs) have shown promising performance on video quality assessment (VQA) tasks. However, adapting them to new scenarios remains expensive due to large-scale retraining and costly mean opinion score (MOS) annotations. In this paper, we argue that a pretrained MLLM already provides a useful perceptual prior for VQA, and that the main challenge is to efficiently calibrate this prior to the target MOS space. Based on this insight, we propose DPC-VQA, a decoupling perception and calibration framework for video quality assessment. Specifically, DPC-VQA uses a frozen MLLM to provide a base quality estimate and perceptual prior, and employs a lightweight calibration branch to predict a residual correction for target-scenario adaptation. This design avoids costly end-to-end retraining while maintaining reliable performance with lower training and data costs. Extensive experiments on both user-generated content (UGC) and AI-generated content (AIGC) benchmarks show that DPC-VQA achieves competitive performance against representative baselines, while using less than 2% of the trainable parameters of conventional MLLM-based VQA methods and remaining effective with only 20% of MOS labels. The code will be released upon publication.

2604.20623 2026-06-16 cs.CV cs.AI 版本更新

RSRCC: A Remote Sensing Regional Change Comprehension Benchmark Constructed via Retrieval-Augmented Best-of-N Ranking

RSRCC:通过检索增强的最佳N排序构建的遥感区域变化理解基准

Roie Kazoom, Yotam Gigi, George Leifman, Tomer Shekel, Genady Beryozkin

发表机构 * Google Research(谷歌研究)

AI总结 提出RSRCC基准,包含12.6万个细粒度遥感变化问答对,采用层次化半监督流程结合最佳N排序解决歧义,实现局部语义变化推理。

详情
AI中文摘要

传统变化检测识别变化发生的位置,但不解释发生了什么变化。现有的遥感变化描述数据集通常描述整体图像级别的差异,而细粒度的局部语义推理尚未充分探索。为弥补这一差距,我们提出RSRCC,一个新的遥感变化问答基准,包含12.6万个问题,分为8.7万训练、1.71万验证和2.2万测试实例。与以往数据集不同,RSRCC围绕局部、变化特定的问题构建,需要推理特定的语义变化。据我们所知,这是第一个明确设计用于此类细粒度推理监督的遥感变化问答基准。为构建RSRCC,我们引入了一个层次化半监督策展流程,将最佳N排序作为关键的最后歧义解决阶段。首先,从语义分割掩码中提取候选变化区域,然后使用图像-文本嵌入模型进行初步筛选,最后通过检索增强的视觉语言策展和最佳N排序进行验证。该过程能够在保留语义有意义变化的同时,对噪声和模糊候选进行可扩展过滤。数据集可在该网址获取。

英文摘要

Traditional change detection identifies where changes occur, but does not explain what changed in natural language. Existing remote sensing change captioning datasets typically describe overall image-level differences, leaving fine-grained localized semantic reasoning largely unexplored. To close this gap, we present RSRCC, a new benchmark for remote sensing change question-answering containing 126k questions, split into 87k training, 17.1k validation, and 22k test instances. Unlike prior datasets, RSRCC is built around localized, change-specific questions that require reasoning about a particular semantic change. To the best of our knowledge, this is the first remote sensing change question-answering benchmark designed explicitly for such fine-grained reasoning-based supervision. To construct RSRCC, we introduce a hierarchical semi-supervised curation pipeline that uses Best-of-N ranking as a critical final ambiguity-resolution stage. First, candidate change regions are extracted from semantic segmentation masks, then initially screened using an image-text embedding model, and finally validated through retrieval-augmented vision-language curation with Best-of-N ranking. This process enables scalable filtering of noisy and ambiguous candidates while preserving semantically meaningful changes. The dataset is available at https://huggingface.co/datasets/google/RSRCC.

2604.27128 2026-06-16 cs.CV cs.AI 版本更新

Lightweight Distillation of SAM 3 and DINOv3 for Edge-Deployable Individual-Level Livestock Monitoring and Longitudinal Visual Analytics

SAM 3 和 DINOv3 的轻量级蒸馏用于边缘可部署的个体级牲畜监测与纵向视觉分析

Haiyu Yang, Miel Hostens

发表机构 * College of Agriculture and Life Sciences, Cornell University(农业与生命科学学院,康奈尔大学)

AI总结 通过蒸馏SAM 3的感知编码器至TinyViT学生网络,并采用DINOv3的ViT-S嵌入器,实现边缘可部署的个体级牲畜监测,在Edinburgh猪数据集上以7.77倍参数减少和3.01倍显存降低达到接近教师模型的性能,支持长期视觉分析。

详情
AI中文摘要

用于个体级牲畜监测的基础模型流水线——结合开放词汇检测、可提示视频分割和自监督视觉嵌入——提高了精准畜牧业(PLF)的准确率上限,但其GPU内存预算超出了商用边缘加速器的范围。为弥补这一差距,SAM 3的4.46亿参数感知编码器(PE-ViT-L+)骨干通过三种机制被蒸馏为一个4066万参数的多尺度学生网络:基于TinyViT-21M-512的特征金字塔网络学生编码器、四项方向-尺度蒸馏损失,以及带滑动窗口会话剪枝的骨干替换推理,以限制流式GPU内存增长。DINOv3系列包括一个预蒸馏的ViT-S/16变体(2160万参数),与6716万参数的ViT-7B教师模型一同发布;采用ViT-S(2100万参数)变体作为每个个体的嵌入器。在Edinburgh猪数据集上,压缩流水线相对于SAM 3教师模型达到92.29% MOTA和96.15% IDF1(分别下降1.68和0.84个百分点),系统级参数减少7.77倍,峰值显存减少3.01倍(19.52GB -> 6.49GB),并在九类猪行为分类中达到97.34% top-1准确率和91.67% macro-F1。该流水线适配NVIDIA Jetson Orin NX 16GB环境,具有4.9GB余量,支持一种提议但尚未经验证的设备端嵌入池重识别机制,其每个个体每年约94MB的足迹产生纵向视觉记录,便于与疾病、跛行、繁殖和生长结果标签进行回顾性关联。

英文摘要

Foundation-model pipelines for individual-level livestock monitoring -- combining open-vocabulary detection, promptable video segmentation, and self-supervised visual embeddings -- have raised the accuracy ceiling of precision livestock farming (PLF), but their GPU memory budgets exceed the envelope of commodity edge accelerators. To close this gap, the 446M-parameter Perception Encoder (PE-ViT-L+) backbone of SAM 3 is distilled into a 40.66M-parameter multi-scale student through three mechanisms: a Feature Pyramid Network student encoder built on TinyViT-21M-512, a four-term direction-then-scale distillation loss, and backbone-substitution inference with sliding-window session pruning that bounds streaming GPU memory growth. The DINOv3 family includes a pre-distilled ViT-S/16 variant (21.6M parameters) released alongside a 6716M-parameter ViT-7B teacher; the ViT-S (21M) variant is adopted as the per-individual embedder. On the Edinburgh Pig dataset, the compressed pipeline reaches 92.29% MOTA and 96.15% IDF1 against the SAM 3 teacher (1.68- and 0.84-percentage-point losses), achieves a 7.77-fold reduction in system-level parameters and a 3.01-fold reduction in peak VRAM (19.52GB -> 6.49GB), and reaches 97.34% top-1 accuracy with 91.67% macro-F1 on nine-class pig behaviour classification. The pipeline fits inside an NVIDIA Jetson Orin NX 16GB envelope with 4.9GB of headroom, supporting a proposed -- but not yet empirically validated -- on-device embedding-pool re-identification mechanism whose per-individual footprint of approximately 94MB per animal per year produces a longitudinal visual record amenable to retrospective association with disease, lameness, reproductive, and growth outcome labels.

2605.09697 2026-06-16 cs.CV cs.LG 版本更新

Discriminative Span as a Predictor of Synthetic Data Utility via Classifier Reconstruction

判别跨度作为通过分类器重构预测合成数据效用的指标

Radhika Amar Desai, Modigari Narendra

发表机构 * School Of Computer Science(计算机科学学院) Vellore Institute of Technology(维杰雷理工学院)

AI总结 本文提出一种几何驱动的指标,通过预训练模型的嵌入空间评估合成数据效用,无需模型训练,通过测量线性分类器权重向量在变化子空间中的投影误差,判断合成数据对下游分类性能的影响。

详情
AI中文摘要

在许多现实世界计算机视觉应用中,如医学影像和工业检测,二分类任务常面临正样本严重缺乏的问题。广泛采用的解决方案是通过图像到图像转换生成合成正样本。然而,一个根本性挑战是:如何可靠地评估此类合成数据是否能提升下游模型性能?本文提出一种几何驱动的指标,该指标可预测合成数据的效用,而无需模型训练。我们的方法在预训练基础模型的嵌入空间中操作,并通过样本之间的差异向量表示数据集。我们通过测量线性分类器权重向量在这些变化子空间中的投影误差,评估其是否可被表示在该子空间内。直观上,如果合成数据诱导的变化捕捉了任务相关方向,其张量可近似分类器,导致投影误差低。反之,质量差的合成数据无法张量这些方向,导致误差高。在多个数据集和架构上,我们证明该指标与混合真实负样本和合成正样本训练的CNN下游分类性能有强相关性。这些发现表明,所提指标是评估数据稀缺设置中合成数据质量的实用且信息丰富的工具。

英文摘要

In many real-world computer vision applications, including medical imaging and industrial inspection, binary classification tasks are characterized by a severe scarcity of positive samples. A widely adopted solution is to generate synthetic positive data using image-to-image transformations applied to negative samples. However, a fundamental challenge remains: how can we reliably assess whether such synthetic data will improve downstream model performance? In this work, we propose a geometry-driven metric that predicts the utility of synthetic data without requiring model training. Our approach operates in the embedding space of a pre-trained foundation model and represents the dataset through difference vectors between samples. We evaluate whether the weight vector of a linear classifier can be expressed within the subspace spanned by these variations by measuring the relative projection error. Intuitively, if the variations induced by synthetic data capture task-relevant directions, their span can approximate the classifier, resulting in low projection error. Conversely, poor synthetic data fails to span these directions, leading to higher error. Across multiple datasets and architectures, we show that this metric exhibits strong correlation with downstream classification performance of CNNs trained on mixtures of real negative and synthetic positive data. These findings suggest that the proposed metric serves as a practical and informative tool for evaluating synthetic data quality in data-scarce settings.

2605.12678 2026-06-16 cs.CV cs.CY 版本更新

No One Knows the State of the Art in Geospatial Foundation Models

没有人知道地理空间基础模型的现状

Isaac Corley, Nils Lehmann, Caleb Robinson, Gabriel Tseng, Anthony Fuller, Hamed Alemohammad, Evan Shelhamer, Jennifer Marcus, Hannah Kerner

发表机构 * Taylor Geospatial(泰勒地理空间公司) Technical University of Munich(慕尼黑技术大学) Microsoft AI for Good Research Lab(微软AI for Good研究实验室) Allen Institute for AI(艾伦人工智能研究所) Vector Institute(向量研究所) Carleton University(卡尔顿大学) Clark University(克拉克大学) University of British Columbia(不列颠哥伦比亚大学) Arizona State University(亚利桑那州立大学)

AI总结 本文指出地理空间基础模型缺乏标准化评估和训练协议,提出六项具体期望以促进社区共识。

详情
AI中文摘要

地理空间基础模型(GFMs)已被提出作为灾难响应、土地覆盖制图、粮食安全监测等高风险地球观测任务的通用化基础架构。然而,发表的关于这些模型的工作并未为评审者或用户提供足够的信息来确定哪个模型适合给定任务。我们主张没有人知道当前地理空间基础模型的现状。方法可能有用,但GFMs文献并未充分标准化评估、训练和测试协议、释放的权重或预训练控制,使任何人难以比较或排名它们。在152篇论文的审计中,我们发现至少对同一模型、基准和协议存在46处至少10分的跨论文分歧;94/126篇可提取预训练数据的论文使用了其他论文未使用的配置;39%的GFM论文未发布模型权重。这种缺乏社区标准的情况可以解决。我们提出六项具体期望:命名许可的权重发布、共享核心评估、复制与重跑基线注释、方差报告、一个共享的评估工具包,以及数据与架构与算法的控制。这些差距是协调失败,而非任何个人实验室的过错;本文作者,如同许多其他GFM社区成员,都为此做出了贡献。我们并非仅仅批评该社区,而是旨在提供迈向共享理解如何创新GFMs的具体步骤。

英文摘要

Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.

2606.03788 2026-06-16 cs.CV 版本更新

SLU-2K: A Question-Based Benchmark for Semantic Evaluation of Sign Language Translation

SLU-2K:基于问题的手语翻译语义评估基准

Zeno Testa, Antonino Furnari, Lorenzo Baraldi, Natalia Díaz-Rodríguez

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Catania(卡塔尼亚大学) University of Granada(格拉纳达大学) CITIC & DaSCI Institute(CITIC与DaSCI研究所)

AI总结 提出SLU-2K基准,通过2350个视频问答对评估手语翻译的语义理解,揭示当前系统在语义正确性上的不足。

Comments Accepted at the GenSign Workshop, CVPR 2026

详情
AI中文摘要

手语翻译(SLT)通常使用表面形式指标(如BLEU和ROUGE)进行评估,这些指标奖励词汇重叠,但不直接衡量翻译是否保留了源手语序列的含义。这与将SLT集成到辅助技术中的最终目标相悖。在这项工作中,我们将重点从手语翻译(SLT)转向手语理解(SLU),特别强调语义理解。具体来说,我们根据系统从输入视频中正确恢复原始句子关键语义方面的能力来评估系统,例如发生的动作以及关于人和物体的事实。为了系统地实现这种评估,我们提出了SLU-2K,这是一个基于流行的PHOENIX-2014T和CSL-Daily数据集的2350个封闭式视频问答对的数据集。为了获得SLU-2K,我们提出并广泛评估了一个自动数据生成流水线,该流水线生成7个类别的问题,即动作、位置、数字、物体、人物、时间和天气条件。我们通过评估流行的多模态大语言模型(MLLM)和两个代表性的最先进系统MMSTL和SpaMo,展示了SLU-2K的潜力。我们的结果表明,MLLM达到了接近随机的性能,突显了当前AI系统中需要更系统地集成SLU。此外,在领域内数据上精心微调的最先进翻译系统仍然存在显著的语义差距,结果范围从56.7%到75.2%。这些发现表明,当前的SLT评估协议高估了真正的理解,未来的进展不仅应通过流畅性和n-gram重叠来衡量,还应通过语义正确性来衡量。代码、提示和基准文件可在此https URL获取。

英文摘要

Sign Language Translation (SLT) is typically evaluated with surface-form metrics such as BLEU and ROUGE, which reward lexical overlap but do not directly measure whether a translation preserves the meaning of the source sign sequence. This is in contrast with the final objective of integrating SLT in assistive technology. In this work, we shift the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), with particular emphasis on semantic understanding. Specifically, we evaluate systems based on their ability to correctly recover, from the input video, key semantic aspects of the original sentence, such as actions taking place and facts about people and objects. To enable this evaluation systematically, we propose SLU-2K, a dataset of 2,350 closed-ended video question-answer pairs based on the popular PHOENIX-2014T and CSL-Daily datasets. To obtain SLU-2K, we propose and extensively evaluate an automated data generation pipeline which produces questions across 7 categories, namely actions, locations, numbers, objects, people, time, and weather conditions. We show the potential of SLU-2K by evaluating popular Multimodal Large Language Models (MLLMs) and two representative state-of-the-art systems, MMSTL and SpaMo. Our results show that MLLMs reach near-random performance, highlighting the need for a more systematic integration of SLU in current AI systems. Furthermore, state-of-the-art translation systems carefully fine-tuned on in-domain data still exhibit a substantial semantic gap, with results ranging from 56.7% to 75.2%. These findings suggest that current SLT evaluation protocols overestimate true understanding and that future progress should be measured not only by fluency and n-gram overlap, but also by semantic correctness. Code, prompts, and benchmark files are available at https://github.com/ZenoTsT/SLU-2K

2606.04184 2026-06-16 cs.CV 版本更新

GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

GroupToM-Bench: 多模态大语言模型中群体心智理论和非线性社会涌现的基准测试

Weidong Tang, Jierui Li, Yueling Hou, Zihan Mei, Can Zhang, Xinyan Wan, Zhiyuan Liang, Pengfei Zhou, Yang You, Wangbo Zhao

发表机构 * Xidian University(西安电子科技大学) National University of Singapore(新加坡国立大学) University of Electronic Science and Technology of China(电子科技大学) University of Science and Technology of China(中国科学技术大学)

AI总结 针对多模态大语言模型在群体心智理论推理上的不足,提出GroupToM-Bench基准,通过七级认知审计框架评估模型从微观BDI状态到宏观结果预测的因果链,揭示模型在处理社会结构和非线性集体动态上的缺陷。

Comments ACL 2026 (Main Conference)

详情
AI中文摘要

真正的通用智能不仅需要物理世界模型,还需要社会世界模型:即推断个体心理状态如何相互作用并结晶为群体层面结果的能力。尽管在个体层面的心智理论推理方面取得了显著进展,现有的多模态大语言模型在这一更广泛的任务上仍然失败。集体行为从社会张力、从众动态和结构约束中非线性地涌现,这意味着它不能通过简单地对个体意图求和来恢复。我们提出了GroupToM-Bench,第一个针对群体层面心智理论的多模态基准,围绕一个跨越微观层面BDI状态(信念、欲望、意图)、中观层面群体张力和结构约束以及宏观层面结果预测和机制归因的因果链构建。为了探测这一完整弧线,我们开发了一个七级认知审计框架。实验揭示了当前模型与人类基线之间的差距,突出了模型在处理社会结构和非线性集体动态方面的失败。

英文摘要

True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.

2606.07086 2026-06-16 cs.CV cs.LG 版本更新

An Adaptive Data cleaning Framework for Noisy Label Detection

自适应数据清洗框架用于噪声标签检测

Chen-Hsuan Fang, Wei-Hsinag Chen, Pin-Hsuan Yu, Jung-Hua Wang, Tsung-Wei Pan

发表机构 * Department of Electrical Eng(电子工程系) AI Research Center(人工智能研究中心)

AI总结 提出一种无需手动阈值的自适应数据清洗框架,融合局部、全局和学习动态等多重度量,通过特征空间的多度量聚类实现噪声标签检测,在CIFAR-10、MNIST和ImageNet-100上显著提升召回率和模型精度。

详情
AI中文摘要

深度神经网络(DNN)在给定大型标注数据集的计算机视觉任务中表现出色。然而,在实际应用中,标签常常因歧义、人为错误或动态环境而受到污染。过参数化的DNN在训练过程中容易记忆这些噪声标签,从而降低模型的准确性和泛化能力。现有的数据清洗和样本选择策略通常依赖于手动指定的阈值、噪声比率的先验知识或单一度量(学习动态或几何结构),这使得它们在复杂数据场景下不稳定。本文提出了一种自适应数据清洗框架,该框架整合了局部、全局和学习动态线索,用于鲁棒的噪声标签检测。通过模块化特征拼接范式,样本被映射到统一的低维特征空间。我们提供了两种实例化:一种二维度量,结合了基于类自适应KNN的局部不一致性和基于k-means的全局质心距离;另一种三维多度量,额外引入了z归一化分数。与传统的将一维高斯混合模型应用于单一标量度量的方法不同,我们的框架在特征空间上执行多度量聚类,以自适应地将样本划分为干净主导和噪声主导成分,无需手动阈值或噪声先验。在CIFAR-10、MNIST和ImageNet-100上,针对5%至40%的对称标签噪声进行的实验表明,该框架在所有设置下均实现了高召回率,包括在ImageNet-100上40%噪声时接近完美的召回率(≥98%)。后续训练在所有评估设置下均获得了精度提升,尤其是在ImageNet-100的严重污染情况下。这些发现表明,多度量整合为噪声标签检测提供了一种无阈值、实用且低调整的策略。

英文摘要

Deep neural networks (DNNs) excel in computer vision tasks given large annotated datasets. In real-world applications, however, labels are often corrupted by ambiguity, human error, or dynamic environments. Over-parameterized DNNs easily memorize these noisy labels during training, degrading model accuracy and generalization. Existing data-cleaning and sample-selection strategies often rely on manually specified thresholds, prior knowledge of the noise ratio, or a single metric (either learning dynamics or geometric structure), making them unstable in complex data regimes. This paper proposes a self-adaptive data-cleaning framework that integrates local, global, and learning dynamics cues for robust noisy-label detection. Samples are mapped into a unified low-dimensional feature space through a modular feature concatenation paradigm. We provide two instantiations: a 2D metric integrating class-adaptive KNN-based local disagreement with k-means-based global centroid distance, and a 3D multi-metric that additionally incorporates a z-normalized score. Unlike conventional 1D Gaussian Mixture Models applied to a single scalar metric, our framework performs multi-metric clustering on the feature space to adaptively partition samples into clean-dominant and noise-dominant components without requiring manual thresholds or noise priors. Experiments on CIFAR-10, MNIST, and ImageNet-100 with 5% to 40% symmetric label noise show high recall across settings, including near-perfect recall (>=98%) on ImageNet-100 at 40% noise. Subsequent training yields accuracy gains across evaluated settings, especially under severe corruption on ImageNet-100. These findings suggest that multi-metric integration provides a threshold-free, practical, and low-tuning strategy for noisy label detection.

2606.11381 2026-06-16 cs.CV 版本更新

From Simulation to the Real-World: An In-Field 6D Pose Dataset and Baseline for Robotic Strawberry Harvesting

从仿真到现实:面向机器人草莓采摘的实地6D位姿数据集与基线

Woojung Son, Won Suk Lee, Zijing Huang, Daeun Choi, Catia Silva, Yu She, Yan Gu

发表机构 * Department of Agricultural and Biological Engineering, University of Florida(佛罗里达大学农业与生物工程系) Department of Electrical and Computer Engineering, University of Florida(佛罗里达大学电气与计算机工程系) Edwardson School of Industrial Engineering, Purdue University(普渡大学爱德华森工业工程学院) School of Mechanical Engineering, Purdue University(普渡大学机械工程学院)

AI总结 针对机器人草莓采摘中6D位姿估计的仿真到现实差距问题,首次构建了实地草莓6D位姿真值数据集(12,040张图像),并基于NVIDIA Isaac Sim生成具有场景级真实感的合成数据集,通过基线实验量化了差距。

Comments 7 pages, 6 figures, 1 table

详情
AI中文摘要

机器人草莓采摘需要精确的6D位姿估计;然而,在实际农业田间收集6D位姿真值本身具有挑战性。现有的6D位姿估计方法因此仅依赖缺乏场景级真实感的合成数据,其在真实农业田间条件下的性能尚未量化。在这项工作中,我们提出了据我们所知的第一个在实际农业田间收集的草莓6D位姿真值数据集(12,040张图像)。我们还引入了一个在NVIDIA Isaac Sim中渲染的合成数据集,具有场景级真实感和域随机化。尽管如此,我们的实验表明,显著的仿真到现实差距仍然存在,强调了可靠评估需要真实农业田间数据。我们进一步通过跨骨干编码器的基线6D位姿估计结果量化了仿真到现实差距,作为未来工作的参考。真实世界数据集将在接收后公开。

英文摘要

Robotic strawberry harvesting requires precise 6D pose estimation; however, collecting 6D pose ground truth in real agricultural fields is inherently challenging. Existing strawberry 6D pose estimation studies have therefore relied mainly on synthetic data, often without sufficient scene-level realism,leaving their performance under real agricultural field conditions unquantified. In this work, we present, to the best of our knowledge, the first real-world 6D pose ground truth dataset of strawberries collected in actual agricultural fields (12,040 images). We also introduce a synthetic dataset rendered in NVIDIA Isaac Sim, featuring scene-level realism and domain randomization. Despite this improved simulation setup, our experiments reveal that a substantial sim-to-real gap persists, underscoring the necessity of real agricultural field data for reliable evaluation. We further quantify the sim-to-real gap through baseline 6D pose estimation results across backbone encoders, serving as a reference for future work.

2510.04127 2026-06-16 cs.IR cs.AI cs.CV cs.LG 版本更新

Projection and Quantisation: A Unifying View of Learning to Hash, from Random Projections to the RAG Era

投影与量化:学习哈希的统一视角,从随机投影到RAG时代

Sean Moran

发表机构 * Independent Researcher(独立研究者) London United Kingdom(伦敦英国)

AI总结 提出投影-量化-组织(PQO)框架,统一理解从局部敏感哈希到深度哈希、乘积量化、图索引及向量数据库二进制嵌入的方法,并通过可复现实验揭示量化轴上的内存-质量权衡。

Comments 80 pages, 19 figures, 22 tables. Survey. Accompanying open benchmark (BitBudget): https://github.com/sjmoran/bitbudget ; live leaderboard: https://sjmoran.github.io/bitbudget/

详情
AI中文摘要

近似最近邻(ANN)搜索支撑着大规模检索,尤其是在增强大型语言模型的检索增强生成管道中,但解决该问题的方法已在不同社区中激增,以至于很少被视为一个统一领域。我们认为它们构成一个具有三个设计选择的领域,并开发了投影-量化-组织(PQO)视角,在该视角下,局部敏感哈希、学习二进制哈希、深度端到端哈希、乘积量化、基于图的索引以及现代向量数据库的二进制嵌入都是三个耦合问题的设置:投影放置在哪里,量化阈值放置在哪里,以及如何组织生成的编码。投影然后量化的解读是已有的;我们的贡献是第三个同等重要的组织阶段,证明这三个阶段从该领域的起源到深度、乘积量化、图和检索增强时代一脉相承,以及一个可复现的测量,将视角从分类方法转向预测方法。该测量得出三个发现。首先,内存节省在量化轴上:一位编码的大小是浮点数的三十二分之一,而在短候选列表上单次全精度重排序即可完全恢复未压缩的质量。其次,视角预期的权衡顺序在嵌入增长时保持不变。第三,在有监督的情况下,八字节编码的质量比其替换的两千字节浮点数提高一倍以上。我们将这些测量结果发布为BitBudget,一个带有实时排行榜的可扩展基准,将生成式检索的“语义标识符”重新解释为量化编码,并指出随着紧凑编码重回大规模检索中心,随之而来的开放问题。

英文摘要

Approximate nearest-neighbour search underpins large-scale retrieval and retrieval-augmented generation, yet its methods are studied in communities that seldom read one another. We argue that they form one field with three design choices. We develop the projection-quantisation-organisation lens: every method places its projections, places its quantisation thresholds, and organises the resulting codes for search. We test the lens with a reproducible measurement, released as the open BitBudget benchmark, and report three findings. First, the quantisation axis delivers the largest memory savings: a one-bit code with full-precision re-ranking matches uncompressed quality for six of seven embedders, the scanned code one thirty-second of the float's size. Second, the orderings the lens anticipates, including a learned-embedding regime where binary codes overtake an inverted-file product quantiser at a matched byte budget, recur as the embedding is enlarged. Third, given class labels, an eight-byte supervised code more than doubles the retrieval quality of the two-kilobyte task-agnostic float it replaces. We also recast the semantic identifiers of generative retrieval as quantisation codes. The main contribution is a single, tested account of compact-code search, from random projections to the retrieval-augmented era.

2603.06861 2026-06-16 cs.LG cs.CV 版本更新

IGLU: The Integrated Gaussian Linear Unit Activation Function

IGLU:集成高斯线性单元激活函数

Mingi Kang, Zai Yang, Jeova Farias Sales Rocha Neto

发表机构 * Bowdoin College(布罗德学院)

AI总结 提出IGLU激活函数,基于半正态混合分布推导出闭式表达,其门控为柯西CDF,通过单一锐度参数在恒等与ReLU行为间插值,重尾特性保证非零梯度,并给出仅含ReLU操作的有理近似,在视觉和语言任务上达到或超越ReLU/GELU性能。

详情
AI中文摘要

激活函数对深度神经网络至关重要,控制着梯度流、优化稳定性和表示能力。在历史深度架构中,ReLU一直是激活函数的主要选择,而现代基于Transformer的模型越来越多地采用更平滑的替代方案,如GELU和其他自门控替代方案。尽管它们在经验上取得了成功,但这些函数之间的数学关系及其有效性背后的原理仍仅被部分理解。我们引入了IGLU,一个参数化激活函数,作为在半正态混合分布下的GELU门控的尺度混合推导得出。该推导产生了一个闭式表达式,其门控分量恰好是柯西CDF,提供了一个原则性的单参数族,通过单一锐度参数$\sigma$在类恒等和类ReLU行为之间连续插值。与GELU的高斯门控不同,IGLU的重尾柯西门控在负尾处以多项式衰减,保证所有有限输入的非零梯度,并对梯度消失具有更强的鲁棒性。我们进一步引入了IGLU-Approx,一种计算高效的IGLU有理近似,完全用ReLU操作表示,消除了超越函数求值。通过在CIFAR-10、CIFAR-100和WikiText-103上使用ResNet-20、ViT-Tiny和GPT-2 Small进行的评估,IGLU在视觉和语言数据集上相对于ReLU和GELU基线实现了具有竞争力或更优的性能,而IGLU-Approx以大幅降低的计算成本恢复了这一性能。特别地,我们表明在高度不平衡的分类数据集中,使用重尾门控带来了显著的性能提升。

英文摘要

Activation functions are fundamental to deep neural networks, governing gradient flow, optimization stability, and representational capacity. Within historic deep architectures, while ReLU has been the dominant choice for the activation function, modern transformer-based models increasingly are adopting smoother alternatives such as GELU and other self-gated alternatives. Despite their empirical success, the mathematical relationships among these functions and the principles underlying their effectiveness remains only partially understood. We introduce IGLU, a parametric activation function derived as a scale mixture of GELU gates under a half-normal mixing distribution. This derivation yields a closed-form expression whose gating component is exactly the Cauchy CDF, providing a principled one-parameter family that continuously interpolates between identity-like and ReLU-like behavior via a single sharpness parameter $σ$. Unlike GELU's Gaussian gate, IGLU's heavy-tailed Cauchy gate decays polynomially in the negative tail, guaranteeing non-zero gradients for all finite inputs and offering greater robustness to vanishing gradients. We further introduce IGLU-Approx, a computationally efficient rational approximation of IGLU expressed entirely in terms of ReLU operations that eliminates transcendental function evaluation. Through evaluations on CIFAR-10, CIFAR-100, and WikiText-103 across ResNet-20, ViT-Tiny, and GPT-2 Small, IGLU achieves competitive or superior performance on both vision and language datasets against ReLU and GELU baselines, with IGLU-Approx recovering this performance at substantially reduced computational cost. In particular, we show that employing a heavy-tailed gate leads to considerable performance gains in heavily imbalanced classification datasets.

2605.00873 2026-06-16 cs.MM cs.AI cs.CV 版本更新

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

BRITE:面向不可信场景的可靠可解释文本到视频评估基准

Advait Tilak, Jiwon Choi, Nazifa Mouli, Wei Le

AI总结 提出BRITE基准,通过人工参与协议统一不可信提示、细粒度音视频一致性评估和可解释QA评估,揭示现有模型在对象-动作绑定和音视频同步上的显著缺陷。

详情
AI中文摘要

逼真文本到视频(T2V)生成的快速发展带来了对最新评估方法的迫切需求。现有基准大多忽略了不可信场景,并且不衡量音视频对齐。我们引入BRITE,这是第一个将(1)不可信提示、(2)音视频一致性的细粒度评估以及(3)基于QA的可解释评估统一为全面T2V基准的框架。与完全自动化的基于多模态LLM的流水线(容易产生幻觉和提示歧义)不同,BRITE通过严格的人工参与协议保证基准创建的可靠性。评估五个最先进模型(Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5和Qwen3Max),我们揭示了一个关键性能差距:虽然模型在静态对象组合方面表现出色,但在对象-动作绑定和音视频同步方面表现出显著退化。我们的框架为社区提供了一个可靠、可解释的基准和评估框架,能够检测和定位下一代T2V模型的局限性,特别是对于流形外提示。

英文摘要

The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts

13. 其他/综合视觉 34 篇

2606.14764 2026-06-16 cs.CV cs.DM 新提交

Avoiding Exponential Blow-Up in Distributive Lattice Submodular Minimization

避免分配格次模最小化中的指数爆炸

Ishant Shanu

发表机构 * Ishant Shanu

AI总结 针对分配格上次模函数最小化中因布尔格变换导致的空间指数膨胀问题,提出仅在分配格内工作的通用框架,显著提升运行效率。

详情
AI中文摘要

近年来,次模函数最小化引起了广泛关注。它在计算机视觉和机器学习领域具有高度适用性。通常,这些应用需要处理定义在分配格上的次模函数。当前处理该问题的最佳方法是使用一种变换,将次模函数外推到相应的布尔格上。由于工作空间的扩大,这使得优化系统效率过低。定量地,扩展后的空间具有额外的指数级(关于集合大小)元素。我们提出了一个仅在分配格内工作的通用框架来处理分配格。我们的框架允许使用已建立的用于布尔格的次模函数最小化算法。在我们的实验中,我们展示了在处理分配格时,相比于传统方法在运行时间上的巨大改进。

英文摘要

Submodular function minimization has gained a lot of interest in recent years. They are highly applicable in the area of Computer Vision and Machine Learning. Often such applications require to work with submodular functions defined on distributive lattice. Current best way of dealing with it is using a transformation which extrapolates the submodular function for the respective boolean lattice. It makes optimization system too inefficient due to enlargement of the working space. Quantitatively, the expanded space has additional exponential (in set size) number of elements. We propose a generic framework for dealing with distributive lattice which only works within distributive lattice. Our framework allows one to use already established submodular function minimization algorithms for boolean lattice. In our experiment, we show the huge improvement in terms of running time over tranditional methods for handling distributive lattice.

2606.14963 2026-06-16 cs.CV cs.AI 新提交

Multi-Modal Attention for Automated Disaster Damage Assessment Using Remote Sensing Imagery and Deep Learning

基于遥感影像和深度学习的多模态注意力自动灾害损伤评估

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * Built Environment Department, College of Science and Technology, North Carolina A&T State University(北卡罗来纳农工州立大学科技学院建筑环境系) United Nations University Institute for Water, Environment and Health(联合国大学水、环境与健康研究所)

AI总结 提出一种多模态注意力机制融合双时相遥感影像的深度学习框架,实现建筑物损伤四分类(无/轻微/严重/毁坏),准确率达94.90%。

Comments This paper has been accepted for publication in ISPRS Congress 2026 and the 47th Canadian Symposium on Remote Sensing (CSRS 2026) Annals

详情
AI中文摘要

及时准确的灾害损伤评估对于有效的应急响应、资源分配和恢复至关重要。传统方法通常依赖人工检查或稀疏数据,往往速度慢且易出错。本文介绍了一种利用遥感影像和深度学习自动化建筑损伤分类的新框架。使用灾前和灾后卫星影像,我们的模型将建筑物分为四个损伤等级:无损伤、轻微损伤、严重损伤和毁坏。核心创新是一种多模态注意力机制,融合双时相特征以显式检测和评估结构变化。我们采用轻量级ConvNeXT-Tiny骨干网络,确保高效处理而不牺牲性能。主要贡献包括:(1)用于多模态数据融合的交叉注意力模块,(2)针对大规模数据集的优化预处理流程,以及(3)鲁棒的数据增强技术。在大规模灾害数据集上的实验表明,总体分类准确率达到94.90%。该模型能有效区分损伤类别,并对不完整数据保持鲁棒性。本系统显著提高了评估速度和准确性,有助于应急响应人员优先安排干预措施。本研究通过将多时相影像与深度学习相结合,推进了自动化灾害损伤检测,为实时响应提供了可扩展的解决方案。

英文摘要

Timely and accurate disaster damage assessment is crucial for effective emergency response, resource allocation, and recovery. Traditional methods, which often rely on manual inspections or sparse data, are typically slow and error-prone. This paper introduces a novel framework leveraging remote sensing imagery and deep learning to automate building damage classification. Using pre- and post-disaster satellite imagery, our model categorizes buildings into four damage levels: no damage, minor damage, major damage, and destroyed. The core innovation is a multi-modal attention mechanism that fuses bi-temporal features to explicitly detect and assess structural changes. We employ a lightweight ConvNeXT-Tiny backbone to ensure efficient processing without compromising performance. Key contributions include: (1) a cross-attention module for multi-modal data fusion, (2) an optimized preprocessing pipeline for large-scale datasets, and (3) robust data augmentation techniques. Experiments on a large-scale disaster dataset demonstrate an overall classification accuracy of 94.90%. The model effectively discriminates between damage categories and remains resilient to incomplete data. This system significantly improves assessment speed and accuracy, aiding emergency responders in prioritizing interventions. This work advances automated disaster damage detection by integrating multi-temporal imagery with deep learning, offering a scalable solution for real-time response.

2606.15198 2026-06-16 cs.CV cs.HC 新提交

City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

城市景观在望:一种从房地产图像解锁城市尺度窗景感知的众包框架

Chucai Peng, Sijie Yang, Ang Liu, Yang Xiang, Zhixiang Zhou, Filip Biljecki

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出一种利用房地产平台真实窗景图像(WVI)进行大规模感知映射的方法,通过混合神经网络模型预测六维感知并分析空间分布,发现楼层高度和窗景组成(如天空、树木比例)对感知有非线性影响。

详情
AI中文摘要

通过住宅窗户看到的城市景观影响生活质量,然而城市尺度上实际窗景的感知仍研究不足。本研究提出一种大规模感知映射方法,使用从中国武汉房地产平台收集的12,334张真实住宅窗景图像(WVI),这是一种罕见探索的城市景观图像形式,相比以往研究中常见的渲染或模拟窗景具有优势。通过非沉浸式虚拟现实平台,我们基于499张WVI从304名参与者收集了27,477对六维感知(如生动性)的比较。训练了一个混合神经网络模型来预测所有众包WVI的人类感知并绘制其空间分布。结果显示,整个城市存在显著的空间自相关,具有明显的热点和冷点。楼层高度强烈影响人类感知:较高楼层提供更受欢迎和更广阔的窗景,而较低楼层为居民提供安静和生动的视野。推理模型进一步表明,窗景组成至关重要:高比例的天空、树木和低层建筑增强人们的偏好和生动性感知,而高层建筑的高比例增加单调和压抑感。重要的是,这些影响是非线性的:某些元素的过度存在会改变其对人类感知的影响。这项工作推进了城市尺度上居民视觉体验的理解,并为以人为本的城市规划和房地产优化窗户视觉景观提供了基于证据的指导。

英文摘要

City landscapes viewed through home windows influence quality of life, yet perceptions of actual window views at the urban scale remain understudied. This study presents an approach for large-scale mapping of perceptions using 12,334 window view images (WVIs) collected from actual residential properties listed on real estate platforms in Wuhan, China, representing a rarely explored form of urban view imagery that offers advantages over the rendered or simulated window views commonly examined in previous studies. Through a non-immersive virtual reality platform, we collected 27,477 pairwise comparisons across six perceptual dimensions (e.g.\ Vivid) from 304 participants based on 499 WVIs. A hybrid neural network model was trained to predict human perceptions of all crowdsourced WVIs and map their spatial distribution. Results reveal significant spatial autocorrelation with distinct hot and cold spots across the whole city. Floor level strongly influences human perceptions: while higher floors offer more preferred and extensive window views, lower-floor windows provide residents with quiet and vivid views. An inference model further shows that window view composition matters considerably: high ratios of sky, trees, and low-rise buildings enhance people's preferences and perceptions of vividness, whereas high ratios of high-rise buildings increase perceptions of monotony and oppression. Importantly, these effects are non-linear: the excessive presence of certain elements can alter their impact on human perception. This work advances urban-scale understanding of residents' visual experiences and provides evidence-based guidance for human-centric urban planning and real estate to optimise visual landscapes from windows.

2606.15351 2026-06-16 cs.CV 新提交

Facial Affect Analysis for Service-Oriented Systems: Advances, Challenges, and Future Visions

面向服务系统的面部情感分析:进展、挑战与未来愿景

Spyridon Georgiou, Aggelos Psiris, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University(国际希腊大学) Democritus University of Thrace(德谟克利特大学) Kingston University London(伦敦金斯顿大学) University of Western Macedonia(西马其顿大学)

AI总结 本文从系统工程角度综述面部情感分析在服务导向软件生态系统中的进展,强调可组合性和可靠性需求,并指出基准性能提升不足以满足服务化部署,需兼顾鲁棒性、公平性、隐私性等运行时保障。

详情
AI中文摘要

面部情感分析(FAA)正从独立的识别任务演变为服务导向软件生态系统(SoSE)中可复用的感知能力。本文保留了FAA的方法论核心,同时通过可组合和可靠服务的系统工程需求重新诠释近期进展。我们回顾了静态和动态表情分析、动作单元和微表情建模以及现代CNN、Transformer、图神经网络和混合架构的代表性进展,然后根据这些进展在边缘、云和混合服务管道中的操作适配性进行解读。综合强调决定可部署性的SoSE关注点:面向不确定性输出的服务契约、延迟和可用性包络、生命周期监控与重新校准、治理感知集成以及跨独立演化组件的互操作性。我们的分析表明,仅凭基准性能提升不足以满足SoSE就绪性;分布偏移下的鲁棒性、干预稳定性、公平性、隐私姿态和运行时保证同样关键。最后,我们提出了将FAA视为具有显式接口、可测量质量属性和可问责生命周期管理的操作服务组件的路线图。

英文摘要

Facial Affect Analysis (FAA) is evolving from a stand-alone recognition task into a reusable perception capability for Service-Oriented Software Ecosystems (SoSE). This paper preserves the FAA methodological core while reframing recent advances through systems-engineering requirements for composable and dependable services. We review representative progress in static and dynamic expression analysis, action-unit and micro-expression modeling, and modern CNN, Transformer, graph, and hybrid architectures, then interpret these advances by their operational fit in edge, cloud, and hybrid service pipelines. The synthesis emphasizes SoSE concerns that determine deployability: service contracts for uncertainty-aware outputs, latency and availability envelopes, lifecycle monitoring and recalibration, governance-aware integration, and interoperability across independently evolving components. Our analysis shows that benchmark gains alone are insufficient for SoSE readiness; robustness under shift, intervention stability, fairness, privacy posture, and runtime guarantees are equally critical. We conclude with a roadmap for treating FAA as an operational service component with explicit interfaces, measurable quality attributes, and accountable lifecycle management.

2606.16271 2026-06-16 cs.CV cs.LG 新提交

Contrastive Learning for Seismic Horizon Tracking with Domain-Specific Priors

基于领域先验的对比学习用于地震层位追踪

Alexandre Thouvenot, Lionel Boillot, Vincent Gripon

发表机构 * IMT Atlantique, LAB-STICC, UMR CNRS 6285(IMT Atlantique, LAB-STICC, CNRS 6285联合实验室) TotalEnergies, OneTech(道达尔能源公司, OneTech)

AI总结 提出自监督融合信号与纹理的方法,利用信号导出的局部层位对应作为领域先验训练纹理深度学习模型,通过对比学习保持层位身份,实现跨不连续面的层位追踪。

Comments 5 pages, 5 figures. Submitted to the IEEE GRSL for possible publication

详情
AI中文摘要

无监督3D地震层位追踪面临一个关键限制:基于信号的传播器提供精确的迹级对齐,但在断层附近常失败,而纹理驱动的深度模型对不连续性更鲁棒,但通常以标记数据需求和降低迹级精度为代价。我们提出了一种自监督融合两种范式的方法,其中信号导出的局部层位对应作为领域先验来训练基于纹理的深度学习模型。具体来说,我们从反射体斜率估计可靠的迹间流,并将其用于形成对比目标中的正对,同时将训练限制在高置信度邻域,可选地使用断层掩码增强。目标不是推断不连续性附近的模糊对应,而是跨不连续性保持层位身份。结果,网络学习到体素级嵌入,保持局部信号连续性,同时通过相似性搜索实现跨不连续性的层位传播。在公共F3数据集和含断层合成数据集上的实验实现了比无监督基线更低的平均绝对误差(MAE),并且与使用单个标记切片的半监督方法性能相当。

英文摘要

Unsupervised 3D seismic horizon tracking faces a key limitation: signal-based propagators provide accurate trace-level alignment but often fail near faults, whereas texture-driven deep models are more robust to discontinuities, typically at the cost of labeled data requirements and reduced trace-level precision. We propose a self-supervised fusion of both paradigms in which signal-derived local horizon correspondences act as domain-specific priors to train a texture-based deep learning model. Specifically, we estimate reliable trace-to-trace flows from reflector slopes and use them to form positive pairs in a contrastive objective, while restricting training to high-confidence neighborhoods, optionally augmented with a fault mask. The objective is not to infer ambiguous correspondences close to discontinuities, but to preserve horizon identity across them. As a result, the network learns voxel-wise embeddings that preserve local signal continuity while enabling horizon propagation beyond discontinuities through similarity search. Experiments on the public F3 dataset and a faulted synthetic dataset achieve lower mean absolute error (MAE) than unsupervised baselines and competitive performance against a semi-supervised method using a single labeled slice.

2606.16837 2026-06-16 cs.CV cs.AI cs.SD 新提交

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

基于时间金字塔建模的鲁棒语音伪造检测

Mahtab Masoudi Nezhad, Nima Karimian

发表机构 * Lane Department of Computer Science and Electrical Engineering, West Virginia University(西弗吉尼亚大学莱恩计算机科学与电气工程系) Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出时间金字塔适配器,通过多尺度时间卷积捕获局部伪影和全局韵律异常,结合自监督XLS-R表示,在多个数据集上显著优于基线模型。

详情
AI中文摘要

伪造语音检测日益受到逼真合成、语音转换和重放攻击的挑战,跨数据集泛化仍然是主要限制。本文提出时间金字塔适配器,利用具有不同感受野的并行时间卷积来捕获多尺度伪造线索,从局部伪影到全局韵律异常。我们还集成了自监督XLS-R表示,并结合前端适配器,包括Mel、Sinc和用于多尺度时间建模的时间金字塔设计。所提出的模型在多个基准上进行了评估,包括ASVspoof 2017、ASVspoof 2021 (DF/LA)、PartialSpoof、DiffSSD和多语言HQ-MPSD数据集。实验结果表明,时间金字塔模型在PartialSpoof数据库上获得了99.24%的AUC和3.87%的EER,显著优于基础模型和多个SOTA基线,如LCNN-BLSTM(9.87% EER)和TRACE(8.08% EER)。此外,多语言评估证实,虽然伪造伪影与语言无关,但自监督表示提高了鲁棒性,在领域和语言偏移下性能下降,凸显了需要更好的适应和校准策略。

英文摘要

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

2606.16870 2026-06-16 cs.CV cs.GR 新提交

Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

潜空间强化学习用于食品断裂模拟中的逆材料估计

Adrian Ramlal, Yuhao Chen, John S. Zelek

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对食品断裂模拟中材料参数难以直接测量的问题,提出基于潜空间强化学习的目标条件策略,实现从断裂行为描述到材料参数的单次前向估计,精度提升23%。

Comments Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 MetaFood Workshop

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 9573-9581
AI中文摘要

食品操作的真实视觉模拟需要精确的材料参数,但这些参数难以直接测量,且在单个食品的异质区域间变化。我们解决了从非连续损伤力学模拟器中断裂行为的目标描述中估计材料参数的逆问题。以剥橙子为测试案例,我们在2000次正向模拟上训练神经代理,并比较协方差矩阵自适应进化策略(CMA-ES,一种无梯度进化优化器)与近端策略优化(PPO,一种强化学习算法)在原始9维参数空间和两个学习的4维潜表示上的表现。由于不同橙子具有不同的材料属性,实用的逆系统必须能够处理任意目标而无需重新训练。我们训练了一个目标条件PPO策略,该策略学习通用的逆映射:给定任意剥皮行为的目标描述,该策略在单次前向传递(8次代理评估,约10毫秒)中产生材料参数估计。在归一化流潜空间中使用共享代理评估器,目标条件策略通过模拟器验证时实现了0.642的实际恢复率,比原始参数空间高出23%。从策略输出初始化CMA-ES细化的热启动扩展进一步将恢复率提升至0.828,使用540次评估。这些发现为食品逆物理提供了实用框架,并为从食品操作的视频观测中通过视觉驱动识别材料奠定了基础。

英文摘要

Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy's output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.

2606.16951 2026-06-16 cs.CV eess.IV 新提交

Simulation-Based Multi-Fillet Evaluation of Woody Breast Poultry Fillets

基于仿真的多鸡胸肉木质化评估

Chirantan Sen Mukherjee, Seung-Chul Yoon, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington(德克萨斯大学阿灵顿分校计算机科学与工程系) Quality and Safety Assessment Research Unit, U.S. National Poultry Research Center, USDA Agricultural Research Service(美国农业部农业研究服务局国家家禽研究中心质量与安全评估研究单元)

AI总结 针对单鸡胸肉检测的吞吐量瓶颈,提出一种俯视多鸡胸肉检测架构,通过物理仿真生成数据集并提取二维形状变形分数,实现多鸡胸肉同时评估。

Comments To be published in the 2026 International Conference on Automation Science and Engineering (CASE)

详情
AI中文摘要

木质化鸡胸肉是现代肉鸡的一种肌病,导致胸肌异常僵硬和纤维化,降低肉质并造成重大经济损失。最先进的自动WB检测依赖于侧视成像系统,分析单个鸡胸肉从传送带落下时的弯曲行为。虽然高度准确,但该方法受限于单鸡胸肉视野,在商业加工线上造成吞吐量瓶颈。本文通过一种利用俯视相机配置的新型多鸡胸肉检测架构来解决这一限制。为验证我们的方法,首先开发了工业传送系统的高保真数字孪生。然后,合成多样化的3D鸡胸肉网格数据集,并使用基于物理的仿真引擎模拟其粘弹性弯曲动力学。最后,从俯视视角提取连续的二维形状变形分数,模拟鸡胸肉经过滚轮边缘的过程。实验结果表明,俯视形状分数有效捕捉鸡胸肉弯曲时的轮廓变化,为同时多鸡胸肉WB评估提供了鲁棒且可扩展的侧视成像系统替代方案。

英文摘要

Woody breast (WB) is a myopathy in modern broiler chickens that causes the breast muscle to become unusually stiff and fibrous, leading to decreased meat quality and significant economic losses. State-of-the-art automated WB detection relies on a side-view imaging system to analyze the bending behavior of a single fillet as it falls off a conveyor belt. While highly accurate, this approach is constrained by its single-fillet field of view, creating throughput bottlenecks on commercial processing lines. In this paper, we address this limitation via a novel multi-fillet detection architecture utilizing a top-down camera configuration. To validate our approach, we first develop a high-fidelity digital twin of an industrial conveyor system. Next, we synthesize a diverse dataset of 3D fillet meshes and model their viscoelastic bending dynamics using a physics-based simulation engine. Lastly, a continuous 2D shape deformation score is extracted from the top-down perspective as the simulated fillets traverse the roller precipice. Experimental results demonstrate that the top-down shape score effectively captures the contour changes of the fillets as it bends, providing a robust and scalable alternative to a side-view imaging system for simultaneous multi-fillet WB evaluation.

2603.04592 2026-06-16 cs.CL cs.CV 交叉投稿

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

从静态推理到动态交互:流式大型语言模型综述

Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究院)

AI总结 本文统一了流式LLM的定义,提出系统分类法,综述其方法、应用与未来方向。

Comments Accepted by ACL 2026 Findings

详情
AI中文摘要

标准大型语言模型(LLM)主要设计用于预定义输入的静态推理,这限制了它们在动态实时场景中的适用性。为解决这一差距,流式LLM范式应运而生。然而,现有流式LLM的定义仍然零散,混淆了流式生成、流式输入和交互式流式架构,且缺乏系统分类法。本文对流式LLM进行了全面概述和分析。首先,我们基于数据流和动态交互建立了流式LLM的统一定义,以澄清现有歧义。基于这一定义,我们提出了当前流式LLM的系统分类法,并对其底层方法进行了深入讨论。此外,我们探讨了流式LLM在现实场景中的应用,并概述了有前景的研究方向,以支持流式智能的持续进展。我们在以下网址维护一个持续更新的相关论文仓库:此 https URL。

英文摘要

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

2606.14750 2026-06-16 eess.AS cs.AI cs.CV cs.SD 交叉投稿

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS: 基于图像的文字渲染实现鲁棒文本转语音

Adarsh Arigala, Arjun Gangwar, S Umesh, Yova Kementchedjhieva

发表机构 * SPRING Lab, Indian Institute of Technology, Madras, India(SPRING实验室,印度理工学院,马德拉斯,印度) MBZUAI, UAE(MBZUAI,阿联酋)

AI总结 提出Pixel-TTS框架,将文本渲染为图像并通过2D卷积生成嵌入,消除嵌入矩阵扩展,提升对未见字符和拼写变体的鲁棒性,实现零样本泛化。

Comments 5 pages, 4 figures, 4 tables

详情
AI中文摘要

近期基于像素的文本建模进展表明,将文本表示为图像能使模型利用视觉线索进行语言理解。将文本锚定在其视觉形式上,允许具有不同Unicode编码的结构相似字符产生相似的嵌入,从而有益于跨语言和零样本场景。传统的基于文本的方法独立处理每个字符,限制了向未见字符的泛化,并在跨语言适应时需要嵌入扩展。我们提出Pixel-TTS,首个视觉接地语音合成框架。它将文本渲染为图像,并通过2D卷积层投影以生成嵌入。这种设计在微调过程中消除了嵌入矩阵扩展,同时提高了对未见字符和拼写变体的鲁棒性。大量实验表明,Pixel-TTS在强基线上实现了有竞争力的性能、更快的收敛和鲁棒的零样本泛化。

英文摘要

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

2606.14808 2026-06-16 eess.IV cs.CV cs.IT math.IT 交叉投稿

Explainable Task-Oriented Token Communication for AI-Native 6G Networks

面向AI原生6G网络的可解释任务导向Token通信

Feibo Jiang, Lei Mao, Li Dong, Kezhi Wang, Cunhua Pan, Jiangzhou Wang

发表机构 * IEEE

AI总结 提出ET-TokenCom框架,通过跨模态注意力融合视觉Token与任务Token,实现可解释的任务导向图像通信,解决Token表示不足、协作差和可解释性低的问题。

详情
AI中文摘要

基础模型(FMs)与无线通信的集成正在推动图像通信从比特精确传输向任务导向传输的演进。然而,现有的任务导向图像通信方法仍面临三大挑战:任务导向Token表示不足、视觉Token与任务Token之间的协作不充分,以及任务决策的可解释性有限。为了解决这些挑战,我们提出了一种可解释的任务导向Token通信(ET-TokenCom)框架。通过将Token作为信息表示和传输的统一单元,所提框架构建了一个跨越视觉感知、无线传输和任务推理的端到端通信链路。在发射端,ET-TokenCom框架从图像中提取视觉Token以保留低层视觉信息。同时,引入由FM生成的任务Token来表示当前任务所需的目标信息和决策意图。进一步设计了跨模态注意力(CMA)融合机制,使任务Token能够显式地指导视觉Token的选择、加权和传输。在接收端,该框架将Token解码与可解释输出机制相结合,生成注意力热图以突出不同任务目标下的关键感知区域,并揭示任务Token对输出的影响。最后,仿真结果验证了所提ET-TokenCom框架的有效性和鲁棒性。

英文摘要

The integration of Foundation Models (FMs) and wireless communications is driving the evolution of image communication from bit-accurate transmission toward task-oriented transmission. However, existing task-oriented image communication methods still face three major challenges: insufficient task-oriented Token representation, inadequate collaboration between Visual Tokens and Task Tokens, and limited interpretability of task decisions. To address these challenges, we propose an Explainable Task-Oriented Token Communication (ET-TokenCom) framework. By treating Tokens as unified units for information representation and transmission, the proposed framework constructs an end-to-end communication link that spans visual perception, wireless transmission, and task reasoning. At the transmitter, the ET-TokenCom framework extracts Visual Tokens from images to preserve low-level visual information. Meanwhile, Task Tokens generated by the FM are introduced to represent the target information and decision intent required by the current task. A Cross-Modal Attention (CMA) fusion mechanism is further designed, enabling Task Tokens to explicitly guide the selection, weighting, and transmission of Visual Tokens. At the receiver, the framework integrates Token decoding with an explainable output mechanism, where attention heatmaps are generated to highlight critical perceptual regions under different task objectives and reveal the influence of Task Tokens on the outputs. Finally, simulation results validate the effectiveness and robustness of the proposed ET-TokenCom framework.

2606.16580 2026-06-16 cs.LG cs.CV 交叉投稿

Multi-Modal Spatio-Temporal Graph Neural Network with Mixture of Experts for Soil Organic Carbon Prediction

基于专家混合的多模态时空图神经网络用于土壤有机碳预测

Daniele Mos, Felipe Drummond, Anton Bossenbroek, Soufiane el Khinifri

发表机构 * Spatialise B.V.

AI总结 提出SpTGNN,一种多模态时空图神经网络,通过异构图注意力、微调基础模型特征提取和稀疏专家混合融合,结合异方差回归与深度集成的不确定性量化,在三个区域数据集上优于XGBoost基线。

Comments Paper is 27 pages, 14 figures, 12 tables

详情
AI中文摘要

表层土壤有机碳(SOC)预测是农业可持续性、土地利用政策和施肥规划的基础。现有方法面临两个限制:它们将手工制作的协变量与经典机器学习或单模态深度模型配对,忽略了丰富的光谱和时间信息,而基于网格的架构忽略了田间测量的不规则空间结构。我们提出了SpTGNN,一种多模态时空图神经网络来解决这两个问题。SpTGNN将土壤测量表示为具有三种边类型(空间邻近性、光谱相似性、高程)的异构图中的节点,并应用关系图注意力来学习每种关系的独立模式。一个微调的TerraMind编码器从Sentinel-2、Sentinel-1和DEM信号中提取节点特征,并结合每个样本的环境协变量以及学习到的位置和时间嵌入。一个稀疏专家混合模块通过top-$k$路由融合四个流。通过配对异方差回归(偶然不确定性)和深度集成(认知不确定性)来捕获不确定性,并使用Moran's $I$惩罚项正则化空间自相关。我们在一个全球SOC语料库上进行评估,该语料库分为三个区域实例(全球约49k样本,非洲约26k,欧洲约14k)。我们的5成员深度集成在非洲测试集上报告$R^2=0.762$,RMSE $=3.51\pm0.48$ g/kg和MAPE $=22.9\\%$,优于表格XGBoost基线;最佳单个检查点达到验证$R^2=0.864$。消融实验证实异构图、MoE融合和微调主干各自贡献显著,集成不确定性量化栈实现后校准ECE为$0.031$(混合)和$0.026$($\beta$-NLL)。据我们所知,这是第一个统一基础模型特征提取、异构图注意力和分解不确定性量化的SOC估计框架。

英文摘要

Top-soil organic carbon (SOC) prediction is fundamental to agricultural sustainability, land use policy and fertilization planning. Existing approaches face two limitations: they pair hand-crafted covariates with classical ML or single-modal deep models that miss rich spectral and temporal information, and grid-based architectures ignore the irregular spatial structure of field measurements. We introduce SpTGNN, a multi-modal spatio-temporal graph neural network addressing both. SpTGNN represents soil measurements as nodes in a heterogeneous graph with three edge types (spatial proximity, spectral similarity, elevation), and applies relational graph attention to learn separate patterns per relation. A fine-tuned TerraMind encoder extracts node features from Sentinel-2, Sentinel-1 and DEM signals, combined with per-sample environmental covariates and learned positional and temporal embeddings. A sparse Mixture-of-Experts module fuses the four streams via top-$k$ routing. Uncertainty is captured by pairing heteroscedastic regression (aleatoric) with deep ensembles (epistemic), and a Moran's $I$ penalty regularizes spatial autocorrelation. We evaluate on a global SOC corpus split into three regional instances ($\sim$49k samples globally, Africa $\sim$26k, Europe $\sim$14k). Our 5-member deep ensemble reports $R^2=0.762$, RMSE $=3.51\pm0.48$ g/kg and MAPE $=22.9\%$ on the Africa test split, improving over a tabular XGBoost baseline; the best single checkpoint reaches validation $R^2=0.864$. Ablations confirm the heterogeneous graph, MoE fusion and fine-tuned backbone each contribute substantively, and the ensemble UQ stack achieves post-calibration ECE of $0.031$ (hybrid) and $0.026$ ($β$-NLL). To our knowledge, this is the first framework to unify foundation-model feature extraction, heterogeneous graph attention and decomposed uncertainty quantification for SOC estimation.

2505.04397 2026-06-16 cs.CV cs.AI cs.LG eess.IV 版本更新

PURe: A Plug-and-Play Product-Unit Residual Module for Vision Networks

PURe: 一种用于视觉网络的即插即用乘积单元残差模块

Ziyuan Li, Uwe Jaekel, Babette Dellen

发表机构 * Department of Mathematics, Informatics and Technology, University of Applied Sciences Koblenz(科隆应用科学大学数学、信息学与技术系) Technical University of Munich(慕尼黑技术大学)

AI总结 提出PURe模块,通过二维乘积单元的对数域公式实现稳定的局部乘法交互,可替代残差网络中的标准单元,在图像分类和CT分割任务中提升精度-参数权衡。

Comments Revised version

详情
AI中文摘要

现代视觉网络主要由加性局部变换主导,而显式的乘法局部交互仍未得到充分探索。乘积单元提供了一种直接建模此类交互的方法,但其在深度架构中的使用受到优化不稳定性的限制。在这项工作中,我们提出了PURe,一种用于深度视觉网络的乘积单元残差模块。PURe围绕一个具有实值对数域公式的二维乘积单元构建,使得乘法局部聚合在深度残差层次结构中变得实用。由此产生的模块可作为原生残差单元的即插即用替代品。我们将PURe实例化到用于图像分类的残差CNN和用于体积CT数据切片分割的二维残差编码器-解码器网络中。在Galaxy10 DECaLS、ImageNet和CIFAR-10上,PURe一致地改进了残差CNN,并产生了更有利的精度-参数权衡,使得中等深度模型能够以更小的参数预算匹配或超越显著更深的ResNet基线。在AMOS基准测试中,PURe还在3D病例级评估下改进了切片CT分割。这些结果表明,显式的乘法局部交互是深度残差视觉网络的一种实用且有效的设计原语。

英文摘要

Modern vision networks are dominated by additive local transformations, whereas explicit multiplicative local interactions remain underexplored. Product units offer a direct approach to modeling such interactions, but their use in deep architectures has been limited by optimization instability. In this work, we propose PURe, a Product-Unit Residual Module for deep vision networks. PURe is built around a 2D Product Unit with a real-valued log-domain formulation that makes multiplicative local aggregation practical within deep residual hierarchies. The resulting module serves as a drop-in replacement for native residual units. We instantiate PURe in residual CNNs for image classification and in 2D residual encoder-decoder networks for slice-based segmentation on volumetric CT data. Across Galaxy10 DECaLS, ImageNet, and CIFAR-10, PURe consistently improves residual CNNs and yields a more favorable accuracy-parameter trade-off, allowing moderately deep models to match or surpass substantially deeper ResNet baselines with much smaller parameter budgets. On the AMOS benchmark, PURe also improves slice-based CT segmentation under 3D case-level evaluation. These results show that explicit multiplicative local interaction is a practical and effective design primitive for deep residual vision networks.

2508.10523 2026-06-16 cs.CV 版本更新

Reasoning in Computer Vision: Taxonomy, Models, Tasks, and Methodologies

计算机视觉中的推理:分类、模型、任务与方法论

Ayushman Sarkar, Zhenyu Yu, Mohd Yamani Idna Idris

发表机构 * Department of Computer Science and Engineering, Birbhum Institute of Engineering and Technology(计算机科学与工程系,比罗尔理工学院) College of Computer Science and Artificial Intelligence, Fudan University(计算机科学与人工智能学院,复旦大学) Faculty of Computer Science and Information Technology, Universiti Malaya(计算机科学与信息技术学院,马来亚大学)

AI总结 本文对计算机视觉中的推理进行系统分类,涵盖关系、符号、时间、因果和常识推理,并综述了从图模型到多模态大语言模型的实现方法及评估协议,指出开放挑战并设定未来研究方向。

详情
AI中文摘要

视觉推理对于许多超越表面级物体检测和分类的计算机视觉任务至关重要。尽管在关系、符号、时间、因果和常识推理方面取得了进展,但现有综述通常只涵盖问题的一部分,例如视觉问答、场景图生成、神经符号AI或多模态思维链,很少同时分析推理类型、方法论和评估协议。本综述填补了这一空白。通过结构化文献回顾,我们将视觉推理分为五大类型(关系、符号、时间、因果和常识),并考察每种类型如何在从基于图的模型、记忆网络、注意力机制、神经符号系统到视觉语言模型(VLM)和多模态大语言模型(MLLM)的方法中实现,包括视觉思维链、视觉编程、工具增强和测试时推理。然后,我们回顾了功能正确性、结构一致性和因果有效性的评估协议,并分析了它们在泛化性、可重复性、忠实性和解释性方面的局限性。我们还识别了开放挑战:扩展到复杂场景、更深入地整合符号和神经范式、缺乏全面基准、基础模型中的语言先验捷径和幻觉,以及弱监督下的推理。最后,我们为视觉系统设定了一个研究议程,并认为连接感知和推理对于透明、可信和跨领域模型是必要的,特别是在自动驾驶和医疗诊断等高风险场景中。

英文摘要

Visual reasoning matters for many computer vision tasks that go beyond surface-level object detection and classification. Despite progress in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys typically cover only one part of the problem, such as visual question answering, scene-graph generation, neuro-symbolic AI, or multimodal chain-of-thought, and rarely analyze reasoning types, methodologies, and evaluation protocols together. This survey addresses that gap. Following a structured literature review, we group visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and examine how each is implemented across methods that range from graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems to reasoning with vision-language models (VLMs) and multimodal large language models (MLLMs), including visual chain-of-thought, visual programming, and tool-augmented and test-time reasoning. We then review evaluation protocols for functional correctness, structural consistency, and causal validity, and we analyze their limits in generalizability, reproducibility, faithfulness, and explanatory power. We also identify open challenges: scaling to complex scenes, integrating symbolic and neural paradigms more deeply, the shortage of comprehensive benchmarks, language-prior shortcuts and hallucination in foundation models, and reasoning under weak supervision. Finally, we set out a research agenda for vision systems and argue that connecting perception and reasoning is necessary for transparent, trustworthy, and cross-domain models, especially in high-stakes settings such as autonomous driving and medical diagnostics.

2508.17254 2026-06-16 cs.CV cs.AI 版本更新

A biological vision inspired framework for machine perception of abutting grating illusory contours

一种受生物视觉启发的机器感知对接光栅错觉轮廓框架

Xiao Zhang, Kai-Fu Yang, Xian-Shi Zhang, Hong-Zhi You, Hong-Mei Yan, Yong-Jie Li

发表机构 * Sichuan Cancer Hospital & Institute, School of Life Science and Technology, University of Electronic Science and Technology of China(四川肿瘤医院及研究院、电子科技大学生命科学与技术学院)

AI总结 提出受视觉皮层启发的ICPNet网络,通过多尺度特征投影、特征交互注意力和边缘融合模块,显著提升了对对接光栅错觉轮廓的感知能力。

详情
AI中文摘要

更高层次的机器智能需要与人类感知和认知对齐。深度神经网络(DNN)主导的机器智能在各种现实任务中表现出色。然而,最近证据表明,DNN无法感知如对接光栅这样的错觉轮廓,这与人类感知模式不一致。与以往工作不同,我们提出了一种受视觉皮层电路启发的新型深度网络,称为错觉轮廓感知网络(ICPNet)。在ICPNet中,设计了多尺度特征投影(MFP)模块以提取多尺度表示。为了增强前馈和反馈特征之间的交互,引入了特征交互注意力模块(FIAM)。此外,受人类感知中形状偏见的启发,通过边缘融合模块(EFM)进行的边缘检测任务注入了形状约束,引导网络关注前景。我们在现有的AG-MNIST测试集和本文构建的AG-Fashion-MNIST测试集上评估了我们的方法。综合实验结果表明,ICPNet对对接光栅错觉轮廓的敏感度显著高于最先进模型,在各个子集上的top-1准确率均有显著提升。这项工作有望使基于DNN的模型向人类级智能迈进一步。

英文摘要

Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.

2602.08029 2026-06-16 gr-qc astro-ph.IM cs.CV 版本更新

Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

基于物理信息神经场的动态黑洞发射断层成像

Berthy T. Feng, Andrew A. Chael, David Bromley, Aviad Levis, William T. Freeman, Katherine L. Bouman

发表机构 * Caltech(加州理工学院) MIT(麻省理工学院) NSF IAIFI(国家科学基金会IAIFI) Princeton University(普林斯顿大学) Niels Bohr International Academy(尼尔斯·玻尔国际学院) University of Toronto(多伦多大学)

AI总结 提出PI-DEF方法,利用可微神经渲染从EHT测量数据中联合重建4D发射率场和3D速度场,以软约束方式引入物理信息,在模拟数据上显著优于现有方法。

Comments CVPR 2026

详情
AI中文摘要

随着静态黑洞成像的成功,下一个前沿是黑洞的动态和三维成像。恢复黑洞附近的动态三维气体将揭示宇宙中以前未见的部分,并为新的物理模型提供信息。然而,只有从单一视角进行的稀疏射电测量是可能的,这使得动态三维重建问题严重不适定。此前,BH-NeRF通过假设气体的开普勒动力学来解决不适定问题,但这种假设在黑洞附近失效,因为黑洞的强大引力吸引和增强的电磁活动使流体动力学复杂化。为了克服BH-NeRF的限制性假设,我们提出了PI-DEF,一种基于物理信息的方法,使用可微神经渲染根据EHT测量拟合4D(时间+3D)发射率场。我们的方法联合重建3D速度场与4D发射率场,并将速度作为发射率动力学的软约束。在模拟数据上的实验中,我们发现与BH-NeRF和物理无关方法相比,重建精度显著提高。我们展示了我们的方法如何用于估计黑洞的其他物理参数,例如其自旋。

英文摘要

With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose PI-DEF, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a physics-agnostic approach. We demonstrate how our method may be used to estimate other physics parameters of the black hole, such as its spin.

2604.16592 2026-06-16 cs.RO cs.AI cs.CV cs.ET 版本更新

Human Cognition in Machines: A Unified Perspective of World Models

机器中的人类认知:世界模型的统一视角

Timothy Rupprecht, Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Tooba Imtiaz, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Xuan Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang

发表机构 * Northeastern University(东北大学) EmbodyX Inc.(EmbodyX公司) Tulane University(路易斯安那州立大学) Cornell University(康奈尔大学) University of Georgia(佐治亚大学)

AI总结 提出统一框架整合记忆、感知等认知功能,指出动机和元认知研究不足,并引入认知世界模型新类别。

详情
AI中文摘要

本报告通过区分先前工作在认知功能上的创新来审视世界模型。许多工作声称其世界模型具有近乎人类般的认知能力。评估这些主张需要基于人类和机器认知理论的第一原理。在迈向类人世界模型的过程中,我们提出了一个概念性的统一框架,该框架完全整合了所有认知功能(即记忆、感知、语言、推理、想象、动机和元认知),并指出现有研究的空白,以指导未来技术的发展。特别是,我们发现动机(尤其是内在动机)和元认知仍然严重研究不足,并提出了基于主动推理和全局工作空间理论的具体方向来解决这些空白。我们还引入了认知世界模型,这是一个新的类别,涵盖在结构化知识上运行的科学发现代理框架。我们的分类法应用于视频、具身和认知世界模型,提出了先前分类法未涉及的研究方向。

英文摘要

This report of world models distinguishes prior works by the cognitive functions they innovate. Many works claim an almost human-like cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles from human and machine cognition theory. In moving towards human-like world models we present a conceptual unified framework for world models that fully incorporates all the cognitive functions (i.e., memory, perception, language, reasoning, imagining, motivation, and metacognition) and identify gaps in existing research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and metacognition remain drastically under-researched, and we propose concrete directions to address these gaps informed by active inference and global workspace theory. We also introduce epistemic world models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied to video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

2605.05372 2026-06-16 cs.CV cs.AI 版本更新

Two Steps Are All You Need: Efficient 3D Point Cloud Anomaly Detection with Consistency Models

两步即可:基于一致性模型的高效3D点云异常检测

Pranav A, Shashank B, Pranav Siddappa, Dominik Seuss, Minal Moharir, Subramanya KN

发表机构 * R.V. College of Engineering(R.V. 工程学院) Technical University of Applied Sciences Würzburg-Schweinfurt(Würzburg-Schweinfurt 应用科学大学)

AI总结 本文提出基于一致性学习的重建异常检测方法,通过简化推理过程提升效率,实现低延迟的3D点云异常检测,适用于资源受限设备。

Comments Accepted to CVPR 2026, at the 9th Workshop on Efficient Deep Learning for Computer Vision (ECV). To be published in the IEEE/CVF CVPR 2026 Workshop Proceedings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 3479-3487
AI中文摘要

扩散模型正在重新定义3D点云数据中的异常检测。随着3D传感成为现代制造的关键,可靠的异常检测对于高吞吐量的质量保证和过程控制至关重要。然而,在资源受限且延迟敏感的系统中,实际部署仍然有限。现有方法往往在复杂未遮挡区域计算上不可行或不可靠,而扩散管道本质上受限于迭代去噪。在本文中,我们通过一致性学习重构基于重建的异常检测,使能够在一次或两次网络评估中直接预测无异常几何。我们进一步引入了一种新的混合损失公式,明确强制重建至干净数据。这种设计显著降低了推理成本,达到比当前最先进方法快80倍的运行时间,无需GPU加速,同时保持强大的检测性能。它在Anomaly-ShapeNet上以76.20%的I-AUROC优于R3D-AD,在Real3DAD上以72.80%的I-AUROC保持竞争力,使在资源受限平台上实现高效、低延迟的异常检测成为可能,包括无人机、智能工业相机和其他边缘设备。

英文摘要

Diffusion models are rapidly redefining 3D anomaly detection in point cloud data. As 3D sensing becomes integral to modern manufacturing, reliable anomaly detection is essential for high-throughput quality assurance and process control. Yet practical deployment on resource-constrained, latency-critical systems remains limited. Existing methods are often computationally prohibitive or unreliable in complex, unmasked regions, and diffusion pipelines are inherently bottlenecked by iterative denoising. In this work, we address this bottleneck by reformulating reconstructionbased anomaly detection through consistency learning, enabling direct prediction of anomaly-free geometry in one or two network evaluations. We further introduce a novel hybrid loss formulation that explicitly enforces reconstruction toward clean data. This design substantially reduces inference cost, achieving up to 80x faster runtime than the current state-of-the-art method, without GPU acceleration, while preserving strong detection performance. It outperforms R3D-AD on Anomaly-ShapeNet with 76.20% I-AUROC and remains competitive on Real3DAD with 72.80% I-AUROC, enabling efficient, low-latency anomaly detection on resource-constrained platforms, including drones, smart industrial cameras, and other edge devices.

2603.24724 2026-06-16 cs.CV cs.AI 版本更新

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

几何足够吗?基于标记的注视估计评估

Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni

发表机构 * Department of Industrial Engineering and Mathematical Sciences, Università Politecnica delle Marche(工业工程与数学科学系,帕尔米塞大学) Department of Science and Information Technology, Università Pegaso(科学与信息科技系,佩加索大学)

AI总结 本文评估了基于面部标记的注视估计方法,通过标准化流程提取和归一化三个大型数据集的标记,并训练轻量级回归模型,发现其在跨域评估中与ResNet18基线相当,表明稀疏几何特征能有效支持鲁棒的注视估计。

详情
AI中文摘要

基于外观的注视估计通常依赖深度卷积神经网络(CNNs)。这些模型准确但计算成本高且作为“黑箱”,可解释性差。基于面部标记的几何方法是轻量级替代方案,但其性能限制和泛化能力在现代基准中仍待探索。本文全面评估了基于标记的注视估计,引入标准化流程提取和归一化三个大型数据集(Gaze360、ETH-XGaze、GazeGene)的标记,并训练轻量级回归模型,具体为极端梯度提升树和两种神经架构:整体多层感知机(MLP)和设计捕捉双眼几何的孪生MLP。发现基于标记的模型在领域内评估表现较低,可能由于数据集中的标记检测噪声引入。然而,在跨域评估中,所提出的MLP架构的泛化能力与ResNet18基线相当。这些发现表明稀疏几何特征编码了足够的信息以支持鲁棒的注视估计,为高效、可解释且隐私友好的边缘应用铺平了道路。源代码和生成的基于标记的数据集可在https://github.com/daniele-agostinelli/LandmarkGaze.git获取。

英文摘要

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

2106.14490 2026-06-16 cs.CV 版本更新

Making Images Real Again: A Comprehensive Survey on Deep Image Composition

让图像重现真实:深度图像合成的全面综述

Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, Liqing Zhang

发表机构 * MOE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University(教育部人工智能联合研究院,计算机科学与工程系,上海交通大学)

AI总结 本文综述了深度图像合成的子任务与综合任务,总结了现有方法、数据集及评估指标,并提供了首个图像合成工具箱libcom。

详情
AI中文摘要

作为常见的图像编辑操作,图像合成(也称为对象/主体插入/添加/合成)旨在将一幅图像的前景与另一幅背景图像结合,生成合成图像。然而,许多问题可能导致合成图像不真实。这些问题可以总结为前景与背景之间的不一致,包括外观不一致、几何不一致和语义不一致。图像合成任务可以分解为多个子任务,每个子任务针对一个或多个问题。具体而言,对象放置旨在为前景找到合理的尺度、位置和形状。图像融合旨在解决前景与背景之间的不自然边界。图像调和旨在调整前景的光照统计。阴影(或反射)生成旨在为前景生成合理的阴影(或反射)。这些子任务可以按顺序或并行执行以获得逼真的合成图像。据我们所知,目前没有关于图像合成的先前综述。在本文中,我们对图像合成的子任务和综合任务进行了全面综述。对于每个任务,我们总结了现有方法、可用数据集和常见评估指标。图像合成的数据集和代码汇总在https://github.com/bcmi/Awesome-Object-Insertion。我们还贡献了首个图像合成工具箱:libcom https://github.com/bcmi/libcom,它集成了10多个与图像合成相关的功能。该工具箱的最终目标是通过简单的`import libcom`解决所有图像合成问题。基于libcom工具箱,我们还开发了一个在线图像合成工作台https://libcom.ustcnewly.com。

英文摘要

As a common image editing operation, image composition/compositing, which is also called object/subject insertion/addition/compositing, aims to combine the foreground from one image and another background image to produce a composite image. However, there are many issues that could make the composite images unrealistic. These issues can be summarized as the inconsistency between foreground and background, which includes appearance inconsistency, geometry inconsistency, and semantic inconsistency. The image composition task could be decomposed into multiple sub-tasks, in which each sub-task targets one or more issues. Specifically, object placement aims to find reasonable scale, location, and shape for the foreground. Image blending aims to address the unnatural boundary between foreground and background. Image harmonization aims to adjust the illumination statistics of foreground. Shadow (resp., reflection) generation aims to generate plausible shadow (resp., reflection) for the foreground. These sub-tasks can be executed sequentially or in parallel to acquire realistic composite images. To the best of our knowledge, there is no previous survey on image composition. In this paper, we conduct a comprehensive survey over the sub-tasks and combined task of image composition. For each one, we summarize the existing methods, available datasets, and common evaluation metrics. Datasets and codes for image composition are summarized at https://github.com/bcmi/Awesome-Object-Insertion. We have also contributed the first image composition toolbox: libcom https://github.com/bcmi/libcom, which assembles 10+ image-composition-related functions. The ultimate goal of this toolbox is to solve all image composition problems with simple `import libcom'. Based on libcom toolbox, we also develop an online image composition workbench https://libcom.ustcnewly.com.

2511.00352 2026-06-16 cs.CV cs.AI 版本更新

Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

通过扩散快回重建检测AI生成图像:一种取证方法

Mohd Ruhul Ameen, Akif Islam

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology(1 计算机科学与工程系,孟加拉国工程与技术大学)

AI总结 本文提出通过扩散模型重建图像时的响应行为来检测AI生成图像,利用LPIPS等指标分析图像与扩散模型去噪行为的匹配程度,实验显示方法在识别准确率上表现优异。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情
Journal ref
2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)
AI中文摘要

生成图像模型的快速发展使数字媒体发生了变革,使得人类观察者或许多传统检测方法难以可靠地区分AI生成图像和真实照片。现代文本到图像系统如Stable Diffusion和DALL E能够生成极其逼真的图像,使其看起来完全自然,留下很少或没有传统深度伪造检测器可以依赖的可见伪影。这一挑战对虚假信息控制、机构身份验证和政治和法律领域中的数字信任有实际影响。我们不搜索隐藏的像素级痕迹,而是观察图像在被轻微扰动和由扩散模型重建时的反应。我们称之为扩散快回。通过跟踪不同重建强度下感知相似性度量(LPIPS、SSIM和PSNR)的变化,我们捕捉到紧凑且可解释的信号,揭示图像与扩散模型学习的去噪行为的接近程度。在包含4000张人类和AI生成图像的平衡数据集上评估,所提出的方法在分层五折交叉验证中达到AUROC 0.993,在使用仅逻辑回归的测试集上达到0.990。初步的鲁棒性测试显示,该方法在常见的现实世界失真如图像压缩和添加噪声下仍保持稳定。虽然我们的实验使用单一扩散主干进行,但结果表明,重建行为可以作为合成媒体检测的可靠且可扩展的基础,随着生成模型变得越来越逼真。

英文摘要

The rapid advancement of generative image models has transformed digital media to the point where AI generated images can no longer be reliably distinguished from authentic photographs by human observers or many conventional detection methods. Modern text to image systems such as Stable Diffusion and DALL E can now generate images so realistic that they often appear completely natural, leaving little to no visible artifacts for traditional deepfake detectors to rely on. This challenge has practical consequences for misinformation control, institutional identity verification, and digital trust in political and legal contexts. Instead of searching for hidden pixel level traces, we take a different approach: we observe how an image responds when it is gently disturbed and reconstructed by a diffusion model. We call this behavior diffusion snap back. By tracking how perceptual similarity measures (LPIPS, SSIM, and PSNR) change across different reconstruction strengths, we capture compact and interpretable signals that reveal how closely an image aligns with the diffusion model's learned denoising behavior. Evaluated on a balanced dataset of 4,000 human and AI generated images, the proposed method achieves an AUROC of 0.993 under stratified five fold cross validation and 0.990 on a holdout split using only logistic regression. Initial robustness tests show that the method remains stable under common real world distortions such as image compression and added noise. Although our experiments were conducted using a single diffusion backbone, the results indicate that reconstruction behavior can serve as a reliable and scalable foundation for synthetic media detection as generative models continue to grow more realistic.

2510.23785 2026-06-16 cs.CV cs.AI 版本更新

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

CountFormer:一种用于学习类无关物体计数中视觉重复和结构的Transformer框架

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 CountFormer通过使用DINOv2和位置嵌入,改进了无示例物体计数中的结构一致性,实现了在FSC-147上的竞争力表现。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情
Journal ref
2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)
AI中文摘要

人类通常通过观察视觉重复和组成来计数 unfamiliar objects,而非仅依赖物体类别。然而,许多无示例计数模型在这种情况下的表现不佳,尤其是在物体包含对称组件、重复子结构或部分遮挡时可能过计数。我们引入了CountFormer,这是一种受CounTR启发的密度回归框架的受控适应,其中图像编码器被自监督视觉基础模型DINOv2取代。所得的Transformer特征与显式的二维位置嵌入结合,并通过轻量级卷积网络解码,以生成密度图,其积分给出最终计数。我们的目标不是提出新的计数架构,而是研究在严格无示例设置下,基于基础的表示是否能提高结构一致性。在FSC-147上,CountFormer在官方基准上实现了竞争性表现(MAE 19.06,RMSE 118.45)。定性分析表明,对于某些结构复杂的物体,部分层面的过计数错误更少,而总体误差与先前方法大致一致。敏感性分析显示,评估指标强烈受少量极端高密度场景的影响。总体而言,结果突显了表示质量在无示例物体计数中的作用。

英文摘要

Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free setting. On FSC-147, CountFormer achieves competitive performance under the official benchmark (MAE 19.06, RMSE 118.45). Qualitative analysis suggests fewer part-level overcounting errors for some structurally complex objects, while overall error remains broadly consistent with prior approaches. Sensitivity analysis shows that evaluation metrics are strongly affected by a small number of extreme high-density scenes. Overall, the results highlight the role of representation quality in exemplar-free object counting.

2601.18045 2026-06-16 cs.CV cs.AI 版本更新

Leveraging Persistence Image to Enhance Robustness and Performance in Curvilinear Structure Segmentation

利用持续图像增强曲率结构分割的鲁棒性和性能

Zhuangzhi Gao, Feixiang Zhou, He Zhao, Xiuju Chen, Xiaoxin Li, Qinkai Yu, Yitian Zhao, Alena Shantsila, Gregory Y. H. Lip, Eduard Shantsila, Yalin Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出PIs-Regressor和Topology SegNet,通过直接学习持续图像来增强曲率结构分割的鲁棒性和性能,实验表明拓扑特征能有效提升医学图像分割的准确性。

Comments Accepted by IEEE International Symposium on Biomedical Imaging (ISBI) 2026. 5 pages, 3 figures

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), London, United Kingdom, 2026
AI中文摘要

在医学图像中分割曲率结构对于分析临床应用中的形态学模式至关重要。整合拓扑属性如连通性可提高分割的准确性和一致性。然而,从持续图(PD)中提取和嵌入这些属性具有挑战性,因为它们非可微且计算成本高。现有方法大多通过手工设计的损失函数编码拓扑,泛化能力差。本文提出PIs-Regressor,一个简单有效的模块,直接从数据中学习持续图像(PI)——拓扑特征的有限、可微表示。与Topology SegNet结合,该框架将拓扑整合到网络架构本身而非辅助损失中。与依赖手工损失函数的方法不同,我们的方法直接将拓扑信息整合到网络结构中,从而实现更稳健的分割。我们的设计灵活,可无缝结合其他拓扑方法以进一步提升分割性能。实验结果表明,整合拓扑特征增强了模型鲁棒性,有效处理医学图像中的过曝和模糊挑战。在三个曲率基准上,我们的方法在像素级准确性和拓扑保真度上均达到最先进的性能。

英文摘要

Segmenting curvilinear structures in medical images is essential for analyzing morphological patterns in clinical applications. Integrating topological properties, such as connectivity, improves segmentation accuracy and consistency. However, extracting and embedding such properties - especially from Persistence Diagrams (PD) - is challenging due to their non-differentiability and computational cost. Existing approaches mostly encode topology through handcrafted loss functions, which generalize poorly across tasks. In this paper, we propose PIs-Regressor, a simple yet effective module that learns persistence image (PI) - finite, differentiable representations of topological features - directly from data. Together with Topology SegNet, which fuses these features in both downsampling and upsampling stages, our framework integrates topology into the network architecture itself rather than auxiliary losses. Unlike existing methods that depend heavily on handcrafted loss functions, our approach directly incorporates topological information into the network structure, leading to more robust segmentation. Our design is flexible and can be seamlessly combined with other topology-based methods to further enhance segmentation performance. Experimental results show that integrating topological features enhances model robustness, effectively handling challenges like overexposure and blurring in medical imaging. Our approach on three curvilinear benchmarks demonstrate state-of-the-art performance in both pixel-level accuracy and topological fidelity.

2411.13602 2026-06-16 eess.IV cs.AI cs.CV 版本更新

Translating Electrocardiograms to Cardiac Magnetic Resonance Imaging Useful for Cardiac Assessment and Disease Screening: A Multi-Center Study

将心电图转换为心脏磁共振成像对心脏评估和疾病筛查有用:一项多中心研究

Zhengyao Ding, Ziyu Li, Yujian Hu, Youyao Xu, Chengchen Zhao, Yiheng Mao, Haitao Li, Zhikang Li, Qian Li, Jing Wang, Yue Chen, Mengjia Chen, Longbo Wang, Xuesen Chu, Weichao Pan, Ziyi Liu, Fei Wu, Hongkun Zhang, Ting Chen, Zhengxing Huang

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) Department of Vascular Surgery, The First Affiliated Hospital of Zhejiang University School of Medicine(浙江大学医学院附属第一医院血管外科) Department of Cardiology, The First Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院附属第一医院心内科) Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine(浙江大学医学院附属第一医院放射科) Department of Vascular Surgery, Quzhou People’s Hospital(衢州人民医院血管外科) Department of Cardiology, The Second Affiliated Hospital of Zhejiang University School of Medicine(浙江大学医学院附属第二医院心内科) China Ship Scientific Research Center(中国船舶科学研究院) Guangdong Transtek Medical Electronics Co., Ltd.(广东 Transtek 医疗电子有限公司)

AI总结 本文提出CardioNets框架,通过深度学习将12导联心电图信号转换为心脏磁共振成像级别的功能参数和合成图像,提升大规模心血管疾病筛查的效率和可及性。

Comments 29 pages, 7 figures

详情
Journal ref
NEJM AI 2026;3(4)
AI中文摘要

心血管疾病(CVDs)是全球死亡的主要原因,需要可访问且准确的诊断工具。尽管心脏磁共振成像(CMR)提供心脏结构和功能的金标准见解,但其临床效用受到高成本和复杂性的限制。相比之下,心电图(ECG)成本低且广泛可用,但缺乏CMR的粒度。我们提出CardioNets,一种深度学习框架,将12导联ECG信号转换为CMR级别的功能参数和合成图像,从而实现可扩展的心脏评估。CardioNets整合了跨模态对比学习和生成预训练,对齐ECG与CMR衍生的心脏表型,并通过掩码自回归模型合成高分辨率CMR图像。在159,819个样本上训练,包括英国生物库(n=42,483)和MIMIC-IV-ECG(n=164,550),并在独立临床数据集(n=3,767)上进行外部验证,CardioNets在疾病筛查和表型估计任务中表现出色。在英国生物库中,它将心脏表型回归R2提高了24.8%,并使心肌病AUC提高了高达39.3%。在MIMIC中,它将肺动脉高压检测的AUC提高了5.6%。生成的CMR图像在SSIM和PSNR方面分别比先前方法高36.6%和8.7%。在一项读者研究中,仅使用ECG的CardioNets在准确率上比同时使用ECG和真实CMR的人类医生高13.9%。这些结果表明,CardioNets为大规模CVD筛查提供了一个有前景的低成本替代方案,特别是在资源有限的环境中。未来的工作将专注于临床部署和ECG基于合成成像的监管验证。

英文摘要

Cardiovascular diseases (CVDs) are the leading cause of global mortality, necessitating accessible and accurate diagnostic tools. While cardiac magnetic resonance imaging (CMR) provides gold-standard insights into cardiac structure and function, its clinical utility is limited by high cost and complexity. In contrast, electrocardiography (ECG) is inexpensive and widely available but lacks the granularity of CMR. We propose CardioNets, a deep learning framework that translates 12-lead ECG signals into CMR-level functional parameters and synthetic images, enabling scalable cardiac assessment. CardioNets integrates cross-modal contrastive learning and generative pretraining, aligning ECG with CMR-derived cardiac phenotypes and synthesizing high-resolution CMR images via a masked autoregressive model. Trained on 159,819 samples from five cohorts, including the UK Biobank (n=42,483) and MIMIC-IV-ECG (n=164,550), and externally validated on independent clinical datasets (n=3,767), CardioNets achieved strong performance across disease screening and phenotype estimation tasks. In the UK Biobank, it improved cardiac phenotype regression R2 by 24.8% and cardiomyopathy AUC by up to 39.3% over baseline models. In MIMIC, it increased AUC for pulmonary hypertension detection by 5.6%. Generated CMR images showed 36.6% higher SSIM and 8.7% higher PSNR than prior approaches. In a reader study, ECG-only CardioNets achieved 13.9% higher accuracy than human physicians using both ECG and real CMR. These results suggest that CardioNets offers a promising, low-cost alternative to CMR for large-scale CVD screening, particularly in resource-limited settings. Future efforts will focus on clinical deployment and regulatory validation of ECG-based synthetic imaging.

2512.00572 2026-06-16 cs.CV cs.AI 版本更新

Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models

基于骨架表示的瑜伽姿势分类深度学习模型整合

Mohammed Mohiuddin, Syed Mohammod Minhaz Hossain, Sumaiya Khanam, Prionkar Barua, Aparup Barua, MD Tamim Hossain

发表机构 * Department of Computer Science and Engineering, Premier University(计算机科学与工程系,普里梅尔大学)

AI总结 本文提出Yoga-16数据集,系统评估了三种深度学习模型,证明骨架表示在瑜伽姿势分类中优于原始图像,VGG16结合MediaPipe骨架输入达到96.09%的准确率。

详情
AI中文摘要

瑜伽因其精神和身体健康益处而全球流行,但错误姿势可能导致受伤。自动化瑜伽姿势分类因此变得重要,以减少对专家的依赖。尽管人类姿态关键点提取模型在动作识别中表现出潜力,但系统化的瑜伽姿势识别基准评估仍有限,因为先前工作通常仅关注原始图像或单一姿态提取模型。本文引入了'Yoga-16'数据集,以解决现有数据集的限制,并系统评估了三种深度学习架构(VGG16、ResNet50和Xception),使用三种输入模式(直接图像、MediaPipe Pose骨架图像和YOLOv8 Pose骨架图像)。我们的实验表明,基于骨架的表示优于原始图像输入,VGG16与MediaPipe Pose骨架输入的最高准确率为96.09%。此外,我们通过Grad-CAM进行可解释性分析,提供瑜伽姿势分类的模型决策洞察,通过交叉验证分析。

英文摘要

Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception), using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross-validation analysis.

2501.05436 2026-06-16 cs.CV 版本更新

Scale-invariant brain morphometry: application to sulcal depth

标度不变脑形态学:应用于沟回深度

Maxime Dieudonné, Guillaume Auzias, Julien Lefèvre

发表机构 * Institut National de la Santé et de la Recherche Médicale (INSERM), U954, Université de Nantes(法国国家卫生与医学研究院(INSERM)U954,南特大学)

AI总结 本文研究了脑大小对沟回深度形态学特征的影响,提出了一种标度不变的沟回深度估计方法,并通过大规模样本验证了其生物学意义。

Comments GA and JL contributed equally to this work

详情
Journal ref
Computers in Biology and Medicine, Volume 212, 2026, 111754, ISSN 0010-4825
AI中文摘要

人类皮层的几何结构复杂且高度变异,文献中已明确记录了脑大小、皮层折叠和年龄之间的相互作用。然而,很少有研究探讨了全局脑大小如何影响从解剖MRI中获得的皮层表面形态学特征。在本工作中,我们关注沟回深度,这一成像表型在基础研究和临床应用中都受到关注。我们通过四个关键贡献推动该领域:1)提供首次定量分析脑大小对沟回深度测量的影响;2)引入一种基于问题原始形式化的新型标度不变沟回深度估计方法;3)提出验证框架并分享代码和基准数据;4)通过涵盖从受孕后26周到成年期的1987名受试者的大样本,展示我们新沟回深度测量的生物学相关性。

英文摘要

The geometry of the human cortex is complex and highly variable, with interactions between brain size, cortical folding, and age well-documented in the literature. However, few studies have explored how global brain size influences morphometry features of the cortical surface derived from anatomical MRI. In this work, we focus on sulcal depth, an imaging phenotype that has gained attention in both basic research and clinical applications. We make key contributions to the field by: 1) providing the first quantitative analysis of the influence of brain size on sulcal depth measurements; 2) introducing a novel, scale-invariant method for sulcal depth estimation based on an original formalization of the problem; 3) presenting a validation framework and sharing our code and benchmark data with the community; and 4) demonstrating the biological relevance of our new sulcal depth measure using a large sample of 1,987 subjects spanning the developmental period from 26 weeks post-conception to adulthood.

2410.20202 2026-06-16 cs.CV 版本更新

An Efficient Watermarking Method for Latent Diffusion Models via Low-Rank Adaptation and Dynamic Loss Weighting

通过低秩适应与动态损失加权实现潜在扩散模型的高效水印方法

Dongdong Lin, Yue Li, Benedetta Tondi, Kaiqing Lin, Bin Li, Mauro Barni

发表机构 * Xiamen Key Laboratory of Data Security and Blockchain Technology, Huaqiao University, Xiamen 361021, China(厦门数据安全与区块链技术重点实验室,华侨大学,厦门361021,中国) Department of Information Engineering and Mathematics of the University of Siena, Italy(意大利锡耶纳大学信息工程与数学系) Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China(广东省智能信息处理重点实验室,深圳大学,深圳518060,中国) Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060, China(深圳媒体安全重点实验室,深圳大学,深圳518060,中国) SZU-AFS Joint Innovation Center for AI Technology, Shenzhen University, Shenzhen 518060, China(深圳大学人工智能技术联合创新中心)

AI总结 本文提出基于低秩适应的潜在扩散模型高效水印方法,通过动态损失加权平衡生成质量与水印保真度,实现快速准确的水印嵌入且不影响生成图像质量。

详情
Journal ref
Expert Systems with Applications. 331 (2026) 133172
AI中文摘要

深度神经网络的快速普及推动了模型水印技术的发展,因为训练模型本身是有价值的知识产权。现有水印方法主要修改模型参数或改变采样行为。然而,随着模型规模的增大,提高水印嵌入效率以管理日益增长的计算需求变得至关重要。本文提出了一种基于低秩适应的潜在扩散模型(LDM)高效水印方法。核心思想是将可训练的低秩参数引入冻结的LDM中以嵌入水印,从而保持原始模型权重的完整性。此外,设计了一个动态损失权重调度器,以适应性地平衡生成质量和水印保真度,使模型能够以最小影响生成图像质量的方式实现有效的水印嵌入。实验结果表明,所提出的方法确保了快速且准确的水印嵌入,并保持了高质量的生成图像,同时在某些情况下与最先进的方法相比具有同等或更高的鲁棒性。此外,该方法在不同数据集和基础LDM上具有良好的泛化能力。代码可在:https://github.com/MrDongdongLin/EW-LoRA 上获取。

英文摘要

The rapid proliferation of Deep Neural Networks (DNNs) is driving a surge in model watermarking technologies, as the trained models themselves constitute valuable intellectual property. Existing watermarking approaches primarily focus on modifying model parameters or altering sampling behaviors. However, with the emergence of increasingly large models, improving the efficiency of watermark embedding becomes essential to manage increasing computational demands. Prioritizing efficiency not only optimizes resource utilization, making the watermarking process more applicable for large models, but also mitigates potential degradation of model performance. In this paper, we propose an efficient watermarking method for Latent Diffusion Models (LDMs) based on Low-Rank Adaptation (LoRA). The core idea is to introduce trainable low-rank parameters into the frozen LDM to embed watermark, thereby preserving the integrity of the original model weights. Furthermore, a dynamic loss weight scheduler is designed to adaptively balance the objectives of generative quality and watermark fidelity, enabling the model to achieve effective watermark embedding with minimal impact on quality of the generated images. Experimental results show that the proposed method ensures fast and accurate watermark embedding and a high quality of the generated images, at the same time maintaining a level of robustness aligned - in some cases superior - with state-of-the-art approaches. Moreover, the method generalizes well across different datasets and base LDMs. Codes are available at: https://github.com/MrDongdongLin/EW-LoRA.

2410.13439 2026-06-16 cs.LG cs.CL cs.CV 版本更新

Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning

多标签监督对比学习中的相似性-差异性损失

Guangming Huang, Yunfei Long, Cunjin Luo

发表机构 * University of Essex(埃塞克斯大学) Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 本文提出相似性-差异性损失,通过动态加权样本解决多标签场景下正样本确定问题,提供理论证明并统一单标签与多标签对比学习框架,实验表明方法在图像、文本和医疗领域均优于基线。

Comments Accepted by Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

监督对比学习通过利用标签信息取得了显著成功;然而,在多标签场景中确定正样本仍是一个关键挑战。在多标签监督对比学习(MSCL)中,多标签关系尚未完全定义,导致正样本识别和对比损失函数构建存在歧义。为解决这些挑战,我们:(i)系统地制定了MSCL中的多标签关系;(ii)提出了一种新颖的相似性-差异性损失,根据相似性和差异性因素动态重新加权样本;(iii)通过严谨的数学分析提供了理论支持,支持我们的方法制定和有效性;(iv)为单标签和多标签监督对比损失提供统一形式和范式。我们在图像和文本模态上进行了实验,并进一步将其扩展到医疗领域。结果表明,我们的方法在全面评估中始终优于基线,证明了其有效性和鲁棒性。

英文摘要

Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), multi-label relations are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) systematically formulate multi-label relations in MSCL, (ii) propose a novel Similarity-Dissimilarity Loss, which dynamically re-weights samples based on similarity and dissimilarity factors, (iii) further provide theoretically grounded proofs for our method through rigorous mathematical analysis that supports the formulation and effectiveness, and (iv) offer a unified form and paradigm for both single-label and multi-label supervised contrastive loss. We conduct experiments on both image and text modalities and further extend the evaluation to the medical domain. The results show that our method consistently outperforms baselines in comprehensive evaluations, demonstrating its effectiveness and robustness.

2509.00176 2026-06-16 cs.CV cs.AI 版本更新

Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments

Waste-Bench: 一个用于评估在杂乱环境中视觉大型语言模型性能的综合基准

Muhammad Ali, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出Waste-Bench基准,用于评估VLLMs在复杂环境中的鲁棒性和准确性,揭示了提升VLLM在复杂环境性能的必要性。

详情
Journal ref
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), pp. 31019-31032, 2025
AI中文摘要

近年来,大型语言模型(LLMs)的进步为能够执行广泛视觉理解任务的视觉大型语言模型(VLLMs)铺平了道路。尽管LLMs在标准自然图像上表现出色,但其在杂乱数据集中的能力尚未得到充分探索,其中包含复杂环境和变形形状的对象。在本工作中,我们引入了一个专门设计用于现实场景中垃圾分类的新型数据集,其特点是有复杂的环境和变形形状的对象。此外,我们还提出了一种深入的评估方法,以严格评估VLLMs的鲁棒性和准确性。所引入的数据集和全面分析为VLLMs在挑战性条件下性能提供了有价值的见解。我们的发现强调了进一步提升VLLM鲁棒性以在复杂环境中表现更好的重要性。数据集和实验代码将公开发布。

英文摘要

Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM's robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.

2505.15408 2026-06-16 cs.CV 版本更新

Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes

鼠标锁盒数据集:小鼠解决锁盒的行为识别

Patrik Reiske, Marcus N. Boon, Niek Andresen, Sole Traverso, Katharina Hohlbaum, Lars Lewejohann, Christa Thöne-Reineke, Olaf Hellwich, Henning Sprekeler

发表机构 * Max Planck Institute for Biological Cybernetics, Berlin, Germany(柏林生物医学信息学研究所)

AI总结 本文提出一个包含小鼠解决复杂机械谜题的视频数据集,用于评估帧级动作分类方法,提供人工标注标签以研究细粒度行为自动标注的挑战。

Comments Accepted and published (poster) at the CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling workshop, in conjunction with Computer Vision and Pattern Recognition (CVPR) 2025

详情
AI中文摘要

机器学习和计算机视觉方法对研究自然动物行为有重大影响,因为它们能够自动分析大量视频数据。小鼠是大多数研究领域中的标准哺乳动物模型,但现有数据集主要关注简单或社交行为。本文提出一个视频数据集,记录小鼠从三个不同视角解决复杂机械谜题(锁盒)。总播放时间超过110小时,我们为两种不同小鼠的视频提供了人工标注标签,占数据集的13%。基于关键点(姿态)跟踪的动作分类框架展示了自动标注细粒度行为(如物体操作)的挑战。我们希望该工作能加速计算神经科学领域自动动作和行为分类的发展。数据集可公开访问:https://doi.org/10.14279/depositonce-23850

英文摘要

Machine learning and computer vision methods have a major impact on the study of natural animal behavior, as they enable the (semi-)automatic analysis of vast amounts of video data. Mice are the standard mammalian model system in most research fields, but the datasets available today to refine such methods focus either on simple or social behaviors. In this work, we present a video dataset of individual mice solving complex mechanical puzzles, so-called lockboxes. The more than 110 hours of total playtime show their behavior recorded from three different perspectives. As a benchmark for frame-level action classification methods, we provide human-annotated labels for all videos of two different mice, that equal 13% of our dataset. Our keypoint (pose) tracking-based action classification framework illustrates the challenges of automated labeling of fine-grained behaviors, such as the manipulation of objects. We hope that our work will help accelerate the advancement of automated action and behavior classification in the computational neuroscience community. Our dataset is publicly available at https://doi.org/10.14279/depositonce-23850

2411.07742 2026-06-16 cs.CV 版本更新

Efficient 3D Perception on Multi-Sweep Point Cloud with Gumbel Spatial Pruning

多扫点云上的高效3D感知与Gumbel空间修剪

Tianyu Sun, Jianhao Li, Xueqian Zhang, Zhongdao Wang, Bailan Feng, Hengshuang Zhao

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Noah’s Ark Lab(诺亚实验室) Department of Computer Science, University of Hong Kong(香港大学计算机科学系)

AI总结 本文研究了户外环境中点云感知问题,通过累积多个连续点云扫描以提高感知精度,引入Gumbel空间修剪层有效减少冗余点,提升3D感知性能。

详情
AI中文摘要

本文研究了户外环境中点云感知问题。现有方法在远距离或遮挡物体识别上受限,因户外点云稀疏。本文通过累积多个连续点云扫描显著缓解该问题,但计算成本增加阻碍了大量点云扫描的使用。我们发现累积点云中大部分点冗余,剔除这些点对感知精度影响小。引入简单有效的Gumbel空间修剪(GSP)层,基于端到端采样动态修剪点。GSP层与其他网络组件解耦,可无缝集成到现有点云网络中。无需额外计算开销,将点云扫描数从10增加到40,显著提升感知性能。例如,在nuScenes 3D目标检测和BEV地图分割任务中,我们的修剪策略改进了多种3D感知基线方法。

英文摘要

This paper studies point cloud perception within outdoor environments. Existing methods face limitations in recognizing objects located at a distance or occluded, due to the sparse nature of outdoor point clouds. In this work, we observe a significant mitigation of this problem by accumulating multiple temporally consecutive point cloud sweeps, resulting in a remarkable improvement in perception accuracy. However, the computation cost also increases, hindering previous approaches from utilizing a large number of point cloud sweeps. To tackle this challenge, we find that a considerable portion of points in the accumulated point cloud is redundant, and discarding these points has minimal impact on perception accuracy. We introduce a simple yet effective Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a learned end-to-end sampling. The GSP layer is decoupled from other network components and thus can be seamlessly integrated into existing point cloud network architectures. Without incurring additional computational overhead, we increase the number of point cloud sweeps from 10, a common practice, to as many as 40. Consequently, there is a significant enhancement in perception performance. For instance, in nuScenes 3D object detection and BEV map segmentation tasks, our pruning strategy improves several 3D perception baseline methods.

2504.18179 2026-06-16 cs.CV cs.LG 版本更新

Label-independent hyperparameter-free self-supervised single-view deep subspace clustering

与标签无关的超参数自由单视图深度子空间聚类

Lovro Sindicic, Ivica Kopriva

发表机构 * Division of Computing and Data Science, Ruđer Bošković Institute(计算与数据科学系,鲁德·博克维奇研究所)

AI总结 本文提出一种无需超参数调节的单视图深度子空间聚类方法,通过层间自表达损失、子空间结构范数优化、多阶段学习框架和相对误差终止机制提升聚类性能。

Comments 35 pages; 1 figure; 10 Tables

详情
AI中文摘要

深度子空间聚类(DSC)算法面临多个挑战,限制了其在各种应用领域中的广泛应用。首先,聚类质量通常仅通过编码器的输出层评估,忽略了中间层中的有价值信息。其次,大多数DSC方法将表示学习和子空间聚类视为独立任务,限制了其有效性。第三,它们假设可以使用一个留出的数据集进行超参数调节,这在实际场景中往往不现实。第四,学习终止通常基于聚类误差监控,需要外部标签。最后,其性能通常依赖于依赖标注数据的后处理技术。为了解决这些限制,我们引入了一种新的单视图DSC方法:(i) 使用联合表示矩阵最小化层间自表达损失;(ii) 优化子空间结构范数以提高聚类质量;(iii) 采用多阶段顺序学习框架,包括预训练和微调,使能够使用多个正则化项而无需超参数调节;(iv) 融合基于相对误差的自停止机制以终止训练而不使用标签;(v) 根据先验知识在学习的表示矩阵中保留固定数量的领先系数。我们在六个代表面孔、数字和物体的数据集上评估了所提出的方法。结果表明,我们的方法在经过仔细调节的超参数下优于大多数线性SC算法,同时在最佳线性方法中保持竞争力。

英文摘要

Deep subspace clustering (DSC) algorithms face several challenges that hinder their widespread adoption across variois application domains. First, clustering quality is typically assessed using only the encoder's output layer, disregarding valuable information present in the intermediate layers. Second, most DSC approaches treat representation learning and subspace clustering as independent tasks, limiting their effectiveness. Third, they assume the availability of a held-out dataset for hyperparameter tuning, which is often impractical in real-world scenarios. Fourth, learning termination is commonly based on clustering error monitoring, requiring external labels. Finally, their performance often depends on post-processing techniques that rely on labeled data. To address this limitations, we introduce a novel single-view DSC approach that: (i) minimizes a layer-wise self expression loss using a joint representation matrix; (ii) optimizes a subspace-structured norm to enhance clustering quality; (iii) employs a multi-stage sequential learning framework, consisting of pre-training and fine-tuning, enabling the use of multiple regularization terms without hyperparameter tuning; (iv) incorporates a relative error-based self-stopping mechanism to terminate training without labels; and (v) retains a fixed number of leading coefficients in the learned representation matrix based on prior knowledge. We evaluate the proposed method on six datasets representing faces, digits, and objects. The results show that our method outperforms most linear SC algorithms with careffulyl tuned hyperparameters while maintaining competitive performance with the best performing linear appoaches.

2502.05214 2026-06-16 eess.IV cs.AI cs.CV 版本更新

CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models

CoRPA: 基于概念向量扰动和生成模型的胸部X光图像对抗生成

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) NHS Lothian(NHS洛锡安)

AI总结 本文提出CoRPA,一种针对医学影像领域的临床聚焦对抗攻击框架,通过概念向量扰动生成对抗性影像报告和图像,揭示医疗AI在真实临床场景下的脆弱性。

详情
AI中文摘要

深度学习模型在医学图像分类任务中的应用日益广泛,旨在提高诊断准确性、减轻医务人员负担并改善患者预后。然而,其对对抗攻击的脆弱性对患者安全构成重大风险。当前攻击方法使用通用技术如模型查询或像素值扰动生成对抗样本以欺骗模型。这些方法可能无法充分解决源于临床错误的特征遗漏或误识别问题。我们提出基于概念的报告扰动攻击(CoRPA),一种专注于临床的黑盒对抗攻击框架,专门针对医学影像领域。CoRPA利用临床概念生成对抗性放射学报告和图像,以接近现实的临床误诊场景。我们使用MIMIC-CXR-JPG数据集中的胸部X光影像和放射学报告验证了CoRPA的实用性。评估显示,对传统对抗攻击具有强大鲁棒性的深度学习模型在面对CoRPA的临床聚焦扰动时显著更脆弱。这突显了在医疗AI系统中解决领域特定脆弱性的重要性。通过引入专门的对抗攻击框架,本研究为开发在真实世界中可靠、安全的AI模型提供了基础,确保其在高风险临床环境中的安全可靠部署。

英文摘要

Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA's clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.

2403.19444 2026-06-16 cs.LG cs.CV 版本更新

Leveraging Expert Input for Robust and Explainable AI-Assisted Lung Cancer Detection in Chest X-rays

利用专家输入实现稳健且可解释的AI辅助肺癌检测

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh(信息学院,爱丁堡大学) NHS Lothian(洛锡安国家健康服务)

AI总结 本文研究了基于InceptionV3的肺癌检测模型的可解释性和鲁棒性,提出ClinicXAI方法,通过专家驱动的思路生成临床相关解释,并在对抗攻击下表现出更强的鲁棒性。

详情
AI中文摘要

深度学习模型在推动AI辅助医学诊断方面显示出巨大潜力,特别是在通过胸部X光等医学图像模态检测肺癌方面。然而,这些模型的黑盒性质对可解释性和可信度构成挑战,限制了其在临床中的应用。本研究评估了基于InceptionV3的高性能肺癌检测模型的可解释性和鲁棒性,利用公开的胸部X光和放射学报告数据集。我们评估了多种可解释AI(XAI)技术的临床效用,包括后验和先验方法,并发现现有方法常无法提供临床相关解释,存在不一致性和与放射科专家评估的偏离。为解决这些限制,我们与放射科医生合作定义诊断特定的临床概念,并开发了ClinicXAI,一种专家驱动的方法,利用概念瓶颈方法。ClinicXAI生成具有临床意义的解释,与临床医生的实践需求紧密相关,同时保持高诊断准确性。我们还通过一系列广泛使用的对抗攻击测试ClinicXAI与原始InceptionV3模型的鲁棒性。我们的分析表明,ClinicXAI在对抗扰动下表现出显著更强的鲁棒性。这些发现强调了在医学诊断中将领域专业知识纳入可解释和鲁棒AI系统设计的重要性,为医疗领域更可信和有效的AI解决方案铺平道路。

英文摘要

Deep learning models show significant potential for advancing AI-assisted medical diagnostics, particularly in detecting lung cancer through medical image modalities such as chest X-rays. However, the black-box nature of these models poses challenges to their interpretability and trustworthiness, limiting their adoption in clinical practice. This study examines both the interpretability and robustness of a high-performing lung cancer detection model based on InceptionV3, utilizing a public dataset of chest X-rays and radiological reports. We evaluate the clinical utility of multiple explainable AI (XAI) techniques, including both post-hoc and ante-hoc approaches, and find that existing methods often fail to provide clinically relevant explanations, displaying inconsistencies and divergence from expert radiologist assessments. To address these limitations, we collaborated with a radiologist to define diagnosis-specific clinical concepts and developed ClinicXAI, an expert-driven approach leveraging the concept bottleneck methodology. ClinicXAI generated clinically meaningful explanations which closely aligned with the practical requirements of clinicians while maintaining high diagnostic accuracy. We also assess the robustness of ClinicXAI in comparison to the original InceptionV3 model by subjecting both to a series of widely utilized adversarial attacks. Our analysis demonstrates that ClinicXAI exhibits significantly greater resilience to adversarial perturbations. These findings underscore the importance of incorporating domain expertise into the design of interpretable and robust AI systems for medical diagnostics, paving the way for more trustworthy and effective AI solutions in healthcare.