arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 多模态与视觉语言模型 12 篇

2606.17296 2026-06-17 cs.CV 新提交

Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

Pareto LoRA:通过帕累托最优梯度集成缓解统一多模态模型中的模态不平衡

Xiwen Wei, Mark Nutter, Madhusudhanan Srinivasan, Radu Marculescu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Advanced Micro Devices, Inc.(超威半导体公司)

AI总结 针对统一多模态模型在LoRA微调中语言梯度主导优化导致图像生成质量下降的问题,提出帕累托最优梯度集成策略Pareto LoRA,通过调节梯度方向和强度平衡文本与图像目标,在CoMM基准上显著提升图像感知质量达44.9%。

详情
AI中文摘要

统一多模态模型(UMMs)最近作为一种有前景的范式出现,将多模态理解和生成集成在单个自回归Transformer中。然而,在多模态指令微调期间,这些模型通常表现出明显的模态不平衡:语言梯度主导优化,从而导致图像生成质量较低,尤其是在参数高效微调(如LoRA)下。在这项工作中,我们系统分析了基于LoRA的UMMs在交错文本-图像生成微调中的模态不平衡。我们表明,与单模态对应物相比,视觉模态性能下降幅度远大于文本模态性能,并且模态特定梯度在不同任务和层之间可能相差数个数量级。受此观察启发,我们将多模态指令微调重新表述为双目标优化问题,并提出Pareto LoRA,一种帕累托最优梯度集成策略,通过调节梯度方向和强度来平衡文本和图像目标。在CoMM基准上使用Emu2的实验表明,Pareto LoRA持续改善多模态生成平衡,在保持可比文本性能的同时,图像感知质量相比普通LoRA提升高达44.9%。

英文摘要

Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.

2606.17389 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

视觉会撒谎,一致性说话:在视觉-语言模型中解耦空间注意力与可靠性

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校) Algoverse AI Research(Algoverse AI研究) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出VLM可靠性探针(VRP),通过结构注意力指标和生成动态分析,发现空间注意力与准确性几乎无关(R≈0.001),而自一致性是可靠性的主要预测因子(R=0.429),揭示了视觉特征与最终生成之间的符号脱离现象。

Comments 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: https://github.com/itsloganmann/VLM-Reliability-Probe

详情
AI中文摘要

多模态基础模型越来越多地被用作推理代理,因此可靠性(即知道模型何时可能产生幻觉)变得至关重要。一种常见的直觉,我们称之为注意力-置信度假设,认为可靠性源于“结构性”视觉感知:对相关区域的紧密注意力应表明答案可信,而分散的注意力则表示困惑。我们通过VLM可靠性探针(VRP)挑战这一观点,这是一项对当代视觉-语言模型(VLM)中可靠性信号进行的系统性跨家族研究。我们引入了结构注意力指标——簇计数(C_k)和空间熵(H_s)——来量化视觉编码器的注视点,并追踪其跨层的演化(ΔH_s)。这揭示了一种“符号脱离”:模型通常“早期锁定”视觉特征,但随后注意力扩散,切断了早期感知与最终生成的联系。与接地假设相反,我们发现“簇失效”:空间注意力与准确性几乎零相关(R≈0.001)。相反,可靠性是生成动态和内部状态分布的现象。自一致性,即采样推理路径之间的一致率,是真实性的主要预测因子(R=0.429)。扩展因果干预揭示了尖锐的架构差异:LLaVA将其预测锁定在脆弱的后期瓶颈中,而PaliGemma和Qwen2-VL全局分布可靠性,即使其最具预测性的层被破坏约50%或更多,仍保持韧性。对于当前的VLM,可靠性信号与视觉接地图脱离,最好通过生成时动态和隐藏状态探针来推断。

英文摘要

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

2606.17410 2026-06-17 cs.CV 新提交

Attention Alignment Between Humans and Vision-Language Models

人类与视觉语言模型之间的注意力对齐

Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase, Taylor Webb, Michael Graziano

发表机构 * Princeton Neuroscience Institute, Princeton University(普林斯顿大学普林斯顿神经科学研究所) Department of Psychology, Princeton University(普林斯顿大学心理学系) Department of Computer Science, Princeton University(普林斯顿大学计算机科学系) Department of Psychology and Center for Computational Language Sciences, University of Southern California(南加州大学心理学系与计算语言科学中心) Department of Psychology, Université de Montréal(蒙特利尔大学心理学系)

AI总结 本研究比较了六种视觉语言模型的空间注意力图与人类注视热图,发现解码器架构(LSTM vs Transformer)主导对齐程度,LSTM解码器对齐度更高但空间分散且任务区分度低,而Transformer解码器注意力更集中且任务区分度强。

详情
AI中文摘要

视觉感知依赖于自上而下的目标和自下而上的感觉机制。视觉语言模型同时实现了这两种机制,使我们能够将每个组成部分视为关于驱动我们注视位置的可分离假设。我们比较了六种视觉语言模型的空间注意力图与在200张图像上两个任务(一般描述和社交字幕)中记录的人类注视热图。这六种模型跨越了CNN与ViT编码器乘以LSTM与Transformer解码器的2×2因子设计,外加Molmo 7B-D和Qwen3.5 9B。我们发现解码器和编码器架构都影响对齐,但解码器选择占主导地位。LSTM与Transformer解码器使对齐度提高了40-50个百分点(分别达到人类噪声上限的80-87%和40-59%)。相比之下,CNN与ViT编码器根据解码器家族的不同贡献了5-20个百分点的次要优势,其中CNN-LSTM是整体对齐度最高的模型(85-87%)。尽管对齐度有优势,但LSTM解码器的注意力图在空间上分散且任务区分度最小;而对齐度最弱的ViT-Transformer则显示出最尖锐的空间集中度和最强的任务区分度。一项半空间忽略模拟证实,消融注意力对LSTM解码器的影响大于Transformer解码器。在使用TRIBE模拟的合成神经反应的探索性扩展中,注视对齐和神经相关性分离:CNN-Transformer注意力图尽管注视对齐度较低,但能更好地预测合成大脑活动,其中注意力图最佳预测早期视觉皮层。总之,自上而下和自下而上的组件在行为和合成神经数据中预测的内容上存在权衡。

英文摘要

Visual perception depends on top-down goals and bottom-up sensory mechanisms. Vision-language models implement both, allowing us to treat each component as a separable hypothesis about what drives where we look. We compared spatial attention maps from six vision-language models against human fixation heatmaps recorded on 200 images during two tasks (general description and social captioning). The six models spanned a 2$\times$2 factorial of CNN vs.\ ViT encoders crossed with LSTM vs.\ Transformer decoders, plus Molmo 7B-D and Qwen3.5 9B. We found that both decoder and encoder architecture shaped alignment, but decoder choice dominated. LSTM vs.\ Transformer decoders increased alignment by 40--50 percentage points (80--87\% vs.\ 40--59\% of the human noise ceiling). In contrast, CNN vs.\ ViT encoders contributed a secondary 5--20 point advantage depending on decoder family, with CNN-LSTM the most aligned model overall (85--87\%). Despite their alignment advantage, LSTM-decoder attention maps were spatially diffuse and minimally task-differentiated; ViT-Transformer, the weakest in alignment, showed the sharpest spatial concentration and strongest task differentiation. A hemispatial-neglect simulation confirmed that ablating attention impacted LSTM decoders more than Transformer decoders. In an exploratory extension using TRIBE-simulated synthetic neural responses, fixation alignment and neural relevance dissociate: CNN-Transformer attention maps better predicted synthetic brain activity despite lower fixation alignment, with attention maps best predicting early visual cortex. Together, top-down and bottom-up components trade off what they predict in behavioral and synthetic neural data.

2606.17430 2026-06-17 cs.CV 新提交

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

CIAN:基于检索增强生成的事件丰富图像描述的多阶段框架

Trinh Thi Thu Hien, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City(胡志明市理科大学) Vietnam National University, Ho Chi Minh City(越南国家大学胡志明市分校)

AI总结 提出多阶段框架CIAN,通过检索相关文章并利用LoRA微调Qwen模型生成叙事,结合N-Gram精炼,在OpenEvents-V1上提升CIDEr从0.030到0.094,实现事件丰富的图像描述。

Comments SOICT 2025

详情
AI中文摘要

事件丰富的图像描述不仅描述可见内容,还描述事件的更广泛背景,包括时间、地点和参与者,这是大多数基于像素的模型所缺乏的能力。我们提出了上下文图像-文章叙述器(CIAN),这是一个多阶段框架,通过外部叙述丰富描述。CIAN使用SigLIP检索相关文章,总结它们以指导使用LoRA微调的Qwen模型进行叙事生成,并应用基于N-Gram的精炼以提高流畅性和连贯性。在OpenEvents-V1基准上,CIAN实现了高检索性能(mAP 0.979),并提高了描述质量,将CIDEr从0.030提升到0.094。这些结果突显了检索增强推理与语言精炼相结合在生成上下文感知、类人描述方面的有效性。

英文摘要

Event-enriched image captioning describes not only visible content but also the broader context of events, including timing, location, and participants, capabilities missing in most pixel-bound models. We propose the Contextual Image-Article Narrator (CIAN), a multi-stage framework that enriches captions with external narratives. CIAN retrieves relevant articles using SigLIP, summarizes them to guide a Narrative Generation stage with a LoRA-fine-tuned Qwen model, and applies N-Gram-based Refinement for fluency and coherence. On the OpenEvents-V1 benchmark, CIAN achieves high retrieval performance (mAP 0.979) and improves caption quality, increasing CIDEr from 0.030 to 0.094. These results highlight the effectiveness of retrieval-augmented reasoning combined with linguistic refinement for generating context-aware, human-like captions.

2606.17539 2026-06-17 cs.CV cs.AI 新提交

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

空间视觉语言模型中的双路径推理强化

Yatai Ji, An-Chieh Cheng, Yang Fu, Yukang Chen, Han Zhang, Zhaojing Yang, Wei Huang, Ka Chun Cheung, Song Han, Vidya Nariyambut Murali, Pavlo Molchanov, Jan Kautz, Simon See, Hongxu Yin, Ping Luo, Sifei Liu

发表机构 * The University of Hong Kong(香港大学) NVIDIA(英伟达) University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出SR-REAL框架,通过强化学习融合语言推理和3D检测推理两条路径,显著提升空间VLM在复杂几何推理任务中的性能。

详情
AI中文摘要

空间VLM在几何感知方面取得了显著进展,但需要多步推理(涉及深度、距离和场景关系)的复杂空间推理仍然具有挑战性。此外,不同的空间查询需要根本不同的策略:有些最好通过纯语言的逐步演绎来解决,而另一些则需要在进行定量推理之前进行显式的3D定位。我们提出了SR-REAL(通过强化学习实现空间VLM的双路径空间推理),这是一个统一框架,为空间VLM配备了两条互补的推理路径:纯语言推理(LOR),执行逐步语言演绎;以及先检测后推理(DTR),通过区域标记检测3D几何线索(如中心或边界框),然后进行显式几何推理。SR-REAL首先进行冷启动监督微调阶段,构建LOR和DTR的思维链监督,并暴露区域到3D的接口;随后进行强化学习,使用准确性和格式奖励优化策略模型;对于DTR,基于离散中心的检测奖励进一步细化几何对齐。在多种空间基准测试中,SR-REAL显著优于空间VLM基线:(i) 单个RL训练模型支持两条推理路径,DTR通过精确的3D定位在区域感知任务中表现出色,LOR增强了一般空间推理;(ii) 联合训练两条路径促进相互强化;(iii) 高质量、混合的冷启动数据对于稳定的RL优化至关重要;(iv) 模型无需逐任务调整即可跨数据集和领域泛化,展示了LOR和DTR之间的正向迁移。

英文摘要

Spatial VLMs have made substantial progress in geometric perception, yet complex spatial reasoning requiring multi-step inference over depth, distance, and scene relations remains challenging. Moreover, different spatial queries call for fundamentally different strategies: some are best addressed through purely linguistic, step-by-step deduction, while others require explicit 3D grounding before quantitative inference. We present Dual-Path Spatial Reasoning via Reinforcement Learning for Spatial VLMs (SR-REAL), a unified framework that equips a spatial VLM with two complementary reasoning paths: Language-Only Reasoning (LOR), which performs step-by-step linguistic deduction, and Detect-Then-Reason (DTR), which detects 3D geometric cues (e.g., centers or bounding boxes) via region tokens before explicit geometric inference. SR-REAL begins with a cold-start supervised fine-tuning stage that constructs LOR and DTR chain-of-thought supervision and exposes a region-to-3D interface, followed by RL that optimizes the policy model with accuracy and format rewards; for DTR, a discrete center-based detection reward further refines geometric alignment. Across diverse spatial benchmarks, SR-REAL significantly outperforms spatial VLM baselines: (i) a single RL-trained model supports both reasoning paths, with DTR excelling in region-aware tasks through precise 3D localization and LOR enhancing general spatial reasoning; (ii) jointly training both paths fosters mutual reinforcement; (iii) high-quality, blended cold-start data is crucial for stable RL optimization; and (iv) the model generalizes across datasets and domains without per-task tuning, demonstrating positive transfer between LOR and DTR.

2606.17678 2026-06-17 cs.CV cs.AI 新提交

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

先看后答:基于充分性驱动的强化学习实现视觉证据预对齐

Yilian Liu, Sicong Leng, Guoshun Nan, Junyi Zhu, Jiayu Huang, Minghao Sun, Xuancheng Zhu, Yisong Chen, Zexian Wei, Xiaofeng Tao

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Nanyang Technological University(南洋理工大学) China Telecom(中国电信)

AI总结 提出视觉证据预对齐(VEPA)方法,在预训练与后训练之间引入充分性驱动的GRPO优化,以增强多模态大模型对细粒度视觉证据的利用,显著提升视觉密集型任务性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)将强大的文本推理与视觉输入相结合,但其响应可能与底层图像不一致,表明在推理过程中未能有效利用视觉证据。当前的训练范式依赖于大规模基于标题的预训练进行通用对齐,随后通过监督微调和强化学习实现指令遵循和复杂推理。然而,这种预训练仅提供较弱的视觉基础:简短、粗略的标题使模型偏向显著物体,而忽略了细粒度的视觉证据。本文引入视觉证据预对齐(VEPA),作为预训练与后训练之间的中间阶段,探索一种新颖的充分性驱动目标,结合组相对策略优化(GRPO)来优化基于问题的视觉证据描述。在多种基准上的大量实验表明,我们的VEPA在视觉密集型评估上持续提升性能,并补充了标准的监督后训练。进一步分析表明,这种提升源于增强的、可迁移的视觉基础,而非额外的任务特定训练。

英文摘要

Multimodal large language models (MLLMs) integrate strong text reasoning with visual inputs, yet their responses can be inconsistent with the underlying images, indicating ineffective utilization of visual evidence during inference. The prevailing training paradigm relies on large-scale caption-based pretraining for general alignment, followed by supervised fine-tuning and reinforcement learning to enable instruction following and complex reasoning. However, such pretraining provides only weak visual grounding: short, coarse captions bias models toward salient objects while neglecting fine-grained visual evidence. In this paper, we introduce Visual Evidence Pre-Alignment (VEPA), an intermediate stage between pretraining and post-training that explores a novel sufficiency-driven objective with Group Relative Policy Optimization (GRPO) to optimize question-conditioned visual evidence descriptions. Extensive experiments across diverse benchmarks show that our VEPA consistently enhances performance on visually demanding evaluations and complements standard supervised post-training. Further analyses show that the income stems from strengthened, transferable visual grounding, rather than from additional task-specific training.

2606.17950 2026-06-17 cs.CV cs.AI 新提交

Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

即插即适应:基于预训练对齐模型的首眼多模态指代消解

Jinghan Wu, Jing Li, Ivor W. Tsang, Xuetao Zhang

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi'an Jiaotong University(西安交通大学人工智能与机器人研究所人机混合增强智能全国重点实验室) Centre for Frontier AI Research and Institute of High-Performance Computing, Agency for Science, Technology and Research (A*STAR)(新加坡科技研究局前沿人工智能研究中心与高性能计算研究所)

AI总结 提出即插即适应方法,利用预训练的细粒度对齐模型,通过证据理论融合视觉与类别线索,无需目标数据集训练或大型VLLM,在CIN基准上CoNLL F1比专用方法和流行VLLM分别提升5.31%和2.12%。

详情
AI中文摘要

视觉信息有助于解决指代消解中的歧义,带来显著的性能提升。然而,现有的多模态指代消解(MCR)方法在应用前需要使用目标数据集的部分标注数据进行训练,这阻碍了其直接可用性并引发泛化担忧。虽然拥有数十亿参数的视觉-语言大模型(VLLM)提供了有前景的零样本能力,但它们仍然难以获取。其庞大的规模限制了部署能力,且许多模型只能通过付费API访问。在本文中,我们提出了一种即插即适应方法,该方法策略性地适配一个精心预训练的\emph{对齐模型},以立即用于MCR任务,旨在消除对稀缺基准数据集的训练或依赖资源密集型VLLM的需求。具体来说,我们首先使用视觉-语言对齐数据集预训练文本与视觉上下文信息之间的细粒度对齐模型。然后,我们通过证据理论融合视觉和类别线索进行相似度聚合,将对齐模型重新用于MCR,从而增强效果。在Coreference Image Narratives (CIN)基准数据集上的实验证明了我们方法的有效性,在CoNLL F1上比最先进的专用方法和流行VLLM分别提高了5.31%和2.12%。我们进一步在掩码CIN数据集上进行鲁棒性测试,并在专门构建的VCR-MCR数据集上进行泛化评估,结果证实了这两种能力。

英文摘要

Visual information helps resolve ambiguity in coreference resolution, leading to notable performance gains. However, existing Multi-modal Coreference Resolution (MCR) methods require training with (partially) annotated data from the target dataset before they can be applied, preventing their direct usability and raising concerns about generalization. While Vision-Language Large Models (VLLMs) with billions of parameters offer promising zero-shot capabilities, they remain largely inaccessible. Their massive size limits deployability, and many are only accessible through paid APIs. In this paper, we propose a plug-and-adapt method that strategically adapts a carefully pre-trained \emph{alignment model} for immediate use in MCR tasks, designed to eliminate the need for training on scarce benchmark datasets or relying on resource-intensive VLLMs. Specifically, we first pre-train a fine-grained alignment model between textual and visual contextual information using vision-language alignment datasets. We then repurpose the alignment model to MCR through similarity aggregation by fusing visual and categorical cues with evidence theory, thereby enhancing effectiveness. Experiments on the Coreference Image Narratives (CIN) benchmark dataset demonstrate the effectiveness of our method, achieving a 5.31\% and 2.12\% improvement in CoNLL F1 over SOTA dedicated methods and popular VLLMs, respectively. We further evaluate our method on a masked CIN dataset for robustness testing and on a specially constructed VCR-MCR dataset for generalization assessment, with results confirming both capabilities.

2606.18249 2026-06-17 cs.CV 新提交

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

统一多模态自回归建模:共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(可信具身AI研究院,复旦大学) Shanghai Innovation Institute(上海创新研究院) Qwen Team, Alibaba Inc.(通义实验室,阿里公司)

AI总结 提出UniAR框架,通过单一离散视觉分词器桥接视觉理解与生成,采用并行位预测和扩散解码,在图像生成和编辑上达到最优,同时保持多模态理解竞争力。

Comments Accepted by ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情
AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而,现有方法通常依赖两个不同的视觉分词器,这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR,一个统一的自回归框架,其中单个离散视觉分词器作为理解和生成之间的关键桥梁,使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码,从而实现共享上下文。UniAR采用预训练的视觉编码器,结合多级特征融合和无查找的逐位量化方案,在保留高层语义和低层细节的同时,以最小代价扩展有效视觉词汇。在此基础上,统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码,大幅减少视觉序列长度并加速生成。最后,基于扩散的视觉解码器对离散视觉标记进行操作,以解码高保真图像。通过大规模预训练,随后进行监督微调和强化学习,UniAR在图像生成和图像编辑上达到了最先进的性能,同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

2606.17449 2026-06-17 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出MODE-RAG多智能体系统,利用变分自由能和内部注意力状态动态门控干预,结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情
AI中文摘要

虽然多模态检索增强生成(M-RAG)增强了大型视觉语言模型,但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外,现有的缓解流程常常面临干预悖论:静态规则往往不必要地干扰准确的生成,而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉,我们提出了一个多智能体系统MODE-RAG,由变分自由能(VFE)和内部注意力状态驱动,以动态门控干预。高风险查询被路由到五个阶段特定的智能体,集成蒙特卡洛树搜索(MCTS)进行严格的因果推导,以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法,我们引入了ModeVent,一个源自MultiVent数据集的具有挑战性的子集。大量实验表明,我们的系统有效降低了幻觉率和逻辑捏造,显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

2601.00215 2026-06-17 cs.CV cs.CL 版本更新

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

通过奖励设计解耦多模态大语言模型中的感知与推理

Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng

AI总结 研究多模态大模型中感知与推理的瓶颈,发现感知是主要约束,并通过奖励设计提升视觉基础推理,平均提升5.56分。

Comments 24 pages, 15 Figures, 10 Tables

详情
AI中文摘要

基于可验证奖励的强化学习推动了LLM推理的重大进步,直观上这种策略应能很好地迁移到多模态模型。然而,多模态模型做两件事:首先感知图像中的内容,然后推理其含义。由于这两个阶段是联合评分的,很难判断推理本身还有多大提升空间。我们在算法视觉谜题上研究这一问题,其中两个组件都是必要的,并表明感知而非推理是约束瓶颈。用简单的文本描述替换图像,Claude模型的平均性能提升超过20点。然后我们评估了六种奖励设计,旨在诱导推理过程中的视觉基础,而无需思维链监督。使用GRPO训练Qwen-2.5-VL-7B,奖励设计诱导出带有自我反思和视觉引用的长结构化推理,相比基础模型获得5.56点的提升。然而,这些提升是不均匀的;没有单一奖励能改善所有类别,并且具有可验证准确性信号的奖励会以域外迁移为代价换取域内准确性。这些结果表明,感知感知的奖励设计是一条前进之路,以便在源头纠正感知,而不是纠正继承其错误的推理。

英文摘要

Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.

2602.18746 2026-06-17 cs.CV 版本更新

Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

通过闭环视觉基础验证弥合自我反思中的模态脱节

Haoyu Zhang, Yuwei Wu, Pengxiang Li, Xintong Zhang, Zhi Gao, Rui Gao, Mingyang Gao, Che Sun, Yunde Jia

AI总结 提出MIRROR框架,通过闭环视觉反思(草稿-批评-区域验证-修订)减少VLM幻觉,并构建ReflectV数据集训练视觉基础的多轮反思。

详情
AI中文摘要

在视觉语言模型(VLM)时代,增强多模态推理能力仍然是一个关键挑战,尤其是在处理模糊或复杂的视觉输入时,初始推理常常导致幻觉或逻辑错误。现有的VLM通常产生看似合理但缺乏依据的答案,即使提示其“反思”,修正也可能与图像证据脱节。为了解决这个问题,我们提出了MIRROR框架,用于通过视觉区域的反思进行多模态迭代推理。通过将视觉反思嵌入为核心机制,MIRROR被表述为一个闭环过程,包括草稿、批评、基于区域的验证和修订,重复进行直到输出具有视觉基础。为了促进该模型的训练,我们构建了**ReflectV**,一个用于多轮监督的视觉反思数据集,明确包含反思触发器、基于区域的验证动作以及基于视觉证据的答案修订。在通用视觉语言基准和代表性视觉语言推理基准上的实验表明,MIRROR提高了正确性并减少了视觉幻觉,证明了将反思训练为一种寻求证据、区域感知的验证过程而非纯文本修订步骤的价值。

英文摘要

In the era of Vision-Language Models (VLMs), enhancing multimodal reasoning capabilities remains a critical challenge, particularly in handling ambiguous or complex visual inputs, where initial inferences often lead to hallucinations or logic errors. Existing VLMs often produce plausible yet ungrounded answers, and even when prompted to "reflect", their corrections may remain detached from the image evidence. To address this, we propose the MIRROR framework for Multimodal Iterative Reasoning via Reflection On visual Regions. By embedding visual reflection as a core mechanism, MIRROR is formulated as a closed-loop process comprising draft, critique, region-based verification, and revision, which are repeated until the output is visually grounded. To facilitate training of this model, we construct **ReflectV**, a visual reflective dataset for multi-turn supervision that explicitly contains reflection triggers, region-based verification actions, and answer revision grounded in visual evidence. Experiments on both general vision-language benchmarks and representative vision-language reasoning benchmarks show that MIRROR improves correctness and reduces visual hallucinations, demonstrating the value of training reflection as an evidence-seeking, region-aware verification process rather than a purely textual revision step.

2606.14782 2026-06-17 cs.CV cs.CL 版本更新

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

最后但同样重要:用于多模态KV缓存压缩的边界注意力校准

Tianhao Chen, Yuheng Wu, Kelu Yao, Xiaogang Xu, Xiaobin Hu, Dongman Lee

发表机构 * KAIST(韩国科学技术院) Zhejiang Laboratory(之江实验室) The Chinese University of Hong Kong(香港中文大学) National University of Singapore(新加坡国立大学)

AI总结 针对多模态大语言模型长视觉上下文中KV缓存压缩导致关键证据丢失的问题,提出BACON方法,通过校准观察窗口注意力与最后查询注意力,并利用层内一致性和层间持久性抑制噪声,在激进压缩下平均提升7.5%性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)实现了强大的视觉-语言推理,但长视觉上下文会扩大KV缓存并增加解码延迟。现有的压缩方法依赖观察窗口注意力进行稳定的token重要性估计,然而这种聚合可能稀释稀疏的视觉证据,并在激进压缩下丢弃答案关键token。因此,我们识别出最后查询注意力作为恢复此类证据的补充来源,但其与答案无关的信号可能误导保留。我们提出BACON,一种即插即用方法,通过最后查询证据校准观察窗口注意力,并通过层内一致性和层间持久性抑制孤立噪声。在多种基准、模型、预算和压缩方法下,BACON在最激进的预算下平均提升多模态KV压缩7.5%,最高提升达30.9%。

英文摘要

Multimodal Large Language Models (MLLMs) achieve strong vision-language reasoning, but long visual contexts enlarge the KV cache and increase decoding latency. Existing compression methods rely on observation window attention for stable token-importance estimation, yet this aggregation can dilute sparse visual evidence and discard answer-critical tokens under aggressive compression. Therefore, we identify last-query attention as a complementary source for recovering such evidence, but its answer-irrelevant signals can mislead retention. We propose BACON, a plug-and-play method that calibrates observation window attention with last-query evidence and suppresses isolated noise via intra-layer coherence and inter-layer persistence. Across diverse benchmarks, models, budgets, and compression methods, BACON improves multimodal KV compression by 7.5% on average under the most aggressive budget, with gains up to 30.9%. Our project page is available at https://ryu1ion.github.io/official_BACON/

2. 具身智能、机器人与自动驾驶 26 篇

2606.17241 2026-06-17 cs.CV cs.RO cs.SY eess.SY 新提交

Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception

超越基准:面向细粒度路边感知的连续边缘推理

Aditya Mishra, Haroon Lone

发表机构 * Indian Institute of Science Education and Research Bhopal(印度科学教育与研究学院博帕尔分校)

AI总结 针对边缘推理在持续运行中的性能退化问题,提出Edge-TSR系统,集成检测、跟踪与轻量级时域稳定机制,在NVIDIA Jetson Orin Nano上实现实时路边感知,恢复高达10.16%的分类准确率。

详情
AI中文摘要

在资源受限的边缘硬件上进行连续AI推理会引入传统基准评估难以察觉的部署效应,包括流视频的时间不稳定性、持续负载下的热节流以及工作负载相关的性能变化。我们提出Edge-TSR,一个面向部署的连续边缘推理系统,用于在NVIDIA Jetson Orin Nano上进行持续的路边感知。Edge-TSR集成了检测、跟踪、细粒度分类以及轻量级的轨迹感知时域稳定机制,以最小的计算开销提高了流推理的一致性。我们的核心发现是,以基准为中心的评估系统性地高估了部署边缘推理的性能。在三个最先进的基线上,我们观察到从静态图像评估过渡到真实流部署时,性能一致下降20-30%。Edge-TSR通过时域推理稳定解决了这一差距,在持续运行下,相比逐帧推理基线,恢复了高达10.16%的分类准确率,同时保持了实时性能。我们在多种真实部署条件下评估了整个系统,联合表征了长时间运行期间的推理质量、延迟、吞吐量和热行为。在26公里路线上进行的55分钟车辆部署表明,在单个嵌入式设备上,无需云端卸载,即可在安全热限制内以16.18 FPS持续运行。我们的发现表明,部署感知评估和时域推理稳定是面向真实传感部署的持续运行边缘AI系统的必要组成部分。我们发布了一个带注释的流视频评估数据集样本和完整的系统实现,以支持可重复的以部署为中心的评估。

英文摘要

Continuous AI inference on resource-constrained edge hardware introduces deployment effects that are largely invisible to conventional benchmark evaluation, including temporal instability in streaming video, thermal throttling under sustained load, and workload-dependent performance variability. We present Edge-TSR, a deployment-oriented continuous edge inference system for sustained roadside perception on the NVIDIA Jetson Orin Nano. Edge-TSR integrates detection, tracking, fine-grained classification, and a lightweight track-aware temporal stabilization mechanism that improves streaming inference consistency with negligible computational overhead. Our central finding is that benchmark-centric evaluation systematically overstates deployed edge inference performance. Across three state-of-the-art baselines, we observe consistent 20-30% relative degradation when transitioning from static-image evaluation to real-world streaming deployment. Edge-TSR addresses this gap through temporal inference stabilization, recovering up to 10.16% classification accuracy over per-frame inference baselines while maintaining sustained real-time performance under continuous operation. We evaluate the complete system under diverse real-world deployment conditions, jointly characterizing inference quality, latency, throughput, and thermal behavior during long-duration operation. A 55-minute vehicular deployment over a 26 km route demonstrates sustained operation at 16.18 FPS within safe thermal limits on a single embedded device without cloud offload. Our findings show that deployment-aware evaluation and temporal inference stabilization are necessary components of continuously operating edge AI systems intended for real-world sensing deployments. We release a sample annotated streaming video evaluation dataset and full system implementation to support reproducible deployment-centric evaluation.

2606.17362 2026-06-17 cs.CV cs.AI cs.LG cs.RO 新提交

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA(英伟达)

AI总结 提出DriveJudge,结合规则评估与VLM推理,通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估,在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情
AI中文摘要

自动驾驶已转向端到端策略学习,其中可靠、可解释的策略评估是一个基本挑战,因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标(如EPDMS)可解释但缺乏上下文感知,而近期基于VLM的评估虽具有上下文感知能力,但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶,我们引入了DriveJudge。DriveJudge是一个驾驶评估代理,它将规则基础评估与视觉-语言模型(VLM)推理相结合,并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge,我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集,并附有人类标注,指示给定场景中的驾驶行为是否合理。利用该数据集,我们解决了驾驶指标评估中未被充分探索的问题,并引入了两个与人类对齐的基准任务:驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC,在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%,为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 新提交

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition UCLA(加州大学洛杉矶分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出一种无需专家示范的端到端驾驶方法,通过向量化模拟器中的自博弈预训练策略,再与预训练视觉骨干对齐,降低了数据成本并达到或超越现有方法。

详情
AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而,其标准训练流程在所有阶段都成本高昂:收集和标注数百万驾驶帧代价昂贵,而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性:每秒数百万次 rollout 步骤,状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略,然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略,因此对齐从未对记录的轨迹进行监督:只需要一个(图像、场景状态)帧的配对数据集,无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中,得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

2606.17463 2026-06-17 cs.CV cs.RO 新提交

WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

WeaveLA: 面向重复机器人操作的基于事件驱动的跨子任务潜在记忆编织

Shoujing Zhu, Zhenyang Liu, Fungmiu Wang, Jiafeng Wang, Bo Yue, Guiliang Liu, Simo Wu, Xiangyang Xue, Taiping Zeng

发表机构 * Fudan University(复旦大学) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) Shanghai Innovation Institute(上海创新研究院) Shenzhen Loop Area Institute(深圳环域研究院)

AI总结 针对短窗口VLA策略缺乏跨子任务信息传递的问题,提出WeaveLA,通过事件触发将完成子任务压缩为潜在令牌并注入下一子任务的动作生成路径,在保持基础策略短窗口接口的同时实现轻量级跨子任务通道,在困难重复任务上成功率从0%提升至47.8%。

详情
AI中文摘要

视觉-语言-动作(VLA)策略已实现显著的单步操作,但在每个阶段依赖于刚刚完成的任务时仍然脆弱。核心问题是结构性的:短窗口VLA缺乏明确的跨子任务信息路由通道,而现有的记忆增强变体要么在每一帧写入,要么从演示阶段检索,要么在子目标事件触发时未执行显式的子任务到子任务交接给动作专家。我们将子目标完成事件识别为跨子任务记忆交接的自然时间单元,并提出WeaveLA(为视觉-语言-动作策略编织潜在记忆),这是一种跨子任务记忆接口,在冻结的VLA骨干之上,通过查询驱动的注意力池化将每个完成的段压缩为潜在令牌,并直接路由到下一子任务的动作生成路径。这种事件触发、动作侧的设计保留了基础策略的短窗口接口,同时添加了轻量级跨子任务通道。通过在RoboMME上使用$\pi_{0.5}$骨干进行分层评估,WeaveLA的增益恰好出现在需要该通道的地方:在最难的重复切片(SwingXtimes,$N{=}3$)上,成功率从$0\\%$提升至$47.8\\%$,而单次执行片段保持不变。每集配对分析证实增益仅限于因果结构需要跨子任务信息的任务。

英文摘要

Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.

2606.17480 2026-06-17 cs.CV cs.RO 新提交

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

GeneralVLA-2: 几何感知重建与受控记忆用于机器人规划

Haoyu Wang, Guoqing Ma, Zeyu Zhang, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) CASIA(中国科学院自动化研究所) AI 2 Robotics

AI总结 针对机器人规划中3D物体重建幻觉和记忆质量不可控的问题,提出GeoFuse-MV3D几何先验引导重建分支和受控长期记忆系统,在GSO-30和Terminal-Bench等基准上显著提升性能。

详情
AI中文摘要

通用视觉-语言-动作系统需要以物体为中心的3D证据和可复用的操作经验来规划可靠的机器人轨迹。GeneralVLA提供了一个层次化接口,用于将语言和RGB-D观测转换为3D末端执行器路径,但仍存在两个瓶颈。首先,单目SAM3D风格的物体重建可能产生姿态和未见几何的幻觉,而操作受益于在标定多视图观测可用时的稳定物体形状。其次,原始的KnowledgeBank主要检索语义相似的片段并附加新知识,这使得难以控制记忆质量、冲突、置信度和几何相关性。为了解决第一个挑战,我们引入了GeoFuse-MV3D,一个几何先验引导的MV-SAM3D重建分支,它用输入视图掩码验证外部几何线索,应用软视觉外壳支持,执行轴方向细化,并仅融合几何同时保留外观。为了解决第二个挑战,我们将KnowledgeBank升级为一个受控的长期记忆系统,具有明确的质量、置信度、生命周期、验证器和冲突元数据,以及面向精度的检索。最后,我们在GSO-30上评估重建分支,在Terminal-Bench 2.0和SWE-Bench Verified上评估记忆模块;GeoFuse-MV3D相比MV-SAM3D基线,CD和LPIPS分别降低2.20%和2.02%,PSNR和SSIM分别提高2.36%和1.03%;KnowledgeBank相比ReasoningBank,在Terminal-Bench SR上提高4.53%,在SWE-Bench解决率上提高3.73%,同时AS分别降低4.95%和5.65%。代码:此 https URL。网站:此 https URL。

英文摘要

Generalist vision-language-action systems need object-centric 3D evidence and reusable manipulation experience to plan reliable robot trajectories. GeneralVLA provides a hierarchical interface for converting language and RGB-D observations into 3D end-effector paths, but two bottlenecks remain. First, monocular SAM3D-style object reconstruction can hallucinate pose and unseen geometry, while manipulation benefits from stable object shape when calibrated multi-view observations are available. Second, the original KnowledgeBank mainly retrieves semantically similar snippets and appends new knowledge, which makes it difficult to control memory quality, conflicts, confidence, and geometric relevance. To address the first challenge, we introduce GeoFuse-MV3D, a geometry-prior-guided MV-SAM3D reconstruction branch that verifies external geometry cues with input-view masks, applies soft visual-hull support, performs axis-wise refinement, and fuses only geometry while preserving appearance. To address the second challenge, we upgrade KnowledgeBank into a governed long-term memory system with explicit quality, confidence, lifecycle, verifier, and conflict metadata, together with precision-oriented retrieval. Finally, we evaluate the reconstruction branch on GSO-30 and the memory module on Terminal-Bench 2.0 and SWE-Bench Verified; GeoFuse-MV3D improves over the MV-SAM3D baseline by reducing CD and LPIPS by 2.20% and 2.02% while increasing PSNR and SSIM by 2.36% and 1.03%, and KnowledgeBank improves over ReasoningBank by 4.53% on Terminal-Bench SR and 3.73% on SWE-Bench resolve rate, while reducing AS by 4.95% and 5.65%, respectively. Code: https://github.com/AIGeeksGroup/GeneralVLA-2. Website: https://aigeeksgroup.github.io/GeneralVLA-2.

2606.17482 2026-06-17 cs.CV 新提交

SPHINX: First Explain, Then Explore

SPHINX: 先解释,再探索

Nguyen Do, Tue M. Cao, Tien Van Do, András Hajdu, Tamás Bérczes, My T. Thai

发表机构 * University of Florida(佛罗里达大学) University of Debrecen(德布勒恩大学)

AI总结 提出SPHINX闭环框架,通过可解释AI分析驾驶策略的失败模式,并利用视觉语言模型生成针对性对抗场景,提升自动驾驶策略鲁棒性。

Comments 13 pages

详情
AI中文摘要

生成对抗性驾驶场景对于在仿真中评估和改进自动驾驶决策系统至关重要。最近的方法,如ChatScene和LLM-Attacker,主要依赖大型语言模型和视觉语言模型的先验知识来程序化生成驾驶场景。我们认为,对抗性场景应基于驾驶策略的失败诊断(例如,犹豫不决、多帧不一致)来生成,以专门针对策略的弱点,而不是依赖先验假设。在本文中,我们提出SPHINX,一个闭环框架,用于对抗性场景合成,遵循一个简单原则:先解释,再探索。除了盲目探索场景空间外,SPHINX利用可解释人工智能方法分析策略,识别关键视觉概念及其对策略输出的影响,以及决策的不确定性。基于从策略自身决策过程中提取的可解释证据,我们使用视觉语言模型对当前策略的失败模式进行推理和批评。然后,这些批评被用于生成针对性的对抗性场景,以进行策略再训练和改进。我们证明,SPHINX能够突出策略失败的可解释说明,而其他对抗性场景生成方法则不能。在评估的基准和测试套件中,SPHINX可应用于多种最先进的自动驾驶架构,并在现有场景生成方法上带来一致的鲁棒性改进。

英文摘要

Generating adversarial driving scenarios is critical for evaluating and improving autonomous vehicle decision-making systems in simulation. Recent approaches, such as ChatScene and LLM-Attacker, rely primarily on the prior knowledge of Large Language Models and Vision-Language Models to generate driving scenarios procedurally. We argue that adversarial scenes should be generated based on the failure diagnosis (e.g., indecisiveness, multi-frame inconsistency) of the driving policy to specifically address the policy's weaknesses instead of relying on prior assumptions. In this paper, we propose SPHINX, a closed-loop framework for adversarial scenario synthesis guided by a simple principle: first explain, then explore. Beyond blindly exploring the scenario space, SPHINX leverages explainable artificial intelligence methods to analyze the policy, identifying key visual concepts and their influence on policy outputs, and the uncertainty of the decisions. Given the interpretable evidence extracted from the policy's own decision process, we use a vision language model to rationalize and criticize failure modes of the current policy. These critics are then used to generate targeted adversarial scenarios for policy retraining and improvement. We demonstrate that SPHINX can highlight an interpretable account of policy failures while other adversarial scene generation cannot. Across the evaluated benchmarks and test suites, SPHINX can be applied to diverse state-of-the-art autonomous vehicle architectures and yields consistent robustness improvements over existing scenario-generation methods.

2606.17536 2026-06-17 cs.CV cs.AI 新提交

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

OmniDrive: 一种由LLM编排的多智能体世界模型,用于多视角驾驶视频生成的统一潜在协同压缩

Zijie Meng, Yufei Liu, Chengqian Ma, Zhiyu Li, Jiyuan Liu, Wenhua Nie, Bingcai Wei, Shuqin Chen, Weichen Xu, Jiquan Yuan, Miao Zhang

发表机构 * Peking University(北京大学) Xiamen University(厦门大学) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院) National Taiwan University(国立台湾大学) Wuhan University(武汉大学) Wuhan University of Technology(武汉理工大学) Tsinghua University(清华大学) Jimei University(集美大学)

AI总结 提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,通过三个Qwen2.5-VL智能体协同生成位置感知的潜在序列,并利用视图-时间置换与3D VAE协同压缩,实现可控多视角视频生成,在nuScenes上达到SOTA多视角一致性和BEV mAP 21.6。

Comments 24 pages, 10 figures

详情
AI中文摘要

自动驾驶的生成式世界模型面临两个未解决的对立:异构控制注入(自由形式语言、高清地图、轨迹和相机位姿存在于不兼容的表示空间)和事后跨视图融合(每个相机的潜在编码未能编码全局3D几何)。我们将两者追溯到同一个根本原因:在潜在标记级别上缺乏对齐语言、几何和像素的共享符号中间语言。我们提出DRIVE-CHOREO,一种由LLM编排的多智能体世界模型,将可控多视图视频生成重新定义为潜在编排。三个Qwen2.5-VL智能体——一个解析用户意图为结构化WorldScript的导演,一个将其接地为空间锚定布局标记的制图师,以及一个将跨视图批评反馈为辅助监督的审计员——共同创作一个单一的位置感知标记序列。该序列通过视图-时间置换与多视图视频协同压缩,在3D VAE的卷积感受野内强制实现相机间几何。在nuScenes上,DRIVE-CHOREO以具有竞争力的FVD(45.7)实现了新的最先进的多视图一致性和BEV mAP(21.6);仅在我们的合成数据上训练的检测器在真实验证集上获得了+2.4 NDS,验证了下游实用性。

英文摘要

Generative world models for autonomous driving face two unresolved tensions: heterogeneous control injection, where free-form language, HD-maps, trajectories, and camera poses reside in incompatible representational spaces, and post-hoc cross-view fusion, where per-camera latents fail to encode global 3-D geometry. We trace both to a single root cause: the absence of a shared symbolic interlingua aligning language, geometry, and pixels at the latent-token level. We present DRIVE-CHOREO, an LLM-choreographed multi-agent world model that recasts controllable multi-view video generation as latent choreography. Three Qwen2.5-VL agents - a Director parsing user intent into a structured WorldScript, a Cartographer grounding it into spatially-anchored layout tokens, and an Auditor feeding cross-view critiques back as auxiliary supervision - jointly author a single position-aware token sequence. This sequence is co-compressed with the multi-view video via a view-time permutation that enforces inter-camera geometry within the convolutional receptive field of a 3-D VAE. On nuScenes, DRIVE-CHOREO sets new state-of-the-art multi-view consistency and BEV mAP (21.6) with competitive FVD (45.7); a detector trained purely on our synthetic data gains +2.4 NDS on the real validation split, validating downstream utility.

2606.17730 2026-06-17 cs.CV 新提交

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

ActWorld: 从可探索到可交互的世界模型——基于动作感知记忆

Zhexiao Xiong, Yizhi Song, Hao Kang, Qing Yan, Liming Jiang, Jenson Yang, Zhoujie Fu, Stathi Fotiadis, Angtian Wang, Zichuan Liu, Bo Liu, Yiding Yang, Xin Lu, Nathan Jacobs

发表机构 * Washington University in St. Louis(圣路易斯华盛顿大学) Intelligent Creation, ByteDance(字节跳动智能创作)

AI总结 提出ActWorld,通过构建10万交互视频数据集和分层动作感知记忆机制,在块自回归框架中扩展导航模型以支持对象交互,解决数据稀缺和记忆瓶颈问题。

Comments Project page: https://interactwm.github.io/ActWorld

详情
AI中文摘要

交互式世界模型旨在模拟实时用户动作下的环境动态。然而,它们的动作词汇很大程度上局限于导航:大多数动作对应运动(例如,行走、转身、环顾四周),而与场景中对象的交互(例如,拿起盘子、开门或触发物理响应)要么不存在,要么局限于游戏领域,要么降级为提示到全视频场景。由此产生的世界在视觉上可探索,但并非真正可操作。在这项工作中,我们提出了ActWorld,一个交互式世界模型,它将先前的以导航为中心的生成器扩展到块自回归框架内支持中期滚动中的对象交互。我们认为导航-交互差距源于两个瓶颈。首先,数据瓶颈:缺乏具有准确、密集标签的人-对象交互数据。其次,记忆瓶颈:现有世界模型中基于近因的历史压缩丢弃了因果决定后续对象状态的事件转换帧,导致动作遗忘病理。在数据方面,我们构建了一个10万交互视频数据集,每个视频通过思维链推理标注了每块字幕。在模型方面,我们引入了一种分层动作感知记忆设计,通过交互重要性路由历史压缩,并辅以一个持久记忆库,在长滚动中维护事件更新和对象身份令牌。实验表明,ActWorld在单个模型中同时支持灵活的导航和丰富的对象交互,在不牺牲视点控制的情况下,显著提高了交互保真度,优于仅导航的基线。项目页面可从此https URL获取。

英文摘要

Interactive world models aim to simulate environment dynamics under real-time user actions. However, their action vocabulary is largely confined to navigation: most actions correspond to motion (e.g., walk, turn, look around), while interaction with objects in the scene (e.g., pick up plates, open doors, or trigger physical responses) is either absent, restricted to game domains, or relegated to prompt-to-full-video scenarios. The resulting worlds are visually explorable but not truly actionable. In this work, we present ActWorld, an interactive world model that extends prior navigation-centric generators to support mid-rollout object interaction within a chunk-autoregressive framework. We argue that the navigation-interaction gap stems from two bottlenecks. First, a data bottleneck: the lack of human-object interaction data with accurate, dense labels. Second, a memory bottleneck: recency-biased history compression in existing world models discards the event-transition frames that causally determine subsequent object states, leading to an action-forgetting pathology. On the data side, we construct a 100K interaction video dataset, each annotated with per-chunk captions via chain-of-thought reasoning. On the model side, we introduce a hierarchical action-aware memory design that routes history compression by interaction importance, complemented by a persistent memory bank that maintains event-update and object-identity tokens across long rollouts. Experiments show that ActWorld supports both flexible navigation and rich object interaction within a single model, substantially improving interaction fidelity over navigation-only baselines without sacrificing viewpoint control. Project page is available at https://interactwm.github.io/ActWorld.

2606.18242 2026-06-17 cs.CV 新提交

EventDrive: Event Cameras for Vision-Language Driving Intelligence

EventDrive: 用于视觉-语言驾驶智能的事件相机

Dongyue Lu, Rong Li, Ao Liang, Lingdong Kong, Wei Yin, Lai Xing Ng, Benoit R. Cottereau, Camille Simon Chane, Wei Tsang Ooi

发表机构 * NUS(新加坡国立大学) HKUST(GZ)(香港科技大学(广州)) Horizon Robotics(地平线机器人) A*STAR, I2R(新加坡科技研究局,资讯通信研究院) IPAL, CNRS IRL 2955, Singapore(IPAL,法国国家科学研究中心国际联合实验室2955,新加坡) University Toulouse, CNRS, CerCo, Toulouse, France(图卢兹大学,法国国家科学研究中心,CerCo,法国图卢兹) ETIS UMR 8051, CY Cergy Paris University, ENSEA, CNRS, France(ETIS UMR 8051,CY塞尔吉-巴黎大学,ENSEA,法国国家科学研究中心,法国)

AI总结 提出EventDrive基准和模型套件,通过多时域事件金字塔和时域混合专家模块融合事件流与RGB帧,在感知、理解、预测和规划四维度提升驾驶推理性能。

Comments CVPR2026, 34 pages, 15 figures, 15 tables, project page: https://dylanorange.github.io/projects/eventdrive

详情
AI中文摘要

事件相机通过异步亮度变化感知世界,具有微秒级延迟和高动态范围,其运动保真度远超基于帧的传感器,并能捕捉传统曝光常遗漏的时间结构。这些特性使事件成为自动驾驶中RGB的有力补充,尤其在帧感知可能不可靠的模糊、眩光和快速运动场景下。然而,现有的事件感知视觉-语言模型仍局限于通用感知,未能揭示事件传感如何促进整个驾驶循环中的推理和决策。我们提出EventDrive,一个大规模基准和模型套件,统一了事件流、RGB帧和语言监督,涵盖四个核心维度:感知、理解、预测和规划,包括描述、结构化问答、定位、运动状态识别、轨迹预测和规划任务。在此基础上,EventDrive-VLM引入了多时域事件金字塔和时域混合专家模块,自适应地编码和融合异步与基于帧的信息,用于下游推理。在多样化任务上的全面评估表明,事件流在时间精度、运动感知和鲁棒性方面提供了显著提升,将事件传感置于驾驶智能的核心。

英文摘要

Event cameras sense the world through asynchronous brightness changes with microsecond latency and high dynamic range, offering motion fidelity far beyond frame-based sensors and capturing temporal structure that conventional exposures often miss. These properties make events a powerful complement to RGB in autonomous driving, especially under blur, glare, and rapid motion, where frame-based perception can become unreliable. However, existing event-aware vision-language models remain limited to generic perception and do not reveal how event sensing contributes to reasoning and decision-making across the full driving loop. We present EventDrive, a large-scale benchmark and model suite that unifies event streams, RGB frames, and language supervision across four core dimensions: Perception, Understanding, Prediction, and Planning, covering captions, structured QA, grounding, motion-state recognition, trajectory forecasting, and planning tasks. Building on this foundation, EventDrive-VLM introduces a multi-horizon event pyramid and a temporal-horizon mixture-of-experts module to adaptively encode and fuse asynchronous and frame-based information for downstream reasoning. Comprehensive evaluation across diverse tasks shows that event streams provide substantial gains in temporal precision, motion awareness, and robustness, bringing event sensing into the center of driving intelligence.

2606.18243 2026-06-17 cs.CV cs.GR cs.RO 新提交

MOCHI: Motion Enhancement of Collaborative Human-object Interactions

MOCHI: 协作人-物交互的运动增强

Jiye Lee, Yonghun Choi, Jungdam Won

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Seoul National University(首尔国立大学)

AI总结 针对多人-物交互数据中手物接触错位、运动抖动和手指细节缺失等问题,提出两阶段框架MOCHI,先通过优化生成物理合理的手部抓取,再基于扩散模型优化全身运动,有效增强噪声数据。

Comments SIGGRAPH 2026 Journal (ACM TOG); Project page: https://jiyewise.github.io/projects/MOCHI/

详情
AI中文摘要

协作人-物交互展示了动态且复杂的运动,需要参与者与共享对象之间的相互预期和持续调整。对此类协作多人-物交互(MHOI)场景进行建模需要高质量的数据采集作为基础步骤;然而,由于MHOI中人与人、人与物交互同时发生的内在复杂性,这一步骤具有挑战性。这种复杂性导致MHOI捕获数据存在噪声,表现为多种伪影:手与物体之间的接触错位、捕获序列中的运动抖动和时间不一致性,以及缺失或不完整的手指级关节细节。为了解决这些挑战,我们提出了MOCHI(协作人-物交互的运动增强),一个用于增强噪声MHOI数据的两阶段框架。我们的方法首先通过从噪声身体输入进行优化生成物理合理的手部抓取,产生既物理合理又与身体姿态语义一致的抓取,然后将这些优化后的抓取扩展为完整的手-物交互序列。随后,所有参与者的全身运动通过一个基于扩散的噪声优化框架进行细化,该框架使用单人运动先验。在优化过程中,我们引入优化目标以在这些单人先验中编码人-物和人与人交互信息。实验结果表明,我们的流程在多种MHOI数据(无论是通过现有捕获方法获取还是由生成模型合成)上均有效。我们进一步展示了系统在不同参与者数量和交互类型下的鲁棒性,并演示了包括基于关键帧的MHOI创建和通过改变物体几何形状进行数据增强在内的多种应用。

英文摘要

Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

2606.18250 2026-06-17 cs.CV 新提交

Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion

未来动态3D重建:一种具有解耦自运动的3D世界模型

Nils Morbitzer, Jonathan Evers, Artem Savkin, Thomas Stauner, Nassir Navab, Federico Tombari, Stefano Gasperini

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 提出FR3D世界模型,通过解耦场景3D演化与智能体轨迹,利用教师-学生蒸馏策略实现从单目观测到未来动态3D重建的几何一致性和零样本泛化。

Comments ICML 2026. Project page: https://fr3d-wm.github.io

详情
AI中文摘要

预测动态环境的演化对于自主智能体至关重要。尽管生成式世界模型最近通过在图像平面内混合自运动和环境动态,在2D视频合成中实现了高逼真度,但它们表现出物理不一致性,例如物体变形或消失,尤其是在长时间范围内。在本文中,我们提出FR3D,一种预测未来动态3D重建的持久3D潜在表示的世界模型。与将世界视为基于图像的特征序列的先前工作不同,FR3D明确地将场景的3D演化与智能体的轨迹解耦,将推断的自运动视为动作的潜在代理。这种解耦解决了自运动和世界运动之间的歧义,确保了几何一致性到未来。此外,我们引入了一种教师-学生蒸馏策略,利用现成基础模型的空间“常识”,从而实现鲁棒的零样本泛化。大量实验表明,FR3D在多个数据集上从单目观测进行未来动态3D重建(甚至到未来2秒)的强大性能。项目页面:此https URL。

英文摘要

Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.

2606.17256 2026-06-17 cs.RO cs.CV 交叉投稿

Contrastive Action-Image Pre-training for Visuomotor Control

对比动作-图像预训练用于视觉运动控制

Yuvan Sharma, Dantong Niu, Anirudh Pai, Zekai Wang, Zhuoyang Liu, Baifeng Shi, Stefano Saravalle, Boning Shao, Ruijie Zheng, Jing Wang, Konstantinos Kallidromitis, Yusuke Kato, Fabio Galasso, Yuke Zhu, Danfei Xu, Linxi "Jim" Fan, Jitendra Malik, Trevor Darrell, Roei Herzig

发表机构 * UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达) Sapienza University of Rome(罗马大学) Panasonic(松下) ItalAI

AI总结 提出CAIP方法,利用大规模第一人称视频中3D手部关键点作为代理动作信号,通过对比学习统一动作-图像表示,在少量机器人数据下显著提升灵巧操作性能。

详情
AI中文摘要

现有的机器人视觉编码器面临一个根本瓶颈:机器人数据集缺乏大规模预训练所需的规模。先前的工作通过转向互联网规模的图像和语言数据或自我中心的人类视频来规避数据稀缺问题。虽然这些模型显示出潜力,但两种范式都没有从配对的视觉和动作数据中学习,而下游视觉运动控制策略需要这些数据。然而,机器人轨迹作为这种配对信号最直接的来源,在预训练规模上不可用,这促使我们从丰富的人类视频中提取动作信号。为此,我们引入了CAIP(对比动作-图像预训练),一种视觉编码器,将大规模自我中心视频中的人类手部姿态视为末端执行器动作的代理。通过提取3D手部关键点(一种与下游机器人动作空间自然对齐的表示),CAIP通过对比目标学习统一的动作-图像表示。利用32,041小时的自我中心人类视频和仅88小时的机器人操作数据,CAIP优于最先进的视觉编码器,包括DINOv2、SigLIP、MVP和R3M。在使用Dexmate Vega和Sharpa Wave手的具有挑战性的真实世界灵巧操作设置上评估,CAIP在涉及折叠、倾倒和精细操作的任务上取得了超过30%的性能提升。我们的结果表明,我们的对比动作中心预训练方法为获得更适合物理交互的鲁棒视觉表示提供了一条可扩展的路径。

英文摘要

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

2606.17376 2026-06-17 cs.RO cs.CV 交叉投稿

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

异构移动机器人上的非接触式呼吸监测:一种多模态边缘计算框架

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校)

AI总结 提出一种适用于异构移动机器人的多模态非接触式呼吸率监测框架,通过自适应传感器选择、关键点引导的ROI提取和信号质量过滤,在多种平台和光照条件下实现鲁棒监测,无需平台特定调参。

Comments 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

详情
AI中文摘要

呼吸率监测是紧急响应、灾难恢复和传染病场景中远程分诊和受害者评估的关键组成部分,在这些场景中,最小化物理接触可以降低救援人员风险并提高操作安全性。然而,由于光照变化、姿势变化、平台异构性以及危险环境中可穿戴传感器的不实用性,非接触式呼吸率监测的现场部署仍然具有挑战性。在本文中,我们提出了一种适用于具有机载边缘计算的异构移动机器人的模态自适应非接触式呼吸率监测框架。所提出的系统结合了跨RGB、热成像、近红外和低光相机的亮度自适应传感器选择、用于姿势鲁棒监测的关键点引导胸部ROI提取,以及基于信号质量指数的滤波机制以实现可靠的呼吸估计。我们在三个机器人平台上实现并评估了该框架,涵盖四足和轮式运动以及多种边缘计算架构。在不同光照条件、受试者姿势和机器人到受试者距离下进行的实验表明,该框架无需针对每个平台进行算法重新调整即可跨平台泛化,同时揭示了模态特定的操作边界。RGB提供最广的覆盖范围,可达8米;近红外在6米内有效;热成像仅在短距离内可靠;低光传感支持在完全黑暗环境中监测,距离可达8米。总体而言,结果证明了在移动机器人上进行多模态非接触式呼吸率监测的可行性,并支持其作为危险搜救场景中自主分诊和受害者评估的基础。

英文摘要

Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

2606.17408 2026-06-17 cs.RO cs.CV cs.LG 交叉投稿

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

动作生成应从何处开始?面向生成式机器人策略的可学习源先验

Meipo Dai, Qiyuan Zhuang, He-Yang Xu, Ying-Jie Shuai, Yijun Wang, Qi Dou, Xiu-Shen Wei

发表机构 * Southeast University(东南大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出LeaP,用轻量MLP预测基于本体感知的对角高斯分布作为动作生成源先验,替代标准高斯分布,在15个RoboTwin任务中平均成功率81.6%,优于基线方法6.5-25.5个百分点。

详情
AI中文摘要

生成式机器人策略通常从与观测无关的标准高斯分布开始动作生成,源分布的选择尚未被充分探索。本文提出一个简单问题:动作生成应从何处开始?我们提出LeaP,一种可学习源先验,用基于本体感知的对角高斯分布(作用于动作块)替代标准高斯分布。通过轻量MLP参数化,LeaP联合预测源分布的均值和状态自适应方差,同时保持下游生成器架构和推理求解器不变。这种设计提供了观测信息驱动的随机初始化,使生成器能够专注于精确的动作细化,而非从无信息的噪声源传输样本。在15个RoboTwin操作任务中,LeaP实现了81.6%的平均成功率,优于四个代表性基线——包括确定性源方法、无先验对应方法和扩散桥策略——6.5至25.5个百分点。相同的先验一致地改进了流匹配和扩散桥生成器,同时使用更少的参数且收敛更快。该优势延续到实际部署中,LeaP取得了最佳性能。这些结果表明,源分布是生成式机器人策略的一个独立且可重用的设计轴,与生成动力学的选择互补。

英文摘要

Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.

2606.17446 2026-06-17 cs.RO cs.CV 交叉投稿

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

AnnotateAnything:面向机器人操作的3D资产自动标注

Haoran Lu, Mutian Shen, Shuyang Yu, Yu Xiao, Songling Liu, Jianshu Zhang, Shang Wu, Yue Chen, Guo Ye, Jiayi Wang, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学)

AI总结 提出AnnotateAnything框架,通过视觉-语言标注和物理标注双流水线,自动为3D资产生成可执行操作标签,提升仿真数据收集效率和任务成功率。

详情
AI中文摘要

仿真使得可扩展的机器人数据收集成为可能,但原始3D资产仅提供几何信息,缺乏指定机器人应在何处以及如何操作的语义、交互和物理知识。在这项工作中,我们提出了AnnotateAnything,一个通用的自动标注框架,将被动3D资产转换为具有结构化、多样化和可执行操作标签的、可用于操作的资产。AnnotateAnything围绕两个互补的流水线构建。首先,一个统一的视觉-语言标注流水线,利用视觉-语言推理来推断对象语义、交互约束和3D接地线索,为识别有意义的交互区域提供人类先验指导。其次,一个全自动且大规模并行的物理标注流水线,通过候选生成、几何优化和轨迹生成,将这些先验知识嵌入每个资产的几何和物理约束中。该流水线生成多样且可执行的动作标注,包括抓取姿态、灵巧接触、关节运动路径点、插入方向、悬挂可供性和导航目标。利用生成的标注,我们进一步构建了一个跨不同对象、任务和机器人形态的异步并行仿真数据收集系统。实验表明,与现有的标注和数据生成流水线相比,AnnotateAnything在标注效率、数据收集效率和任务成功率方面均表现优越,同时支持下游任务如可供性检测、机器人VQA和视觉指令微调。我们在项目页面上提供项目材料,并计划发布完整代码、标注和基准以促进未来研究。视频、代码、演示资产和标注在补充材料中提供。项目页面:此https URL。

英文摘要

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 交叉投稿

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University(西北大学) Peking University(北京大学) University of California, Berkeley(加州大学伯克利分校) ShanghaiTech University(上海科技大学)

AI总结 提出MagicSim,一个基于确定性批处理运行时和共享MDP的具身交互基础设施,通过YAML规范解耦内容、放置、行为和智能体暴露,统一世界构建、执行、评估和自动生成轨迹。

详情
AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底,而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层,无法重现、评估和标注同一情节。我们提出MagicSim,一个围绕确定性批处理运行时和共享马尔可夫决策过程(MDP)构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露,MagicSim在单一重置-步进循环中构建多样化的可执行世界,涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化,将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力:基准测试和强化学习评估、自动收集接口(自动将命令转化为具体轨迹)以及面向智能体/VLM的交互。对于自动执行,命令流经Command->Skill->Planner->Robot->Record流水线,而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹,将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此,MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

2606.17520 2026-06-17 cs.RO cs.CV 交叉投稿

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

GASE:基于高斯溅射的自动化系统用于重建具身仿真环境

Jiawei Zhang, Yiming Yan, Chao Liang, Nuo Xu, Seson Sun, Qichen Zhang, Yuhao Xu, Yantai Yang, Yingqiao Wang, Qin Jin, Zhipeng Zhang

发表机构 * AutoLab, SAI, Shanghai Jiao Tong University(上海交通大学SAI学院AutoLab实验室) AIM3 Lab, School of Information, Renmin University of China(中国人民大学信息学院AIM3实验室) Research Lab, Anyverse Dynamics(Anyverse Dynamics研究实验室)

AI总结 提出GASE系统,利用全景相机阵列和多视图视频流,通过相机位姿策略提取前景物体并修复场景,独立重建后导入物理仿真器,实现高效高保真仿真环境构建,分割精度提升超10%,真实机器人部署性能差距小于10%。

详情
AI中文摘要

在现实世界中训练具身代理需要熟练的操作人员和昂贵的硬件。仿真环境通过实现大规模、低成本的数据增强提供了一种引人注目的替代方案。因此,快速构建具有最小仿真到现实差距的高保真仿真场景已成为机器人学习的关键目标。尽管基于重建的方法提供了优越的视觉质量,但当前的工作流程受到低效的数据采集和次优的前景物体提取的阻碍。因此,我们提出了GASE,一个高度自动化的仿真场景构建系统。GASE利用全景相机阵列的多视角视频流实现快速环境扫描。为确保高质量的资产生成,我们的流程引入了一种基于相机位姿的策略,在2D域中跨帧鲁棒地提取物体,随后进行高保真场景修复。前景物体和静态背景随后被独立重建,并无缝导入物理仿真器用于策略训练。大量实验表明,GASE在分割精度上比现有的基于3D高斯的方法提高了超过10%,同时实现了最先进的修复质量。此外,在操作和导航任务中的真实机器人部署保持了与纯真实世界数据训练策略相比低于10%的性能差距。这些结果证实GASE为弥合仿真到现实差距提供了高效且高度有效的解决方案。代码将发布。

英文摘要

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

2606.17598 2026-06-17 cs.RO cs.CV 交叉投稿

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA: 一种用于机器人操作的自适应多模态感知视觉-语言-动作模型

Xingyuming Liu, Ruichun Ma, Heyu Guo, Qixiu Li, Qingwen Yang, Lin Luo, Shiqi Jiang, Chenren Xu, Jiaolong Yang, Baining Guo

发表机构 * School of Computer Science, Peking University(北京大学计算机科学学院) Microsoft Research Asia(微软亚洲研究院) Princeton University(普林斯顿大学) Tsinghua University(清华大学)

AI总结 提出MuseVLA模型,通过将传感器作为按需工具集成,实现自适应多模态感知;设计传感器图像统一表示,并引入数据合成流水线,在灵巧手操作任务中平均成功率80.6%,显著优于RGB-only和多模态基线。

详情
AI中文摘要

人类自然地利用多种感知模态与物理世界交互,而大多数用于机器人的视觉-语言-动作(VLA)模型仅依赖RGB观测。这限制了它们感知难以或无法从RGB相机推断的物理属性(如温度、声音或雷达响应)的能力。我们提出MuseVLA,一种自适应多模态感知VLA模型,将新型传感器作为按需工具集成到机器人操作中。给定任务指令和视觉上下文,MuseVLA首先生成一个传感器令牌和目标描述,选择要调用的感知模态和关注对象,类似于带参数的工具调用。然后,它将选定的传感器测量值转换为接地传感器图像,这是一种统一的中间表示,编码异构读数以进行多模态融合和动作生成。这种设计将传感器特定处理与VLA主干解耦,实现了多种模态的高效集成。为了减少对昂贵的多传感器机器人数据集的需求,我们进一步引入了一种数据合成流水线,用接地传感器图像增强现有的RGB视频数据集,从而实现对未见过的传感器引导任务的泛化。我们在真实机器人上评估了MuseVLA,涉及需要多模态感知输入的挑战性灵巧手操作任务,包括温度引导的拾取与放置、音频驱动的物体搜索和雷达辅助的隐藏物体检索。MuseVLA平均成功率达到80.6%,显著优于仅RGB和多模态VLA基线,并在未见任务上表现出强大的零样本能力。

英文摘要

Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

2606.17739 2026-06-17 cs.RO cs.AI cs.CV cs.MA 交叉投稿

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

ED3R: 能量感知的分布式灾难检测——基于协作机器人智能体

Lina Magoula, Nikolaos Koursioumpas, Nancy Alonistioti, Ramin Khalili

发表机构 * Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens(雅典大学信息学与电信系) Huawei Heisenberg Research Center (Munich)(华为海森堡研究中心(慕尼黑))

AI总结 提出ED3R框架,通过机器人-远程控制器分层协作与分布式神经回归预测,在不确定性下以最低能耗实现野火检测,成功率达97.18%,能耗降低36.4%,检测速度提升41%。

Comments 14 pages, 9 figures

详情
AI中文摘要

机器人技术有望支持环境监测和自然灾害管理,在这些场景中,决策必须在不确定性、资源限制和严格操作约束下做出。在关键任务(如野火)中,机器人智能体不仅需要以足够置信度识别危险事件,还需管理能量成本和检测时间。本文介绍ED3R,一种用于不确定性下野火检测的能量感知分布式框架。ED3R实现了机器人与远程控制器之间的分层协作决策:远程控制器决定机器人的运动,而机器人感知环境并决定在何处(机载或远程)以及如何执行野火检测。共同目标是以所需置信度检测野火,同时最小化任何机器人操作消耗的能量。ED3R进一步集成了避免附近障碍物、防止冗余探索、实现自适应早期任务完成以及通过自定义惩罚函数确保可行性的机制。ED3R还引入了前瞻能力,通过分布式神经回归模型使智能体能够在执行前评估候选策略以预测未来。该框架通过逼真的机器人仿真、消融研究和基线比较进行评估。总体而言,ED3R的任务成功率高达97.18%。尤其是在最具挑战性的任务中,它比基线减少高达36.4%的能量消耗,并提前高达41%检测到野火。

英文摘要

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

2606.18112 2026-06-17 cs.RO cs.CV 交叉投稿

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告:为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team(通义实验室)

AI总结 提出 Qwen-RobotNav 可扩展导航模型,通过参数化接口支持多种任务模式和可调观测参数,在15.6M样本上训练,联合视觉语言数据防止行为坍缩,在多个导航基准上取得新最优结果,并展示零样本泛化能力。

详情
AI中文摘要

智能体导航系统需要一个基础导航模型,其观测策略可以在推理时从外部重新配置,因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干,但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav,一个建立在 Qwen-RobotNav 上的可扩展导航模型,通过一个具有两个互补维度的参数化接口来解决这个问题:多个任务模式选择导航行为,以及可控的观测参数(例如,token 预算、每个摄像头的权重)控制视觉历史的编码方式。通过训练时对所有参数进行随机化,Qwen-RobotNav 对任何推理时配置都具有鲁棒性,无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav;与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块:对于长时域场景,上层规划器将目标分解为子任务,并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略,通过重复调用同一模型组合出复杂行为。大量实验表明,Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性,联合多任务训练发展出一个跨任务族迁移的共享空间规划基板,并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

2603.28251 2026-06-17 cs.CV cs.AI 版本更新

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

DiffAttn: 基于扩散的驾驶员视觉注意力预测与LLM增强语义推理

Weimin Liu, Qingkun Li, Jiyuan Qiu, Wenjun Wang, Joshua H. Meng

AI总结 提出DiffAttn框架,将驾驶员视觉注意力预测建模为条件扩散去噪过程,结合Swin Transformer、特征融合金字塔和LLM增强语义推理,在四个数据集上达到最先进性能。

详情
AI中文摘要

驾驶员的视觉注意力为预测潜在危险提供关键线索,并直接影响决策和控制操作,其缺失可能危及交通安全。为模拟驾驶员的感知模式并推进智能车辆的视觉注意力预测,我们提出DiffAttn,一种基于扩散的框架,将该任务建模为条件扩散-去噪过程,从而更准确地建模驾驶员注意力。为捕捉局部和全局场景特征,我们采用Swin Transformer作为编码器,并设计了一个解码器,该解码器结合了特征融合金字塔用于跨层交互,以及密集的多尺度条件扩散,以共同增强去噪学习并建模细粒度的局部和全局场景上下文。此外,引入大语言模型(LLM)层以增强自上而下的语义推理,并提高对安全关键线索的敏感性。在四个公共数据集上的大量实验表明,DiffAttn实现了最先进的性能,超越了大多数基于视频、自上而下特征驱动和LLM增强的基线。我们的框架进一步支持可解释的以驾驶员为中心的场景理解,并具有改善智能车辆中座舱人机交互、风险感知和驾驶员状态测量的潜力。

英文摘要

Drivers' visual attention provides critical cues for anticipating latent hazards and directly shapes decision-making and control maneuvers, where its absence can compromise traffic safety. To emulate drivers' perception patterns and advance visual attention prediction for intelligent vehicles, we propose DiffAttn, a diffusion-based framework that formulates this task as a conditional diffusion-denoising process, enabling more accurate modeling of drivers' attention. To capture both local and global scene features, we adopt Swin Transformer as encoder and design a decoder that combines a Feature Fusion Pyramid for cross-layer interaction with dense, multi-scale conditional diffusion to jointly enhance denoising learning and model fine-grained local and global scene contexts. Additionally, a large language model (LLM) layer is incorporated to enhance top-down semantic reasoning and improve sensitivity to safety-critical cues. Extensive experiments on four public datasets demonstrate that DiffAttn achieves state-of-the-art (SoTA) performance, surpassing most video-based, top-down-feature-driven, and LLM-enhanced baselines. Our framework further supports interpretable driver-centric scene understanding and has the potential to improve in-cabin human-machine interaction, risk perception, and drivers' state measurement in intelligent vehicles.

2604.03120 2026-06-17 cs.CV cs.RO 版本更新

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

SCC-Loc: 无人机热红外地理定位的统一语义级联共识框架

Xiaoran Zhang, Yu Liu, Jinyu Liang, Kangqiushi Li, Zhiwei Huang, Huaxin Xiao

AI总结 提出SCC-Loc框架,通过共享DINOv2骨干网络、语义引导视口对齐、级联空间自适应纹理结构滤波和共识驱动可靠性感知位置选择,解决热红外-可见光模态差异导致的特征模糊问题,实现零样本高精度绝对位置估计,平均定位误差9.37米。

Comments 17 pages, 5 figures. Submitted to IEEE J-STARS

详情
AI中文摘要

跨模态热红外地理定位(TG)为无人机在GNSS拒止环境中提供了鲁棒的全天候解决方案。然而,深刻的热红外-可见光模态差异引入了严重的特征模糊性,系统性地破坏了传统的由粗到精配准。为打破这一瓶颈,我们提出SCC-Loc,一个统一的语义-级联-共识定位框架。通过在全局检索和MINIMA$_{\ ext{RoMa}}$匹配中共享单个DINOv2骨干网络,它最小化内存占用并实现零样本、高精度的绝对位置估计。具体而言,我们通过引入三个协同组件来解决模态模糊性。首先,我们设计语义引导视口对齐(SGVA)模块,自适应优化卫星裁剪区域,有效校正初始空间偏差。其次,我们开发级联空间自适应纹理结构滤波(C-SATSF)机制,显式强制几何一致性,从而消除密集的跨模态离群点。最后,我们提出共识驱动可靠性感知位置选择(CD-RAPS)策略,通过物理约束位姿优化的协同作用推导出最优解。为解决数据稀缺问题,我们构建了Thermal-UAV数据集,提供11,890个多样化的热红外查询,并参考大规模卫星正射影像和相应的空间对齐数字表面模型(DSM)。大量实验表明,SCC-Loc建立了新的最先进水平,将平均定位误差抑制到9.37米,并在严格的5米阈值内比最强基线提供了7.6倍的精度提升。代码和数据集可在该URL获取。

英文摘要

Cross-modal Thermal Geo-localization (TG) provides a robust, all-weather solution for Unmanned Aerial Vehicles (UAVs) in Global Navigation Satellite System (GNSS)-denied environments. However, profound thermal-visible modality gaps introduce severe feature ambiguity, systematically corrupting conventional coarse-to-fine registration. To dismantle this bottleneck, we propose SCC-Loc, a unified Semantic-Cascade-Consensus localization framework. By sharing a single DINOv2 backbone across global retrieval and MINIMA$_{\text{RoMa}}$ matching, it minimizes memory footprint and achieves zero-shot, highly accurate absolute position estimation. Specifically, we tackle modality ambiguity by introducing three cohesive components. First, we design the Semantic-Guided Viewport Alignment (SGVA) module to adaptively optimize satellite crop regions, effectively correcting initial spatial deviations. Second, we develop the Cascaded Spatial-Adaptive Texture-Structure Filtering (C-SATSF) mechanism to explicitly enforce geometric consistency, thereby eradicating dense cross-modal outliers. Finally, we propose the Consensus-Driven Reliability-Aware Position Selection (CD-RAPS) strategy to derive the optimal solution through a synergy of physically constrained pose optimization. To address data scarcity, we construct Thermal-UAV, a comprehensive dataset providing 11,890 diverse thermal queries referenced against a large-scale satellite ortho-photo and corresponding spatially aligned Digital Surface Model (DSM). Extensive experiments demonstrate that SCC-Loc establishes a new state-of-the-art, suppressing the mean localization error to 9.37 m and providing a 7.6-fold accuracy improvement within a strict 5-m threshold over the strongest baseline. Code and dataset are available at https://github.com/FloralHercules/SCC-Loc.

2509.00064 2026-06-17 cs.RO cs.CV 版本更新

OpenTie: Open-vocabulary Sequential Rebar Tying System

OpenTie: 开放词汇的连续钢筋绑扎系统

Sai Fan, Mingze Liu, Haozhen Li, Haobo Liang, Yixing Yuan, Yanke Wang

AI总结 提出OpenTie,一种无需训练的3D钢筋绑扎框架,通过RGB到点云生成和开放词汇检测实现高精度连续绑扎,优于基于YOLO的方法。

Comments This article is accepted by The 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026)

详情
AI中文摘要

建筑工地的机器人实践因其应对复杂挑战的能力而备受关注,尤其是在涉及钢筋的场景中。现有产品和研究主要集中于需要模型训练的大量数据收集。为填补这一空白,我们提出OpenTie,一种利用RGB到点云生成和开放词汇钢筋检测的3D无训练钢筋绑扎框架,并在真实世界测试中实现。我们通过带有双目摄像头的机械臂实现OpenTie,并通过将基于提示的目标检测方法应用于经我们提出的后处理流程过滤的图像(用于图像到点云生成框架),保证了高精度。我们的流程无需训练,且在真实连续钢筋绑扎测试中优于基于训练的目标检测(即基于YOLO的方法)。该系统灵活适用于水平和垂直钢筋绑扎任务,并具有在真实建筑工地应用和商业化的潜力。

英文摘要

Robotic practices on the construction site emerge as an attention-attracting manner owing to their capability of tackling complex challenges, especially in the rebar-involved scenarios. Most of existing products and research are mainly focused on the collection of large amounts of data with model training demands. To fulfill this gap, we propose OpenTie, a 3D training-free rebar tying framework utilizing a RGB-to-point-cloud generation and an open-vocabulary rebar detection on the real-world test. We implement the OpenTie via a robotic arm with a binocular camera and guarantee a high accuracy by applying the prompt-based object detection method on the image filtered by our proposed post-processing procedure for the image-to-point-cloud generation framework. Our pipeline requires no training efforts and outperforms the training-based object detection, i.e., YOLO-based method, with the verification on the real-world sequential rebar tying test. The system is flexible for horizontal and vertical rebar tying tasks and holds the potential application to the real construction site with possibility of commercialization.

2601.01762 2026-06-17 cs.RO cs.CV 版本更新

AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving

AlignDrive: 用于端到端自动驾驶的对齐横向-纵向规划

Yanhao Wu, Haoyang Zhang, Fei He, Rui Wu, Yanhu Shan, Congpei Qiu, Liang Gao, Wei Ke, Tong Zhang

AI总结 本文提出一种 cascaded 框架,通过将纵向规划转化为路径条件推理过程,提升自动驾驶的协调性和安全性。方法引入锚点回归设计和规划导向的数据增强策略,实现在 Bench2Drive 上达到 SOTA 性能。

Comments underreview

详情
AI中文摘要

实用的自动驾驶需要能够通过时空可能性推理来排除不安全结果的模型。尽管最先进的方法使用并行规划架构,但它们未能明确将速度决策与路径上的代理行为联系起来,导致协调不优。为此,我们提出了一种级联框架,将纵向规划从独立预测任务转化为路径条件推理过程。在模型方面,我们引入基于锚点的回归设计,将纵向预测条件于横向驾驶路径,并将纵向规划重新表述为路径上的 1D 位移预测。这减少了几何不确定性,并使模型更专注于由交互驱动的动力学。在数据方面,我们引入了规划导向的数据增强策略,通过程序性插入代理和重标记纵向目标来模拟罕见的安全关键事件。在具有挑战性的 Bench2Drive 基准上评估,我们的方法在驾驶分数为 89.07 和成功率为 73.18% 的情况下实现了 SOTA 性能,证明了显著改进的协调性和安全性。进一步在 Fail2Drive 上的评估证实了在平行公式通常失败的罕见边缘情况下具有强大的泛化能力。项目页面:https://yanhaowu.github.io/AlignDrive/.

英文摘要

Practical autonomous driving requires models that generalize by reasoning through spatial-temporal possibilities to exclude unsafe outcomes. While state-of-the-art (SOTA) methods use parallel planning architectures, they fail to explicitly couple speed decisions with agent behavior along the driving path, leading to suboptimal coordination. To address this, we propose a cascaded framework that transforms longitudinal planning from an independent prediction task into a path-conditioned reasoning process. On the model side, we introduce an anchor-based regression design that conditions longitudinal prediction on the lateral drive path, and reformulate longitudinal planning as 1D displacement prediction along the path. This reduces geometric uncertainty and sharpens the model's focus on interaction-driven dynamics. On the data side, we introduce a planning-oriented data augmentation strategy that simulates rare safety-critical events by programmatically inserting agents and relabeling longitudinal targets to enforce collision avoidance. Evaluated on the challenging Bench2Drive benchmark, our method achieves SOTA performance with a driving score of 89.07 and a success rate of 73.18%, demonstrating significantly improved coordination and safety. Further evaluation on Fail2Drive confirms strong generalization to rare edge cases where parallel formulations typically fail. Project page:https://yanhaowu.github.io/AlignDrive/.

2605.29563 2026-06-17 cs.AI cs.CV cs.RO 版本更新

Planning with the Views

通过场景自我探索进行视图规划

Kangrui Wang, Linjie Li, Zhengyuan Yang, Shiqi Chen, Zihan Wang, Li Fei-Fei, Jiajun Wu, Leonidas Guibas, Lijuan Wang, Manling Li

发表机构 * Northwestern University(西北大学) University of Washington(华盛顿大学) Microsoft(微软) University of Oxford(牛津大学) Stanford University(斯坦福大学)

AI总结 提出ViewSuite基准测试揭示VLM在多步视图规划中的不足,并设计迭代框架通过自我探索和视图图蒸馏将Qwen2.5-VL-7B的交互式视图规划准确率从2.5%提升至47.8%。

详情
AI中文摘要

VLM能否预测每个相机移动如何改变视图,并提前规划许多这样的移动?我们称这种能力为视图规划,需要(1)理解单个动作如何变换视图,以及(2)在多步规划中组合许多这样的变换以识别目标视图。我们在提出的ViewSuite中探测了这两种能力,ViewSuite是一个基于真实ScanNet场景的3D点云环境。在13个前沿VLM中,出现了一个关键的规划差距:它们具备基本的视图-动作知识,但无法在多步规划中组合这些知识,并且随着视点距离的增加,差距扩大。为了缩小这一差距,我们提出了一个迭代框架,交替进行自我探索和视图图蒸馏。关键洞察是,所有探索轨迹,无论其结果如何,共同形成一个视图图,紧凑地捕捉了场景中视点如何连接。将这个图蒸馏到多样化的监督任务中,重塑了策略分布,并克服了使纯RL停滞的稀疏奖励。这将Qwen2.5-VL-7B在交互式视图规划上的准确率从2.5%提升到47.8%,超过了GPT-5.4 Pro(18.5%)和Gemini 3.1 Pro(21.4%)。自我探索成为VLM在3D空间中主动推理和规划的一条有前景的路径。

英文摘要

Can VLMs predict how each camera move changes the view, and plan many such moves ahead? We call this capability view planning, requiring (1)understanding how a single action transforms the view, and (2)composing many such transformations across multi-turn plans to identify a target view. We probe both abilities in our proposed ViewSuite, a 3D point-cloud environment on real ScanNet scenes. Across 13 frontier VLMs, a critical planning gap emerges: they possess basic view-action knowledge but fail to compose it across multi-turn plans, with the gap widening as viewpoint distance grows. To close this gap, we propose an iterative framework that alternates self-exploration with view graph distillation. The key insight is that all exploration trajectories, regardless of their outcome, collectively form a view graph that compactly captures how viewpoints connect across a scene. Distilling this graph into diverse supervised tasks reshapes the policy distribution and overcomes the sparse rewards that stall pure RL. This improves Qwen2.5-VL-7B from 2.5% to 47.8% on interactive view planning, surpassing GPT-5.4 Pro (18.5%) and Gemini 3.1 Pro (21.4%). Self-exploration emerges as a promising path toward VLMs that can actively reason and plan in 3D space. Code and Data are at https://viewsuite.github.io.

2606.16533 2026-06-17 cs.AI cs.CV 版本更新

Kairos: A Native World Model Stack for Physical AI

Kairos: 面向物理AI的原生世界模型栈

Kairos Team, Fei Wang, Shan You, Qiming Zhang, Tao Huang, Zuoyi Fu, Zhisheng Zheng, Yunlong Xi, Feng Lv, Xiaoming Wu, Zeyu Liu, Cong Wan, Pu Li, Ruiqing Yang, Xiaoou Li, Wei Wang, Kangkang Zhu, Yuwei Zhang, Shi Fu, Zheng Zhang, Xiaoning Wu, Xuzeng Fan, Dacheng Tao, Xiaogang Wang

发表机构 * Kairos Team(Kairos团队)

AI总结 提出Kairos原生世界模型栈,通过跨具身数据课程、混合线性时间注意力架构和部署感知系统协同设计,实现世界知识获取、长时程状态保持与高效执行,在具身世界模型等基准上达到顶级性能。

详情
AI中文摘要

世界模型正从被动视觉生成器转变为物理AI的基础性、可操作基础设施:它们必须从异构经验中原生获取世界知识,在长时间跨度内维持持久状态,并在实际部署约束下高效执行。我们引入Kairos,一个围绕这些需求设计的原生世界模型栈。(1) Kairos通过开创由跨具身数据课程指导的原生预训练范式来学习世界,该课程将开放世界视频、人类行为数据和机器人交互组织成渐进式发展路径。(2) Kairos通过配备混合线性时间注意力的原生统一架构来维持世界,该架构中滑动窗口注意力捕捉局部动态,扩张滑动窗口捕捉中程依赖,门控线性注意力维持持久全局记忆。我们建立了形式化理论界限,证明这种时间分解严格限制了误差累积,从数学上保证了跨扩展时间范围的状态传播。(3) Kairos通过整合部署感知系统协同设计来运行世界,支持在服务器和消费级硬件上为真实世界的观察-行动-反馈循环生成低延迟展开。在具身世界模型、长时程和动作策略基准上的实验表明,Kairos在实现顶级性能的同时提供了强大的效率-能力权衡。这些结果共同将Kairos定位为未来自进化物理智能的凝聚性操作基础。

英文摘要

World models are transitioning from passive visual generators to foundational, operational infrastructure for Physical AI: they must natively acquire world knowledge from heterogeneous experience, maintain persistent states over long horizons, and execute efficiently within real deployment constraints. We introduce Kairos, a native world model stack designed around these requirements. (1) Kairos learns the world by pioneering a Native Pre-training Paradigm governed by a Cross-Embodiment Data Curriculum, which organizes open-world videos, human behavioral data, and robot interactions into a progressive developmental pathway. (2) Kairos maintains the world by unified world understanding, generation, and prediction within a Native Unified Architecture equipped with Hybrid Linear Temporal Attention, where sliding-window attention captures local dynamics, dilated sliding windows capture mid-range dependencies, and gated linear attention maintains persistent global memory. We establish formal theoretical bounds demonstrating that this temporal factorization strictly limits error accumulation, mathematically guaranteeing state propagation across extended horizons. (3) Kairos runs the world by incorporating a Deployment-Aware System Co-Design to support low-latency rollout generation on server and consumer-grade hardware for real-world observation-action-feedback loops. Experiments on embodied world-model, long-horizon, and action-policy benchmarks show that Kairos achieves top level performance while offering a strong efficiency-capability trade-off. Together, these results position Kairos as a cohesive operational foundation for future self-evolving physical intelligence.

3. 图像识别、检索与分类 5 篇

2606.17222 2026-06-17 cs.CV 新提交

Quantum Enchanced Multi-Scale CNN with Bi-directional Mamba for Crop Field Analysis

量子增强多尺度CNN与双向Mamba用于农田分析

Mohammad Salman Khan, Ehsan Atoofian, Saad B. Ahmed

发表机构 * Lakehead University(湖首大学)

AI总结 提出BiSpectral Mamba框架,结合多尺度CNN、光谱注意力、双向状态空间建模和量子启发学习,解决高光谱图像分类中的高维性、类不平衡等问题,在UAVHSI-Crop数据集上达到84.83%准确率。

详情
AI中文摘要

高光谱图像(HSI)作物分析对于精准农业至关重要,因为它捕获了丰富的光谱和空间信息,用于准确的作物监测和评估。然而,由于高光谱维度、空间复杂性、类别不平衡以及有限的标记样本,HSI分类仍然具有挑战性。为了解决这些问题,本文提出了一种基于BiSpectral Mamba的框架,该框架结合了多尺度卷积特征提取、光谱注意力、双向状态空间建模和量子启发学习。多尺度CNN骨干首先通过跨多个分辨率的特征融合提取层次化的空间-光谱表示。然后,光谱注意力机制强调信息丰富的波段,同时抑制冗余和噪声通道。精炼后的特征由BiSpectral Mamba模块处理,该模块通过将高光谱特征图建模为序列标记,在正向和反向方向上捕获长距离依赖关系。此外,还引入了类加权优化和特征融合策略,以提高训练稳定性并缓解类别不平衡。在UAVHSI-Crop数据集上的实验评估证明了所提框架的有效性,总体准确率达到84.83%。结果表明,集成卷积、注意力机制和状态空间建模组件能够实现稳健的空间-光谱特征学习,用于作物分类。所提框架还展示了在更广泛的农业和遥感应用中的潜力,包括作物病害检测、产量预测和土壤湿度估计,同时突出了结构化状态空间和量子启发架构在高光谱图像分析中的有效性。

英文摘要

Hyperspectral image (HSI) crop analysis is essential for precision agriculture because it captures rich spectral and spatial information for accurate crop monitoring and assessment. However, HSI classification remains challenging due to high spectral dimensionality, spatial complexity, class imbalance, and limited labeled samples. To address these challenges, this paper proposes a BiSpectral Mamba-based framework that combines multi-scale convolutional feature extraction, spectral attention, bidirectional state-space modeling, and quantum-inspired learning. A multi-scale CNN backbone first extracts hierarchical spatial-spectral representations through feature fusion across multiple resolutions. A spectral attention mechanism then emphasizes informative bands while suppressing redundant and noisy channels. The refined features are processed by a BiSpectral Mamba module that captures long-range dependencies in both forward and backward directions by modeling hyperspectral feature maps as sequential tokens. In addition, class-weighted optimization and feature fusion strategies are incorporated to improve training stability and mitigate class imbalance. Experimental evaluation on the UAVHSI-Crop dataset demonstrates the effectiveness of the proposed framework, achieving an overall accuracy of 84.83%. The results show that integrating convolutional, attention-based, and state-space modeling components enables robust spatial-spectral feature learning for crop classification. The proposed framework also shows potential for broader agricultural and remote sensing applications, including crop disease detection, yield prediction, and soil moisture estimation, while highlighting the effectiveness of structured state-space and quantum-inspired architectures for hyperspectral image analysis.

2606.17242 2026-06-17 cs.CV 新提交

Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description, Implementation, and Examples

基于视觉Transformer的Landsat-Sentinel-2藻华制图:模型描述、实现与示例

Thainara Lima, Vitor Martins

发表机构 * Department of Agricultural & Biological Engineering, Mississippi State University(密苏里州立大学农业与生物工程系)

AI总结 提出首个基于视觉Transformer的沿海藻华制图方法,利用Landsat-Sentinel-2 30米分辨率影像,通过全局分布数据集和多种架构对比,证明Swin Transformer在云/耀斑条件下优于传统方法,实现高精度碎片化藻华检测。

详情
AI中文摘要

沿海藻华监测需要频繁、空间详细且全球一致的观测,这由Landsat-8/9和Sentinel-2 A/B/C提供。这些任务共同提供了超过十年的中等分辨率多光谱影像,每2-3天覆盖近全球,能够检测粗分辨率海洋水色传感器无法分辨的碎片化藻华结构。然而,由于光谱覆盖有限且缺乏统一的反射率产品,它们在水生环境中的应用仍然具有挑战性。作为传统生物光学方法的替代,基于深度学习的图像分类提供了一种数据驱动的方法,可以克服许多这些限制。本研究首次成功实现了基于视觉Transformer的沿海藻华制图,使用30米Landsat-Sentinel-2影像。在全球范围内易发生藻华的沿海热点区域生成了一个全球分布的藻华斑块数据集。将四种基于Transformer的架构与标准卷积基线进行比较,用于精细尺度藻华检测,并在不同光学水类型、大气和表面条件下进行评估。所有深度学习模型在检测漂浮藻华区域方面表现出强大能力,遗漏和误报误差为8-65%。在时间序列中的云和耀斑压力下,Swin Transformer优于传统的光谱指数方法(后者产生广泛的误报),有效避免了受云和耀斑影响的像素。与MODIS产品的进一步比较突出了更高空间分辨率在检测碎片化和不规则影响藻华方面的优势。我们的研究结果支持深度学习作为动态沿海环境中漂浮藻华中等分辨率一致监测的可靠工具。

英文摘要

Coastal algal bloom monitoring requires frequent, spatially detailed, and globally consistent observations, provided by Landsat-8/9 and Sentinel-2 A/B/C. Together, these missions offer over a decade of medium-resolution multispectral imagery with near-global coverage every 2-3 days, enabling the detection of fragmented bloom structures not resolvable by coarse ocean-color sensors. However, their use in aquatic environments remains challenging due to limited spectral coverage and a lack of harmonized reflectance products. As an alternative to traditional bio-optical methods, deep learning-based image classification offers a data-driven approach that can overcome many of these limitations. This study presents the first successful implementation of vision transformer-based coastal algal bloom mapping using 30-m Landsat-Sentinel-2 images. A globally distributed bloom patch dataset was generated across bloom-prone coastal hotspots worldwide. Four transformer-based architectures were compared against a standard convolutional baseline for fine-scale bloom detection, and assessed under different optical water types and atmospheric and surface conditions. All deep learning models showed strong capabilities in detecting floating bloom areas, with omission and commission errors of 8-65%. Under cloud and glint stress in a time series, the Swin Transformer outperformed traditional spectral-index approaches, which produced widespread false positives, effectively avoiding cloud- and glint-affected pixels. Comparisons with MODIS-derived products further highlighted the benefits of higher spatial resolution in detecting fragmented and irregularly affected blooms. Our findings support deep learning as a reliable tool for medium-resolution, consistent monitoring of floating algal blooms in dynamic coastal environments.

2606.17406 2026-06-17 cs.CV cs.AI 新提交

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

基于多特征聚合的图神经网络用于半监督图像分类

Marina Chagas Bulach Gapski, Vinicius Atsushi Sato Kawai, Gustavo Rosseto Leticio, Lucas Pascotti Valem, Daniel Carlos Guimarães Pedronette, Mohand Said Allili

发表机构 * Department of Statistics, Applied Mathematics, and Computing (DEMAC), São Paulo State University (UNESP)(圣保罗州立大学统计、应用数学与计算系) Institute of Mathematics and Computer Science (ICMC), University of São Paulo (USP)(圣保罗大学数学与计算机科学研究所) Department of Computer Science and Engineering, University of Quebec in Outaouais (UQO)(魁北克大学乌塔韦校区计算机科学与工程系)

AI总结 提出一种结合多种特征提取器和图表示进行半监督图像分类的GNN方法,通过流形学习和排名聚合提升分类精度。

详情
AI中文摘要

特征提取涉及识别和提取显著特征或模式,包括边缘、纹理、形状和颜色属性。当代特征提取器主要利用深度学习架构,如卷积神经网络(CNN)和视觉变换器(VIT)。文献中各种特征提取器的可用性提供了广泛的特征表示。从图像中提取的特征取决于具体应用、所选提取器及其配置。因此,通过组合不同的提取器来整合互补信息,为提高性能提供了一种有前景的方式。图神经网络(GNN),特别是图卷积网络(GCN),已成为半监督图像分类的强大且广泛采用的方法,因为它们有效利用标记和未标记数据,同时利用捕捉样本间关系的底层图结构。本研究提出了一种新颖的GNN方法,适用于标记数据稀缺的场景,通过整合来自不同提取器的多样化特征和图表示集进行分类。进行了实验研究,包括不同特征和图提取器的组合,以及排名聚合策略。实验发现强调了本研究的主要贡献,表明特征和图表示的策略性组合,结合流形学习用于图处理,在大多数实验条件下显著提高了分类精度。此外,利用排名聚合技术整合来自不同提取器的特征,被证明能增强分类精度。

英文摘要

Feature extraction involves the identification and extraction of salient characteristics or patterns, including edges, textures, shapes, and color attributes. Contemporary feature extractors predominantly leverage deep learning architectures, such as Convolutional Neural Networks (CNNs) and Vision Transformers (VITs). The availability of diverse feature extractors in the literature provides a wide range of feature representations. Features extracted from an image depend on the specific application, the chosen extractor, and its configuration. Therefore, integrating complementary information by combining distinct extractors offers a promising way to enhance performance. Graph Neural Networks (GNNs), particularly Graph Convolutional Networks (GCNs), have emerged as powerful and widely adopted approaches for semi-supervised image classification, as they effectively leverage both labeled and unlabeled data while exploiting the underlying graph structures that capture relationships among samples. This study proposes a novel approach for GNNs in scenarios where labeled data is scarce, by integrating diverse sets of feature and graph representations derived from various extractors in classification scenarios. Experimental investigations were conducted, encompassing combinations of distinct feature and graph extractors, as well as rank aggregation strategies. The primary contributions of this work are underscored by the experimental findings, which demonstrate that the strategic combination of feature and graph representations, coupled with the application of manifold learning for graph processing, leads to significant improvements in classification accuracy across the majority of experimental conditions. Furthermore, the utilization of rank aggregation techniques to integrate features from different extractors was shown to enhance classification accuracy.

2606.17436 2026-06-17 cs.CV 新提交

UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning

UoU:基于大规模无监督学习的通用指纹基础模型

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 提出UoU指纹基础模型,通过多级表示层次和结合监督、弱监督与无监督的训练策略,实现跨传感器、质量和应用的通用特征提取。

详情
AI中文摘要

指纹识别仍然由特定任务流水线主导,其中增强、结构解析、对齐和匹配被独立优化。尽管在狭窄场景中有效,但这种设计限制了表示在传感器、质量和下游应用中的重用。因此,我们提出UoU,即“基于大规模无监督学习的通用指纹基础模型”,它将指纹特征提取重新定义为领域特定的基础模型问题。UoU围绕一个多级表示层次组织,涵盖图像恢复、结构场、语义标记、点级生物特征实体和紧凑的全局描述符。其训练方案结合了在精确标注上的监督冷启动、大规模弱监督细化以及大规模无监督巩固,后两个阶段在大规模训练中迭代,使得弱监督拓宽语义覆盖,而无监督学习稳定对应关系、不变性和表示几何。UoU不将指纹图像视为通用纹理,而是利用领域特定的对称性和中间结构,包括方向流、周期性脊模式、稀疏生物特征实体和空间等变性。该框架有意与架构无关:虽然本研究包含一个基于transformer的结构化预测初始实例,但更广泛的设计支持多任务学习、可扩展模型配置以及针对匹配、对齐、增强、配准和相关指纹应用的下游专业化。本文介绍了UoU的技术动机、系统设计和验证协议,部分基线实现已公开于此https URL。

英文摘要

Fingerprint recognition is still dominated by task-specific pipelines, where enhancement, structural parsing, alignment, and matching are optimized in isolation. Although effective in narrow settings, this design limits representation reuse across sensors, qualities, and downstream applications. We therefore present UoU, short for ``a \textbf{U}niversal fingerprint foundation model based \textbf{o}n large-scale \textbf{U}nsupervised learning,'' which reframes fingerprint feature extraction as a domain-specific foundation-model problem. UoU is organized around a multi-level representation hierarchy spanning image restoration, structural fields, semantic tokens, point-level biometric entities, and compact global descriptors. Its training recipe combines a supervised cold start on precise annotations, large-scale weakly supervised refinement, and large-scale unsupervised consolidation, with the latter two stages iterated during large-scale training so that weak supervision broadens semantic coverage while unsupervised learning stabilizes correspondences, invariances, and representation geometry. Rather than treating fingerprint imagery as generic texture, UoU exploits domain-specific symmetries and intermediate structure, including orientation flow, periodic ridge patterns, sparse biometric entities, and spatial equivariance. The framework is intentionally architecture-agnostic: while the present study includes an initial transformer-based structured-prediction instantiation, the broader design supports multi-task learning, scalable model configurations, and downstream specialization for matching, alignment, enhancement, registration, and related fingerprint applications. This paper presents the technical motivation, system design, and validation protocol of UoU, and part of the baseline implementation is publicly available at https://github.com/XiongjunGuan/UoU.

2602.04170 2026-06-17 cs.CV 版本更新

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

部分环形扫描:重新审视视觉状态空间模型中的扫描顺序

Yi-Kuan Hsieh, Kuan-Chuan Peng, Xin li, Ming-Ching Chang, Yu-Chee Tseng, Jun-Wei Hsieh

AI总结 提出PRISMamba,通过环形扫描和部分通道滤波提升视觉状态空间模型的旋转鲁棒性和效率,在ImageNet-1K上达到84.5% Top-1精度。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

状态空间模型(SSM)已成为视觉任务中注意力机制的高效替代方案,提供线性时间序列处理并具有竞争性精度。然而,视觉SSM需要将2D图像沿预定义扫描顺序序列化为1D token序列,这一因素常被忽视。我们证明扫描顺序通过改变空间邻接性、破坏对象连续性以及加剧旋转等几何变换下的性能退化,对性能产生关键影响。我们提出部分环形扫描Mamba(PRISMamba),一种旋转鲁棒的遍历方法,将图像划分为同心环,在每个环内进行顺序无关的聚合,并通过一组短径向SSM跨环传播上下文。通过部分通道滤波进一步提高效率,仅将信息最丰富的通道路由到循环环路径,其余通道保留在轻量级残差分支上。在ImageNet-1K上,PRISMamba以3.9G FLOPs和A100上3054 img/s的速度达到84.5% Top-1精度,在准确率和吞吐量上均优于VMamba,且所需FLOPs更少。在旋转下,PRISMamba保持性能,而固定路径扫描下降1~2%。这些结果突显了扫描顺序设计以及通道滤波,作为视觉SSM中准确性、效率和旋转鲁棒性的关键且未被充分探索的因素。代码将在接收后发布。

英文摘要

State Space Models (SSMs) have emerged as efficient alternatives to attention for vision tasks, offering lineartime sequence processing with competitive accuracy. Vision SSMs, however, require serializing 2D images into 1D token sequences along a predefined scan order, a factor often overlooked. We show that scan order critically affects performance by altering spatial adjacency, fracturing object continuity, and amplifying degradation under geometric transformations such as rotation. We present Partial RIng Scan Mamba (PRISMamba), a rotation-robust traversal that partitions an image into concentric rings, performs order-agnostic aggregation within each ring, and propagates context across rings through a set of short radial SSMs. Efficiency is further improved via partial channel filtering, which routes only the most informative channels through the recurrent ring pathway while keeping the rest on a lightweight residual branch. On ImageNet-1K, PRISMamba achieves 84.5% Top-1 with 3.9G FLOPs and 3,054 img/s on A100, outperforming VMamba in both accuracy and throughput while requiring fewer FLOPs. It also maintains performance under rotation, whereas fixed-path scans drop by 1~2%. These results highlight scan-order design, together with channel filtering, as a crucial, underexplored factor for accuracy, efficiency, and rotation robustness in Vision SSMs. Code will be released upon acceptance.

4. 目标检测、分割与定位 11 篇

2606.17334 2026-06-17 cs.CV 新提交

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

FATE: 基于柱状编码和频率感知训练的事件目标检测

Md Tawheedul Islam Bhuian, Kyoung-Don Kang

发表机构 * School of Computing, State University of New York at Binghamton(纽约州立大学宾汉姆顿分校计算学院)

AI总结 提出FATE框架,通过柱状编码保留事件流时间结构,并利用频率感知训练生成密集伪标签,实现高达200Hz的高时间分辨率目标检测,性能优于现有方法。

详情
AI中文摘要

事件相机是生物启发式传感器,异步捕获对数强度变化,在高速和高动态范围场景中具有固有优势。然而,事件流的稀疏和异步特性对现代深度学习架构构成了根本性挑战。为了与标准模型兼容,大多数现有方法将累积窗口划分为固定的时间子区间。虽然这种方法对空间处理有效,但这种内部离散化丢弃了细粒度的时间结构,并将推理限制在训练监督所施加的低时间频率下。为了解决这一限制,我们提出了FATE,一个基于新型柱状编码(PE)的统一框架。在目标频率决定的离散宏观累积窗口上操作时,PE避免了内部时间子区间划分。它将事件组织成空间柱,并通过投影到连续时间正交多项式基上来近似其窗口内演化。这种公式产生了一个L2最优表示,在密集伪图像中保留了丰富的时间动态,减轻了稀疏事件条件下的信息损失。为了充分利用这种表示,我们引入了频率感知训练(FAT),一种软均值教师课程,生成时间密集的伪标签,有效弥合了低频监督和高频推理之间的不匹配。大量实验表明,FATE能够跨架构范式泛化,并持续优于强基线。它能够在高达200Hz的高时间分辨率下实现鲁棒的目标检测,同时参数数量和推理延迟的开销最小。

英文摘要

Event cameras are bio-inspired sensors that asynchronously capture logarithmic intensity changes, offering inherent advantages in high-speed and high-dynamic-range scenarios. However, the sparse and asynchronous nature of event streams poses a fundamental challenge for modern deep learning architectures. To enable compatibility with standard models, most existing approaches partition the accumulation window into fixed temporal sub-bins. While effective for spatial processing, this internal discretization discards fine-grained temporal structure and constrains inference to the low temporal frequencies imposed by training supervision. To address this limitation, we propose FATE, a unified framework built upon a novel Pillar Encoding (PE). While operating over discrete macro-accumulation windows dictated by the target frequency, PE avoids internal temporal sub-binning. It organizes events into spatial pillars and approximates their intra-window evolution via projection onto a continuous-time orthogonal polynomial basis. This formulation yields an L2-optimal representation that retains rich temporal dynamics in a dense pseudo-image, mitigating information loss under sparse event conditions. To fully leverage this representation, we introduce Frequency-Aware Training (FAT), a soft mean-teacher curriculum that generates temporally dense pseudo-labels, effectively bridging the mismatch between low-frequency supervision and high-frequency inference. Extensive experiments demonstrate that FATE generalizes across architectural paradigms and consistently outperforms strong baselines. It enables robust object detection at high temporal resolutions up to 200 Hz, while incurring minimal overhead in parameter count and inference latency

2606.17384 2026-06-17 cs.CV 新提交

Improving and Evaluating Hand-Object Interaction Detection

改进和评估手-物体交互检测

Ahmad Darkhalil, Dima Damen, David Fouhey

发表机构 * School of Computer Science, University of Bristol, Bristol, UK(布里斯托大学计算机科学学院) Computer Science and Electrical and Computer Engineering, New York University, NY, US(纽约大学计算机科学与电气与计算机工程系)

AI总结 提出HOI-DETR框架,将手-物体和物体-物体交互引入Co-DETR架构,在四个数据集上显著提升检测性能,mAP提升超过20个百分点。

Comments Project page: https://ahmaddarkhalil.github.io/HOI-DETR/

详情
AI中文摘要

理解手及其直接或通过工具交互的物体,是从动作感知到3D重建和机器人等任务的关键步骤。本文为手-物体交互(HOI)理解文献做出了多项贡献:(1)HOI-DETR,一种新框架,将手-物体和物体-物体交互引入Co-DETR架构,产生最先进的方法;(2)一个包含4个不同数据集的综合HOI评估套件,包括源自HD-EPIC数据集的视频基准和改善Hands23基准的新标注;(3)一个训练好的检查点,显著改进了Hands23、HOIST、FineBio和HD-EPIC上的最先进水平,包括在Hands23和FineBio上mAP提升超过20个百分点。我们的消融实验证实了每个模型组件的贡献。

英文摘要

Understanding hands and the objects they interact with, both directly and through tools, is a key step for tasks ranging from action perception to 3D reconstruction and robotics. Our paper provides several contributions to the Hand-Object Interaction (HOI) understanding literature: (1) HOI-DETR, a new framework that introduces hand-object and object-object interactions to the Co-DETR architecture to produce a state-of-the-art method; (2) a comprehensive HOI evaluation suite of 4 diverse datasets, including a video benchmark derived from the HD-EPIC dataset and fresh annotations that improve the Hands23 benchmark and (3) a trained checkpoint that significantly improves the state of the art across Hands23, HOIST, FineBio, and HD-EPIC, including mAP gains of over 20 percentage points on Hands23 and FineBio. Our ablations confirm the contributions of each model component.

2606.17475 2026-06-17 cs.CV 新提交

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

StereoFactory: 一种用于鲁棒立体匹配的统一合并框架

Xianda Guo, Pinhan Fu, Ruilin Wang, Wenke Huang, Mang Ye, Qin Zou

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) D-Star Robotics Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 提出StereoFactory,一种粗到细的进化框架,通过遗传算法选择模型子集和CMA-ES优化模块级路由,实现自适应模型合并,在多个基准上降低误差并显著减少训练时间。

详情
AI中文摘要

立体匹配通过在大规模数据集上训练的基础模型取得了进展,但这种范式存在可扩展性瓶颈:引入新数据需要昂贵的联合重新训练。模型合并提供了一种可扩展的事后替代方案,在源检查点可用后整合来自专门模型的知识。然而,现有的合并方法通常保留所有可用模型或依赖贪婪包含,这可能会保留有害的任务向量干扰。我们提出StereoFactory,一种用于自适应模型合并的粗到细进化框架。第一阶段采用遗传算法搜索模型子集的组合空间,确定哪些模型应该参与。第二阶段通过CMA-ES优化对所选任务向量进行架构自适应路由,并可选地进行模块级缩放,解决模块级知识专门化问题(不同功能模块对知识源表现出不同偏好)。在两个架构和四个基准上的实验表明,在相同检查点池下,StereoFactory始终达到最佳的四基准平均值,相对于最强的受控基线,在NMRF上将平均误差从3.80降至3.30,在FoundationStereo上从2.88降至2.19。事后搜索仅需要相应联合重新训练挂钟时间的2.7–3.7%。分析表明,知识贡献本质上是模块特定的,所选子集可以在架构间转移且性能下降最小。代码将在接收后公开发布于:此 https URL。

英文摘要

Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7--3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: https://github.com/XiandaGuo/StereoFactory.

2606.17561 2026-06-17 cs.CV 新提交

RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

RT-Counter:实时文本引导的开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Zhiwei Zhu, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出实时文本引导开放词汇计数框架RT-Counter,通过视觉原型文本化模块和编织Transformer层,在保持高精度的同时实现实时推理,在FSC147上MAE为13.30,速度达112.48 FPS。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在对自然语言描述指定的类别中的对象进行计数。尽管视觉-语言预训练模型已成功应用于TOOC任务,但在计数场景中仍面临细粒度空间理解和实时推理需求的挑战。为解决这些限制,本文提出一种实时TOOC框架,称为实时计数器(RT-Counter),它不仅实现了良好的计数精度,而且具有高计算效率。RT-Counter设计了一种新颖的视觉原型文本化(VPT)模块,该模块可以将学习到的视觉特征投影到文本特征空间,然后生成包含视觉原型难以捕获的抽象信息和文本难以描述的详细原型信息的特征,增强了对象级视觉-语言模型的计数能力。此外,RT-Counter集成了我们的编织Transformer(Weaformer)层,以极低的计算成本保持高描述能力。Weaformer层采用了一种新颖的混合注意力机制,可以高效地编织局部和全局视觉特征。在三个公共数据集上的大量实验表明,RT-Counter成功打破了TOOC中精度与速度的权衡。在FSC147上实现具有竞争力的MAE 13.30的同时,RT-Counter以112.48 FPS运行,比现有TOOC领先方法快7.4倍,参数效率高4倍以上。我们的工作旨在平衡TOOC中的高精度和实时性能。代码可在以下网址获取:this https URL。

英文摘要

Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.

2606.17650 2026-06-17 cs.CV cs.CL 新提交

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

MambaCount: 基于空间稀疏状态空间对偶块的高效文本引导开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 提出MambaCount框架,通过空间稀疏状态空间对偶块解决Mamba在非因果视觉任务中的双向依赖限制和空间token高熵问题,实现线性复杂度的开放词汇目标计数,在FSC-147上取得12.23的测试MAE。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)旨在估计由文本提示描述的目标数量,在具有大规模变化的密集场景中尤其具有挑战性。现有的TOOC方法主要依赖Transformer,其相对于图像分辨率的二次复杂度限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而,先前基于Mamba的方法存在两个主要限制。一方面,Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模。另一方面,现有的基于Mamba的视觉模型往往忽略了空间token响应中无约束的高熵,这可能削弱局部细节和高频线索。为了解决这些限制,我们提出了MambaCount,一种基于空间稀疏状态空间对偶(S^4D)块的高效框架。具体来说,我们分析并重构了Mamba中隐藏状态的衰减动态,以缓解因果建模引入的依赖约束。此外,我们引入了空间token选择(STS)子块,以减少Mamba中空间token响应的无约束高熵。另外,我们设计了多粒度原型(MGP),以在不同语义级别识别类似目标的区域,改善跨模态对齐和可解释性。在FSC-147上的大量实验表明,MambaCount在无需二次查询的方法中达到了最先进的性能,测试MAE为12.23,同时保持了线性复杂度。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

2606.17966 2026-06-17 cs.CV 新提交

Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

Reload-Mamba:用于多类语义分割的分层抗稀释状态空间建模

Sheng-Wei Chan, Hsin-Jui Pan, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University(淡江大学电机与计算机工程系)

AI总结 提出Reload-Mamba框架,通过边界监督的局部细节先验、类不确定性感知的Reload门控和分层多级Reload机制,解决Mamba状态空间传播导致的响应稀释问题,在ADE20K、Cityscapes和PASCAL VOC 2012上取得优异性能。

Comments 23 pages, 4 figures, 17 tables. Code will be released soon

详情
AI中文摘要

基于Mamba的状态空间模型为高分辨率密集预测提供了线性时间的长程建模能力,但顺序状态空间传播会削弱多类语义分割中关键的边界敏感和细节敏感响应。我们提出Reload-Mamba,一种语义分割框架,通过三个分割特定设计解决这种传播导致的响应稀释问题:(i) 边界监督的局部细节先验,使用真实边界掩码显式训练,以识别需要响应恢复的区域;(ii) 类不确定性感知的Reload门控,将来自预重载辅助头的逐像素类熵作为额外的门控信号,该公式仅在多类密集预测下提供信息;(iii) 分层多级Reload机制,在三个解码器级别应用抗稀释细化,并自上而下融合恢复的表示。基于ConvNeXt-Tiny编码器、多尺度解码器和具有像素级方向注意力的四方向Mamba扫描,Reload-Mamba在ADE20K上达到47.9%单尺度(48.9%多尺度)mIoU,在Cityscapes上达到83.2%单尺度mIoU。在标准DeepLab风格协议下使用ResNet-101 + COCO预训练,Reload-Mamba在PASCAL VOC 2012 val上达到87.8% mIoU。控制消融实验表明,三个分割特定设计各自贡献了超出直接移植先前为二值化提出的抗稀释架构的性能,在ADE20K上相比直接移植基线累积提升了+2.2 mIoU。

英文摘要

Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线,用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结 提出结合几何模型与深度学习的框架,利用固定摄像头连续量化监测城市河流漂浮碎片,并评估不同模型在复杂环境下的精度与速度,通过投影几何实现碎片尺寸估计。

详情
AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题,对生物多样性、水质以及人类活动(如航行和娱乐)产生不利影响。本研究提出了一种新颖的方法框架,利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献:(i)利用深度学习对漂浮碎片进行连续量化和监测;(ii)在复杂环境条件下,识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试,包括与数据泄漏相关的偏差实验。此外,实现了一个几何模型,用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性,特别是在负样本图像的整合和时间泄漏的考虑方面。最后,证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

2601.18252 2026-06-17 cs.CV cs.AI cs.LG stat.ML 版本更新

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet: 一种用于提示引导的线框解析的协作点线网络

Chao Wang, Xuanying Li, Cheng Dai, Jinglei Feng, Yuxiang Luo, Hao Qin, Yuqi Ouyang

AI总结 提出点线协作框架Co-PLNet,通过点线提示编码器交换空间线索,并利用交叉引导线解码器增强点线一致性,在Wireframe和YorkUrban数据集上提升线框解析的准确性和鲁棒性。

详情
AI中文摘要

线框解析旨在恢复线段及其连接点,以形成结构化的几何表示,用于同时定位与地图构建(SLAM)等下游任务。现有方法分别预测线和点,并在事后进行调和,导致不匹配和鲁棒性降低。我们提出Co-PLNet,一个点线协作框架,在两个任务之间交换空间线索,其中早期检测通过点线提示编码器(PLP-Encoder)转换为空间提示,该编码器将几何属性编码为紧凑且空间对齐的图。交叉引导线解码器(CGL-Decoder)随后通过基于互补提示的稀疏注意力细化预测,强制点线一致性和效率。在Wireframe和YorkUrban上的实验显示,准确性和鲁棒性持续改进,同时具有有利的实时效率,证明了我们在结构化几何感知中的有效性。我们的代码可在该 https URL 获取。

英文摘要

Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception. Our code is available at https://github.com/GalacticHogrider/Co-PLNet.

2602.14771 2026-06-17 cs.CV cs.AI cs.LG cs.MM cs.NE 版本更新

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA:基于联合嵌入预测架构的通用目标跟踪与模型自适应及遮挡处理

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

AI总结 提出GOT-JEPA框架,通过预测跟踪模型而非图像特征来提升泛化能力,并设计OccuSolver增强遮挡感知,在七个基准上验证了有效性。

Comments Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

详情
Journal ref
IEEE Transactions on Circuits and Systems for Video Technology 2026
AI中文摘要

人类视觉系统通过整合当前观测与先前观测信息、适应目标和场景变化、以及精细推理遮挡来跟踪物体。相比之下,最近的通用目标跟踪器通常针对训练目标进行优化,这限制了在未见场景中的鲁棒性和泛化能力,并且它们的遮挡推理仍然粗糙,缺乏对遮挡模式的详细建模。为了解决这些在泛化和遮挡感知方面的局限性,我们提出了GOT-JEPA,一个模型预测预训练框架,将JEPA从预测图像特征扩展到预测跟踪模型。给定相同的历史信息,教师预测器从干净的当前帧生成伪跟踪模型,学生预测器学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督,并明确训练预测器在遮挡、干扰和其他不利观测下产生可靠的跟踪模型,从而提高了对动态环境的泛化能力。基于GOT-JEPA,我们进一步提出了OccuSolver来增强目标跟踪的遮挡感知。OccuSolver调整了一个以点为中心的点跟踪器,用于目标感知的可见性估计和详细的遮挡模式捕获。在跟踪器迭代生成的目标先验条件下,OccuSolver逐步细化可见性状态,增强遮挡处理,并产生更高质量的参考标签,逐步改进后续模型预测。在七个基准上的广泛评估表明,我们的方法有效增强了跟踪器的泛化能力和鲁棒性。

英文摘要

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

2606.14081 2026-06-17 cs.CV cs.AI cs.LG eess.IV 版本更新

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Clay-CNN混合模型:利用地理基础模型作为滑坡检测的辅助上下文

Huong Binh Vu

发表机构 * Harvard University(哈佛大学)

AI总结 针对滑坡检测中的极端类别不平衡问题,提出将地理基础模型Clay v1.5作为辅助上下文注入U-Net瓶颈的混合方法,在Landslide4Sense基准上达到64.5% F1,优于纯Clay或U-Net基线。

详情
AI中文摘要

灾后快速滑坡制图对灾害响应至关重要,但由于极端类别不平衡,自动化仍然困难。本研究评估了地理基础模型(GFM)Clay v1.5是否能够改善Landslide4Sense(L4S)基准上的像素级滑坡分割,该基准包含3,799个训练块,具有14个Sentinel-2和地形波段,约2%的正像素。我们比较了三种策略:Clay作为主编码器并融合多尺度残差地形、在瓶颈处注入Clay语义上下文的U-Net骨干、以及标准U-Net基线。采用两阶段低秩适应(LoRA)的混合U-Net + Clay模型在三个随机种子上的最佳测试F1为64.5±1.8%,超过了纯Clay骨干(55.2±3.6%)和U-Net基线(59.9%)。由于缺乏多尺度跳跃连接,Clay作为独立编码器的性能低于U-Net,但其预训练表示在作为辅助上下文注入时持续提升了性能。这些发现表明,GFM在滑坡检测中最有效的方式是补充空间细节丰富的卷积架构,而非替代它们。

英文摘要

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

2606.15937 2026-06-17 cs.CV 版本更新

GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

GOOSE-M2F:适配Mask2Former用于非结构化户外地形的高保真、长尾细粒度语义分割

Jyothiraditya Lingam, Nikhileswara Rao Sulake, Sai Manikanta Eswar Machara

发表机构 * Rajiv Gandhi University of Knowledge Technologies, Nuzvid, India(拉吉夫·甘地知识技术大学,努兹维德,印度)

AI总结 针对非结构化户外地形长尾细粒度语义分割挑战,提出GOOSE-M2F,通过200个对象查询、特征精炼模块和辅助监督头,结合多阶段训练策略,在GOOSE基准上达到70.08%复合mIoU。

Comments This solution has got 3rd position at GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026

详情
AI中文摘要

我们提出GOOSE-M2F,这是Mask2Former针对GOOSE 2D细粒度语义分割(FGSS)挑战(ICRA 2026)的任务特定适配。GOOSE基准涵盖非结构化户外地形中的64个细粒度类别,具有严重的长尾分布,其中稀有类别每张图像占据少于50个像素。我们扩展了Swin-Large Mask2Former基线,并贡献了三个针对性改进:(1)200个对象查询以消除表示饱和;(2)结合ASPP-lite和CBAM双注意力的特征精炼模块(FRM);(3)为稀有类别提供直接逐像素梯度的辅助监督头。多阶段训练策略结合了分布平衡损失、稀有类别复制粘贴增强、动态IoU感知重加权和EMA。在推理时,采用密集滑动窗口引擎,结合2D高斯核融合和4尺度TTA,提升了+10.57%。GOOSE-M2F达到70.08%官方复合mIoU(细粒度63.55%,粗粒度76.61%),在GOOSE 2D FGSS排行榜上位列第三。代码和训练好的模型已公开:\href{https://github.com/Aditya-Lingam-9000/GOOSE-M2F}{Github GOOSE-M2F代码} 和 \href{https://huggingface.co/XYZ9843/GOOSE-M2F}{Hugging Face GOOSE-M2F}。

英文摘要

We present GOOSE-M2F, a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA 2026. The GOOSE benchmark spans 64 fine-grained classes across unstructured outdoor terrain with a severely long-tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin-Large Mask2Former baseline with three targeted contributions: (1) 200 object queries to eliminate representational saturation; (2) a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention; and (3) an Auxiliary Supervision Head that delivers direct per-pixel gradients for rare classes. A multi-stage training strategy pairs Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. At inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale TTA adds +10.57%. GOOSE-M2F achieves 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at GitHub: https://github.com/Aditya-Lingam-9000/GOOSE-M2F and Hugging Face: https://huggingface.co/XYZ9843/GOOSE-M2F.

5. 视频理解与时序视觉 6 篇

2606.17279 2026-06-17 cs.CV 新提交

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

基于数字孪生表示的强化学习训练LLMs用于推理密集型手术视频问答

Yiqing Shen, Han Zhang, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出强化学习框架,通过手术基础模型构建数字孪生表示,解耦视觉感知与推理,并引入分层表示与新型奖励,在三个基准上取得最优性能。

详情
AI中文摘要

手术视频问答需要跨语义、空间和时间维度的多步推理。现有方法在架构上将视频压缩为离散令牌表示,并将视觉感知与推理耦合。这种方法割裂了连续的空间-时间关系,已被证明限制了多步推理能力。我们引入了一个强化学习框架,通过操作由手术基础模型构建的数字孪生表示,训练大型语言模型解耦感知与推理。此外,我们引入了跨帧、时间窗口和程序级别的分层表示,并带有概率不确定性估计。最后,我们提出了一种新颖的奖励,结合了格式验证与通过临床合理性评估和不确定性感知校准进行的准确性评估。为了展示该方法的能力,我们引入了REAL-Colon-Reason,一个包含2000个问题-答案对、涵盖三个复杂度级别的结肠镜基准。我们在REAL-Colon-Reason以及两个现有手术视频问答基准REAL-Colon-VQA和EndoVis18-VQA上取得了最先进的性能。

英文摘要

Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.

2606.17298 2026-06-17 cs.CV 新提交

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

面向手术室视频的推理式文本-视频检索:基于动作驱动数字孪生

Yiqing Shen, Hao Ding, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出OR3方法,通过动作驱动数字孪生(ActDT)将视频片段转化为结构化表示,并利用大语言模型生成假设ActDT进行检索,结合证据修正实现隐式查询推理,在手术室视频检索中显著优于基线。

详情
AI中文摘要

手术室中的文本-视频检索是实现手术室安全的关键技术,它允许利益相关者检索和检查特定事件的记录。然而,由于最安全关键的事件可能不遵循常见结构,为了充分发挥其潜力,文本-视频检索必须能够处理需要推理才能识别正确视频的隐式查询(例如,剪断前的一步)。然而,现有方法依赖于无法对此类查询进行推理的全局嵌入。我们提出OR3,一种文本-视频检索方法,它将视频片段转换为动作驱动数字孪生(ActDTs),将并发的“主体-动作-对象”三元组分组到非重叠的时间间隔内。此外,与通过配对编码器进行跨模态匹配不同,OR3执行基于想象的检索,其中大语言模型从查询生成假设的ActDTs。这通过使用针对ActDT定制的难负样本训练的单一编码器实现模态内匹配。最后,基于证据的修正根据与顶级候选者的差异修正想象的ActDTs,以捕获特定于程序的模式。我们从MM-OR构建了一个基准,包含来自机器人膝关节手术的386个视频片段的276个隐式查询,涵盖四个推理类别。OR3实现了57.6的R@1和77.3的R@5,优于最强基线。这些结果表明,OR3通过时间动作推理实现了视觉上相似的手术室视频片段之间的细粒度区分。

英文摘要

Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

2606.17615 2026-06-17 cs.CV cs.AI 新提交

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

SkillMoV: 基于原型条件门控的视图混合路由用于统一多视角熟练度估计

Edoardo Bianchi, Antonio Liotta

发表机构 * Free University of Bozen-Bolzano(博尔扎诺自由大学)

AI总结 提出SkillMoV框架,通过混合视图投影器(MoVP)实现多场景多视角视频的熟练度估计,在EgoExo4D数据集上达到50.17%准确率,超越现有方法。

详情
AI中文摘要

从视频中估计人类熟练度是自动化技能评估的关键挑战,应用于体育教练、音乐教学、手术培训和工作场所学习。现有方法通常专注于单一场景或依赖共享的多视角聚合,限制了其适应异构摄像机视角和活动领域的能力。我们提出SkillMoV,一个统一的、参数高效的框架,用于从同步多视角视频中进行多场景熟练度估计。其核心是混合视图投影器(MoVP),将混合专家范式适应于摄像机特定的视角特征。MoVP由四个阶段组成:(i) 一个具有12个专家MLP的混合视图软路由器,无需摄像机身份监督即可学习视角相关的专家偏好;(ii) 跨视角注意力以对齐同步摄像机;(iii) 可学习的原型锚定,以类级参考向量条件化表示;(iv) 一个原型条件门控投影,生成最终技能嵌入。我们在EgoExo4D上评估SkillMoV,涵盖六个技能领域和三种单独训练的视角配置:Ego、Exos和Ego+Exos。SkillMoV在Exos设置中达到50.17%的总体准确率,单个模型在所有场景上联合训练,超过比较方法中报告的最强Exos结果3.57个百分点。在Ego+Exos中,SkillMoV接近该设置的最佳报告结果(47.63%对48.20%)。在选定的Exos配置上的消融实验验证了每个组件:MoV路由比注意力聚合提高+6.61个百分点,跨视角注意力+4.92个百分点,原型锚定+4.07个百分点,随机视角丢弃+3.90个百分点。通过LoRA适配,SkillMoV仅训练其参数的23.32%,并且相对于仅LoRA基线增加了有限的测量开销。

英文摘要

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

2606.17627 2026-06-17 cs.CV cs.AI 新提交

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

分、议、决:一种用于细粒度自我中心动作识别的多智能体框架

Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano(博尔扎诺自由大学工程学院)

AI总结 提出一种零样本多智能体框架,通过视频分割、异构VLM专家协商和Borda计数聚合,提升细粒度自我中心动作识别性能。

详情
AI中文摘要

在自我中心视频中进行细粒度动作识别对视觉语言模型(VLM)具有挑战性:动作通常仅在小视觉线索上有所不同,而单个模型往往偏向于这些线索的一个子集。我们提出了“分、议、决”(Divide, Deliberate, Decide),一个完全本地化的零样本多智能体框架,其中(i)一个VLM编排器将视频分块,并为每个片段提出一个top-k候选标签列表,(ii)一个由来自不同开放模型系列的异构VLM专家组成的集成体进行结构化协商,包括一轮同行咨询问题,以及(iii)使用Borda计数聚合智能体排名,并且编排器根据专家的证据重新排名自己的预测。整个流程在本地运行,无需微调。实验表明,我们的方法在零样本动作识别性能上比基线有积极改进,突出了异构协商步骤的影响,表明增益来自去相关的模型先验而非额外的计算。

英文摘要

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

2606.17798 2026-06-17 cs.CV cs.AI 新提交

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

LiveStarPro: 具有分层记忆的主动式流视频理解用于长时域流

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

发表机构 * IEEE

AI总结 提出LiveStarPro,通过流验证解码、流因果注意力掩码和树结构分层记忆三个组件,实现长时域流媒体视频的主动理解,在语义正确性和时序误差上分别提升28.9%和降低18.2%。

详情
AI中文摘要

尽管视频大语言模型(Video-LLMs)取得了显著进展,当前的在线架构仍然难以同时处理连续视频流、自主决定何时响应以及保持长时域上下文记忆。这些障碍削弱了实时响应能力,并在长时间交互中导致严重遗忘。在这项工作中,我们引入了LiveStarPro,一个专为长时域流上的主动视频理解而设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流验证解码(SVeD),一种通过单次困惑度验证识别适当响应时机的推理框架,从而消除了对显式静音标记的依赖。第二个组件是流因果注意力掩码(SCAM),一种训练策略,它在可变长度流上强制实现增量视频-语言对齐。第三个组件是树结构分层记忆(TSHM),一种递归记忆架构,它将驱逐的历史信息组织成事件链,从而能够从有效无界的视频流中高效检索。为了在现实在线条件下促进全面评估,我们进一步提出了OmniStarPro,一个大规模基准测试,涵盖15个多样化的真实世界场景,并扩展到小时级流以评估长期回忆。大量实验表明,LiveStarPro持续超越现有方法,在语义正确性上提升28.9%,时序误差降低18.2%,而其流式键值缓存进一步在相同模型上实现了1.58倍的推理加速。模型和代码在此https URL公开。

英文摘要

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA:赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结 提出ThinkJEPA框架,结合密集JEPA分支与稀疏VLM思考者分支,通过分层金字塔表示提取模块,实现细粒度运动建模与长程语义引导,在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情
AI中文摘要

潜在世界模型(如V-JEPA2)的最新进展展示了从视频观测预测未来世界状态的能力。然而,短观测窗口的密集预测限制了时间上下文,可能导致预测偏向局部低层次外推,难以捕捉长程语义并降低下游效用。相比之下,视觉-语言模型(VLM)通过对均匀采样帧进行推理,提供强大的语义基础和通用知识,但由于计算驱动的稀疏采样、语言输出瓶颈(将细粒度交互状态压缩为文本导向表示)以及适应小规模动作条件数据集时的数据分布不匹配,它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架,通过双时间路径结合密集帧动态建模与长程语义指导:一个密集JEPA分支用于细粒度运动和交互线索,以及一个均匀采样的VLM“思考者”分支,具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号,我们引入了一个分层金字塔表示提取模块,将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上,我们的方法优于强VLM-only基线和JEPA预测器基线,并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

6. 生成式视觉与世界模型 23 篇

2606.17257 2026-06-17 cs.CV cs.AI 新提交

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Pulling The REINS: 通过表示引导实现视频扩散模型的无训练安全对齐

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

发表机构 * University of California, Riverside(加州大学河滨分校) YouTube (Google)(YouTube(谷歌))

AI总结 提出REINS方法,在推理时通过线性方向引导视频扩散模型的内部表示,实现无训练的安全对齐,避免有害内容生成,且不降低通用能力。

详情
AI中文摘要

开源视频扩散模型能够生成从暴力到虚假信息等逼真的不安全内容,然而现有防御要么需要昂贵的安全微调(这会降低通用能力),要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS(表示空间推理时安全引导),一种无训练方法,通过在推理时引导其内部表示向安全生成方向对齐视频扩散模型。我们的关键发现是,安全相关结构线性编码在视频扩散Transformer的隐藏状态激活中,并且通过基于二元安全标签的监督PCA发现的一个单一方向足以分离安全与不安全的生成轨迹。在推理时,将该方向添加到中间Transformer层的隐藏状态中,将生成从有害内容重定向到语义相关的安全替代方案,无需权重更新、无需概念枚举,且计算开销可忽略。通过机制分析,我们揭示了虽然安全信息随Transformer深度单调累积,但引导效果在中间层(约50%深度)达到峰值,暴露了信息可用性与下游传播能力之间的基本权衡。我们在9个视频扩散模型、多个参数规模(1.3B-5B)以及文本到视频和图像到视频生成上评估REINS,据我们所知,这是视频生成文献中最广泛的安全评估套件。

英文摘要

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

2606.17310 2026-06-17 cs.CV 新提交

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

SierpinskiCam: 基于谢尔宾斯基三角形图案线索的相机控制视频重拍

Suttisak Wizadwongsa, Hyelin Nam, Supasorn Suwajanakorn, Jeong Joon Park

发表机构 * University of Michigan, Ann Arbor(密歇根大学安娜堡分校) VISTEC, Thailand(泰国威斯泰克科学技术研究院)

AI总结 提出SierpinskiCam方法,通过谢尔宾斯基圆顶纹理线索增强几何引导,并引入参考视频条件机制,解决单目视频重拍中相机大角度偏离时的稀疏区域问题,提升相机可控性、几何一致性和视频质量。

Comments 20 pages, 13 figures

详情
AI中文摘要

从单个单目视频沿用户定义的相机轨迹生成场景的新颖渲染,称为视频重拍,是内容创作和视觉效果中一个引人注目但困难的问题。现有的几何引导方法从源视频重建4D表示,并沿目标轨迹渲染以条件视频扩散模型。然而,当目标相机偏离源轨迹时,这种引导会退化,导致新暴露区域稀疏或完全缺失。我们提出SierpinskiCam,通过使用包含丰富可跟踪特征的谢尔宾斯基圆顶纹理线索来增强基于几何的引导,从而解决了这一限制,即使在大的视角变化下也能保持跟踪。我们进一步引入了一种参考视频条件机制,将源视频令牌附加到目标令牌序列,并使用负RoPE索引分离两个流,从而无需架构修改或逐视频适应即可实现外观基础。大量实验表明,SierpinskiCam在多样且具有挑战性的重拍场景中,在相机可控性、几何一致性和视频质量方面取得了显著提升。项目页面:此https URL。

英文摘要

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

2606.17342 2026-06-17 cs.CV 新提交

Learning a Maximum Entropy Model for Visual Textures using Diffusion

使用扩散学习视觉纹理的最大熵模型

Xinyuan Zhao, Eero P. Simoncelli

发表机构 * New York University(纽约大学) Flatiron Institute(熨斗研究所)

AI总结 提出首个基于扩散模型无监督学习最大熵模型统计量的纹理建模方法,仅用512个统计量即可生成质量优于或媲美当前最优模型(约177k统计量)的纹理图像,并实现平滑插值。

详情
AI中文摘要

视觉纹理——包含重复元素的空间均匀图像区域(例如草地、树皮)——在视觉场景中普遍存在,并为识别和分析材料及物体提供重要线索。许多现有纹理模型从单张纹理图像中提取关键统计量,然后通过匹配这些统计量生成视觉上相似的高质量样本。然而,它们的统计量要么是手工设计的,要么基于为其他目的(如物体识别)预训练的网络。在这里,我们开发了第一个用于无监督学习一组统计量的原理性方法,这些统计量用于约束最大熵概率模型。我们利用为生成扩散模型开发的方法来推导训练和采样程序,并将这些与通过匹配统计量进行采样的传统方法进行比较。尽管我们训练的模型很紧凑(512个统计量),但它生成的纹理图像质量与当前最先进的模型(约177k统计量)相当或更好。通过合成对一个模型不可区分但对另一个模型差异最大的图像,对两个模型进行更直接的比较,揭示了它们的相对优势和劣势。最后,我们表明,与以前的统计纹理模型不同,在我们的模型表示空间中的直线轨迹生成均匀的纹理样本,这些样本在两个端点的特征之间平滑插值。

英文摘要

Visual textures -- spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) -- are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.

2606.17431 2026-06-17 cs.CV 新提交

Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

视觉检索增强生成:基于轮廓引导的动物艺术创作

Quoc-Duy Tran, Anh-Tuan Vo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM(胡志明市国立大学理科大学) Vietnam National University, Ho Chi Minh(胡志明市国立大学)

AI总结 提出视觉检索增强生成(Visual-RAG)框架,通过检索与自然轮廓结构相似的动物形状,结合ControlNet和IP-Adapter引导扩散模型生成动物艺术,实现计算空想性视错觉。

Comments SOICT 2025

详情
AI中文摘要

生成式AI已经提升了渲染逼真或艺术图像的能力,但在人类创造力的一个关键方面仍然有限:解释模糊形状。这种现象根植于空想性视错觉,使人类能够从云、石头或树叶等随机图案中感知有意义的形状。为了在计算上复制这一想象过程,我们引入了视觉检索增强生成(Visual-RAG),这是一个直接从自然轮廓生成动物艺术的框架。我们的方法从包含28,586个高质量轮廓的精选语料库中检索结构相似的动物形状,并将其作为参考示例,通过ControlNet和IP-Adapter引导基于扩散的生成。消融研究证实,使用RANSAC的形状上下文提供了最准确的匹配,而去除形状标准化会使内点比率降至仅13.4%,强调了结构保真度在Visual-RAG中的重要性。一项包含12名参与者的用户研究从美学、轮廓保真度和整体印象方面评估了输出结果。结果表明,虽然Visual-RAG提供了合理的解释,但在实现高感知影响力方面仍存在挑战。这项工作为计算空想性视错觉奠定了基础,展示了机器如何为想象发现的早期阶段做出贡献。

英文摘要

Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4\%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.

2606.17584 2026-06-17 cs.CV cs.LG 新提交

Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

基于轨迹直线度的整流流根选择不动点反演

Semin Kim, Jihwan Yoon, Seunghoon Hong

发表机构 * KAIST(韩国科学技术院)

AI总结 提出SelFix方法,通过选择使逆轨迹更直的不动点解,在整流流中实现精确反演,提升图像重建和编辑质量。

详情
AI中文摘要

找到生成给定数据样本的初始噪声(称为反演)是下游应用(如无训练图像编辑)的关键组成部分。现有的不动点反演方法通过将每个反演步骤表述为不动点问题来提高反演精度,但它们缺乏一个原则性的机制来选择实践中可能出现的多个不动点解。我们观察到不同的选择会引发不同的反演轨迹,导致重建和编辑质量的显著变化。对于整流流,我们进一步发现这种变化与轨迹直线度密切相关,这促使我们将直线度作为原则性的选择标准。我们提出SelFix,一种不动点反演方法,它选择诱导更直逆轨迹的不动点解,同时在标准局部假设下保持收敛到精确的反演根。在FLUX.1-dev和PIE-Bench上的实验表明,SelFix改进了不动点反演,实现了比先前反演基线更强的真实图像重建和更好的源保持提示编辑。代码可在该https URL获取。

英文摘要

Finding the initial noise that generates a given data sample, known as inversion, is a key component for downstream applications such as training-free image editing. Existing fixed-point inversion methods improve inversion accuracy by formulating each inversion step as a fixed-point problem, but they lack a principled mechanism for selecting among multiple fixed-point solutions that can arise in practice. We observe that different selections induce different inversion trajectories, leading to substantial variation in reconstruction and editing quality. For rectified flows, we further find that this variation is closely associated with trajectory straightness, motivating straightness as a principled selection criterion. We propose SelFix, a fixed-point inversion method that selects fixed-point solutions inducing straighter inverse trajectories while retaining convergence to an exact inverse root under standard local assumptions. Experiments on FLUX.1-dev and PIE-Bench show that SelFix improves fixed-point inversion, achieving stronger real-image reconstruction and better source-preserving prompt-based editing than prior inversion baselines. The code is available at https://github.com/seminkim/selfix.

2606.17590 2026-06-17 cs.CV 新提交

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

TivTok:广播时间不变令牌以实现可扩展视频分词

Weiliang Chen, Yuanhui Huang, Xuebo Wang, Yueqi Duan

发表机构 * Department of Electronic Engineering, Tsinghua University(清华大学电子工程系) Department of Automation, Tsinghua University(清华大学自动化系) Kuaishou Technology(快手科技)

AI总结 提出TivTok,一种可重用感知的视频分词器,通过时间不变(TIV)和时间变化(TV)令牌分解视频,实现高效压缩和长视频建模,在标准基准上rFVD达12.65,压缩效率提升2.91倍。

详情
AI中文摘要

视频分词是可扩展视频生成的基础,因为令牌数量直接决定计算成本和可建模视频长度。现有分词器主要通过将视频压缩为更少令牌来提高可扩展性,但它们通常跨帧和块重复表示持久内容,如静态背景和一致物体外观。本文提出\textbf{TivTok}(\textit{时间不变分词器}),一种可重用感知的视频分词器,使持久信息随时间可重用。TivTok用时间不变(TIV)令牌(编码跨帧共享信息)和时间变化(TV)令牌(编码帧特定残差)表示一个片段。为获得这种分解,我们引入范围诱导分解(SIF),为两个令牌组分配不同的注意力范围:TIV令牌关注整个片段,而每个TV令牌仅访问其对应帧及TIV令牌。在解码器中,不变广播(IB)跨帧和块重用相同的TIV令牌,用于并行重建和长视频分词。实验表明,TivTok在标准$16{\times}256{\times}256$基准上达到12.65的rFVD,与评估基线相比,128帧视频的压缩效率提升2.91倍,同时仅使用下采样分词器所需令牌的1.1%。

英文摘要

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

2606.17619 2026-06-17 cs.CV 新提交

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

RAVA: 检索增强的视角对齐用于主题驱动图像生成

Qiwei Yan, Zhiqiang Yuan, Chongyang Li, Jiapei Zhang, Ying Deng, Jinchao Zhang, Jie Zhou

发表机构 * WeChat AI, Tencent Inc.(腾讯微信人工智能实验室)

AI总结 提出RAVA框架,通过检索增强提供几何证据,解决跨主体视角对齐中的视角漂移和结构不匹配问题,在保持身份的同时实现可靠视角控制。

详情
AI中文摘要

参考驱动图像生成在身份保持方面取得了快速进展,但跨不同主体的可靠视角控制仍然难以理解。难点不仅在于生成目标主体的新图像:模型必须推断一个主体的隐含视角,并仅使用图像级证据将其转移到另一个主体,无需相机姿态、深度或基于射线的条件。在这种设置下,现有基于多个图像参考的生成器通常依赖虚假的语义相关性,导致视角漂移、部分级结构不匹配以及缺失或不支持的目标特定内容。我们将这一挑战形式化为跨主体视角对齐,并提出RAVA,一个检索增强框架,在生成前提供显式几何证据。RAVA首先学习一个跨实例视角嵌入,检索与锚点视角对齐的目标主体图像,然后应用基于LogDet的子集选择策略,保留一个既视角一致又结构互补的紧凑参考集。最后,选定的参考被微调的多参考图像生成器使用。实验表明,通用语义嵌入在此任务上几乎是随机的,而所提出的检索器显著提高了视角检索质量。在跨主体生成上,RAVA在相同生成骨干下始终优于零样本基线和更强的检索替代方案。这些结果表明,跨主体视角对齐受益于检索增强的几何基础,而非仅依赖端到端生成。

英文摘要

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

2606.17800 2026-06-17 cs.CV 新提交

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

MaineCoon: 追求实时音视频社交世界模型

Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie

发表机构 * Catnip AI Team(Catnip AI团队)

AI总结 提出MaineCoon,首个22B参数的实时音视频自回归模型,支持单GPU上高达47.5 FPS的流式生成和亚秒级交互,专为社交互动应用优化,引入自重采样、跨模态对齐、领域偏好优化和强化在线策略蒸馏等技术。

Comments 32 pages, 13 figures, 3 tables

详情
AI中文摘要

随着全球视频内容越来越多地在社交平台上用于互动社交目的,为社交世界构建的视频生成模型至关重要,但以往研究在很大程度上忽视了这一点。在这项工作中,我们定义了社交世界模型的位置,并构建了一个原型模型作为实现这一目标的第一步。虽然以往的世界模型成功模拟了物理环境或游戏世界探索,但它们从根本上与以人为中心的社交动态脱节。为了弥合这一差距,作为社交世界模型的第一步,我们提出了MaineCoon,这是首个实时音视频自回归模型,拥有22B参数,能够在单个GPU上实现高达47.5 FPS的创纪录帧率的实时流式生成和亚秒级交互。据我们所知,MaineCoon也是首个专门针对社交互动应用优化的实时音视频生成模型。为了实现高效稳定的训练,我们在MaineCoon中引入了多种新技术,包括自重采样、跨模态表示对齐、领域感知偏好优化和强化在线策略蒸馏(ROPD)。我们还设计了首个智能体流式推理框架,支持千秒级甚至更长的生成,同时通过智能体缓存管理和提示规划减轻漂移。这些创新显著加速了训练,同时优化了实时推理性能。我们相信,这项工作不仅为高质量、低延迟和长时域的音视频自回归模型设定了新的最先进(SOTA)性能基准,而且指出了下一代AI原生社交平台所需的范式转变。

英文摘要

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

2606.18156 2026-06-17 cs.CV cs.AI 新提交

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D:具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University(德克萨斯农工大学) Netflix Eyeline Studios

AI总结 提出ReAge3D框架,通过2D扩散模型DiffReaging和中心向外编辑传播策略,实现多视角一致的3D人脸回龄,保持身份和细节,优于现有方法。

详情
AI中文摘要

我们提出了一种新颖的框架,用于实现逼真且可控的3D人脸回龄,生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效,但不适合回龄,因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战,我们首先引入了一个基于2D扩散的回龄模型DiffReaging,该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略,利用该回龄模型重建多视图一致的回龄图像。具体来说,从回龄的正面枢轴视图开始,我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容,Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术,能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

2606.17432 2026-06-17 cs.GR cs.CV 交叉投稿

Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting

Edit3DGS:通过2D指令引导扩散与3D高斯泼溅的动态头部编辑统一框架

Duy-Dat Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM, Ho Chi Minh, Vietnam(越南胡志明市国家大学) Vietnam National University, Ho Chi Minh, Vietnam(越南国家大学)

AI总结 提出Edit3DGS统一框架,结合2D指令引导扩散与3D高斯泼溅,实现动态3D头部的可控编辑,支持表情变换、属性修改等操作,并保持身份与运动动态的一致性。

Comments SOICT 2025

详情
AI中文摘要

我们提出Edit3DGS,一个用于动态3D头部编辑的统一框架,它将2D指令引导扩散与3D高斯泼溅相结合。与先前分别处理基于帧的编辑或静态3D重建的方法不同,我们的方法将图像域中的语义可控性与逼真、时间一致的3D表示结合起来。给定输入视频,可编辑的面部区域被掩码并使用文本条件扩散模型进行修改,以支持细粒度操作,如表情变换、属性修改和外观细化。然后,编辑后的帧通过3D高斯泼溅聚合,生成一个连贯、高保真的化身,同时保留身份和运动动态。为了强制一致性,Edit3DGS采用了多视图批量编辑和轻量级修复策略,以恢复跨时间步丢失的表情。实验结果表明,我们的框架能够实现可控、无伪影的头部编辑,并具有平滑的时间过渡,在虚拟化身、沉浸式通信、电影制作和交互媒体中具有实际应用。

英文摘要

We present Edit3DGS, a unified framework for dynamic 3D head editing that integrates 2D instruction-guided diffusion with 3D Gaussian splatting. Unlike prior approaches that separately address frame-based edits or static 3D reconstruction, our method couples semantic controllability in the image domain with photorealistic, temporally consistent 3D representations. Given an input video, editable facial regions are masked and modified using a text-conditioned diffusion model to support fine-grained operations such as expression transformation, attribute modification, and appearance refinement. The edited frames are then aggregated through 3D Gaussian splatting to produce a coherent, high-fidelity avatar that preserves both identity and motion dynamics. To enforce consistency, Edit3DGS incorporates multi-view batch editing and lightweight inpainting strategies that recover lost expressions across timesteps. Experimental results demonstrate that our framework enables controllable, artifact-free head editing with smooth temporal transitions, offering practical applications in virtual avatars, immersive communication, film production, and interactive media.

2606.18208 2026-06-17 cs.LG cs.AI cs.CL cs.CV 交叉投稿

Looped World Models

循环世界模型

Hongyuan Adam Lu, Z. L. Victor Wei, Qun Zhang, Jinrui Zeng, Bowen Cao, Lingwei Meng, Mocheng Li, Zezhong Wang, Haonan Yin, Naifu Xue, Minyu Chen, Cenyuan Zhang, Zefan Zhang, Hao Wei, Jiawei Zhou, Haoran Xu, Hao Yang, Ronglai Zuo, Tongda Xu, Yonghao Li, Jian Chen, Hebin Wang, Zeyu Gao, Yang Li, Wei Zhao, Qimin Zhong, Siqi Liu, Yumeng Zhang, Leyan Cui, Zhangyu Wang, Wai Lam

发表机构 * FaceMind Research Asia

AI总结 提出循环世界模型(LoopWM),通过参数共享的Transformer块迭代细化潜在环境状态,实现高达100倍参数效率,并建立迭代潜在深度作为世界模拟的新缩放轴。

Comments Technical Report

详情
AI中文摘要

当前的世界模型面临一个基本矛盾:忠实的长期模拟需要深度计算,但更深的模型部署成本高且容易产生累积误差。我们通过引入循环世界模型(LoopWM)来解决这一问题,这是首个用于世界建模的循环架构。我们的方法通过一个参数共享的Transformer块迭代地细化潜在环境状态。这带来了高达100倍于传统方法的参数效率,并具有自适应计算能力,可自动调整深度以匹配每个预测步骤的复杂性。与缩放模型大小和训练数据正交,LoopWM建立了迭代潜在深度作为世界模拟的新缩放轴,这可能显著推动社区发展。

英文摘要

Current world models face a fundamental tension: faithful long-horizon simulation demands deep computation, but deeper models are expensive to deploy and prone to compounding errors. We resolve this by introducing Looped World Models (LoopWM), which are the first looped architectures for world modelling. Our method iteratively refines latent environment states through a parameter-shared transformer block. This yield up to 100x parameter efficiency over conventional approaches with adaptive computation that automatically scales depth to match the complexity of each prediction step. Orthogonal to scaling model size and training data, LoopWM establishes iterative latent depth as a new scaling axis for world simulation, which might significantly push the community forward.

2506.10981 2026-06-17 cs.CV 版本更新

SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

SceneCompleter:面向生成式新视角合成的密集3D场景补全

Weiliang Chen, Jiayi Bi, Yuanhui Huang, Wenzhao Zheng, Yueqi Duan

AI总结 提出SceneCompleter,通过几何-外观双流扩散模型在RGBD潜空间进行密集3D场景补全,并引入场景嵌入器整合全局信息,实现跨视角一致的生成式新视角合成。

详情
AI中文摘要

生成模型通过利用强大的图像生成先验,在新视角合成(NVS)方面显示出巨大潜力。然而,现有方法通常遵循2D修复范式,首先补全缺失图像区域,然后进行3D重建。这种策略常常导致几何失真和外观漂移,因为2D修复模型无法可靠推断跨视角一致生成所需的底层3D结构。在本文中,我们提出\textbf{SceneCompleter},一个几何感知框架,将生成式NVS重新表述为密集3D场景补全。SceneCompleter不是单独生成2D视图,而是通过几何-外观双流扩散模型在空间对齐的RGBD潜空间中联合补全几何和外观。为了提供整体场景上下文,我们进一步引入场景嵌入器,根据参考图像的全局语义和风格信息调节生成。然后,补全的RGBD预测被对齐并集成到可扩展的3D场景表示中,实现迭代且连贯的场景补全。在域内和域外数据集上的大量实验表明,SceneCompleter在各种场景下都能生成视觉上合理且几何一致的新视图。项目页面:此https URL

英文摘要

Generative models have shown great promise for novel view synthesis (NVS) by leveraging strong image generation priors. However, existing approaches typically follow a 2D inpainting paradigm, first completing missing image regions and then performing 3D reconstruction. This strategy often causes geometry distortion and appearance drift, as 2D inpainting models cannot reliably infer the underlying 3D structure required for cross-view consistent generation. In this paper, we propose \textbf{SceneCompleter}, a geometry-aware framework that reformulates generative NVS as dense 3D scene completion. Instead of hallucinating isolated 2D views, SceneCompleter jointly completes geometry and appearance through a geometry-appearance dual-stream diffusion model in a spatially aligned RGBD latent space. To provide holistic scene context, we further introduce a Scene Embedder that conditions generation on global semantic and stylistic information from reference images. The completed RGBD predictions are then aligned and integrated into an expandable 3D scene representation, enabling iterative and coherent scene completion. Extensive experiments on in-domain and out-of-distribution datasets demonstrate that SceneCompleter produces visually plausible and geometrically consistent novel views across diverse scenarios. Project Page: https://chen-wl20.github.io/SceneCompleter

2506.24121 2026-06-17 cs.CV 版本更新

TextMesh4D: Zero-shot Text-to-4D Mesh Generation

TextMesh4D: 零样本文本到4D网格生成

Sisi Dai, Xinxin Su, Kai Xu

AI总结 提出TextMesh4D框架,通过雅可比变形场和局部-全局语义正则化,实现零样本文本到动态网格生成,解决扩散引导与网格拓扑约束的冲突,达到高时间一致性和几何保真度。

详情
AI中文摘要

大规模、高质量动态3D(4D)资产对于学习物理基础表示至关重要,但大规模捕获和标注成本高昂。这限制了监督式4D学习的可行性,并激发了利用预训练扩散先验的零样本文本到4D生成。为了建模复杂动态,先前方法通常采用隐式3D表示(如NeRF或3DGS)以利用其变形能力。然而,其隐式性质对表面拓扑的控制有限,阻碍了高保真几何,并使时间一致表面重建具有挑战性。为解决这些限制,我们探索零样本文本到4D网格生成。然而,将基于扩散的引导与拓扑约束网格结合时会出现结构不匹配:引导是噪声且空间不一致的,而网格施加严格的拓扑约束,使得直接顶点级变形不稳定。在本文中,我们介绍TextMesh4D,这是首个零样本文本到4D框架,通过在两个互补层面解决上述挑战,直接生成动态网格。几何上,我们通过雅可比变形场(JDF)将变形建模从顶点转移到面,通过可积性强制积分公式实现拓扑感知表面重建。语义上,我们提出局部-全局语义正则化器(LGSR),通过联合约束局部变形合理性和全局形状一致性来随时间保持身份。大量实验表明,在单个24GB GPU上高效运行的同时,达到了最先进的时间一致性、结构保真度和视觉质量。

英文摘要

Large-scale, high-quality dynamic 3D (4D) assets are essential for learning physically grounded representations, but remain costly to capture and annotate at scale. This limits the viability of supervised 4D learning and motivates zero-shot text-to-4D generation leveraging pretrained diffusion priors. To model complex dynamics, prior methods typically adopt implicit 3D representations (e.g., NeRFs or 3DGS) for their deformation capacity. However, their implicit nature provides limited control over surface topology, which hinders high-fidelity geometry and makes temporally coherent surface reconstruction challenging. To address these limitations, we explore zero-shot text-to-4D mesh generation. However, a structural mismatch arises when combining diffusion-based guidance with topology-constrained meshes: the guidance is noisy and spatially inconsistent, while meshes impose severe topological constraints, making direct vertex-level deformation unstable. In this paper, we introduce TextMesh4D, the first zero-shot framework for text-to-4D that directly generates dynamic meshes by addressing the above challenge at two complementary levels. Geometrically, we shift deformation modeling from vertices to faces via a Jacobian Deformation Field (JDF), enabling topology-aware surface reconstruction through an integrability-enforcing integration formulation. Semantically, we propose a Local-Global Semantic Regularizer (LGSR) that preserves identity over time by jointly constraining local deformation plausibility and global shape consistency. Extensive experiments demonstrate state-of-the-art temporal consistency, structural fidelity, and visual quality, while remaining efficient on a single 24GB GPU.

2507.17853 2026-06-17 cs.CV cs.AI 版本更新

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models

Detail++: 文本到图像扩散模型的免训练细节增强器

Lifeng Chen, Jiner Wang, Zihao Pan, Beier Zhu, Xiaofeng Yang, Chi Zhang

AI总结 提出免训练框架Detail++,通过渐进式细节注入策略分解复杂提示词,利用自注意力布局控制与交叉注意力质心对齐损失,提升多主体复杂提示下的生成质量。

详情
AI中文摘要

文本到图像(T2I)生成的最新进展已带来令人印象深刻的视觉结果。然而,这些模型在处理复杂提示词时仍面临重大挑战,尤其是涉及具有不同属性的多个主体时。受人类绘画过程(先勾勒构图,再逐步添加细节)的启发,我们提出Detail++,一个免训练框架,引入新颖的渐进式细节注入(PDI)策略来解决这一局限。具体来说,我们将复杂提示词分解为一系列简化的子提示词,分阶段引导生成过程。这种分阶段生成利用自注意力的固有布局控制能力,首先确保全局构图,然后进行精确细化。为了实现属性与对应主体的准确绑定,我们利用交叉注意力机制,并进一步在测试时引入质心对齐损失,以减少绑定噪声并增强属性一致性。在T2I-CompBench和新构建的风格组合基准上的大量实验表明,Detail++显著优于现有方法,特别是在涉及多个对象和复杂风格条件的场景中。

英文摘要

Recent advances in text-to-image (T2I) generation have led to impressive visual results. However, these models still face significant challenges when handling complex prompt, particularly those involving multiple subjects with distinct attributes. Inspired by the human drawing process, which first outlines the composition and then incrementally adds details, we propose Detail++, a training-free framework that introduces a novel Progressive Detail Injection (PDI) strategy to address this limitation. Specifically, we decompose a complex prompt into a sequence of simplified sub-prompts, guiding the generation process in stages. This staged generation leverages the inherent layout-controlling capacity of self-attention to first ensure global composition, followed by precise refinement. To achieve accurate binding between attributes and corresponding subjects, we exploit cross-attention mechanisms and further introduce a Centroid Alignment Loss at test time to reduce binding noise and enhance attribute consistency. Extensive experiments on T2I-CompBench and a newly constructed style composition benchmark demonstrate that Detail++ significantly outperforms existing methods, particularly in scenarios involving multiple objects and complex stylistic conditions.

2510.21583 2026-06-17 cs.CV cs.AI 版本更新

Principled RL for Flow Matching Emerges from the Chunk-level Policy Optimization

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结 本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO,通过将连续步骤聚合为相干片段并改变策略优化层级,有效缓解了优势归因不准确的问题,实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期在文本到图像(T2I)生成中的后训练流匹配中,群相对策略优化(GRPO)展示了强大的潜力。然而,其受到关键限制:优势归因不准确。在本文中,我们主张将连续步骤聚合为一个连贯的`chunk'并将策略优化范式从GRPO的步骤级别转移到片段级别,可以有效减轻这一问题的负面影响。基于这一见解,我们提出了群片段策略优化(GCPO),这是首个用于后训练流匹配的片段级强化学习方法。广泛的实验表明,GCPO在标准T2I基准和偏好对齐方面均取得了优越的性能,相对于GRPO最高相对提升达43%,凸显了片段级策略优化的前景。代码可在https://github.com/xingzhejun/GCPO上获得。

英文摘要

Recent Progress in post-training flow matching for text-to-image (T2I) generation with Group Relative Policy Optimization (GRPO) has demonstrated strong potential. However, it is hindered by a critical limitation: inaccurate advantage attribution. In this work, we argue that aggregating consecutive steps into a coherent 'chunk' and shifting the policy optimization paradigm from GRPO's step level to the chunk level can effectively mitigate the negative impact of this issue. Building on this insight, we propose Group Chunking Policy Optimization (GCPO), the first chunk-level reinforcement learning approach for post-training flow matching. Extensive experiments demonstrate that GCPO achieves superior performance on both standard T2I benchmarks and preference alignment, with up to 43% relative gains over GRPO, highlighting the promise of chunk-level policy optimization. The code is available on https://github.com/xingzhejun/GCPO.

2602.06806 2026-06-17 cs.CV cs.LG 版本更新

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

RAIGen: 文本到图像生成模型中的罕见属性识别

Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

发表机构 * University of California, Berkeley(加州大学伯克利分校) UC Berkeley(加州大学伯克利分校)

AI总结 提出RAIGen框架,利用Matryoshka稀疏自编码器和新颖的少数度量,在无标签条件下发现扩散模型中的罕见属性,并支持属性放大。

Comments Accepted at ICML 2026. Webpage and code available at https://github.com/VSSILPA/RAIGen

详情
AI中文摘要

文本到图像扩散模型实现了令人印象深刻的生成质量,但继承并放大了训练数据中的偏差,扭曲了语义属性的覆盖。先前的工作以两种方式解决这一问题。封闭集方法在预定义的公平性类别(如性别、种族)中减轻偏差,假设社会显著的少数属性是先验已知的。开放集方法将任务框架化为偏差识别,突出主导输出的多数属性。两者都忽略了一个互补的任务:揭示在数据分布中代表性不足(社会、文化或风格)但仍编码在模型表示中的罕见或少数特征。我们介绍了RAIGen,据我们所知,这是第一个用于扩散模型中无标签罕见属性发现的框架,不需要预定义的少数类别。RAIGen利用Matryoshka稀疏自编码器和一种新颖的少数度量,结合神经元激活频率与语义独特性,识别出那些其最高激活图像揭示代表性不足属性的可解释神经元。实验表明,RAIGen在Stable Diffusion中发现了超出固定公平性类别的属性,可扩展到更大的模型如SDXL,支持跨架构的系统审计,并在生成过程中实现罕见属性的定向放大。项目页面可在 https://vssilpa.github.io/RAIGen_webpage/ 获取。

英文摘要

Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for label-free rare-attribute discovery in diffusion models, requiring no predefined minority categories. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation. The project page is available at https://vssilpa.github.io/RAIGen_webpage/ .

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结 提出Phys4D流水线,通过三阶段训练(伪监督预训练、物理监督微调、强化学习校正)从视频扩散模型学习物理一致的4D世界表示,显著提升细粒度时空与物理一致性。

详情
AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而,这些模型通常难以保持细粒度的物理一致性,随时间表现出物理上不合理的动态。在这项工作中,我们提出了 \textbf{Phys4D},一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式},逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示,为4D场景建模奠定基础。然后,我们使用模拟生成的数据进行基于物理的监督微调,强制执行时间一致的4D动态。最后,我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性,我们引入了一套 \textbf{4D世界一致性评估},探测几何一致性、运动稳定性和长期物理合理性。实验结果表明,与外观驱动的基线相比,Phys4D 显著改善了细粒度时空和物理一致性,同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

2605.09313 2026-06-17 cs.CV 版本更新

Attention Sinks in Diffusion Transformers: A Causal Analysis

扩散变换器中的注意力 sinks:一种因果分析

Fangzheng Wu, Brian Summa

AI总结 研究探讨了扩散变换器中注意力 sinks 的作用,通过动态识别并抑制注意力接收者,发现其对文本-图像对齐和偏好代理影响有限,但强干预下出现特定边界。

详情
AI中文摘要

Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, 免训练 interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}

英文摘要

Attention sinks -- tokens that receive disproportionate attention mass -- are assumed to be functionally important in autoregressive language models, but their role in diffusion transformers remains unclear. We present a causal analysis in text-to-image diffusion, dynamically identifying dominant attention recipients per timestep and suppressing them via paired, training-free interventions on the score and value paths. Across 553 GenEval prompts on Stable Diffusion~3 (with SDXL corroboration), removing these sinks does not degrade text-image alignment (CLIP-T) or preference proxies (ImageReward, HPS-v2) at $k{=}1$; only under stronger interventions ($k\!\geq\!10$) does HPS-v2 exhibit a metric-dependent boundary, while CLIP-T remains robust throughout. The perceptual shifts induced by suppression are nonetheless \emph{sink-specific} -- $\sim\!6\times$ larger than equal-budget random masking -- revealing an empirical dissociation between trajectory-level perturbation and \emph{semantic alignment} in diffusion transformers. \footnote{Code available at https://github.com/wfz666/ICML26-attention-sink.}

2605.15980 2026-06-17 cs.CV 版本更新

Flash-GRPO: Efficient Alignment for Video Diffusion via One-Step Policy Optimization

Flash-GRPO:通过单步策略优化实现视频扩散的高效对齐

Xiaoxuan He, Siming Fu, Zeyue Xue, Weijie Wang, Ruizhe He, Yuming Li, Dacheng Yin, Shuai Dong, Haoyang Huang, Hongfa Wang, Nan Duan, Bohan Zhuang

发表机构 * Zhejiang University(浙江大学) Joy Future Academy Independent Researcher(独立研究员) Tsinghua University(清华大学)

AI总结 提出Flash-GRPO单步训练框架,通过等时分组和时间梯度校正解决计算瓶颈,在低计算预算下实现优于全轨迹训练的对齐质量和训练效率。

详情
AI中文摘要

群体相对策略优化已成为将视频扩散模型与人类偏好对齐的关键,但面临一个关键的计算瓶颈:训练一个14B参数的模型通常每个实验需要数百个GPU天。现有的效率方法通过滑动窗口子采样训练时间步来降低成本,但从根本上损害了优化,表现出严重的不稳定性,并且无法达到完整的轨迹性能。我们提出了Flash-GRPO,一个单步训练框架,在低计算预算下在对齐质量上优于全轨迹训练,同时大幅提高了训练效率。Flash-GRPO解决了两个关键挑战:等时分组通过强制提示级别的时间一致性消除了时间步混淆的方差,将策略性能与时间步难度解耦;时间梯度校正中和了导致不同时间步梯度幅度极不一致的时间依赖缩放因子。在1.3B到14B参数模型上的实验验证了Flash-GRPO的有效性,展示了显著的训练加速,同时保持了一致的稳定性和最先进的对齐质量。

英文摘要

Group Relative Policy Optimization has emerged as essential for aligning video diffusion models with human preferences, but faces a critical computational bottleneck: training a 14B parametered model typically demands hundreds of GPU days per experiment. Existing efficiency methods reduce costs through sliding window subsampling training timesteps, but fundamentally compromise optimization, exhibiting severe instability and failing to reach full trajectory performance. We present Flash-GRPO, a single-step training framework that outperforms full trajectory training in alignment quality under low computational budgets while substantially improving training efficiency. Flash-GRPO addresses two critical challenges: iso-temporal grouping eliminates timestep-confounded variance by enforcing prompt-wise temporal consistency, decoupling policy performance from timestep difficulty; temporal gradient rectification neutralizes the time-dependent scaling factor that causes vastly inconsistent gradient magnitudes across timesteps. Experiments on 1.3B to 14B parameter models validate Flash-GRPO's effectiveness, demonstrating substantial training acceleration with consistent stability and state-of-the-art alignment quality.

2605.20708 2026-06-17 cs.CV cs.AI 版本更新

Rethinking Cross-Layer Information Routing in Diffusion Transformers

重新思考扩散变换器中的跨层信息路由

Chao Xu, Maohua Li, Qirui Li, Yixuan Xu, Yanke Zhou, Yunhe Li, Cuifeng Shen, Hanlin Tang, Kan Liu, Tao Lan, Lin Qu, Shao-Qun Zhang

发表机构 * Nanjing University(南京大学) Alibaba Group(阿里巴巴集团) Zhejiang University(浙江大学) City University of Hong Kong(香港城市大学)

AI总结 本文研究了扩散变换器中跨层信息流动的问题,通过系统性的实证分析,识别了传统残差加法的三个具体症状,并提出了扩散适应性路由(DAR)方法,以实现可学习、时间步适应和非递增的子层输出聚合,从而提升模型性能。

详情
AI中文摘要

扩散变换器(DiTs)已成为现代视觉生成的事实性骨干,其设计的几乎所有主要轴线——分词、注意力、条件、目标和潜在自编码器——都已被广泛重新审视。然而,决定信息如何在层之间积累的残差流却直接继承自原始Transformer。在本文中,我们对DiTs中的跨层信息流进行了系统性的实证分析,同时考虑深度和去噪时间步,并识别出传统残差加法的三个具体症状,即单调的前向幅度膨胀、急剧的反向梯度衰减和显著的块状冗余。受此诊断的启发,我们提出了扩散适应性路由(DAR),一种可直接替换残差的机制,能够对子层输出的历史进行可学习、时间步适应和非递增的聚合。此外,所提出的DAR与许多现代Transformer增强方法,如REPA,具有兼容性。在ImageNet 256×256上,DAR将SiT-XL/2的FID值提升了2.11(7.56 vs. 9.67),并且在8.75倍更少的训练迭代中达到了基线的收敛质量。在REPA之上堆叠时,它在早期阶段实现了2倍的训练加速,表明跨层信息路由是扩散建模中一个未被充分探索的设计轴,该轴与现有表示对齐目标相互独立。除了预训练外,DAR还可以在大规模T2I模型的微调阶段应用,并在分布匹配蒸馏中保留高频细节。

英文摘要

Diffusion Transformers (DiTs) have become a de facto backbone of modern visual generation, and nearly every major axis of their design -- tokenization, attention, conditioning, objectives, and latent autoencoders -- has been extensively revisited. The residual stream that governs how information accumulates across layers, however, has been directly inherited from the original Transformer. In this paper, we present a systematic empirical analysis of cross-layer information flow in DiTs, jointly along depth and denoising timestep, and identify three concrete symptoms of traditional residual addition, namely monotonic forward magnitude inflation, sharp backward gradient decay, and pronounced block-wise redundancy. Motivated by this diagnosis, we propose Diffusion-Adaptive Routing (\textsc{DAR}), a drop-in residual replacement that performs \emph{learnable, timestep-adaptive, and non-incremental} aggregation over the history of sublayer outputs. Moreover, the proposed \textsc{DAR} is compatible with many modern Transformer enhancement methods, such as REPA. On ImageNet $256\times256$, \textsc{DAR} improves SiT-XL/2 by $2.11$ FID ($7.56$ vs.\ $9.67$) and matches the baseline's converged quality with $8.75\times$ fewer training iterations. Stacked on top of REPA, it yields a $2\times$ training acceleration in the early stage, suggesting cross-layer information routing as an underexplored design axis in diffusion modeling, one that operates orthogonally to existing representation-alignment objectives. Beyond pretraining, \textsc{DAR} can also be applied during the fine-tuning stage of large-scale T2I models and preserves high-frequency details during Distribution Matching Distillation.

2606.15236 2026-06-17 cs.CV 版本更新

Show the Signal, Hide the Noise: Spectral Forcing for Pixel-Space Diffusion

展示信号,隐藏噪声:像素空间扩散的频谱强制

Weichen Fan, Haiwen Diao, Penghao Wu, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University(南洋理工大学S-Lab)

AI总结 提出频谱强制方法,通过在像素空间扩散模型中对噪声输入施加时变低通滤波器,引导模型关注信号频带,提升训练效率和生成质量。

Comments Code link: https://github.com/WeichenFan/Spectral_Forcing

详情
AI中文摘要

像素空间扩散模型在全带宽噪声图像上训练,但去噪器可用的有用信号强烈依赖于频率。在整流流扩散和自然图像幂律谱下,每个时间$t$的频带数据-噪声等高线$k^{*}(t) = (1-t)^{-2/α}$将信号承载的低频区域与噪声主导的高频区域分开。我们表明,这种隐式的由粗到细结构不仅仅是描述性的:它引发了一个容量分配问题。标准的像素空间去噪器必须内部发现移动的带宽边界,并可能在最优预测退化为确定性基线而非数据分布建模的频率-时间区域上花费计算。为了显式化这个边界,我们引入了频谱强制,一个无参数、时间条件的2D-DCT低通算子,在补丁嵌入器之前应用于噪声输入。其截止频率随扩散时间单调增加,并在数据端点处变为恒等映射。通过受控的合成实验,我们确定了该算子有益的机制:粗补丁分词和其高频内容主要是噪声而非必要信号的数据。在ImageNet-256上使用JiT-700M/32,频谱强制在不同训练周期中一致地改进了FID和Inception Score,展示了训练过程中的稳健增益;在更细的分词下,频谱强制仍然具有竞争力。我们进一步将未修改的算子插入SenseNova-U1,一个统一的文本到图像模型,它改进了DPG-Bench和GenEval,表明输入侧频谱先验可以超越类条件生成。这些结果表明了一条通往容量高效的像素空间扩散的途径:展示信号并隐藏噪声。

英文摘要

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/α}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

2606.15614 2026-06-17 cs.CV 版本更新

Variational Test-time Optimization for Diffusion Synchronization

扩散同步的变分测试时优化

Hyunsoo Lee, Farrin Marouf Sofian, Kushagra Pandey, Stephan Mandt

发表机构 * Seoul National University(首尔大学) University of California, Irvine(加利福尼亚大学尔湾分校)

AI总结 提出基于最优控制的变分测试时优化框架,通过优化控制变量引导多轨迹协同生成,无需额外训练即可提升扩散同步性能。

Comments Preprint. Project website: https://hleephilip.github.io/SyncVC/

详情
AI中文摘要

协同生成通过协调多个扩散轨迹来扩展预训练先验的能力,已成为扩展扩散模型适用性的强大范式。在现有方法中,扩散同步通过引入通用引导机制提供了场景无关的解决方案。然而,当前的同步方法严重依赖启发式方法,并且仍然需要针对特定任务进行调整,这限制了它们的泛化能力和性能。在这项工作中,我们基于最优控制数学推导了一个同步框架,为扩散同步提供了原理性解释。在采样过程中,我们优化控制变量以引导多个轨迹朝向一致解,同时保持接近底层扩散先验。我们的方法完全在测试时运行,无需额外训练,因此当与强大的预训练先验结合时,能够在多样化的生成场景中广泛应用。我们在三个代表性的协同生成任务上展示了相对于基线的持续改进,涵盖了广泛的模态和应用。除了性能提升,我们的工作为协同生成建立了新的基础,为将预训练生成模型扩展到新的协同生成设置开辟了一条原理性路径。

英文摘要

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

2606.16449 2026-06-17 cs.CV 版本更新

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

PermaVid: 通过解耦上下文记忆实现编辑下的一致视频生成

Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Stanford University(斯坦福大学) S-Lab, Nanyang Technological University(南洋理工大学S-Lab) The Chinese University of Hong Kong(香港中文大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出PermaVid框架,利用解耦为语义外观和几何结构的上下文记忆,结合编辑感知更新策略,实现编辑操作后视频的长期一致生成。

Comments Project page: https://ys-imtech.github.io/projects/PermaVid/

详情
AI中文摘要

在编辑操作下的一致视频生成需要持久性:当编辑修改场景外观或布局时,后续生成应在时间和视角上保持连贯。然而,现有的记忆设计在修改后难以维持长期一致性,因为存储的上下文可能变得过时或无效。为了解决这个问题,我们提出了PermaVid,一种新颖的框架,基于多模态上下文记忆,将空间上下文解耦为语义外观和几何结构,并采用编辑感知的记忆更新和检索策略,使记忆演化与后续观察保持一致。具体来说,我们开发了两个互补的记忆库:一个RGB上下文记忆,捕获外观感知的观察同时隐式编码几何;一个深度上下文记忆,保留与语义解耦的纯几何结构。基于此设计,我们引入了一个记忆引导的视频生成模型,在从混合模态记忆上下文中提取的参考条件下执行多模态特征融合。实验表明,我们的方法在编辑后保持了强大的长期语义和结构一致性,显著优于现有方法。

英文摘要

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

7. 3D视觉、点云与空间智能 11 篇

2606.17438 2026-06-17 cs.CV 新提交

Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects

基于接触式条纹投影轮廓术的高分辨率反射与透明物体三维表面测量

Ingu Yeo, Hyung-Gun Chi, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University(延世大学机械工程系) Yonsei Institute for Embodied Intelligence, Yonsei University(延世大学具身智能研究所)

AI总结 针对GelSight传感器在反射/透明物体上深度精度不足和校准困难的问题,提出基于数字条纹投影的接触式三维测量方法,通过三角测量实现高精度全视场三维重建。

详情
AI中文摘要

本文提出一种基于数字条纹投影(DFP)系统的接触式三维表面测量方法,属于以商业成功的GelSight传感器为代表的视觉触觉传感家族。此类传感器已被证明对机器人指尖操作和接触传感有效。然而,由于GelSight采用RGB LED光度立体视觉,它不直接测量绝对深度,而是通过积分估计的表面梯度来推断深度,这可能累积重建误差;此外,随着传感区域增大,校准变得越来越困难,并且在高反射或透明物体上深度精度受到挑战。为克服这些缺点,我们提出一种基于条纹投影的接触测量技术,在涂覆硅胶的接触表面上执行基于三角测量的三维重建,提供接触区域上密集的逐像素表面几何和全场三维形状测量。通过将高精度数字条纹投影集成到传感器中,我们的方法简化了大面积校准,并提高了复杂表面的深度精度。实验结果,包括与GelSight Mini传感器的直接比较、球面拟合精度评估和不确定性分析,证实了所提方法显著提高了基于结构光的三维测量的精度和稳定性,允许可靠重建具有不同光学特性的物体。

英文摘要

This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.

2606.17824 2026-06-17 cs.CV cs.AI 新提交

Human-in-the-Loop Atlas-Based 3D Asset Segmentation for Interactive Content Workflows

人在回路中基于图集的3D资产分割用于交互式内容工作流

Paul Julius Kühn, Saptarshi Neil Sinha, Jakob Hansen, Robin Horst

发表机构 * Fraunhofer IGD(弗劳恩霍夫计算机图形学研究所) Hochschule RheinMain(莱茵美因应用科学大学)

AI总结 提出一种人在回路中流水线,通过贪心视图选择、SAM~2交互分割和UV反投影生成分割图集,支持材质分配、风格迁移等下游任务,在8个文化遗产物体上验证了有效性。

详情
AI中文摘要

将3D资产分割成有意义的区域仍然具有挑战性,尤其是当分割标准依赖于应用且需要用户控制时。我们提出了一种人在回路中的流水线,用于从3D模型生成分割的2D参数化图集,适用于交互式媒体、游戏和XR内容工作流。我们的方法首先使用基于采样表面点的贪心集合覆盖策略选择一组紧凑的渲染视图,然后支持使用SAM~2和Label Studio对这些视图进行交互式分割。生成的掩码被反投影到模型的UV参数化上,以产生统一的图集分割,支持下游生产任务,如逐段材质分配、风格迁移和语义标注。我们通过对八个文化遗产物体的基于演示的技术评估来评估该流水线。结果表明,该方法可以在不同几何形状上生成可用的分割图集,同时揭示了需要手动校正的常见问题,特别是精细结构、空腔和弱外观边界。

英文摘要

Segmenting 3D assets into meaningful regions remains challenging, especially when segmentation criteria are application-dependent and require user control. We present a human-in-the-loop pipeline for generating a segmented 2D parameterized atlas from a 3D model for interactive media, game, and XR content workflows. Our method first selects a compact set of rendered views using a greedy set cover strategy over sampled surface points, and then supports interactive segmentation of these views with SAM~2 and Label Studio. The resulting masks are back-projected onto the model's UV parameterization to produce a unified segmented atlas that supports downstream production tasks such as segment-wise material assignment, style transfer, and semantic labeling. We assess the pipeline through a demonstration-based technical evaluation on eight cultural heritage objects. The results show that the approach can generate usable segmented atlases across diverse geometries while revealing recurring sources of manual correction, particularly fine structures, cavities, and weak appearance boundaries.

2606.17935 2026-06-17 cs.CV 新提交

MoonSplat: Monocular Online Gaussian Splatting with Sim(3) Global Optimization

MoonSplat: 基于Sim(3)全局优化的单目在线高斯泼溅

Guo Pu, Yixuan Han, Haofeng Li, Yao Zhang, Hui Zhou, Zhouhui Lian

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所) Beijing Hydrogen Intelligent Tech. Co., Ltd.(北京氢元智能科技有限公司)

AI总结 提出一种结合Sim(3)全局优化的在线体素化3DGS框架,通过颜色残差学习策略加速收敛,实现鲁棒的相机跟踪和全局闭环,在室内外数据集上达到SOTA性能。

Comments SIGGRAPH 2026

详情
AI中文摘要

从单目图像序列进行在线3D重建是一个具有挑战性且持续的研究课题。3D高斯泼溅(3DGS)凭借其高质量的实时渲染能力,使得在线3D重建能够以更强的表达能力表示密集场景,因此在机器人、AR/VR等广泛应用中具有巨大潜力。然而,现有的在线3DGS方法仍面临一些关键挑战:由于缺乏全局优化导致的脆弱相机位姿估计,以及在大规模或长序列场景中优化效率低下。为了解决这些问题,我们提出了一种鲁棒且高效的在线体素化3DGS重建框架,该框架集成了全局$\ ext{Sim}(3)$优化,能够实现可靠的相机跟踪以及针对相机位姿和体素化3DGS的高效全局闭环。为了加速体素化3DGS的收敛,我们进一步引入了一种颜色残差学习策略,这不仅提高了优化速度,还增强了渲染质量。在多种室内外数据集上的大量实验表明,我们的方法在相机位姿估计精度和渲染质量方面均达到了最先进的性能,同时保持了实时效率。此外,我们基于所提出的方法开发并部署了一个真实的基于无人机的主动重建系统,验证了其在实际在线3D重建任务中的鲁棒性和泛化能力。我们的代码和数据可在该网址获取。

英文摘要

Online 3D reconstruction from monocular image sequences is a challenging and ongoing research topic. 3D Gaussian Splatting (3DGS), leveraging its high-quality real-time rendering capability, empowers online 3D reconstruction to represent dense scenes with enhanced expressiveness, and thus holds great promise for a wide range of applications such as robotics and AR/VR. However, existing online 3DGS methods still suffer from some key challenges: fragile camera pose estimation due to the lack of global optimization, and low optimization efficiency in large-scale or long-sequence scenarios. To address these issues, we propose a robust and efficient online voxelized 3DGS reconstruction framework integrated with global $\text{Sim}(3)$ optimization, which enables reliable camera tracking and efficient global loop closure for both camera poses and voxelized 3DGS. To accelerate the convergence of the voxelized 3DGS, we further introduce a color residual learning strategy, which not only boosts optimization speed but also enhances rendering quality. Extensive experiments on diverse indoor and outdoor datasets demonstrate that our method achieves state-of-the-art performance in both camera pose estimation accuracy and rendering quality, while retaining real-time efficiency. Additionally, we develop and deploy a real-world UAV-based active reconstruction system grounded on our proposed method, validating its robustness and generalizability for practical online 3D reconstruction tasks. Our code and data are available at https://github.com/TrickyGo/MoonSplat.

2606.18153 2026-06-17 cs.CV 新提交

Neural Tree Reconstruction for the Open Forest Observatory

开放森林观测站的神经树重建

Marissa Ramirez de Chanlatte, Arjun Rewari, Trevor Darrell, Derek J. N. Young

发表机构 * Berkeley AI Research, University of California, Berkeley(加州大学伯克利分校伯克利人工智能研究) Department of Plant Sciences, University of California, Davis(加州大学戴维斯分校植物科学系)

AI总结 针对开放森林观测站中经典运动恢复结构方法重建质量差的问题,提出引入神经辐射场提升3D树重建的细节与鲁棒性,并展望未来工作。

Comments Published as a workshop paper at "Tackling Climate Change with Machine Learning", ICLR 2024

详情
AI中文摘要

开放森林观测站(OFO)是一项跨大学及其他合作伙伴的合作项目,旨在让生态学家、土地管理者和公众能够低成本地进行森林测绘。OFO正在构建一个地理空间森林数据库,以及通过无人机进行森林测绘的开源方法和工具。这些数据对多种气候应用非常有用,包括优先安排重新造林工作、减少野火风险以及监测碳封存。在OFO森林地图数据库的当前版本中,3D树图是使用经典的运动恢复结构技术创建的。这种方法容易出现伪影,缺乏细节,并且在森林地面(输入数据即俯拍图像的可视性有限)上尤其困难。这些重建错误可能会传播到下游的科学任务中(例如野火模拟)。3D重建的进展,包括神经辐射场(NeRF)等方法,产生了更高质量的结果,对稀疏视图更具鲁棒性,并支持数据驱动的先验。我们探索了将NeRF纳入OFO数据集的方法,概述了支持更先进的3D视觉模型的未来工作,并描述了高质量3D重建对林业应用的重要性。

英文摘要

The Open Forest Observatory (OFO) is a collaboration across universities and other partners to make low-cost forest mapping accessible to ecologists, land managers, and the general public. The OFO is building both a database of geospatial forest data as well as open-source methods and tools for forest mapping by uncrewed aerial vehicle. Such data are useful for a variety of climate applications including prioritizing reforestation efforts, informing wildfire hazard reduction, and monitoring carbon sequestration. In the current iteration of the OFO's forest map database, 3D tree maps are created using classical structure-from-motion techniques. This approach is prone to artifacts, lacks detail, and has particular difficulty on the forest floor where the input data (overhead imagery) has limited visibility. These reconstruction errors can potentially propagate to the downstream scientific tasks (e.g. a wildfire simulation.) Advances in 3D reconstruction, including methods like Neural Radiance Fields (NeRF), produce higher quality results that are more robust to sparse views and support data-driven priors. We explore ways to incorporate NeRFs into the OFO dataset, outline future work to support even more state-of-the-art 3D vision models, and describe the importance of high-quality 3D reconstructions for forestry applications.

2606.18231 2026-06-17 cs.CV cs.LG cs.RO 新提交

Adaptive Volumetric Mechanical Property Fields Invariant to Resolution

自适应体积力学属性场:分辨率无关

Rishit Dagli, Donglai Xiang, Vismay Modi, Xuning Yang, Gavriel State, David I. W. Levin, Maria Shugrina

发表机构 * NVIDIA(英伟达)

AI总结 提出AdaVoMP方法,利用稀疏自适应体素结构和自回归Transformer编解码器,为3D物体预测高分辨率空间变化的杨氏模量、泊松比和密度,相比现有技术分辨率提升16^3倍且更准确。

Comments Project Page and hi-res paper: https://research.nvidia.com/labs/sil/projects/adavomp/. ICML 2026

详情
AI中文摘要

精确的力学属性(或材料)杨氏模量($E$)、泊松比($\ u$)和密度($\ ho$)对于数字世界的可靠物理模拟至关重要,但大多数3D资产缺乏这些信息。我们提出AdaVoMP,一种预测输入3D物体跨表示形式的精确密集空间变化($E$,$\ u$,$\ ho$)的方法,在分辨率、准确性和内存效率上优于现有技术。我们技术的基础是一种稀疏自适应体素结构SAV,它能高效地表示输入3D形状和材料场输出。我们将最准确的先前方法VoMP的固定体素模型替换为一种新颖的稀疏Transformer编码器-解码器模型,该模型学习为每个输入形状自回归地生成唯一的SAV来表示其材料,实现比先前技术高$16^3$倍的分辨率。实验表明,即使测试时计算量少于所有先前技术,AdaVoMP也能估计出更准确的体积属性。这使得我们能够将高分辨率复杂3D物体转换为可模拟的资产,从而实现逼真的可变形模拟。

英文摘要

Accurate mechanical properties (or materials) Young's modulus ($E$), Poisson's ratio ($ν$) and density ($ρ$) are essential for reliable physics simulation of digital worlds, but most 3D assets lack this information. We propose AdaVoMP, a method for predicting accurate dense spatially-varying ($E$, $ν$, $ρ$) for input 3D objects across representations, improving the resolution, accuracy, and memory efficiency over the state-of-the-art. The foundation of our technique is a sparse and adaptive voxel structure SAV that efficiently represents both the input 3D shape and the material field output. We replace the fixed-voxel model of the most accurate prior method, VoMP, with a novel sparse transformer encoder-decoder model that learns to generate a unique SAV autoregressively for every input shape to represent its materials, achieving a resolution $16^3\times$ higher than prior art. Experiments show that AdaVoMP estimates more accurate volumetric properties, even with lesser test-time compute than all prior art. This allows us to convert high-resolution complex 3D objects into simulation-ready assets, resulting in realistic deformable simulations.

2606.18069 2026-06-17 cs.GR cs.CG cs.CV 交叉投稿

Blended Chart Surfaces: A Seamless Explicit Representation for Smooth Surface Fitting

混合图表曲面:一种用于光滑曲面拟合的无缝显式表示

Romy Williamson, Niloy Mitra

发表机构 * Adobe

AI总结 提出混合图表曲面,一种无网络、显式且光滑的曲面表示,通过代理网格和多项式映射联合优化,融合局部图表实现全局光滑,支持微分量和能量计算。

Comments 17 pages, 16 figures

详情
AI中文摘要

适用于几何处理的曲面表示应紧凑且显式,提供全局光滑性保证,支持广泛的曲面拓扑,并能可靠访问法线和曲面能量等微分量,同时与现代可微优化兼容。现有的神经表示通常会牺牲这些属性中的一项或多项:隐式场通常需要等值面提取以供下游使用,而显式神经映射受限于规范域参数化或在局部图表之间出现接缝伪影。我们引入了混合图表曲面,这是一种紧凑、无网络、显式的表示,通过构造实现光滑,并锚定到用户提供的拓扑。给定一个编码预期曲面拓扑和近似几何的粗略代理网格,混合图表曲面使用现成优化器在每个代理顶点联合优化多项式映射,以拟合隐式目标形状,从而避免输入参数化。相邻映射使用光滑的“一环坐标”混合方案融合,将拓扑和粗略几何(由代理承载)与几何细节(由局部补丁承载)解耦。曲面全局光滑、完全可微,并支持导数的稳定评估,使得微分量和曲面能量可直接访问。此外,我们的构造对代理网格的刚体运动和缩放是等变的。我们在各种拓扑和几何复杂度上评估了混合图表曲面,并与显式替代方案(包括插值函数基线和网格位移MLP)进行比较。在这些比较中,混合图表曲面在紧凑性、简单性、微分量访问性和表达力之间实现了有利的权衡,同时在补丁边界保持光滑。

英文摘要

A surface representation suitable for geometry processing should be compact and explicit, provide global smoothness guarantees, support a wide range of surface topologies, and offer reliable access to differential quantities such as normals and surface energies, while remaining compatible with modern differentiable optimization. Existing neural representations typically sacrifice one or more of these properties: implicit fields typically require iso-surfacing for downstream use, while explicit neural maps are constrained by canonical-domain parametrizations or exhibit seam artifacts between local charts. We introduce Blended Chart Surfaces, a compact, network-free, explicit representation that is smooth by construction and anchored to user-provided topology. Given a coarse proxy mesh encoding the intended surface topology and approximate geometry, Blended Chart Surfaces jointly optimize for a polynomial map at each proxy vertex using an off-the-shelf optimizer to fit to an implicit target shape, avoiding the need for an input parametrization. Neighboring maps are fused using a smooth 'one-ring coordinate' blending scheme, decoupling topology and coarse geometry (carried by the proxy) from geometric details (carried by the local patches). The surface is globally smooth, fully differentiable, and enables stable evaluation of derivatives, making differential quantities and surface energies directly accessible. Additionally, our construction is equivariant to rigid motions and scaling of the proxy mesh. We evaluate Blended Chart Surfaces on various topologies and geometric complexity, and compare against explicit alternatives including interpolating-function baselines and mesh-displacement MLPs. Across these, Blended Chart Surfaces achieve a favorable trade-off among compactness, simplicity, access to differential quantities, and expressivity while remaining smooth across patch boundaries.

2507.05163 2026-06-17 cs.CV 版本更新

4DSloMo: 4D Reconstruction for High Speed Scene with Asynchronous Capture

4DSloMo: 基于异步捕获的高速场景4D重建

Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue

AI总结 提出一种仅使用低帧率相机的高速4D捕获系统,通过异步捕获方案将等效帧率提升至100-200 FPS,并利用视频扩散模型修复稀疏视图伪影,实现高质量高速4D重建。

Comments Webpage: https://openimaginglab.github.io/4DSloMo/

详情
AI中文摘要

从多视角视频重建快速动态场景对于高速运动分析和逼真的4D重建至关重要。然而,大多数4D捕获系统的帧率限制在30 FPS以下,直接从低帧率输入进行高速运动的4D重建可能导致不理想的结果。在这项工作中,我们提出了一种仅使用低帧率相机的高速4D捕获系统,通过新颖的捕获和处理模块实现。在捕获方面,我们提出了一种异步捕获方案,通过错开相机的开始时间来提高有效帧率。通过分组相机并利用25 FPS的基础帧率,我们的方法实现了100-200 FPS的等效帧率,无需专门的高速相机。在处理方面,我们还提出了一种新颖的生成模型来修复由4D稀疏视图重建引起的伪影,因为异步减少了每个时间戳的视角数量。具体来说,我们提出训练一个基于视频扩散的伪影修复模型用于稀疏4D重建,该模型细化缺失细节、保持时间一致性并提高整体重建质量。实验结果表明,与同步捕获相比,我们的方法显著增强了高速4D重建。

英文摘要

Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

2509.12742 2026-06-17 cs.CV 版本更新

Effective Gaussian Management for High-fidelity Object Reconstruction

高保真物体重建的有效高斯管理

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Junxin Chen, Feng Xu

AI总结 提出一种高斯管理框架,通过选择性激活属性、自适应表示和任务解耦剪枝,结合正则化表面重建模块,在减少参数的同时实现高保真外观与几何重建。

详情
AI中文摘要

本文提出了一种有效的高斯管理框架,用于外观和几何的高保真场景重建。与最近将所有基元在优化过程中统一处理的高斯泼溅(GS)管线不同,我们的框架显式管理高斯的属性激活、表示和剪枝。具体来说,我们的框架首先引入GauSep,一种新的致密化策略,选择性地激活高斯颜色或法线属性,以缓解由双重监督产生的破坏性梯度冲突。我们进一步提出GauRep,一种自适应高斯表示,动态调整球谐函数(SHs)阶数并执行任务解耦剪枝,以在个体和全局层面减少冗余。为了为上述管理过程提供可靠的几何监督,我们还引入了CoRe,一个正则化表面重建模块,通过置信度机制从SDF分支蒸馏鲁棒的法线场到高斯表示。值得注意的是,所提出的高斯管理与各种重建架构兼容,可以无缝集成以提高性能同时减小模型大小。大量实验表明,与最先进方法相比,我们的方法在外观和几何重建上实现了优越或可比的性能,同时使用了显著更少的参数。

英文摘要

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.

2512.09373 2026-06-17 cs.CV 版本更新

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

FUSER: 前馈多视图3D配准Transformer与SE(3)^N扩散精化

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

发表机构 * Nanyang Technological University(南洋理工大学) Alibaba Group(阿里巴巴集团) Nanjing University(南京大学)

AI总结 提出FUSER,首个前馈多视图配准Transformer,在统一潜在空间中直接预测全局位姿,避免成对匹配;并引入SE(3)^N扩散精化框架FUSER-DF以校正估计。

Comments Accepted to CVPR 2026 (Oral)

详情
AI中文摘要

多视图点云的配准传统上依赖于广泛的成对匹配来构建用于全局同步的位姿图,这在计算上昂贵且在没有整体几何约束的情况下本质上是不适定的。本文提出了FUSER,第一个前馈多视图配准Transformer,它在统一、紧凑的潜在空间中联合处理所有扫描,直接预测全局位姿,无需任何成对估计。为了保持可处理性,FUSER通过稀疏3D CNN将每个扫描编码为低分辨率超点特征,该网络保留绝对平移线索,并通过几何交替注意力模块执行高效的扫描内和扫描间推理。特别地,我们从现成的基础模型中转移2D注意力先验,以增强3D特征交互和几何一致性。基于FUSER,我们进一步引入了FUSER-DF,一个SE(3)^N扩散精化框架,通过在联合SE(3)^N空间中进行去噪来校正FUSER的估计。FUSER作为代理多视图配准模型来构建去噪器,并推导了先验条件SE(3)^N变分下界用于去噪监督。在3DMatch、ScanNet和ArkitScenes上的大量实验表明,我们的方法实现了优越的配准精度和出色的计算效率。

英文摘要

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

2605.07971 2026-06-17 cs.CV cs.LG 版本更新

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD: 用于3D生成和编辑的离散体素扩散

Zhengrui Xiang, Jiaqi Wu, Fupeng Sun, Heliang Zheng, Yingzhen Li

发表机构 * Imperial College London(伦敦帝国学院) Math Magic Hitem3D

AI总结 提出离散体素扩散框架(DVD),通过将体素占用视为离散变量,实现稀疏体素的生成、不确定性估计和编辑,避免连续到离散的阈值处理,并提供可解释的生成动态。

详情
AI中文摘要

我们引入了离散体素扩散(DVD),这是一个离散扩散框架,用于生成、评估和编辑基于SLat(结构化潜在)的3D生成管道中的稀疏体素。尽管离散扩散通常没有在类似图像的生成中取代连续扩散,但我们表明它可以作为稀疏体素支架的有效第一阶段先验。通过将体素占用视为原生离散变量,DVD避免了连续到离散的阈值处理,并为体素生成、不确定性估计和编辑提供了一个简单的框架。除了质量提升外,DVD通过显式类别建模提供了更可解释的生成动态。此外,我们利用预测熵作为稳健的不确定性度量,以识别模糊的体素区域和复杂样本,促进数据过滤和质量评估等任务。最后,我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内修复和编辑体素,所需的辅助计算量可忽略不计,且无需额外的模型评估。

英文摘要

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations. Code is available at https://github.com/TeCai/DVD.

2606.08402 2026-06-17 cs.CV cs.AI cs.MA 版本更新

SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

SceneConductor: 基于多智能体编排的单图像3D场景生成

Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan

发表机构 * Nanyang Technological University(南洋理工大学) University of Oxford(牛津大学) Meshy AI

AI总结 提出多智能体编排框架,将单图像3D场景生成分解为场景初始化、环境构建和多智能体细化三个阶段,并引入几何感知布局预测器,在几何精度、空间一致性和感知真实性上超越现有方法。

详情
AI中文摘要

从单张图像生成完整3D场景需要从本质上模糊的视觉证据中推断全局一致的几何、物体关系和环境上下文。尽管联合布局和网格生成近期取得进展,现有方法通常依赖整体或弱分解的流水线,将许多因素纠缠在一起,需要大量场景级监督,限制了其对复杂真实环境的泛化。我们提出一个多智能体编排框架,将单图像3D场景生成分解为三个结构化阶段:场景初始化、环境构建和多智能体细化。初始化阶段提取图像派生的物体掩码,构建物体级3D表示,并预测初始空间布局以形成粗略3D场景。环境构建阶段随后利用该初始化以及点图几何,构建支撑表面、房间边界、材质和光照的环境支架。最后,在细化阶段,规划器智能体识别结构和视觉不一致性,直接应用简单修正,并派遣专家智能体进行复杂的局部修订,再整合回全局场景。为提供可靠的结构初始化同时减少对场景级标注的依赖,我们进一步引入一个几何感知布局预测器,由点图派生的稀疏几何先验监督。与全监督布局生成器不同,该预测器可从分割级数据训练,并稳健泛化到多样真实场景。在基准数据集上的大量实验表明,我们的方法在几何精度、空间一致性和感知真实性上持续优于先前方法。

英文摘要

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

8. 医学影像与生物视觉 26 篇

2606.17340 2026-06-17 cs.CV cs.AI 新提交

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

几何一致的内窥镜表示用于图像引导导航:基于结构化基础模型适配

Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Semaphor Surgical Johnson & Johnson MedTech(强生医疗科技)

AI总结 提出统一框架,结合合成数据管道与层级感知几何语义适配,学习几何一致且领域鲁棒的图像表示,提升单目内窥镜中的位姿估计与深度预测性能。

详情
AI中文摘要

由于深度线索有限、组织纹理弱、非刚性变形以及跨域外观变化大,单目内窥镜中基于视觉的精确导航十分困难,这些问题使得位姿估计、深度预测和图像-解剖对齐复杂化。尽管最近的视觉基础模型显示出潜力,但它们学到的表示往往几何一致性不足,阻碍了稳定的特征对应,限制了其在后续导航任务中的可靠性。我们提出了一个统一框架,用于学习单目内窥镜中几何一致且领域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据管道与层级感知几何语义适配,后者是标准LoRA的结构化替代方案,在Transformer层级间选择性插入低秩适配器,并配合逐层训练目标,以鼓励中间特征的几何对应和深层特征的语义一致性。在公开和专有数据集上的实验表明,几何和语义表示质量得到提升,从而在包括位姿估计和单目深度估计在内的下游导航任务上取得更好性能。学到的表示在临床支气管镜中显示出良好的合成到真实迁移能力,并为在有限监督下适配鼻窦镜和结肠镜提供了有用的初始化。该框架还显示出随模型大小和训练数据的良好扩展性。这些结果支持层级感知、几何引导的适配作为内窥镜表示学习的实用方法。

英文摘要

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

2606.17343 2026-06-17 cs.CV stat.AP 新提交

Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling

贝叶斯磁共振联合图像重建与不确定性量化:基于稀疏先验模型和马尔可夫链蒙特卡洛采样

Ahmed Karam Eldaly, Matteo Figini, Daniel C. Alexander

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) UCL Hawkes Institute, Department of Computer Science, University College London(伦敦大学学院计算机科学系霍克斯研究所)

AI总结 提出一种基于压缩感知磁共振图像重建的不确定性量化框架,采用贝叶斯线性逆问题建模,利用稀疏先验(总变分或小波变换)和分裂增广吉布斯采样器进行MCMC采样,在单线圈和多线圈数据集上验证了优于优化方法和深度学习方法的图像重建与不确定性量化性能。

详情
AI中文摘要

我们提出了一种新的框架,用于使用压缩感知磁共振图像重建进行不确定性量化。该问题在贝叶斯框架内被表述为线性逆问题,并为未知模型参数分配先验分布。具体而言,待重建的图像在给定基下被假设为稀疏的。我们开发了一个适用于任何基的通用框架,并作为示例,测试了图像在(1)空间梯度(使用总变分先验模型)和(2)小波变换中的稀疏性。然后,采用基于分裂增广吉布斯采样的马尔可夫链蒙特卡洛(MCMC)方法从未知参数的后验分布中采样。使用近端MCMC方法有效采样不可微的条件分布。所提出的算法在单线圈和多线圈数据集上使用各种k空间子采样模式和比率进行了验证。结果表明,与对应的基于优化的方法相比,每种提出的方法在图像重建方面具有优越性能。此外,与现有的基于深度学习的方法相比,我们的框架有效地量化了不确定性,显示估计的不确定性图与使用真实值和重建图像计算的误差图之间存在显著相关性。

英文摘要

We propose a novel framework for uncertainty quantification using compressed sensing magnetic resonance image reconstruction. The problem is formulated within a Bayesian framework as a linear inverse problem, with prior distributions assigned to the unknown model parameters. Specifically, the image to be reconstructed is assumed to be sparse in a given basis. We develop a general framework applicable to any basis and as examples, we test the sparsity of the image in its (1) spatial gradients using a total variation prior model, and in its (2) wavelet transform. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then employed to sample from the posterior distribution of the unknown parameters. The non-differentiable conditional distributions are efficiently sampled using a proximal MCMC method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sub-sampling patterns and ratios. The results demonstrate the superior performance of each proposed approach in reconstructing images compared to its counterpart optimisation-based method. Moreover, our framework effectively quantifies uncertainty, showing a notable correlation between estimated uncertainty maps and error maps computed using ground truth and reconstructed images, compared with existing deep learning-based methods.

2606.17379 2026-06-17 cs.CV cs.AI eess.IV 新提交

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MeiBRD:元学习术中生物力学残余变形

Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

发表机构 * Rochester Institute of Technology(罗切斯特理工学院) Vanderbilt University(范德堡大学)

AI总结 提出混合配准框架,利用稀疏术中对应点自适应生物力学先验,通过图神经扩散函数学习残余变形,结合元学习从术中样本中快速适应,在肝脏体模上优于现有方法。

详情
AI中文摘要

由于软组织大幅变形且术中测量稀疏,精确的术中肝脏配准具有挑战性。生物力学模型通过先验知识正则化这一不适定问题,但由于简化假设而表现出持续的预测偏差,而数据驱动学习方法在数据效率、泛化能力和物理合理性方面存在困难。我们提出一个混合配准框架,利用稀疏术中对应点自适应生物力学先验。我们不是学习完整的变形场,而是学习一个校正线性生物力学预测的残余变形函数,该函数建模为图神经扩散函数,在3D肝脏网格上具有几何感知注意力。为了实现稀疏观测的长距离信息传递,我们从一个新颖的角度将稀疏术中测量视为\textit{上下文}样本,其中残余变形函数的输入-输出对完全观测,将问题转化为从术中上下文样本中学习该残余函数,使用前馈元学习器。在可变形肝脏体模数据集上的实验表明,与刚性、生物力学和数据驱动基线相比,配准精度和泛化能力得到提升,特别是在分布外几何和变形情况下。

英文摘要

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

2606.17437 2026-06-17 cs.CV cs.AI 新提交

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

超声心动图视频标准视图分类的时空融合模型

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

发表机构 * Department of Ultrasound, The First Affiliated Hospital of Chengdu Medical College, School of Clinical Medicine, Chengdu Medical College(成都医学院第一附属医院超声科,临床医学院) College of Computer Science, Sichuan University(四川大学计算机学院) Department of Medical Ultrasound, West China Hospital of Sichuan University(四川大学华西医院超声科) Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College(中国医学科学院北京协和医学院肿瘤医院)

AI总结 针对超声心动图视图分类中数据稀缺、时空特征难以融合的问题,提出基于不确定性感知的CNN-LSTM双流融合模型,在最大公开数据集EV9V上取得竞争性能。

详情
AI中文摘要

超声心动图标准视图的自动分类对于高效的临床工作流程至关重要,但面临三个主要挑战。首先,公开可用的数据集稀缺,且规模和视图覆盖范围有限。其次,一些现代视频级架构在超声心动图视图分类中的性能尚未得到充分探索。第三,某些视图类别在空间外观上高度相似,使得单帧特征不足以区分,而异质的帧质量使得鲁棒的时序信息融合变得复杂。为了解决这些挑战,我们发布了九视图超声心动图视频(EV9V)数据集,包含5,138个视频、910,579帧和9个标准视图,据我们所知,这是最大的公开超声心动图视频数据集。利用EV9V,我们系统地基准测试了代表性的视频分类架构,包括卷积神经网络(CNN)、循环神经网络(RNN)和Transformer。此外,我们提出了一种时空融合模型(STFM),一种高效的双流CNN-LSTM(长短期记忆)框架,联合捕获空间解剖结构和时间心脏动力学。所提出的框架利用不确定性感知学习在训练期间优先采样代表性视频片段,并在推理期间进行基于证据的融合,提高了对超声心动图视频中帧质量变化的鲁棒性。大量实验表明,我们的方法在各种视频分类模型中取得了竞争性能,验证了不确定性感知时空学习在超声心动图视图分类中的有效性。代码可在以下网址获取:https://this URL。

英文摘要

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

2606.17675 2026-06-17 cs.CV 新提交

Do We Really Need Diffusion? A Fast U-Net for Paired Medical Image Translation

我们真的需要扩散吗?用于配对医学图像翻译的快速U-Net

Alicia Pirwass, Birte Glimm, Michael Munz, Hans-Joachim Wilke

发表机构 * Institute of Artificial Intelligence, Ulm University(乌尔姆大学人工智能研究所) Institute of Orthopaedic Research and Biomechanics, Centre for Trauma Research, University Hospital Ulm(乌尔姆大学医院创伤研究中心骨科研究与生物力学研究所) AI for Sensor Data Analytics Research Group, Ulm University of Applied Sciences(乌尔姆应用科学大学传感器数据分析人工智能研究组)

AI总结 本文比较轻量级4级U-Net与去噪扩散概率模型(DDPM)在从T2加权MRI估计脂肪分数任务上的性能,发现U-Net在精度和速度上均优于DDPM。

详情
AI中文摘要

磁共振成像-信号脂肪分数(MRI-SFF)量化组织脂肪,是代谢和肌肉骨骼疾病的既定生物标志物。然而,采集需要专门的MRI序列,这些序列并非常规可用。我们研究是否可以通过图像到图像翻译(I2I)从广泛可用的T2加权(T2w)MRI估计SFF。我们进一步使用来自德国国家队列(NAKO)的230048对2D图像(183517训练,23621验证,22910测试)数据集,将轻量级4级U-Net与最先进的去噪扩散概率模型(DDPM)进行比较。两种模型均明显优于恒等基线(Pearson相关系数r=0.769,平均绝对误差MAE=0.070±0.054),证实模型学习了非平凡的跨模态映射。有趣的是,轻量级U-Net在相关性(r=0.975 vs. 0.962)和误差(MAE=0.014±0.015 vs. 0.019±0.019)方面均优于DDPM,同时推理时间减少了208倍(每张图像25.2 ms vs. 5 227.2 ms,使用50步去噪扩散隐式模型(DDIM))。在显著降低计算成本的同时实现强大的临床性能,使得实时临床使用成为可能。

英文摘要

Magnetic resonance imaging-signal fat fraction (MRI-SFF) quantifies tissue fat and serves as an established biomarker for metabolic and musculoskeletal disorders. The acquisition requires, however, specialized MRI sequences, which are not available routinely. We investigate whether SFF can be estimated from widely available T2-weighted (T2w) MRI via image-to-image translation (I2I). We further compare a lightweight 4-level U-Net to a state-of-the-art Denoising Diffusion Probabilistic Model (DDPM) using a dataset of 230 048 paired 2D images (183 517 train, 23 621 val, 22 910 test) from the German National Cohort (NAKO). Both models clearly outperform the identity baseline (Pearson correlation r = 0.769, mean absolute error MAE = 0.070 +/- 0.054), which confirms that the models learn a non-trivial cross-modal mapping. Interestingly, the lightweight U-Net outperforms the DDPM in both correlation (r = 0.975 vs. 0.962) and error (MAE = 0.014 +/- 0.015 vs. 0.019 +/- 0.019), while reducing inference time by a factor of 208 (25.2 ms vs. 5 227.2 ms per image using 50 Denoising Diffusion Implicit Model (DDIM) steps). The strong clinical performance at substantially reduced computational cost enables real-time clinical use.

2606.17702 2026-06-17 cs.CV cs.AI 新提交

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

SegTME-UNI2: 一种基于基础模型的可泛化多类细胞分割框架及LLM驱动的组织微环境表征在组织病理学中的应用

Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

发表机构 * Department of Data Science and Artificial Intelligence, School of Computing and Artificial Intelligence, Faculty of Engineering and Technology, Sunway University(双威大学工程与技术学院计算与人工智能学院数据科学与人工智能系) Faculty of Dentistry, Universiti Malaya(马来亚大学牙科学院)

AI总结 提出SegTME-UNI2框架,结合UNI2-H病理基础模型与双头UperNet解码器实现六类语义分割和核实例分割,通过三阶段伪标签课程学习解决标注不足问题,并利用LLM生成临床可解释的TME报告。

详情
AI中文摘要

从常规H&E染色组织学图像中表征肿瘤微环境(TME)需要同时进行细胞分割、特征提取和可解释的临床报告。我们提出了SEGTME-UNI2,一个统一框架来满足这些需求。其核心是UNI2-UPERHOVER,一个双头分割模型,将UNI2-H病理基础模型(ViT-Giant,在来自100K张切片的>100M张图块上预训练)与两个并行的UperNet解码器配对:一个用于六类语义分割,另一个用于水平-垂直梯度回归,从而实现基于分水岭的核实例分离。为了解决大型真实世界数据集中缺乏像素级标注的问题,UNI2-UPERHOVER经历了一个三阶段渐进式伪标签课程。每个阶段训练一个全新模型(无权重迁移),完全通过提高伪标签质量来驱动改进:阶段1:使用人工标注的PanNuke(7,901张图像,189,744个细胞核,0.25 um/像素)。阶段2:使用阶段1模型在271,711个TCGA-UT尺度0图块(0.5 um/像素)上生成的熵过滤伪标签。阶段3:使用阶段2模型在所有1,608,060个TCGA-UT图块(覆盖六个分辨率尺度,0.5-1.0 um/像素)上生成的伪标签。分割输出输入到一个结构化的TME特征提取流水线,计算每个图块的20多个组成、形态、空间熵和细胞间距离指标。这些指标编码为JSON,并传递给微调的NVIDIA BioNeMo GPT模型,以生成临床可解释的TME叙述。在保留的PanNuke和TCGA-UT分区上的初步验证证明了框架的可行性和内部一致性。公开释放了伪标注的TCGA-UT数据集和UNI2-UPERHOVER检查点,以支持大规模TME分析和空间生物学研究。

英文摘要

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg(弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室) Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich(慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系) Lab for AI in Medicine, RWTH Aachen University(亚琛工业大学医学人工智能实验室) Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen(亚琛工业大学医院诊断与介入放射学系)

AI总结 本文通过因果审计方法,发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像,纯文本模型与多模态模型性能接近,并提出了基于图像依赖性的评估框架。

详情
AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性,这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的:一个利用发现名称先验的模型得分与读取扫描的模型相同,且没有标准基准能区分它们。我们引入了一种因果审计方法,通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像,并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中,一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平,而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型(针对部分发现);这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比,纯文本模型在准确率上与放射科医生无统计差异,但基础归因于零,而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计(而非准确性)应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

2606.17742 2026-06-17 cs.CV q-bio.NC 新提交

BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics

BrainWorld:一种用于全脑4D fMRI动力学的结构先验条件生成模型

Junfeng Xia, Wenhao Ye, Junxiang Zhang, Xuanye Pan, Mo Wang, Quanying Liu

发表机构 * Department of Biomedical Engineering, Southern University of Science and Technology(南方科技大学生物医学工程系) School of Biomedical Engineering, Shenzhen University(深圳大学生物医学工程学院)

AI总结 提出BrainWorld模型,利用结构MRI作为解剖先验条件,通过去噪过程生成全脑4D fMRI动态,在22个数据集上稳定生成400帧轨迹,并通过生成样本增强提升下游任务性能。

详情
AI中文摘要

全脑4D fMRI生成对于建模功能性脑动力学具有重要价值,然而现有的fMRI基础模型主要针对表示学习和下游预测,而非条件预测生成。我们提出BrainWorld,一种用于全脑4D fMRI动力学的结构先验条件生成模型。BrainWorld使用sMRI作为受试者级别的解剖上下文来指导未来的fMRI生成,将结构信息整合到去噪过程中,而非将其视为并行模态。在涵盖不同队列和脑状态的22个数据集上评估,BrainWorld能够生成长达400帧的稳定4D fMRI轨迹,通过生成样本增强提升下游性能,并学习到可迁移的多模态表示,优于基线方法。这些结果共同确立了BrainWorld作为长时程脑动力学建模和多模态表示学习的条件感知生成框架。

英文摘要

Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.

2606.17836 2026-06-17 cs.CV cs.AI cs.CG cs.GR 新提交

High-Fidelity 3D Geometric Reconstruction of Pelvic Organs from MRI: A Hybrid Deep Learning and Iterative Optimization Approach

高保真盆腔器官MRI三维几何重建:一种混合深度学习与迭代优化方法

Hui Wang, Xiaowei Li, Chenxin Zhang, Yifan Feng, Jianwei Zuo, Yumeng Tang, Xiuli Sun, Jianliu Wang, Bing Xie, Jiajia Luo

发表机构 * Institute of Medical Technology, Peking University Health Science Center, Peking University(北京大学医学部医学技术研究院,北京大学) Biomedical Engineering Department, Institute of Advanced Clinical Medicine, Peking University(北京大学先进临床医学研究院生物医学工程系) Department of Obstetrics and Gynecology, Peking University People’s Hospital(北京大学人民医院妇产科部)

AI总结 提出混合可变形形状建模框架,结合深度学习预测与迭代优化,实现膀胱、子宫和直肠的高保真三维几何重建,在几何保真度和网格质量上优于现有方法。

详情
AI中文摘要

从MRI中患者特定的盆腔器官几何三维重建对于盆底建模和下游患者特定分析至关重要。然而,以往研究主要关注图像分割或三维模型的下游使用,高保真、高质量几何的重建仍然劳动密集且缺乏标准化。本研究引入了一种混合可变形形状建模框架,将深度学习预测与迭代优化相结合,用于膀胱、子宫和直肠的重建。该框架包含三个核心组件:一种保持盆腔器官拓扑一致性的几何感知多级深度学习架构;一种平衡全局形状捕获和局部表面细化的两阶段摊销优化训练策略;以及一种整体协同机制——在训练阶段,迭代优化为深度学习提供监督,而在推理阶段,深度学习快速预测全局器官形态,随后通过迭代优化细化局部表面和网格质量。该框架在几何保真度上显著优于当前主流的基于深度学习的器官重建模型。对于各个解剖结构,重建的膀胱、直肠和子宫三维几何实现了显著更低的Chamfer距离值和更高的Dice相似系数分数。此外,在保持高计算效率的同时,所提出的架构产生了优越的整体体积网格质量。在患者层面,该框架在minSICN和minSIGE的10个最差元素上均获得了比传统几何后处理算法更高的平均值。

英文摘要

Patient-specific 3D reconstruction of pelvic organ geometry from MRI is important for pelvic floor modeling and downstream patient-specific analysis. However, while previous studies have focused primarily on either image segmentation or downstream use of 3D models, the reconstruction of high-fidelity, high-quality geometries remains labor-intensive and poorly standardized. The study introduced a hybrid deformable shape modeling framework that integrates deep learning prediction with iterative optimization for the reconstruction of the bladder, uterus, and rectum. The framework consists of three core components: a geometry-aware multi-level deep learning architecture that preserves topological consistency of pelvic organs; a two-stage amortized optimization training strategy that balances global shape capture and local surface refinement; and a holistic synergy mechanism--where iterative optimization provides supervision for deep learning during the training phase, and during inference, deep learning rapidly predicts the global organ morphology, followed by iterative optimization to refine local surfaces and mesh quality. This framework demonstrated marked superiority in geometric fidelity than current mainstream deep learning-based organ reconstruction models. For individual anatomical structures, the reconstructed 3D geometries for the bladder, rectum, and uterus achieved significantly lower Chamfer Distance values and higher Dice Similarity Coefficient scores. In addition, while maintaining high computational efficiency, the proposed architecture yielded superior overall volumetric mesh quality. At the patient level, the framework achieved higher mean values for the 10 worst elements for both minSICN and minSIGE compared to traditional geometric post-processing algorithms.

2606.17867 2026-06-17 cs.CV cs.AI 新提交

A Quantitative Analysis of Multimodal Biomarkers in Alzheimer's Disease

阿尔茨海默病多模态生物标志物的定量分析

Antonio Scardace, Daniele Ravì

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学) Department MIFT(MIFT部门) University of Messina(梅西纳大学)

AI总结 通过整合tau-PET、结构MRI、认知评分和APOE4数据,量化多模态生物标志物间的冗余与预测依赖关系,揭示tau拓扑与萎缩的关联,并分解tau-认知关联,为AD生物标志物选择提供可解释性。

Comments Accepted to ICTS4eHealth 2026

详情
AI中文摘要

尽管阿尔茨海默病(AD)研究中越来越多地采用多模态方法——旨在整合分子、结构、临床和遗传生物标志物以增强疾病表征——但这些模态之间的关系仍知之甚少。对其动态相互作用进行系统分析对于改进疾病建模、识别冗余评估以及减少患者负担和获取成本至关重要。在本文中,我们通过整合来自ADNI数据集的789名受试者的tau-PET、结构MRI、认知评分(MMSE和CDR)以及APOE4数据,对多模态AD生物标志物进行了定量分析。在我们的分析中,我们(A)量化跨模态互信息和解释方差以评估冗余和预测依赖性;(B)检查tau拓扑与跨脑区结构萎缩之间的关联以选择信息性ROI;(C)对tau-认知关联进行统计分解,分为萎缩相关和萎缩无关成分;(D)识别与认知衰退一致的主要神经退行性轨迹。本研究提供了跨模态关系的系统表征,提高了AD生物标志物的可解释性和选择。代码公开于:此 https URL。

英文摘要

Despite increasing adoption of multimodal approaches in Alzheimer's Disease (AD) research -- aimed at integrating molecular, structural, clinical, and genetic biomarkers to enhance disease characterization -- the relationships among these modalities remain poorly understood. A systematic analysis of their dynamic interaction is essential for improving disease modeling, identifying redundant assessments, and reducing patient burden and acquisition costs. In this paper, we present a quantitative analysis of multimodal AD biomarkers by integrating tau-PET, structural MRI, cognitive scores (MMSE and CDR), and APOE4 data from 789 subjects drawn from the ADNI dataset. In our analyses, we (A) quantify cross-modal mutual information and explained variance to assess redundancy and predictive dependencies; (B) examine associations between tau topologies and structural atrophy across brain regions to select informative ROIs; (C) perform a statistical decomposition of the tau-cognition association into atrophy-related and atrophy-independent components; (D) and identify a dominant neurodegenerative trajectory that aligns with cognitive decline. This study provides a systematic characterization of cross-modal relationships, improving the interpretability and selection of biomarkers in AD. Code is publicly available at: https://github.com/antonioscardace/Multimodal-AD.

2606.17958 2026-06-17 cs.CV cs.LG 新提交

Beyond Visual Cues: CoT-Enhanced Reasoning for Semi-supervised Medical Image Segmentation

超越视觉线索:CoT增强推理用于半监督医学图像分割

Yuming Chen, Yuxin Xie, Tao Zhou, Yi Zhou

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education(教育部新一代人工智能技术及其跨学科应用重点实验室) Nanjing University of Science and Technology(南京理工大学)

AI总结 提出CERS框架,通过集成链式思维推理和语义参考选择策略,解决半监督医学图像分割中的视觉-语义不匹配问题,在边界模糊和语义不一致场景下优于现有方法。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

半监督医学图像分割已成为医学图像分析中的主导研究问题,通过对未标记数据利用一致性正则化来缓解标注稀缺。然而,现有方法主要通过视觉模式匹配操作,严重依赖像素级相似性。这种以视觉为中心的依赖在临床场景中常常失效,因为视觉上相似的病变可能需要不同的诊断结论,从而无法捕捉专家使用的潜在诊断逻辑。为了解决这个问题,我们超越视觉线索,提出了CERS(CoT增强推理分割),一个集成链式思维(CoT)推理以区分病理上不同案例的框架。具体来说,我们构建了一个知识池,其中包含由大型语言模型(LLMs)生成的丰富语言推理描述。引入了一种语义感知的参考选择策略来识别历史证据,首先通过形态学过滤候选,然后通过CoT一致性进行细化以消除硬负样本。此外,设计了多尺度坐标注意力模块(MCAM)以有效地将这种推理衍生的上下文融合到解码过程中。大量实验证明了CERS相对于最先进方法的优越性,特别是在解决边界模糊和语义不一致方面。代码可在该https URL获取。

英文摘要

Semi-supervised medical image segmentation has emerged as a dominant research problem in medical image analysis, mitigating annotation scarcity by leveraging consistency regularization on unlabeled data. However, existing approaches operate predominantly via visual pattern matching, relying heavily on pixel-level similarities. This visual-centric dependency often falters in clinical scenarios characterized by the visual-semantic mismatch, where visually similar lesions warrant distinct diagnostic conclusions, thus failing to capture the underlying diagnostic logic used by experts. To address this, we move beyond visual cues and propose CERS (CoT-Enhanced Reasoning Segmentation), a framework that integrates Chain-of-Thought (CoT) reasoning to distinguish pathologically distinct cases. Specifically, we construct a knowledge pool enriched with linguistic reasoning descriptions generated by large language models (LLMs). A semantic-aware reference selection strategy is introduced to identify historical evidence, filtering candidates first by morphology, and then refining them via CoT consistency to eliminate hard negatives. Furthermore, a multi-scale coordinate attention module (MCAM) is designed to effectively fuse this reasoning-derived context into the decoding process. Extensive experiments demonstrate the superiority of CERS against state-of-the-art approaches, particularly in resolving boundary ambiguities and semantic inconsistencies. The code is available at https://github.com/cymasuna/CERS.

2606.17972 2026-06-17 cs.CV cs.AI 新提交

SegDINO: Introducing Multi-Scale Structure into DINO for Efficient Medical Image Segmentation

SegDINO: 将多尺度结构引入DINO以实现高效医学图像分割

Sicheng Yang, Hongqiu Wang, Zhaohu Xing, Sixiang Chen, Qiuxia Yang, Yize Mao, Guang Yang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Sun Yat-sen University Cancer Center(中山大学肿瘤防治中心) Imperial College London(帝国理工学院)

AI总结 提出SegDINO框架,通过令牌金字塔适应和尺度感知解码将多尺度结构引入DINO,在保持高效的同时实现医学图像分割的最优性能。

Comments Code: https://github.com/script-Yang/segdino_v2

详情
AI中文摘要

自监督DINO模型提供了强大的可迁移视觉表示,但直接应用于图像分割仍具挑战。现有方法通常依赖带有复杂上采样的重型解码器,引入大量参数和计算开销。我们观察到,向DINO特征引入尺度远比增加解码器容量更为关键。本文提出SegDINO,一种高效分割框架,将DINOv3骨干网络与轻量级尺度建模相结合。SegDINO引入令牌金字塔适应(TPA)将中间DINO特征重组为伪多尺度层次,以及尺度感知解码(SAD)实现高效的尺度内细化和自顶向下的多尺度传播。我们进一步整理了PanCT,一个包含284名患者专家标注胰腺肿瘤的新CT数据集,以评估SegDINO处理困难小病灶的能力。在PanCT和三个公共基准上的大量实验表明,SegDINO以高效率实现了最先进的结果。代码见此https链接。

英文摘要

Self-supervised DINO models provide strong transferable visual representations, yet applying them directly to image segmentation remains challenging. Existing approaches commonly rely on heavy decoders with complex upsampling, introducing substantial parameter and computational overhead. We observe that introducing scale into DINO features is far more critical than increasing decoder capacity. In this work, we present SegDINO, an efficient segmentation framework that integrates a DINOv3 backbone with lightweight scale modeling. SegDINO introduces Token Pyramid Adaptation (TPA) to reorganize intermediate DINO features into a pseudo multi-scale hierarchy, and Scale-Aware Decoding (SAD) for efficient intra-scale refinement and top-down multi-scale propagation. We further curate PanCT, a new CT dataset containing 284 patients with expert-annotated pancreatic tumors, to assess SegDINO's ability to handle difficult small-lesion cases. Extensive experiments on PanCT and three public benchmarks demonstrate that SegDINO achieves state-of-the-art results with high efficiency. The code is available at https://github.com/script-Yang/segdino_v2.

2606.17989 2026-06-17 cs.CV cs.AI 新提交

Recover Semantics First, Generate Better: Improved Latent Modeling for 3D MRI Reconstruction and Cross-Contrast Synthesis

先恢复语义,再生成更好:改进的潜在建模用于3D MRI重建和跨对比合成

Yonghao Chen, Sicheng Yang, Rui Tang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Xi’an Jiaotong University(西安交通大学)

AI总结 提出语义优先的潜在建模框架,通过潜在协调编码器、语义恢复块和解剖感知频率损失,解决3D MRI压缩中长程解剖一致性、语义退化和平滑重建问题,提升重建和跨对比合成质量。

Comments Code: https://github.com/script-Yang/RSF

详情
AI中文摘要

多对比磁共振成像(MRI)为临床诊断提供互补信息。然而,获取所有MRI序列通常耗时且成本高昂。最近的生成模型通过从可用对比推断缺失对比来进行跨对比合成以解决此问题。尽管如此,合成3D MRI面临重大挑战。由于体积巨大,直接在像素空间操作在计算上不可行;因此,常见方法是先将3D体积压缩到潜在空间,然后在该空间中训练生成模型。我们观察到现有压缩架构存在几个关键问题:它们未能保持长程解剖一致性,丢弃了临床有意义的语义,并依赖于导致过度平滑重建的优化目标。最终,这些缺陷损害了后续生成模型的性能。在这项工作中,我们提出了一种语义优先的潜在建模框架,用于3D MRI重建和跨对比合成。具体来说,我们引入了潜在协调编码器(LHE)来捕获全局解剖依赖关系,确保体积表示的一致性。为了减轻潜在压缩过程中的语义退化,我们进一步设计了语义恢复块(SRB),该块从自监督语义教师注入高级先验,增强潜在空间中对比感知的可分离性。此外,我们提出了解剖感知频率损失(AFL),以自适应地保留诊断相关的高频结构。在两个公共多对比MRI数据集上的大量实验表明,重建保真度和跨对比合成质量持续提升。我们的代码可在该https URL获取。

英文摘要

Multi-contrast magnetic resonance imaging (MRI) provides complementary information for clinical diagnosis. However, acquiring all MRI sequences is often time-consuming and costly. Recent generative models perform cross-contrast synthesis to address this issue by inferring absent contrasts from the available ones. Nevertheless, synthesizing 3D MRI presents significant challenges. Due to the massive volume sizes, operating directly in the pixel space is computationally prohibitive; therefore, a common approach is to first compress the 3D volumes into a latent space and subsequently train generative models in that space. We observe that existing compression architectures face several critical issues: they under-preserve long-range anatomical coherence, discard clinically meaningful semantics, and rely on optimization objectives that lead to over-smoothed reconstructions. Ultimately, these shortcomings compromise the performance of subsequent generative models. In this work, we propose a semantics-first latent modeling framework for 3D MRI reconstruction and cross-contrast synthesis. Specifically, we introduce a Latent Harmonization Encoder (LHE) to capture global anatomical dependencies, ensuring coherent volumetric representations. To mitigate semantic degradation during latent compression, we further design a Semantic Recovery Block (SRB) that injects high-level priors from a self-supervised semantic teacher, enhancing contrast-aware separability in the latent space. Additionally, we propose an Anatomy-aware Frequency Loss (AFL) to adaptively preserve diagnostically relevant high-frequency structures. Extensive experiments on two public multi-contrast MRI datasets demonstrate consistent improvements in reconstruction fidelity and cross-contrast synthesis quality. Our code is available at https://github.com/script-Yang/RSF.

2606.18063 2026-06-17 cs.CV cs.AI cs.LG 新提交

When LLMs Analyze Scars: From Images to Clinically-Meaningful Features

当LLM分析疤痕:从图像到临床有意义的特征

Ruman Wang, Hangting Ye

发表机构 * Liaoning University of Traditional Chinese Medicine(辽宁中医药大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出ScaFE框架,利用LLM作为知识驱动的特征工程师,将高维图像转化为低维临床可解释特征,在数据稀缺的疤痕分类中优于端到端深度学习方法。

详情
AI中文摘要

医学图像分类面临一个基本困境:虽然深度学习模型在大规模数据上表现卓越,但现实临床场景中由于标注成本、隐私约束和疾病罕见性,常常遭受严重的数据稀缺。这一挑战在病理性疤痕分类中尤为突出,区分瘢痕疙瘩和增生性疤痕需要微妙的专家知识,且标注图像极其有限。我们提出一种新范式,将大型语言模型(LLM)重新定位为知识驱动的特征工程师,而非端到端分类器。我们将此框架称为ScaFE(疤痕特征工程)。我们的关键洞察是,LLM编码了丰富的医学知识,可以外部化为可执行的特征提取代码,从而将高维图像转化为低维、临床可解释的表示。具体来说,我们使用既定的疤痕评估标准提示LLM,生成确定性的Python代码,提取与临床评分系统(如温哥华疤痕量表)对齐的特征。我们的方法提供三个关键优势:(1)数据效率,通过将知识获取与统计学习解耦,在有限训练样本下实现稳健性能;(2)隐私保护,原始图像在本地处理,不暴露给外部LLM;(3)可解释性,通过基于临床推理的显式特征。在疤痕分类上的大量实验表明,在数据有限条件下,我们的方法始终优于端到端深度学习基线或使用LLM作为黑盒分类器,为将LLM集成到数据高效且临床透明的医学AI系统中开辟了有前景的方向。

英文摘要

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

2606.18123 2026-06-17 cs.CV 新提交

Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology

使用多模态混合专家病理基础模型预测免疫生物标志物,赋能精准肿瘤学

Tianyu Liu, Ziqing Wang, Zhaokang Liang, Tong Ding, Peter Humphrey, Lorraine Colón-Cartagena, Emily Ling-Lin Pai, Kenneth Tou En Chang, Mohamed Kahila, Jonathan Chong Kai Liew, Tinglin Huang, Rex Ying, Kaize Ding, Faisal Mahmood, Wengong Jin

发表机构 * Program of Computational Biology and Bioinforamtics, Yale University(耶鲁大学计算生物学与生物信息学项目) Broad Institute of MIT and Harvard(麻省理工学院与哈佛大学博德研究所) Department of Statistics and Data Science, Northwestern University(西北大学统计与数据科学系) Department of Computer Science, Northeastern University(东北大学计算机科学系) Department of Computer Science, Harvard University(哈佛大学计算机科学系) Department of Pathology, Yale University(耶鲁大学病理学系) Department of Anatomic Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania(宾夕法尼亚大学医院解剖病理学与检验医学系) Department of Pathology and Laboratory Medicine, University of California, San Francisco(加州大学旧金山分校病理学与检验医学系) Department of Pathology and Laboratory Medicine, KK Women’s and Children’s Hospital(竹脚妇幼医院病理学与检验医学系) Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学院生物统计学、流行病学与信息学系)

AI总结 提出MixTIME多模态基础模型,采用混合专家架构整合不同模态的病理基础模型,从HE全切片图像预测多重免疫荧光蛋白表达,在17个蛋白标记物上达到最优性能,并增强空间域识别、生存预测等下游任务。

Comments 5 figures

详情
AI中文摘要

预测与肿瘤免疫微环境(TIME)相关的免疫生物标志物对于推进精准肿瘤学至关重要,但现有方法主要局限于单一图像模态,且存在分辨率不足以及未能充分利用互补的临床和生物学信息的问题。本文介绍MixTIME,一种多模态基础模型,利用混合专家(MoE)架构整合在不同模态上训练的病理基础模型:纯图像(UNIv2)、图像文本(CONCHv1.5)和图像转录组(STPath)表示,用于从苏木精-伊红(HE)全切片图像进行像素级和切片级的多重免疫荧光(mIF)蛋白表达预测。MixTIME采用可学习路由器动态加权专家贡献,并使用分布和趋势感知的损失函数进行训练。在两个不同规模的数据集上进行基准测试,MixTIME在17个蛋白标记物上通过相关性指标衡量达到了最先进的性能。预测的mIF图谱显著增强了下游任务,包括空间域识别、生存预测以及由全球多个机构的病理专家验证的AI辅助病理报告生成。此外,MixTIME能够跨临床时间点纵向追踪蛋白表达动态,并揭示与肿瘤微环境中耐药性和免疫抑制相关的蛋白-基因相互作用模式。总之,MixTIME为计算病理学中的多模态生物标志物发现和临床转化提供了一个可扩展的框架。

英文摘要

Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.

2606.17213 2026-06-17 cs.CL cs.CV 交叉投稿

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

重新审视用于3D CT报告生成的LLM适应:缩放与诊断先验研究

Vanshali Sharma, Andrea M. Bejar, Halil Ertugrul Aktas, Quoc-Huy Trinh, Debesh Jha, Gorkem Durak, Ulas Bagci

发表机构 * Northwestern University(西北大学) University of South Dakota(南达科他大学) Aalto University(阿尔托大学)

AI总结 提出RAD3D-Prefix轻量级诊断先验框架,通过冻结大语言模型并融合多标签分类逻辑,在少量可训练参数下实现3D CT报告生成,优于全微调基线并展现强泛化性。

详情
AI中文摘要

多模态学习的最新进展,包括大型语言模型(LLM)和视觉-语言模型(VLM),已展现出对自然图像的强大适应性。然而,将其扩展到医学领域,特别是体积(3D)图像,由于高计算复杂度、体积依赖性和视觉特征与临床术语之间的语义差距而具有挑战性。在有限的医学数据上对LLM进行朴素微调常常导致过拟合和临床幻觉,其中语言流畅性优先于临床事实性。在本研究中,我们研究了用于体积CT报告生成的参数高效适应策略,并引入了RAD3D-Prefix,一种轻量级的诊断先验条件框架,最大限度地减少了对大量参数训练的需求。该模块将图像嵌入与多标签诊断分类逻辑相结合,保留了关键的临床细节,同时弥合了语义差距。通过保持LLM冻结,我们的方法需要最少的可训练参数,并减轻了在小规模、特定领域数据集上过拟合的风险。通过对从96.1M到1.6B参数的LLM进行系统研究,我们发现微调对较小的LLM最有益,而冻结较大的(约1B+)LLM并仅训练轻量级投影层在性能、泛化性和计算效率之间提供了优越的权衡。在多个自动指标和一项临床读者研究中,RAD3D-Prefix优于可比较的参数高效基线,并在使用比全微调替代方案少得多的可训练参数的情况下,展现出强大的域外泛化能力。

英文摘要

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

2606.17295 2026-06-17 eess.IV cs.CV 交叉投稿

Phenotyping TPF via Self-Supervised Learning: A Label-Agnostic Framework with Expert Validation

通过自监督学习进行胫骨平台骨折表型分析:一种具有专家验证的标签无关框架

Miral Elnakib, Muhammad Saad, Ahmad Al-Kabbany

发表机构 * Faculty of Sciences(科学学院) Alexandria University(亚历山大大学) Multimedia Interaction and Communication Lab(多媒体交互与通信实验室) Wearables, Biosensing, and Biosignal Processing Research Lab(可穿戴设备、生物传感与生物信号处理研究实验室) Arab Academy for Science and Technology(阿拉伯科学与技术学院)

AI总结 提出一种标签无关的自监督学习框架,利用SimCLR和聚类从X光片中直接学习骨折表征,发现四种影像衍生表型,经盲法专家验证具有稳定性和临床可解释性,与常规分类正交。

详情
AI中文摘要

人工智能在胫骨平台骨折特征描述中的全部潜力尚未实现,受限于对标注数据集的根本依赖,而标注数据集的一致性无法保证:传统的分类方案如Schatzker和AO/OTA存在观察者间变异性,导致监督模型学习的是人类分歧而非稳定的骨折形态。我们设计、实现并验证了一个标签无关的框架,通过直接从影像数据中学习骨折表征来消除这一约束,无需观察者分配的标签。使用RadImageNet预训练的ResNet-50编码器,在154张清洁的膝关节X光片上通过SimCLR对比目标进行微调,之前进行数据清洗协议,之后进行UMAP降维和k-means聚类,以发现四种影像衍生表型。通过盲法专家审查协议评估表型有效性,由两名独立临床医生进行。四种表型表现出稳健的稳定性(bootstrap ARI = 0.319 +/- 0.041)、强内部凝聚力(轮廓系数 = 0.511),以及两名评审者在盲法条件下给出3-5/5的一致性评分;一种表型被一致认为表现出粉碎性——一种在没有监督信号的情况下分离出的高复杂性特征。与Schatzker标签的跨分区比较得出ARI = 0.013,证实了与传统分类边界的正交性。值得注意的是,锚定于既定分类词汇的专家评审者在Schatzker对齐度最低的地方认为影像衍生组是异质的,这表明Schatzker训练的感知和标签无关的嵌入几何测量的是正交维度。这些发现确立了标签无关的SSL表型分析作为传统分类的可重复且临床可解释的补充。

英文摘要

The full potential of artificial intelligence in tibial plateau fracture characterisation remains unrealised, constrained by a fundamental dependency on labelled datasets whose consistency cannot be guaranteed: conventional classification schemes such as Schatzker and AO/OTA suffer from inter-observer variability, causing supervised models to learn human disagreement rather than stable fracture morphology. We design, implement, and validate a label-agnostic framework that eliminates this constraint by learning fracture representations directly from imaging data without observer-assigned labels. A RadImageNet-pretrained ResNet-50 encoder is fine-tuned on 154 cleaned knee radiographs using the SimCLR contrastive objective, preceded by a data cleaning protocol and followed by UMAP dimensionality reduction and k-means clustering to discover four imaging-derived phenotypes. Phenotype validity is assessed through a blinded expert review protocol administered to two independent clinicians. The four phenotypes demonstrate robust stability (bootstrap ARI = 0.319 +/- 0.041), strong internal cohesion (silhouette = 0.511), and coherence ratings of 3-5/5 from both reviewers under blinded conditions; one phenotype was unanimously identified as exhibiting comminution -- a high-complexity feature isolated without any supervisory signal. Inter-partition comparison against Schatzker labels yields ARI = 0.013, confirming orthogonality to conventional classification boundaries. Notably, expert reviewers anchored to established classification vocabularies perceived imaging-derived groups as heterogeneous precisely where Schatzker alignment was lowest, suggesting that Schatzker-trained perception and label-agnostic embedding geometry measure orthogonal dimensions. These findings establish label-agnostic SSL phenotyping as a reproducible and clinically interpretable complement to conventional classification.

2606.17504 2026-06-17 eess.IV cs.CV 交叉投稿

Two-Stage Fine-Tuning of ResNet50 for High-Sensitivity Melanoma Detection on Dermoscopic Images

ResNet50的两阶段微调用于皮肤镜图像中高灵敏度黑色素瘤检测

Aryan Bhagat

AI总结 提出ResNet50的两阶段微调方法,通过分层训练和低学习率微调解决类别不平衡和迁移学习不足问题,在3826张测试图像上实现AUC-ROC 0.9559,灵敏度87.56%,优于单阶段微调。

Comments 13 pages, 4 figures, 4 tables. Code available at https://github.com/Aryanbhagat23/melanoma-detection

详情
AI中文摘要

黑色素瘤是最危险的皮肤癌,早期检测五年生存率超过99%,但一旦扩散则急剧下降。本文提出并评估了一种两阶段微调方法,用于皮肤镜图像上的二分类黑色素瘤检测,基于ResNet50。解决的核心挑战是类别不平衡和单阶段微调导致的迁移学习次优。在分层训练/验证/测试分割后,仅对训练集应用随机过采样以实现1:1类别平衡。第一阶段冻结ResNet50骨干网络,仅训练分类头;第二阶段以1e-5的低学习率联合微调所有层,以防止对已学习视觉特征的灾难性遗忘。在包含3826张图像的独立测试集上,模型实现了AUC-ROC为0.9559,准确率88.34%,灵敏度87.56%,特异度89.13%,F1分数88.29%。消融研究证实两阶段协议显著优于单阶段微调,灵敏度提升超过4%。Grad-CAM可视化展示了正确的病变定位。提供了完全可部署的Streamlit检测应用程序及所有训练代码。

英文摘要

Melanoma is the most dangerous form of skin cancer with five-year survival rates exceeding 99% when detected early but falling sharply once the disease spreads. This paper proposes and evaluates a two-stage fine-tuning approach for ResNet50 applied to binary melanoma classification on dermoscopic images. The core challenges addressed are class imbalance and suboptimal transfer learning from single-stage fine-tuning. After stratified train/validation/test splitting, random oversampling was applied exclusively to the training set to achieve a 1:1 class balance. Stage 1 trained only the classification head with the ResNet50 base frozen, while Stage 2 fine-tuned all layers jointly at a low learning rate of 1e-5 to prevent catastrophic forgetting of learned visual features. On an independent test set of 3,826 images, the model achieved an AUC-ROC of 0.9559, accuracy of 88.34%, sensitivity of 87.56%, specificity of 89.13%, and F1-score of 88.29%. An ablation study confirms the two-stage protocol significantly outperforms single-stage fine-tuning, with sensitivity gains of over 4%. Grad-CAM visualizations demonstrate correct lesion localization. A fully deployable Streamlit detection application is provided alongside all training code.

2606.17791 2026-06-17 cs.CL cs.CV 交叉投稿

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Slop悖论:合成标准化如何侵蚀AI重写放射学报告中的临床不确定性和跨模态对齐

Samar Ansari

发表机构 * School of Computing and Engineering Sciences, University of Chester(切斯特大学计算与工程科学学院)

AI总结 本研究通过控制实验测量AI重写放射学报告导致的信息退化,发现电子健康记录摘要虽破坏内容但保留图像-文本对齐,而标准化重写和教学病例准备则相反,造成更大对齐损失,称为slop悖论。

详情
AI中文摘要

AI辅助临床文档工具越来越多地使用大型语言模型(LLMs)对放射学报告进行摘要、标准化和重新格式化。我们提出了对由此产生的信息退化的受控测量。使用印第安纳大学数据集的450份胸部X光报告,我们通过三种真实的LLM重写任务生成合成版本:电子健康记录摘要、标准化重写和教学病例准备。我们测量了实体侵蚀(通过医学命名实体识别)、对冲崩溃(临床不确定性语言的丧失)和跨模态对齐退化(通过BiomedCLIP图像-文本相似度)。我们的核心发现是信息损失与跨模态保真度之间的分离。电子健康记录摘要在内容层面最具破坏性,侵蚀了51.4%的临床实体和43.7%的对冲语言,但它几乎完全保留了图像-文本对齐(下降2.5%)。旨在生成更干净训练数据的两个任务,即标准化重写和教学病例准备,则相反:它们保留了更多实体(分别侵蚀26.8%和29.3%),但导致14.9-16.5%的对齐下降,是电子健康记录摘要的六到七倍。我们称之为slop悖论:使临床文本看起来更干净以用于多模态训练的重写恰恰使其偏离图像。与我们预先指定的假设相反,罕见病理并未优先退化:在九次罕见与常见比较中,没有差异在多重比较校正后幸存,且名义差异方向相反(常见>罕见),因此污染对特定条件监测不可见。退化的主要决定因素是AI重写任务的类型,而非临床内容。这些发现对多模态医学AI数据集构建和AI辅助临床文档的治理具有重要意义。

英文摘要

AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

2604.24696 2026-06-17 cs.CV 版本更新

NeuroClaw Technical Report

NeuroClaw 技术报告

Cheng Wang, Zhibin He, Zhihao Peng, Shengyuan Liu, Yufan Hu, Carl Yang, Lifang He, Lichao Sun, Xiang Li, Yixuan Yuan

AI总结 针对神经影像学中多模态数据、长流程和可重复性挑战,提出NeuroClaw多智能体研究助手,通过数据驱动决策、环境管理和三层技能架构实现可执行可复现的神经影像分析,并在NeuroBench基准上显著优于直接调用智能体。

详情
AI中文摘要

代理型人工智能系统有望加速科学工作流程,但神经影像学面临独特挑战:异质模态(sMRI、fMRI、dMRI、EEG)、长多阶段流水线以及持续的可重复性风险。为解决这一差距,我们提出了NeuroClaw,一个面向可执行和可复现神经影像研究的领域专用多智能体研究助手。NeuroClaw直接操作跨格式和模态的原始神经影像数据,将决策基于数据集语义和BIDS元数据,因此用户无需准备精选输入或定制模型代码。该平台结合了工具工程与端到端环境管理,包括固定Python环境、Docker支持、常见神经影像工具的自动安装程序以及GPU配置。在实践中,这一层强调检查点、执行后验证、结构化审计追踪和受控运行时设置,使工具链更加透明,同时提高可重复性和可审计性。三层技能/智能体层次结构将用户交互、高层编排和底层工具技能分离,将复杂工作流分解为安全、可重用的单元。除了NeuroClaw框架,我们还引入了NeuroBench,一个系统级基准测试,用于评估可执行性、工件有效性和可重复性准备情况。在多个多模态LLM上,与直接调用智能体相比,启用NeuroClaw的运行产生了一致且显著的分数提升。项目主页:此https URL

英文摘要

Agentic artificial intelligence systems promise to accelerate scientific workflows, but neuroimaging poses unique challenges: heterogeneous modalities (sMRI, fMRI, dMRI, EEG), long multi-stage pipelines, and persistent reproducibility risks. To address this gap, we present NeuroClaw, a domain-specialized multi-agent research assistant for executable and reproducible neuroimaging research. NeuroClaw operates directly on raw neuroimaging data across formats and modalities, grounding decisions in dataset semantics and BIDS metadata so users need not prepare curated inputs or bespoke model code. The platform combines harness engineering with end-to-end environment management, including pinned Python environments, Docker support, automated installers for common neuroimaging tools, and GPU configuration. In practice, this layer emphasizes checkpointing, post-execution verification, structured audit traces, and controlled runtime setup, making toolchains more transparent while improving reproducibility and auditability. A three-tier skill/agent hierarchy separates user-facing interaction, high-level orchestration, and low-level tool skills to decompose complex workflows into safe, reusable units. Alongside the NeuroClaw framework, we introduce NeuroBench, a system-level benchmark for executability, artifact validity, and reproducibility readiness. Across multiple multimodal LLMs, NeuroClaw-enabled runs yield consistent and substantial score improvements compared with direct agent invocation. Project homepage: https://cuhk-aim-group.github.io/NeuroClaw/index.html

2606.00588 2026-06-17 cs.CV 版本更新

Response-Aware Multimodal Learning for Post-Treatment Visual Acuity Forecasting

响应感知的多模态学习用于治疗后视力预测

Phuoc-Nguyen Bui, Van-Vi Vo, Duc-Tai Le, Junghyun Bum, Van-Nguyen Pham, Ki-Young Kim, Seung-Young Yu, Hyunseung Choo

发表机构 * Research Convergence Institute(研究融合研究所) Sungkyunkwan University(全北大学) Dept. of AI Systems Engineering(人工智能系统工程系) Dept. of Ophthalmology(眼科系) Kyung Hee University Medical Center(庆熙大学医学院) Dept. of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 提出ReVA框架,利用基线与第1个月OCT影像及表格数据,通过多模态融合预测糖尿病性黄斑水肿患者抗VEGF治疗后3-24个月的视力轨迹。

Comments Accepted to MICCAI 2026

详情
AI中文摘要

抗VEGF治疗后长期视力(VA)结果对于糖尿病性黄斑水肿(DME)患者的咨询、期望设定和随访计划至关重要。然而,在临床实践中,医生通常仅根据早期治疗后发现来估计长期视力轨迹,使得可靠的预后判断变得困难。尽管先前基于OCT的学习方法主要关注短期反应或单终点预测,但利用早期纵向观测数据建模多个未来时间点的VA轨迹仍未被充分探索。在本研究中,我们收集了一个由188名接受抗VEGF治疗的DME患者组成的真实世界队列,配有配对基线和第1个月OCT扫描,以及表格化的OCT衍生生物标志物和非影像临床变量。仅使用这些早期数据,我们构建了一个多时间点VA预测问题,旨在预测3、6、12、18和24个月的视力结果,反映临床上有意义的随访间隔。我们提出了ReVA,一个响应感知的多模态框架,该框架整合了基线和第1个月OCT的结构特征与表格变量,以捕捉基线疾病状态和早期治疗反应。ReVA使用空间注意力保留局部预后成像特征,并使用依赖感知的表格编码器建模临床变量之间的交互。这些多模态表示被融合以预测患者特定的长期视力轨迹。所提出的框架在24个月VA预测中实现了MAE=0.1246,RMSE=0.1621,R^2=0.6064,并在所有预测时间点上表现一致。我们的研究结果表明,纳入早期治疗反应信号能够实现临床上有意义的长期视力预测,为常规抗VEGF管理中的数据驱动决策支持提供了依据。

英文摘要

Long-term visual acuity (VA) forecasting after anti-VEGF therapy is important for counseling and follow-up planning in diabetic macular edema (DME), yet remains challenging when only early post-treatment findings are available. While prior OCT-based methods mainly focus on short-term response or single-endpoint prediction, multi-horizon VA forecasting from early longitudinal data remains insufficiently under-explored. In this study, we assembled a real-world cohort of 188 anti-VEGF--treated DME patients with paired baseline and month-1 OCT scans, along with tabular OCT-derived biomarkers and non-imaging clinical variables. Using only these early data, we formulate a multi-horizon VA forecasting problem aimed at predicting visual outcomes at 3, 6, 12, 18, and 24 months, reflecting clinically meaningful follow-up intervals. We propose ReVA, a response-aware multimodal framework that combines baseline and month-1 OCT features with tabular variables to capture disease status and early treatment response. ReVA integrates spatial OCT attention, dependency-aware tabular encoding, and cross-modal fusion to predict patient-specific long-term VA trajectories. The proposed framework achieves MAE=0.1246, RMSE=0.1621, and R^2=0.6064 for 24-month VA prediction, with consistent performance across all forecast horizons. Our findings show that incorporating early treatment-response signals enables clinically meaningful long-term visual acuity forecasting, supporting data-driven decision support for routine anti-VEGF management. Code and pretrained models will be released on https://github.com/nguyenpbui/ReVA.

2606.15617 2026-06-17 cs.CV 版本更新

NeRD: Neuro-Symbolic Rule Distillation for Efficient Ontology-Grounded Chain-of-Thought in Medical Image Diagnosis

NeRD:面向医学图像诊断的高效本体接地思维链的神经符号规则蒸馏

Hongxi Yang, Yiwen Jiang, Siyuan Yan, Jamie Chow, Eunis Li, Charlotte Poon, Stephanie Fong, Xiangyu Zhao, Deval Mehta, Yasmeen George, Zongyuan Ge

发表机构 * Department of Data Science & AI, Faculty of Information Technology, Monash University(莫纳什大学信息技术学院数据科学与人工智能系) AIM for Health Lab, Faculty of Information Technology, Monash University(莫纳什大学信息技术学院AIM健康实验室) Faculty of Engineering, Monash University(莫纳什大学工程学院) Faculty of Medicine, The Chinese University of Hong Kong(香港中文大学医学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院)

AI总结 提出NeRD框架,通过神经符号规则蒸馏生成高效、本体接地且非冗余的推理链,避免人工规则,在皮肤数据集上实现强诊断性能和可解释性,并首次实现专家介入的多模态思维链诊断。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

可解释性对于可信的医学图像诊断至关重要。然而,现有的概念驱动可解释方法存在关键局限性:概念瓶颈模型(CBM)需要在推理时对所有预定义概念进行评分并用于人工干预,给临床医生带来沉重负担;而基于理由的生成方法通常通过类别可区分性选择概念,这可能偏离诊断本体。为了解决这些问题,我们提出了神经符号规则蒸馏(NeRD),这是一个生成高效、本体接地且充分而非冗余的推理链的框架,无需手动构建诊断规则。在两个皮肤数据集上的实验证明了其强大的诊断性能和可解释性,盲法专家评估确认了NeRD理由的临床合理性。我们的方法进一步实现了首次专家介入的多模态思维链诊断研究,实现了高效且有效的概念级干预。

英文摘要

Interpretability is essential for trustworthy medical image diagnosis. However, existing concept-driven interpretable methods have key limitations: Concept Bottleneck Models (CBMs) require scoring all predefined concepts at inference time and for manual intervention, imposing a substantial burden on clinicians, while rationale-based generative approaches often select concepts by class discriminability, which can drift from diagnostic ontologies. To address these issues, we propose Neuro-Symbolic Rule Distillation (NeRD), a framework that produces efficient, ontology-grounded reasoning chains that are sufficient yet non-redundant, without manually crafting diagnostic rules. Experiments on two skin datasets demonstrate strong diagnostic performance and interpretability, and blinded expert evaluation confirms the clinical plausibility of NeRD rationales. Our method further enables a first expert-in-the-loop study for Multimodal Chain-of-Thought-based diagnosis, achieving efficient and effective concept-level intervention.

2606.16203 2026-06-17 cs.CV 版本更新

DynFS-MoE: Dynamic Functional-Structural Mixture-of-Experts for Post-Traumatic Epilepsy Diagnosis

DynFS-MoE: 用于创伤后癫痫诊断的动态功能-结构混合专家模型

Jun-En Ding, Spencer Chen, Henry Noren, Daniel Valdivia, Christine Yohn, Suhina Patel, Taylor Zink, Hai Sun, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology(史蒂文斯理工学院系统工程系) Department of Neurosurgery, Robert Wood Johnson Medical School, Rutgers University(罗格斯大学罗伯特·伍德·约翰逊医学院神经外科)

AI总结 提出动态多模态混合专家框架,通过时间感知功能-结构编码和类别条件专家路由,融合功能与结构MRI,在三个二分类任务中优于静态融合基线,并揭示有意义的ROI交互。

详情
AI中文摘要

创伤后癫痫(PTE)是创伤性脑损伤(TBI)的严重并发症,但由于其在大脑中诱导的复杂结构和功能改变,早期识别仍然具有挑战性。为了解决这个问题,我们提出了一个动态多模态混合专家(MoE)框架,通过时间感知功能-结构编码和类别条件专家路由,整合功能性和结构性MRI。在该框架内,模态特定和跨模态专家学习互补表示,而模态-类别MoE(MCoE)模块根据每个分类目标动态分配专家权重。跨三个二分类任务的实验结果表明,该框架始终优于静态融合基线,高可解释性分析进一步揭示了有意义的感兴趣区域(ROI)交互。这种动态多模态专家框架有效捕获了类别依赖的脑交互模式,并为PTE诊断和风险分层提供了一种可解释的方法。

英文摘要

Post-traumatic epilepsy (PTE) is a severe complication of traumatic brain injury (TBI), yet early identification remains challenging due to the complex structural and functional alterations it induces in the brain. To address this, we propose a dynamic multimodal Mixture-of-Experts (MoE) framework that integrates functional and structural MRI through time-aware functional-structural encoding and class-conditioned expert routing. Within this framework, modality-specific and cross-modal experts learn complementary representations, while a Modality-Class MoE (MCoE) module dynamically dispatches expert weights according to each classification objective. Experimental results across three binary classification tasks demonstrate that the framework consistently outperforms static fusion baselines, and high-interpretability analyses further reveal meaningful region-of-interest (ROI) interactions. This dynamic multimodal expert framework effectively captures class-dependent brain interaction patterns and provides an interpretable approach for PTE diagnosis and risk stratification.

2507.04704 2026-06-17 q-bio.QM cs.AI cs.CV 版本更新

SPATIA: Multimodal Generation and Prediction of Spatial Cell Phenotypes

SPATIA: 空间细胞表型的多模态生成与预测

Zhenglun Kong, Mufan Qiu, John Boesen, Xiang Lin, Sukwon Yun, Tianlong Chen, Manolis Kellis, Marinka Zitnik

AI总结 提出SPATIA模型,融合细胞形态、基因表达和空间上下文,通过置信感知流匹配和形态-谱对齐实现多尺度生成与预测,在12项任务中优于18个基线模型。

Comments ICML 2026

详情
AI中文摘要

理解细胞形态、基因表达和空间上下文如何共同塑造组织功能是生物学中的一个核心挑战。基于图像的空间转录组学技术现在能够提供细胞图像和基因表达谱的高分辨率测量,但现有方法通常孤立地分析这些模态或以有限的分辨率进行分析。我们通过引入SPATIA来解决这个问题,这是一个多层次的生成和预测模型,通过融合从细胞到组织水平的形态、基因表达和空间上下文,学习统一的、空间感知的表征。SPATIA还结合了一个空间条件生成框架,该框架具有置信感知的OT重加权和形态-谱对齐,用于建模目标状态形态分布。具体来说,我们提出了一个置信感知的流匹配目标,该目标基于不确定性对弱最优传输对进行重加权。我们进一步应用形态-谱对齐来鼓励有生物学意义的图像生成,从而能够建模微环境依赖的表型转变。我们组装了一个多尺度数据集,包含17个组织中的2590万个细胞-基因对。我们在12项任务上对SPATIA与18个模型进行了基准测试,涵盖表型生成、注释、聚类、基因插补和跨模态预测等类别。SPATIA相比最先进模型取得了改进,生成保真度提高了8%,预测准确率提高了3%。

英文摘要

Understanding how cellular morphology, gene expression, and spatial context jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but existing methods typically analyze these modalities in isolation or at limited resolution. We address the problem by introducing SPATIA, a multi-level generative and predictive model that learns unified, spatially aware representations by fusing morphology, gene expression, and spatial context from the cell to the tissue level. SPATIA also incorporates a spatially conditioned generative framework with confidence-aware OT reweighting and morphology-profile alignment for modeling target-state morphology distributions. Specifically, we propose a confidence-aware flow matching objective that reweights weak optimal-transport pairs based on uncertainty. We further apply morphology-profile alignment to encourage biologically meaningful image generation, enabling the modeling of microenvironment-dependent phenotypic transitions. We assembled a multi-scale dataset consisting of 25.9 million cell-gene pairs across 17 tissues. We benchmark SPATIA against 18 models across 12 tasks, spanning categories such as phenotype generation, annotation, clustering, gene imputation, and cross-modal prediction. SPATIA achieves improved performance over state-of-the-art models, improving generative fidelity by 8% and predictive accuracy by up to 3%.

2606.13919 2026-06-17 eess.IV cs.AI cs.CV 版本更新

GMN4AD: Graph Matching Network for Alzheimer's Disease Diagnosis with Test-Time Domain Adaptation using Multi-centered Structure Magnetic Resonance Imaging

GMN4AD:基于图匹配网络的阿尔茨海默病诊断与测试时域适应方法在多中心结构磁共振成像中的应用

Chen Zhao, Huan Huang, Yixin Xie, Jiajing Huang, Weihua Zhou

发表机构 * Department of Computer Science, Kennesaw State University(肯纳邦大学计算机科学系) Department of Information Technology, Kennesaw State University(肯纳邦大学信息技术系) School of Data Science and Analytics, Kennesaw State University(肯纳邦大学数据科学与分析学院) Department of Applied Computing, Michigan Technological University(密歇根技术大学应用计算系)

AI总结 提出GMN4AD,利用图匹配网络建模异质脑图间关系,结合测试时域适应策略,在三个公共数据集上优于现有方法,实现鲁棒的AD诊断。

详情
AI中文摘要

阿尔茨海默病(AD)是一种进行性神经退行性疾病,影响数百万老年人,预计未来几年患病率将显著上升。早期诊断,特别是在轻度认知障碍(MCI)阶段,对于及时干预至关重要。结构磁共振成像(sMRI)已成为检测AD相关脑变化的关键模态,但传统的基于图的方法通常难以处理模态和站点间异质性,限制了诊断性能。在本文中,我们提出了用于阿尔茨海默病诊断的图匹配网络(GMN4AD),旨在建模来自神经影像数据的异质脑图之间的交互。与将每个脑图独立处理的传统方法不同,GMN4AD利用图匹配来捕获跨图关系,提高诊断精度。此外,我们引入了一种测试时域适应策略,结合对比学习来减轻推理过程中的域偏移。在三个公共AD数据集上的大量实验表明,GMN4AD相比最先进方法实现了优越的性能,为AD诊断提供了鲁棒且可泛化的解决方案。

英文摘要

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer's Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.

2601.05212 2026-06-17 cs.CV 版本更新

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

FlowLet: 基于小波流匹配的条件性3D脑MRI合成

Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

发表机构 * Politecnico di Bari(巴里理工学院) Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 提出FlowLet框架,利用可逆3D小波域中的流匹配生成年龄条件化的3D脑MRI,避免重建伪影并降低计算需求,实验证明其生成高保真体积且提升脑年龄预测模型对低代表性年龄组的性能。

Comments Accepted at Medical Image Analysis (Elsevier)

详情
AI中文摘要

脑磁共振成像(MRI)在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测(BAP),它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大规模、多样化和年龄平衡的数据集,而现有的3D MRI数据集在人口统计学上存在偏差,限制了公平性和泛化能力。获取新数据成本高昂且受到伦理约束,这促使了生成性数据增强。当前的生成方法通常基于潜在扩散模型,这些模型在学习的低维潜在空间中操作,以应对体积MRI数据的内存需求。然而,这些方法在推理时通常较慢,可能因潜在压缩而引入伪影,并且很少以年龄为条件,从而影响BAP性能。在这项工作中,我们提出了FlowLet,一个条件生成框架,通过在可逆3D小波域中利用流匹配来合成年龄条件化的3D MRI,有助于避免重建伪影并降低计算需求。实验表明,FlowLet以少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可改善低代表性年龄组的性能,基于区域的分析确认了解剖结构的保留。

英文摘要

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

9. 文档图像、OCR与图表理解 3 篇

2606.17355 2026-06-17 cs.CV 新提交

Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

野外复杂版面分类:一种低资源方法及版面保持增强

Sharva Gogawale, Iddo Hakim, Gal Grudka, Mohammad Suliman, Omer Ventura, Daria Vasyutinsky-Shapira, Berat Kurar-Barakat, Nachum Dershowitz

发表机构 * School of Computer Science and AI, Tel Aviv University(特拉维夫大学计算机科学与人工智能学院)

AI总结 针对低资源复杂版面分类问题,提出基于CNN的分类器,采用窄各向异性高斯掩码和反射诱导标签变换等版面保持增强方法,在标注稀缺下显著提升分类性能。

详情
AI中文摘要

许多数字化语料库面临低资源问题,因为标注可能稀缺、页面扫描噪声大且分辨率低,或者版面结构复杂,对自动转录质量产生负面影响。低资源语言的鲁棒分类模型开发受到缺乏大规模标注数据和页面版面频繁语义复杂性的制约。为此,我们整理了一个复杂版面数据集,根据分隔区域手动分为八种版面类型。为克服数据稀缺,我们提出了一种基于CNN的分类器的新型训练策略,采用强领域感知增强来改善泛化。我们利用窄各向异性高斯掩码抑制偶然文本细节,同时保留基本分隔,迫使模型学习全局几何排列。此外,我们实施反射诱导标签变换以丰富训练分布,同时保持不对称类别间的标签一致性。结果表明,版面特定增强可以在严重标注稀缺下显著改善页面级版面分类。

英文摘要

Many digitized corpora suffer from low resources because annotations may be scarce, page scans are noisy and of poor resolution, or layouts are structurally complex in ways that negatively affect the quality of automatic transcription. Developing robust classification models for low-resource languages is inhibited by the lack of large-scale annotated data and by the frequent semantic complexity of page layouts. To this end, we have curated a complex-layout dataset, manually classified into eight distinct layout types based on their separator regions. To overcome data scarcity, we propose a novel training strategy in the form of a CNN-based classifier that employs strong, domain-aware augmentations to improve generalization. We utilize narrow anisotropic Gaussian masking to suppress incidental textual details while preserving essential separations, compelling the model to learn global geometric arrangements. Additionally, we implement reflection-induced label transformations to enrich the training distribution while maintaining label consistency across asymmetric categories. The results demonstrate that layout-specific augmentations can substantially improve page-level layout classification under severe annotation scarcity.

2606.17874 2026-06-17 cs.CV cs.LG 新提交

Revisiting Structural Dependency in Autoregressive Multi-Task Table Recognition via Order-Independent Cell-Level Representations

重新审视自回归多任务表格识别中的结构依赖性:基于顺序无关的单元格级表示

Takaya Kawakatsu

发表机构 * Preferred Networks, Inc.(Preferred Networks公司)

AI总结 针对自回归多任务表格识别中单元格表示顺序依赖导致全局一致性下降的问题,提出通过非因果注意力生成顺序无关的单元格特征,实现并行推理,在两大数据集上提升定位与识别性能,推理时间减少约3倍。

Comments ICDAR 2026

详情
AI中文摘要

多任务表格识别在统一框架中联合处理表格结构预测、单元格定位和单元格内容识别。现有方法通常依赖自回归解码器生成表格结构,并重用其隐藏状态进行单元格定位和内容识别。这种自回归生成过程可能使单元格表示产生顺序依赖,降低跨单元格的全局一致性。本文提出一个结构细化模块,通过非因果注意力产生顺序无关的单元格特征。该设计使得单元格内容能够并行推理,同时每个单元格以细化特征中编码的全局上下文为条件。在两个大型数据集上的实验表明,该方法在单元格定位和端到端识别上持续提升,同时将整体推理时间减少约三倍。

英文摘要

Multi-task table recognition jointly addresses table structure prediction, cell localization, and cell content recognition within a unified framework. Existing approaches often rely on autoregressive decoders to generate table structures and reuse their hidden states for cell localization and content recognition. This autoregressive generation process can make cell representations order-dependent, degrading global consistency across cells. This paper proposes a structural refinement module that produces order-independent cell features through non-causal attention. This design enables parallel inference of cell contents while conditioning each cell on global context encoded in the refined features. Experiments on two large datasets demonstrate consistent gains in cell localization and end-to-end recognition, while reducing overall inference time by around threefold.

2603.02803 2026-06-17 cs.CV 版本更新

Structure-Aware Text Recognition for Ancient Greek Critical Editions

面向古希腊校勘本的结构感知文本识别

Nicolas Angleraud, Antonia Karamolegkou, Benoît Sagot, Thibault Clérice

发表机构 * Inria(法国国家信息与自动化技术研究所)

AI总结 本文通过构建大规模合成语料库和真实扫描基准,评估了视觉语言模型在结构感知文本识别上的性能,发现Qwen3VL-8B模型在真实扫描上达到1.0%的中位字符错误率。

详情
AI中文摘要

视觉语言模型(VLM)的最新进展已经改变了端到端的文档理解。然而,它们解释历史学术文本复杂布局语义的能力仍然有限。本文研究了面向古希腊校勘本的结构感知文本识别,这些校勘本具有密集的参考层次和广泛的边缘注释。我们引入了两个新资源:(i)从TEI/XML源生成的185,000页图像的大规模合成语料库,具有受控的排版和布局变化,以及(ii)跨越一个多世纪编辑和排版实践的真实扫描校勘本的精选基准。使用这些数据集,我们在零样本和微调设置下评估了三种最先进的VLM。我们的实验揭示了当前VLM架构在面对高度结构化的历史文档时的显著局限性。在零样本设置中,大多数模型的性能明显低于现有的现成软件。尽管如此,Qwen3VL-8B模型达到了最先进的性能,在真实扫描上实现了1.0%的中位字符错误率。这些结果既突显了当前VLM在结构感知识别复杂学术文档方面的不足,也展示了其未来潜力。

英文摘要

Recent advances in visual language models (VLMs) have transformed end-to-end document understanding. However, their ability to interpret the complex layout semantics of historical scholarly texts remains limited. This paper investigates structure-aware text recognition for Ancient Greek critical editions, which have dense reference hierarchies and extensive marginal annotations. We introduce two novel resources: (i) a large-scale synthetic corpus of 185,000 page images generated from TEI/XML sources with controlled typographic and layout variation, and (ii) a curated benchmark of real scanned editions spanning more than a century of editorial and typographic practices. Using these datasets, we evaluate three state-of-the-art VLMs under both zero-shot and fine-tuning regimes. Our experiments reveal substantial limitations in current VLM architectures when confronted with highly structured historical documents. In zero-shot settings, most models significantly underperform compared to established off-the-shelf software. Nevertheless, the Qwen3VL-8B model achieves state-of-the-art performance, reaching a median Character Error Rate of 1.0\% on real scans. These results highlight both the current shortcomings and the future potential of VLMs for structure-aware recognition of complex scholarly documents.

10. 低层视觉、计算成像与图像增强 8 篇

2606.17403 2026-06-17 cs.CV cs.AI 新提交

Bridging Spatial And Frequency Views For Disaster Assessment: Benefits And Limitations

桥接空间与频率视角进行灾害评估:优势与局限

Shikha V. Chandel, Yadav Raj Ghimire, Timothy Agboada, Leila Hashemi-Beni

发表机构 * College of Science and Technology(科学与技术学院) Computational Data Science and Engineering(计算数据科学与工程)

AI总结 本研究对比了空间域、频率域及双域深度学习方法在建筑损伤分类中的表现,发现双域模型优于单域模型,但所有模型对轻微损伤检测仍存在困难。

Comments Copyright 2026 IEEE. Published in the 2026 IEEE International Geoscience and Remote Sensing Symposium (IGARSS 2026)

详情
AI中文摘要

从卫星图像快速评估建筑损伤对于有效的灾害响应和恢复至关重要。虽然大多数深度学习方法依赖于空间域特征,但频率域表示可以捕捉互补的结构线索,如碎片模式和坍塌引起的纹理。本研究使用来自xView2(xBD)数据集灾后图像,对空间域、频率域和双域深度学习方法进行了受控比较,用于多类建筑损伤分类。为确保公平,所有模型均基于EfficientNet-B0骨干网络,并在相同设置下训练,仅输入表示和融合策略不同。使用准确率、宏F1分数、每类指标和混淆矩阵评估性能。结果表明,双域模型比单域方法提供了可衡量的改进。双空间配置实现了最高的测试准确率(0.4688)和最低的损失,而仅空间模型获得了最佳的宏F1分数(0.4254),表明类别性能更平衡。相比之下,仅频率模型表现最差并出现过拟合,表明泛化能力有限。尽管有这些改进,所有模型仍难以检测细微损伤级别,特别是Minor类别,这是由于类别不平衡和细粒度视觉模糊性。虽然双域方法改进了严重损伤的检测,但挑战依然存在。这些发现突出了混合表示的优势和局限,并推动了未来在数据平衡、高级融合和正则化方面的工作。

英文摘要

Rapid assessment of building damage from satellite imagery is essential for effective disaster response and recovery. While most deep learning methods rely on spatial-domain features, frequency-domain representations can capture complementary structural cues such as debris patterns and collapse-induced textures. This study presents a controlled comparison of spatial-domain, frequency-domain, and dual-domain deep learning approaches for multi-class building damage classification using post-disaster imagery from the xView2 (xBD) dataset. To ensure fairness, all models are built on an EfficientNet-B0 backbone and trained under identical settings, differing only in their input representations and fusion strategies. Performance is evaluated using accuracy, macro F1-score, per-class metrics, and confusion matrices. Results show that dual-domain models provide measurable improvements over single-domain approaches. The dual spatial configuration achieves the highest test accuracy (0.4688) and lowest loss, while the spatial-only model attains the best macro F1-score (0.4254), indicating more balanced class performance. In contrast, frequency-only models perform worst and exhibit overfitting, suggesting limited generalization. Despite these gains, all models struggle to detect subtle damage levels, particularly the Minor class, due to class imbalance and fine-grained visual ambiguity. While dual-domain approaches improve detection of severe damage, challenges remain. These findings highlight the benefits and limitations of hybrid representations and motivate future work on data balancing, advanced fusion, and regularization.

2606.17557 2026-06-17 cs.CV 新提交

Universal Image Restoration via Internalized Chain-of-Thought Reasoning

通过内化思维链推理的通用图像恢复

Yu Guo, Zhengru Fang, Shengfeng He, Senkang Hu, Yihang Tao, Phone Lin, Yuguang Fang

发表机构 * Hong Kong JC Lab of Smart City and Department of Computer Science, City University of Hong Kong(香港城市大学智慧城市香港联合实验室及计算机科学系) School of Computing and Information Systems, Singapore Management University(新加坡管理大学计算与信息系统学院) Computer Science and Information Engineering, National Taiwan University(国立台湾大学计算机科学与信息工程学系)

AI总结 提出CoTIR框架,将思维链推理内化到单个预训练编辑模型中,通过可微拉格朗日优化实现混合退化下的通用图像恢复,在5.2M样本基准上优于现有方法。

详情
AI中文摘要

图像恢复旨在从退化输入中恢复高质量图像,但在复杂混合退化下高度病态。虽然统一的全能模型很常见,但其性能随退化复杂性增加而下降。近期工作采用思维链推理,通过专用模块进行多轮恢复。然而,这种方法面临两个关键限制:(i) 多步处理导致计算成本增加,(ii) 逐步推理过程中退化间交互建模薄弱。我们提出CoTIR,一种将思维链推理内化到单个模型中的通用图像恢复框架。具体而言,我们将图像恢复视为图像编辑的一个专门子任务,这意味着大规模预训练编辑模型提供了更有利的优化起点。在此基础上,我们对模型进行恢复微调,并通过受拉格朗日优化启发的可微公式将结构化思维链式推理编码到学习目标中,从而实现无需链接专用恢复器的整体恢复。为促进训练和评估,我们进一步提出CoTIR-Bench,一个包含520万样本及思维链式推理轨迹的大规模基准。在CoTIR-Bench和广泛真实复合退化场景上的大量实验表明,CoTIR在感知质量和保真度上均优于全能模型和多轮恢复方法。源代码见https://this https URL。

英文摘要

Image restoration seeks to recover high-quality images from degraded inputs but becomes highly ill-posed under complex, mixed degradations. While unified all-in-one models are common, their performance declines as degradation complexity increases. Recent works adopt Chain-of-Thought (CoT) reasoning for multi-round restoration using specialized modules. However, this approach faces two key limitations: (i) increased computational cost due to multi-step processing, and (ii) weak modeling of interactions between degradations during stepwise inference. We introduce CoTIR, a universal image restoration framework that internalizes CoT reasoning within a single model. Concretely, we view image restoration as a specialized subtask of image editing, which implies that a large-scale pre-trained editing model provides a more favorable optimization starting point. Building on this, we fine-tune the model for restoration and further encode structured CoT-style reasoning into the learning objective via a differentiable formulation inspired by Lagrangian optimization, enabling holistic restoration without chaining specialized restorers. To facilitate training and evaluation, we further present CoTIR-Bench, a large-scale benchmark comprising 5.2 million samples with CoT-style reasoning traces. Extensive experiments on CoTIR-Bench and broad real composite degradation scenes show that CoTIR achieves stronger perceptual quality and more competitive fidelity than both all-in-one models and multi-round restoration methods. The source code is available at https://github.com/gy65896/CoTIR.

2606.17722 2026-06-17 cs.CV 新提交

GSPan: A Continuous Gaussian Primitive Representation for Arbitrary-Scale Pansharpening

GSPan:一种用于任意尺度全色锐化的连续高斯基元表示

Fangyi Li, Xiaoyuan Yang, Yixiao Li, Zongyang Sui, Kangqing Shen, Gemine Vivone

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) National Research Council - Institute of Methodologies for Environmental Analysis, CNR-IMAA(意大利国家研究委员会 - 环境分析方法研究所)

AI总结 提出GSPan框架,将2D高斯溅射引入全色锐化,通过连续可学习的2D高斯基元表示残差细节,实现任意尺度融合,无需重新训练。

详情
AI中文摘要

全色锐化旨在通过融合低分辨率多光谱(LRMS)和全色(PAN)观测生成高分辨率多光谱(HRMS)图像。现有深度学习方法大多将全色锐化视为固定网格预测,限制了尺度适应性。为此,我们提出GSPan框架,将2D高斯溅射(GS)引入全色锐化。GSPan不直接预测像素,而是将逐波段残差细节表示为连续且可学习的2D高斯基元。我们设计了具有空间-光谱交互注意力(SSIA)模块的双流层次交互(DSHI)架构,从互补的PAN和MS观测中估计这些基元。预测的基元被渲染为残差细节场,并注入到上采样的MS图像中。这种连续表示使得GSPan能够在任意目标采样网格上渲染融合图像,无需针对特定尺度重新训练。它进一步支持尺度解耦非对称推理(SDAI)策略,该策略在降低的分辨率下估计基元,并在目标分辨率下渲染融合图像,用于高效的大场景全色锐化。在QuickBird、GaoFen-2、WorldView-3和WorldView-3-4K数据集上的实验表明,GSPan实现了最先进的融合性能。此外,SDAI显著加速了推理,在计算效率和融合质量之间取得了良好的平衡。我们的结果证明了连续高斯残差表示作为固定网格预测的灵活且尺度解耦替代方案的潜力。

英文摘要

Pansharpening aims to generate high-resolution multispectral (HRMS) images by fusing low-resolution multispectral (LRMS) and panchromatic (PAN) observations. Most existing deep learning methods treat pansharpening as fixed-grid prediction, which limits scale adaptation. To address this, we propose GSPan, a framework that introduces 2D Gaussian Splatting (GS) into pansharpening. Instead of directly predicting pixels, GSPan represents band-wise residual details as continuous and learnable 2D Gaussian primitives. We design a Dual-Stream Hierarchical Interaction (DSHI) architecture with a Spatial-Spectral Interactive Attention (SSIA) module to estimate these primitives from complementary PAN and MS observations. The predicted primitives are rendered as a residual detail field and injected into the upsampled MS image. This continuous representation allows GSPan to render fused images on arbitrary target sampling grids without scale-specific retraining. It further enables a Scale-Decoupled Asymmetric Inference (SDAI) strategy, which estimates primitives at a reduced resolution and renders the fused image at the target resolution for efficient large-scene pansharpening. Experiments on QuickBird, GaoFen-2, WorldView-3, and WorldView-3-4K datasets show that GSPan delivers state-of-the-art fusion performance. Moreover, SDAI markedly accelerates inference, achieving a favorable trade-off between computational efficiency and fusion quality. Our results demonstrate the potential of continuous Gaussian residual representations as a flexible and scale-decoupled alternative to fixed-grid prediction.

2606.17985 2026-06-17 cs.CV 新提交

Gaussian Light Field Splatting: A Physical Prior-Driven Vision Transformer for Unsupervised Low-Light Image Enhancement

高斯光场溅射:一种物理先验驱动的视觉Transformer用于无监督低光图像增强

Yuhan Chen, Wenxuan Yu, Guofa Li, Fuchen Li, Kunyang Huang, Yicui Shi, Ying Fang, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Herbert Wertheim College of Engineering, University of Florida(佛罗里达大学赫伯特·韦特海姆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出GLFS模型,将高斯溅射的连续物理光照建模引入Transformer,通过各向异性高斯函数表示场景光照并引入物理引导偏置到自注意力中,配合颜色向量角损失和亮度边缘损失,实现非均匀光照下的曝光均衡和色彩校正,达到最先进性能。

详情
AI中文摘要

现有的无监督低光图像增强方法在复杂的非均匀光照下常常遇到局部曝光不平衡和颜色失真。此外,大多数Vision Transformers缺乏对光照退化的物理先验进行建模的显式机制。为了解决这些限制,我们提出了GLFS,一种基于高斯光场溅射的Vision Transformer,它将高斯溅射的连续物理光照建模集成到Transformer架构中。在GLFS中,场景光照由各向异性高斯基函数的叠加表示。将物理引导的偏置引入自注意力,以自适应地推断空间增益场,从而在复杂光照下实现准确且均匀的恢复。为了减少增强过程中的颜色偏差和结构退化,进一步开发了颜色向量角损失和亮度边缘损失。这些损失强制色调一致性并提高局部细节的结构保真度。广泛的消融研究和定量评估表明,GLFS在光照校正和细节保留方面具有明显优势。它实现了最先进的性能,并为低光图像增强提供了一种新的表示范式。

英文摘要

Existing unsupervised low-light image enhancement methods often encounter local exposure imbalance and color distortion under complex non-uniform illumination. In addition, most Vision Transformers lack an explicit mechanism for modeling the physical priors of illumination degradation. To address these limitations, we propose GLFS, a Gaussian light field splatting-based Vision Transformer that integrates continuous physical illumination modeling from Gaussian splatting into the Transformer architecture. In GLFS, scene illumination is represented by a superposition of anisotropic Gaussian basis functions. Physics-guided biases are introduced into self-attention to adaptively infer a spatial gain field, enabling accurate and uniform restoration under complex illumination. To reduce color bias and structural degradation during enhancement, a color-vector angular loss and a luminance-edge loss are further developed. These losses enforce hue consistency and improve the structural fidelity of local details. Extensive ablation studies and quantitative evaluations show that GLFS provides clear advantages in illumination correction and detail preservation. It achieves state-of-the-art performance and offers a new representation paradigm for low-light image enhancement.

2606.17998 2026-06-17 cs.CV 新提交

AIGS-Net: Compact Illumination Field Modeling via 2D Gaussian Splatting for Fast Low-Light Image Enhancement

AIGS-Net: 基于2D高斯泼溅的紧凑光照场建模用于快速低光图像增强

Yuhan Chen, Kunyang Huang, Fuchen Li, Zhuohan Qin, Guofa Li, Wenbo Chu, Keqiang Li

发表机构 * College of Mechanical and Vehicle Engineering, Chongqing University(重庆大学机械与车辆工程学院) Department of Electrical and Computer Engineering, Carnegie Mellon University(卡内基梅隆大学电气与计算机工程系) Herbert Wertheim College of Engineering, University of Florida(佛罗里达大学赫伯特·韦特海姆工程学院) School of Mathematics and Statistics, Qingdao University(青岛大学数学与统计学院) National Innovation Center of Intelligent and Connected Vehicles(国家智能网联汽车创新中心) School of Vehicle and Mobility, Tsinghua University(清华大学车辆与运载学院)

AI总结 提出AIGS-Net,通过输入自适应的2D高斯泼溅光照场和零参数多尺度上下文编码,以约40个可学习参数实现低光图像增强,在LOL和LSRW基准上平衡了增强质量与推理效率。

详情
AI中文摘要

现有的低光图像增强方法通常在光照场建模的表征能力与计算复杂度之间存在瓶颈。为解决此问题,本文提出自适应光照高斯泼溅网络(AIGS-Net),一种用于快速低光增强的超轻量级架构。与传统的静态先验不同,AIGS-Net构建了一个输入自适应的2D高斯泼溅光照场。高斯基函数的不透明度由输入图像的相对亮度统计动态调制,并通过有序alpha合成渲染空间变化的光照补偿。为了高效指导自适应光照补偿,引入了一个零参数非线性多尺度上下文编码模块,无需额外卷积权重即可提取低频结构和局部对比度线索。为抑制噪声放大和传感器引起的颜色偏差,AIGS-Net集成了噪声掩膜估计、锁定单通道伽马映射、跨通道一致性正则化和目标颜色对齐约束。在LOL和LSRW基准上的实验表明,AIGS-Net在仅需约40个可学习参数的情况下,改善了细节恢复和颜色保真度,实现了增强质量与极端推理效率之间的有效权衡。

英文摘要

Existing low-light image enhancement methods often face a bottleneck between the representation capacity of illumination-field modeling and computational complexity. To address this issue, this paper proposes an Adaptive Illumination Gaussian Splatting Network (AIGS-Net), an ultra-lightweight architecture for fast low-light enhancement. Unlike conventional static priors, AIGS-Net constructs an input-adaptive 2D Gaussian Splatting illumination field. The opacity of Gaussian basis functions is dynamically modulated by relative luminance statistics of the input image, and spatially varying illumination compensation is rendered through ordered alpha compositing. To guide adaptive illumination compensation efficiently, a zero-parameter nonlinear multiscale contextual encoding module is introduced to extract low-frequency structures and local contrast cues without additional convolutional weights. To suppress noise amplification and sensor-induced color bias, AIGS-Net integrates noise-mask estimation, locked single-channel Gamma mapping, cross-channel consistency regularization, and target color-alignment constraints. Experiments on LOL and LSRW benchmarks show that AIGS-Net improves detail recovery and color fidelity while requiring only approximately 40 learnable parameters, achieving an effective trade-off between enhancement quality and extreme inference efficiency.

2406.07435 2026-06-17 cs.CV cs.LG eess.IV 版本更新

Beware of Aliases -- Signal Preservation is Crucial for Robust Image Restoration

警惕混叠——信号保留对鲁棒图像复原至关重要

Shashank Agnihotri, Julia Grabinski, Janis Keuper, Margret Keuper

AI总结 针对图像复原网络因混叠导致鲁棒性差的问题,提出BOA-Restormer,通过在频域执行部分下采样和上采样操作,确保无混叠路径,在低成本下提升模型鲁棒性。

Comments Tags: Adversarial attack, image restoration, image deblurring, frequency sampling

详情
AI中文摘要

图像复原网络通常由编码器和解码器组成,分别负责从噪声、失真数据中聚合图像内容并恢复干净、无失真的图像。数据聚合以及高分辨率图像生成通常都伴随着混叠的风险,即标准架构为了在验证数据上达到高PSNR值而牺牲了重建模型输入的能力。代价是模型鲁棒性低。在这项工作中,我们表明,在先进的复原变换器中简单地提供无混叠路径,可以在低复原性能成本下支持改进的模型鲁棒性。为此,我们提出了BOA-Restormer,一种基于变换器的图像复原模型,它在频域中部分执行下采样和上采样操作,以确保整个模型的无混叠路径,同时可能保留所有相关的高频信息。

英文摘要

Image restoration networks are usually comprised of an encoder and a decoder, responsible for aggregating image content from noisy, distorted data and to restore clean, undistorted images, respectively. Data aggregation as well as high-resolution image generation both usually come at the risk of involving aliases, i.e.~standard architectures put their ability to reconstruct the model input in jeopardy to reach high PSNR values on validation data. The price to be paid is low model robustness. In this work, we show that simply providing alias-free paths in state-of-the-art reconstruction transformers supports improved model robustness at low costs on the restoration performance. We do so by proposing BOA-Restormer, a transformer-based image restoration model that executes downsampling and upsampling operations partly in the frequency domain to ensure alias-free paths along the entire model while potentially preserving all relevant high-frequency information.

2602.22159 2026-06-17 cs.CV 版本更新

CASR: A Robust Cyclic Framework for Arbitrary Large-Scale Super-Resolution with Distribution Alignment and Self-Similarity Awareness

CASR:一种鲁棒的循环框架,用于任意大尺度超分辨率,具有分布对齐和自相似性意识

Wenhao Guo, Zhaoran Zhao, Peng Lu, Sheng Li, Qian Qiao, DeRui Li

AI总结 CASR通过分布对齐和自相似性意识,解决大尺度超分辨率中的分布漂移和扩散不一致问题,实现稳定推理和高效单模型处理。

详情
AI中文摘要

CASR通过分布对齐和自相似性意识,解决大尺度超分辨率中的分布漂移和扩散不一致问题,实现稳定推理和高效单模型处理。

英文摘要

Arbitrary-Scale SR (ASISR) remains fundamentally limited by cross-scale distribution shift: once the inference scale leaves the training range, noise, blur, and artifacts accumulate sharply. We revisit this challenge from a cross-scale distribution transition perspective and propose CASR, a simple yet highly efficient cyclic SR framework that reformulates ultra-magnification as a sequence of in-distribution scale transitions. This design ensures stable inference at arbitrary scales while requiring only a single model. CASR tackles two major bottlenecks: distribution drift across iterations and patch-wise diffusion inconsistencies. The proposed SSAM module aligns structural distributions via superpixel aggregation, preventing error accumulation, while SARM module restores high-frequency textures by enforcing correlation-guided consistency and preserving self-similarity structure through correlation alignment. Despite using only a single model, our approach significantly reduces distribution drift, preserves long-range texture consistency, and achieves superior generalization even at extreme magnification.

2605.24003 2026-06-17 cs.CV cs.AI stat.AP 版本更新

Remote sensing data imputation using deep learning for multispectral imagery

基于深度学习的多光谱遥感数据插补

Shuang Liu, Fiona Johnson, Rohitash Chandra

发表机构 * Water Research Centre, University of New South Wales(新南威尔士大学水研究中心) ARC ITTC Data Analytics for Resources and Environments, University of New South Wales(新南威尔士大学资源与环境数据分析师联盟) Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics, University of New South Wales(新南威尔士大学数学与统计学过渡人工智能研究组)

AI总结 针对云覆盖导致的光学卫星数据缺失问题,本研究比较了线性插值与多种深度学习模型(CNN、Inception Resnet、Autoencoder及其与LSTM的组合)在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果,发现深度学习模型显著优于基线方法,其中CNN表现最佳,且基于插补图像的藻华指数与观测数据吻合良好。

详情
AI中文摘要

近年来,遥感技术在水体应用中得到越来越多的利用。使用光学卫星数据的一个常见挑战是由于云覆盖导致的观测缺失。这些数据缺口可能导致错过对水资源管理部门高度关注的湖泊中关键事件(如藻华)的检测。因此,提高光学卫星数据集的完整性对于改善藻华的监测和预测至关重要。在本研究中,我们比较了传统数据插补方法(即线性插值)与深度学习模型在四个有藻华历史记录的湖泊中重建缺失光谱波段的效果。采用的深度学习模型包括基于CNN的架构(即CNN、Inception Resnet和Autoencoder)以及基于CNN-LSTM的架构(即CNN-LSTM、Resnet-LSTM和Autoencoder-LSTM)。我们的结果表明,在人工掩膜区域内插补光谱波段值时,深度学习模型显著优于基线线性插值方法。在这些模型中,CNN在大多数湖泊中表现最佳。此外,我们通过将插补图像与观测数据进行比较,评估了基于插补图像的藻华指数(即Green/Red和NDCI)的性能。我们的结果表明,深度学习模型对于插补PlanetScope SuperDove影像中的缺失数据是有效的,从而能够实现更可靠的水体监测应用。

英文摘要

Remote sensing techniques have been increasingly utilised in aquatic applications in recent years. A common challenge in using optical satellite data is the presence of missing observations due to cloud cover. These data gaps can lead to missed detection of critical events, such as algal blooms, in lakes of high interest to water authorities. As a result, enhancing the completeness of optical satellite datasets is crucial for improving the monitoring and prediction of algal blooms. In this study, we compared a traditional data imputation method (i.e., linear interpolation) with deep learning models for reconstructing missing spectral bands across four lakes with historical records of algal blooms. The deep learning models adopted include CNN-based architectures (i.e., CNN, Inception Resnet, and Autoencoder) and CNN-LSTM-based architectures (i.e., CNN-LSTM, Resnet-LSTM, and Autoencoder-LSTM). Our results demonstrated that deep learning models substantially outperformed the baseline linear interpolation method in imputing spectral band values within artificially masked regions. Among these models, CNN delivered the best performance across most lakes. Furthermore, we evaluated the performance of algal bloom indices (i.e., Green/Red and NDCI) derived from the imputed imagery by comparing them with the observed data. Our results demonstrate that deep learning models are effective for imputing missing data in PlanetScope SuperDove imagery, enabling more reliable applications in water monitoring.

11. 鲁棒性、安全、隐私与可信视觉 14 篇

2606.17477 2026-06-17 cs.CV cs.LG 新提交

Theoretical Grounding of Out-Of-Distribution Detection With Reinforcement Learning Optimizer

基于强化学习优化器的分布外检测的理论基础

Salimeh Sekeh, Xin Zhang

发表机构 * San Diego State University(圣地亚哥州立大学)

AI总结 本文提出一种强化学习引导的优化器,通过修正梯度下降更新来降低语义分布外误报率,理论分析了模型变化和环境变化对泛化误差的影响。

详情
AI中文摘要

动态开放世界环境中的分布外(OOD)检测要求模型持续适应不断变化的数据分布,同时泛化到协变量偏移输入并拒绝语义偏移的OOD样本。大多数现有的OOD检测方法仅优化当前步目标,并未明确考虑部署后环境变化如何影响未来的OOD行为。在本文中,我们使用强化学习(RL)引导的优化器为动态OOD检测建立了理论基础,该优化器明确偏好随时间降低语义OOD假阳性率的更新。我们开发了一种新颖的增强优化器,在标准梯度下降(GD)之上使用RL引导的修正项,并展示了其在未来域泛化和语义OOD拒绝方面的改进。我们从模型变化和环境变化泛化误差的角度分析了时间误差分解,并开发了一个新的理论框架来比较GD和RL引导优化器下的泛化误差。

英文摘要

Out-of-distribution (OOD) detection in dynamic open-world environments requires a model to continually adapt to evolving data distributions while generalizing to covariate-shifted inputs and rejecting semantic-shifted OOD examples. Most existing OOD detection methods optimize only the current-step objective and do not explicitly account for how post-deployment environment changes affect future OOD behavior. In this paper, we establish a theoretical grounding for dynamic OOD detection using a reinforcement learning (RL)-guided optimizer that explicitly favors updates that reduce the semantic OOD false positive rate over time. We develop a novel augmented optimizer that uses an RL-guided correction term on top of standard gradient descent (GD) and show its improvement over both future-domain generalization and semantic-OOD rejection. We analyze temporal error decomposition in terms of model-change and environment-change generalization errors and develop a new theoretical framework for comparing the generalization errors under both GD and RL-guided optimizers.

2606.17540 2026-06-17 cs.CV 新提交

TaFD: Threat-Aware Frequency Decoupling for Adversarial Robustness against Heterogeneous Attacks

TaFD:针对异构攻击的对抗鲁棒性的威胁感知频率解耦

Mengda Xie, Yiling He, Meie Fang

发表机构 * School of Computer Science and Cyber Engineering, Guangzhou University(广州大学计算机科学与网络工程学院) Information Security Research Group, University College London(伦敦大学学院信息安全研究组)

AI总结 针对异构攻击下联合对抗训练中的梯度不兼容问题,提出威胁感知频率解耦(TaFD)框架,通过频域分治策略实现结构参数分离,在多个基准上平均鲁棒准确率提升约11%。

详情
AI中文摘要

多威胁鲁棒性仍然是深度学习中的一个基本挑战。尽管联合对抗训练(JAT)被广泛采用,但在异构威胁下,特别是$\ell_p$有界攻击和语义攻击之间,它遭受负迁移。通过一阶梯度分析,我们将此形式化为梯度不兼容,并从理论上证明了分离优化的必要性。我们进一步揭示这些冲突的威胁在频域中表现出可分离的频谱特性。受此观察启发,我们提出了威胁感知频率解耦(TaFD),一个两阶段防御框架,将JAT重新表述为频域分治范式。TaFD首先通过攻击频谱原型的无监督聚类发现潜在威胁域,并训练一个轻量级分类器用于推理时的威胁域识别。基于预测,TaFD采用频率条件卷积,学习威胁域特定的频谱掩码,并将每个样本路由到相应的专家,强制结构参数分离并缓解优化冲突。我们在三个代表性图像分类基准(CIFAR-10、CIFAR-100和Tiny-ImageNet)和两个代表性架构(卷积ResNet和混合Transformer MobileViT)上验证了TaFD。大量结果表明,与现有的JAT和频域基线相比,TaFD在异构攻击下实现了更均衡的鲁棒性,在保持领先的干净准确率的同时,平均鲁棒准确率比最强基线提高了约11%。

英文摘要

Multi-threat robustness remains a fundamental challenge in deep learning. Although joint adversarial training (JAT) is widely adopted, it suffers from negative transfer under heterogeneous threats, particularly between $\ell_p$-bounded and semantic attacks. Through first-order gradient analysis, we formalize this as gradient incompatibility and theoretically establish the necessity of decoupled optimization. We further reveal that these conflicting threats exhibit separable spectral characteristics in the frequency domain. Motivated by this observation, we propose Threat-aware Frequency Decoupling (TaFD), a two-stage defense framework that reformulates JAT as a frequency-domain divide-and-conquer paradigm. TaFD first discovers latent threat domains via unsupervised clustering of attack spectral prototypes and trains a lightweight classifier for inference-time threat domain identification. Conditioned on the prediction, TaFD employs a Frequency-Conditional Convolution that learns threat-domain-specific spectral masks and routes each sample to the corresponding expert, enforcing structural parameter separation and alleviating optimization conflicts. We validate TaFD on three representative image-classification benchmarks (CIFAR-10, CIFAR-100, and Tiny-ImageNet) and on two representative architectures (the convolutional ResNet and the hybrid-transformer MobileViT). Extensive results demonstrate that TaFD achieves more balanced robustness against heterogeneous attacks than existing JAT and frequency-domain baselines, improving average robust accuracy by approximately 11\% over the strongest baseline while maintaining leading clean accuracy.

2606.17601 2026-06-17 cs.CV 新提交

Test-Time Training for Robust Text-Guided Open-Vocabulary Object Counting

测试时训练用于鲁棒文本引导的开放词汇目标计数

Hao-Yuan Ma, Yuda Zou, Li Zhang, Yongchao Xu

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院) Wuhan University(武汉大学)

AI总结 提出Dual-TTT框架,通过测试时训练轻量去噪模块,提升文本引导开放词汇目标计数在恶劣条件下的鲁棒性,无需修改原模型架构。

详情
AI中文摘要

文本引导的开放词汇目标计数(TOOC)能够对文本提示指定的任意目标类别进行计数,相比传统的封闭集计数提供了更大的灵活性。然而,现有的TOOC方法主要在理想图像上开发和评估,而现实场景常遭受雨、雾、黑暗和传感器噪声等不利条件,这些条件严重降低视觉质量并损害视觉-语言对齐。为弥补这一差距,我们引入了Robust-TOOC,这是首个在多种损坏条件下评估TOOC的基准,涵盖六种代表性退化类型:雨、雾、黑暗、高斯噪声、椒盐噪声和混合损坏。为提高鲁棒性同时保留原始计数架构,我们提出了Dual-TTT,一种用于TOOC的双架构测试时训练框架。具体来说,在测试时训练期间,Dual-TTT仅更新文本引导轻量去噪模块(TL-Denoiser),同时冻结原始计数网络。受扩散模型启发,TL-Denoiser被优化以从退化条件下的图像表示中去除与损坏相关的噪声。由于仅在测试时训练TL-Denoiser,Dual-TTT无需标注,并且可以无缝集成到现有TOOC模型中而无需修改其原始架构。在多个近期TOOC基线上的大量实验证明了我们方法的有效性。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) enables counting arbitrary object categories specified by text prompts, offering substantially greater flexibility than conventional closed-set counting. However, existing TOOC methods are developed and evaluated primarily on ideal images, while real-world scenes often suffer from adverse conditions such as rain, fog, darkness, and sensor noise, which severely degrade visual quality and impair vision-language alignment. To bridge this gap, we introduce Robust-TOOC, the first benchmark for evaluating TOOC under diverse corruption conditions, which covers six representative degradation types: rain, fog, darkness, Gaussian noise, salt-and-pepper noise, and mixed corruption. To improve robustness while preserving the original counting architecture, we propose Dual-TTT, a dual-architecture test-time training framework for TOOC. Specifically, during test-time training, Dual-TTT updates only the Text-guided Lightweight Denoising module (TL-Denoiser), while keeping the original counting network frozen. Inspired by diffusion models, the TL-Denoiser is optimized to remove corruption-aware noise from image representations under degraded conditions. Since only the TL-Denoiser is trained at test time, Dual-TTT is annotation-free and can be seamlessly integrated into existing TOOC models without modifying their original architecture. Extensive experiments on multiple recent TOOC baselines demonstrate the effectiveness of our method.

2606.17606 2026-06-17 cs.CV 新提交

Flux-Guard: Facial Identity Protection using diffusion models

Flux-Guard:使用扩散模型的面部身份保护

Jie Wang, Tao Wang, Ru Zhang, Jianyi Liu

发表机构 * School of Cyberspace Security, Beijing University of Posts and Telecommunications(北京邮电大学网络空间安全学院) Nanjing University of Aeronautics and Astronautics(南京航空航天大学)

AI总结 提出Flux-Guard框架,通过流轨迹控制和潜在空间对抗优化,在统一生成过程中实现面部编辑与隐私保护,有效提升对跨域人脸识别模型的攻击成功率。

详情
AI中文摘要

人脸识别系统的广泛部署使得社交媒体和公共平台上共享的个人图像面临身份关联和隐私风险。现有的对抗性隐私保护方法可以降低未经授权的人脸识别性能,但与生成式面部编辑不兼容。人工智能驱动的面部编辑工具越来越受欢迎,这显著增加了用户对个性化肖像生成和社交分享的需求。然而,当前的编辑方法通常保留身份特征,使得编辑后的图像仍然容易被恶意人脸识别系统追踪。因此,本文提出了Flux-Guard,一种基于对抗攻击的隐私保护面部编辑框架,它在统一的生成过程中集成了面部编辑和隐私保护。具体地,我们设计了一种流轨迹控制方法,将语义操作与生成过程对齐,并引入了潜在空间对抗优化,采用自适应感知损失驱动的加权策略,动态调整对抗强度以在保持视觉质量的同时最大化攻击效果。大量实验表明,Flux-Guard支持面部编辑,同时在CelebA-HQ和LADN数据集上显著提高了对跨域人脸识别模型的攻击成功率。此外,对商业API的评估结果证实了其在现实世界应用中的有效性。代码发布在https://this URL。

英文摘要

The widespread deployment of face recognition (FR) systems exposes personal images shared on social media and public platforms to identity linkage and privacy risks. Existing adversarial privacy protection methods can degrade unauthorized FR performance but are not compatible with generative face editing. Artificial intelligence-driven face editing tools are gaining popularity, which has significantly increased user demand for personalized portrait generation and social sharing. However, current editing methods often preserve identity features, making the edited images still susceptible to tracking by malicious FR systems. Thus, this paper proposes Flux-Guard, a privacy-preserving face editing framework based on adversarial attacks, which integrates face editing and privacy protection within a unified generative process. Specifically, we design a flow trajectory control method to align semantic manipulations with the generative process and introduce latent-space adversarial optimization with an adaptive perceptual-loss-driven weighting strategy, dynamically adjusting adversarial strength to maximize attack effectiveness while preserving visual quality. Extensive experiments demonstrate that Flux-Guard supports face editing while significantly improving attack success rates against cross-domain face recognition models on the CelebA-HQ and LADN datasets. Furthermore, evaluation results for commercial APIs have confirmed its effectiveness in real-world applications. The code is released at https://github.com/JLMWang/Flux-Guard.

2606.17711 2026-06-17 cs.CV cs.AI 新提交

Structured Adversarial Camouflage via Voronoi Diagrams

基于Voronoi图的结构化对抗伪装

Jens Bayer, Stefan Becker, David Münch, Michael Arens, Jürgen Beyerer

发表机构 * Fraunhofer IOSB and Fraunhofer Center for Machine Learning(弗劳恩霍夫光学、系统技术及图像处理研究所和弗劳恩霍夫机器学习中心) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 提出通过软分配优化种子点位置生成结构化伪装图案,在固定调色板下有效降低行人检测AP,且攻击可跨域转移。

详情
AI中文摘要

像素级对抗补丁计算量大且视觉上可检测,限制了在安全关键系统中的实用性。我们提出对抗性Voronoi伪装,通过软分配在固定可打印调色板下仅优化种子点位置,无需额外正则化即可生成类似结构化碎片伪装图案。在COCO风格AP@[.5:.95]上评估行人检测,朴素放置(Inria -> COCO)表现相当差,而通过分割掩码(3DPeople)进行服装级应用导致AP显著下降。该攻击可迁移到域外背景和跨检测器家族(YOLOv9/10/11/12),表明在黑盒设置中的鲁棒性。使用不同调色板重新绘制在很大程度上抵消了效果,单色调整显示有限容忍度(<=0.17),突出了结构-调色板耦合。参数高效、调色板受限的设计在降低实时检测器性能的同时提高了视觉合理性。物理验证和颜色校准留待未来工作。代码:此https URL。本文最初发表于由信息与通信技术系统技术委员会IST-224-RSY组织的国际军事通信与信息系统会议(ICMCIS),于2026年5月12-13日在英国巴斯举行。

英文摘要

Pixel-wise adversarial patches are computationally heavy and often visually detectable, limiting utility in security-critical systems. We present adversarial Voronoi camouflage that optimizes only seed-point locations under fixed, printable palettes using a soft assignment, producing structured, splinter camouflage-like patterns without additional regularization. Evaluated on person detection with COCO-style AP@[.5:.95], naive placement (Inria -> COCO) performs comparably bad, while garment-level application via segmentation mask (3DPeople) results in a significant AP drop. The attack transfers to out-of-domain backgrounds and across detector families (YOLOv9/10/11/12), indicating robustness in black-box settings. Repainting with different palettes largely nullifies the effect, and single-color tweaks show limited tolerance (<=0.17), highlighting a structure-palette coupling. The parameter-efficient, palette-constrained design improves visual plausibility while degrading real-time detector performance. Physical validation and color calibration are left for future work. Code: https://github.com/JensBayer/Voronoi This paper was originally presented at the International Conference on Military Communication and Information Systems (ICMCIS), organized by the Information Systems Technology (IST) Scientific and Technical Committee, IST-224-RSY - the ICMCIS, held in Bath, United Kingdom, 12-13 May 2026.

2606.17953 2026-06-17 cs.CV 新提交

MLLMs Get It Right, Then Get It Wrong: Tracing and Correcting Late-Layer Textual Bias

MLLMs 先正确后错误:追踪并纠正后层文本偏见

Xingming Li, Ao Cheng, Qiyao Sun, Xixiang He, Xuanyu Ji, Runke Huang, Qingyong Hu

发表机构 * National University of Defense Technology(国防科技大学) Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Intelligent Game and Decision Lab(智能博弈与决策实验室)

AI总结 发现多模态大语言模型在中间层形成正确视觉预测,但最终输出时被文本覆盖,通过检测预测方向变化(85%失败转向文本,89%成功转向视觉)提出无训练方法CALRD,在冲突基准上提升高达9.4%。

Comments Accepted at IJCAI 2026. 16 pages, 10 figures

详情
AI中文摘要

当视觉与文本矛盾时,多模态大语言模型(MLLMs)始终偏向文本,即使图像提供了明确的相反证据。这种偏见对需要视觉基础的应用构成风险,但其原因尚不清楚。本文中,我们揭示了一个令人惊讶的发现:模型最初往往是正确的,在中间层形成基于视觉的正确预测,然后在最终输出中改变主意,偏向文本。我们称之为“后层文本覆盖”。视觉信息已被编码,只是未能保留到输出。更有趣的是,我们发现预测的变化方式揭示了其正确性:85%的失败转向文本,而89%的成功转向视觉。这种方向性特征使得一种简单而有效的干预成为可能:当我们检测到自信的视觉预测被抑制时,我们将其恢复。我们提出了CALRD(冲突感知层参考解码),一种无需训练的方法,在推理时恢复被覆盖的预测。在五种不同架构的MLLM上的实验表明,在冲突基准上绝对提升高达9.4%,同时基本保持标准性能,无需训练或外部知识。它恢复了模型已知但未能保留的信息。

英文摘要

When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.

2606.17961 2026-06-17 cs.CV cs.AI 新提交

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

基于相似性的位置编码在旋转下的鲁棒性:理论分析与实验验证

Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale, Alessandria, Italy(皮埃蒙特东方大学计算机科学研究所,DiSIT,亚历山德里亚,意大利)

AI总结 本文理论分析并实验验证了基于相似性的位置编码(simPE)在旋转扰动下的稳定性,证明其在Frobenius范数下具有有界扰动,并在多个数据集上优于标准位置编码。

详情
AI中文摘要

位置编码是Transformer架构的基本组成部分,因为它注入了关于输入空间或序列排列的信息。在标准绝对位置编码和正弦编码的最新替代方案中,基于相似性的位置编码(simPE)已成为一种通过成对关系表示位置结构的灵活框架。simPE最初是为医学成像应用设计的,其中几何鲁棒性尤为重要:在图像采集过程中,由于成像仪器、患者定位或轻微的采集偏差,自然会产生小旋转。尽管具有经验上的前景,但simPE在几何扰动下的理论行为尚未完全表征。在本文中,我们研究了simPE对旋转的鲁棒性,结合了形式化的理论分析和实验验证。我们首先证明simPE通常不是旋转不变的。然后,我们证明,在基本分量的温和Lipschitz假设下,simPE在旋转扰动下是稳定的,并推导了Frobenius范数下的显式扰动界限。我们在四个受控数据集上实验验证了这些发现——一个合成Arrow数据集、一个合成Shapes数据集(四个几何形状类别)、一个合成Digits数据集和一个基准图像分类数据集(FashionMNIST)——其中训练和验证图像保持固定的规范方向,而测试图像则经受逐渐增大的旋转角度。在所有数据集中,simPE在旋转下的准确率、F1分数、精确率和召回率方面始终优于标准学习位置编码,特别是在小到中等角度范围内,这证实了理论稳定性保证。

英文摘要

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

2606.18008 2026-06-17 cs.CV 新提交

PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution

PhaseWin:一种用于忠实视觉归因的高效搜索算法

Zihan Gu, Ruoyu Chen, Junchi Zhang, Li Liu, Xiaochun Cao, Hua Zhang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络空间安全学院) Shanghai Center for Mathematical Sciences, Fudan University(复旦大学上海数学中心) College of Electronic Science and Technology, National University of Defense Technology(国防科技大学电子科学学院) School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区网络空间安全学院)

AI总结 提出PhaseWin算法,通过分阶段窗口搜索将视觉归因的计算复杂度从O(n²)降至O(n),在保持接近贪心算法忠实度的同时大幅减少模型评估次数。

Comments 26 pages, 29 figures

详情
AI中文摘要

视觉归因是解释现代视觉和视觉-语言模型的基本工具,尤其在需要检查、诊断或审计模型决策时。其目标是通过对候选图像区域分配重要性排序,解释模型决策如何依赖于视觉输入的局部区域。给定一个划分为n个区域的图像,忠实归因可以转化为有序子集搜索问题,其中逐步插入所选区域应尽可能早地恢复目标模型响应。对区域子集的穷举搜索会产生指数级成本,而广泛使用的贪心搜索仍需要二次数量的模型评估,因为每个选择步骤都会重新评分所有剩余候选。我们提出PhaseWin,一种用于忠实视觉归因的高效子集搜索算法。PhaseWin将贪心区域选择重组为分阶段窗口搜索过程:不是每一步都重新评估整个候选集,而是在全局候选筛选、自适应剪枝和局部窗口细化之间交替,同时保留贪心搜索的基本区域排序行为。我们在单调证据积累条件下分析PhaseWin,并表明在特征级结构假设下,它实现了可控的线性评估复杂度以及接近贪心的忠实度保证。在图像分类、目标检测、视觉定位和图像描述上的大量实验表明,在所有比较的归因方法中,PhaseWin以最少的前向传播达到高忠实度,经验上实现了从O(n²)到O(n)的预测降低。代码可在该网址获取。

英文摘要

Visual attribution is a fundamental tool for interpreting modern vision and vision-language models, particularly when their decisions must be inspected, diagnosed, or audited. Its goal is to explain how a model's decision depends on local regions of the visual input, typically by assigning an importance ordering over candidate image regions. Given an image partitioned into $n$ regions, faithful attribution can be cast as an ordered subset-search problem, in which progressively inserting the selected regions should recover the target model response as early as possible. Exhaustive search over region subsets incurs exponential cost, while the widely used greedy search still requires a quadratic number of model evaluations, because every selection step rescores all remaining candidates. We propose PhaseWin, an efficient subset-search algorithm for faithful visual attribution. PhaseWin reorganizes greedy region selection into a phased window-search procedure: rather than re-evaluating the full candidate set at every step, it alternates between global candidate screening, adaptive pruning, and localized window refinement, while preserving the essential region-ranking behavior of greedy search. We analyze PhaseWin under monotone evidence-accumulation conditions and show that, under feature-level structural assumptions, it attains controllable linear evaluation complexity together with near-greedy faithfulness guarantees. Extensive experiments on image classification, object detection, visual grounding, and image captioning show that, among all compared attribution methods, PhaseWin reaches high faithfulness with the fewest forward passes, empirically realizing the predicted reduction from $O(n^2)$ to $O(n)$. The code is available at https://github.com/Qihuai27/phasewin-va.

2606.17352 2026-06-17 cs.LG cs.CV 交叉投稿

MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

MM++: 无监督尺度不变多层OOD检测通过Top-K门控特征融合

Rahim Hossain, Md Tawheedul Islam Bhuian, Md Farhan Shadiq, Kyoung-Don Kang

发表机构 * School of Computing, State University of New York at Binghamton(纽约州立大学宾汉姆顿分校计算机学院)

AI总结 提出MM++框架,通过熵密度下降识别判别性中间层,结合Ledoit-Wolf正则化协方差矩阵实现无监督、后处理、尺度不变的多层OOD检测,在近/远OOD场景中表现鲁棒。

详情
AI中文摘要

我们提出了MM++(多层马氏距离++),一个完全无监督、严格事后处理且尺度不变的分布外(OOD)检测框架。为了解决尺度不变性与层次表达性之间的权衡,MM++构建了一个原则性的联合特征空间。它首先通过测量熵密度下降来识别判别性中间层,这些下降标志着尖锐语义压缩的边界。通过将这些选定层与终端表示融合,该框架捕获潜在的跨层相关性,同时减轻早期层噪声。关键地,一个Ledoit-Wolf正则化的绑定协方差矩阵稳定了这个统一空间,使得距离估计可靠。无需辅助OOD数据、分类器微调或架构修改,MM++在近和远OOD检测的不同架构上均提供了鲁棒性能。

英文摘要

We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.

2606.18198 2026-06-17 cs.CR cs.CV 交叉投稿

Seeing Is Not Screening: Multimodal Hidden Instruction Attacks on Agent Skill Scanners

看见不等于筛查:针对智能体技能扫描仪的多模态隐藏指令攻击

Xiaojun Jia, Jie Liao, Simeng Qin, Ke Ma, Wenbo Guo, Yebo Feng, Aishan Liu, Yang Liu

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Chongqing University, China(重庆大学) Northeastern University, China(东北大学) University of Chinese Academy of Sciences, China(中国科学院大学) Beihang University, China(北航大学)

AI总结 针对现有技能扫描仪忽视图像中恶意指令的盲点,提出SkillCamo多模态攻击将指令隐藏于图像,并设计ExecScan执行级扫描模块,通过意图提取、行为重建等检测隐藏指令,实验证明其有效性。

详情
AI中文摘要

智能体技能正在成为基于LLM的系统中的重要攻击面。通过对现有技能扫描仪的实证研究,我们发现当前的防御主要依赖文本描述、清单和源代码作为安全分析的主要信号,这可能导致视觉传达的恶意意图未得到充分检查。这造成了一个实际盲点:隐藏在图像中的有害操作指令可能绕过扫描,但在部署期间仍可被多模态智能体恢复。为系统研究这一威胁,我们提出SkillCamo,一种文档介导的多模态指令攻击,将恶意指令隐藏在技能附带的图像中,同时重写周围文档以自然地将这些图像引用为正常工作流程的一部分。因此,攻击不依赖于图像本身,而是依赖于执行时对文本指导和视觉负载的联合解释。为防御此类攻击,我们进一步提出ExecScan,一种基于执行的多模态扫描模块,对技能工件进行意图提取、行为重建、滥用评估和审慎执行模拟。ExecScan联合分析文档、代码、引用的资源和视觉内容,以恢复隐藏指令,重建可执行行为链,并识别下游风险,如泄露、破坏、持久化、欺骗和权限提升。大量实验表明,图像隐藏的恶意指令挑战了现有技能扫描仪,而ExecScan可以提高技能扫描性能。

英文摘要

Agent skills are emerging as an important attack surface in LLM-based systems. Through an empirical study of existing skill scanners, we find that current defenses primarily rely on textual descriptions, manifests, and source code as the main signals for security analysis, which can leave visually conveyed malicious intent insufficiently examined. This creates a practical blind spot: harmful operational instructions hidden in images may bypass scanning while still being recoverable by multimodal agents during deployment. To systematically investigate this threat, we propose SkillCamo, a document-mediated multimodal instruction attack that conceals malicious instructions within images bundled with a skill while rewriting the surrounding documentation to naturally reference those images as part of the normal workflow. Thus, the attack does not rely on the image alone, but on the joint interpretation of textual guidance and visual payload at execution time. To defend against such attacks, we further propose ExecScan, an execution-grounded multimodal scanning module that performs intent extraction, behavior reconstruction, abuse assessment, and deliberative execution simulation over skill artifacts. ExecScan jointly analyzes documentation, code, referenced resources, and visual content to recover hidden instructions, reconstruct executable behavior chains, and identify downstream risks such as exfiltration, destruction, persistence, deception, and privilege escalation. Extensive experiments show that image-hidden malicious instructions challenge existing skill scanners, while ExecScan can improve the skill scanning performance.

2408.12099 2026-06-17 cs.CV cs.CR 版本更新

Query-Efficient Video Adversarial Attack with Stylized Logo on Service Computing

面向服务计算的查询高效视频对抗攻击:带风格化标志

Duoxun Tang, Yuxin Cao, Xi Xiao, Derui Wang, Sheng Wen, Tianqing Zhu

AI总结 提出一种黑盒视频攻击框架SLA,通过风格化标志和强化学习实现低预算、高逼真度的对抗样本生成,在目标攻击中优于现有方法。

Comments Accepted to IEEE Transactions on Dependable and Secure Computing (TDSC)

详情
AI中文摘要

在服务计算中,视频分类已成为许多智能应用的基础。尽管深度神经网络(DNN)在识别视频内容方面表现出色,但最近的研究表明,DNN极易受到对抗样本的影响。因此,理解对抗攻击可以更好地应对紧急情况。为了提高攻击性能,许多基于风格迁移的攻击和基于补丁的攻击被提出。然而,前者的全局扰动会带来不自然的全局色彩,而后者由于扰动空间有限,在目标攻击中难以成功。此外,与大量针对图像分类器的方法相比,视频对抗攻击仍然相对未被充分探索。因此,为了在低预算下生成对抗样本并使其具有更高的逼真度,我们提出了一种新颖的黑盒视频攻击框架,称为风格化标志攻击(SLA)。SLA通过三个阶段进行。第一阶段涉及构建标志的风格参考集,这不仅可以使生成的样本更自然,还可以在目标攻击中携带更多目标类别特征。然后,采用强化学习来确定标志在视频中的风格参考和位置参数,确保风格化标志以最优属性放置在视频中。最后,逐步优化扰动以提高欺骗率。实验结果表明,SLA可以实现比最先进方法更好的性能,并且在面对各种防御方法时仍保持良好的欺骗效果。我们相信SLA可以提高安全社区对视频分类系统可靠性和安全性的认识,并作为可能攻击方法的备忘录。

英文摘要

In service computing, video classification has become fundamental to many intelligent applications. While Deep Neural Networks (DNNs) have demonstrated excellent performance in recognizing video content, recent studies have shown that DNNs are highly vulnerable to adversarial examples. Thus, understanding adversarial attacks can better respond to emergency situations. In order to improve attack performance, many style-transfer-based attacks and patch-based attacks have been proposed. However, the global perturbation of the former will bring unnatural global colors, while the latter is difficult to achieve success in targeted attacks due to the limited perturbation space. Moreover, compared to a plethora of methods targeting image classifiers, video adversarial attacks remain relatively underexplored. Therefore, to generate adversarial examples with a low budget and to provide them with a higher verisimilitude, we propose a novel black-box video attack framework, called Stylized Logo Attack (SLA). SLA is conducted through three stages. The first stage involves building a style reference set for logos, which can not only make the generated examples more natural, but also carry more target class features in targeted attacks. Then, Reinforcement Learning is employed to determine the style reference and position parameters of the logo within the video, which ensures that the stylized logo is placed in the video with optimal attributes. Finally, perturbations are optimized in a step-by-step manner so as to improve the fooling rate. Experimental results indicate that SLA can achieve better performance than state-of-the-art methods and still maintain good deception effects when facing various defense methods. We believe SLA can raise awareness among the security community about the reliability and security of video classification systems and serve as a memorandum of possible attack methods.

2507.14632 2026-06-17 cs.CV 版本更新

BusterX++: Towards Unified Cross-Modal AI-Generated Content Detection and Explanation with MLLM

BusterX++: 迈向基于MLLM的统一跨模态AI生成内容检测与解释

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

AI总结 提出统一多模态大模型BusterX++,通过纯强化学习策略实现图像与视频伪造检测的跨模态能力迁移,性能超越现有方法。

详情
AI中文摘要

生成式AI的快速发展显著提升了图像和视频合成质量,加剧了多模态视觉错误信息的风险。最近的多模态大模型通过推理和解释在透明化AI生成内容检测方面展现出潜力,但现有方法大多将图像和视频取证视为孤立任务,跨模态协同作用尚未充分探索。为解决这一问题,我们提出了\textbf{BusterX++},一个统一的多模态大模型,用于联合图像和视频检测并具备可解释推理能力。我们还引入了\textbf{GenBuster-Bench++},一个精心策划、难度对齐的基准测试,包含平衡的图像和视频样本,覆盖最新的生成模型和多样化的真实场景。利用这一受控设置,我们重新审视了广泛采用的$SFT \rightarrow RL$后训练范式。值得注意的是,我们的发现表明,仅由稀疏结果奖励驱动的单阶段纯RL策略在统一和单模态设置中始终匹配或超越强SFT+RL基线。我们的关键洞察是,SFT降低了策略熵,限制了策略搜索空间并抑制了探索自由度。相比之下,单阶段纯RL在整个训练过程中保持较高的策略熵,有效解锁了图像和视频取证之间跨模态能力迁移的自发涌现。大量实验表明,BusterX++达到了最先进的性能,突显了RL在统一跨模态视觉推理中的强大潜力。

英文摘要

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

2604.00605 2026-06-17 cs.CV 版本更新

Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent

流利地撒谎:对抗鲁棒性可能依赖于底层架构

Daye Kang, Hyeongboo Baek

AI总结 发现一种新的对抗攻击失败模式——质量崩溃(QC),即检测数量不变但精度骤降,且仅出现在特定SNN架构(EMS-YOLO)中,表明对抗失败模式可能依赖于底层架构。

Comments Withdrawn by the authors due to an implementation bug discovered in the main experimental pipeline. The bug affects the main results, and therefore the empirical claims and conclusions of the paper are no longer supported

详情
AI中文摘要

用于监控和防御对抗攻击下目标检测器的主要工具假设,当精度下降时,检测数量也会同步下降。这种耦合是假设的,并未经过测量。我们报告了在单个模型上观察到的反例:在标准PGD攻击下,EMS-YOLO(一种脉冲神经网络(SNN)目标检测器)保留了超过70%的检测结果,而mAP从0.528骤降至0.042。我们将这种保持检测数量但精度崩溃的现象称为质量崩溃(QC),以区别于在非目标评估中占主导地位的抑制现象。在四种SNN架构和两种威胁模型(l-infinity和l-2)下,QC仅出现在测试的四种检测器之一(EMS-YOLO)中。在该模型上,所有五种标准防御组件均未能检测或缓解QC,这表明防御生态系统可能依赖于一种基于单一底层架构校准的共享假设。据我们所知,这些结果首次证明对抗失败模式可能依赖于底层架构。

英文摘要

The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.

2510.11709 2026-06-17 cs.LG cs.AI cs.CV 版本更新

Adversarial Attacks Leverage Interference Between Features in Superposition

对抗攻击利用特征叠加中的干扰

Edward Stevinson, Lucas Prieto, Melih Barsbey, Tolga Birdal

AI总结 本文揭示神经网络中特征叠加导致的干扰是对抗脆弱性的根源,通过理论推导和实验验证了干扰模式决定攻击成功与迁移性。

Comments Forty-third International Conference on Machine Learning

详情
AI中文摘要

为什么对抗样本存在,并且为什么它们能在模型间迁移?现有的解释诉诸于高维几何、输入中的非鲁棒模式以及决策边界结构,但没有一个提供表示层面的机制来解释为什么特定的扰动会成功以及为什么攻击能在模型间迁移。在本文中,我们表明对抗脆弱性可能源于神经网络中高效的信息编码。具体来说,脆弱性可能源于叠加——网络表示的概念数量超过其维度,迫使非正交表示从而产生干扰。这种干扰导致针对一个表示的扰动会影响其他表示,从而产生由干扰模式决定的脆弱性。在精确控制叠加的合成环境中,我们证实叠加足以产生对抗脆弱性。由此产生的攻击是可预测的:PGD发现的扰动与从干扰几何导出的理论最优扰动一致。在相似数据上训练的模型会发展出相似的干扰模式,这解释了攻击的可迁移性。然后我们表明,对图像分类器的成功攻击表现出我们提出的机制所预测的结构。这些发现揭示了对抗脆弱性可能是网络表示压缩的副产品,补充了基于数据属性或架构因素的现有解释。

英文摘要

Why do adversarial examples exist, and why do they transfer between models? Existing explanations appeal to high-dimensional geometry, non-robust patterns in the input, and decision boundary structure, but none provides a representation-level mechanism that explains why specific perturbations succeed and why attacks transfer between models. In this paper, we show that adversarial vulnerability can stem from efficient information encoding in neural networks. Specifically, vulnerability can arise from superposition - the phenomenon where networks represent more concepts than they have dimensions, forcing non-orthogonal representation and thus interference. This interference causes perturbations targeting one representation to affect others, creating vulnerabilities determined by interference patterns. In synthetic settings with precisely controlled superposition, we establish that superposition suffices to create adversarial vulnerability. The resulting attacks are predictable: PGD-discovered perturbations align with theoretically optimal perturbations derived from the interference geometry. Models trained on similar data develop similar interference patterns, explaining attack transferability. We then show that successful attacks on image classifiers exhibit the structure predicted by our proposed mechanism. These findings reveal that adversarial vulnerability can be a byproduct of networks' representational compression, complementing existing explanations based on data properties or architectural factors.

12. 数据集、基准、评测与训练方法 23 篇

2606.17246 2026-06-17 cs.CV cs.MA 新提交

GeoDisaster: Benchmarking Orchestrated Agents for Operational Disaster Geo-Intelligence

GeoDisaster: 用于操作化灾害地理智能的编排智能体基准测试

Maram Hasan, Aman Verma, Savitra Roy, Hariseetharam Gunduboina, Daksh Jain, Muhammad Haris Khan, Subhasis Chaudhuri, Biplab Banerjee

发表机构 * Indian Institute of Technology Bombay(印度理工学院孟买分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出GeoDisaster基准,包含2921个实例和43种问题类型,用于评估遥感视觉语言模型在工具化空间推理和结构化决策方面的能力,并设计多智能体框架RCEA提升工具使用和证据基础。

Comments 28 pages, 11 Figures

详情
AI中文摘要

遥感视觉语言模型(RS-VLMs)推动了地球观测分析向视觉解释和指令遵循发展,但在操作化地理智能方面仍显不足,后者需要基于工具的空间推理和结构化、有证据支持的决策。我们提出了GeoDisaster,一个操作化地理空间灾害推理基准,包含2921个经过验证的实例,涵盖43种问题类型和五个任务族:森林砍伐监测、多灾害分析、建筑损坏评估、洪水安全路线规划以及Sentinel-1 SAR洪水监测。实例集成了异构的EO/GIS证据——光学和SAR影像、栅格掩膜、矢量几何、道路网络和暴露图层——涵盖灾害检测、损坏评估、暴露估计和诊断报告生成。真实答案基于可执行的地理空间工作流和确定性一致性检查,无需语言模型标注。我们进一步提出了一个编排的多智能体框架,包含18个面向灾害的工具,其中角色专业化的智能体通过明确的执行契约进行协调,并通过角色契约期望对齐(RCEA)进行对齐:结合故障感知的监督微调和基于契约的强化学习,利用密集的步骤级信号。实验表明,GeoDisaster对现有的RS-VLMs和智能体系统构成了挑战,而RCEA改进了工具使用、证据基础、状态一致性和决策生成。

英文摘要

Remote-sensing vision-language models (RS-VLMs) have advanced Earth-observation analysis toward visual interpretation and instruction-following, yet fall short of operational geo-intelligence, which demands tool-grounded spatial reasoning and structured, evidence-backed decisions. We introduce GeoDisaster, an operational geospatial disaster reasoning benchmark with 2,921 verified instances across 43 question types and five task families: deforestation monitoring, multi-hazard analysis, building-damage assessment, flood-safe routing, and Sentinel-1 SAR flood monitoring. Instances integrate heterogeneous EO/GIS evidence-optical and SAR imagery, raster masks, vector geometries, road networks, and exposure layers-spanning hazard detection, damage assessment, exposure estimation, and diagnostic report generation. Ground-truth answers are grounded in executable geospatial workflows and deterministic consistency checks, removing the need for language-model annotation. We further propose an orchestrated multi-agent framework with 18 disaster-oriented tools, where role-specialized agents coordinate through explicit execution contracts, aligned via Role-Contract Expectation Alignment (RCEA): failure-aware supervised fine-tuning combined with contract-grounded reinforcement learning over dense step-level signals. Experiments show that GeoDisaster challenges existing RS-VLMs and agentic systems, while RCEA improves tool use, evidence grounding, state consistency, and decision generation.

2606.17433 2026-06-17 cs.CV 新提交

LADBench: A Benchmark for Logical Fault Detection in Images

LADBench: 图像中逻辑故障检测的基准

Sahasra Kondapalli, Lara Radovanovic, Aadi Palnitkar, Mingyang Mao, Xiaomin Lin

发表机构 * University of South Florida(南佛罗里达大学)

AI总结 提出LAD-Bench基准,包含1000多张合成图像的四域逻辑异常,通过分层提示协议评估模型,揭示现有VLM在隐式逻辑故障检测上的不足。

Comments Accepted to the IEEE International Conference on Development and Learning (ICDL 2026)

详情
AI中文摘要

大型视觉语言模型在视觉问答和语义定位方面表现出色,但其自主逻辑推理能力仍未被充分探索。现有的异常基准强调视觉错误或直接提示,而非开放世界部署所需的物理和社会常识。为此,我们引入了LAD-bench,一个包含1000多张精心策划的合成图像基准,涵盖四个领域的逻辑异常:住宅、城市、协作和自然。我们进一步提出了一种基于渐进式揭示的分层提示协议,该协议衡量模型在定位和推理逻辑故障时需要多少显式帮助。评估领先的基础模型揭示了重大弱点:即使最好的模型也仅达到70.11%的整体准确率,表明隐式逻辑故障检测仍未解决。关键的是,即使在更深层次收到显式提示后,模型也常常无法识别异常。通过揭示这些序列多模态推理中的局限性,LAD-Bench为推进自主视觉系统的安全性、可靠性和认知对齐提供了一个严格的框架。数据集和代码:此 https URL

英文摘要

Large Vision Language Models (VLMs) excel at visual question answering and semantic grounding, but their capacity for autonomous logical reasoning remains underexplored. Existing anomaly benchmarks emphasize visual errors or direct prompting rather than the physical and social common sense needed for open-world deployment. To address this, we introduce LAD-bench, a benchmark of more than 1,000 curated synthetic images with logical anomalies across four domains: Residential, Urban, Collaborative, and Nature. We further propose a Tiered Prompting Protocol based on progressive disclosure, which measures how much explicit assistance a model needs to localize and reason about a logical fault. Evaluating leading foundation models reveals substantial weaknesses: even the best achieves only 70.11% overall accuracy, showing that implicit logical fault detection remains unsolved. Crucially, models often fail to identify anomalies even after receiving explicit hints in deeper tiers. By surfacing these limitations in sequential multimodal reasoning, LAD-Bench offers a rigorous framework for advancing the safety, reliability, and cognitive alignment of autonomous visual systems. Dataset and Code: https://huggingface.co/datasets/SahasraK/LADBench

2606.17564 2026-06-17 cs.CV cs.AI 新提交

Geometric Consistency Protocol for Foundation Model Features in Multi-View Satellite Imagery

多视图卫星图像中基础模型特征的几何一致性协议

Qiyan Luo, Jie Yang, Yingdong Pi, Lekang Wen, Mi Wang

发表机构 * Hubei Province Key Research and Development Program(湖北省重点研发计划) LIESMARS Special Research Funding(测绘遥感信息工程国家重点实验室专项研究基金) National Science Fund for Distinguished Young Scholars(国家杰出青年科学基金)

AI总结 针对卫星多视图重建中传统2D全局匹配的误导性,提出基于有理函数模型(RFM)的几何忠实评估协议,通过RPC投影3D一致性度量和几何约束密集匹配代理,揭示语义一致性与几何定位的解耦,并证明在RPC一致评估下2D骨干网络仍具竞争力。

Comments The manuscript is accepted as Oral Presentation in IEEE International Geoscience and Remote Sensing Symposium(IGARSS 2026)

详情
AI中文摘要

标准化的评估协议对于遥感领域的稳健基准测试至关重要,特别是当基础特征越来越多地跨不同传感器和复杂成像几何进行迁移时。在卫星多视图重建中,依赖无约束2D全局匹配的传统评估常常具有误导性。有理函数模型(RFM)及其有理多项式系数(RPC)决定了弯曲的、高度依赖的极线几何,这使得平坦的2D搜索空间在物理上不一致。我们提出了一种针对RPC框架的几何忠实且可复现的协议。我们的方法将RPC投影的3D一致性度量与几何约束的密集匹配代理相结合,专门评估在物理上合理的搜索流形下相似性响应是否保持局部化和唯一性。我们联合报告策略的一个关键发现是语义一致性与几何定位的解耦:在投影3D点处的高跨视图相似性并不能保证实际推理中的可靠匹配性。我们的基准测试表明,将几何约束纳入问题定义对于卫星图像是基础性的。此外,我们展示了最先进的2D骨干网络在经受这种RPC一致评估时,仍然与专门的3D感知模型保持显著竞争力。

英文摘要

Standardized evaluation protocols are indispensable for robust benchmarking in remote sensing, particularly as foundation features are increasingly transferred across diverse sensors and complex imaging geometries. In satellite multi-view reconstruction, conventional evaluations relying on unconstrained 2D global matching are often misleading. The Rational Function Model (RFM) and its Rational Polynomial Coefficients (RPC) dictate a curved, height-dependent epipolar geometry that render flat 2D search spaces physically inconsistent. We propose a geometry-faithful and reproducible protocol tailored for the RPC framework. Our approach integrates an RPC-projected 3D consistency metric with a geometry-constrained dense matching proxy, specifically evaluating whether similarity responses remain localized and unique under physically plausible search manifolds. A pivotal finding of our joint reporting strategy is the decoupling of semantic agreement and geometric localization: high cross-view similarity at a projected 3D point does not guarantee reliable matchability in practical inference. Our benchmark demonstrates that incorporating geometric constraints is fundamental to the problem definition in satellite imagery. Furthermore, we show that state-of-the-art 2D backbones remain remarkably competitive against specialized 3D-aware models when subjected to this RPC-consistent evaluation.

2606.17644 2026-06-17 cs.CV cs.AI 新提交

Bounding Box Label Propagation for Re-Annotation of Document Layout Analysis Datasets

边界框标签传播用于文档布局分析数据集的重新标注

Nick Jochum, Tobias Alt-Veit, Christian Schön, Alexander Lück, René Schuster, Didier Stricker

发表机构 * Insiders Technologies GmbH(Insiders Technologies 有限公司) DFKI – German Research Center for Artificial Intelligence(德国人工智能研究中心) RPTU – University Kaiserslautern-Landau(凯泽斯劳滕-兰道大学)

AI总结 提出BBLP伪标签框架,通过对象编码器融合视觉、文本和位置嵌入,利用标签传播实现仅用10%标注数据达到全监督性能的81.6%。

Comments 17 pages, 3 figures, to appear in proceedings of ICDAR 2026, Vienna, Austria

详情
AI中文摘要

实际文档处理场景中的数据集通常随时间增长,其类别标注不断细化,这导致大量耗时且昂贵的重新标注工作。一个有前景的解决方案是仅手动重新标注一小部分可用文档,并应用半监督学习技术利用有标签和无标签数据。尽管针对分类问题已有多种方法,但对于目标检测实例的重新分类(例如文档布局分析)尚无适配方法。为此,我们提出了边界框标签传播(BBLP),一种用于目标检测的伪标签框架。对象编码器整合来自目标检测样本的视觉、文本和位置嵌入,生成联合嵌入,可用于部分标注数据集上的标签传播,即插即用。评估结果表明,所提方法能产生高质量的边界框类别标注。在D4LA布局分析数据集中,仅使用10%标注数据,其mAP达到54.0%,相当于全监督性能的81.6%。我们的工作展示了标签传播在目标检测中的潜力,并为减少实际文档处理应用中的手动标注工作量奠定了基础。

英文摘要

Datasets in practical document processing scenarios typically grow over time, and their class annotations undergo continuous refinement. This creates significant re-annotation efforts, which are time-consuming and costly. A promising remedy is to re-annotate only a small subset of available documents manually and apply semi-supervised learning techniques that leverage both labelled and unlabelled data. Although there are numerous approaches to tackle this problem for classification, there exists no adaptation for the problem of re-classifying object detection instances, e.g. for document layout analysis. To this end, we propose Bounding Box Label Propagation (BBLP), a pseudo-labelling framework for object detection. An object encoder integrates visual, textual, and positional embeddings from object detection samples to come up with a joint embedding that can be used for Label Propagation on partially annotated datasets in a plug-and-play fashion. Evaluation results indicate that the proposed approach produces high-quality class annotations of bounding boxes. In the D4LA layout analysis dataset, it achieves a mAP of 54.0%, corresponding to 81.6% of fully supervised performance, while using only 10% labelled data. Our work demonstrates the potential of Label Propagation for object detection and lays the groundwork for reducing manual annotation efforts in real-world document processing applications.

2606.17809 2026-06-17 cs.CV 新提交

Million-scale multimodal pollen microscopy with expert-guided foundation models

百万级多模态花粉显微镜图像与专家引导的基础模型

András Biricz, Björn Gedda, Donát Magyar, Antonio Spanu, János Fillinger, Péter Pollner, István Csabai

发表机构 * Department of Physics of Complex Systems, ELTE Eötvös Loránd University(ELTE罗兰大学复杂物理系) The Palynological Laboratory at the Swedish Museum of Natural History(瑞典自然历史博物馆孢粉学实验室) National Centre for Public Health and Pharmacy(国家公共卫生与药品中心) INRAE, UR 546 BioSP, Site Agroparc(法国国家农业、食品与环境研究院,UR 546 BioSP,阿格罗帕克园区) National Korányi Institute for Pulmonology(国家科拉尼肺病研究所) Health Data Science and AI Knowledge Centre, Health Services Management Training Centre, Faculty of Health and Public Administration, Semmelweis University(塞梅维什大学健康与公共管理学院卫生服务管理培训中心健康数据科学与人工智能知识中心) Department of Biological Physics, ELTE Eötvös Loránd University(ELTE罗兰大学生物物理系)

AI总结 提出百万级多模态花粉显微镜数据集Pollen AI Atlas,结合专家引导的视觉-语言模型生成形态描述,实现跨区域、跨设置的高精度花粉识别与检索。

Comments 31 pages, 5 main figures, supplementary information included. Submitted to Scientific Reports

详情
AI中文摘要

从显微镜图像自动识别花粉仍然是空气生物学、古生态学和生物多样性监测中的一个瓶颈,因为可扩展系统必须泛化到样本制备、扫描仪设置和地理来源,同时保持孢粉学可解释性。为解决这一问题,我们提出了一个百万级多模态花粉显微镜资源——Pollen AI Atlas,该资源由来自四个地理来源、四种扫描仪设置和31个植物科46个分类标签的纯种全切片明场图像组装而成。通过每个源切片的一个手动选择示例,令牌级挖掘和过滤产生了1,511,390个释放的花粉颗粒检测结果,在专家筛选的测试区域中提案精度达到99.6%。每个检测结果与来自五个开放权重视觉-语言模型的机器生成颗粒级形态描述配对,这些描述由专家验证的孢粉学锚点引导,提供了关于萌发孔系统、壁纹饰、形状和大小的结构化描述。在评估的模型中,Gemma4提供了最可控的主描述集,结合了严格长度控制、无泄漏和最强的文本检索性能。使用冻结视觉特征的基线基准达到了88.16%的top-1准确率,而跨区域检索表明,当图像相似度下降时,描述派生的文本嵌入仍然保持鲁棒(mAP@20 0.811对比0.262)。发布的数据、注释、描述、划分、代码和权重为花粉识别、跨区域领域适应和特定领域多模态显微镜学习提供了基准。

英文摘要

Automated pollen identification from microscopy remains a bottleneck in aerobiology, palaeoecology and biodiversity monitoring, because scalable systems must generalise across specimen preparation, scanner settings and geographic origins while retaining palynological interpretability. To address this gap, we present a million-scale multimodal pollen microscopy resource, Pollen AI Atlas, assembled from pure-species whole-slide bright-field images spanning four geographic origins, four scanner settings and 46 taxon labels across 31 botanical families. Seeded by one manually selected exemplar per source slide, token-level mining and filtering produced 1,511,390 released grain detections with 99.6\% proposal precision in expert-curated test regions. Each detection was paired with machine-generated grain-level morphological captions from five open-weight vision-language models, guided by expert-verified palynological anchors, yielding structured descriptions of aperture systems, wall ornamentation, shape and size. Among the evaluated models, Gemma4 provided the most controlled primary caption set, combining tight length control, no leakage and the strongest text-retrieval performance. Baseline benchmarks with frozen visual features reached 88.16\% top-1 accuracy, while cross-regional retrieval showed that caption-derived text embeddings remained robust when image similarity degraded (mAP@20 0.811 versus 0.262). Released data, annotations, captions, splits, code, and weights provide a benchmark for pollen recognition, cross-regional domain adaptation and domain-specific multimodal microscopy learning.

2606.18115 2026-06-17 cs.CV 新提交

HLS-GPT: A Generative Pretrained Transformer (GPT) for Continental-Scale NASA Harmonized Landsat and Sentinel-2 (HLS) Reflectance Reconstruction Across All Bands on Arbitrary Dates

HLS-GPT:一种用于大陆尺度NASA协调Landsat和Sentinel-2(HLS)所有波段任意日期反射率重建的生成式预训练Transformer

Junjie Li, Hankui K. Zhang, David P. Roy

发表机构 * Geospatial Sciences Center of Excellence, Department of Geography and Geospatial Sciences, South Dakota State University(地理科学卓越中心,地理与空间科学系,南达科他州立大学) Department of Geography, Environment, and Spatial Sciences, & Center for Global Change and Earth Observations, Michigan State University(地理、环境与空间科学系,以及全球变化与地球观测中心,密歇根州立大学)

AI总结 提出HLS-GPT,一种大规模生成式预训练Transformer模型,用于重建所有波段、任意日期和任意像素位置的NASA HLS 30米地表反射率,采用层次化Transformer架构处理不同光谱波段配置,在超过25万训练像素上训练,在多样地表条件下实现稳健重建。

详情
AI中文摘要

最近用于Landsat和Sentinel-2反射率时间序列重建的深度学习方法仍然受到光谱覆盖范围有限、地理可扩展性受限或基于补丁的短时间上下文设计的限制。我们提出了HLS-GPT,一种大规模生成式预训练Transformer模型,用于重建所有波段、任意日期和任意像素位置的NASA协调Landsat Sentinel-2 30米地表反射率。HLS-GPT使用层次化Transformer架构处理Landsat和Sentinel-2的不同光谱波段配置,并基于单像素12个月时间序列进行操作。为了捕捉地理和季节变化,该模型使用来自美国本土超过25万训练像素的九年HLS时间序列进行训练。一种随机裁剪和掩码策略提取不同起始日期的12个月周期,掩码50%的有效观测,并训练模型从剩余观测中重建被掩码的反射率值。使用超过62000个独立测试像素的评估显示,在多样化的地表条件下,包括复杂的作物物候和稀疏、不规则的观测,重建效果稳健。留一观测评估在所有HLS光谱波段上实现了低于0.026的重建RMSE,可见光波段相对RMSE低于35%,其他波段低于13%。尽管Landsat上缺少红边波段,红边波段误差与红波段和近红外波段误差相当。敏感性分析随机掩码10%至90%的测试观测,显示当掩码10%至50%的观测时,性能仅适度下降,所有波段RMSE低于0.028。在九个独立的109×109公里美国本土HLS图块上的图像重建进一步表明,HLS-GPT优于两种传统方法和NASA-IBM Prithvi模型。

英文摘要

Recent deep learning methods for Landsat and Sentinel-2 reflectance time series reconstruction remain limited by restricted spectral coverage, limited geographic scalability, or patch-based designs with short temporal contexts. We present HLS-GPT, a large-scale generative pretrained Transformer model for reconstructing NASA Harmonized Landsat Sentinel-2 30 m surface reflectance for all bands, any date, and any pixel location. HLS-GPT uses a hierarchical Transformer architecture to handle the different spectral band configurations of Landsat and Sentinel-2 and operates on single-pixel 12-month time series. To capture geographic and seasonal variability, the model was trained with nine years of HLS time series from more than 0.25 million training pixels across the conterminous United States. A random cropping and masking strategy extracts 12-month periods with varying start dates across epochs, masks 50% of valid observations, and trains the model to reconstruct the masked reflectance values from the remaining observations. Evaluation using more than 62,000 independent test pixels shows robust reconstruction under diverse land surface conditions, including complex crop phenology and sparse, irregular observations. Leave-one-observation-out evaluation achieved reconstruction RMSE below 0.026 for all HLS spectral bands, with relative RMSE below 35% for visible bands and below 13% for other bands. Red-edge band errors were comparable to red and near-infrared errors despite the absence of red-edge bands on Landsat. Sensitivity analyses that randomly masked 10% to 90% of test observations showed only modest degradation when 10% to 50% of observations were masked, with all-band RMSE below 0.028. Image reconstruction over nine independent 109 by 109 km CONUS HLS tiles further demonstrates that HLS-GPT outperforms two conventional methods and the NASA-IBM Prithvi model.

2606.18180 2026-06-17 cs.CV 新提交

EgoCS-400K: An Egocentric Gameplay Dataset for World Models

EgoCS-400K:面向世界模型的自我中心游戏数据集

Rongjin Guo, Dong Liang, Yuhao Liu, Fang Liu, Tianyu Huang, Gerhard P. Hancke, Rynson W. H. Lau

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 为支持世界模型研究,构建大规模自我中心游戏数据集EgoCS-400K,包含40万第一人称视频和1万小时游戏轨迹,支持动作条件未来预测、状态事件场景展开等交互式视觉建模任务。

详情
AI中文摘要

从视频生成到交互式世界建模的转变对数据提出了新要求:除了带字幕的视频外,世界模型还需要基于驱动未来场景变化的动作、相机运动、状态和事件的时间对齐的视频-动作-语言轨迹。然而,大规模获取此类数据十分困难。网络视频数据集提供广泛的视觉覆盖,但缺乏可执行动作和可靠状态;机器人数据集提供动作和状态监督,但成本高昂且场景多样性有限;现有模拟器通常缺乏大规模人类驱动的交互轨迹。在本文中,我们介绍了EgoCS-400K,一个大规模基于回放的自我中心反恐精英世界模型数据集,该数据集基于公开的职业CS和CS2比赛演示构建,保留了人类游戏轨迹,并支持解析、回放、渲染和时间对齐。我们提取玩家状态、视角方向、移动、键盘/按钮输入、视角变化、武器使用、游戏事件和回合级上下文,并从相同轨迹渲染干净的第一人称视频。EgoCS-400K包含超过40万第一人称视频和1万小时游戏时间,来自1000多场比赛和4万回合,涵盖13张地图和每回合10个玩家视角。它支持一系列交互式视觉建模任务,包括动作条件未来预测、状态和事件感知场景展开、基于回放的描述以及智能体自我中心动作理解。通过大规模连接视觉观察与人类动作、相机运动、游戏状态和事件,EgoCS-400K在被动网络视频、可控游戏模拟和昂贵的真实世界具身数据之间架起了一座实用桥梁。

英文摘要

The shift from video generation to interactive world modeling places new demands on data: beyond captioned videos, world models require temporally aligned video-action-language trajectories grounded in the actions, camera motion, states, and events that drive future scene changes. However, such data is difficult to obtain at scale. Web video datasets offer broad visual coverage but lack executable actions and reliable states; robotic datasets provide action and state supervision but are costly and limited in scene diversity; and existing simulators often lack large-scale human-driven interaction trajectories. In this paper, we introduce EgoCS-400K, a large-scale replay-grounded egocentric Counter-Strike dataset for world models, built from public professional CS and CS2 match demos that preserve human gameplay trajectories and enable parsing, replaying, rendering, and temporal alignment. We extract player states, view directions, movements, keyboard/button inputs, view-angle changes, weapon usage, game events, and round-level context, and render clean first-person videos from the same trajectories. EgoCS-400K contains over 400,000 first-person videos and 10,000 hours of gameplay from more than 1,000 matches and 40,000 rounds, covering 13 maps and 10 player viewpoints per round. It supports a range of interactive visual modeling tasks, including action-conditioned future prediction, state- and event-aware scene rollout, replay-grounded captioning, and agent egocentric action understanding. By connecting visual observations with human actions, camera motion, game states, and events at scale, EgoCS-400K serves as a practical bridge between passive web videos, controllable game simulation, and costly real-world embodied data.

2606.17080 2026-06-17 cs.RO cs.AI cs.CV 交叉投稿

HRDX: A Large-Scale Vector HD-Map Dataset

HRDX:大规模矢量高清地图数据集

Sahith Reddy Chada, Isht Dwivedi, Nirav Savaliya

发表机构 * Honda Research Institute US(本田美国研究院)

AI总结 提出HRDX大规模矢量高清地图数据集,覆盖1400公里驾驶数据,含10类地图元素和20多种属性,并引入复合评分评估几何与属性准确性。

Comments https://usa.honda-ri.com/hrdx

详情
AI中文摘要

可靠的自动驾驶需要矢量化的高清地图,这些地图应具有几何精确性、语义丰富性,并能够扩展到长距离驾驶。然而,现有的公开高清地图数据集规模有限,提供的语义属性稀疏,并且缺乏诸如航拍图像等能够开启新研究方向的模态。我们提出了HRDX,一个用于矢量高清地图构建的大规模数据集,涵盖约40小时(1400公里)的最小重叠驾驶,比之前的公开高清地图数据集大数倍。数据使用六个同步环视摄像头、一个128线激光雷达和厘米级RTK GNSS/IMU捕获,并辅以精确对齐的航拍正射影像。标注涵盖10个矢量地图类别,并补充了20多个语义和拓扑属性。为了评估这一更丰富的本体,我们引入了复合评分(CS)来联合评估几何保真度和属性正确性。基准实验表明,HRDX的规模改善了在线矢量地图构建,并且对齐的航拍图像提供了有用的结构先验:在训练和/或推理中使用航拍图像可提高几何地图质量,而航拍增强的教师可以将部分优势转移给仅使用摄像头的学生,而无需增加推理时的传感器需求。HRDX旨在支持大规模高清地图学习、多模态BEV融合以及训练时特权信息的可重复研究。HRDX数据集和基准可在以下网址获取:https://github.com/example/HRDX

英文摘要

Reliable autonomous driving requires vectorized HD maps that are geometrically accurate, semantically rich, and scalable to long-horizon driving. However, existing public HD map datasets are limited in scale, provide sparse semantic attributes, and lack modalities such as aerial imagery that could enable new research directions. We present HRDX, a large-scale dataset for vector HD-map construction, spanning about 40 hours (1,400 km) of minimally overlapping drives, which is several times larger than prior public HD map datasets. Data is captured using six synchronized surround cameras, a 128-beam LiDAR, and centimeter-level RTK GNSS/IMU, and is further complemented by precisely aligned aerial orthoimagery. Annotations cover 10 vector map classes, complemented with over 20 semantic and topological attributes. To evaluate this richer ontology, we introduce the Composite Score (CS) to jointly assess geometric fidelity and attribute correctness. Benchmark experiments show that HRDX's scale improves online vector-map construction, and that aligned aerial imagery provides a useful structural prior: using aerial imagery at training and/or inference improves geometric map quality, while aerial-augmented teachers can transfer part of this benefit to camera-only students without increasing inference-time sensor requirements. HRDX is intended to support reproducible research on large-scale HD-map learning, multimodal BEV fusion, and training-time privileged information. HRDX dataset and benchmarks are available at https://github.com/honda-research-institute/HRDX

2606.17321 2026-06-17 cs.LG cs.CV 交叉投稿

ProCUA-SFT Technical Report

ProCUA-SFT 技术报告

Jaehun Jung, Ximing Lu, Brandon Cui, Muhammad Khalifa, Shaokun Zhang, Hao Zhang, Jin Xu, Amala Sanjay Deshmukh, Karan Sapra, Andrew Tao, Yejin Choi, Jan Kautz, Mingjie Liu, Yi Dong

发表机构 * NVIDIA(英伟达) University of Washington(华盛顿大学) Allen Institute for AI(艾伦人工智能研究所)

AI总结 提出 ProCUA-SFT 数据集,通过自动化管道从 2484 个应用组合的合成轨迹中蒸馏出 310 万步级 SFT 样本,微调 UI-TARS 7B 在 OSWorld 上达到 45.0% 的成功率,比基线提升 18.7 个百分点。

Comments 15 pages, 5 figures

详情
AI中文摘要

训练计算机使用智能体(CUA)——通过截图和键盘/鼠标操作与图形桌面交互的模型——需要在全桌面环境中收集的大规模、多样化的轨迹数据。最大的公共资源 AgentNet(22.5K 条人类轨迹)在用于监督微调(SFT)时会导致负迁移:在 AgentNet 上继续训练 UI-TARS 7B 导致 OSWorld 成功率从 26.3% 下降到 8-10%。我们提出了 ProCUA-SFT,一个包含 310 万步级 SFT 样本的数据集,这些样本从 2484 个应用组合中的 93K 条合成轨迹中蒸馏得到。该数据集由一个全自动管道生成,该管道(i)在带有真实世界内容的实况桌面上合成有基础的任务——912 个来自 SpreadsheetBench 的电子表格、约 10K 个来自 Zenodo10K 的宽松许可演示文稿以及多应用 OSWorld 配置——以及(ii)在展开前通过二元前置条件检查验证每个任务的可行性。单个 VLM(Kimi-K2.5)作为目标生成器、前置条件判断器和轨迹执行器,消除了规划器-执行器的能力差距。每条轨迹被扩展为步前缀样本,精确复现推理时看到的上下文布局。在 ProCUA-SFT 上微调 UI-TARS 7B 一个 epoch 后,在 OSWorld 上达到 45.0%——比基础模型提升 18.7 个百分点,比 AgentNet 训练的模型高出 35% 以上。ProCUA 的一个子集被纳入 Nemotron 3 Nano Omni 模型的训练数据中,为其计算机使用能力做出了贡献。

英文摘要

Training computer-use agents (CUAs) -- models that interact with graphical desktops through screenshots and keyboard/mouse actions -- requires large-scale, diverse trajectory data collected in full desktop environments. The largest public resource, AgentNet (22.5K human trajectories), leads to negative transfer when used for supervised fine-tuning (SFT): continuing training UI-TARS 7B on AgentNet causes OSWorld success rate to fall from 26.3% to 8-10%. We present ProCUA-SFT, a dataset of 3.1M step-level SFT samples distilled from 93K synthetic trajectories across 2,484 application combinations. The dataset is produced by a fully automated pipeline that (i) synthesizes grounded tasks on live desktops seeded with real-world content -- 912 spreadsheets from SpreadsheetBench, approximately 10K permissively-licensed presentations from Zenodo10K, and multi-application OSWorld configs -- and (ii) verifies each task's feasibility through binary precondition checking before rollout. A single VLM (Kimi-K2.5) serves as goal generator, precondition judge, and trajectory executor, eliminating planner-actor capability gaps. Each trajectory is expanded into step-prefix samples that exactly reproduce the context layout seen at inference time. Fine-tuning UI-TARS 7B on ProCUA-SFT for one epoch yields 45.0% on OSWorld -- an 18.7 percentage-point improvement over the base model and over 35% above AgentNet-trained counterparts. A subset of ProCUA was incorporated into the training data for the Nemotron 3 Nano Omni model, contributing to its computer-use capabilities.

2212.07700 2026-06-17 cs.CV 版本更新

Colab NAS: Obtaining lightweight task-specific convolutional neural networks following Occam's razor

Colab NAS:遵循奥卡姆剃刀原则获取轻量级任务特定卷积神经网络

Andrea Mattia Garavagno, Daniele Leonardis, Antonio Frisoli

发表机构 * Institute of Mechanical Intelligence, Scuola Superiore Sant’Anna of Pisa(机械智能研究所,比萨圣安娜高等学院)

AI总结 提出ColabNAS,一种低成本的硬件感知神经架构搜索方法,通过奥卡姆剃刀启发的无导数搜索策略,在免费GPU服务上3.1小时内获得轻量级CNN,在Visual Wake Word数据集上达到最先进结果。

详情
Journal ref
Future Generation Computer Systems, vol. 152, pp. 152-159, 2024
AI中文摘要

当前从在大数据集上训练的卷积神经网络(CNN)进行迁移学习的趋势,在目标应用是一个自定义且有限的问题,且有足够数据从头训练网络时,可能是一种过度杀伤。另一方面,从头训练自定义且更轻量的CNN需要专业知识,以及在硬件感知神经架构搜索(HW NAS)情况下需要高端资源,这限制了非习惯性神经网络开发者对该技术的访问。因此,我们提出了ColabNAS,一种用于生成轻量级任务特定CNN的经济实惠的HW NAS技术。其新颖的无导数搜索策略受奥卡姆剃刀原则启发,使得在Visual Wake Word数据集(一个标准的TinyML基准)上,仅需使用Google Colaboratory和Kaggle Kernel等免费在线GPU服务,在3.1 GPU小时内即可获得最先进的结果。

英文摘要

The current trend of applying transfer learning from convolutional neural networks (CNNs) trained on large datasets can be an overkill when the target application is a custom and delimited problem, with enough data to train a network from scratch. On the other hand, the training of custom and lighter CNNs requires expertise, in the from-scratch case, and or high-end resources, as in the case of hardware-aware neural architecture search (HW NAS), limiting access to the technology by non-habitual NN developers. For this reason, we present ColabNAS, an affordable HW NAS technique for producing lightweight task-specific CNNs. Its novel derivative-free search strategy, inspired by Occam's razor, allows to obtain state-of-the-art results on the Visual Wake Word dataset, a standard TinyML benchmark, in just 3.1 GPU hours using free online GPU services such as Google Colaboratory and Kaggle Kernel.

2404.09790 2026-06-17 cs.CV 版本更新

NTIRE 2024 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2024图像超分辨率挑战赛(x4):方法与结果

Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Hongyu An, Xinfeng Zhang, Zhiyuan Song, Ziyue Dong, Qing Zhao, Xiaogang Xu, Pengxu Wei, Zhi-chao Dou, Gui-ling Wang, Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, Cansu Korkmaz, A. Murat Tekalp, Yubin Wei, Xiaole Yan, Binren Li, Haonan Chen, Siqi Zhang, Sihan Chen, Amogh Joshi, Nikhil Akalwadi, Sampada Malagi, Palani Yashaswini, Chaitra Desai, Ramesh Ashok Tabib, Ujwala Patil, Uma Mudenagudi, Anjali Sarvaiya, Pooja Choksy, Jagrit Joshi, Shubh Kawa, Kishor Upla, Sushrut Patwardhan, Raghavendra Ramachandra, Sadat Hossain, Geongi Park, S. M. Nadim Uddin, Hao Xu, Yanhui Guo, Aman Urumbekov, Xingzhuo Yan, Wei Hao, Minghan Fu, Isaac Orais, Samuel Smith, Ying Liu, Wangwang Jia, Qisheng Xu, Kele Xu, Weijun Yuan, Zhan Li, Wenqin Kuang, Ruijin Guan, Ruting Deng, Zhao Zhang, Bo Wang, Suiyi Zhao, Yan Luo, Yanyan Wei, Asif Hussain Khan, Christian Micheloni, Niki Martinel

发表机构 * CVLAI

AI总结 本文回顾NTIRE 2024图像超分辨率挑战赛(x4),总结参赛方案和成果,推动单图像超分辨率性能边界并概述当前趋势。

Comments NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 6108-6132
AI中文摘要

本文回顾了NTIRE 2024图像超分辨率($\ imes$4)挑战赛,重点介绍了提出的解决方案和获得的结果。该挑战涉及利用先验信息从低分辨率(LR)输入生成对应的高分辨率(HR)图像,放大倍数为四倍。LR图像来源于双三次下采样退化。挑战的目标是获得具有最先进SR性能的设计/解决方案,对计算资源(如模型大小和FLOPs)或训练数据没有限制。该赛道在DIV2K测试数据集上使用PSNR指标评估性能。比赛吸引了199名注册者,其中20支队伍提交了有效参赛作品。这一集体努力不仅推动了单图像SR的性能边界,还提供了对该领域当前趋势的全面概述。

英文摘要

This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.

2504.14582 2026-06-17 cs.CV 版本更新

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2025 图像超分辨率(×4)挑战赛:方法与结果

Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen

发表机构 * CVPR 2025

AI总结 本文介绍NTIRE 2025图像超分辨率(×4)挑战赛,包括恢复和感知两个子赛道,总结比赛设计、数据集、评估协议及25个团队的提交方法。

Comments NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 1525-1535
AI中文摘要

本文介绍了NTIRE 2025图像超分辨率(×4)挑战赛,这是CVPR 2025第10届NTIRE Workshop的关联竞赛之一。该挑战旨在从通过双三次下采样生成的×4比例低分辨率图像中恢复高分辨率图像,目标是开发有效的网络设计或解决方案以实现最先进的超分辨率性能。为反映图像超分辨率研究的双重目标,挑战包含两个子赛道:(1)恢复赛道,强调像素级精度,根据PSNR对提交结果进行排名;(2)感知赛道,关注视觉真实感,根据感知分数对结果进行排名。共有286名参与者注册了比赛,25个团队提交了有效作品。本报告总结了挑战设计、数据集、评估协议、主要结果以及每个团队的方法。该挑战作为基准,旨在推动图像超分辨率领域的最先进技术并促进其进步。

英文摘要

This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

2512.16978 2026-06-17 cs.CV 版本更新

A Benchmark for Omni-Modal Reasoning in Long Videos

长视频全模态推理基准

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

AI总结 提出LongShOTBench基准,用于评估长视频中视觉、语音和环境音频的全模态推理,并引入无训练的全模态证据搜索代理LongShOTAgent,在105个模型上取得最优性能。

详情
AI中文摘要

长形式全模态视频理解需要整合视觉、语音和环境音频,并进行连贯的长上下文推理。现有的视频基准通常在时间尺度、模态覆盖、开放式交互和可解释评分之间进行权衡。为了解决这一差距,我们引入了LongShOTBench,一个围绕三个耦合目标设计的长期视频理解基准:整体全模态集成、意图驱动的开放式交互和规则级诊断。它从真实观看场景构建单轮和多轮问题,通过系统任务探究视觉、语音、环境音频、时间和跨模态推理。每个项目包括一个参考答案和一个加权标准级规则,让评估识别哪些感知事实、时间链接、模态接地要求和推理步骤得到满足或遗漏。所有样本都经过手动验证,以提高接地性、清晰度和规则可靠性。我们还引入了LongShOTAgent,一个无训练的全模态证据搜索代理,将全视频预处理与目标检索、查询自适应片段细化以及基于视觉、语音和非语音音频证据的显式声明验证相结合。其迭代搜索-细化-验证循环暴露中间证据,并让模态特定专家在回答之前重新分析相关时刻。我们评估了105个视频能力模型,涵盖开源全模态模型、视觉语言系统、音频LLM、代理管道和闭源API。当前的MLLM远未饱和LongShOTBench,而我们的LongShOTAgent是最强的无训练系统,达到66.64%的整体性能。通过发布基准、排行榜和方法,我们为推进长形式全模态视频推理提供了一个共享、可解释的测试平台。代码、数据和排行榜可在以下网址获取:此 https URL。

英文摘要

Long-form omni-modal video understanding requires integrating vision, speech, and ambient audio with coherent long-context reasoning. Existing video benchmarks often trade off temporal scale, modality coverage, open-ended interaction, and interpretable scoring. To address this gap, we introduce LongShOTBench, a long video understanding benchmark designed around three coupled goals: holistic omni-modal integration, intent-driven open-ended interaction, and rubric-level diagnosis. It builds single- and multi-turn questions from real viewing scenarios, with systematic tasks probing visual, speech, ambient-audio, temporal, and cross-modal reasoning. Each item includes a reference answer and a weighted criterion-level rubric, letting evaluation identify which perceptual facts, temporal links, modality-grounding requirements, and reasoning steps are satisfied or missed. All samples are manually verified to improve grounding, clarity, and rubric reliability. We also introduce LongShOTAgent, a training-free omni-modal evidence-seeking agent coupling full-video preprocessing with targeted retrieval, query-adaptive segment refinement, and explicit claim verification over visual, speech, and non-speech audio evidence. Its iterative search-refine-verify loop exposes intermediate evidence and lets modality-specific specialists re-analyze relevant moments before answering. We evaluate 105 video-capable models spanning open-source omni-modal models, vision-language systems, audio LLMs, agentic pipelines and closed-source APIs. Current MLLMs remain far from saturating LongShOTBench, while our LongShOTAgent is the strongest training-free system, reaching 66.64% overall. By releasing the benchmark, leaderboard, and method, we provide a shared, interpretable testbed for advancing long-form omni-modal video reasoning. Code, data, and the leaderboard are available at https://longshot.cvmbzuai.com/.

2601.19099 2026-06-17 cs.CV cs.AI 版本更新

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv: 地图到街景空间推理的可扩展基准

Yosub Shin, Michael Buriek, Igor Molybog

AI总结 提出m2sv基准,通过匹配朝北俯视图与街景图像推断相机方向,评估VLM空间推理能力;最佳模型准确率65.2%,低于人类72.0%,揭示几何对齐与推理一致性的差距。

详情
AI中文摘要

视觉-语言模型(VLM)在许多多模态基准上表现强劲,但在需要将抽象俯视图表示与自我中心视图对齐的空间推理任务上仍然脆弱。我们引入m2sv,一个用于地图到街景空间推理的可扩展基准,要求模型通过将朝北俯视图与在同一真实世界交叉口拍摄的街景图像对齐来推断相机视角方向。我们发布了m2sv-20k,一个具有受控歧义的地理多样化基准,以及m2sv-sft-11k,一个用于监督微调的精选结构化推理轨迹集。尽管在现有多模态基准上表现强劲,但最佳评估的VLM在m2sv上仅达到65.2%的准确率,低于人类标注者的平均72.0%(专家可达95%),且标注者间一致性高($\kappa$高达0.76)。虽然监督微调和强化学习带来持续改进,但跨基准评估显示迁移有限。除了总体准确率,我们使用结构信号和人工努力系统分析了地图到街景推理的难度,并对适应的开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面的持续差距,为跨视角的接地空间推理的未来工作提供了动力。

英文摘要

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

2603.26551 2026-06-17 cs.CV cs.AI 版本更新

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

超越MACs:面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine(乌迪大学机器学习与感知实验室) Centre for Vision Research, York University(约克大学视觉研究中心)

AI总结 针对MACs指标在边缘设备上的不足,提出基于硬件效率洞察的LowFormer骨干网络,通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

详情
Journal ref
Int J Comput Vis 134, 295 (2026)
AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率,许多出版物依赖MACs(乘累加操作)作为执行时间的预测指标。本文通过实验证明该指标的缺陷,尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间,我们识别出高效执行的关键因素,并提供优化骨干设计的见解。基于这些见解,我们提出LowFormer,一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计,包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效,还在ImageNet上取得了更优结果。此外,我们提出LowFormer的边缘GPU版本,可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务(如目标检测、语义分割、图像检索和视觉目标跟踪),我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比,LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

2605.23176 2026-06-17 cs.CV 版本更新

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

DRIVESPATIAL:自动驾驶中视觉语言模型时空智能的基准

Hao Vo, Khoa Vo, Phu Loc Nguyen, Sieu Tran, Duc Minh Nguyen, Ngo Xuan Cuong, Gladys Gawugah, Sreevenkata Anjani Tishita Godavarthi, Chase Rainwater, Nghi D. Q. Bui, Anh Nguyen, Duy Minh Ho Nguyen, Ngan Le

发表机构 * University of Arkansas, USA(美国阿肯色大学) Google Research, Google(谷歌研究院) University of Liverpool, UK(英国利物浦大学) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学校)

AI总结 提出DriveSpatial基准,通过多视角、时空推理任务评估视觉语言模型在自动驾驶中的场景构建、关系理解、时序推理和泛化能力,发现人类与模型间存在显著差距。

详情
AI中文摘要

自动驾驶中的时空智能要求智能体将多视角观测整合为连贯的场景表示,跨视角和时间保持物体连续性,并推理空间关系、交互和未来动态。然而,现有的自动驾驶视觉语言基准主要关注单视角、静态、自我中心或单源问答,尚不清楚当前视觉语言模型(VLM)能否真正构建和推理动态驾驶场景。我们引入了DriveSpatial,一个包含来自五个大规模自动驾驶数据集的20个任务、15.6K人工验证问答对的基准。DriveSpatial评估四种能力:认知场景构建、多视角关系理解、时序推理和泛化。与之前的基准不同,DriveSpatial是从一个动态多关系场景图生成的,该图编码了物体状态、空间关系、交互、相机可见性和时间对应关系,从而产生强制进行真正的跨视角和时空推理的问答对。评估15个代表性VLM揭示了显著的人机差距:最强模型落后人类28.4分,其中认知场景构建成为关键瓶颈。进一步诊断表明,仅语言提示不足,而显式BEV基础一致地提升性能。这些结果表明,当前VLM缺乏可靠的时空驾驶智能所需的场景构建能力。DriveSpatial及其构建流程将发布以支持未来研究。

英文摘要

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

2606.14096 2026-06-17 cs.CV 版本更新

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

微动作识别与检测的新多领域基准

Yanbin Hao, Pengyu Liu, Xing Wei, Xun Yang, Dan Guo, Meng Wang

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院) School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院)

AI总结 提出MMA-82,一个大规模多领域微动作基准,扩展至82个类别、4个领域,涵盖识别与多标签检测任务,实验表明现有方法在域迁移、长尾分布等场景下仍面临挑战。

Comments 10 pages, 9 figures

详情
AI中文摘要

微动作是全身层面持续时间短、幅度低的细微身体运动,能够揭示潜在意图、非自愿反应和细粒度情感变化。我们之前的MA-52基准为微动作识别提供了重要基础,但在规模、场景多样性、任务覆盖和评估协议方面仍有限。为了将微动作分析推向更真实和全面的场景,我们引入了MMA-82,这是MA-52的大规模多领域扩展。MMA-82将标签空间从52个细粒度微动作类别扩展到82个,并涵盖四个不同领域,包括实验室访谈、街头访谈、精神病患者访谈和情感丰富的电视视频,最终从454名受试者中获得了77,856个标注实例。基于MMA-82,我们建立了两个核心任务:微动作识别和多标签微动作检测。对于识别,我们进一步定义了域内和跨域协议,包括少样本和零样本设置,以评估模型的鲁棒性、可迁移性和泛化能力。大量实验表明,当前方法在真实微动作理解中仍面临困难,尤其是在域迁移、长尾类别分布和复杂时间定位下。除了基准测试,我们还研究了微动作与情感之间的关系,表明微动作与情感状态密切相关,并为面部微表情提供补充线索,以改进情感识别。这些结果表明,MMA-82是真实微动作分析的全面且具有挑战性的基准,也是以人为中心的AI的宝贵资源。MMA-82可在以下网址获取:https://xxx。

英文摘要

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://lpynow.github.io/MMA-82-AIM/.

2606.14383 2026-06-17 cs.CV 版本更新

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

IndustryBench-MIPU:面向工业产品的多图像属性值提取基准

Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang, Bin Chen, Chengfu Huo, Haojun Pan, Hengyu You, Jing Li, Yingde Wang, Liang Ding

发表机构 * Multimodal and Industrial AI Team(多模态与工业AI团队) Taobao&Tmall, Alibaba Group(淘宝&天猫,阿里巴巴集团)

AI总结 提出首个多图像工业产品理解基准IndustryBench-MIPU,通过结构化属性提取任务评估多模态大模型在规格表、铭牌、技术图纸上的文本识别、视觉推理、领域知识和跨图像证据整合能力,发现多图像完整性是核心瓶颈。

详情
AI中文摘要

工业产品(如阀门和断路器)由密集的技术规格定义,这些规格支配着供应链中的采购、兼容性和安全性。这些规格分散在多个异构的产品图像中,包括规格表、铭牌和技术图纸,然而多模态大语言模型(MLLMs)能否可靠地恢复它们仍未被充分探索。为填补这一空白,我们引入了IndustryBench-MIPU,这是首个用于多图像工业产品理解的大规模基准,围绕结构化属性提取构建——从产品图像中恢复属性-值对。该任务共同探究了规格表和铭牌上的文本识别、技术图纸上的视觉推理、解码工业术语的领域知识,以及跨图像证据整合以组装分散的规格。具体而言,该基准包含来自27,652张图像的4,559个产品,具有跨越18个工业类别的103,703个标注,通过多模型共识和三层质量保证构建。在单图像和产品级多图像设置下评估九个MLLMs,揭示了一个显著的完整性差距:模型实现了高精度(86-94%),但最佳模型仅恢复了49.9%的产品级属性;从单图像到多图像提取,召回率下降了15-34个百分点。多图像完整性,而非单图像准确性,是核心瓶颈。数据集和代码已公开。

英文摘要

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

2411.06842 2026-06-17 eess.IV cs.CV 版本更新

Evaluating Synthetic Data Generation for Domain Generalization in Fetal Brain MRI Segmentation

评估胎儿脑MRI分割中域泛化的合成数据生成

Vladyslav Zalevskyi, Thomas Sanchez, Margaux Roulet, Busra Bulut, Hélène Lajous, Jordina Aviles Verdera, Sara Neves Silva, Georg Langs, Gregor Kasprian, Roxane Licandro, Jana Hutter, Hamza Kebiri, Meritxell Bach Cuadra

发表机构 * Department of Radiology, Lausanne University Hospital and University of Lausanne (UNIL)(拉沃斯大学医院放射科和洛桑大学(UNIL)) CIBM Center for Biomedical Imaging(生物医学成像中心) Institute for Information Processing, Leibniz University Hannover(汉诺威莱比锡大学信息处理研究所) Department of Biomedical Engineering, School of Biomedical Engineering & Imaging Sciences, King’s College London(伦敦国王学院生物医学工程系) Department of Biomedical Imaging and Image-Guided Therapy, Division of Neuroradiology and Musculoskeletal Radiology, Medical University of Vienna(维也纳医学大学生物医学成像与影像引导治疗系) Department of Biomedical Imaging and Image-guided Therapy, Computational Imaging Research Lab (CIR), Medical University of Vienna(维也纳医学大学生物医学成像与影像引导治疗系,计算成像研究实验室(CIR)) Christian Doppler Laboratory for Mathematical Modelling and Simulation of Next-Generation Medical Ultrasound Devices, Medical University of Vienna(维也纳医学大学下一代医学超声设备数学建模与仿真克里斯蒂安多普勒实验室) Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna(维也纳医学大学人工智能在医学中的综合中心) Division of Neuroradiology and Musculoskeletal Radiology, Department of Biomedical Imaging and Image–guided Therapy, Medical University of Vienna(维也纳医学大学生物医学成像与影像引导治疗系,神经放射学和骨科放射学系)

AI总结 针对胎儿脑MRI分割中数据异质性和标注不足问题,研究基于域随机化的合成数据生成策略,提出FetalSynthSeg框架,通过高斯混合强度建模和强度聚类提升跨域鲁棒性,在多个数据集上达到最优性能。

详情
AI中文摘要

从磁共振成像(MRI)中进行胎儿脑组织分割对于研究神经发育至关重要,但由于数据异质性和有限标注而仍然具有挑战性。域随机化(DR)最近作为一种有前景的单源域泛化策略出现,通过合成具有随机伪影、对比度和分辨率的训练图像。在这项工作中,我们研究了如何最大化基于DR的方法的域外(OOD)泛化能力。我们评估了几种用于DR的合成数据生成策略,特别关注我们最近提出的框架FetalSynthSeg。我们表明,简单的高斯混合强度建模优于更复杂的基于物理的模拟,并且强度聚类(根据强度细分组织类别)提高了OOD鲁棒性。在来自四个站点的348个胎儿受试者(涵盖0.55-3T以及T1w和T2w对比)上评估,FetalSynthSeg在多个FeTA 2024测试数据集上达到了最先进的性能(80-85 Dice分数),并且首次在T2w以外的模态上为胎儿脑分割提供了鲁棒的分割(在dHCP-T1w数据集上达到80 Dice)。与最先进的方法(如BOUNTI、nnU-Net集成和FeTA 2024获胜者)相比,FetalSynthSeg在保持跨域偏移的强鲁棒性的同时,提供了相当或更优的准确性。我们的代码、模型权重和便于推理的Docker镜像可在以下网址获取:此 https URL。

英文摘要

Fetal brain tissue segmentation from magnetic resonance imaging (MRI) is crucial for studying neurodevelopment, but remains challenging due to data heterogeneity and limited annotations. Domain randomization (DR) has recently emerged as a promising strategy for single-source domain generalization by synthesizing training images with randomized artifacts, contrast, and resolution. In this work, we investigate how to maximize the out-of-domain (OOD) generalization of DR-based methods. We evaluate several synthetic data generation strategies for DR, with a particular focus on our recently proposed framework, FetalSynthSeg. We show that simple Gaussian mixture-based intensity modeling outperforms more complex physics-based simulations, and that intensity clustering (subdividing tissue classes based on intensity) improves OOD robustness. Evaluated on 348 fetal subjects from four sites spanning 0.55-3T and both T1w and T2w contrasts, FetalSynthSeg reaches state-of-the-art performance on several FeTA 2024 testing datasets (80-85 Dice score) and, for the first time, offers robust segmentation on modalities other than T2w for fetal brain segmentation (80 Dice on dHCP-T1w dataset). Compared with state-of-the-art methods such as BOUNTI, nnU-Net ensemble, and the FeTA 2024 winner, FetalSynthSeg delivers comparable or superior accuracy while maintaining strong robustness across domain shifts. Our code, model weights, and Docker image ready for easy inference are available at https://hub.docker.com/r/vzalevskyi/fetalsynthseg.

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结 提出Mordal框架,通过减少候选模型数量和评估时间,自动化搜索用户定义任务的最佳视觉语言模型,相比网格搜索降低GPU耗时8.9-11.6倍,加权Kendall's τ平均提升69%。

详情
AI中文摘要

将多种模态融入大型语言模型(LLMs)是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型(VLMs)因其在医疗、机器人和无障碍等领域的众多实际应用,成为增长最快的多模态模型类别。然而,尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力,它们都是由人类专家手工设计的;目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal,一种自动化多模态模型搜索框架,能够高效地为用户定义的任务找到最佳VLM,无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明,Mordal能够找到给定问题的最佳VLM,其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现,Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

2505.00986 2026-06-17 cs.LG cs.CV 版本更新

EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems

EmbodiTTA:面向具身视觉系统的资源高效测试时自适应

Xiao Ma, Young D. Kwon, Dong Ma

AI总结 提出按需测试时自适应范式OD-TTA,通过轻量域移检测、源域选择和分离批归一化更新,在边缘设备上实现高效准确的自适应,显著降低计算和能耗开销。

详情
AI中文摘要

连续测试时自适应(CTTA)持续对每个到达的数据批次调整部署模型。虽然达到了最优精度,但现有的CTTA方法由于巨大的内存开销和能耗,在资源受限的边缘设备上实际应用性差。本文首先引入一种新范式——按需TTA,仅在检测到显著域移时触发自适应。然后,我们提出OD-TTA,一种用于边缘设备上准确高效自适应的按需TTA框架。OD-TTA包含三项创新技术:1)轻量级域移检测机制,仅在需要时激活TTA,大幅降低总体计算开销;2)源域选择模块,选择合适的源模型进行自适应,确保高且鲁棒的精度;3)解耦的批归一化(BN)更新方案,实现小批量下的内存高效自适应。大量实验表明,OD-TTA在显著降低能量和计算开销的同时,实现了可比甚至更好的性能,使TTA成为实际可行的技术。

英文摘要

Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm -- on-demand TTA -- which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.

2512.21315 2026-06-17 cs.LG cs.CV stat.ML 版本更新

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

数据处理不等式是否反映实践?论低级任务的有用性

Roy Turgeman, Tom Tirer

AI总结 本文研究低级处理(如去噪、编码)如何提升分类性能,证明在有限样本下存在预处理可提高准确率,并通过实验验证理论趋势。

Comments ICLR 2026 (camera-ready). Code is available at: https://github.com/serveroy/process-before-you-classify

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

数据处理不等式是一个信息论原理,指出信号的信息内容不能通过处理观测数据而增加。特别地,它表明在解决分类问题之前,增强信号或对其进行编码没有益处。对于最优贝叶斯分类器,这一断言可以被证明是正确的。然而,在实践中,尽管现代深度神经网络具有强大的能力,但在高级下游任务之前执行“低级”任务仍然很常见。在本文中,我们旨在理解低级处理何时以及为何对分类有益。我们提出了一个二元分类设置的综合理论研究,其中我们考虑一个与最优贝叶斯分类器紧密相连的分类器,并随着训练样本数量的增加而收敛到它。我们证明,对于任何有限数量的训练样本,存在一种预分类处理可以提高分类准确率。我们还探讨了类分离、训练集大小和类平衡对该过程相对增益的影响。我们通过理论设置的经验研究来支持我们的理论。最后,我们进行了一项实证研究,调查去噪和编码对基准数据集上实际深度分类器性能的影响。具体来说,我们改变了训练集的大小和类别分布以及噪声水平,并展示了与我们的理论结果一致的趋势。

英文摘要

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

2602.03300 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

R1-SyntheticVL:生成模型的合成数据是否已为多模态大语言模型做好准备?

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

AI总结 提出集体对抗数据合成(CADS)方法,通过集体智能和对抗学习自动生成高质量、多样且具有挑战性的多模态数据,用于增强多模态大语言模型(MLLM)在复杂现实任务中的性能。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

在这项工作中,我们旨在开发有效的数据合成技术,自主合成多模态训练数据,以增强MLLM解决复杂现实任务的能力。为此,我们提出了集体对抗数据合成(CADS),这是一种新颖且通用的方法,用于合成高质量、多样且具有挑战性的多模态数据。CADS的核心思想是利用集体智能确保高质量和多样化的生成,同时探索对抗学习以合成具有挑战性的样本,从而有效驱动模型改进。具体来说,CADS包含两个循环阶段:集体对抗数据生成(CAD-Generate)和集体对抗数据判断(CAD-Judge)。CAD-Generate利用集体知识共同生成新的多样化多模态数据,而CAD-Judge则协作评估合成数据的质量。此外,CADS引入了一种对抗上下文优化机制,以优化生成上下文,鼓励生成具有挑战性和高价值的数据。通过CADS,我们构建了MMSynthetic-20K并训练了我们的模型R1-SyntheticVL,该模型在多个基准测试中表现出优越的性能。

英文摘要

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

13. 其他/综合视觉 17 篇

2606.17427 2026-06-17 cs.CV cs.HC 新提交

Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications

手部损伤和遮挡对增强现实应用中手部姿态估计精度的影响

Damian M. Manzone, Mathew Szymanowski, Olga Taran, Shuo Cai, Melissa Marquez-Chin, Tammy Zeng, Hardeep Singh, Cesar Marquez-Chin, José Zariffa

发表机构 * KITE Research Institute, Toronto Rehabilitation Institute, University Health Network(大学健康网络多伦多康复研究所KITE研究所) Institute of Biomedical Engineering, University of Toronto(多伦多大学生物医学工程研究所) Department of Health Sciences and Technology, ETH Zürich(苏黎世联邦理工学院健康科学与技术系) Department of Occupational Science & Occupational Therapy and the Rehabilitation Sciences Institute, University of Toronto(多伦多大学职业科学与职业治疗系及康复科学研究所)

AI总结 研究评估了HoloLens 2和多种姿态估计算法在手部损伤和物体遮挡条件下的精度,发现算法可泛化至手部损伤人群,透明物体略有优势。

详情
AI中文摘要

混合现实应用可设计用于手部康复。增强现实(AR)头戴式显示器(HMD)特别允许生态有效的任务,因为个体可以看到真实环境并与真实物体交互,同时在HMD上接收额外提示。虽然这些应用依赖于准确的手部姿态估计,但在调查手部损伤或真实物体交互遮挡对姿态估计精度的影响方面存在空白。此外,AR HMD预测与最先进姿态估计方法之间的比较尚未建立。本研究评估了HoloLens 2 HMD和最先进姿态估计算法(WiLoR、HaMeR、WildHands和MediaPipe)在颈椎损伤(cSCI;n=13,神经损伤水平:C3-C6;美国脊柱损伤协会损伤量表:A-D)和15名未受伤对照者与透明和不透明物体交互时的姿态估计精度。通过多摄像头设置三角测量生成3D关节位置的真实值。姿态估计精度在cSCI和未受伤对照组之间没有差异,表明HoloLens 2和姿态估计算法的3D关节预测可以泛化到手部损伤人群。此外,透明物体比不透明物体提供了微小的精度优势(0.1毫米),WiLoR和HaMeR的预测比HoloLens 2略精确(2毫米)。总体而言,这些结果表明HoloLens 2可能适用于手部康复应用,生成的数据集可用于改进手部损伤人群的姿态估计方法。

英文摘要

Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.

2606.17713 2026-06-17 cs.CV 新提交

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

异质SAR-光学融合用于云污染下近实时的土地利用和土地覆盖制图:新框架与全球基准数据集

Jiangong Xu, Weibao Xue, Xiaoyu Yu, Jun Pan, Xinlian Lianga, Mi Wang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(信息工程测绘与遥感国家重点实验室) School of Computer Science and Information Engineering(计算机科学与信息工程学院) Hangzhou International Innovation Institute(杭州国际创新研究院) Oriental Space Port Research Institute(东方航天港研究院) Hubei Luojia Laboratory(湖北珞珈实验室)

AI总结 针对云污染导致光学遥感不可靠的问题,提出端到端异质SAR-光学融合框架CloudLULC-Net,通过光学可靠性调制、异质信息自适应聚合和统一语义映射变换器,实现近实时LULC制图,并构建含40,223个三元组的全球基准数据集CloudLULC-Set,在多个指标上优于现有方法。

详情
AI中文摘要

光学遥感影像经常受到云和云阴影污染的干扰,这限制了其在近实时土地利用和土地覆盖(LULC)制图中的可靠性。尽管合成孔径雷达(SAR)可以提供穿透云层的结构信息,但现有的SAR-光学融合方法通常假设光学观测可靠,未能充分解决云污染引入的语义不确定性。为了解决这个问题,我们提出了CloudLULC-Net,一个端到端的异质SAR-光学融合框架,直接从受云污染的Sentinel-2影像和时间相邻的Sentinel-1 SAR观测中预测LULC图。所提出的网络包含光学可靠性调制以抑制不可靠的光学响应、异质信息自适应聚合以建模光学和SAR表示之间的高阶空间-通道交互,以及一个统一的语义映射变换器,在面向LULC的潜在空间中组织融合特征。进一步引入语义锚点引导优化策略以提高中间语义表示的一致性。为支持该任务,我们构建了CloudLULC-Set,一个大规模基准数据集,包含40,223个精心挑选的SAR-光学-标签三元组,具有跨不同地理区域和云条件的像素级LULC标注。实验结果表明,CloudLULC-Net实现了86.60%的总体精度、83.29%的F1分数和73.51%的平均交并比,优于代表性的异质重建优先和端到端SAR-光学映射方法。与现有全球LULC产品的比较以及不同云覆盖水平下的分析进一步证明了CloudLULC-Net在易云区域目标日期LULC制图中的鲁棒性和实用价值。该项目公开于:https://github.com/your-repo(实际链接请替换)。

英文摘要

Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone regions.The project is publicly available at: https://github.com/RSIIPAC/CloudLULC

2510.19255 2026-06-17 cs.CV 版本更新

Advances in 4D Representation: Geometry, Motion, and Interaction

4D表示进展:几何、运动与交互

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji, Peter Chun, Ali Mahdavi-Amiri, Hao Zhang

AI总结 本文综述了4D生成与重建领域,从几何、运动和交互三个核心支柱出发,分析不同4D表示方法的特性、挑战及适用场景,并探讨了大语言模型和视频基础模型在其中的作用。

Comments CGF'26,21 pages. Project Page: https://mingrui-zhao.github.io/4DRep-GMI/

详情
AI中文摘要

我们呈现了一篇关于4D生成与重建的综述,这是一个快速发展的计算机图形学子领域,其进展得益于神经场、几何与运动深度学习以及3D生成式人工智能(GenAI)的最新突破。尽管我们的综述并非首篇,但我们从独特且鲜明的4D表示视角构建领域覆盖,以建模随时间演变的3D几何,同时展现运动和交互。具体而言,我们并未穷举众多工作,而是采取更具选择性的方法,聚焦代表性工作,以突出每种表示在不同计算、应用和数据场景下的理想特性及随之而来的挑战。我们旨在向读者传达的主要信息是:如何为其任务选择并定制合适的4D表示。在组织上,我们基于三个关键支柱:几何、运动与交互,对4D表示进行划分。我们的讨论不仅涵盖当今最流行的表示,如神经辐射场(NeRFs)和3D高斯泼溅(3DGS),还关注在4D背景下相对未被充分探索的表示,如结构化模型和长程运动。在整个综述中,我们将重新审视大语言模型(LLMs)和视频基础模型(VFMs)在各种4D应用中的作用,同时引导讨论指向它们当前的局限性以及如何解决。我们还专门介绍了目前可用的4D数据集以及推动该子领域前进所缺乏的数据。项目页面:this https URL

英文摘要

We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well as 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

2605.26921 2026-06-17 cs.CV q-bio.NC 版本更新

Similarity-based representation factorization for revealing interpretable dimensions in representational data

揭示大脑、行为和AI中表征的核心维度

Florian P. Mahner, Ka Chun Lam, Francisco Pereira, Martin N. Hebart

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences(人类认知与脑科学最大平面研究所) National Institute of Mental Health(心理健康国家研究所) Justus Liebig University Giessen(吉森约斯特-利普大学) Center for Mind, Brain and Behavior(心智、脑与行为中心)

AI总结 提出相似性基表示因子分解(SRF)方法,从相似性矩阵中恢复低维、非负、可解释的嵌入,以揭示神经、行为和计算数据中表征的潜在维度。

详情
AI中文摘要

表征研究广泛存在于神经科学、心理学和人工智能等领域。虽然通常通过刺激之间的相似性来研究和比较表征,但现有方法仅能有限地访问塑造这些表征的维度,且可解释性有限。为克服这些挑战,本文引入相似性基表示因子分解(SRF),一种通用的计算方法,用于从测量数据导出的相似性矩阵中恢复低维、非负、可解释的嵌入。在模拟以及多种神经、行为和计算数据集中,SRF能从各种形式的表征数据中恢复可解释的维度,即使对于非常稀疏采样、不完整的数据也是如此。从这些数据集中导出的维度与任务特定模型获得的维度相匹配,预测独立的行为属性,改进探索性分析,并且与比较相似性矩阵相比,为验证性假设检验提供更高的统计功效。这些结果共同确立了SRF作为一种通用方法,在揭示、理解和利用表征背后的维度方面具有广泛的应用前景。

英文摘要

The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2601.06862 2026-06-17 cs.CR cs.CV cs.LG cs.MM eess.IV 版本更新

Learning QoE from Packet-Level Measurements in Encrypted Video Conferencing Traffic

从加密视频会议流量的数据包级别测量中学习QoE

Michael Sidorov, Ofer Hadar

AI总结 针对ISP无法访问加密内容评估QoE的挑战,提出基于CNN的框架仅利用数据包大小预测BRISQUE和MOS,在WhatsApp和Zoom数据集上优于先前模型。

详情
AI中文摘要

用户体验质量已成为当今世界最重要的方面之一,因为它直接影响个人继续使用或放弃产品或服务的意愿。在此背景下,视频会议应用(VCAs)在COVID-19大流行后得到广泛采用,必须在日益拥挤的市场中提供卓越性能以保持竞争力。尽管内容提供商(CPs)如Zoom、WhatsApp、Telegram和Google Meet可以通过比较发送和接收的数据来评估通话质量,但VCAs中广泛使用的端到端加密使得互联网服务提供商(ISPs)评估体验质量(QoE)变得更加困难。由于ISPs无法访问加密内容,他们必须依赖对数据路径上未加密流量特征的被动测量。在这项工作中,我们提出了一个简单而有效的QoE预测框架,基于几乎原生的卷积神经网络(CNN)架构,仅使用从视频会议(VC)通话中两个参与者之间的通信中提取的数据包大小来预测两个QoE指标:BRISQUE和MOS。所提出的框架简单、易于实现,且不需要高端计算资源,但提供了优越的预测性能,正如我们在从WhatsApp和Zoom收集的两个自定义数据集上的实验所示,这些实验在QoE预测任务上比先前模型取得了显著改进。

英文摘要

The quality of the user experience has become one of the most important aspects in todays world, as it directly influences individuals willingness to continue using or abandon a product or service. In this context, video conferencing applications (VCAs), which experienced widespread adoption following the COVID-19 pandemic, must deliver excellent performance to remain competitive in an increasingly crowded market. Although content providers (CPs) such as Zoom, WhatsApp, Telegram, and Google Meet can assess conversation quality by comparing transmitted and received data. The widespread use of end-to-end encryption in VCAs makes quality-of-experience (QoE) evaluation by internet service providers (ISPs) far more challenging. Since ISPs do not have access to the encrypted content, they must rely on passive measurements of unencrypted traffic characteristics on the data path. In this work, we present a simple yet effective QoE prediction framework based on an almost stock convolutional neural network (CNN) architecture that uses only the packet sizes extracted from the communication between two participants in a video conferencing (VC) call to predict two QoE metrics: BRISQUE and MOS. The proposed framework is simple, easy to implement, and does not require high-end computational resources, yet it provides superior prediction performance, as shown in our experiments on two custom datasets collected from WhatsApp and Zoom, which achieve substantial improvements over previous models for the QoE prediction task.

2507.15777 2026-06-17 cs.CV 版本更新

Label tree semantic losses for rich multi-class medical image segmentation

用于丰富多类医学图像分割的标签树语义损失

Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren

发表机构 * School of Biomedical Engineering & Imaging Sciences(生物医学工程与成像科学学院) Department of Neurosurgery(神经外科部门)

AI总结 提出两种基于标签层次结构的树状语义损失函数,在脑MRI全监督分割和神经外科高光谱成像稀疏标注场景理解中取得一致改进。

详情
AI中文摘要

丰富且准确的医学图像分割有望通过描绘术前规划的关键解剖结构、指导实时术中导航和支持精确术后评估,为下一代AI定义的临床实践奠定基础。然而,医学和外科成像分割任务中常用的学习方法对所有错误一视同仁,未能利用标签空间中的任何类间语义。随着标签基数和丰富度的增加以包含细微不同的类别,这一问题变得尤为突出。在这项工作中,我们提出了两种基于树的语义损失函数,利用标签的层次组织。我们进一步将我们的损失纳入最近提出的用于稀疏、无背景标注的训练方法中,以扩展所提出损失的适用性。在两个医学和外科成像分割任务上进行了大量实验,即全监督的头部MRI全脑分割和稀疏标注的神经外科高光谱成像场景理解。结果表明,在评估的任务特定基线上取得了一致的改进,其中基于Wasserstein的复合损失在全脑分割中支持最强,而层次加权顶层监督在稀疏HSI设置中表现最佳。

英文摘要

Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical imaging segmentation tasks, namely head MRI for whole brain parcellation with full supervision and neurosurgical hyperspectral imaging for scene understanding with sparse annotations. Results demonstrate consistent improvements over the evaluated task-specific baselines, with the strongest support for the Wasserstein-based compound loss in whole-brain parcellation and for hierarchy-weighted top-level supervision in the sparse HSI setting.

2605.12220 2026-06-17 cs.CV cs.AI cs.LG cs.RO 版本更新

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV:基于高度感知的鸟瞰图与高分辨率特征融合的实时仅LiDAR三维行人检测

Mohammad Khoshkdahan, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出TriBand-BEV方法,通过高度感知的鸟瞰图与高分辨率特征融合实现实时LiDAR-only三维行人检测,采用轻量级鸟瞰图张量映射,单网络一次通过检测车辆、行人和自行车,提升检测精度与速度。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
Journal ref
Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)
AI中文摘要

安全的自动驾驶代理和移动机器人需要快速的实时三维感知,尤其是对于行人等易受伤害道路使用者。我们介绍了一种新的鸟瞰图(BEV)编码方法,将完整的三维LiDAR点云映射到轻量级的二维BEV张量中,分为三个高度带。我们明确地将三维检测重新公式化为二维检测问题,然后从BEV输出中重建三维框。单个网络在一次通过中检测车辆、行人和自行车。骨干网络在深层阶段使用区域注意力,层次化的双向颈部网络在P1到P4之间融合上下文和细节,头部使用分布焦点学习预测定向框,以预测侧偏移和旋转IoU损失。训练应用小垂直重新分箱和温和的反射率抖动以防止记忆化。我们使用四分位距(IQR)过滤器在三维重建中去除噪声和离群的LiDAR点。在KITTI数据集上,TriBand-BEV在49 FPS的单个消费级GPU上实现了易、中等和困难样本的行人BEV AP分别为58.7/52.6/47.2%,优于Complex-YOLO,分别提升了+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下检测稳定。该流程紧凑且适用于实时机器人部署。我们的源代码在GitHub上公开可用。

英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

2604.13662 2026-06-17 cond-mat.mes-hall cs.CV cs.LG 版本更新

Automatic Charge State Tuning of 300 mm FDSOI Quantum Dots Using Neural Network Segmentation of Charge Stability Diagram

300毫米FDSOI量子点自动电荷状态调节:基于神经网络的电荷稳定性图分割

Peter Samaha, Amine Torki, Ysaline Renaud, Sam Fiette, Emmanuel Chanrion, Pierre-Andre Mortemousque, Yann Beilliard

发表机构 * CEA-Leti(法国格勒诺耶大学(Univ. Grenoble Alpes))

AI总结 本文提出基于深度学习的语义分割流程,通过识别电荷稳定性图中的过渡线实现量子点自动电荷调节,提升硅量子点量子比特的高通量电荷调节效率。

Comments 10 pages, 6 figures, supplementary materials available

详情
AI中文摘要

调节由门定义的半导体量子点(QDs)是扩展自旋量子比特技术的主要瓶颈。我们提出了一种由深度学习(DL)驱动的语义分割流程,通过在完整的电荷稳定性图(CSDs)中定位过渡线来实现电荷自动调节,并返回单电荷 regime 的门电压目标。我们组装并手动注释了1015个实验测量的硅量子点设备的大型异构数据集,涵盖九种设计几何形状、多个晶圆和制造批次。一个具有MobileNetV2编码器的U-Net风格卷积神经网络(CNN)通过五折分组交叉验证进行训练和验证。我们的模型在定位单电荷 regime 方面实现了80.0%的离线调节成功率,某些设计的峰值性能超过88%。我们分析了主导的失败模式并提出了针对性的缓解措施。最后,宽范围图分割也自然地启用了可扩展的基于物理的特征提取,可以反馈到制造和设计流程中,并概述了在低温晶圆探针中实现实时集成的道路图。总体而言,我们的结果表明,基于神经网络(NN)的宽图分割是实现硅量子点量子比特高通量电荷调节的可行步骤。

英文摘要

Tuning of gate-defined semiconductor quantum dots (QDs) is a major bottleneck for scaling spin qubit technologies. We present a deep learning (DL) driven, semantic-segmentation pipeline that performs charge auto-tuning by locating transition lines in full charge stability diagrams (CSDs) and returns gate voltage targets for the single charge regime. We assemble and manually annotate a large, heterogeneous dataset of 1015 experimental CSDs measured from silicon QD devices, spanning nine design geometries, multiple wafers, and fabrication runs. A U-Net style convolutional neural network (CNN) with a MobileNetV2 encoder is trained and validated through five-fold group cross validation. Our model achieves an overall offline tuning success of 80.0% in locating the single-charge regime, with peak performance exceeding 88% for some designs. We analyze dominant failure modes and propose targeted mitigations. Finally, wide-range diagram segmentation also naturally enables scalable physic-based feature extraction that can feed back to fabrication and design workflows and outline a roadmap for real-time integration in a cryogenic wafer prober. Overall, our results show that neural network (NN) based wide-diagram segmentation is a practical step toward automated, high-throughput charge tuning for silicon QD qubits.

2506.07917 2026-06-17 cs.GR cs.CV 版本更新

SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping

SpeeDe3DGS:通过时间修剪和运动分组实现快速变形3D高斯点拨

Allen Tu, Haiyang Ying, Alex Hanson, Yonghan Lee, Tom Goldstein, Matthias Zwicker

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文提出SpeeDe3DGS,通过时间敏感性修剪、时间敏感性采样和GroupFlow模块,在保持高质量重建的同时,显著提升3DGS的渲染和训练效率。

Comments Project Page: https://speede3dgs.github.io/

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 26083-26093
AI中文摘要

动态扩展的3D高斯点拨(3DGS)通过神经运动场实现高质量重建,但每个高斯神经推理使其模型计算成本高。基于DeformableGS,我们引入了快速变形3D高斯点拨(SpeeDe3DGS),通过三个互补模块:时间敏感性修剪(TSP)通过时间聚合敏感性分析移除低影响高斯,时间敏感性采样(TSS)扰动时间戳以抑制漂浮点并提高时间一致性,以及GroupFlow将学习的变形场压缩为共享SE(3)变换以实现高效的组间运动。在50个动态场景的MonoDyGauBench上,将TSP和TSS整合到DeformableGS中,平均渲染速度提升6.78倍,同时保持神经场保真度并使用10倍更少的原始体素。添加GroupFlow后,渲染速度进一步提升13.71倍,训练时间缩短2.53倍,超越所有基线,在保持优越图像质量的同时实现了更快的速度。

英文摘要

Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency-fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\times$ on average while maintaining neural-field fidelity and using 10$\times$ fewer primitives. Adding GroupFlow culminates in 13.71$\times$ faster rendering and 2.53$\times$ shorter training, surpassing all baselines in speed while preserving superior image quality.

2603.19801 2026-06-17 eess.IV cs.AI cs.CV 版本更新

Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive

北海、墨西哥湾和波斯湾的海上石油和天然气平台动态:利用Sentinel-1档案

Robin Spanier, Thorsten Hoeser, John Truckenbrodt, Felix Bachofer, Claudia Kuenzer

发表机构 * German Remote Sensing Data Center, Earth Observation Center, EOC of the German Aerospace Center, DLR(德国遥感数据中心,地球观测中心,德国航空航天中心(DLR)地球观测中心) Institute for Geography and Geology, Department of Remote Sensing, University of Würzburg(地理与地质研究所,遥感系,乌尔姆大学)

AI总结 本文利用Sentinel-1数据和深度学习技术,研究了北海、墨西哥湾和波斯湾的海上平台动态,揭示了平台数量变化及结构转型,为海洋基础设施监测提供了数据支持。

Comments 16 pages, 10 figures, 1 table

详情
Journal ref
Big Earth Data, 2026, 1-27
AI中文摘要

随着海上基础设施的增加,对持续、可扩展的监测需求日益增长。本文提出了一种基于免费地球观测数据的自动化方法,利用Sentinel-1档案数据和深度学习目标检测技术,构建了2017-2025年间北海、墨西哥湾和波斯湾的季度平台位置时间序列。此外,还推导了平台大小、水深、海岸距离、国家归属及安装和退役日期等信息。2025年识别出3728个海上平台,其中北海有356个,墨西哥湾有1641个,波斯湾有1731个。尽管波斯湾平台数量在2024年前持续增长,但墨西哥湾和北海的平台数量在2018-2020年间有所下降。同时,超过2700个平台被安装或迁移到新地点,同时有相当数量被退役或迁移。此外,平台寿命缩短的趋势表明,海上行业正经历结构性变化,与移动海上单位如钻探平台的重要性增长有关。研究结果展示了免费地球观测数据和深度学习在持续、长期监测海洋基础设施中的潜力。所推导的数据集是公开的,为海上监测、海洋规划及海上能源行业转型分析提供了基础。

英文摘要

The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.

2505.12620 2026-06-17 cs.CV 版本更新

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

BusterX:基于MLLM的AI生成视频伪造检测与解释

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng

发表机构 * University of Liverpool, UK(利物浦大学) Nanyang Technological University, SG(南洋理工大学) The Chinese University of Hong Kong, Shenzhen, Guangdong, China(香港中文大学(深圳)) Wuhan University(武汉大学) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 本文提出BusterX,一种基于多模态大语言模型的视频伪造检测系统,通过改进数据集和评估基准,提升检测准确性和解释质量。

详情
AI中文摘要

随着生成视频模型日益逼真,检测AI生成视频需要兼具准确性和可解释性的系统。然而,将多模态大语言模型(MLLMs)应用于视频取证目前受限于过时的数据集、简化的评估协议和对黑盒分类的依赖。为解决这些问题,我们引入了一个全面的数据集、基准和基线模型用于视频伪造检测。首先,我们提出了GenBuster-200K,一个包含超过200,000个高质量视频的公平数据集,这些视频来自最先进的生成器,涵盖多样化的现实场景。其次,我们提出了GenBuster-Bench,一个覆盖三个渐进赛道(领域内、领域外和野外)的诊断基准,用于评估模型在领域转移和代际转移中的表现。它还引入了MLLM-as-a-Judge协议来评估生成的取证解释质量。最后,我们开发了BusterX,一种具有RL训练的MLLM基线模型。不同于直接二元分类,BusterX将检测视为视觉推理任务,其中生成的推理链本身作为检测器。实验结果表明,BusterX在检测准确性和推理质量上均优于几种领先的MLLMs(例如Qwen3.5、Claude-Sonnet-4.6)

英文摘要

As generative video models become increasingly realistic, detecting AI-generated videos requires systems that offer both accuracy and interpretability. However, applying Multimodal Large Language Models (MLLMs) to video forensics is currently limited by outdated datasets, simplistic evaluation protocols, and a reliance on black-box classification. To address these issues, we introduce a comprehensive dataset, benchmark, and baseline model for video forgery detection. First, we present \textbf{GenBuster-200K}, a fair dataset of over 200,000 high-quality videos sourced from state-of-the-art generators, featuring diverse real-world scenarios. Second, we propose \textbf{GenBuster-Bench}, a diagnostic benchmark spanning three progressive tracks (In-Domain, Out-of-Domain, and In-the-Wild) to evaluate models across \textit{domain shifts} and \textit{generational shifts}. It also introduces an MLLM-as-a-Judge protocol to assess the quality of the generated forensic explanations. Finally, we develop \textbf{BusterX}, an MLLM baseline with RL training. Instead of direct binary classification, BusterX formulates detection as a visual reasoning task, where the generated reasoning chain serves as detector itself. Experimental results demonstrate that BusterX outperforms several leading MLLMs (e.g., Qwen3.5, Claude-Sonnet-4.6) in both detection accuracy and rationale quality.

2508.04492 2026-06-17 cs.CV cs.AI 版本更新

Learning Robust Intervention Representations with Delta Embeddings

通过delta嵌入学习鲁棒的干预表示

Panagiotis Alimisis, Christos Diou

发表机构 * Department of Informatics and Telematics(信息与电信学系)

AI总结 本文提出通过潜在空间中的可操作反事实表示提升模型鲁棒性,提出因果delta嵌入方法,在无需额外监督的情况下学习因果表示,实验显示其在合成和现实基准中表现优异。

Comments ICLR 2026, Poster

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

因果表示学习近年来引起了广泛关注,作为提高模型泛化性和鲁棒性的手段。因果干预图像对(也称为“可操作反事实”)的表示具有特性:在起始状态和结束状态之间,只有受干预/动作影响的场景变量发生变化。尽管大多数工作集中在识别和表示因果模型下的场景变量,但较少关注干预本身的表示。本文表明,通过关注潜在空间中的可操作反事实表示,可以有效提升离分布鲁棒性。具体而言,我们提出干预可通过因果delta嵌入表示,该嵌入对视觉场景不变且在影响的因果变量上稀疏。基于此见解,我们提出一种无需额外监督的学习因果表示的方法。在因果三元组挑战中的实验表明,因果delta嵌入在离分布设置中表现突出,显著超越基线性能,在合成和现实基准中均取得优异结果。

英文摘要

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

2602.13318 2026-06-17 cs.AI cs.CV cs.LG 版本更新

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

DECKBench:用于学术幻灯片生成和编辑的多智能体框架基准测试

Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan

发表机构 * Huawei Technologies Canada(华为加拿大技术有限公司) University of British Columbia(不列颠哥伦比亚大学)

AI总结 本文提出DECKBench,一个用于评估多智能体生成和编辑学术幻灯片的框架,通过定制数据集和模拟编辑指令,系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

详情
AI中文摘要

本文提出DECKBench,一个用于评估多智能体生成和编辑学术幻灯片的框架,通过定制数据集和模拟编辑指令,系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

英文摘要

Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .

2601.17053 2026-06-17 cs.CV 版本更新

Synthetic Data Guided Feature Selection for Robust Activity Recognition in Older Adults

合成数据引导的特征选择用于老年人稳健活动识别

Shuhao Que, Dieuwke van Dartel, Ilse Heeringa, Han Hegeman, Miriam Vollenbroek-Hutten, Ying Wang

发表机构 * University of Twente(特文特大学) Ziekenhuis Groep Twente(Twente医疗集团) Medisch Spectrum Twente(Twente医疗光谱)

AI总结 本研究开发了稳健的人体活动识别系统,利用合成数据提高老年人髋部骨折康复期间持续活动识别的可靠性,尤其在识别高临床相关性的体位转移任务上表现突出。

Comments This paper has been submitted to Nordic Conference on Digital Health and Wireless Solutions 2026, currently under review

详情
AI中文摘要

髋部骨折康复期间的体力活动对于减轻老年人群长期功能下降至关重要,但在临床实践中很少被量化。现有连续监测系统通常针对中年人开发,因此在老年人步态缓慢且变化大的情况下表现不可靠。本研究旨在开发一个稳健的人体活动识别(HAR)系统,以提高髋部骨折康复期间的持续体力活动识别。24名超过80岁的健康老年人在模拟自由生活条件下,佩戴两个加速度计(分别置于下背部和前上大腿)进行了75分钟的日常活动(行走、站立、坐、躺和体位转换)。通过留一被试法交叉验证评估模型的鲁棒性。合成数据展示了在不同参与者间泛化的能力。所得到的特征干预模型(FIM)通过合成数据指导实现了可靠的活动识别,其平均F1分数分别为行走0.896、站立0.927、坐0.997、躺0.937、体位转换0.816。与无合成数据的对照模型相比,FIM显著提高了体位转换检测,即在现有HAR文献中常被忽视的高临床相关性活动类别。结论:这些初步结果展示了在老年人群中稳健活动识别的可行性。需要进一步在髋部骨折患者群体中验证以评估所提出监测系统的临床实用性。

英文摘要

Physical activity during hip fracture rehabilitation is essential for mitigating long-term functional decline in geriatric patients. However, it is rarely quantified in clinical practice. Existing continuous monitoring systems with commercially available wearable activity trackers are typically developed in middle-aged adults and therefore perform unreliably in older adults with slower and more variable gait patterns. This study aimed to develop a robust human activity recognition (HAR) system to improve continuous physical activity recognition in the context of hip fracture rehabilitation. 24 healthy older adults aged over 80 years were included to perform activities of daily living (walking, standing, sitting, lying down, and postural transfers) under simulated free-living conditions for 75 minutes while wearing two accelerometers positioned on the lower back and anterior upper thigh. Model robustness was evaluated using leave-one-subject-out cross-validation. The synthetic data demonstrated potential to improve generalization across participants. The resulting feature intervention model (FIM), aided by synthetic data guidance, achieved reliable activity recognition with mean F1-scores of 0.896 for walking, 0.927 for standing, 0.997 for sitting, 0.937 for lying down, and 0.816 for postural transfers. Compared with a control condition model without synthetic data, the FIM significantly improved the postural transfer detection, i.e., an activity class of high clinical relevance that is often overlooked in existing HAR literature. In conclusion, these preliminary results demonstrate the feasibility of robust activity recognition in older adults. Further validation in hip fracture patient populations is required to assess the clinical utility of the proposed monitoring system.

2511.03876 2026-06-17 eess.IV cs.CV cs.LG physics.med-ph 版本更新

Computed Tomography (CT)-derived Cardiovascular Flow Estimation Using Physics-Informed Neural Networks Improves with Sinogram-based Training: A Simulation Study

基于CT的心血管血流估计利用物理信息神经网络,通过sinogram训练提升:一项模拟研究

Jinyuxuan Guo, Gurnoor Singh Khurana, Alejandro Gonzalo Grande, Juan C. del Alamo, Francisco Contijoch

发表机构 * Dept. of Bioengineering, University of California San Diego(加州大学圣地亚哥分校生物工程系) Dept. of Computer Science Engineering, University of California San Diego(加州大学圣地亚哥分校计算机科学与工程系) Dept. of Mechanical Engineering, Univ of Washington(华盛顿大学机械工程系) Depts of Mechanical Engineering and Cardiology, Univ. of Washington(华盛顿大学机械工程与心内科系) Depts. of Bioengineering, Radiology, University of California San Diego(加州大学圣地亚哥分校生物工程与放射学系)

AI总结 本研究评估了CT影像对基于物理信息神经网络(PINN)的血流估计的影响,提出了一种改进框架SinoFlow,直接利用sinogram数据估计血流,结果显示SinoFlow在避免滤波反投影引入的误差方面表现更优。

详情
AI中文摘要

背景:非侵入性成像基于血流评估在评估心脏功能和结构中起关键作用。CT是一种广泛使用的成像模态,能够稳健地评估心血管解剖和功能,但直接从对比剂演变的电影中估计血流速度的方法尚未开发。目的:本研究评估CT影像对基于物理信息神经网络(PINN)的血流估计的影响,并提出一种改进框架SinoFlow,直接利用sinogram数据估计血流。方法:我们利用计算流体力学生成理想化的2D血管分叉中的脉动流场,并模拟了不同 gantry 旋转速度、管电流和脉冲模式成像设置的CT扫描。我们比较了基于重建图像的PINN血流估计(ImageFlow)与SinoFlow的性能。结果:SinoFlow通过避免滤波反投影引入的误差显著提高了血流估计性能。SinoFlow在所有测试的gantry旋转速度下都表现出鲁棒性,并且始终产生比ImageFlow更低的均方误差和速度误差。此外,SinoFlow与脉冲模式成像兼容,并且在较短的脉冲宽度下保持更高的准确性。结论:本研究展示了SinoFlow在CT基血流估计中的潜力,为非侵入性血流评估提供了一种更有前景的方法。研究结果旨在为PINNs在CT图像中的未来应用提供信息,并提供了一种基于图像的估计解决方案,合理采集参数可产生准确的血流估计。

英文摘要

Background: Non-invasive imaging-based assessment of blood flow plays a critical role in evaluating heart function and structure. Computed Tomography (CT) is a widely-used imaging modality that can robustly evaluate cardiovascular anatomy and function, but direct methods to estimate blood flow velocity from movies of contrast evolution have not been developed. Purpose: This study evaluates the impact of CT imaging on Physics-Informed Neural Networks (PINN)-based flow estimation and proposes an improved framework, SinoFlow, which uses sinogram data directly to estimate blood flow. Methods: We generated pulsatile flow fields in an idealized 2D vessel bifurcation using computational fluid dynamics and simulated CT scans with varying gantry rotation speeds, tube currents, and pulse mode imaging settings. We compared the performance of PINN-based flow estimation using reconstructed images (ImageFlow) to SinoFlow. Results: SinoFlow significantly improved flow estimation performance by avoiding propagating errors introduced by filtered backprojection. SinoFlow was robust across all tested gantry rotation speeds and consistently produced lower mean squared error and velocity errors than ImageFlow. Additionally, SinoFlow was compatible with pulsed-mode imaging and maintained higher accuracy with shorter pulse widths. Conclusions: This study demonstrates the potential of SinoFlow for CT-based flow estimation, providing a more promising approach for non-invasive blood flow assessment. The findings aim to inform future applications of PINNs to CT images and provide a solution for image-based estimation, with reasonable acquisition parameters yielding accurate flow estimates.

2201.06574 2026-06-17 eess.IV cs.CV 版本更新

Neural Computed Tomography

神经计算断层扫描

Kunal Gupta, Brendan Colvert, Francisco Contijoch

发表机构 * University of California San Diego(加州大学圣地亚哥分校)

AI总结 本文提出NeuralCT框架,通过神经隐式方法生成无运动伪影的时间分辨图像,适用于心脏等复杂运动场景。

Comments https://kunalmgupta.github.io/projects/NeuralCT.html

详情
AI中文摘要

在获取投影集过程中发生的运动可能导致计算断层扫描重建中出现显著的运动伪影,尽管单个视图的获取速度较快。在如心脏成像等情况下,运动可能是不可避免的,评估运动具有临床意义。通过开发具有更快门架旋转速度的系统或使用测量和/或估计位移的算法,通常可以减少运动伪影。然而,这些方法由于物理限制以及估计/测量非刚性、时间变化和患者特异性运动的挑战而效果有限。我们提出了一种新的重建框架NeuralCT,以生成无运动伪影的时间分辨图像。我们的方法利用神经隐式方法,不需要对底层运动进行估计或建模。相反,通过使用符号距离度量和神经隐式框架来表示边界。我们利用“分析-合成”方法来确定与所获取的sinogram一致且符合空间和时间一致性约束的解决方案。我们通过三个渐进复杂的场景展示了NeuralCT的实用性:小圆的平移、椭圆直径的心跳样变化以及复杂的拓扑变形。在不进行超参数调优或改变架构的情况下,NeuralCT在使用均方误差和Dice度量时,为所有三种运动提供了高质量的图像重建,相比滤波反投影。

英文摘要

Motion during acquisition of a set of projections can lead to significant motion artifacts in computed tomography reconstructions despite fast acquisition of individual views. In cases such as cardiac imaging, motion may be unavoidable and evaluating motion may be of clinical interest. Reconstructing images with reduced motion artifacts has typically been achieved by developing systems with faster gantry rotation or using algorithms which measure and/or estimate the displacements. However, these approaches have had limited success due to both physical constraints as well as the challenge of estimating/measuring non-rigid, temporally varying, and patient-specific motions. We propose a novel reconstruction framework, NeuralCT, to generate time-resolved images free from motion artifacts. Our approaches utilizes a neural implicit approach and does not require estimation or modeling of the underlying motion. Instead, boundaries are represented using a signed distance metric and neural implicit framework. We utilize `analysis-by-synthesis' to identify a solution consistent with the acquired sinogram as well as spatial and temporal consistency constraints. We illustrate the utility of NeuralCT in three progressively more complex scenarios: translation of a small circle, heartbeat-like change in an ellipse's diameter, and complex topological deformation. Without hyperparameter tuning or change to the architecture, NeuralCT provides high quality image reconstruction for all three motions, as compared to filtered backprojection, using mean-square-error and Dice metrics.