arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.31603 2026-06-01 cs.CV cs.AI 版本更新

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Lumos-Nexus: 面向视频统一模型的高效频率桥接与同质潜在空间

Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu

发表机构 * Zhejiang University(浙江大学) DAMO Academy, Alibaba Group(阿里云达摩院) Hupan Lab(虎扑实验室) National University of Singapore(新加坡国立大学) Hong Kong University of Science and Technology(香港科技大学) Fudan University(复旦大学) Tsinghua University(清华大学)

AI总结 提出Lumos-Nexus框架,通过两阶段训练和渐进频率桥接,在保持推理能力的同时显著提升视频生成保真度。

Comments Project page (https://jiazheng-xing.github.io/nexus-lumos-home/) and Code (https://github.com/alibaba-damo-academy/Lumos-Custom/) are available

详情
AI中文摘要

基于连接器的视频统一模型在指令引导的视频合成中展现出强大能力,但将大型高保真生成器集成到统一训练循环中计算成本过高,限制了可实现的视觉质量。因此,我们提出Lumos-Nexus,一个训练高效的统一视频生成框架,促进强推理驱动生成能力的发展,同时显著提升视觉保真度。Lumos-Nexus采用两阶段设计:1)训练时,仅将轻量级生成器与理解模块对齐,以学习接收推理驱动的语义控制。2)推理时,我们引入统一渐进频率桥接(UPFB),在共享潜在空间中逐步将生成任务移交给高容量预训练生成器,实现从粗到细的细化,在不牺牲推理质量的情况下生成高保真视频。为填补推理驱动视频生成基准的空白,我们引入VR-Bench,评估模型将推断意图转化为连贯且语义对齐的视频内容的能力。大量实验表明,Lumos-Nexus在VBench上实现了视觉真实感和时间连贯性的显著提升,同时在VR-Bench上展现出强大的基于推理的生成性能。代码和模型可在https://jiazheng-xing.github.io/nexus-lumos-home/获取。

英文摘要

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

2605.31598 2026-06-01 cs.CV 版本更新

Linear Scaling Video VLMs for Long Video Understanding

面向长视频理解的线性缩放视频视觉语言模型

Cristobal Eyzaguirre, Jiajun Wu, Juan Carlos Niebles

发表机构 * Stanford University(斯坦福大学)

AI总结 提出StateKV方法,通过固定容量的重要性驱动循环状态实现线性时间视频预填充,在保持接近全自注意力性能的同时显著降低计算成本。

详情
AI中文摘要

视频视觉语言模型(VLM)越来越多地用于长时和流式场景,但大多数视频编码器仍依赖时空自注意力,导致计算和延迟随帧数二次增长。现有的效率方法提高了可扩展性,但相对于全自注意力往往损失准确性,例如通过激进的帧/令牌丢弃或粗略的注意力近似。我们引入了StateKV,一种推理时方法,通过将跨帧上下文携带在固定容量、基于重要性的循环状态中,并配以用于解码的第二个完整每帧缓存,使预训练的长视频VLM适应线性时间视频预填充。在三个长视频基准测试和跨越三个家族、多个尺度的七个模型上,StateKV保持接近全自注意力的性能,并持续优于主流的滑动窗口/基于最近性的流式近似,无需微调或架构更改。StateKV还降低了以FLOPs衡量的视频预填充成本,通过运行更大的模型在固定计算预算下实现更强的准确性。这些结果表明了向可扩展长视频理解迈出的实际一步。

英文摘要

Video vision-language models (VLMs) are increasingly used in long-horizon and streaming settings, yet most video encoders still rely on spatiotemporal self-attention, causing compute and latency to grow quadratically with the number of frames. Existing efficiency methods improve scalability but often lose accuracy relative to full self-attention, for example through aggressive frame/token dropping or coarse attention approximations. We introduce StateKV, an inference-time method that adapts pretrained long-video VLMs to linear-time video prefill by carrying cross-frame context in a fixed-capacity, importance-based recurrent state, paired with a second full per-frame cache used for decoding. Across three long-video benchmarks and seven models spanning three families and multiple scales, StateKV remains close to full self-attention and consistently outperforms dominant sliding-window / recency-based streaming approximations, without fine-tuning or architectural changes. StateKV also reduces video-prefill cost measured FLOPs, enabling stronger accuracy at a fixed compute budget by running larger models. These results suggest a practical step toward scalable long-video understanding.

2605.31596 2026-06-01 cs.CV cs.LG 版本更新

KLIP: localized distribution shift detection via KL-divergence with diffusion priors in Inverse Problems

KLIP:通过逆问题中扩散先验的KL散度进行局部分布偏移检测

Alireza Kheirandish, Jihoon Hong, Sara Fridovich-Keil

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出基于KL散度的OOD检测指标,无需校准数据或偏移分布知识,可检测并定位图像中的局部分布偏移。

Comments CVPR 2026

详情
AI中文摘要

扩散模型作为计算成像的数据驱动先验以及检测分布外(OOD)图像方面已展现出有前景的性能。然而,现有的OOD检测方法通常需要一些关于偏移分布的知识,无法检测细微或局部的分布偏移,并且作用于完整图像而非逆问题中可用的间接测量。我们提出了一种基于扩散先验与后验分布之间的Kullback-Leibler散度的OOD检测指标,该指标(i)不需要任何校准数据或关于偏移分布的知识,并且(ii)可以检测整张图像是否为OOD,以及定位图像内的OOD块。实验上,我们表明该指标可以检测细微但语义上有意义的分布偏移,例如从健康肝脏CT扫描到有肿瘤的CT扫描的偏移,并且能够泛化到不同类型的扩散模型、数据集和逆问题。我们的代码可在https://github.com/voilalab/KLIP找到。

英文摘要

Diffusion models have shown promising performance as data-driven priors for computational imaging, as well as some capacity to detect out-of-distribution (OOD) images. However, existing approaches to OOD detection often require some knowledge of the shifted distribution, fail to detect subtle or localized distribution shifts, and operate on full images, rather than the indirect measurements available in inverse problems. We propose an OOD detection metric based on the Kullback-Leibler divergence between the diffusion prior and the posterior distribution, that (i) does not require any calibration data or knowledge of the shifted distribution, and (ii) can detect whole images as OOD as well as localize OOD patches within an image. Experimentally, we show that this metric can detect subtle yet semantically meaningful distribution shifts, such as the shift from healthy liver CT scans to those with tumors, and generalizes across different types of diffusion models, datasets, and inverse problems. Our code can be found at https://github.com/voilalab/KLIP.

2605.31595 2026-06-01 cs.CV 版本更新

Learning Global Motion with Compact Gaussians for Feed-Forward 4D Reconstruction

利用紧凑高斯体学习全局运动的前馈式4D重建

Mungyeom Kim, Minkyeong Jeon, Honggyu An, Jaewoo Jung, Hyuna Ko, Jisang Han, Hyeonseo Yu, Donghwan Shin, Sunghwan Hong, Takuya Narihira, Kazumi Fukuda, Yuki Mitsufuji, Seungryong Kim

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) ETH Zürich, ETH AI Center(苏黎世联邦理工学院,ETH人工智能中心) Sony AI(索尼人工智能) Sony Group Corporation(索尼集团)

AI总结 提出C4G框架,通过紧凑的可学习高斯查询令牌和视频扩散模型增强渲染,实现无需相机位姿的前馈式4D动态场景重建,显著减少高斯数量并提升运动建模鲁棒性。

Comments Project Page: see https://cvlab-kaist.github.io/C4G

详情
AI中文摘要

从单目视频进行动态场景重建仍然是计算机视觉中的一个基本挑战。现有的前馈方法逐帧预测像素级3D高斯体,存在重复高斯体和视角依赖偏差,阻碍了场景运动的有效学习。我们提出C4G,一个前馈式4D重建框架,基于一组紧凑的时间戳条件可学习高斯查询令牌。每个令牌在整个时间上下文中聚合对应特征,并解码出一个3D高斯体,其位置由目标时间戳调制,无需逐场景优化即可实现全局一致的运动建模。为了捕捉细粒度细节,我们进一步引入基于视频扩散模型的渲染增强模块。由于我们的框架有效地将特征聚合到高斯体中,我们将此能力扩展到特征提升,生成一个支持点跟踪和动态场景理解的4D特征场。C4G在显著减少高斯体数量且无需相机位姿的情况下,实现了强的新视角合成性能,同时展现出更强的运动建模能力和对大时间间隔的鲁棒性。

英文摘要

Dynamic scene reconstruction from monocular video remains a fundamental challenge in computer vision. Existing feed-forward methods predict 3D Gaussians pixel-wise for each frame, suffering from duplicated Gaussians and view-dependent biases that hinder effective learning of scene motion. We present C4G, a feed-forward 4D reconstruction framework built upon a compact set of timestamp-conditioned learnable Gaussian query tokens. Each token aggregates corresponding features across the full temporal context and decodes a 3D Gaussian whose position is modulated by the target timestamp, enabling globally coherent motion modeling without per-scene optimization. To capture fine-grained details, we further introduce a video diffusion model-based rendering enhancement module. Since our framework effectively aggregates features into Gaussians, we extend this capability to feature lifting, producing a 4D feature field that supports point tracking and dynamic scene understanding. C4G achieves strong novel-view synthesis performance using significantly fewer Gaussians and without requiring camera poses, while exhibiting stronger motion modeling and robustness to large temporal gaps.

2605.31591 2026-06-01 cs.CV 版本更新

CoFiDA-M: Concept-Aware Feature Modulation for Cross-Domain Adaptation with Image-Only Inference

CoFiDA-M: 面向仅图像推理的跨域自适应的概念感知特征调制

Nurjahan Sultana, Moi Hoon Yap, Xinqi Fan, Wenqi Lu

发表机构 * Department of Computing and Mathematics, Manchester Metropolitan University(计算与数学系,曼彻斯特 Metropolitan 大学)

AI总结 提出CoFiDA-M框架,通过特权信息学习利用临床概念(如MONET概率)指导特征调制,训练仅图像的学生模型,在跨域皮肤癌筛查中显著提升黑色素瘤召回率。

Comments 'Accepted by CVPR 2026'

详情
AI中文摘要

基于AI的皮肤癌筛查模型在从专家皮肤镜(源)图像转向消费级临床(目标)图像时性能严重下降,阻碍了实际部署。现有的域自适应方法常常忽略关键的语义不变性,如临床概念。虽然像MONET这样的新基础模型可以提供这种语义信息作为密集的概率分数,但该元数据在测试时不可用,为实用的仅图像筛查工具造成了部署悖论。我们通过提出CoFiDA-M来解决这一差距,这是一个特权信息框架,在训练时从概念中学习,但部署为仅图像模型。我们的方法训练一个教师网络,该网络使用MONET概念概率来指导FiLM调制器,将视觉特征转换为语义“编辑”的特征空间。然后训练一个轻量级的、仅图像的学生模型来重现这种编辑后的表示,而不仅仅是教师的最终预测。这种蒸馏将临床推理“烘焙”到学生模型的权重中。在一个具有挑战性的多数据集基准上,我们的仅图像学生模型显著优于最先进的方法,特别是在黑色素瘤召回率方面。我们的工作提供了一个实用且可泛化的框架,用于利用噪声概率元数据作为特权信息,展示了强大的跨数据集鲁棒性和在皮肤科之外实际部署的潜力。实现代码可在以下网址获取:https://github.com/mmu-dermatology-research/CoFiDA.git

英文摘要

Models for AI-based skin cancer screening suffer a severe performance drop when shifting from expert dermoscopic (source) images to consumer-grade clinical (target) images, hindering real-world deployment. Existing domain adaptation methods often ignore crucial semantic invariants, such as clinical concepts. While new foundation models like MONET can provide this semantic information as dense, probabilistic scores, this metadata is unavailable at test time, creating a deployment paradox for practical image-only screening tools. We address this gap by proposing CoFiDA-M, a privileged information framework that learns from concepts at training time but deploys as an image-only model. Our method trains a teacher network that uses MONET concept probabilities to guide a FiLM modulator, transforming visual features into a semantically ``edited" feature space. A lightweight, image-only student is then trained to reproduce this edited representation, not just the teacher's final predictions. This distillation ``bakes" the clinical reasoning into the student's weights. On a challenging multi-dataset benchmark, our image-only student significantly outperforms state-of-the-art approaches, especially in melanoma recall. Our work provides a practical and generalizable framework for leveraging noisy, probabilistic metadata as privileged information, demonstrating strong cross-dataset robustness and potential for real-world deployment beyond dermatology. Implementation code is available at: https://github.com/mmu-dermatology-research/CoFiDA.git

2605.31590 2026-06-01 cs.CV cs.AI 版本更新

TunerDiT: Training-free Progressive Steering of Diffusion Transformer for Multi-Event Video Generation

TunerDiT: 无需训练的多事件视频生成扩散变压器渐进式引导

Ruotong Liao, Guowen Huang, Qing Cheng, Guangyao Zhai, Lei Zhang, Xun Xiao, Thomas Seidl, Daniel Cremers, Volker Tresp

发表机构 * Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学) Technical University of Munich(慕尼黑技术大学) MCML University of Hamburg(汉堡大学) Huawei European Research Institute(华为欧洲研究院)

AI总结 针对长视频多事件生成难题,提出无需额外训练的TunerDiT方法,通过事件分区掩码和跨事件提示融合实现渐进式引导,在8项指标上达到最优。

Comments 17 pages, 13 figures

详情
AI中文摘要

文本到视频(T2V)生成在生成长时间跨度包含多个事件的视频时面临挑战性问题。受扩散过程内在特性的启发,我们探测了视频扩散变压器(DiTs),并发现了DiT去噪轨迹中的内在转折点,其中条件文本从全局布局到细粒度细节影响生成。基于这一发现,我们提出了TunerDiT,一种简单而有效的渐进式引导方法,无需额外训练即可实现多事件生成。TunerDiT包含两个引导手柄:(1)事件分区掩码,强制事件边界同时允许跨事件过渡带;(2)跨事件提示融合,注入相邻事件语义用于后期细化。我们贡献了一个自策提示套件用于多事件生成基准测试,即Meve。与其他无训练方法相比,TunerDiT在8项指标上达到了最先进性能,并在视频一致性和事件分离之间提供了可调权衡。文本对齐的提升随事件数量增加而增强,表明随着事件数量增加存在扩展可能性。

英文摘要

Text-to-video (T2V) generation faces challenging questions when generating videos with long horizons containing multiple events. Inspired by the intrinsics of the diffusion process, we probe video diffusion transformers (DiTs) and uncover intrinsic turning points in the DiT denoising trajectory where conditioning text affects generation from global layout to fine-grained details. Building on this finding, we present TunerDiT, a simple yet effective progressive steering method that requires no additional training for multi-event generation. TunerDiT comprises two steering handles: (1) Event-Partitioned Masking that enforces event boundaries while allowing cross-event transition bands; (2) Cross-Event Prompt Fusion that injects neighboring event semantics for late-stage refinement. We contribute a self-curated prompt suite for benchmarking multi-event generation, i.e., Meve. TunerDiT achieves state-of-the-art performance across 8 metrics and offers a tunable trade-off between video consistency and event separation, compared with other training-free methods. The improvement in text alignment increases with the event count, indicating a scaling possibility with increasing event count.

2605.31589 2026-06-01 cs.CV 版本更新

Recognizing Co-Speech Gestures in-the-Wild

识别野外伴随语音的手势

Sindhu B Hegde, K R Prajwal, Andrew Zisserman

发表机构 * Visual Geometry Group, Dept. of Engineering Science, University of Oxford(视觉几何组,工程科学系,牛津大学)

AI总结 针对当前多模态模型难以捕捉语义性伴随手势的问题,构建了首个大规模基准数据集GRW,用于训练视频模型进行手势语义分类、对应词汇识别和时序定位。

详情
AI中文摘要

尽管人类在说话时自然地进行手势,但这些动作中只有稀疏的子集在视觉上具有描绘性,并与特定的口语词汇语义相关。当前的多模态模型难以捕捉这些语义性的伴随手势,主要受限于缺乏精确标注的训练数据。为解决这一问题,我们引入了野外手势识别(GRW)数据集,这是第一个大规模基准,旨在将无约束的人类手势与特定词汇以帧精确的时间边界进行映射。GRW包含156,688个手动标注的视频片段,涵盖了一个高度多样化的150词分类体系,包括物理动作、空间描述符和抽象概念。我们利用GRW训练视频模型以(a)将手势分类为语义性或非语义性,(b)识别伴随手势对应的词汇,以及(c)在时间上定位手势。我们还使用GRW为这三项任务建立基准。

英文摘要

While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.

2605.31577 2026-06-01 cs.CV 版本更新

SurGe: Improved Surface Geometry in Point Maps

SurGe: 改进点图中的表面几何

Karim Knaebel, Gonzalo Martin Garcia, Christian Schmidt, Ilya Fradlin, Lucas Nunes, Daan de Geus, Bastian Leibe

发表机构 * RWTH Aachen University(亚琛RWTH大学) Eindhoven University of Technology(埃因霍温技术大学)

AI总结 针对前馈3D重建方法中局部表面几何不准确的问题,提出点图法线度量、点梯度匹配损失和邻域注意力解码器(NAD)来改善局部表面方向预测,在多个零样本单目几何基准上取得最优平均排名。

Comments Project page at https://vision.rwth-aachen.de/surge

详情
AI中文摘要

最近的前馈3D重建方法能够很好地预测点图并估计全局3D几何。然而,它们的预测仍然显示出不准确的局部表面几何,这在定性上明显可见,但在常见指标中仅被微弱反映。为了使这些错误在评估中更明确,我们引入了一个点图法线度量,用于评估由相邻3D预测引起的局部表面方向。为了减少这些错误,我们提出了两个互补组件:一个点梯度匹配损失,用于监督深度归一化的3D有限差分;以及一个邻域注意力解码器(NAD),它逐步上采样特征并使用邻域注意力进行局部特征混合。在八个零样本单目几何基准上,我们的模型SurGe在全局点图AbsRel上取得了最佳平均排名,并一致地改进了局部点图和点图法线评估。

英文摘要

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

2605.31576 2026-06-01 cs.CV 版本更新

Joint Multi-Camera LiDAR Extrinsic Calibration via Learned Pairwise Initialization and Geometric Refinement

联合多相机激光雷达外参标定:基于学习的成对初始化与几何优化

Aziz Al-Najjar, Marzieh Amini, James R. Green, Felix Kwamena

发表机构 * Department of System and Computer Engineering, Carleton University(系统与计算机工程系,卡尔顿大学)

AI总结 提出两阶段框架,先通过CMRNext独立估计每个相机的外参和2D-3D对应,再通过联合光束法平差优化实现全局一致的多相机标定,显著提升精度和一致性。

Comments Paper is accepted in CVPR 2026 Workshop URVI: Unified Robotic Vision with Cross-Modal Sensing and Alignment

详情
AI中文摘要

大多数基于学习的相机-激光雷达标定方法独立处理每个相机-激光雷达对,忽略了多相机平台中的刚性几何耦合。因此,每个相机的估计可能单独准确,但在系统层面不一致。我们提出一个两阶段框架,用于联合多相机激光雷达外参标定,结合了学习的成对匹配与几何优化。首先,CMRNext独立应用于每个相机,产生初始外参估计和密集的2D-3D对应。然后,这些预测通过多帧光束法平差联合优化,包含重投影项、每相机先验项和相对位姿先验项。该方法将成对预测转化为全局一致的多相机标定。在KITTI(CMRNext的域内)和Walkley(域外)数据集上的实验表明,该方法提高了每相机的精度和相机间的一致性。在KITTI上,该方法实现了0.89厘米的平移误差和0.038度的旋转误差。在Walkley上,它将平移误差从108.6厘米降低到3.1厘米,突显了当单相机预测不可靠时显式多相机耦合的优势。

英文摘要

Most learning-based camera-LiDAR calibration methods treat each camera-LiDAR pair independently, ignoring the rigid geometric coupling in multi-camera platforms. As a result, per-camera estimates may be individually accurate yet inconsistent at the system level. We present a two-stage framework for joint multi-camera LiDAR extrinsic calibration that combines learned pairwise matching with geometric refinement. First, CMRNext is applied independently to each camera to produce initial extrinsic estimates and dense 2D-3D correspondences. These predictions are then jointly refined through a multi-frame bundle adjustment with reprojection, per-camera prior, and relative-pose prior terms. This approach converts pairwise predictions into a globally consistent multi-camera calibration. Experiments on KITTI (in-domain for CMRNext) and Walkley (out-of-domain) datasets show improved per-camera accuracy and inter-camera consistency. On KITTI, the method achieves 0.89 cm translation error and 0.038 rotation error. On Walkley, it reduces translation error from 108.6 cm to 3.1 cm, highlighting the benefit of explicit multi-camera coupling when single-camera predictions are less reliable.

2605.31572 2026-06-01 cs.CV 版本更新

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

nuReasoning:面向长尾自动驾驶的推理中心数据集与基准

Zhiyu Huang, Johnson Liu, Rui Song, Zewei Zhou, Ruining Yang, Yun Zhang, Tianhui Cai, Hanyin Zhang, Mingxuan Gao, Valeria Xu, Jiali Chen, Yishan Shen, Yiluan Guo, Tony, Qi, Jiaqi Ma

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Motional

AI总结 提出nuReasoning数据集,包含2万段20秒长尾驾驶场景的推理标注,支持空间、决策和反事实推理评估,并证明推理监督可提升VLM和VLA的驾驶性能。

详情
AI中文摘要

推理对于自动驾驶在长尾场景中至关重要,车辆必须运用常识知识、理解空间关系、推断智能体交互并做出安全决策。然而,现有的自动驾驶数据集和基准主要针对感知、预测或规划,对现实长尾驾驶场景的推理监督有限。我们提出nuReasoning,一个面向推理中心自动驾驶的大规模真实世界数据集和基准。沿袭nuScenes和nuPlan的体系,nuReasoning将真实世界自动驾驶数据集和基准推向长尾驾驶场景中的推理。该数据集包含2万个片段,每个片段长20秒,采集自多个城市,具有同步的多摄像头图像、LiDAR数据、高清地图、物体标注以及人工验证的推理标注,涵盖空间推理、决策推理和反事实推理。与先前主要关注视觉问答的数据集不同,nuReasoning同时支持推理评估和规划评估,能够直接研究推理监督如何影响驾驶性能。实验表明,在nuReasoning上微调VLM可显著提升驾驶特定问答的性能,而将推理监督纳入VLA训练中,即使在推理时禁用文本推理输出,也能改善规划性能。这些结果确立了nuReasoning作为在现实长尾场景中评估和改进鲁棒、可解释、推理驱动的自动驾驶系统的基础。

英文摘要

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC 版本更新

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

发表机构 * School of Engineering and Applied Sciences(工程与应用科学系) Department of Psychology(心理学系)

AI总结 本研究通过引入零样本度量LALS,发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦,女性信号在生成前被抑制,揭示了模型对性别偏见的内部处理机制。

Comments 16 pages, 12 figures, 1 table

详情
AI中文摘要

对齐训练使视觉-语言模型(VLM)避免表达人口统计偏见,当性别清晰可见时,它们基本成功。但对于模糊输入(如全副武装的工人、从背后看到的人物)——实践中常见但很少研究的情况——我们发现,在模糊输入图像时,最小的提示压力就会暴露职业-性别默认值,模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容?我们引入LALS(潜在关联倾向分数),一种零样本度量,将视觉标记激活投影到模型的文本嵌入空间中,以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上,内部表征和输出系统性地解耦:模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大,而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明,文化负载的视觉线索(如服装颜色)进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

2605.31551 2026-06-01 cs.CV 版本更新

SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

SMART: SMPLest-X 网格自适应与 RAFT 跟踪用于足球姿态估计

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods(Dick’s Sporting Goods 游戏革新部)

AI总结 提出 SMART 方法,通过微调 SMPLest-X 模型、结合 RAFT 光流相机跟踪和足部平面锚定等策略,在 FIFA 骨骼跟踪挑战中显著降低 3D 姿态估计误差。

Comments CVPR 2026 SoccerNet FIFA Skeleton Tracking Light Challenge, Rank 6

详情
AI中文摘要

我们介绍了参加 2026 年 FIFA 骨骼跟踪挑战赛的方法,该挑战要求从广播视频中估计足球运动员的 3D 世界空间姿态。我们的方法通过分层片段分割、多任务深度监督和广播增强对 SMPLest-X(ViT-H,687 M 参数)进行微调,并结合 RAFT 密集光流相机跟踪器、足部平面锚定和两遍时间平滑。在验证集上,SMART 相对于 FIFA 基线得分 1.053 取得了 0.647 的成绩,提升了 38.6%;在保留的测试集上,SMART 得分为 0.593(全局 MPJPE:0.324 m,局部 MPJPE:0.054 m)。

英文摘要

We present our approach to the FIFA Skeletal Tracking Challenge 2026, which requires estimating 3D world-space poses of soccer players from broadcast video. Our method finetunes SMPLest-X (ViT-H, 687 M parameters) via a stratified clip split, multi-task depth supervision, and broadcast augmentation, paired with a RAFT dense optical flow camera tracker, foot-plane anchoring, and two-pass temporal smoothing. Against the FIFA baseline score of 1.053 on the validation set, SMART achieves 0.647, a 38.6% improvement; on the held-out test set, SMART scores 0.593 (Global MPJPE: 0.324 m, Local MPJPE: 0.054 m).

2605.31539 2026-06-01 cs.CV cs.LG q-bio.QM 版本更新

Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography

利用术前计算机断层扫描自动预测术后胰瘘

Ashok Choudhary, Chris Varghese, Leo Y. Li-Han, Frank G. Lee, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

发表机构 * Department of Surgery, Mayo Clinic, Rochester, MN, USA(梅奥诊所外科部,罗切斯特,明尼苏达州,美国) Department of Surgery, University of Auckland, Auckland, NZ(奥克兰大学外科部,奥克兰,新西兰) Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA(健康照护科学中心,梅奥诊所,罗切斯特,明尼苏达州,美国) Department of Artificial Intelligence(人工智能部)

AI总结 提出一种从胰腺分割到分类的端到端深度学习流程,利用术前CT扫描自动预测术后胰瘘风险,为临床决策提供工具和方法基准。

详情
AI中文摘要

术后胰瘘(POPF)是胰腺切除术后的一种严重并发症,会增加发病率、住院时间和医疗费用。我们提出了一种自动化的端到端深度学习流程——从胰腺分割到分类——用于利用术前CT扫描进行术前POPF风险估计和分层。使用包含自动分割的胰腺体积和手术结果的数据集评估了多种架构,包括自定义轻量级3D CNN基线(CNN3D)、R(2+1)D ResNet-18和ResNet-MC3-18模型。在多个3D架构上的评估显示了有前景的预测性能。该方法为胰腺特异性CT分类提供了临床有价值的工具和方法基准,支持胰腺手术中改进的术前决策。

英文摘要

Postoperative pancreatic fistula (POPF) is a serious complication after pancreatic resection, increasing morbidity, hospital stay, and healthcare costs. We present an automatic, end-to-end deep learning pipeline-from pancreatic segmentation to classification-for preoperative POPF risk estimation and stratification using preoperative CT scans. A data set with auto-segmented pancreas volumes and surgical outcomes was used to evaluate multiple architectures, including a custom lightweight 3D CNN baseline (CNN3D), R(2+1)D ResNet-18, and ResNet-MC3-18 models. Evaluation across multiple 3D architectures demonstrated promising predictive performance. This approach offers a clinically valuable tool and a methodological benchmark for pancreas-specific CT classification, supporting improved preoperative decision-making in pancreatic surgery.

2605.31535 2026-06-01 cs.CV cs.AI cs.LG 版本更新

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer: 从真实世界视频中可扩展的自监督新视角合成

Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer

发表机构 * Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心 (MCML))

AI总结 提出统一前馈变压器RayDer,将相机估计、场景重建和渲染整合为单一骨干,实现自监督新视角合成的可扩展幂律缩放,在零样本开放集性能上媲美有监督方法。

Comments Project Page: https://compvis.github.io/rayder

详情
AI中文摘要

自监督新视角合成(NVS)在扩展方面仍然具有挑战性,尽管视频数据丰富,这主要是由于在真实视频上训练的脆弱性以及多网络系统设计的难以预测的缩放行为。我们引入了RayDer,一个统一的前馈变压器,将相机估计、场景重建和渲染整合到一个单一骨干中,将自监督NVS转化为一个适定的单模型缩放问题。一个最小的动态状态,被视为干扰因素,吸收时变内容,使得在无约束的真实世界视频上稳定训练成为可能。重要的是,RayDer将静态场景NVS作为其目标任务:动态内容仅作为可扩展的监督被利用,而不是像动态场景(4D)NVS那样重建。在多个模型大小和数量级的数据上,RayDer展示了与数据和计算量相关的清晰幂律缩放,并优于静态场景数据混合。在大量基准测试中,RayDer实现了与最先进的有监督方法相竞争的强大零样本开放集性能。项目页面:https://compvis.github.io/rayder

英文摘要

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

2605.31534 2026-06-01 cs.CV cs.AI 版本更新

Feature-Optimized Vision for Adaptive 3D Scene Reconstruction

面向自适应3D场景重建的特征优化视觉

Eric Liang

发表机构 * Oracle

AI总结 提出一种自适应特征优化视觉前端,通过评分纹理、可重复性、独特性、预期三角化角度和空间覆盖来分配每视图特征预算,以最大化有效轨迹并降低重建RMSE。

详情
AI中文摘要

三维场景重建依赖于局部图像证据,这些证据既要在视觉上具有判别性,又要在几何上有用。固定的特征阈值和均匀的特征预算易于部署,但可能会在重复纹理、低视差区域或不稳定点上浪费计算。本文提出了一种用于3D重建的自适应特征优化视觉前端。该方法通过纹理、可重复性、独特性、预期三角化角度和空间覆盖对候选特征进行评分,然后在固定重建流程下分配每视图特征预算以最大化有效轨迹。一个小型合成多视图原型在走廊、立面、物体桌面和杂乱场景中评估了四种选择策略。与随机、仅纹理和均匀网格基线相比,自适应策略在保持广泛图像覆盖的同时,获得了最佳的质量感知完整性和最低的聚合重建RMSE。结果并非替代现代学习匹配或神经重建系统;它是一个模块化的前端策略,可以使经典和学习的3D流程更审慎地决定将计算花费在哪些视觉证据上。

英文摘要

Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thresholds and uniform feature budgets are easy to deploy, but they can waste computation on repeated texture, low-parallax regions, or unstable points. This paper proposes an adaptive feature-optimized vision front end for 3D reconstruction. The method scores candidate features by texture, repeatability, distinctiveness, expected triangulation angle, and spatial coverage, then allocates a per-view feature budget to maximize useful tracks under a fixed reconstruction pipeline. A small synthetic multi-view prototype evaluates four selection policies across corridor, facade, object-table, and cluttered scenes. Compared with random, texture-only, and uniform-grid baselines, the adaptive policy obtains the best quality-aware completeness and the lowest aggregate reconstruction RMSE while preserving broad image coverage. The result is not a replacement for modern learned matching or neural reconstruction systems; it is a modular front-end policy that can make classical and learned 3D pipelines more deliberate about which visual evidence they spend compute on.

2605.31529 2026-06-01 cs.CV 版本更新

SVI-Bench: A Dynamic Microworld for Strategic Video Intelligence

SVI-Bench: 一个用于战略视频智能的动态微世界

Yulu Pan, Han Yi, Seongsu Ha, Md Mohaiminul Islam, Benjamin Zhang, Lorenzo Torresani, Gedas Bertasius

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Northeastern University(东北大学)

AI总结 本文提出SVI-Bench,一个基于团队体育动态微世界的大规模基准,通过四个层级(动态场景理解、因果推理、战略模拟、智能体合成)的9个任务评估视频智能从感知到战略规划的能力,发现模型在感知任务上表现良好但在认知层级上性能急剧下降。

详情
AI中文摘要

真正的视频智能需要的不仅仅是识别可见内容:它需要推理事件为何发生,预测在不同条件下会有什么变化,并决定下一步该做什么。我们将这种从感知到因果推理、模拟再到战略规划的演进称为战略视频智能(SVI)。现有基准均未评估这一能力栈:野外视频缺乏因果和战略问题的可验证真实数据,而合成环境则牺牲了真实多智能体系统的复杂性。为弥补这一差距,我们引入了SVI-Bench,这是一个大规模基准,利用团队体育作为动态微世界,将真实世界多智能体交互(10-22个智能体在对抗压力下做出协调决策)的复杂性与显式规则和确定性结果的可验证性相结合。SVI-Bench包含约3.5万小时的广播视频、1500万个标注动作、1.5万小时的专家解说、2.3万份比赛报告以及涵盖篮球、足球和冰球的10.3万条结构化统计记录,所有这些均通过一个将原始比赛数据转换为密集交叉引用语料库的数据引擎构建。我们将评估组织为9个任务,涵盖一个渐进的四层层次结构:动态场景理解、因果推理、战略模拟和智能体合成。评估强多模态和智能体基线后,我们发现一个能力悬崖:模型在感知任务上表现胜任,在细粒度动作问答上达到约73%的准确率,但在每个后续认知层级上急剧下降。智能体任务最为困难:当需要自主收集并整合来自180万个片段语料库的证据时,最强模型仅达到5%的准确率。

英文摘要

True video intelligence demands more than recognizing what is visible: it requires reasoning about why events unfold, predicting what would change under different conditions, and deciding what to do next. We refer to this progression, from perception through causal reasoning and simulation to strategic planning, as Strategic Video Intelligence (SVI). No existing benchmark evaluates this capability stack: in-the-wild videos lack verifiable ground truth for causal and strategic questions, while synthetic environments sacrifice the complexity of real multi-agent systems. To bridge this gap, we introduce SVI-Bench, a large-scale benchmark that leverages team sports as a dynamic microworld, combining the complexity of real-world multi-agent interaction (10-22 agents making coordinated decisions under adversarial pressure) with the verifiability of explicit rules and definitive outcomes. SVI-Bench comprises approximately 35K hours of broadcast video, 15M annotated actions, 15K hours of expert commentary, 23K game reports, and 103K structured statistical records across basketball, soccer, and hockey, all constructed via a data engine that transforms raw game data into a dense, cross-referenced corpus. We organize evaluation into 9 tasks spanning a progressive four-pillar hierarchy: Dynamic Scene Understanding, Causal Reasoning, Strategic Simulation, and Agentic Synthesis. Evaluating strong multimodal and agentic baselines, we find a capability cliff: models perform competently on perceptual tasks, achieving approximately 73% on fine-grained action QA, but degrade sharply at each successive cognitive level. Agentic tasks prove hardest: the strongest model achieves only 5% accuracy when required to autonomously gather and integrate evidence across a corpus of 1.8M clips.

2605.31513 2026-06-01 cs.CV 版本更新

Personalize Your Large Vision-language Models With In-context Prompt Tuning

用上下文提示调优个性化你的大型视觉语言模型

Yanshu Li, Jiaqian Li, Kuai Yu, Xi Xiao, Dongfang Liu, Tianyang Wang, Ruixiang Tang

发表机构 * Brown University(布朗大学) Columbia University(哥伦比亚大学) University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Purdue University(普渡大学) Rutgers University(罗格斯大学)

AI总结 提出上下文提示调优(ICPT)方法,通过轻量投影模块从多参考图像中提取细粒度视觉语义并转化为连续提示,结合几何正则化解决环境偏差和跨概念干扰,实现高效个性化。

Comments 27 pages, 10 figures, 5 tables

详情
AI中文摘要

大型视觉语言模型(LVLMs)展示了强大的通用多模态能力,并越来越多地部署在下游系统中。这一趋势推动了对LVLM个性化的日益增长的兴趣,其目标是使模型能够快速有效地学习分布外的多模态概念,以满足用户特定需求。然而,许多现有方法依赖于推理时训练,降低了效率。它们也难以在复杂的多图像、多概念设置中保持准确性。这些限制制约了基于LVLM的系统的更广泛部署。因此,本文提出了上下文提示调优(ICPT)。具体来说,ICPT采用了一个轻量级投影模块,能够在复杂场景中操作,从多个参考图像中提取细粒度视觉语义,并将这些特征与身份标签映射无缝地转化为连续提示。为了最大化计算效率,该模块根据每个概念的内在视觉复杂性自适应地确定提示长度。关键的是,为了克服实际应用中普遍存在的环境偏差和跨概念干扰,我们引入了两种新颖的几何正则化。这些约束通过将关键身份与瞬态环境状态解耦,并分离概念以避免语义混淆,来优化提示表示。大量实验表明,ICPT在多种任务和LVLM骨干网络上实现了最先进的个性化准确性。

英文摘要

Large vision-language models (LVLMs) have demonstrated strong general multimodal capability and are increasingly deployed in downstream systems. This trend has driven growing interest in LVLM personalization, which aims to enable models to quickly and effectively learn out-of-distribution multimodal concepts to meet user-specific needs. However, many existing methods rely on inference-time training, which reduces efficiency. They also struggle to maintain accuracy in complex multi-image, multi-concept settings. These limitations restrict the broader deployment of LVLM-based systems. Therefore, this paper proposes in-context prompt tuning (ICPT). Specifically, ICPT employs a lightweight projection module capable of operating in complex scenarios to extract fine-grained visual semantics from multiple reference images, seamlessly transforming these features alongside identity-label mappings into continuous prompts. To maximize computational efficiency, this module adaptively determines the prompt length based on the intrinsic visual complexity of each concept. Crucially, to overcome the environmental biases and cross-concept interference prevalent in real-world applications, we introduce two novel geometric regularizations. These constraints refine prompt representations by decoupling key identities from transient environmental states and separating concepts to avoid semantic confusion. Extensive experiments show that ICPT achieves state-of-the-art personalization accuracy across diverse tasks and LVLM backbones.

2605.31508 2026-06-01 cs.CV 版本更新

Internalizing Temporal Consistency in Video Object-Centric Learning without Explicit Regularization

在没有显式正则化的情况下内化视频目标中心学习中的时间一致性

Rongzhen Zhao, Zhiyuan Li, Juho Kannala, Joni Pajarinen

发表机构 * Department of Electrical Engineering and Automation, Aalto University, Finland(艾尔沃斯大学电气工程与自动化系) Department of Computer Science, Aalto University, Finland(艾尔沃斯大学计算机科学系) Center for Machine Vision and Signal Analysis, University of Oulu, Finland(奥卢大学机器视觉与信号分析中心)

AI总结 提出一种无需显式时间一致性损失(SSC)的视频目标中心学习方法,通过时序通道分解(CCD)和跨时间重建(CTR)机制隐式学习时间一致性,提升训练效率和性能。

Comments 14 pages

详情
AI中文摘要

视频目标中心学习(OCL)旨在将目标表示为 extit{slot}向量并保持其在帧间的一致性。Slot-Slot对比(SSC)损失已成为最先进(SOTA)视频OCL方法的基石。虽然非常有效,但SSC依赖于帧间的一对一目标对应并引入额外损失。遵循奥卡姆剃刀原则,我们提出范式转变:时间一致性应作为隐式模型设计而非显式损失来加强。为了优雅地排除SSC( extbf{xSSC}),我们引入了两种准零开销的协同机制:( extit{i})时序通道分解(CCD)在结构上将slot表示沿通道维度分解为 extit{静态}和 extit{动态}子空间,作为经验统一的信息瓶颈;( extit{ii})跨时间重建(CTR)通过融合当前slot的静态通道和目标slot的动态通道,随机重建当前或前一时间步的目标特征,使用单个标准OCL解码器并进行少量训练调整。因此,slot集合通过仅最小化标准重建误差而内在地学习时间一致性。大量实验表明,将xSSC集成到领先基线中不仅提高了训练效率,还在视频目标发现和识别任务上建立了新的SOTA。此外,我们的PCA和梯度分析证实了目标的时间不变语义和时间变化运动学被编码到所提出的子空间中。我们的源代码、模型检查点和训练日志可在https://github.com/Genera1Z/xSSC上获取。

英文摘要

Video Object-Centric Learning (OCL) aims to represent objects as \textit{slot} vectors and maintain their consistency across frames. Slot-Slot Contrastive (SSC) loss has become the cornerstone for state-of-the-art (SOTA) video OCL methods. While highly effective, SSC relies on one-to-one object correspondence across frames and introduces an extra loss. Following Occam's Razor, we propose a paradigm shift: temporal consistency is better enforced as an implicit model design rather than an explicit loss. To elegantly exclude SSC (\textbf{xSSC}), we introduce two quasi-zero-overhead synergistic mechanisms: (\textit{i}) Chrono-Channel Decomposition (CCD) structurally disentangles slot representations along the channel dimension into \textit{static} and \textit{dynamic} sub-spaces, serving as an empirically unified information bottleneck; (\textit{ii}) Cross-Temporal Reconstruction (CTR) stochastically reconstructs target features of either the current or previous time step by fusing current slots' static channels and target slots' dynamic channels, using a single standard OCL decoder with minor training adaptation. Thereby, the slot sets inherently learn temporal consistency by minimizing the standard reconstruction error alone. Extensive experiments show that integrating xSSC into leading baselines not only improves training efficiency but also establishes new SOTAs on video object discovery and recognition tasks. Furthermore, our PCA and gradient analyses confirm that objects' time-invariant semantics and time-variant kinematics are encoded into the proposed sub-spaces. Our source code, model checkpoints and training logs are provided on https://github.com/Genera1Z/xSSC.

2605.31503 2026-06-01 cs.CV cs.LG 版本更新

How can embedding models bind concepts?

嵌入模型如何绑定概念?

Arnas Uselis, Darina Koishigarina, Seong Joon Oh

AI总结 本文研究视觉-语言嵌入模型(如CLIP)在概念绑定上的局限性,发现场景嵌入可加性分解为对象表示,但CLIP的高复杂度绑定函数阻碍了泛化,而通过充分数据训练的Transformer模型能学习低复杂度乘法交互绑定函数实现系统泛化。

Comments ICML 2026

详情
AI中文摘要

人类在多物体场景中能轻松判断哪种颜色属于哪种形状,这种能力称为概念绑定。视觉-语言嵌入模型(如CLIP)在绑定时存在困难:它们能识别单个概念,但无法表示哪些概念构成哪些对象。尽管CLIP在跨模态检索中表现为词袋模型,但对象信息可以从其图像和文本嵌入中分别恢复。我们通过绑定函数(将概念映射到场景嵌入)研究这种张力。我们发现场景嵌入可加性分解为对象表示,这解释了为何单模态探针能恢复对象信息。然而,CLIP的绑定函数具有高复杂度,这可能阻止图像和文本编码器学习共享的绑定机制,从而无法泛化到未见过的概念组合。然后我们探究这种局限性是否是根本性的。我们证明并非如此。在从零开始训练的受控Transformer模型中,随着数据覆盖率的增加,绑定泛化出现。这些模型学习到低复杂度的绑定函数,其特点是概念之间的乘法交互,从而实现系统泛化。代码公开于https://github.com/oshapio/binding-concepts-complexity。

英文摘要

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.

2605.31466 2026-06-01 cs.CV 版本更新

VolFill: Single-View Amodal 3D Scene Reconstruction with Volumetric Flow Matching

VolFill: 基于体素流匹配的单视图非模态3D场景重建

Tuan Duc Ngo, Chuang Gan, Evangelos Kalogerakis

发表机构 * UMass Amherst(马萨诸塞大学阿姆赫斯特分校) TU Crete(希腊技术大学)

AI总结 提出VolFill框架,利用混合3D VAE和潜在扩散Transformer从单张RGB图像生成完整3D场景结构,在SCRREAM和NRGB-D数据集上显著优于现有方法。

详情
AI中文摘要

从单张RGB图像重建场景的完整几何形状仍然具有挑战性——尤其是在推断视觉证据不完整的隐藏结构时。我们提出了VolFill,一个生成框架,它预测完整场景的3D结构,而不是依赖传统的像素对齐回归。我们的方法利用混合3D VAE将稀疏截断无符号距离函数网格压缩为紧凑的潜在空间,并结合潜在扩散Transformer对该表示进行去噪以恢复完整场景。我们以几何基础模型为条件生成,利用丰富的空间先验进行稳健推理。与受限于逐射线约束或非结构化点云查询的现有方法不同,VolFill提供了一种结构化表示,支持直接表面提取和大规模占用查询。在SCRREAM和NRGB-D数据集上的大量实验表明,我们的方法显著优于当前基线,为整体空间理解提供了稳健的基础。

英文摘要

Reconstructing the complete geometry of a scene from a single RGB image remains challenging - especially when inferring hidden structures where visual evidence is incomplete. We introduce VolFill, a generative framework that predicts the 3D structure of the complete scene rather than relying on traditional pixel-aligned regression. Our method utilizes a hybrid 3D VAE to compress sparse truncated unsigned distance function grids into a compact latent space, paired with a latent Diffusion Transformer that denoises this representation to recover the complete scene. We condition the generation on geometry foundation models, leveraging rich spatial priors for robust reasoning. Unlike existing methods limited by per-ray constraints or unstructured point-cloud queries, VolFill provides a structured representation that supports direct surface extraction and occupancy queries at scale. Extensive experiments on the SCRREAM and NRGB-D datasets demonstrate that our approach significantly outperforms current baselines, providing a robust foundation for holistic spatial understanding.

2605.31457 2026-06-01 cs.CV 版本更新

VisionPulse: Dynamic Visual Sparsity for Efficient Multimodal Reasoning

VisionPulse: 用于高效多模态推理的动态视觉稀疏性

Hengbo Xu, Shengjie Jin, Yanbiao Ma, Zhiwu Lu

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院)

AI总结 提出VisionPulse框架,通过步骤级视觉令牌剪枝,利用视觉注意力质量估计保留预算,仅保留关键令牌,在几乎不损失准确率的情况下减少推理开销和推理轨迹长度。

Comments Accepted at ICML 2026

详情
AI中文摘要

随着大型多模态模型(LMMs)的快速发展,推理时间开销已成为实际部署的关键瓶颈。现有方法通常在预填充阶段剪枝视觉令牌,假设推理过程中所需的视觉证据保持静态。然而,我们经验性地表明,视觉证据具有强烈的步骤依赖性:每个解码步骤只有稀疏的视觉令牌子集是关键,且关键集在推理过程中演变。此外,我们识别出一个耦合瓶颈,其中冗余的视觉上下文可能将模型引向与查询无关的区域,从而延长推理轨迹。受这些洞察的指导,我们提出VisionPulse,一种推理过程中的步骤级视觉令牌剪枝框架。VisionPulse计算轻量级的视觉注意力质量,通过利用其与LMMs有效视觉令牌使用的强正相关性来估计步骤级保留预算,并在此预算下仅保留最关键的令牌。通过在推理过程中强制视觉稀疏性,VisionPulse过滤冗余的视觉上下文,同时保留相关的视觉证据,自然缩短推理轨迹。大量实验表明,VisionPulse每步仅保留5%的视觉令牌,推理轨迹缩短11.2%,同时保持准确率几乎不变。

英文摘要

With the rapid advancement of large multimodal models (LMMs), inference-time overhead has become a key bottleneck for real-world deployment. Existing methods typically prune visual tokens at prefill, assuming the required visual evidence remains static during reasoning. However, we empirically show that visual evidence is strongly step-dependent: only a sparse subset of visual tokens is critical at each decoding step, and the critical set evolves across reasoning. Furthermore, we identify a coupled bottleneck where redundant visual context can steer the model toward query-irrelevant regions, lengthening the reasoning trace. Guided by these insights, we propose VisionPulse, a step-wise visual token pruning framework during reasoning. VisionPulse computes a lightweight visual attention mass to estimate the step-wise retention budget by exploiting its strong positive correlation with LMMs' effective visual token usage and retain only the most critical tokens under this budget. By enforcing visual sparsity during reasoning, VisionPulse filters redundant visual context while preserving relevant visual evidence, shortening reasoning traces naturally. Extensive experiments show that VisionPulse only retains 5% of visual tokens per step with reasoning traces shortened by 11.2%, while keeping accuracy almost unchanged.

2605.31429 2026-06-01 cs.CV 版本更新

YARD: Y-Architecture Register Decoding for Efficient Hallucination Mitigation in Large Vision-Language Models

YARD: Y型架构寄存器解码用于大型视觉语言模型中的高效幻觉缓解

Ting Chen, Geng Li, Guohao Chen, Yu Hu, Guan Huang, Mai Chen, Langsheng Lei, Jun Du

发表机构 * Guangdong University of Technology(广东工业大学) Nanyang Technological University(南洋理工大学) Shenzhen TENCLASS Technology Co., Ltd.(深圳TENCLASS科技有限公司)

AI总结 提出YARD框架,通过Y型架构共享浅层计算并在中间层分支,用寄存器令牌替换视觉令牌构建退化分支,实现无需训练的对比解码,有效缓解幻觉并降低推理延迟。

Comments 21 pages, 11 figures

详情
AI中文摘要

对比解码(CD)旨在通过对比标准模型和视觉退化模型的输出分布来缓解大型视觉语言模型(LVLM)中的幻觉。然而,现有的免训练CD方法存在次优的退化分支:完全丢弃视觉令牌过于极端并导致语言幻觉,而破坏输入图像对视觉证据的控制粗糙且由于需要两次完整前向传播而导致高推理延迟。为了解决这些问题,我们提出了YARD,一种免训练的Y型架构寄存器解码框架。受可靠文本到视觉定位主要出现在中间解码器层的观察启发,YARD通过共享浅层计算并恰好在此关键阶段分支,在内部构建退化分支。对于退化分支,YARD用寄存器令牌替换补丁级视觉令牌,这些令牌保留了全局图像语义但缺乏细粒度局部证据。这种图像感知但局部欠基础的设计提供了忠实的对比信号,没有极端模态不匹配,同时Y型架构严格避免了昂贵的第二次前向传播。在生成性和判别性幻觉基准上的大量实验表明,YARD在多个LVLM上一致实现了最先进的幻觉缓解,同时显著降低了推理延迟。

英文摘要

Contrastive decoding (CD) seeks to mitigate hallucinations in Large Vision-Language Models (LVLMs) by contrasting the output distributions of a standard model and a visually degraded model. However, existing training-free CD methods suffer from sub-optimal degraded branches: completely dropping visual tokens is too extreme and induces language hallucinations, while corrupting input images offers coarse control over visual evidence and suffers from high inference latency due to requiring two full forward passes. To address these dilemmas, we propose YARD, a training-free Y-Architecture Register Decoding framework. Motivated by the observation that reliable text-to-vision grounding predominantly emerges in the middle decoder layers, YARD constructs the degraded branch internally by sharing shallow-layer computations and branching exactly at this critical stage. For the degraded branch, YARD replaces patch-level visual tokens with register tokens, which preserve global image semantics but lack fine-grained local evidence. This image-aware yet locally under-grounded design provides a faithful contrastive signal without extreme modality mismatch, while the Y-architecture strictly avoids a costly second forward pass. Extensive experiments on generative and discriminative hallucination benchmarks demonstrate that YARD consistently achieves state-of-the-art hallucination mitigation across multiple LVLMs, alongside a significant reduction in inference latency.

2605.31426 2026-06-01 eess.IV cs.CV math.OC 版本更新

Self-Tuning Regularization for Image Scanning Microscopy

图像扫描显微镜的自调谐正则化

Sofia Agostoni, Lisa Cuneo, Christian Daniele, Giacomo Garré, Laurent Le, Alessandro Zunino, Giuseppe Vicidomini, Luca Calatroni

发表机构 * MaLGa Center, DIBRIS, University of Genoa(MaLGa中心,DIBRIS,热那亚大学) MMS, Istituto Italiano di Tecnologia (IIT)(MMS,意大利技术研究院(IIT)) CSML, Istituto Italiano di Tecnologia (IIT)(CSML,意大利技术研究院(IIT))

AI总结 针对图像扫描显微镜(ISM)的多图像反卷积(MID)和超分辨率切片ISM(s²ISM)重建,提出一种自调谐显式正则化框架,通过贝叶斯最大后验公式结合多帧泊松数据保真项与ℓ1或平滑全变分惩罚,并基于残差白化原则自适应选择正则化参数,无需经验停止准则,在低光子条件下实现稳定超分辨和光学切片。

详情
AI中文摘要

图像扫描显微镜(ISM)是一种荧光成像技术,它结合探测器阵列采集和计算重建,实现理想共聚焦显微镜(即使用无穷小针孔)的理论分辨率,同时保持高信噪比。在获得超分辨图像的重建方法中,多图像反卷积(MID)及其旨在保持共聚焦显微镜光学切片能力的扩展(称为超分辨率切片ISM,s²ISM)是最广泛使用的方法之一。这两种方法都依赖于Richardson-Lucy型迭代方案,其半收敛行为需要提前停止,并且常常导致噪声放大和重建伪影。在这项工作中,我们为MID和s²ISM重建引入了一个自调谐显式正则化框架。在贝叶斯最大后验公式中,我们将多帧泊松数据保真项与显式正则化相结合,考虑ℓ1和平滑全变差惩罚作为代表性例子。我们进一步通过将残差白化原则适应于多帧泊松设置,并引入针对s²ISM定制的频谱高通扩展,开发了一种自动且无需真实值的正则化参数选择策略。由此产生的框架无需经验停止规则即可实现稳定重建。为了演示所提出的框架,我们考虑了基于近端梯度和镜像下降方法的一阶优化方案,并采用自适应回溯策略。在模拟和真实荧光ISM数据集上的实验表明,与无正则化方法相比,重建稳定性和图像质量得到改善,同时在低光子条件下实现了鲁棒的超分辨率和光学切片。

英文摘要

Image Scanning Microscopy (ISM) is a fluorescence imaging technique that combines detector-array acquisition and computational reconstruction to achieve the theoretical resolution of an ideal confocal microscope, i.e., one operating with an infinitesimally small pinhole, while maintaining high signal-to-noise ratio. Among the reconstruction methods for obtaining the super-resolved image, multi-image deconvolution (MID) and its extension aimed at preserving the optical sectioning capability of confocal microscopy, known as super-resolution sectioning ISM (s$^2$ISM), are among the most widely used approaches. Both methods rely on Richardson--Lucy-type iterative schemes, whose semi-convergent behavior requires early stopping and often leads to noise amplification and reconstruction artifacts. In this work, we introduce a self-tuning explicit regularization framework for both MID and s$^2$ISM reconstruction. Within a Bayesian maximum a posteriori formulation, we combine a multi-frame Poisson data fidelity term with explicit regularization, considering $\ell_1$ and smoothed total variation penalties as representative examples. We further develop an automatic and ground-truth-free strategy for regularization parameter selection by adapting the residual whiteness principle to the multi-frame Poisson setting and introducing a spectral high-pass extension tailored to s$^2$ISM. The resulting framework enables stable reconstructions without empirical stopping rules. To demonstrate the proposed framework, we consider first-order optimization schemes based on proximal gradient and mirror descent methods with adaptive backtracking strategies. Experiments on simulated and real fluorescence ISM datasets demonstrate improved reconstruction stability and image quality with respect to unregularized approaches, while enabling robust super-resolution and optical sectioning in low-photon conditions.

2605.31400 2026-06-01 cs.CV 版本更新

FSM-Net: An Efficient Frequency-Spatial Network for Real-World Deblurring

FSM-Net:一种用于真实世界去模糊的高效频率-空间网络

Vinh-Thuan Ly

发表机构 * University of Science, VNU-HCM(越南胡志明市国家大学) Vietnam National University, Ho Chi Minh City(越南国家大学)

AI总结 提出FSM-Net,通过频率注意力模块和交叉门控视觉E-Branchformer实现高效双域去模糊,在NTIRE 2026挑战赛中获得第二名。

Comments Accepted to NTIRE Workshop at CVPR 2026. Project page: https://efficient-deblurring-fsmnet.vercel.app

详情
AI中文摘要

真实世界图像去模糊要求高保真恢复和计算效率,现有方法往往难以平衡。本文提出FSM-Net(频率-空间多分支网络),一种高效解决方案,在NTIRE 2026高效真实世界去模糊挑战赛中获得第二名。FSM-Net开创了双域方法:新颖的频率注意力模块通过FFT显式恢复高频结构细节,而瓶颈处的交叉门控视觉E-Branchformer以线性复杂度捕获全局依赖。为确保鲁棒收敛,我们采用由复合损失函数(多尺度Charbonnier、结构边缘和频率)引导的渐进课程训练策略。在RSBlur基准上评估,FSM-Net仅用4.94M参数和159.35 GMACs(1920x1200分辨率)即达到33.144 dB PSNR的出色性能。通过有效推动效率与质量的帕累托前沿,FSM-Net为资源受限的图像恢复建立了强基线。

英文摘要

Real-world image deblurring demands both high-fidelity restoration and computational efficiency, a balance existing methods often struggle to achieve. In this paper, we propose FSM-Net (Frequency-Spatial Multi-branch Network), a highly efficient solution that secured 2nd place in the NTIRE 2026 Challenge on Efficient Real-World Deblurring. FSM-Net pioneers a dual-domain approach: a novel Frequency Attention module explicitly recovers high-frequency structural details via FFT, while a Cross-Gated Vision E-Branchformer at the bottleneck captures global dependencies with linear complexity. To ensure robust convergence, we employ a progressive curriculum training strategy guided by a composite loss function (Multi-Scale Charbonnier, Structural Edge, and Frequency). Evaluated on the RSBlur benchmark, FSM-Net achieves an outstanding 33.144 dB PSNR with only 4.94M parameters and 159.35 GMACs (at 1920x1200 resolution). By effectively pushing the Pareto frontier of efficiency and quality, FSM-Net establishes a strong baseline for resource-constrained image restoration.

2605.31376 2026-06-01 cs.RO cs.CV cs.GR 版本更新

LiftNav: Path Planning via Semantic Lifting in TSDF-Guided Gaussian Splatting

LiftNav: TSDF引导的高斯泼溅中的语义提升路径规划

Hannah Schieber, Dominik Frischmann, Victor Schaack, Angela P. Schoellig, Daniel Roth

发表机构 * Technical University of Munich(慕尼黑技术大学) Human-Centered Computing and Extended Reality Lab(以人为本计算与扩展现实实验室) TUM University Hospital(慕尼黑大学医院) Clinic for Orthopedics and Sports Orthopedics(骨科与运动医学诊所) Munich Institute of Robotics and Machine Intelligence (MIRMI)(慕尼黑机器人与机器智能研究所) Learning Systems and Robotics Lab(学习系统与机器人实验室)

AI总结 提出LiftNav混合导航框架,结合TSDF+GS双地图、YOLO检测、TSDF三维提升和B样条轨迹优化,实现无需密集三维嵌入的灵活语义导航,并通过铰链损失碰撞惩罚提升轨迹平滑性和安全性,在Replica数据集仿真中实现100%可行性和更短轨迹。

详情
AI中文摘要

未知室内环境中的自主机器人需要可靠的碰撞避免和对象级理解。经典表示如TSDF支持安全规划但缺乏语义,而像高斯泼溅(GS)这样的逼真方法提供丰富外观但存在软几何问题,限制了精确的障碍物避免。我们提出LiftNav,一个基于GSFusion的TSDF+GS双地图构建的混合导航框架,并增强了基于YOLO的检测、基于TSDF的三维提升和B样条轨迹优化的实时流水线。该设计实现了无需密集三维嵌入的灵活语义导航。我们进一步引入了一种基于铰链损失的碰撞惩罚,提高了轨迹平滑性和安全性。我们在使用Replica数据集的仿真中评估了我们的方法。与最先进的辐射场基线相比,我们展示了100%的可行性和更短的轨迹。

英文摘要

Autonomous robots in unknown indoor environments require both reliable collision avoidance and object-level understanding. Classical representations such as TSDF support safe planning but lack semantics, while photorealistic methods like Gaussian Splatting (GS) provide rich appearance yet suffer from soft geometry, limiting precise obstacle avoidance. We present LiftNav, a hybrid navigation framework built on GSFusion's TSDF+GS dual map, augmented with a real-time pipeline of YOLO-based detection, TSDF-based 3D lifting, and B-spline trajectory optimization. This design enables flexible semantic navigation without dense 3D embeddings. We further introduce a hinge-loss-based collision penalty that improves trajectory smoothness and safety. We evaluate our approach in a simulation using the Replica dataset. Compared against a state-of-the-art radiance field baseline we show a 100% feasibility rate and shorter trajectories.

2605.31369 2026-06-01 cs.LG cs.CV 版本更新

A Unifying View of Variational Generative Wasserstein Flows

变分生成式Wasserstein流的统一视角

Paul Caucheteux, Clément Bonet, Anna Korba

发表机构 * CMAP, CNRS, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France(CMAP、法国国家科学研究中心、巴黎高等理工学院、巴黎理工 institute、Palaiseau,法国)

AI总结 本文提出生成式Wasserstein流(GWF)的统一理论框架,将多种现有生成模型视为f-散度目标的参数化JKO方案实例,并扩展至积分概率度量与最大均值差异,推导新算法并阐明与GAN的联系。

Comments Accepted as a spotlight at ICML2026

详情
AI中文摘要

许多现代生成模型可视为最小化概率分布之间的散度,但它们依赖于不同的算法和几何原理。Wasserstein梯度流为优化分布提供了连续时间形式,可通过Jordan-Kinderlehrer-Otto(JKO)方案的隐式离散化来近似。在这项工作中,我们提出了一个基于Wasserstein梯度流的生成建模统一理论框架,称为生成式Wasserstein流(GWF)。我们表明,一大类现有方法可以推导为f-散度目标的参数化JKO方案实例,并建立了几个最近提出的算法之间的等价性。我们将此框架扩展到f-散度之外,涵盖积分概率度量和平方最大均值差异,推导了新的基于JKO的生成算法,并阐明了它们与GAN的联系。我们通过实验研究了JKO正则化对广泛目标的影响。最后,我们分析了参数化Wasserstein流,其中动力学限制在由参数化映射诱导的分布上。

英文摘要

Many modern generative models can be viewed as minimizing divergences between probability distributions, yet they rely on different algorithmic and geometric principles. Wasserstein gradient flows provide a continuous-time formulation for optimizing over distributions, and can be approximated through their implicit discretization via the Jordan-Kinderlehrer-Otto (JKO) scheme. In this work, we present a unified theoretical framework for generative modeling based on Wasserstein gradient flows, which we refer to as Generative Wasserstein Flows (GWF). We show that a broad class of existing methods can be derived as instances of parametric JKO schemes for $f$-divergence objectives, and we establish equivalences between several recently proposed algorithms. We extend this framework beyond f-divergence to Integral Probability Metrics and squared Maximum Mean Discrepancy, deriving new JKO-based generative algorithms, and clarifying their connections with GANs. We study empirically the impact of the JKO regularization for a wide set of objectives. Finally, we analyze parametric Wasserstein flows, where the dynamics are restricted to distributions induced by parametrized maps.

2605.31351 2026-06-01 cs.CL cs.CV 版本更新

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

面向VLM-as-a-Judge评估的视障辅助基准

Yi Zhao, Siqi Wang, Zhe Hu, Yushi Li, Jing Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算机系)

AI总结 针对视障辅助任务中VLM-as-a-Judge评估的可靠性问题,提出VIABLE基准(含30万+样本、有效性-公正性-稳定性框架及12种失败模式分类),发现现有模型不可靠,并开发VIA-Judge-Agent方法提升诊断准确性和用户偏好。

详情
AI中文摘要

基于AI的视障辅助(VIA)仍然具有挑战性,主要原因是人工评估成本高昂。VLM-as-a-Judge范式可能提供一种有前景的替代方案,尽管该范式主要在通用领域得到研究。因此,我们质疑此类评判者是否可以在VIA任务中值得信赖。为探究这一问题,我们引入了VIABLE(面向VLM-as-a-Judge评估的视障辅助基准),这是首个用于VIA中VLM-as-a-Judge评估的基准。VIABLE包含超过30万个判断样本,涵盖三种场景,并引入了一个包含12种失败模式分类的有效性-公正性-稳定性框架。基于VIABLE,我们对七个不同模型规模的评判者进行了系统研究,结果表明现有模型在所有评估轴上基本不可靠。最强的评判者GPT-5.4仅达到52.6%的单故障诊断准确率,却表现出最高的自我偏好率(94.2%);而开源评判者存在严重偏差且对抗性脆弱。为解决这些问题,我们提出了VIA-Judge-Agent,一种与模型无关的推理时增强方法,通过视觉证据提取和基于分类的工作流来增强评判者。该方法在诊断准确性和下游VIA响应(更受BLV用户青睐)方面实现了积极改进。数据和代码可在 https://github.com/YiyiyiZhao/VIABLE 获取。

英文摘要

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

2605.31349 2026-06-01 cs.CL cs.AI cs.CV cs.MM 版本更新

FBHM: Functional Benchmarking and Steering of VLMs for Hateful Meme Detection

FBHM:用于仇恨模因检测的功能性基准测试与视觉语言模型引导

Paramananda Bhaskar, Naquee Rizwan, Daksh Jogchand, Saurabh Kumar Pandey, Animesh Mukherjee

发表机构 * Indian Institute of Technology (IIT), Kharagpur(印度理工学院(IIT)卡拉格浦尔) Microsoft(微软)

AI总结 针对现有基准无法因果评估视觉语言模型漏洞的问题,提出基于25种修辞功能和10个目标社区构建的FBHM基准,并采用可学习引导向量(LSV)在极低数据量下提升模型性能约30个Macro-F1点。

详情
AI中文摘要

仇恨模因检测对于视觉语言模型仍是一个严峻挑战,因为现有基准在结构上是观察性的——混淆了修辞仇恨机制与目标社区特征,并阻碍了对模型漏洞的因果评估。为解决这一问题,我们引入了FBHM,一个系统策划的基于功能的仇恨模因基准,沿两个正交轴构建:25种不同的修辞功能和10个目标社区(总共5,000个模因)。对最先进的视觉语言模型进行基准测试揭示了一个严重的泛化差距:在标准数据集上高度准确的模型在FBHM上灾难性地下降到接近随机性能,证明它们利用了数据集特定的启发式方法而非稳健的多模态推理。为了高效缩小这一差距,我们提出了LSV(可学习引导向量),一种超低数据量策略,在仅500个引导样本(50个独特基础模因)上应用因果干预目标,将FBHM性能提升约30个Macro-F1点,同时优于上下文学习和PEFT,且不降低源域性能。

英文摘要

Hateful meme detection remains a formidable challenge for vision-language models, as existing benchmarks are structurally observational - confounding rhetorical hate mechanisms with target community features and preventing causal evaluation of model vulnerabilities. To address this, we introduce FBHM, a systematically curated benchmark of Functionality Based Hateful Memes constructed along two orthogonal axes: 25 distinct rhetorical functionalities and 10 target communities (5,000 memes total). Benchmarking state-of-the-art VLMs reveals a severe generalization gap: models highly accurate on standard datasets catastrophically drop to near-random performance on FBHM, proving they exploit dataset-specific heuristics rather than robust multimodal reasoning. To efficiently close this gap, we propose LSV (learnable steering vectors), an ultra-low data regime strategy that applies a causal intervention objective on as few as 500 steering samples (50 unique base memes), boosting FBHM performance by ~30 Macro-F1 points while outperforming in-context learning and PEFT without degrading source-domain performance.

2605.31336 2026-06-01 cs.CV 版本更新

DecMem: Towards Minute-Long Consistent World Generation with Decoupled Memory

DecMem:基于解耦记忆的分钟级一致世界生成

Zhenhao Yang, Xiaoshi Wu, Zhengyao Lv, Xiaoyu Shi, Xintao Wang, Pengfei Wan, Kun Gai, Kwan-Yee K. Wong

发表机构 * The University of Hong Kong(香港大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队)

AI总结 提出解耦记忆架构DecMem,通过稀疏全局记忆和锚定局部记忆解决长程视频生成中的时空一致性问题,实现分钟级可控长视频生成。

Comments Project page is available at https://jeffreyyzh.github.io/DecMem-Page

详情
AI中文摘要

近期视频生成模型的进展推动了可控世界模型的快速发展。然而,在长程推理下保持细粒度时空一致性仍是一个关键挑战。在这项工作中,我们超越了显式3D记忆和粗粒度的帧级隐式建模,提出了一种细粒度、可学习且可扩展的记忆用于一致世界生成。我们首先识别了朴素可学习记忆架构在长程外推中的两个基本限制,即计算效率低下和注意力分散。通过对注意力分散的系统分析,我们提出了DecMem,一种解耦记忆架构,采用稀疏全局记忆实现对全局历史的高效细粒度访问,以及锚定局部记忆实现稳定高质量的外推。大量实验表明,DecMem显著优于当前最先进的方法。通过确保精确高效的长时记忆并实现卓越的外推能力,DecMem实现了分钟级可控长视频生成,具有高保真度和一致性。

英文摘要

Recent advances in video generative models have promoted rapid progress in controllable world models. However, maintaining fine-grained spatio-temporal consistency under long-horizon reasoning remains a key challenge. In this work, we move beyond explicit 3D memory and coarse frame-level implicit modeling, and propose a fine-grained, learnable, and scalable memory for consistent world generation. We first identify two fundamental limitations of naïve learnable memory architectures in long-horizon extrapolation, namely computational inefficiency and attention dispersion. Through a systematic analysis of attention dispersion, we propose DecMem, a decoupled memory architecture that employs Sparse Global Memory for efficient fine-grained access to global history and Anchored Local Memory for stable and high-quality extrapolation. Extensive experiments demonstrate that DecMem significantly outperforms current state-of-the-art methods. By ensuring precise and efficient long-term memory and achieving superior extrapolation capabilities, DecMem enables minute-level controllable long video generation with high fidelity and consistency.

2605.31312 2026-06-01 cs.CV cs.CL 版本更新

Learning from Fine-Grained Visual Discrepancies: Mitigating Multimodal Hallucinations via In-Context Visual Contrastive Optimization

从细粒度视觉差异中学习:通过上下文视觉对比优化缓解多模态幻觉

Haolin Deng, Xin Zou, Zhiwei Jin, Chen Chen, Haonan Lu, Xuming Hu

发表机构 * The Hong Kong University of Science(香港科学与技术大学) OPPO AI Center(OPPO AI中心)

AI总结 提出上下文视觉对比优化(IC-VCO)方法,通过共享多图像上下文中的对比图像确保数学严谨的目标,并引入视觉对比蒸馏(VCDist)和对比样本编辑策略,有效缓解多模态幻觉。

Comments ICML 2026

详情
AI中文摘要

多模态幻觉仍然是视觉语言模型(VLM)面临的持续挑战。标准的文本直接偏好优化(DPO)由于缺乏显式的视觉监督,往往无法缓解这一问题。虽然现有工作通过将原始图像与负样本对比引入了视觉偏好DPO,但由于配分函数不匹配导致目标在理论上不一致,并且依赖可能引发捷径学习的粗粒度负样本。在这项工作中,我们提出了上下文视觉对比优化(IC-VCO)。通过将对比图像置于共享的多图像上下文中,IC-VCO确保了数学上严谨的目标。我们进一步引入了视觉对比蒸馏(VCDist),一种辅助的可靠性门控正则化器,鼓励多图像对比训练与单图像推理之间的一致性。最后,我们提出了一种对比样本编辑策略,通过精确的语义扰动生成困难负样本。在五个基准上的实验表明,IC-VCO取得了最佳的整体性能,并且我们的样本编辑策略有效。代码和数据可在 https://github.com/OPPO-Mente-Lab/IC-VCO 获取。

英文摘要

Multimodal hallucination remains a persistent challenge for Vision-Language Models (VLMs). Standard textual Direct Preference Optimization (DPO) often fails to mitigate it due to a lack of explicit visual supervision. While existing works introduce visual preference DPO by contrasting original images against negative ones, they suffer from a theoretically inconsistent objective caused by partition function mismatches and rely on coarse-grained negatives that could enable shortcut learning. In this work, we propose In-Context Visual Contrastive Optimization (IC-VCO). By placing contrastive images within a shared multi-image context, IC-VCO ensures a mathematically rigorous objective. We further introduce Visual Contrast Distillation (VCDist), an auxiliary reliability-gated regularizer that encourages consistency between multi-image contrastive training and single-image inference. Finally, we propose a contrastive sample editing strategy that generates hard negatives via precise semantic perturbations. Experiments on five benchmarks demonstrate IC-VCO's best overall performance and the effectiveness of our sample editing strategy. Code and data are available at https://github.com/OPPO-Mente-Lab/IC-VCO.

2605.31304 2026-06-01 cs.LG cs.CV 版本更新

Interpretability Without Tradeoffs: Disentangling Polysemanticity At Equal Predictive Performance

无权衡的可解释性:在同等预测性能下解开多义性

Doğukan Bağcı, Bernt Schiele, Simone Schaub-Meyer, Jonas Fischer, Robin Hesse

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所) Department of Computer Science, TU Darmstadt(图宾根大学计算机科学系)

AI总结 提出ELUDe方法,通过无损重组层间信息流,在不改变模型输出的前提下将多义神经元分解为单义特征,提升深度神经网络的可解释性。

Comments Preprint

详情
AI中文摘要

深度神经网络(DNN)被广泛使用,但解释它们实际学到什么仍然困难。一个主要障碍是单个神经元通常编码多个不相关的概念,模糊了网络的决策过程。虽然先前的工作,如稀疏自编码器,可以将这些混合信号分离成更有意义的“单义”特征,但这通常需要以可能降低下游性能的方式改变模型。为了克服这一点,我们引入了ELUDe(显式、无损、无监督解缠),一种在保持功能等价性的同时提高DNN可解释性的方法。ELUDe将潜在表示分解为清晰、可检查的子单元,这些子单元表现得像可解释的特征,同时保证模型的输出保持完全相同。它不需要显式训练,不需要标签,并且可以应用于预训练模型。ELUDe通过重组层间信息流的方式工作,重新路由特定概念的贡献,同时通过构造保留原始计算。在多个视觉模型上,包括DINOv2和有监督的ViT-B/16,ELUDe提高了可解释性,保持下游准确性不变,运行高效,并支持实际用途,如引导模型表示。简而言之,ELUDe提供了(几乎)没有权衡的可解释性:更清晰、可扩展且可操作的模型洞察,且性能无损失。

英文摘要

Deep neural networks (DNNs) are widely used, but interpreting what they actually learn remains difficult. A major obstacle is that individual neurons often encode multiple unrelated concepts, obscuring the decision process of the network. While prior work, such as sparse autoencoders, can separate these mixed signals into more meaningful, "monosemantic" features, this typically requires altering the model in ways that can degrade downstream performance. To overcome this, we introduce ELUDe (explicit, lossless, unsupervised disentanglement), a method for improving the interpretability of DNNs while preserving their functional equivalence. ELUDe breaks latent representations into clear, inspectable sub-units that behave like interpretable features, while guaranteeing that the model's outputs remain exactly the same. It requires no explicit training, no labels, and can be applied to pretrained models. ELUDe works by reorganizing how information flows between layers, re-routing concept-specific contributions while preserving the original computation by construction. Across several vision models, including DINOv2 and supervised ViT-B/16, ELUDe improves interpretability, keeps downstream accuracy unchanged, runs efficiently, and supports practical uses such as steering model representations. In short, ELUDe offers interpretability (almost) without a tradeoff: clearer, scalable, and actionable model insights with no loss in performance.

2605.31302 2026-06-01 eess.IV cs.CV eess.SP 版本更新

MoE-dqINR: A Unified Mixture-of-Experts Implicit Neural Representation Framework for Scan-Specific Dynamic and Quantitative MRI Reconstruction

MoE-dqINR:用于特定扫描动态和定量MRI重建的统一混合专家隐式神经表示框架

Yinzhe Wu, Fanwen Wang, Zhenxuan Zhang, Zi Wang, Chengyan Wang, Guang Yang

发表机构 * Department of Bioengineering and I-X, Imperial College London(生物工程系和I-X,帝国理工学院伦敦分校) Cardiovascular Research Centre, Royal Brompton Hospital(心脏血管研究中心,皇家布隆特医院) National Heart and Lung Institute, Imperial College London(国家心脏和肺研究所,帝国理工学院伦敦分校) School of Biomedical Engineering & Imaging Sciences, King’s College London(生物医学工程与成像科学学院,伦敦国王学院) Shanghai Pudong Hospital and Human Phenome Institute, Fudan University(上海浦东医院和人类表型研究所,复旦大学) International Human Phenome Institute (Shanghai), Shanghai, China(国际人类表型研究所(上海),上海,中国)

AI总结 提出MoE-dqINR框架,通过共享空间专家和状态条件路由路径,实现高效、统一的特定扫描多线圈动态和定量MRI重建,优化时间约30秒。

详情
AI中文摘要

欠采样磁共振成像(MRI)重建旨在从不完整的多线圈k空间数据中恢复时间或对比度变化的图像序列,同时为动态和定量MRI(qMRI)保留状态相关的保真度。现有的特定扫描隐式神经表示(INR)通常使用单一的时空坐标场、显式子空间、运动或变形模型、校准变量或序列特定的定量信号模型。这些设计选择在跨采集状态适应图像合成的同时,限制了共享空间信息的灵活性。此外,许多基于INR的基线方法计算量大,通常需要每个扫描数百到数千秒的优化时间。我们提出MoE-dqINR,一种特定扫描的多线圈MRI重建框架,将图像域表示分解为共享空间专家和状态条件路由路径。空间专家编码可重用的坐标相关图像内容,而路由权重(以有序采集状态为条件)从公共专家库合成每个动态帧或对比状态。该表示与多线圈MRI前向模型耦合,使用归一化状态索引驱动动态和定量MRI中的路由。通过将共享空间表示与状态相关合成分离,该框架为动态和定量MRI提供了一种以图像为先的架构,同时在我们的实验中将特定扫描INR优化减少到每扫描约30秒。所提出的公式建立了状态条件混合专家INR作为特定扫描多线圈MRI重建先验,统一了共享空间表示、动态和qMRI特定合成以及实际每扫描效率。

英文摘要

Undersampled magnetic resonance imaging (MRI) reconstruction seeks to recover temporally or contrast-varying image series from incomplete multicoil k-space data while preserving state-dependent fidelity for dynamic and quantitative MRI (qMRI). Existing scan-specific implicit neural representations (INRs) often use monolithic spatiotemporal coordinate fields, explicit subspaces, motion or deformation models, calibration variables, or sequence-specific quantitative signal models. These design choices can limit flexibility in sharing spatial information while adapting image synthesis across acquisition states. Moreover, many INR-based baselines remain computationally demanding, typically requiring per-scan optimization times on the order of hundreds to thousands of seconds. We propose MoE-dqINR, a scan-specific multicoil MRI reconstruction framework that factorizes the image-domain representation into shared spatial experts and a state-conditioned routing pathway. Spatial experts encode reusable coordinate-dependent image content, whereas routing weights, conditioned on ordered acquisition states, synthesize each dynamic frame or contrast state from a common expert bank. The representation is coupled to a multicoil MRI forward model, uses the normalized state index to drive routing in both dynamic and quantitative MRI. By separating shared spatial representation from state-dependent synthesis, the framework provides an image-first architecture for dynamic and quantitative MRI while reducing scan-specific INR optimization to approximately 30 s per scan in our experiments. The proposed formulation establishes state-conditioned mixture-of-experts INR as a scan-specific multicoil MRI reconstruction prior that unifies shared spatial representation, dynamic- and qMRI-specific synthesis, and practical per-scan efficiency.

2605.31294 2026-06-01 cs.CV 版本更新

TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

TokTalk: 基于音频-大语言模型令牌的富有表现力的实时面部动画

Qingcheng Zhao, Yifang Pan, Karan Singh

发表机构 * University of Toronto(多伦多大学)

AI总结 提出TokTalk系统,利用音频-大语言模型产生的音频令牌直接实时生成富有表现力的3D面部动画,通过分块条件流匹配模型和轻量级适配策略实现低延迟和高品质。

详情
AI中文摘要

近期GPT-4o等音频-大语言模型的进展开启了与语言模型对话交互的新时代。然而,对话式虚拟角色在面部表情和对话流程上仍显机械,部分原因在于其顺序执行语音识别、文本生成、轮次文本响应、语音合成和音频驱动面部动画等多个阶段。基于当前音频-大语言模型产生的音频令牌包含足够信息以重建合理面部表现这一洞察,我们提出TokTalk,一个直接从流式音频令牌实时输出富有表现力面部动画的系统。我们构建了一个新颖的音频令牌到3D面部运动数据集,并使用基于分块的条件流匹配模型训练TokTalk。一种轻量级适配策略使我们的训练模型能够以极小的计算开销无缝连接到任何基于令牌的音频-大语言模型。我们的分块处理进一步实现了延迟与面部质量之间的参数化权衡,并通过消融研究进行了验证。我们还表明,TokTalk的实时性能在延迟上与现有技术解决方案相当,而在3D面部表现的质量、表现力和可控性方面(通过感知研究)显著更优。我们通过聊天机器人虚拟角色、语音驱动的用户虚拟角色和动画导演界面展示了TokTalk在多种音视频面部应用中的灵活性。

英文摘要

Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text generation, turn-based text response, speech synthesis, and audio driven facial animation. Based on our insight that audio-tokens produced by current Audio-LLMs carry sufficient information to reconstruct a plausible facial performance, we present TokTalk, a system that directly outputs expressive facial animation in real-time from streaming audio-tokens. We construct a novel audio-token to 3D facial motion dataset, on which TokTalk is trained using a Chunk-based Conditional Flow Matching model. A lightweight adaptation strategy allows our trained model to seamlessly connect to any token-based Audio-LLM at minimal computational overhead. Our chunk-based processing further enables parametric trade-off between latency and facial quality, shown through ablation studies. We further show that the real-time performance of TokTalk is comparable in latency to prior art solutions, and significantly favorable (via a perceptual study) in terms of quality, expressivity and control of the 3D facial performance. We showcase TokTalk's flexibility using a chatbot Avatar, a voice-driven user Avatar, and an animation Director's interface, as diverse audio-visual face applications.

2605.31292 2026-06-01 cs.CV 版本更新

Authentication of Copy Detection Patterns via Cross-Camera Dual-Synthetic Referencing

复制检测模式的跨相机双合成参考认证

Ivan Oleksiyuk, Roman Chaban, Slava Voloshynovskiy

AI总结 提出一种基于注册的跨相机双合成参考框架,通过深度学习翻译器联合利用数字模板和注册捕获生成高质量参考图像,以应对打印随机性和相机失真,提升复制检测模式的认证性能。

Comments To appear in Proc. ICIP2026, September 13-17, 2026, Tampere, Finland

详情
AI中文摘要

复制检测模式(CDP)是打印在物理对象上的结构,用于实现经济高效的认证。验证通过将捕获图像与打印CDP的数字模板进行比较来完成。在实践中,打印机的随机性和相机失真阻碍了这种比较,限制了对抗伪造的鲁棒性。先前的工作通过在验证相机域中合成参考图像来解决相机效应,但忽略了打印变异性。我们引入了一种基于注册的跨相机双合成参考框架。每个打印的CDP首先由受控的注册相机捕获,然后一个基于深度学习的翻译器联合利用数字模板和注册捕获,为验证图像生成高质量的参考。我们提供了信息论上的证明,表明双参考比基于模板的参考包含更多信息。在异构移动相机上的实验表明,认证性能得到提升,对基于机器学习的复制攻击具有鲁棒性,并且能够从小CDP区域和低端设备上进行可靠验证。

英文摘要

Copy Detection Patterns (CDPs) are structures printed on physical objects to enable cost-effective authentication. Verification is achieved by comparing a captured image with the digital template from which the CDP was printed. In practice, printer stochasticity and camera distortions hinder this comparison, limiting robustness against counterfeiting. Prior work addressed camera effects by synthesising reference images in the verification camera domain, but it ignored printing variability. We introduce an enrolment-based cross-camera dual-synthetic referencing framework. Each printed CDP is first captured by a controlled enrolment camera, and a deep-learning-based translator jointly exploits the digital template and the enrolled capture to generate a high-quality reference for the verification image. We provide an information-theoretic justification showing that the dual reference is more informative than template-based references. Experiments on heterogeneous mobile cameras demonstrate improved authentication performance, robustness to machine-learning-based copy attacks, and reliable verification from small CDP regions and on low-end devices.

2605.31284 2026-06-01 cs.CV cs.AI 版本更新

SAM for Robust Mitochondria Instance Segmentation in Fluorescence Microscopy

SAM 用于荧光显微镜中鲁棒的线粒体实例分割

Suyog Jadhav, Dilip K. Prasad, Krishna Agarwal

发表机构 * UiT The Arctic University of Norway(UiT北极大学)

AI总结 通过仅在合成荧光显微镜数据上微调 SAM,解决了真实数据稀缺问题,提高了线粒体实例分割的精度和平均 Dice 分数。

Comments Accepted at PHAROS-AIF-MIH workshop @ CVPR 2026

详情
AI中文摘要

荧光显微镜(FM)中线粒体的形态分析对于理解细胞健康、能量产生和代谢调节至关重要。虽然像 Segment Anything Model (SAM) 这样的基础模型已经革新了自然图像分割,但由于衍射受限分辨率、低对比度和复杂的重叠细胞器网络,它们直接应用于 FM 受到显著领域偏移的阻碍。此外,鲁棒模型的开发因严重缺乏高质量、手动标注的线粒体实例分割数据集而受阻。在本文中,我们提出了一种可扩展的解决方案,通过仅在合成生成的 FM 数据上微调 SAM 来解决数据稀缺问题。我们模拟真实的线粒体数据并模拟荧光显微镜的光学特性,以创建大规模标注数据集。我们在一个精心策划的真实手动标注 FM 图像数据集上评估了我们的微调模型。定性和定量分析表明,我们的合成微调模型在精度和平均 Dice 分数上优于强基线。这项工作确立了模拟辅助训练在 FM 实例分割中的潜力。

英文摘要

The morphological analysis of mitochondria in fluorescence microscopy (FM) is crucial for understanding cellular health, energy production, and metabolic regulation. While foundation models like the Segment Anything Model (SAM) have revolutionized natural image segmentation, their direct application to FM is hindered by a significant domain shift characterized by diffraction-limited resolution, low contrast, and complex overlapping organelle networks. Furthermore, the development of robust models is bottlenecked by a severe lack of high-quality, manually annotated instance segmentation datasets for mitochondria. In this paper, we propose a scalable solution to this data scarcity by finetuning SAM exclusively on synthetically generated FM data. We simulate realistic mitochondria data and emulate the optical properties of fluorescence microscopes to create a large-scale annotated dataset. We evaluate our fine-tuned model on a curated dataset of real, manually annotated FM images. Qualitative and quantitative analyses demonstrate that our synthetically fine-tuned model improves precision and average dice score over strong baselines. This work establishes the potential of simulation-assisted training for FM instance segmentation.

2605.31283 2026-06-01 cs.CV 版本更新

Topologically Consistent Multi-view 3D Head Reconstruction via Coarse-Guided Layered Surface Sampling

基于粗引导分层表面采样的拓扑一致多视图三维头部重建

Timo Bolkart, Daoye Wang, Prashanth Chandran

发表机构 * Google(谷歌)

AI总结 提出SHELLS框架,通过分层采样策略解耦特征提取与网格分辨率,实现高效、拓扑一致的多视图三维头部重建,在合成数据训练下泛化到真实场景。

Comments SIGGRAPH Conference Papers 2026

详情
AI中文摘要

我们提出SHELLS(分层局部采样的语义头部估计),一种高效的前馈框架,用于从多视图图像中重建具有密集语义对应的三维头部。现有方法通常通过局部特征体素独立细化顶点,这种方法将内存密集的特征采样与网格分辨率耦合,限制了密集拓扑(>1万顶点)的可扩展性并引入表面噪声。相比之下,SHELLS通过分层采样策略将特征提取与网格分辨率解耦。我们使用带有LoRA适配的DINOv2骨干网络提取多视图特征,投影采样稀疏全局特征云,并预测中间粗网格。该粗先验指导构建分层、表面感知的采样壳,作为最终重建的离散搜索空间。SHELLS保持表面一致性,同时推理GPU内存比体积基线减少88%(2.4GB vs. 20GB)。对于1.8万顶点的网格,它将中位配准误差降低21%至29%,推理速度提升3.5倍(0.08s vs. 0.29s)。值得注意的是,我们的模型仅在合成数据上训练,却能有效泛化到真实世界捕获,消除了先前工作中常见的昂贵预注册多视图数据集的需求。

英文摘要

We present SHELLS (Semantic Head Estimation via Layered Local Sampling), an efficient feed-forward framework for 3D head reconstruction in dense semantic correspondence from multi-view images. Existing methods typically refine vertices independently via localized feature volumes. This approach couples memory-intensive feature sampling to mesh resolution, which limits scalability for dense topologies (> 10k vertices) and introduces surface noise. In contrast, SHELLS decouples feature extraction from mesh resolution via a hierarchical sampling strategy. We extract multi-view features using a DINOv2 backbone with LoRA adaptation, projectively sample a sparse global feature cloud, and predict an intermediate coarse mesh. This coarse prior guides the construction of layered, surface-aware sampling shells that serve as a discrete search space for the final reconstruction. SHELLS maintains surface consistency while using 88% less inference GPU memory (2.4GB vs. 20GB) than volumetric baselines. It reduces median registration error by 21% to 29% with a 3.5x inference speedup (0.08s vs. 0.29s) for 18k-vertex meshes. Notably, our model is trained exclusively on synthetic data yet generalizes effectively to real-world captures, eliminating the need for the costly, pre-registered multi-view datasets common in prior work.

2605.31271 2026-06-01 cs.CV 版本更新

DriveMA: Driving Vision-Language-Action Models with verifiable Meta-Actions

DriveMA:基于可验证元动作的驾驶视觉-语言-动作模型

Weicheng Zheng, Yixin Huang, Qiao Sun, Derun Li, Hang Zhao

发表机构 * Shanghai Qi Zhi Institute(上海启智研究院) Tsinghua University(清华大学) Tongji University(同济大学)

AI总结 提出DriveMA框架,通过可验证元动作弥合语言与动作的差距,结合动作中心监督训练和强化学习实现端到端驾驶规划,在Waymo Open Dataset上取得最优性能。

Comments arXiv admin note: text overlap with arXiv:2605.21273

详情
AI中文摘要

驾驶视觉-语言-动作模型(Driving VLAs)旨在利用语言改进端到端规划,但语言-动作差距限制了这一前景。我们提出DriveMA,一个基于可验证元动作的Driving VLA框架,该元动作将未来自我运动总结为紧凑的语言域意图,并可通过轨迹接地标注流水线从专家轨迹构建,以及通过基于规则的投影对生成轨迹进行验证。DriveMA利用这种可验证性,采用以动作为中心的监督训练和数据高效的回合级信用分配强化学习框架,通过密集奖励和精确信用分配明确地将高层决策与低层轨迹规划对齐。DriveMA在Waymo Open Dataset基于视觉的端到端驾驶上设立了新的最先进水平,2B模型获得8.060的评分者反馈分数,4B模型进一步提升至8.079;同时在NAVSIM上获得了具有竞争力的闭环规划性能。这些结果表明,即使是一个简单的元动作接口,在可验证并针对语言-动作对齐优化后,也能实现最先进的规划。代码、数据和模型将公开发布以促进未来研究。

英文摘要

Driving Vision-Language-Action Models (Driving VLAs) aim to use language to improve end-to-end planning, but the language-action gap limits this promise. We propose DriveMA, a Driving VLA framework built on verifiable meta-actions, which summarize future ego motion into compact language-domain intentions and can be constructed from expert trajectories with a trajectory-grounded annotation pipeline and can be verified against generated trajectories through rule-based projection. DriveMA exploits this verifiability with action-centric supervised training and a data-efficient turn-level credit assignment reinforcement learning framework, explicitly aligning high-level decisions with low-level trajectory planning through dense rewards and precise credit assignment. DriveMA sets a new state of the art on the Waymo Open Dataset Vision-based E2E Driving, achieving a Rater Feedback Score of 8.060 with a 2B model and further improving it to 8.079 with a 4B model; it also obtains competitive closed-loop planning performance on NAVSIM. These results show that even a simple meta-action interface can achieve state-of-the-art planning when made verifiable and optimized for language-action alignment. Code, data, and models will be released to facilitate future research.

2605.31266 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Envisioning Beyond the Few: Disentangled Semantics and Primitives for Few-Shot Atypical Layout-to-Image Generation

超越少数:用于少样本非典型布局到图像生成的解耦语义与基元

Nan Bao, Yifan Zhao, Wenzhuang Wang, Jia Li

发表机构 * State Key Laboratory of Virtual Reality Technology and Systems(虚拟现实技术与系统国家重点实验室) School of Computer Science and Engineering(计算机科学与工程学院) Qingdao Research Institute, Beihang University, China(北京航空航天大学青岛研究所,中国)

AI总结 针对少样本非典型布局到图像生成中表示碎片化问题,提出通过语义锚定和基元注入解耦语义与视觉细节,实现鲁棒少样本适应。

Comments Accepted to ICML 2026; code available at https://github.com/iCVTEAM/DSP

详情
AI中文摘要

布局到图像(L2I)任务通过对象类别和空间布局实现对图像生成的细粒度控制。然而,现有的L2I方法在少样本非典型设置下会产生碎片化和扭曲的生成结果。我们将这种失败称为表示碎片化,源于将语义身份与视觉细节纠缠在一起的粒度不匹配。为了解决这个问题,我们提出了一种表示驱动的框架,将语义与基元解耦,以实现鲁棒的少样本适应。具体来说,语义锚定将类别语义聚合到锚点中以实现稳定的身份,而基元注入则建模可重新组合的基元以实现鲁棒的局部细节建模。概念引导进一步通过显著性感知目标调节优化,以保持前景语义一致性。大量实验表明,在5样本设置下,我们的方法在视觉保真度和跨不同非典型领域的对齐方面,均优于最先进的L2I方法。源代码公开于 https://github.com/iCVTEAM/DSP。

英文摘要

The layout-to-image (L2I) task enables fine-grained control over image generation via object categories and spatial layouts. However, existing L2I methods yield fragmented and distorted generations under few-shot atypical settings. We term this failure as representation fragmentation, arising from a granularity mismatch that entangles semantic identity with visual details. To address this issue, we propose a representation-driven framework that disentangles semantics from primitives for robust few-shot adaptation. Specifically, Semantic Anchoring aggregates categorical semantics into anchors for stable identity, while Primitive Imbuing models recomposable primitives for robust local detail modeling. Conceptual Steering further regulates optimization with a saliency-aware objective to preserve foreground semantic consistency. Extensive experiments demonstrate consistent improvements in the 5-shot regime over state-of-the-art L2I methods in both visual fidelity and alignment across diverse atypical domains. The source code is publicly available at https://github.com/iCVTEAM/DSP.

2605.31251 2026-06-01 cs.CV cs.AI 版本更新

ERGeoBench:A Comprehensive Benchmark for Embodied Reasoning and Geo-localization in Multimodal Large Language Models

ERGeoBench:多模态大语言模型中具身推理与地理定位的综合基准

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室) School of Materials Science and Engineering(材料科学与工程学院) China Mobile Research Institute(中国移动研究院) College of Computing and Data Science(计算与数据科学学院)

AI总结 提出ERGeoBench基准,通过单视图、全景视图和具身视图三种渐进设置评估多模态大语言模型在视觉驱动的具身地理定位中的能力,发现当前模型在高层次地理语义推理上表现良好,但在细粒度感知、度量定位和视图间空间一致性上仍有不足。

详情
AI中文摘要

多模态大语言模型(MLLMs)作为具身代理展现出强大潜力,然而由于缺乏细粒度评估,具身地理定位仍未被充分探索。我们引入ERGeoBench,一个用于视觉驱动的具身地理定位的诊断基准。ERGeoBench在三种渐进设置下评估模型——单视图、全景视图和具身视图——其中代理可以通过偏航、俯仰和缩放的顺序变化主动获取观察。该基准包含2,207个全球分布的街景全景图,并衡量四种互补能力:基础感知、空间意识、常识推理和地理定位推理。对领先的专有和开源MLLMs的评估表明,当前模型能够推断高层次的地理语义,但在细粒度感知操作、度量定位和跨视图空间一致性方面仍然困难。我们进一步观察到,地理定位与其他能力维度强相关,表明准确定位依赖于集成的感知、空间推理和常识推理,而非孤立的视觉识别。总体而言,ERGeoBench为诊断和推进类人具身地理定位提供了一个统一框架。项目页面:https://kaixuewen.github.io/ERGeoBench/

英文摘要

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/

2605.31246 2026-06-01 cs.CR cs.CV 版本更新

BadBone: Backdoor Attacks Against Backbone Models in Visual Prompt Learning

BadBone:视觉提示学习中针对骨干模型的后门攻击

Ziqing Yang, Rui Wen, Xinlei He, Yun Shen, Michael Backes, Yang Zhang

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 欧洲信息安全中心) Institute of Science Tokyo(东京科学研究院) Wuhan University(武汉大学) Flexera(Flexera 公司)

AI总结 提出BadBone,一种利用双层优化的隐蔽自适应后门攻击方法,通过破坏骨干模型使下游提示学习任务继承后门漏洞,实验表明现有防御措施基本无效。

Comments Accepted by IEEE Transactions on Information Forensics & Security

详情
AI中文摘要

提示学习是一种新的机器学习范式,因其简单性和有效性而受到广泛关注。尽管其应用日益增多,但该范式的安全漏洞仍未被充分探索。在这项工作中,我们率先提出BadBone,一种利用双层优化的隐蔽自适应后门攻击,针对提示学习。我们的目标不是对提示学习过程植入后门,而是破坏骨干模型,使得只有采用提示学习的目标下游任务继承后门漏洞。在三个不同模型和来自不同领域的三个数据集上的大量实验表明,我们的定向/非定向后门模型在保持预训练和下游任务实用性的同时,实现了高攻击性能。此外,我们针对六种最先进的模型级防御(包括Neural Cleanse、ABS、MNTD、NAD、CLP和D-BR)评估了我们的方法。结果表明,这些防御对我们的后门模型基本无效,因此有效的防御仍是未来工作的重要方向。

英文摘要

Prompt learning is a new machine learning paradigm that has attracted ample attention due to its simplicity and proven efficacy. Despite its growing adoption, the security vulnerabilities associated with this paradigm remain underexplored. In this work, we take the first step to propose BadBone, a stealthy and adaptive backdoor attack against prompt learning using bi-level optimization. Instead of backdooring the prompt learning process, we aim to compromise a backbone model such that only target downstream tasks employing prompt learning inherit the backdoor vulnerability. Extensive experiments on three different models and three datasets from various domains show that our targeted/untargeted backdoored models achieve high attack performance while maintaining utility on both pre-training and downstream tasks. Moreover, we evaluate our approach against six state-of-the-art model-level defenses, including Neural Cleanse, ABS, MNTD, NAD, CLP, and D-BR. The results demonstrate that these defenses are largely ineffective against our backdoored models and thus leave the effective defense as an important direction for future work.

2605.31229 2026-06-01 cs.CV cs.AI 版本更新

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

超越分类:面向持续多模态检索的动态适配器路由

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski

发表机构 * NASK National Research Institute(NASK国家研究院) IDEAS Research Institute(IDEAS研究所) Warsaw University of Technology(华沙技术大学) Universitat Autonoma de Barcelona(巴塞罗那自治大学)

AI总结 针对持续多模态检索(CMR)任务,提出基于原型路由和模型合并的动态适配器路由(DAR)方法,在跨域评估中取得优于现有基线的性能。

详情
AI中文摘要

虽然检索是视觉-语言模型的核心功能,但持续更新这些模型用于检索任务仍未被充分探索。现有工作通常通过类增量学习(CIL)的视角处理持续检索,在可能无法完全捕捉检索特定动态的设置中评估标准CIL方法和面向检索的适应方法。为了解决这一问题,我们引入了一个新的、原则性的持续多模态检索(CMR)评估框架,涵盖多样化的视觉领域,并在此设置中系统评估常见方法。我们的实证分析表明,标准CIL方法在我们更具挑战性的场景中未能产生有意义的增益。因此,我们提出了动态适配器路由(DAR),一种基于通过原型路由选择适配器并通过模型合并组合的新方法。DAR在先前基线上取得了优越性能,并在分布外评估中展现出强大的泛化能力。我们的结果凸显了CMR的独特挑战,并鼓励在该方向进行进一步研究。

英文摘要

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.

2605.31227 2026-06-01 cs.CV 版本更新

HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding

HiERO-StepG @ Ego4D Step Grounding Challenge: 层次化活动理解实现零样本步骤定位

Andrea Zenotto, Simone Alberto Peirone, Francesca Pistilli, Giuseppe Averta

发表机构 * Politecnico di Torino(托里诺理工大学)

AI总结 提出HiERO-StepG方法,利用弱监督层次化表示学习和聚类,无需任务特定微调即可实现零样本步骤定位,在Ego4D挑战中达到56.27% R@1 (IoU=0.3)。

Comments Technical report for the Ego4D Goal Step - Step Grounding challenge at CVPR 2026, derived from arXiv:2505.12911

详情
AI中文摘要

程序性活动遵循明确的结构:无论是考虑烹饪食谱还是机械师修理汽车,这些活动自然分解为步骤和子步骤的层次结构。传统的步骤定位方法需要大量标注且扩展性差。相反,我们认为这种层次结构可以通过共同发生的动作和活动的重复模式,从非策划的人类活动视频中自然涌现。我们的方法基于HiERO,一种弱监督表示学习方法,它利用细粒度的动作级叙述,将功能相关的动作在特征空间中映射得接近。在这个特征空间中,程序步骤可以通过简单的聚类检测到,无需额外的任务特定微调。对于Ego4D步骤定位挑战,我们通过确保步骤分配的细粒度和粗粒度一致性、强制定位步骤的严格时间单调性以及后处理检测步骤以减少噪声预测的影响来增强这种方法。我们将这种方法称为HiERO-StepG,在提交时,它在全局排行榜上以完全零样本且不需要程序特定注释的情况下,在R@1 (IoU = 0.3)指标上达到56.27%,排名第二。项目页面:https://github.com/andreazenotto/HiERO-StepG。

英文摘要

Procedural activities follow well-defined structures: whether we consider a cooking recipe or a mechanic repairing a car, these activities naturally decompose in a hierarchy of steps and sub-steps. Traditional approaches for step grounding require extensive annotations and scale poorly. Instead, we argue that such hierarchical structure can emerge naturally from uncurated videos of human activities through recurring patterns of co-occurring actions and activities. Our approach builds on HiERO, a weakly-supervised representation learning approach that maps close in the feature space actions that are functionally related to each other, leveraging only fine-grained action-level narrations. In this feature space, procedure steps can be detected by a simple clustering, with no additional task-specific fine-tuning. For the Ego4D Step Grounding challenge, we augment this approach by ensuring fine and coarse level agreement in step assignments, enforcing strict temporal monotonicity of the grounded steps and post-processing the detected steps to reduce the impact of noisy predictions. We call this approach HiERO-StepG and it achieves 56.27 % on the R@1 (IoU = 0.3) metric on the global leaderboard at submission time, ranking second while being completely zero-shot and not requiring procedure-specific annotations. Project page: https://github.com/andreazenotto/HiERO-StepG.

2605.31217 2026-06-01 cs.CV 版本更新

TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation

TALON: 用于六自由度航天器姿态估计的令牌对齐轻量适配器

Abid Ali, Arunkumar Rathinam, Djamila Aouada

AI总结 提出TALON方法,通过在冻结的ViT注意力层前注入时空3D适配器并结合令牌对齐损失,实现轻量级六自由度航天器姿态估计,在SPADES和SwissCube数据集上显著降低姿态误差。

Comments 13 pages paper with 3 figures in total

详情
AI中文摘要

单目六自由度航天器姿态估计方法主要处理单帧图像,忽略了航天器机动过程中获取的图像序列中的时间信息。少数时间方法需要完全骨干微调或辅助光流网络,分别存在灾难性遗忘或增加计算成本的风险。我们提出TALON(轨道导航的令牌对齐轻量适配器):在冻结的ViT视觉变换器的自注意力层之前注入时空3D适配器,结合补丁-令牌对齐损失,通过原型条件KL散度目标将适配特征几何地锚定到关键点结构。注意力前放置允许冻结注意力对时间增强的令牌进行推理,每个块使用单个适配器即可获得比注意力后替代方案更强的性能。对齐损失塑造中间表示,使得每个关键点在令牌场中引发空间精确的激活,而该框架向冻结骨干添加的参数少于5%。在SPADES数据集上,TALON将姿态误差比先前最先进方法降低50%;在SwissCube数据集上,其在ADD-0.1d准确率上超越先前最佳方法21.8%。在SPARK真实数据上的从仿真到真实的零样本跨域评估将姿态误差降低4.7倍,消融实验表征了适配器深度在域内和跨域设置中的作用。

英文摘要

Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.

2605.31215 2026-06-01 cs.LG cs.CV 版本更新

Fixed-Point Masked Generative Modeling

不动点掩码生成建模

Andrea Miele, Yiming Qin, Alba Carballo-Castro, Justin Deschenaux, Pascal Frossard

发表机构 * LTS4, EPFL(EPFL LTS4实验室)

AI总结 提出不动点掩码生成模型(FP-MGM),通过共享注意力层的不动点求解器实现自适应深度,并引入跨步一致性损失和三态重用(3SR)策略,在降低参数和训练成本的同时提升低预算掩码生成质量。

详情
AI中文摘要

掩码生成模型(MGM)支持并行解码并在多种模态上取得强性能,但每一步都需要全序列双向变换器,导致训练成本高且在低采样预算下质量下降。现有工作通过更好的采样器或更便宜的固定深度去噪器提升效率,但仍为每个精炼步骤分配固定量的去噪器计算。我们提出不动点掩码生成模型(FP-MGM),用共享注意力层上的不动点求解器替换部分去噪器,实现自适应深度且参数更少。为使其更有效地用于掩码生成,我们首先引入跨步一致性损失,对齐相邻去噪步骤的隐藏表示;其次,三态重用(3SR)通过分别处理未改变、仍掩码和新揭示的令牌,利用先前解热启动求解器。这些组件共同定义了我们的不动点掩码生成的完整训练到推理框架CoFRe。我们还表明,预训练的MGM可以通过短微调转换为FP-MGM,避免完全重新训练。跨模态,CoFRe改善了质量与成本的权衡。在OpenWebText上,与MDLM相比,CoFRe参数减少38.8%,训练时间减少11.5%,VRAM减少16.9%,同时在96个变换器块前向传播的预算下,生成困惑度从830.8提升到101.8。在ImageNette上,CoFRe训练时间减少48.6%,VRAM减少50.7%,并在所有测试的样本预算下改善FID。总体而言,CoFRe为更便宜的训练和更强的低预算掩码生成提供了一个实用框架。

英文摘要

Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.

2605.31212 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Benchmarking and Enhancing Text-to-Image Models for Generating Visual Representations in Early Arithmetic Education

基准测试与增强文本到图像模型以生成早期算术教育中的视觉表示

Junling Wang, Boqi Chen, Heejin Do, Mubashara Akhtar, April Yi Wang, Mrinmaya Sachan

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) ETH AI Center(ETH人工智能中心)

AI总结 针对早期算术教育中的方程到视觉生成任务,构建了E2V-Bench基准并评估了现有T2I模型,发现其在计数和关系结构上存在严重错误,进而探索了基准引导的增强策略。

详情
AI中文摘要

AI系统越来越多地用于支持教育内容创作,但尚不清楚它们能否生成忠实代表其旨在教授的教学概念的输出。因此,我们引入了方程到视觉生成任务,与传统的图像生成不同,该任务要求从算术方程中生成具有教学意义的视觉内容,同时精确保留其数值和关系结构。根据对教师的访谈和教育材料的分析,我们构建了E2V-Bench基准,涵盖四种基于教学法的视觉类型,以及用于评估视觉正确性的自动指标。我们的评估显示,最近的文本到图像(T2I)模型在此任务上频繁失败,错误主要表现为对象计数不正确和关系结构破坏。在此基础上,我们探索了基准引导的增强策略。这些策略改进了代表性模型,但剩余的差距要求未来的T2I模型具备更强的数值和关系基础。

英文摘要

AI systems are increasingly used to support educational content creation, yet it remains unclear whether they can generate outputs that faithfully represent the pedagogical concepts they are intended to teach. Thus, we introduce equation-to-visual generation, a task that, in contrast to conventional image generation, requires producing pedagogically meaningful visuals from arithmetic equations while precisely preserving their numerical and relational structure. Informed by interviews with teachers and an analysis of educational materials, we construct E2V-Bench, a benchmark spanning four pedagogically grounded visual types, along with automatic metrics for evaluating visual correctness. Our evaluation reveals that recent text-to-image (T2I) models frequently fail on this task, with errors dominated by incorrect object counts and broken relational structure. Building on this, we explore benchmark-guided enhancement strategies. These strategies improve representative models, while the remaining gap calls for stronger numerical and relational grounding in future T2I models.

2605.31204 2026-06-01 cs.CV 版本更新

Probabilistic Precipitation Nowcasting with Rectified Flow Transformers

基于整流流变压器的概率降水临近预报

Johannes Schusterbauer, Jannik Wiese, Nick Stracke, Timy Phan, Björn Ommer

发表机构 * CompVis LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 提出FREUD模型,通过帧级编码器和统一解码器结合整流流变压器,在保持不确定性的同时实现高效时空压缩,在SEVIR基准上达到降水临近预报最优性能。

Comments CVPR 2026, Project Page: https://compvis.github.io/weather-rf/

详情
AI中文摘要

准确的天气预报在各个领域都至关重要,在极端天气条件下更是安全关键。与基于模拟的预报相比,数据驱动方法显示出更高的效率,能够实现短期、高分辨率的临近预报。特别是,扩散模型因其强大的概率基础在天气临近预报中被证明有效。然而,现有方法依赖于确定性压缩来降低高维天气数据的复杂性,限制了它们在解码过程中捕捉不确定性的能力。在这项工作中,我们引入了$ extbf{FREUD}$,一个基于整流流变压器的$ extbf{Fr}$ame-wise $ extbf{E}$ncoder和$ extbf{U}$nited $ extbf{D}$ecoder模型,用于高效压缩时空天气数据。帧级编码支持连续预报更新,而统一视频解码器确保时间一致性。我们保留不确定性的第一阶段允许通过集成捕捉偶然不确定性,这对于解码变异性高的极端天气事件特别有利。我们在SEVIR基准上使用紧凑的潜在空间整流流变压器实现了降水临近预报的最新性能,并通过模型和测试时缩放进一步展示了性能提升。代码见:https://github.com/CompVis/weather-rf

英文摘要

Accurate weather forecasts are essential across various domains and are safety-critical in extreme weather conditions. Compared to simulation-based forecasting, data-driven approaches show greater efficiency, enabling short-term, high-resolution nowcasting. In particular, diffusion models proved effective in weather nowcasting due to their strong probabilistic foundation. However, existing methods rely on deterministic compression to reduce the complexity of high-dimensional weather data, limiting their ability to capture uncertainty in the decoding process. In this work, we introduce $\textbf{FREUD}$, a $\textbf{Fr}$ame-wise $\textbf{E}$ncoder and $\textbf{U}$nited $\textbf{D}$ecoder model based on rectified flow transformers for efficient compression of spatio-temporal weather data. Frame-wise encoding enables continuous forecast updates, while the unified video decoder ensures temporal consistency. Our uncertainty-preserving first stage allows us to capture aleatoric uncertainty via ensembling, which is particularly beneficial for extreme weather events with high decoding variability. We achieve state-of-the-art performance in precipitation nowcasting with a compact latent-space rectified flow transformer on the SEVIR benchmark and show further performance gains by model and test-time scaling. Code available here: https://github.com/CompVis/weather-rf

2605.31196 2026-06-01 cs.CV cs.AI cs.CL cs.RO 版本更新

Probing Collision Grounding in Vision-Language Models for Safe Human-Robot Collaboration

探索视觉-语言模型中的碰撞接地以实现安全的人机协作

Jun Wang, Xiaohao Xu, Xiaonan Huang

发表机构 * University of Michigan, Ann Arbor(密歇根大学,安娜堡)

AI总结 针对安全人机协作,提出碰撞接地概念及物理基准TouchSafeBench,评估视觉-语言模型在分类当前安全状态和预警即将碰撞任务中的表现,发现现有模型不可靠,视觉流畅性不等于物理责任性。

Comments 31 pages, 9 figures

详情
AI中文摘要

安全的人机协作需要的不仅仅是视觉描述:监控器必须确定机器人身体是否安全分离、已经与场景或人发生碰撞,或即将碰撞。我们将这种能力称为碰撞接地:将视觉观察与机器人身体几何、相机视角、场景布局、人体接近度和时间运动相结合,以推断当前和即将发生的接触。我们引入了TouchSafeBench,一个基于物理的基准,用于评估视觉-语言模型(VLM)中的碰撞接地能力。TouchSafeBench基于Habitat 3.0构建,包含2,940个模拟室内共现场景,涵盖社交导航和社交重排,具有同步的多视角RGB-D观测、自上而下的轨迹地图、校准的相机元数据和模拟器导出的接触标签。我们研究了两个面向部署的任务:分类当前安全状态和在接触前预警即将发生的碰撞。在三个前沿或面向机器人的VLM和九种视觉表示中,当前模型远未达到可靠:最佳平均Macro-F1仍低于50%,显式深度不会自动转化为机器人身体碰撞证据,且机器人与场景的接触始终比人与人的接触风险更难。TouchSafeBench揭示了具身VLM的一个核心限制:视觉流畅性并不意味着物理责任性。可靠的机器人安全监控器需要能够显式绑定视角、机器人形态、度量几何和未来碰撞的表示。我们将在论文被接收后发布该基准。

英文摘要

Safe human--robot collaboration requires more than visual description: a monitor must determine whether the robot body is safely separated, already colliding with the scene or a person, or about to collide. We call this capability collision grounding: binding visual observations to robot body geometry, camera viewpoint, scene layout, human proximity, and temporal motion in order to infer present and imminent contact. We introduce TouchSafeBench, a physics-grounded benchmark for evaluating collision grounding in vision-language models (VLMs). Built in Habitat~3.0, TouchSafeBench contains 2,940 simulated indoor co-presence episodes across social navigation and social rearrangement, with synchronized multi-view RGB-D observations, top-down trajectory maps, calibrated camera metadata, and simulator-derived contact labels. We study two deployment-facing tasks: classifying the current safety state and warning about imminent collision before contact. Across three frontier or robotics-oriented VLMs and nine visual representations, current models remain far from reliable: the best average Macro-F1 stays below 50\%, explicit depth is not automatically transformed into robot-body collision evidence, and robot--scene contact is consistently harder than human-contact risk. TouchSafeBench reveals a central limitation of embodied VLMs: visual fluency does not imply physical accountability. Reliable robot safety monitors will need representations that explicitly bind viewpoint, robot morphology, metric geometry, and future collision. We will release the benchmark upon acceptance.

2605.31192 2026-06-01 cs.CV 版本更新

The Regularizing Power of Language-Training Deepfake Detectors

语言训练深度伪造检测器的正则化能力

Benedikt Hopf, Zongwei Wu, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg(计算机视觉实验室,CAIDAS,乌尔姆大学)

AI总结 提出利用多模态大语言模型的双编码器架构和两阶段训练,通过语言正则化缓解过拟合,提升深度伪造检测的泛化性和可解释性。

详情
AI中文摘要

最近,得益于多模态大语言模型的出现,深度伪造检测器不仅追求泛化性,还追求可解释性。我们提出这两个挑战可以有效地联合解决,因为可描述的伪影通常泛化性更好,从而开辟了使用语言作为正则化机制的可能性。由于深度伪造检测通常过拟合于低层次的领域特定伪影,我们的直觉是,经过语言预训练的LLM会更偏好于可更好描述的高层次伪影。这样,我们可以在可能的情况下使用高层次特征,同时训练模型在必要时使用低层次特征。我们利用双编码器架构,将冻结的专家检测器与LoRA调优的MLLM编码器配对,并采用两阶段训练课程:首先,二元对齐阶段表明,MLLM的内在能力可以有效地组合特征,以减轻对数据集特定伪影的过拟合。为了进一步增强泛化性并实现可解释性,我们采用强化学习阶段,鼓励模型在分类前生成描述性推理,仅使用二元标签。通过奖励这种“先解释后分类”的行为,我们明确激励模型优先考虑高层次、鲁棒的特征。关键在于,这一过程既产生了可解释的描述,又进一步提升了跨数据集性能,即使在推理时省略推理链也是如此。在基准数据集上的大量实验验证了我们的方法,以较大优势超越了最先进的方法。

英文摘要

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

2605.31191 2026-06-01 cs.LG cs.CV 版本更新

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10

学生容量调节知识蒸馏有效性:基于CIFAR-10上ResNet教师-学生对的系统研究

Umut Onur Yasar

发表机构 * GitHub

AI总结 通过ResNet教师-学生对在CIFAR-10上的图像分类实验,系统研究学生容量如何调节知识蒸馏(KD)的有效性,发现学生容量是蒸馏增益的关键调节因素,并指出实现正确性和输入分辨率感知架构的重要性。

Comments 9 pages, 2 figures, 5 tables. Code available at https://github.com/umutonuryasar/kd-capacity-gap

详情
AI中文摘要

我们研究了教师-学生容量关系如何调节基于ResNet的CIFAR-10图像分类中知识蒸馏(KD)的有效性。在三个教师-学生对(R50->R18、R34->R18和R50->R34)中,我们在受控、可重复的条件下(3个种子,全程报告均值±标准差)比较了Logit-KD和Feature-KD。我们报告三个主要发现。首先,学生容量是蒸馏增益的关键调节因素:即使教师-学生准确率差距相当,R34学生从KD中获得的收益也远大于R18学生,R50->R34 Feature-KD的最大增益为+0.30个百分点,而R34->R18 Feature-KD为+0.18个百分点,R34->R18 Logit-KD为+0.00个百分点。其次,实现的正确性对Feature-KD至关重要:一个排除了投影层的梯度裁剪错误抑制了Feature-KD的性能,并产生了与Logit-KD的误导性比较。修正后,Feature-KD在三个对中的两个上匹配或优于Logit-KD,在R50->R34上达到95.55%,基线为95.25%。第三,输入分辨率感知架构是有效蒸馏的先决条件:将ResNet主干修正为32x32输入使教师准确率提高超过5个百分点——比任何KD增益高出一个数量级。所有代码和结果可在github.com/umutonuryasar/kd-capacity-gap获取。

英文摘要

We investigate how teacher-student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet-based image classification on CIFAR-10. Across three teacher-student pairs -- R50->R18, R34->R18, and R50->R34 -- we compare Logit-KD and Feature-KD under controlled, reproducible conditions (3 seeds, mean+/-std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher-student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50->R34 Feature-KD versus +0.18pp for R34->R18 Feature-KD and +0.00pp for R34->R18 Logit-KD. Second, implementation correctness critically affects Feature-KD: a gradient clipping bug that excluded projection layers suppressed Feature-KD performance and produced misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs, reaching 95.55% on R50->R34 against a baseline of 95.25%. Third, input-resolution-aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp -- an order of magnitude larger than any KD gain. All code and results are available at github.com/umutonuryasar/kd-capacity-gap.

2605.31187 2026-06-01 cs.CV cs.LG 版本更新

From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift

从局部几何到全局伪标注:协变量偏移下鲁棒的正无标记学习

Firas Gabetni, Alexandre Rocchi Henry, Nacim Belkhir, Ziyi Liu, Gianni Franchi

发表机构 * U2IS, ENSTA(U2IS,ENSTA) Institut Polytechnique de Paris(巴黎政治学院) AMIAD, Pôle Recherche, Palaiseau(AMIAD,研究学院,帕莱索)

AI总结 提出SPUNA框架,利用局部流形结构逐步发现偏移数据,在协变量偏移下实现正无标记学习,性能达到全监督方法水平。

详情
AI中文摘要

检测协变量偏移对于构建可靠的视觉系统至关重要。虽然大多数先前工作专注于提高对偏移的鲁棒性,但显式检测协变量偏移仍未被充分探索。现有方法通常依赖于全监督训练,需要来自原始分布和偏移分布的有标签样本,这往往不切实际。在本文中,我们表明协变量偏移检测可以通过使用正无标记(PU)学习的弱监督有效解决。然而,在协变量偏移下,分布内数据和偏移数据显著重叠,使得经典PU方法不稳定且对噪声敏感。为克服这一挑战,我们引入了谱PU邻域标注(SPUNA),这是一种几何感知框架,通过利用视觉特征的局部流形结构逐步发现偏移数据。大量实验表明,SPUNA在PU设置中实现了最先进的性能,并且显著匹配了全监督方法的性能。此外,我们的方法在不同类型的偏移之间鲁棒地迁移,展示了强大的泛化能力。

英文摘要

Detecting covariate shift is critical for building reliable vision systems. While most prior work focuses on improving robustness to shift, explicitly detecting covariate shift remains underexplored. Existing approaches typically rely on fully supervised training, requiring labeled examples from both original and shifted distributions, which is often impractical. In this paper, we show that covariate shift detection can be effectively addressed with weaker supervision using Positive Unlabeled (PU) learning. However, under covariate shift, in distribution and shifted data overlap significantly, making classical PU methods unstable and sensitive to noise. To overcome this challenge, we introduce Spectral PU Neighborhood Annotation (SPUNA), a geometry aware framework that progressively discovers shifted data by leveraging the local manifold structure of visual features. Extensive experiments show that SPUNA achieves state of the art performance in PU settings and remarkably matches the performances of fully supervised methods. Moreover, our approach transfers robustly across different types of shifts, demonstrating strong generalization capabilities.

2605.31177 2026-06-01 cs.CV 版本更新

Vanilla ViT for Automotive Point Cloud Semantic Segmentation

用于汽车点云语义分割的普通ViT

Gilles Puy, Nermin Samet, Alexandre Boulch, Spyros Gidaris, Tuan-Hung VU, Renaud Marlet

发表机构 * LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris, France(LIGM、CNRS、Université Gustave Eiffel、ENPC、IP Paris、法国)

AI总结 本文提出VaViT,通过精心设计的标记器、轻量级解码器头和定制数据增强,使普通非分层ViT在大规模激光雷达点云语义分割中达到或超越现有最先进方法。

详情
AI中文摘要

普通Transformer已成为处理文本、音频、图像和视频的事实标准架构,为多模态学习提供了统一的主干。然而,点云语义分割的最先进架构仍然由U-Net架构主导,其中卷积与局部或窗口注意力交错。在这项工作中,我们展示了如何有效利用普通、非分层的ViT进行大规模汽车激光雷达场景的分割。通过精心设计的标记器、轻量级解码器分割头和定制数据增强,我们弥合了性能差距。我们的方法VaViT(Vanilla ViT)在保持ViT架构简单性的同时,匹配或超过了最先进方法的性能。我们在nuScenes、SemanticKITTI和Waymo Open Dataset上进行了广泛评估,以验证我们方法的有效性。代码和模型可在https://github.com/valeoai/VaViT获取。

英文摘要

Plain Transformers have become the de-facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state-of-the-art architectures for point cloud semantic segmentation remain dominated by U-Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non-hierarchical ViTs for segmentation of large-scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state-of-the-art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at https://github.com/valeoai/VaViT.

2605.31174 2026-06-01 cs.CV cs.LG 版本更新

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

任意场景检测:一种具有经验感知推理的目标检测智能体框架

Wenlun Zhang, Jun Yin, Kentaro Yoshioka

发表机构 * Keio University(Keio大学) Tsinghua University(清华大学)

AI总结 提出DetAS/DetAS-X智能体框架,利用多模态大语言模型自适应组合恢复模块和专用检测器,通过自进化经验积累实现经验感知推理,在六个基准上平均F1提升28.36%。

详情
AI中文摘要

现实场景中的目标检测由于图像退化多样和物体分布异质而仍然具有挑战性,这显著阻碍了现有检测器的泛化。传统方法,包括场景特定表示学习和端到端流水线设计,本质上受限于对预定义条件的依赖,缺乏对动态环境的适应性。本文提出DetAS,一种将目标检测表述为动态决策过程的智能体检测框架。DetAS不依赖静态流水线,而是利用多模态大语言模型(MLLM)作为中央智能体,通过从恢复模块和专用检测器的工具箱中选择来自适应地组合检测工作流。具体来说,DetAS包含两个关键组件:自适应图像恢复,动态决定是否以及如何增强图像以进行下游检测;以及多专家检测,集成多个领域专用检测器并通过实例级推理解决它们的预测。为了在细粒度条件下进一步提高决策质量,我们引入了自进化经验积累,并将框架扩展到DetAS-X,该框架从少量标注数据中积累节点级决策经验,并在推理过程中实现经验感知推理。这种机制使系统能够逐步优化其决策策略,并适应各种现实场景。在六个具有挑战性的基准上的大量实验表明,DetAS-X显著优于现有的基于MLLM的检测器,在F1分数上平均提高28.36%,在DarkFace上增益高达37.01%。这些结果展示了智能体检测的前景,并为其在复杂动态环境中的应用奠定了坚实基础。

英文摘要

Object detection in real-world scenarios remains challenging due to diverse image degradations and heterogeneous object distributions, which significantly hinder the generalization of existing detectors. Conventional approaches, including scene-specific representation learning and end-to-end pipeline design, are inherently limited by their reliance on predefined conditions and lack adaptability to dynamic environments. In this paper, we propose DetAS, an agentic detection framework that formulates object detection as a dynamic decision process. Instead of relying on static pipelines, DetAS leverages a Multimodal Large Language Model (MLLM) as a central agent to adaptively compose detection workflows by selecting from a toolbox of restoration modules and specialized detectors. Specifically, DetAS consists of two key components: Self-Adaptive Image Restoration, which dynamically determines whether and how to enhance images for downstream detection, and Multi-Expertise Detection, which integrates multiple domain-specialized detectors and resolves their predictions through instance-level reasoning. To further improve decision quality under fine-grained conditions, we introduce Self-Evolving Experience Harvesting and extend the framework to DetAS-X, which accumulates node-level decision experience from a small set of annotated data and enables experience-aware reasoning during inference. This mechanism allows the system to progressively refine its decision policy and adapt to diverse real-world scenarios. Extensive experiments on six challenging benchmarks demonstrate that DetAS-X significantly outperforms existing MLLM-based detectors, achieving an average improvement of 28.36% in F1 score, with up to 37.01% gain on DarkFace. These results demonstrate the promise of agentic detection and establish a solid foundation for its application in complex and dynamic environments.

2605.31153 2026-06-01 cs.CV 版本更新

BIAS-ID: A Framework for Analyzing Transformation Biases in AI-Generated Image Detectors

BIAS-ID: 分析AI生成图像检测器中变换偏差的框架

Jonas Ricker, Asja Fischer, Erwin Quiring

发表机构 * Ruhr University Bochum(鲁尔大学波恩) _fbeta Berlin, Germany(柏林_fbeta)

AI总结 本文提出BIAS-ID框架,用于分析和量化AI生成图像检测器中的变换偏差,并通过实验揭示多种先进检测方法受偏差影响严重。

详情
AI中文摘要

鉴于网络上有害AI生成图像的激增,可靠地区分真实图像与生成图像已成为一个紧迫的研究课题。虽然许多提出的检测方法在受控设置下表现良好,但在真实世界数据上测试时常常失效。一个潜在的根本原因是检测器训练数据中的细微偏差。因此,检测器可能依赖虚假相关性而非学习真正的取证痕迹。虽然最近的工作已经识别出这个问题,但尚未建立评估检测器实际偏差程度的既定协议。因此,在本文中,我们退一步:首先,我们讨论检测器存在偏差意味着什么,以及这与缺乏鲁棒性有何不同。其次,我们提出BIAS-ID,一个用于分析和量化AI生成图像检测器中变换偏差的透明框架。我们通过对两个数据集上的六个检测器进行评估来验证我们的框架,揭示了几种最先进的检测方法受到偏差的强烈影响。我们的结果强调了偏差感知评估对于开发可靠的AI生成图像检测器的重要性。

英文摘要

Given the surge of harmful AI-generated imagery online, reliably distinguishing authentic images from generated ones has become an urgent research topic. While many proposed detection methods perform well under controlled settings, they often collapse when tested on real-world data. A potential root cause are subtle biases in the detectors' training data. As a result, detectors may rely on spurious correlations instead of learning true forensic artifacts. While a recent line of work has identified the problem, there is not yet an established protocol to evaluate how biased a detector actually is. In this work, we therefore take a step back: First, we discuss what it means for a detector to be biased, and how this differs from a lack of robustness. Second, we propose BIAS-ID, a transparent framework for analyzing and quantifying the presence of transformation biases in AI-generated image detectors. We validate our framework by performing an evaluation of six detectors across two datasets, revealing that several state-of-the-art detection methods are strongly affected by biases. Our results highlight the importance of bias-aware evaluation for developing reliable AI-generated image detectors.

2605.31148 2026-06-01 cs.CV cs.AI cs.CL 版本更新

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct:探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Zhongguancun Academy(中关村学院) Tsinghua University(清华大学) Helsinki University(赫尔辛基大学)

AI总结 本文提出SpatialAct基准,通过多轮交互细化、单步错误检测与修复等任务,揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情
AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系,并将这种推理转化为行动。尽管最近的视觉语言模型(VLM)在基于观测的空间感知和推理任务上表现出色,但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题,我们引入了 extbf{SpatialAct},一个基于模拟器的基准,用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始,我们进一步设计了其分解版本——单步错误检测与修复,以及五个基础空间能力任务,以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距:当前VLM在孤立的空间推理任务上表现良好,但在多轮反馈中难以维持连贯的空间信念并产生可靠行动,显著不如人类。这些结果表明,即使抽象掉了低级控制,当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

2605.31145 2026-06-01 cs.CV cs.AI cs.LG 版本更新

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

FOCUS: 通过视觉支持约束和策略优化强制上下文目标定位

Mohammed Asad Karim, Vinay Kumar Verma

发表机构 * Amazon, Seattle, USA(亚马逊(美国西雅图))

AI总结 提出一种两阶段训练框架,通过优化支持框与查询图像间的上下文注意力并结合GRPO强化学习,实现无类别监督的类别无关上下文目标定位,7B模型性能超越72B模型。

Comments Accepted at ICML 2026. * Equal Contributions

详情
AI中文摘要

上下文定位(ICL)旨在通过查询图像中的少量支持示例定位目标对象,无需训练或参数更新即可即时操作。尽管视觉语言模型(VLM)快速发展,实现类别无关且基于视觉的ICL仍然是一个未解决的问题,尽管它对图像编辑、个性化视觉搜索和检索等应用至关重要。现有方法脆弱且依赖显式类别监督,这不仅限制了在具有未命名或实例特定对象的现实场景中的适用性,还引入了类别偏差,使预测偏向语义先验而非视觉证据。我们提出一个两阶段训练框架,在无类别监督的情况下显式优化支持边界框与查询图像之间的上下文注意力。我们进一步通过使用组相对策略优化(GRPO)的强化学习来细化定位,直接最小化定位误差。这种公式强制视觉对应优于语义先验,产生鲁棒的实例级定位。实验表明,使用我们的目标训练的7B参数模型优于高达72B参数的模型,证明了上下文感知定位目标可以超越单纯扩展规模。全面的消融实验验证了每个组件的贡献。

英文摘要

In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.

2605.31137 2026-06-01 cs.CV 版本更新

PolSAR Image Classification using a Hybrid Complex-Valued Network (HybridCVNet)

使用混合复数网络(HybridCVNet)进行PolSAR图像分类

Mohammed Q. Alkhatib

发表机构 * IEEE

AI总结 提出一种混合复数网络HybridCVNet,结合CV-CNN和CV-ViT,通过提取互补信息并利用数据内部依赖关系,提升PolSAR图像分类性能,在Flevoland和San Francisco数据集上分别达到97.39%总体精度和0.972 Kappa值。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情
AI中文摘要

近年来,卷积神经网络(CNN)因其在计算机视觉任务中的有效性而成为图像分类的热门方法。现在,研究人员正在探索视觉Transformer(ViT)在遥感和地球观测中的潜力。然而,传统的实值网络常常忽略复数(CV)数据(如极化合成孔径雷达(PolSAR)数据)中重要的相位信息。为了解决这个问题,出现了新的CV深度架构。HybridCVNet是一种新颖的混合网络,融合了CV-CNN和CV视觉Transformer(CV-ViT)技术。它有效地结合了CV 3D和2D CNN作为特征提取器,通过提取互补信息并有效利用数据内部的相互依赖关系,增强了PolSAR图像分类。来自广泛使用的PolSAR数据集的实验结果表明,HybridCVNet优于其他方法,在Flevoland数据集上实现了97.39%的总体精度,并且在仅1%采样率下也显示出潜力,在旧金山数据集上Kappa值为0.972。源代码可通过https://github.com/mqalkhatib/HybridCVNet获取。

英文摘要

Recently, convolutional neural networks (CNNs) have become popular for image classification due to their effectiveness in computer vision tasks. Now, researchers are exploring the potential of vision transformers (ViTs) in remote sensing and Earth observation. However, traditional Real-Valued networks often overlook important phase information in Complex-Valued (CV) data like polarimetric synthetic aperture radar (PolSAR) data. To address this, new CV deep architectures have emerged. HybridCVNet, a novel hybrid network, blends CV-CNN and CV vision transformer (CV-ViT) techniques. It efficiently combines CV 3D and 2D CNNs as feature extractors, enhancing PolSAR image classification by extracting complementary information and effectively leveraging interdependencies within the data. Experimental results from widely-used PolSAR datasets show HybridCVNet outperforms other methods, achieving an overall accuracy of 97.39% on the Flevoland dataset and showing promise even with just a 1% sampling ratio, with a Kappa value of 0.972 on the San Francisco dataset. Source code is accessible through https://github.com/mqalkhatib/HybridCVNet

2605.31124 2026-06-01 cs.CV 版本更新

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

QVGGT: 训练后量化的视觉几何基础Transformer

Zhizhen Pan, Hesong Wang, Huan Wang

发表机构 * Westlake University(西湖大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对VGGT模型参数量大、部署受限的问题,提出QVGGT量化框架,通过选择性混合精度、令牌滤波与任务感知尺度搜索,实现近无损W4A16量化,显著降低内存和加速推理。

Comments Accepted by CVPR 2026. Project page: https://ddsacu.github.io/QVGGT/

详情
AI中文摘要

直接从图像估计3D属性的技术随着视觉几何基础Transformer(VGGT)的提出而迅速发展,该模型能够在前向传播中一次性预测相机参数、深度图和点云。然而,其12亿参数规模严重限制了在无人机和移动AR设备等资源受限平台上的部署。为解决这一限制,我们引入了QVGGT,一个专门为压缩VGGT而设计的量化框架。我们的方法基于以下观察:VGGT内的Transformer块对量化表现出异质性敏感度。因此,我们分析了逐块量化敏感度,并提出了一种选择性混合精度策略,为最脆弱的Transformer块分配更高精度。为了解决由高方差相机和注册令牌引起的量化误差放大问题,我们进一步引入了带相机信息补偿的令牌过滤,从激活校准中移除这些异常值,并使用PCA导出的全局补偿令牌恢复其几何线索。最后,我们开发了一种任务感知尺度搜索机制,不仅通过层重建,还通过多头监督以及相机姿态、深度图和点图之间的跨头几何一致性来评估候选量化尺度。在多个几何感知基准上的大量实验表明,QVGGT实现了近乎无损的W4A16量化,在保持所有3D预测头精度的同时,相比FP32实现了3~4.9倍的内存减少和高达2.8倍的硬件实际加速。我们的方法使得在边缘设备上实现高保真3D感知成为可能,从而在现实世界的受限环境中实现前馈3D重建模型的实际部署。

英文摘要

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

2605.31116 2026-06-01 cs.CV cs.RO 版本更新

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

NTR:端到端驾驶中场景令牌瓶颈的神经令牌重建

Jiahui Li, Jiawei Sun, Zixiang Ren, Ming Liu, Jiamin Shi, Ruiteng Zhao, Zhiyang Liu, Liying Liu, Zuoguan Wang, Kaidi Yang

发表机构 * National University of Singapore(新加坡国立大学) Black Sesame Technologies(黑 sesame 技术公司)

AI总结 针对端到端驾驶中场景令牌瓶颈缺乏视觉监督的问题,提出神经令牌重建(NTR)框架,通过自蒸馏掩码潜在重建约束场景令牌保留更丰富的视觉表示,实现最先进的驾驶性能。

详情
AI中文摘要

最近的无感知端到端自动驾驶方法通过将密集的图像块令牌压缩为紧凑的场景令牌,用于下游轨迹生成和评分,从而绕过了显式的感知输出。虽然这些场景令牌为规划器形成了紧凑的视觉瓶颈,但它们仅从规划目标接收监督,对编码的视觉信息提供了有限的约束。为了解决这一限制,我们引入了神经令牌重建(NTR),一种表示学习框架,直接约束无感知驾驶中的紧凑场景令牌瓶颈。NTR引入了一种自蒸馏掩码潜在重建目标,该目标仅使用紧凑的场景令牌作为重建记忆来重建被掩码的块级潜在特征。这迫使重建梯度仅通过场景令牌瓶颈传递,鼓励场景令牌为规划保留更丰富且更少冗余的视觉表示。我们进一步引入了来自基础模型注释的语义先验,作为弱语义接口,将重建目标偏向于驾驶相关结构,而不引入显式的感知头。所有辅助重建组件在推理时被移除,部署的规划器保持不变。NTR在三个公共自动驾驶基准测试中实现了最先进的性能,包括Waymo E2E上的8.0461 RFS以及NavSim1&2上的94.1 PDMS / 90.9 EPDMS。学习到的场景令牌表现出更低的成对冗余和更高的有效秩,表明有效的瓶颈监督同时改善了紧凑视觉表示学习和规划性能。

英文摘要

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

2605.31115 2026-06-01 cs.CV 版本更新

Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Polyphony: 基于扩散的双手动作分割,采用交替视觉Transformer和语义条件

Hao Zheng, Hu Wang, Tiantian Zheng, Prajjwal Bhattarai, Tuka Alhanai

发表机构 * New York University Abu Dhabi(纽约大学阿布扎赫尔分校) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出Polyphony三阶段方法,通过交替训练双手视觉Transformer、语义特征条件化和扩散分割,解决双手动作分割中的手间依赖、视觉不对称和语义模糊问题,在多个数据集上达到最优性能。

Comments CVPR 2026

详情
AI中文摘要

双手动作分割是从未修剪视频中密集预测双手动作,对于理解复杂的双手活动至关重要。然而,它带来了几个独特的挑战:复杂的手间依赖、双手之间的视觉不对称、主导手垄断梯度的表示冲突以及细粒度动作中的语义模糊性。我们提出了Polyphony,一种三阶段方法,通过以下方式应对这些挑战:(1) 交替双手视觉Transformer,在左右手小批量之间交替训练,以确保双手的梯度贡献平衡,同时共享时空编码器;(2) 语义特征条件化,将视觉特征与结构化的、组合式的动作描述对齐,以增强语义相似动作的区分度;(3) 基于扩散的分割,结合跨手特征融合以实现手间协调,以及自适应损失加权以平衡性能。Polyphony在双手数据集(HA-ViD、ATTACH)上达到了最先进水平,改进高达16.8个百分点,并在单流Breakfast数据集(82.5%)上超越了之前使用12倍大骨干网络的最佳方法。值得注意的是,我们的统一模型使用单个共享骨干网络,超越了需要单独每手模型的基线方法。代码位于https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation。

英文摘要

Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation.

2605.31108 2026-06-01 cs.CV cs.LG 版本更新

Remembering by Reconstructing: Domain Incremental Learning With Test-Time Training on Video Streams

通过重建来记忆:视频流上的域增量学习与测试时训练

Jonathan Swinnen, Tinne Tuytelaars

发表机构 * ESAT, KU Leuven(ESAT,比利时鲁汶大学)

AI总结 提出一种结合主任务头和自监督掩码自编码器头的域增量学习方法,通过测试时训练识别最佳LoRA适配器以重新记忆域,适用于视频流数据。

详情
AI中文摘要

在这项工作中,我们提出了一种新颖的域增量学习方法,使模型能够随时间适应不断演变的非平稳数据。与其他工作不同,我们不试图避免灾难性遗忘,而是允许并利用它。我们的模型结合了一个主任务头和一个自监督掩码自编码器(MAE)头。然后在增量训练期间学习特定于域的LoRA适配器。每个适配器专攻其域,自然地在两个头上诱导对其他域的遗忘。在推理时,我们在自监督MAE头上进行在线测试时训练,以识别哪些LoRA最匹配当前输入,从而使模型能够再次“记住”该域。我们的方案特别适用于现实世界的流数据,例如视频,其中连续样本高度相关且域变化是渐进的。我们在域增量动作识别和语义分割任务上展示了我们的方法。

英文摘要

In this work we introduce a novel approach to domain incremental learning, adapting models over time to evolving, non-stationary data. In contrast to other works, we do not attempt to avoid catastrophic forgetting, but rather allow it and exploit it. Our model combines a main task head with a self-supervised masked autoencoder (MAE) head. We then learn domain-specific LoRA adapters during incremental training. Each adapter specializes to its domain, naturally inducing forgetting on other domains in both heads. At inference, we perform online test-time training on the self-supervised MAE head to identify which LoRAs best matches the current input, so the model can `remember' the domain again. Our scheme is especially well-suited to real-world streaming data, such as video, where consecutive samples are highly correlated and domain shifts are gradual. We demonstrate our method on domain-incremental action recognition and semantic segmentation tasks.

2605.31096 2026-06-01 cs.CV 版本更新

iVGR: Internalizing Visually Grounded Reasoning for MLLMs with Reinforcement Learning

iVGR: 通过强化学习将视觉基础推理内化到多模态大语言模型中

Chang-Bin Zhang, Yujie Zhong, Qiang Zhang, Kai Han

发表机构 * Visual AI Lab, The University of Hong Kong(香港大学视觉人工智能实验室) Independent Researcher(独立研究者) University of Science and Technology of China(中国科学技术大学)

AI总结 提出iVGR框架,利用强化学习和双流训练策略将视觉定位能力内化到文本推理中,避免显式视觉基础在推理时的干扰,提升细粒度感知性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管视觉基础链式思维(CoT)已成为增强多模态大语言模型(MLLM)细粒度感知的有前途范式,但其在推理阶段的有效性仍未得到充分探索。在这项工作中,我们经验性地发现,与没有显式视觉基础的标准文本CoT相比,在推理时强制要求视觉基础CoT中的显式对象框通常会降低性能。我们假设视觉定位能力可以内化到文本CoT中,而强制性的显式基础会对模型的主要目标(答案预测)引入不必要的干扰。为了解决这个问题,我们提出了内化视觉基础推理(iVGR),一种新颖的强化学习框架,将定位能力转移到文本推理过程中。我们采用双流训练策略,通过提出的一致性奖励将文本流与高质量的视觉基础流对齐,使模型在推理时无需显式基础即可准确定位。大量实验表明,我们的方法在细粒度基准上显著优于现有基线,同时保持支持工具辅助推理工作流的灵活性。

英文摘要

While visually grounded Chain-of-Thought (CoT) has emerged as a promising paradigm to enhance fine-grained perception in multimodal large language models (MLLMs), its efficacy during the inference phase remains underexplored. In this work, we empirically find that mandating explicit object boxes in visually grounded CoT during inference often degrades performance compared to standard textual CoT, which reasons without explicit visual grounding. We hypothesize that the visual localization capability can be internalized into the textual CoT and that the mandatory explicit grounding introduces unnecessary interference with the model's primary objective of answer prediction. To address this problem, we propose Internalizing Visually Grounded Reasoning (\textbf{iVGR}), a novel reinforcement learning framework that transfers localization capabilities into the textual reasoning process. We employ a dual-stream training strategy, where a textual stream is aligned with a high-quality visually grounded stream via a proposed consistency reward, enabling the model to localize accurately without explicit grounding during inference. Extensive experiments demonstrate that our method significantly outperforms existing baselines on fine-grained benchmarks, while maintaining the flexibility to support tool-assisted inference workflows.

2605.31094 2026-06-01 cs.CV cs.AI 版本更新

Redefining Instance Matching: A Unified Framework for Part-Aware Matching in Panoptic Segmentation Evaluation

重新定义实例匹配:全景分割评估中部件感知匹配的统一框架

Erik Großkopf, Soumya Snigdha Kundu, Hendrik Möller, Nicolas Münster, Mehdi Astaraki, Paula Tamara Buzduga, Kerstin Ritter, Benedikt Wiestler, Jan Kirschke, Jonathan Shapey, Tom Vercauteren, Florian Kofler

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany(人工智能与脑健康研究所,图宾根大学,德国) King’s College London, UK(伦敦国王学院,英国) Technical University of Munich, Germany(慕尼黑技术大学,德国) Stockholm University, Sweden(斯德哥尔摩大学,瑞典)

AI总结 提出将全景分割中的片段匹配重新表述为约束二分分配问题,定义四种匹配策略,并扩展至部件感知评估,发布基于Panoptica的统一开源包。

Comments 9 pages, 4 figures

详情
AI中文摘要

全景质量(PQ)度量是联合评估实例分割和语义分割的标准。然而,其原始定义依赖于预测片段和真实片段之间的一对一匹配,只有当IoU阈值超过0.5时才是直接的。低于0.5时,在一个探索不足的问题空间中会出现多种匹配策略。我们通过将片段匹配重新表述为约束二分分配问题,系统地阐明了这个空间。独立地约束预测端和真实端的度数,产生了四种匹配策略:一对一、多对一、一对多和多对多。我们表明,前三种在PQ框架内是良好定义的,而多对多则超出其范围。当实例被碎片化、相邻物体难以划分或标注有噪声时,这些策略变得相关。我们框架的核心是基于顶点的TP、FN和FP计数,锚定于真实片段和预测片段,而不是匹配边。我们进一步表明,该框架自然地扩展到部件感知全景分割,并在生物医学数据上探索了部件感知评估。在可配置的案例研究中,我们报告了不同阈值和匹配策略组合在实际中的表现。我们发布了一个基于Panoptica的统一开源包,它暴露了基于Voronoi的区域分析、部件感知评估和阈值下曲线面积作为可配置选项。

英文摘要

The Panoptic Quality (PQ) metric is the standard for jointly evaluating instance and semantic segmentation. However, its original definition relies on a One-to-One matching between predicted and ground truth segments, which is only straightforward when the IoU threshold exceeds 0.5. Below 0.5, multiple matching strategies emerge in a poorly explored problem space. We systematically elucidate this space by recasting segment matching as a constrained bipartite assignment problem. Independently bounding the prediction- and ground-truth-side degrees yields four matching strategies: One-to-One, Many-to-One, One-to-Many, and Many-to-Many. We show that the first three are well-defined within the PQ framework, while Many-to-Many falls outside it. These strategies become relevant when instances are fragmented, adjacent objects are difficult to delineate, or annotations are noisy. Central to our framework is a vertex-based accounting of TP, FN, and FP, anchored to ground truth and predicted segments rather than to matching edges. We further show that the framework extends naturally to part-aware panoptic segmentation, and we explore part-aware evaluation on biomedical data. Across configurable case studies we report how different combinations of thresholds and matching strategies behave in practice. We release a unified open-source package built on Panoptica. It exposes Voronoi-based region-wise analysis, part-aware evaluation, and Area Under Threshold Curve computations as configurable options.

2605.31093 2026-06-01 cs.CV 版本更新

Cross-Modal Clinical Knowledge Integration for Mammography Report Generation

跨模态临床知识整合用于乳腺X线报告生成

Jiayi Zhu, Fuxiang Huang, Yu Xie, Xi Wang, Zhixuan Chen, Yuan Guo, Qingcong Kong, Zhenhui Li, Qiong Luo, Hao Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Lingnan University(岭南大学) The Third Affiliated Hospital of Kunming Medical University, Yunnan Cancer Hospital, Peking University Cancer Hospital Yunnan(昆明医科大学第三附属医院、云南癌症医院、北京大学肿瘤医院云南分院) Guangzhou First People's Hospital, South China University of Technology(广州第一人民医院、华南理工大学) The Third Affiliated Hospital, Sun Yat-Sen University(中山大学第三附属医院) HKUST Shenzhen-Hong Kong Collaborative Innovation Research Institute(香港科技大学深圳-香港协同创新研究院)

AI总结 提出MammoRG框架,通过两阶段训练模拟临床报告流程,整合BI-RADS指南和先验知识,提升报告生成的临床一致性。

Comments 16 pages, 5 figures

详情
AI中文摘要

乳腺癌是一个主要的全球健康问题,乳腺X线筛查在早期检测中起着核心作用。大量的筛查检查给放射科医生带来了沉重的工作负担,使得准确且一致的报告生成成为一个关键的临床挑战。现有的自动乳腺X线报告生成方法主要关注直接的视觉到文本映射,而忽略了放射科医生在实际工作中遵循的结构化临床推理过程。为了解决这一局限性,我们提出了MammoRG,一个乳腺X线报告生成框架,它通过遵循BI-RADS指南并整合先验临床知识来明确模拟临床报告工作流程,从而生成诊断报告。具体来说,MammoRG采用两阶段训练框架。在第一阶段,模型通过基于分类的监督学习从患者的四视图乳腺X线图像中整合临床相关的先验知识。在第二阶段,引入术语感知的监督微调策略,将乳腺X线特异性临床术语建模为原子语义单元,从而生成具有更高临床一致性的高质量报告。为了促进生成报告的临床效能评估,我们进一步开发了MammoRGTool,一个专用的乳腺X线报告解析工具,它从自由文本报告中提取结构化临床信息。大量实验表明,MammoRG在多个临床效能指标上持续优于现有方法,特别是在与诊断相关的BI-RADS F1上,它在内部、外部1、外部2和VinDr-Mammo数据集上分别超过第二名模型2.73%、2.04%、1.90%和3.27%。

英文摘要

Breast cancer is a major global health concern, and mammography screening plays a central role in early detection. The large volume of screening examinations creates a substantial workload for radiologists, making accurate and consistent report generation a critical clinical challenge. Existing automated mammography report generation methods primarily focus on direct visual-to-text mapping, while overlooking the structured clinical reasoning process followed by radiologists in real-world practice. To address this limitation, we propose MammoRG, a mammography report generation framework that explicitly simulates the clinical reporting workflow by following the BI-RADS guideline and incorporating prior clinical knowledge to produce diagnostic reports. Specifically, MammoRG adopts a two-stage training framework. In the first stage, the model learns to integrate clinically relevant prior knowledge from a patient's four-view mammograms through classification-based supervision. In the second stage, a terminology-aware supervised fine-tuning strategy is introduced to model mammography-specific clinical terms as atomic semantic units, enabling the generation of high-quality reports with improved clinical consistency. To facilitate clinical efficacy evaluation of generated reports, we further develop MammoRGTool, a dedicated mammography report parsing tool that extracts structured clinical information from free-text reports. Extensive experiments demonstrate that MammoRG consistently outperforms existing methods across multiple clinical efficacy metrics, particularly in diagnosis-related BI-RADS F1, where it surpasses the second-best model by 2.73%, 2.04%, 1.90%, and 3.27% on the internal, external 1, external 2, and VinDr-Mammo datasets, respectively.

2605.31090 2026-06-01 cs.CV cs.AI 版本更新

On Revisiting Entropy for Identifying Mislabeled Images

重新审视熵在识别错误标注图像中的应用

Chunlei Li, Zixuan Zheng, Yilei Shi, Guanglu Dong, Pengfei Li, Jingliang Hu, Xiao Xiang Zhu, Lichao Mou

发表机构 * MedAI Technology (Wuxi) Co. Ltd., Wuxi, China(MedAI技术(无锡)有限公司,无锡,中国) Sichuan University, Chengdu, China(四川大学,成都,中国) University of Basel, Allschwil, Switzerland(巴塞尔大学,阿勒西维尔,瑞士) Technical University of Munich, Munich, Germany(慕尼黑技术大学,慕尼黑,德国)

AI总结 提出基于训练动态的有符号熵积分(SEI)统计量,通过捕捉预测熵的幅度和时间趋势,有效识别训练集中的错误标注样本,在医学影像数据集上达到最优性能。

Comments ICML 2026

详情
AI中文摘要

训练数据集中的错误标注样本会严重降低深度网络的性能,因为过参数化模型倾向于记忆错误标签。我们通过提出一种利用训练动态的错误标注数据检测新方法来应对这一挑战。我们的方法基于一个关键观察:正确标注的样本在训练过程中熵持续下降,而错误标注的样本在整个训练过程中保持相对较高的熵。基于这一见解,我们引入了一个有符号熵积分(SEI)统计量,它捕捉了训练周期中预测熵的幅度和时间趋势。SEI广泛适用于分类网络,并且在与对比语言-图像预训练(CLIP)架构集成时表现出特别的有效性。通过在四个医学影像数据集(由于诊断复杂性,该领域特别容易受到标注错误的影响)上进行涵盖不同模态和病理的广泛实验,我们证明SEI在错误标注数据识别中达到了最先进的性能,在保持计算效率和实现简单性的同时优于现有方法。我们的代码可在 https://github.com/MedAITech/SEI 获取。

英文摘要

Mislabeled samples in training datasets severely degrade the performance of deep networks, as overparameterized models tend to memorize erroneous labels. We address this challenge by proposing a novel approach for mislabeled data detection that leverages training dynamics. Our method is grounded in the key observation that correctly labeled samples exhibit consistent entropy decrease during training, while mislabeled samples maintain relatively high entropy throughout the training process. Building on this insight, we introduce a signed entropy integral (SEI) statistic that captures both the magnitude and temporal trend of prediction entropy across training epochs. SEI is broadly applicable to classification networks and demonstrates particular effectiveness when integrated with contrastive language-image pretraining (CLIP) architectures. Through extensive experiments on four medical imaging datasets -- a domain particularly susceptible to labeling errors due to diagnostic complexity -- spanning diverse modalities and pathologies, we demonstrate that SEI achieves state-of-the-art performance in mislabeled data identification, outperforming existing methods while maintaining computational efficiency and implementation simplicity. Our code is available at https://github.com/MedAITech/SEI.

2605.31080 2026-06-01 cs.MM cs.AI cs.CL cs.CV cs.HC 版本更新

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

策展人引导的多语言艺术描述对盲人和低视力观众的小型视觉语言模型试点研究

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana, Björn W. Schuller

发表机构 * Technical University of Munich(慕尼黑技术大学) Foundation for Research and Technology -- Hellas(希腊研究与技术基金会) National University of Science and Technology Politehnica Bucharest(布加勒斯特政治技术科学与技术国家大学)

AI总结 本研究使用小型视觉语言模型Qwen2.5-VL-3B-Instruct,通过策展人引导的方式为盲人和低视力观众生成德语、罗马尼亚语和塞尔维亚语的多语言艺术描述,发现语言特定适配器在控制性和视觉基础描述质量上优于多语言适配器。

Comments 7 pages, 2 figures, 3 tables. Preprint

详情
AI中文摘要

盲人和低视力(BLV)观众在视觉艺术描述方面仍然服务不足,尤其是在跨语言和博物馆环境中,隐私和知识产权限制可能倾向于使用小型本地视觉语言模型(VLM)。本试点研究使用Qwen2.5-VL-3B-Instruct,针对德语、罗马尼亚语和塞尔维亚语,调查了策展人引导的多语言艺术描述。我们从艺术品图像和元数据构建了一个平行的BLV导向字幕语料库,并在固定骨干网络和训练预算下,比较了语言特定的LoRA适配器与单个多语言适配器。评估结合了自动词汇和基于嵌入的指标,以及针对小型罗马尼亚BLV试点研究校准的LLM作为评判协议。在我们的试点设置下,语言特定适配器在罗马尼亚语和塞尔维亚语上表现出更稳定的可控性和视觉基础描述质量,而多语言适配器在德语上仍具有竞争力。我们将这些发现视为小型本地VLM的部署导向证据,并强调在得出关于多语言可访问性的总体结论之前,需要进行更大规模的BLV用户研究和更广泛的语言覆盖。

英文摘要

Blind and low-vision (BLV) audiences remain underserved by visual art descriptions, particularly across languages and in museum settings where privacy and intellectual-property constraints may favour small on-premise vision-language models (VLMs). This pilot study investigates curator-guided multilingual art description with Qwen2.5-VL-3B-Instruct for German, Romanian, and Serbian. We construct a parallel BLV-oriented caption corpus from artwork images and metadata, and compare language-specific LoRA adapters with a single multilingual adapter under a fixed backbone and training budget. Evaluation combines automatic lexical and embedding-based metrics with an LLM-as-Judge protocol calibrated against a small Romanian BLV pilot study. Under our pilot setup, language-specific adapters show more stable controllability and visually grounded description quality for Romanian and Serbian, while multilingual adaptation remains competitive in German. We frame these findings as deployment-oriented evidence for small on-premise VLMs, and highlight the need for larger BLV user studies and broader language coverage before drawing general conclusions about multilingual accessibility.

2605.31075 2026-06-01 cs.CV 版本更新

Task-Focused Memorization for Multimodal Agents

面向多模态智能体的任务聚焦记忆

Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li

发表机构 * Fudan University(复旦大学)

AI总结 提出基于强化学习的任务聚焦记忆策略学习框架TaskMem,通过两阶段训练使多模态智能体在流式观测中动态选择任务相关记忆,在三个流式基准上VQA准确率提升5.3%-7.0%。

详情
AI中文摘要

长期记忆对于多模态智能体构建连贯经验、积累世界知识和实现持续学习至关重要。然而,构建有效记忆不仅涉及记忆模块设计和准确性、保真度等基本要求,关键挑战在于决定记忆什么。多模态智能体(如具身智能体)在真实或虚拟环境中持续感知、推理和行动,接收无界的多模态观测流。面对这种信息组合爆炸,智能体必须选择性地保留与其环境角色相关且对未来任务有价值的内容。为弥合这一差距,我们将记忆生成建模为可学习的记忆策略,并引入TaskMem(任务聚焦记忆策略学习),一种基于强化学习的框架,使策略能够动态调整其关注点以适应环境中遇到的实际任务需求。TaskMem采用两阶段训练范式:第一阶段在基本保真度要求下优化记忆质量,学习如何记忆;第二阶段在部署后进行,智能体通过在其基础MLLM上调整适配器来学习记忆什么,利用近期环境任务定义奖励模型,引导记忆策略聚焦于任务相关的内容。为评估我们的方法,我们将VideoMME、EgoLife和EgoTempo重新构建为流式基准,模拟智能体处理流式观测并处理在线到达任务的真实场景。为隔离记忆评估,问题必须仅使用智能体的记忆回答,而不访问原始视频。基于Qwen3-VL-30B-A3B,TaskMem在这些基准上分别将VQA准确率提高了6.3%、7.0%和5.3%。

英文摘要

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

2605.31069 2026-06-01 cs.CV cs.CL 版本更新

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出VISTA框架,通过多级事件语义挖掘(细节级、事件级、未来级)实现长视频事件预测,解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情
AI中文摘要

准确预测未来事件是内容理解和决策制定的基础,涉及多个领域。先前研究主要关注文本或短视频场景,而长视频事件预测具有多模态上下文丰富和叙事复杂的特点,尚未得到充分探索。同时,基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力,但难以泛化到事件预测,因为它们既不能精确提取事件相关细节,也无法对事件发展进行细粒度分析。为弥补这一差距,我们提出VISTA,一个用于长视频事件预测的多级事件语义挖掘框架。首先,VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节,增强细节级语义;其次,采用知识增强的迭代检索策略,引导大语言模型逐步构建逻辑连贯的事件链,从而改善事件级叙事;最后,VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索,产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

2605.31068 2026-06-01 cs.CV 版本更新

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

HQ-JEPA: 用于跨模态遥感表示学习的混合量子联合嵌入预测架构

Md Aminur Hossain, Ayush V. Patel, Sanjay K. Singh, Biplab Banerjee

发表机构 * Space Applications Centre, Indian Space Research Organisation(印度空间研究组织空间应用中心) Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay(印度理工学院孟买资源工程研究中心)

AI总结 提出HQ-JEPA混合量子-经典架构,通过联合嵌入预测、跨模态对齐、SIGReg高斯正则化和量子保真度损失,在Sentinel-1/2图像上学习语义表示,在GeoBench分类和分割任务上取得优于强基线的性能。

Comments 19 pages

详情
AI中文摘要

我们提出了HQ-JEPA,一种用于跨模态遥感表示学习的混合量子-经典联合嵌入预测架构。该框架将JEPA风格的掩码潜在预测扩展到配对的Sentinel-1和Sentinel-2图像,通过从可见上下文区域预测掩码目标表示,同时在共享嵌入空间中对齐异构模态特征。为了提高表示质量,HQ-JEPA结合了四个互补目标:潜在令牌预测、跨模态令牌对齐、融合潜在空间中基于SIGReg的高斯正则化,以及基于可微SWAP测试的保真度量子相似性(FQS)损失。与像素重建方法不同,HQ-JEPA直接在潜在空间中学习语义表示,并使用基于量子态重叠的相似性作为额外的正则化信号。我们在线性探测和微调设置下,在GeoBench分类和分割任务上评估了预训练编码器。结果表明,HQ-JEPA在强自监督和遥感基础模型基线上取得了具有竞争力且通常更优的性能,证明了将预测性自监督、跨模态几何正则化和基于量子保真度的表示学习相结合对遥感应用的好处。

英文摘要

We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA combines four complementary objectives: latent token prediction, cross-modal token alignment, SIGReg-based Gaussian regularization in the fused latent space, and a differentiable SWAP-test-based Fidelity Quantum Similarity (FQS) loss. Unlike pixel reconstruction methods, HQ-JEPA learns semantic representations directly in latent space and uses quantum state-overlap-based similarity as an additional regularization signal. We evaluate the pretrained encoder on GeoBench classification and segmentation tasks under linear probing and fine-tuning settings. Results show that HQ-JEPA achieves competitive and often superior performance over strong self-supervised and remote sensing foundation-model baselines, demonstrating the benefit of integrating predictive self-supervision, cross-modal geometric regularization, and quantum fidelity-based representation learning for remote sensing applications.

2605.31057 2026-06-01 cs.CV cs.LG 版本更新

LVSA: Training-Free Sparse Attention for Long Video Diffusion

LVSA:长视频扩散的无训练稀疏注意力

Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Distributed Parallel Technology Laboratory, Paris Research Center, Huawei Technologies France(华为法国巴黎研究中心分布式并行技术实验室) AI Framework and Data Technology Lab, Huawei Technologies Co., Ltd.(华为技术有限公司人工智能框架与数据技术实验室)

AI总结 提出一种无需训练、模型无关的块稀疏注意力方法LVSA,通过结构化窗口模式与旋转全局锚点结合,在降低长视频扩散推理计算成本的同时消除固定网格偏差,支持超训练时域的视频生成。

Comments 10 pages, 5 figures, 4 tables. Code: https://github.com/JiusiServe/LongVideoSparseAttention

详情
AI中文摘要

密集自注意力是长视频扩散推理的计算和质量的瓶颈:成本随序列长度二次增长,且超出训练时域时模型收敛到近乎静态的输出,即“冻结”的重复视频。最先进的方法要么成本过高(例如需要重新训练),要么无法以可扩展的方式同时满足性能和质量目标。为此,我们提出长视频稀疏注意力(LVSA),一种无需训练、模型无关的块稀疏注意力方法,用于视频扩散Transformer,它结合了结构化窗口模式与旋转全局锚点,从而消除了导致长时域伪影的固定网格偏差。LVSA结合FlashInfer内核,与密集注意力相比,在Wan 2.1 1.3B上以6倍时域减少计算量达3.17倍,在Wan 2.1 14B上以6倍时域减少2.98倍,在HunyuanVideo 1.5上以1.5倍时域减少3.33倍。除了减少计算量,LVSA还使得HunyuanVideo 1.5能够在2倍时域下生成,否则在单个GPU上会内存不足。此外,与RIFLEx相比,LVSA在Wan 2.1 1.3B上提供高达2.41倍的加速,与UltraViCo相比提供3.27倍的加速。为了展示跨不同平台的适用性,我们将LVSA应用于NPU,与密集注意力相比,在Wan 2.2 A14B上实现高达2.71倍的加速,在Wan 2.1 1.3B上实现3.24倍的加速。为了公平地评估质量,我们引入了VQeval,一个正确评分循环视频失败的工具,而VBench-Long等最先进评估器则会奖励这类失败。LVSA在训练时域长度下生成时质量中性,在扩展长度下质量积极。

英文摘要

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

2605.31048 2026-06-01 cs.CV 版本更新

Rethinking Efficient Crack Segmentation with Task-Aligned Structural-Directional Modeling

重新思考基于任务对齐的结构-方向性建模的高效裂缝分割

Shipeng Liu, Liang Zhao, Dengfeng Chen, Weihua Zhang

发表机构 * xauat(西安理工大学)

AI总结 将裂缝分割视为稀疏结构恢复问题,提出RIFT模型,通过轻量多尺度融合保留局部证据、聚合方向连续性,在16项指标上达到最优或并列最优。

详情
AI中文摘要

最近的裂缝分割方法通常遵循通用的语义分割设计,使用更强的骨干网络、混合CNN-Transformer-Mamba编码器和辅助增强分支。虽然有效,但这引发了疑问:更强的通用特征混合是否是裂缝分割最合适的方向。相反,我们将裂缝分割表述为稀疏结构恢复。裂缝具有有限的类别级语义,但具有很强的形态规律性,即细、稀疏、各向异性、局部碎片化,且容易与纹理或阴影混淆。因此,关键瓶颈在于保留弱结构证据、恢复方向连续性以及抑制背景耦合。我们提出RIFT,一个紧凑的形态对齐裂缝分割模型家族。RIFT设计简单,而不是压缩复杂的通用架构,它保留局部证据,聚合协作方向连续性,并通过轻量多尺度融合恢复裂缝结构。在四个公共基准上的实验表明,RIFT在16个主要指标上对再现的代表性基线取得了最佳或并列最佳结果。RIFT-B提供了最强的整体精度,而RIFT-T提供了最佳的部署效率,仅0.47M参数和高推理速度。拓扑感知评估、消融实验、迁移实验和可视化进一步验证了,当其归纳偏置与裂缝形态匹配时,任务对齐的简单性可以匹配或超越复杂的混合架构。代码:https://github.com/xauat-liushipeng/RIFT

英文摘要

Recent crack segmentation methods often follow generic semantic segmentation designs, using stronger backbones, hybrid CNN-Transformer-Mamba encoders, and auxiliary enhancement branches. Although effective, this raises whether stronger generic feature mixing is the most suitable direction for crack segmentation. We instead formulate crack segmentation as sparse structural recovery. Cracks have limited category-level semantics but strong morphological regularities, being thin, sparse, anisotropic, locally fragmented, and easily confused with textures or shadows. Thus, the key bottleneck lies in preserving weak structural evidence, recovering directional continuity, and suppressing background coupling. We propose RIFT, a compact family of morphology-aligned crack segmentation models. Rather than compressing a complex generic architecture, RIFT is simple by design, preserving local evidence, aggregating cooperative directional continuity, and restoring crack structures through lightweight multi-scale fusion. Experiments on four public benchmarks show that RIFT achieves the best or tied-best results across the 16 main metrics against reproduced representative baselines. RIFT-B gives the strongest overall accuracy, while RIFT-T provides the best deployment efficiency with only 0.47M parameters and high inference speed. Topology-aware evaluation, ablations, transfer experiments, and visualizations further verify that task-aligned simplicity can match or surpass complex hybrid architectures when its inductive bias fits crack morphology. Code: https://github.com/xauat-liushipeng/RIFT

2605.31041 2026-06-01 cs.CV cs.AI 版本更新

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

视觉信息在视觉-语言-动作模型驾驶行为中是否起决定性作用?

Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou)(科技与交通智能 thrust,香港科学与技术大学(广州))

AI总结 本文提出结构化多级视觉扰动框架,系统分析VLA驾驶模型对视觉信息的依赖程度,揭示依赖模式随评估方式变化且在不同抽象层次上不均匀。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在自动驾驶中展现出令人期待的能力,凸显了统一多模态架构联合建模感知与规划的潜力。然而,当前基于VLA的驾驶行为如何植根于视觉信息仍知之甚少。现有评估协议主要关注聚合性能指标,缺乏结构化和实用的诊断方法来量化视觉-行为依赖性。在这项工作中,我们引入了一个结构化的多级视觉扰动框架,以系统分析基于VLA的驾驶模型中的视觉-行为依赖性。该框架沿着三个互补维度组织受控视觉扰动:通道级退化、信息级破坏和结构级修改。我们将其应用于基于VLA的驾驶系统,并在开环轨迹预测和交互式闭环安全评估下评估行为响应。实验揭示了依赖于评估的依赖模式以及跨抽象层次的不均匀视觉基础。这些发现呼吁对VLA驾驶模型进行更结构化的分析和原则性设计,以更好地理解视觉信息如何塑造行为,并开发更安全、更鲁棒的系统。

英文摘要

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.

2605.31033 2026-06-01 cs.CV 版本更新

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

SlotMemory: 面向流式长视频生成的以对象为中心的KV记忆

Weijia Dou, Hui Li, Jiahao Cui, Lei Zhou, Jingdong Wang, Siyu Zhu

发表机构 * Fudan University(复旦大学) Meta Superintelligence Labs(Meta超智能实验室) Baidu(百度)

AI总结 提出SlotMemory,一种以对象为中心的键值记忆机制,通过将变换器的键值流形分解为离散语义槽,实现实体级持久性和提示感知检索,在60秒交互叙事中动态一致性相对提升22.8%。

详情
AI中文摘要

流式视频生成模型通常依赖于以时间为中心的记忆,将历史上下文组织为原始帧、片段或未聚类的令牌。这种组织方式常导致实体离开画面或交互式提示转换时出现身份漂移和语义不一致。为解决这些限制,我们提出SlotMemory,一种用于流式视频扩散的以对象为中心的键值记忆机制。我们的方法通过将变换器的键值流形分解为离散、可重用的语义槽,将记忆抽象从事件发生的“何时”转移到所表示的“什么”。通过利用这些槽作为路由地址来索引和存储高保真键值令牌,我们实现了跨长时域的实体级持久性和提示感知检索。在使用Wan2.1-T2V-1.3B骨干网络对60秒交互叙事进行评估时,SlotMemory达到了81.61的最先进质量分数,并在动态一致性上比现有最强流式基线相对提升22.8%。我们的结果表明,结构化的语义表示,而非原始时间容量,是持久长视频合成的关键原语。我们的代码和检查点可在https://tj12323.github.io/SlotMemory/获取。

英文摘要

Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic inconsistency when entities exit the frame or during interactive prompt transitions. To address these limitations, we propose SlotMemory, an object-centric Key-Value memory mechanism for streaming video diffusion. Our approach shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete, reusable semantic slots. By utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, we enable entity-level persistence and prompt-aware retrieval across long horizons. Evaluated on 60-second interactive narratives using the Wan2.1-T2V-1.3B backbone, SlotMemory achieves a state-of-the-art quality score of 81.61 and a 22.8 percent relative improvement in dynamic consistency over the strongest existing streaming baseline. Our results demonstrate that structured semantic representation, rather than raw temporal capacity, is the essential primitive for persistent long-form video synthesis. Our codes and checkpoints are available at https://tj12323.github.io/SlotMemory/.

2605.31029 2026-06-01 cs.CV 版本更新

PEEK: Picking Essential frames via Efficient Knowledge distillation

PEEK: 通过高效知识蒸馏提取关键帧

Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Mounîm A. El-Yacoubi, Yannis Tevissen

发表机构 * Télécom SudParis — SAMOVAR(Telecom SudParis — SAMOVAR) Institut Polytechnique de Paris(巴黎政治学院) Moments Lab

AI总结 提出PEEK方法,通过知识蒸馏将教师模型的帧相关性排名迁移至轻量级时序模型,实现高效动态帧采样,在低帧预算下显著提升视频字幕生成性能。

Comments Supplementary material at https://www.killian-steunou.com/peek/static/pdfs/peek_supplementary.pdf

详情
AI中文摘要

视频语言模型只能处理有限数量的帧,使得帧选择成为高效视频字幕生成的关键瓶颈。大多数字幕生成流程仍依赖均匀采样,该方法计算成本低但忽略视觉内容。自适应帧采样最近成为从视频中选择最具信息量帧的有前景方法,但现有方法计算成本仍然高昂。我们提出PEEK,一种高效的动态帧采样方法,它将字幕条件帧相关性排名从更强的教师模型蒸馏到仅基于视觉内容运行的轻量级时序模型中。我们发现,总体而言,在ActivityNet Captions和MSR-VTT上,我们的方法在所有评估的下游视觉语言模型中优于最先进方法,特别是当仅选择一或两帧进行字幕生成时,在大多数帧预算下获得最佳CIDEr分数。在ActivityNet Captions上,PEEK尤其强大,在16个配置中赢得14个。在MSR-VTT上的零样本评估表明,我们的模型在低帧预算下迁移效果最佳,而在四帧和八帧时结果更为混合,因为时间覆盖和视觉多样性变得更具竞争力。与最近的自适应基线相比,PEEK在低预算场景下更准确且更高效:它仅增加5.2%的字幕生成时间,而CSTA增加65.4%,MaxInfo增加211.9%。我们在https://github.com/momentslab/peek发布代码和预训练检查点。

英文摘要

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only $5.2\%$ to the captioning time, compared with $65.4\%$ for CSTA and $211.9\%$ for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

2605.31001 2026-06-01 cs.CV 版本更新

Iterative Framework For Data Augmentation Of Segmented Fingerprints

分割指纹数据增强的迭代框架

João Leonardo H. D. Agnol, Wesley Augusto de Bona, Erick Oliveira Rodrigues, Luiz Fernando Puttow Southier, Jefferson Oliva, Marcelo Filipak, Dalcimar Casanova

发表机构 * Federal University of Technology (UTFPR) Pato Branco, Parana Brazil(联邦技术大学(UTFPR)帕托布兰科,巴西南里维亚州)

AI总结 针对婴儿指纹数据稀缺问题,提出一种迭代数据增强方法,通过在训练用于提取指纹脊线和谷线的卷积神经网络中引入错误,生成多样化的分割指纹变体,实验证明该方法能有效扩展指纹变异性且保持视觉相似性。

详情
Journal ref
Anais do XV Workshop de Sistemas de Informação 2024
AI中文摘要

婴儿生物识别由于婴儿与成人之间的生理差异而面临独特挑战,加上可用于研究的数据稀缺,限制了稳健匹配系统的发展。本文提出一种新颖的数据增强方法,使用迭代技术通过在训练用于提取指纹脊线和谷线的卷积神经网络中引入错误,生成分割指纹的多样化变体。在真实婴儿指纹上的实验证明了该方法在扩展指纹变异性方面的有效性,增强后的指纹在细节计数上表现出显著波动,同时仍保持与原始指纹的视觉相似性。研究还强调了该方法在应用不同程度变化到指纹分割方面的可定制性。未来研究包括使用所提框架增强的数据集训练分割和匹配神经网络。

英文摘要

Infant biometrics presents unique challenges due to the physiological differences between infants and adults, compounded by the scarcity of available data for research that limits the development of robust matching systems. This paper proposes a novel data augmentation method that uses iterative techniques to generate diverse variants of segmented fingerprints by inducing errors in a convolutional neural network trained to extract fingerprint ridges and valleys. Experiments on real infant fingerprints demonstrate the method's effectiveness in expanding fingerprint variability, with augmentations exhibiting significant fluctuations in minutiae counts while still retaining visual similarity to the originals. The study also highlights the method's customizable nature for applying varying levels of changes to fingerprint segmentations. Future research includes training segmentation and matching neural networks using datasets augmented by the proposed framework.

2605.30991 2026-06-01 cs.LG cs.CV 版本更新

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

推理时奖励对齐中的并行回火初始采样

Myeongjun Oh, Gwangho Kim, Sungyoon Lee

发表机构 * Department of Artificial Intelligence(人工智能系) Department of Computer Science(计算机科学系)

AI总结 针对推理时奖励对齐中标准SMC方法因初始采样陷入局部模式的问题,提出基于并行回火的PATHS方法,通过耦合多条回火链实现高效探索,提升对齐质量。

Comments 31 pages, 11 figures

详情
AI中文摘要

推理时奖励对齐无需重新训练即可引导预训练的扩散和基于流的生成模型满足用户指定的奖励。最近,序贯蒙特卡洛(SMC)通过迭代过滤和传播多个粒子成为该任务的有力框架。然而,我们表明基于SMC的标准方法通常性能不佳,因为它们从标准先验初始化粒子,而复杂奖励景观中的高奖励区域极为罕见。此外,我们表明即使最近的奖励感知初始采样方法仍然容易陷入局部模式,因为复杂奖励景观通常是多模态的。为克服这些限制,我们提出PATHS(用于高复杂度奖励采样的并行回火),一种通过并行回火耦合多个采样链的新型初始化方法。PATHS维护一个奖励回火链的阶梯,并定期执行Metropolis交换,从而在平坦化的奖励景观中实现高效探索,缓解模式陷阱问题。我们的分析表明,该机制显著增强了有限预算下对通常难以采样的罕见高奖励区域的探索。在布局到图像和数量感知生成上的实验表明,PATHS在对齐质量上取得了一致的提升,尤其是在复杂提示上。

英文摘要

Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this task by iteratively filtering and propagating multiple particles. However, we show that standard SMC-based methods often suffer from poor performance because they initialize particles from a standard prior, whereas high-reward regions in complex reward landscapes are extremely rare. Further, we show that even recent reward-aware initial sampling approaches remain vulnerable to getting trapped in local modes, as complex reward landscapes are often multi-modal. To overcome these limitations, we propose PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method that couples multiple sampling chains through parallel tempering. PATHS maintains a ladder of reward-tempered chains and periodically performs Metropolis swaps, enabling efficient exploration across flattened reward landscapes, thereby mitigating the mode-trapping issues. Our analysis reveals that this mechanism substantially enhances the finite-budget exploration of rare, high-reward regions that are typically challenging to sample. Experiments on layout-to-image and quantity-aware generation show that PATHS achieves consistent gains in alignment quality, particularly on complex prompts.

2605.30987 2026-06-01 cs.CV 版本更新

Benchmarking Single-Step Inpainting Methods for Multi-Object 3D Gaussian Splatting Scenes

多对象3D高斯泼溅场景的单步修复方法基准测试

Finn Dröge, Cecilia Curreli, Abhishek Saroha, Daniel Cremers

发表机构 * Technical University of Munich(慕尼黑技术大学) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 针对3D高斯泼溅场景中的对象移除与修复任务,比较了2D修复器在3D一致性上的表现,发现基于重建的修复器优于生成扩散模型,且从头初始化场景比微调现有场景效果更好,同时引入了一个带真实数据的新多对象场景。

Comments Accepted as an extended abstract to the CVEU Workshop at CVPR 2026

详情
AI中文摘要

对象移除和修复3D高斯泼溅(3DGS)场景面临跨相机视图的3D一致性等挑战。在比较2D修复器及其对3D领域的适用性时,我们发现基于重建的修复器在3D一致性上优于生成扩散模型。将这些2D修复器集成到创建和微调3DGS场景的不同单步方法中,我们的结果表明,从头初始化场景比微调现有场景产生更高质量的结果。使用最先进的生成式2D修复器,我们创建了一个简单的基线,以强调在3D设置中先移除对象再进行修复的重要性。由于360°数据集很少包含真实世界的地面真值,且具有挑战性的遮挡场景同样稀少,我们引入了一个新的多对象场景,其中包含记录的地面真值数据和多个存在对象遮挡的视图。

英文摘要

The tasks of object removal and inpainting 3D Gaussian Splatting (3DGS) scenes face challenges such as 3D consistency across camera views. In comparing 2D inpainters and their suitability for the 3D domain, we find that reconstruction-based inpainters outperform generative diffusion models in 3D consistency. Integrating these 2D inpainters into different single-step methods for creating and finetuning 3DGS scenes, our results indicate that initializing the scene from scratch produces higher quality results than finetuning the existing scene. Using a state-of-the-art generative 2D inpainter, we create a straightforward baseline to underline the importance of object removal before inpainting in the 3D setting. Since 360° datasets rarely include real-world ground truths, and challenging occlusion scenarios are equally sparse, we introduce a novel multi-object scene with recorded ground truth data and many views with object occlusions.

2605.30984 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

生成报告还是重复模板?测量和缓解三维CT报告生成中的模板崩溃

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani, Benedikt Wiestler, Christian Wachinger

发表机构 * Technical University of Munich (TUM)(慕尼黑技术大学) TUM Hospital(TUM医院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 针对三维CT报告生成中模型输出多样性低、病理检测能力差的模板崩溃问题,提出解耦框架CLarGen,通过分离临床检测与语言合成,显著提升临床准确性并保持报告流畅性。

详情
AI中文摘要

现代三维医学视觉语言模型(VLM)能够生成流畅的放射学风格文本,但表现出极低的病理检测率和输出多样性,崩溃为低估罕见但关键发现的通用模板。我们将这种失败模式识别为模板崩溃。这种失败源于三维医学成像的独特限制,例如数据有限、标签严重不平衡以及体积编码器的弱信号。在这些限制下,文本生成目标鼓励捷径学习和流畅但基础薄弱的报告。我们通过临床保真度、输出多样性、正常模板偏差和罕见发现存活率系统性地诊断模板崩溃。为了缓解它,我们提出CLarGen,一个解耦框架,将说什么(临床检测)与怎么说(语言合成)分开。CLarGen使用(i)用于多标签病理检测的潜在查询变换器,(ii)用于临床匹配示例的病理引导检索,以及(iii)用于从检测到的发现和检索到的上下文中合成最终报告的医学语言模型。在最新的三维CT报告生成基线中,CLarGen缓解了模板崩溃,并在保持流畅报告的同时显著提高了临床准确性(macro-F1 0.487 vs. 0.189;CRG 0.472 vs. 0.368)。我们的结果表明,明确、可测量的临床基础对于抗模板崩溃的三维CT报告生成至关重要。代码将在接收后发布。

英文摘要

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

2605.30983 2026-06-01 cs.CV 版本更新

Can BEV Perception Gracefully Degrade under Sensor Failures?

BEV感知能否在传感器故障下优雅降级?

Haifa Zhang, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo

发表机构 * Tianjin Key Laboratory of Intelligent Unmanned Swarm Technology and System(天津智能无人群技术与系统重点实验室) School of Electrical and Information Engineering, Tianjin University(天津大学电气与信息工程学院) Key Laboratory of System Control and Information Processing, Ministry of Education of China(系统控制与信息处理重点实验室,中华人民共和国教育部)

AI总结 针对多模态BEV感知在传感器损坏时性能骤降的问题,提出Grace-BEV框架,通过主动可靠性评估和动态特征重校准实现优雅降级,在极端LiDAR故障下将mAP从0.0%恢复至34.7%。

详情
AI中文摘要

尽管多模态鸟瞰图(BEV)感知在自动驾驶中取得了显著成功,但现有系统存在一个关键脆弱性:现有融合机制对传感器损坏高度敏感,常导致灾难性性能下降。这种脆弱性主要源于标准融合框架通常以静态方式集成多模态表示,导致在缺失或损坏模态下性能急剧崩溃。相比之下,我们表明通过主动模态可靠性评估可以实现优雅降级。为此,我们提出Grace-BEV,一个轻量级即插即用框架,在多模态融合过程中强制引入主动可靠性感知。Grace-BEV不依赖计算昂贵的跨模态交互,而是利用对齐的BEV空间通过TrustGate路由器显式评估模态可信度,并使用FailSafe融合块动态重新校准特征集成。此外,我们设计了带模态丢弃的三阶段训练策略,以防止模态主导并鼓励在不可靠输入下进行平衡的跨模态学习。在nuScenes-R和nuScenes-C上的大量实验表明,Grace-BEV在各种损坏设置下保持稳健性能。值得注意的是,在标准基线崩溃至0.0%平均精度(mAP)的灾难性LiDAR故障下,Grace-BEV将性能恢复至高达34.7% mAP。此外,它将干净准确率提升高达1.4%,实现了鲁棒性与效率之间的强权衡。

英文摘要

Despite the remarkable success of multi-modal bird's-eye view (BEV) perception in autonomous driving, current systems exhibit a critical vulnerability: existing fusion mechanisms are highly brittle to sensor corruptions, often causing catastrophic performance degradation. This vulnerability largely stems from the fact that standard fusion frameworks typically integrate multi-modal representations in a static manner, leading to a precipitous performance collapse under missing or corrupted modalities. In contrast, we show that graceful degradation is achievable through active modality reliability assessment. To this end, we present Grace-BEV, a lightweight and plug-and-play framework that enforces active reliability awareness during multi-modal fusion. Instead of relying on computationally expensive cross-modal interactions, Grace-BEV leverages the aligned BEV space to explicitly assess modality trustworthiness via a TrustGate Router and dynamically recalibrate feature integration using the FailSafe Fusion Block. Furthermore, we devise a Three-Phase Training strategy with Modality Dropout to prevent modality dominance and encourage balanced cross-modal learning under unreliable inputs. Extensive experiments on nuScenes-R and nuScenes-C show that Grace-BEV maintains robust performance across diverse corruption settings. Notably, under catastrophic LiDAR failures where standard baselines collapse to 0.0% mean Average Precision (mAP), Grace-BEV restores performance to as high as 34.7% mAP. Moreover, it improves clean accuracy by up to 1.4%, achieving a strong trade-off between robustness and efficiency.

2605.30972 2026-06-01 cs.CV 版本更新

BiSegMamba: Efficient Bidirectional Tri-Oriented Mamba for 3D Medical Image Segmentation

BiSegMamba: 用于3D医学图像分割的高效双向三向Mamba

Bakht Zada, Chao Tong, Qile Su, Shuai Zhang

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) State Key Laboratory of Virtual Reality Technology and Systems, Beihang University(北航虚拟现实技术与系统国家重点实验室)

AI总结 提出BiSegMamba,一种基于双向三向Mamba的高效3D医学图像分割网络,通过渐进压缩主干、多尺度空间混合器、双向正交Mamba块和自适应方向融合,在降低计算成本的同时提升分割精度。

Comments 10 pages, 7 figures, 5 tables. Code is available at: https://github.com/bakhtzadaabshare/BiSegMamba

详情
AI中文摘要

精确的3D医学图像分割需要长程体积上下文和精细边界保持。基于CNN的方法全局依赖建模有限,而基于Transformer的模型对于密集3D输入通常计算成本高昂。最近的基于Mamba的方法提供了一种高效替代方案,但现有的体积设计仍依赖于重复的高分辨率扫描、仅前向的顺序建模和固定的方向求和,导致高成本、扫描顺序偏差和次优的方向聚合。我们提出BiSegMamba,一种用于3D医学图像分割的高效双向三向Mamba网络。BiSegMamba遵循紧凑到细节的设计,其中渐进压缩主干(PCS)能够进行高效的潜在空间推理,同时保留浅层高分辨率特征用于重建。多尺度空间混合器(MSSM)在早期阶段捕获局部解剖模式,而提出的双向三向正交Mamba(Bi-ToOM)块使用联合处理的前向和后向扫描序列,从多个正交视图建模长程依赖。自适应方向融合(ADF)学习跨扫描方向的输入相关通道权重,用方向感知融合替代固定求和。在收集的颈动脉CTA数据集和三个公共基准BraTS2023、ACDC和AMOS-CT上的实验表明,BiSegMamba在血管、心脏、脑肿瘤和腹部多器官分割任务中具有良好的泛化能力。与SegMamba-V2相比,BiSegMamba在BraTS2023上性能略有提升,在ACDC和颈动脉数据集上显著改进,同时计算成本降低高达77.9% FLOPs,展示了在通用3D医学图像分割中强大的精度-效率平衡。

英文摘要

Accurate 3D medical image segmentation requires both long-range volumetric context and fine boundary preservation. CNN-based methods have limited global dependency modeling, while Transformer-based models are often computationally expensive for dense 3D inputs. Recent Mamba-based methods provide an efficient alternative, but existing volumetric designs still depend on repeated high-resolution scanning, forward-only sequential modeling, and fixed directional summation, causing high cost, scan-order bias, and suboptimal directional aggregation. We propose BiSegMamba, an efficient bidirectional tri-oriented Mamba network for 3D medical image segmentation. BiSegMamba follows a compact-to-detail design, where a progressive compacting stem (PCS) enables efficient latent-space reasoning while retaining shallow high-resolution features for reconstruction. A multi-scale spatial mixer (MSSM) captures local anatomical patterns in early stages, and the proposed bidirectional tri-oriented Ortho Mamba (Bi-ToOM) block models long-range dependencies from multiple orthogonal views using jointly processed forward and backward scan sequences. Adaptive directional fusion (ADF) learns input-dependent channel-wise weights across scan orientations, replacing fixed summation with orientation-aware fusion. Experiments on a collected carotid CTA dataset and three public benchmarks, BraTS2023, ACDC, and AMOS-CT, show that BiSegMamba generalizes well across vascular, cardiac, brain tumor, and abdominal multi-organ segmentation tasks. Compared with SegMamba-V2, BiSegMamba achieves slightly better performance on BraTS2023 and clear improvements on ACDC and the carotid dataset, while reducing computational cost by up to 77.9% FLOPs, demonstrating a strong accuracy-efficiency balance for general 3D medical image segmentation.

2605.30969 2026-06-01 cs.CV 版本更新

Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning

全监督运动编辑:通过正负学习平衡变化与不变性

Zhenwu Shi, Jingyu Gong, Peiwei Wang, Xingzan Wang, Tianwen Qian, Wenxi Li, Yuan Fang, Jiao Xie, Lizhuang Ma, Shaohui Lin

发表机构 * Shanghai Institute of Artificial Intelligence for Education, East China Normal University, China(上海人工智能教育研究院,华东师范大学,中国) School of Computer Science and Technology, East China Normal University, China(华东师范大学计算机科学与技术学院,中国) School of Statistics, East China Normal University, China(华东师范大学统计学院,中国) The 27th Research Institute of CETC, Zhengzhou, China(中国电子科技集团第27研究所,郑州,中国) Key Laboratory of Advanced Theory and Application in Statistics and Data Science, MOE, China(教育部统计与数据科学先进理论与应用重点实验室,中国) Shanghai Key Laboratory of Computer Software Evaluating and Testing, China(上海计算机软件评测测试重点实验室,中国) School of Computer Science, Shanghai Jiao Tong University, China(上海交通大学计算机科学学院,中国)

AI总结 提出OmniME框架,通过正负学习结合回顾特征监督、运动保持机制和三元组语义对齐,平衡运动编辑中的变化与不变性,在MotionFix和STANCE Adjustment数据集上达到最优性能。

详情
AI中文摘要

基于文本的人体运动编辑旨在根据自然语言指令修改现有运动序列,同时保持原始运动的一致性。现有的基于扩散的方法通常依赖启发式相似性线索或粗糙的全局条件,导致运动失真和次优的语义对齐。关键挑战在于平衡变化(即精确编辑目标区域)和不变性(即保留未编辑部分)。为应对这一挑战,我们提出了一个全监督正负学习框架,名为OmniME。我们的方法集成了三个互补组件:(1)回顾特征监督,在Transformer层之间强制执行从粗到细的一致性;(2)运动保持机制,根据源-目标相似性关注细微变化;(3)基于三元组的语义对齐,增强文本-运动对应关系。这些组件共同形成了一个统一的监督范式,平衡变化与不变性。在MotionFix和STANCE Adjustment数据集上的大量实验表明,OmniME在编辑对齐方面达到了最先进的性能,验证了我们统一学习框架的有效性。我们的源代码和模型已发布在:https://github.com/rocket-ycyer/OmniME.git

英文摘要

Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers,(2) motion preservation mechanism that focuses on subtle variations according to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate that OmniME achieves state-of-the-art performance in editing alignment, validating the effectiveness of our unified learning framework. Our source codes and models have been released at: https://github.com/rocket-ycyer/OmniME.git

2605.30968 2026-06-01 cs.CV cs.AI 版本更新

Variational Adapter for Cross-modal Similarity Representation

变分适配器用于跨模态相似性表示

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye, Huayi Wu

发表机构 * School of Remote Sensing and Information Engineering(遥感与信息工程学院) Wuhan University(武汉大学) School of Data Science and Engineering(数据科学与工程学院) East China Normal University(华东师范大学) State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing(测绘遥感信息工程国家重点实验室)

AI总结 针对跨模态匹配中细粒度标注稀缺导致二元分类边界压缩和假负样本问题,提出变分适配器VACSR,将匹配任务重构为变分推断问题,通过构建潜在相似性空间和正则化缓解过拟合,在图像-文本检索、域泛化和基类到新类泛化任务上验证了有效性。

Comments Accepted by the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

视觉-语言模型的核心在于在统一表示空间中度量跨模态相似性。然而,大多数图像-文本匹配或多类图像分类数据集缺乏细粒度的跨模态匹配标注,迫使连续的相似性空间压缩为二元分类边界。这种压缩引入了假负样本,并严重损害了跨模态任务的泛化性能。尽管先前的研究试图通过建模模态内模糊性来缓解这一问题,但往往忽略了固有的标注缺陷,导致不确定性分配次优。为了解决这些挑战,我们提出了一种变分适配器用于跨模态相似性表示(VACSR)。该方法将具有细粒度语义稀缺性的图像-文本匹配重新表述为变分推断问题。它构建了一个跨模态相似性的潜在空间,并使用正则化技术来减轻对二元标注的过拟合。在图像-文本检索、域泛化和基类到新类泛化上的实验证明了所提出方法的有效性和鲁棒的泛化能力。

英文摘要

The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.

2605.30942 2026-06-01 cs.CV 版本更新

PRISM: Progressive Reasoning through Iterative Slot Memory for Vision

PRISM: 通过迭代槽记忆进行渐进推理的视觉架构

Ziyu Wang, Shuangpeng Han, Mengmi Zhang

发表机构 * Deep NeuroCognition Lab, Nanyang Technological University, Singapore(深神经认知实验室,南洋理工大学,新加坡)

AI总结 提出PRISM架构,通过迭代槽记忆进行渐进推理,在图像分类、目标检测和语义分割等任务上取得竞争性能,并在遮挡等不完整观测下展现出更强的鲁棒性。

详情
AI中文摘要

现代视觉模型通过单次前馈传递处理图像,这限制了它们在观测不完整时恢复缺失证据或细化不确定表示的能力。受人类感知迭代性质的启发,我们引入了PRISM(通过迭代槽记忆进行渐进推理),这是一种通过迭代细化对图像进行推理的金字塔视觉架构。在高层次上,PRISM将视觉特征分组为以对象为中心的表示,从学习到的记忆中检索相关模式,并迭代细化表示以解决歧义和恢复缺失信息。这种组织-回忆-细化过程在多个尺度上循环运行,实现了视觉表示的渐进改进。在包括图像分类、目标检测和语义分割在内的标准视觉任务中,PRISM取得了竞争性能,同时在遮挡等不完整观测下展现出更强的鲁棒性。这些结果表明,使用结构化表示和记忆进行迭代推理是构建更具弹性和适应性的视觉模型的一个有前景的方向。源代码和模型将发布。

英文摘要

Modern vision models process images in a single feed-forward pass, which limits their ability to recover missing evidence or refine uncertain representations under incomplete observations. Inspired by the iterative nature of human perception, we introduce PRISM (Progressive Reasoning through Iterative Slot Memory), a pyramid vision architecture that reasons over images through iterative refinement. At a high level, PRISM groups visual features into object-centric representations, retrieves relevant patterns from a learned memory, and iteratively refines the representation to resolve ambiguity and recover missing information. This organize-recall-refine process operates recurrently across multiple scales, enabling progressive improvement of visual representations. Across standard vision tasks, including image classification, object detection, and semantic segmentation, PRISM achieves competitive performance while demonstrating improved robustness under incomplete observations such as occlusion. These results suggest that iterative reasoning with structured representations and memory is a promising direction for building more resilient and adaptive vision models. Source code and models will be released.

2605.30939 2026-06-01 cs.CV 版本更新

IAF-Net: Illumination-Adaptive Fusion for Low-Light Urban Road Segmentation

IAF-Net:用于低光照城市道路分割的照明自适应融合网络

Bingtao Wang, Daojie Peng, Fulong Ma, Jun Ma, Liang Zhang

发表机构 * The Shandong University(山东大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出IAF-Net,通过照明自适应融合模块动态调整RGB与几何特征的融合权重,并利用亮度调制注意力解码器增强低光照特征选择,实现不同光照条件下鲁棒的道路分割。

详情
AI中文摘要

语义道路分割对于自动驾驶至关重要,但现有方法在低光照条件下性能严重下降。许多现有的多模态融合方法没有显式适应模态可靠性的光照依赖性变化,这可能在夜间将退化的RGB特征传播到融合表示中。我们提出IAF-Net(照明自适应融合网络),一种端到端框架,具有照明自适应融合功能,可在不同光照条件下实现鲁棒的道路分割。它通过核心的照明自适应融合(IAF)模块动态调整RGB和几何特征的融合权重,并使用亮度调制注意力解码器增强低光照特征选择。我们还构建了两个专用数据集:nuScenes夜间道路分割(nuScenes-NRS)和CARLA多天气道路分割(CARLA-MWRS)。在nuScenes-NRS上的实验显示,在比较方法中整体性能达到最先进水平,而CARLA-MWRS进一步验证了在恶劣天气条件下的鲁棒性。在40%训练子集上的消融研究进一步强调了IAF模块的重要性,该模块在MaxF中提供了最大的个体增益0.70%。

英文摘要

Semantic road segmentation is important for autonomous driving, but existing methods suffer severe performance degradation under low-light conditions. Many existing multi-modal fusion methods do not explicitly adapt to illumination-dependent changes in modality reliability, which can propagate degraded RGB features into the fused representation at night. We propose IAF-Net (Illumination-Adaptive Fusion Network), an end-to-end framework with illumination-adaptive fusion for robust road segmentation across different lighting conditions. It dynamically adjusts fusion weights of RGB and geometric features via the core Illumination-Adaptive Fusion (IAF) module, and enhances low-light feature selection with a brightness-modulated attention decoder. We also construct two dedicated datasets: nuScenes Nighttime Road Segmentation (nuScenes-NRS) and CARLA Multi-Weather Road Segmentation (CARLA-MWRS). Experiments on nuScenes-NRS show state-of-the-art overall performance among the compared methods, while CARLA-MWRS further validates robustness across adverse weather conditions. Ablation studies on a 40% training subset further highlight the importance of the IAF module, which provides the largest individual gain of 0.70% in MaxF.

2605.30925 2026-06-01 cs.CV cs.GR 版本更新

MultiAct: Text-to-Motion Generation from Composite Text via Tailored Attention Guidance

MultiAct: 通过定制注意力引导从复合文本生成动作

Nathan Sala, Ofir Abramovich, Ariel Shamir, Daniel Cohen-Or, Andreas Aristidou, Sigal Raab

发表机构 * Tel Aviv University(特拉维夫大学) Reichman University(雷赫曼大学) CYENS Centre of Excellence(CYENS卓越中心) University of Cyprus(塞浦路斯大学)

AI总结 提出MultiAct,一种无需重新训练或修改架构的推理时框架,通过自适应增强未充分表示提示组件的交叉注意力分数,解决复合文本到动作生成中语义覆盖不全的问题。

Comments Accepted to SIGGRAPH 2026 conference. Project page: https://natsala13.github.io/multiact.github.io

详情
AI中文摘要

近年来,文本到动作生成发展迅速,为动画和人机交互提供了富有表现力的界面。然而,当前模型在处理描述同时发生的多个动作的提示时仍然脆弱。模型常常优先考虑单个主导动作而忽略其余部分,导致动作不完整或模糊,而不是实现复合描述的所有组成部分。我们提出MultiAct,一种无需配对、推理时的组合文本到动作合成框架,可直接作用于预训练的动作生成器,无需重新训练或架构修改。我们的方法通过自适应增强与未充分表示提示组件相关的交叉注意力分数来对抗语义崩溃。我们注意到有效调制取决于提示特定的选择,例如要定位的令牌和层,并引入一个轻量级辅助决策方案,以确定最有效的注意力增强参数化。广泛的定量和定性评估表明,MultiAct在复合提示上持续优于现有基线,在保持动作真实感的同时实现了改进的语义覆盖。项目页面:https://natsala13.github.io/multiact.github.io。

英文摘要

Text-to-motion generation has progressed rapidly in recent years, offering an expressive interface for animation and human-computer interaction. However, current models remain brittle when handling prompts that describe multiple actions occurring at the same time. Rather than realizing all components of a composite description, models frequently prioritize a single dominant action and neglect the rest, leading to incomplete or ambiguous motion. We present MultiAct, an unpaired, inference-time framework for compositional text-to-motion synthesis that operates directly on pretrained motion generators without retraining or architectural modification. Our method counteracts semantic collapse by adaptively amplifying cross-attention scores associated with underrepresented prompt components. We note that effective modulation depends on prompt-specific choices, such as which tokens and layers to target, and introduce a lightweight auxiliary decision scheme that determines the most effective attention-strengthening parametrization. Extensive quantitative and qualitative evaluations demonstrate that MultiAct consistently outperforms existing baselines on composite prompts, achieving improved semantic coverage while preserving motion realism. Project page: https://natsala13.github.io/multiact.github.io.

2605.30917 2026-06-01 cs.IR cs.CV 版本更新

Inference-Free Multimodal Learned Sparse Retrieval for Production-Scale Visual Document Search

无推理多模态学习稀疏检索用于生产级视觉文档搜索

Gyu-Hwung Cho, Youngjune Lee, Kiyoon Jeong, Siyoung Lee, Sanggyu Han, Hervé Dejean, Stéphane Clinchant, Seung-won Hwang

发表机构 * NAVER Corp.(NAVER公司) Seoul National University(首尔国立大学) Naver Labs Europe(Naver欧洲实验室)

AI总结 提出V-SPLADE,一种无需推理的稀疏检索器,通过标题门控令牌监督解决视觉稀疏表示中的词汇基础问题,在视觉文档检索中达到稠密级效果。

Comments 12 pages, 5 figures, 12 tables, preprint

详情
AI中文摘要

随着arXiv论文和企业PDF等大规模视觉文档语料库的持续增长,视觉文档检索受到越来越多的关注;然而,目前仍缺乏一个可部署的系统,能够对视觉文档进行词汇索引,而无需在大规模下进行神经编码。现有方法要么使用基于VLM的稠密或多向量模型实现强大的检索质量,但需要在服务时进行神经查询编码;要么使用基于OCR或标题的BM25避免查询编码,但代价是耗时的文本提取或生成。为了填补这一缺失的服务机制,我们提出了V-SPLADE,一种用于视觉文档检索的无推理稀疏检索器。然而,这种无推理的多模态学习稀疏检索系统仍未得到充分探索,并且在高稀疏性下尚未显示出稠密级别的有效性。我们将这一限制归因于词汇基础问题:视觉稀疏表示通常无法捕捉文档图像中嵌入的词汇内容。为了解决这个问题,我们引入了标题门控令牌监督,这是一种仅在训练时使用的信号,利用VLM生成的标题作为词汇线索来激活检索相关的词汇维度。通过这种监督,V-SPLADE在六个视觉文档检索基准上的平均NDCG@5比同规模稠密基线提高了13.8个百分点,比基于OCR或标题的BM25基线提高了最多6.3个百分点。在1870万文档的语料库上,其R@5比同规模稠密基线提高了一倍以上,并通过分数融合进一步将竞争检索器的R@5提高了最多2.4个百分点。代码即将在https://github.com/naver/v-splade发布。

英文摘要

As large-scale visual-document corpora such as arXiv papers and enterprise PDFs continue to grow, visual-document retrieval has gained increasing attention; yet it still lacks a deployable system that lexically indexes visual documents to serve queries without neural encoding at scale. Existing methods either achieve strong retrieval quality with VLM-based dense or multi-vector models but require neural query encoding at serving time, or avoid query encoding with OCR- or caption-based BM25 at the cost of time-consuming text extraction or generation. To fill this missing serving regime, we present V-SPLADE, an inference-free sparse retriever for visual-document retrieval. However, such inference-free multimodal learned sparse retrieval systems remain underexplored and have not yet shown dense-level effectiveness under high sparsity. We attribute this limitation to a lexical grounding problem: visual sparse representations often fail to capture the lexical content embedded in document images. To address this problem, we introduce caption-gated token supervision, a training-only signal that uses VLM-generated captions as lexical cues to activate retrieval-relevant vocabulary dimensions. With this supervision, V-SPLADE improves average NDCG@5 across six visual-document retrieval benchmarks by +13.8pp over the same-scale dense baseline and by up to +6.3pp over OCR- or caption-based BM25 baselines. On an 18.7M-document corpus, it more than doubles R@5 over the same-scale dense baseline and further improves competing retrievers through score fusion by up to +2.4pp R@5. Code will be released soon at https://github.com/naver/v-splade.

2605.30912 2026-06-01 cs.CV cs.CL 版本更新

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

关注证据:面向多模态RLVR的证据锚定空间注意力监督

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Zhongguancun Academy(中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Nankai University(南开大学) Shanghai Jiaotong University(上海交通大学) Zhejiang University(浙江大学)

AI总结 提出EASE方法,通过将标注证据区域转化为平滑视觉标记目标,在多模态强化学习训练中引导响应到图像的注意力,从而提升视觉语言模型在感知、幻觉、视觉数学和多模态推理基准上的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过优化从最终答案中导出的结果奖励来改进视觉语言模型(VLM)。然而,这种仅基于结果的奖励并不能告诉模型哪些图像区域证明了答案的正确性。对于需要视觉定位的问题,这些奖励无法区分由相关视觉证据支持的响应与由语言先验捷径或幸运猜测产生的响应。我们引入了EASE(证据锚定空间注意力),它通过视觉证据过程监督增强了多模态RLVR。EASE将标注的证据区域转换为平滑的视觉标记目标,并在RL训练期间使用它来引导响应到图像的注意力,但仅限于高奖励轨迹。标注仅用作特权训练标签,而推理仅需要原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B上,EASE在感知、幻觉、视觉数学和多模态推理基准上的平均得分比DAPO高出2.5到3.1分。诊断和消融实验表明,EASE更好地将视觉注意力与标注的证据区域对齐。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

2605.30911 2026-06-01 cs.CV cs.AI 版本更新

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

什么使LVLMs更少产生幻觉?揭示影响幻觉鲁棒性的架构因素

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin, Jun Luo, Jiancheng Lv

发表机构 * School of Computer Science, Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University(计算机科学学院,机器学习与产业智能工程研究中心,四川大学) School of Computer and Information Engineering, Xiamen University of Technology(计算机与信息工程学院,厦门理工大学) Department of Electrical and Computer Engineering, University of Hong Kong(电气与计算机工程系,香港大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 本文通过将架构设计分解为语言基础、视觉表示和语义对齐三个维度,并引入CoSimUE基准,系统探索了架构因素对LVLMs幻觉鲁棒性的影响,发现模型参数扩展效果有限,而增强视觉编码器、语言基础和语义对齐能分别减少不同类型的幻觉。

详情
AI中文摘要

幻觉仍然是削弱大型视觉-语言模型(LVLMs)可靠性的关键挑战之一。但什么使LVLM更少产生幻觉?许多现有工作专注于改进模型的内部组件。我们认为幻觉从根本上源于模型架构的设计方式。为了研究这一点,我们将架构设计分解为三个维度:语言基础(LF)、视觉表示(VR)和语义对齐(SA),并将幻觉分为共现型、相似型和先前被忽视的不确定型。基于这一框架,我们提出了CoSimUE基准,通过受控文本扰动和随机扰动创建细粒度的幻觉场景,从而建立设计选择与幻觉行为之间的映射。在7个设计方面的实验表明:1)广泛强调的参数规模扩展对减少所有三类幻觉的影响有限;2)更大且训练更好的语言基础可以减少共现型幻觉;3)更强的视觉编码器和更高的分辨率减轻相似型错误;4)有效的对齐策略缓解不确定型幻觉。5)此外,跨维度分析显示,联合增强视觉保真度和对齐质量能带来最全面的改进。本研究首次系统性地将架构级设计与幻觉鲁棒性联系起来,为开发可靠且高效的LVLMs提供了实用指导。

英文摘要

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

2605.30904 2026-06-01 cs.CV 版本更新

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

MergeTok: 通过令牌合并实现统一连续和离散视觉令牌化

Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang

发表机构 * Tsinghua University(清华大学) Westlake University(西湖大学) Zhejiang University(浙江大学) Hong Kong University of Science and Technology(香港科学与技术大学) OPPO Shanghai AI Lab(上海人工智能实验室)

AI总结 提出MergeTok统一令牌化器,通过令牌合并技术联合优化连续VAE和离散VQ令牌化器,实现高保真重建与语义可控离散表示的兼顾。

Comments 11 pages (main text), 7 figures. Preprint. Under review at NeurIPS 2026

详情
AI中文摘要

大多数用于图像生成的视觉令牌化器分为两类,各有互补的局限性:连续VAE提供高保真重建,但遭受密集、纠缠的潜在变量,不适合语义控制;而基于离散VQ的模型能够实现自回归生成,但面临梯度稀疏、训练不稳定和码本崩溃的问题。在这项工作中,我们引入了MergeTok,一个统一的令牌化器,在编码器-解码器架构中联合优化连续(VAE)和离散(VQ)令牌化器,利用令牌合并技术作为语义桥梁。通过在编码过程中聚类相似令牌,MergeTok建立了一个结构先验,提供双重监督信号:(i)在VAE分支中施加合并令牌的语义对齐,将其潜在空间正则化为解缠、语义感知的表示;(ii)推导出组级约束,促进组内多样性和组间排他性,从而稳定VQ训练。MergeTok在ImageNet-256上展示了具有竞争力的重建和生成性能,在匹配令牌预算下,其rFID远低于强VAE和VQ模型,同时产生语义组织的令牌表示,兼容自回归和扩散生成器。这表明单一架构可以赋予视觉令牌化器鲁棒的语义组织和生成器友好的离散性。

英文摘要

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

2605.30894 2026-06-01 cs.CV 版本更新

SteerFace: Debiasing Synthetic Face Generation via Adaptive Residue Perturbation

SteerFace: 通过自适应残差扰动消除合成人脸生成中的偏差

Yuxi Mi, Qiuyang Yuan, Jianqing Xu, Yichun Zhou, Xuan Zhao, Jun Wang, Rizen Guo, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Youtu Lab, Tencent(腾讯优图实验室) WeChat Pay Lab33, Tencent(腾讯微信支付实验室33)

AI总结 针对合成人脸数据与真实数据分布存在视觉倾向差异的问题,提出SteerFace框架,通过将身份嵌入向随机正交方向扰动作为正则化项,抑制生成器对非身份视觉线索的依赖,从而缩小合成-真实差距。

详情
AI中文摘要

人脸识别训练中合法合规数据的短缺引发了人们对使用合成数据作为替代方案的日益关注。虽然最近的扩散方法能够生成具有强身份一致性和数据多样性的逼真人脸图像,但其下游识别性能仍然存在显著的合成-真实差距。本文识别出视觉倾向(visual tendency)作为一个此前未被充分探索的限制因素,即合成数据表现出不切实际的视觉属性普遍性,从而偏离真实数据分布。视觉倾向可归因于生成器对身份嵌入的条件化,通过这种条件化,共现的残留视觉线索被无意中吸收到学习到的身份语义中。为了阻止生成器利用此类视觉线索,本文提出SteerFace,一个简单高效的训练框架,通过将身份嵌入向嵌入超球面上的随机正交方向引导来扰动身份嵌入。该扰动作为一种身份保持正则化项,惩罚生成器对非身份成分的依赖,理论分析支持了这一点。本文进一步引入一种自适应策略,学习具有样本级偏好和有利总体统计的扰动强度。大量实验表明,SteerFace有效缓解了视觉倾向,在下游人脸识别中优于先前方法,并且在不同训练数据集和生成流程中具有良好的泛化能力。

英文摘要

The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.

2605.30893 2026-06-01 cs.CV 版本更新

Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation

用于3D CT重建、增强和生成的基础VAE

Qi Chen, Shuhan Ding, Yu Gu, Nan Liu, Jiang Bian, Alan Yuille, Zongwei Zhou, Jingjing Fu

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Duke-NUS Medical School(duke-nus 医学院) Microsoft Research(微软研究院)

AI总结 本文发现,在自然图像上预训练的基础VAE可直接用于CT重建、增强和生成,无需训练或微调,通过冻结编解码器实现解剖结构保留和噪声抑制,并在分割和生成任务上取得显著提升。

Comments ICML 2026 Accepted

详情
AI中文摘要

变分自编码器(VAE)将高分辨率CT体积压缩为紧凑的潜在表示,同时保留临床相关结构。然而,从头训练或大量微调CT专用VAE会带来巨大的计算和工程成本,并且在异构扫描仪、协议和疾病下性能常会下降。本文通过一个关键观察向免训练的医学VAE迈出了渐进的一步:一个在自然图像和视频上大规模预训练的基础VAE可以作为CT重建、增强和生成的统一接口。在编码器和解码器均冻结的情况下,基础VAE重建CT体积时保留了解剖结构,同时抑制了采集噪声;在这些重建上训练分割模型,对于胰腺肿瘤和肺肿瘤,表面准确度平均提高了3.9% NSD。在相同的基础VAE潜在空间中,条件潜在扩散模型实现了平均FVD降低3.9%,CT CLIP分数提高36.2%,并在18种疾病的多疾病生成忠实度上提高了2.76% AUC。这些结果表明基础VAE可作为可扩展的CT表示重用和忠实CT生成的实用接口。我们的代码和演示可在 https://github.com/qic999/Foundation-VAE 获取。

英文摘要

Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.

2605.30884 2026-06-01 cs.CV 版本更新

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$:基于难度感知强化学习的由粗到细GUI定位

Junlong Li, Chao Hao, Lap-Pui Chau, Yi Wang

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出GUI-C$^2$框架,通过难度感知数据筛选和由粗到细的强化学习机制,解决GUI定位中训练样本难度不均和视觉区域裁剪权衡问题,实现最先进性能。

详情
AI中文摘要

现有的用于GUI定位的智能体强化学习方法在数据层面和策略层面存在局限性。在数据层面,当前方法通常平等对待所有训练样本,尽管它们对基线模型的训练价值随难度而变化。忽视这一点会大大降低训练效率甚至导致崩溃。在策略层面,现有框架难以平衡裁剪较大区域以获取足够上下文和较小区域以减少冗余之间的权衡,这是工具增强定位代理固有的张力。此外,过于复杂的决策对于小参数模型来说难以处理,并显著增加推理时间。为了解决这些问题,在数据层面,我们提出了GUI-D,一个数据挖掘和难度评分流程,通过适当的测试识别值得训练的样本,并分配难度分数以指导后续训练权重。在策略层面,我们提出了GUI-C$^2$,它采用区域门控的由粗到细细化机制,通过模型内部不确定性信号逐步缩小视野,自适应地为大目标保留上下文,同时增强对小目标的精度,并通过改进感知的阶段奖励进行强化,确保每次细化真正提升定位。同时,我们简化了决策过程,大大减少了额外的推理时间。最后,大量实验表明,我们的方法达到了最先进的性能。代码和数据将公开。

英文摘要

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.

2605.30863 2026-06-01 cs.CV cs.GR 版本更新

DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction

DSD-GS: 面向高效高保真动态场景重建的高斯泼溅动态-静态分解

Youngtae Han, Sung-hwan Han, Youngmin Yi

发表机构 * Department of Artificial Intelligence Engineering, Sogang University(人工智能工程系,首尔大学)

AI总结 提出基于前馈高斯泼溅编码器和光流模型的动态-静态分解框架,通过消除静态区域冗余计算,在渲染质量、训练/渲染速度和存储效率上达到最优。

Comments 23 pages, 9 figures, 7 tables

详情
AI中文摘要

动态场景重建和新视角合成是虚拟现实、机器人、数字孪生等下一代视觉智能应用的基础。然而,从任意视角对复杂时变场景进行高保真重建仍是一个重大挑战。现有的动态3DGS方法由于将所有高斯体建模为动态组件,存在计算效率低下的问题。虽然近期基于分解的方法试图解决这一问题,但仍面临重建质量下降和训练时间延长的问题。为缓解这些局限,我们提出一种新颖的动态重建框架,基于高效的静态-动态分解策略,使用前馈高斯泼溅编码器和光流模型。通过消除静态区域的冗余计算,我们的方法实现了最先进的性能,在渲染质量、训练和渲染速度以及存储效率上均优于现有基线。值得注意的是,在Neural 3D数据集上,我们的框架仅需10分钟训练,并在单张NVIDIA RTX 5090 GPU上以1352x1014分辨率实现了超过700 FPS的渲染速度。此外,我们的分解策略消除了COLMAP预处理的需求,并实现了确定性初始化,从而提高了效率和可重复性。

英文摘要

Dynamic scene reconstruction and novel view synthesis are fundamental to next-generation visual intelligence applications such as virtual reality, robotics, and digital twins. However, high-fidelity reconstruction of complex, time-varying scenes from arbitrary viewpoints remains a significant challenge. Existing dynamic 3DGS methods suffer from computational inefficiency, since they model all Gaussians as dynamic components. While recent decomposition-based approaches address this issue, they still struggle with degraded reconstruction quality and prolonged training time. To mitigate these limitations, we propose a novel dynamic reconstruction framework built upon an efficient static-dynamic decomposition strategy using a Feed-Forward Gaussian Splatting encoder and an optical flow model. By eliminating redundant computations on static regions, our method achieves state-of-the-art performance, outperforming existing baselines across rendering quality, training and rendering speed, and storage efficiency. Notably, on the Neural 3D dataset, our framework requires only 10 minutes for training and achieves a rendering speed of over 700 FPS on a single NVIDIA RTX 5090 GPU at resolution of 1352x1014. Furthermore, our decomposition strategy eliminates the need for COLMAP preprocessing and enables deterministic initialization, thereby enhancing both efficiency and reproducibility.

2605.30846 2026-06-01 cs.CV 版本更新

Count Anything

Count Anything

Mengqi Lei, Shuokun Cheng, Wei Bao, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao

发表机构 * Tsinghua University(清华大学) China University of Geosciences, Wuhan(武汉地质大学) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院)

AI总结 提出跨域文本引导的目标计数模型Count Anything,通过双粒度实例枚举和互补计数融合,在统一基准CLOC上实现多域泛化。

详情
AI中文摘要

尽管通用视觉模型取得了快速进展,目标计数仍然分散在特定领域的数据集和任务公式中。现有的计数模型通常针对人群、车辆、细胞、农作物或遥感目标等场景定制,因此难以跨类别、视觉域、目标尺度和密度分布进行泛化。在本文中,我们研究了跨域的文本引导目标计数,其中模型以图像和自然语言查询为输入,并返回一组基于实例的目标点,其基数给出计数。这种公式将类别条件计数与可解释的空间定位统一起来。为了支持这一设置,我们构建了CLOC,一个跨域大规模目标计数数据集,将多样化的公共数据源重组为统一的基准。CLOC涵盖六个视觉域:通用场景、遥感、组织病理学、细胞显微镜、农业和微生物学,包含约22万张图像、619个类别和1500万个目标实例。基于CLOC,我们提出了Count Anything,一个用于文本引导目标计数的通用模型。与主导计数模型的密度图方法不同,Count Anything采用离散实例点并执行双粒度实例枚举。区域级稀疏计数器为大而稀疏的目标提供目标级锚点,而像素级密集计数器通过密集点预测处理小、拥挤和弱边界目标。点中心监督策略能够从异构标注中学习,互补计数融合以无参数方式结合两个计数器。大量实验表明,Count Anything实现了强准确性和多域泛化,优于现有的开放世界计数方法。代码可在:https://github.com/Mengqi-Lei/count-anything 获取。

英文摘要

Object counting remains fragmented across domain-specific datasets and task formulations, despite rapid progress in generalist vision models. Existing counting models are often tailored to scenarios such as crowds, vehicles, cells, crops, or remote-sensing objects, and thus struggle to generalize across categories, visual domains, object scales, and density distributions. In this paper, we study text-guided object counting across domains, where a model takes an image and a natural-language query as input and returns an instance-grounded set of target points whose cardinality gives the count. This formulation unifies category-conditioned counting with interpretable spatial localization. To support this setting, we construct CLOC, a Cross-domain Large-scale Object Counting dataset that reorganizes diverse public data sources into a unified benchmark. CLOC covers six visual domains: General Scene, Remote Sensing, Histopathology, Cellular Microscopy, Agriculture, and Microbiology, with about 220K images, 619 categories, and 15M object instances. Based on CLOC, we propose Count Anything, a generalist model for text-guided object counting. Unlike density-map-based methods, which dominate counting models, Count Anything adopts discrete instance points and performs dual-granularity instance enumeration. A Region-level Sparse Counter provides object-level anchors for large and sparse targets, while a Pixel-level Dense Counter handles small, crowded, and weakly bounded targets via dense point prediction. A point-centric supervision strategy enables learning from heterogeneous annotations, and Complementary Count Fusion combines both counters in a parameter-free manner. Extensive experiments show that Count Anything achieves strong accuracy and multi-domain generalization, outperforming existing open-world counting methods. Code is available at: https://github.com/Mengqi-Lei/count-anything.

2605.30829 2026-06-01 cs.CV 版本更新

LegSegNet: A Public Deep Learning System for Lower Extremity CT Tissue Segmentation and Quantification

LegSegNet:用于下肢CT组织分割与量化的公共深度学习系统

Yuwen Chen, Yaqian Chen, Roy Colglazier, Haoyu Dong, Hanxue Gu, Maciej A. Mazurowski, Kevin W. Southerland

发表机构 * Department of Electrical and Computer Engineering, Duke University(杜克大学电气与计算机工程系) Department of Biostatistics & Bioinformatics, Duke University(杜克大学生物统计与生物信息学系) Department of Radiology, Duke University(杜克大学放射学系) Department of Computer Science, Duke University(杜克大学计算机科学系) Department of Surgery, Duke University(杜克大学外科系)

AI总结 提出LegSegNet深度学习系统,实现下肢CT中骨骼、肌肉、皮下脂肪和肌间/肌内脂肪的自动分割与量化,在测试集上平均Dice达89.31,是首个公开的端到端系统。

Comments 9 pages

详情
AI中文摘要

下肢计算机断层扫描(CT)包含用于身体成分分析、肌少症评估和肌肉骨骼疾病监测的临床相关信息,但大规模提取这些测量需要精确的组织分割和自动化量化工作流程。现有的公共分割工具并非为全面的下肢CT分析而设计,特别是对于临床重要的肌间/肌内脂肪组织,而且大多数公共方法仅提供掩膜预测而非端到端量化系统。为解决这一问题,我们提出了LegSegNet,一个用于下肢CT组织分割和身体成分量化的深度学习系统。给定输入CT扫描,LegSegNet分割骨骼、骨骼肌、皮下脂肪组织和肌间/肌内脂肪组织。然后计算定量的组织测量用于下游分析。我们使用1,302张手动标注的CT切片开发了分割模型,并在900张保留测试切片上进行了评估,所有标注均由放射科医生审核。我们将LegSegNet与广泛的2D分割方法进行基准测试,包括基于CNN的模型、基于Transformer的模型和微调的基础模型,并进一步在外部公共CT数据集上评估其泛化能力。LegSegNet实现了最佳的整体分割性能,在保留测试集上的平均Dice得分为89.31。据我们所知,LegSegNet是首个公开可用的用于下肢CT组织分割和量化的端到端系统,为未来医学图像分析中的计算机视觉研究提供了实用的评估工具。代码和模型权重可在https://github.com/mazurowski-lab/LegSegNet获取。

英文摘要

Lower extremity computed tomography (CT) contains clinically relevant information for body composition analysis, sarcopenia assessment, and musculoskeletal disease monitoring, but extracting these measurements at scale requires accurate tissue segmentation and an automated quantification workflow. Existing public segmentation tools are not designed for comprehensive lower extremity CT analysis, particularly for clinically important inter/intramuscular adipose tissue, and most public methods only provide mask prediction rather than an end-to-end quantification system. To address this problem, we present LegSegNet, a deep learning system for lower extremity CT tissue segmentation and body composition quantification. Given an input CT scan, LegSegNet segments bone, skeletal muscle, subcutaneous adipose tissue, and inter/intramuscular adipose tissue. It then computes quantitative tissue measurements for downstream analysis. We developed the segmentation model using 1,302 manually annotated CT slices and evaluated it on 900 held-out test slices, with all annotations reviewed by radiologists. We benchmark LegSegNet against a broad set of 2D segmentation methods, including CNN-based models, transformer-based models, and finetuned foundation models, and further evaluate its generalization on an external public CT dataset. LegSegNet achieves the best overall segmentation performance, with an average Dice score of 89.31 on the held-out test set. To our knowledge, LegSegNet is the first publicly available end-to-end system for lower extremity CT tissue segmentation and quantification, providing a practical evaluation tool for future computer vision research in medical image analysis. The code and model weights are available at: https://github.com/mazurowski-lab/LegSegNet

2605.30794 2026-06-01 cs.CV cs.AI 版本更新

MechVQA: Benchmarking and Enhancing Multimodal LLMs on Comprehensive Mechanical Drawing Understanding

MechVQA:在综合机械图纸理解上基准测试与增强多模态大语言模型

Qian Kou, Xiaofeng Shi, Yulin Li, Xiaosong Qiu, Xinyang Wang, Hua Zhou, Cao Dongxing

发表机构 * Beijing Academy of Artificial Intelligence (BAAI), China(北京人工智能研究院) Institute of Information Engineering, Chinese Academy of Sciences, China(信息工程研究所) Beijing University of Technology, China(北京理工大学)

AI总结 针对多模态大语言模型在机械工程图纸理解上的不足,提出首个综合机械图纸理解数据集MechVQA,并开发MechVL模型,通过多阶段训练显著提升性能。

Comments accept by iclm2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在通用视觉问答(VQA)任务中取得了显著成就。然而,它们在机械工程图纸上仍然脆弱,因为高标注密度和弱领域知识,加上严格投影规则和几何约束下不可靠的空间关系推理,使得决定性线索容易被忽略,并经常导致错误答案。为弥补这一差距,我们引入了第一个综合机械图纸理解数据集MechVQA,通过半自动构建和质量控制流程创建。MechVQA包含3.3k张高密度图片和21K个问答对,涵盖三个能力级别(识别、推理和判断)的10个不同细粒度任务,为评估和改进MLLM在真实机械图纸上的理解提供了测试平台。在MechVQA基础上,我们通过多阶段训练范式开发了MechVL模型,构建了一个强大的领域专用基线。大量实验结果表明,MechVL在MechVQA总分上比最强的闭源基线高出7.57个百分点,显著增强了机械图纸理解能力,并为在机械设计和检测场景中部署MLLM提供了可复用的基础。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated significant achievements in general visual question answering (VQA) tasks. However, they remain brittle on mechanical engineering drawings, where high annotation density and weak domain knowledge, compounded by unreliable spatial relation reasoning under strict projection rules and geometric constraints, make decisive cues easy to miss and frequently lead to wrong answers. To bridge this gap, we introduce the first comprehensive mechanical drawing understanding dataset, MechVQA, created through a semi-automated construction and quality-control pipeline. MechVQA contains 3.3k high-density pictures with 21K question-answer pairs, spanning 10 different fine-grained tasks across three capability levels: Recognition, Reasoning, and Judging, providing a testbed to evaluate and improve MLLM understanding on real-world mechanical drawings. On top of MechVQA, we then develop the MechVL model through a multi-stage training paradigm, building a strong domain-specialized baseline. Extensive experimental results demonstrate that MechVL outperforms the strongest closed-source baseline by 7.57 percentage points on the MechVQA total score, significantly enhancing mechanical drawing understanding ability and providing a reusable foundation for deploying MLLMs in mechanical design and inspection scenarios.

2605.30784 2026-06-01 cs.CV 版本更新

Text-guided Feature Disentanglement for Cross-modal Gait Recognition

文本引导的跨模态步态识别特征解耦

Zhiyang Lu, Ming Cheng

发表机构 * Fujian Key Laboratory of Urban Intelligent Sensing and Computing, Xiamen University(福建城市智能感知与计算重点实验室,厦门大学) Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(多媒体可信感知与高效计算重点实验室,中华人民共和国教育部,厦门大学)

AI总结 针对LiDAR与RGB相机之间的模态差异,提出TCFDNet网络,利用文本先验引导解耦模态共享特征,通过CLIP对齐、特征解耦和稳定性增强实现跨模态步态识别,在SUSTech1K和FreeGait数据集上达到最优性能。

Comments Accept by CVPR2026

详情
AI中文摘要

步态识别是一种基于行走模式识别个体的生物特征技术,在远距离、非侵入场景中具有优势。然而,现实场景通常涉及异构传感模态,如LiDAR和RGB相机,由于2D视频和3D点云序列之间存在显著的模态差距,LiDAR-相机跨模态步态识别(LCCGR)成为一项关键但具有挑战性的任务。为应对这一挑战,我们提出了TCFDNet,一种文本引导的跨模态特征解耦网络,该网络利用模态感知的文本先验作为语义锚点,指导学习解耦的模态共享表示。具体而言,我们使用大型语言模型构建步态模态文本字典(GMTD),以生成跨模态和视角的丰富步态语义描述。然后,基于CLIP的多粒度特征编码器将视觉和文本特征对齐到统一的视觉-语言空间中。此外,文本引导的特征解耦(TFD)模块选择topk匹配的文本描述来重建模态特定表示,并通过残差分解和正交性约束推导出模态共享特征。为缓解解耦共享特征的脆弱性,我们提出特征稳定性增强(FSE)模块,该模块建模空间和通道相关性以提高特征鲁棒性。此外,引入跨模态补丁交换策略以进一步提升泛化能力。在SUSTech1K和FreeGait数据集上的大量实验表明,TCFDNet取得了新的最优结果,并验证了所提模块的有效性。

英文摘要

Gait recognition is a biometric technique that identifies individuals based on their walking patterns, offering advantages in long-range, non-intrusive scenarios. However, real-world scenarios often involve heterogeneous sensing modalities such as LiDAR and RGB cameras, making LiDAR-Camera Cross-modal Gait recognition (LCCGR) a critical yet challenging task due to the substantial modality gap between 2D videos and 3D point cloud sequences. To address this challenge, we propose TCFDNet, a Text-guided Cross-modal Feature Disentanglement Network, which leverages modality-aware textual priors as semantic anchors to guide the learning of disentangled modality-shared representations. Specifically, we construct a Gait Modality Text Dictionary (GMTD) using large language models to generate rich semantic descriptions of gait across modalities and viewpoints. A CLIP-based Multi-grained Feature Encoder then aligns visual and textual features within a unified vision-language space. Furthermore, the Text-guided Feature Disentanglement (TFD) module selects the topk matched textual descriptions to reconstruct modality-specific representations and derive modality-shared features via residual decomposition and orthogonality constraints. To mitigate the fragility of the disentangled shared features, we propose a Feature Stability Enhancement (FSE) module, which models spatial and channel-wise correlations to improve feature robustness. In addition, a cross-modal patch exchange strategy is introduced to further improve generalization. Extensive experiments on SUSTech1K and FreeGait datasets demonstrate that TCFDNet achieves new state-of-the-art results and validate the effectiveness of the proposed modules.

2605.30774 2026-06-01 cs.CV 版本更新

CameraNoise: Enabling Faithful Camera Control in Video Diffusion through Geometry-Flow-Guided Noise Warping

CameraNoise: 通过几何流引导的噪声扭曲实现视频扩散中的忠实相机控制

Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Tencent(腾讯) Xiamen University(厦门大学)

AI总结 提出CameraNoise方法,通过几何流引导的噪声扭曲将相机运动编码为时间一致的随机表示,实现视频扩散中忠实且几何一致的相机控制。

Comments 28 pages, 16 figures

详情
Journal ref
Proceedings of the Forty-third International Conference on Machine Learning (ICML), 2026
AI中文摘要

精确的相机姿态控制对于视频扩散至关重要,但保持几何一致性仍然是一个挑战。现有方法直接将数值相机参数注入扩散骨干网络,往往无法弥合抽象坐标与视觉内容之间的差距,导致结构失真。为解决这一问题,我们提出CameraNoise,一种流到噪声的扭曲方法,将相机运动编码为时间一致的随机表示。与传统的条件控制不同,CameraNoise将相机姿态直接嵌入噪声空间。这将在忠实保留轨迹动态的同时,将运动与场景外观解耦。具体来说,我们引入了一种新颖的几何引导重投影流和噪声扭曲算法,共同保持扩散的高斯先验,并确保在相机变换下噪声传播的一致性。通过将CameraNoise集成到扩散过程中,我们的框架能够生成稳定、高保真的视频。大量实验表明,我们的方法在视觉质量和轨迹忠实度方面均显著优于先前方法。项目页面和代码可在 https://gulucaptain.github.io/CameraNoise/ 获取。

英文摘要

Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flow-to-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness. The project page and code are available at: https://gulucaptain.github.io/CameraNoise/.

2605.30769 2026-06-01 cs.CV cs.RO 版本更新

DisPlace: Discriminative Place Projections for Multi-Reference Visual Place Recognition

DisPlace: 面向多参考视觉地点识别的判别性地点投影

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics, School of Electrical Engineering and Robotics at the Queensland University of Technology(昆士兰理工大学机器人中心,电气工程与机器人学学院)

AI总结 提出DisPlace框架,通过广义特征值问题融合多参考描述符,最大化地点间可分性并抑制地点内变化,提升视觉地点识别在多变条件下的鲁棒性。

Comments Under review

详情
AI中文摘要

视觉地点识别(VPR)的一个关键挑战是在不同环境条件和视角下,将查询图像与参考地图进行匹配。虽然多次参考遍历提高了鲁棒性,但现有的融合策略要么统一聚合参考,要么依赖启发式选择,无法区分保持稳定地点身份的描述符变化与由变化条件或视角引起的变化。在本文中,我们提出DisPlace,一种多参考VPR框架,将多个参考描述符融合为单个紧凑且具有判别性的地点表示。DisPlace将描述符融合表述为一个广义特征值问题,该问题最大化地点间可分性,同时抑制跨参考的地点内变化,而不是保留整体描述符方差。与现有的多参考融合方法不同,DisPlace利用跨参考遍历的变化来识别哪些描述符维度的线性组合保留了地点身份,哪些捕捉了条件或视角特定的变化。我们在Oxford RobotCar、Nordland、Pittsburgh30k和Google Landmarks v2上,使用六种最先进的VPR描述符评估了DisPlace。在54种外观变化条件下,DisPlace在49种中优于七种多参考基线,在视角和非结构化设置下持续改进描述符级融合性能,并且在推理期间比所有比较的融合方法需要更少的存储空间。

英文摘要

A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.

2605.30750 2026-06-01 cs.CV 版本更新

SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling

SLAP: 用于变分视频-语言建模的语义最小作用原理

Xiang Fang, Wanlong Fang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学)

AI总结 提出语义最小作用原理(SLAP),将视频插值建模为黎曼流形上的边界值问题,通过离散欧拉-拉格朗日方程保持对象持久性,解决大视频语言模型中的时间间隙问题。

Comments Accepted by ICML 2026

详情
AI中文摘要

在大视频语言模型(LVLMs)时代,稀疏帧采样的计算需求造成了根本性的“时间间隙”,使模型对关键的因果转换视而不见。现有的依赖于生成幻觉(如潜在扩散)或自回归外推的解决方案往往难以在长时间跨度内保持语义一致性,遭受对象消失和能量不稳定的问题。我们提出从概率生成到变分力学的范式转变,即语义最小作用原理(SLAP)。通过在经典力学和语义动力学之间建立严格的同构关系,我们将潜在视频轨迹建模为由语义拉格朗日量控制的黎曼流形上的路径。通过将插值任务表述为通过离散欧拉-拉格朗日方程求解的边界值问题(BVP),SLAP自然地强制对象持久性,而无需像素级渲染。大量实验证明了我们提出的SLAP的有效性。

英文摘要

In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the \textbf{Semantic Least Action Principle (SLAP)}. Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments show the effectiveness of our proposed SLAP.

2605.30745 2026-06-01 cs.CV 版本更新

Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness

Immuno-VLM:通过生成式语义抗体实现大型视觉-语言模型的开放世界可信赖性

Xiang Fang, Wanlong Fang, Wei Ji

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) Nanjing University(南京大学)

AI总结 针对大型视觉-语言模型在开放世界部署中因缺乏负面知识而将未知异常高置信度误分类为已知类别的“语义傲慢”问题,提出受生物免疫负选择启发的Immuno-VLM框架,利用大语言模型的生成推理主动产生“语义抗体”(近分布异常文本描述)来约束已知类决策空间,在ImageNet-1K和四个OOD基准上达到新最优。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型视觉-语言模型通过将视觉特征与广泛语义概念对齐,在零样本识别中取得了前所未有的成功。然而,这种语义抽象在开放世界部署中造成了一个关键漏洞:“语义傲慢”——由于缺乏显式的负面知识,模型会将未知异常高置信度地强行拟合到已知类别中。为了解决这个“开放世界可信赖性悖论”,我们提出了 extbf{Immuno-VLM},一个受生物启发的框架,它将 extbf{免疫负选择}的生物学原理适应到高维潜在空间。与依赖被动密度估计或低效像素空间异常生成的传统开放集识别方法不同,Immuno-VLM利用大语言模型的生成推理能力主动“幻想”出“语义抗体”,即近分布异常(例如,相似物、上下文异常)的文本描述,这些描述有效地约束了已知类别的决策空间。在ImageNet-1K和四个具有挑战性的OOD基准上的大量实验表明,Immuno-VLM达到了新的最优水平。

英文摘要

Large Vision-Language Models have achieved unprecedented success in zero-shot recognition by aligning visual features with broad semantic concepts. However, this semantic abstraction creates a critical vulnerability in open-world deployment: the ``Hubris of Semantics'', where models force-fit unknown anomalies into known categories with high confidence due to the lack of explicit negative knowledge. To address this \textit{Open-World Trustworthiness Paradox}, we propose \textbf{Immuno-VLM}, a bio-inspired framework that adapts the biological principle of \textbf{Immunological Negative Selection} to high-dimensional latent spaces. Departing from traditional Open-Set Recognition methods that rely on passive density estimation or inefficient pixel-space outlier generation, Immuno-VLM leverages the generative reasoning of Large Language Models to actively hallucinate ``Semantic Antibodies'', textual descriptions of near-distribution outliers (e.g., look-alikes, contextual anomalies) that effectively bound the decision space of known classes.Extensive experiments on ImageNet-1K and four challenging OOD benchmarks reveal that Immuno-VLM establishes a new state-of-the-art.

2605.30742 2026-06-01 cs.CV 版本更新

Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding

注释并非全部所需:面向无监督时间语句定位的跨模态知识迁移网络

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, Kai Zou

发表机构 * Hubei Key Laboratory of Distributed System Security(湖北分布式系统安全重点实验室) Hubei Engineering Research Center on Big Data Security(湖北大数据安全工程研究中心) School of Cyber Science and Engineering(网络安全学院) Huazhong University of Science and Technology(华中科技大学) Peking University(北京大学) Henan University(河南大学) The Chinese University of Hong Kong(香港中文大学) Guangzhou University(广州大学) Protagolabs Inc.(Protagolabs公司)

AI总结 提出跨模态知识迁移网络,通过从图像-名词和视频-动词任务中迁移实体感知和事件感知知识,实现无监督时间语句定位,无需配对视频-查询标注。

Comments Published in Findings of EMNLP 2023

详情
AI中文摘要

本文研究时间语句定位(TSG)任务。尽管许多优秀工作在该重要课题上取得了显著成就,但它们严重依赖于大量昂贵的视频-查询配对标注,这在现实应用中需要大量人力收集。为此,本文针对更实际但更具挑战性的TSG设置:无监督时间语句定位,其中网络训练期间既没有配对视频-查询标注,也没有片段边界标注。考虑到其他跨模态任务提供了许多易于获取且廉价的标签,我们倾向于收集并将其简单的跨模态对齐知识迁移到我们的复杂场景中:1)首先从配对的图像-名词任务中探索实体感知的对象引导外观知识,并将其适应到每个独立视频帧;2)然后从配对的视频-动词任务中提取事件感知的动作表示,并通过新提出的复制-粘贴方法进一步将动作表示精炼为更实际但复杂的现实案例;3)通过将外观和动作知识调制并迁移到我们具有挑战性的无监督任务中,我们的模型可以直接利用这些通用知识来关联视频和查询,并在无需训练的情况下准确检索相关片段。在两个具有挑战性的数据集(ActivityNet Captions和Charades-STA)上的大量实验证明了我们的有效性,优于现有无监督方法,甚至与有监督方法竞争。

英文摘要

This paper addresses the task of temporal sentence grounding (TSG). Although many respectable works have made decent achievements in this important topic, they severely rely on massive expensive video-query paired annotations, which require a tremendous amount of human effort to collect in real-world applications. To this end, in this paper, we target a more practical but challenging TSG setting: unsupervised temporal sentence grounding, where both paired video-query and segment boundary annotations are unavailable during the network training. Considering that some other cross-modal tasks provide many easily available yet cheap labels, we tend to collect and transfer their simple cross-modal alignment knowledge into our complex scenarios: 1) We first explore the entity-aware object-guided appearance knowledge from the paired Image-Noun task, and adapt them into each independent video frame; 2) Then, we extract the event-aware action representation from the paired Video-Verb task, and further refine the action representation into more practical but complicated real-world cases by a newly proposed copy-paste approach; 3) By modulating and transferring both appearance and action knowledge into our challenging unsupervised task, our model can directly utilize this general knowledge to correlate videos and queries, and accurately retrieve the relevant segment without training. Extensive experiments on two challenging datasets (ActivityNet Captions and Charades-STA) show our effectiveness, outperforming existing unsupervised methods and even competitively beating supervised works.

2605.30734 2026-06-01 cs.LG cs.CV 版本更新

Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis

超越准确率:评估深度学习在疟疾诊断中的效率、鲁棒性和可解释性

Olivier Kanamugire, Kerol Djoumessi

发表机构 * African Institute for Mathematical Sciences(非洲数学科学研究所) Hertie Institute for AI in Brain Health(脑健康人工智能研究所)

AI总结 本研究在NLM-Malaria数据集上基准测试四种深度学习模型,联合评估预测性能、鲁棒性和事后可解释性,发现轻量级模型在性能上与重型模型相当,但可解释性在图像损坏下脆弱。

Comments Under review

详情
AI中文摘要

疟疾仍然是撒哈拉以南非洲地区的主要死亡原因,该地区诊断基础设施匮乏,使得及时准确的诊断尤其具有挑战性。虽然深度学习为自动化疟疾筛查提供了一条有前景的途径,但临床采用受到计算成本和决策不透明性的阻碍。本研究在NLM-Malaria数据集上基准测试了四种涵盖广泛设计架构和模型容量的深度学习模型,联合评估了预测性能、鲁棒性和事后可解释性。我们发现,轻量级、高效设计的模型在预测性能上与更重的模型相当,Friedman检验确认无统计显著差异。基于CAM的XAI方法一致地定位诊断相关区域,而细粒度归因方法产生的解释针对性较弱,尤其是在使用更重的骨干网络时。在三种图像损坏下的鲁棒性评估进一步揭示,模型置信度下降速度快于准确率,为人工审核提供了实用信号。然而,没有一种XAI方法对损坏具有鲁棒性,即使在预测仍然准确的情况下,解释可靠性也会在临床实践中可能出现的噪声水平下降。这些发现支持在资源受限环境中部署轻量级架构用于疟疾诊断,同时强调事后解释的脆弱性,这是负责任临床部署的重要考虑因素。

英文摘要

Malaria remains a leading cause of mortality in sub-Saharan Africa, where scarce diagnostic infrastructure makes timely, accurate diagnosis particularly challenging. While deep learning offers a compelling path toward automated malaria screening, clinical adoption is hindered by computational cost and opacity in decision-making. This work benchmarks four deep learning models spanning a wide range of designed design architectures and model capacities on the NLM-Malaria dataset, jointly evaluating predictive performance, robustness, and post-hoc explainability. We find that lightweight, efficient-by-design models match their heavier counterparts in predictive performance, and the Friedman test confirms no statistically significant performance differences. CAM-based XAI methods consistently localize diagnostically relevant regions, while fine-grained attribution methods produce less targeted explanations, particularly with heavier backbones. Robustness evaluation under three types of image corruption further reveals that model confidence degrades faster than accuracy, providing a practical signal for human review. However, no XAI method is robust to corruption, with explanation reliability degrading at noise levels plausible in clinical practice, even when predictions remain accurate. These findings support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings, while highlighting the vulnerability of post-hoc explanations as an important consideration for responsible clinical deployment.

2605.30716 2026-06-01 cs.CV cs.AI 版本更新

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

用于病例级病理学概要报告生成的简单令牌高效视觉语言模型

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

发表机构 * Department of Computer Science and Software Engineering (CSSE), Concordia University, Montreal, Canada(计算机科学与软件工程系(CSSE),康科迪亚大学,蒙特利尔,加拿大) Axe Cancer, Centre de recherche du CHUM, Université de Montréal, Montreal, Canada(Axe癌症,CHUM研究中心,蒙特利尔大学,蒙特利尔,加拿大) Institut de recherche en immunologie et cancérologie (IRIC), Université de Montréal(免疫学与癌症研究所(IRIC),蒙特利尔大学) Mila - Quebec AI Institute, Montreal, Canada(魁北克AI研究所(Mila),蒙特利尔,加拿大)

AI总结 提出一种简单令牌高效的视觉语言模型,通过5倍放大率的512×512补丁和两阶段监督训练,在有限GPU内存下实现病例级多WSI病理报告生成,显著降低序列长度并提升效率。

Comments Accepted by the DeLTA 2026 conference

详情
AI中文摘要

从全切片图像(WSI)生成临床有用的病理报告具有挑战性,原因在于十亿像素分辨率、长视觉令牌序列以及病例级推理的复杂性(单个病例可能包含多个具有异质性组织和模糊发现的WSI)。我们提出了一种简单的令牌高效视觉语言模型,用于病例级概要报告生成,在受限GPU内存下保持实用性。我们的架构遵循最小的三组件设计:冻结的病理补丁编码器、轻量级两层MLP视觉语言对齐器和大语言模型解码器,并带有显式的WSI标记令牌以分隔病例内的切片。训练分两个监督阶段进行:(1)仅对齐器的WSI字幕生成,使用异质WSI-文本对;(2)病例级监督微调,基于病例-报告对进行结构化报告生成。为了减少序列长度,我们使用5倍放大率下的$512 \times 512$补丁表示每个切片,与常用的20倍补丁相比,平均序列长度减少高达64倍。结合高效训练技术,我们仅用半块NVIDIA H100 GPU即可实现实际训练。在两个训练阶段中,我们的方法在ROUGE-L/METEOR/BLEU-4上取得了高分,同时在内存和运行时间上显著更高效。在基于AI的评估中,我们的模型始终优于强基线。大量消融实验表征了性能-效率权衡,并确定了在多WSI设置中提高鲁棒性的简单选择。总体而言,这项工作为高效病理报告生成提供了一个强大且可复现的基线,降低了在有限计算资源下进行多WSI VLM研究的门槛。

英文摘要

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

2605.30714 2026-06-01 cs.CV 版本更新

Vision-Based Localization in Dense Urban Environments: A Case Study of an Urban Village in China

基于视觉的密集城市环境定位:中国城中村案例研究

Menglin Wu, Rui Cao

发表机构 * Thrust of Urban Governance and Design, Society Hub, The Hong Kong University of Science and Technology (Guangzhou)(城市治理与设计 thrust,社会枢纽,香港科技大学(广州))

AI总结 针对密集城市环境中GPS信号不可靠和地图数据不完整的问题,提出一种基于双摄像头系统的低成本视觉地理定位方法,并在广州石牌村数据集上评估现有模型性能。

详情
AI中文摘要

城中村是快速城市化过程中出现的广泛非正规住区,现已成为中国大城市中农民工的主要居住中心。这些区域建筑密集,常导致GPS信号不可靠,而不完整的地图数据进一步影响精确路线规划和导航。这些问题不仅阻碍日常出行,还对应急响应构成重大挑战,因为混乱的道路布局和GPS不准确可能使疏散工作复杂化。为应对这些挑战,我们提出了一种针对密集城市环境的实用视觉地理定位解决方案。我们的方法采用低成本的数采流程,利用双摄像头系统(包括全景相机和智能手机相机)捕获同步的360度全景图和查询图像。以广州著名的密集城中村石牌村为案例,我们开发了专门的图像地理定位数据集。然后,我们评估并比较了现有模型在不同场景类型下的性能,以识别其优缺点。研究结果展示了基于视觉的定位在密集城中村环境中的潜力和局限性。我们的框架旨在改善GPS覆盖较差区域的步行导航、最后一公里配送和应急管理,最终支持这些非正规住区中的弱势群体。

英文摘要

Urban villages, the widespread informal settlements which have emerged as a result of rapid urbanization, are now major residential hubs for migrant workers in large cities in China. The dense arrangement of buildings in these areas often leads to unreliable GPS signals, while incomplete mapping data further impairs accurate route planning and navigation. These issues not only hinder everyday mobility but also pose significant challenges for emergency response, as confusing road layouts and GPS inaccuracies can complicate evacuation efforts. To address these challenges, we propose a practical vision-based geo-localization solution tailored for dense urban environments. Our approach features a low-cost data collection pipeline utilizing a dual-camera system, comprising a panoramic camera and a smartphone camera, to capture synchronized 360-degree panoramas and query images. Using Shipai Village, a well-known densely populated urban village in Guangzhou, as a case study, we develop a specialized image geo-localization dataset. We then assess and compare the performance of existing models across various scene types to identify their strengths and weaknesses. The findings demonstrate both the potential and limitations of visual-based localization in dense urban-village environments. Our framework aims to enhance pedestrian navigation, last-mile delivery, and emergency management in areas with poor GPS coverage, ultimately supporting the vulnerable populations living within these informal settlements.

2605.30713 2026-06-01 cs.LG cs.CV cs.MM 版本更新

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

多样性至关重要:重新审视视觉-语言模型中的测试时计算

Yijie Tong, Yifan Hou, Shaobo Cui, Antoine Bosselut, Mrinmaya Sachan

发表机构 * ETH Zürich(苏黎世联邦理工学院) Shanghai Jiao Tong University(上海交通大学) EPFL(苏黎世联邦理工学院)

AI总结 针对视觉-语言模型(VLM)中测试时计算(TTC)策略应用不足的问题,提出基于预测熵的ETTC方法,通过利用模型间的置信度差异提升集成性能,理论证明并实验验证其优于多数投票和最佳单模型。

Comments ICML 2026

详情
AI中文摘要

测试时计算(TTC)策略已成为提升大型语言模型(LLM)推理能力的一种轻量级方法。然而,它们在视觉-语言模型(VLM)中的应用和益处尚未得到充分探索。我们对七个VLM和六个基准进行了TTC的系统研究,特别分析了基于特征的评分和多数投票方法。我们发现特征启发式方法失败,而投票在单模型设置中仅带来微小提升。我们从理论上证明,这种局限性源于缺乏预测多样性:当输出高度相关时,投票收益甚微。相比之下,多模型集成提供了更丰富的多样性,但标准的多数投票未能考虑不同模型的能力差异。为解决这一问题,我们提出了基于熵的TTC(ETTC),它根据预测熵选择最自信的预测。在单模型情况下,我们的方法退化为多数投票,但在模型集成中,它利用置信度差异优先考虑更强的模型。我们证明,在温和假设下ETTC优于多数投票,并通过实验表明它始终优于投票和最佳个体模型。关键在于,我们的结果表明,较小的模型可以协同增强较大的模型,释放出标准策略无法实现的集成增益。

英文摘要

Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.

2605.30700 2026-06-01 cs.CV cs.LG 版本更新

Mathematical Morphology in Machine Learning

机器学习中的数学形态学

Erick Oliveira Rodrigues, Aura Conci

发表机构 * Universidade Federal Fluminense(里贝伦联邦大学)

AI总结 将数学形态学引入机器学习,提出基于形态学重建的快速聚类算法和一种结合闵可夫斯基与切比雪夫距离的新型距离度量,并设计新型形态学分类器以建模形状、密度和分形信息。

详情
Journal ref
sibgrapi 2018
AI中文摘要

本工作将数学形态学——一种成熟的视觉计算理论——引入机器学习,以利用标准技术常忽视的形状和密度方面。我们提出了一种基于形态学重建的快速聚类算法,该算法能精确保留聚类形状和密度。该方案具有独特特性:内在的最大聚类感知、无成本的噪声去除以及由结构元素控制的多样化增长模式。此外,我们提出了一种结合闵可夫斯基距离和切比雪夫距离的新型距离度量,对于形态学膨胀非常高效。在 $Z^2$ 离散邻域迭代中,它比曼哈顿距离快约1.3倍,比欧几里得距离快约329.5倍。当使用k近邻(k-NN)分类器在33个UCI数据集上与其他14种距离度量进行评估时,我们的度量在大多数情况下(33例中的26例)达到了高于平均的准确率,并在9个案例中取得了最佳整体准确率。最后,我们引入了新型形态学分类器。与现有文献不同,本方案独特地对数据集中的形状、密度和分形信息进行建模。

英文摘要

This work introduces mathematical morphology-an established visual computing theory-into machine learning to exploit shape and density aspects often overlooked by standard techniques. We propose a fast clustering algorithm based on morphological reconstruction that accurately preserves cluster shapes and density. This scheme offers unique features: an intrinsic sense of maximal clusters, cost-free noise removal, and diverse growth patterns controlled by structuring elements.Additionally, we propose a novel distance metric combining Minkowski and Chebyshev distances, highly efficient for morphological dilations. In $Z^2$ discrete neighbourhood iterations, it is roughly 1.3 times faster than Manhattan and 329.5 times faster than Euclidean distances. When evaluated using a k-Nearest Neighbours (k-NN) classifier across 33 UCI datasets against 14 other distances, our metric achieved above-average accuracies most frequently (26 of 33 cases) and the best overall accuracy in 9 cases.Finally, we introduce novel morphological classifiers. Unlike current literature, this proposal uniquely models shape, density, and fractal information in datasets.

2605.30699 2026-06-01 cs.LG cs.CV 版本更新

A Context-Aware Middleware for Medical Image Based Reports: An approach based on image feature extraction and association rules

基于医学图像报告的情境感知中间件:一种基于图像特征提取和关联规则的方法

Erick O. Rodrigues, Jose Viterbo, Aura Conci, Trueman Mac Henry

发表机构 * Department of Computer Science(计算机科学系) Departament of Mathematics & Statistics(数学与统计学系) Universidade Federal Fluminense(联邦Fluminense大学) York University(约克大学)

AI总结 提出一种情境感知中间件,通过图像特征提取和关联规则,自动将医学图像分派给最合适的医疗人员,以提高医疗工作流程效率。

详情
Journal ref
2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA)
AI中文摘要

本工作提出了一种用于医疗工作流程组织和效率提升的情境感知中间件。在医院、实验室和远程放射学公司中,每位医生或技术人员都专注于特定类型的诊断或分析。因此,某些类型的医学图像通常会被转发给特定的医生或特定群体。这种转发非常耗时。也就是说,反复决定谁是最合适的医生,以及他在特定情境下是否可用,既繁琐又可能非常低效。因此,所提出的中间件能够处理并收集每位医疗人员所分析图像的数据。基于收集的数据和当前临床情境,中间件能够推断出谁是最适合接收特定传入医学图像的人员。

英文摘要

This work proposes a context-aware middleware for medical workflow organization and efficiency improvement. In hospitals, laboratories and teleradiology companies, each physician or technician is specialized in a specific kind of diagnosis or analysis. Therefore, certain types of medical images are often forwarded to a certain physician or a certain group. This forwarding is time consuming. That is, repeatedly deciding who would be the best physician, whether he is available at a certain moment given a certain context is exhaustive and may be very inefficient. Thus, the proposed middleware has the ability to process and collect data from images analyzed by each medical staff. Based on the collected data and current clinical context, the middleware is able to infer who would be the best fit staff to receive a certain incoming medical image.

2605.30698 2026-06-01 cs.CV cs.AI cs.MA 版本更新

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

先见后议:用视觉证据对齐多智能体共识

Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) Renmin University of China(中国人民大学) Shandong University(山东大学)

AI总结 提出EAGLE框架,通过显式暴露各智能体的视觉证据区域并相互验证,实现无需训练的多智能体视觉问答协作,提升共识可靠性。

详情
AI中文摘要

视觉语言模型(VLM)在视觉问答(VQA)上取得了强劲性能。为了减轻个体幻觉和盲点,通过多智能体协作聚合不同视角已成为一种有前景的范式。虽然这种方法在文本问答中取得了巨大成功,但其在多模态领域的潜力仍未充分探索。现有的多智能体VQA方法主要采用以文本为中心的协议,专注于文本讨论而忽略视觉信息的对齐。在这项工作中,我们揭示了一个关键见解:答案级别的共识对于可靠的多智能体VQA是不够的; extit{对齐的视觉证据}——智能体所依赖的图像区域的共享支持——对于可信的共识至关重要。为了利用这一见解,我们提出了EAGLE( extbf{E}vidence- extbf{A}ligned extbf{G}rounded mu extbf{L}ti-agent r extbf{E}asoning),一个无需训练的以证据为中心的框架,用于协调多个VLM智能体。EAGLE显式暴露每个智能体的定位区域作为视觉证据,允许对证据进行相互验证,并使用证据一致性指导最终决策。在六个VQA基准上的实验表明,EAGLE在跨领域实现了最佳平均性能,同时保持轻量、可解释且易于部署。

英文摘要

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

2605.30689 2026-06-01 cs.CV cs.AI 版本更新

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans:学习文本增强的局部-全局时间表示用于零样本时间动作定位

Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

发表机构 * Vellore Institute of Technology, India(维洛雷理工学院,印度) Lakehead University, Canada(拉克希德大学,加拿大)

AI总结 针对零样本时间动作定位中忽略局部相关性和特征表示能力不足的问题,提出融合卷积归纳偏置与Transformer自注意力的多尺度编码器ConTrans,联合捕获细粒度局部依赖和长程全局上下文,在ActivityNet-1.3和THUMOS14上显著超越现有方法。

Comments 4 figures, 8 tables

详情
AI中文摘要

零样本时间动作定位(ZS-TAL)旨在检测和定位未修剪视频中未见过的动作。然而,现有方法主要关注建模长程上下文信息,常常忽略了视频帧之间基于相对偏移的关键局部相关性。此外,由于网络架构的浅层性,其特征表示能力受限,阻碍了性能提升。在本文中,我们通过引入一种新颖的局部-全局多尺度特征表示模块来解决这些局限性。我们提出了一种新颖的多尺度编码器架构,称为ConTrans,它将卷积(Conv)归纳偏置与Transformer自注意力相结合,以共同捕获细粒度的局部依赖和长程全局上下文,从而比现有方法获得更全面的特征表示。在ActivityNet-1.3和THUMOS14数据集上的实验评估表明,ConTrans显著优于现有方法,为ZS-TAL建立了新的基准。

英文摘要

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

2605.30671 2026-06-01 cs.CV cs.RO 版本更新

WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation

WristCompass: 运动耦合作为可学习的视觉概念用于自我相机朝向估计

Varun Nair, Vidyut Baradwaj, Jiahang He, Anya Singh, Jai Relan, Cabrel Happi

AI总结 提出WristCompass,利用手腕与相机朝向之间的运动耦合作为视觉概念,通过紧凑的4D特征和GRU时序建模,从操作视频中恢复自我相机朝向,零样本迁移至厨房视频并达到与1B参数场景模型相近的性能。

详情
AI中文摘要

从操作视频中恢复自我相机朝向是从自我中心演示中分离手部运动与相机运动的前提,这是模仿学习的关键步骤。从场景几何推断朝向的常规方法在手部遮挡框架时失效:VGGT,一个1B参数的场景重建模型,在TACO基准测试上的表现甚至不如常数预测器。我们识别出一个替代的视觉概念,它恰好出现在场景几何缺失时:运动耦合动力学,即由手臂-肩-头链施加的手腕运动与相机朝向之间的结构化物理关系。我们发现这个概念是紧凑的(4D手腕间特征优于126D全手关键点)、时序的(需要短窗口上的GRU而非逐帧检索)和物理基础的(由于根植于解剖学而非场景外观,因此可零样本跨数据集迁移)。仅在桌面操作上训练的WristCompass,零样本迁移至Epic Kitchens烹饪视频,实现了14.3°的中位测地误差,并以200K GRU参数接近1B参数场景模型的性能。

英文摘要

Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.

2605.30639 2026-06-01 cs.CV cs.AI cs.RO 版本更新

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

PInVerify:面向主动实例验证的离线具身基准

Yuhang Jiang

发表机构 * University of Trento(特伦托大学)

AI总结 提出主动实例验证任务,构建离线具身基准PInVerify,通过多视角导航和细粒度属性匹配评估具身智能体,并基于多模态大语言模型建立基线。

Comments Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify

详情
AI中文摘要

具身智能体在导航到目标物体方面取得了显著进展,但到达目标附近并不能保证智能体找到了正确的实例:微妙的属性差异(例如“白色花卉”与“白色条纹”)通常需要近距离、多视角检查。我们通过主动实例验证(AIV)来解决这一差距,该任务要求智能体主动围绕候选对象选择视角,以判断其是否匹配细粒度的自然语言描述。我们将AIV形式化为一个有限视野决策过程,并引入PInVerify,一个用于AIV的离线具身基准:包含18个物体类别的3000个评估场景,以多视角捕获形式提供,并采用6扇区导航拓扑,暴露陷阱视角(可导航但无信息)和不可达扇区。作为参考基线,我们构建了一个无需训练的流水线和一个基于开源多模态大语言模型(MLLMs)的LoRA微调端到端智能体(参数规模≤8B),包括属性分解、可见性加权多视角跟踪器和三种次优视角选择(NBV)策略。在Qwen3-VL(4B/8B)、SenseNova-SI-1.2-InternVL3-8B、CLIP和SigLIP2上的评估中,最佳MLLM基线超过最佳嵌入基线4.9个百分点;GT框消融实验显示检测差距为+3.1个百分点;在测试的NBV策略中,我们未观察到主动视角选择带来的可靠增益。LoRA微调智能体(SFT+GSPO)达到85.6%。PInVerify旨在支持具身AI中主动、细粒度语义验证的进一步研究。代码:https://github.com/Avalon-S/PInVerify。

英文摘要

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

2605.30631 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models

基于直方图正则化潜扩散模型的可控肺结节合成

Arunkumar Kannan, Yanbo Zhang, Han Liu, Michael Baumgartner, Jianing Wang, Alexander Hertel, Bogdan Georgescu, Sasa Grbic

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Department of Radiology and Nuclear Medicine, University Medical Center Mannheim, Heidelberg University(放射学与核医学科,曼海姆大学医学中心,海德堡大学)

AI总结 提出一种直方图正则化潜扩散模型,通过结合亚型、空间掩码和HU直方图条件以及可微特征空间直方图正则化项,在3D CT体积中合成肺结节,以准确建模结节特异性强度分布,提高视觉真实感和亚型一致性。

详情
AI中文摘要

尽管自动诊断系统在基于CT的肺癌筛查中取得了显著成功,但其发展仍受限于多样化、带标注的肺结节数据集的稀缺性。基于扩散的生成模型为数据合成提供了一种有前景的策略;然而,许多现有的条件方法主要优化空间重建损失,这鼓励体素级相似性,但可能不足以约束病灶级强度分布。因此,这些方法可能产生过度平滑的纹理轮廓,并低估不同结节亚型(包括实性、部分实性和磨玻璃结节)的独特衰减特性。为解决这一挑战,我们提出了一种可控潜扩散模型,该模型在全3D CT体积内合成肺结节,同时准确建模结节特异性强度分布。具体而言,我们不只依赖空间损失,还引入了一个基于直方图的正则化项,在生成过程中约束体素强度分布。该模型结合了亚型、空间掩码和Hounsfield单位(HU)直方图条件以及可微特征空间直方图正则化项,以更好地对齐病灶级强度分布,提高合成结节的视觉真实感和亚型一致性。在肺部CT数据上的大量实验表明,我们的框架实现了强烈的视觉真实感,通过定量指标和视觉图灵测试验证。此外,当用于数据增强时,生成的结节提高了下游临床任务的性能,特别是对于代表性不足的结节亚型,并显示出对亚型知情恶性分类的潜在益处。

英文摘要

While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.

2605.30611 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Crafter: 面向多样化输入的可编辑科学图表生成的多智能体框架

Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学) Peking University(北京大学)

AI总结 提出Crafter多智能体框架,通过结构化组合离散语义组件,实现跨图表类型和输入条件的可编辑科学图表生成,并引入CraftEditor将栅格输出转换为可编辑SVG,在CraftBench基准上显著优于现有方法。

Comments 24 pages, 11 figures

详情
AI中文摘要

科学图表是传达复杂研究思想最有效的手段之一,但生成出版质量的插图仍然是论文准备中最劳动密集的部分。现有的自动化系统各自针对单一图表类型,且仅接受文本输入,未能解决研究人员实际使用的多样类型和条件;此外,它们的栅格输出无法进行局部修改。由于科学图表是离散语义组件的结构化组合,生成器在这些布局上产生的局部错误需要的不是更强的骨干网络,而是一个框架。我们将这个框架实例化为两个互补系统:Crafter,一个用于图表生成的多智能体框架,无需架构更改即可泛化到多种图表类型和输入条件;以及CraftEditor,它应用相同的模式将栅格输出转换为可编辑的SVG。此外,我们引入了CraftBench,一个涵盖三种图表类型和四种输入条件的基准,并带有手工质量标注。实验表明,Crafter在PaperBanana-Bench和CraftBench上显著优于独立的生成器和智能体基线,消融实验确认了每个组件的独立贡献;CraftEditor忠实地将输出转换为可编辑的SVG,超越了所有基线。我们的代码和基准可在https://github.com/HaozheZhao/Crafter获取。

英文摘要

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

2605.30587 2026-06-01 cs.CV 版本更新

ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models

ReGuLaR:面向大型视觉语言模型的基于关系的潜在推理

Zihu Wang, Karthik Somayaji N. S, Peng Li

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出ReGuLaR框架,通过训练时的ReGFormer将潜在推理显式地锚定在视觉证据中的对象和关系上,在多种基准上取得最优性能。

详情
AI中文摘要

链式思维推理通过用自然语言表述中间推理步骤,显著提升了大视觉语言模型的推理能力。然而,这种离散的文本理由通常不足以编码连续的视觉证据。最近的工作通过将推理转移到连续潜在空间来解决这一限制。尽管取得了有希望的进展,现有方法仍使潜在推理与视觉证据的组合结构和关系结构联系不足。为填补这一空白,我们引入了ReGuLaR,一种基于关系的潜在推理框架,将潜在状态显式地锚定在这些关键但被忽视的视觉证据上。ReGuLaR在训练时使用ReGFormer使潜在推理聚焦于与问题相关的对象及对象间关系,而在推理时模型无需调用ReGFormer即可推理并生成答案。为支持ReGuLaR的训练,我们构建了RGROUNDING-351K,一个标注了关键对象边界框和对象间关系的真实世界视觉语言数据集。在多种基准上的广泛实验表明,ReGuLaR持续优于现有方法,并取得了最先进的性能。我们在投稿中包含了代码,并将在接收后公开发布代码和训练数据。

英文摘要

Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are often insufficient for encoding continuous visual evidence. Recent work addresses this limitation by moving reasoning into continuous latent space. Despite promising progress, existing methods leave latent reasoning insufficiently connected to the compositional and relational structure of visual evidence. To address this gap, we introduce ReGuLaR, a relation grounded latent reasoning framework that explicitly grounds latent states in these critical yet overlooked visual evidence. ReGuLaR uses a training-time ReGFormer to focus latent reasoning on question-relevant objects and inter-object relations, while at inference time the model reasons and generates answers without invoking the ReGFormer. To support training ReGuLaR, we construct RGROUNDING-351K, a real-world vision-language dataset annotated with key object bounding boxes and inter-object relations. Extensive experiments across diverse benchmarks show that ReGuLaR consistently outperforms existing approaches and achieves state-of-the-art performance. We include our code in the submission and will release the code and training data publicly upon acceptance.

2605.30578 2026-06-01 cs.CR cs.CV 版本更新

AdvScene: Rethinking Adversarial Patch Evaluation Through Scene Robustness

AdvScene: 通过场景鲁棒性重新思考对抗补丁评估

Xiaoyong, Yuan, Lan, Zhang

发表机构 * Clemson University(克莱姆森大学)

AI总结 提出AdvScene框架,通过重建真实环境并引入对抗补丁到场景嵌入(APSE)方法,评估对抗补丁在视角、距离和场景条件变化下的场景鲁棒性,揭示现有评估未捕获的场景依赖性变化。

详情
AI中文摘要

对抗补丁是附着在真实物体上以误导AI视觉系统的物理图案。它们的现实世界风险并非由单次成功预测决定,而是取决于在部署后变化的视角、距离和场景条件下是否仍然有效。我们将这一特性称为场景鲁棒性,即部署的补丁在真实环境各种条件下的有效性。然而,现有评估并未很好地衡量场景鲁棒性:真实图像基准虽真实但固定,而模拟器虽可控但未基于特定真实场景。我们提出AdvScene,一个基于场景的框架,用于在重建的真实环境中测量对抗补丁的场景鲁棒性。AdvScene将评估重新定义为操作性测量:给定一个固定的部署补丁,它刻画补丁的操作包络——攻击成功的位置和条件——作为视角、距离和场景上下文的函数。一个关键挑战是攻击通常仅在单个锚定视图中定义,而评估需要一种在视角变化下保持保真度的表示。我们将此形式化为一个约束提升问题,并引入对抗补丁到场景嵌入(APSE),它在保留攻击关键外观并强制局部性、目标表面附着和跨视图一致性的同时,解决跨视图歧义。我们使用真实世界物理数据验证AdvScene,并对现有对抗补丁进行全面评估。结果表明,AdvScene揭示了攻击有效性的显著场景依赖性变化,而现有基于图像或模拟器的评估未能捕获这些变化。

英文摘要

Adversarial patches are physical patterns attached to real objects to mislead AI vision systems. Their real-world risk is not determined by a single successful prediction, but by whether they remain effective after deployment under changing viewpoints, distances, and scene conditions. We refer to this property as scene robustness, the effectiveness of a deployed patch across conditions in a real environment. Yet existing evaluations do not measure scene robustness well: real image benchmarks are realistic but fixed, while simulators are controllable but not grounded in a specific real scene. We present AdvScene, a scene-grounded framework for measuring the scene robustness of adversarial patches in reconstructed real environments. AdvScene reframes evaluation as operational measurement: given a fixed deployed patch, it characterizes the patch's operational envelope - where and when the attack succeeds - as a function of viewpoint, distance, and scene context. A key challenge is that the attack is typically defined only in a single anchor view, while evaluation requires a representation that remains faithful under viewpoint changes. We formalize this as a constrained lifting problem and introduce Adversarial Patch-to-Scene Embedding (APSE), which resolves cross-view ambiguity while preserving attack-critical appearance and enforcing locality, target-surface attachment, and cross-view consistency. We validate AdvScene using real-world physical data and conduct a comprehensive evaluation of existing adversarial patches. Our results show that AdvScene reveals substantial scene-dependent variation in attack effectiveness that is not captured by existing image-centric or simulator-based evaluations.

2605.30561 2026-06-01 cs.CV cs.AI 版本更新

VLM3: Vision Language Models Are Native 3D Learners

VLM3:视觉语言模型是原生3D学习者

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi

发表机构 * Meta(Meta公司) Princeton University(普林斯顿大学)

AI总结 本文提出VLM3,通过焦距统一、文本像素参考和数据混合缩放,使标准视觉语言模型无需复杂架构或损失函数即可高效掌握多种3D任务。

详情
AI中文摘要

视觉语言模型(VLM)通过提示使统一模型能够解决各种视觉任务,在语义理解方面表现出色。然而,3D理解仍然很大程度上依赖于具有复杂任务特定设计的专家视觉模型。本文要提出的关键论点是,VLM是原生的3D学习者。我们深入的大规模研究表明:1)焦距统一,2)基于文本的像素参考,以及3)数据混合和缩放,是有效3D学习所需的一切。模型架构变化、大模型、大量数据增强以及包括回归公式在内的复杂损失(其中许多构成了专家视觉模型的基础)实际上并不是必要条件。因此,我们提出了VLM3,一种具有最简单设计的可扩展方法,使标准VLM能够掌握多样的3D任务。VLM3不仅大幅提升了VLM深度估计的准确性(0.84 -> 0.9),还实现了多样的3D任务,如像素对应、相机姿态估计和物体级3D理解,在保持标准架构和基于文本的训练的同时,匹配了专家视觉模型的准确性。我们相信VLM3为简单且可扩展的3D学习开辟了新的范式。

英文摘要

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

2605.30557 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

看见不等于知道:视觉语言模型是否知道何时不回答空间问题(以及为什么)?

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

发表机构 * UNC Chapel Hill(北卡罗来纳大学教堂山分校) Google Research(谷歌研究)

AI总结 针对视觉语言模型在空间推理中过度自信回答的问题,提出SpatialUncertain框架,通过遮挡和视角歧义两种挑战,评估模型是否知道何时应弃权以及如何寻找可靠证据。

Comments Website: https://zhangyuejoslin.github.io/spatialuncertain/

详情
AI中文摘要

空间推理是部署在真实环境中的视觉语言模型(VLM)的基本能力。然而,视觉观察本质上是对3D世界的有限表示:遮挡可能使物体不可见,视角可能使几何属性产生误导。尽管如此,现有的空间推理基准通常假设观察是充分且可靠的,侧重于模型是否产生正确答案,而不是它们是否认识到问题无法回答以及需要哪些额外观察。在这项工作中,我们通过构建一个受控评估框架SpatialUncertain来挑战这一假设,并引入两种观察挑战:(1)遮挡,隐藏目标信息;(2)视角歧义,产生误导性视觉线索。对于每种配置,我们设计在清晰观察下可回答但在引入挑战下需要弃权的空间问题。我们进一步评估模型是否能识别哪些额外视角可以解决视角歧义。我们在多种前沿开源和闭源VLM上的结果揭示了两个一致的失败模式。首先,模型倾向于过度自信地回答,即使在视觉证据不完整或具有误导性时也试图解决空间推理任务,在遮挡下平均准确率约为30%,在视角歧义下低于10%。其次,即使有额外视角可用,一些模型在识别哪些视角能提供可靠证据方面表现接近随机。总之,我们的发现呼吁超越答案正确性,转向评估模型是否知道何时弃权以及如何寻找可靠证据。

英文摘要

Spatial reasoning is a fundamental capability for vision-language models (VLMs) deployed in real-world environments. However, visual observations are inherently limited representations of a 3D world: occlusion can render objects invisible, and perspective can make geometric properties misleading. Despite this, existing spatial reasoning benchmarks typically assume that observations are sufficient and reliable, focusing on whether models produce correct answers rather than whether they recognize when a question cannot be answered and what additional observations would be needed. In this work, we challenge this assumption by constructing a controlled evaluation framework, SpatialUncertain, and introducing two types of observation challenges: (1) occlusion, which hides target information, and (2) perspective ambiguity, which produces misleading visual cues. For each configuration, we design spatial questions that are answerable under clean observations but require abstention under the introduced challenges. We further evaluate whether models can identify which additional viewpoints would resolve perspective ambiguity. Our results across a diverse set of frontier open- and closed-source VLMs reveal two consistent failure modes. First, models are prone to overconfident answering, attempting to solve spatial reasoning tasks even when visual evidence is incomplete or misleading, with average accuracy around 30\% under occlusion and below 10\% under perspective ambiguity. Second, even when additional views are available, some models perform near random chance in identifying which would provide reliable evidence. Together, our findings call for moving beyond answer correctness toward evaluating whether models know when to abstain and how to seek reliable evidence.

2605.30544 2026-06-01 cs.CV cs.CR 版本更新

On-Device Generative AI for GDPR-Compliant Visual Monitoring: Natural Language Alerts from Local Object Detection

面向GDPR合规的视觉监控的端侧生成式AI:来自本地目标检测的自然语言警报

Gudrun Schappacher-Tilp, Nicoletta Kaehling, Jan Kornberger, Egon Teiniker

发表机构 * Hailo-8L AI accelerator(Hailo-8L人工智能加速器) Raspberry Pi 5(树莓派5) Phi-3 Mini

AI总结 提出一种隐私设计管道,通过将推理完全限制在边缘设备上,结合YOLOv5n-seg目标检测和Phi-3 Mini语言模型,生成自然语言警报,实现GDPR合规的视觉监控。

Comments 6 pages, 4 figures, 3 tables, 1 listing

详情
AI中文摘要

依赖云端AI推理的视觉监控系统会将原始图像数据暴露给外部服务,这与《通用数据保护条例》(GDPR)的数据最小化原则产生根本冲突。本文提出了一种隐私设计的概念验证管道,通过将所有推理完全限制在边缘设备上来解决这一冲突。为Hailo-8L AI加速器编译的YOLOv5n-seg模型在Raspberry Pi 5上实现实时目标检测,推理后立即丢弃原始像素缓冲区。一个状态触发引擎将最小的JSON事件负载转发到本地托管的Phi-3 Mini(3.8B参数,Q4_0量化)实例,该实例为操作员合成一到两句的自然语言警报。任何图像数据都不会跨越网络边界;仅传输生成的文本警报。我们描述了完整的系统架构和实现,报告了目标硬件上的测量推理延迟和资源利用率,并展示了代表性的生成警报。结果表明,在单板计算机上结合专用神经网络加速器和端侧大型语言模型不仅是可行的,而且能产生实际可部署、人类可读的监控输出,同时通过设计符合GDPR第5(1)(c)条。

英文摘要

Visual monitoring systems that rely on cloud-based AI inference expose raw image data to external services, creating fundamental tensions with the data-minimisation principle of the General Data Protection Regulation (GDPR). This paper presents a proof-of-concept privacy-by-design pipeline that resolves this tension by confining all inference entirely to the edge device. A YOLOv5n-seg model compiled for a Hailo-8L AI accelerator delivers real-time object detection on a Raspberry Pi 5, from which raw pixel buffers are immediately discarded after inference. A stateful trigger engine forwards minimal JSON event payloads to a locally hosted instance of Phi-3 Mini (3.8B parameters, Q4_0 quantisation), which synthesises one-to-two sentence natural-language alerts for a human operator. No image data crosses the network boundary at any point; only the generated text alert is transmitted. We describe the full system architecture and implementation, report measured inference latency and resource utilisation on the target hardware, and present representative generated alerts. The results demonstrate that combining a dedicated neural-network accelerator with an on-device large language model on a single-board computer is not only feasible but produces practically deployable, human-readable monitoring output while aligning with GDPR Art. 5(1)(c) by design.

2605.30519 2026-06-01 cs.CV 版本更新

OmniMem: Scalable and Adaptive Memory Retrieval for Long Video Generation

OmniMem: 用于长视频生成的可扩展自适应记忆检索

Lin Zhao, Yushu Wu, Yifan Gong, Yanzhi Wang, Pu Zhao

发表机构 * Northeastern University(东北大学) Adobe Research

AI总结 提出OmniMem框架,通过自适应窗口排除和查询共享KV选择等机制,在自回归视频生成中实现显式全范围稀疏KV检索,显著提升长视频动态程度并保持一致性。

Comments 22 pages, 14 figures; project page: https://wuyushuwys.github.io/OmniMem/

详情
AI中文摘要

自回归(AR)视频生成通过顺序生成潜在块来扩展视频,但扩展到长视频需要重复访问不断增长的历史KV缓存。现有方法通过截断KV缓存或将其压缩为隐式记忆来降低这一成本,但两者都失去了对查询相关历史细节的显式访问。我们提出OmniMem,一个显式全范围记忆检索框架,对历史缓存执行稀疏KV检索。为了使其在基于块的自回归视频生成中实用,OmniMem解决了两个问题:(i)稀疏KV选择中的局部偏差和(ii)记忆访问中的联合爆炸。自适应窗口排除在存在足够长距离历史时从选择候选者中移除局部窗口块,为信息丰富的长距离检索保留稀疏预算。查询共享KV选择减少了跨查询的多样性,而每头分散KV访问避免了将特定于头的选择扩展为大的选定KV缓冲区。这使得每个注意力头可以根据自己的选择模式检索非连续的KV块。长视频生成实验表明,OmniMem在强基线上将动态程度提高了52.3%,并保持了强一致性,同时保持了可比较的内存使用量。

英文摘要

Autoregressive (AR) video generation extends videos by producing latent chunks sequentially, but scaling to long videos requires repeated access to a growing historical KV cache. Existing methods reduce this cost by truncating the KV cache or compressing it into implicit memory, but both lose explicit access to query-relevant historical details. We propose OmniMem, an explicit full-range memory retrieval framework that performs sparse KV retrieval over the historical cache. To make this practical for chunk-based AR video generation, OmniMem addresses two issues: (i) local bias in sparse KV selection and (ii) Union Explosion in memory access. Adaptive Window Exclusion removes local-window blocks from the selection candidates when sufficient long-range history is available, preserving the sparse budget for informative long-range retrieval. Query-Shared KV Selection reduces cross-query diversity, while Per-Head Scattered KV Access avoids expanding head-specific selections into a large selected KV buffer. This allows each attention head to retrieve non-contiguous KV blocks according to its own selection pattern. Experiments on long-video generation show that OmniMem improves Dynamic Degree by 52.3% and preserves strong consistency over strong baselines, while maintaining comparable memory usage.

2605.30512 2026-06-01 cs.AI cs.CV 版本更新

PhyDrawGen: Physically Grounded Diagram Generation from Natural Language

PhyDrawGen: 基于自然语言的物理约束图表生成

Nafiul Haque, Syed Nazmus Sakib, Shifat E Arman

发表机构 * Department of Robotics and Mechatronics Engineering, University of Dhaka(机器人与机电工程系,达卡大学)

AI总结 提出PhyDrawGen神经符号管道,通过场景图提取、确定性求解器和视觉验证循环,从自然语言生成符合物理定律的图表,在力学、光学和电磁学基准上显著优于现有模型。

Comments 9 figures, 7 tables. Under review at EMNLP 2026

详情
AI中文摘要

从文本生成物理图表需要严格遵守物理定律。虽然当前生成模型能产生视觉上合理的输出,但它们会系统性地产生力向量幻觉、忽略守恒定律并违反几何约束。我们提出PhyDrawGen,一种神经符号管道,将语义场景理解与物理约束满足解耦。首先,大语言模型从问题文本中提取类型化场景图。然后,确定性求解器将该图转换为平面直线图(PSLG),将力平衡、光路和场拓扑编码为精确几何基元。最后,微调的Qwen-VL模型实现视觉基础的提议-验证循环,以迭代纠正任何约束违反。在涵盖力学、光学和电磁学的1,449个问题基准上评估,PhyDrawGen显著优于GPT-5-image、Gemini 2.5 Flash和Gemini 3 Pro,即使在非常见物体问题上也展现出鲁棒的物理准确性。

英文摘要

Generating physics diagrams from text requires strict adherence to physical laws. While current generative models produce visually plausible outputs, they systematically hallucinate force vectors, ignore conservation laws, and violate geometric constraints. We present PhyDrawGen, a neuro-symbolic pipeline that decouples semantic scene understanding from physical constraint satisfaction. First, a large language model extracts a typed scene graph from the problem text. A deterministic solver then converts this graph into a Planar Straight-Line Graph (PSLG), encoding force balance, optical paths, and field topologies as exact geometric primitives. Finally, a fine-tuned Qwen-VL model implements a visually grounded propose-verify loop to iteratively correct any constraint violations. Evaluated on a benchmark of 1,449 problems spanning mechanics, optics, and electromagnetism, PhyDrawGen significantly outperforms GPT-5-image, Gemini 2.5 Flash, and Gemini 3 Pro, demonstrating robust physical accuracy even on unusual-object problems.

2605.30510 2026-06-01 cs.CV cs.AI 版本更新

A Novel Global Context-aware Deep Neural Network for Enhanced Brain Tumor Segmentation using Magnetic Resonance Images

一种新颖的全局上下文感知深度神经网络用于基于磁共振图像的增强脑肿瘤分割

Sourjya Mukherjee, Ananya Bhattacharjee, R. Murugan

发表机构 * National Institute of Technology Silchar(全国理工学院锡拉char分校)

AI总结 提出全局上下文感知的挤压激励残差UNet(GCSER-UNet),融合空间和通道注意力,在TCGA LGG和BraTS 2020数据集上取得优于现有技术的Dice分数。

Comments 11 pages, 9 figures, 6 tables. Submitted to arXiv cs.CV

详情
AI中文摘要

脑癌的严重性需要精确的脑肿瘤分割,这对于有效的脑肿瘤诊断至关重要。手动识别成本高、劳动强度大且易出错,凸显了自动化方法的必要性。在本研究中,我们引入了全局上下文感知的挤压激励残差UNet(GCSER-UNet),它促进了空间和通道注意力的融合,从而增强了模型捕捉复杂空间依赖和上下文信息的能力。GCSER-UNet从多模态MRI切片中高效提取肿瘤区域,表现出卓越的性能。在基准数据库上的评估显示了其优越性,在TCGA LGG数据集上达到了94%的Dice分数,超过了当前最先进的91.8%。在BraTS 2020数据集上,所提出的GCSER-UNet集成方法在肿瘤区域——全肿瘤(W)、肿瘤核心(T)和增强肿瘤(E)上分别获得了95%、92%和90%的Dice分数,而当前最先进的Dice分数分别为94%、93%和88%。这些令人信服的结果突显了GCSER-UNet在精确脑肿瘤分割中的有效性,因此可以帮助神经科医生进行有效的脑癌管理和治疗规划。

英文摘要

Brain cancer's severity necessitates precise brain tumor segmentation, which is crucial for effective brain tumor diagnosis. Manual identification, burdened by high costs, labor, and error risks, highlights the need for automated methods. In this study, we introduce the Global Context-aware Squeeze and Excite Residual UNet (GCSER-UNet), which facilitates a fusion of spatial and channel-wise attention and thus enhances the model's capacity to capture intricate spatial dependencies and contextual information. GCSER-UNet efficiently extracts tumor segments from multimodal MRI slices, delivering exceptional performance. Evaluations on benchmark databases exhibit its superiority, achieving a notable 94 percent dice score on the TCGA LGG dataset, surpassing the state-of-the-art dice score of 91.8 percent. In the BraTS 2020 dataset, the proposed GCSER-UNet ensemble approach yielded dice scores of 95 percent, 92 percent, and 90 percent for the tumor regions - Whole Tumor (W), Tumor Core (T), and Enhancing Tumor (E), respectively. The current state-of-the-art dice scores were 94 percent, 93 percent, and 88 percent. These compelling outcomes highlight the efficacy of GCSER-UNet in precise brain tumor segmentation and thus can aid neurologists in effective brain cancer management and treatment planning.

2605.30506 2026-06-01 cs.RO cs.CV 版本更新

VLM-GLoc: Vision-Language Model Enhanced Monte Carlo Localization for Robust Semantic Global Localization in Cluttered Quasi-Static Environments

VLM-GLoc:视觉语言模型增强的蒙特卡洛定位,用于杂乱准静态环境中的鲁棒语义全局定位

Shivendra Agrawal, Bradley Hayes

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出VLM-GLoc方法,利用开放词汇视觉语言模型作为统一语义观测前端,通过逆语义提议机制和文本到地图检索,在几何模糊和语义歧义的准静态环境中实现鲁棒全局定位。

详情
AI中文摘要

在几何模糊的准静态环境(如杂货店、办公室、学校和医院)中,全局定位对移动机器人构成重大挑战。具有平行过道和长尾产品分布的杂货店,以及具有重复家具(如椅子、桌子、显示器和门)的办公室和实验室,是常见的室内环境,存在几何甚至语义歧义。传统方法要么依赖独特的几何特征,要么依赖特定领域的视觉管道,这些方法难以处理长尾语义分布和瞬态视觉杂乱。我们提出VLM-GLoc,一种分层语义蒙特卡洛定位(MCL)方法,利用开放词汇视觉语言模型(VLM)作为统一语义观测前端。我们假设VLM具有三重优势:(1)提取高度判别性的丰富文本特征,(2)对模糊或动态对象进行隐式质量过滤,(3)针对数据增强的持久性推理。我们引入一种逆语义提议机制,通过文本到地图检索播种粒子。在两个具有不同特征的真实世界环境和两个不同平台上进行评估:一个3500平方英尺的杂货店(使用手机)和一个3700平方英尺的实验室空间(使用四足机器人),VLM-GLoc分别实现了70%和74%的全局定位成功率,显著优于传统的纯几何和特定领域基线方法。

英文摘要

Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.

2605.03337 2026-06-01 cs.CV cs.AI 版本更新

FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles

FreeTimeGS++:动态高斯泼溅的秘密及其原理

Lucas Yunkyu Lee, Soonho Kim, Youngwook Kim, Sangmin Kim, Jaesik Park

发表机构 * Seoul National University(首尔国立大学) POSTECH

AI总结 本文通过建立控制基线FreeTimeGS_ours,系统分析4D高斯泼溅框架中的隐藏因素,揭示高斯持续时间驱动的时态分区和光度保真度与时空一致性之间的差异等关键秘密,并提出FreeTimeGS++方法,采用门控边缘化和神经速度场实现更稳定的动态表示。

Comments Project page: https://yklcs.com/ftgspp

详情
AI中文摘要

近期4D高斯泼溅(4DGS)的兴起在动态场景重建方面取得了令人瞩目的成果。尽管这些方法表现出卓越的性能,但其背后的具体驱动因素仍未被充分探索,使得对基本原理的系统理解具有挑战性。本文对这些隐藏因素进行了全面分析,以提供对4DGS框架更清晰的视角。我们首先通过形式化和复现最先进的FreeTimeGS的启发式方法,建立了一个受控基线FreeTimeGS_ours。利用该框架,我们沿着其基本轴剖析4DGS,并揭示了关键秘密,包括由高斯持续时间驱动的涌现时态分区以及光度保真度与时空一致性之间的差异。基于这些见解,我们提出了FreeTimeGS++,这是一种采用门控边缘化和神经速度场的原理性方法,以实现卓越的稳定性和鲁棒的动态表示。我们的方法产生了可重复的结果,并降低了运行间方差。我们将发布我们的实现,为未来的4DGS研究提供可靠的基础。

英文摘要

The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate remarkable performance, the specific drivers behind such gains remain less explored, making a systematic understanding of the underlying principles challenging. In this paper, we perform a comprehensive analysis of these hidden factors to provide a clearer perspective on the 4DGS framework. We first establish a controlled baseline, FreeTimeGS_ours, by formalizing and reproducing the heuristics of the state-of-the-art FreeTimeGS. Using this framework, we dissect 4DGS along its fundamental axes and uncover key secrets, including the emergent temporal partitioning driven by Gaussian durations and the discrepancy between photometric fidelity and spatiotemporal consistency. Based on these insights, we propose FreeTimeGS++, a principled method that employs gated marginalization and neural velocity fields to achieve superior stability and robust dynamic representations. Our approach yields reproducible results with reduced run-to-run variance. We will release our implementation to provide a reliable foundation for future 4DGS research.

2605.30469 2026-06-01 cs.SD cs.CV 版本更新

3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark

3DAE: 基于空间图谱和基准的音频新视角合成双耳质量评估

Jialu Xu, Yifan Zhou

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出一个全参考诊断框架3DAE Map,通过时频音频误差图(幅度、ILD、IPD、时间对齐、响度和高频故障)进行视觉检查,并构建模型无关基准3DAE Bench,用于评估音频新视角合成模型的双耳预测质量。

详情
AI中文摘要

3D音频和新视角声学合成模型通常使用全局指标进行评估。然而,全局指标往往隐藏了双耳预测失败的位置和原因。我们提出一个全参考诊断框架,该框架使用时频音频误差图,包括幅度、ILD、IPD、时间对齐、响度和高频故障,形成3D音频误差图(3DAE Map)用于视觉检查。我们将这些诊断方法整合到一个模型无关的基准——空间音频误差基准(3DAE Bench)中,该基准接受任意真实和预测的双耳对,并报告音频新视角合成模型的预测质量。在Replay-NVAS和SoundSpaces上对ViGAS输出的实验显示了不同的主要故障模式:Replay-NVAS上的时间错位和SoundSpaces上的ILD不匹配。总体而言,该框架为音频新视角合成模型开发优化提供了可解释的故障模式总结和直观的视觉图谱。

英文摘要

3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.

2605.30444 2026-06-01 cs.CV 版本更新

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

Dex2HOI: 灵巧双手双物体交互生成

Chrysa Pratikaki, Pablo Ruiz-Ponce, Jiankang Deng, Stefanos Zafeiriou, Rolandos Alexandros Potamias

发表机构 * Imperial College London, UK(伦敦帝国学院) University of Alicante, Spain(阿利坎特大学)

AI总结 提出Dex2HOI统一扩散模型,通过双流扩散和运动融合网络,实现从文本生成单/双物体灵巧双手交互,速度提升达540倍。

详情
AI中文摘要

近期4D人-物体交互(HOI)生成的进展使得运动合成越来越逼真,特别是对于单物体操作。然而,当前研究忽视了人类行为的一个固有特性:人们自然地协调双手并同时操作多个物体。为填补这一空白,我们提出了Dex2HOI,一个用于从文本合成单物体和双物体HOI的统一扩散模型。其核心采用双流扩散方法,每个物体在专用交互流中处理,并通过双向交叉注意力进行协调。为了合成最终运动,我们引入了一个运动融合网络,该网络集成了新颖的相对于手的物体表示和应用于整个序列的接触感知条件。通过在带前缀条件的窗口上自回归采样扩散过程,Dex2HOI以实时速度生成任意长的序列,省略了冗余的测试时优化,相比先前最先进方法实现了高达540倍的推理加速。在单物体和双物体基准上的广泛评估展示了最先进的定量结果,标志着超越传统单物体HOI生成、向表达性多物体操作迈出的一步。代码和模型将在接收后发布。

英文摘要

Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text. At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence. By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation. Code and models will be released upon acceptance.

2605.30431 2026-06-01 cs.CV 版本更新

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

DTG-Restore: 无需训练的视频超分辨率扩散精炼

Hidir Yesiltepe, Koutilya PNVR, Gaurav Pathak, Navaneeth Bodla, Bharat Singh, Pinar Yanardag, Jinrong Xie

发表机构 * Virginia Tech(弗吉尼亚理工大学) Adobe(Adobe公司)

AI总结 提出解耦时间引导(DTG)方法,通过时间解耦条件与无条件分支,无需训练即可增强扭曲低分辨率视频,提升结构保真度和时间稳定性。

详情
AI中文摘要

近期视频扩散模型的进展实现了显著的生成保真度,但利用这些先验进行修复仍受限于标准无分类器引导中条件分支与无条件分支的强耦合。我们提出一种无需训练的框架,通过时间解耦这些信号来增强扭曲和低分辨率视频。我们提出的解耦时间引导(DTG)在更干净的扩散时间步评估无条件分支,提供一个前瞻先验,在抑制扭曲内容复制的同时保持几何结构。这种时间偏置在采样过程中逐渐减弱,使模型能够从结构校正过渡到细节精炼,无需重新训练。结合任何现成的修复模块以即插即用的方式,我们的方法在AI生成和真实世界视频中均能改善感知一致性并恢复合理的结构。为便于评估,我们整理了GenWarp480基准,包含从多种文本到视频模型合成的4400个扭曲480p视频。GenWarp480专注于特征性生成退化,如扭曲面部、身体错位和空间伪影,为评估对生成错误的鲁棒性提供了专门构建的测试平台。大量实验表明,我们的方法在无需任何模型训练的情况下,在结构保真度和时间稳定性方面取得了显著改进。

英文摘要

Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifier-free guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AIgenerated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4,400 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.

2605.30409 2026-06-01 cs.CV cs.AI 版本更新

SANA-Streaming: Real-time Streaming Video Editing with Hybrid Diffusion Transformer

SANA-Streaming: 基于混合扩散Transformer的实时流式视频编辑

Yuyang Zhao, Yicheng Pan, Qiyuan He, Jincheng Yu, Junsong Chen, Tian Ye, Haozhe Liu, Enze Xie, Song Han

发表机构 * NVIDIA MIT(麻省理工学院) THU(清华大学) NUS(新加坡国立大学) HKU(香港大学)

AI总结 提出系统-算法协同设计的SANA-Streaming框架,通过混合扩散Transformer架构、循环反向正则化训练策略和高效系统协同设计,在消费级GPU上实现高分辨率实时流式视频编辑,达到1280×704分辨率24 FPS的端到端性能。

详情
AI中文摘要

实时流式视频到视频编辑(V2V)对于直播和游戏等交互式应用至关重要,但由于对时间一致性和推理吞吐量的严格要求,它仍然是一个严峻的挑战。在本文中,我们提出了SANA-Streaming,一个系统-算法协同设计的框架,用于在消费级GPU上进行高分辨率、实时流式视频编辑,具有以下三个核心设计:(1)混合扩散Transformer架构在部分块中引入softmax注意力以提高局部建模能力,同时保持线性层的效率。(2)循环反向正则化是一种新颖的训练策略,通过流匹配从生成内容预测源帧来强制语义一致性,无需成对的长编辑视频即可提高时间一致性。(3)高效系统协同设计结合了融合GDN内核和针对NVIDIA Blackwell(RTX 5090)架构优化的混合精度量化(MPQ)。通过分析实际吞吐量,我们的MPQ在保持生成质量的同时最大化Tensor Core利用率。最终系统在单个RTX 5090 GPU上以24 FPS的端到端帧率实现实时1280×704分辨率编辑,其中DiT核心运行在58 FPS。实验结果表明,我们的协同设计方法在时间一致性和系统吞吐量方面均显著优于现有最先进方法。

英文摘要

Real-time streaming video-to-video editing (V2V) is critical for interactive applications such as live broadcasting and gaming, yet it remains a formidable challenge due to the stringent requirements for temporal consistency and inference throughput. In this paper, we present SANA-Streaming, a system-algorithm co-designed framework for high-resolution, real-time streaming video editing on consumer GPUs, with the following three core designs: (1) Hybrid Diffusion Transformer architecture introduces softmax attention in part of the blocks to improve local modeling capabilities while preserving the efficiency of linear layers. (2) Cycle-Reverse Regularization is a novel training strategy that enforces semantic consistency by predicting source frames from generated content via flow matching, improving temporal consistency without requiring paired long edited videos. (3) Efficient System Co-design combines fused GDN kernels and Mixed-Precision Quantization (MPQ) optimized for the NVIDIA Blackwell (RTX 5090) architecture. By profiling real-world throughput, our MPQ maximizes Tensor Core utilization while maintaining generation quality. The resulting system achieves real-time 1280 x 704 resolution editing at 24 end-to-end FPS on a single RTX 5090 GPU, with the DiT core running at 58 FPS. Experimental results demonstrate that our co-design approach significantly outperforms existing SOTA methods in both temporal coherence and system throughput.

2605.30387 2026-06-01 cs.LG cs.AI cs.CV eess.SP 版本更新

Functional MRI Time Series Generation via Wavelet-Based Image Transform and Spectral Flow Matching for Brain Disorder Identification

基于小波图像变换和频谱流匹配的功能磁共振时间序列生成用于脑疾病识别

Hwa Hui Tew, Junn Yong Loo, Fang Yu Leong, Julia K. Lau, Ding Fan, Hernando Ombao, Raphaël C. -W. Phan, Chee Pin Tan, Chee-Ming Ting

发表机构 * School of Information Technology, Monash University Malaysia(墨尔本大学马来西亚分校信息科技学院) School of Engineering, Monash University Malaysia(墨尔本大学马来西亚分校工程学院) Statistics Program, King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹大学科学与技术学院统计学项目)

AI总结 提出双频谱流匹配(DSFM)框架,通过离散小波变换和离散余弦变换对BOLD信号进行双频表示,结合频谱流匹配生成类条件余弦频率表示,再经逆变换重建生理上合理的时域BOLD信号,以改善下游脑网络分类。

Comments Accepted at the Fourteenth International Conference on Learning Representations (ICLR 2026)

详情
AI中文摘要

功能磁共振成像(fMRI)通过测量随时间变化的血氧水平依赖(BOLD)信号,提供对动态脑活动的非侵入性访问。然而,fMRI采集的资源密集型特性限制了数据驱动脑分析模型所需的高保真样本的可用性。虽然现代生成模型可以合成fMRI数据,但它们在复制原始BOLD信号固有的非平稳性、复杂的时空动态和生理变化方面仍然面临挑战。为了解决这些挑战,我们提出了双频谱流匹配(DSFM),一种新颖的fMRI生成框架,它将BOLD信号的双频表示与频谱流匹配级联起来。具体来说,我们的框架首先通过离散小波变换(DWT)将BOLD信号转换为小波分解图,以捕获全局瞬态和多尺度变化,并将其投影到跨脑区和时间的离散余弦变换(DCT)空间中,以利用低频主导BOLD系数的局部能量压缩。随后,训练一个频谱流匹配模型来生成类条件余弦频率表示。通过逆DCT和逆DWT操作重建生成的样本,以恢复生理上合理的时域BOLD信号。这种双变换方法施加了结构化的频率先验,并保留了关键的生理脑动力学。最终,我们通过改进的下游基于fMRI的脑网络分类证明了我们方法的有效性。代码可在 https://github.com/htew0001/DSFM.git 获取。

英文摘要

Functional Magnetic Resonance Imaging (fMRI) provides non-invasive access to dynamic brain activity by measuring blood oxygen level-dependent (BOLD) signals over time. However, the resource-intensive nature of fMRI acquisition limits the availability of high-fidelity samples required for data-driven brain analysis models. While modern generative models can synthesize fMRI data, they often remain challenging in replicating their inherent non-stationarity, intricate spatiotemporal dynamics, and physiological variations of raw BOLD signals. To address these challenges, we propose Dual-Spectral Flow Matching (DSFM), a novel fMRI generative framework that cascades dual frequency representation of BOLD signals with spectral flow matching. Specifically, our framework first converts BOLD signals into a wavelet decomposition map via a discrete wavelet transform (DWT) to capture globalized transient and multi-scale variations, and projects into the discrete cosine transform (DCT) space across brain regions and time to exploit localized energy compaction of low-frequency dominant BOLD coefficients. Subsequently, a spectral flow matching model is trained to generate class-conditioned cosine-frequency representation. The generated samples are reconstructed through inverse DCT and inverse DWT operations to recover physiologically plausible time-domain BOLD signals. This dual-transform approach imposes structured frequency priors and preserves key physiological brain dynamics. Ultimately, we demonstrate the efficacy of our approach through improved downstream fMRI-based brain network classification. The code is available at https://github.com/htew0001/DSFM.git .

2605.30362 2026-06-01 cs.NE cs.AI cs.CV 版本更新

XOResNet: Exclusive-OR Meta-Residuals Facilitate Deep Spiking Neural Networks Learning

XOResNet: 异或元残差促进深度脉冲神经网络学习

Jianfang Wu, Junsong Wang

发表机构 * School of Artificial Intelligence, Shenzhen Technology University(人工智能学院,深圳技术大学) Faculty of Data Science, City University of Macau(数据科学学院,澳门城市大学)

AI总结 针对深度脉冲神经网络中残差结构存在的脉冲冗余、信息损失和冗余学习问题,提出OR-ADD捷径连接和XOR元残差机制,构建XOResNet,在多个数据集上超越现有方法。

Comments 33 pages, 12 figures, 7 Tables

详情
AI中文摘要

脉冲神经网络(SNN)在深度模型中展现出优越的学习和表示能力。鉴于ResNet在深度学习中的巨大成功,自然希望用残差学习训练深度SNN。然而,现有的用于构建深度SNN的残差结构仍然面临脉冲冗余或信息损失以及冗余学习的挑战。在本研究中,我们首先旨在解决恒等映射中的相对脉冲冗余和非恒等映射中的信息损失问题。为此,我们提出了一种OR-ADD(OA)捷径连接,用于合并残差结构中两个分支的输出脉冲/电流。此外,为了减轻残差结构主干分支中的冗余学习,我们引入了XOR元残差的概念,即使用异或(XOR)操作为主干分支选择预学习残差。最后,通过整合OA捷径和XOR元残差,我们设计了XOR残差块,并基于该块进一步构建了不同深度的XOResNet。在Fashion-MNIST、CIFAR-10、CIFAR-100和miniImageNet四个数据集上的大量实验表明,所提出的XOResNet优于现有的通过梯度下降优化的最先进深度SNN。这些结果验证了我们的OA捷径和XOR元残差组件在克服SNN中残差学习基本局限性方面的有效性,为构建高性能神经形态系统提供了新的架构见解。

英文摘要

Spiking neural networks (SNNs) hold promise for demonstrating superior learning and representation capabilities in deep models. Given the tremendous success of ResNet in deep learning, it would naturally follow to train deep SNNs with residual learning. However, existing residual structures for constructing deep SNNs still present challenges of spike redundancy or information loss, as well as redundant learning. In the present study, we first aim to address issues of relative spike redundancy in identity mapping and information loss in non-identity mapping. To this end, we propose an OR-ADD (OA) shortcut connection to merge output spikes/currents from two branches in the residual structure. Furthermore, to mitigate redundant learning in the backbone branch of the residual structure, we introduce the concept of XOR meta-residuals, i.e., selecting pre-learning residuals using the Exclusive-OR (XOR) operation for the backbone branch. Finally, by integrating the OA shortcut and XOR meta-residuals, we devise the XOR residual block and further construct XOResNet with varying depths based on this block. Extensive experiments on four datasets, Fashion-MNIST, CIFAR-10, CIFAR-100, and miniImageNet, show that the proposed XOResNet outperforms existing state-of-the-art deep SNNs optimized via gradient descent. These results validate the effectiveness of our OA shortcut and XOR meta-residual components in overcoming fundamental limitations of residual learning in SNNs, providing new architectural insights for building high-performance neuromorphic systems.

2605.28442 2026-06-01 cs.RO cs.CV 版本更新

Self-Supervised Online Robot-Agnostic Traversability Estimation for Open-World Environments

面向开放世界的自监督在线机器人无关可通行性估计

Julia Hindel, Simon Bultmann, Houman Masnavi, Daniele Cattaneo, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 提出COTRATE框架,通过自监督在线学习从多模态未标记机器人经验中估计可通行性,采用机器人无关的地形评估模块和多样性感知特征选择策略,实现跨平台知识迁移并降低遗忘。

Comments 14 pages, 16 Figures

详情
AI中文摘要

自监督在线可通行性估计使机器人能够从未标记的开放世界经验中持续学习,并调整其导航行为以实现安全高效的轨迹。现有方法要么依赖手工设计的本体感受可通行性分数,限制了机器人无关性,要么对先验数据进行聚类,阻碍了在线学习。此外,许多持续学习方法会带来大量的内存和计算成本,阻碍了机载部署。我们提出了COTRATE,一个用于从多模态、未标记的机器人经验中持续估计可通行性的在线学习框架。我们的方法首先使用一个基于学习的机器人无关在线地形评估模块,该模块处理本体感受和惯性信号,推断出鲁棒的可通行性分数。然后,这些分数通过一种新颖的对齐损失来监督视觉可通行性网络,该损失将视觉嵌入与在线地形评估相关联。为了在持续学习过程中以最小开销减轻遗忘,我们提出了一种多样性感知的特征选择策略,该策略使用紧凑的回放记忆来保持性能。我们进一步表明,学习到的可通行性表示支持具有不同运动学特性的不同机器人平台之间的知识迁移。我们在一个包含约50,000张图像的数据集上评估了COTRATE,该数据集由两个机器人平台在11种户外地形上收集,并在三个代表性户外环境中的导航任务上进行了基准测试。我们将数据集、代码和训练模型公开。

英文摘要

Self-supervised online traversability estimation enables robots to continuously learn from unlabeled open-world experiences and adapt their navigation behavior toward safe and efficient trajectories. Existing approaches either rely on handcrafted proprioceptive traversability scores, limiting robot-agnosticism, or cluster prior data, preventing online learning. Moreover, many continual learning methods incur substantial memory and computational costs, hindering onboard deployment. We introduce COTRATE, an online learning framework for continuous traversability estimation from multimodal, unlabeled robot experience. Our method first infers robust traversability scores using a robot-agnostic, learning-based online terrain assessment module operating on proprioceptiveand inertial signals. These scores then supervise a visual traversability network through a novel alignment loss that associates visual embeddings with online terrain assessments. To mitigate forgetting during continual learning with minimal overhead, we propose a diversity-aware feature selection strategythat preserves performance using a compact replay memory. We further show that the learned traversability representation supports knowledge transfer across different robot platforms with different locomotion kinematics. We evaluate COTRATE on a dataset of $\approx$ 50,000 images collected with two robotic platforms across 11 outdoor terrains, and benchmark it on navigation tasks in three representative outdoor environments. We make the dataset, code, and trained models publicly available.

2605.27367 2026-06-01 cs.CV 版本更新

SpatialBench: Is Your Spatial Foundation Model an All-Round Player?

SpatialBench: 你的空间基础模型是全能选手吗?

Haosong Peng, Hao Li, Jiaqi Chen, Yuhao Pan, Runmao Yao, Yalun Dai, Fushuo Huo, Fangzhou Hong, Zhaoxi Chen, Haozhao Wang, Dingwen Zhang, Ziwei Liu, Wenchao Xu

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Nanyang Technological University(南洋理工大学) Northwestern Polytechnical University(西北工业大学) Southeast University(东南大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 提出SpatialBench基准,通过跨范式、多域、确定性采样的评估,揭示当前空间基础模型在多样化下游任务中的泛化能力不足,并引入DA-Next-5M数据集和DA-Next模型推动空间表示学习。

Comments Project Page: https://ropedia.github.io/SpatialBench/

详情
AI中文摘要

尽管空间基础模型在标准数据集上展示了令人印象深刻的性能,但一个关键问题仍然存在:它们是否真正是能够稳健泛化到多样化下游任务、任意视角、变化的场景域、不同输入密度和特定硬件约束的全能选手?回答这个总体问题需要整体评估,然而当前模型主要在其专门设计或训练的特定领域上进行评估。这种评估本质上受到狭窄范式覆盖、有限场景域和任意帧采样的限制,使得从根本上难以评估其真正的泛化能力。为弥补这一差距,我们提出了SpatialBench,一个用于空间基础模型的跨范式、域多样化的基准,采用确定性采样。SpatialBench具有前所未有的规模和严格的确定性设计,包含19个数据集和546个场景,覆盖5个不同的空间域。它在4种不同输入密度设置下,全面评估了6个范式的41个模型在5个任务套件上的表现。我们的广泛评估揭示当前模型尚未成为全能选手,并为未来进展揭示了关键见解。具体来说,我们证明全上下文注意力最大化准确性,而有界记忆策略解锁长序列可扩展性。此外,我们在具有挑战性的具身和自我中心任务中的实证评估表明,严格的域对齐和高数据质量对性能的影响远大于简单的数据集扩展。最后,为解决我们分析中发现的最大数据差距,我们超越评估,引入大规模数据集DA-Next-5M和强基线模型DA-Next,推动空间表示学习的边界。

英文摘要

While spatial foundation models have demonstrated impressive performance on standard datasets, a critical question remains: are they truly all-round players capable of generalizing robustly across diverse downstream tasks, arbitrary viewpoints, shifting scene domains, varying input densities, and specific hardware constraints? Answering this overarching question requires a holistic assessment, yet current models are mainly evaluated on specific domains for which they were specifically designed or trained. Such evaluations are intrinsically limited by narrow paradigm coverage, limited scene domains, and arbitrary frame sampling, making it fundamentally difficult to assess their true generalization capabilities. To address this gap, we present SpatialBench, a cross-paradigm, domain-diverse benchmark for spatial foundation models with deterministic sampling. SpatialBench features unprecedented scale and rigorous deterministic design, comprising 19 datasets and 546 scenes across 5 diverse spatial domains. It comprehensively evaluates 41 models across 6 paradigms on 5 task suites under 4 different input density settings. Our extensive evaluation reveals that current models are not yet all-round players, and uncovers crucial insights for future advancement. Specifically, we demonstrate that full-context attention maximizes accuracy while bounded-memory strategies unlock long-sequence scalability. Moreover, our empirical evaluations in challenging embodied and egocentric tasks demonstrate that strict domain alignment and high data quality are far more critical to performance than simple dataset scaling. Furthermore, to address the largest data gap identified in our analysis, we go beyond evaluation by introducing a large-scale dataset, DA-Next-5M, and a strong baseline model, DA-Next, pushing the boundaries of spatial representation learning.

2605.26519 2026-06-01 cs.CV 版本更新

$R^3$: 3D Reconstruction via Relative Regression

$R^3$: 通过相对回归进行3D重建

Congrong Xu, Huachen Gao, Xingyu Chen, Yuliang Xiu, Jun Gao, Anpei Chen

发表机构 * University of Michigan(密歇根大学) Westlake University(西雅图大学) NVIDIA Research(英伟达研究)

AI总结 提出一种基于相对回归的3D重建方法$R^3$,使用轻量级MLP预测置信度加权的相对约束,以支持全上下文离线重建和因果有界内存流式重建。

详情
AI中文摘要

最近的馈送式几何基础模型通过单次前向传播恢复深度和姿态,展现出了令人印象深刻的泛化能力。然而,这些模型通常受限于全局坐标框架假设。这种依赖性成为长上下文和流式重建的一个显著瓶颈,因为它迫使网络维护一个任意的时序原点,并处理随时间无界增长的平移幅度。我们的解决方案,称为$R^3$,采用了相对回归。我们使用一个轻量级MLP来预测置信度加权的相对约束。这些置信度作为一个统一的锚点:在训练期间加权损失,在推理期间指导姿态聚合。$R^3$支持全上下文离线重建和因果、有界内存的流式重建。我们在离线与流式设置下的评估验证了我们的相对机制的有效性。项目页面:https://kevinxu02.github.io/r3-site

英文摘要

Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call $R^3$, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. $R^3$ supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism. Project page: https://kevinxu02.github.io/r3-site

2605.22050 2026-06-01 cs.CV 版本更新

Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

破碎的记忆:通过退化生成检测和缓解扩散模型中的记忆化

Yuanmin Huang, Mi Zhang, Chen Chen, Feifei Li, Geng Hong, Xiaoyu You, Min Yang

发表机构 * Fudan University(复旦大学) East China University of Science and Technology(东华大学)

AI总结 本文首次发现扩散模型中的记忆化会导致内部数值不稳定性并表现为视觉“破碎”伪影,基于此提出了一种基于潜变量更新范数的经验稳定区域来量化稳定行为,并设计了一个即时的逐步骤检测与自适应缓解框架,在不改变提示或引导的情况下抑制记忆化,在Stable Diffusion 1.4上实现了AUC>0.999的检测性能和0.0%的记忆化率。

Comments KDD 2026, extended version

详情
AI中文摘要

虽然扩散模型在生成高质量图像方面表现出色,但它们记忆训练数据的倾向带来了显著的隐私和版权风险。在这项工作中,我们首次发现记忆化会导致内部数值不稳定性,通常表现为视觉上的“破碎”伪影。受数值方法中稳定性分析的启发,我们引入了基于潜变量更新范数的经验稳定区域,以定量表征生成过程中的稳定行为。利用这一点,我们提出了一个原则性的、即时的框架,用于逐步骤检测和自适应缓解。我们的方法在不改变提示或引导的情况下抑制记忆化,从而保持语义保真度和图像质量。在Stable Diffusion 1.4上的大量实验表明,我们的方法在缓解后实现了AUC>0.999的检测性能和0.0%的记忆化率,且开销可忽略不计(每张图像约0.01秒)。

英文摘要

While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'' artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $>0.999$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ($\approx0.01$s per image).

2605.30248 2026-06-01 cs.CV 版本更新

GenClaw: Code-Driven Agentic Image Generation

GenClaw: 代码驱动的智能体图像生成

Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang, Rui Chen, Weijia Li

AI总结 提出GenClaw,一种代码驱动的智能体图像生成范式,通过概念构思、代码草图绘制和纹理补充三个阶段,将黑盒图像生成转变为可控、可解释的分阶段过程。

Comments 21 pages, 7 figures

详情
AI中文摘要

图像生成模型已从基于文本的像素合成演变为具备视觉理解和工具调用能力的多模态智能体。然而,现有智能体仍受制于底层黑盒图像模型。其工作流程陷入重复的提示重写循环以改进生成,缺乏直接操控画布的机制。本质上,LLMs作为精确视觉构建的“画笔”的潜力尚未被充分挖掘。本文提出GenClaw,一种代码驱动的智能体图像生成范式,使智能体像人类艺术家一样创作:先构思,再素描,最后上色。具体而言,智能体首先通过搜索和推理构建概念知识和上下文。然后利用代码(如SVG、HTML、ThreeJS)渲染可执行的视觉草图。最后,使用图像生成模型补充纹理、材质和逼真度。在此工作流中,代码作为连接语言推理和像素合成的可控中间画布,无缝集成程序逻辑与生成模型的视觉表现力。通过将图像生成从黑盒范式转变为类似真实人类创作的分阶段过程,GenClaw朝着高度可控和可解释的视觉生成系统迈出了一步。

英文摘要

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.

2605.30215 2026-06-01 cs.CV 版本更新

Déjà View: Looping Transformers for Multi-View 3D Reconstruction

Déjà View: 用于多视图3D重建的循环Transformer

Alessandro Burzio, Tobias Fischer, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic, Haithem Turki

发表机构 * NVIDIA University of Modena and Reggio Emilia, AImageLab(摩德纳和雷焦艾米利亚大学,AImageLab) University of Toronto, Vector Institute(多伦多大学,向量研究所) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出DéjàView模型,通过循环应用单个Transformer块进行迭代细化,以更少的参数和计算量在多个3D重建基准上达到或超越大规模前馈模型。

Comments Project Page: https://research.nvidia.com/labs/dvl/projects/dvlt

详情
AI中文摘要

近期的前馈式3D重建Transformer已扩展到超过十亿参数,遵循计算机视觉中模型容量增加的趋势。然而,新出现的证据表明,连续的Transformer层通常表现为类似操作的重复应用,而多视图重建Transformer在解码器深度上逐步优化其预测。我们认为模型深度部分地购买了迭代,但以独特的参数低效地支付,因此我们将迭代显式地融入架构中。我们的模型DéjàView对每个视图的特征循环应用单个循环Transformer块,进行K步细化。训练一次后,它将K暴露为推理时的计算旋钮,在涵盖室内、室外、物体中心和驾驶场景的五个重建基准上,匹配或优于显著更大的前馈基线,同时使用其一小部分参数和相当或更低的计算量。重要的是,在匹配的训练数据和计算量下,相同的循环块公式优于具有独立每步参数的相同变体,这表明显式迭代不仅是计算高效的容量替代方案,而且是多视图3D重建更强的归纳偏置。

英文摘要

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

2605.30060 2026-06-01 cs.CV 版本更新

Towards Consistent Video Geometry Estimation

Towards Consistent Video Geometry Estimation

Zhu Yu, Jingnan Gao, Runmin Zhang, Lingteng Qiu, Zhengyi Zhao, Rui Peng, Yichao Yan, Kejie Qiu, Siyu Zhu, Zilong Dong, Si-Yuan Cao, Hui-Liang Shen

发表机构 * Zhejiang University(浙江大学) Tongyi Lab, Alibaba Group(阿里云实验室) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 提出ViGeo,一种基于纯Transformer架构的前馈基础模型,通过动态分块注意力机制和基于补全的数据精炼框架,实现视频序列中空间密集且时间一致的几何(深度、法线、点图)估计,在在线、离线及长视频任务中达到最先进性能。

Comments Project webpage: https://pkqbajng.github.io/ViGeo/

详情
AI中文摘要

本文提出了ViGeo,一种前馈基础模型,用于从视频序列中恢复空间密集且时间一致的几何信息。ViGeo基于纯Transformer架构,没有针对特定任务的架构修改,支持在统一模型中进行流式、全序列和长视频推理。关键设计是动态分块注意力,该机制在训练期间使模型同时暴露于双向和因果时间上下文,并允许其在测试时无需重新训练即可调整注意力模式。为了提高监督质量,我们进一步引入了一种基于补全的数据精炼框架。该框架训练了一个视频深度补全教师模型,该模型以稀疏且有噪声的标注为条件,利用视频/多视图上下文生成密集、时间一致且几何可靠的训练目标。除了深度和点图,ViGeo还在同一框架内预测表面法线。仅使用公共数据集训练,ViGeo在在线、离线和长视频深度估计、表面法线估计以及视频点图估计中均达到了最先进性能。

英文摘要

This work presents ViGeo, a feed-forward foundation model for recovering spatially dense and temporally consistent geometry from video sequences. Built upon a plain transformer architecture without task-specific architectural modifications, ViGeo supports streaming, full-sequence, and long-video inference within a unified model. The key design is dynamic chunking attention, which exposes the model to both bidirectional and causal temporal contexts during training and allows it to adapt its attention pattern at test time without retraining. To improve supervision quality, we further introduce a completion-based data refinement framework. This framework trains a video depth completion teacher that conditions on sparse and noisy annotations and exploits video/multi-view context to produce dense, temporally coherent, and geometrically reliable training targets. Beyond depth and point maps, ViGeo also predicts surface normals within the same framework. Trained solely on public datasets, ViGeo achieves state-of-the-art performance across online, offline, and long-video depth estimation, surface normal estimation, and video point map estimation.

2605.29879 2026-06-01 cs.CV cs.RO 版本更新

DGSG-Mind: Dynamic 3D Gaussian Scene Graphs for Long-Term Scene Understanding and Grounding

DGSG-Mind:用于长期场景理解与定位的动态3D高斯场景图

Luzhou Ge, Xiangyu Zhu, Jinyan Liu, Xuesong Li

发表机构 * School of Computer Science, Beijing Institute of Technology, China(北京理工大学计算机科学学院)

AI总结 提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,通过概率体素网格与显式3D高斯结合实现鲁棒的跨模态实例融合和增量语义映射,并构建层次化场景图与3D高斯思维进行多模态推理,在零样本3D视觉定位、开放词汇语义分割和场景重建中取得领先性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

将开放词汇语义信息集成到动态3D场景表示中对于长期具身场景理解至关重要。然而,现有方法常因跨视角线索不完整而导致脆弱的实例关联,同时处理对象级拓扑变化的能力有限,限制了长期机器人任务执行。此外,当前的3D场景理解方法要么依赖简单的特征匹配而缺乏显式空间推理,要么假设离线真实3D几何。为应对这些挑战,我们提出DGSG-Mind,一种混合实例感知的3D高斯动态场景图系统,配备具身推理智能体。我们的系统将概率体素网格与显式3D高斯耦合,实现鲁棒的跨模态实例融合和增量语义映射。它通过基于高斯的视觉重定位和由几何-语义一致性引导的局部掩码细化来处理动态变化。基于实例高斯图,DGSG-Mind进一步构建层次化场景图,并开发3D高斯思维,集成结构关系、空间-语义信息和视觉标注的RoI高斯渲染以进行多模态推理。大量实验表明,DGSG-Mind在基于自重建地图的方法中实现了最佳的零样本3D视觉定位性能,同时在3D开放词汇语义分割和场景重建中也表现出强劲性能。我们进一步将DGSG-Mind部署到真实世界机器人上,展示其目标导向推理和动态更新能力。DGSG-Mind的项目页面位于https://icr-lab.github.io/DGSG-Mind。

英文摘要

Integrating open-vocabulary semantic information into dynamic 3D scene representations is essential for long-term embodied scene understanding. However, existing methods often suffer from fragile instance association due to incomplete cross-view cues, while their limited ability to handle object-level topological changes restricts long-term robotic task execution. Moreover, current 3D scene understanding methods either rely on simple feature matching without explicit spatial reasoning or assume offline ground-truth 3D geometry. To address these challenges, we present DGSG-Mind, a hybrid instance-aware 3D Gaussian dynamic scene graph system with an embodied reasoning agent. Our system couples a probabilistic voxel grid with explicit 3D Gaussians to enable robust cross-modal instance fusion and incremental semantic mapping. It handles dynamic changes through Gaussian-based visual relocalization and localized masked refinement guided by geometric-semantic consistency. Built on the instance Gaussian map, DGSG-Mind further constructs a hierarchical scene graph and develops the 3D Gaussian Mind, which integrates structural relations, spatial-semantic information, and visually annotated RoI Gaussian renderings for multimodal reasoning. Extensive experiments show that DGSG-Mind achieves the best zero-shot 3DVG performance among methods operating on self-reconstructed maps, while also delivering strong performance in 3D open-vocabulary semantic segmentation and scene reconstruction. We further deploy DGSG-Mind on real-world robots to demonstrate its target-oriented reasoning and dynamic update capabilities. The project page of DGSG-Mind is available at https://icr-lab.github.io/DGSG-Mind

2605.29852 2026-06-01 cs.CV cs.LG cs.MM 版本更新

Parameter-Efficient Subspace Decoupling ViT for Mitigating Multi-Task Negative Transfer in Histological Scoring

参数高效子空间解耦ViT用于缓解组织学评分中的多任务负迁移

Youhan Huang, Jiajun Li, Yilin Fang, Shuai Wang, Chuheng Li

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing University of Chemical Technology(北京化工大学) Capital Medical University(首都医科大学)

AI总结 提出子空间解耦多任务Vision Transformer,通过轻量级任务特定适配器和正交性约束构建独立特征子空间,减少任务干扰并保留共享表示,有效缓解多任务负迁移。

Comments 6 pages, 5 figures, 2 tables. IEEE ICME 2026 (Oral). Camera-ready version

详情
AI中文摘要

组织学评分对于诊断非酒精性脂肪性肝病(NAFLD)至关重要,但由于高标注成本以及多任务学习中强相关的NAFLD活动评分(NAS)指标之间的负迁移,其自动化仍然具有挑战性。为了解决这个问题,我们提出了一种子空间解耦的多任务Vision Transformer(ViT),它集成了轻量级的任务特定适配器与基于正交性的约束。该设计为脂肪变性、气球样变和炎症构建了独立的特征子空间,有效减少了任务干扰,同时保留了共享表示。我们进一步构建了一个精心策划的多任务小鼠NAFLD组织学数据集,其中包含所有NAS组件的专家标注。实验结果表明,与训练单独的单个任务模型相比,所提出的方法以显著降低的计算成本提高了多任务稳定性和泛化能力。代码和策划的数据集已准备就绪,将在接收后公开以支持可重复性。

英文摘要

Histological scoring is essential for diagnosing Non-Alcoholic Fatty Liver Disease (NAFLD), yet its automation remains challenging due to the high annotation cost and negative transfer among the strongly correlated NAFLD Activity Score (NAS) indicators in multi-task learning. To address this issue, we propose a subspace-decoupled multi-task Vision Transformer (ViT) that integrates lightweight task-specific Adapters with orthogonality-based constraints. This design constructs independent feature subspaces for steatosis, ballooning, and inflammation, effectively reducing task interference while retaining shared representations. We further construct a curated multi-task mouse NAFLD histology dataset with expert annotations for all NAS components. Experimental results demonstrate that the proposed method improves multi-task stability and generalization with substantially reduced computational cost compared to training separate single-task models. The code and the curated dataset have been prepared and will be made publicly available upon acceptance to support reproducibility.

2605.29655 2026-06-01 cs.CV cs.GR 版本更新

SuperVoxelGPT: Adaptive and Ordered 3D Tokenization for Autoregressive Shape Generation

SuperVoxelGPT: 自适应有序3D令牌化用于自回归形状生成

Yuan Li, Congyi Zhang, Xifeng Gao, Xiaohu Guo

发表机构 * University of Texas at Dallas(德克萨斯大学达拉斯分校) Tencent America(腾讯美国)

AI总结 提出SuperVoxelGPT框架,通过自适应且有序的超体素令牌化解决自回归3D生成中序列长度与空间顺序的矛盾,实现高质量、高效率的形状生成。

详情
AI中文摘要

自回归多模态大语言模型(MLLMs)能够进行3D生成,但由于3D令牌化不足,难以扩展到高分辨率形状。基于集合的紧凑表示丢弃了确定性的空间排序,导致序列预测模糊,而均匀或基于八叉树的体素网格保留了排序,但代价是严重的冗余和过长的序列。这种结构上的权衡限制了稳定高效的自回归3D生成。我们提出了SuperVoxelGPT,一个以表示优先的框架,通过自适应且确定性的超体素令牌化解决了这一矛盾。给定提示,我们首先预测粗略的几何显著性分布,并使用显著性引导的质心Voronoi细分构建形状自适应的超体素划分,将细粒度单元分配给复杂区域,将较大单元分配给平滑区域。基于文本和有序的超体素布局,我们引入了SuperVoxelVAE,并微调预训练的MLLM以自回归生成超体素令牌。在Trellis-500K上的实验表明,SuperVoxelGPT将令牌序列长度减少到均匀体素令牌化的12.8%,同时实现了最先进的生成质量,并且相比先前方法平均加速10倍。

英文摘要

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

2605.24700 2026-06-01 cs.CV cs.GR 版本更新

SRUG: Shadow-Guided Relightable Urban Scene with Generation Model

SRUG: 基于阴影引导的可重光照城市场景生成模型

Yonghao Zhao, Zexin Yin, Jian Yang, Beibei Wang, Jin Xie

发表机构 * College of Computer Science, Nankai University(南开大学计算机科学学院) Nankai University(南开大学) Nanjing University(南京大学) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院)

AI总结 提出SRUG框架,利用阴影引导3D补全模型恢复不可见区域几何,结合迭代材质分解和物理光照模型,实现从稀疏输入视图生成可重光照城市场景。

详情
AI中文摘要

从图像或视频创建可重光照的城市场景具有广泛用途,但高度不适定。城市环境通常是无界的,且延伸到可见区域之外。因此,场景的许多部分未被观察到,但这些不可见区域会向可见区域投射阴影。合理建模这些不可见区域投射的阴影具有挑战性,并成为创建可重光照城市场景的主要障碍。同时,稀疏的输入视图和复杂的照明条件进一步使重光照复杂化,因为它们引入了材质分解中的严重歧义。在本文中,我们提出了SRUG(Shadow-guided Relightable Urban Scene with Generation model),一种新颖的框架,旨在解决城市场景中的重光照挑战。SRUG利用阴影引导3D补全模型恢复不可见区域的几何,促进物理合理阴影的合成。此外,SRUG采用迭代材质分解方案,应用大材质模型(LMM)提供材质监督,并迭代分解场景的材质属性,实现鲁棒的材质分解。基于这些组件,我们引入了一个基于物理的光照模型,该模型捕捉城市场景的复杂照明并支持可靠的重光照。大量的定量评估和视觉比较表明,我们的方法在新视图合成和重光照任务中均优于现有方法。

英文摘要

Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene's material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.

2605.29417 2026-06-01 cs.CV 版本更新

ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects

ParCo-SDF: 学习可变形物体的无先验部分到完整有符号距离场

Deokmin Hwang, Minseok Song, Daehyung Park

发表机构 * School of Computing, Korea Advanced Institute of Science and Technology, Korea(韩国科学技术院计算机学院)

AI总结 提出 ParCo-SDF 两阶段框架,通过时序几何编码和 FiLM 条件 SDF 预测,实现无需物体特定先验的可变形物体部分到完整几何重建。

Comments Accepted at the 23rd International Conference on Ubiquitous Robots (UR 2026), 6 pages

详情
AI中文摘要

本研究针对从点云观测到可变形物体(DOs)的部分到完整几何重建,以实现精确的 DO 操作。最近的 DO 重建方法通常采用隐式神经表示(INRs)来建模连续表面并捕捉结构变异性。然而,这些方法通常依赖于物体特定的形状先验,这虽然提高了训练稳定性,但限制了泛化能力。为了解决这个问题,我们引入了 ParCo-SDF,一个两阶段的部分到完整有符号距离场(SDF)重建框架,包括时序几何编码和随后的 FiLM 条件 SDF 预测。时序编码器捕捉 DO 序列中的结构相似性,实现无先验的稳定训练。基于 FiLM 的条件化在降低网络复杂度的同时保持了重建的表达能力。我们在橡皮筋操作数据集上评估了所提方法与最先进的 DO 表面重建基线,证明了在严重遮挡下的鲁棒和高保真重建。

英文摘要

This study addresses the partial-to-complete geometry reconstruction of deformable objects (DOs) from point-cloud observations toward precise DO manipulation. Recent DO reconstruction approaches often adopt implicit neural representations (INRs) to model continuous surfaces as well as capture structural variability. However, these methods typically rely on object-specific shape priors that improve training stability and limit generalization. To figure it out, we introduce ParCo-SDF, a two-stage partial-to-complete signed distance field (SDF) reconstruction framework consisting of temporal geometry encoding followed by FiLM-conditioned SDF prediction. The temporal encoder captures structural similarity across DO sequence, enabling prior-free stable training. FiLM-based conditioning preserves reconstruction expressivity while reducing network complexity. We evaluate the proposed method against a state-of-the-art DO surface reconstruction baseline on a rubber band manipulation dataset, demonstrating robust and high-fidelity reconstruction under severe occlusions.

2605.29299 2026-06-01 cs.CV cs.AI 版本更新

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

口袋牙医:通过高效多模态大语言模型实现设备端牙科图像理解

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan, Yiran Shen, Ting Dang, Hong Jia

发表机构 * The University of Auckland, New Zealand(奥克兰大学) Shandong University, China(山东大学) The University of Melbourne, Australia(墨尔本大学)

AI总结 提出Pocket-Dentist基准,通过评估14种视觉语言模型发现紧凑模型(2B参数)在牙科图像理解中精度更高且计算成本更低,并在iPhone 17 Pro上实现低延迟部署。

详情
AI中文摘要

牙科视觉语言模型的评估在数据集、任务定义和指标上仍然分散,并且常常忽略其计算成本。这限制了它们在专科中心之外的广泛部署用于牙科筛查,而及时推理、有限的硬件以及对患者图像的本地处理对于实用、保护隐私的临床预筛查至关重要。本文提出了Pocket-Dentist,一个面向牙科多模态问答的效率感知基准,它汇集了三个数据集,涵盖约1159名患者、五种任务类型和七种指标。在典型的14种VLM上,我们的结果揭示了一个有趣的观察:紧凑型VLM(例如2B参数模型)在牙科图像理解中精度更高,同时所需计算成本大幅降低。在iPhone 17 Pro上本地部署时,我们微调的紧凑型VLM Pocket-Dentist-2B处理每个样本耗时4.31秒,与7B基线相比延迟降低4.9倍,内存使用减少2.3倍。

英文摘要

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.

2605.29198 2026-06-01 cs.CV 版本更新

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

引导对比令牌信用分配用于离散策略优化

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yuta Kyuragi, Aditya Grover

发表机构 * UCLA Panasonic AI Research(松下人工智能研究) NVIDIA(英伟达)

AI总结 针对组优势强化学习方法中令牌级信用分配缺失的问题,提出引导对比策略优化(GCPO),通过正负提示下的对比预测分配令牌级优势,在文本到图像生成和思维链推理任务上优于GRPO和DAPO。

Comments 21 pages, 11 figures

详情
AI中文摘要

基于组优势的强化学习方法,如GRPO和DAPO,在包括数学推理和文本到图像生成在内的多个领域展示了强大的性能。然而,它们对样本级奖励的依赖引入了一个关键限制,即所有令牌的均匀信用分配无法捕捉细粒度的令牌级贡献。为了解决这个问题,我们提出了引导对比策略优化(GCPO),一种新颖的算法,通过对比正负提示下的模型预测来实现每个令牌的信用分配。GCPO不是均匀地广播样本级优势,而是分配与这些对比预测差异成比例的令牌级优势,从而提供更精确和信息丰富的学习信号。实验上,我们发现GCPO强调语义相关区域,例如文本到图像生成中与文本提示对齐的视觉区域,以及思维链任务中推理轨迹内的关键关键词。通过大量实验,GCPO在文本到图像生成和思维链推理基准测试上 consistently 优于GRPO和DAPO基线,证明了其作为离散策略学习的通用且可扩展优化策略的有效性。

英文摘要

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.

2604.22409 2026-06-01 cs.CV 版本更新

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

SpaMEM:具身环境中通过感知-记忆集成进行动态空间推理的基准测试

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao

发表机构 * The University of New South Wales(新南威尔士大学) The University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Fudan University(复旦大学) Tsinghua University(清华大学) Zhejiang University(浙江大学) Xinjiang University(新疆大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出SpaMEM基准,通过动作条件场景变换和多模态数据,分层评估多模态大模型在具身环境中的空间信念演化能力,揭示坐标一致性和视觉记忆瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在静态视觉-空间推理方面取得了进展,但在具身环境中,当信念必须根据环境变化下的自我中心观察不断修正时,它们往往无法保持长期的空间连贯性。我们引入了SpaMEM(动作序列的空间记忆),这是一个大规模诊断基准,通过长交互时间内的动作条件场景变换(生成、放置、移除)来隔离空间信念演化的机制。SpaMEM基于一个物理基础数据集构建,包含来自1000个程序生成房屋中25000多个交互序列的10,601,392张高保真图像,涵盖四种模态(RGB、深度、实例、语义分割)。我们将具身空间推理形式化为一个三级层次结构,包含15个诊断任务:第1级测量单次观察的原子空间感知;第2级利用神谕文本状态历史探测时间推理,以排除感知噪声;第3级要求在同一任务维度下从原始视觉流进行端到端的信念维护。我们还评估了短期(逐步)更新和长期(情节)重建。对代表性开源VLM系列的基准测试揭示了一个一致的堆叠瓶颈:坐标一致的定位仍然是一个硬上限,从第2级到第3级的急剧下降暴露了显著的符号脚手架依赖性,即模型在基于文本的记账中成功,但难以维持稳健的视觉记忆。SpaMEM提供了一个细粒度的诊断标准,并激发了状态表示、信念修正和长期情节集成的显式机制。SpaMEM的一个子集可在https://huggingface.co/datasets/mill-ct-liao/SpaMEM公开获取。

英文摘要

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration. A subset of SpaMEM is publicly available at https://huggingface.co/datasets/mill-ct-liao/SpaMEM.

2603.09632 2026-06-01 cs.CV cs.CL 版本更新

X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting

X-GS:基于3D高斯溅射的感知与思考可扩展框架

Yueen Ma, Zenglin Xu, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院)

AI总结 提出X-GS框架,包含感知器和思考器,统一多种3DGS技术实现实时在线SLAM与语义蒸馏,并支持多模态模型完成下游任务。

详情
AI中文摘要

3D高斯溅射(3DGS)已成为新颖视图合成的强大技术,随后扩展到众多空间AI应用。然而,大多数现有3DGS方法孤立运行,专注于特定领域。本文介绍X-GS,一个包含两个主要组件的可扩展框架。X-GS-感知器统一了广泛的3DGS技术,以实现具有语义蒸馏的实时在线SLAM。X-GS-思考器容纳多模态模型,使其能够与感知器无缝交互以完成下游任务。在我们的X-GS实现中,感知器利用最新的视觉基础模型提高在线SLAM性能,并采用三种关键机制加速语义蒸馏。思考器可以基于对比和生成视觉语言模型构建,并利用感知器的语义高斯溅射解锁3D视觉定位和场景描述等功能。在多个基准上的实验结果表明了X-GS框架的高效性和新解锁的多模态能力。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods operate in isolation, focusing on specific domains. In this paper, we introduce X-GS, an extensible framework consisting of two major components. The X-GS-Perceiver unifies a broad range of 3DGS techniques to enable real-time online SLAM with semantic distillation. The X-GS-Thinker accommodates multimodal models, enabling them to seamlessly interface with the Perceiver to complete downstream tasks. In our implementation of X-GS, the Perceiver leverages the latest vision foundation models to improve online SLAM performance and employs three key mechanisms to accelerate semantic distillation. The Thinker can be built upon both contrastive and generative vision-language models and utilizes the Perceiver's semantic Gaussian splats to unlock capabilities such as 3D visual grounding and scene captioning. Experimental results on diverse benchmarks demonstrate the efficiency and newly unlocked multimodal capabilities of the X-GS framework.

2602.01173 2026-06-01 cs.CV 版本更新

EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment

EEmo-Logic:面向全面图像诱发情感评估的统一数据集与多阶段框架

Lancheng Gao, Ziheng Jia, Zixuan Xing, Wei Sun, Huiyu Duan, Guangtao Zhai, Xiongkuo Min

AI总结 提出最大图像诱发情感理解数据集EEmoDB和统一多模态大语言模型EEmo-Logic,通过指令微调和任务定制GRPO实现细粒度情感问答与评估。

详情
AI中文摘要

理解图像诱发情感的多维属性和强度细微差别对于提升机器共情能力和赋能多样化人机交互应用至关重要。然而,现有模型仍局限于粗粒度情感感知或推理能力不足。为弥补这一差距,我们引入了 extbf{EEmoDB},这是迄今为止最大的图像诱发情感理解数据集。它包含跨越5个不同任务类别的5个分析维度,促进全面解读。具体而言,我们通过自动生成从125K张图像中整理了1.2M问答对(EEmoDB-QA),以及从25K张图像中策划了36K数据集(EEmoDB-Assess)用于细粒度评估。此外,我们提出了 extbf{EEmo-Logic},一个通过指令微调和具有新颖奖励设计的任务定制组相对偏好优化(GRPO)开发的一体化多模态大语言模型(MLLM)。大量实验表明,EEmo-Logic在域内和跨域数据集上实现了稳健性能,在情感问答和细粒度评估方面表现出色。数据集和代码可在https://github.com/workerred/EEmo-Logic获取。

英文摘要

Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce \textbf{EEmoDB}, the largest image-{\ul e}voked {\ul emo}tion understanding {\ul d}ataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125K$ images via automated generation, alongside a $36K$ dataset (EEmoDB-Assess) curated from $25K$ images for fine-grained assessment. Furthermore, we propose \textbf{EEmo-Logic}, an \textbf{all-in-one} multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The dataset and code are available at https://github.com/workerred/EEmo-Logic.

2605.25193 2026-06-01 cs.CV 版本更新

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

SpongeBob:同步感知的和谐视听生成式编辑

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, Zhibo Chen

发表机构 * University of Science and Technology of China(科学技术大学) Tencent Hunyuan(腾讯文生)

AI总结 提出首个端到端视听联合编辑框架SpongeBob,通过双向跨模态交互的同步感知机制和上下文感知模块,解决视频编辑中的音画不同步和语义冲突问题。

详情
AI中文摘要

物理世界中的视觉和声学事件本质上是耦合的,然而现有的视频编辑方法通常采用解耦的流水线,缺乏双向模态交互。这导致两个关键限制:(i) 视听不同步和(ii) 生成的音频与保留内容之间的上下文冲突。为了解决这些问题,我们提出了SpongeBob,这是第一个具有双向跨模态交互的端到端视听联合编辑框架。对于同步,同步感知机制通过双向注意力、时间对齐和空间约束将视觉编辑与声音事件对齐。对于上下文一致性,上下文感知模块利用声学和视觉上下文注意力来防止语义冲突。此外,我们引入了同步保持训练和指导(SPTG),以在不降低质量的情况下增强对齐。由于配对数据的稀缺,我们构建了一个可扩展的数据流水线和一个大规模的主题级数据集。我们还提出了SpongeBob-Bench用于系统评估。实验表明,SpongeBob显著优于现有基线,将Sync-C提高了30%,Ctx-F1提高了12.5%。我们的项目页面位于:https://hy-spongebob.github.io/。

英文摘要

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

2605.22478 2026-06-01 cs.CV 版本更新

DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval

DeliCIR: 用于组合图像检索的深思型测试时进化分层多智能体

Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shengpeng Xu, Shibiao Xu

发表机构 * Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences)(计算机网络与信息安全部重点实验室,教育部,山东计算机科学中心(济南国家超算中心),青岛科技大学(山东科学院)) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) School of Artificial Intelligence, Mohamed bin Zayed University(Mohamed bin Zayed大学人工智能学院)

AI总结 提出一种分层感知-深思框架PDF,通过分层多智能体架构、意图路由管理器、决策管理器及锦标赛式测试时缩放策略,实现经验自进化与测试时缩放定律在组合图像检索中的首次应用,在三个基准数据集上达到最优性能。

Comments 10 pages, 5 figures,4 tables

详情
AI中文摘要

组合图像检索(CIR)要求同时保留参考图像的视觉连续性并忠实执行修改文本中指定的语义变量,这构成了该任务的核心挑战。现有方法常常在单一空间中遭受感知近视,或由于底层检索器的感知上限而在迭代协作中陷入逻辑漂移。为解决这一问题,我们提出了一种一站式分层感知-深思框架(PDF),据我们所知,这是首次将经验自进化和测试时缩放定律(TTS)引入CIR。依托分层多智能体架构,PDF首先利用意图路由管理器根据修改意图动态调度多视角工作器感知信号,构建高召回候选池。随后,决策管理器结合无需训练的推理策略蒸馏机制与锦标赛式TTS(T-TTS)策略,实现自进化的细粒度推理,得出最终检索结果。实验结果表明,PDF在三个基准数据集CIRR、CIRCO和FashionIQ上均达到了最优性能。本研究表明,经验驱动的自进化和TTS是实现零样本细粒度多媒体检索的一条极具前景且可扩展的路径。代码将在论文被接收后公开。

英文摘要

Composed Image Retrieval (CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitute the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Laws (TTS) into CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate pool. Subsequently, the Decision Manager combines a Training-free Reasoning Policy Distillation mechanism with a Tournament-style TTS (T-TTS) strategy to achieve self-evolving fine-grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience-driven self-evolution and TTS represent a highly promising and scalable path for achieving zero-shot fine-grained multimedia retrieval. The code will be made publicly available upon acceptance.

2605.20992 2026-06-01 cs.CV 版本更新

CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

CHOIR: 接触感知的4D手物交互重建

Hao Xu, Yilin Liu, Yinqiao Wang, Chi-Wing Fu, Niloy J. Mitra

发表机构 * The Chinese University of Hong Kong(香港中文大学) University College London(伦敦大学学院) University College London, Adobe Research(伦敦大学学院,Adobe研究)

AI总结 提出CHOIR框架,利用接触作为显式耦合信号,从单目视频中重建手物交互的4D序列,包括手部运动、物体形状与6D姿态以及接触信息,显著提升了物体重建、物理合理性和时间一致性。

详情
AI中文摘要

我们探究是否可以将日常开放世界单目视频转化为可复用的4D交互基元:包括关节手部运动、随时间变化的物体形状与6D姿态,以及接触的时空信息。这种能力将支持真实交互的可扩展挖掘,并在重建之外,支持场景感知的合成与规划。然而,从具有挑战性的单目视频中重建手物交互(HOI)仍然困难:现有方法通常假设已知物体或精心设计的场景,且单独估计的手和物体在杂乱、遮挡和未见物体几何下容易错位。针对这一场景,我们提出CHOIR,一种面向单目相机的接触感知HOI重建框架,利用接触作为手和物体之间的显式耦合信号。CHOIR首先从开放世界视觉先验中初始化一个粗糙的、接触无关的4D HOI序列。然后引入一个生成式HOI空间修正模块,预测射线深度修正并纠正手物相对位置,随后在修正后的几何上推导出初始的逐帧接触对应关系。最后,采用带有动态更新接触约束的接触感知联合优化,强制执行几何、时间和接触一致性。在受控和具有挑战性的视频上的实验表明,CHOIR在物体重建、物理合理性和时间一致性上优于现有最先进方法。

英文摘要

We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

2605.21007 2026-06-01 cs.CV cs.RO 版本更新

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

LiteViLNet: 轻量级视觉-激光雷达融合网络用于高效道路分割

Daojie Peng, Bingtao Wang, Fulong Ma, Liang Zhang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Shandong University(山东大学)

AI总结 提出轻量级多模态网络LiteViLNet,通过双流编码器、深度可分离卷积和多尺度特征融合模块,在KITTI数据集上以14.04M参数达到96.36% MaxF,实现精度与效率的平衡。

详情
AI中文摘要

道路分割是自动驾驶和智能机器人系统的基本感知任务,需要高精度和实时推理,特别是在资源受限的边缘设备上部署时。现有的多模态道路分割方法通常依赖重型基于Transformer的编码器以达到最先进的性能,但其巨大的计算成本阻碍了在嵌入式平台上的实时部署。为解决这一困境,我们提出了LiteViLNet,一种轻量级多模态网络,融合RGB纹理信息和LiDAR几何信息用于高效道路分割。具体来说,我们设计了双流轻量级编码器和深度可分离卷积,以最小的参数从两种模态中提取层次特征。我们进一步提出了多尺度特征融合模块(MSFM)以促进不同层次的跨模态交互,以及一个大核桥模块以线性复杂度捕获长距离依赖。在KITTI道路数据集和实际应用上的大量实验表明,LiteViLNet在准确性和效率之间取得了有希望的平衡。值得注意的是,仅用14.04M参数,我们的模型达到了96.36%的MaxF分数,在所有基于CNN的方法中排名最佳,并与更大的基于Transformer的模型相当,在RTX 4060 Ti上模型推理速度为163.79 FPS(在Jetson Orin NX上为22.18 FPS)。它在推理速度上优于许多重型方法,同时保持高度竞争的准确性,充分验证了LiteViLNet在自动驾驶和智能机器人中实时嵌入式部署的潜力。

英文摘要

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose LiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

2605.18023 2026-06-01 cs.CV 版本更新

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

DSAA: 面向细粒度开放词汇检测的双阶段属性激活

Donghong Jiang, Endian Lin, Hanqing Liu, Mingjie Liu, Luoping Cui, Zhao Yang, Chuang Zhu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing E-Hualu Information Technology Co., Ltd.(北京亿华鲁信息技术有限公司) State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China(通用人工智能国家重点实验室,BIGAI,北京,中国)

AI总结 提出DSAA框架,通过文本嵌入阶段的属性前缀适配器和BERT编码阶段的键/值调制器增强属性语义,并引入属性感知对比损失,提升细粒度开放词汇检测性能。

详情
AI中文摘要

开放词汇目标检测(OVD)模型打破了封闭集检测的限制,能够通过自然语言提示识别未见类别。然而,在涉及颜色、材质和纹理等属性的细粒度检测任务中,它们表现出明显的局限性。我们将OVD模型中的这一性能瓶颈归因于一个核心问题:当类别信号占主导时,OVD模型在推理过程中倾向于边缘化属性信息,导致属性与目标对象之间的错误绑定。为了解决这个问题,我们提出了双阶段属性激活(DSAA)框架,通过在两个关键阶段增强属性语义来提升细粒度检测能力。在文本嵌入阶段,我们采用属性前缀适配器(APA)模块生成属性前缀,注入显式的属性先验。为了进一步放大这些属性的影响,我们的键/值(K/V)调制器模块在BERT编码阶段进行干预,选择性地增强对应属性令牌的键和值向量。此外,我们引入了属性感知对比损失,以在训练过程中提高具有不同属性的同类别实例之间的区分度。在FG-OVD基准上的实验结果表明,我们的方法在各种主流开放词汇模型中均有效。

英文摘要

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the identification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine-grained detection tasks involving attributes like color, material, and texture. We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capabilities by strengthening attribute semantics at two critical stages. In the text embedding stage, we employ Attribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further amplify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encoding phase, selectively enhancing the Key and Value vectors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with different attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary models.

2402.17672 2026-06-01 cs.CV eess.IV 版本更新

SDF2Net: Shallow to Deep Feature Fusion Network for PolSAR Image Classification

SDF2Net: 用于PolSAR图像分类的浅层到深层特征融合网络

Mohammed Q. Alkhatib, M. Sami Zitouni, Mina Al-Saad, Nour Aburaed, Hussain Al-Ahmad

AI总结 提出一种新颖的三分支复值CNN融合网络SDF2Net,通过浅层到深层特征融合提升PolSAR图像分类精度,在三个数据集上取得优于现有方法的性能。

详情
AI中文摘要

极化合成孔径雷达(PolSAR)图像包含有价值的信息,有助于广泛的土地覆盖解释并生成多样化的输出产品。从PolSAR数据中提取有意义的特征面临与光学图像不同的挑战。深度学习方法为克服PolSAR特征提取中的这些挑战提供了有效解决方案。卷积神经网络(CNN)通过利用内核能力考虑局部信息和PolSAR数据的复值性质,在捕获PolSAR图像特征中发挥关键作用。本研究提出了一种新颖的三分支复值CNN融合网络,称为浅层到深层特征融合网络(SDF2Net),用于PolSAR图像分类。为了验证所提方法的性能,使用Flevoland和San Francisco的机载合成孔径雷达(AIRSAR)数据集以及ESAR Oberpfaffenhofen数据集,将分类结果与多种最先进方法进行比较。结果表明,所提方法在总体精度上有所提升,AIRSAR数据集提升1.3%和0.8%,ESAR数据集提升0.5%。对Flevoland数据的分析强调了SDF2Net模型的有效性,即使在仅1%采样率下,总体精度也达到了96.01%。

英文摘要

Polarimetric synthetic aperture radar (PolSAR) images encompass valuable information that can facilitate extensive land cover interpretation and generate diverse output products. Extracting meaningful features from PolSAR data poses challenges distinct from those encountered in optical imagery. Deep learning (DL) methods offer effective solutions for overcoming these challenges in PolSAR feature extraction. Convolutional neural networks (CNNs) play a crucial role in capturing PolSAR image characteristics by leveraging kernel capabilities to consider local information and the complex-valued nature of PolSAR data. In this study, a novel three-branch fusion of complex-valued CNN, named the Shallow to Deep Feature Fusion Network (SDF2Net), is proposed for PolSAR image classification. To validate the performance of the proposed method, classification results are compared against multiple state-of-the-art approaches using the airborne synthetic aperture radar (AIRSAR) datasets of Flevoland and San Francisco, as well as the ESAR Oberpfaffenhofen dataset. The results indicate that the proposed approach demonstrates improvements in overallaccuracy, with a 1.3% and 0.8% enhancement for the AIRSAR datasets and a 0.5% improvement for the ESAR dataset. Analyses conducted on the Flevoland data underscore the effectiveness of the SDF2Net model, revealing a promising overall accuracy of 96.01% even with only a 1% sampling ratio.

2601.15197 2026-06-01 cs.AI cs.CL cs.CV cs.RO 版本更新

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology(华中科技大学) Beijing Zhongguancun Academy(北京中关村学院) Zhongguancun Institute of Artificial Intelligence(中关村人工智能研究院) Harbin Institute of Technology(哈尔滨工业大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Zhengzhou University(郑州大学) Beihang University(北航) East China Normal University(东华大学) DeepCybot Co., Ltd.(DeepCybot有限公司)

AI总结 针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题,提出LangForce框架,通过贝叶斯分解和潜在动作查询构建双分支架构,最大化动作与指令的点互信息,无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中显示出潜力,但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理:目标驱动的数据收集造成了数据集偏差。在此类数据集中,仅凭视觉观察就能高度预测语言指令,导致指令与动作之间的条件互信息消失,我们将此现象称为信息崩溃。因此,模型退化为忽略语言约束的纯视觉策略,并在分布外(OOD)设置中失败。为解决此问题,我们提出LangForce,一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询,我们构建了一个双分支架构,用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息(PMI)。该目标有效惩罚了视觉捷径,并奖励明确解释语言命令的动作。无需新数据,LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进,包括在具有挑战性的OOD SimplerEnv基准上提升11.3%,验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

2511.16084 2026-06-01 cs.CV cs.AI 版本更新

SpectralTrain: A Universal Framework for Hyperspectral Image Classification

SpectralTrain:一种通用的高光谱图像分类框架

Meihua Zhou, Liping Yu, Xinyu Tong, Wai Kin Fung, Ruiguo Hu, Jiarui Zhao, Nan Wan

发表机构 * School of Medical Information, Wannan Medical University(皖南医学院信息学院) University of Chinese Academy of Sciences(中国科学院大学) The Chinese University of Hong Kong(香港中文大学) Northeastern University(东北大学)

AI总结 提出SpectralTrain通用训练框架,通过课程学习与基于PCA的光谱下采样提升高光谱图像分类效率,在多个数据集上实现2-7倍训练加速且精度损失小。

详情
AI中文摘要

高光谱图像(HSI)分类通常涉及大规模数据和计算密集的训练,这限制了深度学习模型在实际遥感任务中的部署。本研究引入SpectralTrain,一个通用的、与架构无关的训练框架,通过将课程学习(CL)与基于主成分分析(PCA)的光谱下采样相结合,提高学习效率。通过逐步引入光谱复杂性同时保留关键信息,SpectralTrain能够在显著降低计算成本的情况下高效学习光谱-空间模式。该框架独立于特定架构、优化器或损失函数,并与经典和最先进(SOTA)模型兼容。在三个基准数据集——Indian Pines、Salinas-A和新引入的CloudPatch-7上的大量实验表明,该框架在空间尺度、光谱特性和应用领域上具有很强的泛化能力。结果显示,训练时间一致减少2-7倍,精度变化取决于骨干网络。在云分类上的应用进一步揭示了其在气候相关遥感中的潜力,强调训练策略优化作为HSI模型中架构设计的有效补充。代码可在https://github.com/mh-zhou/SpectralTrain获取。

英文摘要

Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

2605.11367 2026-06-01 cs.CV 版本更新

3D-Belief: Embodied Belief Inference via Generative 3D World Modeling

3D-Belief:通过生成式3D世界建模实现具身信念推断

Yifan Yin, Zehao Wen, Suyu Ye, Jieneng Chen, Zehan Zheng, Nanru Dai, Haojun Shi, Aydan Huang, Zheyuan Zhang, Alan Yuille, Jianwen Xie, Ayush Tewari, Tianmin Shu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Lambda University of Cambridge(剑桥大学)

AI总结 提出3D-Belief,一种生成式3D世界模型,通过在线更新显式3D信念,使具身智能体能够在部分可观测环境中想象场景补全并推理,在2D/3D想象质量和下游物体导航任务上优于现有方法。

详情
AI中文摘要

近期视觉生成模型的进展凸显了学习生成式世界模型的前景。然而,现有大多数方法将世界建模视为新视角合成或未来帧预测,强调视觉真实感,而非部分可观测环境下具身智能体所需的结构化不确定性。在这项工作中,我们提出了一种不同的视角:世界建模作为3D空间中的具身信念推断。从这个角度看,世界模型不应仅仅渲染可能看到的景象,而应在获取新观测时维护并更新智能体关于未观测3D世界的信念。我们识别了此类模型的几个关键能力,包括空间一致的场景记忆、多假设信念采样、顺序信念更新以及基于语义的未观测区域预测。我们将这些思想实例化为3D-Belief,一种生成式3D世界模型,它从部分观测中推断出显式、可操作的3D信念,并随时间在线更新。与先前的视觉预测模型不同,3D-Belief直接在3D中表示不确定性,使具身智能体能够想象合理的场景补全并在部分可观测环境中进行推理。我们在场景记忆和未观测场景想象的2D视觉质量、使用我们提出的3D-CORE基准的物体和场景级3D想象,以及模拟和真实世界中的挑战性物体导航任务上评估了3D-Belief。实验表明,与最先进方法相比,3D-Belief提高了2D和3D想象质量以及下游具身任务性能。

英文摘要

Recent advances in visual generative models have highlighted the promise of learning generative world models. However, most existing approaches frame world modeling as novel-view synthesis or future-frame prediction, emphasizing visual realism rather than the structured uncertainty required by embodied agents acting under partial observability. In this work, we propose a different perspective: world modeling as embodied belief inference in 3D space. From this view, a world model should not merely render what may be seen, but maintain and update an agent's belief about the unobserved 3D world as new observations are acquired. We identify several key capabilities for such models, including spatially consistent scene memory, multi-hypothesis belief sampling, sequential belief updating, and semantically informed prediction of unseen regions. We instantiate these ideas in 3D-Belief, a generative 3D world model that infers explicit, actionable 3D beliefs from partial observations and updates them online over time. Unlike prior visual prediction models, 3D-Belief represents uncertainty directly in 3D, enabling embodied agents to imagine plausible scene completions and reason over partially observed environments. We evaluate 3D-Belief on 2D visual quality for scene memory and unobserved-scene imagination, object- and scene-level 3D imagination using our proposed 3D-CORE benchmark, and challenging object navigation tasks in both simulation and the real world. Experiments show that 3D-Belief improves 2D and 3D imagination quality and downstream embodied task performance compared to state-of-the-art methods.

2605.06280 2026-06-01 cs.CV 版本更新

Eulerian Motion Guidance: Robust Image Animation via Bidirectional Geometric Consistency

欧拉运动引导:基于双向几何一致性的鲁棒图像动画

Thong Nguyen, Khoi M. Le, Cong-Duy Nguyen, Luu Anh Tuan, See-Kiong Ng, Chunyan Miao

发表机构 * National University of Singapore(新加坡国立大学) Centre for AI Research, VinUniversity(Vin大学人工智能研究中心) Nanyang Technological University(南洋理工大学)

AI总结 提出使用相邻帧欧拉运动场引导生成,并通过双向几何一致性机制解决遮挡问题,实现加速训练、保持时间连贯性和减少动态伪影。

Comments Work in progress. Code is available at https://github.com/nguyentthong/eulerian_motion_guidance

详情
AI中文摘要

近期图像动画的进展利用扩散模型为静态图像注入活力。然而,现有的可控框架通常依赖于拉格朗日运动引导,其中光流是相对于初始帧估计的。本文通过更局部的监督设计重新审视相同的光流基元:我们使用相邻帧欧拉运动场来引导生成,其中运动信号始终描述一个短时间跳跃。这种转变使得并行训练成为可能,并在整个生成过程中提供有界误差监督。为了减轻相邻帧生成中常见的漂移伪影,我们引入了一种双向几何一致性机制,该机制计算前向-后向循环检查以数学识别并掩蔽遮挡区域,防止模型学习错误的扭曲目标。大量实验表明,与基于参考的基线相比,我们的方法加速了训练,保持了时间连贯性,并减少了动态伪影。

英文摘要

Recent advancements in image animation have utilized diffusion models to breathe life into static images. However, existing controllable frameworks typically rely on Lagrangian motion guidance, where optical flow is estimated relative to the initial frame. This paper revisits the same optical-flow primitive through a more local supervision design: we use adjacent-frame Eulerian motion fields to guide generation, where the motion signal always describes a short temporal hop. This shift enables parallelized training and provides bounded-error supervision throughout the generation process. To mitigate the drift artifacts common in adjacent frame generation, we introduce a Bidirectional Geometric Consistency mechanism, which computes a forward-backward cycle check to mathematically identify and mask occluded regions, preventing the model from learning incorrect warping objectives. Extensive experiments demonstrate that our approach accelerates training, preserves temporal coherence, and reduces dynamic artifacts compared to reference-based baselines.

2605.08145 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Self-Captioning Multimodal Interaction Tuning: Amplifying Exploitable Redundancies for Robust Vision Language Models

自描述多模态交互调优:放大可利用冗余以实现鲁棒的视觉语言模型

Yuriel Ryan, Hei Man Ip, Adriel Kuek, Paul Pu Liang, Roy Ka-Wei Lee

发表机构 * Singapore University of Technology and Design(新加坡科技设计大学) DSO National Laboratories(国防部国家实验室) Massachusetts Institute of Technology(麻省理工学院)

AI总结 针对视觉语言模型中的幻觉和鲁棒性问题,提出自描述多模态交互调优方法,通过放大模态间冗余信息来补偿受损模态,并设计多模态交互门机制将独特交互转化为冗余交互,实验表明该方法可减少38.3%的视觉诱导错误并提升16.8%的一致性。

Comments Accepted to ICML 2026. Code: https://github.com/yurielryan/Multimodal-Interaction-Tuning

详情
AI中文摘要

当前的视觉语言模型在面对模糊或受损模态时存在幻觉和鲁棒性问题。我们假设这些问题可以通过利用模态间的共享信息来补偿受损模态得到解决。为此,我们分析了多模态交互——模态提供的冗余(共享)、独特(排他)和协同(涌现)任务相关信息——以确定它们对模型可靠性的影响。具体来说,放大冗余交互将增加这种可利用的共享信息以解决这些问题;然而,现代指令数据集通常消除冗余以优先考虑视觉定位。我们通过一个自描述工作流弥合这一差距,该工作流包含一个 extsc{多模态交互门}:一种将独特交互转化为冗余交互的机制。我们的发现表明,增加冗余可以减少38.3%的视觉诱导错误,并提高16.8%的一致性。

英文摘要

Current vision language models face hallucination and robustness issues against ambiguous or corrupted modalities. We hypothesize that these issues can be addressed by exploiting the shared information between modalities to compensate for the impaired one. To this end, we analyze multimodal interactions -- redundant (shared), unique (exclusive), and synergistic (emergent) task-relevant information provided by the modalities -- to determine their impacts on model reliability. Specifically, amplifying redundant interactions would increase this exploitable shared information to resolve these issues; yet, modern instruction datasets often eliminate redundancies to prioritize visual grounding. We bridge this gap through a self-captioning workflow featuring a \textsc{Multimodal Interaction Gate}: a mechanism to convert unique interactions into redundant interactions. Our findings suggest that increasing redundancy can reduce visual induced errors by 38.3\% and improve consistency by 16.8\%.

2605.06137 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Autoregressive Visual Generation Needs a Prologue

自回归视觉生成需要一个序幕

Bowen Zheng, Weijian Luo, Guang Yang, Colin Zhang, Tianyang Hu

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) hi-Lab, Xiaohongshu Inc(小红书实验室)

AI总结 提出Prologue方法,通过生成前置的序幕令牌来弥合自回归图像生成中的重建-生成差距,在不影响重建质量的前提下显著提升生成性能。

Comments Code: https://github.com/Zyriix/prologue Demo: https://huggingface.co/spaces/Zyriix/prologue-demo

详情
AI中文摘要

在这项工作中,我们提出了Prologue,一种弥合自回归(AR)图像生成中重建-生成差距的方法。Prologue不修改视觉令牌以同时满足重建和生成,而是生成一小部分序幕令牌,并将其前置到视觉令牌序列之前。这些序幕令牌仅使用AR交叉熵(CE)损失进行训练,而视觉令牌则专用于重建。这种解耦设计使我们能够通过AR模型的真实分布优化生成,而不影响重建质量,我们进一步从ELBO角度形式化了这一点。在ImageNet 256x256上,Prologue-Base在没有无分类器引导的情况下将gFID从21.01降至10.75,同时几乎保持重建不变;Prologue-Large使用标准AR模型,无需辅助语义监督,达到了具有竞争力的rFID 0.99和gFID 1.46。有趣的是,仅由AR梯度驱动,序幕令牌展现出涌现的语义结构:对16个序幕令牌进行线性探测达到35.88%的Top-1准确率,远高于标准分词器前16个令牌的23.71%;使用固定序幕令牌进行重采样保留了相似的高层语义布局。我们的结果暗示了一个新方向:通过引入单独学习的生成表示,同时保持原始表示不变,可以提升生成质量。

英文摘要

In this work, we propose Prologue, an approach to bridging the reconstruction-generation gap in autoregressive (AR) image generation. Instead of modifying visual tokens to satisfy both reconstruction and generation, Prologue generates a small set of prologue tokens prepended to the visual token sequence. These prologue tokens are trained exclusively with the AR cross-entropy (CE) loss, while visual tokens remain dedicated to reconstruction. This decoupled design lets us optimize generation through the AR model's true distribution without affecting reconstruction quality, which we further formalize from an ELBO perspective. On ImageNet 256x256, Prologue-Base reduces gFID from 21.01 to 10.75 without classifier-free guidance while keeping reconstruction almost unchanged; Prologue-Large reaches a competitive rFID of 0.99 and gFID of 1.46 using a standard AR model without auxiliary semantic supervision. Interestingly, driven only by AR gradients, prologue tokens exhibit emergent semantic structure: linear probing on 16 prologue tokens reaches 35.88% Top-1, far above the 23.71% of the first 16 tokens from a standard tokenizer; resampling with fixed prologue tokens preserves a similar high-level semantic layout. Our results suggest a new direction: generation quality can be improved by introducing a separate learned generative representation while leaving the original representation intact.

2604.26262 2026-06-01 cs.CV 版本更新

Semantic Foam: Unifying Spatial and Semantic Scene Decomposition

Semantic Foam:统一空间与语义场景分解

Amr Sharafeldin, Shrisudhan Govindarajan, Thomas Walker, Aryan Mikaeili, Daniel Rebain, Kwang Moo Yi, Andrea Tagliasacchi

发表机构 * Simon Fraser University(西蒙弗雷泽大学) University of Toronto(多伦多大学) Wayve Technologies(Wayve技术公司) University of British Columbia(不列颠哥伦比亚大学) University of Edinburgh(爱丁堡大学)

AI总结 提出Semantic Foam,通过扩展Radiant Foam表示,结合Voronoi网格的空间分解和显式语义特征场,实现高质量、一致性的语义分割。

Comments 15 pages, 10 figures, Accepted to CVPR 2026 (Highlight) , Project page: http://semanticfoam.github.io/

详情
AI中文摘要

现代场景重建方法,如3D高斯泼溅,能够以实时速度实现照片级真实感的新视角合成,但它们在交互式图形应用中的采用受到限制。一个主要瓶颈是与传统人工创作的3D资产相比,与这些表示进行交互的难度。尽管先前的研究尝试对这些模型施加语义分解,但在分割质量和一致性方面仍然存在重大挑战。为了解决这个问题,我们引入了Semantic Foam,将最近提出的Radiant Foam表示扩展到语义分解任务。我们的方法将Radiant Foam的Voronoi网格的自然空间体积分解与在单元级别参数化的显式语义特征场相结合。这种显式结构能够直接进行空间正则化,从而防止由遮挡或跨视图不一致监督引起的伪影——这是其他基于点的表示的常见问题。实验结果表明,与Gaussian Grouping和SAGA等最先进方法相比,我们的方法在对象级分割性能上达到或超越了它们。

英文摘要

Modern scene reconstruction methods, such as 3D Gaussian Splatting, deliver photo-realistic novel view synthesis at real-time speeds, yet their adoption in interactive graphics applications has been limited. A major bottleneck is the difficulty of interacting with these representations compared to traditional, human-authored 3D assets. While previous research has attempted to impose semantic decomposition on these models, significant challenges remain regarding segmentation quality and consistency. To address this, we introduce Semantic Foam, extending the recently proposed Radiant Foam representations to semantic decomposition tasks. Our approach integrates the natural spatial volumetric decomposition of Radiant Foam's Voronoi mesh with an explicit semantic feature field parameterized at the cell level. This explicit structure enables direct spatial regularization, which prevents artifacts caused by occlusion or inconsistent supervision across views - common pitfalls for other point-based representations. Experimental results show that our method achieves comparable or superior object-level segmentation performance compared to state-of-the-art methods like Gaussian Grouping and SAGA.

2604.27617 2026-06-01 cs.CV cs.AI 版本更新

Robust Lightweight Crack Classification for Real-Time UAV Bridge Inspection

用于实时无人机桥梁检测的鲁棒轻量级裂缝分类

Wei Li, Haisheng Li, Weijie Li, Jiandong Wang, Kaichen Ma, Luming Yang

发表机构 * Bay Area Super Bridge Maintenance Technology Center, Guangdong Provincial Highway Construction Co., Ltd., Guangdong, China(湾区超级桥梁维护技术中心、广东省高速公路建设有限公司、广东,中国) Guangdong AIHISUN Technology Co., Ltd., Guangdong, China(广东AIHISUN技术有限公司、广东,中国)

AI总结 提出一个由轻量级骨干网络、CBAM注意力模块、基于场景先验的定向鲁棒增强策略和Focal Loss组成的统一轻量级CNN框架,在SDNET2018数据集上以11.21M参数和1.82G FLOPs实现825 FPS推理速度,F1分数提升2.51%,召回率提升3.95%。

详情
AI中文摘要

随着无人机在桥梁结构健康监测中的广泛应用,基于深度学习的自动裂缝检测已成为主要研究热点。然而,实际无人机检测仍面临四个关键挑战:弱裂缝特征、退化成像条件、严重类别不平衡以及实际无人机检测工作流程中有限的计算资源。为了解决这些问题,本文提出了一个统一的轻量级卷积神经网络框架,由四个协同组件组成:轻量级骨干网络、用于通道和空间增强的卷积块注意力模块(CBAM)、基于检测场景先验的定向鲁棒增强策略,以及用于类别不平衡下难样本学习的Focal Loss。在SDNET2018桥面数据集上的实验表明,所提方法仅以11.21M参数和1.82G FLOPs实现了825 FPS的推理速度。与基线模型相比,完整框架的F1分数提高了2.51%,召回率提高了3.95%。此外,Grad-CAM可视化表明,引入的注意力模块将模型关注点从分散区域转移到沿裂缝轨迹的精确跟踪。总体而言,本研究在准确性、速度和鲁棒性之间取得了强平衡,为无人机桥梁检测中地面站辅助的实时部署提供了实用解决方案。源代码可在 https://github.com/skylynf/AttXNet 获取。

英文摘要

With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: https://github.com/skylynf/AttXNet .

2604.20395 2026-06-01 cs.CV cs.RO 版本更新

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

SpaCeFormer: 快速无提议开放词汇3D实例分割

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

发表机构 * NVIDIA

AI总结 提出SpaCeFormer,一种基于空间曲线变换的无提议方法,在0.12-0.30秒内完成场景分割,比多阶段2D+3D流水线快2-3个数量级,并构建了最大开放词汇3D实例分割数据集SpaCeFormer-3M,在ScanNet200上零样本mAP达11.1,提升2.8倍。

Comments Project page: https://nvlabs.github.io/SpaCeFormer/

详情
AI中文摘要

开放词汇3D实例分割是机器人和AR/VR的核心能力,但先前方法存在瓶颈:多阶段2D+3D流水线聚合基础模型输出需数百秒每场景,而伪标签端到端方法依赖碎片化掩码和外部区域提议。我们提出SpaCeFormer,一种无提议的空间曲线变换器,在标准基准上每场景运行0.12-0.30秒,比多阶段2D+3D流水线快2-3个数量级。我们将其与SpaCeFormer-3M配对,这是最大的开放词汇3D实例分割数据集(通过多视图掩码聚类和多视图VLM标注构建,包含来自7.4K场景的604K实例的3.0M多视图一致描述);其掩码召回率比先前单视图流水线高21倍(IoU>0.5时54.3% vs 2.5%)。SpaCeFormer结合空间窗口注意力与Morton曲线序列化以获得空间连贯特征,并使用RoPE增强解码器直接从学习到的查询预测实例掩码,无需外部提议。在ScanNet200上,我们实现11.1零样本mAP,比先前最佳无提议方法提升2.8倍;在ScanNet++和Replica上,我们达到22.9和24.1 mAP,超越包括使用多视图2D输入在内的所有先前方法。

英文摘要

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

2604.09429 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

射线即像素:学习视频与相机轨迹的联合分布

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang

发表机构 * Meta AI

AI总结 提出一种视频扩散模型(Rays as Pixels),通过将相机表示为密集射线像素(raxels)并与视频帧共享潜在空间,联合去噪实现相机轨迹预测和相机控制视频生成。

Comments Accepted to ICML 2026. 9-page main paper plus supplementary material. Project page: https://wbjang.github.io/raysaspixels/

详情
AI中文摘要

从图像恢复相机参数和从新视角渲染场景在计算机视觉和图形学中被视为独立任务。当图像覆盖稀疏或姿态模糊时,这种分离会失效,因为每个任务依赖于另一个任务的输出。我们提出Rays as Pixels,一种视频扩散模型(VDM),学习视频和相机轨迹的联合分布。据我们所知,这是首个在单一框架内预测相机姿态并进行相机控制视频生成的模型。我们将每个相机表示为密集射线像素(raxels),这是一种与视频帧位于同一潜在空间的像素对齐编码,并通过解耦自交叉注意力机制联合去噪两者。一个训练好的模型处理三个任务:从视频预测相机轨迹、沿预定义轨迹从输入图像生成视频、以及从输入图像联合合成视频和轨迹。我们在姿态估计和相机控制视频生成上进行评估,并引入闭环自一致性测试,显示模型预测的姿态及其基于这些姿态的渲染结果一致。与Plücker嵌入的消融实验证实,将相机与视频共享潜在空间显著更有效。

英文摘要

Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

2604.20650 2026-06-01 cs.CV 版本更新

MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

MAPRPose: 面向多目标6D姿态估计的掩膜感知提议与模态补全精化

Yang Luo, Yan Gong, Yongsheng Gao, Xiaoying Sun, Jie Zhao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology(机器人系统国家重点实验室,哈尔滨工业大学) School of Civil Engineering, Harbin Institute of Technology(土木工程学院,哈尔滨工业大学) Shenzhen Infinite Meta Robot Co., Ltd(深圳无限元机器人有限公司)

AI总结 提出MAPRPose两阶段框架,通过掩膜感知对应关系生成姿态提议和模态补全驱动的ROI预测实现鲁棒精化,在BOP基准上达到76.5%平均召回率,比FoundationPose高3.1%且多目标推理加速43倍。

详情
AI中文摘要

在杂乱场景中,6D物体姿态估计由于严重遮挡和传感器噪声仍然具有挑战性。我们提出MAPRPose,一个两阶段框架,利用掩膜感知对应关系进行姿态提议,并利用模态补全驱动的感兴趣区域(ROI)预测进行鲁棒精化。在掩膜感知姿态提议(MAPP)阶段,我们将2D对应关系提升到3D空间,建立可靠的关键点匹配,并基于对应关系评分生成几何一致的姿态假设,从中选择前K个候选。在精化阶段,我们引入了一个张量化渲染-比较流水线,集成了模态补全掩膜预测和ROI重新对齐(AMPR)模块。通过重建完整的物体几何并动态调整ROI,AMPR减轻了严重遮挡下的定位误差和空间错位。此外,我们的GPU加速RGB-XYZ重投影使得所有N×B个姿态假设能够在单次前向传播中同时精化。在BOP基准上评估,MAPRPose实现了76.5%的最先进平均召回率(AR),比FoundationPose高出3.1% AR,同时在多目标推理中实现了43倍加速。

英文摘要

6D object pose estimation in cluttered scenes remains challenging due to severe occlusion and sensor noise. We propose MAPRPose, a two-stage framework that leverages mask-aware correspondences for pose proposal and amodal-driven Region-of-Interest (ROI) prediction for robust refinement. In the Mask-Aware Pose Proposal (MAPP) stage, we lift 2D correspondences into 3D space to establish reliable keypoint matches and generate geometrically consistent pose hypotheses based on correspondence-level scoring, from which the top-$K$ candidates are selected. In the refinement stage, we introduce a tensorized render-and-compare pipeline integrated with an Amodal Mask Prediction and ROI Re-Alignment (AMPR) module. By reconstructing complete object geometry and dynamically adjusting the ROI, AMPR mitigates localization errors and spatial misalignment under heavy occlusion. Furthermore, our GPU-accelerated RGB-XYZ reprojection enables simultaneous refinement of all $N \times B$ pose hypotheses in a single forward pass. Evaluated on the BOP benchmark, MAPRPose achieves a state-of-the-art Average Recall (AR) of 76.5%, outperforming FoundationPose by 3.1% AR while delivering a 43x speedup in multi-object inference.

2604.10805 2026-06-01 cs.CV 版本更新

Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

基于单应性的地面映射中距离误差的解析建模与校正

Mateusz Szulc, Marcin Iwanowski

发表机构 * Institute of Control Industrial Electronics, Faculty of Electrical Engineering, Warsaw University of Technology, ul. Koszykowa 75, 00-662 Warsaw, Poland

AI总结 本文推导了单应性扰动与距离误差的解析关系,提出基于回归和梯度下降的两种校正策略,并通过大规模仿真验证了其有效性。

Comments 7 pages, 4 figures

详情
AI中文摘要

从单目相机准确估计距离对于智能监控系统至关重要。在许多部署中,通过手动选择对应区域初始化的平面单应性将图像坐标映射到地面位置。这种初始化中的微小不准确性会传播为系统性的距离失真。本文推导了单应性扰动与由此产生的距离误差之间的显式关系,表明误差大致随距相机的真实距离呈二次增长。基于该模型,评估了两种简单的校正策略:基于回归的二次误差函数估计和通过基于坐标的梯度下降直接优化单应性。一项包含超过1900万个测试样本的大规模仿真研究表明,当模型可靠拟合时,回归可实现更高的峰值精度,而梯度下降在初始校准较差时具有更强的鲁棒性。这表明,在许多实际系统中,改进几何校准可能比增加模型复杂度带来更大的性能提升。

英文摘要

Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

2604.10273 2026-06-01 cs.CV 版本更新

Dual-Exposure Imaging with Events

基于事件的双曝光成像

Mingyuan Lin, Hongyi Liu, Chu He, Wen Yang, Gui-Song Xia, Lei Yu

发表机构 * School of Electronic Information, Wuhan University(武汉大学电子信息学院) School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院)

AI总结 提出事件辅助的双曝光成像算法E-DEI,利用事件相机的高时间分辨率对齐和融合双曝光图像特征,以消除运动伪影和曝光差异,提升低光图像质量。

详情
AI中文摘要

通过结合短曝光和长曝光图像的互补优势,双曝光成像(DEI)在低光场景下增强了图像质量。然而,现有的DEI方法由于场景运动导致的空间位移和不同曝光时间引起的图像特征差异,不可避免地会产生伪影。为了解决这个问题,我们提出了一种新颖的基于事件的双曝光成像(E-DEI)算法,该算法从双曝光图像对和事件中重建高质量图像,利用事件相机的高时间分辨率提供准确的帧间/帧内动态信息。具体来说,我们将这个复杂任务分解为两个子任务的集成,即基于事件的运动去模糊和低光图像增强任务,这指导我们将E-DEI网络设计为双路径并行特征传播架构。我们提出了一个双路径特征对齐与融合(DFAF)模块,以在事件的辅助下有效地对齐和融合从双曝光图像中提取的特征。此外,我们构建了一个包含配对低/正常光图像和事件的真实世界数据集(PIED)。在多个数据集上的实验表明了我们方法的优越性。代码和数据集可在GitHub上获取。

英文摘要

By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

2603.26885 2026-06-01 cs.CV 版本更新

TTE-CAM: Self-Explainable Class Activation Maps for Pretrained Black-Box CNNs

TTE-CAM:用于预训练黑盒CNN的自解释类激活图

Kerol Djoumessi, Philipp Berens

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany(图宾根大学脑健康人工智能研究所)

AI总结 提出TTE-CAM框架,通过卷积替换分类头将预训练黑盒CNN转化为自解释模型,在保持预测性能的同时提供忠实解释。

Comments Accepted at MIDL 2026 in the short paper track

详情
AI中文摘要

卷积神经网络在医学图像分析中取得了最先进的性能,但仍然不透明,限制了在高风险临床环境中的采用。现有方法面临一个基本权衡:事后方法提供不忠实的近似解释,而固有可解释架构是忠实的,但往往牺牲预测性能。我们引入TTE-CAM,一个测试时框架,通过基于原始权重初始化的卷积替换其分类头,将预训练的黑盒CNN转换为自解释模型,从而弥合这一差距。所得模型保留了黑盒预测性能,同时提供了与事后方法在定性和定量上都具有竞争力的内置忠实解释。代码可在 https://github.com/kdjoumessi/Test-Time-Explainability 获取。

英文摘要

Convolutional neural networks (CNNs) achieve state-of-the-art performance in medical image analysis yet remain opaque, limiting adoption in high-stakes clinical settings. Existing approaches face a fundamental trade-off: post-hoc methods provide unfaithful approximate explanations, while inherently interpretable architectures are faithful but often sacrifice predictive performance. We introduce TTE-CAM, a test-time framework that bridges this gap by converting pretrained black-box CNNs into self-explainable models via a convolution-based replacement of their classification head, initialized from the original weights. The resulting model preserves black-box predictive performance while delivering built-in faithful explanations competitive with post-hoc methods, both qualitatively and quantitatively. The code is available at https://github.com/kdjoumessi/Test-Time-Explainability

2511.11440 2026-06-01 cs.CV cs.CL 版本更新

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

合成刺激,真实收益:通过完全受控的数据生成重新思考VLM微调

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento(信号与交互系统实验室,特伦托大学)

AI总结 本文提出一种完全受控的数据生成与标注流程,用于微调视觉语言模型(VLM),通过平衡分布和干净标注消除偏差,在空间推理任务上仅用130个样本即可实现均匀性能,并在真实世界数据上提升13%的性能。

详情
AI中文摘要

通过微调获得的视觉语言模型(VLM)的性能提升通常基于对真实世界场景的临时数据收集和标注。尽管有所改进,但这一过程往往容易受到偏差、错误和分布不平衡的影响,导致过拟合和性能不平衡。虽然少数研究探索了合成数据生成,但它们通常缺乏对数据分布和标注质量的控制。在这项工作中,我们通过探索完全受控的数据生成和标注流程,重新评估了模型微调的潜力,获得了具有平衡分布和干净标注的无偏差数据。以识别物体绝对位置的空间推理任务作为用例,我们微调了最先进的VLM,并在合成和真实世界基准上进行了详尽的评估,包括对真实世界场景的可迁移性。我们的实验揭示了两个关键发现:1)在平衡数据上微调可以在视觉场景中产生均匀的性能,并且仅用130个样本就能缓解常见偏差;2)在合成刺激上微调使真实世界数据(COCO)的性能提升了13%,优于在完整COCO训练集上微调的模型。

英文摘要

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

2603.19862 2026-06-01 cs.CV cs.LG 版本更新

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

IsoCLIP: 分解CLIP投影器以实现高效的模态内对齐

Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center (MICC), University of Florence, Italy(意大利佛罗伦萨大学媒体集成与通信中心) Department of Computer Science, Universitat Autònoma de Barcelona, Spain(西班牙巴塞罗那自治大学计算机科学系) Computer Vision Center, Barcelona, Spain(西班牙巴塞罗那计算机视觉中心) IDEAS Research Institute, Warsaw, Poland(波兰华沙IDEAS研究所)

AI总结 本文通过分析CLIP投影器的谱特性,发现模态间对齐子空间和各向异性方向,提出无训练方法IsoCLIP去除各向异性方向以改善模态内对齐,在模态内检索和分类任务上降低延迟并超越现有方法。

Comments Accepted at CVPR2026

详情
AI中文摘要

视觉-语言模型如CLIP被广泛用于涉及视觉和文本模态的跨模态任务。然而,当个体模态编码器应用于固有的模态内任务(如图像到图像检索)时,其性能因模态内错位而受损。本文研究CLIP中的模态内错位,重点关注将投影前图像和文本嵌入映射到共享嵌入空间的投影器的作用。通过分析应用于投影特征的余弦相似度形式及其与对比CLIP损失的交互,我们发现在训练期间存在一个负责对齐两种模态的跨模态算子,以及第二个仅强制执行模态内归一化但不促进模态内对齐的模态内算子。通过对跨模态算子的谱分析,我们识别出一个近似各向同性的子空间,其中两种模态良好对齐,以及每个模态特有的各向异性方向。我们证明该对齐子空间可以直接从投影器权重中获得,并且去除各向异性方向可改善模态内对齐。我们在模态内检索和分类基准上的实验表明,我们的无训练方法减少了模态内错位,大大降低了延迟,并在多个预训练的类CLIP模型上优于现有方法。代码公开于:https://github.com/simomagi/IsoCLIP。

英文摘要

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

2509.25269 2026-06-01 eess.IV cs.CV cs.LG cs.NA math.NA physics.optics 版本更新

Position-Blind Ptychography: Viability of image reconstruction via data-driven variational inference

位置盲叠层成像:通过数据驱动变分推断进行图像重建的可行性

Simon Welker, Lorenz Kuger, Tim Roith, Berthy Feng, Martin Burger, Timo Gerkmann, Henry Chapman

发表机构 * Department of Informatics, University of Hamburg(汉堡大学信息学院) Center for Free-Electron Laser Science CFEL, Deutsches Elektronen-Synchrotron DESY(自由电子激光科学中心 CFEL,德意志电子同步辐射实验室) Department of Mathematics, Bundesstr. 55, University of Hamburg(汉堡大学数学系) CIT School, Technical University of Munich(慕尼黑技术大学 CIT 学院) Munich Center for Machine Learning, München(慕尼黑机器学习中心) Massachusetts Institute of Technology (MIT)(麻省理工学院) The NSF AI Institute for Artificial Intelligence and Fundamental Interactions(国家科学基金会人工智能与基本相互作用研究院) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY(海德堡成像,德意志电子同步辐射实验室)

AI总结 针对位置盲叠层成像这一新盲逆问题,利用基于分数的扩散模型作为数据驱动先验,通过变分推断联合恢复扫描位置和图像,在模拟简化二维变体中验证了图像重建的可行性。

详情
AI中文摘要

在这项工作中,我们提出并研究了位置盲叠层成像这一新颖的盲逆问题,即在没有任何扫描位置信息的情况下进行叠层相位恢复,必须与图像联合恢复扫描位置。该问题的动机来自单粒子衍射X射线成像,其中随机取向的粒子被照射并收集一组衍射图案。如果使用高度聚焦的X射线束,测量结果也会对每个粒子的光束位置敏感,从而成为叠层成像,但这些位置也是未知的。我们通过使用基于分数的扩散模型作为现代数据驱动图像先验,采用变分推断,在模拟的简化二维变体中研究了这个困难问题的图像重建可行性。我们发现,在适当的照明结构和强先验条件下,即使在测量噪声下,除了最困难的成像场景外,所有情况下都能实现可靠且成功的图像重建。

英文摘要

In this work, we present and investigate the novel blind inverse problem of position-blind ptychography, i.e., ptychographic phase retrieval without any knowledge of scan positions, which then must be recovered jointly with the image. The motivation for this problem comes from single-particle diffractive X-ray imaging, where particles in random orientations are illuminated and a set of diffraction patterns is collected. If one uses a highly focused X-ray beam, the measurements would also become sensitive to the beam positions relative to each particle and therefore ptychographic, but these positions are also unknown. We investigate the viability of image reconstruction in a simulated, simplified 2-D variant of this difficult problem, using variational inference with modern data-driven image priors in the form of score-based diffusion models. We find that, with the right illumination structure and a strong prior, one can achieve reliable and successful image reconstructions even under measurement noise, in all except the most difficult evaluated imaging scenario.

2603.10422 2026-06-01 cs.CV 版本更新

World2Act: Latent Action Post-Training from World Model Dynamics

World2Act:基于世界模型动力学的潜在动作后训练

An Dinh Vuong, Tuan Van Vo, Abdullah Sohail, Haoran Ding, Liang Ma, Xiaodan Liang, Anqing Duan, Ivan Laptev, Ian Reid

发表机构 * MBZUAI

AI总结 提出World2Act框架,通过潜在空间对齐世界模型动力学与动作嵌入,避免像素级监督,在仿真和真实机器人上提升VLA策略的成功率。

Comments Updated version. Project page: https://wm2act.github.io/

详情
AI中文摘要

世界模型(WMs)通过提供动力学先验,为后训练视觉-语言-动作(VLA)策略提供了一种有前景的机制,可改善任务和场景变化下的泛化能力。然而,大多数基于WM的后训练方法依赖像素空间监督,使得策略对不完美的WM rollout引入的视觉伪影敏感。我们提出World2Act,一种潜在空间后训练框架,无需像素空间监督即可将WM动力学迁移到VLA策略。World2Act分两个阶段运行:1)通过对比对齐WM动力学潜在变量与动作嵌入,诱导共享的视频-动作潜在空间;2)通过引导策略动作表示朝向WM想象的动力学而非解码像素,对VLA进行后训练。基于GR00T-N1.6,World2Act在仿真基准(RoboCasa、LIBERO、Bridge-SIMPLER)上实现了高达+2.5%的绝对成功率提升,在真实机器人上比微调VLA基线提升了+6.7%。值得注意的是,它比像素空间WM监督高出高达+6.0%,包括在LIBERO上像素监督导致基线退化的情况下,这表明潜在WM动力学为像素空间迁移提供了一种更稳定的基于WM的后训练替代方案。

英文摘要

World Models (WMs) offer a promising mechanism for post-training Vision-Language-Action (VLA) policies by providing dynamics priors that improve generalization under task and scene variation. However, most WM-based post-training methods rely on pixel-space supervision, making policies sensitive to visual artifacts introduced by imperfect WM rollouts. We present World2Act, a latent-space post-training framework that transfers WM dynamics to the VLA policy without pixel-space supervision. World2Act operates in two stages: 1) it induces a shared video-action latent space by contrastively aligning WM-dynamics latents with action embeddings, and 2) it post-trains the VLA by guiding policy action representations toward WM-imagined dynamics rather than decoded pixels. Built on GR00T-N1.6, World2Act delivers absolute success-rate gains of up to +2.5% on simulation benchmarks (RoboCasa, LIBERO, Bridge-SIMPLER) and +6.7% on a real robot over finetuned VLA baselines. Notably, it outperforms pixel-space WM supervision by up to +6.0%, including on LIBERO where pixel supervision degrades the baseline, suggesting that latent WM dynamics offer a more stable WM-based post-training alternative to pixel-space transfer.

2603.09787 2026-06-01 cs.CV cs.LG 版本更新

What is Missing? Explaining Neurons Activated by Absent Concepts

缺失的是什么?解释被缺失概念激活的神经元

Robin Hesse, Simone Schaub-Meyer, Janina Hesse, Bernt Schiele, Stefan Roth

发表机构 * Max Planck Institute for Informatics, SIC(马克斯·普朗克信息研究所,SIC) Department of Computer Science, Technical University of Darmstadt(达姆施塔特技术大学计算机科学系) Leibniz Institute for Resilience Research(莱比锡韧性研究所) Institute for Quantitative and Computational Biosciences, Johannes Gutenberg University Mainz(美因茨雅各布·冯·特利尔大学定量与计算生物科学研究所) University Medical Center Mainz(美因茨大学医学中心)

AI总结 针对深度神经网络中编码缺失(概念缺失导致神经元激活)这一被忽视的因果关系,提出两种扩展归因和特征可视化方法以揭示并解释这种缺失,实验表明ImageNet模型利用此类缺失且考虑它们可改善去偏。

Comments ICML 2025 | Code: https://github.com/visinf/what-is-missing

详情
AI中文摘要

可解释人工智能(XAI)旨在通过估计模型的简化因果结构,提供对深度神经网络(DNN)行为的人类可解释洞察。在现有工作中,这种因果结构通常包括概念的存在与神经元强激活之间的关系。例如,归因方法主要识别对预测贡献最大的输入像素,而特征可视化方法揭示导致目标神经元高激活的输入——前者隐含假设相关信息存在于输入中,后者假设神经元编码概念的存在。然而,一种很大程度上被忽视的因果关系是编码缺失,即概念的缺失会增加神经元的激活。在这项工作中,我们展示了这种缺失但相关的概念是常见的,并且主流XAI方法在标准形式下难以揭示它们。为了解决这个问题,我们提出了两种简单的扩展,分别应用于归因和特征可视化技术,以揭示编码缺失。通过实验,我们展示了如何使用主流XAI方法揭示和解释编码缺失,ImageNet模型如何利用它们,以及考虑它们时如何改进去偏。

英文摘要

Explainable artificial intelligence (XAI) aims to provide human-interpretable insights into the behavior of deep neural networks (DNNs), typically by estimating a simplified causal structure of the model. In existing work, this causal structure often includes relationships where the presence of a concept is associated with a strong activation of a neuron. For example, attribution methods primarily identify input pixels that contribute most to a prediction, and feature visualization methods reveal inputs that cause high activation of a target neuron - the former implicitly assuming that the relevant information resides in the input, and the latter that neurons encode the presence of concepts. However, a largely overlooked type of causal relationship is that of encoded absences, where the absence of a concept increases neural activation. In this work, we show that such missing but relevant concepts are common and that mainstream XAI methods struggle to reveal them when applied in their standard form. To address this, we propose two simple extensions to attribution and feature visualization techniques that uncover encoded absences. Across experiments, we show how mainstream XAI methods can be used to reveal and explain encoded absences, how ImageNet models exploit them, and that debiasing can be improved when considering them.

2603.08385 2026-06-01 eess.IV cs.CV 版本更新

Rectified flow-based prediction of post-treatment brain MRI from pre-radiotherapy priors for patients with glioma

基于整流流的胶质瘤患者放疗前先验信息治疗后脑MRI预测

Selena Huisman, Nordin Belkacemi, Vera C. Keil, Joost Verhoeff, Szabolcs David

发表机构 * Amsterdam UMC, Department of Radiation Oncology(阿姆斯特丹大学医学中心放射肿瘤科) Cancer Center Amsterdam, Imaging and Biomarkers(阿姆斯特丹癌症中心影像与生物标志物) Department of Radiology and Nuclear Medicine(放射科与核医学科) Amsterdam UMC, Vrije Universiteit Amsterdam, Amsterdam Neuroscience, Brain Imaging(阿姆斯特丹大学医学中心、阿姆斯特丹自由大学、阿姆斯特丹神经科学、脑成像)

AI总结 提出一种基于整流流的条件图像生成模型,利用放疗前MRI和剂量图预测治疗后任意时间点的脑MRI,实现快速推理并保持语义和视觉保真度。

Comments 10 pages, 6 figures, 1 supplementary table, added GitHub url, corrected figure captions

详情
AI中文摘要

脑肿瘤平均导致20年的寿命损失。标准疗法会引起大脑复杂的结构变化,这些变化通过MRI监测。人工智能的最新进展使得从临床数据中进行条件多模态图像生成成为可能。在本研究中,我们通过条件图像生成探索了颅内肿瘤患者随访MRI的AI驱动生成。该方法能够对放疗后变化进行真实建模,从而优化治疗。使用公开的SAILOR数据集(25名患者)创建了一个二维整流流模型,该模型以治疗前MRI轴向切片和RT剂量图为条件。采用交叉注意力条件化来整合时间和化疗数据。通过结构相似性指数(SSIM)、峰值信噪比(PSNR)、Dice分数和雅可比行列式对生成的图像进行验证。所生成的模型能够生成任意时间点的真实随访MRI,同时整合治疗信息。比较真实图像与预测图像,SSIM为0.88,PSNR为22.82。真实与预测MRI的组织分割平均Dice-Sørensen系数(DSC)为0.91。整流流(RF)模型的推理速度比去噪扩散概率模型(DDPM)快250倍。所提出的模型能够实时生成真实的随访MRI,并通过图像质量指标和组织分割确认其保持了语义和视觉保真度。条件生成允许通过改变治疗参数进行反事实模拟,产生预测的形态学变化。该能力有望支持颅内肿瘤患者的适应性治疗剂量规划和个性化预后预测。代码将在同行评审发表后提供:https://github.com/SelenaIHuisman/RF-GlioPREDICT

英文摘要

Brain tumors result in 20 years of lost life on average. Standard therapies induce complex structural changes in the brain that are monitored through MRI. Recent developments in artificial intelligence (AI) enable conditional multimodal image generation from clinical data. In this study, we investigate AI-driven generation of follow-up MRI in patients with intracranial tumors through conditional image generation. This approach enables realistic modeling of post-radiotherapy changes, allowing for treatment optimization. The public SAILOR dataset of 25 patients was used to create a 2D rectified flow model conditioned on axial slices of pre-treatment MRI and RT dose maps. Cross-attention conditioning was used to incorporate temporal and chemotherapy data. The resulting images were validated with structural similarity index measure (SSIM), peak signal-to-noise ratio (PSNR), Dice scores and Jacobian determinants. The resulting model generates realistic follow-up MRI for any time point, while integrating treatment information. Comparing real versus predicted images, SSIM is 0.88, and PSNR is 22.82. Tissue segmentations from real versus predicted MRI result in a mean Dice-Sørensen coefficient (DSC) of 0.91. The rectified flow (RF) model enables up to 250x faster inference than Denoising Diffusion Probabilistic Models (DDPM). The proposed model generates realistic follow-up MRI in real-time, preserving both semantic and visual fidelity as confirmed by image quality metrics and tissue segmentations. Conditional generation allows counterfactual simulations by varying treatment parameters, producing predicted morphological changes. This capability has potential to support adaptive treatment dose planning and personalized outcome prediction for patients with intracranial tumors. Code will be available upon peer-reviewed publication at: https://github.com/SelenaIHuisman/RF-GlioPREDICT

2603.07751 2026-06-01 cs.CV cs.CL 版本更新

3ViewSense: Spatial and Mental Perspective Reasoning from Orthographic Views in Vision-Language Models

3ViewSense: 视觉-语言模型中基于正交视图的空间与心理视角推理

Shaoxiong Zhan, Yanlin Lai, Zheng Liu, Hai Lin, Shen Li, Xiaodong Cai, Zijian Lin, Wen Huang, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院) School of Software Engineering, Chongqing University, Chongqing, China(重庆大学软件学院)

AI总结 提出3ViewSense框架,通过正交视图的“模拟-推理”机制解决视觉-语言模型在空间推理中的视角一致性问题,显著提升遮挡计数和空间推理性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前大型语言模型已达到奥林匹克级别的逻辑推理能力,然而视觉-语言模型在诸如积木计数等基础空间任务上却出人意料地表现不佳。这种能力不匹配揭示了一个关键的“空间智能鸿沟”,即模型无法从2D观测中构建连贯的3D心理表征。我们通过诊断分析发现,这一瓶颈在于缺乏视角一致的空间接口,而非视觉特征不足或推理能力薄弱。为弥合这一鸿沟,我们引入了 extbf{3ViewSense}框架,该框架将空间推理建立在正交视图之上。借鉴工程认知,我们提出了一种“模拟-推理”机制,将复杂场景分解为规范的正交投影以解决几何歧义。通过将自我中心感知与这些异中心参考对齐,我们的方法促进了显式的心理旋转与重建。在空间推理基准上的实验结果表明,我们的方法显著优于现有基线,在遮挡密集计数和视角一致空间推理上取得了一致的提升。该框架还提高了空间描述的稳定性和一致性,为多模态系统中更强的空间智能提供了一条可扩展的路径。~ ootnote{https://github.com/Jasaxion/3ViewSense}

英文摘要

Current Large Language Models have achieved Olympiad-level logic, yet Vision-Language Models paradoxically falter on elementary spatial tasks like block counting. This capability mismatch reveals a critical ``spatial intelligence gap,'' where models fail to construct coherent 3D mental representations from 2D observations. We uncover this gap via diagnostic analyses showing the bottleneck is a missing view-consistent spatial interface rather than insufficient visual features or weak reasoning. To bridge this, we introduce \textbf{3ViewSense}, a framework that grounds spatial reasoning in Orthographic Views. Drawing on engineering cognition, we propose a ``Simulate-and-Reason'' mechanism that decomposes complex scenes into canonical orthographic projections to resolve geometric ambiguities. By aligning egocentric perceptions with these allocentric references, our method facilitates explicit mental rotation and reconstruction. Empirical results on spatial reasoning benchmarks demonstrate that our method significantly outperforms existing baselines, with consistent gains on occlusion-heavy counting and view-consistent spatial reasoning. The framework also improves the stability and consistency of spatial descriptions, offering a scalable path toward stronger spatial intelligence in multimodal systems.~\footnote{https://github.com/Jasaxion/3ViewSense}

2602.22968 2026-06-01 cs.AI cs.CV cs.CY 版本更新

Certified Circuits: Stability Guarantees for Mechanistic Circuits

认证电路:机械论电路的稳定性保证

Alaa Anani, Tobias Lorenz, Bernt Schiele, Mario Fritz, Jonas Fischer

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所信息学研究所) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍茨中心)

AI总结 提出Certified Circuits框架,通过随机数据子采样认证电路组件(神经元或边)对概念数据集编辑距离扰动的稳定性,生成更紧凑、更准确的电路。

Comments Accepted at ICML 2026

详情
AI中文摘要

理解神经网络如何得出其预测对于调试、审计和部署至关重要。机械论可解释性通过识别电路——负责特定行为的最小子网络——来追求这一目标。然而,现有的电路发现方法脆弱:电路强烈依赖于所选的概念数据集,并且常常无法迁移到分布外,引发对其是否捕捉概念或仅仅是数据集特定伪影的怀疑。我们引入了Certified Circuits,它为电路发现提供了可证明的稳定性保证。我们的框架用随机数据子采样包装任何黑盒发现算法,以认证电路组件——根据基础算法,模型图的神经元或边——的包含决策对概念数据集的有界编辑距离扰动是不变的。不稳定的组件被弃用,从而产生更紧凑、更准确的电路。我们在三个架构(ResNet、ViT、GPT-2)上,针对视觉(ImageNet和四个OOD数据集)和语言(IOI、IOI-Hard、Greater-Than)任务进行了验证。认证电路实现了高达56%的更高准确率和高达80%的更少组件,并且在基线退化时保持可靠。Certified Circuits通过产生可证明稳定且与目标概念更好对齐的机械论解释,将电路发现置于形式化的基础上。代码:https://github.com/AlaaAnani/certified-circuits。

英文摘要

Understanding how neural networks arrive at their predictions is essential for debugging, auditing, and deployment. Mechanistic interpretability pursues this goal by identifying circuits--minimal subnetworks responsible for specific behaviors. However, existing circuit discovery methods are brittle: circuits depend strongly on the chosen concept dataset and often fail to transfer out-of-distribution, raising doubts whether they capture the concept or merely dataset-specific artifacts. We introduce Certified Circuits, which provide provable stability guarantees for circuit discovery. Our framework wraps any black-box discovery algorithm with randomized data subsampling to certify that inclusion decisions over circuit components--neurons or edges of the model graph, depending on the base algorithm--are invariant to bounded edit-distance perturbations of the concept dataset. Unstable components are abstained from, yielding circuits that are more compact and more accurate. We validate across three architectures (ResNet, ViT, GPT-2) on vision (ImageNet and four OOD datasets) and language (IOI, IOI-Hard, Greater-Than) tasks. Certified circuits achieve up to 56% higher accuracy and up to 80% fewer components, and remain reliable where baselines degrade. Certified Circuits puts circuit discovery on formal ground by producing mechanistic explanations that are provably stable and better aligned with the target concept. Code: https://github.com/AlaaAnani/certified-circuits.

2602.16682 2026-06-01 cs.CV 版本更新

SAW-Bench: Learning Situated Awareness in the Real World

SAW-Bench:在现实世界中学习情境感知

Chuhan Li, Rilyn Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣芭芭拉分校) Yale University(耶鲁大学) Stanford University(斯坦福大学) University of Maryland, College Park(马里兰大学学院市分校) Amazon(亚马逊) University of California, Merced(加州大学默塞德分校)

AI总结 提出SAW-Bench基准,通过自录视频和问答对评估多模态基础模型的以观察者为中心的情境感知能力,揭示人类与模型间的显著性能差距。

详情
AI中文摘要

人类感知的一个核心方面是情境感知,即我们能够将自己与周围的物理环境联系起来,并根据上下文推理可能的行动。然而,现有的大多数多模态基础模型(MFM)基准强调以环境为中心的空间关系(场景中物体之间的关系),而很大程度上忽略了需要相对于智能体的视角、姿态和运动进行推理的以观察者为中心的关系。为了填补这一空白,我们引入了SAW-Bench(现实世界中的情境感知),这是一个利用真实世界视频评估以自我为中心的情境感知的新基准。SAW-Bench包含使用Ray-Ban Meta(Gen 2)智能眼镜自录的786个视频,涵盖多样的室内外环境,以及超过2,071个人工标注的问答对。它通过六种不同的感知任务来探测模型对观察者中心的理解。我们的综合评估显示,即使使用性能最佳的MFM Gemini 3 Flash,人类与模型之间的性能差距也达到了37.66%。除了这一差距,我们的深入分析还揭示了一些显著发现;例如,虽然模型可以利用以自我为中心的视频中的部分几何线索,但它们常常无法推断出连贯的相机几何结构,从而导致系统性的空间推理错误。我们将SAW-Bench定位为情境空间智能的基准,超越被动观察,转向理解基于物理的、以观察者为中心的动态。

英文摘要

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

2602.15018 2026-06-01 cs.RO cs.CV 版本更新

Neurosim: A Fast Simulator for Neuromorphic Robot Perception

Neurosim: 一种用于神经形态机器人感知的快速模拟器

Richeek Das, Pratik Chaudhari

发表机构 * GRASP Laboratory, University of Pennsylvania(宾夕法尼亚大学GRASP实验室)

AI总结 提出Neurosim和Cortex库,通过高速传感器模拟和低延迟通信,支持神经形态感知与控制算法的训练和闭环测试。

Comments 11 pages, 6 figures

详情
AI中文摘要

Neurosim是一个快速、实时、高性能的库,用于模拟动态视觉传感器、RGB相机、深度传感器和惯性传感器等传感器。它还可以模拟复杂动态环境中多旋翼飞行器的敏捷动力学。Neurosim在桌面GPU上可实现高达约2700 FPS的帧率。Neurosim与一个基于ZeroMQ的通信库Cortex集成,以促进与机器学习和机器人工作流的无缝集成。Cortex为Python和C++应用程序提供了一个高吞吐量、低延迟的消息传递系统,原生支持NumPy数组和PyTorch张量。本文讨论了Neurosim和Cortex的设计理念。它展示了如何利用它们来(i)训练神经形态感知和控制算法,例如,在时间同步的多模态数据上使用自监督学习,以及(ii)在闭环中测试这些算法的实时实现。Neurosim和Cortex可在https://github.com/grasp-lyrl/neurosim获取。

英文摘要

Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .

2602.14441 2026-06-01 cs.CV 版本更新

D-SECURE: Dual-Source Evidence Combination for Unified Reasoning in Misinformation Detection

D-SECURE:用于虚假信息检测中统一推理的双源证据融合

Samudi Amarasinghe, Gagandeep Singh, Priyanka Singh

发表机构 * School of Information Technology(信息科技学院) Electrical Engineering The University of Queensland, Brisbane, Australia(电气工程学院,昆士兰大学,布里斯班,澳大利亚)

AI总结 提出D-SECURE框架,通过融合内部篡改检测(HAMMER)和外部证据检索(DEFAME)实现多模态虚假新闻的统一推理与可解释报告。

详情
AI中文摘要

多模态虚假信息越来越多地将逼真的图像编辑与流畅但误导性的文本混合在一起,产生难以验证的有说服力的帖子。现有系统通常依赖单一证据源。基于内容的检测器识别图像及其标题内的局部不一致性,但无法确定全局事实真相。基于检索的事实核查器在外部证据上进行推理,但将输入视为粗略声明,常常错过微妙的视觉或文本操纵。这种分离导致内部一致的伪造绕过操纵检测器,而事实核查器验证包含像素级或令牌级损坏的声明。我们提出了D-SECURE,一个结合内部操纵检测与基于外部证据的推理的框架,用于新闻类帖子。D-SECURE将HAMMER操纵检测器与DEFAME检索流水线集成。DEFAME执行广泛验证,HAMMER分析可能包含细粒度编辑的残差或不确定案例。在DGM4和ClaimReview样本上的实验突出了两个系统的互补优势,并推动了它们的融合。我们提供了一个统一的、可解释的报告,融合了操纵线索和外部证据。

英文摘要

Multimodal misinformation increasingly mixes realistic im-age edits with fluent but misleading text, producing persuasive posts that are difficult to verify. Existing systems usually rely on a single evidence source. Content-based detectors identify local inconsistencies within an image and its caption but cannot determine global factual truth. Retrieval-based fact-checkers reason over external evidence but treat inputs as coarse claims and often miss subtle visual or textual manipulations. This separation creates failure cases where internally consistent fabrications bypass manipulation detectors and fact-checkers verify claims that contain pixel-level or token-level corruption. We present D-SECURE, a framework that combines internal manipulation detection with external evidence-based reasoning for news-style posts. D-SECURE integrates the HAMMER manipulation detector with the DEFAME retrieval pipeline. DEFAME performs broad verification, and HAMMER analyses residual or uncertain cases that may contain fine-grained edits. Experiments on DGM4 and ClaimReview samples highlight the complementary strengths of both systems and motivate their fusion. We provide a unified, explainable report that incorporates manipulation cues and external evidence.

2602.10809 2026-06-01 cs.CV cs.IR 版本更新

DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories

DeepImageSearch: 多模态智能体在视觉历史中上下文感知图像检索的基准测试

Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院) OPPO Research Institute(OPPO研究院)

AI总结 提出DeepImageSearch范式,将图像检索重构为自主探索任务,通过构建DISBench基准和模块化智能体框架,验证了在视觉历史中基于隐式上下文线索进行多步推理的必要性。

Comments 18 pages, 6 figures

详情
AI中文摘要

现有的多模态检索系统在语义匹配方面表现出色,但隐含地假设查询-图像相关性可以孤立地衡量。这种范式忽略了真实视觉流中固有的丰富依赖关系,其中信息分布在时间序列中,而不是局限于单个快照。为了弥合这一差距,我们引入了DeepImageSearch,一种新颖的智能体范式,将图像检索重构为自主探索任务。模型必须规划并在原始视觉历史上执行多步推理,以基于隐式上下文线索定位目标。我们构建了DISBench,一个基于互联视觉数据的具有挑战性的基准。为了解决创建上下文相关查询的可扩展性挑战,我们提出了一种人机协作流程,利用视觉语言模型挖掘潜在的时空关联,在人工验证之前有效地卸载密集的上下文发现。此外,我们使用一个配备细粒度工具和用于长程导航的双记忆系统的模块化智能体框架构建了一个稳健的基线。大量实验表明,DISBench对最先进的模型提出了重大挑战,突出了将智能体推理纳入下一代检索系统的必要性。

英文摘要

Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.

2602.07864 2026-06-01 cs.CV 版本更新

Thinking in Structures: Evaluating Spatial Intelligence in Constraint-Governed Spaces

在结构中思考:评估约束空间中的空间智能

Chen Yang, Guanxin Lin, Youquan He, Peiyao Chen, Guanghe Liu, Yufan Mo, Zhouyuan Xu, Linhao Wang, Guohui Zhang, Zihang Zhang, Shenxiang Zeng, Chen Wang, Jiansheng Fan

发表机构 * Tsinghua University(清华大学)

AI总结 提出SSI-Bench基准,通过结构约束下的空间推理任务评估视觉语言模型的空间智能,发现模型与人类存在巨大差距。

Comments ICML 2026, Project Page: https://ssi-bench.github.io

详情
AI中文摘要

空间智能对视觉语言模型(VLM)至关重要,然而许多以场景为中心的基准评估的是无约束环境,其中单个图像可能允许多种合理的3D解释。我们引入了SSI-Bench,一个用于约束空间中结构中心空间推理(SCSR)的VQA基准。它基于复杂的真实世界3D结构,利用几何、拓扑和物理可行性方面的结构约束,使组件关系从视觉证据中更加确定。该基准包含1000个涵盖几何和拓扑推理的排序问题,其中正确的排序需要解决所有候选对象的3D关系,对空间理解提出了更强的要求。它通过完全以人为中心的流程创建,包括超过400研究员小时的图像整理、组件标注和问题设计。评估31个VLM揭示了与人类的巨大差距:最好的开源模型达到22.2%的准确率,最强的闭源模型达到33.6%,而人类得分为91.6%。进一步的结果表明,思维链推理仅带来微小的提升,错误分析揭示了当前模型在约束空间中空间理解的根本局限性。项目页面:https://ssi-bench.github.io。

英文摘要

Spatial intelligence is crucial for vision--language models (VLMs), yet many scene-centric benchmarks evaluate unconstrained environments where a single image may admit multiple plausible 3D interpretations. We introduce SSI-Bench, a VQA benchmark for Structure-Centric Spatial Reasoning (SCSR) in constraint-governed spaces. Built from complex real-world 3D structures, it uses structural constraints from geometry, topology, and physical feasibility to make component relations more determinate from visual evidence. The benchmark contains 1,000 ranking questions spanning geometric and topological reasoning, where correct ordering requires resolving all candidate-wise 3D relations, imposing stronger demands on spatial understanding. It is created through a fully human-centered pipeline with over 400 researcher-hours of image curation, component annotation, and question design. Evaluating 31 VLMs reveals a large gap to humans: the best open-source model achieves 22.2% accuracy and the strongest closed-source model reaches 33.6%, while humans score 91.6%. Further results show that chain-of-thought reasoning brings only marginal gains, and error analysis reveals fundamental limitations in current models' spatial understanding within constraint-governed spaces. Project page: https://ssi-bench.github.io.

2511.16940 2026-06-01 cs.CV cs.CR 版本更新

MultiPriv: Benchmarking Individual-Level Privacy Reasoning in Vision-Language Models

MultiPriv: 视觉语言模型中个体级隐私推理的基准测试

Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu, Ruxin Feng, Wen Jun Tan, Wei Yang Bryan Lim

发表机构 * Xidian University(西安电子科技大学) Nanyang Technological University(南洋理工大学)

AI总结 针对视觉语言模型通过层次化链式推理关联多模态数据识别个体的隐私风险,提出首个系统评估个体级隐私推理的基准MultiPriv,包含隐私感知与推理框架、双语多模态数据集和九项挑战任务,对50多个开源和商业模型评估发现60%的模型能以高达80%的准确率进行个体级隐私推理。

详情
AI中文摘要

现代视觉语言模型(VLM)通过层次化链式推理将碎片化的多模态数据与可识别的个体关联起来,构成了显著的个体级隐私风险。然而,现有的隐私基准在结构上不足以应对这一威胁,因为它们主要评估隐私感知,而未能解决更关键的隐私推理风险:VLM推断和关联分布式信息以构建个体档案的能力。为填补这一空白,我们提出了MultiPriv,这是第一个旨在系统评估VLM中个体级隐私推理的基准。我们引入了隐私感知与推理(PPR)框架,并构建了一个包含合成个体档案的双语多模态数据集,其中标识符(如人脸和姓名)与敏感属性相关联。该设计支持九项具有挑战性的任务,涵盖属性检测、跨图像重新识别和链式推理。我们对超过50个开源和商业VLM进行了大规模评估。在我们的受控基准中,60%的广泛使用的VLM能够以高达80%的准确率进行个体级隐私推理,这表明对个人隐私存在重大潜在威胁。该基准可在https://github.com/CyberChangAn/MultiPriv-PII获取。

英文摘要

Modern Vision-Language Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical chain-of-thought reasoning. However, existing privacy benchmarks remain structurally insufficient for this threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM's ability to infer and link distributed information to construct individual profiles. To address this gap, we propose MultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the Privacy Perception and Reasoning (PPR) framework and construct a bilingual multimodal dataset with synthetic individual profiles, where identifiers, such as faces and names, are linked to sensitive attributes. This design enables nine challenging tasks spanning attribute detection, cross-image re-identification, and chained inference. We conduct a large-scale evaluation of over 50 open-source and commercial VLMs. In our controlled benchmark, 60% of widely used VLMs can perform individual-level privacy reasoning with up to 80% accuracy, suggesting a significant potential threat to personal privacy. The benchmark is available at https://github.com/CyberChangAn/MultiPriv-PII.

2602.02220 2026-06-01 cs.CV cs.RO 版本更新

LangMap: A Human-Verified Benchmark for Hierarchical Open-Vocabulary Goal Navigation

LangMap:一个用于分层开放词汇目标导航的人工验证基准

Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel

发表机构 * AIML, Adelaide University(AIML,阿德莱德大学) East China Normal University(华东师范大学) NERC-RVC, Hunan University(NERC-RVC,湖南大学) The University of Western Australia(西澳大学) Breaker Industries

AI总结 针对现有基准在分层语义目标导航中的不足,提出LangMap基准,通过人工验证的语义标注和对比注释协议,支持场景、房间、区域和实例四个层级的目标导航任务,并引入PlaNaVid基线方法。

详情
AI中文摘要

语言条件目标导航(LGN)要求智能体在没有逐步指导的情况下定位用户指定的目标。然而,现有基准主要关注类别级目标或依赖视觉语言模型(VLM)生成的实例描述,这些描述通常包含歧义和语义错误,限制了系统性和可靠的评估。我们提出了HieraNav,一个开放词汇的LGN任务,目标在四个分层语义层级上指定:场景、房间、区域和实例。为此,我们提出了Language as a Map(LangMap),据我们所知,这是第一个具有人工验证语义标注的真实世界3D室内导航基准,支持所有四个目标层级的任务。LangMap提供了区域标签以及覆盖414个对象类别的区分性区域和实例描述,通过比较同一场景区域和实例的严格对比注释协议生成,包含超过18K个任务。每个目标都配有简洁和详细的描述,支持跨指令风格的评估。定量和定性分析验证了我们的注释质量;值得注意的是,我们的实例描述在文本到视图匹配上比GOAT-Bench注释高出23个百分点。我们进一步引入了PlaNaVid,一个强大的仅RGB基线,它将有界多样记忆(BDM)与高级规划相结合,以激发用于多目标导航的反应策略。PlaNaVid在没有深度、3D场景表示或对象掩码的情况下实现了顶级成功率。进一步分析表明,记忆和更丰富的上下文提升了性能,而长尾类别、小物体、远距离目标和多目标完成仍然是开放的挑战。该基准可在https://bo-miao.github.io/LangMap获取。

英文摘要

Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap

2511.11406 2026-06-01 cs.CV 版本更新

Robust Low-Rank Sparse Framework for Video-Based Affective Computing

基于视频的情感计算的鲁棒低秩稀疏框架

Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Xinyu Li, Xin Yan, Ziyu Jia, Xiaokang Zhou

发表机构 * Cylingo Group(Cylingo集团) RIKEN Center for Advanced Intelligence Project, RIKEN(RIKEN高级智能项目中心,RIKEN)

AI总结 提出低秩稀疏情感理解框架(LSEF),通过层次化低秩稀疏分解将情感动态分解为情感基和瞬态波动,并采用秩感知优化策略提升鲁棒性和动态判别能力。

详情
AI中文摘要

基于视频的情感计算(VAC)对于情感分析和人机交互至关重要,但由于复杂的情感动态,存在模型不稳定和表示退化的问题。由于不同情感波动的含义在不同情感背景下可能不同,核心限制在于缺乏一种层次结构机制来分离不同的情感成分,即情感基(长期情感基调)和瞬态波动(短期情感波动)。为解决这一问题,我们提出了低秩稀疏情感理解框架(LSEF),这是一个基于低秩稀疏原理的统一模型,从理论上将情感动态重新定义为层次化的低秩稀疏组合过程。LSEF采用三个即插即用模块:稳定性编码模块(SEM)捕获低秩情感基;动态解耦模块(DDM)分离稀疏瞬态信号;一致性整合模块(CIM)重构多尺度稳定性和反应性一致性。该框架通过秩感知优化(RAO)策略进行优化,该策略自适应地平衡梯度平滑性和敏感性。跨多个数据集的大量实验证实,LSEF显著增强了鲁棒性和动态判别能力,进一步验证了层次化低秩稀疏建模对于理解情感动态的有效性和通用性。

英文摘要

Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

2601.22412 2026-06-01 cs.CV 版本更新

Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture

用于可信临床步态分析的校准不确定性:基于概率多视角无标记运动捕捉

Seth Donahue, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz, R. James Cotton

发表机构 * Shriners Children’s Lexington(夏皮尔儿童医院莱克星顿分院) Dept. of Physical Therapy, University of Kentucky(肯塔基大学物理治疗系) Shirley Ryan AbilityLab(希拉里·瑞安能力实验室) Northwestern University(西北大学) University of Göttingen(哥廷根大学) Lower Saxony Center for AI & Causal Methods in Medicine(下萨克森人工智能与因果方法医学中心) Shriners Children’s Philadelphia(夏皮尔儿童医院费城分院)

AI总结 提出一种概率多视角无标记运动捕捉方法,通过变分推断估计关节角度后验分布,并利用期望校准误差评估置信区间校准性,实现无需真实标记即可识别不可靠输出的可靠步态分析。

Comments 9 pages, 5 figures, EMBS Special Issue

详情
AI中文摘要

基于视频的人体运动分析在临床实践和研究中具有潜力。然而,多视角无标记运动捕捉(MMMC)的临床实施和信任要求,除了准确性外,这些系统还能为任何个体产生可靠的置信区间以指示其准确程度。基于我们先前利用变分推断估计关节角度后验分布的工作,本研究评估了一种概率MMMC方法的校准性和可靠性。我们分析了来自两个机构的68名参与者的数据,使用仪器化步道和标准标记运动捕捉验证模型。我们通过期望校准误差(ECE)测量置信区间的校准性。模型展示了可靠的校准性,步长和跨步长的ECE值通常<0.1,偏差校正的步态运动学也类似。我们观察到步长和跨步长中位误差分别约为16毫米和12毫米,下肢关节的中位偏差校正运动学误差范围为1.5至3.8度。与校准的ECE一致,模型预测的不确定性大小与观察到的误差测量值强相关。这些发现表明,按照设计,概率模型重建量化了认知不确定性,使其能够在无需同时使用真实标记仪器的情况下识别不可靠的输出。

英文摘要

Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally < 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model's predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.

2601.22202 2026-06-01 eess.IV cs.CV 版本更新

A Survey on Semantic Communication for Vision: Categories, Frameworks, Enabling Techniques, and Applications

面向视觉的语义通信综述:分类、框架、使能技术与应用

Runze Cheng, Yao Sun, Ahmad Taha, Xuesong Liu, David Flynn, Muhammad Ali Imran

发表机构 * James Watt School of Engineering, University of Glasgow(格拉斯哥大学詹姆斯·瓦特工程学院)

AI总结 本文系统综述了面向视觉数据语义通信(SemCom-Vision)的方法,基于语义量化方案将现有方法分为语义保持、语义扩展和语义精炼三类,并总结了基于机器学习的编解码模型、训练算法及知识利用策略。

详情
Journal ref
IEEE Transactions on Network Science and Engineering, vol. 13, pp. 8080-8103, 2026
AI中文摘要

语义通信(SemCom)作为一种变革性范式,用于流量密集的视觉数据传输,将关注点从原始数据传输转向有意义的内容传输,从而缓解通信资源的日益紧张。然而,要实现SemCom,面临着视觉数据精确语义量化、不同任务和目标下的鲁棒语义提取与重建、利用有效知识的收发端协调以及适应不可预测的无线通信环境等挑战。本文对面向视觉数据语义通信(SemCom-Vision)进行了系统综述,其中进行了计算机视觉(CV)与通信工程的跨学科分析,为机器学习(ML)赋能的SemCom-Vision设计提供全面指导。具体而言,本综述首先阐述了SemCom的基础知识和关键概念。然后,我们引入了一种新的分类视角,根据通过语义量化方案解释的通信目标,将现有的SemCom-Vision方法分为语义保持通信(SPC)、语义扩展通信(SEC)和语义精炼通信(SRC)。此外,本综述阐述了每个SemCom-Vision类别中基于ML的编码器-解码器模型和训练算法,随后介绍了知识结构和利用策略。最后,我们讨论了潜在的SemCom-Vision应用。

英文摘要

Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on communication resources. However, to achieve SemCom, challenges are faced in accurate semantic quantization for visual data, robust semantic extraction and reconstruction under diverse tasks and goals, transceiver coordination with effective knowledge utilization, and adaptation to unpredictable wireless communication environments. In this paper, we present a systematic review of SemCom for visual data transmission (SemCom-Vision), wherein an interdisciplinary analysis integrating computer vision (CV) and communication engineering is conducted to provide comprehensive guidelines for the machine learning (ML)-empowered SemCom-Vision design. Specifically, this survey first elucidates the basics and key concepts of SemCom. Then, we introduce a novel classification perspective to categorize existing SemCom-Vision approaches as semantic preservation communication (SPC), semantic expansion communication (SEC), and semantic refinement communication (SRC) based on communication goals interpreted through semantic quantization schemes. Moreover, this survey articulates the ML-based encoder-decoder models and training algorithms for each SemCom-Vision category, followed by knowledge structure and utilization strategies. Finally, we discuss potential SemCom-Vision applications.

2502.12119 2026-06-01 cs.CV cs.AI cs.CL 版本更新

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

PRISM:免训练多模态数据选择的自剪枝内在选择方法

Jinhe Bi, Aniri, Zengjie Jin, Yifan Wang, Danqi Yan, Wenke Huang, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

发表机构 * LMU Munich(慕尼黑大学) Munich Research Center, Huawei Technologies(慕尼黑研究中心,华为技术) METEOR School of Computer Science, Wuhan University(武汉大学计算机学院) Munich Center for Machine Learning(慕尼黑机器学习中心)

AI总结 针对多模态大语言模型视觉指令数据冗余问题,提出一种免训练框架PRISM,通过隐式重中心化消除视觉特征各向异性导致的全局语义漂移,实现高效数据选择,在降低计算成本的同时提升模型性能。

Comments Accepted to ACL 2026 and selected for the Best Paper list; later desk-rejected due to an inadvertent manual bibliography-editing error. Previous versions are withdrawn due to an inadvertent manual bibliography-editing error; please refer to the latest corrected version

详情
AI中文摘要

视觉指令微调使预训练的多模态大语言模型(MLLMs)能够遵循人类指令以应用于现实场景。然而,这些数据集的快速增长引入了显著的冗余,导致计算成本增加。现有的指令数据选择方法旨在修剪这种冗余,但主要依赖于计算密集型技术,如基于代理的推理或基于训练的指标。因此,这些选择过程产生的巨大计算成本往往加剧了它们本应解决的效率瓶颈,对MLLMs的可扩展和有效微调构成了重大挑战。为了解决这一挑战,我们首先发现了一个关键但先前被忽视的因素:视觉特征分布中固有的各向异性。我们发现这种各向异性引发了 extit{全局语义漂移},而忽视这一现象是限制当前数据选择方法效率的关键因素。受此启发,我们设计了 extbf{PRISM},这是第一个用于高效视觉指令选择的免训练框架。PRISM通过隐式重中心化建模内在视觉语义,精确移除全局背景特征的干扰影响。实验表明,PRISM将数据选择和模型微调的端到端时间减少到传统流程的30%。更值得注意的是,它在实现这一效率的同时提升了性能,在八个多模态和三个语言理解基准上超越了在全数据集上微调的模型,最终相对于基线实现了101.7%的相对改进。代码可通过\href{https://github.com/bibisbar/PRISM}{此仓库}获取。

英文摘要

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

2601.01456 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration

重新思考多模态少样本3D点云分割:从融合精炼到解耦仲裁

Wentao Bian, Fenglei Xu

发表机构 * Suzhou University of Science and Technology(苏州科技大学)

AI总结 针对多模态少样本3D点云分割中“融合-精炼”范式的“可塑性-稳定性困境”和CLIP的语义盲区,提出解耦专家仲裁少样本分割网络(DA-FSS),通过解耦语义与几何路径并相互正则化梯度,实现更好的泛化性能。

Comments Accepted to IJCAI-ECAI 2026 (Main Track). 9 pages, 3 figures, 3 tables

详情
AI中文摘要

本文重新审视多模态少样本3D点云语义分割(FS-PCS),识别出“融合-精炼”范式中的一个冲突:“可塑性-稳定性困境”。此外,CLIP的类间混淆可能导致语义盲区。为解决这些问题,我们提出解耦专家仲裁少样本分割网络(DA-FSS),该模型有效区分语义和几何路径,并相互正则化它们的梯度以实现更好的泛化。DA-FSS采用与MM-FSS相同的主干网络和预训练文本编码器生成文本嵌入,从而提高自由模态的利用率并更好地利用每个模态的信息空间。为此,我们提出并行专家精炼模块以生成每个模态相关性。我们还提出堆叠仲裁模块(SAM)执行卷积融合并为每个模态路径仲裁相关性。并行专家解耦两条路径:几何专家保持可塑性,语义专家确保稳定性。它们通过解耦对齐模块(DAM)协调,该模块在不传播混淆的情况下传递知识。在流行数据集(S3DIS、ScanNet)上的实验表明DA-FSS优于MM-FSS。同时,几何边界、完整性和纹理区分均优于基线。代码可在https://github.com/MoWenQAQ/DA-FSS/获取。

英文摘要

In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in "Fuse-then-Refine" paradigms: the "Plasticity-Stability Dilemma." In addition, CLIP's inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA-FSS/.

2601.01075 2026-06-01 cs.LG cs.AI cs.CV 版本更新

Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

流等变世界模型:部分观测动态环境的记忆

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

发表机构 * Kempner Institute, Harvard University(哈佛大学 Kempner 研究所) ML, Carnegie Mellon University(卡内基梅隆大学 ML 研究所) SEAS, Harvard University(哈佛大学 SEAS 研究所)

AI总结 提出流等变世界建模框架,利用时间参数化对称性在潜在记忆中实现长时程稳定准确的动力学预测,解决部分观测问题。

Comments Accepted at ICML 2026

详情
AI中文摘要

具身系统将世界体验为“流之交响”:多种连续感官输入流与自身运动耦合,并与外部物体的动力学交织。这些感官流和世界的基本动力学遵循平滑的时间参数化对称性,而现有的世界模型忽略了这一点。如果没有尊重这种结构的记忆,部分可观测性对现有方法构成主要障碍:每次观测仅揭示世界的一部分,而未观测区域继续演化。在这项工作中,我们引入了流等变世界建模,这是一个利用潜在记忆中的时间参数化对称性来实现长时程稳定准确动力学预测的框架。潜在记忆随自身运动和推断的外部物体运动等变地移动和变换,使关于视野外区域的信息随时间保持对齐。我们在2D和3D部分观测视频世界建模基准上展示了该框架相对于最先进的扩散、记忆增强和循环世界模型架构的优势。更广泛地说,我们的结果表明,当预测表示按照它们所建模的世界的时间和动力学结构组织时,它们会变得更加强大。项目页面:https://flowequivariantworldmodels.github.io/

英文摘要

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These sensory streams and the underlying dynamics of the world obey smooth, time-parameterized symmetries which existing world models ignore. Without a memory that respects this structure, partial observability presents a major obstacle to existing methods: each observation reveals only a fraction of the world, while unobserved regions continue to evolve. In this work, we introduce Flow Equivariant World Modeling, a framework that leverages time-parameterized symmetries within a latent memory for stable and accurate dynamics prediction over long horizons. The latent memory shifts and transforms equivariantly with self-motion and inferred external object motion, keeping information about out-of-view regions aligned as time progresses. We demonstrate the advantage of this framework over state-of-the-art diffusion, memory-augmented, and recurrent world model architectures on 2D and 3D partially observed video world modeling benchmarks. More broadly, our results suggest that predictive representations become more powerful when they are organized in line with the temporal and dynamical structure of the world they model. Project page: https://flowequivariantworldmodels.github.io/

2512.02743 2026-06-01 cs.CV cs.AI 版本更新

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

面向仇恨视频检测的推理感知多模态融合

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

发表机构 * Multimodal Intelligence Lab(多模态智能实验室) Department of Computer Science(计算机科学系) University of Exeter(埃克塞特大学) School of Computer Science(计算机科学学院) University of Leeds(利兹大学) School of Computer Science and Informatics(计算机科学与信息学学院) University of Liverpool(利物浦大学) University of Birmingham(伯明翰大学) Machine Intelligence + x Group(机器智能+X小组)

AI总结 提出推理感知多模态融合框架,通过局部-全局上下文融合和语义交叉注意力实现多模态交互,并引入对抗推理生成互补语义视角,在仇恨视频检测中提升Macro-F1和召回率3%和7%。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

在线视频中的仇恨言论对数字平台构成日益严重的威胁,尤其是当视频内容变得日益多模态和上下文依赖时。现有方法通常难以有效融合模态间的复杂语义关系,且缺乏理解细微仇恨内容的能力。为解决这些问题,我们提出了一种创新的推理感知多模态融合(RAMF)框架。针对第一个挑战,我们设计了局部-全局上下文融合(LGCF)以捕捉局部显著线索和全局时间结构,并提出语义交叉注意力(SCA)以实现细粒度多模态语义交互。针对第二个挑战,我们引入了对抗推理——一个结构化的三阶段过程,其中视觉语言模型生成(i)客观描述、(ii)仇恨假设推理和(iii)非仇恨假设推理——提供互补的语义视角,丰富模型对细微仇恨意图的上下文理解。在两个真实仇恨视频数据集上的评估表明,我们的方法实现了稳健的泛化性能,在Macro-F1和仇恨类别召回率上分别比现有最先进方法提高了3%和7%。重现我们结果所需的源代码和数据可在https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF获取。

英文摘要

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. The source codes and data required to reproduce our results are available at https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF.

2509.21379 2026-06-01 cs.CV cs.AI 版本更新

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

SAEmnesia:基于监督稀疏自编码器的扩散模型概念擦除

Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto

发表机构 * University of Turin, Italy(意大利都灵大学) Intesa Sanpaolo AI Research, Italy(意大利Intesa Sanpaolo人工智能研究院)

AI总结 提出监督稀疏自编码器框架SAEmnesia,通过强制一对一概念-神经元映射实现特征集中化,从而高效、精准地擦除扩散模型中的概念。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散模型中的概念遗忘受到特征分裂的阻碍,即概念分布在许多潜在特征上,使得移除它们具有挑战性且计算成本高。我们引入了SAEmnesia,一种监督稀疏自编码器框架,通过强制一对一的概念-神经元映射来克服这一问题。通过在训练过程中系统地标记概念,我们的方法实现了特征集中化,将每个概念绑定到一个可解释的神经元上。这使得概念擦除高度精准且高效。与最先进的基于稀疏自编码器的遗忘方法相比,SAEmnesia将超参数搜索减少了96.67%,并在UnlearnCanvas对象基准上实现了9.22%的提升。我们的方法在顺序遗忘中也表现出卓越的可扩展性,在移除九个对象时准确率提高了28.4%,为精确可控的概念擦除迈出了一步。此外,SAEmnesia在I2P基准上有效抑制了裸体内容,并对对抗攻击保持鲁棒性。源代码可在https://github.com/EIDOSLAB/SAEmnesia获取。

英文摘要

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. Compared to the state-of-the-art sparse autoencoder-based unlearning approach, SAEmnesia reduces hyperparameter search by 96.67% and achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects. Our method also shows superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a step forward for precise and controllable concept erasure. Moreover, SAEmnesia effectively suppresses nudity on the I2P benchmark and remains robust to adversarial attacks. Source code available at https://github.com/EIDOSLAB/SAEmnesia.

2511.19923 2026-06-01 cs.CV cs.CL 版本更新

Distilling Counterfactual Reasoning from Language to Vision: Causal Graph Guided Post-Training for Video Understanding

从语言到视觉的反事实推理蒸馏:因果图引导的视频理解后训练

Yuefei Chen, Jiang Liu, Xiaodong Lin, Ruixiang Tang

发表机构 * Rutgers University(新泽西罗格斯大学)

AI总结 针对视觉语言模型在反事实推理上的不足,提出CounterVQA基准和CFGPT后训练方法,通过从语言模态蒸馏反事实推理能力提升视频理解。

详情
AI中文摘要

视觉语言模型(VLM)最近在视频理解方面取得了显著进展,特别是在特征对齐、事件推理和指令遵循任务中。然而,它们在反事实推理(即在假设条件下推断替代结果)方面的能力仍未得到充分探索。这种能力对于鲁棒的视频理解至关重要,因为它需要识别潜在的因果结构并推理未观察到的可能性,而不仅仅是识别观察到的模式。为了系统评估这一能力,我们引入了CounterVQA,一个基于视频的基准测试,具有三个渐进难度级别,评估反事实推理的不同方面。通过对最先进的开源和闭源模型的全面评估,我们发现了一个显著的性能差距:虽然这些模型在简单的反事实问题上达到了合理的准确性,但在复杂的多跳因果链上性能显著下降。为了解决这些限制,我们开发了一种后训练方法CFGPT,通过从语言模态蒸馏其反事实推理能力来增强模型的视觉反事实推理能力,在CounterVQA的所有难度级别上均取得了一致的改进。数据集和代码将后续发布。

英文摘要

Vision Language Models (VLMs) have recently shown significant advancements in video understanding, especially in feature alignment, event reasoning, and instruction-following tasks. However, their capability for counterfactual reasoning, inferring alternative outcomes under hypothetical conditions, remains underexplored. This capability is essential for robust video understanding, as it requires identifying underlying causal structures and reasoning about unobserved possibilities, rather than merely recognizing observed patterns. To systematically evaluate this capability, we introduce CounterVQA, a video-based benchmark featuring three progressive difficulty levels that assess different aspects of counterfactual reasoning. Through comprehensive evaluation of both state-of-the-art open-source and closed-source models, we uncover a substantial performance gap: while these models achieve reasonable accuracy on simple counterfactual questions, performance degrades significantly on complex multi-hop causal chains. To address these limitations, we develop a post-training method, CFGPT, that enhances a model's visual counterfactual reasoning ability by distilling its counterfactual reasoning capability from the language modality, yielding consistent improvements across all CounterVQA difficulty levels. Dataset and code will be further released.

2511.19433 2026-06-01 cs.RO cs.AI cs.CV 版本更新

Mixture of Horizons in Action Chunking

动作分块中的视野混合

Dong Jing, Gang Wang, Jiaqi Liu, Weiliang Tang, Zelong Sun, Yunchao Yao, Zhenyu Wei, Yunhui Liu, Zhiwu Lu, Mingyu Ding

发表机构 * Renmin University of China(中国人民大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对视觉-语言-动作模型中动作分块长度(视野)的权衡问题,提出混合视野策略,通过并行处理不同视野的动作片段并融合输出,同时提升长期预见与短期精度,实现性能与泛化性的改进。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型在机器人操作中展现出显著能力,但其性能对训练中使用的$ extbf{动作分块长度}$(称为$ extbf{视野}$)敏感。我们的实证研究揭示了一个内在权衡:较长的视野提供更强的全局预见但降低细粒度精度,而较短的视野增强局部控制但在长期任务上表现不佳,这意味着固定选择单一视野是次优的。为缓解这一权衡,我们提出$ extbf{混合视野(MoH)}$策略。MoH将动作分块重新排列为多个不同视野的片段,通过共享动作变换器并行处理,并使用轻量线性门控融合输出。它具有三个吸引人的优点:1) MoH在单个模型中联合利用长期预见和短期精度,提高了复杂任务的性能和泛化能力。2) MoH对全注意力动作模块即插即用,训练或推理开销极小。3) MoH支持自适应视野的动态推理,通过跨视野共识选择稳定动作,实现比基线高2.5倍的吞吐量,同时保持优越性能。在基于流的策略$π_0$、$π_{0.5}$和单步回归策略$π_{ ext{reg}}$上的大量实验表明,MoH在仿真和真实世界任务上均取得一致且显著的提升。值得注意的是,在混合任务设置下,带有MoH的$π_{0.5}$在LIBERO上仅经过$30k$次训练迭代即达到99$\%$的平均成功率,创下新纪录。项目页面:https://timsty1.github.io/moh/

英文摘要

Vision-language-action (VLA) models have shown remarkable capabilities in robotic manipulation, but their performance is sensitive to the $\textbf{action chunk length}$ used during training, termed $\textbf{horizon}$. Our empirical study reveals an inherent trade-off: longer horizons provide stronger global foresight but degrade fine-grained accuracy, while shorter ones sharpen local control yet struggle on long-term tasks, implying fixed choice of single horizons being suboptimal. To mitigate the trade-off, we propose a $\textbf{mixture of horizons (MoH)}$ strategy. MoH rearranges the action chunk into several segments with different horizons, processes them in parallel with a shared action transformer, and fuses outputs with a light linear gate. It has three appealing benefits. 1) MoH exploits long-term foresight and short-term precision jointly within a single model, improving both performance and generalizability to complex tasks. 2) MoH is plug-and-play for full-attention action modules with minimal training or inference overhead. 3) MoH enables dynamic inference with adaptive horizons, which selects stable actions through cross-horizon consensus, achieving 2.5$\times$ higher throughput than baselines while preserving superior performance. Extensive experiments over flow-based policies $π_0$, $π_{0.5}$, and one-step regression policy $π_{\text{reg}}$ demonstrate that MoH yields consistent and significant gains on both simulations and real-world tasks. Notably, under mixed-task setting, $π_{0.5}$ with MoH reaches a new state-of-the-art with 99$\%$ average success rate on LIBERO after only $30k$ training iterations. Project page: https://timsty1.github.io/moh/

2511.19394 2026-06-01 cs.CV 版本更新

BackSplit: The Importance of Sub-dividing the Background in Biomedical Lesion Segmentation

BackSplit:在生物医学病灶分割中细分背景的重要性

Rachit Saluja, Asli Cihangir, Ruining Deng, Johannes C. Paetzold, Fengbei Liu, Mert R. Sabuncu

发表机构 * Cornell University(康奈尔大学) Cornell Tech(康奈尔科技) Weill Cornell Medicine(韦尔医学院)

AI总结 提出BackSplit方法,通过将背景细分为多个子类(如组织、器官)进行训练,在不增加推理成本的情况下显著提升小病灶分割性能,并从信息论角度证明其有效性。

Comments Accepted to CVPR 2026

详情
AI中文摘要

在医学图像中分割小病灶仍然非常困难。大多数先前的工作通过设计更好的架构、损失函数或数据增强方案,以及收集更多标注数据来应对这一挑战。我们采取不同的观点,认为部分问题在于背景的建模方式。常见的病灶分割将所有非病灶像素合并为单一的“背景”类,忽略了病灶出现的丰富解剖背景。实际上,背景是高度异质的——由组织、器官和其他结构组成,这些结构现在可以手动标注或使用现有分割模型自动推断。在本文中,我们认为使用细分背景类的细粒度标签进行训练(我们称之为BackSplit)是一种简单而强大的范式,可以在不增加推理成本的情况下提供显著的性能提升。从信息论的角度,我们证明BackSplit相对于传统的二值训练增加了期望的Fisher信息,从而得到更紧的渐近界和更稳定的优化。通过在多个数据集和架构上进行大量实验,我们经验性地表明,即使辅助标签是使用预训练分割模型自动生成的,BackSplit也能持续提升小病灶分割性能。此外,我们证明从交互式分割框架中导出的辅助标签也表现出相同的有利效果,展示了其鲁棒性、简单性和广泛的适用性。

英文摘要

Segmenting small lesions in medical images remains notoriously difficult. Most prior work tackles this challenge by either designing better architectures, loss functions, or data augmentation schemes; and collecting more labeled data. We take a different view, arguing that part of the problem lies in how the background is modeled. Common lesion segmentation collapses all non-lesion pixels into a single "background" class, ignoring the rich anatomical context in which lesions appear. In reality, the background is highly heterogeneous-composed of tissues, organs, and other structures that can now be labeled manually or inferred automatically using existing segmentation models. In this paper, we argue that training with fine-grained labels that sub-divide the background class, which we call BackSplit, is a simple yet powerful paradigm that can offer a significant performance boost without increasing inference costs. From an information theoretic standpoint, we prove that BackSplit increases the expected Fisher Information relative to conventional binary training, leading to tighter asymptotic bounds and more stable optimization. With extensive experiments across multiple datasets and architectures, we empirically show that BackSplit consistently boosts small-lesion segmentation performance, even when auxiliary labels are generated automatically using pretrained segmentation models. Additionally, we demonstrate that auxiliary labels derived from interactive segmentation frameworks exhibit the same beneficial effect, demonstrating its robustness, simplicity, and broad applicability.

2511.17380 2026-06-01 cs.CV cs.LG 版本更新

Non-Parametric Probabilistic Robustness: A Conservative Risk Estimator under Unknown Perturbation Distributions

非参数概率鲁棒性:未知扰动分布下的保守风险估计

Zheng Wang, Yi Zhang, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

发表机构 * WMG, University of Warwick, Coventry, United Kingdom(沃里克大学商学院,沃里克,英国) Wuhan University, Wuhan, China(武汉大学,武汉,中国)

AI总结 提出非参数概率鲁棒性(NPPR)度量,通过从数据中学习扰动分布,在分布不确定性下实现保守的概率鲁棒性估计,并基于高斯混合模型开发估计器。

详情
AI中文摘要

深度学习模型尽管取得了显著成功,但仍然容易受到微小输入扰动的影响,导致错误输出,这促使最近提出概率鲁棒性(PR)作为对抗鲁棒性(AR)的补充替代方案。然而,现有的PR公式假设扰动分布固定且已知,这在实践中是不现实的期望。为了解决这一限制,我们提出了非参数概率鲁棒性(NPPR),一种更实用的PR度量,不依赖于任何预定义的扰动分布。遵循统计建模中的非参数范式,NPPR直接从数据中学习优化的扰动分布,从而在分布不确定性下实现保守的PR评估。我们进一步开发了基于高斯混合模型(GMM)的NPPR估计器,涵盖了各种输入相关和输入无关的扰动场景。理论分析建立了AR、PR和NPPR之间的关系。在CIFAR-10、CIFAR-100和Tiny ImageNet上使用ResNet18/50、WideResNet50和VGG16的大量实验验证了NPPR作为更实用的鲁棒性度量,与假设最先进技术中使用的常见扰动分布相比,显示出保守(较低)的PR估计。

英文摘要

Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM), covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.

2511.17185 2026-06-01 cs.CV 版本更新

PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

PostCam: 基于查询共享交叉注意力的相机可控新视角视频生成

Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang, Jialing Liu, Nan Wang, Guofeng Zhang, Haomin Liu

发表机构 * State Key Lab of CAD\&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) Shanghai InSpatio Intelligent Technology Co., Ltd.(上海InSpatio智能技术有限公司)

AI总结 提出PostCam框架,通过查询共享交叉注意力机制对齐6自由度姿态和渲染特征,实现动态场景中高细节保持和精确相机轨迹编辑的新视角视频生成。

详情
AI中文摘要

我们提出了PostCam,一个用于新视角视频生成的简化框架,在动态场景中实现了优越的细节保留和精确的相机轨迹编辑。当前方法常常在基于姿态的控制(缺乏视觉细节)和基于渲染的引导(对几何精度过于敏感)之间权衡。尽管最近有混合尝试,但由于缺乏有效的跨模态对齐,实现精确的运动和视觉一致性仍然具有挑战性。我们认为,稳健的控制源于多模态信号的深度对齐,而不是增加输入复杂性。我们的核心贡献是查询共享交叉注意力机制,它将6自由度姿态和渲染特征投影到统一的潜在空间中。这使得模型在去噪过程中能够自发地实现运动线索和像素级引导之间的内在一致性。实验表明,PostCam在保持高保真视觉细节的同时,在轨迹精度上比最先进的方法提高了20%,在复杂动态场景中表现出卓越的鲁棒性。我们的项目网页公开在:https://cccqaq.github.io/PostCam.github.io/

英文摘要

We propose PostCam, a streamlined framework for novel-view video generation that achieves superior detail preservation and precise camera trajectory editing in dynamic scenes. Current methods often struggle with a trade-off between pose-based control, which lacks visual detail, and rendering-based guidance, which is overly sensitive to geometric accuracy. Despite recent hybrid attempts, achieving precise motion and visual consistency remains challenging due to the lack of effective cross-modal alignment. We argue that robust control stems from the deep alignment of multimodal signals rather than increased input complexity. Our core contribution is the Query-Shared Cross-Attention mechanism, which projects 6-DoF poses and rendered features into a unified latent space. This allows the model to spontaneously achieve intrinsic consistency between motion cues and pixel-level guidance during denoising. Experiments demonstrate that PostCam maintains high-fidelity visual details while outperforming state-of-the-art methods by 20% in trajectory precision, exhibiting superior robustness in complex dynamic scenes. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

2511.15692 2026-06-01 cs.CV 版本更新

Hyperspectral Image Classification using Spectral-Spatial Mixer Network

高光谱图像分类的光谱-空间混合器网络

Mohammed Q. Alkhatib

发表机构 * College of Engineering and IT(工程与信息技术学院)

AI总结 提出SS-MixNet轻量级深度学习模型,通过3D卷积和并行MLP混合器模块提取局部与长距离光谱-空间特征,在1%标注数据下实现高精度高光谱图像分类。

Comments Accepted and published in IEEE WHISPERS2025

详情
AI中文摘要

本文介绍了SS-MixNet,一种用于高光谱图像(HSI)分类的轻量级且有效的深度学习模型。该架构将用于局部光谱-空间特征提取的3D卷积层与两个并行的MLP风格混合器模块相结合,以捕获光谱和空间维度上的长距离依赖关系。采用基于深度可分离卷积的注意力机制,以最小的计算开销增强判别能力。该模型在QUH-Tangdaowan和QUH-Qingyun数据集上进行了评估,仅使用1%的标注数据进行训练和验证。SS-MixNet在比较的方法中取得了最高性能,包括2D-CNN、3D-CNN、IP-SWIN、SimPoolFormer和HybridKAN,在Tangdaowan和Qingyun数据集上分别达到了95.68%和93.86%的总体准确率。由定量指标和分类图支持的结果证实了该模型在有限监督下提供准确且鲁棒预测的有效性。代码将在以下网址公开:https://github.com/mqalkhatib/SS-MixNet

英文摘要

This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model's effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet

2510.22067 2026-06-01 cs.CV 版本更新

Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

捕捉注视转移以引导:跨模态融合增强用于VLM幻觉缓解

Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas

发表机构 * AWS AI Labs(AWS人工智能实验室)

AI总结 提出GIFT方法,通过预计算视觉显著性图并跟踪注视转移,在解码时增强对显著视觉信息和用户查询的注意力,以缓解视觉语言模型中的幻觉问题。

Comments ICML 2026

详情
AI中文摘要

视觉语言模型(VLM)经常产生幻觉,即无法由文本或视觉输入证实的内容。先前的工作主要将其归因于过度依赖语言先验知识而非视觉输入。一些方法尝试通过按注意力分数比例放大视觉令牌注意力来缓解幻觉。然而,这些方法忽视了视觉注意力沉没问题,即注意力经常被错误分配到与任务无关的视觉区域,并且忽略了跨模态融合平衡,仅增强视觉注意力而不调整对用户查询的注意力。这可能导致放大错误区域,同时无法正确解释用户查询。为解决这些挑战,我们提出了一种简单而有效的方法,称为注视转移引导的跨模态融合增强(GIFT)。GIFT通过在用户查询理解过程中跟踪视觉注意力的正向变化(即“注视转移”),预计算整体视觉显著性图,并利用该图在每个解码步骤放大对显著视觉信息和用户查询的注意力。这减少了视觉注意力沉没的影响,因为无关令牌的转移最小,同时确保平衡的跨模态融合以获得良好整合的表示。大量实验表明,GIFT在生成和分类任务中均有效缓解了VLM的幻觉,与贪婪解码相比实现了高达20.7%的改进,同时以低计算开销保持了通用的视觉语言性能。

英文摘要

Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

2511.05875 2026-06-01 cs.HC cs.AI cs.CV 版本更新

Towards a Humanized Social-Media Ecosystem: AI-Augmented HCI Design Patterns for Safety, Agency & Well-Being

迈向人性化的社交媒体生态系统:面向安全、自主与福祉的AI增强人机交互设计模式

Mohd Ruhul Ameen, Akif Islam

发表机构 * College of Engineering(工程学院) Computer Sciences Marshall University Huntington, WV, USA(计算机科学马歇尔大学亨廷顿州威斯康星州) Department of Computer Science(计算机科学系) Engineering University of Rajshahi Rajshahi 6205, Bangladesh(工程 Rajshahi 大学 Rajshahi 6205 巴基斯坦)

AI总结 提出Human-Layer AI(HL-AI)框架,通过浏览器端用户拥有的可解释中介,在不依赖平台合作的情况下赋予用户实时控制权,实现内容重写、完整性检测、信息流定制、行为中断和恢复模式等五种设计模式,以提升社交媒体安全性与用户福祉。

Comments 6 pages, 5 tables, 7 figures, and 2 algorithm tables. Accepted at International Conference on Signal Processing, Information, Communication and Systems (SPICSCON 2025)

详情
Journal ref
2025 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON)
AI中文摘要

社交平台连接了数十亿人,但其以参与度优先的算法往往对用户施加影响而非与用户协作,加剧了压力、虚假信息和失控感。我们提出Human-Layer AI(HL-AI)——用户拥有的、可解释的中介,位于浏览器中平台逻辑与界面之间。HL-AI赋予人们实用的、即时的控制权,无需平台合作。我们贡献了一个可用的Chrome/Edge原型,实现了五种代表性模式框架——上下文感知帖子重写器、帖子完整性检测器、精细信息流策展器、微退出代理和恢复模式——以及一个统一的数学公式,平衡用户效用、自主成本和风险阈值。评估涵盖技术准确性、可用性和行为结果。结果是一套人性化的控制手段,帮助用户在伤害发生前重写内容、通过完整性提示阅读、有意图地调整信息流、暂停强迫性循环以及在骚扰期间寻求庇护,同时通过解释和覆盖选项保留自主权。该原型为改造当今的信息流以融入安全性、自主性和福祉提供了实用路径,并邀请进行严格的跨文化用户评估。

英文摘要

Social platforms connect billions of people, yet their engagement-first algorithms often work on users rather than with them, amplifying stress, misinformation, and a loss of control. We propose Human-Layer AI (HL-AI)--user-owned, explainable intermediaries that sit in the browser between platform logic and the interface. HL-AI gives people practical, moment-to-moment control without requiring platform cooperation. We contribute a working Chrome/Edge prototype implementing five representative pattern frameworks--Context-Aware Post Rewriter, Post Integrity Meter, Granular Feed Curator, Micro-Withdrawal Agent, and Recovery Mode--alongside a unifying mathematical formulation balancing user utility, autonomy costs, and risk thresholds. Evaluation spans technical accuracy, usability, and behavioral outcomes. The result is a suite of humane controls that help users rewrite before harm, read with integrity cues, tune feeds with intention, pause compulsive loops, and seek shelter during harassment, all while preserving agency through explanations and override options. This prototype offers a practical path to retrofit today's feeds with safety, agency, and well-being, inviting rigorous cross-cultural user evaluation.

2510.15710 2026-06-01 cs.CV 版本更新

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

UniMedVL: 通过观察-知识-分析统一医学多模态理解与生成

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Zhongying Deng, Lihao Liu, Ming Hu, Junjun He

发表机构 * Shanghai Artificial Intelligence Laboratory Shanghai Innovation Institute Shanghai Jiao Tong University Shanghai Institute of Optics Fudan University University of Cambridge Monash University DAMO Academy, Alibaba Group Imperial College London The University of Hong Kong The Hong Kong University of Science Hupan Lab The Chinese University of Hong Kong

AI总结 提出首个统一医学模型UniMedVL,通过渐进式训练流水线融合多模态理解与生成能力,并在8种影像模态的5.6M实例数据集上验证其性能。

Comments This submission has been converted to the ICML template

详情
AI中文摘要

医学工作流程通常结合阅读图像与生成视觉和文本输出,使得图像理解和生成成为医学AI的核心。然而,大多数现有系统在孤立模型中处理这些能力,失去了统一架构可以利用的共享知识。为弥合这一差距,我们提出了UniMedVL,这是第一个在单个模型中无缝集成多模态理解和生成能力而无需切换权重的统一医学模型。我们通过定制的渐进式训练流水线实现这一点,其中理解和生成相互增强。为有效训练UniMedVL,我们整理了UniMedVL-5M,这是第一个大规模医学数据集,包含跨越8种医学影像模态的超过560万个实例,专为统一医学理解和生成中的多模态输入输出任务设计。实验结果表明,UniMedVL在五个医学理解基准上取得了有竞争力的性能。关键的是,UniMedVL原生支持多种交错生成任务,例如虚拟染色、超分辨率、跨模态合成,这些对于复杂的医学工作流程至关重要。我们的代码和数据集已公开。

英文摘要

Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.

2506.22304 2026-06-01 cs.LG cs.CV 版本更新

Unfolding Generative Flows with Koopman Operators: Trajectory-Preserving Linearization

利用Koopman算子展开生成流:轨迹保持的线性化

Erkan Turan, Aristotelis Siozopoulos, Louis Martinez, Julien Gaubil, Emery Pierson, Maks Ovsjanikov

发表机构 * University of Athens, Greece(雅典大学)

AI总结 提出基于Koopman理论的全局线性化方法,将预训练的条件流匹配模型提升到高维Koopman空间,实现轨迹保持的线性化,从而支持一步并行采样和生成轨迹的谱分析。

详情
AI中文摘要

连续归一化流(CNFs)实现了优雅的生成建模,但受限于其迭代性质,需要昂贵的采样且缺乏中间状态的可解释性。最近的方法通过拉直轨迹或蒸馏端点来加速采样,但将原始生成过程视为黑箱,丢弃了教师模型的中间动态。我们提出了一种根本不同的视角:通过Koopman理论全局线性化流动态,以实现轨迹保持的线性化。通过将预训练的条件流匹配(CFM)模型提升到高维Koopman空间,我们用单个线性算子表示其演化。关键的是,与仅边界蒸馏不同,我们的方法沿整个生成路径强制与教师向量场保持无穷小一致性。我们推导了一个实用的、无模拟的训练目标,确保这种全局对齐,并带来两个关键优势。首先,采样变为一步且可并行化。其次,由于线性化忠实于动态,Koopman算子提供了对生成的独特见解。我们证明,这种结构能够实现先前方法无法实现的新应用,包括发现语义一致的编辑方向、使用与教师对齐的线性算子进行反演以及类条件谱特征。实验上,我们的方法在实现竞争性样本质量的同时,能够对生成流的整个轨迹进行谱分析和控制。

英文摘要

Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by their iterative nature requiring costly sampling and lacking interpretability of the intermediate states. Recent approaches accelerate sampling by straightening trajectories or distilling endpoints, yet they treat the original generative process as a black box, discarding the teacher's intermediate dynamics. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory to achieve trajectory-preserving linearization. By lifting a pre-trained Conditional Flow Matching (CFM) model into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. Crucially, unlike boundary-only distillation, our method enforces infinitesimal consistency with the teacher's vector field along the full generative path. We derive a practical, simulation-free training objective that ensures this global alignment and yields two key benefits. First, sampling becomes one-step and parallelizable. Second, because the linearization is faithful to the dynamics, the Koopman operator provides unique insights on the generation. We demonstrate that this structure enables novel applications unavailable in prior approaches, including discovery of semantically coherent editing directions, inversion with a teacher-aligned linear operator and class-conditional spectral signatures. Empirically, our approach achieves competitive sample quality, while enabling spectral analysis and control of the entire trajectories of generative flows.

2510.17700 2026-06-01 cs.CV 版本更新

Elastic ViTs from Pretrained Models without Retraining

无需重新训练的预训练模型弹性ViTs

Walter Simoncini, Michael Dorkenwald, Tijmen Blankevoort, Cees G. M. Snoek, Yuki M. Asano

发表机构 * University of Technology Nuremberg(图恩大学) University of Amsterdam(阿姆斯特丹大学) NVIDIA(英伟达)

AI总结 提出SnapViT方法,通过结合梯度信息与进化算法近似跨网络结构相关性,实现无需重训练的结构化剪枝,支持连续计算预算下的弹性推理。

Comments Accepted at NeurIPS 2025

详情
AI中文摘要

视觉基础模型取得了显著性能,但仅以有限的预定尺寸可用,导致在现实约束下部署选择次优。我们引入SnapViT:用于剪枝视觉Transformer的单次网络近似,一种新的后预训练结构化剪枝方法,可在连续计算预算范围内实现弹性推理。我们的方法高效地将梯度信息与跨网络结构相关性相结合,通过进化算法近似,无需标注数据,可推广到无分类头的模型,且无需重训练。在DINO、SigLIPv2、DeIT和AugReg模型上的实验表明,在各种稀疏度下,该方法优于最先进方法,在单个A100 GPU上不到五分钟即可生成可调整到任何计算预算的弹性模型。我们的主要贡献包括:一种针对预训练ViT的高效剪枝策略,一种新颖的Hessian非对角结构的进化近似,以及一种无需重训练或标签即可保持强大性能的自监督重要性评分机制。代码和剪枝模型可在https://elastic.ashita.nl/获取。

英文摘要

Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/

2510.09364 2026-06-01 cs.CV 版本更新

VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

VAD-GS:动态城市场景中3D高斯泼溅的可见性感知致密化

Yikang Zhang, Rui Fan

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(同济大学智能自主系统研究所) College of Electronic and Information Engineering, Tongji University(同济大学电子与信息工程学院) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(西安交通大学人机混合增强智能国家实验室)

AI总结 提出VAD-GS框架,通过体素可见性推理、多样性感知视图选择和多视图立体重建,在动态城市场景中恢复缺失几何结构,提升3D高斯泼溅的重建质量。

详情
AI中文摘要

3D高斯泼溅(3DGS)在合成高保真新视角方面表现出色。然而,其有效性关键取决于初始化点云的质量。具体而言,要实现对底层场景结构的均匀且完整的点覆盖,需要重叠的观察视锥,这一假设在无边界、动态的城市环境中经常被违反。使用部分初始化的点云训练高斯模型通常会导致失真和伪影,因为相机射线可能无法与有效表面相交,导致梯度错误传播到与遮挡或不可见几何体关联的高斯基元。此外,现有的致密化策略只是从现有基元中克隆和分割高斯基元,无法从缺失结构中重建几何体。为解决这些限制,我们提出了VAD-GS,一个专为具有挑战性的城市场景中几何恢复设计的3DGS框架。我们的方法通过基于体素的可见性推理识别不可靠的几何结构,通过多样性感知视图选择选择信息丰富的支持视图,并通过多视图立体重建恢复缺失结构。这种设计使得即使在缺乏初始点的区域,也能在可靠几何先验的指导下生成新的高斯基元。在Waymo和nuScenes数据集上的大量实验表明,VAD-GS优于最先进的3DGS方法,并显著提高了静态和动态物体的重建几何质量。我们的项目网页位于mias.group/VAD-GS。

英文摘要

3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Our project webpage is at mias.group/VAD-GS.

2510.07135 2026-06-01 cs.CV 版本更新

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

遥感视觉语言模型的少样本适应基准

Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

发表机构 * UCLouvain(乌尔特-洛文大学) UMons(蒙斯大学) Fonds de la Recherche Scientifique(科学基金组织)

AI总结 提出首个遥感视觉语言模型少样本适应基准,通过十个数据集和五种策略评估三个模型,发现零样本性能相似的模型在少样本适应下表现差异显著,需开发更鲁棒的方法。

详情
AI中文摘要

遥感视觉语言模型(RSVLMs)得益于大规模预训练,在各种任务上展现出强大的零样本性能。然而,它们在低数据场景(如少样本学习)中的泛化能力尚未得到充分探索。在这项工作中,我们提出了第一个用于评估RSVLMs少样本适应方法的结构化基准。我们在十个遥感场景分类数据集上进行了全面实验,将五种广泛使用的少样本适应策略应用于三个具有不同骨干网络的最先进RSVLMs。我们的发现表明,零样本性能相似的模型在少样本适应下可能表现出显著不同的行为,一些RSVLMs天生比其他模型更适合这种适应。性能的变异性以及现有方法中缺乏明确的优胜者,凸显了为遥感定制更鲁棒的少样本适应方法的必要性。为了促进未来研究,我们提供了一个可复现的基准框架和开源代码,以系统评估RSVLMs在少样本条件下的表现。源代码已在Github上公开:https://github.com/elkhouryk/fewshot_RSVLMs

英文摘要

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

2510.03876 2026-06-01 cs.CV 版本更新

Skin Lesion Classification Based on ResNet-50 Enhanced With Adaptive Spatial Feature Fusion

基于自适应空间特征融合增强的ResNet-50皮肤病变分类

Runhao Liu, Fengyi Zha, Fei Ding, Guangzhen Yao, Peng Zhang

发表机构 * Polytechnic Institute, Zhejiang University, Hangzhou, China(浙江大学杭州Polytechnic学院) Chu Kochen Honors College, Zhejiang University, Hangzhou, China(浙江大学杭州Chu Kochen荣誉学院) Alibaba Group, Chaoyang District, Beijing, China(北京朝阳区阿里巴巴集团) School of Information Science and Technology, Northeast Normal University, Changchun, China(吉林师范大学信息科学与技术学院) School of Mathematical Sciences, Zhejiang University, Hangzhou, China(浙江大学数学科学学院)

AI总结 提出一种结合自适应空间特征融合(ASFF)的改进ResNet-50模型,通过双分支结构融合多尺度语义和细节特征,在ISIC 2020子集上达到93.182%准确率,并有效泛化至ISIC 2019外部验证集。

详情
AI中文摘要

皮肤癌分类因皮肤镜图像中类间相似度高、类内变异大以及伪影的存在而具有挑战性。为解决这些问题,我们提出了一种改进的ResNet-50模型,结合自适应空间特征融合(ASFF),该机制自适应地整合多尺度语义和表面特征以细化表示并减少过拟合。ResNet-50模型通过自适应特征融合机制增强,以实现更有效的多尺度特征提取并提升整体性能。具体而言,双分支设计融合了高层语义特征和中间层细节特征,利用全局平均池化和全连接层生成空间权重,并强调病变相关区域。在ISIC 2020平衡子集(从原始数据集中随机选取的3,297张图像)上评估,基于ASFF的ResNet-50优于多个CNN基线,达到93.182%的准确率,并具有优越的精确率、召回率、特异性和F1分数。其AUC(P-R)达到0.9670,AUC(ROC)达到0.9717。Grad-CAM可视化显示对病变区域的聚焦更加准确。所提模型在ISIC 2019外部验证集上也表现出良好的泛化能力,优于ResNet-50基线。这些发现表明,所提方法为计算机辅助皮肤癌诊断提供了更有效且高效的解决方案。生成代码、权重和混淆矩阵已在https://github.com/Grapesea/ASFF-ResNet50-enhanced开源。

英文摘要

Skin cancer classification is challenging due to high inter-class similarity, intra-class variability, and artifacts in dermoscopic images. To address these issues, we propose an improved ResNet-50 with Adaptive Spatial Feature Fusion (ASFF), which adaptively integrates multi-scale semantic and surface features to refine representations and reduce overfitting. The ResNet-50 model is enhanced with an adaptive feature fusion mechanism to achieve more effective multi-scale feature extraction and improve overall performance. Specifically, a dual-branch design fuses high-level semantic and mid-level detail features which use global average pooling and fully connected layers to produce spatial weights, and emphasizes lesion-relevant regions. Evaluated on a balanced subset of ISIC 2020 (3,297 images, randomly selected from the original dataset), the ASFF-based ResNet-50 outperforms multiple CNN baselines, achieving 93.182% accuracy with superior precision, recall, specificity, and F1. It also reaches 0.9670 AUC (P-R) and 0.9717 AUC (ROC). Grad-CAM visualizations show more accurate focus on lesion areas.The proposed model also generalizes well to ISIC 2019 external validation, outperforming the ResNet-50 baseline. These findings demonstrate that the proposed approach provides a more effective and efficient solution for computer-aided skin cancer diagnosis. The generation codes, weights and confusion matrices are open sourced in https://github.com/Grapesea/ASFF-ResNet50-enhanced.

2509.19452 2026-06-01 cs.RO cs.CV cs.LG 版本更新

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

HUNT:通过瞬时相对帧在非结构化环境中进行高速无人机导航与跟踪

Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

发表机构 * New York University(纽约大学) University of California Berkeley(加州大学伯克利分校)

AI总结 提出HUNT框架,利用瞬时相对帧统一搜索与跟踪,实现高速飞行和鲁棒自主性。

详情
AI中文摘要

搜索与救援任务要求无人机既能高速穿越未知的非结构化环境,又能在检测到目标后跟踪目标。在感知退化且无全局定位的情况下实现这两种能力仍是一个开放挑战。最近的相对导航工作通过将规划和控制锚定到可见的检测目标上展示了鲁棒跟踪,但在视野中没有目标时无法进行导航。我们提出了HUNT(高速无人机导航与跟踪),一个实时框架,在单一相对公式中统一了穿越、获取和跟踪。HUNT直接从机载瞬时观测量(如姿态、高度和速度)定义导航目标,从而在搜索过程中实现反应式高速飞行。一旦检测到目标,相同的感知-控制管道无缝过渡到跟踪。在茂密森林、集装箱场地以及使用车辆和人体模型的搜索与救援任务中的户外实验表明,在全局方法失败的情况下,该框架实现了鲁棒自主性。

英文摘要

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

2509.21561 2026-06-01 cs.CV 版本更新

Unsupervised Defect Detection for Surgical Instruments

手术器械的无监督缺陷检测

Joseph Huang, Yichi Zhang, Jingxi Yu, Wei Chen, Seunghyun Hwang, Qiang Qiu, Amy R. Reibman, Edward J. Delp, Fengqing Zhu

发表机构 * Purdue University School of Electrical

AI总结 针对手术器械缺陷检测中纹理背景导致误检、小缺陷灵敏度低及领域迁移问题,提出结合背景掩蔽、补丁分析和高效域适应的无监督方法。

详情
AI中文摘要

确保手术器械的安全性需要可靠地检测视觉缺陷。然而,人工检查容易出错,现有的自动缺陷检测方法通常在自然/工业图像上训练,无法有效迁移到手术领域。我们证明,简单地应用或微调这些方法会导致问题:纹理背景引起的误检、对微小缺陷的灵敏度低,以及由于域偏移导致的器械特定特征捕获不足。为了解决这些挑战,我们提出了一种通用方法,专门针对手术器械调整无监督缺陷检测方法。通过集成背景掩蔽、基于补丁的分析策略和高效的域适应,我们的方法克服了这些限制,能够可靠地检测手术器械图像中的细微缺陷。

英文摘要

Ensuring the safety of surgical instruments requires reliable detection of visual defects. However, manual inspection is prone to error, and existing automated defect detection methods, typically trained on natural/industrial images, fail to transfer effectively to the surgical domain. We demonstrate that simply applying or fine-tuning these approaches leads to issues: false positive detections arising from textured backgrounds, poor sensitivity to small, subtle defects, and inadequate capture of instrument-specific features due to domain shift. To address these challenges, we propose a versatile method that adapts unsupervised defect detection methods specifically for surgical instruments. By integrating background masking, a patch-based analysis strategy, and efficient domain adaptation, our method overcomes these limitations, enabling the reliable detection of fine-grained defects in surgical instrument imagery.

2509.20941 2026-06-01 cs.CV 版本更新

Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

解码手术场景:手术中场景图的范围综述

Angelo Henriques, Korab Hoxha, Daniel Zapp, Peter C. Issa, Nassir Navab, M. Ali Nasseri

发表机构 * School of Computation, Information and Technology, Technical University of Munich(计算信息学院,慕尼黑技术大学) Klinik und Poliklinik für Augenheilkunde, TUM University Hospital(眼科诊所,TUM大学医院) Computer Aided Medical Procedures, Technical University of Munich(医学辅助程序,慕尼黑技术大学) Department of Biomedical Engineering, University of Alberta(生物医学工程系,阿尔伯塔大学)

AI总结 本文通过PRISMA-ScR指导的范围综述,系统梳理了手术中场景图(SG)的研究现状,分析了52项研究,揭示了从图神经网络向基础模型和生成式AI的方法论转变,并提出了“验证三位一体”评估框架以弥合临床转化差距。

Comments Submitted and accepted to Medical Image Analysis (DOI: 10.1016/j.media.2026.104083). An interactive version of the summary tables is available at: osf.io/fruq8

详情
Journal ref
Medical Image Analysis (2026)
AI中文摘要

随着手术人工智能从像素级检测向复杂推理过渡,场景图(SG)提供了解码动态手术环境所需的结构化关系表示。本项遵循PRISMA-ScR指南的范围综述系统性地绘制了手术中SG研究的发展格局,分析了52项主要研究,以描绘应用和方法论转变。我们的分析揭示了快速增长,但也发现了一个关键的“数据鸿沟”:内部视角研究(例如,从内窥镜视频中识别三元组)占研究的81%,且几乎完全使用真实世界的2D视频,而外部视角的手术室建模则严重依赖模拟数据。在方法论上,我们识别出从基础图神经网络向专门基础模型和生成式AI的决定性转变,这些模型在2025年合计约占研究的50%。至关重要的是,我们的综合表明,场景图正从简单的描述符演变为必要的“神经符号护栏”,提供结构化、可验证的中间表示,以防止日益自主的手术基础模型产生幻觉。尽管前景广阔,但仍存在一个主要的转化差距:所审查的研究均未进入前瞻性临床验证。我们得出结论,弥合这一差距需要超越标准的计算机视觉指标;因此,我们提出“验证三位一体”——优先考虑语义查询成功率、延迟感知准确率和安全关键召回率——作为将基于图的手术人工智能引入临床实践的必要评估框架。

英文摘要

As surgical AI transitions from pixel-level detection to complex reasoning, Scene Graphs (SGs) offer the structured, relational representations necessary to decode dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, analyzing 52 primary studies to chart applications and methodological shifts. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition from endoscopic video) accounts for 81% of studies and almost exclusively uses real-world 2D video, while external-view operating room modeling relies heavily on simulated data. Methodologically, we identify a decisive shift from foundational graph neural networks to specialized foundation models and generative AI, which together now account for approximately 50% of research in 2025. Crucially, our synthesis suggests that Scene Graphs are evolving from simple descriptors into essential 'neuro-symbolic guardrails', providing the structured, verifiable intermediate representation needed to prevent hallucinations in increasingly autonomous Surgical Foundation Models. Despite this promise, a major translational gap remains: none of the reviewed studies have proceeded to prospective clinical validation. We conclude that bridging this gap requires moving beyond standard computer vision metrics; we therefore propose the 'Validation Trinity' -- prioritizing Semantic Query Success, Latency-Aware Accuracy, and Safety-Critical Recall -- as the necessary evaluation framework to bring graph-based surgical AI into clinical practice.

2509.18898 2026-06-01 cs.CV 版本更新

DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

DeblurSplat:基于事件相机的无SfM三维高斯泼溅鲁棒去模糊方法

Pengteng Li, Yunfan Lu, Pinhao Song, Weiyu Guo, Huizai Yao, F. Richard Yu, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) KU Leuven(卢森堡大学) Carleton University(卡尔顿大学)

AI总结 提出首个无需运动恢复结构的去模糊三维高斯泼溅方法,利用密集立体模块和事件流实现高质量新视图合成与高效渲染。

Comments Accepted by TMM 2026

详情
AI中文摘要

本文提出首个无需运动恢复结构(SfM)的基于事件相机的去模糊三维高斯泼溅方法,称为DeblurSplat。我们从两个方面解决运动去模糊问题。首先,利用密集立体模块(DUSt3R)的预训练能力,直接从模糊图像中获取准确的初始点云。无需计算相机位姿作为中间结果,避免了不准确相机位姿到初始点云位置的累积误差传递。其次,将事件流引入去模糊流水线,利用其对动态变化的高敏感性。通过从事件流和模糊图像中解码潜在清晰图像,我们可以为场景重建优化提供细粒度监督信号。在多种场景上的大量实验表明,与去模糊3D-GS的最新方法相比,DeblurSplat不仅在新视图生成中表现出高保真度,而且实现了显著的渲染效率。

英文摘要

In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermediate result, we avoid the cumulative errors transfer from inaccurate camera poses to the initial point clouds' positions. Second, we introduce the event stream into the deblur pipeline for its high sensitivity to dynamic change. By decoding the latent sharp images from the event stream and blurred images, we can provide a fine-grained supervision signal for scene reconstruction optimization. Extensive experiments across a range of scenes demonstrate that DeblurSplat not only excels in generating high-fidelity novel views but also achieves significant rendering efficiency compared to the SOTAs in deblur 3D-GS.

2506.11653 2026-06-01 cs.CV cs.AI cs.LG 版本更新

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

DISCO: 使用条件距离相关性减轻深度学习中的偏差

Emre Kavak, Tom Nuno Wolf, Christian Wachinger

发表机构 * Technical University of Munich, Germany(慕尼黑技术大学) Konrad Zuse School of Excellence in Reliable AI, Germany(Konrad Zuse可靠性人工智能卓越学院) Munich Center for Machine Learning (MCML), Germany(慕尼黑机器学习中心(MCML))

AI总结 提出基于反因果模型的条件独立性准则,并设计条件距离相关性的高效估计器DISCO$_m$和sDISCO,通过正则化实现梯度模型中的偏差缓解,在多个数据集上优于或媲美现有方法。

Comments Accepted to ICML 2026 (oral)

详情
AI中文摘要

数据集偏差常常导致深度学习模型利用虚假相关性而非任务相关信号。我们引入了标准反因果模型(SAM),这是一个统一的因果框架,用于刻画偏差机制并得出因果稳定性的条件独立性准则。基于这一理论,我们提出了DISCO$_m$和sDISCO,它们是条件距离相关性的高效且可扩展的估计器,能够在基于梯度的模型中实现独立性正则化。在六个不同数据集上,我们的方法在现有观察偏差缓解方法中持续表现更优或具有竞争力,同时需要更少的超参数并能够无缝扩展到多偏差场景。这项工作桥接了因果理论与实际深度学习,为稳健预测提供了原则性基础和有效工具。源代码:https://github.com/yakamoz5/DISCO。

英文摘要

Dataset bias often leads deep learning models to exploit spurious correlations instead of task-relevant signals. We introduce the Standard Anti-Causal Model (SAM), a unifying causal framework that characterizes bias mechanisms and yields a conditional independence criterion for causal stability. Building on this theory, we propose DISCO$_m$ and sDISCO, efficient and scalable estimators of conditional distance correlation that enable independence regularization in gradient-based models. Across six diverse datasets, our methods consistently outperform or are competitive in existing observed bias mitigation approaches, while requiring fewer hyperparameters and scaling seamlessly to multi-bias scenarios. This work bridges causal theory and practical deep learning, providing both a principled foundation and effective tools for robust prediction. Source Code: https://github.com/yakamoz5/DISCO.

2509.10114 2026-06-01 cs.CV 版本更新

A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss

一种基于集成学习的轻量级人脸图像质量评估方法及关联感知损失

MohammadAli Hamidi, Hadi Amirpour, Luigi Atzori, Christian Timmerer

发表机构 * DIEE, University of Cagliari, CNIT, University of Cagliari(卡利亚里大学DIEE部门,CNIT,卡利亚里大学) Department of Information Technology (ITEC), Alpen-Adria Universität Klagenfurt(克雷格弗尔德大学信息科技学院(ITEC))

AI总结 提出一种轻量级集成方法,结合MobileNetV3-Small和ShuffleNetV2,使用MSECorrLoss损失函数,在VQualA基准上达到高精度与低计算成本。

Comments This paper has been published in the Proceedings of ICCV 2025. The final published version is available via IEEE Xplore

详情
AI中文摘要

人脸图像质量评估(FIQA)在人脸识别和验证系统中起着关键作用,尤其是在非受控的真实世界环境中。尽管已有多种方法被提出,但通用的无参考图像质量评估技术往往无法捕捉人脸特定的退化。同时,最先进的FIQA模型通常计算量大,限制了其实际应用。我们提出了一种轻量级且高效的FIQA方法,专为野外人脸图像的感知评估而设计。我们的方法集成了两个紧凑的卷积神经网络MobileNetV3-Small和ShuffleNetV2,并通过简单平均进行预测级融合。为了增强与人类感知判断的一致性,我们采用了一种关联感知损失(MSECorrLoss),将均方误差(MSE)与皮尔逊相关正则化器相结合。我们的方法在准确性和计算成本之间取得了良好的平衡,使其适用于实际部署。在VQualA FIQA基准上的实验表明,我们的模型达到了0.9829的斯皮尔曼秩相关系数(SRCC)和0.9894的皮尔逊线性相关系数(PLCC),同时保持在竞赛效率约束内。

英文摘要

Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.

2508.20478 2026-06-01 cs.CV 版本更新

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Video-MTR: 用于长视频理解的多轮强化推理

Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Guangdong University of Technology(广东工业大学) X-Era AI Lab(X-Era人工智能实验室)

AI总结 提出Video-MTR框架,通过强化多轮推理迭代选择关键视频片段并理解问题,结合门控双层奖励系统实现端到端训练,在长视频理解基准上提升准确率和效率。

Comments Accepted by ICML 2026. Camera-ready version

详情
AI中文摘要

长视频理解因其长期时间依赖性和多事件特性仍然是一个挑战。现有方法通常依赖静态推理或外部视觉语言模型(VLM),但存在复杂性和缺乏端到端训练导致的次优性能等问题。本文提出Video-MTR,一个强化多轮推理框架,旨在实现迭代的关键视频片段选择和问题理解。与传统的单轮预测视频推理流程不同,Video-MTR进行多轮推理,基于对先前处理片段和当前问题的逐步理解,逐步选择视频片段。这种迭代过程允许对视频进行更精细和上下文感知的分析。为确保中间推理过程,我们引入了一种新颖的门控双层奖励系统,结合基于答案正确性的轨迹级奖励和强调帧-查询相关性的轮次级奖励。该系统优化了视频片段选择和问题理解,无需外部VLM,并允许端到端训练。在VideoMME、MLVU和EgoSchema等基准上的大量实验表明,Video-MTR在准确性和效率上均优于现有方法,推动了长视频理解的最新进展。

英文摘要

Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

2508.19830 2026-06-01 cs.CV cs.AI 版本更新

Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification

分布偏移下基于频率感知梯度修正的目标无关校准

Yilin Zhang, Cai Xu, You Wu, Ziyu Guan, Wei Zhao

发表机构 * School of Computer Science and Technology, Xidian University, Xi'an, China(西安电子科技大学计算机科学与技术学院)

AI总结 提出频率感知梯度修正(FGR)框架,通过对训练图像进行低通滤波减少虚假高频线索并学习域不变特征,同时利用几何投影确保分布内校准不退化,从而在无需目标域信息的情况下提升模型在分布偏移下的校准性能。

Comments 25 pages, Accepted at ICML 2026

详情
AI中文摘要

现实世界中的模型部署不可避免地会遇到分布偏移,使得深度神经网络的置信度估计高度不可靠,在安全关键应用中带来严重风险。现有方法通过训练时正则化或事后调整来改善校准,但通常依赖于对目标域的访问(或模拟),限制了实用性。我们提出频率感知梯度修正(FGR),一种用于鲁棒校准的目标无关训练框架。从频率角度出发,FGR 对部分训练图像应用低通滤波,以减少虚假的高频线索并鼓励学习域不变特征。然而,相关的信息损失可能会降低分布内(ID)校准。为了解决这一权衡,FGR 将 ID 校准视为硬约束,并通过几何投影修正冲突的参数更新。这确保了 ID 校准目标的一阶非增,而无需引入额外的损失平衡系数。在合成、真实世界和语义偏移数据集上的大量实验表明,FGR 在保持 ID 性能的同时显著改善了各种偏移下的校准,并且与事后校准方法兼容。我们的代码可在 https://github.com/YilinZhang107/FGR-Calib 获取。

英文摘要

Real-world model deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly unreliable, posing severe risks in safety-critical applications. Existing methods improve calibration via training-time regularization or post-hoc adjustment, but often rely on access to (or simulation of) target domains, limiting practicality. We propose Frequency-aware Gradient Rectification (FGR), a target-agnostic training framework for robust calibration. From a frequency perspective, FGR applies low-pass filtering to a subset of training images to diminish spurious high-frequency cues and encourage the learning of domain-invariant features. However, the associated information loss can degrade In-Distribution (ID) calibration. To resolve this trade-off, FGR treats ID calibration as a hard constraint and rectifies conflicting parameter updates via geometric projection. This ensures a first-order non-increase in the ID calibration objective without introducing an additional loss-balancing coefficient. Extensive experiments on synthetic, real-world, and semantic shift datasets demonstrate that FGR significantly improves calibration under diverse shifts while preserving ID performance, and it remains compatible with post-hoc calibration methods. Our code is available at https://github.com/YilinZhang107/FGR-Calib.

2507.16362 2026-06-01 cs.CV 版本更新

LPTR-AFLNet: Lightweight Integrated Chinese License Plate Rectification and Recognition Network

LPTR-AFLNet:轻量级集成式中国车牌校正与识别网络

Guangzhu Xu, Pengcheng Zuo, Zhi Ke, Bangjun Lei

发表机构 * Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University(湖北水利水电工程智能视觉监测重点实验室,中国三峡大学) College of Computer and Information Technology, China Three Gorges University(计算机与信息学院,中国三峡大学) Hubei Key Laboratory of Digital Finance Innovation(湖北省数字金融创新重点实验室) School of Information Engineering, Hubei University of Economics(信息工程学院,湖北经济学院)

AI总结 提出一种轻量级统一网络LPTR-AFLNet,结合透视变换校正模块和优化后的AFLNet识别网络,利用识别输出作为弱监督信号引导校正,并改进注意力模块和采用Focal Loss,实现高效准确的车牌校正与识别。

Comments 28 pages, 33 figures

详情
AI中文摘要

中国车牌识别(CLPR)在无约束和复杂环境中面临诸多挑战,特别是由于不同拍摄角度导致的透视畸变以及单行和双行车牌的校正问题。考虑到边缘设备有限的计算资源,开发低复杂度、端到端的集成校正与识别网络对于实现实时高效部署至关重要。本文提出了一种名为LPTR-AFLNet的轻量级统一网络,用于校正和识别中国车牌,该网络将透视变换校正模块(PTR)与优化的车牌识别网络AFLNet相结合。该网络利用识别输出作为弱监督信号,有效引导校正过程,确保准确的透视畸变校正。为提高识别精度,我们对LPRNet进行了多项改进,包括引入改进的注意力模块以减少相似字符间的混淆,以及使用Focal Loss解决训练中的类别不平衡问题。实验结果表明,LPTR-AFLNet在校正透视畸变和识别双行车牌图像方面表现出色,在各种具有挑战性的场景下均保持高识别精度。此外,在中低端GPU平台上,该方法运行时间小于10毫秒,显示出其实用效率和广泛适用性。

英文摘要

Chinese License Plate Recognition (CLPR) faces numerous challenges in unconstrained and complex environments, particularly due to perspective distortions caused by various shooting angles and the correction of single-line and double-line license plates. Considering the limited computational resources of edge devices, developing a low-complexity, end-to-end integrated network for both correction and recognition is essential for achieving real-time and efficient deployment. In this work, we propose a lightweight, unified network named LPTR-AFLNet for correcting and recognizing Chinese license plates, which combines a perspective transformation correction module (PTR) with an optimized license plate recognition network, AFLNet. The network leverages the recognition output as a weak supervisory signal to effectively guide the correction process, ensuring accurate perspective distortion correction. To enhance recognition accuracy, we introduce several improvements to LPRNet, including an improved attention module to reduce confusion among similar characters and the use of Focal Loss to address class imbalance during training. Experimental results demonstrate the exceptional performance of LPTR-AFLNet in rectifying perspective distortion and recognizing double-line license plate images, maintaining high recognition accuracy across various challenging scenarios. Moreover, on lower-mid-range GPUs platform, the method runs in less than 10 milliseconds, indicating its practical efficiency and broad applicability.

2507.17335 2026-06-01 cs.CV cs.CL 版本更新

TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

TransLPRNet:用于单/双行中文车牌识别的轻量级视觉-语言网络

Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

发表机构 * Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University(水电工程智能视觉监测湖北省重点实验室,中国三峡大学) College of Computer and Information Technology, China Three Gorges University(计算机与信息学院,中国三峡大学) Hubei Key Laboratory of Digital Finance Innovation(数字金融创新湖北省重点实验室) School of Information Engineering, Hubei University of Economics(信息工程学院,湖北经济学院)

AI总结 针对开放环境中车牌类型多样和成像条件复杂的问题,提出一种集成轻量视觉编码器和文本解码器的统一解决方案,通过预训练框架和透视校正网络实现单/双行中文车牌的高精度识别。

详情
AI中文摘要

开放环境中的车牌识别在各个领域广泛应用,但车牌类型和成像条件的多样性带来了显著挑战。为了解决基于CNN和CRNN的方法在车牌识别中遇到的局限性,本文提出了一种统一解决方案,该方案在针对单行和双行中文车牌的预训练框架内,集成了轻量级视觉编码器和文本解码器。为缓解双行车牌数据集的稀缺性,我们通过合成图像、将纹理映射到真实场景并与真实车牌图像混合,构建了单/双行车牌数据集。此外,为提高系统的识别精度,我们引入了一个透视校正网络(PTN),该网络将车牌角点坐标回归作为隐变量,并通过车牌视角分类信息进行监督。该网络具有更好的稳定性、可解释性和较低的标注成本。所提出的算法在粗定位扰动下的校正CCPD测试集上实现了99.34%的平均识别准确率。在细定位扰动下评估时,准确率进一步提高到99.58%。在双行车牌测试集上,平均识别准确率达到98.70%,处理速度高达每秒167帧,显示出较强的实际应用性。

英文摘要

License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system's recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

2507.11075 2026-06-01 cs.CV cs.AI 版本更新

Joint angle based learning to refine kinematic human pose estimation

基于关节角度学习的运动学人体姿态估计精化

Chang Peng, Yifei Zhou, Haoqiang Ren, Shiqing Huang, Chuangye Chen, Jianming Yang, Bao Yang, Huifeng Xi, Zhenyu Jiang

发表机构 * Department of Engineering Mechanics, School of Civil Engineering and Transportation, South China University of Technology(工程力学系,交通工程学院,华南理工大学) School of Mechanics and Construction Engineering, Jinan University(机械与建筑工程学院,暨南大学) Guangdong Provincial Key Laboratory of Speed Capability, School of Physical Education, Jinan University(广东省速度能力重点实验室,暨南大学体育学院)

AI总结 提出一种基于关节角度的双向循环网络后处理模块,利用高阶傅里叶级数近似生成可靠真值,以精化单图像人体姿态估计,纠正错误关键点并平滑轨迹。

详情
AI中文摘要

无标记人体姿态估计(HPE)在各个领域中的应用日益增多。当前的HPE在分析运动学人体姿态时,偶尔会出现关键点识别错误和关键点轨迹随机波动的问题。现有基于深度学习的HPE精化模型的性能受到训练数据集(关键点手动标注)不准确的显著限制。本文提出了一种新方法克服这一困难,关键技术包括:(i) 基于关节角度的运动学人体姿态鲁棒描述;(ii) 使用高阶傅里叶级数近似关节角度的时间变化以获得可靠的“真值”;(iii) 设计双向循环网络作为后处理模块,以精化基于单图像的HPE模型的估计。使用我们方法构建的高质量数据集训练后,该网络在纠正错误识别关节和平滑其时空轨迹方面表现出卓越性能。测试表明,在花样滑冰和霹雳舞等挑战性案例中,基于关节角度的精化(JAR)优于最先进的HPE精化网络。JAR还展示了纠正现有数据集的巨大潜力。

英文摘要

Marker-free human pose estimation (HPE) has found increasing applications in various fields. Current HPE suffers from occasional errors in keypoint recognition and random fluctuation in keypoint trajectories when analyzing kinematic human poses. The performance of existing deep learning-based models for HPE refinement is considerably limited by inaccurate training datasets in which the keypoints are manually annotated. This paper proposed a novel method to overcome the difficulty, in which the key techniques include: (i) A robust joint angle-based description of kinematic human poses; (ii) Approximating temporal variation of joint angles using high order Fourier series to get reliable "ground truth"; (iii) A bidirectional recurrent network is designed as a post-processing module to refine the estimation of single image-based HPE models. Trained with the high-quality dataset constructed using our method, the network demonstrates outstanding performance to correct wrongly recognized joints and smooth their spatiotemporal trajectories. Tests show that joint angle-based refinement (JAR) outperforms the state-of-the-art HPE refinement network in challenging cases like figure skating and breaking. JAR also demonstrates great potential to rectify existing datasets.

2507.06161 2026-06-01 cs.CV 版本更新

Sinkhorn Normalization of Diffusion Kernels

扩散核的Sinkhorn归一化

Nathan Kessler, Robin Magnet, Jean Feydy

发表机构 * ENS Paris-Saclay(巴黎-萨克雷大学) Inria, Université Paris Cité, Inserm, HeKA(法国国家科学研究中心、巴黎-城市大学、法国国家医学研究院、HeKA)

AI总结 提出一种基于Sinkhorn算法的对称变体,将通用相似性矩阵归一化为类似扩散算子,继承拉普拉斯算子的理想性质,用于不规则数据(如点云、稀疏体素网格、高斯混合)的平滑处理,并保留谱信息用于形状分析与匹配。

Comments 33 pages, 25 figures

详情
AI中文摘要

基于局部邻域对信号进行平滑是机器学习和几何处理中的核心操作。在向量空间和流形等结构良好的域上,由微分几何导出的拉普拉斯算子通过热扩散提供了一种有理论保证的平滑方法。然而,构造这样的拉普拉斯算子需要精确定义的域结构,这并不总是可行的。因此,大多数从业者依赖于简单的卷积核和消息传递层,这些方法对域边界存在偏差。我们通过引入一类广泛的平滑算子(由一般相似性或邻接矩阵导出)来弥合这一差距,并证明它们可以被归一化为类似扩散的算子,继承拉普拉斯算子的理想性质。我们的方法依赖于Sinkhorn算法的对称变体,该算法重新缩放正平滑算子以匹配热扩散的结构行为。这种构造使得能够对不规则数据(如点云、稀疏体素网格或高斯混合)进行类似拉普拉斯的平滑和处理。我们表明,得到的算子不仅近似热扩散,而且保留了拉普拉斯算子本身的谱信息,可应用于形状分析和匹配。

英文摘要

Smoothing a signal based on local neighborhoods is a core operation in machine learning and geometry processing. On well-structured domains such as vector spaces and manifolds, the Laplace operator derived from differential geometry offers a principled approach to smoothing via heat diffusion, with strong theoretical guarantees. However, constructing such Laplacians requires a carefully defined domain structure, which is not always available. Most practitioners thus rely on simple convolution kernels and message-passing layers, which are biased against the boundaries of the domain. We bridge this gap by introducing a broad class of smoothing operators, derived from general similarity or adjacency matrices, and demonstrate that they can be normalized into diffusion-like operators that inherit desirable properties from Laplacians. Our approach relies on a symmetric variant of the Sinkhorn algorithm, which rescales positive smoothing operators to match the structural behavior of heat diffusion. This construction enables Laplacian-like smoothing and processing of irregular data such as point clouds, sparse voxel grids or mixture of Gaussians. We show that the resulting operators not only approximate heat diffusion but also retain spectral information from the Laplacian itself, with applications to shape analysis and matching.

2506.14842 2026-06-01 cs.CV cs.AI 版本更新

PictSure: Pretraining Embeddings Matters for In-Context Learning Image Classifiers

PictSure:预训练嵌入对上下文学习图像分类器至关重要

Lukas Schiesser, Cornelius Wolff, Sophie Haas, Simon Pukrop

发表机构 * German Research Center for AI (DFKI)(德国人工智能研究中心(DFKI)) Centrum Wiskunde & Informatica (CWI)(数学与信息学研究中心(CWI))

AI总结 本文提出PictSure视觉上下文学习模型,发现预训练嵌入质量是下游性能的关键瓶颈,而融合层训练数据的多样性影响有限。

Comments 10 pages, 2 figures

详情
AI中文摘要

在数据稀缺领域,构建图像分类模型仍然繁琐,因为收集大规模标注数据集不切实际。上下文学习(ICL)是少样本图像分类(FSIC)的一种有前景的范式,但先前工作未充分探索编码器预训练与融合层训练数据的相对重要性。我们提出了PictSure,一个纯视觉的ICL模型家族,展示了易于使用的融合Transformer架构的潜力,以及需要在更广泛的图像域中获得更好的嵌入表示。在域内和域外评估中,我们发现预训练引起的表示质量与下游ICL性能强相关。关键在于,将融合Transformer的训练数据集从仅ImageNet更改为多样化的多域混合,在评估设置下仅提供有限的额外性能提升,表明一旦嵌入充分结构化,融合层似乎能够有效适应。这些结果表明,视觉ICL的瓶颈是表示质量,而非融合模块的训练多样性。为了促进采用和可重复性,我们以开源形式发布所有模型权重,并提供一个MCP服务器,将PictSure作为可调用工具暴露给基于LLM的智能系统,使少样本图像分类能够在AI流水线中直接调用,无需集成开销。代码可在https://github.com/PictSure获取,模型可在https://huggingface.co/pictsure获取。

英文摘要

Building image classification models remains cumbersome in data-scarce domains, where collecting large labeled datasets is impractical. In-context learning (ICL) is a promising paradigm for few-shot image classification (FSIC), but prior work has underexplored the relative importance of encoder pretraining versus fusion-layer training data. We present PictSure, a vision-only ICL family of models that demonstrates the potential of easy-to-use fusion transformer architectures, as well as the need for better embedding representations across a wider range of image domains. In both in-domain and out-of-domain evaluations, we find that representation quality induced by pretraining strongly correlates with downstream ICL performance. Crucially, varying the training dataset for the fusion transformer, from ImageNet alone to diverse multi-domain mixtures, provides limited additional performance gains under the evaluated settings, demonstrating that the fusion layer appears capable of adapting effectively once embeddings are sufficiently structured. These results show that the bottleneck in visual ICL is representation quality, not fusion-module training diversity. To facilitate adoption and reproducibility, we release all model weights as open-source artifacts and provide an MCP server that exposes PictSure as a callable tool for LLM-based agentic systems, enabling few-shot image classification to be invoked directly within AI pipelines without integration overhead. Code can be found at https://github.com/PictSure and models at https://huggingface.co/pictsure.

2501.01926 2026-06-01 cs.CV cs.AI 版本更新

Cross-Modal Attention Calibration for LVLM Hallucination Mitigation

跨模态注意力校准用于LVLM幻觉缓解

Jiaming Li, Jiacheng Zhang, Zequn Jie, Lin Ma, Guanbin Li

发表机构 * Sun Yat-sen University(中山大学) The University of Hong Kong(香港大学) Meituan(美团) Inspur Database Technology(Inspur数据库技术) Guilin University of Electronic Technology(桂林电子科技大学) Shenzhen Loop Area Institute(深圳环湖院) Guangdong Key Laboratory of Big Data Analysis and Processing(广东大数据分析与处理重点实验室)

AI总结 提出一种无需训练的跨模态注意力校准方法,通过设计模态间解码和位置校准模块,缓解大型视觉语言模型中的幻觉问题。

Comments CVPR2026

详情
AI中文摘要

大型视觉语言模型(LVLM)在视觉-语言理解方面表现出显著能力。尽管取得了成功,LVLM在复杂生成任务中仍然会产生幻觉,导致视觉输入与生成内容不一致。为了解决这个问题,一些方法引入了推理时干预,如对比解码,以减少对语言先验的过度依赖。然而,这些方法忽略了由位置偏差和虚假跨模态相关性引起的幻觉。在本文中,我们提出了一种跨模态注意力校准(CMAC)方法,以无需训练的方式缓解LVLM中的幻觉。在该方法中,我们设计了一个模态间解码(IMD)模块,通过一种新颖的对比解码机制来减轻幻觉。IMD将具有显著跨模态注意力权重的值向量掩蔽为失真,从而同时解决了单模态过度依赖和误导性跨模态相关性问题。此外,跨模态位置校准(CMPC)模块缩小了图像标记的位置差距,缓解了跨模态注意力中的位置偏差。在多种幻觉基准上的实验结果验证了我们的方法在减少LVLM幻觉方面优于现有最先进技术。我们的代码将在https://github.com/lijm48/IMCCD上提供。

英文摘要

Large vision-language models (LVLMs) have shown remarkable capabilities in visual-language understanding. Despite their success, LVLMs still suffer from generating hallucinations in complex generation tasks, leading to inconsistencies between visual inputs and generated content. To address this issue, some approaches have introduced inference-time interventions, such as contrastive decoding, to reduce overreliance on language priors. However, these approaches overlook hallucinations stemming from position bias and spurious inter-modality correlations. In this paper, we propose a Cross-Modal Attention Calibration (CMAC) method to mitigate hallucinations in LVLMs in a training-free manner. In this method, we design an Inter-Modality Decoding (IMD) module to alleviate hallucination by a novel contrastive decoding mechanism. IMD masks the value vectors associated with significant cross-modal attention weights as distortion, which addresses both uni-modality overreliance and misleading inter-modality correlations. Additionally, a Cross-Modal Position Calibration (CMPC) module shrinks the position gap of image tokens, alleviating the position bias in cross-modal attention. Experimental results on diverse hallucination benchmarks validate the superiority of our method over existing state-of-the-art techniques in reducing hallucinations for LVLM. Our code will be available at https://github.com/lijm48/IMCCD.

2501.12020 2026-06-01 cs.CV 版本更新

On the Illusion of Gender Bias in Face Recognition: Explaining the Fairness Issue Through Non-demographic Attributes

论人脸识别中的性别偏见错觉:通过非人口统计属性解释公平性问题

Paul Jonas Kurz, Haiyu Wu, Rouqaiah Al-Refai, Kevin W. Bowyer, Philipp Terhörst

发表机构 * Paderborn University(帕德博恩大学) Technical University of Darmstadt(达姆施塔特技术大学) University of Notre Dame(诺特丹大学) Johannes Gutenberg University Mainz(美因茨约翰· Gutenberg大学)

AI总结 本文通过去相关组合40种非人口统计面部特征,提出无监督联合调查框架,发现当男性和女性图像共享特定属性时性别差距消失,表明性能差异源于社会外貌定义而非生物学因素。

Comments Accepted at IEEE TBIOM

详情
AI中文摘要

人脸识别系统(FRS)根据用户性别表现出显著的准确性差异。由于这种性别差距降低了FRS的可信度,最近的努力试图找到原因。然而,这些研究使用手动选择、相关且小规模的面部特征集来支持其主张。在这项工作中,我们通过成功地将搜索域扩展到40种非人口统计面部特征的去相关组合来分析人脸识别中的性别偏见。首先,我们引入了一个工具链,以有效去相关和聚合面部属性,从而在大规模数据上实现较少偏见的性别分析。其次,我们定制了两个专门指标来量化面部属性对绝对和相对公平性的影响。基于这些基础,我们第三提出了一种新颖的无监督联合调查框架,能够识别当用作平衡测试数据集的过滤谓词时导致偏见消失的属性组合。实验表明,当男性和女性受试者的图像共享特定属性时,性别差距消失,这清楚地表明性能差异不是生物学问题,而是外貌的社会定义问题。这些发现可能重塑我们对人脸生物识别中公平性的理解,并为FRS提供见解,有助于解决性别偏见问题。

英文摘要

Face recognition systems (FRS) exhibit significant accuracy differences based on the user's gender. Since such a gender gap reduces the trustworthiness of FRS, more recent efforts have tried to find the causes. However, these studies make use of manually selected, correlated, and small-sized sets of facial features to support their claims. In this work, we analyze gender bias in face recognition by successfully extending the search domain to decorrelated combinations of 40 non-demographic facial characteristics. First, we introduce a toolchain to effectively decorrelate and aggregate facial attributes to enable a less-biased gender analysis on large-scale data. Second, we tailor two specialized metrics to quantify the effect of facial attributes on absolute and relative fairness. Based on these grounds, we thirdly present a novel unsupervised joint investigation framework capable of identifying attribute combinations leading to vanishing bias when used as filter predicates for balanced testing datasets. Experiments show the gender gap vanishing when images of male and female subjects share specific attributes, clearly indicating that the disparate performance is not a question of biology but of the social definition of appearance. These findings could reshape our understanding of fairness in face biometrics and provide insights into FRS, helping to address gender bias issues.

2412.03876 2026-06-01 cs.CV 版本更新

Safeguarding Text-to-Image Generation via Inference-Time Prompt-Noise Optimization

通过推理时提示-噪声优化保障文本到图像生成

Jiangweizhi Peng, Zhiwei Tang, Gaowen Liu, Charles Fleming, Mingyi Hong

发表机构 * University of Minnesota(明尼苏达大学) Cisco Research(思科研究) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出一种无需训练的推理时优化方法(PNO),通过联合优化连续提示嵌入和注入噪声轨迹,抑制不安全图像生成,达到最先进性能并抵抗对抗攻击。

详情
AI中文摘要

文本到图像(T2I)扩散模型因其基于文本提示生成高质量、多样化图像的能力而被广泛认可。然而,尽管近期取得了进展,这些模型仍然容易生成包含敏感或不适当内容的不安全图像,这可能对用户造成伤害。当前防止扩散模型生成不当图像的努力容易被绕过且易受对抗攻击。如何确保T2I模型符合特定安全目标仍然是一个重大挑战。在这项工作中,我们提出了一种新颖的、无需训练的方法,称为提示-噪声优化(PNO),以减轻不安全图像生成。我们的方法引入了一个新颖的优化框架,利用采样过程中的连续提示嵌入和注入噪声轨迹来生成安全图像。大量的数值结果表明,我们的框架在抑制有毒图像生成方面达到了最先进的性能,并且对对抗攻击表现出鲁棒性,无需调整模型参数。此外,与现有方法相比,PNO在保持相当生成时间的同时,在安全生成和提示-图像对齐这两个冲突目标之间提供了最佳权衡。

英文摘要

Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.

2410.15475 2026-06-01 cs.CV 版本更新

Multimodal Fusion via Self-Consistent Task-Gradient Fields

通过自洽任务梯度场的多模态融合

Jiayu Xiong, Jing Wang, Jun Xue, Wanlong Wang, Jianlong Kwan, Xiaosen Lyu, Zhouqiang Jiang

发表机构 * Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, Fujian, China(厦门计算机视觉与模式识别重点实验室,华侨大学,厦门,福建,中国) Huaqiao University, Xiamen, Fujian, China(华侨大学,厦门,福建,中国) Wuhan University, Wuhan, Hubei, China(武汉大学,武汉,湖北,中国) Nakashima Lab, SANKEN, The University of Osaka, Osaka, Japan(Nakashima实验室,SANKEN,大阪大学,大阪,日本)

AI总结 提出自洽场自编码器(SCFAE),利用自洽场原理平衡任务学习与特征组织,通过任务损失和重构损失在互补子空间中分离特征,从而鲁棒处理缺失数据和不均匀输入。

Comments ICML 2026 accepted paper

详情
AI中文摘要

多模态学习旨在从不同输入中保留尽可能多的任务相关信息。然而,当前的融合设计常常扭曲对特征提取器的反馈循环。激进地合并模态会纠缠它们的表示,使得特征提取器对不完整输入变得脆弱。同时,试图通过辅助损失分离特征常常引入优化冲突,分散对主要任务的注意力。我们提出自洽场自编码器(SCFAE)为任务梯度提供更好的路径。我们的方法遵循自洽场原理来平衡任务学习与特征组织,从而最小化互信息。我们为每个模态使用小型自编码器以保持信息完整。任务损失作为驱动力选择预测性特征。重构损失作为约束将这些特征分离到独立子空间中。这两个目标通过互补的特征子空间运作,从而减轻优化干扰。我们在音频-视觉-文本、音频-视觉和图像-视频基准上评估SCFAE。结果表明,SCFAE通过简单结构更鲁棒地处理缺失数据和不均匀输入尺寸。梯度分析确认SCFAE避免了冲突并保持了稳定的训练动态。

英文摘要

Multimodal learning aims to preserve as much task-related information as possible from different inputs. However, current fusion designs often distort the feedback loop to feature extractors. Aggressively merging modalities entangles their representations, making the feature extractors fragile to incomplete inputs. Meanwhile, attempting to separate features via auxiliary losses frequently introduces optimization conflicts that distract from the primary task. We propose the Self-Consistent Field Autoencoder (SCFAE) to provide a better path for task gradients. Our method follows the self-consistent field principle to balance task learning with feature organization, thereby minimizing mutual information. We use small autoencoders for each modality to keep information intact. The task loss acts as a driving force to select predictive features. The reconstruction loss acts as a constraint to separate these features into independent subspaces. These dual objectives operate through complementary feature subspaces, thereby mitigating optimization interference. We evaluate SCFAE on audio-visual-text, audio-visual, and image-video benchmarks. Results show that SCFAE handles missing data and unequal input sizes more robustly via a simple structure. Gradient analysis confirms that SCFAE avoids conflicts and maintains stable training dynamics.