arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.31603 2026-06-01 cs.CV cs.AI 版本更新

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Lumos-Nexus: 面向视频统一模型的高效频率桥接与同质潜在空间

Jiazheng Xing, Hangjie Yuan, Lingling Cai, Xinyu Liu, Yujie Wei, Fei Du, Hai Ci, Tao Feng, Jiasheng Tang, Weihua Chen, Fan Wang, Yong Liu

发表机构 * Zhejiang University（浙江大学）； DAMO Academy, Alibaba Group（阿里云达摩院）； Hupan Lab（虎扑实验室）； National University of Singapore（新加坡国立大学）； Hong Kong University of Science and Technology（香港科技大学）； Fudan University（复旦大学）； Tsinghua University（清华大学）

AI总结提出Lumos-Nexus框架，通过两阶段训练和渐进频率桥接，在保持推理能力的同时显著提升视频生成保真度。

Comments Project page (https://jiazheng-xing.github.io/nexus-lumos-home/) and Code (https://github.com/alibaba-damo-academy/Lumos-Custom/) are available

详情

AI中文摘要

基于连接器的视频统一模型在指令引导的视频合成中展现出强大能力，但将大型高保真生成器集成到统一训练循环中计算成本过高，限制了可实现的视觉质量。因此，我们提出Lumos-Nexus，一个训练高效的统一视频生成框架，促进强推理驱动生成能力的发展，同时显著提升视觉保真度。Lumos-Nexus采用两阶段设计：1）训练时，仅将轻量级生成器与理解模块对齐，以学习接收推理驱动的语义控制。2）推理时，我们引入统一渐进频率桥接（UPFB），在共享潜在空间中逐步将生成任务移交给高容量预训练生成器，实现从粗到细的细化，在不牺牲推理质量的情况下生成高保真视频。为填补推理驱动视频生成基准的空白，我们引入VR-Bench，评估模型将推断意图转化为连贯且语义对齐的视频内容的能力。大量实验表明，Lumos-Nexus在VBench上实现了视觉真实感和时间连贯性的显著提升，同时在VR-Bench上展现出强大的基于推理的生成性能。代码和模型可在https://jiazheng-xing.github.io/nexus-lumos-home/获取。

英文摘要

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

URL PDF HTML ☆

赞 0 踩 0

2605.31598 2026-06-01 cs.CV 版本更新

识别野外伴随语音的手势

Sindhu B Hegde, K R Prajwal, Andrew Zisserman

发表机构 * Visual Geometry Group, Dept. of Engineering Science, University of Oxford（视觉几何组，工程科学系，牛津大学）

AI总结针对当前多模态模型难以捕捉语义性伴随手势的问题，构建了首个大规模基准数据集GRW，用于训练视频模型进行手势语义分类、对应词汇识别和时序定位。

详情

AI中文摘要

尽管人类在说话时自然地进行手势，但这些动作中只有稀疏的子集在视觉上具有描绘性，并与特定的口语词汇语义相关。当前的多模态模型难以捕捉这些语义性的伴随手势，主要受限于缺乏精确标注的训练数据。为解决这一问题，我们引入了野外手势识别（GRW）数据集，这是第一个大规模基准，旨在将无约束的人类手势与特定词汇以帧精确的时间边界进行映射。GRW包含156,688个手动标注的视频片段，涵盖了一个高度多样化的150词分类体系，包括物理动作、空间描述符和抽象概念。我们利用GRW训练视频模型以（a）将手势分类为语义性或非语义性，（b）识别伴随手势对应的词汇，以及（c）在时间上定位手势。我们还使用GRW为这三项任务建立基准。

英文摘要

While humans naturally gesture during speech, only a sparse subset of these movements are visually depictive and semantically linked to specific spoken words. Current multimodal models struggle to capture these semantic co-speech gestures, heavily bottlenecked by a lack of precisely annotated training data. To address this, we introduce the Gesture Recognition in the Wild (GRW) dataset, the first large-scale benchmark designed to map unconstrained human gestures to specific words with frame-accurate temporal boundaries. Comprising 156,688 manually annotated video clips, GRW spans a highly diverse 150-word taxonomy of physical actions, spatial descriptors, and abstract concepts. We leverage GRW to train video models to (a) classify gestures as semantic or not, (b) recognize the word corresponding to a co-speech gesture, and (c) temporally localize the gesture. We also use GRW to establish benchmarks for these three tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.31577 2026-06-01 cs.CV 版本更新

SurGe: Improved Surface Geometry in Point Maps

SurGe: 改进点图中的表面几何

Karim Knaebel, Gonzalo Martin Garcia, Christian Schmidt, Ilya Fradlin, Lucas Nunes, Daan de Geus, Bastian Leibe

发表机构 * RWTH Aachen University（亚琛RWTH大学）； Eindhoven University of Technology（埃因霍温技术大学）

AI总结针对前馈3D重建方法中局部表面几何不准确的问题，提出点图法线度量、点梯度匹配损失和邻域注意力解码器（NAD）来改善局部表面方向预测，在多个零样本单目几何基准上取得最优平均排名。

Comments Project page at https://vision.rwth-aachen.de/surge

详情

AI中文摘要

最近的前馈3D重建方法能够很好地预测点图并估计全局3D几何。然而，它们的预测仍然显示出不准确的局部表面几何，这在定性上明显可见，但在常见指标中仅被微弱反映。为了使这些错误在评估中更明确，我们引入了一个点图法线度量，用于评估由相邻3D预测引起的局部表面方向。为了减少这些错误，我们提出了两个互补组件：一个点梯度匹配损失，用于监督深度归一化的3D有限差分；以及一个邻域注意力解码器（NAD），它逐步上采样特征并使用邻域注意力进行局部特征混合。在八个零样本单目几何基准上，我们的模型SurGe在全局点图AbsRel上取得了最佳平均排名，并一致地改进了局部点图和点图法线评估。

英文摘要

Recent feedforward 3D reconstruction methods predict point maps and estimate global 3D geometry remarkably well. However, their predictions still exhibit inaccurate local surface geometry, which is clearly visible qualitatively but only weakly reflected in common metrics. To make these errors more explicit in evaluation, we introduce a point map normal metric that evaluates the local surface orientation induced by neighboring 3D predictions. To reduce these errors, we propose two complementary components: a point gradient matching loss that supervises depth-normalized 3D finite differences, and a Neighborhood Attention Decoder (NAD) that progressively upsamples features and uses Neighborhood Attention for local feature mixing. Across eight zero-shot monocular geometry benchmarks, our model, SurGe, achieves the best average rank for global point map AbsRel and consistently improves local point map and point map normal evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.31576 2026-06-01 cs.CV 版本更新

Joint Multi-Camera LiDAR Extrinsic Calibration via Learned Pairwise Initialization and Geometric Refinement

联合多相机激光雷达外参标定：基于学习的成对初始化与几何优化

Aziz Al-Najjar, Marzieh Amini, James R. Green, Felix Kwamena

发表机构 * Department of System and Computer Engineering, Carleton University（系统与计算机工程系，卡尔顿大学）

AI总结提出两阶段框架，先通过CMRNext独立估计每个相机的外参和2D-3D对应，再通过联合光束法平差优化实现全局一致的多相机标定，显著提升精度和一致性。

Comments Paper is accepted in CVPR 2026 Workshop URVI: Unified Robotic Vision with Cross-Modal Sensing and Alignment

详情

AI中文摘要

大多数基于学习的相机-激光雷达标定方法独立处理每个相机-激光雷达对，忽略了多相机平台中的刚性几何耦合。因此，每个相机的估计可能单独准确，但在系统层面不一致。我们提出一个两阶段框架，用于联合多相机激光雷达外参标定，结合了学习的成对匹配与几何优化。首先，CMRNext独立应用于每个相机，产生初始外参估计和密集的2D-3D对应。然后，这些预测通过多帧光束法平差联合优化，包含重投影项、每相机先验项和相对位姿先验项。该方法将成对预测转化为全局一致的多相机标定。在KITTI（CMRNext的域内）和Walkley（域外）数据集上的实验表明，该方法提高了每相机的精度和相机间的一致性。在KITTI上，该方法实现了0.89厘米的平移误差和0.038度的旋转误差。在Walkley上，它将平移误差从108.6厘米降低到3.1厘米，突显了当单相机预测不可靠时显式多相机耦合的优势。

英文摘要

Most learning-based camera-LiDAR calibration methods treat each camera-LiDAR pair independently, ignoring the rigid geometric coupling in multi-camera platforms. As a result, per-camera estimates may be individually accurate yet inconsistent at the system level. We present a two-stage framework for joint multi-camera LiDAR extrinsic calibration that combines learned pairwise matching with geometric refinement. First, CMRNext is applied independently to each camera to produce initial extrinsic estimates and dense 2D-3D correspondences. These predictions are then jointly refined through a multi-frame bundle adjustment with reprojection, per-camera prior, and relative-pose prior terms. This approach converts pairwise predictions into a globally consistent multi-camera calibration. Experiments on KITTI (in-domain for CMRNext) and Walkley (out-of-domain) datasets show improved per-camera accuracy and inter-camera consistency. On KITTI, the method achieves 0.89 cm translation error and 0.038 rotation error. On Walkley, it reduces translation error from 108.6 cm to 3.1 cm, highlighting the benefit of explicit multi-camera coupling when single-camera predictions are less reliable.

URL PDF HTML ☆

赞 0 踩 0

2605.31572 2026-06-01 cs.CV 版本更新

nuReasoning: A Reasoning-Centric Dataset and Benchmark for Long-Tail Autonomous Driving

nuReasoning：面向长尾自动驾驶的推理中心数据集与基准

Zhiyu Huang, Johnson Liu, Rui Song, Zewei Zhou, Ruining Yang, Yun Zhang, Tianhui Cai, Hanyin Zhang, Mingxuan Gao, Valeria Xu, Jiali Chen, Yishan Shen, Yiluan Guo, Tony, Qi, Jiaqi Ma

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； Motional

AI总结提出nuReasoning数据集，包含2万段20秒长尾驾驶场景的推理标注，支持空间、决策和反事实推理评估，并证明推理监督可提升VLM和VLA的驾驶性能。

详情

AI中文摘要

推理对于自动驾驶在长尾场景中至关重要，车辆必须运用常识知识、理解空间关系、推断智能体交互并做出安全决策。然而，现有的自动驾驶数据集和基准主要针对感知、预测或规划，对现实长尾驾驶场景的推理监督有限。我们提出nuReasoning，一个面向推理中心自动驾驶的大规模真实世界数据集和基准。沿袭nuScenes和nuPlan的体系，nuReasoning将真实世界自动驾驶数据集和基准推向长尾驾驶场景中的推理。该数据集包含2万个片段，每个片段长20秒，采集自多个城市，具有同步的多摄像头图像、LiDAR数据、高清地图、物体标注以及人工验证的推理标注，涵盖空间推理、决策推理和反事实推理。与先前主要关注视觉问答的数据集不同，nuReasoning同时支持推理评估和规划评估，能够直接研究推理监督如何影响驾驶性能。实验表明，在nuReasoning上微调VLM可显著提升驾驶特定问答的性能，而将推理监督纳入VLA训练中，即使在推理时禁用文本推理输出，也能改善规划性能。这些结果确立了nuReasoning作为在现实长尾场景中评估和改进鲁棒、可解释、推理驱动的自动驾驶系统的基础。

英文摘要

Reasoning is essential for autonomous driving (AD) in long-tail scenarios, where vehicles must apply commonsense knowledge, understand spatial relations, infer agent interactions, and make safe decisions. However, existing AD datasets and benchmarks mainly target perception, prediction, or planning, and provide limited supervision for reasoning over realistic long-tail driving scenes. We introduce nuReasoning, a large-scale real-world dataset and benchmark for reasoning-centric AD. Following the lineage of nuScenes and nuPlan, nuReasoning advances real-world AD datasets and benchmarks toward reasoning in long-tail driving scenarios. The dataset contains 20,000 clips, each 20 seconds long, collected across multiple cities, with synchronized multi-camera images, LiDAR data, HD maps, object annotations, and human-verified reasoning annotations spanning Spatial Reasoning, Decision Reasoning, and Counterfactual Reasoning. Unlike prior datasets that focus primarily on visual question answering, nuReasoning supports both reasoning evaluation and planning evaluation, enabling a direct study of how reasoning supervision affects driving performance. Experiments show that fine-tuning VLMs on nuReasoning substantially improves driving-specific question answering, while incorporating reasoning supervision into VLA training improves planning performance even when textual reasoning outputs are disabled at inference time. These results establish nuReasoning as a foundation for evaluating and improving robust, interpretable, reasoning-driven AD systems in realistic long-tail settings.

URL PDF HTML ☆

赞 0 踩 0

2605.31556 2026-06-01 cs.CV cs.AI cs.CL cs.CY cs.HC 版本更新

Vision-Language Models Suppress Female Representations Under Ambiguous Input

视觉-语言模型在模糊输入下抑制女性表征

Arnau Marin-Llobet, Simon Henniger, Mahzarin R. Banaji

发表机构 * School of Engineering and Applied Sciences（工程与应用科学系）； Department of Psychology（心理学系）

AI总结本研究通过引入零样本度量LALS，发现视觉-语言模型在模糊输入下内部编码与输出存在系统性解耦，女性信号在生成前被抑制，揭示了模型对性别偏见的内部处理机制。

Comments 16 pages, 12 figures, 1 table

详情

AI中文摘要

对齐训练使视觉-语言模型（VLM）避免表达人口统计偏见，当性别清晰可见时，它们基本成功。但对于模糊输入（如全副武装的工人、从背后看到的人物）——实践中常见但很少研究的情况——我们发现，在模糊输入图像时，最小的提示压力就会暴露职业-性别默认值，模型甚至对强烈女性刻板印象的职业也倾向于男性。但这些输出是否反映了模型实际内部编码的内容？我们引入LALS（潜在关联倾向分数），一种零样本度量，将视觉标记激活投影到模型的文本嵌入空间中，以测量每个标记和层的概念关联。在15个职业、超过800张性别模糊图像和四个VLM上，内部表征和输出系统性地解耦：模型通常内部编码女性关联但输出男性。逐层分析揭示了一个不对称滤波器——男性信号端到端放大，而女性信号在中间网络达到峰值并在生成前被抑制——颜色消融实验表明，文化负载的视觉线索（如服装颜色）进一步调节这些内部关联。

英文摘要

Alignment teaches vision-language models (VLMs) to avoid expressing demographic biases, and when gender is clearly visible they largely succeed. Far less is known about ambiguous inputs (a worker in full gear, a figure seen from behind) cases common in practice yet rarely studied. We find that minimal prompting pressure exposes occupation-gender defaults when prompting ambiguous input images, with models collapsing to male even for strongly female-stereotyped occupations. But do these outputs reflect what models actually encode internally? We introduce LALS (Latent Association Leaning Score), a zero-shot metric that projects visual-token activations into the model's text-embedding space to measure concept associations per token and layer. Across 15 occupations, over 800 gender-ambiguous images, and four VLMs, internal representations and outputs are systematically decoupled: models often encode a female association internally yet output male. Layer-wise analysis reveals an asymmetric filter -- male signal amplifies end-to-end while female signal peaks mid-network and is suppressed before generation -- and a color ablation shows that culturally loaded visual cues such as clothing color further modulate these internal associations.

URL PDF HTML ☆

赞 0 踩 0

2605.31551 2026-06-01 cs.CV 版本更新

SMART: SMPLest-X Mesh Adaptation and RAFT Tracking for Soccer Pose Estimation

SMART: SMPLest-X 网格自适应与 RAFT 跟踪用于足球姿态估计

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods（Dick’s Sporting Goods 游戏革新部）

AI总结提出 SMART 方法，通过微调 SMPLest-X 模型、结合 RAFT 光流相机跟踪和足部平面锚定等策略，在 FIFA 骨骼跟踪挑战中显著降低 3D 姿态估计误差。

Comments CVPR 2026 SoccerNet FIFA Skeleton Tracking Light Challenge, Rank 6

2605.31539 2026-06-01 cs.CV cs.LG q-bio.QM 版本更新

Automated Prediction of Postoperative Pancreatic Fistula Using Preoperative Computed Tomography

利用术前计算机断层扫描自动预测术后胰瘘

Ashok Choudhary, Chris Varghese, Leo Y. Li-Han, Frank G. Lee, Ellen L. Larson, Elizabeth B. Habermann, Cornelius A. Thiels, Hojjat Salehinejad

发表机构 * Department of Surgery, Mayo Clinic, Rochester, MN, USA（梅奥诊所外科部，罗切斯特，明尼苏达州，美国）； Department of Surgery, University of Auckland, Auckland, NZ（奥克兰大学外科部，奥克兰，新西兰）； Kern Center for the Science of Health Care Delivery, Mayo Clinic, Rochester, MN, USA（健康照护科学中心，梅奥诊所，罗切斯特，明尼苏达州，美国）； Department of Artificial Intelligence（人工智能部）

AI总结提出一种从胰腺分割到分类的端到端深度学习流程，利用术前CT扫描自动预测术后胰瘘风险，为临床决策提供工具和方法基准。

2605.31535 2026-06-01 cs.CV cs.AI cs.LG 版本更新

RayDer: Scalable Self-Supervised Novel View Synthesis from Real-World Video

RayDer: 从真实世界视频中可扩展的自监督新视角合成

Ulrich Prestel, Stefan Andreas Baumann, Nick Stracke, Björn Ommer

发表机构 * Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心 (MCML)）

AI总结提出统一前馈变压器RayDer，将相机估计、场景重建和渲染整合为单一骨干，实现自监督新视角合成的可扩展幂律缩放，在零样本开放集性能上媲美有监督方法。

Comments Project Page: https://compvis.github.io/rayder

详情

AI中文摘要

自监督新视角合成（NVS）在扩展方面仍然具有挑战性，尽管视频数据丰富，这主要是由于在真实视频上训练的脆弱性以及多网络系统设计的难以预测的缩放行为。我们引入了RayDer，一个统一的前馈变压器，将相机估计、场景重建和渲染整合到一个单一骨干中，将自监督NVS转化为一个适定的单模型缩放问题。一个最小的动态状态，被视为干扰因素，吸收时变内容，使得在无约束的真实世界视频上稳定训练成为可能。重要的是，RayDer将静态场景NVS作为其目标任务：动态内容仅作为可扩展的监督被利用，而不是像动态场景（4D）NVS那样重建。在多个模型大小和数量级的数据上，RayDer展示了与数据和计算量相关的清晰幂律缩放，并优于静态场景数据混合。在大量基准测试中，RayDer实现了与最先进的有监督方法相竞争的强大零样本开放集性能。项目页面：https://compvis.github.io/rayder

英文摘要

Self-supervised novel view synthesis (NVS) remains challenging to scale, despite the abundance of video data, largely due to the brittleness of training on realistic videos and the hard-to-predict scaling behavior of multi-network system designs. We introduce RayDer, a unified, feed-forward transformer that consolidates camera estimation, scene reconstruction, and rendering into a single backbone, turning self-supervised NVS into a well-posed single-model scaling problem. A minimal dynamic state, treated as a nuisance factor, absorbs time-varying content and enables stable training on unconstrained real-world video. Importantly, RayDer keeps static-scene NVS as its target task: dynamic content is leveraged purely as scalable supervision, not reconstructed as in dynamic-scene (4D) NVS. Across multiple model sizes and orders of magnitude in data, RayDer exhibits clean power-law scaling with data and compute, and outperforms static-scene data mixtures. On a large number of benchmarks, RayDer achieves strong zero-shot open-set performance competitive with state-of-the-art supervised approaches. Project Page: https://compvis.github.io/rayder

URL PDF HTML ☆

赞 0 踩 0

2605.31534 2026-06-01 cs.CV cs.AI 版本更新

Feature-Optimized Vision for Adaptive 3D Scene Reconstruction

面向自适应3D场景重建的特征优化视觉

Eric Liang

发表机构 * Oracle

AI总结提出一种自适应特征优化视觉前端，通过评分纹理、可重复性、独特性、预期三角化角度和空间覆盖来分配每视图特征预算，以最大化有效轨迹并降低重建RMSE。

详情

AI中文摘要

三维场景重建依赖于局部图像证据，这些证据既要在视觉上具有判别性，又要在几何上有用。固定的特征阈值和均匀的特征预算易于部署，但可能会在重复纹理、低视差区域或不稳定点上浪费计算。本文提出了一种用于3D重建的自适应特征优化视觉前端。该方法通过纹理、可重复性、独特性、预期三角化角度和空间覆盖对候选特征进行评分，然后在固定重建流程下分配每视图特征预算以最大化有效轨迹。一个小型合成多视图原型在走廊、立面、物体桌面和杂乱场景中评估了四种选择策略。与随机、仅纹理和均匀网格基线相比，自适应策略在保持广泛图像覆盖的同时，获得了最佳的质量感知完整性和最低的聚合重建RMSE。结果并非替代现代学习匹配或神经重建系统；它是一个模块化的前端策略，可以使经典和学习的3D流程更审慎地决定将计算花费在哪些视觉证据上。

英文摘要

Three-dimensional scene reconstruction depends on local image evidence that is both visually discriminative and geometrically useful. Fixed feature thresholds and uniform feature budgets are easy to deploy, but they can waste computation on repeated texture, low-parallax regions, or unstable points. This paper proposes an adaptive feature-optimized vision front end for 3D reconstruction. The method scores candidate features by texture, repeatability, distinctiveness, expected triangulation angle, and spatial coverage, then allocates a per-view feature budget to maximize useful tracks under a fixed reconstruction pipeline. A small synthetic multi-view prototype evaluates four selection policies across corridor, facade, object-table, and cluttered scenes. Compared with random, texture-only, and uniform-grid baselines, the adaptive policy obtains the best quality-aware completeness and the lowest aggregate reconstruction RMSE while preserving broad image coverage. The result is not a replacement for modern learned matching or neural reconstruction systems; it is a modular front-end policy that can make classical and learned 3D pipelines more deliberate about which visual evidence they spend compute on.

URL PDF HTML ☆

赞 0 踩 0

2605.31529 2026-06-01 cs.CV 版本更新

嵌入模型如何绑定概念？

Arnas Uselis, Darina Koishigarina, Seong Joon Oh

AI总结本文研究视觉-语言嵌入模型（如CLIP）在概念绑定上的局限性，发现场景嵌入可加性分解为对象表示，但CLIP的高复杂度绑定函数阻碍了泛化，而通过充分数据训练的Transformer模型能学习低复杂度乘法交互绑定函数实现系统泛化。

Comments ICML 2026

详情

AI中文摘要

人类在多物体场景中能轻松判断哪种颜色属于哪种形状，这种能力称为概念绑定。视觉-语言嵌入模型（如CLIP）在绑定时存在困难：它们能识别单个概念，但无法表示哪些概念构成哪些对象。尽管CLIP在跨模态检索中表现为词袋模型，但对象信息可以从其图像和文本嵌入中分别恢复。我们通过绑定函数（将概念映射到场景嵌入）研究这种张力。我们发现场景嵌入可加性分解为对象表示，这解释了为何单模态探针能恢复对象信息。然而，CLIP的绑定函数具有高复杂度，这可能阻止图像和文本编码器学习共享的绑定机制，从而无法泛化到未见过的概念组合。然后我们探究这种局限性是否是根本性的。我们证明并非如此。在从零开始训练的受控Transformer模型中，随着数据覆盖率的增加，绑定泛化出现。这些模型学习到低复杂度的绑定函数，其特点是概念之间的乘法交互，从而实现系统泛化。代码公开于https://github.com/oshapio/binding-concepts-complexity。

英文摘要

Humans easily determine which color belongs to which shape in multi-object scenes, an ability known as concept binding. Vision-language embedding models such as CLIP struggle with binding: they recognize individual concepts but fail to represent which concepts form which objects. Although CLIP behaves like a bag-of-concepts model in cross-modal retrieval, object information is recoverable from its image and text embeddings separately. We study this tension through the binding function, which maps concepts to scene embeddings. We find that scene embeddings decompose additively into object representations, explaining why uni-modal probes can recover object information. However, CLIP's binding function is high-complexity, which likely prevents the image and text encoders from learning a shared binding mechanism that generalizes to unseen concept combinations. We then ask whether this limitation is fundamental. We show that it is not. In controlled transformer models trained from scratch, binding generalization emerges with sufficient data coverage. These models learn low-complexity binding functions characterized by multiplicative interactions between concepts, enabling systematic generalization. Code is publicly available at https://github.com/oshapio/binding-concepts-complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.31466 2026-06-01 cs.CV 版本更新

图像扫描显微镜的自调谐正则化

Sofia Agostoni, Lisa Cuneo, Christian Daniele, Giacomo Garré, Laurent Le, Alessandro Zunino, Giuseppe Vicidomini, Luca Calatroni

发表机构 * MaLGa Center, DIBRIS, University of Genoa（MaLGa中心，DIBRIS，热那亚大学）； MMS, Istituto Italiano di Tecnologia (IIT)（MMS，意大利技术研究院（IIT））； CSML, Istituto Italiano di Tecnologia (IIT)（CSML，意大利技术研究院（IIT））

AI总结针对图像扫描显微镜（ISM）的多图像反卷积（MID）和超分辨率切片ISM（s²ISM）重建，提出一种自调谐显式正则化框架，通过贝叶斯最大后验公式结合多帧泊松数据保真项与ℓ1或平滑全变分惩罚，并基于残差白化原则自适应选择正则化参数，无需经验停止准则，在低光子条件下实现稳定超分辨和光学切片。

详情

AI中文摘要

图像扫描显微镜（ISM）是一种荧光成像技术，它结合探测器阵列采集和计算重建，实现理想共聚焦显微镜（即使用无穷小针孔）的理论分辨率，同时保持高信噪比。在获得超分辨图像的重建方法中，多图像反卷积（MID）及其旨在保持共聚焦显微镜光学切片能力的扩展（称为超分辨率切片ISM，s²ISM）是最广泛使用的方法之一。这两种方法都依赖于Richardson-Lucy型迭代方案，其半收敛行为需要提前停止，并且常常导致噪声放大和重建伪影。在这项工作中，我们为MID和s²ISM重建引入了一个自调谐显式正则化框架。在贝叶斯最大后验公式中，我们将多帧泊松数据保真项与显式正则化相结合，考虑ℓ1和平滑全变差惩罚作为代表性例子。我们进一步通过将残差白化原则适应于多帧泊松设置，并引入针对s²ISM定制的频谱高通扩展，开发了一种自动且无需真实值的正则化参数选择策略。由此产生的框架无需经验停止规则即可实现稳定重建。为了演示所提出的框架，我们考虑了基于近端梯度和镜像下降方法的一阶优化方案，并采用自适应回溯策略。在模拟和真实荧光ISM数据集上的实验表明，与无正则化方法相比，重建稳定性和图像质量得到改善，同时在低光子条件下实现了鲁棒的超分辨率和光学切片。

英文摘要

Image Scanning Microscopy (ISM) is a fluorescence imaging technique that combines detector-array acquisition and computational reconstruction to achieve the theoretical resolution of an ideal confocal microscope, i.e., one operating with an infinitesimally small pinhole, while maintaining high signal-to-noise ratio. Among the reconstruction methods for obtaining the super-resolved image, multi-image deconvolution (MID) and its extension aimed at preserving the optical sectioning capability of confocal microscopy, known as super-resolution sectioning ISM (s$^2$ISM), are among the most widely used approaches. Both methods rely on Richardson--Lucy-type iterative schemes, whose semi-convergent behavior requires early stopping and often leads to noise amplification and reconstruction artifacts. In this work, we introduce a self-tuning explicit regularization framework for both MID and s$^2$ISM reconstruction. Within a Bayesian maximum a posteriori formulation, we combine a multi-frame Poisson data fidelity term with explicit regularization, considering $\ell_1$ and smoothed total variation penalties as representative examples. We further develop an automatic and ground-truth-free strategy for regularization parameter selection by adapting the residual whiteness principle to the multi-frame Poisson setting and introducing a spectral high-pass extension tailored to s$^2$ISM. The resulting framework enables stable reconstructions without empirical stopping rules. To demonstrate the proposed framework, we consider first-order optimization schemes based on proximal gradient and mirror descent methods with adaptive backtracking strategies. Experiments on simulated and real fluorescence ISM datasets demonstrate improved reconstruction stability and image quality with respect to unregularized approaches, while enabling robust super-resolution and optical sectioning in low-photon conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.31400 2026-06-01 cs.CV 版本更新

FSM-Net: An Efficient Frequency-Spatial Network for Real-World Deblurring

FSM-Net：一种用于真实世界去模糊的高效频率-空间网络

Vinh-Thuan Ly

发表机构 * University of Science, VNU-HCM（越南胡志明市国家大学）； Vietnam National University, Ho Chi Minh City（越南国家大学）

AI总结提出FSM-Net，通过频率注意力模块和交叉门控视觉E-Branchformer实现高效双域去模糊，在NTIRE 2026挑战赛中获得第二名。

Comments Accepted to NTIRE Workshop at CVPR 2026. Project page: https://efficient-deblurring-fsmnet.vercel.app

详情

AI中文摘要

真实世界图像去模糊要求高保真恢复和计算效率，现有方法往往难以平衡。本文提出FSM-Net（频率-空间多分支网络），一种高效解决方案，在NTIRE 2026高效真实世界去模糊挑战赛中获得第二名。FSM-Net开创了双域方法：新颖的频率注意力模块通过FFT显式恢复高频结构细节，而瓶颈处的交叉门控视觉E-Branchformer以线性复杂度捕获全局依赖。为确保鲁棒收敛，我们采用由复合损失函数（多尺度Charbonnier、结构边缘和频率）引导的渐进课程训练策略。在RSBlur基准上评估，FSM-Net仅用4.94M参数和159.35 GMACs（1920x1200分辨率）即达到33.144 dB PSNR的出色性能。通过有效推动效率与质量的帕累托前沿，FSM-Net为资源受限的图像恢复建立了强基线。

英文摘要

Real-world image deblurring demands both high-fidelity restoration and computational efficiency, a balance existing methods often struggle to achieve. In this paper, we propose FSM-Net (Frequency-Spatial Multi-branch Network), a highly efficient solution that secured 2nd place in the NTIRE 2026 Challenge on Efficient Real-World Deblurring. FSM-Net pioneers a dual-domain approach: a novel Frequency Attention module explicitly recovers high-frequency structural details via FFT, while a Cross-Gated Vision E-Branchformer at the bottleneck captures global dependencies with linear complexity. To ensure robust convergence, we employ a progressive curriculum training strategy guided by a composite loss function (Multi-Scale Charbonnier, Structural Edge, and Frequency). Evaluated on the RSBlur benchmark, FSM-Net achieves an outstanding 33.144 dB PSNR with only 4.94M parameters and 159.35 GMACs (at 1920x1200 resolution). By effectively pushing the Pareto frontier of efficiency and quality, FSM-Net establishes a strong baseline for resource-constrained image restoration.

URL PDF HTML ☆

赞 0 踩 0

2605.31376 2026-06-01 cs.RO cs.CV cs.GR 版本更新

LiftNav: Path Planning via Semantic Lifting in TSDF-Guided Gaussian Splatting

LiftNav: TSDF引导的高斯泼溅中的语义提升路径规划

Hannah Schieber, Dominik Frischmann, Victor Schaack, Angela P. Schoellig, Daniel Roth

发表机构 * Technical University of Munich（慕尼黑技术大学）； Human-Centered Computing and Extended Reality Lab（以人为本计算与扩展现实实验室）； TUM University Hospital（慕尼黑大学医院）； Clinic for Orthopedics and Sports Orthopedics（骨科与运动医学诊所）； Munich Institute of Robotics and Machine Intelligence (MIRMI)（慕尼黑机器人与机器智能研究所）； Learning Systems and Robotics Lab（学习系统与机器人实验室）

AI总结提出LiftNav混合导航框架，结合TSDF+GS双地图、YOLO检测、TSDF三维提升和B样条轨迹优化，实现无需密集三维嵌入的灵活语义导航，并通过铰链损失碰撞惩罚提升轨迹平滑性和安全性，在Replica数据集仿真中实现100%可行性和更短轨迹。

2605.31369 2026-06-01 cs.LG cs.CV 版本更新

A Unifying View of Variational Generative Wasserstein Flows

变分生成式Wasserstein流的统一视角

Paul Caucheteux, Clément Bonet, Anna Korba

发表机构 * CMAP, CNRS, École Polytechnique, Institut Polytechnique de Paris, Palaiseau, France（CMAP、法国国家科学研究中心、巴黎高等理工学院、巴黎理工 institute、Palaiseau，法国）

AI总结本文提出生成式Wasserstein流（GWF）的统一理论框架，将多种现有生成模型视为f-散度目标的参数化JKO方案实例，并扩展至积分概率度量与最大均值差异，推导新算法并阐明与GAN的联系。

Comments Accepted as a spotlight at ICML2026

详情

AI中文摘要

许多现代生成模型可视为最小化概率分布之间的散度，但它们依赖于不同的算法和几何原理。Wasserstein梯度流为优化分布提供了连续时间形式，可通过Jordan-Kinderlehrer-Otto（JKO）方案的隐式离散化来近似。在这项工作中，我们提出了一个基于Wasserstein梯度流的生成建模统一理论框架，称为生成式Wasserstein流（GWF）。我们表明，一大类现有方法可以推导为f-散度目标的参数化JKO方案实例，并建立了几个最近提出的算法之间的等价性。我们将此框架扩展到f-散度之外，涵盖积分概率度量和平方最大均值差异，推导了新的基于JKO的生成算法，并阐明了它们与GAN的联系。我们通过实验研究了JKO正则化对广泛目标的影响。最后，我们分析了参数化Wasserstein流，其中动力学限制在由参数化映射诱导的分布上。

英文摘要

Many modern generative models can be viewed as minimizing divergences between probability distributions, yet they rely on different algorithmic and geometric principles. Wasserstein gradient flows provide a continuous-time formulation for optimizing over distributions, and can be approximated through their implicit discretization via the Jordan-Kinderlehrer-Otto (JKO) scheme. In this work, we present a unified theoretical framework for generative modeling based on Wasserstein gradient flows, which we refer to as Generative Wasserstein Flows (GWF). We show that a broad class of existing methods can be derived as instances of parametric JKO schemes for $f$-divergence objectives, and we establish equivalences between several recently proposed algorithms. We extend this framework beyond f-divergence to Integral Probability Metrics and squared Maximum Mean Discrepancy, deriving new JKO-based generative algorithms, and clarifying their connections with GANs. We study empirically the impact of the JKO regularization for a wide set of objectives. Finally, we analyze parametric Wasserstein flows, where the dynamics are restricted to distributions induced by parametrized maps.

URL PDF HTML ☆

赞 0 踩 0

2605.31351 2026-06-01 cs.CL cs.CV 版本更新

A Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation

面向VLM-as-a-Judge评估的视障辅助基准

Yi Zhao, Siqi Wang, Zhe Hu, Yushi Li, Jing Li

发表机构 * Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结针对视障辅助任务中VLM-as-a-Judge评估的可靠性问题，提出VIABLE基准（含30万+样本、有效性-公正性-稳定性框架及12种失败模式分类），发现现有模型不可靠，并开发VIA-Judge-Agent方法提升诊断准确性和用户偏好。

详情

AI中文摘要

基于AI的视障辅助（VIA）仍然具有挑战性，主要原因是人工评估成本高昂。VLM-as-a-Judge范式可能提供一种有前景的替代方案，尽管该范式主要在通用领域得到研究。因此，我们质疑此类评判者是否可以在VIA任务中值得信赖。为探究这一问题，我们引入了VIABLE（面向VLM-as-a-Judge评估的视障辅助基准），这是首个用于VIA中VLM-as-a-Judge评估的基准。VIABLE包含超过30万个判断样本，涵盖三种场景，并引入了一个包含12种失败模式分类的有效性-公正性-稳定性框架。基于VIABLE，我们对七个不同模型规模的评判者进行了系统研究，结果表明现有模型在所有评估轴上基本不可靠。最强的评判者GPT-5.4仅达到52.6%的单故障诊断准确率，却表现出最高的自我偏好率（94.2%）；而开源评判者存在严重偏差且对抗性脆弱。为解决这些问题，我们提出了VIA-Judge-Agent，一种与模型无关的推理时增强方法，通过视觉证据提取和基于分类的工作流来增强评判者。该方法在诊断准确性和下游VIA响应（更受BLV用户青睐）方面实现了积极改进。数据和代码可在 https://github.com/YiyiyiZhao/VIABLE 获取。

英文摘要

AI-based Visually Impaired Assistance (VIA) remains challenging, largely due to the high cost of human evaluation. The VLM-as-a-Judge paradigm may offer a promising alternative, although it has mostly been studied in general domains. We therefore ask whether such judges can be trusted for VIA tasks. To investigate this question, we introduce VIABLE (Visually Impaired Assistance Benchmark for VLM-as-a-Judge Evaluation), the first benchmark for VLM-as-a-Judge evaluation in VIA. VIABLE contains over 300K judgment samples across three scenarios and introduces an Effectiveness--Impartiality--Stability framework with a 12-mode failure taxonomy. Based on VIABLE, our systematic study of seven judges across different model scales shows that existing models are largely unreliable across all evaluation axes. The strongest judge, GPT-5.4, achieves only 52.6% single-failure diagnostic accuracy, yet exhibits the highest self-preference rate at 94.2%; while open-source judges are strongly biased and adversarially fragile. To address these issues, we propose VIA-Judge-Agent, a model-agnostic inference-time harness that augments judges with visual evidence extraction and a taxonomy-guided workflow. It enables positive improvements in diagnostic accuracy and downstream VIA responses more preferred by BLV users. Data and code are available at: https://github.com/YiyiyiZhao/VIABLE

URL PDF HTML ☆

赞 0 踩 0

2605.31349 2026-06-01 cs.CL cs.AI cs.CV cs.MM 版本更新

MoE-dqINR：用于特定扫描动态和定量MRI重建的统一混合专家隐式神经表示框架

Yinzhe Wu, Fanwen Wang, Zhenxuan Zhang, Zi Wang, Chengyan Wang, Guang Yang

发表机构 * Department of Bioengineering and I-X, Imperial College London（生物工程系和I-X，帝国理工学院伦敦分校）； Cardiovascular Research Centre, Royal Brompton Hospital（心脏血管研究中心，皇家布隆特医院）； National Heart and Lung Institute, Imperial College London（国家心脏和肺研究所，帝国理工学院伦敦分校）； School of Biomedical Engineering & Imaging Sciences, King’s College London（生物医学工程与成像科学学院，伦敦国王学院）； Shanghai Pudong Hospital and Human Phenome Institute, Fudan University（上海浦东医院和人类表型研究所，复旦大学）； International Human Phenome Institute (Shanghai), Shanghai, China（国际人类表型研究所（上海），上海，中国）

AI总结提出MoE-dqINR框架，通过共享空间专家和状态条件路由路径，实现高效、统一的特定扫描多线圈动态和定量MRI重建，优化时间约30秒。

详情

AI中文摘要

欠采样磁共振成像（MRI）重建旨在从不完整的多线圈k空间数据中恢复时间或对比度变化的图像序列，同时为动态和定量MRI（qMRI）保留状态相关的保真度。现有的特定扫描隐式神经表示（INR）通常使用单一的时空坐标场、显式子空间、运动或变形模型、校准变量或序列特定的定量信号模型。这些设计选择在跨采集状态适应图像合成的同时，限制了共享空间信息的灵活性。此外，许多基于INR的基线方法计算量大，通常需要每个扫描数百到数千秒的优化时间。我们提出MoE-dqINR，一种特定扫描的多线圈MRI重建框架，将图像域表示分解为共享空间专家和状态条件路由路径。空间专家编码可重用的坐标相关图像内容，而路由权重（以有序采集状态为条件）从公共专家库合成每个动态帧或对比状态。该表示与多线圈MRI前向模型耦合，使用归一化状态索引驱动动态和定量MRI中的路由。通过将共享空间表示与状态相关合成分离，该框架为动态和定量MRI提供了一种以图像为先的架构，同时在我们的实验中将特定扫描INR优化减少到每扫描约30秒。所提出的公式建立了状态条件混合专家INR作为特定扫描多线圈MRI重建先验，统一了共享空间表示、动态和qMRI特定合成以及实际每扫描效率。

英文摘要

Undersampled magnetic resonance imaging (MRI) reconstruction seeks to recover temporally or contrast-varying image series from incomplete multicoil k-space data while preserving state-dependent fidelity for dynamic and quantitative MRI (qMRI). Existing scan-specific implicit neural representations (INRs) often use monolithic spatiotemporal coordinate fields, explicit subspaces, motion or deformation models, calibration variables, or sequence-specific quantitative signal models. These design choices can limit flexibility in sharing spatial information while adapting image synthesis across acquisition states. Moreover, many INR-based baselines remain computationally demanding, typically requiring per-scan optimization times on the order of hundreds to thousands of seconds. We propose MoE-dqINR, a scan-specific multicoil MRI reconstruction framework that factorizes the image-domain representation into shared spatial experts and a state-conditioned routing pathway. Spatial experts encode reusable coordinate-dependent image content, whereas routing weights, conditioned on ordered acquisition states, synthesize each dynamic frame or contrast state from a common expert bank. The representation is coupled to a multicoil MRI forward model, uses the normalized state index to drive routing in both dynamic and quantitative MRI. By separating shared spatial representation from state-dependent synthesis, the framework provides an image-first architecture for dynamic and quantitative MRI while reducing scan-specific INR optimization to approximately 30 s per scan in our experiments. The proposed formulation establishes state-conditioned mixture-of-experts INR as a scan-specific multicoil MRI reconstruction prior that unifies shared spatial representation, dynamic- and qMRI-specific synthesis, and practical per-scan efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.31294 2026-06-01 cs.CV 版本更新

TokTalk: Expressive Real-time Facial Animation from Audio-LLM Tokens

TokTalk: 基于音频-大语言模型令牌的富有表现力的实时面部动画

Qingcheng Zhao, Yifang Pan, Karan Singh

发表机构 * University of Toronto（多伦多大学）

AI总结提出TokTalk系统，利用音频-大语言模型产生的音频令牌直接实时生成富有表现力的3D面部动画，通过分块条件流匹配模型和轻量级适配策略实现低延迟和高品质。

详情

AI中文摘要

近期GPT-4o等音频-大语言模型的进展开启了与语言模型对话交互的新时代。然而，对话式虚拟角色在面部表情和对话流程上仍显机械，部分原因在于其顺序执行语音识别、文本生成、轮次文本响应、语音合成和音频驱动面部动画等多个阶段。基于当前音频-大语言模型产生的音频令牌包含足够信息以重建合理面部表现这一洞察，我们提出TokTalk，一个直接从流式音频令牌实时输出富有表现力面部动画的系统。我们构建了一个新颖的音频令牌到3D面部运动数据集，并使用基于分块的条件流匹配模型训练TokTalk。一种轻量级适配策略使我们的训练模型能够以极小的计算开销无缝连接到任何基于令牌的音频-大语言模型。我们的分块处理进一步实现了延迟与面部质量之间的参数化权衡，并通过消融研究进行了验证。我们还表明，TokTalk的实时性能在延迟上与现有技术解决方案相当，而在3D面部表现的质量、表现力和可控性方面（通过感知研究）显著更优。我们通过聊天机器人虚拟角色、语音驱动的用户虚拟角色和动画导演界面展示了TokTalk在多种音视频面部应用中的灵活性。

英文摘要

Recent advances in Audio-LLMs like GPT-4o have ushered in an era of conversational interaction with language models. Conversational avatars however, still seem robotic in facial expression and conversational flow, in part due to sequential stages of speech recognition, text generation, turn-based text response, speech synthesis, and audio driven facial animation. Based on our insight that audio-tokens produced by current Audio-LLMs carry sufficient information to reconstruct a plausible facial performance, we present TokTalk, a system that directly outputs expressive facial animation in real-time from streaming audio-tokens. We construct a novel audio-token to 3D facial motion dataset, on which TokTalk is trained using a Chunk-based Conditional Flow Matching model. A lightweight adaptation strategy allows our trained model to seamlessly connect to any token-based Audio-LLM at minimal computational overhead. Our chunk-based processing further enables parametric trade-off between latency and facial quality, shown through ablation studies. We further show that the real-time performance of TokTalk is comparable in latency to prior art solutions, and significantly favorable (via a perceptual study) in terms of quality, expressivity and control of the 3D facial performance. We showcase TokTalk's flexibility using a chatbot Avatar, a voice-driven user Avatar, and an animation Director's interface, as diverse audio-visual face applications.

URL PDF HTML ☆

赞 0 踩 0

2605.31292 2026-06-01 cs.CV 版本更新

ERGeoBench：多模态大语言模型中具身推理与地理定位的综合基准

Kaiwen Xue, Tao Wei, Guoxin Zhang, Zhonghong Ou, Kaoyan Lu, Yu Feng, Yifan Zhu, Haoran Luo

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； State Key Laboratory of Networking and Switching Technology（网络与交换技术国家重点实验室）； School of Materials Science and Engineering（材料科学与工程学院）； China Mobile Research Institute（中国移动研究院）； College of Computing and Data Science（计算与数据科学学院）

AI总结提出ERGeoBench基准，通过单视图、全景视图和具身视图三种渐进设置评估多模态大语言模型在视觉驱动的具身地理定位中的能力，发现当前模型在高层次地理语义推理上表现良好，但在细粒度感知、度量定位和视图间空间一致性上仍有不足。

详情

AI中文摘要

多模态大语言模型（MLLMs）作为具身代理展现出强大潜力，然而由于缺乏细粒度评估，具身地理定位仍未被充分探索。我们引入ERGeoBench，一个用于视觉驱动的具身地理定位的诊断基准。ERGeoBench在三种渐进设置下评估模型——单视图、全景视图和具身视图——其中代理可以通过偏航、俯仰和缩放的顺序变化主动获取观察。该基准包含2,207个全球分布的街景全景图，并衡量四种互补能力：基础感知、空间意识、常识推理和地理定位推理。对领先的专有和开源MLLMs的评估表明，当前模型能够推断高层次的地理语义，但在细粒度感知操作、度量定位和跨视图空间一致性方面仍然困难。我们进一步观察到，地理定位与其他能力维度强相关，表明准确定位依赖于集成的感知、空间推理和常识推理，而非孤立的视觉识别。总体而言，ERGeoBench为诊断和推进类人具身地理定位提供了一个统一框架。项目页面：https://kaixuewen.github.io/ERGeoBench/

英文摘要

Multimodal large language models (MLLMs) have shown strong potential as embodied agents, yet embodied geo-localization remains underexplored due to the lack of fine-grained evaluation. We introduce ERGeoBench, a diagnostic benchmark for vision-driven embodied geo-localization. ERGeoBench evaluates models under three progressive settings -- single-view, panorama-view, and embodied-view -- where agents may actively acquire observations through sequential changes in yaw, pitch, and zoom. The benchmark contains 2,207 globally distributed street-view panoramas and measures four complementary capabilities: foundational perception, spatial awareness, common sense reasoning, and geo-localization reasoning. Evaluations of leading proprietary and open-source MLLMs show that current models can infer high-level geographic semantics, but still struggle with fine-grained perceptual operations, metric localization, and spatial consistency across views. We further observe that geo-localization is strongly correlated with the other capability dimensions, suggesting that accurate localization depends on integrated perception, spatial reasoning, and commonsense inference rather than isolated visual recognition. Overall, ERGeoBench provides a unified framework for diagnosing and advancing human-like embodied geo-localization. Project Page: https://kaixuewen.github.io/ERGeoBench/

URL PDF HTML ☆

赞 0 踩 0

2605.31246 2026-06-01 cs.CR cs.CV 版本更新

BadBone: Backdoor Attacks Against Backbone Models in Visual Prompt Learning

BadBone：视觉提示学习中针对骨干模型的后门攻击

Ziqing Yang, Rui Wen, Xinlei He, Yun Shen, Michael Backes, Yang Zhang

发表机构 * CISPA Helmholtz Center for Information Security（CISPA 欧洲信息安全中心）； Institute of Science Tokyo（东京科学研究院）； Wuhan University（武汉大学）； Flexera（Flexera 公司）

AI总结提出BadBone，一种利用双层优化的隐蔽自适应后门攻击方法，通过破坏骨干模型使下游提示学习任务继承后门漏洞，实验表明现有防御措施基本无效。

Comments Accepted by IEEE Transactions on Information Forensics & Security

详情

DOI: 10.1109/TIFS.2026.3698596

AI中文摘要

提示学习是一种新的机器学习范式，因其简单性和有效性而受到广泛关注。尽管其应用日益增多，但该范式的安全漏洞仍未被充分探索。在这项工作中，我们率先提出BadBone，一种利用双层优化的隐蔽自适应后门攻击，针对提示学习。我们的目标不是对提示学习过程植入后门，而是破坏骨干模型，使得只有采用提示学习的目标下游任务继承后门漏洞。在三个不同模型和来自不同领域的三个数据集上的大量实验表明，我们的定向/非定向后门模型在保持预训练和下游任务实用性的同时，实现了高攻击性能。此外，我们针对六种最先进的模型级防御（包括Neural Cleanse、ABS、MNTD、NAD、CLP和D-BR）评估了我们的方法。结果表明，这些防御对我们的后门模型基本无效，因此有效的防御仍是未来工作的重要方向。

英文摘要

Prompt learning is a new machine learning paradigm that has attracted ample attention due to its simplicity and proven efficacy. Despite its growing adoption, the security vulnerabilities associated with this paradigm remain underexplored. In this work, we take the first step to propose BadBone, a stealthy and adaptive backdoor attack against prompt learning using bi-level optimization. Instead of backdooring the prompt learning process, we aim to compromise a backbone model such that only target downstream tasks employing prompt learning inherit the backdoor vulnerability. Extensive experiments on three different models and three datasets from various domains show that our targeted/untargeted backdoored models achieve high attack performance while maintaining utility on both pre-training and downstream tasks. Moreover, we evaluate our approach against six state-of-the-art model-level defenses, including Neural Cleanse, ABS, MNTD, NAD, CLP, and D-BR. The results demonstrate that these defenses are largely ineffective against our backdoored models and thus leave the effective defense as an important direction for future work.

URL PDF HTML ☆

赞 0 踩 0

2605.31229 2026-06-01 cs.CV cs.AI 版本更新

Beyond Classification: Dynamic Adapter Routing for Continual Multimodal Retrieval

超越分类：面向持续多模态检索的动态适配器路由

Alicja Dobrzeniecka, Filip Szatkowski, Sebastian Cygert, Szymon Lukasik, Bartlomiej Twardowski

发表机构 * NASK National Research Institute（NASK国家研究院）； IDEAS Research Institute（IDEAS研究所）； Warsaw University of Technology（华沙技术大学）； Universitat Autonoma de Barcelona（巴塞罗那自治大学）

AI总结针对持续多模态检索（CMR）任务，提出基于原型路由和模型合并的动态适配器路由（DAR）方法，在跨域评估中取得优于现有基线的性能。

详情

AI中文摘要

虽然检索是视觉-语言模型的核心功能，但持续更新这些模型用于检索任务仍未被充分探索。现有工作通常通过类增量学习（CIL）的视角处理持续检索，在可能无法完全捕捉检索特定动态的设置中评估标准CIL方法和面向检索的适应方法。为了解决这一问题，我们引入了一个新的、原则性的持续多模态检索（CMR）评估框架，涵盖多样化的视觉领域，并在此设置中系统评估常见方法。我们的实证分析表明，标准CIL方法在我们更具挑战性的场景中未能产生有意义的增益。因此，我们提出了动态适配器路由（DAR），一种基于通过原型路由选择适配器并通过模型合并组合的新方法。DAR在先前基线上取得了优越性能，并在分布外评估中展现出强大的泛化能力。我们的结果凸显了CMR的独特挑战，并鼓励在该方向进行进一步研究。

英文摘要

While retrieval is a core function of vision-language models, continually updating these models for retrieval tasks remains critically underexplored. Existing work often approaches continual retrieval through the lens of class-incremental learning (CIL), evaluating both standard CIL methods and retrieval-oriented adaptations in settings that may not fully capture the retrieval-specific dynamics. To address this, we introduce a new, principled evaluation framework for continual multimodal retrieval (CMR) spanning diverse visual domains, and systematically evaluate common approaches within this setting. Our empirical analysis shows that standard CIL methods fail to yield meaningful gains in our more challenging scenario. Therefore, we propose Dynamic Adapter Routing (DAR), a novel approach based on adapters selected through prototype-based routing and combined via model merging.DAR achieves superior performance over the previous baselines and demonstrates strong generalization under out-of-distribution evaluation. Our results highlights the unique challenges of CMR and encourages further research in this direction.

URL PDF HTML ☆

赞 0 踩 0

2605.31227 2026-06-01 cs.CV 版本更新

HiERO-StepG @ Ego4D Step Grounding Challenge: hierarchical activity understanding enables zero-shot step grounding

HiERO-StepG @ Ego4D Step Grounding Challenge: 层次化活动理解实现零样本步骤定位

Andrea Zenotto, Simone Alberto Peirone, Francesca Pistilli, Giuseppe Averta

发表机构 * Politecnico di Torino（托里诺理工大学）

AI总结提出HiERO-StepG方法，利用弱监督层次化表示学习和聚类，无需任务特定微调即可实现零样本步骤定位，在Ego4D挑战中达到56.27% R@1 (IoU=0.3)。

Comments Technical report for the Ego4D Goal Step - Step Grounding challenge at CVPR 2026, derived from arXiv:2505.12911

详情

AI中文摘要

程序性活动遵循明确的结构：无论是考虑烹饪食谱还是机械师修理汽车，这些活动自然分解为步骤和子步骤的层次结构。传统的步骤定位方法需要大量标注且扩展性差。相反，我们认为这种层次结构可以通过共同发生的动作和活动的重复模式，从非策划的人类活动视频中自然涌现。我们的方法基于HiERO，一种弱监督表示学习方法，它利用细粒度的动作级叙述，将功能相关的动作在特征空间中映射得接近。在这个特征空间中，程序步骤可以通过简单的聚类检测到，无需额外的任务特定微调。对于Ego4D步骤定位挑战，我们通过确保步骤分配的细粒度和粗粒度一致性、强制定位步骤的严格时间单调性以及后处理检测步骤以减少噪声预测的影响来增强这种方法。我们将这种方法称为HiERO-StepG，在提交时，它在全局排行榜上以完全零样本且不需要程序特定注释的情况下，在R@1 (IoU = 0.3)指标上达到56.27%，排名第二。项目页面：https://github.com/andreazenotto/HiERO-StepG。

英文摘要

Procedural activities follow well-defined structures: whether we consider a cooking recipe or a mechanic repairing a car, these activities naturally decompose in a hierarchy of steps and sub-steps. Traditional approaches for step grounding require extensive annotations and scale poorly. Instead, we argue that such hierarchical structure can emerge naturally from uncurated videos of human activities through recurring patterns of co-occurring actions and activities. Our approach builds on HiERO, a weakly-supervised representation learning approach that maps close in the feature space actions that are functionally related to each other, leveraging only fine-grained action-level narrations. In this feature space, procedure steps can be detected by a simple clustering, with no additional task-specific fine-tuning. For the Ego4D Step Grounding challenge, we augment this approach by ensuring fine and coarse level agreement in step assignments, enforcing strict temporal monotonicity of the grounded steps and post-processing the detected steps to reduce the impact of noisy predictions. We call this approach HiERO-StepG and it achieves 56.27 % on the R@1 (IoU = 0.3) metric on the global leaderboard at submission time, ranking second while being completely zero-shot and not requiring procedure-specific annotations. Project page: https://github.com/andreazenotto/HiERO-StepG.

URL PDF HTML ☆

赞 0 踩 0

2605.31217 2026-06-01 cs.CV 版本更新

TALON: Token-Aligned Lightweight Adapters for 6-DoF Spacecraft Pose Estimation

TALON: 用于六自由度航天器姿态估计的令牌对齐轻量适配器

Abid Ali, Arunkumar Rathinam, Djamila Aouada

AI总结提出TALON方法，通过在冻结的ViT注意力层前注入时空3D适配器并结合令牌对齐损失，实现轻量级六自由度航天器姿态估计，在SPADES和SwissCube数据集上显著降低姿态误差。

Comments 13 pages paper with 3 figures in total

详情

AI中文摘要

单目六自由度航天器姿态估计方法主要处理单帧图像，忽略了航天器机动过程中获取的图像序列中的时间信息。少数时间方法需要完全骨干微调或辅助光流网络，分别存在灾难性遗忘或增加计算成本的风险。我们提出TALON（轨道导航的令牌对齐轻量适配器）：在冻结的ViT视觉变换器的自注意力层之前注入时空3D适配器，结合补丁-令牌对齐损失，通过原型条件KL散度目标将适配特征几何地锚定到关键点结构。注意力前放置允许冻结注意力对时间增强的令牌进行推理，每个块使用单个适配器即可获得比注意力后替代方案更强的性能。对齐损失塑造中间表示，使得每个关键点在令牌场中引发空间精确的激活，而该框架向冻结骨干添加的参数少于5%。在SPADES数据集上，TALON将姿态误差比先前最先进方法降低50%；在SwissCube数据集上，其在ADD-0.1d准确率上超越先前最佳方法21.8%。在SPARK真实数据上的从仿真到真实的零样本跨域评估将姿态误差降低4.7倍，消融实验表征了适配器深度在域内和跨域设置中的作用。

英文摘要

Monocular 6-DoF spacecraft pose estimation methods predominantly process individual frames, discarding the temporal information present in an image sequence acquired during spacecraft manoeuvres. Few temporal approaches require full backbone fine-tuning or auxiliary optical flow networks, risking catastrophic forgetting or increasing computational cost, respectively. We propose TALON (Token-Aligned Lightweight adapters for Orbital Navigation): spatiotemporal 3D adapters injected before the self-attention layers of a frozen ViT vision transformer, combined with a patch-token alignment loss that geometrically grounds the adapted features to keypoint structure through a prototype-conditioned KL-divergence objective. Pre-attention placement allows the frozen attention to reason over temporally enriched tokens, achieving stronger performance with a single adapter per block than post-attention alternatives. The alignment loss shapes the intermediate representations so that each keypoint induces a spatially precise activation in the token field, while the framework adds less than 5% parameters to the frozen backbone. On SPADES dataset, TALON reduces the pose error by 50% over the prior state-of-the-art, and on SwissCube dataset it surpasses the prior best by 21.8% in ADD-0.1d accuracy. Zero-shot cross-domain evaluation from sim-to-real on SPARK real data reduces pose error by 4.7x, and ablations characterise the role of adapter depth across in-domain and cross-domain settings.

URL PDF HTML ☆

赞 0 踩 0

2605.31215 2026-06-01 cs.LG cs.CV 版本更新

Fixed-Point Masked Generative Modeling

不动点掩码生成建模

Andrea Miele, Yiming Qin, Alba Carballo-Castro, Justin Deschenaux, Pascal Frossard

发表机构 * LTS4, EPFL（EPFL LTS4实验室）

AI总结提出不动点掩码生成模型(FP-MGM)，通过共享注意力层的不动点求解器实现自适应深度，并引入跨步一致性损失和三态重用(3SR)策略，在降低参数和训练成本的同时提升低预算掩码生成质量。

详情

AI中文摘要

掩码生成模型(MGM)支持并行解码并在多种模态上取得强性能，但每一步都需要全序列双向变换器，导致训练成本高且在低采样预算下质量下降。现有工作通过更好的采样器或更便宜的固定深度去噪器提升效率，但仍为每个精炼步骤分配固定量的去噪器计算。我们提出不动点掩码生成模型(FP-MGM)，用共享注意力层上的不动点求解器替换部分去噪器，实现自适应深度且参数更少。为使其更有效地用于掩码生成，我们首先引入跨步一致性损失，对齐相邻去噪步骤的隐藏表示；其次，三态重用(3SR)通过分别处理未改变、仍掩码和新揭示的令牌，利用先前解热启动求解器。这些组件共同定义了我们的不动点掩码生成的完整训练到推理框架CoFRe。我们还表明，预训练的MGM可以通过短微调转换为FP-MGM，避免完全重新训练。跨模态，CoFRe改善了质量与成本的权衡。在OpenWebText上，与MDLM相比，CoFRe参数减少38.8%，训练时间减少11.5%，VRAM减少16.9%，同时在96个变换器块前向传播的预算下，生成困惑度从830.8提升到101.8。在ImageNette上，CoFRe训练时间减少48.6%，VRAM减少50.7%，并在所有测试的样本预算下改善FID。总体而言，CoFRe为更便宜的训练和更强的低预算掩码生成提供了一个实用框架。

英文摘要

Masked Generative Models (MGMs) enable parallel decoding and achieve strong performance across modalities, but require full-sequence bidirectional transformers at every step, making training costly and degrading quality under low sampling budgets. Existing work improves efficiency via better samplers or cheaper fixed-depth denoisers, but they still allocate a fixed amount of denoiser computation to each refinement step. We introduce Fixed-Point Masked Generative Models (FP-MGMs), which replace part of the denoiser with a fixed-point solver over shared attention layers to enable adaptive depth with fewer parameters. To make it more effective for masked generation, we first introduce a cross-step consistency loss, which aligns hidden representations at neighboring denoising steps and, second, three-state reuse (3SR) which warm-starts the solver using the previous solution by treating differently unchanged, still-masked, and newly revealed tokens respectively. Together, these components define our complete training-to-inference framework for fixed-point masked generation, \emph{CoFRe}. We also show that pre-trained MGMs can be converted into FP-MGMs with short fine-tuning, avoiding full retraining. Across modalities, CoFRe improves the quality and cost trade-off. On OpenWebText, CoFRe reduces parameters by 38.8\%, training time by 11.5\%, and VRAM by 16.9\%, while improving generative perplexity from 830.8 to 101.8 at a budget of $96$ transformer-block forward passes, compared to MDLM. In ImageNette, CoFRe reduces training time by 48.6\% and VRAM by 50.7\%, while improving FID in all sample budgets tested. Overall, CoFRe offers a practical framework for cheaper training and stronger low-budget masked generation.

URL PDF HTML ☆

赞 0 踩 0

2605.31212 2026-06-01 cs.CV cs.AI cs.CL 版本更新

语言训练深度伪造检测器的正则化能力

Benedikt Hopf, Zongwei Wu, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg（计算机视觉实验室，CAIDAS，乌尔姆大学）

AI总结提出利用多模态大语言模型的双编码器架构和两阶段训练，通过语言正则化缓解过拟合，提升深度伪造检测的泛化性和可解释性。

详情

AI中文摘要

最近，得益于多模态大语言模型的出现，深度伪造检测器不仅追求泛化性，还追求可解释性。我们提出这两个挑战可以有效地联合解决，因为可描述的伪影通常泛化性更好，从而开辟了使用语言作为正则化机制的可能性。由于深度伪造检测通常过拟合于低层次的领域特定伪影，我们的直觉是，经过语言预训练的LLM会更偏好于可更好描述的高层次伪影。这样，我们可以在可能的情况下使用高层次特征，同时训练模型在必要时使用低层次特征。我们利用双编码器架构，将冻结的专家检测器与LoRA调优的MLLM编码器配对，并采用两阶段训练课程：首先，二元对齐阶段表明，MLLM的内在能力可以有效地组合特征，以减轻对数据集特定伪影的过拟合。为了进一步增强泛化性并实现可解释性，我们采用强化学习阶段，鼓励模型在分类前生成描述性推理，仅使用二元标签。通过奖励这种“先解释后分类”的行为，我们明确激励模型优先考虑高层次、鲁棒的特征。关键在于，这一过程既产生了可解释的描述，又进一步提升了跨数据集性能，即使在推理时省略推理链也是如此。在基准数据集上的大量实验验证了我们的方法，以较大优势超越了最先进的方法。

英文摘要

Recently, thanks to the advent of Multimodal-LLMs, deepfake detectors are striving not only to be generalizable but also interpretable. We propose that these two challenges can effectively be tackled jointly, since describable artifacts typically generalize better, opening the possibility to use language as a regularization mechanism. Since deepfake detection generally suffers from overfitting to low-level domain-specific artifacts, our intuition is that an LLM that has been pretrained on language would prefer high-level artifacts that can be described better. This way, we can use high-level features where possible, while training the model to use low-level features where necessary. We utilize a dual-encoder architecture, pairing a frozen specialist detector with a LoRA-tuned MLLM encoder, and a two-stage training curriculum: first, a binary alignment phase demonstrates that the intrinsic capability of MLLMs can effectively combine features to mitigate overfitting to dataset-specific artifacts. To further bolster generalization and achieve interpretability, we employ a reinforcement learning stage that encourages the model to generate descriptive reasoning before classifying, using only binary labels. By rewarding this "explain-then-classify" behavior, we explicitly incentivize the model to prioritize high-level, robust features. Crucially, this process yields both interpretable descriptions and a further boost in cross-dataset performance, even when reasoning chains are omitted at inference. Extensive experiments on benchmark datasets validate our approach, outperforming state-of-the-art methods by a large margin.

URL PDF HTML ☆

赞 0 踩 0

2605.31191 2026-06-01 cs.LG cs.CV 版本更新

Student Capacity Moderates Knowledge Distillation Effectiveness: A Systematic Study Across ResNet Teacher-Student Pairs on CIFAR-10

学生容量调节知识蒸馏有效性：基于CIFAR-10上ResNet教师-学生对的系统研究

Umut Onur Yasar

发表机构 * GitHub

AI总结通过ResNet教师-学生对在CIFAR-10上的图像分类实验，系统研究学生容量如何调节知识蒸馏（KD）的有效性，发现学生容量是蒸馏增益的关键调节因素，并指出实现正确性和输入分辨率感知架构的重要性。

Comments 9 pages, 2 figures, 5 tables. Code available at https://github.com/umutonuryasar/kd-capacity-gap

详情

AI中文摘要

我们研究了教师-学生容量关系如何调节基于ResNet的CIFAR-10图像分类中知识蒸馏（KD）的有效性。在三个教师-学生对（R50->R18、R34->R18和R50->R34）中，我们在受控、可重复的条件下（3个种子，全程报告均值±标准差）比较了Logit-KD和Feature-KD。我们报告三个主要发现。首先，学生容量是蒸馏增益的关键调节因素：即使教师-学生准确率差距相当，R34学生从KD中获得的收益也远大于R18学生，R50->R34 Feature-KD的最大增益为+0.30个百分点，而R34->R18 Feature-KD为+0.18个百分点，R34->R18 Logit-KD为+0.00个百分点。其次，实现的正确性对Feature-KD至关重要：一个排除了投影层的梯度裁剪错误抑制了Feature-KD的性能，并产生了与Logit-KD的误导性比较。修正后，Feature-KD在三个对中的两个上匹配或优于Logit-KD，在R50->R34上达到95.55%，基线为95.25%。第三，输入分辨率感知架构是有效蒸馏的先决条件：将ResNet主干修正为32x32输入使教师准确率提高超过5个百分点——比任何KD增益高出一个数量级。所有代码和结果可在github.com/umutonuryasar/kd-capacity-gap获取。

英文摘要

We investigate how teacher-student capacity relationships modulate knowledge distillation (KD) effectiveness in ResNet-based image classification on CIFAR-10. Across three teacher-student pairs -- R50->R18, R34->R18, and R50->R34 -- we compare Logit-KD and Feature-KD under controlled, reproducible conditions (3 seeds, mean+/-std reported throughout). We report three main findings. First, student capacity is a key moderating factor in distillation gain: R34 students benefit substantially more from KD than R18 students even when teacher-student accuracy gaps are comparable, with the strongest gain of +0.30pp observed for R50->R34 Feature-KD versus +0.18pp for R34->R18 Feature-KD and +0.00pp for R34->R18 Logit-KD. Second, implementation correctness critically affects Feature-KD: a gradient clipping bug that excluded projection layers suppressed Feature-KD performance and produced misleading comparisons with Logit-KD. After correction, Feature-KD matches or outperforms Logit-KD in two of three pairs, reaching 95.55% on R50->R34 against a baseline of 95.25%. Third, input-resolution-aware architecture is a prerequisite for effective distillation: correcting the ResNet stem for 32x32 inputs raises teacher accuracy by over 5pp -- an order of magnitude larger than any KD gain. All code and results are available at github.com/umutonuryasar/kd-capacity-gap.

URL PDF HTML ☆

赞 0 踩 0

2605.31187 2026-06-01 cs.CV cs.LG 版本更新

From Local Geometry to Global Pseudo Labeling for Robust Positive Unlabeled Learning under Covariate Shift

从局部几何到全局伪标注：协变量偏移下鲁棒的正无标记学习

Firas Gabetni, Alexandre Rocchi Henry, Nacim Belkhir, Ziyi Liu, Gianni Franchi

发表机构 * U2IS, ENSTA（U2IS，ENSTA）； Institut Polytechnique de Paris（巴黎政治学院）； AMIAD, Pôle Recherche, Palaiseau（AMIAD，研究学院，帕莱索）

AI总结提出SPUNA框架，利用局部流形结构逐步发现偏移数据，在协变量偏移下实现正无标记学习，性能达到全监督方法水平。

详情

AI中文摘要

检测协变量偏移对于构建可靠的视觉系统至关重要。虽然大多数先前工作专注于提高对偏移的鲁棒性，但显式检测协变量偏移仍未被充分探索。现有方法通常依赖于全监督训练，需要来自原始分布和偏移分布的有标签样本，这往往不切实际。在本文中，我们表明协变量偏移检测可以通过使用正无标记（PU）学习的弱监督有效解决。然而，在协变量偏移下，分布内数据和偏移数据显著重叠，使得经典PU方法不稳定且对噪声敏感。为克服这一挑战，我们引入了谱PU邻域标注（SPUNA），这是一种几何感知框架，通过利用视觉特征的局部流形结构逐步发现偏移数据。大量实验表明，SPUNA在PU设置中实现了最先进的性能，并且显著匹配了全监督方法的性能。此外，我们的方法在不同类型的偏移之间鲁棒地迁移，展示了强大的泛化能力。

英文摘要

Detecting covariate shift is critical for building reliable vision systems. While most prior work focuses on improving robustness to shift, explicitly detecting covariate shift remains underexplored. Existing approaches typically rely on fully supervised training, requiring labeled examples from both original and shifted distributions, which is often impractical. In this paper, we show that covariate shift detection can be effectively addressed with weaker supervision using Positive Unlabeled (PU) learning. However, under covariate shift, in distribution and shifted data overlap significantly, making classical PU methods unstable and sensitive to noise. To overcome this challenge, we introduce Spectral PU Neighborhood Annotation (SPUNA), a geometry aware framework that progressively discovers shifted data by leveraging the local manifold structure of visual features. Extensive experiments show that SPUNA achieves state of the art performance in PU settings and remarkably matches the performances of fully supervised methods. Moreover, our approach transfers robustly across different types of shifts, demonstrating strong generalization capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.31177 2026-06-01 cs.CV 版本更新

Vanilla ViT for Automotive Point Cloud Semantic Segmentation

用于汽车点云语义分割的普通ViT

Gilles Puy, Nermin Samet, Alexandre Boulch, Spyros Gidaris, Tuan-Hung VU, Renaud Marlet

发表机构 * LIGM, CNRS, Univ Gustave Eiffel, ENPC, IP Paris, France（LIGM、CNRS、Université Gustave Eiffel、ENPC、IP Paris、法国）

AI总结本文提出VaViT，通过精心设计的标记器、轻量级解码器头和定制数据增强，使普通非分层ViT在大规模激光雷达点云语义分割中达到或超越现有最先进方法。

详情

AI中文摘要

普通Transformer已成为处理文本、音频、图像和视频的事实标准架构，为多模态学习提供了统一的主干。然而，点云语义分割的最先进架构仍然由U-Net架构主导，其中卷积与局部或窗口注意力交错。在这项工作中，我们展示了如何有效利用普通、非分层的ViT进行大规模汽车激光雷达场景的分割。通过精心设计的标记器、轻量级解码器分割头和定制数据增强，我们弥合了性能差距。我们的方法VaViT（Vanilla ViT）在保持ViT架构简单性的同时，匹配或超过了最先进方法的性能。我们在nuScenes、SemanticKITTI和Waymo Open Dataset上进行了广泛评估，以验证我们方法的有效性。代码和模型可在https://github.com/valeoai/VaViT获取。

英文摘要

Plain Transformers have become the de-facto architecture for processing text, audio, image, and video, offering a unified backbone for multimodal learning. However, state-of-the-art architectures for point cloud semantic segmentation remain dominated by U-Nets architectures where convolutions are interleaved with local or windowed attentions. In this work, we show how to effectively leverage vanilla, non-hierarchical ViTs for segmentation of large-scale automotive lidar scenes. We bridge the performance gap thanks to a carefully designed tokenizer, a lightweight decoder segmentation head, and tailored data augmentations. Our approach, VaViT for Vanilla ViT, matches or exceeds the performance of state-of-the-art methods while maintaining the simplicity of ViT architecture. We provide extensive evaluations on nuScenes, SemanticKITTI, and Waymo Open Dataset to validate the efficiency of our method. Code and models are available at https://github.com/valeoai/VaViT.

URL PDF HTML ☆

赞 0 踩 0

2605.31174 2026-06-01 cs.CV cs.LG 版本更新

Detect in Any Scene: An Agentic Framework for Object Detection with Experience-Aware Reasoning

任意场景检测：一种具有经验感知推理的目标检测智能体框架

Wenlun Zhang, Jun Yin, Kentaro Yoshioka

发表机构 * Keio University（Keio大学）； Tsinghua University（清华大学）

AI总结提出DetAS/DetAS-X智能体框架，利用多模态大语言模型自适应组合恢复模块和专用检测器，通过自进化经验积累实现经验感知推理，在六个基准上平均F1提升28.36%。

详情

AI中文摘要

现实场景中的目标检测由于图像退化多样和物体分布异质而仍然具有挑战性，这显著阻碍了现有检测器的泛化。传统方法，包括场景特定表示学习和端到端流水线设计，本质上受限于对预定义条件的依赖，缺乏对动态环境的适应性。本文提出DetAS，一种将目标检测表述为动态决策过程的智能体检测框架。DetAS不依赖静态流水线，而是利用多模态大语言模型（MLLM）作为中央智能体，通过从恢复模块和专用检测器的工具箱中选择来自适应地组合检测工作流。具体来说，DetAS包含两个关键组件：自适应图像恢复，动态决定是否以及如何增强图像以进行下游检测；以及多专家检测，集成多个领域专用检测器并通过实例级推理解决它们的预测。为了在细粒度条件下进一步提高决策质量，我们引入了自进化经验积累，并将框架扩展到DetAS-X，该框架从少量标注数据中积累节点级决策经验，并在推理过程中实现经验感知推理。这种机制使系统能够逐步优化其决策策略，并适应各种现实场景。在六个具有挑战性的基准上的大量实验表明，DetAS-X显著优于现有的基于MLLM的检测器，在F1分数上平均提高28.36%，在DarkFace上增益高达37.01%。这些结果展示了智能体检测的前景，并为其在复杂动态环境中的应用奠定了坚实基础。

英文摘要

Object detection in real-world scenarios remains challenging due to diverse image degradations and heterogeneous object distributions, which significantly hinder the generalization of existing detectors. Conventional approaches, including scene-specific representation learning and end-to-end pipeline design, are inherently limited by their reliance on predefined conditions and lack adaptability to dynamic environments. In this paper, we propose DetAS, an agentic detection framework that formulates object detection as a dynamic decision process. Instead of relying on static pipelines, DetAS leverages a Multimodal Large Language Model (MLLM) as a central agent to adaptively compose detection workflows by selecting from a toolbox of restoration modules and specialized detectors. Specifically, DetAS consists of two key components: Self-Adaptive Image Restoration, which dynamically determines whether and how to enhance images for downstream detection, and Multi-Expertise Detection, which integrates multiple domain-specialized detectors and resolves their predictions through instance-level reasoning. To further improve decision quality under fine-grained conditions, we introduce Self-Evolving Experience Harvesting and extend the framework to DetAS-X, which accumulates node-level decision experience from a small set of annotated data and enables experience-aware reasoning during inference. This mechanism allows the system to progressively refine its decision policy and adapt to diverse real-world scenarios. Extensive experiments on six challenging benchmarks demonstrate that DetAS-X significantly outperforms existing MLLM-based detectors, achieving an average improvement of 28.36% in F1 score, with up to 37.01% gain on DarkFace. These results demonstrate the promise of agentic detection and establish a solid foundation for its application in complex and dynamic environments.

URL PDF HTML ☆

赞 0 踩 0

2605.31153 2026-06-01 cs.CV 版本更新

BIAS-ID: A Framework for Analyzing Transformation Biases in AI-Generated Image Detectors

BIAS-ID: 分析AI生成图像检测器中变换偏差的框架

Jonas Ricker, Asja Fischer, Erwin Quiring

发表机构 * Ruhr University Bochum（鲁尔大学波恩）； _fbeta Berlin, Germany（柏林_fbeta）

AI总结本文提出BIAS-ID框架，用于分析和量化AI生成图像检测器中的变换偏差，并通过实验揭示多种先进检测方法受偏差影响严重。

详情

AI中文摘要

鉴于网络上有害AI生成图像的激增，可靠地区分真实图像与生成图像已成为一个紧迫的研究课题。虽然许多提出的检测方法在受控设置下表现良好，但在真实世界数据上测试时常常失效。一个潜在的根本原因是检测器训练数据中的细微偏差。因此，检测器可能依赖虚假相关性而非学习真正的取证痕迹。虽然最近的工作已经识别出这个问题，但尚未建立评估检测器实际偏差程度的既定协议。因此，在本文中，我们退一步：首先，我们讨论检测器存在偏差意味着什么，以及这与缺乏鲁棒性有何不同。其次，我们提出BIAS-ID，一个用于分析和量化AI生成图像检测器中变换偏差的透明框架。我们通过对两个数据集上的六个检测器进行评估来验证我们的框架，揭示了几种最先进的检测方法受到偏差的强烈影响。我们的结果强调了偏差感知评估对于开发可靠的AI生成图像检测器的重要性。

英文摘要

Given the surge of harmful AI-generated imagery online, reliably distinguishing authentic images from generated ones has become an urgent research topic. While many proposed detection methods perform well under controlled settings, they often collapse when tested on real-world data. A potential root cause are subtle biases in the detectors' training data. As a result, detectors may rely on spurious correlations instead of learning true forensic artifacts. While a recent line of work has identified the problem, there is not yet an established protocol to evaluate how biased a detector actually is. In this work, we therefore take a step back: First, we discuss what it means for a detector to be biased, and how this differs from a lack of robustness. Second, we propose BIAS-ID, a transparent framework for analyzing and quantifying the presence of transformation biases in AI-generated image detectors. We validate our framework by performing an evaluation of six detectors across two datasets, revealing that several state-of-the-art detection methods are strongly affected by biases. Our results highlight the importance of bias-aware evaluation for developing reliable AI-generated image detectors.

URL PDF HTML ☆

赞 0 踩 0

2605.31148 2026-06-01 cs.CV cs.AI cs.CL 版本更新

SpatialAct: Probing Spatial Reasoning-to-Action Capabilities of VLM Agents in 3D Scenes

SpatialAct：探测VLM智能体在3D场景中的空间推理到行动能力

Tianhui Liu, Jie Feng, Zhiheng Zheng, Shengyuan Wang, Yiming Guo, Yanxin Xi, Hangyu Fan, Yong Li, Pan Hui

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Zhongguancun Academy（中关村学院）； Tsinghua University（清华大学）； Helsinki University（赫尔辛基大学）

AI总结本文提出SpatialAct基准，通过多轮交互细化、单步错误检测与修复等任务，揭示当前视觉语言模型在3D场景中从空间推理到行动存在显著差距。

详情

AI中文摘要

人类能够在日常3D环境中轻松感知空间布局、形成认知表征、推理空间关系，并将这种推理转化为行动。尽管最近的视觉语言模型（VLM）在基于观测的空间感知和推理任务上表现出色，但它们是否能够构建连贯的空间理解、据此行动并通过多轮反馈优化行动仍不清楚。为研究这一问题，我们引入了 extbf{SpatialAct}，一个基于模拟器的基准，用于探测3D场景中的 extit{行动条件空间推理}。从最具挑战性的设置——多轮交互细化开始，我们进一步设计了其分解版本——单步错误检测与修复，以及五个基础空间能力任务，以诊断模型失败的潜在原因。实验揭示了明显的推理到行动差距：当前VLM在孤立的空间推理任务上表现良好，但在多轮反馈中难以维持连贯的空间信念并产生可靠行动，显著不如人类。这些结果表明，即使抽象掉了低级控制，当前VLM智能体在行动引起的环境变化下仍缺乏稳健的空间状态跟踪能力。

英文摘要

Humans can effortlessly perceive spatial layouts, form cognitive representations, reason about spatial relations, and translate such reasoning into actions in everyday 3D environments. Although recent vision-language models (VLMs) have shown promising performance on observation-conditioned spatial perception and reasoning tasks, it remains unclear whether they can build coherent spatial understanding, act upon it, and refine their actions through multi-turn feedback. To study this problem, we introduce \textbf{SpatialAct}, a simulator-grounded benchmark for probing \textit{action-conditioned spatial reasoning} in 3D scenes. Starting from the most challenging setting, Multi-turn Interactive Refinement, we further design its decomposed counterpart, Single-step Error Detection and Fix, together with five fundamental spatial ability tasks to diagnose the underlying causes of model failures. Experiments reveal a clear reasoning-to-action gap: current VLMs can perform well on isolated spatial reasoning tasks, but struggle to maintain coherent spatial beliefs and produce reliable actions during multi-turn feedback, substantially underperforming humans. These results suggest that current VLM agents still lack robust spatial state tracking under action-induced environment changes, even when low-level control is abstracted away.

URL PDF HTML ☆

赞 0 踩 0

2605.31145 2026-06-01 cs.CV cs.AI cs.LG 版本更新

FOCUS: Forcing In-Context Object Localization through Visual Support Constraints and Policy Optimization

FOCUS: 通过视觉支持约束和策略优化强制上下文目标定位

Mohammed Asad Karim, Vinay Kumar Verma

发表机构 * Amazon, Seattle, USA（亚马逊（美国西雅图））

AI总结提出一种两阶段训练框架，通过优化支持框与查询图像间的上下文注意力并结合GRPO强化学习，实现无类别监督的类别无关上下文目标定位，7B模型性能超越72B模型。

Comments Accepted at ICML 2026. * Equal Contributions

详情

AI中文摘要

上下文定位（ICL）旨在通过查询图像中的少量支持示例定位目标对象，无需训练或参数更新即可即时操作。尽管视觉语言模型（VLM）快速发展，实现类别无关且基于视觉的ICL仍然是一个未解决的问题，尽管它对图像编辑、个性化视觉搜索和检索等应用至关重要。现有方法脆弱且依赖显式类别监督，这不仅限制了在具有未命名或实例特定对象的现实场景中的适用性，还引入了类别偏差，使预测偏向语义先验而非视觉证据。我们提出一个两阶段训练框架，在无类别监督的情况下显式优化支持边界框与查询图像之间的上下文注意力。我们进一步通过使用组相对策略优化（GRPO）的强化学习来细化定位，直接最小化定位误差。这种公式强制视觉对应优于语义先验，产生鲁棒的实例级定位。实验表明，使用我们的目标训练的7B参数模型优于高达72B参数的模型，证明了上下文感知定位目标可以超越单纯扩展规模。全面的消融实验验证了每个组件的贡献。

英文摘要

In-context localization (ICL) seeks to localize a target object specified by a small set of support examples in a query image, operating on the fly without training or parameter updates. Despite rapid advances in vision-language models (VLMs), achieving category-agnostic and visually grounded ICL remains an open problem, even though it is essential for applications such as image editing, personalized visual search, and retrieval. Existing methods are fragile and rely on explicit category supervision, which not only limits applicability in realistic settings with unnamed or instance-specific objects but also introduces category bias that steers predictions toward semantic priors rather than visual evidence. We introduce a two-stage training framework that explicitly optimizes in-context attention between support bounding boxes and query images without category supervision. We further refine localization via reinforcement learning using Group Relative Policy Optimization (GRPO) to directly minimize localization error. This formulation enforces visual correspondence over semantic priors, yielding robust instance-level localization. Empirically, a 7B-parameter model trained with our objectives outperforms models up to 72B parameters, demonstrating that context-aware localization objectives can surpass scaling alone. Comprehensive ablations validate the contribution of each component.

URL PDF HTML ☆

赞 0 踩 0

2605.31137 2026-06-01 cs.CV 版本更新

PolSAR Image Classification using a Hybrid Complex-Valued Network (HybridCVNet)

使用混合复数网络（HybridCVNet）进行PolSAR图像分类

Mohammed Q. Alkhatib

发表机构 * IEEE

AI总结提出一种混合复数网络HybridCVNet，结合CV-CNN和CV-ViT，通过提取互补信息并利用数据内部依赖关系，提升PolSAR图像分类性能，在Flevoland和San Francisco数据集上分别达到97.39%总体精度和0.972 Kappa值。

Comments Accepted and Published in IEEE Geoscience and Remote Sensing Letters (GRSL)

详情

DOI: 10.1109/LGRS.2024.3468190

AI中文摘要

近年来，卷积神经网络（CNN）因其在计算机视觉任务中的有效性而成为图像分类的热门方法。现在，研究人员正在探索视觉Transformer（ViT）在遥感和地球观测中的潜力。然而，传统的实值网络常常忽略复数（CV）数据（如极化合成孔径雷达（PolSAR）数据）中重要的相位信息。为了解决这个问题，出现了新的CV深度架构。HybridCVNet是一种新颖的混合网络，融合了CV-CNN和CV视觉Transformer（CV-ViT）技术。它有效地结合了CV 3D和2D CNN作为特征提取器，通过提取互补信息并有效利用数据内部的相互依赖关系，增强了PolSAR图像分类。来自广泛使用的PolSAR数据集的实验结果表明，HybridCVNet优于其他方法，在Flevoland数据集上实现了97.39%的总体精度，并且在仅1%采样率下也显示出潜力，在旧金山数据集上Kappa值为0.972。源代码可通过https://github.com/mqalkhatib/HybridCVNet获取。

英文摘要

Recently, convolutional neural networks (CNNs) have become popular for image classification due to their effectiveness in computer vision tasks. Now, researchers are exploring the potential of vision transformers (ViTs) in remote sensing and Earth observation. However, traditional Real-Valued networks often overlook important phase information in Complex-Valued (CV) data like polarimetric synthetic aperture radar (PolSAR) data. To address this, new CV deep architectures have emerged. HybridCVNet, a novel hybrid network, blends CV-CNN and CV vision transformer (CV-ViT) techniques. It efficiently combines CV 3D and 2D CNNs as feature extractors, enhancing PolSAR image classification by extracting complementary information and effectively leveraging interdependencies within the data. Experimental results from widely-used PolSAR datasets show HybridCVNet outperforms other methods, achieving an overall accuracy of 97.39% on the Flevoland dataset and showing promise even with just a 1% sampling ratio, with a Kappa value of 0.972 on the San Francisco dataset. Source code is accessible through https://github.com/mqalkhatib/HybridCVNet

URL PDF HTML ☆

赞 0 踩 0

2605.31124 2026-06-01 cs.CV 版本更新

QVGGT: Post-Training Quantized Visual Geometry Grounded Transformer

QVGGT: 训练后量化的视觉几何基础Transformer

Zhizhen Pan, Hesong Wang, Huan Wang

发表机构 * Westlake University（西湖大学）； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结针对VGGT模型参数量大、部署受限的问题，提出QVGGT量化框架，通过选择性混合精度、令牌滤波与任务感知尺度搜索，实现近无损W4A16量化，显著降低内存和加速推理。

Comments Accepted by CVPR 2026. Project page: https://ddsacu.github.io/QVGGT/

详情

AI中文摘要

直接从图像估计3D属性的技术随着视觉几何基础Transformer（VGGT）的提出而迅速发展，该模型能够在前向传播中一次性预测相机参数、深度图和点云。然而，其12亿参数规模严重限制了在无人机和移动AR设备等资源受限平台上的部署。为解决这一限制，我们引入了QVGGT，一个专门为压缩VGGT而设计的量化框架。我们的方法基于以下观察：VGGT内的Transformer块对量化表现出异质性敏感度。因此，我们分析了逐块量化敏感度，并提出了一种选择性混合精度策略，为最脆弱的Transformer块分配更高精度。为了解决由高方差相机和注册令牌引起的量化误差放大问题，我们进一步引入了带相机信息补偿的令牌过滤，从激活校准中移除这些异常值，并使用PCA导出的全局补偿令牌恢复其几何线索。最后，我们开发了一种任务感知尺度搜索机制，不仅通过层重建，还通过多头监督以及相机姿态、深度图和点图之间的跨头几何一致性来评估候选量化尺度。在多个几何感知基准上的大量实验表明，QVGGT实现了近乎无损的W4A16量化，在保持所有3D预测头精度的同时，相比FP32实现了3~4.9倍的内存减少和高达2.8倍的硬件实际加速。我们的方法使得在边缘设备上实现高保真3D感知成为可能，从而在现实世界的受限环境中实现前馈3D重建模型的实际部署。

英文摘要

Estimating 3D attributes directly from images has advanced rapidly with the Visual Geometry Grounded Transformer (VGGT), which predicts camera parameters, depth maps, and point clouds in a single forward pass. However, its 1.2B-parameter scale severely limits deployment on resource-constrained platforms such as UAVs and mobile AR devices. To address this limitation, we introduce QVGGT, a tailored quantization framework designed to compress VGGT. Our approach starts from the observation that transformer blocks within VGGT exhibit heterogeneous sensitivity to quantization. We thus analyze per-block quantization sensitivity and propose a selective mixed-precision strategy that allocates higher precision to the most fragile transformer blocks. To address the amplification of quantization error caused by high-variance camera and register tokens, we further introduce token filtering with camera information compensation, which removes these outliers from activation calibration and restores their geometric cues using a PCA-derived global compensation token. Finally, we develop a task-aware scale search mechanism that evaluates candidate quantization scales not only through layer reconstruction but also through multi-head supervision and cross-head geometric consistency among camera poses, depth maps, and point maps. Extensive experiments on multiple geometry perception benchmarks demonstrate that QVGGT achieves near-lossless W4A16 quantization, preserving the accuracy of all 3D prediction heads while delivering 3$\sim$4.9$\times$ memory reduction and up to 2.8$\times$ real hardware speedup over FP32. Our approach makes high-fidelity 3D perception feasible on edge devices, enabling practical deployment of feed-forward 3D reconstruction models in real-world constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2605.31116 2026-06-01 cs.CV cs.RO 版本更新

NTR: Neural Token Reconstruction for Scene Token Bottleneck in End-to-End Driving

NTR：端到端驾驶中场景令牌瓶颈的神经令牌重建

Jiahui Li, Jiawei Sun, Zixiang Ren, Ming Liu, Jiamin Shi, Ruiteng Zhao, Zhiyang Liu, Liying Liu, Zuoguan Wang, Kaidi Yang

发表机构 * National University of Singapore（新加坡国立大学）； Black Sesame Technologies（黑 sesame 技术公司）

AI总结针对端到端驾驶中场景令牌瓶颈缺乏视觉监督的问题，提出神经令牌重建（NTR）框架，通过自蒸馏掩码潜在重建约束场景令牌保留更丰富的视觉表示，实现最先进的驾驶性能。

详情

AI中文摘要

最近的无感知端到端自动驾驶方法通过将密集的图像块令牌压缩为紧凑的场景令牌，用于下游轨迹生成和评分，从而绕过了显式的感知输出。虽然这些场景令牌为规划器形成了紧凑的视觉瓶颈，但它们仅从规划目标接收监督，对编码的视觉信息提供了有限的约束。为了解决这一限制，我们引入了神经令牌重建（NTR），一种表示学习框架，直接约束无感知驾驶中的紧凑场景令牌瓶颈。NTR引入了一种自蒸馏掩码潜在重建目标，该目标仅使用紧凑的场景令牌作为重建记忆来重建被掩码的块级潜在特征。这迫使重建梯度仅通过场景令牌瓶颈传递，鼓励场景令牌为规划保留更丰富且更少冗余的视觉表示。我们进一步引入了来自基础模型注释的语义先验，作为弱语义接口，将重建目标偏向于驾驶相关结构，而不引入显式的感知头。所有辅助重建组件在推理时被移除，部署的规划器保持不变。NTR在三个公共自动驾驶基准测试中实现了最先进的性能，包括Waymo E2E上的8.0461 RFS以及NavSim1&2上的94.1 PDMS / 90.9 EPDMS。学习到的场景令牌表现出更低的成对冗余和更高的有效秩，表明有效的瓶颈监督同时改善了紧凑视觉表示学习和规划性能。

英文摘要

Recent perception-free end-to-end (E2E) autonomous driving methods bypass explicit perception outputs by compressing dense image patch tokens into compact scene tokens for downstream trajectory generation and scoring. While these scene tokens form a compact visual bottleneck for the planner, they receive supervision solely from the planning objective, providing limited constraints on the encoded visual information. To address this limitation, we introduce Neural Token Reconstruction (NTR), a representation learning framework to directly constrain the compact scene-token bottleneck in perception-free driving. NTR introduces a self-distillation masked latent reconstruction objective that reconstructs masked patch-level latent features using only compact scene tokens as reconstruction memory. This forces reconstruction gradients to pass exclusively through the scene-token bottleneck, encouraging scene tokens to preserve richer and less redundant visual representations for planning. We further introduce semantic priors derived from foundation-model annotations as a weak semantic interface biasing reconstruction targets toward driving-related structures without introducing explicit perception heads. All auxiliary reconstruction components are removed at inference time, leaving the deployed planner unchanged. NTR achieves state-of-the-art performance on three public autonomous driving benchmarks, including 8.0461 RFS on Waymo E2E and 94.1 PDMS / 90.9 EPDMS on NavSim1&2. The learned scene tokens exhibit lower pairwise redundancy and higher effective rank, indicating that effective bottleneck supervision improves both compact visual representation learning and planning performance.

URL PDF HTML ☆

赞 0 踩 0

2605.31115 2026-06-01 cs.CV 版本更新

Polyphony: Diffusion-based Dual-Hand Action Segmentation with Alternating Vision Transformer and Semantic Conditioning

Polyphony: 基于扩散的双手动作分割，采用交替视觉Transformer和语义条件

Hao Zheng, Hu Wang, Tiantian Zheng, Prajjwal Bhattarai, Tuka Alhanai

发表机构 * New York University Abu Dhabi（纽约大学阿布扎赫尔分校）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结提出Polyphony三阶段方法，通过交替训练双手视觉Transformer、语义特征条件化和扩散分割，解决双手动作分割中的手间依赖、视觉不对称和语义模糊问题，在多个数据集上达到最优性能。

Comments CVPR 2026

详情

AI中文摘要

双手动作分割是从未修剪视频中密集预测双手动作，对于理解复杂的双手活动至关重要。然而，它带来了几个独特的挑战：复杂的手间依赖、双手之间的视觉不对称、主导手垄断梯度的表示冲突以及细粒度动作中的语义模糊性。我们提出了Polyphony，一种三阶段方法，通过以下方式应对这些挑战：(1) 交替双手视觉Transformer，在左右手小批量之间交替训练，以确保双手的梯度贡献平衡，同时共享时空编码器；(2) 语义特征条件化，将视觉特征与结构化的、组合式的动作描述对齐，以增强语义相似动作的区分度；(3) 基于扩散的分割，结合跨手特征融合以实现手间协调，以及自适应损失加权以平衡性能。Polyphony在双手数据集（HA-ViD、ATTACH）上达到了最先进水平，改进高达16.8个百分点，并在单流Breakfast数据集（82.5%）上超越了之前使用12倍大骨干网络的最佳方法。值得注意的是，我们的统一模型使用单个共享骨干网络，超越了需要单独每手模型的基线方法。代码位于https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation。

英文摘要

Dual-hand action segmentation, densely predicting actions for both hands from untrimmed videos, is essential for understanding complex bimanual activities. However, it poses several unique challenges: complex inter-hand dependencies, visual asymmetry between hands, representation conflicts where the dominant hand monopolizes gradients, and semantic ambiguity in fine-grained actions. We propose Polyphony, a three-stage method to address these challenges through: (1) an Alternating Dual-Hand Vision Transformer that alternates training between left- and right-hand mini-batches to ensure balanced gradient contributions from both hands while sharing a spatio-temporal encoder; (2) Semantic Feature Conditioning that aligns visual features with structured, compositional action descriptions to enhance discrimination of semantically similar actions; and (3) Diffusion-Based Segmentation with cross-hand feature fusion for inter-hand coordination and adaptive loss weighting for balancing performance. Polyphony achieves state-of-the-art on both dual-hand datasets (HA-ViD, ATTACH) with improvements up to 16.8 points, and on the single-stream Breakfast dataset (82.5%), outperforming the prior best method that uses a 12x larger backbone. Notably, our unified model with a single shared backbone surpasses baselines requiring separate per-hand models. Code is at https://github.com/x-labs-xyz/Polyphony-Dual-hand-Action-Segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.31108 2026-06-01 cs.CV cs.LG 版本更新

面向多模态智能体的任务聚焦记忆

Tao Zou, Yichen He, Tian Qiu, Yuan Lin, Hang Li

发表机构 * Fudan University（复旦大学）

AI总结提出基于强化学习的任务聚焦记忆策略学习框架TaskMem，通过两阶段训练使多模态智能体在流式观测中动态选择任务相关记忆，在三个流式基准上VQA准确率提升5.3%-7.0%。

详情

AI中文摘要

长期记忆对于多模态智能体构建连贯经验、积累世界知识和实现持续学习至关重要。然而，构建有效记忆不仅涉及记忆模块设计和准确性、保真度等基本要求，关键挑战在于决定记忆什么。多模态智能体（如具身智能体）在真实或虚拟环境中持续感知、推理和行动，接收无界的多模态观测流。面对这种信息组合爆炸，智能体必须选择性地保留与其环境角色相关且对未来任务有价值的内容。为弥合这一差距，我们将记忆生成建模为可学习的记忆策略，并引入TaskMem（任务聚焦记忆策略学习），一种基于强化学习的框架，使策略能够动态调整其关注点以适应环境中遇到的实际任务需求。TaskMem采用两阶段训练范式：第一阶段在基本保真度要求下优化记忆质量，学习如何记忆；第二阶段在部署后进行，智能体通过在其基础MLLM上调整适配器来学习记忆什么，利用近期环境任务定义奖励模型，引导记忆策略聚焦于任务相关的内容。为评估我们的方法，我们将VideoMME、EgoLife和EgoTempo重新构建为流式基准，模拟智能体处理流式观测并处理在线到达任务的真实场景。为隔离记忆评估，问题必须仅使用智能体的记忆回答，而不访问原始视频。基于Qwen3-VL-30B-A3B，TaskMem在这些基准上分别将VQA准确率提高了6.3%、7.0%和5.3%。

英文摘要

Long-term memory is essential for multimodal agents to build coherent experience, accumulate world knowledge, and achieve continual learning. However, constructing effective memory goes beyond memory module design and basic requirements such as accuracy and fidelity; the key challenge lies in determining what to memorize. Multimodal agents, such as embodied agents, continuously perceive, reason, and act in real or virtual environments, receiving an unbounded stream of multimodal observations. From this combinatorial explosion of information, an agent must selectively retain content that is relevant to its role in the environment and valuable for future tasks. To bridge this gap, we frame memory generation as a learnable memorization policy and introduce TaskMem (Task-focused Memorization Policy Learning), a reinforcement-learning-based framework that enables the policy to dynamically adjust its focus to the demands of real tasks encountered in the environment. TaskMem adopts a two-phase training paradigm: Phase One learns how to memorize by optimizing memory quality under fundamental fidelity requirements; Phase Two occurs after deployment, where the agent learns what to memorize by tuning an adapter on its base MLLM, using recent environment tasks to define a reward model that guides the memorization policy toward task-relevant content. To evaluate our approach, we reformulate VideoMME, EgoLife, and EgoTempo into streaming benchmarks that simulate a realistic setting in which an agent processes streaming observations and handles tasks arriving online. To isolate memory assessment, the questions must be answered using only the agent's memory, without access to raw video. Built on Qwen3-VL-30B-A3B, TaskMem improves VQA accuracy by 6.3%, 7.0%, and 5.3% on these benchmarks, respectively.

URL PDF HTML ☆

赞 0 踩 0

2605.31069 2026-06-01 cs.CV cs.CL 版本更新

Towards Effective Long-Video Event Prediction via Multi-Level Event Semantics Mining

面向有效长视频事件预测的多级事件语义挖掘

Bo Peng, YuanJie Lyu, PengGang Qin, Tong Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出VISTA框架，通过多级事件语义挖掘（细节级、事件级、未来级）实现长视频事件预测，解决现有模型无法精确提取事件细节和进行细粒度分析的问题。

详情

DOI: 10.1007/978-981-95-6950-2

AI中文摘要

准确预测未来事件是内容理解和决策制定的基础，涉及多个领域。先前研究主要关注文本或短视频场景，而长视频事件预测具有多模态上下文丰富和叙事复杂的特点，尚未得到充分探索。同时，基于大语言模型和视觉语言模型构建的近期长视频语言模型在长视频问答和摘要方面表现出潜力，但难以泛化到事件预测，因为它们既不能精确提取事件相关细节，也无法对事件发展进行细粒度分析。为弥补这一差距，我们提出VISTA，一个用于长视频事件预测的多级事件语义挖掘框架。首先，VISTA应用以角色为中心的视觉提示精确提取事件相关视觉细节，增强细节级语义；其次，采用知识增强的迭代检索策略，引导大语言模型逐步构建逻辑连贯的事件链，从而改善事件级叙事；最后，VISTA采用类人的先提议后检索策略生成多样化的面向未来的提议并整合多级线索，产生稳健准确的预测。在真实数据集上的大量实验验证了VISTA在长视频事件预测中的有效性。

英文摘要

Accurately predicting future events is fundamental to content understanding and decision-making across various domains. While prior research has primarily focused on text or short-video scenarios, long-video event prediction, characterized by vast multimodal context and more complex narratives, remains underexplored. Meanwhile, although recent Long-Video Language Models (LVLMs), built on Large Language Models (LLMs) and Vision-Language Models (VLMs), have shown promise in long-video question answering and summarization, they struggle to generalize to event prediction, as they can neither precisely extract event-related details nor perform fine-grained analysis of event development. To address this gap, we propose VISTA, a multi-level event semantics mining framework for long-video event prediction. Initially, VISTA applies a character-centric visual prompt to precisely extract event-related visual details, enhancing detail-level semantics; subsequently, it employs a knowledge-enhanced iterative retrieval strategy, guiding the LLM to progressively construct logically coherent event chains, thereby improving event-level narratives; ultimately, VISTA adopts a human-like propose-then-retrieve strategy to generate diverse future-oriented proposals and integrate multi-level clues, producing robust and accurate predictions. Extensive experiments on real-world datasets validate the effectiveness of VISTA for long-video event prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.31068 2026-06-01 cs.CV 版本更新

HQ-JEPA: Hybrid Quantum Joint-Embedding Predictive Architecture for Cross-Modal Remote Sensing Representation Learning

HQ-JEPA: 用于跨模态遥感表示学习的混合量子联合嵌入预测架构

Md Aminur Hossain, Ayush V. Patel, Sanjay K. Singh, Biplab Banerjee

发表机构 * Space Applications Centre, Indian Space Research Organisation（印度空间研究组织空间应用中心）； Centre of Studies in Resources Engineering, Indian Institute of Technology Bombay（印度理工学院孟买资源工程研究中心）

AI总结提出HQ-JEPA混合量子-经典架构，通过联合嵌入预测、跨模态对齐、SIGReg高斯正则化和量子保真度损失，在Sentinel-1/2图像上学习语义表示，在GeoBench分类和分割任务上取得优于强基线的性能。

Comments 19 pages

详情

AI中文摘要

我们提出了HQ-JEPA，一种用于跨模态遥感表示学习的混合量子-经典联合嵌入预测架构。该框架将JEPA风格的掩码潜在预测扩展到配对的Sentinel-1和Sentinel-2图像，通过从可见上下文区域预测掩码目标表示，同时在共享嵌入空间中对齐异构模态特征。为了提高表示质量，HQ-JEPA结合了四个互补目标：潜在令牌预测、跨模态令牌对齐、融合潜在空间中基于SIGReg的高斯正则化，以及基于可微SWAP测试的保真度量子相似性（FQS）损失。与像素重建方法不同，HQ-JEPA直接在潜在空间中学习语义表示，并使用基于量子态重叠的相似性作为额外的正则化信号。我们在线性探测和微调设置下，在GeoBench分类和分割任务上评估了预训练编码器。结果表明，HQ-JEPA在强自监督和遥感基础模型基线上取得了具有竞争力且通常更优的性能，证明了将预测性自监督、跨模态几何正则化和基于量子保真度的表示学习相结合对遥感应用的好处。

英文摘要

We introduce HQ-JEPA, a hybrid quantum-classical joint-embedding predictive architecture for cross-modal remote sensing representation learning. The proposed framework extends JEPA-style masked latent prediction to paired Sentinel-1 and Sentinel-2 imagery by predicting masked target representations from visible context regions while aligning heterogeneous modality features in a shared embedding space. To improve representation quality, HQ-JEPA combines four complementary objectives: latent token prediction, cross-modal token alignment, SIGReg-based Gaussian regularization in the fused latent space, and a differentiable SWAP-test-based Fidelity Quantum Similarity (FQS) loss. Unlike pixel reconstruction methods, HQ-JEPA learns semantic representations directly in latent space and uses quantum state-overlap-based similarity as an additional regularization signal. We evaluate the pretrained encoder on GeoBench classification and segmentation tasks under linear probing and fine-tuning settings. Results show that HQ-JEPA achieves competitive and often superior performance over strong self-supervised and remote sensing foundation-model baselines, demonstrating the benefit of integrating predictive self-supervision, cross-modal geometric regularization, and quantum fidelity-based representation learning for remote sensing applications.

URL PDF HTML ☆

赞 0 踩 0

2605.31057 2026-06-01 cs.CV cs.LG 版本更新

LVSA: Training-Free Sparse Attention for Long Video Diffusion

LVSA：长视频扩散的无训练稀疏注意力

Gael Glorian, Ioannis Lamprou, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Distributed Parallel Technology Laboratory, Paris Research Center, Huawei Technologies France（华为法国巴黎研究中心分布式并行技术实验室）； AI Framework and Data Technology Lab, Huawei Technologies Co., Ltd.（华为技术有限公司人工智能框架与数据技术实验室）

AI总结提出一种无需训练、模型无关的块稀疏注意力方法LVSA，通过结构化窗口模式与旋转全局锚点结合，在降低长视频扩散推理计算成本的同时消除固定网格偏差，支持超训练时域的视频生成。

Comments 10 pages, 5 figures, 4 tables. Code: https://github.com/JiusiServe/LongVideoSparseAttention

详情

AI中文摘要

密集自注意力是长视频扩散推理的计算和质量的瓶颈：成本随序列长度二次增长，且超出训练时域时模型收敛到近乎静态的输出，即“冻结”的重复视频。最先进的方法要么成本过高（例如需要重新训练），要么无法以可扩展的方式同时满足性能和质量目标。为此，我们提出长视频稀疏注意力（LVSA），一种无需训练、模型无关的块稀疏注意力方法，用于视频扩散Transformer，它结合了结构化窗口模式与旋转全局锚点，从而消除了导致长时域伪影的固定网格偏差。LVSA结合FlashInfer内核，与密集注意力相比，在Wan 2.1 1.3B上以6倍时域减少计算量达3.17倍，在Wan 2.1 14B上以6倍时域减少2.98倍，在HunyuanVideo 1.5上以1.5倍时域减少3.33倍。除了减少计算量，LVSA还使得HunyuanVideo 1.5能够在2倍时域下生成，否则在单个GPU上会内存不足。此外，与RIFLEx相比，LVSA在Wan 2.1 1.3B上提供高达2.41倍的加速，与UltraViCo相比提供3.27倍的加速。为了展示跨不同平台的适用性，我们将LVSA应用于NPU，与密集注意力相比，在Wan 2.2 A14B上实现高达2.71倍的加速，在Wan 2.1 1.3B上实现3.24倍的加速。为了公平地评估质量，我们引入了VQeval，一个正确评分循环视频失败的工具，而VBench-Long等最先进评估器则会奖励这类失败。LVSA在训练时域长度下生成时质量中性，在扩展长度下质量积极。

英文摘要

Dense self-attention is the compute and quality bottleneck of long-video diffusion inference: cost grows quadratically with the sequence length, and beyond the training horizon the model converges to near-static output, that is, "frozen" repetitive video. State of the art approaches are either too costly, e.g., they require retraining, or fail to satisfy both performance and quality objectives in a scalable manner. To this end, we introduce Long Video Sparse Attention (LVSA), a training-free model-agnostic block-sparse attention for video diffusion transformers that combines a structured window pattern with rotating global anchors, thus removing the fixed-grid bias which causes long-range temporal artifacts. LVSA, combined with a FlashInfer kernel, reduces compute up to 3.17x on Wan 2.1 1.3B at a 6x horizon, 2.98x on Wan 2.1 14B at a 6x horizon, and 3.33x on HunyuanVideo 1.5 at a 1.5x horizon, compared to dense attention. Beyond reducing compute, LVSA enables HunyuanVideo 1.5 generation at a 2x horizon, which is otherwise out-of-memory on a single GPU. Moreover, LVSA provides speedups up to 2.41x compared to RIFLEx and 3.27x compared to UltraViCo on Wan 2.1 1.3B. To demonstrate applicability across diverse platforms, we apply LVSA on NPUs and achieve speedups up to 2.71x on Wan 2.2 A14B and 3.24x on Wan 2.1 1.3B compared to dense attention. To evaluate quality in a fair way, we introduce VQeval, a tool properly scoring loopy video failures, which instead are rewarded in state of the art evaluators like VBench-Long. LVSA is quality-neutral for generation at training horizon length and quality-positive at extended lengths.

URL PDF HTML ☆

赞 0 踩 0

2605.31048 2026-06-01 cs.CV 版本更新

Rethinking Efficient Crack Segmentation with Task-Aligned Structural-Directional Modeling

重新思考基于任务对齐的结构-方向性建模的高效裂缝分割

Shipeng Liu, Liang Zhao, Dengfeng Chen, Weihua Zhang

发表机构 * xauat（西安理工大学）

AI总结将裂缝分割视为稀疏结构恢复问题，提出RIFT模型，通过轻量多尺度融合保留局部证据、聚合方向连续性，在16项指标上达到最优或并列最优。

详情

AI中文摘要

最近的裂缝分割方法通常遵循通用的语义分割设计，使用更强的骨干网络、混合CNN-Transformer-Mamba编码器和辅助增强分支。虽然有效，但这引发了疑问：更强的通用特征混合是否是裂缝分割最合适的方向。相反，我们将裂缝分割表述为稀疏结构恢复。裂缝具有有限的类别级语义，但具有很强的形态规律性，即细、稀疏、各向异性、局部碎片化，且容易与纹理或阴影混淆。因此，关键瓶颈在于保留弱结构证据、恢复方向连续性以及抑制背景耦合。我们提出RIFT，一个紧凑的形态对齐裂缝分割模型家族。RIFT设计简单，而不是压缩复杂的通用架构，它保留局部证据，聚合协作方向连续性，并通过轻量多尺度融合恢复裂缝结构。在四个公共基准上的实验表明，RIFT在16个主要指标上对再现的代表性基线取得了最佳或并列最佳结果。RIFT-B提供了最强的整体精度，而RIFT-T提供了最佳的部署效率，仅0.47M参数和高推理速度。拓扑感知评估、消融实验、迁移实验和可视化进一步验证了，当其归纳偏置与裂缝形态匹配时，任务对齐的简单性可以匹配或超越复杂的混合架构。代码：https://github.com/xauat-liushipeng/RIFT

英文摘要

Recent crack segmentation methods often follow generic semantic segmentation designs, using stronger backbones, hybrid CNN-Transformer-Mamba encoders, and auxiliary enhancement branches. Although effective, this raises whether stronger generic feature mixing is the most suitable direction for crack segmentation. We instead formulate crack segmentation as sparse structural recovery. Cracks have limited category-level semantics but strong morphological regularities, being thin, sparse, anisotropic, locally fragmented, and easily confused with textures or shadows. Thus, the key bottleneck lies in preserving weak structural evidence, recovering directional continuity, and suppressing background coupling. We propose RIFT, a compact family of morphology-aligned crack segmentation models. Rather than compressing a complex generic architecture, RIFT is simple by design, preserving local evidence, aggregating cooperative directional continuity, and restoring crack structures through lightweight multi-scale fusion. Experiments on four public benchmarks show that RIFT achieves the best or tied-best results across the 16 main metrics against reproduced representative baselines. RIFT-B gives the strongest overall accuracy, while RIFT-T provides the best deployment efficiency with only 0.47M parameters and high inference speed. Topology-aware evaluation, ablations, transfer experiments, and visualizations further verify that task-aligned simplicity can match or surpass complex hybrid architectures when its inductive bias fits crack morphology. Code: https://github.com/xauat-liushipeng/RIFT

URL PDF HTML ☆

赞 0 踩 0

2605.31041 2026-06-01 cs.CV cs.AI 版本更新

Does Visual Information Play a Decisive Role in Vision-Language-Action Model Driving Behavior?

视觉信息在视觉-语言-动作模型驾驶行为中是否起决定性作用？

Jingtao He, Hongliang Lu, Xiaoyun Qiu, Yixuan Wang, Xinhu Zheng

发表机构 * Intelligent Transportation Thrust, The Hong Kong University of Science and Technology (Guangzhou)（科技与交通智能 thrust，香港科学与技术大学（广州））

AI总结本文提出结构化多级视觉扰动框架，系统分析VLA驾驶模型对视觉信息的依赖程度，揭示依赖模式随评估方式变化且在不同抽象层次上不均匀。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在自动驾驶中展现出令人期待的能力，凸显了统一多模态架构联合建模感知与规划的潜力。然而，当前基于VLA的驾驶行为如何植根于视觉信息仍知之甚少。现有评估协议主要关注聚合性能指标，缺乏结构化和实用的诊断方法来量化视觉-行为依赖性。在这项工作中，我们引入了一个结构化的多级视觉扰动框架，以系统分析基于VLA的驾驶模型中的视觉-行为依赖性。该框架沿着三个互补维度组织受控视觉扰动：通道级退化、信息级破坏和结构级修改。我们将其应用于基于VLA的驾驶系统，并在开环轨迹预测和交互式闭环安全评估下评估行为响应。实验揭示了依赖于评估的依赖模式以及跨抽象层次的不均匀视觉基础。这些发现呼吁对VLA驾驶模型进行更结构化的分析和原则性设计，以更好地理解视觉信息如何塑造行为，并开发更安全、更鲁棒的系统。

英文摘要

Vision-Language-Action (VLA) models have demonstrated promising capability in autonomous driving, highlighting the potential of unified multimodal architectures for jointly modeling perception and planning. However, how current VLA-based driving behavior is grounded in visual information remains poorly understood. Existing evaluation protocols mainly focus on aggregate performance metrics, lacking structured and practical diagnostics to quantify visual-behavior dependency. In this work, we introduce a structured multi-level visual perturbation framework to analyze visual-behavior dependency in VLA-based driving models systematically. The framework organizes controlled visual perturbations along three complementary dimensions: channellevel degradation, information-level disruption, and structurelevel modification. We apply it to VLA-based driving systems and evaluate behavioral responses under both open-loop trajectory prediction and interactive closed-loop safety evaluation. Experimental results reveal evaluation-dependent dependency patterns and uneven visual grounding across abstraction levels. These findings call for more structured analyses and principled design of VLA driving models to better understand how visual information shapes behavior and develop safer, more robust systems.

URL PDF HTML ☆

赞 0 踩 0

2605.31033 2026-06-01 cs.CV 版本更新

SlotMemory: Object-Centric KV Memory for Streaming Long-Video Generation

SlotMemory: 面向流式长视频生成的以对象为中心的KV记忆

Weijia Dou, Hui Li, Jiahao Cui, Lei Zhou, Jingdong Wang, Siyu Zhu

发表机构 * Fudan University（复旦大学）； Meta Superintelligence Labs（Meta超智能实验室）； Baidu（百度）

AI总结提出SlotMemory，一种以对象为中心的键值记忆机制，通过将变换器的键值流形分解为离散语义槽，实现实体级持久性和提示感知检索，在60秒交互叙事中动态一致性相对提升22.8%。

详情

AI中文摘要

流式视频生成模型通常依赖于以时间为中心的记忆，将历史上下文组织为原始帧、片段或未聚类的令牌。这种组织方式常导致实体离开画面或交互式提示转换时出现身份漂移和语义不一致。为解决这些限制，我们提出SlotMemory，一种用于流式视频扩散的以对象为中心的键值记忆机制。我们的方法通过将变换器的键值流形分解为离散、可重用的语义槽，将记忆抽象从事件发生的“何时”转移到所表示的“什么”。通过利用这些槽作为路由地址来索引和存储高保真键值令牌，我们实现了跨长时域的实体级持久性和提示感知检索。在使用Wan2.1-T2V-1.3B骨干网络对60秒交互叙事进行评估时，SlotMemory达到了81.61的最先进质量分数，并在动态一致性上比现有最强流式基线相对提升22.8%。我们的结果表明，结构化的语义表示，而非原始时间容量，是持久长视频合成的关键原语。我们的代码和检查点可在https://tj12323.github.io/SlotMemory/获取。

英文摘要

Streaming video generation models typically rely on temporal-centric memory, which organizes historical context as raw frames, chunk segments, or unclustered tokens. This organization frequently leads to identity drift and semantic inconsistency when entities exit the frame or during interactive prompt transitions. To address these limitations, we propose SlotMemory, an object-centric Key-Value memory mechanism for streaming video diffusion. Our approach shifts the memory abstraction from "when" an event occurred to "what" is being represented by decomposing the transformer's key-value manifold into discrete, reusable semantic slots. By utilizing these slots as routing addresses to index and store high-fidelity key-value tokens, we enable entity-level persistence and prompt-aware retrieval across long horizons. Evaluated on 60-second interactive narratives using the Wan2.1-T2V-1.3B backbone, SlotMemory achieves a state-of-the-art quality score of 81.61 and a 22.8 percent relative improvement in dynamic consistency over the strongest existing streaming baseline. Our results demonstrate that structured semantic representation, rather than raw temporal capacity, is the essential primitive for persistent long-form video synthesis. Our codes and checkpoints are available at https://tj12323.github.io/SlotMemory/.

URL PDF HTML ☆

赞 0 踩 0

2605.31029 2026-06-01 cs.CV 版本更新

PEEK: Picking Essential frames via Efficient Knowledge distillation

PEEK: 通过高效知识蒸馏提取关键帧

Killian Steunou, Anas Filali Razzouki, Khalil Guetari, Mounîm A. El-Yacoubi, Yannis Tevissen

发表机构 * Télécom SudParis — SAMOVAR（Telecom SudParis — SAMOVAR）； Institut Polytechnique de Paris（巴黎政治学院）； Moments Lab

AI总结提出PEEK方法，通过知识蒸馏将教师模型的帧相关性排名迁移至轻量级时序模型，实现高效动态帧采样，在低帧预算下显著提升视频字幕生成性能。

Comments Supplementary material at https://www.killian-steunou.com/peek/static/pdfs/peek_supplementary.pdf

详情

AI中文摘要

视频语言模型只能处理有限数量的帧，使得帧选择成为高效视频字幕生成的关键瓶颈。大多数字幕生成流程仍依赖均匀采样，该方法计算成本低但忽略视觉内容。自适应帧采样最近成为从视频中选择最具信息量帧的有前景方法，但现有方法计算成本仍然高昂。我们提出PEEK，一种高效的动态帧采样方法，它将字幕条件帧相关性排名从更强的教师模型蒸馏到仅基于视觉内容运行的轻量级时序模型中。我们发现，总体而言，在ActivityNet Captions和MSR-VTT上，我们的方法在所有评估的下游视觉语言模型中优于最先进方法，特别是当仅选择一或两帧进行字幕生成时，在大多数帧预算下获得最佳CIDEr分数。在ActivityNet Captions上，PEEK尤其强大，在16个配置中赢得14个。在MSR-VTT上的零样本评估表明，我们的模型在低帧预算下迁移效果最佳，而在四帧和八帧时结果更为混合，因为时间覆盖和视觉多样性变得更具竞争力。与最近的自适应基线相比，PEEK在低预算场景下更准确且更高效：它仅增加5.2%的字幕生成时间，而CSTA增加65.4%，MaxInfo增加211.9%。我们在https://github.com/momentslab/peek发布代码和预训练检查点。

英文摘要

Video-language models can process only a limited number of frames, making frame selection a key bottleneck for efficient video captioning. Most captioning pipelines still rely on uniform sampling, which is computationally cheap but agnostic to visual content. Adaptive frame sampling has recently emerged as a promising approach for selecting the most informative frames from a video; however, existing methods remain computationally expensive. We introduce PEEK, an efficient dynamic frame sampling method that distills caption-conditioned frame relevance rankings from a stronger teacher model into a lightweight temporal model that operates only on visual content. We find that, overall, on ActivityNet Captions and MSR-VTT, our method outperforms state-of-the-art methods across all evaluated downstream vision language models, especially when only one or two frames are selected for captioning, obtaining the best CIDEr for most frame budgets. On ActivityNet Captions, PEEK is particularly strong, winning 14 out of 16 configurations. Zero-shot evaluation on MSR-VTT shows that our model transfers best at low frame budgets, while results at four and eight frames are more mixed as temporal coverage and visual diversity become increasingly competitive. Compared with recent adaptive baselines, PEEK is both more accurate in the low-budget regime and more efficient: it adds only $5.2\%$ to the captioning time, compared with $65.4\%$ for CSTA and $211.9\%$ for MaxInfo. We release our code and pre-trained checkpoint at https://github.com/momentslab/peek.

URL PDF HTML ☆

赞 0 踩 0

2605.31001 2026-06-01 cs.CV 版本更新

Iterative Framework For Data Augmentation Of Segmented Fingerprints

分割指纹数据增强的迭代框架

João Leonardo H. D. Agnol, Wesley Augusto de Bona, Erick Oliveira Rodrigues, Luiz Fernando Puttow Southier, Jefferson Oliva, Marcelo Filipak, Dalcimar Casanova

发表机构 * Federal University of Technology (UTFPR) Pato Branco, Parana Brazil（联邦技术大学（UTFPR）帕托布兰科，巴西南里维亚州）

AI总结针对婴儿指纹数据稀缺问题，提出一种迭代数据增强方法，通过在训练用于提取指纹脊线和谷线的卷积神经网络中引入错误，生成多样化的分割指纹变体，实验证明该方法能有效扩展指纹变异性且保持视觉相似性。

详情

DOI: 10.5753/wsis.2024.33666
Journal ref: Anais do XV Workshop de Sistemas de Informação 2024

AI中文摘要

婴儿生物识别由于婴儿与成人之间的生理差异而面临独特挑战，加上可用于研究的数据稀缺，限制了稳健匹配系统的发展。本文提出一种新颖的数据增强方法，使用迭代技术通过在训练用于提取指纹脊线和谷线的卷积神经网络中引入错误，生成分割指纹的多样化变体。在真实婴儿指纹上的实验证明了该方法在扩展指纹变异性方面的有效性，增强后的指纹在细节计数上表现出显著波动，同时仍保持与原始指纹的视觉相似性。研究还强调了该方法在应用不同程度变化到指纹分割方面的可定制性。未来研究包括使用所提框架增强的数据集训练分割和匹配神经网络。

英文摘要

Infant biometrics presents unique challenges due to the physiological differences between infants and adults, compounded by the scarcity of available data for research that limits the development of robust matching systems. This paper proposes a novel data augmentation method that uses iterative techniques to generate diverse variants of segmented fingerprints by inducing errors in a convolutional neural network trained to extract fingerprint ridges and valleys. Experiments on real infant fingerprints demonstrate the method's effectiveness in expanding fingerprint variability, with augmentations exhibiting significant fluctuations in minutiae counts while still retaining visual similarity to the originals. The study also highlights the method's customizable nature for applying varying levels of changes to fingerprint segmentations. Future research includes training segmentation and matching neural networks using datasets augmented by the proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2605.30991 2026-06-01 cs.LG cs.CV 版本更新

Parallel Tempering Initial Sampling in Inference-Time Reward Alignment

推理时奖励对齐中的并行回火初始采样

Myeongjun Oh, Gwangho Kim, Sungyoon Lee

发表机构 * Department of Artificial Intelligence（人工智能系）； Department of Computer Science（计算机科学系）

AI总结针对推理时奖励对齐中标准SMC方法因初始采样陷入局部模式的问题，提出基于并行回火的PATHS方法，通过耦合多条回火链实现高效探索，提升对齐质量。

Comments 31 pages, 11 figures

详情

AI中文摘要

推理时奖励对齐无需重新训练即可引导预训练的扩散和基于流的生成模型满足用户指定的奖励。最近，序贯蒙特卡洛（SMC）通过迭代过滤和传播多个粒子成为该任务的有力框架。然而，我们表明基于SMC的标准方法通常性能不佳，因为它们从标准先验初始化粒子，而复杂奖励景观中的高奖励区域极为罕见。此外，我们表明即使最近的奖励感知初始采样方法仍然容易陷入局部模式，因为复杂奖励景观通常是多模态的。为克服这些限制，我们提出PATHS（用于高复杂度奖励采样的并行回火），一种通过并行回火耦合多个采样链的新型初始化方法。PATHS维护一个奖励回火链的阶梯，并定期执行Metropolis交换，从而在平坦化的奖励景观中实现高效探索，缓解模式陷阱问题。我们的分析表明，该机制显著增强了有限预算下对通常难以采样的罕见高奖励区域的探索。在布局到图像和数量感知生成上的实验表明，PATHS在对齐质量上取得了一致的提升，尤其是在复杂提示上。

英文摘要

Inference-time reward alignment steers pretrained diffusion and flow-based generative models to satisfy user-specified rewards without retraining. Recently, Sequential Monte Carlo (SMC) has emerged as a powerful framework for this task by iteratively filtering and propagating multiple particles. However, we show that standard SMC-based methods often suffer from poor performance because they initialize particles from a standard prior, whereas high-reward regions in complex reward landscapes are extremely rare. Further, we show that even recent reward-aware initial sampling approaches remain vulnerable to getting trapped in local modes, as complex reward landscapes are often multi-modal. To overcome these limitations, we propose PATHS (PArallel Tempering for High-complexity reward Sampling), a novel initialization method that couples multiple sampling chains through parallel tempering. PATHS maintains a ladder of reward-tempered chains and periodically performs Metropolis swaps, enabling efficient exploration across flattened reward landscapes, thereby mitigating the mode-trapping issues. Our analysis reveals that this mechanism substantially enhances the finite-budget exploration of rare, high-reward regions that are typically challenging to sample. Experiments on layout-to-image and quantity-aware generation show that PATHS achieves consistent gains in alignment quality, particularly on complex prompts.

URL PDF HTML ☆

赞 0 踩 0

2605.30987 2026-06-01 cs.CV 版本更新

Benchmarking Single-Step Inpainting Methods for Multi-Object 3D Gaussian Splatting Scenes

多对象3D高斯泼溅场景的单步修复方法基准测试

Finn Dröge, Cecilia Curreli, Abhishek Saroha, Daniel Cremers

发表机构 * Technical University of Munich（慕尼黑技术大学）； Munich Center for Machine Learning（慕尼黑机器学习中心）

AI总结针对3D高斯泼溅场景中的对象移除与修复任务，比较了2D修复器在3D一致性上的表现，发现基于重建的修复器优于生成扩散模型，且从头初始化场景比微调现有场景效果更好，同时引入了一个带真实数据的新多对象场景。

Comments Accepted as an extended abstract to the CVEU Workshop at CVPR 2026

详情

AI中文摘要

对象移除和修复3D高斯泼溅（3DGS）场景面临跨相机视图的3D一致性等挑战。在比较2D修复器及其对3D领域的适用性时，我们发现基于重建的修复器在3D一致性上优于生成扩散模型。将这些2D修复器集成到创建和微调3DGS场景的不同单步方法中，我们的结果表明，从头初始化场景比微调现有场景产生更高质量的结果。使用最先进的生成式2D修复器，我们创建了一个简单的基线，以强调在3D设置中先移除对象再进行修复的重要性。由于360°数据集很少包含真实世界的地面真值，且具有挑战性的遮挡场景同样稀少，我们引入了一个新的多对象场景，其中包含记录的地面真值数据和多个存在对象遮挡的视图。

英文摘要

The tasks of object removal and inpainting 3D Gaussian Splatting (3DGS) scenes face challenges such as 3D consistency across camera views. In comparing 2D inpainters and their suitability for the 3D domain, we find that reconstruction-based inpainters outperform generative diffusion models in 3D consistency. Integrating these 2D inpainters into different single-step methods for creating and finetuning 3DGS scenes, our results indicate that initializing the scene from scratch produces higher quality results than finetuning the existing scene. Using a state-of-the-art generative 2D inpainter, we create a straightforward baseline to underline the importance of object removal before inpainting in the 3D setting. Since 360° datasets rarely include real-world ground truths, and challenging occlusion scenarios are equally sparse, we introduce a novel multi-object scene with recorded ground truth data and many views with object occlusions.

URL PDF HTML ☆

赞 0 踩 0

2605.30984 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

生成报告还是重复模板？测量和缓解三维CT报告生成中的模板崩溃

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani, Benedikt Wiestler, Christian Wachinger

发表机构 * Technical University of Munich (TUM)（慕尼黑技术大学）； TUM Hospital（TUM医院）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结针对三维CT报告生成中模型输出多样性低、病理检测能力差的模板崩溃问题，提出解耦框架CLarGen，通过分离临床检测与语言合成，显著提升临床准确性并保持报告流畅性。

详情

AI中文摘要

现代三维医学视觉语言模型（VLM）能够生成流畅的放射学风格文本，但表现出极低的病理检测率和输出多样性，崩溃为低估罕见但关键发现的通用模板。我们将这种失败模式识别为模板崩溃。这种失败源于三维医学成像的独特限制，例如数据有限、标签严重不平衡以及体积编码器的弱信号。在这些限制下，文本生成目标鼓励捷径学习和流畅但基础薄弱的报告。我们通过临床保真度、输出多样性、正常模板偏差和罕见发现存活率系统性地诊断模板崩溃。为了缓解它，我们提出CLarGen，一个解耦框架，将说什么（临床检测）与怎么说（语言合成）分开。CLarGen使用（i）用于多标签病理检测的潜在查询变换器，（ii）用于临床匹配示例的病理引导检索，以及（iii）用于从检测到的发现和检索到的上下文中合成最终报告的医学语言模型。在最新的三维CT报告生成基线中，CLarGen缓解了模板崩溃，并在保持流畅报告的同时显著提高了临床准确性（macro-F1 0.487 vs. 0.189；CRG 0.472 vs. 0.368）。我们的结果表明，明确、可测量的临床基础对于抗模板崩溃的三维CT报告生成至关重要。代码将在接收后发布。

英文摘要

Modern 3D medical vision-language models (VLMs) can generate fluent radiology-style text while exhibit critically low pathology detection and output diversity, collapsing to generic templates that under-report rare yet critical findings. We identify this failure mode as Template Collapse. This failure stems from the unique constraints of 3D medical imaging, e.g., limited data, severe label imbalance, and weak signals from volumetric encoders. Under these constraints, text-generation objectives encourage shortcut learning and fluent but weakly grounded reports. We systematically diagnose the Template Collapse through clinical fidelity, output diversity, normal-template bias, and rare-finding survival. To mitigate it, we propose CLarGen, a decoupled framework that separates what to say (clinical detection) from how to say it (language synthesis). CLarGen uses (i) a Latent Query Transformer for multi-label pathology detection, (ii) pathology-guided retrieval for clinically matched exemplars, and (iii) a medical language model to synthesize the final report from detected findings and retrieved context. Across state-of-the-art 3D CT report generation baselines, CLarGen mitigates Template Collapse and substantially improves clinical accuracy (macro-F1 0.487 vs. 0.189; CRG 0.472 vs. 0.368) while maintaining fluent reporting. Our results suggest that explicit, measurable clinical grounding is essential for template-collapse-resistant 3D CT report generation. Code will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.30983 2026-06-01 cs.CV 版本更新

Can BEV Perception Gracefully Degrade under Sensor Failures?

BEV感知能否在传感器故障下优雅降级？

Haifa Zhang, Yijing Wang, Haoyu Wang, Zheng Li, Zhiqiang Zuo

发表机构 * Tianjin Key Laboratory of Intelligent Unmanned Swarm Technology and System（天津智能无人群技术与系统重点实验室）； School of Electrical and Information Engineering, Tianjin University（天津大学电气与信息工程学院）； Key Laboratory of System Control and Information Processing, Ministry of Education of China（系统控制与信息处理重点实验室，中华人民共和国教育部）

AI总结针对多模态BEV感知在传感器损坏时性能骤降的问题，提出Grace-BEV框架，通过主动可靠性评估和动态特征重校准实现优雅降级，在极端LiDAR故障下将mAP从0.0%恢复至34.7%。

详情

AI中文摘要

尽管多模态鸟瞰图（BEV）感知在自动驾驶中取得了显著成功，但现有系统存在一个关键脆弱性：现有融合机制对传感器损坏高度敏感，常导致灾难性性能下降。这种脆弱性主要源于标准融合框架通常以静态方式集成多模态表示，导致在缺失或损坏模态下性能急剧崩溃。相比之下，我们表明通过主动模态可靠性评估可以实现优雅降级。为此，我们提出Grace-BEV，一个轻量级即插即用框架，在多模态融合过程中强制引入主动可靠性感知。Grace-BEV不依赖计算昂贵的跨模态交互，而是利用对齐的BEV空间通过TrustGate路由器显式评估模态可信度，并使用FailSafe融合块动态重新校准特征集成。此外，我们设计了带模态丢弃的三阶段训练策略，以防止模态主导并鼓励在不可靠输入下进行平衡的跨模态学习。在nuScenes-R和nuScenes-C上的大量实验表明，Grace-BEV在各种损坏设置下保持稳健性能。值得注意的是，在标准基线崩溃至0.0%平均精度（mAP）的灾难性LiDAR故障下，Grace-BEV将性能恢复至高达34.7% mAP。此外，它将干净准确率提升高达1.4%，实现了鲁棒性与效率之间的强权衡。

英文摘要

Despite the remarkable success of multi-modal bird's-eye view (BEV) perception in autonomous driving, current systems exhibit a critical vulnerability: existing fusion mechanisms are highly brittle to sensor corruptions, often causing catastrophic performance degradation. This vulnerability largely stems from the fact that standard fusion frameworks typically integrate multi-modal representations in a static manner, leading to a precipitous performance collapse under missing or corrupted modalities. In contrast, we show that graceful degradation is achievable through active modality reliability assessment. To this end, we present Grace-BEV, a lightweight and plug-and-play framework that enforces active reliability awareness during multi-modal fusion. Instead of relying on computationally expensive cross-modal interactions, Grace-BEV leverages the aligned BEV space to explicitly assess modality trustworthiness via a TrustGate Router and dynamically recalibrate feature integration using the FailSafe Fusion Block. Furthermore, we devise a Three-Phase Training strategy with Modality Dropout to prevent modality dominance and encourage balanced cross-modal learning under unreliable inputs. Extensive experiments on nuScenes-R and nuScenes-C show that Grace-BEV maintains robust performance across diverse corruption settings. Notably, under catastrophic LiDAR failures where standard baselines collapse to 0.0% mean Average Precision (mAP), Grace-BEV restores performance to as high as 34.7% mAP. Moreover, it improves clean accuracy by up to 1.4%, achieving a strong trade-off between robustness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.30972 2026-06-01 cs.CV 版本更新

BiSegMamba: Efficient Bidirectional Tri-Oriented Mamba for 3D Medical Image Segmentation

BiSegMamba: 用于3D医学图像分割的高效双向三向Mamba

Bakht Zada, Chao Tong, Qile Su, Shuai Zhang

发表机构 * School of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； State Key Laboratory of Virtual Reality Technology and Systems, Beihang University（北航虚拟现实技术与系统国家重点实验室）

AI总结提出BiSegMamba，一种基于双向三向Mamba的高效3D医学图像分割网络，通过渐进压缩主干、多尺度空间混合器、双向正交Mamba块和自适应方向融合，在降低计算成本的同时提升分割精度。

Comments 10 pages, 7 figures, 5 tables. Code is available at: https://github.com/bakhtzadaabshare/BiSegMamba

详情

AI中文摘要

精确的3D医学图像分割需要长程体积上下文和精细边界保持。基于CNN的方法全局依赖建模有限，而基于Transformer的模型对于密集3D输入通常计算成本高昂。最近的基于Mamba的方法提供了一种高效替代方案，但现有的体积设计仍依赖于重复的高分辨率扫描、仅前向的顺序建模和固定的方向求和，导致高成本、扫描顺序偏差和次优的方向聚合。我们提出BiSegMamba，一种用于3D医学图像分割的高效双向三向Mamba网络。BiSegMamba遵循紧凑到细节的设计，其中渐进压缩主干（PCS）能够进行高效的潜在空间推理，同时保留浅层高分辨率特征用于重建。多尺度空间混合器（MSSM）在早期阶段捕获局部解剖模式，而提出的双向三向正交Mamba（Bi-ToOM）块使用联合处理的前向和后向扫描序列，从多个正交视图建模长程依赖。自适应方向融合（ADF）学习跨扫描方向的输入相关通道权重，用方向感知融合替代固定求和。在收集的颈动脉CTA数据集和三个公共基准BraTS2023、ACDC和AMOS-CT上的实验表明，BiSegMamba在血管、心脏、脑肿瘤和腹部多器官分割任务中具有良好的泛化能力。与SegMamba-V2相比，BiSegMamba在BraTS2023上性能略有提升，在ACDC和颈动脉数据集上显著改进，同时计算成本降低高达77.9% FLOPs，展示了在通用3D医学图像分割中强大的精度-效率平衡。

英文摘要

Accurate 3D medical image segmentation requires both long-range volumetric context and fine boundary preservation. CNN-based methods have limited global dependency modeling, while Transformer-based models are often computationally expensive for dense 3D inputs. Recent Mamba-based methods provide an efficient alternative, but existing volumetric designs still depend on repeated high-resolution scanning, forward-only sequential modeling, and fixed directional summation, causing high cost, scan-order bias, and suboptimal directional aggregation. We propose BiSegMamba, an efficient bidirectional tri-oriented Mamba network for 3D medical image segmentation. BiSegMamba follows a compact-to-detail design, where a progressive compacting stem (PCS) enables efficient latent-space reasoning while retaining shallow high-resolution features for reconstruction. A multi-scale spatial mixer (MSSM) captures local anatomical patterns in early stages, and the proposed bidirectional tri-oriented Ortho Mamba (Bi-ToOM) block models long-range dependencies from multiple orthogonal views using jointly processed forward and backward scan sequences. Adaptive directional fusion (ADF) learns input-dependent channel-wise weights across scan orientations, replacing fixed summation with orientation-aware fusion. Experiments on a collected carotid CTA dataset and three public benchmarks, BraTS2023, ACDC, and AMOS-CT, show that BiSegMamba generalizes well across vascular, cardiac, brain tumor, and abdominal multi-organ segmentation tasks. Compared with SegMamba-V2, BiSegMamba achieves slightly better performance on BraTS2023 and clear improvements on ACDC and the carotid dataset, while reducing computational cost by up to 77.9% FLOPs, demonstrating a strong accuracy-efficiency balance for general 3D medical image segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.30969 2026-06-01 cs.CV 版本更新

Omni-Supervised Motion Editing: Balancing Change and Invariance through Positive-Negative Learning

全监督运动编辑：通过正负学习平衡变化与不变性

Zhenwu Shi, Jingyu Gong, Peiwei Wang, Xingzan Wang, Tianwen Qian, Wenxi Li, Yuan Fang, Jiao Xie, Lizhuang Ma, Shaohui Lin

发表机构 * Shanghai Institute of Artificial Intelligence for Education, East China Normal University, China（上海人工智能教育研究院，华东师范大学，中国）； School of Computer Science and Technology, East China Normal University, China（华东师范大学计算机科学与技术学院，中国）； School of Statistics, East China Normal University, China（华东师范大学统计学院，中国）； The 27th Research Institute of CETC, Zhengzhou, China（中国电子科技集团第27研究所，郑州，中国）； Key Laboratory of Advanced Theory and Application in Statistics and Data Science, MOE, China（教育部统计与数据科学先进理论与应用重点实验室，中国）； Shanghai Key Laboratory of Computer Software Evaluating and Testing, China（上海计算机软件评测测试重点实验室，中国）； School of Computer Science, Shanghai Jiao Tong University, China（上海交通大学计算机科学学院，中国）

AI总结提出OmniME框架，通过正负学习结合回顾特征监督、运动保持机制和三元组语义对齐，平衡运动编辑中的变化与不变性，在MotionFix和STANCE Adjustment数据集上达到最优性能。

详情

AI中文摘要

基于文本的人体运动编辑旨在根据自然语言指令修改现有运动序列，同时保持原始运动的一致性。现有的基于扩散的方法通常依赖启发式相似性线索或粗糙的全局条件，导致运动失真和次优的语义对齐。关键挑战在于平衡变化（即精确编辑目标区域）和不变性（即保留未编辑部分）。为应对这一挑战，我们提出了一个全监督正负学习框架，名为OmniME。我们的方法集成了三个互补组件：（1）回顾特征监督，在Transformer层之间强制执行从粗到细的一致性；（2）运动保持机制，根据源-目标相似性关注细微变化；（3）基于三元组的语义对齐，增强文本-运动对应关系。这些组件共同形成了一个统一的监督范式，平衡变化与不变性。在MotionFix和STANCE Adjustment数据集上的大量实验表明，OmniME在编辑对齐方面达到了最先进的性能，验证了我们统一学习框架的有效性。我们的源代码和模型已发布在：https://github.com/rocket-ycyer/OmniME.git

英文摘要

Text-based human motion editing aims to modify existing motion sequences according to natural language instructions while maintaining the consistency of the original motion. Existing diffusion-based approaches often rely on heuristic similarity cues or coarse global conditioning, leading to motion distortion and suboptimal semantic alignment. The key challenge lies in balancing change (i.e. precisely editing target regions) and invariance (i.e. preserving unedited parts). To handle such challenge, we propose an Omni-Supervised Positive-Negative Learning framework, named OmniME. Our method integrates three complementary components: (1) retrospective feature supervision that enforces coarse-to-fine consistency across transformer layers,(2) motion preservation mechanism that focuses on subtle variations according to the source-target similarity, and (3) triplet-based semantic alignment that strengthens text-motion correspondence. Together, these components form a unified supervision paradigm that balances change and invariance. Extensive experiments on the MotionFix and STANCE Adjustment datasets demonstrate that OmniME achieves state-of-the-art performance in editing alignment, validating the effectiveness of our unified learning framework. Our source codes and models have been released at: https://github.com/rocket-ycyer/OmniME.git

URL PDF HTML ☆

赞 0 踩 0

2605.30968 2026-06-01 cs.CV cs.AI 版本更新

Variational Adapter for Cross-modal Similarity Representation

变分适配器用于跨模态相似性表示

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye, Huayi Wu

发表机构 * School of Remote Sensing and Information Engineering（遥感与信息工程学院）； Wuhan University（武汉大学）； School of Data Science and Engineering（数据科学与工程学院）； East China Normal University（华东师范大学）； State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing（测绘遥感信息工程国家重点实验室）

AI总结针对跨模态匹配中细粒度标注稀缺导致二元分类边界压缩和假负样本问题，提出变分适配器VACSR，将匹配任务重构为变分推断问题，通过构建潜在相似性空间和正则化缓解过拟合，在图像-文本检索、域泛化和基类到新类泛化任务上验证了有效性。

Comments Accepted by the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

视觉-语言模型的核心在于在统一表示空间中度量跨模态相似性。然而，大多数图像-文本匹配或多类图像分类数据集缺乏细粒度的跨模态匹配标注，迫使连续的相似性空间压缩为二元分类边界。这种压缩引入了假负样本，并严重损害了跨模态任务的泛化性能。尽管先前的研究试图通过建模模态内模糊性来缓解这一问题，但往往忽略了固有的标注缺陷，导致不确定性分配次优。为了解决这些挑战，我们提出了一种变分适配器用于跨模态相似性表示（VACSR）。该方法将具有细粒度语义稀缺性的图像-文本匹配重新表述为变分推断问题。它构建了一个跨模态相似性的潜在空间，并使用正则化技术来减轻对二元标注的过拟合。在图像-文本检索、域泛化和基类到新类泛化上的实验证明了所提出方法的有效性和鲁棒的泛化能力。

英文摘要

The core of vision-language models lies in measuring cross-modal similarity within a unified representation space. However, most image-text matching or multi-class image classification datasets lack fine-grained cross-modal matching annotations, forcing the continuous similarity space into binary classification boundaries. This compression induces false negative samples and significantly impairs the generalization performance of cross-modal tasks. While prior research has attempted to mitigate this by modeling intra-modal ambiguity, it often overlooks inherent annotation flaws, leading to suboptimal uncertainty allocation. To address these challenges, we propose a Variational Adapter for Cross-modal Similarity Representation (VACSR). This approach reformulates image-text matching with fine-grained semantic scarcity as a variational inference problem. It constructs a latent space for cross-modal similarity and uses regularization techniques to mitigate overfitting to binary annotations. Experiments on image-text retrieval, domain generalization, and base-to-novel generalization demonstrate the proposed method's effectiveness and robust generalization ability.

URL PDF HTML ☆

赞 0 踩 0

2605.30942 2026-06-01 cs.CV 版本更新

PRISM: Progressive Reasoning through Iterative Slot Memory for Vision

PRISM: 通过迭代槽记忆进行渐进推理的视觉架构

Ziyu Wang, Shuangpeng Han, Mengmi Zhang

发表机构 * Deep NeuroCognition Lab, Nanyang Technological University, Singapore（深神经认知实验室，南洋理工大学，新加坡）

AI总结提出PRISM架构，通过迭代槽记忆进行渐进推理，在图像分类、目标检测和语义分割等任务上取得竞争性能，并在遮挡等不完整观测下展现出更强的鲁棒性。

详情

AI中文摘要

现代视觉模型通过单次前馈传递处理图像，这限制了它们在观测不完整时恢复缺失证据或细化不确定表示的能力。受人类感知迭代性质的启发，我们引入了PRISM（通过迭代槽记忆进行渐进推理），这是一种通过迭代细化对图像进行推理的金字塔视觉架构。在高层次上，PRISM将视觉特征分组为以对象为中心的表示，从学习到的记忆中检索相关模式，并迭代细化表示以解决歧义和恢复缺失信息。这种组织-回忆-细化过程在多个尺度上循环运行，实现了视觉表示的渐进改进。在包括图像分类、目标检测和语义分割在内的标准视觉任务中，PRISM取得了竞争性能，同时在遮挡等不完整观测下展现出更强的鲁棒性。这些结果表明，使用结构化表示和记忆进行迭代推理是构建更具弹性和适应性的视觉模型的一个有前景的方向。源代码和模型将发布。

英文摘要

Modern vision models process images in a single feed-forward pass, which limits their ability to recover missing evidence or refine uncertain representations under incomplete observations. Inspired by the iterative nature of human perception, we introduce PRISM (Progressive Reasoning through Iterative Slot Memory), a pyramid vision architecture that reasons over images through iterative refinement. At a high level, PRISM groups visual features into object-centric representations, retrieves relevant patterns from a learned memory, and iteratively refines the representation to resolve ambiguity and recover missing information. This organize-recall-refine process operates recurrently across multiple scales, enabling progressive improvement of visual representations. Across standard vision tasks, including image classification, object detection, and semantic segmentation, PRISM achieves competitive performance while demonstrating improved robustness under incomplete observations such as occlusion. These results suggest that iterative reasoning with structured representations and memory is a promising direction for building more resilient and adaptive vision models. Source code and models will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.30939 2026-06-01 cs.CV 版本更新

关注证据：面向多模态RLVR的证据锚定空间注意力监督

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai, Bin Yu, Weiran Huang, Kai Wang, Yue Wang

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Zhongguancun Academy（中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Nankai University（南开大学）； Shanghai Jiaotong University（上海交通大学）； Zhejiang University（浙江大学）

AI总结提出EASE方法，通过将标注证据区域转化为平滑视觉标记目标，在多模态强化学习训练中引导响应到图像的注意力，从而提升视觉语言模型在感知、幻觉、视觉数学和多模态推理基准上的性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）通过优化从最终答案中导出的结果奖励来改进视觉语言模型（VLM）。然而，这种仅基于结果的奖励并不能告诉模型哪些图像区域证明了答案的正确性。对于需要视觉定位的问题，这些奖励无法区分由相关视觉证据支持的响应与由语言先验捷径或幸运猜测产生的响应。我们引入了EASE（证据锚定空间注意力），它通过视觉证据过程监督增强了多模态RLVR。EASE将标注的证据区域转换为平滑的视觉标记目标，并在RL训练期间使用它来引导响应到图像的注意力，但仅限于高奖励轨迹。标注仅用作特权训练标签，而推理仅需要原始图像和问题。在Qwen2.5-VL-7B、Qwen3-VL-4B和Qwen3-VL-8B上，EASE在感知、幻觉、视觉数学和多模态推理基准上的平均得分比DAPO高出2.5到3.1分。诊断和消融实验表明，EASE更好地将视觉注意力与标注的证据区域对齐。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) improves vision-language models (VLMs) by optimizing outcome rewards derived from final answers. However, such outcome-only rewards do not tell the model which image regions justify an answer. For questions that require visual grounding, these rewards cannot distinguish responses supported by relevant visual evidence from those produced by language-prior shortcuts or lucky guesses. We introduce EASE (Evidence-Anchored Spatial Attention), which augments multimodal RLVR with visual-evidence process supervision. EASE converts annotated evidence regions into a smoothed visual-token target and uses it to guide response-to-image attention during RL training, but only on high-reward trajectories. The annotations are used solely as privileged training labels, while inference requires only the original image and question. Across Qwen2.5-VL-7B, Qwen3-VL-4B, and Qwen3-VL-8B, EASE raises average scores over DAPO by 2.5 to 3.1 points on perception, hallucination, visual math, and multimodal reasoning benchmarks. Diagnostics and ablations show that EASE better aligns visual attention with annotated evidence regions.

URL PDF HTML ☆

赞 0 踩 0

2605.30911 2026-06-01 cs.CV cs.AI 版本更新

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

什么使LVLMs更少产生幻觉？揭示影响幻觉鲁棒性的架构因素

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin, Jun Luo, Jiancheng Lv

发表机构 * School of Computer Science, Engineering Research Center of Machine Learning and Industry Intelligence, Sichuan University（计算机科学学院，机器学习与产业智能工程研究中心，四川大学）； School of Computer and Information Engineering, Xiamen University of Technology（计算机与信息工程学院，厦门理工大学）； Department of Electrical and Computer Engineering, University of Hong Kong（电气与计算机工程系，香港大学）； College of Computing and Data Science, Nanyang Technological University（计算与数据科学学院，南洋理工大学）

AI总结本文通过将架构设计分解为语言基础、视觉表示和语义对齐三个维度，并引入CoSimUE基准，系统探索了架构因素对LVLMs幻觉鲁棒性的影响，发现模型参数扩展效果有限，而增强视觉编码器、语言基础和语义对齐能分别减少不同类型的幻觉。

详情

AI中文摘要

幻觉仍然是削弱大型视觉-语言模型（LVLMs）可靠性的关键挑战之一。但什么使LVLM更少产生幻觉？许多现有工作专注于改进模型的内部组件。我们认为幻觉从根本上源于模型架构的设计方式。为了研究这一点，我们将架构设计分解为三个维度：语言基础（LF）、视觉表示（VR）和语义对齐（SA），并将幻觉分为共现型、相似型和先前被忽视的不确定型。基于这一框架，我们提出了CoSimUE基准，通过受控文本扰动和随机扰动创建细粒度的幻觉场景，从而建立设计选择与幻觉行为之间的映射。在7个设计方面的实验表明：1）广泛强调的参数规模扩展对减少所有三类幻觉的影响有限；2）更大且训练更好的语言基础可以减少共现型幻觉；3）更强的视觉编码器和更高的分辨率减轻相似型错误；4）有效的对齐策略缓解不确定型幻觉。5）此外，跨维度分析显示，联合增强视觉保真度和对齐质量能带来最全面的改进。本研究首次系统性地将架构级设计与幻觉鲁棒性联系起来，为开发可靠且高效的LVLMs提供了实用指导。

英文摘要

Hallucination remains one of the key challenges undermining the reliability of Large Vision-Language Models (LVLMs). But what makes an LVLM hallucinate less? Many existing efforts focus on improving internal components of the model. We argue that hallucination fundamentally stems from how the model architecture is designed. To investigate this, we factor the architecture design into three dimensions: Linguistic Foundation (LF), Visual Representation (VR), and Semantic Alignment (SA), and categorize hallucinations into Co-occurrence, Similarity, and previously overlooked Uncertainty types. Building on this formulation, we propose CoSimUE, a benchmark that creates fine-grained hallucination scenarios through controlled textual perturbations and random perturbations, enabling mapping between design choices and hallucination behaviors. Experiments across 7 design aspects show that: 1) the widely emphasized scaling of model parameters has only limited impact on reducing all three types of hallucinations; 2) larger and better-trained language foundations can reduce co-occurrence hallucinations; 3) stronger visual encoders and higher resolutions mitigate similarity errors; 4) effective alignment strategies alleviate uncertainty hallucinations. 5) Furthermore, cross-dimensional analysis reveals that jointly enhancing visual fidelity and alignment quality yields the most comprehensive improvements. This study provides the first systematic exploration linking architecture-level design to hallucination robustness, offering practical guidance for developing reliable and efficient LVLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.30904 2026-06-01 cs.CV 版本更新

MergeTok: Unified Continuous and Discrete Visual Tokenization via Token Merging

MergeTok: 通过令牌合并实现统一连续和离散视觉令牌化

Luyuan Zhang, Siyuan Li, Zedong Wang, Qingsong Xie, Cheng Tan, Anna Wang, Yanhao Zhang, Chen Chen, Haonan Lu, Haoqian Wang

发表机构 * Tsinghua University（清华大学）； Westlake University（西湖大学）； Zhejiang University（浙江大学）； Hong Kong University of Science and Technology（香港科学与技术大学）； OPPO ； Shanghai AI Lab（上海人工智能实验室）

AI总结提出MergeTok统一令牌化器，通过令牌合并技术联合优化连续VAE和离散VQ令牌化器，实现高保真重建与语义可控离散表示的兼顾。

Comments 11 pages (main text), 7 figures. Preprint. Under review at NeurIPS 2026

详情

AI中文摘要

大多数用于图像生成的视觉令牌化器分为两类，各有互补的局限性：连续VAE提供高保真重建，但遭受密集、纠缠的潜在变量，不适合语义控制；而基于离散VQ的模型能够实现自回归生成，但面临梯度稀疏、训练不稳定和码本崩溃的问题。在这项工作中，我们引入了MergeTok，一个统一的令牌化器，在编码器-解码器架构中联合优化连续（VAE）和离散（VQ）令牌化器，利用令牌合并技术作为语义桥梁。通过在编码过程中聚类相似令牌，MergeTok建立了一个结构先验，提供双重监督信号：（i）在VAE分支中施加合并令牌的语义对齐，将其潜在空间正则化为解缠、语义感知的表示；（ii）推导出组级约束，促进组内多样性和组间排他性，从而稳定VQ训练。MergeTok在ImageNet-256上展示了具有竞争力的重建和生成性能，在匹配令牌预算下，其rFID远低于强VAE和VQ模型，同时产生语义组织的令牌表示，兼容自回归和扩散生成器。这表明单一架构可以赋予视觉令牌化器鲁棒的语义组织和生成器友好的离散性。

英文摘要

Most visual tokenizers for image generation are bifurcated into two families with complementary limitations: continuous VAEs offer high-fidelity reconstruction but suffer from dense, entangled latents that are poorly suited for semantic control, whereas discrete VQ-based models enable autoregressive generation yet struggle with gradient sparsity, unstable training, and codebook collapse. In this work, we introduce MergeTok, a unified tokenizer that jointly optimizes continuous (VAE) and discrete (VQ) tokenizers within a encoder-decoder architecture, leveraging token merging techniques as a semantic bridge. By clustering similar tokens during encoding, MergeTok establishes a structural prior that provides dual supervision signals: (i) it imposes merged-token semantic alignment in the VAE branch, regularizing its latent space toward disentangled, semantic-aware representations; (ii) it derives group-wise constraints, promoting intra-group diversity and inter-group exclusivity that stabilize VQ training. MergeTok shows competitive reconstruction and generation performance on ImageNet-256, with substantially lower rFID than strong VAE and VQ models under matched token budgets, while producing semantically-organized token representations compatible with both autoregressive and diffusion generators. This shows that a single architecture can endow visual tokenizers with robust semantic organization and generator-friendly discreteness.

URL PDF HTML ☆

赞 0 踩 0

2605.30894 2026-06-01 cs.CV 版本更新

SteerFace: Debiasing Synthetic Face Generation via Adaptive Residue Perturbation

SteerFace: 通过自适应残差扰动消除合成人脸生成中的偏差

Yuxi Mi, Qiuyang Yuan, Jianqing Xu, Yichun Zhou, Xuan Zhao, Jun Wang, Rizen Guo, Shuigeng Zhou

发表机构 * Fudan University（复旦大学）； Youtu Lab, Tencent（腾讯优图实验室）； WeChat Pay Lab33, Tencent（腾讯微信支付实验室33）

AI总结针对合成人脸数据与真实数据分布存在视觉倾向差异的问题，提出SteerFace框架，通过将身份嵌入向随机正交方向扰动作为正则化项，抑制生成器对非身份视觉线索的依赖，从而缩小合成-真实差距。

详情

AI中文摘要

人脸识别训练中合法合规数据的短缺引发了人们对使用合成数据作为替代方案的日益关注。虽然最近的扩散方法能够生成具有强身份一致性和数据多样性的逼真人脸图像，但其下游识别性能仍然存在显著的合成-真实差距。本文识别出视觉倾向（visual tendency）作为一个此前未被充分探索的限制因素，即合成数据表现出不切实际的视觉属性普遍性，从而偏离真实数据分布。视觉倾向可归因于生成器对身份嵌入的条件化，通过这种条件化，共现的残留视觉线索被无意中吸收到学习到的身份语义中。为了阻止生成器利用此类视觉线索，本文提出SteerFace，一个简单高效的训练框架，通过将身份嵌入向嵌入超球面上的随机正交方向引导来扰动身份嵌入。该扰动作为一种身份保持正则化项，惩罚生成器对非身份成分的依赖，理论分析支持了这一点。本文进一步引入一种自适应策略，学习具有样本级偏好和有利总体统计的扰动强度。大量实验表明，SteerFace有效缓解了视觉倾向，在下游人脸识别中优于先前方法，并且在不同训练数据集和生成流程中具有良好的泛化能力。

英文摘要

The shortage of legally compliant data for face recognition training has sparked growing interest in using synthetic data as an alternative. While recent diffusion-based methods enable the generation of photorealistic face images with strong identity adherence and data diversity, their downstream recognition performance still exhibits a significant synthetic-real gap. This paper identifies visual tendency as a previously underexplored limitation, whereby synthetic data exhibit an unrealistic prevalence of visual attributes and thus deviate from the real-data distribution. Visual tendency can be attributed to the generator's conditioning on identity embeddings, through which co-occurring residual visual cues are unintentionally absorbed into learned identity semantics. To discourage the generator from exploiting such visual cues, this paper proposes SteerFace, a simple and efficient training framework that perturbs identity embeddings by steering them toward random orthogonal directions on the embedding hypersphere. The perturbation serves as an identity-preserving regularizer that penalizes the generator's reliance on non-identity components, as supported by theoretical analysis. This paper further introduces an adaptive strategy that learns perturbation strengths with both sample-wise preference and favorable overall statistics. Extensive experiments show that SteerFace effectively mitigates visual tendency, outperforms prior methods in downstream face recognition, and generalizes well across different training datasets and generation pipelines.

URL PDF HTML ☆

赞 0 踩 0

2605.30893 2026-06-01 cs.CV 版本更新

Foundation VAEs for 3D CT Reconstruction, Augmentation, and Generation

用于3D CT重建、增强和生成的基础VAE

Qi Chen, Shuhan Ding, Yu Gu, Nan Liu, Jiang Bian, Alan Yuille, Zongwei Zhou, Jingjing Fu

发表机构 * Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Duke-NUS Medical School（duke-nus 医学院）； Microsoft Research（微软研究院）

AI总结本文发现，在自然图像上预训练的基础VAE可直接用于CT重建、增强和生成，无需训练或微调，通过冻结编解码器实现解剖结构保留和噪声抑制，并在分割和生成任务上取得显著提升。

Comments ICML 2026 Accepted

详情

AI中文摘要

变分自编码器（VAE）将高分辨率CT体积压缩为紧凑的潜在表示，同时保留临床相关结构。然而，从头训练或大量微调CT专用VAE会带来巨大的计算和工程成本，并且在异构扫描仪、协议和疾病下性能常会下降。本文通过一个关键观察向免训练的医学VAE迈出了渐进的一步：一个在自然图像和视频上大规模预训练的基础VAE可以作为CT重建、增强和生成的统一接口。在编码器和解码器均冻结的情况下，基础VAE重建CT体积时保留了解剖结构，同时抑制了采集噪声；在这些重建上训练分割模型，对于胰腺肿瘤和肺肿瘤，表面准确度平均提高了3.9% NSD。在相同的基础VAE潜在空间中，条件潜在扩散模型实现了平均FVD降低3.9%，CT CLIP分数提高36.2%，并在18种疾病的多疾病生成忠实度上提高了2.76% AUC。这些结果表明基础VAE可作为可扩展的CT表示重用和忠实CT生成的实用接口。我们的代码和演示可在 https://github.com/qic999/Foundation-VAE 获取。

英文摘要

Variational autoencoders (VAEs) compress high resolution CT volumes into compact latents while preserving clinically relevant structure. However, training CT-specific VAEs from scratch or heavily fine-tuning them incurs substantial computational and engineering cost, and often degrades under heterogeneous scanners, protocols, and diseases. This paper makes a progressive stride toward training-free medical VAEs by leveraging a critical observation: a single Foundation VAE, pretrained at scale on natural images and videos, can serve as a unified interface for CT Reconstruction, Augmentation, and Generation. With both encoder and decoder frozen, the Foundation VAE reconstructs CT volumes with preserved anatomy while suppressing acquisition noise; training segmentation models on these reconstructions improves surface accuracy by 3.9% NSD on average for pancreatic tumor and lung tumor. Within the same Foundation VAE latent space, a conditional latent diffusion model achieves 3.9% lower average FVD with 36.2% higher CT CLIP score, and improves multi-disease generation faithfulness across 18 types by 2.76% AUC. These results demonstrate Foundation VAEs as a practical interface for scalable CT representation reuse and faithful CT generation. Our code and demo are available at https://github.com/qic999/Foundation-VAE.

URL PDF HTML ☆

赞 0 踩 0

2605.30884 2026-06-01 cs.CV 版本更新

GUI-C$^2$: Coarse-to-Fine GUI Grounding via Difficulty-Aware Reinforcement Learning

GUI-C$^2$：基于难度感知强化学习的由粗到细GUI定位

Junlong Li, Chao Hao, Lap-Pui Chau, Yi Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出GUI-C$^2$框架，通过难度感知数据筛选和由粗到细的强化学习机制，解决GUI定位中训练样本难度不均和视觉区域裁剪权衡问题，实现最先进性能。

详情

AI中文摘要

现有的用于GUI定位的智能体强化学习方法在数据层面和策略层面存在局限性。在数据层面，当前方法通常平等对待所有训练样本，尽管它们对基线模型的训练价值随难度而变化。忽视这一点会大大降低训练效率甚至导致崩溃。在策略层面，现有框架难以平衡裁剪较大区域以获取足够上下文和较小区域以减少冗余之间的权衡，这是工具增强定位代理固有的张力。此外，过于复杂的决策对于小参数模型来说难以处理，并显著增加推理时间。为了解决这些问题，在数据层面，我们提出了GUI-D，一个数据挖掘和难度评分流程，通过适当的测试识别值得训练的样本，并分配难度分数以指导后续训练权重。在策略层面，我们提出了GUI-C$^2$，它采用区域门控的由粗到细细化机制，通过模型内部不确定性信号逐步缩小视野，自适应地为大目标保留上下文，同时增强对小目标的精度，并通过改进感知的阶段奖励进行强化，确保每次细化真正提升定位。同时，我们简化了决策过程，大大减少了额外的推理时间。最后，大量实验表明，我们的方法达到了最先进的性能。代码和数据将公开。

英文摘要

Existing agentic reinforcement learning methods for GUI grounding have limitations at two levels. At the data level, current approaches typically treat all training samples equally, although their training value to the baseline model varies with difficulty. Overlooking this can greatly reduce training efficiency or even cause collapse. At the strategy level, existing frameworks struggle to balance the trade-off between cropping larger regions for sufficient context and smaller ones for reduced redundancy, a tension inherent to tool-augmented grounding agents. In addition, overly complex decision-making is difficult for small-parameter models and significantly increases inference time. To address these issues, at the data level, we propose GUI-D, a data mining and difficulty scoring pipeline that identifies the training-worthy samples by proper testing and assigns difficulty scores to guide subsequent training weights. At the strategy level, we propose GUI-C$^2$, which employs an area-gated coarse-to-fine refinement mechanism that progressively narrows the visual field via model-internal uncertainty signals, adaptively reserving context for large targets while amplifying precision for small ones, reinforced by improvement-aware stage rewards that ensure each refinement genuinely advances grounding. Meanwhile, we simplify the decision-making process to greatly reduce additional inference time. Finally, extensive experiments show that our method achieves state-of-the-art performance. The code and data will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.30863 2026-06-01 cs.CV cs.GR 版本更新

DSD-GS: Dynamic-Static Decomposition of Gaussian Splatting for Efficient and High-Fidelity Dynamic Scene Reconstruction

DSD-GS: 面向高效高保真动态场景重建的高斯泼溅动态-静态分解

Youngtae Han, Sung-hwan Han, Youngmin Yi

发表机构 * Department of Artificial Intelligence Engineering, Sogang University（人工智能工程系，首尔大学）

AI总结提出基于前馈高斯泼溅编码器和光流模型的动态-静态分解框架，通过消除静态区域冗余计算，在渲染质量、训练/渲染速度和存储效率上达到最优。

Comments 23 pages, 9 figures, 7 tables

详情

AI中文摘要

动态场景重建和新视角合成是虚拟现实、机器人、数字孪生等下一代视觉智能应用的基础。然而，从任意视角对复杂时变场景进行高保真重建仍是一个重大挑战。现有的动态3DGS方法由于将所有高斯体建模为动态组件，存在计算效率低下的问题。虽然近期基于分解的方法试图解决这一问题，但仍面临重建质量下降和训练时间延长的问题。为缓解这些局限，我们提出一种新颖的动态重建框架，基于高效的静态-动态分解策略，使用前馈高斯泼溅编码器和光流模型。通过消除静态区域的冗余计算，我们的方法实现了最先进的性能，在渲染质量、训练和渲染速度以及存储效率上均优于现有基线。值得注意的是，在Neural 3D数据集上，我们的框架仅需10分钟训练，并在单张NVIDIA RTX 5090 GPU上以1352x1014分辨率实现了超过700 FPS的渲染速度。此外，我们的分解策略消除了COLMAP预处理的需求，并实现了确定性初始化，从而提高了效率和可重复性。

英文摘要

Dynamic scene reconstruction and novel view synthesis are fundamental to next-generation visual intelligence applications such as virtual reality, robotics, and digital twins. However, high-fidelity reconstruction of complex, time-varying scenes from arbitrary viewpoints remains a significant challenge. Existing dynamic 3DGS methods suffer from computational inefficiency, since they model all Gaussians as dynamic components. While recent decomposition-based approaches address this issue, they still struggle with degraded reconstruction quality and prolonged training time. To mitigate these limitations, we propose a novel dynamic reconstruction framework built upon an efficient static-dynamic decomposition strategy using a Feed-Forward Gaussian Splatting encoder and an optical flow model. By eliminating redundant computations on static regions, our method achieves state-of-the-art performance, outperforming existing baselines across rendering quality, training and rendering speed, and storage efficiency. Notably, on the Neural 3D dataset, our framework requires only 10 minutes for training and achieves a rendering speed of over 700 FPS on a single NVIDIA RTX 5090 GPU at resolution of 1352x1014. Furthermore, our decomposition strategy eliminates the need for COLMAP preprocessing and enables deterministic initialization, thereby enhancing both efficiency and reproducibility.

URL PDF HTML ☆

赞 0 踩 0

CameraNoise: 通过几何流引导的噪声扭曲实现视频扩散中的忠实相机控制

Haoyu Zhao, Jiaxi Gu, Haoran Chen, Qingping Zheng, Yeying Jin, Hongyi Yang, Junqi Cheng, Yuang Zhang, Zenghui Lu, Huan Yu, Jie Jiang, Peng Shu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Tencent（腾讯）； Xiamen University（厦门大学）

AI总结提出CameraNoise方法，通过几何流引导的噪声扭曲将相机运动编码为时间一致的随机表示，实现视频扩散中忠实且几何一致的相机控制。

Comments 28 pages, 16 figures

详情

Journal ref: Proceedings of the Forty-third International Conference on Machine Learning (ICML), 2026

AI中文摘要

精确的相机姿态控制对于视频扩散至关重要，但保持几何一致性仍然是一个挑战。现有方法直接将数值相机参数注入扩散骨干网络，往往无法弥合抽象坐标与视觉内容之间的差距，导致结构失真。为解决这一问题，我们提出CameraNoise，一种流到噪声的扭曲方法，将相机运动编码为时间一致的随机表示。与传统的条件控制不同，CameraNoise将相机姿态直接嵌入噪声空间。这将在忠实保留轨迹动态的同时，将运动与场景外观解耦。具体来说，我们引入了一种新颖的几何引导重投影流和噪声扭曲算法，共同保持扩散的高斯先验，并确保在相机变换下噪声传播的一致性。通过将CameraNoise集成到扩散过程中，我们的框架能够生成稳定、高保真的视频。大量实验表明，我们的方法在视觉质量和轨迹忠实度方面均显著优于先前方法。项目页面和代码可在 https://gulucaptain.github.io/CameraNoise/ 获取。

英文摘要

Precise camera pose control is critical for video diffusion, yet maintaining geometric consistency remains a challenge. Existing methods that directly inject numerical camera parameters into the diffusion backbone often fail to bridge the gap between abstract coordinates and visual content, leading to structural distortions. To address this issue, we propose CameraNoise, a flow-to-noise warping method that encodes camera motion into a temporally coherent stochastic representation. Unlike conventional conditioning, CameraNoise embeds camera poses directly into the noise space. This decouples motion from scene appearance while faithfully preserving trajectory dynamics. Specifically, we introduce a novel Geometry-guided Reprojection Flow and a noise warping algorithm, which jointly preserve the Gaussian prior of diffusion and ensure consistent noise propagation under camera transformations. By integrating CameraNoise into the diffusion process, our framework delivers stable, high-fidelity videos. Extensive experiments demonstrate that our approach significantly outperforms prior methods in both visual quality and trajectory faithfulness. The project page and code are available at: https://gulucaptain.github.io/CameraNoise/.

URL PDF HTML ☆

赞 0 踩 0

2605.30769 2026-06-01 cs.CV cs.RO 版本更新

DisPlace: Discriminative Place Projections for Multi-Reference Visual Place Recognition

DisPlace: 面向多参考视觉地点识别的判别性地点投影

Dhyey Manish Rajani, Michael Milford, Tobias Fischer

发表机构 * QUT Centre for Robotics, School of Electrical Engineering and Robotics at the Queensland University of Technology（昆士兰理工大学机器人中心，电气工程与机器人学学院）

AI总结提出DisPlace框架，通过广义特征值问题融合多参考描述符，最大化地点间可分性并抑制地点内变化，提升视觉地点识别在多变条件下的鲁棒性。

Comments Under review

详情

AI中文摘要

视觉地点识别（VPR）的一个关键挑战是在不同环境条件和视角下，将查询图像与参考地图进行匹配。虽然多次参考遍历提高了鲁棒性，但现有的融合策略要么统一聚合参考，要么依赖启发式选择，无法区分保持稳定地点身份的描述符变化与由变化条件或视角引起的变化。在本文中，我们提出DisPlace，一种多参考VPR框架，将多个参考描述符融合为单个紧凑且具有判别性的地点表示。DisPlace将描述符融合表述为一个广义特征值问题，该问题最大化地点间可分性，同时抑制跨参考的地点内变化，而不是保留整体描述符方差。与现有的多参考融合方法不同，DisPlace利用跨参考遍历的变化来识别哪些描述符维度的线性组合保留了地点身份，哪些捕捉了条件或视角特定的变化。我们在Oxford RobotCar、Nordland、Pittsburgh30k和Google Landmarks v2上，使用六种最先进的VPR描述符评估了DisPlace。在54种外观变化条件下，DisPlace在49种中优于七种多参考基线，在视角和非结构化设置下持续改进描述符级融合性能，并且在推理期间比所有比较的融合方法需要更少的存储空间。

英文摘要

A key challenge in Visual Place Recognition (VPR) is matching query images against reference maps captured under diverse environmental conditions and viewpoints. While multiple reference traversals improve robustness, existing fusion strategies either aggregate references uniformly or rely on heuristic selection, without distinguishing descriptor variations that preserve stable place identity from those caused by changing conditions or viewpoints. In this paper, we propose DisPlace, a multi-reference VPR framework that fuses multiple reference descriptors into a single compact and discriminative place representation. DisPlace formulates descriptor fusion as a generalized eigenvalue problem that maximizes between-place separability while suppressing within-place variation across references, rather than preserving overall descriptor variance. Unlike existing multi-reference fusion methods, DisPlace exploits variation across reference traversals to identify which linear combinations of descriptor dimensions preserve place identity and which capture condition- or viewpoint-specific variation. We evaluate DisPlace on Oxford RobotCar, Nordland, Pittsburgh30k, and Google Landmarks v2 across six state-of-the-art VPR descriptors. DisPlace outperforms seven multi-reference baselines in 49 out of 54 appearance-varying conditions, consistently improves descriptor-level fusion performance under viewpoint and unstructured settings, and requires less storage during inference than all compared fusion methods.

URL PDF HTML ☆

赞 0 踩 0

2605.30750 2026-06-01 cs.CV 版本更新

SLAP: The Semantic Least Action Principle for Variational Video-Language Modeling

SLAP: 用于变分视频-语言建模的语义最小作用原理

Xiang Fang, Wanlong Fang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Nanyang Technological University, Singapore（新加坡南洋理工大学）

AI总结提出语义最小作用原理(SLAP)，将视频插值建模为黎曼流形上的边界值问题，通过离散欧拉-拉格朗日方程保持对象持久性，解决大视频语言模型中的时间间隙问题。

Comments Accepted by ICML 2026

详情

AI中文摘要

在大视频语言模型（LVLMs）时代，稀疏帧采样的计算需求造成了根本性的“时间间隙”，使模型对关键的因果转换视而不见。现有的依赖于生成幻觉（如潜在扩散）或自回归外推的解决方案往往难以在长时间跨度内保持语义一致性，遭受对象消失和能量不稳定的问题。我们提出从概率生成到变分力学的范式转变，即语义最小作用原理（SLAP）。通过在经典力学和语义动力学之间建立严格的同构关系，我们将潜在视频轨迹建模为由语义拉格朗日量控制的黎曼流形上的路径。通过将插值任务表述为通过离散欧拉-拉格朗日方程求解的边界值问题（BVP），SLAP自然地强制对象持久性，而无需像素级渲染。大量实验证明了我们提出的SLAP的有效性。

英文摘要

In the era of Large Video-Language Models (LVLMs), the computational necessity of sparse frame sampling creates a fundamental ``temporal gap'', rendering models blind to critical causal transitions. Existing solutions relying on generative hallucination (e.g., latent diffusion) or autoregressive extrapolation often fail to maintain semantic consistency over long horizons, suffering from object vanishing and energetic instability. We propose a paradigm shift from probabilistic generation to variational mechanics with the \textbf{Semantic Least Action Principle (SLAP)}. Drawing a rigorous isomorphism between classical mechanics and semantic dynamics, we model the latent video trajectory as a path on a Riemannian manifold governed by a Semantic Lagrangian. By formulating the interpolation task as a Boundary Value Problem (BVP) solved via the discrete Euler-Lagrange equations, SLAP naturally enforces object persistence without pixel-level rendering. Extensive experiments show the effectiveness of our proposed SLAP.

URL PDF HTML ☆

赞 0 踩 0

2605.30745 2026-06-01 cs.CV 版本更新

Immuno-VLM: Immunizing Large Vision-Language Models via Generative Semantic Antibodies for Open-World Trustworthiness

Immuno-VLM：通过生成式语义抗体实现大型视觉-语言模型的开放世界可信赖性

Xiang Fang, Wanlong Fang, Wei Ji

发表机构 * School of Software Engineering, Huazhong University of Science and Technology（华中科技大学软件学院）； Nanyang Technological University, Singapore（新加坡南洋理工大学）； Nanjing University（南京大学）

AI总结针对大型视觉-语言模型在开放世界部署中因缺乏负面知识而将未知异常高置信度误分类为已知类别的“语义傲慢”问题，提出受生物免疫负选择启发的Immuno-VLM框架，利用大语言模型的生成推理主动产生“语义抗体”（近分布异常文本描述）来约束已知类决策空间，在ImageNet-1K和四个OOD基准上达到新最优。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型视觉-语言模型通过将视觉特征与广泛语义概念对齐，在零样本识别中取得了前所未有的成功。然而，这种语义抽象在开放世界部署中造成了一个关键漏洞：“语义傲慢”——由于缺乏显式的负面知识，模型会将未知异常高置信度地强行拟合到已知类别中。为了解决这个“开放世界可信赖性悖论”，我们提出了 extbf{Immuno-VLM}，一个受生物启发的框架，它将 extbf{免疫负选择}的生物学原理适应到高维潜在空间。与依赖被动密度估计或低效像素空间异常生成的传统开放集识别方法不同，Immuno-VLM利用大语言模型的生成推理能力主动“幻想”出“语义抗体”，即近分布异常（例如，相似物、上下文异常）的文本描述，这些描述有效地约束了已知类别的决策空间。在ImageNet-1K和四个具有挑战性的OOD基准上的大量实验表明，Immuno-VLM达到了新的最优水平。

英文摘要

Large Vision-Language Models have achieved unprecedented success in zero-shot recognition by aligning visual features with broad semantic concepts. However, this semantic abstraction creates a critical vulnerability in open-world deployment: the ``Hubris of Semantics'', where models force-fit unknown anomalies into known categories with high confidence due to the lack of explicit negative knowledge. To address this \textit{Open-World Trustworthiness Paradox}, we propose \textbf{Immuno-VLM}, a bio-inspired framework that adapts the biological principle of \textbf{Immunological Negative Selection} to high-dimensional latent spaces. Departing from traditional Open-Set Recognition methods that rely on passive density estimation or inefficient pixel-space outlier generation, Immuno-VLM leverages the generative reasoning of Large Language Models to actively hallucinate ``Semantic Antibodies'', textual descriptions of near-distribution outliers (e.g., look-alikes, contextual anomalies) that effectively bound the decision space of known classes.Extensive experiments on ImageNet-1K and four challenging OOD benchmarks reveal that Immuno-VLM establishes a new state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2605.30742 2026-06-01 cs.CV 版本更新

Annotations Are Not All You Need: A Cross-modal Knowledge Transfer Network for Unsupervised Temporal Sentence Grounding

注释并非全部所需：面向无监督时间语句定位的跨模态知识迁移网络

Xiang Fang, Daizong Liu, Wanlong Fang, Pan Zhou, Yu Cheng, Keke Tang, Kai Zou

发表机构 * Hubei Key Laboratory of Distributed System Security（湖北分布式系统安全重点实验室）； Hubei Engineering Research Center on Big Data Security（湖北大数据安全工程研究中心）； School of Cyber Science and Engineering（网络安全学院）； Huazhong University of Science and Technology（华中科技大学）； Peking University（北京大学）； Henan University（河南大学）； The Chinese University of Hong Kong（香港中文大学）； Guangzhou University（广州大学）； Protagolabs Inc.（Protagolabs公司）

AI总结提出跨模态知识迁移网络，通过从图像-名词和视频-动词任务中迁移实体感知和事件感知知识，实现无监督时间语句定位，无需配对视频-查询标注。

Comments Published in Findings of EMNLP 2023

详情

AI中文摘要

本文研究时间语句定位（TSG）任务。尽管许多优秀工作在该重要课题上取得了显著成就，但它们严重依赖于大量昂贵的视频-查询配对标注，这在现实应用中需要大量人力收集。为此，本文针对更实际但更具挑战性的TSG设置：无监督时间语句定位，其中网络训练期间既没有配对视频-查询标注，也没有片段边界标注。考虑到其他跨模态任务提供了许多易于获取且廉价的标签，我们倾向于收集并将其简单的跨模态对齐知识迁移到我们的复杂场景中：1）首先从配对的图像-名词任务中探索实体感知的对象引导外观知识，并将其适应到每个独立视频帧；2）然后从配对的视频-动词任务中提取事件感知的动作表示，并通过新提出的复制-粘贴方法进一步将动作表示精炼为更实际但复杂的现实案例；3）通过将外观和动作知识调制并迁移到我们具有挑战性的无监督任务中，我们的模型可以直接利用这些通用知识来关联视频和查询，并在无需训练的情况下准确检索相关片段。在两个具有挑战性的数据集（ActivityNet Captions和Charades-STA）上的大量实验证明了我们的有效性，优于现有无监督方法，甚至与有监督方法竞争。

英文摘要

This paper addresses the task of temporal sentence grounding (TSG). Although many respectable works have made decent achievements in this important topic, they severely rely on massive expensive video-query paired annotations, which require a tremendous amount of human effort to collect in real-world applications. To this end, in this paper, we target a more practical but challenging TSG setting: unsupervised temporal sentence grounding, where both paired video-query and segment boundary annotations are unavailable during the network training. Considering that some other cross-modal tasks provide many easily available yet cheap labels, we tend to collect and transfer their simple cross-modal alignment knowledge into our complex scenarios: 1) We first explore the entity-aware object-guided appearance knowledge from the paired Image-Noun task, and adapt them into each independent video frame; 2) Then, we extract the event-aware action representation from the paired Video-Verb task, and further refine the action representation into more practical but complicated real-world cases by a newly proposed copy-paste approach; 3) By modulating and transferring both appearance and action knowledge into our challenging unsupervised task, our model can directly utilize this general knowledge to correlate videos and queries, and accurately retrieve the relevant segment without training. Extensive experiments on two challenging datasets (ActivityNet Captions and Charades-STA) show our effectiveness, outperforming existing unsupervised methods and even competitively beating supervised works.

URL PDF HTML ☆

赞 0 踩 0

2605.30734 2026-06-01 cs.LG cs.CV 版本更新

Beyond Accuracy: Evaluating Efficiency, Robustness and Explainability in Deep Learning for Malaria Diagnosis

超越准确率：评估深度学习在疟疾诊断中的效率、鲁棒性和可解释性

Olivier Kanamugire, Kerol Djoumessi

发表机构 * African Institute for Mathematical Sciences（非洲数学科学研究所）； Hertie Institute for AI in Brain Health（脑健康人工智能研究所）

AI总结本研究在NLM-Malaria数据集上基准测试四种深度学习模型，联合评估预测性能、鲁棒性和事后可解释性，发现轻量级模型在性能上与重型模型相当，但可解释性在图像损坏下脆弱。

Comments Under review

详情

AI中文摘要

疟疾仍然是撒哈拉以南非洲地区的主要死亡原因，该地区诊断基础设施匮乏，使得及时准确的诊断尤其具有挑战性。虽然深度学习为自动化疟疾筛查提供了一条有前景的途径，但临床采用受到计算成本和决策不透明性的阻碍。本研究在NLM-Malaria数据集上基准测试了四种涵盖广泛设计架构和模型容量的深度学习模型，联合评估了预测性能、鲁棒性和事后可解释性。我们发现，轻量级、高效设计的模型在预测性能上与更重的模型相当，Friedman检验确认无统计显著差异。基于CAM的XAI方法一致地定位诊断相关区域，而细粒度归因方法产生的解释针对性较弱，尤其是在使用更重的骨干网络时。在三种图像损坏下的鲁棒性评估进一步揭示，模型置信度下降速度快于准确率，为人工审核提供了实用信号。然而，没有一种XAI方法对损坏具有鲁棒性，即使在预测仍然准确的情况下，解释可靠性也会在临床实践中可能出现的噪声水平下降。这些发现支持在资源受限环境中部署轻量级架构用于疟疾诊断，同时强调事后解释的脆弱性，这是负责任临床部署的重要考虑因素。

英文摘要

Malaria remains a leading cause of mortality in sub-Saharan Africa, where scarce diagnostic infrastructure makes timely, accurate diagnosis particularly challenging. While deep learning offers a compelling path toward automated malaria screening, clinical adoption is hindered by computational cost and opacity in decision-making. This work benchmarks four deep learning models spanning a wide range of designed design architectures and model capacities on the NLM-Malaria dataset, jointly evaluating predictive performance, robustness, and post-hoc explainability. We find that lightweight, efficient-by-design models match their heavier counterparts in predictive performance, and the Friedman test confirms no statistically significant performance differences. CAM-based XAI methods consistently localize diagnostically relevant regions, while fine-grained attribution methods produce less targeted explanations, particularly with heavier backbones. Robustness evaluation under three types of image corruption further reveals that model confidence degrades faster than accuracy, providing a practical signal for human review. However, no XAI method is robust to corruption, with explanation reliability degrading at noise levels plausible in clinical practice, even when predictions remain accurate. These findings support the deployment of lightweight architectures for malaria diagnosis in resource-constrained settings, while highlighting the vulnerability of post-hoc explanations as an important consideration for responsible clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.30716 2026-06-01 cs.CV cs.AI 版本更新

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

用于病例级病理学概要报告生成的简单令牌高效视觉语言模型

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

发表机构 * Department of Computer Science and Software Engineering (CSSE), Concordia University, Montreal, Canada（计算机科学与软件工程系（CSSE），康科迪亚大学，蒙特利尔，加拿大）； Axe Cancer, Centre de recherche du CHUM, Université de Montréal, Montreal, Canada（Axe癌症，CHUM研究中心，蒙特利尔大学，蒙特利尔，加拿大）； Institut de recherche en immunologie et cancérologie (IRIC), Université de Montréal（免疫学与癌症研究所（IRIC），蒙特利尔大学）； Mila - Quebec AI Institute, Montreal, Canada（魁北克AI研究所（Mila），蒙特利尔，加拿大）

AI总结提出一种简单令牌高效的视觉语言模型，通过5倍放大率的512×512补丁和两阶段监督训练，在有限GPU内存下实现病例级多WSI病理报告生成，显著降低序列长度并提升效率。

Comments Accepted by the DeLTA 2026 conference

详情

AI中文摘要

从全切片图像（WSI）生成临床有用的病理报告具有挑战性，原因在于十亿像素分辨率、长视觉令牌序列以及病例级推理的复杂性（单个病例可能包含多个具有异质性组织和模糊发现的WSI）。我们提出了一种简单的令牌高效视觉语言模型，用于病例级概要报告生成，在受限GPU内存下保持实用性。我们的架构遵循最小的三组件设计：冻结的病理补丁编码器、轻量级两层MLP视觉语言对齐器和大语言模型解码器，并带有显式的WSI标记令牌以分隔病例内的切片。训练分两个监督阶段进行：（1）仅对齐器的WSI字幕生成，使用异质WSI-文本对；（2）病例级监督微调，基于病例-报告对进行结构化报告生成。为了减少序列长度，我们使用5倍放大率下的$512 \times 512$补丁表示每个切片，与常用的20倍补丁相比，平均序列长度减少高达64倍。结合高效训练技术，我们仅用半块NVIDIA H100 GPU即可实现实际训练。在两个训练阶段中，我们的方法在ROUGE-L/METEOR/BLEU-4上取得了高分，同时在内存和运行时间上显著更高效。在基于AI的评估中，我们的模型始终优于强基线。大量消融实验表征了性能-效率权衡，并确定了在多WSI设置中提高鲁棒性的简单选择。总体而言，这项工作为高效病理报告生成提供了一个强大且可复现的基线，降低了在有限计算资源下进行多WSI VLM研究的门槛。

英文摘要

Generating clinically useful pathology reports for pathology cases from whole-slide images (WSIs) is challenging due to gigapixel resolution, long visual-token sequences, and the complexity of case-level reasoning, where a single case may contain multiple WSIs with heterogeneous tissues and ambiguous findings. We present a simple token-efficient vision--language model for case-level synoptic report generation that remains practical under constrained GPU memory. Our architecture follows a minimal three-component design: a frozen pathology patch encoder, a lightweight two-layer MLP vision-language aligner, and a large language model decoder, with an explicit WSI marker token to separate slides within a case. Training proceeds in two supervised stages: (1) aligner-only WSI captioning using heterogeneous WSI-text pairs, and (2) case-level supervised fine-tuning on case-report pairs for structured report generation. To reduce sequence length, we represent each slide using $512 \times 512$ patches at $5\times$ magnification, which reduces the average sequence length by up to $64\times$ times compared to the commonly used $20\times$ patches. Combined with efficient training techniques, we enable practical training with only half a NVIDIA H100 GPU. Across both training stages, our approach achieves high ROUGE-L/METEOR/BLEU-4 scores while being substantially more efficient in memory and runtime. In AI-based evaluations, our model is consistently preferred over strong baselines. Extensive ablations characterize performance-efficiency trade-offs and identify simple choices that improve robustness in multi-WSI settings. Overall, this work provides a strong, reproducible baseline for efficient pathology report generation, lowering the barrier to multi-WSI VLM research under limited compute.

URL PDF HTML ☆

赞 0 踩 0

2605.30714 2026-06-01 cs.CV 版本更新

Vision-Based Localization in Dense Urban Environments: A Case Study of an Urban Village in China

基于视觉的密集城市环境定位：中国城中村案例研究

Menglin Wu, Rui Cao

发表机构 * Thrust of Urban Governance and Design, Society Hub, The Hong Kong University of Science and Technology (Guangzhou)（城市治理与设计 thrust，社会枢纽，香港科技大学（广州））

AI总结针对密集城市环境中GPS信号不可靠和地图数据不完整的问题，提出一种基于双摄像头系统的低成本视觉地理定位方法，并在广州石牌村数据集上评估现有模型性能。

详情

AI中文摘要

城中村是快速城市化过程中出现的广泛非正规住区，现已成为中国大城市中农民工的主要居住中心。这些区域建筑密集，常导致GPS信号不可靠，而不完整的地图数据进一步影响精确路线规划和导航。这些问题不仅阻碍日常出行，还对应急响应构成重大挑战，因为混乱的道路布局和GPS不准确可能使疏散工作复杂化。为应对这些挑战，我们提出了一种针对密集城市环境的实用视觉地理定位解决方案。我们的方法采用低成本的数采流程，利用双摄像头系统（包括全景相机和智能手机相机）捕获同步的360度全景图和查询图像。以广州著名的密集城中村石牌村为案例，我们开发了专门的图像地理定位数据集。然后，我们评估并比较了现有模型在不同场景类型下的性能，以识别其优缺点。研究结果展示了基于视觉的定位在密集城中村环境中的潜力和局限性。我们的框架旨在改善GPS覆盖较差区域的步行导航、最后一公里配送和应急管理，最终支持这些非正规住区中的弱势群体。

英文摘要

Urban villages, the widespread informal settlements which have emerged as a result of rapid urbanization, are now major residential hubs for migrant workers in large cities in China. The dense arrangement of buildings in these areas often leads to unreliable GPS signals, while incomplete mapping data further impairs accurate route planning and navigation. These issues not only hinder everyday mobility but also pose significant challenges for emergency response, as confusing road layouts and GPS inaccuracies can complicate evacuation efforts. To address these challenges, we propose a practical vision-based geo-localization solution tailored for dense urban environments. Our approach features a low-cost data collection pipeline utilizing a dual-camera system, comprising a panoramic camera and a smartphone camera, to capture synchronized 360-degree panoramas and query images. Using Shipai Village, a well-known densely populated urban village in Guangzhou, as a case study, we develop a specialized image geo-localization dataset. We then assess and compare the performance of existing models across various scene types to identify their strengths and weaknesses. The findings demonstrate both the potential and limitations of visual-based localization in dense urban-village environments. Our framework aims to enhance pedestrian navigation, last-mile delivery, and emergency management in areas with poor GPS coverage, ultimately supporting the vulnerable populations living within these informal settlements.

URL PDF HTML ☆

赞 0 踩 0

2605.30713 2026-06-01 cs.LG cs.CV cs.MM 版本更新

Diversity Matters: Revisiting Test-Time Compute in Vision-Language Models

多样性至关重要：重新审视视觉-语言模型中的测试时计算

Yijie Tong, Yifan Hou, Shaobo Cui, Antoine Bosselut, Mrinmaya Sachan

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Shanghai Jiao Tong University（上海交通大学）； EPFL（苏黎世联邦理工学院）

AI总结针对视觉-语言模型（VLM）中测试时计算（TTC）策略应用不足的问题，提出基于预测熵的ETTC方法，通过利用模型间的置信度差异提升集成性能，理论证明并实验验证其优于多数投票和最佳单模型。

Comments ICML 2026

详情

AI中文摘要

测试时计算（TTC）策略已成为提升大型语言模型（LLM）推理能力的一种轻量级方法。然而，它们在视觉-语言模型（VLM）中的应用和益处尚未得到充分探索。我们对七个VLM和六个基准进行了TTC的系统研究，特别分析了基于特征的评分和多数投票方法。我们发现特征启发式方法失败，而投票在单模型设置中仅带来微小提升。我们从理论上证明，这种局限性源于缺乏预测多样性：当输出高度相关时，投票收益甚微。相比之下，多模型集成提供了更丰富的多样性，但标准的多数投票未能考虑不同模型的能力差异。为解决这一问题，我们提出了基于熵的TTC（ETTC），它根据预测熵选择最自信的预测。在单模型情况下，我们的方法退化为多数投票，但在模型集成中，它利用置信度差异优先考虑更强的模型。我们证明，在温和假设下ETTC优于多数投票，并通过实验表明它始终优于投票和最佳个体模型。关键在于，我们的结果表明，较小的模型可以协同增强较大的模型，释放出标准策略无法实现的集成增益。

英文摘要

Test-time compute (TTC) strategies have emerged as a lightweight approach to boost reasoning in large language models (LLMs). However, their application and benefits for vision-language models (VLMs) remain underexplored. We present a systematic study of TTC across seven VLMs and six benchmarks, specifically analyzing feature-based scoring and majority voting methods. We find that feature heuristics fail and voting yields only modest gains in single-model settings. We theoretically show that this limitation stems from a lack of prediction diversity: when outputs are highly correlated, voting provides little benefit. In contrast, multi-model ensembles offer richer diversity, yet standard majority voting fails to account for varying model capabilities. To address this, we propose Entropy-based TTC (ETTC), which selects the most confident prediction based on predictive entropy. Our method reduces to majority voting in the single-model case, but in model ensembles, it leverages confidence disparities to prioritize stronger models. We prove that ETTC outperforms majority voting under mild assumptions and empirically demonstrate that it consistently surpasses both voting and the best individual model. Crucially, our results show that smaller models can synergistically enhance larger ones, unlocking ensembling gains not achievable with standard strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.30700 2026-06-01 cs.CV cs.LG 版本更新

Mathematical Morphology in Machine Learning

机器学习中的数学形态学

Erick Oliveira Rodrigues, Aura Conci

发表机构 * Universidade Federal Fluminense（里贝伦联邦大学）

AI总结将数学形态学引入机器学习，提出基于形态学重建的快速聚类算法和一种结合闵可夫斯基与切比雪夫距离的新型距离度量，并设计新型形态学分类器以建模形状、密度和分形信息。

详情

Journal ref: sibgrapi 2018

AI中文摘要

本工作将数学形态学——一种成熟的视觉计算理论——引入机器学习，以利用标准技术常忽视的形状和密度方面。我们提出了一种基于形态学重建的快速聚类算法，该算法能精确保留聚类形状和密度。该方案具有独特特性：内在的最大聚类感知、无成本的噪声去除以及由结构元素控制的多样化增长模式。此外，我们提出了一种结合闵可夫斯基距离和切比雪夫距离的新型距离度量，对于形态学膨胀非常高效。在 $Z^2$ 离散邻域迭代中，它比曼哈顿距离快约1.3倍，比欧几里得距离快约329.5倍。当使用k近邻（k-NN）分类器在33个UCI数据集上与其他14种距离度量进行评估时，我们的度量在大多数情况下（33例中的26例）达到了高于平均的准确率，并在9个案例中取得了最佳整体准确率。最后，我们引入了新型形态学分类器。与现有文献不同，本方案独特地对数据集中的形状、密度和分形信息进行建模。

英文摘要

This work introduces mathematical morphology-an established visual computing theory-into machine learning to exploit shape and density aspects often overlooked by standard techniques. We propose a fast clustering algorithm based on morphological reconstruction that accurately preserves cluster shapes and density. This scheme offers unique features: an intrinsic sense of maximal clusters, cost-free noise removal, and diverse growth patterns controlled by structuring elements.Additionally, we propose a novel distance metric combining Minkowski and Chebyshev distances, highly efficient for morphological dilations. In $Z^2$ discrete neighbourhood iterations, it is roughly 1.3 times faster than Manhattan and 329.5 times faster than Euclidean distances. When evaluated using a k-Nearest Neighbours (k-NN) classifier across 33 UCI datasets against 14 other distances, our metric achieved above-average accuracies most frequently (26 of 33 cases) and the best overall accuracy in 9 cases.Finally, we introduce novel morphological classifiers. Unlike current literature, this proposal uniquely models shape, density, and fractal information in datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.30699 2026-06-01 cs.LG cs.CV 版本更新

A Context-Aware Middleware for Medical Image Based Reports: An approach based on image feature extraction and association rules

基于医学图像报告的情境感知中间件：一种基于图像特征提取和关联规则的方法

Erick O. Rodrigues, Jose Viterbo, Aura Conci, Trueman Mac Henry

发表机构 * Department of Computer Science（计算机科学系）； Departament of Mathematics & Statistics（数学与统计学系）； Universidade Federal Fluminense（联邦Fluminense大学）； York University（约克大学）

AI总结提出一种情境感知中间件，通过图像特征提取和关联规则，自动将医学图像分派给最合适的医疗人员，以提高医疗工作流程效率。

详情

DOI: 10.1109/AICCSA.2015.7507147
Journal ref: 2015 IEEE/ACS 12th International Conference of Computer Systems and Applications (AICCSA)

AI中文摘要

本工作提出了一种用于医疗工作流程组织和效率提升的情境感知中间件。在医院、实验室和远程放射学公司中，每位医生或技术人员都专注于特定类型的诊断或分析。因此，某些类型的医学图像通常会被转发给特定的医生或特定群体。这种转发非常耗时。也就是说，反复决定谁是最合适的医生，以及他在特定情境下是否可用，既繁琐又可能非常低效。因此，所提出的中间件能够处理并收集每位医疗人员所分析图像的数据。基于收集的数据和当前临床情境，中间件能够推断出谁是最适合接收特定传入医学图像的人员。

英文摘要

This work proposes a context-aware middleware for medical workflow organization and efficiency improvement. In hospitals, laboratories and teleradiology companies, each physician or technician is specialized in a specific kind of diagnosis or analysis. Therefore, certain types of medical images are often forwarded to a certain physician or a certain group. This forwarding is time consuming. That is, repeatedly deciding who would be the best physician, whether he is available at a certain moment given a certain context is exhaustive and may be very inefficient. Thus, the proposed middleware has the ability to process and collect data from images analyzed by each medical staff. Based on the collected data and current clinical context, the middleware is able to infer who would be the best fit staff to receive a certain incoming medical image.

URL PDF HTML ☆

赞 0 踩 0

2605.30698 2026-06-01 cs.CV cs.AI cs.MA 版本更新

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

先见后议：用视觉证据对齐多智能体共识

Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma, Yuanzi Li, Zhengren Wang, Yinglong Yang, Yufei Chen, Yikang Wang, Shaoxu Sun, Wentao Zhang

发表机构 * Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； Renmin University of China（中国人民大学）； Shandong University（山东大学）

AI总结提出EAGLE框架，通过显式暴露各智能体的视觉证据区域并相互验证，实现无需训练的多智能体视觉问答协作，提升共识可靠性。

详情

AI中文摘要

视觉语言模型（VLM）在视觉问答（VQA）上取得了强劲性能。为了减轻个体幻觉和盲点，通过多智能体协作聚合不同视角已成为一种有前景的范式。虽然这种方法在文本问答中取得了巨大成功，但其在多模态领域的潜力仍未充分探索。现有的多智能体VQA方法主要采用以文本为中心的协议，专注于文本讨论而忽略视觉信息的对齐。在这项工作中，我们揭示了一个关键见解：答案级别的共识对于可靠的多智能体VQA是不够的； extit{对齐的视觉证据}——智能体所依赖的图像区域的共享支持——对于可信的共识至关重要。为了利用这一见解，我们提出了EAGLE（ extbf{E}vidence- extbf{A}ligned extbf{G}rounded mu extbf{L}ti-agent r extbf{E}asoning），一个无需训练的以证据为中心的框架，用于协调多个VLM智能体。EAGLE显式暴露每个智能体的定位区域作为视觉证据，允许对证据进行相互验证，并使用证据一致性指导最终决策。在六个VQA基准上的实验表明，EAGLE在跨领域实现了最佳平均性能，同时保持轻量、可解释且易于部署。

英文摘要

Vision-language models (VLMs) have achieved strong performance on visual question answering (VQA). To mitigate individual hallucinations and blind spots, aggregating diverse perspectives via multi-agent collaboration has emerged as a promising paradigm. While this approach has shown great success in textual QA, its potential in the multimodal domain remains under-explored. Existing multi-agent VQA methods predominantly adapt text-centric protocols, focusing on textual discussions while ignoring the alignment of visual information. In this work, we reveal a key insight: answer-level agreement is insufficient for reliable multi-agent VQA; \textit{aligned visual evidence} -- shared support from the image regions agents rely on -- is essential for trustworthy consensus. To leverage this insight, we propose EAGLE (\textbf{E}vidence-\textbf{A}ligned \textbf{G}rounded mu\textbf{L}ti-agent r\textbf{E}asoning), a training-free evidence-centered framework for coordinating multiple VLM agents. EAGLE explicitly exposes each agent's grounding regions as visual evidence, enables mutual verification over the evidence, and uses evidence consistency to guide final decision-making. Experiments on six VQA benchmarks show that EAGLE achieves best average performance across domains while remaining lightweight, interpretable, and practical for deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.30689 2026-06-01 cs.CV cs.AI 版本更新

ConTrans: Learning Text-enhanced Local-global Temporal Representations for Zero-shot Temporal Action Localization

ConTrans：学习文本增强的局部-全局时间表示用于零样本时间动作定位

Kanchan Keisham, Thenukan Pathmanathan, Thangarajah Akilan

发表机构 * Vellore Institute of Technology, India（维洛雷理工学院，印度）； Lakehead University, Canada（拉克希德大学，加拿大）

AI总结针对零样本时间动作定位中忽略局部相关性和特征表示能力不足的问题，提出融合卷积归纳偏置与Transformer自注意力的多尺度编码器ConTrans，联合捕获细粒度局部依赖和长程全局上下文，在ActivityNet-1.3和THUMOS14上显著超越现有方法。

Comments 4 figures, 8 tables

详情

AI中文摘要

零样本时间动作定位（ZS-TAL）旨在检测和定位未修剪视频中未见过的动作。然而，现有方法主要关注建模长程上下文信息，常常忽略了视频帧之间基于相对偏移的关键局部相关性。此外，由于网络架构的浅层性，其特征表示能力受限，阻碍了性能提升。在本文中，我们通过引入一种新颖的局部-全局多尺度特征表示模块来解决这些局限性。我们提出了一种新颖的多尺度编码器架构，称为ConTrans，它将卷积（Conv）归纳偏置与Transformer自注意力相结合，以共同捕获细粒度的局部依赖和长程全局上下文，从而比现有方法获得更全面的特征表示。在ActivityNet-1.3和THUMOS14数据集上的实验评估表明，ConTrans显著优于现有方法，为ZS-TAL建立了新的基准。

英文摘要

Zero-shot Temporal Action Localization (ZS-TAL) aims to detect and locate previously unseen actions in untrimmed videos. However, existing approaches primarily focus on modeling long-range contextual information, often neglecting the critical relative-offset-based local correlations between video frames. Furthermore, their performance is hindered by limited feature representation capabilities due to the shallow nature of their network architectures. In this paper, we address these limitations by introducing a novel local-global multi-scale feature representation module. We propose a novel multi-scale encoder architecture, termed ConTrans, that integrates convolutional (Conv) inductive biases with transformer Self-attention to jointly capture fine-grained local dependencies and long-range global context, leading to more comprehensive feature representations than existing methods. Experimental evaluations on the ActivityNet-1.3 and THUMOS14 datasets demonstrate that ConTrans significantly outperforms existing methods, establishing a new benchmark for ZS-TAL.

URL PDF HTML ☆

赞 0 踩 0

2605.30671 2026-06-01 cs.CV cs.RO 版本更新

WristCompass: Kinematic Coupling as a Learnable Visual Concept for Ego-Camera Orientation

WristCompass: 运动耦合作为可学习的视觉概念用于自我相机朝向估计

Varun Nair, Vidyut Baradwaj, Jiahang He, Anya Singh, Jai Relan, Cabrel Happi

AI总结提出WristCompass，利用手腕与相机朝向之间的运动耦合作为视觉概念，通过紧凑的4D特征和GRU时序建模，从操作视频中恢复自我相机朝向，零样本迁移至厨房视频并达到与1B参数场景模型相近的性能。

详情

AI中文摘要

从操作视频中恢复自我相机朝向是从自我中心演示中分离手部运动与相机运动的前提，这是模仿学习的关键步骤。从场景几何推断朝向的常规方法在手部遮挡框架时失效：VGGT，一个1B参数的场景重建模型，在TACO基准测试上的表现甚至不如常数预测器。我们识别出一个替代的视觉概念，它恰好出现在场景几何缺失时：运动耦合动力学，即由手臂-肩-头链施加的手腕运动与相机朝向之间的结构化物理关系。我们发现这个概念是紧凑的（4D手腕间特征优于126D全手关键点）、时序的（需要短窗口上的GRU而非逐帧检索）和物理基础的（由于根植于解剖学而非场景外观，因此可零样本跨数据集迁移）。仅在桌面操作上训练的WristCompass，零样本迁移至Epic Kitchens烹饪视频，实现了14.3°的中位测地误差，并以200K GRU参数接近1B参数场景模型的性能。

英文摘要

Recovering ego-camera orientation from manipulation video is a prerequisite for disentangling hand motion from camera motion, a key step in imitation learning from egocentric demonstrations. The obvious approach, inferring orientation from scene geometry, fails when hands occlude the frame: VGGT, a 1B-parameter scene reconstruction model, scores worse than a constant predictor on the TACO benchmark. We identify an alternative visual concept that is present precisely when scene geometry is absent: kinematic coupling dynamics, the structured physical relationship between wrist motion and camera orientation imposed by the arm-shoulder-head chain. We find that this concept is compact (4D inter-wrist features outperform 126D full hand keypoints), temporal (requiring a GRU over short windows rather than per-frame retrieval), and physically grounded (transferring zero-shot across datasets because it is rooted in anatomy rather than scene appearance). Trained only on tabletop manipulation, WristCompass transfers zero-shot to Epic Kitchens cooking video, achieving 14.3$^\circ$ median geodesic error and approaching the performance of a 1B-parameter scene model at 200K GRU parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.30639 2026-06-01 cs.CV cs.AI cs.RO 版本更新

PInVerify: An Offline Embodied Benchmark for Active Instance Verification

PInVerify：面向主动实例验证的离线具身基准

Yuhang Jiang

发表机构 * University of Trento（特伦托大学）

AI总结提出主动实例验证任务，构建离线具身基准PInVerify，通过多视角导航和细粒度属性匹配评估具身智能体，并基于多模态大语言模型建立基线。

Comments Accepted as a poster at the Foundation Models Meet Embodied Agents (FMEA) Workshop, CVPR 2026. 44 pages including appendix. Code: https://github.com/Avalon-S/PInVerify

详情

AI中文摘要

具身智能体在导航到目标物体方面取得了显著进展，但到达目标附近并不能保证智能体找到了正确的实例：微妙的属性差异（例如“白色花卉”与“白色条纹”）通常需要近距离、多视角检查。我们通过主动实例验证（AIV）来解决这一差距，该任务要求智能体主动围绕候选对象选择视角，以判断其是否匹配细粒度的自然语言描述。我们将AIV形式化为一个有限视野决策过程，并引入PInVerify，一个用于AIV的离线具身基准：包含18个物体类别的3000个评估场景，以多视角捕获形式提供，并采用6扇区导航拓扑，暴露陷阱视角（可导航但无信息）和不可达扇区。作为参考基线，我们构建了一个无需训练的流水线和一个基于开源多模态大语言模型（MLLMs）的LoRA微调端到端智能体（参数规模≤8B），包括属性分解、可见性加权多视角跟踪器和三种次优视角选择（NBV）策略。在Qwen3-VL（4B/8B）、SenseNova-SI-1.2-InternVL3-8B、CLIP和SigLIP2上的评估中，最佳MLLM基线超过最佳嵌入基线4.9个百分点；GT框消融实验显示检测差距为+3.1个百分点；在测试的NBV策略中，我们未观察到主动视角选择带来的可靠增益。LoRA微调智能体（SFT+GSPO）达到85.6%。PInVerify旨在支持具身AI中主动、细粒度语义验证的进一步研究。代码：https://github.com/Avalon-S/PInVerify。

英文摘要

Embodied agents have made strong progress in navigating to target objects, but reaching the goal vicinity does not guarantee that the agent has found the correct instance: subtle attribute differences (e.g., "white floral" vs. "white striped") often require close-range, multi-view inspection. We address this gap with Active Instance Verification (AIV), a task in which an agent actively selects viewpoints around a candidate object to decide whether it matches a fine-grained natural-language description. We formalize AIV as a finite-horizon decision process and introduce PInVerify, an offline embodied benchmark for AIV: 3,000 evaluation episodes across 18 object categories, delivered as multi-view captures with a 6-sector navigation topology that exposes trap views (navigable but uninformative) and unreachable sectors. As reference baselines we build a training-free pipeline and a LoRA-fine-tuned end-to-end agent around open-source multimodal large language models (MLLMs) at on-device scale ($\leq$8B parameters), with attribute decomposition, a visibility-weighted multi-view tracker, and three next-best-view (NBV) strategies. In our evaluation across Qwen3-VL (4B/8B), SenseNova-SI-1.2-InternVL3-8B, CLIP, and SigLIP2, the best MLLM-based baseline exceeds the best embedding baseline by 4.9 pp; GT-box ablations show a +3.1 pp detection gap; and we do not observe reliable gains from active viewpoint selection within the tested NBV strategies. A LoRA-fine-tuned agent (SFT+GSPO) reaches 85.6%. PInVerify aims to support further work on active, fine-grained semantic verification in embodied AI. Code: https://github.com/Avalon-S/PInVerify.

URL PDF HTML ☆

赞 0 踩 0

2605.30631 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Controllable Lung Nodule Synthesis via Histogram-Regularized Latent Diffusion Models

基于直方图正则化潜扩散模型的可控肺结节合成

Arunkumar Kannan, Yanbo Zhang, Han Liu, Michael Baumgartner, Jianing Wang, Alexander Hertel, Bogdan Georgescu, Sasa Grbic

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Department of Radiology and Nuclear Medicine, University Medical Center Mannheim, Heidelberg University（放射学与核医学科，曼海姆大学医学中心，海德堡大学）

AI总结提出一种直方图正则化潜扩散模型，通过结合亚型、空间掩码和HU直方图条件以及可微特征空间直方图正则化项，在3D CT体积中合成肺结节，以准确建模结节特异性强度分布，提高视觉真实感和亚型一致性。

详情

AI中文摘要

尽管自动诊断系统在基于CT的肺癌筛查中取得了显著成功，但其发展仍受限于多样化、带标注的肺结节数据集的稀缺性。基于扩散的生成模型为数据合成提供了一种有前景的策略；然而，许多现有的条件方法主要优化空间重建损失，这鼓励体素级相似性，但可能不足以约束病灶级强度分布。因此，这些方法可能产生过度平滑的纹理轮廓，并低估不同结节亚型（包括实性、部分实性和磨玻璃结节）的独特衰减特性。为解决这一挑战，我们提出了一种可控潜扩散模型，该模型在全3D CT体积内合成肺结节，同时准确建模结节特异性强度分布。具体而言，我们不只依赖空间损失，还引入了一个基于直方图的正则化项，在生成过程中约束体素强度分布。该模型结合了亚型、空间掩码和Hounsfield单位（HU）直方图条件以及可微特征空间直方图正则化项，以更好地对齐病灶级强度分布，提高合成结节的视觉真实感和亚型一致性。在肺部CT数据上的大量实验表明，我们的框架实现了强烈的视觉真实感，通过定量指标和视觉图灵测试验证。此外，当用于数据增强时，生成的结节提高了下游临床任务的性能，特别是对于代表性不足的结节亚型，并显示出对亚型知情恶性分类的潜在益处。

英文摘要

While automated diagnosis systems have achieved remarkable success in computed tomography (CT)-based lung cancer screening, their development remains limited by the scarcity of diverse, annotated pulmonary nodule datasets. Diffusion-based generative models offer a promising strategy for data synthesis; however, many existing conditional approaches primarily optimize spatial reconstruction losses, which encourage voxel-wise similarity but may inadequately constrain lesion-level intensity distributions. As a result, these methods may produce over-smoothed texture profiles and underrepresent the distinct attenuation characteristics of different nodule subtypes, including solid, part-solid, and ground-glass nodules. To address this challenge, we propose a controllable latent diffusion model that synthesizes pulmonary nodules within full 3D CT volumes while accurately modeling nodule-specific intensity distributions. Specifically, rather than relying solely on spatial losses, we introduce a histogram-based regularization term that constrains voxel intensity distributions during the generative process. The model combines subtype, spatial mask, and Hounsfield unit (HU) histogram conditioning with the differentiable feature-space histogram regularization term to better align lesion-level intensity distributions, improving the visual plausibility and subtype consistency of synthesized nodules. Extensive experiments on lung CT data demonstrate that our framework achieves strong visual realism, validated through both quantitative metrics and a visual Turing test. Furthermore, when used for data augmentation, the generated nodules improve performance in downstream clinical tasks, particularly for underrepresented nodule subtypes, and show a potential benefit for subtype-informed malignancy classification.

URL PDF HTML ☆

赞 0 踩 0

2605.30611 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Crafter: A Multi-Agent Harness for Editable Scientific Figure Generation from Diverse Inputs

Crafter: 面向多样化输入的可编辑科学图表生成的多智能体框架

Haozhe Zhao, Shuzheng Si, Zhenhailong Wang, Zheng Wang, Liang Chen, Xiaotong Li, Zhixiang Liang, Maosong Sun, Minjia Zhang

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Tsinghua University（清华大学）； Peking University（北京大学）

AI总结提出Crafter多智能体框架，通过结构化组合离散语义组件，实现跨图表类型和输入条件的可编辑科学图表生成，并引入CraftEditor将栅格输出转换为可编辑SVG，在CraftBench基准上显著优于现有方法。

Comments 24 pages, 11 figures

详情

AI中文摘要

科学图表是传达复杂研究思想最有效的手段之一，但生成出版质量的插图仍然是论文准备中最劳动密集的部分。现有的自动化系统各自针对单一图表类型，且仅接受文本输入，未能解决研究人员实际使用的多样类型和条件；此外，它们的栅格输出无法进行局部修改。由于科学图表是离散语义组件的结构化组合，生成器在这些布局上产生的局部错误需要的不是更强的骨干网络，而是一个框架。我们将这个框架实例化为两个互补系统：Crafter，一个用于图表生成的多智能体框架，无需架构更改即可泛化到多种图表类型和输入条件；以及CraftEditor，它应用相同的模式将栅格输出转换为可编辑的SVG。此外，我们引入了CraftBench，一个涵盖三种图表类型和四种输入条件的基准，并带有手工质量标注。实验表明，Crafter在PaperBanana-Bench和CraftBench上显著优于独立的生成器和智能体基线，消融实验确认了每个组件的独立贡献；CraftEditor忠实地将输出转换为可编辑的SVG，超越了所有基线。我们的代码和基准可在https://github.com/HaozheZhao/Crafter获取。

英文摘要

Scientific figures are among the most effective means of communicating complex research ideas, yet producing publication-quality illustrations remains one of the most labor-intensive parts of paper preparation. Existing automated systems each target a single figure type under text-only input, leaving the diversity of types and conditions researchers actually use unaddressed; their raster outputs further cannot be locally revised. Because scientific figures are structured compositions of discrete semantic components, the localized errors generators produce on such layouts demand not a stronger backbone but a harness. We instantiate this harness in two complementary systems: Crafter, a multi-agent harness for figure generation that generalizes across figure types and input conditions without architectural changes, and CraftEditor, which applies the same pattern to convert raster outputs into editable SVGs. Moreover, we introduce CraftBench, a benchmark spanning three figure types and four input conditions with human quality annotation. Experiments show that Crafter substantially outperforms both standalone generators and the agentic baseline on PaperBanana-Bench and CraftBench, with ablations confirming each component's independent contribution; CraftEditor faithfully converts outputs into editable SVGs that surpass all baselines. Our code and benchmark are available at https://github.com/HaozheZhao/Crafter.

URL PDF HTML ☆

赞 0 踩 0

2605.30587 2026-06-01 cs.CV 版本更新

ReGuLaR: Relation-Grounded Latent Reasoning for Large Vision-Language Models

ReGuLaR：面向大型视觉语言模型的基于关系的潜在推理

Zihu Wang, Karthik Somayaji N. S, Peng Li

发表机构 * University of California, Santa Barbara（加州大学圣巴巴拉分校）

AI总结提出ReGuLaR框架，通过训练时的ReGFormer将潜在推理显式地锚定在视觉证据中的对象和关系上，在多种基准上取得最优性能。

详情

AI中文摘要

链式思维推理通过用自然语言表述中间推理步骤，显著提升了大视觉语言模型的推理能力。然而，这种离散的文本理由通常不足以编码连续的视觉证据。最近的工作通过将推理转移到连续潜在空间来解决这一限制。尽管取得了有希望的进展，现有方法仍使潜在推理与视觉证据的组合结构和关系结构联系不足。为填补这一空白，我们引入了ReGuLaR，一种基于关系的潜在推理框架，将潜在状态显式地锚定在这些关键但被忽视的视觉证据上。ReGuLaR在训练时使用ReGFormer使潜在推理聚焦于与问题相关的对象及对象间关系，而在推理时模型无需调用ReGFormer即可推理并生成答案。为支持ReGuLaR的训练，我们构建了RGROUNDING-351K，一个标注了关键对象边界框和对象间关系的真实世界视觉语言数据集。在多种基准上的广泛实验表明，ReGuLaR持续优于现有方法，并取得了最先进的性能。我们在投稿中包含了代码，并将在接收后公开发布代码和训练数据。

英文摘要

Chain-of-thought (CoT) reasoning has significantly improved the reasoning ability of large vision-language models (LVLMs) by verbalizing intermediate reasoning steps in natural language. However, such discrete textual rationales are often insufficient for encoding continuous visual evidence. Recent work addresses this limitation by moving reasoning into continuous latent space. Despite promising progress, existing methods leave latent reasoning insufficiently connected to the compositional and relational structure of visual evidence. To address this gap, we introduce ReGuLaR, a relation grounded latent reasoning framework that explicitly grounds latent states in these critical yet overlooked visual evidence. ReGuLaR uses a training-time ReGFormer to focus latent reasoning on question-relevant objects and inter-object relations, while at inference time the model reasons and generates answers without invoking the ReGFormer. To support training ReGuLaR, we construct RGROUNDING-351K, a real-world vision-language dataset annotated with key object bounding boxes and inter-object relations. Extensive experiments across diverse benchmarks show that ReGuLaR consistently outperforms existing approaches and achieves state-of-the-art performance. We include our code in the submission and will release the code and training data publicly upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.30578 2026-06-01 cs.CR cs.CV 版本更新

AdvScene: Rethinking Adversarial Patch Evaluation Through Scene Robustness

AdvScene: 通过场景鲁棒性重新思考对抗补丁评估

Xiaoyong, Yuan, Lan, Zhang

发表机构 * Clemson University（克莱姆森大学）

AI总结提出AdvScene框架，通过重建真实环境并引入对抗补丁到场景嵌入（APSE）方法，评估对抗补丁在视角、距离和场景条件变化下的场景鲁棒性，揭示现有评估未捕获的场景依赖性变化。

详情

AI中文摘要

对抗补丁是附着在真实物体上以误导AI视觉系统的物理图案。它们的现实世界风险并非由单次成功预测决定，而是取决于在部署后变化的视角、距离和场景条件下是否仍然有效。我们将这一特性称为场景鲁棒性，即部署的补丁在真实环境各种条件下的有效性。然而，现有评估并未很好地衡量场景鲁棒性：真实图像基准虽真实但固定，而模拟器虽可控但未基于特定真实场景。我们提出AdvScene，一个基于场景的框架，用于在重建的真实环境中测量对抗补丁的场景鲁棒性。AdvScene将评估重新定义为操作性测量：给定一个固定的部署补丁，它刻画补丁的操作包络——攻击成功的位置和条件——作为视角、距离和场景上下文的函数。一个关键挑战是攻击通常仅在单个锚定视图中定义，而评估需要一种在视角变化下保持保真度的表示。我们将此形式化为一个约束提升问题，并引入对抗补丁到场景嵌入（APSE），它在保留攻击关键外观并强制局部性、目标表面附着和跨视图一致性的同时，解决跨视图歧义。我们使用真实世界物理数据验证AdvScene，并对现有对抗补丁进行全面评估。结果表明，AdvScene揭示了攻击有效性的显著场景依赖性变化，而现有基于图像或模拟器的评估未能捕获这些变化。

英文摘要

Adversarial patches are physical patterns attached to real objects to mislead AI vision systems. Their real-world risk is not determined by a single successful prediction, but by whether they remain effective after deployment under changing viewpoints, distances, and scene conditions. We refer to this property as scene robustness, the effectiveness of a deployed patch across conditions in a real environment. Yet existing evaluations do not measure scene robustness well: real image benchmarks are realistic but fixed, while simulators are controllable but not grounded in a specific real scene. We present AdvScene, a scene-grounded framework for measuring the scene robustness of adversarial patches in reconstructed real environments. AdvScene reframes evaluation as operational measurement: given a fixed deployed patch, it characterizes the patch's operational envelope - where and when the attack succeeds - as a function of viewpoint, distance, and scene context. A key challenge is that the attack is typically defined only in a single anchor view, while evaluation requires a representation that remains faithful under viewpoint changes. We formalize this as a constrained lifting problem and introduce Adversarial Patch-to-Scene Embedding (APSE), which resolves cross-view ambiguity while preserving attack-critical appearance and enforcing locality, target-surface attachment, and cross-view consistency. We validate AdvScene using real-world physical data and conduct a comprehensive evaluation of existing adversarial patches. Our results show that AdvScene reveals substantial scene-dependent variation in attack effectiveness that is not captured by existing image-centric or simulator-based evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.30561 2026-06-01 cs.CV cs.AI 版本更新

VLM3: Vision Language Models Are Native 3D Learners

VLM3：视觉语言模型是原生3D学习者

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu, Vikas Chandra, Yangyang Shi

发表机构 * Meta（Meta公司）； Princeton University（普林斯顿大学）

AI总结本文提出VLM3，通过焦距统一、文本像素参考和数据混合缩放，使标准视觉语言模型无需复杂架构或损失函数即可高效掌握多种3D任务。

详情

AI中文摘要

视觉语言模型（VLM）通过提示使统一模型能够解决各种视觉任务，在语义理解方面表现出色。然而，3D理解仍然很大程度上依赖于具有复杂任务特定设计的专家视觉模型。本文要提出的关键论点是，VLM是原生的3D学习者。我们深入的大规模研究表明：1）焦距统一，2）基于文本的像素参考，以及3）数据混合和缩放，是有效3D学习所需的一切。模型架构变化、大模型、大量数据增强以及包括回归公式在内的复杂损失（其中许多构成了专家视觉模型的基础）实际上并不是必要条件。因此，我们提出了VLM3，一种具有最简单设计的可扩展方法，使标准VLM能够掌握多样的3D任务。VLM3不仅大幅提升了VLM深度估计的准确性（0.84 -> 0.9），还实现了多样的3D任务，如像素对应、相机姿态估计和物体级3D理解，在保持标准架构和基于文本的训练的同时，匹配了专家视觉模型的准确性。我们相信VLM3为简单且可扩展的3D学习开辟了新的范式。

英文摘要

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.

URL PDF HTML ☆

赞 0 踩 0

2605.30557 2026-06-01 cs.CV cs.AI cs.CL 版本更新

Seeing Isn't Knowing: Do VLMs Know When Not to Answer Spatial Questions (and Why)?

看见不等于知道：视觉语言模型是否知道何时不回答空间问题（以及为什么）？

Yue Zhang, Zun Wang, Han Lin, Yonatan Bitton, Idan Szpektor, Mohit Bansal

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Google Research（谷歌研究）

AI总结针对视觉语言模型在空间推理中过度自信回答的问题，提出SpatialUncertain框架，通过遮挡和视角歧义两种挑战，评估模型是否知道何时应弃权以及如何寻找可靠证据。

Comments Website: https://zhangyuejoslin.github.io/spatialuncertain/

详情

AI中文摘要

VLM-GLoc：视觉语言模型增强的蒙特卡洛定位，用于杂乱准静态环境中的鲁棒语义全局定位

Shivendra Agrawal, Bradley Hayes

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结提出VLM-GLoc方法，利用开放词汇视觉语言模型作为统一语义观测前端，通过逆语义提议机制和文本到地图检索，在几何模糊和语义歧义的准静态环境中实现鲁棒全局定位。

详情

AI中文摘要

在几何模糊的准静态环境（如杂货店、办公室、学校和医院）中，全局定位对移动机器人构成重大挑战。具有平行过道和长尾产品分布的杂货店，以及具有重复家具（如椅子、桌子、显示器和门）的办公室和实验室，是常见的室内环境，存在几何甚至语义歧义。传统方法要么依赖独特的几何特征，要么依赖特定领域的视觉管道，这些方法难以处理长尾语义分布和瞬态视觉杂乱。我们提出VLM-GLoc，一种分层语义蒙特卡洛定位（MCL）方法，利用开放词汇视觉语言模型（VLM）作为统一语义观测前端。我们假设VLM具有三重优势：（1）提取高度判别性的丰富文本特征，（2）对模糊或动态对象进行隐式质量过滤，（3）针对数据增强的持久性推理。我们引入一种逆语义提议机制，通过文本到地图检索播种粒子。在两个具有不同特征的真实世界环境和两个不同平台上进行评估：一个3500平方英尺的杂货店（使用手机）和一个3700平方英尺的实验室空间（使用四足机器人），VLM-GLoc分别实现了70%和74%的全局定位成功率，显著优于传统的纯几何和特定领域基线方法。

英文摘要

Global localization in geometrically aliased, quasi-static environments such as grocery stores, offices, schools, and hospitals poses a significant challenge for mobile robots. Grocery stores with parallel aisles and a long tailed distribution of products, as well as offices and labs with repetitive furniture such as chairs, desks, monitors, and doors, exemplify common indoor environments that present geometric and even semantic ambiguity. Traditional approaches rely either on distinct geometric features or on domain-specific vision pipelines that struggle with long-tail semantic distributions and transient visual clutter. We present VLM-GLoc, a method for hierarchical semantic Monte Carlo Localization (MCL) that leverages open-vocabulary Vision-Language Models (VLMs) as a unified semantic observation front-end. We hypothesize a three-fold benefit from VLMs: (1) extracting highly discriminative rich text features, (2) implicit quality filtering of blurry or dynamic objects, and (3) permanence reasoning for targeted data augmentation. We introduce an inverse semantic proposal mechanism that seeds particles via text-to-map retrieval. Evaluated across two real-world environments with different characteristics and two different platforms: a 3,500 sq. ft. grocery store with a cellphone and a 3,700 sq. ft. lab space with a quadruped, VLM-GLoc achieves 70% and 74% global localization success respectively, substantially outperforming traditional geometry-only and domain-specific baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.03337 2026-06-01 cs.CV cs.AI 版本更新

FreeTimeGS++: Secrets of Dynamic Gaussian Splatting and Their Principles

FreeTimeGS++：动态高斯泼溅的秘密及其原理

Lucas Yunkyu Lee, Soonho Kim, Youngwook Kim, Sangmin Kim, Jaesik Park

发表机构 * Seoul National University（首尔国立大学）； POSTECH

AI总结本文通过建立控制基线FreeTimeGS_ours，系统分析4D高斯泼溅框架中的隐藏因素，揭示高斯持续时间驱动的时态分区和光度保真度与时空一致性之间的差异等关键秘密，并提出FreeTimeGS++方法，采用门控边缘化和神经速度场实现更稳定的动态表示。

Comments Project page: https://yklcs.com/ftgspp

详情

AI中文摘要

近期4D高斯泼溅（4DGS）的兴起在动态场景重建方面取得了令人瞩目的成果。尽管这些方法表现出卓越的性能，但其背后的具体驱动因素仍未被充分探索，使得对基本原理的系统理解具有挑战性。本文对这些隐藏因素进行了全面分析，以提供对4DGS框架更清晰的视角。我们首先通过形式化和复现最先进的FreeTimeGS的启发式方法，建立了一个受控基线FreeTimeGS_ours。利用该框架，我们沿着其基本轴剖析4DGS，并揭示了关键秘密，包括由高斯持续时间驱动的涌现时态分区以及光度保真度与时空一致性之间的差异。基于这些见解，我们提出了FreeTimeGS++，这是一种采用门控边缘化和神经速度场的原理性方法，以实现卓越的稳定性和鲁棒的动态表示。我们的方法产生了可重复的结果，并降低了运行间方差。我们将发布我们的实现，为未来的4DGS研究提供可靠的基础。

英文摘要

The recent surge in 4D Gaussian Splatting (4DGS) has achieved impressive dynamic scene reconstruction. While these methods demonstrate remarkable performance, the specific drivers behind such gains remain less explored, making a systematic understanding of the underlying principles challenging. In this paper, we perform a comprehensive analysis of these hidden factors to provide a clearer perspective on the 4DGS framework. We first establish a controlled baseline, FreeTimeGS_ours, by formalizing and reproducing the heuristics of the state-of-the-art FreeTimeGS. Using this framework, we dissect 4DGS along its fundamental axes and uncover key secrets, including the emergent temporal partitioning driven by Gaussian durations and the discrepancy between photometric fidelity and spatiotemporal consistency. Based on these insights, we propose FreeTimeGS++, a principled method that employs gated marginalization and neural velocity fields to achieve superior stability and robust dynamic representations. Our approach yields reproducible results with reduced run-to-run variance. We will release our implementation to provide a reliable foundation for future 4DGS research.

URL PDF HTML ☆

赞 0 踩 0

2605.30469 2026-06-01 cs.SD cs.CV 版本更新

3DAE: Binaural Quality Assessment for Audio Novel View Synthesis with Spatial Maps and Benchmark

3DAE: 基于空间图谱和基准的音频新视角合成双耳质量评估

Jialu Xu, Yifan Zhou

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出一个全参考诊断框架3DAE Map，通过时频音频误差图（幅度、ILD、IPD、时间对齐、响度和高频故障）进行视觉检查，并构建模型无关基准3DAE Bench，用于评估音频新视角合成模型的双耳预测质量。

详情

AI中文摘要

3D音频和新视角声学合成模型通常使用全局指标进行评估。然而，全局指标往往隐藏了双耳预测失败的位置和原因。我们提出一个全参考诊断框架，该框架使用时频音频误差图，包括幅度、ILD、IPD、时间对齐、响度和高频故障，形成3D音频误差图（3DAE Map）用于视觉检查。我们将这些诊断方法整合到一个模型无关的基准——空间音频误差基准（3DAE Bench）中，该基准接受任意真实和预测的双耳对，并报告音频新视角合成模型的预测质量。在Replay-NVAS和SoundSpaces上对ViGAS输出的实验显示了不同的主要故障模式：Replay-NVAS上的时间错位和SoundSpaces上的ILD不匹配。总体而言，该框架为音频新视角合成模型开发优化提供了可解释的故障模式总结和直观的视觉图谱。

英文摘要

3D audio and novel-view acoustic synthesis models are usually evaluated with global metrics.However, global metrics often hide where and why binaural prediction fails. We propose a full-reference diagnostic framework that uses time-frequency audio error maps for magnitude, ILD, IPD, temporal alignment, loudness, and high-frequency failures, forming a 3D Audio Error Map (3DAE Map) for visual inspection. We frame these diagnostics into a model-agnostic benchmark, Spatial Audio Error Bench (3DAE Bench), which takes arbitrary ground-truth and predicted binaural pairs and reports the prediction quality of audio novel-view synthesis models. Experiments on ViGAS outputs over Replay-NVAS and SoundSpaces show different dominant failure modes: temporal misalignment on Replay-NVAS and ILD mismatch on SoundSpaces. Overall, the framework provides interpretable failure-mode summaries and intuitive visual maps for audio Novel-view-synthesis model development optimization.

URL PDF HTML ☆

赞 0 踩 0

2605.30444 2026-06-01 cs.CV 版本更新

Dex2HOI: Dexterous Bimanual Two-Object Interaction Generation

Dex2HOI: 灵巧双手双物体交互生成

Chrysa Pratikaki, Pablo Ruiz-Ponce, Jiankang Deng, Stefanos Zafeiriou, Rolandos Alexandros Potamias

发表机构 * Imperial College London, UK（伦敦帝国学院）； University of Alicante, Spain（阿利坎特大学）

AI总结提出Dex2HOI统一扩散模型，通过双流扩散和运动融合网络，实现从文本生成单/双物体灵巧双手交互，速度提升达540倍。

详情

AI中文摘要

近期4D人-物体交互（HOI）生成的进展使得运动合成越来越逼真，特别是对于单物体操作。然而，当前研究忽视了人类行为的一个固有特性：人们自然地协调双手并同时操作多个物体。为填补这一空白，我们提出了Dex2HOI，一个用于从文本合成单物体和双物体HOI的统一扩散模型。其核心采用双流扩散方法，每个物体在专用交互流中处理，并通过双向交叉注意力进行协调。为了合成最终运动，我们引入了一个运动融合网络，该网络集成了新颖的相对于手的物体表示和应用于整个序列的接触感知条件。通过在带前缀条件的窗口上自回归采样扩散过程，Dex2HOI以实时速度生成任意长的序列，省略了冗余的测试时优化，相比先前最先进方法实现了高达540倍的推理加速。在单物体和双物体基准上的广泛评估展示了最先进的定量结果，标志着超越传统单物体HOI生成、向表达性多物体操作迈出的一步。代码和模型将在接收后发布。

英文摘要

Recent advances in 4D Human-Object Interaction (HOI) generation have enabled increasingly realistic motion synthesis, particularly for single-object manipulation. Yet current research overlooks an inherent property of human behavior: people naturally coordinate both hands and manipulate multiple objects simultaneously. To address this gap, we present Dex2HOI, a unified diffusion model for single- and two-object HOI synthesis from text. At its core, Dex2HOI employs a Dual-Stream Diffusion approach, where each object is processed in a dedicated interaction stream and coordinated through bidirectional cross-attention. To synthesize the final motion, we introduce a Motion Fusion Network integrated with novel hand-relative object representations and contact-aware conditioning applied across the whole sequence. By sampling the diffusion process autoregressively over prefix-conditioned windows, Dex2HOI generates arbitrarily long sequences at real-time speed omitting redundant test-time optimization, achieving up to x540 inference speed-up over prior state-of-the-art methods. Extensive evaluation on both single- and two-object benchmarks demonstrates state-of-the-art quantitative results, marking a step beyond conventional single-object HOI generation and toward expressive multi-object manipulation. Code and models will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.30431 2026-06-01 cs.CV 版本更新

DTG-Restore: Training-Free Diffusion Refinement for Generative Video Super-Resolution

DTG-Restore: 无需训练的视频超分辨率扩散精炼

Hidir Yesiltepe, Koutilya PNVR, Gaurav Pathak, Navaneeth Bodla, Bharat Singh, Pinar Yanardag, Jinrong Xie

发表机构 * Virginia Tech（弗吉尼亚理工大学）； Adobe（Adobe公司）

AI总结提出解耦时间引导（DTG）方法，通过时间解耦条件与无条件分支，无需训练即可增强扭曲低分辨率视频，提升结构保真度和时间稳定性。

详情

AI中文摘要

近期视频扩散模型的进展实现了显著的生成保真度，但利用这些先验进行修复仍受限于标准无分类器引导中条件分支与无条件分支的强耦合。我们提出一种无需训练的框架，通过时间解耦这些信号来增强扭曲和低分辨率视频。我们提出的解耦时间引导（DTG）在更干净的扩散时间步评估无条件分支，提供一个前瞻先验，在抑制扭曲内容复制的同时保持几何结构。这种时间偏置在采样过程中逐渐减弱，使模型能够从结构校正过渡到细节精炼，无需重新训练。结合任何现成的修复模块以即插即用的方式，我们的方法在AI生成和真实世界视频中均能改善感知一致性并恢复合理的结构。为便于评估，我们整理了GenWarp480基准，包含从多种文本到视频模型合成的4400个扭曲480p视频。GenWarp480专注于特征性生成退化，如扭曲面部、身体错位和空间伪影，为评估对生成错误的鲁棒性提供了专门构建的测试平台。大量实验表明，我们的方法在无需任何模型训练的情况下，在结构保真度和时间稳定性方面取得了显著改进。

英文摘要

Recent progress in video diffusion models has enabled remarkable generative fidelity, yet leveraging these priors for restoration remains limited by the strong coupling between conditional and unconditional branches in standard classifier-free guidance. We introduce a training-free framework that enhances distorted and low-resolution videos by decoupling these signals in time. Our proposed Decoupled Time Guidance (DTG) evaluates the unconditional branch at a cleaner diffusion timestep, providing a lookahead prior that preserves geometry while suppressing replication of warped content. This temporal bias is annealed throughout sampling, allowing the model to transition from structure correction to detail refinement without retraining. Combined with any off-the-shelf restoration module in a plug-and-play manner, our approach improves perceptual coherence and restores plausible structure in AIgenerated and real-world videos alike. To facilitate evaluation, we curate GenWarp480, a benchmark of 4,400 distorted 480p videos synthesized from diverse text-to-video models. GenWarp480 focuses on characteristic generative degradations such as warped faces, body misalignments, and spatial artifacts, providing a purpose-built testbed for assessing robustness to generative errors. Extensive experiments demonstrate that our method achieves significant improvements in structural fidelity and temporal stability without any model training.

URL PDF HTML ☆

赞 0 踩 0

2605.30409 2026-06-01 cs.CV cs.AI 版本更新

$R^3$: 通过相对回归进行3D重建

Congrong Xu, Huachen Gao, Xingyu Chen, Yuliang Xiu, Jun Gao, Anpei Chen

发表机构 * University of Michigan（密歇根大学）； Westlake University（西雅图大学）； NVIDIA Research（英伟达研究）

AI总结提出一种基于相对回归的3D重建方法$R^3$，使用轻量级MLP预测置信度加权的相对约束，以支持全上下文离线重建和因果有界内存流式重建。

详情

AI中文摘要

最近的馈送式几何基础模型通过单次前向传播恢复深度和姿态，展现出了令人印象深刻的泛化能力。然而，这些模型通常受限于全局坐标框架假设。这种依赖性成为长上下文和流式重建的一个显著瓶颈，因为它迫使网络维护一个任意的时序原点，并处理随时间无界增长的平移幅度。我们的解决方案，称为$R^3$，采用了相对回归。我们使用一个轻量级MLP来预测置信度加权的相对约束。这些置信度作为一个统一的锚点：在训练期间加权损失，在推理期间指导姿态聚合。$R^3$支持全上下文离线重建和因果、有界内存的流式重建。我们在离线与流式设置下的评估验证了我们的相对机制的有效性。项目页面：https://kevinxu02.github.io/r3-site

英文摘要

Recent feed-forward geometry foundation models have demonstrated impressive generalization by recovering depth and poses in a single forward pass. However, these models are typically constrained by a global coordinate frame assumption. This dependency becomes a significant bottleneck for long-context and streaming reconstruction, as it forces the network to maintain an arbitrary temporal origin and handle translation magnitudes that grow unbounded over time. Our solution, which we call $R^3$, employs relative regression. We employ a lightweight MLP to predict confidence-weighted relative constraints. These confidences serve as a unified anchor: weighting losses during training and guiding pose aggregation during inference. $R^3$ supports both full-context offline reconstruction and causal, bounded-memory streaming. Our evaluation in both offline and streaming settings validates the effectiveness of our relative mechanism. Project page: https://kevinxu02.github.io/r3-site

URL PDF HTML ☆

赞 0 踩 0

2605.22050 2026-06-01 cs.CV 版本更新

Broken Memories: Detecting and Mitigating Memorization in Diffusion Models with Degraded Generations

破碎的记忆：通过退化生成检测和缓解扩散模型中的记忆化

Yuanmin Huang, Mi Zhang, Chen Chen, Feifei Li, Geng Hong, Xiaoyu You, Min Yang

发表机构 * Fudan University（复旦大学）； East China University of Science and Technology（东华大学）

AI总结本文首次发现扩散模型中的记忆化会导致内部数值不稳定性并表现为视觉“破碎”伪影，基于此提出了一种基于潜变量更新范数的经验稳定区域来量化稳定行为，并设计了一个即时的逐步骤检测与自适应缓解框架，在不改变提示或引导的情况下抑制记忆化，在Stable Diffusion 1.4上实现了AUC>0.999的检测性能和0.0%的记忆化率。

Comments KDD 2026, extended version

详情

DOI: 10.1145/3770855.3817770

AI中文摘要

虽然扩散模型在生成高质量图像方面表现出色，但它们记忆训练数据的倾向带来了显著的隐私和版权风险。在这项工作中，我们首次发现记忆化会导致内部数值不稳定性，通常表现为视觉上的“破碎”伪影。受数值方法中稳定性分析的启发，我们引入了基于潜变量更新范数的经验稳定区域，以定量表征生成过程中的稳定行为。利用这一点，我们提出了一个原则性的、即时的框架，用于逐步骤检测和自适应缓解。我们的方法在不改变提示或引导的情况下抑制记忆化，从而保持语义保真度和图像质量。在Stable Diffusion 1.4上的大量实验表明，我们的方法在缓解后实现了AUC>0.999的检测性能和0.0%的记忆化率，且开销可忽略不计（每张图像约0.01秒）。

英文摘要

While diffusion models excel at generating high-quality images, their tendency to memorize training data poses significant privacy and copyright risks. In this work, we for the first time identify that memorization induces internal numerical instability, often manifesting as visually ``broken'' artifacts. Inspired by stability analysis in numerical methods, we introduce empirical stability regions based on latent update norms to quantitatively characterize stable behavior during generation. Leveraging this, we propose a principled, on-the-fly framework for step-wise detection and adaptive mitigation. Our approach suppresses memorization without altering prompts or guidance, thereby preserving semantic fidelity and image quality. Extensive experiments on Stable Diffusion 1.4 demonstrate that our method achieves an AUC $>0.999$ detection performance and a $0.0\%$ memorization rate after mitigation with negligible overhead ($\approx0.01$s per image).

URL PDF HTML ☆

赞 0 踩 0

2605.30248 2026-06-01 cs.CV 版本更新

GenClaw: Code-Driven Agentic Image Generation

GenClaw: 代码驱动的智能体图像生成

Junyan Ye, Jun He, Zilong Huang, Dongzhi Jiang, Xuan Yang, Rui Chen, Weijia Li

AI总结提出GenClaw，一种代码驱动的智能体图像生成范式，通过概念构思、代码草图绘制和纹理补充三个阶段，将黑盒图像生成转变为可控、可解释的分阶段过程。

Comments 21 pages, 7 figures

详情

AI中文摘要

图像生成模型已从基于文本的像素合成演变为具备视觉理解和工具调用能力的多模态智能体。然而，现有智能体仍受制于底层黑盒图像模型。其工作流程陷入重复的提示重写循环以改进生成，缺乏直接操控画布的机制。本质上，LLMs作为精确视觉构建的“画笔”的潜力尚未被充分挖掘。本文提出GenClaw，一种代码驱动的智能体图像生成范式，使智能体像人类艺术家一样创作：先构思，再素描，最后上色。具体而言，智能体首先通过搜索和推理构建概念知识和上下文。然后利用代码（如SVG、HTML、ThreeJS）渲染可执行的视觉草图。最后，使用图像生成模型补充纹理、材质和逼真度。在此工作流中，代码作为连接语言推理和像素合成的可控中间画布，无缝集成程序逻辑与生成模型的视觉表现力。通过将图像生成从黑盒范式转变为类似真实人类创作的分阶段过程，GenClaw朝着高度可控和可解释的视觉生成系统迈出了一步。

英文摘要

Image generation models have evolved from text-conditioned pixel synthesis toward multimodal agents endowed with visual comprehension and tool invocation capabilities. Yet, existing agents remain at the mercy of underlying black-box image models. Their workflow is trapped in a repetitive cycle of prompt rewriting for generation refinement, leaving them with no mechanism to directly manipulate the canvas. In essence, the potential of LLMs to serve as a genuine "brush" for precise visual construction remains largely untapped. In this paper, we propose GenClaw, a code-driven agentic image generation paradigm that empowers the agent to create like a human artist: first conceptualizing, then sketching, and finally coloring. Specifically, the agent first constructs the conceptual knowledge and context through search and reasoning. It then utilizes code (e.g., SVG, HTML, ThreeJS) to render executable visual sketches. Finally, it employs an image generation model to supplement textures, materials, and photorealism. In this workflow, code serves as a controllable intermediate canvas bridging linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. By transforming image generation from a black-box paradigm into a staged process akin to authentic human creation, GenClaw offers a step toward for highly controllable and interpretable visual generation systems.

URL PDF HTML ☆

赞 0 踩 0

2605.30215 2026-06-01 cs.CV 版本更新

Déjà View: Looping Transformers for Multi-View 3D Reconstruction

Déjà View: 用于多视图3D重建的循环Transformer

Alessandro Burzio, Tobias Fischer, Sven Elflein, Qunjie Zhou, Riccardo de Lutio, Jiawei Ren, Jiahui Huang, Shengyu Huang, Marc Pollefeys, Laura Leal-Taixé, Zan Gojcic, Haithem Turki

发表机构 * NVIDIA ； University of Modena and Reggio Emilia, AImageLab（摩德纳和雷焦艾米利亚大学，AImageLab）； University of Toronto, Vector Institute（多伦多大学，向量研究所）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出DéjàView模型，通过循环应用单个Transformer块进行迭代细化，以更少的参数和计算量在多个3D重建基准上达到或超越大规模前馈模型。

Comments Project Page: https://research.nvidia.com/labs/dvl/projects/dvlt

详情

AI中文摘要

近期的前馈式3D重建Transformer已扩展到超过十亿参数，遵循计算机视觉中模型容量增加的趋势。然而，新出现的证据表明，连续的Transformer层通常表现为类似操作的重复应用，而多视图重建Transformer在解码器深度上逐步优化其预测。我们认为模型深度部分地购买了迭代，但以独特的参数低效地支付，因此我们将迭代显式地融入架构中。我们的模型DéjàView对每个视图的特征循环应用单个循环Transformer块，进行K步细化。训练一次后，它将K暴露为推理时的计算旋钮，在涵盖室内、室外、物体中心和驾驶场景的五个重建基准上，匹配或优于显著更大的前馈基线，同时使用其一小部分参数和相当或更低的计算量。重要的是，在匹配的训练数据和计算量下，相同的循环块公式优于具有独立每步参数的相同变体，这表明显式迭代不仅是计算高效的容量替代方案，而且是多视图3D重建更强的归纳偏置。

英文摘要

Recent feed-forward 3D reconstruction transformers have scaled to over a billion parameters, following the broader trend of increasing model capacity in computer vision. Yet emerging evidence suggests that contiguous transformer layers often behave like repeated applications of similar operations, and multi-view reconstruction transformers refine their predictions progressively across decoder depth. We posit that model depth partially buys iteration, paid for inefficiently in unique parameters, and instead make that iteration explicit in architecture. Our model, DéjàView, applies a single looped transformer block recurrently to per-view features for K refinement steps. Trained once, it exposes K as an inference-time compute knob, matching or outperforming substantially larger feed-forward baselines across five reconstruction benchmarks spanning indoor, outdoor, object-centric, and driving scenes, while using a fraction of their parameters and comparable or lower compute. Importantly, the same looped block formulation outperforms an otherwise identical variant with independent per-step parameters under matched training data and compute, suggesting that explicit iteration is not merely a compute-efficient substitute for capacity but a stronger inductive bias for multi-view 3D reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2605.30060 2026-06-01 cs.CV 版本更新

SuperVoxelGPT: 自适应有序3D令牌化用于自回归形状生成

Yuan Li, Congyi Zhang, Xifeng Gao, Xiaohu Guo

发表机构 * University of Texas at Dallas（德克萨斯大学达拉斯分校）； Tencent America（腾讯美国）

AI总结提出SuperVoxelGPT框架，通过自适应且有序的超体素令牌化解决自回归3D生成中序列长度与空间顺序的矛盾，实现高质量、高效率的形状生成。

详情

AI中文摘要

自回归多模态大语言模型（MLLMs）能够进行3D生成，但由于3D令牌化不足，难以扩展到高分辨率形状。基于集合的紧凑表示丢弃了确定性的空间排序，导致序列预测模糊，而均匀或基于八叉树的体素网格保留了排序，但代价是严重的冗余和过长的序列。这种结构上的权衡限制了稳定高效的自回归3D生成。我们提出了SuperVoxelGPT，一个以表示优先的框架，通过自适应且确定性的超体素令牌化解决了这一矛盾。给定提示，我们首先预测粗略的几何显著性分布，并使用显著性引导的质心Voronoi细分构建形状自适应的超体素划分，将细粒度单元分配给复杂区域，将较大单元分配给平滑区域。基于文本和有序的超体素布局，我们引入了SuperVoxelVAE，并微调预训练的MLLM以自回归生成超体素令牌。在Trellis-500K上的实验表明，SuperVoxelGPT将令牌序列长度减少到均匀体素令牌化的12.8%，同时实现了最先进的生成质量，并且相比先前方法平均加速10倍。

英文摘要

Autoregressive multimodal large language models (MLLMs) enable 3D generation but struggle to scale to high-resolution shapes due to inadequate 3D tokenizations. Compact set-based representations discard deterministic spatial ordering, leading to ambiguous sequence prediction, while uniform or octree-based voxel grids preserve ordering at the cost of severe redundancy and excessively long sequences. This structural trade-off limits stable and efficient autoregressive 3D generation. We present SuperVoxelGPT, a representation-first framework that resolves this tension through adaptive and deterministically ordered supervoxel tokenization. Given a prompt, we first predict a coarse geometric saliency distribution and construct a shape-adaptive supervoxel partition using saliency-guided centroidal Voronoi tessellation, allocating fine-grained cells to complex regions and larger cells to smooth regions. Conditioned on the text and ordered supervoxel layout, we introduce a SuperVoxelVAE and fine-tune a pretrained MLLM to autoregressively generate supervoxel tokens. Experiments on Trellis-500K show that SuperVoxelGPT reduces token sequence length to 12.8% of uniform voxel tokenization while achieving state-of-the-art generation quality and an average 10$\times$ speedup over prior methods.

URL PDF HTML ☆

赞 0 踩 0

2605.24700 2026-06-01 cs.CV cs.GR 版本更新

SRUG: Shadow-Guided Relightable Urban Scene with Generation Model

SRUG: 基于阴影引导的可重光照城市场景生成模型

Yonghao Zhao, Zexin Yin, Jian Yang, Beibei Wang, Jin Xie

发表机构 * College of Computer Science, Nankai University（南开大学计算机科学学院）； Nankai University（南开大学）； Nanjing University（南京大学）； School of Intelligence Science and Technology, Nanjing University（南京大学智能科学与技术学院）

AI总结提出SRUG框架，利用阴影引导3D补全模型恢复不可见区域几何，结合迭代材质分解和物理光照模型，实现从稀疏输入视图生成可重光照城市场景。

详情

AI中文摘要

从图像或视频创建可重光照的城市场景具有广泛用途，但高度不适定。城市环境通常是无界的，且延伸到可见区域之外。因此，场景的许多部分未被观察到，但这些不可见区域会向可见区域投射阴影。合理建模这些不可见区域投射的阴影具有挑战性，并成为创建可重光照城市场景的主要障碍。同时，稀疏的输入视图和复杂的照明条件进一步使重光照复杂化，因为它们引入了材质分解中的严重歧义。在本文中，我们提出了SRUG（Shadow-guided Relightable Urban Scene with Generation model），一种新颖的框架，旨在解决城市场景中的重光照挑战。SRUG利用阴影引导3D补全模型恢复不可见区域的几何，促进物理合理阴影的合成。此外，SRUG采用迭代材质分解方案，应用大材质模型（LMM）提供材质监督，并迭代分解场景的材质属性，实现鲁棒的材质分解。基于这些组件，我们引入了一个基于物理的光照模型，该模型捕捉城市场景的复杂照明并支持可靠的重光照。大量的定量评估和视觉比较表明，我们的方法在新视图合成和重光照任务中均优于现有方法。

英文摘要

Creating relightable urban scenes from images or videos is widely useful but highly ill-posed. Urban environments are typically unbounded and extend beyond the visible regions. As a result, many portions of the scene remain unobserved, yet these invisible regions can cast shadows onto visible areas. Reasonably modeling shadows cast by such invisible regions is challenging and poses a significant obstacle to creating relightable urban scenes. At the same time, sparse input views and complex illumination conditions further complicate relighting, as they introduce severe ambiguities in material decomposition. In this paper, we propose Shadow-guided Relightable Urban Scene with Generation model (SRUG), a novel framework designed to address relighting challenges in urban scenes. SRUG leverages shadows to guide a 3D completion model for recovering the geometry of invisible regions, promoting the synthesis of physically reasonable shadows. In addition, SRUG employs an iterative material decomposition scheme that applies the large material model (LMM) to provide material supervision and iteratively decompose the scene's material properties, enabling robust material decomposition. Building upon these components, we introduce a physically-based lighting model that captures the complex illumination of urban scenes and supports reliable relighting. Extensive quantitative evaluations and visual comparisons demonstrate that our method outperforms existing approaches in both novel view synthesis and relighting tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.29417 2026-06-01 cs.CV 版本更新

ParCo-SDF: Learning Prior-Free Partial-to-Complete Signed Distance Fields of Deformable Objects

ParCo-SDF: 学习可变形物体的无先验部分到完整有符号距离场

Deokmin Hwang, Minseok Song, Daehyung Park

发表机构 * School of Computing, Korea Advanced Institute of Science and Technology, Korea（韩国科学技术院计算机学院）

AI总结提出 ParCo-SDF 两阶段框架，通过时序几何编码和 FiLM 条件 SDF 预测，实现无需物体特定先验的可变形物体部分到完整几何重建。

Comments Accepted at the 23rd International Conference on Ubiquitous Robots (UR 2026), 6 pages

详情

AI中文摘要

本研究针对从点云观测到可变形物体（DOs）的部分到完整几何重建，以实现精确的 DO 操作。最近的 DO 重建方法通常采用隐式神经表示（INRs）来建模连续表面并捕捉结构变异性。然而，这些方法通常依赖于物体特定的形状先验，这虽然提高了训练稳定性，但限制了泛化能力。为了解决这个问题，我们引入了 ParCo-SDF，一个两阶段的部分到完整有符号距离场（SDF）重建框架，包括时序几何编码和随后的 FiLM 条件 SDF 预测。时序编码器捕捉 DO 序列中的结构相似性，实现无先验的稳定训练。基于 FiLM 的条件化在降低网络复杂度的同时保持了重建的表达能力。我们在橡皮筋操作数据集上评估了所提方法与最先进的 DO 表面重建基线，证明了在严重遮挡下的鲁棒和高保真重建。

英文摘要

This study addresses the partial-to-complete geometry reconstruction of deformable objects (DOs) from point-cloud observations toward precise DO manipulation. Recent DO reconstruction approaches often adopt implicit neural representations (INRs) to model continuous surfaces as well as capture structural variability. However, these methods typically rely on object-specific shape priors that improve training stability and limit generalization. To figure it out, we introduce ParCo-SDF, a two-stage partial-to-complete signed distance field (SDF) reconstruction framework consisting of temporal geometry encoding followed by FiLM-conditioned SDF prediction. The temporal encoder captures structural similarity across DO sequence, enabling prior-free stable training. FiLM-based conditioning preserves reconstruction expressivity while reducing network complexity. We evaluate the proposed method against a state-of-the-art DO surface reconstruction baseline on a rubber band manipulation dataset, demonstrating robust and high-fidelity reconstruction under severe occlusions.

URL PDF HTML ☆

赞 0 踩 0

2605.29299 2026-06-01 cs.CV cs.AI 版本更新

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

口袋牙医：通过高效多模态大语言模型实现设备端牙科图像理解

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan, Yiran Shen, Ting Dang, Hong Jia

发表机构 * The University of Auckland, New Zealand（奥克兰大学）； Shandong University, China（山东大学）； The University of Melbourne, Australia（墨尔本大学）

AI总结提出Pocket-Dentist基准，通过评估14种视觉语言模型发现紧凑模型（2B参数）在牙科图像理解中精度更高且计算成本更低，并在iPhone 17 Pro上实现低延迟部署。

详情

AI中文摘要

牙科视觉语言模型的评估在数据集、任务定义和指标上仍然分散，并且常常忽略其计算成本。这限制了它们在专科中心之外的广泛部署用于牙科筛查，而及时推理、有限的硬件以及对患者图像的本地处理对于实用、保护隐私的临床预筛查至关重要。本文提出了Pocket-Dentist，一个面向牙科多模态问答的效率感知基准，它汇集了三个数据集，涵盖约1159名患者、五种任务类型和七种指标。在典型的14种VLM上，我们的结果揭示了一个有趣的观察：紧凑型VLM（例如2B参数模型）在牙科图像理解中精度更高，同时所需计算成本大幅降低。在iPhone 17 Pro上本地部署时，我们微调的紧凑型VLM Pocket-Dentist-2B处理每个样本耗时4.31秒，与7B基线相比延迟降低4.9倍，内存使用减少2.3倍。

英文摘要

Evaluations of dental vision-language models remain fragmented across datasets, task definitions and metrics, and often ignore their computational cost. This limits their widespread deployment for dental screening outside specialist centres, where timely inference, limited hardware, and local handling of patient images are vital for practical, privacy-preserving clinical prescreening. Here we present Pocket-Dentist, an efficiency-aware benchmark for dental multimodal question answering that brings together three datasets spanning approximately 1,159 patients, five task types and seven metrics. Across typical 14 VLMs, our results reveals an interesting observation: compact VLMs (e.g., 2B-parameter models) outperform larger VLMs in accuracy while requiring substantially lower computational costs in dental image understanding. Deployed locally on an iPhone 17 Pro, our finetuned compact VLM Pocket-Dentist-2B processed each sample in 4.31 s, reducing latency by 4.9-fold and memory use by 2.3-fold compared with a 7B baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.29198 2026-06-01 cs.CV 版本更新

Guidance Contrastive Token Credit Assignment for Discrete Policy Optimization

引导对比令牌信用分配用于离散策略优化

Shufan Li, Konstantinos Kallidromitis, Akash Gokul, Yuta Kyuragi, Aditya Grover

发表机构 * UCLA ； Panasonic AI Research（松下人工智能研究）； NVIDIA（英伟达）

AI总结针对组优势强化学习方法中令牌级信用分配缺失的问题，提出引导对比策略优化（GCPO），通过正负提示下的对比预测分配令牌级优势，在文本到图像生成和思维链推理任务上优于GRPO和DAPO。

Comments 21 pages, 11 figures

详情

AI中文摘要

基于组优势的强化学习方法，如GRPO和DAPO，在包括数学推理和文本到图像生成在内的多个领域展示了强大的性能。然而，它们对样本级奖励的依赖引入了一个关键限制，即所有令牌的均匀信用分配无法捕捉细粒度的令牌级贡献。为了解决这个问题，我们提出了引导对比策略优化（GCPO），一种新颖的算法，通过对比正负提示下的模型预测来实现每个令牌的信用分配。GCPO不是均匀地广播样本级优势，而是分配与这些对比预测差异成比例的令牌级优势，从而提供更精确和信息丰富的学习信号。实验上，我们发现GCPO强调语义相关区域，例如文本到图像生成中与文本提示对齐的视觉区域，以及思维链任务中推理轨迹内的关键关键词。通过大量实验，GCPO在文本到图像生成和思维链推理基准测试上 consistently 优于GRPO和DAPO基线，证明了其作为离散策略学习的通用且可扩展优化策略的有效性。

英文摘要

Group-advantage-based reinforcement learning methods, such as GRPO and DAPO, have demonstrated strong performance across diverse domains, including mathematical reasoning and text-to-image generation. However, their reliance on sample-level rewards introduces a key limitation as uniform credit assignment across all tokens fails to capture fine-grained, token-level contributions. To address this issue, we propose Guidance Contrastive Policy Optimization (GCPO), a novel algorithm that enables per-token credit assignment by contrasting model predictions under positive and negative prompts. Rather than uniformly broadcasting sample-level advantages, GCPO assigns token-level advantages proportional to the difference between these contrastive predictions, allowing more precise and informative learning signals. Empirically, we find that GCPO emphasizes semantically relevant regions such as visual areas aligned with textual prompts in text-to-image generation, and critical keywords within reasoning traces for chain-of-thought tasks. Through extensive experiments, GCPO consistently outperforms GRPO and DAPO baselines on both text-to-image generation and chain-of-thought reasoning benchmarks, demonstrating its effectiveness as a general and scalable optimization strategy for discrete policy learning.

URL PDF HTML ☆

赞 0 踩 0

2604.22409 2026-06-01 cs.CV 版本更新

SpaMEM: Benchmarking Dynamic Spatial Reasoning via Perception-Memory Integration in Embodied Environments

SpaMEM：具身环境中通过感知-记忆集成进行动态空间推理的基准测试

Chih-Ting Liao, Xi Xiao, Chunlei Meng, Zhangquan Chen, Yitong Qiao, Weilin Zhou, Tianyang Wang, Xu Zheng, Xin Cao

发表机构 * The University of New South Wales（新南威尔士大学）； The University of Alabama at Birmingham（阿拉巴马大学伯明翰分校）； Fudan University（复旦大学）； Tsinghua University（清华大学）； Zhejiang University（浙江大学）； Xinjiang University（新疆大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））

AI总结提出SpaMEM基准，通过动作条件场景变换和多模态数据，分层评估多模态大模型在具身环境中的空间信念演化能力，揭示坐标一致性和视觉记忆瓶颈。

详情

AI中文摘要

多模态大语言模型（MLLMs）在静态视觉-空间推理方面取得了进展，但在具身环境中，当信念必须根据环境变化下的自我中心观察不断修正时，它们往往无法保持长期的空间连贯性。我们引入了SpaMEM（动作序列的空间记忆），这是一个大规模诊断基准，通过长交互时间内的动作条件场景变换（生成、放置、移除）来隔离空间信念演化的机制。SpaMEM基于一个物理基础数据集构建，包含来自1000个程序生成房屋中25000多个交互序列的10,601,392张高保真图像，涵盖四种模态（RGB、深度、实例、语义分割）。我们将具身空间推理形式化为一个三级层次结构，包含15个诊断任务：第1级测量单次观察的原子空间感知；第2级利用神谕文本状态历史探测时间推理，以排除感知噪声；第3级要求在同一任务维度下从原始视觉流进行端到端的信念维护。我们还评估了短期（逐步）更新和长期（情节）重建。对代表性开源VLM系列的基准测试揭示了一个一致的堆叠瓶颈：坐标一致的定位仍然是一个硬上限，从第2级到第3级的急剧下降暴露了显著的符号脚手架依赖性，即模型在基于文本的记账中成功，但难以维持稳健的视觉记忆。SpaMEM提供了一个细粒度的诊断标准，并激发了状态表示、信念修正和长期情节集成的显式机制。SpaMEM的一个子集可在https://huggingface.co/datasets/mill-ct-liao/SpaMEM公开获取。

英文摘要

Multimodal large language models (MLLMs) have advanced static visual--spatial reasoning, yet they often fail to preserve long-horizon spatial coherence in embodied settings where beliefs must be continuously revised from egocentric observations under environmental change. We introduce SpaMEM (Spatial Memory from Action Sequences), a large-scale diagnostic benchmark that isolates the mechanics of spatial belief evolution via action-conditioned scene transformations (spawn, place, remove) over long interaction horizons. SpaMEM is built on a physically grounded dataset with 10,601,392 high-fidelity images across four modalities (RGB, depth, instance, semantic segmentation), collected from 25,000+ interaction sequences in 1,000 procedurally generated houses. We formalize embodied spatial reasoning as a three-level hierarchy with 15 diagnostic tasks: Level 1 measures atomic spatial perception from single observations; Level 2 probes temporal reasoning with oracle textual state histories to factor out perceptual noise; and Level 3 requires end-to-end belief maintenance from raw visual streams under the same task dimensions. We further evaluate both short-term (step-wise) updates and long-term (episodic) reconstruction. Benchmarking representative open-source VLM families reveals a consistent stacked bottleneck: coordinate-consistent grounding remains a hard ceiling, and the sharp collapse from Level 2 to Level 3 exposes a pronounced symbolic scaffolding dependency, where models succeed with text-based bookkeeping but struggle to sustain robust visual memory. SpaMEM provides a granular diagnostic standard and motivates explicit mechanisms for state representation, belief revision, and long-horizon episodic integration. A subset of SpaMEM is publicly available at https://huggingface.co/datasets/mill-ct-liao/SpaMEM.

URL PDF HTML ☆

赞 0 踩 0

2603.09632 2026-06-01 cs.CV cs.CL 版本更新

X-GS: An Extensible Framework for Perceiving and Thinking via 3D Gaussian Splatting

X-GS：基于3D高斯溅射的感知与思考可扩展框架

Yueen Ma, Zenglin Xu, Irwin King

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Fudan University（复旦大学）； Shanghai Academy of AI for Science（上海人工智能科学研究院）

AI总结提出X-GS框架，包含感知器和思考器，统一多种3DGS技术实现实时在线SLAM与语义蒸馏，并支持多模态模型完成下游任务。

详情

AI中文摘要

3D高斯溅射（3DGS）已成为新颖视图合成的强大技术，随后扩展到众多空间AI应用。然而，大多数现有3DGS方法孤立运行，专注于特定领域。本文介绍X-GS，一个包含两个主要组件的可扩展框架。X-GS-感知器统一了广泛的3DGS技术，以实现具有语义蒸馏的实时在线SLAM。X-GS-思考器容纳多模态模型，使其能够与感知器无缝交互以完成下游任务。在我们的X-GS实现中，感知器利用最新的视觉基础模型提高在线SLAM性能，并采用三种关键机制加速语义蒸馏。思考器可以基于对比和生成视觉语言模型构建，并利用感知器的语义高斯溅射解锁3D视觉定位和场景描述等功能。在多个基准上的实验结果表明了X-GS框架的高效性和新解锁的多模态能力。

英文摘要

3D Gaussian Splatting (3DGS) has emerged as a powerful technique for novel view synthesis, subsequently extending into numerous spatial AI applications. However, most existing 3DGS methods operate in isolation, focusing on specific domains. In this paper, we introduce X-GS, an extensible framework consisting of two major components. The X-GS-Perceiver unifies a broad range of 3DGS techniques to enable real-time online SLAM with semantic distillation. The X-GS-Thinker accommodates multimodal models, enabling them to seamlessly interface with the Perceiver to complete downstream tasks. In our implementation of X-GS, the Perceiver leverages the latest vision foundation models to improve online SLAM performance and employs three key mechanisms to accelerate semantic distillation. The Thinker can be built upon both contrastive and generative vision-language models and utilizes the Perceiver's semantic Gaussian splats to unlock capabilities such as 3D visual grounding and scene captioning. Experimental results on diverse benchmarks demonstrate the efficiency and newly unlocked multimodal capabilities of the X-GS framework.

URL PDF HTML ☆

赞 0 踩 0

2602.01173 2026-06-01 cs.CV 版本更新

EEmo-Logic: A Unified Dataset and Multi-Stage Framework for Comprehensive Image-Evoked Emotion Assessment

EEmo-Logic：面向全面图像诱发情感评估的统一数据集与多阶段框架

Lancheng Gao, Ziheng Jia, Zixuan Xing, Wei Sun, Huiyu Duan, Guangtao Zhai, Xiongkuo Min

AI总结提出最大图像诱发情感理解数据集EEmoDB和统一多模态大语言模型EEmo-Logic，通过指令微调和任务定制GRPO实现细粒度情感问答与评估。

详情

AI中文摘要

理解图像诱发情感的多维属性和强度细微差别对于提升机器共情能力和赋能多样化人机交互应用至关重要。然而，现有模型仍局限于粗粒度情感感知或推理能力不足。为弥补这一差距，我们引入了 extbf{EEmoDB}，这是迄今为止最大的图像诱发情感理解数据集。它包含跨越5个不同任务类别的5个分析维度，促进全面解读。具体而言，我们通过自动生成从125K张图像中整理了1.2M问答对（EEmoDB-QA），以及从25K张图像中策划了36K数据集（EEmoDB-Assess）用于细粒度评估。此外，我们提出了 extbf{EEmo-Logic}，一个通过指令微调和具有新颖奖励设计的任务定制组相对偏好优化（GRPO）开发的一体化多模态大语言模型（MLLM）。大量实验表明，EEmo-Logic在域内和跨域数据集上实现了稳健性能，在情感问答和细粒度评估方面表现出色。数据集和代码可在https://github.com/workerred/EEmo-Logic获取。

英文摘要

Understanding the multi-dimensional attributes and intensity nuances of image-evoked emotions is pivotal for advancing machine empathy and empowering diverse human-computer interaction applications. However, existing models are still limited to coarse-grained emotion perception or deficient reasoning capabilities. To bridge this gap, we introduce \textbf{EEmoDB}, the largest image-{\ul e}voked {\ul emo}tion understanding {\ul d}ataset to date. It features $5$ analysis dimensions spanning $5$ distinct task categories, facilitating comprehensive interpretation. Specifically, we compile $1.2M$ question-answering (QA) pairs (EEmoDB-QA) from $125K$ images via automated generation, alongside a $36K$ dataset (EEmoDB-Assess) curated from $25K$ images for fine-grained assessment. Furthermore, we propose \textbf{EEmo-Logic}, an \textbf{all-in-one} multimodal large language model (MLLM) developed via instruction fine-tuning and task-customized group relative preference optimization (GRPO) with novel reward design. Extensive experiments demonstrate that EEmo-Logic achieves robust performance in in-domain and cross-domain datasets, excelling in emotion QA and fine-grained assessment. The dataset and code are available at https://github.com/workerred/EEmo-Logic.

URL PDF HTML ☆

赞 0 踩 0

2605.25193 2026-06-01 cs.CV 版本更新

SpongeBob: Sync-Aware Harmonious Audio-Visual Generative Editing

SpongeBob：同步感知的和谐视听生成式编辑

Sen Liang, Cong Wang, Fengbin Guan, Zhentao Yu, Yiting Lu, Yuanzhi Wang, Yuan Zhou, Xin Li, Zhibo Chen

发表机构 * University of Science and Technology of China（科学技术大学）； Tencent Hunyuan（腾讯文生）

AI总结提出首个端到端视听联合编辑框架SpongeBob，通过双向跨模态交互的同步感知机制和上下文感知模块，解决视频编辑中的音画不同步和语义冲突问题。

详情

AI中文摘要

物理世界中的视觉和声学事件本质上是耦合的，然而现有的视频编辑方法通常采用解耦的流水线，缺乏双向模态交互。这导致两个关键限制：(i) 视听不同步和(ii) 生成的音频与保留内容之间的上下文冲突。为了解决这些问题，我们提出了SpongeBob，这是第一个具有双向跨模态交互的端到端视听联合编辑框架。对于同步，同步感知机制通过双向注意力、时间对齐和空间约束将视觉编辑与声音事件对齐。对于上下文一致性，上下文感知模块利用声学和视觉上下文注意力来防止语义冲突。此外，我们引入了同步保持训练和指导（SPTG），以在不降低质量的情况下增强对齐。由于配对数据的稀缺，我们构建了一个可扩展的数据流水线和一个大规模的主题级数据集。我们还提出了SpongeBob-Bench用于系统评估。实验表明，SpongeBob显著优于现有基线，将Sync-C提高了30%，Ctx-F1提高了12.5%。我们的项目页面位于：https://hy-spongebob.github.io/。

英文摘要

Visual and acoustic events in the physical world are inherently coupled, yet existing video editing methods typically adopt decoupled pipelines, lacking bidirectional modality interaction. This results in two key limitations: (i) audio-visual desynchronization and (ii) contextual conflicts between generated audio and preserved content. To address these, we propose SpongeBob, the first end-to-end audio-visual joint editing framework featuring bidirectional cross-modal interaction. For synchronization, a Sync-Aware Mechanism aligns visual edits with sound events via bidirectional attention, temporal alignment, and spatial constraints. For contextual consistency, a Context-Aware Module leverages acoustic and visual context attention to prevent semantic clashes. Additionally, we introduce Sync-Preserving Training and Guidance (SPTG) to enhance alignment without degrading quality. Due to the scarcity of paired data, we construct a scalable data pipeline and a large-scale subject-level dataset. We also propose SpongeBob-Bench for systematic evaluation. Experiments show SpongeBob significantly outperforms existing baselines, improving Sync-C by 30% and Ctx-F1 by 12.5%. Our project page is available at: https://hy-spongebob.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.22478 2026-06-01 cs.CV 版本更新

DeliCIR: Deliberative Test-Time Evolutionary Hierarchical Multi-Agents for Composed Image Retrieval

DeliCIR: 用于组合图像检索的深思型测试时进化分层多智能体

Xingtian Pei, Yukun Song, Changwei Wang, Shunpeng Chen, Rongtao Xu, Shengpeng Xu, Shibiao Xu

发表机构 * Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center (National Supercomputer Center in Jinan), Qilu University of Technology (Shandong Academy of Sciences)（计算机网络与信息安全部重点实验室，教育部，山东计算机科学中心（济南国家超算中心），青岛科技大学（山东科学院））； School of Artificial Intelligence, Beijing University of Posts and Telecommunications（北京邮电大学人工智能学院）； School of Artificial Intelligence, Mohamed bin Zayed University（Mohamed bin Zayed大学人工智能学院）

AI总结提出一种分层感知-深思框架PDF，通过分层多智能体架构、意图路由管理器、决策管理器及锦标赛式测试时缩放策略，实现经验自进化与测试时缩放定律在组合图像检索中的首次应用，在三个基准数据集上达到最优性能。

Comments 10 pages, 5 figures,4 tables

详情

AI中文摘要

组合图像检索（CIR）要求同时保留参考图像的视觉连续性并忠实执行修改文本中指定的语义变量，这构成了该任务的核心挑战。现有方法常常在单一空间中遭受感知近视，或由于底层检索器的感知上限而在迭代协作中陷入逻辑漂移。为解决这一问题，我们提出了一种一站式分层感知-深思框架（PDF），据我们所知，这是首次将经验自进化和测试时缩放定律（TTS）引入CIR。依托分层多智能体架构，PDF首先利用意图路由管理器根据修改意图动态调度多视角工作器感知信号，构建高召回候选池。随后，决策管理器结合无需训练的推理策略蒸馏机制与锦标赛式TTS（T-TTS）策略，实现自进化的细粒度推理，得出最终检索结果。实验结果表明，PDF在三个基准数据集CIRR、CIRCO和FashionIQ上均达到了最优性能。本研究表明，经验驱动的自进化和TTS是实现零样本细粒度多媒体检索的一条极具前景且可扩展的路径。代码将在论文被接收后公开。

英文摘要

Composed Image Retrieval (CIR) requires both preserving the visual continuity of the reference image and faithfully executing the semantic variables specified in the modification text, which constitute the core challenge of the task. Existing methods often suffer from Perception Myopia in a single space, or fall into Logic Drift in iterative collaboration due to the perception ceiling of the underlying retriever. To address this issue, we propose a one-stop hierarchical Perception-to-Deliberation Framework (PDF), which, to the best of our knowledge, is the first to introduce experience self-evolution and Test-Time Scaling Laws (TTS) into CIR. Relying on a hierarchical multi-agent architecture, PDF first utilizes an Intent Routing Manager to dynamically dispatch multi-view Worker perception signals based on modification intents to construct a high-recall candidate pool. Subsequently, the Decision Manager combines a Training-free Reasoning Policy Distillation mechanism with a Tournament-style TTS (T-TTS) strategy to achieve self-evolving fine-grained reasoning, yielding the final retrieval results. Experimental results demonstrate that PDF achieves SOTA performance on three benchmark datasets: CIRR, CIRCO, and FashionIQ. This study indicates that experience-driven self-evolution and TTS represent a highly promising and scalable path for achieving zero-shot fine-grained multimedia retrieval. The code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.20992 2026-06-01 cs.CV 版本更新

CHOIR: Contact-aware 4D Hand-Object Interaction Reconstruction

CHOIR: 接触感知的4D手物交互重建

Hao Xu, Yilin Liu, Yinqiao Wang, Chi-Wing Fu, Niloy J. Mitra

发表机构 * The Chinese University of Hong Kong（香港中文大学）； University College London（伦敦大学学院）； University College London, Adobe Research（伦敦大学学院，Adobe研究）

AI总结提出CHOIR框架，利用接触作为显式耦合信号，从单目视频中重建手物交互的4D序列，包括手部运动、物体形状与6D姿态以及接触信息，显著提升了物体重建、物理合理性和时间一致性。

详情

AI中文摘要

我们探究是否可以将日常开放世界单目视频转化为可复用的4D交互基元：包括关节手部运动、随时间变化的物体形状与6D姿态，以及接触的时空信息。这种能力将支持真实交互的可扩展挖掘，并在重建之外，支持场景感知的合成与规划。然而，从具有挑战性的单目视频中重建手物交互（HOI）仍然困难：现有方法通常假设已知物体或精心设计的场景，且单独估计的手和物体在杂乱、遮挡和未见物体几何下容易错位。针对这一场景，我们提出CHOIR，一种面向单目相机的接触感知HOI重建框架，利用接触作为手和物体之间的显式耦合信号。CHOIR首先从开放世界视觉先验中初始化一个粗糙的、接触无关的4D HOI序列。然后引入一个生成式HOI空间修正模块，预测射线深度修正并纠正手物相对位置，随后在修正后的几何上推导出初始的逐帧接触对应关系。最后，采用带有动态更新接触约束的接触感知联合优化，强制执行几何、时间和接触一致性。在受控和具有挑战性的视频上的实验表明，CHOIR在物体重建、物理合理性和时间一致性上优于现有最先进方法。

英文摘要

We ask whether everyday open-world monocular videos can be turned into reusable 4D interaction primitives: articulated hand motion, object shape with 6D pose over time, and the when/where of contact. Such a capability would enable scalable mining of real interactions and, beyond reconstruction, support scene-aware synthesis and planning. However, reconstructing hand-object interaction (HOI) from challenging monocular videos remains difficult: methods often assume known objects or curated scenes, and separately estimated hands and objects easily become misaligned under clutter, occlusion, and unseen object geometries. Targeting this setting, we present CHOIR, a Contact-aware HOI Reconstruction framework for a monocular camera, using contact as an explicit coupling signal between hands and objects. CHOIR first initializes a coarse, contact-agnostic 4D HOI sequence from open-world visual priors. It then introduces a generative HOI spatial rectification module to predict ray-depth corrections and rectify hand-object relative placement, then derive initial per-frame contact correspondences on the rectified geometry. Last, a contact-aware joint optimization with dynamically updated contact constraints enforces geometric, temporal, and contact consistency. Experiments on controlled and challenging videos show that CHOIR improves object reconstruction, physical plausibility, and temporal consistency over state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2605.21007 2026-06-01 cs.CV cs.RO 版本更新

LiteViLNet: Lightweight Vision-LiDAR Fusion Network for Efficient Road Segmentation

LiteViLNet: 轻量级视觉-激光雷达融合网络用于高效道路分割

Daojie Peng, Bingtao Wang, Fulong Ma, Liang Zhang, Jun Ma

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； The Shandong University（山东大学）

AI总结提出轻量级多模态网络LiteViLNet，通过双流编码器、深度可分离卷积和多尺度特征融合模块，在KITTI数据集上以14.04M参数达到96.36% MaxF，实现精度与效率的平衡。

详情

AI中文摘要

道路分割是自动驾驶和智能机器人系统的基本感知任务，需要高精度和实时推理，特别是在资源受限的边缘设备上部署时。现有的多模态道路分割方法通常依赖重型基于Transformer的编码器以达到最先进的性能，但其巨大的计算成本阻碍了在嵌入式平台上的实时部署。为解决这一困境，我们提出了LiteViLNet，一种轻量级多模态网络，融合RGB纹理信息和LiDAR几何信息用于高效道路分割。具体来说，我们设计了双流轻量级编码器和深度可分离卷积，以最小的参数从两种模态中提取层次特征。我们进一步提出了多尺度特征融合模块（MSFM）以促进不同层次的跨模态交互，以及一个大核桥模块以线性复杂度捕获长距离依赖。在KITTI道路数据集和实际应用上的大量实验表明，LiteViLNet在准确性和效率之间取得了有希望的平衡。值得注意的是，仅用14.04M参数，我们的模型达到了96.36%的MaxF分数，在所有基于CNN的方法中排名最佳，并与更大的基于Transformer的模型相当，在RTX 4060 Ti上模型推理速度为163.79 FPS（在Jetson Orin NX上为22.18 FPS）。它在推理速度上优于许多重型方法，同时保持高度竞争的准确性，充分验证了LiteViLNet在自动驾驶和智能机器人中实时嵌入式部署的潜力。

英文摘要

Road segmentation is a fundamental perception task for autonomous driving and intelligent robotic systems, requiring both high accuracy and real-time inference, especially for deployment on resource-constrained edge devices. Existing multi-modal road segmentation methods often rely on heavy transformer-based encoders to achieve state-of-the-art performance, but their enormous computational cost prohibits real-time deployment on embedded platforms. To address this dilemma, we propose LiteViLNet, a lightweight multi-modal network that fuses RGB texture information and LiDAR geometric information for efficient road segmentation. Specifically, we design a dual-stream lightweight encoder and depth-wise separable convolutions to extract hierarchical features from both modalities with minimal parameters. We further propose a Multi-Scale Feature Fusion Module (MSFM) to facilitate cross-modal interaction at different levels, and a large-kernel-bridge module to capture long-range dependencies with linear complexity. Extensive experiments on the KITTI Road dataset and real-world applications demonstrate that LiteViLNet achieves a promising balance between accuracy and efficiency. Notably, with only 14.04M parameters, our model attains a 96.36% MaxF score, ranking the best among all CNN-based methods and being comparable to larger transformer-based models, and runs at 163.79 FPS in model-only inference on RTX 4060 Ti (22.18 FPS on Jetson Orin NX). It outperforms numerous heavy-weight methods in inference speed while maintaining highly competitive accuracy, fully validating the potential of LiteViLNet for real-time embedded deployment in autonomous driving and intelligent robotics.

URL PDF HTML ☆

赞 0 踩 0

2605.18023 2026-06-01 cs.CV 版本更新

DSAA: Dual-Stage Attribute Activation for Fine-grained Open Vocabulary Detection

DSAA: 面向细粒度开放词汇检测的双阶段属性激活

Donghong Jiang, Endian Lin, Hanqing Liu, Mingjie Liu, Luoping Cui, Zhao Yang, Chuang Zhu

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Beijing E-Hualu Information Technology Co., Ltd.（北京亿华鲁信息技术有限公司）； State Key Laboratory of General Artificial Intelligence, BIGAI, Beijing, China（通用人工智能国家重点实验室，BIGAI，北京，中国）

AI总结提出DSAA框架，通过文本嵌入阶段的属性前缀适配器和BERT编码阶段的键/值调制器增强属性语义，并引入属性感知对比损失，提升细粒度开放词汇检测性能。

详情

AI中文摘要

开放词汇目标检测（OVD）模型打破了封闭集检测的限制，能够通过自然语言提示识别未见类别。然而，在涉及颜色、材质和纹理等属性的细粒度检测任务中，它们表现出明显的局限性。我们将OVD模型中的这一性能瓶颈归因于一个核心问题：当类别信号占主导时，OVD模型在推理过程中倾向于边缘化属性信息，导致属性与目标对象之间的错误绑定。为了解决这个问题，我们提出了双阶段属性激活（DSAA）框架，通过在两个关键阶段增强属性语义来提升细粒度检测能力。在文本嵌入阶段，我们采用属性前缀适配器（APA）模块生成属性前缀，注入显式的属性先验。为了进一步放大这些属性的影响，我们的键/值（K/V）调制器模块在BERT编码阶段进行干预，选择性地增强对应属性令牌的键和值向量。此外，我们引入了属性感知对比损失，以在训练过程中提高具有不同属性的同类别实例之间的区分度。在FG-OVD基准上的实验结果表明，我们的方法在各种主流开放词汇模型中均有效。

英文摘要

Open-Vocabulary Object Detection (OVD) models break the limitations of closed-set detection, enabling the identification of unseen categories through natural language prompts. However, they exhibit notable limitations in fine-grained detection tasks involving attributes like color, material, and texture. We attribute this performance bottleneck in OVD models to a core issue: when category signals dominate, OVD models tend to marginalize attribute information during inference. This leads to incorrect binding between attributes and target objects. To address this, we propose the Dual-Stage Attribute Activation (DSAA) framework, which enhances fine-grained detection capabilities by strengthening attribute semantics at two critical stages. In the text embedding stage, we employ Attribute Prefix Adapter (APA) module to generate attribute prefixes that inject explicit attribute priors. To further amplify the influence of these attributes, our Key/Value (K/V) Modulator module then intervenes during the BERT encoding phase, selectively enhancing the Key and Value vectors of the corresponding attribute tokens. In addition, we introduce an attribute-aware contrastive loss to improve discrimination among same-category instances with different attributes during training. Experimental results on the FG-OVD benchmark demonstrate the effectiveness of our method across various mainstream open-vocabulary models.

URL PDF HTML ☆

赞 0 踩 0

2402.17672 2026-06-01 cs.CV eess.IV 版本更新

SDF2Net: Shallow to Deep Feature Fusion Network for PolSAR Image Classification

SDF2Net: 用于PolSAR图像分类的浅层到深层特征融合网络

Mohammed Q. Alkhatib, M. Sami Zitouni, Mina Al-Saad, Nour Aburaed, Hussain Al-Ahmad

AI总结提出一种新颖的三分支复值CNN融合网络SDF2Net，通过浅层到深层特征融合提升PolSAR图像分类精度，在三个数据集上取得优于现有方法的性能。

详情

AI中文摘要

极化合成孔径雷达（PolSAR）图像包含有价值的信息，有助于广泛的土地覆盖解释并生成多样化的输出产品。从PolSAR数据中提取有意义的特征面临与光学图像不同的挑战。深度学习方法为克服PolSAR特征提取中的这些挑战提供了有效解决方案。卷积神经网络（CNN）通过利用内核能力考虑局部信息和PolSAR数据的复值性质，在捕获PolSAR图像特征中发挥关键作用。本研究提出了一种新颖的三分支复值CNN融合网络，称为浅层到深层特征融合网络（SDF2Net），用于PolSAR图像分类。为了验证所提方法的性能，使用Flevoland和San Francisco的机载合成孔径雷达（AIRSAR）数据集以及ESAR Oberpfaffenhofen数据集，将分类结果与多种最先进方法进行比较。结果表明，所提方法在总体精度上有所提升，AIRSAR数据集提升1.3%和0.8%，ESAR数据集提升0.5%。对Flevoland数据的分析强调了SDF2Net模型的有效性，即使在仅1%采样率下，总体精度也达到了96.01%。

英文摘要

Polarimetric synthetic aperture radar (PolSAR) images encompass valuable information that can facilitate extensive land cover interpretation and generate diverse output products. Extracting meaningful features from PolSAR data poses challenges distinct from those encountered in optical imagery. Deep learning (DL) methods offer effective solutions for overcoming these challenges in PolSAR feature extraction. Convolutional neural networks (CNNs) play a crucial role in capturing PolSAR image characteristics by leveraging kernel capabilities to consider local information and the complex-valued nature of PolSAR data. In this study, a novel three-branch fusion of complex-valued CNN, named the Shallow to Deep Feature Fusion Network (SDF2Net), is proposed for PolSAR image classification. To validate the performance of the proposed method, classification results are compared against multiple state-of-the-art approaches using the airborne synthetic aperture radar (AIRSAR) datasets of Flevoland and San Francisco, as well as the ESAR Oberpfaffenhofen dataset. The results indicate that the proposed approach demonstrates improvements in overallaccuracy, with a 1.3% and 0.8% enhancement for the AIRSAR datasets and a 0.5% improvement for the ESAR dataset. Analyses conducted on the Flevoland data underscore the effectiveness of the SDF2Net model, revealing a promising overall accuracy of 96.01% even with only a 1% sampling ratio.

URL PDF HTML ☆

赞 0 踩 0

2601.15197 2026-06-01 cs.AI cs.CL cs.CV cs.RO 版本更新

LangForce: Bayesian Decomposition of Vision Language Action Models via Latent Action Queries

LangForce: 通过潜在动作查询对视觉语言动作模型进行贝叶斯分解

Shijie Lian, Bin Yu, Xiaopeng Lin, Laurence T. Yang, Zhaolong Shen, Changti Wu, Yuzhuo Miao, Cong Huang, Kai Chen

发表机构 * Huazhong University of Science and Technology（华中科技大学）； Beijing Zhongguancun Academy（北京中关村学院）； Zhongguancun Institute of Artificial Intelligence（中关村人工智能研究院）； Harbin Institute of Technology（哈尔滨工业大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Zhengzhou University（郑州大学）； Beihang University（北航）； East China Normal University（东华大学）； DeepCybot Co., Ltd.（DeepCybot有限公司）

AI总结针对VLA模型在训练中因数据偏差导致语言信息被忽略的问题，提出LangForce框架，通过贝叶斯分解和潜在动作查询构建双分支架构，最大化动作与指令的点互信息，无需新数据即可显著提升泛化能力。

Comments ICML 2026

详情

AI中文摘要

视觉-语言-动作（VLA）模型在机器人操作中显示出潜力，但往往难以泛化到新指令或复杂的多任务场景。我们识别出当前训练范式中的一个关键病理：目标驱动的数据收集造成了数据集偏差。在此类数据集中，仅凭视觉观察就能高度预测语言指令，导致指令与动作之间的条件互信息消失，我们将此现象称为信息崩溃。因此，模型退化为忽略语言约束的纯视觉策略，并在分布外（OOD）设置中失败。为解决此问题，我们提出LangForce，一种通过贝叶斯分解强制执行指令跟随的新框架。通过引入可学习的潜在动作查询，我们构建了一个双分支架构，用于估计纯视觉先验 $p(a \mid v)$ 和语言条件后验 $π(a \mid v, \ell)$。然后我们优化策略以最大化动作与指令之间的条件点互信息（PMI）。该目标有效惩罚了视觉捷径，并奖励明确解释语言命令的动作。无需新数据，LangForce显著提升了泛化能力。在SimplerEnv和RoboCasa上的大量实验证明了显著改进，包括在具有挑战性的OOD SimplerEnv基准上提升11.3%，验证了我们的方法在动作中稳健地锚定语言的能力。

英文摘要

Vision-Language-Action (VLA) models have shown promise in robot manipulation but often struggle to generalize to new instructions or complex multi-task scenarios. We identify a critical pathology in current training paradigms where goal-driven data collection creates a dataset bias. In such datasets, language instructions are highly predictable from visual observations alone, causing the conditional mutual information between instructions and actions to vanish, a phenomenon we term Information Collapse. Consequently, models degenerate into vision-only policies that ignore language constraints and fail in out-of-distribution (OOD) settings. To address this, we propose LangForce, a novel framework that enforces instruction following via Bayesian decomposition. By introducing learnable Latent Action Queries, we construct a dual-branch architecture to estimate both a vision-only prior $p(a \mid v)$ and a language-conditioned posterior $π(a \mid v, \ell)$. We then optimize the policy to maximize the conditional Pointwise Mutual Information (PMI) between actions and instructions. This objective effectively penalizes the vision shortcut and rewards actions that explicitly explain the language command. Without requiring new data, LangForce significantly improves generalization. Extensive experiments across on SimplerEnv and RoboCasa demonstrate substantial gains, including an 11.3% improvement on the challenging OOD SimplerEnv benchmark, validating the ability of our approach to robustly ground language in action.

URL PDF HTML ☆

赞 0 踩 0

2511.16084 2026-06-01 cs.CV cs.AI 版本更新

SpectralTrain: A Universal Framework for Hyperspectral Image Classification

SpectralTrain：一种通用的高光谱图像分类框架

Meihua Zhou, Liping Yu, Xinyu Tong, Wai Kin Fung, Ruiguo Hu, Jiarui Zhao, Nan Wan

发表机构 * School of Medical Information, Wannan Medical University（皖南医学院信息学院）； University of Chinese Academy of Sciences（中国科学院大学）； The Chinese University of Hong Kong（香港中文大学）； Northeastern University（东北大学）

AI总结提出SpectralTrain通用训练框架，通过课程学习与基于PCA的光谱下采样提升高光谱图像分类效率，在多个数据集上实现2-7倍训练加速且精度损失小。

详情

AI中文摘要

高光谱图像（HSI）分类通常涉及大规模数据和计算密集的训练，这限制了深度学习模型在实际遥感任务中的部署。本研究引入SpectralTrain，一个通用的、与架构无关的训练框架，通过将课程学习（CL）与基于主成分分析（PCA）的光谱下采样相结合，提高学习效率。通过逐步引入光谱复杂性同时保留关键信息，SpectralTrain能够在显著降低计算成本的情况下高效学习光谱-空间模式。该框架独立于特定架构、优化器或损失函数，并与经典和最先进（SOTA）模型兼容。在三个基准数据集——Indian Pines、Salinas-A和新引入的CloudPatch-7上的大量实验表明，该框架在空间尺度、光谱特性和应用领域上具有很强的泛化能力。结果显示，训练时间一致减少2-7倍，精度变化取决于骨干网络。在云分类上的应用进一步揭示了其在气候相关遥感中的潜力，强调训练策略优化作为HSI模型中架构设计的有效补充。代码可在https://github.com/mh-zhou/SpectralTrain获取。

英文摘要

Hyperspectral image (HSI) classification typically involves large-scale data and computationally intensive training, which limits the practical deployment of deep learning models in real-world remote sensing tasks. This study introduces SpectralTrain, a universal, architecture-agnostic training framework that enhances learning efficiency by integrating curriculum learning (CL) with principal component analysis (PCA)-based spectral downsampling. By gradually introducing spectral complexity while preserving essential information, SpectralTrain enables efficient learning of spectral -- spatial patterns at significantly reduced computational costs. The framework is independent of specific architectures, optimizers, or loss functions and is compatible with both classical and state-of-the-art (SOTA) models. Extensive experiments on three benchmark datasets -- Indian Pines, Salinas-A, and the newly introduced CloudPatch-7 -- demonstrate strong generalization across spatial scales, spectral characteristics, and application domains. The results indicate consistent reductions in training time by 2-7x speedups with small-to-moderate accuracy deltas depending on backbone. Its application to cloud classification further reveals potential in climate-related remote sensing, emphasizing training strategy optimization as an effective complement to architectural design in HSI models. Code is available at https://github.com/mh-zhou/SpectralTrain.

URL PDF HTML ☆

赞 0 踩 0

2605.11367 2026-06-01 cs.CV 版本更新

用于实时无人机桥梁检测的鲁棒轻量级裂缝分类

Wei Li, Haisheng Li, Weijie Li, Jiandong Wang, Kaichen Ma, Luming Yang

发表机构 * Bay Area Super Bridge Maintenance Technology Center, Guangdong Provincial Highway Construction Co., Ltd., Guangdong, China（湾区超级桥梁维护技术中心、广东省高速公路建设有限公司、广东，中国）； Guangdong AIHISUN Technology Co., Ltd., Guangdong, China（广东AIHISUN技术有限公司、广东，中国）

AI总结提出一个由轻量级骨干网络、CBAM注意力模块、基于场景先验的定向鲁棒增强策略和Focal Loss组成的统一轻量级CNN框架，在SDNET2018数据集上以11.21M参数和1.82G FLOPs实现825 FPS推理速度，F1分数提升2.51%，召回率提升3.95%。

详情

AI中文摘要

随着无人机在桥梁结构健康监测中的广泛应用，基于深度学习的自动裂缝检测已成为主要研究热点。然而，实际无人机检测仍面临四个关键挑战：弱裂缝特征、退化成像条件、严重类别不平衡以及实际无人机检测工作流程中有限的计算资源。为了解决这些问题，本文提出了一个统一的轻量级卷积神经网络框架，由四个协同组件组成：轻量级骨干网络、用于通道和空间增强的卷积块注意力模块（CBAM）、基于检测场景先验的定向鲁棒增强策略，以及用于类别不平衡下难样本学习的Focal Loss。在SDNET2018桥面数据集上的实验表明，所提方法仅以11.21M参数和1.82G FLOPs实现了825 FPS的推理速度。与基线模型相比，完整框架的F1分数提高了2.51%，召回率提高了3.95%。此外，Grad-CAM可视化表明，引入的注意力模块将模型关注点从分散区域转移到沿裂缝轨迹的精确跟踪。总体而言，本研究在准确性、速度和鲁棒性之间取得了强平衡，为无人机桥梁检测中地面站辅助的实时部署提供了实用解决方案。源代码可在 https://github.com/skylynf/AttXNet 获取。

英文摘要

With the widespread application of Unmanned Aerial Vehicles (UAVs) in bridge structural health monitoring, deep learning-based automatic crack detection has become a major research focus. However, practical UAV inspections still face four key challenges: weak crack features, degraded imaging conditions, severe class imbalance, and limited computational resources for practical UAV inspection workflows. To address these issues, this paper proposes a unified lightweight convolutional neural network framework composed of four synergistic components: a lightweight backbone network, a Convolutional Block Attention Module (CBAM) for channel and spatial enhancement, a directed robust augmentation strategy based on inspection-scene priors, and Focal Loss for hard-sample learning under class imbalance. Experiments on the SDNET2018 bridge deck dataset show that the proposed method achieves an inference speed of 825 FPS with only 11.21M parameters and 1.82G FLOPs. Compared with the baseline model, the complete framework improves the F1-score by 2.51% and recall by 3.95%. In addition, Grad-CAM visualizations indicate that the introduced attention module shifts the model's focus from scattered regions to precise tracking along crack trajectories. Overall, this study achieves a strong balance among accuracy, speed, and robustness, providing a practical solution for ground-station assisted real-time deployment in UAV bridge inspections. The source code is available at: https://github.com/skylynf/AttXNet .

URL PDF HTML ☆

赞 0 踩 0

2604.20395 2026-06-01 cs.CV cs.RO 版本更新

SpaCeFormer: Fast Proposal-Free Open-Vocabulary 3D Instance Segmentation

SpaCeFormer: 快速无提议开放词汇3D实例分割

Chris Choy, Junha Lee, Chunghyun Park, Minsu Cho, Jan Kautz

发表机构 * NVIDIA

AI总结提出SpaCeFormer，一种基于空间曲线变换的无提议方法，在0.12-0.30秒内完成场景分割，比多阶段2D+3D流水线快2-3个数量级，并构建了最大开放词汇3D实例分割数据集SpaCeFormer-3M，在ScanNet200上零样本mAP达11.1，提升2.8倍。

Comments Project page: https://nvlabs.github.io/SpaCeFormer/

详情

AI中文摘要

开放词汇3D实例分割是机器人和AR/VR的核心能力，但先前方法存在瓶颈：多阶段2D+3D流水线聚合基础模型输出需数百秒每场景，而伪标签端到端方法依赖碎片化掩码和外部区域提议。我们提出SpaCeFormer，一种无提议的空间曲线变换器，在标准基准上每场景运行0.12-0.30秒，比多阶段2D+3D流水线快2-3个数量级。我们将其与SpaCeFormer-3M配对，这是最大的开放词汇3D实例分割数据集（通过多视图掩码聚类和多视图VLM标注构建，包含来自7.4K场景的604K实例的3.0M多视图一致描述）；其掩码召回率比先前单视图流水线高21倍（IoU>0.5时54.3% vs 2.5%）。SpaCeFormer结合空间窗口注意力与Morton曲线序列化以获得空间连贯特征，并使用RoPE增强解码器直接从学习到的查询预测实例掩码，无需外部提议。在ScanNet200上，我们实现11.1零样本mAP，比先前最佳无提议方法提升2.8倍；在ScanNet++和Replica上，我们达到22.9和24.1 mAP，超越包括使用多视图2D输入在内的所有先前方法。

英文摘要

Open-vocabulary 3D instance segmentation is a core capability for robotics and AR/VR, but prior methods trade one bottleneck for another: multi-stage 2D+3D pipelines aggregate foundation-model outputs at hundreds of seconds per scene, while pseudo-labeled end-to-end approaches rely on fragmented masks and external region proposals. We present SpaCeFormer, a proposal-free space-curve transformer that runs in 0.12--0.30 seconds per scene across standard benchmarks, 2--3 orders of magnitude faster than multi-stage 2D+3D pipelines. We pair it with SpaCeFormer-3M, the largest open-vocabulary 3D instance segmentation dataset (3.0M multi-view-consistent captions over 604K instances from 7.4K scenes) built through multi-view mask clustering and multi-view VLM captioning; it reaches 21$\times$ higher mask recall than prior single-view pipelines (54.3% vs 2.5% at IoU$>$0.5). SpaCeFormer combines spatial window attention with Morton-curve serialization for spatially coherent features, and uses a RoPE-enhanced decoder to predict instance masks directly from learned queries without external proposals. On ScanNet200 we achieve 11.1 zero-shot mAP, a 2.8$\times$ improvement over the prior best proposal-free method; on ScanNet++ and Replica, we reach 22.9 and 24.1 mAP, surpassing all prior methods including those using multi-view 2D inputs.

URL PDF HTML ☆

赞 0 踩 0

2604.09429 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Rays as Pixels: Learning A Joint Distribution of Videos and Camera Trajectories

射线即像素：学习视频与相机轨迹的联合分布

Wonbong Jang, Shikun Liu, Soubhik Sanyal, Juan Camilo Perez, Kam Woh Ng, Sanskar Agrawal, Juan-Manuel Perez-Rua, Yiannis Douratsos, Tao Xiang

发表机构 * Meta AI

AI总结提出一种视频扩散模型（Rays as Pixels），通过将相机表示为密集射线像素（raxels）并与视频帧共享潜在空间，联合去噪实现相机轨迹预测和相机控制视频生成。

Comments Accepted to ICML 2026. 9-page main paper plus supplementary material. Project page: https://wbjang.github.io/raysaspixels/

详情

AI中文摘要

从图像恢复相机参数和从新视角渲染场景在计算机视觉和图形学中被视为独立任务。当图像覆盖稀疏或姿态模糊时，这种分离会失效，因为每个任务依赖于另一个任务的输出。我们提出Rays as Pixels，一种视频扩散模型（VDM），学习视频和相机轨迹的联合分布。据我们所知，这是首个在单一框架内预测相机姿态并进行相机控制视频生成的模型。我们将每个相机表示为密集射线像素（raxels），这是一种与视频帧位于同一潜在空间的像素对齐编码，并通过解耦自交叉注意力机制联合去噪两者。一个训练好的模型处理三个任务：从视频预测相机轨迹、沿预定义轨迹从输入图像生成视频、以及从输入图像联合合成视频和轨迹。我们在姿态估计和相机控制视频生成上进行评估，并引入闭环自一致性测试，显示模型预测的姿态及其基于这些姿态的渲染结果一致。与Plücker嵌入的消融实验证实，将相机与视频共享潜在空间显著更有效。

英文摘要

Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.

URL PDF HTML ☆

赞 0 踩 0

2604.20650 2026-06-01 cs.CV 版本更新

MAPRPose: Mask-Aware Proposal and Amodal Refinement for Multi-Object 6D Pose Estimation

MAPRPose: 面向多目标6D姿态估计的掩膜感知提议与模态补全精化

Yang Luo, Yan Gong, Yongsheng Gao, Xiaoying Sun, Jie Zhao

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（机器人系统国家重点实验室，哈尔滨工业大学）； School of Civil Engineering, Harbin Institute of Technology（土木工程学院，哈尔滨工业大学）； Shenzhen Infinite Meta Robot Co., Ltd（深圳无限元机器人有限公司）

AI总结提出MAPRPose两阶段框架，通过掩膜感知对应关系生成姿态提议和模态补全驱动的ROI预测实现鲁棒精化，在BOP基准上达到76.5%平均召回率，比FoundationPose高3.1%且多目标推理加速43倍。

详情

AI中文摘要

在杂乱场景中，6D物体姿态估计由于严重遮挡和传感器噪声仍然具有挑战性。我们提出MAPRPose，一个两阶段框架，利用掩膜感知对应关系进行姿态提议，并利用模态补全驱动的感兴趣区域（ROI）预测进行鲁棒精化。在掩膜感知姿态提议（MAPP）阶段，我们将2D对应关系提升到3D空间，建立可靠的关键点匹配，并基于对应关系评分生成几何一致的姿态假设，从中选择前K个候选。在精化阶段，我们引入了一个张量化渲染-比较流水线，集成了模态补全掩膜预测和ROI重新对齐（AMPR）模块。通过重建完整的物体几何并动态调整ROI，AMPR减轻了严重遮挡下的定位误差和空间错位。此外，我们的GPU加速RGB-XYZ重投影使得所有N×B个姿态假设能够在单次前向传播中同时精化。在BOP基准上评估，MAPRPose实现了76.5%的最先进平均召回率（AR），比FoundationPose高出3.1% AR，同时在多目标推理中实现了43倍加速。

英文摘要

6D object pose estimation in cluttered scenes remains challenging due to severe occlusion and sensor noise. We propose MAPRPose, a two-stage framework that leverages mask-aware correspondences for pose proposal and amodal-driven Region-of-Interest (ROI) prediction for robust refinement. In the Mask-Aware Pose Proposal (MAPP) stage, we lift 2D correspondences into 3D space to establish reliable keypoint matches and generate geometrically consistent pose hypotheses based on correspondence-level scoring, from which the top-$K$ candidates are selected. In the refinement stage, we introduce a tensorized render-and-compare pipeline integrated with an Amodal Mask Prediction and ROI Re-Alignment (AMPR) module. By reconstructing complete object geometry and dynamically adjusting the ROI, AMPR mitigates localization errors and spatial misalignment under heavy occlusion. Furthermore, our GPU-accelerated RGB-XYZ reprojection enables simultaneous refinement of all $N \times B$ pose hypotheses in a single forward pass. Evaluated on the BOP benchmark, MAPRPose achieves a state-of-the-art Average Recall (AR) of 76.5%, outperforming FoundationPose by 3.1% AR while delivering a 43x speedup in multi-object inference.

URL PDF HTML ☆

赞 0 踩 0

2604.10805 2026-06-01 cs.CV 版本更新

Analytical Modeling and Correction of Distance Error in Homography-Based Ground-Plane Mapping

基于单应性的地面映射中距离误差的解析建模与校正

Mateusz Szulc, Marcin Iwanowski

发表机构 * Institute of Control ； Industrial Electronics, Faculty of Electrical Engineering, Warsaw University of Technology, ul. Koszykowa 75, 00-662 Warsaw, Poland

AI总结本文推导了单应性扰动与距离误差的解析关系，提出基于回归和梯度下降的两种校正策略，并通过大规模仿真验证了其有效性。

Comments 7 pages, 4 figures

详情

AI中文摘要

从单目相机准确估计距离对于智能监控系统至关重要。在许多部署中，通过手动选择对应区域初始化的平面单应性将图像坐标映射到地面位置。这种初始化中的微小不准确性会传播为系统性的距离失真。本文推导了单应性扰动与由此产生的距离误差之间的显式关系，表明误差大致随距相机的真实距离呈二次增长。基于该模型，评估了两种简单的校正策略：基于回归的二次误差函数估计和通过基于坐标的梯度下降直接优化单应性。一项包含超过1900万个测试样本的大规模仿真研究表明，当模型可靠拟合时，回归可实现更高的峰值精度，而梯度下降在初始校准较差时具有更强的鲁棒性。这表明，在许多实际系统中，改进几何校准可能比增加模型复杂度带来更大的性能提升。

英文摘要

Accurate distance estimation from monocular cameras is essential for intelligent monitoring systems. In many deployments, image coordinates are mapped to ground positions using planar homographies initialized by manual selection of corresponding regions. Small inaccuracies in this initialization propagate into systematic distance distortions. This paper derives an explicit relationship between homography perturbations and the resulting distance error, showing that the error grows approximately quadratically with the true distance from the camera. Based on this model, two simple correction strategies are evaluated: regression-based estimation of the quadratic error function and direct optimization of the homography via coordinate-based gradient descent. A large-scale simulation study with more than 19 million test samples demonstrates that regression achieves higher peak accuracy when the model is reliably fitted, whereas gradient descent provides greater robustness against poor initial calibration. This suggests that improving geometric calibration may yield greater performance gains than increasing model complexity in many practical systems.

URL PDF HTML ☆

赞 0 踩 0

2604.10273 2026-06-01 cs.CV 版本更新

Dual-Exposure Imaging with Events

基于事件的双曝光成像

Mingyuan Lin, Hongyi Liu, Chu He, Wen Yang, Gui-Song Xia, Lei Yu

发表机构 * School of Electronic Information, Wuhan University（武汉大学电子信息学院）； School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）

AI总结提出事件辅助的双曝光成像算法E-DEI，利用事件相机的高时间分辨率对齐和融合双曝光图像特征，以消除运动伪影和曝光差异，提升低光图像质量。

详情

AI中文摘要

通过结合短曝光和长曝光图像的互补优势，双曝光成像（DEI）在低光场景下增强了图像质量。然而，现有的DEI方法由于场景运动导致的空间位移和不同曝光时间引起的图像特征差异，不可避免地会产生伪影。为了解决这个问题，我们提出了一种新颖的基于事件的双曝光成像（E-DEI）算法，该算法从双曝光图像对和事件中重建高质量图像，利用事件相机的高时间分辨率提供准确的帧间/帧内动态信息。具体来说，我们将这个复杂任务分解为两个子任务的集成，即基于事件的运动去模糊和低光图像增强任务，这指导我们将E-DEI网络设计为双路径并行特征传播架构。我们提出了一个双路径特征对齐与融合（DFAF）模块，以在事件的辅助下有效地对齐和融合从双曝光图像中提取的特征。此外，我们构建了一个包含配对低/正常光图像和事件的真实世界数据集（PIED）。在多个数据集上的实验表明了我们方法的优越性。代码和数据集可在GitHub上获取。

英文摘要

By combining complementary benefits of short- and long-exposure images, Dual-Exposure Imaging (DEI) enhances image quality in low-light scenarios. However, existing DEI approaches inevitably suffer from producing artifacts due to spatial displacement from scene motion and image feature discrepancies from different exposure times. To tackle this problem, we propose a novel Event-based DEI (E-DEI) algorithm, which reconstructs high-quality images from dual-exposure image pairs and events, leveraging high temporal resolution of event cameras to provide accurate inter-/intra-frame dynamic information. Specifically, we decompose this complex task into an integration of two sub-tasks, i.e., event-based motion deblurring and low-light image enhancement tasks, which guides us to design E-DEI network as a dual-path parallel feature propagation architecture. We propose a Dual-path Feature Alignment and Fusion (DFAF) module to effectively align and fuse features extracted from dual-exposure images with assistance of events. Furthermore, we build a real-world Dataset containing Paired low-/normal-light Images and Events (PIED). Experiments on multiple datasets show the superiority of our method. The code and dataset are available at github.

URL PDF HTML ☆

赞 0 踩 0

2603.26885 2026-06-01 cs.CV 版本更新

TTE-CAM: Self-Explainable Class Activation Maps for Pretrained Black-Box CNNs

TTE-CAM：用于预训练黑盒CNN的自解释类激活图

Kerol Djoumessi, Philipp Berens

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany（图宾根大学脑健康人工智能研究所）

AI总结提出TTE-CAM框架，通过卷积替换分类头将预训练黑盒CNN转化为自解释模型，在保持预测性能的同时提供忠实解释。

Comments Accepted at MIDL 2026 in the short paper track

2511.11440 2026-06-01 cs.CV cs.CL 版本更新

Synthetic Stimuli, Real Gains: Rethinking VLM Fine-Tuning Through Fully Controlled Data Generation

合成刺激，真实收益：通过完全受控的数据生成重新思考VLM微调

Massimo Rizzoli, Simone Alghisi, Seyed Mahed Mousavi, Giuseppe Riccardi

发表机构 * Signals and Interactive Systems Lab, University of Trento（信号与交互系统实验室，特伦托大学）

AI总结本文提出一种完全受控的数据生成与标注流程，用于微调视觉语言模型（VLM），通过平衡分布和干净标注消除偏差，在空间推理任务上仅用130个样本即可实现均匀性能，并在真实世界数据上提升13%的性能。

详情

AI中文摘要

通过微调获得的视觉语言模型（VLM）的性能提升通常基于对真实世界场景的临时数据收集和标注。尽管有所改进，但这一过程往往容易受到偏差、错误和分布不平衡的影响，导致过拟合和性能不平衡。虽然少数研究探索了合成数据生成，但它们通常缺乏对数据分布和标注质量的控制。在这项工作中，我们通过探索完全受控的数据生成和标注流程，重新评估了模型微调的潜力，获得了具有平衡分布和干净标注的无偏差数据。以识别物体绝对位置的空间推理任务作为用例，我们微调了最先进的VLM，并在合成和真实世界基准上进行了详尽的评估，包括对真实世界场景的可迁移性。我们的实验揭示了两个关键发现：1）在平衡数据上微调可以在视觉场景中产生均匀的性能，并且仅用130个样本就能缓解常见偏差；2）在合成刺激上微调使真实世界数据（COCO）的性能提升了13%，优于在完整COCO训练集上微调的模型。

英文摘要

Performance gains of Vision Language Models (VLMs) obtained by fine-tuning are generally based on ad hoc data collection and annotation of real-world scenes. Despite the improvements, this process is often prone to biases, errors, and distribution imbalance, resulting in overfitting and imbalanced performance. Although a few studies have explored synthetic data generation, they typically lack control over data distribution and annotation quality. In this work, we re-evaluate the potential of model fine-tuning by exploring a fully controlled data generation and annotation pipeline, obtaining bias-free data with balanced distribution and clean annotations. Using the spatial reasoning task of identifying the absolute position of an object as a use case, we fine-tune state-of-the-art VLMs and conduct exhaustive evaluations on both synthetic and real-world benchmarks, including transferability to real-world scenes. Our experiments reveal two key findings: 1) fine-tuning on balanced data yields uniform performance across the visual scene and mitigates common biases with as few as 130 samples; and 2) fine-tuning on synthetic stimuli improves performance by 13% on real-world data (COCO), outperforming models fine-tuned on the full COCO train set.

URL PDF HTML ☆

赞 0 踩 0

2603.19862 2026-06-01 cs.CV cs.LG 版本更新

IsoCLIP: Decomposing CLIP Projectors for Efficient Intra-modal Alignment

IsoCLIP: 分解CLIP投影器以实现高效的模态内对齐

Simone Magistri, Dipam Goswami, Marco Mistretta, Bartłomiej Twardowski, Joost van de Weijer, Andrew D. Bagdanov

发表机构 * Media Integration and Communication Center (MICC), University of Florence, Italy（意大利佛罗伦萨大学媒体集成与通信中心）； Department of Computer Science, Universitat Autònoma de Barcelona, Spain（西班牙巴塞罗那自治大学计算机科学系）； Computer Vision Center, Barcelona, Spain（西班牙巴塞罗那计算机视觉中心）； IDEAS Research Institute, Warsaw, Poland（波兰华沙IDEAS研究所）

AI总结本文通过分析CLIP投影器的谱特性，发现模态间对齐子空间和各向异性方向，提出无训练方法IsoCLIP去除各向异性方向以改善模态内对齐，在模态内检索和分类任务上降低延迟并超越现有方法。

Comments Accepted at CVPR2026

详情

AI中文摘要

视觉-语言模型如CLIP被广泛用于涉及视觉和文本模态的跨模态任务。然而，当个体模态编码器应用于固有的模态内任务（如图像到图像检索）时，其性能因模态内错位而受损。本文研究CLIP中的模态内错位，重点关注将投影前图像和文本嵌入映射到共享嵌入空间的投影器的作用。通过分析应用于投影特征的余弦相似度形式及其与对比CLIP损失的交互，我们发现在训练期间存在一个负责对齐两种模态的跨模态算子，以及第二个仅强制执行模态内归一化但不促进模态内对齐的模态内算子。通过对跨模态算子的谱分析，我们识别出一个近似各向同性的子空间，其中两种模态良好对齐，以及每个模态特有的各向异性方向。我们证明该对齐子空间可以直接从投影器权重中获得，并且去除各向异性方向可改善模态内对齐。我们在模态内检索和分类基准上的实验表明，我们的无训练方法减少了模态内错位，大大降低了延迟，并在多个预训练的类CLIP模型上优于现有方法。代码公开于：https://github.com/simomagi/IsoCLIP。

英文摘要

Vision-Language Models like CLIP are extensively used for inter-modal tasks which involve both visual and text modalities. However, when the individual modality encoders are applied to inherently intra-modal tasks like image-to-image retrieval, their performance suffers from the intra-modal misalignment. In this paper we study intra-modal misalignment in CLIP with a focus on the role of the projectors that map pre-projection image and text embeddings into the shared embedding space. By analyzing the form of the cosine similarity applied to projected features, and its interaction with the contrastive CLIP loss, we show that there is an inter-modal operator responsible for aligning the two modalities during training, and a second, intra-modal operator that only enforces intra-modal normalization but does nothing to promote intra-modal alignment. Via spectral analysis of the inter-modal operator, we identify an approximately isotropic subspace in which the two modalities are well-aligned, as well as anisotropic directions specific to each modality. We demonstrate that this aligned subspace can be directly obtained from the projector weights and that removing the anisotropic directions improves intra-modal alignment. Our experiments on intra-modal retrieval and classification benchmarks show that our training-free method reduces intra-modal misalignment, greatly lowers latency, and outperforms existing approaches across multiple pre-trained CLIP-like models. The code is publicly available at: https://github.com/simomagi/IsoCLIP.

URL PDF HTML ☆

赞 0 踩 0

2509.25269 2026-06-01 eess.IV cs.CV cs.LG cs.NA math.NA physics.optics 版本更新

SAW-Bench：在现实世界中学习情境感知

Chuhan Li, Rilyn Han, Joy Hsu, Yongyuan Liang, Rajiv Dhawan, Jiajun Wu, Ming-Hsuan Yang, Xin Eric Wang

发表机构 * University of California, Santa Barbara（加州大学圣芭芭拉分校）； Yale University（耶鲁大学）； Stanford University（斯坦福大学）； University of Maryland, College Park（马里兰大学学院市分校）； Amazon（亚马逊）； University of California, Merced（加州大学默塞德分校）

AI总结提出SAW-Bench基准，通过自录视频和问答对评估多模态基础模型的以观察者为中心的情境感知能力，揭示人类与模型间的显著性能差距。

详情

AI中文摘要

人类感知的一个核心方面是情境感知，即我们能够将自己与周围的物理环境联系起来，并根据上下文推理可能的行动。然而，现有的大多数多模态基础模型（MFM）基准强调以环境为中心的空间关系（场景中物体之间的关系），而很大程度上忽略了需要相对于智能体的视角、姿态和运动进行推理的以观察者为中心的关系。为了填补这一空白，我们引入了SAW-Bench（现实世界中的情境感知），这是一个利用真实世界视频评估以自我为中心的情境感知的新基准。SAW-Bench包含使用Ray-Ban Meta（Gen 2）智能眼镜自录的786个视频，涵盖多样的室内外环境，以及超过2,071个人工标注的问答对。它通过六种不同的感知任务来探测模型对观察者中心的理解。我们的综合评估显示，即使使用性能最佳的MFM Gemini 3 Flash，人类与模型之间的性能差距也达到了37.66%。除了这一差距，我们的深入分析还揭示了一些显著发现；例如，虽然模型可以利用以自我为中心的视频中的部分几何线索，但它们常常无法推断出连贯的相机几何结构，从而导致系统性的空间推理错误。我们将SAW-Bench定位为情境空间智能的基准，超越被动观察，转向理解基于物理的、以观察者为中心的动态。

英文摘要

A core aspect of human perception is situated awareness, the ability to relate ourselves to the surrounding physical environment and reason over possible actions in context. However, most existing benchmarks for multimodal foundation models (MFMs) emphasize environment-centric spatial relations (relations among objects in a scene), while largely overlooking observer-centric relationships that require reasoning relative to agent's viewpoint, pose, and motion. To bridge this gap, we introduce SAW-Bench (Situated Awareness in the Real World), a novel benchmark for evaluating egocentric situated awareness using real-world videos. SAW-Bench comprises 786 self-recorded videos captured with Ray-Ban Meta (Gen 2) smart glasses spanning diverse indoor and outdoor environments, and over 2,071 human-annotated question-answer pairs. It probes a model's observer-centric understanding with six different awareness tasks. Our comprehensive evaluation reveals a human-model performance gap of 37.66%, even with the best-performing MFM, Gemini 3 Flash. Beyond this gap, our in-depth analysis uncovers several notable findings; for example, while models can exploit partial geometric cues in egocentric videos, they often fail to infer a coherent camera geometry, leading to systematic spatial reasoning errors. We position SAW-Bench as a benchmark for situated spatial intelligence, moving beyond passive observation to understanding physically grounded, observer-centric dynamics.

URL PDF HTML ☆

赞 0 踩 0

2602.15018 2026-06-01 cs.RO cs.CV 版本更新

Neurosim: A Fast Simulator for Neuromorphic Robot Perception

Neurosim: 一种用于神经形态机器人感知的快速模拟器

Richeek Das, Pratik Chaudhari

发表机构 * GRASP Laboratory, University of Pennsylvania（宾夕法尼亚大学GRASP实验室）

AI总结提出Neurosim和Cortex库，通过高速传感器模拟和低延迟通信，支持神经形态感知与控制算法的训练和闭环测试。

Comments 11 pages, 6 figures

详情

AI中文摘要

Neurosim是一个快速、实时、高性能的库，用于模拟动态视觉传感器、RGB相机、深度传感器和惯性传感器等传感器。它还可以模拟复杂动态环境中多旋翼飞行器的敏捷动力学。Neurosim在桌面GPU上可实现高达约2700 FPS的帧率。Neurosim与一个基于ZeroMQ的通信库Cortex集成，以促进与机器学习和机器人工作流的无缝集成。Cortex为Python和C++应用程序提供了一个高吞吐量、低延迟的消息传递系统，原生支持NumPy数组和PyTorch张量。本文讨论了Neurosim和Cortex的设计理念。它展示了如何利用它们来(i)训练神经形态感知和控制算法，例如，在时间同步的多模态数据上使用自监督学习，以及(ii)在闭环中测试这些算法的实时实现。Neurosim和Cortex可在https://github.com/grasp-lyrl/neurosim获取。

英文摘要

Neurosim is a fast, real-time, high-performance library for simulating sensors such as dynamic vision sensors, RGB cameras, depth sensors, and inertial sensors. It can also simulate agile dynamics of multi-rotor vehicles in complex and dynamic environments. Neurosim can achieve frame rates as high as ~2700 FPS on a desktop GPU. Neurosim integrates with a ZeroMQ-based communication library called Cortex to facilitate seamless integration with machine learning and robotics workflows. Cortex provides a high-throughput, low-latency message-passing system for Python and C++ applications, with native support for NumPy arrays and PyTorch tensors. This paper discusses the design philosophy behind Neurosim and Cortex. It demonstrates how they can be used to (i) train neuromorphic perception and control algorithms, e.g., using self-supervised learning on time-synchronized multi-modal data, and (ii) test real-time implementations of these algorithms in closed-loop. Neurosim and Cortex are available at https://github.com/grasp-lyrl/neurosim .

URL PDF HTML ☆

赞 0 踩 0

2602.14441 2026-06-01 cs.CV 版本更新

MultiPriv: 视觉语言模型中个体级隐私推理的基准测试

Xiongtao Sun, Hui Li, Jiaming Zhang, Yujie Yang, Kaili Liu, Ruxin Feng, Wen Jun Tan, Wei Yang Bryan Lim

发表机构 * Xidian University（西安电子科技大学）； Nanyang Technological University（南洋理工大学）

AI总结针对视觉语言模型通过层次化链式推理关联多模态数据识别个体的隐私风险，提出首个系统评估个体级隐私推理的基准MultiPriv，包含隐私感知与推理框架、双语多模态数据集和九项挑战任务，对50多个开源和商业模型评估发现60%的模型能以高达80%的准确率进行个体级隐私推理。

详情

AI中文摘要

现代视觉语言模型（VLM）通过层次化链式推理将碎片化的多模态数据与可识别的个体关联起来，构成了显著的个体级隐私风险。然而，现有的隐私基准在结构上不足以应对这一威胁，因为它们主要评估隐私感知，而未能解决更关键的隐私推理风险：VLM推断和关联分布式信息以构建个体档案的能力。为填补这一空白，我们提出了MultiPriv，这是第一个旨在系统评估VLM中个体级隐私推理的基准。我们引入了隐私感知与推理（PPR）框架，并构建了一个包含合成个体档案的双语多模态数据集，其中标识符（如人脸和姓名）与敏感属性相关联。该设计支持九项具有挑战性的任务，涵盖属性检测、跨图像重新识别和链式推理。我们对超过50个开源和商业VLM进行了大规模评估。在我们的受控基准中，60%的广泛使用的VLM能够以高达80%的准确率进行个体级隐私推理，这表明对个人隐私存在重大潜在威胁。该基准可在https://github.com/CyberChangAn/MultiPriv-PII获取。

英文摘要

Modern Vision-Language Models (VLMs) pose significant individual-level privacy risks by linking fragmented multimodal data to identifiable individuals through hierarchical chain-of-thought reasoning. However, existing privacy benchmarks remain structurally insufficient for this threat, as they primarily evaluate privacy perception while failing to address the more critical risk of privacy reasoning: a VLM's ability to infer and link distributed information to construct individual profiles. To address this gap, we propose MultiPriv, the first benchmark designed to systematically evaluate individual-level privacy reasoning in VLMs. We introduce the Privacy Perception and Reasoning (PPR) framework and construct a bilingual multimodal dataset with synthetic individual profiles, where identifiers, such as faces and names, are linked to sensitive attributes. This design enables nine challenging tasks spanning attribute detection, cross-image re-identification, and chained inference. We conduct a large-scale evaluation of over 50 open-source and commercial VLMs. In our controlled benchmark, 60% of widely used VLMs can perform individual-level privacy reasoning with up to 80% accuracy, suggesting a significant potential threat to personal privacy. The benchmark is available at https://github.com/CyberChangAn/MultiPriv-PII.

URL PDF HTML ☆

赞 0 踩 0

2602.02220 2026-06-01 cs.CV cs.RO 版本更新

LangMap: A Human-Verified Benchmark for Hierarchical Open-Vocabulary Goal Navigation

LangMap：一个用于分层开放词汇目标导航的人工验证基准

Bo Miao, Weijia Liu, Jun Luo, Lachlan Shinnick, Jian Liu, Thomas Hamilton-Smith, Yuhe Yang, Zijie Wu, Vanja Videnovic, Feras Dayoub, Anton van den Hengel

发表机构 * AIML, Adelaide University（AIML，阿德莱德大学）； East China Normal University（华东师范大学）； NERC-RVC, Hunan University（NERC-RVC，湖南大学）； The University of Western Australia（西澳大学）； Breaker Industries

AI总结针对现有基准在分层语义目标导航中的不足，提出LangMap基准，通过人工验证的语义标注和对比注释协议，支持场景、房间、区域和实例四个层级的目标导航任务，并引入PlaNaVid基线方法。

详情

AI中文摘要

语言条件目标导航（LGN）要求智能体在没有逐步指导的情况下定位用户指定的目标。然而，现有基准主要关注类别级目标或依赖视觉语言模型（VLM）生成的实例描述，这些描述通常包含歧义和语义错误，限制了系统性和可靠的评估。我们提出了HieraNav，一个开放词汇的LGN任务，目标在四个分层语义层级上指定：场景、房间、区域和实例。为此，我们提出了Language as a Map（LangMap），据我们所知，这是第一个具有人工验证语义标注的真实世界3D室内导航基准，支持所有四个目标层级的任务。LangMap提供了区域标签以及覆盖414个对象类别的区分性区域和实例描述，通过比较同一场景区域和实例的严格对比注释协议生成，包含超过18K个任务。每个目标都配有简洁和详细的描述，支持跨指令风格的评估。定量和定性分析验证了我们的注释质量；值得注意的是，我们的实例描述在文本到视图匹配上比GOAT-Bench注释高出23个百分点。我们进一步引入了PlaNaVid，一个强大的仅RGB基线，它将有界多样记忆（BDM）与高级规划相结合，以激发用于多目标导航的反应策略。PlaNaVid在没有深度、3D场景表示或对象掩码的情况下实现了顶级成功率。进一步分析表明，记忆和更丰富的上下文提升了性能，而长尾类别、小物体、远距离目标和多目标完成仍然是开放的挑战。该基准可在https://bo-miao.github.io/LangMap获取。

英文摘要

Language-conditioned goal navigation (LGN) requires agents to locate user-specified targets without step-by-step guidance. However, existing benchmarks largely focus on category-level goals or rely on instance descriptions generated by vision-language models (VLMs), which often contain ambiguities and semantic errors, limiting systematic and reliable evaluation. We introduce HieraNav, an open-vocabulary LGN task with goals specified at four hierarchical semantic levels: scene, room, region, and instance. To this end, we present Language as a Map (LangMap), to our knowledge the first real-world 3D indoor navigation benchmark with human-verified semantic annotations to support tasks across all four goal levels. LangMap provides region labels and discriminative region and instance descriptions covering 414 object categories, produced through a rigorous contrastive annotation protocol comparing same-scene regions and instances, and contains over 18K tasks. Each target is paired with concise and detailed descriptions, enabling evaluation across instruction styles. Quantitative and qualitative analyses validate our annotation quality; notably, our instance descriptions outperform GOAT-Bench annotations by 23 percentage points in text-to-view matching. We further introduce PlaNaVid, a strong RGB-only baseline that combines Bounded Diverse Memory (BDM) with high-level planning to prime a reactive policy for multi-goal navigation. PlaNaVid achieves top-tier success rates without depth, 3D scene representations, or object masks. Further analysis shows that memory and richer context boost performance, while long-tailed categories, small objects, distant targets, and multi-goal completion remain open challenges. The benchmark is available at https://bo-miao.github.io/LangMap

URL PDF HTML ☆

赞 0 踩 0

2511.11406 2026-06-01 cs.CV 版本更新

Robust Low-Rank Sparse Framework for Video-Based Affective Computing

基于视频的情感计算的鲁棒低秩稀疏框架

Feng-Qi Cui, Jinyang Huang, Sirui Zhao, Xinyu Li, Xin Yan, Ziyu Jia, Xiaokang Zhou

发表机构 * Cylingo Group（Cylingo集团）； RIKEN Center for Advanced Intelligence Project, RIKEN（RIKEN高级智能项目中心，RIKEN）

AI总结提出低秩稀疏情感理解框架（LSEF），通过层次化低秩稀疏分解将情感动态分解为情感基和瞬态波动，并采用秩感知优化策略提升鲁棒性和动态判别能力。

详情

DOI: 10.1109/TCE.2026.3697969

AI中文摘要

基于视频的情感计算（VAC）对于情感分析和人机交互至关重要，但由于复杂的情感动态，存在模型不稳定和表示退化的问题。由于不同情感波动的含义在不同情感背景下可能不同，核心限制在于缺乏一种层次结构机制来分离不同的情感成分，即情感基（长期情感基调）和瞬态波动（短期情感波动）。为解决这一问题，我们提出了低秩稀疏情感理解框架（LSEF），这是一个基于低秩稀疏原理的统一模型，从理论上将情感动态重新定义为层次化的低秩稀疏组合过程。LSEF采用三个即插即用模块：稳定性编码模块（SEM）捕获低秩情感基；动态解耦模块（DDM）分离稀疏瞬态信号；一致性整合模块（CIM）重构多尺度稳定性和反应性一致性。该框架通过秩感知优化（RAO）策略进行优化，该策略自适应地平衡梯度平滑性和敏感性。跨多个数据集的大量实验证实，LSEF显著增强了鲁棒性和动态判别能力，进一步验证了层次化低秩稀疏建模对于理解情感动态的有效性和通用性。

英文摘要

Video-based Affective Computing (VAC), vital for emotion analysis and human-computer interaction, suffers from model instability and representational degradation due to complex emotional dynamics. Since the meaning of different emotional fluctuations may differ under different emotional contexts, the core limitation is the lack of a hierarchical structural mechanism to disentangle distinct affective components, i.e., emotional bases (the long-term emotional tone), and transient fluctuations (the short-term emotional fluctuations). To address this, we propose the Low-Rank Sparse Emotion Understanding Framework (LSEF), a unified model grounded in the Low-Rank Sparse Principle, which theoretically reframes affective dynamics as a hierarchical low-rank sparse compositional process. LSEF employs three plug-and-play modules, i.e., the Stability Encoding Module (SEM) captures low-rank emotional bases; the Dynamic Decoupling Module (DDM) isolates sparse transient signals; and the Consistency Integration Module (CIM) reconstructs multi-scale stability and reactivity coherence. This framework is optimized by a Rank Aware Optimization (RAO) strategy that adaptively balances gradient smoothness and sensitivity. Extensive experiments across multiple datasets confirm that LSEF significantly enhances robustness and dynamic discrimination, which further validates the effectiveness and generality of hierarchical low-rank sparse modeling for understanding affective dynamics.

URL PDF HTML ☆

赞 0 踩 0

2601.22412 2026-06-01 cs.CV 版本更新

Calibrated Uncertainty for Trustworthy Clinical Gait Analysis Using Probabilistic Multiview Markerless Motion Capture

用于可信临床步态分析的校准不确定性：基于概率多视角无标记运动捕捉

Seth Donahue, Irina Djuraskovic, Kunal Shah, Fabian Sinz, Ross Chafetz, R. James Cotton

发表机构 * Shriners Children’s Lexington（夏皮尔儿童医院莱克星顿分院）； Dept. of Physical Therapy, University of Kentucky（肯塔基大学物理治疗系）； Shirley Ryan AbilityLab（希拉里·瑞安能力实验室）； Northwestern University（西北大学）； University of Göttingen（哥廷根大学）； Lower Saxony Center for AI & Causal Methods in Medicine（下萨克森人工智能与因果方法医学中心）； Shriners Children’s Philadelphia（夏皮尔儿童医院费城分院）

AI总结提出一种概率多视角无标记运动捕捉方法，通过变分推断估计关节角度后验分布，并利用期望校准误差评估置信区间校准性，实现无需真实标记即可识别不可靠输出的可靠步态分析。

Comments 9 pages, 5 figures, EMBS Special Issue

详情

AI中文摘要

基于视频的人体运动分析在临床实践和研究中具有潜力。然而，多视角无标记运动捕捉（MMMC）的临床实施和信任要求，除了准确性外，这些系统还能为任何个体产生可靠的置信区间以指示其准确程度。基于我们先前利用变分推断估计关节角度后验分布的工作，本研究评估了一种概率MMMC方法的校准性和可靠性。我们分析了来自两个机构的68名参与者的数据，使用仪器化步道和标准标记运动捕捉验证模型。我们通过期望校准误差（ECE）测量置信区间的校准性。模型展示了可靠的校准性，步长和跨步长的ECE值通常<0.1，偏差校正的步态运动学也类似。我们观察到步长和跨步长中位误差分别约为16毫米和12毫米，下肢关节的中位偏差校正运动学误差范围为1.5至3.8度。与校准的ECE一致，模型预测的不确定性大小与观察到的误差测量值强相关。这些发现表明，按照设计，概率模型重建量化了认知不确定性，使其能够在无需同时使用真实标记仪器的情况下识别不可靠的输出。

英文摘要

Video-based human movement analysis holds potential for movement assessment in clinical practice and research. However, the clinical implementation and trust of multi-view markerless motion capture (MMMC) require that, in addition to being accurate, these systems produce reliable confidence intervals to indicate how accurate they are for any individual. Building on our prior work utilizing variational inference to estimate joint angle posterior distributions, this study evaluates the calibration and reliability of a probabilistic MMMC method. We analyzed data from 68 participants across two institutions, validating the model against an instrumented walkway and standard marker-based motion capture. We measured the calibration of the confidence intervals using the Expected Calibration Error (ECE). The model demonstrated reliable calibration, yielding ECE values generally < 0.1 for both step and stride length and bias-corrected gait kinematics. We observed a median step and stride length error of ~16 mm and ~12 mm respectively, with median bias-corrected kinematic errors ranging from 1.5 to 3.8 degrees across lower extremity joints. Consistent with the calibrated ECE, the magnitude of the model's predicted uncertainty correlated strongly with observed error measures. These findings indicate that, as designed, the probabilistic model reconstruction quantifies epistemic uncertainty, allowing it to identify unreliable outputs without the need for concurrent ground-truth instrumentation.

URL PDF HTML ☆

赞 0 踩 0

2601.22202 2026-06-01 eess.IV cs.CV 版本更新

A Survey on Semantic Communication for Vision: Categories, Frameworks, Enabling Techniques, and Applications

面向视觉的语义通信综述：分类、框架、使能技术与应用

Runze Cheng, Yao Sun, Ahmad Taha, Xuesong Liu, David Flynn, Muhammad Ali Imran

发表机构 * James Watt School of Engineering, University of Glasgow（格拉斯哥大学詹姆斯·瓦特工程学院）

AI总结本文系统综述了面向视觉数据语义通信（SemCom-Vision）的方法，基于语义量化方案将现有方法分为语义保持、语义扩展和语义精炼三类，并总结了基于机器学习的编解码模型、训练算法及知识利用策略。

详情

DOI: 10.1109/TNSE.2026.3671446
Journal ref: IEEE Transactions on Network Science and Engineering, vol. 13, pp. 8080-8103, 2026

AI中文摘要

语义通信（SemCom）作为一种变革性范式，用于流量密集的视觉数据传输，将关注点从原始数据传输转向有意义的内容传输，从而缓解通信资源的日益紧张。然而，要实现SemCom，面临着视觉数据精确语义量化、不同任务和目标下的鲁棒语义提取与重建、利用有效知识的收发端协调以及适应不可预测的无线通信环境等挑战。本文对面向视觉数据语义通信（SemCom-Vision）进行了系统综述，其中进行了计算机视觉（CV）与通信工程的跨学科分析，为机器学习（ML）赋能的SemCom-Vision设计提供全面指导。具体而言，本综述首先阐述了SemCom的基础知识和关键概念。然后，我们引入了一种新的分类视角，根据通过语义量化方案解释的通信目标，将现有的SemCom-Vision方法分为语义保持通信（SPC）、语义扩展通信（SEC）和语义精炼通信（SRC）。此外，本综述阐述了每个SemCom-Vision类别中基于ML的编码器-解码器模型和训练算法，随后介绍了知识结构和利用策略。最后，我们讨论了潜在的SemCom-Vision应用。

英文摘要

Semantic communication (SemCom) emerges as a transformative paradigm for traffic-intensive visual data transmission, shifting focus from raw data to meaningful content transmission and relieving the increasing pressure on communication resources. However, to achieve SemCom, challenges are faced in accurate semantic quantization for visual data, robust semantic extraction and reconstruction under diverse tasks and goals, transceiver coordination with effective knowledge utilization, and adaptation to unpredictable wireless communication environments. In this paper, we present a systematic review of SemCom for visual data transmission (SemCom-Vision), wherein an interdisciplinary analysis integrating computer vision (CV) and communication engineering is conducted to provide comprehensive guidelines for the machine learning (ML)-empowered SemCom-Vision design. Specifically, this survey first elucidates the basics and key concepts of SemCom. Then, we introduce a novel classification perspective to categorize existing SemCom-Vision approaches as semantic preservation communication (SPC), semantic expansion communication (SEC), and semantic refinement communication (SRC) based on communication goals interpreted through semantic quantization schemes. Moreover, this survey articulates the ML-based encoder-decoder models and training algorithms for each SemCom-Vision category, followed by knowledge structure and utilization strategies. Finally, we discuss potential SemCom-Vision applications.

URL PDF HTML ☆

赞 0 踩 0

2502.12119 2026-06-01 cs.CV cs.AI cs.CL 版本更新

PRISM: Self-Pruning Intrinsic Selection Method for Training-Free Multimodal Data Selection

PRISM：免训练多模态数据选择的自剪枝内在选择方法

Jinhe Bi, Aniri, Zengjie Jin, Yifan Wang, Danqi Yan, Wenke Huang, Xiaowen Ma, Sikuan Yan, Artur Hecker, Mang Ye, Xun Xiao, Hinrich Schuetze, Volker Tresp, Yunpu Ma

发表机构 * LMU Munich（慕尼黑大学）； Munich Research Center, Huawei Technologies（慕尼黑研究中心，华为技术）； METEOR ； School of Computer Science, Wuhan University（武汉大学计算机学院）； Munich Center for Machine Learning（慕尼黑机器学习中心）

AI总结针对多模态大语言模型视觉指令数据冗余问题，提出一种免训练框架PRISM，通过隐式重中心化消除视觉特征各向异性导致的全局语义漂移，实现高效数据选择，在降低计算成本的同时提升模型性能。

Comments Accepted to ACL 2026 and selected for the Best Paper list; later desk-rejected due to an inadvertent manual bibliography-editing error. Previous versions are withdrawn due to an inadvertent manual bibliography-editing error; please refer to the latest corrected version

详情

AI中文摘要

视觉指令微调使预训练的多模态大语言模型（MLLMs）能够遵循人类指令以应用于现实场景。然而，这些数据集的快速增长引入了显著的冗余，导致计算成本增加。现有的指令数据选择方法旨在修剪这种冗余，但主要依赖于计算密集型技术，如基于代理的推理或基于训练的指标。因此，这些选择过程产生的巨大计算成本往往加剧了它们本应解决的效率瓶颈，对MLLMs的可扩展和有效微调构成了重大挑战。为了解决这一挑战，我们首先发现了一个关键但先前被忽视的因素：视觉特征分布中固有的各向异性。我们发现这种各向异性引发了 extit{全局语义漂移}，而忽视这一现象是限制当前数据选择方法效率的关键因素。受此启发，我们设计了 extbf{PRISM}，这是第一个用于高效视觉指令选择的免训练框架。PRISM通过隐式重中心化建模内在视觉语义，精确移除全局背景特征的干扰影响。实验表明，PRISM将数据选择和模型微调的端到端时间减少到传统流程的30%。更值得注意的是，它在实现这一效率的同时提升了性能，在八个多模态和三个语言理解基准上超越了在全数据集上微调的模型，最终相对于基线实现了101.7%的相对改进。代码可通过\href{https://github.com/bibisbar/PRISM}{此仓库}获取。

英文摘要

Visual instruction tuning adapts pre-trained Multimodal Large Language Models (MLLMs) to follow human instructions for real-world applications. However, the rapid growth of these datasets introduces significant redundancy, leading to increased computational costs. Existing methods for selecting instruction data aim to prune this redundancy, but predominantly rely on computationally demanding techniques such as proxy-based inference or training-based metrics. Consequently, the substantial computational costs incurred by these selection processes often exacerbate the very efficiency bottlenecks they are intended to resolve, posing a significant challenge to the scalable and effective tuning of MLLMs. To address this challenge, we first identify a critical, yet previously overlooked, factor: the anisotropy inherent in visual feature distributions. We find that this anisotropy induces a \textit{Global Semantic Drift}, and overlooking this phenomenon is a key factor limiting the efficiency of current data selection methods. Motivated by this insight, we devise \textbf{PRISM}, the first training-free framework for efficient visual instruction selection. PRISM surgically removes the corrupting influence of global background features by modeling the intrinsic visual semantics via implicit re-centering. Empirically, PRISM reduces the end-to-end time for data selection and model tuning to just 30\% of conventional pipelines. More remarkably, it achieves this efficiency while simultaneously enhancing performance, surpassing models fine-tuned on the full dataset across eight multimodal and three language understanding benchmarks, culminating in a 101.7\% relative improvement over the baseline. The code is available for access via \href{https://github.com/bibisbar/PRISM}{this repository}.

URL PDF HTML ☆

赞 0 踩 0

2601.01456 2026-06-01 cs.CV cs.AI cs.LG 版本更新

Rethinking Multimodal Few-Shot 3D Point Cloud Segmentation: From Fused Refinement to Decoupled Arbitration

重新思考多模态少样本3D点云分割：从融合精炼到解耦仲裁

Wentao Bian, Fenglei Xu

发表机构 * Suzhou University of Science and Technology（苏州科技大学）

AI总结针对多模态少样本3D点云分割中“融合-精炼”范式的“可塑性-稳定性困境”和CLIP的语义盲区，提出解耦专家仲裁少样本分割网络（DA-FSS），通过解耦语义与几何路径并相互正则化梯度，实现更好的泛化性能。

Comments Accepted to IJCAI-ECAI 2026 (Main Track). 9 pages, 3 figures, 3 tables

详情

AI中文摘要

本文重新审视多模态少样本3D点云语义分割（FS-PCS），识别出“融合-精炼”范式中的一个冲突：“可塑性-稳定性困境”。此外，CLIP的类间混淆可能导致语义盲区。为解决这些问题，我们提出解耦专家仲裁少样本分割网络（DA-FSS），该模型有效区分语义和几何路径，并相互正则化它们的梯度以实现更好的泛化。DA-FSS采用与MM-FSS相同的主干网络和预训练文本编码器生成文本嵌入，从而提高自由模态的利用率并更好地利用每个模态的信息空间。为此，我们提出并行专家精炼模块以生成每个模态相关性。我们还提出堆叠仲裁模块（SAM）执行卷积融合并为每个模态路径仲裁相关性。并行专家解耦两条路径：几何专家保持可塑性，语义专家确保稳定性。它们通过解耦对齐模块（DAM）协调，该模块在不传播混淆的情况下传递知识。在流行数据集（S3DIS、ScanNet）上的实验表明DA-FSS优于MM-FSS。同时，几何边界、完整性和纹理区分均优于基线。代码可在https://github.com/MoWenQAQ/DA-FSS/获取。

英文摘要

In this paper, we revisit multimodal few-shot 3D point cloud semantic segmentation (FS-PCS), identifying a conflict in "Fuse-then-Refine" paradigms: the "Plasticity-Stability Dilemma." In addition, CLIP's inter-class confusion can result in semantic blindness. To address these issues, we present the Decoupled-experts Arbitration Few-Shot SegNet (DA-FSS), a model that effectively distinguishes between semantic and geometric paths and mutually regularizes their gradients to achieve better generalization. DA-FSS employs the same backbone and pre-trained text encoder as MM-FSS to generate text embeddings, which can increase free modalities' utilization rate and better leverage each modality's information space. To achieve this, we propose a Parallel Expert Refinement module to generate each modal correlation. We also propose a Stacked Arbitration Module (SAM) to perform convolutional fusion and arbitrate correlations for each modality pathway. The Parallel Experts decouple two paths: a Geometric Expert maintains plasticity, and a Semantic Expert ensures stability. They are coordinated via a Decoupled Alignment Module (DAM) that transfers knowledge without propagating confusion. Experiments on popular datasets (S3DIS, ScanNet) demonstrate the superiority of DA-FSS over MM-FSS. Meanwhile, geometric boundaries, completeness, and texture differentiation are all superior to the baseline. The code is available at: https://github.com/MoWenQAQ/DA-FSS/.

URL PDF HTML ☆

赞 0 踩 0

2601.01075 2026-06-01 cs.LG cs.AI cs.CV 版本更新

Flow Equivariant World Models: Memory for Partially Observed Dynamic Environments

流等变世界模型：部分观测动态环境的记忆

Hansen Jin Lillemark, Benhao Huang, Fangneng Zhan, Yilun Du, Thomas Anderson Keller

发表机构 * Kempner Institute, Harvard University（哈佛大学 Kempner 研究所）； ML, Carnegie Mellon University（卡内基梅隆大学 ML 研究所）； SEAS, Harvard University（哈佛大学 SEAS 研究所）

AI总结提出流等变世界建模框架，利用时间参数化对称性在潜在记忆中实现长时程稳定准确的动力学预测，解决部分观测问题。

Comments Accepted at ICML 2026

详情

AI中文摘要

具身系统将世界体验为“流之交响”：多种连续感官输入流与自身运动耦合，并与外部物体的动力学交织。这些感官流和世界的基本动力学遵循平滑的时间参数化对称性，而现有的世界模型忽略了这一点。如果没有尊重这种结构的记忆，部分可观测性对现有方法构成主要障碍：每次观测仅揭示世界的一部分，而未观测区域继续演化。在这项工作中，我们引入了流等变世界建模，这是一个利用潜在记忆中的时间参数化对称性来实现长时程稳定准确动力学预测的框架。潜在记忆随自身运动和推断的外部物体运动等变地移动和变换，使关于视野外区域的信息随时间保持对齐。我们在2D和3D部分观测视频世界建模基准上展示了该框架相对于最先进的扩散、记忆增强和循环世界模型架构的优势。更广泛地说，我们的结果表明，当预测表示按照它们所建模的世界的时间和动力学结构组织时，它们会变得更加强大。项目页面：https://flowequivariantworldmodels.github.io/

英文摘要

Embodied systems experience the world as 'a symphony of flows': a combination of many continuous streams of sensory input coupled to self-motion, interwoven with the dynamics of external objects. These sensory streams and the underlying dynamics of the world obey smooth, time-parameterized symmetries which existing world models ignore. Without a memory that respects this structure, partial observability presents a major obstacle to existing methods: each observation reveals only a fraction of the world, while unobserved regions continue to evolve. In this work, we introduce Flow Equivariant World Modeling, a framework that leverages time-parameterized symmetries within a latent memory for stable and accurate dynamics prediction over long horizons. The latent memory shifts and transforms equivariantly with self-motion and inferred external object motion, keeping information about out-of-view regions aligned as time progresses. We demonstrate the advantage of this framework over state-of-the-art diffusion, memory-augmented, and recurrent world model architectures on 2D and 3D partially observed video world modeling benchmarks. More broadly, our results suggest that predictive representations become more powerful when they are organized in line with the temporal and dynamical structure of the world they model. Project page: https://flowequivariantworldmodels.github.io/

URL PDF HTML ☆

赞 0 踩 0

2512.02743 2026-06-01 cs.CV cs.AI 版本更新

Reasoning-Aware Multimodal Fusion for Hateful Video Detection

面向仇恨视频检测的推理感知多模态融合

Shuonan Yang, Tailin Chen, Jiangbei Yue, Guangliang Cheng, Jianbo Jiao, Zeyu Fu

发表机构 * Multimodal Intelligence Lab（多模态智能实验室）； Department of Computer Science（计算机科学系）； University of Exeter（埃克塞特大学）； School of Computer Science（计算机科学学院）； University of Leeds（利兹大学）； School of Computer Science and Informatics（计算机科学与信息学学院）； University of Liverpool（利物浦大学）； University of Birmingham（伯明翰大学）； Machine Intelligence + x Group（机器智能+X小组）

AI总结提出推理感知多模态融合框架，通过局部-全局上下文融合和语义交叉注意力实现多模态交互，并引入对抗推理生成互补语义视角，在仇恨视频检测中提升Macro-F1和召回率3%和7%。

Comments Accepted at Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

在线视频中的仇恨言论对数字平台构成日益严重的威胁，尤其是当视频内容变得日益多模态和上下文依赖时。现有方法通常难以有效融合模态间的复杂语义关系，且缺乏理解细微仇恨内容的能力。为解决这些问题，我们提出了一种创新的推理感知多模态融合（RAMF）框架。针对第一个挑战，我们设计了局部-全局上下文融合（LGCF）以捕捉局部显著线索和全局时间结构，并提出语义交叉注意力（SCA）以实现细粒度多模态语义交互。针对第二个挑战，我们引入了对抗推理——一个结构化的三阶段过程，其中视觉语言模型生成（i）客观描述、（ii）仇恨假设推理和（iii）非仇恨假设推理——提供互补的语义视角，丰富模型对细微仇恨意图的上下文理解。在两个真实仇恨视频数据集上的评估表明，我们的方法实现了稳健的泛化性能，在Macro-F1和仇恨类别召回率上分别比现有最先进方法提高了3%和7%。重现我们结果所需的源代码和数据可在https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF获取。

英文摘要

Hate speech in online videos is posing an increasingly serious threat to digital platforms, especially as video content becomes increasingly multimodal and context-dependent. Existing methods often struggle to effectively fuse the complex semantic relationships between modalities and lack the ability to understand nuanced hateful content. To address these issues, we propose an innovative Reasoning-Aware Multimodal Fusion (RAMF) framework. To tackle the first challenge, we design Local-Global Context Fusion (LGCF) to capture both local salient cues and global temporal structures, and propose Semantic Cross Attention (SCA) to enable fine-grained multimodal semantic interaction. To tackle the second challenge, we introduce adversarial reasoning-a structured three-stage process where a vision-language model generates (i) objective descriptions, (ii) hate-assumed inferences, and (iii) non-hate-assumed inferences-providing complementary semantic perspectives that enrich the model's contextual understanding of nuanced hateful intent. Evaluations on two real-world hateful video datasets demonstrate that our method achieves robust generalisation performance, improving upon state-of-the-art methods by 3% and 7% in Macro-F1 and hate class recall, respectively. The source codes and data required to reproduce our results are available at https://github.com/Multimodal-Intelligence-Lab-MIL/RAMF.

URL PDF HTML ☆

赞 0 踩 0

2509.21379 2026-06-01 cs.CV cs.AI 版本更新

SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

SAEmnesia：基于监督稀疏自编码器的扩散模型概念擦除

Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto

发表机构 * University of Turin, Italy（意大利都灵大学）； Intesa Sanpaolo AI Research, Italy（意大利Intesa Sanpaolo人工智能研究院）

AI总结提出监督稀疏自编码器框架SAEmnesia，通过强制一对一概念-神经元映射实现特征集中化，从而高效、精准地擦除扩散模型中的概念。

Comments Accepted at ICML 2026

详情

AI中文摘要

扩散模型中的概念遗忘受到特征分裂的阻碍，即概念分布在许多潜在特征上，使得移除它们具有挑战性且计算成本高。我们引入了SAEmnesia，一种监督稀疏自编码器框架，通过强制一对一的概念-神经元映射来克服这一问题。通过在训练过程中系统地标记概念，我们的方法实现了特征集中化，将每个概念绑定到一个可解释的神经元上。这使得概念擦除高度精准且高效。与最先进的基于稀疏自编码器的遗忘方法相比，SAEmnesia将超参数搜索减少了96.67%，并在UnlearnCanvas对象基准上实现了9.22%的提升。我们的方法在顺序遗忘中也表现出卓越的可扩展性，在移除九个对象时准确率提高了28.4%，为精确可控的概念擦除迈出了一步。此外，SAEmnesia在I2P基准上有效抑制了裸体内容，并对对抗攻击保持鲁棒性。源代码可在https://github.com/EIDOSLAB/SAEmnesia获取。

英文摘要

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. Compared to the state-of-the-art sparse autoencoder-based unlearning approach, SAEmnesia reduces hyperparameter search by 96.67% and achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects. Our method also shows superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a step forward for precise and controllable concept erasure. Moreover, SAEmnesia effectively suppresses nudity on the I2P benchmark and remains robust to adversarial attacks. Source code available at https://github.com/EIDOSLAB/SAEmnesia.

URL PDF HTML ☆

赞 0 踩 0

2511.19923 2026-06-01 cs.CV cs.CL 版本更新

非参数概率鲁棒性：未知扰动分布下的保守风险估计

Zheng Wang, Yi Zhang, Siddartha Khastgir, Carsten Maple, Xingyu Zhao

发表机构 * WMG, University of Warwick, Coventry, United Kingdom（沃里克大学商学院，沃里克，英国）； Wuhan University, Wuhan, China（武汉大学，武汉，中国）

AI总结提出非参数概率鲁棒性（NPPR）度量，通过从数据中学习扰动分布，在分布不确定性下实现保守的概率鲁棒性估计，并基于高斯混合模型开发估计器。

详情

AI中文摘要

深度学习模型尽管取得了显著成功，但仍然容易受到微小输入扰动的影响，导致错误输出，这促使最近提出概率鲁棒性（PR）作为对抗鲁棒性（AR）的补充替代方案。然而，现有的PR公式假设扰动分布固定且已知，这在实践中是不现实的期望。为了解决这一限制，我们提出了非参数概率鲁棒性（NPPR），一种更实用的PR度量，不依赖于任何预定义的扰动分布。遵循统计建模中的非参数范式，NPPR直接从数据中学习优化的扰动分布，从而在分布不确定性下实现保守的PR评估。我们进一步开发了基于高斯混合模型（GMM）的NPPR估计器，涵盖了各种输入相关和输入无关的扰动场景。理论分析建立了AR、PR和NPPR之间的关系。在CIFAR-10、CIFAR-100和Tiny ImageNet上使用ResNet18/50、WideResNet50和VGG16的大量实验验证了NPPR作为更实用的鲁棒性度量，与假设最先进技术中使用的常见扰动分布相比，显示出保守（较低）的PR估计。

英文摘要

Deep learning (DL) models, despite their remarkable success, remain vulnerable to small input perturbations that can cause erroneous outputs, motivating the recent proposal of probabilistic robustness (PR) as a complementary alternative to adversarial robustness (AR). However, existing PR formulations assume a fixed and known perturbation distribution, an unrealistic expectation in practice. To address this limitation, we propose non-parametric probabilistic robustness (NPPR), a more practical PR metric that does not rely on any predefined perturbation distribution. Following the non-parametric paradigm in statistical modeling, NPPR learns an optimized perturbation distribution directly from data, enabling conservative PR evaluation under distributional uncertainty. We further develop an NPPR estimator based on a Gaussian Mixture Model (GMM), covering various input-dependent and input-independent perturbation scenarios. Theoretical analyses establish the relationships among AR, PR, and NPPR. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet across ResNet18/50, WideResNet50 and VGG16 validate NPPR as a more practical robustness metric, showing conservative (lower) PR estimates compared to assuming those common perturbation distributions used in state-of-the-arts.

URL PDF HTML ☆

赞 0 踩 0

2511.17185 2026-06-01 cs.CV 版本更新

PostCam: Camera-Controllable Novel-View Video Generation with Query-Shared Cross-Attention

PostCam: 基于查询共享交叉注意力的相机可控新视角视频生成

Yipeng Chen, Zhichao Ye, Zhenzhou Fang, Xinyu Chen, Xiaoyu Zhang, Jialing Liu, Nan Wang, Guofeng Zhang, Haomin Liu

发表机构 * State Key Lab of CAD\&CG, Zhejiang University（浙江大学CAD与CG国家重点实验室）； Shanghai InSpatio Intelligent Technology Co., Ltd.（上海InSpatio智能技术有限公司）

AI总结提出PostCam框架，通过查询共享交叉注意力机制对齐6自由度姿态和渲染特征，实现动态场景中高细节保持和精确相机轨迹编辑的新视角视频生成。

详情

AI中文摘要

我们提出了PostCam，一个用于新视角视频生成的简化框架，在动态场景中实现了优越的细节保留和精确的相机轨迹编辑。当前方法常常在基于姿态的控制（缺乏视觉细节）和基于渲染的引导（对几何精度过于敏感）之间权衡。尽管最近有混合尝试，但由于缺乏有效的跨模态对齐，实现精确的运动和视觉一致性仍然具有挑战性。我们认为，稳健的控制源于多模态信号的深度对齐，而不是增加输入复杂性。我们的核心贡献是查询共享交叉注意力机制，它将6自由度姿态和渲染特征投影到统一的潜在空间中。这使得模型在去噪过程中能够自发地实现运动线索和像素级引导之间的内在一致性。实验表明，PostCam在保持高保真视觉细节的同时，在轨迹精度上比最先进的方法提高了20%，在复杂动态场景中表现出卓越的鲁棒性。我们的项目网页公开在：https://cccqaq.github.io/PostCam.github.io/

英文摘要

We propose PostCam, a streamlined framework for novel-view video generation that achieves superior detail preservation and precise camera trajectory editing in dynamic scenes. Current methods often struggle with a trade-off between pose-based control, which lacks visual detail, and rendering-based guidance, which is overly sensitive to geometric accuracy. Despite recent hybrid attempts, achieving precise motion and visual consistency remains challenging due to the lack of effective cross-modal alignment. We argue that robust control stems from the deep alignment of multimodal signals rather than increased input complexity. Our core contribution is the Query-Shared Cross-Attention mechanism, which projects 6-DoF poses and rendered features into a unified latent space. This allows the model to spontaneously achieve intrinsic consistency between motion cues and pixel-level guidance during denoising. Experiments demonstrate that PostCam maintains high-fidelity visual details while outperforming state-of-the-art methods by 20% in trajectory precision, exhibiting superior robustness in complex dynamic scenes. Our project webpage is publicly available at: https://cccqaq.github.io/PostCam.github.io/

URL PDF HTML ☆

赞 0 踩 0

2511.15692 2026-06-01 cs.CV 版本更新

Hyperspectral Image Classification using Spectral-Spatial Mixer Network

高光谱图像分类的光谱-空间混合器网络

Mohammed Q. Alkhatib

发表机构 * College of Engineering and IT（工程与信息技术学院）

AI总结提出SS-MixNet轻量级深度学习模型，通过3D卷积和并行MLP混合器模块提取局部与长距离光谱-空间特征，在1%标注数据下实现高精度高光谱图像分类。

Comments Accepted and published in IEEE WHISPERS2025

详情

DOI: 10.1109/WHISPERS69515.2025.11501574

AI中文摘要

本文介绍了SS-MixNet，一种用于高光谱图像（HSI）分类的轻量级且有效的深度学习模型。该架构将用于局部光谱-空间特征提取的3D卷积层与两个并行的MLP风格混合器模块相结合，以捕获光谱和空间维度上的长距离依赖关系。采用基于深度可分离卷积的注意力机制，以最小的计算开销增强判别能力。该模型在QUH-Tangdaowan和QUH-Qingyun数据集上进行了评估，仅使用1%的标注数据进行训练和验证。SS-MixNet在比较的方法中取得了最高性能，包括2D-CNN、3D-CNN、IP-SWIN、SimPoolFormer和HybridKAN，在Tangdaowan和Qingyun数据集上分别达到了95.68%和93.86%的总体准确率。由定量指标和分类图支持的结果证实了该模型在有限监督下提供准确且鲁棒预测的有效性。代码将在以下网址公开：https://github.com/mqalkhatib/SS-MixNet

英文摘要

This paper introduces SS-MixNet, a lightweight and effective deep learning model for hyperspectral image (HSI) classification. The architecture integrates 3D convolutional layers for local spectral-spatial feature extraction with two parallel MLP-style mixer blocks that capture long-range dependencies in spectral and spatial dimensions. A depthwise convolution-based attention mechanism is employed to enhance discriminative capability with minimal computational overhead. The model is evaluated on the QUH-Tangdaowan and QUH-Qingyun datasets using only 1% of labeled data for training and validation. SS-MixNet achieves the highest performance among compared methods, including 2D-CNN, 3D-CNN, IP-SWIN, SimPoolFormer, and HybridKAN, reaching 95.68% and 93.86% overall accuracy on the Tangdaowan and Qingyun datasets, respectively. The results, supported by quantitative metrics and classification maps, confirm the model's effectiveness in delivering accurate and robust predictions with limited supervision. The code will be made publicly available at: https://github.com/mqalkhatib/SS-MixNet

URL PDF HTML ☆

赞 0 踩 0

2510.22067 2026-06-01 cs.CV 版本更新

Capturing Gaze Shifts for Guidance: Cross-Modal Fusion Enhancement for VLM Hallucination Mitigation

捕捉注视转移以引导：跨模态融合增强用于VLM幻觉缓解

Zheng Qi, Chao Shang, Evangelia Spiliopoulou, Nikolaos Pappas

发表机构 * AWS AI Labs（AWS人工智能实验室）

AI总结提出GIFT方法，通过预计算视觉显著性图并跟踪注视转移，在解码时增强对显著视觉信息和用户查询的注意力，以缓解视觉语言模型中的幻觉问题。

Comments ICML 2026

详情

AI中文摘要

视觉语言模型（VLM）经常产生幻觉，即无法由文本或视觉输入证实的内容。先前的工作主要将其归因于过度依赖语言先验知识而非视觉输入。一些方法尝试通过按注意力分数比例放大视觉令牌注意力来缓解幻觉。然而，这些方法忽视了视觉注意力沉没问题，即注意力经常被错误分配到与任务无关的视觉区域，并且忽略了跨模态融合平衡，仅增强视觉注意力而不调整对用户查询的注意力。这可能导致放大错误区域，同时无法正确解释用户查询。为解决这些挑战，我们提出了一种简单而有效的方法，称为注视转移引导的跨模态融合增强（GIFT）。GIFT通过在用户查询理解过程中跟踪视觉注意力的正向变化（即“注视转移”），预计算整体视觉显著性图，并利用该图在每个解码步骤放大对显著视觉信息和用户查询的注意力。这减少了视觉注意力沉没的影响，因为无关令牌的转移最小，同时确保平衡的跨模态融合以获得良好整合的表示。大量实验表明，GIFT在生成和分类任务中均有效缓解了VLM的幻觉，与贪婪解码相比实现了高达20.7%的改进，同时以低计算开销保持了通用的视觉语言性能。

英文摘要

Vision language models (VLMs) often generate hallucination, i.e., content that cannot be substantiated by either textual or visual inputs. Prior work primarily attributes this to over-reliance on linguistic prior knowledge rather than visual inputs. Some methods attempt to mitigate hallucination by amplifying visual token attention proportionally to their attention scores. However, these methods overlook the visual attention sink problem, where attention is frequently misallocated to task-irrelevant visual regions, and neglect cross-modal fusion balance by enhancing only visual attention without adjusting attention to the user query. This can result in amplifying incorrect areas while failing to properly interpret the user query. To address these challenges, we propose a simple yet effective method called Gaze Shift-Guided Cross-modal Fusion Enhancement (GIFT). GIFT pre-computes a holistic visual saliency map by tracking positive changes in visual attention, or "gaze shifts", during user query comprehension, and leverages this map to amplify attention to both salient visual information and the user query at each decoding step. This reduces the impact of visual attention sink, as irrelevant tokens exhibit minimal shifts, while ensuring balanced cross-modal fusion for well-integrated representation. Extensive experiments show that GIFT effectively mitigates hallucination in VLMs across both generative and classification tasks, achieving up to 20.7% improvement over greedy decoding, while maintaining general vision-language performance with low computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2511.05875 2026-06-01 cs.HC cs.AI cs.CV 版本更新

Towards a Humanized Social-Media Ecosystem: AI-Augmented HCI Design Patterns for Safety, Agency & Well-Being

迈向人性化的社交媒体生态系统：面向安全、自主与福祉的AI增强人机交互设计模式

Mohd Ruhul Ameen, Akif Islam

发表机构 * College of Engineering（工程学院）； Computer Sciences Marshall University Huntington, WV, USA（计算机科学马歇尔大学亨廷顿州威斯康星州）； Department of Computer Science（计算机科学系）； Engineering University of Rajshahi Rajshahi 6205, Bangladesh（工程 Rajshahi 大学 Rajshahi 6205 巴基斯坦）

AI总结提出Human-Layer AI（HL-AI）框架，通过浏览器端用户拥有的可解释中介，在不依赖平台合作的情况下赋予用户实时控制权，实现内容重写、完整性检测、信息流定制、行为中断和恢复模式等五种设计模式，以提升社交媒体安全性与用户福祉。

Comments 6 pages, 5 tables, 7 figures, and 2 algorithm tables. Accepted at International Conference on Signal Processing, Information, Communication and Systems (SPICSCON 2025)

详情

DOI: 10.1109/SPICSCON69221.2025.11504070
Journal ref: 2025 IEEE International Conference on Signal Processing, Information, Communication and Systems (SPICSCON)

AI中文摘要

社交平台连接了数十亿人，但其以参与度优先的算法往往对用户施加影响而非与用户协作，加剧了压力、虚假信息和失控感。我们提出Human-Layer AI（HL-AI）——用户拥有的、可解释的中介，位于浏览器中平台逻辑与界面之间。HL-AI赋予人们实用的、即时的控制权，无需平台合作。我们贡献了一个可用的Chrome/Edge原型，实现了五种代表性模式框架——上下文感知帖子重写器、帖子完整性检测器、精细信息流策展器、微退出代理和恢复模式——以及一个统一的数学公式，平衡用户效用、自主成本和风险阈值。评估涵盖技术准确性、可用性和行为结果。结果是一套人性化的控制手段，帮助用户在伤害发生前重写内容、通过完整性提示阅读、有意图地调整信息流、暂停强迫性循环以及在骚扰期间寻求庇护，同时通过解释和覆盖选项保留自主权。该原型为改造当今的信息流以融入安全性、自主性和福祉提供了实用路径，并邀请进行严格的跨文化用户评估。

英文摘要

Social platforms connect billions of people, yet their engagement-first algorithms often work on users rather than with them, amplifying stress, misinformation, and a loss of control. We propose Human-Layer AI (HL-AI)--user-owned, explainable intermediaries that sit in the browser between platform logic and the interface. HL-AI gives people practical, moment-to-moment control without requiring platform cooperation. We contribute a working Chrome/Edge prototype implementing five representative pattern frameworks--Context-Aware Post Rewriter, Post Integrity Meter, Granular Feed Curator, Micro-Withdrawal Agent, and Recovery Mode--alongside a unifying mathematical formulation balancing user utility, autonomy costs, and risk thresholds. Evaluation spans technical accuracy, usability, and behavioral outcomes. The result is a suite of humane controls that help users rewrite before harm, read with integrity cues, tune feeds with intention, pause compulsive loops, and seek shelter during harassment, all while preserving agency through explanations and override options. This prototype offers a practical path to retrofit today's feeds with safety, agency, and well-being, inviting rigorous cross-cultural user evaluation.

URL PDF HTML ☆

赞 0 踩 0

2510.15710 2026-06-01 cs.CV 版本更新

UniMedVL: Unifying Medical Multimodal Understanding and Generation through Observation-Knowledge-Analysis

UniMedVL: 通过观察-知识-分析统一医学多模态理解与生成

Junzhi Ning, Wei Li, Cheng Tang, Jiashi Lin, Chenglong Ma, Chaoyang Zhang, Jiyao Liu, Ying Chen, Shujian Gao, Yuandong Pu, Huihui Xu, Chenhui Gou, Ziyan Huang, Yi Xin, Qi Qin, Diping Song, Bin Fu, Guang Yang, Yuanfeng Ji, Tianbin Li, Yanzhou Su, Jin Ye, Shixiang Tang, Zhongying Deng, Lihao Liu, Ming Hu, Junjun He

发表机构 * Shanghai Artificial Intelligence Laboratory ； Shanghai Innovation Institute ； Shanghai Jiao Tong University ； Shanghai Institute of Optics ； Fudan University ； University of Cambridge ； Monash University ； DAMO Academy, Alibaba Group ； Imperial College London ； The University of Hong Kong ； The Hong Kong University of Science ； Hupan Lab ； The Chinese University of Hong Kong

AI总结提出首个统一医学模型UniMedVL，通过渐进式训练流水线融合多模态理解与生成能力，并在8种影像模态的5.6M实例数据集上验证其性能。

Comments This submission has been converted to the ICML template

详情

AI中文摘要

医学工作流程通常结合阅读图像与生成视觉和文本输出，使得图像理解和生成成为医学AI的核心。然而，大多数现有系统在孤立模型中处理这些能力，失去了统一架构可以利用的共享知识。为弥合这一差距，我们提出了UniMedVL，这是第一个在单个模型中无缝集成多模态理解和生成能力而无需切换权重的统一医学模型。我们通过定制的渐进式训练流水线实现这一点，其中理解和生成相互增强。为有效训练UniMedVL，我们整理了UniMedVL-5M，这是第一个大规模医学数据集，包含跨越8种医学影像模态的超过560万个实例，专为统一医学理解和生成中的多模态输入输出任务设计。实验结果表明，UniMedVL在五个医学理解基准上取得了有竞争力的性能。关键的是，UniMedVL原生支持多种交错生成任务，例如虚拟染色、超分辨率、跨模态合成，这些对于复杂的医学工作流程至关重要。我们的代码和数据集已公开。

英文摘要

Medical workflows routinely combine reading images with producing visual and textual outputs, making both image understanding and generation central to medical AI. Most existing systems, however, address these abilities in isolated models, losing the shared knowledge that a unified architecture could exploit. To bridge this gap, we present UniMedVL, the first unified medical model that seamlessly integrates multimodal understanding and generation capabilities within a single model without switching weights. We achieve this via a tailored progressive training pipeline where understanding and generation mutually reinforce each other. To effectively train UniMedVL, we curate UniMedVL-5M, the first large-scale medical dataset comprising over 5.6M instances across 8 medical imaging modalities, tailored for multimodal input-output tasks in unified medical understanding and generation. Experimental results demonstrate that UniMedVL achieves competitive performance on five medical understanding benchmarks. Crucially, UniMedVL natively supports diverse interleaved generation tasks, e.g., virtual staining, super-resolution, cross-modal synthesis, essential for complex medical workflows. Our code and dataset are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2506.22304 2026-06-01 cs.LG cs.CV 版本更新

Unfolding Generative Flows with Koopman Operators: Trajectory-Preserving Linearization

利用Koopman算子展开生成流：轨迹保持的线性化

Erkan Turan, Aristotelis Siozopoulos, Louis Martinez, Julien Gaubil, Emery Pierson, Maks Ovsjanikov

发表机构 * University of Athens, Greece（雅典大学）

AI总结提出基于Koopman理论的全局线性化方法，将预训练的条件流匹配模型提升到高维Koopman空间，实现轨迹保持的线性化，从而支持一步并行采样和生成轨迹的谱分析。

详情

AI中文摘要

连续归一化流（CNFs）实现了优雅的生成建模，但受限于其迭代性质，需要昂贵的采样且缺乏中间状态的可解释性。最近的方法通过拉直轨迹或蒸馏端点来加速采样，但将原始生成过程视为黑箱，丢弃了教师模型的中间动态。我们提出了一种根本不同的视角：通过Koopman理论全局线性化流动态，以实现轨迹保持的线性化。通过将预训练的条件流匹配（CFM）模型提升到高维Koopman空间，我们用单个线性算子表示其演化。关键的是，与仅边界蒸馏不同，我们的方法沿整个生成路径强制与教师向量场保持无穷小一致性。我们推导了一个实用的、无模拟的训练目标，确保这种全局对齐，并带来两个关键优势。首先，采样变为一步且可并行化。其次，由于线性化忠实于动态，Koopman算子提供了对生成的独特见解。我们证明，这种结构能够实现先前方法无法实现的新应用，包括发现语义一致的编辑方向、使用与教师对齐的线性算子进行反演以及类条件谱特征。实验上，我们的方法在实现竞争性样本质量的同时，能够对生成流的整个轨迹进行谱分析和控制。

英文摘要

Continuous Normalizing Flows (CNFs) enable elegant generative modeling but remain bottlenecked by their iterative nature requiring costly sampling and lacking interpretability of the intermediate states. Recent approaches accelerate sampling by straightening trajectories or distilling endpoints, yet they treat the original generative process as a black box, discarding the teacher's intermediate dynamics. We propose a fundamentally different perspective: globally linearizing flow dynamics via Koopman theory to achieve trajectory-preserving linearization. By lifting a pre-trained Conditional Flow Matching (CFM) model into a higher-dimensional Koopman space, we represent its evolution with a single linear operator. Crucially, unlike boundary-only distillation, our method enforces infinitesimal consistency with the teacher's vector field along the full generative path. We derive a practical, simulation-free training objective that ensures this global alignment and yields two key benefits. First, sampling becomes one-step and parallelizable. Second, because the linearization is faithful to the dynamics, the Koopman operator provides unique insights on the generation. We demonstrate that this structure enables novel applications unavailable in prior approaches, including discovery of semantically coherent editing directions, inversion with a teacher-aligned linear operator and class-conditional spectral signatures. Empirically, our approach achieves competitive sample quality, while enabling spectral analysis and control of the entire trajectories of generative flows.

URL PDF HTML ☆

赞 0 踩 0

2510.17700 2026-06-01 cs.CV 版本更新

Elastic ViTs from Pretrained Models without Retraining

无需重新训练的预训练模型弹性ViTs

Walter Simoncini, Michael Dorkenwald, Tijmen Blankevoort, Cees G. M. Snoek, Yuki M. Asano

发表机构 * University of Technology Nuremberg（图恩大学）； University of Amsterdam（阿姆斯特丹大学）； NVIDIA（英伟达）

AI总结提出SnapViT方法，通过结合梯度信息与进化算法近似跨网络结构相关性，实现无需重训练的结构化剪枝，支持连续计算预算下的弹性推理。

Comments Accepted at NeurIPS 2025

详情

AI中文摘要

视觉基础模型取得了显著性能，但仅以有限的预定尺寸可用，导致在现实约束下部署选择次优。我们引入SnapViT：用于剪枝视觉Transformer的单次网络近似，一种新的后预训练结构化剪枝方法，可在连续计算预算范围内实现弹性推理。我们的方法高效地将梯度信息与跨网络结构相关性相结合，通过进化算法近似，无需标注数据，可推广到无分类头的模型，且无需重训练。在DINO、SigLIPv2、DeIT和AugReg模型上的实验表明，在各种稀疏度下，该方法优于最先进方法，在单个A100 GPU上不到五分钟即可生成可调整到任何计算预算的弹性模型。我们的主要贡献包括：一种针对预训练ViT的高效剪枝策略，一种新颖的Hessian非对角结构的进化近似，以及一种无需重训练或标签即可保持强大性能的自监督重要性评分机制。代码和剪枝模型可在https://elastic.ashita.nl/获取。

英文摘要

Vision foundation models achieve remarkable performance but are only available in a limited set of pre-determined sizes, forcing sub-optimal deployment choices under real-world constraints. We introduce SnapViT: Single-shot network approximation for pruned Vision Transformers, a new post-pretraining structured pruning method that enables elastic inference across a continuum of compute budgets. Our approach efficiently combines gradient information with cross-network structure correlations, approximated via an evolutionary algorithm, does not require labeled data, generalizes to models without a classification head, and is retraining-free. Experiments on DINO, SigLIPv2, DeIT, and AugReg models demonstrate superior performance over state-of-the-art methods across various sparsities, requiring less than five minutes on a single A100 GPU to generate elastic models that can be adjusted to any computational budget. Our key contributions include an efficient pruning strategy for pretrained Vision Transformers, a novel evolutionary approximation of Hessian off-diagonal structures, and a self-supervised importance scoring mechanism that maintains strong performance without requiring retraining or labels. Code and pruned models are available at: https://elastic.ashita.nl/

URL PDF HTML ☆

赞 0 踩 0

2510.09364 2026-06-01 cs.CV 版本更新

VAD-GS: Visibility-Aware Densification for 3D Gaussian Splatting in Dynamic Urban Scenes

VAD-GS：动态城市场景中3D高斯泼溅的可见性感知致密化

Yikang Zhang, Rui Fan

发表机构 * Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University（同济大学智能自主系统研究所）； College of Electronic and Information Engineering, Tongji University（同济大学电子与信息工程学院）； National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University（西安交通大学人机混合增强智能国家实验室）

AI总结提出VAD-GS框架，通过体素可见性推理、多样性感知视图选择和多视图立体重建，在动态城市场景中恢复缺失几何结构，提升3D高斯泼溅的重建质量。

详情

AI中文摘要

3D高斯泼溅（3DGS）在合成高保真新视角方面表现出色。然而，其有效性关键取决于初始化点云的质量。具体而言，要实现对底层场景结构的均匀且完整的点覆盖，需要重叠的观察视锥，这一假设在无边界、动态的城市环境中经常被违反。使用部分初始化的点云训练高斯模型通常会导致失真和伪影，因为相机射线可能无法与有效表面相交，导致梯度错误传播到与遮挡或不可见几何体关联的高斯基元。此外，现有的致密化策略只是从现有基元中克隆和分割高斯基元，无法从缺失结构中重建几何体。为解决这些限制，我们提出了VAD-GS，一个专为具有挑战性的城市场景中几何恢复设计的3DGS框架。我们的方法通过基于体素的可见性推理识别不可靠的几何结构，通过多样性感知视图选择选择信息丰富的支持视图，并通过多视图立体重建恢复缺失结构。这种设计使得即使在缺乏初始点的区域，也能在可靠几何先验的指导下生成新的高斯基元。在Waymo和nuScenes数据集上的大量实验表明，VAD-GS优于最先进的3DGS方法，并显著提高了静态和动态物体的重建几何质量。我们的项目网页位于mias.group/VAD-GS。

英文摘要

3D Gaussian splatting (3DGS) has demonstrated impressive performance in synthesizing high-fidelity novel views. Nonetheless, its effectiveness critically depends on the quality of the initialized point cloud. Specifically, achieving uniform and complete point coverage over the underlying scene structure requires overlapping observation frustums, an assumption that is often violated in unbounded, dynamic urban environments. Training Gaussian models with partially initialized point clouds often leads to distortions and artifacts, as camera rays may fail to intersect valid surfaces, resulting in incorrect gradient propagation to Gaussian primitives associated with occluded or invisible geometry. Additionally, existing densification strategies simply clone and split Gaussian primitives from existing ones, incapable of reconstructing geometry from missing structures. To address these limitations, we propose VAD-GS, a 3DGS framework tailored for geometry recovery in challenging urban scenes. Our method identifies unreliable geometry structures via voxel-based visibility reasoning, selects informative supporting views through diversity-aware view selection, and recovers missing structures via multi-view stereo reconstruction. This design enables the generation of new Gaussian primitives guided by reliable geometric priors, even in regions lacking initial points. Extensive experiments on the Waymo and nuScenes datasets demonstrate that VAD-GS outperforms state-of-the-art 3DGS approaches and significantly improves the quality of reconstructed geometry for both static and dynamic objects. Our project webpage is at mias.group/VAD-GS.

URL PDF HTML ☆

赞 0 踩 0

2510.07135 2026-06-01 cs.CV 版本更新

Few-Shot Adaptation Benchmark for Remote Sensing Vision-Language Models

遥感视觉语言模型的少样本适应基准

Karim El Khoury, Maxime Zanella, Christophe De Vleeschouwer, Benoit Macq

发表机构 * UCLouvain（乌尔特-洛文大学）； UMons（蒙斯大学）； Fonds de la Recherche Scientifique（科学基金组织）

AI总结提出首个遥感视觉语言模型少样本适应基准，通过十个数据集和五种策略评估三个模型，发现零样本性能相似的模型在少样本适应下表现差异显著，需开发更鲁棒的方法。

详情

AI中文摘要

遥感视觉语言模型（RSVLMs）得益于大规模预训练，在各种任务上展现出强大的零样本性能。然而，它们在低数据场景（如少样本学习）中的泛化能力尚未得到充分探索。在这项工作中，我们提出了第一个用于评估RSVLMs少样本适应方法的结构化基准。我们在十个遥感场景分类数据集上进行了全面实验，将五种广泛使用的少样本适应策略应用于三个具有不同骨干网络的最先进RSVLMs。我们的发现表明，零样本性能相似的模型在少样本适应下可能表现出显著不同的行为，一些RSVLMs天生比其他模型更适合这种适应。性能的变异性以及现有方法中缺乏明确的优胜者，凸显了为遥感定制更鲁棒的少样本适应方法的必要性。为了促进未来研究，我们提供了一个可复现的基准框架和开源代码，以系统评估RSVLMs在少样本条件下的表现。源代码已在Github上公开：https://github.com/elkhouryk/fewshot_RSVLMs

英文摘要

Remote Sensing Vision-Language Models (RSVLMs) have shown remarkable potential thanks to large-scale pretraining, achieving strong zero-shot performance on various tasks. However, their ability to generalize in low-data regimes, such as few-shot learning, remains insufficiently explored. In this work, we present the first structured benchmark for evaluating few-shot adaptation methods on RSVLMs. We conduct comprehensive experiments across ten remote sensing scene classification datasets, applying five widely used few-shot adaptation strategies to three state-of-the-art RSVLMs with varying backbones. Our findings reveal that models with similar zero-shot performance can exhibit markedly different behavior under few-shot adaptation, with some RSVLMs being inherently more amenable to such adaptation than others. The variability of performance and the absence of a clear winner among existing methods highlight the need for the development of more robust methods for few-shot adaptation tailored to RS. To facilitate future research, we provide a reproducible benchmarking framework and open-source code to systematically evaluate RSVLMs under few-shot conditions. The source code is publicly available on Github: https://github.com/elkhouryk/fewshot_RSVLMs

URL PDF HTML ☆

赞 0 踩 0

2510.03876 2026-06-01 cs.CV 版本更新

Skin Lesion Classification Based on ResNet-50 Enhanced With Adaptive Spatial Feature Fusion

基于自适应空间特征融合增强的ResNet-50皮肤病变分类

Runhao Liu, Fengyi Zha, Fei Ding, Guangzhen Yao, Peng Zhang

发表机构 * Polytechnic Institute, Zhejiang University, Hangzhou, China（浙江大学杭州Polytechnic学院）； Chu Kochen Honors College, Zhejiang University, Hangzhou, China（浙江大学杭州Chu Kochen荣誉学院）； Alibaba Group, Chaoyang District, Beijing, China（北京朝阳区阿里巴巴集团）； School of Information Science and Technology, Northeast Normal University, Changchun, China（吉林师范大学信息科学与技术学院）； School of Mathematical Sciences, Zhejiang University, Hangzhou, China（浙江大学数学科学学院）

AI总结提出一种结合自适应空间特征融合（ASFF）的改进ResNet-50模型，通过双分支结构融合多尺度语义和细节特征，在ISIC 2020子集上达到93.182%准确率，并有效泛化至ISIC 2019外部验证集。

详情

AI中文摘要

皮肤癌分类因皮肤镜图像中类间相似度高、类内变异大以及伪影的存在而具有挑战性。为解决这些问题，我们提出了一种改进的ResNet-50模型，结合自适应空间特征融合（ASFF），该机制自适应地整合多尺度语义和表面特征以细化表示并减少过拟合。ResNet-50模型通过自适应特征融合机制增强，以实现更有效的多尺度特征提取并提升整体性能。具体而言，双分支设计融合了高层语义特征和中间层细节特征，利用全局平均池化和全连接层生成空间权重，并强调病变相关区域。在ISIC 2020平衡子集（从原始数据集中随机选取的3,297张图像）上评估，基于ASFF的ResNet-50优于多个CNN基线，达到93.182%的准确率，并具有优越的精确率、召回率、特异性和F1分数。其AUC（P-R）达到0.9670，AUC（ROC）达到0.9717。Grad-CAM可视化显示对病变区域的聚焦更加准确。所提模型在ISIC 2019外部验证集上也表现出良好的泛化能力，优于ResNet-50基线。这些发现表明，所提方法为计算机辅助皮肤癌诊断提供了更有效且高效的解决方案。生成代码、权重和混淆矩阵已在https://github.com/Grapesea/ASFF-ResNet50-enhanced开源。

英文摘要

Skin cancer classification is challenging due to high inter-class similarity, intra-class variability, and artifacts in dermoscopic images. To address these issues, we propose an improved ResNet-50 with Adaptive Spatial Feature Fusion (ASFF), which adaptively integrates multi-scale semantic and surface features to refine representations and reduce overfitting. The ResNet-50 model is enhanced with an adaptive feature fusion mechanism to achieve more effective multi-scale feature extraction and improve overall performance. Specifically, a dual-branch design fuses high-level semantic and mid-level detail features which use global average pooling and fully connected layers to produce spatial weights, and emphasizes lesion-relevant regions. Evaluated on a balanced subset of ISIC 2020 (3,297 images, randomly selected from the original dataset), the ASFF-based ResNet-50 outperforms multiple CNN baselines, achieving 93.182% accuracy with superior precision, recall, specificity, and F1. It also reaches 0.9670 AUC (P-R) and 0.9717 AUC (ROC). Grad-CAM visualizations show more accurate focus on lesion areas.The proposed model also generalizes well to ISIC 2019 external validation, outperforming the ResNet-50 baseline. These findings demonstrate that the proposed approach provides a more effective and efficient solution for computer-aided skin cancer diagnosis. The generation codes, weights and confusion matrices are open sourced in https://github.com/Grapesea/ASFF-ResNet50-enhanced.

URL PDF HTML ☆

赞 0 踩 0

2509.19452 2026-06-01 cs.RO cs.CV cs.LG 版本更新

HUNT: High-Speed UAV Navigation and Tracking in Unstructured Environments via Instantaneous Relative Frames

HUNT：通过瞬时相对帧在非结构化环境中进行高速无人机导航与跟踪

Alessandro Saviolo, Jeffrey Mao, Giuseppe Loianno

发表机构 * New York University（纽约大学）； University of California Berkeley（加州大学伯克利分校）

AI总结提出HUNT框架，利用瞬时相对帧统一搜索与跟踪，实现高速飞行和鲁棒自主性。

详情

AI中文摘要

搜索与救援任务要求无人机既能高速穿越未知的非结构化环境，又能在检测到目标后跟踪目标。在感知退化且无全局定位的情况下实现这两种能力仍是一个开放挑战。最近的相对导航工作通过将规划和控制锚定到可见的检测目标上展示了鲁棒跟踪，但在视野中没有目标时无法进行导航。我们提出了HUNT（高速无人机导航与跟踪），一个实时框架，在单一相对公式中统一了穿越、获取和跟踪。HUNT直接从机载瞬时观测量（如姿态、高度和速度）定义导航目标，从而在搜索过程中实现反应式高速飞行。一旦检测到目标，相同的感知-控制管道无缝过渡到跟踪。在茂密森林、集装箱场地以及使用车辆和人体模型的搜索与救援任务中的户外实验表明，在全局方法失败的情况下，该框架实现了鲁棒自主性。

英文摘要

Search and rescue operations require unmanned aerial vehicles to both traverse unknown unstructured environments at high speed and track targets once detected. Achieving both capabilities under degraded sensing and without global localization remains an open challenge. Recent works on relative navigation have shown robust tracking by anchoring planning and control to a visible detected object, but cannot address navigation when no target is in the field of view. We present HUNT (High-speed UAV Navigation and Tracking), a real-time framework that unifies traversal, acquisition, and tracking within a single relative formulation. HUNT defines navigation objectives directly from onboard instantaneous observables such as attitude, altitude, and velocity, enabling reactive high-speed flight during search. Once a target is detected, the same perception-control pipeline transitions seamlessly to tracking. Outdoor experiments in dense forests, container compounds, and search-and-rescue operations with vehicles and mannequins demonstrate robust autonomy where global methods fail.

URL PDF HTML ☆

赞 0 踩 0

2509.21561 2026-06-01 cs.CV 版本更新

Unsupervised Defect Detection for Surgical Instruments

手术器械的无监督缺陷检测

Joseph Huang, Yichi Zhang, Jingxi Yu, Wei Chen, Seunghyun Hwang, Qiang Qiu, Amy R. Reibman, Edward J. Delp, Fengqing Zhu

发表机构 * Purdue University School of Electrical

AI总结针对手术器械缺陷检测中纹理背景导致误检、小缺陷灵敏度低及领域迁移问题，提出结合背景掩蔽、补丁分析和高效域适应的无监督方法。

2509.20941 2026-06-01 cs.CV 版本更新

Decoding the Surgical Scene: A Scoping Review of Scene Graphs in Surgery

解码手术场景：手术中场景图的范围综述

Angelo Henriques, Korab Hoxha, Daniel Zapp, Peter C. Issa, Nassir Navab, M. Ali Nasseri

发表机构 * School of Computation, Information and Technology, Technical University of Munich（计算信息学院，慕尼黑技术大学）； Klinik und Poliklinik für Augenheilkunde, TUM University Hospital（眼科诊所，TUM大学医院）； Computer Aided Medical Procedures, Technical University of Munich（医学辅助程序，慕尼黑技术大学）； Department of Biomedical Engineering, University of Alberta（生物医学工程系，阿尔伯塔大学）

AI总结本文通过PRISMA-ScR指导的范围综述，系统梳理了手术中场景图（SG）的研究现状，分析了52项研究，揭示了从图神经网络向基础模型和生成式AI的方法论转变，并提出了“验证三位一体”评估框架以弥合临床转化差距。

Comments Submitted and accepted to Medical Image Analysis (DOI: 10.1016/j.media.2026.104083). An interactive version of the summary tables is available at: osf.io/fruq8

详情

DOI: 10.1016/j.media.2026.104083
Journal ref: Medical Image Analysis (2026)

AI中文摘要

随着手术人工智能从像素级检测向复杂推理过渡，场景图（SG）提供了解码动态手术环境所需的结构化关系表示。本项遵循PRISMA-ScR指南的范围综述系统性地绘制了手术中SG研究的发展格局，分析了52项主要研究，以描绘应用和方法论转变。我们的分析揭示了快速增长，但也发现了一个关键的“数据鸿沟”：内部视角研究（例如，从内窥镜视频中识别三元组）占研究的81%，且几乎完全使用真实世界的2D视频，而外部视角的手术室建模则严重依赖模拟数据。在方法论上，我们识别出从基础图神经网络向专门基础模型和生成式AI的决定性转变，这些模型在2025年合计约占研究的50%。至关重要的是，我们的综合表明，场景图正从简单的描述符演变为必要的“神经符号护栏”，提供结构化、可验证的中间表示，以防止日益自主的手术基础模型产生幻觉。尽管前景广阔，但仍存在一个主要的转化差距：所审查的研究均未进入前瞻性临床验证。我们得出结论，弥合这一差距需要超越标准的计算机视觉指标；因此，我们提出“验证三位一体”——优先考虑语义查询成功率、延迟感知准确率和安全关键召回率——作为将基于图的手术人工智能引入临床实践的必要评估框架。

英文摘要

As surgical AI transitions from pixel-level detection to complex reasoning, Scene Graphs (SGs) offer the structured, relational representations necessary to decode dynamic surgical environments. This PRISMA-ScR-guided scoping review systematically maps the evolving landscape of SG research in surgery, analyzing 52 primary studies to chart applications and methodological shifts. Our analysis reveals rapid growth, yet uncovers a critical 'data divide': internal-view research (e.g., triplet recognition from endoscopic video) accounts for 81% of studies and almost exclusively uses real-world 2D video, while external-view operating room modeling relies heavily on simulated data. Methodologically, we identify a decisive shift from foundational graph neural networks to specialized foundation models and generative AI, which together now account for approximately 50% of research in 2025. Crucially, our synthesis suggests that Scene Graphs are evolving from simple descriptors into essential 'neuro-symbolic guardrails', providing the structured, verifiable intermediate representation needed to prevent hallucinations in increasingly autonomous Surgical Foundation Models. Despite this promise, a major translational gap remains: none of the reviewed studies have proceeded to prospective clinical validation. We conclude that bridging this gap requires moving beyond standard computer vision metrics; we therefore propose the 'Validation Trinity' -- prioritizing Semantic Query Success, Latency-Aware Accuracy, and Safety-Critical Recall -- as the necessary evaluation framework to bring graph-based surgical AI into clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2509.18898 2026-06-01 cs.CV 版本更新

DeblurSplat: SfM-free 3D Gaussian Splatting with Event Camera for Robust Deblurring

DeblurSplat：基于事件相机的无SfM三维高斯泼溅鲁棒去模糊方法

Pengteng Li, Yunfan Lu, Pinhao Song, Weiyu Guo, Huizai Yao, F. Richard Yu, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； KU Leuven（卢森堡大学）； Carleton University（卡尔顿大学）

AI总结提出首个无需运动恢复结构的去模糊三维高斯泼溅方法，利用密集立体模块和事件流实现高质量新视图合成与高效渲染。

Comments Accepted by TMM 2026

详情

AI中文摘要

本文提出首个无需运动恢复结构（SfM）的基于事件相机的去模糊三维高斯泼溅方法，称为DeblurSplat。我们从两个方面解决运动去模糊问题。首先，利用密集立体模块（DUSt3R）的预训练能力，直接从模糊图像中获取准确的初始点云。无需计算相机位姿作为中间结果，避免了不准确相机位姿到初始点云位置的累积误差传递。其次，将事件流引入去模糊流水线，利用其对动态变化的高敏感性。通过从事件流和模糊图像中解码潜在清晰图像，我们可以为场景重建优化提供细粒度监督信号。在多种场景上的大量实验表明，与去模糊3D-GS的最新方法相比，DeblurSplat不仅在新视图生成中表现出高保真度，而且实现了显著的渲染效率。

英文摘要

In this paper, we propose the first Structure-from-Motion (SfM)-free deblurring 3D Gaussian Splatting method via event camera, dubbed DeblurSplat. We address the motion-deblurring problem in two ways. First, we leverage the pretrained capability of the dense stereo module (DUSt3R) to directly obtain accurate initial point clouds from blurred images. Without calculating camera poses as an intermediate result, we avoid the cumulative errors transfer from inaccurate camera poses to the initial point clouds' positions. Second, we introduce the event stream into the deblur pipeline for its high sensitivity to dynamic change. By decoding the latent sharp images from the event stream and blurred images, we can provide a fine-grained supervision signal for scene reconstruction optimization. Extensive experiments across a range of scenes demonstrate that DeblurSplat not only excels in generating high-fidelity novel views but also achieves significant rendering efficiency compared to the SOTAs in deblur 3D-GS.

URL PDF HTML ☆

赞 0 踩 0

2506.11653 2026-06-01 cs.CV cs.AI cs.LG 版本更新

DISCO: Mitigating Bias in Deep Learning with Conditional Distance Correlation

DISCO: 使用条件距离相关性减轻深度学习中的偏差

Emre Kavak, Tom Nuno Wolf, Christian Wachinger

发表机构 * Technical University of Munich, Germany（慕尼黑技术大学）； Konrad Zuse School of Excellence in Reliable AI, Germany（Konrad Zuse可靠性人工智能卓越学院）； Munich Center for Machine Learning (MCML), Germany（慕尼黑机器学习中心（MCML））

AI总结提出基于反因果模型的条件独立性准则，并设计条件距离相关性的高效估计器DISCO$_m$和sDISCO，通过正则化实现梯度模型中的偏差缓解，在多个数据集上优于或媲美现有方法。

Comments Accepted to ICML 2026 (oral)

2509.10114 2026-06-01 cs.CV 版本更新

A Lightweight Ensemble-Based Face Image Quality Assessment Method with Correlation-Aware Loss

一种基于集成学习的轻量级人脸图像质量评估方法及关联感知损失

MohammadAli Hamidi, Hadi Amirpour, Luigi Atzori, Christian Timmerer

发表机构 * DIEE, University of Cagliari, CNIT, University of Cagliari（卡利亚里大学DIEE部门，CNIT，卡利亚里大学）； Department of Information Technology (ITEC), Alpen-Adria Universität Klagenfurt（克雷格弗尔德大学信息科技学院（ITEC））

AI总结提出一种轻量级集成方法，结合MobileNetV3-Small和ShuffleNetV2，使用MSECorrLoss损失函数，在VQualA基准上达到高精度与低计算成本。

Comments This paper has been published in the Proceedings of ICCV 2025. The final published version is available via IEEE Xplore

详情

DOI: 10.1109/ICCVW69036.2025.00362

AI中文摘要

人脸图像质量评估（FIQA）在人脸识别和验证系统中起着关键作用，尤其是在非受控的真实世界环境中。尽管已有多种方法被提出，但通用的无参考图像质量评估技术往往无法捕捉人脸特定的退化。同时，最先进的FIQA模型通常计算量大，限制了其实际应用。我们提出了一种轻量级且高效的FIQA方法，专为野外人脸图像的感知评估而设计。我们的方法集成了两个紧凑的卷积神经网络MobileNetV3-Small和ShuffleNetV2，并通过简单平均进行预测级融合。为了增强与人类感知判断的一致性，我们采用了一种关联感知损失（MSECorrLoss），将均方误差（MSE）与皮尔逊相关正则化器相结合。我们的方法在准确性和计算成本之间取得了良好的平衡，使其适用于实际部署。在VQualA FIQA基准上的实验表明，我们的模型达到了0.9829的斯皮尔曼秩相关系数（SRCC）和0.9894的皮尔逊线性相关系数（PLCC），同时保持在竞赛效率约束内。

英文摘要

Face image quality assessment (FIQA) plays a critical role in face recognition and verification systems, especially in uncontrolled, real-world environments. Although several methods have been proposed, general-purpose no-reference image quality assessment techniques often fail to capture face-specific degradations. Meanwhile, state-of-the-art FIQA models tend to be computationally intensive, limiting their practical applicability. We propose a lightweight and efficient method for FIQA, designed for the perceptual evaluation of face images in the wild. Our approach integrates an ensemble of two compact convolutional neural networks, MobileNetV3-Small and ShuffleNetV2, with prediction-level fusion via simple averaging. To enhance alignment with human perceptual judgments, we employ a correlation-aware loss (MSECorrLoss), combining mean squared error (MSE) with a Pearson correlation regularizer. Our method achieves a strong balance between accuracy and computational cost, making it suitable for real-world deployment. Experiments on the VQualA FIQA benchmark demonstrate that our model achieves a Spearman rank correlation coefficient (SRCC) of 0.9829 and a Pearson linear correlation coefficient (PLCC) of 0.9894, remaining within competition efficiency constraints.

URL PDF HTML ☆

赞 0 踩 0

2508.20478 2026-06-01 cs.CV 版本更新

Video-MTR: Reinforced Multi-Turn Reasoning for Long Video Understanding

Video-MTR: 用于长视频理解的多轮强化推理

Yuan Xie, Tianshui Chen, Zheng Ge, Lionel Ni

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））； Guangdong University of Technology（广东工业大学）； X-Era AI Lab（X-Era人工智能实验室）

AI总结提出Video-MTR框架，通过强化多轮推理迭代选择关键视频片段并理解问题，结合门控双层奖励系统实现端到端训练，在长视频理解基准上提升准确率和效率。

Comments Accepted by ICML 2026. Camera-ready version

详情

AI中文摘要

长视频理解因其长期时间依赖性和多事件特性仍然是一个挑战。现有方法通常依赖静态推理或外部视觉语言模型（VLM），但存在复杂性和缺乏端到端训练导致的次优性能等问题。本文提出Video-MTR，一个强化多轮推理框架，旨在实现迭代的关键视频片段选择和问题理解。与传统的单轮预测视频推理流程不同，Video-MTR进行多轮推理，基于对先前处理片段和当前问题的逐步理解，逐步选择视频片段。这种迭代过程允许对视频进行更精细和上下文感知的分析。为确保中间推理过程，我们引入了一种新颖的门控双层奖励系统，结合基于答案正确性的轨迹级奖励和强调帧-查询相关性的轮次级奖励。该系统优化了视频片段选择和问题理解，无需外部VLM，并允许端到端训练。在VideoMME、MLVU和EgoSchema等基准上的大量实验表明，Video-MTR在准确性和效率上均优于现有方法，推动了长视频理解的最新进展。

英文摘要

Long-form video understanding, characterized by long-range temporal dependencies and multiple events, remains a challenge. Existing methods often rely on static reasoning or external visual-language models (VLMs), which face issues like complexity and sub-optimal performance due to the lack of end-to-end training. In this paper, we propose Video-MTR, a reinforced multi-turn reasoning framework designed to enable iterative key video segment selection and question comprehension. Unlike traditional video reasoning pipeline, which generate predictions in a single turn, Video-MTR performs reasoning in multiple turns, selecting video segments progressively based on the evolving understanding of previously processed segments and the current question. This iterative process allows for a more refined and contextually aware analysis of the video. To ensure intermediate reasoning process, we introduce a novel gated bi-level reward system, combining trajectory-level rewards based on answer correctness and turn-level rewards emphasizing frame-query relevance. This system optimizes both video segment selection and question comprehension, eliminating the need for external VLMs and allowing end-to-end training. Extensive experiments on benchmarks like VideoMME, MLVU, and EgoSchema demonstrate that Video-MTR outperforms existing methods in both accuracy and efficiency, advancing the state-of-the-art in long video understanding.

URL PDF HTML ☆

赞 0 踩 0

2508.19830 2026-06-01 cs.CV cs.AI 版本更新

Target-Agnostic Calibration under Distribution Shift with Frequency-Aware Gradient Rectification

分布偏移下基于频率感知梯度修正的目标无关校准

Yilin Zhang, Cai Xu, You Wu, Ziyu Guan, Wei Zhao

发表机构 * School of Computer Science and Technology, Xidian University, Xi'an, China（西安电子科技大学计算机科学与技术学院）

AI总结提出频率感知梯度修正（FGR）框架，通过对训练图像进行低通滤波减少虚假高频线索并学习域不变特征，同时利用几何投影确保分布内校准不退化，从而在无需目标域信息的情况下提升模型在分布偏移下的校准性能。

Comments 25 pages, Accepted at ICML 2026

详情

AI中文摘要

现实世界中的模型部署不可避免地会遇到分布偏移，使得深度神经网络的置信度估计高度不可靠，在安全关键应用中带来严重风险。现有方法通过训练时正则化或事后调整来改善校准，但通常依赖于对目标域的访问（或模拟），限制了实用性。我们提出频率感知梯度修正（FGR），一种用于鲁棒校准的目标无关训练框架。从频率角度出发，FGR 对部分训练图像应用低通滤波，以减少虚假的高频线索并鼓励学习域不变特征。然而，相关的信息损失可能会降低分布内（ID）校准。为了解决这一权衡，FGR 将 ID 校准视为硬约束，并通过几何投影修正冲突的参数更新。这确保了 ID 校准目标的一阶非增，而无需引入额外的损失平衡系数。在合成、真实世界和语义偏移数据集上的大量实验表明，FGR 在保持 ID 性能的同时显著改善了各种偏移下的校准，并且与事后校准方法兼容。我们的代码可在 https://github.com/YilinZhang107/FGR-Calib 获取。

英文摘要

Real-world model deployments inevitably encounter distribution shifts, rendering the confidence estimates of deep neural networks highly unreliable, posing severe risks in safety-critical applications. Existing methods improve calibration via training-time regularization or post-hoc adjustment, but often rely on access to (or simulation of) target domains, limiting practicality. We propose Frequency-aware Gradient Rectification (FGR), a target-agnostic training framework for robust calibration. From a frequency perspective, FGR applies low-pass filtering to a subset of training images to diminish spurious high-frequency cues and encourage the learning of domain-invariant features. However, the associated information loss can degrade In-Distribution (ID) calibration. To resolve this trade-off, FGR treats ID calibration as a hard constraint and rectifies conflicting parameter updates via geometric projection. This ensures a first-order non-increase in the ID calibration objective without introducing an additional loss-balancing coefficient. Extensive experiments on synthetic, real-world, and semantic shift datasets demonstrate that FGR significantly improves calibration under diverse shifts while preserving ID performance, and it remains compatible with post-hoc calibration methods. Our code is available at https://github.com/YilinZhang107/FGR-Calib.

URL PDF HTML ☆

赞 0 踩 0

2507.16362 2026-06-01 cs.CV 版本更新

LPTR-AFLNet: Lightweight Integrated Chinese License Plate Rectification and Recognition Network

LPTR-AFLNet：轻量级集成式中国车牌校正与识别网络

Guangzhu Xu, Pengcheng Zuo, Zhi Ke, Bangjun Lei

发表机构 * Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University（湖北水利水电工程智能视觉监测重点实验室，中国三峡大学）； College of Computer and Information Technology, China Three Gorges University（计算机与信息学院，中国三峡大学）； Hubei Key Laboratory of Digital Finance Innovation（湖北省数字金融创新重点实验室）； School of Information Engineering, Hubei University of Economics（信息工程学院，湖北经济学院）

AI总结提出一种轻量级统一网络LPTR-AFLNet，结合透视变换校正模块和优化后的AFLNet识别网络，利用识别输出作为弱监督信号引导校正，并改进注意力模块和采用Focal Loss，实现高效准确的车牌校正与识别。

Comments 28 pages, 33 figures

详情

AI中文摘要

中国车牌识别（CLPR）在无约束和复杂环境中面临诸多挑战，特别是由于不同拍摄角度导致的透视畸变以及单行和双行车牌的校正问题。考虑到边缘设备有限的计算资源，开发低复杂度、端到端的集成校正与识别网络对于实现实时高效部署至关重要。本文提出了一种名为LPTR-AFLNet的轻量级统一网络，用于校正和识别中国车牌，该网络将透视变换校正模块（PTR）与优化的车牌识别网络AFLNet相结合。该网络利用识别输出作为弱监督信号，有效引导校正过程，确保准确的透视畸变校正。为提高识别精度，我们对LPRNet进行了多项改进，包括引入改进的注意力模块以减少相似字符间的混淆，以及使用Focal Loss解决训练中的类别不平衡问题。实验结果表明，LPTR-AFLNet在校正透视畸变和识别双行车牌图像方面表现出色，在各种具有挑战性的场景下均保持高识别精度。此外，在中低端GPU平台上，该方法运行时间小于10毫秒，显示出其实用效率和广泛适用性。

英文摘要

Chinese License Plate Recognition (CLPR) faces numerous challenges in unconstrained and complex environments, particularly due to perspective distortions caused by various shooting angles and the correction of single-line and double-line license plates. Considering the limited computational resources of edge devices, developing a low-complexity, end-to-end integrated network for both correction and recognition is essential for achieving real-time and efficient deployment. In this work, we propose a lightweight, unified network named LPTR-AFLNet for correcting and recognizing Chinese license plates, which combines a perspective transformation correction module (PTR) with an optimized license plate recognition network, AFLNet. The network leverages the recognition output as a weak supervisory signal to effectively guide the correction process, ensuring accurate perspective distortion correction. To enhance recognition accuracy, we introduce several improvements to LPRNet, including an improved attention module to reduce confusion among similar characters and the use of Focal Loss to address class imbalance during training. Experimental results demonstrate the exceptional performance of LPTR-AFLNet in rectifying perspective distortion and recognizing double-line license plate images, maintaining high recognition accuracy across various challenging scenarios. Moreover, on lower-mid-range GPUs platform, the method runs in less than 10 milliseconds, indicating its practical efficiency and broad applicability.

URL PDF HTML ☆

赞 0 踩 0

2507.17335 2026-06-01 cs.CV cs.CL 版本更新

TransLPRNet: Lite Vision-Language Network for Single/Dual-line Chinese License Plate Recognition

TransLPRNet：用于单/双行中文车牌识别的轻量级视觉-语言网络

Guangzhu Xu, Zhi Ke, Pengcheng Zuo, Bangjun Lei

发表机构 * Hubei Key Laboratory of Intelligent Vision Based Monitoring for Hydroelectric Engineering, China Three Gorges University（水电工程智能视觉监测湖北省重点实验室，中国三峡大学）； College of Computer and Information Technology, China Three Gorges University（计算机与信息学院，中国三峡大学）； Hubei Key Laboratory of Digital Finance Innovation（数字金融创新湖北省重点实验室）； School of Information Engineering, Hubei University of Economics（信息工程学院，湖北经济学院）

AI总结针对开放环境中车牌类型多样和成像条件复杂的问题，提出一种集成轻量视觉编码器和文本解码器的统一解决方案，通过预训练框架和透视校正网络实现单/双行中文车牌的高精度识别。

详情

AI中文摘要

开放环境中的车牌识别在各个领域广泛应用，但车牌类型和成像条件的多样性带来了显著挑战。为了解决基于CNN和CRNN的方法在车牌识别中遇到的局限性，本文提出了一种统一解决方案，该方案在针对单行和双行中文车牌的预训练框架内，集成了轻量级视觉编码器和文本解码器。为缓解双行车牌数据集的稀缺性，我们通过合成图像、将纹理映射到真实场景并与真实车牌图像混合，构建了单/双行车牌数据集。此外，为提高系统的识别精度，我们引入了一个透视校正网络（PTN），该网络将车牌角点坐标回归作为隐变量，并通过车牌视角分类信息进行监督。该网络具有更好的稳定性、可解释性和较低的标注成本。所提出的算法在粗定位扰动下的校正CCPD测试集上实现了99.34%的平均识别准确率。在细定位扰动下评估时，准确率进一步提高到99.58%。在双行车牌测试集上，平均识别准确率达到98.70%，处理速度高达每秒167帧，显示出较强的实际应用性。

英文摘要

License plate recognition in open environments is widely applicable across various domains; however, the diversity of license plate types and imaging conditions presents significant challenges. To address the limitations encountered by CNN and CRNN-based approaches in license plate recognition, this paper proposes a unified solution that integrates a lightweight visual encoder with a text decoder, within a pre-training framework tailored for single and double-line Chinese license plates. To mitigate the scarcity of double-line license plate datasets, we constructed a single/double-line license plate dataset by synthesizing images, applying texture mapping onto real scenes, and blending them with authentic license plate images. Furthermore, to enhance the system's recognition accuracy, we introduce a perspective correction network (PTN) that employs license plate corner coordinate regression as an implicit variable, supervised by license plate view classification information. This network offers improved stability, interpretability, and low annotation costs. The proposed algorithm achieves an average recognition accuracy of 99.34% on the corrected CCPD test set under coarse localization disturbance. When evaluated under fine localization disturbance, the accuracy further improves to 99.58%. On the double-line license plate test set, it achieves an average recognition accuracy of 98.70%, with processing speeds reaching up to 167 frames per second, indicating strong practical applicability.

URL PDF HTML ☆

赞 0 踩 0

2507.11075 2026-06-01 cs.CV cs.AI 版本更新

通过推理时提示-噪声优化保障文本到图像生成

Jiangweizhi Peng, Zhiwei Tang, Gaowen Liu, Charles Fleming, Mingyi Hong

发表机构 * University of Minnesota（明尼苏达大学）； Cisco Research（思科研究）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出一种无需训练的推理时优化方法（PNO），通过联合优化连续提示嵌入和注入噪声轨迹，抑制不安全图像生成，达到最先进性能并抵抗对抗攻击。

详情

AI中文摘要

文本到图像（T2I）扩散模型因其基于文本提示生成高质量、多样化图像的能力而被广泛认可。然而，尽管近期取得了进展，这些模型仍然容易生成包含敏感或不适当内容的不安全图像，这可能对用户造成伤害。当前防止扩散模型生成不当图像的努力容易被绕过且易受对抗攻击。如何确保T2I模型符合特定安全目标仍然是一个重大挑战。在这项工作中，我们提出了一种新颖的、无需训练的方法，称为提示-噪声优化（PNO），以减轻不安全图像生成。我们的方法引入了一个新颖的优化框架，利用采样过程中的连续提示嵌入和注入噪声轨迹来生成安全图像。大量的数值结果表明，我们的框架在抑制有毒图像生成方面达到了最先进的性能，并且对对抗攻击表现出鲁棒性，无需调整模型参数。此外，与现有方法相比，PNO在保持相当生成时间的同时，在安全生成和提示-图像对齐这两个冲突目标之间提供了最佳权衡。

英文摘要

Text-to-Image (T2I) diffusion models are widely recognized for their ability to generate high-quality and diverse images based on text prompts. However, despite recent advances, these models are still prone to generating unsafe images containing sensitive or inappropriate content, which can be harmful to users. Current efforts to prevent inappropriate image generation for diffusion models are easy to bypass and vulnerable to adversarial attacks. How to ensure that T2I models align with specific safety goals remains a significant challenge. In this work, we propose a novel, training-free approach, called Prompt-Noise Optimization (PNO), to mitigate unsafe image generation. Our method introduces a novel optimization framework that leverages both the continuous prompt embedding and the injected noise trajectory in the sampling process to generate safe images. Extensive numerical results demonstrate that our framework achieves state-of-the-art performance in suppressing toxic image generations and demonstrates robustness to adversarial attacks, without needing to tune the model parameters. Furthermore, compared with existing methods, PNO uses comparable generation time while offering the best tradeoff between the conflicting goals of safe generation and prompt-image alignment.

URL PDF HTML ☆

赞 0 踩 0

2410.15475 2026-06-01 cs.CV 版本更新

Multimodal Fusion via Self-Consistent Task-Gradient Fields

通过自洽任务梯度场的多模态融合

Jiayu Xiong, Jing Wang, Jun Xue, Wanlong Wang, Jianlong Kwan, Xiaosen Lyu, Zhouqiang Jiang

发表机构 * Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, Fujian, China（厦门计算机视觉与模式识别重点实验室，华侨大学，厦门，福建，中国）； Huaqiao University, Xiamen, Fujian, China（华侨大学，厦门，福建，中国）； Wuhan University, Wuhan, Hubei, China（武汉大学，武汉，湖北，中国）； Nakashima Lab, SANKEN, The University of Osaka, Osaka, Japan（Nakashima实验室，SANKEN，大阪大学，大阪，日本）

AI总结提出自洽场自编码器（SCFAE），利用自洽场原理平衡任务学习与特征组织，通过任务损失和重构损失在互补子空间中分离特征，从而鲁棒处理缺失数据和不均匀输入。

Comments ICML 2026 accepted paper

详情

AI中文摘要

多模态学习旨在从不同输入中保留尽可能多的任务相关信息。然而，当前的融合设计常常扭曲对特征提取器的反馈循环。激进地合并模态会纠缠它们的表示，使得特征提取器对不完整输入变得脆弱。同时，试图通过辅助损失分离特征常常引入优化冲突，分散对主要任务的注意力。我们提出自洽场自编码器（SCFAE）为任务梯度提供更好的路径。我们的方法遵循自洽场原理来平衡任务学习与特征组织，从而最小化互信息。我们为每个模态使用小型自编码器以保持信息完整。任务损失作为驱动力选择预测性特征。重构损失作为约束将这些特征分离到独立子空间中。这两个目标通过互补的特征子空间运作，从而减轻优化干扰。我们在音频-视觉-文本、音频-视觉和图像-视频基准上评估SCFAE。结果表明，SCFAE通过简单结构更鲁棒地处理缺失数据和不均匀输入尺寸。梯度分析确认SCFAE避免了冲突并保持了稳定的训练动态。

英文摘要

Multimodal learning aims to preserve as much task-related information as possible from different inputs. However, current fusion designs often distort the feedback loop to feature extractors. Aggressively merging modalities entangles their representations, making the feature extractors fragile to incomplete inputs. Meanwhile, attempting to separate features via auxiliary losses frequently introduces optimization conflicts that distract from the primary task. We propose the Self-Consistent Field Autoencoder (SCFAE) to provide a better path for task gradients. Our method follows the self-consistent field principle to balance task learning with feature organization, thereby minimizing mutual information. We use small autoencoders for each modality to keep information intact. The task loss acts as a driving force to select predictive features. The reconstruction loss acts as a constraint to separate these features into independent subspaces. These dual objectives operate through complementary feature subspaces, thereby mitigating optimization interference. We evaluate SCFAE on audio-visual-text, audio-visual, and image-video benchmarks. Results show that SCFAE handles missing data and unequal input sizes more robustly via a simple structure. Gradient analysis confirms that SCFAE avoids conflicts and maintains stable training dynamics.

URL PDF HTML ☆

赞 0 踩 0