arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06520 2026-06-08 cs.CV cs.GR 新提交

Applying Deep Learning for cockpit segmentation in the context of mixed reality

应用深度学习进行混合现实场景中的驾驶舱分割

Alexandre Leles Sousa, Pedro de Oliveira Nielson, Erick Oliveira Rodrigues, Rafael Francisco dos Santos, Giovani Bernardes Vitor

发表机构 * Laboratdrio de Robética, sistemas inteligentes e Complezos - RobSIC(机器人、智能系统与复杂性实验室 - RobSIC) Instituto de Ciências Tecnoldgicas Universidade Federal de Itajubd, Campus Itabira, MG(科技学院 巴西联邦大学它雅布德分校,伊塔比拉校区,马里兰) Universidade Tecnológica Federal do Parand - UTFPR, Campus Pato Branco/PR(帕托布兰科/PR技术联邦大学 - UTFPR)

AI总结 本文提出利用U-net和DeepLabV3+卷积神经网络对驾驶舱图像进行前景与背景分割,以促进混合现实中的虚实融合,实现了约90%的准确率。

Comments XXV Congresso Brasileiro de Automática - CBA 2024

详情
AI中文摘要

计算机视觉是一个持续发展的领域。随着第一人称视角技术的进步,该领域内出现了新的发展机遇。混合现实通过实时显示物理世界中的物体来促进虚拟环境。为此,必须关注用户在此模拟环境中的沉浸感,不断寻求使其更接近可能的期望现实。本文提出开发图像处理,以执行图像分割,识别前景和背景,从而便于虚拟和真实图像的融合。因此,本研究通过摄像头获取用户使用CAT793F非公路卡车模拟器的真实图像,利用人工智能对这些图像进行分割。应用了卷积神经网络架构“U-net”和“DeepLabV3+”来执行图像分割。结果显示,准确率约为90%,并确定了最佳模型。

英文摘要

Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it's necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence techniques.The convolutional neural network architectures "U-net" and "DeepLabV3+" are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

2606.06532 2026-06-08 cs.CV 新提交

GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

GOPAgen: 基于结构记忆与层次推理的运动感知高效智能长视频理解

Haozhe Chi, Yang Jin, Yadong Mu

发表机构 * Peking University(北京大学)

AI总结 提出GOPAgen方法,通过视频编解码的GOP运动代理、GOP树推理算法和结构记忆机制,实现高效长视频理解,在多个VQA基准上取得领先性能。

详情
AI中文摘要

尽管在智能长视频理解方面取得了显著进展,现有方法仍然缺乏详细的运动理解以及高效的内存架构。在本文中,我们提出GOPAgen,一种新颖的方法,该方法首先通过精心设计的运动代理将视频编解码器集成到视频理解框架中,该代理基于视频编解码器中的图像组(GOP)进行训练。我们进一步开发了GOP树推理算法,该算法与视频编解码器自然对齐,增强了模型理解视频中局部细节运动的能力。此外,我们精心设计了一种结构记忆机制,将局部运动信息与结构页面中的详细描述相结合,并提出了一种高效的从粗到精的缩放算法,以充分利用结构记忆。此外,我们将运动矢量数据库纳入框架,以实现不同粒度运动矢量的高效检索。总体而言,我们的方法在各种视频理解基准(包括MotionBench和Egoschema)上取得了优越的视频问答(VQA)性能,从而证明了我们提出框架的优越性。

英文摘要

Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

2606.06536 2026-06-08 cs.CV cs.AI cs.LG 新提交

Attention-Guided Autoencoder Fusion for Insulator Defect Detection Using UAV Transmission-Line Imaging

基于注意力引导自编码器融合的无人机输电线路绝缘子缺陷检测

Malak Allam, Khaled Shaban, Ali Hamdi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出AE-YOLO框架,通过注意力引导自编码器融合与方差最大化正则化,解决无人机图像中绝缘子缺陷检测的类别不平衡和尺度变化问题,在mAP@0.5上达95.10%,优于YOLO基线5个百分点。

详情
AI中文摘要

高压输电线路绝缘子的自动缺陷检测仍然具有挑战性,原因在于无人机(UAV)图像中严重的类别不平衡、尺度变化大以及缺陷实例的空间范围小。为了解决这些问题,本文提出了AE-YOLO,一种注意力引导的自编码器增强型YOLO框架,用于鲁棒的绝缘子缺陷检测。该架构在特征金字塔网络-路径聚合网络(FPN-PAN)颈部集成了轻量级瓶颈自编码器,在多尺度特征融合过程中保留了异常敏感信息。整个骨干网络使用卷积块注意力模块(CBAM),增强了特征辨别能力并抑制了背景干扰。该框架还引入了一种方差最大化的自编码器正则化策略,鼓励生成多样化、缺陷判别性的潜在表示。网络使用统一目标进行训练,该目标结合了焦点损失、完全IoU(CIoU)损失和自编码器正则化,以解决前景-背景不平衡问题并提高定位精度。在推理过程中,加权框融合(WBF)结合了YOLOv8、YOLOv10和YOLO11的预测结果。自编码器引导的置信度提升机制提高了对罕见缺陷类别的敏感性。在绝缘子缺陷检测数据集上的实验表明,采用EfficientNetV2骨干网络的AE-YOLO在mAP@0.5上达到95.10%,精度为96.40%,召回率为93.80%。这一性能在mAP@0.5上比最强的YOLO系列基线高出5.0个百分点,在召回率上高出6.7个百分点。这些结果证实了该框架的有效性和适应性。该模型是基于UAV的输电线路巡检和缺陷监测的实用且可扩展的解决方案。

英文摘要

Automated defect detection in high-voltage transmission-line insulators remains challenging due to severe class imbalance, large scale variation, and the small spatial extent of defect instances in Unmanned Aerial Vehicle (UAV) imagery. To address these challenges, this paper proposes AE-YOLO, an Attention-Guided AutoEncoder-Enhanced YOLO framework for robust insulator defect detection. The architecture integrates lightweight bottleneck autoencoders within a Feature Pyramid Network-Path Aggregation Network (FPN-PAN) neck. This preserves anomaly-sensitive information during multi-scale feature fusion. Convolutional Block Attention Modules (CBAM) are used throughout the backbone, enhancing feature discrimination and suppressing background interference. The framework also introduces a variance-maximizing autoencoder regularization strategy, which encourages diverse, defect-discriminative latent representations. The network trains using a unified objective that combines focal loss, Complete IoU (CIoU) loss, and autoencoder regularization to address foreground-background imbalance and improve localization accuracy. During inference, Weighted Boxes Fusion (WBF) combines predictions from YOLOv8, YOLOv10, and YOLO11. An autoencoder-guided confidence boosting mechanism improves sensitivity to rare defect categories. Experiments on the Insulator-Defect Detection dataset show that AE-YOLO with an EfficientNetV2 backbone achieves 95.10 percent mAP at 0.5, 96.40 percent precision, and 93.80 percent recall. This performance surpasses the strongest YOLO-family baseline by 5.0 points in mAP at 0.5 and 6.7 points in recall. These results confirm the effectiveness and adaptability of the framework. The model is a practical and scalable solution for UAV-based transmission-line inspection and defect monitoring.

2606.06538 2026-06-08 cs.CV 新提交

WorldBench: A Challenging and Visually Diverse Multimodal Reasoning Benchmark

WorldBench: 一个具有挑战性且视觉多样的多模态推理基准

Yida Yin, Harish Krishnakumar, Chung Peng Lee, Boya Zeng, Wenhao Chai, Shengbang Tong, Wenhu Chen, Hu Xu, Xingyu Fu, Gabriel Sarch, Aleksandra Korolova, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学) NYU(纽约大学) University of Waterloo(滑铁卢大学) Meta, FAIR(Meta和FAIR)

AI总结 提出WorldBench,通过构建多领域视觉概念分类法并收集多样化图像,设计前沿MLLM难以回答的问题,以评估多模态大语言模型的视觉理解能力,揭示其弱点。

Comments Project page: https://worldbench-vl.github.io/

详情
AI中文摘要

在现实世界应用中,模型被期望在不同设置下可靠地执行。然而,许多现有的多模态基准扩展了任务类型,但没有捕捉到处理开放视觉输入所需的视觉多样性。我们提出了WorldBench,一个具有挑战性且视觉多样的推理基准,用于评估多模态大语言模型(MLLM)。我们构建了一个跨多个领域(例如,生物)的数千个视觉概念的分类法。在该分类法的指导下,我们从搜索引擎和现有数据集中策划了一个广泛的图像集合,以全面代表视觉世界。通过结构化的试错,我们手动设计了前沿MLLM无法回答的具有挑战性的问题。在定量和人工评估中,WorldBench比任何现有的多样化基准实现了更高的视觉多样性。在WorldBench上评估15个MLLM揭示了视觉理解中的弱点:即使是最强的模型也只达到64.0%的准确率,而一些模型的表现略高于随机水平。我们希望我们的工作强调视觉多样性在构建多模态基准中的重要性。

英文摘要

In real-world applications, models are expected to perform reliably across diverse settings. Yet, many existing multimodal benchmarks expand task types without capturing the visual diversity needed to handle open-ended visual inputs. We present WorldBench, a challenging and visually diverse reasoning benchmark to evaluate Multimodal Large Language Models (MLLMs). We build a taxonomy of thousands of visual concepts across multiple domains (e.g., living things). Guided by this taxonomy, we curate a broad collection of images from search engines and existing datasets to comprehensively represent the visual world. Through structured trial-and-error, we manually design challenging questions that frontier MLLMs fail to answer. On quantitative and human evaluations, WorldBench achieves higher visual diversity than any existing diverse benchmark. Evaluating 15 MLLMs on WorldBench reveals weaknesses in visual understanding: even the strongest model reaches only 64.0% accuracy, while some models perform marginally above chance-level. We hope our work highlights the importance of visual diversity in building multimodal benchmarks.

2606.06539 2026-06-08 cs.CV cs.AI cs.LG cs.NE 新提交

Synthetic Benchmarks Overstate Forward-Forward Scaling: Real-Data Limits of Layer-Local Training

合成基准高估了前向-前向扩展:真实数据对逐层训练的限制

Yucheng Chen

发表机构 * Amplimit

AI总结 通过DTG-FF方法在真实数据上评估前向-前向学习的扩展性,发现其与反向传播的差距随类别数增加而扩大,合成任务高估了其迁移能力,且内存优势不成立。

Comments 23 pages, 6 figures

详情
AI中文摘要

前向-前向(FF)学习[Hinton, 2022]用严格的逐层良好性更新取代了反向传播。最近的FF-CNN工作在32x32基准上缩小了与BP的差距,引发了逐层训练是否在现实规模下成为可行替代方案的问题。为了严格探究这一点,我们开发了DTG-FF——动态温度良好性、解耦归一化和多层融合——作为在九个真实数据基准上设定FF系列最先进水平的工具(CIFAR-10上91.8%,以及ImageNet-100 224x224上的首个FF基线),并用它来审计逐层训练实际能扩展到何种程度。(1)真实数据扩展。在相同配方和主干下,架构匹配的BP-DeepSup基线在CIFAR-10/CIFAR-100上分别超过DTG-FF 2.40/5.93个百分点,且差距随类别数增加而扩大。在224x224分辨率下,同一工具仅达到49.4%——这是该尺度下的首个FF基线,而典型BP超过75%[Tian et al., 2020]——暴露了在32x32下不可见的真实数据上限。(2)合成与真实K冲突。在合成教师-学生任务中,随着类别数K增长,DTG-FF越来越优于BP;而在真实图像上,FF-BP差距符号反转并随K扩大。数据集内CIFAR-100粗粒度与细粒度探针将标签层次与图像分布分离:合成K扫描将输出维度与细粒度判别难度混淆,从而高估了FF的可迁移性。(3)系统审计。FF可以在不存储深度激活的情况下实现,但在普通8 GB硬件上,标准BP+梯度累积达到4.18 GB / 157 imgs/s,而DTG-FF为7.90 GB / 138 imgs/s,因此在公平基线支持下,基于内存的理由在此规模下不成立。

英文摘要

Forward-Forward (FF) learning [Hinton, 2022] replaces backpropagation with strictly layer-local goodness updates. Recent FF-CNN work has narrowed the gap to BP on 32x32 benchmarks, raising the question of whether layer-local training is becoming a viable alternative at realistic scale. To probe this rigorously, we develop DTG-FF -- dynamic temperature goodness, decoupled normalization, and multi-layer fusion -- as an instrument that sets FF-family state of the art across nine real-data benchmarks (91.8% CIFAR-10 and the first FF baseline at ImageNet-100 224x224), and use it to audit how far layer-local training actually scales. (1) Real-data scaling. Under identical recipe and backbone, an architecture-matched BP-DeepSup baseline beats DTG-FF by 2.40/5.93 pp on CIFAR-10/CIFAR-100, and the gap widens with class count. At 224x224 the same instrument reaches only 49.4% -- the first FF baseline at this scale, versus typical BP above 75% [Tian et al., 2020] -- exposing a real-data ceiling invisible at 32x32. (2) Synthetic vs. real K-conflict. DTG-FF increasingly outperforms BP as class count K grows on synthetic teacher-student tasks, yet on real images the FF-BP gap reverses sign and widens with K. A within-dataset CIFAR-100 coarse vs. fine probe isolates label-hierarchy from image distribution: synthetic K-sweeps confound output dimensionality with fine-grained discrimination difficulty and thereby overstate FF transferability. (3) Systems audit. FF can be implemented without storing depth-wide activations, but on commodity 8 GB hardware standard BP+gradient-accumulation reaches 4.18 GB / 157 imgs/s versus DTG-FF's 7.90 GB / 138 imgs/s, so a memory-based justification for FF at this scale is not supported under fair baselines.

2606.06601 2026-06-08 cs.CV cs.AI cs.LG 新提交

Direct 3D-Aware Object Insertion via Decomposed Visual Proxies

通过分解视觉代理实现直接3D感知物体插入

Jingbo Gong, Yikai Wang, Yushi Lan, Yuhao Wan, Ziheng Ouyang, Rui Zhao, Ming-Ming Cheng, Qibin Hou, Chen Change Loy

发表机构 * Google(谷歌) Black Forest Labs(黑森林实验室)

AI总结 提出DIRECT框架,通过分解外观、几何和上下文引导,实现可控制3D姿态的物体插入,在几何可控性和视觉质量上优于现有方法。

Comments ICML 2026; Project Page: https://gong1130.github.io/DIRECT/

详情
AI中文摘要

物体插入旨在将参考对象无缝合成到背景图像的指定区域。最近的基于扩散的方法实现了高视觉质量,但将插入视为简单的2D修复任务,无法显式控制对象的3D姿态,限制了其实用性。我们提出DIRECT(用于参考组合和目标集成的分解注入),一种新颖框架,将交互式姿态操作与高保真2D图像合成相结合,实现姿态可控的物体插入。我们的方法将插入条件分解为三个互补组件:从参考对象捕获视觉细节的外观引导、从用户调整的3D代理派生的几何引导以及来自目标背景的上下文引导。通过将它们注入到不同路径,DIRECT避免了特征纠缠,同时保留了参考外观、遵循用户指定的姿态并使对象适应目标场景。我们还引入了一个自动数据构建流程,以提高训练数据的多样性和质量。实验表明,DIRECT在几何可控性和视觉质量方面均优于先前方法。

英文摘要

Object insertion aims to seamlessly composite a reference object into a specified region of a background image. Recent diffusion-based methods achieve high visual quality but formulate insertion as a simple 2D inpainting task, providing no explicit control over the object's 3D pose and limiting their practical applicability. We propose DIRECT (Decomposed Injection for Reference Composition and Target-integration), a novel framework that integrates interactive pose manipulation with high-fidelity 2D image synthesis to enable pose-controllable object insertion. Our method decomposes the insertion conditions into three complementary components: appearance guidance capturing visual details from the reference object, geometry guidance derived from the user-adjusted 3D proxy, and context guidance from the target background. By injecting them through separate pathways, DIRECT avoids feature entanglement and simultaneously preserves reference appearance, follows the user-specified pose, and adapts the object to the target scene. We also introduce an automated data construction pipeline to improve the diversity and quality of training data. Experiments show that DIRECT outperforms previous methods in both geometric controllability and visual quality.

2606.06631 2026-06-08 cs.CV 新提交

From Pixels to Newtons: Predicting In Vivo Joint Contact Forces from Monocular Video

从像素到牛顿:从单目视频预测体内关节接触力

Jessy Lauer

发表机构 * Rowland Institute at Harvard(哈佛大学罗兰研究所)

AI总结 提出一种无物理模型的流水线,从非标定单目视频预测3D髋膝接触力,无需标记、力板、肌电、个体成像或肌肉骨骼模型,通过变换器融合运动、形状、活动文本和自监督视频令牌,在26名患者25种活动上达到与个体化肌肉骨骼模拟相当的精度。

详情
AI中文摘要

关节接触力决定植入物寿命、软骨健康和康复效果,影响谁患骨关节炎、谁从关节置换中良好恢复以及谁受益于生物力学干预。然而,它们只能通过侵入性测量,在少数装有仪器的患者中进行。我提出一种无物理流水线,从非标定单目视频预测瞬时3D髋膝接触力:无需标记、力板、肌电图、个体成像或肌肉骨骼模型。每帧恢复参数化身体网格,编码为运动特征,并由变换器解码为力,其姿态流在每一层由身体形状、关节、侧别、活动文本和自监督视频令牌(V-JEPA 2)自适应调制,将髋和膝统一在单一模型中。在来自体内OrthoLoad数据库的26名患者和25个活动类别上的留一受试者交叉验证中,该流水线匹配个体化肌肉骨骼模拟的精度(髋部$0.32 \pm 0.08$ BW RMSE;膝部$0.23 \pm 0.03$ BW RMSE),并分辨出比步态再训练和骨关节炎进展报道的更小的峰值力变化。零样本应用于独立仪器化队列,它媲美或超越先前发表的方法。即使没有精心策划的活动标签,仅视频特征也能保持精度,并实现对原始视频的端到端推理。由预测器驱动,生成式运动先验产生生物力学合理的变体,降低峰值负荷,重新发现预测模拟文献中的策略。该流水线确立非标定单目视频作为估计关节负荷的可行模态,为回顾分析存档临床记录、初级保健筛查和家庭康复追踪开辟道路。

英文摘要

Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.

2606.06664 2026-06-08 cs.CV cs.AI cs.LG 新提交

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

内在视觉:神经科学启发的概念电路用于解释和引导视觉变换器

Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ViSAE工具箱,通过神经科学启发的概念电路解释视觉变换器内部机制,包含高效概念集、自动电路追踪算法和概念编辑应用,在WaterBirds上最差组准确率提升48.2%。

Comments In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

详情
AI中文摘要

尽管视觉变换器(ViT)具有高准确率,但其预测可能受到虚假线索的驱动,因此在安全部署前需要理解其内部工作机制。稀疏自编码器(SAE)为将模型表示分解为人类可解释的概念提供了有前景的视角,但由于对概念覆盖范围的控制有限以及特征解释的主观性和不可扩展性,将基于SAE的解释方法应用于ViT仍然具有挑战性。为填补这些空白,受神经科学启发原理的驱动,我们提出了ViSAE,一个通过概念电路理解ViT内部工作机制的机械可解释性工具箱。ViSAE包含三个组成部分:(1)一个包含64K图像和16K视觉基础概念词汇的探测套件,与ImageNet相比,概念覆盖效率提高了20倍,与现有概念集相比,解释准确率提高了28.7%。(2)自上而下的概念读取和自下而上的电路追踪算法,通过概念电路自动恢复ViT内部工作机制。(3)用于审计和引导ViT行为的应用。通过概念编辑,ViSAE在WaterBirds上将最差组准确率提高了48.2%,比现有方法高出23.8%。我们的数据和代码:此 https URL。

英文摘要

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

2606.06666 2026-06-08 cs.CV 新提交

Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

面向深度伪造检测的架构自适应不确定性融合

Ritesh Sharma, Mohammad Ghasemigol, Yuichi Motai

发表机构 * University of Tokyo(东京大学) Nagoya University(名古屋大学)

AI总结 提出相关性优化融合(COF)框架,通过最大化融合不确定性分数与预测误差的皮尔逊相关性,自适应融合五种不确定性来源,无需模型修改且优化仅需42秒,在分布偏移下表现优于随机森林。

详情
AI中文摘要

深度伪造检测系统在基准测试中达到近乎完美的准确率,但法医部署需要可靠的预测不确定性。现有的不确定性量化(UQ)方法依赖单一来源,忽略了最优不确定性组合因架构而异。我们提出相关性优化融合(COF),这是一种架构自适应框架,通过概率单纯形上的约束优化最大化融合不确定性分数与预测误差之间的皮尔逊相关性,融合五种互补的不确定性来源——认知、偶然、校准、共形和分布。COF无需模型修改,权重优化仅需42秒,而5模型深度集成需要20-45小时。在FaceForensics++上对11种架构的评估揭示了一个基本权衡:在匹配的训练/评估协议下,非线性方法在域内相关性上比COF高约5-6%(平均r=0.438),但在分布偏移下情况反转。在CelebDF上,COF在11种架构中的9种上优于随机森林,相关性高出高达7.3倍(MaxViT-B: r=0.249 vs. 0.034);RF跨域退化85%至r=0.071,而COF保留显著更多的信号(下降74%至r=0.116)。在CelebDF和DFDC上的跨数据集评估揭示了所有方法的灾难性泛化失败:域内相关性0.41-0.47在外部崩溃至接近零(平均退化90.7%),其中11种架构中有7种出现不确定性反转。这些结果确立了COF作为受控分布部署的实用、可解释框架,并指出域自适应UQ是法医部署的核心开放挑战。

英文摘要

Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources -- epistemic, aleatoric, calibration, conformal, and distributional -- by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20--45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5--6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41--0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

2606.06671 2026-06-08 cs.CV 新提交

JA-SIREN: Deterministic Initialization for Sinusoidal Networks via Spectral Matching

JA-SIREN:通过频谱匹配实现正弦网络的确定性初始化

Mohammed Alsakabi, Kejia Hu, John M. Dolan, Ozan K. Tonguz

发表机构 * Department of Electrical and Computer Engineering, College of Engineering(电气与计算机工程系) The Robotics Institute, School of Computer Science(机器人研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出JA-SIREN确定性初始化方案,利用离散正弦变换和Jacobi-Anger展开解析匹配网络初始频谱与目标信号,消除随机性,在Kodak数据集上PSNR达67.18 dB,比最佳基线提升21.30 dB。

详情
AI中文摘要

现有的隐式神经表示(INR)方法受随机初始化影响,无法保证跨运行的一致性或高质量性能,图像回归中的变化超过2.5 dB(78%)。这种变化对结果可重复性至关重要的科学计算和模拟来说是有问题的。为了解决这个问题,我们提出了Jacobi-Anger正弦表示网络(JA-SIREN),一种基于经典频谱分析的正弦网络确定性初始化方案。通过计算目标信号的离散正弦变换(DST)并利用Jacobi-Anger展开,我们为两层正弦MLP推导出闭式权重,该权重解析地将网络的初始频谱响应与目标信号匹配,无需随机种子或额外的超参数调整。在Kodak数据集上,JA-SIREN实现了67.18 dB的平均PSNR,比最佳基线提高了21.30 dB。这是以零运行间方差实现的,证实了频谱信息初始化是正弦INR中比随机初始化更有效且可重复的替代方案。

英文摘要

Existing implicit neural representation (INR) approaches suffer from stochastic initialization that does not guarantee consistent or high-quality performance across runs, with variations reaching more than 2.5 dB (78%) in image regression. This variation is problematic for scientific computing and simulation, where result reproducibility is crucial. To address this problem, we present Jacobi-Anger Sinusoidal Representation Network (JA-SIREN), a deterministic initialization scheme for sinusoidal networks grounded in classical spectral analysis. By computing the Discrete Sine Transform (DST) of the target signal and leveraging the Jacobi-Anger expansion, we derive closed-form weights for a two-layer sinusoidal MLP that analytically match the network's initial spectral response to the target signal, requiring no random seed or additional hyperparameter tuning. On the Kodak dataset, JA-SIREN achieves a mean PSNR of 67.18 dB, a 21.30 dB improvement over the best baseline. This is achieved with zero run-to-run variance, confirming that spectrally-informed initialization is a more effective and reproducible alternative to stochastic initialization for sinusoidal INRs.

2606.06684 2026-06-08 cs.CV 新提交

Adaptive Band Selection for Hyperspectral Classification with Spatially Disjoint Evaluation

面向空间分离评估的高光谱分类自适应波段选择

Ikram El-Hajri, Ouassim Karrakchou, Alejandro Mousist

发表机构 * International University of Rabat, Rabat, Morocco(拉巴特国际大学) Thales Alenia Space, Spain(西班牙泰勒斯阿莱尼亚空间公司)

AI总结 提出SGBR-HC方法,通过监督光谱排序初始化可训练稀疏门,自适应确定波段数,在空间分离评估下以约20个波段取得最高平均总体精度和Kappa系数。

Comments 6 pages, 2 figures, 3 tables

详情
AI中文摘要

基于可微选择器的高光谱波段选择方法可能对初始化和提取最终离散子集敏感,而预设的波段数量限制了灵活性。我们提出SGBR-HC(光谱组波段排序与硬混凝土初始化),一种两阶段方法,使用监督光谱排序来初始化可训练稀疏门,而不是将排序视为固定选择规则,让所选波段的数量由训练决定。第一阶段通过类别可分性和光谱多样性对训练像素的候选波段进行评分;该排序为第二阶段的门控逻辑值提供种子,第二阶段将稀疏门与空间分类器联合训练。在帕维亚大学和休斯顿2013数据集上进行空间分离评估,并通过在所选波段上重新训练新分类器进行验证,SGBR-HC以大约20个波段实现了最高的平均总体精度和Cohen's kappa。跳过第一阶段导致帕维亚大学的OA下降8.84个百分点,休斯顿2013下降22.15个百分点,证实了排序先验的作用。随机像素分割使帕维亚大学的OA膨胀30.56个百分点,强调了空间泄漏作为关键评估混淆因素。

英文摘要

Hyperspectral band selection methods based on differentiable selectors can be sensitive to initialization and to extracting a final discrete subset, while prescribed band counts limit flexibility. We propose SGBR-HC (Spectral-Group Band Ranking with Hard-Concrete initialization), a two-stage method that uses a supervised spectral ranking to initialize trainable sparse gates rather than treating ranking as a fixed selection rule, letting the number of selected bands be determined by training. Stage-1 scores candidate bands from training pixels by class discriminability and spectral diversity; this ranking seeds the gate logits for Stage-2, which trains the sparse gates jointly with a spatial classifier. Under spatially disjoint evaluation on Pavia University and Houston 2013, verified by retraining a fresh classifier on the selected bands, SGBR-HC achieves the highest mean overall accuracy and Cohen's kappa with approximately twenty bands. Bypassing Stage-1 degrades OA by 8.84 pp on Pavia University and 22.15 pp on Houston 2013, confirming the ranking prior's role. Random pixel splits inflate OA on Pavia University by 30.56 pp, underscoring spatial leakage as a critical evaluation confound.

2606.06685 2026-06-08 cs.CV cs.GR 新提交

RigPAPR: Rig-Based Animation of Static Neural Point Clouds from a Fixed-Viewpoint Video

RigPAPR:基于固定视角视频的静态神经点云绑定动画

Shichong Peng, Yanshu Zhang, Ke Li

发表机构 * APEX Lab(APEX实验室) School of Computing Science(计算科学学院) Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出RigPAPR方法,通过直接线性混合蒙皮驱动静态神经点云,无需网格代理或姿态依赖校正,在合成和真实数据上减少关节边界伪影,新视角PSNR提升3+dB。

Comments An overview video is available at https://youtu.be/up3BwRHYWG8

详情
AI中文摘要

静态神经点云重建从姿态图像中高保真地捕捉主体。给定这样的重建,我们的目标是使其动画化,以跟随主体的单目固定视角驱动视频(无论是捕获的还是由图像到视频生成产生的),并恢复一个绑定的、可重新姿态的3D资产。现有方法通过直接线性混合蒙皮或网格代理来变形高斯溅射,两者在关节连接处都容易出现伪影,即使有逐基元的校正。我们将伪影追溯到表示:每个溅射携带一个在规范姿态中校准的个体形状,以与其邻居拼接。在刚性LBS下,每个溅射随其骨骼移动但不能弯曲,因此规范拼接在关节边界处断裂成间隙和尖峰。邻近注意力点渲染则没有逐基元的形状;每个像素在渲染时从变形基元的位置重新组合,因此表面自然地随关节运动重新形成。我们提出RigPAPR,它自动绑定静态PAPR点云,并通过单个固定视角视频在直接LBS下驱动它,无需网格代理、姿态依赖校正或类别模板。在合成主体上,RigPAPR在有监督视角下匹配最强基线,在新视角下超过基于网格和高斯溅射的基线3+dB PSNR,并在合成和真实主体上生成更干净的关节边界渲染。

英文摘要

Static neural point reconstructions capture a subject at high fidelity from posed images. Given such a reconstruction, we aim to animate it to follow a monocular fixed-viewpoint driving video of the subject, whether captured or produced by image-to-video (I2V) generation, and to recover a rigged, re-posable 3D asset. Existing methods deform Gaussian splats through direct linear blend skinning (LBS) or mesh proxies, both of which are prone to joint-boundary artifacts under articulation, even with per-primitive corrections. We trace the artifact to the representation: each splat carries an individual shape calibrated in the canonical pose to tile with its neighbours. Under rigid LBS, each splat moves with its bone but cannot bend, so the canonical tiling breaks at joint boundaries into gaps and spikes. Proximity attention point rendering (PAPR) instead carries no per-primitive shape; each pixel is recomposed at render time from the deformed primitives' positions, so the surface re-forms naturally with the articulation. We present RigPAPR, which auto-rigs a static PAPR cloud and drives it under direct LBS from a single fixed-viewpoint video, without mesh proxy, pose-dependent correction, or category template. On synthetic subjects, RigPAPR matches the strongest baseline at the supervised view and exceeds mesh-based and Gaussian-splatting baselines at novel views by 3+dB PSNR, with cleaner joint-boundary renderings of both synthetic and real subjects.

2606.06690 2026-06-08 cs.CV 新提交

RPC-GS: Gaussian Splatting with native RPC Rendering for Satellite Imagery

RPC-GS:基于原生RPC渲染的卫星图像高斯泼溅

Valentin Wagner, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation(弗劳恩霍夫光学研究所、系统技术与图像 exploitation 研究所)

AI总结 提出首个原生使用RPC模型的高斯泼溅框架RPC-GS,通过直接投影高斯均值和协方差避免近似误差,在卫星基准数据集上重建误差最低。

详情
AI中文摘要

我们提出了RPC-GS,这是首个原生使用有理多项式相机(RPC)模型的卫星图像高斯泼溅框架。RPC模型是表示现代推扫式卫星传感器复杂成像几何的事实标准。为了简化渲染,先前的卫星高斯泼溅方法用透视或仿射相机近似替代RPC模型,导致重建过程中的几何误差。RPC-GS通过在泼溅过程中直接通过RPC模型投影高斯均值和协方差,避免了这些近似。我们将RPC模型嵌入一系列精心选择的地理坐标变换链中,该变换表示从适合泼溅的场景坐标到图像坐标的映射。为了映射高斯协方差矩阵,我们推导了基于数值稳健的雅可比协方差投影,用于(部分非线性的)坐标变换。由于RPC缺乏明确的相机深度概念,我们集成了基于度量射线的深度公式。我们在统一框架中对RPC、透视和仿射相机模型进行了基准测试,我们的原生RPC渲染器在领先的卫星基准数据集上始终实现最低的重建误差,在DFC2019上,平均高程误差比透视和仿射近似分别提高了29.6%和63.8%,在IARPA2016上分别提高了9.9%和37.9%。我们公开代码以支持卫星成像领域高斯泼溅的未来研究。

英文摘要

We present RPC-GS, the first Gaussian Splatting framework for satellite imagery that operates natively with Rational Polynomial Camera (RPC) models. The RPC model is the de facto standard for representing the complex imaging geometry of modern pushbroom satellite sensors. To simplify rendering, prior satellite Gaussian Splatting methods replace the RPC model with perspective or affine camera approximations, leading to geometric errors during reconstruction. RPC-GS avoids these approximations by projecting Gaussian means and covariances directly through the RPC model during the splatting process. We embed the RPC model in a chain of carefully selected geo-coordinate transformations representing a mapping from splatting-suitable scene coordinates to image coordinates. To map the Gaussian covariance matrices, we derive a numerically robust Jacobian-based covariance projection for the (partially nonlinear) coordinate transformations. Since RPCs lack an explicit notion of camera depth, we integrate a metric ray-based depth formulation. We benchmark RPC, perspective, and affine camera models in a unified framework, with our native RPC renderer consistently achieving the lowest reconstruction error on leading satellite benchmark datasets, improving mean altitude error over perspective and affine approximations by 29.6% and 63.8% on DFC2019, and by 9.9% and 37.9% on IARPA2016. We release our code to support future research of Gaussian Splatting in the satellite imaging domain.

2606.06695 2026-06-08 cs.CV 新提交

S23DR 2026 Winning Solution

S23DR 2026 获胜方案

Jan Skvrna, Miroslav Purkrabek, Lukas Neumann

发表机构 * Visual Recognition Group(视觉识别组) Czech Technical University in Prague(布拉格捷克技术大学)

AI总结 提出一种基于条件集和流匹配DiT的3D线框重建方法,通过全局粗预测、局部细化及多采样一致性步骤,在S23DR 2026挑战中取得HSS=0.654的领先成绩。

详情
AI中文摘要

本文介绍了在S23DR 2026挑战中针对从稀疏SfM、拟合深度和语义分割进行结构化3D线框重建的获胜方案。该方法将顶点视为条件集,并使用以Perceiver风格场景令牌为条件的流匹配DiT对64个顶点令牌进行去噪。全局通道预测粗略结构,船体裁剪的第二通道对其进行细化,小规模的多采样一致性步骤确保随机采样器行为良好。最终系统在私有排行榜上排名第一,达到HSS = 0.654。

英文摘要

This text presents the winning solution to the S23DR 2026 challenge for structured 3D wireframe reconstruction from sparse SfM, fitted depth, and semantic segmentations. The method treats vertices as a conditional set and denoises 64 vertex tokens with a flow-matching DiT conditioned on Perceiver-style scene tokens. A global pass predicts the coarse structure, a hull-cropped second pass refines it, and a small multi-sample consensus step keeps the stochastic sampler well behaved. The final system ranked first on the private leaderboard, achievingHSS = 0.654.

2606.06696 2026-06-08 cs.CV cs.AI 新提交

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

MMBU: 大规模多模态生物医学理解基准,用于探测视觉语言模型的感知能力

Ryan D'Cunha, Alejandro Lozano, Xiaoxiao Sun, Daniel Vela Jarquin, Min Woo Sun, Josiah Aklilu, James Burgess, Yuhui Zhang, Ryan Nayebi, Paola Avila, Robayo, Jin Ye, Ming Hu, Zhongying Deng, Junjun He, Xin Chen, Yue Yao, Robert Tibshirani, Jeffrey J. Nirschl, Serena Yeung-Levy

发表机构 * Stanford University(斯坦福大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Instituto Tecnológico de Monterrey(蒙特雷技术学院) Monash University(墨尔本大学) University of Cambridge(剑桥大学) Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学)

AI总结 提出MMBU基准,涵盖35个子模态,通过分类、定位和检测任务系统评估VLM在生物医学领域的视觉感知和泛化能力,发现高准确率可能掩盖感知缺陷。

详情
AI中文摘要

视觉和语言模型(VLM)在转变生物医学成像工作流程方面具有巨大潜力,从检测胸部X光片中的病变到显微镜下的细胞特征分析。然而,实现这一潜力需要稳健且细粒度的视觉感知。模型需要正确解释图像中的细微特征,并且必须在不同的生物医学模态、尺度和上下文中做到这一点。尽管如此,当前的基准仍然有限。为了解决这些差距,我们引入了大规模多模态生物医学理解(MMBU)基准。它是迄今为止最大的生物医学视觉和语言基准,涵盖35个子模态,具有丰富的结构化元数据。它包括开放和封闭版本的非接地分类、接地分类和物体检测,从而能够系统地评估模型在生物尺度、临床环境和成像模态上的性能。通过评估15个开源权重和2个前沿VLM,我们发现虽然医学适应为某些模型带来了可衡量的提升,但通常在高准确率报告中的表现可能掩盖了视觉感知和领域泛化方面的缺陷。

英文摘要

Vision and language models (VLMs) hold immense promise to transform biomedical imaging workflows, from detecting lesions in chest X-rays to profiling cellular features in microscopy. Realizing this potential, however, requires robust and fine-grained visual perception. Models need to correctly interpret subtle features in images, and they must do so across diverse biomedical modalities, scales, and contexts. Nevertheless, current benchmarks remain limited. To address these gaps, we introduce the Massive Multimodal Biomedical Understanding (MMBU) benchmark. It is the largest biomedical vision and language benchmark to date, covering 35 submodalities with rich structured metadata. It includes both open and closed versions of ungrounded classification, grounded classification, and object detection, enabling systematic evaluation of model performance across biological scales, clinical settings, and imaging modalities. Evaluating 15 open-weight and 2 frontier VLMs, we find that while medical adaptation provides measurable gains for some models, the high accuracy often reported on established benchmarks can mask deficiencies in visual perception and domain generalization.

2606.06714 2026-06-08 cs.CV 新提交

Anchored, Not Graded: Vision-Language Models Fail at Slant-from-Texture Perception

锚定而非分级:视觉-语言模型在纹理倾斜感知中失败

Qian Zhang, Michal Golovanevsky, Fulvio Domini, James Tompkin

发表机构 * Brown University(布朗大学) Harvard University(哈佛大学)

AI总结 研究视觉-语言模型(VLM)在纹理倾斜感知任务中的表现,发现零样本和上下文提示均产生锚定失败,仅预测少数离散角度,监督微调部分缓解但残留锚定,表明问题在于表示到输出的语言接口无法分级表达。

详情
AI中文摘要

人类从纹理感知表面倾斜时,会表现出系统性的、分级的偏差,这些偏差在心理物理实验中可靠地出现。先前的研究表明,无监督CNN再现了几种类人偏差,而有监督CNN则没有。视觉-语言模型(VLM)是否表现出类似的能力?在多个VLM家族和模型规模中,零样本和上下文提示都产生了独特的失败:倾斜仅在少量锚点(例如0°、±25°、±45°)处被预测,且几乎不依赖于刺激视场、光学倾斜或表面曲率。监督微调部分弥补了这种失败,但残留的锚定仍然存在。虽然高级视觉-语言基准测试的成功可能不需要对低级几何线索的敏感性,但我们将锚定解释为表示到输出语言接口的失败:不一定缺乏几何编码,而是无法以分级形式表达它。

英文摘要

Human perception of surface slant from texture exhibits systematic, graded biases that emerge reliably in psychophysical experiments. Prior work showed that unsupervised CNNs reproduce several human-like biases, while supervised CNNs do not. Do Vision-Language Models (VLMs) exhibit similar competences? Across multiple VLM families and model scales, zero-shot and in-context prompting both produce distinctive failures: slant is predicted at only a small set of anchors (e.g., 0\degree, $\pm$25\degree, $\pm$45\degree) with little dependence on stimulus field of view, optical slant, or surface curvature. Supervised fine-tuning partially remediates the failure, but residual anchoring persists. While success in high-level vision-language benchmarks might not require sensitivity to low-level geometric cues, we interpret anchoring as a failure at the representation-to-output language interface: Not necessarily an absence of geometric encoding, but a failure to express it in a graded form.

2606.06760 2026-06-08 cs.CV 新提交

MedSIGHT: Towards Grounded Visual Comprehension in Medical Large Vision-Language Models

MedSIGHT:迈向医学大型视觉语言模型中的基础视觉理解

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MedSIGHT框架,通过区域感知器、医学区域码本和渐进训练策略,统一医学视觉语言模型的语义理解和像素级分割,在72K数据上达到多模态理解与分割的SOTA。

Comments Accepted at ICML 2026

详情
AI中文摘要

医学大型视觉语言模型(Med-LVLMs)最近在视觉语言理解和医学图像分割方面取得了显著进展。然而,现有模型仍难以统一这两种能力,而这对于实现连接视觉发现与语义解释的临床推理至关重要。我们提出MedSIGHT,一个统一框架,赋予Med-LVLMs结构化的像素级理解能力,实现基础视觉理解。MedSIGHT引入了一个新颖的区域感知器模块,生成以区域为中心的标记,将空间信息直接编码到语言模型的表示空间中。我们进一步将医学区域码本引入LLM词汇表,使模型能够生成离散的区域代码,作为解剖和病理区域的符号表示。这些代码通过区域感知器解码以重建分割掩码,实现端到端的空间基础。最后,MedSIGHT使用我们提出的渐进训练策略,将区域感知器、码本和LLM组合起来,逐步稳定地对齐这些模块。仅在72K多模态指令对上训练,MedSIGHT在多种成像模态的医学理解和分割任务上均达到了最先进的性能。

英文摘要

Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.

2606.06813 2026-06-08 cs.CV cs.AI 新提交

Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

打破锁定:通过表示调制实现文本到图像生成的多样化

Dahee Kwon, Haeun Lee, Jaesik Choi

发表机构 * KAIST(韩国科学技术院)

AI总结 针对文本到图像模型在固定提示下生成样本过于相似的问题,提出无训练表示级干预方法DAVE,通过选择性衰减早期生成中的零频空间平均分量来增强多样性,保持图像质量且计算开销极小。

Comments Accepted to ICML 2026. Code is available at: https://github.com/daheekwon/DAVE

详情
AI中文摘要

近期基于大规模Transformer骨干和流目标的文本到图像模型在文本-图像对齐和视觉质量方面表现出色,但在固定提示下常生成过于相似的样本。现有的多样性增强方法缓解了这一问题,但通常需要昂贵的采样或辅助优化,带来显著开销。为探究这种同质性的根本原因,我们检查了中间Transformer特征,观察到零频空间平均(DC)分量在生成早期快速收敛,导致早期轨迹锁定,限制了后续变化。基于此观察,我们提出DC衰减多样性增强(DAVE),一种无训练的表示级干预,选择性地在早期阶段衰减该分量。DAVE以可忽略的开销保留采样流程,在保持竞争性图像质量的同时,提高了提示一致性的多样性。

英文摘要

Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

2606.06819 2026-06-08 cs.CV 新提交

VideoSEG-O3: A Multi-turn Reinforcement Learning Framework for Reasoning Video Object Segmentation

VideoSEG-O3:用于推理视频对象分割的多轮强化学习框架

Ming Dai, Sen Yang, Boqiang Duan, Boyuan Tong, Jiedong Zhuang, Wankou Yang, Jingdong Wang

AI总结 提出VideoSEG-O3,首个多轮强化学习框架,通过多轮时空思维链和SEG感知逻辑校准,实现从粗到细的推理视频对象分割,解决复杂视频中的精确像素定位问题。

Comments ICML2026

详情
AI中文摘要

推理视频对象分割(RVOS)需要时间动态、空间细节和语言推理的复杂集成,以实现精确的像素级定位。现有方法局限于对固定初始输入进行推理,缺乏主动获取更多视觉证据的能力,而这对于解决长或复杂视频中的复杂引用通常至关重要。为了解决这个问题,我们提出了\textbf{VideoSEG-O3},这是第一个用于RVOS的多轮强化学习框架,模拟人类的“从粗到细”认知过程。它采用\textit{多轮时空思维链},通过迭代定位关键区间和关键帧来捕获细粒度细节。此外,为了使策略在强化学习阶段能够感知超出\texttt{[SEG]}文本概率的分割质量,我们引入了\textit{SEG感知逻辑校准},将像素级分割反馈直接集成到令牌级逻辑中。此外,我们设计了一个\textit{解耦思考轨迹},将推理过程分层分解为时间、空间和语言维度,并构建了\textbf{VTS-CoT},一个包含全面推理轨迹的专门冷启动数据集。代码和模型将在以下网址发布:this https URL。

英文摘要

Reasoning Video Object Segmentation (RVOS) demands a sophisticated integration of temporal dynamics, spatial details, and linguistic reasoning to achieve precise pixel-level localization. Existing methods are limited to reasoning over fixed initial inputs and lack the capacity to actively acquire further visual evidence, which is often essential for resolving complex references in long or intricate videos. To address this, we propose \textbf{VideoSEG-O3}, the first multi-turn reinforcement learning framework for RVOS that emulates the human \textit{``coarse-to-fine''} cognitive process. It employs a \textit{multi-turn temporal-spatial chain-of-thought} to capture fine-grained details by iteratively pinpointing critical intervals and keyframes. Additionally, to enable the policy to perceive segmentation quality beyond mere text probability of \texttt{[SEG]} during the RL stage, we introduce \textit{SEG-aware logit calibration}, which integrates pixel-wise segmentation feedback directly into the token-level logits. Furthermore, we design a \textit{decoupled thinking trace} to hierarchically decompose the reasoning process into temporal, spatial, and linguistic dimensions, and construct \textbf{VTS-CoT}, a specialized cold-start dataset featuring comprehensive reasoning trajectories. The code and models will be released at https://github.com/Dmmm1997/VideoSEG-O3.

2606.06828 2026-06-08 cs.CV cs.LG 新提交

AdaGRPO: A Capability-Aware Adaptive Enhancement for Flow-based GRPO

AdaGRPO: 一种面向基于流的GRPO的能力感知自适应增强方法

Jiazi Bu, Pengyang Ling, Yujie Zhou, Yibin Wang, Yuhang Zang, Tianyi Wei, Xiaohang Zhan, Jiaqi Wang, Tong Wu, Xingang Pan, Dahua Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) S-Lab, Nanyang Technological University(南洋理工大学S实验室) Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Stanford University(斯坦福大学) Shanghai Innovation Institute(上海创新研究院) The Chinese University of Hong Kong(香港中文大学) Fudan University(复旦大学) CPII under InnoHK(InnoHK下的CPII) Adobe Research(Adobe研究)

AI总结 提出AdaGRPO,通过在线课程过滤策略和跨层级优势融合,解决流模型GRPO中提示选择随机和优势估计缺乏全局视角的问题,提升训练稳定性和性能。

Comments Project Website: https://bujiazi.github.io/adagrpo.github.io/

详情
AI中文摘要

组相对策略优化(GRPO)在将文本到图像(T2I)流模型与人类偏好对齐方面取得了显著成功。然而,我们发现当前基于流的GRPO的学习循环与学习者的当前能力基本脱钩,在提示选择和优势估计方面存在关键盲点:(i)现有方法随机采样提示,忽视了数据选择对强化学习(RL)效能的重大影响——这一因素在大型语言模型的GRPO中被证明至关重要;(ii)它们仅依赖组内统计来评估样本质量,缺乏准确衡量真实策略改进的全局视角。为解决这些问题,我们提出了自适应GRPO(AdaGRPO),一种专为流模型设计的新型能力感知RL算法。具体而言,AdaGRPO由两个主要部分组成:(i)在线课程过滤策略:动态跟踪模型的能力,并自适应选择与其当前学习边界最匹配的提示;(ii)跨层级优势融合:协同整合细粒度组内优势与宏观全局优势,提供全面无偏的策略评估。作为轻量级即插即用模块,AdaGRPO可无缝集成到现有框架如Flow-GRPO、DanceGRPO和Flow-CPS中。大量实验表明,AdaGRPO持续推动性能提升,同时显著稳定流模型的GRPO训练。

英文摘要

Group Relative Policy Optimization (GRPO) has demonstrated remarkable success in aligning text-to-image (T2I) flow models with human preferences. However, we have identified that the learning loop of current flow-based GRPO is fundamentally decoupled from the learner's current capability, suffering from critical blind spots at both prompt selection and advantage estimation: (i) Existing methods sample prompts randomly, overlooking the substantial impact of data selection on reinforcement learning (RL) efficacy--a factor proven crucial in GRPO for large language models; (ii) They evaluate sample quality solely relying on intra-group statistics, lacking a global perspective to accurately measure true policy improvement. To address these issues, we propose Adaptive GRPO (AdaGRPO), a novel capability-aware RL algorithm tailored for flow models. Specifically, AdaGRPO consists of two principal components: (i) Online Curriculum Filtering Strategy: Dynamically tracks the model's proficiency and adaptively selects prompts that best match its current learning boundary; (ii) Cross-Level Advantage Fusion: Synergistically integrates fine-grained intra-group advantages with macro-level global advantages, providing a comprehensive and unbiased policy evaluation. As a lightweight, plug-and-play module, AdaGRPO can be seamlessly integrated with existing frameworks such as Flow-GRPO, DanceGRPO, and Flow-CPS. Extensive experiments demonstrate that AdaGRPO consistently drives performance gains while significantly stabilizes GRPO training for flow models.

2606.06850 2026-06-08 cs.CV 新提交

CFRNet: Cycle-Consistent Fixed-Point Training for Real-Time Blind Face Restoration on Consumer Embedded NPUs

CFRNet: 用于消费级嵌入式NPU上实时盲脸修复的循环一致不动点训练

Fuchen Li, Xinyang Wang, Yahui Zhang, Yuhan Chen, Jiahong Guo, Zhuohan Qin, Wenbo Ma

发表机构 * University of Florida(佛罗里达大学) University of Southampton(南安普顿大学) Chongqing University(重庆大学) Qingdao University(青岛大学) Intel Asia-Pacific Research & Development Ltd(英特尔亚太研发有限公司)

AI总结 提出CFRNet,一种2.0M参数的ResNet风格修复网络,通过循环一致不动点训练(CCFP)在消费级NPU上实现高质量盲脸修复,兼顾速度与效果,LPIPS比单次循环降低31%。

Comments 12 pages.Code and project page will be released

详情
AI中文摘要

消费设备上的盲脸修复必须在图像质量与速度和内存之间取得平衡。GFPGAN和CodeFormer等强方法提供了良好的感知质量,但它们依赖于大型预训练生成先验以及注意力、码本查找和风格调制等操作,这些操作难以在消费硬件中使用的小型神经处理单元(NPU)上编译和量化。小型卷积修复器运行速度足够快,但往往过度平滑,并在眼睛、鼻子和嘴巴周围留下伪影。我们提出了CFRNet,一个2.0M参数的ResNet风格修复器,用于在消费级NPU上常见的$256\times256$人脸裁剪尺寸的端侧使用。主要思想是循环一致不动点训练(CCFP)。我们不是训练网络进行单次前向传播然后手动多次运行,而是训练它作为一个不动点算子,使得对修复后的人脸再次应用该网络不会改变人脸。CCFP使用三种训练损失,即渐进式多周期监督、幂等损失和重新退化循环损失,并且在推理时不增加任何成本。为了在我们的部署限制下进行公平比较,我们在相同的$256\times256$分辨率下从头重新训练所有基线。在300张图像的测试集上,CFRNet达到了最佳感知分数(三次循环时LPIPS为0.250,比一次循环低31%),并且在两次循环时也达到了最佳PSNR和SSIM。在HiSilicon Hi3402 NPU上,它以INT8格式每次循环运行约23毫秒,而相同的基线无法编译到该芯片上。循环次数$k$作为一个简单的质量旋钮,无需重新训练:PSNR在$k=2$时最佳,LPIPS在$k=3$时持续改善。我们进一步表明,同样的思想适用于更易于部署的普通CNN,并在车载驾驶员监控板上实时运行模型。

英文摘要

Blind face restoration on consumer devices has to balance image quality against speed and memory. Strong methods such as GFPGAN and CodeFormer give good perceptual quality, but they rely on large pretrained generative priors and on operators such as attention, codebook lookup, and style modulation that are hard to compile and quantize on the small neural processing units (NPUs) used in consumer hardware. Small convolutional restorers run fast enough, but they tend to over-smooth and to leave artifacts around the eyes, nose, and mouth. We present CFRNet, a 2.0,M-parameter ResNet-style restorer for on-device use at $256\times256$, the common face-crop size on consumer NPUs. The main idea is Cycle-Consistent Fixed-Point Training (CCFP). Instead of training the network for one pass and then running it several times by hand, we train it to act as a fixed-point operator, so that applying it again to a restored face does not change the face. CCFP uses three training losses, namely progressive multi-cycle supervision, an idempotence loss, and a re-degradation cycle loss, and it adds no cost at inference. To compare fairly under our deployment limits, we retrain all baselines from scratch at the same $256\times256$ resolution. On a 300-image test set, CFRNet reaches the best perceptual score (LPIPS 0.250 at three cycles, which is 31% lower than one cycle) and also the best PSNR and SSIM at two cycles. It runs in about 23,ms per cycle in INT8 on a HiSilicon Hi3402 NPU, while the same baselines cannot be compiled to that chip. The cycle count $k$ acts as a simple quality knob that needs no retraining: PSNR is best at $k\!=\!2$ and LPIPS keeps improving up to $k\!=\!3$. We further show that the same idea works with a plain CNN that is even easier to deploy, and we run the model in real time on an in-car driver-monitoring board.

2606.06853 2026-06-08 cs.CV cs.AI 新提交

MotionEnhancer: Leveraging Video Diffusion for Motion-Enhanced Vision-Language Models

MotionEnhancer: 利用视频扩散模型增强运动感知的视觉-语言模型

Yifan Xu, Chao Zhang, Ruifei Ma, Fei Gao, Zhifei Yang, Jiaxing Qi, Zhipeng Chen

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Beijing Digital Native Digital City Research Center(北京数字原生数字城研究中心) School of Computer Science, Peking University(北京大学计算机学院) School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院)

AI总结 提出MotionEnhancer,通过从视频扩散模型中提取运动先验并利用注意力对齐增强视觉-语言模型的运动理解能力,无需额外参数或架构修改,在运动级视频理解基准上取得一致提升。

Comments Accepted by CVPR 2026

详情
AI中文摘要

新时代见证了视觉-语言模型(VLM)在视频理解任务中的显著能力扩展。虽然当前的VLM在事件或故事级别的理解上表现出色,但它们捕捉细粒度运动细节的能力仍然有限,这主要是由于它们关注高层静态语义结构和宏观事件逻辑。相比之下,视频扩散模型(VDM)擅长建模动态运动模式,得益于大规模视频数据和时序生成的内在需求。在本文中,我们介绍了MotionEnhancer,一种新颖的方法,它利用从强大视频扩散模型中提取的运动先验作为辅助监督,通过注意力对齐增强VLM的运动理解能力。MotionEnhancer包含两个简单的无参数模块:运动敏感头选择(MHS)和运动显著文本标记识别(MTTI),以仅计算的方式直接从VDM中提取和优化与运动相关的注意力。MotionEnhancer为运动理解提供了可扩展的解决方案,无需额外的训练参数、修改现有架构或工具调用。大量实验表明,在两个运动级视频理解基准上,MotionEnhancer能够在最先进的VLM上实现一致的改进,尤其是在运动相关指标上。

英文摘要

The new era has witnessed a remarkable capability to extend Vision-Language Models (VLMs) for tackling tasks of video understanding. While current VLMs excel at event- or story-level understanding, their ability to capture fine-grained motion details remains limited, primarily due to their focus on high-level static semantic structures and macro-event logic. In contrast, Video Diffusion Models (VDMs) are adept at modeling dynamic motion patterns, benefiting from large-scale video data and the intrinsic requirement of temporal generation. In this paper, we introduce MotionEnhancer, a novel approach that leverages motion priors distilled from a powerful video diffusion model as auxiliary supervision to enhance the motion understanding capability of a VLM via attention alignment. MotionEnhancer comprises two simple parameter-free modules, Motion-sensitive Head Selection (MHS) and Motion-salient Text Token Identification (MTTI), to directly extract and optimize motion-related attentions from the VDM in a computation-only manner. MotionEnhancer provides a scalable solution for motion understanding without additional training parameters, modifications to existing architectures, or tool calling. Extensive experiments demonstrate that MotionEnhancer can achieve consistent improvements over state-of-the-art VLMs on two motion-level video understanding benchmarks, especially on motion-related metrics.

2606.06856 2026-06-08 cs.CV 新提交

FS-DVS: A Frequency-Selective Dynamic Visual Sensing Paradigm for Enhancing Information Completeness

FS-DVS:一种增强信息完整性的频率选择性动态视觉传感范式

Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FS-DVS范式,通过在事件触发前集成可学习空间滤波器模拟视网膜神经节细胞聚合机制,自发学习中心-环绕模式以增强中频信息,在目标检测和动作识别中取得显著性能提升。

详情
AI中文摘要

动态视觉传感器(DVS)通过异步报告像素级强度变化,提供卓越的时间分辨率和动态范围。然而,传统DVS依赖每像素独立触发机制,忽略了生物视网膜神经节细胞(RGC)执行的空间整合。因此,它们缺乏对比度敏感函数(CSF)及其对中空间频率的固有敏感性,这不可避免地因亚阈值信号丢失而导致信息不完整。为弥补这一差距,我们提出FS-DVS(频率选择性动态视觉传感器),一种新颖范式,它在事件触发过程之前严格集成一个可学习空间滤波器,以模拟RGC聚合机制。通过开发可微分事件模拟框架,空间滤波器可以与下游任务进行端到端优化。我们的研究揭示,从δ函数开始,学习到的空间滤波器自发演变为强调中频分量的中心-环绕模式,与人类CSF一致。除了在目标检测和动作识别中实现显著的性能提升外,不同任务中向类人CSF特性的一致收敛强调了这种中频选择性机制的普遍性。与单纯提高传感器灵敏度或依赖后处理相比,我们的范式实现了具有高噪声鲁棒性的选择性信息增强,为下一代神经形态传感器提供了稳健且生物合理的蓝图。

英文摘要

Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.

2606.06864 2026-06-08 cs.CV cs.LG 新提交

LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification

LRMIL: 通过高分辨率知识蒸馏实现全切片图像分类的高效低分辨率多实例学习

Yonghan Shin, Won-Ki Jeong

发表机构 * Department of Computer Science and Engineering, Korea University, Seoul, Korea(韩国大学计算机科学与工程系)

AI总结 提出LRMIL框架,通过两阶段知识蒸馏将高分辨率知识迁移到低分辨率表示,在推理时仅使用低分辨率图像块,显著降低计算成本并提升分类性能。

详情
AI中文摘要

多实例学习(MIL)已成为数字病理学中全切片图像(WSI)分析的标准范式,因为它无需密集标注即可实现切片级预测。现有的MIL方法通常依赖于高分辨率图像块的详尽提取和编码。然而,这种做法在真实临床环境中存在两个关键限制:难以在较低放大倍数下捕获全局视觉线索,并且由于每张切片包含大量高分辨率图像块而导致巨大的计算开销。为了解决这些限制,我们提出了一种高效的低分辨率多实例学习(LRMIL)框架,该框架将高分辨率知识迁移到低分辨率表示。LRMIL采用两阶段蒸馏策略。首先,图像块级别的跨分辨率蒸馏将低分辨率图像块嵌入与高分辨率表示对齐。其次,切片级知识蒸馏在切片级监督和教师指导下训练低分辨率学生MIL模型。在推理时,LRMIL仅处理低分辨率图像块,大幅减少了数据预处理和计算成本。在多个WSI基准上的大量实验表明,LRMIL在实现更高效推理的同时,始终优于最先进的MIL方法。这些结果凸显了LRMIL作为临床病理学中WSI分析的实用且可扩展的解决方案。

英文摘要

Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.

2606.06867 2026-06-08 cs.CV 新提交

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

Multi-FRuGaL:面向癌症诊断与预后的多模态灵活冗余感知分解门控学习

Sanket Kachole, Siddhesh Thakur, Shubham Innani, Sanyukta Adap, Suhang You, Carla Pitarch-Abaigar, Spyridon Bakas

发表机构 * Division of Computational Pathology, Department of Pathology and Laboratory Medicine, Indiana University School of Medicine(计算病理学部,病理学与实验室医学部,印第安纳大学医学院) IU Melvin and Bren Simon Comprehensive Cancer Center(印第安纳大学Melvin和Bren Simon综合癌症中心) Departments of Biostatistics and Health Data Science(生物统计学与健康数据科学部) Radiology and Imaging Sciences(放射学与影像科学部) Neurological Surgery(神经外科) Indiana University School of Medicine(印第安纳大学医学院) Department of Computer Science, Luddy School of Informatics, Computing, and Engineering(计算机科学部,Luddy信息、计算与工程学院)

AI总结 提出Multi-FRuGaL框架,通过分解感知自适应门控中间融合,在缺失模态下学习模态级表示,分离冗余与互补信号,提升癌症诊断与预后性能。

详情
AI中文摘要

现代医学依赖于涵盖放射学、病理学、文本报告和结构化临床信息的异构数据源。然而,真实世界的患者数据常常不完整,存在缺失或稀疏获取的模态,限制了标准多模态融合方法的有效性。为此,我们提出了多模态灵活冗余感知分解门控学习(Multi-FRuGaL)框架,这是一种分解感知的自适应门控中间融合框架,可在数据缺失下执行模态级表示学习。Multi-FRuGaL 集成了每个模态的编码器、信号分解层、输入条件门控网络和信息感知融合目标,以将冗余信号与模态特异性互补信号分离,选择性地提升信息丰富的模态并抑制冗余或噪声输入,即使在多个模态缺失时也能保持良好定义。我们在两个多模态头颈癌队列上评估了 Multi-FRuGaL:HANCOCK 挑战数据集(N = 763),包含五种模态和两个预后终点(5年生存率和2年复发率);以及 HECKTOR 挑战数据集(N = 588),包含三种模态用于人乳头瘤病毒(HPV)状态分类。Multi-FRuGaL 在多个任务上始终比评估的基线方法获得更高的平均性能,将生存预测的 AUC 从 0.601 提高到 0.8496,复发预测的 AUC 从 0.672 提高到 0.8102,并在 HECKTOR 上实现 HPV 预测的 AUC 为 0.975。对于生存分析,它在 HANCOCK 上进一步实现了总生存期的 C-index 为 0.6814,无复发生存期为 0.7421,无进展生存期为 0.7143,在 HECKTOR 上无复发生存期为 0.7203。定性分析进一步表明,即使在严重缺失模态条件下,Multi-FRuGaL 也能学习到判别性和鲁棒的多模态表示。

英文摘要

Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.

2606.06872 2026-06-08 cs.CV cs.AI 新提交

EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

EgoPressDiff: 用于自我中心UV域手部压力估计的多模态视频扩散模型

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出EgoPressDiff,一种条件视频扩散框架,通过多模态条件策略(手部姿态、3D网格顶点和深度信息)从视觉输入生成UV压力图,解决了现有方法中的量化误差和时间不一致问题,在EgoPressure数据集上实现SOTA,Volumetric IoU相对提升34%以上。

Comments Accepted to IEEE ICASSP 2026

详情
AI中文摘要

从自我中心视角估计手部表面接触压力对于AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法通常对压力信号进行离散化并独立处理帧,导致量化误差和时间不一致性。我们提出EgoPressDiff,一种条件视频扩散框架,从视觉输入生成UV压力图。我们方法的核心是一种多模态条件策略,引入PoseNet和顶点编码器,从手部姿态和3D网格顶点中高效提取特征。这些信号与深度信息一起,指导生成过程以确保压力场在物理上是合理的。为了有效融合这些异构特征,我们进一步提出分布校准空间层,在组合前对齐其统计特性。在EgoPressure自我中心视图设置上的评估表明,EgoPressDiff实现了最先进的结果,Volumetric IoU相对先前基线提升超过34%,同时降低MAE并保持高时间精度。我们的项目页面位于此https URL。

英文摘要

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.

2606.06875 2026-06-08 cs.CV cs.CR 新提交

Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

统一安全上下文图像生成:在多模态扩散变换器中通过限制不安全信息流

Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Mi Wen, Min Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出UVR框架,通过分析注意力动态中的不安全信息流,在无需训练的情况下对输出补丁进行注意力调制,实现图像生成和编辑任务的安全控制,达到91%和77%的擦除率。

Comments ICML26

详情
AI中文摘要

配备多模态注意力(MM-Attn)的扩散变换器(DiTs)已成为图像生成的主导范式。然而,防止有害内容的生成仍然是一个关键挑战,特别是在图像到图像(I2I)编辑任务中。现有的安全机制主要针对文本到图像(T2I)合成或基于U-Net的架构设计,这限制了它们在基于DiT的框架中统一安全缓解的有效性。为弥补这一差距,我们提出了统一视觉安全调节器(UVR),一个无需训练的、在生成图像中调节不安全语义的安全生成框架。UVR基于从信息流角度对MM-Attn中注意力动态的分析。我们识别出一个与任务无关的启动阶段,在该阶段输出补丁中的不安全语义迅速出现并可以被精确定位,随后是特定任务的语义放大和干扰阶段,其中有害信号进一步传播并与良性内容纠缠。基于这些观察,UVR通过统一的、有针对性的注意力调制和对识别出的不安全输出补丁上有害信息流的显式限制来缓解不安全生成。跨多种概念的实验表明,UVR在图像合成和编辑任务中分别实现了91%和77%的擦除率,达到了最先进的安全性能,同时以最小的退化保持了视觉质量和保真度。代码可在以下网址获取:https://this URL。

英文摘要

Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.

2606.06885 2026-06-08 cs.CV cs.AI 新提交

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

FreeAnimate: 基于预览引导去噪的无训练人体图像动画

Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出FreeAnimate框架,利用图像扩散模型内在能力实现无训练的人体图像动画,通过预览生成策略提供时序和结构先验,结合反演增强注意力和参考锚定自注意力模块,保证时序一致性和身份保持。

Comments Accepted to IEEE ICASSP 2026

详情
AI中文摘要

人体图像动画已经取得了显著进展,主要得益于扩散模型。然而,现有方法通常需要大量的训练数据和资源才能获得高质量结果,限制了泛化性和可访问性。在这项工作中,我们引入了FreeAnimate,一个无训练框架,利用图像扩散模型的内在能力来实现时序一致性、身份保持和背景稳定性。我们的方法包含一种新颖的预览生成策略,该策略从生成的预览帧中提供时序和结构先验,无需训练即可有效引导姿态对齐和背景一致性。此外,FreeAnimate引入了反演增强注意力和参考锚定自注意力模块,以保证时序一致性和身份保持。实验结果表明,FreeAnimate优于现有的无训练竞争方法和基于训练的基线方法,生成的图像质量可与最先进的方法相媲美,并在不同数据集上展现出强大的泛化能力。我们的项目页面位于此https URL。

英文摘要

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

2606.06887 2026-06-08 cs.CV 新提交

ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning

ARAPDiffusion: 基于ARAP正则化的扩散变形形状空间学习

Haibo Liu, Jinghan Ke, Haitao Yang, Xiangru Huang, Georgios Pavlakos, Qixing Huang

发表机构 * University of Texas at Austin(德克萨斯大学) Westlake University(西拉丘学院)

AI总结 提出ARAPDiffusion,一种潜在扩散模型,通过注入ARAP变形模型作为正则化损失,学习变形形状集合的连续形状空间,减少对大量3D训练数据的依赖。

详情
AI中文摘要

本文介绍了ARAPDiffusion,一种潜在扩散模型,用于学习变形形状集合的潜在连续形状空间。关键创新在于将尽可能刚性(ARAP)变形模型作为正则化损失注入潜在扩散(LD),从而减少学习生成模型所需的大量3D训练数据。与标准LD相比,我们展示了如何利用ARAP模型同时改进编码器/解码器和LD模型。训练过程交替使用LD模型定义的合成分布来开发增强形状编码器/解码器的正则化损失,以及使用形状解码器来开发改进LD模型的正则化损失。我们还展示了LD范式在结合无表示LD模型和适用于无序点云的隐式形状解码器方面的优势。无条件和条件形状生成的实验结果证明了ARAPDiffusion相对于基线方法的优势。

英文摘要

This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.

2606.06890 2026-06-08 cs.CV cs.LG 新提交

Diagnosing Visual Ignorance in Vision-Language Models

诊断视觉语言模型中的视觉忽视

Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang

发表机构 * Peking University(北京大学)

AI总结 研究视觉语言模型依赖语言先验的内部机制,通过层替换和探针分析揭示多阶段瓶颈,并引入渐进视觉退化指标发现基准测试可能奖励视觉忽视。

详情
AI中文摘要

视觉语言模型(VLM)经常依赖语言先验,产生自信但缺乏视觉证据支持的答案。虽然这种行为被广泛观察到,但其内部机制及对基准评估的影响仍未被充分理解。在这项工作中,我们从机制和行为两个角度研究语言先验依赖。在内部,我们将反事实层替换与有监督的逐层MLP探针相结合,以追踪真实视觉语义和语言先验语义如何在语言解码器中竞争。我们的分析揭示了一个多阶段瓶颈:中间层通常无法有效检索视觉信息,而后续层可能进一步抑制存活的视觉信号,偏向文本空间偏差。在外部,我们引入了一种基于多步高斯模糊的渐进视觉退化度量,用于识别那些即使视觉内容被逐渐破坏,答案仍保持不变的实例。在十二个视觉问答基准和三个代表性VLM上,我们发现相当一部分示例在严重或完全视觉混淆下仍可回答,表明当前基准可能无意中奖励视觉忽视。这些发现表明,语言先验依赖是一种系统性的路由故障,影响模型内部和基准有效性。最后,我们概述了未来的关键研究方向,强调需要设计基于结构隔离或反事实数据的训练分布和评估协议,以强制执行真正的跨模态基础。

英文摘要

Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

2606.06891 2026-06-08 cs.CV 新提交

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Stream3D-VLM:基于增量几何先验的在线3D空间理解

Hanxun Yu, Xuan Qu, Lei Ke, Boqiang Zhang, Yuxin Wang, Jianke Zhu, Dong Yu

发表机构 * Zhejiang University(浙江大学) Tencent Hunyuan(腾讯文汇) HKUST(香港科技大学) Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 提出在线3D视觉语言模型Stream3D-VLM,通过自回归流控制、轻量视觉-空间特征融合模块和几何自适应体素压缩,实现从流式视频中实时理解3D空间,并构建超百万在线3D问答数据集,在多项任务上超越现有模型。

Comments Project Page: https://stream3d-vlm.github.io/

详情
AI中文摘要

尽管3D场景理解取得了进展,但现有的3D大型多模态模型在离线设置下运行,需要完整的场景观测或预定义的视频片段。在本文中,我们提出了一种在线3D视觉语言模型,能够从流式视频中实现实时空间理解。我们的方法基于LLM的下一个词预测目标,采用自回归流控制建模来学习何时响应,并使用轻量级的视觉-空间特征融合(VSFI)模块,将时间对齐的几何先验增量注入视觉流。为了减轻长上下文解码开销,我们提出了一种即插即用的几何自适应体素压缩(GAVC)模块,用于高效的视觉令牌压缩。为了解决流式3D语言数据的稀缺问题,我们进一步开发了一个可扩展的数据生成流程,策划了超过100万个在线时空3D问答对,并建立了一个涵盖29个任务的全面基准。大量实验表明,我们的方法在在线和离线3D空间理解、推理和定位任务上均显著优于专有和开源模型。项目页面见https://这个URL。

英文摘要

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/

2606.06899 2026-06-08 cs.CV cs.LG 新提交

Lighting-Aware Representation Learning under Controllable Lighting Variation

可控光照变化下的光照感知表示学习

Lizhen Zhu, Charantej Reddy Pochimireddy, James Z Wang, Brad Wyble

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出光照感知表示学习框架,将光照变化作为显式训练信号,通过辅助目标捕获光照依赖变化,在分类和检测任务上优于标准对比学习基线。

详情
AI中文摘要

光照变化仍然是视觉表示学习的主要挑战,因为它们会在环境内部和之间引起显著的外观变化。虽然现有方法通常通过数据增强来鼓励模型对光照变化具有不变性,但这些策略在学习过程中并未显式建模光照信息。受人类视觉理论的启发,我们提出了一种光照感知表示学习框架,该框架将光照变化作为显式训练信号而非需要抑制的干扰因素。我们的方法通过引入一个辅助目标来扩展对比学习,该目标捕获渲染场景中光照依赖的变化,使模型能够联合学习保持语义一致性的表示,同时保持对光照依赖的视觉结构的敏感性。我们在ImageNet、ExDark和PASCAL VOC基准测试上评估了所提模型的图像分类和物体检测任务。结果表明,所提出的光照感知训练在保持相同架构和训练预算的情况下,始终优于标准对比学习基线。此外,我们的方法在监督学习框架和涉及更简单光照变化的设置中表现出有前景的性能,表明其具有超越复杂光照场景的广泛适用性。这些结果显示了它在复杂视觉环境以及更常规的图像处理任务中增强模型鲁棒性和适应性的潜力。

英文摘要

Variations in illumination remain a major challenge for visual representation learning, as they induce substantial appearance changes both across and within environments. While existing approaches typically address this issue through data augmentations that encourage models to become invariant to lighting changes, such strategies do not explicitly model lighting information during learning. Inspired by theories of human vision, we propose a lighting-aware representation learning framework that incorporates illumination variation as an explicit training signal rather than a nuisance factor to be suppressed. Our method extends contrastive learning by introducing an auxiliary objective that captures illumination-dependent variation in rendered scenes, enabling the model to jointly learn representations that preserve semantic consistency while remaining sensitive to lighting-dependent visual structure. We evaluate the proposed model on image classification and object detection tasks across the ImageNet, ExDark, and PASCAL VOC benchmarks. Results demonstrate that the proposed lighting-aware training consistently improves downstream performance over standard contrastive learning baselines, while maintaining the same architecture and training budget. Furthermore, our approach shows promising performance in supervised learning frameworks and under settings involving simpler lighting variation, suggesting broad applicability beyond complex illumination scenarios. These results indicate its potential to enhance model robustness and adaptability in complex visual environments as well as in more conventional image processing tasks.

2606.06901 2026-06-08 cs.CV 新提交

LUCID: Learning Unified Control for Image Deflaring and Exposure Mastery in Nighttime Photography

LUCID:夜间摄影中图像去眩光与曝光控制的统一学习

Tingyu Yang, Yuan Cheng, Xiaoyun Yuan

发表机构 * MoE Key Lab of Artificial Intelligence(人工智能混合专家实验室) AI Institute(人工智能研究所) School of Computer Science(计算机科学学院) School of Biomedical Engineering(生物医学工程学院) School of Artificial Intelligence(人工智能学院)

AI总结 提出LUCID统一框架,通过眩光解缠模块和扩散驱动模块联合处理夜间图像中的眩光和噪声,并引入四模式训练实现可控恢复,支持HDR重建,性能优于现有方法。

Comments Accepted by SIGGRAPH 2026

详情
AI中文摘要

摄影是用光绘画的艺术,但夜间场景受到相互竞争的退化影响:强烈的眩光掩盖了场景结构,而光子受限区域则陷入噪声。传统方法孤立地处理这些因素,忽略了这些退化本质上是纠缠的。为弥补这一差距,我们引入了LUCID,一个统一框架,将夜间恢复重新定义为连续且可控的过程,而非固定的校正。我们将夜间恢复分解为两个协作组件:一个眩光解缠模块,用于揭开光学伪影的“幕布”,提供可靠的结构指导;以及一个扩散驱动模块,利用生成先验重建干净且曝光良好的图像。关键的是,LUCID通过一种新颖的四模式训练策略引入了显式的可控性,使用户能够通过无分类器引导(CFG)引导恢复过程,并允许对光源及其相关的眩光和鬼影伪影进行选择性控制,同时通过连续曝光控制支持高动态范围(HDR)重建。大量实验表明,LUCID在多种真实夜间场景中始终优于最先进的方法。

英文摘要

Photography is the art of painting with light, yet nighttime scenes are shaped by competing degradations: intense flares obscure scene structure, while photon-limited regions collapse into noise. Conventional approaches address these factors in isolation, overlooking the fact that these degradations are fundamentally entangled. To bridge this gap, we introduce LUCID, a unified framework that reframes nighttime restoration as a continuous and controllable process rather than a fixed correction. We decompose nighttime restoration into two cooperative components: a flare disentanglement module that lifts the 'curtain' of optical artifacts to provide reliable structural guidance, and a diffusion-driven module that leverages generative priors to reconstruct clean and well-exposed imagery. Crucially, LUCID introduces explicit controllability through a novel four-mode training strategy, enabling users to steer the restoration process via classifier-free guidance (CFG) and allowing selective control over light sources and their associated flare and ghosting artifacts, while also supporting high dynamic range (HDR) reconstruction through continuous exposure control. Extensive experiments demonstrate that LUCID consistently outperforms state-of-the-art methods across diverse real-world nighttime scenarios.

2606.06903 2026-06-08 cs.CV cs.AI 新提交

Beyond Skeletons: Learning Animation Directly from Driving Videos with Same2X Training Strategy

超越骨架:使用Same2X训练策略直接从驱动视频学习动画

Yuan Zeng, Yujia Shi, Yuhao Yang, Dongxia Liu, Zongqing Lu, Wenming Yang, Qingmin Liao

发表机构 * Tsinghua University(清华大学) Harbin Institute of Technology(哈尔滨工业大学) Pengcheng Laboratory(鹏城实验室)

AI总结 提出DirectAnimator框架,通过驱动线索三元组和Same2X训练策略,绕过姿态提取直接从原始视频学习动画,实现鲁棒且高质量的人体图像动画生成。

Comments Accepted to ICLR 2026

详情
AI中文摘要

人体图像动画旨在根据从驱动视频中提取的姿态信息,从静态参考图像生成视频。现有方法通常依赖姿态估计器提取中间表示,但在遮挡或复杂姿态下这些信号容易出错。基于这些观察,我们提出了DirectAnimator,一个绕过姿态提取并直接从原始驱动视频学习的框架。我们引入了一个由姿态、面部和位置线索组成的驱动线索三元组,以语义丰富且稳定的形式捕捉运动、表情和对齐,并通过CueFusion DiT块融合它们,以实现去噪过程中的可靠控制。为了使学习在驱动和参考身份不同时依然可靠,我们设计了Same2X训练策略,将跨身份特征与从相同身份数据中学到的特征对齐,从而正则化优化并加速收敛。大量实验表明,DirectAnimator在保持身份的同时达到了最先进的视觉质量,对遮挡和复杂关节运动具有鲁棒性,并且计算资源更少。我们的项目页面位于此https URL。

英文摘要

Human image animation aims to generate a video from a static reference image, guided by pose information extracted from a driving video. Existing approaches often rely on pose estimators to extract intermediate representations, but such signals are prone to errors under occlusion or complex poses. Building on these observations, we present DirectAnimator, a framework that bypasses pose extraction and directly learns from raw driving videos. We introduce a Driving Cue Triplet consisting of pose, face, and location cues that captures motion, expression, and alignment in a semantically rich yet stable form, and we fuse them through a CueFusion DiT block for reliable control during denoising. To make learning dependable when the driving and reference identities differ, we devise a Same2X training strategy that aligns cross-ID features with those learned from same-ID data, regularizing optimization and accelerating convergence. Extensive experiments demonstrate that DirectAnimator attains state-of-the-art visual quality and identity preservation while remaining robust to occlusions and complex articulation, and it does so with fewer computational resources. Our project page is at https://directanimator.github.io/.

2606.06908 2026-06-08 cs.CV 新提交

polyDAG: Polynomial Acyclicity Constraints for Efficient Continuous Causal Discovery in Visual Semantic Graphs

polyDAG:用于视觉语义图中高效连续因果发现的多项式无环性约束

Wenhao Zhang, Ramin Ramezani, Tao Han, Kai Hwang, Minyi Guo

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of California, Los Angeles(加州大学洛杉矶分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出多项式无环性约束框架polyDAG,用有限多项式迹约束替代矩阵指数约束,实现视觉语义图中高效的连续有向无环图学习,在合成图和CelebA数据集上提升了效率与结构恢复性能。

详情
AI中文摘要

现代图像分析流程通常将图像转换为结构化语义变量,如面部属性、对象概念和场景描述符。学习这些变量之间的有向依赖关系可以生成可解释的视觉语义图,但连续有向无环图学习受到执行无环性成本的限制。我们提出了polyDAG,一个用于视觉语义图中高效连续因果发现的多项式无环性框架。polyDAG用有限多项式迹约束替代矩阵指数无环性约束,并证明了新约束恰好对有向无环图为零。我们进一步推导了一种几何级数实现,避免了显式求和循环,同时保持了相同的无环性条件。在合成Erdos-Renyi图和CelebA面部视觉属性上的实验表明,polyDAG提高了效率和结构恢复能力。在d∈{100,200,500}的修订合成协议上平均,polyDAG将平均结构汉明距离从318.4降低到285.4,并将平均F1分数从0.725提高到0.756。在100个节点时,几何变体运行时间为3.44秒,而指数基线为5.16秒,对应33.4%的加速。代码和数据公开于此https URL。

英文摘要

Modern image-analysis pipelines often convert images into structured semantic variables, such as facial attributes, object concepts, and scene descriptors. Learning directed dependencies among these variables can produce interpretable visual semantic graphs, but continuous directed acyclic graph learning is limited by the cost of enforcing acyclicity. We present polyDAG, a polynomial acyclicity framework for efficient continuous causal discovery in visual semantic graphs. polyDAG replaces the matrix-exponential acyclicity constraint with a finite polynomial trace constraint and proves that the new constraint is zero exactly for acyclic graphs. We further derive a geometric-series implementation that avoids the explicit summation loop while preserving the same acyclicity condition. Experiments on synthetic Erdos-Renyi graphs and CelebA facial visual attributes show that polyDAG improves efficiency and structure recovery. Averaged over the revised synthetic protocol with d in {100, 200, 500}, polyDAG reduces mean structural Hamming distance from 318.4 to 285.4 and improves mean F1 score from 0.725 to 0.756. At 100 nodes, the geometric variant runs in 3.44 seconds compared with 5.16 seconds for the exponential baseline, corresponding to a 33.4 percent speedup. Code and data are publicly available at https://github.com/wenhaoz-fengcai/polyDAG.

2606.06918 2026-06-08 cs.CV 新提交

DRIFT: From Robustness Gaps to Invariance Manifolds for AI-Generated Image Detection

DRIFT: 从鲁棒性差距到AI生成图像检测的不变流形

Abhishek Ameta, Sayan Banerjee, Shreyas Pandith, Harshit, Ankita Chatterjee, Akshay Janardan Bankar, Amit Satish Unde

发表机构 * Samsung Research Institute, Bangalore, India(三星研究所,班加罗尔,印度)

AI总结 提出DRIFT方法,通过冻结视觉基础模型并学习真实图像的结构化不变流形,利用鲁棒和脆弱子空间分解及排序间隔实现AI生成图像检测,在未见生成器和分辨率上表现优异。

Comments Submitted to ECCV 2026

详情
AI中文摘要

生成图像模型的快速演进挑战了现有的AI生成图像检测器,尤其是在面对未见生成器的开放世界场景中。近期无训练方法通过测量冻结视觉基础模型(VFM)中的鲁棒性差距,利用扰动引起的嵌入漂移检测伪造图像。然而,这些方法依赖于预训练继承的固定不变几何结构,缺乏针对检测任务的原则性适应。我们转而将AI生成图像检测表述为在单类监督下学习真实图像的结构化不变流形。基于冻结的VFM,我们引入轻量级投影头,将表示空间分解为互补的鲁棒子空间和脆弱子空间。鲁棒子空间被显式训练以抑制由物理上合理的成像变换引起的变异,近似真实图像流形的切方向,而脆弱子空间则保持对类似编辑扰动的敏感性。结构化的排序间隔强制实现物理不变性与编辑诱导变异性之间的层次分离,使得检测成为相对于所学流形的间隔违反测试。在推理时,两种变换族下的多尺度逐块漂移产生双通道不变性特征和可解释的定位。大量实验表明,该方法在未见生成器和分辨率上具有强大的开放世界泛化能力,始终优于基于无训练鲁棒性的基线方法,同时提供可解释的不变性违反图。

英文摘要

The rapid evolution of generative image models challenges existing AI-generated image detectors, particularly in open-world settings with unseen generators. Recent training-free approaches measure robustness gaps in frozen vision foundation models (VFMs), detecting fakes via perturbation-induced embedding drift. However, these methods rely on fixed invariance geometry inherited from pretraining and lack principled adaptation to the detection task. We instead formulate AI-generated image detection as learning a structured invariance manifold of real images under one-class supervision. Building upon a frozen VFM, we introduce lightweight projection heads that decompose representation space into complementary robust and fragile subspaces. The robust subspace is explicitly trained to suppress variations induced by physically plausible imaging transformations, approximating tangent directions of a real-image manifold, while the fragile subspace retains sensitivity to edit-like perturbations. A structured ordering margin enforces hierarchical separation between physical invariance and edit-induced variability, enabling detection as a margin-violation test relative to the learned manifold. At inference, multi-scale patch-wise drift under both transformation families yields a dual-channel invariance signature and interpretable localization. Extensive experiments demonstrate strong open-world generalization across unseen generators and resolutions, consistently outperforming training-free robustness-based baselines while providing interpretable invariance-violation maps.

2606.06938 2026-06-08 cs.CV 新提交

When CLIP Sees More, It Fights Back Harder: Multi-View Guided Adaptive Counterattacks for Test-Time Adversarial Robustness

当CLIP看得更多,它反击得更猛烈:多视图引导的自适应对抗攻击用于测试时对抗鲁棒性

Sunoh Kim, Daeho Um

发表机构 * Dankook University(Dankook 大学) University of Seoul(首尔大学) Yongin, South Korea(韩国 Yongin) Seoul, South Korea(首尔, 韩国)

AI总结 提出多视图引导的自适应对抗攻击(MAC),通过构建输入图像的增强视图、执行对抗攻击精炼嵌入、自适应调整攻击强度并聚合视图,显著提升CLIP在测试时的对抗鲁棒性。

Comments Accepted in CVPR2026

详情
AI中文摘要

视觉-语言模型如CLIP在零样本识别方面取得了显著成就,但其对对抗扰动的鲁棒性仍然有限。最近提出的测试时对抗攻击(TTC)通过在推理过程中扰动输入图像使其远离受损状态来提高CLIP的鲁棒性。然而,TTC在强攻击下仍然脆弱,因为其对抗攻击依赖于直接受损的原始视图,并采用噪声驱动的硬门控方案,无法适应变化的损坏严重程度。为了解决这些限制,我们引入了多视图引导的自适应对抗攻击(MAC),它针对多视图执行具有损坏感知软加权的对抗攻击。具体来说,MAC首先构建输入图像的增强视图以获得多样化的嵌入。然后,它执行对抗攻击以精炼视图的受损嵌入。接下来,MAC根据每个视图的估计损坏程度自适应地缩放对抗攻击强度。最后,自适应对抗攻击后的视图被聚合以产生鲁棒的最终预测。在20个数据集和多种攻击场景下的广泛实验表明,MAC显著提高了鲁棒性,同时由于其无调优设计,保持了高推理速度和内存效率。我们的代码可在该https URL获取。

英文摘要

Vision-language models such as CLIP have achieved remarkable zero-shot recognition capabilities, yet their robustness against adversarial perturbations remains limited. Test-time counterattack (TTC) was recently proposed to improve CLIP's robustness by perturbing an input image to steer it away from a corrupted state during inference. However, TTC remains fragile under strong attacks because its counterattack relies on a directly corrupted original view and employs a noise-driven hard-gating scheme that cannot adapt to varying corruption severity. To address these limitations, we introduce Multi-view guided Adaptive Counterattack (MAC), which performs counterattacks for multi-view with corruption-aware soft weighting. Specifically, MAC first constructs augmented views of an input image to obtain diverse embeddings. It then performs counterattacks to refine corrupted embeddings of views. Next, MAC adaptively scales the counterattack intensity for each view based on its estimated corruption degree. Finally, the adaptively counterattacked views are aggregated to yield a robust final prediction. Extensive experiments across 20 datasets and diverse attack scenarios demonstrate that MAC substantially improves robustness while preserving high inference speed and memory efficiency with its tuning-free design. Our code is available at https://github.com/sunoh-kim/MAC.

2606.06943 2026-06-08 cs.CV cs.AI 新提交

SS-TPT: Stability and Suitability-Guided Test-Time Prompt Tuning for Adversarially Robust Vision-Language Models

SS-TPT:面向对抗鲁棒视觉语言模型的稳定性和适用性引导的测试时提示微调

Sunoh Kim, Daeho Um

发表机构 * Dankook University, Yongin, South Korea(首尔大学,韩国永兴) University of Seoul, Seoul, South Korea(首尔大学,韩国首尔)

AI总结 提出SS-TPT方法,通过稳定性与适用性分数评估增强视图质量,引导测试时提示微调,在保持高吞吐量的同时显著提升对抗鲁棒性。

Comments Accepted in ICML2026

详情
AI中文摘要

视觉语言模型(如CLIP)实现了强大的零样本识别,但在对抗扰动下仍然非常脆弱。最近的测试时自适应防御通过利用大量增强视图来提高鲁棒性,但这导致了不切实际的减速和明确的鲁棒性-吞吐量权衡。为了应对这一挑战,我们提出了稳定性和适用性引导的测试时提示微调(SS-TPT),通过两个互补分数评估每个增强视图的质量:(1)稳定性,衡量对弱增强的预测不变性,以及(2)适用性,衡量视图间的特征空间密度。这些稳定性和适用性(SS)分数通过SS引导的一致性损失和SS加权预测来指导自适应和推理,放大可信视图同时抑制受损视图。大量实验表明,SS-TPT显著优于先前最先进的方法,在不同数据集和不同视图数量下实现了卓越的鲁棒性-吞吐量权衡,从而展示了强大的实用性和泛化性。我们的代码可在以下网址获得:https://this URL。

英文摘要

Vision-language models (VLMs) such as CLIP achieve strong zero-shot recognition but remain highly fragile under adversarial perturbations. Recent test-time adaptation defenses improve robustness by leveraging many augmented views, but this leads to impractical slowdown and a clear robustness-throughput trade-off. To address this challenge, we present Stability and Suitability-guided Test-time Prompt Tuning (SS-TPT), evaluating the quality of each augmented view via two complementary scores: (1) stability, measuring prediction invariance to weak augmentations, and (2) suitability, measuring feature-space density among views. These stability and suitability (SS) scores guide both adaptation and inference through an SS-guided consistency loss and an SS-weighted prediction, amplifying trustworthy views while suppressing corrupted ones. Extensive experiments demonstrate that SS-TPT significantly outperforms prior state-of-the-art methods, achieving superior robustness-throughput trade-offs across diverse datasets and varying numbers of views, thereby demonstrating both strong practicality and generality. Our code is available at https://github.com/sunoh-kim/SS-TPT.

2606.06950 2026-06-08 cs.CV cs.AI 新提交

When is 3D Worth It? A Resource-Performance Frontier for CNNs and Transformers in Lung CT

何时3D值得?肺CT中CNN和Transformer的资源-性能前沿

Md Enamul Hoq, Sharafat Hossain, Imraul Emmaka, Linda Larson-Prior, Lawrence Tarbox, Jonathan Bona, Donald Johann Jr. and Fred Prior

发表机构 * Department of Biomedical Informatics University of Arkansas for Medical Sciences(生物医学信息学系,美国阿肯色大学医学科学分校) Department of Information Science University of Arkansas at Little Rock(信息科学系,美国阿肯色大学小岩分校) Department of Neuroscience University of Arkansas for Medical Sciences(神经科学系,美国阿肯色大学医学科学分校)

AI总结 研究在肺CT中2D、2.5D和3D输入对CNN和Transformer的影响,发现2.5D CNN在判别-稳定性权衡上最优,而3D CNN和Transformer存在不稳定性或退化预测。

Comments 8 pages, 6 figures

详情
AI中文摘要

三维模型通常被认为更适合体积医学成像,但其实际价值取决于性能提升是否值得增加的计算成本和复杂性。我们不提出新架构,而是研究在固定训练协议下,输入维度(2D、2.5D、3D)如何影响卷积神经网络(CNN)和视觉Transformer(ViT)的行为。使用无泄漏的NLST队列(n=1,977)和辅助LIDC-IDRI数据,我们发现2.5D CNN在我们的比较中提供了最有利的判别-稳定性权衡(ROC-AUC 0.682,95% CI [0.546, 0.799]),具有稳定的操作点。相比之下,3D CNN表现出阈值不稳定性,而Transformer出现退化预测,例如全正预测。置信区间宽且重叠,因此我们将这些结果呈现为受控的资源-性能前沿和失败模式分类,而非明确的优越性声明。对于类别不平衡的肺癌筛查分类,2D和2.5D输入在性能、稳定性和计算效率之间提供了比全3D表示更可靠的权衡。

英文摘要

Three-dimensional models are widely assumed preferable for volumetric medical imaging, yet their practical value depends on whether performance gains justify added computational cost and complexity. Rather than proposing a new architecture, we study how input dimensionality (2D, 2.5D, 3D) affects model behavior across convolutional neural networks (CNNs) and Vision Transformers (ViTs) under a fixed training protocol. Using a leakage-free NLST cohort (n = 1,977) with supporting LIDC-IDRI data, we find that the 2.5D CNN offers the most favorable discrimination-stability trade-off in our comparison (ROC-AUC 0.682, 95% CI [0.546, 0.799]) with a stable operating point. In contrast, 3D CNNs show threshold instability, and transformers exhibit degenerate predictions, such as all-positive predictions. Confidence intervals are wide and overlapping, so we present these results as a controlled resource-performance frontier and a failure-mode taxonomy rather than as definitive superiority claims. For class-imbalanced lung cancer screening classification, 2D and 2.5D inputs provide a more reliable trade-off between performance, stability, and computational efficiency than full 3D representations.

2606.06958 2026-06-08 cs.CV 新提交

MVSegNet: A Lightweight Boundary-Aware Network for Fetal Lateral Ventricle Segmentation and Atrial Width Estimation in Prenatal Ultrasound

MVSegNet: 一种用于产前超声中胎儿侧脑室分割和心房宽度估计的轻量级边界感知网络

Arafat Hossain Sayem

发表机构 * Department of Computer Science & Engineering, Stamford University Bangladesh(计算机科学与工程系,斯坦福大学孟加拉国分校)

AI总结 提出轻量级边界感知网络MVSegNet,结合多尺度特征提取与边界细化,在584张超声图像上实现侧脑室分割,Dice达80.79%,心房宽度平均绝对误差3.40 mm,速度快且参数少。

Comments 11 pages, 3 figures, 4 tables. Code and trained models will be released upon acceptance. Supplementary material available upon request

详情
AI中文摘要

胎儿脑室扩张通过测量产前超声中侧脑室的心房宽度来评估。准确的分割对于这一测量至关重要,但声影、散斑噪声和低对比度使其变得困难。我们开发了MVSegNet,一种轻量级的编码器-解码器网络,结合了多尺度特征提取和边界感知细化。该模型在584张专家标注的经脑室超声图像上使用70/15/15划分进行训练和评估。使用重叠、边界和测量指标与六个分割基线进行了性能比较。MVSegNet实现了80.79%的Dice分数、68.47%的IoU、4.07 mm的豪斯多夫距离和3.40 mm的心房宽度平均绝对误差。该模型包含231万个参数,在NVIDIA T4 GPU上以165.6帧/秒的速度运行。MVSegNet在边界和测量指标上优于所有评估的基线,同时保持较低的计算成本,支持其在自动化胎儿超声分析中的应用。

英文摘要

Fetal ventriculomegaly is assessed by measuring the atrial width of the lateral ventricle in prenatal ultrasound. Accurate segmentation is essential for this measurement, but acoustic shadowing, speckle noise, and poor contrast make it difficult. We developed MVSegNet, a lightweight encoder-decoder network combining multi-scale feature extraction and boundary-aware refinement. The model was trained and evaluated on 584 expert-annotated transventricular ultrasound frames using a 70/15/15 split. Performance was compared against six segmentation baselines using overlap, boundary, and measurement metrics. MVSegNet achieved a Dice score of 80.79%, IoU of 68.47%, Hausdorff distance of 4.07 mm, and atrial width mean absolute error of 3.40 mm. The model contains 2.31 million parameters and runs at 165.6 frames per second on an NVIDIA T4 GPU. MVSegNet outperformed all evaluated baselines on boundary and measurement metrics while maintaining low computational cost, supporting its use in automated fetal ultrasound analysis.

2606.06966 2026-06-08 cs.CV 新提交

From Vision to Text: A Compact Multimodal Approach for Robust, Cross-Domain Presentation Attack Detection on ID Cards

从视觉到文本:一种用于身份证件跨域鲁棒演示攻击检测的紧凑多模态方法

Qingwen Zeng, Juan E. Tapia, Sneha Das, Christoph Busch

发表机构 * da/sec-Biometrics and Security Research Group, Hochschule Darmstadt(da/sec生物安全研究组,达姆施塔特应用技术大学) Technical University of Denmark (DTU)(丹麦技术大学(DTU))

AI总结 针对身份证件演示攻击检测中的跨域迁移问题,提出一种结合视觉与文本数据的紧凑多模态模型,发现监督微调后泛化强但零样本设置下失效,强调模型容量和真实数据的重要性。

Comments Publication under the revision process on IEEE

详情
AI中文摘要

跨域迁移对身份证件上的演示攻击检测(PAD)构成挑战,因为隐私问题导致可用数据受限。本工作提出一种紧凑的多模态模型,基于新的生成和判别模块,结合视觉和文本数据对真实和合成身份证图像进行PAD。虽然多模态模型在监督微调后表现出强大的泛化能力,但在零样本设置下失败。我们的发现强调,模型容量和真实世界数据对于可靠的PAD至关重要,而现有的合成数据集可能无法反映真实世界的挑战。我们主张重新评估合成数据作为基准,并强调需要更真实、更多样化的数据集以推动PAD研究。

英文摘要

Cross-domain shifts challenge Presentation Attack Detection (PAD) on ID Cards, given the restricted data available due to privacy concerns. This work proposes a compact multimodal model, based on new generative and discriminative blocks, which combines visual and textual data for PAD on genuine and synthetic ID images. While multimodal models exhibit strong generalisation after supervised fine-tuning, they fail in zero-shot settings. Our findings underscore that model capacity and real-world data are essential for reliable PAD, while existing synthetic datasets may not reflect real-world challenges. We argue for a re-evaluation of synthetic data as a benchmark and emphasise the need for more realistic, diverse datasets to advance PAD research.

2606.06978 2026-06-08 cs.CV 新提交

CL-CLIP: CLIP-Based Continual Learning Framework with Cost-Volume Category Decoupling for Object Detection

CL-CLIP: 基于CLIP的持续学习框架与代价体积类别解耦用于目标检测

Zihan Liu, Yuguang Yang, Shengjie Su, Jianing Pang, Linlin Yang, Chunyu Xie, Nikolai Yu. Zolotykh, Baochang Zhang

发表机构 * National College for Excellent Engineers, Beihang University(卓越工程师学院,北京航空航天大学) AI Research, Qihoo 360(360人工智能研究院,奇虎360) School of Electronic Information Engineering, Beihang University(电子信息学院,北京航空航天大学) School of Cyber Science and Technology, Beihang University(网络安全科学与技术学院,北京航空航天大学) School of Computer Science and Engineering, Beihang University(计算机科学与工程学院,北京航空航天大学) State Key Laboratory of Media Convergence and Communication, Communication University of China(媒体融合与传播国家重点实验室,中国传媒大学) Institute of Information Technology, Mathematics and Mechanics, Lobachebsky University(信息技术、数学与力学学院,洛瓦茨基大学) School of Artificial Intelligence, Beihang University(人工智能学院,北京航空航天大学)

AI总结 提出CL-CLIP框架,通过代价体积引导的类别解耦,增强开放词汇检测器的持续学习能力,缓解灾难性遗忘,在PASCAL VOC和MS-COCO上显著提升F-ViT基线性能。

详情
AI中文摘要

持续目标检测(COD)要求检测器随时间获取新类别的同时保留先前学习的类别。这一目标与开放词汇检测密切相关,因为两种设置都需要对当前训练阶段注释未完全覆盖的类别进行推理。最近的基于CLIP的开放词汇检测器展现出强大的零样本泛化能力,而F-ViT等框架表明视觉-语言预训练可以为未见类别提供强大的零样本检测能力。然而,实际部署不能保持纯粹的零样本:一旦这些检测器在新引入的类别上持续更新,它们会遭受严重的灾难性遗忘,并迅速失去先前校准的检测能力。因此,我们提出CL-CLIP,一种基于CLIP的COD框架,通过代价体积引导的类别解耦,为开放词汇检测器提供更好的持续学习能力。具体来说,遵循CAT-Seg,我们计算CLIP图像-文本相似度代价体积,定义为视觉令牌与类别文本嵌入之间的密集类别级响应图。这种零样本空间先验将共享区域特征分解为类别特定路径,然后由多专家RoI头处理。在PASCAL VOC和MS-COCO上的大量实验表明,CL-CLIP在持续微调下显著改善了F-ViT基线,并与现有持续目标检测器相比取得了竞争性能,特别是在适应新引入类别的同时保持竞争力的基类性能。

英文摘要

Continual Object Detection (COD) requires a detector to acquire new categories over time while preserving previously learned ones. This goal is closely related to open-vocabulary detection, since both settings require reasoning over categories that are not fully covered by the annotations available at the current training stage. Recent CLIP-based open-vocabulary detectors have shown strong zero-shot generalization, and frameworks such as F-ViT demonstrate that vision-language pretraining can provide powerful zero-shot detection ability for unseen categories. However, real-world deployments cannot remain purely zero-shot: once these detectors are continually updated on newly introduced categories, they suffer severe catastrophic forgetting and quickly lose their previously calibrated detection ability. We therefore propose CL-CLIP, a CLIP-based COD framework that equips open-vocabulary detectors with better continual learning ability through cost-volume-guided category decoupling. Specifically, following CAT-Seg, we compute a CLIP image-text similarity cost volume, defined as dense category-wise response maps between visual tokens and class text embeddings. This zero-shot spatial prior decomposes shared region features into class-specific pathways, which are then processed by a Multi-Expert RoI head. Extensive experiments on PASCAL VOC and MS-COCO show that CL-CLIP substantially improves the F-ViT baseline under continual fine-tuning and achieves competitive performance with existing continual object detectors, especially in adapting to newly introduced categories while preserving competitive base-class performance.

2606.06991 2026-06-08 cs.CV cs.AI 新提交

Don't Pause: Streaming Video-Language Synchrony for Online Video Understanding

不要暂停:面向在线视频理解的流式视频-语言同步

Zhenyu Yang, Kairui Zhang, Shengsheng Qian, Weiming Dong, Changsheng Xu

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出流式视频-语言同步(SVLS)范式,通过帧驱动转换控制器和流式令牌调节器实现视频帧与语言生成的细粒度同步,在不中断感知的情况下进行实时交互。

详情
AI中文摘要

在线视频大语言模型(Video-LLMs)通过逐帧处理和主动响应,在人机交互方面取得了进展。然而,流式场景中仍存在一个关键挑战:现有模型在生成响应时通常会暂停视频感知,破坏了实时的视频-语言同步并导致卡顿。为了解决这个问题,我们引入了一种新的在线视频理解范式:流式视频-语言同步(SVLS),并提出了LyraV,一个基于分层控制框架的实时流式助手,具有两个核心创新。首先,帧驱动转换控制器(FDTC)是一个无需训练的基于验证的有限状态机,它做出高层语义决策,决定何时继续说话、开始新的响应或保持沉默。其次,流式令牌调节器(SToP)是一个即插即用的轻量级预测模块,动态调整语言生成速率以匹配视觉内容的节奏。具体来说,LyraV执行逐帧增量、子预算解码:在每个帧间隔内,它只发射适合实时预算的一小部分令牌,因此感知永远不会被阻塞整个句子。这些组件共同使LyraV能够无缝地交织传入的视频帧和生成的词令牌,实现细粒度的同步。在五个在线和三个离线基准上进行的广泛实验表明,LyraV保留了骨干网络的通用理解能力,同时显著提高了流式同步和叙事流畅性,实现了98.29%的视频播放同步率和3.89 FPS的实时处理速度。有趣的是,我们观察到LyraV的一个经验能力:对流式令牌进行动态推理,实现了与视觉输入并行的连续解释和“思考”。

英文摘要

Online Video Large Language Models (Video-LLMs) have advanced toward seamless human-AI interaction through frame-by-frame processing and proactive responding. However, a critical challenge remains in streaming scenarios: existing models typically pause video perception while generating responses, breaking real-time video-language synchrony and causing stutters. To address this, we introduce a novel paradigm for online video understanding: Streaming Video-Language Synchrony (SVLS), and present LyraV, a live streaming assistant built upon a hierarchical control framework with two core innovations. First, the Frame-Driven Transition Controller (FDTC), a training-free verification-based finite-state machine, makes high-level semantic decisions on when to continue speaking, start a new response, or stay silent. Second, the Streaming Token Pacer (SToP), a plug-and-play lightweight predictive module, dynamically adapts the language generation rate to match the pace of the visual content. Concretely, LyraV performs \emph{per-frame incremental, sub-budget decoding}: within each frame interval it emits only a small chunk of tokens that fits the real-time budget, so perception is never blocked for a full sentence. Together, these components enable LyraV to seamlessly interleave incoming video frames with generated word tokens, achieving a fine-grained synchrony. Extensive experiments conducted on five online and three offline benchmarks demonstrate that LyraV preserves the backbone's general understanding ability while substantially improving streaming synchrony and narrative fluency, delivering a 98.29\% synchrony with video playback and a real-time processing speed of 3.89 FPS. Interestingly, we observe an empirical capability in LyraV: dynamic reasoning over streaming tokens, enabling continuous interpretation and "thinking" alongside visual input.

2606.07024 2026-06-08 cs.CV 新提交

GuideCAD: A Lightweight Multimodal Framework for 3D CAD Model Generation via Prefix Embedding

GuideCAD: 基于前缀嵌入的轻量级多模态3D CAD模型生成框架

Minseong Kim, Jinyeong Park, Sungho Park, Jibum Kim

发表机构 * Convergence Research Center for Insect Vectors(昆虫传播载体汇聚研究中心) Incheon National University(仁川国立大学) Center for Brain-Machine Interface(脑机接口中心)

AI总结 提出GuideCAD框架,利用少量可训练参数通过映射网络将图像嵌入转为前缀嵌入,结合预训练大语言模型和Transformer解码器生成3D CAD模型,参数减少约4倍且训练效率提升2倍。

详情
AI中文摘要

用于3D CAD生成的多模态方法需要大量计算资源,因此需要高效训练。为此,我们提出GuideCAD,利用语义丰富的视觉-文本表示,仅用少量可训练参数即可生成3D CAD模型。具体而言,GuideCAD使用映射网络将图像嵌入转换为前缀嵌入,使预训练的大语言模型能够整合视觉和文本信息。随后,基于Transformer的解码器利用视觉-文本嵌入预测构建序列,从而生成3D CAD模型。为了实验评估,我们构建了一个新数据集,称为GuideCAD,包含文本-图像对。每对包括一个表示3D CAD构建序列的文本提示及其对应的3D CAD图像。实验结果表明,与微调方法相比,GuideCAD在生成质量相当的情况下,参数减少约四倍,训练效率提升两倍。我们已在以下网址发布方法的源代码和数据集:this https URL

英文摘要

Multi-modal approaches used for 3D CAD generation require substantial computational resources, necessitating efficient training. To address this, we propose GuideCAD, which leverages semantically rich visual-textual representations having only a small number of trainable parameters to generate 3D CAD models. Specifically, GuideCAD uses a mapping network that converts image embeddings into prefix embeddings, enabling a pretrained large language model to integrate visual and textual information. As a result, a transformer-based decoder predicts the construction sequence using the visual-textual embeddings in order to generate the 3D CAD model. For experimental evaluation, we construct a new dataset, referred to as GuideCAD, which consists of text-image pairs. Each pair includes a text prompt that represents a 3D CAD construction sequence and its corresponding 3D CAD image. Our experimental results show that GuideCAD generates comparably high-quality 3D CAD models while using approximately four times fewer parameters and achieving twice the training efficiency compared to fine-tuning approaches. We have released the source code and dataset for our method at: https://github.com/mskimS2/GuideCAD

2606.07032 2026-06-08 cs.CV cs.AI 新提交

Never Seen Before: Benchmarking Genuine Zero-Shot Composed Image Retrieval with Consistent Video-Sourced Datasets

前所未见:基于一致视频源数据集的真正零样本组合图像检索基准测试

Zhenyu Yang, Zemin Du, Shengsheng Qian, Changsheng Xu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有零样本组合图像检索数据集存在参考与目标图像不相关、非真正零样本的问题,提出ZeroSight基准,包含来自视频的一致参考-目标对和训练无关的MLLM驱动方法SC4CIR,通过三重对称一致性检查识别难负样本,实验表明现有方法性能被高估。

详情
AI中文摘要

零样本组合图像检索(ZS-CIR)旨在基于由参考图像和相对描述组成的查询,在没有训练样本的情况下检索目标图像。现有的ZS-CIR数据集常因图像来源嘈杂而导致参考图像与目标图像完全不相关,并且由于使用了CLIP等模型已训练过的公开图像数据集,未能实现真正的零样本场景。为解决这些挑战,我们引入了ZeroSight,一个用于ZS-CIR的新基准。它包括一个来自视频的一致参考-目标对数据集、一个数据构建流程,以及考虑多个正负目标图像排序的评估方法。我们通过从单个视频中提取帧并使用LLM辅助方法生成相对描述,确保参考-目标对在视觉和语义上一致。为确保真正的零样本场景,我们使用2022年3月31日之后发布的视频数据,确保其未包含在CLIP的预训练数据中。此外,我们提出了一种无需训练的MLLM驱动方法SC4CIR(对称一致性用于CIR),该方法通过三重对称一致性检查能够有效识别难负目标。该方法是即插即用的,能与各种CIR方法无缝集成并显著提升性能。我们通过27种方法的实验结果表明,当前的ZS-CIR数据集和评估指标导致了检索性能的膨胀,夸大了CIR方法的能力。我们的基准和模型可通过此https URL访问。

英文摘要

Zero-Shot Composed Image Retrieval (ZS-CIR) aims to retrieve a target image based on a query composed of a reference image and a relative caption without training samples. Existing ZS-CIR datasets often suffer from complete irrelevance between reference and target images due to noisy image sources, and do not achieve a true zero-shot scenario as they use public image datasets that models like CLIP have been trained on. To tackle these challenges, we introduce ZeroSight, a novel benchmark for ZS-CIR. It includes a dataset with consistent reference-target pairs sourced from videos, a data construction pipeline, and evaluation methods that consider the ranking of multiple positive and negative target images. We ensure visually and semantically consistent reference-target pairs by extracting frames from a single video and generating relative captions using LLM-assisted methods. To ensure a true zero-shot scenario, we use video data published after March 31, 2022, ensuring it was not included in CLIP's pre-training data. Additionally, we propose a training-free MLLM-driven method, SC4CIR (Symmetric Consistency for CIR), which can effectively identify hard negative targets through 3 symmetric consistency checks. This method is plug-and-play, seamlessly integrating with various CIR methods and significantly improving performance. Our experimental results from 27 methods reveal that current ZS-CIR datasets and evaluation metrics result in inflated retrieval performance, exaggerating the capabilities of CIR methods. Our benchmark and models can be accessed at https://github.com/sotayang/ZeroSight.

2606.07034 2026-06-08 cs.CV 新提交

ForensicConcept: Transferable Forensic Concepts for AIGI Detection

ForensicConcept: 用于AIGI检测的可迁移取证概念

Menyanshu Zhou, Ziyin Zhou, Ke Sun, Yunpeng Luo, Jiayi Ji, Xiaoshuai Sun, Rongrong Ji

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出ForensicConcept框架,通过Transformer归因定位关键补丁、构建概念码本并利用扩散特征对齐,实现AI生成图像检测中可迁移取证概念的提取与跨骨干网络迁移。

Comments Accepted by ICML 2026

详情
AI中文摘要

AI生成图像检测器在分布内数据上取得高精度,但往往在未见过的生成器上失效。理解这一失败的关键障碍在于当前检测器的黑箱性质:它们不揭示哪些证据驱动其决策。我们提出ForensicConcept,一个从检测器中提取显式取证概念并使其能够跨骨干网络迁移的框架。我们的方法通过Transformer归因定位决策关键补丁,将其聚类为紧凑的概念码本,并使用概念对齐投影产生可审计的证据读出。受先前研究表明DINO表示可以引导扩散生成并与扩散特征具有概念级对应关系的启发,我们引入基于CleanDIFT扩散特征的生成痕迹参考,并通过邻域结构一致性(CKNNA)量化骨干-痕迹对齐。我们进一步提出概念码本注入,将扩散衍生的概念迁移到目标骨干网络中。在GenImage、GAN族和Chameleon基准上的实验显示,相比先前方法有一致改进。我们还发现CKNNA对齐预测迁移有效性,为为什么某些骨干网络产生比其他更可迁移的取证证据提供了原则性解释。

英文摘要

AI-generated image detectors achieve high accuracy on in-distribution data but often fail on unseen generators. A key obstacle to understanding this failure is the black-box nature of current detectors: they do not reveal which evidence drives their decisions. We propose ForensicConcept, a framework that extracts explicit forensic concepts from detectors and enables their transfer across backbones. Our method localizes decision-critical patches via Transformer attribution, clusters them into a compact concept codebook, and uses a concept-aligned projection to produce auditable evidence readouts. Motivated by prior studies showing that DINO representations can guide diffusion generation and exhibit concept-level correspondence with diffusion features, we introduce a generation-trace reference based on CleanDIFT diffusion features and quantify backbone-trace alignment via neighborhood-structure consistency (CKNNA). We further propose concept codebook injection to transfer diffusion-derived concepts into target backbones. Experiments on GenImage, GAN-family, and Chameleon benchmarks show consistent improvements over prior methods. We also find that CKNNA alignment predicts transfer effectiveness, providing a principled explanation for why some backbones yield more transferable forensic evidence than others.

2606.07036 2026-06-08 cs.CV cs.AI cs.CE cs.LG 新提交

STREAM: Stochastic Riemannian Flow Matching with Anisotropic Decoder for Digital Histopathology Image Generation

STREAM: 用于数字组织病理学图像生成的随机黎曼流匹配与各向异性解码器

Won June Cho, Daeky Jeong, Hyeongyeol Lim, Hongjun Yoon

发表机构 * DEEPNOID Inc.(DEEPNOID公司)

AI总结 提出STREAM框架,利用组织病理学视觉基础模型的patch-token特征作为潜在空间,通过黎曼流匹配生成高质量组织病理学图像,解决条件崩溃问题,并设计各向异性解码器提升生成质量。

Comments 27 pages, 7 figures

详情
AI中文摘要

合成组织病理学图像生成解决了计算病理学中的关键挑战,包括患者隐私和对基础模型大规模训练数据日益增长的需求。潜在扩散模型主导了图像生成领域,最近的研究强调潜在空间的选择对生成图像的质量至关重要。现有的组织病理学最先进生成模型使用预训练的视觉基础模型(VFM)作为条件信号,我们观察到这会导致“条件崩溃”,即条件信号主导潜在空间,降低生成样本的质量和多样性。因此,我们转而使用预训练的组织病理学VFM作为潜在空间本身,利用其编码丰富语义信息的patch-token特征。我们经验性地表明,这些特征经过$\ell_2$归一化,位于单位超球面$\mathcal{S}^{d-1}$上,具有强烈的角度主导性和内在曲率,使其自然适用于黎曼公式。因此,我们提出了STREAM,这是第一个在病理学领域应用黎曼流匹配的框架。STREAM包括两个阶段:1)一种桥式随机扰动,在$\mathcal{S}^{d-1}$上建立每个token的可整流性,用于在潜在空间中训练扩散变换器(DiT);2)一种新颖的各向异性解码器,对速度场雅可比矩阵的低能量方向分配鲁棒性,同时保持其高能量方向的保真度。STREAM在乳腺癌和结直肠癌数据集上实现了最先进的重建和生成性能。代码将在接收后公开发布。

英文摘要

Synthetic histopathology image generation addresses critical challenges in computational pathology, including patient privacy and the growing need for large-scale training data for foundation models. Latent diffusion models have dominated the image generation domain, with recent works emphasizing that the choice of latent space is critical to the quality of generated images. Existing state-of-the-art generative models in histopathology use pretrained Vision Foundation Models (VFMs) as conditioning signals, and we observe that this leads to "conditioning collapse," where the conditioning signal dominates the latent space and lowers the quality and diversity of generated samples. Therefore, we instead use pretrained histopathology VFMs as the latent space itself, leveraging their patch-token features that encode rich semantic information. We empirically show that these features are $\ell_2$-normalized and lie on the unit hypersphere $\mathcal{S}^{d-1}$ with strong angular dominance and intrinsic curvature, making them naturally suited for a Riemannian formulation. We therefore present STREAM, the first framework to apply Riemannian flow matching in the pathology domain. STREAM consists of two stages: 1) a bridge-type stochastic perturbation that establishes per-token rectifiability on $\mathcal{S}^{d-1}$ for training a Diffusion Transformer (DiT) in latent space, and 2) a novel anisotropic decoder that allocates robustness to low-energy directions of the velocity-field Jacobian while preserving fidelity along its high-energy directions. Together, STREAM achieves state-of-the-art reconstruction and generation performance on breast and colorectal cancer datasets. The code will be publicly released upon acceptance.

2606.07053 2026-06-08 cs.CV cs.LG 新提交

TrioPose: Native Triple-Stream Diffusion Transformers for Pose-Guided Text-to-Image Generation

TrioPose: 用于姿态引导文本到图像生成的原生三流扩散变换器

Dian Gu, Zhengyi Yang

发表机构 * Institute of Automation Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 提出TrioPose,基于SD3.5M架构的原生三流姿态感知DiT,通过逐层激活和零初始化双残差注入保持预训练稳定性,并设计可学习关系偏置掩码和姿态引导空间损失加权,在多人姿态引导生成中实现SOTA性能,Human-Art上AP达64.33。

Comments 15 pages (9 pages main body, 6 pages references and appendix), 3 figures, 5 tables

详情
AI中文摘要

姿态引导的文本到图像生成在复杂多人场景中常遭受肢体扭曲和特征串扰。虽然现有的基于UNet的适配器难以处理长程空间依赖,新兴的多模态扩散变换器(MM-DiT)提供了优越的全局建模能力。然而,MM-DiT中的简单信号拼接严重破坏了预训练的潜在分布。为了解决这个问题,我们提出了TrioPose,一个基于SD3.5M架构的原生姿态驱动框架。具体来说,我们引入了一个三流姿态感知DiT(TSPA-DiT),将姿态视为独立模态。它采用逐层激活和零初始化双残差注入,在保持预训练潜在稳定性的同时平滑地施加几何约束。为了解决严重的多实例遮挡,我们设计了一个可学习关系偏置掩码,将拓扑连接分类为细粒度的物理状态,将其映射为连续的注意力软约束,以有效解耦实例间干扰。此外,一种姿态引导空间损失加权策略利用热图导出的误差图调制原生扩散目标,将解剖监督严格集中在畸变区域。大量实验表明,TrioPose在具有挑战性的基准测试(包括Human-Art、CrowdPose和OCHuman)上实现了最先进的性能。值得注意的是,它在Human-Art上达到了64.33的AP,比先前方法提高了30%,同时在复杂多人生成中为视觉保真度和文本-图像语义对齐设立了新标准。

英文摘要

Pose-guided text-to-image generation often suffers from limb distortions and feature crosstalk in complex multi-person scenarios. While existing UNet-based adapters struggle with long-range spatial dependencies, emerging Multimodal Diffusion Transformers (MM-DiTs) offer superior global modeling. However, naive signal concatenation in MM-DiTs severely disrupts pre-trained latent distributions. To address this, we propose TrioPose, a native pose-driven framework built upon the SD3.5M architecture. Specifically, we introduce a Triple-Stream Pose-Aware DiT (TSPA-DiT) that treats pose as an independent modality. It employs layer-wise activation and zero-initialized dual-residual injection to smoothly enforce geometric constraints while preserving pre-trained latent stability. To resolve severe multi-instance occlusions, we design a Learnable Relational Bias Mask that categorizes topological connectivity into fine-grained physical states, mapping them into continuous attention soft constraints to effectively decouple inter-instance interference. Furthermore, a Pose-Guided Spatial Loss Weighting strategy modulates the native diffusion objective using heatmap-derived error maps, focusing anatomical supervision strictly on distortion-prone regions. Extensive experiments demonstrate that TrioPose achieves state-of-the-art performance across challenging benchmarks, including Human-Art, CrowdPose, and OCHuman. Notably, it attains an AP of $64.33$ on Human-Art, representing a $30\%$ improvement over prior arts, while setting new standards for visual fidelity and text-image semantic alignment in complex multi-human generation.

2606.07079 2026-06-08 cs.CV 新提交

AsyncPatch Diffusion: spatially-flexible image generation

异步补丁扩散:空间灵活的图像生成

Samuele Papa, Valentin De Bortoli, Guillaume Couairon, Daniel Sýkora, Romuald Elie, Klaus Greff

发表机构 * Google DeepMind(谷歌DeepMind) University of Amsterdam(阿姆斯特丹大学) The Netherlands Cancer Institute(荷兰癌症研究所)

AI总结 提出AsyncPatch Diffusion框架,通过为不同空间区域分配不同噪声水平实现异质去噪轨迹,在保持生成质量的同时原生支持图像修复和自适应生成。

Comments 36 pages, 14 figures

详情
AI中文摘要

标准扩散模型使用单一共享噪声水平破坏整个样本,迫使所有空间区域遵循相同的去噪轨迹。我们引入了AsyncPatch Diffusion,一个联合扩散框架,为不同的输入维度(如图像像素或潜在令牌)分配不同的噪声水平。我们展示了这种异步破坏如何定义有效的生成过程,同时支持更丰富的空间异质去噪轨迹,并为此过程证明了第一个有效的ELBO。我们表明,单个预训练模型可以执行空间自适应生成,其中不同区域按不同调度去噪。一个关键挑战是训练:天真的独立噪声水平采样过度强调高度异质的配置,而低估了在采样过程中至关重要的同质噪声水平。我们通过一个受控的噪声水平采样器来解决这个问题,该采样器调节平均破坏水平及其空间变异性。AsyncPatch在ImageNet 256和LSUN上实现了与常规扩散相当的生成质量,同时原生适用于图像修复而无需特定任务微调。我们进一步引入了输入引导,利用干净或部分损坏的区域来指导未知区域的生成,提高了局部一致性和纹理匹配。最后,我们展示了自适应生成策略,包括不确定性引导加速和自回归采样。

英文摘要

Standard diffusion models corrupt an entire sample with a single shared noise level, forcing all spatial regions to follow the same denoising trajectory. We introduce AsyncPatch Diffusion, a joint-diffusion framework that assigns distinct noise levels to different input dimensions, such as image pixels, or latent tokens. We show how this asynchronous corruption defines a valid generative process while supporting a richer family of spatially heterogeneous denoising trajectories, and prove the first valid ELBO for this process. We show that a single pretrained model can perform spatially adaptive generation, where different regions are denoised on different schedules. A key challenge is training: naive independent noise-level sampling overemphasizes highly heterogeneous configurations and underrepresents homogeneous noise levels, that are crucial during sampling. We address this with a controlled noise-level sampler that regulates both the average corruption level and its spatial variability. AsyncPatch achieves generation quality comparable to conventional diffusion on ImageNet 256 and LSUN, while being natively suited for inpainting without task-specific fine-tuning. We further introduce input guidance, which uses clean or partially corrupted regions to guide the generation of unknown regions, improving local consistency and texture matching. Finally, we demonstrate adaptive generation strategies including uncertainty-guided acceleration and autoregressive sampling.

2606.07090 2026-06-08 cs.CV 新提交

Detecting Temporally Localized Manipulations in Authentic Video Streams

检测真实视频流中的时间局部操纵

Okan Umur, Ali Emre Güşlü, Ibrahim Delibasoglu

发表机构 * Okan Umur Ali Emre Güşlü Ibrahim Delibasoglu

AI总结 针对真实视频中插入短时逼真操纵片段难以检测的问题,提出新数据集并评估两种方法:基于DINOv3特征的线性探针和连续帧相似性方法,建立初步基准。

详情
AI中文摘要

视频编辑和生成式人工智能技术的快速发展使得逼真的视频操纵越来越容易实现。尽管现有数据集显著推动了深度伪造检测、对象移除和视频修复的研究,但它们未能充分模拟在真实视频中插入短时操纵片段且原始视频继续播放的场景。在本研究中,我们回顾了文献中的代表性数据集,分析了它们的特征,并讨论了它们在时间局部逼真操纵检测方面的局限性。基于此分析,我们提出了专门针对包含短时且高度逼真操纵间隔的真实视频的新数据集的需求。最后,我们在自定义策划的测试集上评估了两种互补方法,为这一具有挑战性的场景建立了初始基准。第一种方法采用基于DINOv3特征的线性探针,在三种阈值策略下进行评估。第二种方法利用DINOv3特征结合连续帧相似性方法来检测时间操纵边界。这些实验共同为部分操纵视频检测提供了初步基准,并强调了内容自适应阈值机制的必要性。数据集、代码和补充材料可在此https URL公开获取。

英文摘要

The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at https://github.com/OkanUmur/temporally-localized-video-manipulation-detection.

2606.07100 2026-06-08 cs.CV cs.RO 新提交

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

LARA: 视觉-语言-动作模型的潜在动作表示对齐

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LARA框架,通过表示对齐联合优化潜在动作模型和视觉-语言-动作模型,利用人类视频数据提升机器人操作性能,在模拟和真实基准上平均提升约10%、5%和15%。

详情
AI中文摘要

视觉-语言动作(VLA)模型使机器人能够直接从观测和语言指令预测动作,但其性能依赖于大规模、高质量数据,并受到真实机器人动作数据集稀缺的限制。为了利用丰富的未标记人类视频促进VLA模型学习,潜在动作模型(LAM)从视觉动态中学习潜在动作表示,为VLA学习提供额外监督。然而,LAM和VLA通常分开训练,导致LAM在VLA训练期间未接地,且VLA模型受冻结的LAM表示约束。为解决这些问题,我们提出潜在动作表示对齐(LARA),一种即插即用框架,通过表示对齐联合优化LAM和VLA。这使得LAM能够利用动作轨迹学习以避免虚假视觉变化,同时VLA通过LAM中学习的前向动力学进行正则化,减少功能无效轨迹的幻觉。我们展示了LARA在预训练、预训练VLA模型的后训练增强以及LAM细化中的多功能性和有效性,在3个模拟和1个精心设计的真实机器人操作基准上平均提升约10%、约5%和约15%。

英文摘要

Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

2606.07115 2026-06-08 cs.CV cs.GR 新提交

3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing

3DMorph: 单图引导的局部3D形状编辑与变形

Tobias Preintner, Yunfei Deng, Phillip Müller, Sebastian Illing, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 提出无训练框架3DMorph,通过单张编辑图像自动定位并转移2D修改到3D局部区域,同时支持中间形状生成,在Delta3D基准上优于现有方法。

Comments Accepted to IJCNN 2026

详情
AI中文摘要

尽管3D生成领域近期取得了进展,但对现有形状的直观编辑仍然有限。与受益于成熟修复工具的图像不同,网格等通用3D对象仍缺乏简单有效的局部形状编辑方法。现有方法通常是全局的、领域特定的、需要复杂的用户交互,或侧重于外观(颜色和纹理)而非几何。我们提出了3DMorph,一个无需训练的框架,用于单图引导的局部3D形状编辑和变形。给定一张显示所需形状修改的编辑图像,我们的方法自动定位相关的3D区域,并将2D修改转移到3D,同时保留未修改的区域。3DMorph还能在原始对象和编辑对象之间生成中间形状,促进设计探索。为了基准测试编辑质量,我们引入了Delta3D,一个带有配对真实编辑的图像引导局部3D编辑基准。实验结果表明,3DMorph将直观的2D编辑转化为3D,优于最先进的生成和编辑方法。

英文摘要

Despite recent progress in 3D generation, intuitive editing of existing shapes remains limited. Unlike images, which benefit from well-established inpainting tools, general 3D objects such as meshes still lack simple and effective methods for local shape editing. Existing approaches are often global, domain-specific, require complex user interaction, or focus on appearance (color and texture) rather than geometry. We introduce 3DMorph, a training-free framework for single-image-guided local 3D shape editing and morphing. Given an edited image showing a desired shape modification, our method automatically localizes the relevant 3D region and transfers 2D modifications to 3D while preserving unmodified areas. 3DMorph also enables intermediate shape generation between the original and edited objects, facilitating design exploration. To benchmark editing quality, we introduce Delta3D, an image-guided local 3D editing benchmark with paired ground-truth edits. Experimental results show that 3DMorph translates intuitive 2D edits into 3D, outperforming state-of-the-art generative and editing methods.

2606.07117 2026-06-08 cs.CV cs.AI 新提交

Native3D: End-to-End 3D Scene Generation via Unified Mesh-Texture Modeling and Semantic Alignment

Native3D: 通过统一网格纹理建模与语义对齐的端到端3D场景生成

Yibo Liu, Ziwei Zhang, Haozhou Pang, Menghao Li, Lanshan He, Gan Qi

发表机构 * Kuaishou GameMind Lab(快手游戏大脑实验室)

AI总结 提出Native3D,首个完全绕过2D中间表示的端到端3D场景生成框架,通过统一网格纹理联合表示和3D表示对齐损失,解决几何结构失真和纹理细节退化问题。

详情
AI中文摘要

本文提出了Native3D,首个完全绕过2D中间表示的端到端3D场景生成框架。传统方法通常需要将3D表示适配到2D域以利用预训练的扩散模型,这不可避免地引入了域适应问题,包括几何结构失真和纹理细节退化。为了解决这些限制,我们设计了一种统一的网格纹理联合表示,通过基于Transformer的场景编码器同时对几何结构和纹理特征进行建模,有效维持场景中物体之间的空间关系和视觉一致性。我们进一步提出了3D表示对齐损失(3D REPA Loss),该损失采用改进的对比学习机制来对齐潜在空间中的多级语义表示,显著增强了几何和纹理保真度。实验结果表明,Native3D在生成质量和编辑灵活性方面均优于现有方法,为3D场景编辑提供了一种新的解决方案。

英文摘要

This paper presents Native3D, the first end-to-end 3D scene generation framework that completely bypasses 2D intermediate representations. Traditional approaches typically require adapting 3D representations to the 2D domain to leverage pre-trained diffusion models, which inevitably introduces domain adaptation issues including geometric structural distortion and texture detail degradation. To address these limitations, we design a unified mesh-texture joint representation that simultaneously models both geometric structures and texture features through a Transformer-based scene encoder, effectively maintaining spatial relationships and visual consistency among objects within scenes. We further propose the 3D Representation Alignment Loss (3D REPA Loss), which employs an improved contrastive learning mechanism to align multi-level semantic representations in the latent space, significantly enhancing geometric and textural fidelity. Experimental results demonstrate that Native3D outperforms existing methods in both generation quality and editing flexibility, providing a novel solution for 3D scene editing.

2606.07145 2026-06-08 cs.CV 新提交

Consistent-Inversion: Reverse Consistency Guidance for Structure-Preserving Visual Editing

Consistent-Inversion: 用于结构保持视觉编辑的反向一致性引导

Xiaocheng Lu, Jingcai Guo, Song Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出Consistent-Inversion,一种无训练的反向一致性引导框架,通过检查中间目标轨迹能否在源提示下反向到源反转轨迹,并利用反向一致性差异校正早期去噪步骤,在保持结构的同时提升编辑效果。

Comments Submitted to IEEE Transactions on Multimedia; 10 pages, 4 figures

详情
AI中文摘要

文本引导的扩散模型已成为真实图像视觉编辑的有效工具,其中编辑后的图像必须遵循目标指令,同时保持与编辑无关的结构。大多数无训练编辑器依赖于反转:源图像被映射到一个噪声潜变量轨迹,终端潜变量被重新用于目标提示去噪。这种重用有助于保持结构,但也耦合了源重建和目标编辑。由此产生的轨迹不匹配可能会损害背景/布局细节,或过度约束预期编辑。本文提出Consistent-Inversion,一种用于结构保持视觉编辑的无训练反向一致性引导框架。Consistent-Inversion不将反转后的源潜变量视为固定初始化,而是检查中间目标轨迹是否能在源提示下反向到源反转轨迹。为使这一检查明确,我们构建了一个辅助的目标侧噪声表示,执行源引导的反向去噪,并将得到的反向一致性差异作为校正信号,用于选定的早期目标去噪步骤。该方法不更新模型参数,与基于反转的编辑器兼容,且在稀疏应用时仅引入少量推理开销。在PIE-Bench上的实验表明,Consistent-Inversion在统一的SD3.5协议下提高了背景和结构保真度,同时保持目标提示对齐,兼容性实验进一步验证了相同校正原则在经典Stable-Diffusion反转流水线上的有效性。

英文摘要

Text-guided diffusion models have become effective tools for real-image visual editing, where the edited image must follow a target instruction while preserving editing-irrelevant structure. Most training-free editors rely on inversion: a source image is mapped to a noisy latent trajectory and the terminal latent is reused for target-prompt denoising. This reuse is useful for preservation, but it also couples source reconstruction and target editing. The resulting trajectory mismatch may either damage background/layout details or over-constrain the intended edit. This paper presents Consistent-Inversion, a training-free reverse consistency guidance framework for structure-preserving visual editing. Instead of treating the inverted source latent as a fixed initialization, Consistent-Inversion checks whether an intermediate target trajectory can be reversed toward the source inversion trajectory under the source prompt. To make this check well-defined, we construct an auxiliary target-side noise representation, perform source-guided reverse denoising, and use the resulting reverse consistency discrepancy as a correction signal for selected early target denoising steps. The method does not update model parameters, is compatible with inversion-based editors, and introduces only a small inference overhead when applied sparsely. Experiments on PIE-Bench show that Consistent-Inversion improves background and structural fidelity under a unified SD3.5 protocol while maintaining target-prompt alignment, and compatibility experiments further verify the same correction principle on classical Stable-Diffusion inversion pipelines.

2606.07161 2026-06-08 cs.CV 新提交

TraRA: Trajectory-level Recognition Aggregation for Video Text Spotting in Urban Surveillance

TraRA: 面向城市监控视频文本识别的轨迹级识别聚合方法

Duc Tri Tran, Trung Thanh Nguyen, Vijay John, Phi Le Nguyen, Yasutomo Kawanishi

发表机构 * RIKEN(日本理化学研究所) Hanoi University of Science and Technology(河内科学技术大学) Nagoya University(名古屋大学) Lawrence Technological University(劳伦斯技术大学) Ritsumeikan University(立命馆大学)

AI总结 提出TraRA方法,通过轨迹级文本识别聚合,利用时间与多模态一致性,解决监控视频中运动模糊、遮挡等导致的帧级识别不一致问题,在多个基准上提升跟踪与识别性能。

Comments 22nd IEEE International Conference on Advanced Visual and Signal-Based Systems

详情
AI中文摘要

视频文本识别(VTS)对于城市监控和智能交通系统至关重要,能够自动读取视频流中的街道标志、车辆标记和场景文本。然而,由于监控场景中常见的动态视频因素(包括运动模糊、遮挡和尺度变化)导致帧级识别退化,可靠识别仍然具有挑战性。现有的VTS方法通常对每一帧独立进行识别,导致跨序列的结果不一致且不准确。为了解决这些限制,我们提出了TraRA(面向VTS的轨迹级识别聚合),这是一种即插即用的方法,通过利用时间和多模态一致性执行轨迹级文本识别。TraRA集成了两个关键模块:(1)时间聚类和(2)视觉-语言聚合。前者通过分组时间和视觉上一致的文本实例来细化噪声轨迹,而后者采用低秩自适应增强的视觉-语言模型,融合跨帧的视觉线索与语言上下文。通过聚合整个文本轨迹的信息,TraRA即使在具有挑战性的监控条件下也能实现鲁棒的文本识别。在四个公共基准(包括道路和城市场景数据集RoadText、BOVText、ArTVideo和ICDAR15)上进行的大量实验表明,与最先进的VTS方法相比,TraRA持续提升了跟踪和识别性能。源代码可在该网址获取。

英文摘要

Video Text Spotting (VTS) is essential for urban surveillance and intelligent transportation systems, enabling automated reading of street signs, vehicle markings, and scene text in video streams. However, reliable recognition remains challenging due to dynamic video factors common in surveillance scenarios, including motion blur, occlusion, and scale variation, which degrade frame-level recognition. Existing VTS methods typically perform recognition independently on each frame, leading to inconsistent and inaccurate results across sequences. To address these limitations, we propose TraRA (Trajectory-level Recognition Aggregation for VTS), a plug-and-play method that performs trajectory-level text recognition by leveraging temporal and multimodal consistency. TraRA integrates two key modules: (1) the Temporal Clustering and (2) the Vision-Language Aggregation. The former refines noisy trajectories by grouping temporally and visually coherent text instances, while the latter employs a Low-Rank Adaptation-enhanced Vision-Language model to fuse visual cues with linguistic context across frames. By aggregating information over entire text trajectories, TraRA achieves robust text recognition even under challenging surveillance conditions. Extensive experiments on four public benchmarks, including road and urban scene datasets (RoadText, BOVText, ArTVideo, and ICDAR15), demonstrate that TraRA consistently improves tracking and recognition performance over state-of-the-art VTS methods. The source code is available at https://github.com/trid2912/TraRA.

2606.07171 2026-06-08 cs.CV 新提交

When Recovery Matters: The Blind Spot of Surrogate Privacy in MLLM Editing

当恢复至关重要时:MLLM编辑中替代隐私的盲点

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui LI, Shiqi Wang, Sam Kwong

发表机构 * City University of Hong Kong(香港城市大学) Hon Hai Research Institute(鸿海研究院) Lingnan University(岭南大学)

AI总结 针对多模态大模型编辑中的隐私风险,提出首个面向恢复的替代隐私保护编辑基准SPPE,涵盖36个细粒度隐私类别和65个编辑指令,并设计可编辑性评估与替代到源编辑恢复两个任务及对应方法。

详情
AI中文摘要

多模态大语言模型(MLLM)支持灵活的指令驱动图像编辑,但当用户图像暴露多样且用户特定的私有内容时,会产生隐私风险。典型的隐私保护策略通常在云端编辑前用替代内容替换敏感区域。然而,结果输出往往是编辑后的替代图像而非期望的编辑后源图像,在设计和评估范围中都忽略了局部恢复。为此,我们引入SPPE(基于替代的隐私保护编辑),这是首个面向恢复的基准,涵盖36个细粒度隐私类别和65个编辑指令。它定义了两个互补任务:1)可编辑性评估,在云端交互前估计替代图像是否能产生与原始图像一致的编辑;2)替代到源编辑恢复,评估编辑后的替代图像是否能转移回私有源图像并保留编辑效果。我们为每个任务提出了专用方法:ERMA通过指令感知的多模态关系建模预测替代可编辑性,而C2E-S2SER通过使用替代编辑对作为视觉编辑证据和源图像作为源保留锚点来执行循环一致性恢复。在SPPE和InstructPix2Pix上的实验表明,两个任务均有一致改进。对于可编辑性评估,ERMA在SRCC上比最佳基线提升13.9%,在PLCC上提升12.3%。对于替代到源编辑恢复,C2E-S2SER在SPPE的所有8个源完整性和编辑一致性指标上优于SOER。

英文摘要

Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.

2606.07172 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo(圣保罗大学) National University of Singapore(新加坡国立大学)

AI总结 研究视觉、视觉-语言及多模态模型的地理空间表示能力,发现文本监督能有效提升空间编码,推动地理空间AI发展。

Comments Accepted at ICML 2026

详情
AI中文摘要

地理空间理解是机器学习系统在图像地理定位和空间推理等任务中一个关键但尚未充分探索的维度。在这项工作中,我们分析了三种模型家族获得的地理空间表示:纯视觉架构(如ViT)、视觉-语言模型(如CLIP)和大规模多模态基础模型(如LLaVA、Qwen和Gemma)。通过评估包括人物、地标和日常物体在内的图像聚类(根据可定位程度分组),我们揭示了空间准确性的系统性差距,并表明文本监督增强了地理空间表示的学习。我们的发现表明语言作为编码空间上下文的有效补充模态,以及多模态学习作为推进地理空间AI的关键方向。

英文摘要

Geospatial understanding is a critical yet underexplored dimension in the development of machine learning systems for tasks such as image geolocation and spatial reasoning. In this work, we analyze the geospatial representations acquired by three model families: vision-only architectures (e.g., ViT), vision-language models (e.g., CLIP), and large-scale multimodal foundation models (e.g., LLaVA, Qwen, and Gemma). By evaluating across image clusters, including people, landmarks, and everyday objects, grouped based on the degree of localizability, we reveal systematic gaps in spatial accuracy and show that textual supervision enhances the learning of geospatial representations. Our findings suggest the role of language as an effective complementary modality for encoding spatial context and multimodal learning as a key direction for advancing geospatial AI.

2606.07175 2026-06-08 cs.CV 新提交

Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs

看见而不暴露:面向开放世界、上下文饥渴型MLLM的自适应隐私控制

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang, Sam Kwong

发表机构 * City University of Hong Kong(香港城市大学) Hon Hai Research Institute(鸿海研究学院) Lingnan University(岭南大学)

AI总结 针对多模态大语言模型在开放世界中面临不可预测敏感信息泄露的隐私挑战,提出无训练方法APD,将隐私元素漂移至语义等价替代物并锚定上下文线索,结合新基准AdaptShield实现隐私保护与上下文保留的平衡提升。

详情
AI中文摘要

多模态大语言模型(MLLM)引发了新的隐私挑战。在数据方面,用户提供的输入通常包含不可预测的敏感信息;而在下游任务方面,模型推理依赖于丰富的视觉上下文,这些上下文本身可能涉及隐私敏感信息。然而,现有的隐私保护方法依赖于预定义的敏感类别和固定的混淆策略,难以应对MLLM中的此类挑战。为解决这一困境,我们提出了锚定隐私漂移(APD),一种无需训练的方法,它将隐私敏感元素漂移到语义等价的替代物,同时将上下文线索锚定到源图像。为了系统评估这种隐私保护和上下文保留的双重目标,我们引入了AdaptShield,一个涵盖22个隐私类别的综合基准,它将传统隐私度量与基于MLLM的上下文效用评估相结合。大量实验表明,我们的方法在隐私净化和内容保留方面实现了平衡改进,在四个MLLM系列(即Qwen2.5、Qwen3、InternVL3和InternVL3.5)上,文本类别的平均增益为10.4%,基于MLLM的评估平均增益为8.5%。

英文摘要

Multimodal large language models (MLLMs) have raised new privacy challenges. On the data side, user-provided inputs often include unpredictable sensitive information; while on the downstream task side, model reasoning depends on rich visual context that may itself be privacy-sensitive. Existing privacy protection methods, however, rely on predefined sensitive categories and fixed obfuscation strategies, struggling to tackle such challenges in MLLMs. To address this dilemma, we propose Anchored Privacy Drifting (APD), a training-free method that drifts privacy-sensitive elements toward semantically equivalent alternatives while anchoring contextual cues to the source image. To systematically evaluate this dual objective of privacy protection and contextual preservation, we introduce AdaptShield, a comprehensive benchmark covering 22 privacy categories, which combines conventional privacy metrics with MLLM-based assessments of contextual utility. Extensive experiments show that our method achieves balanced improvements in both privacy sanitization and content retention, with average gains of 10.4% on textual categories and 8.5% under MLLM-based evaluation across four MLLM series, i.e., Qwen2.5, Qwen3, InternVL3, and InternVL3.5.

2606.07179 2026-06-08 cs.CV cs.MM eess.IV 新提交

EvoGS: Constructing Continuous-Layered Gaussian Splatting with Evolution Tree for Scalable 3D Streaming

EvoGS:基于进化树构建连续分层高斯泼溅以实现可扩展3D流式传输

Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi

发表机构 * National University of Singapore(国立新加坡大学) IRIT - University of Toulouse(图卢兹大学IRIT实验室) IPAL, IRL2955(IPAL研究所)

AI总结 提出EvoGS,首个连续分层高斯泼溅表示,通过进化树结构实现父-子细化,消除冗余并支持可扩展3D流式传输,传输负载和显存占用分别降低2.4倍和5.5倍。

Comments Project page: https://yuang-ian.github.io/evogs/

详情
AI中文摘要

流式传输3D高斯泼溅需要高度可扩展的渐进式表示。现有渐进式方法依赖\textit{离散分层},为每个细节层次累积独立的泼溅集。层间的结构独立性固有地导致误差累积、严重的泼溅冗余以及不受控的质量过渡。我们提出EvoGS,首个\textit{连续分层}表示。EvoGS组织为进化树,通过显式的、受小波启发的父-子细化生成更精细的细节。这使得子节点能够结构性地纠正祖先误差,产生固有稀疏且高度可压缩的层间信号。大量实验表明,EvoGS将泼溅冗余从超过65%降至低于25%。与最先进的基线相比,它分别将传输负载和GPU显存占用降低高达2.4倍和5.5倍,并实现了适用于实时自适应流式传输的平滑质量过渡。项目页面:此 https URL

英文摘要

Streaming 3D Gaussian Splatting requires highly scalable, progressive representations. Existing progressive methods rely on \textit{discrete layering}, accumulating separate splat sets for each level of detail. This structural independence between layers inherently leads to error accumulation, severe splat redundancy, and uncontrolled quality transitions. We propose EvoGS, the first \textit{continuous-layering} representation. Organized as an Evolution Tree, EvoGS generates finer details via an explicit, wavelet-inspired parent-child refinement. This empowers child nodes to structurally correct ancestral errors, yield inherently sparse and highly compressible inter-layer signals. Extensive experiments show EvoGS eliminates splat redundancy from over 65\% to under 25\%. Compared to state-of-the-art baselines, it reduces transmission payload and GPU VRAM footprint by up to 2.4$\times$ and 5.5$\times$, respectively, and achieves smooth quality transitions optimal for real-time adaptive streaming. Project page: https://yuang-ian.github.io/evogs/

2606.07180 2026-06-08 cs.CV cs.LG 新提交

OPTIMUS-Prime: Minimal and Sufficient Concept Explanations for Deep Vision Models

OPTIMUS-Prime:深度视觉模型的最小且充分的概念解释

Arthur Hoarau, Chenrui Zhu, Vu Linh Nguyen

发表机构 * Université de Lorraine(洛林大学) CentraleSupélec Loria(中央超导Loria) CNRS(国家科学研究中心) Metz, France(法国梅斯) Université de technologie de Compiègne UMR CNRS 7253 Heudiasyc(图卢兹技术大学UMR CNRS 7253 Heudiasyc) France(法国)

AI总结 提出OPTIMUS框架,基于主蕴含项理论生成视觉热图解释,满足充分性和最小性,提供形式化保证。

详情
AI中文摘要

自动化决策中日益增长的透明度需求已将可解释人工智能(XAI)推向机器学习研究的前沿。然而,在计算机视觉中,现有的解释方法通常优先考虑最终用户的可访问性,而牺牲了形式化保证,在实用性和理论严谨性之间留下了关键差距。在本文中,我们通过引入OPTIMUS(一种用于深度分类模型的基于概念的可视化解释的新框架)来弥补这一差距。OPTIMUS解释采用视觉热图的形式,不仅对最终用户保持可解释性,而且基于成熟的主蕴含项理论,提供了现有基于显著性方法所缺乏的形式化保证。具体来说,OPTIMUS解释满足两个理想性质:充分性,确保被强调的概念可证明地保证分类器的预测;以及最小性,确保这些概念的严格子集不再保留此保证。这两个性质共同产生了逻辑上紧凑且视觉上连贯的解释。我们在视觉分类基准上验证了我们的方法,证明OPTIMUS热图自然且忠实地呈现了模型预测背后的决策相关概念。

英文摘要

The growing demand for transparency in automated decision-making has propelled eXplainable Artificial Intelligence (XAI) to the forefront of machine learning research. In computer vision, however, existing explanation methods often prioritize end-user accessibility at the expense of formal guarantees, leaving a critical gap between practical utility and theoretical rigor. In this paper, we address this gap by introducing OPTIMUS, a novel framework for generating concept-based visual explanations for deep classification models. OPTIMUS explanations take the form of visual heatmaps that not only remain interpretable to end users, but are grounded in the well-established theory of prime implicants, providing formal guarantees that have been largely absent from existing saliency-based methods. Specifically, OPTIMUS explanations satisfy two desirable properties: sufficiency, ensuring that the highlighted concepts provably guarantee the classifier's prediction, and minimality, ensuring that no strict subset of those concepts retains this guarantee. Together, these properties yield explanations that are both logically tight and visually coherent. We validate our approach on a visual classification benchmark, demonstrating that OPTIMUS heatmaps naturally and faithfully surface the decision-relevant concepts underlying model predictions.

2606.07185 2026-06-08 cs.CV 新提交

AdaTok: Self-Budgeting Image Tokenization with Quality-Preserving Dynamic Tokens

AdaTok: 具有质量保持动态令牌的自预算图像令牌化

Xiaocheng Lu, Yuxi Chen, Jie Zhang, Jian Liu, Jingcai Guo, Fangqi Zhu, Tao Han, Song Guo

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出AdaTok,一种自预算离散一维令牌化器,通过表示-分配协同设计(优先表示学习和自适应令牌分配)实现图像自适应令牌数量,在保持重建质量的同时减少平均令牌数。

Comments Preprint; 11 pages, 4 figures

详情
AI中文摘要

图像令牌化器,从二维网格到最近的一维序列,通常用相同固定数量的令牌编码每张图像。然而视觉复杂度高度异质,因此统一预算在简单输入上过度开销,在复杂输入上不足。现有的弹性令牌化器暴露了可变长度重建,但通常将令牌长度作为部署时的操作点、搜索目标或外部预测,而非令牌化器本身的输出。在这项工作中,我们询问离散视觉令牌化器能否一次性自我预算。我们的核心发现是,可操作的弹性需要表示-分配协同设计:前缀必须在不同预算下保持可解码,且令牌化器必须学习每个图像需要哪个前缀。我们提出AdaTok,一种自预算离散一维令牌化器。AdaTok结合了优先表示学习(通过嵌套尾部掩码对令牌排序,并通过多头LoRA解码器头解决预算依赖的语义偏移)和自适应令牌分配(在候选预算上训练轻量级确定性组GRPO策略)。动态帕累托加权在策略训练期间平衡保真度和效率,无需手动权衡扫描。在ImageNet-1K上,AdaTok-Full在256个令牌时达到rFID 1.31,而AdaTok-Adaptive平均仅使用约118个令牌达到rFID 1.50,在可比预算下优于离散一维基线。在自回归图像生成中,较短的适应性表示相比固定256令牌解码实现了约2.1倍的吞吐量,表明视觉令牌数量可以学习为内容条件输出,而非设置为固定超参数。

英文摘要

Image tokenizers, from 2D grids to recent 1D sequences, typically encode every image with the same fixed number of tokens. Yet visual complexity is highly heterogeneous, so a uniform budget overspends on simple inputs and underserves complex ones. Existing elastic tokenizers expose variable-length reconstructions, but often leave token length as a deployment-time operating point, a search target, or an external prediction rather than an output of the tokenizer itself. In this work, we ask whether a discrete visual tokenizer can budget itself in one pass. Our central finding is that actionable elasticity requires a representation--allocation co-design: prefixes must remain decodable across budgets, and the tokenizer must learn which prefix each image needs. We propose AdaTok, a self-budgeting discrete 1D tokenizer. AdaTok combines Prioritized Representation Learning, which orders tokens with nested tail masking and resolves budget-dependent semantic shift through Multi-Head LoRA decoder heads, with Adaptive Token Allocation, which trains a lightweight deterministic-group GRPO policy over candidate budgets. Dynamic Pareto Weighting balances fidelity and efficiency during policy training without manual trade-off sweeps. On ImageNet-1K, AdaTok-Full reaches rFID 1.31 at 256 tokens, while AdaTok-Adaptive attains rFID 1.50 using only ~118 tokens on average, outperforming discrete 1D baselines at comparable budgets. In autoregressive image generation, the shorter adaptive representation yields ~2.1x throughput over a fixed 256-token decode, suggesting that visual token count can be learned as a content-conditioned output rather than set as a fixed hyperparameter.

2606.07222 2026-06-08 cs.CV cs.AI 新提交

DualGate-Net: A Prior-Gated Dual-Encoder Framework for Histopathology Cell Detection

DualGate-Net: 用于组织病理学细胞检测的先验门控双编码器框架

Bahman Jafari Tabaghsar, Son Tran, K. Devaraja, Atul Sajjanhar

发表机构 * School of Information Technology, Deakin University(德肯大学信息科技学院) Kasturba Medical College, Manipal Academy of Higher Education(曼岛医学院)

AI总结 提出DualGate-Net,通过可学习的先验门控融合机制自适应调节组织先验影响,结合局部和全局编码器及辅助分支,在OCELOT基准上实现稳健的细胞检测。

Comments 15 pages, 4 figures

详情
AI中文摘要

组织病理学图像中的细胞检测强烈依赖于周围组织背景,其中视觉上相似的细胞在不同微环境下可能属于不同类别。最近的感知组织方法结合了上下文先验,但通常依赖于可能传播噪声信息的静态融合策略。在这项工作中,我们提出了DualGate-Net,一种先验感知的双编码器框架,通过可学习的先验门控融合机制结合了基于ConvNeXtV2的局部编码器和基于SegFormer的全局编码器。所提出的模块自适应地调节组织先验在空间位置上的影响,同时一个辅助的前景重建分支在训练过程中保留高频细胞结构。此外,还引入了辅助的细胞性引导线索以进一步提高定位鲁棒性。在OCELOT基准上的实验表明,该方法在验证集上取得了0.7722的宏F1分数,在测试集上取得了0.7345的宏F1分数,突显了自适应先验整合对于稳健的组织病理学细胞检测的有效性。

英文摘要

Cell detection in histopathology images strongly depends on surrounding tissue context, where visually similar cells may belong to different classes under different microenvironments. Recent tissue-aware methods incorporate contextual priors, but often rely on static fusion strategies that may propagate noisy information. In this work, we propose DualGate-Net, a prior-aware dual-encoder framework that combines a ConvNeXtV2-based local encoder and a SegFormer-based global encoder through a learnable prior-gated fusion mechanism. The proposed module adaptively regulates the influence of tissue priors across spatial locations, while an auxiliary foreground reconstruction branch preserves high-frequency cellular structures during training. In addition, auxiliary cellness-guided cues are incorporated to further improve localization robustness. Experiments on the OCELOT benchmark demonstrate consistent improvements, achieving macro F1-scores of 0.7722 on the validation set and 0.7345 on the test set, highlighting the effectiveness of adaptive prior integration for robust histopathology cell detection.

2606.07233 2026-06-08 cs.CV cs.LG cs.RO 新提交

Does Appearance Help? A Systematic Study of Image-Based Re-Identification in Online 3D Multi-Pedestrian Tracking

外观有帮助吗?在线3D多行人追踪中基于图像的重识别系统研究

Eduardo Borges, Luís Garrote, Urbano J. Nunes

发表机构 * Institute of Systems and Robotics, Department of Electrical and Computer Engineering, University of Coimbra(系统与机器人研究所,电气与计算机工程系,科英布拉大学)

AI总结 系统研究轻量级投影框架下图像重识别在在线3D多目标追踪中的作用,提出级联匹配策略以在低延迟下恢复遮挡轨迹并防止身份切换。

Comments Accepted for publication at the 35th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN 2026)

详情
AI中文摘要

基于LiDAR的3D多目标追踪通常仅依赖几何信息,这在长时间遮挡或拥挤人群环境中往往不足以区分目标。虽然集成基于RGB的重识别提供了保持身份上下文的理论解决方案,但现有方法通常依赖计算昂贵的并行检测器,阻碍了机器人的实时响应。本文通过利用轻量级投影框架解耦移动机器人的几何和外观建模,对在线3D多目标追踪中的基于图像的重识别进行了系统研究。对特征提取架构进行了全面分析,采用轻量级CNN和视觉Transformer,并评估了多种多模态数据关联策略以平衡计算延迟和鲁棒追踪。在KITTI数据集的行人类别上的实验表明,外观和运动成本的朴素线性融合由于视觉噪声而降低了性能。相反,级联匹配策略成功恢复了被遮挡的轨迹而不损害整体精度,有效防止了身份切换以维持人机交互的连续性。我们表明,轻量级架构可以在安全导航所需的低延迟和社交意识所需的判别能力之间提供最优权衡。

英文摘要

LiDAR-based 3D Multi-Object Tracking (MOT) typically relies solely on geometric information, which is often insufficient to distinguish between targets during prolonged occlusions or in crowded human-populated environments. While integrating RGB-based Re-Identification (ReID) offers a theoretical solution for preserving identity context, existing approaches often rely on computationally expensive parallel detectors that hinder real-time robot responsiveness. This work presents a systematic study of image-based ReID in online 3D MOT, utilizing a lightweight projection-based framework to decouple geometric and appearance modeling for mobile robots. A comprehensive analysis of feature extraction architectures is conducted, employing lightweight CNNs and Vision Transformers, and evaluating various multi-modal data association strategies to balance computational latency with robust tracking. Experiments on the Pedestrian class of the KITTI dataset reveal that naive linear fusion, of appearance and motion costs, degrades performance due to visual noise. Conversely, a cascaded matching strategy successfully recovers occluded tracks without compromising overall precision, effectively preventing identity switches to maintain human-robot interaction continuity. We show that lightweight architectures can offer an optimal trade-off between the low latency required for safe navigation and the discriminative power needed for social awareness.

2606.07249 2026-06-08 cs.CV 新提交

Reconstructing Multi-Decadal Forest Disturbances: A Spatio-Temporal Transformer Approach

重建多年代森林干扰:一种时空Transformer方法

Linus Scheibenreif, Anton Raichuk, Maxim Neumann

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出时空Transformer框架,同时建模时间轨迹和空间邻域,利用Landsat、Sentinel-1/2数据重建美国1984-2022年森林干扰图,在手动标注验证集上达到高精度并减少空间伪影。

详情
AI中文摘要

准确监测森林干扰对于理解碳动态和土地管理至关重要,但传统方法通常依赖卫星时间序列的逐像素分析,忽略了空间上下文。我们提出了一种深度学习框架,通过同时建模时间轨迹和空间邻域,绘制了美国本土38年(1984-2022)的森林干扰图。通过利用视觉Transformer架构,我们的方法有效过滤了弱监督信号中的噪声,生成了空间连贯的干扰图。我们在多个卫星(Landsat、Sentinel-1、Sentinel-2)和时间窗口(38年及最近6年)上进行了详尽评估,并使用新的人工标注验证数据集(n=300)和独立火周界数据集(n=706)验证了性能。结果凸显了任务的复杂性:我们的时空模型表现出高精度(在MTBS上±1年检测精度高达98.2%,在CONUS验证数据集上高达71.3%,F1分数分别高达75.8%和47.3%),并有效减少了空间伪影,但与逐像素基线相比,在不同干扰类型上存在性能权衡。我们的方法为一致的森林监测提供了有前景的基础。

英文摘要

Accurate monitoring of forest disturbances is essential for understanding carbon dynamics and land management, yet traditional approaches typically rely on pixel-wise analysis of satellite time-series, ignoring spatial context. We present a deep learning framework that maps 38 years (1984-2022) of forest disturbance across the contiguous United States by modeling temporal trajectories and spatial neighborhoods simultaneously. By leveraging a vision transformer architecture, our approach effectively filters noise from weak supervision signals to produce spatially coherent disturbance maps. We perform exhaustive evaluations across multiple satellites (Landsat, Sentinel-1, Sentinel-2) and temporal windows (38 years and the more recent 6 years), validating performance against a novel, manually annotated validation dataset (n=300) and independent fire perimeter dataset (n=706). The results highlight the complexity of the task: while our spatio-temporal model demonstrates high precision (up to 98.2% for +-1 year detection on MTBS and up to 71.3% on the CONUS validation datasets, with F1-scores up to 75.8% and 47.3%, respectively) and effectively reduces spatial artifacts, it exhibits performance trade-offs across different disturbance regimes compared to pixel-wise baselines. Our method offers a promising foundation for consistent forest monitoring.

2606.07280 2026-06-08 cs.CV 新提交

Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation

几何感知超图推理用于点云分割中的新类别发现

Zihao Zhang, Aming Wu, Yang Li, Yahong Han, Jialie Shen

发表机构 * School of Artificial Intelligence, College of Intelligence and Computing, Tianjin University(人工智能学院、智能计算学院、天津大学) School of Computer Science and Information Engineering, Hefei University of Technology(计算机科学与信息工程学院、合肥工业大学) Department of Computer Science City St George’s, University of London(伦敦大学城市圣乔治学院计算机科学系)

AI总结 提出超图框架建模高阶关联,结合几何感知原型,实现点云分割中从已知到新类别的协同推理,提升分割精度。

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

点云分割中的新类别发现旨在从已知类别转移知识,自动识别和分割点云中未标注的新类别。现有方法主要依赖成对关联进行类别分配和新类别推理,这限制了其捕捉已知和新类别间复杂关系的能力,可能导致语义分割不准确。为解决此问题,我们引入基于超图的框架,建模类别间的高阶关联,并实现从已知类别到新类别的协同推理,超越传统的成对关系。此外,现有方法倾向于关注语义特征提取,而对点云中的几何信息关注不足。为了更好地利用空间结构,我们提出几何感知原型以增强类别级几何线索的表示。通过超边传播几何信息,所提方法改进了对类别间空间分布的理解,从而实现更准确的分割。在SemanticKITTI和SemanticPOSS数据集上的实验证明了我们方法的有效性和优越性。

英文摘要

Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph-based framework that models high-order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric-Aware Prototypes to enhance the representation of class-level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.

2606.07288 2026-06-08 cs.CV cs.GR 新提交

ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

ExMesh: 具有拓扑自适应的显式网格重建

Chuanjin Fan, Lifan Wu, Wenjie Chang, Hanzhi Chang, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家空间科学探测重点实验室,深空探测实验室)

AI总结 提出ExMesh框架,通过可微优化与离散拓扑更新直接优化显式网格,引入自适应顶点分裂合并和实时UV维护,实现从粗到细的优化,兼顾精度、效率和网格简洁性。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

详情
AI中文摘要

从多视图图像重建表面网格近年来一直是核心挑战。大多数现有方法,无论是隐式还是显式,都依赖于中间表示和后处理步骤(如Marching Cubes或TSDF融合),常常导致伪影和碎片化几何。直接优化显式网格是一种有前景的方法,但它面临两个关键挑战:一是如何自适应细化网格拓扑以捕捉细节而不引入退化面;二是在网格结构演变时如何保持一致的UV坐标以实现高保真纹理映射。为克服这些,我们提出ExMesh,一种新颖的框架,通过将可微优化与离散拓扑更新相结合,直接优化显式网格。具体而言,我们引入自适应顶点分裂合并策略以及实时UV维护,实现从粗到细的优化,同时保持几何完整性。据我们所知,ExMesh是第一个将离散拓扑操作无缝集成到连续可微优化流程中的框架。大量实验表明,ExMesh在精度、计算效率和网格简洁性之间取得了平衡。

英文摘要

Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.

2606.07311 2026-06-08 cs.CV cs.AI 新提交

CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

CULTURESCORE: 评估视频生成模型的文化忠实度

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Mila – Quebec AI Institute(魁北克人工智能研究所) Netflix(网飞)

AI总结 提出CultureScore框架,从身份、背景和行为三个维度评估视频生成的文化忠实度,实验发现当前最佳模型得分仅56.8%,行为维度最困难。

详情
AI中文摘要

随着Veo 3.1和LTX-2等视频生成模型的进步,它们准确表现多元全球文化的能力仍是一个关键但研究不足的前沿。当前的指标如VideoScore仅衡量视觉质量,无法评估文化忠实度。因此,一个将合十礼替换为握手的模型与正确生成该手势的模型获得相同的分数。我们提出CultureScore,一个将文化忠实度分解为三个细粒度维度的组合评估框架:身份(谁被代表)、背景(文化本地化背景)和行为(规范性手势和互动)。我们通过一个覆盖10个国家的评估套件来实施该框架,在三个最先进的模型上生成了6,180个视频。我们的评估显示,当前没有模型能够实现文化忠实的视频生成:表现最好的模型整体CultureScore仅为56.8%,其中行为是最具挑战性的维度,所有模型在该维度上均低于52%。此外,人类偏好排序与CultureScore方向一致,但与VideoScore相反;在视觉质量上得分最高的模型被标注者排在最后,这强调了文化忠实度是公平视频生成的基本标准。

英文摘要

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

2606.07326 2026-06-08 cs.CV 新提交

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

AnchorWorld: 基于视图演化定制的具身自我中心世界模拟

Yu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang, Lei Ke, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Kun Gai, Yujiu Yang

发表机构 * Tsinghua University(清华大学) HUST(华中科技大学) Kling Team, Kuaishou Technology(快手科技 Kling 团队) HKUST(香港科技大学) WHU(武汉大学)

AI总结 提出AnchorWorld框架,利用3D人体运动和外源视角辅助训练增强交互完整性,并通过锚点视图和文本描述实现自我演化世界的灵活定制,显著优于现有方法。

详情
AI中文摘要

尽管交互式世界建模是一个关键前沿,但在实际场景所需的多样化可控性方面仍未被充分探索。为弥补这一差距,我们提出AnchorWorld,一个通过增强交互完整性和灵活的世界定制机制来推进自我中心模拟的框架。首先,我们利用3D人体运动作为主要交互模态。为了补充自我中心视角中不可见或被截断的身体部位,我们引入了一种辅助训练监督,该监督包含了与智能体第一人称感知解耦的外源视角。这使得模型能够观察智能体相对于环境的全身定位,从而促进人-世界交互更稳健的空间基础。此外,我们提出了一种简单而有效的机制来定制自我演化的世界。这是通过在统一的世界坐标系内定义锚点视图,并结合描述局部场景动态演化的文本描述来实现的。实验结果表明,AnchorWorld显著优于最先进的基线方法,而消融研究验证了我们关键设计的有效性。值得注意的是,我们的定制方案展现出有希望的时空几何一致性,并严格遵守规定的演化动力学。

英文摘要

Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

2606.07333 2026-06-08 cs.CV 新提交

Varifold Moment Invariants for Sustainable and Explainable Contour Feature Extraction

Varifold矩不变量:可持续且可解释的轮廓特征提取

G. Longari, J. -C. Alvarez Paiva, A. B. Tumpach

发表机构 * Computer Vision Lab, Technische Universität Wien, Karlsplatz 13, 1040 Vienna, Austria Wolfgang Pauli Institut, Oskar-Morgensternplatz 1, 1090 Vienna, Austria U.M.R. CNRS 8524, U.F.R. de Math\'ematiques, 59655 Villeneuve d'Ascq C\'edex, France Laboratoire Painlevé, Lille University, 59650 Villeneuve d’Ascq, France Wolfgang Pauli Institut, Oskar-Morgensternplatz 1, 1090 Vienna, Austria

AI总结 提出Varifold矩不变量(VMI)统一框架,结合区域、边界和切线几何生成高判别力几何特征,配合轻量分类器在降低计算成本的同时超越现有轮廓方法。

Comments 29 pages, 12 figures

详情
AI中文摘要

我们引入Varifold矩不变量(VMI)作为许多先前提出的矩不变量的统一框架。这些不变量与其他在平移和旋转下不变的轮廓特征(如扩展高斯图像、椭圆傅里叶描述符或形状分布)密切相关。Varifold矩方法的优势在于能够结合区域的几何、其边界以及与之相切的直线族,从而创建大量具有高判别力和清晰几何意义的不变特征。通过将我们的VMI特征提取与轻量特征分类器随机森林或多层感知器相结合,我们在基于轮廓的方法中超越了现有技术水平,同时大幅降低了计算成本,使我们的算法能够在轻量设备上运行。我们在大量广泛使用的不同类型数据集(叶子、物体、细胞)上测试了我们的分类任务,并以少量几何可解释的特征实现了高精度。

英文摘要

We introduce Varifold Moments Invariants (VMI) as a unifying framework for many previously introduced Moment Invariants. These invariants are deeply related to other contour features that are invariant under translations and rotations, like Extended Gaussian Image, Elliptic Fourier Descriptors or Shape Distributions. The advantage of the varifold approach to moments consists in being able to combine the geometry of the region, its boundary, and the family of lines tangent to it, in order to create a substantial number of invariant features with high discriminating power and clear geometric meaning. By coupling our VMI feature extraction with the light feature classifiers Random Forest or Multi-Layer-Perceptron, we outperform state-of-the-art approaches based on contours, while decreasing drastically the computational cost to the point of allowing our algorithm to run on light devices. We tested our approach on classification tasks on a large number of widely-used datasets of various types (leaves, objects, cells) and achieved high accuracy with a low number of geometrically interpretable features.

2606.07338 2026-06-08 cs.CV 新提交

VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

VeriDrive: 可验证的反事实监督用于成本高效的视觉-语言规划

Zikai Zhang, Hubert P. H. Shum, Toby P. Breckon

发表机构 * Department of Computer Science, Durham University(杜伦大学计算机科学系)

AI总结 提出VeriDrive框架,通过结构化感知-评估-修正链生成可验证的反事实监督,降低视觉-语言驾驶规划的数据构建成本,并在nuScenes数据集上验证其有效性。

详情
AI中文摘要

视觉-语言驾驶模型越来越多地使用推理监督来连接感知、预测和规划,但现有的驾驶理由通常是自由形式的,且使用前沿模型生成成本高昂。我们提出了VeriDrive,一个构建面向规划的、可验证的反事实监督框架。VeriDrive将驾驶推理转化为结构化的感知-评估-修正链,该链将关键对象锚定于未来运动,使用可规则检查的证据评估替代自我轨迹,将风险意图修正为专家行为,并生成最终规划目标。为了扩展数据构建,VeriDrive结合了本地生成与验证器引导的选择性修正,仅升级无效或困难的样本。我们在nuScenes上构建了VeriDrive数据集,并在Omni-Q协议下进行训练。受控的开环实验表明,VeriDrive在L2、碰撞和交叉指标上优于OmniDrive,同时减少了记录的令牌使用量、生成时间和实际支付的LLM/VLM成本。这些结果表明,可审计的中间字段和结构化修正目标可以在现实注释预算下改进视觉-语言规划监督。代码、提示和验证器脚本即将发布,并将在审稿过程后公开。

英文摘要

Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

2606.07355 2026-06-08 cs.CV 新提交

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

面向微手势在线识别的时空解耦适配器

Xucheng Shen, Kun Li, Fei Wang, Wei Qian, Jin Jiang, Dan Guo

发表机构 * Hefei University of Technology(合肥工业大学) United Arab Emirates University(阿联酋大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(人工智能研究院,合肥国家综合科学中心) Anhui Evolution Technology Co., Ltd.(安徽进化科技有限公司)

AI总结 提出时空解耦适配器,通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支,并引入自适应软平衡增强缓解长尾分布问题,在EI-MiGA挑战赛Track 2中取得第一名。

Comments Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026

详情
AI中文摘要

微手势在线识别旨在对未修剪视频中的细微手势进行时间定位和分类。由于微手势持续时间极短、运动幅度低且视觉线索模糊,捕获判别性的时空表示仍然极具挑战性。现有的参数高效适配器通常采用单分支联合建模时空线索,这可能无法捕获微手势的细粒度模式。为解决这一局限,我们提出了一种时空解耦适配器,通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支。此外,为解决基准数据集中的长尾分布问题,我们引入了自适应软平衡增强,该方法根据类别稀有性和学习难度动态分配增强强度,无需手动设置阈值。我们的方法取得了0.43808的F1分数,在第四届EI-MiGA-IJCAI挑战赛的Track 2中排名第一。

英文摘要

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

2606.07366 2026-06-08 cs.CV cs.LG cs.RO 新提交

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Dash2Sim: 来自野外行车记录仪视频的闭环驾驶仿真

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla, Manmohan Chandraker, Srinivasa Narasimhan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NEC Labs America(NEC美国实验室) MIT(麻省理工学院) UC San Diego(加州大学圣地亚哥分校)

AI总结 提出Dash2Sim框架,将单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志,用于闭环仿真,并构建ROADWork4D基准数据集,验证了施工区场景对规划器的挑战。

详情
AI中文摘要

自动驾驶仿真通常依赖于在少数城市收集的数据或手工编写的合成场景。行车记录仪视频覆盖了更广泛的位置和情况,包括罕见或长尾场景。由于难以从单目野外视频中恢复准确的4D场景,它们被认为不太适用于仿真。施工区是行车记录仪捕捉到的一类长尾情况。我们提出Dash2Sim,一个将野外单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志并与现有仿真器兼容的框架,并针对独立维护的地图验证每个日志,无需标注。我们将Dash2Sim应用于大型视频语料库,创建了ROADWork4D基准数据集,涵盖17个城市的4,244个场景和270万个3D对象。在验证子集ROADWork4D-CL(2,201个场景)上,我们研究了特权闭环规划器,发现施工区场景具有挑战性:尽管基于规则和混合规划器的泛化能力优于基于学习的规划器,但所有规划器均表现不足,无法完成临时施工区通道所需的变道。在规划之外,Dash2Sim恢复的密集深度在新视角合成质量上提高了高达19%(基于感知指标),表明其具有为单目视频的闭环传感器仿真提供丰富条件的潜力。

英文摘要

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

2606.07368 2026-06-08 cs.CV cs.AI 新提交

Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

野外有丝分裂检测:MIDOG 2025挑战中的多肿瘤与上下文感知泛化

Marc Aubreville, Jonas Ammeling, Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Robert Klopfleisch, Jiaqi Lv, Shan E Ahmed Raza, Raphaël Bourgade, Thomas Walter, Yasemin Topuz, Songül Varlı, Charles-Antoine Collins-Fekete, Zhuoyan Shen, Navya Sri Kelam, Nitin Singhal, Christian Marzahl, Brian Napora, Tengyou Xu, Hongyan Gu, Mario Vento, Gennaro Percannella, Norbert Ropiak, Izabela Wasiak, Jie Xiao, Shaojun Liu, Seungho Choe, April Khademi, Vidushi Walia, Sujatha Kotte, Andrew Broad, Alex Wright, Guillaume Balezo, Esha Sadia Nasir, Mostafa Jahanifar, Yosuke Yamagishi, Shouhei Hanaoka, Mattia Sarno, Francesco Tortorella, Biwen Meng, Jingxin Liu, Sara Krauss, Daniel Hieber, Lavish Ramchandani, Dev Kumar Das, Mieko Ochi, Yuan Bae, Piotr Giedziun, Mateusz Maniewski, Vangala Govindakrishnan Saipradeep, Naveen Sivadasan, Leire Benito-Del-Valle, Adrian Galdran, Kaustubh Atey, Sameer Anand Jha, Adinath Dukre, Imran Razzak, Maxime W. Lafarge, Viktor H. Koelzer, Nils Porsche, Nikolas Stathonikos, Mitko Veta, Dominik Hirling, Zsanett Zsófia Iván, Peter Horvath, Katharina Breininger, Christof A. Bertram

发表机构 * Flensburg University of Applied Sciences(弗劳恩霍夫应用科技大学) Technische Hochschule Ingolstadt(施特拉尔松德应用技术大学) University of Veterinary Medicine(兽医大学) Schwarzman Animal Medical Center(施瓦茨曼动物医学中心) Freie Universität Berlin(柏林自由大学) University of Warwick(沃里克大学) MINES Paris - PSL University(巴黎综合理工学院) Yildiz Technical University(耶利泽技术大学) University College London(伦敦大学学院) AIRA MATRIX Private Limited(AIRA MATRIX 私人有限公司) University of California, Los Angeles(加州大学洛杉矶分校) University of Kansas Medical Center(堪萨斯医学中心) University of Salerno(萨勒诺大学) Cancer Center Sp. z o. o.(癌症中心) th Military Research Hospital in Bydgoszcz(比多日茨军医研究所) Shenzhen Technology University(深圳技术大学) Toronto Metropolitan University(多伦多 Metropolitan 大学) Tata Consultancy Services Ltd.(塔塔咨询有限公司) Leeds Teaching Hospitals NHS Trust(利兹教学医院 NHS信托) The University of Tokyo(东京大学) Xi’an Jiaotong-Liverpool University(西安交通大学-利物浦大学) University of Augsburg(奥格斯堡大学) Ulm University(乌尔姆大学) Japanese Red Cross Medical Center(日本红十字医疗中心) Wroclaw University of Science and Technology(沃拉日市科学与技术大学) TECNALIA, Basque Research and Technology Alliance (BRTA)(TECNALIA,巴斯克研究与技术联盟(BRTA)) Indian Institute of Technology Bombay(孟买印度理工学院) MBZUAI University of Basel(巴塞尔大学) University Medical Center Utrecht(乌得勒支大学医学中心) TU Eindhoven(埃因霍温理工大学) HUN-REN Biological Research Centre(匈牙利-人生物研究中心)

AI总结 针对临床实际中组织学多样性的挑战,MIDOG 2025挑战评估了跨12种肿瘤类型和多种扫描平台的算法性能,发现模型在传统热点区域表现可靠,但在困难区域和罕见肿瘤中性能显著下降,集成方法可提升F1分数1.5个百分点。

详情
AI中文摘要

自动有丝分裂检测是计算病理学中一项成熟的任务。虽然之前的基准测试关注扫描仪引起的域偏移,但临床“真实世界”应用要求模型能够对组织学景观中预期的巨大差异具有鲁棒性。MItosis DOmain Generalization (MIDOG) 2025挑战旨在评估算法在空前生物学和上下文多样性下的性能。我们策划了一个包含365个病例的测试数据集,涵盖12种不同的人类、犬和猫肿瘤类型,并在多个扫描平台上数字化。超越手动选择的感兴趣区域(ROI),该挑战还要求在随机组织区域(代表全切片检测情况)和困难区域(富含难负样本的区域)进行检测。在第二个赛道中,我们引入了非典型有丝分裂象(AMF)的分类。检测赛道有18支队伍提交,F1分数最高达0.740。在AMF检测赛道,我们有21个提交,平衡准确率最高达0.908。我们的分析显示,虽然大多数模型在传统热点区域表现可靠,但在困难ROI中性能显著下降,假阳性率增加了两倍。此外,性能在12种肿瘤类型间差异显著,突显了当前最先进架构在遇到罕见或高度多形性恶性肿瘤时的“盲点”。此外,我们评估了集成的有效性,发现F1分数和平衡准确率平均分别提高1.5和1.3个百分点。相比之下,测试时增强(TTA)没有显示出相关改进。MIDOG 2025表明,“野外”有丝分裂检测仍然是一个重大障碍。从仅热点评估到多上下文框架的转变,为临床可靠性提供了更现实的代理指标。

英文摘要

Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

2606.07394 2026-06-08 cs.CV 新提交

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

注意差距:解开视频实例分割中的性能瓶颈

Danial Hamdi, Fardin Ayar, Mahdi Javanmardi

发表机构 * Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic)(阿美里卡布里大学计算机工程系(德黑兰技术学院))

AI总结 提出一种基于整数线性规划的诊断框架,分离分类、分割和跟踪误差,发现跟踪不稳定是视频实例分割的主要瓶颈,尤其在遮挡、长视频和高密度场景下,且强骨干网络无法消除该算法性问题。

详情
AI中文摘要

在视频实例分割(VIS)中,分类、分割和跟踪目标被联合评估,但它们各自对性能损失的贡献仍然不透明。我们引入一个诊断框架,将身份和类别分配表述为整数线性规划(ILP),产生一个模型无关的预言机,分层隔离每个错误源。应用于跨越在线和离线范式的七种VIS方法,在YouTube-VIS 2019/2021和OVIS的诊断子集上,我们的分析揭示了一致的图景。跟踪不稳定是在线方法的关键瓶颈,在严重遮挡下差距超过20 AP,并且随着视频长度和实例密度急剧增长。虽然语义分类在标准基准上有显著贡献,但在跟踪失败最严重的地方其影响变得微不足道。尽管更强的骨干网络大幅提升了默认分数,但它们基本保留了AP跟踪差距,证实了时间脆弱性是算法性的,而非纯粹表示性的。为补充预言机,我们引入了TrackLens,一种可视化工具,将差距大小转化为可观察的查询级故障模式。这些工具共同为瞄准VIS的核心挑战——鲁棒的长期时间关联——提供了系统基础。

英文摘要

In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

2606.07401 2026-06-08 cs.CV 新提交

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

RealDocBench: 面向真实世界监管文档的字段级问答与布局理解基准

Ameya Joshi, Joon Kim, Gus Eggert, Joseph Bajor, Cindy Hao, Jing Reyhan, Kushal Byatnal, Eli Badgio

发表机构 * Extend AI

AI总结 提出RealDocBench基准,包含字段级问答和布局理解两个任务,评估18个系统在真实监管文档上的性能,揭示单一指标掩盖的性能差异和成本延迟权衡。

详情
AI中文摘要

文档解析系统越来越多地部署在高风险、受监管的工作流程中,如抵押贷款承销、财务报告、供应链物流和临床记录。然而,大多数公开基准在干净的学术布局或合成文本上评估解析器,并报告单一的OCR或Markdown级相似度分数。这类文档和指标与下游代理实际需求(即在混乱的真实世界页面上获取特定字段的正确值)相关性较差。我们引入了RealDocBench,这是一个基于真实监管文档构建的双轨基准。问答轨道包含跨越四个领域的581份文档上的1,356个字段级问题,每个问题配有一个类型化的gold_dict键值对答案,解析器按每个字段和严格的每个问题准确率评分。布局轨道包含1,500个人工验证的页面图像,在九类公共分类法下用COCO风格的边界框注释,使用包含邻域感知分割/合并恢复的匈牙利匹配器评分。我们在统一的提取和评分协议下评估了18个系统,涵盖商业解析API、通用视觉语言模型和开源OCR模型,并报告准确率以及每页成本和缓存失效延迟。RealDocBench暴露了单一数字基准隐藏的广泛性能差异、一个持续困难的医学子领域以及不同操作点之间的成本和延迟权衡。我们发布了数据集、解析器适配器和评估工具,以支持文档解析系统的可重复字段级比较。

英文摘要

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

2606.07433 2026-06-08 cs.CV cs.AI cs.MM 新提交

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

发表机构 * School of Intelligence Science and Technology, Peking University(北京理工大学智能科学与技术学院) Wuhan University(武汉大学) Shanghai Jiao Tong University(上海交通大学) Nanyang Technological University(南洋理工大学) CASIA(中国科学院自动化研究所) University of Tokyo(东京大学) University of Liverpool(利物浦大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) UC Merced(加州大学默塞德分校)

AI总结 提出人类视角下视频理解的三个功能能力(观看、记忆、推理),构建统一框架分析视频MLLM的感知、记忆、推理和预测,并总结挑战、方法、应用及未来方向。

详情
AI中文摘要

视频理解正被多模态大语言模型(MLLMs)迅速变革,研究从短视频片段转向长视频、多模态和知识密集型视频场景。这些场景要求模型在有限计算预算下处理稀疏证据、长程依赖、多模态对齐和可靠推理。本文从人类视角出发,围绕三个功能能力——观看、记忆和推理——组织基于LLM的视频理解。该视角并非将视频任务视为孤立基准,而是提供一个统一结构,用于分析视频MLLM如何获取证据、保持上下文并产生有依据的输出。我们引入一个公式,通过感知表示、记忆状态、推理轨迹和最终预测来表征视频理解系统。基于此公式,我们识别出时空感知、高效长视频处理、记忆建模、流式理解和忠实推理中的挑战。代表性方法按其视频MLLM系统中的角色进行组织:观看涵盖细粒度、全面、音视频和高效感知;记忆包括离线记忆和流式记忆;推理涵盖纯文本推理和视频辅助推理。我们进一步考察了应用领域,如自我中心、体育、教学、医学和叙事视频,并涵盖了跨任务类型、监督格式、模态和能力维度的训练数据集和评估基准。最后,我们概述了可扩展、记忆感知和有依据的视频智能的开放问题和未来方向。相关工作将在https://this https URL持续追踪。

英文摘要

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany(马克斯·普朗克研究所信息学院,萨尔兰信息学院,德国萨尔布吕肯) Department of Language Science and Technology, Saarland University, Saarbrücken, Germany(语言科学与技术系,萨尔兰大学,德国萨尔布吕肯)

AI总结 提出TEVI框架,利用稀疏自编码器解耦图像嵌入,并通过文本条件掩码模块选择性重构嵌入,以改善CLIP等视觉-语言模型的图像-文本对齐,在多个检索基准上取得提升。

Comments 20 pages, 13 figures, 14 tables

详情
AI中文摘要

视觉-语言模型(如CLIP)由于共享图像-文本嵌入空间,对多种任务非常有用。尽管如此,图像和文本嵌入往往对齐不佳,影响下游性能。最近的研究表明,这可以归因于信息不平衡:图像包含的信息比其标题描述的更多。在这项工作中,我们提出了TEVI,一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说,我们使用稀疏自编码器来解耦图像嵌入,并训练一个掩码模块,根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中,我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型,我们进一步在粗粒度短标题(MS COCO, Flickr)和细粒度长标题(IIW, DOCCI)基准上实现了改进的检索性能,在更丰富的标题上获得更强的增益,并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

2606.07498 2026-06-08 cs.CV 新提交

Implicit Data Synthesis for Contrastive Unsupervised Data Augmentation

隐式数据合成用于对比无监督数据增强

Patrick Kage, Trevor Hedges, N. Siddharth, Pavlos Andreadis

发表机构 * School of Informatics, The University of Edinburgh(信息学院) Massachusetts Institute of Technology Lincoln Laboratory(麻省理工学院林肯实验室)

AI总结 针对科学观测数据难以标注的问题,提出通过扰动网络权重而非数据生成对比样本,在雷达流星观测上使用SimCLR管道验证性能提升。

Comments 11 pages, 3 figures, 2 tables

详情
AI中文摘要

科学观测产生大量未标记数据,手工标记费力,因此无监督学习技术对于处理数据集很有价值。在这些方法中,对比学习提供了一种从无标注数据集中提取结构表示的便捷机制。对于自然图像,通用方法是使用多种数据空间增强方法来生成合成样本;然而,对于科学观测,数据空间扰动可能从根本上改变底层数据。我们提出的方法是通过扰动网络权重而非底层数据来生成对比样本,从而更紧密地保留数据结构。我们使用基于SimCLR的管道在雷达流星观测上演示了该技术,并展示了在匹配协议下的性能提升。

英文摘要

Scientific observations generate large quantities of unlabeled data which is laborious to hand-label, making unsupervised learning techniques valuable for processing datasets. Among these approaches, contrastive learning provides a convenient mechanism for extracting structural representations from unannotated datasets. For natural imagery, the general approach is to use a variety of data-space augmentation methods in order to generate synthetic samples; however, for scientific observations data-space perturbations can fundamentally alter the underlying data. Our proposed method is to generate contrastive samples by perturbing the network weights rather than the underlying data, thus more closely preserving the structure of the data. We demonstrate this technique using a SimCLR-based pipeline applied over radar observations of meteors, and show performance gains under matched protocols.

2606.07503 2026-06-08 cs.CV 新提交

Differences in Detection: Explainability Where it Matters

检测中的差异:可解释性在关键之处

Johannes Theodoridis, Johannes Maucher, Andreas Schilling

发表机构 * University of Tübingen(图宾根大学) Institute for Applied AI(应用人工智能研究所) Hochschule der Medien Stuttgart(斯图加特媒体大学)

AI总结 提出DnD方法,通过匹配算法直接比较两个目标检测模型,揭示个体与共享错误,并引导可解释性方法聚焦于度量相关示例。

Comments Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2026 - How Do Vision Models Work? (HOW)

详情
AI中文摘要

我们提出了检测中的差异(DnD),一种直观的比较两个目标检测模型的方法。基于相同的匹配算法,它补充了平均精度($mAP$)和TIDE误差分析的标准指标,能够直接比较两个模型。更具体地说,我们计算两个模型都识别的真实标签的交集,然后是相应的差集以及两个模型都遗漏的真实标签的补集。与独立的汇总统计比较相比,这种比较更直接、更直观。它揭示了个体和共享的错误,当与错误类型结合时尤其有趣。在这种情况下,检测误差的差异可以自然地通过标准混淆矩阵进行分析。虽然本身有价值,但我们认为DnD的最佳应用之一是引导可解释性方法(如ODAM)关注基于结构化子集的度量相关示例。我们方法的代码可在此处获取:this https URL

英文摘要

We propose Differences in Detection (DnD), an intuitive method to compare two object detection models. Based on the same matching algorithm, it complements the standard metrics of mean Average Precision ($mAP$) and TIDE error analysis with the ability to compare two models directly. More specifically, we calculate the intersection of ground truth labels that are recognized by both models, followed by the corresponding difference sets and the complement set of ground truth labels that are missed by both models. The resulting comparison is more direct and intuitive than a comparison of independent summary statistics. It reveals individual and shared mistakes and becomes particularly interesting when combined with error types. In this case, the differences in detection errors can be analyzed naturally in a standard confusion matrix. While valuable in itself, we believe that one of the best applications of DnD is to guide explainability methods such as ODAM towards metric-relevant examples, grounded in structured subsets. The code for our method is available here: https://github.com/JohannesTheo/differences-in-detection

2606.07508 2026-06-08 cs.CV 新提交

Streaming Video Generation with Streaming Force Control

流式视频生成与流力控制

Hanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv, Shenlong Wang, Huaizu Jiang

发表机构 * Northeastern University(东北大学) Impossible Research University of California, Berkeley(加州大学伯克利分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出StreamForce框架,通过统一力表示和蒸馏流程实现因果、统一的流式视频生成,支持局部和全局时变力控制,在单GPU上达到16.6 FPS,力遵循和运动真实性达最优。

详情
AI中文摘要

我们提出StreamForce,一个流式视频生成框架,通过连续的力输入实现基于物理的控制。与先前的视频模型不同,这些模型为不同的力类型训练单独的模型,假设固定的力,或依赖非因果处理,StreamForce是一个因果且统一的模型,能够即时且连贯地响应局部和全局的时变力。为此,我们设计了一个统一的力表示作为控制信号,并开发了一个用于力可控视频生成的蒸馏流程。我们的模型结合了自回归效率和力响应性,维持了稳定的光度学和动态真实性。StreamForce在单个GPU上以高达16.6 FPS的速度运行,在力遵循和运动真实性方面均达到了最先进的性能。项目网站:此https URL

英文摘要

We introduce StreamForce, a streaming video generation framework that enables physically grounded control through continuous force inputs. Unlike prior video models that train separate models for different force types, assume fixed forces, or rely on non-causal processing, StreamForce is a causal and unified model that responds instantly and coherently to both local and global, time-varying forces. To achieve this, we design a unified force representation as a control signal and develop a distillation pipeline for force-controllable video generation. Our model combines autoregressive efficiency with force responsiveness, sustaining stable photometric and dynamic realism. StreamForce runs at up to 16.6 FPS on a single GPU, achieving state-of-the-art performance in both force adherence and motion realism. Project website: https://neu-vi.github.io/StreamForce/

2606.07512 2026-06-08 cs.CV cs.AI cs.CL 新提交

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

发表机构 * Ant Group(蚂蚁集团) Zhejiang University(浙江大学) Central South University(中南大学) HKUST(GZ)(香港科技大学(广州))

AI总结 提出MemDreamer框架,通过分层图记忆和智能体检索机制解耦感知与推理,将长视频理解转化为智能体探索过程,在四个基准上达到SOTA,推理上下文窗口仅占全量2%且准确率提升12.5点。

详情
AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难,因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题,我们引入了MemDreamer,将感知与推理解耦,将长视频理解转化为智能体探索过程。作为一个即插即用的框架,它增量式地流式传输视频以构建分层图记忆,这是一种自顶向下的三层架构,用于语义抽象,并由一个捕获时空和因果关系的基础图锚定。在推理过程中,推理模型采用智能体工具增强的检索,通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明,MemDreamer在四个主流基准上取得了最先进的结果,将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%,同时提供了12.5个百分点的绝对准确率提升。此外,统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关,将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

2606.07514 2026-06-08 cs.CV 新提交

UniSHARP: Universal Sharp Monocular View Synthesis

UniSHARP: 通用锐利单目视图合成

Meixi Song, Dizhe Zhang, Hao Ren, Ruiyang Zhang, Bo Du, Ming-Hsuan Yang, Lu Qi

发表机构 * Insta360 Research(Insta360研究院) Sun Yat-sen University(中山大学) Beihang University(北京航空航天大学) Wuhan University(武汉大学) University of California, Merced(加州大学默塞德分校)

AI总结 提出UniSHARP,通过统一全景隐空间和射线基高斯表示,将SHARP扩展到任意相机系统(包括鱼眼、全景),在特征与高斯空间隐式对齐,在构建的多视角基准上大幅超越现有方法。

Comments Project page: https://insta360-research-team.github.io/Unisharp-website/

详情
AI中文摘要

在这项工作中,我们专注于扩展SHARP(一种流行的逼真视图合成方法),以实现跨连续相机系统(从传统透视相机到广角、鱼眼和全景设置)的通用单目渲染。为了克服SHARP的针孔特定假设,我们的关键思想是将各种图像对齐到统一的全景隐空间中。因此,我们提出了UniSHARP,它在特征空间和高斯空间中执行隐式对齐。具体来说,高斯基元沿射线和径向距离排列在基于射线的通用表示中,而从UniK3D启发的编码器中提取的2D语义和3D空间特征被联合解码以生成完整的高斯云。为了全面评估我们的方法,我们构建了一个覆盖各种场景下多种成像系统的基准。该基准进一步按视场角(FoV)分层,以实现对通用单目渲染任务的细粒度评估。在提出的基准上进行的大量实验证明了UniSHARP的有效性,其性能大幅优于替代方法。项目页面可在此处找到:this https URL

英文摘要

In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal monocular rendering across a continuum of camera systems, from conventional perspective cameras to wide-field-of-view, fisheye and omnidirectional panoramic settings. To overcome the pinhole-specific assumptions of SHARP, our key idea is to align various images in a unified omnidirectional latent space. Thus, we propose UniSHARP, which performs implicit alignment in both feature and Gaussian spaces. Specifically, Gaussian primitives are arranged along rays and radial distances in a ray-based universal representation, while 2D semantic and 3D spatial features extracted from UniK3D-inspired encoders are jointly decoded to generate the complete Gaussian cloud. To comprehensively evaluate our method, we construct a benchmark covering diverse imaging systems across various scenes. The benchmark is further stratified by field of view (FoV) to enable fine-grained assessment of the universal monocular rendering task. Extensive experiments on the proposed benchmark demonstrate the effectiveness of UniSHARP, outperforming alternative methods by a large margin. The project page can be found at: https://insta360-research-team.github.io/Unisharp-website/

2606.06498 2026-06-08 cs.GR cs.CV 交叉投稿

Semantic-Structural Alignment for Generative Pictorial Charts

生成式图形图表的语义-结构对齐

Zhida Sun, Yulin Zhang, Zheng Gu, Min Lu, Bongshin Lee, Daniel Cohen-Or, Hui Huang

发表机构 * Visual Computing Research Center (VCC), College of Computer Science and Software Engineering (CSSE) Shenzhen University China(视觉计算研究中心(VCC)、计算机科学与软件工程学院(CSSE)深圳大学中国)

AI总结 提出一种生成式框架,通过多模态扩散变压器中的结构对齐和语义对齐机制,实现兼具艺术表现力和结构保真度的图形图表自动合成。

Comments 11 pages, 17 figures, Accepted to ACM TOG

详情
AI中文摘要

传统统计图形精确但往往缺乏图形图表的视觉吸引力、记忆性和参与度。我们提出了一种用于自动合成图形图表的生成式框架,弥合了语义表达与结构保真度之间的差距。我们不是将图表仅仅视为需要风格化的图像,而是将问题构建为一个双条件生成任务,由两个并行的外部控制信号引导:一个捕捉编辑意图语义上下文的文本提示,以及一个提供抽象统计图表全局结构的上下文图像。为了在多模态扩散变压器中增强这些控制,我们引入了两个互补的特征级机制:结构对齐,将空间布局锚定到输入图表;以及语义对齐,从参考图像转移表达性纹理。我们的方法泛化到主要视觉通道(即长度、面积、角度和位置)和多样化的语义领域,生成的图形图表既具有艺术吸引力又结构一致。广泛的定量评估和感知用户研究表明,我们的框架优于传统的可控生成和图像编辑基线,为表达性视觉叙事中高保真、数据驱动的生成建模提供了基础。项目页面:此 https URL。

英文摘要

Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart's global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling. Project page: https://ssalign.github.io/.

2606.06505 2026-06-08 cs.CG cs.AI cs.CV math.DG 交叉投稿

A Geometric Gaussian Mixture Representation of Plane Curves

平面曲线的几何高斯混合表示

Ali Darijani, Benedikt Stratmann, Jürgen Beyerer

发表机构 * Fraunhofer IOSB(弗劳恩霍夫研究所) KIT, IES(卡尔斯鲁厄理工学院,信息工程系)

AI总结 提出一种用户定义的平面曲线概率多边形表示,通过为每个线段赋予法向不确定性参数,构造高斯混合模型,保留局部几何与法向不确定性,适用于多种曲线类型。

详情
AI中文摘要

我们引入了一种用户定义的平面曲线概率多边形表示。给定一条曲线,我们在曲线上选择顶点,并通过线段连接相邻顶点以获得多边形近似。每个线段在法线方向上配备一个用户定义的不确定性参数。这产生了一组薄的概率几何基元,它们保留了底层曲线的几何形状,同时将其扩展到理想化的确定性一维公式之外。对于每个线段,我们定义一个随机变量,该变量在线段的切线方向上均匀分布,在线段的法线方向上高斯分布。通过匹配第一和第二中心矩,该构造诱导出一个高斯分量,其均值位于线段中点,协方差编码了切向和法向不确定性。将逐段分量与适当的权重相结合,得到平面曲线的用户定义概率多边形表示的高斯混合模型(GMM)。所提出的框架提供了一个解析上可处理的概率模型,保留了局部几何和法向不确定性。它适用于光滑、封闭、开放、非正则和自交的平面曲线,允许自适应离散化和法向方向上的变化不确定性,从而支持不确定性感知的几何建模。在一组典型平面曲线上的实验表明,所得的GMM捕获了局部切线、局部法线和局部弧长;从而也真实地捕获了底层曲线的全局形状。该表示特别适用于不确定性感知的CAD和数字孪生、机器人中的概率障碍物建模以及概率轨迹规划等应用。

英文摘要

We introduce a user defined probabilistic polygonal representation for plane curves. Given a curve, we select vertices on the curve and connect consecutive vertices by line segments to obtain a polygonal approximation. Each segment is equipped with a user defined uncertainty parameter in the normal direction. This yields a collection of thin probabilistic geometric primitives that retain the geometrz of the underlying curve while extending it beyond the idealized deterministic one dimensional formulation. For each segment, we define a Random Variable that is uniform distributed in the tangent direction of the segment and Gaussian distributed in the normal direction of the segment. By matching the first and the second central moments, this construction induces a Gaussian component whose mean lies at the segment midpoint and whose covariance encodes both tangential and normal uncertainty. Combining the segment wise components with appropriate weights yields a Gaussian Mixture Model (GMM) representation of the user defined probabilistic polygonal representation of the plane curve. The proposed framework provides an analytically tractable probabilistic model that preserves local geometry, and uncertainty in the normal direction. It applies to smooth, closed, open, non regular, and self intersecting plane curves, allows adaptive discretization and varying uncertainty in the normal direction, and as a result supports uncertainty aware geometric modeling. Experiments on a collection of canonical plane curves show that the resulting GMM capture local tangent, local normal, and local arc length; resulting in the global shape of the underlying curves to be truthfully captured as well. The representation is particularly relevant for applications in uncertainty aware CAD and digital twins, probabilistic obstacle modeling in robotics, and probabilistic trajectory planning.

2606.06524 2026-06-08 eess.IV cs.CV cs.LG 交叉投稿

Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery

基于物理引导深度学习的先进洪水预测:结合UNet、FNO与SAR/光学影像

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * National Center for Atmospheric Research (NCAR)(国家大气研究中心)

AI总结 提出物理引导深度学习框架,融合多模态遥感与浅水方程约束,通过UNet-FNO混合架构实现高精度洪水预测,IoU达0.82,F1达0.90。

Comments This paper has been accepted for publication in the Proceedings of the IEEE Radar Conference (RadarConf 2026). The final authenticated version will be available through IEEE Xplore

详情
AI中文摘要

由于地面观测有限、地形条件异质以及数据驱动模型中难以强制执行水动力学一致性,准确且可扩展的洪水测绘仍然具有挑战性。本文介绍了一种物理引导的深度学习框架,该框架集成了多模态遥感(Sentinel-1 SAR、Sentinel-2光学影像和DEM衍生的地形特征)与深度平均浅水方程(SWE)的约束。所提出的混合架构结合了用于捕捉精细尺度空间细节的UNet和用于模拟流域尺度水力相互作用的傅里叶神经算子(FNO),而物理信息残差损失确保了质量和动量一致性。在多种洪泛区环境下评估,混合模型在洪水范围预测中实现了0.82的交并比和0.90的F1分数,优于仅使用UNet和仅使用FNO的基线模型。以水动力学模拟作为参考数据,该模型在水深方面实现了0.21米的均方根误差,在流速方面实现了0.15米/秒的均方根误差。物理一致性得以保持,残差低且质量不平衡低于2.1%。消融研究证实,去除基于物理的正则化会显著降低性能,突显了物理约束对稳定性和泛化能力的价值。这些结果表明,将水动力学原理嵌入深度学习可产生更准确、可靠且物理一致的洪水预测,为业务监测和大规模部署提供了巨大潜力。

英文摘要

Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.

2606.06537 2026-06-08 q-bio.QM cs.CV eess.IV 交叉投稿

DSU-Net: An Attention-Enhanced Dense Skip U-Net for Breast Lesion Segmentation in Mammographic Images

DSU-Net:用于乳腺X线图像中乳腺病变分割的注意力增强密集跳跃U-Net

Reza Bozorgpour, Mohammadreza Soltany Sadrabadi

发表机构 * Department of Biomedical Engineering, University of Wisconsin-Milwaukee(威斯康星大学密尔沃基分校生物医学工程系) Department of Mechanical Engineering, Northern Arizona University(北亚利桑那大学机械工程系)

AI总结 提出DSU-Net,通过密集跳跃连接和注意力机制改进特征传播与边界描绘,在CBIS-DDSM数据集上实现高精度乳腺病变分割。

详情
AI中文摘要

乳腺癌仍然是全球女性癌症相关死亡的主要原因之一,因此早期检测对于有效治疗至关重要。乳腺X线摄影是主要的筛查方式;然而,可疑病变的准确勾画仍然具有挑战性,且存在观察者间差异。自动分割方法可以通过提供一致且高效的病变定位来辅助放射科医生。本研究提出了DSU-Net,一种用于乳腺X线图像中自动乳腺病变分割的注意力增强密集跳跃U-Net架构。该框架集成了密集跳跃连接和注意力机制,以改进特征传播、保留空间信息并增强病变边界描绘。实验使用了乳腺摄影筛查数字数据库的精选乳腺成像子集(CBIS-DDSM)。为了解决严重的前景-背景不平衡问题,训练中采用了结合Dice损失、焦点损失和二元交叉熵损失的复合损失函数。所提模型在验证数据集上实现了0.9421的Dice相似系数、0.8905的交并比、0.9711的准确率和0.9878的AUC-ROC。定性评估显示了对不同大小和形态病变的准确勾画,而定量结果证实了病变与背景区域之间的稳健区分。这些发现表明,DSU-Net在乳腺X线图像中提供了准确可靠的乳腺病变分割,并突出了注意力引导深度学习在计算机辅助乳腺癌筛查和诊断中的潜力。

英文摘要

Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, making early detection essential for effective treatment. Mammography is the primary screening modality; however, accurate delineation of suspicious lesions remains challenging and subject to inter-observer variability. Automated segmentation methods can assist radiologists by providing consistent and efficient lesion localization. This study presents DSU-Net, an attention-enhanced Dense Skip U-Net architecture for automated breast lesion segmentation in mammographic images. The proposed framework integrates dense skip connections and attention mechanisms to improve feature propagation, preserve spatial information, and enhance lesion boundary delineation. Experiments were conducted using the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). To address severe foreground-background imbalance, a composite loss function combining Dice loss, focal loss, and binary cross-entropy loss was employed during training. The proposed model achieved a Dice Similarity Coefficient of 0.9421, an Intersection over Union of 0.8905, an accuracy of 0.9711, and an AUC-ROC of 0.9878 on the validation dataset. Qualitative evaluation demonstrated accurate delineation of lesions with varying sizes and morphologies, while quantitative results confirmed robust discrimination between lesion and background regions. These findings demonstrate that DSU-Net provides accurate and reliable breast lesion segmentation in mammographic images and highlights the potential of attention-guided deep learning for computer-aided breast cancer screening and diagnosis.

2606.06540 2026-06-08 eess.IV cs.CV 交叉投稿

ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

ErA:用于单图像散焦去模糊的误差感知深度展开网络

Tu Vo, Chan Y. Park

发表机构 * KC Machine Learning Lab(KC机器学习实验室)

AI总结 提出ErA网络,通过联合学习紧凑核基和逐像素权重,并利用增广拉格朗日展开中的误差感知项交替更新和ResUNet去噪器校正核估计误差,在多个数据集上达到最优性能。

详情
AI中文摘要

我们提出了ErA(误差感知深度展开网络),一个用于单图像散焦去模糊的端到端框架。ErA联合学习一个紧凑的核基和逐像素权重,同时增广拉格朗日展开中的一个误差感知项通过交替更新和ResUNet去噪器校正核估计误差。它在DPDD、RealDOF和RTF上达到了最先进的PSNR/SSIM,并在没有真实数据的CUHK上显示出强大的泛化能力。

英文摘要

We introduce ErA (Error-Aware Deep Unrolling Network), an end-to-end frame work for single-image defocus deblurring. ErA jointly learns a compact kerne basis and per-pixel weights, while an error-aware term in Augmented Lagrangian unrolling corrects kernel estimation errors via alternating updates and ResUNet denoisers. It achieves state-of-the-art PSNR/SSIM on DPDD, RealDOF, and RTF, and shows strong generalization on CUHK without ground truth.

2606.06627 2026-06-08 cs.RO cs.AI cs.CV cs.LG 交叉投稿

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

在日常生活人类视频上协同训练机器人操作策略时什么因素重要?

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Harvard University(哈佛大学)

AI总结 研究利用日常互联网视频协同训练机器人操作策略时,手部姿态质量和运动差距对迁移的影响,提出一种协同训练方法,在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

详情
AI中文摘要

用于协同训练机器人操作策略的人类视频数据集主要由精心策划的演示组成,其中动作被编排成类似机器人行为,并且使用专用硬件捕获3D手部姿态。更丰富的数据源是日常互联网视频,但哪些因素能够实现从这些视频到机器人的迁移仍是一个开放问题。我们使用一个新的数据集(包含532个人类视频,共28小时的高质量三角测量手部标签和自然动作)对此进行研究。我们发现手部姿态质量影响迁移,但即使手部姿态准确,固有的运动差距也会阻碍迁移,除非视觉和策略网络针对每种具身形态进行专门化。我们的协同训练方法在低机器人数据场景下,在六个操作任务中绝对成功率提升29.7%,并带来一致的改进。

英文摘要

Human video datasets used for cotraining robot manipulation policies largely consist of curated demonstrations where motions are orchestrated to resemble robot behavior and 3D hand poses are captured with specialized hardware. A more plentiful source of data is everyday Internet video, but it is an open question what factors enable transfer from such videos to robots. We investigate this using a new dataset of 532 human videos with 28 hours of high-quality triangulated hand labels and natural motions. We find that hand pose quality affects transfer, but even with accurate hands, the inherent motion gap hinders transfer unless the vision and policy networks specialize to each embodiment. Our cotraining recipe yields consistent improvements, with an absolute success rate gain of $29.7\%$ in the low-robot-data regime across six manipulation tasks.

2606.06725 2026-06-08 eess.IV cs.CV 交叉投稿

Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

基于神经缩放定律的超声心动图心肌分割与灌注量化的计算最优网络设计

Clara Rodrigo González, Matthieu Toulemonde, Lasha Gvinianidze, Cameron A. B. Smith, Oscar Bates, Roxy Senior, Fu Siong Ng, Meng-Xing Tang

发表机构 * Department of Bioengineering, Imperial College London(生物工程系,帝国理工学院伦敦分校) National Heart and Lung Institute, Imperial College London(国家心脏和肺 institute,帝国理工学院伦敦分校) Guy’s and St. Thomas’ NHS Foundation Trust(圣泰莫斯国家健康服务信托基金)

AI总结 应用神经缩放定律预测心肌分割性能,在CAMUS和CEUS数据集上确定最优网络大小,实现参数减少240倍且性能达最优,自动分割在心肌灌注量化中与资深心脏病专家等效。

Comments 15 pages, 4 figures, 5 tables, journal

详情
AI中文摘要

使用对比增强超声进行心肌灌注量化提供了一种床旁非电离替代核成像模态的方法。然而,其临床采用受到耗时的手动标注的限制。由于域内训练数据匮乏,自动分割已被证明具有挑战性。我们应用当前用于优化大数据集上大型语言模型的策略,将神经缩放定律应用于预测心肌分割的网络性能。我们在数据子集上外推性能,以确定CAMUS超声心动图数据集和25名患者的对比增强超声(CEUS)数据集上的最优网络大小。最后,通过将最终心肌灌注参数与资深心脏病专家获得的参数进行比较,验证了我们模型的临床实用性。基于缩放定律的外推能够预测完整数据集大小下的测试损失,使我们能够选择两个网络,在CAMUS上以240倍的参数减少获得最先进性能。我们观察到缩放定律的梯度从CAMUS迁移到CEUS数据集,但预测损失存在偏差。自动分割的掩膜在心肌灌注量化中与资深心脏病专家表现相当。这些结果确立了神经缩放定律作为小成像数据集上数据驱动计算最优模型设计的实用工具。

英文摘要

Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考:细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

发表机构 * Colab Beihang University(北航) Meituan(美团) National University of Singapore(新加坡国立大学)

AI总结 提出FLIGHT基准和FLIGHT VLA异步架构,通过低频飞行员推理VLM与高频扩散动作模型解耦,实现无人机长时程语义指令下的平滑连续飞行控制。

详情
AI中文摘要

语言引导的无人机代理必须执行长时程语义指令,同时产生平滑、物理可行的连续飞行命令,然而现有的视觉语言导航(VLN)基准通常使用离散或粗粒度的动作,而现有的无人机视觉-语言-动作(VLA)任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白,我们引入了\ extbf{FLIGHT},一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准,该基准结合了多阶段指令与密集的6-DoF轨迹注释,分为两个数据集:细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力,同时适应高频、实时的精确控制,我们进一步提出了\ extbf{FLIGHT VLA},一种异步架构,将用于任务状态推理的低频流式飞行员视觉语言模型(VLM)与用于连续控制的高频扩散动作模型解耦,并由显式的\ extbf{飞行员推理}文本进行监督,该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中,FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线,实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理,验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

2606.06847 2026-06-08 eess.IV cs.CV 交叉投稿

Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

SAR图像中飞机目标的物理驱动语义散射结构理解

Yifei Yin, Xiaogang Yu, Hao Shi, Liang Chen, Wei Li

发表机构 * School of Information and Electronics, Beijing Institute of Technology(信息与电子学院,北京理工大学) National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing(空间智能信息处理国家级重点实验室) Beijing Institute of Remote Sensing Information(遥感信息北京市研究院)

AI总结 针对SAR图像中飞机目标散射中心表示不稳定、弱散射部件缺失的问题,提出物理驱动框架S3U-SAR,通过定义语义散射关键点并利用多维物理先验约束,实现完整拓扑结构重建,在基准数据集上取得最优性能。

详情
AI中文摘要

合成孔径雷达(SAR)因其全天时、全天候观测能力,已成为目标解译不可或缺的手段。在SAR目标解译中,电磁散射信息提供了超越视觉纹理的物理基础线索,并被广泛用于目标解译。然而,现有方法仍以局部散射中心表示为主。这种无序且与部件无关的表示对飞机目标极不稳定。因此,物理存在的弱散射响应部件常被遗漏,导致重建的拓扑结构不完整。为解决这一局限,我们建立了语义散射结构理解作为SAR飞机解译的新范式。定义语义散射关键点以将局部电磁响应与物理上有意义的飞机部件关联,同时引入可见性感知属性以保留弱可观测但物理存在的部件。关键点进一步组织为稳定的语义散射结构。基于此,我们提出S3U-SAR,一个物理驱动框架,用于定位语义散射关键点并构建由多维物理先验(包括散射异质性、刚体拓扑、散斑不确定性)约束的完整表示。进一步引入置信门控联合监督策略以缓解优化冲突。我们构建了KP-SAR-Aircraft-1.0,首个用于语义散射结构理解的细粒度基准。大量实验表明,S3U-SAR相比基线取得了最佳性能。跨类别和跨数据集评估进一步验证了其鲁棒性和可迁移性。

英文摘要

Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.

2606.06878 2026-06-08 cs.RO cs.CV 交叉投稿

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

一种用于鲁棒6-DoF抓取姿态估计的跨视图融合框架

Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie

发表机构 * Nanjing University of Science and Technology(南京理工大学) Nanyang Technological University(南洋理工大学) Nanjing University(南京大学)

AI总结 提出跨视图融合框架,通过辅助视图缓解遮挡,利用自监督对比学习增强点云特征的空间一致性和方向区分性,并设计跨视图对齐圆柱体集成模块融合抓取相关几何,提升角落视图下的6-DoF抓取姿态估计鲁棒性。

Comments Corresponding author: Jin Xie

详情
AI中文摘要

本文提出一种跨视图融合框架,增强了角落视图中6-DoF抓取姿态估计的鲁棒性。我们的框架通过引入辅助视图缓解遮挡,并通过后融合策略避免了耗时的、任务无关的多视图重建。为了增强跨视图融合,我们提出一种自监督对比学习策略,利用跨视图关联来正则化点云特征。简而言之,如果两个点对应相同的3D位置,则跨视图点对被视作匹配;如果它们代表不同的抓取方向,则视为不匹配。该学习策略显著增强了点特征的空间一致性和方向区分性,从而促进了跨视图融合并提高了估计鲁棒性。此外,我们提出一种跨视图对齐圆柱体集成模块,将抓取相关几何融合为综合表示。具体地,该模块首先根据相似性对齐跨视图点和特征,以增强对噪声的鲁棒性。随后,将这些点注册到圆柱坐标系中,强调对抓取重要的旋转对称几何。最后,交替使用局部自注意力和种子交叉注意力层,分别实现单视图内和跨视图间的交互,支持抓取相关几何的细粒度表示。我们的框架在GraspNet-1Billion基准测试和实际应用中均取得了强劲性能。代码可在以下网址获取:此https URL。

英文摘要

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.

2606.06983 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

DaX: Learning General Pathology Representations Across Scales

DaX: 跨尺度的通用病理学表示学习

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao, Minfeng Xu

发表机构 * DAMO Academy, Alibaba Group(达摩院,阿里巴巴集团) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Hupan Lab(虎斑实验室)

AI总结 提出病理视觉基础模型DaX,通过改进DINOv3自监督学习,结合连续放大训练、跨尺度组织视图等设计,在44个公开数据集的161项临床任务上取得最佳平均性能。

详情
AI中文摘要

计算病理学需要能够跨不同临床终点迁移且对放大倍数、染色、扫描仪类型、切片制备和输入分辨率变化保持鲁棒的视觉表示。我们提出DaX,一个病理视觉基础模型,它将DINOv3风格的自监督学习适应到全切片组织病理学。DaX从自然图像DINOv3权重初始化,并融合了连续放大训练、跨尺度组织视图、方向无关和采集鲁棒的数据增强、多输入尺寸训练以及Gram锚定的密集一致性。这些设计旨在连接局部细胞形态与全局组织结构,同时稳定跨输入尺度的密集token级表示。我们进一步构建了一个WSI级基准,包含来自44个公共数据集的161项临床有意义任务,涵盖28,182名患者和34,394张切片,跨越四个临床领域和九个任务类别。所有模型在固定的患者级交叉验证协议下进行评估,并采用折叠级统计排名,从而实现可重复的比较,对分割依赖的变异性不敏感。在该基准上,DaX在任务中取得了最高的平均性能,并持续获得强大的任务级排名分数,其增益涵盖诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后。这些结果支持DaX作为计算病理学的可迁移视觉编码器,并为未来的病理基础模型提供了标准化的评估框架。项目页面:此https URL。

英文摘要

Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.

2606.07016 2026-06-08 stat.AP cs.CV 交叉投稿

An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

信号交叉口弱势道路使用者安全的集成路边感知与通信框架

Parvez Anowar

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida(中央佛罗里达大学土木、环境与建设工程系)

AI总结 提出集成多模态感知、边缘计算、V2X/P2X通信和自适应信号控制的框架,基于公开数据集R-LiViT分析53,319个标注,发现VRU占49%、昼夜密度差异大、近距离事件变化10倍、83%行人边界框小,支持多模态感知和自适应部署。

Comments 17 pages, 5 figures, 2 tables. Preprint

详情
AI中文摘要

弱势道路使用者(VRU)约占全球城市交通死亡人数的一半,而交叉口集中了不成比例的伤亡。最近关于VRU保护的感知技术综述列举了数十种单传感器和双传感器部署,但所调查的系统均未将多模态感知与边缘侧近碰撞分析以及双向车联万物(V2X)和行人联万物(P2X)消息传递集成在单个交叉口机柜中。本文提出一个信号交叉口VRU保护的综合框架,在感知层结合LiDAR、雷达、RGB相机和热成像相机,在计算层进行基于边缘的预测和替代安全分析,在通信层进行V2X和P2X消息传递,在驱动层进行自适应信号控制。该框架基于使用R-LiViT(首个公开的路边LiDAR-视觉-热成像数据集)的实证案例研究,该数据集提供了200个多模态序列和2,400个标注的RGB-T帧,来自三个德国交叉口。对53,319个检测标注的分析显示,VRU约占所有道路使用者观测的49%;从白天到夜晚,行人密度下降38%,车辆下降45%,而夜间分布显示更高的近距离比例;在三个交叉口的八个独特位置,每帧近距离事件计数变化约10倍;83%的行人边界框在图像空间中较小,表明VRU通常远离任何单个传感器。这些发现支持多模态感知、边缘侧分析和自适应上下文感知部署,而非统一的单传感器解决方案。

英文摘要

Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.

2606.07033 2026-06-08 cs.AI cs.CV 交叉投稿

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

层次化语义约束异构图用于音视频事件定位

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Peng Cheng Laboratory(鹏城实验室) Harbin Institute of Technology Suzhou Research Institute(哈尔滨工业大学苏州研究院)

AI总结 提出层次化语义约束异构图框架,通过构建异构图、双向语义约束和双曲空间层次正则化,解决开放词汇音视频事件定位中跨尺度一致性和层次语义一致性问题。

详情
AI中文摘要

开放词汇音视频事件定位(OV-AVEL)联合建模音视频线索,以识别并时间定位事件,包括训练中未见过的类别。现有方法主要在欧几里得空间中学习联合音视频表示,但仍面临两个重大挑战。首先,未见类别缺乏监督信号,难以在多个时间尺度上保持音视频一致性。其次,片段级与视频级语义之间缺乏层次约束,导致模型无法在不同层级间建立语义一致性。为解决这些挑战,我们提出一种层次化语义约束异构图(HSCHG)用于音视频事件定位框架。我们首先在欧几里得空间中构建一个异构层次图,包含音频和视觉片段节点及其对应的视频级节点。我们使用多方向时间边来捕获每个模态内的完整时间信息。同时,我们采用双阈值过滤门控融合策略,仅在对齐置信度高时引入跨模态信息。此外,我们在片段级和视频级表示之间引入双向语义约束,以实现不同层级间的语义一致性。基于此,我们将多级音视频表示和文本原型统一映射到双曲空间中。我们使用层次蕴含正则化损失来表征视频与片段之间的层次关系。大量实验结果表明,我们的方法在OV-AVEL基准上优于现有方法。消融研究进一步验证了我们方法的有效性。

英文摘要

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

2606.07058 2026-06-08 cs.LG cs.CV math.AT stat.ML 交叉投稿

Constructing VAE Latent Spaces with Prescribed Topology

构建具有指定拓扑的VAE潜在空间

Jilles S. van Hulst, Jakub M. Tomczak, W. P. M. H. Heemels, Duarte J. Antunes

发表机构 * Control Systems Technology Section, Department of Mechanical Engineering, Eindhoven University of Technology(机械工程系控制系统技术部,埃因霍温理工大学) Nature Innovation Laboratory (NatInLab)(自然创新实验室(NatInLab))

AI总结 针对数据流形非欧几里得拓扑导致标准高斯先验不匹配的问题,提出一种构造性数学框架,通过因子化分布和重参数化技巧,为乘积覆盖空间流形(如圆柱、环面、莫比乌斯带等)设计拓扑匹配的先验,提升重建质量和表示忠实性。

Comments 16 pages, 7 figures

详情
AI中文摘要

变分自编码器(VAE)学习高维数据的低维潜在表示。当数据位于具有非欧几里得拓扑的流形上时,标准高斯先验会引入拓扑不匹配,从而降低重建质量并阻碍忠实表示。我们提出了一个构造性数学框架,解决了所有允许乘积覆盖空间的流形的这种不匹配问题。这些流形可表示为基本因子(圆、区间或直线)的乘积,或此类乘积在有限对称群下的商。该类包括圆柱、环面、莫比乌斯带、克莱因瓶和实射影空间。基本因子上的因子化分布产生具有闭式解耦KL散度的乘积拓扑,使得每个潜在因子可以独立塑造,同时保持训练可处理。我们为周期、有界和无界支撑编目了可重参数化的编码器-先验对,并提供了坐标变换,允许标准神经网络输出具有平滑梯度的非欧几里得参数。对于商流形,解码器接收覆盖空间坐标的群不变特征,使得识别点产生相同输出。锚点约束相对于数据固定坐标系或创建软拓扑孔。在合成流形和真实图像数据集(旋转和循环移位MNIST)上的实验证实,拓扑匹配的先验使KL正则化与数据流形对齐。所得到的拓扑感知模型在所有实际相关的正则化强度下均优于高斯基线。代码可从此https URL获取。

英文摘要

Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When the data lies on a manifold with non-Euclidean topology, the standard Gaussian prior introduces a topological mismatch that degrades reconstruction quality and prevents faithful representation. We present a constructive mathematical framework that resolves this mismatch for all manifolds that admit a product covering space. These are manifolds expressible as products of elementary factors (circles, intervals, or lines) or as quotients of such products by a finite symmetry group. The class includes cylinders, tori, Möbius strips, Klein bottles, and real projective spaces. Factorized distributions over the elementary factors yield product topologies with closed-form, decoupled KL divergences, so that each latent factor can be shaped independently while keeping training tractable. We catalogue reparametrizable encoder-prior pairs for periodic, bounded, and unbounded supports, and provide coordinate transformations that allow standard neural networks to output non-Euclidean parameters with smooth gradients. For quotient manifolds, the decoder receives group-invariant features of the covering-space coordinates, so that identified points produce identical outputs. Anchor constraints fix the coordinate system relative to the data or create soft topological holes. Experiments on synthetic manifolds and real-image datasets (rotated and cyclically shifted MNIST) confirm that a topology-matched prior aligns KL regularization with the data manifold. The resulting topology-aware models outperform the Gaussian baseline at all practically relevant regularization strengths. The code is available at https://github.com/JvHulst/VAE-Topology.

2606.07063 2026-06-08 eess.IV cs.CV 交叉投稿

Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

超越普遍性:GCC-FER数据集及面向动态面部表情识别的文化感知适应

Sonalika Singh, Jyotirindra Dandapat, Avishi Razdan, Kshipra V. Moghe, Puneet Gupta, Lalan Kumar

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi, India(印度理工学院德里分校电子工程系) Department of Computer Science and Engineering, Indian Institute of Technology Indore, India(印度理工学院印尔德分校计算机科学与工程系) Department of Psychology, COEP Technological University, India(COEP技术大学心理学系)

AI总结 针对动态面部表情识别中文化差异被忽视的问题,提出首个大规模全球跨文化数据集GCC-FER,并设计文化感知适应系统CA-FER,通过自适应校准面部表示减轻文化偏差,实验证明其有效性。

详情
AI中文摘要

动态面部表情识别(DFER)是情感计算、人机交互和智能多媒体系统中的关键使能技术。尽管文化细微差别对FER性能有显著影响,但大多数现有FER系统假设情感表达在人群中普遍一致。这种差异可归因于不同文化中面部肌肉激活模式的系统性差异。推进跨文化FER的主要挑战在于缺乏文化多样性的基准数据集。为解决这一问题,本文引入了一个名为全球跨文化面部表情识别(GCC-FER)的新型混合多元文化视频数据集。GCC-FER包含跨越四种文化群体(非洲、高加索、东亚和南亚)的23,934个视频样本,涵盖七种基本表情,结合了对代表性不足人群的心理学家监督内部数据收集以及对现有来源的严格种族过滤。据我们所知,GCC-FER是首个旨在解决这些人口统计差距的大规模全球跨文化DFER数据集。利用该数据集,为每个文化群体推导出基于行为的文化先验,并为实际部署推导出全局先验。提出了一种文化感知FER(CA-FER)系统,通过自适应重新校准潜在面部表示来减轻文化偏差。在GCC-FER和DFEW上的大量实验表明,所提系统在多文化环境下持续提高了FER性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.

2606.07217 2026-06-08 cs.RO cs.CV cs.LG 交叉投稿

Robotic Policy Adaptation via Weight-Space Meta-Learning

通过权重空间元学习实现机器人策略自适应

Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio, Fabio Galasso, Luca Franco

发表机构 * ItalAI University of Verona(威尼斯大学) Sapeinza University of Rome(罗马萨佩因扎大学)

AI总结 提出WIZARD框架,通过权重空间元学习从语言指令和演示视频生成任务特定LoRA参数,无需微调即可适应新任务,在LIBERO上性能提升高达14倍。

详情
AI中文摘要

视觉-语言-动作(VLA)模型正成为机器人操作的一种有前景的范式,能够从大规模演示和动作标签语料库中训练通用策略。然而,将这些模型适应新任务通常仍需要任务特定的演示、动作注释和额外的微调,使得部署成本高昂且难以扩展。我们提出WIZARD,一种权重空间元学习框架,通过为冻结的VLA策略生成任务特定的LoRA参数来避免任务特定的微调。仅凭语言指令和简短的演示视频,WIZARD即可在单次前向传播中预测相应的自适应权重,无需目标任务动作标签或测试时优化。在元训练期间,WIZARD学习将任务证据直接映射到专家LoRA更新,在权重空间中捕获任务之间的关系。在LIBERO上的实验表明,WIZARD在未见过的数据集集合上性能提升高达约2倍,在未见过的任务上提升高达约14倍。在Franka Emika Panda机器人上,WIZARD持续优于真实域自适应基线,表明生成的适配器提供了超越仿真的任务级特化。

英文摘要

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点:面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室)

AI总结 提出轨迹航点范式,通过TSDF引导的扩散策略预测可执行轨迹,解决VLN-CE中航点不可达与规划控制不一致问题,在基准上取得最优性能。

详情
AI中文摘要

连续环境中的视觉语言导航(VLN-CE)要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架:航点预测器提出可导航航点,导航器选择最佳航点,低层控制器执行移动。然而,这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式,将每个候选航点锚定到可执行轨迹上。为此,我们设计了TSDF引导的扩散策略作为轨迹航点预测器,引导轨迹生成避开障碍物,从本质上保证预测航点的可达性。进一步提出轨迹增强导航器,将关联轨迹作为额外信息注入规划,实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明,我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

2606.07289 2026-06-08 cs.LG cs.CV 交叉投稿

Closed-Form Spectral Regularization for Multi-Task Model Merging

多任务模型融合的闭式谱正则化

Yongxian Wei, Runxi Cheng, Xingxuan Zhang, Li Shen, Chun Yuan, Peng Cui, Dacheng Tao

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Sun Yat-sen University(中山大学) Nanyang Technological University(南洋理工大学)

AI总结 针对多任务模型融合中的干扰最小化问题,发现迭代求解器实际充当隐式谱正则化器,据此提出基于谱滤波的闭式方法SWUDI及其自适应变体SWUDI-A,显著提升效率并匹配或超越现有方法。

详情
AI中文摘要

模型融合将多个独立微调专家合并为单个多任务模型,无需任何训练数据,降低了大型基础模型的存储、服务和去中心化开发成本。最先进的融合方法将融合表述为逐层二次干扰最小化问题。尽管该问题存在精确的闭式伪逆解,但该解在实践中性能不如数百次梯度下降迭代。迭代循环主导了流程的成本,但其有效性尚未得到解释。我们重新审视这一机制,并表明迭代求解器主要并非作为优化器;相反,它充当了病态正规方程的隐式谱正则化器,其中每层干扰算子的小特征值方向放大了代理噪声。基于这一发现,我们将多任务模型融合形式化为一个带噪线性逆问题,并提出一种由逐方向滤波器参数化的谱滤波估计器。我们通过SWUDI实例化该估计器,这是一种闭式方法,结合了软指数滤波器(匹配迭代下降的梯度流轨迹)和硬top-K截断(抑制放大噪声的小特征值方向)。此外,我们提出了SWUDI-A,一种自适应变体,用逐层秩规则替换全局秩超参数,进一步提高了跨架构的鲁棒性。两种变体共享每个线性层的单个对称特征分解,且不需要训练数据或优化器状态。在四个通用基准和一个涵盖VQA、几何、图表、OCR、定位和模态融合的多模态融合基准上,我们提出的谱求解器匹配或超越了最先进的融合方法。关键的是,它们将挂钟时间减少了28-72倍,峰值GPU内存减少了高达50%。

英文摘要

Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.

2606.07374 2026-06-08 eess.SP cs.CV 交叉投稿

Beyond Backscatter: InSAR coherence from detected SAR images

超越后向散射:来自检测SAR图像的InSAR相干性

Francescopaolo Sica, Andrea Pulella, Michael Schmitt

发表机构 * Department of Aerospace Engineering, University of the Bundeswehr Munich(联邦国防军 Munich航空航天工程系) Microwaves and Radar Institute, German Aerospace Center (DLR)(德国航空航天中心 (DLR) 微波与雷达研究所)

AI总结 提出一种深度学习框架,直接从检测SAR图像回归相干性,无需精确配准,使用Residual U-Net学习后向散射幅度与相干性的关系,在多种数据集上验证了高分辨率相干性回归的准确性提升和泛化能力。

Comments 27 pages, 20 figures

详情
AI中文摘要

在这项工作中,我们提出了一个深度学习框架,用于直接从检测SAR图像进行相干性回归,无需精确配准。使用从精确配准的Sentinel-1 SLC数据导出的相干性图训练Residual U-Net,以学习后向散射幅度与相干性之间的关系。模型在12天SLC对上训练,并在不同数据集上进行评估,包括配准的SLC产品和开放存取的分析就绪数据,覆盖不同的辐射特性、几何形状和位置。实验结果表明,与现有的基于强度的方法相比,所提出的方法实现了高分辨率相干性回归,且准确性更高。该网络在多样化的地理位置以及训练时从未见过的不同时间基线之间都能很好地泛化。此外,能够在全球可用的分析就绪数据(例如通过Google Earth Engine分发的地距检测数据)上运行,使其在任务设计、变化监测和多种制图任务中能够大规模应用。

英文摘要

In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.

2606.07381 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

合成病灶MR图像在低数据场景下自动局灶性皮质发育不良检测中的影响

Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, Simon K. Warfield

发表机构 * Computational Radiology Laboratory(计算放射学实验室) Boston Children’s Hospital(波士顿儿童医院) Harvard Medical School(哈佛医学院)

AI总结 本研究通过条件生成网络合成FCD病灶MRI数据,评估其真实性及对自动检测的影响,发现合成数据可减少约20%标注需求,但真实数据仍更有效。

详情
AI中文摘要

背景与目的:自动检测局灶性皮质发育不良(FCD)需要大量体素级病灶勾画的MRI数据,这些数据难以获取。本研究旨在生成呈现FCD的合成MRI数据,评估其真实性,并评估其对自动FCD检测的影响,特别是在减少手动标注需求方面。方法:回顾性研究了来自多个(3个)中心的131例FCD患者和90例健康对照的T1加权(T1w)和T2加权液体衰减反转恢复(FLAIR)MRI扫描。通过将生成网络以二元FCD掩膜为条件生成合成MRI。两位神经放射科医生从14张真实和14张合成扫描的随机集合中识别真实图像。训练了三个nnU-Net模型用于检测FCD,分别使用:(i)仅真实数据(35例FCD/35例对照),(ii)真实数据(35例FCD/35例对照)加合成增强,以及(iii)扩展的真实数据(70例FCD/70例对照)。结果:专家区分真实与合成图像的能力有限,T1w分类准确率为60%,FLAIR为70%(评分者间一致性kappa=0.86)。用合成数据增强自动FCD检测使灵敏度提高8.14%(p=0.12),并改善了模型在真实病灶部位的置信度(0.83±0.11至0.89±0.12;p=0.02)。扩展真实数据模型进一步将灵敏度提高至73.8%(p<0.001),置信度提高至0.90±0.14(p=0.01)。结论:条件生成网络可以生成逼真的合成FCD-MRI,在保持同等灵敏度的情况下减少约20%的标注数据需求。当可用时,等量的真实数据仍比合成增强更有效。

英文摘要

Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p < 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

2606.07464 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Planning-aligned Token Compression for Long-Context Autonomous Driving

面向长上下文自动驾驶的规划对齐令牌压缩

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone

发表机构 * NVIDIA Research(NVIDIA研究) School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学)

AI总结 提出COMPACT-VA框架,基于条件VQ-VAE将长上下文压缩为有界表示,通过规划对齐实现决策关键信息保留,在动态场景中成功率提升超6%,速度提升3.3倍。

Comments 9 pages

详情
AI中文摘要

整体视觉-动作模型代表了自动驾驶中的一种新兴范式。然而,这种架构在编码用于复杂交互的扩展时间上下文时,会产生迅速超过实时计算预算的令牌序列。虽然线性变换器和外部记忆等方法试图使上下文轻量化,但令牌压缩与架构最为兼容,因为它不需要修改主干网络。然而,现有的压缩采用基于规则的启发式方法(如时间衰减),与规划解耦,存在丢失决策关键信息的风险。我们提出COMPACT-VA,一种基于条件VQ-VAE的规划对齐工作记忆框架,将扩展上下文压缩为有界表示。压缩条件同时基于历史轨迹和学习的规划意图,其中后验编码器在训练期间从未来轨迹中提炼规划意图,而先验编码器学习从压缩观测中预测它。压缩记忆与预测的潜在变量拼接,输入策略进行端到端优化,从而在保留决策关键信息的情况下进行规划。我们在历史上下文对行为正确性(如停车、让行或前行)最关键的高信号动态场景中进行评估,并相应地设计了行为指标。在可比的令牌预算下,我们在成功率上实现了超过6%的提升(68.3%),且各项指标一致提升。消融实验验证了规划对齐耦合的有效性。闭环评估证实,与未压缩处理相比,COMPACT-VA在保持一般驾驶性能的同时实现了3.3倍的速度提升和2.7倍的内存减少。

英文摘要

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

2406.00636 2026-06-08 cs.CV 版本更新

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

T2LM:基于多句子的长期3D人体运动生成

Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez

发表机构 * IPAI & ASRI(IPAI与ASRI) Dept. of ECE, Seoul National University(电子工程系,首尔国立大学) NAVER LABS Europe(NAVER欧洲实验室)

AI总结 提出T2LM框架,利用1D卷积VQVAE和Transformer文本编码器,无需顺序数据即可从多句子生成连续长期3D人体运动,优于先前方法且与单动作SOTA竞争。

Comments CVPR 2024 HuMoGen Workshop

详情
AI中文摘要

本文解决了长期3D人体运动生成的挑战性问题。具体而言,我们旨在从多个句子(即段落)流中生成平滑连接的长时间动作序列。先前的长期运动生成方法大多基于循环方法,使用先前生成的运动块作为下一步的输入。然而,这种方法有两个缺点:1)依赖顺序数据集,成本高昂;2)这些方法在每一步生成的运动之间产生不切实际的间隙。为了解决这些问题,我们引入了简单而有效的T2LM,一个无需顺序数据即可训练的连续长期生成框架。T2LM包含两个组件:一个1D卷积VQVAE,训练将运动压缩为潜在向量序列;以及一个基于Transformer的文本编码器,根据输入文本预测潜在序列。在推理时,一个句子序列被翻译成连续的潜在向量流,然后由VQVAE解码器解码为运动;使用具有局部时间感受野的1D卷积避免了训练序列和生成序列之间的时间不一致性。VQ-VAE上的这个简单约束使其仅用短序列训练即可产生更平滑的过渡。T2LM优于先前的长期生成模型,同时克服了需要顺序数据的限制;它也与最先进的单动作生成模型具有竞争力。

英文摘要

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.

2408.08973 2026-06-08 cs.CV 版本更新

Image class translation: visual inspection of class-specific hypotheticals and classification based on translation distance

图像类别翻译:类别特定假设的视觉检查与基于翻译距离的分类

Mikyla K. Bowen, Jesse W. Wilson

发表机构 * College of Natural Sciences, Colorado State University, Colorado, United States of America(科罗拉多州立大学自然科学院) School of Biomedical and Chemical Engineering, Colorado State University, Colorado, United States of America(科罗拉多州立大学生物医学与化学工程学院) Department of Electrical and Computer Engineering, Colorado State University, Colorado, United States of America(科罗拉多州立大学电气与计算机工程学院)

AI总结 提出图像翻译网络用于分类,通过翻译距离作为低维特征进行分类,在皮肤镜和骨髓细胞图像上验证,可解释性优于传统CNN。

Comments 47 pages, 20 figures, submitted revision to SPIE J. Medical Imaging

详情
AI中文摘要

目的:人工智能在医学应用中的主要障碍是自动CNN缺乏可解释性,并且对错误决策(尤其是域外样本)有高置信度。我们提出图像翻译网络用于图像分类的泛化,并展示翻译网络作为传统黑盒分类器更可解释的替代方案的潜力。\n方法:我们训练一个图像到图像网络,将输入图像翻译为类别特定的假设,然后通过视觉和定量方式将这些假设与输入进行比较。翻译距离(即为了符合某一类别所需的改变程度)被检查其聚类和趋势,并用作分类的简单低维特征向量。\n结果:在黑色素瘤/良性皮肤镜图像上,翻译距离分类器仅使用2维特征空间就达到了80%的准确率(而传统CNN使用约62,000维特征空间达到85%)。对渲染图像的视觉检查揭示了数据集偏差,例如黑色素瘤照片中比良性病变有更多的比例尺。翻译距离空间中的图像分布揭示了沿着皮肤科医生活检决策的自然分离,而不是恶性与良性之间的分离。在骨髓细胞学图像上,翻译距离分类器在3类(92%准确率对比CNN的89%)和6类(90%对比86%)场景中均优于传统CNN。\n结论:这一概念验证表明,图像到图像翻译有潜力超越艺术/风格变化,揭示数据集偏差,进行降维和数据集可视化,并且在某些情况下可能优于传统的端到端CNN分类器。

英文摘要

Purpose: A major barrier to the implementation of artificial intelligence for medical applications is automated CNNs' lack of explainability and high confidence for incorrect decisions, specifically with out-of-domain samples. We propose a generalization of image translation networks for image classification and demonstrate translation networks' potential as a more interpretable alternative to conventional black-box classifiers. Approach: We train an image-to-image network to translate an input image to class-specific hypotheticals, and then compare these with the input, both visually and quantitatively. Translation distances, the degree of alteration needed to conform to one class or another, are examined for clusters and trends, and used as a simple low-dimensional feature vector for classification. Results: On melanoma/benign dermoscopy images, a translation distance classifier achieved 80% accuracy using only a 2-dimensional feature space (versus 85% for a conventional CNN using a ~62,000-dimensional feature space). Visual inspection of rendered images revealed dataset biases, like more scalebars in melanoma photographs than in benign lesions. Image distributions in translation distance space revealed a natural separation along the lines of dermatologist decision to biopsy, rather than between malignant and benign. On bone marrow cytology images, translation distance classifiers outperformed a conventional CNN in both 3-class (92% accuracy vs 89% for CNN) and 6-class (90% vs 86% for CNN) scenarios. Conclusions: This proof-of-concept shows the potential for image-to-image translation to go beyond artistic/stylistic changes and to expose dataset biases, perform dimension reduction and dataset visualization, and in some cases, potentially outperform conventional end-to-end CNN classifiers.

2506.01850 2026-06-08 cs.CV cs.AI cs.LG cs.MM 版本更新

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA: 面向指令型多模态大语言模型的细粒度视觉定位的调制适配器

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MoDA调制适配器,通过指令引导的通道级乘法调制增强细粒度视觉定位,在12个基准上对三种MLLM架构取得一致提升,计算开销极小。

Comments Accepted at ICML 2026. Code is available at https://github.com/waybarrios/MoDA

详情
AI中文摘要

多模态大语言模型(MLLMs)通过将预训练的视觉编码器与大语言模型(LLMs)集成,在指令跟随任务中取得了显著成功。然而,现有方法由于视觉补丁表示中的语义纠缠,常常难以实现细粒度的视觉定位,其中单个补丁混合了多个不同的视觉元素,使得模型难以聚焦于指令相关的细节。为了应对这一挑战,我们提出了MoDA(调制适配器),一种轻量级模块,通过指令引导的通道级调制增强视觉定位。与Q-Former等执行加性特征选择的令牌级方法不同,MoDA通过对已对齐特征进行乘法调制在通道级操作,从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准的LLaVA训练协议,MoDA在语言指令与预对齐的视觉特征之间应用交叉注意力,生成动态调制掩码,无需架构修改或额外监督。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准上评估了MoDA,包括最近的2024年基准(MMVP、CV-Bench、MMStar、RealWorldQA),并在三种不同的MLLM架构上进行了测试:LLaVA-1.5、LLaVA-MoRE(2025)和Qwen3-VL(2025)。MoDA在所有三个系列中均取得了一致的提升,在LLaVA-1.5系列的MMVP上提升了+12.0个百分点,在LLaVA-MoRE系列的ScienceQA上提升了+4.8个百分点,在Qwen3-VL上ScienceQA提升了+4.9、RealWorldQA提升了+4.1、GQA提升了+3.8,证实了这些增益在CLIP编码器之外具有泛化性,且计算开销极小(<1% FLOPs)。代码可在https://github.com/waybarrios/MoDA获取。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

2509.24935 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Scalable GANs with Transformers

可扩展的Transformer生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * KAIST(韩国科学技术院)

AI总结 本文通过紧凑变分自编码器潜在空间和纯Transformer架构,研究了生成对抗网络的可扩展性,并提出了轻量级中间监督和宽度自适应学习率调整来解决缩放时的失败模式,在ImageNet-256上以40个epoch达到2.96的FID。

Comments ICML 2026

详情
AI中文摘要

可扩展性推动了生成建模的最新进展,但其原理在对抗学习中仍未充分探索。我们通过两个在其他类型生成模型中被证明有效的设计选择来研究生成对抗网络(GAN)的可扩展性:在紧凑的变分自编码器潜在空间中训练,以及采用纯Transformer的生成器和判别器。在潜在空间中训练能够在保持感知保真度的同时实现高效计算,而这种效率与普通Transformer自然匹配,后者的性能随计算预算扩展。基于这些选择,我们分析了朴素缩放GAN时出现的失败模式。具体来说,我们发现了随着网络规模扩大,生成器早期层利用不足和优化不稳定的问题。因此,我们提供了简单且对缩放友好的解决方案,如轻量级中间监督和宽度自适应学习率调整。我们的实验表明,GAT——一种纯Transformer的潜在空间GAN——能够在从S到XL的广泛容量范围内可靠地训练。此外,GAT-XL/2在ImageNet-256上仅用40个epoch(比强基线少6倍)就达到了最先进的单步类条件生成性能(FID为2.96)。项目页面:https://hse1032.github.io/GAT。

英文摘要

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.18) on ImageNet-256 in just 60 epochs, 4x fewer epochs than strong baselines. Project page: https://hse1032.github.io/GAT.

2511.05949 2026-06-08 cs.CV 版本更新

Zero-Shot Polygon Matching with Pre-trained Models for Pose Estimation and Polygon Cloud from Challenging Stereo

基于预训练模型的零样本多边形匹配用于挑战性立体图像的姿态估计和多边形云

Chang Li, Xingtao Peng

发表机构 * Chang Li(李昌) Xingtao Peng(彭兴涛)

AI总结 提出首个零样本多边形匹配范式Z(PM)2,结合预训练模型和手工几何约束,通过双向金字塔匹配和局部-整体二分图优化解决视差不连续、尺度变化等问题,在姿态估计和3D表示中取得领先性能。

详情
AI中文摘要

尽管立体匹配在0D点和1D线基元上已经成熟,但由于视差不连续、尺度变化、训练依赖和泛化能力差等挑战,2D多边形的对应关系建立仍基本未被探索,限制了姿态估计和3D重建等下游任务。为了解决这些问题,我们首次提出了一种基于预训练模型的零样本多边形匹配范式(即Z(PM)2),通过即插即用模块结合学习特征和手工几何约束,将匹配从0D/1D基元扩展到2D多边形。该流程包括三个核心阶段:首先,检测器利用预训练的segment anything模型将分割掩码矢量化成图结构的多边形,融合几何和纹理;其次,全局匹配器使用双向金字塔和多几何约束处理视角变化;第三,局部匹配器利用局部-整体二分图优化解决视差不连续和拓扑不一致。此外,我们开发了多边形匹配引导的姿态估计,利用对应关系获得分布良好、低冗余的同名点,并首创多边形云概念及最优表面生成方法,生成结构完整、语义丰富的3D表示,超越点云和线云。由于没有可直接比较的立体图像多边形匹配方法,我们选择了最接近该任务的最先进方法作为基线。在五个具有挑战性的数据集(ISPRS、KITTI、ScanNet、SceneFlow、DTU)上的大量实验表明,Z(PM)2实现了68.60%的匹配面积分数,比MESA高出约32%,在区域级姿态估计中排名第一,具有竞争力的速度和强大的零样本泛化能力,无需任何训练要求。

英文摘要

While stereo matching has achieved maturity for 0D point and 1D line primitives, establishing correspondences for 2D polygons remains largely unexplored due to challenges including disparity discontinuity, scale variation, training dependency, and poor generalization, limiting downstream tasks such as pose estimation and 3D reconstruction. To address these issues, we are the first to propose a Zero-shot Polygon Matching paradigm with Pre-trained Models (i.e., Z(PM)2), which combines learned features and handcrafted geometric constraints through plug-and-play modules, extending matching from 0D/1D primitives to 2D polygons. The pipeline comprises three core stages: Firstly, detector leverages the pre-trained segment anything model to vectorize segmentation masks into graph-structured polygons integrating geometry and texture; Secondly, global matcher uses bidirectional-pyramid and multi-geometric constraints to handle viewpoint variation; Thirdly, local matcher leverages local-holistic bipartite graph optimization to resolve disparity discontinuity and topological inconsistency. Moreover, we develop polygon-matching-guided pose estimation using correspondences to obtain well-distributed, low-redundancy homologous points, and pioneer the polygon cloud concept with an optimal surface generation method, producing structurally complete and semantically rich 3D representations beyond point and line clouds. Since no polygon matching methods from stereo imagery are available for direct comparison, we selected state-of-the-art (SoTA) methods close to this task as baselines. Extensive experiments on five challenging datasets (ISPRS, KITTI, ScanNet, SceneFlow, DTU) show Z(PM)2 achieves a 68.60% matching area score, outperforming MESA by approximately 32% and ranking first in area-level pose estimation, with competitive speed and strong zero-shot generalization without any training requirement.

2511.06080 2026-06-08 cs.CV cs.CY cs.HC 版本更新

AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired

AIDEN:面向视障人士的AI助手设计与初步研究

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

发表机构 * Institute for Computer Research, University of Alicante(计算机研究所,阿利坎特大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出AIDEN系统,结合YOLO实时目标检测、LLaVA场景描述与OCR,以及基于盖革计数器隐喻的连续触觉引导,避免听觉过载并保护隐私,实验表明用户满意度高。

详情
AI中文摘要

本文介绍了AIDEN,一种基于人工智能的助手,旨在增强视障人士的自主性和日常生活质量,他们通常在物体识别、文本阅读和陌生环境导航方面遇到困难。现有的解决方案如屏幕阅读器或基于音频的助手虽然便于获取信息,但常常导致听觉过载,并在开放环境中引发隐私问题。AIDEN通过一种混合架构解决了这些限制,该架构集成了用于实时目标检测的YOLO(You Only Look Once)和用于场景描述及光学字符识别(OCR)的大型语言与视觉助手(LLaVA)。该系统的一个关键创新是基于盖革计数器隐喻的连续触觉引导机制,该机制在不占用听觉通道的情况下支持物体居中,同时通过确保不存储个人数据来保护隐私。与视障参与者进行的实证评估使用技术接受模型(TAM)评估了感知易用性和接受度。结果表明用户满意度高,特别是在直观性和感知自主性方面。此外,“寻找物体”功能实现了有效的实时性能。这些发现提供了有希望的证据,表明与传统的以音频为中心的方法相比,多模态触觉-视觉反馈可以改善日常可用性和独立性,从而推动更大规模的临床验证。

英文摘要

This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.

2511.14019 2026-06-08 cs.CV 版本更新

RISE: Single Static Radar-based Indoor Scene Understanding

RISE:基于单静态雷达的室内场景理解

Kaichen Zhou, Laura Dodds, Sayed Saad Afzal, Fadel Adib

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Cartesian Systems

AI总结 提出RISE系统,利用毫米波雷达的多径反射(传统视为噪声)编码几何线索,通过双角度多径增强和模拟到现实的分层扩散框架,实现布局重建和物体检测,在50,000帧数据集上布局重建倒角距离降低60%,首次实现基于毫米波雷达的物体检测。

详情
AI中文摘要

鲁棒且保护隐私的室内场景理解仍然是一个基本开放问题。虽然光学传感器(如RGB和LiDAR)提供高空间保真度,但它们在室内环境中遭受严重遮挡并引入隐私风险。相比之下,毫米波雷达保护隐私并穿透障碍物,但其固有的低空间分辨率使得可靠的几何推理变得困难。我们介绍了RISE,这是首个用于单静态雷达室内场景理解的基准和系统,同时针对布局重建和物体检测。RISE基于一个关键洞察:多径反射——传统上被视为噪声——编码了丰富的几何线索。为了利用这一点,我们提出了一种双角度多径增强方法,显式建模到达角和离开角,以恢复二次(鬼影)反射并揭示不可见结构。在这些增强观测的基础上,一个模拟到现实的分层扩散框架将碎片化的雷达响应转化为完整的布局重建和物体检测。我们的基准包含100条真实室内轨迹中收集的50,000帧数据,形成了首个专门用于单静态雷达室内场景理解的大规模数据集。大量实验表明,与最先进的毫米波布局重建方法相比,RISE将倒角距离降低了60%(降至16厘米),并实现了首个基于毫米波雷达的物体检测,IoU达到58%。这些结果确立了RISE作为使用单静态雷达进行几何感知和隐私保护室内场景理解的新基础。我们的网站和代码可在https://rise-cvpr.github.io获取。

英文摘要

Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections-traditionally treated as noise-encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to single, static, radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in mmWave layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar. Our website and code are available at https://rise-cvpr.github.io.

2512.10521 2026-06-08 cs.CV 版本更新

Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

Take a Peek: 通过LoRA高效编码器适应少样本语义分割

Pasquale De Marinis, Gennaro Vessio, Giovanna Castellano

发表机构 * University of Bari(巴里大学)

AI总结 提出TaP方法,利用低秩适应(LoRA)微调编码器,在少样本和跨域少样本语义分割中实现高效适应,提升新类分割性能。

详情
AI中文摘要

少样本语义分割(FSS)旨在仅使用少量标注支持集对查询图像中的新类进行分割。先前研究主要关注改进解码器,但编码器提取未见类有意义特征的能力有限仍是关键瓶颈。本文提出 extit{Take a Peek}(TaP),一种简单而有效的方法,通过引入基于支持集的轻量级 extit{特征空间偏移},增强了编码器对FSS和跨域FSS的适应性。TaP利用低秩适应(LoRA)在支持集上微调编码器,计算开销极小,能够快速适应新类同时减轻灾难性遗忘。我们的方法模型无关,可无缝集成到现有FSS流程中。在多个基准(包括COCO $20^i$、Pascal $5^i$以及跨域数据集DeepGlobe、ISIC和Chest X-ray)上的大量实验表明,TaP在不同模型和shot设置下一致地提升了分割性能。值得注意的是,TaP在复杂的多类场景中取得了显著增益,突显了其在现实场景中的实际有效性。秩敏感性分析还表明,即使采用低秩适应也能实现强性能,从而确保计算效率。通过解决FSS中编码器泛化到新类的关键限制,TaP为构建更鲁棒、高效和可泛化的分割系统铺平了道路。代码可在https://github.com/pasqualedem/TakeAPeek获取。

英文摘要

Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS \rev{by inducing a lightweight \textit{feature-space shift} conditioned on the support set}. TaP leverages Low-Rank Adaptation to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, thereby ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

2512.12997 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

校准零样本对抗性CLIP的不确定性

Wenjing Lu, Zerui Tao, Yuning Qiu, Dongping Zhang, Yang Yang, Qibin Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对CLIP在零样本分类中对抗攻击脆弱且不确定性校准差的问题,提出基于狄利克雷分布重参数化的对抗微调目标,统一对齐语义结构与置信度,提升校准性和鲁棒性。

Comments ICML 2026

详情
AI中文摘要

CLIP在零样本分类中表现强劲,但仍易受对抗攻击。先前的对抗微调工作主要匹配干净样本和对抗样本之间的预测logits,忽略了不确定性校准,可能损害零样本泛化能力。在可靠的不确定性估计中,一个常见期望是预测不确定性应随输入难度增加或偏离训练分布而上升。然而,在对抗环境中我们经常观察到相反的情况:扰动不仅降低准确性,还抑制不确定性,导致严重的校准错误和过度自信。这揭示了鲁棒性之外的关键可靠性差距。为弥合这一差距,我们提出了一种考虑准确性和不确定性的CLIP对抗微调目标。通过将CLIP输出重参数化为狄利克雷分布的浓度参数,我们提出了一种统一表示,捕获相对语义结构和置信度大小。这使得在扰动下实现整体分布对齐,超越单一logits锚定,恢复校准的不确定性。在多个零样本基准上的实验表明,我们的方法显著提高了不确定性校准,在保持干净准确性的同时实现了具有竞争力的对抗鲁棒性。

英文摘要

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and over-confidence. This reveals a critical reliability gap beyond robustness. To bridge this gap, we propose an adversarial fine-tuning objective for CLIP considering both accuracy and uncertainty. By reparameterizing CLIP outputs as the concentration parameters of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. This enables holistic distribution alignment under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments across multiple zero-shot benchmarks demonstrate that our method significantly improves uncertainty calibration and achieves competitive adversarial robustness while preserving clean accuracy.

2601.04791 2026-06-08 cs.CV cs.LG 版本更新

Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

用于稳定潜在扩散逆问题求解器的测量一致朗之万校正器

Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh

发表机构 * Sookmyung Women's University(成均馆女子大学)

AI总结 针对潜在扩散模型逆问题求解器的不稳定性,提出测量一致朗之万校正器(MCLC),通过测量一致的朗之万更新缩小求解器与稳定反向扩散之间的差距,实现稳定可靠的潜在空间求解。

Comments ICML 2026

详情
AI中文摘要

尽管潜在扩散模型(LDM)已成为逆问题的强大先验,但现有的基于LDM的求解器经常遭受不稳定性。在这项工作中,我们首先将不稳定性识别为求解器动力学与扩散模型学习的稳定反向扩散动力学之间的差异,并表明减少这种差距可以稳定求解器。基于此,我们引入了\textit{测量一致朗之万校正器(MCLC)},这是一个理论上有依据的即插即用稳定模块,通过测量一致的朗之万更新来修复基于LDM的逆问题求解器。与先前依赖线性流形假设(通常在潜在空间中不成立)的方法相比,MCLC提供了一种原则性的稳定机制,从而在潜在空间中实现更稳定和可靠的行为。

英文摘要

While latent diffusion models (LDMs) have emerged as powerful priors for inverse problems, existing LDM-based solvers frequently suffer from instability. In this work, we first identify the instability as a discrepancy between the solver dynamics and stable reverse diffusion dynamics learned by the diffusion model, and show that reducing this gap stabilizes the solver. Building on this, we introduce \textit{Measurement-Consistent Langevin Corrector (MCLC)}, a theoretically grounded plug-and-play stabilization module that remedies the LDM-based inverse problem solvers through measurement-consistent Langevin updates. Compared to prior approaches that rely on linear manifold assumptions, which often fail to hold in latent space, MCLC provides a principled stabilization mechanism, leading to more stable and reliable behavior in latent space.

2601.09698 2026-06-08 cs.CV 版本更新

COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

COMPOSE:用于多视角三维人体姿态估计的超图覆盖优化

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * School of Computation, Information, and Technology, Technical University of Munich(技术大学慕尼黑计算、信息与技术学院) Munich Center for Machine Learning(慕尼黑机器学习中心) Department of Computing, Imperial College London(伦敦帝国学院计算机系)

AI总结 提出COMPOSE方法,将多视角三维人体姿态估计重构为超图上的加权精确覆盖优化,通过全局组合目标替代局部配对关联,结合几何剪枝与整数线性规划或信念传播求解器,无监督下精度提升显著。

详情
AI中文摘要

从稀疏多视角相机装置中进行三维人体姿态估计是众多应用(包括动作识别、体育分析和人机交互)的基本任务。尽管学习方法在基准测试中占据主导地位,但它们需要大量标注数据集;无训练的基于优化的方法仍然有前景,因为它们通过解决来自二维检测的跨视角对应问题来规避三维监督。现有的组合公式依赖配对关联来建模这一对应问题,并将跨视角的全局一致性仅作为下游约束来强制执行。然而,在遮挡和噪声检测下,调和局部合理的配对匹配变得脆弱,局部错误会全局传播。我们提出COMPOSE,它将多视角三维人体姿态估计重新定义为对人物假设超图上的加权精确覆盖优化。我们的公式用单个全局组合目标替代了配对关联和事后一致性强制执行。为了应对指数级大的候选空间,我们引入了一种几何剪枝策略以及两种互补的求解器:精确整数线性规划公式和通过信念传播的可扩展松弛。在没有任何三维监督的情况下,COMPOSE在平均精度上比最佳基于优化的方法提高了31个百分点,比自监督学习方法提高了13个百分点,证明了高阶组合关联在无训练的多视角三维人体姿态估计中的有效性。

英文摘要

3D human pose estimation from sparse multi-view camera rigs is an essential task for numerous applications, including action recognition, sports analysis, and human-robot interaction. While learned methods dominate the field on benchmarks, they require large annotated datasets; training-free optimization-based methods remain promising as they circumvent 3D supervision by solving a correspondence problem across views from 2D detections. Existing combinatorial formulations rely on pairwise associations to model this correspondence problem and enforce global consistency across views only as a downstream constraint. However, reconciling locally plausible pairwise matches becomes brittle under occlusion and noisy detections, where local errors propagate globally. We propose COMPOSE, which recasts multi-view 3D human pose estimation as a weighted exact-cover optimization over a hypergraph of person hypotheses. Our formulation replaces pairwise association and post-hoc consistency enforcement with a single global combinatorial objective. To address the exponentially large candidate space, we introduce a geometric pruning strategy alongside two complementary solvers: an exact Integer Linear Programming formulation and a scalable relaxation via Belief Propagation. Without any 3D supervision, COMPOSE improves average precision by up to 31 points over the best optimization-based method and 13 points over self-supervised learned methods, demonstrating the effectiveness of higher-order combinatorial association for training-free multi-view 3D human pose estimation.

2601.22574 2026-06-08 cs.CV cs.AI 版本更新

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

增强视频表示中的时空语义残差以缓解视频大型多模态模型中的幻觉

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Wenbin Xing, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University(浙江大学) University of Toronto(多伦多大学) Dalian University of Technology(大连理工大学) Sun Yat-sen University(中山大学)

AI总结 提出ViSSRes方法,通过轻量级MLP网络学习视频表示的残差,从时空和语义一致性优化,在推理时仅需单次前向传播,有效降低幻觉率并提升视频理解性能。

Comments Preprint

详情
AI中文摘要

尽管视频大型多模态模型在视频理解方面取得了强劲性能,但它们仍然存在幻觉问题。现有的推理时干预方法通常在对比解码框架下修改视频,但其启发式设计带来的改进有限且增加了推理延迟。为了解决这些问题,我们提出了ViSSRes,一种通过轻量级MLP风格网络增强视频表示的推理时干预方法。具体来说,我们使用对比随机游走方法来表征视频表示的时空一致性,并引入条件互信息将视频表示与模型的语义理解关联起来。在保持模型主干冻结的情况下,ViSSRes学习视频表示的残差,并从时空和语义一致性角度优化它们。在推理时,ViSSRes仅需单次前向传播,且不会引入显著的额外推理成本。实验表明,ViSSRes在EventHallusion上将LLaVA-NeXT-Video的幻觉率降低了40.69%,并在CoT设置下将MMVU上的视频理解提升了18.36%,证明了其在缓解幻觉方面的有效性。

英文摘要

Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Existing inference-time intervention methods usually modify videos under the contrastive decoding framework, but their heuristic designs bring limited improvements and increase inference latency. To address these issues, we propose ViSSRes, an inference-time intervention method that enhances video representations through a lightweight MLP-style network. Specifically, we use a contrastive random walk approach to characterize the spatiotemporal consistency of video representations, and introduce conditional mutual information to associate video representations with the model's semantic understanding. With the model backbone kept frozen, ViSSRes learns residuals for video representations and optimizes them from both spatiotemporal and semantic consistency perspectives. During inference, ViSSRes requires only a single forward pass and introduces no substantial additional inference cost. Experiments show that ViSSRes reduces the hallucination rate of LLaVA-NeXT-Video on EventHallusion by 40.69% and improves video understanding on MMVU by 18.36% under the CoT setting, demonstrating its effectiveness in mitigating hallucinations.

2602.00163 2026-06-08 cs.CV q-bio.NC 版本更新

Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

基于深度学习姿态估计的联合多动性运动障碍多标签识别

Laura Cif, Diane Demailly, Gabriella A. Horvàth, Juan Dario Ortigoza Escobar, Nathalie Dorison, Mayté Castro Jiménez, Cécile A. Hubsch, Thomas Wirth, Gun-Marie Hariz, Sophie Huby, Morgan Dornadic, Zohra Souei, Muhammad Mushhood Ur Rehman, Simone Hemm, Mehdi Boulayme, Eduardo M. Moraud, Jocelyne Bloch, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV) and University of Lausanne (UNIL)(日内瓦大学医院(CHUV)和日内瓦大学) Institut du Neurone(神经研究所) Department of Neurology, Clinique Beau Soleil, Institut Mutualiste Montpelliérain(神经科,贝索尔诊所,蒙彼利埃互益研究所) Department of Pediatrics, British Columbia Children’s Hospital(儿科,不列颠哥伦比亚儿童医院) Movement Disorders Unit, Pediatric Neurology Department, Institut de Recerca, Hospital Sant Joan de Déu(运动障碍科,儿童神经科,研究所,圣约翰德杜医院) European Reference Network for Rare Neurological Diseases (ERN-RND)(罕见神经系统疾病欧洲参考网络(ERN-RND)) U-703 Centre for Biomedical Research on Rare Diseases (CIBER-ER), Instituto de Salud Carlos III(罕见疾病生物医学研究中心(CIBER-ER),卡洛斯三世健康研究所) Pediatric Neurosurgery Department, CCMR Neurogenetique, European Reference Network Brainteam Member, Rothschild Foundation Hospital(小儿神经外科部门,CCMR神经遗传学,欧洲参考网络Brainteam成员,罗切什基金会医院) Department of Neurology, University Hospital of Strasbourg(神经科,斯特拉斯堡大学医院) Strasbourg Neuroscience Institute, Strasbourg University(斯特拉斯堡神经科学研究所,斯特拉斯堡大学) Institute of Genetics and Cellular biology(遗传学和细胞生物学研究所)

AI总结 针对多动性运动障碍(HMD)临床识别主观性强、表型重叠的问题,提出基于姿态的机器学习框架,从常规临床视频提取关键点时间序列并计算多维度运动学特征,实现多标签分类。

详情
AI中文摘要

多动性运动障碍(HMD),如肌张力障碍、震颤、舞蹈症、肌阵挛和抽动症,是儿童和成人中致残的运动表现。其波动性、间歇性和频繁共存的表达阻碍了临床识别和纵向监测,这些在很大程度上仍然是主观的且易受评估者间变异影响。目前仍缺乏客观且可扩展的方法来从常规临床视频中区分重叠的HMD表型。在此,我们开发了一个基于姿态的机器学习框架,将常规门诊视频转化为解剖学上有意义的关键点时间序列,并计算涵盖统计、时间、频谱以及高阶不规则性-复杂性特征的运动学描述符。

英文摘要

Hyperkinetic movement disorders (HMDs) such as dystonia, tremor, chorea, myoclonus, and tics are disabling motor manifestations across childhood and adulthood. Their fluctuating, intermittent, and frequently co-occurring expressions hinder clinical recognition and longitudinal monitoring, which remain largely subjective and vulnerable to inter-rater variability. Objective and scalable methods to distinguish overlapping HMD phenotypes from routine clinical videos are still lacking. Here, we developed a pose-based machine-learning framework that converts standard outpatient videos into anatomically meaningful keypoint time series and computes kinematic descriptors spanning statistical, temporal, spectral, and higher-order irregularity-complexity features.

2602.02014 2026-06-08 cs.CV cs.AI cs.CL cs.LG 版本更新

Rethinking Genomic Modeling Through Optical Character Recognition

通过光学字符识别重新思考基因组建模

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出OpticalDNA框架,将DNA渲染为视觉布局,利用视觉语言模型进行OCR式基因组理解,实现高保真压缩和长序列高效处理,在450k碱基序列上以近20倍更少有效token超越基线模型。

Comments Accepted by ICML 2026

详情
AI中文摘要

最近的基因组基础模型大多采用大型语言模型架构,将DNA视为一维token序列。然而,穷举式顺序阅读在结构上与稀疏且不连续的基因组语义不匹配,导致在低信息背景上的计算浪费,并阻碍了面向长上下文的压缩理解。在此,我们提出OpticalDNA,一个基于视觉的框架,将基因组建模重新定义为光学字符识别(OCR)风格的文档理解。OpticalDNA将DNA渲染为结构化视觉布局,并训练一个具备OCR能力的视觉语言模型,该模型包含视觉DNA编码器和文档解码器,其中编码器生成紧凑、可重建的视觉token以实现高保真压缩。基于这种表示,OpticalDNA定义了基于提示条件的核心基因组原语目标——读取、区域定位、子序列检索和掩码跨度补全——从而学习到布局感知的DNA表示,在减少的有效token预算下保留细粒度的基因组信息。在多种基因组基准测试中,OpticalDNA持续优于最近的基线模型;在长达450k碱基的序列上,它以近20倍更少的有效token实现了最佳整体性能,并且仅调整256k可训练参数就超越了激活参数多达985倍的模型。

英文摘要

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a visual DNA encoder and a document decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20$\times$ fewer effective tokens, and surpasses models with up to 985$\times$ more activated parameters while tuning only 256k trainable parameters.

2602.07025 2026-06-08 cs.CV cs.AI 版本更新

The Geometry of Representational Failures in Vision Language Models

视觉语言模型中表征失败的几何结构

Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti

发表机构 * Dipartimento di Fisica, Università di Torino(都灵大学物理系) Princeton Neuroscience Institute and AI Lab, Princeton University(普林斯顿大学神经科学研究所和AI实验室) Intesa Sanpaolo AI Research(Intesa Sanpaolo AI研究中心) Dipartimento di Scienze Matematiche, Politecnico di Torino(都灵理工学院数学科学系) Network Science Institute, Northeastern University London, UK(伦敦大学东北方大学网络科学研究所)

AI总结 通过分析开源视觉语言模型的概念向量几何重叠,揭示多目标视觉任务中幻觉等错误与认知约束的关联,并提出基于干预的验证方法。

详情
AI中文摘要

视觉语言模型在多目标视觉任务中表现出令人困惑的失败,例如幻觉不存在的元素或未能识别干扰中最相似的物体。虽然这些错误反映了人类的认知约束,如“绑定问题”,但在人工系统中驱动这些错误的内部机制仍然知之甚少。在这里,我们通过分析开源视觉语言模型(Qwen、InternVL、Gemma)的表征几何结构,提出了一种机制性见解,比较了提炼“概念向量”(编码视觉概念的潜在方向)的方法。我们通过引导干预验证了概念向量,这些干预在简化和自然视觉任务中可靠地操纵模型行为(例如,强制模型将红色花朵感知为蓝色)。我们观察到这些向量之间的几何重叠与特定错误模式强相关,提供了一个有依据的定量框架来理解内部表征如何塑造模型行为并驱动视觉失败。

英文摘要

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the 'Binding Problem', the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors'' - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

2602.07026 2026-06-08 cs.CV cs.AI cs.MM 版本更新

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

模态间隙驱动的子空间对齐训练范式用于多模态大语言模型

Xiaomin Yu, Yi Xin, Yuhui Zhang, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Chen Liu, Xiaoxing Hu, Ziyue Qiao, Hao Tang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

发表机构 * HKUST(GZ)(香港科技大学(广州)) NUS(新加坡国立大学) sh AILab SII Stanford(斯坦福大学) UCLA(加州大学洛杉矶分校) Yale(耶鲁大学) SJTU(上海交通大学) GBU(国防大学) PKU(北京大学)

AI总结 针对多模态对比学习中的模态间隙问题,提出固定帧模态间隙理论,并基于该理论设计无训练的对齐策略ReAlign和可扩展训练范式ReVision,利用无配对数据实现视觉与语言表示的高效对齐。

详情
AI中文摘要

尽管多模态对比学习在视觉和语言表示对齐方面取得了成功,但一个持久的几何异常——模态间隙——仍然存在:表达相同语义的不同模态的嵌入位于系统性偏移的区域。先前弥合这一间隙的方法大多受限于过于简化的各向同性假设,阻碍了它们在大规模场景中的应用。在本文中,我们通过精确刻画模态间隙的几何形状并利用它进行高效模型扩展来解决这些局限性。首先,我们提出了固定帧模态间隙理论,该理论将冻结参考帧内的模态间隙分解为稳定偏差和各向异性残差。在这种精确建模的指导下,我们引入了ReAlign,一种无需训练的模态对齐策略。利用大量无配对数据的统计信息,ReAlign通过锚点、轨迹和质心对齐三步过程将文本表示对齐到图像表示分布,从而显式纠正几何错位。基于ReAlign,我们提出了ReVision,一种用于多模态大语言模型(MLLMs)的可扩展训练范式。ReVision将ReAlign集成到预训练阶段,使模型在视觉指令微调之前从无配对文本中学习视觉表示的分布,无需大规模、高质量的图像-文本对。我们的框架表明,统计对齐的无配对数据可以有效替代昂贵的图像-文本对,为MLLMs的高效扩展提供了一条稳健的路径。

英文摘要

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

2602.15287 2026-06-08 cs.CV 版本更新

Consistency-Preserving Diverse Video Generation

保持一致性的多样化视频生成

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种联合采样框架,在保持时间一致性的同时提高文本到视频生成中批次内视频的多样性,通过轻量级潜在空间模型避免视频解码和反向传播。

详情
AI中文摘要

文本到视频生成成本高昂,因此每个提示通常只生成少量样本。在这种低样本情况下,最大化每批的价值需要高跨视频多样性。最近的方法提高了图像生成的多样性,但对于视频,它们常常降低视频内的时间一致性,并且需要通过视频解码器进行昂贵的反向传播。我们提出了一种用于流匹配视频生成器的联合采样框架,该框架在保持时间一致性的同时提高了批次多样性。我们的方法应用多样性驱动的更新,然后仅移除会降低时间一致性目标的分量。为了避免图像空间梯度,我们使用轻量级潜在空间模型计算两个目标,避免了视频解码和解码器反向传播。在最新的文本到视频流匹配模型上的实验表明,我们的方法在接近强联合采样基线的多样性的同时,显著提高了时间一致性和颜色自然度。我们的代码可在 https://github.com/XinshuangL/Diverse-Video 获取。

英文摘要

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity close to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Our code is available at https://github.com/XinshuangL/Diverse-Video.

2602.19213 2026-06-08 cs.CV 版本更新

SegMoTE: Token-Level Mixture of Experts for Medical Image Segmentation

SegMoTE: 用于医学图像分割的令牌级混合专家模型

Yujie Lu, Jingwen Li, Sibo Ju, Yanzhou Su, he yao, Yisong Liu, Min Zhu, Junlong Cheng

发表机构 * Sichuan University(四川大学) Xinjiang University(新疆大学) Fuzhou University(福州大学) Alibaba DAMO Academy(阿里巴巴 DAMO 院)

AI总结 提出SegMoTE框架,通过令牌级混合专家机制和渐进式提示令牌化,在极低标注成本下实现医学图像分割的跨模态自适应与SOTA性能。

详情
AI中文摘要

医学图像分割对于临床诊断和定量分析至关重要,但由于成像模态的异质性和像素级标注的高成本,仍然具有挑战性。尽管像SAM这样的通用交互式分割模型取得了显著进展,但它们向医学影像的迁移仍面临两个关键瓶颈:(i) 缺乏针对模态和解剖特定任务的自适应机制,限制了在分布外医学场景中的泛化能力;(ii) 当前的医学适应方法在没有选择的情况下对大型异构数据集进行微调,导致噪声监督、更高成本和负迁移。为了解决这些问题,我们提出了SegMoTE,一个高效且自适应的医学图像分割框架。SegMoTE保留了SAM原始的提示接口、高效推理和零样本泛化能力,同时仅引入少量可学习参数以动态适应不同模态和任务。此外,我们设计了一种渐进式提示令牌化机制,实现了全自动分割,显著减少了对标注的依赖。在MedSeg-HQ(一个精心策划的数据集,规模不到现有大型数据集的1%)上训练后,SegMoTE在多种成像模态和解剖任务中达到了SOTA性能。这是首次在极低标注成本下将通用分割模型高效、鲁棒且可扩展地适应到医学领域,推动了基础视觉模型在临床应用中的实际部署。

英文摘要

Medical image segmentation is vital for clinical diagnosis and quantitative analysis, yet remains challenging due to the heterogeneity of imaging modalities and the high cost of pixel-level annotations. Although general interactive segmentation models like SAM have achieved remarkable progress, their transfer to medical imaging still faces two key bottlenecks: (i) the lack of adaptive mechanisms for modality- and anatomy-specific tasks, which limits generalization in out-of-distribution medical scenarios; and (ii) current medical adaptation methods fine-tune on large, heterogeneous datasets without selection, leading to noisy supervision, higher cost, and negative transfer. To address these issues, we propose SegMoTE, an efficient and adaptive framework for medical image segmentation. SegMoTE preserves SAM's original prompt interface, efficient inference, and zero-shot generalization while introducing only a small number of learnable parameters to dynamically adapt across modalities and tasks. In addition, we design a progressive prompt tokenization mechanism that enables fully automatic segmentation, significantly reducing annotation dependence. Trained on MedSeg-HQ, a curated dataset less than 1% of existing large-scale datasets, SegMoTE achieves SOTA performance across diverse imaging modalities and anatomical tasks. It represents the first efficient, robust, and scalable adaptation of general segmentation models to the medical domain under extremely low annotation cost, advancing the practical deployment of foundation vision models in clinical applications.

2603.06673 2026-06-08 cs.CV cs.LG 版本更新

Unmixing ATR-μFTIR spectroscopic images of cross-sections of historical oil paintings

历史油画横截面的ATR-μFTIR光谱图像解混

Shivam Pande, Nicolas Nadisic, Francisco Mederos-Henry, Aleksandra Pizurica

发表机构 * Belgian Federal Science Policy(比利时联邦科学政策) FED-tWIN project(FED-tWIN项目) Prf-2022-050 BALaTAI Prf-2021-002 MatCoRe

AI总结 提出一种无监督CNN自编码器,结合加权光谱角距离损失,用于解混ATR-μFTIR高光谱图像,自动估计端元光谱和丰度图,在污染区域提升可解释性。

Comments 5 pages, accepted at EUSIPCO 2026

详情
AI中文摘要

光谱成像已成为遗产科学的核心技术,因为它能够对文物中的材料进行非侵入性、空间分辨的表征。特别是,衰减全反射傅里叶变换红外显微镜(ATR-$μ$FTIR)被广泛用于分析绘画横截面,其中在每个像素处记录光谱以形成高光谱图像(HSI)。解释这些数据是困难的:光谱通常是异质、多层和退化样品中多种物质的混合物,而当前实践仍然严重依赖于与参考库的手动比较。这种工作流程缓慢、主观且难以扩展。我们提出了一种无监督CNN自编码器,用于盲解混ATR-$μ$FTIR HSI,通过基于块建模利用局部空间结构,估计端元光谱及其丰度图。为了减少对超过1500个波段的大气和采集伪影的敏感性,我们引入了一种加权光谱角距离(WSAD)损失,该损失具有从空间平坦度、邻域一致性和光谱粗糙度的稳健度量中自动导出的波段可靠性权重。与标准SAD训练相比,WSAD在易受污染的光谱区域提高了可解释性。我们在凡·艾克兄弟的根特祭坛画的ATR-$μ$FTIR横截面上演示了该方法。

英文摘要

Spectroscopic imaging (SI) has become central to heritage science because it enables non-invasive, spatially resolved characterisation of materials in artefacts. In particular, attenuated total reflection Fourier transform infrared microscopy (ATR-$μ$FTIR) is widely used to analyse painting cross-sections, where a spectrum is recorded at each pixel to form a hyperspectral image (HSI). Interpreting these data is difficult: spectra are often mixtures of several species in heterogeneous, multi-layered and degraded samples, and current practice still relies heavily on manual comparison with reference libraries. This workflow is slow, subjective and hard to scale. We propose an unsupervised CNN autoencoder for blind unmixing of ATR-$μ$FTIR HSIs, estimating endmember spectra and their abundance maps while exploiting local spatial structure through patch-based modelling. To reduce sensitivity to atmospheric and acquisition artefacts across more than 1500 bands, we introduce a weighted spectral angle distance (WSAD) loss with automatic band-reliability weights derived from robust measures of spatial flatness, neighbour agreement and spectral roughness. Compared with standard SAD training, WSAD improves interpretability in contamination-prone spectral regions. We demonstrate the method on an ATR-$μ$FTIR cross-section from the Ghent Altarpiece by the Van Eyck brothers.

2603.07704 2026-06-08 cs.CV 版本更新

PARSE: Part-Aware Relational Spatial Modeling

PARSE: 部件感知的关系空间建模

Yinuo Bai, Peijun Xu, Kuixiang Shao, Yuyang Jiao, Jingxuan Zhang, Kaixin Yao, Jiayuan Gu, Jingyi Yu

发表机构 * ShanghaiTech University(上海科技大学) Deemos Technology(德莫斯科技)

AI总结 提出PARSE框架,通过部件级部件中心装配图(PAG)和空间配置求解器,实现几何约束下的无碰撞物理有效场景组装,并构建PARSE-10K数据集,提升3D场景布局推理和生成的真实感。

Comments Project Page: https://otanaaa.github.io/PARSE-project-page/

详情
AI中文摘要

物体间关系是空间智能的基础,但现有表示(如语言介词或物体级场景图)过于粗糙,无法指定哪些区域实际支撑、包含或接触彼此,导致布局模糊且物理不一致。为解决这些歧义,需要部件级表示;因此,我们引入PARSE,一个显式建模物体部件如何交互以确定可行且空间接地场景配置的框架。PARSE的核心是部件中心装配图(PAG),它编码特定物体部件之间的几何关系,以及一个部件感知空间配置求解器,该求解器将这些关系转换为几何约束,以组装无碰撞、物理有效的场景。利用PARSE,我们构建了PARSE-10K数据集,包含10,000个3D室内场景,这些场景基于真实图像布局先验和精心标注的部件形状数据库构建,每个场景具有密集的接触结构和部件级接触图。借助这种结构化、空间接地的监督,在PARSE-10K上微调Qwen3-VL可产生更强的物体级布局推理和更准确的部件级关系理解;此外,在3D生成模型中利用PAG作为结构先验,可生成物理真实感和结构复杂性显著提升的场景。这些结果表明,PARSE显著推进了几何接地的空间推理,并支持生成物理一致的3D场景。

英文摘要

Inter-object relations underpin spatial intelligence, yet existing representations -- linguistic prepositions or object-level scene graphs -- are too coarse to specify which regions actually support, contain, or contact one another, leading to ambiguous and physically inconsistent layouts. To address these ambiguities, a part-level formulation is needed; therefore, we introduce PARSE, a framework that explicitly models how object parts interact to determine feasible and spatially grounded scene configurations. PARSE centers on the Part-centric Assembly Graph (PAG), which encodes geometric relations between specific object parts, and a Part-Aware Spatial Configuration Solver that converts these relations into geometric constraints to assemble collision-free, physically valid scenes. Using PARSE, we build PARSE-10K, a dataset of 10,000 3D indoor scenes constructed from real-image layout priors and a curated part-annotated shape database, each with dense contact structures and a part-level contact graph. With this structured, spatially grounded supervision, fine-tuning Qwen3-VL on PARSE-10K yields stronger object-level layout reasoning and more accurate part-level relation understanding; furthermore, leveraging PAGs as structural priors in 3D generation models leads to scenes with substantially improved physical realism and structural complexity. Together, these results show that PARSE significantly advances geometry-grounded spatial reasoning and supports the generation of physically consistent 3D scenes.

2603.22278 2026-06-08 cs.CV cs.LG 版本更新

The Dual Mechanisms of Spatial Variable Binding in Vision-Language Models

视觉-语言模型中空间变量绑定的双重机制

Kelly Cui, Nikhil Prakash, Shoval Messica, Ayush Raina, David Bau, Antonio Torralba, Tamar Rott Shaham

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Northeastern University(东北大学) Sony Playstation(索尼PlayStation)

AI总结 本文揭示视觉-语言模型通过语言骨干中的内容无关空间关系编码和视觉编码器中的全局布局表示两种机制实现空间变量绑定,其中视觉编码器起主导作用。

Comments 37 pages, 53 figures

详情
AI中文摘要

许多多模态任务,如图像描述和视觉问答,要求视觉-语言模型(VLM)将对象与其属性和空间关系绑定。然而,这种关联在VLM中如何以及在哪里计算仍不清楚。在这项工作中,我们展示了VLM依赖两种并发机制来表示空间变量绑定。在语言模型骨干中,中间层在对应对象的视觉标记之上表示内容无关的空间关系。然而,这种机制在塑造模型预测中仅起次要作用。相反,空间信息的主要来源是视觉编码器,其表示编码了对象的布局,并被语言模型骨干直接利用。值得注意的是,这种空间信号全局分布在视觉标记中,从对象区域扩展到周围的背景区域。我们表明,增强这些源自视觉的空间表示(跨所有图像标记)可以改善不同规模模型在COCO数据集复杂自然图像上的空间变量绑定性能。总之,我们的结果阐明了VLM中空间变量绑定的计算方式,并强调了视觉编码器在实现这一功能中的核心作用。

英文摘要

Many multimodal tasks, such as image captioning and visual question answering, require vision-language models (VLMs) to bind objects with their properties and spatial relations. Yet it remains unclear where and how such associations are computed within VLMs. In this work, we show that VLMs rely on two concurrent mechanisms to represent spatial variable binding. In the language model backbone, intermediate layers represent content-independent spatial relations on top of visual tokens corresponding to objects. However, this mechanism plays only a secondary role in shaping model predictions. Instead, the dominant source of spatial information originates in the vision encoder, whose representations encode the layout of objects and are directly exploited by the language model backbone. Notably, this spatial signal is distributed globally across visual tokens, extending beyond object regions into surrounding background areas. We show that enhancing these vision-derived spatial representations globally across all image tokens improves spatial variable binding performance across models of various sizes on complex natural images from the COCO datasets. Together, our results clarify how spatial variable binding is computed within VLMs and highlight the central role of vision encoders in enabling it.

2604.00270 2026-06-08 cs.CV 版本更新

OmniSch: A Multimodal PCB Schematic Benchmark For Structured Diagram Visual Reasoning

OmniSch:面向结构化图表视觉推理的多模态PCB原理图基准

Taiting Lu, Kaiyuan Lin, Yuxin Tian, Mingjia Wang, Yubo Wang, Muchuan Wang, Sharique Khatri, Akshit Kartik, Yixi Wang, Amey Santosh Rane, Yida Wang, Sung-Liang Chen, Yifan Yang, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda

发表机构 * Pennsylvania State University, USA(宾夕法尼亚州立大学) Independent Researcher(独立研究者) Binghamton University, USA(布ingham顿大学) Shanghai Jiao Tong University, China(上海交通大学) Microsoft Research(微软研究院)

AI总结 提出首个多模态PCB原理图理解基准OmniSch,包含四项任务评估大模型在视觉定位、图推理和几何推理上的能力,揭示现有模型在工程图表理解上的显著差距。

详情
AI中文摘要

近期大型多模态模型(LMMs)在视觉定位、文档理解和图表推理任务中取得了快速进展。然而,它们将印刷电路板(PCB)原理图转换为机器可读的空间加权网表图(同时捕获组件属性、连接性和几何信息)的能力仍未被充分探索,尽管这种图表示是实际电子设计自动化(EDA)工作流的基石。为弥补这一差距,我们引入了OmniSch,这是首个旨在评估LMMs在原理图理解和空间网表图构建方面的综合基准。OmniSch包含1,854张真实世界原理图,并包括四项任务:(1)原理图实体的视觉定位,包含109.9K个定位实例,将423.4K个图表语义标签与其视觉区域对齐;(2)图到图推理,理解图表元素间的拓扑关系;(3)几何推理,为每个连接构建依赖于布局的权重;(4)用于视觉搜索的工具增强型智能体推理,调用外部工具完成(1)-(3)。我们的结果揭示了当前LMMs在解释原理图工程制品方面的显著差距,包括不可靠的细粒度定位、脆弱的布局到图解析、不一致的全局连通性推理以及低效的视觉探索。

英文摘要

Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.

2604.10578 2026-06-08 cs.CV 版本更新

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

Rein3D: 基于全景视频扩散模型的强化3D室内场景生成

Dehui Wang, Rong Wei, Yue Shi, Congsheng Xu, Shoufa Chen, Dingxiang Luo, Tianshuo Yang, Xiaokang Yang, Wei Sui, Yusen Qin, Rui Tang, Yao Mu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Manycore Tech Inc.(Manycore科技公司) D-Robotics The University of Hong Kong(香港大学)

AI总结 提出Rein3D框架,结合3D高斯泼溅与视频扩散模型,通过“恢复-细化”范式从稀疏输入生成全局一致的360度室内场景,并构建PanoV2V-15K数据集,显著提升长距离相机探索效果。

详情
AI中文摘要

随着具身AI和VR应用需求的增长,从稀疏输入合成高质量3D室内场景变得尤为重要。然而,现有方法在推断大量未观测区域中的缺失几何结构时难以保持全局一致性,往往产生局部合理但全局不一致的重建结果。我们提出Rein3D,一个通过将显式3D高斯泼溅(3DGS)与视频扩散模型的时间一致先验相结合来重建完整360度室内环境的框架。我们的方法遵循“恢复-细化”范式:采用径向探索策略,沿从原点开始的轨迹渲染不完美的全景视频,从而从粗略的3DGS初始化中有效揭示被遮挡区域。这些序列由全景视频到视频扩散模型恢复,并通过视频超分辨率进一步增强,以合成高保真几何和纹理。最后,这些细化后的视频作为伪真值更新全局3D高斯场。为支持此任务,我们构建了PanoV2V-15K数据集,包含超过15K对干净和退化的全景视频,用于基于扩散的场景恢复。实验表明,Rein3D生成逼真且全局一致的3D场景,与现有基线相比,显著改善了长距离相机探索。

英文摘要

The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.

2604.20123 2026-06-08 cs.CV 版本更新

Topology-Aware Skeleton Detection via Lighthouse-Guided Structured Inference

拓扑感知的骨架检测:基于灯塔引导的结构化推理

Daoyong Fu, Xiang Zhang, Zhaohuan Zhan, Fan Yang, Ke Yang

AI总结 提出Lighthouse-Skel方法,通过双分支协作检测骨架置信场和结构锚点,并利用灯塔引导策略重连不连续骨架,提升骨架连续性和结构完整性。

Comments This submission is withdrawn by the authors because we identified substantive issues in the current version that may affect the reliability and interpretation of the results. We are conducting a thorough revision and validation before making the work publicly available again

详情
AI中文摘要

在自然图像中,物体骨架用于表示几何形状。然而,姿态或运动的轻微变化可能导致骨架结构的显著变化,增加骨架检测的难度,并常常导致不连续的骨架。现有方法主要关注点级骨架点检测,忽视了结构连续性在恢复完整骨架中的重要性。为解决此问题,我们提出Lighthouse-Skel,一种通过灯塔引导的结构化推理实现拓扑感知的骨架检测方法。具体来说,我们引入了一个双分支协作检测框架,联合学习骨架置信场和结构锚点(包括端点和连接点)。点分支学习的空间分布引导网络关注拓扑脆弱区域,从而提高骨架检测的准确性。基于学习的骨架置信场,我们进一步提出灯塔引导的拓扑补全策略,该策略将检测到的连接点和断点作为灯塔,沿低成本路径重连不连续的骨架段,从而改善骨架连续性和结构完整性。在四个公开数据集上的实验结果表明,所提方法在实现竞争性检测精度的同时,显著提升了骨架的连通性和结构完整性。

英文摘要

In natural images, object skeletons are used to represent geometric shapes. However, even slight variations in pose or movement can cause noticeable changes in skeleton structure, increasing the difficulty of detecting the skeleton and often resulting in discontinuous skeletons. Existing methods primarily focus on point-level skeleton point detection and overlook the importance of structural continuity in recovering complete skeletons. To address this issue, we propose Lighthouse-Skel, a topology-aware skeleton detection method via lighthouse-guided structured inference. Specifically, we introduce a dual-branch collaborative detection framework that jointly learns skeleton confidence field and structural anchors, including endpoints and junction points. The spatial distributions learned by the point branch guide the network to focus on topologically vulnerable regions, which improves the accuracy of skeleton detection. Based on the learned skeleton confidence field, we further propose a lighthouse-guided topology completion strategy, which uses detected junction points and breakpoints as lighthouses to reconnect discontinuous skeleton segments along low-cost paths, thereby improving skeleton continuity and structural integrity. Experimental results on four public datasets demonstrate that the proposed method achieves competitive detection accuracy while substantially improving skeleton connectivity and structural integrity.

2605.14166 2026-06-08 cs.CV 版本更新

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

你只需一次地标:基于YOLO-World地标热图的轻量级U-Net人脸超分辨率

Riccardo Carraro, Anna Briotto, Endi Hysa, Marco Fiorucci, Lamberto Ballan

发表机构 * Università degli Studi di Milano(米兰大学) Istituto Italiano di Tecnologia(意大利理工学院)

AI总结 提出轻量级U-Net,利用YOLO-World生成的地标热图作为监督,无需额外训练辅助网络,实现8倍人脸超分辨率重建,提升关键区域细节。

Comments Accepted for publication at IEEE AVSS 2026 (Notification date: June 5, 2026)

详情
AI中文摘要

人脸图像超分辨率旨在从严重退化的输入中恢复高分辨率人脸图像。在极端放大因子下,精细的面部细节常常丢失,使得准确重建具有挑战性。现有方法通常依赖重型网络架构、对抗训练方案或单独的对齐网络,增加了模型复杂度和计算成本。为解决这些问题,我们提出了一种基于轻量级U-Net的架构,旨在从严重退化的$16 \ imes 16$输入重建$128 \ imes 128$面部图像,实现$8 \ imes$放大。一个关键贡献是一种新颖的无辅助训练监督策略,利用YOLO-World(一种开放词汇目标检测器)生成的热图来定位关键面部特征,如眼睛、鼻子和嘴巴。这些热图被转换为空间权重,形成热图引导的损失,强调语义重要区域的重建误差。与先前需要专用地标或对齐网络的方法不同,我们的方法直接重用检测器输出作为监督,保持高效的训练和推理流程。在对齐的CelebA数据集上的实验表明,所提出的损失一致地改善了定量指标,并产生了更清晰、更逼真的重建。总体而言,我们的结果表明,轻量级网络可以有效地利用检测驱动的先验进行感知上令人信服的极端放大,而无需对抗训练或增加计算成本。

英文摘要

Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

2605.19611 2026-06-08 cs.CV cs.ET 版本更新

Physics Guided Conditional Diffusion Framework for Generative Inverse Design of Manufacturable Metasurface based Absorbers

基于物理引导的条件扩散模型的超材料吸收体逆向设计

Vineetha Joy, Jamshed Palai, Satwik Sahu, Anshuman Kumar, Amit Sethi, Hema Singh

发表机构 * Centre for Electromagnetics, CSIR-National Aerospace Laboratories(电磁研究中心,国家航空航天实验室) Birla Institute of Technology and Science, Pilani(比拉理工学院,皮兰) Indian Institute of Technology, Bombay(孟买印度理工学院)

AI总结 本文提出了一种基于物理引导的条件扩散框架,用于设计具有特定电磁响应的超材料吸收体,通过特征线性调制和预训练的替代电磁模拟器,提高了设计效率和条件准确性,实验表明该方法在2-18GHz频率范围内能够快速生成实用的超材料结构。

详情
AI中文摘要

针对特定电磁响应的超材料逆向设计需要生成满足严格频谱约束且可制造的几何结构。传统设计方法依赖于全波仿真进行迭代优化,对于大设计空间来说非常耗时且计算密集。此外,常用的生成方法往往条件保真度有限,生成的设计通常包含精细或不规则特征,难以制造。为此,我们提出了一种物理引导的条件质量增强扩散框架,用于超材料吸收体的逆向设计。在这里,由目标反射特性构成的条件信息通过特征线性调制(FiLM)整合到模型中。此外,为了确保符合目标频谱,嵌入了预训练的替代电磁模拟器,通过频谱级损失函数引入物理感知的正则化。通过在2至18GHz频率范围内生成不同类型的反射特性实用的超材料结构,证明了所提模型的有效性。该框架实现了目标频谱与生成设计频谱之间的平均频谱均方误差为0.0006,频段对齐精度为0.958,显示出高条件准确性。此外,模型为相同条件生成多种几何结构,从而为工程师提供多样化的设计选择。所提模型在约30秒内生成合适的设计,而传统方法在同等计算资源下需要数月时间。模型的效率还通过实验测量得到验证。

英文摘要

Inverse design of metasurfaces under continuous electromagnetic constraints requires generation of geometries that simultaneously satisfy stringent spectral specifications and remain manufacturable. Conventional approaches based on iterative full wave simulations are computationally prohibitive for large design spaces, while existing generative models often suffer from poor conditional controllability and limited fabrication awareness. In this regard, we propose a physics guided condition quality enhanced diffusion framework for the inverse design of metasurface based absorbers. Fabrication-aware constraints are incorporated to ensure practical realizability of the generated designs. The framework introduces a conditioning mechanism for continuous spectral specifications, wherein feature-wise linear modulation propagates the condition across the denoising hierarchy, enabling stable and accurate generation with improved spectral controllability. Further, to embed EM consistency directly into the generative learning process, a pre trained surrogate EM simulator is integrated within the diffusion training pipeline. The proposed framework generated physically realizable metasurface designs for diverse reflection characteristics in the frequency range of 2 to 18 GHz, achieving a very low average spectral mean squared error of 0.0006 and a high band alignment accuracy of 0.958. The framework also addresses the fundamentally non-unique nature of inverse EM design by enabling structured multimodal generation of geometrically distinct yet spectrally consistent metasurface designs for the same target response. The proposed model produces the suitable design in approximately 30 seconds, whereas the conventional approach can take several months under comparable computational resources. The efficiency of the model is also established via experimental measurements.

2605.20950 2026-06-08 cs.CV cs.AI 版本更新

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

聚焦-然后-上下文:面向视觉-语言模型的主体导向渐进视觉标记缩减

Yulin Zhao, Zheng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳学院) ShenZhen Loop Area Institute(深圳环形区研究所)

AI总结 本文提出了一种主体导向的渐进视觉标记缩减方法SPpruner,通过模拟人类视觉感知系统的'聚焦-然后-上下文'机制,有效减少视觉标记数量,提升视觉-语言模型的推理效率,实验表明其在速度和资源消耗上均优于现有方法。

详情
AI中文摘要

视觉-语言模型(VLMs)在推理过程中面临由于大规模视觉标记序列带来的计算成本瓶颈。现有的视觉标记缩减方法虽然减轻了这一负担,但无意中保留了与用户查询严格对齐的孤立视觉主体,无法充分探索显著主体及其上下文关系。本文提出SPpruner,一种以主体为中心的渐进缩减范式,模拟人类视觉感知系统的'聚焦-然后-上下文'机制。具体而言,我们首先构建了一个聚焦识别模块,以显式建模视觉显著性与语义相关性之间的相互作用。在此基础上,它可以挖掘全面的视觉主体光谱,确保视觉输入的高保真表示。随后,开发了一个上下文感知的结构扫描模块,用于聚合邻近区域的上下文线索。因此,它可以有效恢复全局关系依赖,以维持保留主体的结构完整性。大量实验表明,我们的范式在速度和资源消耗上均优于现有方法,在Qwen2.5-VL中仅保留22.2%的视觉标记即可实现2.53倍的加速,在LLaVA中实现67%的FLOPs减少,仅导致0.6%的精度下降。

英文摘要

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

2605.22882 2026-06-08 cs.CV cs.RO 版本更新

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

GEM-4D:用于机器人操作的几何增强视频世界模型

Kaichen Zhou, Yuzhen Chen, Fangneng Zhan, Hang Hua, Grace Chen, Xinhai Chang, Ao Qu, Yilun Du, Zhuang Liu, Paul Pu Liang, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab(哈佛人工智能与机器人实验室) Harvard University(哈佛大学) Media Lab and EECS(媒体实验室和电子工程与计算机科学系) MIT(麻省理工学院) Princeton University(普林斯顿大学) MIT-IBM Watson AI Lab(麻省理工-IBM沃森人工智能实验室)

AI总结 提出GEM-4D,通过注入从预训练几何基础模型蒸馏的密集4D对应监督,增强视频世界模型的几何一致性,并引入逆动力学模块将视频滚动转换为可执行机器人轨迹,提升操作成功率。

Comments Robotic World Model, Video Generative Model

详情
AI中文摘要

视频世界模型可以从单个指令生成逼真的未来帧,但它们通常无法在时间上一致地跟踪相同的物理点。因此,生成的视频看似合理,但缺乏可靠动作执行(如机器人操作)所需的物理基础。我们提出GEM-4D,一种几何接地视频世界模型,通过在训练期间将预训练几何基础模型蒸馏的密集4D对应监督注入视频生成骨干网络来解决这一限制。这种监督使模型能够联合捕捉外观和几何结构,同时保持单流架构且无额外推理成本。我们进一步引入逆动力学模块,将对应一致的视频滚动转换为可执行的机器人轨迹,从而能够在真实世界和模拟操作中直接部署。GEM-4D在视频预测和几何一致性方面在模拟和真实场景中均达到最先进性能,并将真实世界操作成功率从61%提升至81%。更多结果见https://gem-4d.github.io/。

英文摘要

Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.

2605.24011 2026-06-08 cs.CV cs.AI 版本更新

ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

ActQuant: 面向视觉-语言-动作模型的亚4比特动作引导量化

Arash Akbari, Arman Akbari, Masih Eskandar, Qitao Tan, Yixiao Chen, Jingwu Luo, Bertha Pangaribuan, Liyun Zhang, Jennifer Dy, Geng Yuan, Xue Lin, Gaowen Liu, Stratis Ioannidis, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出ActQuant框架,通过动作引导的混合精度后训练量化,在亚4比特权重量化下保持VLA模型性能,并引入OmniModel.cpp实现高效部署。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在具身智能中展现出卓越的动作生成能力,但其高计算量使得在边缘平台部署不切实际。激进的亚4比特权重量化是自然解决方案,但现有后训练量化(PTQ)方法在此情况下性能严重下降。为解决此问题,我们引入ActQuant,一个动作引导的混合精度PTQ框架,包含两个阶段:(1)张量间比特分配器,根据每个权重矩阵对预测智能体动作的贡献程度分配单一比特宽度;(2)张量内尺度优化器,使用动作感知曲率调整每块量化尺度,使动态范围集中在控制影响最大的权重上。为了在设备上实现激进量化的优势,我们进一步引入OmniModel.cpp,一个代理转换流水线,将架构移植到具有高效低位内核的原生C/C++运行时。我们在仿真和真实世界的6自由度UR3机械臂上评估ActQuant,所有模型通过OmniModel.cpp部署。在LIBERO基准上,ActQuant是唯一在每权重3比特或以下运行的方法,在OpenVLA-OFT上保持95.0%的性能,在$π_{0.5}$上保持94.8%。进一步,ActQuant在OpenVLA-OFT上达到2.5 bpw,性能为90.1%,将骨干网络从14.3 GB压缩到2.7 GB(5.3倍)。在物理UR3机械臂上,使用ActQuant量化的$π_{0.5}$保持基线的成功率,同时将内存占用减少2.5倍。

英文摘要

Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $π_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $π_{0.5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5$\times$.

2605.25757 2026-06-08 cs.CV 版本更新

Broadband Hyperspectral 3D Imaging using Dispersed Structured Light

宽带高光谱3D成像:使用色散结构光

Suhyun Shin, Yunseong Moon, Ryota Maeda, David B. Lindell, Kiriakos N. Kutulakos, Seung-Hwan Baek

发表机构 * POSTECH South Korea(POSTECH韩国) University of Hyogo Japan(日本广岛大学) University of Toronto Canada(加拿大多伦多大学)

AI总结 提出一种基于单光谱仪的宽带高光谱3D成像方法,通过可见光和SWIR相机立体设置,利用色散结构光同时重建密集宽带高光谱反射率和精确3D几何,解决了传统方法光谱范围窄、系统复杂的问题。

详情
AI中文摘要

高光谱3D成像能够捕获密集的光谱信息和场景几何,但传统上局限于窄光谱窗口,通常是可见光范围。在这项工作中,我们引入了一种宽带高光谱3D成像(BH3D)方法,将这一能力扩展到整个可见-近红外和短波红外(SWIR)光谱(450-1500 nm)。这种宽覆盖范围至关重要,因为它捕获了互补的物理线索:可见光波长揭示表面外观,而SWIR波段提供对次表面特性和材料组成的洞察。然而,实现BH3D具有挑战性,因为可见光谱硅传感器和SWIR光谱InGaAs传感器之间存在基本的传感器限制,需要复杂的多光谱仪设计。在这里,我们提出了一种单光谱仪BH3D系统,使用包含可见光和SWIR相机的立体设置,重建密集的宽带高光谱反射率以及精确的3D几何。我们的关键思想是使用单个光谱仪将色散结构光扩展到宽带范围。我们建模了宽带色散结构光的图像形成过程,并估计了高光谱反射率和深度。我们在多样化的真实场景上验证了我们的方法,展示了精确的重建,平均光谱角映射器为0.13 rad,均方根误差为0.03,平均深度误差为4.5 mm。我们进一步展示了识别同色异谱材料、通过不透明层成像、揭示钞票上的隐藏特征以及显示血管的能力。

英文摘要

Hyperspectral 3D imaging enables the capture of dense spectral information and scene geometry but has traditionally been confined to narrow spectral windows, typically the visible range. In this work, we introduce a broadband hyperspectral 3D imaging (BH3D) method to extend this capability across the full visible-near-infrared and short-wavelength infrared (SWIR) spectrum (450-1500 nm). This broad coverage is critical as it captures complementary physical cues: visible wavelengths reveal surface appearance, while SWIR bands provide insight into subsurface properties and material composition. However, realizing BH3D is challenging due to fundamental sensor constraints between visible-spectrum silicon and SWIR-spectrum InGaAs sensors, which necessitate complex multi-spectrograph designs. Here we propose a single-spectrograph BH3D system, using a stereo setup comprising visible and SWIR cameras, that reconstructs dense broadband hyperspectral reflectance together with accurate 3D geometry. Our key idea is to extend dispersed structured light to the broadband regime using a single spectrograph. We model the image formation of broadband dispersed structured light, and estimate hyperspectral reflectance and depth. We validate our approach on diverse real-world scenes, demonstrating accurate reconstruction with a mean spectral angle mapper of 0.13 rad, root mean square error of 0.03, and mean depth error of 4.5 mm. We further demonstrate identifying metameric materials, performing imaging through opaque layers, uncovering hidden features on banknotes, and revealing blood vessels.

2605.25806 2026-06-08 cs.CV 版本更新

An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset?

聚焦女性安全分析:多模态数据集能否增强VAD模型?

Sangeeta ., Maddikuntla Sai Prajwal, Debi Prosad Dogra, Kamalakar Vijay Thakare, Hyungjoo Jung, Ig-Jae Kim, Heeseung Choi

发表机构 * Indian Institute of Technology Bhubaneswar(印度理工学院巴特那分校) Artificial Intelligence and Robotics Institute, Korea Institute of Science and Technology(人工智能与机器人研究所,韩国科学技术院) Yonsei-KIST Convergence Research Institute, Yonsei University(延世大学KIST融合研究中心)

AI总结 针对现有视频异常检测数据集缺乏女性中心异常样本的问题,提出包含1001个视频及文本描述的多模态基准ExtrAnom,覆盖5种犯罪类型,并验证了多模态方法在检测女性中心异常上的有效性。

Comments 7 pages, 6 figures, 4 tables

详情
AI中文摘要

女性安全对于现代社会至关重要。针对女性的犯罪既发生在白天也发生在低光照条件下。通常,此类事件通过低分辨率的现实监控摄像头捕捉。尽管计算机视觉相关研究取得了显著进展,但专注于女性安全的视频异常检测(VAD)尚未得到充分解决。现有的视频异常数据集包含光照良好、高分辨率、近景视频,未能涵盖女性中心异常,如抢项链、跟踪、不当触摸及其他针对女性的细微犯罪形式。为解决这些问题,我们提出了ExtrAnom数据集,这是一个新的多模态基准,包含1001个带有文本描述的视频(500个正常,501个异常),分为5种不同类型的女性中心犯罪。该数据集包含低光照(8%)、低分辨率(13%)、远景(15%)以及白天(64%)异常视频。它涵盖了异常事件如跟踪(3.9%)、抢项链(17.6%)、绑架(7.3%)、暗杀(2.3%)、骚扰(18.9%)和正常(50%)。每个视频附带4个文本标注,包括一个人工生成和三个大语言模型生成的描述,支持跨模态和基于视觉语言模型(VLM)的验证。创建女性中心数据集的目标是准确检测可能通过视觉观察到的女性中心异常模式。该数据集辅助VLM准确生成视频级描述。ExtrAnom已针对流行的单模态和多模态VAD数据集(如XD-Violence、UCF-Crime和UCA)及最先进方法进行了基准测试。实验表明,现有数据集不足以训练模型检测女性中心异常。

英文摘要

Women's safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women's safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.

2606.02450 2026-06-08 cs.CV 版本更新

Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion

先推理后检索:面向CoVR-R的结构化编辑提示与密集-稀疏融合

DongQing Liu, MengShi Qi, HongWei Ji

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 针对CoVR-R任务,提出一种零样本的“先推理后检索”流水线,利用Qwen3.5-27B生成结构化描述和密集嵌入,并结合TF-IDF稀疏分支进行融合排序,在验证集和测试集上取得了领先性能。

详情
AI中文摘要

CoVR-R研究基于推理的组合视频检索:给定一个参考视频和一个编辑指令,系统必须检索满足编辑的目标视频。主要困难在于目标未被直接描述,必须从对象身份、动作顺序、最终状态、手部交互和场景转换的细粒度变化中推断。我们围绕Qwen3.5-27B构建了一个零样本的“先推理后检索”流水线。对于每个图库视频,模型通过池化生成令牌的隐藏状态(使用令牌相关权重)生成面向检索的结构化描述和密集嵌入。对于每个查询,模型首先对参考视频和指令进行编辑推理,然后生成目标视频描述,其隐藏状态作为查询嵌入。我们通过生成文本上的TF-IDF分支补充密集检索,并使用分割特定权重融合两个排序。在验证集上,当前最佳提交在R@1达到80.81,R@5达到94.86,R@10达到97.11,R@50达到98.59。在盲测集上,R@1达到89.73,R@5达到95.79,R@10达到96.63,R@50达到97.98。

英文摘要

CoVR-R studies reason-aware composed video retrieval: given a reference video and an edit instruction, the system must retrieve the target video that satisfies the edit. The main difficulty is that the target is not described directly; it must be inferred from fine-grained changes in object identity, action order, final state, hand interaction, and scene transition. We build a zero-shot reason-then-retrieve pipeline around Qwen3.5-27B. For each gallery video, the model generates a retrieval-oriented structured description and a dense embedding by pooling generated-token hidden states with token-dependent weights. For each query, the model first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states serve as the query embedding. We complement dense retrieval with a TF-IDF branch over the generated texts and fuse the two rankings with split-specific weights. On validation, the current best submission reaches 80.81 at R@1, 94.86 at R@5, 97.11 at R@10, and 98.59 at R@50. On the blind test split, it reaches 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.

2606.02919 2026-06-08 cs.CV 版本更新

Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

Pixel Cube: 基于扩散的肖像视频重光照通过真实感光照再现

Yufan Zhang, Yu Ji, Ayo Ajiboye, Rundi Wu, Yu Guo, Changxi Zheng, Jinwei Ye

发表机构 * George Mason University(乔治·马歇尔大学) LightThought LLC Columbia University(哥伦比亚大学)

AI总结 提出一种基于扩散的方法,利用混合训练数据集和HDR环境图控制,实现动态肖像视频的真实感重光照,保持时间一致性和身份特征。

Comments ACM SIGGRAPH 2026 Journal Track / ACM Transactions on Graphics, 17 pages. Project page: https://yufanzhang82.github.io/PixelCube/

详情
Journal ref
ACM Trans. Graph. 45, 4, Article 119 (July 2026), 17 pages
AI中文摘要

我们提出了一种基于扩散的方法,用于对动态肖像视频进行重光照,实现照片级真实感和时间一致性。我们的方法由一个混合训练数据集驱动,该数据集包含真实拍摄和渲染的动态肖像视频,具有多样的主体外观、面部运动、头部姿态和已知光照条件。具体来说,我们构建了一个基于LED的光照系统,用于真实感光照模拟和高速视频重光照数据采集。通过利用预训练视频扩散模型中嵌入的图像先验,并使用逐帧高动态范围(HDR)环境图作为光照控制,我们训练了一个高性能生成模型,用于真实且保持身份的动态肖像视频重光照。除了环境图控制外,我们的模型还使用合成的背景图像来控制相机的曝光水平和色调。我们的模型可以在提供的新环境下生成时间一致的重光照肖像视频,看起来真实且和谐,并忠实保留主体的表情和精细面部特征,包括肤色、皱纹和胡须。我们的模型在主体外观、运动和光照条件方面对未见数据具有良好的泛化能力。我们使用各种环境图对野外视频进行了广泛的重光照实验,并展示了在肖像摄影中的实际应用。结果表明,我们的方法在照片级真实感、光照和谐性和时间一致性方面达到了最先进的性能。

英文摘要

We present a diffusion-based method for relighting dynamic portrait videos with photorealism and temporal consistency. Our method is fueled by a hybrid training dataset that consists of real-captured and rendered dynamic portrait videos with diverse subject appearances, facial motions, head poses, and known lighting conditions. Specifically, we construct an LED-based lighting system for realistic lighting emulation and high-speed video relighting data acquisition. By leveraging the image priors embedded in pre-trained video diffusion models, and using per-frame high dynamic range (HDR) environment map as lighting control, we train a high-performance generative model for realistic and identity-preserving dynamic portrait video relighting. In addition to the environment map control, our model uses a synthesized background image to enable control on the camera's exposure level and color tone. Our model can produce temporally consistent relit portrait video that looks realistic and harmonious under a provided new environment and faithfully preserve the subject's expression and fine facial features, including skin tone, wrinkles, and facial hair. Our model generalizes well to unseen data, in terms of the subject appearance, motion, and lighting condition. We perform extensive experiments on relighting in-the-wild videos with various environment maps and demonstrate practical applications on portrait photography. Results show that our method achieves state-of-the-art performance in photorealism, lighting harmony, and temporal consistency.

2606.04349 2026-06-08 cs.CV cs.AI 版本更新

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

MorphoQuant: 面向全模态大语言模型的模态感知量化

Yue Wu, Changyuan Wang, Zixuan Wang, Shilin Ma, Yansong Tang

发表机构 * institutetext: MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models Yue Wu Changyuan Wang Zixuan Wang Shilin Ma Yansong Tang(机构文本:MorphoQuant:多模态大语言模型的模态感知量化 Yue Wu 王昌元 王梓轩 马世林 唐彦松)

AI总结 提出MorphoQuant框架,通过分布感知偏差补偿和形态导向量化函数优化,解决全模态大语言模型在4比特后训练量化中的分布异质性和异常值问题,实现精度与效率的优异平衡。

详情
AI中文摘要

传统的后训练量化方法在处理4比特全模态大语言模型时,由于跨模态的极端分布异质性和不同的异常值模式而面临困难。为了解决这一问题,我们提出了MorphoQuant,一种模态感知的PTQ框架,旨在保留跨模态形态并减轻异常值损失。具体来说,我们引入了分布感知偏差补偿,它选择性地将长尾异常值吸收到通道偏差中。该机制在保持异常值幅度的同时,为密集内点维持高精度离散化,从而在多样的模态分布中保持精确的离散化。作为补充,我们提出了形态导向量化函数优化,以协同优化量化网格与偏差掩码,确保跨模态的细粒度对齐。在Qwen2.5-Omni上对MMMU和Video-MME等基准的广泛评估证明了我们方法的优越性。值得注意的是,我们的W4A4模型在ScienceQA上达到了76.63%,显著优于最先进的W4A4方法,并意外地超越了W4A16基线,这充分展示了我们框架在精度-效率权衡方面的卓越表现。

英文摘要

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

2606.04373 2026-06-08 cs.CV cs.AI 版本更新

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

解耦信息区域的选择性耦合:用于视觉Transformer无数据量化的掩码注意力对齐

Biao Qian, Yang Wang, Yong Wu, Jungong Han

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出MaskAQ方法,通过解耦合成样本中的信息区域并利用掩码注意力对齐全精度模型与量化模型,解决无数据量化中分布不匹配问题。

Comments Accepted to appear at ICML 2026, Seoul, Korea

详情
AI中文摘要

无数据量化(DFQ)通过合成样本解决数据安全问题,无需访问真实数据。由于自注意力机制相比经典卷积运算的优势,DFQ在视觉Transformer(ViT)中日益受到关注。然而,先前的ViT DFQ方法常遭受合成样本与量化模型Q期望输入分布之间的分布不匹配,导致性能次优。本文提出一种新颖的掩码注意力对齐方法用于ViT的无数据量化,称为MaskAQ,揭示了:1)自注意力机制中的语义主要局限于稀疏的补丁子集,称为信息区域;2)信息区域主导了合成样本与Q输出之间的互信息。为此,我们利用合成样本补丁相似性的微分熵最大化,从噪声背景中解耦信息区域。为了与不同的Q耦合,通过掩码注意力对齐目标选择信息区域以对齐全精度模型与Q,从而产生高质量的合成样本。此外,提出周期性样本刷新策略,使MaskAQ能够在训练过程中持续适应Q的演化状态,以保持与合成样本的理想互信息。大量实验验证了MaskAQ在多个骨干网络和下游任务上优于最先进方法。我们的代码可在https://github.com/hfutqian/MaskAQ获取。

英文摘要

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

2606.05949 2026-06-08 cs.CV 版本更新

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

忠实、丰富且精确:T2I模型在自然科学插图生成中的基准测试

Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo, Liangliang Zhao, Yuchen Ren, Minghao Liu, Yunfei Yu, Yu Qiao, Kaipeng Zhang, Yihao Liu

发表机构 * Shanghai Innovation Institute(上海创新研究院) Shanghai AI Laboratory(上海人工智能实验室) University of Science and Technology of China(中国科学技术大学) Wuhan University(武汉大学) Nankai University(南开大学) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) ZODA Alaya Studio(Alaya工作室)

AI总结 提出FEPBench基准,通过细粒度原子集标注和三维评估(指令忠实性、推理丰富性、语义精确性)系统评估T2I模型在自然科学插图生成中的表现,发现即使最先进的闭源模型仍存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。

详情
AI中文摘要

科学插图是交流研究发现的重要工具,尤其是在自然科学中,它们可视化复杂的概念和过程。随着文本到图像(T2I)模型能力的增强,研究人员已开始将其用于科学插图生成。然而,现有基准通常从整体层面评估输出,忽略了细粒度元素,同时科学推理能力和输出简洁性仍缺乏量化。我们引入了FEPBench,一个基于跨多个学科和布局类型精心挑选的高质量科学插图构建的基准。借助多模态大语言模型(MLLM)和人类专家,我们提供了细粒度原子集标注,并沿三个维度系统评估T2I模型:指令忠实性、推理丰富性和语义精确性。我们的评估进一步将模型性能分解为视觉、文本、关系和布局元素。结果表明,即使最先进的(SOTA)闭源模型,如GPT Image 2和Nano Banana Pro,仍然存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。这些发现为改进和部署T2I模型进行科学插图生成提供了实用指导。基准数据、原子集标注和评估代码将由我们发布。

英文摘要

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

2606.06002 2026-06-08 cs.CV 版本更新

Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

面向文本到3D室内场景生成的视觉-语言模型中的全局-局部蒙特卡洛树搜索

Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China(网络与交换技术国家重点实验室,北京邮电大学)

AI总结 提出一种全局-局部蒙特卡洛树搜索方法,通过分层场景表示和PRM引导的MCTS解决文本到3D室内场景生成中的错误传播问题,并构建新基准数据集3DTindo-bench。

详情
AI中文摘要

大型视觉-语言模型在各种任务中取得了显著的推理性能。然而,关于使用LVLM进行文本到3D室内场景生成的研究很少。主要挑战在于,现有的基于LVLM的方法采用思维链顺序决策机制,无法修正早期决策,导致错误传播。在本文中,我们将该任务视为一个受空间和布局常识约束的规划问题。为解决此问题,我们将其建模为具有全局树和局部树的树搜索问题,这与现有的顺序决策方法不同。在全局树中,我们迭代地放置每个对象,并像人类布置房间一样探索多种尝试,其中问题空间表示为树。为了有效搜索树,我们提出了一种分层场景表示和PRM引导的MCTS方法。分层表示将场景抽象为房间级别、区域级别、地板对象级别和支撑对象级别。PRM引导的MCTS方法使用PRM剪枝不必要的分支,并使用MCTS算法平衡探索和利用,以更少的尝试获得最优解。在局部树中,它进一步将每个对象的放置分解为更细的子步骤,包括具体的放置参数。为了使场景整体外观一致,我们利用预训练的扩散图像生成模型为场景中的所有对象预测纹理。由于现有的文本到3D室内场景生成基准在规模和多样性上仍然有限,我们收集了一个新的大规模多样化数据集,包含65种场景类型和3,250条指令,具有不同的尺寸、布局和风格,命名为3DTindo-bench,以更好地评估最先进模型的能力。我们的实验表明,我们的方法比最先进的方法生成更逼真的3D场景。

英文摘要

Large Vision-Language Models have achieved significant reasoning performance in various tasks. However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation. In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense. To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches. In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree. To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method. This representation abstracts a scene into room level, region level, floor object level, and supported object level. The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts. In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters. To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene. As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art methods.

2606.06042 2026-06-08 cs.CV 版本更新

LoomVideo: Unifying Multimodal Inputs into Video Generation and Editing

LoomVideo: 统一多模态输入到视频生成与编辑

Jianzong Wu, Hao Lian, Jiongfan Yang, Dachao Hao, Ye Tian, Yunhai Tong, Jingyuan Zhu, Biaolong Chen, Qiaosong Qi, Aixi Zhang, Wanggui He, Mushui Liu, Jinlong Liu, Pipei Huang, Hao Jiang

发表机构 * Peking University(北京大学) Alibaba Group(阿里巴巴集团)

AI总结 提出LoomVideo,一种5B参数的高效统一架构,通过多模态大语言模型和零开销Scale-and-Add条件机制,实现视频生成与编辑,显著降低计算复杂度并加速推理。

详情
AI中文摘要

开发能够解释交错多模态输入的统一视频生成和编辑模型是一个有前景但充满挑战的前沿领域。现有的统一框架主要依赖大规模模型(通常为13B参数或更多),并通过拼接序列令牌来引入源视频条件进行编辑。这种拼接不可避免地使序列长度加倍,使自注意力机制的计算复杂度翻两番,带来难以承受的开销。为解决这些瓶颈,我们提出了LoomVideo,一种高效5B参数的统一架构,用于视频生成和编辑。LoomVideo用多模态大语言模型(MLLM)替换标准文本编码器,并采用Deepstack注入机制将多层MLLM特征与扩散变换器(DiT)对齐。关键地,我们引入了一种零开销的Scale-and-Add条件方法用于视频编辑。通过缩放并直接将干净源视频潜变量加到带噪目标潜变量上,这种优雅的设计消除了令牌拼接的需要,大幅降低计算成本,同时保持对复杂非刚性编辑的强大能力。此外,无缝集成了负时间RoPE策略以处理多个参考图像。大量实验表明,我们紧凑的5B模型在全面基准测试中达到了最先进或极具竞争力的性能,在电商和时尚生成场景中展现出卓越优势。得益于零开销条件机制,LoomVideo在推理速度上比类似能力的模型至少快5.41倍,为高度实用和高效的视频基础模型铺平了道路。

英文摘要

Developing unified video generation and editing models capable of interpreting interleaved multimodal inputs is a promising yet challenging frontier field. Existing unified frameworks predominantly rely on massive models (typically 13B parameters or more) and incorporate source video conditions for editing by concatenating sequence tokens. This concatenation inevitably doubles the sequence length, quadrupling the computational complexity of the self-attention mechanism and introducing prohibitive overhead. To address these bottlenecks, we present LoomVideo, a highly efficient 5B-parameter unified architecture for both video generation and editing. LoomVideo replaces the standard text encoder with a Multimodal Large Language Model (MLLM) and employs Deepstack injection mechanism to align multi-layer MLLM features with the Diffusion Transformer (DiT). Crucially, we introduce a zero-overhead Scale-and-Add conditioning approach for video editing. By scaling and directly adding the clean source video latent to the noised target latent, this elegant design eliminates the need for token concatenation, drastically reducing computational cost while maintaining robust capabilities for complex, non-rigid edits. Furthermore, a Negative Temporal RoPE strategy is seamlessly integrated to handle multiple reference images. Extensive experiments demonstrate that our compact 5B model achieves state-of-the-art or highly competitive performance across comprehensive benchmarks, exhibiting exceptional superiority in e-commerce and fashion generation scenarios. Benefiting from the zero-overhead conditioning mechanism, LoomVideo achieves at least a 5.41x acceleration in inference speed compared to models of similar capabilities, paving the way for highly practical and efficient video foundation models.

2606.06048 2026-06-08 cs.CV 版本更新

LLM-Conditioned Synthesis of Pathological Gaits via Structured Gait-Language Representations

基于结构化步态-语言表示的LLM条件病理步态合成

Mritula Chandrasekaran, Sanket Kachole, Jarek Francik, Dimitrios Makris

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) MIT Media Lab(麻省理工学院媒体实验室)

AI总结 提出一种多模态LLM引导框架,通过结构化文本描述合成病理步态3D数据,利用运动标记化、病理感知语言条件、LLM语义增强和语言到步态生成,改善下游分类性能。

Comments Accepted at CVPR MOMA Workshop 2026 and selected for spotlight presentation at the workshop

详情
AI中文摘要

由于隐私、招募、成本和运动变异性,病理步态数据集仍然稀缺。我们的工作提出了一个多模态LLM引导框架,用于从结构化文本描述中合成病理感知的3D步态数据。该方法为病理步态分类任务生成固定长度的合成骨架步态序列。该框架结合了运动标记化、病理感知语言条件、基于LLM的语义增强和语言到步态生成。一个关键贡献是提出的病理标记器,旨在在离散表示学习期间保留病理特定的运动特征。实验表明,当与真实数据结合时,所提出的合成序列改善了循环分类器的下游分类。最佳结果是在留一受试者协议下,使用真实和合成样本训练的GRU分类器,达到92.77%的准确率。

英文摘要

Pathological gait datasets remain scarce due to privacy, recruitment, cost, and movement variability. Our work presents a multimodal LLM-guided framework for pathology-aware 3D gait data synthesis from structured textual descriptions. The proposed method generates fixed-length synthetic skeleton-based gait sequences for pathological gait classification tasks. The framework combines motion tokenisation, pathology-aware language conditioning, LLM-based semantic augmentation, and language-to-gait generation. A key contribution is the proposed pathological tokeniser, which is designed to preserve pathology-specific motion characteristics during discrete representation learning. Experiments suggest that the proposed synthetic sequences improve downstream classification for recurrent classifiers when combined with real data. The best result is obtained using a GRU classifier trained with real and synthetic samples, achieving 92.77\% accuracy under a leave-one-subject-out protocol.

2606.06224 2026-06-08 cs.CV cs.LG 版本更新

Symb-xMIL: Symbolic Explanations for Multiple Instance Learning in Digital Pathology

Symb-xMIL: 数字病理学中多实例学习的符号解释

Yanqing Luo, Julius Hense, Niklas Prenißl, Andreas Mock, Klaus-Robert Müller, Thomas Schnake, Mina Jamshidi Idaji

发表机构 * Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究院) Machine Learning Group, Technische Universität Berlin(柏林技术大学机器学习组) Institute of Pathology, Charité Universitätsmedizin(查理研究所病理学部) Berlin Institute of Health at Charité – Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Digital Clinician Scientist Program(柏林查理医学研究院健康研究所、BIH生物医学创新学院、BIH查理数字临床科学家项目) Institute of Pathology, Ludwig Maximilian University of Munich(慕尼黑路德维希-马克西米利安大学病理学部) Division of Translational Medical Oncology, DKFZ(转化医学肿瘤学部,德国有机化学研究所) German Cancer Consortium (DKTK), partner site Munich, a partnership between DKFZ and Ludwig-Maximilians-Universität München (LMU)(德国癌症联盟(DKTK),慕尼黑合作伙伴站点,由DKFZ和路德维希-马克西米利安-慕尼黑大学(LMU)组成) Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Max-Planck Institute for Informatics, Saarbrücken, Germany(马克斯·普朗克信息学院,萨尔布吕肯,德国) Department of Chemistry, Chemical Physics Theory Group, University of Toronto(多伦多大学化学系,化学物理理论组) Vector Institute for Artificial Intelligence, Toronto, Canada(多伦多人工智能矢量研究所) Acceleration Consortium, University of Toronto(多伦多大学加速联盟)

AI总结 提出Symb-xMIL框架,通过量化模型行为与可读决策规则(逻辑关系)的对齐程度,为多实例学习提供结构化的符号解释,并在合成和真实病理数据上验证其有效性。

Comments 23 pages, 18 figures

详情
AI中文摘要

多实例学习(MIL)模型的解释被广泛用于数字组织病理学的验证和发现。现有方法主要依赖于突出显示影响区域的热力图,但不解释如何将不同组织区域的证据组合以产生预测。这限制了可解释性,尤其是当决策依赖于组织特征之间的交互时。我们引入了符号可解释MIL(Symb-xMIL),一种事后解释框架,量化MIL模型的行为与人类可读决策规则(表示为输入特征之间的逻辑关系,如AND、OR、NOT)的对齐程度。这些对齐分数揭示了模型预测背后的语义模式。我们在合成和真实世界的组织病理学数据集上评估了Symb-xMIL。在合成MIL数据上,Symb-xMIL可靠地恢复了真实逻辑规则。在临床肿瘤检测任务中,最佳对齐的规则揭示了异质决策模式并暴露了隐藏的模型错误。在TCGA-HNSCC(头颈癌队列)的HPV预测任务中,我们的框架在HPV状态之外细化了患者生存分层,具有潜在的临床相关性。总体而言,Symb-xMIL将MIL的可解释性从视觉归因扩展到结构化的、基于规则的推理,实现了对模型预测更透明和基于语义的解释。

英文摘要

Explanations of multiple instance learning (MIL) models are widely used for validation and discovery in digital histopathology. Existing methods primarily rely on heatmaps that highlight influential regions but do not explain how evidence from different tissue regions is combined to produce a prediction. This limits interpretability, especially when decisions depend on interactions between tissue features. We introduce Symbolic explainable MIL (Symb-xMIL), a post-hoc explanation framework that quantifies how a MIL model's behavior aligns with human-readable decision rules, expressed as logical relationships (e.g., AND, OR, NOT) between input features. These alignment scores reveal semantic patterns underlying the model's predictions. We evaluate Symb-xMIL on synthetic and real-world histopathology datasets. On synthetic MIL data, Symb-xMIL reliably recovers ground-truth logical rules. In a clinical tumor detection task, the best-aligned rules uncover heterogeneous decision patterns and expose hidden model errors. On an HPV-prediction task on TCGA-HNSCC, a cohort of head and neck cancer, our framework refines patient survival stratification beyond HPV status with potential clinical relevance. Overall, Symb-xMIL extends MIL explainability beyond visual attribution toward structured, rule-based reasoning, enabling more transparent and semantically grounded interpretation of model predictions.

2203.07904 2026-06-08 eess.IV cs.CV cs.LG 版本更新

Unsupervised Learning Based Focal Stack Camera Depth Estimation

基于无监督学习的焦堆相机深度估计

Zhengyu Huang, Weizhi Du, Theodore B. Norris

发表机构 * Center for Ultrafast Optical Science, University of Michigan(超快光学科学中心,密歇根大学) University of Michigan(密歇根大学)

AI总结 提出一种基于无监督深度学习的方法,从焦堆相机图像估计深度,在NYU-v2数据集上相比单图像方法显著提高精度。

详情
Journal ref
in Conference on Lasers and Electro-Optics, Technical Digest Series (Optica Publishing Group, 2022), paper JW3A.5
AI中文摘要

我们提出一种基于无监督深度学习的方法,从焦堆相机图像估计深度。在NYU-v2数据集上,我们的方法相比基于单图像的方法实现了更好的深度估计精度。

英文摘要

We propose an unsupervised deep learning based method to estimate depth from focal stack camera images. On the NYU-v2 dataset, our method achieves much better depth estimation accuracy compared to single-image based methods.

2403.05532 2026-06-08 cs.LG cs.CV 版本更新

Twin: Tuning Learning Rate and Weight Decay of Deep Homogeneous Classifiers without Validation

Twin: 无需验证的深度同质分类器学习率和权重衰减调优

Lorenzo Brigato, Stavroula Mougiakakou

发表机构 * ARTORG Center, University of Bern(伯恩大学ARTORG中心)

AI总结 提出Twin方法,利用同质网络的边界最大化动态和训练-测试损失间的经验缩放定律,实现无需验证集的学习率和权重衰减调优,在37个图像分类配置上达到与Oracle基线1.28%的平均绝对误差。

Comments Accepted at TMLR

详情
AI中文摘要

我们介绍了Tune without Validation (Twin),一种简单有效的管道,用于调优同质分类器的学习率和权重衰减,无需验证集,消除了保留数据的需求并避免了两步过程。Twin利用了同质网络的边界最大化动态以及连接超参数配置下训练和测试损失的经验缩放定律。这种数学建模产生了一个依赖于区域的、无需验证的选择规则:在不可分离区域,训练损失在测试损失中是单调的,因此可以预测泛化;而在可分离区域,由于边界最大化,参数的范数成为泛化的可靠指标。在37个图像分类的数据集-架构配置中,我们证明Twin与使用测试准确率选择超参数的Oracle基线相比,平均绝对误差为1.28%。我们展示了Twin在验证数据稀缺的场景(如小数据 regime)或难以且昂贵收集的场景(如医学成像)中的优势。代码可在 https://github.com/lorenzobrigato/twin 获取。

英文摘要

We introduce Tune without Validation (Twin), a simple and effective pipeline for tuning learning rate and weight decay of homogeneous classifiers without validation sets, eliminating the need to hold out data and avoiding the two-step process. Twin leverages the margin-maximization dynamics of homogeneous networks and an empirical scaling law that links training and test losses across hyper-parameter configurations. This mathematical modeling yields a regime-dependent, validation-free selection rule: in the non-separable regime, training loss is monotonic in test loss and therefore predictive of generalization, whereas in the separable regime, the parameters' norm becomes a reliable indicator of generalization due to margin maximization. Across 37 dataset-architecture configurations for image classification, we demonstrate that Twin achieves a mean absolute error of 1.28% compared to an Oracle baseline that selects HPs using test accuracy. We demonstrate Twin's benefits in scenarios where validation data is scarce, such as small-data regimes, or difficult and costly to collect, as in medical imaging. Code available at https://github.com/lorenzobrigato/twin.

2406.05670 2026-06-08 cs.LG cs.CR cs.CV 版本更新

Certified Robustness to Data Poisoning in Gradient-Based Training

基于梯度的训练中对数据投毒的认证鲁棒性

Philip Sosnin, Mark N. Müller, Maximilian Baader, Calvin Tsay, Matthew Wicker

发表机构 * Department of Computing, Imperial College London, United Kingdom(帝国理工学院伦敦分校计算机系) Department of Computer Science, ETH Zurich, Switzerland(苏黎世联邦理工学院计算机科学系) LogicStar.ai, Switzerland(LogicStar.ai公司) The Alan Turing Institute, United Kingdom(艾伦·图灵研究所)

AI总结 提出首个框架,通过凸松弛过度近似参数更新集,为梯度下降训练的模型提供针对无目标、有目标投毒和后门攻击的可证明鲁棒性保证。

Comments 21 pages, 8 figures

详情
AI中文摘要

现代机器学习流程利用大量公共数据,使得保证数据质量变得不可行,并使模型容易受到投毒和后门攻击。在攻击下可证明地约束模型行为仍然是一个开放问题。在这项工作中,我们通过开发第一个框架来应对这一挑战,该框架在不修改模型或学习算法的情况下,为使用可能被操纵的数据训练的模型的行为提供可证明的保证。特别是,我们的框架针对训练输入和标签的有界和无界操纵,认证了对无目标和有目标投毒以及后门攻击的鲁棒性。我们的方法利用凸松弛来过度近似给定投毒威胁模型下所有可能的参数更新集,从而允许我们为任何基于梯度的学习算法约束所有可达参数的集合。给定这个参数集,我们提供了最坏情况行为的界限,包括模型性能和后门成功率。我们在多个真实世界数据集上展示了我们的方法,这些数据集来自能源消耗、医学成像和自动驾驶等应用。

英文摘要

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

2509.22685 2026-06-08 eess.IV cs.CV cs.GR 版本更新

VIRTUS-FPP: Virtual Sensor Modeling for Fringe Projection Profilometry in NVIDIA Isaac Sim

VIRTUS-FPP:NVIDIA Isaac Sim中条纹投影轮廓测量的虚拟传感器建模

Adam Haroon, Anush Lakshman, Badrinath Balasubramaniam, Beiwen Li

发表机构 * Department of Mechanical Engineering, Iowa State University(Iowa州立大学机械工程系) College of Engineering, University of Georgia(佐治亚大学工程学院)

AI总结 提出VIRTUS-FPP,首个在NVIDIA Isaac Sim中实现的端到端虚拟传感器建模框架,用于条纹投影轮廓测量,实现物理保真模拟,无需预校准物理系统,支持亚毫米级重建精度。

Comments 10 pages, 13 figures, accepted for publication in IEEE Sensors Journal

详情
AI中文摘要

条纹投影轮廓测量(FPP)是一种用于3D表面重建的高精度结构光传感技术,但其实际部署常受限于复杂的校准程序、对环境条件的敏感性以及物理实验的高成本。同时,机器人研究日益依赖如NVIDIA Isaac Sim等仿真平台进行可扩展的开发与验证,但目前缺乏FPP等光学计量传感器的精确虚拟表示。本文提出VIRTUS-FPP,这是首个在NVIDIA Isaac Sim中实现的用于条纹投影轮廓测量的端到端虚拟传感器建模框架,能够对完整的FPP流程(包括结构光投影、图像形成、校准和3D重建)进行物理保真模拟,且无需依赖预校准的物理系统。该框架利用逆相机模型表示投影仪,确保了几何和光度保真度与结构光原理一致。通过连接光学计量与机器人仿真,VIRTUS-FPP实现了高保真合成数据生成、传感流程的系统评估以及真实世界FPP系统的数字孪生复制。实验结果表明,该框架具有亚毫米级重建精度,且模拟与物理测量之间具有强对应性,突显了其有效性及在推动感知驱动型机器人、仿真到现实迁移以及可扩展光学传感器设计方面的潜力。

英文摘要

Fringe projection profilometry (FPP) is a high-precision structured-light sensing technique for 3D surface reconstruction, yet its practical deployment is often constrained by complex calibration procedures, sensitivity to environmental conditions, and the high cost of physical experimentation. At the same time, robotics research increasingly relies on simulation platforms such as NVIDIA Isaac Sim for scalable development and validation, but accurate virtual representations of optical metrology sensors such as FPP are not currently available. In this work, we present VIRTUS-FPP, the first end-to-end virtual sensor modeling framework for fringe projection profilometry implemented in NVIDIA Isaac Sim, enabling physically grounded simulation of the complete FPP pipeline, including structured light projection, image formation, calibration, and 3D reconstruction, without dependence on pre-calibrated physical systems. The framework leverages an inverse camera model for projector representation, ensuring geometric and photometric fidelity consistent with structured-light principles. By bridging optical metrology and robotics simulation, VIRTUS-FPP enables high-fidelity synthetic data generation, systematic evaluation of sensing pipelines, and digital twin replication of real-world FPP systems. Experimental results demonstrate sub-millimeter reconstruction accuracy and strong correspondence between simulated and physical measurements, highlighting the framework's effectiveness and its potential to advance perception-driven robotics, simulation-to-reality transfer, and scalable optical sensor design.

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University(澳大利亚国立大学计算机学院)

AI总结 针对机器人通过门缝观察时场景结构缺失的问题,提出MatterDoor方法,利用预训练生成模型(VLM引导外推、单目深度估计、语义分割)采样隐藏房间的语义3D点云先验,在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情
AI中文摘要

自主机器人通常只能通过门缝部分观察房间,墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询,估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询,我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor,一个源自Matterport3D的门遮挡室内场景基准,并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明,无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

2512.00883 2026-06-08 cs.MM cs.CV cs.SD 版本更新

Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents

视听世界模型:为具身智能体奠定多感官想象的基础

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao, Shijie Cheng

发表机构 * Tsinghua University(清华大学) Beijing Institute of Technology(北京理工大学)

AI总结 提出视听世界模型(AVWM)统一框架,通过条件扩散Transformer(AV-CDiT)联合预测双耳音频与视觉动态,在30小时基准AVW-4k上实现高保真多模态预测,并验证其在具身导航中的有效性。

详情
AI中文摘要

世界模型通过模拟环境动态使智能体能够规划和推理未来状态。虽然现有方法主要关注视觉观察,但现实世界的感知本质上涉及多种感觉模态。音频提供了关键的空间和时间线索,如声源定位和声学场景属性,但其整合到世界模型中仍相对未被充分探索。先前的工作尚未建立低层动作控制下视听世界建模的通用公式,也未阐明如何联合捕捉物理上合理的双耳音频和视觉动态。本文提出了视听世界模型(AVWM)的统一公式,将多模态环境模拟建模为具有同步视听观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤,我们构建了AVW-4k,一个受控基准数据集,包含30小时的双耳视听轨迹,覆盖76个室内环境并带有动作标注。我们提出了AV-CDiT,一种视听条件扩散Transformer,采用新颖的模态专家架构平衡视觉和听觉学习,通过三阶段训练策略优化以实现有效的多模态整合。在该基准上的大量实验表明,AV-CDiT在视觉和听觉模态上实现了高保真多模态预测。此外,我们验证了其在具身导航中的实际效用,证明AVWM改进了视觉-语言模型引导的智能体在连续视听导航中的表现。

英文摘要

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.

2512.20963 2026-06-08 cs.LG cs.CV 版本更新

Generalization of Diffusion Models Arises with a Balanced Representation Space

扩散模型的泛化源于平衡表示空间

Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过分析两层ReLU去噪自编码器,证明记忆化导致局部尖峰表示,而泛化产生平衡表示,并在真实扩散模型中验证,提出基于表示的检测和编辑方法。

Comments Accepted at ICLR 2026. 40 pages, 19 figures. The first two authors contributed equally

详情
AI中文摘要

扩散模型擅长生成高质量、多样化的样本,但当过度拟合训练目标时,它们有记忆训练数据的风险。我们通过表示学习的视角分析了扩散模型中记忆化和泛化之间的区别。通过研究两层ReLU去噪自编码器(DAE),我们证明了(i)记忆化对应于模型在学习的权重中存储原始训练样本以进行编码和解码,产生局部尖峰表示,而(ii)泛化发生在模型捕获局部数据统计时,产生平衡表示。此外,我们在真实的无条件和文本到图像扩散模型上验证了这些理论发现,表明相同的表示结构出现在深度生成模型中,并具有重要的实际意义。基于这些见解,我们提出了一种基于表示的检测记忆化的方法,以及一种无需训练的编辑技术,通过表示引导实现精确控制。总之,我们的结果强调了学习好的表示对于新颖且有意义的生成建模至关重要。

英文摘要

Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized spiky representations, whereas (ii) generalization arises when the model captures local data statistics, producing balanced representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.

2602.00471 2026-06-08 cs.AI cs.CV 版本更新

Dual Latent Memory for Visual Multi-agent System

面向视觉多智能体系统的双潜在记忆

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出L²-VMAS框架,通过双潜在记忆解耦感知与思考,并采用熵驱动主动触发机制,打破视觉多智能体系统的“扩展墙”,在提升准确率的同时大幅降低令牌消耗。

详情
AI中文摘要

尽管视觉多智能体系统(VMAS)有望通过智能体间协作增强综合能力,但经验证据揭示了一个反直觉的“扩展墙”:增加智能体轮次往往会降低性能,同时指数级增加令牌成本。我们将这一失败归因于以文本为中心的通信中固有的信息瓶颈,其中将感知和思维轨迹转换为离散自然语言不可避免地导致语义损失。为此,我们提出了\textbf{L}$\mathbf{^{2}}$\textbf{-VMAS},一种新颖的模型无关框架,通过双潜在记忆实现智能体间协作。此外,我们解耦了感知与思考,同时动态合成双潜在记忆。另外,我们引入了熵驱动的主动触发,用高效的按需内存访问取代被动信息传输。在骨干网络、规模和多智能体结构上的大量实验表明,我们的方法有效打破了“扩展墙”,具有卓越的可扩展性,平均准确率提高2.7-5.4%,同时令牌使用量减少21.3-44.8%。

英文摘要

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose \textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%.

2602.01740 2026-06-08 cs.AI cs.CV cs.LG 版本更新

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

MACD:基于反事实数据的模型感知对比解码

Qixin Xiao, Kun Zhou

发表机构 * University of Michigan, Ann Arbor, MI, USA(密歇根大学,安娜堡分校) University of California San Diego, La Jolla, CA, USA(加州大学圣地亚哥分校)

AI总结 提出MACD方法,利用视频语言模型自身反馈识别导致幻觉的目标区域,生成目标级反事实输入,结合对比解码减少幻觉,提升多模型在复杂场景下的准确性。

详情
AI中文摘要

视频语言模型(Video-LLMs)容易产生幻觉,当视觉证据薄弱、模糊或存在偏差时,会生成看似合理但无根据的内容。现有方法如对比解码(CD)依赖随机扰动构建对比数据以缓解幻觉,但往往未能针对驱动幻觉的视觉线索或模型弱点。我们提出基于模型感知反事实数据的对比解码(MACD),这是一种结合模型引导的反事实构建与对比解码的推理策略。MACD利用Video-LLM自身的反馈来识别最可能导致幻觉的目标区域,生成有针对性的目标级反事实输入,而非任意的帧或时间修改。这些反事实输入被整合到CD中,以在解码过程中强制进行基于证据的令牌选择。在EventHallusion、MVBench、Perception-test和Video-MME上的实验表明,MACD在包括Qwen和InternVL在内的多种Video-LLM上持续减少幻觉,同时保持或提高任务准确性,在涉及小目标、遮挡目标或共现目标的场景中尤其表现出显著优势。

英文摘要

Video language models (Video-LLMs) are prone to hallucinations, generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for hallucination mitigation, but often fail to target the visual cues that drive hallucination or align with model weaknesses. We propose Model-Aware Counterfactual Data based Contrastive Decoding (MACD), an inference strategy that combines model-guided counterfactual construction with contrastive decoding. MACD uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted object-level counterfactual inputs rather than arbitrary frame or temporal modifications. These counterfactual inputs are integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test, and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL, with especially strong gains in scenarios involving small, occluded, or co-occurring objects.

2603.02220 2026-06-08 cs.LG cs.AI cs.CV 版本更新

Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

预测即渲染:面向时间序列预测的2D高斯泼溅框架

Yixin Wang, Yifan Hu, Peiyuan Liu, Naiqi Li, Tao Dai, Shu-Tao Xia

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院)

AI总结 提出TimeGS框架,将时间序列预测转化为2D高斯泼溅生成渲染,通过各向异性高斯核和连续光栅化解决周期内与周期间的建模问题,实现SOTA性能。

详情
AI中文摘要

时间序列预测仍然是一个具有挑战性的问题,因为周期内波动和周期间趋势的复杂纠缠。尽管最近的进展试图将一维序列重塑为二维周期-相位表示,但它们存在两个主要局限性。首先,将重塑后的张量视为静态图像会导致拓扑不匹配,因为标准空间算子在网格边界处切断了时间连续性。其次,依赖统一的固定大小表示会低效地分配建模能力,并且无法为可压缩的非平稳时间模式提供所需的自适应分辨率。为了解决这些局限性,我们引入了TimeGS,这是一个新颖的框架,从根本上将预测范式从回归转变为二维生成渲染。通过将未来序列重新概念化为潜在的二维时间表面,TimeGS利用高斯核的固有各向异性,以灵活的几何对齐自适应地建模复杂变化。为了实现这一点,我们引入了多基高斯核生成(MB-GKG)块,该块从固定字典中合成核以稳定优化,以及多周期时间连续光栅化(MP-CCR)块,该块在周期边界上强制执行严格的时间连续性。在标准基准数据集上的全面实验表明,TimeGS达到了最先进或具有竞争力的性能。代码位于https://github.com/yixinwang1/TimeGS。

英文摘要

Time series forecasting remains a challenging problem due to the intricate entanglement of intra-period fluctuations and inter-period trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal limitations. Firstly, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a latent 2D temporal surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art or competitive performance. The code is at https://github.com/yixinwang1/TimeGS.

2603.21510 2026-06-08 eess.IV cs.CV 版本更新

Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability

未配准光谱图像融合:解混、对抗学习与可恢复性

Jiahui Song, Sagar Shrestha, Xiao Fu

AI总结 提出无监督框架,通过耦合光谱解混和潜在空间对抗学习同时超分辨未配准的高光谱和多光谱图像,并首次建立可恢复性理论保证。

详情
AI中文摘要

本文研究一对空间未配准的高光谱图像(HSI)和多光谱图像(MSI)的融合问题,两者覆盖大致重叠区域。HSI提供高光谱但低空间分辨率,而MSI则相反。目标是整合它们的互补信息,以提升HSI空间分辨率和MSI光谱分辨率。虽然高光谱-多光谱融合(HMF)已被广泛研究,但未配准设置仍然具有挑战性。许多现有方法仅关注MSI超分辨,而保持HSI不变。监督深度学习方法被提出用于HSI超分辨,但依赖于准确的训练数据,这通常不可用。此外,理论分析主要处理已配准情况,导致未配准HMF理解不足。本文提出一种无监督框架,同时超分辨MSI和HSI。该方法将用于MSI超分辨的耦合光谱解混与用于HSI超分辨的潜在空间对抗学习相结合。在合理的生成模型下,建立了超分辨MSI和HSI可恢复性的理论保证——据我们所知,这是首次为未配准HMF提供此类见解。该方法在半真实和真实HSI-MSI对的不同条件下得到验证。

英文摘要

This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.

2603.24576 2026-06-08 cs.RO cs.AI cs.CV 版本更新

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Chameleon: 用于视觉运动操控的索引控制前瞻记忆

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Yuhang Han, Ying Sun, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University(南洋理工大学MARS实验室) Institute for Infocomm Research, A*STAR, Singapore(新加坡*STAR信息与通信研究所) National University of Singapore(新加坡国立大学)

AI总结 提出Chameleon策略,通过索引控制前瞻记忆解决观察-动作延迟问题,在Camo-Dataset上决策成功率从22.5%提升至80.8%,并在多个基准上达到最优。

Comments Code is available at https://github.com/gxyes/MARS_Chameleon

详情
AI中文摘要

机器人常常在观察到某个信息后很久才执行相应的动作。例如,在藏球游戏中,机器人首先看到哪个杯子藏有球,观察杯子移动,然后才需要选择正确的杯子。仅凭最后的观察不足以做出决策:正确的动作依赖于更早的事件。我们将这种时间间隔称为观察-动作延迟。它使得记忆成为一个策略面对的问题:策略必须保持相似历史记录的可区分性,检索与当前决策相关的过去事件,并将该回忆转换为动作就绪状态。我们将这些需求称为可分离性、可寻址性和前瞻性。我们引入了Chameleon,一个约60M参数的视觉运动策略,用于索引控制的前瞻记忆。Chameleon写入具身事件记忆,保留可分离的历史记录,检索控制相关的痕迹,并训练生成的工作状态具有前瞻性。我们还引入了Camo-Dataset,这是一个真实机器人基准,通过使决策场景视觉模糊来隔离观察-动作延迟,从而必须从早期观察中推断正确动作。Chameleon在Camo-Dataset上将决策/端到端成功率从22.5%/21.3%提高到80.8%/71.3%。在公开的长时记忆基准上,它在LIBERO-10上达到87.1% ± 0.8%,在MemoryBench上达到97.3% ± 4.5%,在MIKASA-Robo上达到75.1% ± 1.4%,在相同规模模型中达到最先进水平,并在报告协议下超过多个更大的VLA基线。探针和消融实验表明,Chameleon学习了可分离、可寻址和前瞻的记忆,并且这些特性驱动了其性能提升。

英文摘要

Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.

2606.01072 2026-06-08 cs.RO cs.CV 版本更新

Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs

利用场景图扩展机器人模仿学习的时空上下文

Jianing Qian, Qinhe Peng, Emmanuel Panov, Leonor Fermoselle, Dinesh Jayaraman, Bernadette Bucher, Tarik Kelestemur

发表机构 * University of Pennsylvania(宾夕法尼亚大学) RAI Institute(RAI研究院) University of Michigan(密歇根大学)

AI总结 提出使用场景图作为显式结构化记忆机制,通过动态维护对象中心关系及其时间演化,解决机器人模仿学习中的部分可观测性和长时推理问题。

详情
AI中文摘要

模仿学习使机器人能够通过观察学习如何执行任务。然而,像家庭和办公室这样的真实环境通常由于空间尺度大而严重部分可观测。此外,许多任务涉及执行一系列子任务,要求自主机器人在扩展的时间范围内进行推理。为了解决这些挑战,我们提出在模仿学习中使用场景图作为显式且结构化的记忆机制。通过维护一个动态场景图,捕捉以对象为中心的关系及其随时间的变化,我们的方法允许智能体在任务执行期间保留相关历史上下文,从而有效推理逐步累积的场景信息。我们在模拟移动操作和真实桌面操作上的实验表明,我们的方法显著提高了策略性能,特别是在需要长期推理和在部分可观测性下鲁棒泛化的场景中。

英文摘要

Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.

2606.05759 2026-06-08 cs.CV 版本更新

Physics-Guided Deep Unfolding for Blind Cross-Sensor Spectral Super-Resolution via Learning the Spectral Transformation Function

物理引导的深度展开网络用于盲跨传感器光谱超分辨率:通过学习光谱变换函数

Zhaolin Li, Jinsong Chen, Shanxin Guo, Tuo Zhang, Xinglong Zhang, Pan Chen

发表机构 * Center for Geo-Spatial Information, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences(地理信息中心,深圳先进技术研究院,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application(深圳海洋环境大数据分析与应用工程实验室)

AI总结 提出一种物理引导的深度展开网络PGU-Net,通过交替优化联合估计高光谱图像和可学习的光谱变换函数,解决盲跨传感器光谱超分辨率问题。

详情
AI中文摘要

高光谱成像为定量遥感提供丰富的光谱信息,然而高光谱传感器成本高昂,因此在许多无人机部署中不可用。光谱超分辨率旨在从多光谱图像重建高光谱图像。大多数现有的SSR方法假设固定且已知的光谱响应函数,因此仅限于单传感器设置。在实际的跨传感器场景中,从HSI到MSI的光谱退化是未知的,并且随传感器特性和场景内容变化,这使得HSI重建病态。本文提出一种物理引导的深度展开网络,称为PGU-Net,通过联合估计HSI和可学习的光谱变换函数来解决盲跨传感器SSR。PGU-Net将交替优化过程展开为端到端可训练的多阶段架构,每个阶段依次更新HSI和STF。两个模块结合了可学习的近端网络和可微的闭式求解器,在保持强表示能力的同时实现物理可解释性。在具有多个SRF的基准数据集(CAVE和NTIRE 2022)上的实验表明,STF(退化算子)的准确恢复以及相对于最先进SSR方法的重建性能提升。此外,在真实无人机跨传感器数据集(Headwall Nano HSI和DJI P4多光谱MSI)上的评估验证了PGU-Net在真正盲条件下的有效性和鲁棒性,并表明估计的STF可能表现出与土地覆盖相关的差异。

英文摘要

Hyperspectral imaging provides rich spectral information for quantitative remote sensing, yet hyperspectral sensors remain costly and thus unavailable in many UAV deployments. Spectral super-resolution (SSR) seeks to reconstruct hyperspectral images (HSIs) from multispectral images (MSIs). Most existing SSR methods assume a fixed and known spectral response function (SRF) and are therefore limited to single-sensor settings. In practical cross-sensor scenarios, the spectral degradation from HSI to MSI is unknown and varies with sensor characteristics and scene content, which renders HSI reconstruction ill-posed. This paper proposes a physics-guided deep unfolding network, termed PGU-Net, to address blind cross-sensor SSR by jointly estimating the HSI and a learnable spectral transformation function (STF). PGU-Net unrolls an alternating optimization procedure into an end-to-end trainable architecture with stages, where each stage sequentially updates the HSI and the STF. Both modules combine learnable proximal networks with differentiable closed-form solvers, enabling physical interpretability while retaining strong representation capacity. Experiments on benchmark datasets (CAVE and NTIRE 2022) with multiple SRFs demonstrate accurate recovery of the STF (degradation operator) and improved reconstruction performance over state-of-the-art SSR methods. Furthermore, evaluations on a real UAV cross-sensor dataset (Headwall Nano HSI and DJI P4 Multispectral MSI) verify the effectiveness and robustness of PGU-Net under truly blind conditions, and suggest that the estimated STF may exhibit land-cover-related differences.

2510.17568 2026-06-08 cs.CV 版本更新

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

PAGE-4D: 通过解耦姿态与几何估计实现VGGT-4D感知

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University(哈佛人工智能与机器人实验室,哈佛大学) Media Lab and Electrical Engineering and Computer Science, Massachusetts Institute of Technology(媒体实验室和电气工程与计算机科学,麻省理工学院) Department of Computing, Imperial College London(计算系,帝国理工学院) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学自然与人工智能研究学院)

AI总结 提出PAGE-4D,扩展VGGT到动态场景,通过动态感知聚合器解耦静态与动态信息,同时提升相机姿态估计、深度预测和点云重建性能。

Comments ICLR 2026, VGGT-4D, Dynamic VGGT

详情
AI中文摘要

最近的3D前馈模型,如视觉几何基础变换器(VGGT),在推断静态场景的3D属性方面表现出强大的能力。然而,由于这些模型通常在静态数据集上训练,它们在涉及复杂动态元素的现实场景中(例如移动的人或可变形物体如雨伞)往往表现不佳。为了解决这一限制,我们引入了PAGE-4D,一种将VGGT扩展到动态场景的前馈模型,能够实现相机姿态估计、深度预测和点云重建——全部无需后处理。多任务4D重建的一个核心挑战是任务之间的固有冲突:准确的相机姿态估计需要抑制动态区域,而几何重建则需要对其进行建模。为了解决这一矛盾,我们提出了一种动态感知聚合器,通过预测动态感知掩码来解耦静态和动态信息——抑制姿态估计的运动线索,同时放大几何重建的运动线索。大量实验表明,PAGE-4D在动态场景中始终优于原始VGGT,在相机姿态估计、单目和视频深度估计以及密集点图重建方面取得了更优的结果。必要的代码和额外演示可在链接:https://page4d.github.io/ 获取,包括训练和推理掩码变体以及仅训练掩码变体(=推理时的VGGT架构)。关键词:VGGT-4D,4D感知,动态场景重建。

英文摘要

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/, including both the training-and-inference masking variant and the training-only masking variant (= VGGT architecture at inference). Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

2409.13477 2026-06-08 eess.IV cs.CV physics.med-ph 版本更新

A Plug-and-Play Method for Guided Multi-contrast MRI Reconstruction based on Content/Style Modeling

基于内容/风格建模的即插即用式引导多对比度MRI重建方法

Chinmay Rao, Matthias van Osch, Nicola Pezzotti, Jeroen de Bresser, Mark van Buchem, Laurens Beljaards, Jakob Meineke, Elwin de Weerdt, Huangling Lu, Mariya Doneva, Marius Staring

发表机构 * University of Amsterdam(阿姆斯特丹大学) Erasmus University Rotterdam(埃因霍温理工大学) Erasmus University Medical Center(埃因霍温医学院) University of Utrecht(乌得勒支大学)

AI总结 提出一种无需k空间训练数据的模块化即插即用方法PnP-CoSMo,通过内容/风格解耦利用参考扫描引导欠采样对比度重建,在公共和内部数据集上达到或超越端到端方法,并实现更高加速比。

详情
AI中文摘要

由于同一解剖结构的不同MR对比度包含冗余信息,一种对比度可用于引导在同一会话中随后采集的另一种欠采样对比度的重建。为了解决这一利用多对比度侧信息的重建问题,已有多种端到端学习方法被提出。然而,一个关键挑战是需要包含原始k空间数据和配准参考图像的大型配对训练数据集。我们提出了一种模块化的即插即用方法,该方法不需要k空间训练数据,仅依赖于部分配对的图像域数据集。首先学习双对比度MR图像数据的内容/风格模型,随后在迭代重建中作为即插即用算子应用。内容与风格的解耦允许显式表示对比度无关和对比度特定的因素。因此,将先验信息融入重建简化为使用从参考扫描中导出的高质量内容替换估计图像的混叠内容的操作。将该操作与MR数据一致性步骤以及内容估计的校正过程相结合,形成迭代方案。我们将这种新方法命名为PnP-CoSMo。通过设计,它提供了跨对比度的泛化能力,并基于两个给定对比度下的共享和非共享生成因素提供了一个解释框架。我们通过仿真探索了包括可解释性和收敛性在内的多个方面。此外,在公共NYU fastMRI DICOM数据集上展示了其实用性,显示出与端到端方法相当或更优的质量以及更强的泛化能力。在两个内部多线圈数据集上,在给定SSIM下,PnP-CoSMo相比非引导重建实现了高达32.6%的加速。

英文摘要

Since the various MR contrasts of a given anatomy contain redundant information, one contrast can be used to guide the reconstruction of another undersampled contrast acquired subsequently in the same session. To solve this reconstruction problem leveraging multi-contrast side information, several end-to-end learning-based methods have been proposed. However, a key challenge is the requirement for large paired training datasets comprising raw k-space data and aligned reference images. We propose a modular plug-and-play method, which requires no k-space training data and relies solely on partially paired image-domain datasets. A content/style model of two-contrast MR image data is first learned and subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Consequently, incorporating prior information into the reconstruction reduces to a simple replacement operation on the aliased content of the estimated image using high-quality content derived from the reference scan. Combining this operation with an MR data consistency step, followed by a corrective procedure for the content estimate, yields an iterative scheme. We name this novel approach PnP-CoSMo. It offers, by design, cross-contrast generalizability and provides an explanatory framework based on the shared and non-shared generative factors underlying the two given contrasts. We explore various aspects, including interpretability and convergence, via simulations. Furthermore, its practicality is demonstrated on the public NYU fastMRI DICOM dataset, showing equivalent or superior quality and greater generalizability compared to end-to-end methods. On two in-house multi-coil datasets, PnP-CoSMo enabled up to 32.6% greater acceleration over non-guided reconstruction at given SSIM.

2510.21122 2026-06-08 cs.CV 版本更新

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

NoisyGRPO:通过噪声注入和贝叶斯估计激励多模态Co T推理

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

发表机构 * ShanghaiTech University(上海科技大学) Shanghai Engineering Research Center of Intelligent Vision and Imaging(上海智能视觉与成像工程研究中心) Lingang Laboratory(临港实验室)

AI总结 NoisyGRPO通过引入可控噪声增强探索并利用贝叶斯框架建模优势估计,提升多模态大语言模型的泛化能力和鲁棒性,尤其在小规模模型上表现突出。

Comments Accepted by Neurips 2025, Project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/

详情
Journal ref
Advances in Neural Information Processing Systems 38 (2026) 124239-124267
AI中文摘要

强化学习(RL)在增强多模态大语言模型(MLLMs)的链式推理能力方面展现出潜力。然而,当应用于提升通用链式推理时,现有RL框架往往难以超越训练分布。为此,我们提出NoisyGRPO,一种系统化的多模态RL框架,通过在视觉输入中引入可控噪声以增强探索,并通过贝叶斯框架显式建模优势估计过程。具体而言,NoisyGRPO通过(1)噪声注入探索策略:用高斯噪声扰动视觉输入以鼓励探索更广泛的视觉场景;以及(2)贝叶斯优势估计:将优势估计建模为一个原理性的贝叶斯推断问题,其中注入的噪声水平作为先验,观察到的轨迹奖励作为似然。这种贝叶斯建模融合了两种信息源,以计算轨迹优势的稳健后验估计,有效引导MLLMs偏好视觉支撑的轨迹而非噪声轨迹。在标准链式推理质量、通用能力和幻觉基准测试中,NoisyGRPO显著提高了泛化能力和鲁棒性,尤其是在小规模MLLMs如Qwen2.5-VL 3B的RL设置中。项目页面可在https://artanic30.github.io/project_pages/NoisyGRPO/上获取。

英文摘要

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC 版本更新

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol(布里斯托尔大学计算机科学学院) School of Geographical Sciences, University of Bristol(布里斯托尔大学地理科学学院)

AI总结 本文提出Forest-Chat,一种基于LLM的森林变化分析代理,通过多任务处理实现自然语言查询,提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情
AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat,一种基于大语言模型的视觉-语言代理,支持多任务的交互式森林变化分析,包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释(MCI)视觉-语言框架,结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集,包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上,Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%,在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中,其在Forest-Change数据集上达到60.15%和34.00%,在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明,描述优化能注入地理领域知识,但标签域迁移有限。这些发现表明,交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

2505.13140 2026-06-08 cs.CV 版本更新

CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

CacheFlow: 通过缓存归一化流实现快速的人体运动预测

Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

发表机构 * Toyota Technological Institute(丰田技术研究所) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 CacheFlow通过缓存归一化流生成模型,实现快速3D人体运动预测,相比传统方法速度提升显著,且保持预测精度和模型表达能力。

Comments Accepted at Transactions on Machine Learning Research (TMLR). See https://openreview.net/forum?id=icq5659pQt

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

许多用于3D人体运动预测的密度估计技术需要大量推理时间,通常超过预测时间范围。为解决此问题,我们提出了一种新的基于流的方法,称为CacheFlow。与之前的条件生成模型相比,CacheFlow利用无条件的流生成模型,将高斯混合转化为未来运动的密度。流生成模型的计算结果可以预先计算并缓存。然后,对于条件预测,我们通过一个更轻量的模型将历史轨迹映射到高斯混合中的样本。这种映射方式相比传统条件流模型节省了显著的计算开销。通过这种两阶段方法和缓存慢流模型的计算结果,我们构建了CacheFlow,不损失预测精度和模型表达能力。此推理过程大约在1毫秒内完成,比之前的VAE方法快4倍,比之前的扩散方法快30倍。此外,我们的方法在Human3.6M数据集上展示了改进的密度估计精度,并与SOTA方法具有可比的预测精度。我们的代码和模型可在https://github.com/meaten/CacheFlow上获得。

英文摘要

Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from poor time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models are available at https://github.com/meaten/CacheFlow.

2511.09568 2026-06-08 physics.chem-ph cs.AI cs.CV 版本更新

VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing

VEDA:通过退火变方差扩散实现3D分子生成

Peining Zhang, Jinbo Bi, Minghu Song

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 VEDA结合退火变方差扩散与SE(3)等价架构,高效生成准确的3D分子结构,实现高化学精度与计算效率。

详情
AI中文摘要

扩散模型在3D分子生成中展现出潜力,但面临采样效率与构象准确性之间的根本权衡。尽管流形模型速度快,但常产生几何不准确的结构,因难以捕捉分子构象的多模分布。相比之下,去噪扩散模型更准确但采样慢,限制在于扩散动力学与SE(3)-等价架构之间的整合不足。为此,我们提出了VEDA,一个统一的SE(3)-等价框架,结合变方差扩散与退火以高效生成构象准确的3D分子结构。关键贡献包括:(1) 一种VE调度使噪声注入类似于模拟退火,提高3D准确性并降低松弛能量;(2) 一种新型预处理方案协调SE(3)-等价网络的坐标预测性质与残差扩散目标;(3) 一种新的arcsin调度器将采样集中在对数信号噪声比的关键区间。在QM9和GEOM-DRUGS数据集上,VEDA的采样效率与流形模型相当,仅用100次采样步骤就实现了最先进的价键稳定性与有效性。更重要的是,VEDA生成的结构在GFN2-xTB优化过程中表现出显著的稳定性,其松弛能量中位数仅为1.72 kcal/mol,显著低于其基线架构SemlaFlow的32.3 kcal/mol。我们的框架证明了原理上整合VE扩散与SE(3)-等价架构可以实现高化学精度和计算效率。

英文摘要

Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

2504.21614 2026-06-08 cs.CV 版本更新

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Mcity数据引擎:通过开放词汇数据选择实现迭代模型改进

Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu

发表机构 * University of Michigan Transportation Research Institute(密歇根大学交通研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Texas A&M University(德克萨斯A&M大学)

AI总结 本文提出Mcity数据引擎,通过开放词汇数据选择解决大规模未标记数据中长尾类检测难题,提供从数据采集到模型部署的完整数据开发流程。

Comments Accepted for publication at ITSC 2025

详情
AI中文摘要

随着数据可用性的持续增长,选择和标注适合机器学习模型训练的样本变得越来越具有挑战性。特别是在大规模未标记数据中检测感兴趣的长尾类更是困难重重。这尤其适用于智能交通系统(ITS),其中车辆车队和道路侧感知系统产生大量的原始数据。虽然存在用于此类迭代数据选择和模型训练过程的工业专有数据引擎,但研究人员和开源社区却缺乏一个公开可用的系统。我们提出了Mcity数据引擎,它提供了完整的基于数据的发展周期模块,从数据采集阶段开始,到模型部署阶段结束。Mcity数据引擎通过开放词汇数据选择过程专注于罕见和新颖的类别。所有代码均以MIT许可证公开发布在GitHub上:https://github.com/mcity/mcity_data_engine

英文摘要

With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine