arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06520 2026-06-08 cs.CV cs.GR 新提交

Applying Deep Learning for cockpit segmentation in the context of mixed reality

应用深度学习进行混合现实场景中的驾驶舱分割

Alexandre Leles Sousa, Pedro de Oliveira Nielson, Erick Oliveira Rodrigues, Rafael Francisco dos Santos, Giovani Bernardes Vitor

发表机构 * Laboratdrio de Robética, sistemas inteligentes e Complezos - RobSIC（机器人、智能系统与复杂性实验室 - RobSIC）； Instituto de Ciências Tecnoldgicas Universidade Federal de Itajubd, Campus Itabira, MG（科技学院巴西联邦大学它雅布德分校，伊塔比拉校区，马里兰）； Universidade Tecnológica Federal do Parand - UTFPR, Campus Pato Branco/PR（帕托布兰科/PR技术联邦大学 - UTFPR）

AI总结本文提出利用U-net和DeepLabV3+卷积神经网络对驾驶舱图像进行前景与背景分割，以促进混合现实中的虚实融合，实现了约90%的准确率。

Comments XXV Congresso Brasileiro de Automática - CBA 2024

详情

DOI: 10.20906/CBA2024/4844

AI中文摘要

计算机视觉是一个持续发展的领域。随着第一人称视角技术的进步，该领域内出现了新的发展机遇。混合现实通过实时显示物理世界中的物体来促进虚拟环境。为此，必须关注用户在此模拟环境中的沉浸感，不断寻求使其更接近可能的期望现实。本文提出开发图像处理，以执行图像分割，识别前景和背景，从而便于虚拟和真实图像的融合。因此，本研究通过摄像头获取用户使用CAT793F非公路卡车模拟器的真实图像，利用人工智能对这些图像进行分割。应用了卷积神经网络架构“U-net”和“DeepLabV3+”来执行图像分割。结果显示，准确率约为90%，并确定了最佳模型。

英文摘要

Computer vision is an area that has been growing continuously. With the advance of technologies with a first-person view, new development opportunities have emerged inside the area. Mixed reality promotes virtual environments with objects from the physical world shown in real time. For that, it's necessary to be concerned with the immersion of the user in this simulated environment, increasingly seeking to bring it closer to a possible desired reality. This paper proposes the development of image processing in order to perform the segmentation of images to identify what is foreground and background in order to facilitate the union of virtual and real images. Thus, the present work obtain real images of the user using the off-highway truck simulator CAT793F, through a camera, to be able to perform the segmentation of such images with artificial intelligence techniques.The convolutional neural network architectures "U-net" and "DeepLabV3+" are applied to perform image segmentation. As a result, metrics with around 90% accuracy were presented and and the best model was determined.

URL PDF HTML ☆

赞 0 踩 0

2606.06532 2026-06-08 cs.CV 新提交

GOPAgen: Motion-Aware and Efficient Agentic Long-Video Understanding with Structural Memory and Hierarchical Reasoning

GOPAgen: 基于结构记忆与层次推理的运动感知高效智能长视频理解

Haozhe Chi, Yang Jin, Yadong Mu

发表机构 * Peking University（北京大学）

AI总结提出GOPAgen方法，通过视频编解码的GOP运动代理、GOP树推理算法和结构记忆机制，实现高效长视频理解，在多个VQA基准上取得领先性能。

详情

AI中文摘要

尽管在智能长视频理解方面取得了显著进展，现有方法仍然缺乏详细的运动理解以及高效的内存架构。在本文中，我们提出GOPAgen，一种新颖的方法，该方法首先通过精心设计的运动代理将视频编解码器集成到视频理解框架中，该代理基于视频编解码器中的图像组（GOP）进行训练。我们进一步开发了GOP树推理算法，该算法与视频编解码器自然对齐，增强了模型理解视频中局部细节运动的能力。此外，我们精心设计了一种结构记忆机制，将局部运动信息与结构页面中的详细描述相结合，并提出了一种高效的从粗到精的缩放算法，以充分利用结构记忆。此外，我们将运动矢量数据库纳入框架，以实现不同粒度运动矢量的高效检索。总体而言，我们的方法在各种视频理解基准（包括MotionBench和Egoschema）上取得了优越的视频问答（VQA）性能，从而证明了我们提出框架的优越性。

英文摘要

Despite significant progress in agentic long video understanding, existing methods still lack detailed motion comprehension coupled with an efficient memory architecture. In this paper, we propose GOPAgen, a novel approach that first integrates video codec into the video understanding framework via a meticulously designed motion agent trained on Groups of Pictures (GOPs) from video codec. We further develop a GOP tree reasoning algorithm, which is naturally aligned with video codec and enhances the model's ability to understand local detailed motions in videos. Additionally, we carefully design a structural memory mechanism that integrates local motion information with detailed captions in structural pages, and propose an efficient coarse-to-fine zoom-in algorithm to fully exploit the structural memory. Furthermore, we incorporate a motion vector database into the framework to enable efficient retrieval of motion vectors at different granularities. Overall, our method achieves superior Video Question Answering (VQA) performance on various video understanding benchmarks, including MotionBench and Egoschema, thereby demonstrating the superiority of our proposed framework.

URL PDF HTML ☆

赞 0 踩 0

2606.06536 2026-06-08 cs.CV cs.AI cs.LG 新提交

从像素到牛顿：从单目视频预测体内关节接触力

Jessy Lauer

发表机构 * Rowland Institute at Harvard（哈佛大学罗兰研究所）

AI总结提出一种无物理模型的流水线，从非标定单目视频预测3D髋膝接触力，无需标记、力板、肌电、个体成像或肌肉骨骼模型，通过变换器融合运动、形状、活动文本和自监督视频令牌，在26名患者25种活动上达到与个体化肌肉骨骼模拟相当的精度。

详情

AI中文摘要

关节接触力决定植入物寿命、软骨健康和康复效果，影响谁患骨关节炎、谁从关节置换中良好恢复以及谁受益于生物力学干预。然而，它们只能通过侵入性测量，在少数装有仪器的患者中进行。我提出一种无物理流水线，从非标定单目视频预测瞬时3D髋膝接触力：无需标记、力板、肌电图、个体成像或肌肉骨骼模型。每帧恢复参数化身体网格，编码为运动特征，并由变换器解码为力，其姿态流在每一层由身体形状、关节、侧别、活动文本和自监督视频令牌（V-JEPA 2）自适应调制，将髋和膝统一在单一模型中。在来自体内OrthoLoad数据库的26名患者和25个活动类别上的留一受试者交叉验证中，该流水线匹配个体化肌肉骨骼模拟的精度（髋部$0.32 \pm 0.08$ BW RMSE；膝部$0.23 \pm 0.03$ BW RMSE），并分辨出比步态再训练和骨关节炎进展报道的更小的峰值力变化。零样本应用于独立仪器化队列，它媲美或超越先前发表的方法。即使没有精心策划的活动标签，仅视频特征也能保持精度，并实现对原始视频的端到端推理。由预测器驱动，生成式运动先验产生生物力学合理的变体，降低峰值负荷，重新发现预测模拟文献中的策略。该流水线确立非标定单目视频作为估计关节负荷的可行模态，为回顾分析存档临床记录、初级保健筛查和家庭康复追踪开辟道路。

英文摘要

Joint contact forces govern implant longevity, cartilage health, and rehabilitation outcomes, shaping who develops osteoarthritis, who recovers well from joint replacement, and who benefits from biomechanical interventions. Yet they remain measurable only invasively, in a few dozen patients with instrumented implants. I present a physics-free pipeline to predict instantaneous 3D hip and knee contact forces from an uncalibrated monocular video: no markers, force plates, electromyography, subject-specific imaging, or musculoskeletal model. Parametric body meshes are recovered per frame, encoded as kinematic features, and decoded into forces by a transformer whose pose stream is adaptively modulated at every layer by body shape, joint, side, activity text, and self-supervised video tokens (V-JEPA 2), unifying hip and knee in a single model. Under leave-one-subject-out cross-validation across 26 patients and 25 activity categories from the in vivo OrthoLoad database, the pipeline matches the accuracy of subject-specific musculoskeletal simulations ($0.32 \pm 0.08$ BW RMSE for hip; $0.23 \pm 0.03$ BW for knee) and resolves peak force changes smaller than those reported for gait retraining and osteoarthritis progression. Applied zero-shot to an independent instrumented cohort, it rivals or outperforms prior published methods. Even without curated activity labels, video features alone preserve accuracy and enable end-to-end inference on raw footage. Driven by the predictor, a generative motion prior produces biomechanically plausible variants with reduced peak loading, rediscovering strategies from the predictive simulation literature. This pipeline establishes uncalibrated monocular video as a viable modality for estimating joint loading, opening a path toward retrospective analysis of archived clinical recordings, primary-care screening, and at-home rehabilitation tracking.

URL PDF HTML ☆

赞 0 踩 0

2606.06664 2026-06-08 cs.CV cs.AI cs.LG 新提交

Inside the Visual Mind: Neuroscience-Motivated Concept Circuits for Interpreting and Steering Vision Transformers

内在视觉：神经科学启发的概念电路用于解释和引导视觉变换器

Tang Li, Yanlin Chen, Mengmeng Ma, Xi Peng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出ViSAE工具箱，通过神经科学启发的概念电路解释视觉变换器内部机制，包含高效概念集、自动电路追踪算法和概念编辑应用，在WaterBirds上最差组准确率提升48.2%。

Comments In Proceedings of the International Conference on Machine Learning, 2026. (acceptance rate 26.6%)

详情

AI中文摘要

尽管视觉变换器（ViT）具有高准确率，但其预测可能受到虚假线索的驱动，因此在安全部署前需要理解其内部工作机制。稀疏自编码器（SAE）为将模型表示分解为人类可解释的概念提供了有前景的视角，但由于对概念覆盖范围的控制有限以及特征解释的主观性和不可扩展性，将基于SAE的解释方法应用于ViT仍然具有挑战性。为填补这些空白，受神经科学启发原理的驱动，我们提出了ViSAE，一个通过概念电路理解ViT内部工作机制的机械可解释性工具箱。ViSAE包含三个组成部分：（1）一个包含64K图像和16K视觉基础概念词汇的探测套件，与ImageNet相比，概念覆盖效率提高了20倍，与现有概念集相比，解释准确率提高了28.7%。（2）自上而下的概念读取和自下而上的电路追踪算法，通过概念电路自动恢复ViT内部工作机制。（3）用于审计和引导ViT行为的应用。通过概念编辑，ViSAE在WaterBirds上将最差组准确率提高了48.2%，比现有方法高出23.8%。我们的数据和代码：此 https URL。

英文摘要

Despite high accuracy, Vision Transformer (ViT) predictions can be driven by spurious cues, raising the need to understand their inner workings before safe deployment. Sparse autoencoders (SAEs) provide a promising lens for decomposing model representations into human-interpretable concepts, yet adapting SAE-based interpretation to ViTs remains challenging due to limited control over concept coverage and subjective, non-scalable feature interpretation. To fill the gaps, motivated by neuroscience-inspired principles, we propose ViSAE, a mechanistic interpretability toolbox for understanding ViT inner workings through concept circuits. ViSAE consists of three components: (1) A probing suite with 64K images and a 16K visually grounded concept vocabulary, improving concept coverage efficiency by 20x over ImageNet and interpretation accuracy by 28.7% over existing concept sets. (2) Top-down concept reading and Bottom-up circuit tracing algorithms that automatically recover ViT inner workings via concept circuits. (3) Applications for auditing and steering ViT behavior. Through concept editing, ViSAE improves the worst-group accuracy on WaterBirds by 48.2%, outperforming existing methods by 23.8%. Our data and code: https://github.com/deep-real/ViSAE.

URL PDF HTML ☆

赞 0 踩 0

2606.06666 2026-06-08 cs.CV 新提交

Architecture-Adaptive Uncertainty Fusion for Deepfake Detection

面向深度伪造检测的架构自适应不确定性融合

Ritesh Sharma, Mohammad Ghasemigol, Yuichi Motai

发表机构 * University of Tokyo（东京大学）； Nagoya University（名古屋大学）

AI总结提出相关性优化融合（COF）框架，通过最大化融合不确定性分数与预测误差的皮尔逊相关性，自适应融合五种不确定性来源，无需模型修改且优化仅需42秒，在分布偏移下表现优于随机森林。

详情

AI中文摘要

深度伪造检测系统在基准测试中达到近乎完美的准确率，但法医部署需要可靠的预测不确定性。现有的不确定性量化（UQ）方法依赖单一来源，忽略了最优不确定性组合因架构而异。我们提出相关性优化融合（COF），这是一种架构自适应框架，通过概率单纯形上的约束优化最大化融合不确定性分数与预测误差之间的皮尔逊相关性，融合五种互补的不确定性来源——认知、偶然、校准、共形和分布。COF无需模型修改，权重优化仅需42秒，而5模型深度集成需要20-45小时。在FaceForensics++上对11种架构的评估揭示了一个基本权衡：在匹配的训练/评估协议下，非线性方法在域内相关性上比COF高约5-6%（平均r=0.438），但在分布偏移下情况反转。在CelebDF上，COF在11种架构中的9种上优于随机森林，相关性高出高达7.3倍（MaxViT-B: r=0.249 vs. 0.034）；RF跨域退化85%至r=0.071，而COF保留显著更多的信号（下降74%至r=0.116）。在CelebDF和DFDC上的跨数据集评估揭示了所有方法的灾难性泛化失败：域内相关性0.41-0.47在外部崩溃至接近零（平均退化90.7%），其中11种架构中有7种出现不确定性反转。这些结果确立了COF作为受控分布部署的实用、可解释框架，并指出域自适应UQ是法医部署的核心开放挑战。

英文摘要

Deepfake detection systems achieve near-perfect accuracy on benchmarks, yet forensic deployment demands reliable prediction uncertainty. Existing uncertainty quantification (UQ) methods rely on single sources and ignore that optimal uncertainty composition varies across architectures. We propose Correlation-Optimized Fusion (COF), an architecture-adaptive framework that fuses five complementary uncertainty sources -- epistemic, aleatoric, calibration, conformal, and distributional -- by maximizing Pearson correlation between fused uncertainty scores and prediction errors via constrained optimization on the probability simplex. COF requires no model modifications and only 42 s of weight optimization, compared to 20--45 h for a 5-model Deep Ensemble. Evaluation across eleven architectures on FaceForensics++ reveals a fundamental trade-off: under matched train/evaluation protocol, non-linear methods achieve approximately 5--6% higher in-domain correlation than COF (mean r = 0.438), but this reverses under distribution shift. On CelebDF, COF outperforms Random Forest in 9/11 architectures with up to 7.3x higher correlation (MaxViT-B: r = 0.249 vs. 0.034); RF degrades 85% cross-domain to r = 0.071, whereas COF retains substantially more signal (74% drop to r = 0.116). Cross-dataset evaluation on CelebDF and DFDC reveals catastrophic generalization failure across all methods: in-domain correlations of 0.41--0.47 collapse to near-zero externally (mean degradation 90.7%), with seven of eleven architectures exhibiting uncertainty inversion. These results establish COF as a practical, interpretable framework for controlled-distribution deployment and identify domain-adaptive UQ as the central open challenge for forensic deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.06671 2026-06-08 cs.CV 新提交

RPC-GS：基于原生RPC渲染的卫星图像高斯泼溅

Valentin Wagner, Sebastian Bullinger, Christoph Bodensteiner, Michael Arens

发表机构 * Fraunhofer Institute of Optronics, System Technologies and Image Exploitation（弗劳恩霍夫光学研究所、系统技术与图像 exploitation 研究所）

AI总结提出首个原生使用RPC模型的高斯泼溅框架RPC-GS，通过直接投影高斯均值和协方差避免近似误差，在卫星基准数据集上重建误差最低。

详情

AI中文摘要

MedSIGHT：迈向医学大型视觉语言模型中的基础视觉理解

Aofei Chang, Le Huang, Alex James Boyd, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, Cao Xiao

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MedSIGHT框架，通过区域感知器、医学区域码本和渐进训练策略，统一医学视觉语言模型的语义理解和像素级分割，在72K数据上达到多模态理解与分割的SOTA。

Comments Accepted at ICML 2026

详情

AI中文摘要

医学大型视觉语言模型（Med-LVLMs）最近在视觉语言理解和医学图像分割方面取得了显著进展。然而，现有模型仍难以统一这两种能力，而这对于实现连接视觉发现与语义解释的临床推理至关重要。我们提出MedSIGHT，一个统一框架，赋予Med-LVLMs结构化的像素级理解能力，实现基础视觉理解。MedSIGHT引入了一个新颖的区域感知器模块，生成以区域为中心的标记，将空间信息直接编码到语言模型的表示空间中。我们进一步将医学区域码本引入LLM词汇表，使模型能够生成离散的区域代码，作为解剖和病理区域的符号表示。这些代码通过区域感知器解码以重建分割掩码，实现端到端的空间基础。最后，MedSIGHT使用我们提出的渐进训练策略，将区域感知器、码本和LLM组合起来，逐步稳定地对齐这些模块。仅在72K多模态指令对上训练，MedSIGHT在多种成像模态的医学理解和分割任务上均达到了最先进的性能。

英文摘要

Medical large vision-language models (Med-LVLMs) have recently achieved remarkable progress in vision-language comprehension and medical image segmentation. However, existing models still struggle to unify these two capabilities, which is essential for achieving clinically reasoning that connects visual findings with semantic interpretation. We present MedSIGHT, a unified framework that equips Med-LVLMs with structured, pixel-level understanding for grounded visual comprehension. MedSIGHT introduces a novel Region Perceiver module that produces region-centric tokens, encoding spatial information directly into representation space of the language model. We further propose a medical region codebook into the LLM vocabulary, allowing the model to generate discrete region codes as symbolic representations of anatomical and pathological regions. These codes are decoded through the Region Perceiver to reconstruct segmentation mask, achieving end-to-end spatial grounding. Lastly, MedSIGHT combines Region Perceiver, Codebook and LLM using our proposed progressive training strategy to gradually aligns these modules stably. Trained on only 72K multimodal instruction pairs, MedSIGHT achieves state-of-the-art performance across diverse imaging modalities on both medical comprehension and segmentation tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.06813 2026-06-08 cs.CV cs.AI 新提交

Breaking the Lock-in: Diversifying Text-to-Image Generation via Representation Modulation

打破锁定：通过表示调制实现文本到图像生成的多样化

Dahee Kwon, Haeun Lee, Jaesik Choi

发表机构 * KAIST（韩国科学技术院）

AI总结针对文本到图像模型在固定提示下生成样本过于相似的问题，提出无训练表示级干预方法DAVE，通过选择性衰减早期生成中的零频空间平均分量来增强多样性，保持图像质量且计算开销极小。

Comments Accepted to ICML 2026. Code is available at: https://github.com/daheekwon/DAVE

详情

AI中文摘要

近期基于大规模Transformer骨干和流目标的文本到图像模型在文本-图像对齐和视觉质量方面表现出色，但在固定提示下常生成过于相似的样本。现有的多样性增强方法缓解了这一问题，但通常需要昂贵的采样或辅助优化，带来显著开销。为探究这种同质性的根本原因，我们检查了中间Transformer特征，观察到零频空间平均（DC）分量在生成早期快速收敛，导致早期轨迹锁定，限制了后续变化。基于此观察，我们提出DC衰减多样性增强（DAVE），一种无训练的表示级干预，选择性地在早期阶段衰减该分量。DAVE以可忽略的开销保留采样流程，在保持竞争性图像质量的同时，提高了提示一致性的多样性。

英文摘要

Recent text-to-image models built on large-scale Transformer backbones and flow-based objectives deliver strong text-image alignment and high visual quality, yet often produce overly similar samples under a fixed prompt. Existing diversity-enhancement methods alleviate this issue, but typically require expensive sampling or auxiliary optimization, incurring non-trivial overhead. To investigate the root cause of this homogeneity, we examine intermediate Transformer features and observe that the zero-frequency spatial average (DC) component rapidly converges across seeds early in generation, causing early trajectory lock-in that limits downstream variation. Building on this observation, we propose DC Attenuation for diVersity Enhancement (DAVE), a training-free representation-level intervention that selectively attenuates this component in the early regime. DAVE preserves the sampling pipeline with negligible overhead, improving prompt-consistent diversity while maintaining competitive image quality.

URL PDF HTML ☆

赞 0 踩 0

2606.06819 2026-06-08 cs.CV 新提交

FS-DVS：一种增强信息完整性的频率选择性动态视觉传感范式

Feiyu Ji, Xiaokang Yang, Xiaoyun Yuan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出FS-DVS范式，通过在事件触发前集成可学习空间滤波器模拟视网膜神经节细胞聚合机制，自发学习中心-环绕模式以增强中频信息，在目标检测和动作识别中取得显著性能提升。

详情

AI中文摘要

动态视觉传感器（DVS）通过异步报告像素级强度变化，提供卓越的时间分辨率和动态范围。然而，传统DVS依赖每像素独立触发机制，忽略了生物视网膜神经节细胞（RGC）执行的空间整合。因此，它们缺乏对比度敏感函数（CSF）及其对中空间频率的固有敏感性，这不可避免地因亚阈值信号丢失而导致信息不完整。为弥补这一差距，我们提出FS-DVS（频率选择性动态视觉传感器），一种新颖范式，它在事件触发过程之前严格集成一个可学习空间滤波器，以模拟RGC聚合机制。通过开发可微分事件模拟框架，空间滤波器可以与下游任务进行端到端优化。我们的研究揭示，从δ函数开始，学习到的空间滤波器自发演变为强调中频分量的中心-环绕模式，与人类CSF一致。除了在目标检测和动作识别中实现显著的性能提升外，不同任务中向类人CSF特性的一致收敛强调了这种中频选择性机制的普遍性。与单纯提高传感器灵敏度或依赖后处理相比，我们的范式实现了具有高噪声鲁棒性的选择性信息增强，为下一代神经形态传感器提供了稳健且生物合理的蓝图。

英文摘要

Dynamic vision sensors (DVS) offer exceptional temporal resolution and dynamic range by asynchronously reporting pixel-level intensity changes. However, conventional DVS rely on a per-pixel independent triggering mechanism, ignoring the spatial integration performed by biological retinal ganglion cells (RGCs). Consequently, they lack the contrast sensitivity function (CSF) and its inherent sensitivity to mid-spatial frequencies, which inevitably leads to information incompleteness due to sub-threshold signal loss. To bridge this gap, we propose FS-DVS (Frequency-Selective Dynamic Vision Sensor), a novel paradigm that integrates a learnable spatial filter strictly preceding the event triggering process to mimic the RGC aggregation mechanism. By developing a differentiable event simulation framework, the spatial filter can be optimized end-to-end with downstream tasks. Our study reveals that starting from a delta function, the learned spatial filters spontaneously evolve into center-surround patterns that emphasize mid-frequency components, consistently aligning with human CSF. Beyond achieving substantial performance gains in object detection and action recognition, the consistent convergence to human-like CSF characteristics across different tasks underscores the universality of this mid-frequency selective mechanism. Compared to naively increasing sensor sensitivity or relying on post-processing, our paradigm achieves selective information enhancement with high noise resilience, providing a robust, biologically plausible blueprint for next-generation neuromorphic sensors.

URL PDF HTML ☆

赞 0 踩 0

2606.06864 2026-06-08 cs.CV cs.LG 新提交

LRMIL: Efficient Low-Resolution Multiple Instance Learning via High-Resolution Knowledge Distillation for Whole Slide Image Classification

LRMIL: 通过高分辨率知识蒸馏实现全切片图像分类的高效低分辨率多实例学习

Yonghan Shin, Won-Ki Jeong

发表机构 * Department of Computer Science and Engineering, Korea University, Seoul, Korea（韩国大学计算机科学与工程系）

AI总结提出LRMIL框架，通过两阶段知识蒸馏将高分辨率知识迁移到低分辨率表示，在推理时仅使用低分辨率图像块，显著降低计算成本并提升分类性能。

详情

AI中文摘要

多实例学习（MIL）已成为数字病理学中全切片图像（WSI）分析的标准范式，因为它无需密集标注即可实现切片级预测。现有的MIL方法通常依赖于高分辨率图像块的详尽提取和编码。然而，这种做法在真实临床环境中存在两个关键限制：难以在较低放大倍数下捕获全局视觉线索，并且由于每张切片包含大量高分辨率图像块而导致巨大的计算开销。为了解决这些限制，我们提出了一种高效的低分辨率多实例学习（LRMIL）框架，该框架将高分辨率知识迁移到低分辨率表示。LRMIL采用两阶段蒸馏策略。首先，图像块级别的跨分辨率蒸馏将低分辨率图像块嵌入与高分辨率表示对齐。其次，切片级知识蒸馏在切片级监督和教师指导下训练低分辨率学生MIL模型。在推理时，LRMIL仅处理低分辨率图像块，大幅减少了数据预处理和计算成本。在多个WSI基准上的大量实验表明，LRMIL在实现更高效推理的同时，始终优于最先进的MIL方法。这些结果凸显了LRMIL作为临床病理学中WSI分析的实用且可扩展的解决方案。

英文摘要

Multiple instance learning (MIL) has become a standard paradigm for whole slide image (WSI) analysis in digital pathology, as it enables slide-level prediction without dense annotations. Existing MIL methods typically rely on exhaustive extraction and encoding of high-resolution patches. However, this practice suffers from two critical limitations in real-world clinical settings: it struggles to capture global visual cues at lower magnifications, and incurs substantial computational overhead due to the massive number of high-resolution patches per slide. To address these limitations, we propose an efficient low-resolution multiple instance learning (LRMIL) framework that transfers high-resolution knowledge to low-resolution representations. LRMIL adopts a two-stage distillation strategy. First, patch-level cross-resolution distillation aligns low-resolution patch embeddings with high-resolution representations. Second, slide-level knowledge distillation trains a low-resolution student MIL model under both slide-level supervision and teacher guidance. At inference time, LRMIL operates exclusively on low-resolution patches, substantially reducing data preprocessing and computational cost. Extensive experiments on multiple WSI benchmarks demonstrate that LRMIL consistently outperforms state-of-the-art MIL methods while achieving more efficient inference. These results highlight LRMIL as a practical and scalable solution for WSI analysis in clinical pathology.

URL PDF HTML ☆

赞 0 踩 0

2606.06867 2026-06-08 cs.CV 新提交

Multi-FRuGaL: Multimodal Flexible Redundancy-aware Decomposed Gated Learning for Cancer Diagnosis and Prognosis

Multi-FRuGaL：面向癌症诊断与预后的多模态灵活冗余感知分解门控学习

Sanket Kachole, Siddhesh Thakur, Shubham Innani, Sanyukta Adap, Suhang You, Carla Pitarch-Abaigar, Spyridon Bakas

发表机构 * Division of Computational Pathology, Department of Pathology and Laboratory Medicine, Indiana University School of Medicine（计算病理学部，病理学与实验室医学部，印第安纳大学医学院）； IU Melvin and Bren Simon Comprehensive Cancer Center（印第安纳大学Melvin和Bren Simon综合癌症中心）； Departments of Biostatistics and Health Data Science（生物统计学与健康数据科学部）； Radiology and Imaging Sciences（放射学与影像科学部）； Neurological Surgery（神经外科）； Indiana University School of Medicine（印第安纳大学医学院）； Department of Computer Science, Luddy School of Informatics, Computing, and Engineering（计算机科学部，Luddy信息、计算与工程学院）

AI总结提出Multi-FRuGaL框架，通过分解感知自适应门控中间融合，在缺失模态下学习模态级表示，分离冗余与互补信号，提升癌症诊断与预后性能。

详情

AI中文摘要

现代医学依赖于涵盖放射学、病理学、文本报告和结构化临床信息的异构数据源。然而，真实世界的患者数据常常不完整，存在缺失或稀疏获取的模态，限制了标准多模态融合方法的有效性。为此，我们提出了多模态灵活冗余感知分解门控学习（Multi-FRuGaL）框架，这是一种分解感知的自适应门控中间融合框架，可在数据缺失下执行模态级表示学习。Multi-FRuGaL 集成了每个模态的编码器、信号分解层、输入条件门控网络和信息感知融合目标，以将冗余信号与模态特异性互补信号分离，选择性地提升信息丰富的模态并抑制冗余或噪声输入，即使在多个模态缺失时也能保持良好定义。我们在两个多模态头颈癌队列上评估了 Multi-FRuGaL：HANCOCK 挑战数据集（N = 763），包含五种模态和两个预后终点（5年生存率和2年复发率）；以及 HECKTOR 挑战数据集（N = 588），包含三种模态用于人乳头瘤病毒（HPV）状态分类。Multi-FRuGaL 在多个任务上始终比评估的基线方法获得更高的平均性能，将生存预测的 AUC 从 0.601 提高到 0.8496，复发预测的 AUC 从 0.672 提高到 0.8102，并在 HECKTOR 上实现 HPV 预测的 AUC 为 0.975。对于生存分析，它在 HANCOCK 上进一步实现了总生存期的 C-index 为 0.6814，无复发生存期为 0.7421，无进展生存期为 0.7143，在 HECKTOR 上无复发生存期为 0.7203。定性分析进一步表明，即使在严重缺失模态条件下，Multi-FRuGaL 也能学习到判别性和鲁棒的多模态表示。

英文摘要

Modern medicine relies on heterogeneous data sources spanning radiology, pathology, text reports, and structured clinical information. However, real-world patient data are frequently incomplete, with missing or sparsely acquired modalities, limiting the effectiveness of standard multimodal fusion approaches. To this end, we propose the Multimodal Flexible Redundancy-aware decomposed GAted Learning (Multi-FRuGaL) framework, a decomposition-aware, adaptive gated intermediate-fusion framework that performs modality-level representation learning under missing data. Multi-FRuGaL integrates per-modality encoders with a signal decomposition layer, an input-conditioned gating network, and an information-aware fusion objective to separate redundant from modality-specific complementary signals, selectively upweighting informative modalities and suppressing redundant or noisy inputs, and remaining well-defined even when multiple modalities are absent. We evaluate Multi-FRuGaL on two multimodal head and neck cancer cohorts: the HANCOCK challenge dataset (N = 763) comprising five modalities and two prognostic endpoints (5-year survival and 2-year recurrence), and the HECKTOR challenge dataset (N = 588) comprising three modalities for human papillomavirus (HPV) status classification. Multi-FRuGaL consistently achieves higher mean performance than the evaluated baselines across multiple tasks, improving AUC from 0.601 to 0.8496 for survival, from 0.672 to 0.8102 for recurrence, and achieving 0.975 AUC for HPV prediction on HECKTOR. For survival analysis, it further achieves a concordance index of 0.6814 for overall survival, 0.7421 for recurrence-free survival, and 0.7143 for progression-free survival on HANCOCK, and 0.7203 for recurrence-free survival on HECKTOR. Qualitative analyses further show that Multi-FRuGaL learns discriminative and robust multimodal representations, even under severe missing-modality conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.06872 2026-06-08 cs.CV cs.AI 新提交

EgoPressDiff: Multimodal Video Diffusion for Egocentric UV-Domain Hand-Pressure Estimation

EgoPressDiff: 用于自我中心UV域手部压力估计的多模态视频扩散模型

Yuan Zeng, Zilue Gao, Yujia Shi, Zongqing Lu, Wenming Yang, QingMin Liao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出EgoPressDiff，一种条件视频扩散框架，通过多模态条件策略（手部姿态、3D网格顶点和深度信息）从视觉输入生成UV压力图，解决了现有方法中的量化误差和时间不一致问题，在EgoPressure数据集上实现SOTA，Volumetric IoU相对提升34%以上。

Comments Accepted to IEEE ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11463813

AI中文摘要

从自我中心视角估计手部表面接触压力对于AR/VR设备、机器人模仿和人体工程学分析至关重要。现有方法通常对压力信号进行离散化并独立处理帧，导致量化误差和时间不一致性。我们提出EgoPressDiff，一种条件视频扩散框架，从视觉输入生成UV压力图。我们方法的核心是一种多模态条件策略，引入PoseNet和顶点编码器，从手部姿态和3D网格顶点中高效提取特征。这些信号与深度信息一起，指导生成过程以确保压力场在物理上是合理的。为了有效融合这些异构特征，我们进一步提出分布校准空间层，在组合前对齐其统计特性。在EgoPressure自我中心视图设置上的评估表明，EgoPressDiff实现了最先进的结果，Volumetric IoU相对先前基线提升超过34%，同时降低MAE并保持高时间精度。我们的项目页面位于此https URL。

英文摘要

Estimating hand-surface contact pressure from an egocentric view is crucial for AR/VR devices, robotic imitation, and ergonomic analysis. Existing methods often discretize pressure signal and process frames independently, leading to quantization errors and temporal inconsistencies. We present \emph{EgoPressDiff}, a conditional video diffusion framework that generates UV-pressure maps from visual input. The core of our approach is a multi-modal conditioning strategy, introducing a PoseNet and a Vertex Encoder to efficiently extract features from hand pose and 3D mesh vertices. These signals, along with depth information, guide the generative process to ensure the pressure fields are physically grounded. To effectively fuse these heterogeneous features, we further propose a Distribution-Calibrated Spatial Layer, which aligns their statistical properties before combination. Evaluated on the EgoPressure ego-view setting, EgoPressDiff achieves state-of-the-art results, improving Volumetric IoU by over 34\% relative to prior baseline, while reducing MAE and maintaining high temporal accuracy. Our project page is at https://egopressdiff.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06875 2026-06-08 cs.CV cs.CR 新提交

Unified Safe In-context Image Generation in Multimodal Diffusion Transformers via Restricting Unsafe Information Flows

统一安全上下文图像生成：在多模态扩散变换器中通过限制不安全信息流

Xiang Yang, Feifei Li, Mi Zhang, Geng Hong, Xiaoyu You, Mi Wen, Min Yang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UVR框架，通过分析注意力动态中的不安全信息流，在无需训练的情况下对输出补丁进行注意力调制，实现图像生成和编辑任务的安全控制，达到91%和77%的擦除率。

Comments ICML26

详情

AI中文摘要

配备多模态注意力（MM-Attn）的扩散变换器（DiTs）已成为图像生成的主导范式。然而，防止有害内容的生成仍然是一个关键挑战，特别是在图像到图像（I2I）编辑任务中。现有的安全机制主要针对文本到图像（T2I）合成或基于U-Net的架构设计，这限制了它们在基于DiT的框架中统一安全缓解的有效性。为弥补这一差距，我们提出了统一视觉安全调节器（UVR），一个无需训练的、在生成图像中调节不安全语义的安全生成框架。UVR基于从信息流角度对MM-Attn中注意力动态的分析。我们识别出一个与任务无关的启动阶段，在该阶段输出补丁中的不安全语义迅速出现并可以被精确定位，随后是特定任务的语义放大和干扰阶段，其中有害信号进一步传播并与良性内容纠缠。基于这些观察，UVR通过统一的、有针对性的注意力调制和对识别出的不安全输出补丁上有害信息流的显式限制来缓解不安全生成。跨多种概念的实验表明，UVR在图像合成和编辑任务中分别实现了91%和77%的擦除率，达到了最先进的安全性能，同时以最小的退化保持了视觉质量和保真度。代码可在以下网址获取：https://this URL。

英文摘要

Diffusion transformers (DiTs) equipped with multimodal attention (MM-Attn) have become a dominant paradigm for image generation. However, preventing the generation of harmful content remains a critical challenge, particularly in image-to-image (I2I) editing tasks. Existing safety mechanisms are primarily designed for text-to-image (T2I) synthesis or U-Net-based architectures, which limits their effectiveness for unified safety mitigation in DiT-based frameworks. To bridge this gap, we propose Unified Visual Safety Regulator (UVR), a training-free safe generation framework that regulates unsafe semantics in generated images. UVR is grounded in an analysis of attention dynamics from the perspective of information flow in MM-Attn. We identify a task-independent start-up stage, during which unsafe semantics in output patches rapidly emerge and can be accurately localized, followed by task-specific semantic amplification and interference stages, where harmful signals are further propagated and entangled with benign content. Based on these observations, UVR mitigates unsafe generation through unified, targeted attention modulation and explicit restriction of harmful information flow over the identified unsafe output patches. Experiments across various concepts show that UVR achieves state-of-the-art safety performance by achieving 91% and 77% erase rate in image synthesis and editing tasks, while preserving visual quality and fidelity with minimal degradation. Code is available at https://github.com/deng12yx/UVR.

URL PDF HTML ☆

赞 0 踩 0

2606.06885 2026-06-08 cs.CV cs.AI 新提交

FreeAnimate: Training-Free Human Image Animation with Preview-Guided Denoising

FreeAnimate: 基于预览引导去噪的无训练人体图像动画

Yuan Zeng, Yujia Shi, Zongqing Lu, QingMin Liao

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出FreeAnimate框架，利用图像扩散模型内在能力实现无训练的人体图像动画，通过预览生成策略提供时序和结构先验，结合反演增强注意力和参考锚定自注意力模块，保证时序一致性和身份保持。

Comments Accepted to IEEE ICASSP 2026

详情

DOI: 10.1109/ICASSP55912.2026.11462600

AI中文摘要

人体图像动画已经取得了显著进展，主要得益于扩散模型。然而，现有方法通常需要大量的训练数据和资源才能获得高质量结果，限制了泛化性和可访问性。在这项工作中，我们引入了FreeAnimate，一个无训练框架，利用图像扩散模型的内在能力来实现时序一致性、身份保持和背景稳定性。我们的方法包含一种新颖的预览生成策略，该策略从生成的预览帧中提供时序和结构先验，无需训练即可有效引导姿态对齐和背景一致性。此外，FreeAnimate引入了反演增强注意力和参考锚定自注意力模块，以保证时序一致性和身份保持。实验结果表明，FreeAnimate优于现有的无训练竞争方法和基于训练的基线方法，生成的图像质量可与最先进的方法相媲美，并在不同数据集上展现出强大的泛化能力。我们的项目页面位于此https URL。

英文摘要

Human Image Animation has seen significant advancements, primarily driven by diffusion models. However, existing methods typically demand substantial training data and resources to achieve high-quality results, limiting generalization and accessibility. In this work, we introduce \emph{FreeAnimate}, a training-free framework that leverages the inherent capabilities of image diffusion models to enable temporal consistency, identity preservation, and background stability. Our approach incorporates a novel preview generation strategy that provides temporal and structural priors from generated preview frames, effectively guiding pose alignment and background consistency without training. Additionally, FreeAnimate introduces Inversion-Boosted Attention and Reference-Anchored Self-Attention modules to guarantee temporal consistency and identity preservation. Experimental results demonstrate that FreeAnimate outperforms existing training-free competitors and training-based baseline methods, achieving generation quality comparable to state-of-the-art methods and offering robust generalization across diverse datasets. Our project page is at https://freeani.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06887 2026-06-08 cs.CV 新提交

ARAPDiffusion: ARAP Regularization for Diffusion-Based Deformable Shape Space Learning

ARAPDiffusion: 基于ARAP正则化的扩散变形形状空间学习

Haibo Liu, Jinghan Ke, Haitao Yang, Xiangru Huang, Georgios Pavlakos, Qixing Huang

发表机构 * University of Texas at Austin（德克萨斯大学）； Westlake University（西拉丘学院）

AI总结提出ARAPDiffusion，一种潜在扩散模型，通过注入ARAP变形模型作为正则化损失，学习变形形状集合的连续形状空间，减少对大量3D训练数据的依赖。

详情

AI中文摘要

本文介绍了ARAPDiffusion，一种潜在扩散模型，用于学习变形形状集合的潜在连续形状空间。关键创新在于将尽可能刚性（ARAP）变形模型作为正则化损失注入潜在扩散（LD），从而减少学习生成模型所需的大量3D训练数据。与标准LD相比，我们展示了如何利用ARAP模型同时改进编码器/解码器和LD模型。训练过程交替使用LD模型定义的合成分布来开发增强形状编码器/解码器的正则化损失，以及使用形状解码器来开发改进LD模型的正则化损失。我们还展示了LD范式在结合无表示LD模型和适用于无序点云的隐式形状解码器方面的优势。无条件和条件形状生成的实验结果证明了ARAPDiffusion相对于基线方法的优势。

英文摘要

This paper introduces ARAPDiffusion, a latent diffusion model to learn the underlying continuous shape space of a deformation shape collection. The key innovation is in injecting the as-rigid-as-possible (ARAP) deformation model as regularization losses into latent diffusion (LD), releasing the requirement of having abundant 3D training data for learning generative models. In contrast to the standard LD, we show how the ARAP model can be used to improve both the encoder/decoder and the LD model. The training procedure alternates between using the synthetic distribution defined by the LD model to develop a regularization loss that enhances the shape encoder/decoder and using the shape decoder to develop a regularization loss to improve the LD model. We also show the benefit of the LD paradigm in combining a representation-free LD process and an implicit shape decoder that is applicable to unorganized point clouds. The experimental results of unconditional and conditional shape generation demonstrate the advantages of ARAPDiffusion over baseline approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.06890 2026-06-08 cs.CV cs.LG 新提交

Diagnosing Visual Ignorance in Vision-Language Models

诊断视觉语言模型中的视觉忽视

Runyu Zhou, Qi Zhang, Qixun Wang, Yisen Wang

发表机构 * Peking University（北京大学）

AI总结研究视觉语言模型依赖语言先验的内部机制，通过层替换和探针分析揭示多阶段瓶颈，并引入渐进视觉退化指标发现基准测试可能奖励视觉忽视。

详情

AI中文摘要

视觉语言模型（VLM）经常依赖语言先验，产生自信但缺乏视觉证据支持的答案。虽然这种行为被广泛观察到，但其内部机制及对基准评估的影响仍未被充分理解。在这项工作中，我们从机制和行为两个角度研究语言先验依赖。在内部，我们将反事实层替换与有监督的逐层MLP探针相结合，以追踪真实视觉语义和语言先验语义如何在语言解码器中竞争。我们的分析揭示了一个多阶段瓶颈：中间层通常无法有效检索视觉信息，而后续层可能进一步抑制存活的视觉信号，偏向文本空间偏差。在外部，我们引入了一种基于多步高斯模糊的渐进视觉退化度量，用于识别那些即使视觉内容被逐渐破坏，答案仍保持不变的实例。在十二个视觉问答基准和三个代表性VLM上，我们发现相当一部分示例在严重或完全视觉混淆下仍可回答，表明当前基准可能无意中奖励视觉忽视。这些发现表明，语言先验依赖是一种系统性的路由故障，影响模型内部和基准有效性。最后，我们概述了未来的关键研究方向，强调需要设计基于结构隔离或反事实数据的训练分布和评估协议，以强制执行真正的跨模态基础。

英文摘要

Vision-Language Models (VLMs) frequently rely on language priors, producing confident answers that are weakly grounded in visual evidence. While this behavior is widely observed, its internal mechanisms and its impact on benchmark evaluation remain insufficiently understood. In this work, we study language-prior reliance from both mechanistic and behavioral perspectives. Internally, we combine counterfactual layer replacement with supervised layer-wise MLP probing to trace how ground-truth visual semantics and language-prior semantics compete across the language decoder. Our analysis reveals a multi-stage bottleneck: intermediate layers often fail to effectively retrieve visual information, while later layers can further suppress surviving visual signals in favor of text-space biases. Externally, we introduce a progressive visual decay metric based on multi-step Gaussian blurring, which identifies instances whose answers remain invariant even as visual content is increasingly destroyed. Across twelve visual question-answering benchmarks and three representative VLMs, we find that a substantial fraction of examples remain answerable under severe or total visual obfuscation, indicating that current benchmarks can inadvertently reward visual ignorance. These findings demonstrate that language-prior reliance is a systematic routing failure affecting both model internals and benchmark validity. Finally, we outline critical pathways for future research, highlighting the necessity of designing training distributions and evaluation protocols built on structurally isolated or counterfactual data to enforce genuine cross-modal grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.06891 2026-06-08 cs.CV 新提交

Stream3D-VLM: Online 3D Spatial Understanding with Incremental Geometry Priors

Stream3D-VLM：基于增量几何先验的在线3D空间理解

Hanxun Yu, Xuan Qu, Lei Ke, Boqiang Zhang, Yuxin Wang, Jianke Zhu, Dong Yu

发表机构 * Zhejiang University（浙江大学）； Tencent Hunyuan（腾讯文汇）； HKUST（香港科技大学）； Shenzhen Loop Area Institute（深圳环城研究院）

AI总结提出在线3D视觉语言模型Stream3D-VLM，通过自回归流控制、轻量视觉-空间特征融合模块和几何自适应体素压缩，实现从流式视频中实时理解3D空间，并构建超百万在线3D问答数据集，在多项任务上超越现有模型。

Comments Project Page: https://stream3d-vlm.github.io/

详情

AI中文摘要

尽管3D场景理解取得了进展，但现有的3D大型多模态模型在离线设置下运行，需要完整的场景观测或预定义的视频片段。在本文中，我们提出了一种在线3D视觉语言模型，能够从流式视频中实现实时空间理解。我们的方法基于LLM的下一个词预测目标，采用自回归流控制建模来学习何时响应，并使用轻量级的视觉-空间特征融合（VSFI）模块，将时间对齐的几何先验增量注入视觉流。为了减轻长上下文解码开销，我们提出了一种即插即用的几何自适应体素压缩（GAVC）模块，用于高效的视觉令牌压缩。为了解决流式3D语言数据的稀缺问题，我们进一步开发了一个可扩展的数据生成流程，策划了超过100万个在线时空3D问答对，并建立了一个涵盖29个任务的全面基准。大量实验表明，我们的方法在在线和离线3D空间理解、推理和定位任务上均显著优于专有和开源模型。项目页面见https://这个URL。

英文摘要

Despite advances in 3D scene understanding, existing 3D Large Multimodal Models operate in offline settings, requiring complete scene observations or predefined video clips. In this paper, we present an online 3D vision-language model that enables real-time spatial understanding from streaming video. Our approach adopts an autoregressive streaming control modeling based on the LLM's next-token prediction objective to learn when to respond, and employs a lightweight Visual-Spatial Feature Integration (VSFI) module to incrementally inject temporally aligned geometry priors into the visual stream. To alleviate long-context decoding overhead, we propose a plug-and-play Geometry-Adaptive Voxel Compression (GAVC) module for efficient visual token compression. To address the scarcity of streaming 3D-language data, we further develop a scalable data generation pipeline that curates over 1M online spatio-temporal 3D QA pairs and establishes a comprehensive benchmark spanning 29 tasks. Extensive experiments show that our approach significantly outperforms both proprietary and open-source models across online and offline 3D spatial understanding, reasoning, and grounding tasks. The project page is available at https://stream3d-vlm.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.06899 2026-06-08 cs.CV cs.LG 新提交

检测真实视频流中的时间局部操纵

Okan Umur, Ali Emre Güşlü, Ibrahim Delibasoglu

发表机构 * Okan Umur ； Ali Emre Güşlü ； Ibrahim Delibasoglu

AI总结针对真实视频中插入短时逼真操纵片段难以检测的问题，提出新数据集并评估两种方法：基于DINOv3特征的线性探针和连续帧相似性方法，建立初步基准。

详情

AI中文摘要

视频编辑和生成式人工智能技术的快速发展使得逼真的视频操纵越来越容易实现。尽管现有数据集显著推动了深度伪造检测、对象移除和视频修复的研究，但它们未能充分模拟在真实视频中插入短时操纵片段且原始视频继续播放的场景。在本研究中，我们回顾了文献中的代表性数据集，分析了它们的特征，并讨论了它们在时间局部逼真操纵检测方面的局限性。基于此分析，我们提出了专门针对包含短时且高度逼真操纵间隔的真实视频的新数据集的需求。最后，我们在自定义策划的测试集上评估了两种互补方法，为这一具有挑战性的场景建立了初始基准。第一种方法采用基于DINOv3特征的线性探针，在三种阈值策略下进行评估。第二种方法利用DINOv3特征结合连续帧相似性方法来检测时间操纵边界。这些实验共同为部分操纵视频检测提供了初步基准，并强调了内容自适应阈值机制的必要性。数据集、代码和补充材料可在此https URL公开获取。

英文摘要

The rapid advancement of video editing and generative artificial intelligence technologies has made realistic video manipulation increasingly accessible. Although existing datasets have significantly advanced research in deepfake detection, object removal, and video inpainting, they do not adequately model scenarios in which a short manipulated segment is inserted into an otherwise authentic video and the original video continues afterward. In this study, we review representative datasets from the literature, analyze their characteristics, and discuss their limitations with respect to temporally localized realistic manipulation detection. Based on this analysis, we motivate the need for a new dataset specifically designed for authentic videos containing short and highly realistic manipulated intervals. Finally, we evaluate two complementary approaches on our custom-curated test set to establish an initial benchmark for this challenging scenario. The first employs a linear probe on DINOv3 features, assessed under three thresholding strategies. The second leverages DINOv3 features with a consecutive frame similarity-based method to detect temporal manipulation boundaries. Together, these experiments provide an initial benchmark for partially manipulated video detection and highlight the need for content-adaptive thresholding mechanisms. The dataset, code, and supplementary materials are publicly available at https://github.com/OkanUmur/temporally-localized-video-manipulation-detection.

URL PDF HTML ☆

赞 0 踩 0

2606.07100 2026-06-08 cs.CV cs.RO 新提交

LARA: Latent Action Representation Alignment for Vision-Language-Action Models

LARA: 视觉-语言-动作模型的潜在动作表示对齐

Mengya Liu, Baoxiong Jia, Jiangyong Huang, Jingze Zhang, Siyuan Huang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出LARA框架，通过表示对齐联合优化潜在动作模型和视觉-语言-动作模型，利用人类视频数据提升机器人操作性能，在模拟和真实基准上平均提升约10%、5%和15%。

详情

AI中文摘要

视觉-语言动作（VLA）模型使机器人能够直接从观测和语言指令预测动作，但其性能依赖于大规模、高质量数据，并受到真实机器人动作数据集稀缺的限制。为了利用丰富的未标记人类视频促进VLA模型学习，潜在动作模型（LAM）从视觉动态中学习潜在动作表示，为VLA学习提供额外监督。然而，LAM和VLA通常分开训练，导致LAM在VLA训练期间未接地，且VLA模型受冻结的LAM表示约束。为解决这些问题，我们提出潜在动作表示对齐（LARA），一种即插即用框架，通过表示对齐联合优化LAM和VLA。这使得LAM能够利用动作轨迹学习以避免虚假视觉变化，同时VLA通过LAM中学习的前向动力学进行正则化，减少功能无效轨迹的幻觉。我们展示了LARA在预训练、预训练VLA模型的后训练增强以及LAM细化中的多功能性和有效性，在3个模拟和1个精心设计的真实机器人操作基准上平均提升约10%、约5%和约15%。

英文摘要

Visual-language action (VLA) models enable robots to predict actions directly from observations and language instructions, but their performance depends on large-scale, high-quality data and is limited by the scarcity of real-world robot action datasets. To facilitate VLA model learning with abundant unlabeled human videos, Latent Action Models (LAM) learn latent action representations from visual dynamics to provide additional supervision for VLA learning. However, LAM and VLA are typically trained separately, leaving LAM ungrounded during VLA training and VLA models constrained by frozen LAM representations. To address these issues, we propose Latent Action Representation Alignment (LARA), a plug-and-play framework that jointly optimizes LAM and VLA via representation alignment. This enables reciprocal benefits where LAMs learn with action trajectories to avoid spurious visual changes, while VLAs are regularized by forward dynamics learned within LAMs to reduce hallucinations of functionally ineffective trajectories. We demonstrate LARA versatility and effectiveness for pre-training, post-training enhancement of pre-trained VLA models, and LAM refinement, achieving an average of ~10%, ~5%, and ~15% improvement over 3 simulation and 1 meticulously designed real-world robotic manipulation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.07115 2026-06-08 cs.CV cs.GR 新提交

3DMorph: Single-Image-Guided Local 3D Shape Editing and Morphing

3DMorph: 单图引导的局部3D形状编辑与变形

Tobias Preintner, Yunfei Deng, Phillip Müller, Sebastian Illing, Adrian König, Thomas Bäck, Elena Raponi, Niki van Stein

发表机构 * ETH Zürich（苏黎世联邦理工学院）

AI总结提出无训练框架3DMorph，通过单张编辑图像自动定位并转移2D修改到3D局部区域，同时支持中间形状生成，在Delta3D基准上优于现有方法。

Comments Accepted to IJCNN 2026

详情

AI中文摘要

尽管3D生成领域近期取得了进展，但对现有形状的直观编辑仍然有限。与受益于成熟修复工具的图像不同，网格等通用3D对象仍缺乏简单有效的局部形状编辑方法。现有方法通常是全局的、领域特定的、需要复杂的用户交互，或侧重于外观（颜色和纹理）而非几何。我们提出了3DMorph，一个无需训练的框架，用于单图引导的局部3D形状编辑和变形。给定一张显示所需形状修改的编辑图像，我们的方法自动定位相关的3D区域，并将2D修改转移到3D，同时保留未修改的区域。3DMorph还能在原始对象和编辑对象之间生成中间形状，促进设计探索。为了基准测试编辑质量，我们引入了Delta3D，一个带有配对真实编辑的图像引导局部3D编辑基准。实验结果表明，3DMorph将直观的2D编辑转化为3D，优于最先进的生成和编辑方法。

英文摘要

Despite recent progress in 3D generation, intuitive editing of existing shapes remains limited. Unlike images, which benefit from well-established inpainting tools, general 3D objects such as meshes still lack simple and effective methods for local shape editing. Existing approaches are often global, domain-specific, require complex user interaction, or focus on appearance (color and texture) rather than geometry. We introduce 3DMorph, a training-free framework for single-image-guided local 3D shape editing and morphing. Given an edited image showing a desired shape modification, our method automatically localizes the relevant 3D region and transfers 2D modifications to 3D while preserving unmodified areas. 3DMorph also enables intermediate shape generation between the original and edited objects, facilitating design exploration. To benchmark editing quality, we introduce Delta3D, an image-guided local 3D editing benchmark with paired ground-truth edits. Experimental results show that 3DMorph translates intuitive 2D edits into 3D, outperforming state-of-the-art generative and editing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.07117 2026-06-08 cs.CV cs.AI 新提交

当恢复至关重要时：MLLM编辑中替代隐私的盲点

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui LI, Shiqi Wang, Sam Kwong

发表机构 * City University of Hong Kong（香港城市大学）； Hon Hai Research Institute（鸿海研究院）； Lingnan University（岭南大学）

AI总结针对多模态大模型编辑中的隐私风险，提出首个面向恢复的替代隐私保护编辑基准SPPE，涵盖36个细粒度隐私类别和65个编辑指令，并设计可编辑性评估与替代到源编辑恢复两个任务及对应方法。

详情

AI中文摘要

多模态大语言模型（MLLM）支持灵活的指令驱动图像编辑，但当用户图像暴露多样且用户特定的私有内容时，会产生隐私风险。典型的隐私保护策略通常在云端编辑前用替代内容替换敏感区域。然而，结果输出往往是编辑后的替代图像而非期望的编辑后源图像，在设计和评估范围中都忽略了局部恢复。为此，我们引入SPPE（基于替代的隐私保护编辑），这是首个面向恢复的基准，涵盖36个细粒度隐私类别和65个编辑指令。它定义了两个互补任务：1）可编辑性评估，在云端交互前估计替代图像是否能产生与原始图像一致的编辑；2）替代到源编辑恢复，评估编辑后的替代图像是否能转移回私有源图像并保留编辑效果。我们为每个任务提出了专用方法：ERMA通过指令感知的多模态关系建模预测替代可编辑性，而C2E-S2SER通过使用替代编辑对作为视觉编辑证据和源图像作为源保留锚点来执行循环一致性恢复。在SPPE和InstructPix2Pix上的实验表明，两个任务均有一致改进。对于可编辑性评估，ERMA在SRCC上比最佳基线提升13.9%，在PLCC上提升12.3%。对于替代到源编辑恢复，C2E-S2SER在SPPE的所有8个源完整性和编辑一致性指标上优于SOER。

英文摘要

Multimodal Large Language Models (MLLMs) enable flexible instruction-driven image editing, but privacy risks arise when user images expose diverse and user-specific private content. Canonical privacy protection strategies typically substitute sensitive regions with surrogate content before cloud editing. Yet, the resulting output is often an edited surrogate rather than the desired edited source image, neglecting the local recovery in both design and evaluation scope. To this end, we introduce SPPE (Surrogate-based Privacy-Preserving Editing), the first recovery-oriented benchmark covering 36 fine-grained privacy categories and 65 editing instructions. It defines two complementary tasks: 1) editability assessment, which estimates before cloud interaction whether a surrogate can induce an edit consistent with the original image; and 2) surrogate-to-source edit recovery, which evaluates whether the edited surrogate can be transferred back to the private source with the edit effect preserved. We address each task with a dedicated method: ERMA predicts surrogate editability through instruction-aware multimodal relation modeling, while \method performs cycle-consistent recovery by using the surrogate editing pair as visual edit evidence and the source image as a source-preserving anchor. Experiments on SPPE and InstructPix2Pix show consistent improvements on both tasks. For editability assessment, ERMA improves over the best-performing baselines by 13.9% in SRCC and 12.3% in PLCC. For surrogate-to-source edit recovery, C2E-S2SER outperforms SOER across all 8 source integrity and edit consistency metrics on SPPE.

URL PDF HTML ☆

赞 0 踩 0

2606.07172 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

Textual Supervision Enhances Geospatial Representations in Vision-Language Models

文本监督增强视觉-语言模型中的地理空间表示

Marcelo Sartori Locatelli, Fernando Tonucci, Jea Kwon, Luiz Felipe Vecchietti, Bryan Nathanael Wijaya, Cheng Yaw Low, Virgilio Almeida, Meeyoung Cha

发表机构 * University of São Paulo（圣保罗大学）； National University of Singapore（新加坡国立大学）

AI总结研究视觉、视觉-语言及多模态模型的地理空间表示能力，发现文本监督能有效提升空间编码，推动地理空间AI发展。

Comments Accepted at ICML 2026

2606.07175 2026-06-08 cs.CV 新提交

Seeing Without Exposing: Adaptive Privacy Control for Open-World, Context-Hungry MLLMs

看见而不暴露：面向开放世界、上下文饥渴型MLLM的自适应隐私控制

Siyuan Xu, Yibing Liu, Peilin Chen, Yung-Hui Li, Shiqi Wang, Sam Kwong

发表机构 * City University of Hong Kong（香港城市大学）； Hon Hai Research Institute（鸿海研究学院）； Lingnan University（岭南大学）

AI总结针对多模态大语言模型在开放世界中面临不可预测敏感信息泄露的隐私挑战，提出无训练方法APD，将隐私元素漂移至语义等价替代物并锚定上下文线索，结合新基准AdaptShield实现隐私保护与上下文保留的平衡提升。

详情

AI中文摘要

多模态大语言模型（MLLM）引发了新的隐私挑战。在数据方面，用户提供的输入通常包含不可预测的敏感信息；而在下游任务方面，模型推理依赖于丰富的视觉上下文，这些上下文本身可能涉及隐私敏感信息。然而，现有的隐私保护方法依赖于预定义的敏感类别和固定的混淆策略，难以应对MLLM中的此类挑战。为解决这一困境，我们提出了锚定隐私漂移（APD），一种无需训练的方法，它将隐私敏感元素漂移到语义等价的替代物，同时将上下文线索锚定到源图像。为了系统评估这种隐私保护和上下文保留的双重目标，我们引入了AdaptShield，一个涵盖22个隐私类别的综合基准，它将传统隐私度量与基于MLLM的上下文效用评估相结合。大量实验表明，我们的方法在隐私净化和内容保留方面实现了平衡改进，在四个MLLM系列（即Qwen2.5、Qwen3、InternVL3和InternVL3.5）上，文本类别的平均增益为10.4%，基于MLLM的评估平均增益为8.5%。

英文摘要

Multimodal large language models (MLLMs) have raised new privacy challenges. On the data side, user-provided inputs often include unpredictable sensitive information; while on the downstream task side, model reasoning depends on rich visual context that may itself be privacy-sensitive. Existing privacy protection methods, however, rely on predefined sensitive categories and fixed obfuscation strategies, struggling to tackle such challenges in MLLMs. To address this dilemma, we propose Anchored Privacy Drifting (APD), a training-free method that drifts privacy-sensitive elements toward semantically equivalent alternatives while anchoring contextual cues to the source image. To systematically evaluate this dual objective of privacy protection and contextual preservation, we introduce AdaptShield, a comprehensive benchmark covering 22 privacy categories, which combines conventional privacy metrics with MLLM-based assessments of contextual utility. Extensive experiments show that our method achieves balanced improvements in both privacy sanitization and content retention, with average gains of 10.4% on textual categories and 8.5% under MLLM-based evaluation across four MLLM series, i.e., Qwen2.5, Qwen3, InternVL3, and InternVL3.5.

URL PDF HTML ☆

赞 0 踩 0

2606.07179 2026-06-08 cs.CV cs.MM eess.IV 新提交

EvoGS: Constructing Continuous-Layered Gaussian Splatting with Evolution Tree for Scalable 3D Streaming

EvoGS：基于进化树构建连续分层高斯泼溅以实现可扩展3D流式传输

Yuang Shi, Simone Gasparini, Géraldine Morin, Wei Tsang Ooi

发表机构 * National University of Singapore（国立新加坡大学）； IRIT - University of Toulouse（图卢兹大学IRIT实验室）； IPAL, IRL2955（IPAL研究所）

AI总结提出EvoGS，首个连续分层高斯泼溅表示，通过进化树结构实现父-子细化，消除冗余并支持可扩展3D流式传输，传输负载和显存占用分别降低2.4倍和5.5倍。

Comments Project page: https://yuang-ian.github.io/evogs/

详情

AI中文摘要

流式传输3D高斯泼溅需要高度可扩展的渐进式表示。现有渐进式方法依赖\textit{离散分层}，为每个细节层次累积独立的泼溅集。层间的结构独立性固有地导致误差累积、严重的泼溅冗余以及不受控的质量过渡。我们提出EvoGS，首个\textit{连续分层}表示。EvoGS组织为进化树，通过显式的、受小波启发的父-子细化生成更精细的细节。这使得子节点能够结构性地纠正祖先误差，产生固有稀疏且高度可压缩的层间信号。大量实验表明，EvoGS将泼溅冗余从超过65%降至低于25%。与最先进的基线相比，它分别将传输负载和GPU显存占用降低高达2.4倍和5.5倍，并实现了适用于实时自适应流式传输的平滑质量过渡。项目页面：此 https URL

英文摘要

Streaming 3D Gaussian Splatting requires highly scalable, progressive representations. Existing progressive methods rely on \textit{discrete layering}, accumulating separate splat sets for each level of detail. This structural independence between layers inherently leads to error accumulation, severe splat redundancy, and uncontrolled quality transitions. We propose EvoGS, the first \textit{continuous-layering} representation. Organized as an Evolution Tree, EvoGS generates finer details via an explicit, wavelet-inspired parent-child refinement. This empowers child nodes to structurally correct ancestral errors, yield inherently sparse and highly compressible inter-layer signals. Extensive experiments show EvoGS eliminates splat redundancy from over 65\% to under 25\%. Compared to state-of-the-art baselines, it reduces transmission payload and GPU VRAM footprint by up to 2.4$\times$ and 5.5$\times$, respectively, and achieves smooth quality transitions optimal for real-time adaptive streaming. Project page: https://yuang-ian.github.io/evogs/

URL PDF HTML ☆

赞 0 踩 0

2606.07180 2026-06-08 cs.CV cs.LG 新提交

重建多年代森林干扰：一种时空Transformer方法

Linus Scheibenreif, Anton Raichuk, Maxim Neumann

发表机构 * Google DeepMind（谷歌深Mind）

AI总结提出时空Transformer框架，同时建模时间轨迹和空间邻域，利用Landsat、Sentinel-1/2数据重建美国1984-2022年森林干扰图，在手动标注验证集上达到高精度并减少空间伪影。

详情

AI中文摘要

准确监测森林干扰对于理解碳动态和土地管理至关重要，但传统方法通常依赖卫星时间序列的逐像素分析，忽略了空间上下文。我们提出了一种深度学习框架，通过同时建模时间轨迹和空间邻域，绘制了美国本土38年（1984-2022）的森林干扰图。通过利用视觉Transformer架构，我们的方法有效过滤了弱监督信号中的噪声，生成了空间连贯的干扰图。我们在多个卫星（Landsat、Sentinel-1、Sentinel-2）和时间窗口（38年及最近6年）上进行了详尽评估，并使用新的人工标注验证数据集（n=300）和独立火周界数据集（n=706）验证了性能。结果凸显了任务的复杂性：我们的时空模型表现出高精度（在MTBS上±1年检测精度高达98.2%，在CONUS验证数据集上高达71.3%，F1分数分别高达75.8%和47.3%），并有效减少了空间伪影，但与逐像素基线相比，在不同干扰类型上存在性能权衡。我们的方法为一致的森林监测提供了有前景的基础。

英文摘要

Accurate monitoring of forest disturbances is essential for understanding carbon dynamics and land management, yet traditional approaches typically rely on pixel-wise analysis of satellite time-series, ignoring spatial context. We present a deep learning framework that maps 38 years (1984-2022) of forest disturbance across the contiguous United States by modeling temporal trajectories and spatial neighborhoods simultaneously. By leveraging a vision transformer architecture, our approach effectively filters noise from weak supervision signals to produce spatially coherent disturbance maps. We perform exhaustive evaluations across multiple satellites (Landsat, Sentinel-1, Sentinel-2) and temporal windows (38 years and the more recent 6 years), validating performance against a novel, manually annotated validation dataset (n=300) and independent fire perimeter dataset (n=706). The results highlight the complexity of the task: while our spatio-temporal model demonstrates high precision (up to 98.2% for +-1 year detection on MTBS and up to 71.3% on the CONUS validation datasets, with F1-scores up to 75.8% and 47.3%, respectively) and effectively reduces spatial artifacts, it exhibits performance trade-offs across different disturbance regimes compared to pixel-wise baselines. Our method offers a promising foundation for consistent forest monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.07280 2026-06-08 cs.CV 新提交

Geometric-Aware Hypergraph Reasoning for Novel Class Discovery in Point Cloud Segmentation

几何感知超图推理用于点云分割中的新类别发现

Zihao Zhang, Aming Wu, Yang Li, Yahong Han, Jialie Shen

发表机构 * School of Artificial Intelligence, College of Intelligence and Computing, Tianjin University（人工智能学院、智能计算学院、天津大学）； School of Computer Science and Information Engineering, Hefei University of Technology（计算机科学与信息工程学院、合肥工业大学）； Department of Computer Science City St George’s, University of London（伦敦大学城市圣乔治学院计算机科学系）

AI总结提出超图框架建模高阶关联，结合几何感知原型，实现点云分割中从已知到新类别的协同推理，提升分割精度。

Comments Accepted to the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026

详情

AI中文摘要

点云分割中的新类别发现旨在从已知类别转移知识，自动识别和分割点云中未标注的新类别。现有方法主要依赖成对关联进行类别分配和新类别推理，这限制了其捕捉已知和新类别间复杂关系的能力，可能导致语义分割不准确。为解决此问题，我们引入基于超图的框架，建模类别间的高阶关联，并实现从已知类别到新类别的协同推理，超越传统的成对关系。此外，现有方法倾向于关注语义特征提取，而对点云中的几何信息关注不足。为了更好地利用空间结构，我们提出几何感知原型以增强类别级几何线索的表示。通过超边传播几何信息，所提方法改进了对类别间空间分布的理解，从而实现更准确的分割。在SemanticKITTI和SemanticPOSS数据集上的实验证明了我们方法的有效性和优越性。

英文摘要

Novel class discovery in point cloud segmentation aims to transfer knowledge from known classes to automatically identify and segment unlabeled novel classes in point clouds. Existing methods mainly rely on pairwise associations for class assignment and novel class reasoning, which limits their ability to capture complex relationships among known and novel classes and may lead to inaccurate semantic segmentation. To address this issue, we introduce a hypergraph-based framework that models high-order associations among classes and enables collaborative reasoning from known classes to novel classes beyond traditional pairwise relations. Moreover, existing methods tend to focus on semantic feature extraction while paying insufficient attention to geometric information in point clouds. To better exploit spatial structure, we propose Geometric-Aware Prototypes to enhance the representation of class-level geometric cues. By propagating geometric information through hyperedges, the proposed method improves the understanding of spatial distributions across classes and leads to more accurate segmentation. Experiments on the SemanticKITTI and SemanticPOSS datasets demonstrate the effectiveness and superiority of our method.

URL PDF HTML ☆

赞 0 踩 0

2606.07288 2026-06-08 cs.CV cs.GR 新提交

ExMesh: EXplicit Mesh Reconstruction with Topology Adaptation

ExMesh: 具有拓扑自适应的显式网格重建

Chuanjin Fan, Lifan Wu, Wenjie Chang, Hanzhi Chang, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory（国家空间科学探测重点实验室，深空探测实验室）

AI总结提出ExMesh框架，通过可微优化与离散拓扑更新直接优化显式网格，引入自适应顶点分裂合并和实时UV维护，实现从粗到细的优化，兼顾精度、效率和网格简洁性。

Comments Accepted at the IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026 (CVPR 2026)

详情

AI中文摘要

从多视图图像重建表面网格近年来一直是核心挑战。大多数现有方法，无论是隐式还是显式，都依赖于中间表示和后处理步骤（如Marching Cubes或TSDF融合），常常导致伪影和碎片化几何。直接优化显式网格是一种有前景的方法，但它面临两个关键挑战：一是如何自适应细化网格拓扑以捕捉细节而不引入退化面；二是在网格结构演变时如何保持一致的UV坐标以实现高保真纹理映射。为克服这些，我们提出ExMesh，一种新颖的框架，通过将可微优化与离散拓扑更新相结合，直接优化显式网格。具体而言，我们引入自适应顶点分裂合并策略以及实时UV维护，实现从粗到细的优化，同时保持几何完整性。据我们所知，ExMesh是第一个将离散拓扑操作无缝集成到连续可微优化流程中的框架。大量实验表明，ExMesh在精度、计算效率和网格简洁性之间取得了平衡。

英文摘要

Reconstructing surface meshes from multi-view images has remained a core challenge in recent years. Most existing methods, whether implicit or explicit, depend on intermediate representations and post-processing steps like Marching Cubes or TSDF fusion, often resulting in artifacts and fragmented geometry. Directly optimizing explicit meshes is a promising approach. However, it presents two critical challenges. The first is how to adaptively refine mesh topology to capture detail without introducing degenerate faces. The second is how to maintain consistent UV coordinates for high-fidelity texturing as the mesh structure evolves. To overcome these, we propose ExMesh, a novel framework that directly optimizes explicit meshes by integrating differentiable optimization with discrete topology updates. Specifically, we introduce an adaptive vertex splitting and merging strategy, along with real-time UV maintenance, to enable coarse-to-fine optimization while preserving geometric integrity. To our knowledge, ExMesh is the first framework to seamlessly integrate discrete topology operations into a continuous differentiable optimization pipeline. Extensive experiments demonstrate that ExMesh achieves a balance among accuracy, computational efficiency, and mesh conciseness.

URL PDF HTML ☆

赞 0 踩 0

2606.07311 2026-06-08 cs.CV cs.AI 新提交

CULTURESCORE: Evaluating Cultural Faithfulness in Video Generation Models

CULTURESCORE: 评估视频生成模型的文化忠实度

Anku Rani, Wei Dai, Shravan Nayak, Pattie Maes, Mahdi M. Kalayeh, Paul Pu Liang

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Mila – Quebec AI Institute（魁北克人工智能研究所）； Netflix（网飞）

AI总结提出CultureScore框架，从身份、背景和行为三个维度评估视频生成的文化忠实度，实验发现当前最佳模型得分仅56.8%，行为维度最困难。

详情

AI中文摘要

随着Veo 3.1和LTX-2等视频生成模型的进步，它们准确表现多元全球文化的能力仍是一个关键但研究不足的前沿。当前的指标如VideoScore仅衡量视觉质量，无法评估文化忠实度。因此，一个将合十礼替换为握手的模型与正确生成该手势的模型获得相同的分数。我们提出CultureScore，一个将文化忠实度分解为三个细粒度维度的组合评估框架：身份（谁被代表）、背景（文化本地化背景）和行为（规范性手势和互动）。我们通过一个覆盖10个国家的评估套件来实施该框架，在三个最先进的模型上生成了6,180个视频。我们的评估显示，当前没有模型能够实现文化忠实的视频生成：表现最好的模型整体CultureScore仅为56.8%，其中行为是最具挑战性的维度，所有模型在该维度上均低于52%。此外，人类偏好排序与CultureScore方向一致，但与VideoScore相反；在视觉质量上得分最高的模型被标注者排在最后，这强调了文化忠实度是公平视频生成的基本标准。

英文摘要

As video generation models like Veo 3.1 and LTX-2 advance, their ability to accurately represent diverse global cultures remains a critical yet understudied frontier. Current metrics, such as VideoScore, only measure visual quality but offer no mechanism for assessing cultural faithfulness. Consequently, a model that replaces a Namaste with a handshake receives the same score as one that generates the gesture correctly. We propose CultureScore, a compositional evaluation framework that decomposes cultural faithfulness into three granular dimensions: Identity (who is represented), Context (culturally localized background), and Behavior (normative gestures and interactions). We operationalize this framework through an evaluation suite spanning 10 countries, yielding 6,180 generated videos across three state-of-the-art models. Our evaluation reveals that no current model achieves culturally faithful video generation: the best-performing model reaches only 56.8\% overall CultureScore, with Behavior the most challenging dimension, which remains below 52\% across all models. Furthermore, human preference rankings align directionally with CultureScore but are inverted relative to VideoScore; the highest-scoring model on visual quality was ranked last by annotators, underscoring that cultural faithfulness is an essential criterion for equitable video generation.

URL PDF HTML ☆

赞 0 踩 0

2606.07326 2026-06-08 cs.CV 新提交

AnchorWorld: Embodied Egocentric World Simulation with View-based Evolution Customization

AnchorWorld: 基于视图演化定制的具身自我中心世界模拟

Yu Li, Menghan Xia, Gongye Liu, Xintao Wang, Conglang Zhang, Lei Ke, Yuxuan Lin, Ruihang Chu, Pengfei Wan, Kun Gai, Yujiu Yang

发表机构 * Tsinghua University（清华大学）； HUST（华中科技大学）； Kling Team, Kuaishou Technology（快手科技 Kling 团队）； HKUST（香港科技大学）； WHU（武汉大学）

AI总结提出AnchorWorld框架，利用3D人体运动和外源视角辅助训练增强交互完整性，并通过锚点视图和文本描述实现自我演化世界的灵活定制，显著优于现有方法。

详情

AI中文摘要

尽管交互式世界建模是一个关键前沿，但在实际场景所需的多样化可控性方面仍未被充分探索。为弥补这一差距，我们提出AnchorWorld，一个通过增强交互完整性和灵活的世界定制机制来推进自我中心模拟的框架。首先，我们利用3D人体运动作为主要交互模态。为了补充自我中心视角中不可见或被截断的身体部位，我们引入了一种辅助训练监督，该监督包含了与智能体第一人称感知解耦的外源视角。这使得模型能够观察智能体相对于环境的全身定位，从而促进人-世界交互更稳健的空间基础。此外，我们提出了一种简单而有效的机制来定制自我演化的世界。这是通过在统一的世界坐标系内定义锚点视图，并结合描述局部场景动态演化的文本描述来实现的。实验结果表明，AnchorWorld显著优于最先进的基线方法，而消融研究验证了我们关键设计的有效性。值得注意的是，我们的定制方案展现出有希望的时空几何一致性，并严格遵守规定的演化动力学。

英文摘要

Despite being a pivotal frontier, interactive world modeling remains underexplored in terms of the versatile controllability required by practical scenarios. To bridge this gap, we present AnchorWorld, a framework that advances egocentric simulation through enhanced interaction integrity and a flexible mechanism for world customization. First, we utilize 3D human motion as the primary interaction modality. To complement the out-of-view or truncated body parts in egocentric views, we introduce an auxiliary training supervision that incorporates exogenous viewpoints decoupled from the agent's first-person sensorium. It allows the model to observe the agent's full-body positioning relative to the environment, facilitating a more robust spatial grounding of human-world interactions. Furthermore, we propose a simple yet effective mechanism for customizing self-evolving worlds. This is achieved by defining anchor views within a unified world coordinate system, coupled with textual descriptions dictating the dynamic evolution of local scenes. Experimental results show that AnchorWorld significantly outperforms state-of-the-art baselines, while ablation studies validate the effectiveness of our key designs. Notably, our customization scheme exhibits promising spatio-temporal geometric consistency and adheres strictly to the prescribed evolutionary dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.07333 2026-06-08 cs.CV 新提交

Varifold Moment Invariants for Sustainable and Explainable Contour Feature Extraction

Varifold矩不变量：可持续且可解释的轮廓特征提取

G. Longari, J. -C. Alvarez Paiva, A. B. Tumpach

发表机构 * Computer Vision Lab, Technische Universität Wien, Karlsplatz 13, 1040 Vienna, Austria Wolfgang Pauli Institut, Oskar-Morgensternplatz 1, 1090 Vienna, Austria ； U.M.R. CNRS 8524, U.F.R. de Math\'ematiques, 59655 Villeneuve d'Ascq C\'edex, France ； Laboratoire Painlevé, Lille University, 59650 Villeneuve d’Ascq, France Wolfgang Pauli Institut, Oskar-Morgensternplatz 1, 1090 Vienna, Austria

AI总结提出Varifold矩不变量（VMI）统一框架，结合区域、边界和切线几何生成高判别力几何特征，配合轻量分类器在降低计算成本的同时超越现有轮廓方法。

Comments 29 pages, 12 figures

详情

AI中文摘要

我们引入Varifold矩不变量（VMI）作为许多先前提出的矩不变量的统一框架。这些不变量与其他在平移和旋转下不变的轮廓特征（如扩展高斯图像、椭圆傅里叶描述符或形状分布）密切相关。Varifold矩方法的优势在于能够结合区域的几何、其边界以及与之相切的直线族，从而创建大量具有高判别力和清晰几何意义的不变特征。通过将我们的VMI特征提取与轻量特征分类器随机森林或多层感知器相结合，我们在基于轮廓的方法中超越了现有技术水平，同时大幅降低了计算成本，使我们的算法能够在轻量设备上运行。我们在大量广泛使用的不同类型数据集（叶子、物体、细胞）上测试了我们的分类任务，并以少量几何可解释的特征实现了高精度。

英文摘要

We introduce Varifold Moments Invariants (VMI) as a unifying framework for many previously introduced Moment Invariants. These invariants are deeply related to other contour features that are invariant under translations and rotations, like Extended Gaussian Image, Elliptic Fourier Descriptors or Shape Distributions. The advantage of the varifold approach to moments consists in being able to combine the geometry of the region, its boundary, and the family of lines tangent to it, in order to create a substantial number of invariant features with high discriminating power and clear geometric meaning. By coupling our VMI feature extraction with the light feature classifiers Random Forest or Multi-Layer-Perceptron, we outperform state-of-the-art approaches based on contours, while decreasing drastically the computational cost to the point of allowing our algorithm to run on light devices. We tested our approach on classification tasks on a large number of widely-used datasets of various types (leaves, objects, cells) and achieved high accuracy with a low number of geometrically interpretable features.

URL PDF HTML ☆

赞 0 踩 0

2606.07338 2026-06-08 cs.CV 新提交

VeriDrive: Verifiable Counterfactual Supervision for Cost-Efficient Vision-Language Planning

VeriDrive: 可验证的反事实监督用于成本高效的视觉-语言规划

Zikai Zhang, Hubert P. H. Shum, Toby P. Breckon

发表机构 * Department of Computer Science, Durham University（杜伦大学计算机科学系）

AI总结提出VeriDrive框架，通过结构化感知-评估-修正链生成可验证的反事实监督，降低视觉-语言驾驶规划的数据构建成本，并在nuScenes数据集上验证其有效性。

详情

AI中文摘要

视觉-语言驾驶模型越来越多地使用推理监督来连接感知、预测和规划，但现有的驾驶理由通常是自由形式的，且使用前沿模型生成成本高昂。我们提出了VeriDrive，一个构建面向规划的、可验证的反事实监督框架。VeriDrive将驾驶推理转化为结构化的感知-评估-修正链，该链将关键对象锚定于未来运动，使用可规则检查的证据评估替代自我轨迹，将风险意图修正为专家行为，并生成最终规划目标。为了扩展数据构建，VeriDrive结合了本地生成与验证器引导的选择性修正，仅升级无效或困难的样本。我们在nuScenes上构建了VeriDrive数据集，并在Omni-Q协议下进行训练。受控的开环实验表明，VeriDrive在L2、碰撞和交叉指标上优于OmniDrive，同时减少了记录的令牌使用量、生成时间和实际支付的LLM/VLM成本。这些结果表明，可审计的中间字段和结构化修正目标可以在现实注释预算下改进视觉-语言规划监督。代码、提示和验证器脚本即将发布，并将在审稿过程后公开。

英文摘要

Vision-language driving models increasingly use reasoning supervision to bridge perception, prediction, and planning, but existing driving rationales are often free-form and expensive to generate with frontier models. We present VeriDrive, a framework for constructing planning-oriented, verifiable counterfactual supervision. VeriDrive converts driving reasoning into a structured Perception-Evaluation-Revision chain that grounds key objects in future motion, evaluates alternative ego trajectories with rule-checkable evidence, revises risky intent toward expert behavior, and produces final planning targets. To scale data construction, VeriDrive combines local generation with validator-guided selective correction, escalating only invalid or difficult samples. We build the VeriDrive dataset on nuScenes and train under the Omni-Q protocol. Controlled open-loop experiments show that VeriDrive improves L2, Collision, and Intersection over OmniDrive while reducing logged token usage, generation time, and actual paid LLM/VLM cost. These results show that auditable intermediate fields and structured revision targets can improve vision-language planning supervision under realistic annotation budgets. Code, prompts, and validator scripts are coming soon and will be released after the review process.

URL PDF HTML ☆

赞 0 踩 0

2606.07355 2026-06-08 cs.CV 新提交

Spatial-Temporal Decoupled Adapter for Micro-gesture Online Recognition

面向微手势在线识别的时空解耦适配器

Xucheng Shen, Kun Li, Fei Wang, Wei Qian, Jin Jiang, Dan Guo

发表机构 * Hefei University of Technology（合肥工业大学）； United Arab Emirates University（阿联酋大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Center（人工智能研究院，合肥国家综合科学中心）； Anhui Evolution Technology Co., Ltd.（安徽进化科技有限公司）

AI总结提出时空解耦适配器，通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支，并引入自适应软平衡增强缓解长尾分布问题，在EI-MiGA挑战赛Track 2中取得第一名。

Comments Technical Report. 1st Place in Micro-gesture Online Recognition in 4th MiGA at IJCAI 2026

详情

AI中文摘要

微手势在线识别旨在对未修剪视频中的细微手势进行时间定位和分类。由于微手势持续时间极短、运动幅度低且视觉线索模糊，捕获判别性的时空表示仍然极具挑战性。现有的参数高效适配器通常采用单分支联合建模时空线索，这可能无法捕获微手势的细粒度模式。为解决这一局限，我们提出了一种时空解耦适配器，通过轻量级深度可分离卷积将视频适配分解为独立的时间和空间分支。此外，为解决基准数据集中的长尾分布问题，我们引入了自适应软平衡增强，该方法根据类别稀有性和学习难度动态分配增强强度，无需手动设置阈值。我们的方法取得了0.43808的F1分数，在第四届EI-MiGA-IJCAI挑战赛的Track 2中排名第一。

英文摘要

Micro-gesture online recognition aims to temporally localize and classify subtle gestures in untrimmed videos. Owing to their extremely short duration, low motion amplitude, and ambiguous visual cues, capturing discriminative spatiotemporal representations remains highly challenging. Existing parameter-efficient adapters typically employ a single branch to model spatial and temporal cues jointly, which may fail to capture the fine-grained patterns of micro-gestures. To address this limitation, we propose a Spatial-Temporal Decoupled Adapter that decomposes video adaptation into independent temporal and spatial branches via lightweight depthwise convolutions. In addition, to address the long-tail distribution problem in the benchmark dataset, we introduce Adaptive Soft Balanced Augmentation, which dynamically allocates augmentation intensity based on class rarity and learning difficulty, without manual thresholds. Our method achieves an F1 score of 0.43808, ranking 1st in Track 2 of the 4th EI-MiGA-IJCAI Challenge.

URL PDF HTML ☆

赞 0 踩 0

2606.07366 2026-06-08 cs.CV cs.LG cs.RO 新提交

Dash2Sim: Closed-Loop Driving Simulation from in-the-wild Dashcam Videos

Dash2Sim: 来自野外行车记录仪视频的闭环驾驶仿真

Anurag Ghosh, Francesco Pittaluga, Khiem Vuong, Angela Chen, Juan Alvarez-Padilla, Manmohan Chandraker, Srinivasa Narasimhan

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； NEC Labs America（NEC美国实验室）； MIT（麻省理工学院）； UC San Diego（加州大学圣地亚哥分校）

AI总结提出Dash2Sim框架，将单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志，用于闭环仿真，并构建ROADWork4D基准数据集，验证了施工区场景对规划器的挑战。

详情

AI中文摘要

自动驾驶仿真通常依赖于在少数城市收集的数据或手工编写的合成场景。行车记录仪视频覆盖了更广泛的位置和情况，包括罕见或长尾场景。由于难以从单目野外视频中恢复准确的4D场景，它们被认为不太适用于仿真。施工区是行车记录仪捕捉到的一类长尾情况。我们提出Dash2Sim，一个将野外单目行车记录仪视频转化为度量级、地理参考的4D驾驶日志并与现有仿真器兼容的框架，并针对独立维护的地图验证每个日志，无需标注。我们将Dash2Sim应用于大型视频语料库，创建了ROADWork4D基准数据集，涵盖17个城市的4,244个场景和270万个3D对象。在验证子集ROADWork4D-CL（2,201个场景）上，我们研究了特权闭环规划器，发现施工区场景具有挑战性：尽管基于规则和混合规划器的泛化能力优于基于学习的规划器，但所有规划器均表现不足，无法完成临时施工区通道所需的变道。在规划之外，Dash2Sim恢复的密集深度在新视角合成质量上提高了高达19%（基于感知指标），表明其具有为单目视频的闭环传感器仿真提供丰富条件的潜力。

英文摘要

Self-driving simulations typically rely on data collected in a small number of cities or on hand-authored synthetic scenarios. Dashcam videos cover a far broader range of locations and situations, including rare or long-tailed scenarios. They are considered less usable for simulation because it is difficult to recover accurate 4D scenes from monocular in-the-wild videos. Work zones are one such class of long-tailed situations that dashcams capture. We present Dash2Sim, a framework that turns in-the-wild monocular dashcam videos into metric, geo-referenced 4D driving logs compatible with existing simulators, and verifies eachone against an independently maintained map without annotations. We apply Dash2Sim to a large video corpus to create the ROADWork4D benchmark dataset, which spans 4,244 scenes with 2.7M 3D objects across 17 cities. On a verified subset ROADWork4D-CL (2,201 scenes), we study privileged closed-loop planners and find that work zone scenarios are difficult: while rule-based and hybrid planners generalize better than learning-based ones, all fall short, failing to make the lane changes that temporary work zone channels require. Beyond planning, dense depth recovered by Dash2Sim improves novel-view synthesis quality by up to 19% on perceptual metrics, suggesting its potential to provide rich conditioning for closed-loop sensor simulation from monocular videos.

URL PDF HTML ☆

赞 0 踩 0

2606.07368 2026-06-08 cs.CV cs.AI 新提交

Mitosis Detection in the Wild: Multi-Tumor and Context-Aware Generalization in the MIDOG 2025 Challenge

野外有丝分裂检测：MIDOG 2025挑战中的多肿瘤与上下文感知泛化

Marc Aubreville, Jonas Ammeling, Sweta Banerjee, Viktoria Weiss, Taryn A. Donovan, Robert Klopfleisch, Jiaqi Lv, Shan E Ahmed Raza, Raphaël Bourgade, Thomas Walter, Yasemin Topuz, Songül Varlı, Charles-Antoine Collins-Fekete, Zhuoyan Shen, Navya Sri Kelam, Nitin Singhal, Christian Marzahl, Brian Napora, Tengyou Xu, Hongyan Gu, Mario Vento, Gennaro Percannella, Norbert Ropiak, Izabela Wasiak, Jie Xiao, Shaojun Liu, Seungho Choe, April Khademi, Vidushi Walia, Sujatha Kotte, Andrew Broad, Alex Wright, Guillaume Balezo, Esha Sadia Nasir, Mostafa Jahanifar, Yosuke Yamagishi, Shouhei Hanaoka, Mattia Sarno, Francesco Tortorella, Biwen Meng, Jingxin Liu, Sara Krauss, Daniel Hieber, Lavish Ramchandani, Dev Kumar Das, Mieko Ochi, Yuan Bae, Piotr Giedziun, Mateusz Maniewski, Vangala Govindakrishnan Saipradeep, Naveen Sivadasan, Leire Benito-Del-Valle, Adrian Galdran, Kaustubh Atey, Sameer Anand Jha, Adinath Dukre, Imran Razzak, Maxime W. Lafarge, Viktor H. Koelzer, Nils Porsche, Nikolas Stathonikos, Mitko Veta, Dominik Hirling, Zsanett Zsófia Iván, Peter Horvath, Katharina Breininger, Christof A. Bertram

发表机构 * Flensburg University of Applied Sciences（弗劳恩霍夫应用科技大学）； Technische Hochschule Ingolstadt（施特拉尔松德应用技术大学）； University of Veterinary Medicine（兽医大学）； Schwarzman Animal Medical Center（施瓦茨曼动物医学中心）； Freie Universität Berlin（柏林自由大学）； University of Warwick（沃里克大学）； MINES Paris - PSL University（巴黎综合理工学院）； Yildiz Technical University（耶利泽技术大学）； University College London（伦敦大学学院）； AIRA MATRIX Private Limited（AIRA MATRIX 私人有限公司）； University of California, Los Angeles（加州大学洛杉矶分校）； University of Kansas Medical Center（堪萨斯医学中心）； University of Salerno（萨勒诺大学）； Cancer Center Sp. z o. o.（癌症中心）； th Military Research Hospital in Bydgoszcz（比多日茨军医研究所）； Shenzhen Technology University（深圳技术大学）； Toronto Metropolitan University（多伦多 Metropolitan 大学）； Tata Consultancy Services Ltd.（塔塔咨询有限公司）； Leeds Teaching Hospitals NHS Trust（利兹教学医院 NHS信托）； The University of Tokyo（东京大学）； Xi’an Jiaotong-Liverpool University（西安交通大学-利物浦大学）； University of Augsburg（奥格斯堡大学）； Ulm University（乌尔姆大学）； Japanese Red Cross Medical Center（日本红十字医疗中心）； Wroclaw University of Science and Technology（沃拉日市科学与技术大学）； TECNALIA, Basque Research and Technology Alliance (BRTA)（TECNALIA，巴斯克研究与技术联盟（BRTA））； Indian Institute of Technology Bombay（孟买印度理工学院）； MBZUAI ； University of Basel（巴塞尔大学）； University Medical Center Utrecht（乌得勒支大学医学中心）； TU Eindhoven（埃因霍温理工大学）； HUN-REN Biological Research Centre（匈牙利-人生物研究中心）

AI总结针对临床实际中组织学多样性的挑战，MIDOG 2025挑战评估了跨12种肿瘤类型和多种扫描平台的算法性能，发现模型在传统热点区域表现可靠，但在困难区域和罕见肿瘤中性能显著下降，集成方法可提升F1分数1.5个百分点。

详情

AI中文摘要

自动有丝分裂检测是计算病理学中一项成熟的任务。虽然之前的基准测试关注扫描仪引起的域偏移，但临床“真实世界”应用要求模型能够对组织学景观中预期的巨大差异具有鲁棒性。MItosis DOmain Generalization (MIDOG) 2025挑战旨在评估算法在空前生物学和上下文多样性下的性能。我们策划了一个包含365个病例的测试数据集，涵盖12种不同的人类、犬和猫肿瘤类型，并在多个扫描平台上数字化。超越手动选择的感兴趣区域（ROI），该挑战还要求在随机组织区域（代表全切片检测情况）和困难区域（富含难负样本的区域）进行检测。在第二个赛道中，我们引入了非典型有丝分裂象（AMF）的分类。检测赛道有18支队伍提交，F1分数最高达0.740。在AMF检测赛道，我们有21个提交，平衡准确率最高达0.908。我们的分析显示，虽然大多数模型在传统热点区域表现可靠，但在困难ROI中性能显著下降，假阳性率增加了两倍。此外，性能在12种肿瘤类型间差异显著，突显了当前最先进架构在遇到罕见或高度多形性恶性肿瘤时的“盲点”。此外，我们评估了集成的有效性，发现F1分数和平衡准确率平均分别提高1.5和1.3个百分点。相比之下，测试时增强（TTA）没有显示出相关改进。MIDOG 2025表明，“野外”有丝分裂检测仍然是一个重大障碍。从仅热点评估到多上下文框架的转变，为临床可靠性提供了更现实的代理指标。

英文摘要

Automated mitosis detection is a well-established task in computational pathology. While previous benchmarks focused on scanner-induced domain shift, clinical "real-world" application requires models to be robust across the vast variance to be expected in the histological landscape. The MItosis DOmain Generalization (MIDOG) 2025 challenge was designed to evaluate algorithmic performance across unprecedented biological and contextual diversity. We curated a test dataset of 365 cases, encompassing 12 distinct human, canine and feline tumor types, digitized across multiple scanning platforms. Moving beyond hand-selected hotspots, the challenge required detection also in random tissue areas (representative of the whole slide detection situation) and challenging areas (areas rich in hard negatives). In the second track, we introduced the classification of atypical mitotic figures (AMFs). There were 18 teams submitting to the detection track, with F1 scores ranging up to 0.740. In the AMF detection track, we had 21 submissions with balanced accuracy values up to 0.908. Our analysis reveals that while most models perform reliably in traditional hotspots, significant performance degradation occurs in challenging ROIs, where false positive rates tripled. Furthermore, performance varied significantly across the 12 tumor types, highlighting "blind spots" in current state-of-the-art architectures when encountering rare or highly pleomorphic malignancies. Moreover, we evaluated the effectiveness of ensembling and found a mean increases of 1.5 and 1.3 percentage points in F1 score and balanced accuracy, respectively. In contrast, TTA showed no relevant improvement. MIDOG 2025 demonstrates that "in the wild" mitosis detection remains a significant hurdle. The transition from hotspot-only evaluation to a multi-contextual framework provides a more realistic proxy for clinical reliability.

URL PDF HTML ☆

赞 0 踩 0

2606.07394 2026-06-08 cs.CV 新提交

Mind the Gap: Disentangling Performance Bottlenecks in Video Instance Segmentation

注意差距：解开视频实例分割中的性能瓶颈

Danial Hamdi, Fardin Ayar, Mahdi Javanmardi

发表机构 * Computer Engineering Department, Amirkabir University of Technology (Tehran Polytechnic)（阿美里卡布里大学计算机工程系（德黑兰技术学院））

AI总结提出一种基于整数线性规划的诊断框架，分离分类、分割和跟踪误差，发现跟踪不稳定是视频实例分割的主要瓶颈，尤其在遮挡、长视频和高密度场景下，且强骨干网络无法消除该算法性问题。

详情

AI中文摘要

在视频实例分割（VIS）中，分类、分割和跟踪目标被联合评估，但它们各自对性能损失的贡献仍然不透明。我们引入一个诊断框架，将身份和类别分配表述为整数线性规划（ILP），产生一个模型无关的预言机，分层隔离每个错误源。应用于跨越在线和离线范式的七种VIS方法，在YouTube-VIS 2019/2021和OVIS的诊断子集上，我们的分析揭示了一致的图景。跟踪不稳定是在线方法的关键瓶颈，在严重遮挡下差距超过20 AP，并且随着视频长度和实例密度急剧增长。虽然语义分类在标准基准上有显著贡献，但在跟踪失败最严重的地方其影响变得微不足道。尽管更强的骨干网络大幅提升了默认分数，但它们基本保留了AP跟踪差距，证实了时间脆弱性是算法性的，而非纯粹表示性的。为补充预言机，我们引入了TrackLens，一种可视化工具，将差距大小转化为可观察的查询级故障模式。这些工具共同为瞄准VIS的核心挑战——鲁棒的长期时间关联——提供了系统基础。

英文摘要

In Video Instance Segmentation (VIS), classification, segmentation, and tracking objectives are jointly evaluated, but their individual contributions to performance loss remain opaque. We introduce a diagnostic framework that formulates identity and class assignment as an Integer Linear Program (ILP), yielding a model-agnostic oracle that hierarchically isolates each error source. Applied to seven VIS methods spanning online and offline paradigms across YouTube-VIS 2019/2021 and a diagnostic subset of OVIS, our analysis reveals a consistent picture. Tracking instability is a critical bottleneck for online methods, with gaps exceeding 20 AP under heavy occlusion, and grows sharply with video length and instance density. While semantic classification contributes meaningfully on standard benchmarks, its impact becomes negligible where tracking fails most. Although stronger backbones substantially lift default scores, they leave AP tracking gaps largely intact, confirming that temporal fragility is algorithmic rather than purely representational. To complement the oracle, we introduce TrackLens, a visual tool that translates gap magnitude into observable, query-level failure modes. Together, these tools provide a systematic foundation for targeting VIS's core challenge: robust long-term temporal association.

URL PDF HTML ☆

赞 0 踩 0

2606.07401 2026-06-08 cs.CV 新提交

RealDocBench: A Benchmark for Field-Level QA and Layout Understanding on Real-World Regulated Documents

RealDocBench: 面向真实世界监管文档的字段级问答与布局理解基准

Ameya Joshi, Joon Kim, Gus Eggert, Joseph Bajor, Cindy Hao, Jing Reyhan, Kushal Byatnal, Eli Badgio

发表机构 * Extend AI

AI总结提出RealDocBench基准，包含字段级问答和布局理解两个任务，评估18个系统在真实监管文档上的性能，揭示单一指标掩盖的性能差异和成本延迟权衡。

详情

AI中文摘要

文档解析系统越来越多地部署在高风险、受监管的工作流程中，如抵押贷款承销、财务报告、供应链物流和临床记录。然而，大多数公开基准在干净的学术布局或合成文本上评估解析器，并报告单一的OCR或Markdown级相似度分数。这类文档和指标与下游代理实际需求（即在混乱的真实世界页面上获取特定字段的正确值）相关性较差。我们引入了RealDocBench，这是一个基于真实监管文档构建的双轨基准。问答轨道包含跨越四个领域的581份文档上的1,356个字段级问题，每个问题配有一个类型化的gold_dict键值对答案，解析器按每个字段和严格的每个问题准确率评分。布局轨道包含1,500个人工验证的页面图像，在九类公共分类法下用COCO风格的边界框注释，使用包含邻域感知分割/合并恢复的匈牙利匹配器评分。我们在统一的提取和评分协议下评估了18个系统，涵盖商业解析API、通用视觉语言模型和开源OCR模型，并报告准确率以及每页成本和缓存失效延迟。RealDocBench暴露了单一数字基准隐藏的广泛性能差异、一个持续困难的医学子领域以及不同操作点之间的成本和延迟权衡。我们发布了数据集、解析器适配器和评估工具，以支持文档解析系统的可重复字段级比较。

英文摘要

Document parsing systems are increasingly deployed in high-stakes, regulated workflows such as mortgage underwriting, financial reporting, supply-chain logistics, and clinical records. Yet most public benchmarks evaluate parsers on clean academic layouts or synthetic prose, and report a single OCR or markdown-level similarity score. Such documents and metrics correlate poorly with what downstream agents actually need: the correct value for a specific field on a messy real-world page. We introduce RealDocBench, a two-track benchmark built from real regulated documents. The QA track contains 1,356 field-level questions over 581 documents spanning four domains, where each question is paired with a typed gold_dict of key-to-value answers and parsers are scored on both per-field and strict per-question accuracy. The layout track contains 1,500 human-verified page images annotated with COCO-style bounding boxes under a nine-class public taxonomy, scored with a Hungarian matcher that includes adjacency-aware split/merge recovery. We evaluate eighteen systems, spanning commercial parsing APIs, general-purpose VLMs, and open-source OCR models, under a uniform extraction-and-scoring protocol, and report accuracy alongside per-page cost and cache-busted latency. RealDocBench exposes a wide performance spread that single-number benchmarks hide, a persistently hard medical sub-domain, and sharp cost/latency trade-offs across operating points. We release the datasets, parser adapters, and evaluation harness to support reproducible, field-level comparison of document parsing systems.

URL PDF HTML ☆

赞 0 踩 0

2606.07433 2026-06-08 cs.CV cs.AI cs.MM 新提交

Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Watch, Remember, Reason: 基于多模态大语言模型的人类视角视频理解

Jiahao Meng, Yue Tan, Qi Xu, Kuan Gao, Weisong Liu, Yanwei Li, Jason Li, Lingdong Kong, Haochen Wang, Qianyu Zhou, Jiangning Zhang, Guangliang Cheng, Yunhai Tong, Lu Qi, Minghsuan Yang

发表机构 * School of Intelligence Science and Technology, Peking University（北京理工大学智能科学与技术学院）； Wuhan University（武汉大学）； Shanghai Jiao Tong University（上海交通大学）； Nanyang Technological University（南洋理工大学）； CASIA（中国科学院自动化研究所）； University of Tokyo（东京大学）； University of Liverpool（利物浦大学）； Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； UC Merced（加州大学默塞德分校）

AI总结提出人类视角下视频理解的三个功能能力（观看、记忆、推理），构建统一框架分析视频MLLM的感知、记忆、推理和预测，并总结挑战、方法、应用及未来方向。

详情

AI中文摘要

视频理解正被多模态大语言模型（MLLMs）迅速变革，研究从短视频片段转向长视频、多模态和知识密集型视频场景。这些场景要求模型在有限计算预算下处理稀疏证据、长程依赖、多模态对齐和可靠推理。本文从人类视角出发，围绕三个功能能力——观看、记忆和推理——组织基于LLM的视频理解。该视角并非将视频任务视为孤立基准，而是提供一个统一结构，用于分析视频MLLM如何获取证据、保持上下文并产生有依据的输出。我们引入一个公式，通过感知表示、记忆状态、推理轨迹和最终预测来表征视频理解系统。基于此公式，我们识别出时空感知、高效长视频处理、记忆建模、流式理解和忠实推理中的挑战。代表性方法按其视频MLLM系统中的角色进行组织：观看涵盖细粒度、全面、音视频和高效感知；记忆包括离线记忆和流式记忆；推理涵盖纯文本推理和视频辅助推理。我们进一步考察了应用领域，如自我中心、体育、教学、医学和叙事视频，并涵盖了跨任务类型、监督格式、模态和能力维度的训练数据集和评估基准。最后，我们概述了可扩展、记忆感知和有依据的视频智能的开放问题和未来方向。相关工作将在https://this https URL持续追踪。

英文摘要

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

URL PDF HTML ☆

赞 0 踩 0

2606.07451 2026-06-08 cs.CV cs.AI cs.CL cs.LG 新提交

TEVI: Text-Conditioned Editing of Visual Representations via Sparse Autoencoders for Improved Vision-Language Alignment

TEVI: 基于稀疏自编码器的文本条件视觉表示编辑以改进视觉-语言对齐

Sweta Mahajan, Sukrut Rao, Jiahao Xie, Alexander Koller, Bernt Schiele

发表机构 * Max Planck Institute for Informatics, Saarland Informatics Campus, Saarbrücken, Germany（马克斯·普朗克研究所信息学院，萨尔兰信息学院，德国萨尔布吕肯）； Department of Language Science and Technology, Saarland University, Saarbrücken, Germany（语言科学与技术系，萨尔兰大学，德国萨尔布吕肯）

AI总结提出TEVI框架，利用稀疏自编码器解耦图像嵌入，并通过文本条件掩码模块选择性重构嵌入，以改善CLIP等视觉-语言模型的图像-文本对齐，在多个检索基准上取得提升。

Comments 20 pages, 13 figures, 14 tables

详情

AI中文摘要

视觉-语言模型（如CLIP）由于共享图像-文本嵌入空间，对多种任务非常有用。尽管如此，图像和文本嵌入往往对齐不佳，影响下游性能。最近的研究表明，这可以归因于信息不平衡：图像包含的信息比其标题描述的更多。在这项工作中，我们提出了TEVI，一个利用标题作为信号来决定从图像嵌入中保留哪些信息的框架。具体来说，我们使用稀疏自编码器来解耦图像嵌入，并训练一个掩码模块，根据给定的标题选择性重构嵌入。在具有合成标题的受控设置中，我们展示了TEVI在保留标题描述的属性同时丢弃其他属性方面的有效性。通过将TEVI应用于在自然图像上训练的CLIP模型，我们进一步在粗粒度短标题（MS COCO, Flickr）和细粒度长标题（IIW, DOCCI）基准上实现了改进的检索性能，在更丰富的标题上获得更强的增益，并在RoCOCO基准上提高了鲁棒性。

英文摘要

Vision-language models such as CLIP are highly useful for diverse tasks due to their shared image-text embedding space. Despite this, the image and text embeddings are often poorly aligned, affecting downstream performance. Recent work has shown that this can be attributed to an information imbalance: images contain more information than their captions describe. In this work, we propose TEVI, a framework that uses captions as a signal for what to retain from image embeddings. Specifically, we use sparse autoencoders to disentangle image embeddings and train a masking module to selectively reconstruct the embedding based on a given caption. In a controlled setup with synthetic captions, we show that TEVI is effective at preserving caption-described attributes while discarding others. By applying TEVI to CLIP models trained on natural images, we further achieve improved retrieval performance across coarse-grained short-caption (MS COCO, Flickr) and fine-grained long-caption (IIW, DOCCI) benchmarks, with stronger gains on richer captions, and improved robustness on the RoCOCO benchmark.

URL PDF HTML ☆

赞 0 踩 0

2606.07498 2026-06-08 cs.CV 新提交

Implicit Data Synthesis for Contrastive Unsupervised Data Augmentation

隐式数据合成用于对比无监督数据增强

Patrick Kage, Trevor Hedges, N. Siddharth, Pavlos Andreadis

发表机构 * School of Informatics, The University of Edinburgh（信息学院）； Massachusetts Institute of Technology Lincoln Laboratory（麻省理工学院林肯实验室）

AI总结针对科学观测数据难以标注的问题，提出通过扰动网络权重而非数据生成对比样本，在雷达流星观测上使用SimCLR管道验证性能提升。

Comments 11 pages, 3 figures, 2 tables

2606.07503 2026-06-08 cs.CV 新提交

Differences in Detection: Explainability Where it Matters

检测中的差异：可解释性在关键之处

Johannes Theodoridis, Johannes Maucher, Andreas Schilling

发表机构 * University of Tübingen（图宾根大学）； Institute for Applied AI（应用人工智能研究所）； Hochschule der Medien Stuttgart（斯图加特媒体大学）

AI总结提出DnD方法，通过匹配算法直接比较两个目标检测模型，揭示个体与共享错误，并引导可解释性方法聚焦于度量相关示例。

Comments Accepted to IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops 2026 - How Do Vision Models Work? (HOW)

详情

AI中文摘要

我们提出了检测中的差异（DnD），一种直观的比较两个目标检测模型的方法。基于相同的匹配算法，它补充了平均精度（$mAP$）和TIDE误差分析的标准指标，能够直接比较两个模型。更具体地说，我们计算两个模型都识别的真实标签的交集，然后是相应的差集以及两个模型都遗漏的真实标签的补集。与独立的汇总统计比较相比，这种比较更直接、更直观。它揭示了个体和共享的错误，当与错误类型结合时尤其有趣。在这种情况下，检测误差的差异可以自然地通过标准混淆矩阵进行分析。虽然本身有价值，但我们认为DnD的最佳应用之一是引导可解释性方法（如ODAM）关注基于结构化子集的度量相关示例。我们方法的代码可在此处获取：this https URL

英文摘要

We propose Differences in Detection (DnD), an intuitive method to compare two object detection models. Based on the same matching algorithm, it complements the standard metrics of mean Average Precision ($mAP$) and TIDE error analysis with the ability to compare two models directly. More specifically, we calculate the intersection of ground truth labels that are recognized by both models, followed by the corresponding difference sets and the complement set of ground truth labels that are missed by both models. The resulting comparison is more direct and intuitive than a comparison of independent summary statistics. It reveals individual and shared mistakes and becomes particularly interesting when combined with error types. In this case, the differences in detection errors can be analyzed naturally in a standard confusion matrix. While valuable in itself, we believe that one of the best applications of DnD is to guide explainability methods such as ODAM towards metric-relevant examples, grounded in structured subsets. The code for our method is available here: https://github.com/JohannesTheo/differences-in-detection

URL PDF HTML ☆

赞 0 踩 0

2606.07508 2026-06-08 cs.CV 新提交

Streaming Video Generation with Streaming Force Control

流式视频生成与流力控制

Hanhui Wang, Yiming Xie, Haiwen Feng, Zhaoyang Lv, Shenlong Wang, Huaizu Jiang

发表机构 * Northeastern University（东北大学）； Impossible Research ； University of California, Berkeley（加州大学伯克利分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出StreamForce框架，通过统一力表示和蒸馏流程实现因果、统一的流式视频生成，支持局部和全局时变力控制，在单GPU上达到16.6 FPS，力遵循和运动真实性达最优。

2606.07512 2026-06-08 cs.CV cs.AI cs.CL 新提交

MemDreamer: Decoupling Perception and Reasoning for Long Video Understanding via Hierarchical Graph Memory and Agentic Retrieval Mechanism

MemDreamer: 通过分层图记忆和智能体检索机制解耦感知与推理以实现长视频理解

Cong Chen, Guo Gan, Kaixiang Ji, ChaoYang Zhang, Zhen Yang, Guangming Yao, Hao Chen, Jingdong Chen, Yi Yuan, Chunhua Shen

发表机构 * Ant Group（蚂蚁集团）； Zhejiang University（浙江大学）； Central South University（中南大学）； HKUST(GZ)（香港科技大学(广州)）

AI总结提出MemDreamer框架，通过分层图记忆和智能体检索机制解耦感知与推理，将长视频理解转化为智能体探索过程，在四个基准上达到SOTA，推理上下文窗口仅占全量2%且准确率提升12.5点。

详情

AI中文摘要

当前的视觉-语言模型在处理数小时长的视频时面临困难，因为处理完整长度的视觉序列会导致令牌爆炸和注意力稀释。为了克服这一问题，我们引入了MemDreamer，将感知与推理解耦，将长视频理解转化为智能体探索过程。作为一个即插即用的框架，它增量式地流式传输视频以构建分层图记忆，这是一种自顶向下的三层架构，用于语义抽象，并由一个捕获时空和因果关系的基础图锚定。在推理过程中，推理模型采用智能体工具增强的检索，通过观察-推理-行动循环导航层次结构、搜索节点和遍历逻辑边。实验表明，MemDreamer在四个主流基准上取得了最先进的结果，将人类专家的差距缩小到仅3.7个百分点。它将推理上下文窗口限制在全量上下文的仅2%，同时提供了12.5个百分点的绝对准确率提升。此外，统计分析揭示了VLM在逻辑推理和长视频理解基准上的性能之间存在强正线性相关，将智能体能力扩展确立为多模态理解的新范式。

英文摘要

Current Vision-Language Models struggle with hours-long videos because processing full-length visual sequences induces prohibitive token explosion and attention dilution. To overcome this, we introduce MemDreamer to decouple perception and reasoning, shifting long-video understanding into an agentic exploration process. As a plug-and-play framework, it incrementally streams videos to construct a Hierarchical Graph Memory, a top-down three-tier architecture for semantic abstraction, anchored by a foundational graph capturing spatiotemporal and causal relations. During inference, the reasoning model employs agentic tool-augmented retrieval, navigating hierarchies, searching nodes, and traversing logical edges via an Observation-Reason-Action loop. Experiments show MemDreamer achieves SOTA results across four mainstream benchmarks, narrowing the gap with human experts to only 3.7 points. It constrains the reasoning context window to merely 2% of full-context ingestion while delivering a 12.5 point absolute accuracy gain. Furthermore, statistical analysis uncovers a strong positive linear correlation between an VLM's performance on logic reasoning and long-video understanding benchmarks, establishing agentic capability scaling as a new paradigm for multimodal comprehension.

URL PDF HTML ☆

赞 0 踩 0

2606.07514 2026-06-08 cs.CV 新提交

UniSHARP: Universal Sharp Monocular View Synthesis

UniSHARP: 通用锐利单目视图合成

Meixi Song, Dizhe Zhang, Hao Ren, Ruiyang Zhang, Bo Du, Ming-Hsuan Yang, Lu Qi

发表机构 * Insta360 Research（Insta360研究院）； Sun Yat-sen University（中山大学）； Beihang University（北京航空航天大学）； Wuhan University（武汉大学）； University of California, Merced（加州大学默塞德分校）

AI总结提出UniSHARP，通过统一全景隐空间和射线基高斯表示，将SHARP扩展到任意相机系统（包括鱼眼、全景），在特征与高斯空间隐式对齐，在构建的多视角基准上大幅超越现有方法。

Comments Project page: https://insta360-research-team.github.io/Unisharp-website/

详情

AI中文摘要

在这项工作中，我们专注于扩展SHARP（一种流行的逼真视图合成方法），以实现跨连续相机系统（从传统透视相机到广角、鱼眼和全景设置）的通用单目渲染。为了克服SHARP的针孔特定假设，我们的关键思想是将各种图像对齐到统一的全景隐空间中。因此，我们提出了UniSHARP，它在特征空间和高斯空间中执行隐式对齐。具体来说，高斯基元沿射线和径向距离排列在基于射线的通用表示中，而从UniK3D启发的编码器中提取的2D语义和3D空间特征被联合解码以生成完整的高斯云。为了全面评估我们的方法，我们构建了一个覆盖各种场景下多种成像系统的基准。该基准进一步按视场角（FoV）分层，以实现对通用单目渲染任务的细粒度评估。在提出的基准上进行的大量实验证明了UniSHARP的有效性，其性能大幅优于替代方法。项目页面可在此处找到：this https URL

英文摘要

In this work, we focus on extending SHARP, the popular photorealistic view synthesis method, for universal monocular rendering across a continuum of camera systems, from conventional perspective cameras to wide-field-of-view, fisheye and omnidirectional panoramic settings. To overcome the pinhole-specific assumptions of SHARP, our key idea is to align various images in a unified omnidirectional latent space. Thus, we propose UniSHARP, which performs implicit alignment in both feature and Gaussian spaces. Specifically, Gaussian primitives are arranged along rays and radial distances in a ray-based universal representation, while 2D semantic and 3D spatial features extracted from UniK3D-inspired encoders are jointly decoded to generate the complete Gaussian cloud. To comprehensively evaluate our method, we construct a benchmark covering diverse imaging systems across various scenes. The benchmark is further stratified by field of view (FoV) to enable fine-grained assessment of the universal monocular rendering task. Extensive experiments on the proposed benchmark demonstrate the effectiveness of UniSHARP, outperforming alternative methods by a large margin. The project page can be found at: https://insta360-research-team.github.io/Unisharp-website/

URL PDF HTML ☆

赞 0 踩 0

2606.06498 2026-06-08 cs.GR cs.CV 交叉投稿

Semantic-Structural Alignment for Generative Pictorial Charts

生成式图形图表的语义-结构对齐

Zhida Sun, Yulin Zhang, Zheng Gu, Min Lu, Bongshin Lee, Daniel Cohen-Or, Hui Huang

发表机构 * Visual Computing Research Center (VCC), College of Computer Science and Software Engineering (CSSE) Shenzhen University China（视觉计算研究中心（VCC）、计算机科学与软件工程学院（CSSE）深圳大学中国）

AI总结提出一种生成式框架，通过多模态扩散变压器中的结构对齐和语义对齐机制，实现兼具艺术表现力和结构保真度的图形图表自动合成。

Comments 11 pages, 17 figures, Accepted to ACM TOG

详情

DOI: 10.1145/3811313

AI中文摘要

传统统计图形精确但往往缺乏图形图表的视觉吸引力、记忆性和参与度。我们提出了一种用于自动合成图形图表的生成式框架，弥合了语义表达与结构保真度之间的差距。我们不是将图表仅仅视为需要风格化的图像，而是将问题构建为一个双条件生成任务，由两个并行的外部控制信号引导：一个捕捉编辑意图语义上下文的文本提示，以及一个提供抽象统计图表全局结构的上下文图像。为了在多模态扩散变压器中增强这些控制，我们引入了两个互补的特征级机制：结构对齐，将空间布局锚定到输入图表；以及语义对齐，从参考图像转移表达性纹理。我们的方法泛化到主要视觉通道（即长度、面积、角度和位置）和多样化的语义领域，生成的图形图表既具有艺术吸引力又结构一致。广泛的定量评估和感知用户研究表明，我们的框架优于传统的可控生成和图像编辑基线，为表达性视觉叙事中高保真、数据驱动的生成建模提供了基础。项目页面：此 https URL。

英文摘要

Traditional statistical graphics are precise but often lack the visual appeal, memorability, and engagement of pictorial charts. We present a generative framework for the automated synthesis of pictorial charts that bridges the gap between semantic expression and structural faithfulness. Rather than treating charts merely as images to be stylized, we frame the problem as a dual-conditioned generation task guided by two parallel external control signals: a text prompt capturing the semantic context of the editing intent, and a context image providing the abstract statistical chart's global structure. To reinforce these controls within a Multi-Modal Diffusion Transformer, we introduce two complementary feature-level mechanisms: structural alignment to anchor spatial layouts to the input chart, and semantic alignment to transfer expressive textures from reference images. Generalizing across major visual channels (i.e., length, area, angle, and position) and diverse semantic domains, our method produces pictorial charts that are both artistically compelling and structurally consistent. Extensive quantitative evaluations and perceptual user studies demonstrate that our framework outperforms traditional controllable generation and image editing baselines, providing a foundation for high-fidelity, data-driven generative modeling in expressive visual storytelling. Project page: https://ssalign.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.06505 2026-06-08 cs.CG cs.AI cs.CV math.DG 交叉投稿

A Geometric Gaussian Mixture Representation of Plane Curves

平面曲线的几何高斯混合表示

Ali Darijani, Benedikt Stratmann, Jürgen Beyerer

发表机构 * Fraunhofer IOSB（弗劳恩霍夫研究所）； KIT, IES（卡尔斯鲁厄理工学院，信息工程系）

AI总结提出一种用户定义的平面曲线概率多边形表示，通过为每个线段赋予法向不确定性参数，构造高斯混合模型，保留局部几何与法向不确定性，适用于多种曲线类型。

详情

AI中文摘要

我们引入了一种用户定义的平面曲线概率多边形表示。给定一条曲线，我们在曲线上选择顶点，并通过线段连接相邻顶点以获得多边形近似。每个线段在法线方向上配备一个用户定义的不确定性参数。这产生了一组薄的概率几何基元，它们保留了底层曲线的几何形状，同时将其扩展到理想化的确定性一维公式之外。对于每个线段，我们定义一个随机变量，该变量在线段的切线方向上均匀分布，在线段的法线方向上高斯分布。通过匹配第一和第二中心矩，该构造诱导出一个高斯分量，其均值位于线段中点，协方差编码了切向和法向不确定性。将逐段分量与适当的权重相结合，得到平面曲线的用户定义概率多边形表示的高斯混合模型（GMM）。所提出的框架提供了一个解析上可处理的概率模型，保留了局部几何和法向不确定性。它适用于光滑、封闭、开放、非正则和自交的平面曲线，允许自适应离散化和法向方向上的变化不确定性，从而支持不确定性感知的几何建模。在一组典型平面曲线上的实验表明，所得的GMM捕获了局部切线、局部法线和局部弧长；从而也真实地捕获了底层曲线的全局形状。该表示特别适用于不确定性感知的CAD和数字孪生、机器人中的概率障碍物建模以及概率轨迹规划等应用。

英文摘要

We introduce a user defined probabilistic polygonal representation for plane curves. Given a curve, we select vertices on the curve and connect consecutive vertices by line segments to obtain a polygonal approximation. Each segment is equipped with a user defined uncertainty parameter in the normal direction. This yields a collection of thin probabilistic geometric primitives that retain the geometrz of the underlying curve while extending it beyond the idealized deterministic one dimensional formulation. For each segment, we define a Random Variable that is uniform distributed in the tangent direction of the segment and Gaussian distributed in the normal direction of the segment. By matching the first and the second central moments, this construction induces a Gaussian component whose mean lies at the segment midpoint and whose covariance encodes both tangential and normal uncertainty. Combining the segment wise components with appropriate weights yields a Gaussian Mixture Model (GMM) representation of the user defined probabilistic polygonal representation of the plane curve. The proposed framework provides an analytically tractable probabilistic model that preserves local geometry, and uncertainty in the normal direction. It applies to smooth, closed, open, non regular, and self intersecting plane curves, allows adaptive discretization and varying uncertainty in the normal direction, and as a result supports uncertainty aware geometric modeling. Experiments on a collection of canonical plane curves show that the resulting GMM capture local tangent, local normal, and local arc length; resulting in the global shape of the underlying curves to be truthfully captured as well. The representation is particularly relevant for applications in uncertainty aware CAD and digital twins, probabilistic obstacle modeling in robotics, and probabilistic trajectory planning.

URL PDF HTML ☆

赞 0 踩 0

2606.06524 2026-06-08 eess.IV cs.CV cs.LG 交叉投稿

Advanced Flood Prediction with Physics-Guided Deep Learning: Combining UNet, FNO, and SAR/Optical Imagery

基于物理引导深度学习的先进洪水预测：结合UNet、FNO与SAR/光学影像

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * National Center for Atmospheric Research (NCAR)（国家大气研究中心）

AI总结提出物理引导深度学习框架，融合多模态遥感与浅水方程约束，通过UNet-FNO混合架构实现高精度洪水预测，IoU达0.82，F1达0.90。

Comments This paper has been accepted for publication in the Proceedings of the IEEE Radar Conference (RadarConf 2026). The final authenticated version will be available through IEEE Xplore

详情

AI中文摘要

由于地面观测有限、地形条件异质以及数据驱动模型中难以强制执行水动力学一致性，准确且可扩展的洪水测绘仍然具有挑战性。本文介绍了一种物理引导的深度学习框架，该框架集成了多模态遥感（Sentinel-1 SAR、Sentinel-2光学影像和DEM衍生的地形特征）与深度平均浅水方程（SWE）的约束。所提出的混合架构结合了用于捕捉精细尺度空间细节的UNet和用于模拟流域尺度水力相互作用的傅里叶神经算子（FNO），而物理信息残差损失确保了质量和动量一致性。在多种洪泛区环境下评估，混合模型在洪水范围预测中实现了0.82的交并比和0.90的F1分数，优于仅使用UNet和仅使用FNO的基线模型。以水动力学模拟作为参考数据，该模型在水深方面实现了0.21米的均方根误差，在流速方面实现了0.15米/秒的均方根误差。物理一致性得以保持，残差低且质量不平衡低于2.1%。消融研究证实，去除基于物理的正则化会显著降低性能，突显了物理约束对稳定性和泛化能力的价值。这些结果表明，将水动力学原理嵌入深度学习可产生更准确、可靠且物理一致的洪水预测，为业务监测和大规模部署提供了巨大潜力。

英文摘要

Accurate and scalable flood mapping remains challenging due to limited ground observations, heterogeneous terrain conditions, and the difficulty of enforcing hydrodynamic consistency within data-driven models. This work introduces a physics-guided deep learning framework that integrates multi-modal remote sensing (Sentinel-1 SAR, Sentinel-2 optical imagery, and DEM-derived terrain features) with constraints from the depth-averaged shallow water equations (SWE). The proposed hybrid architecture combines a UNet to capture fine-scale spatial details with a Fourier Neural Operator (FNO) to model basin-scale hydraulic interactions, while physics-informed residual losses ensure mass and momentum consistency. Evaluated across diverse floodplain settings, the hybrid model achieves an Intersection over Union of 0.82 and an F1 score of 0.90 for flood extent prediction, outperforming UNet-only and FNO-only baselines. Using hydrodynamic simulations as reference data, the model achieves an RMSE of 0.21 m for water depth and 0.15 m/s for flow velocity. Physics consistency is maintained, with low residuals and mass imbalance below 2.1%. Ablation studies confirm that removing physicsbased regularization significantly degrades performance, underscoring the value of physical constraints for stability and generalization. These results demonstrate that embedding hydrodynamic principles into deep learning yields more accurate, reliable, and physically coherent flood predictions, offering strong potential for operational monitoring and large-scale deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.06537 2026-06-08 q-bio.QM cs.CV eess.IV 交叉投稿

DSU-Net: An Attention-Enhanced Dense Skip U-Net for Breast Lesion Segmentation in Mammographic Images

DSU-Net：用于乳腺X线图像中乳腺病变分割的注意力增强密集跳跃U-Net

Reza Bozorgpour, Mohammadreza Soltany Sadrabadi

发表机构 * Department of Biomedical Engineering, University of Wisconsin-Milwaukee（威斯康星大学密尔沃基分校生物医学工程系）； Department of Mechanical Engineering, Northern Arizona University（北亚利桑那大学机械工程系）

AI总结提出DSU-Net，通过密集跳跃连接和注意力机制改进特征传播与边界描绘，在CBIS-DDSM数据集上实现高精度乳腺病变分割。

详情

AI中文摘要

乳腺癌仍然是全球女性癌症相关死亡的主要原因之一，因此早期检测对于有效治疗至关重要。乳腺X线摄影是主要的筛查方式；然而，可疑病变的准确勾画仍然具有挑战性，且存在观察者间差异。自动分割方法可以通过提供一致且高效的病变定位来辅助放射科医生。本研究提出了DSU-Net，一种用于乳腺X线图像中自动乳腺病变分割的注意力增强密集跳跃U-Net架构。该框架集成了密集跳跃连接和注意力机制，以改进特征传播、保留空间信息并增强病变边界描绘。实验使用了乳腺摄影筛查数字数据库的精选乳腺成像子集（CBIS-DDSM）。为了解决严重的前景-背景不平衡问题，训练中采用了结合Dice损失、焦点损失和二元交叉熵损失的复合损失函数。所提模型在验证数据集上实现了0.9421的Dice相似系数、0.8905的交并比、0.9711的准确率和0.9878的AUC-ROC。定性评估显示了对不同大小和形态病变的准确勾画，而定量结果证实了病变与背景区域之间的稳健区分。这些发现表明，DSU-Net在乳腺X线图像中提供了准确可靠的乳腺病变分割，并突出了注意力引导深度学习在计算机辅助乳腺癌筛查和诊断中的潜力。

英文摘要

Breast cancer remains one of the leading causes of cancer-related mortality among women worldwide, making early detection essential for effective treatment. Mammography is the primary screening modality; however, accurate delineation of suspicious lesions remains challenging and subject to inter-observer variability. Automated segmentation methods can assist radiologists by providing consistent and efficient lesion localization. This study presents DSU-Net, an attention-enhanced Dense Skip U-Net architecture for automated breast lesion segmentation in mammographic images. The proposed framework integrates dense skip connections and attention mechanisms to improve feature propagation, preserve spatial information, and enhance lesion boundary delineation. Experiments were conducted using the Curated Breast Imaging Subset of the Digital Database for Screening Mammography (CBIS-DDSM). To address severe foreground-background imbalance, a composite loss function combining Dice loss, focal loss, and binary cross-entropy loss was employed during training. The proposed model achieved a Dice Similarity Coefficient of 0.9421, an Intersection over Union of 0.8905, an accuracy of 0.9711, and an AUC-ROC of 0.9878 on the validation dataset. Qualitative evaluation demonstrated accurate delineation of lesions with varying sizes and morphologies, while quantitative results confirmed robust discrimination between lesion and background regions. These findings demonstrate that DSU-Net provides accurate and reliable breast lesion segmentation in mammographic images and highlights the potential of attention-guided deep learning for computer-aided breast cancer screening and diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2606.06540 2026-06-08 eess.IV cs.CV 交叉投稿

ErA: Error-Aware Deep Unrolling Network for Single Image Defocus Deblurring

ErA：用于单图像散焦去模糊的误差感知深度展开网络

Tu Vo, Chan Y. Park

发表机构 * KC Machine Learning Lab（KC机器学习实验室）

AI总结提出ErA网络，通过联合学习紧凑核基和逐像素权重，并利用增广拉格朗日展开中的误差感知项交替更新和ResUNet去噪器校正核估计误差，在多个数据集上达到最优性能。

2606.06627 2026-06-08 cs.RO cs.AI cs.CV cs.LG 交叉投稿

What Matters When Cotraining Robot Manipulation Policies on Everyday Human Videos?

在日常生活人类视频上协同训练机器人操作策略时什么因素重要？

Richard Li, Aditya Prakash, Andrew Wen, Saurabh Gupta, Yilun Du, Pulkit Agrawal

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Harvard University（哈佛大学）

AI总结研究利用日常互联网视频协同训练机器人操作策略时，手部姿态质量和运动差距对迁移的影响，提出一种协同训练方法，在低机器人数据场景下六个操作任务中绝对成功率提升29.7%。

Comments The project website is here: https://richardrl.github.io/what-matters-cotraining-human-videos/index.html

2606.06725 2026-06-08 eess.IV cs.CV 交叉投稿

Compute-Optimal Network Design for Echocardiography Myocardial Segmentation and Perfusion Quantification using Neural Scaling Laws

基于神经缩放定律的超声心动图心肌分割与灌注量化的计算最优网络设计

Clara Rodrigo González, Matthieu Toulemonde, Lasha Gvinianidze, Cameron A. B. Smith, Oscar Bates, Roxy Senior, Fu Siong Ng, Meng-Xing Tang

发表机构 * Department of Bioengineering, Imperial College London（生物工程系，帝国理工学院伦敦分校）； National Heart and Lung Institute, Imperial College London（国家心脏和肺 institute，帝国理工学院伦敦分校）； Guy’s and St. Thomas’ NHS Foundation Trust（圣泰莫斯国家健康服务信托基金）

AI总结应用神经缩放定律预测心肌分割性能，在CAMUS和CEUS数据集上确定最优网络大小，实现参数减少240倍且性能达最优，自动分割在心肌灌注量化中与资深心脏病专家等效。

Comments 15 pages, 4 figures, 5 tables, journal

详情

AI中文摘要

使用对比增强超声进行心肌灌注量化提供了一种床旁非电离替代核成像模态的方法。然而，其临床采用受到耗时的手动标注的限制。由于域内训练数据匮乏，自动分割已被证明具有挑战性。我们应用当前用于优化大数据集上大型语言模型的策略，将神经缩放定律应用于预测心肌分割的网络性能。我们在数据子集上外推性能，以确定CAMUS超声心动图数据集和25名患者的对比增强超声（CEUS）数据集上的最优网络大小。最后，通过将最终心肌灌注参数与资深心脏病专家获得的参数进行比较，验证了我们模型的临床实用性。基于缩放定律的外推能够预测完整数据集大小下的测试损失，使我们能够选择两个网络，在CAMUS上以240倍的参数减少获得最先进性能。我们观察到缩放定律的梯度从CAMUS迁移到CEUS数据集，但预测损失存在偏差。自动分割的掩膜在心肌灌注量化中与资深心脏病专家表现相当。这些结果确立了神经缩放定律作为小成像数据集上数据驱动计算最优模型设计的实用工具。

英文摘要

Myocardial perfusion quantification using contrast-enhanced ultrasound offers a bedside non-ionizing alternative to nuclear imaging modalities. However, its clinical adoption is hindered by time-consuming manual labelling. Automated segmentation has proved challenging due to a paucity of in-domain training data. Adapting strategies currently used to optimise large language models for large datasets, we apply neural scaling laws to predict network performance for myocardial segmentation. We extrapolate performance on subsets of the data to determine optimal network size on the CAMUS echocardiography dataset and a 25-patient contrast-enhanced ultrasound (CEUS) dataset. Finally, we validate the clinical utility of our models by comparing the final myocardial perfusion parameters with those obtained by a senior cardiologist. Extrapolation based on the scaling law is predictive of test loss at the full dataset size, allowing us to select two networks that obtained state-of-the-art performance on CAMUS with a 240-fold reduction in parameter count. We observe the gradient of the scaling law transfers from CAMUS to the CEUS dataset with a bias in the predicted losses. The automatically segmented masks perform equivalently to a senior cardiologist in myocardial perfusion quantification. These results establish neural scaling laws as a practical tool for data-driven compute-optimal model design for small imaging datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.06836 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Think Like a Pilot: Fine-Grained Long-Horizon UAV Navigation

像飞行员一样思考：细粒度长时程无人机导航

Xiangyi Zheng, Xiangyu Wang, Qinan Liao, Zimu Tang, Yue Liao, Dongyue Lyu, Guodong Wang, Junjie Liu, Si Liu

发表机构 * Colab ； Beihang University（北航）； Meituan（美团）； National University of Singapore（新加坡国立大学）

AI总结提出FLIGHT基准和FLIGHT VLA异步架构，通过低频飞行员推理VLM与高频扩散动作模型解耦，实现无人机长时程语义指令下的平滑连续飞行控制。

详情

AI中文摘要

语言引导的无人机代理必须执行长时程语义指令，同时产生平滑、物理可行的连续飞行命令，然而现有的视觉语言导航（VLN）基准通常使用离散或粗粒度的动作，而现有的无人机视觉-语言-动作（VLA）任务则专注于短时、原子化的机动。为了解决无人机任务设置中的这一空白，我们引入了\ extbf{FLIGHT}，一个用于混合无人机导航与推理任务的\ extbf{细}粒度\ extbf{长}时程\ extbf{指令引导}基准，该基准结合了多阶段指令与密集的6-DoF轨迹注释，分为两个数据集：细粒度VLN和长时程流。为了使无人机代理具备对任务执行状态和任务规划进行实时飞行推理的能力，同时适应高频、实时的精确控制，我们进一步提出了\ extbf{FLIGHT VLA}，一种异步架构，将用于任务状态推理的低频流式飞行员视觉语言模型（VLM）与用于连续控制的高频扩散动作模型解耦，并由显式的\ extbf{飞行员推理}文本进行监督，该文本总结了当前飞行状态并预测下一个子目标。在闭环评估中，FLIGHT VLA在我们的FLIGHT基准上持续优于代表性的VLN和VLA基线，实现了更强的多阶段完成、子目标遵循和终端控制。其训练的流式飞行员推理VLM进一步提升了无人机视频推理，验证了我们设计的有效性。

英文摘要

Language-guided UAV agents must execute long-horizon semantic instructions while producing smooth, physically feasible continuous flight commands, yet existing Vision-Language Navigation (VLN) benchmarks typically use discrete or coarse actions and existing UAV Vision-Language-Action (VLA) tasks focus on short, atomic maneuvers. To address this gap in UAV task settings, we introduce \textbf{FLIGHT}, a \textbf{F}ine-grained \textbf{L}ong-horizon \textbf{I}nstruction-\textbf{G}uided benchmark for \textbf{H}ybrid UAV navigation and reasoning \textbf{T}asks, which combines multi-stage instructions with dense 6-DoF trajectory annotations across two dataset splits: Fine-grained VLN and Long-horizon Flow. To endow the UAV agent with the capability of real-time in-flight reasoning over task execution status and mission planning, while simultaneously accommodating high-frequency, real-time precise control, we further propose \textbf{FLIGHT VLA}, an asynchronous architecture that decouples a low-frequency Streaming Pilot Vision-Language Model (VLM) for task-state reasoning from a high-frequency diffusion action model for continuous control, supervised by explicit \textbf{Pilot Reasoning} texts that summarize the current flight state and anticipate the next subgoal. In closed-loop evaluation, FLIGHT VLA consistently surpasses representative VLN and VLA baselines on our FLIGHT benchmarks, achieving stronger multi-stage completion, subgoal adherence, and terminal control. Its trained Streaming Pilot Reasoning VLM further improves UAV video reasoning, validating the effectiveness of our design.

URL PDF HTML ☆

赞 0 踩 0

2606.06847 2026-06-08 eess.IV cs.CV 交叉投稿

Physics-Driven Semantic Scattering Structure Understanding of Aircraft Target in SAR Images

SAR图像中飞机目标的物理驱动语义散射结构理解

Yifei Yin, Xiaogang Yu, Hao Shi, Liang Chen, Wei Li

发表机构 * School of Information and Electronics, Beijing Institute of Technology（信息与电子学院，北京理工大学）； National Key Laboratory of Science and Technology on Space-Born Intelligent Information Processing（空间智能信息处理国家级重点实验室）； Beijing Institute of Remote Sensing Information（遥感信息北京市研究院）

AI总结针对SAR图像中飞机目标散射中心表示不稳定、弱散射部件缺失的问题，提出物理驱动框架S3U-SAR，通过定义语义散射关键点并利用多维物理先验约束，实现完整拓扑结构重建，在基准数据集上取得最优性能。

详情

AI中文摘要

合成孔径雷达（SAR）因其全天时、全天候观测能力，已成为目标解译不可或缺的手段。在SAR目标解译中，电磁散射信息提供了超越视觉纹理的物理基础线索，并被广泛用于目标解译。然而，现有方法仍以局部散射中心表示为主。这种无序且与部件无关的表示对飞机目标极不稳定。因此，物理存在的弱散射响应部件常被遗漏，导致重建的拓扑结构不完整。为解决这一局限，我们建立了语义散射结构理解作为SAR飞机解译的新范式。定义语义散射关键点以将局部电磁响应与物理上有意义的飞机部件关联，同时引入可见性感知属性以保留弱可观测但物理存在的部件。关键点进一步组织为稳定的语义散射结构。基于此，我们提出S3U-SAR，一个物理驱动框架，用于定位语义散射关键点并构建由多维物理先验（包括散射异质性、刚体拓扑、散斑不确定性）约束的完整表示。进一步引入置信门控联合监督策略以缓解优化冲突。我们构建了KP-SAR-Aircraft-1.0，首个用于语义散射结构理解的细粒度基准。大量实验表明，S3U-SAR相比基线取得了最佳性能。跨类别和跨数据集评估进一步验证了其鲁棒性和可迁移性。

英文摘要

Synthetic aperture radar (SAR) has become indispensable for target interpretation owing to its all-day and all-weather observation capability. In SAR target interpretation, electromagnetic scattering information provides a physically grounded cue beyond visual texture and has been widely exploited for target interpretation. However, existing methods remain dominated by local scattering center representations. Such unordered and component-agnostic representations are highly unstable for aircraft targets. As a result, physically existing components with weak scattering responses are often missed, resulting in the incomplete reconstructed topology structure. To address this limitation, we establish Semantic Scattering Structure Understanding as a new paradigm for SAR aircraft interpretation. Semantic scattering keypoints are defined to associate local electromagnetic responses with physically meaningful aircraft components, while visibility-aware attributes are introduced to retain weakly observable yet physically existed components. The keypoints are further organized into a stable semantic scattering structure. Build upon this, we propose S3U-SAR, a physics-driven framework to localize semantic scattering keypoints and construct the complete representation constrained by multi-dimensional physical priors containing scattering heterogeneity, rigid-body topology, speckle uncertainty. A confidence-gated joint supervision strategy is further introduced to alleviate optimization conflicts. We construct KP-SAR-Aircraft-1.0, the first fine-grained benchmark for semantic scattering structure understanding. Extensive experiments demonstrate that S3U-SAR achieves the best performance compared with baselines. Cross-category and cross-dataset evaluations further verify its robustness and transferability.

URL PDF HTML ☆

赞 0 踩 0

2606.06878 2026-06-08 cs.RO cs.CV 交叉投稿

A Cross-view Fusion Framework for Robust 6-DoF Grasp Pose Estimation

一种用于鲁棒6-DoF抓取姿态估计的跨视图融合框架

Kangjian Zhu, Haobo Jiang, Jianjun Qian, Jin Xie

发表机构 * Nanjing University of Science and Technology（南京理工大学）； Nanyang Technological University（南洋理工大学）； Nanjing University（南京大学）

AI总结提出跨视图融合框架，通过辅助视图缓解遮挡，利用自监督对比学习增强点云特征的空间一致性和方向区分性，并设计跨视图对齐圆柱体集成模块融合抓取相关几何，提升角落视图下的6-DoF抓取姿态估计鲁棒性。

Comments Corresponding author: Jin Xie

详情

AI中文摘要

本文提出一种跨视图融合框架，增强了角落视图中6-DoF抓取姿态估计的鲁棒性。我们的框架通过引入辅助视图缓解遮挡，并通过后融合策略避免了耗时的、任务无关的多视图重建。为了增强跨视图融合，我们提出一种自监督对比学习策略，利用跨视图关联来正则化点云特征。简而言之，如果两个点对应相同的3D位置，则跨视图点对被视作匹配；如果它们代表不同的抓取方向，则视为不匹配。该学习策略显著增强了点特征的空间一致性和方向区分性，从而促进了跨视图融合并提高了估计鲁棒性。此外，我们提出一种跨视图对齐圆柱体集成模块，将抓取相关几何融合为综合表示。具体地，该模块首先根据相似性对齐跨视图点和特征，以增强对噪声的鲁棒性。随后，将这些点注册到圆柱坐标系中，强调对抓取重要的旋转对称几何。最后，交替使用局部自注意力和种子交叉注意力层，分别实现单视图内和跨视图间的交互，支持抓取相关几何的细粒度表示。我们的框架在GraspNet-1Billion基准测试和实际应用中均取得了强劲性能。代码可在以下网址获取：此https URL。

英文摘要

In this paper, we propose a cross-view fusion framework that enhances the robustness of 6-DoF grasp pose estimation in corner views. Our framework alleviates occlusion by incorporating an auxiliary view and avoids the time-consuming, task-agnostic multi-view reconstruction through a post-fusion strategy. To enhance cross-view fusion, we propose a self-supervised contrastive learning strategy that leverages cross-view associations to regularize point cloud features. In brief, a cross-view point pair is considered a match if the two points correspond to the same 3D location, and a non-match if they represent distinct grasp directions. The learning strategy significantly enhances the spatial consistency and direction distinctiveness of point features, thereby facilitating cross-view fusion and improving estimation robustness. Furthermore, we propose a cross-view-aligned cylinder integration module to fuse grasp-relevant geometry into a comprehensive representation. Specifically, the module first aligns the cross-view points and features according to their similarity to enhance the robustness against noise. Subsequently, these points are registered into the cylindrical coordinate frame, emphasizing the rotation-symmetric geometry which is important for grasping. Finally, local self-attention and seed cross-attention layers are alternately employed, respectively enabling interactions within single views and across views, which supports fine-grained representation of grasp-relevant geometry. Our framework achieves strong performance on the GraspNet-1Billion benchmark and in real-world applications. Code is available at https://github.com/KJZhuAutomatic/Cross-view-Grasp.

URL PDF HTML ☆

赞 0 踩 0

2606.06983 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

DaX: Learning General Pathology Representations Across Scales

DaX: 跨尺度的通用病理学表示学习

Bokai Zhao, Yiyang Zhang, Long Bai, Tai Ma, Hanqing Chao, Minfeng Xu

发表机构 * DAMO Academy, Alibaba Group（达摩院，阿里巴巴集团）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Hupan Lab（虎斑实验室）

AI总结提出病理视觉基础模型DaX，通过改进DINOv3自监督学习，结合连续放大训练、跨尺度组织视图等设计，在44个公开数据集的161项临床任务上取得最佳平均性能。

详情

AI中文摘要

计算病理学需要能够跨不同临床终点迁移且对放大倍数、染色、扫描仪类型、切片制备和输入分辨率变化保持鲁棒的视觉表示。我们提出DaX，一个病理视觉基础模型，它将DINOv3风格的自监督学习适应到全切片组织病理学。DaX从自然图像DINOv3权重初始化，并融合了连续放大训练、跨尺度组织视图、方向无关和采集鲁棒的数据增强、多输入尺寸训练以及Gram锚定的密集一致性。这些设计旨在连接局部细胞形态与全局组织结构，同时稳定跨输入尺度的密集token级表示。我们进一步构建了一个WSI级基准，包含来自44个公共数据集的161项临床有意义任务，涵盖28,182名患者和34,394张切片，跨越四个临床领域和九个任务类别。所有模型在固定的患者级交叉验证协议下进行评估，并采用折叠级统计排名，从而实现可重复的比较，对分割依赖的变异性不敏感。在该基准上，DaX在任务中取得了最高的平均性能，并持续获得强大的任务级排名分数，其增益涵盖诊断病理学、生物标志物和分子谱分析、组织/标本背景以及风险、反应和预后。这些结果支持DaX作为计算病理学的可迁移视觉编码器，并为未来的病理基础模型提供了标准化的评估框架。项目页面：此https URL。

英文摘要

Computational pathology requires visual representations that transfer across diverse clinical endpoints and remain robust to variation in magnification, staining, scanner type, slide preparation, and input resolution. We present DaX, a pathology vision foundation model that adapts DINOv3-style self-supervised learning to whole-slide histopathology. DaX is initialized from natural-image DINOv3 weights and incorporates continuous magnification training, cross-scale tissue views, orientation-agnostic and acquisition-robust augmentation, multi-input-size training, and Gram-anchored dense consistency. These designs aim to connect local cellular morphology with global tissue architecture while stabilizing dense token-level representations across input scales. We further construct a WSI-level benchmark comprising 161 clinically meaningful tasks from 44 public datasets, covering 28,182 patients and 34,394 slides across four clinical domains and nine task categories. All models are evaluated under a fixed patient-level cross-validation protocol with fold-level statistical ranking, enabling reproducible comparisons that are less sensitive to split-dependent variation. Across this benchmark, DaX achieves the highest mean performance across tasks and consistently strong task-level ranking scores, with gains spanning diagnostic pathology, biomarker and molecular profiling, tissue/specimen context, and risk, response, and prognosis. These results support DaX as a transferable visual encoder for computational pathology and provide a standardized evaluation framework for future pathology foundation models. Project page: https://alibaba-damo-academy.github.io/DaX/benchboard/.

URL PDF HTML ☆

赞 0 踩 0

2606.07016 2026-06-08 stat.AP cs.CV 交叉投稿

An Integrated Roadside Sensing and Communication Framework for Vulnerable Road User Safety at Signalized Intersections

信号交叉口弱势道路使用者安全的集成路边感知与通信框架

Parvez Anowar

发表机构 * Department of Civil, Environmental and Construction Engineering, University of Central Florida（中央佛罗里达大学土木、环境与建设工程系）

AI总结提出集成多模态感知、边缘计算、V2X/P2X通信和自适应信号控制的框架，基于公开数据集R-LiViT分析53,319个标注，发现VRU占49%、昼夜密度差异大、近距离事件变化10倍、83%行人边界框小，支持多模态感知和自适应部署。

Comments 17 pages, 5 figures, 2 tables. Preprint

详情

AI中文摘要

弱势道路使用者（VRU）约占全球城市交通死亡人数的一半，而交叉口集中了不成比例的伤亡。最近关于VRU保护的感知技术综述列举了数十种单传感器和双传感器部署，但所调查的系统均未将多模态感知与边缘侧近碰撞分析以及双向车联万物（V2X）和行人联万物（P2X）消息传递集成在单个交叉口机柜中。本文提出一个信号交叉口VRU保护的综合框架，在感知层结合LiDAR、雷达、RGB相机和热成像相机，在计算层进行基于边缘的预测和替代安全分析，在通信层进行V2X和P2X消息传递，在驱动层进行自适应信号控制。该框架基于使用R-LiViT（首个公开的路边LiDAR-视觉-热成像数据集）的实证案例研究，该数据集提供了200个多模态序列和2,400个标注的RGB-T帧，来自三个德国交叉口。对53,319个检测标注的分析显示，VRU约占所有道路使用者观测的49%；从白天到夜晚，行人密度下降38%，车辆下降45%，而夜间分布显示更高的近距离比例；在三个交叉口的八个独特位置，每帧近距离事件计数变化约10倍；83%的行人边界框在图像空间中较小，表明VRU通常远离任何单个传感器。这些发现支持多模态感知、边缘侧分析和自适应上下文感知部署，而非统一的单传感器解决方案。

英文摘要

Vulnerable road users (VRUs) account for approximately half of urban traffic deaths globally, with intersections concentrating a disproportionate share of these casualties. Recent reviews of sensing technology for VRU protection have cataloged dozens of single-sensor and dual-sensor deployments, yet none of the surveyed systems couples multi-modal sensing with edge-side near-miss analytics and bidirectional vehicle-to-everything (V2X) and pedestrian-to-everything (P2X) messaging in a single intersection cabinet. This paper presents an integrated framework for VRU protection at signalized intersections, combining LiDAR, radar, RGB camera, and thermal camera at the perception layer, edge-based prediction and surrogate-safety analytics at the computation layer, V2X and P2X messaging at the communication layer, and adaptive signal control at the actuation layer. The framework is grounded in an empirical case study using R-LiViT, the first publicly released roadside LiDAR-Visual-Thermal dataset, which provides 200 multi-modal sequences and 2,400 annotated RGB-T frames at three German intersections. Analysis of 53,319 detection annotations reveals that VRUs comprise approximately 49% of all road-user observations, that day-to-night density drops by 38% for pedestrians and 45% for vehicles while the night distribution shows a higher close-proximity share, that per-frame close-proximity event counts vary approximately 10-fold across the eight unique locations at three intersections, and that 83% of pedestrian bounding boxes are small in image space, indicating that VRUs are typically far from any single sensor. These findings support multi-modal sensing, edge-side analytics, and adaptive context-sensitive deployment rather than uniform single-sensor solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.07033 2026-06-08 cs.AI cs.CV 交叉投稿

Hierarchical Semantic-Constrained Heterogeneous Graph for Audio-Visual Event Localization

层次化语义约束异构图用于音视频事件定位

Zhe Yang, Ruyi Zhang, Hongtao Chen, Wenrui Li, Hengyu Man, Wangmeng Zuo, Xiaopeng Fan

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Peng Cheng Laboratory（鹏城实验室）； Harbin Institute of Technology Suzhou Research Institute（哈尔滨工业大学苏州研究院）

AI总结提出层次化语义约束异构图框架，通过构建异构图、双向语义约束和双曲空间层次正则化，解决开放词汇音视频事件定位中跨尺度一致性和层次语义一致性问题。

详情

AI中文摘要

开放词汇音视频事件定位（OV-AVEL）联合建模音视频线索，以识别并时间定位事件，包括训练中未见过的类别。现有方法主要在欧几里得空间中学习联合音视频表示，但仍面临两个重大挑战。首先，未见类别缺乏监督信号，难以在多个时间尺度上保持音视频一致性。其次，片段级与视频级语义之间缺乏层次约束，导致模型无法在不同层级间建立语义一致性。为解决这些挑战，我们提出一种层次化语义约束异构图（HSCHG）用于音视频事件定位框架。我们首先在欧几里得空间中构建一个异构层次图，包含音频和视觉片段节点及其对应的视频级节点。我们使用多方向时间边来捕获每个模态内的完整时间信息。同时，我们采用双阈值过滤门控融合策略，仅在对齐置信度高时引入跨模态信息。此外，我们在片段级和视频级表示之间引入双向语义约束，以实现不同层级间的语义一致性。基于此，我们将多级音视频表示和文本原型统一映射到双曲空间中。我们使用层次蕴含正则化损失来表征视频与片段之间的层次关系。大量实验结果表明，我们的方法在OV-AVEL基准上优于现有方法。消融研究进一步验证了我们方法的有效性。

英文摘要

Open-vocabulary audio-visual event localization (OV-AVEL) jointly models audio-visual cues to recognize and temporally localize events, including categories unseen during training. Existing methods primarily learn joint audio-visual representations in Euclidean space, but still face two significant challenges. First, the lack of supervision signals for unseen categories makes it difficult to maintain audio-visual consistency across multiple temporal scales. Second, the lack of hierarchical constraints between segment- and video-level semantics prevents the model from establishing semantic consistency across different levels. To address these challenges, we propose a hierarchical semantic constrained heterogeneous graph (HSCHG) for audio-visual event localization framework. We first construct a heterogeneous hierarchical graph in Euclidean space, which includes audio and visual segment nodes and their corresponding video-level nodes. We use multi-directional temporal edges to capture complete temporal information within each modality. Simultaneously, we employ a dual-threshold filtering gated fusion strategy, introducing cross-modal information only when the alignment confidence is high. Furthermore, we introduce bidirectional semantic constraints between segment- and video-level representations to achieve semantic consistency across different levels. Based on this, we map the multi-level audio-visual representations and text prototypes uniformly into hyperbolic space. We use a hierarchical entailment regularization loss to characterize the hierarchical relationships between videos and segments. Extensive experimental results show that our method outperforms existing methods on the OV-AVEL benchmark. Ablation studies further validate the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2606.07058 2026-06-08 cs.LG cs.CV math.AT stat.ML 交叉投稿

Constructing VAE Latent Spaces with Prescribed Topology

构建具有指定拓扑的VAE潜在空间

Jilles S. van Hulst, Jakub M. Tomczak, W. P. M. H. Heemels, Duarte J. Antunes

发表机构 * Control Systems Technology Section, Department of Mechanical Engineering, Eindhoven University of Technology（机械工程系控制系统技术部，埃因霍温理工大学）； Nature Innovation Laboratory (NatInLab)（自然创新实验室（NatInLab））

AI总结针对数据流形非欧几里得拓扑导致标准高斯先验不匹配的问题，提出一种构造性数学框架，通过因子化分布和重参数化技巧，为乘积覆盖空间流形（如圆柱、环面、莫比乌斯带等）设计拓扑匹配的先验，提升重建质量和表示忠实性。

Comments 16 pages, 7 figures

详情

AI中文摘要

变分自编码器（VAE）学习高维数据的低维潜在表示。当数据位于具有非欧几里得拓扑的流形上时，标准高斯先验会引入拓扑不匹配，从而降低重建质量并阻碍忠实表示。我们提出了一个构造性数学框架，解决了所有允许乘积覆盖空间的流形的这种不匹配问题。这些流形可表示为基本因子（圆、区间或直线）的乘积，或此类乘积在有限对称群下的商。该类包括圆柱、环面、莫比乌斯带、克莱因瓶和实射影空间。基本因子上的因子化分布产生具有闭式解耦KL散度的乘积拓扑，使得每个潜在因子可以独立塑造，同时保持训练可处理。我们为周期、有界和无界支撑编目了可重参数化的编码器-先验对，并提供了坐标变换，允许标准神经网络输出具有平滑梯度的非欧几里得参数。对于商流形，解码器接收覆盖空间坐标的群不变特征，使得识别点产生相同输出。锚点约束相对于数据固定坐标系或创建软拓扑孔。在合成流形和真实图像数据集（旋转和循环移位MNIST）上的实验证实，拓扑匹配的先验使KL正则化与数据流形对齐。所得到的拓扑感知模型在所有实际相关的正则化强度下均优于高斯基线。代码可从此https URL获取。

英文摘要

Variational autoencoders (VAEs) learn low-dimensional latent representations of high-dimensional data. When the data lies on a manifold with non-Euclidean topology, the standard Gaussian prior introduces a topological mismatch that degrades reconstruction quality and prevents faithful representation. We present a constructive mathematical framework that resolves this mismatch for all manifolds that admit a product covering space. These are manifolds expressible as products of elementary factors (circles, intervals, or lines) or as quotients of such products by a finite symmetry group. The class includes cylinders, tori, Möbius strips, Klein bottles, and real projective spaces. Factorized distributions over the elementary factors yield product topologies with closed-form, decoupled KL divergences, so that each latent factor can be shaped independently while keeping training tractable. We catalogue reparametrizable encoder-prior pairs for periodic, bounded, and unbounded supports, and provide coordinate transformations that allow standard neural networks to output non-Euclidean parameters with smooth gradients. For quotient manifolds, the decoder receives group-invariant features of the covering-space coordinates, so that identified points produce identical outputs. Anchor constraints fix the coordinate system relative to the data or create soft topological holes. Experiments on synthetic manifolds and real-image datasets (rotated and cyclically shifted MNIST) confirm that a topology-matched prior aligns KL regularization with the data manifold. The resulting topology-aware models outperform the Gaussian baseline at all practically relevant regularization strengths. The code is available at https://github.com/JvHulst/VAE-Topology.

URL PDF HTML ☆

赞 0 踩 0

2606.07063 2026-06-08 eess.IV cs.CV 交叉投稿

Beyond Universality: The GCC-FER Dataset and Culture-Aware Adaptation for Dynamic Facial Expression Recognition

超越普遍性：GCC-FER数据集及面向动态面部表情识别的文化感知适应

Sonalika Singh, Jyotirindra Dandapat, Avishi Razdan, Kshipra V. Moghe, Puneet Gupta, Lalan Kumar

发表机构 * Department of Electrical Engineering, Indian Institute of Technology Delhi, India（印度理工学院德里分校电子工程系）； Department of Computer Science and Engineering, Indian Institute of Technology Indore, India（印度理工学院印尔德分校计算机科学与工程系）； Department of Psychology, COEP Technological University, India（COEP技术大学心理学系）

AI总结针对动态面部表情识别中文化差异被忽视的问题，提出首个大规模全球跨文化数据集GCC-FER，并设计文化感知适应系统CA-FER，通过自适应校准面部表示减轻文化偏差，实验证明其有效性。

详情

AI中文摘要

动态面部表情识别（DFER）是情感计算、人机交互和智能多媒体系统中的关键使能技术。尽管文化细微差别对FER性能有显著影响，但大多数现有FER系统假设情感表达在人群中普遍一致。这种差异可归因于不同文化中面部肌肉激活模式的系统性差异。推进跨文化FER的主要挑战在于缺乏文化多样性的基准数据集。为解决这一问题，本文引入了一个名为全球跨文化面部表情识别（GCC-FER）的新型混合多元文化视频数据集。GCC-FER包含跨越四种文化群体（非洲、高加索、东亚和南亚）的23,934个视频样本，涵盖七种基本表情，结合了对代表性不足人群的心理学家监督内部数据收集以及对现有来源的严格种族过滤。据我们所知，GCC-FER是首个旨在解决这些人口统计差距的大规模全球跨文化DFER数据集。利用该数据集，为每个文化群体推导出基于行为的文化先验，并为实际部署推导出全局先验。提出了一种文化感知FER（CA-FER）系统，通过自适应重新校准潜在面部表示来减轻文化偏差。在GCC-FER和DFEW上的大量实验表明，所提系统在多文化环境下持续提高了FER性能。

英文摘要

Dynamic Facial Expression Recognition (DFER) is a key enabling technology in affective computing, human-computer interaction, and intelligent multimedia systems. Despite the significant influence of cultural nuances on FER performance, most existing FER systems assume that emotional expressions are universally consistent across populations. This variation can be attributed to systematic differences in facial muscle activation patterns across cultures. A major challenge in advancing cross-cultural FER lies in the scarcity of culturally diverse benchmark datasets. To address this, a new hybrid multicultural video dataset termed Global Cross-Cultural Facial Expression Recognition (GCC-FER) is introduced. GCC-FER comprises 23,934 video samples spanning four cultural groups (African, Caucasian, East Asian, and South Asian) across seven basic expressions, combining psychologically supervised in-house data collection for underrepresented populations with rigorous ethnicity filtering of existing sources. To the best of our knowledge, GCC-FER is the first large-scale global cross-cultural DFER dataset designed to address these demographic gaps. Leveraging this dataset, behaviorally grounded cultural priors are derived for each cultural group and a global prior for practical deployment. A Culture-Aware FER (CA-FER) system is proposed to mitigate cultural bias by adaptively recalibrating latent facial representations. Extensive experiments on GCC-FER and DFEW demonstrate that the proposed system consistently improves FER performance across multicultural settings.

URL PDF HTML ☆

赞 0 踩 0

2606.07217 2026-06-08 cs.RO cs.CV cs.LG 交叉投稿

Robotic Policy Adaptation via Weight-Space Meta-Learning

通过权重空间元学习实现机器人策略自适应

Christian Bianchi, Siamak Yousefi, Alessio Sampieri, Andrea Roberti, Luca Rigazio, Fabio Galasso, Luca Franco

发表机构 * ItalAI ； University of Verona（威尼斯大学）； Sapeinza University of Rome（罗马萨佩因扎大学）

AI总结提出WIZARD框架，通过权重空间元学习从语言指令和演示视频生成任务特定LoRA参数，无需微调即可适应新任务，在LIBERO上性能提升高达14倍。

详情

AI中文摘要

视觉-语言-动作（VLA）模型正成为机器人操作的一种有前景的范式，能够从大规模演示和动作标签语料库中训练通用策略。然而，将这些模型适应新任务通常仍需要任务特定的演示、动作注释和额外的微调，使得部署成本高昂且难以扩展。我们提出WIZARD，一种权重空间元学习框架，通过为冻结的VLA策略生成任务特定的LoRA参数来避免任务特定的微调。仅凭语言指令和简短的演示视频，WIZARD即可在单次前向传播中预测相应的自适应权重，无需目标任务动作标签或测试时优化。在元训练期间，WIZARD学习将任务证据直接映射到专家LoRA更新，在权重空间中捕获任务之间的关系。在LIBERO上的实验表明，WIZARD在未见过的数据集集合上性能提升高达约2倍，在未见过的任务上提升高达约14倍。在Franka Emika Panda机器人上，WIZARD持续优于真实域自适应基线，表明生成的适配器提供了超越仿真的任务级特化。

英文摘要

Vision-Language-Action (VLA) models are emerging as a promising paradigm for robotic manipulation, enabling general-purpose policies trained from large corpora of demonstrations and action labels. However, adapting these models to new tasks still typically requires task-specific demonstrations, action annotations, and additional fine-tuning, making deployment costly and difficult to scale. We propose WIZARD, a weight-space meta-learning framework that sidesteps task-specific fine-tuning by generating task-specific LoRA parameters for a frozen VLA policy. Given only a language instruction and a short demonstration video, WIZARD predicts the corresponding adaptation weights in a single forward pass, without target-task action labels or test-time optimization. During meta-training, WIZARD learns to map task evidence directly to expert LoRA updates, capturing relationships between tasks in weight space. Experiments on LIBERO show that WIZARD improves performance by up to ~2x on unseen dataset collections and up to ~14x on unseen tasks. On a Franka Emika Panda, WIZARD consistently improves over a real-domain adapted baseline, showing that generated adapters provide task-level specialization beyond simulation.

URL PDF HTML ☆

赞 0 踩 0

2606.07244 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Beyond Waypoints: A Trajectory-Centric Waypointing Paradigm for Vision-Language Navigation

超越航点：面向视觉语言导航的轨迹中心航点范式

Haoxiang Shi, Xiang Deng, Haoyu Zhang, Qiaohui Chu, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Pengcheng Laboratory（鹏城实验室）

AI总结提出轨迹航点范式，通过TSDF引导的扩散策略预测可执行轨迹，解决VLN-CE中航点不可达与规划控制不一致问题，在基准上取得最优性能。

详情

AI中文摘要

连续环境中的视觉语言导航（VLN-CE）要求智能体在类似真实世界的环境中遵循自然语言指令进行导航。大多数VLN-CE方法采用三阶段框架：航点预测器提出可导航航点，导航器选择最佳航点，低层控制器执行移动。然而，这种解耦范式常导致航点不可达或规划与控制不一致。本文提出一种称为轨迹航点的新范式，将每个候选航点锚定到可执行轨迹上。为此，我们设计了TSDF引导的扩散策略作为轨迹航点预测器，引导轨迹生成避开障碍物，从本质上保证预测航点的可达性。进一步提出轨迹增强导航器，将关联轨迹作为额外信息注入规划，实现高层语义决策与低层执行的严格一致性。在VLN-CE基准上的大量实验表明，我们的轨迹航点范式优于基线方法。

英文摘要

Vision-Language Navigation in Continuous Environments (VLN-CE) requires agents to follow natural-language instructions while navigating in real-world-like environments. Most VLN-CE approach\-es adopt a three-stage framework: a waypoint predictor proposes navigable waypoints, and a navigator selects the best waypoint, with a low-level controller executing the movement to it. However, this decoupled paradigm often leads to unreachable waypoints or inconsistencies between planning and control. In this work, instead of predicting isolated waypoints, we introduce a novel paradigm called Trajectory Waypoint, which grounds each candidate waypoint in an executable trajectory. To realize this, we design a Trajectory Waypoint Predictor formulated as a TSDF-guided diffusion policy, which steers trajectory generation away from obstacles, inherently ensuring the reachability of the predicted waypoints. We further propose a trajectory-enhanced navigator that injects the associated trajectory as additional information for planning, enabling strict consistency between high-level semantic decisions and low-level execution. Extensive experiments on the VLN-CE benchmark show that our Trajectory Waypoint paradigm achieves superior performance over the baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.07289 2026-06-08 cs.LG cs.CV 交叉投稿

Closed-Form Spectral Regularization for Multi-Task Model Merging

多任务模型融合的闭式谱正则化

Yongxian Wei, Runxi Cheng, Xingxuan Zhang, Li Shen, Chun Yuan, Peng Cui, Dacheng Tao

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）； Sun Yat-sen University（中山大学）； Nanyang Technological University（南洋理工大学）

AI总结针对多任务模型融合中的干扰最小化问题，发现迭代求解器实际充当隐式谱正则化器，据此提出基于谱滤波的闭式方法SWUDI及其自适应变体SWUDI-A，显著提升效率并匹配或超越现有方法。

详情

AI中文摘要

模型融合将多个独立微调专家合并为单个多任务模型，无需任何训练数据，降低了大型基础模型的存储、服务和去中心化开发成本。最先进的融合方法将融合表述为逐层二次干扰最小化问题。尽管该问题存在精确的闭式伪逆解，但该解在实践中性能不如数百次梯度下降迭代。迭代循环主导了流程的成本，但其有效性尚未得到解释。我们重新审视这一机制，并表明迭代求解器主要并非作为优化器；相反，它充当了病态正规方程的隐式谱正则化器，其中每层干扰算子的小特征值方向放大了代理噪声。基于这一发现，我们将多任务模型融合形式化为一个带噪线性逆问题，并提出一种由逐方向滤波器参数化的谱滤波估计器。我们通过SWUDI实例化该估计器，这是一种闭式方法，结合了软指数滤波器（匹配迭代下降的梯度流轨迹）和硬top-K截断（抑制放大噪声的小特征值方向）。此外，我们提出了SWUDI-A，一种自适应变体，用逐层秩规则替换全局秩超参数，进一步提高了跨架构的鲁棒性。两种变体共享每个线性层的单个对称特征分解，且不需要训练数据或优化器状态。在四个通用基准和一个涵盖VQA、几何、图表、OCR、定位和模态融合的多模态融合基准上，我们提出的谱求解器匹配或超越了最先进的融合方法。关键的是，它们将挂钟时间减少了28-72倍，峰值GPU内存减少了高达50%。

英文摘要

Model merging combines several independently fine-tuned experts into a single multi-task model without any training data, reducing the storage, serving, and decentralized-development costs of large foundation models. State-of-the-art merging methods formulate merging as a layer-wise quadratic interference minimization problem. Although this problem admits an exact closed-form pseudoinverse solution, that solution underperforms hundreds of iterations of gradient descent in practice. The iterative loop dominates the cost of the pipeline, yet its effectiveness has remained unexplained. We revisit this regime and show that the iterative solver does not primarily act as an optimizer; rather, it serves as an implicit spectral regularizer for an ill-posed normal equation, where small-eigenvalue directions of the per-layer interference operator amplify proxy noise. Building on this finding, we formalize multi-task model merging as a noisy linear inverse problem and propose a spectral filtering estimator parameterized by a per-direction filter. We instantiate this estimator with SWUDI, a closed-form method that combines a soft exponential filter, which matches the gradient-flow trajectory of iterative descent, with a hard top-K truncation that suppresses noise-amplifying small-eigenvalue directions. Furthermore, we propose SWUDI-A, an adaptive variant that replaces the global rank hyperparameter with per-layer rank rules, further improving robustness across architectures. Both variants share a single symmetric eigendecomposition per linear layer and require no training data or optimizer state. Across four general benchmarks and a multimodal merging benchmark spanning VQA, Geometry, Chart, OCR, Grounding, and modality merging, our proposed spectral solvers match or outperform state-of-the-art merging methods. Crucially, they reduce wall-clock time by 28-72x and peak GPU memory by up to 50%.

URL PDF HTML ☆

赞 0 踩 0

2606.07374 2026-06-08 eess.SP cs.CV 交叉投稿

Beyond Backscatter: InSAR coherence from detected SAR images

超越后向散射：来自检测SAR图像的InSAR相干性

Francescopaolo Sica, Andrea Pulella, Michael Schmitt

发表机构 * Department of Aerospace Engineering, University of the Bundeswehr Munich（联邦国防军 Munich航空航天工程系）； Microwaves and Radar Institute, German Aerospace Center (DLR)（德国航空航天中心 (DLR) 微波与雷达研究所）

AI总结提出一种深度学习框架，直接从检测SAR图像回归相干性，无需精确配准，使用Residual U-Net学习后向散射幅度与相干性的关系，在多种数据集上验证了高分辨率相干性回归的准确性提升和泛化能力。

Comments 27 pages, 20 figures

详情

AI中文摘要

在这项工作中，我们提出了一个深度学习框架，用于直接从检测SAR图像进行相干性回归，无需精确配准。使用从精确配准的Sentinel-1 SLC数据导出的相干性图训练Residual U-Net，以学习后向散射幅度与相干性之间的关系。模型在12天SLC对上训练，并在不同数据集上进行评估，包括配准的SLC产品和开放存取的分析就绪数据，覆盖不同的辐射特性、几何形状和位置。实验结果表明，与现有的基于强度的方法相比，所提出的方法实现了高分辨率相干性回归，且准确性更高。该网络在多样化的地理位置以及训练时从未见过的不同时间基线之间都能很好地泛化。此外，能够在全球可用的分析就绪数据（例如通过Google Earth Engine分发的地距检测数据）上运行，使其在任务设计、变化监测和多种制图任务中能够大规模应用。

英文摘要

In this work, we propose a deep learning framework for coherence regression directly from detected SAR images, without the need for accurate coregistration. A Residual U-Net is trained using coherence maps derived from precisely coregistered Sentinel-1 SLC data to learn the relationship between backscatter magnitudes and coherence. The model is trained on 12-day SLC pairs and evaluated across different datasets, including coregistered SLC products and open access analysis-ready data, covering diverse radiometric properties, geometries, and locations. Experimental results demonstrate that the proposed method achieves high-resolution coherence regression with improved accuracy compared to existing intensity-based approaches. The network generalizes well across diverse geographical locations and even across different temporal baselines that were never seen at training time. Additionally, the ability to operate on globally available analysis-ready data, such as ground range detected data, e.g., distributed through Google Earth Engine, enables its large-scale application in mission design, change monitoring, and diverse mapping tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.07381 2026-06-08 eess.IV cs.AI cs.CV 交叉投稿

Impact of Synthetic Lesional MR Images in Automated Focal Cortical Dysplasia Detection in Low-Data Scenarios

合成病灶MR图像在低数据场景下自动局灶性皮质发育不良检测中的影响

Prabhjot Kaur, Hakim Ouaalam, Sedat Kandemirli, Sanjay P. Prabhu, Simon K. Warfield

发表机构 * Computational Radiology Laboratory（计算放射学实验室）； Boston Children’s Hospital（波士顿儿童医院）； Harvard Medical School（哈佛医学院）

AI总结本研究通过条件生成网络合成FCD病灶MRI数据，评估其真实性及对自动检测的影响，发现合成数据可减少约20%标注需求，但真实数据仍更有效。

详情

DOI: 10.1111/jon.70137

AI中文摘要

背景与目的：自动检测局灶性皮质发育不良（FCD）需要大量体素级病灶勾画的MRI数据，这些数据难以获取。本研究旨在生成呈现FCD的合成MRI数据，评估其真实性，并评估其对自动FCD检测的影响，特别是在减少手动标注需求方面。方法：回顾性研究了来自多个（3个）中心的131例FCD患者和90例健康对照的T1加权（T1w）和T2加权液体衰减反转恢复（FLAIR）MRI扫描。通过将生成网络以二元FCD掩膜为条件生成合成MRI。两位神经放射科医生从14张真实和14张合成扫描的随机集合中识别真实图像。训练了三个nnU-Net模型用于检测FCD，分别使用：（i）仅真实数据（35例FCD/35例对照），（ii）真实数据（35例FCD/35例对照）加合成增强，以及（iii）扩展的真实数据（70例FCD/70例对照）。结果：专家区分真实与合成图像的能力有限，T1w分类准确率为60%，FLAIR为70%（评分者间一致性kappa=0.86）。用合成数据增强自动FCD检测使灵敏度提高8.14%（p=0.12），并改善了模型在真实病灶部位的置信度（0.83±0.11至0.89±0.12；p=0.02）。扩展真实数据模型进一步将灵敏度提高至73.8%（p<0.001），置信度提高至0.90±0.14（p=0.01）。结论：条件生成网络可以生成逼真的合成FCD-MRI，在保持同等灵敏度的情况下减少约20%的标注数据需求。当可用时，等量的真实数据仍比合成增强更有效。

英文摘要

Background and Purpose: Automated detection of focal cortical dysplasia (FCD) requires large volumes of voxelwise lesion-delineated MRI data, which are difficult to acquire. This study aims to generate synthetic MRI data exhibiting FCD, assess their realism, and evaluate their impact on automated FCD detection, particularly in reducing the need for manual annotations. Methods: T1-weighted (T1w) and T2-weighted Fluid-Attenuated Inversion Recovery (FLAIR) MRI scans from 131 FCD patients and 90 healthy controls from multiple (3) sites were retrospectively studied. Synthetic MRIs were generated by conditioning a generative network on binary FCD masks. Two neuroradiologists identified real images from a random set of 14 real and 14 synthetic scans. Three nnU-Net models were trained to detect FCD using: (i) real-only (35 FCD / 35 controls), (ii) real (35 FCD / 35 controls) plus synthetic augmentation, and (iii) expanded real data (70 FCD / 70 controls). Results: Experts showed limited ability to distinguish real from synthetic images, with classification accuracy of 60% for T1w and 70% for FLAIR (inter-rater agreement kappa = 0.86). Augmenting automated FCD detection with synthetic data increased sensitivity by 8.14% (p = 0.12) and improved model confidence at true lesion sites (0.83 +/- 0.11 to 0.89 +/- 0.12; p = 0.02). The expanded real-data model further improved sensitivity to 73.8% (p < 0.001) and confidence to 0.90 +/- 0.14 (p = 0.01). Conclusion: Conditional generative networks can generate realistic synthetic FCD-MRIs, reducing labeled data needs by approximately 20% while maintaining equivalent sensitivity. Equivalent amounts of real data, when available, remain more effective than synthetic augmentation.

URL PDF HTML ☆

赞 0 踩 0

2606.07464 2026-06-08 cs.RO cs.AI cs.CV 交叉投稿

Planning-aligned Token Compression for Long-Context Autonomous Driving

面向长上下文自动驾驶的规划对齐令牌压缩

Zhixuan Liang, Yuxiao Chen, Yurong You, Peter Karkus, Wenhao Ding, Boyi Li, Alexander Popov, Yan Wang, Maximilian Igl, Yiming Li, Danfei Xu, Nikolai Smolyanskiy, Boris Ivanovic, Ping Luo, Marco Pavone

发表机构 * NVIDIA Research（NVIDIA研究）； School of Computing and Data Science, The University of Hong Kong（计算与数据科学学院，香港大学）

AI总结提出COMPACT-VA框架，基于条件VQ-VAE将长上下文压缩为有界表示，通过规划对齐实现决策关键信息保留，在动态场景中成功率提升超6%，速度提升3.3倍。

Comments 9 pages

详情

AI中文摘要

整体视觉-动作模型代表了自动驾驶中的一种新兴范式。然而，这种架构在编码用于复杂交互的扩展时间上下文时，会产生迅速超过实时计算预算的令牌序列。虽然线性变换器和外部记忆等方法试图使上下文轻量化，但令牌压缩与架构最为兼容，因为它不需要修改主干网络。然而，现有的压缩采用基于规则的启发式方法（如时间衰减），与规划解耦，存在丢失决策关键信息的风险。我们提出COMPACT-VA，一种基于条件VQ-VAE的规划对齐工作记忆框架，将扩展上下文压缩为有界表示。压缩条件同时基于历史轨迹和学习的规划意图，其中后验编码器在训练期间从未来轨迹中提炼规划意图，而先验编码器学习从压缩观测中预测它。压缩记忆与预测的潜在变量拼接，输入策略进行端到端优化，从而在保留决策关键信息的情况下进行规划。我们在历史上下文对行为正确性（如停车、让行或前行）最关键的高信号动态场景中进行评估，并相应地设计了行为指标。在可比的令牌预算下，我们在成功率上实现了超过6%的提升（68.3%），且各项指标一致提升。消融实验验证了规划对齐耦合的有效性。闭环评估证实，与未压缩处理相比，COMPACT-VA在保持一般驾驶性能的同时实现了3.3倍的速度提升和2.7倍的内存减少。

英文摘要

Monolithic vision-action models represent an emerging paradigm in autonomous driving. However, this architecture produces token sequences that quickly exceed real-time computational budgets when encoding extended temporal context for complex interactions. While approaches like linear transformers and external memory try to make the context lightweight, token compression is most compatible with the architecture as it requires no backbone modifications. Yet existing compression adopts rule-based heuristics like temporal decay, decoupled from planning, risking loss of decision-critical information. We propose COMPACT-VA, a planning-aligned working memory framework built on conditional VQ-VAE, compressing extended context into bounded representations. Compression is conditioned on both historical trajectory and a learned planning intent that the posterior encoder distills from future trajectories during training, while the prior encoder learns to predict it from compressed observations. The compressed memory, concatenated with the predicted latent, feeds the policy for end-to-end optimization, planning with retained decision-critical information. We evaluate on high-signal dynamic scenarios where historical context is most critical for behavior correctness (e.g., stop, yield, or proceed), and accordingly design behavioral metrics. Under comparable token budgets, we achieve $>$6% improvement (68.3%) on success rates with consistent gains across metrics. Ablations validate planning-aligned coupling effectiveness. Closed-loop evaluation confirms that COMPACT-VA maintained general driving performance with 3.3* speedup and 2.7* memory reduction over uncompressed processing.

URL PDF HTML ☆

赞 0 踩 0

2406.00636 2026-06-08 cs.CV 版本更新

T2LM: Long-Term 3D Human Motion Generation from Multiple Sentences

T2LM：基于多句子的长期3D人体运动生成

Taeryung Lee, Fabien Baradel, Thomas Lucas, Kyoung Mu Lee, Gregory Rogez

发表机构 * IPAI & ASRI（IPAI与ASRI）； Dept. of ECE, Seoul National University（电子工程系，首尔国立大学）； NAVER LABS Europe（NAVER欧洲实验室）

AI总结提出T2LM框架，利用1D卷积VQVAE和Transformer文本编码器，无需顺序数据即可从多句子生成连续长期3D人体运动，优于先前方法且与单动作SOTA竞争。

Comments CVPR 2024 HuMoGen Workshop

详情

AI中文摘要

本文解决了长期3D人体运动生成的挑战性问题。具体而言，我们旨在从多个句子（即段落）流中生成平滑连接的长时间动作序列。先前的长期运动生成方法大多基于循环方法，使用先前生成的运动块作为下一步的输入。然而，这种方法有两个缺点：1）依赖顺序数据集，成本高昂；2）这些方法在每一步生成的运动之间产生不切实际的间隙。为了解决这些问题，我们引入了简单而有效的T2LM，一个无需顺序数据即可训练的连续长期生成框架。T2LM包含两个组件：一个1D卷积VQVAE，训练将运动压缩为潜在向量序列；以及一个基于Transformer的文本编码器，根据输入文本预测潜在序列。在推理时，一个句子序列被翻译成连续的潜在向量流，然后由VQVAE解码器解码为运动；使用具有局部时间感受野的1D卷积避免了训练序列和生成序列之间的时间不一致性。VQ-VAE上的这个简单约束使其仅用短序列训练即可产生更平滑的过渡。T2LM优于先前的长期生成模型，同时克服了需要顺序数据的限制；它也与最先进的单动作生成模型具有竞争力。

英文摘要

In this paper, we address the challenging problem of long-term 3D human motion generation. Specifically, we aim to generate a long sequence of smoothly connected actions from a stream of multiple sentences (i.e., paragraph). Previous long-term motion generating approaches were mostly based on recurrent methods, using previously generated motion chunks as input for the next step. However, this approach has two drawbacks: 1) it relies on sequential datasets, which are expensive; 2) these methods yield unrealistic gaps between motions generated at each step. To address these issues, we introduce simple yet effective T2LM, a continuous long-term generation framework that can be trained without sequential data. T2LM comprises two components: a 1D-convolutional VQVAE, trained to compress motion to sequences of latent vectors, and a Transformer-based Text Encoder that predicts a latent sequence given an input text. At inference, a sequence of sentences is translated into a continuous stream of latent vectors. This is then decoded into a motion by the VQVAE decoder; the use of 1D convolutions with a local temporal receptive field avoids temporal inconsistencies between training and generated sequences. This simple constraint on the VQ-VAE allows it to be trained with short sequences only and produces smoother transitions. T2LM outperforms prior long-term generation models while overcoming the constraint of requiring sequential data; it is also competitive with SOTA single-action generation models.

URL PDF HTML ☆

赞 0 踩 0

2408.08973 2026-06-08 cs.CV 版本更新

Image class translation: visual inspection of class-specific hypotheticals and classification based on translation distance

图像类别翻译：类别特定假设的视觉检查与基于翻译距离的分类

Mikyla K. Bowen, Jesse W. Wilson

发表机构 * College of Natural Sciences, Colorado State University, Colorado, United States of America（科罗拉多州立大学自然科学院）； School of Biomedical and Chemical Engineering, Colorado State University, Colorado, United States of America（科罗拉多州立大学生物医学与化学工程学院）； Department of Electrical and Computer Engineering, Colorado State University, Colorado, United States of America（科罗拉多州立大学电气与计算机工程学院）

AI总结提出图像翻译网络用于分类，通过翻译距离作为低维特征进行分类，在皮肤镜和骨髓细胞图像上验证，可解释性优于传统CNN。

Comments 47 pages, 20 figures, submitted revision to SPIE J. Medical Imaging

详情

AI中文摘要

目的：人工智能在医学应用中的主要障碍是自动CNN缺乏可解释性，并且对错误决策（尤其是域外样本）有高置信度。我们提出图像翻译网络用于图像分类的泛化，并展示翻译网络作为传统黑盒分类器更可解释的替代方案的潜力。\n方法：我们训练一个图像到图像网络，将输入图像翻译为类别特定的假设，然后通过视觉和定量方式将这些假设与输入进行比较。翻译距离（即为了符合某一类别所需的改变程度）被检查其聚类和趋势，并用作分类的简单低维特征向量。\n结果：在黑色素瘤/良性皮肤镜图像上，翻译距离分类器仅使用2维特征空间就达到了80%的准确率（而传统CNN使用约62,000维特征空间达到85%）。对渲染图像的视觉检查揭示了数据集偏差，例如黑色素瘤照片中比良性病变有更多的比例尺。翻译距离空间中的图像分布揭示了沿着皮肤科医生活检决策的自然分离，而不是恶性与良性之间的分离。在骨髓细胞学图像上，翻译距离分类器在3类（92%准确率对比CNN的89%）和6类（90%对比86%）场景中均优于传统CNN。\n结论：这一概念验证表明，图像到图像翻译有潜力超越艺术/风格变化，揭示数据集偏差，进行降维和数据集可视化，并且在某些情况下可能优于传统的端到端CNN分类器。

英文摘要

Purpose: A major barrier to the implementation of artificial intelligence for medical applications is automated CNNs' lack of explainability and high confidence for incorrect decisions, specifically with out-of-domain samples. We propose a generalization of image translation networks for image classification and demonstrate translation networks' potential as a more interpretable alternative to conventional black-box classifiers. Approach: We train an image-to-image network to translate an input image to class-specific hypotheticals, and then compare these with the input, both visually and quantitatively. Translation distances, the degree of alteration needed to conform to one class or another, are examined for clusters and trends, and used as a simple low-dimensional feature vector for classification. Results: On melanoma/benign dermoscopy images, a translation distance classifier achieved 80% accuracy using only a 2-dimensional feature space (versus 85% for a conventional CNN using a ~62,000-dimensional feature space). Visual inspection of rendered images revealed dataset biases, like more scalebars in melanoma photographs than in benign lesions. Image distributions in translation distance space revealed a natural separation along the lines of dermatologist decision to biopsy, rather than between malignant and benign. On bone marrow cytology images, translation distance classifiers outperformed a conventional CNN in both 3-class (92% accuracy vs 89% for CNN) and 6-class (90% vs 86% for CNN) scenarios. Conclusions: This proof-of-concept shows the potential for image-to-image translation to go beyond artistic/stylistic changes and to expose dataset biases, perform dimension reduction and dataset visualization, and in some cases, potentially outperform conventional end-to-end CNN classifiers.

URL PDF HTML ☆

赞 0 踩 0

2506.01850 2026-06-08 cs.CV cs.AI cs.LG cs.MM 版本更新

MoDA: Modulation Adapter for Fine-Grained Visual Grounding in Instructional MLLMs

MoDA: 面向指令型多模态大语言模型的细粒度视觉定位的调制适配器

Wayner Barrios, Andrés Villa, Juan León Alcázar, SouYoung Jin, Bernard Ghanem

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出MoDA调制适配器，通过指令引导的通道级乘法调制增强细粒度视觉定位，在12个基准上对三种MLLM架构取得一致提升，计算开销极小。

Comments Accepted at ICML 2026. Code is available at https://github.com/waybarrios/MoDA

详情

AI中文摘要

多模态大语言模型（MLLMs）通过将预训练的视觉编码器与大语言模型（LLMs）集成，在指令跟随任务中取得了显著成功。然而，现有方法由于视觉补丁表示中的语义纠缠，常常难以实现细粒度的视觉定位，其中单个补丁混合了多个不同的视觉元素，使得模型难以聚焦于指令相关的细节。为了应对这一挑战，我们提出了MoDA（调制适配器），一种轻量级模块，通过指令引导的通道级调制增强视觉定位。与Q-Former等执行加性特征选择的令牌级方法不同，MoDA通过对已对齐特征进行乘法调制在通道级操作，从而实现对每个指令相关嵌入维度的细粒度控制。遵循标准的LLaVA训练协议，MoDA在语言指令与预对齐的视觉特征之间应用交叉注意力，生成动态调制掩码，无需架构修改或额外监督。我们在涵盖视觉问答、视觉中心推理和幻觉检测的12个基准上评估了MoDA，包括最近的2024年基准（MMVP、CV-Bench、MMStar、RealWorldQA），并在三种不同的MLLM架构上进行了测试：LLaVA-1.5、LLaVA-MoRE（2025）和Qwen3-VL（2025）。MoDA在所有三个系列中均取得了一致的提升，在LLaVA-1.5系列的MMVP上提升了+12.0个百分点，在LLaVA-MoRE系列的ScienceQA上提升了+4.8个百分点，在Qwen3-VL上ScienceQA提升了+4.9、RealWorldQA提升了+4.1、GQA提升了+3.8，证实了这些增益在CLIP编码器之外具有泛化性，且计算开销极小（<1% FLOPs）。代码可在https://github.com/waybarrios/MoDA获取。

英文摘要

Multimodal Large Language Models (MLLMs) have achieved remarkable success in instruction-following tasks by integrating pretrained visual encoders with large language models (LLMs). However, existing approaches often struggle with fine-grained visual grounding due to semantic entanglement in visual patch representations, where individual patches blend multiple distinct visual elements, making it difficult for models to focus on instruction-relevant details. To address this challenge, we propose MoDA (Modulation Adapter), a lightweight module that enhances visual grounding through instruction-guided channel-wise modulation. Unlike token-level methods such as Q-Former that perform additive feature selection, MoDA operates at the channel level through multiplicative modulation on already-aligned features, enabling fine-grained control over which embedding dimensions are relevant for each instruction. Following the standard LLaVA training protocol, MoDA applies cross-attention between language instructions and pre-aligned visual features, generating dynamic modulation masks without architectural modifications or additional supervision. We evaluate MoDA across 12 benchmarks spanning visual question answering, vision-centric reasoning, and hallucination detection, including recent 2024 benchmarks (MMVP, CV-Bench, MMStar, RealWorldQA), on three distinct MLLM architectures: LLaVA-1.5, LLaVA-MoRE (2025), and Qwen3-VL (2025). MoDA delivers consistent gains across all three families, with +12.0 points on MMVP for the LLaVA-1.5 family and +4.8 points on ScienceQA for the LLaVA-MoRE family, and +4.9 ScienceQA, +4.1 RealWorldQA, and +3.8 GQA on Qwen3-VL, confirming that the gains generalize beyond CLIP-based encoders with minimal overhead (<1% FLOPs). Code is available at https://github.com/waybarrios/MoDA.

URL PDF HTML ☆

赞 0 踩 0

2509.24935 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Scalable GANs with Transformers

可扩展的Transformer生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * KAIST（韩国科学技术院）

AI总结本文通过紧凑变分自编码器潜在空间和纯Transformer架构，研究了生成对抗网络的可扩展性，并提出了轻量级中间监督和宽度自适应学习率调整来解决缩放时的失败模式，在ImageNet-256上以40个epoch达到2.96的FID。

Comments ICML 2026

详情

AI中文摘要

可扩展性推动了生成建模的最新进展，但其原理在对抗学习中仍未充分探索。我们通过两个在其他类型生成模型中被证明有效的设计选择来研究生成对抗网络（GAN）的可扩展性：在紧凑的变分自编码器潜在空间中训练，以及采用纯Transformer的生成器和判别器。在潜在空间中训练能够在保持感知保真度的同时实现高效计算，而这种效率与普通Transformer自然匹配，后者的性能随计算预算扩展。基于这些选择，我们分析了朴素缩放GAN时出现的失败模式。具体来说，我们发现了随着网络规模扩大，生成器早期层利用不足和优化不稳定的问题。因此，我们提供了简单且对缩放友好的解决方案，如轻量级中间监督和宽度自适应学习率调整。我们的实验表明，GAT——一种纯Transformer的潜在空间GAN——能够在从S到XL的广泛容量范围内可靠地训练。此外，GAT-XL/2在ImageNet-256上仅用40个epoch（比强基线少6倍）就达到了最先进的单步类条件生成性能（FID为2.96）。项目页面：https://hse1032.github.io/GAT。

英文摘要

Scalability has driven recent advances in generative modeling, yet its principles remain underexplored for adversarial learning. We investigate the scalability of Generative Adversarial Networks (GANs) through two design choices that have proven to be effective in other types of generative models: training in a compact Variational Autoencoder latent space and adopting purely transformer-based generators and discriminators. Training in latent space enables efficient computation while preserving perceptual fidelity, and this efficiency pairs naturally with plain transformers, whose performance scales with computational budget. Building on these choices, we analyze failure modes that emerge when naively scaling GANs. Specifically, we find issues as underutilization of early layers in the generator and optimization instability as the network scales. Accordingly, we provide simple and scale-friendly solutions as lightweight intermediate supervision and width-aware learning-rate adjustment. Our experiments show that GAT, a purely transformer-based and latent-space GANs, can be easily trained reliably across a wide range of capacities (S through XL). Moreover, GAT-XL/2 achieves state-of-the-art single-step, class-conditional generation performance (FID of 2.18) on ImageNet-256 in just 60 epochs, 4x fewer epochs than strong baselines. Project page: https://hse1032.github.io/GAT.

URL PDF HTML ☆

赞 0 踩 0

2511.05949 2026-06-08 cs.CV 版本更新

Zero-Shot Polygon Matching with Pre-trained Models for Pose Estimation and Polygon Cloud from Challenging Stereo

基于预训练模型的零样本多边形匹配用于挑战性立体图像的姿态估计和多边形云

Chang Li, Xingtao Peng

发表机构 * Chang Li（李昌）； Xingtao Peng（彭兴涛）

AI总结提出首个零样本多边形匹配范式Z(PM)2，结合预训练模型和手工几何约束，通过双向金字塔匹配和局部-整体二分图优化解决视差不连续、尺度变化等问题，在姿态估计和3D表示中取得领先性能。

详情

AI中文摘要

尽管立体匹配在0D点和1D线基元上已经成熟，但由于视差不连续、尺度变化、训练依赖和泛化能力差等挑战，2D多边形的对应关系建立仍基本未被探索，限制了姿态估计和3D重建等下游任务。为了解决这些问题，我们首次提出了一种基于预训练模型的零样本多边形匹配范式（即Z(PM)2），通过即插即用模块结合学习特征和手工几何约束，将匹配从0D/1D基元扩展到2D多边形。该流程包括三个核心阶段：首先，检测器利用预训练的segment anything模型将分割掩码矢量化成图结构的多边形，融合几何和纹理；其次，全局匹配器使用双向金字塔和多几何约束处理视角变化；第三，局部匹配器利用局部-整体二分图优化解决视差不连续和拓扑不一致。此外，我们开发了多边形匹配引导的姿态估计，利用对应关系获得分布良好、低冗余的同名点，并首创多边形云概念及最优表面生成方法，生成结构完整、语义丰富的3D表示，超越点云和线云。由于没有可直接比较的立体图像多边形匹配方法，我们选择了最接近该任务的最先进方法作为基线。在五个具有挑战性的数据集（ISPRS、KITTI、ScanNet、SceneFlow、DTU）上的大量实验表明，Z(PM)2实现了68.60%的匹配面积分数，比MESA高出约32%，在区域级姿态估计中排名第一，具有竞争力的速度和强大的零样本泛化能力，无需任何训练要求。

英文摘要

While stereo matching has achieved maturity for 0D point and 1D line primitives, establishing correspondences for 2D polygons remains largely unexplored due to challenges including disparity discontinuity, scale variation, training dependency, and poor generalization, limiting downstream tasks such as pose estimation and 3D reconstruction. To address these issues, we are the first to propose a Zero-shot Polygon Matching paradigm with Pre-trained Models (i.e., Z(PM)2), which combines learned features and handcrafted geometric constraints through plug-and-play modules, extending matching from 0D/1D primitives to 2D polygons. The pipeline comprises three core stages: Firstly, detector leverages the pre-trained segment anything model to vectorize segmentation masks into graph-structured polygons integrating geometry and texture; Secondly, global matcher uses bidirectional-pyramid and multi-geometric constraints to handle viewpoint variation; Thirdly, local matcher leverages local-holistic bipartite graph optimization to resolve disparity discontinuity and topological inconsistency. Moreover, we develop polygon-matching-guided pose estimation using correspondences to obtain well-distributed, low-redundancy homologous points, and pioneer the polygon cloud concept with an optimal surface generation method, producing structurally complete and semantically rich 3D representations beyond point and line clouds. Since no polygon matching methods from stereo imagery are available for direct comparison, we selected state-of-the-art (SoTA) methods close to this task as baselines. Extensive experiments on five challenging datasets (ISPRS, KITTI, ScanNet, SceneFlow, DTU) show Z(PM)2 achieves a 68.60% matching area score, outperforming MESA by approximately 32% and ranking first in area-level pose estimation, with competitive speed and strong zero-shot generalization without any training requirement.

URL PDF HTML ☆

赞 0 踩 0

2511.06080 2026-06-08 cs.CV cs.CY cs.HC 版本更新

AIDEN: Design and Pilot Study of an AI Assistant for the Visually Impaired

AIDEN：面向视障人士的AI助手设计与初步研究

Luis Marquez-Carpintero, Francisco Gomez-Donoso, Zuria Bauer, Bessie Dominguez-Dager, Alvaro Belmonte-Baeza, Mónica Pina-Navarro, Francisco Morillas-Espejo, Felix Escalona, Miguel Cazorla

发表机构 * Institute for Computer Research, University of Alicante（计算机研究所，阿利坎特大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结提出AIDEN系统，结合YOLO实时目标检测、LLaVA场景描述与OCR，以及基于盖革计数器隐喻的连续触觉引导，避免听觉过载并保护隐私，实验表明用户满意度高。

详情

AI中文摘要

本文介绍了AIDEN，一种基于人工智能的助手，旨在增强视障人士的自主性和日常生活质量，他们通常在物体识别、文本阅读和陌生环境导航方面遇到困难。现有的解决方案如屏幕阅读器或基于音频的助手虽然便于获取信息，但常常导致听觉过载，并在开放环境中引发隐私问题。AIDEN通过一种混合架构解决了这些限制，该架构集成了用于实时目标检测的YOLO（You Only Look Once）和用于场景描述及光学字符识别（OCR）的大型语言与视觉助手（LLaVA）。该系统的一个关键创新是基于盖革计数器隐喻的连续触觉引导机制，该机制在不占用听觉通道的情况下支持物体居中，同时通过确保不存储个人数据来保护隐私。与视障参与者进行的实证评估使用技术接受模型（TAM）评估了感知易用性和接受度。结果表明用户满意度高，特别是在直观性和感知自主性方面。此外，“寻找物体”功能实现了有效的实时性能。这些发现提供了有希望的证据，表明与传统的以音频为中心的方法相比，多模态触觉-视觉反馈可以改善日常可用性和独立性，从而推动更大规模的临床验证。

英文摘要

This paper presents AIDEN, an artificial intelligence-based assistant designed to enhance the autonomy and daily quality of life of visually impaired individuals, who often struggle with object identification, text reading, and navigation in unfamiliar environments. Existing solutions such as screen readers or audio-based assistants facilitate access to information but frequently lead to auditory overload and raise privacy concerns in open environments. AIDEN addresses these limitations with a hybrid architecture that integrates You Only Look Once (YOLO) for real-time object detection and a Large Language and Vision Assistant (LLaVA) for scene description and Optical Character Recognition (OCR). A key novelty of the system is a continuous haptic guidance mechanism based on a Geiger-counter metaphor, which supports object centering without occupying the auditory channel, while privacy is preserved by ensuring that no personal data are stored. Empirical evaluations with visually impaired participants assessed perceived ease of use and acceptance using the Technology Acceptance Model (TAM). Results indicate high user satisfaction, particularly regarding intuitiveness and perceived autonomy. Moreover, the ``Find an Object'' achieved effective real-time performance. These findings provide promising evidence that multimodal haptic-visual feedback can improve daily usability and independence compared to traditional audio-centric methods, motivating larger-scale clinical validations.

URL PDF HTML ☆

赞 0 踩 0

2511.14019 2026-06-08 cs.CV 版本更新

RISE: Single Static Radar-based Indoor Scene Understanding

RISE：基于单静态雷达的室内场景理解

Kaichen Zhou, Laura Dodds, Sayed Saad Afzal, Fadel Adib

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； Cartesian Systems

AI总结提出RISE系统，利用毫米波雷达的多径反射（传统视为噪声）编码几何线索，通过双角度多径增强和模拟到现实的分层扩散框架，实现布局重建和物体检测，在50,000帧数据集上布局重建倒角距离降低60%，首次实现基于毫米波雷达的物体检测。

详情

AI中文摘要

鲁棒且保护隐私的室内场景理解仍然是一个基本开放问题。虽然光学传感器（如RGB和LiDAR）提供高空间保真度，但它们在室内环境中遭受严重遮挡并引入隐私风险。相比之下，毫米波雷达保护隐私并穿透障碍物，但其固有的低空间分辨率使得可靠的几何推理变得困难。我们介绍了RISE，这是首个用于单静态雷达室内场景理解的基准和系统，同时针对布局重建和物体检测。RISE基于一个关键洞察：多径反射——传统上被视为噪声——编码了丰富的几何线索。为了利用这一点，我们提出了一种双角度多径增强方法，显式建模到达角和离开角，以恢复二次（鬼影）反射并揭示不可见结构。在这些增强观测的基础上，一个模拟到现实的分层扩散框架将碎片化的雷达响应转化为完整的布局重建和物体检测。我们的基准包含100条真实室内轨迹中收集的50,000帧数据，形成了首个专门用于单静态雷达室内场景理解的大规模数据集。大量实验表明，与最先进的毫米波布局重建方法相比，RISE将倒角距离降低了60%（降至16厘米），并实现了首个基于毫米波雷达的物体检测，IoU达到58%。这些结果确立了RISE作为使用单静态雷达进行几何感知和隐私保护室内场景理解的新基础。我们的网站和代码可在https://rise-cvpr.github.io获取。

英文摘要

Robust and privacy-preserving indoor scene understanding remains a fundamental open problem. While optical sensors such as RGB and LiDAR offer high spatial fidelity, they suffer from severe occlusions and introduce privacy risks in indoor environments. In contrast, millimeter-wave (mmWave) radar preserves privacy and penetrates obstacles, but its inherently low spatial resolution makes reliable geometric reasoning difficult. We introduce RISE, the first benchmark and system for single-static-radar indoor scene understanding, jointly targeting layout reconstruction and object detection. RISE is built upon the key insight that multipath reflections-traditionally treated as noise-encode rich geometric cues. To exploit this, we propose a Bi-Angular Multipath Enhancement that explicitly models Angle-of-Arrival and Angle-of-Departure to recover secondary (ghost) reflections and reveal invisible structures. On top of these enhanced observations, a simulation-to-reality Hierarchical Diffusion framework transforms fragmented radar responses into complete layout reconstruction and object detection. Our benchmark contains 50,000 frames collected across 100 real indoor trajectories, forming the first large-scale dataset dedicated to single, static, radar-based indoor scene understanding. Extensive experiments show that RISE reduces the Chamfer Distance by 60% (down to 16 cm) compared to the state of the art in mmWave layout reconstruction, and delivers the first mmWave-based object detection, achieving 58% IoU. These results establish RISE as a new foundation for geometry-aware and privacy-preserving indoor scene understanding using a single static radar. Our website and code are available at https://rise-cvpr.github.io.

URL PDF HTML ☆

赞 0 踩 0

2512.10521 2026-06-08 cs.CV 版本更新

Take a Peek: Efficient Encoder Adaptation for Few-Shot Semantic Segmentation via LoRA

Take a Peek: 通过LoRA高效编码器适应少样本语义分割

Pasquale De Marinis, Gennaro Vessio, Giovanna Castellano

发表机构 * University of Bari（巴里大学）

AI总结提出TaP方法，利用低秩适应（LoRA）微调编码器，在少样本和跨域少样本语义分割中实现高效适应，提升新类分割性能。

详情

AI中文摘要

少样本语义分割（FSS）旨在仅使用少量标注支持集对查询图像中的新类进行分割。先前研究主要关注改进解码器，但编码器提取未见类有意义特征的能力有限仍是关键瓶颈。本文提出 extit{Take a Peek}（TaP），一种简单而有效的方法，通过引入基于支持集的轻量级 extit{特征空间偏移}，增强了编码器对FSS和跨域FSS的适应性。TaP利用低秩适应（LoRA）在支持集上微调编码器，计算开销极小，能够快速适应新类同时减轻灾难性遗忘。我们的方法模型无关，可无缝集成到现有FSS流程中。在多个基准（包括COCO $20^i$、Pascal $5^i$以及跨域数据集DeepGlobe、ISIC和Chest X-ray）上的大量实验表明，TaP在不同模型和shot设置下一致地提升了分割性能。值得注意的是，TaP在复杂的多类场景中取得了显著增益，突显了其在现实场景中的实际有效性。秩敏感性分析还表明，即使采用低秩适应也能实现强性能，从而确保计算效率。通过解决FSS中编码器泛化到新类的关键限制，TaP为构建更鲁棒、高效和可泛化的分割系统铺平了道路。代码可在https://github.com/pasqualedem/TakeAPeek获取。

英文摘要

Few-shot semantic segmentation (FSS) aims to segment novel classes in query images using only a small annotated support set. While prior research has mainly focused on improving decoders, the encoder's limited ability to extract meaningful features for unseen classes remains a key bottleneck. In this work, we introduce \textit{Take a Peek} (TaP), a simple yet effective method that enhances encoder adaptability for both FSS and cross-domain FSS \rev{by inducing a lightweight \textit{feature-space shift} conditioned on the support set}. TaP leverages Low-Rank Adaptation to fine-tune the encoder on the support set with minimal computational overhead, enabling fast adaptation to novel classes while mitigating catastrophic forgetting. Our method is model-agnostic and can be seamlessly integrated into existing FSS pipelines. Extensive experiments across multiple benchmarks--including COCO $20^i$, Pascal $5^i$, and cross-domain datasets such as DeepGlobe, ISIC, and Chest X-ray--demonstrate that TaP consistently improves segmentation performance across diverse models and shot settings. Notably, TaP delivers significant gains in complex multi-class scenarios, highlighting its practical effectiveness in realistic settings. A rank sensitivity analysis also shows that strong performance can be achieved even with low-rank adaptations, thereby ensuring computational efficiency. By addressing a critical limitation in FSS--the encoder's generalization to novel classes--TaP paves the way toward more robust, efficient, and generalizable segmentation systems. The code is available at https://github.com/pasqualedem/TakeAPeek.

URL PDF HTML ☆

赞 0 踩 0

2512.12997 2026-06-08 cs.CV cs.AI cs.LG 版本更新

Calibrating Uncertainty for Zero-Shot Adversarial CLIP

校准零样本对抗性CLIP的不确定性

Wenjing Lu, Zerui Tao, Yuning Qiu, Dongping Zhang, Yang Yang, Qibin Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对CLIP在零样本分类中对抗攻击脆弱且不确定性校准差的问题，提出基于狄利克雷分布重参数化的对抗微调目标，统一对齐语义结构与置信度，提升校准性和鲁棒性。

Comments ICML 2026

详情

AI中文摘要

CLIP在零样本分类中表现强劲，但仍易受对抗攻击。先前的对抗微调工作主要匹配干净样本和对抗样本之间的预测logits，忽略了不确定性校准，可能损害零样本泛化能力。在可靠的不确定性估计中，一个常见期望是预测不确定性应随输入难度增加或偏离训练分布而上升。然而，在对抗环境中我们经常观察到相反的情况：扰动不仅降低准确性，还抑制不确定性，导致严重的校准错误和过度自信。这揭示了鲁棒性之外的关键可靠性差距。为弥合这一差距，我们提出了一种考虑准确性和不确定性的CLIP对抗微调目标。通过将CLIP输出重参数化为狄利克雷分布的浓度参数，我们提出了一种统一表示，捕获相对语义结构和置信度大小。这使得在扰动下实现整体分布对齐，超越单一logits锚定，恢复校准的不确定性。在多个零样本基准上的实验表明，我们的方法显著提高了不确定性校准，在保持干净准确性的同时实现了具有竞争力的对抗鲁棒性。

英文摘要

CLIP delivers strong zero-shot classification but remains highly vulnerable to adversarial attacks. Prior adversarial fine-tuning work primarily matches predicted logits between clean and adversarial examples, which overlooks uncertainty calibration and may degrade the zero-shot generalization. A common expectation in reliable uncertainty estimation is that predictive uncertainty should increase as inputs become more difficult or shift away from the training distribution. However, we frequently observe the opposite in the adversarial setting: perturbations not only degrade accuracy but also suppress uncertainty, leading to severe miscalibration and over-confidence. This reveals a critical reliability gap beyond robustness. To bridge this gap, we propose an adversarial fine-tuning objective for CLIP considering both accuracy and uncertainty. By reparameterizing CLIP outputs as the concentration parameters of a Dirichlet distribution, we propose a unified representation that captures relative semantic structure and confidence magnitude. This enables holistic distribution alignment under perturbations, moving beyond single-logit anchoring and restoring calibrated uncertainty. Experiments across multiple zero-shot benchmarks demonstrate that our method significantly improves uncertainty calibration and achieves competitive adversarial robustness while preserving clean accuracy.

URL PDF HTML ☆

赞 0 踩 0

2601.04791 2026-06-08 cs.CV cs.LG 版本更新

Measurement-Consistent Langevin Corrector for Stabilizing Latent Diffusion Inverse Problem Solvers

用于稳定潜在扩散逆问题求解器的测量一致朗之万校正器

Lee Hyoseok, Sohwi Lim, Eunju Cha, Tae-Hyun Oh

发表机构 * Sookmyung Women's University（成均馆女子大学）

AI总结针对潜在扩散模型逆问题求解器的不稳定性，提出测量一致朗之万校正器（MCLC），通过测量一致的朗之万更新缩小求解器与稳定反向扩散之间的差距，实现稳定可靠的潜在空间求解。

Comments ICML 2026

2601.09698 2026-06-08 cs.CV 版本更新

COMPOSE: Hypergraph Cover Optimization for Multi-view 3D Human Pose Estimation

COMPOSE：用于多视角三维人体姿态估计的超图覆盖优化

Tony Danjun Wang, Tolga Birdal, Nassir Navab, Lennart Bastian

发表机构 * School of Computation, Information, and Technology, Technical University of Munich（技术大学慕尼黑计算、信息与技术学院）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Department of Computing, Imperial College London（伦敦帝国学院计算机系）

AI总结提出COMPOSE方法，将多视角三维人体姿态估计重构为超图上的加权精确覆盖优化，通过全局组合目标替代局部配对关联，结合几何剪枝与整数线性规划或信念传播求解器，无监督下精度提升显著。

详情

AI中文摘要

从稀疏多视角相机装置中进行三维人体姿态估计是众多应用（包括动作识别、体育分析和人机交互）的基本任务。尽管学习方法在基准测试中占据主导地位，但它们需要大量标注数据集；无训练的基于优化的方法仍然有前景，因为它们通过解决来自二维检测的跨视角对应问题来规避三维监督。现有的组合公式依赖配对关联来建模这一对应问题，并将跨视角的全局一致性仅作为下游约束来强制执行。然而，在遮挡和噪声检测下，调和局部合理的配对匹配变得脆弱，局部错误会全局传播。我们提出COMPOSE，它将多视角三维人体姿态估计重新定义为对人物假设超图上的加权精确覆盖优化。我们的公式用单个全局组合目标替代了配对关联和事后一致性强制执行。为了应对指数级大的候选空间，我们引入了一种几何剪枝策略以及两种互补的求解器：精确整数线性规划公式和通过信念传播的可扩展松弛。在没有任何三维监督的情况下，COMPOSE在平均精度上比最佳基于优化的方法提高了31个百分点，比自监督学习方法提高了13个百分点，证明了高阶组合关联在无训练的多视角三维人体姿态估计中的有效性。

英文摘要

3D human pose estimation from sparse multi-view camera rigs is an essential task for numerous applications, including action recognition, sports analysis, and human-robot interaction. While learned methods dominate the field on benchmarks, they require large annotated datasets; training-free optimization-based methods remain promising as they circumvent 3D supervision by solving a correspondence problem across views from 2D detections. Existing combinatorial formulations rely on pairwise associations to model this correspondence problem and enforce global consistency across views only as a downstream constraint. However, reconciling locally plausible pairwise matches becomes brittle under occlusion and noisy detections, where local errors propagate globally. We propose COMPOSE, which recasts multi-view 3D human pose estimation as a weighted exact-cover optimization over a hypergraph of person hypotheses. Our formulation replaces pairwise association and post-hoc consistency enforcement with a single global combinatorial objective. To address the exponentially large candidate space, we introduce a geometric pruning strategy alongside two complementary solvers: an exact Integer Linear Programming formulation and a scalable relaxation via Belief Propagation. Without any 3D supervision, COMPOSE improves average precision by up to 31 points over the best optimization-based method and 13 points over self-supervised learned methods, demonstrating the effectiveness of higher-order combinatorial association for training-free multi-view 3D human pose estimation.

URL PDF HTML ☆

赞 0 踩 0

2601.22574 2026-06-08 cs.CV cs.AI 版本更新

Enhancing Video Representations with Spatiotemporal-Semantic Residual to Mitigate Hallucinations in Video Large Multimodal Models

增强视频表示中的时空语义残差以缓解视频大型多模态模型中的幻觉

Yuansheng Gao, Jinman Zhao, Tong Zhang, Xingguo Xu, Wenbin Xing, Han Bao, Zonghui Wang, Wenzhi Chen

发表机构 * Zhejiang University（浙江大学）； University of Toronto（多伦多大学）； Dalian University of Technology（大连理工大学）； Sun Yat-sen University（中山大学）

AI总结提出ViSSRes方法，通过轻量级MLP网络学习视频表示的残差，从时空和语义一致性优化，在推理时仅需单次前向传播，有效降低幻觉率并提升视频理解性能。

Comments Preprint

详情

AI中文摘要

尽管视频大型多模态模型在视频理解方面取得了强劲性能，但它们仍然存在幻觉问题。现有的推理时干预方法通常在对比解码框架下修改视频，但其启发式设计带来的改进有限且增加了推理延迟。为了解决这些问题，我们提出了ViSSRes，一种通过轻量级MLP风格网络增强视频表示的推理时干预方法。具体来说，我们使用对比随机游走方法来表征视频表示的时空一致性，并引入条件互信息将视频表示与模型的语义理解关联起来。在保持模型主干冻结的情况下，ViSSRes学习视频表示的残差，并从时空和语义一致性角度优化它们。在推理时，ViSSRes仅需单次前向传播，且不会引入显著的额外推理成本。实验表明，ViSSRes在EventHallusion上将LLaVA-NeXT-Video的幻觉率降低了40.69%，并在CoT设置下将MMVU上的视频理解提升了18.36%，证明了其在缓解幻觉方面的有效性。

英文摘要

Although Video Large Multimodal Models have achieved strong performance in video understanding, they still suffer from hallucination. Existing inference-time intervention methods usually modify videos under the contrastive decoding framework, but their heuristic designs bring limited improvements and increase inference latency. To address these issues, we propose ViSSRes, an inference-time intervention method that enhances video representations through a lightweight MLP-style network. Specifically, we use a contrastive random walk approach to characterize the spatiotemporal consistency of video representations, and introduce conditional mutual information to associate video representations with the model's semantic understanding. With the model backbone kept frozen, ViSSRes learns residuals for video representations and optimizes them from both spatiotemporal and semantic consistency perspectives. During inference, ViSSRes requires only a single forward pass and introduces no substantial additional inference cost. Experiments show that ViSSRes reduces the hallucination rate of LLaVA-NeXT-Video on EventHallusion by 40.69% and improves video understanding on MMVU by 18.36% under the CoT setting, demonstrating its effectiveness in mitigating hallucinations.

URL PDF HTML ☆

赞 0 踩 0

2602.00163 2026-06-08 cs.CV q-bio.NC 版本更新

Deep Learning Pose Estimation for Multi-Label Recognition of Combined Hyperkinetic Movement Disorders

基于深度学习姿态估计的联合多动性运动障碍多标签识别

Laura Cif, Diane Demailly, Gabriella A. Horvàth, Juan Dario Ortigoza Escobar, Nathalie Dorison, Mayté Castro Jiménez, Cécile A. Hubsch, Thomas Wirth, Gun-Marie Hariz, Sophie Huby, Morgan Dornadic, Zohra Souei, Muhammad Mushhood Ur Rehman, Simone Hemm, Mehdi Boulayme, Eduardo M. Moraud, Jocelyne Bloch, Xavier Vasques

发表机构 * Lausanne University Hospital (CHUV) and University of Lausanne (UNIL)（日内瓦大学医院（CHUV）和日内瓦大学）； Institut du Neurone（神经研究所）； Department of Neurology, Clinique Beau Soleil, Institut Mutualiste Montpelliérain（神经科，贝索尔诊所，蒙彼利埃互益研究所）； Department of Pediatrics, British Columbia Children’s Hospital（儿科，不列颠哥伦比亚儿童医院）； Movement Disorders Unit, Pediatric Neurology Department, Institut de Recerca, Hospital Sant Joan de Déu（运动障碍科，儿童神经科，研究所，圣约翰德杜医院）； European Reference Network for Rare Neurological Diseases (ERN-RND)（罕见神经系统疾病欧洲参考网络（ERN-RND））； U-703 Centre for Biomedical Research on Rare Diseases (CIBER-ER), Instituto de Salud Carlos III（罕见疾病生物医学研究中心（CIBER-ER），卡洛斯三世健康研究所）； Pediatric Neurosurgery Department, CCMR Neurogenetique, European Reference Network Brainteam Member, Rothschild Foundation Hospital（小儿神经外科部门，CCMR神经遗传学，欧洲参考网络Brainteam成员，罗切什基金会医院）； Department of Neurology, University Hospital of Strasbourg（神经科，斯特拉斯堡大学医院）； Strasbourg Neuroscience Institute, Strasbourg University（斯特拉斯堡神经科学研究所，斯特拉斯堡大学）； Institute of Genetics and Cellular biology（遗传学和细胞生物学研究所）

AI总结针对多动性运动障碍（HMD）临床识别主观性强、表型重叠的问题，提出基于姿态的机器学习框架，从常规临床视频提取关键点时间序列并计算多维度运动学特征，实现多标签分类。

2602.02014 2026-06-08 cs.CV cs.AI cs.CL cs.LG 版本更新

Rethinking Genomic Modeling Through Optical Character Recognition

通过光学字符识别重新思考基因组建模

Hongxin Xiang, Pengsen Ma, Yunkang Cao, Di Yu, Haowen Chen, Xinyu Yang, Xiangxiang Zeng

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结提出OpticalDNA框架，将DNA渲染为视觉布局，利用视觉语言模型进行OCR式基因组理解，实现高保真压缩和长序列高效处理，在450k碱基序列上以近20倍更少有效token超越基线模型。

Comments Accepted by ICML 2026

详情

AI中文摘要

最近的基因组基础模型大多采用大型语言模型架构，将DNA视为一维token序列。然而，穷举式顺序阅读在结构上与稀疏且不连续的基因组语义不匹配，导致在低信息背景上的计算浪费，并阻碍了面向长上下文的压缩理解。在此，我们提出OpticalDNA，一个基于视觉的框架，将基因组建模重新定义为光学字符识别（OCR）风格的文档理解。OpticalDNA将DNA渲染为结构化视觉布局，并训练一个具备OCR能力的视觉语言模型，该模型包含视觉DNA编码器和文档解码器，其中编码器生成紧凑、可重建的视觉token以实现高保真压缩。基于这种表示，OpticalDNA定义了基于提示条件的核心基因组原语目标——读取、区域定位、子序列检索和掩码跨度补全——从而学习到布局感知的DNA表示，在减少的有效token预算下保留细粒度的基因组信息。在多种基因组基准测试中，OpticalDNA持续优于最近的基线模型；在长达450k碱基的序列上，它以近20倍更少的有效token实现了最佳整体性能，并且仅调整256k可训练参数就超越了激活参数多达985倍的模型。

英文摘要

Recent genomic foundation models largely adopt large language model architectures that treat DNA as a one-dimensional token sequence. However, exhaustive sequential reading is structurally misaligned with sparse and discontinuous genomic semantics, leading to wasted computation on low-information background and preventing understanding-driven compression for long contexts. Here, we present OpticalDNA, a vision-based framework that reframes genomic modeling as Optical Character Recognition (OCR)-style document understanding. OpticalDNA renders DNA into structured visual layouts and trains an OCR-capable vision--language model with a visual DNA encoder and a document decoder, where the encoder produces compact, reconstructible visual tokens for high-fidelity compression. Building on this representation, OpticalDNA defines prompt-conditioned objectives over core genomic primitives-reading, region grounding, subsequence retrieval, and masked span completion-thereby learning layout-aware DNA representations that retain fine-grained genomic information under a reduced effective token budget. Across diverse genomic benchmarks, OpticalDNA consistently outperforms recent baselines; on sequences up to 450k bases, it achieves the best overall performance with nearly 20$\times$ fewer effective tokens, and surpasses models with up to 985$\times$ more activated parameters while tuning only 256k trainable parameters.

URL PDF HTML ☆

赞 0 踩 0

2602.07025 2026-06-08 cs.CV cs.AI 版本更新

The Geometry of Representational Failures in Vision Language Models

视觉语言模型中表征失败的几何结构

Daniele Savietto, Declan Campbell, André Panisson, Marco Nurisso, Giovanni Petri, Jonathan D. Cohen, Alan Perotti

发表机构 * Dipartimento di Fisica, Università di Torino（都灵大学物理系）； Princeton Neuroscience Institute and AI Lab, Princeton University（普林斯顿大学神经科学研究所和AI实验室）； Intesa Sanpaolo AI Research（Intesa Sanpaolo AI研究中心）； Dipartimento di Scienze Matematiche, Politecnico di Torino（都灵理工学院数学科学系）； Network Science Institute, Northeastern University London, UK（伦敦大学东北方大学网络科学研究所）

AI总结通过分析开源视觉语言模型的概念向量几何重叠，揭示多目标视觉任务中幻觉等错误与认知约束的关联，并提出基于干预的验证方法。

详情

AI中文摘要

视觉语言模型在多目标视觉任务中表现出令人困惑的失败，例如幻觉不存在的元素或未能识别干扰中最相似的物体。虽然这些错误反映了人类的认知约束，如“绑定问题”，但在人工系统中驱动这些错误的内部机制仍然知之甚少。在这里，我们通过分析开源视觉语言模型（Qwen、InternVL、Gemma）的表征几何结构，提出了一种机制性见解，比较了提炼“概念向量”（编码视觉概念的潜在方向）的方法。我们通过引导干预验证了概念向量，这些干预在简化和自然视觉任务中可靠地操纵模型行为（例如，强制模型将红色花朵感知为蓝色）。我们观察到这些向量之间的几何重叠与特定错误模式强相关，提供了一个有依据的定量框架来理解内部表征如何塑造模型行为并驱动视觉失败。

英文摘要

Vision-Language Models (VLMs) exhibit puzzling failures in multi-object visual tasks, such as hallucinating non-existent elements or failing to identify the most similar objects among distractions. While these errors mirror human cognitive constraints, such as the 'Binding Problem', the internal mechanisms driving them in artificial systems remain poorly understood. Here, we propose a mechanistic insight by analyzing the representational geometry of open-weight VLMs (Qwen, InternVL, Gemma), comparing methodologies to distill "concept vectors'' - latent directions encoding visual concepts. We validate our concept vectors via steering interventions that reliably manipulate model behavior in both simplified and naturalistic vision tasks (e.g., forcing the model to perceive a red flower as blue). We observe that the geometric overlap between these vectors strongly correlates with specific error patterns, offering a grounded quantitative framework to understand how internal representations shape model behavior and drive visual failures.

URL PDF HTML ☆

赞 0 踩 0

2602.07026 2026-06-08 cs.CV cs.AI cs.MM 版本更新

Modality Gap-Driven Subspace Alignment Training Paradigm For Multimodal Large Language Models

模态间隙驱动的子空间对齐训练范式用于多模态大语言模型

Xiaomin Yu, Yi Xin, Yuhui Zhang, Wenjie Zhang, Chonghan Liu, Hanzhen Zhao, Chen Liu, Xiaoxing Hu, Ziyue Qiao, Hao Tang, Xiaobin Hu, Chengwei Qin, Hui Xiong, Yu Qiao, Shuicheng Yan

发表机构 * HKUST(GZ)（香港科技大学（广州））； NUS（新加坡国立大学）； sh AILab ； SII ； Stanford（斯坦福大学）； UCLA（加州大学洛杉矶分校）； Yale（耶鲁大学）； SJTU（上海交通大学）； GBU（国防大学）； PKU（北京大学）

AI总结针对多模态对比学习中的模态间隙问题，提出固定帧模态间隙理论，并基于该理论设计无训练的对齐策略ReAlign和可扩展训练范式ReVision，利用无配对数据实现视觉与语言表示的高效对齐。

详情

AI中文摘要

尽管多模态对比学习在视觉和语言表示对齐方面取得了成功，但一个持久的几何异常——模态间隙——仍然存在：表达相同语义的不同模态的嵌入位于系统性偏移的区域。先前弥合这一间隙的方法大多受限于过于简化的各向同性假设，阻碍了它们在大规模场景中的应用。在本文中，我们通过精确刻画模态间隙的几何形状并利用它进行高效模型扩展来解决这些局限性。首先，我们提出了固定帧模态间隙理论，该理论将冻结参考帧内的模态间隙分解为稳定偏差和各向异性残差。在这种精确建模的指导下，我们引入了ReAlign，一种无需训练的模态对齐策略。利用大量无配对数据的统计信息，ReAlign通过锚点、轨迹和质心对齐三步过程将文本表示对齐到图像表示分布，从而显式纠正几何错位。基于ReAlign，我们提出了ReVision，一种用于多模态大语言模型（MLLMs）的可扩展训练范式。ReVision将ReAlign集成到预训练阶段，使模型在视觉指令微调之前从无配对文本中学习视觉表示的分布，无需大规模、高质量的图像-文本对。我们的框架表明，统计对齐的无配对数据可以有效替代昂贵的图像-文本对，为MLLMs的高效扩展提供了一条稳健的路径。

英文摘要

Despite the success of multimodal contrastive learning in aligning visual and linguistic representations, a persistent geometric anomaly, the Modality Gap, remains: embeddings of distinct modalities expressing identical semantics occupy systematically offset regions. Prior approaches to bridge this gap are largely limited by oversimplified isotropic assumptions, hindering their application in large-scale scenarios. In this paper, we address these limitations by precisely characterizing the geometric shape of the modality gap and leveraging it for efficient model scaling. First, we propose the Fixed-frame Modality Gap Theory, which decomposes the modality gap within a frozen reference frame into stable biases and anisotropic residuals. Guided by this precise modeling, we introduce ReAlign, a training-free modality alignment strategy. Utilizing statistics from massive unpaired data, ReAlign aligns text representation into the image representation distribution via a three-step process comprising Anchor, Trace, and Centroid Alignment, thereby explicitly rectifying geometric misalignment. Building on ReAlign, we propose ReVision, a scalable training paradigm for Multimodal Large Language Models~(MLLMs). ReVision integrates ReAlign into the pretraining stage, enabling the model to learn the distribution of visual representations from unpaired text before visual instruction tuning, without the need for large-scale, high-quality image-text pairs. Our framework demonstrates that statistically aligned unpaired data can effectively substitute for expensive image-text pairs, offering a robust path for the efficient scaling of MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2602.15287 2026-06-08 cs.CV 版本更新

Consistency-Preserving Diverse Video Generation

保持一致性的多样化视频生成

Xinshuang Liu, Runfa Blark Li, Truong Nguyen

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种联合采样框架，在保持时间一致性的同时提高文本到视频生成中批次内视频的多样性，通过轻量级潜在空间模型避免视频解码和反向传播。

详情

AI中文摘要

文本到视频生成成本高昂，因此每个提示通常只生成少量样本。在这种低样本情况下，最大化每批的价值需要高跨视频多样性。最近的方法提高了图像生成的多样性，但对于视频，它们常常降低视频内的时间一致性，并且需要通过视频解码器进行昂贵的反向传播。我们提出了一种用于流匹配视频生成器的联合采样框架，该框架在保持时间一致性的同时提高了批次多样性。我们的方法应用多样性驱动的更新，然后仅移除会降低时间一致性目标的分量。为了避免图像空间梯度，我们使用轻量级潜在空间模型计算两个目标，避免了视频解码和解码器反向传播。在最新的文本到视频流匹配模型上的实验表明，我们的方法在接近强联合采样基线的多样性的同时，显著提高了时间一致性和颜色自然度。我们的代码可在 https://github.com/XinshuangL/Diverse-Video 获取。

英文摘要

Text-to-video generation is expensive, so only a few samples are typically produced per prompt. In this low-sample regime, maximizing the value of each batch requires high cross-video diversity. Recent methods improve diversity for image generation, but for videos they often degrade within-video temporal consistency and require costly backpropagation through a video decoder. We propose a joint-sampling framework for flow-matching video generators that improves batch diversity while preserving temporal consistency. Our approach applies diversity-driven updates and then removes only the components that would decrease a temporal-consistency objective. To avoid image-space gradients, we compute both objectives with lightweight latent-space models, avoiding video decoding and decoder backpropagation. Experiments on a state-of-the-art text-to-video flow-matching model show diversity close to strong joint-sampling baselines while substantially improving temporal consistency and color naturalness. Our code is available at https://github.com/XinshuangL/Diverse-Video.

URL PDF HTML ☆

赞 0 踩 0

2602.19213 2026-06-08 cs.CV 版本更新

OmniSch：面向结构化图表视觉推理的多模态PCB原理图基准

Taiting Lu, Kaiyuan Lin, Yuxin Tian, Mingjia Wang, Yubo Wang, Muchuan Wang, Sharique Khatri, Akshit Kartik, Yixi Wang, Amey Santosh Rane, Yida Wang, Sung-Liang Chen, Yifan Yang, Yi-Chao Chen, Yincheng Jin, Mahanth Gowda

发表机构 * Pennsylvania State University, USA（宾夕法尼亚州立大学）； Independent Researcher（独立研究者）； Binghamton University, USA（布ingham顿大学）； Shanghai Jiao Tong University, China（上海交通大学）； Microsoft Research（微软研究院）

AI总结提出首个多模态PCB原理图理解基准OmniSch，包含四项任务评估大模型在视觉定位、图推理和几何推理上的能力，揭示现有模型在工程图表理解上的显著差距。

详情

AI中文摘要

近期大型多模态模型（LMMs）在视觉定位、文档理解和图表推理任务中取得了快速进展。然而，它们将印刷电路板（PCB）原理图转换为机器可读的空间加权网表图（同时捕获组件属性、连接性和几何信息）的能力仍未被充分探索，尽管这种图表示是实际电子设计自动化（EDA）工作流的基石。为弥补这一差距，我们引入了OmniSch，这是首个旨在评估LMMs在原理图理解和空间网表图构建方面的综合基准。OmniSch包含1,854张真实世界原理图，并包括四项任务：（1）原理图实体的视觉定位，包含109.9K个定位实例，将423.4K个图表语义标签与其视觉区域对齐；（2）图到图推理，理解图表元素间的拓扑关系；（3）几何推理，为每个连接构建依赖于布局的权重；（4）用于视觉搜索的工具增强型智能体推理，调用外部工具完成（1）-（3）。我们的结果揭示了当前LMMs在解释原理图工程制品方面的显著差距，包括不可靠的细粒度定位、脆弱的布局到图解析、不一致的全局连通性推理以及低效的视觉探索。

英文摘要

Recent large multimodal models (LMMs) have made rapid progress in visual grounding, document understanding, and diagram reasoning tasks. However, their ability to convert Printed Circuit Board (PCB) schematic diagrams into machine-readable spatially weighted netlist graphs, jointly capturing component attributes, connectivity, and geometry, remains largely underexplored, despite such graph representations are the backbone of practical electronic design automation (EDA) workflows. To bridge this gap, we introduce OmniSch, the first comprehensive benchmark designed to assess LMMs on schematic understanding and spatial netlist graph construction. OmniSch contains 1,854 real-world schematic diagrams and includes four tasks: (1) visual grounding for schematic entities, with 109.9K grounded instances aligning 423.4K diagram semantic labels to their visual regions; (2) diagram-to-graph reasoning, understanding topological relationship among diagram elements; (3) geometric reasoning, constructing layout-dependent weights for each connection; and (4) tool-augmented agentic reasoning for visual search, invoking external tools to accomplish (1)-(3). Our results reveal substantial gaps of current LMMs in interpreting schematic engineering artifacts, including unreliable fine-grained grounding, brittle layout-to-graph parsing, inconsistent global connectivity reasoning and inefficient visual exploration.

URL PDF HTML ☆

赞 0 踩 0

2604.10578 2026-06-08 cs.CV 版本更新

Rein3D: Reinforced 3D Indoor Scene Generation with Panoramic Video Diffusion Models

Rein3D: 基于全景视频扩散模型的强化3D室内场景生成

Dehui Wang, Rong Wei, Yue Shi, Congsheng Xu, Shoufa Chen, Dingxiang Luo, Tianshuo Yang, Xiaokang Yang, Wei Sui, Yusen Qin, Rui Tang, Yao Mu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Manycore Tech Inc.（Manycore科技公司）； D-Robotics ； The University of Hong Kong（香港大学）

AI总结提出Rein3D框架，结合3D高斯泼溅与视频扩散模型，通过“恢复-细化”范式从稀疏输入生成全局一致的360度室内场景，并构建PanoV2V-15K数据集，显著提升长距离相机探索效果。

详情

AI中文摘要

随着具身AI和VR应用需求的增长，从稀疏输入合成高质量3D室内场景变得尤为重要。然而，现有方法在推断大量未观测区域中的缺失几何结构时难以保持全局一致性，往往产生局部合理但全局不一致的重建结果。我们提出Rein3D，一个通过将显式3D高斯泼溅（3DGS）与视频扩散模型的时间一致先验相结合来重建完整360度室内环境的框架。我们的方法遵循“恢复-细化”范式：采用径向探索策略，沿从原点开始的轨迹渲染不完美的全景视频，从而从粗略的3DGS初始化中有效揭示被遮挡区域。这些序列由全景视频到视频扩散模型恢复，并通过视频超分辨率进一步增强，以合成高保真几何和纹理。最后，这些细化后的视频作为伪真值更新全局3D高斯场。为支持此任务，我们构建了PanoV2V-15K数据集，包含超过15K对干净和退化的全景视频，用于基于扩散的场景恢复。实验表明，Rein3D生成逼真且全局一致的3D场景，与现有基线相比，显著改善了长距离相机探索。

英文摘要

The growing demand for Embodied AI and VR applications has highlighted the need for synthesizing high-quality 3D indoor scenes from sparse inputs. However, existing approaches struggle to infer massive amounts of missing geometry in large unseen areas while maintaining global consistency, often producing locally plausible but globally inconsistent reconstructions. We present Rein3D, a framework that reconstructs full 360-degree indoor environments by coupling explicit 3D Gaussian Splatting (3DGS) with temporally coherent priors from video diffusion models. Our approach follows a "restore-and-refine" paradigm: we employ a radial exploration strategy to render imperfect panoramic videos along trajectories starting from the origin, effectively uncovering occluded regions from a coarse 3DGS initialization. These sequences are restored by a panoramic video-to-video diffusion model and further enhanced via video super-resolution to synthesize high-fidelity geometry and textures. Finally, these refined videos serve as pseudo-ground truths to update the global 3D Gaussian field. To support this task, we construct PanoV2V-15K, a dataset of over 15K paired clean and degraded panoramic videos for diffusion-based scene restoration. Experiments demonstrate that Rein3D produces photorealistic and globally consistent 3D scenes and significantly improves long-range camera exploration compared with existing baselines.

URL PDF HTML ☆

赞 0 踩 0

2604.20123 2026-06-08 cs.CV 版本更新

Topology-Aware Skeleton Detection via Lighthouse-Guided Structured Inference

拓扑感知的骨架检测：基于灯塔引导的结构化推理

Daoyong Fu, Xiang Zhang, Zhaohuan Zhan, Fan Yang, Ke Yang

AI总结提出Lighthouse-Skel方法，通过双分支协作检测骨架置信场和结构锚点，并利用灯塔引导策略重连不连续骨架，提升骨架连续性和结构完整性。

Comments This submission is withdrawn by the authors because we identified substantive issues in the current version that may affect the reliability and interpretation of the results. We are conducting a thorough revision and validation before making the work publicly available again

详情

AI中文摘要

在自然图像中，物体骨架用于表示几何形状。然而，姿态或运动的轻微变化可能导致骨架结构的显著变化，增加骨架检测的难度，并常常导致不连续的骨架。现有方法主要关注点级骨架点检测，忽视了结构连续性在恢复完整骨架中的重要性。为解决此问题，我们提出Lighthouse-Skel，一种通过灯塔引导的结构化推理实现拓扑感知的骨架检测方法。具体来说，我们引入了一个双分支协作检测框架，联合学习骨架置信场和结构锚点（包括端点和连接点）。点分支学习的空间分布引导网络关注拓扑脆弱区域，从而提高骨架检测的准确性。基于学习的骨架置信场，我们进一步提出灯塔引导的拓扑补全策略，该策略将检测到的连接点和断点作为灯塔，沿低成本路径重连不连续的骨架段，从而改善骨架连续性和结构完整性。在四个公开数据集上的实验结果表明，所提方法在实现竞争性检测精度的同时，显著提升了骨架的连通性和结构完整性。

英文摘要

In natural images, object skeletons are used to represent geometric shapes. However, even slight variations in pose or movement can cause noticeable changes in skeleton structure, increasing the difficulty of detecting the skeleton and often resulting in discontinuous skeletons. Existing methods primarily focus on point-level skeleton point detection and overlook the importance of structural continuity in recovering complete skeletons. To address this issue, we propose Lighthouse-Skel, a topology-aware skeleton detection method via lighthouse-guided structured inference. Specifically, we introduce a dual-branch collaborative detection framework that jointly learns skeleton confidence field and structural anchors, including endpoints and junction points. The spatial distributions learned by the point branch guide the network to focus on topologically vulnerable regions, which improves the accuracy of skeleton detection. Based on the learned skeleton confidence field, we further propose a lighthouse-guided topology completion strategy, which uses detected junction points and breakpoints as lighthouses to reconnect discontinuous skeleton segments along low-cost paths, thereby improving skeleton continuity and structural integrity. Experimental results on four public datasets demonstrate that the proposed method achieves competitive detection accuracy while substantially improving skeleton connectivity and structural integrity.

URL PDF HTML ☆

赞 0 踩 0

2605.14166 2026-06-08 cs.CV 版本更新

You Only Landmark Once: Lightweight U-Net Face Super Resolution with YOLO-World Landmark Heatmaps

你只需一次地标：基于YOLO-World地标热图的轻量级U-Net人脸超分辨率

Riccardo Carraro, Anna Briotto, Endi Hysa, Marco Fiorucci, Lamberto Ballan

发表机构 * Università degli Studi di Milano（米兰大学）； Istituto Italiano di Tecnologia（意大利理工学院）

AI总结提出轻量级U-Net，利用YOLO-World生成的地标热图作为监督，无需额外训练辅助网络，实现8倍人脸超分辨率重建，提升关键区域细节。

Comments Accepted for publication at IEEE AVSS 2026 (Notification date: June 5, 2026)

详情

AI中文摘要

人脸图像超分辨率旨在从严重退化的输入中恢复高分辨率人脸图像。在极端放大因子下，精细的面部细节常常丢失，使得准确重建具有挑战性。现有方法通常依赖重型网络架构、对抗训练方案或单独的对齐网络，增加了模型复杂度和计算成本。为解决这些问题，我们提出了一种基于轻量级U-Net的架构，旨在从严重退化的$16 \ imes 16$输入重建$128 \ imes 128$面部图像，实现$8 \ imes$放大。一个关键贡献是一种新颖的无辅助训练监督策略，利用YOLO-World（一种开放词汇目标检测器）生成的热图来定位关键面部特征，如眼睛、鼻子和嘴巴。这些热图被转换为空间权重，形成热图引导的损失，强调语义重要区域的重建误差。与先前需要专用地标或对齐网络的方法不同，我们的方法直接重用检测器输出作为监督，保持高效的训练和推理流程。在对齐的CelebA数据集上的实验表明，所提出的损失一致地改善了定量指标，并产生了更清晰、更逼真的重建。总体而言，我们的结果表明，轻量级网络可以有效地利用检测驱动的先验进行感知上令人信服的极端放大，而无需对抗训练或增加计算成本。

英文摘要

Face image super-resolution aims to recover high-resolution facial images from severely degraded inputs. Under extreme upscaling factors, fine facial details are often lost, making accurate reconstruction challenging. Existing methods typically rely on heavy network architectures, adversarial training schemes, or separate alignment networks, increasing model complexity and computational cost. To address these issues, we propose a lightweight U-Net based-architecture designed to reconstructs $128{ \times }128$ facial images from severely degraded $16{ \times }16$ inputs, achieving an $8 \times $ magnification. A key contribution is a novel auxiliary-training-free supervision strategy that leverages heatmaps generated by YOLO-World, an open-vocabulary object detector, to localize key facial features such as eyes, nose, and mouth. These heatmaps are converted into spatial weights to form a heatmap-guided loss that emphasizes reconstruction errors in semantically important regions. Unlike prior methods that require dedicated landmark or alignment networks, our approach directly reuses detector outputs as supervision, maintaining an efficient training and inference pipeline. Experiments on the aligned CelebA dataset demonstrate that the proposed loss consistently improves quantitative metrics and produces sharper, more realistic reconstructions. Overall, our results show that lightweight networks can effectively exploit detection-driven priors for perceptually convincing extreme upscaling, without adversarial training or increased computational cost.

URL PDF HTML ☆

赞 0 踩 0

2605.19611 2026-06-08 cs.CV cs.ET 版本更新

Physics Guided Conditional Diffusion Framework for Generative Inverse Design of Manufacturable Metasurface based Absorbers

基于物理引导的条件扩散模型的超材料吸收体逆向设计

Vineetha Joy, Jamshed Palai, Satwik Sahu, Anshuman Kumar, Amit Sethi, Hema Singh

发表机构 * Centre for Electromagnetics, CSIR-National Aerospace Laboratories（电磁研究中心，国家航空航天实验室）； Birla Institute of Technology and Science, Pilani（比拉理工学院，皮兰）； Indian Institute of Technology, Bombay（孟买印度理工学院）

AI总结本文提出了一种基于物理引导的条件扩散框架，用于设计具有特定电磁响应的超材料吸收体，通过特征线性调制和预训练的替代电磁模拟器，提高了设计效率和条件准确性，实验表明该方法在2-18GHz频率范围内能够快速生成实用的超材料结构。

详情

AI中文摘要

针对特定电磁响应的超材料逆向设计需要生成满足严格频谱约束且可制造的几何结构。传统设计方法依赖于全波仿真进行迭代优化，对于大设计空间来说非常耗时且计算密集。此外，常用的生成方法往往条件保真度有限，生成的设计通常包含精细或不规则特征，难以制造。为此，我们提出了一种物理引导的条件质量增强扩散框架，用于超材料吸收体的逆向设计。在这里，由目标反射特性构成的条件信息通过特征线性调制（FiLM）整合到模型中。此外，为了确保符合目标频谱，嵌入了预训练的替代电磁模拟器，通过频谱级损失函数引入物理感知的正则化。通过在2至18GHz频率范围内生成不同类型的反射特性实用的超材料结构，证明了所提模型的有效性。该框架实现了目标频谱与生成设计频谱之间的平均频谱均方误差为0.0006，频段对齐精度为0.958，显示出高条件准确性。此外，模型为相同条件生成多种几何结构，从而为工程师提供多样化的设计选择。所提模型在约30秒内生成合适的设计，而传统方法在同等计算资源下需要数月时间。模型的效率还通过实验测量得到验证。

英文摘要

Inverse design of metasurfaces under continuous electromagnetic constraints requires generation of geometries that simultaneously satisfy stringent spectral specifications and remain manufacturable. Conventional approaches based on iterative full wave simulations are computationally prohibitive for large design spaces, while existing generative models often suffer from poor conditional controllability and limited fabrication awareness. In this regard, we propose a physics guided condition quality enhanced diffusion framework for the inverse design of metasurface based absorbers. Fabrication-aware constraints are incorporated to ensure practical realizability of the generated designs. The framework introduces a conditioning mechanism for continuous spectral specifications, wherein feature-wise linear modulation propagates the condition across the denoising hierarchy, enabling stable and accurate generation with improved spectral controllability. Further, to embed EM consistency directly into the generative learning process, a pre trained surrogate EM simulator is integrated within the diffusion training pipeline. The proposed framework generated physically realizable metasurface designs for diverse reflection characteristics in the frequency range of 2 to 18 GHz, achieving a very low average spectral mean squared error of 0.0006 and a high band alignment accuracy of 0.958. The framework also addresses the fundamentally non-unique nature of inverse EM design by enabling structured multimodal generation of geometrically distinct yet spectrally consistent metasurface designs for the same target response. The proposed model produces the suitable design in approximately 30 seconds, whereas the conventional approach can take several months under comparable computational resources. The efficiency of the model is also established via experimental measurements.

URL PDF HTML ☆

赞 0 踩 0

2605.20950 2026-06-08 cs.CV cs.AI 版本更新

Focus-then-Context: Subject-Centric Progressive Visual Token Reduction for Vision-Language Models

聚焦-然后-上下文：面向视觉-语言模型的主体导向渐进视觉标记缩减

Yulin Zhao, Zheng Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳学院）； ShenZhen Loop Area Institute（深圳环形区研究所）

AI总结本文提出了一种主体导向的渐进视觉标记缩减方法SPpruner，通过模拟人类视觉感知系统的'聚焦-然后-上下文'机制，有效减少视觉标记数量，提升视觉-语言模型的推理效率，实验表明其在速度和资源消耗上均优于现有方法。

详情

AI中文摘要

视觉-语言模型（VLMs）在推理过程中面临由于大规模视觉标记序列带来的计算成本瓶颈。现有的视觉标记缩减方法虽然减轻了这一负担，但无意中保留了与用户查询严格对齐的孤立视觉主体，无法充分探索显著主体及其上下文关系。本文提出SPpruner，一种以主体为中心的渐进缩减范式，模拟人类视觉感知系统的'聚焦-然后-上下文'机制。具体而言，我们首先构建了一个聚焦识别模块，以显式建模视觉显著性与语义相关性之间的相互作用。在此基础上，它可以挖掘全面的视觉主体光谱，确保视觉输入的高保真表示。随后，开发了一个上下文感知的结构扫描模块，用于聚合邻近区域的上下文线索。因此，它可以有效恢复全局关系依赖，以维持保留主体的结构完整性。大量实验表明，我们的范式在速度和资源消耗上均优于现有方法，在Qwen2.5-VL中仅保留22.2%的视觉标记即可实现2.53倍的加速，在LLaVA中实现67%的FLOPs减少，仅导致0.6%的精度下降。

英文摘要

Vision-Language Models (VLMs) face a bottleneck of prohibitive computational costs arising from massive visual token sequences during inference. Existing vision token reduction methods alleviate this burden, but they unintentionally preserve the isolated visual subject strictly aligned with the user's query, which fails to substantially explore salient subjects and their contextual relationships. In this paper, we propose SPpruner, a subject-centric progressive reduction paradigm that emulates the \textit{Focus-then-Context} mechanism of the human visual perception system. Specifically, we first construct a focus identification module to explicitly model the interplay between visual saliency and semantic relevance. Herein, it can excavate the comprehensive visual subject spectrum to ensure a high-fidelity representation of visual input. Subsequently, a context-aware structural scanning module is developed to aggregate contextual cues from neighboring regions. As such, it can effectively restore global relational dependencies to uphold the structural integrity of the preserved subjects. Extensive experiments demonstrate that our paradigm consistently outperforms SOTA methods, achieving up to 2.53 times speedup with only 22.2% of visual tokens retained in Qwen2.5-VL and a 67% FLOPs reduction on LLaVA with a negligible 0.6% accuracy drop.

URL PDF HTML ☆

赞 0 踩 0

2605.22882 2026-06-08 cs.CV cs.RO 版本更新

GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation

GEM-4D：用于机器人操作的几何增强视频世界模型

Kaichen Zhou, Yuzhen Chen, Fangneng Zhan, Hang Hua, Grace Chen, Xinhai Chang, Ao Qu, Yilun Du, Zhuang Liu, Paul Pu Liang, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab（哈佛人工智能与机器人实验室）； Harvard University（哈佛大学）； Media Lab and EECS（媒体实验室和电子工程与计算机科学系）； MIT（麻省理工学院）； Princeton University（普林斯顿大学）； MIT-IBM Watson AI Lab（麻省理工-IBM沃森人工智能实验室）

AI总结提出GEM-4D，通过注入从预训练几何基础模型蒸馏的密集4D对应监督，增强视频世界模型的几何一致性，并引入逆动力学模块将视频滚动转换为可执行机器人轨迹，提升操作成功率。

Comments Robotic World Model, Video Generative Model

详情

AI中文摘要

视频世界模型可以从单个指令生成逼真的未来帧，但它们通常无法在时间上一致地跟踪相同的物理点。因此，生成的视频看似合理，但缺乏可靠动作执行（如机器人操作）所需的物理基础。我们提出GEM-4D，一种几何接地视频世界模型，通过在训练期间将预训练几何基础模型蒸馏的密集4D对应监督注入视频生成骨干网络来解决这一限制。这种监督使模型能够联合捕捉外观和几何结构，同时保持单流架构且无额外推理成本。我们进一步引入逆动力学模块，将对应一致的视频滚动转换为可执行的机器人轨迹，从而能够在真实世界和模拟操作中直接部署。GEM-4D在视频预测和几何一致性方面在模拟和真实场景中均达到最先进性能，并将真实世界操作成功率从61%提升至81%。更多结果见https://gem-4d.github.io/。

英文摘要

Video world models can generate realistic futures from a single instruction, but they often fail to track the same physical points consistently across time. As a result, the generated videos appear plausible, yet lack the physical grounding required for reliable action execution, such as robot manipulation. We present GEM-4D, a geometry-grounded video world model that resolves this limitation by injecting dense 4D correspondence supervision distilled from a pretrained geometry foundation model into the video generative backbone during training. This supervision enables the model to jointly capture appearance and geometric structure while retaining a single-stream architecture with no additional inference cost. We further introduce an inverse dynamics module that converts correspondence-consistent video rollouts into executable robot trajectories, enabling direct deployment in both real-world and simulated manipulation. GEM-4D achieves state-of-the-art performance on both video prediction and geometric consistency across both simulation and realistic scenarios and improves real-world manipulation success from 61% to 81%. Additional results are available at https://gem-4d.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.24011 2026-06-08 cs.CV cs.AI 版本更新

ActQuant: Sub-4-bit Action-Guided Quantization for Vision-Language-Action Models

ActQuant: 面向视觉-语言-动作模型的亚4比特动作引导量化

Arash Akbari, Arman Akbari, Masih Eskandar, Qitao Tan, Yixiao Chen, Jingwu Luo, Bertha Pangaribuan, Liyun Zhang, Jennifer Dy, Geng Yuan, Xue Lin, Gaowen Liu, Stratis Ioannidis, Yanzhi Wang

发表机构 * Northeastern University（东北大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结提出ActQuant框架，通过动作引导的混合精度后训练量化，在亚4比特权重量化下保持VLA模型性能，并引入OmniModel.cpp实现高效部署。

详情

AI中文摘要

视觉-语言-动作（VLA）模型在具身智能中展现出卓越的动作生成能力，但其高计算量使得在边缘平台部署不切实际。激进的亚4比特权重量化是自然解决方案，但现有后训练量化（PTQ）方法在此情况下性能严重下降。为解决此问题，我们引入ActQuant，一个动作引导的混合精度PTQ框架，包含两个阶段：（1）张量间比特分配器，根据每个权重矩阵对预测智能体动作的贡献程度分配单一比特宽度；（2）张量内尺度优化器，使用动作感知曲率调整每块量化尺度，使动态范围集中在控制影响最大的权重上。为了在设备上实现激进量化的优势，我们进一步引入OmniModel.cpp，一个代理转换流水线，将架构移植到具有高效低位内核的原生C/C++运行时。我们在仿真和真实世界的6自由度UR3机械臂上评估ActQuant，所有模型通过OmniModel.cpp部署。在LIBERO基准上，ActQuant是唯一在每权重3比特或以下运行的方法，在OpenVLA-OFT上保持95.0%的性能，在$π_{0.5}$上保持94.8%。进一步，ActQuant在OpenVLA-OFT上达到2.5 bpw，性能为90.1%，将骨干网络从14.3 GB压缩到2.7 GB（5.3倍）。在物理UR3机械臂上，使用ActQuant量化的$π_{0.5}$保持基线的成功率，同时将内存占用减少2.5倍。

英文摘要

Vision-Language-Action (VLA) models exhibit remarkable action generation for embodied intelligence, but their heavy compute make deployment on edge platforms impractical. Aggressive, sub-4-bit weight quantization is the natural solution, yet existing post-training quantization (PTQ) methods suffer severe performance degradation in this regime. To address this, we introduce ActQuant, an action-guided mixed-precision PTQ framework that operates in two stages: (1) an inter-tensor bit allocator that assigns each weight matrix a single bit-width based on how much it contributes to predicting the agent's actions; (2) an intra-tensor scale optimizer tunes per-block quantization scales using action-aware curvature, so that dynamic range is concentrated on the weights most influential for control. To deliver the on-device benefits of our aggressive quantization, we further introduce OmniModel.cpp, an agentic conversion pipeline that ports architectures into a native C/C++ runtime with efficient low-bit kernels. We evaluate ActQuant both in simulation and on a real-world 6-DoF UR3 arm, with all models deployed through OmniModel.cpp. On the LIBERO benchmark, ActQuant is the only method that operates at or below 3 bits-per-weight, retaining 95.0% on OpenVLA-OFT and 94.8% on $π_{0.5}$. Pushed further, ActQuant reaches 2.5 bpw at 90.1% on OpenVLA-OFT, compressing the backbone from 14.3 GB to 2.7 GB (5.3$\times$). On the physical UR3 arm, $π_{0.5}$ quantized with ActQuant retains the baseline's success rate while reducing the memory footprint by 2.5$\times$.

URL PDF HTML ☆

赞 0 踩 0

2605.25757 2026-06-08 cs.CV 版本更新

Broadband Hyperspectral 3D Imaging using Dispersed Structured Light

宽带高光谱3D成像：使用色散结构光

Suhyun Shin, Yunseong Moon, Ryota Maeda, David B. Lindell, Kiriakos N. Kutulakos, Seung-Hwan Baek

发表机构 * POSTECH South Korea（POSTECH韩国）； University of Hyogo Japan（日本广岛大学）； University of Toronto Canada（加拿大多伦多大学）

AI总结提出一种基于单光谱仪的宽带高光谱3D成像方法，通过可见光和SWIR相机立体设置，利用色散结构光同时重建密集宽带高光谱反射率和精确3D几何，解决了传统方法光谱范围窄、系统复杂的问题。

详情

AI中文摘要

高光谱3D成像能够捕获密集的光谱信息和场景几何，但传统上局限于窄光谱窗口，通常是可见光范围。在这项工作中，我们引入了一种宽带高光谱3D成像（BH3D）方法，将这一能力扩展到整个可见-近红外和短波红外（SWIR）光谱（450-1500 nm）。这种宽覆盖范围至关重要，因为它捕获了互补的物理线索：可见光波长揭示表面外观，而SWIR波段提供对次表面特性和材料组成的洞察。然而，实现BH3D具有挑战性，因为可见光谱硅传感器和SWIR光谱InGaAs传感器之间存在基本的传感器限制，需要复杂的多光谱仪设计。在这里，我们提出了一种单光谱仪BH3D系统，使用包含可见光和SWIR相机的立体设置，重建密集的宽带高光谱反射率以及精确的3D几何。我们的关键思想是使用单个光谱仪将色散结构光扩展到宽带范围。我们建模了宽带色散结构光的图像形成过程，并估计了高光谱反射率和深度。我们在多样化的真实场景上验证了我们的方法，展示了精确的重建，平均光谱角映射器为0.13 rad，均方根误差为0.03，平均深度误差为4.5 mm。我们进一步展示了识别同色异谱材料、通过不透明层成像、揭示钞票上的隐藏特征以及显示血管的能力。

英文摘要

Hyperspectral 3D imaging enables the capture of dense spectral information and scene geometry but has traditionally been confined to narrow spectral windows, typically the visible range. In this work, we introduce a broadband hyperspectral 3D imaging (BH3D) method to extend this capability across the full visible-near-infrared and short-wavelength infrared (SWIR) spectrum (450-1500 nm). This broad coverage is critical as it captures complementary physical cues: visible wavelengths reveal surface appearance, while SWIR bands provide insight into subsurface properties and material composition. However, realizing BH3D is challenging due to fundamental sensor constraints between visible-spectrum silicon and SWIR-spectrum InGaAs sensors, which necessitate complex multi-spectrograph designs. Here we propose a single-spectrograph BH3D system, using a stereo setup comprising visible and SWIR cameras, that reconstructs dense broadband hyperspectral reflectance together with accurate 3D geometry. Our key idea is to extend dispersed structured light to the broadband regime using a single spectrograph. We model the image formation of broadband dispersed structured light, and estimate hyperspectral reflectance and depth. We validate our approach on diverse real-world scenes, demonstrating accurate reconstruction with a mean spectral angle mapper of 0.13 rad, root mean square error of 0.03, and mean depth error of 4.5 mm. We further demonstrate identifying metameric materials, performing imaging through opaque layers, uncovering hidden features on banknotes, and revealing blood vessels.

URL PDF HTML ☆

赞 0 踩 0

2605.25806 2026-06-08 cs.CV 版本更新

An Analysis Focused on Womens Safety: Can VAD Models Be Enhanced by a Multi-modal Dataset?

聚焦女性安全分析：多模态数据集能否增强VAD模型？

Sangeeta ., Maddikuntla Sai Prajwal, Debi Prosad Dogra, Kamalakar Vijay Thakare, Hyungjoo Jung, Ig-Jae Kim, Heeseung Choi

发表机构 * Indian Institute of Technology Bhubaneswar（印度理工学院巴特那分校）； Artificial Intelligence and Robotics Institute, Korea Institute of Science and Technology（人工智能与机器人研究所，韩国科学技术院）； Yonsei-KIST Convergence Research Institute, Yonsei University（延世大学KIST融合研究中心）

AI总结针对现有视频异常检测数据集缺乏女性中心异常样本的问题，提出包含1001个视频及文本描述的多模态基准ExtrAnom，覆盖5种犯罪类型，并验证了多模态方法在检测女性中心异常上的有效性。

Comments 7 pages, 6 figures, 4 tables

详情

AI中文摘要

女性安全对于现代社会至关重要。针对女性的犯罪既发生在白天也发生在低光照条件下。通常，此类事件通过低分辨率的现实监控摄像头捕捉。尽管计算机视觉相关研究取得了显著进展，但专注于女性安全的视频异常检测（VAD）尚未得到充分解决。现有的视频异常数据集包含光照良好、高分辨率、近景视频，未能涵盖女性中心异常，如抢项链、跟踪、不当触摸及其他针对女性的细微犯罪形式。为解决这些问题，我们提出了ExtrAnom数据集，这是一个新的多模态基准，包含1001个带有文本描述的视频（500个正常，501个异常），分为5种不同类型的女性中心犯罪。该数据集包含低光照（8%）、低分辨率（13%）、远景（15%）以及白天（64%）异常视频。它涵盖了异常事件如跟踪（3.9%）、抢项链（17.6%）、绑架（7.3%）、暗杀（2.3%）、骚扰（18.9%）和正常（50%）。每个视频附带4个文本标注，包括一个人工生成和三个大语言模型生成的描述，支持跨模态和基于视觉语言模型（VLM）的验证。创建女性中心数据集的目标是准确检测可能通过视觉观察到的女性中心异常模式。该数据集辅助VLM准确生成视频级描述。ExtrAnom已针对流行的单模态和多模态VAD数据集（如XD-Violence、UCF-Crime和UCA）及最先进方法进行了基准测试。实验表明，现有数据集不足以训练模型检测女性中心异常。

英文摘要

Women's safety and security are paramount for a modern society. Crimes against women occur in daylight as well as in low-light conditions. Often, such events are captured through real-world surveillance cameras that operate at lower resolutions. Despite substantial progress in CV-related research, video anomaly detection (VAD) focused on women's safety has not yet been adequately addressed. Existing video anomaly datasets contain well-lit, high-resolution, close-shot videos, and fail to represent women-centric anomalies such as chain snatching, stalking, inappropriate touch, and other subtle forms of crime against women. To address these problems, we propose the ExtrAnom dataset, a new multi-modal benchmark containing 1001 videos with textual descriptions, 500 normal and 501 anomalous, classified into 5 different types of women-centric crimes. The dataset comprises low-light (8%), low-resolution videos (13%), long-shot (15%), along with daylight (64%) anomalous videos. And it covers anomalous events like stalking (3.9%), chain snatching (17.6%), kidnapping (7.3%), assassinations (2.3%), harassment (18.9%), and normal (50%). Each video is supplemented with 4 textual annotations, including one human-generated and three LLM-generated descriptions, enabling cross-modal and VLM-based validations. The aim of creating a women-centric dataset is to accurately detect the women-centric anomaly patterns, which are possible to observe visually. The dataset supplements the VLMs to accurately generate video-level descriptions. ExtrAnom has been benchmarked against popular unimodal and multi-modal VAD datasets (e.g., XD-Violence, UCF-Crime, and UCA) and SOTA methods. Experiments reveal that the existing datasets are insufficient to train models for detecting women-centric anomalies.

URL PDF HTML ☆

赞 0 踩 0

2606.02450 2026-06-08 cs.CV 版本更新

Reason-Then-Retrieve for CoVR-R with Structured Edit Prompts and Dense-Sparse Fusion

先推理后检索：面向CoVR-R的结构化编辑提示与密集-稀疏融合

DongQing Liu, MengShi Qi, HongWei Ji

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结针对CoVR-R任务，提出一种零样本的“先推理后检索”流水线，利用Qwen3.5-27B生成结构化描述和密集嵌入，并结合TF-IDF稀疏分支进行融合排序，在验证集和测试集上取得了领先性能。

详情

AI中文摘要

CoVR-R研究基于推理的组合视频检索：给定一个参考视频和一个编辑指令，系统必须检索满足编辑的目标视频。主要困难在于目标未被直接描述，必须从对象身份、动作顺序、最终状态、手部交互和场景转换的细粒度变化中推断。我们围绕Qwen3.5-27B构建了一个零样本的“先推理后检索”流水线。对于每个图库视频，模型通过池化生成令牌的隐藏状态（使用令牌相关权重）生成面向检索的结构化描述和密集嵌入。对于每个查询，模型首先对参考视频和指令进行编辑推理，然后生成目标视频描述，其隐藏状态作为查询嵌入。我们通过生成文本上的TF-IDF分支补充密集检索，并使用分割特定权重融合两个排序。在验证集上，当前最佳提交在R@1达到80.81，R@5达到94.86，R@10达到97.11，R@50达到98.59。在盲测集上，R@1达到89.73，R@5达到95.79，R@10达到96.63，R@50达到97.98。

英文摘要

CoVR-R studies reason-aware composed video retrieval: given a reference video and an edit instruction, the system must retrieve the target video that satisfies the edit. The main difficulty is that the target is not described directly; it must be inferred from fine-grained changes in object identity, action order, final state, hand interaction, and scene transition. We build a zero-shot reason-then-retrieve pipeline around Qwen3.5-27B. For each gallery video, the model generates a retrieval-oriented structured description and a dense embedding by pooling generated-token hidden states with token-dependent weights. For each query, the model first performs edit reasoning over the reference video and instruction, then generates a target-video description whose hidden states serve as the query embedding. We complement dense retrieval with a TF-IDF branch over the generated texts and fuse the two rankings with split-specific weights. On validation, the current best submission reaches 80.81 at R@1, 94.86 at R@5, 97.11 at R@10, and 98.59 at R@50. On the blind test split, it reaches 89.73 at R@1, 95.79 at R@5, 96.63 at R@10, and 97.98 at R@50.

URL PDF HTML ☆

赞 0 踩 0

2606.02919 2026-06-08 cs.CV 版本更新

Pixel Cube: Diffusion-based Portrait Video Relighting Through Realistic Lighting Reproduction

Pixel Cube: 基于扩散的肖像视频重光照通过真实感光照再现

Yufan Zhang, Yu Ji, Ayo Ajiboye, Rundi Wu, Yu Guo, Changxi Zheng, Jinwei Ye

发表机构 * George Mason University（乔治·马歇尔大学）； LightThought LLC ； Columbia University（哥伦比亚大学）

AI总结提出一种基于扩散的方法，利用混合训练数据集和HDR环境图控制，实现动态肖像视频的真实感重光照，保持时间一致性和身份特征。

Comments ACM SIGGRAPH 2026 Journal Track / ACM Transactions on Graphics, 17 pages. Project page: https://yufanzhang82.github.io/PixelCube/

详情

DOI: 10.1145/3811400
Journal ref: ACM Trans. Graph. 45, 4, Article 119 (July 2026), 17 pages

AI中文摘要

我们提出了一种基于扩散的方法，用于对动态肖像视频进行重光照，实现照片级真实感和时间一致性。我们的方法由一个混合训练数据集驱动，该数据集包含真实拍摄和渲染的动态肖像视频，具有多样的主体外观、面部运动、头部姿态和已知光照条件。具体来说，我们构建了一个基于LED的光照系统，用于真实感光照模拟和高速视频重光照数据采集。通过利用预训练视频扩散模型中嵌入的图像先验，并使用逐帧高动态范围（HDR）环境图作为光照控制，我们训练了一个高性能生成模型，用于真实且保持身份的动态肖像视频重光照。除了环境图控制外，我们的模型还使用合成的背景图像来控制相机的曝光水平和色调。我们的模型可以在提供的新环境下生成时间一致的重光照肖像视频，看起来真实且和谐，并忠实保留主体的表情和精细面部特征，包括肤色、皱纹和胡须。我们的模型在主体外观、运动和光照条件方面对未见数据具有良好的泛化能力。我们使用各种环境图对野外视频进行了广泛的重光照实验，并展示了在肖像摄影中的实际应用。结果表明，我们的方法在照片级真实感、光照和谐性和时间一致性方面达到了最先进的性能。

英文摘要

We present a diffusion-based method for relighting dynamic portrait videos with photorealism and temporal consistency. Our method is fueled by a hybrid training dataset that consists of real-captured and rendered dynamic portrait videos with diverse subject appearances, facial motions, head poses, and known lighting conditions. Specifically, we construct an LED-based lighting system for realistic lighting emulation and high-speed video relighting data acquisition. By leveraging the image priors embedded in pre-trained video diffusion models, and using per-frame high dynamic range (HDR) environment map as lighting control, we train a high-performance generative model for realistic and identity-preserving dynamic portrait video relighting. In addition to the environment map control, our model uses a synthesized background image to enable control on the camera's exposure level and color tone. Our model can produce temporally consistent relit portrait video that looks realistic and harmonious under a provided new environment and faithfully preserve the subject's expression and fine facial features, including skin tone, wrinkles, and facial hair. Our model generalizes well to unseen data, in terms of the subject appearance, motion, and lighting condition. We perform extensive experiments on relighting in-the-wild videos with various environment maps and demonstrate practical applications on portrait photography. Results show that our method achieves state-of-the-art performance in photorealism, lighting harmony, and temporal consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.04349 2026-06-08 cs.CV cs.AI 版本更新

MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models

MorphoQuant: 面向全模态大语言模型的模态感知量化

Yue Wu, Changyuan Wang, Zixuan Wang, Shilin Ma, Yansong Tang

发表机构 * institutetext: MorphoQuant: Modality-Aware Quantization for Omni-modal Large Language Models Yue Wu Changyuan Wang Zixuan Wang Shilin Ma Yansong Tang（机构文本：MorphoQuant：多模态大语言模型的模态感知量化 Yue Wu 王昌元王梓轩马世林唐彦松）

AI总结提出MorphoQuant框架，通过分布感知偏差补偿和形态导向量化函数优化，解决全模态大语言模型在4比特后训练量化中的分布异质性和异常值问题，实现精度与效率的优异平衡。

详情

AI中文摘要

传统的后训练量化方法在处理4比特全模态大语言模型时，由于跨模态的极端分布异质性和不同的异常值模式而面临困难。为了解决这一问题，我们提出了MorphoQuant，一种模态感知的PTQ框架，旨在保留跨模态形态并减轻异常值损失。具体来说，我们引入了分布感知偏差补偿，它选择性地将长尾异常值吸收到通道偏差中。该机制在保持异常值幅度的同时，为密集内点维持高精度离散化，从而在多样的模态分布中保持精确的离散化。作为补充，我们提出了形态导向量化函数优化，以协同优化量化网格与偏差掩码，确保跨模态的细粒度对齐。在Qwen2.5-Omni上对MMMU和Video-MME等基准的广泛评估证明了我们方法的优越性。值得注意的是，我们的W4A4模型在ScienceQA上达到了76.63%，显著优于最先进的W4A4方法，并意外地超越了W4A16基线，这充分展示了我们框架在精度-效率权衡方面的卓越表现。

英文摘要

Conventional Post-Training Quantization (PTQ) methods struggle with 4-bit Omni-modal Large Language Models (OLLMs) due to the extreme distribution heterogeneity and disparate outlier patterns across modalities. To address this, we propose MorphoQuant, a modality-aware PTQ framework engineered to preserve cross-modal morphology and mitigate outlier loss. Specifically, we introduce Distribution-Aware Bias Compensation (DABC), which selectively absorbs long-tailed outliers into channel-wise biases. This mechanism safeguards outlier magnitudes while maintaining high-precision discretization for dense inliers, thereby preserving accurate discretization across diverse modal distribution. Complementing this, we propose Morphology-Directed Quantization Function Optimization (MDQFO) to co-optimize the quantization grid with the bias mask, ensuring fine-grained alignment across modalities. Extensive evaluations on Qwen2.5-Omni across benchmarks like MMMU and Video-MME demonstrate our approach's superiority. Notably, our W4A4 model achieves 76.63% on ScienceQA, significantly outperforming SOTA W4A4 methods and surprisingly surpassing the W4A16 baseline, which fully demonstrates the exceptional accuracy-efficiency trade-off of our framework.

URL PDF HTML ☆

赞 0 踩 0

2606.04373 2026-06-08 cs.CV cs.AI 版本更新

Selective Coupling of Decoupled Informative Regions: Masked Attention Alignment for Data-Free Quantization of Vision Transformers

解耦信息区域的选择性耦合：用于视觉Transformer无数据量化的掩码注意力对齐

Biao Qian, Yang Wang, Yong Wu, Jungong Han

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MaskAQ方法，通过解耦合成样本中的信息区域并利用掩码注意力对齐全精度模型与量化模型，解决无数据量化中分布不匹配问题。

Comments Accepted to appear at ICML 2026, Seoul, Korea

详情

AI中文摘要

无数据量化（DFQ）通过合成样本解决数据安全问题，无需访问真实数据。由于自注意力机制相比经典卷积运算的优势，DFQ在视觉Transformer（ViT）中日益受到关注。然而，先前的ViT DFQ方法常遭受合成样本与量化模型Q期望输入分布之间的分布不匹配，导致性能次优。本文提出一种新颖的掩码注意力对齐方法用于ViT的无数据量化，称为MaskAQ，揭示了：1）自注意力机制中的语义主要局限于稀疏的补丁子集，称为信息区域；2）信息区域主导了合成样本与Q输出之间的互信息。为此，我们利用合成样本补丁相似性的微分熵最大化，从噪声背景中解耦信息区域。为了与不同的Q耦合，通过掩码注意力对齐目标选择信息区域以对齐全精度模型与Q，从而产生高质量的合成样本。此外，提出周期性样本刷新策略，使MaskAQ能够在训练过程中持续适应Q的演化状态，以保持与合成样本的理想互信息。大量实验验证了MaskAQ在多个骨干网络和下游任务上优于最先进方法。我们的代码可在https://github.com/hfutqian/MaskAQ获取。

英文摘要

Data-Free Quantization (DFQ) addresses data security concerns by synthesizing samples, without accessing real data. It has garnered increasing attention in the context of Vision Transformers (ViTs), owing to the superiority of the self-attention mechanism compared to classical convolutional operation. However, previous DFQ arts for ViTs often suffer from a distribution mismatch between synthetic samples and input distribution expected by quantized models Q, resulting in the suboptimal performance. In this paper, we propose a novel Masked Attention Alignment approach for Data-Free Quantization of ViTs, named MaskAQ, revealing that: 1) the semantics in the self-attention mechanism is predominantly localized to a sparse subset of patches, called informative regions; 2) the informative regions dominate the mutual information between synthetic samples and Q's outputs. To these ends, we incorporate differential entropy maximum over patch similarity of synthetic samples, to decouple informative regions from noisy background. To couple with varied Q, the informative regions are selected to align full-precision models with Q via a masked attention alignment objective, thus yielding high-quality synthetic samples. Furthermore, a periodic sample refreshing strategy comes up to endow MaskAQ with the capacity to continually adapt to the evolving state of Q throughout the training process, to preserve desirable mutual information with synthetic samples. Extensive experiments verify the merits of MaskAQ over state-of-the-art approaches across multiple backbones and downstream tasks. Our code is available at https://github.com/hfutqian/MaskAQ.

URL PDF HTML ☆

赞 0 踩 0

2606.05949 2026-06-08 cs.CV 版本更新

Faithful, Enriched, and Precise: Benchmarking Natural-Science Illustration Generation by T2I models

忠实、丰富且精确：T2I模型在自然科学插图生成中的基准测试

Yifan Chang, Jiaxin Ai, Jianwen Sun, Yuandong Pu, Siqi Luo, Liangliang Zhao, Yuchen Ren, Minghao Liu, Yunfei Yu, Yu Qiao, Kaipeng Zhang, Yihao Liu

发表机构 * Shanghai Innovation Institute（上海创新研究院）； Shanghai AI Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）； Wuhan University（武汉大学）； Nankai University（南开大学）； Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； ZODA ； Alaya Studio（Alaya工作室）

AI总结提出FEPBench基准，通过细粒度原子集标注和三维评估（指令忠实性、推理丰富性、语义精确性）系统评估T2I模型在自然科学插图生成中的表现，发现即使最先进的闭源模型仍存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。

详情

AI中文摘要

科学插图是交流研究发现的重要工具，尤其是在自然科学中，它们可视化复杂的概念和过程。随着文本到图像（T2I）模型能力的增强，研究人员已开始将其用于科学插图生成。然而，现有基准通常从整体层面评估输出，忽略了细粒度元素，同时科学推理能力和输出简洁性仍缺乏量化。我们引入了FEPBench，一个基于跨多个学科和布局类型精心挑选的高质量科学插图构建的基准。借助多模态大语言模型（MLLM）和人类专家，我们提供了细粒度原子集标注，并沿三个维度系统评估T2I模型：指令忠实性、推理丰富性和语义精确性。我们的评估进一步将模型性能分解为视觉、文本、关系和布局元素。结果表明，即使最先进的（SOTA）闭源模型，如GPT Image 2和Nano Banana Pro，仍然存在文本渲染瓶颈、推理丰富性有限以及生成丰富性与精确性难以平衡的问题。这些发现为改进和部署T2I模型进行科学插图生成提供了实用指导。基准数据、原子集标注和评估代码将由我们发布。

英文摘要

Scientific illustrations are essential tools for communicating research findings, especially in natural science, where they visualize complex concepts and processes. As Text-to-Image (T2I) models become increasingly capable, researchers have started to use them for scientific illustration generation. However, existing benchmarks often assess outputs at a holistic level, overlooking fine-grained elements, while scientific reasoning ability and output conciseness remain under-quantified. We introduce FEPBench, a benchmark built from carefully selected high-quality scientific illustrations across multiple disciplines and layout types. With the assistance of multimodal large language models (MLLMs) and human experts, we provide fine-grained atom set annotations and systematically evaluate T2I models along three dimensions: instruction faithfulness, reasoning enrichment, and semantic precision. Our evaluation further decomposes model performance across visual, textual, relation, and layout elements. Results show that even state-of-the-art (SOTA) closed-source models, such as GPT Image 2 and Nano Banana Pro, still suffer from text-rendering bottlenecks, limited reasoning enrichment, and difficulty balancing generation richness with precision. These findings provide practical guidance for improving and deploying T2I models in scientific illustration generation. Benchmark data, atom set annotations, and evaluation code will be released by us.

URL PDF HTML ☆

赞 0 踩 0

2606.06002 2026-06-08 cs.CV 版本更新

Global-Local Monte Carlo Tree Search in Vision-Language Models for Text-to-3D Indoor Scene Generation

面向文本到3D室内场景生成的视觉-语言模型中的全局-局部蒙特卡洛树搜索

Mengshi Qi, Wei Deng, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, China（网络与交换技术国家重点实验室，北京邮电大学）

AI总结提出一种全局-局部蒙特卡洛树搜索方法，通过分层场景表示和PRM引导的MCTS解决文本到3D室内场景生成中的错误传播问题，并构建新基准数据集3DTindo-bench。

详情

AI中文摘要

大型视觉-语言模型在各种任务中取得了显著的推理性能。然而，关于使用LVLM进行文本到3D室内场景生成的研究很少。主要挑战在于，现有的基于LVLM的方法采用思维链顺序决策机制，无法修正早期决策，导致错误传播。在本文中，我们将该任务视为一个受空间和布局常识约束的规划问题。为解决此问题，我们将其建模为具有全局树和局部树的树搜索问题，这与现有的顺序决策方法不同。在全局树中，我们迭代地放置每个对象，并像人类布置房间一样探索多种尝试，其中问题空间表示为树。为了有效搜索树，我们提出了一种分层场景表示和PRM引导的MCTS方法。分层表示将场景抽象为房间级别、区域级别、地板对象级别和支撑对象级别。PRM引导的MCTS方法使用PRM剪枝不必要的分支，并使用MCTS算法平衡探索和利用，以更少的尝试获得最优解。在局部树中，它进一步将每个对象的放置分解为更细的子步骤，包括具体的放置参数。为了使场景整体外观一致，我们利用预训练的扩散图像生成模型为场景中的所有对象预测纹理。由于现有的文本到3D室内场景生成基准在规模和多样性上仍然有限，我们收集了一个新的大规模多样化数据集，包含65种场景类型和3,250条指令，具有不同的尺寸、布局和风格，命名为3DTindo-bench，以更好地评估最先进模型的能力。我们的实验表明，我们的方法比最先进的方法生成更逼真的3D场景。

英文摘要

Large Vision-Language Models have achieved significant reasoning performance in various tasks. However, there are few studies on text-to-3D indoor scene generation with LVLMs. The main challenge is that prevailing LVLM-based methods employ chain-of-thought sequential decision mechanisms that cannot revise earlier decisions, causing error propagation. In this paper, we consider the task as a planning problem constrained by spatial and layout commonsense. To solve this problem, we model it as a tree search problem with global and local trees, which differs from existing sequential decision-making approaches. In the global tree, we place each object iteratively and explore multiple attempts like humans furnishing a room, where the problem space is represented as a tree. To effectively search the tree, we propose a hierarchical scene representation and a PRM-guided MCTS method. This representation abstracts a scene into room level, region level, floor object level, and supported object level. The PRM-guided MCTS method uses the PRM to prune unnecessary branches and the MCTS algorithm to balance exploration and exploitation to get an optimal solution with fewer attempts. In the local tree, it further decomposes the placement of each object into finer sub-steps, including the specific placement parameters. To make the whole appearance of the scene consistent, we leverage pre-trained diffusion image generative models to predict textures for all the objects in the scene. As existing benchmarks for text-to-3D indoor scene generation remain limited in scale and diversity, we collect a new large-scale diverse dataset that contains 65 scene types and 3250 instructions with diverse sizes, layouts, and styles, named 3DTindo-bench, to better assess the capability of the state-of-the-art models. Our experiments show that our method generates more realistic 3D scenes than state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.06042 2026-06-08 cs.CV 版本更新

基于无监督学习的焦堆相机深度估计

Zhengyu Huang, Weizhi Du, Theodore B. Norris

发表机构 * Center for Ultrafast Optical Science, University of Michigan（超快光学科学中心，密歇根大学）； University of Michigan（密歇根大学）

AI总结提出一种基于无监督深度学习的方法，从焦堆相机图像估计深度，在NYU-v2数据集上相比单图像方法显著提高精度。

2403.05532 2026-06-08 cs.LG cs.CV 版本更新

Twin: Tuning Learning Rate and Weight Decay of Deep Homogeneous Classifiers without Validation

Twin: 无需验证的深度同质分类器学习率和权重衰减调优

Lorenzo Brigato, Stavroula Mougiakakou

发表机构 * ARTORG Center, University of Bern（伯恩大学ARTORG中心）

AI总结提出Twin方法，利用同质网络的边界最大化动态和训练-测试损失间的经验缩放定律，实现无需验证集的学习率和权重衰减调优，在37个图像分类配置上达到与Oracle基线1.28%的平均绝对误差。

Comments Accepted at TMLR

详情

AI中文摘要

我们介绍了Tune without Validation (Twin)，一种简单有效的管道，用于调优同质分类器的学习率和权重衰减，无需验证集，消除了保留数据的需求并避免了两步过程。Twin利用了同质网络的边界最大化动态以及连接超参数配置下训练和测试损失的经验缩放定律。这种数学建模产生了一个依赖于区域的、无需验证的选择规则：在不可分离区域，训练损失在测试损失中是单调的，因此可以预测泛化；而在可分离区域，由于边界最大化，参数的范数成为泛化的可靠指标。在37个图像分类的数据集-架构配置中，我们证明Twin与使用测试准确率选择超参数的Oracle基线相比，平均绝对误差为1.28%。我们展示了Twin在验证数据稀缺的场景（如小数据 regime）或难以且昂贵收集的场景（如医学成像）中的优势。代码可在 https://github.com/lorenzobrigato/twin 获取。

英文摘要

We introduce Tune without Validation (Twin), a simple and effective pipeline for tuning learning rate and weight decay of homogeneous classifiers without validation sets, eliminating the need to hold out data and avoiding the two-step process. Twin leverages the margin-maximization dynamics of homogeneous networks and an empirical scaling law that links training and test losses across hyper-parameter configurations. This mathematical modeling yields a regime-dependent, validation-free selection rule: in the non-separable regime, training loss is monotonic in test loss and therefore predictive of generalization, whereas in the separable regime, the parameters' norm becomes a reliable indicator of generalization due to margin maximization. Across 37 dataset-architecture configurations for image classification, we demonstrate that Twin achieves a mean absolute error of 1.28% compared to an Oracle baseline that selects HPs using test accuracy. We demonstrate Twin's benefits in scenarios where validation data is scarce, such as small-data regimes, or difficult and costly to collect, as in medical imaging. Code available at https://github.com/lorenzobrigato/twin.

URL PDF HTML ☆

赞 0 踩 0

2406.05670 2026-06-08 cs.LG cs.CR cs.CV 版本更新

Certified Robustness to Data Poisoning in Gradient-Based Training

基于梯度的训练中对数据投毒的认证鲁棒性

Philip Sosnin, Mark N. Müller, Maximilian Baader, Calvin Tsay, Matthew Wicker

发表机构 * Department of Computing, Imperial College London, United Kingdom（帝国理工学院伦敦分校计算机系）； Department of Computer Science, ETH Zurich, Switzerland（苏黎世联邦理工学院计算机科学系）； LogicStar.ai, Switzerland（LogicStar.ai公司）； The Alan Turing Institute, United Kingdom（艾伦·图灵研究所）

AI总结提出首个框架，通过凸松弛过度近似参数更新集，为梯度下降训练的模型提供针对无目标、有目标投毒和后门攻击的可证明鲁棒性保证。

Comments 21 pages, 8 figures

详情

AI中文摘要

现代机器学习流程利用大量公共数据，使得保证数据质量变得不可行，并使模型容易受到投毒和后门攻击。在攻击下可证明地约束模型行为仍然是一个开放问题。在这项工作中，我们通过开发第一个框架来应对这一挑战，该框架在不修改模型或学习算法的情况下，为使用可能被操纵的数据训练的模型的行为提供可证明的保证。特别是，我们的框架针对训练输入和标签的有界和无界操纵，认证了对无目标和有目标投毒以及后门攻击的鲁棒性。我们的方法利用凸松弛来过度近似给定投毒威胁模型下所有可能的参数更新集，从而允许我们为任何基于梯度的学习算法约束所有可达参数的集合。给定这个参数集，我们提供了最坏情况行为的界限，包括模型性能和后门成功率。我们在多个真实世界数据集上展示了我们的方法，这些数据集来自能源消耗、医学成像和自动驾驶等应用。

英文摘要

Modern machine learning pipelines leverage large amounts of public data, making it infeasible to guarantee data quality and leaving models open to poisoning and backdoor attacks. Provably bounding model behavior under such attacks remains an open problem. In this work, we address this challenge by developing the first framework providing provable guarantees on the behavior of models trained with potentially manipulated data without modifying the model or learning algorithm. In particular, our framework certifies robustness against untargeted and targeted poisoning, as well as backdoor attacks, for bounded and unbounded manipulations of the training inputs and labels. Our method leverages convex relaxations to over-approximate the set of all possible parameter updates for a given poisoning threat model, allowing us to bound the set of all reachable parameters for any gradient-based learning algorithm. Given this set of parameters, we provide bounds on worst-case behavior, including model performance and backdoor success rate. We demonstrate our approach on multiple real-world datasets from applications including energy consumption, medical imaging, and autonomous driving.

URL PDF HTML ☆

赞 0 踩 0

2509.22685 2026-06-08 eess.IV cs.CV cs.GR 版本更新

VIRTUS-FPP: Virtual Sensor Modeling for Fringe Projection Profilometry in NVIDIA Isaac Sim

VIRTUS-FPP：NVIDIA Isaac Sim中条纹投影轮廓测量的虚拟传感器建模

Adam Haroon, Anush Lakshman, Badrinath Balasubramaniam, Beiwen Li

发表机构 * Department of Mechanical Engineering, Iowa State University（Iowa州立大学机械工程系）； College of Engineering, University of Georgia（佐治亚大学工程学院）

AI总结提出VIRTUS-FPP，首个在NVIDIA Isaac Sim中实现的端到端虚拟传感器建模框架，用于条纹投影轮廓测量，实现物理保真模拟，无需预校准物理系统，支持亚毫米级重建精度。

Comments 10 pages, 13 figures, accepted for publication in IEEE Sensors Journal

详情

DOI: 10.1109/JSEN.2026.3698278

AI中文摘要

条纹投影轮廓测量（FPP）是一种用于3D表面重建的高精度结构光传感技术，但其实际部署常受限于复杂的校准程序、对环境条件的敏感性以及物理实验的高成本。同时，机器人研究日益依赖如NVIDIA Isaac Sim等仿真平台进行可扩展的开发与验证，但目前缺乏FPP等光学计量传感器的精确虚拟表示。本文提出VIRTUS-FPP，这是首个在NVIDIA Isaac Sim中实现的用于条纹投影轮廓测量的端到端虚拟传感器建模框架，能够对完整的FPP流程（包括结构光投影、图像形成、校准和3D重建）进行物理保真模拟，且无需依赖预校准的物理系统。该框架利用逆相机模型表示投影仪，确保了几何和光度保真度与结构光原理一致。通过连接光学计量与机器人仿真，VIRTUS-FPP实现了高保真合成数据生成、传感流程的系统评估以及真实世界FPP系统的数字孪生复制。实验结果表明，该框架具有亚毫米级重建精度，且模拟与物理测量之间具有强对应性，突显了其有效性及在推动感知驱动型机器人、仿真到现实迁移以及可扩展光学传感器设计方面的潜力。

英文摘要

Fringe projection profilometry (FPP) is a high-precision structured-light sensing technique for 3D surface reconstruction, yet its practical deployment is often constrained by complex calibration procedures, sensitivity to environmental conditions, and the high cost of physical experimentation. At the same time, robotics research increasingly relies on simulation platforms such as NVIDIA Isaac Sim for scalable development and validation, but accurate virtual representations of optical metrology sensors such as FPP are not currently available. In this work, we present VIRTUS-FPP, the first end-to-end virtual sensor modeling framework for fringe projection profilometry implemented in NVIDIA Isaac Sim, enabling physically grounded simulation of the complete FPP pipeline, including structured light projection, image formation, calibration, and 3D reconstruction, without dependence on pre-calibrated physical systems. The framework leverages an inverse camera model for projector representation, ensuring geometric and photometric fidelity consistent with structured-light principles. By bridging optical metrology and robotics simulation, VIRTUS-FPP enables high-fidelity synthetic data generation, systematic evaluation of sensing pipelines, and digital twin replication of real-world FPP systems. Experimental results demonstrate sub-millimeter reconstruction accuracy and strong correspondence between simulated and physical measurements, highlighting the framework's effectiveness and its potential to advance perception-driven robotics, simulation-to-reality transfer, and scalable optical sensor design.

URL PDF HTML ☆

赞 0 踩 0

2510.11014 2026-06-08 cs.RO cs.AI cs.CV 版本更新

MatterDoor: Sampling Zero-shot Spatio-semantic Priors using Generative Models

MatterDoor: 使用生成模型采样零样本空间语义先验

Subhransu S. Bhattacharjee, Hao Lu, Dylan Campbell, Rahul Shome

发表机构 * School of Computing, Australian National University（澳大利亚国立大学计算机学院）

AI总结针对机器人通过门缝观察时场景结构缺失的问题，提出MatterDoor方法，利用预训练生成模型（VLM引导外推、单目深度估计、语义分割）采样隐藏房间的语义3D点云先验，在Matterport3D基准上验证了零样本空间语义先验的有效性。

Comments Under Review

详情

AI中文摘要

自主机器人通常只能通过门缝部分观察房间，墙壁和场景结构隐藏了安全导航和目标导向行动所需的几何和任务相关语义。我们询问现成的预训练生成视觉模型能否将这些缺失结构作为零样本离线先验用于机器人推理。此类先验应支持对未观察结构的空间语义查询，估计隐藏区域中的目标物体似然以及这些区域被占用的概率。给定一个以自我为中心的RGB观测和目标查询，我们的流程使用VLM引导的外推、单目深度估计和语义分割来采样隐藏房间的语义标注3D点云假设。我们引入了MatterDoor，一个源自Matterport3D的门遮挡室内场景基准，并使用生成指标和模拟Stretch机器人目标到达任务评估所得先验。我们的结果表明，无需针对特定问题微调即可推导出对规划有用的空间语义先验。

英文摘要

Autonomous robots often view rooms only partially, through a doorway, where the walls and scene structure hide the geometry and task-relevant semantics needed for safe navigation and goal-directed action. We ask whether off-the-shelf pretrained generative vision models can derive this missing structure as zero-shot offline priors for robot reasoning. Such priors should support spatio-semantic queries over unobserved structure, estimating the target object likelihood in hidden regions and the probability that those regions are occupied. Given an egocentric RGB observation and target query, our pipeline uses VLM-guided outpainting, monocular depth estimation, and semantic segmentation to sample semantically labeled 3D point cloud hypotheses of the hidden room. We introduce MatterDoor, a Matterport3D-derived benchmark of doorway-occluded indoor scenes, and evaluate the resulting priors with generative metrics and simulated Stretch robot object-reaching tasks. Our results suggest that useful spatio-semantic priors for planning can be derived without problem-specific fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2512.00883 2026-06-08 cs.MM cs.CV cs.SD 版本更新

Audio-Visual World Models: Grounding Multisensory Imagination for Embodied Agents

视听世界模型：为具身智能体奠定多感官想象的基础

Jiahua Wang, Leqi Zheng, Jialong Wu, Yaoxin Mao, Shijie Cheng

发表机构 * Tsinghua University（清华大学）； Beijing Institute of Technology（北京理工大学）

AI总结提出视听世界模型（AVWM）统一框架，通过条件扩散Transformer（AV-CDiT）联合预测双耳音频与视觉动态，在30小时基准AVW-4k上实现高保真多模态预测，并验证其在具身导航中的有效性。

详情

AI中文摘要

世界模型通过模拟环境动态使智能体能够规划和推理未来状态。虽然现有方法主要关注视觉观察，但现实世界的感知本质上涉及多种感觉模态。音频提供了关键的空间和时间线索，如声源定位和声学场景属性，但其整合到世界模型中仍相对未被充分探索。先前的工作尚未建立低层动作控制下视听世界建模的通用公式，也未阐明如何联合捕捉物理上合理的双耳音频和视觉动态。本文提出了视听世界模型（AVWM）的统一公式，将多模态环境模拟建模为具有同步视听观测的部分可观测马尔可夫决策过程。作为解决该问题的基础步骤，我们构建了AVW-4k，一个受控基准数据集，包含30小时的双耳视听轨迹，覆盖76个室内环境并带有动作标注。我们提出了AV-CDiT，一种视听条件扩散Transformer，采用新颖的模态专家架构平衡视觉和听觉学习，通过三阶段训练策略优化以实现有效的多模态整合。在该基准上的大量实验表明，AV-CDiT在视觉和听觉模态上实现了高保真多模态预测。此外，我们验证了其在具身导航中的实际效用，证明AVWM改进了视觉-语言模型引导的智能体在连续视听导航中的表现。

英文摘要

World models simulate environmental dynamics to enable agents to plan and reason about future states. While existing approaches have primarily focused on visual observations, real-world perception inherently involves multiple sensory modalities. Audio provides crucial spatial and temporal cues such as sound source localization and acoustic scene properties, yet its integration into world models remains relatively underexplored. Prior work has not established a commonly adopted formulation for audio-visual world modeling under low-level action control or clarified how to jointly capture physically grounded binaural audio and visual dynamics. This work presents a unified formulation of Audio-Visual World Models (AVWM), casting multimodal environment simulation as a partially observable Markov decision process with synchronized audio-visual observations. As a foundational step toward this problem, we construct AVW-4k, a controlled benchmark comprising 30 hours of binaural audio-visual trajectories with action annotations across 76 indoor environments. We propose AV-CDiT, an Audio-Visual Conditional Diffusion Transformer with a novel modality expert architecture that balances visual and auditory learning, optimized through a three-stage training strategy for effective multimodal integration. Extensive experiments on this benchmark demonstrate that AV-CDiT achieves high-fidelity multimodal prediction across visual and auditory modalities. Furthermore, we validate its practical utility in embodied navigation, demonstrating that AVWM improves a vision-language-model-guided agent in continuous audio-visual navigation.

URL PDF HTML ☆

赞 0 踩 0

2512.20963 2026-06-08 cs.LG cs.CV 版本更新

Generalization of Diffusion Models Arises with a Balanced Representation Space

扩散模型的泛化源于平衡表示空间

Zekai Zhang, Xiao Li, Xiang Li, Lianghe Shi, Meng Wu, Molei Tao, Qing Qu

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结通过分析两层ReLU去噪自编码器，证明记忆化导致局部尖峰表示，而泛化产生平衡表示，并在真实扩散模型中验证，提出基于表示的检测和编辑方法。

Comments Accepted at ICLR 2026. 40 pages, 19 figures. The first two authors contributed equally

详情

AI中文摘要

扩散模型擅长生成高质量、多样化的样本，但当过度拟合训练目标时，它们有记忆训练数据的风险。我们通过表示学习的视角分析了扩散模型中记忆化和泛化之间的区别。通过研究两层ReLU去噪自编码器（DAE），我们证明了（i）记忆化对应于模型在学习的权重中存储原始训练样本以进行编码和解码，产生局部尖峰表示，而（ii）泛化发生在模型捕获局部数据统计时，产生平衡表示。此外，我们在真实的无条件和文本到图像扩散模型上验证了这些理论发现，表明相同的表示结构出现在深度生成模型中，并具有重要的实际意义。基于这些见解，我们提出了一种基于表示的检测记忆化的方法，以及一种无需训练的编辑技术，通过表示引导实现精确控制。总之，我们的结果强调了学习好的表示对于新颖且有意义的生成建模至关重要。

英文摘要

Diffusion models excel at generating high-quality, diverse samples, yet they risk memorizing training data when overfit to the training objective. We analyze the distinctions between memorization and generalization in diffusion models through the lens of representation learning. By investigating a two-layer ReLU denoising autoencoder (DAE), we prove that (i) memorization corresponds to the model storing raw training samples in the learned weights for encoding and decoding, yielding localized spiky representations, whereas (ii) generalization arises when the model captures local data statistics, producing balanced representations. Furthermore, we validate these theoretical findings on real-world unconditional and text-to-image diffusion models, demonstrating that the same representation structures emerge in deep generative models with significant practical implications. Building on these insights, we propose a representation-based method for detecting memorization and a training-free editing technique that allows precise control via representation steering. Together, our results highlight that learning good representations is central to novel and meaningful generative modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.00471 2026-06-08 cs.AI cs.CV 版本更新

Dual Latent Memory for Visual Multi-agent System

面向视觉多智能体系统的双潜在记忆

Xinlei Yu, Chengming Xu, Zhangquan Chen, Bo Yin, Cheng Yang, Yongbo He, Yihao Hu, Jiangning Zhang, Cheng Tan, Xiaobin Hu, Shuicheng Yan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出L²-VMAS框架，通过双潜在记忆解耦感知与思考，并采用熵驱动主动触发机制，打破视觉多智能体系统的“扩展墙”，在提升准确率的同时大幅降低令牌消耗。

详情

AI中文摘要

尽管视觉多智能体系统（VMAS）有望通过智能体间协作增强综合能力，但经验证据揭示了一个反直觉的“扩展墙”：增加智能体轮次往往会降低性能，同时指数级增加令牌成本。我们将这一失败归因于以文本为中心的通信中固有的信息瓶颈，其中将感知和思维轨迹转换为离散自然语言不可避免地导致语义损失。为此，我们提出了\textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}，一种新颖的模型无关框架，通过双潜在记忆实现智能体间协作。此外，我们解耦了感知与思考，同时动态合成双潜在记忆。另外，我们引入了熵驱动的主动触发，用高效的按需内存访问取代被动信息传输。在骨干网络、规模和多智能体结构上的大量实验表明，我们的方法有效打破了“扩展墙”，具有卓越的可扩展性，平均准确率提高2.7-5.4%，同时令牌使用量减少21.3-44.8%。

英文摘要

While Visual Multi-Agent Systems (VMAS) promise to enhance comprehensive abilities through inter-agent collaboration, empirical evidence reveals a counter-intuitive "scaling wall": increasing agent turns often degrades performance while exponentially inflating token costs. We attribute this failure to the information bottleneck inherent in text-centric communication, where converting perceptual and thinking trajectories into discrete natural language inevitably induces semantic loss. To this end, we propose \textbf{L}$\mathbf{^{2}}$\textbf{-VMAS}, a novel model-agnostic framework that enables inter-agent collaboration with dual latent memories. Furthermore, we decouple the perception and thinking while dynamically synthesizing dual latent memories. Additionally, we introduce an entropy-driven proactive triggering that replaces passive information transmission with efficient, on-demand memory access. Extensive experiments among backbones, sizes, and multi-agent structures demonstrate that our method effectively breaks the "scaling wall" with superb scalability, improving average accuracy by 2.7-5.4% while reducing token usage by 21.3-44.8%.

URL PDF HTML ☆

赞 0 踩 0

2602.01740 2026-06-08 cs.AI cs.CV cs.LG 版本更新

MACD: Model-Aware Contrastive Decoding via Counterfactual Data

MACD：基于反事实数据的模型感知对比解码

Qixin Xiao, Kun Zhou

发表机构 * University of Michigan, Ann Arbor, MI, USA（密歇根大学，安娜堡分校）； University of California San Diego, La Jolla, CA, USA（加州大学圣地亚哥分校）

AI总结提出MACD方法，利用视频语言模型自身反馈识别导致幻觉的目标区域，生成目标级反事实输入，结合对比解码减少幻觉，提升多模型在复杂场景下的准确性。

详情

AI中文摘要

视频语言模型（Video-LLMs）容易产生幻觉，当视觉证据薄弱、模糊或存在偏差时，会生成看似合理但无根据的内容。现有方法如对比解码（CD）依赖随机扰动构建对比数据以缓解幻觉，但往往未能针对驱动幻觉的视觉线索或模型弱点。我们提出基于模型感知反事实数据的对比解码（MACD），这是一种结合模型引导的反事实构建与对比解码的推理策略。MACD利用Video-LLM自身的反馈来识别最可能导致幻觉的目标区域，生成有针对性的目标级反事实输入，而非任意的帧或时间修改。这些反事实输入被整合到CD中，以在解码过程中强制进行基于证据的令牌选择。在EventHallusion、MVBench、Perception-test和Video-MME上的实验表明，MACD在包括Qwen和InternVL在内的多种Video-LLM上持续减少幻觉，同时保持或提高任务准确性，在涉及小目标、遮挡目标或共现目标的场景中尤其表现出显著优势。

英文摘要

Video language models (Video-LLMs) are prone to hallucinations, generating plausible but ungrounded content when visual evidence is weak, ambiguous, or biased. Existing methods, such as contrastive decoding (CD), rely on random perturbations to construct contrastive data for hallucination mitigation, but often fail to target the visual cues that drive hallucination or align with model weaknesses. We propose Model-Aware Counterfactual Data based Contrastive Decoding (MACD), an inference strategy that combines model-guided counterfactual construction with contrastive decoding. MACD uses the Video-LLM's own feedback to identify object regions most responsible for hallucination, generating targeted object-level counterfactual inputs rather than arbitrary frame or temporal modifications. These counterfactual inputs are integrated into CD to enforce evidence-grounded token selection during decoding. Experiments on EventHallusion, MVBench, Perception-test, and Video-MME show that MACD consistently reduces hallucination while maintaining or improving task accuracy across diverse Video-LLMs, including Qwen and InternVL, with especially strong gains in scenarios involving small, occluded, or co-occurring objects.

URL PDF HTML ☆

赞 0 踩 0

2603.02220 2026-06-08 cs.LG cs.AI cs.CV 版本更新

Forecasting as Rendering: A 2D Gaussian Splatting Framework for Time Series Forecasting

预测即渲染：面向时间序列预测的2D高斯泼溅框架

Yixin Wang, Yifan Hu, Peiyuan Liu, Naiqi Li, Tao Dai, Shu-Tao Xia

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）

AI总结提出TimeGS框架，将时间序列预测转化为2D高斯泼溅生成渲染，通过各向异性高斯核和连续光栅化解决周期内与周期间的建模问题，实现SOTA性能。

详情

AI中文摘要

时间序列预测仍然是一个具有挑战性的问题，因为周期内波动和周期间趋势的复杂纠缠。尽管最近的进展试图将一维序列重塑为二维周期-相位表示，但它们存在两个主要局限性。首先，将重塑后的张量视为静态图像会导致拓扑不匹配，因为标准空间算子在网格边界处切断了时间连续性。其次，依赖统一的固定大小表示会低效地分配建模能力，并且无法为可压缩的非平稳时间模式提供所需的自适应分辨率。为了解决这些局限性，我们引入了TimeGS，这是一个新颖的框架，从根本上将预测范式从回归转变为二维生成渲染。通过将未来序列重新概念化为潜在的二维时间表面，TimeGS利用高斯核的固有各向异性，以灵活的几何对齐自适应地建模复杂变化。为了实现这一点，我们引入了多基高斯核生成（MB-GKG）块，该块从固定字典中合成核以稳定优化，以及多周期时间连续光栅化（MP-CCR）块，该块在周期边界上强制执行严格的时间连续性。在标准基准数据集上的全面实验表明，TimeGS达到了最先进或具有竞争力的性能。代码位于https://github.com/yixinwang1/TimeGS。

英文摘要

Time series forecasting remains a challenging problem due to the intricate entanglement of intra-period fluctuations and inter-period trends. While recent advances have attempted to reshape 1D sequences into 2D period-phase representations, they suffer from two principal limitations. Firstly, treating reshaped tensors as static images results in a topological mismatch, as standard spatial operators sever chronological continuity at grid boundaries. Secondly, relying on uniform fixed-size representations allocates modeling capacity inefficiently and fails to provide the adaptive resolution required for compressible, non-stationary temporal patterns. To address these limitations, we introduce TimeGS, a novel framework that fundamentally shifts the forecasting paradigm from regression to 2D generative rendering. By reconceptualizing the future sequence as a latent 2D temporal surface, TimeGS utilizes the inherent anisotropy of Gaussian kernels to adaptively model complex variations with flexible geometric alignment. To realize this, we introduce a Multi-Basis Gaussian Kernel Generation (MB-GKG) block that synthesizes kernels from a fixed dictionary to stabilize optimization, and a Multi-Period Chronologically Continuous Rasterization (MP-CCR) block that enforces strict temporal continuity across periodic boundaries. Comprehensive experiments on standard benchmark datasets demonstrate that TimeGS attains state-of-the-art or competitive performance. The code is at https://github.com/yixinwang1/TimeGS.

URL PDF HTML ☆

赞 0 踩 0

2603.21510 2026-06-08 eess.IV cs.CV 版本更新

Unregistered Spectral Image Fusion: Unmixing, Adversarial Learning, and Recoverability

未配准光谱图像融合：解混、对抗学习与可恢复性

Jiahui Song, Sagar Shrestha, Xiao Fu

AI总结提出无监督框架，通过耦合光谱解混和潜在空间对抗学习同时超分辨未配准的高光谱和多光谱图像，并首次建立可恢复性理论保证。

详情

AI中文摘要

本文研究一对空间未配准的高光谱图像（HSI）和多光谱图像（MSI）的融合问题，两者覆盖大致重叠区域。HSI提供高光谱但低空间分辨率，而MSI则相反。目标是整合它们的互补信息，以提升HSI空间分辨率和MSI光谱分辨率。虽然高光谱-多光谱融合（HMF）已被广泛研究，但未配准设置仍然具有挑战性。许多现有方法仅关注MSI超分辨，而保持HSI不变。监督深度学习方法被提出用于HSI超分辨，但依赖于准确的训练数据，这通常不可用。此外，理论分析主要处理已配准情况，导致未配准HMF理解不足。本文提出一种无监督框架，同时超分辨MSI和HSI。该方法将用于MSI超分辨的耦合光谱解混与用于HSI超分辨的潜在空间对抗学习相结合。在合理的生成模型下，建立了超分辨MSI和HSI可恢复性的理论保证——据我们所知，这是首次为未配准HMF提供此类见解。该方法在半真实和真实HSI-MSI对的不同条件下得到验证。

英文摘要

This paper addresses the fusion of a pair of spatially unregistered hyperspectral image (HSI) and multispectral image (MSI) covering roughly overlapping regions. HSIs offer high spectral but low spatial resolution, while MSIs provide the opposite. The goal is to integrate their complementary information to enhance both HSI spatial resolution and MSI spectral resolution. While hyperspectral-multispectral fusion (HMF) has been widely studied, the unregistered setting remains challenging. Many existing methods focus solely on MSI super-resolution, leaving HSI unchanged. Supervised deep learning approaches were proposed for HSI super-resolution, but rely on accurate training data, which is often unavailable. Moreover, theoretical analyses largely address the co-registered case, leaving unregistered HMF poorly understood. In this work, an unsupervised framework is proposed to simultaneously super-resolve both MSI and HSI. The method integrates coupled spectral unmixing for MSI super-resolution with latent-space adversarial learning for HSI super-resolution. Theoretical guarantees on the recoverability of the super-resolution MSI and HSI are established under reasonable generative models -- providing, to our best knowledge, the first such insights for unregistered HMF. The approach is validated on semi-real and real HSI-MSI pairs across diverse conditions.

URL PDF HTML ☆

赞 0 踩 0

2603.24576 2026-06-08 cs.RO cs.AI cs.CV 版本更新

Chameleon: Control-Indexed Prospective Memory for Visuomotor Manipulation

Chameleon: 用于视觉运动操控的索引控制前瞻记忆

Xinying Guo, Chenxi Jiang, Hyun Bin Kim, Yuhang Han, Ying Sun, Yang Xiao, Jianfei Yang

发表机构 * MARS Lab, Nanyang Technological University（南洋理工大学MARS实验室）； Institute for Infocomm Research, A*STAR, Singapore（新加坡*STAR信息与通信研究所）； National University of Singapore（新加坡国立大学）

AI总结提出Chameleon策略，通过索引控制前瞻记忆解决观察-动作延迟问题，在Camo-Dataset上决策成功率从22.5%提升至80.8%，并在多个基准上达到最优。

Comments Code is available at https://github.com/gxyes/MARS_Chameleon

详情

AI中文摘要

机器人常常在观察到某个信息后很久才执行相应的动作。例如，在藏球游戏中，机器人首先看到哪个杯子藏有球，观察杯子移动，然后才需要选择正确的杯子。仅凭最后的观察不足以做出决策：正确的动作依赖于更早的事件。我们将这种时间间隔称为观察-动作延迟。它使得记忆成为一个策略面对的问题：策略必须保持相似历史记录的可区分性，检索与当前决策相关的过去事件，并将该回忆转换为动作就绪状态。我们将这些需求称为可分离性、可寻址性和前瞻性。我们引入了Chameleon，一个约60M参数的视觉运动策略，用于索引控制的前瞻记忆。Chameleon写入具身事件记忆，保留可分离的历史记录，检索控制相关的痕迹，并训练生成的工作状态具有前瞻性。我们还引入了Camo-Dataset，这是一个真实机器人基准，通过使决策场景视觉模糊来隔离观察-动作延迟，从而必须从早期观察中推断正确动作。Chameleon在Camo-Dataset上将决策/端到端成功率从22.5%/21.3%提高到80.8%/71.3%。在公开的长时记忆基准上，它在LIBERO-10上达到87.1% ± 0.8%，在MemoryBench上达到97.3% ± 4.5%，在MIKASA-Robo上达到75.1% ± 1.4%，在相同规模模型中达到最先进水平，并在报告协议下超过多个更大的VLA基线。探针和消融实验表明，Chameleon学习了可分离、可寻址和前瞻的记忆，并且这些特性驱动了其性能提升。

英文摘要

Robots often observe information that determines a future action long before that action is executed. In a shell game, for example, a robot first sees which cup hides the ball, watches the cups move, and only later needs to choose the correct cup. The final observation alone is not enough for a decision: the correct action depends on an earlier event. We refer to this temporal gap as observation-action delay. It makes memory a policy-facing problem: a policy must keep similar histories distinct, retrieve the past event relevant to the current decision, and convert that recall into an action-ready state. We call these requirements separability, addressability, and prospectiveness. We introduce Chameleon, a ~60M visuomotor policy for control-indexed prospective memory. Chameleon writes embodied event memory, preserves separable histories, retrieves control-relevant traces, and trains the resulting working state to be prospective. We also introduce Camo-Dataset, a real-robot benchmark that isolates observation-action delay by making the decision scene visually ambiguous, so the correct action must be inferred from earlier observations. Chameleon improves decision/end-to-end success on Camo-Dataset from 22.5%/21.3% to 80.8%/71.3%. On public long-horizon memory benchmarks, it achieves 87.1% +/- 0.8% on LIBERO-10, 97.3% +/- 4.5% on MemoryBench, and 75.1% +/- 1.4% on MIKASA-Robo, setting the state of the art for same-size models and exceeding multiple larger VLA baselines under the reported protocols. Probes and ablations show that Chameleon learns separable, addressable, and prospective memory, and that these properties drive its performance gains.

URL PDF HTML ☆

赞 0 踩 0

2606.01072 2026-06-08 cs.RO cs.CV 版本更新

Expanding Spatial and Temporal Context for Robotic Imitation Learning With Scene Graphs

利用场景图扩展机器人模仿学习的时空上下文

Jianing Qian, Qinhe Peng, Emmanuel Panov, Leonor Fermoselle, Dinesh Jayaraman, Bernadette Bucher, Tarik Kelestemur

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； RAI Institute（RAI研究院）； University of Michigan（密歇根大学）

AI总结提出使用场景图作为显式结构化记忆机制，通过动态维护对象中心关系及其时间演化，解决机器人模仿学习中的部分可观测性和长时推理问题。

详情

AI中文摘要

模仿学习使机器人能够通过观察学习如何执行任务。然而，像家庭和办公室这样的真实环境通常由于空间尺度大而严重部分可观测。此外，许多任务涉及执行一系列子任务，要求自主机器人在扩展的时间范围内进行推理。为了解决这些挑战，我们提出在模仿学习中使用场景图作为显式且结构化的记忆机制。通过维护一个动态场景图，捕捉以对象为中心的关系及其随时间的变化，我们的方法允许智能体在任务执行期间保留相关历史上下文，从而有效推理逐步累积的场景信息。我们在模拟移动操作和真实桌面操作上的实验表明，我们的方法显著提高了策略性能，特别是在需要长期推理和在部分可观测性下鲁棒泛化的场景中。

英文摘要

Imitation learning enables robots to learn how to execute tasks via observation. However, real-world environments like homes and offices are often severely partially observed due to their large spatial scales. In addition, many tasks involve executing a series of subtasks requiring autonomous robots to reason over extended time horizons. To address these challenges, we propose using scene graphs as an explicit and structured memory mechanism in imitation learning. By maintaining a dynamic scene graph that captures object-centric relationships and their evolution over time, our method allows the agent to retain relevant historical context during task execution to efficiently reason over incrementally accrued scene information. Our experiments on simulated mobile manipulation and real-world tabletop manipulation demonstrate that our approach substantially improves policy performance, particularly in settings that demand long-term reasoning and robust generalization under partial observability.

URL PDF HTML ☆

赞 0 踩 0

2606.05759 2026-06-08 cs.CV 版本更新

Physics-Guided Deep Unfolding for Blind Cross-Sensor Spectral Super-Resolution via Learning the Spectral Transformation Function

物理引导的深度展开网络用于盲跨传感器光谱超分辨率：通过学习光谱变换函数

Zhaolin Li, Jinsong Chen, Shanxin Guo, Tuo Zhang, Xinglong Zhang, Pan Chen

发表机构 * Center for Geo-Spatial Information, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences（地理信息中心，深圳先进技术研究院，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； Shenzhen Engineering Laboratory of Ocean Environmental Big Data Analysis and Application（深圳海洋环境大数据分析与应用工程实验室）

AI总结提出一种物理引导的深度展开网络PGU-Net，通过交替优化联合估计高光谱图像和可学习的光谱变换函数，解决盲跨传感器光谱超分辨率问题。

详情

AI中文摘要

高光谱成像为定量遥感提供丰富的光谱信息，然而高光谱传感器成本高昂，因此在许多无人机部署中不可用。光谱超分辨率旨在从多光谱图像重建高光谱图像。大多数现有的SSR方法假设固定且已知的光谱响应函数，因此仅限于单传感器设置。在实际的跨传感器场景中，从HSI到MSI的光谱退化是未知的，并且随传感器特性和场景内容变化，这使得HSI重建病态。本文提出一种物理引导的深度展开网络，称为PGU-Net，通过联合估计HSI和可学习的光谱变换函数来解决盲跨传感器SSR。PGU-Net将交替优化过程展开为端到端可训练的多阶段架构，每个阶段依次更新HSI和STF。两个模块结合了可学习的近端网络和可微的闭式求解器，在保持强表示能力的同时实现物理可解释性。在具有多个SRF的基准数据集（CAVE和NTIRE 2022）上的实验表明，STF（退化算子）的准确恢复以及相对于最先进SSR方法的重建性能提升。此外，在真实无人机跨传感器数据集（Headwall Nano HSI和DJI P4多光谱MSI）上的评估验证了PGU-Net在真正盲条件下的有效性和鲁棒性，并表明估计的STF可能表现出与土地覆盖相关的差异。

英文摘要

Hyperspectral imaging provides rich spectral information for quantitative remote sensing, yet hyperspectral sensors remain costly and thus unavailable in many UAV deployments. Spectral super-resolution (SSR) seeks to reconstruct hyperspectral images (HSIs) from multispectral images (MSIs). Most existing SSR methods assume a fixed and known spectral response function (SRF) and are therefore limited to single-sensor settings. In practical cross-sensor scenarios, the spectral degradation from HSI to MSI is unknown and varies with sensor characteristics and scene content, which renders HSI reconstruction ill-posed. This paper proposes a physics-guided deep unfolding network, termed PGU-Net, to address blind cross-sensor SSR by jointly estimating the HSI and a learnable spectral transformation function (STF). PGU-Net unrolls an alternating optimization procedure into an end-to-end trainable architecture with stages, where each stage sequentially updates the HSI and the STF. Both modules combine learnable proximal networks with differentiable closed-form solvers, enabling physical interpretability while retaining strong representation capacity. Experiments on benchmark datasets (CAVE and NTIRE 2022) with multiple SRFs demonstrate accurate recovery of the STF (degradation operator) and improved reconstruction performance over state-of-the-art SSR methods. Furthermore, evaluations on a real UAV cross-sensor dataset (Headwall Nano HSI and DJI P4 Multispectral MSI) verify the effectiveness and robustness of PGU-Net under truly blind conditions, and suggest that the estimated STF may exhibit land-cover-related differences.

URL PDF HTML ☆

赞 0 踩 0

2510.17568 2026-06-08 cs.CV 版本更新

PAGE-4D: VGGT-4D Perception via Disentangled Pose and Geometry Estimation

PAGE-4D: 通过解耦姿态与几何估计实现VGGT-4D感知

Kaichen Zhou, Yuhan Wang, Grace Chen, Xinhai Chang, Gaspard Beaudouin, Fangneng Zhan, Paul Pu Liang, Mengyu Wang

发表机构 * Harvard AI and Robotics Lab, Harvard University（哈佛人工智能与机器人实验室，哈佛大学）； Media Lab and Electrical Engineering and Computer Science, Massachusetts Institute of Technology（媒体实验室和电气工程与计算机科学，麻省理工学院）； Department of Computing, Imperial College London（计算系，帝国理工学院）； Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University（哈佛大学自然与人工智能研究学院）

AI总结提出PAGE-4D，扩展VGGT到动态场景，通过动态感知聚合器解耦静态与动态信息，同时提升相机姿态估计、深度预测和点云重建性能。

Comments ICLR 2026, VGGT-4D, Dynamic VGGT

详情

AI中文摘要

最近的3D前馈模型，如视觉几何基础变换器（VGGT），在推断静态场景的3D属性方面表现出强大的能力。然而，由于这些模型通常在静态数据集上训练，它们在涉及复杂动态元素的现实场景中（例如移动的人或可变形物体如雨伞）往往表现不佳。为了解决这一限制，我们引入了PAGE-4D，一种将VGGT扩展到动态场景的前馈模型，能够实现相机姿态估计、深度预测和点云重建——全部无需后处理。多任务4D重建的一个核心挑战是任务之间的固有冲突：准确的相机姿态估计需要抑制动态区域，而几何重建则需要对其进行建模。为了解决这一矛盾，我们提出了一种动态感知聚合器，通过预测动态感知掩码来解耦静态和动态信息——抑制姿态估计的运动线索，同时放大几何重建的运动线索。大量实验表明，PAGE-4D在动态场景中始终优于原始VGGT，在相机姿态估计、单目和视频深度估计以及密集点图重建方面取得了更优的结果。必要的代码和额外演示可在链接：https://page4d.github.io/ 获取，包括训练和推理掩码变体以及仅训练掩码变体（=推理时的VGGT架构）。关键词：VGGT-4D，4D感知，动态场景重建。

英文摘要

Recent 3D feed-forward models, such as the Visual Geometry Grounded Transformer (VGGT), have shown strong capability in inferring 3D attributes of static scenes. However, since they are typically trained on static datasets, these models often struggle in real-world scenarios involving complex dynamic elements, such as moving humans or deformable objects like umbrellas. To address this limitation, we introduce PAGE-4D, a feedforward model that extends VGGT to dynamic scenes, enabling camera pose estimation, depth prediction and point cloud reconstruction - all without post-processing. A central challenge in multitask 4D reconstruction is the inherent conflict between tasks: accurate camera pose estimation requires suppressing dynamic regions, while geometry reconstruction requires modeling them. To resolve this tension, we propose a dynamics aware aggregator that disentangles static and dynamic information by predicting a dynamics-aware mask - suppressing motion cues for pose estimation while amplifying them for geometry reconstruction. Extensive experiments show that PAGE-4D consistently outperforms the original VGGT in dynamic scenarios, achieving superior results in camera pose estimation, monocular and video depth estimation, and dense point map reconstruction. Necessary code and additional demos are available at Link: https://page4d.github.io/, including both the training-and-inference masking variant and the training-only masking variant (= VGGT architecture at inference). Keywords: VGGT-4D, 4D Perception, Dynamic Scene Reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2409.13477 2026-06-08 eess.IV cs.CV physics.med-ph 版本更新

A Plug-and-Play Method for Guided Multi-contrast MRI Reconstruction based on Content/Style Modeling

基于内容/风格建模的即插即用式引导多对比度MRI重建方法

Chinmay Rao, Matthias van Osch, Nicola Pezzotti, Jeroen de Bresser, Mark van Buchem, Laurens Beljaards, Jakob Meineke, Elwin de Weerdt, Huangling Lu, Mariya Doneva, Marius Staring

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Erasmus University Rotterdam（埃因霍温理工大学）； Erasmus University Medical Center（埃因霍温医学院）； University of Utrecht（乌得勒支大学）

AI总结提出一种无需k空间训练数据的模块化即插即用方法PnP-CoSMo，通过内容/风格解耦利用参考扫描引导欠采样对比度重建，在公共和内部数据集上达到或超越端到端方法，并实现更高加速比。

详情

AI中文摘要

由于同一解剖结构的不同MR对比度包含冗余信息，一种对比度可用于引导在同一会话中随后采集的另一种欠采样对比度的重建。为了解决这一利用多对比度侧信息的重建问题，已有多种端到端学习方法被提出。然而，一个关键挑战是需要包含原始k空间数据和配准参考图像的大型配对训练数据集。我们提出了一种模块化的即插即用方法，该方法不需要k空间训练数据，仅依赖于部分配对的图像域数据集。首先学习双对比度MR图像数据的内容/风格模型，随后在迭代重建中作为即插即用算子应用。内容与风格的解耦允许显式表示对比度无关和对比度特定的因素。因此，将先验信息融入重建简化为使用从参考扫描中导出的高质量内容替换估计图像的混叠内容的操作。将该操作与MR数据一致性步骤以及内容估计的校正过程相结合，形成迭代方案。我们将这种新方法命名为PnP-CoSMo。通过设计，它提供了跨对比度的泛化能力，并基于两个给定对比度下的共享和非共享生成因素提供了一个解释框架。我们通过仿真探索了包括可解释性和收敛性在内的多个方面。此外，在公共NYU fastMRI DICOM数据集上展示了其实用性，显示出与端到端方法相当或更优的质量以及更强的泛化能力。在两个内部多线圈数据集上，在给定SSIM下，PnP-CoSMo相比非引导重建实现了高达32.6%的加速。

英文摘要

Since the various MR contrasts of a given anatomy contain redundant information, one contrast can be used to guide the reconstruction of another undersampled contrast acquired subsequently in the same session. To solve this reconstruction problem leveraging multi-contrast side information, several end-to-end learning-based methods have been proposed. However, a key challenge is the requirement for large paired training datasets comprising raw k-space data and aligned reference images. We propose a modular plug-and-play method, which requires no k-space training data and relies solely on partially paired image-domain datasets. A content/style model of two-contrast MR image data is first learned and subsequently applied as a plug-and-play operator in iterative reconstruction. The disentanglement of content and style allows explicit representation of contrast-independent and contrast-specific factors. Consequently, incorporating prior information into the reconstruction reduces to a simple replacement operation on the aliased content of the estimated image using high-quality content derived from the reference scan. Combining this operation with an MR data consistency step, followed by a corrective procedure for the content estimate, yields an iterative scheme. We name this novel approach PnP-CoSMo. It offers, by design, cross-contrast generalizability and provides an explanatory framework based on the shared and non-shared generative factors underlying the two given contrasts. We explore various aspects, including interpretability and convergence, via simulations. Furthermore, its practicality is demonstrated on the public NYU fastMRI DICOM dataset, showing equivalent or superior quality and greater generalizability compared to end-to-end methods. On two in-house multi-coil datasets, PnP-CoSMo enabled up to 32.6% greater acceleration over non-guided reconstruction at given SSIM.

URL PDF HTML ☆

赞 0 踩 0

2510.21122 2026-06-08 cs.CV 版本更新

NoisyGRPO: Incentivizing Multimodal CoT Reasoning via Noise Injection and Bayesian Estimation

NoisyGRPO：通过噪声注入和贝叶斯估计激励多模态Co T推理

Longtian Qiu, Shan Ning, Jiaxuan Sun, Xuming He

发表机构 * ShanghaiTech University（上海科技大学）； Shanghai Engineering Research Center of Intelligent Vision and Imaging（上海智能视觉与成像工程研究中心）； Lingang Laboratory（临港实验室）

AI总结 NoisyGRPO通过引入可控噪声增强探索并利用贝叶斯框架建模优势估计，提升多模态大语言模型的泛化能力和鲁棒性，尤其在小规模模型上表现突出。

Comments Accepted by Neurips 2025, Project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/

详情

Journal ref: Advances in Neural Information Processing Systems 38 (2026) 124239-124267

AI中文摘要

强化学习（RL）在增强多模态大语言模型（MLLMs）的链式推理能力方面展现出潜力。然而，当应用于提升通用链式推理时，现有RL框架往往难以超越训练分布。为此，我们提出NoisyGRPO，一种系统化的多模态RL框架，通过在视觉输入中引入可控噪声以增强探索，并通过贝叶斯框架显式建模优势估计过程。具体而言，NoisyGRPO通过（1）噪声注入探索策略：用高斯噪声扰动视觉输入以鼓励探索更广泛的视觉场景；以及（2）贝叶斯优势估计：将优势估计建模为一个原理性的贝叶斯推断问题，其中注入的噪声水平作为先验，观察到的轨迹奖励作为似然。这种贝叶斯建模融合了两种信息源，以计算轨迹优势的稳健后验估计，有效引导MLLMs偏好视觉支撑的轨迹而非噪声轨迹。在标准链式推理质量、通用能力和幻觉基准测试中，NoisyGRPO显著提高了泛化能力和鲁棒性，尤其是在小规模MLLMs如Qwen2.5-VL 3B的RL设置中。项目页面可在https://artanic30.github.io/project_pages/NoisyGRPO/上获取。

英文摘要

Reinforcement learning (RL) has shown promise in enhancing the general Chain-of-Thought (CoT) reasoning capabilities of multimodal large language models (MLLMs). However, when applied to improve general CoT reasoning, existing RL frameworks often struggle to generalize beyond the training distribution. To address this, we propose NoisyGRPO, a systematic multimodal RL framework that introduces controllable noise into visual inputs for enhanced exploration and explicitly models the advantage estimation process via a Bayesian framework. Specifically, NoisyGRPO improves RL training by: (1) Noise-Injected Exploration Policy: Perturbing visual inputs with Gaussian noise to encourage exploration across a wider range of visual scenarios; and (2) Bayesian Advantage Estimation: Formulating advantage estimation as a principled Bayesian inference problem, where the injected noise level serves as a prior and the observed trajectory reward as the likelihood. This Bayesian modeling fuses both sources of information to compute a robust posterior estimate of trajectory advantage, effectively guiding MLLMs to prefer visually grounded trajectories over noisy ones. Experiments on standard CoT quality, general capability, and hallucination benchmarks demonstrate that NoisyGRPO substantially improves generalization and robustness, especially in RL settings with small-scale MLLMs such as Qwen2.5-VL 3B. The project page is available at https://artanic30.github.io/project_pages/NoisyGRPO/.

URL PDF HTML ☆

赞 0 踩 0

2601.14637 2026-06-08 cs.CV cs.AI cs.CL cs.HC 版本更新

Forest-Chat: Adapting Vision-Language Agents for Interactive Forest Change Analysis

Forest-Chat: 为交互式森林变化分析适应视觉-语言代理

James Brock, Ce Zhang, Nantheera Anantrasirichai

发表机构 * School of Computer Science, University of Bristol（布里斯托尔大学计算机科学学院）； School of Geographical Sciences, University of Bristol（布里斯托尔大学地理科学学院）

AI总结本文提出Forest-Chat，一种基于LLM的森林变化分析代理，通过多任务处理实现自然语言查询，提升森林变化检测与语义解释的准确性与可解释性。

Comments 28 pages, 9 figures, 12 tables, Submitted to Ecological Informatics

详情

DOI: 10.1016/j.ecoinf.2026.103741

AI中文摘要

高分辨率卫星影像的普及与深度学习的进步为森林监测提供了新机遇。本文提出Forest-Chat，一种基于大语言模型的视觉-语言代理，支持多任务的交互式森林变化分析，包括变化检测、图像描述、对象计数、森林砍伐特征识别和变化推理。Forest-Chat基于多级变化解释（MCI）视觉-语言框架，结合零样本变化检测和多模态零样本变化描述与优化。引入Forest-Change数据集，包含双时相卫星影像、像素级变化掩码和语义变化描述。在Forest-Change数据集上，Forest-Chat在mIoU和BLEU-4指标上达到67.10%和40.17%，在LEVIR-MCI-Trees子集上达到88.13%和34.41%。零样本测试中，其在Forest-Change数据集上达到60.15%和34.00%，在LEVIR-MCI-Trees子集上达到47.32%和18.23%。进一步实验表明，描述优化能注入地理领域知识，但标签域迁移有限。这些发现表明，交互式、基于LLM的系统能支持可访问和可解释的森林变化分析。

英文摘要

The increasing availability of high-resolution satellite imagery, together with advances in deep learning, creates new opportunities for forest monitoring workflows. Two central challenges in this domain are pixel-level change detection and semantic change interpretation, particularly for complex forest dynamics. While large language models (LLMs) are increasingly adopted for data exploration, their integration with vision-language models (VLMs) for remote sensing image change interpretation (RSICI) remains underexplored, especially beyond urban environments. This paper introduces Forest-Chat, an LLM-driven agent for forest change analysis, enabling natural language querying across multiple RSICI tasks, including change detection and captioning, object counting, deforestation characterisation, and change reasoning. Forest-Chat builds upon a multi-level change interpretation (MCI) vision-language backbone with LLM-based orchestration, incorporating zero-shot change detection via AnyChange and multimodal LLM-based zero-shot change captioning and refinement. To support adaptation and evaluation in forest environments, we introduce the Forest-Change dataset, comprising bi-temporal satellite imagery, pixel-level change masks, and semantic change captions via human annotation and rule-based methods. Forest-Chat achieves mIoU and BLEU-4 scores of 67.10% and 40.17% on Forest-Change, and 88.13% and 34.41% on LEVIR-MCI-Trees, a tree-focused subset of LEVIR-MCI. In a zero-shot capacity, it achieves 60.15% and 34.00% on Forest-Change, and 47.32% and 18.23% on LEVIR-MCI-Trees. Further experiments demonstrate the value of caption refinement for injecting geographic domain knowledge into supervised captions, and the system's limited label domain transfer onto JL1-CD-Trees. These findings demonstrate that interactive, LLM-driven systems can support accessible and interpretable forest change analysis.

URL PDF HTML ☆

赞 0 踩 0

2505.13140 2026-06-08 cs.CV 版本更新

CacheFlow: Fast Human Motion Prediction by Cached Normalizing Flow

CacheFlow: 通过缓存归一化流实现快速的人体运动预测

Takahiro Maeda, Jinkun Cao, Norimichi Ukita, Kris Kitani

发表机构 * Toyota Technological Institute（丰田技术研究所）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结 CacheFlow通过缓存归一化流生成模型，实现快速3D人体运动预测，相比传统方法速度提升显著，且保持预测精度和模型表达能力。

Comments Accepted at Transactions on Machine Learning Research (TMLR). See https://openreview.net/forum?id=icq5659pQt

详情

Journal ref: Transactions on Machine Learning Research, 2026

AI中文摘要

许多用于3D人体运动预测的密度估计技术需要大量推理时间，通常超过预测时间范围。为解决此问题，我们提出了一种新的基于流的方法，称为CacheFlow。与之前的条件生成模型相比，CacheFlow利用无条件的流生成模型，将高斯混合转化为未来运动的密度。流生成模型的计算结果可以预先计算并缓存。然后，对于条件预测，我们通过一个更轻量的模型将历史轨迹映射到高斯混合中的样本。这种映射方式相比传统条件流模型节省了显著的计算开销。通过这种两阶段方法和缓存慢流模型的计算结果，我们构建了CacheFlow，不损失预测精度和模型表达能力。此推理过程大约在1毫秒内完成，比之前的VAE方法快4倍，比之前的扩散方法快30倍。此外，我们的方法在Human3.6M数据集上展示了改进的密度估计精度，并与SOTA方法具有可比的预测精度。我们的代码和模型可在https://github.com/meaten/CacheFlow上获得。

英文摘要

Many density estimation techniques for 3D human motion prediction require a significant amount of inference time, often exceeding the duration of the predicted time horizon. To address the need for faster density estimation for 3D human motion prediction, we introduce a novel flow-based method for human motion prediction called CacheFlow. Unlike previous conditional generative models that suffer from poor time efficiency, CacheFlow takes advantage of an unconditional flow-based generative model that transforms a Gaussian mixture into the density of future motions. The results of the computation of the flow-based generative model can be precomputed and cached. Then, for conditional prediction, we seek a mapping from historical trajectories to samples in the Gaussian mixture. This mapping can be done by a much more lightweight model, thus saving significant computation overhead compared to a typical conditional flow model. In such a two-stage fashion and by caching results from the slow flow model computation, we build our CacheFlow without loss of prediction accuracy and model expressiveness. This inference process is completed in approximately one millisecond, making it 4 times faster than previous VAE methods and 30 times faster than previous diffusion-based methods on standard benchmarks such as Human3.6M and AMASS datasets. Furthermore, our method demonstrates improved density estimation accuracy and comparable prediction accuracy to a SOTA method on Human3.6M. Our code and models are available at https://github.com/meaten/CacheFlow.

URL PDF HTML ☆

赞 0 踩 0

2511.09568 2026-06-08 physics.chem-ph cs.AI cs.CV 版本更新

VEDA: 3D Molecular Generation via Variance-Exploding Diffusion with Annealing

VEDA：通过退火变方差扩散实现3D分子生成

Peining Zhang, Jinbo Bi, Minghu Song

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结 VEDA结合退火变方差扩散与SE(3)等价架构，高效生成准确的3D分子结构，实现高化学精度与计算效率。

详情

DOI: 10.1609/aaai.v40i33.40063

AI中文摘要

扩散模型在3D分子生成中展现出潜力，但面临采样效率与构象准确性之间的根本权衡。尽管流形模型速度快，但常产生几何不准确的结构，因难以捕捉分子构象的多模分布。相比之下，去噪扩散模型更准确但采样慢，限制在于扩散动力学与SE(3)-等价架构之间的整合不足。为此，我们提出了VEDA，一个统一的SE(3)-等价框架，结合变方差扩散与退火以高效生成构象准确的3D分子结构。关键贡献包括：(1) 一种VE调度使噪声注入类似于模拟退火，提高3D准确性并降低松弛能量；(2) 一种新型预处理方案协调SE(3)-等价网络的坐标预测性质与残差扩散目标；(3) 一种新的arcsin调度器将采样集中在对数信号噪声比的关键区间。在QM9和GEOM-DRUGS数据集上，VEDA的采样效率与流形模型相当，仅用100次采样步骤就实现了最先进的价键稳定性与有效性。更重要的是，VEDA生成的结构在GFN2-xTB优化过程中表现出显著的稳定性，其松弛能量中位数仅为1.72 kcal/mol，显著低于其基线架构SemlaFlow的32.3 kcal/mol。我们的框架证明了原理上整合VE扩散与SE(3)-等价架构可以实现高化学精度和计算效率。

英文摘要

Diffusion models show promise for 3D molecular generation, but face a fundamental trade-off between sampling efficiency and conformational accuracy. While flow-based models are fast, they often produce geometrically inaccurate structures, as they have difficulty capturing the multimodal distributions of molecular conformations. In contrast, denoising diffusion models are more accurate but suffer from slow sampling, a limitation attributed to sub-optimal integration between diffusion dynamics and SE(3)-equivariant architectures. To address this, we propose VEDA, a unified SE(3)-equivariant framework that combines variance-exploding diffusion with annealing to efficiently generate conformationally accurate 3D molecular structures. Specifically, our key technical contributions include: (1) a VE schedule that enables noise injection functionally analogous to simulated annealing, improving 3D accuracy and reducing relaxation energy; (2) a novel preconditioning scheme that reconciles the coordinate-predicting nature of SE(3)-equivariant networks with a residual-based diffusion objective, and (3) a new arcsin-based scheduler that concentrates sampling in critical intervals of the logarithmic signal-to-noise ratio. On the QM9 and GEOM-DRUGS datasets, VEDA matches the sampling efficiency of flow-based models, achieving state-of-the-art valency stability and validity with only 100 sampling steps. More importantly, VEDA's generated structures are remarkably stable, as measured by their relaxation energy during GFN2-xTB optimization. The median energy change is only 1.72 kcal/mol, significantly lower than the 32.3 kcal/mol from its architectural baseline, SemlaFlow. Our framework demonstrates that principled integration of VE diffusion with SE(3)-equivariant architectures can achieve both high chemical accuracy and computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2504.21614 2026-06-08 cs.CV 版本更新

Mcity Data Engine: Iterative Model Improvement Through Open-Vocabulary Data Selection

Mcity数据引擎：通过开放词汇数据选择实现迭代模型改进

Daniel Bogdoll, Rajanikant Patnaik Ananta, Abeyankar Giridharan, Isabel Moore, Gregory Stevens, Henry X. Liu

发表机构 * University of Michigan Transportation Research Institute（密歇根大学交通研究所）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Texas A&M University（德克萨斯A&M大学）

AI总结本文提出Mcity数据引擎，通过开放词汇数据选择解决大规模未标记数据中长尾类检测难题，提供从数据采集到模型部署的完整数据开发流程。

Comments Accepted for publication at ITSC 2025

详情

DOI: 10.1109/ITSC60802.2025.11423797

AI中文摘要

随着数据可用性的持续增长，选择和标注适合机器学习模型训练的样本变得越来越具有挑战性。特别是在大规模未标记数据中检测感兴趣的长尾类更是困难重重。这尤其适用于智能交通系统（ITS），其中车辆车队和道路侧感知系统产生大量的原始数据。虽然存在用于此类迭代数据选择和模型训练过程的工业专有数据引擎，但研究人员和开源社区却缺乏一个公开可用的系统。我们提出了Mcity数据引擎，它提供了完整的基于数据的发展周期模块，从数据采集阶段开始，到模型部署阶段结束。Mcity数据引擎通过开放词汇数据选择过程专注于罕见和新颖的类别。所有代码均以MIT许可证公开发布在GitHub上：https://github.com/mcity/mcity_data_engine

英文摘要

With an ever-increasing availability of data, it has become more and more challenging to select and label appropriate samples for the training of machine learning models. It is especially difficult to detect long-tail classes of interest in large amounts of unlabeled data. This holds especially true for Intelligent Transportation Systems (ITS), where vehicle fleets and roadside perception systems generate an abundance of raw data. While industrial, proprietary data engines for such iterative data selection and model training processes exist, researchers and the open-source community suffer from a lack of an openly available system. We present the Mcity Data Engine, which provides modules for the complete data-based development cycle, beginning at the data acquisition phase and ending at the model deployment stage. The Mcity Data Engine focuses on rare and novel classes through an open-vocabulary data selection process. All code is publicly available on GitHub under an MIT license: https://github.com/mcity/mcity_data_engine

URL PDF HTML ☆

赞 0 踩 0