arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.06485 2026-06-05 cs.CV 版本更新

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

PAR3D: 一种用于场景理解的统一部件感知3D多模态大语言模型

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University(教育部多媒体可信感知与高效计算重点实验室,厦门大学)

AI总结 提出PAR3D框架,通过部件感知3D表示学习和层次化分割查询生成,解决现有3D-MLLM在细粒度部件理解上的不足,在部件级问答和指代分割任务上取得显著提升。

Comments Project page: https://atrovast.github.io/PAR3D/

详情
AI中文摘要

近期3D多模态大语言模型(3D-MLLMs)的进展为3D场景理解任务(包括视觉问答、描述和指代分割)提供了统一解决方案。然而,现有的3D-MLLM仍以物体为中心,限制了其对细粒度部件结构的建模能力,而这对于与3D环境的具身交互至关重要。在这项工作中,我们提出了PAR3D,一个统一的部件感知3D-MLLM框架,使模型能够理解、推理并定位3D场景中的物体及其部件。为了支持部件感知3D场景理解的训练和评估,我们引入了ScenePart,一个带有部件级标注和语言指令的合成3D场景数据集。我们进一步开发了部件感知3D表示学习,以用细粒度部件级语义丰富3D视觉表示,并提出了层次化分割查询生成,通过层次化的物体-部件查询来定位部件目标。大量实验表明,我们的方法显著提升了部件级问答和指代分割的性能,同时在物体级视觉语言任务上也取得了强劲表现。

英文摘要

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

2606.06477 2026-06-05 cs.CV 版本更新

Complexity-Balanced Diffusion Splitting

复杂度平衡的扩散分裂

Noam Issachar, Dani Lischinski, Raanan Fattal

发表机构 * The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 提出复杂度平衡分裂(CBS)框架,通过将扩散时间线划分为等近似负担的段并分配更多容量给困难区域,在多个架构和数据集上提升生成质量而不增加推理成本。

详情
AI中文摘要

标准连续时间生成模型依赖于整体架构,必须从各向同性噪声到复杂数据分布等截然不同的信号域中导航。虽然扩展模型容量可提升性能,但在整个生成时间线上均匀部署大规模网络本质上效率低下。在这项工作中,我们提出复杂度平衡分裂(CBS),一种用于时间容量分配的原则性框架,将生成工作负载分布到多个专门的子网络上。基于函数逼近理论和de Boor的等分布原则,CBS将扩散时间线划分为等近似负担的段,将更多表示容量分配给生成动力学更难建模的区域。为估计这种局部复杂度,我们引入两个互补且易于处理的监控函数:基于流Dirichlet能量的空间度量,和基于采样轨迹加速度的几何度量。通过使用轻量级辅助模型估计这些复杂度分布,我们的方法消除了启发式时间分割或计算昂贵的搜索过程的需求。在多种架构(SiT、JiT和UNet)和数据集上的广泛评估表明,CBS在不增加每步推理成本的情况下持续提升合成质量。特别地,在SiT-XL上使用CFG时,CBS相比朴素时间分割将FID改善了约35%。项目页面见https://noamissachar.github.io/CBS/。

英文摘要

Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deploying a massive network uniformly across the entire generative timeline is inherently inefficient. In this work, we propose Complexity-Balanced Splitting (CBS), a principled framework for temporal capacity allocation that distributes the generative workload across multiple specialized sub-networks. Grounded in function approximation theory and de Boor's equidistribution principle, CBS partitions the diffusion timeline into segments of equal approximation burden, allocating more representational capacity to regions where the generative dynamics are more difficult to model. To estimate this local complexity, we introduce two complementary and tractable monitor functions: a spatial measure based on the flow's Dirichlet energy, and a geometric measure based on the acceleration of the sampling trajectories. Using a lightweight auxiliary model to estimate these complexity profiles, our approach eliminates the need for heuristic temporal splits or computationally expensive search procedures. Extensive evaluation across multiple architectures (SiT, JiT, and UNet) and datasets demonstrates that CBS consistently improves synthesis quality without increasing per-step inference cost. In particular, CBS improves FID by ~35% on SiT-XL with CFG relative to naive temporal partitioning. Project page is available at https://noamissachar.github.io/CBS/.

2606.06476 2026-06-05 cs.CV 版本更新

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

思考与想象:基于世界模拟器的智能视觉空间推理

Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

发表机构 * The University of Hong Kong(香港大学) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学) Beihang University(北航大学)

AI总结 提出Astra框架,通过强化学习训练VLM策略与Bagel世界模拟器交互,在推理中生成想象视觉证据,解决空间推理中的未观察布局、跨视角一致性和替代视角推理问题。

Comments Project page: https://zcmax.github.io/projects/Thinking-With-Imagination

详情
AI中文摘要

尽管视觉语言模型(VLM)展现出强大的视觉推理能力,但其空间推理能力仍然很大程度上局限于观察到的图像和面向文本的思维链。当只有有限的自我中心观察可用时,它们通常难以推断未观察到的布局、保持跨视角一致性以及从替代视角进行推理。在这项工作中,我们将此问题研究为“思考与想象”,即VLM在推理过程中通过与世界模拟器交互主动获取想象的视觉证据。我们提出Astra,一种智能空间推理框架,赋予VLM以动作条件视觉想象能力。具体而言,Astra将强化学习训练的VLM策略Astra-VL与基于Bagel的世界模拟器Astra-WM相结合,后者从上下文图像和自然语言相机运动生成新视角观察。为了提供可靠的想象证据,Astra-WM通过视角一致性训练进行训练,以提高跨视角的姿态和内容一致性。在强化学习阶段,我们提出了一种世界模拟器在环的两阶段强化学习课程,以稳定工具使用探索,并提升模型仅在想象观察优于直接回答时调用模拟器的能力。实验表明,世界模拟器和智能策略都是必要的:Astra-WM将模拟器增强的Gemini-3-Flash在MMSI-Bench上的性能从45.1提升到49.5,而Astra-VL将Qwen3-VL骨干网络在MMSI-Bench上的性能从29.8提升到38.8,在MindCube上从36.8提升到42.7。这些结果表明,想象观察可以提供有用的空间证据,但有效的世界模型增强推理需要学习何时、何地以及如何想象。

英文摘要

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

2606.06458 2026-06-05 cs.LG cs.AI cs.CV 版本更新

In-Context Multiple Instance Learning

上下文多实例学习

Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller

发表机构 * Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所) Machine Learning Group, Technische Universität Berlin(柏林技术大学机器学习小组) Aignostics Institute of Pathology, Charité – Universitätsmedizin Berlin(柏林查理医院病理研究所) Max-Planck Institute for Informatics(马克斯·普朗克信息研究所) Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 本文提出一种基于感知器架构的上下文学习器,通过合成数据预训练,无需梯度更新即可从少量标记包中解决新的多实例学习任务,在12个基准上超越需任务特定训练的监督基线。

详情
AI中文摘要

多实例学习(MIL)解决了在实例包级别提供监督的问题,并已成功应用于从计算病理学到卫星图像等领域。然而,现有算法在低标签率(许多实际应用的特点)下表现不佳。灵活的模型过拟合,而僵化的模型无法适应手头的任务。我们证明,在合成数据上预训练一个具有感知器架构的上下文学习器,可以得到一个能够从少量标记包中解决新任务的模型。在推理时,分类在单次前向传播中完成,无需梯度更新。我们提出并研究了不同的用于包结构数据的合成数据生成器,发现它们捕获了互补的归纳偏差。在这些生成器的混合上预训练的模型继承了每个生成器在各自任务上的优势,并在12个MIL基准上取得了最佳平均性能,超过了需要任务特定训练的监督基线。

英文摘要

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

2606.06390 2026-06-05 cs.CV cs.AI 版本更新

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

HomeWorld:一个统一的从平面图到家具的框架,用于生成可控、密集交互的全屋场景

Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li

发表机构 * Ace Robotics(Ace机器人公司) CUHK MMLab(香港大学多模态实验室) Shenzhen Loop Area Institute(深圳环城区域研究院)

AI总结 提出一个统一的分层框架,通过大规模真实平面图数据集训练大语言模型生成全屋平面图,结合图像生成模型和VLM优化器生成家具及小物体布局,并附加物理属性和纹理光照,实现可控、高真实感的全屋场景生成。

详情
AI中文摘要

室内场景生成对于机器人仿真和现代室内设计至关重要。然而,复杂的布局加上稀缺的3D场景数据使得基于学习的生成具有挑战性。现有方法通常依赖手工规则或关注孤立子任务(例如平面图合成或单房间家具布置),生成的全屋场景缺乏全局连贯性、真实感和仿真就绪性。为缓解这些限制,我们提出一个统一的分层框架,将室内场景合成分解为可控阶段。首先,我们整理了一个包含30万真实住宅平面图的大规模数据集,用于训练一个全屋平面图生成的大语言模型。通过详细描述和基于K-D树的表示,我们的方法实现了细粒度、可控的全屋平面图生成。基于生成的全屋平面图,我们利用图像生成模型从多级漫游视角草拟家具布局,然后生成不同支撑表面(例如橱柜、书桌和餐桌)上可操作小物体的布局,用于具身AI仿真。在家具和物体布局生成过程中,一个基于VLM的优化器迭代修正家具和物体放置,而一个3D生成模型则允许灵活替换单个资产。我们进一步附加基本物理属性和简单表面纹理与光照设置,以完成用于具身AI的流水线。实验和用户研究表明,我们的流水线生成的室内空间具有更大的布局多样性和更强的3D设计吸引力,在定量和定性指标上均优于先前方法。最后,除了生成流水线,我们还将向社区发布平面图数据集和5000个完全家具化的场景。项目页面:https://kairos-homeworld.github.io/

英文摘要

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

2606.06379 2026-06-05 cs.CV cs.AI 版本更新

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

EasyLens: 一种无需训练的即插即用型微病变表示放大器,用于医学视觉语言模型

Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang, Yige Peng, Haoyuan Che, Jinman Kim, Lei Bi

发表机构 * Jilin University(吉林大学) School of Computer Science, The University of Sydney(悉尼大学计算机科学学院) ByteDance(字节跳动) Institute of Translational Medicine, Shanghai Jiao Tong University(上海交通大学转化医学研究院)

AI总结 提出EasyLens,一种无需训练的即插即用模块,通过构建病理-解剖原型空间、反事实推理选择病变相关补丁以及形态引导残差增强,放大医学视觉语言模型对微病变的表示能力。

详情
AI中文摘要

医学视觉语言模型(VLM)在临床图像解读(包括病变检测和报告生成)方面显示出越来越大的潜力。然而,其对微病变的敏感性不足限制了其实用性,因为微病变的视觉证据通常稀疏、低对比度且嵌入复杂的解剖背景中。随着局部视觉标记的聚合,这些微弱的病变线索在全局图像表示中可能变得代表性不足,使得医学VLM难以识别。现有的提高病变敏感性的工作主要依赖于医学领域的视觉编码器预训练、临床术语引导的对齐或可训练的病理表示增强。尽管有效,但这些方法通常需要额外训练或模型特定适配,并可能过度适应特定疾病形态,限制了其在冻结的医学VLM上的适用性。为解决这些限制,我们提出EasyLens,一种无需训练的即插即用型微病变表示放大器,用于医学VLM。EasyLens首先构建EasyBank,一个病理-解剖原型空间,提供病变相关原型和解剖感知的正常参考,用于将可疑补丁与病理和正常解剖模式进行比较。为避免盲目放大正常组织,EasyTag通过反事实原型推理选择病变相关补丁。为抵消全局图像表示中微病变线索的稀释,EasyAmplifier通过形态引导的残差增强强化所选病变相关补丁的表示,从而增加其对全局图像嵌入的贡献。在多个医学图像数据集和冻结的医学VLM骨干上的实验表明,EasyLens改进了微病变检测,并优于现有的编码器增强基线。

英文摘要

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

2606.06369 2026-06-05 cs.CV 版本更新

Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

视觉常识驱动的场景图生成知识精炼

Maëlic Neau, Salim Baloch, Jakob Suchan, Zoe Falomir, Mehul Bhatt

发表机构 * Computing Science Department, Umeå University(乌梅大学计算机科学系) School of Computer Science & Engineering, Constructor University(构造大学计算机科学与工程学院) School of Science and Technology, Örebro University(Örebro大学科学与技术学院) CoDesign Lab EU.(欧盟CoDesign实验室)

AI总结 提出一种模型无关的语义引导知识精炼框架,通过挖掘训练数据中的常识约束并利用声明式常识推理在推理时修正场景图预测,无需人工规则或重新训练,在三个基准上持续提升强基线性能。

详情
AI中文摘要

基于学习的场景图生成(SGG)模型在频繁关系类型上表现优异,但在标注稀疏情况下性能急剧下降,无法捕获可靠的视觉常识知识。我们提出一种模型无关、语义引导的知识精炼框架,系统地从训练数据中挖掘基于常识的约束——捕获空间、功能和定性关系规律——并使用通用声明式常识推理在推理时修正和排序SGG预测。该框架无需手动规则编写、无需模型重新训练,并且可跨数据集和架构迁移。在三个标准基准上,我们相对于强基线获得了一致改进,表明对深层场景语义的结构化视觉常识推理是纯学习式场景图生成的实用且有效的补充。

英文摘要

Learning-driven Scene Graph Generation (SGG) models excel on frequent relation types but degrade sharply under annotation sparsity, failing to capture reliable visual commonsense knowledge. We propose a model-agnostic, semantically-guided knowledge refinement framework that systematically mines commonsense-grounded constraints from training data - capturing spatial, functional, and qualitative relational regularities - and uses general declarative commonsense reasoning to correct and refine ranked SGG predictions at inference time. The framework requires no manual rule authoring, no model retraining, and transfers across datasets and architectures. On three standard benchmarks, we obtain consistent improvements over strong baselines, demonstrating that structured visual commonsense reasoning over deep scene semantics is a practical and effective complement to purely learning-based scene graph generation.

2606.06363 2026-06-05 cs.CV 版本更新

GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

GMBFormer: 一种NDVI引导的全局记忆库Transformer用于超高分辨率影像城市绿地提取

Hao Lei, Xi Cheng, Chenlu Shu, Zhiheng Chen, Zhengjie Duan, Haoyu Wang, Zhanfeng Shen

发表机构 * College of Geophysics, Chengdu University of Technology(成都理工大学地球物理学院) National Engineering Research Center for Geomatics, Aerospace Information Research Institute, Chinese Academy of Sciences, and University of Chinese Academy of Sciences(中国科学院测绘学部国家工程研究中心、航天信息研究院、中国科学院大学)

AI总结 针对超高分辨率影像城市绿地提取中视觉相似植被模式语义复用受限及NDVI与RGB特征融合模糊的问题,提出GMBFormer框架,通过解耦NDVI作为物理门控并利用全局记忆库进行选择性原型检索,在三个数据集上提升了分割精度。

Comments 34 pages, 5 figures

详情
AI中文摘要

从超高分辨率(UHR)影像中提取城市绿地通常逐块进行,这限制了空间分离但视觉相似的植被模式之间的语义复用。将归一化差异植被指数(NDVI)直接注入红绿蓝(RGB)主干网络也会模糊视觉外观学习与物理植被置信度的作用。我们提出了GMBFormer,一个基于SegFormer的框架,用选择性、相似性驱动的原型检索替代邻域驱动的特征传播。只有RGB通道进入主干网络和解码器,而NDVI被解耦为一个物理信息门控,通过动量更新将高置信度植被描述符纳入紧凑的全局记忆库。在训练和推理过程中,当前块通过记忆介导的交叉注意力查询存储的原型,并以有限的开销集成检索到的响应。实验使用了自建的成都UHR数据集(含7,700个标注的512×512块)以及从公共国际摄影测量与遥感学会(ISPRS)波茨坦数据集派生的两种减少标签设置。在相同的训练和评估协议下,GMBFormer分别获得了89.25%/94.31%、92.17%/95.92%和83.72%/90.86%的平均交并比(mIoU)/平均Dice(mDice)分数,在每种设置下均优于受控的SegFormer-B4基线。消融研究表明,解耦的NDVI准入、记忆检索、容量和动量共同决定了最终性能。

英文摘要

Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.

2606.06359 2026-06-05 cs.CV 版本更新

Comparison of Deep Learning Frameworks For Rice Disease Mapping From UAV Multispectral Imaging

基于无人机多光谱成像的水稻病害深度学习框架比较

Yadav Raj Ghimire, Jagrati Talreja, Tewodros Syum Gebre, Timothy Agboada, Shikha V. Chandel, Leila Hashemi Beni

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of California, Los android(加州大学洛杉矶分校)

AI总结 本研究使用CNN和Transformer模型对无人机多光谱图像进行水稻白叶枯病严重程度分割,发现轻量级CNN骨干网络在操作监测中更可靠,植被指数可带来小幅持续改进。

Comments This paper has been accepted in IGARSS 2026. Copyright 2026 IEEE

详情
AI中文摘要

在本研究中,利用无人机多光谱图像,采用卷积神经网络(CNN)和基于Transformer的模型对水稻白叶枯病(BLB)的严重程度进行分割。评估的架构包括带有ResNet-101编码器的U-Net、带有EfficientNet-B3和EfficientNet-B7的U-Net++、DeepLabV3+以及SegFormer,所有模型均在统一的流水线下使用三种输入配置(仅多光谱、多光谱+NDVI、多光谱+NDRE)进行训练。实验使用公开的BLB数据集进行,性能指标包括平均IoU(mIoU)、平均F1(mF1)、平均准确率(mAcc)、精确率和召回率。带有EfficientNet-B3的U-Net++取得了最高性能,mIoU达到97.62%。SegFormer的分割精度较低,但推理速度相当。总体而言,结果表明轻量级CNN骨干网络在操作性的BLB监测中更为可靠,而植被指数的整合带来了微小但一致的改进。该研究还强调了标准化无人机数据集在比较病害映射方法中的价值,并鼓励在实地实施中使用CNN架构。

英文摘要

In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and transformer-based models. The evaluated architectures include U-Net with a ResNet- 101 encoder, U-Net++ with EfficientNet-B3 and EfficientNetB7, DeepLabV3+, and SegFormer, all trained under a common pipeline with three input configurations (multispectral only, multispectral+NDVI, and multispectral+NDRE). Experiments are conducted using the publicly available BLB dataset with performance reported using mean IoU (mIoU), mean F1 (mF1), mean accuracy (mAcc), precision, and recall. U-Net++ with EfficientNet-B3 achieved the highest performance, with an mIoU of 97.62%. SegFormer obtained lower segmentation accuracy but comparable inference speed. Overall, the results indicate that lightweight CNN backbones remain more reliable for operational BLB monitoring while integration of vegetation indices provides small and consistent improvements. The study also highlights the value of standardised UAV datasets to compare disease mapping methods and encourages the use of CNN architectures for field implementation.

2606.06338 2026-06-05 cs.CV 版本更新

StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

StoryVideoQA: 通过大规模、多类型和自动生成的数据集扩展深度视频理解

Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang

发表机构 * School of Computer Science, Wuhan University(武汉大学计算机学院) National Engineering Research Center for Multimedia Software(多媒体软件国家工程研究中心) Hubei Key Laboratory of Multimedia and Network Communication Engineering(湖北省多媒体与网络通信工程重点实验室)

AI总结 提出StoryVideoQA数据集和PlotTree方法,通过多智能体协作框架自动生成大规模深度视频理解问答对,并利用层次化情节结构提升复杂故事线推理能力。

Comments Accepted by IJCV 2026

详情
Journal ref
International Journal of Computer Vision (2026)
AI中文摘要

视频问答(VideoQA)旨在回答关于给定视频的问题。现有方法在事实型VideoQA上表现出色,但在深度视频理解(DVU)上存在困难,后者需要理解复杂的故事线。这一挑战源于固有的长程视频内容、多类型问题以及实例级故事元素,这些都限制了人工构建DVU数据集的规模和多样性。为了解决这些问题,我们之前引入了StoryMind来自动构建具有平衡细粒度主题的DVU数据集。尽管它能为电视剧生成高质量问答对,但在处理更长更复杂的电影时性能显著下降。本文进一步设计了StoryMindv2,一个增强的多智能体协作框架,用于为电视剧和电影生成高质量的DVU数据集。通过集成新颖的监督引导生成机制和精细的多审阅者投票策略,该框架用于构建StoryVideoQA,这是迄今为止最大的DVU数据集,包含超过363K个问答对,覆盖393.2小时多样化的故事视频,包括电视剧(平均1635秒)和电影(平均7878秒)。在此大规模基准上对20种最先进的VideoQA方法进行全面评估,发现它们无法完全维持长程角色关联或构建对复杂故事线的连贯理解。为弥补这一差距,我们提出PlotTree,一种新颖的视频理解智能体,将长程视频内容重新组织为层次化情节结构,从而在StoryVideoQA上实现高效的故事线推理。项目页面:https://github.com/nercms-mmap/StoryVideoQA/

英文摘要

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/

2606.06329 2026-06-05 cs.LG cs.CG cs.CV stat.ML 版本更新

Efficient Mean Curvature Computation on High-Dimensional Data Manifolds

高维数据流形上的高效平均曲率计算

Alexandre L. M. Levada

发表机构 * Federal University of São Carlos(萨尔瓦多·卡洛斯联邦大学)

AI总结 针对高维数据集局部平均曲率计算中原始方法O(m^4)每点成本过高的问题,提出基于代数恒等式和截断SVD的快速估计器,将成本降至O(k^2 m + k m p^2),在真实数据集上实现50-300倍加速且精度损失可忽略。

Comments 31 pages, 2 figures and 5 tables

详情
AI中文摘要

估计高维数据集中每个点的局部平均曲率是几何感知机器学习算法(如平均曲率边界点(MCBP)方法)的关键组成部分。该计算的朴素实现基于从k近邻块近似的局部形状算子,涉及显式构造矩阵$H$,其迹形式导致每点成本为$O(m^4)$,使得该方法对于具有超过几十个特征的数据集变得难以处理。本文提出了两个互补的贡献,共同将这一成本降低了几个数量级。第一个贡献是一个精确的代数恒等式。该恒等式源自协方差矩阵特征向量的正交性和迹算子的循环性,完全消除了$H$,并将特征分解后的每点成本降低到$O(m^2)$。第二个贡献解决了完整特征分解中剩余的$O(m^3)$瓶颈。由于局部协方差矩阵的秩最多为$k-1 \ll m$,我们将其替换为$k imes m$中心数据矩阵的截断SVD,这是一个$O(k^2 m)$操作,并基于Haar测度下零空间特征向量外积的期望值,推导出其贡献的解析近似。得到的估计器总成本为$O(k^2 m + k m p^2)$,其中$p = k-1$。在真实数据集上的实验证实,相对于原始实现,加速比为50到300倍,当使用快速估计器替换原始版本时,精度损失可忽略。通过提供可扩展且数据驱动的局部曲率估计,所提出的方法将曲率确立为从经典到现代深度学习流水线的广泛机器学习任务中的实用几何特征。

英文摘要

Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix $H$ whose trace form yields an $O(m^4)$ cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates $H$ entirely and reduces the per-point cost to $O(m^2)$ after the eigendecomposition. The second contribution addresses the remaining $O(m^3)$ bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most $k-1 \ll m$, we replace it with a truncated SVD of the $k \times m$ centered data matrix, an $O(k^2 m)$ operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost $O(k^2 m + k m p^2)$, where $p = k-1$. Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.

2606.06309 2026-06-05 cs.CV 版本更新

RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

RhymeFlow: 基于异步去噪流调度的无训练加速视频生成

Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu, Yueqi Duan

发表机构 * Tsinghua University(清华大学) GigaAI

AI总结 针对DiT视频生成模型推理慢的问题,提出无训练框架RhymeFlow,通过识别关键帧并仅对其密集去噪,非关键帧逐步跳过步骤,同时引入潜在轨迹投影模块保持时序一致性,实现加速并提升质量。

Comments Project Page: https://simon-dcs.github.io/Website-of-RhymeFlow/, Code: https://github.com/Simon-Dcs/RhymeFlow

详情
AI中文摘要

基于扩散变换器(DiTs)的视频生成模型在视频合成中取得了显著性能,但由于3D注意力的二次复杂度,它们存在高推理延迟和计算成本的问题。现有的加速方法主要通过稀疏注意力和KV缓存等技术降低每个单独去噪步骤内的计算复杂度。然而,它们严格遵循标准扩散管道的固有约束:目标视频序列中的每一帧都必须经历所有扩散时间步的完整、密集去噪过程。我们观察到,由于相邻帧之间的对应内容和运动,当锚定具有关键语义过渡的关键帧时,其他帧的中间状态通常遵循更可预测的轨迹,这表明这种均匀、密集的去噪过程对于自然视频数据本质上是冗余的。为此,我们引入了 extbf{RhymeFlow},一个无训练框架,它将不同帧的去噪轨迹解耦。具体来说,我们首先识别出一组稀疏的关键帧,它们主导了潜在语义演化。然后,只有这些关键帧经历密集的逐步去噪以确保结构完整性,而非关键帧则逐步跳过去噪步骤以最小化计算成本。由于非关键帧跳过的中间状态破坏了关键帧去噪步骤中的时间连贯性,导致视觉退化,我们进一步引入了一个潜在轨迹投影模块,使关键帧能够与完整且时间一致的序列表示进行交互。在当前的基于DiT的视频生成模型上的大量实验表明,我们的方法以更高的推理速度和更好的视觉质量优于现有基线。

英文摘要

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

2606.06294 2026-06-05 cs.CV cs.AI 版本更新

Towards One-to-Many Temporal Grounding

面向一对多时间定位

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对一对多时间定位(OMTG)任务,提出包含基准、数据集和奖励函数的系统解决方案,显著提升多段视频定位性能。

Comments Accepted to ICML'26

详情
AI中文摘要

时间定位(TG)旨在定位与文本查询对应的视频片段。先前研究主要关注单段检索。然而,现实场景通常需要为单个查询定位多个不连续片段——我们将其称为一对多时间定位(OMTG)。先前最先进的MLLMs针对一对一设置优化,在此场景下表现不佳,由于缺乏事件基数感知,往往得到近乎零的分数。为弥补这一差距,我们提出一个包含三项关键贡献的系统解决方案。首先,我们建立了首个全面的OMTG基准,引入计数准确率(C-Acc)和有效时间F1(EtF1)作为评估指标。其次,我们通过一个复杂的构建流程,整理了一个包含56k样本的高质量OMTG数据集。第三,我们开发了专门针对OMTG的新型时间奖励和描述奖励函数。特别地,描述奖励利用密集视频描述上的思维链推理,明确引导策略优化以实现精确性和完整性。大量实验表明,我们的模型在OMTG基准上达到了43.65%的最新EtF1,分别超过Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

英文摘要

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

2606.06292 2026-06-05 cs.CV cs.RO 版本更新

Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

合成数据生成与基于视觉的褶皱和关键点检测用于双手布料操作

Ariel Herrera, Xueyang Kang, Atal Anil Kumar

发表机构 * Department of Engineering, University of Luxembourg(卢森堡大学工程系) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) Université de Lorraine, Arts et Metiers Institute of Technology, LCFC(洛林大学,艺术与工艺技术学院,LCFC)

AI总结 针对布料操作中视觉感知难题,提出基于Blender的合成数据生成管道和结合CNN与YOLOv8-OpenCV的感知框架,实现褶皱抓取和关键点熨烫,关键点模型平均位置误差1.7615像素。

详情
AI中文摘要

纺织品的机器人操作仍然具有挑战性,因为连续变形和自遮挡阻碍了估计布料状态所需的鲁棒视觉感知。为了解决缺乏标注真实世界数据的问题,我们开发了一个基于Blender的合成管道,导出自动标注的关键点,并将人工标注的渲染图与真实世界数据结合训练褶皱检测器。我们提出了一个感知框架,集成了用于置换不变关键点检测的CNN和用于从结构褶皱中提取抓取点的YOLOv8-OpenCV管道。一个提出的双手算法利用该系统通过褶皱拉伸完全折叠的服装,一旦角落出现就过渡到基于关键点的熨烫。关键点模型实现了1.7615像素的平均位置误差(MPE)。感知系统无需微调即可迁移到物理织物上,优于在高遮挡状态下失败或在严重褶皱上产生误报的基线方法。

英文摘要

Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

2606.06278 2026-06-05 cs.CV 版本更新

Geodesic Flow Matching on a Riemannian Degradation Manifold for Blind Image Restoration

黎曼退化流形上的测地流匹配用于盲图像恢复

Akshay Janardan Bankar, Ankita Chatterjee, Sayan Banerjee, Shreyas Pandith, Kalakonda Sai Shashank, Amit Satish Unde

发表机构 * Samsung Research Institute(三星研究院)

AI总结 提出在低维黎曼流形上显式建模退化,通过联合图像-流形空间上的测地流匹配目标学习内在传输动力学,实现盲图像恢复。

Comments Submitted to ECCV 2026

详情
AI中文摘要

盲图像恢复需要从被未知且可能混合退化破坏的观测中恢复干净图像。虽然近期基于确定性流的方法将恢复建模为将退化图像映射到干净图像的传输过程,但它们通常依赖欧几里得插值,隐含假设线性退化几何。本文中,我们显式地将退化建模为低维黎曼流形上的点,并将恢复表述为联合图像-流形空间上的测地传输。通过测地流匹配目标,我们学习尊重退化空间曲率的内在传输动力学。该框架推广了线性流匹配,为混合退化作为测地组合提供了原则性处理,并为泛化到未见退化提供了清晰的理论解释。

英文摘要

Blind image restoration requires recovering clean images from observations corrupted by unknown and potentially mixed degradations. While recent deterministic flow-based methods model restoration as transport processes that map degraded images to clean ones, they typically rely on Euclidean interpolation, implicitly assuming linear degradation geometry. In this paper, we explicitly model degradations as points on a low-dimensional Riemannian manifold and formulate restoration as geodesic transport on the joint image-manifold space. Using a geodesic flow matching objective, we learn intrinsic transport dynamics that respect the curvature of degradation space. This framework generalizes linear flow matching, provides a principled treatment of mixed degradations as geodesic compositions, and yields a clean theoretical interpretation for generalization beyond observed degradations.

2606.06255 2026-06-05 cs.RO cs.CV cs.DC 版本更新

RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning

RadiusFPS:通过球形体素剪枝在CPU和GPU上实现高效最远点采样

Ziyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki

发表机构 * School of Computing(计算学院) Institute of Science(科学研究院) Tokyo(东京)

AI总结 提出RadiusFPS框架,利用球形体素剪枝加速最远点采样(FPS),在保持标准更新规则的同时,通过保守几何边界和坐标点跳过测试减少冗余计算,并在GPU上实现融合核,显著提升速度并降低内存占用。

Comments 28 pages,15 figures

详情
AI中文摘要

点云是机器人感知的主要感官表示,支撑着基于激光雷达的自动驾驶、同时定位与地图构建(SLAM)和导航。在这些流程中,最远点采样(FPS)是最著名的下采样算子,其均匀覆盖保留了下游感知所依赖的几何结构。然而,经典FPS的大时间复杂度与现代3D传感器每秒百万点的速率难以匹配,使其成为与机器人系统的实时性和有限机载计算预算相冲突的主要延迟瓶颈。因此,我们提出RadiusFPS,一种基于球形体素剪枝的FPS加速框架,在相同初始化和打破平局策略下保留标准FPS更新规则。通过用球形体素索引点云,RadiusFPS推导出保守的几何边界,在每次迭代中剪枝冗余距离计算,并辅以坐标点跳过测试去除残余更新。我们进一步引入RadiusFPS-G,一种线程束级别的GPU实现,将体素选择、剪枝和距离更新融合到内存合并的核中,消除了昂贵的全局内存往返。在室内(S3DIS、ScanNet)和室外LiDAR(SemanticKITTI)基准测试中,RadiusFPS-G相比基于GPU的FPS实现了高达2.5倍的加速,在评估方法中与QuickFPS相当或更优,同时使用大约一半的GPU内存,并具有可比较的分割精度。当与基于学习的FastPoint采样器结合时,生成的流程在所有评估配置中实现了最快的端到端推理。这些特性使得高质量的FPS风格采样对于延迟和内存受限的机器人视觉变得实用。

英文摘要

Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.

2606.06249 2026-06-05 cs.CV cs.LG 版本更新

GRAMformer: Any-Order Modality Interactions via Volumetric Multimodal Cross-Attention

GRAMformer: 通过体积多模态交叉注意力实现任意顺序模态交互

Giordano Cicchetti, Eleonora Grassucci, Danilo Comminiello

发表机构 * Dept. of Information Engineering, Electronics, and Telecommunications, Sapienza University of Rome(信息工程、电子与电信系,罗马萨皮恩扎大学)

AI总结 提出体积多模态交叉注意力(VMA)机制,通过计算查询与多模态键向量的联合几何体积来建模任意顺序的模态交互,并集成到新型多模态Transformer架构GRAMformer中,提升多模态学习的有效性和效率。

详情
AI中文摘要

基于Transformer的多模态模型依赖注意力机制来整合异构模态间的信息。尽管取得了成功,现有的多模态注意力公式通过成对点积交互的集合或将所有模态拼接成键来计算分数,即使多个模态应该被联合参与。因此,当前方法要么在模态数量上产生二次复杂度,要么无法显式建模依赖于多个表示联合配置的交互。在这项工作中,我们引入了体积多模态交叉注意力(VMA),一种新颖的交叉注意力机制,其中注意力分数被定义为查询和多个模态特定键的联合几何的函数。VMA计算跨多个模态的查询和键向量所张成的体积,捕获超越成对相似性的联合多模态依赖,实现任意顺序模态交互的原生建模。我们将VMA集成到我们新颖的多模态Transformer架构中,命名为GRAMformer,该架构专门设计用于整合任意数量的模态。我们在多模态学习任务上评估了所提出的模型,展示了改进的有效性和效率。

英文摘要

Transformer-based multimodal models rely on attention mechanisms to integrate information across heterogeneous modalities. Despite their success, existing multimodal attention formulations compute their scores through collections of pairwise dot-product interactions or by concatenating all the modalities into the keys, even when multiple modalities should be jointly involved. As a consequence, current approaches either incur quadratic complexity in the number of modalities or fail to explicitly model interactions that depend on the joint configuration of multiple representations. In this work, we introduce the Volumetric Multimodal cross-Attention (VMA), a novel cross-attention mechanism in which attention scores are defined as a function of the joint geometry of a query and multiple modality-specific keys. VMA computes the volume spanned by query and key vectors across multiple modalities, capturing joint multimodal dependencies beyond pairwise similarity, enabling native modeling of any-order modality interactions. We integrate VMA into our novel multimodal transformer architecture, named GRAMformer, explicitly designed to integrate any number of modalities. We evaluate the proposed model on multimodal learning tasks, demonstrating improved effectiveness and efficiency.

2606.06242 2026-06-05 cs.CL cs.AI cs.CV cs.IR 版本更新

Benchmarking Open-Source Layout Detection Models for Data Snapshot Extraction from Institutional Documents

面向机构文档数据快照提取的开源布局检测模型基准测试

AJ Carl P. Dy, Aivin V. Solatorio

发表机构 * Development Data Group Office of the World Bank Group Chief Statistician(世界银行发展数据分析组办公室世界银行统计主任) The World Bank(世界银行)

AI总结 针对机构文档中图表数据快照提取任务,构建基准数据集并评估多个开源布局检测模型,发现现有模型在操作型文档上泛化能力不足,存在内容混淆、碎片化及上下文缺失等问题。

Comments 23 pages, 8 figures

详情
AI中文摘要

机构文档中的图表包含大量操作和分析信息。当前从文档中提取视觉内容的方法主要围绕通用文档布局分析,将图表视为统一相关的文档对象,而非具有语义意义的分析产物。在这项工作中,我们引入了一个基准数据集和评估框架,用于 extit{数据快照提取},即识别和定位机构文档中具有语义意义的视觉产物的任务。该基准涵盖人道主义报告、世界银行政策研究工作论文和项目评估文件,并包含包含可重用分析信息的图表注释。利用该数据集,我们对多个开源布局检测模型进行了基准测试,并评估了检测性能和空间提取质量。结果表明,尽管当前模型在传统学术基准上表现强劲,但在操作型机构文档上难以泛化。常见的失败模式包括分析内容与非分析内容混淆、复合分析产物碎片化,以及解释所需的上下文信息提取不完整。这些发现凸显了通用文档布局分析与操作上有用的数据快照提取之间持续存在的差距。我们发布了源PDF、注释数据集、元数据和源代码,以支持操作型文档智能的未来研究。数据集可在https://huggingface.co/datasets/ai4data/data-snapshot获取,源代码可在https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot获取。

英文摘要

Institutional documents contain substantial amounts of operational and analytical information embedded within figures and tables. Current approaches for extracting visual content from documents are largely built around generic document layout analysis, where figures and tables are treated as uniformly relevant document objects rather than semantically meaningful analytical artifacts. In this work, we introduce a benchmark dataset and evaluation framework for \textit{data snapshot extraction}, the task of identifying and localizing semantically meaningful visual artifacts within institutional documents. The benchmark spans humanitarian reports, World Bank policy research working papers, and project appraisal documents, and includes annotations for figures and tables that contain reusable analytical information. Using this dataset, we benchmarked multiple open-source layout detection models and evaluated both detection performance and spatial extraction quality. Our results show that current models struggle to generalize to operational institutional documents despite strong performance on conventional academic benchmarks. Common failure modes include confusion between analytical and non-analytical content, fragmentation of composite analytical artifacts, and incomplete extraction of contextual information required for interpretation. These findings highlight a persistent gap between generic document layout analysis and operationally useful data snapshot extraction. We release the source PDFs, annotation dataset, metadata, and source code to support future research in operational document intelligence. The dataset is available at https://huggingface.co/datasets/ai4data/data-snapshot and the source code is available at https://github.com/worldbank/ai4data/tree/main/experimental/data-snapshot.

2606.06228 2026-06-05 cs.CV 版本更新

SAM-Flow: Source-Anchored Masked Flow for Training-Free Image Editing

SAM-Flow:源锚定掩码流用于免训练图像编辑

Haowang Cui, Rui Chen, Tao Luo, Tao Guo, Zheng Qin, Jiaze Wang

发表机构 * Tianjin Key Laboratory of Imaging and Sensing Microelectronic Technology, School of Microelectronics, Tianjin University(天津影像与传感微电子技术重点实验室,微电子学院,天津大学) School of Cyber Security, Tianjin University(网络安全学院,天津大学)

AI总结 提出SAM-Flow框架,通过源锚定掩码流和动态软掩码机制实现局部免训练图像编辑,有效防止背景泄漏。

Comments Code is available at: https://github.com/chwbob/Sam-Flow

详情
AI中文摘要

免训练图像编辑最近因能够利用强大的预训练扩散和流匹配模型修改真实图像而无需额外训练,引起了越来越多的关注。然而,现有的基于反演和基于差分流的方法通常执行全局潜在传输,这不可避免地会将编辑效果传播到非目标区域并导致背景泄漏。为了解决这个问题,我们提出了SAM-Flow,一种源锚定掩码流框架,用于局部免训练图像编辑。SAM-Flow不是更新整个潜在表示,而是首先使用侦察图像和令牌接地注意力图来定位可编辑的语义区域。然后,它仅在这些区域内应用差分速度更新,同时将剩余区域锚定到源图像潜在轨迹。为了进一步提高空间稳定性和边界自然性,我们引入了一种时变源锚定投影机制,具有动态软掩码、过渡区域和时间掩码累积。所提出的方法是即插即用的,可以集成到主流流匹配骨干网络(如Stable Diffusion 3和FLUX)中,无需任何微调。大量的定性和定量实验表明,SAM-Flow实现了准确的语义编辑,同时显著改善了背景保持,为免训练图像编辑提供了一种简单且通用的局部编辑范式。代码可在 https://github.com/chwbob/Sam-Flow 获取。

英文摘要

Training-free image editing has recently attracted increasing attention due to its ability to modify real images using powerful pre-trained diffusion and flow-matching models without additional training. However, existing inversion-based and differential-flow-based methods usually perform global latent transport, which inevitably propagates editing effects to non-target regions and leads to background leakage. To address this problem, we propose SAM-Flow, a source-anchored masked flow framework for localized training-free image editing. Instead of updating the whole latent representation, SAM-Flow first uses a scout image and token-grounded attention maps to localize the editable semantic regions. It then applies differential velocity updates only within these regions, while anchoring the remaining areas to the source-image latent trajectory. To further improve spatial stability and boundary naturalness, we introduce a time-varying source-anchored projection mechanism with dynamic soft masks, transition regions, and temporal mask accumulation. The proposed method is plug-and-play and can be integrated with mainstream flow-matching backbones such as Stable Diffusion 3 and FLUX without any fine-tuning. Extensive qualitative and quantitative experiments demonstrate that SAM-Flow achieves accurate semantic editing while significantly improving background preservation, providing a simple and general localized editing paradigm for training-free image editing. Code is available at: https://github.com/chwbob/Sam-Flow.

2606.06217 2026-06-05 cs.CV cs.AI 版本更新

DisasterBench: A Multimodal Benchmark for UAV-Based Disaster Response in Complex Environments

DisasterBench: 复杂环境中基于无人机灾害响应的多模态基准

Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出DisasterBench多模态基准,涵盖14种灾害场景和9个响应任务,并设计轻量级模型DisasterVL通过三阶段优化在边缘设备上实现高效推理。

详情
AI中文摘要

当灾难发生时,响应者不仅需要回答正在发生什么,还需要回答为什么发生、接下来会发生什么以及现在该做什么,而这些通常来自嘈杂的低空无人机视角,并在现场计算资源紧张的情况下进行。然而,现有的大多数多模态基准侧重于感知(例如识别/描述),覆盖的灾害类型有限,并且对实际应急响应所需的多阶段推理支持不足。我们引入了DisasterBench,一个用于复杂环境中基于无人机灾害响应的多阶段多模态推理基准。DisasterBench涵盖14种灾害相关场景类型和9个响应关键任务,覆盖灾前、灾中和灾后阶段,具有细粒度的灾害-任务映射,明确测试因果归因、传播预测、损害分析和决策导向推理。为了在边缘设备上实现推理,我们进一步提出了DisasterVL,一个轻量级多模态模型,通过三阶段流水线进行优化,结合领域指令微调、思维链引导的多模态对齐以及基于强化学习的策略优化。在21个流行的MLLM上的实验表明,我们的2B参数DisasterVL优于所有评估的开源模型,并显著缩小了与最先进闭源模型的差距,实现了与GPT-4o相当的推理准确性和更高的效率。项目页面:https://github.com/TanmouTT/DisasterBench。

英文摘要

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.

2606.06199 2026-06-05 cs.CV cs.GR 版本更新

SC-MFJ: A Simple Haptic Quality Metric for Medical Image Segmentation

SC-MFJ: 一种用于医学图像分割的简单触觉质量度量

Souraj Adhikary, Negar Chabi, Andre Mastmeyer

发表机构 * Jade University of Applied Sciences(亚德应用科学大学)

AI总结 针对手术模拟中触觉渲染对分割表面质量的需求,提出SC-MFJ度量,通过虚拟触笔行走测量接触力抖动,揭示了几何度量无法发现的触觉质量差异。

Comments 11 pages, 5 figures, 5 tables, http://www.wscg.eu/

详情
AI中文摘要

标准分割度量如Dice和Hausdorff距离测量几何重叠,但无法判断分割表面是否适合手术模拟中的触觉渲染。我们提出SC-MFJ(表面约束平均力抖动),一种简单、廉价的度量,通过多次短虚拟触笔行走采样分割器官表面,并测量由此产生的接触力抖动程度。该度量从现有分割输出计算,每个病例约需一分钟CPU时间。我们在五折交叉验证中对80个病例评估了三种胰腺CT分割方法——原始二值nnU-Net输出、高斯平滑输出和学习的符号距离函数(SDF)回归。SC-MFJ显示,原始二值基线与简单高斯后处理之间的触觉质量差距达147倍,而Dice和HD95完全无法察觉这一差异。它还表明,尽管需要完整的模型重新训练,学习的SDF回归产生的触觉质量比高斯平滑更不稳定,病例级标准差为168 N/s²,而高斯平滑为22 N/s²。在LiTS肝脏数据集(131个病例)上的第二次评估证实了这些发现的普遍性:二值到高斯的差距扩大到189倍,且高斯平滑在所有折中始终产生一致的低力抖动。我们的结果表明,对于触觉模拟应用,一行后处理步骤可能就足够了,而像SC-MFJ这样廉价的度量可以标记出几何度量遗漏的问题。

英文摘要

Standard segmentation metrics such as Dice and Hausdorff distance measure geometric overlap but say nothing about whether a segmented surface is suitable for haptic rendering in surgical simulation. We propose SC-MFJ (Surface-Constrained Mean Force Jerk), a simple, inexpensive metric that samples a segmented organ surface with many short virtual stylus walks and measures how jerky the resulting contact forces are. The metric is computed from existing segmentation outputs and uses roughly one minute of CPU time per case. We evaluate three pancreas CT segmentation approaches-binary nnU-Net output, Gaussian-smoothed output, and learned signed distance function (SDF) regression-across 80 cases in five-fold cross-validation. SC-MFJ reveals a 147x gap in haptic quality between the raw binary baseline and simple Gaussian post-processing, a difference entirely invisible to Dice and HD95. It also shows that learned SDF regression, despite requiring full model retraining, produces more variable haptic quality than Gaussian smoothing, with a case-level standard deviation of 168 N/s2 compared with 22 N/s2 for Gaussian. A second evaluation on the LiTS liver dataset (131 cases) confirms the generality of these findings: the binary-to-Gaussian gap widens to 189x, and Gaussian smoothing again produces consistently low force jerk across all folds. Our results suggest that for haptic simulation applications, a one-line post-processing step may be sufficient, and that a cheap metric like SC-MFJ can flag problems that geometric metrics miss.

2606.06194 2026-06-05 cs.RO cs.CV 版本更新

ActiveMimic: Egocentric Video Pretraining with Active Perception

ActiveMimic: 基于主动感知的自我中心视频预训练

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Current Robotics NeoteAI

AI总结 提出ActiveMimic框架,从自我中心人类视频中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,联合学习主动感知和操作技能,使预训练模型在机器人任务上达到与机器人数据预训练相当的性能。

Comments Project Page: https://activemimic.github.io/

详情
AI中文摘要

自我中心人类视频为机器人数据预训练提供了一种可扩展的替代方案,但在此类视频上预训练的模型始终不如在机器人数据上预训练的模型。我们将这一差距归因于缺失的信号,即自我中心视频中的主动感知行为,其中人类在操作过程中不断重新定位视角,导致标准流程视为噪声的相机运动。为解决这一问题,我们提出了ActiveMimic,一个预训练框架,从单个身体佩戴的RGB相机中恢复同步的相机和手腕轨迹,将相机运动建模为视角动作,并在适应目标机器人之前,从野外自我中心人类视频中联合学习主动感知和操作。实验表明,在具有不同主动感知需求的任务中,ActiveMimic始终优于在人类视频上预训练的基线,并与在机器人数据上预训练的最先进模型相匹配。进一步分析提供了证据,表明主动感知能力源自自我中心人类视频预训练而非机器人特定微调,确认了主动感知是解锁自我中心人类视频用于机器人预训练的关键。

英文摘要

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

2606.06186 2026-06-05 cs.CV 版本更新

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

对抗攻击已揭示答案:面向视觉语言模型的定向偏差引导测试时防御

Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng, Xiaotian Yin, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory(国家深空探测重点实验室,深空探测实验室) The Chinese University of Hong Kong(香港中文大学) Zhejiang University(浙江大学) Ant Group(蚂蚁集团)

AI总结 提出定向偏差引导防御(DBD),利用对抗样本在CLIP特征空间中沿主导方向偏移的现象,通过估计防御方向并采用DB分数双流重建策略恢复鲁棒表示,在15个数据集上实现最先进对抗鲁棒性且保持干净准确率。

Comments Accepted by ICLR2026

详情
AI中文摘要

视觉语言模型(VLM),如CLIP,展现出强大的零样本泛化能力,但仍高度易受对抗扰动影响,在现实应用中构成严重风险。针对VLM的测试时防御最近成为一种有前景且高效的方法,无需昂贵的大规模重训练即可防御对抗攻击。在这项工作中,我们发现了一个令人惊讶的现象:在多种输入变换下,CLIP特征空间中的对抗图像始终沿主导方向偏移,而干净图像则呈现分散模式。我们假设这种主导偏移(称为防御方向)与对抗偏移相反,将特征指向正确的类别中心。基于这一见解,我们提出了定向偏差引导防御(DBD),一种测试时框架,用于估计防御方向,并采用基于DB分数的双流重建策略恢复鲁棒表示。在15个数据集上的实验表明,DBD不仅实现了最先进的对抗鲁棒性,同时保持了干净准确率,还揭示了对抗准确率甚至可能超过干净准确率的反直觉结果。这表明对抗扰动内在地编码了关于真实决策边界的定向先验信息。

英文摘要

Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

2606.06158 2026-06-05 cs.CV 版本更新

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

通过时间冗余掩码和潜在修复的自适应分词

Kevin Dave, Sai Aditya Patkuri, Chhaya Kumar Das, Gouranga Bala, R. Venkatesh Babu, Rajeshkumar SA

发表机构 * Phronetic AI IISc Bangalore IIT Bombay

AI总结 提出一种无参数的自适应视频分词机制,利用冻结连续分词器的潜在空间中的时间冗余,通过阈值丢弃冗余位置,并使用轻量级潜在修复变压器重建,实现内容驱动的令牌分配和高效推理。

详情
AI中文摘要

自适应视频分词旨在根据序列的底层视觉复杂度动态分配令牌预算。当前的连续方法通过迭代二值化搜索或训练神经回归器实现,而离散方法通常需要全速率解码器来估计信息内容。我们证明这些计算开销并非必要。我们表明,冻结的连续视频分词器的潜在空间固有地编码了可直接利用的时间冗余:潜在表示在连续帧之间变化最小的空间位置携带接近零的额外信息。我们引入了一种无参数的自适应令牌分配机制,该机制对每个位置的时间L1差异应用固定阈值,识别并丢弃冗余的潜在位置。因此,压缩率自然地从输入内容中产生,而不是自上而下地强制执行:静态场景被积极压缩,而高度动态的序列保留更多令牌。为了重建丢弃的位置,我们提出了潜在修复变压器(LIT),一种轻量级的分解时空注意力架构。得到的推理流水线非常高效,仅需一次编码器前向传播和一次LIT前向传播,消除了辅助路由网络的需求。在TokenBench和DAVIS(近期分词器使用的标准基准)上的评估表明,我们的框架产生了有意义的、内容驱动的令牌分配,同时保持了有竞争力的重建保真度,并且相比连续自适应基线(ElasticTok-CV)实现了31倍的推理加速,相比离散信息论基线(InfoTok)实现了约2倍的加速。

英文摘要

Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)

2606.06155 2026-06-05 cs.RO cs.CV cs.MM 版本更新

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AffordanceVLA:一种通过可供性感知理解赋能动作生成的视觉-语言-动作模型

Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen

发表机构 * Peking University(北京大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Chinese University of Hong Kong(香港中文大学) Knowin AI

AI总结 提出AffordanceVLA框架,通过引入结构化可供性预测作为任务导向的中间表示,解决VLA模型中语义空间与具身控制策略的结构不匹配问题,实现精确的感知-动作映射。

Comments Preprint. Code and project page are available. Code: https://github.com/Skywalker-yqz/AffordanceVLA Project page: https://skywalker-yqz.github.io/AffordanceVLA/

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用预训练视觉-语言模型(VLM)的丰富世界知识来实现指令跟随的机器人操作。然而,VLM语义空间与具身控制策略之间的结构不匹配常常阻碍精确感知-动作映射的学习。为解决这一挑战,我们提出 extbf{AffordanceVLA},一个统一框架,引入结构化可供性预测作为任务导向的中间表示,以建立更精确和鲁棒的感知-动作映射。具体而言,我们通过三个互补组件逐步建模操作先验:1) extbf{Which2Act},通过视觉潜在预测进行以物体为中心的定位以抑制干扰;2) extbf{Where2Act},通过可供性图估计进行2D交互定位;3) extbf{How2Act},用于引导操作策略的3D几何推理。这些可供性线索提供了空间定位、语义条件化和动作耦合的中间表示,从而自然地桥接视觉、语言和动作。我们将这些模块集成到具有专门专家的混合Transformer(MoT)架构中,并使用三阶段训练策略和渐进式数据课程训练模型。为克服机器人数据集中密集可供性标签的稀缺性,我们还开发了一个鲁棒的自动化数据增强流水线。在仿真和真实世界中的大量实验表明,AffordanceVLA在多种操作场景中实现了强大的性能。

英文摘要

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

2606.06142 2026-06-05 cs.CV 版本更新

Computation-Aware Event-to-Frame Reconstruction via Selective Attention

计算感知的基于选择性注意力的事件到帧重建

Jingqian Wu, Yunbo Jia, Edmund Y. Lam

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种高效的事件到帧重建框架,通过循环编码器-解码器、选择性上下文融合和轻量级混合注意力机制,在保持重建质量的同时降低计算复杂度。

详情
AI中文摘要

事件到帧(E2F)重建将异步事件流与基于帧的视觉流水线连接起来,但现有方法通常在重建质量和计算效率之间面临权衡。在这项工作中,我们提出了一种高效的E2F框架,强调因果时间建模和计算感知设计。该架构采用循环编码器-解码器,以紧凑的隐藏状态逐步聚合事件信息。为了提高在快速运动和光照变化下的鲁棒性,引入了一种选择性上下文融合策略,将事件驱动的特征与先验强度线索相结合。在此融合过程中,一种轻量级混合注意力机制增强了特征选择性,而无需依赖繁重的注意力操作。在标准基准上的实验结果表明,所提出的方法在保持重建性能竞争力的同时,在准确性和模型复杂度之间取得了良好的平衡。

英文摘要

Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.

2606.06120 2026-06-05 cs.CV 版本更新

Diff-CA: Separating Common and Salient Factors with Diffusion Models

Diff-CA: 使用扩散模型分离共同因素和显著因素

Michaël Soumm, Alexandre Fournier Montgieux, Yunlong He, Pietro Gori, Alasdair Newson

发表机构 * INRIA at Univ. Grenoble Alpes(法国格勒诺布尔大学INRIA实验室) CEA List, Palaiseau(法国CEA列表,帕莱索) Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院)

AI总结 提出一种基于扩散模型的条件框架,通过弱监督学习将图像条件分解为共同因素和显著因素,实现对比分析中的因素分离,并保持高保真图像生成质量。

详情
AI中文摘要

对比分析旨在将两个数据分布之间的共同因素与仅对其中一个分布显著的因素分离开来。现有的对比方法基于生成模型(如VAE或GAN),这些模型通常受到重建和图像质量有限的困扰,这阻碍了有效的潜在因素分离,并限制了它们在高保真图像生成和编辑中的应用。我们提出了一种新颖的扩散模型条件框架,能够在不牺牲生成质量的情况下实现对比分解。我们首先训练一个无需提示、以图像为条件的扩散模型,然后学习使用弱监督将条件分解为共同因素和显著因素。我们证明了先前工作中通常假设的加性对比分解在温和条件下是可识别的。这种分解通过仅交换或插值显著因素来实现有针对性的操作。

英文摘要

Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.

2606.06103 2026-06-05 cs.CV 版本更新

MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models

MS-DKC:用于设计和适配医学图像分割模型的数据集知识卡片框架

Tariq M. Khan, Syed Saud Naqvi, Thantrira Porntaveetus, Hamid Alinejad-Rokny, Shahzaib Iqbal, Imran Razzak, Mohammad AU Khan

发表机构 * Center of Excellence in Precision Medicine and Digital Health, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand(精准医学与数字健康中心,朱拉隆功大学牙科学院,泰国曼谷) Department of Computer Engineering, COMSATS University Islamabad, Islamabad, Pakistan(计算机工程系,COMSATS伊斯兰堡大学,巴基斯坦伊斯兰堡) School of Biomedical Engineering, UNSW, Sydney, NSW, Australia(生物医学工程学院,新南威尔士大学,澳大利亚悉尼,新南威尔士) Visiting Scholar (Collaborative Projects), Center of Excellence in Precision Medicine and Digital Health, Chulalongkorn University, Bangkok, Thailand(访问学者(合作项目),精准医学与数字健康中心,朱拉隆功大学,泰国曼谷) Department of Computing, Abasyn University Islamabad Campus (AUIC), Islamabad, Pakistan(计算系,阿巴斯扬大学伊斯兰堡校区(AUIC),巴基斯坦伊斯兰堡) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates(Mohamed bin Zayed人工智能大学,阿布扎比,阿拉伯联合酋长国) College of Computer and Information Sciences, prince Sultan University, Riyadh, SAudi Arabia(计算机与信息科学学院,苏丹王子大学,沙特阿拉伯利雅得)

AI总结 提出MS-DKC框架,通过显式记录数据集特征(如前景占有率、形态、边界模糊性等)并映射到失败模式、设计先验和风险对齐标准,指导医学图像分割模型的设计与适配,在DRIVE、ISIC2018和ACDC数据集上验证了数据集条件化设计的有效性。

详情
AI中文摘要

医学图像分割通常被定义为寻找更强架构的问题,但这可能掩盖一个更基本的问题:数据集对模型有什么要求?在医学影像中,这种要求由前景占有率、形态、边界模糊性、拓扑敏感性、标注质量、采集变异和操作点决定。本文介绍了医学分割数据集知识卡片(MS-DKC),一个使这些因素显式化的框架。MS-DKC通过图像/采集、形态、监督、上下文依赖和部署风险描述符记录数据集证据。这些描述符被映射到失败模式、设计先验和风险对齐标准,使分割设计比架构优先比较更具可追溯性。我们在DRIVE、ISIC2018和ACDC上评估了MS-DKC,它们代表了不同的场景。DRIVE包含稀疏、细小的分支血管,有利于细节保持模型、敏感性感知优化、阈值分析和拓扑感知指标。DKC-TNet-v2以35103个参数达到了Dice 0.8044和IoU 0.6730,而SA-UNetv2-DKC-AmbRef达到了Dice 0.8141、IoU 0.6865、敏感性0.8265、特异性0.9804和AUC 0.9853。ISIC2018涉及紧凑但外观可变的病变;在Att-Next-Topo/ATTNext上基于验证约束的评分函数选择产生了MS-DKC-AttNextTopo-VCSF-NoAug,Dice 0.8872、IoU 0.8214、精确率0.9173、边界F1 0.4878和ASSD 4.13,而合理的添加未能改善风险对齐的轮廓。ACDC提供了一个多类心脏案例,其中MS-DKC推荐四类softmax分割、类别平衡的Dice/CE监督和类别级表面评估。总体而言,结果支持数据集条件化设计:不同的数据集需要不同的先验、操作点和证据,然后才能判断模型是否合适。

英文摘要

Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate.

2606.06100 2026-06-05 cs.CV 版本更新

HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

HyperVis:洛伦兹双曲面上的连续潜在视觉关系图用于组合推理

Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman

发表机构 * Data Science and AI, University of Doha for Science and Technology, Qatar(数据科学与人工智能,多哈科学技术大学,卡塔尔) Pluralis Research, Australia(Pluralis研究,澳大利亚) Department of Electrical and Computer Engineering, North South University, Bangladesh(电气与计算机工程系,北南大学,孟加拉国)

AI总结 针对视觉语言模型在组合推理中理解物体间关系的困难,提出HyperVis方法,通过计算密集视觉关系张量并投影到洛伦兹双曲面,利用空间物理(IoA驱动的蕴含锥和外部角排斥)增强层次结构,在训练时作为正则化器提升生成式VQA性能,在推理时作为关系编码器提升判别式组合评分。

详情
AI中文摘要

视觉语言模型(VLM)在需要理解物体间关系的组合推理中表现不佳。一个自然的补救措施是从现成的场景图生成器(SGG)注入显式场景图三元组$\langle s, p, o \rangle$,但我们发现这会产生反效果:离散文本标签与连续视觉模态冲突,导致GQA准确率从60.38%降至58.86%。我们提出 extbf{HyperVis},完全绕过了SGG的语义瓶颈。从$N$个类别无关的区域提议出发,通过空间偏置交叉注意力计算密集的$O(N^2)$视觉关系张量,将其投影到洛伦兹双曲面上,并通过空间物理(即IoA驱动的蕴含锥和外部角排斥)强制执行层次结构。我们发现HyperVis以两种互补的方式发挥作用:(1)作为 extit{训练时正则化器},双曲关系损失塑造了LoRA表示,提高了生成式VQA性能(GQA 61.03%对比无关系损失的LoRA微调57.21%,恢复并超越基线);(2)作为 extit{推理时关系编码器},双曲前缀令牌提升了判别式组合评分(SugarCrepe 79.94%,比基线高6.25个百分点)。学习到的曲率稳定在$\kappa=4.0$,比先前的双曲VLM高一个数量级(先前$\kappa$通常趋近于零),表明连续视觉特征确实需要强曲率空间的指数体积。受控的欧几里得消融实验证实了这种分解:关系流水线在平坦空间中对LoRA的正则化效果相当(GQA 60.81%),但组合增益是双曲空间特有的(SugarCrepe比欧几里得高4.58个百分点),且欧几里得训练中的蕴含损失高出约6倍。代码将在后续公布。

英文摘要

Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.

2606.06078 2026-06-05 cs.CV 版本更新

Knowledge Distillation for Visual Autoregressive Models

视觉自回归模型的知识蒸馏

Elia Peruzzo, Aritra Bhowmik, Guillaume Sautiere, Yuki M Asano, Amirhossein Habibian

发表机构 * Qualcomm AI Research(高通人工智能研究) University of Technology Nuremberg(纽伦堡技术大学)

AI总结 针对视觉自回归模型计算开销大的问题,提出VarKD蒸馏框架,通过选择性教师监督和减少令牌级歧义,在ImageNet上多个AR骨干网络中优于现有蒸馏方法。

详情
AI中文摘要

自回归图像生成模型具有高表达能力但计算密集,因此需要有效的模型压缩。知识蒸馏是模型压缩的自然方法,已在语言建模中得到广泛研究,但其在视觉自回归生成中的行为尚未充分探索。在这项工作中,我们首次系统研究了AR图像模型的蒸馏策略。我们的分析表明,虽然标准蒸馏可以带来有意义的收益,但最近为语言开发的方法不能直接迁移到图像:长解码视野和视觉令牌歧义使得教师监督不可靠,尤其是在学生条件下的上下文中。为了解决这个问题,我们提出了VarKD,一个针对视觉自回归模型的蒸馏框架,它在学生样本上进行蒸馏,同时选择性应用教师监督并减少令牌级歧义。在ImageNet上多个AR骨干网络上的实验表明,VarKD始终优于先前的蒸馏基线,缩小了与大规模模型的差距。

英文摘要

Autoregressive (AR) image generation models are highly expressive but computationally intensive, motivating effective model compression. Knowledge distillation (KD) is a natural approach for model compression and has been widely studied in language modeling, yet its behavior in visual AR generation remains underexplored. In this work, we present the first systematic study of distillation strategies for AR image models. Our analysis shows that while standard distillation can yield meaningful gains, recent methods developed for language do not directly transfer to images: long decoding horizons and visual token ambiguity make teacher supervision unreliable especially under student-conditioned contexts. To address this, we propose VarKD, a distillation framework for visual autoregressive models that distills on student samples while selectively applying teacher supervision and reducing token-level ambiguity. Experiments on ImageNet across multiple AR backbones show that VarKD consistently outperforms prior distillation baselines, narrowing the gap to large-scale models.

2606.06074 2026-06-05 cs.CV 版本更新

VZCrash: A Large-Scale IMU Dataset of Ego-Vehicle Crashes

VZCrash:大规模自车碰撞IMU数据集

Tommaso Bianconcini, Henrique Piñeiro Monteagudo, Aurel Pjetri, Tomaso Trinci, Leonardo Taccari

发表机构 * Verizon Connect

AI总结 提出VZCrash,目前最大的真实车辆碰撞IMU数据集,包含超过31,000个验证碰撞和158,000个负样本,并基于该数据集对多种碰撞检测方法进行了基准测试和规模效应分析。

Comments Accepted at the 2026 IEEE International Conference on Intelligent Transportation Systems (ITSC 2026). VZCrash is publicly available at this URL: https://huggingface.co/datasets/vzc-research-chapter/VZCrash

详情
AI中文摘要

我们介绍了VZCrash,这是目前最大的公开真实车辆碰撞数据集,包含惯性测量单元(IMU)遥测数据。该数据集包含超过31,000个经过验证的碰撞事件和158,000个负样本,包括困难案例和干扰项。每个样本包含100 Hz的加速度和角速度,以及1 Hz的GPS速度。VZCrash中的事件由安装在美国各地行驶的73,010辆不同尺寸商用车辆上的设备捕获,时间跨度数年。我们还利用该数据集的规模进行了广泛的实验研究。首先,我们对从简单的基于阈值的启发式方法到最先进的深度学习模型等多种方法进行了基准测试。然后,我们进行了一项实验,证明了数据规模对于训练高质量碰撞检测模型的重要性,并表明当这些模型需要部署到真实环境中时,规模尤其重要。

英文摘要

We introduce VZCrash, the largest publicly available dataset of real-world vehicle collision data featuring Inertial Measurement Unit (IMU) telemetry. The dataset contains more than 31,000 validated crashes and 158,000 negative samples, including hard cases and distractors. Each sample includes acceleration and angular velocity at 100 Hz, and GPS speed at 1 Hz. Events in VZCrash were captured by devices installed on a fleet of 73,010 commercial vehicles of different sizes driving in the United States over the span of several years. We also present an extensive experimental study enabled by the volume of the dataset. We first benchmark several different approaches, from a simple threshold-based heuristic to state-of-the-art deep learning models. Then, we present an experiment demonstrating the importance of scaling data to train high-quality crash detection models, and we show that scale is especially important when these models need to be deployed into a real-world environment.

2606.06066 2026-06-05 cs.CV cs.GR 版本更新

FontFusion: Enhancing Generative Text in Diffusion Models with Typographic Conditioning

FontFusion: 通过排版条件增强扩散模型中的生成文本

Marian Lupascu, Nipun Jindal, Ionut Mironica, Zhaowen Wang

发表机构 * Adobe Research(Adobe研究院) Department of Computer Science, University of Bucharest(布加勒斯特大学计算机科学系)

AI总结 提出FontFusion框架,通过层次化token表示、位置感知嵌入和多级token丢弃策略,在扩散Transformer中实现精确字体控制与文本可读性的平衡,显著提升排版保真度。

Comments 12 pages, 8 figures, accepted at ICANN 2026

详情
AI中文摘要

扩散模型中的排版生成面临持续的权衡:精确的字体控制通常会降低文本可读性,而保持可读性往往牺牲排版保真度。我们提出FontFusion,一种用于扩散Transformer(DiT)架构的即插即用条件框架,通过三个核心创新解决了这一困境:(1)层次化token表示,在多个粒度上建立明确的文本-字体关系;(2)位置感知嵌入,在排版和图像内容之间创建空间绑定;(3)多级token丢弃策略,提高计算效率和对未见字体的泛化能力。我们对字体嵌入空间的系统评估表明,结合DeepFont和DINOv2的双编码器在排版任务上优于任何单一编码器。FontFusion在挑战性装饰字体上相比单编码器基线实现了76%的相对改进,相比无条件模型字体一致性增益超过约68-76%,同时无需重新训练即可集成到现有DiT架构中。

英文摘要

Typography generation in diffusion models faces a persistent trade-off: enabling precise font control typically degrades text legibility, while maintaining readability often sacrifices typographic fidelity. We present FontFusion, a plug-and-play conditioning framework for Diffusion Transformer (DiT) architectures that resolves this dilemma through three core innovations: (1) a hierarchical token representation establishing explicit text-font relationships at multiple granularities, (2) position-aware embeddings creating spatial bindings between typography and image content, and (3) a multi-level token dropping strategy improving both computational efficiency and generalization to unseen fonts. Our systematic evaluation of font embedding spaces reveals that a dual encoder combining DeepFont and DINOv2 outperforms any single encoder for typography tasks. FontFusion demonstrates 76% relative improvement on challenging decorative fonts over single-encoder baselines and font consistency gains exceeding approximately 68-76% over unconditioned models, while integrating into existing DiT architectures without retraining.

2606.06060 2026-06-05 cs.CV 版本更新

ReCache: Learning Budget-Aware Caching Schedules for Diffusion Models via REINFORCE

ReCache: 通过REINFORCE学习扩散模型的预算感知缓存调度

Mishan Aliev, Eva Neudachina, Ilya Bykov, Aleksandr Oganov, Kirill Struminsky, Aibek Alanov, Denis Rakitin

发表机构 * HSE University(俄罗斯高等经济学院) Yandex Research(Yandex研究院)

AI总结 提出ReCache,利用策略梯度学习在给定计算预算下最大化生成质量的去噪步骤重计算调度,无需标注数据且兼容多种缓存机制。

详情
AI中文摘要

现代扩散模型生成高质量图像和视频,但其迭代去噪过程导致推理成本高昂。特征缓存通过重用或预测相邻去噪步骤的中间激活来加速采样,利用沿反向轨迹的计算冗余。本文关注缓存调度:选择哪些去噪步骤应完全重计算。现有调度要么是固定的(如均匀),要么根据每步误差启发式自适应选择;这两种情况下,实际计算成本是手动调整阈值的副作用,而非用户可指定的量。我们提出ReCache,它反转了这一过程:给定目标预算k,学习最大化生成质量的重计算调度,将计算变为可直接控制的输入。ReCache通过策略梯度训练,避开了通过完整扩散推理的反向传播,且不使用任何标注数据。来自无缓存推理的生成作为匹配目标,并配以生成质量的奖励。ReCache兼容任何缓存机制,包括特征重用和特征预测;对于每种机制,单个训练好的策略在推理时适应不同计算预算。ReCache持续优于调度基线:在FLUX上减少$ imes5.04$ FLOPs时,与DiCache相比,LPIPS降低31%(从0.456降至0.316);在Wan 2.1上实现$\sim imes2.6$加速时,与均匀HiCache相比,LPIPS降低65%(从0.480降至0.169),VBench分数提升7%(5.6分,从70.4升至76.0)。代码见https://github.com/thecrazymage/ReCache。

英文摘要

Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a $\times5.04$ FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a $\sim \times2.6$ speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at https://github.com/thecrazymage/ReCache.

2606.06039 2026-06-05 cs.CV 版本更新

Texture-preserving implicit neural representation for Cone beam CT truncated reconstruction

保留纹理的隐式神经表示用于锥束CT截断重建

Genyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan, Songtao Zhu, Fenglin Liu

发表机构 * National Key Research and Development Program of China(中华人民共和国国家重点研发计划) National Natural Science Foundation of China(中华人民共和国国家自然科学基金) Fundamental Research Funds for the Central Universities(中央高校基本科研业务费)

AI总结 提出一种自监督的3D重建框架,基于神经场景表示,结合物理迭代细化模块,解决锥束CT截断重建中的伪影和纹理丢失问题。

详情
AI中文摘要

锥束计算机断层扫描(CBCT)经常受到数据截断的影响,这引入了严重的伪影并限制了有效视场(FOV)。现有的用于截断锥束CT重建的深度学习方法存在严重局限性,包括严格依赖有监督的真实数据和未能考虑连续3D空间截断变化。为了解决这些挑战,我们引入了一个基于神经场景表示的自监督3D重建框架。通过在投影监督下将空间坐标直接映射到辐射密度,我们的方法固有地绕过了传统的滤波和反投影操作,从而从根本上消除了截断引起的环状伪影,同时实现了鲁棒的连续3D数据外推。然而,坐标网络容易受到固有的频谱偏差影响,这导致临床关键的高频纹理严重丢失。为了解决这一瓶颈,我们进一步将基于物理的迭代细化模块集成到神经场景表示架构中。利用来自坐标网络的无伪影外推体积作为最优初始化,该模块逐步从原始投影中重新提取高频结构信息并将其注入体积中。在模拟和真实数据集上的大量实验表明,我们的方法成功地将神经网络的优异伪影抑制和外推能力与迭代算法的高保真细节保留统一起来。

英文摘要

Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.

2606.06020 2026-06-05 cs.CV 版本更新

ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition

ReSAGE-PAR:行人属性识别中生成式扩展的表征相似性评估

Pablo Ayuso-Albizu, Pablo Carballeira, Juan C. SanMiguel, Paula Moral

发表机构 * Universidad Autónoma de Madrid(阿隆托纳大学马德里分校)

AI总结 针对行人属性识别数据稀缺问题,提出ReSAGE-PAR管道,通过扩散模型生成图像并利用贝叶斯分类器验证属性,实现可扩展的高保真数据集扩展,在标准骨干网络上提升高达8.7%。

Comments Under review at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情
AI中文摘要

为了解决行人属性识别(PAR)中有限的数据多样性和数据稀缺问题,我们探索了使用基于属性提示的扩散模型进行图像合成。虽然这能够实现行人图像的可控生成,但它面临两个关键挑战:(i)高质量预训练数据与低分辨率、非标准监控裁剪之间的领域差距,以及(ii)需要可靠的属性验证以防止生成幻觉。在本文中,我们引入了一个稳健的生成-评分-自动标注管道,称为ReSAGE-PAR(PAR中生成式扩展的表征相似性评估),它弥合了这一领域差距,并实现了可扩展、高保真的数据集扩展。首先,我们使用定制的基于LoRA的图像到图像方法,将预训练的扩散模型适应到原生PAR分辨率。其次,我们提取生成图像与其条件提示之间的视觉-语言对齐分数,利用包括标签一致和不一致补充的综合提示策略。最后,我们制定了一个贝叶斯分类器,将这些连续分数转换为可靠的二值伪标签。大量评估证明了ReSAGE-PAR在保留空间先验和验证属性方面的有效性。当集成到PAR训练中时,ReSAGE-PAR一致地带来了显著的改进——在标准骨干网络上实现了高达8.7%的提升,并将最先进的框架推向了新的性能水平。这证明了其作为可扩展PAR增强的架构无关解决方案的价值。ReSAGE-PAR的完整代码库可在http://www-vpu.eps.uam.es/publications/ReSAGE-PAR公开获取。

英文摘要

To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at http://www-vpu.eps.uam.es/publications/ReSAGE-PAR.

2606.05999 2026-06-05 cs.CV cs.AI 版本更新

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

ATT-CR: 自适应三角变换器用于云去除

Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics(计算机与人工智能学院,西南财经大学) Ningbo University of Technology(宁波工程学院)

AI总结 提出自适应三角变换器(ATT-CR),通过三角注意力和特征选择门控模块降低计算复杂度并减少云像素干扰,实现高效云去除。

详情
AI中文摘要

云去除旨在准确重建遥感图像中被云遮挡的地面物体。现有的基于Transformer的方法利用自注意力有效建模云图像中的长距离依赖,取得了显著效果。然而,它们存在以下问题:1)自注意力的高计算复杂度限制了可扩展性;2)在注意力计算中将云像素和干净像素均视为有效,会在后续层中引入干扰,导致性能次优。为解决这些挑战,我们提出了自适应三角变换器用于云去除(ATT-CR),该模型有效降低了计算成本并减轻了云像素的干扰。具体而言,它包含两个核心组件:三角注意力(TAN)和特征选择门控模块(FSGM)。TAN使用下三角和上三角矩阵近似Softmax注意力,计算复杂度为O(N),显著降低了计算成本。而FSGM与TAN集成,自适应地区分云特征和干净特征,从而最小化无效信息引入后续层。在云去除基准上的大量实验表明,ATT-CR相比现有方法具有更优的性能。

英文摘要

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

2606.05998 2026-06-05 cs.CV cs.AI 版本更新

Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

基于深度学习的二维口内图像三维口腔重建

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种仅用十张二维口内图像进行三维口腔重建的软件方法,采用MobileNetV2与多头注意力机制,降低成本和不适,实现自动化重建。

Comments 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情
AI中文摘要

口腔三维建模是牙科中最关键的阶段之一,常用的方法如印模和口内扫描各有显著局限。印模法将藻酸盐或硅胶材料放入托盘并插入患者口腔形成阴模,存在患者不适、材料变形误差及存储运输困难等问题。口内扫描仪利用结构光或激光技术实时直接扫描口腔结构,效果先进但设备成本极高。为解决这些问题,本文提出一种基于软件的方法,仅使用从不同角度拍摄的十张二维口内图像重建三维口腔模型,无需专用硬件设备。该方法降低成本,消除物理扫描设备需求,减少患者不适,并实现自动化三维重建。模型在公开的Dental3DS数据集(包含950个上颌样本)上训练,采用MobileNetV2作为图像编码器,结合多头注意力进行多视图特征融合。所提模型在最近邻匹配(距离阈值0.035)下达到77.49%的准确率。然而,预测顶点倾向于集中在真实值的高密度区域,导致重建模型上的点分布不均匀。

英文摘要

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

2606.05997 2026-06-05 cs.CV 版本更新

Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting

使用大语言模型和梯度提升的多模态性别歧视识别与表征

Kyriakos Chaviaras, Maria Lymperaiou, Athanasios Voulodimos

发表机构 * Artificial Intelligence and Learning Systems Laboratory(人工智能与学习系统实验室) School of Electrical and Computer Engineering(电气与计算机工程学院) National Technical University of Athens(雅典国家技术大学)

AI总结 提出基于特征工程和梯度提升回归模型的后融合管道,结合视觉、文本、人口统计、生物特征及LLM语义指标,用于识别和表征模因和短视频中的多模态性别歧视。

详情
AI中文摘要

我们介绍了AILS-NTUA提交给CLEF EXIST 2026实验室的工作,解决模因(任务2)和短视频(任务3)中的多模态性别歧视识别与表征问题。我们的系统采用基于特征工程的后融合管道,围绕梯度提升回归模型和层次化后处理构建。对于模因,我们结合了视觉、文本、人口统计、生物特征和LLM衍生的语义指标,旨在捕捉刻板印象、物化、讽刺和厌女等高层次线索。对于视频,我们研究了特征选择、基于帧的视觉表示、基于OCR的文本特征、声学描述符和传感器衍生元数据的影响。开发结果表明,聚焦的LLM衍生语义线索改善了模因性别歧视识别,而视频性能对特征维度和跨模态噪声高度敏感。对于视频,开发结果倾向于紧凑的特征选择,但官方测试结果表明这一结论不能完全推广到未见数据,其中未过滤的表征泛化更好。总体而言,我们的发现强调了针对静态模因进行目标语义特征工程的有用性,以及在嘈杂的短视频环境中需要更鲁棒的时间建模。

英文摘要

We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

2606.05981 2026-06-05 cs.CV cs.LG 版本更新

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

基于视觉感知的多模态大语言模型条件编辑扩散的视频率流式风格化:蒸馏UNet + MLLM文本编码器上的非对称批处理推理

Yoshiyuki Ootani

发表机构 * Independent researcher(独立研究员)

AI总结 针对蒸馏扩散模型中文本编码器成为瓶颈的问题,提出一种结合非对称CUDA流水线、编译友好的ControlNet-LLLite重构和周期性条件刷新调度的流式管线,在消费级GPU上实现视频率实时风格化编辑。

Comments 12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at https://github.com/otanl/dreamlite-stream (also mirrored to Hugging Face and Zenodo)

详情
AI中文摘要

扩散U-Net的激进蒸馏反转了实时文本到图像流水线的逐帧瓶颈:一旦去噪器成为4步或1步蒸馏的学生模型,文本编码器就成为关键路径。这种反转在视觉感知编辑扩散中最为严重,其中编码器是多模态大语言模型(MLLM)。我们研究了一个0.39B蒸馏编辑U-Net与2.13B MLLM文本编码器(Qwen3-VL)配对的情况,并提出了一种针对该场景的流式管线,该管线围绕三种工程机制构建:非对称侧流/主流CUDA流水线,带有批处理文本编码器摊销(以及可选的静态提示缓存);一种编译友好的ControlNet-LLLite重构,将整个U-Net +适配器堆栈折叠成单个融合图;以及一个带有钩子子集的周期性条件刷新调度,用于摊销每帧条件成本。在单个消费级RTX 3090 Ti上,512x512分辨率下,管线在批大小B=8时维持27.4 fps,B=16时维持29.6 fps,端到端p50延迟分别约为0.5和1.0秒;相同操作点在RTX 4090上测得54.9 fps,在RTX 5090上测得74.1 fps。我们报告的是视频率流式吞吐量而非交互式低延迟,并将我们的数据与相同堆栈的StreamDiffusion重运行进行对比,作为系统上下文,而非基准优越性声明。对于训练的油画风格,发布的时序适配器在剪辑内噪声中泛化到19个未使用的DAVIS-2017序列和来自七个来源的15个非DAVIS剪辑;对未见风格族的提示级泛化有限,并单独报告。

英文摘要

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

2606.05975 2026-06-05 cs.CV cs.RO 版本更新

T-FunS3D: Task-Driven Hierarchical Open-Vocabulary 3D Functionality Segmentation

T-FunS3D:任务驱动的分层开放词汇3D功能分割

Jingkun Feng, Reza Sabzevari

发表机构 * P4MARS Lab at the Faculty of Aerospace Engineering, Delft University of Technology(代尔夫特理工大学航空航天工程学院P4MARS实验室)

AI总结 提出T-FunS3D方法,通过构建开放词汇场景图并利用视觉语言模型,实现任务驱动的分层3D功能分割,在保持性能的同时提升速度和降低内存消耗。

详情
AI中文摘要

开放词汇3D功能分割使机器人能够在3D场景中定位功能性物体组件。这是一项需要空间理解和任务解释的挑战性任务。当前的开放词汇3D分割方法主要关注物体级识别,而场景级部分分割方法试图详尽地分割整个场景,导致资源密集且耗时。在粒度、准确性和速度之间平衡分割性能仍然是一个挑战。作为缓解这一问题的一步,我们引入了T-FunS3D,一种任务驱动的分层开放词汇3D功能分割方法,为机器人应用提供可操作的感知。我们的方法以室内场景的3D点云和带姿态的RGB-D图像作为输入。通过提取环境中的实例及其视觉嵌入,我们构建了一个开放词汇场景图。给定任务描述,T-FunS3D识别场景图中最相关的实例,并利用视觉语言模型定位其功能组件。在SceneFun3D数据集上的实验表明,T-FunS3D在开放词汇3D功能分割方面与最先进方法相当,同时实现了更快的运行时间和更少的内存使用。

英文摘要

Open-vocabulary 3D functionality segmentation enables robots to localize functional object components in 3D scenes. It is a challenging task that requires spatial understanding and task interpretation. Current open-vocabulary 3D segmentation methods primarily focus on object-level recognition, while scene-wide part segmentation methods attempt to segment the entire scene exhaustively, making them highly resource-intensive and time consuming. Balancing segmentation performance in terms of granularity, accuracy, and speed remains a challenge. As one step towards alleviating this, we introduce T-FunS3D, a task-driven hierarchical open-vocabulary 3D functionality segmentation method that provides actionable perception for robotic applications. Our method takes as input the 3D point cloud and posed RGB-D images of an indoor scene. We construct an open-vocabulary scene graph by extracting instances and their visual embeddings in the environment. Given a task description, T-FunS3D identifies the most relevant instances in the scene graph and locates their functional components leveraging a vision-language model. Experiments on the SceneFun3D dataset demonstrate that T-FunS3D is comparable to state-of-the-art in open-vocabulary 3D functionality segmentation, while achieving faster runtime and reduced memory usage.

2606.05931 2026-06-05 cs.CL cs.AI cs.CV cs.IR cs.LG cs.MM eess.AS 版本更新

To Be Multimodal or Not to Be: Query-Adaptive Audio-Visual Person Retrieval via Active Modality Detection

多模态还是非多模态:通过主动模态检测的查询自适应音视频人物检索

Erfan Loweimi, Mengjie Qian, Kate Knill, Guanfeng Wu, Chi-Ho Chan, Abbas Haider, Muhammad Awan, Josef Kittler, Hui Wang, Mark Gales

发表机构 * University of Cambridge(剑桥大学) Queen's University Belfast(贝尔法斯特女王大学) University of Surrey(萨里大学) Cisco(思科) Southwest Jiaotong University(西南交通大学) Teesside University(泰赛德大学)

AI总结 提出一种查询自适应框架,通过跨模态分数一致性检测主动模态,在BBC Rewind语料库上达到94.2%的P@1,优于单模态和固定融合方法。

Comments INTERSPEECH 2026

详情
AI中文摘要

当通过语音和面部从视频档案中检索一个人时,系统应该是多模态的吗?在实际的广播档案中,与精心策划的基准不同,目标可能只被听到但未被看到、只被看到但未被听到,或者两者兼有。融合来自缺失模态的分数会引入噪声,使精度低于最佳单模态系统。我们提出了一种查询自适应框架,通过跨模态分数一致性检测主动模态:当两种模态都活跃时,由一种模态检索的文件在另一种模态上也得分高;当一种模态缺失时,这种一致性被破坏。由这些跨模态特征驱动的分类器实现了89%的检测准确率。在BBC Rewind语料库(包含超过12,000个广播视频)上,自适应系统达到了94.2%的P@1,优于仅语音(82.9%)、仅面部(93.4%)和固定融合(90.0%),恢复了与具有真实模态标签的Oracle(96.6%)之间差距的64%。

英文摘要

When retrieving a person from a video archive by voice and face, should the system be multimodal or not? In real-world broadcast archives, unlike curated benchmarks, a target may be heard but unseen, seen but unheard, or both. Fusing scores from an absent modality injects noise, degrading precision below the best unimodal system. We propose a query-adaptive framework that detects active modalities via cross-modal score consistency: when both modalities are active, files retrieved by one also score highly on the other; this agreement breaks down when a modality is absent. Classifiers driven by these cross-modal features achieve 89% detection accuracy. On the BBC Rewind corpus (with over 12,000 broadcast videos) the adaptive system attains 94.2% P@1, outperforming speaker-only (82.9%), face-only (93.4%), and fixed fusion (90.0%), recovering 64% of the gap to an oracle with ground-truth modality labels (96.6%).

2606.05917 2026-06-05 cs.CV cs.CL 版本更新

MemoryCard: Topic-Aware Multi-Modal Clue Compression for Long-Video Question Answering

MemoryCard: 面向长视频问答的主题感知多模态线索压缩

Qing Yang, Pengcheng Huang, Xinze Li, Zhenghao Liu, Yukun Yan, Yu Gu, Ge Yu, Gang Li, Maosong Sun

发表机构 * School of Computer Science and Engineering, Northeastern University(东北大学计算机科学与工程学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Digital China Group(数字中国集团)

AI总结 提出MemoryCard框架,通过将长视频分割为主题事件单元并生成事件级摘要和代表性视觉时刻,以记忆卡形式增强VLMs的长视频问答能力,在相同视觉令牌预算下准确率提升高达21.8%。

Comments 21 pages, 8 figures

详情
AI中文摘要

长视频问答对视觉语言模型(VLMs)仍然具有挑战性,因为与答案相关的证据通常稀疏、短暂且时间上分散在冗长的视频上下文中。现有的以帧为中心的方法通过均匀采样、查询感知帧选择、视觉令牌压缩和自适应分辨率策略来提高效率。然而,它们仍然依赖孤立和零散的帧作为基本证据单元,限制了VLMs有效捕获连贯事件级语义的能力。为解决这一限制,我们提出了MemoryCard,一种基于视频记忆的增强框架,将长视频组织成自包含的记忆卡。具体来说,MemoryCard首先对视频和对齐的文本执行自读过程,将视频分割为语义连贯的单元,每个单元对应一个不同的主题或事件。对于每个单元,它生成事件级视频要点并选择代表性视觉时刻,然后将其渲染为统一的记忆卡,用于检索和问答。实验结果表明,在可比的视觉令牌预算下,MemoryCard持续提高了长视频问答性能,准确率相对提升高达21.8%。所有代码可在https://github.com/NEUIR/MemoryCard获取。

英文摘要

Long-video question answering remains challenging for Vision-Language Models (VLMs), as answer-relevant evidence is often sparse, transient, and temporally dispersed across lengthy video contexts. Existing frame-centric approaches improve efficiency through uniform sampling, query-aware frame selection, visual-token compression, and adaptive resolution strategies. However, they still rely on isolated and fragmented frames as the fundamental evidence units, limiting VLMs' ability to effectively capture coherent event-level semantics. To address this limitation, we propose MemoryCard, a video-memory-based augmentation framework that organizes long videos into self-contained Memory Cards. Specifically, MemoryCard first performs a self-reading process over videos and aligned utterances to segment the video into semantically coherent units, each corresponding to a distinct topic or event. For each unit, it generates an event-level video gist and selects representative visual moments, which are then rendered into unified Memory Cards for retrieval and question answering. Experimental results demonstrate that MemoryCard consistently improves long-video QA performance under comparable visual-token budgets, achieving up to a 21.8% relative improvement in accuracy. All code is available at https://github.com/NEUIR/MemoryCard.

2606.05916 2026-06-05 cs.CV 版本更新

Unveiling the Unknown: Open Vocabulary Object Detection with Scene Graphs

揭示未知:基于场景图的开放词汇目标检测

Yi Chen, Yinghao Lu, Zhehao Li, Chenchen Yan, Jiafei Wu, Chong Wang, Jiangbo Qian

发表机构 * Faculty of Electrical Engineering and Computer Science, Ningbo University(宁波大学电气工程与计算机科学学院) Faculty of Computing, Georg-August-Universität Göttingen(哥廷根大学计算机学院) Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University(宁波大学商帮经济与文化智能计算实验室) School of Software Technology, Zhejiang University(浙江大学软件学院)

AI总结 提出场景引导的关系建模检测框架,利用场景图捕获候选区域与上下文对象之间的结构化语义和空间关系,并通过关系注意力模块和场景文本对齐分支增强开放词汇目标检测性能。

详情
AI中文摘要

开放词汇目标检测旨在识别训练数据中未出现的新目标类别。许多基于知识蒸馏的方法通过将预训练视觉-语言模型的知识迁移到目标检测中,展现了有前景的性能。然而,这些方法往往忽略了对象之间结构化的、图像特定的关系,例如交互和空间布局。这种忽视可能严重限制检测新类别的有效性。为解决这一问题,我们提出了一种场景引导的关系建模检测框架。该框架利用场景图捕获候选区域与其上下文对象之间的结构化语义和空间关系。它显式建模相邻区域之间的交互,并引入关系注意力模块隐式增强从场景图中提取的关键关系线索。此外,我们提出了一种基于场景的文本对齐分支,从字幕中蒸馏类别知识以指导关系对齐。该方法促进了视觉关系与语义信息的无缝集成,从而提升检测性能。大量实验表明,我们的模型在COCO和LVIS数据集上对新类别的AP优于其他OVOD方法。

英文摘要

Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.

2606.05915 2026-06-05 cs.CV 版本更新

CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications

CamFlow+: 用于二维相机运动估计的混合运动基及其稳定应用

Haipeng Li, Zhen Liu, Zhanglei Yang, Hai Jiang, Tianhao Zhou, Zhengzhe Liu, Ping Tan, Bing Zeng, Shuaicheng Liu

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China(电子科技大学信息与通信工程学院) University of Electronic Science and Technology of China(电子科技大学) School of Aeronautics and Astronautics, Sichuan University(四川大学航空宇航学院) YingCai Honors College, University of Electronic Science and Technology of China(电子科技大学 YingCai 优秀生学院) Lingnan University(岭南大学) Hong Kong University of Science and Technology and Shenzhen Loop Area Institute(香港科学与技术大学及深圳环宇研究院)

AI总结 提出CamFlow+混合基框架,通过结合单应性物理基、随机基和深度平移基在稠密光流空间中直接估计二维相机运动,并引入深度感知平滑项,有效处理平移、深度变化和局部视差,在相机运动估计和视频稳定任务中取得最优效果。

详情
AI中文摘要

估计二维相机运动是计算机视觉和计算摄影的基础。现有的基于单应性的方法在平面场景或纯旋转情况下效果良好,但在相机平移、深度变化和局部视差方面表现不佳;局部单应性和网格模型提高了灵活性,但仍依赖于分片平面假设。我们提出CamFlow+,一个混合基框架,直接在稠密光流空间中表示二维相机运动。CamFlow+结合了单应性导出的物理基、从单应性流中采样的随机基以及从深度和相机内参导出的深度平移基,在保持相机运动规律的同时放松了单平面约束。一个深度感知平滑项进一步在连续深度区域正则化平移引起的视差,同时保留深度边界附近的运动变化。我们在GHOF-Cam上评估CamFlow+,这是一个相机运动基准,通过掩蔽光流基准中的动态对象和不适定遮挡区域来隔离相机引起的运动。实验表明,CamFlow+改进了稀疏和稠密相机运动估计。在数字视频稳定中,CamFlow+还提高了全局和局部稳定性,在盲用户研究中实现了最佳top-1偏好率。代码和数据集将在项目页面上提供:https://lhaippp.github.io/CamFlow+。

英文摘要

Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: https://lhaippp.github.io/CamFlow+.

2606.05912 2026-06-05 cs.CV 版本更新

Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars

自学习表情形变用于数据高效的高斯化身

Jiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong, Gregory Slabaugh, Shanxin Yuan

发表机构 * Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 提出自适应高斯表情框架,通过自监督学习表情驱动的形变,结合2D高斯面元和符号距离场,实现从极少量输入数据(单帧、单目或单张图像)重建高保真可动画化身。

详情
AI中文摘要

使用3D高斯表示建模动态面部表情由于其非结构化特性仍然具有挑战性。传统的高斯化身流程需要大量的多视角和序列表情数据,限制了可扩展性和可访问性。在这项工作中,我们引入了自适应性高斯表情(SAGE),一个自学习表情诱导的高斯形变框架,能够从最小输入数据中实现高保真、可动画的化身。我们的方法联合优化2D高斯面元和符号距离场(SDF)以强制实现紧凑的、表面对齐的高斯分布,同时一个自监督的表情学习阶段用几何和外观一致性约束取代了长时间的训练序列。这种设计允许在多种重建场景下灵活部署:在多视角设置中,仅需单帧(时间步)而非数千帧;在单目设置中,仅需头部旋转而无需表情序列;在单次设置中,无需预训练或先验。实验表明,我们的方法在重建和动画质量上与最先进方法相当,同时将数据需求降低了几个数量级。我们的结果突显了自监督高斯形变学习作为迈向可访问、数据高效化身创建的一步的潜力。

英文摘要

Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.

2606.05896 2026-06-05 cs.CV 版本更新

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

共鸣心智:具备心智理论的闭环社交虚拟人

Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang, Wentao Zhu

发表机构 * University of Washington(华盛顿大学) Peking University(北京大学) Carnegie Mellon University(卡内基梅隆大学) Eastern Institute of Technology, Ningbo(宁波工程技术学院)

AI总结 提出一个闭环双智能体框架,通过整合感知、社会推理(基于心智理论)和多模态生成,实现具备社交智能的虚拟人,并在信息不对称数据集上取得优于全信息脚本模式的对话质量。

详情
AI中文摘要

创建具有真正社交智能的逼真数字人需要将认知推理和多模态生成统一在一个连贯的框架内。当前的方法将这些视为独立的任务:大型语言模型擅长对话但缺乏具身表达,而基于扩散的说话头模型实现了视觉保真度但忽略了社会认知。为了弥合这一差距,我们提出了一个闭环双智能体框架,将感知、社会推理和表达整合到一个连续的交互循环中。感知模块从视频中分析伙伴的多模态行为,而社会推理模块通过心智理论推断隐藏的心理状态,并通过集成机制选择响应。然后,表达模块生成情感可控的双智能体视频,合成说话者的言语和表情以及听者的反应行为,捕捉先前工作中缺失的双向动态。我们构建了一个分层的角色-场景数据集,包含基于心理学的角色和私人社交目标,以支持信息不对称下的评估。在该数据集上的实验表明,在对话质量和视频生成指标上均具有竞争性或优越的性能。值得注意的是,我们的方法在关键对话质量维度上甚至超过了全信息脚本模式,这表明在不确定性下显式的心理状态推断可以比无限制的信息访问引发更周到的对话。

英文摘要

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG 版本更新

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR(亚马逊FAR) USC(美国南加州大学) UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) CMU(卡内基梅隆大学)

AI总结 提出LadderMan系统,通过两阶段学习管道和视觉基础模型,使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情
AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力,但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性,爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan},一个统一的系统,使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道,其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家,并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署,我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略,我们进一步使用双智能体公式训练一个独立的操控策略,允许通过遥操作在梯子上进行稳定操控。实验表明,LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬,以零样本方式成功迁移到真实世界硬件,并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

2606.05849 2026-06-05 physics.optics cs.CV 版本更新

Inverse Design of Realizable Metasurface based Absorbers using Improved Conditioning and Diversity Enhanced Progressively Growing GANs

利用改进的条件化和多样性增强的渐进式生长GAN实现可实现的超表面吸收体的逆向设计

Vineetha Joy, Mohammad Abdullah, Pramit Pal, Anshuman Kumar, Amit Sethi, Hema Singh

发表机构 * Centre for Electromagnetics, CSIR-National Aerospace Laboratories(电磁研究中心,国家航空航天实验室) Birla Institute of Technology and Science, Pilani, Rajasthan(比拉理工学院和科学学院,比里尼) Indian Institute of Technology, Bombay, Maharashtra(孟买印度理工学院,马哈拉施特拉)

AI总结 提出一种基于渐进式生长WGAN-GP与特征线性调制条件化的生成式逆向设计框架,结合替代辅助光谱对齐损失和行列式点过程多样性正则化,实现连续光谱约束下物理一致且多样化的超表面吸收体设计。

详情
AI中文摘要

超表面能够精确操控电磁波,用于波束转向、传感和隐身技术等应用。然而,由于迭代全波仿真驱动优化的计算成本高昂,以及现有生成方法在条件保真度和多样性方面的限制,具有目标电磁响应的超表面的逆向设计仍然具有挑战性。为了解决这些问题,本文提出了一种生成式逆向设计框架,用于在连续光谱约束下实现可控且物理一致的超表面合成。该方法采用渐进式生长Wasserstein生成对抗网络,结合梯度惩罚和基于特征线性调制的条件化,以实现连续光谱和制造约束的稳定传播。通过替代辅助光谱对齐损失,将电磁一致性直接嵌入生成学习过程,从而在训练期间实现物理约束生成。此外,引入基于行列式点过程的多样性正则化策略,以生成几何多样但光谱一致的实现,对应同一目标响应。通过在2至18 GHz频率范围内生成具有不同反射特性的实际可实现的超表面吸收体,证明了所提框架的有效性。电磁仿真验证了生成的设别以高精度满足目标规格。最终提出的框架实现了平均均方误差0.0052、多样性分数0.8730、波段对齐精度0.8533以及有效电磁设计生成百分比89.57,清晰展示了其生成高精度、多样化、电磁一致且可制造的超表面配置的能力。

英文摘要

Metasurfaces enable precise manipulation of electromagnetic waves for applications such as beam steering, sensing, and stealth technology. However, inverse design of metasurfaces with targeted EM responses remains challenging due to the computational expense of iterative full wave simulation driven optimization and the limited conditioning fidelity and diversity of existing generative approaches. To address these challenges, this paper presents a generative inverse design framework for controllable and physically consistent metasurface synthesis under continuous spectral constraints. The proposed approach employs a progressively growing Wasserstein generative adversarial network with gradient penalty integrated with feature wise linear modulation based conditioning for stable propagation of continuous spectral and fabrication constraints. EM consistency is embedded directly into the generative learning process through a surrogate assisted spectral alignment loss, enabling physics constrained generation during training. Further, a determinantal point process based diversity regularization strategy is incorporated to generate geometrically diverse yet spectrally consistent realizations for the same target response. The effectiveness of the proposed framework is demonstrated through the generation of practically realizable metasurface absorbers exhibiting diverse reflection characteristics in the frequency range of 2 to 18 GHz. EM simulations validate that the generated designs meet the target specifications with high accuracy. The final proposed framework achieved an average mean squared error of 0.0052, diversity score of 0.8730, band alignment accuracy of 0.8533, and a valid EM design generation percentage of 89.57, clearly demonstrating its capability to generate highly accurate, diverse, electromagnetically consistent and fabrication realizable metasurface configurations.

2606.05833 2026-06-05 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

2606.05829 2026-06-05 cs.CV 版本更新

Gender Artifacts from Art History to Text-to-Image Generation

从艺术史到文本到图像生成中的性别伪影

Piera Riccio, Miriam Doh, Benedikt Höltgen, Noa Garcia, Nanne van Noord

发表机构 * University of Amsterdam(阿姆斯特丹大学) Université Libre de Bruxelles(布鲁塞尔自由大学) Hasso Plattner Institut University of Potsdam(波茨坦大学霍索普纳研究所) The University of Osaka(大阪大学)

AI总结 通过提出性别伪影度量(PixelSGA和MaskSGA),研究了艺术风格中性别表征与视觉特征的关系,并发现文本到图像生成模型会放大历史来源中的性别伪影。

详情
AI中文摘要

艺术风格植根于特定的社会历史背景,这些背景编码了社会等级,包括不同的性别建构。然而,在人工智能研究中,风格长期以来被视为一种表面层次的视觉属性:一种应用于内容中性场景的颜色、笔触和纹理的滤镜。我们引入了第一个数据集来研究历史图像和生成图像中性别表征与风格之间的相互作用。StyleGender包含跨越19种艺术风格的74k张图像,包括带有风格和性别注释的艺术历史图像、在受控风格和性别提示下由T2I生成的图像,以及一个语义对齐集,使得可以直接比较艺术史与生成结果。通过提出两种集合性别伪影(SGA)度量(PixelSGA和MaskSGA),在像素级别和构图结构中捕捉性别信号,我们展示了:(1) 性别表征塑造了不同艺术风格的视觉特征,(2) 风格关键词将这些模式带入T2I生成中,(3) 生成模型倾向于放大历史来源中观察到的性别伪影。

英文摘要

Artistic styles are rooted in specific socio-historical contexts that encode social hierarchies, including distinct constructions of gender. Yet in AI research, style has long been treated as a surface-level visual property: a filter of color, brushstroke, and texture applied to otherwise content-neutral scenes. We introduce the first dataset to investigate the interplay between gender representation and style in both historical and generated images. StyleGender comprises 74k images spanning 19 artistic styles, comprising art historical images with style and gender annotations, T2I-generated images under controlled style and gender prompts, and a semantically aligned set enabling direct art history-to-generation comparison. By proposing two Set Gender Artifact (SGA) metrics (PixelSGA and MaskSGA), capturing gender signals at the pixel level and in compositional structure, we show that (1) gender representation shapes visual features across artistic styles, (2) style keywords carry these patterns into T2I generation, and (3) generative models tend to amplify gender artifacts beyond what is observed in historical sources.

2606.05785 2026-06-05 cs.CV cs.AI cs.LG 版本更新

Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation

下一代LPDR并行解码器:架构优化与类别平衡的GAN增强

Shawaiz Obaid, Nida Chandio, Neha Jamil, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering Computer Science National University of Sciences \& Technology Islamabad, Pakistan sobaid.mscs25seecs.edu.pk Computer Science National University of Sciences \& Technology Islamabad, Pakistan nchandio.mscs25seecs.edu.pk Computer Science National University of Sciences \& Technology Islamabad, Pakistan njamil.mscs25seecs.edu.pk Computer Science National University of Sciences \& Technology Islamabad, Pakistan

AI总结 针对车牌检测与识别中的空间字符不匹配和数据不平衡问题,提出交叉空间混合注意力和类别平衡合成增强方法,将少数省份车牌识别率从78.2%提升至91.5%,同时保持152 FPS的实时处理性能。

Comments 8 pages, 7 figures

详情
AI中文摘要

实时车牌检测与识别(LPDR)是现代智慧城市的基石。尽管YOLOV5-PDLPR模型通过并行解码器方法显著提高了系统效率,但其性能仍受训练集中空间字符不匹配和数据不平衡的影响。本文通过引入交叉空间混合注意力(CSHA)和类别平衡合成增强(CBSA)来解决这些局限性。进行了涉及75,000个合成样本的广泛研究,并在四个基准数据集(CCPD、CLPD、PKU和一个应用特定数据集)上进行了评估。实验结果表明,少数省份车牌识别率从78.2%大幅提升至91.5%,同时保持152 FPS的实时处理性能。结果表明,结合空间感知并行解码与类别平衡增强为高速车牌识别系统提供了有效解决方案。

英文摘要

Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.

2606.05778 2026-06-05 cs.CV 版本更新

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

超越绝对分数:基于编辑诱导差异的通用图像美学评估

Qifei Jia, Xintong Yao, Minghao Li, Yajie Chai, Qiming Lu, Baoyue Shen, Yasen Zhang, Runyu Shi, Ying Huang, Yue Zhang

发表机构 * Xiaomi Corporation, Beijing, China(小米公司,北京,中国)

AI总结 提出RED-Aes框架,利用可控图像编辑模型模拟人类审美推理,通过相对编辑诱导差异学习通用美学原则,实现跨场景泛化。

详情
AI中文摘要

传统的图像美学评估(IAA)方法主要依赖于回归绝对平均意见分数(MOS)。然而,这种范式忽视了人类审美感知固有的动态性质,这种感知依赖于对隐含视觉参考的无意识比较。因此,缺乏对美学差异的因果推理使得模型无法学习通用的美学原则,从而限制了它们在多样化场景中的泛化能力。在这项工作中,我们重新思考IAA任务,并提出相对编辑诱导差异美学学习(RED-Aes),一种新颖的框架,利用可控图像编辑模型模拟人类审美推理过程。RED-Aes不拟合绝对分数分布,而是显式学习驱动美学变化的视觉因素。为了支持这一范式,我们构建了RED-20k数据集,包含基于编辑的图像对、定量美学差异和思维链(CoT)推理。此外,我们引入了一种由相对排序一致性奖励引导的三阶段训练策略,仅通过相对监督优化模型。大量实验表明,RED-Aes在多个公共基准上取得了最先进的性能,展现出优越的泛化能力。

英文摘要

Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.

2606.05769 2026-06-05 cs.CV 版本更新

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

在预测之前想象:用于视频事件预测的交错潜在视觉推理

Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai AI Laboratory(上海人工智能实验室) City University of Hong Kong(香港城市大学) Nanjing University(南京大学) Fudan University(复旦大学) Zhejiang University(浙江大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出Future-L1框架,通过交错潜在视觉推理在自回归解码中交替语言token和连续潜在视觉跨度,结合LA-DAPO强化学习优化,在视频事件预测任务上取得最先进结果。

Comments https://github.com/OpenGVLab/Future-L1

详情
AI中文摘要

视频事件预测(VEP)要求模型从部分视频证据中推断未观察到的未来状态。现有的视频多模态大语言模型(MLLMs)通常在文本空间中将中间未来推理进行语言化:一旦视觉证据被语言化,细粒度的运动、几何和交互线索可能会丢失,导致看似合理但视觉上无根据的幻觉。我们引入了Future-L1,一种交错潜在视觉推理框架,允许MLLM在自回归解码过程中在语言token和连续潜在视觉跨度之间交替。为了训练这种能力,我们通过选择未来视觉提示有助于预测的示例,并将潜在状态与未来帧嵌入对齐,构建了Future-L1-50K数据集,然后使用LA-DAPO(一种具有结果对比和时间多样性奖励的潜在感知RL目标)进一步优化采样的潜在轨迹。Future-L1在两个基准测试上均取得了新的最先进结果:在FutureBench上,它将Qwen3-VL-8B从61.0提升至85.4,并超过之前最佳Video-CoE 10.4分;在TwiFF-Bench上,它将平均得分从2.44提升至3.04。这些结果表明,面向未来的视频推理受益于在潜在空间中保留中间视觉语义,而不是将每个推理步骤都转换为文本。

英文摘要

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

2606.05760 2026-06-05 cs.CV 版本更新

ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection

ExpSpeech-Net: 表情与语音的多模态融合用于深度伪造检测

Ruchika Sharma, Rudresh Dwivedi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出轻量级ExpSpeech-Net模型,通过融合面部表情和语音模式,利用SqueezeNet和RNN骨干网络及智能特征选择,实现高效深度伪造检测,准确率达94.5%。

详情
AI中文摘要

深度伪造视频日益挑战在线内容的可信度。许多现有检测方法依赖于复杂、资源密集型的模型,限制了其实用性。本研究引入了ExpSpeech-Net深度伪造检测(SqN-R-DFD)模型,该模型以SqueezeNet和RNN(循环神经网络)为骨干,提供了一个轻量级且高效的深度伪造检测框架,能够同时分析面部表情和语音模式。该方法采用了先进的特征提取,例如基于ISLBT的图像特征和用于信号的MPNCC,并结合使用SASMA(鹬辅助黏液霉菌算法)的智能特征选择策略,确保检测模型获得最优且平衡的输入。通过结合SqueezeNet和RNN,有效捕捉深度伪造视频中的细微不一致性。该框架实现了94.5%的准确率、99.3%的精确率和96.8%的F-measure,优于传统方法。这表明,将多种模态与智能预处理和特征选择相结合,能够实现适用于日常应用的实用、实时深度伪造检测。

英文摘要

Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.

2606.05758 2026-06-05 cs.CV cs.AI cs.LG 版本更新

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT:一种用于视觉-语言模型中连续输出解码的残差流适配器

Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) West Lafayette Jr./Sr. High School(韦斯特拉法叶高中)

AI总结 提出DRIFT框架,通过结合基础预测器和基于流匹配的生成式精化模块,将预训练视觉-语言模型适配到连续解码任务,在视觉定位和机器人控制等任务上优于回归和生成方法。

详情
AI中文摘要

许多现代视觉-语言模型(VLM)基于离散标记的自回归解码。虽然基于文本的输出接口支持可扩展的预训练和跨多种任务的强零样本泛化,但它们不适用于需要精确连续输出的问题,例如定位事件的时间边界或生成机器人控制动作。为了解决这一挑战,我们提出了DRIFT,一个用于将预训练VLM适配到连续解码任务的通用框架。DRIFT结合了一个基础预测器(提供目标输出的粗略估计)和一个基于流匹配的生成式精化模块(迭代改进预测)。这种残差公式将生成建模问题从学习全局输出分布转变为在强先验周围建模局部残差分布,大大简化了优化。我们在感知和规划任务上评估了DRIFT,包括视觉定位和机器人控制。在跨越MLLM、VLA和WAM的多个任务和架构中,DRIFT consistently优于一组强大的基于回归和生成的方法。

英文摘要

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

2606.05753 2026-06-05 cs.CV 版本更新

Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents

余弦误导:辅助损失重塑视觉语言模型,而非其潜变量

XiuYu Zhang, Junfeng Fang, Zhenkai Liang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文通过实验发现,在视觉语言模型的潜视觉推理中,余弦相似度等对齐损失与准确性负相关,并引入PRISM诊断工具揭示潜变量被绕过,辅助损失主要通过共享参数重塑语言模型。

详情
AI中文摘要

潜视觉推理(LVR)在视觉语言模型(VLM)的感知和答案生成之间插入有监督的潜变量。该领域使用这些潜变量与其视觉目标之间的对齐(即余弦相似度或均方误差)作为训练损失和质量指标,假设更好的对齐会产生更好的答案。我们通过设计包含五种LVR变体的矩阵进行测试,发现该假设被颠覆:余弦对齐与所有五种变体的准确性呈负相关(r=-0.94)。为了解释这一点,我们引入了PRISM,一对推理时诊断工具:一个线性探针,询问答案在何处可解码;一个破坏性测试,询问潜变量是否承担负载。有监督的潜变量在很大程度上被绕过。破坏它们最多使准确性变化四个百分点。答案在潜变量下游可解码,但在潜变量处不可解码,并且这种可解码性差距的大小预测了每个变体在扰动下对其潜变量的依赖程度。与信息瓶颈对损失的解释一致,辅助目标通过共享参数而非其名义上优化的潜变量来重塑语言模型。

英文摘要

Latent visual reasoning (LVR) inserts supervised latent tokens between perception and answer generation in vision-language models (VLMs). The field uses alignment between these latents and their visual targets, i.e., cosine similarity or mean squared error (MSE), as both the training loss and the quality metric, assuming that better alignment yields a better answer. We test this with a designed matrix of five LVR variants and find the assumption inverted: cosine alignment is negatively correlated with accuracy across all five (r=-0.94). To explain this, we introduce PRISM, a pair of inference-time diagnostics: a linear probe that asks where the answer is decodable, and a corruption test that asks whether the latent is load-bearing. The supervised latents are largely bypassed. Corrupting them shifts accuracy by at most four points. The answer is decodable downstream of the latent but not at it, and the size of this decodability gap predicts how much each variant relies on its latent under perturbation. Consistent with an Information Bottleneck reading of the loss, the auxiliary objective reshapes the language model via shared parameters rather than via the latent variable it nominally optimizes.

2606.05737 2026-06-05 cs.CV cs.AI cs.LG cs.RO 版本更新

Let It Be Simple: One-Step Action Generation for Vision-Language-Action Models

让它简单:视觉-语言-动作模型的单步动作生成

Yitong Chen, Shiduo Zhang, Jingjing Gong, Xipeng Qiu

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 针对视觉-语言-动作(VLA)模型,提出通过偏置训练时间分布至高频噪声状态,实现无需教师模型、蒸馏或辅助目标的单步动作生成,性能可匹配十步解码。

Comments 20 pages, 10 figures

详情
AI中文摘要

基于扩散的视觉-语言-动作(VLA)模型通常继承图像生成的观点:动作通过迭代去噪生成。我们认为VLA动作生成具有不同的条件-目标结构:策略以丰富的观测、语言和状态为条件,但仅预测紧凑的低维动作块。在这种不对称性下,强单步动作生成不一定需要为图像合成开发的先进单步方法。我们保持标准速度预测,不添加教师模型、蒸馏阶段或辅助目标;在我们的主要方案中,我们简单地将训练时间分布偏向高频噪声状态。我们首先在受控的MNIST网格到序列任务中隔离效果,然后通过广泛的机器人策略实验进行测试。在标准LIBERO、LIBERO-Plus和LIBERO-Pro上,使用高频噪声偏置调度训练的单步策略通常匹配相同方案下的十步解码,并且在标准LIBERO上可以超过使用均匀时间分布训练的十步策略。真实机器人双臂YAM RSS评估提供了相同采样器趋势的小样本跨架构检查。在具有30M动作头的1.4B VLM模型上,单步解码在LIBERO-Long上达到95.6%。这些结果表明,强单步VLA动作生成可以从标准扩散训练中涌现,而无需引入为图像生成开发的完整少步扩散机制。

英文摘要

Diffusion-based vision-language-action (VLA) models often inherit the image-generation view: actions are generated by iterative denoising. We argue that VLA action generation has a different condition-target structure: the policy is conditioned on rich observations, language, and state, but predicts only a compact, low-dimensional action chunk. Under this asymmetry, strong one-step action generation should not necessarily require the advanced one-step methods developed for image synthesis. We keep standard velocity prediction and add no teacher model, distillation stage, or auxiliary objective; in our main recipe, we simply bias the training time distribution toward high-noise states. We first isolate the effect in a controlled MNIST grid-to-sequence task, then test it with extensive robot-policy experiments. Across standard LIBERO, LIBERO-Plus, and LIBERO-Pro, one-step policies trained with high-noise biased schedules generally match ten-step decoding under the same recipe, and on standard LIBERO can exceed ten-step policies trained with a uniform time distribution. A real-robot bimanual YAM RSS evaluation gives a small-sample cross-architecture check of the same sampler trend. On a 1.4B VLM model with a 30M action head, one-step decoding reaches 95.6\% on LIBERO-Long. These results show that strong one-step VLA action generation can emerge from standard diffusion training, without importing the full few-step diffusion machinery developed for image generation.

2606.05736 2026-06-05 cs.CV 版本更新

VTI-CoT: Visual-Textual Interleaved Chain of Thought for Video Reasoning

VTI-CoT: 用于视频推理的视觉-文本交织思维链

Shufan Zhang, Ziyue Lin, Bairun Wang, Lei Jin, Xuanding Ding, Xinzhu Ma, Kunlin Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Hong Kong(香港大学) Beijing Shanwei Zhixing Technology Co., Ltd.(北京尚维智行科技有限公司) Tsinghua University(清华大学) Beihang University(北航)

AI总结 提出VTI-CoT框架,通过视觉-文本交织的思维链结合OCR压缩技术,提升视频推理准确性和训练效率。

Comments 25 pages, 7 figures

详情
AI中文摘要

视频推理旨在理解视频中的复杂时间事件和因果关系。最近,思维链(CoT)被引入该领域以提高推理准确性。然而,现有的基于CoT的视频推理方法主要依赖纯文本信息进行逻辑推理,忽略了推理过程中的关键视觉信息。受人类在推理过程中回顾视觉片段的认知机制启发,我们提出了VTI-CoT,一种视觉-文本交织的CoT框架。VTI-CoT将文本推理步骤与相应的视觉帧相结合。针对现有数据集中缺乏视觉-文本交织CoT的问题,我们开发了一个自动标注流程来构建高质量的多模态CoT数据。此外,对长视频进行推理需要越来越长的CoT token序列,这严重阻碍了训练收敛和效率。为了解决这个问题,我们采用基于光学字符识别(OCR)的压缩技术,将CoT监督信号压缩到单个画布上。实验结果表明,VTI-CoT在相同参数规模的模型中达到了最先进的性能,同时显著提高了训练效率。

英文摘要

Video reasoning aims to understand complex temporal events and causal relationships within videos. Recently, Chain-of-Thought (CoT) has been introduced to this field to enhance reasoning accuracy. However, existing CoT-based video reasoning methods primarily rely on text-only information for logical deduction, overlooking critical visual information during the inference process. Inspired by the human cognitive mechanism of reviewing visual segments during inference, we propose VTI-CoT, a Visual-Textual Interleaved CoT framework. VTI-CoT integrates textual reasoning steps with corresponding visual frames. Given the scarcity of visual-textual interleaved CoT in existing datasets, we develop an automated annotation pipeline to construct high-quality multimodal CoT data. Further, reasoning over long-form videos entails increasingly long CoT token sequences, which severely hinders training convergence and efficiency. To address this, we employ Optical Character Recognition (OCR)-based compression techniques to compress CoT supervision signals into a single canvas. Experimental results demonstrate that VTI-CoT achieves state-of-the-art performance among models of the same parameter scale while significantly improving training efficiency.

2606.05730 2026-06-05 cs.CV 版本更新

TextWand: A Unified Framework for Scene Text Editing

TextWand:场景文本编辑的统一框架

Shuyu Wang, Zhile Guan, Hongxiu Chen, Yule Duan, Weiqi Li, Xin Shan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University(电子与计算机工程学院,北京大学)

AI总结 提出TextWand统一框架,通过渲染和擦除原子操作分解复杂编辑任务,结合ORPE编码和RAS策略,实现场景文本的移除、生成和替换,并在新基准TextWand-Bench上超越现有模型。

详情
AI中文摘要

我们提出TextWand,一个通用框架,将场景文本移除、生成和替换统一到单个模型中。通过将复杂的编辑任务分解为渲染和擦除的原子原语,TextWand实现了对文本外观和背景完整性的精确控制。具体来说,我们引入了一种新颖的设计——叠加参考位置编码(ORPE),以强制执行像素级布局保真度和示例驱动的风格控制,同时采用一种新策略——区域自适应抑制(RAS),以确保干净的文本擦除。为了解决现有单任务数据集中缺乏通用场景文本编辑综合基准的问题,我们构建了TextWand-Bench。大量实验表明,TextWand在场景文本移除、生成和替换任务中,通过提供更优的文本内容准确性、布局和风格一致性以及整体图像质量,超越了现有的领先开源和闭源模型。

英文摘要

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

2606.05718 2026-06-05 cs.CV cs.AI cs.LG 版本更新

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

ViCuR: 视觉线索作为多模态在策略蒸馏中的可恢复特权

Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

发表机构 * Shanghai AI Laboratory(上海人工智能实验室) Fudan University(复旦大学) Nanjing University(南京大学)

AI总结 提出ViCuR框架,通过将教师特权从答案侧替换为输入中的视觉线索,并引入轻量级线索恢复模块,解决多模态在策略蒸馏中的训练-测试不匹配问题,在七个基准上显著提升学生模型性能。

Comments 25 pages, 11 figures. Preprint, under review

详情
AI中文摘要

在策略蒸馏(OPD)通过在教师监督下,对学生自身策略采样的轨迹进行训练来改进推理。在多模态推理中,一种常见的扩展是使用特权教师,该教师观察仅在训练时可用的信号,如参考答案或理由。然而,这种答案侧特权造成了训练-测试不匹配:教师的监督可能依赖于学生无法获得的信号,鼓励捷径模仿而非基于视觉的推理。我们提出ViCuR,一种基于视觉的特权教师蒸馏框架,用视觉线索(输入中与查询相关的证据)取代答案侧特权。由于这些线索来源于推理时可用的相同视觉输入,它们的证据可由学生恢复。为此,ViCuR引入了一个轻量级线索恢复模块,在预填充期间使用专用的汇点令牌交叉注意力,将任务相关的视觉证据聚合到内部表示中,而不改变推理接口或需要辅助的线索生成损失。在七个基准上,使用Qwen3-VL-2B和8B学生,ViCuR在总体平均性能上持续优于基于答案的在策略自蒸馏,分别提升+1.19和+1.24。它还能自然地扩展到更强的教师OPD,超越OPD基线+0.64和+1.08,并在8B规模上具有一致的域外增益。这些结果表明,在多模态在策略蒸馏中,教师特权的设计与教师强度同等重要。

英文摘要

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

2606.05708 2026-06-05 cs.CV 版本更新

Real-Time Threat Detection from Surveillance Cameras using Machine Learning

基于机器学习的监控摄像头实时威胁检测

Gajendra Mandal, J. P. Patra, Priyansh Mahant

发表机构 * GitHub

AI总结 提出基于YOLOv8的实时目标检测框架,利用自定义钝器数据集与公开枪支刀具数据集训练模型,实现监控场景下枪支、刀具和钝器的有效检测。

详情
AI中文摘要

确保人口密集的城市环境中的公共安全仍然是一个关键挑战,需要部署智能和自动化的视频监控系统。传统的监控方法严重依赖人工监控,效率低下且容易受到人为疲劳、响应延迟和观察错误的影响。为了克服这些限制,本文提出了一种基于实时目标检测的监控框架。该系统专注于检测枪支、刀具以及印度监控场景中常见于暴力活动的区域特定钝器。本文的一个关键贡献是使用移动相机收集的自定义数据集,包含336张标记的钝器图像,如铁棒、木棍和塑料棒。该数据集与公开的7,623张枪支和刀具图像数据集合并,形成包含7,959张图像、三个类别(枪、刀、钝器)的合并数据集。使用该合并数据集训练基于YOLOv8的目标检测模型以实现实时性能。实验评估表明,增加训练时长显著提高了钝器类别的召回率和平均精度,且未出现过拟合迹象。总体而言,所提出的框架在准确性和效率之间取得了有效平衡,使其适用于校园、公共空间和交通区域等真实监控环境中的部署。

英文摘要

Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.

2606.05703 2026-06-05 cs.CV 版本更新

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

并行雅可比解码用于快速自回归图像生成

Boya Liao, Ying Li, Siyong Jian, Huan Wang

发表机构 * Westlake University(西交利物浦大学)

AI总结 提出并行雅可比解码(PJD),通过二维空间域扩展草稿令牌并调整注意力掩码,实现无需训练的自回归图像生成加速,在保持生成质量的同时获得4.8倍至6.4倍加速。

Comments Accepted by CVPR 2026

详情
AI中文摘要

自回归(AR)模型在生成高保真图像方面表现出色。然而,其固有的顺序逐令牌预测导致推理速度显著变慢。最近的研究引入了雅可比式解码来加速自回归图像生成。初始扩展草稿序列提高了效率,但由于一维序列中的错误传播阻碍收敛,加速很快饱和。观察到图像表现出强烈的局部空间相关性,我们提出了并行雅可比解码(PJD),一种无需训练的解码方法,在二维空间域中扩展草稿令牌以实现高效的空间并行细化。PJD调整注意力掩码以减轻错误累积并提高收敛稳定性。在多个数据集上的大量实验表明,PJD在多种自回归图像生成模型上实现了4.8倍至6.4倍的加速,同时保持了具有竞争力的生成质量。

英文摘要

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

2606.05702 2026-06-05 cs.AI cs.CV 版本更新

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Computer Science, Wuhan University(武汉大学计算机学院) School of Computing Technologies, RMIT University(皇家墨尔本理工学院计算技术学院)

AI总结 本文提出一个新基准,通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力,并揭示模型常利用颜色等表面线索而非真正时间特征。

详情
AI中文摘要

近期视觉-语言模型(VLM)在解释复杂视觉语义方面取得了显著进展,但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准,专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准(侧重于帧序列)不同,我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此,我们构建了三个专门数据集:一个包含跨越长时间历史周期的视觉相似物体,另一个按不同事件和物体类型分类,第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验,我们分析了模型是否在不同类别间表现出性能差异,并关键地探讨了它们是否依赖“错误捷径”(如图像颜色而非真正的时间特征)。我们的结果表明,尽管VLM显示出潜力,但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架,我们提供了一个诊断工具,用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.

2606.05700 2026-06-05 cs.CV cs.LG 版本更新

T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction

T-SAR-JEPA:通过潜在预测在SAR幅度堆栈中进行自监督时间异常检测

Kerod Woldesenbet, Abem Woldesenbet

发表机构 * Independent Researcher(独立研究者) Dakota State University(达科塔州立大学)

AI总结 提出T-SAR-JEPA框架,通过自监督潜在预测在SAR幅度堆栈中检测时间异常,在DFC 2026数据集上达到77.0%的ROC-AUC,优于多种基线方法。

Comments Won IEEE GRSS Data Fusion Contest 2026; to appear in IGARSS 2026 proceedings

详情
AI中文摘要

我们提出了T-SAR-JEPA,一个通过潜在预测在SAR幅度堆栈中进行时间异常检测的自监督框架。来自SAR-JEPA的ViT-Base/16编码器在39,300个Capella图像块上通过局部掩码重建和梯度特征预测进行领域自适应。一个带有正弦时间编码的时间Transformer从K=7次采集中预测未来潜在状态,渐进式解冻显著降低了验证损失。该模型仅基于幅度操作;InSAR相干性仅作为独立的伪真实标签。在DFC 2026数据集(300个时间序列,三个感兴趣区域)上,T-SAR-JEPA在夏威夷喷发窗口上实现了77.0%的ROC-AUC,优于RX、PaDiM、线性AR和LSTM基线(约50%)。99.9%的空间一致性(p < 0.001,置换检验)确认了结构化检测。代码:https://github.com/TerraLatent/t-sar-jepa

英文摘要

We present T-SAR-JEPA, a self-supervised framework for temporal anomaly detection in SAR amplitude stacks via latent prediction. A ViT-Base/16 encoder from SAR-JEPA is domain-adapted on 39,300 Capella patches using local masked reconstruction with gradient feature prediction. A temporal transformer with sinusoidal time encoding forecasts future latent states from K=7 acquisitions, with progressive unfreezing substantially reducing validation loss. The model operates on amplitude alone; InSAR coherence serves exclusively as independent pseudo-ground-truth. On the DFC 2026 dataset (300 time-series, three AOIs), T-SAR-JEPA achieves ROC-AUC of 77.0% on the Hawaii eruption window, outperforming RX, PaDiM, Linear AR, and LSTM baselines (~50%). Spatial coherence of 99.9% (p < 0.001, permutation test) confirms structured detections. Code: https://github.com/TerraLatent/t-sar-jepa

2606.05677 2026-06-05 cs.CV cs.AI cs.CL 版本更新

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Zhongguancun Academy(中关村学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) The Chinese University of Hong Kong(香港中文大学) Xi’an Jiaotong University(西安交通大学)

AI总结 针对长视频中空间记忆的挑战,提出LongSpace框架,通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理,并在LongSpace-Bench等基准上验证其有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在图像和视频理解方面取得了进展,并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图,模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力,我们引入了LongSpace-Bench,一个用于长程空间记忆的房间导览视频基准,涵盖场景感知、空间关系和空间记忆。在这项工作中,我们进一步提出了LongSpace,一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块,将3D结构线索注入早期解码器层,并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明,LongSpace改善了长视频空间理解,进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

2606.05675 2026-06-05 cs.LG cs.CV 版本更新

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

双向优于单向:基于循环一致性的双向对齐用于无样本类增量学习

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science(切斯特·F·卡勒中心影像科学中心) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出BiCyc方法,通过双向投影器对齐和循环一致性目标,解决无样本类增量学习中原型漂移和单向投影偏差问题,减少灾难性遗忘并提升准确率。

Comments Published as a conference paper at ICLR 2026. 23 pages, 8 figures. Code: https://github.com/HXuSz11/BiCyc_ICLR2026

详情
AI中文摘要

持续学习(CL)旨在使模型在不遗忘先前知识的情况下获取新技能。在无样本类增量学习(EFCIL)中,由于无法存储过去数据,这一挑战被放大,旧类的表示漂移尤其有害。基于原型的EFCIL因其高效性而具有吸引力,但随着嵌入空间的演化,原型会发生漂移;因此,基于投影的漂移补偿已成为一种流行的补救措施。然而,我们表明,现有的单向投影引入了系统性偏差:它们要么追溯性地扭曲当前特征几何结构,要么仅局部对齐旧类,导致跨任务累积的循环不一致性。我们提出BiCyc,一种具有循环一致性目标的双向投影器对齐方法。BiCyc联合优化两个映射(旧到新和新到旧),并采用停止梯度门控,使得传输和表示共同演化。分析表明,循环损失在白化空间中将奇异谱向单位值收缩,并且类均值和协方差的改进传输导致分类对数几率扰动更小,从而保留旧类决策并减轻灾难性遗忘。实验上,在标准EFCIL基准测试中,BiCyc显著减少了遗忘并提高了从头开始设置下的准确率,同时在预训练细粒度场景中保持竞争力。

英文摘要

Continual learning (CL) seeks models that acquire new skills without erasing prior knowledge. In exemplar-free class-incremental learning (EFCIL), this challenge is amplified because past data cannot be stored, making representation drift for old classes particularly harmful. Prototype-based EFCIL is attractive for its efficiency, yet prototypes drift as the embedding space evolves; therefore, projection-based drift compensation has become a popular remedy. We show, however, that existing one-directional projections introduce systematic bias: they either retroactively distort the current feature geometry or align past classes only locally, leaving cycle inconsistencies that accumulate across tasks. We introduce BiCyc, a bidirectional projector alignment approach with a cycle-consistency objective. BiCyc jointly optimizes two maps, old-to-new and new-to-old, with stop-gradient gating so that transport and representation co-evolve. Analytically, we show that the cycle loss contracts the singular spectrum toward unity in whitened space, and that improved transport of class means and covariances yields smaller perturbations of classification log-odds, preserving old-class decisions and mitigating catastrophic forgetting. Empirically, across standard EFCIL benchmarks, BiCyc substantially reduces forgetting and improves accuracy in from-scratch settings, while remaining competitive in the pretrained fine-grained regime.

2606.05665 2026-06-05 cs.CV 版本更新

V2V-Bench: A Comprehensive Benchmark for Video-to-Video Generation Evaluation

V2V-Bench:视频到视频生成评估的综合基准

Tao Liu, Leela Krishna, Gouti Pavan Kumar, Sreeja K, Vishav Garg

发表机构 * Centific Global Solutions Inc.(Centific全球解决方案公司)

AI总结 针对视频到视频生成评估中现有指标无法同时衡量编辑指令遵循和帧级对应的问题,提出包含11个维度、5个类别的V2V-Bench基准,评估三个模型并验证其与人类判断高度相关。

Comments Accepted at ICML 2026 workshop

详情
AI中文摘要

视频到视频(V2V)生成难以评估,因为输出必须同时遵循编辑指令并保持与源视频的帧级对应,而现有的T2V和I2V指标无法捕捉这一点。我们引入了V2V-Bench,一个包含11个维度的基准,分为五个类别:时间对齐、结构保真度、变换质量、视频质量和语义对齐。V2V-Bench将多样化的源视频与具有挑战性的编辑任务配对,并评估了两个商业模型Grok Imagine和Gemini Veo3,以及一个开源模型Open Sora 2。结果显示模型优势互补:Grok在编辑保真度上表现更好,而Veo3在视觉质量上更强。在六个V2V特定维度上,V2V-Bench与人类判断的Spearman相关系数达到0.905。

英文摘要

Video-to-video (V2V) generation is difficult to evaluate because outputs must both follow editing instructions and preserve frame-level correspondence with the source video, which existing T2V and I2V metrics do not capture. We introduce V2V-Bench, a 11-dimension benchmark organized into five categories: temporal alignment, structural fidelity, transformation quality, video quality, and semantic alignment. V2V-Bench pairs diverse source videos with challenging editing tasks and evaluates two commercial models, Grok Imagine and Gemini Veo3, and one open-source model, Open Sora 2. Results show complementary model strengths: Grok performs better on editing fidelity, while Veo3 achieves stronger visual quality. On six V2V-specific dimensions, V2V-Bench reaches a Spearman correlation of 0.905 with human judgments.

2606.05652 2026-06-05 cs.CV 版本更新

CoFi-UCGen: Coarse-to-Fine Unsupervised Conditional Generation without Label Priors

CoFi-UCGen:无标签先验的粗到细无监督条件生成

Shengxi Li, Zhaokun Hu, Ce Zheng, Mai Xu, Jingyuan Xia, Si Liu

发表机构 * Department of Electronic Information Engineering, Beihang University(信息工程系,北航) School of Cyber Science and Technology, Beihang University(网络安全科学与技术学院,北航) College of Electronic Science, National University of Defense Technology(电子科学学院,国防科技大学) Institute of Artificial Intelligence, Beihang University(人工智能研究院,北航)

AI总结 提出粗到细的无监督条件生成框架CoFi-UCGen,通过对抗语义互学习理论和位编码实现无标签条件下的全局与细粒度语义解耦,并利用扩散模型层次调制机制控制生成。

详情
AI中文摘要

无监督条件图像生成(UCGen)旨在不依赖人工标注标签的情况下控制生成,但由于跨粒度的非结构化语义表示而仍然具有挑战性。为了解决这个问题,我们提出了一种新颖的粗到细UCGen框架(CoFi-UCGen),该框架明确地将全局语义与细粒度变化解耦,据我们所知,这是首次在没有任何标签的情况下成功实现粗粒度和细粒度条件生成。具体来说,我们首先提出对抗语义互学习理论,以确保图像和潜在空间之间的语义一致性和完整性。基于这种一致性,我们提出位编码来学习结构化的粗粒度潜在空间,并进一步证明从我们的位编码中继承的独特全局语义,同时保留用于生成的独立噪声采样。在这些位编码的基础上,我们建立了细粒度语义基础,并在扩散模型中引入了层次调制机制,通过从粗条件逐层注入,在生成过程中逐步控制细粒度属性。大量实验表明,在没有任何标签先验或预训练特征提取器的情况下,我们的CoFi-UCGen在图像质量、语义一致性和控制准确性方面始终优于现有的UCGen方法,验证了显式粗到细语义分解对于具有挑战性的UCGen任务的有效性。

英文摘要

Unsupervised conditional image generation (UCGen) aims to control generation without relying on manually annotated labels, yet remains challenging due to unstructured semantic representations across granularities. To address this, we propose a novel coarse-to-fine UCGen framework (CoFi-UCGen) that explicitly disentangles global semantics from fine-grained variations, which to the best of our knowledge, sets out the first successful attempt for both coarse- and fine-grained conditional generation without any labels. More specifically, we first propose the adversarial semantic reciprocal learning theory to ensure the semantic consistency and completeness between images and latent spaces. Based on the consistency, we propose the bit-codes to learn a structured coarse-grained latent space, and further prove distinct global semantics inherent from our bit-codes while preserving independent noise sampling for generation. Building upon these bit-codes, we establish a fine-grained semantic basis and introduce a hierarchical modulation mechanism in diffusion models, by enabling layer-wise injection from coarse conditions to progressively control fine-grained attributes during generation. Extensive experiments demonstrate that without any label priors or pre-trained feature extractors, our CoFi-UCGen consistently outperforms existing UCGen methods in terms of image quality, semantic consistency, and control accuracy, verifying the effectiveness of explicit coarse-to-fine semantic decomposition for the challenging UCGen task.

2606.05650 2026-06-05 cs.MM cs.CV cs.GR cs.NI 版本更新

GS-NFS: Bandwidth-adaptive Streaming of Dynamic Gaussian Splats and Point Clouds

GS-NFS: 动态高斯溅射和点云的带宽自适应流传输

Rajrup Ghosh, Haodong Wang, Haoran Hong, Eduardo Pavez, Amartya Chaudhuri, Weiwu Pang, Harsha V. Madhyastha, Antonio Ortega, Ramesh Govindan

发表机构 * University of Southern California(南加州大学)

AI总结 提出GS-NFS方法,通过GPU并行加速动态3DGS帧的编解码,实现全帧率运行,速度比现有技术快1-2个数量级,同时保持竞争性的压缩性能和渲染质量。

详情
AI中文摘要

动态3D高斯溅射(3DGS)作为一种3D视频流技术具有很大前景,因为它能够以高保真度表示复杂的3D场景。在该方法中,3D视频的每一帧将环境表示为一组高斯体,每个高斯体具有位置以及其他属性,如尺度、旋转、不透明度和颜色。帧捕捉了精细细节,允许从任意视角观看,但数据量比2D视频帧大一个数量级或更多。最近的一系列工作探索了如何压缩动态3DGS帧,但这些方法通常较慢,部分原因是它们的压缩技术不适合高效加速。GS-NFS在GPU上加速动态3DGS的压缩和解压缩,达到能够以全帧率编码和解码的程度。它通过开发基于GPU的新型并行化方法,对现有的高斯位置和属性编码算法进行并行化来实现这一点。因此,它在编码和解码一帧时比现有技术快1-2个数量级,同时提供具有竞争力的压缩性能和渲染质量。

英文摘要

Dynamic 3D Gaussian Splatting (3DGS) holds great promise as a 3D video streaming technology since it can represent complex 3D scenes with high fidelity. In this approach, every frame in a 3D video represents the environment as a collection of Gaussians with position and other attributes such as scale, rotation, opacity, and color. Frames capture fine details, permit views from any arbitrary perspective, but are an order of magnitude, or more, larger than 2D video frames. A line of recent work has explored how to compress dynamic 3DGS frames, but these approaches are often slow, in part because their compression techniques are not amenable to efficient acceleration. GS-NFS accelerates dynamic 3DGS compression and decompression on a GPU, to the point where it can encode and decode at full frame rate. It achieves this by developing novel GPU-based parallelizations of existing algorithms for encoding both positions and attributes of Gaussians. As a result, it is 1-2 orders of magnitude faster than the state-of-the-art in encoding and decoding a frame, while offering competitive compression performance and rendering quality.

2606.05641 2026-06-05 cs.CV 版本更新

Multi-Task Crack Foundation Model for Engineering-Reliable Crack Representation and Topology Preservation in Civil Infrastructure

面向工程可靠裂缝表示与拓扑保持的土木基础设施多任务裂缝基础模型

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh, Armstrong Aboah

发表机构 * NDSU(内达苏大学)

AI总结 提出 CrackGeoFM 多任务框架,结合冻结视觉基础骨干与裂缝专用适配模块,实现掩码预测、骨架重建和不确定性估计,在20个数据集上达到最优分割、拓扑保持和校准不确定性。

Comments 60 pages, 17 figures, 11 tables

详情
AI中文摘要

可靠的裂缝评估不仅需要准确的像素级掩码,还需要在域偏移下保持稳定的连通裂缝几何形状和置信度估计。然而,现有的分割模型在实现高重叠分数的同时,可能会使裂缝碎片化、遗漏细小分支,并且无法提供校准的不确定性。为了解决这一问题,本文提出了 CrackGeoFM,一个多任务框架,它将冻结的视觉基础骨干与裂缝专用适配相结合,用于掩码预测、骨架重建和不确定性估计。该框架集成了频率引导的裂缝增强模块(FCEM)以增强高频裂缝线索,裂缝域特征适配模块(CFAM)以将冻结骨干特征适配到裂缝域模式,以及结构感知多任务解码器(SMTD)以联合解码掩码、骨架和不确定性。在20个裂缝数据集上,CrackGeoFM 实现了最先进的分割性能、改进的拓扑保持、校准的不确定性以及仅需五张标注图像的有效少样本适应。这些结果支持可靠、可泛化且面向工程的裂缝分析,用于基础设施评估。

英文摘要

Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.

2606.05635 2026-06-05 cs.CV cs.MM 版本更新

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

ShotCrop$^3$:将人物中心图像裁剪为电影级三镜头构图

Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang, Xinran Qin, Teng Ma, Jiaqi Xu, Zhixin Wang, Zhikai Chen, Xuecheng Qi, Renjing Pei, Fan Li

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室) Sun Yat-sen University(中山大学)

AI总结 提出三镜头构图任务,通过三阶段训练流程(思维链微调、半监督微调和组相对策略优化)从单张人物中心图像生成远景、中景和特写三张裁剪图,并附带简短描述,以支持视觉叙事。

详情
AI中文摘要

先前关于美学构图的工作通常产生单一美观的裁剪,忽略了从一个场景中构图多个镜头的叙事价值。在实践中,多镜头构图对于下游创意工作流程至关重要:商业海报通常需要不同重点(例如,背景、主体和情感/产品细节)的多个裁剪来呈现关键故事节拍。因此,我们提出了 extbf{三镜头构图(TSC)},这是一个构图任务,从单张人物中心图像生成一个三镜头集——远景、中景和特写,每个镜头都配有简短的镜头描述以支持视觉叙事。为了在有限的专家标注下学习TSC,我们引入了 extbf{ShotCrop},它经历了一个三阶段训练过程:首先应用思维链监督微调以建立基本推理和美学裁剪技能,然后使用高置信度伪标签进行半监督微调以进一步增强美学能力,最后通过针对 extbf{ShotCrop}的组相对策略优化(GRPO-S)进行优化,使用为其定制的复合奖励。具体来说,我们的伪标签策略结合了基于MLLM的评分、美学评估和CLIP相似度,以保留高置信度的训练信号。此外,我们提出了TSC-Bench,一个包含1.2k个专家标注测试用例的基准。值得注意的是,ShotCrop在镜头定位准确率上比GPT-5平均提高了 extbf{2.82}倍。

英文摘要

Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

2606.05624 2026-06-05 cs.CV cs.GR 版本更新

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

KV-Control: 用于轨迹控制文本到运动的参数高效K/V注入

Tengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo, Dongjie Fu, Xiaohao Cai, Hansung Kim

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出KV-Control,一种紧凑的注意力侧控制接口,通过部分标记化运动基元和轨迹编码器注入键/值记忆,实现精确的轨迹控制而不覆盖预训练的文本条件运动先验。

详情
AI中文摘要

文本条件3D人体运动模型现在可以从提示中合成合理的运动,但实际动画和具身代理工作流程很少止步于文本:角色可能需要遵循草绘的根路径,达到末端执行器目标,或满足多关节轨迹,同时保持语言描述的步态、风格和意图。这暴露了一个控制权衡。轨迹控制器应该精确而不覆盖预训练的文本条件运动先验,但现有解决方案要么复制生成器的大部分以重新获得每层控制访问,要么将大部分成本转移到测试时优化。我们引入KV-Control,一种用于冻结掩码文本到运动变换器的紧凑注意力侧控制接口。关键思想是将几何约束作为自注意力中的记忆提供,而不是通过全局姿态标记注入或仅在输出侧强制执行。为了支持该接口,我们共同设计了部分标记化的运动基元和控制器:PartVQ学习解剖对齐的部分码本,T-Concat将每个帧-部分标记暴露为注意力可寻址站点,KV-Control在每个自注意力层注入控制条件的键/值记忆,同时保留预训练的查询流、文本交叉注意力、FFN和所有骨干权重。生成的适配器仅在共享轨迹编码器之上添加可训练的注入参数,但在继承的细化协议下以亚厘米精度跟踪根和多关节约束,同时保留文本条件的运动质量。KV-Control将轨迹条件重新定义为轻量级记忆检索,为文本到运动生成提供了一个小型、精确且透明的控制接口。

英文摘要

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

2606.05611 2026-06-05 cs.CV 版本更新

What's Under the Skin? Estimating Swine Body Condition

皮肤之下是什么?估算猪体况

Mk Bashar, Kuljit Bhatti, Gary Rohrer, Madonna Benjamin, Tami Brown-Brandl, Daniel Morris

AI总结 提出PigFormer系统,利用RGB-D深度图像通过两阶段流程(几何前端和切片注意力编码器)预测猪的皮下背膘厚度、腰肌深度和总组织厚度,实现非接触式体况监测。

详情
AI中文摘要

母猪体况是养殖者的重要指标,因为它对泌乳性能和仔猪存活率有很大影响。然而,生产中使用的体况测量方法(如视觉评分和卡尺)与底层组织成分的相关性较差。超声波扫描可以直接测量皮下背膘厚度和腰肌深度,但操作劳动密集且无法规模化生产。我们提出了PigFormer,一个端到端的两阶段系统,它从天花板安装的RGB-D相机获取原始深度帧,并预测最后肋骨处的皮下背膘厚度、腰肌深度和总组织厚度。第一阶段是几何前端,通过SAM3-to-MaskDINO分割蒸馏、地平面去除和方向归一化将原始深度转换为标准化高度图。第二阶段是切片注意力编码器,将每个高度图视为一系列横截面切片,并捕捉沿整个背侧表面的空间关系。在两个设施的多站点数据集(319头母猪和小母猪实例)上,PigFormer实现了2.43毫米的背膘平均绝对误差和3.87毫米的整体平均绝对误差。它优于强大的单阶段ResNet-18和ViT-small基线。PigFormer为商业养猪生产中实现连续、自动化、非接触式体况监测提供了一条实用途径。代码可在https://github.com/iambashar/Pigformer获取。

英文摘要

Sow body condition is an important indicator for growers as it has a large impact on lactation performance and piglet survival. However, body condition measures used during production, such as visual scoring and calipers, correlate poorly with underlying tissue composition. Ultrasound scans can provide direct measurements of subcutaneous backfat thickness and loin muscle depth, but their operation is labor intensive and not scalable for production. We present PigFormer, an end-to-end two-stage system that takes raw depth frames from a ceiling-mounted RGB-D camera and predicts subcutaneous backfat thickness, loin muscle depth, and total tissue thickness at the last rib. Stage 1 is a geometric front-end that converts raw depth into a standardized height map via SAM3-to-MaskDINO segmentation distillation, ground-plane removal, and orientation normalization. Stage 2 is a Slice Attention Encoder that treats each height map as a sequence of cross-sectional slices and captures spatial relationships along the full dorsal surface. On a multi-site dataset of 319 sow and gilt instances from two facilities, PigFormer achieves 2.43 mm backfat MAE and 3.87 mm overall MAE. It outperforms strong single-stage ResNet-18 and ViT-small baselines. PigFormer offers a practical path toward continuous, automated, non-contact body condition monitoring in commercial swine production. Code is available at https://github.com/iambashar/Pigformer.

2606.05587 2026-06-05 cs.CV cs.AI cs.LG 版本更新

HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery

HDST-GNN:用于无人机航拍图像多目标跟踪的异质动态时空图神经网络

Phillip Jiang

发表机构 * Phillip Jiang(菲利普·姜)

AI总结 针对无人机航拍中目标小、密集、遮挡导致身份切换的问题,提出异质动态时空图神经网络HDST-GNN,通过高度自适应边构建、异质节点表示和遮挡门控时序聚合提升跟踪性能。

Comments 18 pages, 4 figures, 6 tables

详情
AI中文摘要

无人机航拍图像的多目标跟踪(MOT)面临独特挑战:序列间高度变化、目标小而密集、频繁遮挡导致身份切换。现有基于图的跟踪器假设固定空间上下文并统一处理所有目标,忽略了检测、活跃轨迹和丢失目标等异质生命周期状态。我们提出HDST-GNN,一种异质动态时空图神经网络,包含三项创新。首先,高度自适应边构建根据平均目标面积估计相机高度代理,并相应调整图连接半径。其次,异质节点表示将检测(D型)、确认轨迹(T型)和丢失轨迹(L型)建模为不同节点类型,具有专用投影和类型化边关系。第三,遮挡门控时序聚合根据每个节点的遮挡置信度门控其注意力贡献,防止被遮挡节点破坏邻居嵌入。HDST-GNN使用可微Sinkhorn头部,结合交叉熵和三元组损失进行端到端训练。在VisDrone2019-MOT上使用oracle检测时,HDST-GNN达到94.51% MOTA和97.24% IDF1,比SORT高出+5.0 MOTA点,身份切换减少81%。使用真实YOLOv8n检测时,HDST-GNN相比SORT身份切换减少49%。消融研究证实了每个组件的独立贡献。

英文摘要

Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node's attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.

2606.05586 2026-06-05 cs.CV cs.MM 版本更新

BMCR: Adaptive Backbone Module Composition via Reinforcement Learning for Remote Sensing Object Detection

BMCR: 基于强化学习的自适应主干模块组合用于遥感目标检测

Wenlin Liu, Xikun Hu, Ping Zhong

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学)

AI总结 提出BMCR方法,通过强化学习动态组合CNN和ViT的模块化主干,解决遥感目标检测中不同复杂度输入的自适应特征提取问题,在多个数据集上取得领先性能。

详情
AI中文摘要

在遥感目标检测中,卷积神经网络擅长捕捉局部细节,而视觉Transformer更擅长全局上下文建模。然而,现有检测器通常依赖单一固定主干或手动设计的混合架构,无法自适应地利用这些互补优势处理不同复杂度的输入。为解决这一局限,我们提出基于强化学习的主干模块组合(BMCR)。BMCR从现成的CNN和ViT主干中分解出可重用模块,动态组装输入自适应推理路径。为实现跨家族组合,我们首先构建了一个可扩展的模块工具箱。具体而言,我们将代表性的CNN和ViT主干分解为可重用的功能模块,并为每个模块封装明确的结构、语义和计算元数据,以实现兼容性感知的组装。为弥合基于网格的CNN特征与基于令牌的ViT表示之间的差距,我们设计了一种轻量级的基于最优传输(OT)的过渡接口,在保持空间一致性的同时确保分布感知对齐。然后,将主干组合过程建模为序列决策问题,其中策略网络根据中间多尺度观测逐步选择任务相关模块。为稳定可重用模块和路由策略的联合优化,我们进一步开发了自适应模块协同优化(AMCO)策略,在训练过程中协调模块更新、路由探索和奖励分配。在DOTA-v1.0、DOTA-v1.5和DIOR-R上,BMCR分别达到79.31%、73.41%和71.86%的mAP,在保持竞争效率的同时,超越强静态和动态基线最多2.5个百分点。

英文摘要

In remote sensing object detection, Convolutional Neural Networks (CNNs) excel at capturing local details while Vision Transformers (ViTs) are better at global context modeling. However, existing detectors typically rely on a single fixed backbone or a manually designed hybrid architecture, and thus fail to adaptively exploit these complementary strengths across inputs of diverse complexity. To address this limitation, we propose Backbone Module Composition via Reinforcement Learning (BMCR). BMCR dynamically assembles input-adaptive inference paths from reusable modules decomposed from off-the-shelf CNN and ViT backbones. To enable such cross-family composition, we first construct an extensible module toolbox. Specifically, we decompose representative CNN and ViT backbones into reusable functional modules and encapsulate each module with explicit structural, semantic, and computational metadata for compatibility-aware assembly. To bridge the gap between grid-based CNN features and token-based ViT representations, we design a lightweight Optimal Transport (OT) based transition interface that ensures distribution-aware alignment while respecting spatial consistency. The backbone composition process is then formulated as a sequential decision problem, in which a policy network progressively selects task-relevant modules according to intermediate multi-scale observations. To stabilize the joint optimization of reusable modules and the routing policy, we further develop an Adaptive Module Cooperative Optimization (AMCO) strategy that coordinates module updating, routing exploration, and reward assignment during training. On DOTA-v1.0, DOTA-v1.5 and DIOR-R, BMCR achieves 79.31\%, 73.41\% and 71.86\% mAP, respectively, surpassing strong static and dynamic baselines by up to 2.5 points while maintaining competitive efficiency.

2606.05581 2026-06-05 cs.GR cs.CV cs.LG 版本更新

Monte Carlo Steklov Operators for Large-Scale Geometry Processing in the Wild

蒙特卡洛Steklov算子用于大规模野外几何处理

Arman Maesumi, Tanish Makadia, Aruna Anderson, Oras Phongpanangam, Justin Solomon, Daniel Ritchie

发表机构 * Brown University(布朗大学) Loyola Marymount University(洛约拉玛丽蒙特大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出一种蒙特卡洛方法估计Dirichlet-to-Neumann算子及其Steklov特征模态,实现鲁棒且高效的体积算子计算,并应用于大规模3D对比表示学习。

Comments 21 pages

详情
AI中文摘要

内在方法填充了网格几何处理的默认工具箱。内在算子,特别是拉普拉斯算子,是对等距不变性有要求的方法的基础,因此已用于许多形状分析、学习和编辑算法。然而,内在方法的前提假设在处理野外几何时变得脆弱,因为(i)网格质量无法保证,(ii)许多网格由多个连通分量建模。在这种情况下,体积构造定义更清晰,因为可以放宽对表面拓扑的限制。本文提出了一种蒙特卡洛方法,用于估计Dirichlet-to-Neumann (DtN)算子——一种边界到边界的体积算子——及其相关的Steklov特征模态。我们基于蒙特卡洛几何处理的最新发展,将该边界算子本身作为估计对象。通过体积随机过程定义的DtN算子被推广到外部域,通过周围环境空间耦合断开的分量。我们表明,我们的方法在计算Steklov谱时比现有的边界元方法快几个数量级,同时对低质量三角剖分、高分辨率网格和多分量几何保持鲁棒。为了展示这种可扩展性,我们计算了来自未策划的Objaverse数据集的约450,000个形状的内外Steklov特征谱。我们将这些算子集成到Steklov-CLIP中,这是一种基于网格的神经网络,使用体积谱算子进行大规模对比3D表示学习。得到的网络学习到语义上有意义的全局和密集形状表示,说明几何上有原则的体积算子可以在现代3D数据集规模上变得实用。

英文摘要

Intrinsic methods fill the default toolbox for geometry processing on meshes. Intrinsic operators, in particular the Laplacian, underlie methods that require invariance to isometry and have hence been employed in many algorithms for shape analysis, learning, and editing. However, intrinsic methods are predicated on assumptions that quickly become brittle when working with in-the-wild geometry, where (i) mesh quality is not guaranteed, and (ii) many meshes are modeled with multiple connected components. In such settings, volumetric constructions are better-defined, since restrictions on surface topology can be relaxed. This paper presents a Monte Carlo method for estimating the Dirichlet-to-Neumann (DtN) operator -- a boundary-to-boundary volumetric operator -- and its associated Steklov eigenmodes. We build on recent developments in Monte Carlo geometry processing by casting this boundary operator itself as the subject of estimation. The DtN operator, defined through a volumetric stochastic process, is then generalized to the exterior domain, where it couples disconnected components through the surrounding ambient space. We show that our method is orders of magnitude faster than existing boundary-element approaches for computing Steklov spectra while remaining robust to poor triangulations, high-resolution meshes, and multi-component geometry. To demonstrate this scalability, we compute interior and exterior Steklov eigenspectra for approximately 450,000 shapes from the uncurated Objaverse dataset. We incorporate these operators into Steklov-CLIP, a mesh-based neural network that uses volumetric spectral operators for large-scale contrastive 3D representation learning. The resulting network learns semantically meaningful global and dense shape representations, illustrating that geometrically-principled volumetric operators can be made practical at the scale of modern 3D datasets.

2606.05576 2026-06-05 cs.CV 版本更新

UltraVR: A Diagnostic Ultra-Resolution Image-VQA Benchmark for Evidence-Grounded Reasoning

UltraVR:面向证据推理的诊断性超分辨率图像VQA基准

Gexin Huang, Yanting Yang, Myeongkyun Kang, Beidi Zhao, Jun Zhou, Chen Zhou, Gang Wang, Zu-hua Gao, Xiaoxiao Li

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Vector Institute(向量研究所) BC Cancer Agency(不列颠哥伦比亚癌症中心) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出UltraVR基准,通过结构化思维链标注诊断视觉语言模型在超分辨率图像上的证据推理能力,发现模型在证据定位和局部感知环节错误集中。

Comments 10 pages, 1 figure

详情
AI中文摘要

视觉语言模型(VLM)在视觉问答和多模态推理基准上表现出色。然而,它们在超分辨率图像上的能力——其中关键证据微小、细微、空间遥远或分布广泛——仍不清楚。现有评估主要报告最终答案准确率,对模型是否获取并整合必要视觉证据的洞察有限。我们引入UltraVR,一个面向超分辨率图像上基于证据的视觉推理的诊断性基准。UltraVR涵盖四个高价值场景:CCTV监控、遥感(RS)、全切片图像(WSI)病理学和工业异常检测(AD)。这些领域提出互补挑战:拥挤CCTV场景中的细粒度目标定位、RS中的长程空间比较、WSI中的多尺度证据导航以及重复工业布局中的细微不规则检测。除了标准QA三元组,每个实例包括一个结构化的真实思维链,包含步骤级问题、中间答案和推理标签。这些标签将推理分解为证据定位、局部感知、量化、证据整合和决策推断,从而实现对黑盒评分的流程级诊断。使用UltraVR,我们评估前沿VLM,并表明当前模型在超分辨率推理上仍远不可靠。重要的是,结构化注释使我们能够定位从视觉到决策流水线中的失败:错误集中在证据定位和局部感知,而当提供中间视觉事实时,下游推理通常能够恢复。这些发现表明UltraVR是一个诊断性测试平台,不仅衡量VLM是否回答正确,还衡量其超分辨率推理过程在何处中断。

英文摘要

Vision-language models (VLMs) excel on visual question answering and multimodal reasoning benchmarks. Yet their capability on ultra-resolution images - where critical evidence is tiny, subtle, spatially distant, or distributed - remains unclear. Existing evaluations largely report final-answer accuracy, offering limited insight into whether models acquire and integrate the necessary visual evidence. We introduce UltraVR, a diagnostic benchmark for evidence-grounded visual reasoning over ultra-resolution images. UltraVR spans four high-value scenarios: CCTV surveillance, remote sensing (RS), whole-slide image (WSI) pathology, and industrial anomaly detection (AD). These domains pose complementary challenges: fine-grained object grounding in crowded CCTV scenes, long-range spatial comparison in RS, multi-scale evidence navigation in WSI, and subtle irregularity detection in repetitive industrial layouts. Beyond standard QA triples, each instance includes a structured ground-truth chain of thought with step-level questions, intermediate answers, and reasoning labels. These labels decompose reasoning into evidence grounding, local perception, quantification, evidence integration, and decision inference, enabling process-level diagnosis over black-box scoring. Using UltraVR, we evaluate frontier VLMs and show that current models remain far from reliable on ultra-resolution reasoning. Importantly, the structured annotations allow us to localize failures across the visual-to-decision pipeline: errors concentrate in evidence grounding and local perception, while downstream inference often recovers when intermediate visual facts are supplied. These findings demonstrate UltraVR as a diagnostic testbed for measuring not only whether VLMs answer correctly, but where their ultra-resolution reasoning process breaks.

2606.05536 2026-06-05 cs.CV 版本更新

Dual Feature Decoupling for Fine-Grained OOD Detection

面向细粒度OOD检测的双重特征解耦

Xiaokun Li, Yaping Huang, Qingji Guan

发表机构 * School of Computer Science and Technology, Beijing Jiaotong University(计算机科学与技术学院,北京交通大学)

AI总结 提出双重特征解耦网络(DFDNet),通过空间-频率解耦和重建引导解耦模块,解决细粒度分类中因类间差异小和背景干扰导致的OOD检测难题。

详情
AI中文摘要

离群检测(OOD)是将机器学习模型应用于现实场景时不可或缺的技术。现有大多数OOD检测方法都是在类间分布差异较大的理想化假设下开发的,而很大程度上忽略了以细微变化为特征的细粒度任务,如医学图像分类和车辆识别。细粒度子类别之间的高视觉相似性,加上背景因素的干扰,使得OOD检测极具挑战性。为了解决这个问题,我们提出了一种新颖的双重特征解耦网络(DFDNet),从特征解缠的角度解决细粒度OOD检测。所提出的DFDNet包含两个关键组件:空间-频率解耦模块和重建引导解耦模块。空间-频率解耦模块旨在保留对分类有判别性的内容特征,同时抑制与任务无关的风格信息。另一方面,重建引导解耦模块引入了一种新颖的像素级对抗重建任务,以进一步去除低层、非判别性信息,并增强类别特定的高层语义表示。大量实验表明,我们的方法在多个数据集上取得了有竞争力的性能提升。

英文摘要

Out-of-distribution detection (OOD) is an indispensable technique when applying machine learning models to real-world scenarios. Most existing OOD detection methods have been developed under the idealized assumption of large inter-class distributional differences, while largely overlooking fine-grained tasks characterized by subtle variations, such as medical image classification and vehicle recognition. The high visual similarity among fine-grained subcategories, together with the interference of background factors, makes OOD detection extremely challenging. To tackle this problem, we propose a novel Dual Feature Decoupling Network (DFDNet), which addresses fine-grained OOD detection from the perspective of feature disentanglement. The proposed DFDNet comprises two key components: a spatial-frequency decoupling module and a reconstruction-guided decoupling module. The spatial-frequency decoupling module is designed to preserve content features that are discriminative for classification while suppressing task-irrelevant style information. On the other hand, the reconstruction-guided decoupling module introduces a novel pixel-level adversarial reconstruction task to further remove low-level, non-discriminative information and enhance category-specific high-level semantic representations. Extensive experiments demonstrate that our method achieves competitive performance improvements on multiple datasets.

2606.05535 2026-06-05 cs.CV cs.AI 版本更新

Noise-Aware Visual Representation Learning for Medical Visual Question Answering

面向医学视觉问答的噪声感知视觉表示学习

I Putu Adi Pratama, Bahadorreza Ofoghi, Atul Sajjanhar, Shang Gao

发表机构 * Deakin University(德克萨斯大学)

AI总结 提出一种噪声感知的医学视觉问答框架,通过去噪自编码器学习鲁棒的视觉表示,并利用低秩适配高效微调,在SLAKE和PathVQA基准上提升了抗噪性和性能。

Comments 15 pages, 2 figures. Conference submission

详情
AI中文摘要

医学视觉问答(Med-VQA)通过使AI模型能够解释医学图像并回答临床相关问题,在临床决策支持方面具有巨大潜力。近期方法通常通过轻量级映射网络将现成的视觉编码器与大语言模型(LLM)连接起来,以降低计算成本。然而,这些方法往往忽视了处理视觉表示中噪声和小无关变化的重要性。为应对这些挑战,我们提出了一种噪声感知的Med-VQA框架,该框架在视觉嵌入映射到LLM输入空间之前,引入了一个去噪自编码器。去噪自编码器经过预训练,能够从被破坏的输入中重建干净的视觉嵌入,从而鼓励模型学习对噪声不敏感的鲁棒视觉表示。然后,使用多层感知器(MLP)将得到的嵌入投影到语言模型嵌入空间中,形成为LLM提供图像信息的视觉前缀令牌。为了实现无需完全重新训练的高效适配,我们采用低秩适配(LoRA)进行参数高效微调。所提出的方法在SLAKE和PathVQA基准上进行了评估。实验结果表明,该方法在多个评估标准下对噪声输入嵌入具有更强的鲁棒性,同时保持了有竞争力的干净性能。这些发现表明,学习更鲁棒的视觉表示可以提升Med-VQA的性能和鲁棒性。

英文摘要

Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.

2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO 版本更新

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么,而非它们是什么:面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Neurosymbolic Intelligence(神经符号智能) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出A4D框架,通过构建基于功能可供性的共享潜在空间,将视觉观察映射到该空间并测量与可供性的距离,实现基于物体功能而非外观的规划推理,显著提升泛化能力和推理效率。

Comments Code, videos, and data available at: https://A4Dance-reasoning.github.io

详情
AI中文摘要

现有的机器人规划系统依赖于基于外观的推理,其中视觉观察被编码到围绕物体外观组织的潜在空间中(例如,根据外观识别“手推车”)。然而,规划需要推理物体的任务相关功能(例如,物体是否“可移动”),而基于外观的潜在空间无法捕捉这些信息。因此,现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题,使规划基于任务相关的物体功能而非仅外观。我们提出A4D,它将视觉观察映射到一个围绕可供性(例如“可移动”)组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度,A4D推断出与观察物体相关的功能。此外,我们引入了一种可供性发现机制,扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性,并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率,比最先进方法高出超过15个百分点;在不到原始训练数据10%的情况下,将新可供性推理准确率从70%提升到90%以上,并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

2606.05531 2026-06-05 cs.CV cs.AI cs.CL cs.LG 版本更新

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Almieyar-Oryx-BloomBench:一个用于视觉语言模型认知知情评估的双语多模态基准

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

发表机构 * University of British Columbia(不列颠哥伦比亚大学) Zuse School(Zuse学校) Qatar Computing Research Institute (QCRI)(卡塔尔计算研究所) Hamad Bin Khalifa University(哈马德·本·哈利法大学)

AI总结 针对现有基准无法诊断视觉语言模型真实推理能力的问题,提出基于Bloom认知分类学的双语多模态基准BloomBench,系统评估六个认知层次,揭示模型在事实回忆和创造性合成方面的深层局限。

Comments Accepted to ACL 2026 Findings

详情
AI中文摘要

尽管视觉语言模型(VLM)取得了快速进展,但该领域缺乏能够严格诊断其真实推理能力并描绘出向类人多模态智能有意义进展的基准。大多数现有评估侧重于零散或脱节的任务,掩盖了关键的认知弱点,并为有针对性的改进提供了很少的见解。为了弥补这一差距,我们引入了BloomBench,这是Almieyar基准系列的一部分,也是第一个基于人类认知的、双语(英语-阿拉伯语)的多模态VLM基准。基于Bloom分类学,BloomBench通过精心设计的图像-问题-答案任务系统地评估六个认知层次(记忆、理解、应用、分析、评估、创造)。通过半自动化流水线构建,并通过分层混合质量保证协议验证,确保了可扩展性、文化包容性和语言保真度。利用这一框架,我们对最先进的VLM进行了全面研究,以诊断其认知特征。我们的分析揭示了明显的认知不对称:尽管最先进的模型在语义理解方面达到了强大的性能上限,但它们在事实回忆和创造性合成方面存在显著困难。这表明当前的一般多模态能力掩盖了特定认知层次的深层局限性。此外,我们的研究突出了阿拉伯语和英语之间的关键性能差距,暴露了当前跨语言多模态推理的局限性。这些发现为开发更符合认知和包容性的VLM奠定了基础。基准框架和数据集可在以下网址获取:https://github.com/qcri/Almieyar-Oryx-BloomBench。

英文摘要

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

2606.05515 2026-06-05 cs.CV 版本更新

BRepCLIP: Contrastive Multimodal Pretraining on BRep Primitives for CAD Understanding

BRepCLIP: 面向CAD理解的BRep基元对比多模态预训练

Muhammad Usama, Didier Stricker, Mohammad Sadil Khan, Muhammad Zeshan Afzal

发表机构 * DFKI, Germany(德意志联邦共和国DFKI) RPTU Kaiserslautern-Landau, Germany(德国凯撒斯劳滕-兰道大学)

AI总结 提出BRepCLIP框架,通过对比预训练对齐CAD边界表示(BRep)几何与语言/图像嵌入,显著提升检索和零样本分类性能。

详情
AI中文摘要

CAD模型表示学习在很大程度上是一个开放问题。尽管3D表示学习在点云和网格方面蓬勃发展,但CAD的原生格式——边界表示(BReps),它编码精确的参数曲面、曲线及其拓扑,作为表示学习基元却很少受到关注。我们引入BRepCLIP,这是第一个通过对比预训练将BRep几何与语言和图像嵌入对齐的框架。我们将每个CAD对象建模为面令牌和边令牌的序列,分别使用独立的离散词汇表表示曲面和曲线几何,并附加空间和语义描述符来捕获曲面类型(例如,圆柱面、环面、NURBS)和曲线基元(例如,直线、圆弧、B样条)。一个Transformer编码器将这些令牌聚合成全局BRep嵌入,通过联合对比目标与CLIP的文本和图像编码器对齐。BRepCLIP生成的嵌入比现有的基于点的替代方案更具判别性和语义基础,在ABC、CADParser和Automate数据集上,Top-1检索比OpenShape分别提高40.4%、22.0%和23.9%,在FabWave上的零样本分类Top-1分数提高15%。我们进一步展示了其作为CAD感知相似度度量的实用性,用于评估文本和图像条件CAD生成,确立了结构感知预训练对于多模态CAD理解的重要性。项目页面见 https://muhammadusama100.github.io/BrepClip2026/

英文摘要

Learning representations of CAD models is a largely open problem. While 3D representation learning has flourished around point clouds and meshes, the native format of CAD - boundary representations BReps, which encodes exact parametric surfaces, curves, and their topology, has received little attention as a representation learning substrate. We introduce BRepCLIP, the first framework to align BRep geometry with language and image embeddings through contrastive pretraining. We model each CAD object as a sequence of face and edge tokens with separate discrete vocabularies for surface and curve geometry, augmented with spatial and semantic descriptors that capture surface types (e.g., cylindrical, torus, NURBS) and curve primitives (e.g., line, arc, B-spline). A transformer encoder aggregates these tokens into a global BRep embedding, aligned with CLIP's text and image encoders via a joint contrastive objective. BRepCLIP generates more discriminative and semantically grounded embeddings than existing point-based alternatives, improving Top-1 retrieval over OpenShape by 40.4%, 22.0%, and 23.9% on ABC, CADParser, and Automate, respectively, and improving zero-shot classification on FabWave by 15% in Top-1 score. We further demonstrate its utility as a CAD-aware similarity metric for evaluating text and image-conditioned CAD generation, establishing the importance of structure-aware pretraining for multimodal CAD understanding. Project page is available at https://muhammadusama100.github.io/BrepClip2026/

2606.05506 2026-06-05 cs.CV 版本更新

Robust Scene Transfer for PointGoal Navigation via Privileged Sensor Guided Contrastive Learning

基于特权传感器引导对比学习的点目标导航鲁棒场景迁移

Amirhossein Zhalehmehrabi, Tiziano Tezze, Alberto Castelini, Alessandro Farinelli

发表机构 * University of Padua(帕多瓦大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种传感器引导的自适应对比学习框架,利用特权LiDAR传感器在训练时引导视觉编码器学习导航相关结构,并通过解耦表征学习与策略优化以及跨阶段域不匹配来提升策略级场景迁移能力。

Comments 8 pages, Submitted to RAL

详情
AI中文摘要

我们提出了一种用于点目标导航中视觉表征学习的传感器引导自适应对比学习框架。在训练过程中,特权LiDAR传感器通过几何感知相似度度量和自适应温度缩放来引导对比目标,鼓励视觉嵌入捕获导航相关结构而非场景特定外观。得到的编码器被独立预训练、冻结,并用作强化学习的感知骨干,将表征学习与策略优化解耦。我们进一步在表征预训练和策略学习之间引入跨阶段域不匹配,以抑制环境特定捷径并促进对任务相关特征的依赖。在高保真模拟中的大量实验表明,我们的方法显著提高了跨多种室内外环境的策略级场景迁移。在部署时,智能体仅依赖单目RGB观测以及标准任务相关输入(如目标位置和本体感觉信号),无需访问LiDAR或其他特权传感器。我们的方法在严重外观和语义变化下优于大型预训练视觉模型和标准对比基线。我们还发布了一个多模态数据集,以支持未来关于导航中特权引导视觉表征学习的研究。代码可在以下网址获取:

英文摘要

We propose a sensor-guided adaptive contrastive learning framework for visual representation learning in PointGoal navigation. During training, privileged LiDAR sensing guides the contrastive objective through a geometry-aware similarity metric and adaptive temperature scaling, encouraging visual embeddings to capture navigation-relevant structure rather than scene-specific appearance. The resulting encoder is pretrained independently, frozen, and used as the perceptual backbone for reinforcement learning, decoupling representation learning from policy optimization. We further introduce a cross-stage domain mismatch between representation pretraining and policy learning to suppress environment-specific shortcuts and promote reliance on task-relevant features. Extensive experiments in high-fidelity simulation demonstrate that our approach significantly improves policy-level scene transfer across diverse indoor and outdoor environments. At deployment, the agent relies only on monocular RGB observations together with standard task-related inputs such as goal position and proprioceptive signals, without access to LiDAR or other privileged sensors. Our method outperforms large pretrained vision models and standard contrastive baselines under severe appearance and semantic shifts. We also release a multimodal dataset to support future research on privileged-guided visual representation learning for navigation. The code is available at:

2606.05491 2026-06-05 cs.CV cs.RO 版本更新

Unpaired RGB-Thermal Gaussian-Splatting Using Visual Geometric Transformers

无配对RGB-热成像高斯泼溅使用视觉几何变换器

Jean Cordonnier, Chenghao Xu, Olga Fink, Malcolm Mielle

发表机构 * Ecole Polytechnique Federale de Lausanne(瑞士联邦理工学院洛桑分校) Schindler EPFL Lab(施耐德EPFL实验室)

AI总结 提出一种无配对RGB-热成像新视角合成框架,利用VGGT估计各模态相机位姿并通过Procrustes对齐,结合多模态3D高斯泼溅实现联合重建,在保持RGB保真度的同时实现热成像视图合成。

Comments Accepted at ICRA 2026's Workshop MM-SpatialAI: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding

详情
AI中文摘要

结合RGB和热成像的多模态新视角合成(NVS)能够利用视觉和热信息进行精确的3D场景重建。然而,现有方法通常依赖于精确校准的RGB-热成像图像对或立体设置,限制了可扩展性和实际部署。为了解决这个问题,我们引入了一个无配对RGB-热成像NVS框架,该框架利用VGGT(一种3D前馈变换器架构)独立估计每个模态的相机位姿。然后使用Procrustes算法与跨模态特征匹配器对齐位姿集,从而无需配对校准即可实现联合配准。在此对齐基础上,我们进一步提出了一种多模态3D高斯泼溅方法,直接从无配对的RGB和热成像图像中学习。在多种场景上的实验表明,我们的方法在热成像视图合成中取得了有竞争力的性能,同时保持了RGB保真度。此外,我们表明现有的重建方法可能产生缺乏跨模态一致性的特定模态重建。因此,我们引入了一个基准框架,以严格评估每个模态的图像合成以及重建场景的多模态一致性。

英文摘要

Multi-modal novel view synthesis (NVS) combining RGB and thermal imagery enables precise 3D scene reconstruction with visual and thermal information. However, existing methods typically rely on precisely calibrated RGB-thermal image pairs or stereo setups, limiting scalability and practical deployment. To address this, we introduce a framework for unpaired RGB-thermal NVS that leverages VGGT, a 3D feed-forward transformer architecture, to independently estimate camera poses for each modality. The pose sets are then aligned using the Procrustes algorithm with a cross-modal feature matcher, enabling joint registration without paired calibration. Building on this alignment, we further propose a multi-modal 3D Gaussian Splatting approach that learns directly from unpaired RGB and thermal images. Experiments on diverse scenes demonstrate that our method achieves competitive performance in thermal view synthesis while maintaining RGB fidelity. Moreover, we show that existing reconstruction approaches can produce modality-specific reconstructions that lack cross-modal consistency. We thus introduce a benchmarking framework to rigorously evaluate both per-modality image synthesis and the multi-modal coherence of reconstructed scenes.

2606.05489 2026-06-05 cs.CV cs.DB 版本更新

LLM-Guided ANN Index Optimization for Human-Object Interaction Retrieval

LLM引导的ANN索引优化用于人-物交互检索

Shahrzad Esmat, Chaunte W. Lacewell, Sameh Gobriel, Nilesh Jain, Ali Jannesari

发表机构 * Iowa State University(爱荷华州立大学) Intel Corporation(英特尔公司)

AI总结 提出一种基于大语言模型的阶段感知智能体,通过耦合参数空间的分阶段优化,在HICO-DET等基准上显著提升向量检索吞吐量。

Comments 13 pages, 5 figures, 8 tables

详情
AI中文摘要

检索系统支撑着现代AI应用——涵盖视觉搜索、推荐引擎和多模态问答。现代多阶段检索系统需要联合优化高度耦合的参数,然而传统的超参数优化(HPO)方法——包括树结构Parzen估计器(TPE)和高斯过程贝叶斯优化——依赖于独立性假设,这从根本上阻止了它们在这些耦合配置空间中的导航。我们通过一个阶段感知的大语言模型(LLM)智能体来解决这一限制,该智能体将每个提案基于其完整的优化历史进行条件化,在阶段划分的探索、利用和微调阶段中导航耦合参数空间。在HICO-DET人-物交互检索基准上使用Intel VDMS(视觉数据管理系统)进行评估,我们的智能体在SIEVE(向量搜索效率的保障索引评估,一种质量约束的吞吐量指标)下比Optuna TPE高出+33.3%,比VDTuner高出+34.2%,相比UniIR实现了15.3倍的吞吐量提升。在三个基准上的验证证实,智能体的优势随参数耦合程度增加而增长:在HICO-DET(高耦合)上+33.3%,在GLDv2(中等耦合)上方法收敛于1%以内,在SIFT1M(近独立控制)上收敛于3.6%以内。在Milvus上的跨系统验证确认,优化器在所有三个数据集上排名第一且无需修改,展示了跨向量数据库管理系统(VDBMS)平台的可迁移性。

英文摘要

Retrieval systems underpin modern AI applications -- spanning visual search, recommendation engines, and multi-modal question answering. Modern multi-stage retrieval systems require the joint optimization of highly coupled parameters, yet traditional hyperparameter optimization (HPO) methods -- including Tree-structured Parzen Estimators (TPE) and Gaussian Process Bayesian Optimization -- rely on an independence assumption that fundamentally prevents them from navigating these coupled configuration spaces. We address this limitation with a phase-aware large language model (LLM) agent that conditions each proposal on its full optimization history, navigating the coupled parameter space across phase-partitioned exploration, exploitation, and fine-tuning stages. Evaluated on the HICO-DET human-object interaction retrieval benchmark using Intel VDMS (Visual Data Management System), our agent outperforms Optuna TPE by +33.3% and VDTuner by +34.2% under SIEVE (Safeguarded Index Evaluation of Vector-search Efficiency, a quality-constrained throughput metric), delivering a 15.3x throughput gain over UniIR. Validation across three benchmarks confirms that the agent's advantage grows with the degree of parameter coupling: +33.3% on HICO-DET (high coupling), methods converge within 1% on GLDv2 (moderate coupling) and within 3.6% on SIFT1M (near-independent control). Cross-system validation on Milvus confirms the optimizer ranks first on all three datasets without modification, demonstrating transferability across vector database management system (VDBMS) platforms.

2606.05478 2026-06-05 cs.CV cs.LG 版本更新

Can We Predict The Human Preference For Text-to-Image Content Prior To Generation And Is It Even Useful To Do So?

我们能否在生成之前预测文生图内容的人类偏好,以及这样做是否有用?

Joong Ho Kim, Keith G. Mills

发表机构 * LSU ATHENA Lab(LSU ATHENA实验室)

AI总结 研究在扩散模型生成图像前预测人类偏好评分(HPM)的可行性,并利用该预测提升生成质量,同时评估不同HPM的适用性。

Comments Code is available at https://github.com/LSU-ATHENA/HPM-Predict

详情
AI中文摘要

扩散模型(DM)通过从用户提示中合成高质量、逼真的视觉内容,彻底改变了文本驱动的生成。而先前视觉生成的进展(如VAE和GAN)主要基于感知或视觉相似性指标(如FID、PSNR)进行评估,DM的进展促进了更先进的人类偏好指标(HPM)的发展,这些指标将人类判断建模并量化为标量值。然而,DM使用固有的随机过程合成内容,其中随机噪声种子生成。初始随机噪声直接定性和定量地影响生成输出的质量。这种影响在本地部署场景的小型模型中尤为显著。鉴于这一现象,我们首先研究在投入计算资源进行生成之前,我们能在多大程度上预测标量HPM分数。进一步,我们研究能在多大程度上利用这种预测来改善生成图像的质量,并研究哪些HPM最适合此任务。我们的研究表明,这不仅是可能的,而且可以实现可忽略的硬件开销。

英文摘要

Diffusion Models (DM) have revolutionized text-driven generation by enabling the synthesis of high-quality, photorealistic visual content from user prompts. Whereas prior advances in visual generation such as VAEs and GANs were primarily evaluated on perceptual or visual similarity metrics such as FID PSNR, DM advances have fostered the development of more advanced Human Preference Metrics (HPM) that model and quantify human judgment as scalar values. However, DMs synthesize content using an inherently stochastic process where random noise seeds generation. The initial random noise directly affects the quality of generated outputs, both qualitatively and quantitatively. This influence is pronounced in smaller models for local deployment scenarios. Given this phenomenon, we first investigate to what extent we can predict scalar HPM scores prior to committing compute resources for generation. Further, we then investigate to what extent we can leverage such prediction to improve the quality of generated images, and also study which HPMs are best suited for this task. Our investigation reveals that not only is this possible, but that it is feasible to achieve negligible hardware overhead.

2606.05471 2026-06-05 cs.CV 版本更新

Formal Concept Lattices are Good Semantic Scaffolds for Concept-Based Learning

形式概念格是基于概念学习的好语义支架

Deepika SN Vemuri, Sayanta Adhikari, Ankit Saha, Krishn Vishwas Kher, Vineeth N Balasubramanian

发表机构 * Amazon, India(亚马逊(印度)) Microsoft Research(微软研究院)

AI总结 本文利用形式概念分析中的概念格作为语义支架,指导神经网络在不同深度层次学习分层结构的概念表示,从而提升可解释性和干预效果。

Comments Accepted at ICML 2026

详情
AI中文摘要

学习语义对于深度学习模型的可解释性和与人类推理的一致性至关重要。基于概念的模型通过有意义的语义抽象来表示类别,但通常将所有概念视为在单个神经网络层学习的扁平、无结构集合。这忽略了人类语义理解的一个基本属性:概念按层次组织,从一般到具体。虽然深度网络确实学习了视觉特征的层次结构,但这种结构很少与显式的语义层次对齐。借鉴形式概念分析,我们证明了形式概念格提供了原则性的语义支架来指导神经网络学习。这些格自然地根据概念的普遍性级别确定了应在网络的何处学习概念。这使得模型能够在其深度中发展出分阶段、语义基础的表示。在真实世界数据集上的实验结果表明,我们的模型产生了更可解释的嵌入,支持更有效的干预,并学习了既有意义又具有层次结构的概念表示。

英文摘要

Learning semantics is essential for deep learning models to be interpretable and better aligned with human reasoning. Concept-based models approach this by representing classes through meaningful semantic abstractions, but typically treat all concepts as a flat, unstructured set learned at a single neural network layer. This overlooks a fundamental property of human semantic understanding: concepts being organized hierarchically, from general to specific. While deep networks do learn a hierarchy of visual features, this structure is rarely aligned with explicit semantic hierarchies. Drawing on Formal Concept Analysis, we demonstrate that formal concept lattices provide principled semantic scaffolds to guide neural network learning. These lattices naturally identify where in the network concepts should be learned based on their level of generality. This allows the model to develop staged, semantically grounded representations throughout its depth. Empirical results on real-world datasets show that our models produce more interpretable embeddings, support more effective interventions, and learn concept representations that are both meaningful and hierarchically structured.

2606.05460 2026-06-05 cs.CV 版本更新

ORACLE-CT: Anatomy-Aware Support Pooling for CT Classification

ORACLE-CT:用于CT分类的解剖感知支持池化

Lavsen Dahal, Yubraj Bhandari, Geoffrey Rubin, Joseph Y. Lo

发表机构 * Center for Virtual Imaging Trials, RAI Labs, Department of Radiology, Duke University(虚拟成像试验中心,RAI实验室,放射学系,杜克大学) Electrical and Computer Engineering, Pratt School of Engineering, Duke University(电气与计算机工程,工程学院,杜克大学) Department of Mathematics, Trinity College of Arts & Sciences, Duke University(数学系,艺术与科学学院,杜克大学) Department of Radiology and Imaging Sciences, University of Arizona College of Medicine(放射学与影像科学系,亚利桑那大学医学院)

AI总结 提出ORACLE-CT框架,通过多器官分割定义标签特定的解剖支持区域并限制注意力池化,解决CT分类中局部疾病证据与全局聚合不匹配的问题,在多个编码器上提升性能。

详情
AI中文摘要

腹部CT疾病分类具有挑战性,因为每次扫描都是一个包含许多可能发现的大3D体积,而诊断证据通常局限于特定器官或解剖隔室。大多数研究级分类器使用与解剖无关的池化或注意力来聚合编码器特征,造成了局部疾病证据与全局证据聚合之间的不匹配。我们提出ORACLE-CT,一个与编码器无关的解剖感知聚合框架,它使用多器官分割来定义标签特定的解剖支持,并将注意力池化限制在相关区域。该框架支持单器官、多器官联合、比较、局部和全局支持策略。我们使用三个编码器系列评估ORACLE-CT:DINOv3、I3D-ResNet-121和放射学原生Pillar-0编码器。模型在MERLIN上进行端到端训练,并在内部评估以及在冻结外部迁移到Duke-Abdomen和AMOS下进行评估。与全局平均池化相比,支持掩蔽池化将DINOv3的MERLIN宏AUROC/AUPRC从0.838/0.638提高到0.858/0.676,将I3D-ResNet-121从0.829/0.617提高到0.848/0.659。在协调的10标签外部评估中,DINOv3在Duke-Abdomen上从0.802/0.628提高到0.835/0.683,在AMOS上从0.742/0.313提高到0.762/0.350,I3D-ResNet-121也有类似增益。对于Pillar-0,大部分增益来自学习注意力,解剖掩蔽的额外收益较小。ORACLE-CT提高了区分度和外部鲁棒性,同时保留了预测与解剖证据之间的可审计联系。

英文摘要

Abdominal CT disease classification is challenging because each scan is a large 3D volume with many possible findings, while diagnostic evidence is often confined to specific organs or anatomical compartments. Most study-level classifiers aggregate encoder features using anatomy-agnostic pooling or attention, creating a mismatch between localized disease evidence and global evidence aggregation. We propose ORACLE--CT, an encoder-agnostic anatomy-aware aggregation framework that uses multi-organ segmentation to define label-specific anatomical supports and restrict attention pooling to relevant regions. The framework supports single-organ, multi-organ union, comparative, localized, and global support strategies. We evaluate ORACLE--CT with three encoder families: DINOv3, I3D--ResNet-121, and the radiology-native Pillar--0 encoder. Models are trained end-to-end on MERLIN and evaluated internally and under frozen external transfer to Duke--Abdomen and AMOS. Compared with global average pooling, support-masked pooling improved MERLIN macro-AUROC/AUPRC from 0.838/0.638 to 0.858/0.676 for DINOv3 and from 0.829/0.617 to 0.848/0.659 for I3D--ResNet-121. On harmonized 10-label external evaluation, DINOv3 improved on Duke--Abdomen from 0.802/0.628 to 0.835/0.683 and on AMOS from 0.742/0.313 to 0.762/0.350, with similar gains for I3D--ResNet-121. For Pillar--0, most gains came from learned attention, with smaller additional benefit from anatomical masking. ORACLE--CT improves discrimination and external robustness while preserving an auditable link between predictions and anatomical evidence.

2606.05458 2026-06-05 cs.CV 版本更新

Horse Eye Blink Detection and Classification for Equine Affective State Assessment

马匹眼睛眨眼检测与分类用于马匹情感状态评估

João Alves, Signe Møller-Skuldbøl, Pia Haubro Andersen, Rikke Gade

发表机构 * Visual Analysis and Perception Lab, Aalborg University(视觉分析与感知实验室,奥尔堡大学) Department of Animal Biosciences, Swedish University of Agricultural Sciences(动物生物科学系,瑞典农业科学大学)

AI总结 本研究开发并评估了三种基于视频的马匹眨眼自动分类方法(帧级YOLOv12检测器、光流幅度阈值法和微调VideoMAE模型),在公开数据集上实现了眨眼分类宏F1分数0.898和二元眨眼检测0.926,展示了细粒度动作单元检测在马匹福利监测中的潜力和挑战。

Comments CVPRW2026 CV4Animals

详情
AI中文摘要

自动检测马匹面部动作单元(AUs)是评估马匹疼痛和情感状态的一个有前景但尚未充分探索的途径。半眨眼和全眨眼运动被认为是疼痛和压力的识别指标,但作为微表情,其细微、精细的特性使其容易被肉眼忽略,只能通过逐帧视频检查才能辨别,这使得从视频中进行可靠的自动检测成为一项特别艰巨的任务。我们开发并评估了三种从马匹视频中自动分类眨眼的方法:基于帧的YOLOv12检测器、光流幅度阈值方法以及微调的VideoMAE模型,并在公开数据集上进行了测试。我们在眨眼分类任务上达到了0.898的宏F1分数,在二元眨眼检测上达到了0.926。我们的结果突显了细粒度AU检测在马匹福利监测中的潜力和固有挑战。

英文摘要

Automated detection of equine facial action units (AUs) is a promising yet under-explored avenue for pain and affective state assessment in horses. Half and full-blink movements are recognised indicators of pain and stress, but as micro-expressions, their subtle, fine-grained nature makes them easily missed by the naked eye and only discernible through frame-by-frame video inspection, making reliable automated detection from video a particularly demanding task. We develop and evaluate three methods for automated blink classification from horse videos: a frame-based YOLOv12 detector, an optical flow magnitude thresholding approach, and a fine-tuned VideoMAE model, tested on a publicly available dataset. We achieve a macro-F1 score of 0.898 when doing blink classification and 0.926 on binary blink detection. Our results highlight both the potential and the inherent challenges of fine-grained AU detection for equine welfare monitoring.

2606.05455 2026-06-05 cs.CV 版本更新

Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification

面向不完整图像-表格分类的解缠细粒度原型学习

Feixiang Zhou, Jianyang Xie, Zhuangzhi Gao, Qinkai Yu, Fu Wang, Yuheng Fan, Jing Li, Zheheng Jiang, Yitian Zhao, Yanda Meng, He Zhao, Gregory Y. H. Lip, Yalin Zheng

发表机构 * School of Eye and Vision Sciences, University of Liverpool, U.K.(利物浦大学眼科与视觉科学学院) Department of Cardiovascular and Metabolic Medicine, University of Liverpool, U.K.(利物浦大学心血管与代谢医学系) School of Computer Science, University of Exeter, U.K.(埃克塞特大学计算机科学学院) School of Computer Science and Engineering, South China University of Technology, China(华南理工大学计算机科学与工程学院) School of Computing and Mathematical Sciences, University of Leicester, U.K.(莱斯特大学计算科学与数学科学学院) Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, China(中国科学院宁波工业技术研究所) Bioengineering Program, Biological and Environmental Science and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Saudi Arabia(卡尔斯塔德大学科学与技术学院(KAUST)生物工程项目,沙特阿拉伯)

AI总结 针对图像-表格多模态学习中缺失模态问题,提出DFPL框架,通过共享-特定原型建模、原型级解缠和细粒度对齐,实现鲁棒分类。

详情
AI中文摘要

缺失模态问题在广泛的多媒体应用中(包括产品理解、推荐系统和医疗诊断)对图像-表格多模态学习构成了重大挑战。当两种模态高度异质时,这一挑战尤为突出,因为图像和表格属性在语义粒度和数据分布上存在显著差异。现有方法通过对全局令牌平均特征进行解缠和对齐来学习模态不变表示,仅捕获粗粒度的跨模态一致性,忽略了细粒度的语义和分布错位,这阻碍了在缺失模态下利用互补线索。为了解决这个问题,我们提出了DFPL,一种用于细粒度原型学习的新框架。具体来说,共享-特定原型建模(SSPM)提取紧凑且多样化的共享和模态特定原型,并进一步执行原型级解缠以抑制冗余的模态内相关性。此外,我们提出了一个原型引导的细粒度对齐(PFA)模块,该模块在统一的原型空间内联合强制执行原型级分布匹配和原型到类别的语义对齐,从而跨模态保留细粒度的分布和语义一致性。我们还引入了一个类别感知的多尺度聚合(CMA)模块,从全局和原型级别自适应地聚合共享语义和模态特定特征,以实现鲁棒的预测。在三个不同的图像-表格基准上的大量实验表明,我们的方法在各种缺失模态设置下优于先前的方法。代码将公开提供。

英文摘要

The missing-modality problem poses a significant challenge in image-tabular multimodal learning across a wide range of multimedia applications, including product understanding, recommendation systems, and medical diagnosis. This challenge is particularly pronounced when the two modalities are highly heterogeneous, as images and tabular attributes differ substantially in their semantic granularity and data distributions. Existing methods learn modality-invariant representations through disentanglement and alignment over global token-averaged features, capturing only coarse cross-modal consistency and overlooking fine-grained semantic and distributional misalignment, which hampers the exploitation of complementary cues under missing modalities. To address this, we propose DFPL, a novel framework for fine-grained prototype learning. Specifically, Shared-Specific Prototype Modeling (SSPM) extracts compact and diverse shared and modality-specific prototypes, and further performs prototype-level disentanglement to suppress redundant intra-modality correlations. Additionally, we propose a Prototype-guided Fine-grained Alignment (PFA) module that jointly enforces prototype-level distribution matching and prototype-to-class semantic alignment within a unified prototype space, thereby preserving both fine-grained distributional and semantic consistency across modalities. We further introduce a Class-aware Multi-scale Aggregation (CMA) module to adaptively aggregate shared semantics and modality-specific characteristics from global and prototype levels for robust predictions. Extensive experiments on three diverse image-tabular benchmarks demonstrate the superiority of our method compared to the previous approaches under various missing-modality settings. Code will be made publicly available.

2606.05437 2026-06-05 cs.RO cs.CV 版本更新

Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

不确定性感知的自适应传感器融合用于自主导航

Simegnew Yihunie Alaba, Yuichi Motai

发表机构 * IEEE

AI总结 提出一种结合无迹卡尔曼滤波(UKF)的混合深度学习方法,通过不确定性感知的自适应融合视觉和惯性特征,提高自主导航中视觉惯性里程计(VIO)的位姿估计精度。

Comments 13 pages

详情
AI中文摘要

本文介绍了一种混合深度学习方法,与无迹卡尔曼滤波(UKF)相结合,以增强自主导航中视觉惯性里程计(VIO)的位姿估计精度。所提出的模型采用视觉变换器(ViT)网络有效捕获惯性测量单元(IMU)数据的时间依赖性,并利用多尺度卷积神经网络(MCNN)从视觉数据中学习基于光流的运动线索。自适应传感器融合模块通过利用估计的不确定性动态加权IMU和视觉特征,从而在多样且具有挑战性的环境条件下提高鲁棒性。此外,提出了一种新颖的不确定性感知损失函数,将预测不确定性明确纳入学习过程,使得在噪声、不完整或不可靠的传感器输入下实现鲁棒且准确的导航。在KITTI数据集上的全面评估表明,所提出的方法显著优于基线方法,在绝对轨迹误差(ATE)和相对位姿误差(RPE)方面实现了优越性能。该轻量且计算高效的模型在NVIDIA A100 GPU上以155 FPS处理数据,非常适合部署在资源受限的自主系统中。

英文摘要

This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.

2606.05379 2026-06-05 cs.CV 版本更新

Deep Learning-assisted AMD Staging based on OCT and OCT Angiography

基于OCT和OCT血管成像的深度学习辅助AMD分期

Yukun Guo, Tristan T. Hormel, An-Lun Wu, Liqin Gao, Min Gao, Steven T. Bailey, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University(奥勒冈健康与科学大学凯斯眼科研究所) Department of Biomedical Engineering, Oregon Health & Science University(奥勒冈健康与科学大学生物医学工程系) Department of Ophthalmology, Mackay Memorial Hospital(Mackay纪念医院眼科部)

AI总结 利用OCT和OCTA数据,开发并评估基于EfficientNet的深度学习模型,用于自动分级年龄相关性黄斑变性(AMD)严重程度,其中基于生物标志物的模型表现最佳,尤其对早期AMD检测有价值。

详情
AI中文摘要

开发和评估使用光学相干断层扫描(OCT)和OCT血管成像(OCTA)数据自动分级年龄相关性黄斑变性(AMD)严重程度的深度学习模型。研究对象为271名年龄≥50岁、具有不同AMD严重程度的参与者。使用扫频OCTA系统(SOLIX; Visionix/Optovue Inc., CA)获取中央黄斑6×6 mm OCT/OCTA体积。根据AREDS简化严重程度量表,将AMD严重程度分为四个阶段(无AMD、早期AMD、中期AMD和晚期AMD)。开发了三种使用不同输入模态的深度学习模型:(1)来自分割病理特征(包括视网膜液、玻璃膜疣、地图样萎缩(GA)和黄斑新生血管(MNV))的生物标志物图;(2)二维(2D)en face OCT和OCTA投影;(3)三维(3D)OCT/OCTA体积。使用归一化输入、数据增强和五折交叉验证训练基于EfficientNet的架构。分析了来自271名参与者351只眼睛的总共2030个OCT/OCTA体积。所有模型均表现出强大的AMD分期性能,与参考标准具有高度一致性(QWK ≥ 0.83)。基于生物标志物的模型实现了最高的整体性能(QWK = 0.85 ± 0.03,均值±标准差)和最佳的早期AMD检测(F1分数 = 0.59 ± 0.14)。3D模型的性能与2D OCT/OCTA模型相当(QWK = 0.83 ± 0.04 vs. 0.83 ± 0.09),而2D OCT/OCTA模型显示出最高的精确度(0.79 ± 0.06)并最准确地识别出无AMD的眼睛。使用OCT/OCTA数据的深度学习模型可以准确、自动地对AMD严重程度进行分级。在评估的方法中,基于生物标志物的模型提供了最平衡的性能,并对早期AMD检测显示出特别的价值。

英文摘要

To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged >= 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK >= 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.

2606.05375 2026-06-05 cs.CV cs.AI 版本更新

Three-Dimensional Retinal Microvasculature Restoration in OCT Angiography

OCT血管造影中的三维视网膜微血管修复

Yukun Guo, Min Gao, Tristan T. Hormel, Steven T. Bailey, Thomas S. Hwang, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University(俄勒冈健康与科学大学Casey眼科研究所) Department of Biomedical Engineering, Oregon Health & Science University(俄勒冈健康与科学大学生物医学工程系)

AI总结 提出基于EfficientNet-B5编码器和含空间-通道挤压激励模块的解码器的深度学习算法,从单次OCTA体数据恢复毛细血管解剖结构,显著提升图像质量与微血管保真度。

详情
AI中文摘要

光学相干断层扫描血管造影(OCTA)是一种用于成像视网膜微血管的强大技术。然而,由于成像伪影,获取可靠的视网膜血流和视网膜无灌注区域量化具有挑战性。现有方法主要关注噪声抑制、投影伪影去除或信号增强,以改善OCTA在横截面或二维(2D)正面投影中的图像质量,而忽略了内在的三维血管结构。在本研究中,我们提出了一种基于深度学习的算法,用于从单个OCTA体数据中恢复毛细血管解剖血管结构。该网络由EfficientNet-B5编码器和结合了并行空间与通道挤压激励模块的解码器组成,通过跳跃连接保持空间分辨率。使用三个相邻B帧作为输入,预测修复后的中间B帧。我们使用峰值信噪比(PSNR)和结构相似性指数(SSIM)评估模型性能,以多次扫描平均生成的真值作为基准。结果表明,与原始单次OCTA体数据相比,所提模型显著(p < 0.001)提高了图像质量,PSNR为26.16 ± 1.26对比22.23 ± 0.78,SSIM为0.91 ± 0.02对比0.72 ± 0.03。所提模型还显著(p < 0.001)提高了微血管保真度,通过模型输出与真值之间的Dice系数重叠测量,在多个不同血管板层上,2D和3D分别至少提高3.8%和51.2%。

英文摘要

Optical coherence tomographic angiography (OCTA) is a powerful technique for imaging retinal microvasculature. However, acquiring reliable quantification of retinal blood flow and areas of retinal nonperfusion is challenging because of imaging artifacts. Existing methods primarily focus on noise suppression, projection artifact removal, or signal enhancement to improve the image quality of OCTA in cross-sectional or two-dimensional (2D) en face projections, while neglecting the intrinsic three-dimensional vascular architecture. In this study, we propose a deep learning-based algorithm for restoring capillary anatomical vasculature from a single OCTA volume. The network consists of an EfficientNet-B5 encoder and a decoder incorporating concurrent spatial and channel squeeze-and-excitation modules, connected via skip connections to preserve spatial resolution. Three adjacent B-frames are used as input to predict the restored middle B-frame. We evaluated the performance of the model using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against ground truth generated from averaging multiple scans. The results show that the proposed model significantly (both p < 0.001) improved image quality compared with the original single OCTA volume, with a PSNR of 26.16 +/- 1.26 vs. 22.23 +/- 0.78 and an SSIM of 0.91 +/- 0.02 vs. 0.72 +/- 0.03. The proposed model also significantly (p < 0.001) improved microvascular fidelity, measured by the Dice coefficient overlap between the model output and ground truth, in both 2D and 3D by at least 3.8% and 51.2%, respectively, across several different vascular slabs.

2606.05359 2026-06-05 cs.CV 版本更新

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

从单目视频中恢复物理上可信的人-物交互

Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出RePHO方法,通过物理引导的重建框架和强化学习策略,从单目视频中恢复物理上可信的人-物交互,解决了现有方法中的穿透和物体漂浮问题。

Comments CVPR 2026. Project Page: https://dingbang777.github.io/RePHO/

详情
AI中文摘要

在本文中,我们提出了RePHO,一种从单目视频中重建物理上可信的人-物交互(HOI)的方法。现有的基于运动学的方法虽然能产生视觉上合理的运动,但常常导致物理上不合理的伪影,如相互穿透和物体漂浮。为了克服这些问题,我们引入了一个物理引导的重建框架。我们从运动学估计开始,然后通过强化学习(RL)训练一个策略来细化它。该策略被优化以在物理模拟器中重现交互。由于运动学估计通常带有噪声,简单的RL训练可能会失败。因此,我们提出了一种自适应采样策略,具有双重自我更新机制,可以识别具有最丰富信息和最可靠运动学重建的帧。我们的过程逐步提高重建质量,并产生物理一致的HOI序列。我们在两个标准的HOI基准上展示了我们的方法,并在物理合理性指标上取得了比现有方法明显的改进。项目页面:https://dingbang777.github.io/RePHO/

英文摘要

In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: https://dingbang777.github.io/RePHO/

2606.05354 2026-06-05 cs.CV 版本更新

LightVesselNet: An Ultra-Lightweight Sub-100K Parameter Network for Retinal Blood Vessel Segmentation

LightVesselNet:用于视网膜血管分割的超轻量级亚10万参数网络

Shadman Sobhan, Farhana Jalil

发表机构 * Department of Electrical & Electronic Engineering, Bangladesh University of Engineering and Technology (BUET)(电子与电气工程系,孟加拉国工程与技术大学)

AI总结 提出LightVesselNet,一种仅75K参数的紧凑编码器-解码器网络,结合通道与空间注意力、多尺度特征聚合和亚像素上采样,在五个公开数据集上实现与大型模型相当的视网膜血管分割性能,适用于资源受限的临床环境。

详情
AI中文摘要

视网膜血管分割在糖尿病视网膜病变和青光眼的早期检测中起着至关重要的作用。虽然最近的深度学习模型取得了很高的分割精度,但它们通常需要大量的计算资源,使得在边缘设备上的实际部署变得困难。在本文中,我们提出了LightVesselNet,一种专为资源受限环境中的视网膜血管分割设计的高效神经网络。尽管仅包含75K参数,LightVesselNet的性能与更大的模型相比具有竞争力。该网络采用紧凑的编码器-解码器架构,并增强了通道和空间注意力机制、瓶颈处的多尺度特征聚合模块以及解码器中的亚像素上采样策略。专用的边缘残差连接在整个解码过程中保留了精细的血管细节。在五个公开数据集:DRIVE、STARE、CHASEDB1、FIVES和HRF上进行的大量实验,分别获得了0.8189、0.8499、0.8640、0.8634、0.8096的灵敏度分数和0.8070、0.8072、0.8181、0.8649、0.7686的Dice系数。与最先进模型相比,LightVesselNet显示出更高的效率(性能与参数或GFlops之比)。跨数据集评估证实了模型的泛化能力。总体而言,LightVesselNet是低资源临床环境和移动筛查工具中部署的有力候选者。

英文摘要

Retinal blood vessel segmentation plays a vital role in the early detection of diabetic retinopathy and glaucoma. While recent deep learning models have achieved great segmentation accuracy, they typically require heavy computational resources, making real-world deployment on edge devices difficult. In this paper, we propose LightVesselNet, an efficient neural network designed for retinal vessel segmentation in a resource-constrained environment. Despite containing only 75K parameters, LightVesselNet performs competitively with much larger models. The network employs a compact encoder decoder architecture enhanced with channel and spatial attention mechanisms, a multi-scale feature aggregation module at the bottleneck, and a subpixel upsampling strategy in the decoder. A dedicated edge residual connection preserves fine vessel detail throughout decoding. Extensive experiments on five publicly available datasets: DRIVE, STARE, CHASEDB1, FIVES, and HRF, yield sensitivity scores of 0.8189, 0.8499, 0.8640, 0.8634, 0.8096, and Dice coefficients of 0.8070, 0.8072, 0.8181, 0.8649, and 0.7686, respectively. LightVesselNet shows improved efficiency (Performance vs Parameter or GFlops) compared to State-of-the-Art models. Cross-dataset evaluation confirms the model's generalisation capability. Overall, LightVesselNet is a strong candidate for deployment in low-resource clinical settings and mobile screening tools.

2606.05347 2026-06-05 cs.CV 版本更新

TopoPult-SSL: Gland-Mask-Free Cross-Device Meibomian Gland Segmentation via Self-Distilled Weak Clinical Priors

TopoPult-SSL: 通过自蒸馏弱临床先验实现无腺体掩膜的跨设备睑板腺分割

Nicolò Savioli, Luca Del Tongo

发表机构 * OdaxAI S.R.L.(OdaxAI公司) Topcon Group — VISIA Imaging S.R.L.(Topcon集团——VISIA成像公司)

AI总结 提出TopoPult-SSL两阶段框架,利用眼睑掩膜和临床元数据作为弱先验,通过自蒸馏实现跨设备睑板腺分割,无需目标腺体掩膜即可达到高精度。

Comments 13 pages, 4 figures, 5 tables

详情
AI中文摘要

每一种新的临床成像设备都会造成域偏移,其中密集的腺体掩膜成本高昂,而廉价的临床信号——眼睑轮廓、Pult分级、形态测量比率——则被常规记录。我们提出TopoPult-SSL,一个用于跨设备睑板腺分割的两阶段框架。第一阶段在训练损失中不使用目标腺体掩膜,仅通过目标眼睑掩膜和临床元数据驱动的四个弱先验锚点来适应源域训练模型。第二阶段,当目标腺体掩膜可用时,通过监督自蒸馏将互补的第一阶段教师模型蒸馏成一个紧凑的学生模型。我们在公共MGD-1k到CAMG研究基准(1000到100张图像,不同设备)上开发并验证了该技术,蒸馏模型达到Dice 0.716±0.006(最佳0.726),单次推理超越UA-MT(0.710)和集成教师(0.720)。无腺体掩膜的第一阶段变体达到精确度0.694,而SAM/MedSAM为0.30-0.34(p<0.001),使得无需密集腺体轮廓即可部署。代码和可复现脚本已发布。

英文摘要

Every new clinical imaging device creates a domain shift where dense gland masks are expensive yet cheap clinical signals -- eyelid outlines, Pult grades, morphometric ratios -- are routinely recorded. We present TopoPult-SSL, a two-stage framework for cross-device meibomian gland segmentation. Stage 1 adapts a source-trained model without target gland masks in the training loss, using four weak-prior anchors driven by target eyelid masks and clinical metadata only. Stage 2, when target gland masks are available, distils complementary Stage-1 teachers into a single compact student via supervised self-distillation. We develop and validate the technique on the public MGD-1k to CAMG research benchmark (1,000 to 100 images, different device), where the distilled model achieves Dice 0.716+/-0.006 (best 0.726), surpassing UA-MT (0.710) and the ensemble teacher (0.720) -- with a single pass. The gland-mask-free Stage-1 variant reaches Precision 0.694 vs. 0.30-0.34 for SAM/MedSAM (p<0.001), enabling deployment without dense gland contouring. Code and reproducibility scripts are released.

2606.05328 2026-06-05 cs.GR cs.AI cs.CV cs.LG 版本更新

The Invisible Hand of Physics: When Video Diffusion Models Know More Than They Show

物理的隐形之手:当视频扩散模型知道的比它们展示的更多

Parsa Esmati, Somjit Nath, Katja Hofmann, Derek Nowrouzezahrai, Samira Ebrahimi Kahou, Majid Mirmehdi

发表机构 * University of Bristol(布里斯托大学) McGill University(麦吉尔大学) Mila–Quebec AI Institute(魁北克AI研究院) Microsoft Research(微软研究院) University of Calgary(卡尔加里大学)

AI总结 通过逆向扩散过程探测视频扩散模型的潜在轨迹,发现物理合理性可以从扩散变换器状态中线性解码,准确率达81.27%,表明物理有意义的表示是生成式去噪的副产品。

详情
AI中文摘要

现代视频扩散模型生成越来越真实和时间上连贯的视频,这激发了它们作为候选世界模拟器的使用。然而,目前尚不清楚这些模型是否内部编码了物理结构,或者仅仅是复现了训练中看到的运动模式。我们通过沿着对应已知物理合理性的真实视频的潜在轨迹探测视频扩散模型来研究这个问题。为了获得这样的轨迹,我们通过从干净视频潜在变量向后积分学习到的速度场到噪声,近似逆向确定性采样过程,从而访问模型的中间状态和注意力图。利用这些恢复的轨迹,我们表明物理合理性可以从扩散变换器状态中线性解码,在IntPhys和InfLevel上达到约81.27%的平均准确率,并优于专门的表示学习基线如V-JEPA和VideoMAE。令人惊讶的是,这个信号在VAE潜在输入中不存在,而是在去噪变换器内部出现,尽管模型没有使用自监督预测目标进行训练。这些发现表明,物理有意义的表示可以作为生成式去噪的副产品产生。

英文摘要

Modern video diffusion models generate increasingly realistic and temporally coherent videos, motivating their use as candidate world simulators. Yet it remains unclear whether these models internally encode physical structure, or merely reproduce motion patterns seen during training. We study this question by probing video diffusion models along latent trajectories corresponding to real videos with known physical plausibility. To obtain such trajectories, we approximately invert the deterministic sampling process by integrating the learned velocity field backward from a clean video latent to noise, giving access to the model's intermediate states and attention maps. Using these recovered trajectories, we show that physical plausibility is linearly decodable from diffusion transformer states across IntPhys and InfLevel, reaching around 81.27% average accuracy and outperforming dedicated representation-learning baselines such as V-JEPA and VideoMAE. Surprisingly, this signal is absent from the VAE latent input and emerges inside the denoising transformer itself, despite the model not being trained with a self-supervised predictive objective. These findings suggest that physically meaningful representations can arise as a byproduct of generative denoising.

2606.05290 2026-06-05 cs.CV cs.AI cs.MM 版本更新

Do Models Share Safety Representations? Cross-Model Steering for Safe Visual Generation

模型是否共享安全表示?面向安全视觉生成的跨模型引导

Tobia Poppi, Silvia Cappelletti, Sara Sarto, Florian Schiffers, Garin Kessler, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

发表机构 * University of Modena and Reggio Emilia(摩德纳和雷吉奥艾米利亚大学) University of Pisa(比萨大学) Amazon Prime Video(亚马逊prime视频)

AI总结 本文提出首个跨模型安全引导框架,通过源语言模型估计安全方向并迁移至目标生成器,无需目标侧不安全数据即可实现安全控制,且不牺牲生成质量。

Comments Project page: https://aimagelab.github.io/cross-model-safety-representations/

详情
AI中文摘要

生成建模的最新进展使安全控制成为核心挑战,但现有方法大多针对特定模型,需要为每种新架构重新训练或定制干预。在这项工作中,我们探究安全是否可以被表示为一种可移植的潜在方向,一次性学习并在异构生成器之间重用。我们引入了首个跨模型安全引导框架,其中从成对的安全-不安全提示中在源大语言模型中估计安全方向,通过仅在良性数据上拟合的轻量级对齐传输到目标生成器,并在推理时应用。关键的是,我们的流程从未访问目标侧的不安全数据,从而隔离了安全是否可以通过共享表示几何进行转移。除了单个全局方向,我们还识别了一种多向量扩展,捕获类别特定的安全行为,实现更具选择性的控制。我们在文本到图像和文本到视频生成中评估了我们的方法,跨越不同的源-目标模型对。跨模型转移的安全方向实现了与在目标模型上使用不安全数据本地学习的方向相当的ASR降低和CLIP-Score/FID权衡,同时不需要目标侧的不安全数据。这表明安全改进不以生成质量为代价。我们的结果指向了一种模块化的安全观:安全相关行为并非纯粹模型局部,而是可以通过跨模型持续的潜在方向进行控制。这为轻量级、可重用的安全机制开辟了新路径,且无需目标侧不安全数据。

英文摘要

Recent progress in generative modeling has made safety control a central challenge, yet existing approaches remain largely model-specific, requiring retraining or tailored interventions for each new architecture. In this work, we ask whether safety can be represented as a portable latent direction, learned once and reused across heterogeneous generators. We introduce the first framework for cross-model safety steering, in which a safety direction is estimated in a source LLM from paired safe-unsafe prompts, transported to a target generator through a lightweight alignment fitted on benign data alone, and applied at inference time. Crucially, our pipeline never accesses unsafe data on the target side, isolating whether safety can be transferred through shared representation geometry. Beyond a single global direction, we also identify a multi-vector extension that captures category-specific safety behaviors, enabling more selective control. We evaluate our approach in text-to-image and text-to-video generation across diverse source-target model pairs. Across models, transferred safety directions achieve ASR reduction and CLIP-Score/FID trade-offs comparable to directions learned natively on the target model using unsafe data, while requiring no target-side unsafe data. This indicates that safety improvements do not come at the expense of generation quality. Our results point to a modular view of safety: safety-relevant behavior is not purely model-local, but can be controlled through latent directions that persist across models. This suggests a new path toward lightweight, reusable safety mechanisms that do not require target-side unsafe data.

2606.05275 2026-06-05 cs.CV cs.AI 版本更新

Personal AI Agent for Camera Roll VQA

个人AI代理用于相机胶卷VQA

Thao Nguyen, Krishna Kumar Singh, Donghyun Kim, Yong Jae Lee, Yuheng Li

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Korea University(韩国大学) Adobe Research(Adobe研究院)

AI总结 本文提出camroll数据集和camroll-agent代理,通过层次化记忆和工具集解决个人相机胶卷中的长程、高度个性化的视觉问答问题。

Comments Project page, code, and demo: https://thaoshibe.github.io/camroll

详情
AI中文摘要

我们研究了个人相机胶卷的视觉问答设定。在该设定中,一个对话式AI助手可以访问用户的个人相机胶卷并检索相关照片来回答查询,从简单的事实性问题(例如,“我昨天尝试的食物名称?”)到更开放的问题(例如,“推荐一些我从未吃过的菜肴”)。鉴于个人相机胶卷的庞大性质(即多年、数百到数千张照片),一个成功的AI助手需要理解长程、高度个性化的视觉内容流,以便导航和定位正确和/或相关信息。为此,我们收集并手动标注了模拟真实世界使用场景的问题。最终数据集camroll包含50个用户、31,476张图像和2,500个问答对。我们进一步设计了camroll-agent,一个配备层次化记忆和最小工具集的对话式AI代理,用于在大型个性化视觉记忆上高效导航。实验结果表明,camroll-agent在长上下文理解的AI代理系统中优于众多基线和方法。总之,camroll数据集和camroll-agent凸显了AI代理在长上下文推理中的差距:个性化视觉记忆需要与标准长上下文文本记忆不同的方法,尤其是在存在一致性、视觉细节和用户特定上下文时。

英文摘要

We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before''). Given the vast nature of the personal camera roll (i.e., multiple years, hundreds to thousands of photos), a successful AI assistant needs to understand a long-horizon, highly personalized visual content stream in order to navigate and locate the correct and/or relevant information. To support this, we collect and manually annotate questions that mimic real-world usage. The final dataset, camroll, contains 50 users, 31,476 images, and 2,500 QA pairs. We further design camroll-agent, a conversational AI agent equipped with hierarchical memory and a minimal set of tools for efficient navigation over large, personalized visual memory. Experimental results show that camroll-agent outperforms numerous baselines and methods for long-context understanding AI agents system. Together, the camroll dataset and camroll-agent highlight the gap in AI agents' long-context reasoning: personalized visual memory requires different approaches from standard long-context textual memory, especially when consistency, visual details, and user-specific context are present.

2606.05261 2026-06-05 cs.CV cs.AI cs.LG 版本更新

NIV: Neural Axis Variations for Variable Font Generation

NIV: 用于可变字体生成的神经轴变化

Nadav Benedek, Ariel Shamir, Ohad Fried

发表机构 * Reichman University(雷赫曼大学)

AI总结 提出NIV方法,通过预测字形轮廓的逐点位移,自动将静态字体转换为支持多轴连续插值的可变字体,并在新构建的数据集上验证其泛化能力。

详情
AI中文摘要

可变字体能够沿语义设计轴(如字重、字宽、倾斜和光学尺寸)实现字形几何的连续变化。然而,从静态字体构建可变字体仍然是一个劳动密集型过程,需要专业的字体设计和对字形变化数据的手动规范。我们引入了NIV(神经轴变化),一种自动将静态字体转换为功能齐全的可变字体的方法。给定字形轮廓和一组期望的设计轴,NIV预测每点的位移。该模型直接操作矢量字形几何,并采用一种新颖的属性嵌入机制,捕获多个轴之间的相互作用,从而在统一框架内实现一致的多轴变化。我们在一个新构建的源自可变Google字体的数据集上训练NIV,该数据集包含超过一百万个变化元组。得到的模型能够泛化到未见过的码点、未见过的字体样式、高复杂度的CJK字形,甚至分布外的手写输入。生成的输出是标准的可变字体文件,支持通过现有渲染引擎进行连续插值。为了促进研究,我们在https://github.com/ndvbd/NIV上发布了数据集、完整的训练和推理实现以及训练好的模型。超越字体排印,我们的方法展示了如何使用神经变形合成具有连续参数变化的结构化几何对象。

英文摘要

Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.

2606.05259 2026-06-05 cs.CV 版本更新

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR:迈向知识和推理密集型视频理解

Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学) University of Michigan(密歇根大学)

AI总结 提出VideoKR,首个大规模训练语料库,通过人工参与的技能导向生成管道构建315K视频推理示例,增强知识和推理密集型视频理解,并在专家标注基准上验证其有效性。

Comments ICML 2026 Spotlight

详情
AI中文摘要

我们介绍了VideoKR,这是第一个专门设计用于增强知识和推理密集型视频理解的大规模训练语料库。它包含315K个视频推理示例,覆盖145K个新收集的、CC许可的、专家领域的视频。我们开发了一个人工参与的、技能导向的示例生成管道,针对逐步深入的视频推理能力,同时确保示例及其CoT推理的难度、多样性和可靠性。我们还策划了VideoKR-Eval,一个新的专家标注基准,其中的问题需要真正的视频理解和知识密集型推理,而不是文本捷径。我们的实验表明,在标准SFT→GRPO流程下,基于VideoKR后训练的模型在知识密集型视频推理上优于先前的后训练方法,同时在通用视频推理上保持竞争力,突出了数据设计作为视频推理进展的关键驱动因素。我们进一步进行了全面的消融实验,以分离VideoKR的贡献,为未来工作提供可操作的见解。

英文摘要

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

2606.05255 2026-06-05 eess.IV cs.CV cs.GR 版本更新

Oklch+: A Three-Parameter Extension of Oklab for Improved Color Difference Prediction

Oklch+: Oklab的三参数扩展以改进色差预测

Naoyuki Uchida

发表机构 * Independent Researcher(独立研究者)

AI总结 提出Oklch+,通过L轴幂变换和C轴Naka-Rushton压缩扩展Oklab,在COMBVD数据集上以三个参数达到与CIEDE2000相当的色差预测精度(STRESS=29.09 vs 29.13),并显著优于Oklab。

Comments 3 figures, 8 tables. Submitted to Color Research & Application

详情
AI中文摘要

Oklab及其圆柱表示Oklch作为感知驱动的颜色空间,在插值和设计工作流程中被广泛采用,但其色差预测精度不如CIEDE2000。我们提出Oklch+,这是Oklab的一个三参数扩展,包括L轴上的幂变换和C轴上的Naka-Rushton压缩,并在变换后的Oklab坐标中计算欧氏距离。Naka-Rushton函数在[0,1]内有界,反映了在高色度值时色度敏感度的饱和特性。在COMBVD(包含跨越六个独立实验数据集的3,813对超阈值色差对)上评估,Oklch+实现了STRESS=29.09,与CIEDE2000(29.13;差异=0.04)紧密匹配,仅使用了针对色差数据优化的三个参数,而CIEDE2000约需17个参数。在保留的BFD-P D65子集(2,028对)上的交叉验证确认了泛化能力(STRESS=26.14),Oklch+显著优于Oklab(51.45),并在保留集上达到与CIEDE2000(24.12)相当的STRESS。在所有六个COMBVD子数据集上均确认了相对于Oklab(47.35)的改进。由于Oklch+定义了一个欧氏距离近似感知距离的坐标系,变换空间中的线性插值相对于Oklab提供了显著改善的感知均匀性。当前评估仅限于以sRGB为中心的COMBVD数据集;在高色度区域使用经验观察者评级的辨别数据进行验证仍是未来工作。

英文摘要

Oklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD -- 3,813 suprathreshold color difference pairs spanning six independent experimental datasets -- Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.

2606.05254 2026-06-05 cs.LG cs.CV cs.RO 版本更新

Flash-WAM: Modality-Aware Distillation for World Action Models

Flash-WAM:面向世界动作模型的模态感知蒸馏

Arman Akbari, Ci Zhang, Arash Akbari, Lin Zhao, Yixiao Chen, Weiwei Chen, Xuan Zhang, Geng Yuan, Yanzhi Wang

发表机构 * Northeastern University(东北大学) University of Georgia(佐治亚大学) EmbodyX Inc.(EmbodyX公司)

AI总结 针对世界动作模型联合生成视频和机器人动作时因多模态噪声分布不对称导致蒸馏失效的问题,提出模态感知步蒸馏框架Flash-WAM,通过为不同模态选择匹配噪声机制的参数化方法,实现单步推理并大幅加速。

详情
AI中文摘要

世界动作模型(WAMs)通过迭代扩散联合生成未来视频和机器人动作,在操作基准上表现出色,但需要数十个去噪步骤,这一成本阻碍了实时控制。步蒸馏已成为自然的补救措施,但现成的方法在联合视频-动作设置中失效,因为视频和动作流使用不同的信噪比偏移噪声调度,并以显著不同的边际噪声分布到达训练,这种不对称性是单模态蒸馏方法无法处理的。我们提出 extbf{Flash-WAM},一个受一致性蒸馏启发的模态感知步蒸馏框架,为每个模态选择一致性函数以匹配其噪声机制:针对动作流的低噪声机制采用线性梯度缩放参数化,针对视频流的高噪声机制采用方差保持参数化,该框架基于对一致性函数族的结构分析,该分析刻画了在一致性边界条件下可实现的梯度缩放。在LingBot-VA上实例化,Flash-WAM将每个模态的推理压缩到单步。在RoboTwin 2.0上,这将每个块延迟从8.1秒减少到NVIDIA L40S上的348毫秒,实现了23倍的加速,从而支持实时推理。Flash-WAM在模拟基准上保持了任务成功率(RoboTwin 2.0上85.5%,LIBERO上95.7%),并大幅恢复了真实世界性能(Unitree G1人形机器人上平均60%),而朴素的一致性蒸馏在相同步预算下降至24%。

英文摘要

World-action models (WAMs) jointly generate future video and robot actions through iterative diffusion, achieving strong performance on manipulation benchmarks but requiring tens of denoising steps, a cost that precludes real-time control. Step distillation has emerged as the natural remedy, but off-the-shelf methods break down in the joint video-action setting because video and action streams use different SNR-shifted noise schedules and reach training with substantially different marginal noise distributions, an asymmetry that single-modality distillation methods cannot accommodate. We introduce \textbf{Flash-WAM}, a modality-aware step-distillation framework inspired by consistency distillation that selects the consistency function for each modality to match its noise regime: a linear-gradient-scaling parametrization for the action stream's low-noise regime, paired with a variance-preserving parametrization for the video stream's high-noise regime, grounded in a structural analysis of the consistency-function family that characterizes the achievable gradient scaling under the consistency boundary condition. Instantiated on LingBot-VA, Flash-WAM compresses inference to a single step in each modality. On RoboTwin 2.0, this reduces per-chunk latency from $8.1$ seconds to $348$ ms on NVIDIA L40S, a $23{\times}$ speedup that enables real-time inference. Flash-WAM preserves task success on simulation benchmarks ($85.5\%$ RoboTwin 2.0, $95.7\%$ LIBERO) and substantially recovers real-world performance ($60\%$ average on a Unitree G1 humanoid robot), while naive consistency distillation drops to $24\%$ at the same step budget.

2606.05185 2026-06-05 cs.CY cs.CV cs.LG 版本更新

Drishti AI-Event Guardian: An Intelligent Real-Time Crowd Monitoring and Emergency Response System for Mass Gathering Events

Drishti AI-Event Guardian:面向大规模聚集事件的智能实时人群监控与应急响应系统

Ritabrata Roy Choudhury, Arkajyoti Karmakar, Rudra Pratap Mitra

发表机构 * School of Computer Engineering, Kalinga Institute of Industrial Technology(计算机工程学院,凯林加工业技术学院) School of Electronics Engineering, Kalinga Institute of Industrial Technology(电子工程学院,凯林加工业技术学院)

AI总结 提出Drishti AI-Event Guardian框架,结合YOLOv8、异常检测和梯度提升回归等多模态深度学习技术,实现实时人群密度估计、异常检测、预测建模、人脸识别、医疗紧急报告、聊天机器人和智能警卫重分配,在Kumbh Mela和RCB Victory Parade事件中验证了低延迟和高精度。

Comments 22 pages

详情
AI中文摘要

大规模聚集事件常因人群监控不足和应急响应协调不力导致严重安全事故。传统监控系统缺乏智能分析,导致威胁识别延迟、资源部署不当,以及在密集公共集会中对弱势个体的支持不足。本文提出Drishti AI-Event Guardian,一种利用深度学习增强公共安全的智能人群管理框架。该架构整合来自CCTV网络和无人机平台的多模态数据,由Google Vertex AI基础设施上的模型处理。核心方法包括使用YOLOv8进行实时人群密度估计、时空异常检测以及通过梯度提升回归进行预测性人群流动建模。Drishti还集成了四个模块:(i) 用于失踪人员识别并触发全人群通知的人脸识别;(ii) 带有自动调度的医疗紧急报告;(iii) 用于报告和投诉的对话式AI聊天机器人;(iv) 智能警卫重分配引擎,可根据人群密度变化动态重新分配人员。该系统在Kumbh Mela集会和RCB Victory Parade活动两个场景中进行了评估,实现了人群密度估计MAE为3.2人/平方米、异常检测F1分数为0.91、人脸识别精确率为0.93,以及中位警报延迟为111毫秒。预测性拥堵建模提供五分钟预测,MAPE为8.3%,从而实现预防性干预。聊天机器人无需人工操作即可解决89%的事件申报,而警卫重分配相比手动重新分配将响应人员部署延迟降低了34%。结果表明,该系统从被动监控转向主动人群智能,并为从本地集会到大型节日的活动提供了可扩展的基础。

英文摘要

Mass gathering events are associated with critical safety incidents caused by insufficient crowd monitoring and inadequate emergency response coordination. Traditional surveillance systems lack intelligent analytics, resulting in delayed threat identification, poor resource deployment, and weak support for vulnerable individuals during dense public assemblies. This paper presents Drishti AI-Event Guardian, an intelligent crowd management framework using deep learning for public safety enhancement. The architecture combines multimodal data from CCTV networks and UAV platforms, processed by models on Google Vertex AI infrastructure. Core methods include real-time crowd density estimation using YOLOv8, spatiotemporal anomaly detection, and predictive crowd-flow modeling through gradient-boosted regression. Drishti also integrates four modules: (i) facial recognition for missing person identification with crowd-wide notification; (ii) medical emergency reporting with automated dispatch; (iii) a conversational AI chatbot for reports and complaints; and (iv) an intelligent guard reallocation engine that dynamically reassigns personnel in response to crowd density changes. The system is evaluated on two scenarios: the Kumbh Mela gathering and the RCB Victory Parade event, achieving crowd density estimation MAE of 3.2 persons/m2, anomaly detection F1-score of 0.91, facial recognition precision of 0.93, and median alert latency of 111 ms. Predictive congestion modeling provides five-minute forecasts with MAPE of 8.3%, enabling preemptive intervention. The chatbot resolved 89% of incident filings without human operators, while guard reallocation reduced responder deployment latency by 34% versus manual reassignment. Results demonstrate a shift from passive surveillance toward active crowd intelligence and scalable foundation for events from local gatherings to mega festivals.

2606.05172 2026-06-05 cs.HC cs.CV 版本更新

Is This Edit Correct? A Multi-Dimensional Benchmark for Reasoning-Aware Image Editing

这个编辑正确吗?面向推理感知图像编辑的多维度基准

Yixuan Ding, Wei Huang, Ruijie Quan, Xiaojuan Qi, Yi Yang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 提出RE-Edit基准,从物理、环境、文化、因果和指代五个推理维度评估图像编辑系统,发现现有模型在隐式逻辑约束推理上存在不足,并引入轻量级推理引导后编辑基线。

Comments 23 pages, 10 figures, 7 tables

详情
AI中文摘要

基于扩散的图像编辑在自然语言指令下实现了强大的视觉保真度,但大多数现有系统仍停留在表面指令遵循层面,没有推理真实用户请求中嵌入的隐式上下文约束。这常常导致视觉上合理但逻辑不一致的编辑。在这项工作中,我们引入了RE-Edit,一个面向推理感知图像编辑的基准,它从五个互补的推理维度评估图像编辑系统:物理、环境、文化、因果和指代。RE-Edit包含1000个精心策划的样本,每个样本的设计使得仅凭视觉合理性是不够的,正确的编辑需要满足隐式逻辑约束。为了支持细粒度分析,我们建立了维度对齐的评估标准,并对十个开源和两个商业图像编辑模型进行了全面研究。我们的结果表明,即使先进的系统也常常在隐式多维度推理上挣扎,尽管它们能产生高质量的视觉结果。我们进一步提出了一个轻量级的推理引导后编辑基线作为初步探索,说明了如何以模型无关的方式插入显式推理来帮助缓解此类失败。

英文摘要

Diffusion-based image editing has achieved strong visual fidelity under natural language instructions, yet most existing systems still operate at the level of surface instruction following, without reasoning about the implicit contextual constraints embedded in real user requests. This often leads to visually plausible but logically inconsistent edits. In this work, we introduce RE-Edit, a benchmark for REasoning-aware image Editing that evaluates image editing systems across five complementary reasoning dimensions: physical, environmental, cultural, causal, and referential. RE-Edit comprises 1,000 carefully curated samples, each designed such that visual plausibility alone is insufficient and correct editing requires satisfying implicit logical constraints. To support fine-grained analysis, we establish dimension-aligned evaluation criteria and conduct a comprehensive study of ten open-source and two commercial image editing models. Our results show that even advanced systems frequently struggle with implicit multi-dimensional reasoning despite producing high-quality visuals. We further present a lightweight reasoning-guided post-edit baseline as an initial exploration, illustrating how inserting explicit reasoning can help mitigate such failures in a model-agnostic manner.

2606.04811 2026-06-05 cs.CV 版本更新

Dream.exe: Can Video Generation Models Dream Executable Robot Manipulation?

Dream.exe: 视频生成模型能否梦想出可执行的机器人操作?

Rui Zhao, Kaiming Yang, Jifeng Zhu, Siyang Chen, Ziqi Wang, Weijia Wu, Kevin Qinghong Lin, Heng Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore(新加坡国立大学Show实验室) University of Oxford(牛津大学) Tencent(腾讯)

AI总结 提出Dream.exe评估框架,通过视频到执行流水线测试视频生成模型产生的运动能否转化为可执行的机器人操作,发现视觉质量不能预测可执行性。

详情
AI中文摘要

视频生成模型在合成视觉上引人注目的内容方面取得了令人印象深刻的进展,但其输出仍然局限于虚拟领域。一个自然的问题随之而来:当这些模型生成的视频离开屏幕进入现实时,它们对物理世界的反映有多好?我们提出机器人操作作为这个问题的具体、可测量的窗口:如果一个模型真正内化了物理定律,它所描绘的运动应该转化为可执行的机器人行为。我们引入了Dream.exe,一个通过视频到执行流水线来操作这一标准的评估框架。给定一个场景图像和任务描述,Dream.exe合成一个操作视频,将生成的运动转换为机器人轨迹,并在物理模拟器中执行,产生纯视觉指标无法提供的接地信号。使用这个流水线,我们评估了8个模型,涵盖前沿闭源生成器、开源生成器和机器人专用模型。我们的基准测试包括101个手动策划的操作任务,分为三个物理复杂度级别,通过视觉质量、轨迹保真度和执行成功率进行测量。令人鼓舞的是,几个模型取得了可测量的执行成功率,表明从互联网规模数据中学习的生成先验已经编码了有意义的物理知识。然而,视觉质量被证明是执行性的差预测器,暴露了标准视觉评估未捕获的模型能力维度。Dream.exe将在https://github.com/showlab/Dream.exe开源。

英文摘要

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream$.$exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream$.$exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream$.$exe will be open-sourced at https://github.com/showlab/Dream.exe.

2606.03998 2026-06-05 eess.SP cs.CV 版本更新

TGSD: Topology-Guided State-Space Diffusion Framework for EEG Spatial Super-Resolution

TGSD: 拓扑引导的状态空间扩散用于EEG空间超分辨率

Zijian Kang, Weiming Zeng, Yueyang Li, Shengyu Gong, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University(数字图像与智能计算实验室,上海海洋大学) Department of Language Science and Technology, The Hong Kong Polytechnic University(语言科学与技术系,香港理工大学) Affiliated Lianyungang Hospital of Xuzhou Medical University(徐州医学院连云港医院)

AI总结 提出TGSD框架,通过拓扑引导的状态空间扩散模型,利用分层空间先验编码器和条件状态空间扩散重建器,从低密度EEG恢复高密度信号,在SEED和PhysioNet MM/I数据集上优于基线方法。

详情
AI中文摘要

低密度EEG更适合可穿戴和基于物联网的大脑传感,但稀疏的电极采样通常缺乏足够的空间信息来表征跨区域的神经活动。EEG空间超分辨率旨在从稀疏记录中恢复密集通道EEG,但由于通道缺失通常发生在整个通道级别,全电极布局上的时空依赖性往往未被充分探索,且从稀疏到密集信号的映射本质上具有模糊性,因此仍然具有挑战性。为了解决这些问题,我们提出了TGSD,一种用于EEG空间超分辨率的拓扑引导状态空间扩散框架。TGSD首先采用分层空间先验编码器,通过整合局部几何关系与区域级上下文信息,学习完整电极布局上的拓扑感知先验。基于这些先验和稀疏观测,条件状态空间扩散重建器通过反向扩散逐步生成缺失通道信号,同时交替进行时间和通道维度的状态空间建模,在统一框架中捕捉长程时间动态和通道间依赖性。在SEED和PhysioNet MM/I数据集上的实验表明,TGSD在不同超分辨率因子下,在重建保真度和下游分类性能方面均持续优于代表性基线。这些结果证明了将拓扑感知空间先验与条件扩散相结合,在可穿戴和物联网场景中增强实用低密度EEG传感的有效性。官方实现代码可在https://github.com/jtggz/TGSD获取。

英文摘要

Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at https://github.com/jtggz/TGSD.

2606.03730 2026-06-05 cs.CV 版本更新

Beyond False Stability: High-Noise Drift Gating for Test-Time Adversarial Defenses in Vision-Language Models

超越虚假稳定性:面向视觉语言模型测试时对抗防御的高噪声漂移门控

Hashmat Shadab Malik, Muzammal Naseer, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI, UAE(穆罕默德·本·扎耶德人工智能大学,阿联酋) Khalifa University, UAE(卡布斯大学,阿联酋) Australian National University, Australia(澳大利亚国立大学,澳大利亚)

AI总结 针对视觉语言模型在测试时易受对抗攻击的问题,提出一种无训练、即插即用的高噪声漂移门控机制,通过检测高噪声下的特征不稳定性触发防御,改善了干净-鲁棒性权衡。

详情
AI中文摘要

视觉语言模型(如CLIP)展现出强大的零样本泛化能力,但极易受到对抗攻击。对抗训练能提升鲁棒性但计算成本高昂,因此推动了测试时防御的研究。近期方法利用CLIP视觉表示对随机扰动的响应:聚合噪声视图的预测、构建高斯噪声平均锚点并将特征向锚点插值、或应用反扰动。这些策略提升了鲁棒性,但往往降低了干净准确率,导致不利的干净-鲁棒权衡。我们重新审视随机测试时防御,并发现CLIP表示空间中一个未被充分探索的噪声区域转变。先前工作主要在弱噪声区域探索扰动,其中对抗样本可能表现出异常稳定性(虚假稳定性)。我们的分析表明,随着扰动强度增加,这种稳定性发生逆转:在弱噪声区域之外,对抗表示变得比干净表示明显更不稳定,提供了更清晰的分离信号。这种转变在均匀噪声和高斯噪声、光度变换和几何变换、不同数据集以及多种攻击下均一致。在对抗训练模型中,该转变基本消失,表明其与非鲁棒CLIP中对抗表示的脆弱局部盆地几何结构相关。我们提出一种无训练、即插即用的漂移门控机制,利用高噪声特征漂移作为轻量级门控信号,仅在检测到类似对抗的不稳定性时触发现有测试时防御。在13个数据集上,该方法一致改善了干净-鲁棒权衡。在8个细粒度数据集上,反攻击防御的平均干净+对抗准确率从65.7%提升至71.4%,噪声锚定防御从68.4%提升至73.2%;在ImageNet及其四个变体上,分别从56.1%提升至66.2%和从62.1%提升至67.6%。

英文摘要

Vision-language models (VLMs) such as CLIP show strong zero-shot generalization but remain highly vulnerable to adversarial attacks. Adversarial training improves robustness but is computationally expensive, motivating test-time defenses. Recent approaches exploit how CLIP's visual representations respond to stochastic perturbations: aggregating predictions across noisy views, constructing Gaussian noise-averaged anchors and interpolating features toward them, or applying counter-perturbations. These strategies improve robustness but often degrade clean accuracy, yielding an unfavorable clean-robust trade-off. We revisit stochastic test-time defenses and identify an underexplored noise-regime transition in CLIP's representation space. Prior work explored perturbations mainly in the weak-noise regime, where adversarial examples can appear unusually stable (false stability). Our analysis shows this reverses as perturbation strength grows: beyond the weak-noise regime, adversarial representations become markedly more unstable than clean ones, giving a clearer separation signal. The transition is consistent across uniform and Gaussian noise, photometric and geometric transforms, datasets, and diverse attacks. It largely disappears in adversarially trained models, suggesting it is tied to the fragile local-basin geometry of adversarial representations in non-robust CLIP. We propose a training-free, plug-in drift-gated mechanism that uses high-noise feature drift as a lightweight gating signal to trigger existing test-time defenses only when adversarial-like instability is detected. Across 13 datasets it consistently improves the clean-robust trade-off. On eight fine-grained datasets, mean clean+adversarial accuracy rises from 65.7% to 71.4% for counterattack defenses and 68.4% to 73.2% for noise-anchoring; on ImageNet and four shifted variants, from 56.1% to 66.2% and 62.1% to 67.6%.

2606.03100 2026-06-05 cs.CV cs.LG 版本更新

Zero-Shot 3D Question Answering via Hierarchical View-to-Token Transportation

零样本3D问答通过层级视图到令牌传输

Dongsheng Wang, Dawei Su, Hui Huang

发表机构 * Dongsheng Wang(王东生) Dawei Su(苏大卫) Hui Huang(黄慧)

AI总结 提出KeyVT方法,通过层级视图和令牌级输入上下文收集,结合像素特征与相机参数评估视图重要性,并利用最优传输识别代表性令牌,实现零样本3D问答性能提升。

Comments Accepted at ICML 2026. 19 pages, 6 figures

详情
AI中文摘要

最近,通过2D视觉-语言模型(VLM)进行零样本3D场景理解因其有前景的空间推理能力而受到越来越多的研究关注。通常,从3D点云中采样多个2D视图,并输入预训练的VLM以回答给定问题。这种范式凸显了输入上下文质量的关键作用,并提出了在有限输入预算下尽可能保留与任务相关的3D细节的挑战。我们提出了 exttt{KeyVT},一种在视图和令牌级别进行输入上下文收集的层级方法。具体来说,我们将像素特征与相机参数结合,并基于语义内容和几何位置评估视图重要性,从而得到空间一致且与任务相关的视图。此外,我们通过最优传输(OT)框架识别代表性令牌来解决选定视图中补丁之间的冗余问题,其中视图令牌和关键令牌被公式化为嵌入空间中的两个离散分布。这些关键令牌通过最小化OT距离期望覆盖所有视图特征。我们在三个广泛使用的基准上评估了我们的框架,结果表明与现有的无调优方法相比有显著改进,并且性能与基于训练的方法相当。

英文摘要

Recently, zero-shot 3D scene understanding via 2D Vision-Language Models (VLMs) has gained increasing research interest due to their promising spatial reasoning capabilities. Typically, multiple 2D views are sampled from a 3D point cloud and fed into pre-trained VLMs to answer a given question. This paradigm highlights the critical role of input context quality and raises the challenge of retaining as many task-relevant 3D details as possible under a limited input budget. We propose \texttt{KeyVT}, a hierarchical approach for input context collection at both the view and token levels. Specifically, we combine pixel features with camera parameters and assess view importance based on both semantic content and geometric position, resulting in spatially consistent and task-relevant views. Furthermore, we address redundancy among patches across selected views by identifying representative tokens under the optimal transport (OT) framework, where view tokens and key tokens are formulated as two discrete distributions in the embedding space. These key tokens are expected to cover all view features by minimizing the OT distance. We evaluate our framework on three widely used benchmarks, demonstrating significant improvements over existing tuning-free methods and performance comparable to training-based approaches.

2606.02031 2026-06-05 cs.LG cs.AI cs.CL cs.CV 版本更新

OpenWebRL: Demystifying Online Multi-turn Reinforcement Learning for Visual Web Agents

OpenWebRL: 揭秘视觉网络代理的在线多轮强化学习

Rui Yang, Qianhui Wu, Yuxi Chen, Hao Bai, Wenlin Yao, Hao Cheng, Baolin Peng, Huan Zhang, Tong Zhang, Jianfeng Gao

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软)

AI总结 提出OpenWebRL框架,通过在线多轮强化学习在真实网站上训练视觉网络代理,以4B参数模型在基准测试中达到开源最优,并与闭源系统竞争。

Comments 36 pages, 11 figures

详情
AI中文摘要

构建强大的视觉网络代理需要长程推理、精确定位以及与动态真实网站的稳健交互。尽管进展迅速,最强的系统仍然大多是专有的,而开放代理仍然严重依赖于对大量策划的网络轨迹进行监督式后训练。这种依赖造成了主要的可扩展性瓶颈:高质量演示的收集成本高昂,而静态数据集对多样且不断变化的开放网络的覆盖有限。尽管在线强化学习在基于文本的代理中显示出前景,但其直接用于在实时网站上训练视觉网络代理的潜力仍未得到充分探索。在本文中,我们介绍了OpenWebRL,一个用于在真实网站上通过在线多轮强化学习训练视觉网络代理的开放框架。OpenWebRL涵盖了完整的训练流程,包括可扩展的实时浏览器基础设施、监督初始化、多模态上下文管理、轨迹级成功判断以及高效的多轮策略优化。使用该框架,我们训练了OpenWebRL-4B,在具有挑战性的实时网络基准测试中建立了新的开源最优水平。仅使用0.4K初始化轨迹和2.2K开放式强化学习训练任务,OpenWebRL-4B在Online-Mind2Web上达到67.0%的成功率,在DeepShop上达到64.0%,优于之前类似或更大规模的开放代理,并与包括OpenAI CUA和Gemini CUA在内的专有系统保持竞争力。除了强大的基准性能外,我们还系统研究了使在线强化学习对视觉网络代理有效的关键设计选择,并分析了强化学习如何改进代理推理。总体而言,我们的工作为构建更强大、可重复且成本效益更高的开放网络代理提供了一条实用路径。我们将发布我们的训练数据、模型和代码以支持未来的研究。

英文摘要

Building capable visual web agents requires long-horizon reasoning, precise grounding, and robust interaction with dynamic real-world websites. Despite rapid progress, the strongest systems remain largely proprietary, while open agents still depend heavily on supervised post-training over large collections of curated web trajectories. This dependence creates a major scalability bottleneck: high-quality demonstrations are expensive to collect, and static datasets offer limited coverage of the diverse, ever-changing open web. Although online RL has shown promise for text-based agents, its potential for training visual web agents directly on live websites remains largely underexplored. In this paper, we introduce OpenWebRL, an open framework for training visual web agents with online multi-turn RL on real websites. OpenWebRL covers the full training pipeline, including scalable live-browser infrastructure, supervised initialization, multimodal context management, trajectory-level success judging, and efficient multi-turn policy optimization. Using this framework, we train OpenWebRL-4B, which establishes a new open-source state of the art on challenging live-web benchmarks. With only 0.4K initialization trajectories and 2.2K open-ended RL training tasks, OpenWebRL-4B achieves 67.0% success on Online-Mind2Web and 64.0% on DeepShop, outperforming prior open agents of similar or larger scale and remaining competitive with proprietary systems including OpenAI CUA and Gemini CUA. Beyond strong benchmark performance, we systematically study the key design choices that make online RL effective for visual web agents, and analyze how RL improves agentic reasoning. Overall, our work offers a practical path toward building more capable, reproducible, and cost-efficient open web agents. We will release our training data, models, and code to support future research.

2606.01935 2026-06-05 cs.CV 版本更新

Unified Driving Tokens: Representation- and Geometry-Guided Discrete Tokenizer for Driving World Models and Planning

统一驾驶令牌:面向驾驶世界模型和规划的表示与几何引导的离散分词器

Ziyang Yao, Zeyu Zhu, YunCheng Jiang, Zibin Guo, Huijing Zhao

发表机构 * Peking University(北京大学) Xiaomi EV(小米电动车)

AI总结 提出一种表示引导与几何增强的离散分词器,通过联合监督学习紧凑令牌,同时优化重建保真度、表示一致性和规划性能。

详情
AI中文摘要

离散视觉令牌应为基于令牌的世界建模和自动驾驶规划提供紧凑表示。然而,大多数分词器继承自图像生成,主要针对像素重建进行优化,这可能导致易于生成的内容与对驾驶决策有用的解码内容之间存在差距。我们提出了一种表示引导和几何增强的分词器,在联合监督下学习离散令牌。该分词器通过特征解码将其离散瓶颈与冻结的DINO特征空间对齐,同时通过感知损失和对抗损失的RGB重建保留外观。为了注入几何状态相关线索,我们在训练期间添加了相邻帧深度和相对姿态监督,并通过多码本量化稳定联合目标。我们使用轻量级规划读出和GPT风格的下一个令牌世界模型评估相同的学习令牌。在NAVSIM上的实验表明,在固定解码器下,重建保真度和表示一致性得到改善,规划性能具有竞争力,并且在匹配设置下生成质量更好。

英文摘要

Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.

2606.01822 2026-06-05 cs.CV 版本更新

Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios

用于复杂驾驶场景中鲁棒交通标志识别的分层解耦混合专家模型

Mingxiao Wang, Xiaozhen Qu, Bolin Gao, Tong Wang, Lei He

发表机构 * School of Automotive and Traffic Engineering, Liaoning University of Technology(辽宁科技学院汽车与交通工程学院) State Key Laboratory of Intelligent Green Vehicles and Mobility, School of Vehicle and Mobility, Tsinghua University(智能绿色车辆与移动State Key Laboratory,清华大学车辆与移动学院)

AI总结 提出分层解耦异构混合专家框架CBDES MoE TSR,通过图像级动态路由机制选择最优专家模型,在复合交通标志数据集上mAP50-95达76.8%,比基线提升2.3%且计算开销降低39.4%。

Comments 9 figures, 3 tables

详情
AI中文摘要

交通标志检测是自动驾驶和智能交通系统中环境感知的基本组成部分。然而,现有大多数检测器依赖具有全局共享参数的静态推理,限制了其适应多样化和非结构化交通场景的能力。因此,单个静态模型通常难以同时处理清晰的近距样本和诸如远距离小目标或恶劣天气环境等挑战性条件。为解决这一局限,我们提出了CBDES MoE TSR,一种用于交通标志识别的分层解耦异构混合专家(MoE)框架。该框架通过引入异构YOLO专家池和轻量级门控网络,摆脱了传统的全局共享参数范式,实现了图像级动态路由机制。基于输入图像的语义特征,门控模块从专家池中选择性激活最合适的专家模型,实现从固定参数拟合到按需动态表示的转变。这种设计增强了特定场景下的特征提取能力,同时保持了可控的推理开销。实验结果表明,所提方法在复合交通标志数据集上实现了检测精度与效率的显著平衡。具体而言,我们的方法达到了76.8%的mAP50-95,相比基线方法(74.5%)提升了2.3%,同时计算开销降低了约39.4%。这些结果有力地验证了所提方法的有效性。

英文摘要

Traffic sign detection is a fundamental component of environmental perception in autonomous driving and intelligent transportation systems. However, most existing detectors rely on static inference with globally shared parameters, limiting their ability to adapt to diverse and unstructured traffic scenarios. As a result, a single static model often struggles to simultaneously handle both clear near-range samples and challenging conditions such as distant small targets or adverse weather environments. To address this limitation, we propose CBDES MoE TSR, a hierarchically decoupled heterogeneous mixture-of-experts(MoE) framework for traffic sign recognition. The proposed framework departs from the conventional globally shared parameter paradigm by introducing a heterogeneous You Only Look Once (YOLO) expert pool together with a lightweight gating network, enabling an image-level dynamic routing mechanism. Based on the semantic characteristics of the input image, the gating module selectively activates the most suitable expert model from the expert pool, enabling a shift from fixed parameter fitting to on-demand dynamic representation. This design enhances feature extraction capability for specific scenarios while maintaining controlled inference overhead. Experimental results demonstrate that the proposed method achieves a remarkable balance between detection accuracy and efficiency on the composite traffic sign dataset. Specifically, our method attains an mAP50-95 of 76.8%, yielding a 2.3% improvement over the baseline method (74.5%) while simultaneously reducing computational overhead by approximately 39.4%. These findings robustly validate the effectiveness of the proposed approach.

2606.01113 2026-06-05 cs.CV 版本更新

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

R^3: 基于推理引导的召回与重排序的组合视频检索

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出R^3零样本组合视频检索流程,通过生成推理轨迹增强查询表示,并融合重排序验证候选视频,有效解决源视频与编辑指令组合检索的挑战。

详情
AI中文摘要

CoVR-R挑战评估组合视频检索,系统需根据参考视频和文本编辑指令从大型图库中检索目标视频。该设置不是标准的视频-文本检索问题:查询由源视频中的视觉证据和编辑隐含的变换共同定义。强嵌入模型可提供可扩展的候选召回,但可能无法充分表达目标侧后果,如状态变化、动作替换、对象保留或时间一致性。成对多模态重排序器可直接验证此类细节,但全面重排序整个图库在计算上不可行。我们提出R^3,一个基于推理引导的召回与重排序的零样本组合视频检索流程。核心思想是将源-编辑查询转化为推理基础的检索程序,而非将编辑文本视为短标题。首先,模型生成推理轨迹,描述应用编辑后预期的目标视频。然后,将轨迹与源视频一起编码为推理增强查询,并通过一致性门控残差规则与基础组合查询的检索分数融合。最后,重排序器通过直接源-候选比较验证召回候选。实验证明了我们方法在应对该挑战中的有效性。代码可在https://github.com/Lee-zixu/R-3获取。

英文摘要

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

2606.00616 2026-06-05 cs.CV cs.AI 版本更新

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考:面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 提出 pause-and-think-T 数据集和 pause-and-think-B 基准,通过推理监督训练紧凑模型,在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情
AI中文摘要

最近的视觉语言模型(VLM)在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T,一个以推理为中心的训练数据集,鼓励模型暂停、基于视觉证据进行推理,并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理,引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型,并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B(58.9%)少 59 倍的情况下达到了 58.0% 的准确率,在场景理解上与 GPT-5.2 匹配,并超越了 GPT-4o。除了我们的基准之外,该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能,在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升,且无需特定基准训练。我们的结果表明,有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导,同时泛化到训练数据之外,而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

2606.00522 2026-06-05 cs.CV 版本更新

A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST

CVPR 2026 第八届 UG2+ 挑战赛赛道三:湍流中动态目标分割的有效解决方案

Hongzhen Li, Miao Yu, Leilei Cao, Youwei Pan, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings(TEX AI,Transsion控股)

AI总结 基于 SegAnyMo 框架,通过数据域自适应和时空后处理模块,提升严重大气畸变下的动态目标分割性能,在挑战赛中获第二名。

详情
AI中文摘要

在这项工作中,我们提出了针对第八届 UG2+ 挑战赛(CVPR 2026)赛道三:湍流中动态目标分割(DOST)的解决方案。我们的方法建立在强大的基线框架 Segment Any Motion (SegAnyMo) 之上,该框架提供了强大的掩码生成和运动跟踪能力。为了进一步提升在严重大气畸变下的分割性能,我们提出了两个关键改进。首先,我们采用以数据为中心的域自适应策略。通过从 DAVIS 数据集和 DOST 数据集的子集中选取序列,并结合模拟大气波动退化,显著扩展了训练数据,增强了模型对复杂几何畸变的鲁棒性。其次,我们引入了时空后处理模块。该细化步骤有效去除了持续存在的边界连接假前景和短时碎片噪声,同时严格保留了真实小目标并保持帧间的原始个体标签。通过上述组合策略,我们的方法在挑战赛中获得了第二名。

英文摘要

In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data-centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model's robustness against complex geometric distortions. Second, we introduce a spatio-temporal post-processing module. This refinement step effectively removes persistent boundary-connected false foregrounds and short-lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.

2605.30819 2026-06-05 cs.CV cs.GR 版本更新

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Function2Scene: 基于功能规范的3D室内场景布局

Ruiqi Wang, Qimin Chen, Daniel Ritchie, Angel X. Chang, Manolis Savva, Kai Wang, Hao Zhang

发表机构 * Simon Fraser University(西蒙弗雷泽大学) Brown University(布朗大学)

AI总结 提出Function2Scene框架,通过解析自然语言设计简报中的用户角色和活动,从17个功能约束准则生成布局,并利用LLM和VLM的迭代检查-修复循环优化,在30个专业案例中94.3%的成对比较优于基线方法。

Comments project page: https://function2scene.github.io/

详情
AI中文摘要

大多数文本驱动的3D室内场景合成方法从以物体为中心的提示生成房间,询问应放置什么家具而不是如何使用空间。然而,在实际室内设计中,布局的好坏取决于其对居住者的支持程度,例如他们的活动和身体需求。我们引入了Function2Scene,一个从功能规范(即描述谁将使用房间以及他们需要在那里做什么的自然语言设计简报)生成3D室内布局的框架。给定这样的规范,我们的系统解析居住者角色和活动,从涵盖空间、人体工程学、活动和环境考虑的17个标准分类中导出一组定制的功能设计约束,并使用这些约束来指导布局生成。Function2Scene不依赖LLM直接生成最终场景,而是通过工具增强的检查-修复循环进行迭代评估和细化,结合几何测量、基于LLM的上下文推理和基于VLM的视觉评估。在30个专业编写的室内设计案例上的实验表明,Function2Scene生成的布局比最近的基于LLM的场景合成基线更好地满足功能需求,我们的结果在94.3%的成对比较中被偏好。我们的工作将文本驱动的室内场景合成从放置合理的物体重新定义为设计支持人类使用的空间。

英文摘要

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

2605.30467 2026-06-05 cs.CV 版本更新

Clustering Guided Domain-Specific Pretrained Foundation Model for Very High-Resolution Arctic Remote Sensing

聚类引导的领域特定预训练基础模型用于极高分辨率北极遥感

Amal S. Perera, Chandi Witharana, Elias Manos, Michael Pimenta, Anna K. Liljedahl

发表机构 * Woodwell Climate Research Center(伍德沃德气候研究中心)

AI总结 提出结合多样性感知区域图像筛选与掩码自编码器自监督预训练的北极遥感基础模型,在四个标注数据集上显著提升前景F1分数。

详情
AI中文摘要

本研究引入了一种新颖的北极聚焦遥感基础模型(RSFM),通过将多样性感知的区域尺度图像筛选与Vision Transformer(ViT)编码器的掩码自编码器(MAE)自监督预训练相结合,用于极高空间分辨率(VHSR)卫星图像分析。利用光谱和采集元数据描述符,在可扩展的亲和传播聚类工作流中,从267 TB的Vantor VHSR图像中选取约300万张图块。这种筛选策略旨在减少视觉重复或低信息区域的过采样,同时保留研究区域内广泛的场景多样性。我们在筛选后的语料库上使用领域适应的MAE重建目标预训练了ViT-Large编码器,生成了用于下游特征映射的北极特定Transformer权重。预训练编码器被集成到一个现有的位置感知检测与分割框架中,并在四个手工标注的北极数据集上进行了评估。与ImageNet初始化的ViT-Large基线相比,北极MAE预训练在基础设施、IWP、RTS和TCNs上分别产生了0.87、0.72、0.93和0.87的前景平均F1分数一致提升,提高了约5-8个百分点。所提出的模型在所有下游比较中也优于Prithvi-EO-2.0,最小的增益对应至少15个百分点的平均F1提升,这表明在筛选的北极VHSR图像上进行领域特定的自监督预训练,为精细尺度的北极制图提供了比通用地球观测基础模型更具可迁移性的表示。这些结果证明,在保持架构和MAE目标不变的情况下,优化区域尺度的预训练数据分布可以产生一个可重用的北极领域编码器,用于多种VHSR遥感应用。

英文摘要

This study introduces a novel Arctic-focused remote sensing foundation model (RSFM) by combining diversity-aware regional-scale image curation with masked autoencoder (MAE) self-supervised pretraining of a Vision Transformer (ViT) encoder for very-high-spatial-resolution (VHSR) satellite image analysis. Spectral and acquisition-metadata descriptors were used in a scalable affinity-propagation clustering workflow to select approximately 3 million chips from 267 TB of Vantor VHSR imagery This curation strategy was designed to reduce oversampling of visually repetitive or low-information areas while preserving broad scene diversity across the study domain. We pretrained a ViT-Large encoder on the curated corpus using a domain-adapted MAE reconstruction objective, producing Arctic-specific transformer weights for downstream feature mapping. The pretrained encoder was integrated into an existing location-aware detection and segmentation framework and evaluated across four hand-labeled Arctic datasets. Compared to ImageNet-initialized ViT-Large baseline, Arctic MAE pretraining produced consistent improvements in foreground mean F1 scores of 0.87, 0.72, 0.93, and 0.87, for infrastructure, IWP, RTS, and TCNs, with approximately 5-8 percentage increase. The proposed model also outperformed Prithvi-EO-2.0 in all downstream comparisons, with the smallest gain corresponding to at least a 15 percentage improvement mean F1, suggesting that domain-specific self-supervised pretraining on curated Arctic VHSR imagery provides more transferable representations for fine-scale Arctic mapping than a general-purpose Earth observation foundation model. These results demonstrate that optimizing the pretraining data distribution at regional scale, while keeping the architecture and MAE objective fixed, can produce a reusable Arctic-domain encoder for multiple VHSR remote sensing applications.

2605.24481 2026-06-05 cs.CV 版本更新

OmniEgo-R$^2$: A Routed Reasoning Framework for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

OmniEgo-R$^2$:面向CVPR 2026首届跨领域EgoCross挑战赛的路由推理框架

Zixu Li, Zhiwei Chen, Zhiheng Fu, Wenbo Wang, Yupeng Hu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对跨领域自我中心视频推理中的时间边界模糊、语义粒度不匹配和决策不稳定问题,提出OmniEgo-R$^2$路由推理框架,在Source-Limited和Open-Source赛道均获第二名。

Comments Technical Report for the 1st Cross-Domain EgoCross Challenge at CVPR 2026

详情
AI中文摘要

CVPR 2026 EgoVis首届跨领域EgoCross挑战赛评估多模态大语言模型在手术、工业、极限运动和动物视角等自我中心视频上的推理能力。我们在Source-Limited和Open-Source赛道均获得第二名。在本报告中,我们将EgoCross定义为一个鲁棒的跨领域具身视频推理问题,而非简单的多项选择视觉问答任务。我们识别出三个关键挑战:(C1)时间边界模糊,关键状态转换稀疏采样且常发生在帧间;(C2)跨领域语义粒度不匹配,相同能力需要不同的领域特定视觉语法;(C3)接近选项下的决策不稳定,长多模态推理可能选择无支撑的干扰项或产生畸形输出。为解决这些问题,我们提出OmniEgo-R$^2$(全领域自我中心路由推理),一个统一的路由推理流水线,包括时间证据归一化、领域无关能力路由、结构化感知-动态-决策推理、边界感知选项验证和防御性答案校准。OmniEgo-R$^2$使用每个EgoCross领域上的Qwen3-VL-4B-SFT检查点作为视觉语言骨干,并用轻量级测试时推理和解析程序包装。最终提交在Source-Limited赛道获得66.35%总体准确率,在Open-Source赛道获得66.77%,均位列第二。代码见https://github.com/Lee-zixu/OmniEgo-R2。

英文摘要

The 1st Cross-Domain EgoCross Challenge at EgoVis, CVPR 2026 evaluates whether multimodal large language models can reason over egocentric videos across surgery, industry, extreme sports, and animal perspective. We achieved second place in both the Source-Limited and Open-Source tracks. In this report, we formulate EgoCross as a robust cross-domain embodied video reasoning problem rather than a simple multiple-choice visual question answering task. We identify three key challenges: (C1) temporal boundary ambiguity, where critical state transitions are sparsely sampled and often occur between frames; (C2) cross-domain semantic granularity mismatch, where the same capability requires different domain-specific visual grammar; and (C3) decision instability under close options, where long multimodal reasoning can select unsupported distractors or produce malformed outputs. To address them, we propose OmniEgo-R$^2$ (Omnidomain Egocentric Routed Reasoning), a unified routed reasoning pipeline consisting of temporal-evidence normalization, domain-agnostic capability routing, structured perception--dynamics--decision reasoning, boundary-aware option verification, and defensive answer calibration. OmniEgo-R$^2$ uses the Qwen3-VL-4B-SFT checkpoints on each EgoCross domain as the visual-language backbone, and wraps them with lightweight test-time reasoning and parsing programs. Our final submissions obtain 66.35% overall accuracy in the Source-Limited track and 66.77% in the Open-Source track, ranking second in both leaderboards. The codes are available on https://github.com/Lee-zixu/OmniEgo-R2

2605.26761 2026-06-05 cs.CV 版本更新

Once-For-All: A Train-Once and Select-Anytime Framework for Multimodal Instruction Tuning

Once-For-All: 一种用于多模态指令微调的“一次训练,随时选择”框架

Mingkang Dong, Hongyi Cai, Xiwen Lei, Jie Li, Tao Zhang, Muxin Pu

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 提出OFA框架,通过一次训练可迁移的选择器,无需重新计算即可从任意数据集或模型中筛选出最具信息量的多模态指令数据,实现高效微调。

Comments 15 pages, 6 figures. Mingkang Dong and Hongyi Cai contributed equally to this work. Muxin Pu is the corresponding author

详情
AI中文摘要

多模态指令微调是适应视觉语言模型(VLM)的事实标准方法,然而指令数据高度冗余,使得数据选择对训练效率至关重要。现有方法从特定模型或数据集中导出选择信号,因此每当目标模型或候选池发生变化时,必须从头重新计算标准,代价高昂。为了解决这一问题,我们提出了OFA,一个数据选择框架,该框架训练一次可重用的选择器,并将其应用于任何数据集或模型而无需重新计算。OFA在冻结的CLIP空间中对多模态指令进行聚类,从聚类结构中导出伪标签,并仅训练几个epoch的轻量级选择器;该选择器最不确信的样本被选为最具信息量的样本。一旦训练完成,冻结的选择器可直接跨数据集和模型规模迁移。选择器在LLaVA-665K上训练一次,然后应用于LLaVA-665K本身,以及无需任何重新训练的未见过的Vision-Flan-186K。仅选择15%的数据,OFA在10个下游基准测试中达到了全数据性能的98.3%;在较小的Vision-Flan-186K上,迁移的选择器比全数据训练高出10.6%,证实了学习到的信号泛化到了选择器训练期间从未见过的数据集。相同的选定子集在Qwen2.5-VL-3B和LLaVA-v1.5-7B上均有益于VLM,无需针对每个模型重新计算,从而将选择与目标模型解耦。这些结果表明,单个可迁移的选择器为高效的多模态指令微调提供了一种有效且可重用的解决方案。

英文摘要

Multimodal instruction tuning is the de facto recipe for adapting vision language models (VLMs), yet instruction data are highly redundant, making data selection critical for training efficiency. Existing methods derive selection signals from a specific model or dataset, so whenever the target model or candidate pool changes, the criteria must be recomputed from scratch at substantial cost. To address this, we propose OFA, a data selection framework that trains a reusable selector once and applies it to any dataset or model without recomputation. OFA clusters multimodal instructions in a frozen CLIP space, derives pseudo labels from the cluster structure, and trains a lightweight selector for only a few epochs; samples on which this selector is least confident are selected as the most informative. Once trained, the frozen selector transfers directly across datasets and model scales. The selector is trained once on LLaVA-665K and applied both to LLaVA-665K itself and, without any retraining, to the unseen Vision-Flan-186K. Selecting only 15% of the data, OFA achieves 98.3% of full data performance across 10 downstream benchmarks; on the smaller Vision-Flan-186K, the transferred selector surpasses full data training by 10.6%, confirming that the learned signal generalizes to datasets never seen during selector training. The same selected subsets benefit VLMs at both Qwen2.5-VL-3B and LLaVA-v1.5-7B without per model recomputation, decoupling selection from the target model. These results demonstrate that a single, transferable selector provides an effective and reusable solution for efficient multimodal instruction tuning.

2605.26236 2026-06-05 cs.CV cs.SD 版本更新

DuoGesture: Neuro-Inspired and Biomechanically Informed Dual-Stream Co-Speech Gesture Generation

DuoGesture: 神经启发与生物力学约束的双流共语手势生成

Ferdinand Paar, Lanmiao Liu, Aslı Özyürek, Serge Thill, Esam Ghaleb

发表机构 * Max Planck Institute for Psycholinguistics(马克斯·普朗克心理语言学研究所) Radboud University(拉德堡德大学) Utrecht University(乌得勒支大学)

AI总结 提出DuoGesture,一种神经启发和生物力学约束的双流方法,通过语义变分信息瓶颈协调语义流和节拍流,实现语义表达与生物力学合理的节律运动。

详情
AI中文摘要

共语手势生成需要语义表达性和生物力学合理的节律运动。现有的整体手势模型混合了基于词汇的语义手势和频繁的韵律对齐节拍手势,这限制了语义基础、语音-运动对齐和运动平滑性。我们提出DuoGesture,一种神经启发和生物力学约束的双流方法,将共语手势合成分解为耦合的语义流和节拍流。两个流通过语义变分信息瓶颈协调,这是一个随机帧级门控,学习何时语义手势应覆盖节律节拍运动。语义流由运动基础语义条件控制,该条件用运动-语言表示替代纯语言词嵌入,为手势的长尾词汇触发提供运动对齐的语义先验。节拍流进一步由惯性节拍先验正则化,这是一个基于人体测量学的臂链模块,减少抖动并提高节律一致性而不约束语义帧。客观评估和主观实验表明,DuoGesture优于强整体基线,而组件消融证实了语义基础、随机流选择和生物力学正则化的互补作用。

英文摘要

Co-speech gesture generation requires both semantic expressivity and biomechanically plausible rhythmic motion. Existing holistic gesture models mix lexically grounded semantic gestures with frequent prosody-aligned beat gestures. This limits semantic grounding, speech-motion alignment, and kinematic smoothness. We propose \emph{DuoGesture}, a neuro-inspired and biomechanically informed dual-stream approach that decomposes co-speech gesture synthesis into coupled semantic and beat streams. The two streams are coordinated by a \emph{Semantic Variational Information Bottleneck}, a stochastic frame-level gate that learns when semantic gestures should override rhythmic beat motion. The semantic stream is controlled by \emph{Motion-Grounded Semantic Conditioning}, which replaces purely linguistic word embeddings with motion-language representations to provide motion-aligned semantic priors for long-tailed lexical triggers of gestures. The beat stream is further regularised by an \emph{Inertial Beat Prior}, an anthropometry-weighted arm-chain module that reduces jitter and improves rhythmic consistency without constraining semantic frames. Objective evaluations and subjective experiments show that DuoGesture outperforms strong holistic baselines, while component ablations confirm the complementary roles of semantic grounding, stochastic stream selection, and biomechanical regularisation.

2602.03890 2026-06-05 cs.CV 版本更新

4DPC$^2$hat: Towards Dynamic Point Cloud Understanding with Failure-Aware Bootstrapping

4DPC$^2$hat: 面向动态点云理解的失败感知自举学习

Xindan Zhang, Weilong Yan, Yufei Shi, Xuerui Qiu, Tao He, Ying Li, Ming Li, Hehe Fan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出首个针对动态点云理解的多模态大语言模型4DPC$^2$hat,通过构建大规模跨模态数据集4DPC$^2$hat-200K和引入Mamba增强的时间推理模块及失败感知自举学习策略,显著提升了动作理解与时间推理能力。

Comments Accept by ICML 2026

详情
AI中文摘要

点云提供了3D对象的紧凑且富有表现力的表示,最近已被集成到多模态大语言模型(MLLMs)中。然而,现有方法主要关注静态对象,而理解动态点云序列仍基本未被探索。这一限制主要是由于缺乏大规模跨模态数据集以及在时空上下文中建模运动的难度。为弥补这一差距,我们提出了4DPC$^2$hat,这是首个专为动态点云理解设计的MLLM。为此,我们通过一个精心设计的两阶段流程构建了大规模跨模态数据集4DPC$^2$hat-200K,该流程包括拓扑一致的4D点构建和两级标注。该数据集包含超过44K个动态对象序列、700K个点云帧和200K个精心策划的问答对,支持关于计数、时间关系、动作、空间关系和外观的查询。在该框架的核心,我们引入了一个Mamba增强的时间推理MLLM,以捕捉点云序列中的长程依赖和动态模式。此外,我们提出了一种失败感知的自举学习策略,该策略迭代地识别模型缺陷并生成有针对性的问答监督,以持续增强相应的推理能力。大量实验表明,与现有模型相比,我们的4DPC$^2$hat显著提高了动作理解和时间推理能力,为4D动态点云理解奠定了坚实基础。

英文摘要

Point clouds provide a compact and expressive representation of 3D objects, and have recently been integrated into multimodal large language models (MLLMs). However, existing methods primarily focus on static objects, while understanding dynamic point cloud sequences remains largely unexplored. This limitation is mainly caused by the lack of large-scale cross-modal datasets and the difficulty of modeling motions in spatio-temporal contexts. To bridge this gap, we present 4DPC$^2$hat, the first MLLM tailored for dynamic point cloud understanding. To this end, we construct a large-scale cross-modal dataset 4DPC$^2$hat-200K via a meticulous two-stage pipeline consisting of topology-consistent 4D point construction and two-level captioning. The dataset contains over 44K dynamic object sequences, 700K point cloud frames, and 200K curated question-answer (QA) pairs, supporting inquiries about counting, temporal relationship, action, spatial relationship, and appearance. At the core of the framework, we introduce a Mamba-enhanced temporal reasoning MLLM to capture long-range dependencies and dynamic patterns among a point cloud sequence. Furthermore, we propose a failure-aware bootstrapping learning strategy that iteratively identifies model deficiencies and generates targeted QA supervision to continuously strengthen corresponding reasoning capabilities. Extensive experiments demonstrate that our 4DPC$^2$hat significantly improves action understanding and temporal reasoning compared with existing models, establishing a strong foundation for 4D dynamic point cloud understanding.

2605.29219 2026-06-05 cs.CV 版本更新

SalsaAgent: A multimodal embodied language model for interactive dance generation

SalsaAgent: 一种用于交互式舞蹈生成的多模态具身语言模型

Payam Jome Yazdian, Zoe Stanley, Angelica Lim

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 提出SalsaAgent语言模型,通过非语言运动令牌传递和两阶段令牌到扩散管道,生成与人类领舞者及音乐背景交互的全身萨尔萨舞蹈动作。

Comments Project page: https://pjyazdian.github.io/Salsa-Agent

详情
AI中文摘要

人形机器人之间的交互涉及双向和非语言反应性、协调与同步。为了构建具有社会意识的机器人和交互式虚拟代理,我们提出了SalsaAgent,一种语言模型,能够生成表达性的全身萨尔萨舞蹈动作,以响应人类领舞者并配合背景音乐。我们将交互形式化为非语言运动令牌传递,扩展了大语言模型(LLM)的词汇表,以处理离散运动令牌、成对关系令牌和音频。我们的贡献包括:用于全身和运动关系的新令牌、使用自动推导的骨架动力学文本描述进行令牌对齐的LLM微调,以及两阶段令牌到扩散管道。主观和客观评估表明,我们的方法在运动质量、音乐与伙伴协调以及一致的双人空间行为方面具有有效性,显著优于基线方法。

英文摘要

Interaction between humanoids involves bidirectional and nonverbal reactivity, coordination and synchrony. Toward socially aware robots and interactive virtual agents, we present SalsaAgent, a language model that generates expressive, full-body salsa dance motions in reaction to a human leader and against a contextual music backdrop. We formulate interaction as nonverbal motion token passing, extending the vocabulary of a large language model (LLM) to process discrete motion tokens, pairwise relation tokens, and audio. Our contributions include new tokens for full-body and motion relations, LLM fine-tuning using automatically derived text descriptions of skeleton dynamics for token grounding, and a two-stage token-to-diffusion pipeline. Subjective and objective evaluations demonstrate the effectiveness of our approach in terms of motion quality, music and partner coordination, and consistent two-person spatial behavior, with significant improvements over baselines.

2605.25970 2026-06-05 cs.CV 版本更新

PathWISE: Multi-Agent Cancer Pathway Triaging Ontology Learning from Clinical Flowcharts

PathWISE: 基于临床流程图的多智能体癌症路径分诊本体学习

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Mohammed Adil Butt, Andrew D. Beggs, Adam Byfield, Anusha Jose, Junaid Qadir, Muhammad Bilal

发表机构 * Birmingham City University(伯明翰城市大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) University Hospitals Birmingham NHS Foundation Trust(伯明翰大学医院国家健康服务信托基金) NHS England (National Health Service)(英格兰国家健康服务局) Qatar University(卡塔尔大学)

AI总结 提出PathWISE五阶段流水线,结合四个基于LLM的智能体、确定性深度优先搜索审计器和Java编译器批评者,将临床流程图转化为可执行的HL7 CQL库,覆盖100%患者路径,并在五种NHS癌症路径上验证。

Comments 13 pages, 4 figures

详情
AI中文摘要

临床路径以视觉流程图形式传播,其中空间拓扑、箭头方向、颜色编码和字体粗细编码了关键的转诊逻辑,但这些逻辑对计算系统仍然不可访问。我们提出PathWISE,一个五阶段流水线,结合四个基于LLM的智能体与确定性深度优先搜索审计器和Java编译器批评者,将这些不可计算的人工制品转化为经过验证、可执行的HL7临床质量语言(CQL)库,可部署为FHIR CDS Hooks服务。专门构建的智能体将流程图结构提取为类型化有向图,执行确定性路径枚举,对每个节点的可计算性进行结构化语义审计,生成经官方Java CQL-to-ELM编译器验证的术语约束CQL定义,并产生覆盖100%枚举患者路径的路由逻辑。在五种英国NHS癌症路径(结直肠、肺、皮肤、上消化道和乳腺)上展示,PathWISE审计多达183个节点(混合配置下182个),识别四个问题类别中的544个结构化治理发现,实现100%语法编译成功,其中UNCOMPUTABLE节点接收虚假占位符以保持可编译性,同时暴露治理差距供临床审查,并为字典覆盖的概念产生零幻觉术语代码。关键的是,PathWISE将非确定性LLM推理限制在知识提取上,而确定性图数学和标准编译器支撑每个验证步骤。

英文摘要

Clinical pathways are disseminated as visual flowcharts where spatial topology, arrow direction, colour coding, and font weight encode critical triage logic that remains inaccessible to computational systems. We present PathWISE, a five-phase pipeline combining four LLM-based agents with a deterministic depth-first search auditor and a Java compiler critic, transforming these non-computable artefacts into validated, executable HL7 Clinical Quality Language (CQL) libraries deployable as FHIR CDS Hooks services. Purpose-built agents extract flowchart structure into a typed directed graph, perform deterministic path enumeration, conduct a structured semantic audit of every node's computability, generate terminology-constrained CQL definitions verified by the official Java CQL-to-ELM compiler, and produce routing logic covering 100% of enumerated patient journeys. Demonstrated across five UK NHS cancer pathways (colorectal, lung, skin, upper GI, and breast), PathWISE audits up to 183 nodes (182 under the Hybrid configuration), identifies 544 structured governance findings across four issue categories, achieves 100% syntactic compilation success, with UNCOMPUTABLE nodes receiving false placeholders that preserve compilability while surfacing governance gaps for clinical review, and produces zero hallucinated terminology codes for dictionary-covered concepts. Critically, PathWISE confines non-deterministic LLM inference to knowledge extraction while deterministic graph mathematics and a standard compiler underpin every verification step.

2605.25956 2026-06-05 cs.CV 版本更新

RAPTOR+: A Visually Grounded Vision-Language Framework to Improve Clinical Trust and Auditability in Automated Cancer Referral Processing

RAPTOR+: 一种基于视觉的视觉-语言框架,用于提高自动化癌症转诊处理中的临床信任度和可审计性

Sofiat Abioye, Ufaq Khan, Shazad Ashraf, Anusha Jose, Adam Byfield, Lukman Akanbi, Muhammad Bilal

发表机构 * Birmingham City University(伯明翰城市大学) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学) University Hospitals Birmingham NHS Foundation Trust(伯明翰大学医院 NHS 基础信托) NHS England(英格兰国家卫生服务体系)

AI总结 提出RAPTOR+多模态框架,通过微调视觉-语言模型实现端到端转诊理解,在结直肠癌转诊表单上显著提升提取准确性和证据定位能力。

Comments 12 pages 4 figures

详情
AI中文摘要

紧急疑似结直肠癌(CRC)转诊会因半结构化临床文档通常需要人工审查和转录而造成操作瓶颈。原始的RAPTOR系统使用大型语言模型进行结构化提取,但依赖单独的OCR阶段,使其易受手写、布局变化和视觉证据链接丢失的影响。我们提出RAPTOR+,一种多模态扩展,使用视觉-语言模型(VLM)进行端到端转诊理解。我们在223份临床整理的CRC紧急转诊表单上评估了微调VLM、商业和开源零样本VLM以及基于OCR的原始流水线。我们还引入了一种基于定位的评估框架,同时衡量提取准确性和证据定位。结果显示零样本模型存在明显的定位差距。Gemini 2.5 Flash实现了92.6%的读取准确率,但严格安全性仅为1.2%。相比之下,微调的Qwen3-VL-8B实现了96.1%的读取准确率和60.6%的严格安全性,显著改善了可验证的证据定位。这些发现表明,任务特定的微调对于可靠、可审计的临床文档理解至关重要。RAPTOR+使得提取的转诊决策能够与视觉证据关联,支持更安全、更高效的癌症转诊分诊。

英文摘要

Urgent suspected colorectal cancer (CRC) referrals create operational bottlenecks because semi-structured clinical documents often require manual review and transcription. The original RAPTOR system used Large Language Models for structured extraction but relied on a separate OCR stage, making it vulnerable to handwriting, layout variation, and loss of visual evidence linkage. We present RAPTOR+, a multimodal extension that uses Vision-Language Models (VLMs) for end-to-end referral understanding. We evaluate fine-tuned VLMs, commercial and open-source zero-shot VLMs, and the original OCR-based pipeline on 223 clinically curated CRC urgent referral forms. We also introduce a grounding-aware evaluation framework that measures both extraction accuracy and evidence localisation. Results show a clear grounding gap in zero-shot models. Gemini 2.5 Flash achieved 92.6% Reading Accuracy but only 1.2% Strict Safety. In contrast, fine-tuned Qwen3-VL-8B achieved 96.1% Reading Accuracy and 60.6% Strict Safety, substantially improving verifiable evidence grounding. These findings show that task-specific fine-tuning is essential for reliable, auditable clinical document understanding. RAPTOR+ enables extracted referral decisions to be linked to visual evidence, supporting safer and more efficient cancer referral triage.

2605.24500 2026-06-05 cs.CV 版本更新

EgoAdapt: A Multi-Scene Egocentric Adaptation Method for CVPR 2026 HD-EPIC VQA Challenge

EgoAdapt: CVPR 2026 HD-EPIC VQA挑战赛的多场景自我中心适应方法

Zhiwei Chen, Yupeng Hu, Zixu Li, Zhiheng Fu, Guozhi Qiu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出EgoAdapt方法,通过类别条件路由、校准选项评分和测试时一致性适应,解决自我中心视频问答中通用推理与异构时空语义结构不匹配的问题。

Comments Technical Report for CVPR 2026 HD-EPIC VQA Challenge

详情
AI中文摘要

本技术报告介绍了我们针对CVPR 2026 HD-EPIC VQA挑战赛的解决方案EgoAdapt(通过类别、校准和一致性进行自我中心适应)。HD-EPIC评估视觉语言模型是否能够对真实的第一人称厨房视频进行推理,其中答案的证据可能是短暂的手-物体交互、长食谱轨迹、与固定装置的空间关系或微妙的注视线索。该基准包含26K个多项选择题,涵盖七个宏观类别:食谱、食材、营养、细粒度动作、3D感知、物体运动和注视。我们观察到主要困难不仅在于模型容量,还在于单一通用推理配方与基准的异构时间、空间和语义结构之间的不匹配。我们的方法EgoAdapt引入了三个推理时组件:(1)类别条件路由,包含每类提示、帧预算和采样率;(2)校准选项评分,使用字母标记似然和生成一致性评估所有候选答案,而非仅依赖直接生成;(3)测试时一致性适应,针对模糊情况聚合选项排列和验证式提示的预测。该设计显著优于现有的HD-EPIC基线。

英文摘要

This technical report presents our solution, EgoAdapt (Egocentric Adaptation via Category, Calibration, and Consistency), to the CVPR 2026 HD-EPIC VQA challenge. HD-EPIC evaluates whether a vision-language model can reason over realistic first-person kitchen videos, where the evidence for an answer may be a short hand-object interaction, a long recipe trajectory, a spatial relation to a fixture, or a subtle gaze cue. The benchmark contains 26K multiple-choice questions across seven macro-categories: recipe, ingredient, nutrition, fine-grained action, 3D perception, object motion, and gaze. We observe that the main difficulty is not only model capacity, but also the mismatch between a single generic inference recipe and the heterogeneous temporal, spatial, and semantic structure of the benchmark. Our method, EgoAdapt, introduces three inference-time components: (1) category-conditioned routing with per-category prompts, frame budgets, and sampling rates; (2) calibrated option scoring that evaluates all candidate answers with letter-token likelihoods and generation agreement instead of relying only on direct generation; and (3) test-time consistency adaptation that aggregates predictions across option permutations and verification-style prompts for ambiguous cases. This design substantially improves over the available HD-EPIC baselines.

2605.24496 2026-06-05 cs.CV 版本更新

EgoAction: Egocentric Action Composition with Reliability-Aware Temporal Fusion for the EPIC-KITCHENS Action Detection Challenge at CVPR 2026

EgoAction: 面向 EPIC-KITCHENS 动作检测挑战的可靠性感知时间融合自我中心动作组合 (CVPR 2026)

Zhiheng Fu, Zixu Li, Zhiwei Chen, Fangxu Liu, Yupeng Hu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 提出 EgoAction 统一解耦检测与融合流水线,通过动态加权融合(DWF)自适应组合动词和名词检测流,解决自我中心视频中动作边界定位的可靠性问题。

Comments Technical Report for CVPR 2026 EPIC-KITCHENS-100 Action Detection Challenge

详情
AI中文摘要

EPIC-KITCHENS-100 动作检测挑战评估模型能否在长段未裁剪的自我中心视频中定位每个动作的起始和结束,并分配相应的动词-名词动作标签。在本报告中,我们将提交的方法表述为 EgoAction(基于可靠性感知时间融合的自我中心动作组合),这是一个统一的解耦检测和融合流水线。该流水线使用 EPIC 微调的 VideoMAE-L 特征,训练带有因果时间建模的独立名词和动词时间检测器,从 top 名词-动词对中组合动作假设,并在后处理时引入置信度自适应边界融合规则。关键观察是动词和名词流通常以不同方式失败:动词分数对运动过渡敏感,而名词分数对手-物体可见性和物体杂乱敏感。因此,当其中一个流退化时,其预测边界的固定算术平均值会放大定位误差。我们用动态加权融合(DWF)替换这种硬编码的平均值,DWF 将最大名词和动词分类置信度归一化为提议级别的边界权重,并线性组合两个区间。这种轻量级张量运算将边界权威转移到更可靠的流,同时保留解耦的动作评分机制。结合滑动窗口推理、top-K 名词-动词动作组合和类别级 Soft-NMS,EgoAction 为自我中心时间动作检测提供了一个紧凑且可复现的系统。

英文摘要

The EPIC-KITCHENS-100 Action Detection challenge evaluates whether a model can localize the start and end of each action in long untrimmed egocentric videos and assign the corresponding verb--noun action label. In this report, we formulate our submission as EgoAction (Egocentric Action Composition with Reliability-Aware Temporal Fusion), a unified decoupled detection and fusion pipeline. The pipeline uses EPIC-finetuned VideoMAE-L features, trains separate noun and verb temporal detectors with causal temporal modeling, composes action hypotheses from top noun--verb pairs, and introduces a confidence-adaptive boundary fusion rule at post-processing time. The key observation is that verb and noun streams often fail differently: verb scores are sensitive to motion transitions, whereas noun scores are sensitive to hand-object visibility and object clutter. A fixed arithmetic mean of their predicted boundaries can therefore amplify localization errors when one stream degenerates. We replace this hard-coded mean with Dynamic Weighted Fusion (DWF), which normalizes the maximum noun and verb classification confidences into proposal-wise boundary weights and linearly combines the two intervals. This lightweight tensor-only operator shifts boundary authority toward the more reliable stream while preserving the decoupled action scoring mechanism. Together with sliding-window inference, top-K noun--verb action composition, and class-wise Soft-NMS, EgoAction provides a compact and reproducible system for egocentric temporal action detection.

2605.24470 2026-06-05 cs.CV 版本更新

TempRet: Temporal Enhancement and Two-Stage Reranking for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

TempRet: 面向CVPR 2026 EPIC-KITCHENS-100多实例检索挑战的时间增强与两阶段重排序

Zixu Li, Yupeng Hu, Zhiwei Chen, Zhiheng Fu, Xiaowei Zhu, Weili Guan, Liqiang Nie

发表机构 * Shandong University(山东大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 针对第一人称视频检索中时间动态被忽视的问题,提出基于CLIP双编码器、视频端时间Transformer和两阶段重排序的TempRet方法,在EK-100 MIR基准上达到67.97%平均mAP和82.92%平均nDCG。

Comments Technical Report for CVPR 2026 EPIC-KITCHENS-100 Multi-Instance Retrieval Challenge

详情
AI中文摘要

视频-文本检索在大规模视觉-语言预训练的推动下取得了显著进展,但大多数现有方法继承了图像-文本检索的一个隐含假设:视觉语义可以逐帧捕获。这一假设忽视了第一人称视频的时间动态性。EPIC-KITCHENS-100多实例检索(MIR)挑战进一步提高了要求,提供软标签相关性矩阵而非二元标签,要求模型能够解决跨模态的分级语义对应。在本报告中,我们提出了面向CVPR 2026 EPIC-KITCHENS-100 MIR挑战的解决方案,称为TempRet。我们的方法基于CLIP双编码器骨干,并引入两个关键组件来应对时间和跨模态挑战。首先,一个时间Transformer仅在视频端操作,通过可学习的位置编码和帧级CLIP特征上的多头自注意力来建模帧间依赖关系。其次,一个两阶段重排序流程首先通过双编码器检索Top-K候选,然后使用配备图像-文本匹配(ITM)头的交叉编码器细化其分数。整个系统使用对称多相似性损失进行训练,以利用挑战提供的软标签相关性矩阵。我们的方法在EK-100 MIR基准上实现了67.97%的平均mAP和82.92%的平均nDCG,证明了时间建模和跨模态细化对第一人称视频检索的有效性。

英文摘要

Video-text retrieval has witnessed remarkable progress driven by large-scale vision-language pretraining, yet most existing approaches inherit an implicit assumption from image-text retrieval: that visual semantics can be captured frame-by-frame. This assumption overlooks the temporal dynamics of egocentric videos. The EPIC-KITCHENS-100 Multi-Instance Retrieval (MIR) challenge further raises the bar by providing soft-label relevance matrices rather than binary labels, demanding models that can resolve graded semantic correspondences across modalities. In this report, we present our solution, termed TempRet, to the CVPR 2026 EPIC-KITCHENS-100 MIR challenge. Our approach builds upon a CLIP-based dual-encoder backbone and introduces two key components to address the temporal and cross-modal challenges. First, a temporal transformer operates exclusively on the video side, modeling inter-frame dependencies through learnable positional encodings and multi-head self-attention over frame-level CLIP features. Second, a two-stage reranking pipeline first retrieves Top-K candidates via the dual-encoder, then refines their scores using a cross-encoder equipped with an Image-Text Matching (ITM) head. The entire system is trained with Symmetric Multi-Similarity Loss to exploit the soft-label relevance matrices provided by the challenge. Our method achieves 67.97% average mAP and 82.92% average nDCG on the EK-100 MIR benchmark, demonstrating the effectiveness of temporal modeling and cross-modal refinement for egocentric video retrieval.

2506.08809 2026-06-05 cs.CV eess.IV 版本更新

Training-Free Inference for High-Resolution Sinogram Completion

无需训练的高分辨率sinogram补全

Jiaze E, Srutarshi Banerjee, Tekin Bicer, Guannan Wang, Yanfu Zhang, Bin Ren

发表机构 * William & Mary(威廉玛丽学院) Argonne National Laboratory(阿贡国家实验室) University of Chicago(芝加哥大学)

AI总结 本文提出了一种无需训练的高效扩散推理方法HRSino,用于高分辨率sinogram补全,通过自适应分配推理努力来提高计算效率和补全精度。

详情
AI中文摘要

高分辨率sinogram补全对于计算断层扫描重建至关重要,因为缺失的投影可能会引入严重的伪影。尽管扩散模型为该任务提供了强大的生成先验,但其推理成本随着分辨率的增加而变得不可接受。我们提出HRSino,一种无需训练且高效的扩散推理方法,用于高分辨率sinogram补全。通过显式考虑信号特性中的空间异质性,如频谱稀疏性和局部复杂性,HRSino在空间区域和分辨率上自适应地分配推理努力,而不是应用统一的高分辨率扩散步骤。这使得在粗粒度上能够捕捉全局一致性,同时仅在必要时细化局部细节。实验结果表明,与最先进的框架相比,HRSino将峰值内存使用量减少了高达30.81%,推理时间减少了高达17.58%,并在不同数据集和分辨率上保持补全精度。

英文摘要

High-resolution sinogram completion is critical for computed tomography reconstruction, as missing projections can introduce severe artifacts. While diffusion models provide strong generative priors for this task, their inference cost grows prohibitively with resolution. We propose HRSino, a training-free and efficient diffusion inference approach for high-resolution sinogram completion. By explicitly accounting for spatial heterogeneity in signal characteristics, such as spectral sparsity and local complexity, HRSino allocates inference effort adaptively across spatial regions and resolutions, rather than applying uniform high-resolution diffusion steps. This enables global consistency to be captured at coarse scales while refining local details only where necessary. Experimental results show that HRSino reduces peak memory usage by up to 30.81% and inference time by up to 17.58% compared to the state-of-the-art framework, and maintains completion accuracy across datasets and resolutions.

2605.19839 2026-06-05 cs.CV 版本更新

When Preference Labels Fall Short: Aligning Diffusion Models from Real Data

当偏好标签不足时:从真实数据对齐扩散模型

Weiyan Chen, Weijian Deng, Yao Xiao, Weijie Tu, ZiYi Dong, Ibrahim Radwan, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文研究了真实数据作为偏好对齐的替代监督源,通过数据驱动的方法,利用真实图像作为参考点,对比生成或扰动样本以构建偏好信号,无需手动标注的偏好对,实验证明真实数据监督能有效对齐扩散模型并达到与现有偏好方法相当的性能。

Comments ICML 2026 Camera Ready; Project Page: https://cwyxx.github.io/RealAlign

详情
AI中文摘要

偏好对齐旨在通过学习优选样本与非优选样本的比较来引导生成模型。在实践中,大多数现有方法依赖于从模型生成图像中构造的偏好对。这种监督本质上是相对的,当两个样本都表现出伪影或视觉质量有限时,其模糊性使得难以推断何为真正理想的输出。在本工作中,我们探讨了真实数据是否可以作为偏好对齐的替代监督源。我们采用以数据为中心的视角,研究了一种整理策略,将真实图像作为参考点,并通过将其与生成或扰动样本进行对比,构建偏好信号,而无需手动标注的偏好对。通过实证分析,我们证明了基于真实数据的监督能有效指导扩散模型的对齐,并达到与现有基于偏好方法相当的性能。我们的结果表明,真实数据为偏好对齐提供了一个实用且互补的监督源,并突显了标签高效对齐策略的方向。代码和模型可在https://cwyxx.github.io/RealAlign获取。

英文摘要

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.

2510.00054 2026-06-05 cs.CV cs.AI 版本更新

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: 通过分层解耦重新思考高分辨率MLLMs中的Zoom-IN方法

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出HiDe框架,通过分层解耦方法解决高分辨率图像中背景干扰导致的视觉理解问题,提升多模态大语言模型在高分辨率图像任务中的性能。

Comments Accepted by ICML2026

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉理解任务中取得了显著进展。然而,它们在高分辨率图像上的性能仍然不够理想。尽管现有方法通常将这一限制归因于感知约束,并认为MLLMs难以识别小物体,从而使用'缩放进'策略以获得更好的细节,我们的分析揭示了不同的原因:主要问题不是物体大小,而是由复杂的背景干扰引起的。我们通过一系列解耦实验系统分析了这种'缩放进'操作,并提出了一种无需训练的分层解耦框架(HiDe),该框架使用基于标记的注意力解耦(TAD)来解耦问题标记并识别关键信息标记,然后利用其注意力权重实现与目标视觉区域的精确对齐。随后,它利用布局保持解耦(LPD)将这些区域与背景解耦,并重建一个紧凑的表示,该表示在保留基本空间布局的同时消除了背景干扰。HiDe在V*Bench、HRBench4K和HRBench8K上设定了新的SOTA,将Qwen2.5-VL 7B和InternVL3 8B提升至SOTA(在V*Bench上分别为92.1%和91.6%),甚至超过了强化学习方法。经过优化后,HiDe的内存使用比之前的无训练方法减少了75%。代码可在https://tennine2077.github.io/HiDe.github.io/上提供。

英文摘要

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.

2605.16716 2026-06-05 cs.CV cs.AI 版本更新

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN:面向多元文化文本到视频生成的多智能体框架

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

发表机构 * Santa Clara University(圣克拉拉大学)

AI总结 提出MAVEN多智能体提示优化框架,通过并行或串行分解提示为人物、动作、地点维度,提升单文化和跨文化文本到视频生成的文化保真度,并构建包含243个文化提示和972个视频的基准进行评估。

Comments [14] pages, [6] figures, [11] tables, appendix included. Preprint

详情
AI中文摘要

文本到视频(T2V)生成在视觉保真度方面取得了快速进展,但其在单个提示中忠实呈现多种文化的能力仍未被充分探索。我们提出MAVEN,一个多智能体提示优化框架,旨在提高单文化和跨文化T2V生成中的文化保真度。MAVEN将提示分解为人物、动作和地点维度,由并行或串行运行的专业智能体处理。为了支持系统评估,我们贡献了一个新的基准,包含243个基于文化的提示和972个对应视频,涵盖三种文化(中文、美式、罗马尼亚)、三种动作类别以及单文化和跨文化场景。结合基于CLIP的指标、VLM作为评判的评估和视频质量测量的评估表明,多智能体优化,特别是并行专业化,在保持视觉质量和时间一致性的同时,显著提高了文化相关性。数据集和代码可在https://github.com/AIM-SCU/MAVEN获取。

英文摘要

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available at https://github.com/AIM-SCU/MAVEN

2605.14028 2026-06-05 cs.CV 版本更新

Unified Pix Token And Word Token Generative Language Model

统一像素标记与词标记的生成语言模型

Haun Leung, ZiNan Wang

发表机构 * Buaa.edu.cn(北京航空航天大学)

AI总结 本文提出一种统一像素标记和词标记的生成语言模型,通过引入图像无监督预训练、颜色折叠、全局条件注意力近似等方法,提升模型在图像细节识别上的能力,实验表明该模型在小模型和有限数据下仍表现优异。

Comments 13 pages, 6 figures

详情
AI中文摘要

自从视觉Transformer(ViT)出现以来,它已被广泛应用于生成语言模型和生成视觉模型中。尤其是在当前最先进的开源多模态模型中,通过CLIP或SigLIP方法获得的ViT被用作视觉编码器的骨干网络,帮助它们获得视觉理解能力。但这种方法在细节视觉理解上存在局限,例如在图像中难以识别小文本或数字。为了解决这些问题,我们提出了一种新的模型,将像素标记和词标记统一到生成语言模型中。该新模型还具有每个图像像素都有其自己的标记嵌入、颜色折叠、全局条件注意力近似和图像无监督预训练等特性。我们使用我们的新模型进行了图像无监督预训练实验,以探索其潜力。实验结果表明,即使在小模型和有限训练数据下,其性能也很好。我们相信我们的模型也符合扩展定律,只要模型参数和训练数据增加,其性能将继续提高。

英文摘要

Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

2604.20329 2026-06-05 cs.CV cs.AI 版本更新

Image Generators are Generalist Vision Learners

图像生成器是通用视觉学习者

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut

发表机构 * Google(谷歌)

AI总结 本文研究了图像生成器在视觉理解中的通用学习能力,通过引入Vision Banana模型,展示了图像生成训练如何像语言模型预训练一样,使模型在多种视觉任务中取得最佳性能,证明了图像生成预训练在构建基础视觉模型中的核心作用。

Comments Project Page: http://vision-banana.github.io

详情
AI中文摘要

近期的研究表明,图像和视频生成器表现出零样本视觉理解行为,这种行为类似于大型语言模型(LLM)通过生成式预训练发展出语言理解和推理的新兴能力。尽管长期以来人们推测能够生成视觉内容意味着能够理解它,但缺乏证据表明生成式视觉模型已发展出强大的理解能力。在本文中,我们证明图像生成训练的作用类似于LLM预训练,使模型学习到强大的、通用的视觉表示,从而在各种视觉任务中取得最先进的性能。我们引入了Vision Banana,一个通过指令微调Nano Banana Pro(NBP)在原始训练数据和少量视觉任务数据混合中构建的通用模型。通过将视觉任务的输出空间参数化为RGB图像,我们无缝地将感知重新框架为图像生成。我们的通用模型Vision Banana在涉及2D和3D理解的多种视觉任务中取得了最先进的结果,超越或匹敌零样本领域专家,包括Segment Anything Model 3在分割任务中的表现,以及Depth Anything系列在度量深度估计中的表现。我们展示了这些结果可以通过轻量级指令微调实现,而不牺牲基础模型的图像生成能力。优越的结果表明图像生成预训练是一种通用视觉学习者。它还表明图像生成是视觉任务的统一和通用接口,类似于文本生成在语言理解和推理中的作用。我们正见证计算机视觉中的重大范式转变,其中生成式视觉预训练在构建生成和理解的基础视觉模型中发挥核心作用。

英文摘要

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

2605.05367 2026-06-05 cs.CV cs.AI 版本更新

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Tamaththul3D: 从单目视频高保真重建沙特手语3D虚拟形象

Eyad Alghamdi, Sattam Altuuaim, Obay Ghulam, Abdulrahman Qutah, Yousef Basoodan

发表机构 * University of Jeddah(朱德大学) King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学科学与技术)

AI总结 本文提出Tamaththul3D方法,通过几何逆运动学对前臂链进行对齐,结合2D监督肩部优化,实现了阿拉伯语手语的高保真3D虚拟形象重建,并在五个不同语言类型的手语数据集上实现了泛化能力。

详情
AI中文摘要

现有的3D手语虚拟形象重建方法仅在西方手语上开发和评估,且没有任何阿拉伯手语数据集的3D参数注解,这阻碍了阿拉伯聋人社区基于虚拟形象的无障碍应用发展。我们发布了首个SMPL-X参数注解的Ishara-500沙特手语数据集,使阿拉伯手语的定量评估和下游手语生成成为可能。我们引入Tamaththul3D,一种通过几何逆运动学对齐手部和身体估计,随后通过2D监督肩部优化的重建流程。闭式积分与特定身体和手估计器的选择无关:任何SMPL-X兼容的身体估计器和任何MANO兼容的手估计器均可替换,我们通过单独替换每个模块来证明这一点。Tamaththul3D在手部误差上比先前方法低达32%,运行速度比最强基线快32倍,并在没有数据集特定适应的情况下泛化到五个不同语言类型的手语数据集。

英文摘要

Existing 3D sign language avatar reconstruction methods are developed and evaluated exclusively on Western sign languages, and no 3D parametric annotations exist for any Arabic Sign Language dataset, a gap that blocks the development of avatar-based accessibility applications for the Arab Deaf community. We release the first SMPL-X parametric annotations for the Ishara-500 Saudi Sign Language dataset, enabling quantitative evaluation and downstream sign language generation for Arabic Sign Language. We introduce Tamaththul3D, a reconstruction pipeline that aligns hand and body estimates through geometric inverse kinematics on the forearm chain followed by 2D-supervised shoulder refinement. The closed-form integration is decoupled from the specific choice of body and hand estimators: any SMPL-X-compatible body estimator and any MANO-compatible hand estimator can be substituted, as we demonstrate by swapping each module independently. Tamaththul3D achieves up to 32% lower hand error than prior methods, runs 32x faster than the strongest baseline, and generalizes across five typologically distinct sign languages without dataset-specific adaptation.

2605.09989 2026-06-05 cs.RO cs.CV 版本更新

StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception

StereoPolicy:通过立体视觉改进机器人操作策略

Evans Han, Yunfan Jiang, Yingke Wang, Haoyue Xiao, Huang Huang, Jianwen Xie, Jiajun Wu, Li Fei-Fei, Ruohan Zhang

发表机构 * Stanford University(斯坦福大学) Northwestern University(西北大学) Lambda, Inc(Lambda公司)

AI总结 该研究提出StereoPolicy,一种利用立体视觉提升机器人操作策略的框架,通过同步立体图像对增强几何推理,无需构建显式3D表示,在多个仿真和真实机器人任务中优于RGB、RGB-D、点云等基线方法。

详情
AI中文摘要

最近的机器人模仿学习进展产生了能够从视觉输入中操控多样化物体的强大视觉-运动策略。然而,单目观测缺乏深度信息,这对于在杂乱或几何复杂的场景中进行精确操作至关重要。显式的深度图和点云在现实世界操作中往往噪声大且易碎。我们引入了StereoPolicy,一种视觉-运动策略学习框架,直接利用同步的立体图像对来改进几何推理,而无需构建显式的3D表示。StereoPolicy通过预训练的2D视觉编码器处理每张图像,并通过基于交叉注意力的Stereo Transformer融合左右特征,隐式地捕捉空间对应关系和视差线索。该框架与基于扩散和预训练的视觉-语言-动作(VLA)策略集成,在三个仿真基准和七个真实机器人桌面和双臂移动操作任务中,相比RGB、RGB-D、点云和多视角基线方法均实现了持续改进。我们的结果表明,立体视觉能够将预训练的2D表示与3D几何理解联系起来,以提升机器人操作性能。

英文摘要

Recent advances in robot imitation learning have produced powerful visuomotor policies that manipulate diverse objects from visual inputs. However, monocular observations lack depth information, which is critical for precise manipulation in cluttered or geometrically complex scenes. Explicit depth maps and point clouds are often noisy and fragile in real-world manipulation. We introduce StereoPolicy, a visuomotor policy learning framework that directly leverages synchronized stereo image pairs to improve geometric reasoning without constructing explicit 3D representations. StereoPolicy processes each image with pretrained 2D vision encoders and fuses left-right features through a cross-attention-based Stereo Transformer, capturing spatial correspondence and disparity cues implicitly. The framework integrates with diffusion-based and pretrained vision-language-action (VLA) policies, delivering consistent improvements over RGB, RGB-D, point cloud, and multi-view baselines across three simulation benchmarks and seven real-robot tabletop and bimanual mobile manipulation tasks. Our results show that stereo vision bridges 2D pretrained representations and 3D geometric understanding for robotic manipulation.

2605.08215 2026-06-05 cs.CV cs.LG cs.RO 版本更新

Test-Time Training for Visual Foresight Vision-Language-Action Models

测试时训练用于视觉前瞻视觉-语言-动作模型

Sangwu Park, Wonjoong Kim, Yeonjun In, Sein Kim, Hongseok Kang, Chanyoung Park

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种测试时训练方法,用于增强视觉前瞻视觉-语言-动作模型在面对分布外数据时的鲁棒性,通过引入适应性更新过滤机制来减少测试时更新带来的实际挑战。

Comments Accepted at ICML 2026 Workshop on Continual Adaptation at Scale (CATS)

详情
AI中文摘要

Visual Foresight VLA (VF-VLA) 已成为最近 VLA 中的重要架构选择,因其出色的性能。然而,VF-VLA 的固有设计使其特别容易受到分布外(OOD)偏移的影响。由于动作的质量直接取决于预测未来视觉信息的准确性,OOD 条件会影响两个阶段。为了解决这一脆弱性,我们提出了测试时训练视觉前瞻 VLA($T^3$VF),这是一种受观察启发的测试时训练方法,即预测的未来图像及其后续观察形成自然的监督对。为了进一步解决由于随意测试时更新而产生的实际挑战,我们引入了自适应更新过滤机制。经验上,$T^3$VF 在不改变任何架构或辅助模块的情况下,以适度的额外推理成本缓解了 VF-VLA 的 OOD 脆弱性。

英文摘要

Visual Foresight VLA (VF-VLA) has become a prominent architectural choice in the recent VLA due to its impressive performance. Nevertheless, the inherent design of VF-VLA makes it particularly vulnerable to out-of-distribution (OOD) shifts. Because the quality of action directly depends on the accuracy of the predicted future visual information, OOD conditions affect both stages at once. To address this vulnerability, we propose Test-Time Training Visual Foresight VLA ($T^3$VF), a test-time training approach motivated by the observation that the predicted future image and its subsequent observation form a natural supervision pair. To further address the practical challenges that arise from indiscriminate test-time updates, we introduce an adaptive update filtering mechanism. Empirically, $T^3$VF mitigates the OOD vulnerability of VF-VLA at a modest additional inference cost, without requiring any architectural modification or auxiliary modules.

2604.10528 2026-06-05 cs.CV 版本更新

BareBones: Benchmarking Zero-Shot Geometric Comprehension in VLMs

BareBones: 视觉语言模型中零样本几何理解的基准测试

Aaditya Baranwal, Vishal Yadav, Abhishek Rajora

发表机构 * University of Central Florida(佛罗里达大学中央分校) University of Calgary(卡尔加里大学)

AI总结 提出BareBones基准,通过去除RGB纹理仅保留轮廓,测试26个视觉语言模型在零样本几何形状理解上的表现,发现模型存在严重的纹理偏置悬崖。

Comments Accepted at CVPR (13th FGVC Workshop) 2026

详情
AI中文摘要

尽管视觉语言模型(VLM)在多种多模态任务中展现出卓越的零样本识别能力,但这些架构是否真正理解几何结构,还是仅仅利用RGB纹理和上下文先验作为统计捷径,仍然是一个未解之谜。现有的评估未能分离这一机制,将语义推理与纹理映射混为一谈,并依赖于不精确的标注,这些标注无意中泄露了环境线索。为解决这一空白,我们引入了$ extbf{BareBones}$,一个旨在压力测试纯几何形状理解的零样本基准。我们整理了六个数据集中几何不同类别的像素级轮廓:五个已建立的分割来源(ImageNet-S、DIS5K、ThinObject5K、PASCAL VOC、CUB-200)以及我们新颖的旗舰集合WTP-Bench,建立了一个无噪声的几何分类体系。WTP-Bench是一个极端的、细粒度的视觉谜题,迫使模型仅从边界轮廓中识别类间几何概念。我们对26个最先进的专有和开源权重VLM(例如GPT-4.1、Gemini、Claude Sonnet 4.5、LLaVA)的评估揭示,在去除RGB纹理的情况下,模型性能一致且严重崩溃,我们将这一现象称为$ extit{纹理偏置悬崖}$。通过记录普遍的结构性盲点,BareBones为真正的几何基础建立了一个严格的衡量标准。项目页面:https://eternal-f1ame.github.io/WTP-Bench/

英文摘要

While Vision-Language Models (VLMs) demonstrate remarkable zero-shot recognition capabilities across a diverse spectrum of multimodal tasks, it yet remains an open question whether these architectures genuinely comprehend geometric structure or merely exploit RGB textures and contextual priors as statistical shortcuts. Existing evaluations fail to isolate this mechanism, conflating semantic reasoning with texture mapping and relying on imprecise annotations that inadvertently leak environmental cues. To address this gap, we introduce $\textbf{BareBones}$, a zero-shot benchmark designed to stress-test pure geometric shape comprehension. We curate pixel-level silhouettes of geometrically distinct classes across six datasets: five established segmentation sources (ImageNet-S, DIS5K, ThinObject5K, PASCAL VOC, CUB-200) and our novel flagship collection, WTP-Bench, establishing a noise-free geometric taxonomy. WTP-Bench is an extreme, fine-grained visual puzzle that forces models to identify inter-class geometric concepts from boundary contours alone. Our evaluation of 26 state-of-the-art proprietary and open-weight VLMs (eg. GPT-4.1, Gemini, Claude Sonnet 4.5, LLaVA) reveals a consistent, severe performance collapse under RGB deprivation, a phenomenon we term the $\textit{Texture Bias Cliff}$. By documenting universal structural blindspots, BareBones establishes a rigorous yardstick for genuine geometric grounding. Project Page: https://eternal-f1ame.github.io/WTP-Bench/

2605.00174 2026-06-05 cs.AR cs.CV 版本更新

DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference

DPU 或 GPU 加速神经网络推断——为何不两者都用?分割 CNN 推断

Ali Emre Oztas, Mahir Demir, James Garside, Mikel Luján

发表机构 * The University of Manchester(曼彻斯特大学)

AI总结 本文提出了一种将 CNN 推断任务分割到 DPU 和 GPU 上的方法,以降低延迟。通过在 DPU 处理初始层,GPU 处理剩余层,结合 GNN 分割索引预测方法,实现了比单一 DPU 或 GPU 更高的效率提升。

详情
AI中文摘要

边缘设备上的视频和图像流需要低延迟。为解决此问题,神经网络(NN)被广泛应用,先前的研究主要集中在使用单个硬件单元如图形处理单元(GPU)、可编程门阵列(FPGA)和深度学习处理单元(DPU)来加速这些网络。然而,通过结合这些单元可以进一步减少延迟。本文提出将 CNN 推断任务分割到 DPU 和 GPU 上(Split CNN 推断)。第一个分割部分在 Versal VCK190 的 AI 引擎(DPU)上运行,处理输入图像的初始 CNN 层。DPU 在数据源附近处理第一部分。异步流水线方式下,GPU 运行剩余的层。NVIDIA RTX 2080 GPU 处理第二部分,尽管减少了数据源(存储/摄像头)与 GPU 之间的数据传输。此外,提出了一种基于图神经网络(GNN)的分割索引预测方法,以自动化 Split 推断所需的 CNN 分割。已建立的模型如 LeNet-5、ResNet18/50/101/152、VGG16 和 MobileNetv2 被分析。结果表明,相比仅使用 DPU 的执行,延迟提高了最多 2.48 倍;相比仅使用 GPU 的执行,延迟提高了最多 3.37 倍。训练好的 GNN 模型在适当的设备之间分割层的准确率为 96.27%。

英文摘要

Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units. In this paper, partitioning CNN inference across DPU and GPU (Split CNN Inference) is proposed. The first partition runs on the AI engines (DPU) of a Versal VCK190, which consists of initial CNN layers processing the input images. The DPU processes the first partition near the source of the data. Pipelined asynchronously, a GPU runs the remaining layers. The GPU (NVIDIA RTX 2080) processes the second partition, albeit having reduced the data transfer between the data source (storage/camera) and the GPU. Furthermore, a Graph Neural Network (GNN)-based partition index prediction method is proposed to automate the partitioning of CNNs needed for Split Inference. Well established models such as LeNet-5, ResNet18/50/101/152, VGG16, and MobileNetv2 are analyzed. Results demonstrate up to 2.48x latency improvement over DPU-only execution and up to 3.37x over GPU-only execution. The trained GNN model splits the layers between the appropriate devices with 96.27% accuracy.

2604.27343 2026-06-05 cs.CV 版本更新

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

JI-ADF:联合-个体学习与自适应决策融合用于多模态皮肤病变分类

Phan Nguyen, Dat Cao, Hien Kha, Hien Chu, Minh Le, Trang Pham, Nguyen Quoc Khanh Le

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出JI-ADF框架,通过整合皮肤镜图像、临床照片和结构化患者数据,实现基于临床的皮肤病变分类,采用多模态表示学习和自适应决策融合机制,提升跨模态推理能力,并在MILK10k数据集上验证了其在实际临床场景中的可靠性。

详情
AI中文摘要

皮肤病变分类对早期皮肤病诊断至关重要,但许多现有计算机辅助系统主要依赖皮肤镜图像,而未能充分利用临床实践中常规可用的多模态证据。为解决这一问题,我们提出JI-ADF,一种三模态深度学习框架,整合皮肤镜图像、临床照片和结构化患者元数据,用于基于临床的皮肤病变分类。所提出的架构结合了联合多模态表示学习、模态特定的辅助监督以及自适应决策融合机制,该机制在每个样本基础上动态校准模态贡献。为进一步增强跨模态推理并保持模态特定证据,我们进一步引入了多模态融合注意力(MMFA)模块。我们在大规模MILK10k基准上评估了JI-ADF,该基准反映了真实世界临床获取条件和严重的类别不平衡。所提出的方法在病变类别上表现出强大且均衡的性能,提高了灵敏度和Dice分数,同时保持高特异性和良好的校准。广泛的分析,包括模态消融、校准评估和Grad-CAM可视化,进一步证实了模型的鲁棒性和临床意义的行为。这些结果表明,JI-ADF为实际临床场景中的多模态皮肤病变分类提供了可靠且实用的基础。

英文摘要

Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbf{JI-ADF}, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.

2604.19741 2026-06-05 cs.CV 版本更新

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

CityRAG: 通过空间感知的视频生成进入城市

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

发表机构 * Google(谷歌) Cornell University(康奈尔大学) Stanford University(斯坦福大学)

AI总结 CityRAG通过利用地理注册数据的大型语料库,生成空间一致且可导航的真实环境视频,其核心方法是结合学习的先验知识和时空不一致训练数据,以实现复杂的运动和外观变化。

Comments Project page: cityrag.github.io

详情
AI中文摘要

我们解决了生成一个空间一致且可导航的环境的问题,该环境是真实位置的模拟。现有的视频生成模型可以产生一个与文本(T2V)或图像(I2V)提示一致的合理序列。然而,能够重建在任意天气条件和动态物体配置下的真实世界对于下游应用如自动驾驶和机器人模拟至关重要。为此,我们提出了CityRAG,一个视频生成模型,利用大规模地理注册数据作为上下文,将生成过程与物理场景结合,同时保持对复杂运动和外观变化的学习先验。CityRAG依赖于时间不一致的训练数据,教会模型将场景的底层属性与瞬时属性语义解耦。我们的实验表明,CityRAG能够生成连贯的分钟级、物理一致的视频序列,保持数千帧的天气和光照条件,实现回环闭合,并导航复杂的轨迹以重建真实世界地理。

英文摘要

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

2604.16502 2026-06-05 cs.CV 版本更新

Topology-Aware Layer Pruning for Large Vision-Language Models

面向拓扑的层剪枝用于大型视觉-语言模型

Pengcheng Zheng, Chaoning Zhang, Ya Wen, Wang Liu, Qigan Sun, Jiarong Mo, Jiaquan Zhang, Jewon Lee, Tae-Ho Kim, Kuien Liu, Tianyu Li, Caiyan Qin, Yang Yang

AI总结 本文提出了一种面向拓扑的层剪枝框架,用于大型视觉-语言模型,通过利用拓扑持续同调量化层间拓扑一致性,实现自适应剪枝以保留关键表示转换。

Comments This manuscript has been withdrawn by the authors. It reproduced the methodology of Gardinazzi et al., arXiv:2410.11042, without citation, and utilized code and data from the associated repository (github.com/RitAreaSciencePark/ZigZagLLMs) without disclosure or violate the MIT License. A revised future version with full attribution may be prepared. For any feedback, please contact Pengcheng Zheng

详情
AI中文摘要

大型语言模型(LLMs)在自然语言理解和推理方面展示了强大的能力,而最近的扩展将视觉输入纳入其中,使它们能够处理多模态信息。尽管有这些进展,大型视觉-语言模型(LVLMs)仍然带来了显著的计算和内存成本,阻碍了在资源受限场景中的部署。现有的层剪枝方法通常依赖于局部相似性度量或静态代理信号,无法捕捉模型深度中表示的全局和动态演变,这往往导致关键转换层被移除。为了解决这一限制,我们提出了一种面向拓扑的层剪枝框架用于LVLMs。具体而言,我们将层的隐藏状态表示为点云,并利用 extit{simplicial complexes}来建模其演变。通过利用 extit{zigzag persistent homology},我们量化了层间拓扑一致性,并实现了能够保留关键表示转换的自适应剪枝。在多样化的多模态基准上的广泛实验表明,所提出的框架在各种稀疏率范围内均优于现有剪枝方法。我们的代码可在https://github.com/zpc456/TopoVLM上获得。

英文摘要

Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite these advances, Large Vision-Language Models (LVLMs) incur substantial computational and memory costs, hindering deployment in resource-constrained scenarios. Existing layer pruning methods typically rely on local similarity metrics or static proxy signals, failing to capture the global and dynamic evolution of representations across model depth, which often leads to the removal of transition-critical layers. To address this limitation, we propose a topology-aware layer pruning framework for LVLMs. Specifically, we represent layer wise hidden states as point clouds and models their evolution using \textit{simplicial complexes}. By leveraging \textit{zigzag persistent homology}, we quantify inter-layer topological consistency and enable adaptive pruning that preserves critical representational transitions. Extensive experiments on diverse multimodal benchmarks demonstrate that the proposed framework consistently outperforms existing pruning methods across a wide range of sparsity ratios. Our code is available at https://github.com/zpc456/TopoVLM.

2604.16370 2026-06-05 cs.CL cs.AI cs.CV 版本更新

Brain-CLIPLM: Semantic Compression for EEG-to-Text Decoding

Brain-CLIPLM: 用于EEG到文本解码的语义压缩

Xiaoli Yang, Huiyuan Tian, Yurui Li, Jianyu Zhang, Shijian Li, Gang Pan

发表机构 * Beijing Institute of Technology, Beijing, China(北京理工大学,北京,中国)

AI总结 该研究提出Brain-CLIPLM框架,通过语义锚点恢复和锚点引导的句子重建,解决EEG信号低信噪比和信息带宽限制的问题,实现了更高的文本检索准确率。

详情
AI中文摘要

从非侵入性脑电图(EEG)解码自然语言仍受限于低信噪比和有限的信息带宽。这提出了一个核心问题:能否从此类信号中可靠地恢复句子级语言?在现实的信息约束下,直接恢复假设可能过于强烈。我们提出语义压缩假设:非侵入性EEG可能保留可恢复的语义锚点,而非完整的词法-句法形式。从这一视角,直接句子重建相对于EEG可恢复的信息规模过于细粒度。为解决这种不匹配,我们提出了Brain-CLIPLM,一个两阶段框架,将EEG到文本解码分解为语义锚点恢复和锚点引导的句子重建。第一阶段使用对比学习将词级EEG证据对齐固定关键词词汇并恢复有序的语义锚点。第二阶段使用基于检索的大型语言模型和链式推理提示从这些锚点中重建句子意义,遵循粒度匹配原则,使解码复杂度与可恢复的神经信息规模相匹配。在结合了苏黎世认知语言处理(ZuCo)基准测试中,Brain-CLIPLM实现了67.6%的Top-5和85.0%的Top-25句子检索准确率,其中在中间锚点粒度下表现最强。控制分析,包括排列检验,显示EEG衍生的锚点携带超出语言模型先验的信息。这些发现表明,EEG到文本解码应更好地视为在锚点引导句子重建之前恢复压缩的语义内容。

英文摘要

Decoding natural language from non-invasive electroencephalography (EEG) remains constrained by low signal-to-noise ratio and limited information bandwidth. This raises a central question: can sentence-level language be reliably recovered from such signals? Under realistic information constraints, this direct-recovery assumption may be too strong. We introduce a semantic compression hypothesis: non-invasive EEG may preserve recoverable semantic anchors rather than the full lexical--syntactic form of a sentence. From this perspective, direct sentence reconstruction is overly fine-grained relative to the recoverable information scale of EEG. To address this mismatch, we propose Brain-CLIPLM, a two-stage framework that decomposes EEG-to-text decoding into semantic-anchor recovery and anchor-guided sentence reconstruction. Stage 1 uses contrastive learning to align word-level EEG evidence with a fixed keyword vocabulary and recover ordered semantic anchors. Stage 2 uses a retrieval-grounded large language model with chain-of-thought reasoning prompts to reconstruct sentence meaning from these anchors, following a granularity matching principle that aligns decoding complexity with the recoverable neural information scale. On the combined Zurich Cognitive Language Processing (ZuCo) benchmark, Brain-CLIPLM achieves 67.6\% Top-5 and 85.0\% Top-25 sentence retrieval accuracy, with the strongest performance at intermediate anchor granularity. Control analyses, including a permutation test, show that EEG-derived anchors carry sentence-specific information beyond language-model priors. These findings suggest that EEG-to-text decoding is better framed as recovering compressed semantic content before anchor-guided sentence reconstruction.

2410.04960 2026-06-05 cs.CV 版本更新

On Efficient Variants of Segment Anything Model: A Survey

关于高效分段任何模型的变体:一项调查

Xiaorui Sun, Jun Liu, Heng Tao Shen, Xiaofeng Zhu, Ping Hu

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) School of Computing and Communications(计算与通信学院) School of Computer Science and Technology(计算机科学与技术学院)

AI总结 本文综述了高效分段任何模型变体的研究,探讨了提升效率的同时保持准确性的核心技术和方法,并评估了不同硬件上的性能。

Comments IJCV

详情
AI中文摘要

分段任何模型(SAM)是图像分割任务的基础模型,以其在多样化应用中的强大泛化能力而闻名。然而,其出色的性能伴随着显著的计算和资源需求,使其在资源受限的环境中(如边缘设备)部署变得困难。为此,提出了一系列SAM变体以在保持准确性的同时提高效率。本文提供了对这些高效SAM变体的首次全面回顾。我们首先探讨了推动这项研究的动力,然后介绍了SAM中使用的核心技术和模型加速方法。接着,我们详细探讨了SAM加速策略,按方法进行分类,并讨论了几个未来研究方向。最后,我们对这些方法在各种硬件上的进行了统一和广泛的评估,评估了它们在代表性基准上的效率和准确性,并提供了整体性能的清晰比较。

英文摘要

The Segment Anything Model (SAM) is a foundational model for image segmentation tasks, known for its strong generalization across diverse applications. However, its impressive performance comes with significant computational and resource demands, making it challenging to deploy in resource-limited environments such as edge devices. To address this, a variety of SAM variants have been proposed to enhance efficiency while keeping accuracy. This survey provides the first comprehensive review of these efficient SAM variants. We begin by exploring the motivations driving this research. We then present core techniques used in SAM and model acceleration. This is followed by a detailed exploration of SAM acceleration strategies, categorized by approach, and a discussion of several future research directions. Finally, we offer a unified and extensive evaluation of these methods across various hardware, assessing their efficiency and accuracy on representative benchmarks, and providing a clear comparison of their overall performance.

2604.06052 2026-06-05 cs.CV 版本更新

Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models

注意,我可以请你决定吗?定位扩散模型中的生成选择

Katarzyna Zaleska, Łukasz Popek, Monika Wysoczańska, Kamil Deja

发表机构 * Warsaw University of Technology(华沙技术大学) valeo.ai IDEAS Research Institute(IDEAS研究所)

AI总结 本文提出基于探测的定位技术,发现自注意力层是解决模糊概念的关键,并设计ICM方法通过干预少量自注意力层实现精确去偏。

Comments CVPR 2026

详情
AI中文摘要

文本到图像扩散模型展现出卓越的生成能力,但其内部运作仍然不透明,尤其是在处理不完全描述性提示时。在这种情况下,模型必须做出隐式决策以生成文本中未明确指定的细节。本文研究了这一决策过程并非分散而是计算上局部化在模型架构中的假设。虽然现有的定位技术专注于提示相关的干预,但我们注意到这种显式条件可能与隐式决策不同。因此,我们引入了一种基于探测的定位技术,以识别概念属性可分性最高的层。我们的发现表明,模糊概念的分辨主要由自注意力层控制,将其确定为最有效的干预点。基于这一发现,我们提出了ICM(隐式选择修改)——一种精确的引导方法,对少量层进行有针对性的干预。大量实验证实,与现有最先进方法相比,干预这些特定的自注意力层能产生更优的去偏性能,并最小化较不精确方法常见的伪影。代码可在https://github.com/kzaleskaa/icm获取。

英文摘要

Text-to-image diffusion models exhibit remarkable generative capabilities, yet their internal operations remain opaque, particularly when handling prompts that are not fully descriptive. In such scenarios, models must make implicit decisions to generate details not explicitly specified in the text. This work investigates the hypothesis that this decision-making process is not diffuse but is computationally localized within the model's architecture. While existing localization techniques focus on prompt-related interventions, we notice that such explicit conditioning may differ from implicit decisions. Therefore, we introduce a probing-based localization technique to identify the layers with the highest attribute separability for concepts. Our findings indicate that the resolution of ambiguous concepts is governed principally by self-attention layers, identifying them as the most effective point for intervention. Based on this discovery, we propose ICM (Implicit Choice-Modification) - a precise steering method that applies targeted interventions to a small subset of layers. Extensive experiments confirm that intervening on these specific self-attention layers yields superior debiasing performance compared to existing state-of-the-art methods, minimizing artifacts common to less precise approaches. The code is available at https://github.com/kzaleskaa/icm.

2602.19190 2026-06-05 cs.CV cs.AI 版本更新

FUSAR-GPT : A Spatiotemporal Feature-Embedded and Two-Stage Decoupled Visual Language Model for SAR Imagery

FUSAR-GPT : 一种嵌入时空特征和两阶段解耦的视觉语言模型,用于合成孔径雷达图像

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

发表机构 * Fudan University(复旦大学) Discipline and Technology Center of Microwave Vision Intelligent Sensing, Fudan University(微波视觉智能感知学科与技术中心,复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 本文提出FUSAR-GPT,一种专门针对合成孔径雷达图像的视觉语言模型,通过嵌入时空特征和两阶段解耦方法,在多个遥感视觉语言基准测试中实现了最先进的性能。

详情
AI中文摘要

对所有天气和所有时间的合成孔径雷达(SAR)智能解释的研究对于推进遥感应用至关重要。近年来,尽管视觉语言模型(VLMs)在RGB图像上展示了强大的开放世界理解能力,但直接应用于SAR领域时,由于成像机制的复杂性、对散射特征的敏感性和高质量文本语料的稀缺性,其性能受到严重限制。为系统解决这一问题,我们构建了首个SAR图像-文本-AlphaEarth特征三元组数据集,并开发了FUSAR-GPT,一种专门用于SAR的VLM。FUSAR-GPT创新性地引入了一个地理空间基线模型作为“世界知识”先验,并通过“时空锚点”将多源遥感时间特征嵌入模型的视觉主干中,从而实现对SAR图像中目标稀疏表示的动态补偿。此外,我们设计了一种两阶段SFT策略,以解耦大模型的知识注入和任务执行。时空特征嵌入和两阶段解耦范式使FUSAR-GPT在多个典型遥感视觉语言基准测试中实现了最先进的性能,显著优于主流基线模型,超过10%。

英文摘要

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 10%.

2603.16652 2026-06-05 cs.CV 版本更新

Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage

蜂类和黄蜂分层巢穴陷阱中高效育雏细胞检测:平衡标注工作量与物种覆盖度

Chenchang Liu, Felix Fornoff, Annika Grasreiner, Patrick Maeder, Henri Greil, Marco Seeland

发表机构 * Technical University of Ilmenau(伊尔梅瑙技术大学) University of Zurich(苏黎世大学)

AI总结 提出基于深度学习的育雏细胞检测与分类方法,通过约束假阳性损失策略减少标注工作量并缓解类别不平衡,提升检测性能。

详情
AI中文摘要

监测洞穴筑巢的野生蜂类和黄蜂对生物多样性研究和保护至关重要。分层巢穴陷阱(LTNs)正成为研究这些昆虫丰度和物种丰富度的宝贵工具,可深入了解其筑巢活动和生态需求。然而,手动评估LTNs以检测和分类育雏细胞既费时又费力。为此,我们提出一种基于深度学习的方法,用于高效检测和分类LTNs中的育雏细胞。LTNs由于育雏细胞密集排列,导致每张图像的标注工作量很高。此外,我们观察到类别分布显著不平衡,常见物种的出现次数明显多于稀有物种。对常见物种进行全面标注既耗时又加剧数据不平衡,而部分标注则导致数据不完整,从而降低模型性能。为了减少标注工作量并减轻未标注数据的影响,我们引入了一种新颖的约束假阳性损失(CFPL)策略。CFPL动态屏蔽未标注数据的预测,防止其在训练过程中干扰分类损失。实验结果表明,我们的方法提高了检测性能,平衡了模型准确性和标注工作量,同时缓解了类别不平衡问题。

英文摘要

Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. Experimental results demonstrate that our method improves detection performance, balances model accuracy and labeling effort, while also mitigating class imbalance.

2603.08491 2026-06-05 cs.CV 版本更新

Global Cross-Modal Geo-Localization: A Million-Scale Dataset and a Physical Consistency Learning Framework

全球跨模态地理定位:一个百万级数据集和一个物理一致性学习框架

Yutong Hu, Jinhui Chen, Chaoqiang Xu, Yuan Kou, Sili Zhou, Shaocheng Yan, Pengcheng Shi, Qingwu Hu, Jiayuan Li

发表机构 * School of Remote Sensing and Information Engineering, Wuhan University(武汉大学遥感与信息工程学院) First Surveying and Mapping Institute of Hunan Province(湖南省第一测绘院)

AI总结 本文提出CORE数据集和PLANET框架,用于解决全球跨模态地理定位问题,通过大规模数据和物理一致性学习提升定位的鲁棒性和全球适用性。

详情
AI中文摘要

跨模态地理定位(CMGL)将地面级文本描述与带有地理标签的航空影像匹配,这对于行人导航和应急响应至关重要。然而,现有研究受限于狭窄的地理覆盖和简单的场景多样性,无法反映全球建筑风格和地形特征的巨大空间异质性。为弥合这一差距并促进通用定位,我们引入CORE,首个专注于全球CMGL的百万级数据集。CORE包含来自六个大洲225个不同地理区域的1,034,786张跨视角图像,在多样的环境条件和城市布局中提供前所未有的视角多样性。我们利用大视觉-语言模型(LVLMs)的零样本推理能力来合成高质量的场景描述,富含判别性线索。此外,我们提出一个物理定律意识的网络(PLANET)用于跨模态地理定位。PLANET引入了一种新的对比学习范式,指导文本表示在捕捉卫星影像的内在物理特征方面发挥作用。在各种地理区域的广泛实验中,PLANET显著优于现有最先进方法,建立了新的基准,为稳健、大规模的地理定位奠定了基础。数据集和源代码将在https://github.com/YtH0823/CORE发布。

英文摘要

Cross-modal Geo-localization (CMGL) matches ground-level text descriptions with geo-tagged aerial imagery, which is crucial for pedestrian navigation and emergency response. However, existing studies are constrained by narrow geographic coverage and simplistic scene diversity, failing to reflect the immense spatial heterogeneity of global architectural styles and topographic features. To bridge this gap and facilitate universal positioning, we introduce CORE, the first million-scale dataset dedicated to global CMGL. CORE comprises 1,034,786 cross-view images sampled from 225 distinct geographic regions across six continents, offering an unprecedented variety of perspectives in varying environmental conditions and urban layouts. We leverage the zero-shot reasoning of Large Vision-Language Models (LVLMs) to synthesize high-quality scene descriptions rich in discriminative cues. Furthermore, we propose a physical-law-aware network (PLANET) for cross-modal geo-localization. PLANET introduces a novel contrastive learning paradigm to guide textual representations in capturing the intrinsic physical signatures of satellite imagery. Extensive experiments across varied geographic regions demonstrate that PLANET significantly outperforms state-of-the-art methods, establishing a new benchmark for robust, global-scale geo-localization. The dataset and source code will be released at https://github.com/YtH0823/CORE.

2603.07294 2026-06-05 cs.CV cs.AI 版本更新

MAviS: A Multimodal Conversational Assistant For Avian Species

MAviS:一种用于鸟类物种的多模态对话助手

Yevheniia Kryklyvets, Mohammed Irfan Kurpath, Sahal Shaji Mullappilly, Jinxing Zhou, Fahad Shabzan Khan, Rao Anwer, Salman Khan, Hisham Cholakkal

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出MAviS数据集和MAviS-Chat模型,通过整合图像、音频和文本信息,提升对鸟类物种的细粒度理解与多模态问答能力,并展示了在生态应用中领域适应的多模态大语言模型的重要性。

Comments EMNLP 2025

详情
AI中文摘要

细粒度理解和特定物种的多模态问答对于推进生物多样性保护和生态监测至关重要。然而,现有的多模态大语言模型在处理如鸟类物种等专业领域时面临挑战,难以提供准确且上下文相关的信息。为此,我们引入了MAviS数据集,这是一个大规模的多模态鸟类物种数据集,整合了图像、音频和文本模态,涵盖超过1000种鸟类物种,包含预训练和指令微调子集,并补充了结构化的问答对。基于MAviS数据集,我们引入了MAviS-Chat,一种支持音频、视觉和文本的多模态大语言模型,旨在实现细粒度物种理解、多模态问答和场景特定描述生成。最后,为了定量评估,我们提出了MAviS-Bench,一个包含超过25,000个问答对的基准测试,用于评估跨模态的鸟类物种特定感知和推理能力。实验结果表明,MAviS-Chat在基准MiniCPM-o-2.6上表现显著优于基线,实现了最先进的开源结果,并展示了我们指令微调MAviS数据集的有效性。我们的发现强调了在生态应用中领域适应的多模态大语言模型的必要性。

英文摘要

Fine-grained understanding and species-specific multimodal question answering are vital for advancing biodiversity conservation and ecological monitoring. However, existing multimodal large language models face challenges when it comes to specialized topics like avian species, making it harder to provide accurate and contextually relevant information in these areas. To address this limitation, we introduce the MAviS-Dataset, a large-scale multimodal avian species dataset that integrates image, audio, and text modalities for over 1,000 bird species, comprising both pretraining and instruction-tuning subsets enriched with structured question-answer pairs. Building on the MAviS-Dataset, we introduce MAviS-Chat, a multimodal LLM that supports audio, vision, and text and is designed for fine-grained species understanding, multimodal question answering, and scene-specific description generation. Finally, for quantitative evaluation, we present MAviS-Bench, a benchmark of over 25,000 QA pairs designed to assess avian species-specific perceptual and reasoning abilities across modalities. Experimental results show that MAviS-Chat outperforms the baseline MiniCPM-o-2.6 by a large margin, achieving state-of-the-art open-source results and demonstrating the effectiveness of our instruction-tuned MAviS-Dataset. Our findings highlight the necessity of domain-adaptive multimodal LLMs for ecological applications.

2602.16705 2026-06-05 cs.RO cs.CV 版本更新

HERO: Learning Humanoid End-Effector Control for Visual Whole-Body Open-Vocabulary Object Grasping

HERO: 学习人形机器人的末端执行器控制用于视觉全身体对象抓取

Runpei Dong, Ziyan Li, Arjun Gupta, Xialin He, Saurabh Gupta

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该研究提出HERO方法,通过结合大视觉模型和模拟训练,实现了视觉全身体对象抓取任务中末端执行器的高精度控制和场景理解,显著提升了抓取精度和泛化能力。

Comments Project page: https://hero-humanoid.github.io/

详情
AI中文摘要

视觉定位和操作任意真实场景中的物体需要精确的末端执行器(EE)控制和从视觉输入(如RGB-D图像)中获得的可泛化场景理解。现有的模仿和仿真到现实的方法通过单体端到端学习同时学习这两个方面,因此难以扩展。在本工作中,我们利用最适合每个问题的工具——大视觉模型用于可泛化的场景理解和模拟训练用于精确的末端执行器控制,从而得到一个整体模块化的定位和操作系统,表现出强大的泛化能力。我们的核心技术创新是HERO,一个通过结合经典机器人学和机器学习实现的准确残差感知末端执行器跟踪策略。它利用a)逆运动学将残差末端执行器目标转换为参考轨迹,b)一个学习的神经前向模型用于准确的前向运动学,以及c)目标调整和重新规划。这些创新共同将末端执行器跟踪误差减少到2.44厘米,优于最强的先前方法5.5倍。我们的整体系统在多样化的现实环境中运行,从办公室到咖啡馆,机器人能够可靠地抓取各种日常物体(如杯子、苹果、玩具)在高度从43厘米到92厘米的表面上。系统性的模块化和端到端测试验证了我们提出设计的有效性。我们相信我们的进展为训练人形机器人与日常物体互动开辟了新途径。

英文摘要

Visual loco-manipulation of arbitrary in-the-wild objects requires accurate end-effector (EE) control and a generalizable understanding of the scene from visual inputs (eg, RGB-D images). Existing imitation and sim2real methods jointly learn both these aspects via monolithic end-to-end learning and are thus hard to scale. In this work, we bring to bear the best tools for each of these problems -- large vision models for generalizable scene understanding and simulated training for accurate EE control -- leading to an overall modular loco-manipulation system that exhibits strong generalization. Our core technical innovation is HERO, an accurate residual-aware EE tracking policy made possible by combining classical robotics with machine learning. It uses a) inverse kinematics to convert residual end-effector targets into reference trajectories, b) a learned neural forward model for accurate forward kinematics, and c) goal adjustment and replanning. Together, these innovations reduce the end-effector tracking error to 2.44cm, outperforming the strongest prior method by 5.5x. Our overall system operates in diverse real-world environments, from offices to coffee shops, where the robot reliably grasps various everyday objects (eg, mugs, apples, toys) on surfaces ranging from 43cm to 92cm in height. Systematic modular and end-to-end tests demonstrate the effectiveness of our proposed design. We believe our advances open up new ways of training humanoids to interact with daily objects.

2602.16149 2026-06-05 cs.CV 版本更新

Toward Trustworthy Portrait Editing: Evaluation of Demographic Misrepresentation in I2I Models

迈向可信的人像编辑:评估 I2I 模型中的人口统计误表示

Huichan Seo, Minki Hong, Sieun Choi, Jihie Kim, Jean Oh

发表机构 * arXiv

AI总结 本文通过控制基准测试,评估了指令引导的图像到图像编辑器中身份保留失败的两个模式(软擦除和刻板印象替换),发现肤色变浅等偏差普遍存在且人口统计不均,并提出提示级约束作为缓解措施。

Comments 22 pages, 10 figures. Huichan Seo, Minki Hong and Sieun Choi contributed equally

详情
AI中文摘要

指令引导的图像到图像(I2I)编辑器越来越多地用于消费者和专业视觉工作流程,其可信度不仅取决于提示遵循性,还取决于与身份相关属性的公平保留。我们形式化了两种失败模式:软擦除,即请求的编辑被弱实现或静默抑制;以及刻板印象替换,即编辑引入未请求的、符合刻板印象的人口统计属性。使用包含5,040张编辑人像的控制基准,我们通过视觉语言模型评分和人工评估,评估了这些失败在三个近期开源编辑器中的表现。结果表明,身份保留失败普遍存在且人口统计不均。特别是,62-71%的输出表现出肤色变浅,其中印度和黑人源人像受影响率为72-75%,而白人源人像为44%,表明当身份约束未明确指定时,输出层面存在向更浅肤色或更白人外观漂移的趋势。在一项缓解案例研究中,提示级外观约束将非白人源人像的种族变化评分降低了最多1.48分,而白人源人像基本不变,且无需修改模型权重。这些发现表明,身份保留并非I2I人像编辑系统的统一属性,而是一种分布不均的可信度失败,具有直接的社会后果。在部署规模上,这种静默扭曲可能塑造AI中介的自我表征并强化表征差异。我们引入了一种用于公平性感知评估和生成式编辑系统治理的控制审计协议。项目页面:https://seochan99.github.io/i2i-demographic-bias

英文摘要

Instruction-guided image-to-image (I2I) editors are increasingly used in consumer and professional visual workflows, where trustworthiness depends not only on prompt compliance but also on equitable preservation of identity-relevant attributes. We formalize two failure modes: Soft Erasure, where requested edits are weakly realized or silently suppressed, and Stereotype Replacement, where edits introduce unrequested, stereotype-consistent demographic attributes. Using a controlled benchmark of 5,040 edited portraits, we evaluate these failures across three recent open-weight editors with vision-language model scoring and human evaluation. Our results show that identity-preservation failures are pervasive and demographically uneven. In particular, 62--71% of outputs exhibit skin lightening, with Indian and Black source portraits affected at 72--75%, compared with 44% for White source portraits, indicating output-level drift toward lighter or more White-presenting appearances when identity constraints are underspecified. In a mitigation case study, prompt-level appearance constraints reduce race-change scores for non-White source portraits by up to 1.48 points, while leaving White source portraits largely unchanged, without modifying model weights. These findings show that identity preservation is not a uniform property of I2I portrait editing systems, but an unevenly distributed trustworthiness failure with direct social consequences. At deployment scale, such silent distortions can shape AI-mediated self-representation and reinforce representational disparities. We introduce a controlled audit protocol for fairness-aware evaluation and governance of generative editing systems. Project page: https://seochan99.github.io/i2i-demographic-bias

2602.08749 2026-06-05 cs.CV 版本更新

Shifting the Breaking Point of Flow Matching for Multi-Instance Editing

将流匹配的断裂点转向多实例编辑

Carmine Zaccagnino, Fabio Quattrini, Enis Simsar, Marta Tintoré Gazulla, Rita Cucchiara, Alessio Tonioni, Silvia Cascianelli

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 针对流匹配模型在多实例编辑中语义纠缠的问题,提出实例解耦注意力机制,通过分割联合注意力操作强制实例-文本指令与空间区域的绑定,实现单次前向传播的实例级编辑。

Comments Accepted at ICML 2026

详情
AI中文摘要

流匹配模型最近作为扩散模型的高效替代方案出现,特别是在文本引导的图像生成和编辑中,通过连续时间动力学提供更快的推理。然而,现有的基于流的编辑器主要支持全局或单指令编辑,在多实例场景中表现不佳,其中参考输入的多个部分必须独立编辑而不受语义干扰。我们将此限制归因于全局条件速度场和联合注意力机制,它们纠缠了并发编辑。为了解决这个问题,我们引入了实例解耦注意力,一种分割联合注意力操作的机制,在速度场估计期间强制实例特定文本指令与空间区域之间的绑定。我们在自然图像编辑和新引入的具有区域级编辑指令的文本密集信息图表基准上评估了我们的方法。实验结果表明,我们的方法促进了编辑解耦和局部性,同时保持了全局输出的一致性,实现了单次前向传播的实例级编辑。

英文摘要

Flow matching models have recently emerged as an efficient alternative to diffusion, especially for text-guided image generation and editing, offering faster inference through continuous-time dynamics. However, existing flow-based editors predominantly support global or single-instruction edits and struggle with multi-instance scenarios, where multiple parts of a reference input must be edited independently without semantic interference. We identify this limitation as a consequence of globally conditioned velocity fields and joint attention mechanisms, which entangle concurrent edits. To address this issue, we introduce Instance-Disentangled Attention, a mechanism that partitions joint attention operations, enforcing binding between instance-specific textual instructions and spatial regions during velocity field estimation. We evaluate our approach on both natural image editing and a newly introduced benchmark of text-dense infographics with region-level editing instructions. Experimental results demonstrate that our approach promotes edit disentanglement and locality while preserving global output coherence, enabling single-pass, instance-level editing.

2602.08503 2026-06-05 cs.CV cs.CL cs.LG 版本更新

Learning Self-Correction in Vision-Language Models via Rollout Augmentation

通过回滚增强学习视觉-语言模型中的自我纠正

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一种基于回滚增强的强化学习框架Octopus,通过重新组合现有回滚生成密集的自我纠正示例,提高样本效率并稳定RL优化,同时引入响应遮蔽策略以解耦自我纠正与直接推理,从而在7个基准测试中实现开源VLM的SOTA性能。

Comments 18 pages

详情
Journal ref
ICML 2026
AI中文摘要

自我纠正对于解决视觉-语言模型(VLMs)中的复杂推理问题至关重要。然而,现有的强化学习(RL)方法在学习自我纠正方面存在困难,因为有效的自我纠正行为只在很少情况下出现,导致学习信号非常稀疏。为了解决这一挑战,我们提出了correction-specific rollouts(Octopus),一种RL回滚增强框架,通过重新组合现有回滚来合成密集的自我纠正示例。这种增强同时提高了样本效率,由于回滚重用,并通过平衡监督稳定了RL优化。此外,我们引入了一种响应遮蔽策略,将自我纠正与直接推理解耦,避免信号冲突,并使两种行为都能被有效学习。基于此,我们介绍了Octopus-8B,一种具有可控自我纠正能力的推理VLM。在7个基准测试中,它在开源VLM中实现了SOTA性能,优于最佳RLVR基线1.0分,同时仅需0.72倍的训练时间每步。

英文摘要

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

2602.07428 2026-06-05 cs.CV 版本更新

Row-Column Separated Attention Based Low-Light Image/Video Enhancement

基于行-列分离注意力的低光照图像/视频增强

Chengqi Dong, Zhiyuan Cao, Tuoshi Qi, Kexin Wu, Yixing Gao, Fan Tang

发表机构 * School of Artificial Intelligence, Jilin University, China(吉林大学人工智能学院) College of Software, Jilin University, China(吉林大学软件学院) Institute of Computing Technology, Chinese Academy of Sciences, China(中国科学院计算技术研究所)

AI总结 本文提出了一种行-列分离注意力模块(RCSA),用于改进U-Net结构以增强低光照图像和视频,通过减少参数和计算量来利用全局信息指导局部信息,同时提出两种时间损失函数以保持时间一致性。

详情
AI中文摘要

U-Net结构被广泛用于低光照图像/视频增强。增强的图像在没有适当全局信息指导的情况下,会导致局部噪声较大和细节丢失。注意力机制可以更好地关注和利用全局信息。然而,对图像的注意力可能会显著增加参数和计算量。我们提出了一种行-列分离注意力模块(RCSA),插入到改进的U-Net之后。RCSA模块的输入是特征图的行和列的均值和最大值,利用全局信息以较少的参数指导局部信息。我们提出两种时间损失函数,将该方法应用于低光照视频增强并保持时间一致性。在LOL、MIT Adobe FiveK图像和SDSD视频数据集上的广泛实验表明了我们方法的有效性。代码可在https://github.com/cq-dong/URCSA上公开获取。

英文摘要

U-Net structure is widely used for low-light image/video enhancement. The enhanced images result in areas with large local noise and loss of more details without proper guidance for global information. Attention mechanisms can better focus on and use global information. However, attention to images could significantly increase the number of parameters and computations. We propose a Row-Column Separated Attention module (RCSA) inserted after an improved U-Net. The RCSA module's input is the mean and maximum of the row and column of the feature map, which utilizes global information to guide local information with fewer parameters. We propose two temporal loss functions to apply the method to low-light video enhancement and maintain temporal consistency. Extensive experiments on the LOL, MIT Adobe FiveK image, and SDSD video datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/cq-dong/URCSA.

2602.03410 2026-06-05 cs.CV 版本更新

UnHype: CLIP-Guided Hypernetworks for Dynamic LoRA Unlearning

UnHype: 基于CLIP的超网络用于动态LoRA反学习

Piotr Wójcik, Maksym Petrenko, Wojciech Gromski, Przemysław Spurek, Maciej Zieba

发表机构 * Institute of Computer Science, University of Warsaw(华沙大学计算机科学研究所)

AI总结 本文提出UnHype框架,通过将超网络引入单概念和多概念LoRA训练,解决传统LoRA方法在概念语义适应性差、难以平衡删除相关概念与保持泛化能力以及多概念同时删除时的可扩展性问题,展示了在物体擦除、名人擦除和色情内容删除等任务中的有效性。

Comments 23 pages, 11 figures. Accepted at ICML 2026. Code: https://github.com/gmum/UnHype/ Project Page: https://gmum.github.io/UnHype/

详情
AI中文摘要

近期大规模扩散模型的进步加剧了对其潜在滥用的担忧,特别是生成逼真但有害或社会 disruptive 的内容。这一挑战推动了有效机器反学习的研究,即在不损害模型整体生成能力的情况下,选择性地移除特定知识或概念。在各种方法中,低秩适应(LoRA)已成为一种有效的、高效的微调方法,用于针对反学习的定向调整。然而,基于LoRA的方法在概念语义适应性方面有限,并且在删除密切相关概念与保持更广泛意义的泛化能力之间难以平衡。此外,当必须同时删除多个概念时,这些方法面临可扩展性挑战。为了解决这些限制,我们引入了UnHype框架,该框架将超网络引入单概念和多概念LoRA训练中。所提出的架构可以直接插入到Stable Diffusion以及现代流基文本到图像模型中,其中展示了稳定的训练行为和有效的概念控制。在推理过程中,超网络根据CLIP嵌入动态生成适应性的LoRA权重,使反学习更加上下文感知和可扩展。我们评估了UnHype在多个具有挑战性的任务中的表现,包括物体擦除、名人擦除和色情内容删除,展示了其有效性和通用性。见GitHub上的代码:https://github.com/gmum/UnHype。

英文摘要

Recent advances in large-scale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility. See the code on GitHub: https://github.com/gmum/UnHype.

2601.21288 2026-06-05 cs.AI cs.CV 版本更新

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Drive-KD:自动驾驶中用于视觉语言模型的多教师知识蒸馏

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Kaixuan Wang, Yu Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Drive-KD框架,通过将自动驾驶分解为感知-推理-规划三元组,并利用知识蒸馏转移能力,构建了专用教师模型,并通过异构梯度投影缓解跨能力梯度冲突,验证了方法在不同模型家族和规模上的泛化能力,展示了蒸馏模型在自动驾驶任务中的优越性能。

详情
AI中文摘要

自动驾驶是一个重要且安全关键的任务,最近大型语言模型(LLM)和视觉语言模型(VLM)的进展为该领域提供了新的推理和规划可能性。然而,大模型需要大量GPU内存并表现出较高的推理延迟,而传统监督微调(SFT)往往难以弥补小模型的能力差距。为了解决这些限制,我们提出了Drive-KD,一个将自动驾驶分解为“感知-推理-规划”三元组并通过知识蒸馏转移这些能力的框架。我们识别出层特定的注意力作为蒸馏信号,构建出能够超越基线的专用单教师模型。此外,我们将这些单教师设置统一到多教师蒸馏框架中,并引入异构梯度投影以缓解跨能力梯度冲突。广泛的评估验证了我们的方法在不同模型家族和规模上的泛化能力。实验表明,我们的蒸馏InternVL3-1B模型在GPU内存方面仅为78B模型的约42倍,在吞吐量方面为11.4倍,且在DriveBench上整体性能优于同家族的预训练78B模型,并在规划维度上超越GPT-5.1,为高效自动驾驶VLMs提供了新的见解。

英文摘要

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

2601.18219 2026-06-05 physics.med-ph cs.CV cs.LG 版本更新

Automated HER2 scoring with uncertainty quantification using lensfree holography and deep learning

利用无透镜全息和深度学习进行自动HER2评分及不确定性量化

Che-Yung Shen, Xilin Yang, Yuzhu Li, Leon Lenk, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校电气与计算机工程系) Bioengineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校加州纳米系统研究所) Department of Computer Science, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校计算机科学系)

AI总结 本文提出了一种基于无透镜全息和深度学习的紧凑型、低成本系统,用于自动免疫组化染色乳腺组织切片的HER2评分,通过贝叶斯蒙特卡洛Dropout策略提高诊断可靠性,实现了高准确率的HER2分类和评分。

Comments 23 Pages, 6 Figures, 1 Table

详情
Journal ref
BME Frontiers, AAAS (2026)
AI中文摘要

准确评估人类表皮生长因子受体2(HER2)的表达对于乳腺癌的诊断、预后和治疗选择至关重要;然而,大多数现有的数字HER2评分方法依赖于笨重且昂贵的光学系统。本文提出了一种紧凑且经济的无透镜全息平台,结合深度学习用于自动免疫组化染色乳腺组织切片的HER2评分。该系统在RGB激光照明下捕获染色HER2组织切片的无透镜衍射图案,并在约1250 mm²的样本区域上以约84 mm²/分钟的有效吞吐量获取复杂数学信息。为提高诊断可靠性,我们采用了基于贝叶斯蒙特卡洛Dropout的不确定性量化策略,为每个预测提供自主的不确定性估计,支持可靠且稳健的HER2评分,整体修正率为30.4%。使用412个盲测样本的测试集,本方法在4类(0,1+,2+,3+)HER2分类中实现了84.9%的测试准确率,在二分类(0/1+ vs. 2+/3+)HER2评分中实现了94.8%的准确率,结合不确定性量化。总体而言,这种无透镜全息方法提供了一条通往便携式、高吞吐量和低成本HER2评分的实用途径,特别适用于资源有限的环境,其中传统数字病理基础设施不可用。

英文摘要

Accurate assessment of human epidermal growth factor receptor 2 (HER2) expression is critical for breast cancer diagnosis, prognosis, and therapy selection; yet, most existing digital HER2 scoring methods rely on bulky and expensive optical systems. Here, we present a compact and cost-effective lensfree holography platform integrated with deep learning for automated HER2 scoring of immunohistochemically stained breast tissue sections. The system captures lensfree diffraction patterns of stained HER2 tissue sections under RGB laser illumination and acquires complex field information over a sample area of ~1,250 mm^2 at an effective throughput of ~84 mm^2 per minute. To enhance diagnostic reliability, we incorporated an uncertainty quantification strategy based on Bayesian Monte Carlo dropout, which provides autonomous uncertainty estimates for each prediction and supports reliable, robust HER2 scoring, with an overall correction rate of 30.4%. Using a blinded test set of 412 unique tissue samples, our approach achieved a testing accuracy of 84.9% for 4-class (0, 1+, 2+, 3+) HER2 classification and 94.8% for binary (0/1+ vs. 2+/3+) HER2 scoring with uncertainty quantification. Overall, this lensfree holography approach provides a practical pathway toward portable, high-throughput, and cost-effective HER2 scoring, particularly suited for resource-limited settings, where traditional digital pathology infrastructure is unavailable.

2404.10370 2026-06-05 cs.CV cs.LG 版本更新

Know Yourself Better: Diverse Object-Related Features Improve Open Set Recognition

Know Yourself Better: Diverse Object-Related Features Improve Open Set Recognition

Jiawen Xu, Margret Keuper

发表机构 * Technical University Berlin(柏林技术大学) University of Mannheim(曼海姆大学)

AI总结 研究通过分析特征多样性提升开放集识别性能,提出了一种利用特征多样性的新型开放集识别方法。

详情
AI中文摘要

开放集识别(OSR)是机器学习中的关键方面,旨在解决推理过程中检测新类别的挑战。在深度学习领域,训练于封闭集数据的神经分类器通常难以识别新类别,导致错误预测。为了解决这一问题,已提出各种启发式方法,允许模型通过声明"I don't know"来表达不确定性。然而,文献中仍存在空白,因为对这些方法的底层机制探讨有限。在本文中,我们对开放集识别方法进行了分析,重点在于特征多样性方面。我们的研究揭示了学习多样化的判别特征与增强OSR性能之间存在显著相关性。基于这一见解,我们提出了一种新的OSR方法,利用特征多样性的优势。通过在标准OSR测试平台上的严格评估,证明了我们方法的有效性,显示出相对于最新方法的显著改进。

英文摘要

Open set recognition (OSR) is a critical aspect of machine learning, addressing the challenge of detecting novel classes during inference. Within the realm of deep learning, neural classifiers trained on a closed set of data typically struggle to identify novel classes, leading to erroneous predictions. To address this issue, various heuristic methods have been proposed, allowing models to express uncertainty by stating "I don't know." However, a gap in the literature remains, as there has been limited exploration of the underlying mechanisms of these methods. In this paper, we conduct an analysis of open set recognition methods, focusing on the aspect of feature diversity. Our research reveals a significant correlation between learning diverse discriminative features and enhancing OSR performance. Building on this insight, we propose a novel OSR approach that leverages the advantages of feature diversity. The efficacy of our method is substantiated through rigorous evaluation on a standard OSR testbench, demonstrating a substantial improvement over state-of-the-art methods.

2601.08446 2026-06-05 cs.CV cs.LG 版本更新

Noise-Adaptive Regularization for Robust Multi-Label Remote Sensing Image Classification

针对鲁棒多标签遥感图像分类的噪声自适应正则化

Tom Burgert, Julia Henkel, Begüm Demir

发表机构 * Burgert et al.(Burgert 等)

AI总结 本文提出了一种噪声自适应正则化方法NAR,通过区分加性噪声和减性噪声,提升遥感多标签分类的鲁棒性,实验表明在不同噪声场景下均优于现有方法。

Comments Submitted to TGRS

详情
AI中文摘要

可靠多标签分类(MLC)方法的发展已成为遥感(RS)研究中的重要方向。随着RS数据规模的扩大,标注过程越来越多地依赖主题产品或众包流程以降低人工标注成本。尽管成本效益高,这些策略往往以部分错误标注的形式引入多标签噪声。在MLC中,标签噪声以加性噪声、减性噪声或两者的混合形式出现。先前工作大多忽略了这一区别,通常将噪声标注视为监督信号,缺乏能够明确适应不同噪声类型的机制。为了解决这一限制,我们提出NAR,一种噪声自适应正则化方法,它在半监督学习框架中明确区分加性和减性噪声。NAR采用基于置信度的标签处理机制,动态保留高置信度标签条目,暂时停用中等置信度条目,并通过翻转纠正低置信度条目。这种选择性抑制监督与早期学习正则化(ELR)相结合,以稳定训练并减轻对损坏标签的过拟合。在加性、减性及混合噪声场景中的实验表明,NAR在不同噪声情况下均比现有方法更稳健。性能提升在减性及混合噪声情况下最为显著,表明适应性抑制和选择性纠正噪声监督为遥感MLC中的噪声鲁棒学习提供了一种有效策略。

英文摘要

The development of reliable methods for multi-label classification (MLC) has become a prominent research direction in remote sensing (RS). As the scale of RS data continues to expand, annotation procedures increasingly rely on thematic products or crowdsourced procedures to reduce the cost of manual annotation. While cost-effective, these strategies often introduce multi-label noise in the form of partially incorrect annotations. In MLC, label noise arises as additive noise, subtractive noise, or a combination of both in the form of mixed noise. Previous work has largely overlooked this distinction and commonly treats noisy annotations as supervised signals, lacking mechanisms that explicitly adapt learning behavior to different noise types. To address this limitation, we propose NAR, a noise-adaptive regularization method that explicitly distinguishes between additive and subtractive noise within a semi-supervised learning framework. NAR employs a confidence-based label handling mechanism that dynamically retains label entries with high confidence, temporarily deactivates entries with moderate confidence, and corrects low confidence entries via flipping. This selective attenuation of supervision is integrated with early-learning regularization (ELR) to stabilize training and mitigate overfitting to corrupted labels. Experiments across additive, subtractive, and mixed noise scenarios demonstrate that NAR consistently improves robustness compared with existing methods. Performance improvements are most pronounced under subtractive and mixed noise, indicating that adaptive suppression and selective correction of noisy supervision provide an effective strategy for noise robust learning in RS MLC.

2601.08182 2026-06-05 cs.CV 版本更新

Second-order Gaussian directional derivative representations for image high-resolution corner detection

二阶高斯方向导数表示法用于图像高分辨率角点检测

Jiamiao Lu, Dongbo Xie, Junjie Qiu, Lingkun Ma, Changming Sun, Weichuan Zhang

发表机构 * School of Electronic Information and Artificial Intelligence, Shaanxi University of Science and Technology(陕西科技大学电子信息与人工智能学院) CSIRO Data61

AI总结 本文提出了一种新的高分辨率角点检测方法,通过二阶高斯方向导数(SOGDD)滤波器对END型和L型高分辨率角点模型进行平滑处理,发现了高分辨率角点的多种特征,从而实现了对相邻角点的精确检测,实验结果表明该方法在定位误差、图像模糊变换鲁棒性、图像匹配和3D重建方面优于现有方法。

Comments 11pages, 9 figures

详情
AI中文摘要

角点检测被广泛应用于各种计算机视觉任务,如图像匹配和3D重建。我们的研究指出,张等人使用简单角点模型获得一系列角点特征的方法在理论上存在缺陷,因为相邻角点的灰度信息会相互影响。为了解决上述问题,本文使用二阶高斯方向导数(SOGDD)滤波器对两种典型的高分辨率角点模型(即END型和L型模型)进行平滑处理。然后分别推导了这两种角点模型的SOGDD表示,发现了许多高分辨率角点的特征,从而使得我们能够展示如何选择高斯滤波尺度以从图像中获取强度变化信息,准确描绘相邻角点。此外,本文首次提出了一种新的高分辨率角点检测方法,能够准确检测相邻角点。实验结果表明,所提出的方法在定位误差、对图像模糊变换的鲁棒性、图像匹配和3D重建方面均优于现有方法。

英文摘要

Corner detection is widely used in various computer vision tasks, such as image matching and 3D reconstruction. Our research indicates that there are theoretical flaws in Zhang et al.'s use of a simple corner model to obtain a series of corner characteristics, as the grayscale information of two adjacent corners can affect each other. In order to address the above issues, a second-order Gaussian directional derivative (SOGDD) filter is used in this work to smooth two typical high-resolution angle models (i.e. END-type and L-type models). Then, the SOGDD representations of these two corner models were derived separately, and many characteristics of high-resolution corners were discovered, which enabled us to demonstrate how to select Gaussian filtering scales to obtain intensity variation information from images, accurately depicting adjacent corners. In addition, a new high-resolution corner detection method for images has been proposed for the first time, which can accurately detect adjacent corner points. The experimental results have verified that the proposed method outperforms state-of-the-art methods in terms of localization error, robustness to image blur transformation, image matching, and 3D reconstruction.

2601.06056 2026-06-05 cs.CY cs.AI cs.CV 版本更新

Using street view images and visual LLMs to predict heritage values for governance support: Risks, ethics, and policy implications

利用街景图像和视觉大语言模型预测遗产价值以支持治理:风险、伦理与政策影响

Tim Johansson, Mikael Mangold, Kristina Dabrock, Anna Donarelli, Ingrid Campo-Ruiz

发表机构 * RISE Research Institutes of Sweden AB(瑞典RISE研究机构) Malmö University(马尔默大学) Forschungszentrum Jülich GmbH(朱利奇研究中心) Uppsala University(乌普萨拉大学)

AI总结 本研究利用街景图像和视觉大语言模型评估瑞典建筑遗产价值,以支持建筑翻新计划的制定,探讨了方法中的问题、潜在改进以及使用LLM数据的伦理风险。

详情
AI中文摘要

在2025年至2026年期间,欧盟成员国必须实施《建筑性能能效指令》,要求所有成员国制定国家建筑翻新计划。在瑞典,没有全面记录具有遗产价值的建筑的国家注册表,这被视为阻碍建筑翻新计划制定分析的障碍。本研究旨在帮助瑞典当局了解瑞典建筑存量中的遗产价值。通过对瑞典各地(N=154710)的街景图像中的建筑进行多模态大语言模型(LLM)分析,评估了可见的遗产价值指示方面。使用LLM的零样本预测作为基础,确定了潜在具有遗产价值的建筑,覆盖500万平方米的供暖地板面积。本文呈现了预测结果和所学到的经验,并将其与瑞典建筑翻新计划的制定相结合,作为治理的一部分。讨论了方法中的问题和潜在的改进。探讨了当局使用基于LLM的数据的潜在风险,重点是透明性、错误检测和阿谀奉承的问题。

英文摘要

During 2025 and 2026, the Energy Performance of Buildings Directive is being implemented in the European Union member states, requiring all member states to have National Building Renovation Plans. In Sweden, there is no comprehensive national register of buildings with heritage values. This is seen as a barrier for the analyses underlying the development of Building Renovation Plans by the involved Swedish authorities. The purpose of this research was to assist Swedish authorities in developing information on heritage values in the Swedish building stock. Buildings in street view images from all over Sweden (N=154 710) have been analysed using multimodal Large Language Models (LLM) to assess visible aspects indicative of heritage value. Zero-shot predictions by LLMs were used as a basis for identifying buildings with potential heritage values for 5.0 million square meters of heated floor area. In this paper, the results of the predictions and lessons learned are presented and related to the development of the Swedish Building Renovation Plan as part of governance. The problems with the method and potential improvements are discussed. Risks with authorities use of LLM-based data are addressed, with a focus on issues of transparency, error detection and sycophancy.

2601.02730 2026-06-05 cs.CV 版本更新

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

HOLO:基于单应图的细粒度视觉定位网络用于标准定义(SD)地图的视觉定位

Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang

发表机构 * Beijing Institute of Technology(北京理工大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于单应图的视觉定位网络,用于多视角图像与标准定义地图之间的细粒度视觉定位,通过构建满足单应约束的输入对,利用单应关系引导特征融合并限制姿态输出到有效区域,提高了训练效率和定位精度。

详情
AI中文摘要

标准定义(SD)地图上的视觉定位已成为自动驾驶中一种有前途的低成本和可扩展的解决方案。然而,现有基于回归的方法往往忽视了固有的几何先验,导致训练效率低下和定位精度有限。本文提出了一种新的基于单应图的姿态估计网络,用于多视角图像与标准定义(SD)地图之间的细粒度视觉定位。我们通过将地面视图特征投影到BEV域并强制与地图特征进行语义对齐来构建满足单应约束的输入对。然后利用单应关系引导特征融合,并将姿态输出限制在有效可行区域,这在训练效率和定位精度上都显著优于依赖注意力融合和直接3-自由度姿态回归的先前方法。到目前为止,这是首次将BEV语义推理与单应学习统一起来用于图像到地图定位的工作。此外,通过显式建模单应变换,所提出的框架自然支持跨分辨率输入,增强了模型的灵活性。在nuScenes数据集上的广泛实验表明,我们的方法显著优于现有的视觉定位方法。代码和预训练模型将公开发布以促进未来研究。

英文摘要

Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

2512.21218 2026-06-05 cs.CV 版本更新

Latent Implicit Visual Reasoning

潜在隐式视觉推理

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

发表机构 * University of California, Berkeley(加州大学伯克利分校) Xero MIT-IBM Watson AI Lab(麻省理工-IBM Watson人工智能实验室)

AI总结 本文提出了一种任务无关的机制,训练大规模多模态模型(LMMs)在无需显式中间监督的情况下发现和使用潜在视觉推理标记,从而在多种视觉中心任务中优于直接监督微调,并在不使用辅助图像、边界框、图像裁剪、深度图或思维链注释的情况下,与或优于先前基于文本和显式视觉中间推理方法相媲美。

详情
AI中文摘要

尽管大规模多模态模型(LMMs)在显著进展方面取得了进展,但它们仍然主要以文本为中心,依赖语言作为其核心推理模态。因此,它们在处理主要视觉的推理任务时受到限制。最近的方法试图通过监督中间视觉步骤来解决这个问题,使用辅助图像、深度图或图像裁剪。然而,这些策略对“有用的”视觉抽象的外观施加了限制的先验假设,增加了大量的标注成本,并在跨任务时难以泛化。为了解决这一关键限制,我们提出了潜在隐式视觉推理(LIVR),一种任务无关的机制,训练LMMs发现和使用潜在视觉推理标记,而无需显式中间监督。这些标记会全局关注并以任务自适应的方式重新编码图像,使模型能够提取相关视觉信息而无需手工监督。LIVR在多种视觉中心任务和多个LMM基础架构上均优于直接监督微调。在更广泛的比较中,LIVR与或优于先前基于文本和显式视觉中间推理方法,同时不需要额外的中间监督,如辅助图像、边界框、图像裁剪、深度图或思维链注释。我们的项目页面可以在这里找到:https://www.chuyishang.com/livr/

英文摘要

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose Latent Implicit Visual Reasoning (LIVR), a task-agnostic mechanism that trains LMMs to discover and use latent visual reasoning tokens without explicit intermediate supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. LIVR consistently outperforms direct supervised fine-tuning across diverse vision-centric tasks and multiple LMM backbones. In broader comparisons, LIVR remains competitive with or outperforms prior text-based and explicit-visual-intermediate reasoning methods, while requiring no additional intermediate supervision such as helper images, bounding boxes, image crops, depth maps, or chain-of-thought annotations. Our project page can be found here: https://www.chuyishang.com/livr/

2512.15153 2026-06-05 cs.CV 版本更新

Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

通过利用多模态链式推理解释可解释的动作形式评估

Mengshi Qi, Yeteng Wu, Wulian Yun, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 本文提出了一种新的动作形式评估任务,并引入了一个包含大量健身和武术视频的多级标注数据集CoT-AFA,通过引入新的链式思维解释方法,提出了可解释性健身评估框架,以提升动作分析能力。

详情
AI中文摘要

评估人类动作是否标准并提供合理的反馈以提高动作标准化程度在现实场景中非常重要但具有挑战性。然而,当前视频理解方法主要关注动作是什么和在哪里,无法满足要求。同时,现有数据集缺乏指示动作标准化程度的标签,动作质量评估数据集缺乏可解释性和详细反馈。因此,我们定义了一个新的人类动作形式评估(AFA)任务,并引入了一个新的多样化数据集CoT-AFA,其中包含大量健身和武术视频,具有多级标注以进行全面的视频分析。我们通过引入一种新的链式思维解释范式来丰富CoT-AFA数据集。与提供孤立反馈不同,我们的解释提供了一个完整的推理过程--从识别一个动作步骤到分析其结果并提出具体的解决方案。此外,我们提出了一种名为可解释性健身评估器的框架,不仅可以判断动作,还可以解释原因并提供解决方案。该框架采用两个并行处理流和动态门控机制来融合视觉和语义信息,从而提升其分析能力。实验结果表明,我们的方法在解释生成(例如,CIDEr提升16.0%)、动作分类(准确率提升2.7%)和质量评估(准确率提升2.1%)方面均取得了改进,揭示了CoT-AFA在未来研究中的巨大潜力。我们的数据集和源代码可在https://github.com/MICLAB-BUPT/EFA上获取。

英文摘要

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

2512.08560 2026-06-05 cs.CV 版本更新

BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

BrainExplore: 在人脑中大规模发现可解释的视觉表征

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出了一种大规模自动化框架,用于发现和解释人脑皮层中的视觉表征,通过无监督的数据驱动分解方法发现候选可解释模式,并通过识别最能激发这些模式的自然图像生成自然语言描述,从而揭示了数千种覆盖多种不同视觉概念的可解释模式,包括此前未报告的细粒度表征。

详情
AI中文摘要

理解人类大脑如何表示视觉概念,以及这些表示在哪些脑区编码,仍然是一个长期存在的挑战。几十年的研究已经提升了我们对视觉表征的理解,但脑信号仍然很大且复杂,可能的视觉概念空间非常广阔。因此,大多数研究仍处于小规模,依赖手动检查,专注于特定区域和概念,并很少进行系统验证。我们提出了一种大规模、自动化的框架,用于在人脑皮层上发现和解释视觉表征。我们的方法包括两个主要阶段。首先,我们通过无监督、数据驱动的分解方法在fMRI活动中发现候选可解释模式。其次,我们通过识别最能激发这些模式的自然图像集,并生成这些图像共享视觉意义的自然语言描述来解释每个模式。为了扩展这一过程,我们引入了一个自动化流程,测试多个候选解释,分配可靠性分数,并为每个脑区模式选择最佳描述。我们的框架揭示了成千上万种可解释模式,涵盖了许多不同的视觉概念,包括此前未报告的细粒度表征。

英文摘要

Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and concepts, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns reliability scores, and selects the best description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

2512.05774 2026-06-05 cs.CV cs.AI cs.CL 版本更新

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

主动视频感知:用于代理长视频理解的迭代证据寻求

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

发表机构 * Salesforce AI Research(Salesforce AI研究院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文提出了一种主动视频感知框架AVP,通过迭代计划-观察-反思过程,主动决定视频内容的观察目标和时间,以提高长视频理解的准确性和效率。

Comments Website: https://activevideoperception.github.io/

详情
AI中文摘要

长视频理解(LVU)具有挑战性,因为回答现实世界查询往往依赖于稀疏、时间分散的线索,这些线索隐藏在数小时的大部分冗余和无关内容中。尽管代理流程提高了视频推理能力,但现有框架依赖于查询无关的描述器来感知视频信息,这浪费了计算资源并模糊了细粒度的时间和空间信息。受主动感知理论的启发,我们主张LVU代理应主动决定观察什么、何时和在哪里观察,并持续评估当前观察是否足够回答查询。我们提出了主动视频感知(AVP),一种证据寻求框架,将视频视为交互环境,并直接从像素中获取紧凑、查询相关的证据。具体而言,AVP运行一个迭代的计划-观察-反思过程,使用MLLM代理。在每个轮次中,计划者提出有针对性的视频交互,观察者执行以提取时间戳证据,反思者评估证据对查询的充分性,要么终止并给出答案,要么触发进一步观察。在五个LVU基准测试中,AVP实现了最高整体准确率,有显著提升。值得注意的是,AVP在平均整体准确率上比最佳代理方法高出5.7%,同时仅需18.4%的推理时间和12.4%的输入令牌。

英文摘要

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

2511.20158 2026-06-05 cs.CV 版本更新

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

在安全对齐的连续视觉指令微调中实现和谐参数适应

Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang

发表机构 * Hefei University of Technology(合肥工业大学) Tsinghua University(清华大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文研究了在安全对齐的连续视觉指令微调中如何平衡安全性和任务性能,提出了一种名为和谐参数适应(HPA)的后训练框架,通过参数分区、平衡选择和正交调整来缓解遗忘问题。

详情
AI中文摘要

尽管连续视觉指令微调(CVIT)在适应多模态大语言模型(MLLMs)方面显示出潜力,但现有研究大多集中在没有安全对齐的模型上。这种关键疏忽忽略了现实中的MLLMs本质上需要此类机制以缓解潜在风险。在本文中,我们关注CVIT在安全对齐的MLLMs中的应用,并观察到在连续适应过程中,模型不仅会经历任务遗忘,还会表现出安全性的下降。实现安全性和任务性能之间的和谐平衡仍然是一个关键挑战。为此,我们提出了和谐参数适应(HPA),一种由基于聚焦的参数分区、和谐平衡的参数选择和正交参数调整组成的后训练框架。具体而言,HPA根据参数对安全或任务性能的关注程度将其分为两种类型,并从平衡的角度选择聚焦的参数以保留。此外,HPA对参数更新施加正交约束,以进一步缓解灾难性遗忘。在CVIT基准和安全评估数据集上的大量实验表明,HPA比现有基线更好地保持了高安全性和减轻了遗忘问题。代码可在https://github.com/Minato-Zackie/HPA上获得。

英文摘要

While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines. Code is available at https://github.com/Minato-Zackie/HPA.

2511.13183 2026-06-05 cs.CV 版本更新

GenTract: Generative Global Tractography

GenTract:生成式全局束追踪

Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander

发表机构 * Hawkes Institute and Department of Computer Science, University College London, UK(霍克斯研究所和计算机科学系,伦敦大学学院,英国) Department of Maths and Computer Science, University of Catania, Italy(数学和计算机科学系,卡塔尼亚大学,意大利) AI Centre and Department of Computer Science, University College London, UK(人工智能中心和计算机科学系,伦敦大学学院,英国)

AI总结 本文提出GenTract,一种基于生成模型的全局束追踪方法,通过学习从dMRI到完整解剖学合理束流的直接映射,提高了在低分辨率和噪声数据下的精度和可靠性。

Comments Upload of camera-ready

详情
AI中文摘要

束追踪是通过扩散磁共振成像(dMRI)推断大脑白质路径轨迹的过程。局部束追踪方法通过逐步跟随局部纤维方向估计来构建束流,易产生误差累积和高假阳性率,尤其是在噪声或低分辨率数据中。相比之下,全局方法试图通过优化束流集合以最大化与底层纤维方向估计的兼容性,但计算成本较高。为解决这些挑战,我们引入GenTract,这是首个生成式全局束追踪模型。我们将束追踪视为生成任务,学习从dMRI到完整、解剖学合理束流的直接映射。我们比较了基于扩散和流匹配的两种范式,并评估了GenTract在与现有最先进基线方法的性能。值得注意的是,GenTract在精度上比次优方法DDTracking和TractOracle分别高出1.8倍和2.1倍。在具有挑战性的低分辨率和噪声设置中,其优势更加明显,比最接近的竞争对手高出3.5倍。通过在研究级数据上产生高精度的束流图,同时在不完美的低分辨率数据上保持可靠性,GenTract代表了全局束追踪的一个有前景的解决方案。

英文摘要

Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 1.8x and 2.1x higher than the next-best methods, DDTracking and TractOracle, respectively. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by a factor of 3.5. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

2511.10254 2026-06-05 cs.CV 版本更新

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Facial-R1: 通过推理与识别对齐实现面部情绪分析

Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao

AI总结 本文提出Facial-R1框架,通过三阶段对齐方法解决面部情绪分析中推理与识别不一致及推理幻觉的问题,并引入FEA-20K基准数据集,验证了其在多个标准基准上的最佳性能。

Comments Withdrawn by the authors due to pending intellectual property considerations. The authors have determined that the current version contains material that should not have been publicly disseminated at this stage

详情
AI中文摘要

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

英文摘要

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

2510.23497 2026-06-05 cs.CV 版本更新

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

VOLD:通过在线蒸馏将LLM推理能力转移到视觉语言模型

Walid Bousselham, Hilde Kuehne, Cordelia Schmid

发表机构 * Tuebingen AI Center(图宾根人工智能中心) University of Tuebingen(图宾根大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室) Inria, École Normale Supérieure, CNRS, PSL Research University(法国国家科学研究院、巴黎-萨克勒大学、École Normale Supérieure、PSL研究大学)

AI总结 本文提出VOLD框架,通过在线蒸馏将文本模型的推理能力转移到视觉语言模型,利用组相对策略优化与在线蒸馏结合,提升推理性能,并验证了冷启动对齐在在线训练中的重要性。

Comments www.walidbousselham.com/VOLD/

详情
AI中文摘要

训练视觉语言模型(VLMs)进行复杂推理仍是一项具有挑战性的任务,例如由于高质量图像-文本推理数据稀缺。相反,基于文本的推理资源丰富且可扩展,但如何利用它们来增强VLM推理仍是一个开放性问题。为此,我们提出了VOLD,一种将推理能力从文本-only教师模型转移到VLM学生模型的框架。为此,VOLD结合了通过组相对策略优化(GRPO)进行的强化学习与在线蒸馏,使学生推理轨迹能够由教师模型引导,从而在单独使用GRPO时获得显著提升。我们进一步表明,在此场景中,在线训练阶段有效的转移需要冷启动对齐,并且在教师和学生之间缺乏足够的分布对齐时,在线蒸馏无法提供有意义的指导。我们评估了VOLD在MMMU-Pro、MathVision、MathVista和LogicVista等多样化的基准测试中,显示出VOLD显著优于基线模型,并在现有最先进水平上取得显著提升。我们的消融研究显示了通过SFT进行冷启动对齐在文本-only教师与在线蒸馏中的重要性。

英文摘要

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

2504.10020 2026-06-05 cs.CL cs.AI cs.CV 版本更新

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

性能提升的幻象:为何对比解码无法减轻多模态大语言模型中的对象幻觉?

Hao Yin, Guangzong Si, Zilei Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Eastern Institute of Technology, Ningbo(宁波东部技术研究所)

AI总结 本文研究了对比解码方法在减轻多模态大语言模型(MLLMs)中对象幻觉方面的有效性,发现其性能提升主要源于两个误导性因素,挑战了对比解码策略的有效性。

详情
AI中文摘要

对比解码策略被广泛用于减少多模态大语言模型(MLLMs)中的对象幻觉。这些方法通过构建对比样本来诱导幻觉,然后在输出分布中抑制它们。然而,本文证明此类方法无法有效缓解幻觉问题。在POPE基准测试中观察到的性能提升主要由两个误导性因素驱动:(1)对模型输出分布的粗略、单向调整;(2)自适应可能性约束,将采样策略简化为贪婪搜索。为进一步说明这些问题,我们引入了一系列虚假改进方法,并将其性能与对比解码技术进行评估。实验结果揭示了对比解码中观察到的性能提升与其缓解幻觉的初衷无关。我们的发现挑战了对比解码策略有效性的常见假设,并为开发真正有效的MLLMs幻觉解决方案铺平了道路。

英文摘要

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

2508.09697 2026-06-05 cs.LG cs.CV 版本更新

Towards Label-Noise Resistant Learning via Optimal Brain Damage Masking

通过最优脑损伤遮蔽实现抗标签噪声学习

Xinlei Zhang, Fan Liu, Chuanyi Zhang, Fan Cheng, Qian Li, Yuhui Zheng

发表机构 * Hohai University(河海大学)

AI总结 本文提出了一种基于最优脑损伤理论的抗标签噪声学习方法,通过遮蔽冗余连接来减少噪声梯度传播,提升模型鲁棒性。

详情
AI中文摘要

噪声标签在现实世界中不可避免。由于深度神经网络强大的记忆能力,这些噪声标签会导致显著的性能下降。现有的噪声鲁棒方法主要集中在鲁棒损失函数和样本选择上,对动态架构适应的探索相对有限。本文重新审视了标签噪声存在下模型连接的作用。直观上,噪声标签引起的性能下降源于噪声梯度的反向传播。由于最终分类器层是这种误差传播的主要通道,直接丢弃分类器中的冗余连接可以在根源上截断噪声梯度。为了识别这些冗余连接,我们利用模型压缩中的经典最优脑损伤(OBD)理论,该理论指出造成微小损失扰动的参数可以安全移除而不影响性能。基于这一原则,我们发现遮蔽低激活边可以保持网络的正常拟合能力,同时有效降低噪声梯度传播的风险。为了将这一理论洞察与实际训练相结合,我们提出了一种新的选择性边遮蔽(SEM)机制,用于广泛采用的全连接(FC)层,以增强模型对噪声标签的鲁棒性。SEM可以自适应地只保留最重要的边用于信息传播,同时抑制由噪声标签引起的梯度误差。作为插件式组件,SEM可以无缝集成到各种噪声鲁棒方法中,包括鲁棒损失函数和样本选择。在合成和现实世界基准上的广泛评估表明,我们的OBD驱动方法在性能上始终优于最先进的方法。

英文摘要

Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels cause significant performance degradation. Existing noise-robust methods have mainly focused on robust loss functions and sample selection, with comparatively limited exploration of dynamic architectural adaptation. In this paper, we rethink the role of model connectivity in the presence of label noise. Intuitively, performance degradation caused by noisy labels stems from the backpropagation of noisy gradients. Since the final classifier layer acts as the primary gateway for this error propagation, directly discarding redundant connections within the classifier can structurally intercept noisy gradients at the root. Consequently, to identify these redundant connections, we leverage the seminal Optimal Brain Damage (OBD) theory from model compression, which posits that parameters causing negligible loss perturbation can be safely removed without impairing performance. Guided by this principle, we reveal that masking low-activation edges maintains the network's normal fitting capacity while effectively reducing the risk of backpropagating noisy gradients. To bridge this theoretical insight with practical training, we propose a novel Selective Edge Masking (SEM) mechanism for the widely-adopted fully connected (FC) layer to enhance model robustness against noisy labels. It can adaptively preserve only the most critical edges for information propagation while suppressing gradient errors caused by noisy labels. As a plug-and-play component, SEM can be seamlessly integrated into various noise-robust methods, including robust loss functions and sample selection. Extensive evaluations on both synthetic and real-world benchmarks demonstrate that our OBD-driven approach consistently outperforms state-of-the-art methods.

2509.15061 2026-06-05 cs.RO cs.CV 版本更新

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Ask-to-Clarify: 通过多轮对话解决指令歧义

Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China(复旦大学计算机科学与人工智能学院) Shanghai Innovation Institute, Shanghai, China(上海创新研究院) Mechanical Systems Control Lab, UC Berkeley, California, USA(伯克利机械系统控制实验室)

AI总结 本文提出Ask-to-Clarify框架,通过多轮对话解决指令歧义问题,结合视觉语言模型和扩散模型,采用两阶段知识绝缘策略训练,实现多任务中更高效的协作式具身代理。

Comments 9 pages, 4 figures, 7 tables

详情
AI中文摘要

具身代理的最终目标是创造能够与人类交互的合作者,而非仅仅执行指令的被动执行者。这要求代理能够通过沟通、协调和适应行动来响应人类反馈。最近,视觉语言代理(VLAs)的进步为实现这一目标提供了途径。然而,大多数当前基于VLAs的具身代理仍处于单向模式:接收指令并执行,而无反馈。这种做法在现实场景中往往失效,因为指令通常存在歧义。在本文中,我们提出了Ask-to-Clarify框架来解决这一问题。该框架首先通过多轮对话解决模糊的指令,然后生成低层动作。具体来说,Ask-to-Clarify框架由两个组件组成:一个用于协作的视觉语言模型(VLM)和一个用于动作的扩散模型。我们还引入了一个连接模块,该模块根据VLM的输出生成扩散模型的条件。该模块通过指令调整观察来生成可靠的条件。我们采用两阶段知识绝缘策略来训练我们的框架。首先,我们使用模糊解决对话数据微调协作组件以处理歧义。然后,我们在冻结协作组件的情况下整合动作组件。这在保持交互能力的同时,微调扩散模型以生成动作。训练策略保证了我们的框架能够首先提问,然后生成动作。在推理过程中,一个信号检测器充当路由器,帮助框架在提问和执行之间切换。我们在8个现实任务中评估了Ask-to-Clarify框架,结果表明它在现有最先进的VLAs中表现更优。结果表明,所提出的框架及其训练策略为协作式具身代理提供了一条可行路径。

英文摘要

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

2503.22929 2026-06-05 cs.CV 版本更新

Self-supervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

自监督特征解耦与增强网络用于单类面部反伪装

Pei-Kai Huang, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Hao-Chiang Shao, Chiou-Ting Hsu

发表机构 * National Tsinghua University(国立清华大学)

AI总结 本文提出了一种自监督特征解耦与增强网络(UFDANet),通过解耦活体特征和领域特征,提升单类面部反伪装的泛化能力,实验表明其优于现有单类方法并可与双类方法媲美。

详情
AI中文摘要

面部反伪装(FAS)技术旨在通过区分真实活体面部与欺骗性尝试来增强面部身份认证的安全性。虽然双类FAS方法可能因过拟合训练攻击而性能不佳,单类FAS方法能处理未见过的攻击但对活体特征中混杂的领域信息不够鲁棒。为此,我们提出了一种无监督特征解耦与增强网络(UFDANet),一种单类FAS技术,通过解耦特征增强面部图像以提升泛化能力。UFDANet采用新颖的无监督特征解耦方法分离活体和领域特征,促进判别性特征学习。它整合了非分布活体特征增强方案以合成未见过的欺骗类活体特征,从而增强活体特征的表示性和判别性。此外,UFDANet还整合了领域特征增强流程以合成未见过的领域特征,从而实现更好的泛化能力。广泛实验表明,所提出的UFDANet优于现有单类FAS方法,并在与现有最先进双类FAS方法的性能上具有可比性。

英文摘要

Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.

2507.12336 2026-06-05 cs.CV 版本更新

Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors

无监督单目多视图扩散先验的3D关键点发现

Subin Jeon, In Cho, Junyoung Hong, Woong Oh Cho, Seon Joo Kim

发表机构 * Yonsei University(延世大学)

AI总结 本文提出KeyDiff3D框架,通过单张图像准确预测3D关键点,利用预训练的多视图扩散模型中的几何先验,将隐式3D先验转化为显式3D特征体,实现关键点估计和3D对象操控。

Comments Accepted at CVPR 2026. Project page: https://subin6.github.io/keydiff3d-project/

详情
AI中文摘要

大多数现有的3D关键点估计方法依赖于手动标注或校准的多视角图像,这两种方法都昂贵且难以收集。本文引入KeyDiff3D框架,该框架能够从单张图像准确预测3D关键点,从而消除对昂贵数据采集的依赖。为此,我们利用预训练的多视角扩散模型中嵌入的强大几何先验。在我们的框架中,扩散模型从单张图像生成多视角图像,作为监督信号,为模型提供3D几何线索。我们还引入了3D特征提取器,将扩散特征中隐含的3D先验转换为显式的3D特征体。除了准确的关键点估计外,我们还引入了一条管道,使由扩散模型生成的3D对象得以操控。在多样化的数据集上,包括Human3.6M、CUB-200-2011、斯坦福狗、以及多个真实世界和非领域输入,实验结果突显了我们的方法在准确性、泛化能力和从单张图像生成3D对象并进行操控方面的有效性。

英文摘要

Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect. This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model. We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes. Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

2506.22078 2026-06-05 cs.CV 版本更新

Towards Accurate Heart Rate Measurement from Ultra-Short Video Clips via Periodicity-Guided rPPG Estimation and Signal Reconstruction

通过周期性引导的rPPG估计与信号重建实现从超短视频片段中准确的心率测量

Pei-Kai Huanga, Ya-Ting Chan, Kuan-Wen Chen, Chiou-Ting Hsu, Xiaoding Wang, Md. Jalil Piran

发表机构 * National Tsinghua University(国立清华大学) Fujian Normal University(福建师范大学) Sungkyunkwan University(成均馆大学)

AI总结 本文针对超短视频片段中心率测量问题,提出周期性引导的rPPG估计方法和信号重建技术,以提高从超短视频中准确测量心率的能力,并在多个基准数据集上验证了方法的有效性。

详情
AI中文摘要

许多远程心率(HR)测量方法专注于从持续约10秒的视频片段中估计远程光体积脉动图(rPPG)信号,但常常忽略了从超短视频片段中估计心率的必要性。在本文中,我们旨在通过专门解决两个关键挑战来准确测量超短2秒视频片段中的心率。首先,为了解决超短视频片段中心跳周期数量有限的问题,我们提出了一种有效的周期性引导的rPPG估计方法,该方法强制在从超短片段中估计的rPPG信号与其更长的真实信号之间的周期性保持一致。其次,为了解决由于频谱泄漏导致的估计不准确问题,我们提出包含生成器来从超短片段中重建更长的rPPG信号,同时保持其周期性一致性,以实现更准确的心率测量。在四个rPPG估计基准数据集上的大量实验表明,我们提出的方法不仅能够准确测量超短视频片段中的心率,而且在rPPG估计技术中实现了最先进的性能。

英文摘要

Many remote Heart Rate (HR) measurement methods focus on estimating remote photoplethysmography (rPPG) signals from video clips lasting around 10 seconds but often overlook the need for HR estimation from ultra-short video clips. In this paper, we aim to accurately measure HR from ultra-short 2-second video clips by specifically addressing two key challenges. First, to overcome the limited number of heartbeat cycles in ultra-short video clips, we propose an effective periodicity-guided rPPG estimation method that enforces consistent periodicity between rPPG signals estimated from ultra-short clips and their much longer ground truth signals. Next, to mitigate estimation inaccuracies due to spectral leakage, we propose including a generator to reconstruct longer rPPG signals from ultra-short ones while preserving their periodic consistency to enable more accurate HR measurement. Extensive experiments on four rPPG estimation benchmark datasets demonstrate that our proposed method not only accurately measures HR from ultra-short video clips but also outperform previous rPPG estimation techniques to achieve state-of-the-art performance.

2506.20263 2026-06-05 cs.CV 版本更新

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

层次化掩码增强双重建网络用于少样本细粒度图像分类

Ning Luo, Meiyin Hu, Huan Wan, Yanyan Yang, Zhuohang Jiang, Xin Wei

发表机构 * Nanjing University(南京大学)

AI总结 本文提出层次化掩码增强双重建网络(HMDRN),通过双层特征重建与掩码增强特征处理,解决少样本细粒度图像分类中区分视觉相似子类的问题,实验显示其在三种细粒度数据集上均优于现有方法。

详情
AI中文摘要

少样本细粒度图像分类(FS-FGIC)具有挑战性,因为它需要在极少量标记示例下区分视觉相似的子类。现有方法存在关键限制:基于度量的方法丢失空间信息并导致局部特征错位,而基于重建的方法未充分利用层次特征信息且缺乏对判别关键区域的选择性关注。我们提出层次化掩码增强双重建网络(HMDRN),整合双层特征重建与掩码增强特征处理。HMDRN通过可学习权重利用不同网络层次的互补视觉信息,平衡高层语义表示与中层结构细节。它包含一个空间二进制掩码增强的Transformer模块,可选择增强判别区域并过滤背景噪声。在三个细粒度数据集上,HMDRN在Conv-4和ResNet-12背骨上均优于现有最先进方法。消融研究验证了每个组件的有效性,显示双层重建增强类间判别能力,而掩码增强转换减少类内变化。

英文摘要

Few-shot fine-grained image classification (FS-FGIC) is challenging as it requires distinguishing visually similar subclasses with extremely limited labeled examples. Existing methods suffer from critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods underuse hierarchical feature information and lack selective focus on discriminative key regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), integrating dual-layer feature reconstruction with mask-enhanced feature processing. HMDRN leverages complementary visual information from different network hierarchies via learnable weights, balancing high-level semantic representations with mid-level structural details. It incorporates a spatial binary mask-enhanced transformer module that selectively enhances discriminative regions while filtering background noise. On three fine-grained datasets, HMDRN consistently outperforms state-of-the-art methods with both Conv-4 and ResNet-12 backbones. Ablation studies validate each component's effectiveness, showing dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations.

2506.10145 2026-06-05 cs.CV 版本更新

RoCA: Robust Cross-Domain End-to-End Autonomous Driving

RoCA: 面向鲁棒跨域端到端自动驾驶的框架

Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校) University of California, Davis(加州大学戴维斯分校)

AI总结 本文提出RoCA框架,通过联合概率分布建模端到端自动驾驶管道中的 ego 和周围车辆信息,提升跨域自动驾驶的泛化能力和鲁棒性,无需额外推理计算。

Comments accepted for ICML 2026

详情
AI中文摘要

端到端(E2E)自动驾驶最近作为一种新范式出现,具有显著潜力。然而,很少有研究探讨了跨域部署的实际挑战(例如城市)。尽管一些工作将大型语言模型(LLMs)纳入其中以利用其开放世界知识,但LLMs无法保证跨域驾驶性能且在域适应过程中可能产生 prohibitive 重训练成本。本文提出RoCA,一种新颖的框架用于鲁棒跨域端到端自动驾驶。RoCA在E2E管道中对编码ego和周围车辆信息的token的联合概率分布进行建模。通过高斯过程(GP)实例化,RoCA学习一组具有相应轨迹的基底token,这些token跨越了多样化的驾驶场景。然后,给定任何驾驶场景,它能够概率性地推断未来轨迹。通过将RoCA与源域训练中的基础E2E模型结合,我们提升了基础模型的泛化能力,而无需额外的推理计算。此外,RoCA在新目标域上实现了鲁棒适应,显著优于直接微调。我们广泛评估了RoCA在各种跨域场景中,并展示其在领域泛化和适应性能方面表现强劲。

英文摘要

End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.

2408.11336 2026-06-05 cs.LG cs.CV 版本更新

FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting

FATE:用于多变量时间序列预测的焦点调节注意力编码器

Tajamul Ashraf, Janibul Bashir

发表机构 * GAASH Research Lab(GAASH研究实验室) Department of Information Technology(信息科技系) National Institute of Technology Srinagar(斯里 Nagar国立理工学院)

AI总结 本文提出FATE,一种新的Transformer架构,用于可靠的多变量时间序列预测。FATE引入了张量化的焦点调节机制,以显式捕捉时间序列中的时空相关性,并通过两个调节分数提高可解释性,通过在七个不同现实世界数据集上基准测试,证明其在长视界多变量气象数据集上的优越性能。

详情
AI中文摘要

气候变化是21世纪最紧迫的全球挑战之一,其后果包括海平面上升、冰川融化以及日益极端的天气模式。准确的预测对于监测这些现象和支持缓解策略至关重要。尽管最近的数据驱动模型,包括CNNs、RNNs和基于注意力的Transformer,在时间序列预测中显示出潜力,但它们在处理序列依赖性和有限并行性方面存在困难,尤其是在长视界、多变量气象数据集中。在本文中,我们提出了Focal Modulated Attention Encoder(FATE),一种新的Transformer架构,用于可靠的多变量时间序列预测。与传统模型不同,FATE引入了张量化的焦点调节机制,以显式捕捉时间序列数据中的时空相关性。我们进一步提出了两个调节分数,通过突出影响预测的关键环境特征来提供可解释性。我们在七个不同的现实世界数据集上基准测试FATE,包括ETTh1、ETTm2、Traffic、Weather5k、USA-Canada、Europe和LargeST数据集,并显示其在所有最先进的方法,包括温度数据集上都表现优异。我们的消融研究也表明,FATE能够很好地推广到更广泛的多变量时间序列预测任务中。

英文摘要

Climate change stands as one of the most pressing global challenges of the twenty-first century, with far-reaching consequences such as rising sea levels, melting glaciers, and increasingly extreme weather patterns. Accurate forecasting is critical for monitoring these phenomena and supporting mitigation strategies. While recent data-driven models for time-series forecasting, including CNNs, RNNs, and attention-based transformers, have shown promise, they often struggle with sequential dependencies and limited parallelization, especially in long-horizon, multivariate meteorological datasets. In this work, we present Focal Modulated Attention Encoder (FATE), a novel transformer architecture designed for reliable multivariate time-series forecasting. Unlike conventional models, FATE introduces a tensorized focal modulation mechanism that explicitly captures spatiotemporal correlations in time-series data. We further propose two modulation scores that offer interpretability by highlighting critical environmental features influencing predictions. We benchmark FATE across seven diverse real-world datasets, including ETTh1, ETTm2, Traffic, Weather5k, USA-Canada, Europe, and LargeST datasets, and show that it consistently outperforms all state-of-the-art methods, including temperature datasets. Our ablation studies also demonstrate that FATE generalizes well to broader multivariate time-series forecasting tasks.

2506.10601 2026-06-05 cs.CV 版本更新

Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

语义解耦的空间分区引导的点监督定向物体检测

Xinyuan Liu, Hang Xu, Zirui Chen, Yike Ma, Chenggang Yan, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) Hefei University of Technology(合肥工业大学)

AI总结 本文提出了一种高效的训练框架SSP,通过规则驱动的先验注入和数据驱动的标签净化,解决了单点注解放置不足和伪标签质量差的问题,实验表明SSP在DOTA-v1.0和其他数据集上取得了显著的mAP提升,且训练时间和内存占用较低。

Comments Published in Pattern Recognition, 2026

详情
Journal ref
Pattern Recognition, Volume 180, Part B, Article 114079 (2026)
AI中文摘要

鉴于其减少标注成本的能力,基于单点注释的弱监督学习已成为定向物体检测研究的焦点。与经典教师-学生范式相比,简单的模型范式(如PointOBB-v2)可以显著减少训练所需的资源,同时保证强大的性能。后者在低成本训练中具有更大的潜力,但此类方法仍面临样本分配不足和伪标签质量差的挑战。在本文中,我们提出了一种训练高效的框架,称为SSP,该框架结合了规则驱动的先验注入和数据驱动的标签净化。具体而言,SSP引入了两种设计:(1)像素级空间分区基于的样本分配,通过像素映射的空间分区估计物体尺度的上下界,并通过空间分区挖掘高质量的正样本和困难负样本;(2)语义空间分区基于的框提取,通过由语义地图调节的空间分区推导实例,并将其转换为伪框以监督检测器。在DOTA-v1.0和其他数据集上的实验表明,SSP的优越性:与基线相比,SSP实现了+6.73%的mAP提升,同时仅需2小时的训练时间和6GB的GPU内存。此外,当SSP与更强的检测器结合时,mAP可以达到50.81%。代码可在https://github.com/antxinyuan/ssp上获得。

英文摘要

Given its ability to reduce annotation costs, weakly supervised learning based on single-point annotations has emerged as a research focus in oriented object detection. Compared with the classical teacher-student paradigm, the simple model paradigm (e.g., PointOBB-v2) can substantially further reduce resources required for training while ensuring strong performance. The latter exhibits greater potential for low-cost training, yet such methods still face challenges of insufficient sample assignment and poor pseudo-label quality. In this paper, we propose a training-efficient framework named SSP, which synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two designs: (1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. (2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and converts them into pseudo-boxes for supervising detectors. Experiments on DOTA-v1.0 and other datasets demonstrate SSP's superiority: it achieves +6.73% mAP improvement compared with the baseline, while requiring only 2 h of training time and 6 GB of GPU memory. Furthermore, when SSP is integrated with stronger detector, the mAP can reach 50.81%. The code is available at https://github.com/antxinyuan/ssp.

2503.23300 2026-06-05 cs.CV cs.RO 版本更新

Learning Predictive Visuomotor Coordination

学习预测性视觉-运动协调

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Georgia Tech(佐治亚理工学院) Meta AI

AI总结 本文提出了一种基于预测的视觉-运动协调建模任务,通过结合第一人称视觉和运动学观测预测头部姿态、目光方向和上半身运动,展示了多模态整合在理解视觉-运动协调中的重要性。

Comments CVPR 2026 Findings

详情
AI中文摘要

理解并预测人类视觉-运动协调对于机器人学、人机交互和辅助技术的应用至关重要。本文介绍了一种基于预测的视觉-运动协调建模任务,目标是从第一人称视觉和运动学观测中预测头部姿态、目光方向和上半身运动。我们提出了一种视觉-运动协调表示(VCR),学习这些多模态信号之间的结构时间依赖性。我们扩展了基于扩散的运动建模框架,整合了第一人称视觉和运动学序列,实现了时间一致且准确的视觉-运动预测。我们的方法在大规模EgoExo4D数据集上进行了评估,展示了在多样化现实活动中的强大泛化能力。我们的结果强调了多模态整合在理解视觉-运动协调中的重要性,为视觉-运动学习和人类行为建模的研究做出了贡献。项目页面:https://vjwq.github.io/VCR/.

英文摘要

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.

2503.14295 2026-06-05 cs.CV cs.AI 版本更新

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

PC-Talk: 用于音频驱动说话面部生成的精确面部动画控制

Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei

发表机构 * MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Psyche AI.INC(Psyche AI公司) HKUST(香港科技大学) CAIR, HKISI, Chinese Academy of Sciences(中国科学院计算智能研究所) SCSE, FIE, M.U.S.T(M.U.S.T的SCSE、FIE部门)

AI总结 本文针对音频驱动说话面部生成中面部动画控制不足的问题,提出PC-Talk框架,通过改进唇音对齐和情感控制来提升生成视频的多样性和用户友好性。

Comments 10 Pages, 6 figures. Accepted in CVPR2026

详情
AI中文摘要

近年来,音频驱动说话面部生成在唇同步方面取得了显著进展。然而,当前方法往往缺乏对面部动画(如说话风格和情绪表达)的充分控制,导致输出结果单一。本文聚焦于改进两个关键因素:唇音对齐和情感控制,以增强说话视频的多样性和易用性。唇音对齐控制关注说话风格和唇部运动幅度等元素,而情感控制则专注于生成逼真的情绪表达,允许对强度等多属性进行修改。为实现精确的面部动画控制,我们提出了一种新的框架PC-Talk,通过隐式关键点变形实现唇音对齐和情感控制。首先,我们的唇音对齐控制模块实现了对说话风格的精确编辑,并调整唇部运动幅度以模拟不同语音音量水平,保持与音频的同步。其次,我们的情感控制模块生成生动的情绪面部特征,通过纯粹的情绪变形实现。该模块还允许对强度进行精细修改,并在不同面部区域组合多种情绪。我们的方法在广泛的实验中展示了出色的控制能力,并在HDTF和MEAD数据集上取得了最先进的性能。

英文摘要

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

2502.06434 2026-06-05 cs.CV cs.LG 版本更新

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

统一数据集剪枝与蒸馏以实现高效大规模压缩

Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一个统一的数据集压缩基准,探讨数据集剪枝与蒸馏的收敛趋势,发现软标签蒸馏在小数据集上表现不如剪枝,提出基于硬标签的数据集压缩方法,通过PCA框架提升图像质量和存储效率。

Comments Accepted by ICML 2026

详情
AI中文摘要

数据集剪枝(DP)和数据集蒸馏(DD)在输出上有根本差异:DP选择原始图像子集,而DD生成合成图像。最近,DD对原始图像的依赖增加表明两种方法趋于融合。为研究这种融合趋势,我们提出统一的数据集压缩(DC)基准。该基准揭示了软标签-DD的有趣权衡:虽然软标签提供有价值信息,但它们可能使蒸馏过程变得不必要,因为蒸馏图像可能不总能优于随机子集。此外,基准表明在当前阶段,数据集剪枝在小数据集上优于数据集蒸馏。鉴于这些观察,我们探索硬标签-DC作为互补方法,强调图像质量的同时提供显著的存储效率。我们的PCA(Prune, Combine, and Augment)是首个不依赖软标签而是聚焦图像质量的框架。(1)

英文摘要

Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark. This benchmark reveals an interesting trade-off for soft-label-DD: while soft labels provide valuable information, they can make the distillation process less essential, as distilled images may not always outperform random subsets. In addition, the benchmark reveals that in current stages, dataset pruning outperforms dataset distillation at small dataset sizes. Given these observations, we explore hard-label-DC as a complementary approach that emphasizes image quality while offering substantial storage efficiency. Our PCA (Prune, Combine, and Augment) is the first framework that does not rely on soft labels but instead focuses on image quality. (1) "P'' means selecting easy samples based on dataset pruning metrics, (2) "C'' indicates combining these samples effectively, and (3) "A'' is to apply constrained image augmentation during training. Our code is available at https://github.com/ArmandXiao/Unifying-Dataset-Pruning-and-Distillation

2502.02487 2026-06-05 cs.CV 版本更新

Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

Hier-EgoPack:具有多样任务视角的层次化眼动视频理解

Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, Tatiana Tommasi, Giuseppe Averta

发表机构 * Department of Control and Computer Engineering(控制与计算机工程系)

AI总结 本文提出Hier-EgoPack,通过引入层次化架构和GNN层,扩展了EgoPack在多粒度时间推理上的能力,有效解决了多种下游任务中的视频理解问题。

Comments Project webpage at https://sapeirone.github.io/hier-egopack

详情
AI中文摘要

我们对人类活动视频流的理解本质上是多方面的:在短短几秒钟内,我们能够把握正在发生的事情,识别场景中物体的相关性和互动,并预测即将发生的事情,所有这些都在一起发生。为了赋予自主系统这种整体感知,学习如何关联概念、在不同任务中抽象知识,并在学习新技能时利用任务协同是至关重要的。在这方面的一个重要进展是EgoPack,这是一个统一的框架,用于在多样化的任务中理解人类活动,具有最小的开销。EgoPack促进下游任务之间的信息共享和协作,这对于高效学习新技能至关重要。在本文中,我们介绍了Hier-EgoPack,它通过在不同时间粒度上进行推理来扩展EgoPack,从而将其适用范围扩展到更广泛的下游任务。为此,我们提出了一种新的层次化架构用于时间推理,配备了专门设计的GNN层,以有效应对多粒度推理的挑战。我们在多个Ego4D基准上评估了我们的方法,涉及片段级和帧级推理,展示了我们的层次化统一架构如何同时有效地解决这些多样化任务。

英文摘要

Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

2412.07583 2026-06-05 cs.CV cs.AI 版本更新

Mobile Video Diffusion

移动视频扩散

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 本文提出了一种移动优化的视频扩散模型MobileVD,通过降低帧分辨率、引入多尺度时间表示和两种新的剪枝方案,显著降低了内存和计算成本,同时在移动设备上实现了高效的视频生成。

详情
AI中文摘要

视频扩散模型已实现了出色的现实感和可控性,但受限于高计算需求,限制了其在移动设备上的应用。本文介绍了首个移动优化的视频扩散模型。从Stable Video Diffusion (SVD) 的时空UNet出发,我们通过降低帧分辨率、引入多尺度时间表示以及引入两种新的剪枝方案来减少通道数和时间块数量。此外,我们采用对抗微调将去噪步骤减少到一步。我们的模型,称为MobileVD,在效率上提高了523倍(1817.2 vs. 4.34 TFLOPs),质量略有下降(FVD 149 vs. 171),在Xiaomi-14 Pro上生成14x512x256像素的视频片段仅需1.7秒。我们的结果可在https://qualcomm-ai-research.github.io/mobile-video-diffusion/上查看。

英文摘要

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

2308.10897 2026-06-05 cs.CV 版本更新

Can Language Models Learn to Listen?

语言模型能否学会倾听?

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, Shiry Ginosar

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于说话人话语生成适当面部回应的框架,通过将量化后的面部动作元素作为额外语言token输入到基于transformer的大型语言模型中,从而提升监听响应的质量。

Comments ICCV 2023; Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

详情
AI中文摘要

我们提出了一种框架,用于在双人社交互动中根据说话人的词语生成适当的面部回应。给定一个包含说话人词语及其时间戳的输入转录,我们的方法自回归地预测听众的回应:一系列听众的面部动作,通过VQ-VAE进行量化。由于动作是语言的一部分,我们提出将量化后的原子动作元素作为额外的语言token输入到基于transformer的大型语言模型中。使用仅在文本上预训练的语言模型权重初始化transformer,可以显著提高听众回应的质量,优于从头开始训练transformer。我们通过定量指标和定性用户研究展示了生成的听众动作流畅且反映了语言语义。在我们的评估中,我们分析了模型利用口语文本的时间和语义方面的能力。项目页面:https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

英文摘要

We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/