arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.06485 2026-06-05 cs.CV 版本更新

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

PAR3D: 一种用于场景理解的统一部件感知3D多模态大语言模型

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang, Liujuan Cao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University（教育部多媒体可信感知与高效计算重点实验室，厦门大学）

AI总结提出PAR3D框架，通过部件感知3D表示学习和层次化分割查询生成，解决现有3D-MLLM在细粒度部件理解上的不足，在部件级问答和指代分割任务上取得显著提升。

Comments Project page: https://atrovast.github.io/PAR3D/

详情

AI中文摘要

近期3D多模态大语言模型（3D-MLLMs）的进展为3D场景理解任务（包括视觉问答、描述和指代分割）提供了统一解决方案。然而，现有的3D-MLLM仍以物体为中心，限制了其对细粒度部件结构的建模能力，而这对于与3D环境的具身交互至关重要。在这项工作中，我们提出了PAR3D，一个统一的部件感知3D-MLLM框架，使模型能够理解、推理并定位3D场景中的物体及其部件。为了支持部件感知3D场景理解的训练和评估，我们引入了ScenePart，一个带有部件级标注和语言指令的合成3D场景数据集。我们进一步开发了部件感知3D表示学习，以用细粒度部件级语义丰富3D视觉表示，并提出了层次化分割查询生成，通过层次化的物体-部件查询来定位部件目标。大量实验表明，我们的方法显著提升了部件级问答和指代分割的性能，同时在物体级视觉语言任务上也取得了强劲表现。

英文摘要

Recent advances in 3D multimodal large language models (3D-MLLMs) have enabled unified solutions for 3D scene understanding tasks, including visual question answering, captioning, and referring segmentation. However, existing 3D-MLLMs remain largely object-centric, limiting their ability to model fine-grained part structures that are essential for embodied interaction with 3D environments. In this work, we present PAR3D, a unified part-aware 3D-MLLM framework that enables models to understand, reason about, and ground both objects and their parts in 3D scenes. To enable training and evaluation of part-aware 3D scene understanding, we introduce ScenePart, a synthetic 3D scene dataset with part-level annotations and language instructions. We further develop Part-Aware 3D Representation Learning to enrich 3D visual representations with fine-grained part-level semantics, and propose Hierarchical Segmentation Query Generation to ground part targets via hierarchical object-part queries. Extensive experiments show that our method substantially improves part-level question answering and referring segmentation, while also achieving strong performance across object-level vision-language tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.06477 2026-06-05 cs.CV 版本更新

Complexity-Balanced Diffusion Splitting

复杂度平衡的扩散分裂

Noam Issachar, Dani Lischinski, Raanan Fattal

发表机构 * The Hebrew University of Jerusalem（耶路撒冷希伯来大学）

AI总结提出复杂度平衡分裂（CBS）框架，通过将扩散时间线划分为等近似负担的段并分配更多容量给困难区域，在多个架构和数据集上提升生成质量而不增加推理成本。

详情

AI中文摘要

标准连续时间生成模型依赖于整体架构，必须从各向同性噪声到复杂数据分布等截然不同的信号域中导航。虽然扩展模型容量可提升性能，但在整个生成时间线上均匀部署大规模网络本质上效率低下。在这项工作中，我们提出复杂度平衡分裂（CBS），一种用于时间容量分配的原则性框架，将生成工作负载分布到多个专门的子网络上。基于函数逼近理论和de Boor的等分布原则，CBS将扩散时间线划分为等近似负担的段，将更多表示容量分配给生成动力学更难建模的区域。为估计这种局部复杂度，我们引入两个互补且易于处理的监控函数：基于流Dirichlet能量的空间度量，和基于采样轨迹加速度的几何度量。通过使用轻量级辅助模型估计这些复杂度分布，我们的方法消除了启发式时间分割或计算昂贵的搜索过程的需求。在多种架构（SiT、JiT和UNet）和数据集上的广泛评估表明，CBS在不增加每步推理成本的情况下持续提升合成质量。特别地，在SiT-XL上使用CFG时，CBS相比朴素时间分割将FID改善了约35%。项目页面见https://noamissachar.github.io/CBS/。

英文摘要

Standard continuous-time generative models rely on monolithic architectures that must navigate vastly different signal regimes, from isotropic noise to intricate data distributions. While scaling model capacity improves performance, deploying a massive network uniformly across the entire generative timeline is inherently inefficient. In this work, we propose Complexity-Balanced Splitting (CBS), a principled framework for temporal capacity allocation that distributes the generative workload across multiple specialized sub-networks. Grounded in function approximation theory and de Boor's equidistribution principle, CBS partitions the diffusion timeline into segments of equal approximation burden, allocating more representational capacity to regions where the generative dynamics are more difficult to model. To estimate this local complexity, we introduce two complementary and tractable monitor functions: a spatial measure based on the flow's Dirichlet energy, and a geometric measure based on the acceleration of the sampling trajectories. Using a lightweight auxiliary model to estimate these complexity profiles, our approach eliminates the need for heuristic temporal splits or computationally expensive search procedures. Extensive evaluation across multiple architectures (SiT, JiT, and UNet) and datasets demonstrates that CBS consistently improves synthesis quality without increasing per-step inference cost. In particular, CBS improves FID by ~35% on SiT-XL with CFG relative to naive temporal partitioning. Project page is available at https://noamissachar.github.io/CBS/.

URL PDF HTML ☆

赞 0 踩 0

2606.06476 2026-06-05 cs.CV 版本更新

Thinking with Imagination: Agentic Visual Spatial Reasoning with World Simulators

思考与想象：基于世界模拟器的智能视觉空间推理

Chenming Zhu, Jingli Lin, Yilin Long, Peizhou Cao, Tai Wang, Jiangmiao Pang, Xihui Liu

发表机构 * The University of Hong Kong（香港大学）； Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Jiao Tong University（上海交通大学）； Fudan University（复旦大学）； Beihang University（北航大学）

AI总结提出Astra框架，通过强化学习训练VLM策略与Bagel世界模拟器交互，在推理中生成想象视觉证据，解决空间推理中的未观察布局、跨视角一致性和替代视角推理问题。

Comments Project page: https://zcmax.github.io/projects/Thinking-With-Imagination

详情

AI中文摘要

尽管视觉语言模型（VLM）展现出强大的视觉推理能力，但其空间推理能力仍然很大程度上局限于观察到的图像和面向文本的思维链。当只有有限的自我中心观察可用时，它们通常难以推断未观察到的布局、保持跨视角一致性以及从替代视角进行推理。在这项工作中，我们将此问题研究为“思考与想象”，即VLM在推理过程中通过与世界模拟器交互主动获取想象的视觉证据。我们提出Astra，一种智能空间推理框架，赋予VLM以动作条件视觉想象能力。具体而言，Astra将强化学习训练的VLM策略Astra-VL与基于Bagel的世界模拟器Astra-WM相结合，后者从上下文图像和自然语言相机运动生成新视角观察。为了提供可靠的想象证据，Astra-WM通过视角一致性训练进行训练，以提高跨视角的姿态和内容一致性。在强化学习阶段，我们提出了一种世界模拟器在环的两阶段强化学习课程，以稳定工具使用探索，并提升模型仅在想象观察优于直接回答时调用模拟器的能力。实验表明，世界模拟器和智能策略都是必要的：Astra-WM将模拟器增强的Gemini-3-Flash在MMSI-Bench上的性能从45.1提升到49.5，而Astra-VL将Qwen3-VL骨干网络在MMSI-Bench上的性能从29.8提升到38.8，在MindCube上从36.8提升到42.7。这些结果表明，想象观察可以提供有用的空间证据，但有效的世界模型增强推理需要学习何时、何地以及如何想象。

英文摘要

While Vision-Language Models (VLMs) have shown strong visual reasoning capabilities, their spatial reasoning abilities remain largely constrained to the observed images and text-oriented chain-of-thought. They often struggle to infer unobserved layouts, maintain cross-view consistency, and reason from alternative viewpoints when only limited egocentric observations are available. In this work, we study this problem as thinking with imagination, where a VLM actively acquires imagined visual evidence by interacting with a world simulator during reasoning. We propose Astra, an agentic spatial reasoning framework that empowers VLMs with action-conditioned visual imagination. Specifically, Astra couples Astra-VL, an RL-trained VLM policy, with Astra-WM, a Bagel-based world simulator that generates novel-view observations from context images and natural-language camera motions. To provide reliable imagined evidence, Astra-WM is trained with view consistency tuning to improve pose and content consistency across views. In the RL stage, we propose a world-simulator-in-the-loop two-phase RL curriculum to stabilize tool-use exploration and advance the model's ability to invoke the simulator only when imagined observations improve over direct answering. Experiments demonstrate that both the world simulator and the agentic policy are necessary: Astra-WM improves simulator-augmented Gemini-3-Flash on MMSI-Bench from 45.1 to 49.5, while Astra-VL improves the Qwen3-VL backbone from 29.8 to 38.8 on MMSI-Bench and from 36.8 to 42.7 on MindCube. These results show that imagined observations can provide useful spatial evidence, but effective world-model-augmented reasoning requires learning when, where, and how to imagine.

URL PDF HTML ☆

赞 0 踩 0

2606.06458 2026-06-05 cs.LG cs.AI cs.CV 版本更新

In-Context Multiple Instance Learning

上下文多实例学习

Alexander Möllers, Marvin Sextro, Julius Hense, Gabriel Dernbach, Klaus-Robert Müller

发表机构 * Berlin Institute for the Foundations of Learning and Data（柏林学习与数据基础研究所）； Machine Learning Group, Technische Universität Berlin（柏林技术大学机器学习小组）； Aignostics ； Institute of Pathology, Charité – Universitätsmedizin Berlin（柏林查理医院病理研究所）； Max-Planck Institute for Informatics（马克斯·普朗克信息研究所）； Department of Artificial Intelligence, Korea University（韩国大学人工智能系）

AI总结本文提出一种基于感知器架构的上下文学习器，通过合成数据预训练，无需梯度更新即可从少量标记包中解决新的多实例学习任务，在12个基准上超越需任务特定训练的监督基线。

详情

AI中文摘要

多实例学习（MIL）解决了在实例包级别提供监督的问题，并已成功应用于从计算病理学到卫星图像等领域。然而，现有算法在低标签率（许多实际应用的特点）下表现不佳。灵活的模型过拟合，而僵化的模型无法适应手头的任务。我们证明，在合成数据上预训练一个具有感知器架构的上下文学习器，可以得到一个能够从少量标记包中解决新任务的模型。在推理时，分类在单次前向传播中完成，无需梯度更新。我们提出并研究了不同的用于包结构数据的合成数据生成器，发现它们捕获了互补的归纳偏差。在这些生成器的混合上预训练的模型继承了每个生成器在各自任务上的优势，并在12个MIL基准上取得了最佳平均性能，超过了需要任务特定训练的监督基线。

英文摘要

Multiple Instance Learning (MIL) addresses problems where supervision is available at the level of bags of instances and has been successfully applied in fields ranging from computational pathology to satellite imagery. Nevertheless, existing algorithms struggle in the low-label regime that characterizes many real-world applications. Flexible models overfit and rigid ones fail to adapt to the task at hand. We show that pretraining an in-context learner with a Perceiver-style architecture on synthetic data yields a model that can solve new tasks from a handful of labeled bags. At inference time, classification happens in a single forward pass and requires no gradient updates. We propose and investigate different synthetic data generators for bag-structured data and find that they capture complementary inductive biases. A model pretrained on a mixture of these generators inherits their per-task strengths and achieves the best average performance across twelve MIL benchmarks, outperforming supervised baselines that require task-specific training.

URL PDF HTML ☆

赞 0 踩 0

2606.06390 2026-06-05 cs.CV cs.AI 版本更新

HomeWorld: A Unified Floorplan-to-Furnished Framework for Generating Controllable, Densely Interactive Whole-Home Scenes

HomeWorld：一个统一的从平面图到家具的框架，用于生成可控、密集交互的全屋场景

Wenbo Li, Xiaoliang Ju, Zipeng Qin, Rongyao Fang, Hongsheng Li

发表机构 * Ace Robotics（Ace机器人公司）； CUHK MMLab（香港大学多模态实验室）； Shenzhen Loop Area Institute（深圳环城区域研究院）

AI总结提出一个统一的分层框架，通过大规模真实平面图数据集训练大语言模型生成全屋平面图，结合图像生成模型和VLM优化器生成家具及小物体布局，并附加物理属性和纹理光照，实现可控、高真实感的全屋场景生成。

详情

AI中文摘要

室内场景生成对于机器人仿真和现代室内设计至关重要。然而，复杂的布局加上稀缺的3D场景数据使得基于学习的生成具有挑战性。现有方法通常依赖手工规则或关注孤立子任务（例如平面图合成或单房间家具布置），生成的全屋场景缺乏全局连贯性、真实感和仿真就绪性。为缓解这些限制，我们提出一个统一的分层框架，将室内场景合成分解为可控阶段。首先，我们整理了一个包含30万真实住宅平面图的大规模数据集，用于训练一个全屋平面图生成的大语言模型。通过详细描述和基于K-D树的表示，我们的方法实现了细粒度、可控的全屋平面图生成。基于生成的全屋平面图，我们利用图像生成模型从多级漫游视角草拟家具布局，然后生成不同支撑表面（例如橱柜、书桌和餐桌）上可操作小物体的布局，用于具身AI仿真。在家具和物体布局生成过程中，一个基于VLM的优化器迭代修正家具和物体放置，而一个3D生成模型则允许灵活替换单个资产。我们进一步附加基本物理属性和简单表面纹理与光照设置，以完成用于具身AI的流水线。实验和用户研究表明，我们的流水线生成的室内空间具有更大的布局多样性和更强的3D设计吸引力，在定量和定性指标上均优于先前方法。最后，除了生成流水线，我们还将向社区发布平面图数据集和5000个完全家具化的场景。项目页面：https://kairos-homeworld.github.io/

英文摘要

Indoor scene generation is crucial for robot simulation and modern interior design. However, complex layouts together with scarce 3D scene data make learning-based generation challenging. Existing methods often rely on hand-crafted rules or focus on isolated sub-tasks (e.g., floorplan synthesis or single-room furnishing), producing whole-home scenes that lack global coherence, realism, and simulation readiness. To mitigate these limitations, we propose a unified hierarchical framework that decomposes indoor scene synthesis into controllable stages. First, we curate a large-scale dataset of 300K real residential floorplans to train a large language model for whole-home floorplan generation. With detailed descriptions and a K-D tree-based representation, our method enables fine-grained, controllable whole-home floorplan generation. Building upon the generated whole-home floorplan, we leverage image generation models to draft furniture layouts from multi-level roaming viewpoints, and then generate the layouts of small manipulable objects on different supporting surfaces (e.g., cabinets, desks, and dining tables) for embodied AI simulation. During furniture and object layout generation, a VLM-based refiner iteratively corrects furniture and object placement, and a 3D generative model enables flexible replacement of individual assets. We further attach basic physical attributes and simple surface texture and lighting setups to complete the pipeline for embodied AI use. Experiments and user studies demonstrate that our pipeline produces indoor spaces with greater layout diversity and stronger 3D design appeal, outperforming prior methods on both quantitative and qualitative metrics. Finally, alongside our generation pipeline, we will release the floorplan dataset and 5K fully furnished scenes to the community. Project Page: https://kairos-homeworld.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.06379 2026-06-05 cs.CV cs.AI 版本更新

EasyLens: A Training-Free Plug-and-Play Subtle-Lesion Representation Amplifier for Medical Vision-Language Models

EasyLens: 一种无需训练的即插即用型微病变表示放大器，用于医学视觉语言模型

Qiwei Zeng, Hao Wang, Jinghao Lin, Shuchang Ye, Yuezhe Yang, Yige Peng, Haoyuan Che, Jinman Kim, Lei Bi

发表机构 * Jilin University（吉林大学）； School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； ByteDance（字节跳动）； Institute of Translational Medicine, Shanghai Jiao Tong University（上海交通大学转化医学研究院）

AI总结提出EasyLens，一种无需训练的即插即用模块，通过构建病理-解剖原型空间、反事实推理选择病变相关补丁以及形态引导残差增强，放大医学视觉语言模型对微病变的表示能力。

详情

AI中文摘要

医学视觉语言模型（VLM）在临床图像解读（包括病变检测和报告生成）方面显示出越来越大的潜力。然而，其对微病变的敏感性不足限制了其实用性，因为微病变的视觉证据通常稀疏、低对比度且嵌入复杂的解剖背景中。随着局部视觉标记的聚合，这些微弱的病变线索在全局图像表示中可能变得代表性不足，使得医学VLM难以识别。现有的提高病变敏感性的工作主要依赖于医学领域的视觉编码器预训练、临床术语引导的对齐或可训练的病理表示增强。尽管有效，但这些方法通常需要额外训练或模型特定适配，并可能过度适应特定疾病形态，限制了其在冻结的医学VLM上的适用性。为解决这些限制，我们提出EasyLens，一种无需训练的即插即用型微病变表示放大器，用于医学VLM。EasyLens首先构建EasyBank，一个病理-解剖原型空间，提供病变相关原型和解剖感知的正常参考，用于将可疑补丁与病理和正常解剖模式进行比较。为避免盲目放大正常组织，EasyTag通过反事实原型推理选择病变相关补丁。为抵消全局图像表示中微病变线索的稀释，EasyAmplifier通过形态引导的残差增强强化所选病变相关补丁的表示，从而增加其对全局图像嵌入的贡献。在多个医学图像数据集和冻结的医学VLM骨干上的实验表明，EasyLens改进了微病变检测，并优于现有的编码器增强基线。

英文摘要

Medical vision-language models (VLMs) have shown increasing potential for clinical image interpretation, including lesion detection and report generation. However, their practical utility remains limited by insufficient sensitivity to subtle lesions, whose visual evidence is often sparse, low-contrast, and embedded within complex anatomical context. As local visual tokens are aggregated, these weak lesion cues can become underrepresented in global image representations, making them difficult for medical VLMs to recognize. Existing efforts to improve lesion sensitivity mainly rely on medical-domain vision-encoder pre-training, clinical-term-guided alignment, or trainable pathological representation enhancement. Although effective, these approaches usually require additional training or model-specific adaptation and may overfit to particular disease morphologies, limiting their applicability to frozen medical VLMs. To address these limitations, we propose EasyLens, a training-free plug-and-play subtle-lesion representation amplifier for medical VLMs. EasyLens first constructs EasyBank, a pathology-anatomy prototype space that provides lesion-related prototypes and anatomy-aware normal references for comparing suspicious patches against both pathological and normal anatomical patterns. To avoid blindly amplifying normal tissues, EasyTag selects lesion-relevant patches through counterfactual prototype reasoning. To counteract the dilution of subtle lesion cues in global image representations, EasyAmplifier strengthens the selected lesion-relevant patch representations through morphology-guided residual enhancement, thereby increasing their contribution to the global image embedding. Experiments on multiple medical image datasets and frozen medical VLM backbones show that EasyLens improves subtle-lesion detection and outperforms existing encoder-enhancement baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.06369 2026-06-05 cs.CV 版本更新

Visual Commonsense Driven Knowledge Refinements for Scene Graph Generation

视觉常识驱动的场景图生成知识精炼

Maëlic Neau, Salim Baloch, Jakob Suchan, Zoe Falomir, Mehul Bhatt

发表机构 * Computing Science Department, Umeå University（乌梅大学计算机科学系）； School of Computer Science & Engineering, Constructor University（构造大学计算机科学与工程学院）； School of Science and Technology, Örebro University（Örebro大学科学与技术学院）； CoDesign Lab EU.（欧盟CoDesign实验室）

AI总结提出一种模型无关的语义引导知识精炼框架，通过挖掘训练数据中的常识约束并利用声明式常识推理在推理时修正场景图预测，无需人工规则或重新训练，在三个基准上持续提升强基线性能。

2606.06363 2026-06-05 cs.CV 版本更新

GMBFormer: An NDVI-Guided Global Memory Bank Transformer for Urban Green-Space Extraction from Ultra-High-Resolution Imagery

GMBFormer: 一种NDVI引导的全局记忆库Transformer用于超高分辨率影像城市绿地提取

Hao Lei, Xi Cheng, Chenlu Shu, Zhiheng Chen, Zhengjie Duan, Haoyu Wang, Zhanfeng Shen

发表机构 * College of Geophysics, Chengdu University of Technology（成都理工大学地球物理学院）； National Engineering Research Center for Geomatics, Aerospace Information Research Institute, Chinese Academy of Sciences, and University of Chinese Academy of Sciences（中国科学院测绘学部国家工程研究中心、航天信息研究院、中国科学院大学）

AI总结针对超高分辨率影像城市绿地提取中视觉相似植被模式语义复用受限及NDVI与RGB特征融合模糊的问题，提出GMBFormer框架，通过解耦NDVI作为物理门控并利用全局记忆库进行选择性原型检索，在三个数据集上提升了分割精度。

Comments 34 pages, 5 figures

详情

AI中文摘要

从超高分辨率（UHR）影像中提取城市绿地通常逐块进行，这限制了空间分离但视觉相似的植被模式之间的语义复用。将归一化差异植被指数（NDVI）直接注入红绿蓝（RGB）主干网络也会模糊视觉外观学习与物理植被置信度的作用。我们提出了GMBFormer，一个基于SegFormer的框架，用选择性、相似性驱动的原型检索替代邻域驱动的特征传播。只有RGB通道进入主干网络和解码器，而NDVI被解耦为一个物理信息门控，通过动量更新将高置信度植被描述符纳入紧凑的全局记忆库。在训练和推理过程中，当前块通过记忆介导的交叉注意力查询存储的原型，并以有限的开销集成检索到的响应。实验使用了自建的成都UHR数据集（含7,700个标注的512×512块）以及从公共国际摄影测量与遥感学会（ISPRS）波茨坦数据集派生的两种减少标签设置。在相同的训练和评估协议下，GMBFormer分别获得了89.25%/94.31%、92.17%/95.92%和83.72%/90.86%的平均交并比（mIoU）/平均Dice（mDice）分数，在每种设置下均优于受控的SegFormer-B4基线。消融研究表明，解耦的NDVI准入、记忆检索、容量和动量共同决定了最终性能。

英文摘要

Urban green-space extraction from ultra-high-resolution (UHR) imagery is commonly performed patch by patch, which limits semantic reuse among spatially separated but visually similar vegetation patterns. Directly injecting the Normalized Difference Vegetation Index (NDVI) into red-green-blue (RGB) backbones can also blur the roles of visual appearance learning and physical vegetation confidence. We propose GMBFormer, a SegFormer-based framework that replaces adjacency-driven feature propagation with selective, similarity-driven prototype retrieval. Only RGB channels enter the backbone and decoder, while NDVI is decoupled as a physics-informed gate that admits high-confidence vegetation descriptors into a compact global memory bank through momentum updates. During training and inference, the current patch queries stored prototypes through memory-mediated cross-attention, and the retrieved response is integrated with bounded overhead. Experiments use a self-constructed Chengdu UHR dataset with 7,700 labeled 512 x 512 patches and two reduced-label settings derived from the public International Society for Photogrammetry and Remote Sensing (ISPRS) Potsdam dataset. Under the same training and evaluation protocol, GMBFormer obtains mean intersection over union (mIoU)/mean Dice (mDice) scores of 89.25%/94.31%, 92.17%/95.92%, and 83.72%/90.86%, respectively, improving the controlled SegFormer-B4 baseline in each setting. Ablation studies indicate that decoupled NDVI admission, memory retrieval, capacity, and momentum jointly shape the final performance.

URL PDF HTML ☆

赞 0 踩 0

2606.06359 2026-06-05 cs.CV 版本更新

Comparison of Deep Learning Frameworks For Rice Disease Mapping From UAV Multispectral Imaging

基于无人机多光谱成像的水稻病害深度学习框架比较

Yadav Raj Ghimire, Jagrati Talreja, Tewodros Syum Gebre, Timothy Agboada, Shikha V. Chandel, Leila Hashemi Beni

发表机构 * University of California, Los Angeles（加州大学洛杉矶分校）； University of California, Los android（加州大学洛杉矶分校）

AI总结本研究使用CNN和Transformer模型对无人机多光谱图像进行水稻白叶枯病严重程度分割，发现轻量级CNN骨干网络在操作监测中更可靠，植被指数可带来小幅持续改进。

详情

AI中文摘要

在本研究中，利用无人机多光谱图像，采用卷积神经网络（CNN）和基于Transformer的模型对水稻白叶枯病（BLB）的严重程度进行分割。评估的架构包括带有ResNet-101编码器的U-Net、带有EfficientNet-B3和EfficientNet-B7的U-Net++、DeepLabV3+以及SegFormer，所有模型均在统一的流水线下使用三种输入配置（仅多光谱、多光谱+NDVI、多光谱+NDRE）进行训练。实验使用公开的BLB数据集进行，性能指标包括平均IoU（mIoU）、平均F1（mF1）、平均准确率（mAcc）、精确率和召回率。带有EfficientNet-B3的U-Net++取得了最高性能，mIoU达到97.62%。SegFormer的分割精度较低，但推理速度相当。总体而言，结果表明轻量级CNN骨干网络在操作性的BLB监测中更为可靠，而植被指数的整合带来了微小但一致的改进。该研究还强调了标准化无人机数据集在比较病害映射方法中的价值，并鼓励在实地实施中使用CNN架构。

英文摘要

In this study, UAV multispectral imagery is used to segment the severity of bacterial leaf blight (BLB) in rice using convolutional neural networks (CNNs) and transformer-based models. The evaluated architectures include U-Net with a ResNet- 101 encoder, U-Net++ with EfficientNet-B3 and EfficientNetB7, DeepLabV3+, and SegFormer, all trained under a common pipeline with three input configurations (multispectral only, multispectral+NDVI, and multispectral+NDRE). Experiments are conducted using the publicly available BLB dataset with performance reported using mean IoU (mIoU), mean F1 (mF1), mean accuracy (mAcc), precision, and recall. U-Net++ with EfficientNet-B3 achieved the highest performance, with an mIoU of 97.62%. SegFormer obtained lower segmentation accuracy but comparable inference speed. Overall, the results indicate that lightweight CNN backbones remain more reliable for operational BLB monitoring while integration of vegetation indices provides small and consistent improvements. The study also highlights the value of standardised UAV datasets to compare disease mapping methods and encourages the use of CNN architectures for field implementation.

URL PDF HTML ☆

赞 0 踩 0

2606.06338 2026-06-05 cs.CV 版本更新

StoryVideoQA: Scaling Deep Video Understanding with a Large-Scale, Multi-Genre and Auto-Generated Dataset

StoryVideoQA: 通过大规模、多类型和自动生成的数据集扩展深度视频理解

Zhengqian Wu, Zhixian Liu, Aodong Chen, Jingyang Zhang, Ruizhe Li, Hanlin Ge, Zhongyuan Wang, Chunxia Xiao, Chao Liang

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； National Engineering Research Center for Multimedia Software（多媒体软件国家工程研究中心）； Hubei Key Laboratory of Multimedia and Network Communication Engineering（湖北省多媒体与网络通信工程重点实验室）

AI总结提出StoryVideoQA数据集和PlotTree方法，通过多智能体协作框架自动生成大规模深度视频理解问答对，并利用层次化情节结构提升复杂故事线推理能力。

Comments Accepted by IJCV 2026

详情

DOI: 10.1007/s11263-026-02898-w
Journal ref: International Journal of Computer Vision (2026)

AI中文摘要

视频问答（VideoQA）旨在回答关于给定视频的问题。现有方法在事实型VideoQA上表现出色，但在深度视频理解（DVU）上存在困难，后者需要理解复杂的故事线。这一挑战源于固有的长程视频内容、多类型问题以及实例级故事元素，这些都限制了人工构建DVU数据集的规模和多样性。为了解决这些问题，我们之前引入了StoryMind来自动构建具有平衡细粒度主题的DVU数据集。尽管它能为电视剧生成高质量问答对，但在处理更长更复杂的电影时性能显著下降。本文进一步设计了StoryMindv2，一个增强的多智能体协作框架，用于为电视剧和电影生成高质量的DVU数据集。通过集成新颖的监督引导生成机制和精细的多审阅者投票策略，该框架用于构建StoryVideoQA，这是迄今为止最大的DVU数据集，包含超过363K个问答对，覆盖393.2小时多样化的故事视频，包括电视剧（平均1635秒）和电影（平均7878秒）。在此大规模基准上对20种最先进的VideoQA方法进行全面评估，发现它们无法完全维持长程角色关联或构建对复杂故事线的连贯理解。为弥补这一差距，我们提出PlotTree，一种新颖的视频理解智能体，将长程视频内容重新组织为层次化情节结构，从而在StoryVideoQA上实现高效的故事线推理。项目页面：https://github.com/nercms-mmap/StoryVideoQA/

英文摘要

Video question answering (VideoQA) aims to answer questions about given videos. While existing approaches excel on factoid VideoQA, they struggle with deep video understanding (DVU), which requires the comprehension of complex storylines. This challenge arises from the inherent long-range video content, multi-faceted question types, and instance-level story elements, all of which constrain the scale and diversity of manually constructed DVU datasets. These difficulties constrain the scale and diversity of manually-constructed DVU dataset. To address these, we previously introduced StoryMind to automatically construct DVU datasets with balanced fine-grained topics. Though it can generate high-quality question-answer pairs (QAs) for TV series, it suffers significant performance degradation when handling longer and more complex movies. In this paper, we further design StoryMindv2, an enhanced multi-agent collaboration framework to generate high-quality DVU datasets for both TV series and movies. By integrating a novel supervisor-guided generation mechanism and a refined multi-reviewer voting strategy, the framework is utilized to construct StoryVideoQA, the largest DVU dataset to date, featuring over 363K QAs on 393.2 hours diverse story videos including TV series (avg. 1,635 seconds) and movies (avg. 7,878 seconds). Comprehensive evaluations of 20 state-of-the-art VideoQA methods on this large-scale benchmark reveal that they cannot fully maintain long-range character associations or construct a coherent understanding of complex storylines. To bridge this gap, we propose PlotTree, a novel video understanding agent, re-organizing long-range video content into a hierarchical plot structure, enabling efficient storyline reasoning on StoryVideoQA. Project page: https://github.com/nercms-mmap/StoryVideoQA/

URL PDF HTML ☆

赞 0 踩 0

2606.06329 2026-06-05 cs.LG cs.CG cs.CV stat.ML 版本更新

Efficient Mean Curvature Computation on High-Dimensional Data Manifolds

高维数据流形上的高效平均曲率计算

Alexandre L. M. Levada

发表机构 * Federal University of São Carlos（萨尔瓦多·卡洛斯联邦大学）

AI总结针对高维数据集局部平均曲率计算中原始方法O(m^4)每点成本过高的问题，提出基于代数恒等式和截断SVD的快速估计器，将成本降至O(k^2 m + k m p^2)，在真实数据集上实现50-300倍加速且精度损失可忽略。

Comments 31 pages, 2 figures and 5 tables

详情

AI中文摘要

估计高维数据集中每个点的局部平均曲率是几何感知机器学习算法（如平均曲率边界点（MCBP）方法）的关键组成部分。该计算的朴素实现基于从k近邻块近似的局部形状算子，涉及显式构造矩阵$H$，其迹形式导致每点成本为$O(m^4)$，使得该方法对于具有超过几十个特征的数据集变得难以处理。本文提出了两个互补的贡献，共同将这一成本降低了几个数量级。第一个贡献是一个精确的代数恒等式。该恒等式源自协方差矩阵特征向量的正交性和迹算子的循环性，完全消除了$H$，并将特征分解后的每点成本降低到$O(m^2)$。第二个贡献解决了完整特征分解中剩余的$O(m^3)$瓶颈。由于局部协方差矩阵的秩最多为$k-1 \ll m$，我们将其替换为$k imes m$中心数据矩阵的截断SVD，这是一个$O(k^2 m)$操作，并基于Haar测度下零空间特征向量外积的期望值，推导出其贡献的解析近似。得到的估计器总成本为$O(k^2 m + k m p^2)$，其中$p = k-1$。在真实数据集上的实验证实，相对于原始实现，加速比为50到300倍，当使用快速估计器替换原始版本时，精度损失可忽略。通过提供可扩展且数据驱动的局部曲率估计，所提出的方法将曲率确立为从经典到现代深度学习流水线的广泛机器学习任务中的实用几何特征。

英文摘要

Estimating local mean curvature at each point of a high-dimensional dataset is a key ingredient of geometry-aware machine learning algorithms, such as the Mean Curvature Boundary Points (MCBP) method. The naive implementation of this computation, based on a local shape operator approximated from k-nearest neighbor patches, involves an explicit construction of a matrix $H$ whose trace form yields an $O(m^4)$ cost per point, rendering the approach intractable for datasets with more than a few dozen features. This paper introduces two complementary contributions that together reduce this cost by several orders of magnitude. The first contribution is an exact algebraic identity. This identity, derived from the orthogonality of the eigenvectors of the covariance matrix and the cyclicity of the trace operator, eliminates $H$ entirely and reduces the per-point cost to $O(m^2)$ after the eigendecomposition. The second contribution addresses the remaining $O(m^3)$ bottleneck of the full eigendecomposition. Since the local covariance matrix has rank at most $k-1 \ll m$, we replace it with a truncated SVD of the $k \times m$ centered data matrix, an $O(k^2 m)$ operation, and derive an analytical approximation for the contribution of the null-space eigenvectors based on the expected value of their outer product under the Haar measure. The resulting estimator has total cost $O(k^2 m + k m p^2)$, where $p = k-1$. Experiments on real-world datasets confirm speedups of 50 to 300 times relative to the original implementation, with negligible loss when the fast estimator is used to replace the original version. By providing a scalable and data-driven estimate of local curvature, the proposed method establishes curvature as a practical geometric feature for a broad range of machine learning tasks, from classical to modern deep learning pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.06309 2026-06-05 cs.CV 版本更新

RhymeFlow: Training-Free Acceleration for Video Generation with Asynchronous Denoising Flow Scheduling

RhymeFlow: 基于异步去噪流调度的无训练加速视频生成

Chensheng Dai, Shengjun Zhang, Yifan Li, Zhang Zhang, Zheng Zhu, Yueqi Duan

发表机构 * Tsinghua University（清华大学）； GigaAI

AI总结针对DiT视频生成模型推理慢的问题，提出无训练框架RhymeFlow，通过识别关键帧并仅对其密集去噪，非关键帧逐步跳过步骤，同时引入潜在轨迹投影模块保持时序一致性，实现加速并提升质量。

Comments Project Page: https://simon-dcs.github.io/Website-of-RhymeFlow/, Code: https://github.com/Simon-Dcs/RhymeFlow

详情

AI中文摘要

基于扩散变换器（DiTs）的视频生成模型在视频合成中取得了显著性能，但由于3D注意力的二次复杂度，它们存在高推理延迟和计算成本的问题。现有的加速方法主要通过稀疏注意力和KV缓存等技术降低每个单独去噪步骤内的计算复杂度。然而，它们严格遵循标准扩散管道的固有约束：目标视频序列中的每一帧都必须经历所有扩散时间步的完整、密集去噪过程。我们观察到，由于相邻帧之间的对应内容和运动，当锚定具有关键语义过渡的关键帧时，其他帧的中间状态通常遵循更可预测的轨迹，这表明这种均匀、密集的去噪过程对于自然视频数据本质上是冗余的。为此，我们引入了 extbf{RhymeFlow}，一个无训练框架，它将不同帧的去噪轨迹解耦。具体来说，我们首先识别出一组稀疏的关键帧，它们主导了潜在语义演化。然后，只有这些关键帧经历密集的逐步去噪以确保结构完整性，而非关键帧则逐步跳过去噪步骤以最小化计算成本。由于非关键帧跳过的中间状态破坏了关键帧去噪步骤中的时间连贯性，导致视觉退化，我们进一步引入了一个潜在轨迹投影模块，使关键帧能够与完整且时间一致的序列表示进行交互。在当前的基于DiT的视频生成模型上的大量实验表明，我们的方法以更高的推理速度和更好的视觉质量优于现有基线。

英文摘要

Video generation models based on Diffusion Transformers (DiTs) have achieved remarkable performance in video synthesis, yet they suffer from high inference latency and computational costs due to the quadratic complexity of 3D attention. Existing acceleration methods primarily reduce computational complexity within each individual denoising steps through techniques such as sparse attention and KV-caching. However, they rigidly adhere to the inherent constraint of the standard diffusion pipeline: every frame in the target video sequence must be subjected to a complete, dense denoising process across all diffusion timesteps. We observe that due to the corresponding contents and motions among adjacent frames, when keyframes with critical semantic transitions are anchored, the intermediate states of others often follow more predictable trajectories, which indicates that such uniform, dense denoising process is inherently redundant for natural video data. To this end, we introduce \textbf{RhymeFlow}, a training-free framework that decouples the denoising trajectories of different frames. Specifically, we first identify a sparse set of pivotal key frames that dominate the latent semantic evolution. Then, only these keyframes undergo dense, step-by-step denoising to ensure structural integrity, while non-keyframes progressively skip denoising steps to minimize computational cost. Since skipped intermediate states of non-keyframes break the temporal coherence in keyframe denoising steps, leading to visual degradation, we further introduce a latent trajectory projection module, which enables keyframes to interact with a complete and temporally consistent sequence representation. Extensive experiments on current DiT-based video generation models demonstrate our method outperforms existing baselines with higher inference speed and better visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.06294 2026-06-05 cs.CV cs.AI 版本更新

Towards One-to-Many Temporal Grounding

面向一对多时间定位

Qi Xu, Yue Tan, Shihao Chen, Jiahao Meng, Anna Wang, Shunping Ji, Hao Fei, Jason Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对一对多时间定位（OMTG）任务，提出包含基准、数据集和奖励函数的系统解决方案，显著提升多段视频定位性能。

Comments Accepted to ICML'26

详情

AI中文摘要

时间定位（TG）旨在定位与文本查询对应的视频片段。先前研究主要关注单段检索。然而，现实场景通常需要为单个查询定位多个不连续片段——我们将其称为一对多时间定位（OMTG）。先前最先进的MLLMs针对一对一设置优化，在此场景下表现不佳，由于缺乏事件基数感知，往往得到近乎零的分数。为弥补这一差距，我们提出一个包含三项关键贡献的系统解决方案。首先，我们建立了首个全面的OMTG基准，引入计数准确率（C-Acc）和有效时间F1（EtF1）作为评估指标。其次，我们通过一个复杂的构建流程，整理了一个包含56k样本的高质量OMTG数据集。第三，我们开发了专门针对OMTG的新型时间奖励和描述奖励函数。特别地，描述奖励利用密集视频描述上的思维链推理，明确引导策略优化以实现精确性和完整性。大量实验表明，我们的模型在OMTG基准上达到了43.65%的最新EtF1，分别超过Gemini 2.5 Pro和Seed-1.8达15.85%和15.61%。

英文摘要

Temporal Grounding (TG) aims to localize video segments corresponding to a textual query. Prior research predominantly focuses on single-segment retrieval. Real-world scenarios, however, often require localizing multiple disjoint segments for a single query -- a setting we term One-to-Many Temporal Grounding (OMTG). Previous state-of-the-art MLLMs, optimized for one-to-one settings, struggle in this context, often yielding near-zero scores due to a lack of event cardinality perception. To bridge this gap, we present a systematic solution with three key contributions. First, we establish the first comprehensive OMTG benchmark, introducing Count Accuracy (C-Acc) and Effective Temporal F1 (EtF1) as evaluation metrics. Second, we curate a high-quality OMTG dataset comprising 56k samples through a sophisticated construction pipeline. Third, we develop novel temporal and caption reward functions specifically designed for OMTG. In particular, the caption reward leverages Chain-of-Thought reasoning over dense video captions to explicitly guide policy optimization toward both preciseness and completeness. Extensive experiments show our model achieves a new state-of-the-art EtF1 of 43.65\% on OMTG Bench, outperforming Gemini 2.5 Pro and Seed-1.8 by 15.85\% and 15.61\%, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.06292 2026-06-05 cs.CV cs.RO 版本更新

Synthetic Data Generation and Vision-based Wrinkle and Keypoint Detection for Bimanual Cloth Manipulation

合成数据生成与基于视觉的褶皱和关键点检测用于双手布料操作

Ariel Herrera, Xueyang Kang, Atal Anil Kumar

发表机构 * Department of Engineering, University of Luxembourg（卢森堡大学工程系）； School of Electrical and Electronic Engineering, Nanyang Technological University（南洋理工大学电子与电气工程学院）； Université de Lorraine, Arts et Metiers Institute of Technology, LCFC（洛林大学，艺术与工艺技术学院，LCFC）

AI总结针对布料操作中视觉感知难题，提出基于Blender的合成数据生成管道和结合CNN与YOLOv8-OpenCV的感知框架，实现褶皱抓取和关键点熨烫，关键点模型平均位置误差1.7615像素。

详情

AI中文摘要

纺织品的机器人操作仍然具有挑战性，因为连续变形和自遮挡阻碍了估计布料状态所需的鲁棒视觉感知。为了解决缺乏标注真实世界数据的问题，我们开发了一个基于Blender的合成管道，导出自动标注的关键点，并将人工标注的渲染图与真实世界数据结合训练褶皱检测器。我们提出了一个感知框架，集成了用于置换不变关键点检测的CNN和用于从结构褶皱中提取抓取点的YOLOv8-OpenCV管道。一个提出的双手算法利用该系统通过褶皱拉伸完全折叠的服装，一旦角落出现就过渡到基于关键点的熨烫。关键点模型实现了1.7615像素的平均位置误差（MPE）。感知系统无需微调即可迁移到物理织物上，优于在高遮挡状态下失败或在严重褶皱上产生误报的基线方法。

英文摘要

Robotic manipulation of textiles remains challenging because continuous deformation and self-occlusions hinder the robust visual perception required to estimate the cloth's state. To address the lack of annotated real-world data, we developed a Blender-based synthetic pipeline exporting auto-annotated keypoints, and combined manually labeled renders with real-world data to train a wrinkle detector. We present a perception framework integrating a CNN for permutation-invariant keypoint detection and a YOLOv8-OpenCV pipeline to extract grasping points from structural wrinkles. A proposed bimanual algorithm uses this system to stretch fully folded garments via wrinkles, transitioning to keypoint-based ironing once corners emerge. The keypoint model achieves a Mean Position Error (MPE) of 1.7615 pixels. The perception system transfers to physical fabrics without fine-tuning, outperforming baselines that fail in high-occlusion states or yield false positives on severe folds.

URL PDF HTML ☆

赞 0 踩 0

2606.06278 2026-06-05 cs.CV 版本更新

Geodesic Flow Matching on a Riemannian Degradation Manifold for Blind Image Restoration

黎曼退化流形上的测地流匹配用于盲图像恢复

Akshay Janardan Bankar, Ankita Chatterjee, Sayan Banerjee, Shreyas Pandith, Kalakonda Sai Shashank, Amit Satish Unde

发表机构 * Samsung Research Institute（三星研究院）

AI总结提出在低维黎曼流形上显式建模退化，通过联合图像-流形空间上的测地流匹配目标学习内在传输动力学，实现盲图像恢复。

Comments Submitted to ECCV 2026

2606.06255 2026-06-05 cs.RO cs.CV cs.DC 版本更新

RadiusFPS: Efficient Farthest Point Sampling on CPUs and GPUs via Spherical Voxel Pruning

RadiusFPS：通过球形体素剪枝在CPU和GPU上实现高效最远点采样

Ziyang Yu, Xiang Li, Qiong Chang, Jun Miyazaki

发表机构 * School of Computing（计算学院）； Institute of Science（科学研究院）； Tokyo（东京）

AI总结提出RadiusFPS框架，利用球形体素剪枝加速最远点采样（FPS），在保持标准更新规则的同时，通过保守几何边界和坐标点跳过测试减少冗余计算，并在GPU上实现融合核，显著提升速度并降低内存占用。

Comments 28 pages,15 figures

详情

AI中文摘要

点云是机器人感知的主要感官表示，支撑着基于激光雷达的自动驾驶、同时定位与地图构建（SLAM）和导航。在这些流程中，最远点采样（FPS）是最著名的下采样算子，其均匀覆盖保留了下游感知所依赖的几何结构。然而，经典FPS的大时间复杂度与现代3D传感器每秒百万点的速率难以匹配，使其成为与机器人系统的实时性和有限机载计算预算相冲突的主要延迟瓶颈。因此，我们提出RadiusFPS，一种基于球形体素剪枝的FPS加速框架，在相同初始化和打破平局策略下保留标准FPS更新规则。通过用球形体素索引点云，RadiusFPS推导出保守的几何边界，在每次迭代中剪枝冗余距离计算，并辅以坐标点跳过测试去除残余更新。我们进一步引入RadiusFPS-G，一种线程束级别的GPU实现，将体素选择、剪枝和距离更新融合到内存合并的核中，消除了昂贵的全局内存往返。在室内（S3DIS、ScanNet）和室外LiDAR（SemanticKITTI）基准测试中，RadiusFPS-G相比基于GPU的FPS实现了高达2.5倍的加速，在评估方法中与QuickFPS相当或更优，同时使用大约一半的GPU内存，并具有可比较的分割精度。当与基于学习的FastPoint采样器结合时，生成的流程在所有评估配置中实现了最快的端到端推理。这些特性使得高质量的FPS风格采样对于延迟和内存受限的机器人视觉变得实用。

英文摘要

Point clouds are a primary sensory representation for robotic perception, underpinning LiDAR-based autonomous driving, simultaneous localization and mapping (SLAM), and navigation. Within these pipelines, Farthest Point Sampling (FPS) is the most well-known downsampling operator, as its uniform coverage preserves the geometric structure on which downstream perception relies. However, the large time complexity of classical FPS scales poorly with the million-point-per-second rates of modern 3D sensors, making it a dominant latency bottleneck that conflicts with the real-time and limited onboard compute budgets of robotic systems. Therefore, we propose RadiusFPS, an FPS acceleration framework based on spherical voxel pruning that preserves the standard FPS update rule under the same initialization and tie-breaking policy. By indexing the point cloud with spherical voxels, RadiusFPS derives a conservative geometric bound that prunes redundant distance computations in each iteration, complemented by a coordinate-wise point-skip test that removes residual updates. We further introduce RadiusFPS-G, a warp-level GPU implementation that fuses voxel selection, pruning, and distance update into memory-coalesced kernels, eliminating costly global-memory round-trips. On indoor (S3DIS, ScanNet) and outdoor LiDAR (SemanticKITTI) benchmarks, RadiusFPS-G attains up to 2.5x speedup over GPU-based FPS and matches or exceeds QuickFPS among the evaluated methods while using roughly half its GPU memory, with comparable segmentation accuracy. When coupled with the learning-based FastPoint sampler, the resulting pipeline achieves the fastest End-to-End inference among all evaluated configurations. These properties make high-quality FPS-style sampling practical for latency- and memory-constrained robotic vision.

URL PDF HTML ☆

赞 0 踩 0

2606.06249 2026-06-05 cs.CV cs.LG 版本更新

DisasterBench: 复杂环境中基于无人机灾害响应的多模态基准

Tan Zhang, Quanyou Li, Lu Zhang, Jun Liu, Xiaofeng Zhu, Ping Hu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结提出DisasterBench多模态基准，涵盖14种灾害场景和9个响应任务，并设计轻量级模型DisasterVL通过三阶段优化在边缘设备上实现高效推理。

详情

AI中文摘要

当灾难发生时，响应者不仅需要回答正在发生什么，还需要回答为什么发生、接下来会发生什么以及现在该做什么，而这些通常来自嘈杂的低空无人机视角，并在现场计算资源紧张的情况下进行。然而，现有的大多数多模态基准侧重于感知（例如识别/描述），覆盖的灾害类型有限，并且对实际应急响应所需的多阶段推理支持不足。我们引入了DisasterBench，一个用于复杂环境中基于无人机灾害响应的多阶段多模态推理基准。DisasterBench涵盖14种灾害相关场景类型和9个响应关键任务，覆盖灾前、灾中和灾后阶段，具有细粒度的灾害-任务映射，明确测试因果归因、传播预测、损害分析和决策导向推理。为了在边缘设备上实现推理，我们进一步提出了DisasterVL，一个轻量级多模态模型，通过三阶段流水线进行优化，结合领域指令微调、思维链引导的多模态对齐以及基于强化学习的策略优化。在21个流行的MLLM上的实验表明，我们的2B参数DisasterVL优于所有评估的开源模型，并显著缩小了与最先进闭源模型的差距，实现了与GPT-4o相当的推理准确性和更高的效率。项目页面：https://github.com/TanmouTT/DisasterBench。

英文摘要

When a disaster unfolds, responders must answer not only what is happening, but also why it is happening, what will happen next, and what to do now, often from noisy low-altitude UAV views and under tight on-site compute constraints. However, most existing multimodal benchmarks emphasize perception (e.g., recognition/description), cover limited disaster types, and provide insufficient support for the multi-stage reasoning required in practical emergency response. We introduce DisasterBench, a multi-stage multimodal reasoning benchmark for UAV-Based disaster response in complex environments. DisasterBench spans 14 disaster-related scene types and 9 response-critical tasks across pre-, during-, and post-disaster stages, with fine-grained disaster-task mappings that explicitly test causal attribution, propagation prediction, damage analysis, and decision-oriented reasoning. To enable reasoning on the edge, we further propose DisasterVL, a lightweight multimodal model optimized with a three-stage pipeline combining domain instruction tuning, chain-of-thought-guided multimodal alignment, and reinforcement learning-based policy optimization. Experiments across 21 popular MLLMs show that our 2B-parameter DisasterVL outperforms all evaluated open-source models and substantially narrows the gap to state-of-the-art closed-source models, achieving GPT-4o-comparable reasoning accuracy with superior efficiency. The project page is available at https://github.com/TanmouTT/DisasterBench.

URL PDF HTML ☆

赞 0 踩 0

2606.06199 2026-06-05 cs.CV cs.GR 版本更新

SC-MFJ: A Simple Haptic Quality Metric for Medical Image Segmentation

SC-MFJ: 一种用于医学图像分割的简单触觉质量度量

Souraj Adhikary, Negar Chabi, Andre Mastmeyer

发表机构 * Jade University of Applied Sciences（亚德应用科学大学）

AI总结针对手术模拟中触觉渲染对分割表面质量的需求，提出SC-MFJ度量，通过虚拟触笔行走测量接触力抖动，揭示了几何度量无法发现的触觉质量差异。

Comments 11 pages, 5 figures, 5 tables, http://www.wscg.eu/

详情

AI中文摘要

标准分割度量如Dice和Hausdorff距离测量几何重叠，但无法判断分割表面是否适合手术模拟中的触觉渲染。我们提出SC-MFJ（表面约束平均力抖动），一种简单、廉价的度量，通过多次短虚拟触笔行走采样分割器官表面，并测量由此产生的接触力抖动程度。该度量从现有分割输出计算，每个病例约需一分钟CPU时间。我们在五折交叉验证中对80个病例评估了三种胰腺CT分割方法——原始二值nnU-Net输出、高斯平滑输出和学习的符号距离函数（SDF）回归。SC-MFJ显示，原始二值基线与简单高斯后处理之间的触觉质量差距达147倍，而Dice和HD95完全无法察觉这一差异。它还表明，尽管需要完整的模型重新训练，学习的SDF回归产生的触觉质量比高斯平滑更不稳定，病例级标准差为168 N/s²，而高斯平滑为22 N/s²。在LiTS肝脏数据集（131个病例）上的第二次评估证实了这些发现的普遍性：二值到高斯的差距扩大到189倍，且高斯平滑在所有折中始终产生一致的低力抖动。我们的结果表明，对于触觉模拟应用，一行后处理步骤可能就足够了，而像SC-MFJ这样廉价的度量可以标记出几何度量遗漏的问题。

英文摘要

Standard segmentation metrics such as Dice and Hausdorff distance measure geometric overlap but say nothing about whether a segmented surface is suitable for haptic rendering in surgical simulation. We propose SC-MFJ (Surface-Constrained Mean Force Jerk), a simple, inexpensive metric that samples a segmented organ surface with many short virtual stylus walks and measures how jerky the resulting contact forces are. The metric is computed from existing segmentation outputs and uses roughly one minute of CPU time per case. We evaluate three pancreas CT segmentation approaches-binary nnU-Net output, Gaussian-smoothed output, and learned signed distance function (SDF) regression-across 80 cases in five-fold cross-validation. SC-MFJ reveals a 147x gap in haptic quality between the raw binary baseline and simple Gaussian post-processing, a difference entirely invisible to Dice and HD95. It also shows that learned SDF regression, despite requiring full model retraining, produces more variable haptic quality than Gaussian smoothing, with a case-level standard deviation of 168 N/s2 compared with 22 N/s2 for Gaussian. A second evaluation on the LiTS liver dataset (131 cases) confirms the generality of these findings: the binary-to-Gaussian gap widens to 189x, and Gaussian smoothing again produces consistently low force jerk across all folds. Our results suggest that for haptic simulation applications, a one-line post-processing step may be sufficient, and that a cheap metric like SC-MFJ can flag problems that geometric metrics miss.

URL PDF HTML ☆

赞 0 踩 0

2606.06194 2026-06-05 cs.RO cs.CV 版本更新

ActiveMimic: Egocentric Video Pretraining with Active Perception

ActiveMimic: 基于主动感知的自我中心视频预训练

Xingyao Lin, Guojin Zhong, Tianyi Lu, Ziyi Ye, Yichen Zhu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Current Robotics ； NeoteAI

AI总结提出ActiveMimic框架，从自我中心人类视频中恢复同步的相机和手腕轨迹，将相机运动建模为视角动作，联合学习主动感知和操作技能，使预训练模型在机器人任务上达到与机器人数据预训练相当的性能。

Comments Project Page: https://activemimic.github.io/

详情

AI中文摘要

自我中心人类视频为机器人数据预训练提供了一种可扩展的替代方案，但在此类视频上预训练的模型始终不如在机器人数据上预训练的模型。我们将这一差距归因于缺失的信号，即自我中心视频中的主动感知行为，其中人类在操作过程中不断重新定位视角，导致标准流程视为噪声的相机运动。为解决这一问题，我们提出了ActiveMimic，一个预训练框架，从单个身体佩戴的RGB相机中恢复同步的相机和手腕轨迹，将相机运动建模为视角动作，并在适应目标机器人之前，从野外自我中心人类视频中联合学习主动感知和操作。实验表明，在具有不同主动感知需求的任务中，ActiveMimic始终优于在人类视频上预训练的基线，并与在机器人数据上预训练的最先进模型相匹配。进一步分析提供了证据，表明主动感知能力源自自我中心人类视频预训练而非机器人特定微调，确认了主动感知是解锁自我中心人类视频用于机器人预训练的关键。

英文摘要

Egocentric human video offers a scalable alternative to robot data for pretraining, yet models pretrained on such video consistently underperform those pretrained on robot data. We attribute this gap to a missing signal, the active perception behavior in egocentric videos, where humans continuously reposition their viewpoint during manipulation, inducing camera motion that standard pipelines treat as noise. To address this, we present ActiveMimic, a pretraining framework that recovers synchronized camera and wrist trajectories from a single body-worn RGB camera, models camera motion as a viewpoint action, and jointly learns active perception and manipulation from in-the-wild egocentric human video before adapting to a target robot. Empirically, real-world experiments across tasks with diverse active perception demands show that ActiveMimic consistently surpasses baselines pretrained on human video and matches state-of-the-art models pretrained on robot data. Further analysis provides evidence that active perception capability originates from egocentric human video pretraining rather than robot-specific fine-tuning, confirming active perception as the key to unlocking egocentric human video for robot pretraining.

URL PDF HTML ☆

赞 0 踩 0

2606.06186 2026-06-05 cs.CV 版本更新

Adversarial Attacks Already Tell the Answer: Directional Bias-Guided Test-time Defense for Vision-Language Models

对抗攻击已揭示答案：面向视觉语言模型的定向偏差引导测试时防御

Liangsheng Liu, Si Chen, Jiamin Wu, Weiwei Feng, Zhixin Cheng, Xiaotian Yin, Wenfei Yang, Tianzhu Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）； National Key Laboratory of Deep Space Exploration, Deep Space Exploration Laboratory（国家深空探测重点实验室，深空探测实验室）； The Chinese University of Hong Kong（香港中文大学）； Zhejiang University（浙江大学）； Ant Group（蚂蚁集团）

AI总结提出定向偏差引导防御（DBD），利用对抗样本在CLIP特征空间中沿主导方向偏移的现象，通过估计防御方向并采用DB分数双流重建策略恢复鲁棒表示，在15个数据集上实现最先进对抗鲁棒性且保持干净准确率。

Comments Accepted by ICLR2026

详情

AI中文摘要

视觉语言模型（VLM），如CLIP，展现出强大的零样本泛化能力，但仍高度易受对抗扰动影响，在现实应用中构成严重风险。针对VLM的测试时防御最近成为一种有前景且高效的方法，无需昂贵的大规模重训练即可防御对抗攻击。在这项工作中，我们发现了一个令人惊讶的现象：在多种输入变换下，CLIP特征空间中的对抗图像始终沿主导方向偏移，而干净图像则呈现分散模式。我们假设这种主导偏移（称为防御方向）与对抗偏移相反，将特征指向正确的类别中心。基于这一见解，我们提出了定向偏差引导防御（DBD），一种测试时框架，用于估计防御方向，并采用基于DB分数的双流重建策略恢复鲁棒表示。在15个数据集上的实验表明，DBD不仅实现了最先进的对抗鲁棒性，同时保持了干净准确率，还揭示了对抗准确率甚至可能超过干净准确率的反直觉结果。这表明对抗扰动内在地编码了关于真实决策边界的定向先验信息。

英文摘要

Vision-Language Models (VLMs), such as CLIP, have shown strong zero-shot generalization but remain highly vulnerable to adversarial perturbations, posing serious risks in real-world applications. Test-time defenses for VLMs have recently emerged as a promising and efficient approach to defend against adversarial attacks without requiring costly large-scale retraining. In this work, we uncover a surprising phenomenon: under diverse input transformations, adversarial images in CLIP's feature space consistently shift along a dominant direction, in contrast to the dispersed patterns of clean images. We hypothesize that this dominant shift, termed the Defense Direction, opposes the adversarial shift, pointing features back toward their correct class centers. Building on this insight, we propose Directional Bias-guided Defense (DBD), a test-time framework that estimates the Defense Direction and employs a DB-score-based two-stream reconstruction strategy to recover robust representations. Experiments on 15 datasets demonstrate that DBD not only achieves SOTA adversarial robustness while preserving clean accuracy, but also reveals the counterintuitive result that adversarial accuracy can even surpass clean accuracy. This demonstrates that adversarial perturbations inherently encode directional priors about the true decision boundary.

URL PDF HTML ☆

赞 0 踩 0

2606.06158 2026-06-05 cs.CV 版本更新

Adaptive Tokenisation Via Temporal Redundancy Masking And Latent Inpainting

通过时间冗余掩码和潜在修复的自适应分词

Kevin Dave, Sai Aditya Patkuri, Chhaya Kumar Das, Gouranga Bala, R. Venkatesh Babu, Rajeshkumar SA

发表机构 * Phronetic AI ； IISc Bangalore ； IIT Bombay

AI总结提出一种无参数的自适应视频分词机制，利用冻结连续分词器的潜在空间中的时间冗余，通过阈值丢弃冗余位置，并使用轻量级潜在修复变压器重建，实现内容驱动的令牌分配和高效推理。

详情

AI中文摘要

自适应视频分词旨在根据序列的底层视觉复杂度动态分配令牌预算。当前的连续方法通过迭代二值化搜索或训练神经回归器实现，而离散方法通常需要全速率解码器来估计信息内容。我们证明这些计算开销并非必要。我们表明，冻结的连续视频分词器的潜在空间固有地编码了可直接利用的时间冗余：潜在表示在连续帧之间变化最小的空间位置携带接近零的额外信息。我们引入了一种无参数的自适应令牌分配机制，该机制对每个位置的时间L1差异应用固定阈值，识别并丢弃冗余的潜在位置。因此，压缩率自然地从输入内容中产生，而不是自上而下地强制执行：静态场景被积极压缩，而高度动态的序列保留更多令牌。为了重建丢弃的位置，我们提出了潜在修复变压器（LIT），一种轻量级的分解时空注意力架构。得到的推理流水线非常高效，仅需一次编码器前向传播和一次LIT前向传播，消除了辅助路由网络的需求。在TokenBench和DAVIS（近期分词器使用的标准基准）上的评估表明，我们的框架产生了有意义的、内容驱动的令牌分配，同时保持了有竞争力的重建保真度，并且相比连续自适应基线（ElasticTok-CV）实现了31倍的推理加速，相比离散信息论基线（InfoTok）实现了约2倍的加速。

英文摘要

Adaptive video tokenisation seeks to dynamically allocate token budgets based on the underlying visual complexity of a sequence. Current continuous-regime approaches achieve this via iterative binarised searches or trained neural regressors, while discrete methods often require a full-rate decoder pass to estimate information content. We demonstrate that such computational overheads are not strictly necessary. We show that the latent space of a frozen continuous video tokeniser inherently encodes temporal redundancy that can be exploited directly: spatial positions whose latent representations change minimally between consecutive frames carry near-zero additional information. We introduce a parameter-free adaptive token allocation mechanism that applies a fixed threshold to per-position temporal-L1 differences, identifying and dropping redundant latent positions. Consequently, the compression rate emerges naturally from the input content rather than being enforced top-down: static scenes get compressed aggressively, while highly dynamic sequences retain more tokens. To reconstruct the dropped positions, we propose the Latent Inpainting Transformer (LIT), a lightweight factorised spatial-temporal attention architecture. The resulting inference pipeline is highly efficient, requiring only a single encoder pass and one LIT forward pass, eliminating the need for auxiliary routing networks. Evaluations across TokenBench and DAVIS, which are the standard benchmarks used by recent tokenisers~\cite{infotok, agarwal2025cosmos}, indicate that our framework yields meaningful, content-driven token allocation while maintaining competitive reconstruction fidelity, and delivers a $31\times$ inference-time speedup over the continuous adaptive baseline (ElasticTok-CV) and an $\approx2\times$ speedup over the discrete information-theoretic baseline (InfoTok)

URL PDF HTML ☆

赞 0 踩 0

2606.06155 2026-06-05 cs.RO cs.CV cs.MM 版本更新

AffordanceVLA: A Vision-Language-Action Model Empowering Action Generation through Affordance-Aware Understanding

AffordanceVLA：一种通过可供性感知理解赋能动作生成的视觉-语言-动作模型

Qize Yu, Jiadi You, Yuran Wang, Jiaqi Liang, Bowen Ping, Yang Tian, Yue Chen, Minghong Cai, Zeying Gong, Ruihai Wu, Yinchuan Li, Junwei Liang, Yingcong Chen

发表机构 * Peking University（北京大学）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Chinese University of Hong Kong（香港中文大学）； Knowin AI

AI总结提出AffordanceVLA框架，通过引入结构化可供性预测作为任务导向的中间表示，解决VLA模型中语义空间与具身控制策略的结构不匹配问题，实现精确的感知-动作映射。

Comments Preprint. Code and project page are available. Code: https://github.com/Skywalker-yqz/AffordanceVLA Project page: https://skywalker-yqz.github.io/AffordanceVLA/

详情

AI中文摘要

视觉-语言-动作（VLA）模型利用预训练视觉-语言模型（VLM）的丰富世界知识来实现指令跟随的机器人操作。然而，VLM语义空间与具身控制策略之间的结构不匹配常常阻碍精确感知-动作映射的学习。为解决这一挑战，我们提出 extbf{AffordanceVLA}，一个统一框架，引入结构化可供性预测作为任务导向的中间表示，以建立更精确和鲁棒的感知-动作映射。具体而言，我们通过三个互补组件逐步建模操作先验：1） extbf{Which2Act}，通过视觉潜在预测进行以物体为中心的定位以抑制干扰；2） extbf{Where2Act}，通过可供性图估计进行2D交互定位；3） extbf{How2Act}，用于引导操作策略的3D几何推理。这些可供性线索提供了空间定位、语义条件化和动作耦合的中间表示，从而自然地桥接视觉、语言和动作。我们将这些模块集成到具有专门专家的混合Transformer（MoT）架构中，并使用三阶段训练策略和渐进式数据课程训练模型。为克服机器人数据集中密集可供性标签的稀缺性，我们还开发了一个鲁棒的自动化数据增强流水线。在仿真和真实世界中的大量实验表明，AffordanceVLA在多种操作场景中实现了强大的性能。

英文摘要

Vision-Language-Action (VLA) models leverage the rich world knowledge of pretrained vision-language models (VLMs) to enable instruction-following robotic manipulation. However, the structural mismatch between VLM semantic spaces and embodied control policies often hinders the learning of precise perception--action mappings. To address this challenge, we propose \textbf{AffordanceVLA}, a unified framework that introduces structured affordance forecasting as a task-oriented intermediate representation to establish a more precise and robust perception--action mapping. Specifically, we progressively model manipulation priors through three complementary components: 1) \textbf{Which2Act} for object-centric grounding via visual latent prediction to suppress distractions; 2) \textbf{Where2Act} for 2D interaction localization via affordance map estimation; and 3) \textbf{How2Act} for 3D geometric reasoning to guide manipulation policies. These affordance cues provide spatially grounded, semantically conditioned, and action-coupled intermediate representations, thereby naturally bridging vision, language and action. We integrate these modules into a Mixture-of-Transformer (MoT) architecture with specialized experts and train the model using a three-stage training strategy with a progressive data curriculum. To overcome the scarcity of dense affordance labels in robotic datasets, we also develop a robust automated data augmentation pipeline. Extensive experiments on simulation and real-world demonstrate that AffordanceVLA achieves strong performance across diverse manipulation scenarios.

URL PDF HTML ☆

赞 0 踩 0

2606.06142 2026-06-05 cs.CV 版本更新

Computation-Aware Event-to-Frame Reconstruction via Selective Attention

计算感知的基于选择性注意力的事件到帧重建

Jingqian Wu, Yunbo Jia, Edmund Y. Lam

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出一种高效的事件到帧重建框架，通过循环编码器-解码器、选择性上下文融合和轻量级混合注意力机制，在保持重建质量的同时降低计算复杂度。

详情

AI中文摘要

事件到帧（E2F）重建将异步事件流与基于帧的视觉流水线连接起来，但现有方法通常在重建质量和计算效率之间面临权衡。在这项工作中，我们提出了一种高效的E2F框架，强调因果时间建模和计算感知设计。该架构采用循环编码器-解码器，以紧凑的隐藏状态逐步聚合事件信息。为了提高在快速运动和光照变化下的鲁棒性，引入了一种选择性上下文融合策略，将事件驱动的特征与先验强度线索相结合。在此融合过程中，一种轻量级混合注意力机制增强了特征选择性，而无需依赖繁重的注意力操作。在标准基准上的实验结果表明，所提出的方法在保持重建性能竞争力的同时，在准确性和模型复杂度之间取得了良好的平衡。

英文摘要

Event-to-frame (E2F) reconstruction bridges asynchronous event streams with frame-based vision pipelines, but existing methods often face a trade-off between reconstruction quality and computational efficiency. In this work, we propose an efficient E2F framework that emphasizes causal temporal modeling and computation-aware design. The architecture adopts a recurrent encoder-decoder to incrementally aggregate event information with compact hidden states. To improve robustness under fast motion and illumination variations, a selective context fusion strategy is introduced to integrate event-driven features with prior intensity cues. Within this fusion process, a lightweight hybrid attention mechanism enhances feature selectivity without relying on heavy attention operations. Experimental results on standard benchmarks demonstrate that the proposed approach achieves competitive reconstruction performance while maintaining a favorable balance between accuracy and model complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.06120 2026-06-05 cs.CV 版本更新

Diff-CA: Separating Common and Salient Factors with Diffusion Models

Diff-CA: 使用扩散模型分离共同因素和显著因素

Michaël Soumm, Alexandre Fournier Montgieux, Yunlong He, Pietro Gori, Alasdair Newson

发表机构 * INRIA at Univ. Grenoble Alpes（法国格勒诺布尔大学INRIA实验室）； CEA List, Palaiseau（法国CEA列表，帕莱索）； Télécom Paris, Institut Polytechnique de Paris（巴黎电信学院，巴黎理工学院）

AI总结提出一种基于扩散模型的条件框架，通过弱监督学习将图像条件分解为共同因素和显著因素，实现对比分析中的因素分离，并保持高保真图像生成质量。

详情

AI中文摘要

对比分析旨在将两个数据分布之间的共同因素与仅对其中一个分布显著的因素分离开来。现有的对比方法基于生成模型（如VAE或GAN），这些模型通常受到重建和图像质量有限的困扰，这阻碍了有效的潜在因素分离，并限制了它们在高保真图像生成和编辑中的应用。我们提出了一种新颖的扩散模型条件框架，能够在不牺牲生成质量的情况下实现对比分解。我们首先训练一个无需提示、以图像为条件的扩散模型，然后学习使用弱监督将条件分解为共同因素和显著因素。我们证明了先前工作中通常假设的加性对比分解在温和条件下是可识别的。这种分解通过仅交换或插值显著因素来实现有针对性的操作。

英文摘要

Contrastive Analysis aims to separate factors that are common between two data distributions from those that are salient to only one of them. Existing contrastive methods are based on generative models (e.g., VAEs or GANs) that often suffer from limited reconstruction and image quality, which hampers effective latent factor separation and limits their applicability to high-fidelity image generation and edition. We propose a novel conditioning framework for diffusion models that enables contrastive decomposition without compromising generation quality. We first train a prompt-free, image-conditioned diffusion model, and then learn to decompose the conditioning into a common and a salient factor, using weak supervision. We prove that the additive contrastive factorization, commonly assumed in prior work, is identifiable under mild conditions. This factorization enables targeted operations by swapping or interpolating only the salient factor.

URL PDF HTML ☆

赞 0 踩 0

2606.06103 2026-06-05 cs.CV 版本更新

MS-DKC: A Dataset Knowledge Card Framework for Designing and Adapting Medical Image Segmentation Models

MS-DKC：用于设计和适配医学图像分割模型的数据集知识卡片框架

Tariq M. Khan, Syed Saud Naqvi, Thantrira Porntaveetus, Hamid Alinejad-Rokny, Shahzaib Iqbal, Imran Razzak, Mohammad AU Khan

发表机构 * Center of Excellence in Precision Medicine and Digital Health, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand（精准医学与数字健康中心，朱拉隆功大学牙科学院，泰国曼谷）； Department of Computer Engineering, COMSATS University Islamabad, Islamabad, Pakistan（计算机工程系，COMSATS伊斯兰堡大学，巴基斯坦伊斯兰堡）； School of Biomedical Engineering, UNSW, Sydney, NSW, Australia（生物医学工程学院，新南威尔士大学，澳大利亚悉尼，新南威尔士）； Visiting Scholar (Collaborative Projects), Center of Excellence in Precision Medicine and Digital Health, Chulalongkorn University, Bangkok, Thailand（访问学者（合作项目），精准医学与数字健康中心，朱拉隆功大学，泰国曼谷）； Department of Computing, Abasyn University Islamabad Campus (AUIC), Islamabad, Pakistan（计算系，阿巴斯扬大学伊斯兰堡校区（AUIC），巴基斯坦伊斯兰堡）； Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates（Mohamed bin Zayed人工智能大学，阿布扎比，阿拉伯联合酋长国）； College of Computer and Information Sciences, prince Sultan University, Riyadh, SAudi Arabia（计算机与信息科学学院，苏丹王子大学，沙特阿拉伯利雅得）

AI总结提出MS-DKC框架，通过显式记录数据集特征（如前景占有率、形态、边界模糊性等）并映射到失败模式、设计先验和风险对齐标准，指导医学图像分割模型的设计与适配，在DRIVE、ISIC2018和ACDC数据集上验证了数据集条件化设计的有效性。

详情

AI中文摘要

医学图像分割通常被定义为寻找更强架构的问题，但这可能掩盖一个更基本的问题：数据集对模型有什么要求？在医学影像中，这种要求由前景占有率、形态、边界模糊性、拓扑敏感性、标注质量、采集变异和操作点决定。本文介绍了医学分割数据集知识卡片（MS-DKC），一个使这些因素显式化的框架。MS-DKC通过图像/采集、形态、监督、上下文依赖和部署风险描述符记录数据集证据。这些描述符被映射到失败模式、设计先验和风险对齐标准，使分割设计比架构优先比较更具可追溯性。我们在DRIVE、ISIC2018和ACDC上评估了MS-DKC，它们代表了不同的场景。DRIVE包含稀疏、细小的分支血管，有利于细节保持模型、敏感性感知优化、阈值分析和拓扑感知指标。DKC-TNet-v2以35103个参数达到了Dice 0.8044和IoU 0.6730，而SA-UNetv2-DKC-AmbRef达到了Dice 0.8141、IoU 0.6865、敏感性0.8265、特异性0.9804和AUC 0.9853。ISIC2018涉及紧凑但外观可变的病变；在Att-Next-Topo/ATTNext上基于验证约束的评分函数选择产生了MS-DKC-AttNextTopo-VCSF-NoAug，Dice 0.8872、IoU 0.8214、精确率0.9173、边界F1 0.4878和ASSD 4.13，而合理的添加未能改善风险对齐的轮廓。ACDC提供了一个多类心脏案例，其中MS-DKC推荐四类softmax分割、类别平衡的Dice/CE监督和类别级表面评估。总体而言，结果支持数据集条件化设计：不同的数据集需要不同的先验、操作点和证据，然后才能判断模型是否合适。

英文摘要

Medical image segmentation is often framed as a search for stronger architectures, but this can obscure a more fundamental question: what does the dataset require from the model? In medical imaging, this requirement is shaped by foreground occupancy, morphology, boundary ambiguity, topology sensitivity, annotation quality, acquisition variation, and operating point. This paper introduces the Medical Segmentation Dataset Knowledge Card (MS-DKC), a framework for making these factors explicit. MS-DKC records dataset evidence through image/acquisition, morphology, supervision, context-dependence, and deployment-risk descriptors. These descriptors are mapped to failure modes, design priors, and risk-aligned criteria, making segmentation design more traceable than architecture-first comparison. We evaluate MS-DKC on DRIVE, ISIC2018, and ACDC, representing distinct regimes. DRIVE contains sparse, thin, branching vessels, favoring detail-preserving models, sensitivity-aware optimization, threshold analysis, and topology-aware metrics. DKC-TNet-v2 achieved Dice 0.8044 and IoU 0.6730 with 35103 parameters, while SA-UNetv2-DKC-AmbRef reached Dice 0.8141, IoU 0.6865, sensitivity 0.8265, specificity 0.9804, and AUC 0.9853. ISIC2018 involves compact but appearance-variable lesions; validation-constrained score-function selection on Att-Next-Topo/ATTNext produced MS-DKC-AttNextTopo-VCSF-NoAug with Dice 0.8872, IoU 0.8214, precision 0.9173, Boundary F1 0.4878, and ASSD 4.13, while plausible additions failed to improve the risk-aligned profile. ACDC provides a multi-class cardiac case, where MS-DKC recommends four-class softmax segmentation, class-balanced Dice/CE supervision, and class-wise surface evaluation. Overall, the results support dataset-conditioned design: different datasets require different priors, operating points, and evidence before a model can be judged appropriate.

URL PDF HTML ☆

赞 0 踩 0

2606.06100 2026-06-05 cs.CV 版本更新

HyperVis: Continuous Latent Visual Relational Graphs on the Lorentz Hyperboloid for Compositional Reasoning

HyperVis：洛伦兹双曲面上的连续潜在视觉关系图用于组合推理

Moshiur Farazi, Sameera Ramasinghe, Mahbub Ahmed Turza, Shafin Rahman

发表机构 * Data Science and AI, University of Doha for Science and Technology, Qatar（数据科学与人工智能，多哈科学技术大学，卡塔尔）； Pluralis Research, Australia（Pluralis研究，澳大利亚）； Department of Electrical and Computer Engineering, North South University, Bangladesh（电气与计算机工程系，北南大学，孟加拉国）

AI总结针对视觉语言模型在组合推理中理解物体间关系的困难，提出HyperVis方法，通过计算密集视觉关系张量并投影到洛伦兹双曲面，利用空间物理（IoA驱动的蕴含锥和外部角排斥）增强层次结构，在训练时作为正则化器提升生成式VQA性能，在推理时作为关系编码器提升判别式组合评分。

详情

AI中文摘要

视觉语言模型（VLM）在需要理解物体间关系的组合推理中表现不佳。一个自然的补救措施是从现成的场景图生成器（SGG）注入显式场景图三元组$\langle s, p, o \rangle$，但我们发现这会产生反效果：离散文本标签与连续视觉模态冲突，导致GQA准确率从60.38%降至58.86%。我们提出 extbf{HyperVis}，完全绕过了SGG的语义瓶颈。从$N$个类别无关的区域提议出发，通过空间偏置交叉注意力计算密集的$O(N^2)$视觉关系张量，将其投影到洛伦兹双曲面上，并通过空间物理（即IoA驱动的蕴含锥和外部角排斥）强制执行层次结构。我们发现HyperVis以两种互补的方式发挥作用：（1）作为 extit{训练时正则化器}，双曲关系损失塑造了LoRA表示，提高了生成式VQA性能（GQA 61.03%对比无关系损失的LoRA微调57.21%，恢复并超越基线）；（2）作为 extit{推理时关系编码器}，双曲前缀令牌提升了判别式组合评分（SugarCrepe 79.94%，比基线高6.25个百分点）。学习到的曲率稳定在$\kappa=4.0$，比先前的双曲VLM高一个数量级（先前$\kappa$通常趋近于零），表明连续视觉特征确实需要强曲率空间的指数体积。受控的欧几里得消融实验证实了这种分解：关系流水线在平坦空间中对LoRA的正则化效果相当（GQA 60.81%），但组合增益是双曲空间特有的（SugarCrepe比欧几里得高4.58个百分点），且欧几里得训练中的蕴含损失高出约6倍。代码将在后续公布。

英文摘要

Vision-Language Models (VLMs) struggle with compositional reasoning that requires understanding inter-object relationships. A natural remedy is to inject explicit scene graph triplets $\langle s, p, o \rangle$ from an off-the-shelf scene graph generator (SGG), but we show this backfires: discrete text labels collide with the continuous visual modality, degrading GQA accuracy from 60.38\% to 58.86\%. We propose \textbf{HyperVis}, which bypasses the SGG semantic bottleneck entirely. From $N$ class-agnostic region proposals, we compute a dense $O(N^2)$ visual relation tensor via spatially-biased cross-attention, project it onto a Lorentz hyperboloid, and enforce hierarchy through spatial physics, namely IoA-driven entailment cones and exterior-angle repulsion. We discover that HyperVis contributes in two complementary ways: (1) as a \emph{training-time regularizer}, the hyperbolic relational losses shape LoRA representations that improve generative VQA (GQA 61.03\% vs.\ 57.21\% for LoRA fine-tuning without relational losses, recovering and surpassing the baseline); and (2) as an \emph{inference-time relational encoder}, hyperbolic prefix tokens boost discriminative compositional scoring (SugarCrepe 79.94\%, $+$6.25pp over baseline). The learned curvature stabilises at $κ{=}4.0$, an order of magnitude above prior hyperbolic VLMs where $κ$ typically collapses toward zero, indicating that continuous visual features genuinely require the exponential volume of strongly curved space. A controlled Euclidean ablation confirms this decomposition: the relational pipeline regularises LoRA comparably in flat space (GQA 60.81\%), but the compositionality gain is specifically hyperbolic (SugarCrepe $+$4.58pp over Euclidean), with entailment loss ${\sim}6{\times}$ higher in Euclidean training. Codes are available at TBA.

URL PDF HTML ☆

赞 0 踩 0

2606.06078 2026-06-05 cs.CV 版本更新

ReCache: 通过REINFORCE学习扩散模型的预算感知缓存调度

Mishan Aliev, Eva Neudachina, Ilya Bykov, Aleksandr Oganov, Kirill Struminsky, Aibek Alanov, Denis Rakitin

发表机构 * HSE University（俄罗斯高等经济学院）； Yandex Research（Yandex研究院）

AI总结提出ReCache，利用策略梯度学习在给定计算预算下最大化生成质量的去噪步骤重计算调度，无需标注数据且兼容多种缓存机制。

详情

AI中文摘要

现代扩散模型生成高质量图像和视频，但其迭代去噪过程导致推理成本高昂。特征缓存通过重用或预测相邻去噪步骤的中间激活来加速采样，利用沿反向轨迹的计算冗余。本文关注缓存调度：选择哪些去噪步骤应完全重计算。现有调度要么是固定的（如均匀），要么根据每步误差启发式自适应选择；这两种情况下，实际计算成本是手动调整阈值的副作用，而非用户可指定的量。我们提出ReCache，它反转了这一过程：给定目标预算k，学习最大化生成质量的重计算调度，将计算变为可直接控制的输入。ReCache通过策略梯度训练，避开了通过完整扩散推理的反向传播，且不使用任何标注数据。来自无缓存推理的生成作为匹配目标，并配以生成质量的奖励。ReCache兼容任何缓存机制，包括特征重用和特征预测；对于每种机制，单个训练好的策略在推理时适应不同计算预算。ReCache持续优于调度基线：在FLUX上减少$ imes5.04$ FLOPs时，与DiCache相比，LPIPS降低31%（从0.456降至0.316）；在Wan 2.1上实现$\sim imes2.6$加速时，与均匀HiCache相比，LPIPS降低65%（从0.480降至0.169），VBench分数提升7%（5.6分，从70.4升至76.0）。代码见https://github.com/thecrazymage/ReCache。

英文摘要

Modern diffusion models generate high-quality images and videos, but their iterative denoising process makes inference expensive. Feature caching accelerates sampling by reusing or predicting intermediate activations across neighboring denoising steps, exploiting the redundancy of computations along the reverse trajectory. In this work, we focus on the caching schedule: selecting which denoising steps should be fully recomputed. Existing schedules are either fixed (e.g. uniform) or chosen adaptively from per-step error heuristics; in both cases, the actual compute cost is a side-effect of hand-tuned thresholds rather than a quantity the user can specify. We propose ReCache, which inverts this: given a target budget k, it learns the recomputation schedule that maximizes generation quality, turning compute into a directly controllable input. ReCache trains via policy gradients, sidestepping backpropagation through full diffusion inference, and uses no labelled data. Generations from uncached inference serve as matching targets, paired with a reward for generation quality. ReCache is compatible with any caching mechanism, including feature reuse and feature forecasting; for each mechanism, a single trained policy adapts across computational budgets at inference time. ReCache consistently outperforms scheduling baselines: under a $\times5.04$ FLOPs reduction on FLUX, it reduces LPIPS by 31% (from 0.456 to 0.316) compared to DiCache; on Wan 2.1 at a $\sim \times2.6$ speedup, it drops LPIPS by 65% (from 0.480 to 0.169) and boosts the VBench score by 7% (5.6 points, from 70.4 to 76.0) over uniform HiCache. Code is available at https://github.com/thecrazymage/ReCache.

URL PDF HTML ☆

赞 0 踩 0

2606.06039 2026-06-05 cs.CV 版本更新

Texture-preserving implicit neural representation for Cone beam CT truncated reconstruction

保留纹理的隐式神经表示用于锥束CT截断重建

Genyuan Zhang, Junyao Wang, Haoran Lan, Chuandong Tan, Songtao Zhu, Fenglin Liu

发表机构 * National Key Research and Development Program of China（中华人民共和国国家重点研发计划）； National Natural Science Foundation of China（中华人民共和国国家自然科学基金）； Fundamental Research Funds for the Central Universities（中央高校基本科研业务费）

AI总结提出一种自监督的3D重建框架，基于神经场景表示，结合物理迭代细化模块，解决锥束CT截断重建中的伪影和纹理丢失问题。

详情

AI中文摘要

锥束计算机断层扫描（CBCT）经常受到数据截断的影响，这引入了严重的伪影并限制了有效视场（FOV）。现有的用于截断锥束CT重建的深度学习方法存在严重局限性，包括严格依赖有监督的真实数据和未能考虑连续3D空间截断变化。为了解决这些挑战，我们引入了一个基于神经场景表示的自监督3D重建框架。通过在投影监督下将空间坐标直接映射到辐射密度，我们的方法固有地绕过了传统的滤波和反投影操作，从而从根本上消除了截断引起的环状伪影，同时实现了鲁棒的连续3D数据外推。然而，坐标网络容易受到固有的频谱偏差影响，这导致临床关键的高频纹理严重丢失。为了解决这一瓶颈，我们进一步将基于物理的迭代细化模块集成到神经场景表示架构中。利用来自坐标网络的无伪影外推体积作为最优初始化，该模块逐步从原始投影中重新提取高频结构信息并将其注入体积中。在模拟和真实数据集上的大量实验表明，我们的方法成功地将神经网络的优异伪影抑制和外推能力与迭代算法的高保真细节保留统一起来。

英文摘要

Cone-beam computed tomography (CBCT) frequently suffers from data truncation, which introduces severe artifacts and limits the effective field of view (FOV). Existing deep learning methods for truncated cone-beam computed tomography (CBCT) reconstruction suffer from serious limitations, including a strict reliance on supervised ground truth and a failure to account for continuous 3D spatial truncation variations. To address these challenges, we introduce a self-supervised 3D reconstruction framework based on neural scene representations. By directly mapping spatial coordinates to radiodensity under projection supervision, our approach inherently bypasses traditional filtering and backprojection operations, thereby fundamentally eliminating truncation-induced ring artifacts while enabling robust continuous 3D data extrapolation. However, coordinate networks are susceptible to an inherent spectral bias, which leads to a severe loss of clinically vital high-frequency textures. To resolve this bottleneck, we further incorporate a physics-based iterative refinement module into the neural scene representation architecture. Leveraging the artifact-free, extrapolated volume from the coordinate network as an optimal initialization, this module progressively re-extracts and injects high-frequency structural information from the original projections back into the volume. Extensive experiments on both simulated and real-world datasets demonstrate that our method successfully unifies the exceptional artifact suppression and extrapolation capabilities of neural networks with the high-fidelity detail preservation of iterative algorithms.

URL PDF HTML ☆

赞 0 踩 0

2606.06020 2026-06-05 cs.CV 版本更新

ReSAGE-PAR: Representational Similarity Assessment for Generative Expansion in Pedestrian Attribute Recognition

ReSAGE-PAR：行人属性识别中生成式扩展的表征相似性评估

Pablo Ayuso-Albizu, Pablo Carballeira, Juan C. SanMiguel, Paula Moral

发表机构 * Universidad Autónoma de Madrid（阿隆托纳大学马德里分校）

AI总结针对行人属性识别数据稀缺问题，提出ReSAGE-PAR管道，通过扩散模型生成图像并利用贝叶斯分类器验证属性，实现可扩展的高保真数据集扩展，在标准骨干网络上提升高达8.7%。

Comments Under review at IEEE Transactions on Circuits and Systems for Video Technology (TCSVT)

详情

AI中文摘要

为了解决行人属性识别（PAR）中有限的数据多样性和数据稀缺问题，我们探索了使用基于属性提示的扩散模型进行图像合成。虽然这能够实现行人图像的可控生成，但它面临两个关键挑战：（i）高质量预训练数据与低分辨率、非标准监控裁剪之间的领域差距，以及（ii）需要可靠的属性验证以防止生成幻觉。在本文中，我们引入了一个稳健的生成-评分-自动标注管道，称为ReSAGE-PAR（PAR中生成式扩展的表征相似性评估），它弥合了这一领域差距，并实现了可扩展、高保真的数据集扩展。首先，我们使用定制的基于LoRA的图像到图像方法，将预训练的扩散模型适应到原生PAR分辨率。其次，我们提取生成图像与其条件提示之间的视觉-语言对齐分数，利用包括标签一致和不一致补充的综合提示策略。最后，我们制定了一个贝叶斯分类器，将这些连续分数转换为可靠的二值伪标签。大量评估证明了ReSAGE-PAR在保留空间先验和验证属性方面的有效性。当集成到PAR训练中时，ReSAGE-PAR一致地带来了显著的改进——在标准骨干网络上实现了高达8.7%的提升，并将最先进的框架推向了新的性能水平。这证明了其作为可扩展PAR增强的架构无关解决方案的价值。ReSAGE-PAR的完整代码库可在http://www-vpu.eps.uam.es/publications/ReSAGE-PAR公开获取。

英文摘要

To address the limited diversity and data scarcity in Pedestrian Attribute Recognition (PAR), we explore image synthesis using diffusion models guided by attribute-based prompts. While this enables the controlled generation of pedestrian images, it faces two critical challenges: (i) the domain gap between high-quality pre-training data and low-resolution, non-standard surveillance crops, and (ii) the need for reliable attribute verification to prevent generative hallucinations. In this paper, we introduce a robust generate-score-autolabel pipeline called ReSAGE-PAR (REpresentational Similarity Assessment for Generative Expansion in PAR) that bridges this domain gap and enables scalable, high-fidelity dataset expansion. First, we adapt pre-trained diffusion models to native PAR resolutions using a tailored LoRA-based Image-to-Image approach. Second, we extract vision-language alignment scores between the generated images and their conditioning prompts, utilizing a comprehensive prompting strategy that includes label-consistent and inconsistent complements. Finally, we formulate a Bayesian classifier that converts these continuous scores into reliable binary pseudo-labels. Extensive evaluations demonstrate the effectiveness of ReSAGE-PAR in preserving spatial priors and verifying attributes. When integrated into PAR training, ReSAGE-PAR consistently yields significant improvements-achieving gains of up to 8.7% on standard backbones and pushing state-of-the-art frameworks to new performance levels. This proves its value as an architecture-agnostic solution for scalable PAR enhancement. The complete codebase for ReSAGE-PAR is publicly available at http://www-vpu.eps.uam.es/publications/ReSAGE-PAR.

URL PDF HTML ☆

赞 0 踩 0

2606.05999 2026-06-05 cs.CV cs.AI 版本更新

ATT-CR: Adaptive Triangular Transformer for Cloud Removal

ATT-CR: 自适应三角变换器用于云去除

Yang Wu, Ye Deng, Pengna Li, Wenli Huang, Kangyi Wu, Xiaomeng Xin, Jinjun Wang

发表机构 * Xi’an Jiaotong University（西安交通大学）； School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics（计算机与人工智能学院，西南财经大学）； Ningbo University of Technology（宁波工程学院）

AI总结提出自适应三角变换器（ATT-CR），通过三角注意力和特征选择门控模块降低计算复杂度并减少云像素干扰，实现高效云去除。

详情

AI中文摘要

云去除旨在准确重建遥感图像中被云遮挡的地面物体。现有的基于Transformer的方法利用自注意力有效建模云图像中的长距离依赖，取得了显著效果。然而，它们存在以下问题：1）自注意力的高计算复杂度限制了可扩展性；2）在注意力计算中将云像素和干净像素均视为有效，会在后续层中引入干扰，导致性能次优。为解决这些挑战，我们提出了自适应三角变换器用于云去除（ATT-CR），该模型有效降低了计算成本并减轻了云像素的干扰。具体而言，它包含两个核心组件：三角注意力（TAN）和特征选择门控模块（FSGM）。TAN使用下三角和上三角矩阵近似Softmax注意力，计算复杂度为O(N)，显著降低了计算成本。而FSGM与TAN集成，自适应地区分云特征和干净特征，从而最小化无效信息引入后续层。在云去除基准上的大量实验表明，ATT-CR相比现有方法具有更优的性能。

英文摘要

Cloud removal aims to accurately reconstruct the ground objects obscured by clouds in remote sensing images. Existing Transformer-based methods utilizing self-attention have shown impressive results by effectively modeling long-range dependencies in cloudy images. However, they suffer from the following issues: 1) the high computational complexity of self-attention limits scalability; 2) treating both cloudy and clean pixels as valid within the attention computation brings disturbances in subsequent layers, leading to suboptimal performance. To address these challenges, we propose the Adaptive Triangular Transformer for Cloud Removal (ATT-CR), a model that effectively reduces computational costs and mitigates interference from cloudy pixels. Specifically, it consists of two core components: Triangular Attention (TAN) and Feature Selected Gating Module (FSGM). TAN employs lower and upper triangular matrices to approximate Softmax attention with O(N) computational complexity, significantly reducing the computational costs. The FSGM, on the other hand, integrates with TAN to adaptively distinguish between cloudy and clean features, which minimizes the introduction of invalid information into subsequent layers. Extensive experiments on cloud removal benchmarks demonstrate that ATT-CR delivers superior performance compared to existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.05998 2026-06-05 cs.CV cs.AI 版本更新

Deep Learning-based 3D Oral Cavity Reconstruction Using 2D Intraoral Images

基于深度学习的二维口内图像三维口腔重建

Jihun Cho, Soo-Yeon Jeong, Eun-Jeong Bae, Sun-Young Ihm

发表机构 * KAIST（韩国科学技术院）

AI总结提出一种仅用十张二维口内图像进行三维口腔重建的软件方法，采用MobileNetV2与多头注意力机制，降低成本和不适，实现自动化重建。

Comments 4 pages, 5 figures. English version of a paper presented at the Korea Multimedia Society Conference, November 2025

详情

AI中文摘要

口腔三维建模是牙科中最关键的阶段之一，常用的方法如印模和口内扫描各有显著局限。印模法将藻酸盐或硅胶材料放入托盘并插入患者口腔形成阴模，存在患者不适、材料变形误差及存储运输困难等问题。口内扫描仪利用结构光或激光技术实时直接扫描口腔结构，效果先进但设备成本极高。为解决这些问题，本文提出一种基于软件的方法，仅使用从不同角度拍摄的十张二维口内图像重建三维口腔模型，无需专用硬件设备。该方法降低成本，消除物理扫描设备需求，减少患者不适，并实现自动化三维重建。模型在公开的Dental3DS数据集（包含950个上颌样本）上训练，采用MobileNetV2作为图像编码器，结合多头注意力进行多视图特征融合。所提模型在最近邻匹配（距离阈值0.035）下达到77.49%的准确率。然而，预测顶点倾向于集中在真实值的高密度区域，导致重建模型上的点分布不均匀。

英文摘要

Oral 3D modelling is one of the most essential stages in dentistry, and many different approaches, such as impression taking and intraoral scanning, are commonly used for this phase, each with notable limitations. Impression taking, which involves placing alginate or silicone material in a tray and inserting it into the patient's oral cavity to form a negative mold, suffers from significant patient discomfort, material deformation errors, and difficulties in storage and transportation. Intraoral scanners, which directly scan oral structures in real time using structured light or laser technology, produce state-of-the-art results but are associated with substantially high equipment costs. To address these limitations, this paper proposes a software-based approach that reconstructs a 3D oral model using only ten 2D intraoral images captured from different angles, requiring no dedicated hardware devices. The proposed method reduces cost, eliminates the need for physical scanning equipment, minimises patient discomfort, and enables automated 3D reconstruction. The model is trained on the publicly available Dental3DS dataset, comprising 950 upper jaw samples, and employs MobileNetV2 as the image encoder combined with Multi-head Attention for multi-view feature fusion. The proposed model achieves an accuracy of 77.49%, measured by nearest-neighbor matching with a distance threshold of 0.035. However, predicted vertices tend to concentrate in high-density regions of the ground truth, resulting in uneven point distribution across the reconstructed model.

URL PDF HTML ☆

赞 0 踩 0

2606.05997 2026-06-05 cs.CV 版本更新

Multimodal Sexism Identification and Characterization using Large Language Models and Gradient Boosting

使用大语言模型和梯度提升的多模态性别歧视识别与表征

Kyriakos Chaviaras, Maria Lymperaiou, Athanasios Voulodimos

发表机构 * Artificial Intelligence and Learning Systems Laboratory（人工智能与学习系统实验室）； School of Electrical and Computer Engineering（电气与计算机工程学院）； National Technical University of Athens（雅典国家技术大学）

AI总结提出基于特征工程和梯度提升回归模型的后融合管道，结合视觉、文本、人口统计、生物特征及LLM语义指标，用于识别和表征模因和短视频中的多模态性别歧视。

详情

AI中文摘要

我们介绍了AILS-NTUA提交给CLEF EXIST 2026实验室的工作，解决模因（任务2）和短视频（任务3）中的多模态性别歧视识别与表征问题。我们的系统采用基于特征工程的后融合管道，围绕梯度提升回归模型和层次化后处理构建。对于模因，我们结合了视觉、文本、人口统计、生物特征和LLM衍生的语义指标，旨在捕捉刻板印象、物化、讽刺和厌女等高层次线索。对于视频，我们研究了特征选择、基于帧的视觉表示、基于OCR的文本特征、声学描述符和传感器衍生元数据的影响。开发结果表明，聚焦的LLM衍生语义线索改善了模因性别歧视识别，而视频性能对特征维度和跨模态噪声高度敏感。对于视频，开发结果倾向于紧凑的特征选择，但官方测试结果表明这一结论不能完全推广到未见数据，其中未过滤的表征泛化更好。总体而言，我们的发现强调了针对静态模因进行目标语义特征工程的有用性，以及在嘈杂的短视频环境中需要更鲁棒的时间建模。

英文摘要

We present the AILS-NTUA submission to the EXIST 2026 Lab at CLEF, addressing multimodal sexism identification and characterization in memes (Task 2) and short-form videos (Task 3). Our system follows a feature-engineered late-fusion pipeline built around gradient-boosted regression models and hierarchical post-processing. For memes, we combine visual, textual, demographic, biometric, and LLM-derived semantic indicators designed to capture high-level cues such as stereotyping, objectification, irony, and misogyny. For videos, we investigate the effect of feature selection, frame-based visual representations, OCR-based textual features, acoustic descriptors, and sensor-derived metadata. Development results show that focused LLM-derived semantic cues improve meme sexism identification, while video performance is highly sensitive to feature dimensionality and cross-modal noise. For videos, development results favor compact feature selection, but official test results show that this conclusion does not fully transfer to unseen data, where the unfiltered representation generalizes better. Overall, our findings highlight the usefulness of targeted semantic feature engineering for static memes and the need for more robust temporal modeling in noisy short-form video settings.

URL PDF HTML ☆

赞 0 踩 0

2606.05981 2026-06-05 cs.CV cs.LG 版本更新

Video-Rate Streaming Stylization on a Vision-Aware MLLM-Conditioned Edit Diffusion: Asymmetric Batched Inference on a Distilled UNet + MLLM Text Encoder

基于视觉感知的多模态大语言模型条件编辑扩散的视频率流式风格化：蒸馏UNet + MLLM文本编码器上的非对称批处理推理

Yoshiyuki Ootani

发表机构 * Independent researcher（独立研究员）

AI总结针对蒸馏扩散模型中文本编码器成为瓶颈的问题，提出一种结合非对称CUDA流水线、编译友好的ControlNet-LLLite重构和周期性条件刷新调度的流式管线，在消费级GPU上实现视频率实时风格化编辑。

Comments 12 pages, 4 figures, 12 tables. Under review at IEEE Transactions on Circuits and Systems for Video Technology. Code, evaluation harness, and the released v3 Temporal LLLite adapter weights are at https://github.com/otanl/dreamlite-stream (also mirrored to Hugging Face and Zenodo)

详情

AI中文摘要

扩散U-Net的激进蒸馏反转了实时文本到图像流水线的逐帧瓶颈：一旦去噪器成为4步或1步蒸馏的学生模型，文本编码器就成为关键路径。这种反转在视觉感知编辑扩散中最为严重，其中编码器是多模态大语言模型（MLLM）。我们研究了一个0.39B蒸馏编辑U-Net与2.13B MLLM文本编码器（Qwen3-VL）配对的情况，并提出了一种针对该场景的流式管线，该管线围绕三种工程机制构建：非对称侧流/主流CUDA流水线，带有批处理文本编码器摊销（以及可选的静态提示缓存）；一种编译友好的ControlNet-LLLite重构，将整个U-Net +适配器堆栈折叠成单个融合图；以及一个带有钩子子集的周期性条件刷新调度，用于摊销每帧条件成本。在单个消费级RTX 3090 Ti上，512x512分辨率下，管线在批大小B=8时维持27.4 fps，B=16时维持29.6 fps，端到端p50延迟分别约为0.5和1.0秒；相同操作点在RTX 4090上测得54.9 fps，在RTX 5090上测得74.1 fps。我们报告的是视频率流式吞吐量而非交互式低延迟，并将我们的数据与相同堆栈的StreamDiffusion重运行进行对比，作为系统上下文，而非基准优越性声明。对于训练的油画风格，发布的时序适配器在剪辑内噪声中泛化到19个未使用的DAVIS-2017序列和来自七个来源的15个非DAVIS剪辑；对未见风格族的提示级泛化有限，并单独报告。

英文摘要

Aggressive distillation of the diffusion U-Net inverts the per-frame bottleneck of real-time text-to-image pipelines: once the denoiser is a 4-step or 1-step distilled student, the text encoder becomes the critical path. This inversion is most acute in vision-aware edit diffusion, where the encoder is a multimodal large language model (MLLM). We study the case of a 0.39B distilled edit U-Net paired with a 2.13B MLLM text encoder (Qwen3-VL) and present a streaming pipeline targeted at this regime built around three engineering mechanisms: asymmetric side-stream / main-stream CUDA pipelining with batched text-encoder amortisation (and optional static-prompt caching), a compile-friendly ControlNet-LLLite reformulation that folds the entire U-Net + adapter stack into a single fused graph, and a periodic conditioning-refresh schedule with a hook subset that amortises the per-frame conditioning cost. On a single consumer RTX 3090 Ti at 512x512 the pipeline sustains 27.4 fps over a 480-frame run at batch size B=8 and 29.6 fps at B=16, with end-to-end p50 latency of approximately 0.5 and 1.0 seconds respectively; the same operating point measures 54.9 fps on RTX 4090 and 74.1 fps on RTX 5090. We report video-rate streaming throughput rather than interactive low latency, and locate our numbers against same-stack StreamDiffusion re-runs as systems context, not as a benchmark superiority claim. For the trained oil-painting style, the released temporal adapter generalises within in-clip noise to 19 unused DAVIS-2017 sequences and 15 non-DAVIS clips from seven sources; prompt-level generalisation to unseen style families is bounded and reported separately.

URL PDF HTML ☆

赞 0 踩 0

2606.05975 2026-06-05 cs.CV cs.RO 版本更新

揭示未知：基于场景图的开放词汇目标检测

Yi Chen, Yinghao Lu, Zhehao Li, Chenchen Yan, Jiafei Wu, Chong Wang, Jiangbo Qian

发表机构 * Faculty of Electrical Engineering and Computer Science, Ningbo University（宁波大学电气工程与计算机科学学院）； Faculty of Computing, Georg-August-Universität Göttingen（哥廷根大学计算机学院）； Merchants’ Guild Economics and Cultural Intelligent Computing Laboratory, Ningbo University（宁波大学商帮经济与文化智能计算实验室）； School of Software Technology, Zhejiang University（浙江大学软件学院）

AI总结提出场景引导的关系建模检测框架，利用场景图捕获候选区域与上下文对象之间的结构化语义和空间关系，并通过关系注意力模块和场景文本对齐分支增强开放词汇目标检测性能。

详情

AI中文摘要

开放词汇目标检测旨在识别训练数据中未出现的新目标类别。许多基于知识蒸馏的方法通过将预训练视觉-语言模型的知识迁移到目标检测中，展现了有前景的性能。然而，这些方法往往忽略了对象之间结构化的、图像特定的关系，例如交互和空间布局。这种忽视可能严重限制检测新类别的有效性。为解决这一问题，我们提出了一种场景引导的关系建模检测框架。该框架利用场景图捕获候选区域与其上下文对象之间的结构化语义和空间关系。它显式建模相邻区域之间的交互，并引入关系注意力模块隐式增强从场景图中提取的关键关系线索。此外，我们提出了一种基于场景的文本对齐分支，从字幕中蒸馏类别知识以指导关系对齐。该方法促进了视觉关系与语义信息的无缝集成，从而提升检测性能。大量实验表明，我们的模型在COCO和LVIS数据集上对新类别的AP优于其他OVOD方法。

英文摘要

Open-vocabulary object detection seeks to identify novel object categories that were not part of the training data. Many knowledge distillation-based approaches have shown promising performance by transferring knowledge from pre-trained vision-language models to object detection. However, these methods often overlook structured, image-specific relationships between objects, such as interactions and spatial arrangements. This oversight can significantly restrict the effectiveness of detecting novel categories. To address this issue, we propose a Scene-guided Relational Modeling detection framework. This framework utilizes scene graphs to capture structured semantic and spatial relationships between candidate regions and their contextual objects. It explicitly models interactions among neighboring regions and incorporates a Relation Attention Module to implicitly amplify the key relational cues extracted from the scene graph. Furthermore, we present a scene-based textual alignment branch that distills category knowledge from captions to guide relational alignment. This approach facilitates a seamless integration of visual relations with semantic information for enhanced detection performance. Comprehensive experiments show that our model achieves superior performance compared to other OVOD methods, improving the AP for novel categories on COCO and LVIS datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.05915 2026-06-05 cs.CV 版本更新

CamFlow+: Hybrid Motion Bases for 2D Camera Motion Estimation with Stabilization Applications

CamFlow+: 用于二维相机运动估计的混合运动基及其稳定应用

Haipeng Li, Zhen Liu, Zhanglei Yang, Hai Jiang, Tianhao Zhou, Zhengzhe Liu, Ping Tan, Bing Zeng, Shuaicheng Liu

发表机构 * School of Information and Communication Engineering, University of Electronic Science and Technology of China（电子科技大学信息与通信工程学院）； University of Electronic Science and Technology of China（电子科技大学）； School of Aeronautics and Astronautics, Sichuan University（四川大学航空宇航学院）； YingCai Honors College, University of Electronic Science and Technology of China（电子科技大学 YingCai 优秀生学院）； Lingnan University（岭南大学）； Hong Kong University of Science and Technology and Shenzhen Loop Area Institute（香港科学与技术大学及深圳环宇研究院）

AI总结提出CamFlow+混合基框架，通过结合单应性物理基、随机基和深度平移基在稠密光流空间中直接估计二维相机运动，并引入深度感知平滑项，有效处理平移、深度变化和局部视差，在相机运动估计和视频稳定任务中取得最优效果。

详情

AI中文摘要

估计二维相机运动是计算机视觉和计算摄影的基础。现有的基于单应性的方法在平面场景或纯旋转情况下效果良好，但在相机平移、深度变化和局部视差方面表现不佳；局部单应性和网格模型提高了灵活性，但仍依赖于分片平面假设。我们提出CamFlow+，一个混合基框架，直接在稠密光流空间中表示二维相机运动。CamFlow+结合了单应性导出的物理基、从单应性流中采样的随机基以及从深度和相机内参导出的深度平移基，在保持相机运动规律的同时放松了单平面约束。一个深度感知平滑项进一步在连续深度区域正则化平移引起的视差，同时保留深度边界附近的运动变化。我们在GHOF-Cam上评估CamFlow+，这是一个相机运动基准，通过掩蔽光流基准中的动态对象和不适定遮挡区域来隔离相机引起的运动。实验表明，CamFlow+改进了稀疏和稠密相机运动估计。在数字视频稳定中，CamFlow+还提高了全局和局部稳定性，在盲用户研究中实现了最佳top-1偏好率。代码和数据集将在项目页面上提供：https://lhaippp.github.io/CamFlow+。

英文摘要

Estimating 2D camera motion is fundamental to computer vision and computational photography. Existing homography-based methods work well for planar scenes or pure rotation, but struggle with camera translation, depth variation, and local parallax; local homography and mesh-based models improve flexibility but still rely on piecewise planar assumptions. We introduce CamFlow+, a hybrid-basis framework that represents 2D camera motion directly in dense-flow space. CamFlow+ combines homography-derived physical bases, stochastic bases sampled from homography flows, and depth-translational bases derived from depth and camera intrinsics, relaxing the single-plane constraint while preserving camera-motion regularity. A depth-aware smoothness term further regularizes translation-induced parallax in continuous-depth regions while preserving motion changes near depth boundaries. We evaluate CamFlow+ on GHOF-Cam, a camera-motion benchmark that masks out dynamic objects and ill-posed occlusion regions in an optical-flow benchmark to isolate camera-induced motion. Experiments show that CamFlow+ improves sparse and dense camera-motion estimation. In digital video stabilization, CamFlow+ also improves global and local stability, achieving the best top-1 preference rate in a blind user study. Code and datasets will be available on the project page: https://lhaippp.github.io/CamFlow+.

URL PDF HTML ☆

赞 0 踩 0

2606.05912 2026-06-05 cs.CV 版本更新

Self-Learning Expression Deformations for Data-Efficient Gaussian Avatars

自学习表情形变用于数据高效的高斯化身

Jiahao Yang, Xiaohang Yang, Qing Wang, Yilan Dong, Gregory Slabaugh, Shanxin Yuan

发表机构 * Queen Mary University of London（伦敦大学玛丽女王学院）

AI总结提出自适应高斯表情框架，通过自监督学习表情驱动的形变，结合2D高斯面元和符号距离场，实现从极少量输入数据（单帧、单目或单张图像）重建高保真可动画化身。

详情

AI中文摘要

使用3D高斯表示建模动态面部表情由于其非结构化特性仍然具有挑战性。传统的高斯化身流程需要大量的多视角和序列表情数据，限制了可扩展性和可访问性。在这项工作中，我们引入了自适应性高斯表情（SAGE），一个自学习表情诱导的高斯形变框架，能够从最小输入数据中实现高保真、可动画的化身。我们的方法联合优化2D高斯面元和符号距离场（SDF）以强制实现紧凑的、表面对齐的高斯分布，同时一个自监督的表情学习阶段用几何和外观一致性约束取代了长时间的训练序列。这种设计允许在多种重建场景下灵活部署：在多视角设置中，仅需单帧（时间步）而非数千帧；在单目设置中，仅需头部旋转而无需表情序列；在单次设置中，无需预训练或先验。实验表明，我们的方法在重建和动画质量上与最先进方法相当，同时将数据需求降低了几个数量级。我们的结果突显了自监督高斯形变学习作为迈向可访问、数据高效化身创建的一步的潜力。

英文摘要

Modeling dynamic facial expressions using 3D Gaussian representations remains challenging due to their unstructured nature. Conventional Gaussian avatar pipelines require extensive multiview and sequential expression data, limiting scalability and accessibility. In this work, we introduce Self-Adaptive Gaussian Expression (SAGE), a framework for self-learning expression-induced Gaussian deformations that enables high-fidelity, animatable avatars from minimal input data. Our method jointly optimizes 2D Gaussian surfels and a Signed Distance Field (SDF) to enforce compact, surface-aligned Gaussian distributions, while a self-supervised expression learning phase replaces long training sequences with geometric and appearance consistency constraints. This design allows flexible deployment across multiple reconstruction regimes: in the multiview setting, only a single frame (timestep) is required instead of thousands; in the monocular setting, only head rotations are needed without expression sequences; and in the one-shot setting, no pretraining or priors are necessary. Experiments demonstrate that our approach achieves reconstruction and animation quality comparable to state-of-the-art methods, while reducing data requirements by several orders of magnitude. Our results highlight the potential of self-supervised Gaussian deformation learning as a step toward accessible, data-efficient avatar creation.

URL PDF HTML ☆

赞 0 踩 0

2606.05896 2026-06-05 cs.CV 版本更新

Resonant Minds: Closed-Loop Social Avatars with Theory of Mind

共鸣心智：具备心智理论的闭环社交虚拟人

Jianxu Shangguan, Jing Xu, Hang Ye, Xiaoxuan Ma, Yizhou Wang, Wentao Zhu

发表机构 * University of Washington（华盛顿大学）； Peking University（北京大学）； Carnegie Mellon University（卡内基梅隆大学）； Eastern Institute of Technology, Ningbo（宁波工程技术学院）

AI总结提出一个闭环双智能体框架，通过整合感知、社会推理（基于心智理论）和多模态生成，实现具备社交智能的虚拟人，并在信息不对称数据集上取得优于全信息脚本模式的对话质量。

详情

AI中文摘要

创建具有真正社交智能的逼真数字人需要将认知推理和多模态生成统一在一个连贯的框架内。当前的方法将这些视为独立的任务：大型语言模型擅长对话但缺乏具身表达，而基于扩散的说话头模型实现了视觉保真度但忽略了社会认知。为了弥合这一差距，我们提出了一个闭环双智能体框架，将感知、社会推理和表达整合到一个连续的交互循环中。感知模块从视频中分析伙伴的多模态行为，而社会推理模块通过心智理论推断隐藏的心理状态，并通过集成机制选择响应。然后，表达模块生成情感可控的双智能体视频，合成说话者的言语和表情以及听者的反应行为，捕捉先前工作中缺失的双向动态。我们构建了一个分层的角色-场景数据集，包含基于心理学的角色和私人社交目标，以支持信息不对称下的评估。在该数据集上的实验表明，在对话质量和视频生成指标上均具有竞争性或优越的性能。值得注意的是，我们的方法在关键对话质量维度上甚至超过了全信息脚本模式，这表明在不确定性下显式的心理状态推断可以比无限制的信息访问引发更周到的对话。

英文摘要

Creating lifelike digital humans with genuine social intelligence requires unifying cognitive reasoning and multimodal generation within a coherent framework. Current approaches treat these as separate tasks: Large Language Models excel at dialogue but lack embodied expression, while diffusion-based talking head models achieve visual fidelity but ignore social cognition. To bridge this gap, we propose a closed-loop dual-agent framework integrating perception, social reasoning, and expression into a continuous interaction cycle. The perception module analyzes partners' multimodal behaviors from video, while the social reasoning module infers hidden mental states through Theory of Mind and selects responses via an ensemble mechanism. The expression module then generates emotion-controllable dual-agent videos synthesizing both speaker speech and expression alongside listener reactive behaviors, capturing bidirectional dynamics absent in prior work. We construct a hierarchical Persona-Scenario dataset with psychologically grounded personas and private social goals to support evaluation under information asymmetry. Experiments on this dataset demonstrate competitive or superior performance on both dialogue quality and video generation metrics. Notably, our method surpasses even the full-information Script mode on key dialogue quality dimensions, suggesting that explicit mental state inference under uncertainty can elicit more thoughtful dialogue than unrestricted information access.

URL PDF HTML ☆

赞 0 踩 0

2606.05873 2026-06-05 cs.RO cs.AI cs.CV cs.LG 版本更新

LadderMan: Learning Humanoid Perceptive Ladder Climbing

LadderMan: 学习人形机器人感知爬梯

Siheng Zhao, Yuanhang Zhang, Ziqi Lu, Pieter Abbeel, Rocky Duan, Koushil Sreenath, Yue Wang, C. Karen Liu, Guanya Shi

发表机构 * Amazon FAR（亚马逊FAR）； USC（美国南加州大学）； UC Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； CMU（卡内基梅隆大学）

AI总结提出LadderMan系统，通过两阶段学习管道和视觉基础模型，使人形机器人能够鲁棒地攀爬多种梯子并在梯子上进行操控。

详情

AI中文摘要

人形机器人在以人为中心的环境中具有巨大潜力，但由于稀疏的立足点和手抓点、复杂的全身协调以及对感知和控制误差的敏感性，爬梯仍然是最具挑战性的任务之一。我们提出了 extbf{LadderMan}，一个统一的系统，使人形机器人能够鲁棒地攀爬多种梯子并在这种受限条件下进行操控。我们的攀爬策略基于一个可扩展的两阶段学习管道，其中我们使用混合运动跟踪从单个参考运动学习多个攀爬专家，并通过混合模仿和强化学习将这些专家蒸馏成一个统一的基于深度视觉的运动攀爬策略。为了实现真实世界部署，我们利用视觉基础模型来弥合深度感知中的模拟到现实差距。基于学习到的攀爬策略，我们进一步使用双智能体公式训练一个独立的操控策略，允许通过遥操作在梯子上进行稳定操控。实验表明，LadderMan在多种几何形状的梯子上实现了鲁棒的攀爬，以零样本方式成功迁移到真实世界硬件，并在具有挑战性的梯子约束下支持各种操控任务。视频结果见https://ladderman-robot.github.io。

英文摘要

Humanoid robots hold great promise for operating in human-centered environments, yet ladder climbing remains one of the most challenging tasks due to sparse footholds and handholds, complex whole-body coordination, and sensitivity to perception and control errors. We present \textbf{LadderMan}, a unified system that enables humanoid robots to robustly climb diverse ladders and perform manipulation under such constrained conditions. Our climbing policy is built on a scalable two-stage learning pipeline, where we use hybrid motion tracking to learn multiple climbing experts from a single reference motion, and distill these experts into a unified depth-based visuomotor climbing policy via hybrid imitation and reinforcement learning. To enable real-world deployment, we leverage vision foundation models to bridge the sim-to-real gap in depth perception. Building on the learned climbing policy, we further train a separate manipulation policy using a dual-agent formulation, allowing stable on-ladder manipulation via teleoperation. Experiments demonstrate that LadderMan achieves robust ladder climbing across a wide range of geometries, successfully transfers to real-world hardware in a zero-shot manner, and supports various manipulation tasks under challenging ladder constraints. Video results are available at https://ladderman-robot.github.io .

URL PDF HTML ☆

赞 0 踩 0

2606.05849 2026-06-05 physics.optics cs.CV 版本更新

Inverse Design of Realizable Metasurface based Absorbers using Improved Conditioning and Diversity Enhanced Progressively Growing GANs

利用改进的条件化和多样性增强的渐进式生长GAN实现可实现的超表面吸收体的逆向设计

Vineetha Joy, Mohammad Abdullah, Pramit Pal, Anshuman Kumar, Amit Sethi, Hema Singh

发表机构 * Centre for Electromagnetics, CSIR-National Aerospace Laboratories（电磁研究中心，国家航空航天实验室）； Birla Institute of Technology and Science, Pilani, Rajasthan（比拉理工学院和科学学院，比里尼）； Indian Institute of Technology, Bombay, Maharashtra（孟买印度理工学院，马哈拉施特拉）

AI总结提出一种基于渐进式生长WGAN-GP与特征线性调制条件化的生成式逆向设计框架，结合替代辅助光谱对齐损失和行列式点过程多样性正则化，实现连续光谱约束下物理一致且多样化的超表面吸收体设计。

详情

AI中文摘要

超表面能够精确操控电磁波，用于波束转向、传感和隐身技术等应用。然而，由于迭代全波仿真驱动优化的计算成本高昂，以及现有生成方法在条件保真度和多样性方面的限制，具有目标电磁响应的超表面的逆向设计仍然具有挑战性。为了解决这些问题，本文提出了一种生成式逆向设计框架，用于在连续光谱约束下实现可控且物理一致的超表面合成。该方法采用渐进式生长Wasserstein生成对抗网络，结合梯度惩罚和基于特征线性调制的条件化，以实现连续光谱和制造约束的稳定传播。通过替代辅助光谱对齐损失，将电磁一致性直接嵌入生成学习过程，从而在训练期间实现物理约束生成。此外，引入基于行列式点过程的多样性正则化策略，以生成几何多样但光谱一致的实现，对应同一目标响应。通过在2至18 GHz频率范围内生成具有不同反射特性的实际可实现的超表面吸收体，证明了所提框架的有效性。电磁仿真验证了生成的设别以高精度满足目标规格。最终提出的框架实现了平均均方误差0.0052、多样性分数0.8730、波段对齐精度0.8533以及有效电磁设计生成百分比89.57，清晰展示了其生成高精度、多样化、电磁一致且可制造的超表面配置的能力。

英文摘要

Metasurfaces enable precise manipulation of electromagnetic waves for applications such as beam steering, sensing, and stealth technology. However, inverse design of metasurfaces with targeted EM responses remains challenging due to the computational expense of iterative full wave simulation driven optimization and the limited conditioning fidelity and diversity of existing generative approaches. To address these challenges, this paper presents a generative inverse design framework for controllable and physically consistent metasurface synthesis under continuous spectral constraints. The proposed approach employs a progressively growing Wasserstein generative adversarial network with gradient penalty integrated with feature wise linear modulation based conditioning for stable propagation of continuous spectral and fabrication constraints. EM consistency is embedded directly into the generative learning process through a surrogate assisted spectral alignment loss, enabling physics constrained generation during training. Further, a determinantal point process based diversity regularization strategy is incorporated to generate geometrically diverse yet spectrally consistent realizations for the same target response. The effectiveness of the proposed framework is demonstrated through the generation of practically realizable metasurface absorbers exhibiting diverse reflection characteristics in the frequency range of 2 to 18 GHz. EM simulations validate that the generated designs meet the target specifications with high accuracy. The final proposed framework achieved an average mean squared error of 0.0052, diversity score of 0.8730, band alignment accuracy of 0.8533, and a valid EM design generation percentage of 89.57, clearly demonstrating its capability to generate highly accurate, diverse, electromagnetically consistent and fabrication realizable metasurface configurations.

URL PDF HTML ☆

赞 0 踩 0

2606.05833 2026-06-05 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis（加州大学戴维斯分校）

AI总结提出GeoVR框架，通过从2D视频序列中蒸馏3D几何知识（包括相机姿态、深度图、尺度因子和多尺度3D特征），重塑多模态大语言模型的内部表示以赋予其空间智能，在空间推理基准上达到最先进性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）在2D语义理解方面表现出色，但缺乏内在的3D感知能力，导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性，我们提出了GeoVR，一种新颖的框架，仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间，以解锁空间智能。GeoVR并非采用浅层的特征混合，而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的，该策略由四个互补的几何目标驱动：（1）估计帧间相机姿态以嵌入变化的视角动态，（2）回归密集深度图以锚定物理距离，（3）预测度量尺度因子以进行真实世界校准，以及（4）蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下，模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明，GeoVR实现了最先进的性能，为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

URL PDF HTML ☆

赞 0 踩 0

2606.05829 2026-06-05 cs.CV 版本更新

Gender Artifacts from Art History to Text-to-Image Generation

从艺术史到文本到图像生成中的性别伪影

Piera Riccio, Miriam Doh, Benedikt Höltgen, Noa Garcia, Nanne van Noord

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Université Libre de Bruxelles（布鲁塞尔自由大学）； Hasso Plattner Institut University of Potsdam（波茨坦大学霍索普纳研究所）； The University of Osaka（大阪大学）

AI总结通过提出性别伪影度量（PixelSGA和MaskSGA），研究了艺术风格中性别表征与视觉特征的关系，并发现文本到图像生成模型会放大历史来源中的性别伪影。

详情

AI中文摘要

艺术风格植根于特定的社会历史背景，这些背景编码了社会等级，包括不同的性别建构。然而，在人工智能研究中，风格长期以来被视为一种表面层次的视觉属性：一种应用于内容中性场景的颜色、笔触和纹理的滤镜。我们引入了第一个数据集来研究历史图像和生成图像中性别表征与风格之间的相互作用。StyleGender包含跨越19种艺术风格的74k张图像，包括带有风格和性别注释的艺术历史图像、在受控风格和性别提示下由T2I生成的图像，以及一个语义对齐集，使得可以直接比较艺术史与生成结果。通过提出两种集合性别伪影（SGA）度量（PixelSGA和MaskSGA），在像素级别和构图结构中捕捉性别信号，我们展示了：(1) 性别表征塑造了不同艺术风格的视觉特征，(2) 风格关键词将这些模式带入T2I生成中，(3) 生成模型倾向于放大历史来源中观察到的性别伪影。

英文摘要

Artistic styles are rooted in specific socio-historical contexts that encode social hierarchies, including distinct constructions of gender. Yet in AI research, style has long been treated as a surface-level visual property: a filter of color, brushstroke, and texture applied to otherwise content-neutral scenes. We introduce the first dataset to investigate the interplay between gender representation and style in both historical and generated images. StyleGender comprises 74k images spanning 19 artistic styles, comprising art historical images with style and gender annotations, T2I-generated images under controlled style and gender prompts, and a semantically aligned set enabling direct art history-to-generation comparison. By proposing two Set Gender Artifact (SGA) metrics (PixelSGA and MaskSGA), capturing gender signals at the pixel level and in compositional structure, we show that (1) gender representation shapes visual features across artistic styles, (2) style keywords carry these patterns into T2I generation, and (3) generative models tend to amplify gender artifacts beyond what is observed in historical sources.

URL PDF HTML ☆

赞 0 踩 0

2606.05785 2026-06-05 cs.CV cs.AI cs.LG 版本更新

Next-Generation Parallel Decoder for LPDR: Architectural Optimization and Class-Balanced GAN-Augmentation

下一代LPDR并行解码器：架构优化与类别平衡的GAN增强

Shawaiz Obaid, Nida Chandio, Neha Jamil, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering ； Computer Science National University of Sciences \& Technology Islamabad, Pakistan sobaid.mscs25seecs.edu.pk ； Computer Science National University of Sciences \& Technology Islamabad, Pakistan nchandio.mscs25seecs.edu.pk ； Computer Science National University of Sciences \& Technology Islamabad, Pakistan njamil.mscs25seecs.edu.pk ； Computer Science National University of Sciences \& Technology Islamabad, Pakistan

AI总结针对车牌检测与识别中的空间字符不匹配和数据不平衡问题，提出交叉空间混合注意力和类别平衡合成增强方法，将少数省份车牌识别率从78.2%提升至91.5%，同时保持152 FPS的实时处理性能。

Comments 8 pages, 7 figures

详情

AI中文摘要

实时车牌检测与识别（LPDR）是现代智慧城市的基石。尽管YOLOV5-PDLPR模型通过并行解码器方法显著提高了系统效率，但其性能仍受训练集中空间字符不匹配和数据不平衡的影响。本文通过引入交叉空间混合注意力（CSHA）和类别平衡合成增强（CBSA）来解决这些局限性。进行了涉及75,000个合成样本的广泛研究，并在四个基准数据集（CCPD、CLPD、PKU和一个应用特定数据集）上进行了评估。实验结果表明，少数省份车牌识别率从78.2%大幅提升至91.5%，同时保持152 FPS的实时处理性能。结果表明，结合空间感知并行解码与类别平衡增强为高速车牌识别系统提供了有效解决方案。

英文摘要

Real-Time License Plate Detection and Recognition (LPDR) forms the backbone of modern smart cities. Although the YOLOV5-PDLPR model substantially improved system efficiency through a parallel decoder approach, its performance is still affected by spatial character mismatches and data imbalance within the training set. This paper addresses these limitations by introducing Cross-Spatial Hybrid Attention (CSHA) and Class-Balanced Synthetic Augmentation (CBSA). An extensive study involving 75,000 synthetic samples is conducted and evaluated on four benchmarks: CCPD, CLPD, PKU, and an application-specific dataset. Experimental results demonstrate a substantial improvement in the recognition rate of minority provincial license plates from 78.2% to 91.5% while maintaining real-time processing performance of 152 FPS. The results indicate that spatially-aware parallel decoding combined with class-balanced augmentation provides an effective solution for high-speed license plate recognition systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05778 2026-06-05 cs.CV 版本更新

Beyond Absolute Scores: Relative Edit-induced Difference for Generalizable Image Aesthetic Assessment

超越绝对分数：基于编辑诱导差异的通用图像美学评估

Qifei Jia, Xintong Yao, Minghao Li, Yajie Chai, Qiming Lu, Baoyue Shen, Yasen Zhang, Runyu Shi, Ying Huang, Yue Zhang

发表机构 * Xiaomi Corporation, Beijing, China（小米公司，北京，中国）

AI总结提出RED-Aes框架，利用可控图像编辑模型模拟人类审美推理，通过相对编辑诱导差异学习通用美学原则，实现跨场景泛化。

详情

AI中文摘要

传统的图像美学评估（IAA）方法主要依赖于回归绝对平均意见分数（MOS）。然而，这种范式忽视了人类审美感知固有的动态性质，这种感知依赖于对隐含视觉参考的无意识比较。因此，缺乏对美学差异的因果推理使得模型无法学习通用的美学原则，从而限制了它们在多样化场景中的泛化能力。在这项工作中，我们重新思考IAA任务，并提出相对编辑诱导差异美学学习（RED-Aes），一种新颖的框架，利用可控图像编辑模型模拟人类审美推理过程。RED-Aes不拟合绝对分数分布，而是显式学习驱动美学变化的视觉因素。为了支持这一范式，我们构建了RED-20k数据集，包含基于编辑的图像对、定量美学差异和思维链（CoT）推理。此外，我们引入了一种由相对排序一致性奖励引导的三阶段训练策略，仅通过相对监督优化模型。大量实验表明，RED-Aes在多个公共基准上取得了最先进的性能，展现出优越的泛化能力。

英文摘要

Traditional Image Aesthetic Assessment (IAA) methods mainly rely on regressing absolute Mean Opinion Scores (MOS). However, such a paradigm overlooks the inherently dynamic nature of human aesthetic perception, which relies on subconscious comparison against implicit visual references. Consequently, the lack of causal reasoning regarding aesthetic differences prevents models from learning generalizable aesthetic principles, thus limiting their generalization across diverse scenarios. In this work, we rethink the IAA task and propose Relative Edit-induced Difference Aesthetic learning (RED-Aes), a novel framework that leverages controllable image editing models to simulate the human aesthetic reasoning process. Instead of fitting absolute score distributions, RED-Aes explicitly learns the visual factors that drive aesthetic changes. To support this paradigm, we construct the RED-20k dataset, which comprises editing-based image pairs, quantitative aesthetic differences, and Chain-of-Thought (CoT) reasoning. Furthermore, we introduce a three-stage training strategy guided by a relative ranking consistency reward, optimizing the model solely via relative supervision. Extensive experiments demonstrate that RED-Aes achieves state-of-the-art performance on multiple public benchmarks, exhibiting superior generalization capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.05769 2026-06-05 cs.CV 版本更新

Imagine Before You Predict: Interleaved Latent Visual Reasoning for Video Event Prediction

在预测之前想象：用于视频事件预测的交错潜在视觉推理

Tianxiang Jiang, Linquan Wu, Sheng Xia, Songze Li, Ziang Yan, Haoyu Yang, Yu Qiao, Yi Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Shanghai AI Laboratory（上海人工智能实验室）； City University of Hong Kong（香港城市大学）； Nanjing University（南京大学）； Fudan University（复旦大学）； Zhejiang University（浙江大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出Future-L1框架，通过交错潜在视觉推理在自回归解码中交替语言token和连续潜在视觉跨度，结合LA-DAPO强化学习优化，在视频事件预测任务上取得最先进结果。

Comments https://github.com/OpenGVLab/Future-L1

详情

AI中文摘要

视频事件预测（VEP）要求模型从部分视频证据中推断未观察到的未来状态。现有的视频多模态大语言模型（MLLMs）通常在文本空间中将中间未来推理进行语言化：一旦视觉证据被语言化，细粒度的运动、几何和交互线索可能会丢失，导致看似合理但视觉上无根据的幻觉。我们引入了Future-L1，一种交错潜在视觉推理框架，允许MLLM在自回归解码过程中在语言token和连续潜在视觉跨度之间交替。为了训练这种能力，我们通过选择未来视觉提示有助于预测的示例，并将潜在状态与未来帧嵌入对齐，构建了Future-L1-50K数据集，然后使用LA-DAPO（一种具有结果对比和时间多样性奖励的潜在感知RL目标）进一步优化采样的潜在轨迹。Future-L1在两个基准测试上均取得了新的最先进结果：在FutureBench上，它将Qwen3-VL-8B从61.0提升至85.4，并超过之前最佳Video-CoE 10.4分；在TwiFF-Bench上，它将平均得分从2.44提升至3.04。这些结果表明，面向未来的视频推理受益于在潜在空间中保留中间视觉语义，而不是将每个推理步骤都转换为文本。

英文摘要

Video event prediction (VEP) requires models to infer unobserved future states from partial video evidence. Existing video MLLMs usually verbalize intermediate future reasoning in text space: once visual evidence is verbalized, fine-grained motion, geometry, and interaction cues can be lost, leading to plausible but visually ungrounded hallucinations. We introduce Future-L1, an interleaved latent visual reasoning framework that lets an MLLM alternate between language tokens and continuous latent visual spans during autoregressive decoding. To train this capability, we construct Future-L1-50K by selecting examples where future visual hints help prediction and align latent states to future-frame embeddings, then further optimize sampled latent trajectories with LA-DAPO, a latent-aware RL objective with outcome-contrastive and temporal-diversity rewards. Future-L1 achieves new state-of-the-art results on both benchmarks: on FutureBench, it improves Qwen3-VL-8B from 61.0 to 85.4 and exceeds the previous best Video-CoE by 10.4 points; on TwiFF-Bench, it improves the average score from 2.44 to 3.04. These results suggest that future-oriented video reasoning benefits from preserving intermediate visual semantics in latent space rather than translating every reasoning step into text.

URL PDF HTML ☆

赞 0 踩 0

2606.05760 2026-06-05 cs.CV 版本更新

ExpSpeech-Net: Multimodal Fusion of Expression and Speech for Deepfake Detection

ExpSpeech-Net: 表情与语音的多模态融合用于深度伪造检测

Ruchika Sharma, Rudresh Dwivedi

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）

AI总结提出轻量级ExpSpeech-Net模型，通过融合面部表情和语音模式，利用SqueezeNet和RNN骨干网络及智能特征选择，实现高效深度伪造检测，准确率达94.5%。

详情

AI中文摘要

深度伪造视频日益挑战在线内容的可信度。许多现有检测方法依赖于复杂、资源密集型的模型，限制了其实用性。本研究引入了ExpSpeech-Net深度伪造检测（SqN-R-DFD）模型，该模型以SqueezeNet和RNN（循环神经网络）为骨干，提供了一个轻量级且高效的深度伪造检测框架，能够同时分析面部表情和语音模式。该方法采用了先进的特征提取，例如基于ISLBT的图像特征和用于信号的MPNCC，并结合使用SASMA（鹬辅助黏液霉菌算法）的智能特征选择策略，确保检测模型获得最优且平衡的输入。通过结合SqueezeNet和RNN，有效捕捉深度伪造视频中的细微不一致性。该框架实现了94.5%的准确率、99.3%的精确率和96.8%的F-measure，优于传统方法。这表明，将多种模态与智能预处理和特征选择相结合，能够实现适用于日常应用的实用、实时深度伪造检测。

英文摘要

Deepfake videos are increasingly challenging the credibility of online content. Many existing detection methodology relies on complex, resource-intensive models, which limit their practical use. The study introduces the ExpSpeech-Net deepfake detection (SqN-R-DFD) model, which utilizes SqueezeNet and RNN (Recurrent Neural Network) as its backbone, providing a lightweight and efficient deepfake detection framework that simultaneously analyzes facial expressions and speech patterns. The approach incorporates advanced feature extraction, such as ISLBT-based features for image and MPNCC for signals, along with a smart feature-selection strategy using SASMA (Sandpiper-Assisted Slime Mould Algorithm), ensuring optimal and balanced input to the detection models. By combining SqueezeNet and an RNN, subtle inconsistencies in deepfake videos are captured effectively. The framework achieves 94.5% accuracy, precision of 99.3%, and F-measure of 96.8%, outperforming conventional methods. This demonstrates that integrating multiple modalities with intelligent preprocessing and feature selection enables practical, real-time deepfake detection suitable for everyday applications.

URL PDF HTML ☆

赞 0 踩 0

2606.05758 2026-06-05 cs.CV cs.AI cs.LG 版本更新

DRIFT: A Residual Flow Adapter for Decoding Continuous Outputs in Vision-Language Models

DRIFT：一种用于视觉-语言模型中连续输出解码的残差流适配器

Zhuoming Liu, Jinhong Lin, Kwan Man Cheng, Lin Zhang, Shayok Bagchi, Yin Li

发表机构 * University of Wisconsin–Madison（威斯康星大学麦迪逊分校）； West Lafayette Jr./Sr. High School（韦斯特拉法叶高中）

AI总结提出DRIFT框架，通过结合基础预测器和基于流匹配的生成式精化模块，将预训练视觉-语言模型适配到连续解码任务，在视觉定位和机器人控制等任务上优于回归和生成方法。

详情

AI中文摘要

许多现代视觉-语言模型（VLM）基于离散标记的自回归解码。虽然基于文本的输出接口支持可扩展的预训练和跨多种任务的强零样本泛化，但它们不适用于需要精确连续输出的问题，例如定位事件的时间边界或生成机器人控制动作。为了解决这一挑战，我们提出了DRIFT，一个用于将预训练VLM适配到连续解码任务的通用框架。DRIFT结合了一个基础预测器（提供目标输出的粗略估计）和一个基于流匹配的生成式精化模块（迭代改进预测）。这种残差公式将生成建模问题从学习全局输出分布转变为在强先验周围建模局部残差分布，大大简化了优化。我们在感知和规划任务上评估了DRIFT，包括视觉定位和机器人控制。在跨越MLLM、VLA和WAM的多个任务和架构中，DRIFT consistently优于一组强大的基于回归和生成的方法。

英文摘要

Many modern vision-language models (VLMs) build on autoregressive decoding of discrete tokens. While text-based output interfaces enable scalable pretraining and strong zero-shot generalization across diverse tasks, they are poorly suited for problems that require precise continuous outputs, such as localizing temporal boundaries of events or generating robotic control actions. To address this challenge, we propose DRIFT, a general framework for adapting pretrained VLMs to continuous decoding tasks. DRIFT combines a base predictor, which provides a coarse estimate of the target output, with a generative refinement module based on flow matching that iteratively improves the prediction. This residual formulation transforms the generative modeling problem from learning a global output distribution to modeling a localized residual distribution around a strong prior, substantially simplifying optimization. We evaluate DRIFT on both perception and planning tasks, including visual grounding and robotic control. Across multiple tasks and architectures spanning MLLMs, VLAs, and WAMs, DRIFT consistently outperforms a strong set of regression- and generative-based solutions.

URL PDF HTML ☆

赞 0 踩 0

2606.05753 2026-06-05 cs.CV 版本更新

TextWand：场景文本编辑的统一框架

Shuyu Wang, Zhile Guan, Hongxiu Chen, Yule Duan, Weiqi Li, Xin Shan, Ronggang Wang, Jian Zhang

发表机构 * School of Electronic and Computer Engineering, Peking University（电子与计算机工程学院，北京大学）

AI总结提出TextWand统一框架，通过渲染和擦除原子操作分解复杂编辑任务，结合ORPE编码和RAS策略，实现场景文本的移除、生成和替换，并在新基准TextWand-Bench上超越现有模型。

详情

AI中文摘要

我们提出TextWand，一个通用框架，将场景文本移除、生成和替换统一到单个模型中。通过将复杂的编辑任务分解为渲染和擦除的原子原语，TextWand实现了对文本外观和背景完整性的精确控制。具体来说，我们引入了一种新颖的设计——叠加参考位置编码（ORPE），以强制执行像素级布局保真度和示例驱动的风格控制，同时采用一种新策略——区域自适应抑制（RAS），以确保干净的文本擦除。为了解决现有单任务数据集中缺乏通用场景文本编辑综合基准的问题，我们构建了TextWand-Bench。大量实验表明，TextWand在场景文本移除、生成和替换任务中，通过提供更优的文本内容准确性、布局和风格一致性以及整体图像质量，超越了现有的领先开源和闭源模型。

英文摘要

We propose TextWand, a general-purpose framework that unifies scene text removal, generation, and replacement into a single model. By decomposing complex editing tasks into the atomic primitives of rendering and erasure, TextWand achieves precise control over both text appearance and background integrity. Specifically, we introduce a novel design, Overlay-Reference Positional Encoding (ORPE), to enforce pixel-level layout fidelity and exemplar-driven style control, alongside a new strategy, Region-Adaptive Suppression (RAS), to ensure clean text erasure. To address the absence of a comprehensive benchmark for general-purpose scene text editing among existing single-task datasets, we construct TextWand-Bench. Extensive experiments demonstrate that TextWand outperforms existing leading open-source and closed-source models by delivering superior text content accuracy, layout and style consistency, and overall image quality across scene text removal, generation and replacement tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.05718 2026-06-05 cs.CV cs.AI cs.LG 版本更新

ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

ViCuR: 视觉线索作为多模态在策略蒸馏中的可恢复特权

Kanghui Tian, Siyuan Liu, Ziang Yan, Sheng Xia, Shuai Dong, Yi Wang

发表机构 * Shanghai AI Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； Nanjing University（南京大学）

AI总结提出ViCuR框架，通过将教师特权从答案侧替换为输入中的视觉线索，并引入轻量级线索恢复模块，解决多模态在策略蒸馏中的训练-测试不匹配问题，在七个基准上显著提升学生模型性能。

Comments 25 pages, 11 figures. Preprint, under review

详情

AI中文摘要

在策略蒸馏（OPD）通过在教师监督下，对学生自身策略采样的轨迹进行训练来改进推理。在多模态推理中，一种常见的扩展是使用特权教师，该教师观察仅在训练时可用的信号，如参考答案或理由。然而，这种答案侧特权造成了训练-测试不匹配：教师的监督可能依赖于学生无法获得的信号，鼓励捷径模仿而非基于视觉的推理。我们提出ViCuR，一种基于视觉的特权教师蒸馏框架，用视觉线索（输入中与查询相关的证据）取代答案侧特权。由于这些线索来源于推理时可用的相同视觉输入，它们的证据可由学生恢复。为此，ViCuR引入了一个轻量级线索恢复模块，在预填充期间使用专用的汇点令牌交叉注意力，将任务相关的视觉证据聚合到内部表示中，而不改变推理接口或需要辅助的线索生成损失。在七个基准上，使用Qwen3-VL-2B和8B学生，ViCuR在总体平均性能上持续优于基于答案的在策略自蒸馏，分别提升+1.19和+1.24。它还能自然地扩展到更强的教师OPD，超越OPD基线+0.64和+1.08，并在8B规模上具有一致的域外增益。这些结果表明，在多模态在策略蒸馏中，教师特权的设计与教师强度同等重要。

英文摘要

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

URL PDF HTML ☆

赞 0 踩 0

2606.05708 2026-06-05 cs.CV 版本更新

Real-Time Threat Detection from Surveillance Cameras using Machine Learning

基于机器学习的监控摄像头实时威胁检测

Gajendra Mandal, J. P. Patra, Priyansh Mahant

发表机构 * GitHub

AI总结提出基于YOLOv8的实时目标检测框架，利用自定义钝器数据集与公开枪支刀具数据集训练模型，实现监控场景下枪支、刀具和钝器的有效检测。

详情

AI中文摘要

确保人口密集的城市环境中的公共安全仍然是一个关键挑战，需要部署智能和自动化的视频监控系统。传统的监控方法严重依赖人工监控，效率低下且容易受到人为疲劳、响应延迟和观察错误的影响。为了克服这些限制，本文提出了一种基于实时目标检测的监控框架。该系统专注于检测枪支、刀具以及印度监控场景中常见于暴力活动的区域特定钝器。本文的一个关键贡献是使用移动相机收集的自定义数据集，包含336张标记的钝器图像，如铁棒、木棍和塑料棒。该数据集与公开的7,623张枪支和刀具图像数据集合并，形成包含7,959张图像、三个类别（枪、刀、钝器）的合并数据集。使用该合并数据集训练基于YOLOv8的目标检测模型以实现实时性能。实验评估表明，增加训练时长显著提高了钝器类别的召回率和平均精度，且未出现过拟合迹象。总体而言，所提出的框架在准确性和效率之间取得了有效平衡，使其适用于校园、公共空间和交通区域等真实监控环境中的部署。

英文摘要

Ensuring public safety in densely populated urban environments remains a critical challenge, necessitating the deployment of intelligent and automated video surveillance systems. Traditional surveillance approaches rely heavily on manual monitoring, which is inefficient and susceptible to human fatigue, delayed response, and observational errors. To overcome these limitations, this work presents a real-time object detection-based surveillance framework. The proposed system focuses on detecting guns, knives, and region-specific blunt objects commonly involved in violent activities in Indian surveillance scenarios. A key contribution of this work is the use of a custom-created dataset collected using a mobile camera, consisting of 336 labeled images of blunt objects such as iron rods, wooden sticks, and plastic rods. This dataset is combined with a publicly available dataset of 7,623 images of guns and knives, forming a consolidated dataset of 7,959 images across three classes: gun, knife, and blunt object. The combined dataset is used to train a YOLOv8-based object detection model for real-time performance. Experimental evaluation shows that increasing the training duration significantly improves recall and average precision for the blunt object class without signs of overfitting. Overall, the proposed framework achieves an effective balance between accuracy and efficiency, making it suitable for deployment in real-world surveillance environments such as campuses, public spaces, and transportation areas.

URL PDF HTML ☆

赞 0 踩 0

2606.05703 2026-06-05 cs.CV 版本更新

Parallel Jacobi Decoding for Fast Autoregressive Image Generation

并行雅可比解码用于快速自回归图像生成

Boya Liao, Ying Li, Siyong Jian, Huan Wang

发表机构 * Westlake University（西交利物浦大学）

AI总结提出并行雅可比解码（PJD），通过二维空间域扩展草稿令牌并调整注意力掩码，实现无需训练的自回归图像生成加速，在保持生成质量的同时获得4.8倍至6.4倍加速。

Comments Accepted by CVPR 2026

详情

AI中文摘要

自回归（AR）模型在生成高保真图像方面表现出色。然而，其固有的顺序逐令牌预测导致推理速度显著变慢。最近的研究引入了雅可比式解码来加速自回归图像生成。初始扩展草稿序列提高了效率，但由于一维序列中的错误传播阻碍收敛，加速很快饱和。观察到图像表现出强烈的局部空间相关性，我们提出了并行雅可比解码（PJD），一种无需训练的解码方法，在二维空间域中扩展草稿令牌以实现高效的空间并行细化。PJD调整注意力掩码以减轻错误累积并提高收敛稳定性。在多个数据集上的大量实验表明，PJD在多种自回归图像生成模型上实现了4.8倍至6.4倍的加速，同时保持了具有竞争力的生成质量。

英文摘要

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8x-6.4x acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

URL PDF HTML ☆

赞 0 踩 0

2606.05702 2026-06-05 cs.AI cs.CV 版本更新

Seeing Time: Benchmarking Chronological Reasoning and Shortcut Biases in Vision-Language Models

Seeing Time: 视觉-语言模型中的时间顺序推理与捷径偏差基准测试

Haoyu Zhou, Qing Qing, Caichong Li, Qixin Zhang, Yongcheng Jing, Ziqi Xu, Juncheng Hu, Xikun Zhang, Renqiang Luo

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工学院计算技术学院）

AI总结本文提出一个新基准，通过三个专门数据集评估视觉-语言模型在图像内和跨图像的时间顺序推理能力，并揭示模型常利用颜色等表面线索而非真正时间特征。

详情

AI中文摘要

近期视觉-语言模型（VLM）在解释复杂视觉语义方面取得了显著进展，但其时间顺序推理能力仍未得到充分探索。本文引入了一个新颖的基准，专门用于评估VLM如何感知和推理图像内及跨图像的时间顺序信息。与现有基于视频的基准（侧重于帧序列）不同，我们的工作深入探讨了时间判断的基本逻辑以及向多模态集成的扩展。为此，我们构建了三个专门数据集：一个包含跨越长时间历史周期的视觉相似物体，另一个按不同事件和物体类型分类，第三个将图像与时间敏感的新闻文本配对以实现跨模态对齐。通过大量实验，我们分析了模型是否在不同类别间表现出性能差异，并关键地探讨了它们是否依赖“错误捷径”（如图像颜色而非真正的时间特征）。我们的结果表明，尽管VLM显示出潜力，但它们经常利用灰度与彩色滤镜等表面线索来绕过真正的时间顺序推理。通过提供这些高质量数据集和严格的评估框架，我们提供了一个诊断工具，用于识别当前局限性并指导开发更稳健、逻辑更严密的多模态模型。源代码见 https://github.com/LuoRenqiang/ChronoVision。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly enhanced their ability to interpret complex visual semantics, yet their capacity for chronological reasoning remains under-explored. In this paper, we introduce a novel benchmark specifically designed to evaluate how VLMs perceive and reason about chronological information within and across images. Unlike existing video-based benchmarks that focus on frame sequencing, our work delves into the underlying logic of chronological judgment and the expansion toward multimodal integration. To facilitate this, we construct three specialized datasets: one containing visually similar objects spanning long historical durations, another categorized by diverse event and object types, and a third pairing images with time-sensitive news text for cross-modal alignment. Through extensive experiments, we analyze whether models exhibit performance disparities across categories and, crucially, explore whether they rely on ``incorrect shortcuts'', such as image color rather than genuine chronological features. Our results reveal that while VLMs show promise, they frequently exploit superficial cues like grayscale versus color filters to bypass authentic chronological reasoning. By providing these high-quality datasets and a rigorous evaluation framework, we offer a diagnostic tool to identify current limitations and guide the development of more robust, logically grounded multimodal models. The source code is shown in https://github.com/LuoRenqiang/ChronoVision.

URL PDF HTML ☆

赞 0 踩 0

2606.05700 2026-06-05 cs.CV cs.LG 版本更新

T-SAR-JEPA: Self-Supervised Temporal Anomaly Detection in SAR Amplitude Stacks via Latent Prediction

T-SAR-JEPA：通过潜在预测在SAR幅度堆栈中进行自监督时间异常检测

Kerod Woldesenbet, Abem Woldesenbet

发表机构 * Independent Researcher（独立研究者）； Dakota State University（达科塔州立大学）

AI总结提出T-SAR-JEPA框架，通过自监督潜在预测在SAR幅度堆栈中检测时间异常，在DFC 2026数据集上达到77.0%的ROC-AUC，优于多种基线方法。

Comments Won IEEE GRSS Data Fusion Contest 2026; to appear in IGARSS 2026 proceedings

2606.05677 2026-06-05 cs.CV cs.AI cs.CL 版本更新

LongSpace: Exploring Long-Horizon Spatial Memory from Perception to Recall in Video

LongSpace: 从感知到回忆的视频长程空间记忆探索

Shiqiang Lang, Jing Liu, Haoyang He, Peiwen Sun, Yuanteng Chen, Tao Liu, Lan Yang, Longteng Guo, Honggang Zhang

发表机构 * Beijing University of Posts and Telecommunications（北京邮电大学）； Zhongguancun Academy（中关村学院）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； The Chinese University of Hong Kong（香港中文大学）； Xi’an Jiaotong University（西安交通大学）

AI总结针对长视频中空间记忆的挑战，提出LongSpace框架，通过分块建模、3D结构线索注入和层级感知记忆实现长程空间推理，并在LongSpace-Bench等基准上验证其有效性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在图像和视频理解方面取得了进展，并且能够处理更长的视觉输入。自动驾驶和机器人导航等长程任务不仅需要识别当前视图，模型还必须记住并检索之前观察到的空间布局、路线、视角变化和物体状态。为了评估这一能力，我们引入了LongSpace-Bench，一个用于长程空间记忆的房间导览视频基准，涵盖场景感知、空间关系和空间记忆。在这项工作中，我们进一步提出了LongSpace，一个用于长视频空间推理的记忆框架。LongSpace将长视频建模为连续的块，将3D结构线索注入早期解码器层，并构建层级感知记忆以进行问题引导的检索。在多个空间推理基准上的实验表明，LongSpace改善了长视频空间理解，进一步证明了显式空间记忆是长程视频MLLMs的关键能力。

英文摘要

Multimodal Large Language Models (MLLMs) have advanced image and video understanding and can increasingly handle longer visual inputs. Long-horizon tasks such as autonomous driving and robotic navigation require more than recognizing the current view, as models must remember and retrieve previously observed spatial layouts, routes, viewpoint changes, and object states. To evaluate this capability, we introduce LongSpace-Bench, a room-tour video benchmark for long-horizon spatial memory, covering scene perception, spatial relations, and spatial memory. In this work, we further propose LongSpace, a memory framework for long-video spatial reasoning. LongSpace models long videos as sequential chunks, incorporates 3D structural cues into early decoder layers, and constructs layer-aware memory for question-guided retrieval. Experiments on multiple spatial reasoning benchmarks show that LongSpace improves long-video spatial understanding, further demonstrating explicit spatial memory as a key capability for long-horizon video MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.05675 2026-06-05 cs.LG cs.CV 版本更新

Two-Way Is Better Than One: Bidirectional Alignment with Cycle Consistency for Exemplar-Free Class-Incremental Learning

双向优于单向：基于循环一致性的双向对齐用于无样本类增量学习

Hongye Xu, Bartosz Krawczyk

发表机构 * Chester F. Carlson Center for Imaging Science（切斯特·F·卡勒中心影像科学中心）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结提出BiCyc方法，通过双向投影器对齐和循环一致性目标，解决无样本类增量学习中原型漂移和单向投影偏差问题，减少灾难性遗忘并提升准确率。

Comments Published as a conference paper at ICLR 2026. 23 pages, 8 figures. Code: https://github.com/HXuSz11/BiCyc_ICLR2026

详情

AI中文摘要

持续学习（CL）旨在使模型在不遗忘先前知识的情况下获取新技能。在无样本类增量学习（EFCIL）中，由于无法存储过去数据，这一挑战被放大，旧类的表示漂移尤其有害。基于原型的EFCIL因其高效性而具有吸引力，但随着嵌入空间的演化，原型会发生漂移；因此，基于投影的漂移补偿已成为一种流行的补救措施。然而，我们表明，现有的单向投影引入了系统性偏差：它们要么追溯性地扭曲当前特征几何结构，要么仅局部对齐旧类，导致跨任务累积的循环不一致性。我们提出BiCyc，一种具有循环一致性目标的双向投影器对齐方法。BiCyc联合优化两个映射（旧到新和新到旧），并采用停止梯度门控，使得传输和表示共同演化。分析表明，循环损失在白化空间中将奇异谱向单位值收缩，并且类均值和协方差的改进传输导致分类对数几率扰动更小，从而保留旧类决策并减轻灾难性遗忘。实验上，在标准EFCIL基准测试中，BiCyc显著减少了遗忘并提高了从头开始设置下的准确率，同时在预训练细粒度场景中保持竞争力。

面向工程可靠裂缝表示与拓扑保持的土木基础设施多任务裂缝基础模型

Blessing Agyei Kyem, Joshua Kofi Asamoah, Eugene Denteh, Armstrong Aboah

发表机构 * NDSU（内达苏大学）

AI总结提出 CrackGeoFM 多任务框架，结合冻结视觉基础骨干与裂缝专用适配模块，实现掩码预测、骨架重建和不确定性估计，在20个数据集上达到最优分割、拓扑保持和校准不确定性。

Comments 60 pages, 17 figures, 11 tables

详情

AI中文摘要

可靠的裂缝评估不仅需要准确的像素级掩码，还需要在域偏移下保持稳定的连通裂缝几何形状和置信度估计。然而，现有的分割模型在实现高重叠分数的同时，可能会使裂缝碎片化、遗漏细小分支，并且无法提供校准的不确定性。为了解决这一问题，本文提出了 CrackGeoFM，一个多任务框架，它将冻结的视觉基础骨干与裂缝专用适配相结合，用于掩码预测、骨架重建和不确定性估计。该框架集成了频率引导的裂缝增强模块（FCEM）以增强高频裂缝线索，裂缝域特征适配模块（CFAM）以将冻结骨干特征适配到裂缝域模式，以及结构感知多任务解码器（SMTD）以联合解码掩码、骨架和不确定性。在20个裂缝数据集上，CrackGeoFM 实现了最先进的分割性能、改进的拓扑保持、校准的不确定性以及仅需五张标注图像的有效少样本适应。这些结果支持可靠、可泛化且面向工程的裂缝分析，用于基础设施评估。

英文摘要

Reliable crack assessment requires not only accurate pixel-level masks but also connected crack geometry and confidence estimates that remain stable under domain shift. However, existing segmentation models can achieve high overlap scores while fragmenting cracks, missing fine branches, and providing no calibrated uncertainty. To address this gap, this paper proposes CrackGeoFM, a multi-task framework that combines a frozen visual foundation backbone with crack-specific adaptation for mask prediction, skeleton reconstruction, and uncertainty estimation. The framework integrates a Frequency-Guided Crack Enhancement Module (FCEM) to enhance high-frequency crack cues, a Crack-Domain Feature Adaptation Module (CFAM) to adapt frozen backbone features to crack-domain patterns, and a Structure-Aware Multi-Task Decoder (SMTD) to jointly decode masks, skeletons, and uncertainty. Across 20 crack datasets, CrackGeoFM achieves state-of-the-art segmentation, improved topology preservation, calibrated uncertainty, and effective few-shot adaptation with only five labeled images. These results support reliable, generalizable, and engineering-oriented crack analysis for infrastructure assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.05635 2026-06-05 cs.CV cs.MM 版本更新

ShotCrop$^3$: Cropping Human-Centric Images into Cinematic Triple-Shot Compositions

ShotCrop$^3$：将人物中心图像裁剪为电影级三镜头构图

Dehong Kong, Lina Lei, Lingtao Zheng, Chenyang Wu, Ailing Zhang, Xinran Qin, Teng Ma, Jiaqi Xu, Zhixin Wang, Zhikai Chen, Xuecheng Qi, Renjing Pei, Fan Li

发表机构 * Huawei Noah’s Ark Lab（华为诺亚实验室）； Sun Yat-sen University（中山大学）

AI总结提出三镜头构图任务，通过三阶段训练流程（思维链微调、半监督微调和组相对策略优化）从单张人物中心图像生成远景、中景和特写三张裁剪图，并附带简短描述，以支持视觉叙事。

详情

AI中文摘要

先前关于美学构图的工作通常产生单一美观的裁剪，忽略了从一个场景中构图多个镜头的叙事价值。在实践中，多镜头构图对于下游创意工作流程至关重要：商业海报通常需要不同重点（例如，背景、主体和情感/产品细节）的多个裁剪来呈现关键故事节拍。因此，我们提出了 extbf{三镜头构图（TSC）}，这是一个构图任务，从单张人物中心图像生成一个三镜头集——远景、中景和特写，每个镜头都配有简短的镜头描述以支持视觉叙事。为了在有限的专家标注下学习TSC，我们引入了 extbf{ShotCrop}，它经历了一个三阶段训练过程：首先应用思维链监督微调以建立基本推理和美学裁剪技能，然后使用高置信度伪标签进行半监督微调以进一步增强美学能力，最后通过针对 extbf{ShotCrop}的组相对策略优化（GRPO-S）进行优化，使用为其定制的复合奖励。具体来说，我们的伪标签策略结合了基于MLLM的评分、美学评估和CLIP相似度，以保留高置信度的训练信号。此外，我们提出了TSC-Bench，一个包含1.2k个专家标注测试用例的基准。值得注意的是，ShotCrop在镜头定位准确率上比GPT-5平均提高了 extbf{2.82}倍。

英文摘要

Prior work on aesthetic composition typically produces a single aesthetically pleasing crop, overlooking the narrative value of composing multiple shots from one scene. In practice, multi-shot composition is critical for downstream creative workflows: commercial posters often require multiple crops with different emphases (e.g., context, subject, and emotion/product details) to present key story beats. Therefore, we propose \textbf{Triple-Shot Compositions (TSC)}, a composition task that generates a three-shot set -- establishing, medium, and close-up -- from a single human-centric image, each paired with a brief shot description to support visual narration. To learn TSC with limited expert annotations, we introduce \textbf{ShotCrop} which undergoes a three-stage training process: it first applies Chain-of-Thought supervised fine-tuning to establish basic reasoning and aesthetic shot-cropping skills, then performs semi-supervised fine-tuning with high-confidence pseudo labels to further enhance aesthetic capability, and is finally optimized with Group Relative Policy Optimization for \textbf{ShotCrop} (GRPO-S) using a composite reward tailored for it. Specifically, our pseudo-labeling strategy combines MLLM-based scoring, aesthetic assessment, and CLIP similarity to retain high-confidence training signals. In addition, we present TSC-Bench, a benchmark of 1.2k expert-annotated test cases. Notably, ShotCrop achieves an average improvement of \textbf{2.82} times over GPT-5 in shot localization accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.05624 2026-06-05 cs.CV cs.GR 版本更新

KV-Control: Parameter-Efficient K/V Injection for Trajectory-Controlled Text-to-Motion

KV-Control: 用于轨迹控制文本到运动的参数高效K/V注入

Tengjiao Sun, Pengcheng Fang, Xiaoyu Zhan, Yanwen Guo, Dongjie Fu, Xiaohao Cai, Hansung Kim

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出KV-Control，一种紧凑的注意力侧控制接口，通过部分标记化运动基元和轨迹编码器注入键/值记忆，实现精确的轨迹控制而不覆盖预训练的文本条件运动先验。

详情

AI中文摘要

文本条件3D人体运动模型现在可以从提示中合成合理的运动，但实际动画和具身代理工作流程很少止步于文本：角色可能需要遵循草绘的根路径，达到末端执行器目标，或满足多关节轨迹，同时保持语言描述的步态、风格和意图。这暴露了一个控制权衡。轨迹控制器应该精确而不覆盖预训练的文本条件运动先验，但现有解决方案要么复制生成器的大部分以重新获得每层控制访问，要么将大部分成本转移到测试时优化。我们引入KV-Control，一种用于冻结掩码文本到运动变换器的紧凑注意力侧控制接口。关键思想是将几何约束作为自注意力中的记忆提供，而不是通过全局姿态标记注入或仅在输出侧强制执行。为了支持该接口，我们共同设计了部分标记化的运动基元和控制器：PartVQ学习解剖对齐的部分码本，T-Concat将每个帧-部分标记暴露为注意力可寻址站点，KV-Control在每个自注意力层注入控制条件的键/值记忆，同时保留预训练的查询流、文本交叉注意力、FFN和所有骨干权重。生成的适配器仅在共享轨迹编码器之上添加可训练的注入参数，但在继承的细化协议下以亚厘米精度跟踪根和多关节约束，同时保留文本条件的运动质量。KV-Control将轨迹条件重新定义为轻量级记忆检索，为文本到运动生成提供了一个小型、精确且透明的控制接口。

英文摘要

Text-conditioned 3D human motion models now synthesize plausible motions from prompts, but practical animation and embodied-agent workflows rarely stop at text: a character may need to follow a sketched root path, hit an end-effector target, or satisfy a multi-joint trajectory while still preserving the gait, style, and intent described by language. This exposes a control trade-off. A trajectory controller should be precise without overwriting the pretrained text-conditioned motion prior, yet existing solutions either duplicate large portions of the generator to regain per-layer control access or move much of the cost to test-time optimization. We introduce KV-Control, a compact attention-side control interface for frozen masked text-to-motion transformers. The key idea is to make geometric constraints available as memory inside self-attention rather than injecting them through a global pose token or enforcing them only at the output side. To support this interface, we co-design a part-tokenized motion substrate and controller: \textbf{PartVQ} learns anatomy-aligned part codebooks, T-Concat exposes each frame--part token as an attention-addressable site, and KV-Control injects control-conditioned key/value memories at every self-attention layer while preserving the pretrained query stream, text cross-attention, FFN, and all backbone weights. The resulting adapter adds only trainable injection parameters atop a shared trajectory encoder, yet tracks root and multi-joint constraints with sub-centimeter accuracy under the inherited refinement protocol while retaining text-conditioned motion quality. KV-Control reframes trajectory conditioning as lightweight memory retrieval, providing a small, precise, and transparent control interface for text-to-motion generation.

URL PDF HTML ☆

赞 0 踩 0

2606.05611 2026-06-05 cs.CV 版本更新

What's Under the Skin? Estimating Swine Body Condition

皮肤之下是什么？估算猪体况

Mk Bashar, Kuljit Bhatti, Gary Rohrer, Madonna Benjamin, Tami Brown-Brandl, Daniel Morris

AI总结提出PigFormer系统，利用RGB-D深度图像通过两阶段流程（几何前端和切片注意力编码器）预测猪的皮下背膘厚度、腰肌深度和总组织厚度，实现非接触式体况监测。

详情

AI中文摘要

母猪体况是养殖者的重要指标，因为它对泌乳性能和仔猪存活率有很大影响。然而，生产中使用的体况测量方法（如视觉评分和卡尺）与底层组织成分的相关性较差。超声波扫描可以直接测量皮下背膘厚度和腰肌深度，但操作劳动密集且无法规模化生产。我们提出了PigFormer，一个端到端的两阶段系统，它从天花板安装的RGB-D相机获取原始深度帧，并预测最后肋骨处的皮下背膘厚度、腰肌深度和总组织厚度。第一阶段是几何前端，通过SAM3-to-MaskDINO分割蒸馏、地平面去除和方向归一化将原始深度转换为标准化高度图。第二阶段是切片注意力编码器，将每个高度图视为一系列横截面切片，并捕捉沿整个背侧表面的空间关系。在两个设施的多站点数据集（319头母猪和小母猪实例）上，PigFormer实现了2.43毫米的背膘平均绝对误差和3.87毫米的整体平均绝对误差。它优于强大的单阶段ResNet-18和ViT-small基线。PigFormer为商业养猪生产中实现连续、自动化、非接触式体况监测提供了一条实用途径。代码可在https://github.com/iambashar/Pigformer获取。

英文摘要

Sow body condition is an important indicator for growers as it has a large impact on lactation performance and piglet survival. However, body condition measures used during production, such as visual scoring and calipers, correlate poorly with underlying tissue composition. Ultrasound scans can provide direct measurements of subcutaneous backfat thickness and loin muscle depth, but their operation is labor intensive and not scalable for production. We present PigFormer, an end-to-end two-stage system that takes raw depth frames from a ceiling-mounted RGB-D camera and predicts subcutaneous backfat thickness, loin muscle depth, and total tissue thickness at the last rib. Stage 1 is a geometric front-end that converts raw depth into a standardized height map via SAM3-to-MaskDINO segmentation distillation, ground-plane removal, and orientation normalization. Stage 2 is a Slice Attention Encoder that treats each height map as a sequence of cross-sectional slices and captures spatial relationships along the full dorsal surface. On a multi-site dataset of 319 sow and gilt instances from two facilities, PigFormer achieves 2.43 mm backfat MAE and 3.87 mm overall MAE. It outperforms strong single-stage ResNet-18 and ViT-small baselines. PigFormer offers a practical path toward continuous, automated, non-contact body condition monitoring in commercial swine production. Code is available at https://github.com/iambashar/Pigformer.

URL PDF HTML ☆

赞 0 踩 0

2606.05587 2026-06-05 cs.CV cs.AI cs.LG 版本更新

HDST-GNN: Heterogeneous Dynamic Spatiotemporal Graph Neural Networks for Multi-Object Tracking in UAV Aerial Imagery

HDST-GNN：用于无人机航拍图像多目标跟踪的异质动态时空图神经网络

Phillip Jiang

发表机构 * Phillip Jiang（菲利普·姜）

AI总结针对无人机航拍中目标小、密集、遮挡导致身份切换的问题，提出异质动态时空图神经网络HDST-GNN，通过高度自适应边构建、异质节点表示和遮挡门控时序聚合提升跟踪性能。

Comments 18 pages, 4 figures, 6 tables

详情

AI中文摘要

无人机航拍图像的多目标跟踪（MOT）面临独特挑战：序列间高度变化、目标小而密集、频繁遮挡导致身份切换。现有基于图的跟踪器假设固定空间上下文并统一处理所有目标，忽略了检测、活跃轨迹和丢失目标等异质生命周期状态。我们提出HDST-GNN，一种异质动态时空图神经网络，包含三项创新。首先，高度自适应边构建根据平均目标面积估计相机高度代理，并相应调整图连接半径。其次，异质节点表示将检测（D型）、确认轨迹（T型）和丢失轨迹（L型）建模为不同节点类型，具有专用投影和类型化边关系。第三，遮挡门控时序聚合根据每个节点的遮挡置信度门控其注意力贡献，防止被遮挡节点破坏邻居嵌入。HDST-GNN使用可微Sinkhorn头部，结合交叉熵和三元组损失进行端到端训练。在VisDrone2019-MOT上使用oracle检测时，HDST-GNN达到94.51% MOTA和97.24% IDF1，比SORT高出+5.0 MOTA点，身份切换减少81%。使用真实YOLOv8n检测时，HDST-GNN相比SORT身份切换减少49%。消融研究证实了每个组件的独立贡献。

英文摘要

Multi-object tracking (MOT) from UAV imagery presents unique challenges: altitude varies across sequences, objects are small and densely packed, and frequent occlusion causes identity switches. Existing graph-based trackers assume fixed spatial context and treat all objects uniformly, ignoring the heterogeneous lifecycle states of detections, active tracklets, and lost targets. We propose HDST-GNN, a Heterogeneous Dynamic Spatiotemporal Graph Neural Network with three novel contributions. First, Altitude-Adaptive Edge Construction estimates a camera-altitude proxy from mean object area and adjusts the graph connectivity radius accordingly. Second, Heterogeneous Node Representation models detections (Type-D), confirmed tracklets (Type-T), and lost tracklets (Type-L) as distinct node types with dedicated projections and typed edge relations. Third, Occlusion-Gated Temporal Aggregation gates each node's attention contribution by its occlusion confidence, preventing occluded nodes from corrupting neighbour embeddings. HDST-GNN is trained end-to-end with a differentiable Sinkhorn head using joint cross-entropy and triplet loss. On VisDrone2019-MOT with oracle detections, HDST-GNN achieves 94.51% MOTA and 97.24% IDF1, outperforming SORT by +5.0 MOTA points and reducing identity switches by 81%. With real YOLOv8n detections, HDST-GNN reduces identity switches by 49% vs. SORT. Ablation studies confirm the independent contribution of each component.

URL PDF HTML ☆

赞 0 踩 0

2606.05586 2026-06-05 cs.CV cs.MM 版本更新

面向细粒度OOD检测的双重特征解耦

Xiaokun Li, Yaping Huang, Qingji Guan

发表机构 * School of Computer Science and Technology, Beijing Jiaotong University（计算机科学与技术学院，北京交通大学）

AI总结提出双重特征解耦网络(DFDNet)，通过空间-频率解耦和重建引导解耦模块，解决细粒度分类中因类间差异小和背景干扰导致的OOD检测难题。

详情

AI中文摘要

离群检测（OOD）是将机器学习模型应用于现实场景时不可或缺的技术。现有大多数OOD检测方法都是在类间分布差异较大的理想化假设下开发的，而很大程度上忽略了以细微变化为特征的细粒度任务，如医学图像分类和车辆识别。细粒度子类别之间的高视觉相似性，加上背景因素的干扰，使得OOD检测极具挑战性。为了解决这个问题，我们提出了一种新颖的双重特征解耦网络（DFDNet），从特征解缠的角度解决细粒度OOD检测。所提出的DFDNet包含两个关键组件：空间-频率解耦模块和重建引导解耦模块。空间-频率解耦模块旨在保留对分类有判别性的内容特征，同时抑制与任务无关的风格信息。另一方面，重建引导解耦模块引入了一种新颖的像素级对抗重建任务，以进一步去除低层、非判别性信息，并增强类别特定的高层语义表示。大量实验表明，我们的方法在多个数据集上取得了有竞争力的性能提升。

英文摘要

Out-of-distribution detection (OOD) is an indispensable technique when applying machine learning models to real-world scenarios. Most existing OOD detection methods have been developed under the idealized assumption of large inter-class distributional differences, while largely overlooking fine-grained tasks characterized by subtle variations, such as medical image classification and vehicle recognition. The high visual similarity among fine-grained subcategories, together with the interference of background factors, makes OOD detection extremely challenging. To tackle this problem, we propose a novel Dual Feature Decoupling Network (DFDNet), which addresses fine-grained OOD detection from the perspective of feature disentanglement. The proposed DFDNet comprises two key components: a spatial-frequency decoupling module and a reconstruction-guided decoupling module. The spatial-frequency decoupling module is designed to preserve content features that are discriminative for classification while suppressing task-irrelevant style information. On the other hand, the reconstruction-guided decoupling module introduces a novel pixel-level adversarial reconstruction task to further remove low-level, non-discriminative information and enhance category-specific high-level semantic representations. Extensive experiments demonstrate that our method achieves competitive performance improvements on multiple datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.05535 2026-06-05 cs.CV cs.AI 版本更新

Noise-Aware Visual Representation Learning for Medical Visual Question Answering

面向医学视觉问答的噪声感知视觉表示学习

I Putu Adi Pratama, Bahadorreza Ofoghi, Atul Sajjanhar, Shang Gao

发表机构 * Deakin University（德克萨斯大学）

AI总结提出一种噪声感知的医学视觉问答框架，通过去噪自编码器学习鲁棒的视觉表示，并利用低秩适配高效微调，在SLAKE和PathVQA基准上提升了抗噪性和性能。

Comments 15 pages, 2 figures. Conference submission

详情

AI中文摘要

医学视觉问答（Med-VQA）通过使AI模型能够解释医学图像并回答临床相关问题，在临床决策支持方面具有巨大潜力。近期方法通常通过轻量级映射网络将现成的视觉编码器与大语言模型（LLM）连接起来，以降低计算成本。然而，这些方法往往忽视了处理视觉表示中噪声和小无关变化的重要性。为应对这些挑战，我们提出了一种噪声感知的Med-VQA框架，该框架在视觉嵌入映射到LLM输入空间之前，引入了一个去噪自编码器。去噪自编码器经过预训练，能够从被破坏的输入中重建干净的视觉嵌入，从而鼓励模型学习对噪声不敏感的鲁棒视觉表示。然后，使用多层感知器（MLP）将得到的嵌入投影到语言模型嵌入空间中，形成为LLM提供图像信息的视觉前缀令牌。为了实现无需完全重新训练的高效适配，我们采用低秩适配（LoRA）进行参数高效微调。所提出的方法在SLAKE和PathVQA基准上进行了评估。实验结果表明，该方法在多个评估标准下对噪声输入嵌入具有更强的鲁棒性，同时保持了有竞争力的干净性能。这些发现表明，学习更鲁棒的视觉表示可以提升Med-VQA的性能和鲁棒性。

英文摘要

Medical visual question answering (Med-VQA) has strong potential for clinical decision support by enabling AI models to interpret medical images and answer clinically relevant queries. Recent approaches typically connect off-the-shelf vision encoders with large language models (LLMs) through lightweight mapping networks to reduce computational cost. However, these methods often overlook the importance of handling noise and small irrelevant changes in visual representations. To address these challenges, we propose a noise-aware Med-VQA framework that incorporates a denoising autoencoder before visual embeddings are mapped into the input space of an LLM. The denoising autoencoder is pretrained to reconstruct clean visual embeddings from corrupted inputs, encouraging the model to learn robust visual representations that are less sensitive to noise. The resulting embeddings are then projected into the language model embedding space using a multi-layer perceptron (MLP), forming visual prefix tokens that provide image information to the LLM. To enable efficient adaptation without full retraining, we employ parameter-efficient fine-tuning using low-rank adaptation (LoRA). The proposed method is evaluated on the SLAKE and PathVQA benchmarks. Experimental results show improved robustness to noisy input embeddings while maintaining competitive clean performance across multiple evaluation criteria. These findings suggest that learning more robust visual representations can enhance Med-VQA performance and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.05533 2026-06-05 cs.LG cs.AI cs.CV cs.RO 版本更新

What Objects Enable, Not What They Are: Functional Latent Spaces for Affordance Reasoning

物体能做什么，而非它们是什么：面向功能可供性推理的功能潜在空间

Rohan Siva, Neel P. Bhatt, Yunhao Yang, Seoyoung Lee, Nishant Gadde, Christian Ellis, Alvaro Velasquez, Zhangyang Wang, Ufuk Topcu

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Neurosymbolic Intelligence（神经符号智能）； University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结提出A4D框架，通过构建基于功能可供性的共享潜在空间，将视觉观察映射到该空间并测量与可供性的距离，实现基于物体功能而非外观的规划推理，显著提升泛化能力和推理效率。

Comments Code, videos, and data available at: https://A4Dance-reasoning.github.io

详情

AI中文摘要

现有的机器人规划系统依赖于基于外观的推理，其中视觉观察被编码到围绕物体外观组织的潜在空间中（例如，根据外观识别“手推车”）。然而，规划需要推理物体的任务相关功能（例如，物体是否“可移动”），而基于外观的潜在空间无法捕捉这些信息。因此，现有方法难以泛化到新颖的机器人-物体交互。我们通过功能可供性推理解决这一泛化能力有限的问题，使规划基于任务相关的物体功能而非仅外观。我们提出A4D，它将视觉观察映射到一个围绕可供性（例如“可移动”）组织的共享潜在空间中。通过将视觉观察投影到这个功能潜在空间并测量它们与可供性的接近程度，A4D推断出与观察物体相关的功能。此外，我们引入了一种可供性发现机制，扩展潜在空间以处理现有可供性不足的未见场景。A4D利用功能潜在空间中的接近度来量化可供性推理的不确定性，并选择性地触发可供性发现。我们在涉及多样化和未见可供性的多个规划任务上评估A4D。A4D在现有可供性上达到94%的推理准确率，比最先进方法高出超过15个百分点；在不到原始训练数据10%的情况下，将新可供性推理准确率从70%提升到90%以上，并实现100倍更快的推理。代码、视频和数据可在https://A4Dance-reasoning.github.io获取。

英文摘要

Existing robot planning systems rely on appearance-based reasoning, where visual observations are encoded into latent spaces organized around object appearances (e.g., recognizing a "cart" based on how it looks). However, planning requires reasoning about task-relevant functionalities of objects (e.g., whether an object is "movable"), which appearance-based latent spaces do not capture. As a result, existing approaches struggle to generalize to novel robot-object interactions. We address this limited generalizability through affordance reasoning, enabling planning based on task-relevant object functionalities instead of appearance alone. We introduce A4D, which maps visual observations into a shared latent space structured around affordances (e.g., "movable"). By projecting visual observations into this functional latent space and measuring their proximity to affordances, A4D infers functionalities relevant to the observed object. Furthermore, we introduce an affordance discovery mechanism that expands the latent space to handle unseen scenarios where existing affordances are insufficient. A4D uses proximity in the functional latent space to quantify uncertainty in affordance inference and selectively triggers affordance discovery. We evaluate A4D across several planning tasks involving diverse and unseen affordances. A4D achieves 94% inference accuracy on existing affordances outperforming state-of-the-art approaches by over 15% points, improves new-affordance inference accuracy from 70% to over 90% with fewer than 10% of the original training data, and enables 100x faster inference. Code, videos, and data available at: https://A4Dance-reasoning.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.05531 2026-06-05 cs.CV cs.AI cs.CL cs.LG 版本更新

Almieyar-Oryx-BloomBench: A Bilingual Multimodal Benchmark for Cognitively Informed Evaluation of Vision-Language Models

Almieyar-Oryx-BloomBench：一个用于视觉语言模型认知知情评估的双语多模态基准

Mohammad Mahdi Abootorabi, Omid Ghahroodi, Anas Madkoor, Marzia Nouri, Doratossadat Dastgheib, Mohamed Hefeeda, Ehsaneddin Asgari

发表机构 * University of British Columbia（不列颠哥伦比亚大学）； Zuse School（Zuse学校）； Qatar Computing Research Institute (QCRI)（卡塔尔计算研究所）； Hamad Bin Khalifa University（哈马德·本·哈利法大学）

AI总结针对现有基准无法诊断视觉语言模型真实推理能力的问题，提出基于Bloom认知分类学的双语多模态基准BloomBench，系统评估六个认知层次，揭示模型在事实回忆和创造性合成方面的深层局限。

Comments Accepted to ACL 2026 Findings

详情

AI中文摘要

尽管视觉语言模型（VLM）取得了快速进展，但该领域缺乏能够严格诊断其真实推理能力并描绘出向类人多模态智能有意义进展的基准。大多数现有评估侧重于零散或脱节的任务，掩盖了关键的认知弱点，并为有针对性的改进提供了很少的见解。为了弥补这一差距，我们引入了BloomBench，这是Almieyar基准系列的一部分，也是第一个基于人类认知的、双语（英语-阿拉伯语）的多模态VLM基准。基于Bloom分类学，BloomBench通过精心设计的图像-问题-答案任务系统地评估六个认知层次（记忆、理解、应用、分析、评估、创造）。通过半自动化流水线构建，并通过分层混合质量保证协议验证，确保了可扩展性、文化包容性和语言保真度。利用这一框架，我们对最先进的VLM进行了全面研究，以诊断其认知特征。我们的分析揭示了明显的认知不对称：尽管最先进的模型在语义理解方面达到了强大的性能上限，但它们在事实回忆和创造性合成方面存在显著困难。这表明当前的一般多模态能力掩盖了特定认知层次的深层局限性。此外，我们的研究突出了阿拉伯语和英语之间的关键性能差距，暴露了当前跨语言多模态推理的局限性。这些发现为开发更符合认知和包容性的VLM奠定了基础。基准框架和数据集可在以下网址获取：https://github.com/qcri/Almieyar-Oryx-BloomBench。

英文摘要

Despite the rapid progress of Vision-Language Models (VLMs), the field lacks benchmarks that rigorously diagnose their true reasoning abilities and chart meaningful progress toward human-like multimodal intelligence. Most existing evaluations focus on piecemeal or disconnected tasks, obscuring critical cognitive weaknesses and providing little insight for targeted improvement. To address this gap, we introduce BloomBench, part of the Almieyar benchmarking series, the first cognitively human-grounded, bilingual (English-Arabic) multimodal benchmark for VLMs. Grounded in Bloom's Taxonomy, BloomBench systematically evaluates six levels of cognition (Remember, Understand, Apply, Analyze, Evaluate, Create) through carefully designed image-question-answer tasks. Built with a semi-automated pipeline and validated through a stratified hybrid quality assurance protocol, it ensures scalability, cultural inclusivity, and linguistic fidelity. Leveraging this framework, we conduct a comprehensive study of state-of-the-art VLMs to diagnose their cognitive profiles. Our analysis reveals a sharp cognitive asymmetry: while state-of-the-art models achieve strong performance ceilings in semantic understanding, they struggle substantially with factual recall and creative synthesis. This demonstrates that current general multimodal proficiency masks deeper limitations in specific cognitive layers. Furthermore, our study highlights a critical performance gap between Arabic and English, exposing limitations in current cross-lingual multimodal reasoning. These findings establish a foundation for developing more cognitively aligned and inclusive VLMs. The benchmark framework and dataset is available at: https://github.com/qcri/Almieyar-Oryx-BloomBench.

URL PDF HTML ☆

赞 0 踩 0

2606.05515 2026-06-05 cs.CV 版本更新

ORACLE-CT：用于CT分类的解剖感知支持池化

Lavsen Dahal, Yubraj Bhandari, Geoffrey Rubin, Joseph Y. Lo

发表机构 * Center for Virtual Imaging Trials, RAI Labs, Department of Radiology, Duke University（虚拟成像试验中心，RAI实验室，放射学系，杜克大学）； Electrical and Computer Engineering, Pratt School of Engineering, Duke University（电气与计算机工程，工程学院，杜克大学）； Department of Mathematics, Trinity College of Arts & Sciences, Duke University（数学系，艺术与科学学院，杜克大学）； Department of Radiology and Imaging Sciences, University of Arizona College of Medicine（放射学与影像科学系，亚利桑那大学医学院）

AI总结提出ORACLE-CT框架，通过多器官分割定义标签特定的解剖支持区域并限制注意力池化，解决CT分类中局部疾病证据与全局聚合不匹配的问题，在多个编码器上提升性能。

详情

AI中文摘要

腹部CT疾病分类具有挑战性，因为每次扫描都是一个包含许多可能发现的大3D体积，而诊断证据通常局限于特定器官或解剖隔室。大多数研究级分类器使用与解剖无关的池化或注意力来聚合编码器特征，造成了局部疾病证据与全局证据聚合之间的不匹配。我们提出ORACLE-CT，一个与编码器无关的解剖感知聚合框架，它使用多器官分割来定义标签特定的解剖支持，并将注意力池化限制在相关区域。该框架支持单器官、多器官联合、比较、局部和全局支持策略。我们使用三个编码器系列评估ORACLE-CT：DINOv3、I3D-ResNet-121和放射学原生Pillar-0编码器。模型在MERLIN上进行端到端训练，并在内部评估以及在冻结外部迁移到Duke-Abdomen和AMOS下进行评估。与全局平均池化相比，支持掩蔽池化将DINOv3的MERLIN宏AUROC/AUPRC从0.838/0.638提高到0.858/0.676，将I3D-ResNet-121从0.829/0.617提高到0.848/0.659。在协调的10标签外部评估中，DINOv3在Duke-Abdomen上从0.802/0.628提高到0.835/0.683，在AMOS上从0.742/0.313提高到0.762/0.350，I3D-ResNet-121也有类似增益。对于Pillar-0，大部分增益来自学习注意力，解剖掩蔽的额外收益较小。ORACLE-CT提高了区分度和外部鲁棒性，同时保留了预测与解剖证据之间的可审计联系。

英文摘要

Abdominal CT disease classification is challenging because each scan is a large 3D volume with many possible findings, while diagnostic evidence is often confined to specific organs or anatomical compartments. Most study-level classifiers aggregate encoder features using anatomy-agnostic pooling or attention, creating a mismatch between localized disease evidence and global evidence aggregation. We propose ORACLE--CT, an encoder-agnostic anatomy-aware aggregation framework that uses multi-organ segmentation to define label-specific anatomical supports and restrict attention pooling to relevant regions. The framework supports single-organ, multi-organ union, comparative, localized, and global support strategies. We evaluate ORACLE--CT with three encoder families: DINOv3, I3D--ResNet-121, and the radiology-native Pillar--0 encoder. Models are trained end-to-end on MERLIN and evaluated internally and under frozen external transfer to Duke--Abdomen and AMOS. Compared with global average pooling, support-masked pooling improved MERLIN macro-AUROC/AUPRC from 0.838/0.638 to 0.858/0.676 for DINOv3 and from 0.829/0.617 to 0.848/0.659 for I3D--ResNet-121. On harmonized 10-label external evaluation, DINOv3 improved on Duke--Abdomen from 0.802/0.628 to 0.835/0.683 and on AMOS from 0.742/0.313 to 0.762/0.350, with similar gains for I3D--ResNet-121. For Pillar--0, most gains came from learned attention, with smaller additional benefit from anatomical masking. ORACLE--CT improves discrimination and external robustness while preserving an auditable link between predictions and anatomical evidence.

URL PDF HTML ☆

赞 0 踩 0

2606.05458 2026-06-05 cs.CV 版本更新

Horse Eye Blink Detection and Classification for Equine Affective State Assessment

马匹眼睛眨眼检测与分类用于马匹情感状态评估

João Alves, Signe Møller-Skuldbøl, Pia Haubro Andersen, Rikke Gade

发表机构 * Visual Analysis and Perception Lab, Aalborg University（视觉分析与感知实验室，奥尔堡大学）； Department of Animal Biosciences, Swedish University of Agricultural Sciences（动物生物科学系，瑞典农业科学大学）

AI总结本研究开发并评估了三种基于视频的马匹眨眼自动分类方法（帧级YOLOv12检测器、光流幅度阈值法和微调VideoMAE模型），在公开数据集上实现了眨眼分类宏F1分数0.898和二元眨眼检测0.926，展示了细粒度动作单元检测在马匹福利监测中的潜力和挑战。

Comments CVPRW2026 CV4Animals

详情

AI中文摘要

自动检测马匹面部动作单元（AUs）是评估马匹疼痛和情感状态的一个有前景但尚未充分探索的途径。半眨眼和全眨眼运动被认为是疼痛和压力的识别指标，但作为微表情，其细微、精细的特性使其容易被肉眼忽略，只能通过逐帧视频检查才能辨别，这使得从视频中进行可靠的自动检测成为一项特别艰巨的任务。我们开发并评估了三种从马匹视频中自动分类眨眼的方法：基于帧的YOLOv12检测器、光流幅度阈值方法以及微调的VideoMAE模型，并在公开数据集上进行了测试。我们在眨眼分类任务上达到了0.898的宏F1分数，在二元眨眼检测上达到了0.926。我们的结果突显了细粒度AU检测在马匹福利监测中的潜力和固有挑战。

英文摘要

Automated detection of equine facial action units (AUs) is a promising yet under-explored avenue for pain and affective state assessment in horses. Half and full-blink movements are recognised indicators of pain and stress, but as micro-expressions, their subtle, fine-grained nature makes them easily missed by the naked eye and only discernible through frame-by-frame video inspection, making reliable automated detection from video a particularly demanding task. We develop and evaluate three methods for automated blink classification from horse videos: a frame-based YOLOv12 detector, an optical flow magnitude thresholding approach, and a fine-tuned VideoMAE model, tested on a publicly available dataset. We achieve a macro-F1 score of 0.898 when doing blink classification and 0.926 on binary blink detection. Our results highlight both the potential and the inherent challenges of fine-grained AU detection for equine welfare monitoring.

URL PDF HTML ☆

赞 0 踩 0

2606.05455 2026-06-05 cs.CV 版本更新

Disentangled Fine-Grained Prototype Learning for Incomplete Image-Tabular Classification

面向不完整图像-表格分类的解缠细粒度原型学习

Feixiang Zhou, Jianyang Xie, Zhuangzhi Gao, Qinkai Yu, Fu Wang, Yuheng Fan, Jing Li, Zheheng Jiang, Yitian Zhao, Yanda Meng, He Zhao, Gregory Y. H. Lip, Yalin Zheng

发表机构 * School of Eye and Vision Sciences, University of Liverpool, U.K.（利物浦大学眼科与视觉科学学院）； Department of Cardiovascular and Metabolic Medicine, University of Liverpool, U.K.（利物浦大学心血管与代谢医学系）； School of Computer Science, University of Exeter, U.K.（埃克塞特大学计算机科学学院）； School of Computer Science and Engineering, South China University of Technology, China（华南理工大学计算机科学与工程学院）； School of Computing and Mathematical Sciences, University of Leicester, U.K.（莱斯特大学计算科学与数学科学学院）； Ningbo Institute of Industrial Technology, Chinese Academy of Sciences, China（中国科学院宁波工业技术研究所）； Bioengineering Program, Biological and Environmental Science and Engineering Division (BESE), King Abdullah University of Science and Technology (KAUST), Saudi Arabia（卡尔斯塔德大学科学与技术学院（KAUST）生物工程项目，沙特阿拉伯）

AI总结针对图像-表格多模态学习中缺失模态问题，提出DFPL框架，通过共享-特定原型建模、原型级解缠和细粒度对齐，实现鲁棒分类。

详情

AI中文摘要

缺失模态问题在广泛的多媒体应用中（包括产品理解、推荐系统和医疗诊断）对图像-表格多模态学习构成了重大挑战。当两种模态高度异质时，这一挑战尤为突出，因为图像和表格属性在语义粒度和数据分布上存在显著差异。现有方法通过对全局令牌平均特征进行解缠和对齐来学习模态不变表示，仅捕获粗粒度的跨模态一致性，忽略了细粒度的语义和分布错位，这阻碍了在缺失模态下利用互补线索。为了解决这个问题，我们提出了DFPL，一种用于细粒度原型学习的新框架。具体来说，共享-特定原型建模（SSPM）提取紧凑且多样化的共享和模态特定原型，并进一步执行原型级解缠以抑制冗余的模态内相关性。此外，我们提出了一个原型引导的细粒度对齐（PFA）模块，该模块在统一的原型空间内联合强制执行原型级分布匹配和原型到类别的语义对齐，从而跨模态保留细粒度的分布和语义一致性。我们还引入了一个类别感知的多尺度聚合（CMA）模块，从全局和原型级别自适应地聚合共享语义和模态特定特征，以实现鲁棒的预测。在三个不同的图像-表格基准上的大量实验表明，我们的方法在各种缺失模态设置下优于先前的方法。代码将公开提供。

英文摘要

The missing-modality problem poses a significant challenge in image-tabular multimodal learning across a wide range of multimedia applications, including product understanding, recommendation systems, and medical diagnosis. This challenge is particularly pronounced when the two modalities are highly heterogeneous, as images and tabular attributes differ substantially in their semantic granularity and data distributions. Existing methods learn modality-invariant representations through disentanglement and alignment over global token-averaged features, capturing only coarse cross-modal consistency and overlooking fine-grained semantic and distributional misalignment, which hampers the exploitation of complementary cues under missing modalities. To address this, we propose DFPL, a novel framework for fine-grained prototype learning. Specifically, Shared-Specific Prototype Modeling (SSPM) extracts compact and diverse shared and modality-specific prototypes, and further performs prototype-level disentanglement to suppress redundant intra-modality correlations. Additionally, we propose a Prototype-guided Fine-grained Alignment (PFA) module that jointly enforces prototype-level distribution matching and prototype-to-class semantic alignment within a unified prototype space, thereby preserving both fine-grained distributional and semantic consistency across modalities. We further introduce a Class-aware Multi-scale Aggregation (CMA) module to adaptively aggregate shared semantics and modality-specific characteristics from global and prototype levels for robust predictions. Extensive experiments on three diverse image-tabular benchmarks demonstrate the superiority of our method compared to the previous approaches under various missing-modality settings. Code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.05437 2026-06-05 cs.RO cs.CV 版本更新

Uncertainty-Aware Adaptive Sensor Fusion for Autonomous Navigation

不确定性感知的自适应传感器融合用于自主导航

Simegnew Yihunie Alaba, Yuichi Motai

发表机构 * IEEE

AI总结提出一种结合无迹卡尔曼滤波（UKF）的混合深度学习方法，通过不确定性感知的自适应融合视觉和惯性特征，提高自主导航中视觉惯性里程计（VIO）的位姿估计精度。

Comments 13 pages

详情

AI中文摘要

本文介绍了一种混合深度学习方法，与无迹卡尔曼滤波（UKF）相结合，以增强自主导航中视觉惯性里程计（VIO）的位姿估计精度。所提出的模型采用视觉变换器（ViT）网络有效捕获惯性测量单元（IMU）数据的时间依赖性，并利用多尺度卷积神经网络（MCNN）从视觉数据中学习基于光流的运动线索。自适应传感器融合模块通过利用估计的不确定性动态加权IMU和视觉特征，从而在多样且具有挑战性的环境条件下提高鲁棒性。此外，提出了一种新颖的不确定性感知损失函数，将预测不确定性明确纳入学习过程，使得在噪声、不完整或不可靠的传感器输入下实现鲁棒且准确的导航。在KITTI数据集上的全面评估表明，所提出的方法显著优于基线方法，在绝对轨迹误差（ATE）和相对位姿误差（RPE）方面实现了优越性能。该轻量且计算高效的模型在NVIDIA A100 GPU上以155 FPS处理数据，非常适合部署在资源受限的自主系统中。

英文摘要

This work introduces a hybrid deep learning approach integrated with an Unscented Kalman Filter (UKF) to enhance pose estimation accuracy in Visual-Inertial Odometry (VIO) for autonomous navigation. The proposed model employs a Vision Transformer (ViT) network to effectively capture temporal dependencies from inertial measurement unit (IMU) data and utilizes a Multiscale Convolutional Neural Network (MCNN) to learn optical flow-based motion cues from visual data. An adaptive sensor fusion module dynamically weights IMU and visual features by leveraging estimated uncertainty, thus improving robustness in diverse and challenging environmental conditions. Additionally, a novel uncertainty-aware loss function is proposed to explicitly incorporate prediction uncertainty into the learning process, enabling robust and accurate navigation under noisy, incomplete, or unreliable sensor inputs. Comprehensive evaluations of the KITTI dataset demonstrate that the proposed method significantly outperforms baseline approaches, achieving superior performance in terms of Absolute Trajectory Error (ATE) and Relative Pose Error (RPE). The lightweight and computationally efficient model processes data at 155 FPS on an NVIDIA A100 GPU, making it highly suitable for deployment in resource-constrained autonomous systems.

URL PDF HTML ☆

赞 0 踩 0

2606.05379 2026-06-05 cs.CV 版本更新

Deep Learning-assisted AMD Staging based on OCT and OCT Angiography

基于OCT和OCT血管成像的深度学习辅助AMD分期

Yukun Guo, Tristan T. Hormel, An-Lun Wu, Liqin Gao, Min Gao, Steven T. Bailey, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University（奥勒冈健康与科学大学凯斯眼科研究所）； Department of Biomedical Engineering, Oregon Health & Science University（奥勒冈健康与科学大学生物医学工程系）； Department of Ophthalmology, Mackay Memorial Hospital（Mackay纪念医院眼科部）

AI总结利用OCT和OCTA数据，开发并评估基于EfficientNet的深度学习模型，用于自动分级年龄相关性黄斑变性（AMD）严重程度，其中基于生物标志物的模型表现最佳，尤其对早期AMD检测有价值。

详情

AI中文摘要

开发和评估使用光学相干断层扫描（OCT）和OCT血管成像（OCTA）数据自动分级年龄相关性黄斑变性（AMD）严重程度的深度学习模型。研究对象为271名年龄≥50岁、具有不同AMD严重程度的参与者。使用扫频OCTA系统（SOLIX; Visionix/Optovue Inc., CA）获取中央黄斑6×6 mm OCT/OCTA体积。根据AREDS简化严重程度量表，将AMD严重程度分为四个阶段（无AMD、早期AMD、中期AMD和晚期AMD）。开发了三种使用不同输入模态的深度学习模型：（1）来自分割病理特征（包括视网膜液、玻璃膜疣、地图样萎缩（GA）和黄斑新生血管（MNV））的生物标志物图；（2）二维（2D）en face OCT和OCTA投影；（3）三维（3D）OCT/OCTA体积。使用归一化输入、数据增强和五折交叉验证训练基于EfficientNet的架构。分析了来自271名参与者351只眼睛的总共2030个OCT/OCTA体积。所有模型均表现出强大的AMD分期性能，与参考标准具有高度一致性（QWK ≥ 0.83）。基于生物标志物的模型实现了最高的整体性能（QWK = 0.85 ± 0.03，均值±标准差）和最佳的早期AMD检测（F1分数 = 0.59 ± 0.14）。3D模型的性能与2D OCT/OCTA模型相当（QWK = 0.83 ± 0.04 vs. 0.83 ± 0.09），而2D OCT/OCTA模型显示出最高的精确度（0.79 ± 0.06）并最准确地识别出无AMD的眼睛。使用OCT/OCTA数据的深度学习模型可以准确、自动地对AMD严重程度进行分级。在评估的方法中，基于生物标志物的模型提供了最平衡的性能，并对早期AMD检测显示出特别的价值。

英文摘要

To develop and evaluate deep learning models for automated grading of age-related macular degeneration (AMD) severity using optical coherence tomography (OCT) and OCT angiography (OCTA) data. Two hundred seventy-one participants aged >= 50 years with varying AMD severities. Central macular 6 x 6 mm OCT/OCTA volumes were acquired using a swept-source OCTA system (SOLIX; Visionix/Optovue Inc., CA). AMD severity was graded into four stages (No AMD, Early AMD, Intermediate AMD, and Advanced AMD) according to the AREDS simplified severity scale. Three deep learning models were developed using different input modalities: (1) biomarker maps derived from segmented pathological features, including retinal fluid, drusen, geographic atrophy (GA), and macular neovascularization (MNV); (2) two-dimensional (2D) en face OCT and OCTA projections; and (3) three-dimensional (3D) OCT/OCTA volumes. EfficientNet-based architectures were trained using normalized inputs, data augmentation, and five-fold cross-validation. A total of 2,030 OCT/OCTA volumes from 351 eyes of 271 participants were analyzed. All models demonstrated strong AMD staging performance with substantial agreement with the reference standard (QWK >= 0.83). The biomarker-based model achieved the highest overall performance (QWK = 0.85 +/- 0.03, mean +/- standard deviation) and the best detection of early AMD (F1-score = 0.59 +/- 0.14). The 3D model achieved performance comparable to the 2D OCT/OCTA model (QWK = 0.83 +/- 0.04 vs. 0.83 +/- 0.09), while the 2D OCT/OCTA model showed the highest precision (0.79 +/- 0.06) and most accurately identified eyes without AMD. Deep learning models using OCT/OCTA data can accurately and automatically grade AMD severity. Among the evaluated approaches, the biomarker-based model provided the most balanced performance and showed particular value for early AMD detection.

URL PDF HTML ☆

赞 0 踩 0

2606.05375 2026-06-05 cs.CV cs.AI 版本更新

Three-Dimensional Retinal Microvasculature Restoration in OCT Angiography

OCT血管造影中的三维视网膜微血管修复

Yukun Guo, Min Gao, Tristan T. Hormel, Steven T. Bailey, Thomas S. Hwang, Yali Jia

发表机构 * Casey Eye Institute, Oregon Health & Science University（俄勒冈健康与科学大学Casey眼科研究所）； Department of Biomedical Engineering, Oregon Health & Science University（俄勒冈健康与科学大学生物医学工程系）

AI总结提出基于EfficientNet-B5编码器和含空间-通道挤压激励模块的解码器的深度学习算法，从单次OCTA体数据恢复毛细血管解剖结构，显著提升图像质量与微血管保真度。

详情

AI中文摘要

光学相干断层扫描血管造影（OCTA）是一种用于成像视网膜微血管的强大技术。然而，由于成像伪影，获取可靠的视网膜血流和视网膜无灌注区域量化具有挑战性。现有方法主要关注噪声抑制、投影伪影去除或信号增强，以改善OCTA在横截面或二维（2D）正面投影中的图像质量，而忽略了内在的三维血管结构。在本研究中，我们提出了一种基于深度学习的算法，用于从单个OCTA体数据中恢复毛细血管解剖血管结构。该网络由EfficientNet-B5编码器和结合了并行空间与通道挤压激励模块的解码器组成，通过跳跃连接保持空间分辨率。使用三个相邻B帧作为输入，预测修复后的中间B帧。我们使用峰值信噪比（PSNR）和结构相似性指数（SSIM）评估模型性能，以多次扫描平均生成的真值作为基准。结果表明，与原始单次OCTA体数据相比，所提模型显著（p < 0.001）提高了图像质量，PSNR为26.16 ± 1.26对比22.23 ± 0.78，SSIM为0.91 ± 0.02对比0.72 ± 0.03。所提模型还显著（p < 0.001）提高了微血管保真度，通过模型输出与真值之间的Dice系数重叠测量，在多个不同血管板层上，2D和3D分别至少提高3.8%和51.2%。

英文摘要

Optical coherence tomographic angiography (OCTA) is a powerful technique for imaging retinal microvasculature. However, acquiring reliable quantification of retinal blood flow and areas of retinal nonperfusion is challenging because of imaging artifacts. Existing methods primarily focus on noise suppression, projection artifact removal, or signal enhancement to improve the image quality of OCTA in cross-sectional or two-dimensional (2D) en face projections, while neglecting the intrinsic three-dimensional vascular architecture. In this study, we propose a deep learning-based algorithm for restoring capillary anatomical vasculature from a single OCTA volume. The network consists of an EfficientNet-B5 encoder and a decoder incorporating concurrent spatial and channel squeeze-and-excitation modules, connected via skip connections to preserve spatial resolution. Three adjacent B-frames are used as input to predict the restored middle B-frame. We evaluated the performance of the model using the peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM) against ground truth generated from averaging multiple scans. The results show that the proposed model significantly (both p < 0.001) improved image quality compared with the original single OCTA volume, with a PSNR of 26.16 +/- 1.26 vs. 22.23 +/- 0.78 and an SSIM of 0.91 +/- 0.02 vs. 0.72 +/- 0.03. The proposed model also significantly (p < 0.001) improved microvascular fidelity, measured by the Dice coefficient overlap between the model output and ground truth, in both 2D and 3D by at least 3.8% and 51.2%, respectively, across several different vascular slabs.

URL PDF HTML ☆

赞 0 踩 0

2606.05359 2026-06-05 cs.CV 版本更新

Recovering Physically Plausible Human-Object Interactions from Monocular Videos

从单目视频中恢复物理上可信的人-物交互

Dingbang Huang, Etienne Vouga, Qixing Huang, Georgios Pavlakos

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出RePHO方法，通过物理引导的重建框架和强化学习策略，从单目视频中恢复物理上可信的人-物交互，解决了现有方法中的穿透和物体漂浮问题。

Comments CVPR 2026. Project Page: https://dingbang777.github.io/RePHO/

详情

AI中文摘要

在本文中，我们提出了RePHO，一种从单目视频中重建物理上可信的人-物交互（HOI）的方法。现有的基于运动学的方法虽然能产生视觉上合理的运动，但常常导致物理上不合理的伪影，如相互穿透和物体漂浮。为了克服这些问题，我们引入了一个物理引导的重建框架。我们从运动学估计开始，然后通过强化学习（RL）训练一个策略来细化它。该策略被优化以在物理模拟器中重现交互。由于运动学估计通常带有噪声，简单的RL训练可能会失败。因此，我们提出了一种自适应采样策略，具有双重自我更新机制，可以识别具有最丰富信息和最可靠运动学重建的帧。我们的过程逐步提高重建质量，并产生物理一致的HOI序列。我们在两个标准的HOI基准上展示了我们的方法，并在物理合理性指标上取得了比现有方法明显的改进。项目页面：https://dingbang777.github.io/RePHO/

英文摘要

In this paper, we propose RePHO, a method to reconstruct physically plausible human-object interactions (HOI) from monocular videos. While existing kinematic-based approaches produce visually plausible motion, they often result in physically implausible artifacts such as interpenetration and object floating. To overcome these issues, we introduce a physics-guided reconstruction framework. We begin with a kinematic estimate and then refine it by training a policy with reinforcement learning (RL). This policy is optimized to reproduce the interaction in a physics simulator. Because kinematic estimates are typically noisy, naive RL training can fail. Therefore, we propose an adaptive sampling strategy with a dual self-updating mechanism that can identify the frames with the most informative and reliable kinematic reconstruction. Our process progressively improves reconstruction quality and yields physically consistent HOI sequences. We demonstrate our approach on two standard HOI benchmarks and achieve clear improvements in physical plausibility metrics over state-of-the-art methods. Project Page: https://dingbang777.github.io/RePHO/

URL PDF HTML ☆

赞 0 踩 0

2606.05354 2026-06-05 cs.CV 版本更新

LightVesselNet: An Ultra-Lightweight Sub-100K Parameter Network for Retinal Blood Vessel Segmentation

LightVesselNet：用于视网膜血管分割的超轻量级亚10万参数网络

Shadman Sobhan, Farhana Jalil

发表机构 * Department of Electrical & Electronic Engineering, Bangladesh University of Engineering and Technology (BUET)（电子与电气工程系，孟加拉国工程与技术大学）

AI总结提出LightVesselNet，一种仅75K参数的紧凑编码器-解码器网络，结合通道与空间注意力、多尺度特征聚合和亚像素上采样，在五个公开数据集上实现与大型模型相当的视网膜血管分割性能，适用于资源受限的临床环境。

详情

AI中文摘要

视网膜血管分割在糖尿病视网膜病变和青光眼的早期检测中起着至关重要的作用。虽然最近的深度学习模型取得了很高的分割精度，但它们通常需要大量的计算资源，使得在边缘设备上的实际部署变得困难。在本文中，我们提出了LightVesselNet，一种专为资源受限环境中的视网膜血管分割设计的高效神经网络。尽管仅包含75K参数，LightVesselNet的性能与更大的模型相比具有竞争力。该网络采用紧凑的编码器-解码器架构，并增强了通道和空间注意力机制、瓶颈处的多尺度特征聚合模块以及解码器中的亚像素上采样策略。专用的边缘残差连接在整个解码过程中保留了精细的血管细节。在五个公开数据集：DRIVE、STARE、CHASEDB1、FIVES和HRF上进行的大量实验，分别获得了0.8189、0.8499、0.8640、0.8634、0.8096的灵敏度分数和0.8070、0.8072、0.8181、0.8649、0.7686的Dice系数。与最先进模型相比，LightVesselNet显示出更高的效率（性能与参数或GFlops之比）。跨数据集评估证实了模型的泛化能力。总体而言，LightVesselNet是低资源临床环境和移动筛查工具中部署的有力候选者。

英文摘要

Retinal blood vessel segmentation plays a vital role in the early detection of diabetic retinopathy and glaucoma. While recent deep learning models have achieved great segmentation accuracy, they typically require heavy computational resources, making real-world deployment on edge devices difficult. In this paper, we propose LightVesselNet, an efficient neural network designed for retinal vessel segmentation in a resource-constrained environment. Despite containing only 75K parameters, LightVesselNet performs competitively with much larger models. The network employs a compact encoder decoder architecture enhanced with channel and spatial attention mechanisms, a multi-scale feature aggregation module at the bottleneck, and a subpixel upsampling strategy in the decoder. A dedicated edge residual connection preserves fine vessel detail throughout decoding. Extensive experiments on five publicly available datasets: DRIVE, STARE, CHASEDB1, FIVES, and HRF, yield sensitivity scores of 0.8189, 0.8499, 0.8640, 0.8634, 0.8096, and Dice coefficients of 0.8070, 0.8072, 0.8181, 0.8649, and 0.7686, respectively. LightVesselNet shows improved efficiency (Performance vs Parameter or GFlops) compared to State-of-the-Art models. Cross-dataset evaluation confirms the model's generalisation capability. Overall, LightVesselNet is a strong candidate for deployment in low-resource clinical settings and mobile screening tools.

URL PDF HTML ☆

赞 0 踩 0

2606.05347 2026-06-05 cs.CV 版本更新

TopoPult-SSL: Gland-Mask-Free Cross-Device Meibomian Gland Segmentation via Self-Distilled Weak Clinical Priors

TopoPult-SSL: 通过自蒸馏弱临床先验实现无腺体掩膜的跨设备睑板腺分割

Nicolò Savioli, Luca Del Tongo

发表机构 * OdaxAI S.R.L.（OdaxAI公司）； Topcon Group — VISIA Imaging S.R.L.（Topcon集团——VISIA成像公司）

AI总结提出TopoPult-SSL两阶段框架，利用眼睑掩膜和临床元数据作为弱先验，通过自蒸馏实现跨设备睑板腺分割，无需目标腺体掩膜即可达到高精度。

Comments 13 pages, 4 figures, 5 tables

详情

AI中文摘要

每一种新的临床成像设备都会造成域偏移，其中密集的腺体掩膜成本高昂，而廉价的临床信号——眼睑轮廓、Pult分级、形态测量比率——则被常规记录。我们提出TopoPult-SSL，一个用于跨设备睑板腺分割的两阶段框架。第一阶段在训练损失中不使用目标腺体掩膜，仅通过目标眼睑掩膜和临床元数据驱动的四个弱先验锚点来适应源域训练模型。第二阶段，当目标腺体掩膜可用时，通过监督自蒸馏将互补的第一阶段教师模型蒸馏成一个紧凑的学生模型。我们在公共MGD-1k到CAMG研究基准（1000到100张图像，不同设备）上开发并验证了该技术，蒸馏模型达到Dice 0.716±0.006（最佳0.726），单次推理超越UA-MT（0.710）和集成教师（0.720）。无腺体掩膜的第一阶段变体达到精确度0.694，而SAM/MedSAM为0.30-0.34（p<0.001），使得无需密集腺体轮廓即可部署。代码和可复现脚本已发布。

英文摘要

Every new clinical imaging device creates a domain shift where dense gland masks are expensive yet cheap clinical signals -- eyelid outlines, Pult grades, morphometric ratios -- are routinely recorded. We present TopoPult-SSL, a two-stage framework for cross-device meibomian gland segmentation. Stage 1 adapts a source-trained model without target gland masks in the training loss, using four weak-prior anchors driven by target eyelid masks and clinical metadata only. Stage 2, when target gland masks are available, distils complementary Stage-1 teachers into a single compact student via supervised self-distillation. We develop and validate the technique on the public MGD-1k to CAMG research benchmark (1,000 to 100 images, different device), where the distilled model achieves Dice 0.716+/-0.006 (best 0.726), surpassing UA-MT (0.710) and the ensemble teacher (0.720) -- with a single pass. The gland-mask-free Stage-1 variant reaches Precision 0.694 vs. 0.30-0.34 for SAM/MedSAM (p<0.001), enabling deployment without dense gland contouring. Code and reproducibility scripts are released.

URL PDF HTML ☆

赞 0 踩 0

2606.05328 2026-06-05 cs.GR cs.AI cs.CV cs.LG 版本更新

NIV: 用于可变字体生成的神经轴变化

Nadav Benedek, Ariel Shamir, Ohad Fried

发表机构 * Reichman University（雷赫曼大学）

AI总结提出NIV方法，通过预测字形轮廓的逐点位移，自动将静态字体转换为支持多轴连续插值的可变字体，并在新构建的数据集上验证其泛化能力。

详情

AI中文摘要

可变字体能够沿语义设计轴（如字重、字宽、倾斜和光学尺寸）实现字形几何的连续变化。然而，从静态字体构建可变字体仍然是一个劳动密集型过程，需要专业的字体设计和对字形变化数据的手动规范。我们引入了NIV（神经轴变化），一种自动将静态字体转换为功能齐全的可变字体的方法。给定字形轮廓和一组期望的设计轴，NIV预测每点的位移。该模型直接操作矢量字形几何，并采用一种新颖的属性嵌入机制，捕获多个轴之间的相互作用，从而在统一框架内实现一致的多轴变化。我们在一个新构建的源自可变Google字体的数据集上训练NIV，该数据集包含超过一百万个变化元组。得到的模型能够泛化到未见过的码点、未见过的字体样式、高复杂度的CJK字形，甚至分布外的手写输入。生成的输出是标准的可变字体文件，支持通过现有渲染引擎进行连续插值。为了促进研究，我们在https://github.com/ndvbd/NIV上发布了数据集、完整的训练和推理实现以及训练好的模型。超越字体排印，我们的方法展示了如何使用神经变形合成具有连续参数变化的结构化几何对象。

英文摘要

Variable fonts enable continuous variation of glyph geometry along semantic design axes such as weight, width, slant, and optical size. However, constructing a variable font from a static font remains a labor-intensive process requiring expert typographic design and manual specification of glyph variation data. We introduce NIV (Neural Axis Variations), a method that automatically converts a static font into a fully functional variable font. Given glyph outlines and a set of desired design axes, NIV predicts per-point displacements. The model operates directly on vector glyph geometry and employs a novel Property Embedding mechanism that captures interactions between multiple axes, enabling consistent multi-axis variation within a unified framework. We train NIV on a newly constructed dataset derived from variable Google Fonts, comprising over one million variation tuples. The resulting model generalizes across unseen code points, unseen font styles, high-complexity CJK glyphs, and even out-of-distribution handwriting inputs. The generated outputs are standard variable font files supporting continuous interpolation via existing rendering engines. To facilitate research, we release the dataset, the complete training and inference implementation, and trained models at https://github.com/ndvbd/NIV. Beyond typography, our approach demonstrates how structured geometric objects with continuous parametric variation can be synthesized using neural deformations.

URL PDF HTML ☆

赞 0 踩 0

2606.05259 2026-06-05 cs.CV 版本更新

VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding

VideoKR：迈向知识和推理密集型视频理解

Lin Fu, Zheyuan Yang, Yang Wang, Tingyu Song, Arman Cohan, Yilun Zhao

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Stanford University（斯坦福大学）； University of Toronto（多伦多大学）； University of Washington（华盛顿大学）； University of Michigan（密歇根大学）

AI总结提出VideoKR，首个大规模训练语料库，通过人工参与的技能导向生成管道构建315K视频推理示例，增强知识和推理密集型视频理解，并在专家标注基准上验证其有效性。

Comments ICML 2026 Spotlight

详情

AI中文摘要

我们介绍了VideoKR，这是第一个专门设计用于增强知识和推理密集型视频理解的大规模训练语料库。它包含315K个视频推理示例，覆盖145K个新收集的、CC许可的、专家领域的视频。我们开发了一个人工参与的、技能导向的示例生成管道，针对逐步深入的视频推理能力，同时确保示例及其CoT推理的难度、多样性和可靠性。我们还策划了VideoKR-Eval，一个新的专家标注基准，其中的问题需要真正的视频理解和知识密集型推理，而不是文本捷径。我们的实验表明，在标准SFT→GRPO流程下，基于VideoKR后训练的模型在知识密集型视频推理上优于先前的后训练方法，同时在通用视频推理上保持竞争力，突出了数据设计作为视频推理进展的关键驱动因素。我们进一步进行了全面的消融实验，以分离VideoKR的贡献，为未来工作提供可操作的见解。

英文摘要

We introduce VideoKR, the first large-scale training corpus specifically designed to strengthen knowledge- and reasoning-intensive video understanding. It comprises 315K video reasoning examples over 145K newly collected, CC-licensed, expert-domain videos. We develop a human-in-the-loop, skill-oriented example generation pipeline that targets progressively deeper video reasoning capabilities while ensuring the difficulty, diversity, and reliability of both the examples and their CoT rationales. We also curate VideoKR-Eval, a new expert-annotated benchmark where questions require genuine video understanding and knowledge-intensive reasoning rather than textual shortcuts. Our experiments show that, under a standard SFT$\rightarrow$GRPO pipeline, models post-trained on VideoKR outperform prior post-training approaches on knowledge-intensive video reasoning while remaining competitive on general video reasoning, highlighting data design as a key driver of progress in video reasoning. We further conduct comprehensive ablations to isolate the contributions of VideoKR, providing actionable insights for future work.

URL PDF HTML ☆

赞 0 踩 0

2606.05255 2026-06-05 eess.IV cs.CV cs.GR 版本更新

Oklch+: A Three-Parameter Extension of Oklab for Improved Color Difference Prediction

Oklch+: Oklab的三参数扩展以改进色差预测

Naoyuki Uchida

发表机构 * Independent Researcher（独立研究者）

AI总结提出Oklch+，通过L轴幂变换和C轴Naka-Rushton压缩扩展Oklab，在COMBVD数据集上以三个参数达到与CIEDE2000相当的色差预测精度（STRESS=29.09 vs 29.13），并显著优于Oklab。

Comments 3 figures, 8 tables. Submitted to Color Research & Application

详情

AI中文摘要

Oklab及其圆柱表示Oklch作为感知驱动的颜色空间，在插值和设计工作流程中被广泛采用，但其色差预测精度不如CIEDE2000。我们提出Oklch+，这是Oklab的一个三参数扩展，包括L轴上的幂变换和C轴上的Naka-Rushton压缩，并在变换后的Oklab坐标中计算欧氏距离。Naka-Rushton函数在[0,1]内有界，反映了在高色度值时色度敏感度的饱和特性。在COMBVD（包含跨越六个独立实验数据集的3,813对超阈值色差对）上评估，Oklch+实现了STRESS=29.09，与CIEDE2000（29.13；差异=0.04）紧密匹配，仅使用了针对色差数据优化的三个参数，而CIEDE2000约需17个参数。在保留的BFD-P D65子集（2,028对）上的交叉验证确认了泛化能力（STRESS=26.14），Oklch+显著优于Oklab（51.45），并在保留集上达到与CIEDE2000（24.12）相当的STRESS。在所有六个COMBVD子数据集上均确认了相对于Oklab（47.35）的改进。由于Oklch+定义了一个欧氏距离近似感知距离的坐标系，变换空间中的线性插值相对于Oklab提供了显著改善的感知均匀性。当前评估仅限于以sRGB为中心的COMBVD数据集；在高色度区域使用经验观察者评级的辨别数据进行验证仍是未来工作。

英文摘要

Oklab and its cylindrical representation Oklch are widely adopted in interpolation and design workflows as perceptually motivated color spaces, but their color difference prediction accuracy falls short of CIEDE2000. We propose Oklch+, a three-parameter extension of Oklab comprising a power transformation on the L-axis and a Naka-Rushton compression on the C-axis, with Euclidean distance computed in the resulting transformed Oklab coordinates. The Naka-Rushton function is bounded in [0,1], reflecting the saturating nature of chroma sensitivity at high colorimetric values. Evaluated on COMBVD -- 3,813 suprathreshold color difference pairs spanning six independent experimental datasets -- Oklch+ achieves STRESS = 29.09, closely matching CIEDE2000 (29.13; difference = 0.04), using only three parameters optimized against color difference data compared to approximately 17 for CIEDE2000. Cross-validation on a held-out BFD-P D65 subset (2,028 pairs) confirms generalization (STRESS = 26.14), with Oklch+ substantially outperforming Oklab (51.45) and achieving STRESS comparable to CIEDE2000 (24.12) on the held-out set. Improvement over Oklab (47.35) is confirmed across all six COMBVD sub-datasets. Because Oklch+ defines a coordinate system in which Euclidean distance approximates perceptual distance, linear interpolation in the transformed space offers substantially improved perceptual uniformity relative to Oklab. Current evaluation is limited to the sRGB-centered COMBVD dataset; validation in high-chroma regions with empirical observer-rated discrimination data remains future work.

URL PDF HTML ☆

赞 0 踩 0

2606.05254 2026-06-05 cs.LG cs.CV cs.RO 版本更新

Dream.exe: 视频生成模型能否梦想出可执行的机器人操作？

Rui Zhao, Kaiming Yang, Jifeng Zhu, Siyang Chen, Ziqi Wang, Weijia Wu, Kevin Qinghong Lin, Heng Wang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore（新加坡国立大学Show实验室）； University of Oxford（牛津大学）； Tencent（腾讯）

AI总结提出Dream.exe评估框架，通过视频到执行流水线测试视频生成模型产生的运动能否转化为可执行的机器人操作，发现视觉质量不能预测可执行性。

详情

AI中文摘要

视频生成模型在合成视觉上引人注目的内容方面取得了令人印象深刻的进展，但其输出仍然局限于虚拟领域。一个自然的问题随之而来：当这些模型生成的视频离开屏幕进入现实时，它们对物理世界的反映有多好？我们提出机器人操作作为这个问题的具体、可测量的窗口：如果一个模型真正内化了物理定律，它所描绘的运动应该转化为可执行的机器人行为。我们引入了Dream.exe，一个通过视频到执行流水线来操作这一标准的评估框架。给定一个场景图像和任务描述，Dream.exe合成一个操作视频，将生成的运动转换为机器人轨迹，并在物理模拟器中执行，产生纯视觉指标无法提供的接地信号。使用这个流水线，我们评估了8个模型，涵盖前沿闭源生成器、开源生成器和机器人专用模型。我们的基准测试包括101个手动策划的操作任务，分为三个物理复杂度级别，通过视觉质量、轨迹保真度和执行成功率进行测量。令人鼓舞的是，几个模型取得了可测量的执行成功率，表明从互联网规模数据中学习的生成先验已经编码了有意义的物理知识。然而，视觉质量被证明是执行性的差预测器，暴露了标准视觉评估未捕获的模型能力维度。Dream.exe将在https://github.com/showlab/Dream.exe开源。

英文摘要

Video generation models have made impressive strides in synthesizing visually compelling content, yet their outputs remain confined to the virtual domain. A natural question follows: how well do these models reflect the physical world when their generated videos leave the screen and enter reality? We propose robotic manipulation as a concrete, measurable window onto this question: if a model has truly internalized physical laws, the motion it depicts should translate into executable robot behavior. We introduce Dream$.$exe, an evaluation framework that operationalizes this criterion through a video-to-execution pipeline. Given a scene image and a task description, Dream$.$exe synthesizes a manipulation video, converts the generated motion into robot trajectories, and executes them in a physics simulator, yielding a grounding signal that purely visual metrics cannot offer. Using this pipeline, we evaluate 8 models spanning frontier closed-source generators, open-source generators, and robot-specific models. Our benchmark covers 101 manually curated manipulation tasks at three levels of physical complexity, measured across visual quality, trajectory fidelity, and execution success. Encouragingly, several models achieve measurable execution success, suggesting that generative priors learned from internet-scale data already encode meaningful physical knowledge. Yet visual quality proves a poor predictor of executability, exposing a dimension of model capability that standard visual evaluations do not capture. Dream$.$exe will be open-sourced at https://github.com/showlab/Dream.exe.

URL PDF HTML ☆

赞 0 踩 0

2606.03998 2026-06-05 eess.SP cs.CV 版本更新

TGSD: Topology-Guided State-Space Diffusion Framework for EEG Spatial Super-Resolution

TGSD: 拓扑引导的状态空间扩散用于EEG空间超分辨率

Zijian Kang, Weiming Zeng, Yueyang Li, Shengyu Gong, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University（数字图像与智能计算实验室，上海海洋大学）； Department of Language Science and Technology, The Hong Kong Polytechnic University（语言科学与技术系，香港理工大学）； Affiliated Lianyungang Hospital of Xuzhou Medical University（徐州医学院连云港医院）

AI总结提出TGSD框架，通过拓扑引导的状态空间扩散模型，利用分层空间先验编码器和条件状态空间扩散重建器，从低密度EEG恢复高密度信号，在SEED和PhysioNet MM/I数据集上优于基线方法。

详情

AI中文摘要

低密度EEG更适合可穿戴和基于物联网的大脑传感，但稀疏的电极采样通常缺乏足够的空间信息来表征跨区域的神经活动。EEG空间超分辨率旨在从稀疏记录中恢复密集通道EEG，但由于通道缺失通常发生在整个通道级别，全电极布局上的时空依赖性往往未被充分探索，且从稀疏到密集信号的映射本质上具有模糊性，因此仍然具有挑战性。为了解决这些问题，我们提出了TGSD，一种用于EEG空间超分辨率的拓扑引导状态空间扩散框架。TGSD首先采用分层空间先验编码器，通过整合局部几何关系与区域级上下文信息，学习完整电极布局上的拓扑感知先验。基于这些先验和稀疏观测，条件状态空间扩散重建器通过反向扩散逐步生成缺失通道信号，同时交替进行时间和通道维度的状态空间建模，在统一框架中捕捉长程时间动态和通道间依赖性。在SEED和PhysioNet MM/I数据集上的实验表明，TGSD在不同超分辨率因子下，在重建保真度和下游分类性能方面均持续优于代表性基线。这些结果证明了将拓扑感知空间先验与条件扩散相结合，在可穿戴和物联网场景中增强实用低密度EEG传感的有效性。官方实现代码可在https://github.com/jtggz/TGSD获取。

英文摘要

Low-density EEG is more suitable for wearable and IoT-based brain sensing, but sparse electrode sampling often lacks sufficient spatial information to characterize cross-regional neural activity. EEG spatial super-resolution aims to recover dense-channel EEG from sparse recordings, yet remains challenging because channel missingness typically occurs at the whole-channel level, spatiotemporal dependencies over the full electrode layout are often underexplored, and the mapping from sparse to dense signals is inherently ambiguous. To address these issues, we propose TGSD, a topology-guided state-space diffusion framework for EEG spatial super-resolution. TGSD first employs a Hierarchical Spatial Prior Encoder to learn topology-aware priors over the complete electrode layout by integrating local geometric relationships with region-level contextual information. Based on these priors and sparse observations, a Conditional State-Space Diffusion Reconstructor progressively generates missing-channel signals through reverse diffusion, while alternating temporal and channel-wise state-space modeling captures long-range temporal dynamics and inter-channel dependencies in a unified framework. Experiments on the SEED and PhysioNet MM/I datasets show that TGSD consistently outperforms representative baselines under different super-resolution factors in both reconstruction fidelity and downstream classification performance. These results demonstrate the effectiveness of combining topology-aware spatial priors with conditional diffusion for enhancing practical low-density EEG sensing in wearable and IoT scenarios. The official implementation code is available at https://github.com/jtggz/TGSD.

URL PDF HTML ☆

赞 0 踩 0

2606.03730 2026-06-05 cs.CV 版本更新

统一驾驶令牌：面向驾驶世界模型和规划的表示与几何引导的离散分词器

Ziyang Yao, Zeyu Zhu, YunCheng Jiang, Zibin Guo, Huijing Zhao

发表机构 * Peking University（北京大学）； Xiaomi EV（小米电动车）

AI总结提出一种表示引导与几何增强的离散分词器，通过联合监督学习紧凑令牌，同时优化重建保真度、表示一致性和规划性能。

详情

AI中文摘要

离散视觉令牌应为基于令牌的世界建模和自动驾驶规划提供紧凑表示。然而，大多数分词器继承自图像生成，主要针对像素重建进行优化，这可能导致易于生成的内容与对驾驶决策有用的解码内容之间存在差距。我们提出了一种表示引导和几何增强的分词器，在联合监督下学习离散令牌。该分词器通过特征解码将其离散瓶颈与冻结的DINO特征空间对齐，同时通过感知损失和对抗损失的RGB重建保留外观。为了注入几何状态相关线索，我们在训练期间添加了相邻帧深度和相对姿态监督，并通过多码本量化稳定联合目标。我们使用轻量级规划读出和GPT风格的下一个令牌世界模型评估相同的学习令牌。在NAVSIM上的实验表明，在固定解码器下，重建保真度和表示一致性得到改善，规划性能具有竞争力，并且在匹配设置下生成质量更好。

英文摘要

Discrete visual tokens should provide a compact representation for both token-based world modeling and planning in autonomous driving. However, most tokenizers are inherited from image generation and are optimized mainly for pixel reconstruction, which may leave a gap between what is easy to generate and what is useful to decode for driving decisions. We present a representation-guided and geometry-enhanced tokenizer that learns discrete tokens under joint supervision. The tokenizer aligns its discrete bottleneck with a frozen DINO feature space through feature decoding, while preserving appearance via RGB reconstruction with perceptual and adversarial losses. To inject geometric state-related cues, we add adjacent-frame depth and relative-pose supervision during training and stabilize joint objectives with multi-codebook quantization. We evaluate the same learned tokens with a lightweight planning readout and a GPT-style next-token world model. Experiments on NAVSIM show improved reconstruction fidelity and representation consistency, competitive planning performance under a fixed decoder, and better generative quality under matched settings.

URL PDF HTML ☆

赞 0 踩 0

2606.01822 2026-06-05 cs.CV 版本更新

Hierarchically Decoupled Mixture-of-Experts for Robust Traffic Sign Recognition in Complex Driving Scenarios

用于复杂驾驶场景中鲁棒交通标志识别的分层解耦混合专家模型

Mingxiao Wang, Xiaozhen Qu, Bolin Gao, Tong Wang, Lei He

发表机构 * School of Automotive and Traffic Engineering, Liaoning University of Technology（辽宁科技学院汽车与交通工程学院）； State Key Laboratory of Intelligent Green Vehicles and Mobility, School of Vehicle and Mobility, Tsinghua University（智能绿色车辆与移动State Key Laboratory，清华大学车辆与移动学院）

AI总结提出分层解耦异构混合专家框架CBDES MoE TSR，通过图像级动态路由机制选择最优专家模型，在复合交通标志数据集上mAP50-95达76.8%，比基线提升2.3%且计算开销降低39.4%。

Comments 9 figures, 3 tables

详情

AI中文摘要

交通标志检测是自动驾驶和智能交通系统中环境感知的基本组成部分。然而，现有大多数检测器依赖具有全局共享参数的静态推理，限制了其适应多样化和非结构化交通场景的能力。因此，单个静态模型通常难以同时处理清晰的近距样本和诸如远距离小目标或恶劣天气环境等挑战性条件。为解决这一局限，我们提出了CBDES MoE TSR，一种用于交通标志识别的分层解耦异构混合专家（MoE）框架。该框架通过引入异构YOLO专家池和轻量级门控网络，摆脱了传统的全局共享参数范式，实现了图像级动态路由机制。基于输入图像的语义特征，门控模块从专家池中选择性激活最合适的专家模型，实现从固定参数拟合到按需动态表示的转变。这种设计增强了特定场景下的特征提取能力，同时保持了可控的推理开销。实验结果表明，所提方法在复合交通标志数据集上实现了检测精度与效率的显著平衡。具体而言，我们的方法达到了76.8%的mAP50-95，相比基线方法（74.5%）提升了2.3%，同时计算开销降低了约39.4%。这些结果有力地验证了所提方法的有效性。

英文摘要

Traffic sign detection is a fundamental component of environmental perception in autonomous driving and intelligent transportation systems. However, most existing detectors rely on static inference with globally shared parameters, limiting their ability to adapt to diverse and unstructured traffic scenarios. As a result, a single static model often struggles to simultaneously handle both clear near-range samples and challenging conditions such as distant small targets or adverse weather environments. To address this limitation, we propose CBDES MoE TSR, a hierarchically decoupled heterogeneous mixture-of-experts(MoE) framework for traffic sign recognition. The proposed framework departs from the conventional globally shared parameter paradigm by introducing a heterogeneous You Only Look Once (YOLO) expert pool together with a lightweight gating network, enabling an image-level dynamic routing mechanism. Based on the semantic characteristics of the input image, the gating module selectively activates the most suitable expert model from the expert pool, enabling a shift from fixed parameter fitting to on-demand dynamic representation. This design enhances feature extraction capability for specific scenarios while maintaining controlled inference overhead. Experimental results demonstrate that the proposed method achieves a remarkable balance between detection accuracy and efficiency on the composite traffic sign dataset. Specifically, our method attains an mAP50-95 of 76.8%, yielding a 2.3% improvement over the baseline method (74.5%) while simultaneously reducing computational overhead by approximately 39.4%. These findings robustly validate the effectiveness of the proposed approach.

URL PDF HTML ☆

赞 0 踩 0

2606.01113 2026-06-05 cs.CV 版本更新

R^3: Composed Video Retrieval via Reasoning-Guided Recalling and Re-ranking

R^3: 基于推理引导的召回与重排序的组合视频检索

Zixu Li, Yupeng Hu, Zhiheng Fu, Zhiwei Chen, Weili Guan, Liqiang Nie

发表机构 * Shandong University（山东大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结提出R^3零样本组合视频检索流程，通过生成推理轨迹增强查询表示，并融合重排序验证候选视频，有效解决源视频与编辑指令组合检索的挑战。

详情

AI中文摘要

CoVR-R挑战评估组合视频检索，系统需根据参考视频和文本编辑指令从大型图库中检索目标视频。该设置不是标准的视频-文本检索问题：查询由源视频中的视觉证据和编辑隐含的变换共同定义。强嵌入模型可提供可扩展的候选召回，但可能无法充分表达目标侧后果，如状态变化、动作替换、对象保留或时间一致性。成对多模态重排序器可直接验证此类细节，但全面重排序整个图库在计算上不可行。我们提出R^3，一个基于推理引导的召回与重排序的零样本组合视频检索流程。核心思想是将源-编辑查询转化为推理基础的检索程序，而非将编辑文本视为短标题。首先，模型生成推理轨迹，描述应用编辑后预期的目标视频。然后，将轨迹与源视频一起编码为推理增强查询，并通过一致性门控残差规则与基础组合查询的检索分数融合。最后，重排序器通过直接源-候选比较验证召回候选。实验证明了我们方法在应对该挑战中的有效性。代码可在https://github.com/Lee-zixu/R-3获取。

英文摘要

The CoVR-R challenge evaluates composed video retrieval, where a system must retrieve a target video from a large gallery given a reference video and a textual edit instruction. This setting is not a standard video-text retrieval problem: the query is defined by both the visual evidence in the source video and the transformation implied by the edit. A strong embedding model can provide scalable candidate recall, but it may under-express target-side consequences such as state changes, action replacement, object preservation, or temporal consistency. A pairwise multimodal reranker can verify such details more directly, but exhaustive reranking over the full gallery is computationally infeasible. We present $\mathbb{R}^3$, a zero-shot composed video retrieval pipeline built around Reasoning-guided Recalling and Reranking. The core idea is to turn the source-edit query into a reasoning-grounded retrieval program rather than treating the edit text as a short caption. First, the model generates a reasoning trace that describes the expected target video after applying the edit. Then the trace is encoded together with the source video as a reasoning-augmented query, and its retrieval score is fused with the base composed query through an agreement-gated residual rule. At last, a re-ranker verifies the recalled candidates with direct source-candidate comparison. Experiments have demonstrated the effectiveness of our method in addressing this challenge. Codes are available on https://github.com/Lee-zixu/R-3.

URL PDF HTML ☆

赞 0 踩 0

2606.00616 2026-06-05 cs.CV cs.AI 版本更新

Pause and Think: A Dataset and Benchmark for Video-Grounded Assistive Action Suggestion

暂停与思考：面向视频基础辅助动作建议的数据集与基准

Shivam Singh, Saptarshi Majumder, Pratik Prabhanjan Brahma, Zicheng Liu, Emad Barsoum

发表机构 * Advanced Micro Devices, Inc.（先进微器件公司）

AI总结提出 pause-and-think-T 数据集和 pause-and-think-B 基准，通过推理监督训练紧凑模型，在视频场景理解与目标规划任务中达到与大型模型相当的性能。

详情

AI中文摘要

最近的视觉语言模型（VLM）在视频中的基础推理、时间一致性和上下文感知规划方面存在困难。我们引入了 pause-and-think-T，一个以推理为中心的训练数据集，鼓励模型暂停、基于视觉证据进行推理，并生成简洁、可操作的响应。该数据集在生成答案之前促进结构化推理，引导模型走向类人、基于场景的辅助。我们在我们的 pause-and-think-B 基准上微调了一个紧凑的 4B 参数模型，并针对上下文理解和目标规划任务进行了评估。该模型在参数比 Qwen3-VL-235B（58.9%）少 59 倍的情况下达到了 58.0% 的准确率，在场景理解上与 GPT-5.2 匹配，并超越了 GPT-4o。除了我们的基准之外，该模型在 EgoThink 和 TempCompass 上也表现出强大的分布外性能，在可操作性、辅助性、属性识别、情境推理和时间顺序方面取得了显著提升，且无需特定基准训练。我们的结果表明，有针对性的推理监督使紧凑模型能够提供可操作的、基于视觉的指导，同时泛化到训练数据之外，而无需进行大规模模型扩展。

英文摘要

Recent Vision-Language Models (VLMs) struggle with grounded reasoning, temporal consistency, and context aware planning in videos. We introduce pause-and-think-T, a reasoning-centric training dataset that encourages models to pause, reason over visual evidence, and produce concise, actionable responses. The dataset promotes structured reasoning prior to answer generation, guiding models toward human-like, scene-grounded assistance. We fine-tune a compact 4B-parameter model and evaluate it on our pause-and-think-B benchmark targeting contextual understanding and goal planning tasks. The model achieves 58.0% accuracy at 59x fewer parameters than Qwen3-VL-235B (58.9%), matching GPT-5.2 on scene understanding and surpassing GPT-4o. Beyond our benchmark, it also shows strong out-of-distribution performance on EgoThink and TempCompass, with substantial gains in affordance, assistance, attribution recognition, situated reasoning, and temporal order, without benchmark-specific training. Our results indicate that targeted reasoning supervision enables compact models to deliver actionable, visually grounded guidance while generalizing beyond training data, without requiring large-scale model expansion.

URL PDF HTML ☆

赞 0 踩 0

2606.00522 2026-06-05 cs.CV 版本更新

A Trajectory-Driven Spatio-Temporal Refinement Solution for CVPR 2026 8th UG2+ Challenge Track 3: DOST

CVPR 2026 第八届 UG2+ 挑战赛赛道三：湍流中动态目标分割的有效解决方案

Hongzhen Li, Miao Yu, Leilei Cao, Youwei Pan, Yingfang Zhu, Fengjie Zhu

发表机构 * TEX AI, Transsion Holdings（TEX AI，Transsion控股）

AI总结基于 SegAnyMo 框架，通过数据域自适应和时空后处理模块，提升严重大气畸变下的动态目标分割性能，在挑战赛中获第二名。

详情

AI中文摘要

在这项工作中，我们提出了针对第八届 UG2+ 挑战赛（CVPR 2026）赛道三：湍流中动态目标分割（DOST）的解决方案。我们的方法建立在强大的基线框架 Segment Any Motion (SegAnyMo) 之上，该框架提供了强大的掩码生成和运动跟踪能力。为了进一步提升在严重大气畸变下的分割性能，我们提出了两个关键改进。首先，我们采用以数据为中心的域自适应策略。通过从 DAVIS 数据集和 DOST 数据集的子集中选取序列，并结合模拟大气波动退化，显著扩展了训练数据，增强了模型对复杂几何畸变的鲁棒性。其次，我们引入了时空后处理模块。该细化步骤有效去除了持续存在的边界连接假前景和短时碎片噪声，同时严格保留了真实小目标并保持帧间的原始个体标签。通过上述组合策略，我们的方法在挑战赛中获得了第二名。

英文摘要

In this work, we present our solution for the 8th UG2+ Challenge (CVPR 2026) Track 3: Dynamic Object Segmentation in Turbulence (DOST). Our method is built upon the strong baseline framework Segment Any Motion (SegAnyMo), which provides powerful mask generation and motion tracking capabilities. To further boost the segmentation performance under severe atmospheric distortions, we propose two key improvements. First, we employ a data-centric domain adaptation strategy. We significantly expand our training data by incorporating selected sequences from the DAVIS dataset alongside a subset of the DOST dataset, and apply simulated atmospheric fluctuation degradations to enhance the model's robustness against complex geometric distortions. Second, we introduce a spatio-temporal post-processing module. This refinement step effectively removes persistent boundary-connected false foregrounds and short-lived fragmented noise, while strictly preserving genuine small targets and maintaining original individual labels across frames. With these combined strategies, our proposed method ranks the 2st place in the challenge.

URL PDF HTML ☆

赞 0 踩 0

2605.30819 2026-06-05 cs.CV cs.GR 版本更新

Function2Scene: 3D Indoor Scene Layout from Functional Specifications

Function2Scene: 基于功能规范的3D室内场景布局

Ruiqi Wang, Qimin Chen, Daniel Ritchie, Angel X. Chang, Manolis Savva, Kai Wang, Hao Zhang

发表机构 * Simon Fraser University（西蒙弗雷泽大学）； Brown University（布朗大学）

AI总结提出Function2Scene框架，通过解析自然语言设计简报中的用户角色和活动，从17个功能约束准则生成布局，并利用LLM和VLM的迭代检查-修复循环优化，在30个专业案例中94.3%的成对比较优于基线方法。

Comments project page: https://function2scene.github.io/

详情

AI中文摘要

大多数文本驱动的3D室内场景合成方法从以物体为中心的提示生成房间，询问应放置什么家具而不是如何使用空间。然而，在实际室内设计中，布局的好坏取决于其对居住者的支持程度，例如他们的活动和身体需求。我们引入了Function2Scene，一个从功能规范（即描述谁将使用房间以及他们需要在那里做什么的自然语言设计简报）生成3D室内布局的框架。给定这样的规范，我们的系统解析居住者角色和活动，从涵盖空间、人体工程学、活动和环境考虑的17个标准分类中导出一组定制的功能设计约束，并使用这些约束来指导布局生成。Function2Scene不依赖LLM直接生成最终场景，而是通过工具增强的检查-修复循环进行迭代评估和细化，结合几何测量、基于LLM的上下文推理和基于VLM的视觉评估。在30个专业编写的室内设计案例上的实验表明，Function2Scene生成的布局比最近的基于LLM的场景合成基线更好地满足功能需求，我们的结果在94.3%的成对比较中被偏好。我们的工作将文本驱动的室内场景合成从放置合理的物体重新定义为设计支持人类使用的空间。

英文摘要

Most text-driven 3D indoor scene synthesis methods generate rooms from object-centric prompts, asking what furniture should be placed rather than how the space is used. Yet in real interior design, a layout is judged by how well it supports its occupants, e.g., their activities and physical needs. We introduce Function2Scene, a framework for generating 3D indoor layouts from functional specifications, i.e., natural-language design briefs describing who will use a room and what they need to do there. Given such a specification, our system parses occupant personas and activities, derives a customized set of functional design constraints from a taxonomy of 17 criteria spanning spatial, ergonomic, activity, and environmental considerations, and uses these constraints to guide layout generation. Rather than relying on an LLM to directly produce a final scene, Function2Scene performs iterative evaluation and refinement through a tool-augmented check-and-repair loop, combining geometric measurements, LLM-based contextual reasoning, and VLM-based visual assessment. Experiments on 30 professionally written interior-design cases show that Function2Scene produces layouts that better satisfy functional requirements than recent LLM-based scene synthesis baselines, with our results preferred in 94.3% of pairwise comparisons. Our work reframes text-driven indoor scene synthesis from placing plausible objects to designing spaces that support human use.

URL PDF HTML ☆

赞 0 踩 0

2605.30467 2026-06-05 cs.CV 版本更新

无需训练的高分辨率sinogram补全

Jiaze E, Srutarshi Banerjee, Tekin Bicer, Guannan Wang, Yanfu Zhang, Bin Ren

发表机构 * William & Mary（威廉玛丽学院）； Argonne National Laboratory（阿贡国家实验室）； University of Chicago（芝加哥大学）

AI总结本文提出了一种无需训练的高效扩散推理方法HRSino，用于高分辨率sinogram补全，通过自适应分配推理努力来提高计算效率和补全精度。

详情

AI中文摘要

高分辨率sinogram补全对于计算断层扫描重建至关重要，因为缺失的投影可能会引入严重的伪影。尽管扩散模型为该任务提供了强大的生成先验，但其推理成本随着分辨率的增加而变得不可接受。我们提出HRSino，一种无需训练且高效的扩散推理方法，用于高分辨率sinogram补全。通过显式考虑信号特性中的空间异质性，如频谱稀疏性和局部复杂性，HRSino在空间区域和分辨率上自适应地分配推理努力，而不是应用统一的高分辨率扩散步骤。这使得在粗粒度上能够捕捉全局一致性，同时仅在必要时细化局部细节。实验结果表明，与最先进的框架相比，HRSino将峰值内存使用量减少了高达30.81%，推理时间减少了高达17.58%，并在不同数据集和分辨率上保持补全精度。

英文摘要

High-resolution sinogram completion is critical for computed tomography reconstruction, as missing projections can introduce severe artifacts. While diffusion models provide strong generative priors for this task, their inference cost grows prohibitively with resolution. We propose HRSino, a training-free and efficient diffusion inference approach for high-resolution sinogram completion. By explicitly accounting for spatial heterogeneity in signal characteristics, such as spectral sparsity and local complexity, HRSino allocates inference effort adaptively across spatial regions and resolutions, rather than applying uniform high-resolution diffusion steps. This enables global consistency to be captured at coarse scales while refining local details only where necessary. Experimental results show that HRSino reduces peak memory usage by up to 30.81% and inference time by up to 17.58% compared to the state-of-the-art framework, and maintains completion accuracy across datasets and resolutions.

URL PDF HTML ☆

赞 0 踩 0

2605.19839 2026-06-05 cs.CV 版本更新

When Preference Labels Fall Short: Aligning Diffusion Models from Real Data

当偏好标签不足时：从真实数据对齐扩散模型

Weiyan Chen, Weijian Deng, Yao Xiao, Weijie Tu, ZiYi Dong, Ibrahim Radwan, Liang Lin, Pengxu Wei

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文研究了真实数据作为偏好对齐的替代监督源，通过数据驱动的方法，利用真实图像作为参考点，对比生成或扰动样本以构建偏好信号，无需手动标注的偏好对，实验证明真实数据监督能有效对齐扩散模型并达到与现有偏好方法相当的性能。

Comments ICML 2026 Camera Ready; Project Page: https://cwyxx.github.io/RealAlign

详情

AI中文摘要

偏好对齐旨在通过学习优选样本与非优选样本的比较来引导生成模型。在实践中，大多数现有方法依赖于从模型生成图像中构造的偏好对。这种监督本质上是相对的，当两个样本都表现出伪影或视觉质量有限时，其模糊性使得难以推断何为真正理想的输出。在本工作中，我们探讨了真实数据是否可以作为偏好对齐的替代监督源。我们采用以数据为中心的视角，研究了一种整理策略，将真实图像作为参考点，并通过将其与生成或扰动样本进行对比，构建偏好信号，而无需手动标注的偏好对。通过实证分析，我们证明了基于真实数据的监督能有效指导扩散模型的对齐，并达到与现有基于偏好方法相当的性能。我们的结果表明，真实数据为偏好对齐提供了一个实用且互补的监督源，并突显了标签高效对齐策略的方向。代码和模型可在https://cwyxx.github.io/RealAlign获取。

英文摘要

Preference alignment aims to guide generative models by learning from comparisons between preferred and non-preferred samples. In practice, most existing approaches rely on preference pairs constructed from model-generated images. Such supervision is inherently relative and can be ambiguous when both samples exhibit artifacts or limited visual quality, making it difficult to infer what constitutes a truly desirable output. In this work, we investigate whether real data can serve as an alternative source of supervision for preference alignment. We adopt a data-centric perspective and study a curation strategy that treats real images as reference points and constructs preference signals by contrasting them with generated or perturbed samples, without requiring manually annotated preference pairs. Through empirical analysis, we show that real-data-based supervision provides effective guidance for aligning diffusion models and achieves performance comparable to existing preference-based methods. Our results suggest that real data offers a practical and complementary source of supervision for preference alignment and highlight directions of label-efficient alignment strategies. Code and models are available at https://cwyxx.github.io/RealAlign.

URL PDF HTML ☆

赞 0 踩 0

2510.00054 2026-06-05 cs.CV cs.AI 版本更新

HiDe: Rethinking The Zoom-IN method in High Resolution MLLMs via Hierarchical Decoupling

HiDe: 通过分层解耦重新思考高分辨率MLLMs中的Zoom-IN方法

Xianjie Liu, Yiman Hu, Yixiong Zou, Liang Wu, Jian Xu, Bo Zheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出HiDe框架，通过分层解耦方法解决高分辨率图像中背景干扰导致的视觉理解问题，提升多模态大语言模型在高分辨率图像任务中的性能。

Comments Accepted by ICML2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉理解任务中取得了显著进展。然而，它们在高分辨率图像上的性能仍然不够理想。尽管现有方法通常将这一限制归因于感知约束，并认为MLLMs难以识别小物体，从而使用'缩放进'策略以获得更好的细节，我们的分析揭示了不同的原因：主要问题不是物体大小，而是由复杂的背景干扰引起的。我们通过一系列解耦实验系统分析了这种'缩放进'操作，并提出了一种无需训练的分层解耦框架（HiDe），该框架使用基于标记的注意力解耦（TAD）来解耦问题标记并识别关键信息标记，然后利用其注意力权重实现与目标视觉区域的精确对齐。随后，它利用布局保持解耦（LPD）将这些区域与背景解耦，并重建一个紧凑的表示，该表示在保留基本空间布局的同时消除了背景干扰。HiDe在V*Bench、HRBench4K和HRBench8K上设定了新的SOTA，将Qwen2.5-VL 7B和InternVL3 8B提升至SOTA（在V*Bench上分别为92.1%和91.6%），甚至超过了强化学习方法。经过优化后，HiDe的内存使用比之前的无训练方法减少了75%。代码可在https://tennine2077.github.io/HiDe.github.io/上提供。

英文摘要

Multimodal Large Language Models (MLLMs) have made significant strides in visual understanding tasks. However, their performance on high-resolution images remains suboptimal. While existing approaches often attribute this limitation to perceptual constraints and argue that MLLMs struggle to recognize small objects, leading them to use "zoom in" strategies for better detail, our analysis reveals a different cause: the main issue is not object size, but rather caused by complex background interference. We systematically analyze this "zoom in" operation through a series of decoupling experiments and propose the Hierarchical Decoupling Framework (HiDe), a training-free framework that uses Token-wise Attention Decoupling (TAD) to decouple the question tokens and identify the key information tokens, then leverages their attention weights to achieve precise alignment with the target visual regions. Subsequently, it employs Layout-Preserving Decoupling (LPD) to decouple these regions from the background and reconstructs a compact representation that preserves essential spatial layouts while eliminating background interference. HiDe sets a new SOTA on V*Bench, HRBench4K, and HRBench8K, boosting Qwen2.5-VL 7B and InternVL3 8B to SOTA (92.1% and 91.6% on V*Bench), even surpassing RL methods. After optimization, HiDe uses 75% less memory than the previous training-free approach. Code is provided in https://tennine2077.github.io/HiDe.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.16716 2026-06-05 cs.CV cs.AI 版本更新

MAVEN A Multi-Agent Framework for Multicultural Text-to-Video Generation

MAVEN：面向多元文化文本到视频生成的多智能体框架

Shuowei Li, Yuming Zhao, Parth Bhalerao, Oana Ignat

发表机构 * Santa Clara University（圣克拉拉大学）

AI总结提出MAVEN多智能体提示优化框架，通过并行或串行分解提示为人物、动作、地点维度，提升单文化和跨文化文本到视频生成的文化保真度，并构建包含243个文化提示和972个视频的基准进行评估。

Comments [14] pages, [6] figures, [11] tables, appendix included. Preprint

详情

AI中文摘要

文本到视频（T2V）生成在视觉保真度方面取得了快速进展，但其在单个提示中忠实呈现多种文化的能力仍未被充分探索。我们提出MAVEN，一个多智能体提示优化框架，旨在提高单文化和跨文化T2V生成中的文化保真度。MAVEN将提示分解为人物、动作和地点维度，由并行或串行运行的专业智能体处理。为了支持系统评估，我们贡献了一个新的基准，包含243个基于文化的提示和972个对应视频，涵盖三种文化（中文、美式、罗马尼亚）、三种动作类别以及单文化和跨文化场景。结合基于CLIP的指标、VLM作为评判的评估和视频质量测量的评估表明，多智能体优化，特别是并行专业化，在保持视觉质量和时间一致性的同时，显著提高了文化相关性。数据集和代码可在https://github.com/AIM-SCU/MAVEN获取。

英文摘要

Text-to-video (T2V) generation has rapidly progressed in visual fidelity, yet its ability to faithfully represent multiple cultures within a single prompt remains underexplored. We introduce MAVEN, a multi-agent prompt refinement framework designed to improve cultural fidelity in both mono-cultural and cross-cultural T2V generation. MAVEN decomposes prompts into person, action, and location dimensions, handled by specialized agents operating in parallel or sequentially. To support systematic evaluation, we contribute a new benchmark of 243 culturally grounded prompts and 972 corresponding videos, spanning three cultures (Chinese, American, Romanian), three action categories, and both mono-cultural and cross-cultural scenarios. Evaluations combining CLIP-based metrics, VLM-as-judge assessments, and videoquality measures show that multi-agent refinement, particularly parallel specialization, significantly improves cultural relevance while preserving visual quality and temporal consistency. The dataset and code are available at https://github.com/AIM-SCU/MAVEN

URL PDF HTML ☆

赞 0 踩 0

2605.14028 2026-06-05 cs.CV 版本更新

Unified Pix Token And Word Token Generative Language Model

统一像素标记与词标记的生成语言模型

Haun Leung, ZiNan Wang

发表机构 * Buaa.edu.cn（北京航空航天大学）

AI总结本文提出一种统一像素标记和词标记的生成语言模型，通过引入图像无监督预训练、颜色折叠、全局条件注意力近似等方法，提升模型在图像细节识别上的能力，实验表明该模型在小模型和有限数据下仍表现优异。

Comments 13 pages, 6 figures

详情

AI中文摘要

自从视觉Transformer（ViT）出现以来，它已被广泛应用于生成语言模型和生成视觉模型中。尤其是在当前最先进的开源多模态模型中，通过CLIP或SigLIP方法获得的ViT被用作视觉编码器的骨干网络，帮助它们获得视觉理解能力。但这种方法在细节视觉理解上存在局限，例如在图像中难以识别小文本或数字。为了解决这些问题，我们提出了一种新的模型，将像素标记和词标记统一到生成语言模型中。该新模型还具有每个图像像素都有其自己的标记嵌入、颜色折叠、全局条件注意力近似和图像无监督预训练等特性。我们使用我们的新模型进行了图像无监督预训练实验，以探索其潜力。实验结果表明，即使在小模型和有限训练数据下，其性能也很好。我们相信我们的模型也符合扩展定律，只要模型参数和训练数据增加，其性能将继续提高。

英文摘要

Since the emergence of Vision Transformer (ViT), it has been widely used in generative language model and generative visual model. Especially in the current state-of-art open source multimodal models, ViT obtained by CLIP or SigLIP method serves as the vision encoder backbone to help them acquire visual understanding capabilities. But this method leads to limitations in visual understanding for details, such as difficulty in recognizing small text or numbers in images. To address these issues, we propose a new model to unify pix token and word token into the generative language model. The new model also features with each pix of image having its own token embedding, color folding, global conditional attention approximation and image unsupervised pretraining. We conducted image unsupervised pretraining experiments using our new model to explore its potential. The experimental results show that it has good performance even in small model and with limited training data. We believe our model also conforms to the scaling law, as long as model parameters and training data increased, its performance will continue to improve.

URL PDF HTML ☆

赞 0 踩 0

2604.20329 2026-06-05 cs.CV cs.AI 版本更新

Image Generators are Generalist Vision Learners

图像生成器是通用视觉学习者

Valentin Gabeur, Shangbang Long, Songyou Peng, Paul Voigtlaender, Shuyang Sun, Yanan Bao, Karen Truong, Zhicheng Wang, Wenlei Zhou, Jonathan T. Barron, Kyle Genova, Nithish Kannen, Sherry Ben, Yandong Li, Mandy Guo, Suhas Yogin, Yiming Gu, Huizhong Chen, Oliver Wang, Saining Xie, Howard Zhou, Kaiming He, Thomas Funkhouser, Jean-Baptiste Alayrac, Radu Soricut

发表机构 * Google（谷歌）

AI总结本文研究了图像生成器在视觉理解中的通用学习能力，通过引入Vision Banana模型，展示了图像生成训练如何像语言模型预训练一样，使模型在多种视觉任务中取得最佳性能，证明了图像生成预训练在构建基础视觉模型中的核心作用。

Comments Project Page: http://vision-banana.github.io

详情

AI中文摘要

近期的研究表明，图像和视频生成器表现出零样本视觉理解行为，这种行为类似于大型语言模型（LLM）通过生成式预训练发展出语言理解和推理的新兴能力。尽管长期以来人们推测能够生成视觉内容意味着能够理解它，但缺乏证据表明生成式视觉模型已发展出强大的理解能力。在本文中，我们证明图像生成训练的作用类似于LLM预训练，使模型学习到强大的、通用的视觉表示，从而在各种视觉任务中取得最先进的性能。我们引入了Vision Banana，一个通过指令微调Nano Banana Pro（NBP）在原始训练数据和少量视觉任务数据混合中构建的通用模型。通过将视觉任务的输出空间参数化为RGB图像，我们无缝地将感知重新框架为图像生成。我们的通用模型Vision Banana在涉及2D和3D理解的多种视觉任务中取得了最先进的结果，超越或匹敌零样本领域专家，包括Segment Anything Model 3在分割任务中的表现，以及Depth Anything系列在度量深度估计中的表现。我们展示了这些结果可以通过轻量级指令微调实现，而不牺牲基础模型的图像生成能力。优越的结果表明图像生成预训练是一种通用视觉学习者。它还表明图像生成是视觉任务的统一和通用接口，类似于文本生成在语言理解和推理中的作用。我们正见证计算机视觉中的重大范式转变，其中生成式视觉预训练在构建生成和理解的基础视觉模型中发挥核心作用。

英文摘要

Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.05367 2026-06-05 cs.CV cs.AI 版本更新

Tamaththul3D: High-Fidelity 3D Saudi Sign Language Avatars from Monocular Video

Tamaththul3D: 从单目视频高保真重建沙特手语3D虚拟形象

Eyad Alghamdi, Sattam Altuuaim, Obay Ghulam, Abdulrahman Qutah, Yousef Basoodan

发表机构 * University of Jeddah（朱德大学）； King Abdullah University of Science and Technology（国王阿卜杜勒-阿齐兹大学科学与技术）

AI总结本文提出Tamaththul3D方法，通过几何逆运动学对前臂链进行对齐，结合2D监督肩部优化，实现了阿拉伯语手语的高保真3D虚拟形象重建，并在五个不同语言类型的手语数据集上实现了泛化能力。

详情

AI中文摘要

现有的3D手语虚拟形象重建方法仅在西方手语上开发和评估，且没有任何阿拉伯手语数据集的3D参数注解，这阻碍了阿拉伯聋人社区基于虚拟形象的无障碍应用发展。我们发布了首个SMPL-X参数注解的Ishara-500沙特手语数据集，使阿拉伯手语的定量评估和下游手语生成成为可能。我们引入Tamaththul3D，一种通过几何逆运动学对齐手部和身体估计，随后通过2D监督肩部优化的重建流程。闭式积分与特定身体和手估计器的选择无关：任何SMPL-X兼容的身体估计器和任何MANO兼容的手估计器均可替换，我们通过单独替换每个模块来证明这一点。Tamaththul3D在手部误差上比先前方法低达32%，运行速度比最强基线快32倍，并在没有数据集特定适应的情况下泛化到五个不同语言类型的手语数据集。

英文摘要

Existing 3D sign language avatar reconstruction methods are developed and evaluated exclusively on Western sign languages, and no 3D parametric annotations exist for any Arabic Sign Language dataset, a gap that blocks the development of avatar-based accessibility applications for the Arab Deaf community. We release the first SMPL-X parametric annotations for the Ishara-500 Saudi Sign Language dataset, enabling quantitative evaluation and downstream sign language generation for Arabic Sign Language. We introduce Tamaththul3D, a reconstruction pipeline that aligns hand and body estimates through geometric inverse kinematics on the forearm chain followed by 2D-supervised shoulder refinement. The closed-form integration is decoupled from the specific choice of body and hand estimators: any SMPL-X-compatible body estimator and any MANO-compatible hand estimator can be substituted, as we demonstrate by swapping each module independently. Tamaththul3D achieves up to 32% lower hand error than prior methods, runs 32x faster than the strongest baseline, and generalizes across five typologically distinct sign languages without dataset-specific adaptation.

URL PDF HTML ☆

赞 0 踩 0

2605.09989 2026-06-05 cs.RO cs.CV 版本更新

DPU 或 GPU 加速神经网络推断——为何不两者都用？分割 CNN 推断

Ali Emre Oztas, Mahir Demir, James Garside, Mikel Luján

发表机构 * The University of Manchester（曼彻斯特大学）

AI总结本文提出了一种将 CNN 推断任务分割到 DPU 和 GPU 上的方法，以降低延迟。通过在 DPU 处理初始层，GPU 处理剩余层，结合 GNN 分割索引预测方法，实现了比单一 DPU 或 GPU 更高的效率提升。

详情

AI中文摘要

边缘设备上的视频和图像流需要低延迟。为解决此问题，神经网络（NN）被广泛应用，先前的研究主要集中在使用单个硬件单元如图形处理单元（GPU）、可编程门阵列（FPGA）和深度学习处理单元（DPU）来加速这些网络。然而，通过结合这些单元可以进一步减少延迟。本文提出将 CNN 推断任务分割到 DPU 和 GPU 上（Split CNN 推断）。第一个分割部分在 Versal VCK190 的 AI 引擎（DPU）上运行，处理输入图像的初始 CNN 层。DPU 在数据源附近处理第一部分。异步流水线方式下，GPU 运行剩余的层。NVIDIA RTX 2080 GPU 处理第二部分，尽管减少了数据源（存储/摄像头）与 GPU 之间的数据传输。此外，提出了一种基于图神经网络（GNN）的分割索引预测方法，以自动化 Split 推断所需的 CNN 分割。已建立的模型如 LeNet-5、ResNet18/50/101/152、VGG16 和 MobileNetv2 被分析。结果表明，相比仅使用 DPU 的执行，延迟提高了最多 2.48 倍；相比仅使用 GPU 的执行，延迟提高了最多 3.37 倍。训练好的 GNN 模型在适当的设备之间分割层的准确率为 96.27%。

英文摘要

Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units. In this paper, partitioning CNN inference across DPU and GPU (Split CNN Inference) is proposed. The first partition runs on the AI engines (DPU) of a Versal VCK190, which consists of initial CNN layers processing the input images. The DPU processes the first partition near the source of the data. Pipelined asynchronously, a GPU runs the remaining layers. The GPU (NVIDIA RTX 2080) processes the second partition, albeit having reduced the data transfer between the data source (storage/camera) and the GPU. Furthermore, a Graph Neural Network (GNN)-based partition index prediction method is proposed to automate the partitioning of CNNs needed for Split Inference. Well established models such as LeNet-5, ResNet18/50/101/152, VGG16, and MobileNetv2 are analyzed. Results demonstrate up to 2.48x latency improvement over DPU-only execution and up to 3.37x over GPU-only execution. The trained GNN model splits the layers between the appropriate devices with 96.27% accuracy.

URL PDF HTML ☆

赞 0 踩 0

2604.27343 2026-06-05 cs.CV 版本更新

JI-ADF: Joint-Individual Learning with Adaptive Decision Fusion for Multimodal Skin Lesion Classification

JI-ADF：联合-个体学习与自适应决策融合用于多模态皮肤病变分类

Phan Nguyen, Dat Cao, Hien Kha, Hien Chu, Minh Le, Trang Pham, Nguyen Quoc Khanh Le

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结本文提出JI-ADF框架，通过整合皮肤镜图像、临床照片和结构化患者数据，实现基于临床的皮肤病变分类，采用多模态表示学习和自适应决策融合机制，提升跨模态推理能力，并在MILK10k数据集上验证了其在实际临床场景中的可靠性。

详情

AI中文摘要

皮肤病变分类对早期皮肤病诊断至关重要，但许多现有计算机辅助系统主要依赖皮肤镜图像，而未能充分利用临床实践中常规可用的多模态证据。为解决这一问题，我们提出JI-ADF，一种三模态深度学习框架，整合皮肤镜图像、临床照片和结构化患者元数据，用于基于临床的皮肤病变分类。所提出的架构结合了联合多模态表示学习、模态特定的辅助监督以及自适应决策融合机制，该机制在每个样本基础上动态校准模态贡献。为进一步增强跨模态推理并保持模态特定证据，我们进一步引入了多模态融合注意力（MMFA）模块。我们在大规模MILK10k基准上评估了JI-ADF，该基准反映了真实世界临床获取条件和严重的类别不平衡。所提出的方法在病变类别上表现出强大且均衡的性能，提高了灵敏度和Dice分数，同时保持高特异性和良好的校准。广泛的分析，包括模态消融、校准评估和Grad-CAM可视化，进一步证实了模型的鲁棒性和临床意义的行为。这些结果表明，JI-ADF为实际临床场景中的多模态皮肤病变分类提供了可靠且实用的基础。

英文摘要

Skin lesion classification is essential for early dermatological diagnosis, yet many existing computer-aided systems rely primarily on dermoscopic images and underutilize the multimodal evidence routinely available in clinical practice. To address this gap, we propose \textbf{JI-ADF}, a trimodal deep learning framework that integrates dermoscopic images, clinical photographs, and structured patient metadata for clinically grounded skin lesion classification. The proposed architecture combines joint multimodal representation learning with modality-specific auxiliary supervision and an adaptive decision fusion mechanism that dynamically calibrates modality contributions on a per-sample basis. To enhance cross-modal reasoning while preserving modality-specific evidence, we further introduce a multimodal fusion attention (MMFA) module. We evaluate JI-ADF on the large-scale MILK10k benchmark, which reflects real-world clinical acquisition conditions and severe class imbalance. The proposed method demonstrates strong and well-balanced performance across lesion categories, improving sensitivity and Dice score while maintaining high specificity and good calibration. Extensive analyses, including modality ablation, calibration evaluation, and Grad-CAM visualization, further confirm the robustness and clinically meaningful behavior of the model. These results indicate that JI-ADF provides a reliable and practical foundation for multimodal skin lesion classification in real-world clinical settings.

URL PDF HTML ☆

赞 0 踩 0

2604.19741 2026-06-05 cs.CV 版本更新

CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

CityRAG: 通过空间感知的视频生成进入城市

Gene Chou, Charles Herrmann, Kyle Genova, Boyang Deng, Songyou Peng, Bharath Hariharan, Jason Y. Zhang, Noah Snavely, Philipp Henzler

发表机构 * Google（谷歌）； Cornell University（康奈尔大学）； Stanford University（斯坦福大学）

AI总结 CityRAG通过利用地理注册数据的大型语料库，生成空间一致且可导航的真实环境视频，其核心方法是结合学习的先验知识和时空不一致训练数据，以实现复杂的运动和外观变化。

Comments Project page: cityrag.github.io

详情

AI中文摘要

我们解决了生成一个空间一致且可导航的环境的问题，该环境是真实位置的模拟。现有的视频生成模型可以产生一个与文本（T2V）或图像（I2V）提示一致的合理序列。然而，能够重建在任意天气条件和动态物体配置下的真实世界对于下游应用如自动驾驶和机器人模拟至关重要。为此，我们提出了CityRAG，一个视频生成模型，利用大规模地理注册数据作为上下文，将生成过程与物理场景结合，同时保持对复杂运动和外观变化的学习先验。CityRAG依赖于时间不一致的训练数据，教会模型将场景的底层属性与瞬时属性语义解耦。我们的实验表明，CityRAG能够生成连贯的分钟级、物理一致的视频序列，保持数千帧的天气和光照条件，实现回环闭合，并导航复杂的轨迹以重建真实世界地理。

英文摘要

We address the problem of generating a 3D-consistent, navigable environment that is spatially grounded: a simulation of a real location. Existing video generative models can produce a plausible sequence that is consistent with a text (T2V) or image (I2V) prompt. However, the capability to reconstruct the real world under arbitrary weather conditions and dynamic object configurations is essential for downstream applications including autonomous driving and robotics simulation. To this end, we present CityRAG, a video generative model that leverages large corpora of geo-registered data as context to ground generation to the physical scene, while maintaining learned priors for complex motion and appearance changes. CityRAG relies on temporally unaligned training data, which teaches the model to semantically disentangle the underlying scene from its transient attributes. Our experiments demonstrate that CityRAG can generate coherent minutes-long, physically grounded video sequences, maintain weather and lighting conditions over thousands of frames, achieve loop closure, and navigate complex trajectories to reconstruct real-world geography.

URL PDF HTML ☆

赞 0 踩 0

2604.16502 2026-06-05 cs.CV 版本更新

Topology-Aware Layer Pruning for Large Vision-Language Models

面向拓扑的层剪枝用于大型视觉-语言模型

Pengcheng Zheng, Chaoning Zhang, Ya Wen, Wang Liu, Qigan Sun, Jiarong Mo, Jiaquan Zhang, Jewon Lee, Tae-Ho Kim, Kuien Liu, Tianyu Li, Caiyan Qin, Yang Yang

AI总结本文提出了一种面向拓扑的层剪枝框架，用于大型视觉-语言模型，通过利用拓扑持续同调量化层间拓扑一致性，实现自适应剪枝以保留关键表示转换。

Comments This manuscript has been withdrawn by the authors. It reproduced the methodology of Gardinazzi et al., arXiv:2410.11042, without citation, and utilized code and data from the associated repository (github.com/RitAreaSciencePark/ZigZagLLMs) without disclosure or violate the MIT License. A revised future version with full attribution may be prepared. For any feedback, please contact Pengcheng Zheng

详情

AI中文摘要

大型语言模型（LLMs）在自然语言理解和推理方面展示了强大的能力，而最近的扩展将视觉输入纳入其中，使它们能够处理多模态信息。尽管有这些进展，大型视觉-语言模型（LVLMs）仍然带来了显著的计算和内存成本，阻碍了在资源受限场景中的部署。现有的层剪枝方法通常依赖于局部相似性度量或静态代理信号，无法捕捉模型深度中表示的全局和动态演变，这往往导致关键转换层被移除。为了解决这一限制，我们提出了一种面向拓扑的层剪枝框架用于LVLMs。具体而言，我们将层的隐藏状态表示为点云，并利用 extit{simplicial complexes}来建模其演变。通过利用 extit{zigzag persistent homology}，我们量化了层间拓扑一致性，并实现了能够保留关键表示转换的自适应剪枝。在多样化的多模态基准上的广泛实验表明，所提出的框架在各种稀疏率范围内均优于现有剪枝方法。我们的代码可在https://github.com/zpc456/TopoVLM上获得。

英文摘要

Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning, while recent extensions that incorporate visual inputs enable them to process multimodal information. Despite these advances, Large Vision-Language Models (LVLMs) incur substantial computational and memory costs, hindering deployment in resource-constrained scenarios. Existing layer pruning methods typically rely on local similarity metrics or static proxy signals, failing to capture the global and dynamic evolution of representations across model depth, which often leads to the removal of transition-critical layers. To address this limitation, we propose a topology-aware layer pruning framework for LVLMs. Specifically, we represent layer wise hidden states as point clouds and models their evolution using \textit{simplicial complexes}. By leveraging \textit{zigzag persistent homology}, we quantify inter-layer topological consistency and enable adaptive pruning that preserves critical representational transitions. Extensive experiments on diverse multimodal benchmarks demonstrate that the proposed framework consistently outperforms existing pruning methods across a wide range of sparsity ratios. Our code is available at https://github.com/zpc456/TopoVLM.

URL PDF HTML ☆

赞 0 踩 0

2604.16370 2026-06-05 cs.CL cs.AI cs.CV 版本更新

FUSAR-GPT : 一种嵌入时空特征和两阶段解耦的视觉语言模型，用于合成孔径雷达图像

Xiaokun Zhang, Yi Yang, Ziqi Ye, Baiyun, Xiaorong Guo, Qingchen Fang, Ruyi Zhang, Xinpeng Zhou, Haipeng Wang

发表机构 * Fudan University（复旦大学）； Discipline and Technology Center of Microwave Vision Intelligent Sensing, Fudan University（微波视觉智能感知学科与技术中心，复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结本文提出FUSAR-GPT，一种专门针对合成孔径雷达图像的视觉语言模型，通过嵌入时空特征和两阶段解耦方法，在多个遥感视觉语言基准测试中实现了最先进的性能。

详情

AI中文摘要

对所有天气和所有时间的合成孔径雷达（SAR）智能解释的研究对于推进遥感应用至关重要。近年来，尽管视觉语言模型（VLMs）在RGB图像上展示了强大的开放世界理解能力，但直接应用于SAR领域时，由于成像机制的复杂性、对散射特征的敏感性和高质量文本语料的稀缺性，其性能受到严重限制。为系统解决这一问题，我们构建了首个SAR图像-文本-AlphaEarth特征三元组数据集，并开发了FUSAR-GPT，一种专门用于SAR的VLM。FUSAR-GPT创新性地引入了一个地理空间基线模型作为“世界知识”先验，并通过“时空锚点”将多源遥感时间特征嵌入模型的视觉主干中，从而实现对SAR图像中目标稀疏表示的动态补偿。此外，我们设计了一种两阶段SFT策略，以解耦大模型的知识注入和任务执行。时空特征嵌入和两阶段解耦范式使FUSAR-GPT在多个典型遥感视觉语言基准测试中实现了最先进的性能，显著优于主流基线模型，超过10%。

英文摘要

Research on the intelligent interpretation of all-weather, all-time Synthetic Aperture Radar (SAR) is crucial for advancing remote sensing applications. In recent years, although Visual Language Models (VLMs) have demonstrated strong open-world understanding capabilities on RGB images, their performance is severely limited when directly applied to the SAR field due to the complexity of the imaging mechanism, sensitivity to scattering features, and the scarcity of high-quality text corpora. To systematically address this issue, we constructed the inaugural SAR Image-Text-AlphaEarth feature triplet dataset and developed FUSAR-GPT, a VLM specifically for SAR. FUSAR-GPT innovatively introduces a geospatial baseline model as a 'world knowledge' prior and embeds multi-source remote-sensing temporal features into the model's visual backbone via 'spatiotemporal anchors', enabling dynamic compensation for the sparse representation of targets in SAR images. Furthermore, we designed a two-stage SFT strategy to decouple the knowledge injection and task execution of large models. The spatiotemporal feature embedding and the two-stage decoupling paradigm enable FUSAR-GPT to achieve state-of-the-art performance across several typical remote sensing visual-language benchmark tests, significantly outperforming mainstream baseline models by over 10%.

URL PDF HTML ☆

赞 0 踩 0

2603.16652 2026-06-05 cs.CV 版本更新

Efficient Brood Cell Detection in Layer Trap Nests for Bees and Wasps: Balancing Labeling Effort and Species Coverage

蜂类和黄蜂分层巢穴陷阱中高效育雏细胞检测：平衡标注工作量与物种覆盖度

Chenchang Liu, Felix Fornoff, Annika Grasreiner, Patrick Maeder, Henri Greil, Marco Seeland

发表机构 * Technical University of Ilmenau（伊尔梅瑙技术大学）； University of Zurich（苏黎世大学）

AI总结提出基于深度学习的育雏细胞检测与分类方法，通过约束假阳性损失策略减少标注工作量并缓解类别不平衡，提升检测性能。

详情

AI中文摘要

监测洞穴筑巢的野生蜂类和黄蜂对生物多样性研究和保护至关重要。分层巢穴陷阱（LTNs）正成为研究这些昆虫丰度和物种丰富度的宝贵工具，可深入了解其筑巢活动和生态需求。然而，手动评估LTNs以检测和分类育雏细胞既费时又费力。为此，我们提出一种基于深度学习的方法，用于高效检测和分类LTNs中的育雏细胞。LTNs由于育雏细胞密集排列，导致每张图像的标注工作量很高。此外，我们观察到类别分布显著不平衡，常见物种的出现次数明显多于稀有物种。对常见物种进行全面标注既耗时又加剧数据不平衡，而部分标注则导致数据不完整，从而降低模型性能。为了减少标注工作量并减轻未标注数据的影响，我们引入了一种新颖的约束假阳性损失（CFPL）策略。CFPL动态屏蔽未标注数据的预测，防止其在训练过程中干扰分类损失。实验结果表明，我们的方法提高了检测性能，平衡了模型准确性和标注工作量，同时缓解了类别不平衡问题。

英文摘要

Monitoring cavity-nesting wild bees and wasps is vital for biodiversity research and conservation. Layer trap nests (LTNs) are emerging as a valuable tool to study the abundance and species richness of these insects, offering insights into their nesting activities and ecological needs. However, manually evaluating LTNs to detect and classify brood cells is labor-intensive and time-consuming. To address this, we propose a deep learning based approach for efficient brood cell detection and classification in LTNs. LTNs present additional challenges due to densely packed brood cells, leading to a high labeling effort per image. Moreover, we observe a significant imbalance in class distribution, with common species having notably more occurrences than rare species. Comprehensive labeling of common species is time-consuming and exacerbates data imbalance, while partial labeling introduces data incompleteness which degrades model performance. To reduce labeling effort and mitigate the impact of unlabeled data, we introduce a novel Constrained False Positive Loss (CFPL) strategy. CFPL dynamically masks predictions from unlabeled data, preventing them from interfering with the classification loss during training. Experimental results demonstrate that our method improves detection performance, balances model accuracy and labeling effort, while also mitigating class imbalance.

URL PDF HTML ☆

赞 0 踩 0

2603.08491 2026-06-05 cs.CV 版本更新

通过回滚增强学习视觉-语言模型中的自我纠正

Yi Ding, Ziliang Qiu, Bolian Li, Ruqi Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出一种基于回滚增强的强化学习框架Octopus，通过重新组合现有回滚生成密集的自我纠正示例，提高样本效率并稳定RL优化，同时引入响应遮蔽策略以解耦自我纠正与直接推理，从而在7个基准测试中实现开源VLM的SOTA性能。

Comments 18 pages

详情

Journal ref: ICML 2026

AI中文摘要

自我纠正对于解决视觉-语言模型（VLMs）中的复杂推理问题至关重要。然而，现有的强化学习（RL）方法在学习自我纠正方面存在困难，因为有效的自我纠正行为只在很少情况下出现，导致学习信号非常稀疏。为了解决这一挑战，我们提出了correction-specific rollouts（Octopus），一种RL回滚增强框架，通过重新组合现有回滚来合成密集的自我纠正示例。这种增强同时提高了样本效率，由于回滚重用，并通过平衡监督稳定了RL优化。此外，我们引入了一种响应遮蔽策略，将自我纠正与直接推理解耦，避免信号冲突，并使两种行为都能被有效学习。基于此，我们介绍了Octopus-8B，一种具有可控自我纠正能力的推理VLM。在7个基准测试中，它在开源VLM中实现了SOTA性能，优于最佳RLVR基线1.0分，同时仅需0.72倍的训练时间每步。

英文摘要

Self-correction is essential for solving complex reasoning problems in vision-language models (VLMs). However, existing reinforcement learning (RL) methods struggle to learn it, as effective self-correction behaviors emerge only rarely, making learning signals extremely sparse. To address this challenge, we propose correction-specific rollouts (Octopus), an RL rollout augmentation framework that synthesizes dense self-correction examples by recombining existing rollouts. This augmentation simultaneously improves sample efficiency due to rollout reuse and stabilizes RL optimization through balanced supervision. Furthermore, we introduce a response-masking strategy that decouples self-correction from direct reasoning, avoiding signal conflicts and enabling both behaviors to be learned effectively. Building on this, we introduce Octopus-8B, a reasoning VLM with controllable self-correction capability. Across 7 benchmarks, it achieves SoTA performance among open-source VLMs, outperforming the best RLVR baseline by 1.0 score while requiring only $0.72\times$ training time per step.

URL PDF HTML ☆

赞 0 踩 0

2602.07428 2026-06-05 cs.CV 版本更新

Row-Column Separated Attention Based Low-Light Image/Video Enhancement

基于行-列分离注意力的低光照图像/视频增强

Chengqi Dong, Zhiyuan Cao, Tuoshi Qi, Kexin Wu, Yixing Gao, Fan Tang

发表机构 * School of Artificial Intelligence, Jilin University, China（吉林大学人工智能学院）； College of Software, Jilin University, China（吉林大学软件学院）； Institute of Computing Technology, Chinese Academy of Sciences, China（中国科学院计算技术研究所）

AI总结本文提出了一种行-列分离注意力模块（RCSA），用于改进U-Net结构以增强低光照图像和视频，通过减少参数和计算量来利用全局信息指导局部信息，同时提出两种时间损失函数以保持时间一致性。

详情

DOI: 10.1111/cgf.15192

AI中文摘要

U-Net结构被广泛用于低光照图像/视频增强。增强的图像在没有适当全局信息指导的情况下，会导致局部噪声较大和细节丢失。注意力机制可以更好地关注和利用全局信息。然而，对图像的注意力可能会显著增加参数和计算量。我们提出了一种行-列分离注意力模块（RCSA），插入到改进的U-Net之后。RCSA模块的输入是特征图的行和列的均值和最大值，利用全局信息以较少的参数指导局部信息。我们提出两种时间损失函数，将该方法应用于低光照视频增强并保持时间一致性。在LOL、MIT Adobe FiveK图像和SDSD视频数据集上的广泛实验表明了我们方法的有效性。代码可在https://github.com/cq-dong/URCSA上公开获取。

英文摘要

U-Net structure is widely used for low-light image/video enhancement. The enhanced images result in areas with large local noise and loss of more details without proper guidance for global information. Attention mechanisms can better focus on and use global information. However, attention to images could significantly increase the number of parameters and computations. We propose a Row-Column Separated Attention module (RCSA) inserted after an improved U-Net. The RCSA module's input is the mean and maximum of the row and column of the feature map, which utilizes global information to guide local information with fewer parameters. We propose two temporal loss functions to apply the method to low-light video enhancement and maintain temporal consistency. Extensive experiments on the LOL, MIT Adobe FiveK image, and SDSD video datasets demonstrate the effectiveness of our approach. The code is publicly available at https://github.com/cq-dong/URCSA.

URL PDF HTML ☆

赞 0 踩 0

2602.03410 2026-06-05 cs.CV 版本更新

UnHype: CLIP-Guided Hypernetworks for Dynamic LoRA Unlearning

UnHype: 基于CLIP的超网络用于动态LoRA反学习

Piotr Wójcik, Maksym Petrenko, Wojciech Gromski, Przemysław Spurek, Maciej Zieba

发表机构 * Institute of Computer Science, University of Warsaw（华沙大学计算机科学研究所）

AI总结本文提出UnHype框架，通过将超网络引入单概念和多概念LoRA训练，解决传统LoRA方法在概念语义适应性差、难以平衡删除相关概念与保持泛化能力以及多概念同时删除时的可扩展性问题，展示了在物体擦除、名人擦除和色情内容删除等任务中的有效性。

Comments 23 pages, 11 figures. Accepted at ICML 2026. Code: https://github.com/gmum/UnHype/ Project Page: https://gmum.github.io/UnHype/

详情

AI中文摘要

近期大规模扩散模型的进步加剧了对其潜在滥用的担忧，特别是生成逼真但有害或社会 disruptive 的内容。这一挑战推动了有效机器反学习的研究，即在不损害模型整体生成能力的情况下，选择性地移除特定知识或概念。在各种方法中，低秩适应（LoRA）已成为一种有效的、高效的微调方法，用于针对反学习的定向调整。然而，基于LoRA的方法在概念语义适应性方面有限，并且在删除密切相关概念与保持更广泛意义的泛化能力之间难以平衡。此外，当必须同时删除多个概念时，这些方法面临可扩展性挑战。为了解决这些限制，我们引入了UnHype框架，该框架将超网络引入单概念和多概念LoRA训练中。所提出的架构可以直接插入到Stable Diffusion以及现代流基文本到图像模型中，其中展示了稳定的训练行为和有效的概念控制。在推理过程中，超网络根据CLIP嵌入动态生成适应性的LoRA权重，使反学习更加上下文感知和可扩展。我们评估了UnHype在多个具有挑战性的任务中的表现，包括物体擦除、名人擦除和色情内容删除，展示了其有效性和通用性。见GitHub上的代码：https://github.com/gmum/UnHype。

英文摘要

Recent advances in large-scale diffusion models have intensified concerns about their potential misuse, particularly in generating realistic yet harmful or socially disruptive content. This challenge has spurred growing interest in effective machine unlearning, the process of selectively removing specific knowledge or concepts from a model without compromising its overall generative capabilities. Among various approaches, Low-Rank Adaptation (LoRA) has emerged as an effective and efficient method for fine-tuning models toward targeted unlearning. However, LoRA-based methods often exhibit limited adaptability to concept semantics and struggle to balance removing closely related concepts with maintaining generalization across broader meanings. Moreover, these methods face scalability challenges when multiple concepts must be erased simultaneously. To address these limitations, we introduce UnHype, a framework that incorporates hypernetworks into single- and multi-concept LoRA training. The proposed architecture can be directly plugged into Stable Diffusion as well as modern flow-based text-to-image models, where it demonstrates stable training behavior and effective concept control. During inference, the hypernetwork dynamically generates adaptive LoRA weights based on the CLIP embedding, enabling more context-aware, scalable unlearning. We evaluate UnHype across several challenging tasks, including object erasure, celebrity erasure, and explicit content removal, demonstrating its effectiveness and versatility. See the code on GitHub: https://github.com/gmum/UnHype.

URL PDF HTML ☆

赞 0 踩 0

2601.21288 2026-06-05 cs.AI cs.CV 版本更新

Drive-KD: Multi-Teacher Distillation for VLMs in Autonomous Driving

Drive-KD：自动驾驶中用于视觉语言模型的多教师知识蒸馏

Weitong Lian, Zecong Tang, Haoran Li, Tianjian Gao, Yifei Wang, Zixu Wang, Lingyi Meng, Tengju Ru, Zhejun Cui, Yichen Zhu, Hangshuo Cao, Qi Kang, Tianxing Chen, Kaixuan Wang, Yu Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出Drive-KD框架，通过将自动驾驶分解为感知-推理-规划三元组，并利用知识蒸馏转移能力，构建了专用教师模型，并通过异构梯度投影缓解跨能力梯度冲突，验证了方法在不同模型家族和规模上的泛化能力，展示了蒸馏模型在自动驾驶任务中的优越性能。

详情

AI中文摘要

自动驾驶是一个重要且安全关键的任务，最近大型语言模型（LLM）和视觉语言模型（VLM）的进展为该领域提供了新的推理和规划可能性。然而，大模型需要大量GPU内存并表现出较高的推理延迟，而传统监督微调（SFT）往往难以弥补小模型的能力差距。为了解决这些限制，我们提出了Drive-KD，一个将自动驾驶分解为“感知-推理-规划”三元组并通过知识蒸馏转移这些能力的框架。我们识别出层特定的注意力作为蒸馏信号，构建出能够超越基线的专用单教师模型。此外，我们将这些单教师设置统一到多教师蒸馏框架中，并引入异构梯度投影以缓解跨能力梯度冲突。广泛的评估验证了我们的方法在不同模型家族和规模上的泛化能力。实验表明，我们的蒸馏InternVL3-1B模型在GPU内存方面仅为78B模型的约42倍，在吞吐量方面为11.4倍，且在DriveBench上整体性能优于同家族的预训练78B模型，并在规划维度上超越GPT-5.1，为高效自动驾驶VLMs提供了新的见解。

英文摘要

Autonomous driving is an important and safety-critical task, and recent advances in LLMs/VLMs have opened new possibilities for reasoning and planning in this domain. However, large models demand substantial GPU memory and exhibit high inference latency, while conventional supervised fine-tuning (SFT) often struggles to bridge the capability gaps of small models. To address these limitations, we propose Drive-KD, a framework that decomposes autonomous driving into a "perception-reasoning-planning" triad and transfers these capabilities via knowledge distillation. We identify layer-specific attention as the distillation signal to construct capability-specific single-teacher models that outperform baselines. Moreover, we unify these single-teacher settings into a multi-teacher distillation framework and introduce asymmetric gradient projection to mitigate cross-capability gradient conflicts. Extensive evaluations validate the generalization of our method across diverse model families and scales. Experiments show that our distilled InternVL3-1B model, with ~42 times less GPU memory and ~11.4 times higher throughput, achieves better overall performance than the pretrained 78B model from the same family on DriveBench, and surpasses GPT-5.1 on the planning dimension, providing insights toward efficient autonomous driving VLMs.

URL PDF HTML ☆

赞 0 踩 0

2601.18219 2026-06-05 physics.med-ph cs.CV cs.LG 版本更新

Automated HER2 scoring with uncertainty quantification using lensfree holography and deep learning

利用无透镜全息和深度学习进行自动HER2评分及不确定性量化

Che-Yung Shen, Xilin Yang, Yuzhu Li, Leon Lenk, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校电气与计算机工程系）； Bioengineering Department, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校生物工程系）； California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校加州纳米系统研究所）； Department of Computer Science, University of California, Los Angeles, CA, 90095, USA（加州大学洛杉矶分校计算机科学系）

AI总结本文提出了一种基于无透镜全息和深度学习的紧凑型、低成本系统，用于自动免疫组化染色乳腺组织切片的HER2评分，通过贝叶斯蒙特卡洛Dropout策略提高诊断可靠性，实现了高准确率的HER2分类和评分。

Comments 23 Pages, 6 Figures, 1 Table

详情

DOI: 10.34133/bmef.0278
Journal ref: BME Frontiers, AAAS (2026)

AI中文摘要

准确评估人类表皮生长因子受体2（HER2）的表达对于乳腺癌的诊断、预后和治疗选择至关重要；然而，大多数现有的数字HER2评分方法依赖于笨重且昂贵的光学系统。本文提出了一种紧凑且经济的无透镜全息平台，结合深度学习用于自动免疫组化染色乳腺组织切片的HER2评分。该系统在RGB激光照明下捕获染色HER2组织切片的无透镜衍射图案，并在约1250 mm²的样本区域上以约84 mm²/分钟的有效吞吐量获取复杂数学信息。为提高诊断可靠性，我们采用了基于贝叶斯蒙特卡洛Dropout的不确定性量化策略，为每个预测提供自主的不确定性估计，支持可靠且稳健的HER2评分，整体修正率为30.4%。使用412个盲测样本的测试集，本方法在4类（0，1+，2+，3+）HER2分类中实现了84.9%的测试准确率，在二分类（0/1+ vs. 2+/3+）HER2评分中实现了94.8%的准确率，结合不确定性量化。总体而言，这种无透镜全息方法提供了一条通往便携式、高吞吐量和低成本HER2评分的实用途径，特别适用于资源有限的环境，其中传统数字病理基础设施不可用。

英文摘要

Accurate assessment of human epidermal growth factor receptor 2 (HER2) expression is critical for breast cancer diagnosis, prognosis, and therapy selection; yet, most existing digital HER2 scoring methods rely on bulky and expensive optical systems. Here, we present a compact and cost-effective lensfree holography platform integrated with deep learning for automated HER2 scoring of immunohistochemically stained breast tissue sections. The system captures lensfree diffraction patterns of stained HER2 tissue sections under RGB laser illumination and acquires complex field information over a sample area of ~1,250 mm^2 at an effective throughput of ~84 mm^2 per minute. To enhance diagnostic reliability, we incorporated an uncertainty quantification strategy based on Bayesian Monte Carlo dropout, which provides autonomous uncertainty estimates for each prediction and supports reliable, robust HER2 scoring, with an overall correction rate of 30.4%. Using a blinded test set of 412 unique tissue samples, our approach achieved a testing accuracy of 84.9% for 4-class (0, 1+, 2+, 3+) HER2 classification and 94.8% for binary (0/1+ vs. 2+/3+) HER2 scoring with uncertainty quantification. Overall, this lensfree holography approach provides a practical pathway toward portable, high-throughput, and cost-effective HER2 scoring, particularly suited for resource-limited settings, where traditional digital pathology infrastructure is unavailable.

URL PDF HTML ☆

赞 0 踩 0

利用街景图像和视觉大语言模型预测遗产价值以支持治理：风险、伦理与政策影响

Tim Johansson, Mikael Mangold, Kristina Dabrock, Anna Donarelli, Ingrid Campo-Ruiz

发表机构 * RISE Research Institutes of Sweden AB（瑞典RISE研究机构）； Malmö University（马尔默大学）； Forschungszentrum Jülich GmbH（朱利奇研究中心）； Uppsala University（乌普萨拉大学）

AI总结本研究利用街景图像和视觉大语言模型评估瑞典建筑遗产价值，以支持建筑翻新计划的制定，探讨了方法中的问题、潜在改进以及使用LLM数据的伦理风险。

详情

AI中文摘要

在2025年至2026年期间，欧盟成员国必须实施《建筑性能能效指令》，要求所有成员国制定国家建筑翻新计划。在瑞典，没有全面记录具有遗产价值的建筑的国家注册表，这被视为阻碍建筑翻新计划制定分析的障碍。本研究旨在帮助瑞典当局了解瑞典建筑存量中的遗产价值。通过对瑞典各地（N=154710）的街景图像中的建筑进行多模态大语言模型（LLM）分析，评估了可见的遗产价值指示方面。使用LLM的零样本预测作为基础，确定了潜在具有遗产价值的建筑，覆盖500万平方米的供暖地板面积。本文呈现了预测结果和所学到的经验，并将其与瑞典建筑翻新计划的制定相结合，作为治理的一部分。讨论了方法中的问题和潜在的改进。探讨了当局使用基于LLM的数据的潜在风险，重点是透明性、错误检测和阿谀奉承的问题。

英文摘要

During 2025 and 2026, the Energy Performance of Buildings Directive is being implemented in the European Union member states, requiring all member states to have National Building Renovation Plans. In Sweden, there is no comprehensive national register of buildings with heritage values. This is seen as a barrier for the analyses underlying the development of Building Renovation Plans by the involved Swedish authorities. The purpose of this research was to assist Swedish authorities in developing information on heritage values in the Swedish building stock. Buildings in street view images from all over Sweden (N=154 710) have been analysed using multimodal Large Language Models (LLM) to assess visible aspects indicative of heritage value. Zero-shot predictions by LLMs were used as a basis for identifying buildings with potential heritage values for 5.0 million square meters of heated floor area. In this paper, the results of the predictions and lessons learned are presented and related to the development of the Swedish Building Renovation Plan as part of governance. The problems with the method and potential improvements are discussed. Risks with authorities use of LLM-based data are addressed, with a focus on issues of transparency, error detection and sycophancy.

URL PDF HTML ☆

赞 0 踩 0

2601.02730 2026-06-05 cs.CV 版本更新

HOLO: Homography-Guided Pose Estimator Network for Fine-Grained Visual Localization on SD Maps

HOLO：基于单应图的细粒度视觉定位网络用于标准定义（SD）地图的视觉定位

Xuchang Zhong, Xu Cao, Jinke Feng, Hao Fang

发表机构 * Beijing Institute of Technology（北京理工大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出了一种基于单应图的视觉定位网络，用于多视角图像与标准定义地图之间的细粒度视觉定位，通过构建满足单应约束的输入对，利用单应关系引导特征融合并限制姿态输出到有效区域，提高了训练效率和定位精度。

详情

AI中文摘要

标准定义（SD）地图上的视觉定位已成为自动驾驶中一种有前途的低成本和可扩展的解决方案。然而，现有基于回归的方法往往忽视了固有的几何先验，导致训练效率低下和定位精度有限。本文提出了一种新的基于单应图的姿态估计网络，用于多视角图像与标准定义（SD）地图之间的细粒度视觉定位。我们通过将地面视图特征投影到BEV域并强制与地图特征进行语义对齐来构建满足单应约束的输入对。然后利用单应关系引导特征融合，并将姿态输出限制在有效可行区域，这在训练效率和定位精度上都显著优于依赖注意力融合和直接3-自由度姿态回归的先前方法。到目前为止，这是首次将BEV语义推理与单应学习统一起来用于图像到地图定位的工作。此外，通过显式建模单应变换，所提出的框架自然支持跨分辨率输入，增强了模型的灵活性。在nuScenes数据集上的广泛实验表明，我们的方法显著优于现有的视觉定位方法。代码和预训练模型将公开发布以促进未来研究。

英文摘要

Visual localization on standard-definition (SD) maps has emerged as a promising low-cost and scalable solution for autonomous driving. However, existing regression-based approaches often overlook inherent geometric priors, resulting in suboptimal training efficiency and limited localization accuracy. In this paper, we propose a novel homography-guided pose estimator network for fine-grained visual localization between multi-view images and standard-definition (SD) maps. We construct input pairs that satisfy a homography constraint by projecting ground-view features into the BEV domain and enforcing semantic alignment with map features. Then we leverage homography relationships to guide feature fusion and restrict the pose outputs to a valid feasible region, which significantly improves training efficiency and localization accuracy compared to prior methods relying on attention-based fusion and direct 3-DoF pose regression. To the best of our knowledge, this is the first work to unify BEV semantic reasoning with homography learning for image-to-map localization. Furthermore, by explicitly modeling homography transformations, the proposed framework naturally supports cross-resolution inputs, enhancing model flexibility. Extensive experiments on the nuScenes dataset demonstrate that our approach significantly outperforms existing state-of-the-art visual localization methods. Code and pretrained models will be publicly released to foster future research.

URL PDF HTML ☆

赞 0 踩 0

2512.21218 2026-06-05 cs.CV 版本更新

Latent Implicit Visual Reasoning

潜在隐式视觉推理

Kelvin Li, Chuyi Shang, Leonid Karlinsky, Rogerio Feris, Trevor Darrell, Roei Herzig

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Xero ； MIT-IBM Watson AI Lab（麻省理工-IBM Watson人工智能实验室）

AI总结本文提出了一种任务无关的机制，训练大规模多模态模型（LMMs）在无需显式中间监督的情况下发现和使用潜在视觉推理标记，从而在多种视觉中心任务中优于直接监督微调，并在不使用辅助图像、边界框、图像裁剪、深度图或思维链注释的情况下，与或优于先前基于文本和显式视觉中间推理方法相媲美。

详情

AI中文摘要

尽管大规模多模态模型（LMMs）在显著进展方面取得了进展，但它们仍然主要以文本为中心，依赖语言作为其核心推理模态。因此，它们在处理主要视觉的推理任务时受到限制。最近的方法试图通过监督中间视觉步骤来解决这个问题，使用辅助图像、深度图或图像裁剪。然而，这些策略对“有用的”视觉抽象的外观施加了限制的先验假设，增加了大量的标注成本，并在跨任务时难以泛化。为了解决这一关键限制，我们提出了潜在隐式视觉推理（LIVR），一种任务无关的机制，训练LMMs发现和使用潜在视觉推理标记，而无需显式中间监督。这些标记会全局关注并以任务自适应的方式重新编码图像，使模型能够提取相关视觉信息而无需手工监督。LIVR在多种视觉中心任务和多个LMM基础架构上均优于直接监督微调。在更广泛的比较中，LIVR与或优于先前基于文本和显式视觉中间推理方法，同时不需要额外的中间监督，如辅助图像、边界框、图像裁剪、深度图或思维链注释。我们的项目页面可以在这里找到：https://www.chuyishang.com/livr/

英文摘要

While Large Multimodal Models (LMMs) have made significant progress, they remain largely text-centric, relying on language as their core reasoning modality. As a result, they are limited in their ability to handle reasoning tasks that are predominantly visual. Recent approaches have sought to address this by supervising intermediate visual steps with helper images, depth maps, or image crops. However, these strategies impose restrictive priors on what "useful" visual abstractions look like, add heavy annotation costs, and struggle to generalize across tasks. To address this critical limitation, we propose Latent Implicit Visual Reasoning (LIVR), a task-agnostic mechanism that trains LMMs to discover and use latent visual reasoning tokens without explicit intermediate supervision. These tokens attend globally and re-encode the image in a task-adaptive way, enabling the model to extract relevant visual information without hand-crafted supervision. LIVR consistently outperforms direct supervised fine-tuning across diverse vision-centric tasks and multiple LMM backbones. In broader comparisons, LIVR remains competitive with or outperforms prior text-based and explicit-visual-intermediate reasoning methods, while requiring no additional intermediate supervision such as helper images, bounding boxes, image crops, depth maps, or chain-of-thought annotations. Our project page can be found here: https://www.chuyishang.com/livr/

URL PDF HTML ☆

赞 0 踩 0

2512.15153 2026-06-05 cs.CV 版本更新

Explainable Action Form Assessment by Exploiting Multimodal Chain-of-Thoughts Reasoning

通过利用多模态链式推理解释可解释的动作形式评估

Mengshi Qi, Yeteng Wu, Wulian Yun, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology（网络与交换技术国家重点实验室）

AI总结本文提出了一种新的动作形式评估任务，并引入了一个包含大量健身和武术视频的多级标注数据集CoT-AFA，通过引入新的链式思维解释方法，提出了可解释性健身评估框架，以提升动作分析能力。

详情

AI中文摘要

评估人类动作是否标准并提供合理的反馈以提高动作标准化程度在现实场景中非常重要但具有挑战性。然而，当前视频理解方法主要关注动作是什么和在哪里，无法满足要求。同时，现有数据集缺乏指示动作标准化程度的标签，动作质量评估数据集缺乏可解释性和详细反馈。因此，我们定义了一个新的人类动作形式评估（AFA）任务，并引入了一个新的多样化数据集CoT-AFA，其中包含大量健身和武术视频，具有多级标注以进行全面的视频分析。我们通过引入一种新的链式思维解释范式来丰富CoT-AFA数据集。与提供孤立反馈不同，我们的解释提供了一个完整的推理过程--从识别一个动作步骤到分析其结果并提出具体的解决方案。此外，我们提出了一种名为可解释性健身评估器的框架，不仅可以判断动作，还可以解释原因并提供解决方案。该框架采用两个并行处理流和动态门控机制来融合视觉和语义信息，从而提升其分析能力。实验结果表明，我们的方法在解释生成（例如，CIDEr提升16.0%）、动作分类（准确率提升2.7%）和质量评估（准确率提升2.1%）方面均取得了改进，揭示了CoT-AFA在未来研究中的巨大潜力。我们的数据集和源代码可在https://github.com/MICLAB-BUPT/EFA上获取。

英文摘要

Evaluating whether human action is standard or not and providing reasonable feedback to improve action standardization is very crucial but challenging in real-world scenarios. However, current video understanding methods are mainly concerned with what and where the action is, which is unable to meet the requirements. Meanwhile, most of the existing datasets lack the labels indicating the degree of action standardization, and the action quality assessment datasets lack explainability and detailed feedback. Therefore, we define a new Human Action Form Assessment (AFA) task, and introduce a new diverse dataset CoT-AFA, which contains a large scale of fitness and martial arts videos with multi-level annotations for comprehensive video analysis. We enrich the CoT-AFA dataset with a novel Chain-of-Thought explanation paradigm. Instead of offering isolated feedback, our explanations provide a complete reasoning process--from identifying an action step to analyzing its outcome and proposing a concrete solution. Furthermore, we propose a framework named Explainable Fitness Assessor, which can not only judge an action but also explain why and provide a solution. This framework employs two parallel processing streams and a dynamic gating mechanism to fuse visual and semantic information, thereby boosting its analytical capabilities. The experimental results demonstrate that our method has achieved improvements in explanation generation (e.g., +16.0% in CIDEr), action classification (+2.7% in accuracy) and quality assessment (+2.1% in accuracy), revealing great potential of CoT-AFA for future studies. Our dataset and source code is available at https://github.com/MICLAB-BUPT/EFA.

URL PDF HTML ☆

赞 0 踩 0

2512.08560 2026-06-05 cs.CV 版本更新

BrainExplore: Large-Scale Discovery of Interpretable Visual Representations in the Human Brain

BrainExplore: 在人脑中大规模发现可解释的视觉表征

Navve Wasserman, Matias Cosarinsky, Yuval Golbari, Aude Oliva, Antonio Torralba, Tamar Rott Shaham, Michal Irani

发表机构 * Weizmann Institute of Science（魏茨曼科学研究所）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出了一种大规模自动化框架，用于发现和解释人脑皮层中的视觉表征，通过无监督的数据驱动分解方法发现候选可解释模式，并通过识别最能激发这些模式的自然图像生成自然语言描述，从而揭示了数千种覆盖多种不同视觉概念的可解释模式，包括此前未报告的细粒度表征。

详情

AI中文摘要

理解人类大脑如何表示视觉概念，以及这些表示在哪些脑区编码，仍然是一个长期存在的挑战。几十年的研究已经提升了我们对视觉表征的理解，但脑信号仍然很大且复杂，可能的视觉概念空间非常广阔。因此，大多数研究仍处于小规模，依赖手动检查，专注于特定区域和概念，并很少进行系统验证。我们提出了一种大规模、自动化的框架，用于在人脑皮层上发现和解释视觉表征。我们的方法包括两个主要阶段。首先，我们通过无监督、数据驱动的分解方法在fMRI活动中发现候选可解释模式。其次，我们通过识别最能激发这些模式的自然图像集，并生成这些图像共享视觉意义的自然语言描述来解释每个模式。为了扩展这一过程，我们引入了一个自动化流程，测试多个候选解释，分配可靠性分数，并为每个脑区模式选择最佳描述。我们的框架揭示了成千上万种可解释模式，涵盖了许多不同的视觉概念，包括此前未报告的细粒度表征。

英文摘要

Understanding how the human brain represents visual concepts, and in which brain regions these representations are encoded, remains a long-standing challenge. Decades of work have advanced our understanding of visual representations, yet brain signals remain large and complex, and the space of possible visual concepts is vast. As a result, most studies remain small-scale, rely on manual inspection, focus on specific regions and concepts, and rarely include systematic validation. We present a large-scale, automated framework for discovering and explaining visual representations across the human cortex. Our method comprises two main stages. First, we discover candidate interpretable patterns in fMRI activity through unsupervised, data-driven decomposition methods. Next, we explain each pattern by identifying the set of natural images that most strongly elicit it and generating a natural-language description of their shared visual meaning. To scale this process, we introduce an automated pipeline that tests multiple candidate explanations, assigns reliability scores, and selects the best description for each voxel pattern. Our framework reveals thousands of interpretable patterns spanning many distinct visual concepts, including fine-grained representations previously unreported.

URL PDF HTML ☆

赞 0 踩 0

2512.05774 2026-06-05 cs.CV cs.AI cs.CL 版本更新

Active Video Perception: Iterative Evidence Seeking for Agentic Long Video Understanding

主动视频感知：用于代理长视频理解的迭代证据寻求

Ziyang Wang, Honglu Zhou, Shijie Wang, Junnan Li, Caiming Xiong, Silvio Savarese, Mohit Bansal, Michael S. Ryoo, Juan Carlos Niebles

发表机构 * Salesforce AI Research（Salesforce AI研究院）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）

AI总结本文提出了一种主动视频感知框架AVP，通过迭代计划-观察-反思过程，主动决定视频内容的观察目标和时间，以提高长视频理解的准确性和效率。

Comments Website: https://activevideoperception.github.io/

详情

AI中文摘要

长视频理解（LVU）具有挑战性，因为回答现实世界查询往往依赖于稀疏、时间分散的线索，这些线索隐藏在数小时的大部分冗余和无关内容中。尽管代理流程提高了视频推理能力，但现有框架依赖于查询无关的描述器来感知视频信息，这浪费了计算资源并模糊了细粒度的时间和空间信息。受主动感知理论的启发，我们主张LVU代理应主动决定观察什么、何时和在哪里观察，并持续评估当前观察是否足够回答查询。我们提出了主动视频感知（AVP），一种证据寻求框架，将视频视为交互环境，并直接从像素中获取紧凑、查询相关的证据。具体而言，AVP运行一个迭代的计划-观察-反思过程，使用MLLM代理。在每个轮次中，计划者提出有针对性的视频交互，观察者执行以提取时间戳证据，反思者评估证据对查询的充分性，要么终止并给出答案，要么触发进一步观察。在五个LVU基准测试中，AVP实现了最高整体准确率，有显著提升。值得注意的是，AVP在平均整体准确率上比最佳代理方法高出5.7%，同时仅需18.4%的推理时间和12.4%的输入令牌。

英文摘要

Long video understanding (LVU) is challenging because answering real-world queries often depends on sparse, temporally dispersed cues buried in hours of mostly redundant and irrelevant content. While agentic pipelines improve video reasoning capabilities, prevailing frameworks rely on a query-agnostic captioner to perceive video information, which wastes computation on irrelevant content and blurs fine-grained temporal and spatial information. Motivated by active perception theory, we argue that LVU agents should actively decide what, when, and where to observe, and continuously assess whether the current observation is sufficient to answer the query. We present Active Video Perception (AVP), an evidence-seeking framework that treats the video as an interactive environment and acquires compact, queryrelevant evidence directly from pixels. Concretely, AVP runs an iterative plan-observe-reflect process with MLLM agents. In each round, a planner proposes targeted video interactions, an observer executes them to extract time-stamped evidence, and a reflector evaluates the sufficiency of the evidence for the query, either halting with an answer or triggering further observation. Across five LVU benchmarks, AVP achieves highest overall accuracy with significant improvements. Notably, AVP outperforms the best agentic method by 5.7% in average overall accuracy while only requires 18.4% inference time and 12.4% input tokens.

URL PDF HTML ☆

赞 0 踩 0

2511.20158 2026-06-05 cs.CV 版本更新

Harmonious Parameter Adaptation in Continual Visual Instruction Tuning for Safety-Aligned MLLMs

在安全对齐的连续视觉指令微调中实现和谐参数适应

Ziqi Wang, Chang Che, Qi Wang, Hui Ma, Zenglin Shi, Cees G. M. Snoek, Meng Wang

发表机构 * Hefei University of Technology（合肥工业大学）； Tsinghua University（清华大学）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文研究了在安全对齐的连续视觉指令微调中如何平衡安全性和任务性能，提出了一种名为和谐参数适应（HPA）的后训练框架，通过参数分区、平衡选择和正交调整来缓解遗忘问题。

详情

AI中文摘要

尽管连续视觉指令微调（CVIT）在适应多模态大语言模型（MLLMs）方面显示出潜力，但现有研究大多集中在没有安全对齐的模型上。这种关键疏忽忽略了现实中的MLLMs本质上需要此类机制以缓解潜在风险。在本文中，我们关注CVIT在安全对齐的MLLMs中的应用，并观察到在连续适应过程中，模型不仅会经历任务遗忘，还会表现出安全性的下降。实现安全性和任务性能之间的和谐平衡仍然是一个关键挑战。为此，我们提出了和谐参数适应（HPA），一种由基于聚焦的参数分区、和谐平衡的参数选择和正交参数调整组成的后训练框架。具体而言，HPA根据参数对安全或任务性能的关注程度将其分为两种类型，并从平衡的角度选择聚焦的参数以保留。此外，HPA对参数更新施加正交约束，以进一步缓解灾难性遗忘。在CVIT基准和安全评估数据集上的大量实验表明，HPA比现有基线更好地保持了高安全性和减轻了遗忘问题。代码可在https://github.com/Minato-Zackie/HPA上获得。

英文摘要

While continual visual instruction tuning (CVIT) has shown promise in adapting multimodal large language models (MLLMs), existing studies predominantly focus on models without safety alignment. This critical oversight ignores the fact that real-world MLLMs inherently require such mechanisms to mitigate potential risks. In this work, we shift our focus to CVIT for safety-aligned MLLMs and observe that during continual adaptation, the model not only suffers from task forgetting but also exhibits degradation in its safety. Achieving a harmonious balance between safety and task performance remains a crucial challenge. To address this, we propose Harmonious Parameter Adaptation (HPA), a post-training framework composed of focusing-based parameter partition, harmoniously balanced parameter selection, and orthogonal parameter adjustment. Specifically, HPA partitions parameters into two types based on their focus on safety or task performance, and selects the focused ones to preserve from a balanced perspective. In addition, HPA imposes orthogonality constraints on parameter updates to further alleviate catastrophic forgetting. Extensive experiments on the CVIT benchmark and safety evaluation datasets demonstrate that HPA better maintains high safety and mitigates forgetting than existing baselines. Code is available at https://github.com/Minato-Zackie/HPA.

URL PDF HTML ☆

赞 0 踩 0

2511.13183 2026-06-05 cs.CV 版本更新

GenTract: Generative Global Tractography

GenTract：生成式全局束追踪

Alec Sargood, Lemuel Puglisi, Elinor Thompson, Mirco Musolesi, Daniel C. Alexander

发表机构 * Hawkes Institute and Department of Computer Science, University College London, UK（霍克斯研究所和计算机科学系，伦敦大学学院，英国）； Department of Maths and Computer Science, University of Catania, Italy（数学和计算机科学系，卡塔尼亚大学，意大利）； AI Centre and Department of Computer Science, University College London, UK（人工智能中心和计算机科学系，伦敦大学学院，英国）

AI总结本文提出GenTract，一种基于生成模型的全局束追踪方法，通过学习从dMRI到完整解剖学合理束流的直接映射，提高了在低分辨率和噪声数据下的精度和可靠性。

Comments Upload of camera-ready

详情

AI中文摘要

束追踪是通过扩散磁共振成像（dMRI）推断大脑白质路径轨迹的过程。局部束追踪方法通过逐步跟随局部纤维方向估计来构建束流，易产生误差累积和高假阳性率，尤其是在噪声或低分辨率数据中。相比之下，全局方法试图通过优化束流集合以最大化与底层纤维方向估计的兼容性，但计算成本较高。为解决这些挑战，我们引入GenTract，这是首个生成式全局束追踪模型。我们将束追踪视为生成任务，学习从dMRI到完整、解剖学合理束流的直接映射。我们比较了基于扩散和流匹配的两种范式，并评估了GenTract在与现有最先进基线方法的性能。值得注意的是，GenTract在精度上比次优方法DDTracking和TractOracle分别高出1.8倍和2.1倍。在具有挑战性的低分辨率和噪声设置中，其优势更加明显，比最接近的竞争对手高出3.5倍。通过在研究级数据上产生高精度的束流图，同时在不完美的低分辨率数据上保持可靠性，GenTract代表了全局束追踪的一个有前景的解决方案。

英文摘要

Tractography is the process of inferring the trajectories of white-matter pathways in the brain from diffusion magnetic resonance imaging (dMRI). Local tractography methods, which construct streamlines by following local fiber orientation estimates stepwise through an image, are prone to error accumulation and high false positive rates, particularly on noisy or low-resolution data. In contrast, global methods, which attempt to optimize a collection of streamlines to maximize compatibility with underlying fiber orientation estimates, are computationally expensive. To address these challenges, we introduce GenTract, the first generative model for global tractography. We frame tractography as a generative task, learning a direct mapping from dMRI to complete, anatomically plausible streamlines. We compare both diffusion-based and flow matching paradigms and evaluate GenTract's performance against state-of-the-art baselines. Notably, GenTract achieves precision 1.8x and 2.1x higher than the next-best methods, DDTracking and TractOracle, respectively. This advantage becomes even more pronounced in challenging low-resolution and noisy settings, where it outperforms the closest competitor by a factor of 3.5. By producing tractograms with high precision on research-grade data while also maintaining reliability on imperfect, lower-resolution data, GenTract represents a promising solution for global tractography.

URL PDF HTML ☆

赞 0 踩 0

2511.10254 2026-06-05 cs.CV 版本更新

Facial-R1: Aligning Reasoning and Recognition for Facial Emotion Analysis

Facial-R1: 通过推理与识别对齐实现面部情绪分析

Jiulong Wu, Yucheng Shen, Lingyong Yan, Haixin Sun, Deguo Xia, Jizhou Huang, Min Cao

AI总结本文提出Facial-R1框架，通过三阶段对齐方法解决面部情绪分析中推理与识别不一致及推理幻觉的问题，并引入FEA-20K基准数据集，验证了其在多个标准基准上的最佳性能。

Comments Withdrawn by the authors due to pending intellectual property considerations. The authors have determined that the current version contains material that should not have been publicly disseminated at this stage

详情

AI中文摘要

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

英文摘要

Facial Emotion Analysis (FEA) extends traditional facial emotion recognition by incorporating explainable, fine-grained reasoning. The task integrates three subtasks: emotion recognition, facial Action Unit (AU) recognition, and AU-based emotion reasoning to model affective states jointly. While recent approaches leverage Vision-Language Models (VLMs) and achieve promising results, they face two critical limitations: (1) hallucinated reasoning, where VLMs generate plausible but inaccurate explanations due to insufficient emotion-specific knowledge; and (2) misalignment between emotion reasoning and recognition, caused by fragmented connections between observed facial features and final labels. We propose Facial-R1, a three-stage alignment framework that effectively addresses both challenges with minimal supervision. First, we employ instruction fine-tuning to establish basic emotional reasoning capability. Second, we introduce reinforcement training guided by emotion and AU labels as reward signals, which explicitly aligns the generated reasoning process with the predicted emotion. Third, we design a data synthesis pipeline that iteratively leverages the prior stages to expand the training dataset, enabling scalable self-improvement of the model. Built upon this framework, we introduce FEA-20K, a benchmark dataset comprising 17,737 training and 1,688 test samples with fine-grained emotion analysis annotations. Extensive experiments across eight standard benchmarks demonstrate that Facial-R1 achieves state-of-the-art performance in FEA, with strong generalization and robust interpretability.

URL PDF HTML ☆

赞 0 踩 0

2510.23497 2026-06-05 cs.CV 版本更新

VOLD: Reasoning Transfer from LLMs to Vision-Language Models via On-Policy Distillation

VOLD：通过在线蒸馏将LLM推理能力转移到视觉语言模型

Walid Bousselham, Hilde Kuehne, Cordelia Schmid

发表机构 * Tuebingen AI Center（图宾根人工智能中心）； University of Tuebingen（图宾根大学）； MIT-IBM Watson AI Lab（MIT-IBM沃森人工智能实验室）； Inria, École Normale Supérieure, CNRS, PSL Research University（法国国家科学研究院、巴黎-萨克勒大学、École Normale Supérieure、PSL研究大学）

AI总结本文提出VOLD框架，通过在线蒸馏将文本模型的推理能力转移到视觉语言模型，利用组相对策略优化与在线蒸馏结合，提升推理性能，并验证了冷启动对齐在在线训练中的重要性。

Comments www.walidbousselham.com/VOLD/

详情

AI中文摘要

训练视觉语言模型（VLMs）进行复杂推理仍是一项具有挑战性的任务，例如由于高质量图像-文本推理数据稀缺。相反，基于文本的推理资源丰富且可扩展，但如何利用它们来增强VLM推理仍是一个开放性问题。为此，我们提出了VOLD，一种将推理能力从文本-only教师模型转移到VLM学生模型的框架。为此，VOLD结合了通过组相对策略优化（GRPO）进行的强化学习与在线蒸馏，使学生推理轨迹能够由教师模型引导，从而在单独使用GRPO时获得显著提升。我们进一步表明，在此场景中，在线训练阶段有效的转移需要冷启动对齐，并且在教师和学生之间缺乏足够的分布对齐时，在线蒸馏无法提供有意义的指导。我们评估了VOLD在MMMU-Pro、MathVision、MathVista和LogicVista等多样化的基准测试中，显示出VOLD显著优于基线模型，并在现有最先进水平上取得显著提升。我们的消融研究显示了通过SFT进行冷启动对齐在文本-only教师与在线蒸馏中的重要性。

英文摘要

Training vision-language models (VLMs) for complex reasoning remains a challenging task, i.a. due to the scarcity of high-quality image-text reasoning data. Conversely, text-based reasoning resources are abundant and scalable, but it is still an open question how to leveraging them for VLM reasoning. To address this problem, we propose VOLD, a framework to transfer reasoning capabilities from text-only teacher models to VLM student models. To this end, VOLD combines reinforcement learning via Group Relative Policy Optimization (GRPO) with on-policy distillation, which allows the student reasoning traces to be guided by the teacher model, resulting in a significant gain over using GRPO alone. We further show that a cold-start alignment is essential for an effective transfer during the online training phase in this scenario and that without sufficient distributional alignment between teacher and student, on-policy distillation fails to provide meaningful guidance. We evaluate VOLD across diverse benchmarks including MMMU-Pro, MathVision, MathVista, and LogicVista, showing that VOLD outperforms the baseline model significantly and improves over the state of the art by a margin. Our ablation shows the importance of a cold-start alignment via SFT for on-policy distillation with a text-only teacher.

URL PDF HTML ☆

赞 0 踩 0

2504.10020 2026-06-05 cs.CL cs.AI cs.CV 版本更新

The Mirage of Performance Gains: Why Contrastive Decoding Fails to Mitigate Object Hallucinations in MLLMs?

性能提升的幻象：为何对比解码无法减轻多模态大语言模型中的对象幻觉？

Hao Yin, Guangzong Si, Zilei Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）； Eastern Institute of Technology, Ningbo（宁波东部技术研究所）

AI总结本文研究了对比解码方法在减轻多模态大语言模型（MLLMs）中对象幻觉方面的有效性，发现其性能提升主要源于两个误导性因素，挑战了对比解码策略的有效性。

详情

AI中文摘要

对比解码策略被广泛用于减少多模态大语言模型（MLLMs）中的对象幻觉。这些方法通过构建对比样本来诱导幻觉，然后在输出分布中抑制它们。然而，本文证明此类方法无法有效缓解幻觉问题。在POPE基准测试中观察到的性能提升主要由两个误导性因素驱动：（1）对模型输出分布的粗略、单向调整；（2）自适应可能性约束，将采样策略简化为贪婪搜索。为进一步说明这些问题，我们引入了一系列虚假改进方法，并将其性能与对比解码技术进行评估。实验结果揭示了对比解码中观察到的性能提升与其缓解幻觉的初衷无关。我们的发现挑战了对比解码策略有效性的常见假设，并为开发真正有效的MLLMs幻觉解决方案铺平了道路。

英文摘要

Contrastive decoding strategies are widely used to reduce object hallucinations in multimodal large language models (MLLMs). These methods work by constructing contrastive samples to induce hallucinations and then suppressing them in the output distribution. However, this paper demonstrates that such approaches fail to effectively mitigate the hallucination problem. The performance improvements observed on POPE Benchmark are largely driven by two misleading factors: (1) crude, unidirectional adjustments to the model's output distribution and (2) the adaptive plausibility constraint, which reduces the sampling strategy to greedy search. To further illustrate these issues, we introduce a series of spurious improvement methods and evaluate their performance against contrastive decoding techniques. Experimental results reveal that the observed performance gains in contrastive decoding are entirely unrelated to its intended goal of mitigating hallucinations. Our findings challenge common assumptions about the effectiveness of contrastive decoding strategies and pave the way for developing genuinely effective solutions to hallucinations in MLLMs.

URL PDF HTML ☆

赞 0 踩 0

2508.09697 2026-06-05 cs.LG cs.CV 版本更新

Towards Label-Noise Resistant Learning via Optimal Brain Damage Masking

通过最优脑损伤遮蔽实现抗标签噪声学习

Xinlei Zhang, Fan Liu, Chuanyi Zhang, Fan Cheng, Qian Li, Yuhui Zheng

发表机构 * Hohai University（河海大学）

AI总结本文提出了一种基于最优脑损伤理论的抗标签噪声学习方法，通过遮蔽冗余连接来减少噪声梯度传播，提升模型鲁棒性。

详情

AI中文摘要

噪声标签在现实世界中不可避免。由于深度神经网络强大的记忆能力，这些噪声标签会导致显著的性能下降。现有的噪声鲁棒方法主要集中在鲁棒损失函数和样本选择上，对动态架构适应的探索相对有限。本文重新审视了标签噪声存在下模型连接的作用。直观上，噪声标签引起的性能下降源于噪声梯度的反向传播。由于最终分类器层是这种误差传播的主要通道，直接丢弃分类器中的冗余连接可以在根源上截断噪声梯度。为了识别这些冗余连接，我们利用模型压缩中的经典最优脑损伤（OBD）理论，该理论指出造成微小损失扰动的参数可以安全移除而不影响性能。基于这一原则，我们发现遮蔽低激活边可以保持网络的正常拟合能力，同时有效降低噪声梯度传播的风险。为了将这一理论洞察与实际训练相结合，我们提出了一种新的选择性边遮蔽（SEM）机制，用于广泛采用的全连接（FC）层，以增强模型对噪声标签的鲁棒性。SEM可以自适应地只保留最重要的边用于信息传播，同时抑制由噪声标签引起的梯度误差。作为插件式组件，SEM可以无缝集成到各种噪声鲁棒方法中，包括鲁棒损失函数和样本选择。在合成和现实世界基准上的广泛评估表明，我们的OBD驱动方法在性能上始终优于最先进的方法。

英文摘要

Noisy labels are inevitable in real-world scenarios. Due to the strong capacity of deep neural networks to memorize corrupted labels, these noisy labels cause significant performance degradation. Existing noise-robust methods have mainly focused on robust loss functions and sample selection, with comparatively limited exploration of dynamic architectural adaptation. In this paper, we rethink the role of model connectivity in the presence of label noise. Intuitively, performance degradation caused by noisy labels stems from the backpropagation of noisy gradients. Since the final classifier layer acts as the primary gateway for this error propagation, directly discarding redundant connections within the classifier can structurally intercept noisy gradients at the root. Consequently, to identify these redundant connections, we leverage the seminal Optimal Brain Damage (OBD) theory from model compression, which posits that parameters causing negligible loss perturbation can be safely removed without impairing performance. Guided by this principle, we reveal that masking low-activation edges maintains the network's normal fitting capacity while effectively reducing the risk of backpropagating noisy gradients. To bridge this theoretical insight with practical training, we propose a novel Selective Edge Masking (SEM) mechanism for the widely-adopted fully connected (FC) layer to enhance model robustness against noisy labels. It can adaptively preserve only the most critical edges for information propagation while suppressing gradient errors caused by noisy labels. As a plug-and-play component, SEM can be seamlessly integrated into various noise-robust methods, including robust loss functions and sample selection. Extensive evaluations on both synthetic and real-world benchmarks demonstrate that our OBD-driven approach consistently outperforms state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2509.15061 2026-06-05 cs.RO cs.CV 版本更新

Ask-to-Clarify: Resolving Instruction Ambiguity through Multi-turn Dialogue

Ask-to-Clarify: 通过多轮对话解决指令歧义

Xingyao Lin, Xinghao Zhu, Tianyi Lu, Sicheng Xie, Hui Zhang, Xipeng Qiu, Zuxuan Wu, Yu-Gang Jiang

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University, Shanghai, China（复旦大学计算机科学与人工智能学院）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院）； Mechanical Systems Control Lab, UC Berkeley, California, USA（伯克利机械系统控制实验室）

AI总结本文提出Ask-to-Clarify框架，通过多轮对话解决指令歧义问题，结合视觉语言模型和扩散模型，采用两阶段知识绝缘策略训练，实现多任务中更高效的协作式具身代理。

Comments 9 pages, 4 figures, 7 tables

详情

AI中文摘要

具身代理的最终目标是创造能够与人类交互的合作者，而非仅仅执行指令的被动执行者。这要求代理能够通过沟通、协调和适应行动来响应人类反馈。最近，视觉语言代理（VLAs）的进步为实现这一目标提供了途径。然而，大多数当前基于VLAs的具身代理仍处于单向模式：接收指令并执行，而无反馈。这种做法在现实场景中往往失效，因为指令通常存在歧义。在本文中，我们提出了Ask-to-Clarify框架来解决这一问题。该框架首先通过多轮对话解决模糊的指令，然后生成低层动作。具体来说，Ask-to-Clarify框架由两个组件组成：一个用于协作的视觉语言模型（VLM）和一个用于动作的扩散模型。我们还引入了一个连接模块，该模块根据VLM的输出生成扩散模型的条件。该模块通过指令调整观察来生成可靠的条件。我们采用两阶段知识绝缘策略来训练我们的框架。首先，我们使用模糊解决对话数据微调协作组件以处理歧义。然后，我们在冻结协作组件的情况下整合动作组件。这在保持交互能力的同时，微调扩散模型以生成动作。训练策略保证了我们的框架能够首先提问，然后生成动作。在推理过程中，一个信号检测器充当路由器，帮助框架在提问和执行之间切换。我们在8个现实任务中评估了Ask-to-Clarify框架，结果表明它在现有最先进的VLAs中表现更优。结果表明，所提出的框架及其训练策略为协作式具身代理提供了一条可行路径。

英文摘要

The ultimate goal of embodied agents is to create collaborators that can interact with humans, not mere executors that passively follow instructions. This requires agents to communicate, coordinate, and adapt their actions based on human feedback. Recently, advances in VLAs have offered a path toward this goal. However, most current VLA-based embodied agents operate in a one-way mode: they receive an instruction and execute it without feedback. This approach fails in real-world scenarios where instructions are often ambiguous. In this paper, we address this problem with the Ask-to-Clarify framework. Our framework first resolves ambiguous instructions by asking questions in a multi-turn dialogue. Then it generates low-level actions end-to-end. Specifically, the Ask-to-Clarify framework consists of two components, one VLM for collaboration and one diffusion for action. We also introduce a connection module that generates conditions for the diffusion based on the output of the VLM. This module adjusts the observation by instructions to create reliable conditions. We train our framework with a two-stage knowledge-insulation strategy. First, we fine-tune the collaboration component using ambiguity-solving dialogue data to handle ambiguity. Then, we integrate the action component while freezing the collaboration one. This preserves the interaction abilities while fine-tuning the diffusion to generate actions. The training strategy guarantees our framework can first ask questions, then generate actions. During inference, a signal detector functions as a router that helps our framework switch between asking questions and taking actions. We evaluate the Ask-to-Clarify framework in 8 real-world tasks, where it outperforms existing state-of-the-art VLAs. The results suggest that our proposed framework, along with the training strategy, provides a path toward collaborative embodied agents.

URL PDF HTML ☆

赞 0 踩 0

2503.22929 2026-06-05 cs.CV 版本更新

Self-supervised Feature Disentanglement and Augmentation Network for One-class Face Anti-spoofing

自监督特征解耦与增强网络用于单类面部反伪装

Pei-Kai Huang, Jun-Xiong Chong, Ming-Tsung Hsu, Fang-Yu Hsu, Yi-Ting Lin, Kai-Heng Chien, Hao-Chiang Shao, Chiou-Ting Hsu

发表机构 * National Tsinghua University（国立清华大学）

AI总结本文提出了一种自监督特征解耦与增强网络（UFDANet），通过解耦活体特征和领域特征，提升单类面部反伪装的泛化能力，实验表明其优于现有单类方法并可与双类方法媲美。

详情

AI中文摘要

面部反伪装（FAS）技术旨在通过区分真实活体面部与欺骗性尝试来增强面部身份认证的安全性。虽然双类FAS方法可能因过拟合训练攻击而性能不佳，单类FAS方法能处理未见过的攻击但对活体特征中混杂的领域信息不够鲁棒。为此，我们提出了一种无监督特征解耦与增强网络（UFDANet），一种单类FAS技术，通过解耦特征增强面部图像以提升泛化能力。UFDANet采用新颖的无监督特征解耦方法分离活体和领域特征，促进判别性特征学习。它整合了非分布活体特征增强方案以合成未见过的欺骗类活体特征，从而增强活体特征的表示性和判别性。此外，UFDANet还整合了领域特征增强流程以合成未见过的领域特征，从而实现更好的泛化能力。广泛实验表明，所提出的UFDANet优于现有单类FAS方法，并在与现有最先进双类FAS方法的性能上具有可比性。

英文摘要

Face anti-spoofing (FAS) techniques aim to enhance the security of facial identity authentication by distinguishing authentic live faces from deceptive attempts. While two-class FAS methods risk overfitting to training attacks to achieve better performance, one-class FAS approaches handle unseen attacks well but are less robust to domain information entangled within the liveness features. To address this, we propose an Unsupervised Feature Disentanglement and Augmentation Network (\textbf{UFDANet}), a one-class FAS technique that enhances generalizability by augmenting face images via disentangled features. The \textbf{UFDANet} employs a novel unsupervised feature disentangling method to separate the liveness and domain features, facilitating discriminative feature learning. It integrates an out-of-distribution liveness feature augmentation scheme to synthesize new liveness features of unseen spoof classes, which deviate from the live class, thus enhancing the representability and discriminability of liveness features. Additionally, \textbf{UFDANet} incorporates a domain feature augmentation routine to synthesize unseen domain features, thereby achieving better generalizability. Extensive experiments demonstrate that the proposed \textbf{UFDANet} outperforms previous one-class FAS methods and achieves comparable performance to state-of-the-art two-class FAS methods.

URL PDF HTML ☆

赞 0 踩 0

2507.12336 2026-06-05 cs.CV 版本更新

Unsupervised Monocular 3D Keypoint Discovery from Multi-View Diffusion Priors

无监督单目多视图扩散先验的3D关键点发现

Subin Jeon, In Cho, Junyoung Hong, Woong Oh Cho, Seon Joo Kim

发表机构 * Yonsei University（延世大学）

AI总结本文提出KeyDiff3D框架，通过单张图像准确预测3D关键点，利用预训练的多视图扩散模型中的几何先验，将隐式3D先验转化为显式3D特征体，实现关键点估计和3D对象操控。

Comments Accepted at CVPR 2026. Project page: https://subin6.github.io/keydiff3d-project/

详情

AI中文摘要

大多数现有的3D关键点估计方法依赖于手动标注或校准的多视角图像，这两种方法都昂贵且难以收集。本文引入KeyDiff3D框架，该框架能够从单张图像准确预测3D关键点，从而消除对昂贵数据采集的依赖。为此，我们利用预训练的多视角扩散模型中嵌入的强大几何先验。在我们的框架中，扩散模型从单张图像生成多视角图像，作为监督信号，为模型提供3D几何线索。我们还引入了3D特征提取器，将扩散特征中隐含的3D先验转换为显式的3D特征体。除了准确的关键点估计外，我们还引入了一条管道，使由扩散模型生成的3D对象得以操控。在多样化的数据集上，包括Human3.6M、CUB-200-2011、斯坦福狗、以及多个真实世界和非领域输入，实验结果突显了我们的方法在准确性、泛化能力和从单张图像生成3D对象并进行操控方面的有效性。

英文摘要

Most existing 3D keypoint estimation methods rely on manual annotations or calibrated multi-view images, both of which are expensive to collect. This paper introduces KeyDiff3D, a framework that can accurately predict 3D keypoints from a single image, thus eliminating the need for such expensive data acquisitions. To achieve this, we leverage powerful geometric priors embedded in a pretrained multi-view diffusion model. In our framework, the diffusion model generates multi-view images from a single image, serving as supervision signals to provide 3D geometric cues to our model. We also introduce a 3D feature extractor that transforms implicit 3D priors embedded in the diffusion features into explicit 3D feature volumes. Beyond accurate keypoint estimation, we further introduce a pipeline that enables manipulation of 3D objects generated by the diffusion model. Experimental results on diverse datasets, including Human3.6M, CUB-200-2011, Stanford Dogs, and several in-the-wild and out-of-domain inputs, highlight the effectiveness of our method in terms of accuracy, generalization, and its ability to enable manipulation of 3D objects generated by the diffusion model from a single image.

URL PDF HTML ☆

赞 0 踩 0

2506.22078 2026-06-05 cs.CV 版本更新

Towards Accurate Heart Rate Measurement from Ultra-Short Video Clips via Periodicity-Guided rPPG Estimation and Signal Reconstruction

通过周期性引导的rPPG估计与信号重建实现从超短视频片段中准确的心率测量

Pei-Kai Huanga, Ya-Ting Chan, Kuan-Wen Chen, Chiou-Ting Hsu, Xiaoding Wang, Md. Jalil Piran

发表机构 * National Tsinghua University（国立清华大学）； Fujian Normal University（福建师范大学）； Sungkyunkwan University（成均馆大学）

AI总结本文针对超短视频片段中心率测量问题，提出周期性引导的rPPG估计方法和信号重建技术，以提高从超短视频中准确测量心率的能力，并在多个基准数据集上验证了方法的有效性。

详情

AI中文摘要

许多远程心率（HR）测量方法专注于从持续约10秒的视频片段中估计远程光体积脉动图（rPPG）信号，但常常忽略了从超短视频片段中估计心率的必要性。在本文中，我们旨在通过专门解决两个关键挑战来准确测量超短2秒视频片段中的心率。首先，为了解决超短视频片段中心跳周期数量有限的问题，我们提出了一种有效的周期性引导的rPPG估计方法，该方法强制在从超短片段中估计的rPPG信号与其更长的真实信号之间的周期性保持一致。其次，为了解决由于频谱泄漏导致的估计不准确问题，我们提出包含生成器来从超短片段中重建更长的rPPG信号，同时保持其周期性一致性，以实现更准确的心率测量。在四个rPPG估计基准数据集上的大量实验表明，我们提出的方法不仅能够准确测量超短视频片段中的心率，而且在rPPG估计技术中实现了最先进的性能。

英文摘要

Many remote Heart Rate (HR) measurement methods focus on estimating remote photoplethysmography (rPPG) signals from video clips lasting around 10 seconds but often overlook the need for HR estimation from ultra-short video clips. In this paper, we aim to accurately measure HR from ultra-short 2-second video clips by specifically addressing two key challenges. First, to overcome the limited number of heartbeat cycles in ultra-short video clips, we propose an effective periodicity-guided rPPG estimation method that enforces consistent periodicity between rPPG signals estimated from ultra-short clips and their much longer ground truth signals. Next, to mitigate estimation inaccuracies due to spectral leakage, we propose including a generator to reconstruct longer rPPG signals from ultra-short ones while preserving their periodic consistency to enable more accurate HR measurement. Extensive experiments on four rPPG estimation benchmark datasets demonstrate that our proposed method not only accurately measures HR from ultra-short video clips but also outperform previous rPPG estimation techniques to achieve state-of-the-art performance.

URL PDF HTML ☆

赞 0 踩 0

2506.20263 2026-06-05 cs.CV 版本更新

Hierarchical Mask-Enhanced Dual Reconstruction Network for Few-Shot Fine-Grained Image Classification

层次化掩码增强双重建网络用于少样本细粒度图像分类

Ning Luo, Meiyin Hu, Huan Wan, Yanyan Yang, Zhuohang Jiang, Xin Wei

发表机构 * Nanjing University（南京大学）

AI总结本文提出层次化掩码增强双重建网络（HMDRN），通过双层特征重建与掩码增强特征处理，解决少样本细粒度图像分类中区分视觉相似子类的问题，实验显示其在三种细粒度数据集上均优于现有方法。

详情

AI中文摘要

少样本细粒度图像分类（FS-FGIC）具有挑战性，因为它需要在极少量标记示例下区分视觉相似的子类。现有方法存在关键限制：基于度量的方法丢失空间信息并导致局部特征错位，而基于重建的方法未充分利用层次特征信息且缺乏对判别关键区域的选择性关注。我们提出层次化掩码增强双重建网络（HMDRN），整合双层特征重建与掩码增强特征处理。HMDRN通过可学习权重利用不同网络层次的互补视觉信息，平衡高层语义表示与中层结构细节。它包含一个空间二进制掩码增强的Transformer模块，可选择增强判别区域并过滤背景噪声。在三个细粒度数据集上，HMDRN在Conv-4和ResNet-12背骨上均优于现有最先进方法。消融研究验证了每个组件的有效性，显示双层重建增强类间判别能力，而掩码增强转换减少类内变化。

英文摘要

Few-shot fine-grained image classification (FS-FGIC) is challenging as it requires distinguishing visually similar subclasses with extremely limited labeled examples. Existing methods suffer from critical limitations: metric-based methods lose spatial information and misalign local features, while reconstruction-based methods underuse hierarchical feature information and lack selective focus on discriminative key regions. We propose the Hierarchical Mask-enhanced Dual Reconstruction Network (HMDRN), integrating dual-layer feature reconstruction with mask-enhanced feature processing. HMDRN leverages complementary visual information from different network hierarchies via learnable weights, balancing high-level semantic representations with mid-level structural details. It incorporates a spatial binary mask-enhanced transformer module that selectively enhances discriminative regions while filtering background noise. On three fine-grained datasets, HMDRN consistently outperforms state-of-the-art methods with both Conv-4 and ResNet-12 backbones. Ablation studies validate each component's effectiveness, showing dual-layer reconstruction enhances inter-class discrimination while mask-enhanced transformation reduces intra-class variations.

URL PDF HTML ☆

赞 0 踩 0

2506.10145 2026-06-05 cs.CV 版本更新

RoCA: Robust Cross-Domain End-to-End Autonomous Driving

RoCA: 面向鲁棒跨域端到端自动驾驶的框架

Rajeev Yasarla, Shizhong Han, Hsin-Pai Cheng, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Yunxiao Shi, Risheek Garrepalli, Hong Cai, Fatih Porikli

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Texas at Austin（德克萨斯大学奥斯汀分校）； University of California, San Diego（加州大学圣地亚哥分校）； University of California, Los Angeles（加州大学洛杉矶分校）； University of California, Davis（加州大学戴维斯分校）

AI总结本文提出RoCA框架，通过联合概率分布建模端到端自动驾驶管道中的 ego 和周围车辆信息，提升跨域自动驾驶的泛化能力和鲁棒性，无需额外推理计算。

Comments accepted for ICML 2026

详情

AI中文摘要

端到端（E2E）自动驾驶最近作为一种新范式出现，具有显著潜力。然而，很少有研究探讨了跨域部署的实际挑战（例如城市）。尽管一些工作将大型语言模型（LLMs）纳入其中以利用其开放世界知识，但LLMs无法保证跨域驾驶性能且在域适应过程中可能产生 prohibitive 重训练成本。本文提出RoCA，一种新颖的框架用于鲁棒跨域端到端自动驾驶。RoCA在E2E管道中对编码ego和周围车辆信息的token的联合概率分布进行建模。通过高斯过程（GP）实例化，RoCA学习一组具有相应轨迹的基底token，这些token跨越了多样化的驾驶场景。然后，给定任何驾驶场景，它能够概率性地推断未来轨迹。通过将RoCA与源域训练中的基础E2E模型结合，我们提升了基础模型的泛化能力，而无需额外的推理计算。此外，RoCA在新目标域上实现了鲁棒适应，显著优于直接微调。我们广泛评估了RoCA在各种跨域场景中，并展示其在领域泛化和适应性能方面表现强劲。

英文摘要

End-to-end (E2E) autonomous driving has recently emerged as a new paradigm, offering significant potential. However, few studies have looked into the practical challenge of deployment across domains (e.g., cities). Although several works have incorporated Large Language Models (LLMs) to leverage their open-world knowledge, LLMs do not guarantee cross-domain driving performance and may incur prohibitive retraining costs during domain adaptation. In this paper, we propose RoCA, a novel framework for robust cross-domain E2E autonomous driving. RoCA formulates the joint probabilistic distribution over the tokens that encode ego and surrounding vehicle information in the E2E pipeline. Instantiating with a Gaussian process (GP), RoCA learns a set of basis tokens with corresponding trajectories, which span diverse driving scenarios. Then, given any driving scene, it is able to probabilistically infer the future trajectory. By using RoCA together with a base E2E model in source-domain training, we improve the generalizability of the base model, without requiring extra inference computation. In addition, RoCA enables robust adaptation on new target domains, significantly outperforming direct finetuning. We extensively evaluate RoCA on various cross-domain scenarios and show that it achieves strong domain generalization and adaptation performance.

URL PDF HTML ☆

赞 0 踩 0

2408.11336 2026-06-05 cs.LG cs.CV 版本更新

FATE: Focal-modulated Attention Encoder for Multivariate Time-series Forecasting

FATE：用于多变量时间序列预测的焦点调节注意力编码器

Tajamul Ashraf, Janibul Bashir

发表机构 * GAASH Research Lab（GAASH研究实验室）； Department of Information Technology（信息科技系）； National Institute of Technology Srinagar（斯里 Nagar国立理工学院）

AI总结本文提出FATE，一种新的Transformer架构，用于可靠的多变量时间序列预测。FATE引入了张量化的焦点调节机制，以显式捕捉时间序列中的时空相关性，并通过两个调节分数提高可解释性，通过在七个不同现实世界数据集上基准测试，证明其在长视界多变量气象数据集上的优越性能。

详情

AI中文摘要

气候变化是21世纪最紧迫的全球挑战之一，其后果包括海平面上升、冰川融化以及日益极端的天气模式。准确的预测对于监测这些现象和支持缓解策略至关重要。尽管最近的数据驱动模型，包括CNNs、RNNs和基于注意力的Transformer，在时间序列预测中显示出潜力，但它们在处理序列依赖性和有限并行性方面存在困难，尤其是在长视界、多变量气象数据集中。在本文中，我们提出了Focal Modulated Attention Encoder（FATE），一种新的Transformer架构，用于可靠的多变量时间序列预测。与传统模型不同，FATE引入了张量化的焦点调节机制，以显式捕捉时间序列数据中的时空相关性。我们进一步提出了两个调节分数，通过突出影响预测的关键环境特征来提供可解释性。我们在七个不同的现实世界数据集上基准测试FATE，包括ETTh1、ETTm2、Traffic、Weather5k、USA-Canada、Europe和LargeST数据集，并显示其在所有最先进的方法，包括温度数据集上都表现优异。我们的消融研究也表明，FATE能够很好地推广到更广泛的多变量时间序列预测任务中。

英文摘要

Climate change stands as one of the most pressing global challenges of the twenty-first century, with far-reaching consequences such as rising sea levels, melting glaciers, and increasingly extreme weather patterns. Accurate forecasting is critical for monitoring these phenomena and supporting mitigation strategies. While recent data-driven models for time-series forecasting, including CNNs, RNNs, and attention-based transformers, have shown promise, they often struggle with sequential dependencies and limited parallelization, especially in long-horizon, multivariate meteorological datasets. In this work, we present Focal Modulated Attention Encoder (FATE), a novel transformer architecture designed for reliable multivariate time-series forecasting. Unlike conventional models, FATE introduces a tensorized focal modulation mechanism that explicitly captures spatiotemporal correlations in time-series data. We further propose two modulation scores that offer interpretability by highlighting critical environmental features influencing predictions. We benchmark FATE across seven diverse real-world datasets, including ETTh1, ETTm2, Traffic, Weather5k, USA-Canada, Europe, and LargeST datasets, and show that it consistently outperforms all state-of-the-art methods, including temperature datasets. Our ablation studies also demonstrate that FATE generalizes well to broader multivariate time-series forecasting tasks.

URL PDF HTML ☆

赞 0 踩 0

2506.10601 2026-06-05 cs.CV 版本更新

Semantic-decoupled Spatial Partition Guided Point-supervised Oriented Object Detection

语义解耦的空间分区引导的点监督定向物体检测

Xinyuan Liu, Hang Xu, Zirui Chen, Yike Ma, Chenggang Yan, Feng Dai

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； Hefei University of Technology（合肥工业大学）

AI总结本文提出了一种高效的训练框架SSP，通过规则驱动的先验注入和数据驱动的标签净化，解决了单点注解放置不足和伪标签质量差的问题，实验表明SSP在DOTA-v1.0和其他数据集上取得了显著的mAP提升，且训练时间和内存占用较低。

Comments Published in Pattern Recognition, 2026

详情

DOI: 10.1016/j.patcog.2026.114079
Journal ref: Pattern Recognition, Volume 180, Part B, Article 114079 (2026)

AI中文摘要

鉴于其减少标注成本的能力，基于单点注释的弱监督学习已成为定向物体检测研究的焦点。与经典教师-学生范式相比，简单的模型范式（如PointOBB-v2）可以显著减少训练所需的资源，同时保证强大的性能。后者在低成本训练中具有更大的潜力，但此类方法仍面临样本分配不足和伪标签质量差的挑战。在本文中，我们提出了一种训练高效的框架，称为SSP，该框架结合了规则驱动的先验注入和数据驱动的标签净化。具体而言，SSP引入了两种设计：（1）像素级空间分区基于的样本分配，通过像素映射的空间分区估计物体尺度的上下界，并通过空间分区挖掘高质量的正样本和困难负样本；（2）语义空间分区基于的框提取，通过由语义地图调节的空间分区推导实例，并将其转换为伪框以监督检测器。在DOTA-v1.0和其他数据集上的实验表明，SSP的优越性：与基线相比，SSP实现了+6.73%的mAP提升，同时仅需2小时的训练时间和6GB的GPU内存。此外，当SSP与更强的检测器结合时，mAP可以达到50.81%。代码可在https://github.com/antxinyuan/ssp上获得。

英文摘要

Given its ability to reduce annotation costs, weakly supervised learning based on single-point annotations has emerged as a research focus in oriented object detection. Compared with the classical teacher-student paradigm, the simple model paradigm (e.g., PointOBB-v2) can substantially further reduce resources required for training while ensuring strong performance. The latter exhibits greater potential for low-cost training, yet such methods still face challenges of insufficient sample assignment and poor pseudo-label quality. In this paper, we propose a training-efficient framework named SSP, which synergizes rule-driven prior injection and data-driven label purification. Specifically, SSP introduces two designs: (1) Pixel-level Spatial Partition-based Sample Assignment, which compactly estimates the upper and lower bounds of object scales and mines high-quality positive samples and hard negative samples through spatial partitioning of pixel maps. (2) Semantic Spatial Partition-based Box Extraction, which derives instances from spatial partitions modulated by semantic maps and converts them into pseudo-boxes for supervising detectors. Experiments on DOTA-v1.0 and other datasets demonstrate SSP's superiority: it achieves +6.73% mAP improvement compared with the baseline, while requiring only 2 h of training time and 6 GB of GPU memory. Furthermore, when SSP is integrated with stronger detector, the mAP can reach 50.81%. The code is available at https://github.com/antxinyuan/ssp.

URL PDF HTML ☆

赞 0 踩 0

2503.23300 2026-06-05 cs.CV cs.RO 版本更新

Learning Predictive Visuomotor Coordination

学习预测性视觉-运动协调

Wenqi Jia, Bolin Lai, Miao Liu, Danfei Xu, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Georgia Tech（佐治亚理工学院）； Meta AI

AI总结本文提出了一种基于预测的视觉-运动协调建模任务，通过结合第一人称视觉和运动学观测预测头部姿态、目光方向和上半身运动，展示了多模态整合在理解视觉-运动协调中的重要性。

Comments CVPR 2026 Findings

详情

AI中文摘要

理解并预测人类视觉-运动协调对于机器人学、人机交互和辅助技术的应用至关重要。本文介绍了一种基于预测的视觉-运动协调建模任务，目标是从第一人称视觉和运动学观测中预测头部姿态、目光方向和上半身运动。我们提出了一种视觉-运动协调表示（VCR），学习这些多模态信号之间的结构时间依赖性。我们扩展了基于扩散的运动建模框架，整合了第一人称视觉和运动学序列，实现了时间一致且准确的视觉-运动预测。我们的方法在大规模EgoExo4D数据集上进行了评估，展示了在多样化现实活动中的强大泛化能力。我们的结果强调了多模态整合在理解视觉-运动协调中的重要性，为视觉-运动学习和人类行为建模的研究做出了贡献。项目页面：https://vjwq.github.io/VCR/.

英文摘要

Understanding and predicting human visuomotor coordination is crucial for applications in robotics, human-computer interaction, and assistive technologies. This work introduces a forecasting-based task for visuomotor modeling, where the goal is to predict head pose, gaze, and upper-body motion from egocentric visual and kinematic observations. We propose a \textit{Visuomotor Coordination Representation} (VCR) that learns structured temporal dependencies across these multimodal signals. We extend a diffusion-based motion modeling framework that integrates egocentric vision and kinematic sequences, enabling temporally coherent and accurate visuomotor predictions. Our approach is evaluated on the large-scale EgoExo4D dataset, demonstrating strong generalization across diverse real-world activities. Our results highlight the importance of multimodal integration in understanding visuomotor coordination, contributing to research in visuomotor learning and human behavior modeling. Project Page: https://vjwq.github.io/VCR/.

URL PDF HTML ☆

赞 0 踩 0

2503.14295 2026-06-05 cs.CV cs.AI 版本更新

PC-Talk: Precise Facial Animation Control for Audio-Driven Talking Face Generation

PC-Talk: 用于音频驱动说话面部生成的精确面部动画控制

Baiqin Wang, Xiangyu Zhu, Fan Shen, Hao Xu, Zhen Lei

发表机构 * MAIS, Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所MAIS部）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Psyche AI.INC（Psyche AI公司）； HKUST（香港科技大学）； CAIR, HKISI, Chinese Academy of Sciences（中国科学院计算智能研究所）； SCSE, FIE, M.U.S.T（M.U.S.T的SCSE、FIE部门）

AI总结本文针对音频驱动说话面部生成中面部动画控制不足的问题，提出PC-Talk框架，通过改进唇音对齐和情感控制来提升生成视频的多样性和用户友好性。

Comments 10 Pages, 6 figures. Accepted in CVPR2026

详情

AI中文摘要

近年来，音频驱动说话面部生成在唇同步方面取得了显著进展。然而，当前方法往往缺乏对面部动画（如说话风格和情绪表达）的充分控制，导致输出结果单一。本文聚焦于改进两个关键因素：唇音对齐和情感控制，以增强说话视频的多样性和易用性。唇音对齐控制关注说话风格和唇部运动幅度等元素，而情感控制则专注于生成逼真的情绪表达，允许对强度等多属性进行修改。为实现精确的面部动画控制，我们提出了一种新的框架PC-Talk，通过隐式关键点变形实现唇音对齐和情感控制。首先，我们的唇音对齐控制模块实现了对说话风格的精确编辑，并调整唇部运动幅度以模拟不同语音音量水平，保持与音频的同步。其次，我们的情感控制模块生成生动的情绪面部特征，通过纯粹的情绪变形实现。该模块还允许对强度进行精细修改，并在不同面部区域组合多种情绪。我们的方法在广泛的实验中展示了出色的控制能力，并在HDTF和MEAD数据集上取得了最先进的性能。

英文摘要

Recent advancements in audio-driven talking face generation have made great progress in lip synchronization. However, current methods often lack sufficient control over facial animation such as speaking style and emotional expression, resulting in uniform outputs. In this paper, we focus on improving two key factors: lip-audio alignment and emotion control, to enhance the diversity and user-friendliness of talking videos. Lip-audio alignment control focuses on elements like speaking style and the scale of lip movements, whereas emotion control is centered on generating realistic emotional expressions, allowing for modifications in multiple attributes such as intensity. To achieve precise control of facial animation, we propose a novel framework, PC-Talk, which enables lip-audio alignment and emotion control through implicit keypoint deformations. First, our lip-audio alignment control module facilitates precise editing of speaking styles at the word level and adjusts lip movement scales to simulate varying vocal loudness levels, maintaining lip synchronization with the audio. Second, our emotion control module generates vivid emotional facial features with pure emotional deformation. This module also enables the fine modification of intensity and the combination of multiple emotions across different facial regions. Our method demonstrates outstanding control capabilities and achieves state-of-the-art performance on both HDTF and MEAD datasets in extensive experiments.

URL PDF HTML ☆

赞 0 踩 0

2502.06434 2026-06-05 cs.CV cs.LG 版本更新

Unifying Dataset Pruning and Distillation for Efficient Large-scale Compression

统一数据集剪枝与蒸馏以实现高效大规模压缩

Lingao Xiao, Songhua Liu, Yang He, Xinchao Wang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出一个统一的数据集压缩基准，探讨数据集剪枝与蒸馏的收敛趋势，发现软标签蒸馏在小数据集上表现不如剪枝，提出基于硬标签的数据集压缩方法，通过PCA框架提升图像质量和存储效率。

Comments Accepted by ICML 2026

详情

AI中文摘要

数据集剪枝（DP）和数据集蒸馏（DD）在输出上有根本差异：DP选择原始图像子集，而DD生成合成图像。最近，DD对原始图像的依赖增加表明两种方法趋于融合。为研究这种融合趋势，我们提出统一的数据集压缩（DC）基准。该基准揭示了软标签-DD的有趣权衡：虽然软标签提供有价值信息，但它们可能使蒸馏过程变得不必要，因为蒸馏图像可能不总能优于随机子集。此外，基准表明在当前阶段，数据集剪枝在小数据集上优于数据集蒸馏。鉴于这些观察，我们探索硬标签-DC作为互补方法，强调图像质量的同时提供显著的存储效率。我们的PCA（Prune, Combine, and Augment）是首个不依赖软标签而是聚焦图像质量的框架。（1）

英文摘要

Dataset pruning (DP) and dataset distillation (DD) fundamentally differ in their outputs: DP selects original image subsets, while DD generates synthetic images. Recently, DD's increasing reliance on original images suggests a convergence of the two directions. To investigate this convergence trend, we propose a unified dataset compression (DC) benchmark. This benchmark reveals an interesting trade-off for soft-label-DD: while soft labels provide valuable information, they can make the distillation process less essential, as distilled images may not always outperform random subsets. In addition, the benchmark reveals that in current stages, dataset pruning outperforms dataset distillation at small dataset sizes. Given these observations, we explore hard-label-DC as a complementary approach that emphasizes image quality while offering substantial storage efficiency. Our PCA (Prune, Combine, and Augment) is the first framework that does not rely on soft labels but instead focuses on image quality. (1) "P'' means selecting easy samples based on dataset pruning metrics, (2) "C'' indicates combining these samples effectively, and (3) "A'' is to apply constrained image augmentation during training. Our code is available at https://github.com/ArmandXiao/Unifying-Dataset-Pruning-and-Distillation

URL PDF HTML ☆

赞 0 踩 0

2502.02487 2026-06-05 cs.CV 版本更新

Hier-EgoPack: Hierarchical Egocentric Video Understanding with Diverse Task Perspectives

Hier-EgoPack：具有多样任务视角的层次化眼动视频理解

Simone Alberto Peirone, Francesca Pistilli, Antonio Alliegro, Tatiana Tommasi, Giuseppe Averta

发表机构 * Department of Control and Computer Engineering（控制与计算机工程系）

AI总结本文提出Hier-EgoPack，通过引入层次化架构和GNN层，扩展了EgoPack在多粒度时间推理上的能力，有效解决了多种下游任务中的视频理解问题。

Comments Project webpage at https://sapeirone.github.io/hier-egopack

详情

DOI: 10.1109/tpami.2025.3621326

AI中文摘要

我们对人类活动视频流的理解本质上是多方面的：在短短几秒钟内，我们能够把握正在发生的事情，识别场景中物体的相关性和互动，并预测即将发生的事情，所有这些都在一起发生。为了赋予自主系统这种整体感知，学习如何关联概念、在不同任务中抽象知识，并在学习新技能时利用任务协同是至关重要的。在这方面的一个重要进展是EgoPack，这是一个统一的框架，用于在多样化的任务中理解人类活动，具有最小的开销。EgoPack促进下游任务之间的信息共享和协作，这对于高效学习新技能至关重要。在本文中，我们介绍了Hier-EgoPack，它通过在不同时间粒度上进行推理来扩展EgoPack，从而将其适用范围扩展到更广泛的下游任务。为此，我们提出了一种新的层次化架构用于时间推理，配备了专门设计的GNN层，以有效应对多粒度推理的挑战。我们在多个Ego4D基准上评估了我们的方法，涉及片段级和帧级推理，展示了我们的层次化统一架构如何同时有效地解决这些多样化任务。

英文摘要

Our comprehension of video streams depicting human activities is naturally multifaceted: in just a few moments, we can grasp what is happening, identify the relevance and interactions of objects in the scene, and forecast what will happen soon, everything all at once. To endow autonomous systems with such a holistic perception, learning how to correlate concepts, abstract knowledge across diverse tasks, and leverage tasks synergies when learning novel skills is essential. A significant step in this direction is EgoPack, a unified framework for understanding human activities across diverse tasks with minimal overhead. EgoPack promotes information sharing and collaboration among downstream tasks, essential for efficiently learning new skills. In this paper, we introduce Hier-EgoPack, which advances EgoPack by enabling reasoning also across diverse temporal granularities, which expands its applicability to a broader range of downstream tasks. To achieve this, we propose a novel hierarchical architecture for temporal reasoning equipped with a GNN layer specifically designed to tackle the challenges of multi-granularity reasoning effectively. We evaluate our approach on multiple Ego4d benchmarks involving both clip-level and frame-level reasoning, demonstrating how our hierarchical unified architecture effectively solves these diverse tasks simultaneously.

URL PDF HTML ☆

赞 0 踩 0

2412.07583 2026-06-05 cs.CV cs.AI 版本更新

Mobile Video Diffusion

移动视频扩散

Haitam Ben Yahia, Denis Korzhenkov, Ioannis Lelekas, Amir Ghodrati, Amirhossein Habibian

发表机构 * Qualcomm AI Research（高通人工智能研究）

AI总结本文提出了一种移动优化的视频扩散模型MobileVD，通过降低帧分辨率、引入多尺度时间表示和两种新的剪枝方案，显著降低了内存和计算成本，同时在移动设备上实现了高效的视频生成。

详情

DOI: 10.1109/ICCV51701.2025.01808

AI中文摘要

视频扩散模型已实现了出色的现实感和可控性，但受限于高计算需求，限制了其在移动设备上的应用。本文介绍了首个移动优化的视频扩散模型。从Stable Video Diffusion (SVD) 的时空UNet出发，我们通过降低帧分辨率、引入多尺度时间表示以及引入两种新的剪枝方案来减少通道数和时间块数量。此外，我们采用对抗微调将去噪步骤减少到一步。我们的模型，称为MobileVD，在效率上提高了523倍（1817.2 vs. 4.34 TFLOPs），质量略有下降（FVD 149 vs. 171），在Xiaomi-14 Pro上生成14x512x256像素的视频片段仅需1.7秒。我们的结果可在https://qualcomm-ai-research.github.io/mobile-video-diffusion/上查看。

英文摘要

Video diffusion models have achieved impressive realism and controllability but are limited by high computational demands, restricting their use on mobile devices. This paper introduces the first mobile-optimized video diffusion model. Starting from a spatio-temporal UNet from Stable Video Diffusion (SVD), we reduce memory and computational cost by reducing the frame resolution, incorporating multi-scale temporal representations, and introducing two novel pruning schema to reduce the number of channels and temporal blocks. Furthermore, we employ adversarial finetuning to reduce the denoising to a single step. Our model, coined as MobileVD, is 523x more efficient (1817.2 vs. 4.34 TFLOPs) with a slight quality drop (FVD 149 vs. 171), generating latents for a 14x512x256 px clip in 1.7 seconds on a Xiaomi-14 Pro. Our results are available at https://qualcomm-ai-research.github.io/mobile-video-diffusion/

URL PDF HTML ☆

赞 0 踩 0

2308.10897 2026-06-05 cs.CV 版本更新

Can Language Models Learn to Listen?

语言模型能否学会倾听？

Evonne Ng, Sanjay Subramanian, Dan Klein, Angjoo Kanazawa, Trevor Darrell, Shiry Ginosar

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出了一种基于说话人话语生成适当面部回应的框架，通过将量化后的面部动作元素作为额外语言token输入到基于transformer的大型语言模型中，从而提升监听响应的质量。

Comments ICCV 2023; Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

详情

AI中文摘要

我们提出了一种框架，用于在双人社交互动中根据说话人的词语生成适当的面部回应。给定一个包含说话人词语及其时间戳的输入转录，我们的方法自回归地预测听众的回应：一系列听众的面部动作，通过VQ-VAE进行量化。由于动作是语言的一部分，我们提出将量化后的原子动作元素作为额外的语言token输入到基于transformer的大型语言模型中。使用仅在文本上预训练的语言模型权重初始化transformer，可以显著提高听众回应的质量，优于从头开始训练transformer。我们通过定量指标和定性用户研究展示了生成的听众动作流畅且反映了语言语义。在我们的评估中，我们分析了模型利用口语文本的时间和语义方面的能力。项目页面：https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

英文摘要

We present a framework for generating appropriate facial responses from a listener in dyadic social interactions based on the speaker's words. Given an input transcription of the speaker's words with their timestamps, our approach autoregressively predicts a response of a listener: a sequence of listener facial gestures, quantized using a VQ-VAE. Since gesture is a language component, we propose treating the quantized atomic motion elements as additional language token inputs to a transformer-based large language model. Initializing our transformer with the weights of a language model pre-trained only on text results in significantly higher quality listener responses than training a transformer from scratch. We show that our generated listener motion is fluent and reflective of language semantics through quantitative metrics and a qualitative user study. In our evaluation, we analyze the model's ability to utilize temporal and semantic aspects of spoken text. Project page: https://people.eecs.berkeley.edu/~evonne_ng/projects/text2listen/

URL PDF HTML ☆

赞 0 踩 0