arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

现实世界中的许多时刻不会等待用户提问。安全监控上起火，视频通话中表情变化，或直播中观众想要的商品一闪而过。然而，当今的大模型大多仍以轮次式设计：它们只在被召唤时回答，即使是看似交互式的视频通话应用，其运作方式仍是问答系统，仅在轮询或提示时做出反应。我们主张一种不同的范式：一个像人一样存在于世界中的模型。它持续观察当前发生的事件，自行决定是说话还是保持沉默，实时交互，并在问题困难时委托给后台模型。为了推动交互模型及其在各领域的应用，我们做出两项完全开源贡献。首先，我们发布JoyAI-VL-Interaction，一个8B规模的视觉优先VL交互模型。该模型内部做出响应决策，每秒选择保持沉默、回应或委托给后台模型，并在视觉触发响应性和时间感知方面表现出色。我们为其配备了一个可迁移的训练方案，从中涌现出我们从未训练过的能力，例如引导购物者切换应用屏幕或根据幻灯片即兴授课。其次，我们发布了一个围绕该模型构建的完整可部署系统。该系统将任何正在进行的视频流式传输到模型中，使其真正存在于世界中。所有其他组件都是可插拔的，包括ASR/TTS模块、记忆、可视化UI以及可连接任何API或代理的后台大脑。在六个真实场景中，人类评估者以较大优势偏好JoyAI-VL-Interaction而非豆包和Gemini的应用内视频通话助手。据我们所知，这是第一个开源的、视觉驱动的交互模型，同时发布了其训练方案、数据和完整可部署系统。

英文摘要

Many moments in the real world do not wait for a user to ask. A fire starts on a security monitor, an expression flickers across a video call, or a product a viewer wants flashes by in a livestream. Yet today's large models remain mostly turn-based by design: they answer only when addressed, and even video-call apps that appear interactive still operate as question-answer systems, reacting only when polled or prompted. We argue for a different paradigm: a model that is present in the world like a person. It continuously watches what is happening now, decides on its own whether to speak or stay silent, interacts in real time, and delegates to a background model when the problem is hard. To advance interaction models and their adoption across domains, we make two fully open-sourced contributions. First, we release JoyAI-VL-Interaction, an 8B-scale, vision-first VL-interaction model. The model makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model, and it excels at vision-triggered responsiveness and time awareness. We pair it with a transferable training recipe, from which capabilities we never trained for emerge, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck. Second, we release a complete, deployable system built around that model. The system streams any ongoing video into the model, making it genuinely present in the world. All other components are pluggable, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent. Across six real-world scenarios, human raters prefer JoyAI-VL-Interaction over the in-app video-call assistants of Doubao and Gemini by a wide margin. To our knowledge, this is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system.

URL PDF HTML ☆

赞 0 踩 0

2606.14883 2026-06-16 cs.CV cs.LG 新提交

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

理解连续视觉-语言模型中的跨模态贡献：一个理论视角

Salimeh Sekeh, Mary Wisell

发表机构 * San Diego State University（圣地亚哥州立大学）

AI总结本文从理论角度分析连续视觉-语言模型中跨模态（视觉-语言）贡献，提出新视角并通过实验验证其有效性，揭示任务顺序和相似性对贡献鲁棒性的影响，提升泛化性能。

详情

AI中文摘要

连续视觉-语言模型通常通过顺序微调来解决；然而，尽管这种范式能够适应新环境（任务），但它本质上以牺牲保持先前获取知识所需的稳定性为代价，强调了先前学习环境（任务）的贡献。虽然现有方法已经充分研究了视觉-语言模型（VLM）中的连续学习和灾难性遗忘，但跨一系列环境的模态特定贡献的理论理解仍然很大程度上未被探索。在本文中，我们提出了一个新的理论视角来理解跨模态（视觉-语言）对连续环境的贡献。我们在大型VLM上实证评估了我们的理论发现，并展示了它们在捕捉环境级跨模态贡献方面的有效性。我们的分析为连续VLM提供了更深入的见解，突出了它们对不同任务顺序和任务间相似性的贡献鲁棒性，以及它们改进的泛化性能。

英文摘要

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

URL PDF HTML ☆

赞 0 踩 0

2606.15160 2026-06-16 cs.CV cs.LG 新提交

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

DLWM: 多样化潜在世界模型用于高效多模态推理

David Huang, Lianlei Shan

发表机构 * University of Toronto（多伦多大学）； Tsinghua University（清华大学）

AI总结提出DLWM框架，结合潜在空间推理与强化学习，通过多样化潜在假设和资源感知策略提升多模态推理效率，准确率提升2-5%，内存减少24%。

Comments Preprint. 9 pages main text, 15 pages total including appendix, 2 figures

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）的推理能力有了显著提升。现有方法通常依赖显式的思维链或连续的潜在空间轨迹来增强多步推理。然而，这些方法通常假设输入具有单一的潜在解释，并沿着固定路径或在统一计算预算下展开推理。在现实世界的多模态场景中，视觉观测常受遮挡、模糊、视角变化或语义歧义的影响，产生多种合理的解释。统一的推理策略不仅限制了模型探索多个假设的能力，还导致高内存使用和展开成本。我们提出DLWM（多样化潜在世界模型），一种结合潜在空间推理与强化学习的多模态推理框架。首先，我们在连续潜在空间中构建一组多样化的潜在世界假设，每个假设捕捉视觉输入的不同合理解释，并在每个假设上独立展开潜在推理。基于正交性的多样性正则化器明确防止假设坍缩。其次，我们将潜在推理过程形式化为资源受限的序列决策问题，并引入资源感知的强化学习策略，该策略自适应地在假设间分配计算资源，动态决定是扩展、终止还是合并推理路径，从而大幅减少内存占用并提高展开效率。在多个多模态推理基准上的实验表明，DLWM在准确率上比现有方法高出2-5个百分点，同时内存使用减少24%。

英文摘要

Reasoning capabilities of multimodal large language models (MLLMs) have improved considerably in recent years. Existing approaches typically rely on explicit chain-of-thought or continuous latent-space trajectories to enhance multi-step reasoning. However, these methods generally assume that an input admits a single latent interpretation and unfold reasoning along a fixed path or under a uniform computation budget. In real-world multimodal settings, visual observations are often subject to occlusion, blur, viewpoint variation, or semantic ambiguity, giving rise to multiple plausible interpretations. A uniform reasoning strategy not only limits the model's ability to explore multiple hypotheses but also incurs high memory usage and rollout cost. We present DLWM (Diverse Latent World Models), a multimodal reasoning framework that combines latent-space reasoning with reinforcement learning. First, we construct a set of diverse latent world hypotheses in continuous latent space, each capturing a different plausible interpretation of the visual input, and unfold latent reasoning independently on each hypothesis. An orthogonality-based diversity regularizer explicitly prevents hypothesis collapse. Second, we formulate the latent reasoning process as a resource-constrained sequential decision problem and introduce a resource-aware reinforcement learning policy that adaptively allocates computation across hypotheses, dynamically deciding whether to expand, terminate, or merge reasoning paths, thereby substantially reducing memory footprint and improving rollout efficiency. Experiments on multiple multimodal reasoning benchmarks demonstrate that DLWM outperforms existing methods by 2-5 points in accuracy while reducing memory usage by 24%.

URL PDF HTML ☆

赞 0 踩 0

2606.15651 2026-06-16 cs.CV 新提交

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

自问式视觉语言模型：用于组合视觉推理的强化学习

Saraswathy Amjith

发表机构 * Massachusetts Institute of Technology（麻省理工学院）

AI总结提出自问式框架，通过GRPO强化学习训练VLM自动分解问题并回答子问题，提升组合视觉推理能力，在CLEVR和A-OKVQA上验证有效性。

详情

AI中文摘要

视觉语言模型（VLM）是处理图像和文本的AI系统，但它们通常难以处理需要多步骤链式推理的组合视觉推理问题，例如识别物体、计数并比较结果。现有方法通过训练模型使用人工编写的逐步解释来改进推理，但创建这些注释成本高昂且难以扩展。我们提出一个自问式框架，使用称为组相对策略优化（GRPO）的强化学习算法，训练VLM将视觉问题分解为更小的子问题，并在生成最终答案前回答每个子问题。模型从未见过如何分解问题的示例，而是通过奖励信号（根据输出是否包含子问题以及最终答案是否正确评分）自行发现这种行为。我们将该框架应用于一个30亿参数的模型，在合成几何形状场景（CLEVR）和真实世界照片（A-OKVQA）上进行训练。在A-OKVQA上，自问式和标准强化学习均显著提高了未训练模型的准确率（分别为52.2%和51.6%，对比46.8%）。我们引入了首个自问式VLM，不仅像标准RL那样奖励最终答案，还额外奖励生成中间子问题，使其能够发现组合分解策略。这些结果表明，教会AI系统自问中间问题是复杂视觉推理的一种有前景的策略，特别是当问题难度需要显式的逐步分解时。

英文摘要

Vision-Language Models (VLMs) are AI systems that process both images and text, yet they often struggle with compositional visual reasoning questions that require chaining multiple steps together, such as identifying objects, counting them, and comparing the results. Existing approaches improve this reasoning by training models on human-written step-by-step explanations, but creating these annotations is expensive and difficult to scale. We propose a self-questioning framework that trains a VLM to break visual questions into smaller sub-questions and answer each one before producing a final response, using a reinforcement learning algorithm called Group Relative Policy Optimization (GRPO). The model is never shown examples of how to decompose questions, it discovers this behavior on its own, guided by a reward signal that scores whether the output contains sub-questions and whether the final answer is correct. We apply this framework to a 3-billion-parameter model, training on both synthetic scenes of geometric shapes (CLEVR) and real-world photographs (A-OKVQA). On A-OKVQA, both self-questioning and standard reinforcement learning substantially improve accuracy over the untrained model (52.2% and 51.6% vs. 46.8%). We introduce the first self-questioning VLM by rewarding not only the final answer like standard RL but additionally for generating intermediate sub-questions, enabling it to discover compositional decomposition strategies. These results suggest that teaching AI systems to ask themselves intermediate questions is a promising strategy for complex visual reasoning, particularly when the difficulty of a question warrants explicit step-by-step decomposition.

URL PDF HTML ☆

赞 0 踩 0

2606.15663 2026-06-16 cs.CV 新提交

OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

OneFocus: 实现基于统一视觉语言模型的真实世界X光安检

Jiali Wen, Hongxia Gao, Litao Li, Yixin Chen, Kaijie Zhang, Qianyun Liu, Xiaoqin Wen

AI总结针对X光违禁品检测中新型违禁品适应难和视觉理解不足的问题，提出MMXray数据集和统一视觉语言模型OneFocus，支持问答、定位、分类和图像理解四项核心任务，达到最先进性能。

Comments 17 pages, 10 figures

详情

AI中文摘要

X光违禁品检测对于大规模物流和运输中的安全至关重要，然而传统检测器难以适应新兴违禁品类型且缺乏基本的视觉理解。视觉语言模型（VLM）提供了强大的泛化能力，但受到高质量X光图像-文本数据稀缺的阻碍。为弥补这一关键差距，我们提出了MMXray，一个精心策划的基准数据集，包含52,124个图像-文本对，涵盖28个细粒度类别的X光违禁品。为了丰富MMXray中的真实遮挡模式，我们进一步引入了CleanDET，一个专用的合成数据集，包含来自28个类别的干净前景违禁品图像和具有不同密度水平的背景图像，以及AnyContraSyn，一种旨在操作CleanDET的可控合成方法。我们还开发了OnePipe，一个用于系统数据整理的可扩展流水线。基于MMXray，我们提出了OneFocus，一个统一的VLM，支持四个核心任务：视觉问答、违禁品定位、分类和图像理解。OneFocus在X光违禁品理解方面达到了最先进的性能，并展示了强大的跨域泛化能力，为安检建立了强大的视觉语言基线。

英文摘要

X-ray contraband detection is critical for security in large-scale logistics and transportation, yet conventional detectors struggle to adapt to emerging contraband types and lack fundamental visual understanding. Vision-language models (VLMs) offer strong generalization but are hindered by the scarcity of high-quality X-ray image-caption data. To bridge this critical gap, we present MMXray, a meticulously curated benchmark of 52,124 image-caption pairs spanning 28 fine-grained classes of X-ray contraband. To enrich MMXray with realistic occlusion patterns, we further introduce CleanDET, a dedicated synthesis dataset containing clean foreground contraband images from 28 categories and background images with diverse density levels, together with AnyContraSyn, a controllable synthesis method designed to operate on CleanDET. We also develop OnePipe, an extensible pipeline for systematic data curation. Built on MMXray, we propose OneFocus, a unified VLM that supports four core tasks: visual question answering, contraband localization, classification, and image understanding. OneFocus achieves state-of-the-art performance in X-ray contraband understanding and demonstrates robust cross-domain generalization, establishing a strong vision-language baseline for security screening.

URL PDF HTML ☆

赞 0 踩 0

2606.15765 2026-06-16 cs.CV 新提交

Task-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning

任务指令引导的视觉基础模型因果路由用于多任务学习

Donghyun Han, Yuseok Bae, Jung Uk Kim, Hyung-Il Kim

发表机构 * Electronics and Telecommunications Research Institute (ETRI)（韩国电子通信研究院（ETRI））； Kyung Hee University（庆熙大学）； Chonnam National University（全南大学）

AI总结提出TIGER框架，通过自然语言任务指令引导路由网络，结合反事实因果对齐，协调多个异构视觉基础模型实现多任务密集预测，在NYUD-v2和Pascal Context上超越现有方法。

Comments 17 pages, 6 figures

详情

AI中文摘要

视觉基础模型（VFMs）在广泛的视觉任务中展现出强大的鲁棒性和迁移性。然而，每个模型通常编码了由其预训练目标和数据领域形成的强归纳偏置，导致视觉知识碎片化但互补。因此，单个模型往往难以捕捉多个密集预测任务所需的不同视觉表示。为解决这一限制，我们提出TIGER（任务指令引导的专家路由），一个协调多个异构VFMs进行多任务密集预测的框架。TIGER并非简单聚合专家特征，而是利用自然语言任务指令引导路由网络，根据任务语义分配令牌级专家权重，实现互补专家特征的自适应集成。TIGER进一步引入反事实损失，通过测量排除专家时的预测变化，将路由决策与每个专家的因果贡献对齐，鼓励更可靠和可解释的路由。我们在两个多任务密集预测基准NYUD-v2和Pascal Context上评估TIGER，在保持所有VFMs冻结的情况下，它持续优于最近的多任务学习基线。这些结果表明，将指令引导的专家路由与反事实因果对齐相结合，能够有效协调异构视觉基础模型。

英文摘要

Vision foundation models (VFMs) have demonstrated strong robustness and transferability across a wide range of visual tasks. However, each model typically encodes strong inductive biases shaped by its pre-training objective and data domain, resulting in fragmented yet complementary visual knowledge. As a result, a single model often struggles to capture the diverse visual representations required across multiple dense prediction tasks. To address this limitation, we propose TIGER (Task-Instruction-Guided Expert Routing), a framework that coordinates multiple heterogeneous VFMs for multi-task dense prediction. Instead of naively aggregating expert features, TIGER leverages natural-language task instructions to guide a routing network that assigns token-level expert weights conditioned on task semantics, enabling adaptive integration of complementary expert features. TIGER further introduces a counterfactual loss that aligns routing decisions with each expert's causal contribution by measuring prediction changes when experts are excluded, encouraging more reliable and interpretable routing. We evaluate TIGER on two multi-task dense prediction benchmarks, NYUD-v2 and Pascal Context, where it consistently outperforms recent multi-task learning baselines while keeping all VFMs frozen. These results demonstrate that combining instruction-guided expert routing with counterfactual causal alignment enables effective coordination of heterogeneous vision foundation models.

URL PDF HTML ☆

赞 0 踩 0

2606.15920 2026-06-16 cs.CV 新提交

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

OmniOPSD：面向情感计算的理性特权在线自蒸馏

Zebang Cheng, Shuimu Chen, Boxue Yang, Yuanshen Guan, Jingyi Chen, Zheng Lian, Xiaojiang Peng, Fei Ma, LaiZhong Cui, Qi Tian

发表机构 * Shenzhen University（深圳大学）； Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)（广东省人工智能与数字经济实验室（深圳））； Tsinghua University（清华大学）； Shanghai Jiao Tong University（上海交通大学）； University of Science and Technology of China（中国科学技术大学）； Shenzhen Technology University（深圳技术大学）； Tongji University（同济大学）； Huawei（华为）

AI总结针对多模态大模型在复杂推理任务中奖励稀疏的问题，提出OmniOPSD框架，利用前沿模型生成的理性作为教师特权证据而非学生模仿目标，通过在线自蒸馏提供密集令牌级监督，在MER-UniBench上取得84.19平均分的最优性能。

详情

AI中文摘要

多模态大语言模型的强化学习在复杂推理任务中常因严重的奖励稀疏性而受阻。这一挑战在涉及状态、情感、意图和行为的以人为中心的场景中尤为突出，其中异质多模态信号和主观人为因素使得高质量思维链标注昂贵且难以获取。尽管许多多模态数据集提供了专家标注的真实标签，但直接使用这些标签进行监督微调可能会鼓励多模态感知中的捷径学习，并为安全关键的人机交互提供有限的透明度。为解决这些限制，我们提出OmniOPSD，一种理性特权的在线自蒸馏框架，该框架将前沿模型生成的理性作为教师侧的特权证据而非学生模仿目标。OmniOPSD仅将前沿模型生成的证据感知理性作为训练时的特权证据上下文提供给本地教师。学生从原始多模态输入中采样自己的轨迹，而理性特权教师对相同令牌进行评分并提供密集的令牌级监督。因此，学生在自己的轨迹分布上学习，无需直接模仿前沿模型完成，且推理不需要标签、理性、思维链标注或闭源模型访问。在MER-UniBench上的实验表明，OmniOPSD以84.19的平均分实现了最先进的性能，消融实验进一步支持了理性特权教师指导的价值。

英文摘要

Reinforcement learning for multimodal large language models (MLLMs) is often hindered by severe reward sparsity in complex reasoning tasks. This challenge is particularly pronounced in human-centered scenarios involving states, emotions, intentions, and behaviors, where heterogeneous multimodal signals and subjective human factors make high-quality chain-of-thought (CoT) annotations expensive and difficult to obtain. Although many multimodal datasets provide expert-annotated ground-truth labels, directly using these labels for supervised fine-tuning may encourage shortcut learning in multimodal perception and provides limited transparency for safety-critical human--AI interaction. To address these limitations, we propose OmniOPSD, a Rationale-Privileged On-Policy Self-Distillation framework that uses frontier-generated rationales as teacher-side privileged evidence rather than student imitation targets. OmniOPSD uses frontier-generated evidence-aware rationales only as training-time privileged evidence context for a local teacher. The student samples its own rollout from the original multimodal input, while the rationale-privileged teacher scores the same tokens and provides dense token-level supervision. Thus, the student learns on its own trajectory distribution without directly imitating frontier-model completions, and inference requires no labels, rationales, CoT annotations, or closed-source model access. Experiments on MER-UniBench show that OmniOPSD achieves state-of-the-art performance with an average score of $84.19$, and ablations further support the value of rationale-privileged teacher guidance.

URL PDF HTML ☆

赞 0 踩 0

2606.15982 2026-06-16 cs.CV 新提交

Mind the Gap: Diagnosing Constraint Discovery Failures in Text-in-Image Editing

注意差距：诊断图像内文本编辑中的约束发现失败

Rui Gui

AI总结通过图像内文本编辑的受控诊断，研究多模态模型在发现未明确指定的视觉依赖约束时的失败，发现模型仅能自行发现46%的约束，而明确提供时可达94%。

详情

AI中文摘要

多模态推理中的一个关键挑战是确定在特定任务下哪些视觉依赖变得相关，而不仅仅是识别可见内容。我们通过图像内文本编辑中的编辑诱导约束发现来研究这一点，这是一个受控的诊断设置，其中局部文本变化可以激活次要的一致性约束：给定一个有效的编辑指令和一张图像，模型能否识别出也必须改变的次要区域？在461个诊断案例、四个MLLM和19个约束子类型中，模型在无引导提示下仅恢复46%的案例级宏观召回率，而当明确提供约束时则为94%，这表明当模型必须决定要呈现哪些未说明的依赖时，很大一部分失败会出现。Oracle场分解显示，案例特定的因果解释是最有效的部分引导（0.782召回率），高于区域名称（0.610）或类型标签（0.646），这表明编辑特定的因果线索占据了Oracle增益的很大一部分。下游实验进一步表明，更高的自我发现召回率并不一定能提高任务性能：未经验证的自我发现引入了假阳性，抵消了召回率的提升，从而激发了精度感知的约束引出。

英文摘要

A key challenge in multimodal reasoning is determining which visual dependencies become relevant under a specific task, rather than merely recognizing visible content. We study this through edit-induced constraint discovery in text-in-image editing, a controlled diagnostic setting where a local text change can activate secondary consistency constraints: given a valid editing instruction and an image, can a model identify the secondary regions that must also change? Across 461 diagnostic cases, four MLLMs, and 19 constraint subtypes, models recover only 46% case-level macro recall under unguided prompting versus 94% when constraints are explicitly provided, suggesting that a substantial portion of the failure arises when models must decide which unstated dependencies to surface. Oracle-field decomposition shows that case-specific causal explanations are the most effective partial guidance (0.782 recall), above region names (0.610) or type labels (0.646), suggesting that edit-specific causal cues account for much of the oracle gain. A downstream experiment further shows that higher self-discovery recall does not necessarily improve task performance: unverified self-discovery introduces false positives that offset recall gains, motivating precision-aware constraint elicitation.

URL PDF HTML ☆

赞 0 踩 0

2606.16067 2026-06-16 cs.CV 新提交

Stepwise Token Selection for Efficient Multimodal Large Language Models

逐步令牌选择用于高效多模态大语言模型

Landi He, Shawn Young, Lijian Xu

发表机构 * Shenzhen University of Advanced Technology（深圳先进技术大学）

AI总结提出一种基于指针机制的逐步视觉令牌选择方法，通过可微松弛实现端到端训练，动态决定保留令牌数量，在去除88.9%令牌时保持94.6%准确率并加速1.88倍。

详情

AI中文摘要

在多模态大语言模型（MLLMs）中，推理成本主要由视觉令牌前缀而非语言骨干网络决定，因此令牌减少成为提高效率的关键因素。现有方法通常为视觉令牌分配独立的的重要性分数，并保留固定数量的排名靠前的令牌，这隐含地假设令牌独立且输入间压缩比均匀。在这项工作中，我们将视觉令牌剪枝重新表述为序列决策过程。具体来说，我们引入了一种指针式的选择机制，该机制迭代地选择信息丰富的令牌，每次决策都基于先前选择的令牌，并通过学习到的终止动作动态决定何时停止。这使得所选子集及其大小能够联合优化。为了实现标准语言建模目标下的端到端训练，我们设计了一种基于方差保持噪声插值方案的可微松弛，允许梯度通过离散选择过程传播。在LLaVA-v1.5-7B和Qwen2.5-VL-7B上的大量实验表明，我们的方法在不同压缩水平下始终优于固定比例基线。在去除88.9%视觉令牌的激进剪枝下，我们的方法保持了94.6%的原始准确率，同时实现了1.88倍的预填充延迟加速。

英文摘要

In multimodal large language models (MLLMs), inference cost is largely dominated by the visual token prefix rather than the language backbone, making token reduction a key factor for improving efficiency. Existing approaches typically assign independent importance scores to visual tokens and retain a fixed number of top-ranked tokens, implicitly assuming token independence and a uniform compression ratio across inputs. In this work, we reformulate visual token pruning as a sequential decision-making process. Specifically, we introduce a pointer-style selection mechanism that iteratively chooses informative tokens, conditioning each decision on previously selected ones, and dynamically determines when to stop via a learned termination action. This enables joint optimization of both the selected subset and its size. To enable end-to-end training under standard language modeling objectives, we design a differentiable relaxation based on a variance-preserving noise interpolation scheme, allowing gradients to propagate through the discrete selection process. Extensive experiments on LLaVA-v1.5-7B and Qwen2.5-VL-7B demonstrate that our approach consistently outperforms fixed-ratio baselines across different compression levels. Under aggressive pruning that removes 88.9% of visual tokens, our method preserves 94.6% of the original accuracy while achieving a 1.88x speed-up in prefill latency.

URL PDF HTML ☆

赞 0 踩 0

2606.16092 2026-06-16 cs.CV cs.AI 新提交

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

VinQA：面向真实世界多模态文档问答的交错视觉元素长文本答案生成

Young Rok Jang, Hyesoo Kong, Kyunghwan An, Jae Sub Huh, Gyeonghun Kim, Stanley Jungkyu Choi

发表机构 * LG AI Research（LG AI研究院）

AI总结提出VinQA数据集和两种编码方法（页面编码与模态编码），用于生成交错引用视觉元素的长文本答案；通过M-GroSE评估框架和微调Qwen2.5-VL模型，显著缩小与专有模型的性能差距。

Comments Accepted to CVPR 2026. Main paper: 5 figures, 4 tables; includes supplementary material

详情

AI中文摘要

真实世界的文档将文本与表格、图表、照片和示意图以多样化的布局组合在一起，然而现有关于多模态大语言模型（MLLMs）用于文档问答的研究主要产生纯文本回复，未能充分利用这些视觉元素。我们引入VinQA，一个用于长文本答案生成的数据集，其中引用的视觉元素与其支持文本明确交错，并基于相关文档页面。为支持此任务，我们研究了两种将原始文档页面图像输入MLLM的编码方法及其视觉元素引用机制：（1）页面编码，直接编码带有视觉元素边界框的整页图像，并将这些框选区域视为可引用单元；（2）模态编码，解析每个页面以提取文本并裁剪视觉元素，分别编码，并将这些裁剪元素用作可引用单元。在我们的实验中，我们提出M-GroSE，一个扩展GroUSE的多模态评估框架，用于从完整性、答案相关性、忠实性和不可回答性四个维度评估答案。我们还报告了Visual Source F1以直接衡量视觉引用准确性。尽管专有前沿模型在VinQA测试集上仍获得最佳总体分数，但在训练集上微调开源Qwen2.5-VL模型显著提升了其性能并缩小了这一差距。模态编码最初对于具有长文本、多视觉元素和多样化引用需求的复杂文档更为稳健。然而，在VinQA上训练后，页面编码达到了可比水平，即使没有模态编码中使用的显式解析也能有效竞争。最后，基于MLLM的评判器Visual G-Eval确认，微调后的模型在语义恰当的位置插入视觉元素，并附有忠实的支持文本。

英文摘要

Real-world documents combine text with tables, charts, photographs, and diagrams arranged in diverse layouts, yet existing research on multimodal large language models (MLLMs) for document QA predominantly produces text-only responses, underutilizing these visual elements. We introduce VinQA, a dataset for long-form answer generation where cited visual elements are explicitly interleaved with their supporting text and grounded in relevant document pages. To support this task, we study two encoding methods for feeding raw document page images into an MLLM, along with their visual-element citation mechanisms: (1) Page Encoding, which directly encodes full-page images with bounding boxes of visual elements and treats these boxed regions as citable units; and (2) Modality Encoding, which parses each page to extract text and crop visual elements, encodes them separately, and uses these cropped elements as citable units. In our experiments, we propose M-GroSE, a multimodal evaluation framework extending GroUSE to assess answers along four dimensions: completeness, answer relevancy, faithfulness, and unanswerability. We additionally report Visual Source F1 to directly measure visual citation accuracy. Although proprietary frontier models still achieve the best overall scores on the VinQA test split, fine-tuning open Qwen2.5-VL models on the training split substantially improves their performance and narrows this gap. Modality Encoding is initially more robust for complex documents with long text, many visual elements, and diverse citation requirements. After training on VinQA, however, Page Encoding reaches a comparable level, competing effectively even without the explicit parsing used in Modality Encoding. Finally, Visual G-Eval, an MLLM-based judge, confirms that fine-tuned models insert visual elements at semantically appropriate positions with faithful supporting text.

URL PDF HTML ☆

赞 0 踩 0

2606.16158 2026-06-16 cs.CV cs.CL 新提交

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

必要时聚焦：用于无训练视觉定位的自适应路由与协作定位

Yifan Wang, Peiming Li, Shiyu Li, Zhiyuan Hu, Xiaochen Yang, Wenming Yang, Yang Tang, Zheng Wei

发表机构 * East China University of Science and Technology（华东理工大学）； Tsinghua University（清华大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； University of Science and Technology of China（中国科学技术大学）

AI总结提出LazyMCoT动态框架，通过自适应路由评估不确定性，对简单查询跳过处理，对困难样本利用协作定位模块进行两阶段精炼，在提升推理精度的同时降低平均推理延迟。

详情

AI中文摘要

虽然多模态大语言模型（MLLMs）在跨模态推理方面表现出色，但它们通常难以感知复杂高分辨率图像中的细粒度细节。最近的无训练方法通过图像缩放和局部裁剪来解决这一问题。然而，不加区分地应用这些操作会导致简单查询的计算冗余，并且可能因截断必要的全局上下文或引入无关的背景噪声而降低准确性。为此，我们提出了LazyMCoT，一个动态且无需训练的框架，能够根据样本难度自适应地分配视觉定位工作。该框架具有自适应路由机制，通过单次前向传递的首词统计量来评估预测不确定性。这有效地绕过了置信度高的案例，同时通过保形校准确保困难样本的召回。对于这些具有挑战性的案例，协作定位模块通过两阶段精炼过程，将模型固有的跨模态注意力与外部视觉专家相结合。该精炼过程生成精确的局部显示，以恢复小目标或被遮挡的目标。在多个基准上的大量实验表明，LazyMCoT通过同时提高推理精度和降低平均推理延迟，与基于训练的方法相媲美。我们的代码可在https://github.com/TencentBAC/LazyMCoT获取。

英文摘要

While Multimodal Large Language Models (MLLMs) excel in cross-modal reasoning, they often struggle to perceive fine-grained details in complex high-resolution images. Recent training-free methods address this through image scaling and localized cropping. However, applying these manipulations indiscriminately introduces computational redundancy for simple queries and can degrade accuracy by truncating essential global context or introducing irrelevant background noise. To this end, we propose LazyMCoT, a dynamic and training-free framework that adaptively allocates visual grounding efforts based on sample difficulty. The framework features an Adaptive Routing mechanism that evaluates predictive uncertainty using first-token statistics from a single forward pass. This efficiently bypasses confident cases while ensuring the recall of difficult samples via conformal calibration. For these challenging cases, a Collaborative Grounding module integrates the inherent cross-modal attention of the model with an external visual expert through a two-stage refinement process. This refinement process generates a precise localized display to recover small or occluded targets. Extensive experiments across diverse benchmarks demonstrate that LazyMCoT rivals training-based approaches by simultaneously improving reasoning accuracy and reducing average inference latency. Our code is availble at https://github.com/TencentBAC/LazyMCoT.

URL PDF HTML ☆

赞 0 踩 0

2606.16193 2026-06-16 cs.CV cs.AI cs.LG 新提交

LOCUS: 局部视觉线索搜索增强多模态大语言模型的细粒度感知

Zhou Tao, Fang Zhang, Zewen Ding, Shida Wang, Xiaokun Sun, YongXiang Hua, Haoyu Cao, Linli Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）； State Key Laboratory of Cognitive Intelligence（认知智能国家重点实验室）

AI总结提出LOCUS训练框架，通过可验证的局部线索搜索代理任务，使MLLM内化细粒度证据选择，提升定位敏感视觉理解而不改变推理接口。

详情

AI中文摘要

多模态大语言模型（MLLMs）在细粒度视觉感知上仍然不可靠，即使高分辨率输入保留了必要的局部细节。我们将这一限制识别为视觉上下文腐烂：决定性证据可能存在于完整图像中，但在冗余视觉上下文中无法被可靠地选择和利用。我们提出LOCUS（局部视觉线索搜索），一个训练框架，通过可验证的代理任务教会MLLMs内化局部证据搜索。在训练期间，LOCUS提供一个局部裁剪作为视觉线索，并使用基于IoU的奖励优化模型以恢复其在完整图像中的空间支持。视觉线索仅在训练期间使用，保持标准的图像-问题推理接口不变。在细粒度感知、幻觉、一般理解和推理基准上的实验表明，LOCUS改善了定位敏感的视觉理解，同时保留了广泛的能力。注意力分析进一步表明对任务相关证据区域的更强关注，表明训练时的视觉线索搜索为内化的细粒度证据选择提供了有效途径。

英文摘要

Multimodal Large Language Models (MLLMs) remain unreliable on fine-grained visual perception, even when high-resolution inputs preserve the necessary local details. We identify this limitation as visual context rot: decisive evidence may exist in the full image, yet fail to be reliably selected and used amid redundant visual context. We propose LOCUS (LOcal visual CUe Search), a training framework that teaches MLLMs to internalize local evidence search through a verifiable proxy task. During training, LOCUS provides a local crop as a visual cue and optimizes the model to recover its spatial support in the full image using an IoU-based reward. The visual cue is used only during training, leaving the standard image-question inference interface unchanged. Experiments across fine-grained perception, hallucination, general understanding, and reasoning benchmarks show that LOCUS improves localization-sensitive visual understanding while preserving broad capabilities. Attention analyses further indicate stronger focus on task-relevant evidence regions, suggesting that training-time visual cue search provides an effective route to internalized fine-grained evidence selection.

URL PDF HTML ☆

赞 0 踩 0

2606.16601 2026-06-16 cs.CV 新提交

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

DifferAD-R1: 基于差异引导的多模态大语言模型工业异常定位

Dingrong Wang, Xian Tao, Zhen Qu, Hengliang Luo, Xinyi Gong, Fei Shen, Zhengtao Zhang, Guiguang Ding

发表机构 * Institute of Automation, Chinese Academy of Sciences (CAS)（中国科学院自动化研究所）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； CASI Vision Technology Co., Ltd.（中科慧远视觉技术有限公司）； Shandong Laboratory of Aluminum Advanced Manufacturing in Binzhou (SLAAMB), Binzhou Institute of Technology, Weiqiao-UCAS Science and Technology Park（山东省滨州市铝先进制造实验室（SLAAMB），滨州技术学院，魏桥国科科技园）； Space Information Research Institute, Hangzhou Dianzi University（杭州电子科技大学空间信息研究院）； School of Software, Tsinghua University（清华大学软件学院）

AI总结提出DifferAD-R1框架，通过差异引导双图像范式将异常定位转化为一次性差异定位问题，并设计双一致性定位奖励和难度感知策略，在AD-DualDiff数据集上优于现有方法。

Comments Submitted to IEEE Transactions on Circuits and Systems for Video Technology

详情

AI中文摘要

工业异常定位旨在准确识别和定位工业产品中的异常区域，解决实际场景中检测未见缺陷类别的关键挑战。传统的封闭集方法通常跨场景泛化能力差，而现有的基于多模态大语言模型（MLLM）的方法面临两个核心限制：要么采用与定位实际需求不一致的问答式范式，要么依赖标准优化技术如组相对策略优化（GRPO），后者无法为细微缺陷提供有效的学习信号。为解决这些问题，本文提出DifferAD-R1，一种专为工业异常定位设计的MLLM增强强化学习框架。我们设计了一种差异引导的双图像范式，将定位任务重新表述为一次性差异定位问题，以有效探索跨场景异常。针对难以检测的异常，开发了双一致性定位奖励，增强了优化稳定性和鲁棒性。此外，我们整合了难度感知策略，包括自适应重加权和分组重采样，以优先学习困难实例。为促进实际工业环境中的评估，我们构建了AD-DualDiff数据集，包含20个类别的13K对图像。实验结果表明，DifferAD-R1显著优于现有基线，并与大规模模型如Qwen3-VL（235B参数）相比取得了有竞争力的性能。我们的代码公开在：https://github.com/Rong2026/work-1。

英文摘要

Industrial anomaly localization aims to accurately identify and localize abnormal regions in industrial products, addressing the critical challenge of detecting unseen defect categories in real-world scenarios. Traditional closed-set methods often suffer from poor cross-scenario generalization, while existingMultimodal Large Language Model (MLLM)-based approachesface two core limitations: they either adopt QA-style paradigmsmisaligned with the practical demands of localization, or relyon standard optimization techniques such as Group RelativePolicy Optimization (GRPO), which fails to deliver effectivelearning signals for subtle defects. To tackle these issues, thispaper proposes DifferAD-R1, an MLLM-augmented reinforcement learning framework tailored for industrial anomaly localization. We design a Difference-Guided dual-image paradigm,which reformulates the localization task as a one-shot difference grounding problem to effectively explore cross-scenarioanomalies. A Dual-Consistency Localization Reward is developedfor hard-to-detect anomalies, enhancing optimization stabilityand robustness. Additionally, we integrate a difficulty-awarestrategy with adaptive reweighting and group-wise resamplingto prioritize learning on challenging instances. To facilitateevaluations in real-world industrial settings, we construct theAD-DualDiff dataset, comprising 13K paired images across 20categories. Experimental results demonstrate that DifferADR1 significantly outperforms existing baselines and achievescompetitive performance compared to large-scale models likeQwen3-VL (235B parameters). Our code is publicly availableat: https://github.com/Rong2026/work-1.

URL PDF HTML ☆

赞 0 踩 0

2606.16615 2026-06-16 cs.CV 新提交

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

SUP-MCRL：面向EEG视觉解码的感知主体统一伪特征编码多模态对比表示学习

Shengyu Gong, Weiming Zeng, Yueyang Li, Zijian Kang, Hongjie Yan, Wai Ting Siok, Nizhuan Wang

发表机构 * Lab of Digital Image and Intelligent Computation, Shanghai Maritime University（上海海事大学数字图像与智能计算实验室）； Department of Language Science and Technology, The Hong Kong Polytechnic University（香港理工大学语言科学与技术系）； Affiliated Lianyungang Hospital of Xuzhou Medical University（徐州医科大学附属连云港医院）

AI总结提出SUP-MCRL框架，通过语义感知视觉编码器、统一EEG增强器和原型渐进增强器，解决多模态对比学习中语义一致性和主体选择性问题，在THINGS-EEG零样本任务上达到66.0%/91.9%的Top-1/Top-5准确率。

详情

AI中文摘要

非侵入式脑机接口在泛化到自然视觉体验时，神经视觉解码面临严重的保真度退化。传统的多模态对比表示学习仅优化几何距离对齐，忽略了语义一致性和主体选择性，导致虚假的零样本对齐。我们提出SUP-MCRL，一个统一框架，集成了三种协作机制：(1) 语义实体感知视觉编码器(SAVE)，学习空间注意力以提取语义内容，无需预训练的显著性模型；(2) 统一EEG增强器(UEE)，采用多尺度空洞卷积和频带间注意力实现自适应跨主体鲁棒性；(3) 基于原型的渐进增强器(PPA)，维护一个EMA更新的伪特征池以防止表示崩溃。在THINGS-EEG上的零样本实验实现了66.0%/91.9%（Top-1/Top-5）的个体内准确率和24.0%/52.9%的LOSO准确率，超越了现有最先进方法。代码可在https://github.com/NZWANG/SUP-MCRL获取。

英文摘要

Non-invasive brain-computer interfaces suffer severe fidelity degradation in neural visual decoding when generalizing to natural visual experiences. Conventional multimodal contrastive representation learning solely optimizes geometric distance alignment, neglecting semantic consistency and subject selectivity, causing spurious zero-shot alignment. We propose SUP-MCRL, a unified framework integrating three collaborative mechanisms: (1) Semantic-entity Aware Visual Encoder (SAVE), learning spatial attention to extract semantic content without pre-trained saliency models; (2 Unified EEG Enhancer (UEE), employing multi-scale atrous convolutions and inter-band attention for adaptive cross-subject robustness; and (3) Prototype-based Progressive Augmenter (PPA), maintaining an EMA-updated pseudo-feature pool to prevent representation collapse. Zero-shot experiments on THINGS-EEG achieve 66.0%/91.9% (Top-1/Top-5) intra-subject and 24.0%/52.9% LOSO accuracy, surpassing state-of-the-art methods. Code is available at https://github.com/NZWANG/SUP-MCRL.

URL PDF HTML ☆

赞 0 踩 0

2606.16667 2026-06-16 cs.CV 新提交

Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

在放弃之前再看一眼：预算约束下的共形证据获取用于可靠的视觉-语言模型

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * South China University of Technology（华南理工大学）； RIKEN Center for Advanced Intelligence Project（RIKEN先进智能研究中心）； Columbia University（哥伦比亚大学）

AI总结针对视觉-语言模型幻觉问题，提出预算约束共形证据获取（BCEA）方法，通过三级决策（回答、放弃或获取额外视觉证据）在有限计算预算下控制幻觉率，并恢复有限样本保证。

详情

AI中文摘要

大型视觉-语言模型（LVLMs）会产生幻觉：它们断言图像不支持的视觉细节。一个原则性的解决方案是使用无分布保证的选择性预测——验证每个声明，当声明没有依据时放弃，从而使断言声明中的幻觉率有可证明的界限。然而，我们表明，这个保证是以残酷的代价换来的：为了在平衡的对象存在基准上将幻觉率保持在5%以下，最先进的共形过滤器必须在超过80%的声明上放弃。我们认为，当更多视觉证据可以廉价获取时，放弃是浪费的，并引入了预算约束共形证据获取（BCEA），它将二元回答/放弃决策替换为三向选择：回答、放弃或在有限计算预算下通过重新检查图像（缩放、裁剪或应用特定声明的干预）获取额外视觉证据。我们有两个观察。首先，天真地将获取插入到校准的过滤器中会破坏统计保证——实际风险超过目标多达17个百分点——因为获取步骤破坏了共形校准所依赖的可交换性。其次，将整个获取策略折叠到得分函数中，并在获取后得分上重新校准，恢复了有限样本保证，同时仍然恢复覆盖。BCEA进一步使用结构化的、声明类型特定的干预。在POPE基准和COCO构建的存在性和空间关系声明上，针对四个开源VLM，BCEA将幻觉率控制在目标水平，并持续提高覆盖，优于保证放弃的基线。

英文摘要

Large vision-language models (LVLMs) hallucinate: they assert visual details that the image does not support. A principled remedy is selective prediction with a distribution-free guarantee-verify each claim and abstain when the claim is not grounded, so that the hallucination rate among asserted claims is provably bounded. We show, however, that this guarantee is bought at a brutal price: to keep the hallucination rate below $5\%$ on a balanced object-existence benchmark, a state-of-the-art conformal filter must abstain on more than $80\%$ of claims. We argue that abstention is wasteful when more visual evidence is cheaply available, and introduce Budgeted Conformal Evidence Acquisition (BCEA), which replaces the binary answer/abstain decision with a three-way choice: answer, abstain, or acquire additional visual evidence by re-examining the image (zooming, cropping, or applying a claim-specific intervention) under a bounded compute budget. We make two observations. First, acquisition that is plugged naively into a calibrated filter breaks the statistical guarantee -- realized risk overshoots the target by up to $17$ points -- because the acquisition step destroys the exchangeability that conformal calibration relies on. Second, folding the entire acquisition policy into the score function and re-calibrating on post-acquisition scores \emph{restores} the finite-sample guarantee while still recovering coverage. BCEA further uses structured, claim-type-specific interventions. Across the POPE benchmark and COCO-constructed existence and spatial-relation claims, on four open VLMs, BCEA controls the hallucination rate at the target level and consistently improves coverage over a guaranteed-abstention baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.16783 2026-06-16 cs.CV cs.AI cs.LG 新提交

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

Gen-VCoT: 基于扩散的RGB中间表示的生成式视觉思维链推理

Zhiqiang Zhou, Junliang Dai, Xu ling

发表机构 * Hunan Chemical Industry Vocational and Technical College（湖南化工职业技术学院）

AI总结提出Gen-VCoT框架，利用专家视觉模型生成RGB图像作为推理中间步骤，通过自适应路由器选择推理深度，在空间和深度问题上分别提升25%和50%，但简单事实查询性能下降，表明最优表示依赖于任务。

Comments 12 pages, 5 figures

2606.14786 2026-06-16 cs.MM cs.AI cs.CV 交叉投稿

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

MatchLM2Lite: 一种可扩展的MLLM-to-Lite框架用于重复内容识别

Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Zirui Zhu, Kanchan Sarkar, Kun Xu

发表机构 * Tiktok（字节跳动）； National University of Singapore School of Computing（新加坡国立大学计算机学院）

AI总结提出MatchLM2Lite框架，通过将多模态大语言模型蒸馏为轻量模型，实现视频、音频和文本联合建模的实时重复内容识别，在降低35倍计算成本的同时保持高准确率，并成功部署于大规模生产环境。

详情

DOI: 10.1145/3770855.3818444

AI中文摘要

内容审核对于在线视频平台确保内容安全、保护创作者和维持积极的用户体验至关重要。除了过滤有害内容，平台必须大规模保证内容真实性，以便用户接触到多样化、原创的视频，而非低价值的重复内容。我们提出MatchLM2Lite，一个实时、生产级的重复内容识别（RCI）系统，它利用多模态大语言模型（MLLM）的强大理解能力，将其蒸馏为一个小型且推理速度快的模型。我们的系统联合建模视频、音频和文本信号，对视频对进行操作以生成细粒度的重复分数。该系统包含两个模块，MatchLM和MatchLite，以及一个两阶段训练方案。首先，我们高容量的MLLM，MatchLM，作为教师模型定义RCI性能的上限。然后，其能力被蒸馏到一个紧凑的学生模型MatchLite中。这种设计使MatchLite能够在视频对上实现低延迟、高吞吐量的推理，同时保留MatchLM的大部分准确性，使其适合集成到实时推荐系统中。MatchLM相比我们之前的生产模型F1分数提高了+8.57。经过知识蒸馏后，MatchLite保留了+6.55的F1分数提升，同时计算成本降低了35倍。大规模部署后，MatchLM2Lite实现了高效的成对多模态RCI，以高每秒查询数（QPS）稳定服务在线流量，端到端延迟低于30秒。该系统在不降低用户参与度的情况下，将我们平台上的重复视频观看率降低了2.5%，证明了其在大规模生产环境中的有效性。

英文摘要

Content moderation is critical for online video platforms to ensure content safety, protect creators, and sustain positive user experiences. Beyond filtering harmful content, platforms must guarantee content authenticity at scale so that users are exposed to diverse, original videos rather than low-value reproductions. We present MatchLM2Lite, a real-time, production-grade reproduced content identification (RCI) system that leverages the powerful understanding of a multimodal large language model (MLLM) distilled into a small and fast-inference model. Our system jointly models video, audio, and text signals, operating on pairs of videos to produce fine-grained reproduction scores. The system comprises two modules, MatchLM and MatchLite, and a two-stage training recipe. First, our high-capacity MLLM, MatchLM, serves as a teacher model to define the upper bound of RCI performance. Its capabilities are then distilled into a compact student model, MatchLite. This design allows MatchLite to deliver low-latency, high-throughput inference on video pairs while preserving much of MatchLM's accuracy, making it suitable for integration into real-time recommendation systems. MatchLM achieves an F1-score improvement of +8.57 compared to our previous production model. After knowledge distillation, MatchLite retains a +6.55 gain in F1-score while reducing computational cost by 35x. Deployed at scale, MatchLM2Lite enables efficient, pairwise multimodal RCI, stably serving online traffic at high queries per second (QPS) with an end-to-end latency below 30 seconds. This system has reduced the reproduced video view rate on our platform by 2.5% without degrading user engagement, demonstrating its effectiveness in a large-scale production environment.

URL PDF HTML ☆

赞 0 踩 0

2606.15427 2026-06-16 cs.LG cs.AI cs.CV 交叉投稿

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

通过提示实现视觉语言模型发射后能力扩展用于在轨航天器检测

Nicholas A. Welsh, Lennon J. Shikhman, Monty Nehru Attazs, Seemanthini K. Putane, Van Minh Nguyen, Ryan T. White

发表机构 * Florida Institute of Technology（佛罗里达理工学院）； University of Florida（佛罗里达大学）

AI总结研究利用提示驱动的视觉语言模型在轨扩展语义能力，无需修改权重即可通过自然语言提示检测新航天器部件，在129张图像上零样本实例分割达到0.385 mAP@0.5。

Comments 5 pages, 1 figure, 2 tables. Equal contribution by Nicholas A. Welsh and Lennon Shikhman. Published in the CVPR2026 Workshop on AI4Space

详情

AI中文摘要

星载检测系统通常在发射前部署感知模型，之后更新模型权重或扩展固定标签集在操作上变得不可行。虽然监督模型可以在飞行前集成，但在轨道上添加新的语义能力需要重新训练和重新上传参数。我们研究提示驱动的视觉语言模型是否能够实现发射后语义扩展，允许通过自然语言提示指定新的航天器部件，而无需修改星载权重。我们在一个包含129张先前未见卫星图像的测试集上，采用严格冻结的单次推理协议，评估了航天器部件的零样本实例分割。在固定全局阈值且无后处理的情况下，SAM3达到0.385 mAP@0.5和0.267 mAP@0.5:0.95。性能强烈依赖于尺度：大型结构元素如航天器主体（0.639 AP@0.50）和太阳翼（0.598 AP@0.5）定位可靠，而相对较小的附件如天线（0.221 AP@0.5）和推进器（0.081 AP@0.5）仍然困难。提示形式影响性能，包含空间和几何描述符的结构化提示相比短类别名称提示提升高达82%。该模型在当代嵌入式GPU的内存和计算范围内运行，表明提示驱动的定位可以为主要航天器结构提供发射后语义扩展的实用机制，同时突显了在轨道域偏移下细粒度部件零样本定位的局限性。

英文摘要

Spaceborne inspection systems often deploy perception models prior to launch, after which updating model weights or expanding fixed label sets becomes operationally impractical. While supervised models can be integrated pre-flight, adding new semantic capabilities in orbit requires retraining and re-uploading parameters. We investigate whether prompt-driven vision--language models can enable post-launch semantic expansion, allowing new spacecraft components to be specified via natural-language prompts without modifying onboard weights. We evaluate zero-shot instance segmentation of spacecraft components under a strictly frozen, single-pass inference protocol on a test set of $129$ images of previously unseen satellites. Under fixed global thresholds and no post-processing, SAM3 achieves $0.385$ mAP@$0.5$ and $0.267$ mAP@$0.5{:}0.95$. Performance is strongly scale-dependent: large structural elements like spacecraft bodies ($0.639$ AP@$0.50$) and solar arrays ($0.598$ AP@$0.5$) localize reliably, while relatively small appendages like antennas ($0.221$ AP@$0.5$) and thrusters ($0.081$ AP@$0.5$) remain difficult. Prompt formulation influences performance, with structured prompts incorporating spatial and geometric descriptors yielding up to $82%$ improvement over short category-name prompts. The model operates within the memory and compute envelope of contemporary embedded GPUs, suggesting prompt-driven grounding can provide a practical mechanism for post-launch semantic extension of dominant spacecraft structures while highlighting limitations of zero-shot localization for fine-scale components under orbital domain shift.

URL PDF HTML ☆

赞 0 踩 0

2606.15694 2026-06-16 cs.MM cs.AI cs.CV cs.LG 交叉投稿

基于知识的视觉问答系统综述：视觉推理任务中的知识生命周期

Jiaqi Deng, Zonghan Wu, Huan Huo, Guandong Xu

发表机构 * University Technology of Sydney（悉尼大学技术学院）； East China Normal University（华东师范大学）； Education University of Hong Kong（香港教育大学）

AI总结综述基于知识的视觉问答（KB-VQA）方法，将其分为知识表示、检索和推理三个阶段，并探讨大语言模型带来的变革，指出未来研究方向。

Comments Accepted at TKDE, 20 pages, 5 figures, 4 tables

详情

DOI: 10.1109/TKDE.2026.3699946
Journal ref: IEEE Transactions on Knowledge and Data Engineering, 2026

AI中文摘要

基于知识的视觉问答（KB-VQA）扩展了通用视觉问答（VQA），不仅需要理解视觉和文本输入，还需要广泛的知识，从而在多种实际应用中取得显著进展。KB-VQA引入了独特的挑战，包括对齐来自不同模态和来源的异构信息、从嘈杂或大规模存储库中检索相关知识，以及执行复杂推理以从组合上下文中推断答案。随着大语言模型（LLMs）的发展，KB-VQA系统也经历了显著变革，LLMs作为强大的知识库、检索增强生成器和强推理器。尽管取得了实质性进展，但目前尚无全面综述系统性地组织和回顾现有的KB-VQA方法。本综述旨在通过建立KB-VQA方法的结构化分类法，并将系统分为主要阶段：知识表示、知识检索和知识推理，来填补这一空白。通过探索各种知识集成技术并识别持续存在的挑战，本文还概述了有前景的未来研究方向，为推进KB-VQA模型及其应用提供了基础。

英文摘要

Knowledge-based Vision Question Answering (KB-VQA) extends general Vision Question Answering (VQA) by not only requiring the understanding of visual and textual inputs but also extensive range of knowledge, enabling significant advancements across various real-world applications. KB-VQA introduces unique challenges, including the alignment of heterogeneous information from diverse modalities and sources, the retrieval of relevant knowledge from noisy or large-scale repositories, and the execution of complex reasoning to infer answers from the combined context. With the advancement of Large Language Models (LLMs), KB-VQA systems have also undergone a notable transformation, where LLMs serve as powerful knowledge repositories, retrieval-augmented generators and strong reasoners. Despite substantial progress, no comprehensive survey currently exists that systematically organizes and reviews the existing KB-VQA methods. This survey aims to fill this gap by establishing a structured taxonomy of KB-VQA approaches, and categorizing the systems into main stages: knowledge representation, knowledge retrieval, and knowledge reasoning. By exploring various knowledge integration techniques and identifying persistent challenges, this work also outlines promising future research directions, providing a foundation for advancing KB-VQA models and their applications.

URL PDF HTML ☆

赞 0 踩 0

2507.17588 2026-06-16 cs.CV cs.CL 版本更新

Dual-branch Prompting for Multimodal Machine Translation

双分支提示用于多模态机器翻译

Jie Wang, Zhendong Yang, Liansong Zong, Xiaobo Zhang, Dexian Wang, Ji Zhang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； School of Computer and Software Engineering, Xihua University（西华大学计算机与软件工程学院）； School of Intelligent Medicine, Chengdu University of Traditional Chinese Medicine（成都中医药大学针灸推拿学院）

AI总结提出基于扩散模型的双分支提示框架D2P-MMT，利用重建图像过滤视觉噪声，通过分布对齐损失提升鲁棒翻译性能。

Comments This manuscript has been fully accepted and published by ACM Transactions on Multimedia Computing, Communications, and Applications (ACM TOMM)

详情

AI中文摘要

多模态机器翻译（MMT）通常通过整合对齐的视觉特征来增强纯文本翻译。尽管取得了显著进展，最先进的MMT方法在推理时通常依赖于配对的图像-文本输入，并且对无关的视觉噪声敏感，这限制了它们的鲁棒性和实际应用性。为了解决这些问题，我们提出了D2P-MMT，一种基于扩散的双分支提示框架，用于鲁棒的视觉引导翻译。具体来说，D2P-MMT仅需要源文本和由预训练扩散模型生成的重建图像，该图像自然地过滤掉分散注意力的视觉细节，同时保留语义线索。在训练期间，模型使用双分支提示策略从真实图像和重建图像中联合学习，鼓励丰富的跨模态交互。为了弥合模态差距并减轻训练-推理差异，我们引入了一种分布对齐损失，强制两个分支的输出分布之间的一致性。在Multi30K数据集上的大量实验表明，与现有最先进方法相比，D2P-MMT实现了更优的翻译性能。我们的代码在此https URL公开可用。

英文摘要

Multimodal Machine Translation (MMT) typically enhances text-only translation by incorporating aligned visual features. Despite the remarkable progress, state-of-the-art MMT approaches often rely on paired image-text inputs at inference and are sensitive to irrelevant visual noise, which limits their robustness and practical applicability. To address these issues, we propose D2P-MMT, a diffusion-based dual-branch prompting framework for robust vision-guided translation. Specifically, D2P-MMT requires only the source text and a reconstructed image generated by a pre-trained diffusion model, which naturally filters out distracting visual details while preserving semantic cues. During training, the model jointly learns from both authentic and reconstructed images using a dual-branch prompting strategy, encouraging rich cross-modal interactions. To bridge the modality gap and mitigate training-inference discrepancies, we introduce a distributional alignment loss that enforces consistency between the output distributions of the two branches. Extensive experiments on the Multi30K dataset demonstrate that D2P-MMT achieves superior translation performance compared to existing state-of-the-art approaches. Our code is publicly available at https://github.com/MentaY/DDP.

URL PDF HTML ☆

赞 0 踩 0

2601.06212 2026-06-16 cs.CV cs.AI 版本更新

Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

Akasha 2: 哈密顿状态空间对偶与视觉-语言联合嵌入预测架构

Yani Meziani

发表机构 * Independent AI Researcher（独立AI研究员）； Québec (QC), Canada（魁北克（QC），加拿大）

AI总结提出 Akasha 2 多模态架构，结合哈密顿状态空间对偶与视觉-语言联合嵌入预测，通过稀疏混合哈密顿专家和哈密顿流匹配实现超低延迟视频预测与合成，在保持能量守恒下取得 SOTA 性能。

Comments No supporting claims were validated in this automated agentic R&D research run

详情

AI中文摘要

我们提出了 Akasha 2，一种最先进的多模态架构，它集成了哈密顿状态空间对偶（H-SSD）与视觉-语言联合嵌入预测架构（VL-JEPA）。该系统利用 Mamba-3 选择性状态空间模型（SSM），并通过稀疏混合哈密顿专家（SMoE-HE）增强，后者通过辛积分强制执行潜在物理守恒定律。对于视觉合成，我们引入了哈密顿流匹配（HFM）和持久化 3D 高斯泼溅（3DGS），在移动硬件上实现了超低延迟（<50ms）。这项工作在潜在世界模型中建立了一个新范式，通过全息记忆架构实现了前所未有的时空一致性。我们的方法表明，将物理启发的归纳偏置融入神经架构可带来显著改进：最先进的视频预测（FVD: 287），比扩散模型快 4 倍的视觉合成，以及相比 Transformer 基线 3-18 倍的推理加速，同时在长时间范围内保持能量守恒。

英文摘要

We present Akasha 2, a state-of-the-art multimodal architecture that integrates Hamiltonian State Space Duality (H-SSD) with Visual-Language Joint Embedding Predictive Architecture (VL-JEPA). The system leverages the Mamba-3 Selective State Space Model (SSM) augmented by a Sparse Mixture of Hamiltonian Experts (SMoE-HE) that enforces latent physical conservation laws through symplectic integration. For visual synthesis, we introduce Hamiltonian Flow Matching (HFM) and persistent 3D Gaussian Splatting (3DGS), enabling ultra-low latency (<50ms) on mobile hardware. This work establishes a new paradigm in latent world models, achieving unprecedented spatiotemporal coherence through a holographic memory architecture. Our approach demonstrates that incorporating physics-inspired inductive biases into neural architectures yields significant improvements: state-of-the-art video prediction (FVD: 287), 4x faster visual synthesis than diffusion models, and 3-18x inference speedup over transformer baselines while maintaining energy conservation over extended horizons.

URL PDF HTML ☆

赞 0 踩 0

2601.08010 2026-06-16 cs.CV 版本更新

MolSight: 基于图像的分子属性预测

Aaditya Baranwal, Akshaj Gupta, Yogesh S Rawat, Shruti Vyas

发表机构 * University of Central Florida（中央佛罗里达大学）； Birla Institute of Technology and Science（比拉理工学院和科学学院）

AI总结 MolSight首次系统研究基于视觉的分子属性预测，通过10种视觉架构和7种预训练策略，在10个下游任务中展示性能，提出化学引导课程提升效果，以更低的FLOPs实现优异结果。

详情

AI中文摘要

每种合成分子均可绘制为2D骨架图，但现代属性预测更关注分子图、3D构象或大参数语言模型。我们提出MolSight，首次系统研究基于视觉的分子属性预测。使用10种视觉架构、7种预训练策略和2M分子图像，在10个下游任务中评估性能，涵盖物理性质回归、药物发现分类和量子化学预测。为应对预训练分子结构复杂度差异，提出化学引导课程：五种结构复杂度描述符将语料库分为五个难度递增的层级，持续优于非课程基线。证明单个渲染的bond-line图像经视觉编码器处理即可实现竞争性的分子属性预测，即仅凭视觉获得化学洞察。最佳课程训练配置在10个基准中的5个达到顶结果，全部达到前两名，FLOPs仅为最近多模态竞争者的80倍更低。

英文摘要

Every molecule ever synthesised can be drawn as a 2D skeletal diagram, yet in modern property prediction this universally available representation has received less focus in favour of molecular graphs, 3D conformers, or billion-parameter language models, each imposing its own computational and data-engineering overhead. We present $\textbf{MolSight}$, the first systematic large-scale study of vision-based Molecular Property Prediction (MPP). Using 10 vision architectures, 7 pre-training strategies, and $2\,M$ molecule images, we evaluate performance across 10 downstream tasks spanning physical-property regression, drug-discovery classification, and quantum-chemistry prediction. To account for the wide variation in structural complexity across pre-training molecules, we further propose a $\textbf{chemistry-informed curriculum}$: five structural complexity descriptors partition the corpus into five tiers of increasing chemical difficulty, consistently outperforming non-curriculum baselines. We show that a single rendered bond-line image, processed by a vision encoder, is sufficient for competitive molecular property prediction, i.e. $\textit{chemical insight from sight alone}$. The best curriculum-trained configuration achieves the top result on $\textbf{5 of 10}$ benchmarks and top two on $\textbf{all 10}$, at $\textbf{$\textit{80$\times$ lower}$}$ FLOPs than the nearest multi-modal competitor.

URL PDF HTML ☆

赞 0 踩 0

2605.18313 2026-06-16 cs.CV cs.AI 版本更新

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

Wasserstein均衡解码用于可靠的医疗视觉问答

Luca Hagen, Johanna P. Müller, Weitong Zhang, Mengyun Qiao, Bernhard Kainz

发表机构 * Friedrich-Alexander University Erlangen-Nürnberg（弗里德里希-亚历山大厄林根-纽伦堡大学）； Imperial College London（伦敦帝国理工学院）； University College London（伦敦大学学院）

AI总结本文提出了一种基于Wasserstein距离的均衡解码方法，用于改进医疗视觉问答系统，通过语义感知的停止准则提高解码效率和准确性，同时在VQA-RAD和PathVQA数据集上实现了显著的性能提升。

详情

AI中文摘要

小型视觉-语言模型（2-8B）由于隐私限制、有限的连接性和低延迟要求，适合临床部署。然而，其有限的容量会加剧生成合理但错误的输出。我们扩展了之前仅限于纯文本、封闭式NLP任务的博弈论解码方法，应用于开放式的医疗视觉问答（VQA）。我们引入了一种语义感知的Wasserstein停止准则，以取代基于词序的匹配，使收敛基于候选答案之间的语义共识，避免因临床等效排名交换导致的不必要的迭代。在VQA-RAD和PathVQA上，我们获得了比贪心和判别基线显著的改进。在VQA-RAD上，我们比贪心的4B模型提高了3.5个百分点（p < 0.01），在更大规模上呈现出相似趋势。在PathVQA上，Gemma-3-4B与BDG在贪心解码下表现相当，尽管没有领域特定的微调。在与经典BDG的准确性相等时，Wasserstein准则将平均收敛迭代次数减少了约20%，在提高推理效率的同时保留了博弈论均衡行为。代码可在https://github.com/luca-hagen/Wasserstein-BDG-medical-VQA上获得。

英文摘要

Small vision-language models (2-8B) are well-suited for clinical deployment due to privacy constraints, limited connectivity, and low-latency requirements favouring on-device or on-premise inference. However, their limited capacity exacerbates the generation of plausible but incorrect outputs. We extend game-theoretic decoding, previously restricted to text-only, closed-ended NLP tasks, to vision-language models for open-ended Medical VQA. We introduce a semantically aware Wasserstein stopping criterion that replaces lexical order matching, enabling convergence based on semantic consensus among near-synonymous candidate answers and avoiding unnecessary iterations caused by clinically equivalent ranking swaps. On VQA-RAD and PathVQA, we obtain consistent, statistically significant improvements over greedy and discriminative baselines. On VQA-RAD, we improve Qwen3-VL-2B by +3.5 percentage points (p < 0.01), surpassing the greedy 4B model, with similar trends at larger scales. On PathVQA, Gemma-3-4B with BDG matches MedGemma-4B under greedy decoding despite no domain-specific fine-tuning. At accuracy parity with classic BDG, the Wasserstein criterion reduces average convergence iterations by approximately 20%, improving inference efficiency while preserving the game-theoretic equilibrium behaviour. Code is available at https://github.com/luca-hagen/ Wasserstein-BDG-medical-VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.14752 2026-06-16 cs.CV cs.AI cs.LG cs.RO 新提交

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

X-Tokenizer: 一种用于视觉-语言-动作预训练的多模态动作分词器

Xirui Kang, Yanpei Shi, Lucy Liang, Roy Gan, Dongxiu Liu, Pushi Zhang, Danpeng Chen, Xiaoyi Qin, Yinan Zheng, Jinliang Zheng, Hao Wang, Xianyuan Zhan, Hang Su

发表机构 * Square Robot ； City University of Hong Kong（香港城市大学）； Tsinghua University（清华大学）

AI总结提出X-Tokenizer，通过语义残差量化（SRQ）和掩码动作建模（MAM）将动作离散化为语义接口，在2.4M轨迹上预训练后提升VLA模型的多模态接地和长程任务性能。

Comments Project page: https://x-square-robot.github.io/X-Tokenizer_projectPage/

详情

AI中文摘要

现代视觉-语言-动作（VLA）模型必须桥接预训练的视觉-语言推理和精确的连续机器人控制。现有的动作分词器主要为了重建而离散化动作，产生的编码保留了运动几何结构，但仅向主干网络提供弱语义监督。因此，我们将动作分词化不仅视为压缩，而是作为多模态推理与可执行控制之间的语义接口学习。为此，我们引入了X-Tokenizer，一种轻量级的编码器-语义残差量化（SRQ）-解码器架构，为多种机械臂形态提供共享的动作接口。其关键组件SRQ在残差向量量化上施加了非对称结构：第一层通过掩码动作建模（MAM）训练，形成捕获粗略运动意图的离散动作语言，而更深层则保持面向重建的残差，保留细粒度细节。为了进一步将动作标记与多模态语义对齐，X-Tokenizer通过与预训练基础模型的表示空间进行对比对齐以及下一帧视觉-语言特征预测进行预训练。在2.4M轨迹（2.0B动作帧）上预训练后，单个冻结的X-Tokenizer作为表示塑造的监督信号插入混合离散-连续VLA中。X-Tokenizer在真实世界聚合指标上达到最佳，并在RoboTwin 2.0模拟中表现强劲。在多模态接地（+13.5%）和长程任务（+8.25）上优于FAST，表明动作分词器作为VLA预训练的语义接口，而不仅仅是动作压缩。

英文摘要

Modern Vision-Language-Action (VLA) models must bridge pretrained vision-language reasoning and precise continuous robot control. Existing action tokenizers discretize actions primarily for reconstruction, producing codes that preserve motion geometry but provide only weak semantic supervision to the backbone. We therefore formulate action tokenization not as mere compression, but as semantic interface learning between multimodal reasoning and executable control. To this end, we introduce X-Tokenizer, a lightweight encoder-Semantic Residual Quantization (SRQ)-decoder architecture that provides a shared action interface across diverse robotic arm embodiments. Its key component, SRQ, imposes an asymmetric structure on residual vector quantization: the first level is trained with Masked Action Modeling (MAM) to form a discrete action language that captures coarse motion intent, while deeper levels remain reconstruction-oriented residuals that preserve fine-grained details. To further align action tokens with multimodal semantics, X-Tokenizer is pretrained with contrastive alignment to the representation space of a pretrained foundation model and with next-frame vision-language feature prediction. Pretrained on 2.4M trajectories (2.0B action frames), a single frozen X-Tokenizer plugs into a mixed discrete-continuous VLA as a representation-shaping supervision signal. X-Tokenizer achieves top real-world aggregate and strong RoboTwin 2.0 simulation results. Outperforming FAST in multimodal grounding (+13.5%) and long-horizon tasks (+8.25), it shows that action tokenizers serve as semantic interfaces for VLA pretraining beyond mere action compression.

URL PDF HTML ☆

赞 0 踩 0

2606.14772 2026-06-16 cs.CV cs.AI 新提交

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

ScoutVLA：面向开放世界具身问答的无人机中心主动感知双专家VLA模型

Wenhao Lu, Zhengqiu Zhu, Xiaofeng Wang, Xiaoran Zhang, Yatai Ji, Yong Zhao, Yue Hu, Yingzhen Nie, Jinlong Zhu, Zheng Zhu

发表机构 * National Key Laboratory of Digital Intelligent Modeling and Simulation, National University of Defense Technology（国防科技大学数字智能建模与仿真国家重点实验室）； GigaAI

AI总结针对无人机在室外具身问答中细粒度视角调整不足的问题，提出ScoutVLA模型，采用解耦双专家架构（视觉语言专家推断语义意图，动作专家生成连续视角调整轨迹），并通过知识隔离机制平衡连续控制与语义推理，在仿真和真实实验中显著优于基线方法。

详情

AI中文摘要

空中具身问答（EQA）要求无人机（UAV）主动感知环境并回答自然语言问题。现有的室外EQA系统通常在目标进入无人机视野后停止，导致寻找证据所需的问题的细粒度视角调整问题仍未解决。为解决此问题，我们引入FG-EQA，一个细粒度主动感知EQA基准，包含超过4万条模拟轨迹和1千条真实轨迹。受侦察蜂“摇摆舞”的启发（它们迭代调整飞行路径以验证目标信息），我们提出ScoutVLA，一种用于室外EQA的证据驱动视觉-语言-动作模型。为模拟这种主动探索行为，ScoutVLA采用解耦双专家架构：视觉语言专家推断语义意图以识别缺失证据，而独立动作专家使用高自由度流匹配生成连续视角调整轨迹。为平衡连续控制和语义推理的竞争需求，我们设计了一种解耦训练策略，其中包含知识隔离机制，防止动作梯度抹除模型的多模态推理能力。大量仿真实验和定性真实世界实地研究均验证了ScoutVLA相对于最先进基线的优越性，平均严格成功率高10.48倍，平均QA正确率高7.72倍。

英文摘要

Aerial Embodied Question Answering (EQA) requires Unmanned Aerial Vehicles (UAVs) to actively perceive the environment and answer natural language questions. Existing outdoor EQA systems usually stop once the target enters the UAV's field of view, leaving the fine-grained viewpoint adjustment needed for evidence-seeking questions largely unresolved. To address this issue, we introduce FG-EQA, a fine-grained active perception EQA benchmark with more than 40K simulated trajectories and 1K real-world trajectories. Drawing inspiration from the ``waggle dance'' of scout bees, which iteratively adjust their flight paths to verify target information, we propose ScoutVLA, an evidence-driven Vision-Language-Action model for outdoor EQA. To emulate this active exploration behavior, ScoutVLA features a decoupled dual-expert architecture: a vision-language expert infers the semantic intent to identify missing evidence, while an independent action expert employs high-DoF flow matching to generate continuous viewpoint-refinement trajectories. To balance the competing demands of continuous control and semantic reasoning, we devise a decoupled training strategy with a knowledge insulation mechanism that prevents the action gradients from erasing the model's multimodal reasoning ability. Extensive simulated experiments and a qualitative real-world field study both verify the superiority of ScoutVLA over the state-of-the-art baselines, demonstrating a 10.48$\boldsymbol{\times}$ higher average strict success rate and a 7.72$\boldsymbol{\times}$ higher average QA correctness.

URL PDF HTML ☆

赞 0 踩 0

2606.14841 2026-06-16 cs.CV 新提交

Multi-HMR 2: Multi-Person Camera-Centric Human Detection, Mesh Recovery and Tracking

Multi-HMR 2：多人相机中心人体检测、网格恢复与跟踪

Guénolé Fiche, Philippe Weinzaepfel, Romain Brégier, Fabien Baradel

发表机构 * NAVER LABS Europe（NAVER LABS欧洲）

AI总结提出基于DETR的框架Multi-HMR 2，联合预测场景一致相机和人体网格，实现度量3D定位与跟踪，无需真实内参或视频监督，在保持骨盆中心性能的同时显著提升检测与定位精度。

详情

AI中文摘要

人体网格恢复（HMR）的大多数进展集中在骨盆中心恢复，忽视了相机坐标系中的度量3D定位和检测精度——这两个因素对于人机交互和社交场景理解等实际应用至关重要。当前的评估协议通常忽略这些方面，强调每人的根中心恢复而非相机空间感知。因此，现有方法依赖于固定的相机假设或手工后处理，限制了其鲁棒性和实际部署。我们提出了Multi-HMR 2，一个简单而鲁棒的基于DETR的框架，用于多人相机中心的人体检测、网格恢复和跟踪。Multi-HMR 2预测一个场景一致的相机以及人体网格，无需真实内参即可实现度量3D定位。此外，通过从SAM2中提取基于图像的记忆特征，Multi-HMR 2扩展到跟踪，无需视频监督即可实现一致的同一性关联。尽管概念简单——无手工组件、无视频输入、无真实相机——Multi-HMR 2在保持最先进的骨盆中心性能的同时，显著提高了检测精度和度量3D定位。

英文摘要

Most advances in human mesh recovery (HMR) have focused on pelvis-centered recovery, overlooking metric 3D localization and detection accuracy in the camera coordinate system - two key factors for real-world applications such as human-robot interaction and social scene understanding. Current evaluation protocols often ignore these aspects, emphasizing per-person, root-centered recovery rather than camera-space perception. As a result, existing approaches rely on fixed camera assumptions or handcrafted post-processing, limiting their robustness and practical deployment. We introduce Multi-HMR 2, a simple yet robust DETR-based framework for Multi-person Camera-centric Human detection, mesh Recovery, and tracking. Multi-HMR 2 predicts a scene-consistent camera together with human meshes, enabling metric 3D localization without ground-truth intrinsics. Moreover, by distilling image-based memory features from SAM2, Multi-HMR 2 extends to tracking, achieving consistent identity association without video supervision. Despite its conceptual simplicity - no handcrafted components, no video input, and no ground-truth cameras - Multi-HMR 2 achieves state-of-the-art pelvis-centered performance while substantially improving detection accuracy and metric 3D localization.

URL PDF HTML ☆

赞 0 踩 0

2606.15099 2026-06-16 cs.CV cs.LG cs.RO 新提交

Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

少思考，早行动：视觉-语言-动作模型中带早退的强化潜在推理

Dianqiao Lei, Lianlei Shan

AI总结提出AVA-VLA框架，通过强化学习去噪和早退策略优化潜在推理轨迹，在LIBERO上实现6倍推理加速和98.3%平均成功率。

Comments Accepted at ICML 2026

详情

AI中文摘要

现有的视觉-语言-动作（VLA）模型主要依赖显式的思维链（CoT）推理来桥接感知和动作。虽然有效，但这种范式在多步骤任务中面临高计算成本和错误传播的问题。在本文中，我们提出了自适应变量对齐VLA（AVA-VLA），一种新颖的潜在推理VLA框架，将推理建模为一系列不可观测的潜在变量，绕过了显式文本生成的需求。然而，潜在轨迹本质上容易受到噪声干扰和与下游目标不对齐的影响。为了解决这个问题，我们引入了一种基于强化学习的去噪机制，将潜在状态生成视为一个顺序决策过程，通过任务级奖励优化推理轨迹。此外，我们结合了一种早退策略，根据状态置信度自适应地终止推理，实现了深度和效率之间的动态权衡。在具身决策基准上的大量实验表明，AVA-VLA在LIBERO上实现了比显式CoT方法6倍的推理加速，同时达到了98.3%的平均成功率，在效率和长期稳定性上均优于全推理基线。

英文摘要

Existing Vision-Language-Action (VLA) models predominantly rely on explicit Chain-of-Thought (CoT) reasoning to bridge perception and action. While effective, this paradigm suffers from high computational costs and error propagation in multi-step tasks. In this paper, we propose Adaptive Variable Alignment VLA (AVA-VLA), a novel Latent Reasoning VLA framework that models reasoning as a sequence of unobservable latent variables, bypassing the need for explicit text generation. However, latent trajectories are inherently susceptible to noise interference and misalignment with downstream objectives. To address this, we introduce a Reinforcement Learning-based Denoising mechanism that treats latent state generation as a sequential decision process, optimizing reasoning trajectories via task-level rewards. Furthermore, we incorporate an Early-Exit Strategy that adaptively terminates reasoning based on state confidence, enabling a dynamic trade-off between depth and efficiency. Extensive experiments on embodied decision benchmarks demonstrate that AVA-VLA achieves a 6x inference speedup over explicit CoT methods while attaining a 98.3% average success rate on LIBERO, improving both efficiency and long-horizon stability over full-reasoning baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.15142 2026-06-16 cs.CV cs.RO 新提交

MotionVLA: Vision-Language-Action Model for Humanoid Motion

MotionVLA：面向人形运动的视觉-语言-动作模型

Nonghai Zhang, Siyu Zhai, Yanjun Li, Zeyu Zhang, Zhihan Yin, Yandong Guo, Boxin Shi, Hao Tang

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）； AI 2 Robotics

AI总结针对人形运动生成中低频姿态与高频物理信号量化不匹配的问题，提出双流频率分词器DSFT和基于Qwen3.5的MotionVLA模型，在HumanML3D和MBench上显著提升多样性一致性和运动条件一致性。

详情

AI中文摘要

从场景图像和文本生成逼真的人形运动涉及低频姿态语义和高频物理动力学。然而，许多现有方法使用单个共享码本对运动进行分词，将异质运动信号强制映射到相同的量化空间。我们对人体运动数据的频域分析揭示了单码本量化与运动统计之间的明显不匹配：五个DCT系数捕获了93%的关节位置能量，但仅捕获了37%的关节速度能量，这可能导致量化偏向姿态统计，而低估高频速度分量。第二个挑战在于使标准自回归模型有效建模运动序列中的高频物理信号。因此，我们提出了DSFT，一种双流频率分词器，将运动分离为基础流和物理流，并使用DCT截断和BPE独立压缩它们。此外，我们提出了MotionVLA，一个基于Qwen3.5的模型，将基础令牌和物理令牌排列在统一序列中，其中物理令牌在基础令牌之后预测。在HumanML3D和MBench上的实验表明，尽管使用轻量级2B骨干网络，MotionVLA在HumanML3D上将与真实数据的多样性差距减少了50%以上，并在MBench上将运动条件一致性提高了3.8%，支持频率感知的双流解耦作为自回归运动生成的有效公式。代码：https://github.com/AIGeeksGroup/MotionVLA。网站：https://aigeeksgroup.github.io/MotionVLA。

英文摘要

Generating realistic humanoid motion from scene images and text involves both low-frequency pose semantics and high-frequency physical dynamics. However, many existing methods tokenize motion with a single shared codebook, forcing heterogeneous motion signals into the same quantization space. Our frequency-domain analysis of human motion data reveals a clear mismatch between single-codebook quantization and motion statistics: five DCT coefficients capture 93% of joint-position energy but only 37% of joint-velocity energy, which can bias quantization toward pose statistics and under-represent high-frequency velocity components. A second challenge lies in adapting a standard autoregressive model to effectively model high-frequency physical signals in motion sequences. Therefore, we propose DSFT, a dual-stream frequency tokenizer that separates motion into Base and physical streams and compresses them independently with DCT truncation and BPE. Furthermore, we present MotionVLA, a Qwen3.5-based model that arranges Base and physical tokens in a unified sequence, where Phys tokens are predicted after Base tokens. Experiments on HumanML3D and MBench show that, despite using a lightweight 2B backbone, MotionVLA reduces the Diversity gap to real data by over 50% on HumanML3D and improves Motion-Condition Consistency by 3.8% on MBench, supporting frequency-aware dual-stream decoupling as an effective formulation for autoregressive motion generation. Code: https://github.com/AIGeeksGroup/MotionVLA. Website: https://aigeeksgroup.github.io/MotionVLA.

URL PDF HTML ☆

赞 0 踩 0

2606.15287 2026-06-16 cs.CV 新提交

G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

G2IA: 几何引导的实例感知跨模态地点识别检索与精炼

Xianyun Jiao, Jingyi Xu, Zhongmiao Yan, Xieyuanli Chen, Lin Pei

发表机构 * Shanghai Jiao Tong University（上海交通大学）； National University of Defense Technology（国防科技大学）

AI总结提出G2IA框架，通过几何引导的实例感知检索和跨模态局部形状与空间布局验证，解决图像到点云地点识别中的模态差异和感知混淆问题。

详情

AI中文摘要

跨模态地点识别（CMPR）使仅搭载相机的机器人在自主导航场景中能够根据预先构建的激光雷达地图进行定位。这种图像到点云的设置面临两种耦合的模糊性：透视RGB外观与稀疏度量几何之间的模态差异，以及具有相似道路、立面、交叉口和物体布局的城市地点之间的感知混淆。我们不将CMPR视为单一的全局描述符匹配问题，而是认为可靠的检索需要几何感知表示对齐和细粒度候选验证。本文提出G2IA，一个几何引导的实例感知框架，用于图像到点云的地点识别。在检索阶段，来自VGGT的视觉几何先验和实例特征被整合，以构建与激光雷达地图表示更兼容的地点描述符。在精炼阶段，通过显式验证局部实例形状及其相对空间布局在跨模态下是否一致，对检索到的候选进行重新排序。在公开基准上的实验表明，G2IA在不同定位阈值下一致地改善了图像到点云的地点识别，并表现出强大的跨数据集泛化能力。

英文摘要

Cross-modal place recognition (CMPR) enables camera-only robots to localize against pre-built LiDAR maps in autonomous navigation scenarios. This image-to-point-cloud setting is challenged by two coupled ambiguities: the modality gap between perspective RGB appearance and sparse metric geometry, and perceptual aliasing among urban places with similar roads, facades, intersections, and object arrangements. Instead of treating CMPR as a single global descriptor matching problem, we argue that reliable retrieval requires both geometry-aware representation alignment and fine-grained candidate verification. In this paper, we propose G2IA, a geometry-guided instance-aware framework for image-to-point-cloud place recognition. In the retrieval stage, visual geometry priors from VGGT and instance features are integrated to construct place descriptors that are more compatible with LiDAR-derived map representations. In the refinement stage, the retrieved candidates are re-ranked by explicitly verifying whether local instance shapes and their relative spatial layouts are consistent across modalities. Experiments on public benchmarks demonstrate that G2IA consistently improves image-to-point-cloud place recognition under different localization thresholds, and exhibits strong cross-dataset generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.15341 2026-06-16 cs.CV 新提交

CausalDrive: Real-time Causal World Models for Autonomous Driving

CausalDrive: 用于自动驾驶的实时因果世界模型

Tianyi Yan, Huan Zheng, Dubing Chen, Meizhi Qu, Yingying Shen, Lijun Zhou, Mingfei Tu, Bing Wang, Guang Chen, Hangjun Ye, Haiyang Sun, Cheng-zhong Xu, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau（澳门大学协同创新研究院，科技学院）； Xiaomi EV（小米汽车）； CASIA（中国科学院自动化研究所）

AI总结提出CausalDrive，一种可控、实时的驾驶世界渲染器，通过因果预测和Context-Forced DMD架构实现交互式模拟，支持闭环评估、强化学习后训练和人在环仿真。

详情

AI中文摘要

世界模型已成为扩展自动驾驶数据的有前景范式，但现有的视频生成模型作为交互式模拟器仍有不足。基于布局的渲染器依赖所有背景智能体的“预言”未来轨迹，使其严格非反应式。相反，纯动作条件预测器缺乏对复杂交互的语义控制，并受限于高昂的扩散延迟，阻碍了闭环策略学习。为弥补这一差距，我们提出CausalDrive，一种可控、实时的基础驾驶世界渲染器。CausalDrive仅基于初始前视图、自车轨迹和宏观文本提示运行。通过排除未来NPC布局，我们迫使模型内在预测因果交互，实现对驾驶社会学的文本驱动控制，允许用户动态编排对相同自车动作的不同反事实反应。为克服效率瓶颈并解决自回归生成中的协变量偏移，我们提出新颖的Context-Forced DMD架构。该架构结合连续流匹配与自校正蒸馏目标，实现12 FPS的交互速度。这一突破将被动视频生成器转变为可玩的神经模拟器。我们在三个下游应用中展示了其多功能性：（1）生成式闭环评估，显著减轻碰撞伪影；（2）由Video2Reward模块驱动的大规模强化学习后训练；（3）实时人在环仿真。大量实验验证，在CausalDrive反应式场景中训练的策略在现实世界中表现出更优的交互能力。

英文摘要

World models have emerged as a promising paradigm for scaling autonomous driving (AD) data, yet existing video generative models fall short as interactive simulators. Layout-conditioned renderers rely on "oracle" future trajectories of all background agents, rendering them strictly non-reactive. Conversely, pure action-conditioned predictors lack semantic control over complex interactions and suffer from prohibitive diffusion latencies, hindering closed-loop policy learning. To bridge this gap, we present CausalDrive, a controllable, real-time foundation driving world renderer. CausalDrive operates solely on the initial front-view frame, the ego-vehicle's trajectory, and a macroscopic text prompt. By excluding future NPC layouts, we compel the model to intrinsically predict causal interactions, enabling text-driven control over Driving Sociology, allowing users to dynamically orchestrate diverse counterfactual reactions to identical ego-actions. To overcome the efficiency bottleneck and address the covariate shift in autoregressive generation, we propose a novel Context-Forced DMD architecture. This combines continuous flow-matching with a self-correcting distillation objective, achieving interactive speeds of 12 FPS. This breakthrough transforms the passive video generator into a playable neural simulator. We demonstrate its versatility across three downstream applications: (1) generative closed-loop evaluation with significantly mitigated collision artifacts, (2) large-scale Reinforcement Learning (RL) post-training driven by a Video2Reward module, and (3) real-time human-in-the-loop simulation. Extensive experiments validate that policies trained within CausalDrive's reactive scenarios exhibit superior interaction capabilities in the real world.

URL PDF HTML ☆

赞 0 踩 0

2606.15869 2026-06-16 cs.CV 新提交

Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

Metis: 一种用于自动驾驶和城市导航的通用高效世界-动作模型

Jingyu Li, Zhe Liu, Dongnan Hu, Junjie Wu, Zipei Ma, Wenxiao Wu, Chao Han, Zhihui Hao, Zhikang Liu, Kun Zhan, Jiankang Deng, Xiatian Zhu, Li Zhang

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； The University of Hong Kong（香港大学）； Tongji University（同济大学）； Li Auto Inc.（理想汽车）； Huazhong University of Science and Technology（华中科技大学）； Imperial College London（伦敦帝国理工学院）； University of Surrey（萨里大学）

AI总结提出Metis框架，通过解耦视频生成与动作预测，采用混合专家架构和不对称注意力掩码，实现高效推理与泛化，在多个导航基准上取得最优性能。

详情

AI中文摘要

世界-动作模型（WAMs）在自动驾驶和城市导航中展现出巨大潜力。基于视觉-语言-动作模型或视频生成模型的现有方法存在关键限制：（1）测试时因预测未来观测而导致高推理延迟，（2）视频与动作建模紧密耦合导致表示不匹配和泛化能力下降。为解决这两个问题，我们提出Metis，一种端到端WAM框架，将视频生成与动作预测解耦。具体而言，Metis采用混合专家（Mixture-of-Transformers）架构，包含专门用于视频生成和动作预测的专家，保留了每个任务的内在分布特性。为提高效率，我们引入非对称注意力掩码，使得两个专家能够联合训练，同时允许动作模型在推理时绕过显式视频生成。这种设计确保了训练-推理一致性，并在不牺牲规划性能的情况下显著降低计算成本。大量实验表明，Metis在NAVSIM navhard和navtest基准以及CityWalker导航基准上取得了最先进的性能，验证了其在多样化任务中的泛化能力和效率。真实机器人部署进一步证实了我们方法的实际可行性。

英文摘要

World action models~(WAMs) have shown great promise for autonomous driving and urban navigation. Built upon Vision-Language-Action models or video generation models, existing approaches suffer key limitations: (1) High inference latency due to future observation prediction at test time, and (2) tightly coupled video and action modeling leading to representational mismatch and degraded generalization. To address both issues, we propose Metis, an end-to-end WAM framework that decouples video generation and action prediction. Specifically, Metis employs a Mixture-of-Transformers architecture with dedicated experts for video generation and action prediction, preserving the intrinsic distributional properties of each task. To enhance efficiency, we introduce an asymmetric attention mask that enables joint training of both experts while allowing the action model to bypass explicit video generation during inference. This design ensures training-inference consistency and significantly reduces computational costs without compromising planning performance. Extensive experiments demonstrate state-of-the-art performance on the NAVSIM navhard and navtest benchmarks and the CityWalker navigation benchmark, validating both the generalizability and efficiency across diverse tasks. Real-robot deployments further confirm the practical feasibility of our approach.

URL PDF HTML ☆

赞 0 踩 0

2606.16202 2026-06-16 cs.CV cs.AI cs.RO 新提交

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

EgoPhys: 从第一人称视频学习可变形物体的通用物理模型

Hyunjin Kim, Ri-Zhao Qiu, Guangqi Jiang, Xiaolong Wang

发表机构 * UC San Diego（加州大学圣地亚哥分校）

AI总结提出EgoPhys框架，从第一人称RGB视频中通过可泛化先验构建可变形物体的物理数字孪生，无需测试时优化即可预测弹簧刚度场，在重建、未来预测和零样本泛化上优于基线。

Comments Project Page: https://hjhyunjinkim.github.io/EgoPhys

详情

AI中文摘要

人类通过日常互动自然地理解物体物理，但准确预测复杂的可变形动力学（如弹性材料和织物）仍然是计算机视觉和机器人学的主要挑战。我们提出EgoPhys，一个利用可泛化先验从仅RGB的第一人称视频构建可变形物理数字孪生的框架。EgoPhys通过将每个物体的逆物理解蒸馏到紧凑码本中，克服了现有方法的局限性，从而能够为未见物体预测密集的弹簧刚度场，而无需每个弹簧的测试时优化。使用来自多样化第一人称交互的可泛化先验进行训练，EgoPhys在重建、未来预测和零样本泛化方面优于基线。为了支持训练和评估，我们整理了一个涵盖多样化可变形物体、场景和操作风格的第一人称交互数据集。我们将EgoPhys部署在真实的xArm6机器人上，证明从单个第一人称人类游戏视频初始化的数字孪生可以作为内部世界表示，辅助可变形物体规划，突显第一人称RGB观测作为通往真实到模拟管道的可扩展路径。

英文摘要

Humans naturally understand object physics through everyday interactions, but faithfully predicting complex deformable dynamics, such as elastic materials and fabrics, remains a major challenge for computer vision and robotics. We present EgoPhys, a framework that constructs deformable physical digital twins from egocentric RGB-only video using generalizable priors. EgoPhys overcomes the limitations of existing methods to enable controllable deformable digital twin generation from egocentric videos by distilling per-object inverse-physics solutions into a compact codebook, enabling prediction of dense spring stiffness fields for unseen objects without per-spring test-time optimization. Trained with generalizable priors from diverse egocentric interactions, EgoPhys outperforms baselines in reconstruction, future prediction, and zero-shot generalization. To support training and evaluation, we curate an egocentric interaction dataset covering diverse deformable objects, scenes, and manipulation styles. We deploy EgoPhys on a real xArm6 robot, demonstrating that a digital twin initialized from a single egocentric human play video can serve as an internal world representation to aid in deformable-object planning, highlighting egocentric RGB observations as a scalable path toward real-to-sim pipelines.

URL PDF HTML ☆

赞 0 踩 0

2606.16253 2026-06-16 cs.CV cs.AI 新提交

Learned Image Compression for Vision-Language-Action Models

面向视觉-语言-动作模型的图像压缩学习

Hyeonjun Kim, Jegwang Ryu, Sangbeom Ha, Junhyeok Lee, Jun-Hyuk Kim, Hyemin Ahn, Jaeho Lee

发表机构 * POSTECH（浦项科技大学）； Soongsil University（崇实大学）； Chung-Ang University（中央大学）

AI总结提出SPARC框架，通过自适应比特率分配和倾斜率损失，在低带宽下保持VLA机器人控制性能，优于传统编解码器。

详情

AI中文摘要

视觉-语言-动作（VLA）模型越来越依赖高频多摄像头观测，使得视觉通信成为带宽受限或分布式部署场景中实时机器人控制的主要瓶颈。然而，现有的图像和视频编解码器旨在保留通用视觉保真度，而非下游VLA策略的控制性能。在这项工作中，我们引入了SPARC（空间自适应速率控制），一种为VLA驱动机器人量身定制的学习图像压缩框架。我们的关键观察是，视觉信息的重要性在相机视角和图像内的空间区域之间差异很大。基于这一观察，SPARC采用轻量级时间掩码选择器，根据任务相关性自适应地在潜在表示上分配比特率，同时利用时间上下文。我们进一步引入倾斜率损失，通过减少基于熵的目标过度抑制罕见但任务关键的视觉模式的趋势来稳定训练。在包括RoboCasa365、VLABench和LIBERO在内的多样化机器人基准测试上的实验表明，在相同比特率预算下，SPARC始终比传统图像/视频编解码器和最近的学习压缩方法实现更强的控制性能。我们还展示了在远程控制设置中的实际部署优势，我们的方法显著改善了比特率-成功率权衡。

英文摘要

Vision-language-action (VLA) models increasingly rely on high-frequency multi-camera observations, making visual communication a major bottleneck for real-time robotic control in bandwidth-constrained or distributed deployment settings. Existing image and video codecs, however, are designed to preserve generic visual fidelity rather than the control performance of downstream VLA policies. In this work, we introduce SPARC (SPatially Adaptive Rate Control), a learned image compression framework tailored for VLA-driven robots. Our key observation is that the importance of visual information varies substantially across both camera views and spatial regions within an image. Based on this observation, SPARC employs a lightweight temporal mask selector that adaptively allocates bitrate over latent representations according to task relevance while leveraging temporal context. We further introduce a tilted rate loss that stabilizes training by reducing the tendency of entropy-based objectives to over-suppress rare yet task-critical visual patterns. Experiments on diverse robotic benchmarks, including RoboCasa365, VLABench, and LIBERO, show that SPARC consistently achieves stronger control performance than conventional image/video codecs and recent learned compression methods under the same bitrate budget. We additionally demonstrate real-world deployment benefits in remote-control settings, where our method substantially improves the bitrate-success tradeoff.

URL PDF HTML ☆

赞 0 踩 0

2606.16274 2026-06-16 cs.CV 新提交

GraphWorld: Long-Horizon Planning with World Models for End-to-End Autonomous Driving

GraphWorld: 基于世界模型的长时域规划实现端到端自动驾驶

Ziying Song, Caiyan Jia, Lin Liu, Lei Yang, Shengkai Zhang, Feiyang Jia, Fengda Zhao, Peiliang Wu, Shaoqing Xu, Chen Lv, Yadan Luo

发表机构 * Beijing Key Laboratory of Traffic Data Mining and Embodied Intelligence, School of Computer Science and Technology, Beijing Jiaotong University（北京交通大学计算机科学与技术学院，交通数据挖掘与具身智能北京市重点实验室）； School of Artificial Intelligence (School of Software), Yanshan University（燕山大学人工智能学院（软件学院））； School of Mechanical and Aerospace Engineering, Nanyang Technological University（南洋理工大学机械与航空航天工程学院）； University of Macau（澳门大学）； The University of Queensland（昆士兰大学）

AI总结提出GraphWorld框架，通过潜在世界建模增强长时域规划，利用自车中心交互图建模邻车关系，并基于世界状态条件规划实现安全轨迹生成，显著降低碰撞率。

Comments 16 pages, 5 figures

详情

AI中文摘要

端到端自动驾驶通过将感知、预测和规划统一到单一学习框架中取得了显著进展，在短时域决策中表现出色。然而，大多数现有的E2E-AD方法仍局限于短时域规划，缺乏建模长期时间依赖的能力，这严重限制了它们在复杂且高度交互的驾驶场景中的泛化性和安全性。在这项工作中，我们提出了GraphWorld，一个通过潜在世界建模显式增强长时域规划的E2E-AD框架。我们引入了一个自车中心交互图，该图基于空间邻近性自适应地建模关键邻车，并通过跨节点交叉注意力将关系上下文传播到规划查询。我们提出了一种世界状态条件规划，通过建模自车与周围智能体之间的交互来学习以自车为中心的潜在世界表示。这种潜在世界状态捕获了关键的交互动态和安全相关语义，并作为条件信号来指导长时域、安全感知的轨迹规划。在Bench2Drive、NAVSIMv1/2和nuScenes上的大量实验表明，GraphWorld显著降低了碰撞率并提高了长时域规划性能，验证了其在复杂驾驶环境中的有效性。

英文摘要

End-to-end autonomous driving has made significant progress by unifying perception, prediction, and planning within a single learning framework, achieving strong performance in short-horizon decision making. However, most existing E2E-AD methods remain confined to short-horizon planning and lack the ability to model long-term temporal dependencies, which severely limits their generalization and security in complex and highly interactive driving scenarios. In this work, we propose GraphWorld, an E2E-AD framework that explicitly enhances long-horizon planning through latent world modeling. We introduce an Ego-Centric Interaction Graph, which adaptively models critical neighboring agents based on spatial proximity, and propagates relational context to planning queries via cross-node cross-attention. We present a World-State-Conditioned Planning that learns ego-centric latent world representations by modeling interactions between an ego vehicle and surrounding agents. This latent world state captures key interaction dynamics and safety-relevant semantics, and serves as a conditioning signal to guide long-horizon, safety-aware trajectory planning. Extensive experiments on Bench2Drive, NAVSIMv1/2, and nuScenes demonstrate that GraphWorld significantly reduces collision rates and improves long-horizon planning performance, validating its effectiveness in complex driving environments.

URL PDF HTML ☆

赞 0 踩 0

2606.16278 2026-06-16 cs.CV cs.AI 新提交

解耦的以对象为中心的视频理解用于生成机器人操作指令

Thanh Nguyen Canh, Thanh-Tuan Tran, Haolan Zhang, Ziyan Gao, Xiem HoangVan, Nak Young Chong

发表机构 * School of Information Science, Japan Advanced Institute of Science and Technology（日本北陆先端科学技术大学院大学信息科学学院）； University of Engineering and Technology, Vietnam National University（越南国立大学工程与技术大学）； Department of Robotics, Hanyang University（汉阳大学机器人学系）

AI总结提出解耦动作识别与对象选择的框架，通过TSM分类动作和对象选择算法识别任务相关对象，结合VLM生成精确指令，在Something-Something V2上显著提升性能。

详情

AI中文摘要

将视频演示翻译为可执行的机器人命令仍然具有挑战性，因为现有方法通常无法识别演示动作中功能涉及的对象。因此，它们可能生成语言上合理但操作上模糊的命令。我们提出了一种以对象为中心的视频理解框架，将动作识别与对象识别解耦，以生成精确的、无语法的操作命令。我们的方法集成了时间移位模块（TSM）用于高效的时空动作分类，以及一种新颖的\textbf{对象选择}算法，通过基于轨迹的角色分类、模糊检测和重叠最小化来识别任务相关对象。然后，选定的对象由视觉语言模型（VLM）处理，以实现鲁棒的类别识别和零样本泛化。在修改后的Something-Something V2数据集上评估，我们的方法达到了86.79%的动作分类准确率，在标准对象上BLEU-4得分为0.337，在新颖对象上为0.261。这些结果分别比最强的任务特定基线提高了80.2%和143.9%。在METEOR和CIDEr指标上观察到更大的提升，在新颖对象上分别达到157.9%和171.7%。在所有语义指标上，我们的方法始终优于任务特定方法，并与大型通用VLM保持竞争力或超越它们，同时保留了模块化的、以对象为中心的设计。

英文摘要

Translating video demonstrations into executable robot commands remains challenging because existing methods often fail to identify which objects are functionally involved in the demonstrated action. As a result, they may generate commands that are linguistically plausible but operationally ambiguous. We propose an object-centric video understanding framework that decouples action recognition from object identification to generate precise, grammar-free manipulation commands. Our approach integrates Temporal Shift Modules (TSM) for efficient spatio-temporal action classification with a novel \textbf{Object Selection} algorithm that identifies task-relevant objects through trajectory-based role classification, blur detection, and overlap minimization. The selected objects are then processed by Vision-Language Models (VLMs) for robust category recognition and zero-shot generalization. Evaluated on a modified Something-Something V2 dataset, our method achieves 86.79\% action classification accuracy and BLEU-4 scores of 0.337 on standard objects and 0.261 on novel objects. These results improve over the strongest task-specific baseline by 80.2\% and 143.9\%, respectively. Larger gains are observed in METEOR and CIDEr, reaching 157.9\% and 171.7\% on novel objects. Across all semantic metrics, our approach consistently outperforms task-specific methods and remains competitive with, or surpasses, large general-purpose VLMs while retaining a modular, object-centric design.

URL PDF HTML ☆

赞 0 踩 0

2606.16474 2026-06-16 cs.CV cs.RO 新提交

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

MVOFormer：用于鲁棒单目视觉里程计的流-语义Transformer

Jituo Li, Shunwang Sun, Jialu Zhang, Xinqi Liu, Jinyao Hu, Zhicheng Lu, Sajad Saeedi, Guodong Lu

发表机构 * State Key Laboratory of Fluid Power and Mechatronic Systems, Zhejiang University（浙江大学流体动力与机电系统国家重点实验室）； Zhejiang Key Laboratory of Industrial Big Data and Robot Intelligent Systems（浙江省工业大数据与机器人智能系统重点实验室）； School of Mechanical Engineering, Zhejiang University（浙江大学机械工程学院）； Robotics Institute, Zhejiang University（浙江大学机器人研究院）； School of Artificial Intelligence and Robotics, Hunan University（湖南大学人工智能与机器人学院）； Rural Health Research Institute, Charles Sturt University（查尔斯特大学农村健康研究所）； University College London（伦敦大学学院）

AI总结提出MVOFormer，一种流-语义双分支编码器与迭代多模态解码器结合的Transformer框架，通过融合密集几何运动与语义先验实现粗到细位姿优化，在零样本泛化上显著超越现有方法。

Comments 8 pages, 6 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情

AI中文摘要

单目视觉里程计（MVO）是自主导航和机器人定位的基础。然而，现有的基于学习的MVO方法通常缺乏可解释的互补特征或具有过于复杂的多阶段架构，这些局限性固有地限制了它们的鲁棒性和跨域泛化能力。在这项工作中，我们提出了MVOFormer，一种用于鲁棒单目视觉里程计的新型Transformer框架。我们的架构采用流-语义双分支编码器，将密集几何运动线索与以物体为中心的语义先验协同结合，明确区分静态结构与动态干扰物。然后，这些表示通过迭代多模态解码器融合，实现从粗到细的位姿优化，同时动态抑制对不可靠区域的注意力。大量评估表明，无需任何目标域微调，MVOFormer在TartanAir、KITTI、TUM-RGBD和ETH3D-SLAM等多个基准上实现了优越的零样本泛化和鲁棒性，显著优于先前基于学习的帧到帧方法。

英文摘要

Monocular visual odometry (MVO) is foundational to autonomous navigation and robotic localization. However, existing learning-based MVO approaches often struggle with either a lack of interpretable, complementary features or overly complex multi-stage architectures. These limitations inherently restrict their robustness and cross-domain generalization. In this work, we propose MVOFormer, a novel transformer framework for robust monocular visual odometry. Our architecture features a Flow-Semantic Dual Branch Encoder that synergizes dense geometric motion cues with object-centric semantic priors, explicitly distinguishing static structures from dynamic distractors. These representations are then fused by an Iterative Multimodal Decoder, enabling coarse-to-fine pose refinement while dynamically suppressing attention on unreliable regions. Extensive evaluations demonstrate that, without any target-domain fine-tuning, MVOFormer achieves superior zero-shot generalization and robustness, significantly outperforming prior learning-based frame-to-frame methods across diverse benchmarks including TartanAir, KITTI, TUM-RGBD, and ETH3D-SLAM.

URL PDF HTML ☆

赞 0 踩 0

2606.16569 2026-06-16 cs.CV cs.RO 新提交

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

PROSE: 基于视觉语言模型的无训练自我中心场景配准

Zhiang Chen, Nahyuk Lee, Boyang Sun, Taein Kwon, Marc Pollefeys, Zuria Bauer, Sunghwan Hong

发表机构 * ETH Zurich（苏黎世联邦理工学院）； VGG, University of Oxford（牛津大学VGG实验室）； ETH AI Center（苏黎世联邦理工学院人工智能中心）

AI总结提出PROSE方法，利用预训练视觉语言模型将RGB序列提升为对象级3D场景图，通过对象高度先验和相同/不同查询匹配实例，无需训练或深度传感器即可实现自我中心场景配准，在Aria基准上超越几何和场景图基线。

Comments Project page: https://rckola.github.io/prose/

详情

AI中文摘要

将同一室内空间在不同时间拍摄的两张图像进行配准，是机器人和AR系统持久空间记忆的基础，但该任务的现实版本是自我中心的，且其最具可扩展性的形式是仅RGB。头戴式摄像头产生模糊、快速移动、部分重叠的视图，难以从中恢复密集几何。经典配准依赖于该场景所缺乏的干净点云，而学习的场景图方法需要预先构建或注释的图以及训练好的匹配器，我们发现后者在自我中心数据下脆弱。我们采取不同路线，使用预训练的视觉语言模型作为场景理解和跨扫描匹配的来源。我们的方法PROSE（Prompted Scene rEgistration）利用现成的几何、分割和语言基础模型将每个RGB序列提升为对象级3D场景图，然后提示同一VLM匹配两个RGB序列中的对象实例。为了使匹配易于处理且可靠，我们利用对象高度作为先验，并通过配对的相同/不同查询验证每个提议的匹配，然后通过为每个匹配对象假设一个候选并选择具有最强几何一致性的候选来求解刚体变换。PROSE不添加任何学习参数，也不需要深度传感器、训练或注释图。在自我中心的Aria Digital Twin和Aria Everyday Activities基准测试中，它在真实和RGB重建的点云上的配准精度均优于几何和学习的场景图基线，并且其生成的场景图可直接用于下游任务。

英文摘要

Registering two captures of the same indoor space taken at different times underpins persistent spatial memory for robots and AR systems, yet the realistic version of this task is egocentric and its most scalable form is RGB-only. Head-mounted cameras yield blurry, fast-moving, partially overlapping views from which dense geometry is hard to recover. Classical registration leans on exactly the clean point clouds this setting lacks, while learned scene-graph methods require a pre-built or annotated graph and a trained matcher that we find brittle under egocentric data. We take a different route, using a pretrained vision-language model as the source of both scene understanding and cross-scan matching. Our method, PROSE (Prompted Scene rEgistration), lifts each RGB sequence into an object-level 3D scene graph using off-the-shelf foundation models for geometry, segmentation, and language, then prompts the same VLM to match object instances across the two RGB sequences. To make this matching tractable and reliable, we leverage object heights as a prior and verify each proposed match with a paired same/different query, then solve for the rigid transform by hypothesizing a candidate per matched object and selecting the one with the strongest geometric consensus. PROSE adds no learned parameters and requires no depth sensor, training, or annotated graph. On the egocentric Aria Digital Twin and Aria Everyday Activities benchmarks, it outperforms both geometric and learned scene-graph baselines in registration accuracy, on ground-truth and RGB-reconstructed point clouds alike, and the scene graph it produces transfers directly to downstream tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.16898 2026-06-16 cs.CV cs.AI 新提交

Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

Semantic Flip: 用于具身问答和空间定位中鲁棒拒绝的合成OOD生成

Dongbin Na, Chanwoo Kim, Giyun Choi, Dooyoung Hong

发表机构 * RGA Inc.（RGA公司）

AI总结提出Semantic Flip框架，通过合成辅助OOD样本训练轻量拒绝模块，使冻结的视觉语言模型在无外部OOD标注下实现鲁棒拒绝，在具身问答和空间定位基准上优于强提示基线。

Comments 18 pages, 3 figures. Code and data: https://github.com/ndb796/SemanticFlip ; project page: https://ndb796.github.io/SemanticFlip

详情

AI中文摘要

检测不可回答的用户查询对于现实世界具身代理的可靠部署仍然至关重要。然而，现代视觉语言模型（VLM）即使当可用视觉记忆无法支持查询时，也常常生成过于自信的答案。这种过度自信会带来各种任务依赖的风险。代理可能在具身问答中向用户提供误导信息，并在空间推理导航中选择任意坐标并物理引导用户前往。尽管风险很高，但只有少数先前研究直接解决具身VLM何时以及如何回答“我不知道”的问题。本文提出Semantic Flip，一个简单而有效的框架，无需外部OOD标注即可合成辅助分布外（OOD）样本用于具身拒绝。关键思想是独立变换查询和视频记忆，以构建缺乏足够视觉基础的辅助OOD对。这些合成对使得能够在冻结的预训练VLM之上训练一个轻量级拒绝模块。该模块可附加到任何现有的基于VLM的流水线中，无需重新训练底层模型。在两个互补的基准测试中，Semantic Flip始终优于强提示基线。本文还引入了SpaceReject，一个新的用于空间定位的拒绝基准，包含故意不可回答的查询和长视频记忆，其中Semantic Flip达到了0.9559的$F_1$分数。源代码和数据集公开于https://github.com/ndb796/SemanticFlip。

英文摘要

Detecting unanswerable user queries remains essential for the reliable deployment of real-world embodied agents. However, modern vision-language models (VLMs) often generate overly confident answers even when the available visual memory cannot support the query. Such overconfidence poses various task-dependent risks. The agent may provide misleading information to the user in Embodied Question Answering and select an arbitrary coordinate and physically guide the user there in spatial reasoning for navigation. Despite these high stakes, only a few prior studies directly address when and how an embodied VLM should respond with "I do not know." This work proposes Semantic Flip, a simple yet effective framework that synthesizes auxiliary out-of-distribution (OOD) samples for embodied refusal without requiring external OOD annotations. The key idea is to independently transform the query and video memory to construct auxiliary OOD pairs that lack sufficient visual grounding. These synthesized pairs enable training a lightweight rejection module on top of a frozen pretrained VLM. The module attaches to any existing VLM-based pipeline without retraining the underlying model. Across two complementary benchmarks, Semantic Flip consistently outperforms strong prompting baselines. This work also introduces SpaceReject, a new refusal benchmark for spatial localization with deliberately unanswerable queries over long video memory, where Semantic Flip achieves an $F_1$ score of 0.9559. The source codes and datasets are publicly available at https://github.com/ndb796/SemanticFlip.

URL PDF HTML ☆

赞 0 踩 0

2606.16960 2026-06-16 cs.CV 新提交

SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

SurroundNEXO：面向自动驾驶空间一致几何的自车中心度量桥接

Shuai Yuan, Runxi Tang, Yuzhou Ji, Fudong Ge, Hanshi Wang, Yifei Wang, Xianming Zeng, Jianyun Xu, Xingliang Liu, Yanfeng Wang, Zhipeng Zhang

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学人工智能学院）； Hello Inc.

AI总结提出SurroundNEXO框架，通过自车中心几何（Ego-Ray位置编码）和稀疏LiDAR度量锚点，解决多相机低重叠下的度量深度预测与空间一致性问题，在多个基准上显著提升性能。

详情

AI中文摘要

现代自动驾驶依赖于精确的度量3D理解进行感知、重建和规划，这反过来需要可靠的多相机深度预测。然而，车载环视相机系统的外向性本质上限制了视图间的视觉重叠，挑战了传统多视图几何所依赖的对应关系假设。为弥合这一差距，我们提出SurroundNEXO（以西班牙语单词nexo命名，意为几何链接），一个低重叠多相机度量深度框架，将跨视图推理建立在自车中心几何而非密集视觉对应上。SurroundNEXO不直接强制早期全局融合，而是首先通过Ego-Ray位置编码为图像令牌分配全局可比较的自车框架视线方向，然后使用稀疏LiDAR测量作为度量锚点传播绝对尺度线索，最后逐步扩展特征交互，从视图局部建模到分解的时空推理和全局集成。这种设计使得在弱重叠相机间实现具有改进空间一致性的度量尺度深度预测。在包括NuScenes、Waymo和DDAD的低重叠自动驾驶基准上，与SOTA方法相比，SurroundNEXO将单视图误差降低33.2%，跨视图一致性提高10.5%，度量重建质量提升25.6%。此外，它在极稀疏深度提示下保持鲁棒，并对未见过的相机布局展现出强大的零样本泛化能力。

英文摘要

Modern autonomous driving depends on accurate metric 3D understanding for perception, reconstruction, and planning, which in turn requires reliable multi-camera depth prediction. However, the outward-facing nature of vehicle-mounted surround-view camera rigs inherently limits visual overlap across views, challenging the correspondence-based assumptions that underpin conventional multi-view geometry. To bridge this gap, we present SurroundNEXO, named after the Spanish word nexo for a geometric link, a low-overlap multi-camera metric depth framework that grounds cross-view reasoning in ego-centric geometry rather than dense visual correspondences. Instead of directly enforcing early global fusion, SurroundNEXO first assigns image tokens globally comparable ego-frame viewing directions through Ego-Ray Positional Encoding, then uses sparse LiDAR measurements as metric anchors to propagate absolute scale cues, and finally expands feature interaction progressively from view-local modeling to decomposed spatio-temporal reasoning and global integration. This design enables metric-scale depth prediction with improved spatial consistency across weakly overlapping cameras. Across low-overlap autonomous driving benchmarks, including NuScenes, Waymo and DDAD, SurroundNEXO reduces single-view error by 33.2%, improves cross-view consistency by 10.5%, and enhances metric reconstruction quality by 25.6% compared with SOTA methods. It further remains robust under extremely sparse depth prompts and exhibits strong zero-shot generalization to unseen camera layouts.

URL PDF HTML ☆

赞 0 踩 0

2606.14879 2026-06-16 cs.RO cs.CV cs.LG 交叉投稿

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

VANDERER: 基于未来感知与视觉好奇心引导扩散策略的无地图探索

Venkata Naren Devarakonda, Raktim Gautam Goswami, Prashanth Krishnamurthy, Farshad Khorrami

发表机构 * Control/Robotics Research Laboratory (CRRL), Department of Electrical and Computer Engineering, NYU Tandon School of Engineering（纽约大学坦登工程学院电气与计算机工程系控制/机器人研究实验室（CRRL））； New York University Abu Dhabi (NYUAD) Center for Artificial Intelligence and Robotics (CAIR)（纽约大学阿布扎比分校人工智能与机器人中心（CAIR））

AI总结提出VANDERER框架，利用视觉好奇心模块引导预训练扩散策略，仅依赖单目图像实现高效无地图探索，在多种模拟环境中平均探索面积比NoMaD多13.4%。

详情

AI中文摘要

移动智能体需要高效的探索策略来绘制未知环境并自主规划任务。传统方法依赖于生成占据地图并优化未探索区域的访问顺序。然而，在传感器受限的设置中，例如仅使用单目相机，生成准确的占据地图具有挑战性。为了解决这一问题，我们提出了VANDERER，一个探索框架，它利用视觉好奇心模块（VCM）仅使用单目图像数据来引导预训练的扩散策略。该好奇心模块通过导航世界模型预测所提议动作的结果，并通过好奇心成本对其进行评估。然后，该成本引导扩散过程生成最大化探索的动作。在多种模拟环境中进行评估，VANDERER始终优于现有基线，平均探索面积比NoMaD多13.4%。我们的结果揭示了室外环境中视觉好奇心与几何好奇心之间的直接相关性，表明VANDERER能够有效利用这种关系，在传感器受限的智能体上实现高效探索。

英文摘要

Mobile agents require efficient exploration strategies to map unseen environments and autonomously plan tasks. Traditional methods rely on generating occupancy maps and optimizing the sequence in which unexplored regions are visited. However, in sensor-constrained settings, such as those limited to monocular cameras, generating accurate occupancy maps is challenging. To address this, we propose VANDERER, an exploration framework that leverages a Visual Curiosity Module (VCM) to guide pre-trained diffusion policies using only monocular image data. This curiosity module predicts the outcomes of proposed actions via a navigation world model and evaluates them through a curiosity cost. The cost then guides the diffusion process toward generating actions that maximize exploration. Evaluated across diverse simulated environments, VANDERER consistently outperforms established baselines, exploring an average of 13.4% more area than NoMaD. Our results reveal a direct correlation between visual and geometric curiosity in outdoor environments, demonstrating that VANDERER can effectively leverage this relationship for efficient exploration using sensor-constrained agents.

URL PDF HTML ☆

赞 0 踩 0

2606.15133 2026-06-16 cs.RO cs.CV 交叉投稿

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

DragMesh-2: 与铰接物体的物理合理灵巧手-物体交互

Tianshan Zhang, Yijia Duan, Yanjun Li, Zeyu Zhang, Hao Tang

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）

AI总结提出DragMesh-2框架，通过接触驱动的灵巧手-铰接物体交互，结合物理信息感知训练机制PICA，在无触觉反馈下提升变接触负载的鲁棒性。

Comments Code: https://github.com/AIGeeksGroup/DragMesh-2. Website: https://aigeeksgroup.github.io/DragMesh-2

详情

AI中文摘要

与铰接物体的灵巧交互对于家庭、辅助和人形操作至关重要，其中多指手可以提供超越平行爪抓取的顺应接触模式。然而，铰接物体操作不同于静态物体操作：目标部件无法直接驱动，其运动必须通过持续的物理手-手柄接触来实现。这使得从以物体为中心的铰接生成到手驱动的灵巧手-物体交互的转变变得非平凡，因为几何轨迹重放或开环执行无法模拟移动铰接部件所需的接触动力学。此外，仅在固定动力学下为任务完成训练的策略可能会过拟合标称接触负载，尤其是在没有触觉或力反馈的情况下，并且当接触负载变化时性能可能会下降。为了应对这些挑战，我们提出了DragMesh-2，一个用于与铰接物体灵巧交互的接触驱动框架，它将铰接交互从以物体为中心的生成扩展到手驱动的灵巧手-物体交互，其中铰接运动必须通过物理接触产生。我们进一步提出了PICA，一种物理信息感知的训练机制，它在没有触觉或力反馈的情况下将物理信号注入策略学习，提高了在变化接触负载下的鲁棒性和任务成功率。最后，我们在多个阻尼条件和铰接物体类别上进行了系统评估，以研究接触负载变化下的鲁棒性，并提供了一个纯几何的灵巧交互资源，以支持未来的移动操作和人形手-物体交互研究。在七个GAPartNet物体上，DragMesh-2在接触负载变化下比对比方法实现了更强的鲁棒性，同时在各种阻尼条件下保持了高任务成功率。

英文摘要

Dexterous interaction with articulated objects is important for household, assistive, and humanoid manipulation, where multi-finger hands can provide compliant contact patterns beyond parallel-jaw grasping. However, articulated-object manipulation differs from static-object manipulation: the target part cannot be directly actuated, and its motion must emerge through sustained physical hand--handle contact. This makes the transition from object-centric articulated generation to hand-driven dexterous hand--object interaction non-trivial, since geometric trajectory replay or open-loop execution does not model the contact dynamics required to move the articulated part. Moreover, policies trained only for task completion under fixed dynamics can overfit nominal contact loads, especially without tactile or force feedback, and may degrade when the contact load changes. To address these challenges, we present DragMesh-2, a contact-driven framework for dexterous interaction with articulated objects that extends articulated interaction from object-centric generation to hand-driven dexterous hand--object interaction, where articulated motion must arise through physical contact. We further propose PICA, a physically informed contact-aware training mechanism that injects physical signals into policy learning without tactile or force feedback, improving robustness and task success under changing contact loads. Finally, we conduct systematic evaluation across multiple damping conditions and articulated-object categories to study robustness under contact-load variation, and provide a pure-geometry dexterous interaction resource to support future loco-manipulation and humanoid hand--object interaction research. Across seven GAPartNet objects, DragMesh-2 achieves stronger robustness under contact-load variation than the compared methods while maintaining high task success across damping conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.15594 2026-06-16 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 交叉投稿

基于车道的地图定位实现以车辆为中心的路线生成：一种有效且低成本的方法

Hong-Shiang Lin, Jung-Hsin Chen, Yu-Luen Tzeng, Wei-Hao Chen, Yi-Chen Lee, Li-Jhe Chen, Peng-Yuan Chen

发表机构 * National Taipei University（台北国立大学）

AI总结提出OLRA框架，通过匹配导航路线与摄像头检测的车道线，以低成本地图定位生成驾驶员视角路线，提升定位精度和路线一致性，在nuScenes数据集上优于OpenPilot。

Comments 14 pages, 18 figures. Under Review

2606.16436 2026-06-16 cs.RO cs.CV 交叉投稿

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

V2P-Manip：从单目人类视频学习灵巧操作

Kaihan Chen, Yanming Shao, Haifeng Ji, Xiaokang Yang, Yao Mu

发表机构 * Zhejiang University（浙江大学）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结提出V2P-Manip框架，从单目人类演示视频中提取具有视觉保真度和物理合理性的轨迹，通过两阶段精炼实现空间对齐与物理一致性，在TACO和OakInk基准上显著优于先前方法。

详情

AI中文摘要

实现自主机器人灵巧操作需要大规模精确、类人的动作序列。作为昂贵遥操作数据的可扩展补充，从单目视频中提取兼具视觉保真度和物理合理性的轨迹是具身智能的一个有前景的前沿方向。为此，我们引入V2P-Manip，一个高效的框架，旨在直接从人类演示视频中学习灵巧操作策略。我们建立了一个高效、集成的流水线，涵盖3D资产获取、轨迹估计和灵巧策略学习。为了弥合视觉感知与物理约束之间的差距，我们引入了一个两阶段精炼过程，以强制执行空间对齐和物理一致性。在TACO和OakInk基准上的评估表明，我们的方法在姿态精度、对非结构化环境的适应性以及训练效率方面显著优于先前方法。最终，实验结果证实了在多个合成操作任务上平均成功率超过75%，并验证了提取的操作先验在不同灵巧手形态上的适应性。

英文摘要

Achieving autonomous robotic dexterous manipulation requires precise, human-like action sequences at scale. As a scalable supplement to costly teleoperation data, extracting trajectories with both visual fidelity and physical plausibility from monocular videos represents a promising frontier in embodied AI. To this end, we introduce V2P-Manip, an efficient framework designed to learn dexterous manipulation policies directly from human demonstration videos. We establish an efficient, integrated pipeline encompassing 3D asset acquisition, trajectory estimation, and dexterous policy learning. To bridge the gap between visual perception and physical constraints, we introduce a two-stage refinement process to enforce spatial alignment and physical consistency. Evaluations on the TACO and OakInk benchmarks demonstrate that our approach significantly outperforms previous methods in pose accuracy, adaptability to unstructured environments, and training efficiency. Ultimately, experimental results confirm an average success rate of over 75% across multiple synthetic manipulation tasks and validate the adaptability of the extracted manipulation priors across diverse dexterous hand embodiments.

URL PDF HTML ☆

赞 0 踩 0

2606.16690 2026-06-16 cs.RO cs.AI cs.CV 交叉投稿

FDIO：频率分解惯性里程计

Shanshan Zhang, Liqin Wu, Wenying Cao, Lingxiang Zheng, Yu Yang

发表机构 * Department of Information and Communication Engineering, National and Local Joint Engineering Research Center of Navigation and Location Based Services, Xiamen University（信息与通信工程系、导航与位置服务国家与地方联合工程研究中心、厦门大学）； Department of Electronic Science, State Key Laboratory of Physical Chemistry of Solid Surfaces, Xiamen University（电子科学系、固体表面物理化学国家重点实验室、厦门大学）

AI总结针对双设备采集场景中IMU信号耦合问题，提出频率分解惯性里程计（FDIO），通过拉普拉斯金字塔分解信号、Mamba模块建模低频长程运动和多尺度卷积提取高频局部特征，在五个数据集上平均绝对轨迹误差降低33.3%。

详情

AI中文摘要

行人惯性里程计（PIO）仅利用惯性测量单元（IMU）采集的加速度和角速度测量值估计自主行人运动，使其在消费级定位应用中具有极高价值。然而，在双设备采集设置下，自由携带的移动设备收集的IMU信号本质上是复合信号，其中人体躯干的全局运动与局部肢体运动引起的扰动耦合在一起。这种耦合使得精确的人体运动建模更具挑战性。为解决这一问题，本文提出了频率分解惯性里程计（FDIO）。该方法首先使用拉普拉斯金字塔将输入IMU信号分解为低频和高频分量。然后采用Mamba模块从低频分量中建模长程运动信息，并使用多尺度卷积模块从高频分量中提取细粒度局部动态特征。在五个公开PIO数据集上的实验表明，FDIO的平均绝对轨迹误差为3.221米，平均相对轨迹误差为2.550米，与RoNIN ResNet基线相比，误差分别降低了33.3%和16.7%。这些结果验证了所提出的频率分解策略的有效性。据我们所知，这项工作是将Mamba和频率分解架构引入惯性里程计的早期尝试之一。

英文摘要

Pedestrian inertial odometry (PIO) estimates autonomous pedestrian motion using only acceleration and angular velocity measurements collected by an inertial measurement unit (IMU), making it highly valuable for consumer level localization applications. However, under a dual device acquisition setting, IMU signals collected by a freely carried mobile device are inherently composite signals in which the global motion of the human torso is coupled with perturbations induced by local limb motion. This coupling makes accurate human motion modeling more challenging. To address this issue, this paper proposes frequency decomposed inertial odometry (FDIO). The proposed method first decomposes input IMU signals into low frequency and high frequency components using a Laplacian pyramid. It then adopts a Mamba module to model long range motion information from the low frequency component and uses a multi scale convolution module to extract fine grained local dynamic features from the high frequency component. Experiments on five public PIO datasets show that FDIO achieves an average absolute trajectory error of 3.221~m and an average relative trajectory error of 2.550~m, reducing the errors by 33.3\% and 16.7\% compared with the RoNIN ResNet baseline, respectively. These results validate the effectiveness of the proposed frequency decomposition strategy. To the best of our knowledge, this work is among the first efforts to introduce Mamba and a frequency decomposition architecture into inertial odometry.

URL PDF HTML ☆

赞 0 踩 0

2602.07343 2026-06-16 cs.CV cs.AI cs.LG cs.RO 版本更新

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

通过文字看道路：一种语言引导的RGB-T驾驶场景分割框架

Ruturaj Reddy, Hrishav Bakul Barua, Junn Yong Loo, Thanh Thi Nguyen, Ganesh Krishnasamy

发表机构 * National University of Singapore（新加坡国立大学）； University of Technology Sydney（悉尼科技大学）

AI总结提出CLARITY框架，利用视觉语言模型先验动态调整RGB-T融合策略，并引入暗目标语义保留和层次化解码器，在MFNet数据集上达到62.3% mIoU和77.5% mAcc的新SOTA。

详情

AI中文摘要

在恶劣光照、照明和阴影条件下，道路场景的鲁棒语义分割仍然是自动驾驶应用的核心挑战。RGB-热融合是一种标准方法，但现有方法在所有条件下统一应用静态融合策略，导致模态特定噪声在网络中传播。因此，我们提出CLARITY，它根据检测到的场景条件动态调整融合策略。在视觉语言模型（VLM）先验的引导下，网络学习根据光照状态调节每种模态的贡献，同时利用对象嵌入进行分割，而不是应用固定的融合策略。我们进一步引入了两种机制：一种保留有效的暗对象语义，这些语义在先前的噪声抑制方法中被错误丢弃；另一种是层次化解码器，它在不同尺度上强制结构一致性，以锐化薄对象的边界。在MFNet数据集上的实验表明，CLARITY建立了新的最先进水平（SOTA），实现了62.3%的mIoU和77.5%的mAcc。

英文摘要

Robust semantic segmentation of road scenes under adverse illumination, lighting, and shadow conditions remain a core challenge for autonomous driving applications. RGB-Thermal fusion is a standard approach, yet existing methods apply static fusion strategies uniformly across all conditions, allowing modality-specific noise to propagate throughout the network. Hence, we propose CLARITY that dynamically adapts its fusion strategy to the detected scene condition. Guided by vision-language model (VLM) priors, the network learns to modulate each modality's contribution based on the illumination state while leveraging object embeddings for segmentation, rather than applying a fixed fusion policy. We further introduce two mechanisms - one which preserves valid dark-object semantics that prior noise-suppression methods incorrectly discard, and a hierarchical decoder that enforces structural consistency across scales to sharpen boundaries on thin objects. Experiments on the MFNet dataset demonstrate that CLARITY establishes a new state-of-the-art (SOTA), achieving 62.3% mIoU and 77.5% mAcc.

URL PDF HTML ☆

赞 0 踩 0

2603.07920 2026-06-16 cs.CV 版本更新

RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving

RLPR：面向自动驾驶的两阶段非对称跨模态对齐雷达-激光雷达地点识别

Zhangshuo Qi, Jingyi Xu, Luqi Cheng, Shichen Wen, Guangming Xiong

发表机构 * Beijing Institute of Technology（北京理工大学）； Shanghai Jiaotong University（上海交通大学）

AI总结提出RLPR框架，通过双流网络提取结构特征，并利用两阶段非对称跨模态对齐策略，实现雷达与激光雷达之间的鲁棒地点识别，在四个数据集上达到最优性能。

Comments Accepted by IEEE Robotics and Automation Letters (RA-L) 2026

详情

AI中文摘要

全天候自主性对于自动驾驶至关重要，这需要在不同场景下实现可靠的定位。虽然激光雷达地点识别被广泛部署用于此任务，但其性能在恶劣天气下会下降。相反，基于雷达的方法虽然具有天气鲁棒性，但受限于雷达地图的普遍不可用性。为了弥合这一差距，雷达到激光雷达的地点识别（将雷达扫描定位到现有激光雷达地图中）引起了越来越多的兴趣。然而，提取模态间共享的判别性和可泛化特征仍然具有挑战性，加之缺乏大规模配对训练数据以及不同雷达类型之间的信号异质性。在这项工作中，我们提出了RLPR，一个鲁棒的雷达到激光雷达地点识别框架，兼容单芯片、扫描和4D雷达。我们首先设计了一个双流网络来提取结构特征，这些特征抽象掉了传感器特定的信号属性（例如多普勒或RCS）。随后，基于我们对雷达和激光雷达之间任务特定非对称性的观察，我们引入了一种两阶段非对称跨模态对齐（TACMA）策略，该策略利用预训练的雷达分支作为判别性锚点来指导对齐过程。在四个数据集上的实验表明，RLPR实现了最先进的识别精度，并具有强大的零样本泛化能力。

英文摘要

All-weather autonomy is critical for autonomous driving, which necessitates reliable localization across diverse scenarios. While LiDAR place recognition is widely deployed for this task, its performance degrades in adverse weather. Conversely, radar-based methods, though weather-resilient, are hindered by the general unavailability of radar maps. To bridge this gap, radar-to-LiDAR place recognition, which localizes radar scans within existing LiDAR maps, has garnered increasing interest. However, extracting discriminative and generalizable features shared between modalities remains challenging, compounded by the scarcity of large-scale paired training data and the signal heterogeneity across radar types. In this work, we propose RLPR, a robust radar-to-LiDAR place recognition framework compatible with single-chip, scanning, and 4D radars. We first design a dual-stream network to extract structural features that abstract away from sensor-specific signal properties (e.g., Doppler or RCS). Subsequently, motivated by our task-specific asymmetry observation between radar and LiDAR, we introduce a two-stage asymmetric cross-modal alignment (TACMA) strategy, which leverages the pre-trained radar branch as a discriminative anchor to guide the alignment process. Experiments on four datasets demonstrate that RLPR achieves state-of-the-art recognition accuracy with strong zero-shot generalization capabilities.

URL PDF HTML ☆

赞 0 踩 0

2606.08525 2026-06-16 cs.CV 版本更新

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

DriveReward：面向自动驾驶的综合数据集与生成式视觉语言奖励模型

Qimao Chen, Fang Li, Yuechen Luo, Zehan Zhang, Haiyang Sun, Fangzhen Li, Bing Wang, Guang Chen, Yang Ji, Jiong Deng, Hongwei Xie, Hangjun Ye, Long Chen, Yi Zhang

发表机构 * Tsinghua University（清华大学）； Xiaomi EV（小米汽车）

AI总结提出DriveReward数据集和专用视觉语言奖励模型，通过反事实标注和时序视觉引导，解决自动驾驶中奖励获取的泛化问题，在强化学习和轨迹选择中取得与基于规则方法相当的性能。

详情

AI中文摘要

奖励模型在强化学习和自动驾驶的多模态轨迹选择中起着关键作用。然而，获取此类奖励通常依赖于手工设计的基于规则的目标或感知真值，这阻碍了数据扩展的泛化能力。虽然视觉语言模型在其他领域已被证明可作为奖励模型，但其在驾驶任务中的有效性尚未得到充分探索。在这项工作中，我们通过以下方式弥合这一差距：（1）引入DriveReward，一个通过时间接地视觉引导严格标注的推理轨迹评估数据集，并增加了反事实驾驶行为；（2）以及一个专门的视觉语言奖励模型。为了解决传统数据集中失败案例稀缺的问题，我们提出了一种反事实数据标注方案，构建包含多种驾驶风格和错误行为的案例。在我们提出的基准上的评估显示，即使是领先的开源和专有视觉语言模型也无法在所有任务中表现出色，突显出现有模型仍有很大的改进空间。基于这些发现，我们随后定制了一个专门的1B奖励模型，在特定任务的奖励对齐上优于更大的视觉语言模型。最后，我们通过将奖励模型集成到强化学习微调和多模态轨迹评分中，在多个基线上验证了其有效性，在开环和闭环评估中均达到了与基于规则的奖励计算相当的性能。

英文摘要

Reward models play a pivotal role in reinforcement learning (RL) and multi-modal trajectory selection for autonomous driving. However, acquiring such rewards typically relies on hand-crafted rule-based objectives or perception ground truth, which hinders generalization for data-scaling. While Vision-Language Models (VLMs) have demonstrated feasibility as reward models in other domains, their effectiveness in driving tasks remains underexplored. In this work, we bridge this gap by (1) introducing DriveReward, a reasoning trajectory evaluation dataset rigorously labeled via temporally-grounded visual guidance, and augmented with counterfactual driving behaviors., (2) alongside a specialized Vision-Language Reward Model. To address the scarcity of failure cases in conventional datasets, we propose a counterfactual data annotation scheme to construct cases encompassing diverse driving styles and erroneous behaviors. Evaluations on our proposed benchmark reveal that even leading open-source and proprietary VLMs fail to excel across all tasks, highlighting significant room for improvement in existing models. Building on these findings, we subsequently tailor a specialized 1B reward model that outperforms larger VLMs on task-specific reward alignment. Finally, we validate our reward model's effectiveness by integrating it into RL finetuning and multi-modal trajectory scoring across multiple baselines, achieving performance comparable to rule-based reward calculations in both open-loop and closed-loop evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.10862 2026-06-16 cs.CV cs.AI 版本更新

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination

LIBERO-Occ：通过视角想象评估和改进场景诱导遮挡下的视觉-语言-动作模型

Taishan Li, Jiwen Zhang, Siyuan Wang, Xuanjing Huang, Zhongyu Wei

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Chinese University of Hong Kong（香港中文大学）

AI总结针对VLA模型在场景遮挡下性能下降的问题，提出LIBERO-Occ基准和视角想象方法，通过生成互补视图提升鲁棒性。

Comments 14 pages, 7 figures

详情

AI中文摘要

视觉-语言-动作（VLA）模型在标准操作基准上取得了强劲的性能，但大多数评估假设任务相关物体完全可见。这一假设在现实场景中经常不成立，因为遮挡使得操作部分可观察。本文研究了场景诱导遮挡作为VLA模型的一个基本挑战，并引入了LIBERO-Occ，一个面向遮挡的LIBERO扩展。实验表明，最先进的VLA在遮挡下性能显著下降。为解决这一问题，我们提出了视角想象（VIM），该方法从遮挡的主观测中生成互补视图，并基于观察和想象证据共同进行动作预测。VIM在任务套件、遮挡类型和严重程度上提高了鲁棒性，且无需在部署时增加额外摄像头，表明视角想象是部分可观察操作中感知完成的一种有前景的机制。我们的基准和相应代码可在以下网址获取：this https URL。

英文摘要

Vision-Language-Action (VLA) models achieve strong performance on standard manipulation benchmarks, but most evaluations assume that task-relevant objects are fully visible. This assumption often fails in realistic settings, where occlusion makes manipulation partially observable. In this paper, we study \textit{scene-induced occlusion} as a fundamental challenge for VLA models and introduce \textbf{LIBERO-Occ}, an occlusion-oriented extension of LIBERO. Experiments show that state-of-the-art VLAs suffer substantial performance degradation under occlusion. To address this issue, we propose \textbf{Viewpoint Imagination (VIM)}, which generates a complementary view from an occluded primary observation and conditions action prediction on both observed and imagined evidence. VIM improves robustness across task suites, occlusion types, and severity levels without requiring additional cameras at deployment time, suggesting that viewpoint imagination is an promising mechanism for perception completion in partially observable manipulation. Our benchmark and corresponding code are available at: \href{https://github.com/litsh/Libero-Occ}{https://github.com/litsh/Libero-Occ}.

URL PDF HTML ☆

赞 0 踩 0

2606.13674 2026-06-16 cs.CV 版本更新

RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

RepWAM：基于表示视觉-动作分词器的世界动作建模

Junke Wang, Qihang Zhang, Shuai Yang, Yiming Luo, Yujun Shen, Zuxuan Wu, Yu-Gang Jiang, Yinghao Xu

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究所）； Robbyant, Ant Group（蚂蚁集团 Robbyant）； Hongkong University of Science and Technology（香港科技大学）

AI总结提出RepWAM，一种基于表示视觉-动作分词器的世界动作模型，通过联合建模未来视觉状态和潜在动作，在真实和仿真机器人操作任务中取得优异性能。

详情

AI中文摘要

本文提出RepWAM，一种基于表示视觉-动作分词器的表示中心世界动作模型（WAM）。现有的WAM通常从预训练的视频生成模型中继承面向重建的视频分词器。尽管这些分词器保留了视觉保真度，但仅靠像素重建对学习连接未来预测与机器人控制的指令跟随动态提供的指导有限。为解决此问题，我们探索了一种语义视觉-动作潜在空间用于表示中心的全局动作建模。具体来说，我们训练了一个表示视觉-动作分词器，将视觉输入映射为对齐的视觉和潜在动作标记。然后，我们预训练WAM以在语言指令下联合建模未来视觉状态和连接它们的潜在动作，随后适应真实机器人轨迹以实现闭环操作。在真实世界操作任务和仿真基准上的实验表明，RepWAM在多种操作设置中展现出强劲性能，而消融实验凸显了语义视觉-动作分词相对于面向重建替代方案的价值。这些结果确立了表示视觉-动作分词作为世界动作模型的有前途的基础，并朝着通用机器人策略迈出了一步。代码和权重将在以下网址提供：this https URL。

英文摘要

This work presents RepWAM, a representation-centric world action model (WAM) built on representation visual-action tokenizers. Existing WAMs typically inherit reconstruction-oriented video tokenizers from pretrained video generation models. Although these tokenizers preserve visual fidelity, pixel reconstruction alone provides limited guidance for learning instruction-following dynamics that connect future prediction with robot control. To address this, we explore a semantic visual-action latent space for representation-centric world action modeling. Specifically, we train a representation visual-action tokenizer that maps visual inputs into aligned visual and latent action tokens. We then pretrain our WAM to jointly model future visual states and the latent actions that connect them under language instructions, followed by adaptation to real robot trajectories for closed-loop manipulation. Experiments on real-world manipulation tasks and simulation benchmarks show that RepWAM delivers strong performance across diverse manipulation settings, while ablations highlight the value of semantic visual-action tokenization over reconstruction-oriented alternatives. These results establish representation visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. Code and weights will be available at https://github.com/wdrink/RepWAM.

URL PDF HTML ☆

赞 0 踩 0

2509.18428 2026-06-16 cs.RO cs.CV 版本更新

MapDream: 面向视觉-语言导航的任务驱动地图学习

Guoxin Lian, Shuo Wang, Yucheng Wang, Yongcai Wang, Maiyue Chen, Kaihui Wang, Bo Zhang, Zhizhong Su, Deying Li, Zhaoxin Fan

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出MapDream框架，通过自回归鸟瞰图生成联合学习地图与动作预测，在R2R-CE和RxR-CE上达到单目最优性能。

详情

AI中文摘要

视觉-语言导航（VLN）要求智能体在部分可观测的3D环境中遵循自然语言指令，这促使地图表示能够聚合超出局部感知的空间上下文。然而，现有大多数方法依赖于独立于导航策略构建的手工地图。我们认为，地图应该是由导航目标直接塑造的学习表示，而非详尽的重建。基于这一见解，我们提出MapDream，一种地图在环框架，将地图构建表述为自回归鸟瞰图（BEV）图像合成。该框架联合学习地图生成和动作预测，将环境上下文蒸馏为紧凑的三通道BEV地图，仅保留导航关键的可通行性。监督预训练引导了可靠的地图到控制接口，而自回归设计通过强化微调实现端到端联合优化。在R2R-CE和RxR-CE上的实验取得了最先进的单目性能，验证了任务驱动的生成式地图学习。

英文摘要

Vision-Language Navigation (VLN) requires agents to follow natural language instructions in partially observed 3D environments, motivating map representations that aggregate spatial context beyond local perception. However, most existing approaches rely on hand-crafted maps constructed independently of the navigation policy. We argue that maps should instead be learned representations shaped directly by navigation objectives rather than exhaustive reconstructions. Based on this insight, we propose MapDream, a map-in-the-loop framework that formulates map construction as autoregressive bird's-eye-view (BEV) image synthesis. The framework jointly learns map generation and action prediction, distilling environmental context into a compact three-channel BEV map that preserves only navigation-critical affordances. Supervised pre-training bootstraps a reliable mapping-to-control interface, while the autoregressive design enables end-to-end joint optimization through reinforcement fine-tuning. Experiments on R2R-CE and RxR-CE achieve state-of-the-art monocular performance, validating task-driven generative map learning.

URL PDF HTML ☆

赞 0 踩 0

2602.13197 2026-06-16 cs.RO cs.CV cs.LG 版本更新

Imitating What Works: Simulation-Filtered Modular Policy Learning from Human Videos

模仿有效的方法：基于仿真过滤的人类视频模块化策略学习

Albert J. Zhai, Kuo-Hao Zeng, Jiasen Lu, Ali Farhadi, Shenlong Wang, Wei-Chiu Ma

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Allen Institute for AI（Allen人工智能研究所）； University of Washington（华盛顿大学）； Cornell University（康奈尔大学）

AI总结提出Perceive-Simulate-Imitate框架，通过仿真过滤人类视频中的抓取-轨迹对，学习任务导向的抓取与后抓取运动策略，无需机器人数据即可实现鲁棒操作。

Comments Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

通过观看人类视频学习操作技能的能力有潜力为机器人学习解锁新的高度可扩展数据源。本文研究抓取操作，其中任务涉及在抓取物体后执行各种后抓取运动。人类视频为学习后抓取运动提供了强信号，但对于学习先决的抓取行为帮助较小，尤其是对于没有类人手的机器人。一个有前景的方法是采用模块化策略设计，利用专用抓取生成器产生稳定抓取。然而，任意稳定抓取通常与任务不兼容，阻碍机器人执行期望的下游运动。为解决这一挑战，我们提出Perceive-Simulate-Imitate (PSI)框架，该框架使用通过仿真中配对抓取-轨迹过滤处理的人类视频运动数据来训练模块化操作策略。这一仿真步骤用抓取适用性标签扩展轨迹数据，从而允许对任务导向的抓取能力进行监督学习。通过真实世界实验，我们展示了该框架可以在没有任何机器人数据的情况下高效学习精确操作技能，相比直接使用抓取生成器，性能显著更鲁棒。

英文摘要

The ability to learn manipulation skills by watching videos of humans has the potential to unlock a new source of highly scalable data for robot learning. Here, we tackle prehensile manipulation, in which tasks involve grasping an object before performing various post-grasp motions. Human videos offer strong signals for learning the post-grasp motions, but they are less useful for learning the prerequisite grasping behaviors, especially for robots without human-like hands. A promising way forward is to use a modular policy design, leveraging a dedicated grasp generator to produce stable grasps. However, arbitrary stable grasps are often not task-compatible, hindering the robot's ability to perform the desired downstream motion. To address this challenge, we present Perceive-Simulate-Imitate (PSI), a framework for training a modular manipulation policy using human video motion data processed by paired grasp-trajectory filtering in simulation. This simulation step extends the trajectory data with grasp suitability labels, which allows for supervised learning of task-oriented grasping capabilities. We show through real-world experiments that our framework can be used to learn precise manipulation skills efficiently without any robot data, resulting in significantly more robust performance than using a grasp generator naively.

URL PDF HTML ☆

赞 0 踩 0

2606.12978 2026-06-16 cs.RO cs.CV cs.SY eess.SY 版本更新

Trajectory-Level Redirection Attacks on Vision-Language-Action Models

轨迹级重定向攻击对视觉-语言-动作模型

Gokul Puthumanaillam, Vardhan Dongre, Pranay Thangeda, Hooshang Nayyeri, Dilek Hakkani-Tür, Melkior Ornik

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文发现VLA模型存在轨迹级漏洞：看似保留原始指令的对抗性提示，能重定向机器人最终物理结果，并提出了命令保持的轨迹重定向威胁模型和在线提示搜索方法。

详情

AI中文摘要

视觉-语言-动作（VLA）策略将自然语言引入闭环机器人控制，使机器人能够直接从文本指令执行操作任务。同一接口赋予文本在控制中的循环角色，因为提示在每个重新规划步骤中被重复使用，每个提示条件化的动作会改变策略所作用的未来观测。现有的VLA攻击研究对抗性提示，这些提示引发目标低级动作或使此类动作在变化的图像中持续存在。我们识别出一个更强的轨迹级故障模式：一个提示仍然$\textit{看起来}$指定了预期任务，但重定向了最终物理结果。我们在数学上将这种设置形式化为$\textit{命令保持的轨迹重定向}$，这是一种仅提示的威胁模型，其中攻击者在情节开始前选择一个提示，所有策略和环境组件保持不变，并且提示必须保持接近良性指令，同时省略目标词和纠正语言。为了找到这样的提示，我们引入了一种在线提示搜索方法，该方法使用滚动来发现扰动，其闭环行为跟踪目标任务，同时满足命令保持约束。在仿真和硬件上的实验表明，接近良性的提示扰动可以将VLA滚动重定向到攻击者指定的目标。这些结果暴露了VLA指令基础中的轨迹级漏洞：看似保留预期命令的文本仍然可以让对手控制机器人的最终物理结果。项目网站：此https URL

英文摘要

Vision-language-action (VLA) policies bring natural language into closed-loop robot control, enabling robots to execute manipulation tasks directly from text instructions. The same interface gives text a recurring role in control because the prompt is reused at every replanning step, and each prompt-conditioned action changes the future observations on which the policy acts. Existing VLA attacks study adversarial prompts that elicit targeted low-level actions or make such actions persist across changing images. We identify a stronger trajectory-level failure mode: a prompt that still $\textit{appears}$ to specify the intended task but redirects the final physical outcome. We mathematically formalize this setting as $\textit{command-preserving trajectory redirection}$, a prompt-only threat model in which the attacker chooses one prompt before the episode, all policy and environment components remain fixed, and the prompt must stay close to the benign instruction while omitting target words and correction language. To find such prompts, we introduce an on-policy prompt search method that uses rollouts to discover perturbations whose closed-loop behavior tracks a target task while satisfying the command-preserving constraints. Experiments in simulation and on hardware show that near-benign prompt perturbations can redirect VLA rollouts to attacker-specified targets. These results expose a trajectory-level vulnerability in VLA instruction grounding: text that appears to preserve the intended command can still give an adversary control over the robot's final physical outcome. Project website: https://vla-redirection-attack.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.13769 2026-06-16 cs.RO cs.CV cs.LG 版本更新

$μ_0$: A Scalable 3D Interaction-Trace World Model

$\mu_0$: 一种可扩展的3D交互轨迹世界模型

Seungjae Lee, Yoonkyo Jung, Jusuk Lee, Jonghun Shin, Amir Hossein Shahidzadeh, Yao-Chih Lee, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * University of Maryland, College Park（马里兰大学帕克分校）； Seoul National University（首尔大学）

AI总结提出基于3D轨迹的可扩展世界模型$\mu_0$，通过预测交互点轨迹实现跨本体机器人学习，无需动作标签，性能媲美有监督模型。

详情

AI中文摘要

能够捕捉动作如何引起物理变化的世界模型使得可扩展的机器人学习成为可能，而无需依赖特定本体的动作标签。像素空间视频模型提供了广泛的视觉先验，但将模型容量消耗在密集外观重建上，而直接动作模型则需要特定本体的标签，阻碍了可扩展性。我们提出$\mu_0$，一种基于3D轨迹的可扩展世界模型。$\mu_0$不是预测密集像素或直接建模动作，而是预测显著交互点（如物体、工具、手和接触区域）的平滑3D轨迹，从而产生一个紧凑、与本体无关的运动接口。为了能够从多样化的视频源进行训练，我们的TraceExtract系统通过选择关键点、构建全局对齐的轨迹以及将运动片段与层次化语言描述关联，自动提取3D监督。这种TraceExtract监督通过将预训练的视觉-语言骨干网络与模块化轨迹专家相结合来预训练$\mu_0$，其中轨迹专家通过B样条控制点表示每个查询并预测未来轨迹。实验表明，$\mu_0$在2D和3D轨迹预测方面均优于基线方法，包括轨迹预测模型和分词VLM方法。由于$\mu_0$是冻结且可重用的，它可以与动作专家配对用于下游机器人本体。尽管是无动作预训练，由此产生的轨迹条件策略在性能上与使用动作监督预训练的VLA模型（如$\pi_0$）相当。这些结果确立了3D轨迹作为跨本体操作的可扩展和可迁移表示。

英文摘要

World models that capture how actions induce physical change enable scalable robot learning without reliance on embodiment-specific action labels. Pixel-space video models provide broad visual priors but expend model capacity on dense appearance reconstruction, while direct action models require embodiment-specific labels that hinder scalability. We present $μ_0$, a scalable world model based on 3D traces. Rather than predicting dense pixels or directly modeling actions, $μ_0$ forecasts smooth 3D trajectories for salient interaction points such as objects, tools, hands, and contact regions, yielding a compact, embodiment-agnostic motion interface. To enable training from diverse video sources, our TraceExtract system automatically extracts 3D supervision by selecting keypoints, constructing globally aligned traces, and associating motion segments with hierarchical language captions. This TraceExtract supervision pretrains $μ_0$ by combining a pretrained vision-language backbone with a modular trace expert, which represents each query via B-spline control points and predicts future traces. Experiments show that $μ_0$ outperforms baselines in both 2D and 3D trace prediction, including trace prediction models and tokenized VLM methods. Because $μ_0$ is frozen and reusable, it can be paired with action experts for downstream robot embodiments. Despite action-free pretraining, the resulting trace-conditioned policies achieve performance competitive with VLA models pretrained with action supervision, such as $π_0$. These results establish 3D traces as a scalable and transferable representation for cross-embodiment manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.14735 2026-06-16 cs.CV 新提交

UtVAA: Ultra-tiny Vision Transformer with Affix Attention for Mobile Image Classification

UtVAA: 用于移动图像分类的带有Affix Attention的超微型视觉Transformer

Romiyal George, Sathiyamohan Nishankar, Selvarajah Thuseethan, Roshan G. Ragel

发表机构 * University of Peradeniya（佩拉德尼亚大学）； Charles Darwin University（查尔斯·达尔文大学）

AI总结提出超微型ViT架构UtVAA，通过Affix Attention块结合局部与全局特征，在极低参数量和FLOPs下实现高精度图像分类，适用于移动设备。

Comments 13 pages, 7 figures

详情

AI中文摘要

视觉Transformer（ViT）在图像分类中展现了强大的表示能力。然而，其二次自注意力复杂度和大量参数限制了在资源受限的移动和边缘设备上的部署。本文介绍了UtVAA，一种超微型视觉Transformer架构，专为在严格计算预算下进行高效视觉识别而设计。它包含一个新颖的Affix Attention块，该块结合了深度可分离局部特征提取、线性自注意力、用于空间依赖建模的坐标注意力，以及一个轻量级三元融合策略来整合局部和全局表示。此外，Dilated Bottleneck块通过使用扩张深度可分离卷积扩展感受野，同时通过残差连接保持低FLOPs和稳定优化。UtVAA实现了可扩展的Tiny、Medium和Large变体，其中最小的模型包含204.67K参数和53.95M FLOPs。在CIFAR-10、CIFAR-100、PlantVillage-Tomato和SLIF-Tomato数据集上的实验结果表明，UtVAA在百万参数以下的范围内达到了有竞争力的准确率。总体而言，结果表明基于Transformer的视觉模型可以重新设计为超微型架构，而不会显著损失判别性能，使得UtVAA适用于移动和边缘部署。代码可在https://github.com/romiyal/UtVAA获取。

英文摘要

Vision Transformers (ViTs) have demonstrated strong representation capability in image classification. However, their quadratic self-attention complexity and large parameter counts limit deployment on resource-constrained mobile and edge devices. This paper introduces UtVAA, an ultra-tiny Vision Transformer architecture designed for efficient visual recognition under strict computational budgets. It incorporates a novel Affix Attention block that combines depthwise-pointwise local feature extraction, linear self-attention, coordinate attention for spatial dependency modelling, and a lightweight ternary fusion strategy to integrate local and global representations. In addition, Dilated Bottleneck blocks expand the receptive field using dilated depthwise separable convolutions while maintaining low FLOPs and stable optimisation through residual connections. UtVAA is implemented in scalable Tiny, Medium, and Large variants, with the smallest model containing 204.67K parameters and 53.95M FLOPs. Experimental results on CIFAR-10, CIFAR-100, PlantVillage-Tomato and SLIF-Tomato datasets show that UtVAA achieves competitive accuracy within a sub-million-parameter regime. Overall, the results demonstrate that transformer-based vision models can be redesigned into ultra-tiny architectures without significant loss in discriminative performance, making UtVAA suitable for mobile and edge deployment. Code is available at https://github.com/romiyal/UtVAA

URL PDF HTML ☆

赞 0 踩 0

2606.14770 2026-06-16 cs.CV cs.AI cs.IR cs.LG 新提交

An Empirical Analysis of Optimization Dynamics and Sparsity Boundaries in Large-Scale Pedestrian Attribute Recognition

大规模行人属性识别中的优化动态与稀疏边界实证分析

Houssam El Mir

发表机构 * College of Computer Science and Technology, Zhejiang University of Technology（浙江工业大学计算机科学与技术学院）

AI总结针对行人属性识别中极端类别不平衡问题，提出多标签焦点损失校准配置（alpha=0.50, gamma=2.0），在零计算开销下匹配BCE基线并提升难例挖掘，同时识别出0.1%正样本率下的稀疏墙边界。

详情

AI中文摘要

行人属性识别（PAR）对于视频监控至关重要，支持法医搜索和重识别系统。当将PETA和PA-100K合并为一个包含109,000张图像的复合语料库时，极端类别不平衡仍然是一个基本障碍，其中少数属性的正样本比例低于1%。这导致标准BCE优化抑制稀有特征，我们称之为多数负类欺骗陷阱。我们在ResNet-18骨干网络上对多标签焦点损失超参数（alpha和gamma）进行了系统消融。校准配置（alpha=0.50, gamma=2.0）实现了62.32%的宏F1分数，与BCE基线相当，同时保留了优越的难例挖掘和收敛动态。我们的方法使用纯损失函数工程，边缘部署零计算开销。我们识别出稀疏墙，这是一个硬边界，当正样本比例低于0.1%时，全局损失重新加权失效，需要实例级干预。

英文摘要

Pedestrian Attribute Recognition (PAR) is critical for video surveillance, enabling forensic search and re-identification systems. Extreme class imbalance remains a fundamental obstacle when merging PETA and PA-100K into a 109,000-image composite corpus, where minority attributes have positive sample fractions below 1%. This causes standard BCE optimization to suppress rare traits, a phenomenon we term the majority negative class cheating trap. We present a systematic ablation of Multi-Label Focal Loss hyperparameters (alpha and gamma) on a ResNet-18 backbone. A calibrated configuration (alpha=0.50, gamma=2.0) achieves a Macro F1-score of 62.32%, matching BCE baseline while preserving superior hard-example mining and convergence dynamics. Our approach uses pure loss-function engineering with zero computational overhead for edge deployment. We identify the Sparsity Wall, a hard boundary where positive sample fractions below 0.1% make global loss reweighting ineffective, requiring instance-level intervention.

URL PDF HTML ☆

赞 0 踩 0

2606.14871 2026-06-16 cs.CV cs.AI 新提交

An Ensemble Deep Learning Approach for Reliable and Scalable Lemon Leaf Disease Classification

一种可靠且可扩展的柠檬叶病害分类集成深度学习方法

Shayan Abrar, Sudeepta Mandal, Abdul Awal Yasir, Sonjoy Bhattacharjee, Sadman Haque Bhuiyan, Samanta Ghosh, Rafi Ahamed

发表机构 * Dept. of CSE（计算机科学与工程系）； American International University-Bangladesh（美国国际大学-孟加拉国）； East West University（东-西大学）； North South University（北南大学）

AI总结提出集成InceptionV3和MobileNetV2的深度学习方法，结合对抗训练和Grad-CAM可视化，在9类柠檬叶病害数据集上达到99.27%准确率，实现可靠分类。

Comments 5 pages, 12 figures, 3 Tables, Presented at 18th IEEE International Conference on Computational Intelligence and Communication Networks (CICN) 2026

2606.14886 2026-06-16 cs.CV cs.AI 新提交

Improved Knowledge Distillation for Land-Use Image Classification

改进的知识蒸馏用于土地利用图像分类

Arundhuti Sur, Abhiroop Chatterjee, Susmita Ghosh, Emmett Ientilucci

发表机构 * Jadavpur University（贾达沃大学）； Rochester Institute of Technology（罗切斯特理工学院）

AI总结提出一种改进的知识蒸馏框架，通过VGG16教师网络向轻量MobileNetV2学生网络传递知识，结合硬监督和软监督策略，在三个数据集上达到99.04%准确率，优于基线方法。

Comments Accepted by IGARSS 2026

2606.15134 2026-06-16 cs.CV cs.AI cs.LG 新提交

Beyond Scalar Distances: Semantic Attribute Gradients from Frozen MLLMs for Visual Embeddings

超越标量距离：来自冻结MLLM的语义属性梯度用于视觉嵌入

Shubhang Bhatnagar, Dheeraj Baiju, Narendra Ahuja

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出SAGA框架，利用冻结的多模态大语言模型（MLLM）通过GRPO奖励机制为视觉编码器提供属性级监督，替代传统标量距离，提升零样本图像检索性能。

详情

AI中文摘要

用于检索的视觉编码器通常通过类标签监督进行训练：每个训练对简化为一个标量，均匀地将嵌入推远或拉近，就好像每个视觉属性要么不同要么匹配。一个多模态大语言模型（MLLM），在展示相同的一对图像时，能够阐述这些属性并利用它们预测图像是否共享一个类别。我们提出\textbf{SAGA}，一个框架，将这种基于语言、属性感知的感知转化为编码器本身的训练信号。具体来说，我们使用组相对策略优化（GRPO）来奖励MLLM对视觉编码器令牌的正确预测。由于正确的预测要求这些令牌暴露该对之间不同或匹配的具体属性，梯度推动编码器编码这些属性，用属性解析的监督取代统一的成对标量。一个辅助的注意力蒸馏损失将编码器的嵌入锚定到MLLM关注的令牌上，一个标准的度量学习损失塑造嵌入几何结构以进行最近邻检索。MLLM在整个过程中被冻结，在推理时被丢弃，与度量学习基线的部署成本相匹配。在CUB-200-2011、Cars-196、FGVC-Aircraft和iNaturalist Aves上的零样本图像检索中，SAGA在Recall@1上比最先进的基线提高了3到6个百分点。

英文摘要

Vision encoders for retrieval are typically trained with class-label supervision: each training pair reduces to a scalar that uniformly pushes the embedding apart or pulls it together, as if every visual attribute either differed or matched. A multimodal large language model (MLLM), shown the same pair, can articulate those attributes and use them to predict whether the images share a class. We propose \textbf{SAGA}, a framework that turns this language-grounded, attribute-aware perception into a training signal for the encoder itself. Specifically, we use Group Relative Policy Optimization (GRPO) to reward the MLLM for correct predictions on the vision encoder's tokens. Since correct predictions require those tokens to expose the specific attributes that differ or match between the pair, the gradient pushes the encoder to encode them, replacing the uniform pair-level scalar with attribute-resolved supervision. An auxiliary attention-distillation loss anchors the encoder's embedding to tokens the MLLM attended to, and a standard metric-learning loss shapes the embedding geometry for nearest-neighbour retrieval. The MLLM is frozen throughout and discarded at inference, matching the deployment cost of a metric-learning baseline. SAGA improves Recall@1 by 3 to 6 points over state-of-the-art baselines on CUB-200-2011, Cars-196, FGVC-Aircraft, and iNaturalist Aves on zero-shot image retrieval.

URL PDF HTML ☆

赞 0 踩 0

2606.15151 2026-06-16 cs.CV cs.LG 新提交

HiRo: A Compact Four-Directional Hierarchical Reservoir Token-Mixer for Efficient Image Classification

HiRo：一种用于高效图像分类的紧凑型四方向分层储层令牌混合器

Md Farhadul Islam, Ishan Thakkar, J. Todd Hastings

发表机构 * University of Kentucky（肯塔基大学）

AI总结提出HiRo模型，通过四方向扫描和两级切片混合储层模块实现局部与跨窗口令牌混合，在MNIST、CIFAR-10/100上以不足1M参数达到高精度。

Comments Accepted at ICONS 2026

详情

AI中文摘要

最近的图像分类模型必须在局部特征建模、跨窗口交互和参数效率之间取得平衡。许多高性能架构依赖于完全可训练的令牌混合器，这改善了表示学习但增加了参数数量、优化复杂性和计算成本。我们提出了一种参数高效的图像分类模型HiRo，它将移位窗口分区与多方向分层储层计算相结合。图像被划分为非重叠块（视为令牌），线性投影、归一化，并添加二维正弦位置编码，然后在局部窗口内处理。在每个窗口内，令牌沿四个方向扫描，并通过两级切片混合储层模块。在第一阶段，方向序列被分割成连续的切片，每个切片由具有可训练闭环读出的固定储层处理。得到的切片输出使用开始、结束和均值表示进行汇总，然后由每个方向的第二阶段固定储层混合。混合后的切片表示被扩展回令牌级别并与第一阶段输出融合，之后四个方向的输出重新对齐并平均。连续块在常规窗口和移位窗口之间交替以实现跨窗口交互，随后是层归一化、残差前馈网络和用于分类的全局池化。该设计将常规和移位窗口分区与分层多方向储层相结合，构建了一个高效的局部到跨窗口令牌混合框架用于图像分类。尽管使用的可训练参数少于1M，且内存和时间显著低于基于Transformer的基线，HiRo在MNIST、CIFAR-10和CIFAR-100上分别达到了99.46%、85.57%和59.10%的准确率。

英文摘要

Recent image classification models must balance local feature modeling, cross-window interaction, and parameter efficiency. Many high-performing architectures rely on fully trainable token-mixers, which improve representation learning but increase parameter count, optimization complexity and computational cost. We propose a parameter-efficient image classification model called HiRo that integrates shifted-window partitioning with multi-directional hierarchical reservoir computing. Images are divided into non-overlapping patches (treated as tokens), linearly projected, normalized, and enriched with 2D sinusoidal positional encodings, then processed within local windows. Inside each window, tokens are scanned in four directions and passed through a two-stage slice-and-mix reservoir module. In the first stage, directional sequences are split into contiguous slices, each processed by its own fixed reservoir with a trainable closed-loop readout. The resulting slice outputs are summarized using the start, end, and mean representations, and then mixed by a second-stage fixed reservoir for each direction. The mixed slice representations are expanded back to the token level and fused with the first-stage outputs, after which the four directional outputs are realigned and averaged. Consecutive blocks alternate between regular and shifted windows to enable cross-window interaction, followed by layer normalization, a residual feed-forward network, and global pooling for classification. This design combines regular and shifted window partitioning with hierarchical multi-directional reservoirs to make an efficient local-to-cross-window token-mixing framework for image classification. Despite using under 1M trainable parameters and significantly lower memory and time than transformer-style baselines, HiRo also achieves 99.46%, 85.57%, and 59.10% accuracy on MNIST, CIFAR-10, and CIFAR-100, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.15282 2026-06-16 cs.CV 新提交

Enhancing Precision Agriculture with a Hybrid Deep Learning Framework for Multi-Class Plant Disease Classification and Interpretability

利用混合深度学习框架增强精准农业：多类植物病害分类与可解释性

Hasibul Islam Sufi, Ridam Roy, Shayla Alam Setu, Mahimul Islam Nadim

发表机构 * Department of Computer Science and Engineering, Daffodil International University（计算机科学与工程系，达福尔国际大学）

AI总结提出混合ResNet-ViT架构用于多类植物病害分类，在38类叶片图像上达到98.58%准确率，结合Grad-CAM等可解释性技术定位病害区域。

详情

AI中文摘要

本研究提出了一种整体深度学习架构，用于从高分辨率叶片图像中对植物病害进行多类分类，特别关注ResNet-50和混合ResNet + Vision Transformer (ViT)设计的行为。一个专门收集的图像数据库包含15,200张训练图像和3,800张验证图像，涵盖多种作物的38个类别，包括番茄、苹果、葡萄等，经过预处理步骤如调整大小、归一化和数据增强以增强模型鲁棒性。训练了多种架构，包括ResNet-50、MobileNetV2和EfficientNet-B0，并与混合ResNet + ViT模型进行比较。所有模型使用AdamW优化器和交叉熵损失进行微调，并应用早停以防止过拟合并确保泛化。此外，实现了可解释性技术如Grad-CAM和显著性图以指示病害相关区域，同时进行基于分割的分析以识别叶片的受影响部分。在所有考虑的架构中，ResNet-50达到了最高准确率98.74%，而混合ResNet + ViT模型达到了竞争性的98.58%，表明混合架构在捕捉局部和全局信息方面是有效的。实验结果展示了基于Transformer的模型在实现高精度、可解释且计算高效的基于计算机的多类多病害分类系统方面的潜力，为栽培管理实践和精准农业提供了有用的帮助。

英文摘要

This study proposes an overall deep learning architecture for multi-class classification of plant diseases from high-resolution leaf imagery, with a particular interest in investigating the behavior of ResNet-50 and a hybrid ResNet + Vision Transformer (ViT) design. A specially gathered image database with 15,200 training images and 3,800 validation images spanning 38 classes across multiple crops, including tomato, apple, grape etc. were subjected to preprocessing steps such as resizing, normalization, and data augmentation to enhance model robustness. Multiple architectures, including ResNet-50, MobileNetV2, and EfficientNet-B0, were trained and compared with the hybrid ResNet + ViT model. All models were fine-tuned using the AdamW optimizer and cross-entropy loss, with early stopping applied to prevent overfitting and ensure generalization. Furthermore, interpretability techniques such as Grad-CAM and saliency maps were implemented to indicate disease-relevant regions, while segmentation-based analysis was performed to identify the affected parts of a leaf. For every one of the considered architectures, ResNet-50 led to the highest accuracy of 98.74%, whereas the hybrid ResNet + ViT model achieved a competitive accuracy of 98.58%, showing that the hybrid architectures were effective in capturing both local and overall information. The experimental results showcase the promise of transformer-based models to achieve highly accurate, interpretable, and computationally efficient computer-based multi-class multi-disease classification systems, providing helpful assistance for cultivation management practices as well as for precision farming.

URL PDF HTML ☆

赞 0 踩 0

2606.15355 2026-06-16 cs.CV 新提交

迈向全貌：小面积移动传感器的累积指纹映射与重建

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Tsinghua University（清华大学）

AI总结针对小面积移动指纹传感中采集与识别不匹配的问题，提出累积映射与重建框架，将局部观测序列转化为统一指纹状态，实现单次匹配，提升效率与鲁棒性。

详情

AI中文摘要

移动设备上的小面积指纹传感在采集与识别之间造成了根本性的不匹配：每次触摸仅捕获一个微小且姿态变化的局部补丁，而可靠的生物特征匹配最终需要一个稳定且足够完整的指纹表示。现有流程主要通过将重复触摸视为独立的局部模板来应对这种不匹配，这导致重复注册、重复匹配，且无法保证足够的全局覆盖。在本文中，我们提出了一种不同的公式，即针对小面积移动传感的\emph{累积指纹映射与重建}。该视角并非分别匹配每个局部补丁，而是将一系列局部观测转换为一个统一的指纹状态，该状态随着新触摸的到来而逐步细化，并可在整合后仅匹配一次。作为一个具体基线，我们提出了一种经典流程，执行补丁级结构特征提取、特征级配准与融合、指纹图构建以及基于相位的脊线重建。更重要的是，我们将此基线定位在一个更广泛的移动指纹框架内，该框架集成了结构化令牌学习、两阶段姿态推理和基于扩散的生成式重建。这一观点将移动指纹识别从多次捕获多次匹配处理重新构建为累积地图构建、状态细化和一次性匹配，为小面积移动平台提供了一条通向高效、姿态鲁棒且易于部署的生物特征识别的原则性路径。基线实现已在 https://github.com/XiongjunGuan/FpReconstruction 公开发布。

英文摘要

Small-area fingerprint sensing on mobile devices creates a fundamental mismatch between acquisition and recognition: each touch captures only a tiny, pose-varying local patch, while reliable biometric matching ultimately requires a stable and sufficiently complete fingerprint representation. Existing pipelines largely cope with this mismatch by treating repeated touches as independent partial templates, which leads to repeated registration, repeated matching, and no guarantee of adequate global coverage. In this paper, we advocate a different formulation, namely \emph{accumulative fingerprint mapping and reconstruction} for small-area mobile sensing. Rather than matching every partial patch separately, the proposed perspective converts a sequence of local observations into a unified fingerprint state that is progressively refined as new touches arrive and can be matched only once after consolidation. As a concrete baseline, we present a classical pipeline that performs patch-wise structural feature extraction, feature-level registration and fusion, fingerprint map construction, and phase-based ridge reconstruction. More importantly, we position this baseline within a broader mobile fingerprint framework that integrates structured token learning, two-stage pose reasoning, and diffusion-based generative reconstruction. This viewpoint reframes mobile fingerprint recognition from multi-capture multi-match processing to accumulative map building, state refinement, and one-shot matching, offering a principled route toward efficient, pose-robust, and deployment-friendly biometrics for small-area mobile platforms. The baseline implementation has been publicly released at https://github.com/XiongjunGuan/FpReconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.15763 2026-06-16 cs.CV 新提交

The Circumplex Degeneracy Behind the Rare-Class Limit in Affect Recognition

情感识别中稀有类别极限背后的圆周退化

Van Thong Huynh, Hong Hai Nguyen, Soo-Hyung Kim

发表机构 * Faculty of CSE, Ho Chi Minh City University of Technology (HCMUT), VNUHCM（胡志明市理工大学计算机科学与工程学院, 越南国家大学胡志明市分校）； Dept. of AI, FPT University（FPT大学人工智能系）； Dept. of AI Convergence, Chonnam National University（全南大学人工智能融合系）

AI总结通过多任务研究揭示稀有表情识别失败源于Russell圆周上的退化性，而非类别不平衡，并提出圆周代价最优传输项，但增益非几何性，稀有类别错误结构受视觉混淆影响。

详情

AI中文摘要

野外表情识别在少数稀有情感上持续失败，标准解释是类别不平衡。通过在两个基准上的受控多任务研究，我们表明失败反而是情感几何的一个属性：稀有类别在Russell圆周上是退化的，这种退化限制了任何损失或代价所能达到的效果。我们的工具是一个圆周代价最优传输项，通过效价-唤醒距离对表情混淆进行定价。该项提高了官方得分和表情宏F1，但大多数研究省略的对照显示，增益并非几何性的：一个均匀代价（相当于通用置信度惩罚）在Aff-Wild2上与它匹配（p=0.625），并在AffectNet上显著超过它（比基线高+0.057，大于圆周项）。几何重塑的是错误的结构，使它们在Aff-Wild2上情感上更接近真相（与均匀对照相比p=0.031），但这种效果在AffectNet上不成立，因为圆周远角的一个视觉混淆压倒了它。相比之下，稀有类别失败在我们检查的两个数据集上都是稳定的：退化对（Aff-Wild2上的愤怒-恐惧，AffectNet上的愤怒-蔑视）抵抗基于频率的干预、传输项以及专门为分离它们而构建的动作单元增强代价。我们得出结论，稀有表情的进展需要区分这些类别的表示，而不是重新定价其混淆的监督，我们提供了区分两者的对照和指标。

英文摘要

In-the-wild expression recognition persistently fails on a few rare emotions, and the standard explanation is class imbalance. Through a controlled multi-task study on two benchmarks, we show the failure is instead a property of affect geometry: the rare classes are degenerate on Russell's circumplex, and that degeneracy bounds what any loss or cost can achieve. Our instrument is a circumplex-cost optimal-transport term that prices expression confusions by their valence-arousal distance. The term improves the official score and expression macro-F1, but a control most studies omit shows the gain is not geometric: a uniform cost, equivalent to a generic confidence penalty, matches it on Aff-Wild2 (p=0.625) and significantly exceeds it on AffectNet (+0.057 over base, larger than the circumplex). What the geometry reshapes is the structure of the errors, making them affectively nearer the truth on Aff-Wild2 (p=0.031 against the uniform control), an effect that does not survive on AffectNet, where a visual confound at the far corner of the circumplex overwhelms it. The rare-class failure, by contrast, is stable across both datasets we examine: the degenerate pairs (anger-fear on Aff-Wild2, anger-contempt on AffectNet) resist frequency-based interventions, the transport term, and an action-unit-augmented cost built specifically to separate them. We conclude that progress on rare expressions requires representations that distinguish the classes, not supervision that reprices their confusions, and we provide the controls and metrics needed to tell the two apart.

URL PDF HTML ☆

赞 0 踩 0

2606.16161 2026-06-16 cs.CV 新提交

InfoGeo: 面向跨视角泛化无人机地理定位的信息论目标中心学习

Hongyang Zhang, Maonan Wang, Ziyao Wang, Hongrui Yin, Man-On Pun

发表机构 * The University of Hong Kong（香港大学）

AI总结提出InfoGeo框架，利用信息瓶颈理论通过目标中心结构对齐和跨视图知识约束，增强无人机跨视角地理定位在域偏移下的鲁棒性和泛化能力。

详情

AI中文摘要

跨视角地理定位（CVGL）是GPS拒止环境中精确定位和导航的基础，旨在将地面或无人机图像与卫星视图匹配。现有方法通常依赖全局特征对齐，但受区域纹理和天气条件变化引起的显著域偏移影响。在无人机场景中，由于更广的视角不可避免地引入密集的细粒度目标，造成严重视觉杂乱，这一问题更为突出。为此，我们从目标中心学习（OCL）中汲取灵感，提出InfoGeo，一个旨在增强鲁棒性和泛化能力的信息论框架。InfoGeo将优化重新表述为信息瓶颈过程，包含两个核心目标：（i）通过跨视图对齐目标中心结构关系，最大化视图不变信息；（ii）通过跨视图知识约束，最小化视图特定噪声信号。在多种基准和挑战场景上的广泛评估表明，InfoGeo显著优于现有最先进方法。

英文摘要

Cross-view geo-localization (CVGL) is fundamental for precise localization and navigation in GPS-denied environments, aiming to match ground or UAV imagery with satellite views. Existing approaches often rely on global feature alignment, but they suffer from substantial domain shifts induced by varying regional textures and weather conditions. This issue becomes even more pronounced in UAV-based scenarios, where the broader perspective inevitably introduces dense, fine-grained objects, creating significant visual clutter. To address this, we draw inspiration from Object-Centric Learning (OCL) and propose InfoGeo, an information-theoretic framework designed to enhance robustness and generalization. InfoGeo reformulates the optimization as an information bottleneck process with two core objectives: (i) maximizing view-invariant information by aligning the object-centric structural relations across views, and (ii) minimizing view-specific noisy signals through cross-view knowledge constraints. Extensive evaluations across diverse benchmarks and challenging scenarios demonstrate that InfoGeo significantly outperforms state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.03654 2026-06-16 cs.CV cs.NA math.NA 版本更新

Graph Regularized Non-negative Reduced Biquaternion Matrix Factorization for Color Image Recognition

图正则化非负简化四元数矩阵分解用于彩色图像识别

Hailang Wu, Yonghe Liu, Bingxuan Yu, Chaoqian Li

发表机构 * School of Mathematics and Statistics, Yunnan University（云南大学数学与统计学学院）

AI总结针对非负简化四元数矩阵分解忽略局部几何结构的问题，提出图正则化模型，通过引入图拉普拉斯正则化项保持局部结构，并设计分量交替投影梯度算法，在彩色图像识别中取得竞争性结果。

详情

AI中文摘要

非负简化四元数矩阵分解（NRBMF）利用简化四元数（RB）矩阵的乘积，将彩色图像像素的非负约束纳入分解过程。然而，NRBMF主要关注重构精度，未利用图像数据的局部几何结构，这可能限制所学低维特征的判别能力。为解决此问题，我们提出了一种图正则化非负简化四元数矩阵分解（GNRBMF）模型用于彩色图像识别。该模型将图拉普拉斯正则化项引入简化四元数系数矩阵，鼓励原始空间中的邻近样本在学习的特征空间中具有相似表示。同时，GNRBMF在简化四元数域中保留了NRBMF的非负保持特性。为求解优化问题，推导了一种分量交替投影梯度算法，并分析了其收敛性。实验结果表明，所提出的GNRBMF模型在某些测试设置下取得了具有竞争力或更优的识别性能。

英文摘要

Non-negative reduced biquaternion matrix factorization (NRBMF) uses the product of reduced biquaternion (RB) matrices to incorporate the non-negativity constraints of color image pixels into the factorization process. However, NRBMF mainly focuses on reconstruction accuracy and does not explicitly exploit the local geometric structure of image data, which may limit the discriminative ability of the obtained low-dimensional coefficient representations. To address this issue, we propose a graph regularized non-negative reduced biquaternion matrix factorization (GNRBMF) model for color image recognition. The proposed model incorporates a graph Laplacian regularizer into the reduced biquaternion coefficient matrix, encouraging nearby samples in the original space to have similar coefficient representations. Meanwhile, GNRBMF retains the non-negativity property of NRBMF in the reduced biquaternion algebra. To solve the optimization problem, a component-wise alternating projected gradient algorithm is derived, and its convergence properties are analyzed. Experimental results on three color image datasets show that the proposed GNRBMF model achieves competitive or superior recognition performance compared with several methods in most tested settings.

URL PDF HTML ☆

赞 0 踩 0

2606.14716 2026-06-16 cs.CV cs.AI cs.RO 新提交

基于测地线框架的掩膜提议投票用于鲁棒图像分割

Li Liu, Mingzhu Wang, Zhenjiang Li, Da Chen, Laurent D. Cohen

发表机构 * Yuanshen Rehabilitation Institute, Shanghai Jiao Tong University School of Medicine（上海交通大学医学院附属瑞金康复医院）； Yueyang Hospital of Integrated Traditional Chinese and Western Medicine, Shanghai University of Traditional Chinese Medicine（上海中医药大学附属岳阳中西医结合医院）； Department of Radiation Oncology, Shandong Cancer Hospital and Institute, Shandong First Medical University, Shandong Academy of Medical Sciences（山东第一医科大学附属山东省肿瘤医院放疗科）； University Paris Dauphine, PSL Research University, CNRS, UMR 7534, CEREMADE（巴黎多芬纳大学，PSL研究大学，法国国家科学研究中心，UMR 7534，CEREMADE）

AI总结提出一种掩膜提议投票框架，通过自适应域构造和加权投票机制克服经典最小路径法对初始化的依赖，在复杂场景下实现鲁棒分割。

详情

AI中文摘要

尽管取得了巨大进步，但准确的分割仍然是一项具有挑战性的任务，尤其是在背景杂乱、强度变化复杂和拓扑外观多样的场景中。最小路径模型在解决图像分割任务中展现了强大的能力。然而，基于最小路径的分割方法的性能严重受限于模型初始化，从而限制了其在实际中的应用范围。在这项工作中，我们提出了一种新颖的掩膜提议投票框架，克服了经典方法的主要缺点，即使在复杂场景下也能实现鲁棒分割。首先，我们引入了一种高效的方法来构建自适应域切割，作为初始化基于区域的最小割演化的约束，从而可以生成多样且可靠的掩膜提议候选，大大增加了这些提议准确覆盖目标区域的可能性。其次，我们提出了一种新的掩膜投票方案，构建编码最终分割信息的投票得分图。与经典的路径投票方法相比，我们的模型允许引入先验知识，为每个单独的掩膜分配不同的重要性。因此，所提出的分割模型能够在复杂场景下准确描绘对象边界，并且对初始化不敏感。实验表明，我们的方法在准确性和鲁棒性上始终优于最先进的基于最小路径的方法。

英文摘要

Despite great advances, finding accurate segmentation remains a challenging task, especially in scenarios with cluttered backgrounds, complex intensity variations and topology appearance. Minimal path models have exhibited their strong ability in addressing image segmentation tasks. However, the performance of minimal paths-based segmentation approaches is heavily influenced by model initialization, hence limiting their application scope in practice. In this work, we propose a novel mask proposal voting framework that overcomes the major drawback of classical approaches, allowing robust segmentation even in complicated scenarios. Firstly, we introduce an efficient method for constructing adaptive domain cuts as a constraint for initializing the region-based min-cut evolution, by which diverse and reliable mask proposal candidates can be generated, substantially increasing the possibility of accurately covering the objective region by these proposals. Secondly, we propose a new mask voting scheme to build a voting score map encoding the final segmentation information. In contrast to classical path voting methods, our model allows incorporating priors to assign different importance to each individual mask. As a consequence, the proposed segmentation model is capable of accurately delineating object boundaries under complex scenarios, and is insensitive to initialization. Experiments demonstrate that our method consistently outperforms state-of-the-art minimal path-based approaches in both accuracy and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.15049 2026-06-16 cs.CV 新提交

多视角特征高阶融合用于空间弱目标检测与分割

Weilong Guo, Yuhan Sun, Shengyang Li

发表机构 * Technology and Engineering Center for Space Utilization, Chinese Academy of Sciences（中国科学院空间应用工程与技术中心）； Key Laboratory of Space Utilization, Chinese Academy of Sciences（中国科学院空间应用重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结针对空间弱目标检测与分割，提出多视角特征高阶融合方法（MHF），通过高阶特征感知和递归任务贡献门控选择，有效聚合弱目标的准确丰富特征，作为即插即用模块显著提升多种视觉模型性能。

详情

AI中文摘要

弱目标在空间应用的图像和视频中很常见。然而，从它们有限的外观信息中学习合适的表示是困难的。受多视角学习的启发，我们开发了简单的多视角注意力机制，将其输出视为多视角特征。我们还提出了一种多视角特征高阶融合方法（MHF），以聚合更准确和丰富的弱目标特征。我们的MHF将常用的低阶特征融合方法扩展到高阶。它增强了模型捕获弱目标相关和互补信息的能力。这是通过引入高阶多视角特征感知和递归任务贡献门控选择多视角特征来实现的。新操作高度灵活且可定制，与多视角特征表示的各种变体兼容。我们在两个新构建的空间科学数据集和一个开放的大规模卫星视频数据集上进行了大量实验。我们的MHF作为一个即插即用模块，显著改进了各种基于视觉Transformer和卷积的检测与分割模型。我们在三个数据集上的两个任务上都取得了最先进的精度。我们的MHF可以成为视觉建模的新基础模块，有效地从多视角学习角度表示弱目标。代码将在https://github.com/Kingdroper/MHF 提供。

英文摘要

Weak objects are common in images and videos of space applications. However, it is hard to learn proper representations from their limited appearance information. Inspired by multi-view learning, we develop simple multi-view attentions, treating their outputs as multi-view features. We also propose a multi-view feature high-order fusion method (MHF) to aggregate more accurate and richer features of weak objects. Our MHF extends the commonly used low-order feature fusion method to higher orders. It enhances the model's capacity to capture relevant and complementary information about weak objects. This is achieved by introducing high-order multi-view features perception and a recursive task-contribution gated selection of multi-view features. The new operation is highly flexible and customizable. It is compatible with various variants of multi-view feature representations. We conduct extensive experiments on two newly constructed space science datasets and an open, large-scale satellite video dataset. Our MHF serves as a plug-and-play module and significantly improves various vision transformers and convolution-based detection and segmentation models. We achieve all state-of-the-art accuracies on both tasks across three datasets. Our MHF can be a new basic module for visual modeling that effectively represents weak objects in terms of multi-view learning. The code will be available at https://github.com/Kingdroper/MHF.

URL PDF HTML ☆

赞 0 踩 0

2606.15253 2026-06-16 cs.CV 新提交

Focus, Align, and Sustain: Counteracting Gradient Dilution in Incremental Object Detection

聚焦、对齐与维持：对抗增量目标检测中的梯度稀释

Aoting Zhang, Dongbao Yang, Chang Liu, Xiaopeng Hong, Yu Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出FAS框架，通过注入先验的查询聚焦判别信号、确定性锚点蒸馏对齐分配、流形支持回放维持旧类分布，解决增量目标检测中梯度稀释导致的性能下降问题。

Comments Accepted by ICML2026

详情

AI中文摘要

将检测Transformer适应到增量目标检测（IOD）面临系统性挑战，因为基于集合的优化本质上被顺序学习所不稳定。在这项工作中，我们识别出梯度稀释是性能下降的根本原因，其中保留旧知识所需的优化信号逐渐减弱。这种现象表现为保留梯度在幅度、方向和支撑覆盖上的级联侵蚀，由三个紧密耦合的因素驱动：信号分散，其中前景梯度被背景噪声淹没；分配漂移，其中随机查询-目标匹配导致不一致的梯度轨迹；以及支撑衰减，其中保留样本的梯度不足以覆盖旧类特征空间，在新类干扰下削弱决策边界。为对抗此，我们提出FAS，一个统一的框架，在增量学习中聚焦、对齐和维持梯度流。具体地，我们引入注入先验的查询，通过从源头过滤背景干扰来聚焦判别信号。我们进一步提出确定性锚点蒸馏，以对齐查询-目标分配并在不稳定匹配下跨阶段强制执行语义一致性。最后，我们设计流形支撑回放，以维持旧类的分布支撑，对抗持续更新引起的表示侵蚀。大量实验表明，FAS恢复了鲁棒的优化动态，并优于最先进的方法，在具有挑战性的40+10x4增量设置中实现了超过5.0 AP的提升。

英文摘要

Adapting Detection Transformers to Incremental Object Detection (IOD) poses a systemic challenge, as set-based optimization is inherently destabilized by sequential learning. In this work, we identify Gradient Dilution as the root cause of performance degradation, wherein optimization signals required to preserve old knowledge are progressively weakened. This phenomenon manifests as a cascading erosion of preservation gradients in magnitude, direction, and support coverage, driven by three tightly coupled factors: Signal Dispersion, where foreground gradients are overwhelmed by background noise; Assignment Drift, where stochastic query-target matching induces inconsistent gradient trajectories; and Support Attrition, where gradients from retained samples insufficiently cover the old-class feature space, weakening decision boundaries under interference from new classes. To counteract this, we propose FAS, a unified framework that Focuses, Aligns, and Sustains gradient flow throughout incremental learning. Specifically, we introduce prior-injected queries to focus discriminative signals by filtering background interference at the source. We further propose deterministic anchor distillation to align query-target assignments and enforce semantic consistency across stages under unstable matching. Finally, we devise manifold-support replay to sustain distributional support of old classes, counteracting representational erosion induced by continual updates. Extensive experiments show that FAS restores robust optimization dynamics and outperforms state-of-the-art methods, achieving over 5.0 AP improvement in the challenging 40+10x4 incremental setting.

URL PDF HTML ☆

赞 0 踩 0

2606.15286 2026-06-16 cs.CV 新提交

Decoupled Motion Representation Learning for Moving Infrared Small Target Detection

解耦运动表示学习用于移动红外小目标检测

Guoyi Zhang, Peiwen Wu, Han Wang, Xiangpeng Xu, Xiaohu Zhang

发表机构 * School of Aeronautics and Astronautics, Sun Yat-sen University（中山大学航空航天学院）

AI总结针对动态场景中目标、平台和背景运动高度耦合导致检测困难的问题，提出解耦运动表示学习框架，通过显式运动分支建模全局相干运动、隐式分支捕捉局部异常，并设计相干运动引导的异常推理模块抑制虚警，在复杂动态场景中显著优于现有方法。

详情

AI中文摘要

动态场景中的红外小目标检测仍然具有挑战性，原因是目标、成像平台和动态背景之间的运动高度耦合。现有的多帧方法通常执行隐式时间建模，其中连贯的背景动态主导运动对应学习，导致检测与虚警之间存在固有的权衡。在这项工作中，我们观察到背景运动表现出强烈的全局连贯性，而小目标主要对应稀疏的局部运动异常。此外，许多虚警响应与全局连贯运动模式保持高度一致性，表明它们主要源于连贯的背景动态而非真实目标运动。基于这些观察，我们提出了一种解耦运动表示学习框架用于移动红外小目标检测。具体地，引入显式运动分支，利用预训练的光流先验建模全局连贯运动动态，并采用结构保持的自监督适应策略进行红外运动对应学习。同时，设计了基于可变形特征对齐的隐式运动分支，在连贯运动引导下捕捉目标敏感的局部运动异常。此外，提出了连贯运动引导的局部异常推理模块，在局部运动建模过程中识别并抑制由连贯运动引起的虚假响应。在两个具有挑战性的红外小目标检测基准上的大量实验表明，所提方法在复杂运动的动态场景中持续优于现有最先进方法，同时保持了良好的推理效率。

英文摘要

Infrared small target detection in dynamic scenes remains challenging due to the highly coupled motions among targets, imaging platforms, and dynamic backgrounds. Existing multi-frame methods usually perform implicit temporal modeling, where coherent background dynamics dominate motion correspondence learning, leading to an inherent trade-off between detection and false alarms. In this work, we observe that background motions exhibit strong global coherence, whereas small targets mainly correspond to sparse local motion anomalies. Moreover, many false-alarm responses maintain high consistency with globally coherent motion patterns, indicating that they mainly originate from coherent background dynamics rather than genuine target motions. Based on these observations, we propose a decoupled motion representation learning framework for moving infrared small target detection. Specifically, an explicit motion branch is introduced to model globally coherent motion dynamics using pretrained optical flow priors, together with a structure-preserving self-supervised adaptation strategy for infrared motion correspondence learning. Meanwhile, an implicit motion branch based on deformable feature alignment is designed to capture target-sensitive local motion anomalies under coherent motion guidance. Furthermore, a coherent-motion-guided local anomaly reasoning module is proposed to identify and suppress coherent-motion-induced false responses during localized motion modeling. Extensive experiments on two challenging infrared small target detection benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art approaches, particularly in dynamic scenes with complex motions, while maintaining favorable inference efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.15409 2026-06-16 cs.CV 新提交

Segmentation-based Detection for Efficient Multi-Task Spacecraft Perception

基于分割检测的高效多任务航天器感知

Sivaperuman Muniyasamy, Surendar Devasundaram

发表机构 * University of Arizona（亚利桑那大学）

AI总结针对太空视觉感知中的多任务需求，提出集成MobileNetV3编码器与U-Net风格解码器的轻量架构，通过分割掩码联合推导检测框，在SPARK 2026挑战赛中获得0.9482综合得分，排名第二。

Comments 8 pages, 2 figures, 6 tables. CVPRW AI4SPACE-SPARK 2026 Challenge Stream-1 First Place Winners. Code is available at https://github.com/sivaastro/segdet-spark

详情

AI中文摘要

基于视觉的感知是空间态势感知以及自主在轨操作（如交会、对接、服务和导航）的基础。然而，该领域的进展受到标注空间图像稀缺以及具有挑战性的视觉域特性（包括剧烈的光照变化、低信噪比和高对比度）的限制。我们针对SPARK 2026挑战赛的Stream 1，该任务要求一个单一模型完成多目标类型的航天器分类、检测和细粒度部件分割。我们提出了一种紧凑架构，集成了MobileNetV3编码器和U-Net风格解码器，结合了计算效率与精确的密集预测。在单航天器场景下，检测通过预测部件掩码的并集解析得到，避免了单独的边界框回归头。我们的方法取得了0.9482的整体排行榜分数，其中分类、检测和分割的任务特定分数分别为1.0000、0.9788和0.8917。所提出的方法在SPARK 2026挑战赛中总体排名第二，表明轻量级编码器-解码器架构能够为实际星载视觉系统提供强大的多任务性能。

英文摘要

Vision-based perception is fundamental to Space Situational Awareness and autonomous on-orbit operations such as rendezvous, docking, servicing, and navigation. However, progress in this area is limited by the scarcity of annotated space imagery and by challenging visual-domain characteristics including severe illumination changes, low signal-to-noise ratio, and high contrast. We address Stream 1 of the SPARK 2026 Challenge, which requires a single model for spacecraft classification, detection, and fine-grained component segmentation across multiple target types. We propose a compact architecture that integrates a MobileNetV3 encoder with a U-Net-style decoder, combining computational efficiency with accurate dense prediction. Detection is derived analytically from the union of predicted component masks, avoiding a separate bounding-box regression head in the single-spacecraft setting. Our method achieved an overall leaderboard score of 0.9482, with task-specific scores of 1.0000 in classification, 0.9788 in detection, and 0.8917 in segmentation. The proposed approach ranked second overall in the SPARK 2026 Challenge, demonstrating that lightweight encoder-decoder architectures can deliver strong multi-task performance for practical onboard space vision systems.

URL PDF HTML ☆

赞 0 踩 0

2606.15590 2026-06-16 cs.CV 新提交

Unlocking Diffusion Hierarchies: Adaptive Timestep Selection for Zero-Shot Segmentation

解锁扩散层次：自适应时间步选择用于零样本分割

Ramin Nakhli, Mahesh Ramachandran, Luca Ballan

发表机构 * Google（谷歌）

AI总结提出自适应时间步选择机制，利用扩散模型去噪过程中的层次语义进展，结合上下文相似度图融合高分辨率注意力与U-Net特征，实现零样本分割性能提升。

详情

AI中文摘要

零样本分割最近通过利用大规模文本到图像扩散模型（如Stable Diffusion）中的丰富视觉先验取得了显著改进。然而，当前的基于扩散的方法常常面临空间分辨率和上下文信息之间的权衡，以及依赖单一静态时间步进行特征提取的限制。为了克服这些挑战，我们的工作引入了两项关键进展。首先，我们的上下文相似度图将高分辨率注意力图与丰富的U-Net编码器特征融合，提供了细粒度且鲁棒的逐像素表示。其次，我们识别出不同扩散模型的去噪过程中存在一种涌现的层次语义进展：表示从早期时间步的部分级抽象过渡到后期阶段的物体级抽象。利用这一洞察，我们引入了一种机制来自适应地为每个像素选择最优时间步。大量实验表明，我们的方法持续优于现有的零样本分割基线，验证了将上下文特征与动态层次时间步选择相结合的有效性。

英文摘要

Zero-shot segmentation has recently shown notable improvement by leveraging the rich visual priors in large-scale text-to-image diffusion models, such as Stable Diffusion. However, current diffusion-based methods often face limitations due to the trade-off between spatial resolution and contextual information, as well as their reliance on a single static timestep for feature extraction. To overcome these challenges, our work introduces two key advancements. First, our Contextual Similarity Maps fuse high-resolution attention maps with rich U-Net encoder features, providing both fine-grained and robust per-pixel representations. Second, we identify an emergent hierarchical semantic progression within the denoising process of various diffusion models: representations transition from part-level abstractions at earlier timesteps to object-level abstractions at later stages. Leveraging this insight, we introduce a mechanism to adaptively select the optimal timestep for each pixel. Extensive experiments demonstrate that our method consistently outperforms existing zero-shot segmentation baselines, validating the efficacy of combining contextual features with dynamic, hierarchical timestep selection.

URL PDF HTML ☆

赞 0 踩 0

2606.15786 2026-06-16 cs.CV cs.AI physics.geo-ph 新提交

Domain-Guided Prompting of the Segment Anything Model for Seismic Interpretation: The Role of Attributes, Visualization, and Hybrid Prompts

领域引导的Segment Anything模型提示用于地震解释：属性、可视化和混合提示的作用

Aniq Ahmad, Heather Bedle, Ahmad Mustafa

发表机构 * School of Geosciences, University of Oklahoma（俄克拉荷马大学地球科学学院）； King Fahd University of Petroleum and Minerals（法赫德国王石油矿产大学）

AI总结提出零样本适应框架，通过地质目标感知的地震属性与颜色映射选择，结合混合提示策略，提升SAM在地震解释中的分割精度，避免微调。

详情

AI中文摘要

计算机视觉大型预训练基础模型的出现显著提高了视觉数据解释的效率。特别是Segment Anything Model (SAM)通过基于提示的交互提供了强大的零样本分割能力，因此成为地震解释的有前景工具。然而，大多数现有的SAM应用依赖于针对特定地质目标的微调，这需要大量标注数据、计算成本高，且常常损害模型的泛化能力。在本研究中，我们引入了一个原则性框架，用于将基础模型零样本适应到地震数据。该框架基于两个关键组件：(1) 将地震属性和可视化选择（如颜色映射）与感兴趣的地质目标对齐；(2) 采用混合提示策略，结合稀疏的用户定义点提示和从SAM内部特征激活中导出的密集掩码提示。我们系统地在多个地质目标、数据集、提示配置和地震属性表示上评估了该框架。我们的结果表明，地质目标感知的地震属性和颜色映射选择，结合混合提示，相对于仅基于点提示，增强了地质特征的可分离性，并改善了边界描绘和分割精度。我们的发现表明，当这些组件联合应用时，SAM可以在完全零样本设置下实现有竞争力的分割性能，从而消除了为每个地质特征重新训练SAM的需要。这项工作建立了一条实用且可扩展的途径，以在地震解释中利用基础模型，减少对标注数据的依赖，同时保持模型的通用性。

英文摘要

The advent of large pretrained foundation models for computer vision has significantly improved the efficiency of visual data interpretation. The Segment Anything Model (SAM), in particular, offers powerful zero shot segmentation capabilities through prompt based interaction, thus making it a promising tool for seismic interpretation. However, most existing applications of SAM rely on fine tuning for specific geological targets, which requires extensive labeled data, incurs high computational cost, and often compromises the model's generalization capability. In this study, we introduce a principled framework for zero shot adaptation of foundation models to seismic data. The framework is built on two key components: (1) aligning seismic attributes and visualization choices (e.g., colormaps) with the geological target of interest, and (2) employing a hybrid prompting strategy that combines sparse user defined point prompts with dense mask prompts derived from SAM's internal feature activations. We systematically evaluate this framework across multiple geological targets, datasets, prompt configurations, and seismic attribute representations. Our results demonstrate that geologic target aware selection of seismic attributes and colormaps, combined with hybrid prompting, enhances the separability of geological features and improves boundary delineation and segmentation accuracy relative to point based prompting alone. Our findings show that, when these components are jointly applied, SAM can achieve competitive segmentation performance in a fully zero shot setting, thereby eliminating the need to retrain SAM for each geologic feature. This work establishes a practical and scalable pathway to leverage foundation models in seismic interpretation, reducing reliance on labeled data while preserving model generality.

URL PDF HTML ☆

赞 0 踩 0

2606.16119 2026-06-16 cs.CV 新提交

EdgeZSAD: Practical Zero-Shot Anomaly Detection on Edge Devices

EdgeZSAD：边缘设备上的实用零样本异常检测

Taewan Cho, Andrew Jaeyong Choi

发表机构 * Gachon University（加东大学）； Plaid Labs Inc.（Plaid实验室）

AI总结针对边缘部署约束，提出基于TinyViT-21M-512骨干、非对称全局-局部读出（EdgeGLR）和可复现源训练方案（Real-IAD-DR）的紧凑零样本异常检测系统，在多个工业基准上达到高精度且可直接部署。

详情

AI中文摘要

工业检测需要零样本异常检测（ZSAD），该检测在边缘部署约束下仍然有效。最近的方法通常依赖ViT-L基础骨干（约3亿参数），这超出了典型嵌入式硬件的内存和算子预算。我们通过EdgeZSAD研究这一场景，这是一个紧凑的参考系统，围绕TinyViT-21M-512骨干、非对称全局-局部读出（EdgeGLR）和可复现的源端训练方案（Real-IAD-DR）构建。我们在源训练、目标未见协议下训练单个检查点，并在六个工业基准上评估。在三次独立运行中，所得模型在MVTec-AD上平均图像AUROC达到91.6，在VisA上达到88.2，同时可直接部署在Jetson Orin Nano Super（TensorRT FP16）和RB5 Gen2（QNN GPU FP16）上。在六个设备重新评分的基准中，图像AUROC漂移保持在0.2点以下，表明导出的图在评估的部署设置中保留了主机端的排序行为。

英文摘要

Industrial inspection needs zero-shot anomaly detection (ZSAD) that remains useful under edge deployment constraints. Recent methods often rely on ViT-L foundation backbones (~300M parameters), which exceed the memory and operator budget of typical embedded hardware. We study this regime through EdgeZSAD, a compact reference system built around a TinyViT-21M-512 backbone, an asymmetric global-local readout (EdgeGLR), and a reproducible source-side training recipe (Real-IAD-DR). We train a single checkpoint in a source-trained, target-unseen protocol and evaluate it across six industrial benchmarks. Across three independent runs, the resulting model reaches an average image AUROC of 91.6 on MVTec-AD and 88.2 on VisA, while remaining directly deployable on Jetson Orin Nano Super (TensorRT FP16) and RB5 Gen2 (QNN GPU FP16). Across the six device-rescored benchmarks, image-AUROC drift stays below 0.2 points, indicating that the exported graph preserves host-side ranking behavior in the evaluated deployment setting.

URL PDF HTML ☆

赞 0 踩 0

2606.16124 2026-06-16 cs.CV 新提交

Training-Free Open-Vocabulary Visual Grounding for Remote Sensing Images and Videos

面向遥感图像与视频的无训练开放词汇视觉定位

Ke Li, Di Wang, Yongshan Zhu, Ting Wang, Weiping Ni, Tao Lei, Quan Wang, Xinbo Gao

发表机构 * School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）； Interdisciplinary Institute of Artificial Intelligence, Xidian University（西安电子科技大学跨学科人工智能研究院）； School of Artificial Intelligence, Xidian University（西安电子科技大学人工智能学院）； Northwest Institute of Nuclear Technology（西北核技术研究所）； School of Physics and Information Engineering, Fuzhou University（福州大学物理与信息工程学院）

AI总结提出无训练框架RSVG-ZeroOV，利用冻结的通用基础模型通过概览-聚焦-演化范式实现零样本开放词汇遥感视觉定位，并扩展至视频时空定位，在多个基准上超越现有零样本方法。

详情

AI中文摘要

遥感视觉定位（RSVG）旨在根据自然语言表达在遥感图像或视频中定位所指目标。现有的RSVG方法通常依赖于任务特定的手动标注，这些标注收集成本高昂，且在覆盖真实世界地理空间场景的多样性方面不可避免地存在局限。因此，它们往往难以泛化到涉及新物体、细粒度属性、复杂空间关系和功能语义的开放词汇查询。本文提出RSVG-ZeroOV，一个无训练框架，利用冻结的通用基础模型进行零样本开放词汇RSVG。RSVG-ZeroOV遵循概览-聚焦-演化范式，利用视觉语言模型（VLM）和扩散模型（DM）独特且互补的注意力模式逐步生成精确的定位结果。具体而言：(i) 概览利用VLM提取交叉注意力图，捕获指代表达与视觉区域之间的语义相关性；(ii) 聚焦利用DM的细粒度建模先验，补偿VLM注意力常忽略的物体结构和形状信息；(iii) 演化引入一个简单而有效的注意力演化模块，抑制无关激活，产生纯净的物体掩码。为处理视频输入，我们进一步提出Video RSVG-ZeroOV，通过查询相关关键帧选择器和时序传播器将图像级定位扩展到时空定位，无需视频标注或微调即可实现高效且时序一致的视频定位。在六个图像和视频定位基准上的大量实验表明，RSVG-ZeroOV持续优于现有零样本基线，并与弱监督和全监督方法相比达到有竞争力或更优的性能。

英文摘要

Remote sensing visual grounding (RSVG) aims to localize a referred target in a remote sensing image or video according to a natural language expression. Existing RSVG methods usually rely on task-specific manual annotations, which are costly to collect and inevitably limited in covering the diversity of real-world geospatial scenarios. As a result, they often struggle to generalize to open-vocabulary queries involving novel objects, fine-grained attributes, complex spatial relationships, and functional semantics. In this paper, we propose RSVG-ZeroOV, a training-free framework that leverages frozen generic foundation models for zero-shot open-vocabulary RSVG. RSVG-ZeroOV follows an Overview-Focus-Evolve paradigm, which exploits the distinct yet complementary attention patterns of vision-language models (VLMs) and diffusion models (DMs) to progressively generate precise grounding results. Specifically, (i) Overview utilizes a VLM to extract cross-attention maps that capture semantic correlations between the referring expression and visual regions; (ii) Focus leverages the fine-grained modeling priors of a DM to compensate for object structure and shape information often overlooked by VLM attention; and (iii) Evolve introduces a simple yet effective attention evolution module to suppress irrelevant activations, yielding purified object masks. To handle video inputs, we further present Video RSVG-ZeroOV, which extends image-level grounding to spatio-temporal grounding through a query-relevant key-frame selector and a temporal propagator, enabling efficient and temporally coherent video grounding without video annotations or fine-tuning. Extensive experiments on six image and video grounding benchmarks show that RSVG-ZeroOV consistently outperforms existing zero-shot baselines and achieves competitive or superior performance compared with weakly- and fully-supervised methods.

URL PDF HTML ☆

赞 0 踩 0

2606.16302 2026-06-16 cs.CV 新提交

SPDA-SAM: 一种用于实例分割的自提示深度感知分割一切模型

Yihan Shang, Wei Wang, Chao Huang, Xinghui Dong

发表机构 * State Key Laboratory of Physical Oceanography and the Faculty of Information Science and Engineering, Ocean University of China（物理海洋学国家重点实验室和中国海洋大学信息科学与工程学院）； School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区计算机科学与技术学院）

AI总结提出SPDA-SAM，通过自提示模块和粗到细RGB-D融合，解决SAM依赖手动提示和缺乏深度信息的问题，在12个数据集上超越现有方法。

详情

AI中文摘要

最近，分割一切模型（SAM）在各种实例分割任务中展现出强大的泛化能力。然而，其性能严重依赖于手动提示的质量。此外，实例分割方法通常使用的RGB图像本质上缺乏深度信息。因此，这些方法感知空间结构和描绘物体边界的能力受到阻碍。为了解决这些挑战，我们提出了一种用于实例分割的自提示深度感知SAM（SPDA-SAM）。具体来说，我们设计了一个语义-空间自提示模块（SSSPM），该模块分别从SAM的图像编码器和掩码解码器中提取语义和空间提示。此外，我们引入了一个粗到细的RGB-D融合模块（C2FFM），其中从单目RGB图像中提取的特征与从中估计的深度图进行融合。特别地，深度图中的结构信息用于为特征融合提供粗粒度指导，而深度的局部变化被编码以融合细粒度特征表示。据我们所知，SAM尚未以这种自提示和深度感知的方式进行探索。实验结果表明，我们的SPDA-SAM在十二个不同的数据集上优于最先进的对应方法。这些令人鼓舞的结果应归因于自提示的引导以及粗到细RGB-D融合操作对空间信息损失的补偿。

英文摘要

Recently, Segment Anything Model (SAM) has demonstrated strong generalizability in various instance segmentation tasks. However, its performance is severely dependent on the quality of manual prompts. In addition, the RGB images that instance segmentation methods normally use inherently lack depth information. As a result, the ability of these methods to perceive spatial structures and delineate object boundaries is hindered. To address these challenges, we propose a Self-prompted Depth-Aware SAM (SPDA-SAM) for instance segmentation. Specifically, we design a Semantic-Spatial Self-prompt Module (SSSPM) which extracts the semantic and spatial prompts from the image encoder and the mask decoder of SAM, respectively. Furthermore, we introduce a Coarse-to-Fine RGB-D Fusion Module (C2FFM), in which the features extracted from a monocular RGB image and the depth map estimated from it are fused. In particular, the structural information in the depth map is used to provide coarse-grained guidance to feature fusion, while local variations in depth are encoded in order to fuse fine-grained feature representations. To our knowledge, SAM has not been explored in such self-prompted and depth-aware manners. Experimental results demonstrate that our SPDA-SAM outperforms its state-of-the-art counterparts across twelve different data sets. These promising results should be due to the guidance of the self-prompts and the compensation for the spatial information loss by the coarse-to-fine RGB-D fusion operation.

URL PDF HTML ☆

赞 0 踩 0

2604.18866 2026-06-16 cs.CV 版本更新

基于分歧的跨模型路由用于隐式视频问答

Durga Sandeep Saluru

发表机构 * Independent Researcher（独立研究员）

AI总结针对隐式视频问答中单模型精度瓶颈和自一致性策略失效问题，提出无标签无训练的分歧驱动跨模型路由方法，将分歧样本路由至第二模型，在ImplicitQA基准上提升平均准确率1.43%。

详情

AI中文摘要

我们研究ImplicitQA基准上的多项选择视频问答，其中正确答案从未明确显示，必须从屏幕外事件、视线线索、因果结构和跨镜头空间布局中推断。在该基准上，单个前沿视频LLM已接近其精度上限，我们观察到传统的自一致性策略——对同一模型的重复样本进行多数投票——可能有害而非有益，因为模型在难题上的错误是相关的。我们提出基于分歧的跨模型路由，一种纯推理时过程，无需标签和训练。我们对原生视频模型（Gemini 3.1 Pro Preview）在温度为零时进行三次采样，利用其视频处理流水线的真实样本间方差来识别三个样本存在分歧的大约20%的问题子集，并将该子集仅路由到来自不同家族的第二个模型（Claude Opus 4.8），该模型采用自适应思考的均匀采样帧。在具有公开真实标签的1001个问题的验证集上——我们的主要评估——该方法相对于主模型的最佳单样本将AvgAcc提高了1.43，每个类别的提升集中在运动与轨迹（+5.49）、推断计数（+3.45）和垂直空间推理（+1.82）——这些类别最依赖于跨镜头参考解析。相同的流水线应用于保留的172个问题的CVPR 2026 ImplicitQA挑战测试集，实现了82.03 AvgAcc / 79.71 MacroAvgAcc（相对于主模型最佳单样本提升1.81），在独立分割上确认了验证结果。

英文摘要

We study multiple-choice video question answering on the ImplicitQA benchmark, where the correct answer is never explicitly shown but must be inferred from off-screen events, line-of-sight cues, causal structure, and cross-shot spatial layout. On this benchmark a single frontier video LLM already operates near its accuracy ceiling, and we observe that conventional self-consistency strategies -- majority voting across repeated samples of the same model -- can hurt rather than help, because the model's errors on hard questions are correlated. We propose disagreement-based cross-model routing, a pure inference-time procedure that requires no labels and no training. We triple-sample a native-video model (Gemini 3.1 Pro Preview) at temperature zero, exploit the genuine sample-to-sample variance of its video-processing pipeline to identify the roughly 20% subset of questions where the three samples disagree, and route only that subset to a second model from a different family (Claude Opus 4.8) that consumes uniformly sampled frames with adaptive thinking. On the 1001-question validation set with public ground truth -- our main evaluation -- the method improves AvgAcc by +1.43 over the best single sample of the primary model, with per-category gains concentrated on Motion & Trajectory (+5.49), Inferred Counting (+3.45), and Vertical Spatial Reasoning (+1.82) -- the categories most dependent on cross-shot reference resolution. The same pipeline applied to the held-out 172-question CVPR 2026 ImplicitQA challenge test set achieves 82.03 AvgAcc / 79.71 MacroAvgAcc (+1.81 over the best single sample of the primary model), confirming the validation result on an independent split.

URL PDF HTML ☆

赞 0 踩 0

2606.14724 2026-06-16 cs.CV cs.AI 新提交

VigilFormer: Deformable Attention for Video Anomaly Detection with Causal Risk Inference

VigilFormer: 用于视频异常检测的可变形注意力与因果风险推理

Xinze Zhang

发表机构 * University of Southern California（南加州大学）

AI总结提出VigilFormer框架，结合可变形时空注意力与因果时序建模，通过稀疏注意力、对比多实例学习和自适应帧跳过，在保持高精度的同时实现实时异常检测。

详情

AI中文摘要

监控场景中的视频异常检测必须在检测准确性与实时吞吐量之间取得平衡，现有方法要么通过更强的特征提取器，要么通过更高效的架构来解决这一矛盾，但很少能兼顾两者。我们提出VigilFormer，一个统一框架，结合可变形时空注意力与因果时序建模，用于检测未修剪监控视频中的异常。所提出的可变形时空编码器（DSTE）关注跨帧的稀疏信息位置，避免了密集注意力的二次复杂度，同时保留了捕捉不规则运动模式的能力。因果异常分类器（CAC）对片段级特征应用扩张因果卷积，并优化对比多实例学习目标，无需帧级标签即可分离异常和正常表示。为满足部署约束，自适应置信度调度器（ACS）在推理时动态跳过低信息帧，减少静态场景中的冗余计算。在UCF-Crime、ShanghaiTech和CUHK Avenue上评估，VigilFormer在单GPU上以41.5 FPS分别达到87.83%、97.21%和89.74%的AUC分数，在准确性和速度上均优于最近的弱监督方法。

英文摘要

Video anomaly detection in surveillance settings must balance detection accuracy against real-time throughput, a tension that existing methods address either through stronger feature extractors or more efficient architectures, but rarely both. We present VigilFormer, a unified framework that combines deformable spatio-temporal attention with causal temporal modeling to detect anomalies in untrimmed surveillance video. The proposed Deformable Spatio-Temporal Encoder (DSTE) attends to a sparse set of informative locations across frames, avoiding the quadratic cost of dense attention while retaining the ability to capture irregular motion patterns. A Causal Anomaly Classifier (CAC) applies dilated causal convolutions over snippet-level features and optimizes a contrastive multiple-instance learning objective that separates anomalous and normal representations without frame-level labels. To meet deployment constraints, an Adaptive Confidence Scheduler (ACS) dynamically skips low-information frames at inference time, reducing redundant computation in static scenes. Evaluated on UCF-Crime, ShanghaiTech, and CUHK Avenue, VigilFormer achieves AUC scores of 87.83%, 97.21%, and 89.74% respectively, at 41.5 FPS on a single GPU, outperforming recent weakly-supervised methods in both accuracy and speed.

URL PDF HTML ☆

赞 0 踩 0

2606.14730 2026-06-16 cs.CV 新提交

Hierarchical GRU with Input-Conditioned Slot Queries for Ball Action Anticipation

基于输入条件化槽查询的分层GRU用于足球动作预测

Parthsarthi Rawat

发表机构 * GameChanger by Dick’s Sporting Goods（迪克体育用品的GameChanger）

AI总结提出分层模型，利用局部Transformer、GRU和输入条件化事件槽解码器，结合频率重加权匈牙利匹配和高斯软标签，在SoccerNet基准上实现17.91% mAP。

Comments CVPR 2026 SoccerNet Ball Action Anticipation Challenge, Validated Rank 4

2606.14762 2026-06-16 cs.CV cs.AI 新提交

Scribby: A Multi-Level LLM Framework for Semantic Video Analysis

Scribby: 一种用于语义视频分析的多级LLM框架

Julian Abelarde, Hugo Garrido-Lestache Belinchon

发表机构 * Department of Computer Science and Software Engineering, Milwaukee School of Engineering（密尔沃基工程学院计算机科学与软件工程系）

AI总结提出一种基于LLM的视频摘要框架，通过微观索引（分析完整转录、句子及语义分组）平衡宏观理解与微观语义分析，并利用相关性热图实现语义分块和匹配的可视化。

详情

AI中文摘要

随着视频内容在教育平台、录播讲座和直播娱乐中的持续扩展，对长视频进行高效且结构化分析的需求日益增长。尽管许多现有AI程序基于AI生成的转录提供高级视频摘要，但这些方法通常局限于粗略概述，缺乏对视频结构、主题进展和语义关系的详细分析，而这些正是全面视频分析所必需的。本文提出一种基于LLM的视频摘要框架，平衡宏观理解与微观语义分析。该过程的第一阶段在微观层面对视频进行索引，包括：(1) 分析完整转录，(2) 分析单个转录句子，(3) 使用LLM作为评判依据语义相似性对这些句子进行分组。在句子级处理中，通过将全局转录分析和相邻句子信息纳入每个评估提示，保留上下文连续性。该框架为通过相关性热图可视化语义分块和语义匹配的视频分析工具奠定了基础。还讨论了框架的局限性和未来扩展。

英文摘要

As video content continues to expand across educational platforms, recorded lectures, and live-streamed entertainment, the need for efficient and structured analysis of long-form footage has increased \cite{1}. Although many existing AI programs provide high-level video summaries based on AI-generated transcripts \cite{2,3,4,5}, these approaches are often limited to coarse overviews and lack detailed analysis of a video's structure, thematic progression, and semantic relationships, all of which are required for comprehensive video analysis. This paper proposes an LLM-based video summarization framework that balances macro-level comprehension with micro-level semantic analysis \cite{6,12,13}. The first stage of the process indexes the video at a micro level by (1) analyzing the full transcript, (2) analyzing individual transcript sentences, and (3) grouping these sentences by semantic similarity using an LLM as a judge \cite{6,13}. Contextual continuity is retained during sentence-level processing by incorporating both the global transcript analysis and adjacent sentence information into each evaluation prompt. This framework establishes a foundation for video analysis tools that visualize semantic chunking and semantic matching through relevance-based heatmaps. Limitations and future expansions of the framework are also discussed.

URL PDF HTML ☆

赞 0 踩 0

2606.14765 2026-06-16 cs.CV cs.AI cs.LG cs.MM 新提交

Momentum-Guided Semantic Forecasting (MoFore) for Self-Supervised Video Representation Learning

动量引导的语义预测（MoFore）用于自监督视频表示学习

Qinwu Xu

发表机构 * Qinwu Xu, PhD（秦武 Xu 博士）

AI总结提出MoFore框架，通过预测未来潜在嵌入进行自监督视频表示学习，结合对比正则化防止表示崩溃，在UCF101上验证了时间一致性和语义结构。

Comments 13 pages, 5 Figures, and 2 Tables

详情

AI中文摘要

自监督视频表示学习最近通过对比学习、掩码重建和预测表示学习取得了进展。基于重建的方法如MAE和VideoMAE通过恢复掩码视觉内容来学习表示，而对比方法如CLIP通过表示对齐学习语义有意义的嵌入空间。在这项工作中，我们提出了一种动量引导的语义预测框架（MoFore）用于自监督视频表示学习。该方法不是优化像素级重建或任务特定的语义对齐，而是通过从时间上遥远的上下文片段预测未来的潜在嵌入来学习时间预测性视频表示。为了提高跨时间尺度的鲁棒性，我们进一步引入了训练期间的随机时间间隔预测。该框架将预测性潜在预测与对比正则化相结合，以鼓励时间一致性同时防止表示崩溃。在UCF101数据集上的实验表明，所提出的框架在训练期间不使用动作标签的情况下学习了时间一致且语义有意义的视频表示。定量分析显示学习到的嵌入空间具有强时间稳定性和涌现的类别级结构，而定性检索实验揭示了跨相关活动的运动感知组织。总体而言，结果表明长程潜在预测为自监督视频表示学习提供了一种有效且计算高效的方法，而不依赖于基于重建的目标。

英文摘要

Self-supervised video representation learning has recently advanced through contrastive learning, masked reconstruction, and predictive representation learning. Reconstruction-based approaches such as MAE and VideoMAE learn representations by recovering masked visual content \cite{he2022mae,tong2022videomae}, while contrastive methods such as CLIP learn semantically meaningful embedding spaces through representation alignment \cite{radford2021clip}. In this work, we introduce a Momentum-Guided Semantic Forecasting framework (MoFore) for self-supervised video representation learning. Instead of optimizing for pixel-level reconstruction or task-specific semantic alignment, the proposed method learns temporally predictive video representations by forecasting future latent embeddings from temporally distant context clips. To improve robustness across temporal scales, we further introduce randomized temporal-gap forecasting during training. The framework combines predictive latent forecasting with contrastive regularization to encourage temporal consistency while preventing representation collapse. Experiments on the UCF101 dataset demonstrate that the proposed framework learns temporally consistent and semantically meaningful video representations without using action labels during training. Quantitative analysis shows strong temporal stability and emergent category-level structure in the learned embedding space, while qualitative retrieval experiments reveal motion-aware organization across related activities. Overall, the results suggest that long-range latent forecasting provides an effective and computationally efficient approach for self-supervised video representation learning without relying on reconstruction-based objectives.

URL PDF HTML ☆

赞 0 踩 0

2606.14778 2026-06-16 cs.CV cs.AI 新提交

FactCheck: Feasibility-aware Long-term Action Anticipation with Multi-agent Collaboration

FactCheck: 基于多智能体协作的可行性感知长期动作预测

Rui Cao, Jiannong Cao, Bo Yuan, Zhiyuan Wen, Mingjin Zhang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； China Mobile（中国移动）

AI总结提出FactCheck多智能体框架，通过闭环“观察-规划-验证”机制，结合历史动作图验证可行性，在EPIC-Kitchens-55和EGTEA Gaze+上超越现有方法。

详情

AI中文摘要

长期动作预测（LTA）旨在从部分观察的视频中预测未来动词-名词动作的有序序列。虽然该任务是具身智能的基础，但预测物理上可行的长期动作仍然是一个关键挑战。现有方法以开环方式运行，常常幻觉出不存在物体、违反物体可供性或不考虑物体状态，因为它们缺乏明确的机制来验证动作相对于物理环境的可行性。为解决此问题，我们提出FactCheck，一种新颖的多智能体协作框架，通过闭环“观察-规划-验证”机制提高可行性。FactCheck将复杂的LTA任务分解为专门角色：观察者从视频观察中识别历史动作并构建双形式结构化记忆，包括捕捉高层人类意图和环境状态的历史动作摘要，以及编码物体状态和时间依赖性的历史动作图；规划者基于低层历史动作和高层历史动作摘要生成未来动作草案；验证者严格根据历史动作图验证草案并修正不可行动作。在EPIC-Kitchens-55和EGTEA Gaze+基准上的大量实验表明，FactCheck始终优于最先进方法。我们的工作为可行性感知的长期动作预测建立了新范式，有效闭环了动作识别、动作预测和动作验证。

英文摘要

Long-term action anticipation (LTA) aims to predict an ordered sequence of future verb-noun actions from a partially observed video. While this task serves as the foundation for embodied intelligence, anticipating physically feasible long-term actions remains a critical challenge. Existing methods, which operate in an open-loop manner, often hallucinate non-existent objects, violate object affordances, or disregard object states, as they lack explicit mechanisms to verify action feasibility against the physical environment. To address this, we propose FactCheck, a novel multi-agent collaboration framework that improves feasibility through a closed-loop "Observe-Plan-Verify" mechanism. FactCheck decomposes the complex LTA task into specialized roles: an Observer that recognizes historical actions from video observations and constructs a dual-form structured memory, comprising a History Action Abstract that captures high-level human intentions and environmental status, and a History Action Graph that encodes object states and temporal dependencies; a Planner that generates draft future actions conditioned on both low-level historical actions and high-level History Action Abstract; and a Verifier that rigorously validates the draft against the History Action Graph and refines infeasible actions. Extensive experiments on the EPIC-Kitchens-55 and EGTEA Gaze+ benchmarks demonstrate that FactCheck consistently outperforms state-of-the-art methods. Our work establishes a new paradigm for feasibility-aware long-term action anticipation, effectively closing the loop of action recognition, action prediction and action verification.

URL PDF HTML ☆

赞 0 踩 0

2606.15200 2026-06-16 cs.CV 新提交

Keep It in Mind: User Centric Continual Spatial Intelligence Reasoning in Egocentric Video Streams

铭记于心：面向用户中心的持续空间智能推理在自我中心视频流中的应用

Yun Wang, Junbin Xiao, Han Lyu, Yifan Wang, Jing Zuo, Zhanjie Zhang, Hong Huang, Dapeng Wu, Angela Yao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出UCS-Bench数据集和DirectMe框架，通过增量构建结构化空间记忆，实现自我中心视频流中动态空间推理、长期记忆与用户实时位置对齐，显著提升多模态大模型的空间推理能力。

Comments 45 pages. https://icml.cc/virtual/2026/poster/63682

详情

Journal ref: ICML 2026

AI中文摘要

我们介绍了UCS-Bench，一个涵盖170多小时自我中心视觉观察的数据集，包含8.1K+带时间戳的问题，用于诊断自我中心视频流中用户中心的持续空间智能。UCS-Bench针对一个新问题，强调动态空间推理、长期记忆及其与用户实时位置的对齐。我们提出了DirectMe，一个从流式自我中心观察中增量构建和维护结构化空间记忆的框架。DirectMe能够稳健地跟踪和回忆物体位置，这些位置始终相对于用户随时间移动。通过将视觉感知与记忆更新和空间推理紧密耦合，我们的方法支持需要回忆交互、解决视角引起的歧义以及适应动态场景的长时查询。实验表明，DirectMe显著提升了领先多模态大语言模型的空间推理能力；它还超越了许多具有空间感知和长形式流视频模型。我们希望我们的基准和解决方案能够推进自我中心AI助手的空间智能研究。数据和代码可在https://github.com/cocowy1/UCS-Bench获取。

英文摘要

We introduce UCS-Bench, a dataset spanning 170+ hours of egocentric visual observations with 8.1K+ timestamped questions for diagnosing User-Centric Continual Spatial intelligence in egocentric video streams. UCS-Bench targets a new problem that emphasizes dynamic spatial reasoning, long-term memory, and their alignment with users' real-time locations. We propose DirectMe, a framework that incrementally constructs and maintains a structured spatial memory from streaming egocentric observations. DirectMe enables robust tracking and recall of object locations, all relative to the user's movement over time. By tightly coupling visual perception with memory updates and spatial reasoning, our approach supports long-horizon queries that require recalling interactions, resolving viewpoint-induced ambiguities, and adapting to dynamic scenes. Our experiments show that DirectMe significantly improves the spatial reasoning of leading multimodal LLMs; it also surpasses many spatially aware and long-form streaming video models. We hope our benchmark and solution will advance spatial intelligence research for egocentric AI assistants. Data and code are available at https://github.com/cocowy1/UCS-Bench.

URL PDF HTML ☆

赞 0 踩 0

2606.15275 2026-06-16 cs.CV 新提交

MamBOA: State-Space Architecture for Video Recognition

MamBOA：用于视频识别的状态空间架构

Mustafa Bora Çelik

发表机构 * Ankara Medipol University（安卡拉梅迪波尔大学）

AI总结提出MamBOA框架，通过交错扫描结构将选择性状态空间递归(S6)作为运动合成器，从骨干网络提取的连续特征中编码运动，实现细粒度动作识别的高效时序建模。

Comments 15 pages, 7 figures. Codes available at [https://github.com/BOA-clk/MamBOA]

详情

AI中文摘要

细粒度动作识别需要时序推理，通用架构通过不同的成本-精度权衡来解决：3D密集算子将计算与输入体积耦合，而基于差分的方法通过刚性的、手工设计的无上下文特征减法来近似运动——每种方法都反映了深思熟虑的设计选择，并在表达能力或灵活性上存在相应限制。我们提出MamBOA，一个骨干无关的时序框架，基于新颖的交错扫描结构，将选择性状态空间递归(S6)重新定义为原生运动合成器。通过将从预训练骨干中提取的连续特征表示交错成单个交替序列，所提出的扫描结构驱动递归在共享隐藏状态中编码每个位置的时序观测，两者仅相隔一个衰减步骤——使得帧间过渡成为状态动力学的内在组成部分，而非外部计算的量。然后，一系列专用的对齐和解码操作将此联合编码提炼为显式运动表示，双路径池化机制通过平衡注意力驱动的选择与均匀时序覆盖来自适应地聚合该表示。该框架与CNN、Transformer和Mamba骨干家族无缝接口，每对特征仅增加约2.1 GFLOPs。在Diving48上，MamBOA使用图像预训练骨干达到85.02%的Top-1准确率，使用视频预训练骨干在单次前向传播中处理整个视频达到86.24%——表明结构诱导的状态空间动力学构成了运动建模的原则性和通用基础。

英文摘要

Fine-grained action recognition demands temporal reasoning that general-purpose architectures address through different cost-accuracy tradeoffs: 3D dense operators couple computation to the input volume, while difference-based methods approximate motion through rigid, hand-crafted subtraction of uncontextualized features - each reflecting a deliberate design choice with corresponding limitations in expressiveness or flexibility. We present MamBOA, a backbone-agnostic temporal framework built upon a novel interleaved scan structure that recasts the selective state-space recurrence (S6) as a native motion synthesizer. By interleaving consecutive feature representations extracted from a pretrained backbone into a single alternating sequence, the proposed scan structurally drives the recurrence to encode both temporal observations of each position within a shared hidden state, separated by only a single decay step - rendering the inter-frame transition an intrinsic component of the state dynamics rather than an externally computed quantity. A cascade of dedicated alignment and decoding operations then distills this joint encoding into an explicit motion representation, which a dual-path pooling mechanism adaptively aggregates by balancing attention-driven selection with uniform temporal coverage. The framework interfaces seamlessly with CNN, Transformer, and Mamba backbone families, adding only ~2.1 GFLOPs per feature pair. On Diving48, MamBOA achieves 85.02% Top-1 accuracy with an image-pretrained backbone and 86.24% with a video-pretrained backbone processing the entire video in a single forward pass - demonstrating that structurally induced state-space dynamics constitute a principled and general foundation for motion modeling.

URL PDF HTML ☆

赞 0 踩 0

2606.15320 2026-06-16 cs.CV 新提交

Conditional Multi-Event Temporal Grounding in Long-Form Video

长视频中的条件多事件时间定位

Yuanhao Zou, Arthad Kulkarni, Lucas Tonanez, Lincoln Spencer, Guangyu Sun, Tianxingjian Ding, Andong Deng, Yi Li, Shuangjun Liu, Yuan Li, Dashan Gao, Ning Bi, Taotao Jing, Shuai Zhang, Chen Chen

发表机构 * University of Central Florida（中佛罗里达大学）； Qualcomm AI Research（高通人工智能研究院）

AI总结提出CoMET-Bench基准和CoMET-Agent框架，解决长视频中基于组合时空条件定位所有事件的任务，F1@0.5提升6.1%。

详情

AI中文摘要

多模态大语言模型在视频时间定位方面取得了快速进展，但实际应用通常需要定位满足组合时间和空间条件的每个事件。现有基准存在不足：它们仅定位每个查询的单个时刻，在没有时间条件的情况下进行计数，或者将定位和计数视为不相交的任务。我们引入了CoMET-Bench，用于长视频中的条件多事件时间定位，包含600个视频上的2789个查询，平均时长33.8分钟，涵盖五个真实世界领域，每个查询由4个时间条件、3个空间条件和一个专用的负查询子集组成。我们进一步提出了一个统一的评估协议，联合测量计数、定位和负查询识别，包括一个新的Rejection-F1指标，以防止懒惰的“始终为空”模型进行琐碎的游戏。对广泛的MLLM、基于代理和定位专用方法的基准测试表明，现有方法远未解决此任务。基于这些发现，我们提出了CoMET-Agent，一个无需训练的代理框架，将任务重新表述为结构化搜索和聚合，通过纯结构推理在F1@0.5上比GPT-5提高6.1%。失败分析进一步揭示了三个开放方向：细粒度实体跟踪、位置均匀检索和因果事件配对。

英文摘要

Multimodal large language models have made rapid progress in video temporal grounding, yet real-world applications routinely require localizing every event that satisfies compositional temporal and spatial conditions. Existing benchmarks fall short: they localize only a single moment per query, count without temporal conditions, or treat grounding and counting as disjoint tasks. We introduce CoMET-Bench for Conditional Multi-Event Temporal Grounding in long-form video, comprising 2789 queries over 600 videos averaging 33.8 minutes across five real-world domains, with each query composed from 4 temporal conditions, 3 spatial conditions, and a dedicated negative-query subset. We further propose a unified evaluation protocol jointly measuring counting, grounding, and negative-query recognition, including a new Rejection-F1 metric that prevents trivial gaming by lazy "always-empty" models. Benchmarking a broad suite of MLLMs, agent-based, and grounding-specialized methods reveals that existing approaches remain far from solving this task. Building on these findings, we propose CoMET-Agent, a training-free agentic framework that reformulates the task as structured search-and-aggregate, improving F1@0.5 by 6.1% over GPT-5 purely through structural reasoning. Failure analysis further surfaces three open directions: fine-grained entity tracking, position-uniform retrieval, and causal event pairing.

URL PDF HTML ☆

赞 0 踩 0

2606.15417 2026-06-16 cs.CV 新提交

From Frames to Temporal Graphs: In-Context Egocentric Action Recognition with Vision-Language Models

从帧到时间图：基于视觉语言模型的上下文第一人称动作识别

Bessie Dominguez-Dager, Francisco Gomez-Donoso, Miguel Cazorla, Marc Pollefeys, Daniel Barath, Zuria Bauer

发表机构 * University of Alicante（阿利坎特大学）； ETH Zürich（苏黎世联邦理工学院）； Microsoft（微软）

AI总结提出将视频转换为时间动作图，通过多阶段提示生成自然语言叙述并结构化，实现上下文学习，在EGTEA和Epic-Kitchens-100上显著提升零样本和少样本动作识别性能。

详情

AI中文摘要

第一人称视频中的动作推理需要捕捉手-物交互的细粒度过渡，而通用视觉语言模型（VLM）在直接处理原始像素时往往难以胜任。我们提出通过将视频转换为时间动作图，将视觉感知与符号推理解耦。在多阶段提示流程中，我们首先在短时间窗口上生成密集的自然语言叙述作为语义瓶颈，然后将其形式化为结构化的开放词汇图表示。在EGTEA和Epic-Kitchens-100数据集上，符号表示实现了高效的上下文学习：少样本图演示相比零样本帧和图推理均带来显著的准确率提升。即使在零样本设置下，尽管潜在的预训练污染可能有利于基于像素的推理，但基于图的推理仍能与像素推理保持竞争力。在来自6个模型家族、参数范围从2B到235B的11个开源VLM上，我们的发现表明，当前VLM作为符号推理器比作为直接视觉观察者更有效。通过将视频投影到语言领域，我们提供了一种可扩展、无需微调的替代端到端方法，更好地利用了这些模型的潜在推理优势。代码将公开。

英文摘要

Action reasoning in egocentric video requires capturing fine-grained transitions of hand-object interactions, a task where general-purpose Vision-Language Models (VLMs) often struggle when operating directly on raw pixels. We propose to decouple visual perception from symbolic reasoning by converting videos into Temporal Action Graphs. In a multi-stage prompting pipeline, we first generate dense natural language narratives over short temporal windows as a semantic bottleneck, then formalize them into structured, open-vocabulary graph representations. On the EGTEA and Epic-Kitchens-100 datasets, the symbolic representation unlocks efficient in-context learning: few-shot graph demonstrations yield substantial accuracy gains over zero-shot frame and graph-based inference alike. Even in the zero-shot setting, graph-based reasoning remains competitive with pixel-based inference despite potential pretraining contamination favoring the latter. Across 11 open-weight VLMs from 6 model families ranging from 2B to 235B parameters, our findings indicate that current VLMs are more effective as symbolic reasoners than as direct visual observers. By projecting video into the language domain, we provide a scalable, fine-tuning-free alternative to end-to-end approaches that better leverages these models' latent reasoning strengths. The code will be made public.

URL PDF HTML ☆

赞 0 踩 0

2606.15486 2026-06-16 cs.CV 新提交

ST-DiffEye: Diffusion-based Continuous Gaze Generation via Joint Scanpath-Trajectory Modeling

ST-DiffEye: 基于扩散的连续注视生成通过联合扫描路径-轨迹建模

Brian Nlong Zhao, Ozgur Kara, Junho Kim, James M. Rehg

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结提出ST-DiffEye，一种联合轨迹-扫描路径扩散框架，通过将两者拼接为额外输入通道进行联合建模，并引入基于连续排序概率得分（CRPS）的评估方法，在视觉搜索和自由观看任务上达到最先进性能。

详情

AI中文摘要

我们研究人类注视建模问题，旨在生成观察者在观看视觉刺激时产生的注视模式。注视主要通过两种模态捕获：连续眼动轨迹（描述细粒度运动动态）和离散扫描路径（描述高级注视结构）。由于注视在不同观察者和试验间差异显著，我们将这种变异性视为定义属性而非噪声，并将注视建模为随机生成过程。现有的生成式注视模型仅对这两种表示之一进行单独监督。我们假设轨迹和扫描路径以互补尺度描述注视，并在训练过程中联合提供信息，通过ST-DiffEye（一种联合轨迹-扫描路径扩散框架）验证该假设，该框架通过将两者拼接为额外的原始输入通道来耦合两种模态，除了输入和输出通道扩展外无需额外架构开销。我们进一步引入基于连续排序概率得分（CRPS）的原则性评估框架，该框架将任何现有序列相似性度量推广为适当的评分规则，以联合评估生成注视的准确性和多样性。在任务驱动的视觉搜索（涵盖目标存在和目标缺失场景）以及自由观看基准上的实验证明了最先进的性能。这些结果以及详细的消融实验证实了联合建模的优势以及分布感知评估在捕捉人类注视内在变异性方面的价值。项目网页：https://st-diffeye.github.io/

英文摘要

We study the problem of human gaze modeling, which aims to generate the gaze patterns a viewer produces while observing a visual stimulus. Gaze is primarily captured through two modalities: continuous eye-tracking trajectories, which describe fine-grained motion dynamics, and discrete scanpaths, which describe high-level fixation structure. Because gaze varies substantially across viewers and trials, we treat this variability as a defining property rather than noise and model gaze as a stochastic generative process. Existing generative gaze models supervise on only one of these two representations in isolation. We hypothesize that trajectories and scanpaths describe gaze at complementary scales and are jointly informative during training, and test this hypothesis through ST-DiffEye, a joint trajectory-scanpath diffusion framework that couples both modalities by concatenating them as an additional raw input channel, requiring no architectural overhead beyond an input and output channel expansion. We further introduce a principled evaluation framework based on the Continuous Ranked Probability Score (CRPS), which generalizes any existing sequence similarity metric into a proper scoring rule that jointly assesses the accuracy and diversity of generated gaze. Experiments on task-driven visual search, covering both target-present and target-absent scenarios, and on free-viewing benchmarks demonstrate state-of-the-art performance. These results, along with detailed ablations, confirm the benefit of joint modeling and the value of distribution-aware evaluation in capturing the intrinsic variability of human gaze. Project webpage: https://st-diffeye.github.io/

URL PDF HTML ☆

赞 0 踩 0

2606.15527 2026-06-16 cs.CV cs.AI 新提交

Selective Synergistic Learning for Video Object-Centric Learning

选择性协同学习用于视频对象中心学习

WonJun Moon, Jae-Pil Heo

发表机构 * KAIST（韩国科学技术院）； Sungkyunkwan University（成均馆大学）

AI总结提出选择性协同学习（SSync），通过伪标签线性复杂度选择性蒸馏可靠线索，避免错误传播，提升视频对象分解质量并作为即插即用模块。

详情

AI中文摘要

典型的视频对象中心学习（VOCL）方法采用基于槽的框架，依赖重建驱动的编码器-解码器架构，学习通过两个空间图进行：编码器的注意力图和解码器的对象图。由于这两个不同的图表现出不同的属性，最近的密集对齐策略试图通过对比学习强制所有时空补丁之间的一致性来调和这种差异。然而，这种无差别的对齐无意中传播了每个模块固有的弱点，例如编码器的噪声预测和解码器的模糊边界。此外，计算所有对之间的密集相似性会带来与时空补丁总数二次方关系的计算成本，严重限制了可扩展性。受此启发，我们提出了选择性协同学习（SSync）。SSync 不是进行穷举的补丁到补丁对齐，而是通过选择性蒸馏仅最可靠的线索来防止错误传播：严格利用编码器进行边界细化，利用解码器进行内部去噪。这通过线性复杂度的伪标签实现，消除了二次空间比较的需要。此外，为了防止强化架构偏差（如槽冗余），我们引入了传递性伪标签合并，基于时空激活一致性合并重叠的槽。大量研究表明，SSync 提高了分解质量，并作为一个通用的即插即用模块，同时对槽配置表现出卓越的鲁棒性。代码可在 github.com/wjun0830/SSync 获取。

英文摘要

Typical video object-centric learning (VOCL) approaches employ slot-based frameworks that rely on reconstruction-driven encoder-decoder architectures, where learning is mediated by two spatial maps: attention maps from the encoder and object maps from the decoder. As these two distinct maps exhibit different properties, a recent dense alignment strategy attempted to reconcile this discrepancy by enforcing agreement across all spatio-temporal patches via contrastive learning. However, this indiscriminate alignment inadvertently propagates the inherent weaknesses of each module, such as noisy encoder predictions and blurred decoder boundaries. Moreover, computing dense similarities across all pairs incurs a computational cost quadratic in the total number of spatio-temporal patches, severely limiting scalability. Motivated by this, we propose Selective Synergistic Learning (SSync). Instead of exhaustive patch-to-patch alignment, SSync prevents error propagation by selectively distilling only the most reliable cues: leveraging the encoder strictly for boundary refinement and the decoder for interior denoising. This is realized via a pseudo-labeling with linear complexity, eliminating the need for quadratic spatial comparisons. Also, to prevent the reinforcement of architectural biases like slot redundancy, we introduce a transitive pseudo-label merging that consolidates overlapping slots based on spatio-temporal activation consistency. Extensive studies demonstrate that SSync improves decomposition quality and serves as a versatile, plug-and-play module while also exhibiting exceptional robustness to slot configurations. Code is available at github.com/wjun0830/SSync.

URL PDF HTML ☆

赞 0 踩 0

2606.15992 2026-06-16 cs.CV 新提交

Multi-Task Tennis Stroke Biomechanics Analysis Using MediaPipe Pose

基于MediaPipe Pose的多任务网球击球生物力学分析

Jigyashman Hazarika

发表机构 * Kaggle

AI总结提出多任务流水线，从RGB视频自动识别击球类型、预测击球方向并评估姿势质量，结合规则反馈提供教练建议，在跨球员测试中击球类型准确率仅下降0.8%。

Comments 14 pages, 9 figures

详情

AI中文摘要

我们构建了一个从普通RGB视频进行网球击球生物力学分析的多任务流水线。在基于姿态的击球识别基础上，它增加了两个新任务：预测击球方向和评估姿势质量，外加一个基于规则的反馈层，用于提供教练建议。使用加权关节速度得分s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder自动检测击球，无需手动标注。姿态来自MediaPipe Pose Landmarker（33个关键点，公制世界坐标），每个击球被转换为30帧×39特征的序列，输入TennisTransformerGPU——一个紧凑的564,103参数Transformer（4层，4头，d=128），带有三个并行输出头。在来自7名职业球员和1名业余球员的11个视频中的1,281个标注击球上训练，在随机80/20划分下，击球类型准确率为83.7%，方向准确率为61.9%，姿势准确率为62.6%。有趣的测试是跨球员：在职业球员上训练，在业余球员上评估。击球类型几乎不变，为82.9%，下降0.8%。方向预测无法迁移，直接退化为多数类。消融实验表明世界坐标的重要性：切换到图像空间关键点导致跨球员击球类型准确率从83%降至47%，方向准确率从68%降至21%。所有内容在Kaggle的免费T4 GPU上运行，完全可复现。

英文摘要

We built a multi-task pipeline for tennis stroke biomechanics from plain RGB video. On top of pose-based stroke recognition, it adds two new tasks, predicting shot direction and grading posture quality, plus a rule-based feedback layer that suggests coaching tips. Strokes are found automatically using a weighted joint velocity score, s(t) = 0.5 v_wrist + 0.3 m_elbow + 0.2 m_shoulder, removing the need for manual annotation. Pose comes from MediaPipe Pose Landmarker (33 landmarks, metric world coordinates), with each stroke turned into a 30-frame by 39-feature sequence for TennisTransformerGPU, a compact 564,103-parameter transformer (4 layers, 4 heads, d=128) with three parallel output heads. Trained on 1,281 labeled strokes from 7 pros and 1 amateur across 11 videos, it hits 83.7% stroke-type accuracy, 61.9% on direction, and 62.6% on posture under a random 80/20 split. The interesting test is cross-player: train on pros, evaluate on the amateur. Stroke type barely budges, 82.9%, a 0.8% drop. Direction prediction does not transfer; it just falls back to the majority class. An ablation shows why world coordinates matter so much here: switching to image-space landmarks tanks cross-player stroke-type accuracy from 83% to 47% and direction from 68% to 21%. Everything runs on Kaggle's free T4 GPU tier and is fully reproducible.

URL PDF HTML ☆

赞 0 踩 0

2606.16342 2026-06-16 cs.CV 新提交

When the Past Matters: FlashBack Memory for Precipitation Nowcasting

当过去重要时：用于降水临近预报的FlashBack记忆

Yuhao Du, Boxiao Huang, Chengrong Wu, Jiankai Zhang

发表机构 * College of Atmospheric Sciences, Lanzhou University（兰州大学大气科学学院）； Fuqua School of Business, Duke University（杜克大学福库商学院）； Department of Computer Science, University of Manchester（曼彻斯特大学计算机科学系）； Supercomputing Center of Lanzhou University（兰州大学超级计算中心）

AI总结提出FlashBack Memory模块，通过动态检索关键历史状态并自适应融合，增强循环模型时空表征能力，显著提升高分辨率降水预测的准确性和时序一致性。

详情

AI中文摘要

准确的降水临近预报对于减灾和社会经济规划至关重要，然而现有方法在高时空分辨率下常面临虚警、漏报和长程依赖建模困难。为解决这些问题，我们提出FlashBack Memory（FB）模块，该模块动态检索关键历史状态并通过自适应融合门进行整合，增强循环模型的时空表征能力。我们将FB集成到PredRNN、PredRNNpp、MIM、MotionRNN和PredRNN-V2中，并在CIKM2017、Shanghai2020和SEVIR数据集上评估。实验结果表明，FB显著改善了MSE、MAE、SSIM和CSI指标，特别是对于高强度降雨和长序列预测，同时减少了虚警和漏报，增强了时间一致性和空间定位。所提方法提供了一种通用且高效的记忆增强机制，提升了基于循环的降水临近预报模型的整体性能。

英文摘要

Accurate precipitation nowcasting is crucial for disaster mitigation and socio-economic planning, yet existing methods often struggle with false alarms, missed events, and long range dependency modeling at high spatiotemporal resolution. To address these challenges, we propose FlashBack Memory (FB), a module that dynamically retrieves key historical states and integrates them via an adaptive fusion gate, enhancing the spatiotemporal representation capability of recurrent-based models. We incorporate FB into PredRNN, PredRNNpp, MIM, MotionRNN, and PredRNN-V2, and evaluate on CIKM2017, Shanghai2020, and SEVIR datasets. Experimental results demonstrate that FB significantly improves MSE, MAE, SSIM, and CSI metrics, particularly for high-intensity rainfall and long-sequence predictions, while reducing false alarms and missed events and enhancing temporal consistency and spatial localization. The proposed method provides a general and efficient memory enhancement mechanism, improving the overall performance of recurrent-based precipitation nowcasting models.

URL PDF HTML ☆

赞 0 踩 0

2606.16353 2026-06-16 cs.CV cs.AI 新提交

图像到图像生成模型的视觉编码器行为指纹：基于训练范式的六个商业API分类

Hunter Hill

发表机构 * H. Hill

AI总结通过内容自适应亚JND对抗扰动管道，对六个商业图像到图像AI系统进行测试，基于DINOv2 ViT-B/14令牌距离，发现编辑训练模型与采样时适配的T2I基模型在2D平面上形成两个不同的行为带。

2606.14792 2026-06-16 cs.CV cs.AI 新提交

Efficient Reinforcement for Visual-Textual Thinking with Discrete Diffusion Model

基于离散扩散模型的视觉-文本思维高效强化学习

Yoonjeon Kim, Yuhta Takida, Chieh-Hsin Lai, Eunho Yang, Yuki Mitsufuji

发表机构 * KAIST（韩国科学技术院）； Sony AI（索尼AI）； AITRICS ； Sony Group Corporation（索尼集团公司）

AI总结提出用离散扩散模型替代自回归模型进行多模态强化学习，通过局部视觉编辑减少计算量，并设计分解奖励分配策略解决跨模态干扰问题。

详情

AI中文摘要

基于强化学习的后训练已被广泛采用，以在能够同时进行文本和图像生成的统一多模态模型中实现交错视觉和文本推理。然而，大多数现有方法建立在自回归统一模型上，在视觉推理过程中需要完整的图像再生。在这项工作中，我们证明多模态离散扩散模型是自回归模型在交错推理中进行强化学习的有效替代方案，因为它们能够通过局部视觉编辑而非完整的图像令牌再生来执行高效的视觉展开。与自回归基线相比，这使GRPO期间的展开计算减少了26.9%，且性能下降极小。尽管效率提高，我们发现联合奖励分配（在模态间使用共享奖励信号）在RL更新期间会在不相关的图像和文本令牌序列之间引入跨模态干扰。为解决此问题，我们提出分解奖励分配策略，该策略独立地为文本和视觉片段分配奖励。采用分解奖励分配后，我们的RL方法相比联合奖励分配提高了11.2%，相比基础模型提高了38.04%。

英文摘要

RL-based post-training has been widely adopted to enable interleaved visual and textual reasoning in unified multimodal models capable of both text and image generation. However, most existing approaches are built upon autoregressive (AR) unified models, which require full image regeneration during visual reasoning. In this work, we demonstrate that multimodal discrete diffusion models are effective alternatives to AR models for reinforcement learning in interleaved reasoning, owing to their ability to perform efficient visual rollouts via localized visual editing rather than full image-token regeneration. This reduces rollout computation during GRPO by 26.9\% compared to AR baselines, with minimal performance drop. Despite the improved efficiency, we find that joint reward assignment, which employs a shared reward signal across modalities, introduces cross-modal interference between unrelated image and text token sequences during RL updates. To address this issue, we propose factorized reward assignment, a strategy that assigns rewards independently to text and vision segments. With factorized reward assignment, our RL approach achieves an 11.2% improvement over joint reward assignment and a 38.04% improvement over the base model.

URL PDF HTML ☆

赞 0 踩 0

2606.14972 2026-06-16 cs.CV 新提交

GeoStream：迈向精确相机控制的流式视频生成

Yizhou Zhao, Yifan Wang, Xiaoyuan Wang, Yushu Wu, Hao Zhang, Moayed Haji-Ali, Rameen Abdal, Ashkan Mirzaei, Yanyu Li, Willi Menapace, Laszlo Jeni, Sergey Tulyakov, Peter Wonka, Chaoyang Wang

发表机构 * CMU（卡内基梅隆大学）； Northeastern University（东北大学）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； Rice University（莱斯大学）； Snap Inc.（Snap公司）； KAUST（阿卜杜拉国王科技大学）

AI总结提出GeoStream框架，通过自刷新3D缓存和在线策略蒸馏，实现自回归流式视频生成中的精确度量级相机控制，解决了现有方法在视角移动时控制失效和分布偏移问题。

详情

AI中文摘要

精确的交互式相机控制对于基于视频的世界模型至关重要，但大多数现有方法隐式学习相机运动，导致在分布外轨迹下控制不准确。显式几何条件化提高了可控性，但现有方法是非自回归的，依赖于从初始帧构建的静态3D缓存，一旦视点超出原始视锥体，该缓存就会失效。我们提出GeoStream，一个在自回归流式视频生成中实现精确度量级相机控制的框架。我们的方法维护一个自刷新3D缓存，该缓存从模型自身的输出中定期在线更新：我们从最新生成的帧估计深度，反投影到3D，再投影到目标视图，生成点重投影作为后续合成的几何条件。基于相同原理，训练期间看到的条件也从学生自身生成的帧中渲染，产生完全在策略的蒸馏，自然对齐训练和推理条件分布。与先前使用离策略条件噪声的工作不同，我们的方法针对模型在推理时遇到的确切误差分布进行训练，既缓解了标准自回归漂移，也缓解了当缓存本身来自生成输出时出现的二阶几何反馈循环。定量和定性结果表明，我们的方法显著提高了相机可控性。

英文摘要

Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student's own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.

URL PDF HTML ☆

赞 0 踩 0

2606.15188 2026-06-16 cs.CV 新提交

Adaptive Inference-Time Scaling via Early-Step Latent Verification for Image Editing

自适应推理时间缩放：基于早期步骤潜在验证的图像编辑

Yue Yu, Yang Jiao, Jiayu Wang, Qi Dai, Jingjing Chen

发表机构 * College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）； Microsoft Research Asia（微软亚洲研究院）； Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究所）

AI总结提出VeriLatent框架，通过早期步骤潜在空间编辑激活图验证初始噪声，实现自适应推理时间缩放，提升图像编辑质量和效率。

详情

AI中文摘要

基于指令的图像编辑随着生成模型的最新进展取得了显著进步。然而，编辑结果的质量仍受随机采样的初始噪声影响，特别是在复杂编辑场景中。不合适的初始噪声可能导致不满意的编辑结果。最近的推理时间缩放方法通过采样多个初始噪声并选择更好的候选者来解决这一问题。然而，大多数方法遵循解码-验证方案，引入了效率与准确性的权衡。当在有限的推理步骤后进行解码时，解码后的图像通常噪声过大，无法进行可靠评估，而充分去噪的图像则需要更高的计算成本。为了解决这个问题，我们提出了VeriLatent，一种即插即用的自适应推理时间缩放框架，用于图像编辑的早期步骤潜在验证。具体来说，我们提出了一种新颖的验证器，通过在早期阶段通过潜在空间编辑激活图对每个初始噪声进行评分。它通过评估候选者是否能在正确区域引发有效编辑来识别有希望的候选者。这使得无需将潜在变量解码为图像即可进行高效的早期剪枝。在此基础上，我们进一步开发了一种用于推理时间缩放的自适应搜索策略。它根据编辑难度分配推理预算，从而减少函数评估次数（NFE）。在多个基准测试和不同基础模型上的大量实验表明，VeriLatent持续提高了编辑性能和推理时间缩放效率。

英文摘要

Instruction-based image editing has made notable progress with recent advances in generative models. However, the quality of the edited result is still influenced by the randomly sampled initial noise, particularly in complex editing scenarios. An unsuitable initial noise may lead to unsatisfactory editing results. Recent inference-time scaling methods address this issue by sampling multiple initial noises and selecting better candidates. Nevertheless, most of them follow a decode-then-verify scheme which introduces an efficiency-accuracy trade-off. When decoding is performed after limited inference steps, the decoded images often remain too noisy for reliable assessment, whereas sufficiently denoised images require much higher computational cost. To address this issue, we propose VeriLatent, a plug-and-play adaptive inference-time scaling framework with early-step latent verification for image editing. Specifically, we propose a novel verifier that scores each initial noise through a latent-space editing activation map at an early stage. It identifies promising candidates by assessing whether they can induce an effective edit in the correct region. This enables efficient early pruning without decoding latents into images. Building on this, we further develop an adaptive search strategy for inference-time scaling. It allocates inference budgets according to editing difficulty, thereby reducing the number of function evaluations (NFE). Extensive experiments on multiple benchmarks and different base models demonstrate that VeriLatent consistently improves both editing performance and inference-time scaling efficiency.

URL PDF HTML ☆

赞 0 踩 0

2606.15389 2026-06-16 cs.CV 新提交

Timestep Rescheduling in Diffusion Inversion

扩散反演中的时间步重调度

Shangquan Sun, Ting Gong, Zhirui Liu, Jiamin Wu, Runkai Zhao, Mianxin Liu, Wenqi Ren, Xiaochun Cao

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结针对扩散反演中时间步选择影响反演精度的问题，提出一种基于全局重缩放和局部动态规划的非均匀时间步调度器，有效降低反演误差，提升图像重建与编辑性能。

Comments Accepted by ICML 2026. 23 pages, including appendices

详情

AI中文摘要

扩散反演将图像映射回扩散模型的高斯潜在空间，是图像重建和编辑的关键任务。虽然DDIM实现了快速确定性反演，但它固有地引入了累积为明显反演误差的偏差。现有方法通常通过求解不动点问题来解决这一问题，但很大程度上忽略了噪声调度器中扩散时间步的选择如何影响反演保真度。在这项工作中，我们揭示了扩散反演中的偏差尺度强烈依赖于时间步大小，并呈现出抛物线趋势，较大的误差集中在较小和较大的时间步。基于这一发现，我们提出了一种简单而有效的非均匀时间步调度器，该调度器集成了全局重缩放和基于局部动态规划的重调度，实现了计算资源的战略分配，从而最小化整体反演误差并保持更高的反演精度。我们的方法可作为现有反演技术的即插即用增强，无需额外参数或计算开销。通过大量实验，我们验证了集成我们的调度器能够持续提升现有反演方法的性能，在图像重建和编辑中取得更优结果。

英文摘要

Diffusion inversion, which maps images back to the Gaussian latent space of a diffusion model, is a critical task for image reconstruction and editing. While DDIM enables fast deterministic inversion, it inherently introduces deviations that accumulate into noticeable inversion errors. Existing methods often address this by solving a fixed-point problem but largely overlook how the selection of the diffusion timestep in the noise scheduler influences inversion fidelity. In this work, we reveal that the deviation scale in diffusion inversion is strongly dependent on the timestep size, and exhibits a parabolic trend, with larger errors concentrated at both small and large timesteps. Based on this finding, we propose a simple yet effective nonuniform timestep scheduler that integrates a global rescaling with a local dynamic programming based rescheduling, enabling a strategic allocation of computational effort that minimizes the overall inversion error and preserves higher inversion accuracy. Our method serves as an off-the-shelf enhancement for existing inversion techniques and requires no extra parameters or computational overhead. Through extensive experiments, we verify that integrating our scheduler consistently boosts the performance of existing inversion methods, achieving superior results in image reconstruction and editing.

URL PDF HTML ☆

赞 0 踩 0

2606.15534 2026-06-16 cs.CV 新提交

Track2View: 4D-Consistent Camera-Controlled Video Generation via Paired 3D Point Tracks

Track2View: 通过配对3D点轨迹实现4D一致的相机控制视频生成

Feng Qiao, Zhaochong An, Zhexiao Xiong, Serge Belongie, Nathan Jacobs

发表机构 * Washington University in St. Louis（圣路易斯华盛顿大学）； University of Copenhagen（哥本哈根大学）

AI总结提出Track2View，利用配对3D点轨迹为视频扩散变压器提供显式时空对应，实现新视角视频渲染，在视觉质量、视角同步和相机精度上达到最先进水平。

详情

AI中文摘要

从新相机视角重新渲染现有视频需要输出遵循规定的相机轨迹，同时保持原始场景每一帧的外观和动态。现有方法依赖于每帧姿态嵌入、噪声点云渲染或隐式学习对应关系，这些方法都没有提供源像素和目标像素之间的显式、时间连续链接。我们提出Track2View，它将视频扩散变压器条件化为配对3D点轨迹：投影到源和目标相机视图中的场景点的稀疏轨迹。这些轨迹提供了显式的时空对应关系，在构造上是时间连续的，编码了内容应在何时何地出现。Track2View的核心是一个双视图轨迹调节器，通过无参数几何操作和学习的时间聚合将视觉上下文从源视图转移到目标视图，确保对任意相机轨迹的泛化能力，而无需记忆特定运动。我们进一步引入了一个数据整理流程，通过在时间上连接的多相机视图对上运行3D点跟踪器来提取一对一的轨迹对应关系。在一个包含静态和动态场景的400视频基准测试中，Track2View在视觉质量、视角同步和相机精度方面取得了最先进的结果，相对于领先基线，旋转误差减少了30-65%，平移误差减少了61-72%。项目页面可访问：https://qjizhi.github.io/track2view

英文摘要

Re-rendering an existing video from a novel camera viewpoint requires the output to follow the prescribed camera trajectory while preserving the appearance and dynamics of the original scene across every frame. Existing methods rely on per-frame pose embeddings, noisy point-cloud renderings, or implicit learned correspondences, none of which provides an explicit, temporally continuous link between source and target pixels. We propose Track2View, which conditions a video diffusion transformer on paired 3D point tracks: sparse trajectories of scene points projected into both the source and target camera views. These tracks provide explicit spatiotemporal correspondences that are temporally continuous by construction, encoding what content should appear where and when. At the core of Track2View is a dual-view track conditioner that transfers visual context from source to target view through parameter-free geometric operations and learned temporal aggregation, ensuring generalization to arbitrary camera trajectories without memorizing specific motions. We further introduce a data curation pipeline that extracts one-to-one track correspondences by running a 3D point tracker on temporally concatenated multi-camera view pairs. On a 400-video benchmark spanning static and dynamic scenes, Track2View achieves state-of-the-art results across visual quality, view synchronization, and camera accuracy, reducing rotation error by 30-65% and translation error by 61-72% relative to leading baselines. Project page is available at this https URL: https://qjizhi.github.io/track2view

URL PDF HTML ☆

赞 0 踩 0

2606.15592 2026-06-16 cs.CV 新提交

DenseControl: Instance-Level Controllable Synthesis of Dense Crowd Image

DenseControl: 密集人群图像的实例级可控合成

Juncheng Wang, Lei Shang, Wang Lu, Baigui Sun, Shujun Wang

发表机构 * the Hong Kong Polytechnic University（香港理工大学）； Tongyi lab, Alibaba Group（阿里巴巴集团通义实验室）； Tsinghua University（清华大学）

AI总结提出DenseControl管道，通过隔离对象嵌入图和隐式尺度嵌入策略，实现密集人群图像中实例位置、大小、背景、风格和属性的精确控制，在合成质量和下游应用中达到最优。

Comments Accepted to IEEE TMM

详情

AI中文摘要

在本文中，我们介绍了DenseControl，一种用于生成密集人群图像的新型管道。具体来说，DenseControl精心定位和缩放每个生成的实例，以精确对齐预定义的坐标和尺度。在此基础上，我们进一步允许控制背景、风格和实例属性。DenseControl的动机源于对合成人群图像中两个主要挑战的观察：控制信号嵌入和在传递实例尺度指导时保持拓扑完整性。为了解决这些问题，我们首先引入了隔离对象嵌入（IOE）图，这是一种新颖的表示，有助于空间位置控制，同时减轻模型学习投影的困难。其次，我们提出了一种隐式尺度嵌入（ISE）策略，该策略与IOE图无缝集成，以编码精确的尺度信息。为了进一步增强ISE与IOE图结合的效果，我们引入了一种位置快捷机制，增强交叉注意力以缓解投影挑战。我们通过两个角度评估DenseControl：合成质量和在潜在应用中的适用性。不同控制条件下的实验表明，DenseControl在密集人群图像合成中达到了最先进的结果。此外，我们展示了在数据稀缺下增强人群分析、迁移学习和天气泛化场景中的应用，以突出DenseControl的实际效用。代码库将发布。

英文摘要

In this paper, we introduce DenseControl, a novel pipeline for generating dense crowd images. Specifically, DenseControl meticulously positions and sizes each generated instance to align precisely with the predefined coordinates and scales. Based on this, we further allow for control over the background, style, and attributes of instances. The motivation behind DenseControl stems from the observation of two main challenges in synthesizing crowd images: controlling signal embedding and maintaining topological integrity when imparting instance scale guidance. To address these, we first introduce the Isolated Object Embedding (IOE) map, a novel representation that facilitates spatial location control while mitigating the difficulties associated with learning projections for model. Secondly, we propose an Implicit Scale Embedding (ISE) strategy that seamlessly integrates with the IOE map to encode precise scale information. To further enhance the efficacy of combining ISE with the IOE map, we incorporate a Position Shortcut mechanism that enhances cross-attention to alleviate projection challenges. We evaluate DenseControl through two lenses: synthesis quality and applicability in latent applications. Experiments across different control conditions demonstrate DenseControl achieves state-of-the-art results in dense crowd image synthesis. Furthermore, we showcase applications in augmenting crowd analysis under data scarcity, transfer learning, and weather generalization scenes, to highlight the practical utility of DenseControl. The codebase will be released.

URL PDF HTML ☆

赞 0 踩 0

2606.15796 2026-06-16 cs.CV cs.AI 新提交

DifFRACT: Diffusion Feature Reconstruction and Attribution for Circuit Tracing

DifFRACT：用于电路追踪的扩散特征重构与归因

Artyom Mazur, Nina Konovalova, Aibek Alanov

发表机构 * HSE University（高等经济学院）； FusionBrain Lab（FusionBrain实验室）

AI总结本文扩展了基于转码器的电路追踪方法到多模态扩散Transformer，通过训练时间步条件转码器近似MLP子层，实现精确的特征级归因并恢复可解释电路，揭示了属性绑定和跨流语义传播机制。

详情

AI中文摘要

机械可解释性旨在通过将模型计算分解为可解释特征和电路来解释神经网络行为。虽然基于转码器的电路追踪最近已实现对大型语言模型的详细因果分析，但用于图像生成的多模态扩散Transformer仍然相对不透明。我们仍然缺乏理解语义信息如何在去噪步骤间传播以及文本和图像表示如何在双流MM-DiT架构中交互的工具。现有方法仅提供部分洞察：注意力图揭示了token交互的有限视图，而稀疏自编码器可以发现可解释特征，但并未直接揭示这些特征如何通过非线性MLP层进行变换和组合。在这项工作中，我们将基于转码器的电路追踪扩展到多模态扩散Transformer。我们训练了时间步条件转码器，它们忠实地近似FLUX.1[schnell]中MLP子层的输入输出行为。通过用转码器替换MLP并线性化剩余计算，我们获得了精确的特征到特征归因，并恢复了紧凑、可解释的电路。实验上，我们的转码器在稀疏性-忠实度权衡上与稀疏自编码器相当或略优。得到的电路揭示了属性绑定和跨流语义传播背后的机制，并为系统性生成错误提供了因果解释。此外，基于电路的干预比标准的基于SAE的引导更加精确和有效。我们的结果表明，基于转码器的电路分析对于最先进的扩散Transformer是可行的，并为理解和控制多模态生成模型提供了强大的框架。代码可在https://github.com/Artalmaz31/DifFRACT获取。

英文摘要

Mechanistic interpretability seeks to explain neural network behavior by decomposing model computations into interpretable features and circuits. While transcoder-based circuit tracing has recently enabled detailed causal analyses of large language models, multimodal diffusion transformers for image generation remain comparatively opaque. We still lack tools for understanding how semantic information propagates across denoising steps and how text and image representations interact within double-stream MM-DiT architectures. Existing methods provide only partial insight: attention maps expose a limited view of token interactions, while sparse autoencoders can discover interpretable features but do not directly reveal how these features are transformed and composed through nonlinear MLP layers. In this work, we extend transcoder-based circuit tracing to multimodal diffusion transformers. We train timestep-conditioned transcoders that faithfully approximate the input-output behavior of MLP sublayers in FLUX.1[schnell]. By replacing MLPs with transcoders and linearizing the remaining computation, we obtain exact feature-to-feature attribution and recover compact, interpretable circuits. Empirically, our transcoders match or slightly outperform sparse autoencoders on the sparsity-faithfulness tradeoff. The resulting circuits reveal mechanisms underlying attribute binding and cross-stream semantic propagation, and provide causal explanations for systematic generation errors. Moreover, circuit-guided interventions are substantially more precise and effective than standard SAE-based steering. Our results demonstrate that transcoder-based circuit analysis is feasible for state-of-the-art diffusion transformers and provides a powerful framework for understanding and controlling multimodal generative models. The code is available at https://github.com/Artalmaz31/DifFRACT

URL PDF HTML ☆

赞 0 踩 0

2606.15819 2026-06-16 cs.CV cs.AI 新提交

SACE: Concept Erasure at the Semantic Singularity in Visual Autoregressive Models

SACE: 视觉自回归模型中的语义奇点概念擦除

Siya Yang, Nanxiang Jiang, Zhaoxin Fan, Yunfeng Diao

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）

AI总结针对视觉自回归模型应用现有擦除技术导致语义崩溃和视觉伪影的问题，提出语义奇点公理并通过增量语义显著性分析验证，进而引入首个尺度感知的概念擦除框架SACE，在首尺度耦合熵正则化擦除目标与恢复性保存损失，实现精确概念擦除。

详情

AI中文摘要

视觉自回归（VAR）模型的快速进步为高保真文本到图像合成开辟了变革性前沿，同时也加剧了对生成内容安全对齐的担忧。将现有擦除技术简单应用于VAR模型会导致灾难性的语义崩溃和视觉伪影，因为这些技术主要针对扩散模型的同质去噪步骤设计。为应对这一基础性挑战，我们首先提出语义奇点公理，该公理认为提示中嵌入的任何目标语义概念在Scale-0处被明确锁定。然后通过我们提出的增量语义显著性分析（ISSA）严格验证该公理，该分析还使社区能够透明地检查从粗到细的语义注入过程。在此洞察指导下，我们引入了首个针对VAR模型的尺度感知概念擦除框架（SACE）。通过将干预严格限制在首尺度，我们的方法耦合了熵正则化擦除目标以防止高熵采样退化，以及恢复性保存损失以安全锚定纠缠良性先验的完整性。大量实验表明，我们的方法在最小训练开销下实现了跨多个领域的手术式概念擦除性能，及时而优雅地解决了新兴VAR架构中固有的关键安全漏洞。代码可在 https://github.com/limerenceysy/SACE 获取。

英文摘要

The rapid progress of visual autoregressive (VAR) models has unlocked a transformative frontier for high-fidelity text-to-image synthesis, while heightening concerns over the safety alignment of generated content. Naive application of existing erasure techniques to VAR models causes catastrophic semantic collapse and visual artifacts, since they are predominantly designed for the homogeneous denoising steps of diffusion models. To address this foundational challenge, we first propose the Semantic Singularity Axiom, which posits that any target semantic concept embedded within a prompt is definitively locked at Scale-0. Then rigorously validate this axiom through our proposed Incremental Semantic Saliency Analysis (ISSA),which also enable the community to transparently inspect the coarse-to-fine semantic injection process. Guided by this insight, we introduce the first scale-aware concept erasure framework (SACE) for VAR models. By strictly confining interventions to the first scale, our approach couples an Entropy-Regularized Erasure Objective to prevent high-entropy sampling degeneration, alongside a restorative preservation loss to safely anchor the integrity of entangled benign priors. Extensive experiments demonstrate that our method achieves surgical concept erasure performance across various domains with minimal training overhead, timely and elegently resolute the critical safety vulnerabilities inherent in emerging VAR architectures. Code is available at: https://github.com/limerenceysy/SACE}{https://github.com/limerenceysy/SACE.

URL PDF HTML ☆

赞 0 踩 0

2606.15848 2026-06-16 cs.CV 新提交

EmoZone-Talker: Regional Semantic Control of Audio-Driven 3DGS Talking Heads via Facial Action Units

EmoZone-Talker: 基于面部动作单元的音频驱动3DGS说话人头部的区域语义控制

Tingting Chen, Shaojun Wang, Huaye Zhang, Diqiong Jiang, Chenglizhao Chen

发表机构 * China University of Petroleum (East China)（中国石油大学（华东））

AI总结提出EmoZone-Talker框架，通过区域解耦和时序建模解决音频与表情信号的冲突，实现精细、可解释的面部表情控制。

详情

AI中文摘要

3D高斯泼溅（3DGS）在高保真说话头部合成方面显示出巨大潜力。然而，由于语音驱动的面部动态与显式表情信号之间的内在冲突，实现细粒度、可解释且可编辑的面部表情控制仍然具有根本性挑战。现有方法依赖隐式多模态融合，导致空间纠缠和时间不稳定性。我们提出EmoZone-Talker，一种新颖的框架，将音频驱动的面部动画重新表述为跨模态冲突下的结构化时空协调问题。我们的方法引入了面部运动的显式空间解缠和时序动态建模。具体来说，我们提出了具有优先注意力偏好的协同区域（SZ-PAB），通过解剖先验引导的区域约束显式解耦模态贡献，以及通道独立的时间AU编码器（CIT-AE）来建模时间连贯的AU动态。通过将这些表示集成到3D高斯变形中，EmoZone-Talker实现了对面部表情的精确和可解释控制。大量实验表明，我们的方法提高了表情可控性和真实感，在上脸准确性和时间连贯性方面取得了显著提升，同时保持了高渲染质量和准确的唇形同步。代码将公开发布以促进可重复性和进一步研究。

英文摘要

3D Gaussian Splatting (3DGS) has shown strong potential for high-fidelity talking head synthesis. However, enabling fine-grained, interpretable, and editable facial expression control remains fundamentally challenging due to intrinsic conflicts between speech-driven facial dynamics and explicit expression signals. Existing methods rely on implicit multimodal fusion, leading to spatial entanglement and temporal instability. We present EmoZone-Talker, a novel framework that reformulates audio-driven facial animation as a structured spatial-temporal coordination problem under cross-modal conflicts. Our approach introduces an explicit spatial disentanglement and temporal dynamics modeling of facial motion. Specifically, we propose Synergy Zones with Prioritized Attention Bias (SZ-PAB) to explicitly decouple modality contributions via region-wise constraints guided by anatomical priors, and a Channel-Independent Temporal AU Encoder (CIT-AE) to model temporally coherent AU dynamics. By integrating these representations into 3D Gaussian deformation, EmoZone-Talker enables precise and interpretable control over facial expressions. Extensive experiments demonstrate that our method improves expression controllability and realism, with notable gains in upper-face accuracy and temporal coherence, while preserving high rendering quality and accurate lip synchronization. Code will be publicly released to facilitate reproducibility and further research.

URL PDF HTML ☆

赞 0 踩 0

2606.15889 2026-06-16 cs.CV 新提交

SiGnature: Explicit Motion Diffusion for Stylized Semantic Gesture

SiGnature: 显式运动扩散用于风格化语义手势

Adi Rosenthal, Tomer Koren, Nadav Shaked, Doron Friedman, Ariel Shamir

发表机构 * Reichman University（赖希曼大学）

AI总结提出SiGnature框架，通过显式关节旋转空间和免训练推理机制JMI，实现语义手势的精准控制与说话人风格的高保真保持，优于现有方法。

详情

AI中文摘要

虽然共语手势生成的最新进展已实现令人印象深刻的节奏同步，但生成既具有语义意义又忠实于说话人独特非语言风格的手势仍然是一个开放挑战。语义手势（如象形形状或指示性指向）在统计上稀疏，使其难以在标准生成模型中有效学习。我们提出SiGnature，一个用于风格化和语义手势生成的框架，它协调了精确的语义控制与高保真风格保持。与依赖纠缠潜在表示的流行方法不同，SiGnature在显式关节旋转空间中操作。这种设计实现了我们的核心贡献——联合运动集成（JMI），一种免训练推理机制，能够直接将任何外部运动序列（特别是野外语义手势）注入扩散过程。JMI自动识别传达语义动作的特定“活动关节”并将其注入生成，同时依赖扩散主干根据目标说话人预学习的风格合成剩余的身体动态（包括姿态和流畅度）。这使得无需重新训练或引入剪切粘贴方法典型的“弗兰肯斯坦”伪影，即可即插即用地集成任意运动（包括复杂语义手势）。大量实验和感知研究表明，SiGnature在保持流畅自然的共语手势生成和保留说话人独特特征的同时，提供了优越的语义运动控制，从而优于最先进的基线方法。

英文摘要

While recent advances in co-speech gesture generation have achieved impressive rhythmic synchronization, synthesizing gestures that are both semantically meaningful and faithful to a speaker's unique non-verbal style remains an open challenge. Semantic gestures, such as iconic shapes or deictic pointing, are statistically sparse, making them difficult to learn effectively within standard generative models. We present SiGnature, a framework for Stylized and Semantic Gesture generation that reconciles precise semantic control with high-fidelity style preservation. Unlike prevalent methods that rely on entangled latent representations, SiGnature operates in an explicit joint-rotation space. This design enables our core contribution, Joint Motion Integration (JMI), a training-free inference mechanism capable of injecting any external motion sequence, particularly in-the-wild semantic gestures, directly into the diffusion process. JMI automatically identifies the specific ``active joints'' conveying a semantic action and injects them into the generation, while relying on the diffusion backbone to synthesize the remaining body dynamics, including posture and flow, in accordance with the pre-learned style of the target speaker. This allows for the plug-and-play integration of arbitrary motions, including complex semantic gestures, without retraining or introducing the ``Frankenstein'' artifacts typical of cut-and-paste methods. Extensive experiments and perceptual studies demonstrate that SiGnature offers superior semantic motion control while maintaining smooth and natural co-speech gesture generation and preserving the distinct characteristics of the speaker, thereby outperforming state-of-the-art baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.16103 2026-06-16 cs.CV 新提交

SceneCraft: Interactive System for Image Editing via Scene Graph

SceneCraft: 基于场景图的交互式图像编辑系统

Duc-Manh Phan, Ngoc-Dai Tran, Duy-Khang Do, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh, Vietnam（胡志明市理科大学）； Vietnam National University, Ho Chi Minh, Vietnam（越南国家大学胡志明市）； University of Dayton, Dayton, Ohio, USA（代顿大学）

AI总结提出SceneCraft框架，通过场景图表示图像，用户直接操作图结构进行复杂编辑，自动生成精确提示，降低语言歧义，提升编辑质量和用户控制。

详情

AI中文摘要

生成式AI的最新进展使得自然语言驱动的图像编辑成为可能，但现有系统在处理包含多个交互对象的复杂场景时常常失败，因为它们严重依赖用户精心制作精确的文本提示。为了解决缺乏结构化控制的问题，我们提出了SceneCraft，一种新颖的交互式框架，通过将图像表示为可编辑的场景图来桥接用户意图和模型执行。用户无需通过试错来猜测文本提示，而是直接与可视化图交互以执行复杂的空间和关系操作。这些图修改会自动转换为精确的、上下文感知的编辑提示，有效消除语言歧义。为了确保鲁棒和多样化的结果，结构化提示被分派到多个最先进的生成模型。跨多种编辑场景的评估表明，SceneCraft提供了更直观的控制机制，显著减少了手动提示工程的认知负担，同时生成的输出在质量和保真度上获得用户一致更高的评价。

英文摘要

Recent advances in generative AI have enabled natural language-driven image editing, yet existing systems often fail in complex scenes with multiple interacting objects because they rely heavily on users crafting precise text prompts. To address the absence of structured control, we propose SceneCraft, a novel interactive framework that bridges user intent and model execution by representing images as editable scene graphs. Instead of guessing text prompts through trial and error, users interact directly with a visual graph to perform complex spatial and relational operations. These graph modifications are automatically translated into precise, context-aware editing prompts, effectively eliminating linguistic ambiguity. To ensure robust and diverse results, structured prompts are dispatched to multiple state-of-the-art generative models. Evaluations across diverse editing scenarios show that SceneCraft provides a more intuitive control mechanism, significantly reducing the cognitive burden of manual prompt engineering while generating outputs that users consistently rate as higher in quality and fidelity.

URL PDF HTML ☆

赞 0 踩 0

2606.16131 2026-06-16 cs.CV cs.LG 新提交

Shift-and-Sum Quantization for Visual Autoregressive Models

Shift-and-Sum 量化用于视觉自回归模型

Jaehyeon Moon, Bumsub Ham

发表机构 * Yonsei University（延世大学）； Articron

AI总结提出针对视觉自回归模型的训练后量化框架，通过移位求和量化减少注意力值乘积误差，并采用重采样策略校准数据，在图像生成等任务上达到新最优。

Comments ICLR 2026

详情

AI中文摘要

训练后量化（PTQ）能够使用少量数据实现深度网络的高效部署。然而，其在视觉自回归模型（VAR）上的应用仍相对未被探索。我们识别出将PTQ应用于VAR的两个关键挑战：（i）注意力值乘积中的大重建误差，尤其是在高注意力分数更频繁出现的粗尺度上；（ii）由于有限的校准数据，码本条目的采样频率与其预测概率之间存在差异。为了解决这些挑战，我们提出了一种针对VAR的PTQ框架。首先，我们引入了一种移位求和量化方法，通过聚合值令牌的对称移位副本的量化结果来减少重建误差。其次，我们提出了一种校准数据的重采样策略，使码本条目的采样频率与其预测概率对齐。在类别条件图像生成、修复、外推和类别条件编辑上的实验表明，该方法在VAR架构上取得了一致的改进，为VAR的PTQ建立了新的最先进水平。

英文摘要

Post-training quantization (PTQ) enables efficient deployment of deep networks using a small set of data. Its application to visual autoregressive models (VAR), however, remains relatively unexplored. We identify two key challenges for applying PTQ to VAR: (i) large reconstruction errors in attention-value products, especially at coarse scales where high attention scores occur more frequently; and (ii) a discrepancy between the sampling frequencies of codebook entries and their predicted probabilities due to limited calibration data. To address these challenges, we propose a PTQ framework tailored for VAR. First, we introduce a shift-and-sum quantization method that reduces reconstruction errors by aggregating quantized results from symmetrically shifted duplicates of value tokens. Second, we present a resampling strategy for calibration data that aligns sampling frequencies of codebook entries with their predicted probabilities. Experiments on class-conditional image generation, inpainting, outpainting, and class-conditional editing show consistent improvements across VAR architectures, establishing a new state of the art in PTQ for VAR.

URL PDF HTML ☆

赞 0 踩 0

2606.16184 2026-06-16 cs.CV cs.MM 新提交

Closed-Loop Triplet Synergistic Generation for Long-Form Video

闭环三元组协同生成用于长视频

Xinlei Yin, Xiulian Peng, Xiao Li, Zhiwei Xiong, Yan Lu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Microsoft Research Asia（微软亚洲研究院）

AI总结提出CoTriSyGen框架，通过闭环视觉-文本-记忆协同过程，结合分析器进行镜头内和镜头间修正，解决长视频生成中的身份漂移和不一致问题。

详情

AI中文摘要

多镜头长视频生成由于身份漂移和镜头间的复合不一致性仍然具有挑战性。虽然基于故事板的流程提高了可控性，但它们通常以前馈方式执行，缺乏将生成的视觉证据反馈到后续条件中的机制。我们提出CoTriSyGen，一个智能体框架，将多镜头长视频生成形式化为闭环视觉-文本-记忆协同过程，其中计划意图、持久记忆和生成的视觉被联合用于迭代校正和长程一致性。基于视觉语言模型的分析器对该三元组进行推理，并沿两条路径生成对提示和记忆的更新：(i) 镜头内修正，当检测到语义或构成违规时触发目标重新生成，并细化图像到视频的提示以实现连贯运动；(ii) 镜头间修正，重写后续镜头提示以传播新出现的实体或属性，并根据生成的证据提高提示质量（例如，构成基础和电影流畅性）。该循环基于以实体为中心的记忆，该记忆被建模为可变的视觉状态，随着故事进展而演变，由生成器和分析器通过添加新的和演变的实体来持续更新，以反映外观变化、累积的多视图证据和多实体构成。在我们策划的StoryBench基准上的实验表明，与代表性方法相比，在跨镜头一致性、提示遵循和电影连续性方面有显著改进。

英文摘要

Multi-shot long-form video generation remains challenging due to identity drift and compounding inconsistencies across shots. While storyboard-driven pipelines improve controllability, they are often executed in a feed-forward manner, with limited mechanisms to incorporate generated visual evidence back into subsequent conditioning. We propose CoTriSyGen, an agentic framework that formulates multi-shot long video generation as a closed-loop visual-text-memory synergy process, where planned intent, persistent memory, and generated visuals are jointly leveraged for iterative correction and long-range coherence. A vision-language-model-based analyzer reasons over this triplet and produces updates to both prompts and memory along two pathways: (i) intra-shot refinement, which triggers targeted regeneration when semantic or compositional violations are detected and refines image-to-video prompt for coherent motions; and (ii) inter-shot refinement, which rewrites subsequent-shot prompts to propagate newly manifested entities or attributes and improve prompt quality (e.g., compositional grounding and cinematic fluency) based on generated evidence. The loop is grounded in an entity-centric memory modeled as a mutable visual state that evolves as the story progresses, which is continuously updated by both the generator and the analyzer by adding new and evolved entities to reflect appearance changes, accumulated multi-view evidence, and multi-entity compositions. Experiments on our curated StoryBench benchmark demonstrate substantial improvements in cross-shot consistency, prompt adherence, and cinematic continuity over representative methods.

URL PDF HTML ☆

赞 0 踩 0

2606.16241 2026-06-16 cs.CV 新提交

Structure-Semantic Co-optimized Latent Diffusion Model for Fast Visual Anagram Synthesis

结构-语义协同优化的潜扩散模型用于快速视觉字谜合成

Xiang Gao, Yunpeng Jia

发表机构 * School of Digital Media and Design Arts, Beijing University of Posts and Telecommunications（北京邮电大学数字媒体与设计艺术学院）

AI总结提出结构-语义协同优化框架S2CO-Anagram，通过空文本结构对齐、语义增强和注意力引导噪声融合，在极低计算成本下生成高分辨率、高视觉和谐度与语义保真度的视觉字谜图像。

详情

AI中文摘要

视觉字谜是一种有趣的艺术创作形式，其中单个图像在翻转或旋转等变换下呈现不同的概念解释。最近的工作通过利用预训练的文本到图像（T2I）扩散模型实现了视觉字谜合成，但仍存在几个关键限制，包括计算效率低、美学质量次优以及语义保真度和表现力弱。本文专注于以最小的计算成本生成视觉质量显著提升的视觉字谜，从而推进幻觉数字艺术的智能创作。为了提高图像分辨率同时减少时间开销，我们将基于像素的T2I模型中的先进并行去噪算法适配到对抗性蒸馏的潜模型上，并相应地提出了一种结构-语义协同优化（S2CO）框架来抵消随之而来的视觉退化。作为我们方法的核心，S2CO框架包含三个关键创新：（I）空文本结构对齐优化；（II）语义增强优化；（III）注意力引导噪声融合。基于这些组件，我们的方法称为S2CO-Anagram，能够生成比相关SOTA方法具有显著更优视觉和谐性和语义保真度的高分辨率字谜图像，同时实现更快的推理速度。代码将公开。

英文摘要

Visual anagram is an intriguing form of art creation wherein a single image presents different conceptual interpretations under transformations such as flipping or rotation. Recent work has achieved visual anagram synthesis by leveraging pretrained text-to-image (T2I) diffusion models, yet still suffers from several key limitations including computational inefficiency, suboptimal aesthetic quality, and weak semantic fidelity and expressiveness. This work focuses on generating visual anagrams with substantially improved visual quality at minimal computational cost, thereby advancing intelligent creation of illusionary digital art. To increase image resolution while reducing time overhead, we adapt the cutting-edge parallel denoising algorithm from pixel-based T2I model to the adversarially distilled latent-based one, and accordingly propose a structure-semantic co-optimization (S2CO) framework to counteract the consequent visual degradation. As the core of our approach, S2CO framework comprises three key innovations: (\romannumeral1) null-text structure alignment optimization; (\romannumeral2) semantic enhancement optimization; (\romannumeral3) attention-guided noise fusion. Building upon these components, our method dubbed \textbf{S2CO-Anagram} is able to generate higher-resolution anagram images with noticeably superior visual harmony and semantic faithfulness than related SOTA approaches, all while achieving substantially faster inference speed. Code will be publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.16317 2026-06-16 cs.CV 新提交

Training-free sparse attention based on cumulative energy filtering

基于累积能量过滤的无训练稀疏注意力

Chunlu Li, Yixuan Pan, Bai Du, Zhenyuan Chen, Yanzhao Li, Hui Dong, Hui Wang, Zhiqiang Zou

发表机构 * Huawei Technologies（华为技术有限公司）

AI总结提出动态阈值策略，在保持固定召回率的同时提高稀疏性，并与Flash Attention深度集成，无需额外掩码计算，在Wan 2.2上稀疏度从61.42%提升至82%，VBench指标下降小于5%。

详情

AI中文摘要

稀疏注意力通过仅计算重要令牌而跳过其余令牌，加速用于视频生成的扩散变换器（DiTs）。令牌选择策略是平衡稀疏性和准确性的关键。我们将令牌过滤过程形式化为一个双目标优化问题：最大化稀疏性和最小化准确性下降。现有算法无法同时实现这两个目标。例如，Top-p仅考虑准确性约束，而Top-k维持固定的计算预算但放松了准确性约束。本文证明，维持固定的召回率足以保证准确性，而固定阈值对于降低计算成本是次优的。因此，我们提出一种动态阈值方案，在保持相同准确性水平的同时提高稀疏性。此外，我们的算法与Flash Attention（FA）深度集成，无需任何额外的掩码计算开销。在Wan 2.2上的实验结果表明，与同样集成FA的BLASST算法相比，我们的动态阈值策略将稀疏性从61.42%提升至82%，而VBench指标下降小于5%。这导致注意力计算减少约15%，计算效率提升1.61倍，比BLASST高1.18倍。

英文摘要

Sparse attention accelerates Diffusion Transformers (DiTs) for video generation by computing only the important tokens while skipping the rest. The token selection strategy is key to balancing sparsity and accuracy. We formulate the token filtering process as a dual-goal optimization problem: maximizing sparsity and minimizing accuracy degradation. Existing algorithms cannot fulfill both objectives simultaneously. For example, Top-p only considers the accuracy constraint, while Top-k maintains a fixed computational budget but loosens the accuracy constraint. This paper demonstrates that maintaining a fixed recall rate is sufficient for ensuring accuracy, whereas a fixed threshold is suboptimal for reducing computational cost. Therefore, we propose a dynamic thresholding scheme to improve sparsity while maintaining the same level of accuracy. Furthermore, our algorithm is deeply integrated with Flash Attention (FA), eliminating the need for any additional masking computation overhead. Experimental results on Wan 2.2 validate that, compared to the BLASST algorithm which is also integrated with FA, our dynamic thresholding strategy enhances sparsity from 61.42\% to 82\% with a VBench metric drop of less than 5\%. This results in an approximate 15\% in attention computation and a $1.61\times$ increase in computational efficiency, which is 1.18x higher than that of BLASST.

URL PDF HTML ☆

赞 0 踩 0

2606.16401 2026-06-16 cs.CV 新提交

MMDiff: 扩展扩散变换器用于多模态生成

Yagmur Akarken, Orest Kupyn, Christian Rupprecht

发表机构 * University of Oxford, Visual Geometry Group（牛津大学视觉几何组）

AI总结提出MMDiff框架，利用冻结的扩散变换器通过轻量解码器联合生成图像及多种密集感知模态，发现多时间步特征融合与空间变化聚合权重是关键，在语义分割等任务上取得优异性能。

详情

AI中文摘要

扩散变换器已展现出卓越的生成能力，然而在其去噪轨迹中计算出的丰富感知表示在内容渲染后被丢弃。我们提出了MMDiff，一个将冻结的扩散变换器转化为多模态生成系统的框架，该系统使用轻量级解码器头联合生成图像以及任意组合的密集感知模态。我们的核心发现是，感知信息在去噪轨迹上呈时间分布，并且具有空间变化聚合权重的多时间步特征融合至关重要，相比单时间步提取，语义分割结果提高了高达28.7% mIoU。我们进一步采用概念驱动的注意力提取以实现可解释的空间引导，并表明冻结的扩散特征与最先进的编码器（如DINOv3）具有竞争力和互补性。通过在冻结的骨干网络上仅训练轻量级解码器头，我们在语义分割、显著目标检测和深度估计中取得了强劲性能，并证明了该框架能够有效生成大规模合成数据。

英文摘要

Diffusion transformers have demonstrated remarkable generative capabilities, yet the rich perceptual representations computed across their denoising trajectory are discarded once the content is rendered. We present MMDiff, a framework that transforms a frozen diffusion transformer into a multi-modal generative system that jointly produces images alongside any combination of dense perceptual modalities using lightweight decoder heads. Our central finding is that perceptual information is temporally distributed along the denoising trajectory, and that multi-timestep feature fusion with spatially varying aggregation weights is essential, improving semantic segmentation results by up to 28.7% mIoU over single-timestep extraction. We further adopt concept-driven attention extraction for interpretable spatial guidance, and show that frozen diffusion features are competitive with and complementary to state-of-the-art encoders such as DINOv3. By training only lightweight decoder heads on a frozen backbone, we achieve strong performance in semantic segmentation, salient object detection, and depth estimation, and demonstrate that this framework enables effective synthetic data generation at scale.

URL PDF HTML ☆

赞 0 踩 0

2606.16767 2026-06-16 cs.CV 新提交

Text-Vision Co-Instructed Image Editing

文本-视觉协同指导的图像编辑

Chenxi Xie, Yuhui Wu, Qiaosi Yi, Lei Zhang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； OPPO Research Institute（OPPO研究院）

AI总结提出TV-Edit框架，联合文本指令的语义表达与稀疏视觉指令的空间引导，实现精确且忠实于意图的图像编辑，显著优于现有方法。

详情

AI中文摘要

现有的图像编辑方法通常可分为基于文本指令和基于视觉提示两类。文本指令语义表达丰富，但受限于编辑结果空间控制的粗粒度。相比之下，拖拽和点等视觉提示能提供精确的空间引导，但存在语义意图固有的模糊性。为统一文本和视觉提示的优势，我们提出文本-视觉协同指导的图像编辑，将文本指令作为语义意图、稀疏视觉指令作为空间引导联合建模，旨在实现精确且忠实于意图的图像操作。为此，我们首先构建了一个包含超过23K个样本的文本-视觉指令配对数据集，这些样本源自动态视频，为跨模态指令提供对齐监督。然后，我们提出TV-Edit，一个文本-视觉指令统一编辑框架，将基于拖拽或点的视觉指令与图像-文本语义上下文化，并将其提升为语义感知的控制表示，用于预训练的编辑骨干网络。通过整合语义意图和空间约束，TV-Edit相比纯文本或纯拖拽方法实现了更精确的空间控制、更少的指令歧义和更强的结构一致性。最后，我们建立了TV-Edit-Bench，一个精心设计的基准，用于评估语义忠实度、空间对齐和视觉一致性，通过地面真实参考和受控的文本-视觉变化进行可靠评估。我们在多个编辑骨干网络上的实验表明，TV-Edit始终产生更精确且忠实于意图的编辑，显著优于最先进的基于指令和基于拖拽的基线方法。

英文摘要

Existing image editing methods can be generally categorized into textual instruction-based and visual prompt-based ones. Textual instructions are semantically expressive, but are limited by the coarse granularity of spatial control of the editing results. In contrast, visual prompts such as drag and point can provide precise spatial guidance, but are limited by the inherent ambiguity in semantic intent. To unify the strength of textual and visual prompts, we present Text-Vision Co-Instructed Image Editing, which jointly models textual instructions as semantic intent and sparse visual instructions as spatial guidance, aiming to achieve precise and intent-faithful image manipulation. To this end, we first construct a textual-visual instruction paired dataset with more than 23K samples derived from dynamic videos, enabling aligned supervision for cross-modal instruction. We then propose TV-Edit, a Textual-Visual instruction unified Editing framework to contextualize drag or point-based visual instructions with image-text semantics and lift them into semantic-aware control representations for pretrained editing backbones. By integrating semantic intent and spatial constraints, TV-Edit leads to more precise spatial control, less instruction ambiguity, and stronger structural consistency than text-only or drag-based alternatives. Finally, we establish TV-Edit-Bench, a deliberately designed benchmark to evaluate semantic faithfulness, spatial alignment, and visual consistency with ground-truth references and controlled textual-visual variations for reliable assessment. Our experiments across multiple editing backbones demonstrate that TV-Edit consistently yields more precise and intent-faithful edits, significantly outperforming state-of-the-art instruction-based and drag-based baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.16799 2026-06-16 cs.CV cs.AI 新提交

Decoupling Semantics from Distortions: Multi-Scale Two-Stream Vision-Language Alignment for AI-Generated Image Quality Assessment

解耦语义与失真：面向AI生成图像质量评估的多尺度双流视觉-语言对齐

Zijie Meng

AI总结提出MST-CLIPIQA多尺度双流框架，通过显式表示解耦实现层次化视觉-语言对齐，在五个基准上取得质量SRCC平均提升1.11%、图文对应SRCC提升2.35%的新SOTA结果。

Comments 11 pages, 2 figures Accepted by ICME2026(spotlight)

详情

AI中文摘要

现有的基于视觉-语言模型（VLM）的AI生成图像质量评估（AIGIQA）方法存在根本性的语义-失真维度冲突：为语义区分优化的单一表示在本质上将组成性理解与低层感知敏感性纠缠在一起，使其对细粒度质量退化视而不见。我们提出MST-CLIPIQA，一种多尺度双流框架，通过显式表示解耦实现层次化视觉-语言对齐。我们的架构利用具有互补补丁粒度的双CLIP编码器：粗粒度流捕获全局语义连贯性，而细粒度流保留纹理特征和伪影模式。一种受信息瓶颈启发的门控融合机制执行自适应跨尺度蒸馏，当生成提示可用时，可选交叉注意力实现基于提示的对应评估。在五个基准上的广泛实验建立了新的最先进结果，在质量预测上实现平均SRCC提升1.11%，在文本-图像对应预测上提升2.35%，同时仅需0.8M可训练参数即可保持效率。我们的项目可在https://github.com/YMlinfeng/MST-CLIPIQA获取。

英文摘要

Existing vision-language model (VLM)-based AI-generated image quality assessment (AIGIQA) methods suffer from a fundamental semantic-distortion dimensional conflict: monolithic representations optimized for semantic discrimination inherently entangle compositional understanding with low-level perceptual sensitivity, rendering them blind to fine-grained quality degradations. We introduce MST-CLIPIQA, a multi-scale two-stream framework that achieves hierarchical vision-language alignment through explicit representational decoupling. Our architecture leverages dual CLIP encoders with complementary patch granularities: coarse-grained streams capture global semantic coherence while fine-grained streams preserve textural signatures and artifact patterns. An information bottleneck-inspired gated fusion mechanism performs adaptive cross-scale distillation, with optional cross-attention enabling prompt-anchored correspondence evaluation when generation prompts are available. Extensive experiments across five benchmarks establish new state-of-the-art results, achieving average improvements of 1.11 percent SRCC on quality and 2.35 percent SRCC on text-image correspondence prediction, while maintaining efficiency with only 0.8M trainable parameters. Our project is available at https://github.com/YMlinfeng/MST-CLIPIQA.

URL PDF HTML ☆

赞 0 踩 0

2606.16866 2026-06-16 cs.CV 新提交

Redirecting the Flow: Image Customization through Attention Distribution Shift

重定向流：通过注意力分布偏移实现图像定制

Jie Li, Suorong Yang, Jian Zhao, Furao Shen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University（南京大学计算机软件新技术国家重点实验室）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； School of Computer Science, Nanjing University（南京大学计算机科学与技术学院）； School of Electronic Science and Engineering, Nanjing University（南京大学电子科学与工程学院）

AI总结提出基于最大熵理论的Conditional Attention Distribution Shift方法，通过双分支架构CustomShift实现高效主题驱动图像生成，在DreamBooth和Custom101基准上优于现有方法。

详情

AI中文摘要

主题驱动的图像定制旨在生成不仅遵循文本指令而且保留给定参考主题身份的图像。现有方法，包括测试时微调、基于编码器的方法以及共享注意力空间中的令牌竞争，存在效率有限、提取的参考特征与生成过程不对齐以及无关信息干扰等问题。为了解决这些限制，我们将定制任务表述为通过将参考图像融入文本到图像生成所引发的分布偏移，并基于最大熵理论推导出条件注意力分布偏移公式。基于这一公式，我们提出了CustomShift，一种基于Stable Diffusion 3的双分支架构。参考对齐分支利用参考图像和主题名称之间的自注意力实现与潜在表示的逐层对齐，而交叉引导分支整合文本和参考线索以指导生成。在DreamBooth和Custom101基准上的实验表明，我们的方法始终优于最先进的方法，在语义保真度和主题一致性之间取得了更好的平衡。

英文摘要

Subject-driven image customization aims to generate images that not only follow textual instructions but also preserve the identity of a given reference subject. Existing approaches, including test-time fine-tuning, encoder-based methods, and token competition in shared attention spaces, suffer from limited efficiency, misalignment between extracted reference features and the generative process, and interference from irrelevant information. To address these limitations, we formulate the customization task as a distribution shift induced by incorporating reference images into text-to-image generation, and derive a Conditional Attention Distribution Shift formulation grounded in maximum entropy theory. Building on this formulation, we propose CustomShift, a dual-branch architecture based on Stable Diffusion 3. The Reference-Alignment Branch leverages self-attention between reference images and subject names to achieve layer-wise alignment with latent representations, while the Cross-Guidance Branch integrates textual and reference cues to guide generation. Experiments on the DreamBooth and Custom101 benchmarks demonstrate that our method consistently outperforms state-of-the-art approaches, achieving a better balance between semantic fidelity and subject consistency.

URL PDF HTML ☆

赞 0 踩 0

2606.16993 2026-06-16 cs.CV 新提交

DreamX-World 1.0: A General-Purpose Interactive World Model

DreamX-World 1.0：通用交互式世界模型

DreamX Team, Yancheng Bai, Rui Chen, Xiangxiang Chu, Rujing Dang, Hao Dou, Bingjie Gao, Qiwen Gu, Siyu Hong, Jiachen Lei, Geng Li, Jifan Li, Ruimin Lin, Qingfeng Shi, Bingze Song, Lei Sun, Jing Tang, Ruitian Tian, Jun Wang, Jiahong Wu, Pengfei Zhang, Shen Zhang, Jiashu Zhu

AI总结提出通用交互式文图生视频世界模型DreamX-World 1.0，通过E-PRoPE相机控制、因果强制自回归生成、记忆条件场景持久化和事件指令微调，实现可控长时程生成，在多项指标上超越现有方法。

Comments Project page: https://amap-ml.github.io/DreamX_World, Code: https://github.com/AMAP-ML/DreamX-World

详情

AI中文摘要

DreamX-World 1.0 是一个通用的交互式文本/图像到视频的世界模型，用于可控的长时程生成。它支持相机导航、重新访问先前观察过的区域，以及在逼真、游戏风格和风格化领域中的可提示事件。我们的数据引擎结合了相机精确的虚幻引擎渲染、动作丰富的游戏录制以及带有恢复相机几何的真实世界视频。对于相机控制，我们引入了 E-PRoPE，一种轻量级的投影位置编码变体，它保留了 PRoPE 的投影相机几何，同时对空间缩减的令牌应用相机感知注意力。我们使用因果强制、DMD 风格蒸馏和长滚动训练，将双向视频生成器转换为几步自回归世界模型。在自生成的长时程上下文上进行训练，使模型暴露于其自身的生成历史，并减少跨自回归块累积的风格和颜色漂移。记忆条件场景持久性通过基于相机几何的检索来检索早期视图，而残差循环使得条件路径对不完美的记忆潜变量不那么敏感。事件指令微调增加了可组合的事件控制，而强化学习对齐在蒸馏后恢复了相机控制和视觉质量。通过混合精度 DiT 执行、残差重用、75% 剪枝的 VAE 解码和异步流水线并行，DreamX-World 1.0 在八块 RTX 5090 GPU 上达到高达 16 FPS。在我们的 5 秒基本评估中，DreamX-World 1.0 获得了 73.75 的相机控制分数和 84.76 的总分，在总分上优于 HY-WorldPlay 1.5 和 LingBot-World，后两者分别达到 80.79 和 80.45。

英文摘要

DreamX-World 1.0 is a general-purpose interactive text/image-to-video world model for controllable long-horizon generation. It supports camera navigation, revisits to previously observed regions, and promptable events across photorealistic, game-style, and stylized domains. Our data engine combines camera-accurate Unreal Engine rendering, action-rich gameplay recordings, and real-world videos with recovered camera geometry. For camera control, we introduce E-PRoPE, a lightweight variant of projective positional encoding that retains PRoPE's projective camera geometry while applying camera-aware attention to spatially reduced tokens. We convert a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. Training on self-generated long-horizon contexts exposes the model to its own generated history and reduces the style and color drift that accumulates across autoregressive chunks. Memory-Conditioned Scene Persistence retrieves earlier views through camera-geometry-based retrieval, while residual recycling makes the conditioning path less sensitive to imperfect memory latents. Event Instruction Tuning adds composable event control, and reinforcement learning alignment recovers camera control and visual quality after distillation. With mixed-precision DiT execution, residual reuse, 75\%-pruned VAE decoding, and asynchronous pipeline parallelism, DreamX-World 1.0 reaches up to 16\,FPS on eight RTX\,5090 GPUs. On our 5-second basic evaluation, DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score, which achieve 80.79 and 80.45, respectively.

URL PDF HTML ☆

赞 0 踩 0

2606.17049 2026-06-16 cs.CV 新提交

BRDFusion: Physics Meets Generation for Urban Scene Inverse Rendering

BRDFusion：物理与生成结合的城市场景逆渲染

Yi-Ruei Liu, Jie-Ying Lee, Zheng-Hui Huang, Yu-Lun Liu, Chih-Hao Lin

发表机构 * National Yang Ming Chiao Tung University ； University of Illinois Urbana-Champaign ； National Taiwan University

AI总结提出BRDFusion框架，结合物理建模与生成先验，实现城市场景逆渲染，在保持物理一致性的同时修复伪影，支持新视角重光照、夜间模拟和动态物体编辑。

Comments Project page: https://shigon255.github.io/brdfusion-page/

详情

AI中文摘要

从捕获视频中对城市场景进行逆渲染可实现众多应用，包括内容创建和自动驾驶仿真。基于物理的渲染方法遵循并控制光照物理，但存在重建和渲染伪影。而生成模型能产生逼真视频，但一致性和可控性有限。我们提出BRDFusion，一个统一框架，结合两种互补模型用于逆渲染和前向渲染。具体而言，BRDFusion通过物理建模恢复显式、一致的场景属性，并利用生成先验缓解优化歧义。在前向渲染中，物理模型提供基于场景配置的可控渲染，生成模型则去噪并修复伪影。因此，我们的方法在允许精确控制的同时生成高质量视频，在真实和合成场景中均优于基线。此外，BRDFusion支持新视角重光照、夜间模拟以及动态物体插入/编辑。项目页面：https://shigon255.github.io/brdfusion-page/

英文摘要

Inverse rendering of urban scenes from captured videos enables numerous applications, including content creation and autonomous driving simulation. Physically-based rendering methods follow and control lighting physics, but suffer from reconstruction and rendering artifacts. While generative models produce realistic videos, they offer limited consistency and controllability. We present BRDFusion, a unified framework that combines two complementary models for inverse and forward rendering. Specifically, BRDFusion recovers explicit, consistent scene properties with physical modeling and alleviates optimization ambiguity with generative priors. During forward rendering, the physical model provides controllable rendering from the scene configuration, and the generative model denoises and fixes artifacts. Therefore, our method produces high-quality videos while allowing precise control, outperforming baselines in real and synthetic scenes. Moreover, BRDFusion supports novel-view relighting, night simulation, and dynamic object insertion/editing. Project page: https://shigon255.github.io/brdfusion-page/

URL PDF HTML ☆

赞 0 踩 0

2606.14721 2026-06-16 cs.GR cs.CV cs.RO 交叉投稿

DC-Motion: Decoupling Semantics and Details via Discrete-Continuous Tokens for Human Motion Generation

DC-Motion: 通过离散-连续令牌解耦语义与细节以生成人体运动

Hequan Wang, Jiaxu Zhang, Zhengbo Zhang, Zhigang Tu

发表机构 * Wuhan University（武汉大学）

AI总结提出DC-Motion框架，通过离散-连续VAE将运动分解为语义离散令牌和细节连续残差，结合掩码自回归模型和残差扩散模型，实现复杂文本指令下的高质量运动生成。

详情

AI中文摘要

文本到运动生成需要合成物理上真实的动态，这些动态严格遵循复杂且长程的文本指令。现有方法依赖于同质表示空间，可能无法捕捉人体运动的层次结构，扩散模型在组合语义推理上表现不佳，而自回归模型由于量化牺牲了细粒度的物理细节。为了解决这个问题，我们引入了DC-Motion，一个分解式生成框架，旨在通过离散-连续令牌显式解耦语义和细节。首先，离散-连续VAE（DC-VAE）将运动分解为用于语义的离散令牌和用于细粒度动态的连续残差。然后，一个掩码自回归模型从文本预测离散结构，一个轻量级残差扩散模型恢复连续的物理细节。大量实验表明，DC-Motion有效提高了遵循复杂指令的能力。通过有效平衡语义可控性和物理真实性，我们的方法为人体运动生成提供了一种高度可适应的建模范式。在HumanML3D和KIT-ML数据集上，DC-Motion实现了最先进的性能，在运动真实感方面获得了最佳的FID，在文本对齐方面获得了最佳的R-precision。

英文摘要

Text-to-motion generation requires synthesizing physically realistic dynamics that strictly follow complex and long-horizon textual instructions. Existing approaches rely on homogeneous representation spaces that may fail to capture the hierarchical nature of human motion, with diffusion models struggling at compositional semantic reasoning and AR models sacrificing fine-grained physical details due to quantization. To solve it, we introduce DC-Motion, a factorized generative framework designed to explicitly decouple semantics and details via discrete-continuous tokens. A Discrete-Continuous VAE (DC-VAE) first decomposes motion into discrete tokens for semantics and continuous residuals for fine-grained dynamics. Then, a masked AR model predicts the discrete structure from text, and a lightweight residual diffusion model recovers the continuous physical details. Extensive experiments demonstrate that DC-Motion effectively improves the capability to follow complex instructions. By effectively balancing semantic controllability and physical realism, our approach offers a highly adaptable modeling paradigm for human motion generation. On both HumanML3D and KIT-ML datasets, DC-Motion achieves state-of-the-art performance, delivering the best FID for motion realism and R-precision for text alignment.

URL PDF HTML ☆

赞 0 踩 0

2502.10389 2026-06-16 cs.CV cs.AI 版本更新

Region-Adaptive Sampling for Diffusion Transformers

扩散变压器的区域自适应采样

Ziming Liu, Yifan Yang, Chengruidong Zhang, Yiqi Zhang, Lili Qiu, Yang You, Yuqing Yang

发表机构 * National University of Singapore（新加坡国立大学）； Microsoft Research（微软研究院）

AI总结提出RAS，一种无需训练的自适应采样策略，通过动态分配不同采样比例到图像区域，实现扩散变压器2.36-2.51倍加速且质量损失极小。

Comments CVPR'26 Poster

详情

AI中文摘要

扩散模型（DMs）已成为跨不同领域生成任务的主要选择。然而，它们依赖多次顺序前向传递，严重限制了实时性能。先前的加速方法主要集中于减少采样步骤数或重用中间结果，由于卷积U-Net结构的限制，未能利用图像中空间区域的变化。通过利用扩散变压器（DiTs）在处理可变数量令牌方面的灵活性，我们引入了RAS，一种新颖的、无需训练的采样策略，该策略根据DiT模型的关注点动态地为图像中的区域分配不同的采样比例。我们的关键观察是，在每个采样步骤中，模型集中在语义上有意义的区域，并且这些关注区域在连续步骤中表现出强烈的连续性。利用这一见解，RAS仅更新当前关注的区域，而其他区域则使用来自前一步的缓存噪声进行更新。模型的关注点基于前一步的输出确定，利用了我们观察到的时间一致性。我们在Stable Diffusion 3和Lumina-Next-T2I上评估了RAS，分别实现了高达2.36倍和2.51倍的加速，且生成质量下降最小。此外，一项用户研究表明，RAS在人类评估下提供相当的质量，同时实现1.6倍加速。我们的方法朝着更高效的扩散变压器迈出了重要一步，增强了它们在实时应用中的潜力。

英文摘要

Diffusion models (DMs) have become the leading choice for generative tasks across diverse domains. However, their reliance on multiple sequential forward passes significantly limits real-time performance. Previous acceleration methods have primarily focused on reducing the number of sampling steps or reusing intermediate results, failing to leverage variations across spatial regions within the image due to the constraints of convolutional U-Net structures. By harnessing the flexibility of Diffusion Transformers (DiTs) in handling variable number of tokens, we introduce RAS, a novel, training-free sampling strategy that dynamically assigns different sampling ratios to regions within an image based on the focus of the DiT model. Our key observation is that during each sampling step, the model concentrates on semantically meaningful regions, and these areas of focus exhibit strong continuity across consecutive steps. Leveraging this insight, RAS updates only the regions currently in focus, while other regions are updated using cached noise from the previous step. The model's focus is determined based on the output from the preceding step, capitalizing on the temporal consistency we observed. We evaluate RAS on Stable Diffusion 3 and Lumina-Next-T2I, achieving speedups up to 2.36x and 2.51x, respectively, with minimal degradation in generation quality. Additionally, a user study reveals that RAS delivers comparable qualities under human evaluation while achieving a 1.6x speedup. Our approach makes a significant step towards more efficient diffusion transformers, enhancing their potential for real-time applications.

URL PDF HTML ☆

赞 0 踩 0

2505.04486 2026-06-16 cs.CV cs.AI cs.LG 版本更新

Efficient Flow Matching using Latent Variables

使用潜在变量的高效流匹配

Anirban Samaddar, Yixuan Sun, Viktor Nilsson, Sandeep Madireddy

发表机构 * Argonne National Laboratory（阿贡国家实验室）； KTH Royal Institute of Technology（皇家理工学院）

AI总结提出Latent-CFM方法，利用预训练深度潜在变量模型提取数据特征作为条件，提升流匹配模型的训练效率和生成质量，在图像和物理场生成任务中优于现有方法。

详情

AI中文摘要

流匹配模型在概率生成模型的图像生成任务中显示出巨大潜力。然而，文献中的大多数流匹配模型在从简单源分布（如标准高斯）学习流时，并未显式利用目标数据中的潜在聚类结构。这导致学习效率低下，尤其是对于许多通常位于低维流形中的高维真实世界数据集。为此，我们提出了 $\texttt{Latent-CFM}$，它通过使用预训练的深度潜在变量模型从数据中提取的特征作为条件，提供了高效的训练策略。通过对来自多模态分布的合成数据和广泛使用的图像基准数据集的实验，我们表明，$\texttt{Latent-CFM}$ 通过采用预训练的轻量级潜在变量模型，在显著减少训练和计算量的情况下，展现出比最先进的流匹配模型更好的生成质量。除了自然图像，我们还考虑了源自物理过程的空间场的生成建模。使用二维达西流数据集，我们证明了我们的方法比竞争方法生成更物理准确的样本。此外，通过潜在空间分析，我们证明了我们的方法可用于以潜在特征为条件的条件图像生成，这增加了生成过程的可解释性。

英文摘要

Flow matching models have shown great potential in image generation tasks among probabilistic generative models. However, most flow matching models in the literature do not explicitly utilize the underlying clustering structure in the target data when learning the flow from a simple source distribution like the standard Gaussian. This leads to inefficient learning, especially for many high-dimensional real-world datasets, which often reside in a low-dimensional manifold. To this end, we present $\texttt{Latent-CFM}$, which provides efficient training strategies by conditioning on the features extracted from data using pretrained deep latent variable models. Through experiments on synthetic data from multi-modal distributions and widely used image benchmark datasets, we show that $\texttt{Latent-CFM}$ exhibits improved generation quality with significantly less training and computation than state-of-the-art flow matching models by adopting pretrained lightweight latent variable models. Beyond natural images, we consider generative modeling of spatial fields stemming from physical processes. Using a 2d Darcy flow dataset, we demonstrate that our approach generates more physically accurate samples than competing approaches. In addition, through latent space analysis, we demonstrate that our approach can be used for conditional image generation conditioned on latent features, which adds interpretability to the generation process.

URL PDF HTML ☆

赞 0 踩 0

2602.04789 2026-06-16 cs.CV 版本更新

Light Forcing: Accelerating Autoregressive Video Diffusion via Sparse Attention

轻量强制：通过稀疏注意力加速自回归视频扩散

Chengtao Lv, Yumeng Shi, Yushi Huang, Ruihao Gong, Shen Ren, Wenya Wang

发表机构 * Nanjing University of Posts and Telecommunications（南京邮电大学）

AI总结针对自回归视频生成模型中注意力二次复杂度问题，提出首个稀疏注意力方案Light Forcing，通过块感知增长机制和分层稀疏注意力实现质量和效率提升。

Comments ICML 2026

详情

AI中文摘要

先进的自回归（AR）视频生成模型提高了视觉保真度和交互性，但注意力的二次复杂度仍然是高效部署的主要瓶颈。虽然现有的稀疏注意力解决方案在双向模型上显示出潜力，但我们发现将这些解决方案应用于AR模型会导致显著的性能下降，原因有二：孤立地考虑块生成以及对过去信息上下文的利用不足。基于这些观察，我们提出 extsc{Light Forcing}，这是 extit{首个}专为AR视频生成模型设计的稀疏注意力解决方案。它引入了 extit{块感知增长}机制来定量估计每个块的贡献，从而决定其稀疏性分配。这种渐进式稀疏性增加策略使得当前块在生成过程中能够继承早期块中的先验知识。此外，我们引入了 extit{分层稀疏注意力}，以从粗到细的方式捕捉信息丰富的历史和局部上下文。这种两级掩码选择策略（即帧级和块级）能够自适应地处理不同的注意力模式。大量实验表明，我们的方法在质量（例如，VBench上84.5）和效率（例如，端到端加速1.2至1.3倍）上均优于现有的稀疏注意力方法。结合FP8量化和LightVAE， extsc{Light Forcing}在RTX 5090 GPU上进一步实现了2.3倍加速和19.7 FPS。代码将在\href{https://github.com/chengtao-lv/LightForcing}{https://github.com/chengtao-lv/LightForcing}发布。

英文摘要

Advanced autoregressive (AR) video generation models have improved visual fidelity and interactivity, but the quadratic complexity of attention remains a primary bottleneck for efficient deployment. While existing sparse attention solutions have shown promise on bidirectional models, we identify that applying these solutions to AR models leads to considerable performance degradation for two reasons: isolated consideration of chunk generation and insufficient utilization of past informative context. Motivated by these observations, we propose \textsc{Light Forcing}, the \textit{first} sparse attention solution tailored for AR video generation models. It incorporates a \textit{Chunk-Aware Growth} mechanism to quantitatively estimate the contribution of each chunk, which determines their sparsity allocation. This progressive sparsity increase strategy enables the current chunk to inherit prior knowledge in earlier chunks during generation. Additionally, we introduce a \textit{Hierarchical Sparse Attention} to capture informative historical and local context in a coarse-to-fine manner. Such two-level mask selection strategy (i.e., frame and block level) can adaptively handle diverse attention patterns. Extensive experiments demonstrate that our method outperforms existing sparse attention in quality (e.g., 84.5 on VBench) and efficiency (e.g., $1.2{\sim}1.3\times$ end-to-end speedup). Combined with other efficient solutions, \textsc{Light Forcing} further achieves a $2.0{\sim}3.0\times$ end-to-end speedup across diverse GPUs (e.g., 27.4\,FPS on RTX 5090 and 33.9\,FPS on H100). Code is released via this \href{https://github.com/chengtao-lv/LightForcing}{link}.

URL PDF HTML ☆

赞 0 踩 0

2602.13344 2026-06-16 cs.CV eess.IV 版本更新

FireRed-Image-Edit-1.0 Technical Report

FireRed-Image-Edit-1.0 技术报告

Super Intelligence Team, Changhao Qiao, Chao Hui, Chen Li, Cunzheng Wang, Dejia Song, Jiale Zhang, Jing Li, Qiang Xiang, Runqi Wang, Shuang Sun, Wei Zhu, Xu Tang, Yao Hu, Yibo Chen, Yuhao Huang, Yuxuan Duan, Zhiyi Chen, Ziyuan Guo

发表机构 * Super Intelligence Team（超级智能团队）； Xiaohongshu Inc.（小红书公司）

AI总结提出FireRed-Image-Edit扩散变换器，通过数据整理、多阶段训练和评估优化，在指令图像编辑上达到最先进性能，并开源代码、模型和基准。

详情

AI中文摘要

我们提出FireRed-Image-Edit，一种基于指令的图像编辑扩散变换器，通过系统优化数据整理、训练方法和评估设计，实现了最先进的性能。我们构建了一个16亿样本的训练语料库，包含来自不同来源的9亿文本到图像和7亿图像编辑对。经过严格清洗、分层、自动标注和两阶段过滤，我们保留了超过1亿高质量样本，在生成和编辑之间取得平衡，确保强大的语义覆盖和指令对齐。我们的多阶段训练流程通过预训练、监督微调和强化学习逐步构建编辑能力。为了提高数据效率，我们引入了多条件感知桶采样器用于可变分辨率批处理，以及带有动态提示重索引的随机指令对齐。为了稳定优化并增强可控性，我们提出了用于DPO的非对称梯度优化、用于文本编辑的具有布局感知OCR奖励的DiffusionNFT，以及用于身份保持的可微一致性损失。我们进一步建立了REDEdit-Bench，一个涵盖15个编辑类别的综合基准，包括新引入的美化和低级增强任务。在REDEdit-Bench和公共基准（ImgEdit和GEdit）上的大量实验表明，与开源和专有系统相比，我们的性能具有竞争力或更优。为了支持未来研究，我们的代码、模型和基准套件在此https URL公开提供。

英文摘要

We present FireRed-Image-Edit, a diffusion transformer for instruction-based image editing that achieves state-of-the-art performance through systematic optimization of data curation, training methodology, and evaluation design. We construct a 1.6B-sample training corpus, comprising 900M text-to-image and 700M image editing pairs from diverse sources. After rigorous cleaning, stratification, auto-labeling, and two-stage filtering, we retain over 100M high-quality samples balanced between generation and editing, ensuring strong semantic coverage and instruction alignment. Our multi-stage training pipeline progressively builds editing capability via pre-training, supervised fine-tuning, and reinforcement learning. To improve data efficiency, we introduce a Multi-Condition Aware Bucket Sampler for variable-resolution batching and Stochastic Instruction Alignment with dynamic prompt re-indexing. To stabilize optimization and enhance controllability, we propose Asymmetric Gradient Optimization for DPO, DiffusionNFT with layout-aware OCR rewards for text editing, and a differentiable Consistency Loss for identity preservation. We further establish REDEdit-Bench, a comprehensive benchmark spanning 15 editing categories, including newly introduced beautification and low-level enhancement tasks. Extensive experiments on REDEdit-Bench and public benchmarks (ImgEdit and GEdit) demonstrate competitive or superior performance against both open-source and proprietary systems. To support future research, our code, models, and benchmark suite are publicly available at https://github.com/FireRedTeam/FireRed-Image-Edit/ .

URL PDF HTML ☆

赞 0 踩 0

2603.01371 2026-06-16 cs.CV 版本更新

改进的基于表示自动编码器的基线

Jaskirat Singh, Boyang Zheng, Zongze Wu, Richard Zhang, Eli Shechtman, Saining Xie

发表机构 * Adobe Research（Adobe研究院）； ANU（澳大利亚国立大学）； New York University（纽约大学）

AI总结本文研究了基于表示自动编码器（RAE）的设计选择，发现三个见解，简化并改进了RAE。首先，研究了一种通用公式，将表示定义为最后k个编码器层的总和，而不是仅最终层。其次，研究了RAE与表示对齐（REPA）的假设，发现两者具有互补的工作机制。最后，改进了RAE在无分类器指导（CFG）中的表现，通过重新参数化DiT模型输出，实现了无需训练第二个模型的指导效果。RAEv2在ImageNet-256上达到了1.06的gFID，且训练效率显著提高。

详情

AI中文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for

英文摘要

Representation Autoencoders (RAE) replace traditional VAE with pretrained vision encoders. In this paper, we systematically investigate several design choices and find three insights which simplify and improve RAE. First, we study a generalized formulation where the representation is defined as sum of the last k encoder layers rather than solely the final layer. This simple change greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces). Second, we study the prevalent assumption that RAE (using pretrained representation as encoder) replaces representation alignment (REPA), which distills the same representation to intermediate layers instead. Through large-scale empirical analysis, we uncover a surprising finding: RAE and REPA exhibit complementary working mechanisms, allowing the same representation to be used as both encoder and target for intermediate diffusion layers. Finally, the original RAE struggles with classifier-free guidance (CFG) and requires training a second, weaker diffusion model for AutoGuidance (AG). We show that REPA itself can be viewed as x-prediction in RAE latent space. By simply re-parameterizing the output of the DiT model, it can provide guidance for "free". Overall, RAEv2 leads to more than 10x faster convergence over the original RAE, achieving a state-of-the-art gFID of 1.06 in just 80 epochs on ImageNet-256. On FDr6, RAEv2 achieves a state-of-the-art 2.17 at just 80 epochs compared to the previous best 3.26 (800 epochs) without any post-training. This motivates EPFID@k (epochs to reach unguided gFID < k) as a measure of training efficiency. RAEv2 attains an EPFID@2 of 35 epochs, versus 177 for the original RAE. We also validate our approach across diverse settings for text-to-image generation and navigation world models, showing consistent improvements. The code is available at https://raev2.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.19876 2026-06-16 cs.CV 版本更新

Structural Energy Guidance for View-Consistent Text-to-3D Generation

基于结构能量的视图一致文本到3D生成

Qing Zhang, Jinguang Tong, Jing Zhang, Jie Hong, Xuesong Li

发表机构 * Australian National University（澳大利亚国立大学）； CSIRO（澳大利亚国家科学委员会）； The University of Hong Kong（香港大学）

AI总结本文针对基于扩散模型的文本到3D生成中视图不一致问题，提出无需训练的SEGS框架，通过在U-Net特征的PCA子空间中构建结构能量并注入去噪过程，提升多视图一致性，实验表明有效降低Janus率并提升视图一致性评分。

2605.25449 2026-06-16 cs.CV 版本更新

Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion

Pantheon360: 通过3D感知的360°视频扩散驯服数字孪生生成

Ting-Hsuan Chen, Ying-Huan Chen, Tao Tu, Jie-Ying Lee, Cho-Ying Wu, Fangzhou Lin, Hengyuan Zhang, David Paz, Xinyu Huang, Yuliang Guo, Yu-Lun Liu, Yue Wang, Liu Ren

发表机构 * University of Southern California（南加州大学）； National Yang Ming Chiao Tung University（国家阳明交通大学）； Cornell University（康奈尔大学）； Bosch Research（博世研究）

AI总结提出Pantheon360框架，利用显式3D缓存从稀疏360°输入生成高保真视频，实现全局几何一致性和可控相机路径，解决传统透视视频生成器视野受限导致的跨视图不一致和时间漂移问题。

Comments Accepted to CVPR 2026. Project page: https://koi953215.github.io/pantheon360_page/

详情

AI中文摘要

从视频生成完整的数字孪生需要精确的相机控制、全局场景覆盖以及严格的空间-时间一致性约束，由于透视视频生成器的视野（FoV）有限，这些要求仍然具有挑战性。其狭窄的FoV迫使使用长轨迹或多视图轨迹，从而加剧了跨视图不一致和时间漂移。我们认为360°视频生成提供了一种自然的解决方案：全景覆盖简化了轨迹设计，并为保持一致性提供了强大的全局上下文。我们提出Pantheon360：通过3D感知的360°视频扩散驯服数字孪生生成，这是一个可控的360°视频生成框架，能够从稀疏的360°输入合成高保真视频。关键思想是一个显式的3D缓存，从输入中重建，作为任何用户定义相机路径的几何骨架。这使得扩散模型可以专注于逼真的纹理细化，同时3D缓存强制执行全局几何一致性。实验表明，Pantheon360实现了卓越的视觉质量和无与伦比的几何一致性，为下游仿真和数字孪生应用提供了可靠且灵活的360°场景生成。

英文摘要

Generating complete digital twins from videos requires precise camera control, global scene coverage, and strict spatial-temporal consistency constraints that remain challenging for perspective video generators due to their limited field of view (FoV). Their narrow FoV forces long or multi-view trajectories, amplifying cross-view inconsistency and temporal drift. We argue that 360° video generation offers a natural solution: panoramic coverage simplifies trajectory design and provides a strong global context for maintaining coherence. We introduce Pantheon360: Taming Digital Twin Generation via 3D-Aware 360° Video Diffusion, a controllable 360° video generation framework that synthesizes high-fidelity videos from sparse 360° inputs. The key idea is an explicit 3D Cache, reconstructed from the input, which serves as a geometric scaffold for any user-defined camera path. This allows the diffusion model to focus on photorealistic texture refinement while the 3D Cache enforces global geometric consistency. Experiments show that Pantheon360 achieves superior visual quality and unmatched geometric coherence, enabling reliable and flexible 360° scene generation for downstream simulation and digital-twin applications.

URL PDF HTML ☆

赞 0 踩 0

2605.29509 2026-06-16 cs.CV 版本更新

Ultra Flash: 将实时流式视频生成扩展到高分辨率

Luxury, Jie Huang, Zihao Fan, Xiaoxiao Ma, Jun-hao Zhuang, Yuming Li, Zeyue Xue, Siming Fu, Haoran Li, Mingchen Zhong, Guohui Zhang, Shichen Ma, Yijun Liu, Jiaqi Shi, Yanwen Ma, Yaofeng Su, Haoyu Wang, Yaowei Li, Songchun Zhang, Weiyang Jin, Yuxuan Bian, Shiyi Zhang, Haojun Xu, Shuai Lu, Xin Han, Wei Tang, Haoyang Huang, Nan Duan

发表机构 * JD Explore Academy（京东探索研究院）； USTC（中国科学技术大学）； PKU（北京大学）； THU（清华大学）； BUAA（北京航空航天大学）； FDU（复旦大学）； HKUST（香港科技大学）； HKU（香港大学）； CUHK（香港中文大学）

AI总结提出Ultra Flash级联框架，通过架构保持的超分辨率训练、因果流式潜在上采样器和高分辨率解码器、以及级联优化方案，在单GPU上实现1K分辨率约30 FPS和2K分辨率约18 FPS的实时高分辨率流式视频生成。

详情

AI中文摘要

尽管最近的自回归视频扩散模型在流式质量上取得了显著成果，但它们仍局限于低分辨率（如480P），使得高效、可扩展的实时高分辨率视频生成成为一个根本性的开放挑战。为弥补这一差距，我们提出了Ultra Flash，一个能够实时生成高分辨率视频的级联流式框架。Ultra Flash在单GPU上实现约30 FPS（1K分辨率）和约18 FPS（2K分辨率），通过三个关键贡献：（1）一种保持架构的T2V到TV2V超分辨率训练范式，结合面向AIGC的数据降级流水线，有效保留基础模型的生成能力，从而在级联到主流低分辨率生成模型后增强高分辨率细节；（2）一个因果流式潜在上采样器与高分辨率解码器配对，增强时空连贯性，同时实现高效的潜在空间缩放和精确的高分辨率解码，且计算开销可忽略；（3）一种级联高分辨率流式视频生成优化方案，首先对超分辨率模型进行混合奖励增强的稀疏因果化和单步蒸馏，然后引入带有动态缓存管理的级联流式自强迫偏好优化，共同增强整体连贯性、提高质量，并实现实时高分辨率流式视频生成。大量实验表明，Ultra Flash能够可靠地生成超高分辨率流式视频，同时保持最先进的视觉质量和卓越效率。

英文摘要

While recent autoregressive video diffusion models achieve remarkable streaming quality, they remain confined to low resolutions (e.g., 480P), leaving efficient, scalable, real-time high-resolution video generation a fundamental open challenge. To bridge this gap, we present Ultra Flash, a cascaded streaming framework capable of real-time high-resolution video generation. Ultra Flash achieves ~30 FPS at 1K resolution and ~18 FPS at 2K resolution on a single GPU through three key contributions: (1) an architecture-preserving T2V-to-TV2V super-resolution training paradigm coupled with an AIGC-oriented data degradation pipeline that effectively preserves the generative capability of the base model, enabling enhanced high-resolution detail when cascaded after mainstream low-resolution generative models; (2) a causal streaming latent upsampler paired with a high-resolution decoder, which enhances spatiotemporal coherence while enabling efficient latent spatial scaling and precise high-resolution decoding with negligible computational overhead; and (3) a cascade high-resolution streaming video generation optimization scheme that first performs hybrid-reward-enhanced sparse causalization and single-step distillation of the super-resolution model, then introduces cascaded streaming self-forcing preference optimization with dynamic cache management, jointly enhancing overall coherence, improving quality, and enabling real-time high-resolution streaming video generation. Extensive experiments demonstrate that Ultra Flash reliably produces ultra-high-resolution streaming video while maintaining state-of-the-art visual quality and superior efficiency. Project Page: https://xin1u.github.io/UltraFlash/

URL PDF HTML ☆

赞 0 踩 0

2606.11751 2026-06-16 cs.CV cs.AI 版本更新

AnchorEdit: Maintaining Temporal Consistency in Multi-turn Image Editing via Causal Memory

AnchorEdit: 通过因果记忆在多轮图像编辑中保持时间一致性

Hang Xu, Xiaoxiao Ma, Guohui Zhang, Yu Hu, Siming Fu, Jie Huang, Lin Song, Haoyang Huang, Nan Duan, Feng Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）； JD Explore Academy（京东探索研究院）

AI总结提出首个自回归扩散框架AnchorEdit，通过因果记忆机制和自展开策略解决多轮编辑中的身份漂移和误差累积问题，在10轮以上交互中保持高保真度。

Comments Code: https://github.com/xuhang07/AnchorEdit

详情

AI中文摘要

多轮图像编辑对于迭代设计至关重要，但当前模型在连续步骤中常面临身份漂移和误差累积。现有研究利用视频先验保持一致性，但其依赖的双向注意力与交互式编辑的因果、顺序性质根本不符。本文提出AnchorEdit，首个专为高分辨率、长期多轮编辑设计的自回归（AR）扩散框架。AnchorEdit通过三阶段训练课程弥合视频先验与因果推理之间的差距：保持身份的单轮预训练、使用新颖的自展开策略进行因果AR强制微调以缓解暴露偏差，以及用于高效4步生成的一致性蒸馏。在推理过程中，我们引入记忆机制来锚定初始主体身份，并确保在扩展编辑轨迹上的稳定外推。为评估性能，我们提供了一个新的高分辨率多轮编辑基准，旨在压力测试长期稳定性。大量实验表明，AnchorEdit达到了最先进的结果，即使在10轮以上的交互中也能保持卓越的主体保真度和指令遵循能力。

英文摘要

Multi-turn image editing is essential for iterative design, yet current models often struggle with identity drift and error accumulation over successive steps. While existing research leverages video priors for consistency, their reliance on bidirectional attention is fundamentally misaligned with the causal, sequential nature of interactive editing. In this paper, we propose AnchorEdit, the first autoregressive (AR) diffusion-based framework designed specifically for high-resolution, long-term multi-turn editing. AnchorEdit bridges the gap between video priors and causal inference through a three-stage training curriculum: identity-preserving sing-turn pretraining, causal AR forcing fine-tuning with a novel self-rollout strategy to mitigate exposure bias, and consistency distillation for efficient 4-step generation. During inference, we introduce a memory mechanism to anchor the initial subject identity and ensure stable extrapolation across extended editing trajectories. To evaluate performance, we provide a new high-resolution multi-turn editing benchmark designed to stress-test long-horizon stability. Extensive experiments demonstrate that AnchorEdit achieves state-of-the-art results, maintaining exceptional subject fidelity and instruction following even over 10+ interaction rounds.

URL PDF HTML ☆

赞 0 踩 0

2606.13655 2026-06-16 cs.CV cs.GR 版本更新

Flex4DHuman: Flexible Multi-view Video Diffusion for 4D Human Reconstruction

Flex4DHuman：面向4D人体重建的灵活多视角视频扩散模型

Jen-Hao Cheng, Yipeng Wang, Hao Zhang, Gengshan Yang, Jenq-Neng Hwang

发表机构 * University of Washington（华盛顿大学）； World Labs

AI总结提出Flex4DHuman，一种基于相对相机位姿条件化的多视角视频扩散模型，无需显式几何先验即可将单目或稀疏多视角视频转换为密集多视角视频，并用于4D高斯溅射重建。

Comments Project Page: https://andy-cheng.github.io/Flex4DHuman/

详情

AI中文摘要

我们提出Flex4DHuman，一种多视角视频扩散模型，它通过仅使用相对相机位姿条件化，将动态主体的单目或稀疏多视角视频转换为同步的密集多视角视频。与先前依赖骨架、深度图、法线或渲染目标视角几何的人体中心方法不同，Flex4DHuman不需要显式几何先验，而是通过相对相机位姿位置编码来条件化生成。生成的视频可直接被下游重建流程用于创建动态4D高斯溅射。基于Wan 2.1 1.3B文本到视频模型，Flex4DHuman保留了骨干架构，并通过五轴位置编码编码相机和视角信息，该编码将时空RoPE扩展了视角索引和连续SE(3)相对相机几何。三阶段课程逐步训练模型以进行位姿跟随、灵活的参考到目标视角生成以及时间展开。为支持时间展开，我们使用干净的历史目标视角令牌进行训练。我们还添加了多视角字幕以实现测试时文本控制。结合现成的4D高斯溅射阶段，我们的框架将单目静态相机视频提升为动态4D高斯溅射。在DNA-Rendering和ActorsHQ上的实验表明，Flex4DHuman超越了先前最先进的方法，而相同的公式在混合人体-动物训练后泛化到动物类别。这些能力使Flex4DHuman成为从随意单目视频进行可扩展4D内容创建的实际一步，适用于仿真、游戏、AR/VR和视频重拍。

英文摘要

We present Flex4DHuman, a multi-view video diffusion model that transforms a monocular or sparse multi-view video of a dynamic subject into synchronized dense multi-view videos using only relative camera-pose conditioning. Unlike prior human-centric methods that rely on skeletons, depth maps, normals, or rendered target-view geometry, Flex4DHuman requires no explicit geometry priors and instead conditions generation through relative camera-pose positional encoding. The generated videos can be directly ingested by downstream reconstruction pipelines to create dynamic 4D Gaussian splats. Built on the Wan 2.1 1.3B text-to-video model, Flex4DHuman preserves the backbone architecture and encodes camera and view information through a five-axis positional encoding that extends spatio-temporal RoPE with view indices and continuous SE(3) relative camera geometry. A three-stage curriculum progressively trains the model for pose following, flexible reference-to-target view generation, and temporal rollout. To support temporal rollout, we train with clean historical target-view tokens. We also add multi-view captions to enable test-time text control. Combined with an off-the-shelf 4D Gaussian Splatting stage, our framework lifts monocular static-camera videos into dynamic 4D Gaussian splats. Experiments on DNA-Rendering and ActorsHQ show that Flex4DHuman surpasses prior state-of-the-art methods, while the same formulation generalizes to animal categories after mixed human-animal training. These capabilities make Flex4DHuman a practical step toward scalable 4D content creation from casual monocular videos for simulation, gaming, AR/VR, and video re-shooting.

URL PDF HTML ☆

赞 0 踩 0

2509.24223 2026-06-16 cs.LG cs.CV stat.ML 版本更新

Semantic Editing with Coupled Stochastic Differential Equations

耦合随机微分方程的语义编辑

Jianxin Zhang, Clayton Scott

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出耦合随机微分方程（coupled SDEs）引导预训练生成模型的采样过程，无需重新训练即可实现高提示保真度和近像素级一致性的语义编辑。

2606.14811 2026-06-16 cs.CV 新提交

S23DR 2026: End-to-End 3D Wireframe Prediction via DETR-Style Set Prediction with Contrastive Denoising

S23DR 2026：基于对比去噪的DETR风格集合预测实现端到端3D线框预测

Nitiz Khanal

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出WireframeDETR方法，直接对3D点云进行DETR风格集合预测，无需中间顶点检测，通过对比去噪训练、多尺度编码器和渐进辅助损失权重实现端到端3D线框预测，在S23DR 2026挑战赛上取得0.575 HSS。

Comments Technical report; S23DR 2026 Challenge submission

2606.15328 2026-06-16 cs.CV 新提交

SGFormer++: Semantic Graph Transformer for Incremental 3D Scene Graph Generation

SGFormer++：用于增量式3D场景图生成的语义图Transformer

Mengshi Qi, Changsheng Lv, Zijian Fu, Xianlin Zhang, Huadong Ma

发表机构 * State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications（北京邮电大学网络与交换技术国家重点实验室）

AI总结提出SGFormer++，通过图嵌入层和语义注入层实现全局消息传递，并引入空间引导特征适配器和级联二值预测头解决增量场景图生成中的灾难性遗忘问题，在3DSSG基准上达到最优性能。

详情

AI中文摘要

本文提出SGFormer++，一种用于3D场景图生成（SGG）的新型语义图Transformer，旨在将点云场景解析为语义结构图，其中节点表示检测到的对象实例，边编码它们的成对关系，核心挑战在于建模复杂的全局场景结构。现有基于图卷积网络（GCN）的方法存在过平滑和感受野有限的问题，而SGFormer++利用Transformer层作为骨干网络实现全局消息传递。具体地，我们引入了两个专为3D SGG定制的关键组件：（1）图嵌入层++，以线性计算复杂度高效集成边缘感知的全局上下文；（2）语义注入层++，利用来自大语言模型（LLM）和视觉-语言模型（VLM）的语言先验丰富视觉特征，在不引入额外可训练参数的情况下增强语义表示。为进一步解决增量式SGG（I-SGG）的实际挑战（其中新的关系类别顺序到达），我们为SGFormer++配备了新颖的空间引导特征适配器，利用主语-宾语空间几何校准谓词特征以应对尺度变化，以及级联二值预测头，通过任务增量分类器扩展和logit蒸馏缓解灾难性遗忘。在3DSSG基准上的大量实验表明，SGFormer++在标准和增量设置下均达到最先进性能：在增量设置下，谓词A@1绝对提升4.49%。代码和数据可在 https://github.com/Andy20178/SGFormer 获取。

英文摘要

In this paper, we propose SGFormer++, a novel Semantic Graph Transformer for 3D scene graph generation (SGG), which aims to parse point cloud scenes into semantic structural graphs, where nodes denote detected object instances and edges encode their pairwise relationships, with the core challenge lying in modeling complex global scene structure. While existing graph convolutional network (GCN)-based methods suffer from over-smoothing and limited receptive fields, SGFormer++ leverages Transformer layers as its backbone to enable global message passing. Specifically, we introduce two key components tailored for 3D SGG: (1) a Graph Embedding Layer++ that efficiently integrates edge-aware global context with linear computational complexity, and (2) a Semantic Injection Layer++ that enriches visual features with linguistic priors from large language models (LLMs) and vision-language models (VLMs), boosting semantic representation without introducing extra trainable parameters. To further address the practical challenge of incremental SGG (I-SGG), where new relationship categories arrive sequentially, we equip SGFormer++ with a novel Spatial-guided Feature Adapter, which calibrates predicate features using subject-object spatial geometry to counter scale variation, and a Cascaded Binary Prediction Head that mitigates catastrophic forgetting via task-incremental classifier expansion and logit distillation. Extensive experiments on the 3DSSG benchmark demonstrate that SGFormer++ achieves state-of-the-art performance in both standard and incremental settings: it yields a significant 4.49% absolute improvement in Predicate A@1 under the incremental setting. Code and data are available at: https://github.com/Andy20178/SGFormer.

URL PDF HTML ☆

赞 0 踩 0

2606.15659 2026-06-16 cs.CV 新提交

SpatialAvatar-0: High-Quality 4D Head Avatar with Multi-Stage Reconstruction

SpatialAvatar-0: 高质量4D头部虚拟形象的多阶段重建

Yiran Wang, Zeyu Zhang, Yuanming Li, Ziming Wang, Yang Zhao

发表机构 * USYD（悉尼大学）； SpatialReal ； ZJU（浙江大学）； La Trobe（拉筹伯大学）

AI总结提出基于FLAME网格绑定高斯表示的多阶段框架，通过前馈生成器和10K迭代布局保持微调，实现跨域零样本和单目基准领先性能。

详情

AI中文摘要

高质量4D头部虚拟形象（来自一张或少量源肖像）是远程呈现、AR/VR和数字人交互的核心。3D高斯泼溅（3DGS）已成为主导表示，两个互补范式（可泛化的前馈预测器和逐主体精炼器）并行成熟。然而，现有前馈预测器在单一数据集族上训练，具有硬编码的源数量，继承了相应的领域偏差。逐主体精炼器需要30万至60万次迭代，并依赖自适应致密化，这会破坏上游高斯布局，导致两个范式无法端到端共享表示。为桥接两个范式，我们提出SpatialAvatar-0，基于共享的FLAME网格绑定高斯表示：一个前馈生成器，具有无参数的K源均值池化，以及一个从单目时序到多视角空间的两阶段调度，防止身份先验在小多视角集上坍缩。我们进一步引入一个10K迭代的布局保持逐主体精炼循环，冻结FLAME绑定和高斯数量，并用三分量抗尖峰正则化替代致密化。在VFHQ/HDTF跨域零样本上，我们超越域内领先者GAGAvatar +1.5 dB PSNR，尽管从未在任一测试域上训练；在SplattingAvatar单目基准上，我们领先所有报告指标，超越30万次迭代的GeoAvatar +1.3 dB PSNR，且逐主体调度比常见SOTA基线短至60倍。网站：https://spatialwalk.github.io/SpatialAvatar-0。

英文摘要

High-quality 4D head avatars from one or a few source portraits are central to telepresence, AR/VR, and digital-human interaction. 3D Gaussian Splatting (3DGS) has emerged as the dominant representation, with two complementary regimes (generalizable feed-forward predictors and per-subject refiners) maturing in parallel. However, existing feed-forward predictors are trained on a single dataset family with a hard-coded source count, inheriting the corresponding domain bias. Per-subject refiners require 300K--600K iterations and rely on adaptive densification that destroys upstream Gaussian layouts, preventing the two regimes from sharing a representation end-to-end. To bridge both regimes we propose SpatialAvatar-0 on a shared FLAME-mesh-bound Gaussian representation: a feed-forward generator with a parameter-free K-source mean-pool and a monocular-temporal to multi-view-spatial two-phase schedule that anchors against identity-prior collapse onto the smaller multi-view set. We further introduce a 10K-iter layout-preserving per-subject refinement loop that freezes the FLAME-binding and Gaussian count and replaces densification with a three-component anti-spike regularization. On VFHQ/HDTF cross-domain zero-shot we surpass the in-domain leader GAGAvatar by +1.5 dB PSNR despite never training on either test domain, and on the SplattingAvatar monocular benchmark we lead every reported metric, surpassing the 300K-iter GeoAvatar by +1.3 dB PSNR at up to 60x shorter per-subject schedule than common SOTA baselines. Website: https://spatialwalk.github.io/SpatialAvatar-0.

URL PDF HTML ☆

赞 0 踩 0

2606.15681 2026-06-16 cs.CV 新提交

VEPHand: 大规模视图高效光度手部性能捕捉

Zhengyang Shen, Kai-Hung Chang, Erroll Wood, Deying Kong, Bo Peng, Timo Bolkart, Jinlong Yang, Bowen Zhao, Danhang Tang, Sasa Petrovic, Emre Aksan, Jérémy Riviere, Vassilis Choutas, Delio Vicini, Jay Busch, Shichen Liu, Zhe Cao, Hugh Liu, JingJing Shen, Jonathan Taylor, Mingsong Dou

发表机构 * Google XR

AI总结提出面向有限视角（约20个）的端到端手部动态捕捉与配准管线，通过无掩膜神经方法和物理启发框架解决几何歧义与自接触变形难题，在12000+序列上验证了高保真重建与配准。

详情

AI中文摘要

鲁棒、高保真的3D手部捕捉是数字人创建的基础，但在实际多视角系统中仍具挑战性，这些系统需要在丰富光度信息与有限视角密度导致的重建几何歧义之间取得平衡。本文提出一种端到端的动态手部性能捕捉与配准管线，专为视图高效设置（约20个视角）设计。我们通过两项主要创新应对关键挑战。首先，为克服重建困难（如视角重叠有限和背景杂乱），我们的无掩膜神经方法通过场景参数化和场景特定密度正则化，从无掩膜图像中鲁棒地提取精细的手部几何和外观。其次，针对配准挑战（如准确捕捉非线性皮肤变形和确保严重自接触时的合理结果），我们提出一个物理启发框架。它通过优化个性化手部模型规范四面体网格内的固有体积偏移以及姿态参数，将重建与个性化手部模型对齐。该方法在鲁棒损失和优化支持下，捕捉精细表面变形，确保在严重关节运动和自接触下的合理结果，并对输入噪声表现出强容忍性。我们在超过12000个序列的大规模数据集上展示了自动化管线的可扩展性和鲁棒性，并从中导出一个大规模、高质量合成2D/3D手部数据集用于训练下游任务。这展示了该方法在单手、复杂双手交互和自然手物操作中的有效性。我们的方法在视图高效、无掩膜场景下实现了最先进的重建保真度和高精度配准。项目页面：https://zyshen021.github.io/VEPHand/。

英文摘要

Robust, high-fidelity 3D hand capture, while fundamental to digital human creation, remains challenging with practical multi-view systems that balance rich photometry with the geometric ambiguities of reconstruction arising from limited viewpoint density. This paper presents an end-to-end pipeline for dynamic hand performance capture and registration, specifically designed for view-efficient setups ($\sim$20 views). We address key challenges with two primary innovations. First, to overcome reconstruction difficulties like limited view overlap and background clutter, our mask-free neural method robustly extracts detailed hand geometry and appearance from unmasked images using scene parameterization and scenario-specific density regularization. Second, addressing registration challenges such as accurately capturing non-linear skin deformations and ensuring plausible results during severe self-contact, we propose a physics-inspired framework. It aligns reconstructions to a personalized hand model by optimizing intrinsic volumetric offsets within its canonical tetrahedral mesh, alongside pose parameters. This approach, supported by robust losses and optimization, captures fine surface deformations, ensures plausible results under severe articulation and self-contact, and demonstrates strong tolerance to input noise. We demonstrate the scalability and robustness of our automated pipeline on an extensive dataset of over 12,000 sequences, from which we also derive a large-scale, high-quality synthetic 2D/3D hand dataset for training downstream tasks. This showcases its effectiveness for single hands, intricate two-hand interactions, and natural hand-object manipulations. Our method achieves state-of-the-art reconstruction fidelity in view-efficient, unmasked scenarios and highly accurate registration. Our project page are available at https://zyshen021.github.io/VEPHand/.

URL PDF HTML ☆

赞 0 踩 0

2606.16048 2026-06-16 cs.CV 新提交

PointDiffusion: Diffusion-Based Scene Completion in the Point Cloud Domain

PointDiffusion: 点云领域的基于扩散的场景补全

Chidera Agbasiere, Mikhail Sannikov, Faith Ogunwoye, Erik Shaikhiev, Alex Kozinov, Ilya Mikhalchuk, Iana Zhura, Dzmitry Tsetserukou

发表机构 * Intelligent Space Robotics Laboratory, Skolkovo Institute of Science and Technology（斯科尔科沃科学技术学院智能空间机器人实验室）

AI总结提出多令牌高斯VAE和锚点ICP地面真值精化，实现单步扩散场景补全，在SemanticKITTI上平方倒角距离降低16倍，推理延迟降低25-143倍。

详情

AI中文摘要

从稀疏LiDAR点云重建密集3D场景是自动驾驶中的基本挑战，其中潜在扩散模型提供了一种有前景的解决方案。然而，现有方法依赖于对象级自编码器，这些自编码器在室外尺度下会崩溃为不稳定的全局表示，并且受到由里程计漂移破坏的地面真值数据的影响，这系统地降低了监督质量。此外，多步扩散推理会带来难以承受的延迟，无法实时部署。我们提出了一种新颖的多令牌高斯VAE，具有交叉注意力池化，用于稳定的场景级LiDAR压缩，并结合基于锚点的ICP地面真值精化流水线，消除了训练监督中的漂移引入噪声。这些组件共同实现了一个无支架的单步扩散补全模型，在SemanticKITTI序列08上将平方倒角距离减少了约16倍（从0.396 m^2降至0.024 m^2），分别比LiDiff和ScoreLiDAR高出17-19%和10-11%，并且推理延迟降低了25-143倍。我们的结果表明，在此设置下，数据质量主导模型设计，多令牌潜在空间为基于潜在扩散的场景补全提供了稳定的第一阶段。

英文摘要

Reconstructing dense 3D scenes from sparse LiDAR point clouds is a fundamental challenge in autonomous driving, where latent diffusion models offer a promising solution. However, existing approaches rely on object-level autoencoders that collapse into unstable global representations at outdoor scale and suffer from ground truth data corrupted by odometry drift that systematically degrades supervision quality. Furthermore, multi-step diffusion inference incurs prohibitive latency for real-time deployment. We propose a novel multi-token Gaussian VAE with cross-attention pooling for stable scene-scale LiDAR compression, combined with an anchor-based ICP ground truth refinement pipeline that eliminates drift-induced noise from training supervision. Together, these components enable a scaffold-free single-step diffusion completion model that achieves an approximately 16x reduction in squared Chamfer distance on SemanticKITTI seq. 08 (0.396 m^2 to 0.024 m^2), surpasses LiDiff and ScoreLiDAR by 17-19% and 10-11%, respectively, and operates at 25-143x lower inference latency. Our results demonstrate that data quality dominates model design in this regime and that multi-token latent spaces provide a stable first stage for latent diffusion-based scene completion.

URL PDF HTML ☆

赞 0 踩 0

2606.16323 2026-06-16 cs.CV cs.GR 新提交

Local-GS：通过Tile局部Warp一致性加速3D高斯泼溅

Yang Luo, Yan Gong, Yongsheng Gao, Jie Zhao, Xinyu Zhang, Huaping Liu

发表机构 * State Key Laboratory of Robotics and Systems, Harbin Institute of Technology（哈尔滨工业大学机器人技术与系统国家重点实验室）； State Key Laboratory of Intelligent Green Vehicle and Mobility, School of Vehicle and Mobility, Tsinghua University（清华大学车辆与运载学院智能绿色车辆与交通国家重点实验室）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结提出Local-GS，通过基于SIMT执行边界组织高斯原语，设计提升、剔除和混合三阶段warp一致渲染范式，在不降低质量的前提下实现最高7.76倍加速。

详情

AI中文摘要

3D高斯泼溅（3DGS）通过将场景表示为各向异性3D高斯原语的密集集合，显著推进了实时新视角合成。然而，高斯的不规则空间分布通常导致GPU利用率低下，因为warp发散和冗余计算降低了渲染性能。为了解决这个问题，我们提出了Local-GS，一种warp一致的渲染范式，它根据SIMT（单指令多线程）执行边界而非场景几何来组织高斯原语。具体来说，我们提出了三个warp一致阶段：提升阶段，在tile级别预计算共享参数；剔除阶段，丢弃没有贡献的warp；混合阶段，用统一的指令流替换逐像素分支。在多个数据集上的广泛基准测试中，Local-GS在不牺牲质量的情况下提高了效率。作为一种即插即用的优化，它为所有测试的基线提供了额外的性能提升，在Deep Blending场景上实现了7.76倍的加速。

英文摘要

3D Gaussian Splatting (3DGS) has significantly advanced real-time novel view synthesis by representing scenes as dense collections of anisotropic 3D Gaussian primitives. However, the irregular spatial distribution of Gaussians often leads to poor GPU utilization, as warp divergence and redundant computation degrade rendering performance. To address this, we present Local-GS, a warp-coherent rendering paradigm that, organizes Gaussian primitives with respect to SIMT (Single Instruction, Multiple Threads) execution boundaries rather than scene geometry. Specifically, we propose three warp-coherent stages: a hoisting stage that precomputes shared parameters at tile level, a culling stage that discards warps with no contribution, and a blending stage that replaces per-pixel branching with a uniform instruction stream. Across extensive benchmarks on multiple datasets, Local-GS improves efficiency without compromising quality. As a plug-and-play optimization, it provides additional performance gains to all tested baselines, culminating in a $7.76\times$ speedup on Deep Blending scenes.

URL PDF HTML ☆

赞 0 踩 0

2606.16593 2026-06-16 cs.CV 新提交

Rotational Symmetry based Object Pose Estimation from Point Clouds in the Absence of Known 3D Models

基于旋转对称性的无已知3D模型点云物体姿态估计

Weichen Dai, Ruixun Yu, Yangjie Tang, Yifan Du, Yiyang Zhang, Donglei Sun, Hua Zhang

发表机构 * Key Laboratory of Brain Machine Collaborative Intelligence of Zhejiang Province, School of Computer Science, Hangzhou Dianzi University（浙江省脑机协同智能重点实验室，杭州电子科技大学计算机学院）； Advanced Intelligent Manufacturing Research Group, the University of Nottingham Ningbo China（先进智能制造研究组，宁波诺丁汉大学）

AI总结提出利用工业物体的旋转对称性，通过迭代优化联合估计姿态与点云，无需已知3D模型，在合成和真实数据集上达到与有模型方法相当的性能。

详情

AI中文摘要

物体姿态估计对许多工业应用至关重要，例如使用机器人进行自动喷漆。然而，保密性问题常常限制了对高质量3D模型的访问，给基于点云的姿态估计带来了重大挑战。在这种情况下，旋转对称性——许多工业物体易于获取的特征——可以提供有价值的先验信息以促进姿态估计。在本文中，我们提出了一种方法，利用工业物体中常见的旋转对称性来解决缺乏3D模型带来的挑战。通过迭代优化过程，物体姿态与点云细化联合估计。该优化依赖于旋转对称性约束损失。为了构建这一损失，每个3D点根据当前估计的姿态旋转，并利用旋转对称性通过最近邻搜索识别多个对应点。然后使用这些对应点计算旋转对称性约束损失，迭代地细化姿态和点云。通过将旋转对称性显式地纳入优化过程，所提出的方法实现了鲁棒的姿态估计，并在不同物体类型上具有良好的泛化能力。该方法在一个专门为无已知3D模型的点云创建的数据集上进行了评估，该数据集包含四类合成物体和一个从生产线收集的真实轮毂。实验结果表明，所提出的方法实现了与依赖已知3D模型的方法相当的性能。

英文摘要

Object pose estimation is crucial to many industrial applications, with one example being automated spray painting using a robot. However, confidentiality concerns often limit access to high-quality 3D models, posing a significant challenge for point-cloud-based pose estimation. In such scenarios, rotational symmetry, a readily accessible characteristic of many industrial objects, can provide valuable prior information to facilitate pose estimation.In this paper, we propose a method that leverages the rotational symmetry commonly found in industrial objects to address the challenge caused by the absence of 3D models. The object pose is jointly estimated with point cloud refinement through an iterative optimization process. This optimization relies on a rotational symmetry constraint loss. To construct this loss, each 3D point is rotated according to the currently estimated pose, and multiple correspondences are identified using nearest-neighbor search by exploiting the rotational symmetry property. These correspondences are then used to compute the rotational symmetry constraint loss, which iteratively refines both the pose and the point cloud.By explicitly incorporating rotational symmetry into the optimization process, the proposed method achieves robust pose estimation and generalizes well across diverse object types. The proposed method is evaluated on a dataset specifically created for point clouds without known 3D models, consisting of four categories of synthetic objects and one real wheel hub collected from a production line. Experimental results demonstrate that the proposed method achieves performance comparable to methods that rely on known 3D models.

URL PDF HTML ☆

赞 0 踩 0

2606.16672 2026-06-16 cs.CV 新提交

Sinkhorn-CPD: Robust point cloud registration via unbalanced entropic optimal transport

Sinkhorn-CPD：通过非平衡熵最优传输实现鲁棒点云配准

Jin Zhang, Mingyang Zhao, Bing Liu, Xin Jiang

发表机构 * LMIB & School of Mathematical Sciences, Beihang University（北京航空航天大学数学科学学院与LMIB）； State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences（中国科学院数学与系统科学研究院数学科学国家重点实验室）； Beijing Key Laboratory of Artificial Intelligence Innovation and Application in the Machine Tool Industry, School of Artificial Intelligence, Beihang University（北京航空航天大学人工智能学院北京市机床行业人工智能创新与应用重点实验室）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结提出Sinkhorn-CPD，用双KL散度惩罚替代CPD的目标边际约束，通过非平衡熵最优传输和广义Sinkhorn迭代实现鲁棒点云配准，方差自动退火无需手动调参。

Comments 14 pages, 10 figures; journal version published in Computer-Aided Design

详情

DOI: 10.1016/j.cad.2026.104104
Journal ref: Computer-Aided Design 199 (2026) 104104

AI中文摘要

相干点漂移（CPD）因其软对应和闭式参数更新而被广泛用于刚性点云配准。然而，CPD的目标边际约束迫使每个观测值（包括离群点）恰好接收单位概率质量。在严重离群点和部分重叠情况下，这一假设会降低配准精度。最优传输（OT）方法可以通过非平衡公式处理缺失质量，但需要手动调整退火调度。本文提出Sinkhorn-CPD，用双Kullback-Leibler惩罚替代CPD的目标边际约束，使算法能够丢弃两侧的离群点。由此得到的公式是一个完全非平衡的熵最优传输问题，可通过广义Sinkhorn迭代高效求解。此外，Sinkhorn-CPD保留了CPD的闭式Procrustes和方差更新。在我们的方法中，方差sigma^2扮演熵正则化参数的角色，从而自动产生从扩散到尖锐对应的退火调度，无需手动调节温度。在合成、跨类别和扫描到CAD基准上的实验表明，Sinkhorn-CPD达到了最先进的精度，对离群点和部分重叠具有强鲁棒性。

英文摘要

Coherent Point Drift (CPD) is widely used for rigid point cloud registration because of its soft correspondences and closed-form parameter updates. However, CPD's target-side marginal constraint forces every observation, including outliers, to receive exactly unit probability mass. This assumption degrades registration accuracy under heavy outliers and partial overlap. Optimal transport (OT) methods can handle missing mass through unbalanced formulations, but require hand-tuned annealing schedules. In this paper, we propose Sinkhorn-CPD, which replaces CPD's target-side marginal constraint with dual Kullback-Leibler penalties, allowing the algorithm to discard outliers on both sides. The resulting formulation is a fully unbalanced entropic optimal transport problem, which can be efficiently solved by generalized Sinkhorn iterations. Moreover, Sinkhorn-CPD preserves the closed-form Procrustes and variance updates of CPD. In our method, the variance sigma^2 plays the role of the entropic regularization parameter, which induces an automatic annealing schedule from diffuse to sharp correspondences without manual temperature tuning. Experiments on synthetic, cross-category, and scan-to-CAD benchmarks show that Sinkhorn-CPD achieves state-of-the-art accuracy, with strong robustness to outliers and partial overlap.

URL PDF HTML ☆

赞 0 踩 0

2606.17027 2026-06-16 cs.CV 新提交

MeshLoom: Feed-Forward Non-Rigid Registration of Mesh Sequences

MeshLoom: 网格序列的前馈式非刚性配准

Jianqi Chen, Jiraphon Yenphraphai, Xiangjun Tang, Sergey Tulyakov, Chaoyang Wang, Peter Wonka, Rameen Abdal

发表机构 * KAUST Saudi Arabia（沙特阿拉伯国王科技大学）； Snap Inc. United States of America（Snap Inc. 美国）； Purdue University United States of America（普渡大学美国）

AI总结提出MeshLoom，一种前馈式配准网络，通过拓扑感知编码器-解码器直接重建网格序列的顶点变形，实现秒级多网格配准，并在非刚性配准任务上达到最先进水平，同时支持运动插值和网格变形。

Comments Project page: https://meshloom.github.io/

详情

AI中文摘要

我们提出MeshLoom，一种前馈式配准网络，可直接重建网格序列中的顶点变形。我们的方法将非刚性配准推进到超越现有模型，这些模型通常受限于昂贵的逐实例优化、狭窄的物体类别、仅成对输入或仅仅是中间输出。该网络简单高效，可在数秒内配准多个网格。其核心在于拓扑感知的编码器-解码器设计。具体来说，我们首先引入一种拓扑感知的点表示，将锚点（参考）网格的拓扑编码到其逐顶点特征中。这种表示增强了网络对锚点网格几何结构的理解，并区分了欧几里得接近但测地距离远的点。然后，我们提出一种多模态编码器，将这种锚点网格表示与每帧的互补线索（如形状潜变量和图像特征）融合。这些多源信号被压缩成一个紧凑的全局运动嵌入，捕捉密集的帧间对应关系。一个轻量级解码器随后用锚点网格点表示查询该全局嵌入，检索目标时间戳处的逐顶点变形。通过在多种运动和物体类别上的大量实验，我们表明MeshLoom在非刚性配准上达到了最先进的结果。此外，我们发现我们的全局嵌入-然后-查询范式自然地使网络能够生成中间时间戳的变形，这扩展了MeshLoom到运动插值和网格变形。项目页面：https://meshloom.github.io/。

英文摘要

We present MeshLoom, a feed-forward registration network that directly reconstructs vertex deformations across mesh sequences. Our approach advances non-rigid registration beyond existing models, which are typically constrained by costly per-instance optimization, narrow object categories, pairwise-only inputs, or merely intermediate outputs. The network is simple and efficient, registering multiple meshes within seconds. At its core lies a topology-aware encoder--decoder design. Specifically, we first introduce a topology-aware point representation that encodes the anchor (reference) mesh's topology into its per-vertex features. This representation strengthens the network's understanding of the anchor-mesh geometry and disambiguates points that are Euclidean-close yet geodesically distant. We then propose a multi-modal encoder that fuses this anchor-mesh representation with complementary cues from each frame, such as shape latents and image features. These multi-source signals are compressed into a compact global motion embedding that captures dense inter-frame correspondence. A lightweight decoder then queries this global embedding with the anchor-mesh point representation, retrieving per-vertex deformations at target timestamps. Through extensive experiments across diverse motions and object categories, we show that MeshLoom achieves state-of-the-art results on non-rigid registration. In addition, we find that our global embedding-then-query paradigm naturally enables the network to generate deformations at intermediate timestamps, which extends MeshLoom to motion interpolation and mesh morphing. Project page: https://meshloom.github.io/ .

URL PDF HTML ☆

赞 0 踩 0

2606.15238 2026-06-16 cs.GR cs.CV 交叉投稿

HairLRM: Strand-based Hair Modeling via Large Reconstruction Models

HairLRM：基于大型重建模型的发丝建模

Yuefan Shen, Yican Dong, Xiufeng Huang, Zhongtian Zheng, Youyi Zheng, Kui Wu

发表机构 * LIGHTSPEED Shenzhen China（LIGHTSPEED深圳中国）； State Key Lab of CAD and CG, Zhejiang University Hangzhou China（计算机辅助设计与图形学国家重点实验室，浙江大学杭州中国）； Hong Kong Baptist University Hong Kong China（香港 Baptist大学香港中国）； LIGHTSPEED Los Angeles CA USA (2026)（LIGHTSPEED洛杉矶CA美国（2026））

AI总结针对传统发丝建模从2D图像推断3D结构的不适定性问题，提出结合大型重建模型的几何先验，利用双方向自编码器将粗几何提升为高保真发丝，通过潜在空间优化和表面引导细化解决矢量场奇点，实现鲁棒且精确的发丝重建。

Comments ACM SIGGRAPH 2026 Conference Paper

2508.09977 2026-06-16 cs.CV 版本更新

A Survey on 3D Gaussian Splatting Applications: Segmentation, Editing, and Generation

3D高斯泼溅应用综述：分割、编辑与生成

Shuting He, Peilin Ji, Yitong Yang, Changshuo Wang, Jiayi Ji, Yinglin Wang, Henghui Ding

发表机构 * Shanghai University of Finance and Economics（上海财经大学）； University College London（伦敦大学学院）； Xiamen University（厦门大学）； Fudan University（复旦大学）

AI总结综述3D高斯泼溅在分割、编辑和生成三大任务中的应用，总结代表性方法、监督策略和学习范式，并分析公共基准上的比较结果。

Comments IEEE TPAMI, GitHub Repo: https://github.com/heshuting555/Awesome-3DGS-Applications

详情

Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

AI中文摘要

在新视角合成背景下，3D高斯泼溅（3DGS）最近作为神经辐射场（NeRF）的高效且具有竞争力的对应物出现，能够实时实现高保真度的逼真渲染。除了新视角合成，3DGS的显式和紧凑特性使其能够应用于需要几何和语义理解的广泛下游任务。本综述全面概述了3DGS应用的最新进展。首先回顾了3DGS的重建基础，接着介绍了问题公式化、2D基础模型以及相关的基于NeRF的研究领域，这些为下游3DGS应用提供了信息。然后，我们将3DGS应用分为三个基础任务：分割、编辑和生成，以及建立在这些基础能力之上或与之紧密耦合的其他功能应用。对于每个任务，我们总结了代表性方法、监督策略和学习范式，突出了共享的设计原则和新兴趋势。还总结了常用数据集和评估协议，以及最近方法在公共基准上的比较分析。为了支持持续的研究和开发，我们在https://this URL上维护了一个持续更新的论文、代码和资源仓库。

英文摘要

In the context of novel view synthesis, 3D Gaussian Splatting (3DGS) has recently emerged as an efficient and competitive counterpart to Neural Radiance Field (NeRF), enabling high-fidelity photorealistic rendering in real time. Beyond novel view synthesis, the explicit and compact nature of 3DGS enables a wide range of downstream applications that require geometric and semantic understanding. This survey provides a comprehensive overview of recent progress in 3DGS applications. It first reviews the reconstruction preliminaries of 3DGS, followed by the problem formulation, 2D foundation models, and related NeRF-based research areas that inform downstream 3DGS applications. We then categorize 3DGS applications into three foundational tasks: segmentation, editing, and generation, alongside additional functional applications built upon or tightly coupled with these foundational capabilities. For each, we summarize representative methods, supervision strategies, and learning paradigms, highlighting shared design principles and emerging trends. Commonly used datasets and evaluation protocols are also summarized, along with comparative analyses of recent methods across public benchmarks. To support ongoing research and development, a continually updated repository of papers, code, and resources is maintained at https://github.com/heshuting555/Awesome-3DGS-Applications.

URL PDF HTML ☆

赞 0 踩 0

2510.09088 2026-06-16 cs.CV 版本更新

MambaH-Fit: Rethinking Hyper-surface Fitting-based Point Cloud Normal Estimation via State Space Modelling

MambaH-Fit: 基于状态空间建模的超曲面拟合点云法线估计再思考

Weijia Wang, Yuanzhi Su, Pei-Gen Ye, Yuan-Gen Wang

发表机构 * Guangzhou University（广州大学）； Hong Kong Polytechnic University（香港理工大学）； Beijing Institute of Technology（北京理工大学）

AI总结提出MambaH-Fit框架，通过注意力驱动层次特征融合和逐块状态空间模型，增强局部几何细节建模，提升点云法线估计的精度和鲁棒性。

Comments 11 pages, 12 figures

详情

AI中文摘要

我们提出了MambaH-Fit，一个专为基于超曲面拟合的点云法线估计设计的状态空间建模框架。现有的法线估计方法在建模细粒度几何结构方面往往不足，从而限制了预测法线的准确性。最近，状态空间模型（SSMs），特别是Mamba，通过以线性复杂度捕捉长程依赖关系展示了强大的建模能力，并激发了对点云处理的适应性。然而，现有的基于Mamba的方法主要关注理解全局形状结构，而对局部细粒度几何细节的建模仍很大程度上未被探索。为了解决上述问题，我们首先引入了一种注意力驱动的层次特征融合（AHFF）方案，以自适应地融合多尺度点云块特征，显著增强了局部点云邻域中的几何上下文学习。在此基础上，我们进一步提出了逐块状态空间模型（PSSM），该模型通过状态动力学将点云块建模为隐式超曲面，从而实现对法线预测的有效细粒度几何理解。在基准数据集上的大量实验表明，我们的方法在准确性、鲁棒性和灵活性方面优于现有方法。消融研究进一步验证了所提出组件的贡献。

英文摘要

We present MambaH-Fit, a state space modelling framework tailored for hyper-surface fitting-based point cloud normal estimation. Existing normal estimation methods often fall short in modelling fine-grained geometric structures, thereby limiting the accuracy of the predicted normals. Recently, state space models (SSMs), particularly Mamba, have demonstrated strong modelling capability by capturing long-range dependencies with linear complexity and inspired adaptations to point cloud processing. However, existing Mamba-based approaches primarily focus on understanding global shape structures, leaving the modelling of local, fine-grained geometric details largely under-explored. To address the issues above, we first introduce an Attention-driven Hierarchical Feature Fusion (AHFF) scheme to adaptively fuse multi-scale point cloud patch features, significantly enhancing geometric context learning in local point cloud neighbourhoods. Building upon this, we further propose Patch-wise State Space Model (PSSM) that models point cloud patches as implicit hyper-surfaces via state dynamics, enabling effective fine-grained geometric understanding for normal prediction. Extensive experiments on benchmark datasets show that our method outperforms existing ones in terms of accuracy, robustness, and flexibility. Ablation studies further validate the contribution of the proposed components.

URL PDF HTML ☆

赞 0 踩 0

2512.10840 2026-06-16 cs.CV 版本更新

PoseGAM: Robust Unseen Object Pose Estimation via Geometry-Aware Multi-View Reasoning

PoseGAM: 通过几何感知多视图推理实现鲁棒的未见物体姿态估计

Jianqi Chen, Biao Zhang, Xiangjun Tang, Peter Wonka

发表机构 * KAUST（卡塔尔科技大学）

AI总结提出PoseGAM，一种基于多视图基础模型的几何感知框架，直接预测未见物体的6D姿态，无需显式匹配，通过点云几何和特征网络整合几何信息，在多个基准上平均AR提升5.1%。

Comments Accepted by CVPR 2026 (Oral). Project page: https://windvchen.github.io/PoseGAM/

详情

AI中文摘要

6D物体姿态估计，即预测物体相对于相机的变换，对于未见物体仍然具有挑战性。现有方法通常依赖于在查询图像与物体模型或模板图像之间显式构建特征对应关系。在这项工作中，我们提出了PoseGAM，一种几何感知的多视图框架，直接从查询图像和多个模板图像预测物体姿态，消除了显式匹配的需要。该方法基于最近的多视图基础模型架构，通过两种互补机制整合物体几何信息：显式的基于点的几何和来自几何表示网络的学习特征。此外，我们构建了一个包含超过19万个物体的大规模合成数据集，涵盖多种环境条件，以增强鲁棒性和泛化能力。在多个基准上的广泛评估表明，我们的方法达到了最先进的性能，与先前方法相比平均AR提高了5.1%，在单个数据集上最高提升了17.6%，显示出对未见物体的强泛化能力。项目页面：此https URL。

英文摘要

6D object pose estimation, which predicts the transformation of an object relative to the camera, remains challenging for unseen objects. Existing approaches typically rely on explicitly constructing feature correspondences between the query image and either the object model or template images. In this work, we propose PoseGAM, a geometry-aware multi-view framework that directly predicts object pose from a query image and multiple template images, eliminating the need for explicit matching. Built upon recent multi-view-based foundation model architectures, the method integrates object geometry information through two complementary mechanisms: explicit point-based geometry and learned features from geometry representation networks. In addition, we construct a large-scale synthetic dataset containing more than 190k objects under diverse environmental conditions to enhance robustness and generalization. Extensive evaluations across multiple benchmarks demonstrate our state-of-the-art performance, yielding an average AR improvement of 5.1% over prior methods and achieving up to 17.6% gains on individual datasets, indicating strong generalization to unseen objects. Project page: https://windvchen.github.io/PoseGAM/ .

URL PDF HTML ☆

赞 0 踩 0

2601.13565 2026-06-16 cs.CV cs.RO eess.IV 版本更新

Learning Fine-Grained Correspondence with Cross-Perspective Perception for Open-Vocabulary 6D Object Pose Estimation

学习细粒度对应与跨视角感知用于开放词汇6D物体姿态估计

Yu Qin, Shimeng Fan, Fan Yang, Zixuan Xue, Zijie Mai, Wenrui Chen, Kailun Yang, Zhiyong Li

发表机构 * School of Artificial Intelligence and Robotics and the National Engineering Research Center of Robot Visual Perception and Control Technology, Hunan University（人工智能与机器人学院和机器人视觉感知与控制技术国家工程研究中心，湖南大学）； State Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University（自主智能无人系统国家重点实验室，同济大学）； School of Computer Science and Engineering, Hunan University of Science and Technology（计算机科学与工程学院，湖南科技大学）

AI总结提出FiCoP框架，通过物体中心解耦、跨视角全局感知模块和补丁相关预测器，实现空间约束的细粒度对应，显著提升开放世界6D姿态估计的鲁棒性。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L). The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP

详情

AI中文摘要

开放词汇6D物体姿态估计使机器人能够仅凭自然语言指令操控任意未见过的物体。然而，现有方法的一个关键限制是它们依赖于无约束的全局匹配策略。在开放世界场景中，尝试将锚点特征与整个查询图像空间进行匹配会引入过多的歧义，因为目标特征容易与背景干扰物混淆。为解决这一问题，我们提出了细粒度对应姿态估计（FiCoP），这是一个从易受噪声影响的全局匹配过渡到空间约束的补丁级对应的框架。为了系统地消除背景干扰，FiCoP首先采用以物体为中心的解耦步骤，将目标从宏观环境噪声中隔离出来。基于这个局部区域，我们的核心方法创新有两个方面。首先，提出了跨视角全局感知（CPGP）模块，通过显式上下文推理和文本引导的语义注入融合双视图特征，建立结构一致性。其次，我们设计了一个补丁相关预测器（PCP），利用补丁到补丁的相关矩阵作为结构先验。这生成一个精确的块状关联图，作为空间滤波器，强制执行细粒度、抗噪声的匹配。在REAL275和Toyota-Light数据集上的实验表明，与最先进方法相比，FiCoP的平均召回率分别提高了8.0%和6.1%，突显了其在复杂、无约束的开放世界环境中为机器人代理提供鲁棒和泛化感知的能力。源代码将在此https URL公开。

英文摘要

Open-vocabulary 6D object pose estimation empowers robots to manipulate arbitrary unseen objects guided solely by natural language. However, a critical limitation of existing approaches is their reliance on unconstrained global matching strategies. In open-world scenarios, trying to match anchor features against the entire query image space introduces excessive ambiguity, as target features are easily confused with background distractors. To resolve this, we propose Fine-grained Correspondence Pose Estimation (FiCoP), a framework that transitions from noise-prone global matching to spatially-constrained patch-level correspondence. To systematically eliminate background interference, FiCoP first employs an object-centric disentanglement step to isolate the target from macro-level environmental noise. Building upon this localized region, our core methodological innovations are twofold. Firstly, a Cross-Perspective Global Perception (CPGP) module is proposed to fuse dual-view features, establishing structural consensus through explicit context reasoning and text-guided semantic injection. Secondly, we design a Patch Correlation Predictor (PCP) that leverages a patch-to-patch correlation matrix as a structural prior. This generates a precise block-wise association map, acting as a spatial filter to enforce fine-grained, noise-resilient matching. Experiments on the REAL275 and Toyota-Light datasets demonstrate that FiCoP improves Average Recall by 8.0% and 6.1%, respectively, compared to the state-of-the-art method, highlighting its capability to deliver robust and generalized perception for robotic agents operating in complex, unconstrained open-world environments. The source code will be made publicly available at https://github.com/zjjqinyu/FiCoP.

URL PDF HTML ☆

赞 0 踩 0

2605.15796 2026-06-16 cs.CV 版本更新

Cross-Modal Registration Between 3D and 2D Fingerprints via Pose-Aware Unwrapping and Point-Cloud Fusion

通过姿态感知解缠和点云融合实现3D与2D指纹的跨模态注册

Xiongjun Guan, Jianjiang Feng, Jie Zhou

发表机构 * Department of Automation, Tsinghua University（自动化系，清华大学）

AI总结本文提出统一框架，实现3D指纹预处理与跨接触式和非接触式2D指纹的注册，结合非参数可视化解缠、点云融合、姿态归一化和姿态感知注册策略，提升3D与2D指纹兼容性。

详情

AI中文摘要

三维（3D）指纹保留全局指纹几何和局部脊线结构，避免接触引起的变形，但难以与传统二维（2D）指纹系统集成。本文针对3D采集与跨模态匹配之间的中间阶段，提出统一框架，用于3D指纹预处理和跨接触式和非接触式2D模态的注册。框架结合四个组件：1）非参数可视化和解缠方法，将3D指纹点云转换为卷轴等效2D表示，无需全局指纹模型；2）点云融合管道，将多个部分3D捕捉注册并拼接为更完整的指纹模型；3）基于椭圆的姿态归一化方法用于标准指纹对齐；4）姿态感知的跨模态注册策略，提高3D指纹与非接触式和接触式2D指纹的兼容性。在自建的多模态指纹数据库（含150个指纹）上的实验表明，所提框架实现了脊线级3D注册精度、鲁棒的姿态估计和一致的2D兼容性提升。特别是3D融合误差集中在0.09 mm，非接触式2D-3D注册达到脊线尺度投影精度，姿态感知解缠相对于通用3D解缠提高了真实匹配分数。这些结果支持3D指纹作为跨异构指纹模态的有效几何桥梁。

英文摘要

Three-dimensional (3D) fingerprints preserve global finger geometry and local ridge structure while avoiding contact-induced deformation, but they remain difficult to integrate with legacy two-dimensional (2D) fingerprint systems. This paper addresses the intermediate stage between 3D acquisition and cross-modal matching, and presents a unified framework for 3D fingerprint preprocessing and registration across contactless and contact-based 2D modalities. The framework combines four components: 1) a nonparametric visualization and unwrapping method that converts a 3D fingerprint point cloud into a rolled-equivalent 2D representation without relying on a global finger-shape model; 2) a point-cloud fusion pipeline that registers and mosaics multiple partial 3D captures into a more complete fingerprint model; 3) an ellipse-based pose normalization method for canonical finger alignment; and 4) a pose-aware cross-modal registration strategy that improves compatibility between 3D fingerprints and both contactless and contact-based 2D fingerprints. Experiments on a self-collected multimodal fingerprint database containing 150 fingers show that the proposed framework achieves ridge-level 3D registration accuracy, robust pose estimation, and consistent gains in 2D compatibility. In particular, the 3D fusion error is concentrated around 0.09 mm, contactless 2D--3D registration reaches ridge-scale projection accuracy, and pose-aware unwrapping improves genuine matching scores relative to generic 3D unwrapping. These results support the use of 3D fingerprints as an effective geometric bridge across heterogeneous fingerprint modalities. The baseline implementation has been publicly released at https://github.com/XiongjunGuan/3DFpVisual.

URL PDF HTML ☆

赞 0 踩 0

2606.10550 2026-06-16 cs.CV cs.GR 版本更新

LentiAvatar: Pseudo-Multiview Reconstruction and Subpixel Prism Rendering for Real-Time Stereoscopic Communication

PrismAvatar：用于实时立体通信的伪多视图重建与亚像素棱镜渲染

Chufeng Fang, Dongdong Teng, Lilin Liu

发表机构 * Sun Yat-sen University（中山大学）

AI总结提出PrismAvatar系统，通过单目视频重建可控头部化身，并利用亚像素编码光栅实现实时裸眼立体通信，采用伪多视图监督和轮廓感知损失提升侧视质量。

Comments 10 pages, 5 figures, 3 tables

详情

AI中文摘要

实时立体视频通信一直是沉浸式远程呈现的目标，但实际系统仍需要专门的捕获设备或将远程用户限制为单个肖像视图。我们提出PrismAvatar，一种高斯头部化身系统，将单目化身捕获与亚像素编码的裸眼光栅显示连接起来，用于实时自动立体通信。从单目肖像视频中，PrismAvatar重建可控头部化身，并针对显示引起的横向观看区域进行优化。该方法利用自然头部转动作为伪多视图（PMV）监督，以约束在单目训练中弱观察的区域，包括头发、耳朵、下颌轮廓和颈部边界。可靠的侧帧按偏航角分箱，对齐到虚拟相机，并在严格的头部和头发域内进行监督；轮廓感知损失和分阶段正则化进一步抑制鬼影、alpha泄漏和深度不稳定性，同时保留横向细节。在运行时，PrismAvatar渲染32个虚拟视图，并将其编码为具有校准亚像素路由掩码的4K光栅图像。实时跟踪原型保持10.65 FPS，而特定主体的蒸馏驱动将相同的显示管线提升至38.49 FPS。

英文摘要

Real-time stereoscopic video communication has long been a goal of immersive telepresence, yet practical systems still require specialized capture rigs or reduce remote users to a single portrait view. We present LentiAvatar, a Gaussian head-avatar system that connects monocular avatar capture with subpixel-encoded glasses-free lenticular display for real-time autostereoscopic communication. From a monocular portrait video, LentiAvatar reconstructs a controllable head avatar and optimizes it for the lateral viewing zones induced by the display. The method uses natural head turns as pseudo-multiview (PMV) supervision to constrain regions that are otherwise weakly observed in monocular training, including hair, ears, jaw contours, and neck boundaries. Reliable side frames are yaw-binned, aligned to virtual cameras, and supervised within a strict head-and-hair domain; contour-aware losses and staged regularization further suppress ghosting, alpha leakage, and depth instability while preserving lateral detail. At runtime, LentiAvatar renders 32 virtual views and encodes them into a 4K lenticular raster with calibrated subpixel-routing masks. The live-tracker prototype sustains 10.65 FPS, and a subject-specific distilled driver raises the same display pipeline to 38.49 FPS.

URL PDF HTML ☆

赞 0 踩 0

2510.18189 2026-06-16 cs.GR cs.CV 版本更新

A Generalizable Light Transport 3D Embedding for Global Illumination

一种可泛化的全局光照光传输3D嵌入

Bing Xu, Mukund Varma T, Cheng Wang, Tzu-Mao Li, Lifan Wu, Bartlomiej Wronski, Ravi Ramamoorthi, Marco Salvi

发表机构 * UC San Diego and NVIDIA USA（加州大学圣迭戈分校和美国NVIDIA公司）； UC San Diego USA（加州大学圣迭戈分校（美国））； NVIDIA USA（美国NVIDIA公司）； UC San Diego USA and NVIDIA USA（加州大学圣迭戈分校和美国NVIDIA公司）

AI总结提出一种可泛化的3D光传输嵌入方法，通过点云和Transformer直接预测全局光照，无需光栅化或路径追踪线索，适用于多种室内场景。

Comments SIGGRAPH 2026

详情

DOI: 10.1145/3799902.3811095

AI中文摘要

全局光照（GI）对于真实感渲染至关重要，但由于模拟间接光传输的复杂性，计算成本仍然很高。最近的神经方法主要依赖于逐场景优化，有时扩展到处理相机或几何体的变化。跨场景泛化的努力大多停留在2D屏幕空间，例如神经去噪或基于G-buffer的GI预测，这些方法常常遭受视角不一致和空间理解有限的问题。我们提出了一种可泛化的3D光传输嵌入，直接从3D场景配置近似全局光照，而不使用光栅化或路径追踪线索。每个场景被表示为具有几何和材质特征的点云。一个可扩展的Transformer建模全局点对点交互，将这些特征编码为神经基元。在渲染时，每个查询点通过最近邻搜索检索附近的基元，并通过交叉注意力聚合它们的潜在特征，以预测所需的渲染量。我们展示了在具有不同布局、几何体和材质的多样化室内场景中，漫反射全局光照预测的结果。为辐照度估计训练的嵌入可以通过有限的微调快速适应新的渲染任务。我们还展示了用于光泽材质空间方向辐射场估计的初步结果，并展示了归一化场如何加速无偏路径引导。该方法突显了一条将学习先验集成到渲染管线中的路径，而无需显式的光线追踪光照线索。

英文摘要

Global illumination (GI) is essential for realistic rendering but remains computationally expensive due to the complexity of simulating indirect light transport. Recent neural methods have mainly relied on per-scene optimization, sometimes extended to handle changes in camera or geometry. Efforts toward cross-scene generalization have largely stayed in 2D screen space, such as neural denoising or G-buffer based GI prediction, which often suffer from view inconsistency and limited spatial understanding. We propose a generalizable 3D light transport embedding that approximates global illumination directly from 3D scene configurations, without using rasterized or path-traced cues. Each scene is represented as a point cloud with geometric and material features. A scalable transformer models global point-to-point interactions to encode these features into neural primitives. At render time, each query point retrieves nearby primitives via nearest-neighbor search and aggregates their latent features through cross-attention to predict the desired rendering quantity. We demonstrate results on diffuse global illumination prediction across diverse indoor scenes with varying layouts, geometry, and materials. The embedding trained for irradiance estimation can be quickly adapted to new rendering tasks with limited fine-tuning. We also present preliminary results for spatial-directional radiance field estimation for glossy materials and show how the normalized field can accelerate unbiased path guiding. This approach highlights a path toward integrating learned priors into rendering pipelines without explicit ray-traced illumination cues.

URL PDF HTML ☆

赞 0 踩 0

2606.14727 2026-06-16 cs.CV 新提交

FairGen: Preference-Aligned Diffusion for Demographically Equitable Medical Image Synthesis

FairGen: 用于人口统计公平医学图像生成的偏好对齐扩散模型

Zhimin Li, Ruichen Zhang, Zhen Tan, Howard J Aizenstein, Jingtong Hu, Tianlong Chen

发表机构 * University of Pittsburgh, Swanson School of Engineering（匹兹堡大学斯旺森工程学院）； The University of North Carolina at Chapel Hill, Department of Computer Science（北卡罗来纳大学教堂山分校计算机科学系）； Arizona State University, School of Computing and Augmented Intelligence（亚利桑那州立大学计算与增强智能学院）； University of Pittsburgh, Department of Psychiatry（匹兹堡大学精神病学系）

AI总结提出FairGen框架，通过将医生偏好嵌入扩散模型生成过程，合成人口统计平衡的医学图像，在皮肤、胸片和脑MRI任务上分别实现95.9%、80.0%和35.2%的公平性提升，同时保持诊断准确性。

Comments Accepted for publication in npj Digital Medicine. 20 pages, 6 figures

详情

AI中文摘要

医学影像学是现代诊断的核心，人工智能系统越来越多地用于支持基于图像的分析，以提高效率、准确性和医疗可及性。然而，医疗保健获取的不平等和疾病患病率的差异导致临床图像数据中存在严重的人口统计不平衡。由于疾病在不同人口群体中可能表现出不同的特征，使得某些表型表现自然罕见，这种不平衡进一步加剧。在这种不平衡数据上训练的AI模型有可能延续诊断偏见并扩大医疗差距。本文介绍了FairGen，一个公平感知的扩散框架，它在合成人口统计平衡的医学图像的同时保留与病理相关的视觉特征。通过将医生对齐的偏好嵌入生成过程，FairGen在合成和下游分类过程中改善了子组覆盖。应用于皮肤病学、放射学和神经影像学基准任务，FairGen在皮肤图像上实现了95.9%的公平性提升，在胸部X光片上实现了80.0%，在脑MRI上实现了35.2%，同时相对于在原始临床数据上训练的模型保持了有竞争力的诊断准确性。面向临床医生的专家评审和在独立队列上的外部验证进一步支持这些增益超越了标准保真度指标，并且不局限于原始分布内数据集。

英文摘要

Medical imaging is central to modern diagnostics, and artificial intelligence (AI) systems are increasingly used to support image-based analysis by improving efficiency, accuracy, and access to care. However, inequities in healthcare access and differential disease prevalence create severe demographic imbalances in clinical image data. Such imbalances are compounded by the fact that diseases can manifest with distinct features across demographic groups, rendering certain phenotypic presentations naturally rare. AI models trained on such imbalanced data risk perpetuating diagnostic bias and widening healthcare disparities. Here we introduce FairGen, a fairness-aware diffusion framework that synthesizes demographically balanced medical images while preserving pathology-relevant visual features. By embedding physician-aligned preferences into the generation process, FairGen improves subgroup coverage during synthesis and downstream classification. Applied to dermatology, radiology, and neuroimaging benchmark tasks, FairGen achieves fairness improvements of 95.9% for skin images, 80.0% for chest radiography, and 35.2% for brain MRI, while maintaining competitive diagnostic accuracy relative to models trained on original clinical data. Clinician-facing expert review and external validation on independent cohorts further support that these gains extend beyond standard fidelity metrics and are not confined to the original in-distribution datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.14731 2026-06-16 cs.CV 新提交

BBR-Net: Boundary-Balanced Replay for Continual Medical Image Segmentation

BBR-Net：用于连续医学图像分割的边界平衡重放

Zahid Ullah, Sieun Choi, Jihie Kim

发表机构 * Department of Computer Science and Artificial Intelligence, Dongguk University（东国大学计算机科学与人工智能系）

AI总结提出边界平衡重放网络（BBR-Net），通过边界感知优先级和类别平衡选择重放样本，在连续心脏超声分割中减少灾难性遗忘并保持目标域适应能力。

详情

AI中文摘要

在域漂移下，基于重放的方法通常保留外观信息而没有显式建模解剖结构，因此连续学习在医学图像分割中仍然具有挑战性。本研究探究结构一致性是否控制连续心脏超声分割中的知识保留。我们提出边界平衡重放网络（BBR-Net），它使用边界感知优先级和类别平衡来选择重放样本，以保留解剖信息丰富的区域。该方法在CAMUS和CardiacNet上进行了前向（CAMUS到CardiacNet）和反向（CardiacNet到CAMUS）任务顺序的评估。在前向设置中，BBR-Net将源任务性能保持在接近离线联合训练参考的水平，同时显著减少灾难性遗忘并保持竞争性的目标任务适应。消融结果表明，边界感知优先级有助于保留，并且当与类别感知采样结合时，改善了源任务保留与目标任务适应之间的平衡。相反，反向设置揭示，当初始表示从噪声大且结构不一致的数据中学习时，结构感知重放会失败。为了隔离这种效应，我们进行了受控的结构扰动分析，逐步破坏源任务边界，同时保持数据集、架构和训练协议固定。随着结构可靠性降低，遗忘持续增加，表明重放有效性受存储结构信息质量的强烈影响，而不仅仅是记忆容量。这些发现表明，在域漂移下保留解剖结构是连续医学图像分割的核心因素，重放机制应考虑结构可靠性以支持稳健的知识保留。

英文摘要

Continual learning for medical image segmentation remains challenging under domain shift because replay-based methods often preserve appearance information without explicitly modeling anatomical structure. This study investigates whether structural consistency governs knowledge retention in continual cardiac ultrasound segmentation. We propose the Boundary-Balanced Replay Network (BBR-Net), which selects replay samples using boundary-aware priority and class balance to preserve anatomically informative regions. The method is evaluated on CAMUS and CardiacNet under forward (CAMUS to CardiacNet) and reverse (CardiacNet to CAMUS) task orders. In the forward setting, BBR-Net retains source-task performance close to an offline joint-training reference, while markedly reducing catastrophic forgetting and preserving competitive target-task adaptation. Ablation results show that boundary-aware prioritization contributes to retention and improves the balance between source-task preservation and target-task adaptation when combined with class-aware sampling. In contrast, the reverse setting reveals that structure-aware replay fails when initial representations are learned from noisy and structurally inconsistent data. To isolate this effect, we conduct a controlled structural perturbation analysis by progressively corrupting source-task boundaries while keeping the dataset, architecture, and training protocol fixed. Forgetting increases consistently as structural reliability decreases, suggesting that replay effectiveness is strongly influenced by the quality of stored structural information, rather than by memory capacity alone. These findings indicate that preserving anatomical structure under domain shift is a central factor in continual medical image segmentation, and that replay mechanisms should account for structural reliability to support robust knowledge retention.

URL PDF HTML ☆

赞 0 踩 0

2606.14749 2026-06-16 cs.CV cs.AI 新提交

Automated 3D Kinematic Monitoring for Circadian Activity and Anomaly Detection in Juvenile Fish

幼鱼昼夜活动与异常检测的自动化三维运动监测

Chih-Wei Huang, Chang-Wen Huang, Chung-Ping Chiang, Tsung-Wei Pan

发表机构 * AI Research Center, National Taiwan Ocean Univ.（台湾海洋大学人工智能研究中心）； Dept. of Aquaculture, National Taiwan Ocean Univ.（台湾海洋大学水产养殖系）； Center of Excellence for the Oceans, National Taiwan Ocean University（台湾海洋大学海洋卓越研究中心）

AI总结提出结合深度学习目标检测与双目立体视觉的高通量3D行为表型框架，实现高密度环境下幼鱼实时监测、体长估计和3D轨迹重建，首次量化自由游动幼鱼的真实物理速度，建立昼夜运动基线用于生理应激预警。

2606.14759 2026-06-16 cs.CV cs.AI 新提交

Temporally Consistent and Controllable Video Generation of 2D Cine CMR via Latent Space Motion Modeling

基于潜在空间运动建模的二维电影心脏磁共振时序一致且可控视频生成

Yiheng Cao, Gustavo Andrade-Miranda, Jiatian Zhang, Guillaume Sallé, Xin Gao

发表机构 * Suzhou Institute of Biomedical Engineering and Technology, Chinese Academy of Sciences（苏州生物医学工程与技术研究所，中国科学院）； SyCoIA, IMT Mines Ales（SyCoIA，IMT Mines Ales）

AI总结提出一种文本到视频生成方法，通过解耦心脏空间结构与时间运动，利用微调扩散模型合成初始帧，再以心脏相位嵌入条件化潜在流模型生成完整运动，实现高时序一致性和解剖可控性。

详情

Journal ref: ISBI 2026 - IEEE International Symposium on Biomedical Imaging, Apr 2026, London, United Kingdom. pp.1-4

AI中文摘要

电影心脏磁共振是评估心脏功能的金标准，但公共数据集的稀缺限制了先进数据驱动模型的发展。为解决这一限制，我们提出一种生成方法，用于合成时间上连贯且解剖上一致的心脏序列。我们的文本到视频框架将心脏空间结构与时间运动解耦。首先，一个微调的扩散模型根据临床文本提示合成初始帧，控制解剖特征。然后，一个以心脏相位嵌入为条件的潜在流模型生成完整的心脏运动，确保空间一致性和时间控制。我们的模型生成解剖和病理多样化的序列，具有高时间连贯性和对输入提示的强保真度，图像真实感的FID为31.68，文本-图像对齐的CLIP得分为31.04。这些实验结果突显了其产生高保真、按需医疗数据的潜力，为数据稀缺提供了可扩展的解决方案。

英文摘要

Cine cardiac magnetic resonance is the gold standard for assessing cardiac function, but the scarcity of public datasets limits the development of advanced data-driven models. To address this limitation, we propose a generative method for synthesizing temporally coherent and anatomically consistent cardiac sequences. Our text-to-video framework decouples cardiac spatial structure from temporal motion. First, a fine-tuned diffusion model synthesizes an initial frame from a clinical text prompt, controlling anatomical features. Then, a latent flow model conditioned on a cardiac phase embedding generates the complete cardiac motion, ensuring spatial consistency and temporal control. Our model generates anatomically and pathologically diverse sequences with high temporal coherence and strong fidelity to input prompts, achieving a FID of 31.68 for image realism and a CLIP score of 31.04 for text-image alignment. These experimental results highlight its potential to produce high-fidelity, on-demand medical data, offering a scalable solution to data scarcity.

URL PDF HTML ☆

赞 0 踩 0

2606.14766 2026-06-16 cs.CV cs.AI cs.MA 新提交

XMedFusion: A Knowledge-Guided Multimodal Perception and Reasoning Framework for Autonomous Medical Systems

XMedFusion：面向自主医疗系统的知识引导多模态感知与推理框架

Hamza Riaz, Arham Haroon, Maha Baig, Muhammad Dawood Rizwan, Muhammad Naseer Bajwa, Muhammad Moazam Fraz

发表机构 * National University of Sciences and Technology (NUST)（巴基斯坦国立科技大学）； University of Oxford（牛津大学）

AI总结提出XMedFusion模块化AI框架，通过视觉感知、知识图谱构建和检索引导生成等智能体协同，增强放射学报告生成的视觉基础与临床发现捕捉能力，在公共数据集上显著优于基线模型。

Comments Accepted at the 2026 International Conference on Robotics and Automation in Industry (ICRAI)

详情

AI中文摘要

自主医疗和机器人系统日益依赖智能感知与推理能力来解释视觉数据并支持临床决策。放射学报告生成是此类自动化诊断工作流的关键组成部分，然而现有的端到端多模态模型常因视觉基础薄弱而导致不可靠的解释和细微临床发现的遗漏。本文提出XMedFusion，一个模块化AI框架，设计为自主医疗系统的智能感知与推理模块。该框架将视觉信息分解为协调的功能组件，模拟专家驱动的分析，包括提取图像基础证据的视觉感知智能体、构建临床相关发现结构的知识图谱构建智能体，以及确保报告结构一致的检索引导起草过程。合成智能体通过推理驱动的验证迭代整合视觉和结构化证据，生成可靠且可解释的诊断输出。在公共胸部X光片数据集上的实验评估表明，与基线视觉-语言模型相比，在BLEU-1上提升0.0493至0.3359，ROUGE-L上提升0.0863至0.2440，METEOR上提升0.0829至0.1708，同时在语义评估指标如一致性（2.38至7.80）和准确性（2.34至6.93）上也有显著提升。结果突出了结构化多智能体感知与推理在增强智能医学成像系统的鲁棒性、透明度和自动化方面的有效性，使其能够集成到自主医疗和机器人诊断工作流中。

英文摘要

Autonomous medical and robotic systems increasingly rely on intelligent perception and reasoning capabilities to interpret visual data and support clinical decision making. Radiology report generation represents a critical component of such automated diagnostic workflows, yet existing end-to-end multimodal models often suffer from weak visual grounding, resulting in unreliable interpretations and omission of subtle clinical findings. This paper presents XMedFusion, a modular AI framework designed as an intelligent perception and reasoning module for autonomous medical systems. The proposed framework decomposes visual information into coordinated functional components that emulate expert-driven analysis, including a visual perception agent that extracts image-grounded evidence, a knowledge graph construction agent that structures clinically relevant findings, and a retrieval-guided drafting process that ensures a consistent reporting structure. A synthesis agent iteratively integrates visual and structured evidence through reasoning-driven verification to produce reliable and interpretable diagnostic outputs. Experimental evaluation on a public chest radiograph dataset demonstrates significant improvements over baseline vision-language models, achieving gains from 0.0493 to 0.3359 in BLEU-1, 0.0863 to 0.2440 in ROUGE-L, and 0.0829 to 0.1708 in METEOR, along with substantial improvements in semantic evaluation metrics such as Consistency (2.38 to 7.80) and Accuracy (2.34 to 6.93). The results highlight the effectiveness of structured multi-agent perception and reasoning for enhancing robustness, transparency, and automation in intelligent medical imaging systems, enabling integration into autonomous healthcare and robotic diagnostic workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.14803 2026-06-16 cs.CV 新提交

基于非局部图像先验的物理驱动零样本MRI重建

Lingtong Zhang, Wenlei Li, Mu He, Li Xiao, Yang Ji

发表机构 * School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学技术学院）

AI总结提出一种物理驱动的零样本自监督学习框架，通过线圈灵敏度图引导的动态存储库、SPIRiT正则化和非局部自相似像素银行，解决欠采样MRI重建中的监督不足和过拟合问题，在FastMRI数据集上达到最优性能。

详情

AI中文摘要

零样本自监督学习（ZS-SSL）已成为加速磁共振成像（MRI）重建的一种有前景的范式，消除了对全采样外部数据集的依赖。然而，仅从单个欠采样扫描中学习存在监督稀缺和优化不稳定的问题，常常导致过拟合或伪影。为了解决这些挑战，我们提出了一种鲁棒的物理驱动ZS-SSL框架，将物理一致性与图像域非局部先验协同结合。我们的方法引入了三项核心创新：（1）线圈灵敏度图（CSM）引导的动态存储库，通过基于线圈灵敏度约束过滤物理不一致的伪影来稳定训练轨迹；（2）基于SPIRiT的正则化，通过学习的相关核和随机掩蔽强制执行k空间自一致性；（3）非局部自相似性（NSS）像素库，利用前两个模块建立的高保真参考显式挖掘非局部解剖相似性，从而增强图像域的监督。在FastMRI数据集上的大量实验表明，我们的方法实现了最先进的性能，特别是在高加速因子下，有效弥合了零样本学习与监督方法之间的差距。代码可在https://github.com/Zolento/NS-SSL获取。

英文摘要

Zero-Shot Self-Supervised Learning (ZS-SSL) has emerged as a promising paradigm for accelerated Magnetic Resonance Imaging (MRI) reconstruction, eliminating the reliance on fully-sampled external datasets. However, learning solely from a single under-sampled scan suffers from supervision scarcity and optimization instability, often leading to overfitting or artifacts. To address these challenges, we propose a robust physics-driven ZS-SSL framework that synergizes physical consistency with image-domain non-local priors. Our method introduces three core innovations: (1) a Coil Sensitivity Map (CSM)-Guided Dynamic Repository, which stabilizes the training trajectory by filtering physically inconsistent artifacts based on coil sensitivity constraints; (2) a SPIRiT-based regularization, which enforces k-space self-consistency via a learned correlation kernel and stochastic masking; (3) a Non-Local Self-Similarity (NSS) Pixel Bank, which leverages the high-fidelity reference established by the former modules to explicitly mine non-local anatomical similarities, thereby augmenting supervision in the image domain. Extensive experiments on the FastMRI dataset demonstrate that our approach achieves state-of-the-art performance, particularly under high acceleration factors, effectively bridging the gap between zero-shot learning and supervised methods. The code is available at https://github.com/Zolento/NS-SSL.

URL PDF HTML ☆

赞 0 踩 0

2606.15129 2026-06-16 cs.CV cs.AI 新提交

基于特征级与决策级融合的胎儿先天性心脏病可信多视图深度学习分类

Tan Zhou, Shifa Yao, Suncheng Xiang, Dahong Qian, Baoying Ye

AI总结提出一种多视图深度学习框架，通过特征提取、注意力机制和不确定性决策融合超声心动图多视角图像，实现胎儿先天性心脏病高精度二分类。

详情

AI中文摘要

先天性心脏病（CHD）是指胚胎发育期间心脏和大血管发育异常导致的解剖结构异常。传统诊断方法往往难以达到高准确率和效率，尤其是在心脏解剖结构复杂的情况下。本研究提出了一种专门的多视图深度学习框架，利用超声心动图图像进行CHD二分类。使用包含五个视图的大规模CHD数据集训练模型，使其能够整合多角度图像数据。该框架利用先进的特征提取和注意力机制提高诊断精度和可靠性。还集成了基于不确定性的决策组件以处理低质量图像，从而增强诊断效果。实验结果表明，该方法在我们的数据集上达到了顶级性能，并为早期CHD检测提供了稳健的工具，凸显了其临床应用的潜力。数据集和源代码将在论文被接收后发布。

英文摘要

Congenital heart disease (CHD) refers to the abnormal anatomical structure caused by the abnormal development of the heart and great vessels during embryonic development. Traditional diagnostics often fail to achieve high accuracy and efficiency, especially given the complexity of cardiac anatomy. This study presents a specialized multi-view deep learning framework for CHD binary classification using echocardiographic images. A large-scale CHD dataset, including five views, was used to train the model, enabling it to integrate multi-angle image data. The framework utilizes advanced feature extraction and attention mechanisms to improve diagnostic precision and reliability. An uncertainty-based decision-making component is also integrated to handle low-quality images, enhancing diagnostic outcomes. Experimental results show that this method achieves top-tier performance on our dataset and provides a robust tool for early CHD detection, underscoring its potential for clinical use. The dataset and source code will be released upon paper acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.15304 2026-06-16 cs.CV 新提交

MNet++: 用于各向异性医学图像分割的扩展2D/3D网络

Kirsten Odendaal, Rade Bajic

发表机构 * School of Computing, Georgia Institute of Technology（佐治亚理工学院计算学院）

AI总结本文复现并扩展了混合2D/3D卷积网络MNet，引入自适应融合门控和VMamba状态空间模块，在保持各向异性鲁棒性的同时提升分割性能。

详情

AI中文摘要

本工作展示了MNet的完整复现与扩展，MNet是一种专为各向异性医学图像分割设计的混合2D/3D卷积网络。在nnU-Net框架内重新实现了原始架构，以验证其报告的性能和对可变体素间距（即各向异性）的鲁棒性。在匹配的预处理和计算约束下，在PROMISE前列腺MRI和LiTS肝脏CT的受控子集上进行了实验。复现的MNet在PROMISE上达到了89.0 +/- 0.9%的Dice相似系数（DSC），与已发表结果相差0.8%，在LiTS上肝脏和肿瘤分割分别达到94.3 +/- 1.9%和54.6 +/- 3.1%。进一步引入了两种轻量级扩展：(1) 一种学习的融合门控机制，实现自适应2D-3D特征融合；(2) 一个VMamba状态空间模块，用于高效的长程深度建模。空间门控变体以不到3%的推理开销将DSC提高了+0.8%，而VMamba提高了性能一致性，将PROMISE Dice变异降低至+/- 0.7%，并在LiTS肝脏上达到最强性能，Dice为95.8%。两种扩展均保持了MNet对各向异性的鲁棒性，在1-4 mm体素间距下Dice变化为1.5%。总体而言，该研究证实了MNet的可复现性，并表明自适应融合和状态空间建模有潜力进一步增强各向异性条件下的分割可靠性。然而，需要进一步测试才能得出明确结论。

英文摘要

This work demonstrates a full reproduction and extension of MNet, a hybrid 2D/3D convolutional network designed for anisotropic medical image segmentation. The original architecture was re-implemented within the nnU-Net framework to verify its reported performance and robustness to variable voxel spacing, known as anisotropy. Experiments were conducted on PROMISE prostate MRI and a controlled subset of LiTS liver CT under matched preprocessing and compute constraints. The reproduced MNet achieved a Dice similarity coefficient (DSC) of 89.0 +/- 0.9% on PROMISE, within 0.8% of the published result, and 94.3 +/- 1.9% / 54.6 +/- 3.1% for liver and tumor segmentation on LiTS, respectively. Two lightweight extensions were further introduced: (1) a learned Fusion Gating mechanism enabling adaptive 2D-3D feature blending, and (2) a VMamba state-space module for efficient long-range depth modelling. The Spatial Gating variant improved DSC by +0.8% with less than 3% inference overhead, while VMamba improved performance consistency, reducing PROMISE Dice variation to +/- 0.7% and achieving the strongest LiTS liver performance at 95.8% Dice. Both extensions preserved MNet robustness to anisotropy, with delta Dice = 1.5% across 1-4 mm voxel spacing. Overall, the study confirms MNet reproducibility and demonstrates that adaptive fusion and state-space modelling have the potential to further strengthen segmentation reliability under anisotropic conditions. However, further tests are required to provide definitive conclusions.

URL PDF HTML ☆

赞 0 踩 0

2606.15457 2026-06-16 cs.CV cs.LG 新提交

Lesion-DDPM: Lesion-Enhanced 3D Diffusion for MS MRI Synthesis

Lesion-DDPM：用于MS MRI合成的病灶增强3D扩散模型

Weidong Zhang, Yongchan Jung, Shafayat Mowla Anik, Furen Xiao, Vasudevan Janarthanan, Enkhzaya Chuluunbaatar, Byeong Kil Lee, Jeeho Ryoo

发表机构 * University of Texas at Arlington（德克萨斯大学阿灵顿分校）； University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； National Taiwan University Hospital（国立台湾大学医院）； National University of Mongolia（蒙古国立大学）； University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结提出Lesion-DDPM，一种3D条件扩散框架，通过多级解剖掩膜注入和病灶加权重建损失，实现病灶感知的FLAIR合成，在MS病灶分割下游任务中显著提升Dice分数。

详情

AI中文摘要

3D FLAIR MRI被广泛推荐为多发性硬化（MS）脑部成像的标准MRI序列之一，但公开可用的MS数据集仍然相对较小，且在不同扫描仪、采集协议和病灶模式上存在差异。这种稀缺性和异质性阻碍了稳健的神经影像机器学习模型的发展，尤其对于旨在合成图像同时保留小而稀疏病灶的生成模型而言，这是一个挑战。我们提出了Lesion-DDPM，一种用于病灶感知FLAIR合成的3D条件扩散框架，该框架结合了多级解剖掩膜注入以及病灶加权重建损失，以在保持整体大脑结构的同时强调病灶体素。使用MSLesSeg数据集的精选子集，我们将Lesion-DDPM与代表性的最先进GAN和扩散模型进行比较，评估图像生成指标和下游3D U-Net分割性能。在我们的实验中，Lesion-DDPM在所有方法中实现了最低的病灶区域重建误差。在下游3D U-Net病灶分割任务中，仅使用Lesion-DDPM生成的扫描训练并在真实MRI上评估的模型达到了0.616的Dice分数，而最佳竞争合成数据集为0.569。当将Lesion-DDPM图像添加到真实训练集中时，Dice分数进一步增加到0.685。

英文摘要

3D FLAIR MRI is widely recommended as one of the standard MRI sequences for brain imaging in multiple sclerosis (MS), but publicly available MS datasets remain relatively small and vary across scanners, acquisition protocols, and lesion patterns. This scarcity and variability hinder the development of robust neuroimaging machine learning models and are particularly challenging for generative models that aim to synthesize images while preserving small, sparse lesions. We propose Lesion-DDPM, a 3D conditional diffusion framework for lesion-aware FLAIR synthesis that incorporates multi-level anatomical mask injection together with a lesion-weighted reconstruction loss to emphasize lesion voxels while maintaining global brain structure. Using a curated subset of the MSLesSeg dataset, we compare Lesion-DDPM with representative state-of-the-art GAN- and diffusion-based models, assessing both image-generation metrics and downstream 3D U-Net segmentation. In our experiments, Lesion-DDPM achieved the lowest lesion-region reconstruction error among all methods. In a downstream 3D U-Net lesion segmentation task, a model trained only on Lesion-DDPM-generated scans and evaluated on real MRIs reached a Dice score of 0.616 compared with 0.569 for the best competing synthetic dataset. When Lesion-DDPM images were added to the real training set, the Dice score further increased to 0.685.

URL PDF HTML ☆

赞 0 踩 0

2606.15611 2026-06-16 cs.CV cs.AI 新提交

Mutual Distillation of Dual-Foundation Models for Semi-Supervised PET/CT Segmentation

双基础模型的相互蒸馏用于半监督PET/CT分割

Fuyou Mao, Beining Wu, Yanfeng Jiang, Bohan Xu, Lixin Lin, Naye Ji, Hao Zhang, Yan Tang

发表机构 * Central South University（中南大学）； Hangzhou Dianzi University（杭州电子科技大学）； Communication University of Zhejiang（浙江传媒学院）； Northeastern University（东北大学）

AI总结提出MuDuo框架，利用SAM-Med3D和SegAnyPET分别从CT和PET中蒸馏知识到轻量学生网络，实现半监督器官分割，仅用5个标注样本在AutoPET数据集上达到最优性能。

Comments MICCAI 2026

详情

AI中文摘要

PET/CT的器官分割对于肿瘤学中的定量分析和放疗计划至关重要。为了降低PET/CT分割的高标注成本，半监督学习（SSL）为使用有限标注数据开发深度模型提供了一种实用且有效的解决方案。视觉基础模型的最新发展展示了显著的适应性和更高的效率。在这项工作中，我们提出了一个相互蒸馏框架，该框架无缝地利用了结构性和功能性基础模型，这些模型作为模态特定的通才，从结构性CT和代谢性PET成像中蒸馏知识。通过弥合学生模型的任务特定精度与通才基础模型的分割先验之间的差距，我们提出了MuDuo，一个相互蒸馏框架，协同利用SAM-Med3D用于CT和SegAnyPET用于PET，将它们的知识蒸馏到一个轻量级学生网络中。我们的方法消除了手动提示的需要，同时最大化未标注数据在自动分割中的效用，在AutoPET数据集上仅使用5个标注案例就达到了最先进的性能。我们的源代码可在https://github.com/Wu-beining/MuDuo获取。

学习一种无采样的变分DNN插件，从微小训练集精炼OOD分割并估计不确定性

Jimut B. Pal, Suyash P. Awate

发表机构 * Centre for Machine Intelligence and Data Science (C-MInDS), Indian Institute of Technology (IIT) Bombay（印度理工学院孟买分校机器智能与数据科学中心）； Computer Science and Engineering (CSE) Department, Indian Institute of Technology (IIT) Bombay（印度理工学院孟买分校计算机科学与工程系）

AI总结提出VarDeepPCA，一种轻量级变分DNN框架，利用小分布内数据集学习有效解剖几何分布，无需目标域数据或预训练，通过重新解释softmax映射实现无采样推理，并提供不确定性估计，在4种临床应用中显著提升OOD分割的解剖合理性和准确性。

Comments Accepted at the Journal of Machine Learning for Biomedical Imaging

详情

AI中文摘要

深度神经网络（DNN）由于扫描仪和采集协议的变化，经常无法泛化到分布外（OOD）的医学图像。由于获取和标注新医学数据集的成本高昂，重新训练DNN模型以应对这些分布偏移通常不切实际。为了解决这个问题，我们引入了VarDeepPCA，一种新颖的轻量级变分DNN框架，旨在通过利用内在几何先验来恢复/精炼退化的分割图。与需要目标域数据或大量预训练的现有方法不同，我们的VarDeepPCA仅使用小的分布内（ID）数据集显式学习有效解剖几何的分布。理论上，我们的新颖变分学习框架利用对softmax映射的重新解释来隐式执行精确分布建模，从而实现计算高效、无采样的学习和推理。这也使VarDeepPCA能够为其恢复的分割图提供不确定性估计。我们在4种不同的临床应用上，使用14个公开可用的数据集，涉及心肌、神经视网膜边缘、前列腺和胎儿头部分割，对我们的框架进行了实证验证。与15种现有方法的比较表明，VarDeepPCA一致地恢复了现有方法在OOD数据上产生的分割图，以（i）显著提高几何的解剖合理性和分割的临床实用性，以及（ii）显著减少误差，而不需要比现有方法更多的训练数据。

英文摘要

Deep neural networks (DNNs) frequently fail to generalize to out-of-distribution (OOD) medical images because of variations in scanners and acquisition protocols. Retraining DNN models to address these distribution shifts is often impractical due to the high cost of acquiring and annotating new medical datasets. To address this, we introduce VarDeepPCA, a novel lightweight variational DNN framework designed to restore/refine degraded segmentation maps by leveraging intrinsic geometric priors. Unlike existing approaches that require target-domain data or extensive pre-training, our VarDeepPCA explicitly learns a distribution of valid anatomical geometries using only small in-distribution (ID) datasets. Theoretically, our novel variational learning framework leverages a reinterpretation of the softmax mapping to implicitly perform exact distribution modeling, thereby enabling computationally efficient, sampling-free learning and inference. This also enables VarDeepPCA to provide uncertainty estimates associated with its restored segmentation maps. We empirically validate our framework across 4 distinct clinical applications, using 14 publicly available datasets, involving segmentation of the myocardium, neuroretinal rim, prostate, and fetal head. Comparisons against 15 existing methods demonstrate that VarDeepPCA consistently restores segmentation maps produced by the existing methods on OOD data to (i) significantly improve anatomical plausibility of geometries and clinical utility of the segmentations, and (ii) significantly reduce errors, without needing any more training data than that used by existing methods.

URL PDF HTML ☆

赞 0 踩 0

2606.15861 2026-06-16 cs.CV 新提交

信任错误理由的正确预测：基于LIME的肺癌诊断深度学习可解释性分析

Samarpan Poudel, Vladislav D Veksler

发表机构 * Caldwell University School of Business and Computer Science（考德威尔大学商业与计算机科学学院）

AI总结本研究通过LIME分析三种深度学习模型（CNN、ResNet50、ViT）在肺癌CT分类中的决策一致性，发现预测高度一致但解释区域差异显著，表明预测一致性不能替代推理一致性。

详情

AI中文摘要

肺癌是癌症相关死亡的主要原因，每年约有250万新发病例和180万死亡病例，使得可靠诊断成为临床优先事项。尽管深度学习模型在肺癌分类中取得了强劲性能，但评估主要集中于预测准确性，其决策过程尚未得到充分检验。本研究比较了三种架构不同的模型：卷积神经网络（CNN）、预训练ResNet50和视觉Transformer（ViT），均在IQ-OTH/NCCD肺癌CT数据集上训练。应用局部可解释模型无关解释（LIME）来研究模型推理。除了标准性能指标外，还引入了一个双相关框架来测量模型对之间的预测一致性和解释一致性。所有三个模型均取得了强劲的分类性能，ResNet50达到98.61%的准确率，CNN为97.91%，ViT为93.75%，同时所有模型的ROC-AUC得分均为0.99。所有模型对的预测相关性超过0.99，表明输出高度一致。然而，LIME解释相关性仍低于0.26，揭示了用于得出这些预测的图像区域存在实质性差异。对误分类样本的分析进一步识别出一致的空间模式：错误预测与肺实质外的注意力相关，而正确预测主要集中于肺区域内部。这些发现表明，预测一致性是推理一致性的一个糟糕代理，并且可解释性评估必须被视为临床AI系统中与预测性能并列的独立验证标准。

英文摘要

Lung cancer is the leading cause of cancer-related mortality, with approximately 2.5 million new cases and 1.8 million deaths annually, making reliable diagnosis a clinical priority. Although deep learning models have achieved strong performance in lung cancer classification, evaluation has largely focused on predictive accuracy, leaving their decision-making processes insufficiently examined. This study compares three architecturally distinct models: a Convolutional Neural Network (CNN), a pretrained ResNet50, and a Vision Transformer (ViT), trained on the IQ-OTH/NCCD lung cancer CT dataset. Local Interpretable Model-Agnostic Explanations (LIME) were applied to investigate model reasoning. In addition to standard performance metrics, a dual-correlation framework was introduced to measure both prediction agreement and explanation agreement across model pairs. All three models achieved strong classification performance, with ResNet50 attaining 98.61% accuracy, CNN 97.91%, and ViT 93.75%, while all achieved ROC-AUC scores of 0.99. Prediction correlations exceeded 0.99 across all model pairs, indicating highly consistent outputs. However, LIME explanation correlations remained below 0.26, revealing substantial differences in the image regions used to reach those predictions. Analysis of misclassified samples further identified a consistent spatial pattern: incorrect predictions were associated with attention outside the lung parenchyma, whereas correct predictions focused primarily within lung regions. These findings demonstrate that prediction agreement is a poor proxy for reasoning consistency, and that interpretability evaluation must be treated as an independent validation criterion alongside predictive performance in clinical AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.16153 2026-06-16 cs.CV cs.AI 新提交

A Comprehensive Survey of Medical Image Segmentation: Challenges, Benchmarks, and Beyond

医学图像分割综述：挑战、基准与未来展望

Pengyu Zhu, Xiaojing Zhang, Kunbo Zhang, Chunyan Zhang, Zhenyu Wang

发表机构 * School of Control and Computer Engineering, North China Electric Power University（华北电力大学控制与计算机工程学院）； SPIC Digital Technology Co., Ltd（国家电投数字科技有限公司）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Department 6 of Health Care, Second Medical Center, People’s Liberation Army General Hospital（中国人民解放军总医院第二医学中心健康医学科六病区）

AI总结本文系统综述了基于U-Net、Transformer和SAM架构的医学图像分割方法，分析主要挑战，旨在指导未来研究并推动临床转化。

Comments 12 pages,3 figures,1 table. All related resources are available at https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main

详情

AI中文摘要

医学图像分割在临床诊断、治疗规划、疾病监测和神经系统疾病识别中发挥着关键作用。本文对其系统发展进行了全面综述，涵盖了广泛使用的公开数据集、基于U-Net、Transformer和SAM架构的代表性方法及其关键评估指标与差异，随后从多个角度分析了主要挑战。与专注于单一模型家族或特定临床应用的综述不同，本综述将基于U-Net、Transformer和SAM的方法组织在一个统一的分析框架内，特别关注它们在提高分割精度和效率方面的有效性。本工作旨在指导医学图像分割的未来研究并支持临床转化，所有相关资源均可在我们的GitHub仓库中公开获取：https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main。

英文摘要

Medical image segmentation plays a critical role in clinical diagnostics, treatment planning, disease monitoring, and neurological disorder identification. This article presents a comprehensive review of its systematic development, covering widely used public datasets, representative methods built on the U-Net, Transformer, and SAM architectures, and key evaluation metrics with their differences, followed by an analysis of major challenges from multiple perspectives. Unlike surveys that focus on a single model family or a specific clinical application, this review organizes U-Net-, Transformer-, and SAM-based methods within a unified analytical framework, with a particular focus on their effectiveness in improving segmentation accuracy and efficiency. This work aims to guide future research and support clinical translation of medical image segmentation, with all related resources publicly available in our GitHub repository: https://github.com/andrew-pengyu/Awsome_MedSeg/tree/main.

URL PDF HTML ☆

赞 0 踩 0

2606.16180 2026-06-16 cs.CV cs.LG 新提交

To forget is to preserve: Machine Unlearning for 3D medical image segmentation

遗忘即保留：面向3D医学图像分割的机器遗忘

Nitesh Kumar Singh, Akhilesh Singh, Arjun Arora

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）

AI总结针对数据隐私法规，研究基于四种机制的近似遗忘策略在3D医学图像分割中的应用，通过Dice系数和MAE评估，发现噪声标签策略在遗忘集和保留集间取得最佳平衡。

Comments 9 pages, 5 figures

详情

AI中文摘要

随着新的数据隐私法规（如GDPR [1]）允许个人要求从训练好的机器学习模型中删除其任何个人信息，人们开始推动研究从模型中遗忘数据以遵守这些法律。在这方面，基于四种机制，我们考虑了几种应用于MRBrainS18数据集 [2] 的近似遗忘策略。我们使用3D ResNet-50 [3] 作为分割的骨干架构，该架构已通过Med3D框架 [4] 进行预训练。以预训练模型为基线，我们评估了在两类主体（即保留和遗忘）上的相应保留准确率。我们通过Dice相似系数和平均绝对误差（MAE）值评估这些方法，使用两个独立的训练周期（20和50个epoch）。结果表明，噪声标签策略具有最佳的整体权衡，在50个epoch后，遗忘集准确率下降93%，同时保留集准确率保持84%。所有其他策略在更高的epoch数下表现出极端的遗忘水平，同时其保留集性能也出现灾难性退化。本研究结果为在主体特定水平上的遗忘提供了严格的性能指标基线，并为从业者选择适当策略提供了明确标准。

英文摘要

With new data privacy laws such as the General Data Protection Regulation (GDPR) [1] that allow individuals to ask that any of their personal information be erased from trained machine learning models, there has been a push to investigate the unlearning of data from models as a way to comply with these laws. In this regard, based on four mechanics, we consider several approximate unlearning strategies applied to the MRBrainS18 dataset [2]. We use a 3D ResNet-50 [3] as a backbone architecture for segmentation that has been pre-trained with the Med3D framework [4]. Considering the pre-trained model as a baseline, we evaluate respective retention accuracy on 2 types of subjects, i.e., retain and forget. We assess these approaches through their Dice similarity coefficient and mean absolute error (MAE) values using two separate training horizons 20 and 50 epochs. The results show that the Noisy Label strategy had the best overall trade-off with a decrease of 93% in the forget set while maintaining 84% accuracy for the retained set after 50 epochs. All other strategies showed extreme levels of forgetting at higher epoch numbers while also demonstrating catastrophic degradation of their retain set performance. The results of this study provide a strict baseline of performance metrics for unlearning on a subject-specific level and provide practitioners with clear criteria for selecting the proper strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.16212 2026-06-16 cs.CV cs.AI 新提交

LUCID: Learned Undersampling-Adaptive Consistency-Guided Inference with Deterministic Flow Matching for Sparse-View CT Reconstruction

LUCID：基于确定性流匹配的学习型欠采样自适应一致性引导稀疏视角CT重建

Jigang Duan, Jiayi Wang, Heran Wang, Ping Yang, Genwei Ma, Xing Zhao

发表机构 * School of Mathematical Sciences, Capital Normal University（首都师范大学数学科学学院）； National Center for Applied Mathematics Beijing, Capital Normal University（首都师范大学北京国家应用数学中心）； Academy for Multidisciplinary Studies, Capital Normal University（首都师范大学交叉科学研究院）

AI总结提出LUCID框架，利用流匹配生成先验和稀疏度自适应策略，通过退化匹配初始状态和投影域一致性校正，实现不同采样密度下的稳定稀疏视角CT重建，减少伪影和幻觉结构。

详情

AI中文摘要

稀疏视角CT通过获取更少的投影视图来减少辐射剂量和扫描时间，但角度欠采样使得重建严重病态，导致条纹伪影、结构模糊和细节丢失。现有的监督方法通常受限于特定的采样设置，而生成方法在严重欠采样下可能引入解剖上不一致的幻觉样结构。我们提出Lucid，一种基于流匹配生成先验的稀疏自适应、一致性引导重建框架，用于稀疏视角CT。Lucid仅在高品质CT图像上训练，学习高斯分布与高品质CT图像分布之间的连续传输，与视角采样无关。在推理过程中，显式纳入采样稀疏度水平，以调整单个预训练模型的生成轨迹。具体地，Lucid通过稀疏度加权融合稀疏视角FBP图像和高斯噪声构建退化匹配的初始状态，执行稀疏度调制的流匹配更新，并在每次先验更新后应用投影域数据一致性校正。在多种稀疏视角设置下的实验表明，Lucid在不同采样密度下实现稳定的重建性能，提高图像质量和结构保真度，并降低生成式稀疏视角CT重建中幻觉样结构的风险。

英文摘要

Sparse-view CT reduces radiation dose and scanning time by acquiring fewer projection views, but angular undersampling makes reconstruction severely ill-posed, causing streak artifacts, structural blurring, and loss of fine details. Existing supervised methods are often tied to specific sampling settings, whereas generative methods may introduce anatomically inconsistent hallucination-like structures under severe undersampling. We propose Lucid, a sparsity-adaptive, consistency-guided reconstruction framework based on a Flow Matching generative prior for sparse-view CT. Lucid is trained only on high-quality CT images to learn a continuous transport between a Gaussian distribution and the high-quality CT image distribution, independent of view sampling. During inference, the sampling sparsity level is explicitly incorporated to adapt the generative trajectory of a single pretrained model. Specifically, Lucid constructs a degradation-matched initial state by sparsity-weighted fusion of the sparse-view FBP image and Gaussian noise, performs sparsity-modulated Flow Matching updates, and applies projection-domain data-consistency correction after each prior update. Experiments under multiple sparse-view settings show that Lucid achieves stable reconstruction performance across different sampling densities, improves image quality and structural fidelity, and reduces the risk of hallucination-like structures in generative sparse-view CT reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2606.16234 2026-06-16 cs.CV cs.AI 新提交

Propagating Structural Guidance: Synthesizing Fluorescein Angiography from Fundus Images and Sparse OCT Scans

传播结构引导：从眼底图像和稀疏OCT扫描合成荧光素血管造影

Tengfei Ma, Ruiqi Wu, Chenran Zhang, Ye Geng, Na Su, Xiangyuan Duanmu, Tao Zhou, Yi Zhou, Wen Fan

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Laboratory of New Generation Artificial Intelligence Technology and Its Interdisciplinary Applications, Ministry of Education（教育部新一代人工智能技术及其跨学科应用重点实验室）； Tianyuan Honors School, Nanjing Medical University（南京医科大学天元荣誉学院）； Nanjing University of Science and Technology（南京理工大学）； Department of Ophthalmology, The First Affiliated Hospital of Nanjing Medical University（南京医科大学第一附属医院眼科）

AI总结提出从彩色眼底照片（CFP）和稀疏OCT扫描合成荧光素血管造影（FFA）的框架，通过空间对齐跨模态融合和令牌级对比学习，实现非侵入性FFA合成，提升下游诊断性能。

Comments Accepted to MICCAI 2026 (Early Accept)

详情

AI中文摘要

眼底荧光素血管造影（FFA）对于评估视网膜血管异常至关重要，但其获取具有侵入性且并非总是可行。相比之下，彩色眼底摄影（CFP）无创且广泛可用，这推动了CFP到FFA合成的研究。然而，先前的工作仅依赖CFP表面纹理，从根本上限制了重建功能性血管信息和细微病理变化的能力。为了解决这个问题，我们提出了一种新颖的框架，该框架利用光学相干断层扫描（OCT）提供的结构引导，从CFP合成FFA。我们构建了一个包含来自3,676只患者眼睛的配对CFP、FFA和OCT的多模态视网膜成像数据集——这是视网膜成像中首个三模态对齐数据集。为了弥合OCT和眼底模态之间的空间差距，我们提出了空间对齐跨模态融合（SACMF）模块，该模块将深度分辨的OCT特征投影到眼底平面，并通过自适应层归一化将其注入CFP编码器。除了特征融合，我们还引入了令牌级跨模态对齐（TCMA），这是一种令牌级对比学习策略，在对应空间位置显式对齐CFP和FFA表示。我们的方法相比最先进的方法实现了更优的合成性能。此外，大量实验表明，我们方法合成的FFA图像在提升下游疾病诊断性能方面比现有方法带来更大的改进，突显了我们的方法作为常规工作流程中无创决策支持工具的临床潜力。代码可在https://github.com/while-plus/OCT-guide-FFA-Syn获取。

英文摘要

Fundus fluorescein angiography (FFA) is critical for assessing retinal vascular abnormalities, but its acquisition is invasive and not always feasible. In contrast, color fundus photography (CFP) is non-invasive and widely accessible, which has motivated studies on CFP-to-FFA synthesis. However, prior works rely solely on CFP surface texture, fundamentally limiting the ability to reconstruct functional vascular information and subtle pathological changes. To address this, we propose a novel framework that synthesizes FFA from CFP with structural guidance provided by optical coherence tomography (OCT). We construct a multi-modal retinal imaging dataset with paired CFP, FFA, and OCT from 3,676 patient eyes--the first tri-modally aligned dataset in retinal imaging. To bridge the spatial gap between OCT and fundus modalities, we propose a Spatially Aligned Cross-Modal Fusion (SACMF) module that projects depth-resolved OCT features onto the fundus plane and injects them into the CFP encoder via adaptive layer normalization. Beyond feature fusion, we further introduce Token-wise Cross-Modality Alignment (TCMA), a token-level contrastive learning strategy that explicitly aligns CFP and FFA representations at corresponding spatial positions. Our method achieves superior synthesis performance compared to state-of-the-art methods. Moreover, extensive experiments demonstrate that the FFA images synthesized by our approach bring greater improvements in downstream disease diagnosis performance than existing methods, highlighting the clinical potential of our approach as a non-invasive decision-support tool in routine workflows. The code is available at https://github.com/while-plus/OCT-guide-FFA-Syn.

URL PDF HTML ☆

赞 0 踩 0

2606.16294 2026-06-16 cs.CV q-bio.NC 新提交

AURA: 细菌细胞学分析中治疗模糊性下的主动响应归因

Kartik Jhawar, Mrunmayee Deshpande, Wilfried Moreira, Guillermo C. Bazan, Lipo Wang

发表机构 * Nanyang Technological University（南洋理工大学）； Institute of High Performance Computing, A*STAR（新加坡科技研究局高性能计算研究所）； University of California, Santa Barbara（加州大学圣塔芭芭拉分校）

AI总结针对抗生素组合中仅部分药物实际作用的问题，提出基于能量的约束逆归因方法AURA，通过分解残余形态并选择重构能量最低的子集，在跨重复实验中达到95.47%的精确匹配准确率。

详情

AI中文摘要

当细菌样本暴露于多种抗生素时，并非每种施加的药物都必然起作用：如果细菌对其中一种药物耐药，则该药物不会留下形态学痕迹。因此，临床上有意义的量不是施加了哪些抗生素，而是哪些抗生素是活跃的。我们表明，在实际的大肠杆菌显微镜中，这两者严重脱钩——天真地假设施加的组合等于活跃组合的正确率仅约37%——然而现有的计算工具不适合恢复活跃集。前向扰动模型如scGen、CPA和IMPA旨在从处理预测外观，而非反向，并且反转它们会严重退化；判别式图像分类器倾向于记忆菌株和批次特定的纹理，并且无法跨实验重复迁移。我们引入AURA，它将任务重新定义为基于能量的约束逆归因。其核心归纳偏置是活跃集必须是施加集的子集；这压缩了候选空间，并让AURA通过将残余形态分解为抗生素响应原子并选择重构能量最低的子集来推断施加抗生素中的活跃子集，测试时不使用菌株标签。AURA-E添加了证据感知的弃权，当候选解释仍然近乎同等合理时保留预测。在大肠杆菌细胞学分析数据集的跨重复迁移中，AURA以95.47%的精确匹配准确率恢复活跃抗生素组合。

英文摘要

When a bacterial sample is exposed to several antibiotics, not every applied drug necessarily acts: if the organism is resistant to one of them, that drug leaves no morphological trace. The clinically meaningful quantity is therefore not which antibiotics were applied, but which ones were active. We show that these two are sharply decoupled in real E. coli microscopy - naively assuming the applied combination equals the active one is correct only about 37% of the time - yet existing computational tools are ill-suited to recovering the active set. Forward perturbation models such as scGen, CPA, and IMPA are designed to predict appearance from treatment, not the reverse, and inverting them degrades sharply; discriminative image classifiers tend to memorise strain- and batch-specific texture and fail to transfer across experimental replicates. We introduce AURA, which reframes the task as constrained, energy-based inverse attribution. Its central inductive bias is that the active set must be a subset of the applied set; this collapses the candidate space and lets AURA infer the active subset of applied antibiotics by decomposing residual morphology into antibiotic response atoms and selecting the subset with the lowest reconstruction energy, using no strain label at test time. AURA-E adds evidence-aware abstention, withholding a prediction when candidate explanations remain near-equally plausible. On cross-replicate transfer in an E. coli cytological profiling dataset, AURA recovers the active antibiotic combination with 95.47% exact-match accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.16484 2026-06-16 cs.CV cs.AI cs.MM 新提交

Unified Multimodal Model for Brain MRI Imputation and Understanding

统一多模态模型用于脑MRI补全与理解

Zhiyun Song, Che Liu, Tian Xia, Avinash Kori, Wenjia Bai

发表机构 * Department of Computing, Imperial College London（伦敦帝国理工学院计算机系）； Department of Brain Sciences, Imperial College London（伦敦帝国理工学院脑科学系）

AI总结提出UniBrain模型，通过统一训练策略联合处理脑MRI模态补全与图像理解，采用自对齐和动态隐藏状态机制，在多疾病数据集上实现高性能。

Comments Early accepted to MICCAI 2026

详情

AI中文摘要

多模态大语言模型（MLLMs）在医学领域具有巨大潜力，因为它们继承了LLM的知识，并允许以自然语言集成、分析和解释多种数据模态。然而，医学MLLMs面临重大挑战，特别是高质量训练数据的稀缺以及现实临床环境中数据缺失的频繁发生。在此，我们提出了一种新颖的统一多模态模型UniBrain，用于脑磁共振图像（MRI）分析。为了解决潜在的脑MRI模态缺失问题，我们采用统一训练策略进行联合成像模态补全和脑图像理解。在训练过程中，构建了交错且描述丰富的数据流，以自回归方式训练模型，从而实现基于生成的多模态数据的医学推理。引入自对齐策略，利用密集图像嵌入学习细粒度解剖特征，无需详细的图像描述。此外，我们提出了一种动态隐藏状态机制，以缓解长上下文多模态推理中的暴露偏差。在多疾病脑MRI数据集上的大量实验表明，UniBrain在模态不完全的各种情况下，在脑图像补全、理解和疾病诊断方面均取得了高性能。

英文摘要

Multimodal large language models (MLLMs) hold great potential for medicine, as they inherit knowledge from LLM and allow multiple data modalities to be integrated, analysed and interpreted in natural language. However, the field of medical MLLMs is constrained by non-trivial challenges, notably the scarcity of high-quality training data and the frequent occurrence of missing data in the real-world clinical setting. Here, we propose a novel unified multimodal model, UniBrain, for brain magnetic resonance image (MRI) analysis. To address potential missing brain MRI modalities, we employ a unified training strategy to perform joint imaging modality imputation and brain image understanding. During training, an interleaved and description-enriched data flow is constructed to train the model in an autoregressive manner, enabling medical reasoning with generated multimodal data. A self-alignment strategy is introduced to leverage dense image embeddings to learn fine-grained anatomical features without requiring detailed image captions. Furthermore, we propose a dynamic hidden state mechanism to alleviate the exposure bias during long-context multimodal inference. Extensive experiments on multi-disease brain MRI dataset demonstrate that UniBrain achieves high performance for brain image imputation, understanding, and disease diagnosis under various extents of modality incompleteness.

URL PDF HTML ☆

赞 0 踩 0

2606.16573 2026-06-16 cs.CV 新提交

Transformation-driven generation of comparable projection images from multimodal anatomical scenes

从多模态解剖场景生成可比较投影图像的变换驱动方法

Dariusz Pojda, Krzysztof Domino, Michał Tarnawski, Agnieszka Anna Tomaka

发表机构 * Institute of Theoretical and Applied Informatics, Polish Academy of Sciences（波兰科学院理论与应用信息学研究所）

AI总结提出变换驱动框架，从多模态解剖数据生成可重复的投影空间观测，通过下颌运动场景验证，实现不同解剖配置下直接可比的虚拟X光投影生成。

Comments 36 pages, 11 figures

详情

AI中文摘要

本工作解决了从异质解剖场景生成可重复投影空间观测的计算问题，其中组件可能经历独立的空间变换。我们提出了一种变换驱动框架，用于从多模态解剖数据生成合成投影图像，并在下颌运动场景中进行了演示。与主要为配准、投影真实感或渲染效率设计的传统数字重建放射影像（DRR）方法不同，所提出的公式将投影成像视为对显式表示的解剖场景进行观测的过程。独立可变换的基于体积和表面的解剖对象嵌入到共享场景表示中，并通过显式变换直接传播到投影空间。投影几何、采集建模、材料解释和图像呈现保持显式分离，从而能够在保持可重复性和生成投影之间直接可比性的同时，对方法假设进行可控探索。特别强调了与颅面分析相关的变换驱动解剖场景，包括下颌运动和 therapeutic repositioning。使用由CT/CBCT体积、分割结构、表面模型以及辅助解剖或治疗对象组成的共享解剖参考场景，该框架能够在保持相同成像假设的同时，从多种解剖配置生成直接可比的VirtualRTG投影。该方法并非旨在实现完全物理逼真的放射模拟，而是为研究解剖-投影关系、运动可观测性和变换感知成像工作流提供可控且可重复的方法学环境。

英文摘要

This work addresses the computational problem of generating reproducible projection-space observations from heterogeneous anatomical scenes whose components may undergo independent spatial transformations. We propose a transformation-driven framework for synthetic projection imaging from multimodal anatomical data and demonstrate it on mandibular-motion scenarios. In contrast to conventional Digitally Reconstructed Radiograph (DRR) approaches primarily designed for registration, projection realism, or rendering efficiency, the proposed formulation treats projection imaging as an observation process operating on an explicitly represented anatomical scene. Independently transformable volumetric and surface-based anatomical objects are embedded within a shared scene representation and propagated directly into projection space through explicit transformations. Projection geometry, acquisition modelling, material interpretation, and image presentation remain explicitly separated, enabling controlled exploration of methodological assumptions while preserving reproducibility and direct comparability between generated projections. Particular emphasis is placed on transformation-driven anatomical scenarios relevant to craniofacial analysis, including mandibular motion and therapeutic repositioning. Using a shared anatomical reference scene composed of CT/CBCT volumes, segmented structures, surface models, and auxiliary anatomical or therapeutic objects, the framework enables generation of directly comparable VirtualRTG projections from multiple anatomical configurations while preserving identical imaging assumptions. Rather than aiming at fully physically faithful radiographic simulation, the proposed approach provides a controllable and reproducible methodological environment for studying anatomy--projection relationships, motion observability, and transformation-aware imaging workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.16658 2026-06-16 cs.CV 新提交

Vision-Language Models as Zero-Annotation Oracles in Histopathology

视觉-语言模型作为组织病理学中的零标注预言机

Vishal Jain, Giorgio Buzzanca, Sarah Cechnicka, Maarten Naesens, Priyanka Koshy, Tri Nguyen, Jesper Kers, Candice Roufosse, Bernhard Kainz

发表机构 * Imperial College London（帝国理工学院）； Leiden University Medical Center（莱顿大学医学中心）； KU Leuven（鲁汶大学）； University Hospitals Leuven（鲁汶大学医院）； University Medical Center Utrecht（乌得勒支大学医学中心）； Friedrich-Alexander University Erlangen-Nürnberg（埃尔朗根-纽伦堡大学）

AI总结提出一种粗到细方法，利用通用视觉-语言模型作为零标注预言机进行前景分割，在特殊染色上优于监督基线，并通过伪标签蒸馏轻量学生模型。

Comments 11 pages, 1 figure, 6 tables. Code available at https://github.com/VishalJ99/vlm-wsi-auto-context

详情

AI中文摘要

前景分割是每个计算病理学流程的关键第一步，但现有方法依赖于手工调整的启发式规则或监督模型，这些模型过度拟合狭窄的染色和扫描仪分布，在特殊染色（如Jones银染或Elastica van Gieson）上无声失败。我们提出一种粗到细方法，将前景分割重新定义为视觉感知任务，并利用通用视觉-语言模型（VLM）作为零标注预言机。我们的关键洞察是，组织与背景的区分是一个自然图像识别问题，而非组织病理学问题，因此在互联网规模语料上训练的VLM能够泛化到领域特定模型无法处理的场景。我们引入了Leica-75基准，包含跨越三种染色家族的75张肾移植全切片图像。在Leica-75上，我们的方法在分布外染色上实现了最高分割质量（Jones Dice 0.858 +/- 0.027，EVG Dice 0.853 +/- 0.041），交叉染色方差比最佳监督基线低7倍，同时在分布内H&E上保持竞争力。使用自动筛选示例（Auto-context）的少样本提示挽救了Stress-32（n=32）上的困难案例，Stress-32是一个精心设计的压力测试子集（2B模型Dice从0.470提升至0.819）。基于VLM的标注审查与人类专家共识一致（模糊检测kappa=0.989；分割掩码审查的平均精确率/召回率分级准确率0.708 vs. 人类0.646）。生成的伪标签用于蒸馏轻量学生模型，其性能与教师模型相当，而运行成本仅为教师模型的一小部分。我们的框架为数字病理学中持续存在的基础设施瓶颈提供了原则性、可扩展的解决方案。

英文摘要

Foreground segmentation is the critical first step of every computational pathology pipeline, yet existing methods rely on hand-tuned heuristics or supervised models that overfit to narrow stain and scanner distributions, failing silently on specialised stains such as Jones silver or Elastica van Gieson. We propose a coarse-to-fine approach that recasts foreground segmentation as a visual perception task and leverages general-purpose vision-language models (VLMs) as zero-annotation oracles. Our key insight is that tissue-versus-background discrimination is a natural-image recognition problem, not a histopathological one, so VLMs trained on internet-scale corpora generalise where domain-specific models cannot. We introduce Leica-75, a benchmark of 75 renal transplant whole-slide images spanning three stain families. On Leica-75, our method achieves the highest segmentation quality on out-of-distribution stains (Dice 0.858 +/- 0.027 on Jones, 0.853 +/- 0.041 on EVG) with 7x lower cross-stain variance than the best supervised baseline, while remaining competitive on in-distribution H&E. Few-shot prompting with automatically curated exemplars (Auto-context) rescues hard cases on Stress-32 (n=32), a curated stress-test subset (Dice 0.470 to 0.819 for the 2B model). VLM-based annotation review matches human expert consensus (kappa=0.989 for blur detection; mean precision/recall grading accuracy 0.708 vs. human 0.646 for segmentation mask review). The resulting pseudo-labels are used to distil lightweight student models that are as performant as the teacher model while running for a fraction of the cost. Our framework provides a principled, scalable solution to a persistent infrastructure bottleneck in digital pathology.

URL PDF HTML ☆

赞 0 踩 0

2606.16749 2026-06-16 cs.CV 新提交

Structure-aware Knowledge-guided Heterogeneous Mamba for Zygomaticomaxillary Suture Assessment

结构感知知识引导的异构Mamba用于颧上颌缝评估

Xiaoqi Guo, Birui Chen, Xinquan Yang, Chaoyun Zhang, Xuefen Liu, Mianjie Zheng, Kun Tang, Xuguang Li, Wen Ma, Yanhua Xu, Linlin Shen

发表机构 * College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机与软件学院）； School of Artificial Intelligence, Shenzhen University（深圳大学人工智能学院）； Affiliated Stomatology Hospital of Kunming Medical University（昆明医科大学附属口腔医院）； Shenzhen University General Hospital（深圳大学总医院）

AI总结提出首个ZMS公开数据集（3790张图像，覆盖4-24岁），并设计SKMamba框架，通过解耦双路径架构、隐式边缘提取器和跨模态语义对齐模块，实现自动化ZMS成熟度评估，性能优于现有方法。

详情

AI中文摘要

颧上颌缝是连接颧骨和上颌骨的关键颅周结构，是上颌前移过程中的主要阻力部位，其成熟状态直接影响正畸干预的时机和效果。然而，由于缝线中微妙的高频过渡以及相邻阶段之间的全局语义模糊性，ZMS成熟的准确分期仍然具有挑战性。为解决这一问题，我们提出了首个公开ZMS数据集，包含3790张覆盖4至24岁全年龄范围的ZMS图像。基于该数据集，我们提出了SKMamba，一种结构感知和知识引导的基于Mamba的多模态框架，用于自动化ZMS成熟度评估。SKMamba采用解耦的双路径架构，模拟经验丰富的正畸医生使用的分层诊断过程。我们首先引入隐式边缘提取器（IEE），利用结构预训练减少小梁噪声并突出缝线边界。作为补充，设计了跨模态语义对齐（CSA）模块，用于整合来自大语言模型（LLM）的解剖描述。该模块有助于将局部形态线索与全局语义描述对齐，同时确保客观形态证据仍是决策的主要依据。在我们的ZMS数据集上的大量实验表明，SKMamba相比现有方法实现了最先进的性能。代码可在https://github.com/galaxygxq1116/SKMamba获取。

英文摘要

The Zygomaticomaxillary Suture is a key circummaxillary structure that connects the zygomatic bone and the maxilla, which serves as a primary site of resistance during maxillary advancement, and its maturation status directly influences the timing and efficacy of orthopedic interventions. However, accurate staging of ZMS maturation remains challenging due to subtle high-frequency transitions in suture lines and the global semantic ambiguity between adjacent stages. To address this, we present the first public ZMS dataset, comprising 3,790 ZMS images covering the entire age range from 4 to 24 years. Based on this dataset, we propose SKMamba, a Structure-aware and Knowledge-guided Mamba-based multi-modal framework for automated ZMS maturation assessment. SKMamba adopts a decoupled dual-path architecture that mimics the hierarchical diagnostic process used by experienced orthodontists. We first introduce an Implicit Edge Extractor (IEE), which leverages structural pre-training to reduce trabecular noise and accentuate sutural boundaries. Complementarily, a Cross-Modal Semantic Alignment (CSA) module is designed to incorporate anatomical descriptions from a large language model (LLM). This module helps align local morphological cues with global semantic descriptions while ensuring that objective morphological evidence remains the primary basis for decisions. Extensive experiments on our ZMS dataset demonstrate that SKMamba achieves state-of-the-art performance compared to existing methods. Code is available at https://github.com/galaxygxq1116/SKMamba.

URL PDF HTML ☆

赞 0 踩 0

2606.16756 2026-06-16 cs.CV 新提交

3D Classification of Paramagnetic Rim Lesions in Multiple Sclerosis via Asymmetric QSM-FLAIR Modeling

多发性硬化症中顺磁性边缘病变的3D分类：基于非对称QSM-FLAIR建模

Veronica Pignedoli, Giacomo Boffa, Nicoletta Noceti, Matilde Inglese, Francesca Odone, Matteo Moro

发表机构 * MaLGa, DIBRIS, University of Genova（热那亚大学）； DINOGMI, University of Genova（热那亚大学）； IRCCS Azienda Ospedaliera Metropolitana（IRCCS大都会医院）

AI总结提出一种3D多模态深度学习框架，利用非对称QSM-FLAIR建模对多发性硬化症中的顺磁性边缘病变进行自动分类，通过自监督预训练和对比正则化提升有限数据下的鲁棒性，在88名患者队列中验证了有效性。

Comments 10 pages, 3 figures, accepted at MICCAI 2026. Github link: https://github.com/veronicapignedoli/FRODO

详情

AI中文摘要

在磁敏感加权MRI上识别的顺磁性边缘病变（Rim$^+$）最近已成为多发性硬化症（MS）慢性活动性炎症的特异性生物标志物，并与长期残疾进展相关。然而，磁敏感成像和专家判读仍局限于专业中心，视觉评估耗时且可变，且Rim$^+$病变的低患病率给自动分析带来了严重的类别不平衡挑战。我们提出了一种3D多模态深度学习框架，用于从定量磁化率图（QSM）和FLAIR MRI中进行病变级别的Rim$^+$/Rim$^-$分类。该架构通过将QSM作为主要磁敏感驱动信号并用FLAIR衍生的结构上下文进行条件化，显式建模了模态非对称性。为了提高在有限数据下的鲁棒性，我们采用了自监督多模态预训练，随后进行带有对比正则化的监督微调。该方法在临床采集的88名MS患者队列中进行了评估，以专家病变标注作为参考标准。结果显示了相比先前架构的性能提升，支持了非对称多模态建模在自动识别慢性活动性病变中的有效性。

英文摘要

Paramagnetic rim lesions (Rim$^+$) identified on susceptibility-sensitive MRI have recently emerged as a specific biomarker of chronic active inflammation in Multiple Sclerosis (MS) and are associated with long-term disability progression. However, susceptibility imaging and expert interpretation remain limited to specialized centers, visual assessment is time-consuming and variable, and the low prevalence of Rim$^+$ lesions poses severe class imbalance challenges for automated analysis. We propose a 3D multimodal deep learning framework for lesion-level Rim$^+$/Rim$^-$ classification from Quantitative Susceptibility Mapping (QSM) and FLAIR MRI. The architecture explicitly models modality asymmetry by treating QSM as the primary susceptibility-driven signal and conditioning it with FLAIR-derived structural context. To improve robustness under limited data, we employ self-supervised multimodal pretraining followed by supervised fine-tuning with contrastive regularization. The method was evaluated on a clinically acquired cohort of 88 people with MS with expert lesion annotations as reference standard. Results highlight improved performance compared to prior architectures, supporting the effectiveness of asymmetric multimodal modeling for automated chronic active lesion identification.

URL PDF HTML ☆

赞 0 踩 0

2606.16794 2026-06-16 cs.CV 新提交

LLM-Based Visual Explanation Evaluation Framework for Assessing the Explainability of Facial Skin Disease Classification Models

基于LLM的视觉解释评估框架：用于评估面部皮肤病分类模型的可解释性

Gyuyeon Na

发表机构 * AI and Business Analytics, Ewha Womans University（人工智能与商业分析，成均馆大学）

AI总结提出基于LLM的视觉解释评估框架，通过渐进式提示工程评估Grad-CAM在面部皮肤病诊断模型中的解释质量，聚焦病变定位和可信度。

详情

AI中文摘要

本研究提出了一个特定领域的基于LLM的视觉解释评估框架，用于评估面部皮肤病诊断模型中Grad-CAM解释的质量。以往研究主要关注通过数据增强技术提升分类性能，而较少系统性地检验模型解释是否基于临床相关的病变区域。在本研究中，对基于EfficientNet-B0、MobileNetV3和ResNet18的面部皮肤病分类模型应用了几何增强、颜色增强和混合增强策略。采用Grad-CAM生成代表模型决策过程的视觉解释。此外，利用GPT-5.5、Gemini 3.5 Flash和Claude Sonnet 4.6设计了LLM-as-a-Judge评估框架，从病变定位和解释可信度两个角度评估Grad-CAM解释。为提高评估一致性和临床基础，引入了渐进式提示工程策略，包含评估准则、临床知识、惩罚规则和结构化输出格式。

英文摘要

This study proposes a domain-specific LLM-based Visual Explanation Evaluation Framework for assessing Grad-CAM explanations in facial skin disease diagnosis models. While previous studies have primarily focused on improving classification performance through data augmentation techniques, relatively few studies have systematically examined whether model explanations are grounded in clinically relevant lesion regions. In this study, geometric augmentation, color-based augmentation, and mixed augmentation strategies were applied to facial skin disease classification models based on EfficientNet-B0, MobileNetV3, and ResNet18. Grad-CAM was employed to generate visual explanations representing the models' decision-making processes. Furthermore, an LLM-as-a-Judge evaluation framework was designed using GPT-5.5, Gemini 3.5 Flash, and Claude Sonnet 4.6 to assess Grad-CAM explanations from the perspectives of lesion localization and explanation trustworthiness. To improve evaluation consistency and clinical grounding, a progressive prompt engineering strategy was introduced, incorporating evaluation rubrics, clinical knowledge, penalty rules, and structured output formats.

URL PDF HTML ☆

赞 0 踩 0

2606.16991 2026-06-16 cs.CV cs.LG 新提交

A Multi-Center Benchmark for Abdominal Disease Diagnosis and Report Generation from Non-Contrast CT

基于非增强CT的腹部疾病诊断与报告生成的多中心基准

Mariam Elbakry, Aliaa Sayed Sheha, Salma Hassan Tantawy, Aya Yassin, Concetto Spampinato, Karim Lekadir, Xiaomeng Li, Marawan Elbatel

发表机构 * Ain Shams University（艾因夏姆斯大学）； The Hong Kong University of Science and Technology（香港科技大学）； University of Catania（卡塔尼亚大学）； Universitat de Barcelona（巴塞罗那大学）

AI总结提出一个多中心基准，利用非增强CT合成增强CT发现，用于多器官腹部疾病诊断和自动报告生成，实验表明非增强CT保留诊断信号，平均AUC达69.1%（内部）和63.1%（外部）。

Comments Early Accept (top ~9%), MICCAI 2026

详情

AI中文摘要

多期增强CT（CECT）广泛用于腹部病变表征，但存在造影剂肾病风险、增加采集负担并加重放射科医生工作量。为解决这些问题，我们引入了一个新的多中心基准，用于多器官腹部疾病诊断和自动放射报告生成，该基准学习从单期非增强CT（NCCT）合成增强CT发现。为此，我们从两个中心收集了配对NCCT-CECT研究及其对应的增强放射报告的大规模数据集，分为内部集和外部验证队列。在统一评估协议下，我们对五种当代深度学习架构进行了基准测试，涵盖胸部专用、腹部专用和通用多模态领域。大量实验表明，NCCT保留了诊断信号，在内部队列和外部队列上分别实现了平均多器官AUC 69.1%和63.1%。通过公开发布该数据集和标准化基准，本研究旨在促进未来对更安全、资源高效且全球可及的免造影腹部成像工作流程的研究。代码地址：https://github.com/xmed-lab/TriALS-Report。

英文摘要

Multiphasic contrast-enhanced CT (CECT) is widely used for abdominal lesion characterization, yet it carries inherent risks of contrast-induced nephropathy, escalates acquisition burden, and heavily contributes to radiologist workload. To address these challenges, we introduce a novel multi-center benchmark for multi-organ abdominal disease diagnosis and automated radiology report generation, which learns to synthesize contrast-enhanced findings from single-phase non-contrast CT (NCCT). To support this, we curated a large-scale dataset of paired NCCT-CECT studies and their corresponding contrast-enhanced radiology reports from two centers, partitioned into internal sets and an external validation cohort. Under a unified evaluation protocol, we benchmarked five contemporary deep learning architectures encompassing chest-specific, abdomen-specific, and general-purpose multimodal domains. Extensive experiments demonstrate that NCCT retains diagnostic signals, achieving an average multi-organ AUC of 69.1% on the internal cohort and 63.1% on the external cohort, respectively. By releasing this dataset and standardized benchmark publicly, this study aims to catalyze future research into safer, resource-efficient, and globally accessible contrast-free abdominal imaging workflows. Code is available at: https://github.com/xmed-lab/TriALS-Report.

URL PDF HTML ☆

赞 0 踩 0

2606.14828 2026-06-16 eess.IV cs.AI cs.CV 交叉投稿

Leptomeningeal Collateral Detection on DSA via Vessel-Graph Neural Networks

基于血管图神经网络的DSA软脑膜侧支检测

Junyong Cao, Hakim Baazaoui, Chinmay Prabhakar, Suprosanna Shit, Lukas Bastian Otto, Susanne Wegener, Bjoern Menze, Ezequiel de la Rosa

发表机构 * University of Zurich（苏黎世大学）； University Hospital Zurich（苏黎世大学医院）

AI总结提出一种混合图-像素架构，在DSA血管图上对单个血管段分类，首次实现DSA中软脑膜侧支的个体化检测，PR-AUC达0.434，优于纯图或纯像素方法。

详情

AI中文摘要

软脑膜侧支（LMCs）是急性缺血性卒中的重要预后因素。现有自动化方法依赖CT血管造影（CTA），但单个LMCs通常太小而无法在CTA上分辨，限制了这些方法只能进行粗略的侧支评分。数字减影血管造影（DSA）以更高的分辨率可视化单个侧支，但当前评估仍依赖主观的手动分级量表，存在评分者间一致性差的问题。我们提出一个框架，将侧支检测形式化为对从DSA导出的图上的单个血管段进行分类。一种混合图-像素架构将拓扑感知的图分支与密集像素分支相结合，在共享的节点概率空间中融合。在五折交叉验证中，融合模型的PR-AUC达到0.434，优于纯图（0.403）和纯像素（0.362）基线。据我们所知，这是首个能够在DSA中实现LMCs个体化的方法，允许对每个血管进行精确的定量评估。这种整合将DSA评估转向客观评价，支持未来对单个LMCs的生物标志物和模式发现。

英文摘要

Leptomeningeal collaterals (LMCs) are an important prognostic factor in acute ischemic stroke. Existing automated methods rely on CT angiography (CTA), but individual LMCs are often too small to be resolved on CTA, limiting these methods to coarse collateral scoring. Digital subtraction angiography (DSA) visualizes individual collaterals at superior resolution, yet current assessment remains subjective, relying on manual grading scales that suffer from poor inter-rater agreement. We present a framework that formulates collateral detection as the classification of individual vessel segments on a graph derived from DSA. A hybrid graph-pixel architecture combines a topology-aware graph branch with a dense pixel branch, fused in a shared node-probability space. In a five-fold cross-validation setting, the fused model achieves a PR-AUC of 0.434, outperforming the graph-only (0.403) and pixel-only (0.362) baselines. To our knowledge, this is the first method to enable the individualization of LMCs in DSA, allowing for precise per-vessel quantitative assessment. This integration shifts DSA assessment toward objective evaluation, supporting future biomarker and pattern discovery for individual LMCs.

URL PDF HTML ☆

赞 0 踩 0

2606.15000 2026-06-16 eess.IV cs.CV 交叉投稿

Polyp-D2ATL: Deep Domain-Adaptive Transfer Learning for Colorectal Polyp Classification under Label Distribution Shift

Polyp-D2ATL：标签分布偏移下用于结直肠息肉分类的深度域自适应迁移学习

Sajad Jabarzadeh Ghandilu, Maryam Sadat Hosseini Azad, Shahriar Baradaran Shokouhi, Emad Fatemizadeh

发表机构 * School of Electrical Engineering, Sharif University of Technology（谢尔万大学电气工程学院）； School of Electrical Engineering, Iran University of Science and Technology（伊朗科学技术大学电气工程学院）

AI总结提出Polyp-D2ATL框架，通过特定训练策略解决不平衡数据、标签分布偏移和跨模态泛化问题，在PICCOLO数据集上显著优于现有模型。

Comments 15 pages, 5 figures, 7 tables

详情

AI中文摘要

早期且高准确率地预测结直肠息肉，作为最危险癌症类型之一的重要标志，将有助于挽救更多生命。尽管结直肠息肉分类取得了进展，但在获得能够诊断真实场景中伴有不同特征的难以预测息肉的自动化息肉预测系统方面仍存在许多挑战，其中模型需要成功处理不平衡数据、标签分布偏移和跨模态泛化。在本研究中，我们提出了Polyp-D2ATL，一种新颖的框架，并辅以特定的训练策略，缓解了这些限制，并有效预测了属于NICE分类的不同类别息肉。我们在PICCOLO验证集和测试集上的大量实验表明，所提出的Polyp-D2ATL在各种可靠指标上显著优于现有最先进模型，在验证集上达到了82.38%的准确率、77.49%的宏F1分数和87.47%的特异性，同时在保留的测试集上取得了一致的改进，证明了所提出方法的泛化能力和临床适用性。

英文摘要

Early and highly accurate prediction of colorectal polyps, as an important sign of one of the most dangerous types of cancer, will result in saving more lives. Despite the advancements in colorectal polyp classification, many challenges remain in obtaining an automated polyp prediction system that is able to diagnose the difficult-to-predict polyps accompanied by different features in real scenarios, where the model can handle imbalanced data, label distribution shift, and cross-modality generalization successfully. In this study, we propose Polyp-D2ATL, a novel framework accompanied by a specific training strategy, which mitigates these limitations and effectively predicts the different classes of polyps belonging to the NICE classification. Our extensive experiments on the PICCOLO validation and test sets demonstrate that the proposed Polyp-D2ATL significantly outperforms existing state-of-the-art models across various reliable metrics, achieving an accuracy of 82.38%, a Macro-F1 of 77.49%, and a specificity of 87.47% on the validation set, alongside consistent improvements on the held-out test set which demonstrates the generalization capacity and clinical applicability of the proposed approach.

URL PDF HTML ☆

赞 0 踩 0

2606.15037 2026-06-16 cs.CL cs.CV 交叉投稿

ReportQA: QA-Based Radiology Report Evaluation

ReportQA: 基于问答的放射学报告评估

Yiming Shi, Shaoshuai Yang, Xi Chen, Haolin Li, Hengyu Zhang, Che Jiang, Kaiwen Wang, Xun Zhu, Dong Xie, Fei Wang, Dejing Dou, Miao Li, Ji Wu

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； College of AI, Tsinghua University（清华大学人工智能学院）； Beijing National Research Center for Information Science and Technology（北京信息科学与技术国家研究中心）； Beijing Electronic Digital & Intelligence（北京电子数字与智能）

AI总结提出ReportQA框架，利用知识树和LLM从报告中提取结构化信息生成QA对，以问答准确率作为评估指标，比现有指标更符合放射科医生判断。

详情

AI中文摘要

放射学报告评估对于推进自动报告生成至关重要。自然语言生成指标具有有限的临床相关性。临床效能（CE）指标评估重要的医学发现，但主要关注存在性且仅覆盖有限的实体集。由于严重依赖人工标注，CE指标难以扩展临床实体或属性。在临床实践中，放射学报告作为信息传递的媒介。临床医生使用它们执行下游诊断任务，而无需直接检查图像。基于这一见解，我们提出了ReportQA，一个临床相关且灵活的放射学报告评估框架，支持对放射学报告生成系统进行详细的定量分析。我们首先收集涵盖多种成像模态和解剖区域的数据集。然后，在放射科医生的指导下构建临床实体和属性的知识树，并使用大型语言模型（LLM）从原始报告中提取结构化信息。接下来，我们从预定义模板生成QA对，并通过自过滤和基于报告的过滤进行质量控制。在评估期间，将报告视为上下文，LLM作为评判模型来回答QA对。基于得到的QA准确率，我们引入了QAScore指标。与现有指标相比，QAScore显示出与放射科医生判断更好的对齐。在多个最先进的视觉-语言模型上的实验表明，当前基于报告的推理范式难以学习细粒度的临床表示，并表现出强烈的负先验偏差。相比之下，问题驱动的推理提供了一种更有效的替代方案。为了可重复性和可扩展性，我们发布了知识树、结构化报告和QA对，以及用于QA构建和评估的流水线代码。

英文摘要

Radiology report evaluation is essential for advancing automated report generation. Natural language generation metrics have limited clinical relevance. Clinical efficacy (CE) metrics evaluate important medical findings, but focus mainly on presence and cover only a limited set of entities. Due to heavy reliance on manual annotations, it is difficult for CE metrics to extend clinical entities or attributes. In clinical practice, radiology reports serve as a medium for information transfer. Clinicians use them to perform downstream diagnostic tasks without directly inspecting images. Based on this insight, we propose ReportQA, a clinical-related and flexible radiology report evaluation framework, supporting detailed quantitative analysis of radiology report generation systems. We first collect datasets covering multiple imaging modalities and anatomical regions. We then construct knowledge trees of clinical entities and attributes with radiologist guidance, and use large language models (LLMs) to extract structured information from raw reports. Next, we generate QA pairs from predefined templates and apply quality control through self-filtering and report-based filtering. During evaluation, the report is treated as context, and an LLM acts as a judge model to answer the QA pairs. Based on the resulting QA accuracy, we introduce QAScore metric. Compared with existing metrics, QAScore shows better alignment with radiologist judgments. Experiments on multiple state-of-the-art vision-language models reveal that current report-based inference paradigms struggle to learn fine-grained clinical representations and exhibit strong negative prior biases. In contrast, question-driven inference provides a more effective alternative. For reproducibility and extensibility, we release the knowledge trees, structured reports, and QA pairs, along with the pipeline code for QA construction and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2509.25594 2026-06-16 cs.CV cs.AI 版本更新

K-Prism: A Knowledge-Guided and Prompt Integrated Universal Medical Image Segmentation Model

K-Prism: 一种知识引导与提示集成的通用医学图像分割模型

Bangwei Guo, Yunhe Gao, Meng Ye, Difei Gu, Yang Zhou, Leon Axel, Dimitris Metaxas

发表机构 * Rutgers University（罗格斯大学）； Stanford University（斯坦福大学）； The University of Texas at Arlington（德克萨斯大学阿灵顿分校）； New York University（纽约大学）

AI总结提出K-Prism统一分割框架，通过双提示表示和混合专家解码器整合语义先验、上下文知识和交互反馈三种知识范式，在18个数据集上实现语义、上下文和交互分割的最优性能。

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

医学图像分割是临床决策的基础，但现有模型仍然碎片化。它们通常基于单一知识源训练，并针对特定任务、模态或器官。这种碎片化与临床实践形成鲜明对比，在临床实践中，专家无缝整合多种知识：来自训练集的解剖先验、来自参考病例的基于示例的推理，以及通过实时交互的迭代细化。我们提出了$\textbf{K-Prism}$，一个统一的分割框架，通过系统整合三种知识范式来反映这种临床灵活性：(i) 从标注数据集中学习的$\textit{语义先验}$，(ii) 来自少样本参考示例的$\textit{上下文知识}$，以及(iii) 来自用户输入（如点击或涂鸦）的$\textit{交互反馈}$。我们的关键见解是，这些异构知识源可以编码为双提示表示：定义$\textit{分割什么}$的1-D稀疏提示和指示$\textit{关注哪里}$的2-D密集提示，然后通过混合专家（MoE）解码器动态路由。这种设计使得范式之间灵活切换，并能够在不同任务上进行联合训练，而无需修改架构。在涵盖多种模态（CT、MRI、X射线、病理、超声等）的18个公共数据集上的全面实验表明，K-Prism在语义、上下文和交互分割设置中均达到了最先进的性能。

英文摘要

Medical image segmentation is fundamental to clinical decision-making, yet existing models remain fragmented. They are usually trained on single knowledge sources and specific to individual tasks, modalities, or organs. This fragmentation contrasts sharply with clinical practice, where experts seamlessly integrate diverse knowledge: anatomical priors from training, exemplar-based reasoning from reference cases, and iterative refinement through real-time interaction. We present $\textbf{K-Prism}$, a unified segmentation framework that mirrors this clinical flexibility by systematically integrating three knowledge paradigms: (i) $\textit{semantic priors}$ learned from annotated datasets, (ii) $\textit{in-context knowledge}$ from few-shot reference examples, and (iii) $\textit{interactive feedback}$ from user inputs like clicks or scribbles. Our key insight is that these heterogeneous knowledge sources can be encoded into a dual-prompt representation: 1-D sparse prompts defining $\textit{what}$ to segment and 2-D dense prompts indicating $\textit{where}$ to attend, which are then dynamically routed through a Mixture-of-Experts (MoE) decoder. This design enables flexible switching between paradigms and joint training across diverse tasks without architectural modifications. Comprehensive experiments on 18 public datasets spanning diverse modalities (CT, MRI, X-ray, pathology, ultrasound, etc.) demonstrate that K-Prism achieves state-of-the-art performance across semantic, in-context, and interactive segmentation settings.

URL PDF HTML ☆

赞 0 踩 0

2602.02186 2026-06-16 cs.CV 版本更新

Learning Topology-Aware Implicit Field for Unified Pulmonary Tree Modeling with Incomplete Topological Supervision

学习拓扑感知隐式场用于不完整拓扑监督下的统一肺树建模

Ziqiao Weng, Jiancheng Yang, Kangxian Xie, Bo Zhou, Weidong Cai

发表机构 * School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； ELLIS Institute Finland（芬兰ELLIS研究所）； Aalto University（阿尔托大学）； Department of Computer Science and Engineering, University of Buffalo（布法罗大学计算机科学与工程系）； Department of Radiology, Northwestern University（西北大学放射学系）

AI总结提出TopoField框架，利用稀疏点云学习连续隐式场，在无完整标注下修复肺树拓扑不完整，并联合实现解剖标记与肺段重建，效率高且鲁棒。

Comments 20 pages

详情

AI中文摘要

从CT图像中提取的肺树经常表现出拓扑不完整性，例如缺失或断开的分支，这严重降低了下游解剖分析的质量，并限制了现有肺树建模流程的适用性。当前方法通常依赖密集体积处理、显式图推理或通用点云补全先验，导致效率有限、结构感知弱以及在现实结构损坏下的鲁棒性降低。我们提出TopoField，一个拓扑感知隐式建模框架，将拓扑修复视为一类建模问题，并实现肺树分析的统一多任务推理。TopoField使用稀疏表面和骨架点云表示肺部解剖结构，并通过在\textit{已经}不完整的树上合成引入的结构破坏进行训练，学习一个支持拓扑修复的连续隐式场，无需依赖完整或显式的断开标注。基于修复后的隐式表示，通过任务特定的隐式函数在单次前向传播中联合推断解剖标记和肺段重建。在Lung3D+数据集上的大量实验表明，TopoField在具有挑战性的不完整场景下持续改善拓扑完整性，并实现准确的解剖标记和肺段重建。我们进一步在外部分割模型的真实不完整输出上验证TopoField，展示了其对现实分割流程的适用性。由于其隐式公式，TopoField实现了高计算效率，每个案例完成所有任务仅需一秒多，突显了其在大规模和时间敏感的临床应用中的实用性。

英文摘要

Pulmonary trees extracted from CT images frequently exhibit topological incompleteness, such as missing or disconnected branches, which substantially degrades downstream anatomical analysis and limits the applicability of existing pulmonary tree modeling pipelines. Current approaches typically rely on dense volumetric processing, explicit graph reasoning, or generic point cloud completion priors, leading to limited efficiency, weak structural awareness, and reduced robustness under realistic structural corruption. We propose TopoField, a topology-aware implicit modeling framework that treats topology repair as a first-class modeling problem and enables unified multi-task inference for pulmonary tree analysis. TopoField represents pulmonary anatomy using sparse surface and skeleton point clouds and learns a continuous implicit field that supports topology repair without relying on complete or explicit disconnection annotations, by training on synthetically introduced structural disruptions over \textit{already} incomplete trees. Building upon the repaired implicit representation, anatomical labeling and lung segment reconstruction are jointly inferred through task-specific implicit functions within a single forward pass. Extensive experiments on the Lung3D+ dataset demonstrate that TopoField consistently improves topological completeness and achieves accurate anatomical labeling and lung segment reconstruction under challenging incomplete scenarios. We further validate TopoField on real incomplete outputs from an external segmentation model, demonstrating its applicability to realistic segmentation pipelines. Owing to its implicit formulation, TopoField attains high computational efficiency, completing all tasks in just over one second per case, highlighting its practicality for large-scale and time-sensitive clinical applications.

URL PDF HTML ☆

赞 0 踩 0

2603.12514 2026-06-16 cs.CV cs.LG 版本更新

CT-VDETR: Semi-supervised 3D Trauma Detection in Computed Tomography (CT) scans using Dense Vertex Relative Position Encoding

CT-VDETR：使用密集顶点相对位置编码的CT扫描半监督3D创伤检测

Shivam Chaudhary, Sheethal Bhat, Andreas Maier

发表机构 * University of Freiburg（弗赖堡大学）

AI总结提出CT-VDETR框架，结合自监督预训练和半监督transformer检测，在仅78个标注体数据上实现31.33% mAP@0.50，比纯监督方法提升1.53倍。

Comments v2: Updated results with corrected dataset split. Revised Table 1 (mAP@0.50: 31.33% SSL vs 20.45% baseline, 1.53x improvement; mAP@0.75: 30.95% vs 10.45%, 2.96x improvement). Updated validation curves showing stable convergence. No methodology changes. 7 pages, 4 figures, 2 tables. Code: https://github.com/shivasmic/3d-trauma-detection-ssl

详情

AI中文摘要

在腹部CT中准确检测和定位创伤性损伤仍然具有挑战性，因为体素级标注有限且获取成本高。我们提出了一种标签高效的3D腹部创伤检测框架，该框架将自监督预训练与半监督基于transformer的检测相结合。首先，我们在1098个CT体数据上使用掩码图像建模（MIM）预训练3D U-Net编码器，用于解剖表示学习。接着，我们通过特征适配器将V-DETR适应到密集体积CT，该适配器将编码器特征网格转换为紧凑的token序列，用于transformer解码。然后，将预训练编码器与V-DETR和3D顶点相对位置编码（3D V-RPE）集成，以改善不规则形状损伤的定位。最后，在半监督教师-学生一致性正则化中，利用额外的2000个未标注体数据进行检测器训练。据我们所知，这是3D DETR风格检测器首次应用于RSNA腹部创伤检测任务。在该基准上，所提方法仅使用78个标注训练体数据就达到了31.33%的测试mAP@0.50，相当于纯监督训练的1.53倍提升。这些结果表明，将医学领域预训练与半监督学习相结合是标签稀缺的3D医学检测的有效策略。

英文摘要

Accurate detection and localization of traumatic injuries in abdominal CT remain challenging because voxel-level annotations are limited and expensive to obtain. We present a label-efficient framework for 3D abdominal trauma detection that combines self-supervised pretraining with semi-supervised transformer-based detection. First, we use Masked Image Modeling (MIM) on 1098 CT volumes to pretrain a 3D U-Net encoder for anatomical representation learning. Next, we adapt V-DETR to dense volumetric CT through a feature adapter that converts the encoder feature grid into a compact token sequence for transformer decoding. The pretrained encoder is then integrated with V-DETR and 3D Vertex Relative Position Encoding (3D V-RPE) to improve the localization of irregularly shaped injuries. Finally, semi-supervised teacher-student consistency regularization leverages 2,000 additional unlabeled volumes during detector training. To the best of our knowledge, this is the first application of a 3D DETR-style detector to the RSNA abdominal trauma detection task. On this benchmark, the proposed method achieves 31.33% test mAP@0.50 using only 78 labeled training volumes, corresponding to a 1.53x improvement over supervised-only training. These results show that combining medical-domain pretraining with semi-supervised learning is an effective strategy for label-scarce 3D medical detection.

URL PDF HTML ☆

赞 0 踩 0

2603.15525 2026-06-16 cs.CV cs.HC 版本更新

Clinically Aware Synthetic Image Generation for Concept Coverage in Chest X-ray Models

临床感知的合成图像生成用于胸部X光模型的概念覆盖

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * University of Edinburgh（爱丁堡大学）； NHS Lothian（洛锡安国家健康服务）

AI总结提出CARPA框架，通过解剖约束的概念扰动生成合成胸部X光图像，扩展临床概念覆盖，提升模型性能与可靠性。

Comments Accepted for presentation at the IJCAI-ECAI 2026 RobustifAI workshop

详情

AI中文摘要

用于胸部X光诊断的深度学习模型受到公开训练数据集中临床有意义概念组合覆盖有限的限制。虽然合成图像生成已被探索以增加数据多样性，但现有方法很少强制执行临床或解剖约束，限制了其在提高模型可靠性方面的效用。我们提出了CARPA，一个临床感知和解剖基础的合成胸部X光生成框架，该框架在保持解剖结构的同时对临床概念向量进行有针对性的扰动。通过生成具有受控概念插入和删除的解剖忠实合成图像，CARPA扩展了临床相关的概念覆盖。我们通过七种骨干架构评估CARPA，在合成子集上微调模型，并在一个保留的MIMIC-CXR基准上进行测试。与先前的概念扰动方法相比，在CARPA生成的图像上微调一致地提高了精确率-召回率性能，降低了预测不确定性，并改善了模型校准。结构和语义分析表明高解剖保真度、强概念对齐和低语义不确定性。两位专家放射科医生的评估进一步确认了真实感和临床一致性。这些结果共同表明，解剖基础的概念扰动能够更有效地利用合成数据，提高胸部X光分类模型的性能和可靠性，并支持更安全的临床部署。

英文摘要

Deep learning models for chest X-ray diagnosis are constrained by limited coverage of clinically meaningful concept combinations in publicly available training datasets. While synthetic image generation has been explored to increase data diversity, existing methods rarely enforce clinical or anatomical constraints, limiting utility for improving model reliability. We propose CARPA, a clinically aware and anatomically grounded framework for synthetic chest X-ray generation that applies targeted perturbations to clinical concept vectors while preserving anatomical structure. By producing anatomically faithful synthetic images with controlled concept insertions and deletions, CARPA expands clinically relevant concept coverage. We evaluate CARPA across seven backbone architectures by fine-tuning models on synthetic subsets and testing on a held-out MIMIC-CXR benchmark. Compared to prior concept perturbation approaches, fine-tuning on CARPA-generated images consistently improves precision-recall performance, reduces predictive uncertainty, and improves model calibration. Structural and semantic analyses demonstrate high anatomical fidelity, strong concept alignment, and low semantic uncertainty. Evaluation by two expert radiologists further confirms realism and clinical agreement. Together, these results show that anatomically grounded concept perturbations enable more effective use of synthetic data, improving both performance and reliability of chest X-ray classification models and supporting safer clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.05761 2026-06-16 cs.CV 版本更新

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

iTRIALSPACE：用于肺CT模型受控评估的可编程虚拟病灶试验

Fakrul Islam Tushar, Umme Hafsa Momy, Joseph Y. Lo, Geoffrey D. Rubin

发表机构 * Department of Radiology and Imaging Sciences, University of Arizona（亚利桑那大学放射科和影像科学系）； Department of Biomedical Engineering, Florida International University（佛罗里达国际大学生物医学工程系）； Center for Virtual Imaging Trials, Department of Radiology, Duke University Medical Center（达特茅斯大学医学中心虚拟成像试验中心，放射科）

AI总结提出可编程评估框架iTRIALSPACE，通过四阶段流水线（结节分析、试验规范、掩膜插入、CT合成）构建受控虚拟病灶试验，揭示固定基准无法发现的模型缺陷。

Comments 11 pages, 13 figures, 13 tables

详情

AI中文摘要

我们引入iTRIALSPACE，一个用于肺CT模型受控评估的可编程评估框架。标准基准是静态回顾性集合，混杂了病灶大小、肺叶分布、解剖结构和采集背景，使得难以确定什么因素在结构上驱动模型准确性。iTRIALSPACE通过四阶段流水线（多数据集结节分析、显式试验规范、解剖感知掩膜插入和ControlNet条件CT合成）将真实临床CT和病灶轮廓组合成受控虚拟病灶试验来解决这一限制。该框架基于一个统一的54属性结节分析数据集，涵盖来自七个公共CT源的13,140个标注结节，并实例化为13种试验模式。我们在一个涵盖三种医学VLM、四种空间引导条件和三种临床任务的55,469样本虚拟病灶研究中评估iTRIALSPACE。在所有13种模式下，合成基底保持在真实到真实FID基线内，且合成性能排名强烈转移到真实临床数据（ρ = 0.93，p < 10^{-15}）。受控试验模式揭示了固定分布基准无法获得的发现，包括在肺叶均衡采样下的捷径驱动尺寸预测崩溃，以及双交叉分析中宿主与供体方差比达到8.9倍和3.3倍。这些结果将iTRIALSPACE定位为一种可审计的评估基础设施，用于超越静态回顾性基准的受控、可证伪测试。

英文摘要

We introduce iTRIALSPACE, a programmable evaluation framework for controlled assessment of lung CT models. Standard benchmarks are static retrospective collections that entangle lesion size, lobe prevalence, anatomy, and acquisition context, making it difficult to determine what structurally drives model accuracy. iTRIALSPACE addresses this limitation by composing real clinical CTs and lesion profiles into controlled virtual lesion trials through a four-stage pipeline: multidataset nodule profiling, explicit trial specification, anatomy-aware mask insertion, and ControlNet-conditioned CT synthesis. The framework is built on a unified 54-attribute nodule-profile dataset spanning 13,140 annotated nodules from seven public CT sources and instantiated as 13 trial modes. We evaluate iTRIALSPACE in a 55,469-sample Virtual Lesion Study spanning three medical VLMs, four spatialguidance conditions, and three clinical tasks. Across all 13 modes, the synthetic substrate remains within the real-to-real FID baseline, and synthetic performance rankings transfer strongly to real clinical data ($ρ$ = 0.93, p < 10$^{-15}$). Controlled trial modes expose findings unavailable to fixed-distribution benchmarks, including shortcut-driven size prediction collapse under lobe-equalized sampling and hostto-donor variance ratios of 8.9x and 3.3x in twin-cross analysis. These results position iTRIALSPACE as an auditable evaluation infrastructure for controlled, falsifiable testing beyond static retrospective benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.02877 2026-06-16 cs.CV 版本更新

Pathway-Structured Privileged Distillation for Deployable Computational Pathology

面向可部署计算病理学的通路结构特权蒸馏

Yongxin Guo, Hao Lu, Onur Koyun, Muhammet Demir, Metin Gurcan

发表机构 * School of Medicine, Wake Forest University（威克森林大学医学院）

AI总结提出MoPE框架，通过通路索引病理专家和记忆使用对齐，将多模态学习转化为仅组织学推理的特权蒸馏，提升全切片图像推理性能。

详情

AI中文摘要

整合转录组学和组织病理学可以改善癌症风险建模，但在常规环境中RNA分析的有限可用性限制了其实用性。本文引入了通路专家混合（MoPE），这是一个知识蒸馏框架，将多模态学习重新定义为仅组织学推理的特权蒸馏。MoPE的动机来自RNA谱和全切片图像之间的部分可观测性：组织学可以捕获某些分子程序相关的形态学后果，但不能期望重建完整的转录组状态。MoPE编码RNA衍生的通路，并通过记忆使用对齐将分子监督转移到通路索引的病理专家。在各种公共基准测试和两个独立的乳腺癌队列中，与基线方法相比，MoPE持续改善了仅WSI推理性能。通路使用分析和人工审核的视觉检查提供了模型行为和候选形态学相关读数的有限检查。这些结果支持通路结构特权蒸馏作为在训练期间利用分子信息同时保持无RNA推理的有前途的途径。

英文摘要

Integrating transcriptomics and histopathology can improve cancer risk modelling, yet practical use is constrained by the limited availability of RNA profiling in routine settings. Here we introduce Mixture of Pathway Experts (MoPE), a knowledge-distillation framework that reframes multimodal learning as privileged distillation for histology-only inference. MoPE is motivated by the partial observability between RNA profiles and whole-slide images: histology can capture morphology-linked consequences of certain molecular programmes, but cannot be expected to reconstruct the full transcriptomic state. MoPE encodes RNA-derived pathways and transfers the molecular supervision to pathway-indexed pathology experts through memory-usage alignment. Across diverse public benchmarks and two independent breast cancer cohorts, MoPE consistently improved WSI-only inference performance relative to baseline methods. Pathway-usage analyses and human-audited visual inspection provide bounded inspection of model behaviour and candidate morphology-linked readouts. These results support pathway-structured privileged distillation as a promising route to using molecular information during training while preserving RNA-free inference.

URL PDF HTML ☆

赞 0 踩 0

2411.05824 2026-06-16 eess.IV cs.CV cs.LG 版本更新

Navigating Distribution Shifts in Medical Image Analysis: A Survey

医学图像分析中的分布偏移导航：综述

Zixian Su, Jingwei Guo, Xi Yang, Qiufeng Wang, Frans Coenen, Amir Hussain, Kaizhu Huang

发表机构 * Life Simulation Research Center, Beijing Academy of Artificial Intelligence（北京人工智能生命模拟研究中心）； Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology（王国阿卜杜勒·阿齐兹国王科技大学电气与数学科学与工程系）； Department of Intelligent Science, School of Advanced Technology, Xi’an Jiaotong-Liverpool University（西安交通大学利物浦大学先进科技学院智能科学系）； Computer Science, School of Computer Science and Informatics, University of Liverpool（利物浦大学计算机科学与信息学学院）； SDAIA-KFUPM Joint Research Centre for Artificial Intelligence, King Fahd University of Petroleum and Minerals（法赫德石油与矿物大学人工智能SDAIA-KFUPM联合研究中心）； Nuffield Department of Primary Care Health Sciences, University of Oxford（牛津大学初级保健健康科学努尔菲尔德部门）

AI总结本文系统综述了应对医学图像分析中分布偏移的深度学习方法，按临床约束分类为联合训练、联邦学习、微调和域泛化，并揭示方法从显式对齐向不确定性建模的转变。

详情

AI中文摘要

医学图像分析（MedIA）已成为现代医疗保健中不可或缺的一部分，增强了临床诊断和个性化治疗。尽管深度学习（DL）技术取得了显著进展，但其实际部署面临分布偏移带来的挑战，即基于特定数据集训练的模型在不同医院或患者群体的数据上表现不佳。为解决这一问题，研究人员积极开发策略以提高DL模型的适应性，使其能够在陌生环境中有效使用。本文系统综述了将DL技术应用于受分布偏移影响的MedIA系统的方法。我们并非按技术特征组织现有方法，而是明确将现实临床约束（如有限的数据可访问性、严格的隐私要求和异构协作协议）与能够解决这些约束的技术范式联系起来。通过建立操作约束与方法论演变之间的这种联系，我们将现有工作分类为联合训练、联邦学习、微调和域泛化，每种方法对应特定的医疗场景。除了这种分类，我们的实证分析表明，随着这些范式中域信息逐渐变得不可访问，性能改进变得越来越受限，并进一步揭示了方法论焦点从显式分布对齐向不确定性感知建模的逐渐转变，最终指向在实际MedIA中需要更多可部署性感知的设计。

英文摘要

Medical Image Analysis (MedIA) has become indispensable in modern healthcare, enhancing clinical diagnostics and personalized treatment. Despite the remarkable advancements supported by deep learning (DL) technologies, their practical deployment faces challenges posed by distribution shifts, where models trained on specific datasets underperform on others from varying hospitals, or patient populations. To address this issue, researchers have been actively developing strategies to increase the adaptability of DL models, enabling their effective use in unfamiliar environments. This paper systematically reviews approaches that apply DL techniques to MedIA systems affected by distribution shifts. Rather than organizing existing methods by technical characteristics, we explicitly bridge real-world clinical constraints -- such as limited data accessibility, strict privacy requirements, and heterogeneous collaboration protocols -- with the technical paradigms able to address them. By establishing this connection between operational constraints and methodological evolution, we categorize existing works into Joint Training, Federated Learning, Fine-tuning, and Domain Generalization, each aligned with specific healthcare scenarios. Beyond this taxonomy, our empirical analysis suggests that, as domain information becomes progressively less accessible across these paradigms, performance improvements become increasingly constrained, and further uncovers a gradual shift in methodological focus from explicit distribution alignment toward uncertainty-aware modeling, ultimately pointing to the need for more deployability-aware design in real-world MedIA.

URL PDF HTML ☆

赞 0 踩 0

2505.05647 2026-06-16 eess.SP cs.CV 版本更新

A New k-Space Model for Non-Cartesian Fourier Imaging

一种用于非笛卡尔傅里叶成像的新k空间模型

Chin-Cheng Chan, Justin P. Haldar

发表机构 * USC Center for Advanced Research Computing（USC高级研究计算中心）； Signal and Image Processing Institute（信号与图像处理研究所）

AI总结针对传统基于体素的傅里叶成像模型计算成本高、收敛慢且易产生伪影的问题，提出一种基于傅里叶域基展开的新模型，在非笛卡尔MRI重建中实现更优图像质量和更低计算复杂度。

详情

AI中文摘要

在过去的几十年中，使用基于模型的方法重建傅里叶成像数据一直很流行，这些方法可以轻松地融入物理约束和先进的正则化/机器学习先验。最常见的建模方法是将连续图像表示为平移的“体素”基函数的线性组合。尽管这种基于体素的模型已被广泛研究和部署，但它存在长期以来的局限性，包括高计算成本、慢收敛和易产生伪影。在这项工作中，我们从新的角度重新审视该模型，识别出可能之前被忽视的新问题（包括不良近似、环绕和零空间特性）。我们的见解促使我们提出一种新模型，该模型对先前方法的局限性（旧的和新的）更具鲁棒性。具体来说，新模型基于傅里叶域基展开，而不是标准的图像域体素方法。在非笛卡尔MRI重建背景下呈现的示例结果表明，新模型能够改善图像质量（减少伪影）和/或降低计算复杂度（更快的计算和更好的收敛）。

英文摘要

For the past several decades, it has been popular to reconstruct Fourier imaging data using model-based approaches that can easily incorporate physical constraints and advanced regularization/machine learning priors. The most common modeling approach is to represent the continuous image as a linear combination of shifted "voxel" basis functions. Although well-studied and widely-deployed, this voxel-based model is associated with longstanding limitations, including high computational costs, slow convergence, and a propensity for artifacts. In this work, we reexamine this model from a fresh perspective, identifying new issues that may have been previously overlooked (including undesirable approximation, wrap-around, and nullspace characteristics). Our insights motivate us to propose a new model that is more resilient to the limitations (old and new) of the previous approach. Specifically, the new model is based on a Fourier-domain basis expansion rather than the standard image-domain voxel-based approach. Illustrative results, which are presented in the context of non-Cartesian MRI reconstruction, demonstrate that the new model enables improved image quality (reduced artifacts) and/or reduced computational complexity (faster computations and improved convergence).

URL PDF HTML ☆

赞 0 踩 0

2604.25371 2026-06-16 q-bio.QM cs.CV 版本更新

PhyloSDF: Phylogenetically-Conditioned Neural Generation of 3D Skull Morphology via Residual Flow Matching

PhyloSDF: 基于系统发育条件的残差流匹配神经生成3D颅骨形态

Kaikwan Lau, Gary P. T. Choi

发表机构 * Department of Mathematics（数学系）

AI总结提出PhyloSDF模型，结合系统发育一致性损失和残差条件流匹配，从少量样本生成符合系统发育关系的3D颅骨形态，在达尔文雀数据集上优于扩散模型和标准流匹配。

详情

AI中文摘要

生成新颖、生物学上可信的三维形态结构是计算进化生物学中的一个基本挑战，其难点在于极端的数据稀缺性以及对生成形状必须尊重物种间系统发育关系的要求。在这项工作中，我们提出了PhyloSDF，一个基于系统发育条件的神经生成模型，用于3D生物形态，它整合了两项创新：(1) 一个由新型系统发育一致性损失正则化的DeepSDF自动解码器，该损失使潜在空间结构与进化距离相关（Pearson r=0.993）；(2) 一个残差条件流匹配（Residual CFM）架构，将生成分解为解析的物种质心查找和学习到的残差预测，从而能够从每个物种仅约4个标本进行生成。我们在达尔文雀及其近缘物种的24个物种的100个微CT扫描颅骨上评估了PhyloSDF。该模型生成的网格在代码水平上实现了真实种内变异的88-129%，所有180个生成网格均被验证为非记忆。残差CFM在保真度（Chamfer距离0.00181 vs. 0.00190）和形态测量Fréchet距离（10,641 vs. 13,322）上均超越了去噪扩散（在此尺度下完全失败）、标准流匹配（模式坍缩至3-6%变异）以及高斯混合基线。跨18个物种的留一物种实验展示了系统发育外推能力，平滑的潜在插值产生了生物学上可信的祖先颅骨重建。

英文摘要

Generating novel, biologically plausible three-dimensional morphological structures is a fundamental challenge in computational evolutionary biology, hampered by extreme data scarcity and the requirement that generated shapes respect phylogenetic relationships among species. In this work, we present PhyloSDF, a phylogenetically-conditioned neural generative model for 3D biological morphology that integrates two innovations: (1) a DeepSDF auto-decoder regularized by a novel Phylogenetic Consistency Loss that structures the latent space to correlate with evolutionary distances (Pearson r=0.993); (2) a Residual Conditional Flow Matching (Residual CFM) architecture that factorizes generation into analytic species-centroid lookup and learned residual prediction, enabling generation from as few as ~4 specimens per species. We evaluate PhyloSDF on 100 micro-CT-scanned skulls of Darwin's Finches and their relatives across 24 species. The model generates novel meshes achieving 88-129% of real intra-species variation at the code level, with all 180 generated meshes verified as non-memorized. Residual CFM surpasses denoising diffusion (which fails entirely at this scale), standard flow matching (which mode-collapses to 3-6% variation), and a Gaussian mixture baseline in both fidelity (Chamfer Distance 0.00181 vs. 0.00190) and morphometric Fréchet distance (10,641 vs. 13,322). Leave-one-species-out experiments across 18 species demonstrate phylogenetic extrapolation capability, and smooth latent interpolations produce biologically plausible ancestral skull reconstructions.

URL PDF HTML ☆

赞 0 踩 0

2606.15886 2026-06-16 cs.CV 新提交

Text region detection in historical astronomical diagrams

历史天文图中的文本区域检测

Zeynep Sonat Baltacı, Raphaël Baena, Fei Meng, Somkéo Norindr, Florence Somer, Matthieu Husson, Mathieu Aubry

发表机构 * LIGM, ENPC, IP Paris, Univ Gustave Eiffel, CNRS, Marne-la-Vallée, France（LIGM, 国立桥路学校, 巴黎理工学院, 古斯塔夫·埃菲尔大学, 法国国家科学研究中心, 马恩拉瓦莱, 法国）； LTE, CNRS, PSL-Observatoire de Paris, SU, EIDA Project（LTE, 法国国家科学研究中心, 巴黎文理研究大学-巴黎天文台, 索邦大学, EIDA项目）

AI总结提出包含948张历史天文图的大规模数据集，涵盖十世纪七种语言传统，并设计Poly-DETR模型实现文本区域检测。

详情

AI中文摘要

文本检测是历史文献分析中的关键任务。尽管手稿和地图的文本检测已有数据集和基准，但数学图表中的文本研究鲜受关注。为此，我们引入一个大规模、多样化、开放获取的数据集，包含948张历史天文图，共计10,940个定向多边形文本区域。数据集跨越十个世纪（8至18世纪）和七种主要语言传统：阿拉伯语和波斯语（115张）、中文（332张）、拜占庭语（233张）、拉丁语（185张）、希伯来语（48张）和梵语（35张）。它涵盖了从符号到多行段落的广泛图表风格和文本内容。每个文本实例都标注了有序多边形，精确描绘文本区域并编码阅读方向。此外，我们为拉丁图表中的2,293个区域标注了20个类别标签。我们在数据集上评估了多个强基线，包括TESTR、DeepSolo++以及Poly-DETR（我们设计的DINO-DETR的简单扩展，用于预测有序多边形顶点）。Poly-DETR在MTHv2和cBAD2019基准上达到最先进性能，并在我们的数据集上提供了坚实、简单的基线。代码和数据集在线提供。

英文摘要

Text detection is a crucial task in the analysis of historical documents. While datasets and benchmarks exist for text detection in manuscripts and maps, the study of text in mathematical diagrams has received little attention. To address this, we introduce a large-scale, diverse, open-access dataset of 948 historical astronomical diagrams containing 10,940 oriented polygonal text regions. Our dataset spans ten centuries (8th to 18th) and seven main linguistic traditions: Arabic and Persian (115), Chinese (332), Byzantine (233), Latin (185), Hebrew (48), and Sanskrit (35). It captures a wide range of diagram styles and textual content, from symbols to multi-line paragraphs. Each text instance is annotated with ordered polygons that precisely delineate text regions and encode the reading direction. In addition, we annotated the 2,293 regions in Latin diagrams with 20 class labels. We evaluated several strong baselines on our dataset, including TESTR, DeepSolo++, and Poly-DETR, a simple extension of DINO-DETR that we design to predict ordered polygon vertices. Poly-DETR achieves state-of-the-art performance on the MTHv2 and cBAD2019 benchmarks and provides a solid, simple baseline on our dataset. Code and dataset available online.

URL PDF HTML ☆

赞 0 踩 0

2606.15987 2026-06-16 cs.CV cs.DL 新提交

A Text Recognition Dataset from Sahidic Coptic Ancient Manuscripts

来自萨希迪克科普特古代手稿的文本识别数据集

Fabio Quattrini, Carmine Zaccagnino, Costanza Bianchi, Silvia Cascianelli, Rita Cucchiara

发表机构 * University of Modena and Reggio Emilia（摩德纳大学与雷焦艾米利亚大学）

AI总结针对低资源手写文本识别，构建了萨希迪克科普特古代手稿数据集SCAM，并评估了多种先进方法的性能，揭示了当前方法在低资源历史文本上的局限性。

Comments Accepted at ICDAR 2026

详情

AI中文摘要

在这项工作中，我们针对低资源场景下的手写文本识别（HTR），这些场景源于代表性不足的语言、稀有文字以及历史文献典型的退化视觉条件。我们引入了SCAM（萨希迪克科普特古代手稿），这是一个从已灭绝的萨希迪克科普特方言书写的数字化古代手稿构建的新行级数据集。该数据集反映了现实且具有挑战性的环境，因为它结合了跨图书馆的异构采集条件以及典型的文献退化，如墨水褪色、透印和材料劣化。除了视觉复杂性外，由于萨希迪克科普特的资源稀缺、其不常见的字母表以及方言特有的变音符号，SCAM还带来了显著的语言挑战。为了支持低资源HTR的研究，我们基于不同范式对几种最先进的方法进行了基准测试，突出了它们在此环境中的局限性和优势。我们的结果强调了当前在资源丰富的现代文字上的HTR性能与基于历史的低资源场景之间的差距，从而为未来的发展提供了参考点。

英文摘要

In this work, we target Handwritten Text Recognition (HTR) in low-resource scenarios, which arise from underrepresented languages, rare scripts, and degraded visual conditions typical of historical documents. We introduce SCAM (Sahidic Coptic Ancient Manuscripts), a new line-level dataset built from digitized ancient manuscripts written in the extinct Sahidic Coptic dialect. The dataset reflects a realistic and challenging setting, as it combines heterogeneous acquisition conditions across libraries with typical manuscript degradations such as ink fading, bleed-through, and material deterioration. In addition to visual complexity, SCAM poses significant linguistic challenges due to the scarcity of resources for Sahidic Coptic, its uncommon alphabet, and dialect-specific diacritics. To support research in low-resource HTR, we benchmark several state-of-the-art approaches based on different paradigms, highlighting their limitations and strengths in this setting. Our results underline the gap between current HTR performance on well-resourced modern scripts and historically grounded, low-resource scenarios, thus providing a reference point for future developments.

URL PDF HTML ☆

赞 0 踩 0

2406.17148 2026-06-16 cs.CV 版本更新

MixTeX: Data-Efficient LaTeX OCR via Synthetic Pretraining and Limited Fine-Tuning

MixTeX: 通过合成预训练和有限微调实现数据高效的LaTeX OCR

Yuhan Xu, Yijun Zhao, Renqing Luo, Gary M. Weiss

发表机构 * arXiv

AI总结提出MixTeX，通过合成预训练（无需真实LaTeX源）和少量真实样本微调，实现数据高效的LaTeX OCR，在英中印刷和手写基准上优于依赖大数据集的方法。

详情

AI中文摘要

LaTeX OCR将科学文档图像转换为可编辑的LaTeX代码。现有系统依赖大型配对数据集，这些数据集收集成本高，且对于低资源语言有限。本文提出了MIXTEX，一种数据高效的系统，使用合成预训练而无需真实的LaTeX源。与依赖arXiv数据集的Nougat不同，我们通过随机配对语法正确的维基百科文本与LaTeX公式来生成训练数据，仅需语法正确性。这消除了对真实文档集合的依赖，实现了可扩展的数据生成（1.2亿个token），并支持低资源语言。在合成预训练之后，适应仅需400个真实样本。在包含印刷和手写英文及中文的977样本基准上的评估表明，这种两阶段策略在需要更少人力和计算的情况下，优于在大型真实数据集上训练的方法。数据、代码和模型公开可用。

英文摘要

LaTeX OCR converts scientific document images into editable LaTeX code. Existing systems rely on large paired datasets, which are costly to collect and limited for low-resource languages. This paper presents MIXTEX, a data-efficient system using synthetic pretraining without real LaTeX sources. Unlike Nougat that depends on arXiv datasets, we generate training data by randomly pairing grammatical Wikipedia text with LaTeX formulas, requiring only syntactic correctness. This eliminates dependency on real document collections, enables scalable data generation (120M tokens), and supports low-resource languages. Following synthetic pretraining, adaptation requires only 400 real samples. Evaluation on a 977-sample benchmark with printed and handwritten English and Chinese shows that this two-stage strategy outperforms methods trained on large real datasets while requiring less human effort and computation. Data, code, and models are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2603.01016 2026-06-16 cs.CV eess.IV eess.SP 版本更新

Implementation of Licensed Plate Detection and Noise Removal in Image Processing

图像处理中车牌检测与噪声去除的实现

Yiquan Gao

发表机构 * Asia Pacific University, Malaysia（亚太大学，马来西亚）

AI总结本文实现了一种车牌检测与噪声去除方法，通过图像处理技术提高车牌识别系统的准确性和鲁棒性。

Comments 13 pages. This is the author's version, accepted manuscript Published version available at https://www.ijarse.com/ADMIN/admin/postimages/images/fullpdf/1519302304_SVCET2087ijarse.pdf

详情

Journal ref: International Journal of Advance Research in Science and Engineering, Vol. 07, Special Issue No. 02, pp. 678-690, ISSN: 2319-8354, Feb. 2018

AI中文摘要

汽车车牌识别系统是一种图像处理技术，用于通过捕获汽车车牌来识别车辆。汽车车牌识别技术也称为自动车牌识别、自动车辆识别、汽车车牌识别或汽车光学字符识别。在马来西亚，随着如今车辆数量的迅速增加，道路上相当多的车辆带来了对汽车车牌识别系统的巨大需求。汽车车牌识别系统可以应用于电子停车支付系统、高速公路收费系统、交通监控系统以及作为警察执法工具。此外，汽车车牌识别系统技术还有潜力与生物学、航空航天等其他不同领域的各种技术相结合，以实现解决某些专门问题的目标。

英文摘要

Car license plate recognition system is an image processing technology used to identify vehicles by capturing their Car License Plates. The car license plate recognition technology is also known as automatic number-plate recognition, automatic vehicle identification, car license plate recognition or optical character recognition for cars. In Malaysia, as the number of vehicle is increasing rapidly nowadays, a pretty great number of vehicle on the road has brought about the considerable demands of car license plate recognition system. Car license plate recognition system can be implemented in electronic parking payment system, highway toll-fee system, traffic surveillance system and as police enforcement tools. Additionally, car license plate recognition system technology also has potential to be combined with various techniques in other different fields like biology, aerospace and so on to achieve the goal of solving some specialized problems.

URL PDF HTML ☆

赞 0 踩 0

2606.08781 2026-06-16 cs.CV 版本更新

DeepMine-Mamba: Mitigating Information Dilution in Mamba-Based State Space Models for Document Image Binarization

DeepMine-Mamba：缓解基于Mamba的状态空间模型在文档图像二值化中的信息稀释问题

Sheng-Wei Chan, Yung-Che Wang, Hsin-Jui Pan, Chia-Min Lin, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University（淡江大学电机与计算机工程系）

AI总结提出DeepMine-Mamba框架，通过抗稀释门控机制选择性恢复笔画敏感局部响应，抑制无关背景增强，解决Mamba状态空间模型在文档二值化中弱前景线索被稀释的问题。

Comments code will be released on https://github.com/henrychan0719/Deep-Mine-Mamba

详情

AI中文摘要

文档图像二值化旨在从退化的背景中分离前景文本，同时保留细、断裂和低对比度的笔画。尽管深度学习方法提高了二值化性能，但大多数现有方法依赖于卷积、基于Transformer或生成架构，而基于Mamba的状态空间模型在此任务中尚未被充分探索。在这项工作中，我们研究了基于Mamba的特征传播，并观察到直接的状态空间传播可能会在长程建模过程中稀释弱前景线索，特别是淡墨迹、碎片化字符和边界敏感的笔画细节。为了解决这个问题，我们提出了DeepMime-Mamba，一个基于Mamba的二值化框架，配备了一种新颖的抗稀释门控机制，该机制估计传播引起的特征变化，并选择性地恢复笔画敏感的局部响应，同时抑制不必要的背景增强。在严格的留一年验证协议下，对DIBCO/H-DIBCO基准的实验表明，DeepMine-Mamba取得了具有竞争力的整体性能，在基准年份中具有强大的平均FM和Fps。消融结果进一步表明，抗稀释门控机制改善了笔画保留，并减少了感知上显著的二值化误差。

英文摘要

Document image binarization aims to separate foreground text from degraded backgrounds while preserving thin, broken, and low-contrast strokes. Although deep learning methods have improved binarization performance, most existing approaches rely on convolutional, transformer-based, or generative architectures, while Mamba-based state space models remain largely unexplored for this task. In this work, we investigate Mamba-based feature propagation and observe that direct state-space propagation may dilute weak foreground cues during long-range modeling, especially faint ink traces, fragmented characters, and boundary-sensitive stroke details. To address this problem, we propose DeepMine-Mamba, a Mamba-based binarization framework equipped with a novel Anti-Dilution Gate that estimates propagation-induced feature changes and selectively restores stroke-sensitive local responses while suppressing unnecessary background enhancement. Experiments on DIBCO/H-DIBCO benchmarks under a strict leave-one-year-out protocol show that DeepMine-Mamba achieves competitive overall performance, with strong average FM and Fps across benchmark years. Ablation results further show that the Anti-Dilution Gate is the key component for mitigating propagation-induced foreground dilution and improving stroke preservation.

URL PDF HTML ☆

赞 0 踩 0

2606.14773 2026-06-16 cs.CV cs.AI 新提交

Double-Helix Vision (DH-V2): A Geometry-Based Visual Sampler for Bandwidth-Constrained Perception

双螺旋视觉 (DH-V2)：一种基于几何的带宽受限感知视觉采样器

Jinwen Wen

发表机构 * Independent Researcher（独立研究者）

AI总结提出双螺旋视觉(DH)，一种基于黄金比例螺旋轨迹的几何采样器，将2D图像压缩为1D信号，实现1433倍压缩比，在CPU上0.52ms完成感知，CIFAR-10上准确率提升6.03%。

Comments 5 pages, 3 figures, 5 tables. Code and benchmarks: https://github.com/JackJ-C/double-helix-vision-tool

详情

AI中文摘要

我们提出双螺旋视觉(DH)，一种基于几何的视觉采样器，利用成对的黄金比例启发螺旋轨迹将2D图像压缩为紧凑的1D信号。DH不是均匀处理每个像素，而是采用两个相位偏移的螺旋（Alpha和Beta，偏移180度）以生物启发的中央凹方式采样图像：中心高密度，外围稀疏覆盖。在4K分辨率下，DH实现了1433倍压缩比（减少99.93%），同时保留场景的几何结构。完整的感知流水线——包括空间映射、时间碰撞检测和帧内结构视差估计——在仅CPU硬件上以1080p分辨率运行仅需0.52毫秒，无需神经网络依赖。在CIFAR-10上，在极端采样预算下（每个螺旋K=128个点），DH比均匀随机采样获得了+6.03%的准确率提升。提供了一个可序列化为JSON的机器人API，以2.7 KB的数据包提供亚毫秒级空间感知报告。代码和基准测试在MIT许可下提供。

英文摘要

We present Double-Helix Vision (DH), a geometry-based visual sampler that compresses 2D images into compact 1D signals using paired golden-ratio-inspired spiral trajectories. Rather than processing every pixel uniformly, DH employs two phase-shifted helices (Alpha and Beta, offset by 180 degrees) to sample the image with biologically-inspired foveation: high density at the center, sparse coverage at the periphery. At 4K resolution, DH achieves a 1,433x compression ratio (99.93% reduction) while preserving the geometric structure of the scene. The full perception pipeline -- including spatial mapping, temporal collision detection, and intra-frame structural disparity estimation -- runs in 0.52 ms at 1080p on CPU-only hardware, with no neural network dependencies. On CIFAR-10 at extreme sampling budgets (K=128 points per helix), DH achieves a +6.03% accuracy gain over uniform random sampling. A JSON-serializable Robotics API is provided, delivering sub-millisecond spatial perception reports in 2.7 KB packets. Code and benchmarks are available under the MIT License.

URL PDF HTML ☆

赞 0 踩 0

2606.14781 2026-06-16 cs.CV 新提交

融合迁移先验与物理分解的水下图像增强

Haochen Hu, Yanrui Bin, Zhengyan Zhang, Minchen Wei, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出一种无需配对标签的迁移学习方法，将水下图像增强分解为全局颜色校正、去雾和背景噪声抑制，利用跨域先验监督各步骤，实现物理一致的增强。

详情

AI中文摘要

水下图像在不同水质条件下拍摄，导致复杂的退化，包括颜色偏差、低对比度和模糊效应。最近，基于学习的方法已显示出在水下图像增强（UIE）方面的潜力。然而，以往的大多数工作侧重于训练策略或网络设计，使增强结果与数据集中的标签良好对齐，忽略了标签是从先前UIE方法的增强结果中选取的，这些伪标签存在噪声。因此，它们的模型性能在一定程度上并不令人满意。然而，收集水下图像的真实标签具有挑战性。在这项工作中，我们提出了一种基于迁移学习的UIE方法，该方法不需要水下图像具有成对的噪声或真实标签来学习。相反，首先根据水下物理将UIE任务分解为全局颜色校正、去雾和背景噪声抑制。然后，利用来自其他视觉任务的多种先验作为每个步骤的跨域监督。通过这种方式，通过迁移学习实现了一种新颖的UIE，并且物理对齐的UIE分解提供了理论上的合理性。定性和定量实验表明，我们基于物理和先验融合的方法在UIE任务中达到了SOTA性能，并有效提升了下游视觉任务，显著优于基准方法。项目仓库：https://github.com/Haru2022/P2-UIE。

英文摘要

The underwater images are captured within diverse water-medium conditions, leading to complex degradation, including color bias, low contrast, and blur effect. Recently, learning-based methods have demonstrated their potential for underwater image enhancement (UIE). However, most of the previous work focus on the training strategy or network design to make the enhanced result aligned well with the labels in datasets, ignoring that the labels are selected from the enhanced results of previous UIE methods and these pseudo-labels are noisy. Consequently, the performance of their models is not satisfactory to a certain extent. However, collecting the true labels of the underwater images is challenging. In this work, we propose a transfer learning-based UIE that does not require underwater images to have paired noisy or true labels for learning. Instead, the UIE task is first divided into global color correction, haze removal, and background noise suppression following the underwater physics. Then multiple types of prior from other vision tasks are leveraged as cross-domain supervision in each step. In this way, a novel UIE is available via transfer learning, and the physics-aligned UIE decomposition provides theoretical soundness. Qualitative and quantitative experiments demonstrate that our proposal based on physics and priors fusion achieves SOTA performance in the UIE task and effectively boosts downstream vision tasks, significantly outperforming benchmark methods. Project repo: https://github.com/Haru2022/P2-UIE.

URL PDF HTML ☆

赞 0 踩 0

2606.15857 2026-06-16 cs.CV 新提交

A Dual-Branch Collaborative Framework for Joint Optimization of Underwater Image Enhancement and Object Detection

用于水下图像增强与目标检测联合优化的双分支协作框架

Liyuan Cao, Zheng Liu, Guanghao Liao, Yonghui Yang, Qi Li

发表机构 * School of Electronic and Information Engineering, University of Science and Technology Liaoning（电子与信息工程学院，科学技术大学辽宁）

AI总结提出一种双分支水下图像增强框架，通过细节增强和颜色恢复分支分别提升纹理细节和校正色偏，在提升视觉质量的同时兼顾检测性能与效率，在URPC数据集上使YOLOv8的mAP50提升2.1%。

详情

AI中文摘要

由于波长依赖的光吸收和散射，水下图像通常存在颜色失真和细节模糊，这限制了水下目标检测的性能。现有的水下图像增强方法主要关注视觉质量提升，但仍难以平衡增强质量、处理效率和下游检测性能。因此，本文提出一种高效的双分支水下图像增强框架用于目标检测。细节增强分支通过提升亮度和局部对比度来恢复暗区域的纹理细节。颜色恢复分支使用自适应补偿来减少颜色失真并改善色彩层次。通过结合两个分支的互补输出，所提框架为目标检测提供更清晰、信息更丰富的图像。在UIEB和EUVP数据集上，所提方法分别达到2.249和2.576的UIQM分数。当应用于URPC数据集上的YOLOv8检测任务时，与基线相比，所提方法将mAP50提升了2.1%。大量实验表明，我们的方法在复杂水下场景中改善了目标检测，同时平衡了增强质量和处理效率。

干涉合成孔径雷达（InSAR）能够有效监测火山形变；然而，观测信号常受到大气相位延迟、季节性地表变化和去相关效应的干扰。现有的大气校正方法，如基于数值天气模型的方法，可以减少这些影响，但无法始终消除大气伪影，并可能引入残余偏差。为解决这些局限性，我们提出了一种新颖的基于学习的解缠InSAR干涉图去噪方法，采用结合物理驱动合成形变与真实大气噪声的混合训练策略。具体而言，我们引入了WaveDINO，一种基于小波的多尺度去噪框架，其条件依赖于冻结的DINOv3基础模型特征和地形信息。训练使用叠加在短周期干涉图上的合成岩浆源形变，使网络暴露于真实大气统计特征的同时保留已知真值。性能在受控合成数据和来自智利Laguna del Maule及意大利Campi Flegrei的长期真实干涉图上进行评估，并使用独立的GNSS测量进行验证。WaveDINO持续优于竞争模型，提高了与GNSS测量的一致性，在两个站点分别将平均GNSS拟合误差降低了约3%和19%，同时超越了基于天气模型的校正方法。

英文摘要

Interferometric Synthetic Aperture Radar (InSAR) enables effective monitoring of volcanic deformation; however, the observed signals are often corrupted by atmospheric phase delays, seasonal surface changes, and decorrelation effects. Existing atmospheric correction methods, such as numerical weather model-based methods, can reduce these effects but do not consistently remove atmospheric artefacts and may introduce residual biases. To address these limitations, we propose a novel learning-based method for denoising unwrapped InSAR interferograms, using a hybrid training strategy that combines physically motivated synthetic deformation with real atmospheric noise. Specifically, we introduce WaveDINO, a wavelet-based multi-scale denoising framework conditioned on frozen DINOv3 foundation-model features and terrain information. Training uses synthetic magma-source deformation superimposed on short-term interferograms to expose the network to realistic atmospheric statistics while retaining known ground truth. Performance is evaluated on both controlled synthetic data and long-term real interferograms from Laguna del Maule (Chile) and Campi Flegrei (Italy), with independent GNSS measurements used for validation. WaveDINO consistently outperforms competing models, improving agreement with GNSS measurements, and reducing mean GNSS misfit by approximately 3% and 19% at two sites, respectively, while surpassing weather-model-based corrections.

URL PDF HTML ☆

赞 0 踩 0

2606.15352 2026-06-16 eess.IV cs.CV cs.GR 交叉投稿

Chroma-gated, differentiable OKLCH interpolation: Continuous Oklab fallback for color-cast reduction

色度门控、可微分的OKLCH插值：用于减少色偏的连续Oklab回退

Naoyuki Uchida

发表机构 * Independent Researcher（独立研究者）

AI总结针对OKLCH插值在中性轴附近的两种色偏问题，提出一种可微分的色度门控函数，连续混合OKLCH和线性Oklab路径，在不依赖端点测试的情况下统一处理两种色偏，并验证了其有效性。

Comments 14 pages, 5 figures. Ancillary files: reproducibility scripts (symbolic verification, evaluation, and figure generation)

详情

AI中文摘要

OKLCH——Ottosson的Oklab颜色空间的圆柱形式（亮度、色度、色调）——是CSS Color 4推荐的用于渐变和color-mix()的插值空间，现已广泛部署。然而，其极坐标参数化在中性轴附近以两种方式产生色偏：（1）两个彩色端点之间的色调间绕行，经过非预期的色调（蓝色到黄色明显经过绿色）；（2）当一个端点为消色差时，产生离线弯曲。现有补救措施统一为二值化——仅在消色差端点触发的阈值开关——因此它们仅处理（2）；对于彩色对，它们都退化为原始OKLCH，未处理（1）色调间色偏。我们引入连续Oklab回退（COFb），一个单参数、可微分的色度门控$w(C)=C^n/(C^n+σ^n)$，随着色度下降，将OKLCH路径连续混合到线性Oklab路径。单个门控减少了二值化家族未处理的（1）色偏，并统一处理（1）和（2），无需任何端点测试。我们刻画了一个色偏-色调权衡边界，采用默认值（$n=1$，有理Michaelis-Menten形式；对于典型sRGB调色板，$σ\approx0.19$，基于归一化无关的半色偏准则），并符号验证了门控的性质。在默认值下，COFb将色调间路径绕行减半（平均横向偏差-49.5%，色度加权色调偏移-35.5%）。我们还说明了该方法的局限性：仅针对（2），二值化开关仍然更好，并且像任何笛卡尔混合一样，COFb不保持色度。在部署中，COFb完全在普通Oklab (a,b)到sRGB中运行，因此它作为一种回退，在无法使用现代CSS颜色插值（color-mix(in oklch)等）的场合——旧引擎、图像和视频管线或GPU着色器——提供相同的减少色偏的渐变。

英文摘要

OKLCH -- the cylindrical (lightness, chroma, hue) form of Ottosson's Oklab color space -- is the interpolation space recommended by CSS Color 4 for gradients and color-mix(), and it is now broadly deployed. Its polar parameterization, however, casts color near the neutral axis in two ways: (1) an inter-hue detour between two chromatic endpoints that sweeps through an unintended hue (blue to yellow visibly passing through green), and (2) an off-line bow when one endpoint is achromatic. Existing remedies are uniformly two-valued -- a threshold switch that fires only at an achromatic endpoint -- so they address only (2); on chromatic pairs every one of them reduces to raw OKLCH, leaving the (1) inter-hue cast untreated. We introduce Continuous Oklab fallback (COFb), a one-parameter, differentiable chroma gate $w(C)=C^n/(C^n+σ^n)$ that continuously blends the OKLCH path toward the linear Oklab path as chroma falls. A single gate reduces the (1) cast that the two-valued family leaves untreated and unifies the handling of (1) and (2) without any endpoint test. We characterize a cast-hue trade-off frontier, adopt a default ($n=1$, the rational Michaelis-Menten form; $σ\approx0.19$ for a typical sRGB palette, from a normalization-independent cast-half criterion), and verify the gate's properties symbolically. At the default, COFb halves the inter-hue path detour (mean lateral deviation -49.5%, chroma-weighted hue excursion -35.5%). We also state the method's limits: on (2) alone the two-valued switch remains better, and like any Cartesian blend COFb does not preserve chroma. In deployment, COFb runs entirely in plain Oklab (a,b) to sRGB, so it serves as a fallback that delivers the same cast-reduced gradients where modern CSS color interpolation (color-mix(in oklch) and the like) is unavailable -- older engines, image and video pipelines, or GPU shaders.

URL PDF HTML ☆

赞 0 踩 0

2606.16107 2026-06-16 eess.IV cs.CV cs.MM 交叉投稿

Variable-Rate Deep Image Compression based on Low-Rank Adaptation by Progressive Learning

基于渐进学习的低秩自适应变速率深度图像压缩

Xing-Yu Xu, Chen-Hsiu Huang, Ja-Ling Wu

发表机构 * National Taiwan University（台湾大学）

AI总结提出一种基于低秩自适应（LoRA）的渐进学习方法，通过引入LoRA速率自适应模块（LoRAM）实现变速率深度图像压缩，在推理时不增加计算复杂度，参数存储节省99%，数据集节省90%，训练步骤节省97%。

详情

AI中文摘要

在数字时代，图像压缩对于众多应用至关重要，包括网络媒体、流媒体服务、高分辨率医学成像和车联网，能够实现高效的数据存储和传输。随着对高质量图像通信的需求日益增长，对先进压缩技术的需求也变得越来越关键。近年来，许多深度图像压缩（DIC）技术被提出，与传统标准相比表现出令人印象深刻的性能。然而，变速率图像压缩仍然是一个未解决的问题。特定的DIC方法部署多个网络以实现不同的压缩率，而其他方法使用单一模型，这通常会导致更高的计算复杂性和性能下降。本文提出了一种基于参数高效微调方法——低秩自适应（LoRA）的渐进学习变速率图像压缩方法。我们在DIC方法中引入了一个额外的LoRA速率自适应模块（LoRAM）。由于LoRA的重参数化合并，我们提出的方法在推理期间不会引入额外的计算复杂性。与使用多个模型的方法相比，综合实验表明，我们的方法实现了具有竞争力的性能，在参数存储上节省了99%，数据集节省了90%，训练步骤节省了97%。

英文摘要

In the digital age, image compression is crucial for numerous applications, including web media, streaming services, high-resolution medical imaging, and connected vehicle networks, enabling efficient data storage and transmission. With the increasing demand for high-quality image communication, the need for advanced compression techniques becomes increasingly critical. Numerous Deep Image Compression (DIC) techniques have recently been introduced, showing impressive performance compared to traditional standards. However, variable-rate image compression remains an unresolved issue. Specific DIC methods deploy multiple networks to attain different compression rates, whereas others use a single model, which often results in higher computational complexity and reduced performance. This work proposes a progressive learning approach for variable-rate image compression based on the parameter-efficient fine-tuning method, the Low-Rank Adaptation (LoRA). We introduce an additional LoRA Rate-Adaptive Module (LoRAM) in DIC methods. Due to the re-parameterized merging of LoRA, our proposed method does not introduce additional computational complexity during inference. Compared to methods utilizing multiple models, comprehensive experiments demonstrate that our approach achieves competitive performance, saving 99\% in parameter storage, 90% in datasets, and 97% in training steps.

URL PDF HTML ☆

赞 0 踩 0

2606.16261 2026-06-16 physics.optics cs.CV cs.NE physics.app-ph 交叉投稿

Wavelength-Multiplexed 2D Beam Steering via a Passive Diffractive Network

通过无源衍射网络实现波长复用的二维光束偏转

Che-Yung Shen, Yuhang Li, Cagatay Isil, Tianyi Gan, Mona Jarrahi, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles（加州大学洛杉矶分校电气与计算机工程系）； Bioengineering Department, University of California, Los Angeles（加州大学洛杉矶分校生物医学工程系）； California NanoSystems Institute (CNSI), University of California, Los Angeles（加州大学洛杉矶分校加州纳米系统研究所）

AI总结提出一种无源衍射光学网络，利用波长作为控制参数实现任意二维光束偏转，通过深度学习优化级联衍射层，数值和实验验证了625个波长通道的25x25偏转阵列，具有亚波长精度和高信道保真度。

Comments 20 Pages, 4 Figures

详情

AI中文摘要

我们引入了一种波长可寻址的衍射光学网络，将照明波长转化为高维控制参数，用于任意可编程的二维光束偏转。所提出的无源架构包括级联的空间优化衍射层，通过深度学习联合设计，以快速将不同波长映射到预定义/期望的输出角度。与受限于一维线性映射的传统单层色散光学元件不同，该框架利用复杂的波前变换，将照明波长作为内在寻址键，实现任意二维光束偏转，无需机械扫描或电子相位控制。我们在数值上演示了覆盖400-750 nm的625个波长通道的波长控制光束偏转，实现了25x25独立可寻址光束位置阵列，具有亚波长定位精度和高信道保真度。与将波长路由限制在线性轨迹的传统光栅不同，所提出的衍射网络执行非局域波前变换，能够在二维视场内实现任意波长到角度的映射。我们进一步在太赫兹和可见光谱范围内实验验证了所提出的框架，展示了使用3D打印的无源衍射层在太赫兹频率和可见光谱中的相位型空间光调制器实现的波长复用光束偏转。这种波长可寻址的衍射架构为高速可编程光束偏转建立了一种紧凑且可扩展的范式，在光通信、路由、成像、传感以及新兴光子信息处理系统中具有潜在应用。

英文摘要

We introduce a wavelength-addressable diffractive optical network that transforms illumination wavelength into a high-dimensional control parameter for arbitrarily programmable 2D beam steering. The proposed passive architecture comprises cascaded spatially optimized diffractive layers, jointly designed using deep learning, to rapidly map distinct wavelengths to predefined/desired output angles. Unlike conventional single-layer dispersive optical elements, which are physically restricted to 1D linear mapping, this framework harnesses complex wavefront transformations to utilize the illumination wavelength as an intrinsic addressing key for arbitrary 2D beam steering, eliminating the need for mechanical scanning or electronic phase control. We numerically demonstrate wavelength-controlled beam steering across 625 wavelength channels spanning 400-750 nm, realizing a 25 x 25 array of independently addressable beam positions with subwavelength positioning accuracy and high channel fidelity. Unlike conventional gratings, which constrain wavelength routing to a linear trajectory, the proposed diffractive network performs nonlocal wavefront transformations, enabling arbitrary wavelength-to-angle mappings across a 2D field of view. We further validate the proposed framework experimentally in both the terahertz and visible spectral regimes, demonstrating wavelength-multiplexed beam steering using 3D fabricated passive diffractive layers at terahertz frequencies and phase-only spatial light modulators in the visible spectrum. This wavelength-addressable diffractive architecture establishes a compact and scalable paradigm for high-speed programmable beam steering, with potential applications in optical communications, routing, imaging, sensing, and emerging photonic information-processing systems.

URL PDF HTML ☆

赞 0 踩 0

2606.17048 2026-06-16 cs.LG cs.CV stat.ML 交叉投稿

Exact Posterior Score Estimation for Solving Linear Inverse Problems

精确后验分数估计用于求解线性逆问题

Abbas Mammadov, Ozgur Kara, Kaan Oktay, Iskander Azangulov, Adil Kaan Akan, Hyungjin Chung, James Matthew Rehg, Yee Whye Teh

发表机构 * University of Oxford（牛津大学）； UIUC（伊利诺伊大学厄巴纳-香槟分校）； EverEx

AI总结提出精确后验分数（EPS）方法，通过闭式后验分数将线性逆问题转化为去噪问题，无需梯度或投影，在FFHQ和ImageNet上优于现有方法。

详情

AI中文摘要

扩散和基于流的模型通过训练去噪器来逆转高斯损坏，从而学习强大的数据先验。为了利用这一先验解决线性逆问题，需要从后验中采样，但先验提供的分数是无条件分数，而非后验分数。现有方法要么使用近似测量匹配校正来引导固定的预训练去噪器，要么训练一个放弃先验去噪结构的条件恢复模型。我们在一般高斯插值下推导了线性高斯逆问题的精确后验分数闭式，并表明后验采样可归结为在算子依赖的偏移枢轴和各向异性噪声协方差下的去噪问题。我们将这一恒等式转化为精确后验分数（EPS），这是一种去噪训练目标，保留了标准预训练的输入/输出结构，因此可以从头训练或从预训练去噪器微调。在推理时，EPS使用与底层骨干相同的采样器，无需似然梯度或投影。我们在FFHQ和ImageNet上的五个线性逆问题上评估了EPS，在保真度、感知和分布指标上优于无训练和基于训练的基线，同时使用的去噪器评估次数比基于梯度的后验采样器少大约一个数量级。

英文摘要

Diffusion and flow-based models learn powerful data priors by training a denoiser to reverse Gaussian corruption. To use this prior to solve a linear inverse problem, one needs to sample from the posterior, but the score that the prior provides is the unconditional score, not the posterior score. Existing methods either steer a fixed pretrained denoiser with approximate measurement-matching corrections, or train a conditional restoration model that abandons the denoising structure of the prior. We derive the exact posterior score in closed form for linear Gaussian inverse problems under general Gaussian interpolants, and show that posterior sampling reduces to a denoising problem at an operator-dependent shifted pivot under an anisotropic noise covariance. We turn this identity into Exact Posterior Score (EPS), a denoising training objective that preserves the input/output structure of standard pretraining and can therefore be trained from scratch or fine-tuned from a pretrained denoiser. At inference, EPS uses the same sampler as the underlying backbone, with no likelihood gradients or projections. We evaluate EPS on five linear inverse problems across FFHQ and ImageNet, where it outperforms training-free and training-based baselines on fidelity, perceptual, and distributional metrics, while using roughly an order of magnitude fewer denoiser evaluations than gradient-based posterior samplers.

URL PDF HTML ☆

赞 0 踩 0

2511.12024 2026-06-16 cs.CV 版本更新

Null-Space Diffusion Distillation Unlocks Speed, Fidelity and Realism in Lensless Imaging

零空间扩散蒸馏解锁无透镜成像的速度、保真度和真实感

Jose Reinaldo Cunha Santos A V Silva Neto, Hodaka Kawachi, Yasushi Yagi, Tomoya Nakamura

发表机构 * D3 Center, The University of Osaka（大阪大学D3中心）； The University of Osaka（大阪大学）； SANKEN, The University of Osaka（SANKEN与大阪大学）； Grad. Sch. of Eng. Sci., The University of Osaka（大阪大学工科科学院）

AI总结提出零空间扩散蒸馏（NSDD），通过将结构化扩散先验蒸馏为前馈网络，实现单次高质量重建，兼顾测量一致性、感知质量和推理速度。

Comments 10 pages without references, 5 figures, 5 tables

详情

AI中文摘要

无透镜成像从高度复用的测量中重建场景，导致严重不适定的逆问题。在这项工作中，我们识别出无透镜重建范式在测量一致性、感知质量和推理速度之间的基本权衡。传统方法倾向于一致性但产生感知退化的结果，监督方法实现高质量重建和快速推理但可能违反物理约束，而扩散先验方法（特别是使用范围-零分解等结构化约束时）实现了高感知质量和一致性，但由于迭代采样仍然缓慢。基于这一观察，我们提出零空间扩散蒸馏（NSDD），一种单次重建模型，将结构化扩散先验推理蒸馏为高效的前馈网络。NSDD学习生成高质量重建，保持测量一致性，同时避免昂贵的迭代采样。实验结果表明，NSDD实现了与扩散先验方法相竞争的感知质量和一致性，同时提供显著更快的推理速度，并在所有三个目标之间实现了有利的平衡。此外，消融实验表明，蒸馏范围-零分解比非结构化全重建蒸馏提高了重建质量和鲁棒性，包括在未见过的真实场景上。这些结果凸显了结构感知蒸馏在高效无透镜成像中的潜力。代码见此 http URL。

英文摘要

Lensless imaging reconstructs scenes from highly multiplexed measurements, resulting in a severely ill-posed inverse problem. In this work, we identify a fundamental trade-off between measurement consistency, perceptual quality, and inference speed across lensless reconstruction paradigms. Traditional methods favor consistency but produce perceptually degraded results, supervised approaches achieve high-quality reconstructions with fast inference but may violate physical constraints, and diffusion-prior methods achieve high perceptual quality and consistency--particularly when structured constraints such as range-null decomposition are used--but remain slow due to iterative sampling. Motivated by this observation, we propose Null-Space Diffusion Distillation (NSDD), a single-pass reconstruction model that distills structured diffusion-prior inference into an efficient feed-forward network. NSDD learns to produce high-quality reconstructions that preserve measurement consistency while avoiding costly iterative sampling. Experimental results demonstrate that NSDD achieves perceptual quality and consistency competitive with diffusion-prior methods, while providing significantly faster inference and offering a favorable balance across all three objectives. Furthermore, ablation experiments show that distilling the range--null decomposition improves reconstruction quality and robustness over unstructured full-reconstruction distillation, including on unseen real scenes. These results highlight the potential of structure-aware distillation for efficient lensless imaging. Code is available at github.com/JRCSAVSN/NullSpaceDiffusionDistillation.

URL PDF HTML ☆

赞 0 踩 0

2511.12572 2026-06-16 cs.CV 版本更新

Through-Foliage Surface-Temperature Reconstruction for Early Wildfire Detection

面向早期野火检测的穿透植被地表温度重建

Mohamed Youssef, Lukas Brunner, Klaus Rundhammer, Gerald Czech, Oliver Bimber

发表机构 * Department of Computer Science, Johannes Kepler University（计算机科学系，约翰尼斯·开普勒大学）； Fire Brigade St. Agatha（圣阿加塔消防队）； Upper Austria Fire Brigade Headquarter（上奥地利消防队总部）

AI总结结合信号处理与机器学习，通过合成孔径传感和视觉状态空间模型从模糊数据中恢复热信号，实现无人机自动野火监测，在模拟和实地实验中显著降低温度重建误差。

详情

AI中文摘要

我们提出了一种通过结合信号处理和机器学习来重建森林植被下地表温度的方法，实现无人机全自动空中野火监测以进行早期火灾检测。合成孔径（SA）传感减少了树冠遮挡，但引入了热模糊。为了克服这一点，我们训练了一个视觉状态空间模型，从模糊数据中恢复部分遮挡的土壤和火灾热点的微弱热信号。为了解决真实训练数据有限的问题，我们使用潜在扩散模型、温度增强和程序化热森林建模生成逼真的地表温度模拟。在模拟数据集上，与传统热成像和未校正的SA成像相比，我们的方法将RMSE降低了2-2.5倍；在实地热点实验中，RMSE分别改善了12.8倍和2.6倍。我们的方法还能泛化到其他热信号，包括人体特征，捕捉其形态和范围——这在简单阈值法失效时至关重要——而传统成像难以处理部分遮挡情况。

英文摘要

We present a method to reconstruct surface temperatures through forest vegetation by combining signal processing and machine learning, enabling fully automated aerial wildfire monitoring with drones for early fire detection. Synthetic aperture (SA) sensing reduces canopy occlusion but introduces thermal blur. To overcome this, we train a visual state space model to recover subtle thermal signals of partially occluded soil and fire hotspots from blurred data. To address limited real-world training data, we generate realistic surface temperature simulations using a latent diffusion model, temperature augmentation, and procedural thermal forest modeling. On simulated datasets, our method reduces RMSE by 2-2.5 versus conventional thermal and uncorrected SA imaging; in field experiments on hotspots, RMSE improved by 12.8-fold and 2.6-fold, respectively. Our approach also generalizes to other thermal signals, including human signatures, capturing morphology and extent -- critical where simple thresholding fails -- while conventional imaging struggles with partial occlusion.

URL PDF HTML ☆

赞 0 踩 0

2601.19506 2026-06-16 cs.CV 版本更新

Bridging Information Asymmetry: A Hierarchical Framework for Blind Face Restoration with Reduced Uncertainty

弥合信息不对称：一种降低不确定性的分层盲脸修复框架

Zhengjian Yao, Jiakui Hu, Kaiwen Li, Hangzhou He, Xinliang Zhang, Shuang Zeng, Lei Zhu, Yanye Lu

发表机构 * Biomedical Engineering Department, College of Future Technology, Peking University（北京大学未来技术学院生物医学工程系）； Institute of Medical Technology, Peking University Health Science Center（北京大学医学部医学技术研究所）； National Biomedical Imaging Center, Peking University（北京大学国家生物医学成像中心）

AI总结提出Pref-Restore分层框架，通过语义信息增强、纹理保真对齐和保真约束偏好优化，降低盲脸修复中的不确定性，实现身份敏感的高保真重建。

Comments Accepted by TPAMI

详情

AI中文摘要

盲脸修复仍然是一个持续的挑战，因为从严重受限的观测中重建整体结构本质上是病态的。当前的生成范式虽然能够合成逼真的面部细节，但仍然受到盲修复欠约束性质的限制，其中严重退化的输入可能映射到合理但身份不一致的输出。为了解决这个问题，我们提出了\textbf{Pref-Restore}，一种降低修复不确定性的分层BFR框架。我们的设计围绕三个互补原则组织：（1）语义信息增强，其中自回归语义分支将图像和文本线索转换为结构化标记，提供稳定的高层锚点；（2）纹理级保真对齐，其中扩散生成器在此锚点下训练以恢复身份相关细节；（3）保真约束偏好优化，其中面部感知奖励在控制质量-保真权衡的同时优化扩散轨迹。在合成和真实世界基准上的大量实验表明，Pref-Restore实现了最先进的性能，具有更强的身份敏感保真度和更低的重复采样不确定性。系统的消融实验进一步将这些增益归因于所提出的分层设计，展示了分阶段训练的必要性、文本路径的鲁棒性和质量依赖性，以及保真约束偏好优化的好处。

英文摘要

Blind face restoration remains a persistent challenge due to the inherent ill-posedness of reconstructing holistic structures from severely constrained observations. Current generative paradigms, while capable of synthesizing realistic facial details, remain limited by the under-constrained nature of blind restoration, where severely degraded inputs can be mapped to plausible yet identity-inconsistent outputs. To address this issue, we present \textbf{Pref-Restore}, a hierarchical framework for BFR with reduced restoration uncertainty. Our design is organized around three complementary principles: (1) Semantic Information Augmentation, where an auto-regressive semantic branch converts image and text cues into structured tokens that provide a stable high-level anchor; (2) Texture-level Fidelity Alignment, where the diffusion generator is trained under this anchor to recover identity-relevant details; and (3) Fidelity-constrained Preference Optimization, where a face-aware reward refines the diffusion trajectory while controlling the quality--fidelity trade-off. Extensive experiments on synthetic and real-world benchmarks show that Pref-Restore achieves state-of-the-art performance, with stronger identity-sensitive fidelity and lower restoration uncertainty across repeated sampling. Systematic ablations further attribute these gains to the proposed hierarchical design, showing the necessity of staged training, the robustness and quality dependence of the text pathway, and the benefit of fidelity-constrained preference optimization.

URL PDF HTML ☆

赞 0 踩 0

2606.06176 2026-06-16 cs.CV 版本更新

RQUL-UIE: Revitalizing Quality-Unstable Labels for Underwater Image Enhancement via In-Dataset Self-Supervision

RQUL-UIE: 通过数据集内自监督重振质量不稳定标签用于水下图像增强

Haochen Hu, Yanrui Bin, Chih-yung Wen, Bing Wang

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结提出一种基于扩散模型的数据集内自监督学习策略，通过评估标签质量并量化噪声级别进行分步去噪监督，结合傅里叶细化网络，有效利用不稳定标签提升水下图像增强质量。

详情

AI中文摘要

水下图像增强对于减轻水介质引起的退化至关重要。尽管基于学习的方法取得了显著进展，但大多数依赖于具有不稳定标签质量的配对数据集，这限制了模型性能。本文提出了一种基于扩散的数据集内自监督学习策略，旨在利用训练标签的质量分布。具体地，我们通过预训练扩散模型的语义感知嵌入以无需训练的方式评估标签质量。这些质量分数随后被量化为噪声级别索引，指导多步去噪过程以进行级别监督。该机制防止低质量标签降低模型性能，同时最大化其在训练中的效用。此外，引入基于傅里叶的细化网络以显式重建高频分量。大量评估表明，我们的方法在恢复质量上始终优于最先进的方法。代码和预训练模型将在接收后提供链接。

英文摘要

Underwater Image Enhancement (UIE) is essential for mitigating degradations caused by water medium. Although learning-based methods have advanced significantly, most rely on paired datasets with unstable label quality, which bottlenecks model performance. This paper proposes a diffusion-based, in-dataset self-supervised learning strategy designed to exploit the quality distribution of training labels. Specifically, we evaluate label quality via semantic perception embeddings from a pre-trained diffusion model in a training-free manner. These quality scores are subsequently quantized into noise-level indices, guiding a multi-step denoising process for level-wise supervision. This mechanism prevents low-quality labels from degrading the model while maximizing their utility during training. Furthermore, a Fourier-based refinement network is incorporated to explicitly reconstruct high-frequency components. Extensive evaluations demonstrate that our method consistently outperforms SOTA approaches in restoration quality. The code and pre-trained model will be available once accepted in link.

URL PDF HTML ☆

赞 0 踩 0

2606.14748 2026-06-16 cs.CV cs.AI 新提交

Is My Vision-Language Data in Your AI? Membership Inference Test (MINT) Demo 2

我的视觉-语言数据在你的AI中吗？成员推断测试（MINT）演示2

Daniel DeAlcala, Gonzalo Mancera, Julian Fierrez, Aythami Morales, Ruben Tolosana, Ruben Vera-Rodriguez

发表机构 * Universidad Autonoma de Madrid（马德里自治大学）

AI总结提出成员推断测试（MINT）框架，通过多种架构检测训练数据，在人脸识别和LLM上准确率达90%，并构建了多模态审计平台。

Comments IEEE Conf. on Computers, Software, and Applications (COMPSAC), 2026

详情

AI中文摘要

我们展示了成员推断测试（MINT）演示2，这是一个旨在提高机器学习训练过程透明度的框架。MINT是一种实验性技术，用于确定特定数据是否在机器学习模型训练期间被使用。我们建立了理论框架，并根据被审计模型已知信息的多少，提出了多种MINT架构。使用一个流行的人脸识别模型、4个最先进的LLM以及多个多样化的大规模公共图像和文本数据库进行的实验，在训练数据检测中达到了高达90%的准确率。基于这些结果，我们引入了一个综合性的网络平台，将这些能力扩展到图像和文本模态。该平台集成了多种技术栈，包括MINT、aMINT和gMINT，允许用户审计广泛的模型。该演示旨在促进AI透明度，并提供一种实用工具以促进对新兴AI法规的合规性。

英文摘要

We present the Membership Inference Test (MINT) Demo 2, a framework designed to improve transparency in machine learning training processes. MINT is a technique for experimentally determining whether specific data were used during machine learning model training. We establish the theoretical framework and propose multiple architectures for MINT depending on the amount of information known about the models that are being audited. Experimental results using a popular face recognition model, 4 state-of-the-art LLMs, and multiple, diverse, and large-scale public image and text databases achieve promising accuracy levels in the detection of training data of up to 90%. Building on these results, we introduce a comprehensive web platform1 that expands these capabilities to image and text modalities. The platform integrates a diverse technological stack, including MINT, aMINT, and gMINT, allowing users to audit a wide range of models. This demonstrator aims to promote AI transparency and provides a practical tool to foster compliance with emerging AI regulations.

URL PDF HTML ☆

赞 0 踩 0

2606.14783 2026-06-16 cs.CV cs.CR 新提交

The Vision Encoder as a Privacy Boundary: Visual-Token Side Channels in Encoder-Free Vision-Language Models

视觉编码器作为隐私边界：无编码器视觉-语言模型中的视觉令牌侧信道

Chenyu Zhou, Qiliang Jiang, Shuning Wu, Xu Zhou

发表机构 * School of Engineering, Institute of Science Tokyo（东京科学大学工学院）； College of Control Science and Engineering, Zhejiang University（浙江大学控制科学与工程学院）； Department of Electrical and Computer Engineering, National University of Singapore（新加坡国立大学电气与计算机工程系）

AI总结研究无编码器视觉-语言模型中视觉令牌侧信道导致的隐私泄露问题，通过解码器攻击从中间视觉令牌恢复图像和文本，发现空间采样保真度是关键因素，并指出KV缓存也存在泄露风险。

详情

AI中文摘要

视觉编码器将图像像素压缩为语义嵌入，通过保留语义内容同时衰减精确文本恢复所需的像素局部细节，隐式地充当隐私边界。无编码器视觉-语言模型（VLM）通过将图像块直接路由到语言模型令牌流中移除了这一边界，从而暴露了一个架构上的隐私攻击面：中间视觉令牌成为输出前的侧信道。在令牌访问攻击者下，解码器从两个无编码器VLM（Gemma4和Fuyu）中反转视觉令牌流，恢复可识别的图像结构和可读的保留访问码，而匹配的基于编码器的控制模型仅能定位目标区域但无法恢复精确字符串。模型内消融实验表明，操作因素是视觉令牌网格的空间采样保真度，尤其是字符方向采样密度，而非令牌或值的数量。泄露不仅限于导出的令牌：Gemma4第0层键值缓存张量可直接反转，将侧信道置于生产服务栈通常为解码效率而持久化的KV缓存中。该攻击在杂乱场景、真实文档退化以及零样本迁移到公共文档图像中依然有效，并抵抗加性噪声和量化等值级防御。因此，有效的缓解措施必须降低空间采样，使得移除视觉编码器成为VLM部署中的一级隐私决策。

英文摘要

A vision encoder compresses image pixels into semantic embeddings, implicitly acting as a privacy boundary by preserving semantic content while attenuating pixel-local detail required for exact text recovery. Encoder-free vision-language models (VLMs) remove this boundary by routing image patches directly into the language-model token stream, thereby exposing an architectural privacy attack surface: intermediate visual tokens become a pre-output side channel. Under a token-access adversary, decoders invert visual-token streams from two encoder-free VLMs, Gemma4 and Fuyu, recovering recognizable image structure and readable held-out access codes, whereas matched encoder-based controls localize target regions but recover no exact strings. Within-model ablations show that the operative factor is spatial sampling fidelity of the visual-token grid, especially character-direction sampling density, rather than token or value count. The leakage is not limited to exported tokens: Gemma4 layer-0 key-value cache tensors are directly invertible, placing the side channel within KV caches commonly persisted by production serving stacks for decoding efficiency. The attack survives clutter, realistic document degradation, and zero-shot transfer to public document images, and it resists value-level defenses such as additive noise and quantization. Effective mitigation must therefore reduce spatial sampling, making removal of the vision encoder a first-class privacy decision in VLM deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.15169 2026-06-16 cs.CV 新提交

Label Shift Aware Adaptation for Online Zero-shot Learning with Contrastive Language-Image Pre-Training (CLIP)

基于对比语言-图像预训练（CLIP）的在线零样本学习中的标签偏移感知自适应

Pengxiao Han, Changkun Ye, Yanshuo Wang, Jinguang Tong, Miaohua Zhang, Xuesong Li, Jie Hong, Lars Petersson

发表机构 * Australian National University（澳大利亚国立大学）； China North Vehicle Research Institute（中国北方车辆研究所）； The Hong Kong Polytechnic University（香港理工大学）； Griffith University（格里菲斯大学）； CSIRO（澳大利亚联邦科学与工业研究组织）； The University of Hong Kong（香港大学）

AI总结针对在线零样本学习中测试数据与CLIP训练数据分布不匹配的问题，提出标签偏移感知（LSA）方法，通过域自适应和标签偏移校正提升分类性能。

详情

AI中文摘要

像对比语言-图像预训练（CLIP）这样的视觉-语言模型已在数据稀缺场景中得到广泛研究。该领域中一个特别具有挑战性和现实性的任务是使用CLIP进行在线零样本学习，其中未知测试样本由CLIP以随机顺序顺序预测，同时在顺序推理阶段保持特征提取和模型参数固定。在这种设置下，大多数现有方法通过使用传入测试样本在线调整表示来解决问题，而忽略了CLIP最初训练的数据分布。当测试数据中的标签分布与训练域不同时，这种不匹配可能导致性能下降。为了解决这一差距，我们提出了标签偏移感知（LSA），它将在线零样本分类任务形式化为域适应问题。具体来说，LSA适应CLIP（在未知源分布上训练）计算的预测到目标分布，仅使用未标记的测试数据，并应用标签偏移校正来减轻源域和目标域之间的不匹配。跨多个数据集的广泛实验表明，所提出的LSA始终优于基于CLIP的最先进的在线零样本学习方法。

英文摘要

Vision-language models like Contrastive Language-Image Pre-Training (CLIP) have been extensively studied in data-scarce scenarios. A particularly challenging and realistic task in this area is online zero-shot learning with CLIP, where unknown test samples are predicted sequentially in random order by CLIP while keeping the feature extraction and model parameters fixed during the sequential inference phase. Most existing approaches in this setting address the problem by adapting representations online using incoming test samples, while neglecting the distribution of the data on which CLIP was initially trained. This mismatch can lead to degraded performance when the label distribution in the test data differs from that of the training domain. To address this gap, we propose Label Shift Aware (LSA), which formulates the online zero-shot classification task as a domain adaptation problem. Specifically, LSA adapts the predictions computed by CLIP, which was trained on an unknown source distribution, to a target distribution using only unlabeled test data, and applies label shift correction to mitigate the mismatch between the source and target domains. The extensive experiments across multiple datasets demonstrate that the proposed LSA consistently outperforms state-of-the-art online zero-shot learning methods based on CLIP.

URL PDF HTML ☆

赞 0 踩 0

2606.15202 2026-06-16 cs.CV 新提交

Comparing Human Gaze and Vision-Language Model Attention in Safety-Relevant Environments

安全相关环境中的人类注视与视觉语言模型注意力的比较

Marta Vallejo, Siwen Wang

发表机构 * Heriot-Watt University（赫瑞-瓦特大学）

AI总结本研究通过眼动追踪实验和GPT-4o等视觉语言模型，比较了人类与模型在安全相关场景中的注意力分布，发现模型无需训练数据即可近似人类注视模式。

Comments 30 pages, 33 figures. Submitted as a preprint. Code and data available upon reasonable request

详情

AI中文摘要

人类视觉注意力在人们感知和响应包含潜在风险的环境时起着重要作用。本研究探讨大型视觉语言模型是否能识别安全相关环境中吸引人类注意力的相同场景区域。使用Pupil Invisible可穿戴眼镜收集了十名参与者观看33张代表不同潜在风险水平的环境场景图像的眼动数据。将注视坐标映射到刺激图像上，生成群体平均的人类注视热图。同时，通过OpenAI视觉应用程序接口（API）提示GPT-4o生成视觉注意力的空间预测，并将其转换为显著性图，以便与人类注视模式进行比较。使用四种互补指标评估人类注视热图与模型生成的显著性图之间的空间对齐：皮尔逊相关系数（r = 0.515 ± 0.117）、归一化扫描路径显著性（NSS = 0.988 ± 0.323）、Kullback-Leibler散度（KL = 1.766 ± 0.844）以及使用Judd公式的接收者操作特征曲线下面积（AUC-Judd = 0.806 ± 0.076）。与Gemini Pro、Gemini Flash和Claude的跨模型比较显示，所有模型均超过AUC-Judd的随机基线0.5，并获得了正的NSS分数。根据四项指标中的三项，Gemini Pro表现出最强的空间定位能力，而GPT-4o在KL散度上产生了与人类注意力最接近的分布匹配。这些发现表明，大型视觉语言模型能够识别与人类在安全相关场景中视觉注意力大致对应的区域，而无需眼动训练数据。结果凸显了视觉语言模型作为近似人类注意力模式的可扩展工具的潜力。

英文摘要

Human visual attention plays an important role in how people perceive and respond to environments containing potential risks. This study investigates whether large vision-language models can identify the same regions of a scene that attract human attention in safety-relevant environments. Eye-tracking data were collected from ten participants viewing 33 scene images representing environments with varying levels of potential risk using Pupil Invisible wearable glasses. Gaze coordinates were mapped onto stimulus images to generate population-averaged human gaze heatmaps. In parallel, GPT-4o was prompted through the OpenAI Vision Application Programming Interface (API) to generate spatial predictions of visual attention, which were converted into saliency maps for comparison with human gaze patterns. Spatial alignment between human gaze heatmaps and model-generated saliency maps was evaluated using four complementary metrics: Pearson correlation (r = 0.515 +- 0.117), Normalised Scanpath Saliency (NSS = 0.988 +- 0.323), Kullback-Leibler divergence (KL = 1.766 +- 0.844), and Area Under the Receiver Operating Characteristic Curve using the Judd formulation (AUC-Judd = 0.806 +- 0.076). A cross-model comparison with Gemini Pro, Gemini Flash, and Claude showed that all models exceeded the AUC-Judd chance baseline of 0.5 and achieved positive NSS scores. Gemini Pro demonstrated the strongest spatial localisation according to three of the four metrics, whereas GPT-4o produced the closest distributional match to human attention as measured by KL divergence. These findings suggest that large vision-language models can identify regions that broadly correspond to where humans direct visual attention in safety-relevant scenes without requiring eye-tracking training data. The results highlight the potential of vision-language models as a scalable tool for approximating human attentional patterns.

URL PDF HTML ☆

赞 0 踩 0

2606.15608 2026-06-16 cs.CV 新提交

On the Adversarial Robustness of Multimodal LLM Judges

多模态大语言模型评判器的对抗鲁棒性

Zihan Wang, Guansong Pang, Zelin Liu, Wenjun Miao, Jin Zheng, Xiao Bai

发表机构 * School of Computer Science and Engineering, Beihang University（北京航空航天大学计算机科学与工程学院）； State Key Laboratory of Virtual Reality Technology and System, Beihang University（北京航空航天大学虚拟现实技术与系统国家重点实验室）； State Key Laboratory of Software Development Environment, Jiangxi Research Institute, Beihang University（北京航空航天大学江西研究院软件开发环境国家重点实验室）； School of Computing and Information Systems, Singapore Management University（新加坡管理大学计算机与信息系统学院）

AI总结提出RobustMLLMJudge框架评估多模态大语言模型作为评判器时的对抗鲁棒性，并设计MGSIA攻击方法，通过语义诱导和高分流形对齐生成可迁移的分数膨胀扰动，揭示其脆弱性。

详情

AI中文摘要

多模态大语言模型（MLLMs）越来越多地被用作自动评判器，例如用于图像质量和安全评估。然而，它们的对抗鲁棒性在很大程度上尚未被探索，威胁到自动评判的公平性和可靠性。为弥补这一差距，我们引入了RobustMLLMJudge，这是第一个用于评估通用MLLM在充当评判器时对抗鲁棒性的通用框架。它涵盖了针对质量与安全评估场景中主流评判方法的各种攻击。利用RobustMLLMJudge，我们发现：i) 不同的MLLM评判器极易受到分数膨胀的对抗攻击；ii) 尽管这些攻击方法有效，但由于MLLM评判器评估协议中的独特约束，它们面临关键挑战。我们进一步提出了MGSIA，即流形引导语义诱导攻击，这是一种绕过这些约束的新方法，能够对MLLM评判器实施更有效且可迁移的攻击。MGSIA的核心思想是将肯定性语义诱导与高分流形对齐相结合：它最大化评判器对二元语义查询产生肯定性响应（例如“是”）的概率，同时将对抗性表示正则化到从代理协议估计的高分中心附近。这些目标共同产生可迁移的分数膨胀扰动。大量实验证明了MGSIA在不同评估场景下欺骗先进MLLM评判器的优越性和泛化能力，凸显了对鲁棒MLLM评判器的需求。代码和数据将在https://github.com/mala-lab/RobustMLLMJudge提供。

英文摘要

Multimodal Large Language Models (MLLMs) are increasingly used as automated judges, e.g., for image quality and safety assessment. However, their adversarial robustness remains largely unexplored, threatening the fairness and reliability of automated judging. To bridge this gap, we introduce RobustMLLMJudge, the first general framework for evaluating the adversarial robustness of general-purpose MLLMs when functioning as judges. It covers diverse attacks against popular judge approaches across quality and safety evaluation scenarios. Using RobustMLLMJudge, we reveal that i) different MLLM judges are highly vulnerable to score-inflating adversarial attacks; and ii) although effective, these attack methods face a critical challenge due to unique constraints in the evaluation protocols of MLLM judges. We further propose MGSIA, namely Manifold-Guided Semantic Induction Attack, a novel method that bypasses these constraints to enable more effective and transferable attacks on MLLM judges. The core idea of MGSIA is to combine affirmative semantic induction with high-score manifold alignment: it maximizes the probability that judges yield affirmative responses (e.g., "Yes") to binary semantic queries, while regularizing adversarial representations toward high-score centers estimated from proxy protocols. Together, these objectives yield transferable score-inflating perturbations. Extensive experiments demonstrate the superiority and generalizability of MGSIA in deceiving advanced MLLM judges under different evaluation scenarios, highlighting the need for robust MLLM judges. Code and data will be made available at https://github.com/mala-lab/RobustMLLMJudge.

URL PDF HTML ☆

赞 0 踩 0

2606.15779 2026-06-16 cs.CV cs.LG 新提交

相位在神经表示中的重要性：图像分类器的内部Oppenheim-Lim测试

Alper Yıldırım

AI总结通过内部相位-幅度移植实验，发现图像分类器（如PRISM2D、GFNet、ViT-B/16）的预测主要依赖相位/符号信息，而图像特定幅度对读出贡献有限；ResNet-50在ReLU前存在潜在符号编码，揭示了CNN与注意力模型在纹理-形状差异上的机制。

详情

AI中文摘要

Oppenheim和Lim（1981）表明，自然图像仅从傅里叶相位重建时仍可识别，而幅度几乎不携带其身份信息。我们探究训练后的图像分类器是否在其隐藏层内再现这种不对称性，并进行因果测试：给定两幅图像，我们在选定层将一幅图像的相位移植到另一幅图像的幅度上，并记录预测跟随哪幅图像。在PRISM2D、GFNet和ViT-B/16中，预测跟随相位或符号捐赠者，删除所有图像特定幅度几乎不影响准确率，因此身份信息依赖于相位，而图像特定幅度对读出而言在很大程度上是可舍弃的。ResNet-50起初似乎打破了这一模式，因为在ReLU之后移植符号无效；在ReLU之前的公平干预揭示了后期块中存在强烈的潜在符号编码，而仅DC对照表明读出消耗了通道空间平均值。对照排除了幅度简单地不依赖于图像的平凡情况。因此，这些架构共享一个相位/符号身份编码，但以不同基（由整流和读出几何决定）暴露出来，这为CNN与注意力模型之间的纹理-形状差异提供了机制性解释。

英文摘要

Oppenheim and Lim (1981) showed that natural images stay recognizable when reconstructed from their Fourier phase alone, while the magnitude carries little of their identity. We ask whether trained image classifiers reproduce this asymmetry inside their hidden layers, and we test it causally: given two images, we transplant the phase of one onto the magnitude of the other at a chosen layer and record which image the prediction follows. In PRISM2D, GFNet, and ViT-B/16 the prediction follows the phase or sign donor, and deleting all image-specific magnitude barely moves accuracy, so identity rides on phase while image-specific magnitude is largely dispensable to the readout. ResNet-50 at first seems to break the pattern, because transplanting sign after its ReLUs does nothing; a fair intervention before the ReLU reveals a strong latent sign code in the late blocks, and a DC-only control shows the readout consumes a channel-wise spatial average. Controls rule out the trivial case in which magnitude simply stops depending on the image. The architectures therefore share a phase/sign identity code but expose it in different bases, set by rectification and readout geometry, which gives a mechanistic account of the texture--shape gap between CNNs and attention models.

URL PDF HTML ☆

赞 0 踩 0

2606.15117 2026-06-16 cs.MM cs.AI cs.CV cs.LG cs.SD 交叉投稿

Teacher-Student Structure for Domain Adaptation in Ensemble Audio-Visual Video Deepfake Detection

用于集成视听视频深度伪造检测中领域适应的师生结构

Elham Abolhasani, Maryam Ramezani, Hamid R. Rabiee

发表机构 * Department of Computer Engineering, Sharif University of Technology（谢里夫理工学院计算机工程系）

AI总结提出EAV-DFD方法，结合师生框架的领域适应机制，提升模型在未见领域上的泛化能力，在三个数据集上AUC分别提升4.09%、17.94%和0.5%。

详情

DOI: 10.1109/TAI.2025.3642217

AI中文摘要

生成式AI模型的快速发展导致了更逼真的深度伪造媒体，包括对音频、视频或两者的操纵。这引发了严重的隐私和社会问题。该领域的许多研究已经取得了有前景的域内结果；然而，这些模型在面对来自不同领域的数据时，其有效性常常下降。因此，最近的深度伪造检测方法侧重于通过多种技术增强泛化能力，这些技术融合了所有输入模态，包括音频、图像及其交互。为此，我们提出了EAV-DFD方法，一种广义的深度集成视听模型（EAV-DFD），结合了利用师生框架的领域适应机制，以增强模型在未见领域上的表现和泛化能力。为了评估模型性能，我们使用FakeAVCeleb数据集作为主领域，DFDC、Deepfake_TIMIT和PolyGlotFake数据集作为未见领域。我们的实验结果表明，所提出的框架在领域适应方面是有效的，仅使用一小部分未见数据集训练学生模型，就在三个未见数据集上分别将模型的AUC性能提升了4.09%、17.94%和0.5%。这产生了一种新颖的深度伪造检测模型，能够适应新领域并解释哪个模态被操纵，突显了我们的方法在现实世界应用中的潜力。

英文摘要

The rapid advancement of generative AI models is leading to more realistic deepfake media, encompassing the manipulation of audio, video, or both. This raises severe privacy and societal concerns. Numerous studies in this area have yielded promising intra-domain results; however, these models frequently exhibit decreased efficacy when faced with data from dissimilar domains. Consequently, recent deepfake detection approaches focus on enhancing the generalization ability through multiple techniques that incorporate all input modalities, including audio, images, and their interactions. In this regard, we propose the EAV-DFD method, a generalized deep ensemble audio-visual model (EAV-DFD) combined with a domain adaptation mechanism utilizing a teacher-student framework to enhance the model's ability to perform and generalize effectively across unseen domains. To evaluate the model's performance, we used the FakeAVCeleb dataset as the primary domain and the DFDC, Deepfake_TIMIT, and PolyGlotFake datasets as an unseen domain. Our experimental results demonstrate that the proposed framework is efficient in domain adaptation, improving AUC performance of the model by 4.09%, 17.94%, and 0.5% on three unseen datasets, using only a small portion of them to train the student model. This leads to a novel deepfake detection model capable of adapting to new domains and interpreting which modality has been manipulated, highlighting the potential of our approach for real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2606.15993 2026-06-16 cs.CY cs.CV 交叉投稿

Classifying by Proxy: Explainable and Reproducible Ensemble of Proxy Tasks for Child Sexual Abuse Imagery Classification

通过代理任务分类：用于儿童性虐待图像分类的可解释且可复现的代理任务集成

Clara Ernesto, Carlos Caetano, Sandra Avila, João Macedo, Camila Laranjeira, Leo S. F. Ribeiro

发表机构 * Instituto de Ciências Matemáticas e de Computação (ICMC), Universidade Estadual de São Paulo (USP)（圣保罗州立大学数学与计算机科学学院）； Instituto de Computação (IC), Universidade Estadual de Campinas (UNICAMP)（坎皮纳斯州立大学计算机学院）； Departamento de Ciência da Computação, Universidade Federal de Minas Gerais (UFMG)（巴西矿务大学计算机科学系）； Instituto Federal de Educação, Ciência e Tecnologia de Minas Gerais (IFMG)（米纳斯吉拉斯州联邦教育、科学和技术研究院）

AI总结提出一种代理任务集成方法，用于儿童性虐待图像分类，在提高可复现性、可解释性和安全性的同时，在RCPD数据集上达到91.9%的平衡准确率。

Comments 12 pages, 7 figures, 7 tables. Accepted at ACM FAccT 2026

详情

AI中文摘要

儿童性虐待图像（CSAI）分类系统是减轻执法人员评估这些材料时常承受的心理影响以及从网络上高效移除这些材料的必要解决方案。然而，由于任务的性质，研究和开发此类系统并非易事。图像高度敏感，相关数据集受到严格的访问限制，这意味着该领域的大多数研究无法复现或分发，因此难以比较和验证。更令人担忧的是，目前用于此任务的大多数模型缺乏执法人员经常期望的一个方面：可解释性。在本文中，我们应用了代理任务集成——与CSAI分类相关的任务——在可复现性、可解释性和分发安全性方面取得了改进。这一概念首次应用于真实的CSAI，通过从CSAI文献中选择相关代理任务并对原始框架进行训练调整。我们的最终模型取得了有竞争力的结果，在RCPD数据集上使用最佳代理任务组合实现了91.9%的平衡准确率。此外，我们将这些结果与同类最佳表示学习模型DINO进行了对比，表明我们的集成提高了准确性，并为其分类结果提供了解释，这是单个深度学习模型很少能提供的特性。

英文摘要

Child Sexual Abuse Imagery (CSAI) classification systems are needed solutions for lessening the psychological impacts often felt by law enforcement agents responsible for evaluating these materials and for efficient removal of these materials from the web. However, due to the nature of the task, researching and developing such systems is not a trivial endeavor. The images are highly sensitive, and the related datasets are under restrictive access regimes, which means most studies in the area are not reproducible or distributable and are therefore hard to compare and validate. More concerning still, most models for this task today lack an aspect often desired by law enforcement agents: explainability. In this paper, we apply an ensemble of Proxy Tasks -- tasks that correlate to CSAI classification -- yielding improvements in reproducibility, explainability, and security for distribution. This concept is applied for the first time to real CSAI, with a novel selection of relevant Proxy Tasks (selected from the CSAI literature) and training adaptations to the original framework. Our final model achieves competitive results, yielding 91.9% balanced accuracy on the RCPD dataset with the best Proxy Task combination. We furthermore contrast these results with the best-in-class representation learning model, DINO, and show that our ensemble improves accuracy and provides explanations for its classification results, a feature that a single deep learning model can seldom provide.

URL PDF HTML ☆

赞 0 踩 0

2606.16196 2026-06-16 cs.LG cs.CV 交叉投稿

When Confidence Lacks Concepts: Interpretable OOD Detection via Representation Perturbations

当置信度缺乏概念：通过表示扰动实现可解释的OOD检测

Anju Chhetri, Pratik Shrestha, Ramesh Rana, Prashnna Gyawali, Binod Bhattarai

发表机构 * NepAl Applied Mathematics and Informatics Institute for research（尼泊尔应用数学与信息学研究所）； West Virginia University（西弗吉尼亚大学）； Kathmandu University（加德满都大学）； University College London（伦敦大学学院）； University of Aberdeen（阿伯丁大学）

AI总结提出一种基于类条件语义扰动和稀疏自编码器的可解释OOD检测框架，通过分析表示稳定性实现检测与内部机制解释。

详情

AI中文摘要

深度神经网络在医学影像任务中取得了显著性能，但其在分布偏移下过度泛化的倾向对安全临床部署构成了主要障碍。分布外（OOD）检测方法旨在缓解这一风险，但现有方法大多依赖语义含义理解不足的不透明内部信号，限制了在安全关键场景中的信任。本文提出一种可解释的OOD检测框架，该框架通过类条件语义扰动探测模型预测的稳定性。利用稀疏自编码器（SAE），我们从分布内数据中学习类特定概念向量，将密集的中间表示解耦为稀疏、语义有意义的组件。在推理时，我们使用与模型预测类别相关的概念向量扰动深层表示，并测量类别logits的稳定性。我们假设分布内样本对此类扰动表现出低敏感性，因为其表示与类特定语义方向对齐，而OOD样本由于表示错位而显示出放大的偏差。通过将OOD检测框架为概念条件稳定性分析，我们的方法既提供了判别性OOD信号，又提供了驱动模型不确定性的内部机制的可解释视角，使其特别适用于高风险医学应用。

英文摘要

Deep neural networks have achieved remarkable performance across medical imaging tasks, yet their tendency to overgeneralize under distributional shifts poses a major obstacle to safe clinical deployment. Out-of-Distribution (OOD) detection methods aim to mitigate this risk, but most existing approaches rely on opaque internal signals with poorly understood semantic meaning, limiting trust in safety-critical settings. In this work, we propose an interpretable OOD detection framework that probes the stability of model predictions under class-conditioned semantic perturbations. Leveraging sparse autoencoders (SAEs), we learn class-specific concept vectors from in-distribution data that disentangle dense intermediate representations into sparse, semantically meaningful components. At inference, we perturb deeper-layer representations using the concept vectors associated with the model's predicted class and measure the class logits stability. We hypothesize that in-distribution samples exhibit low sensitivity to such perturbations, as their representations align with class-specific semantic directions, whereas OOD samples show amplified deviations due to representational misalignment. By framing OOD detection as a concept conditioned stability analysis, our approach provides both a discriminative OOD signal and an interpretable lens into the internal mechanisms driving model uncertainty, making it particularly suitable for high stakes medical applications.

URL PDF HTML ☆

赞 0 踩 0

2606.16535 2026-06-16 cs.LG cs.CV cs.SC 交叉投稿

Assessing Reliability of Symbol Detection in Concept Bottleneck Models

评估概念瓶颈模型中符号检测的可靠性

Javier Fumanal-Idocin, Javier Andreu-Perez

发表机构 * University of Essex（埃塞克斯大学）

AI总结本文研究概念瓶颈模型（CBM）中符号检测的可靠性问题，通过交换独立训练的概念检测器和分类头来识别易受虚假激活影响的概念，并提出一种可靠性感知训练策略，在CUB-200-2011和合成任务上验证了其有效性。

详情

AI中文摘要

概念瓶颈模型（CBM）是可解释人工智能的相关工具，因为它们通过人类可解释的符号进行预测。然而，高任务准确率并不能保证这些符号被忠实地检测到：联合训练的CBM可能在瓶颈中编码任务特定的捷径，使其解释不可靠。在本文中，我们通过交换共享相同符号词汇的独立训练的概念检测器和分类头来研究概念检测的可靠性。我们利用由此产生的性能下降、概念级指标和符号级不确定性估计来识别特别容易发生虚假激活的概念。最后，我们提出了一种可靠性感知训练策略，其中共享的概念检测器通过多个分类头进行优化，并因依赖全局或实例级不可靠符号而受到惩罚。在具有完整概念监督的CUB-200-2011上，检测器和头几乎可以自由互换（交换下降低于一个准确率点，相对保留率高于99%，且没有概念检测低于随机水平），而在受控的合成任务上，我们表明，随着概念监督权重的减少，模型保持近乎完美的任务准确率，而交换准确率和与真实概念的一致性下降到随机水平。我们的可靠性感知训练显著缓解了这种泄漏，在泄漏情况下大致使交换准确率翻倍。

英文摘要

Concept Bottleneck Models (CBMs) are a relevant tool for explainable Artificial Intelligence because they make their predictions through human-interpretable symbols. However, high task accuracy does not guarantee that these symbols are detected faithfully: jointly trained CBMs may encode task-specific shortcuts in the bottleneck, making their explanations unreliable. In this paper, we study concept-detection reliability by swapping independently trained concept detectors and classification heads that share the same symbolic vocabulary. We use the resulting performance degradation, concept-level metrics, and symbol-wise uncertainty estimates to identify concepts that are especially prone to spurious firing. Finally, we propose a reliability-aware training strategy in which a shared concept detector is optimized with multiple classification heads and penalized for relying on globally or instance-wise unreliable symbols. On CUB-200-2011 with full concept supervision, detectors and heads are almost freely interchangeable (swap drop below one accuracy point, relative retention above $99\%$, and no concept detected below chance), whereas on a controlled synthetic task we show that, as the concept-supervision weight is reduced, models keep near-perfect task accuracy while swapped accuracy and agreement with the ground-truth concepts collapse to chance. Our reliability-aware training substantially mitigates this leakage, roughly doubling swap accuracy in the leaky regime.

URL PDF HTML ☆

赞 0 踩 0

2501.01908 2026-06-16 cs.CV cs.LG eess.IV physics.med-ph 版本更新

Training-Free Adversarial Robustness in Computational MRI

计算MRI中无需训练的抗对抗鲁棒性

Mahdi Saberi, Chi Zhang, Mehmet Akçakaya

发表机构 * arXiv

AI总结提出一种无需重训练即可缓解MRI重建模型对抗攻击的方法，基于循环测量一致性在攻击输入的小邻域内最小化目标函数，显著降低对抗扰动影响。

Comments International Conference on Machine Learning (ICML), 2026

详情

AI中文摘要

深度学习方法已成为重建欠采样磁共振成像数据的最先进技术。然而，研究表明这些方法易受小的对抗输入扰动影响，导致输出图像出现严重失真。已有多种策略被提出以减少这些攻击的影响，但它们需要重新训练。在这项工作中，我们提出了一种新颖的方法，无需任何重训练即可缓解MRI重建模型上的对抗攻击。基于循环测量一致性的思想，我们设计了一个新颖的缓解目标，在攻击输入周围的小球内最小化该目标。结果表明，我们的方法在不同数据集、攻击类型/强度以及PD-DL网络上显著降低了对抗扰动的影响，并在定性和定量上优于传统的缓解方法。我们还引入了一个实际相关的小对抗扰动场景，该场景模拟原始数据中的脉冲噪声（与人字形伪影相关），并展示了我们的方法在此设置中的适用性。最后，我们展示了我们的缓解方法在两种现实扩展场景中仍然有效：盲设置（用户不知道攻击强度或算法）和自适应攻击设置（攻击者完全了解防御策略）。

英文摘要

Deep learning (DL) methods have become the state-of-the-art for reconstructing sub-sampled magnetic resonance imaging (MRI) data. However, studies have shown that these methods are susceptible to small adversarial input perturbations, resulting in major distortions in the output images. Various strategies have been proposed to reduce the effects of these attacks, but they require retraining. In this work, we propose a novel approach for mitigating adversarial attacks on MRI reconstruction models without any retraining. Based on the idea of cyclic measurement consistency, we devise a novel mitigation objective that is minimized in a small ball around the attack input. Results show that our method substantially reduces the impact of adversarial perturbations across different datasets, attack types/strengths and PD-DL networks, and qualitatively and quantitatively outperforms conventional mitigation methods. We also introduce a practically relevant scenario for small adversarial perturbations that models impulse noise in raw data, which relates to herringbone artifacts, and show the applicability of our approach in this setting. Finally, we show our mitigation approach remains effective in two realistic extension scenarios: a blind setup, where the attack strength or algorithm is not known to the user; and an adaptive attack setup, where the attacker has full knowledge of the defense strategy.

URL PDF HTML ☆

赞 0 踩 0

2507.02288 2026-06-16 cs.CV cs.LG 版本更新

Prompt Disentanglement via Language Guidance and Representation Alignment for Domain Generalization

基于语言引导与表示对齐的提示解缠用于域泛化

De Cheng, Zhipeng Xu, Xinyang Jiang, Dongsheng Li, Nannan Wang, Xinbo Gao

发表机构 * School of Telecommunications Engineering, the State Key Laboratory of Integrated Services Networks (ISN), Xidian University, Xi’an, China（电信工程学院、集成服务网络国家重点实验室（ISN）、西安电子科技大学）； Microsoft Research Asia, Shanghai, China（微软亚洲研究院，上海，中国）

AI总结提出利用大语言模型自动解缠文本提示，并引入最差显式表示对齐，结合抽象提示增强源域多样性，实现域不变视觉表示学习，在多个基准上超越现有方法。

详情

DOI: 10.1109/TPAMI.2026.3661049
Journal ref: IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 48, no. 6, pp. 6799-6816, June 2026

AI中文摘要

域泛化（DG）旨在开发一个能够在未见过的目标域上有效执行的通用模型。值得注意的是，预训练视觉基础模型（VFM）如CLIP的最新进展，已显示出增强深度学习模型泛化能力的巨大潜力。尽管基于VFM的域提示调整在DG中受到越来越多的关注，但设计能够解缠跨域不变特征的提示仍然是一个关键挑战。在本文中，我们提出通过利用VFM的可控且灵活的语言提示来解决这一挑战。注意到VFM的文本模态自然更容易解缠，我们引入了一个新颖的文本特征引导的视觉提示调整框架。该框架首先使用大语言模型（LLM）自动解缠文本提示，然后学习由解缠文本特征引导的域不变视觉表示。然而，仅依赖语言来引导视觉特征解缠存在局限性，因为视觉特征有时可能过于复杂或微妙，难以被描述性文本完全捕捉。为解决这一问题，我们引入了最差显式表示对齐（WERA），它通过添加一组额外的抽象提示来扩展文本引导的视觉提示。这些提示通过风格化图像增强来增强源域多样性，而对齐约束确保视觉表示在原始分布和增强分布上保持一致。在包括PACS、VLCS、OfficeHome、DomainNet和TerraInc在内的主要DG数据集上进行的实验表明，我们提出的方法优于最先进的DG方法。

英文摘要

Domain Generalization (DG) seeks to develop a versatile model capable of performing effectively on unseen target domains. Notably, recent advances in pre-trained Visual Foundation Models (VFMs), such as CLIP, have demonstrated considerable potential in enhancing the generalization capabilities of deep learning models. Despite the increasing attention toward VFM-based domain prompt tuning within DG, the effective design of prompts capable of disentangling invariant features across diverse domains remains a critical challenge. In this paper, we propose addressing this challenge by leveraging the controllable and flexible language prompt of the VFM. Noting that the text modality of VFMs is naturally easier to disentangle, we introduce a novel framework for text feature-guided visual prompt tuning. This framework first automatically disentangles the text prompt using a large language model (LLM) and then learns domain-invariant visual representation guided by the disentangled text feature. However, relying solely on language to guide visual feature disentanglement has limitations, as visual features can sometimes be too complex or nuanced to be fully captured by descriptive text. To address this, we introduce Worst Explicit Representation Alignment (WERA), which extends text-guided visual prompts by incorporating an additional set of abstract prompts. These prompts enhance source domain diversity through stylized image augmentations, while alignment constraints ensure that visual representations remain consistent across both the original and augmented distributions. Experiments conducted on major DG datasets, including PACS, VLCS, OfficeHome, DomainNet, and TerraInc, demonstrate that our proposed method outperforms state-of-the-art DG methods.

URL PDF HTML ☆

赞 0 踩 0

2511.20710 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Are Neuro-Inspired Multi-Modal Vision-Language Models Resilient to Membership Inference Privacy Leakage?

受神经启发的多模态视觉-语言模型对成员推断隐私泄露是否具有弹性？

David Amebley, Sayanton Dibbo

发表机构 * The University of Alabama（阿拉巴马大学）； Alabama Center for the Advancement of AI（阿拉巴马人工智能 advancement 中心）； Trustworthy AI Lab（可信人工智能实验室）； Department of Computer Science, The University of Alabama（计算机科学系）

AI总结研究受神经启发的多模态视觉-语言模型（VLM）对基于图像-文本的成员推断攻击的弹性，提出拓扑正则化框架，实验表明神经VLM在保持模型效用同时显著降低攻击成功率。

Comments Accepted at USENIX WOOT '26

详情

AI中文摘要

在智能体AI时代，多模态模型（MMs）的日益部署引入了新的攻击向量，可能导致MMs中敏感训练数据泄露，造成隐私泄露。本文研究了一种黑盒隐私攻击，即对多模态视觉-语言模型（VLMs）的成员推断攻击（MIA）。最先进的研究主要分析单模态AI-ML系统的隐私攻击，而最近的研究表明MMs也可能易受隐私攻击。尽管研究人员已证明生物启发的神经网络表示可以提高单模态模型对对抗攻击的弹性，但受神经启发的MMs是否对隐私攻击具有弹性仍未被探索。在这项工作中，我们引入了一个系统的神经科学启发的拓扑正则化（τ）框架，以分析MM VLMs对基于图像-文本的推断隐私攻击的弹性。我们使用三个VLM：BLIP、PaliGemma 2和ViT-GPT2，在三个基准数据集：COCO、CC3M和NoCaps上检验了这一现象。我们的实验比较了基线VLM和神经VLM（带有拓扑正则化）的弹性，其中τ>0配置定义了VLM的NEURO变体。我们在COCO数据集上使用BLIP模型的结果表明，NEURO VLM中MIA攻击成功率平均下降24%的ROC-AUC，同时在MPNet和ROUGE-2指标上实现了相似的模型效用（生成字幕与参考字幕之间的相似性）。这表明神经VLM相对更具隐私攻击弹性，同时不会显著牺牲模型效用。我们使用PaliGemma 2和ViT-GPT2模型在另外两个数据集CC3M和NoCaps上的广泛评估进一步验证了发现的一致性。这项工作有助于加深对MMs中隐私风险的理解，并为神经VLM的隐私威胁弹性提供了证据。

英文摘要

In the age of agentic AI, the growing deployment of multi-modal models (MMs) has introduced new attack vectors that can leak sensitive training data in MMs, causing privacy leakage. This paper investigates a black-box privacy attack, i.e., membership inference attack (MIA) on multi-modal vision-language models (VLMs). State-of-the-art research analyzes privacy attacks primarily to unimodal AI-ML systems, while recent studies indicate MMs can also be vulnerable to privacy attacks. While researchers have demonstrated that biologically inspired neural network representations can improve unimodal model resilience against adversarial attacks, it remains unexplored whether neuro-inspired MMs are resilient against privacy attacks. In this work, we introduce a systematic neuroscience-inspired topological regularization (tau) framework to analyze MM VLMs resilience against image-text-based inference privacy attacks. We examine this phenomenon using three VLMs: BLIP, PaliGemma 2, and ViT-GPT2, across three benchmark datasets: COCO, CC3M, and NoCaps. Our experiments compare the resilience of baseline and neuro VLMs (with topological regularization), where the tau > 0 configuration defines the NEURO variant of VLM. Our results on the BLIP model using the COCO dataset illustrate that MIA attack success in NEURO VLMs drops by 24% mean ROC-AUC, while achieving similar model utility (similarities between generated and reference captions) in terms of MPNet and ROUGE-2 metrics. This shows neuro VLMs are comparatively more resilient against privacy attacks, while not significantly compromising model utility. Our extensive evaluation with PaliGemma 2 and ViT-GPT2 models, on two additional datasets: CC3M and NoCaps, further validates the consistency of the findings. This work contributes to the growing understanding of privacy risks in MMs and provides evidence on neuro VLMs privacy threat resilience.

URL PDF HTML ☆

赞 0 踩 0

2603.17531 2026-06-16 cs.CV cs.AI cs.CR 版本更新

Rel-Zero: Harnessing Patch-Pair Invariance for Robust Zero-Watermarking Against AI Editing

Rel-Zero：利用补丁对不变性实现鲁棒的零水印以抵御AI编辑

Pengzhen Chen, Yanwei Liu, Xiaoyan Gu, Xiaojun Chen, Wu Liu, Weiping Wang

AI总结针对AI编辑对图像真实性的威胁，提出Rel-Zero零水印框架，利用编辑中补丁对关系距离的不变性，无需修改原图即可生成鲁棒水印，实验证明其优于现有方法。

Comments accepted to CVPR 2026

详情

AI中文摘要

近期基于扩散的图像编辑技术的进步对数字视觉内容的真实性构成了重大威胁。传统的基于嵌入的水印方法通常引入可察觉的扰动以保持鲁棒性，不可避免地损害视觉保真度。同时，现有的零水印方法通常依赖全局图像特征，难以抵御复杂的操作。在这项工作中，我们揭示了一个关键观察：尽管在基于AI的编辑过程中单个图像补丁发生显著变化，但补丁对之间的关系距离保持相对不变。利用这一特性，我们提出了关系零水印（Rel-Zero），一种新颖的框架，无需对原始图像进行任何修改，而是从这些编辑不变的补丁关系中推导出唯一的零水印。通过将水印基于内在的结构一致性而非绝对外观，Rel-Zero为内容认证提供了一种非侵入性且具有弹性的机制。大量实验表明，与先前的零水印方法相比，Rel-Zero在多种编辑模型和操作下实现了显著提升的鲁棒性。

英文摘要

Recent advancements in diffusion-based image editing pose a significant threat to the authenticity of digital visual content. Traditional embedding-based watermarking methods often introduce perceptible perturbations to maintain robustness, inevitably compromising visual fidelity. Meanwhile, existing zero-watermarking approaches, typically relying on global image features, struggle to withstand sophisticated manipulations. In this work, we uncover a key observation: while individual image patches undergo substantial alterations during AI-based editing, the relational distance between patch pairs remains relatively invariant. Leveraging this property, we propose Relational Zero-Watermarking (Rel-Zero), a novel framework that requires no modification to the original image but derives a unique zero-watermark from these editing-invariant patch relations. By grounding the watermark in intrinsic structural consistency rather than absolute appearance, Rel-Zero provides a non-invasive yet resilient mechanism for content authentication. Extensive experiments demonstrate that Rel-Zero achieves substantially improved robustness across diverse editing models and manipulations compared to prior zero-watermarking approaches.

URL PDF HTML ☆

赞 0 踩 0

2603.24058 2026-06-16 cs.CV cs.AI 版本更新

Mitigating Object Hallucinations in LVLMs via Attention Imbalance Rectification

通过注意力不平衡修正减轻LVLM中的对象幻觉

Han Sun, Qin Li, Peixin Wang, Min Zhang

发表机构 * Shanghai Key Laboratory of Trustworthy Computing, East China Normal University（上海可信计算实验室，东华大学）

AI总结发现多模态和token间注意力不平衡是对象幻觉的因果因素，提出轻量级解码干预方法AIR，通过重新分配注意力权重修正不平衡，在多个基准上减少幻觉达35.1%，并提升通用能力。

Comments CVPR 2026 Findings Track, code is available at https://github.com/Ice-wave/AIR

详情

AI中文摘要

大型视觉-语言模型（LVLMs）中的对象幻觉严重损害了其在现实应用中的可靠性，对它们在自动驾驶和医学图像分析等高风险场景中的部署构成了关键障碍。通过系统的实证研究，我们发现跨模态（即视觉和语言）和模态内（单个token之间）的不平衡注意力分配与对象幻觉的发生存在强因果相关性。利用这一洞察，我们引入了一个新概念——注意力不平衡，它不仅量化了注意力差异的程度，还直观地描绘了驱动对象幻觉的潜在模式（例如，对无关语言token的过度关注或对判别性视觉特征的关注不足）。为了减轻对象幻觉，我们进一步提出了注意力不平衡修正（AIR），这是一种轻量级的解码时干预方法，通过重新分配注意力权重和调整注意力分布来修正模态级和token级的不平衡。在四个主流LVLM和三个基准（CHAIR、POPE和MM-Vet）上，与七个基线进行的大量评估表明，AIR持续降低对象幻觉率，与基线相比最高减少35.1%，同时在多种视觉-语言任务中提升LVLMs的通用能力高达15.9%。

英文摘要

Object hallucination in Large Vision-Language Models (LVLMs) severely compromises their reliability in real-world applications, posing a critical barrier to their deployment in high-stakes scenarios such as autonomous driving and medical image analysis. Through systematic empirical investigation, we identify that the imbalanced attention allocation, both across modalities (i.e., vision and language) and within modalities (among individual tokens), exhibits a strong causal correlation with the occurrence of object hallucination. Leveraging this insight, we introduce a novel concept termed attention imbalance, which not only quantifies the degree of attention disparity but also visually delineates the underlying patterns (e.g., over-attentiveness to irrelevant language tokens or under-attentiveness to discriminative visual features) that drive object hallucination. To mitigate object hallucination, we further propose Attention Imbalance Rectification (AIR), a lightweight decoding-time intervention method that reallocates attention weights and adjusts attention distributions to rectify modality-wise and token-wise imbalances. Extensive evaluations on four mainstream LVLMs and three benchmarks (CHAIR, POPE, and MM-Vet) with seven baselines demonstrate that AIR consistently reduces object hallucination rates, achieving up to a 35.1% reduction compared to the baselines, while improving up to 15.9% of LVLMs' general capability across diverse vision-language tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.00591 2026-06-16 cs.CV 版本更新

Intrinsic Gradient Suppression for Label-Noise Prompt Tuning in Vision-Language Models

视觉语言模型中标签噪声提示调优的内在梯度抑制

Jiayu Li, Jiaxin Qi, Sheng Zhou, Jiaqiang Huang, Xiansheng Hua

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出双Softmax提示调优（DSPT），通过内在梯度抑制机制自适应压制高错误噪声样本的梯度，实现标签噪声下的鲁棒提示调优。

详情

AI中文摘要

对比视觉语言模型如CLIP展现出显著的零样本泛化能力。然而，提示调优对标签噪声高度敏感，因为错误标记的样本会产生不成比例的大梯度，可能压倒预训练先验。我们认为，由于CLIP已经提供了接近最优的初始化，适应过程应本质上是保守的，特别是在噪声设置中常见的极端梯度更新情况下。为此，我们提出了双Softmax提示调优（DSPT），一种无需超参数的内在梯度抑制方法。通过应用顺序概率归一化，DSPT诱导出一个自适应饱和区，该区域抑制来自高错误噪声样本的梯度，同时保持信息性更新。我们还提供了关于该机制如何实现自适应抑制的理论分析和实证证据。这种设计将传统上作为训练瓶颈的“梯度消失”转化为标签噪声提示调优的原则性噪声过滤盾牌。大量实验证实，这种简单、即插即用的设计在各种噪声基准上实现了最先进的鲁棒性，优于具有复杂架构和手工调整超参数的方法。

英文摘要

Contrastive vision-language models like CLIP exhibit remarkable zero-shot generalization. However, prompt tuning remains highly sensitive to label noise, as mislabeled samples generate disproportionately large gradients that can overwhelm pre-trained priors. We argue that because CLIP already provides a near-optimal initialization, adaptation should be inherently conservative, particularly against the extreme gradient updates common in noisy settings. To this end, we propose Double-Softmax Prompt Tuning (DSPT), a hyperparameter-free method for intrinsic gradient suppression. By applying a sequential probabilistic normalization, DSPT induces a self-adaptive saturation zone that suppresses gradients from high-error noisy samples while maintaining informative updates. We also provide both theoretical analysis and empirical evidence about how this mechanism achieves adaptive suppression. This design transforms ``gradient vanishing'', traditionally a training bottleneck, into a principled noise-filtering shield for label-noise prompt tuning. Extensive experiments confirm that this simple, drop-in design achieves state-of-the-art robustness across various noisy benchmarks, outperforming methods with complex architectures and handcrafted hyperparameters.

URL PDF HTML ☆

赞 0 踩 0

2606.00435 2026-06-16 cs.CV cs.AI 版本更新

MMLongEmbed: 长上下文场景下的多模态嵌入模型基准测试

Haitian Wang, Ruoxi Sun, Quantong Qiu, Juntao Li, Junhui Li, Hua Chen, Jinxiong Chang, Min Zhang

发表机构 * Soochow University（苏州大学）； Ant Group（蚂蚁集团）

AI总结针对多模态嵌入模型在长上下文场景中缺乏系统评估的问题，提出首个综合基准MMLongEmbed，涵盖文本、文档和视频模态的检索任务，揭示模型依赖浅层特征匹配、难以捕捉深层语义依赖等瓶颈。

详情

AI中文摘要

最近的进展显著扩展了多模态嵌入模型（MEMs）的理论上下文窗口。然而，更大的上下文窗口并不一定能转化为对长上下文多模态输入的有效理解和表示，这仍然是实际部署的关键瓶颈。为了解决这一设置中缺乏系统评估的问题，我们引入了MMLongEmbed，这是首个用于评估长上下文场景中MEMs的综合基准。MMLongEmbed包含四个检索任务，涵盖多个上下文长度范围，覆盖文本、文档和视频模态。通过对最先进模型的广泛评估，我们发现当前架构严重依赖浅层特征匹配，难以捕捉深层语义和结构依赖。我们进一步观察到，性能下降随上下文长度和关键信息位置系统性地变化。此外，模型对不同模态中的冗余上下文信息表现出显著不同的鲁棒性。为了可重复性，基准和代码已公开。

英文摘要

Recent advancements have significantly expanded the theoretical context windows of Multimodal Embedding Models (MEMs). However, larger context windows do not necessarily translate into effective comprehension and representation of long-context multimodal inputs, which remains a critical bottleneck for real-world deployment. To address the lack of systematic evaluation in this setting, we introduce MMLongEmbed, the first comprehensive benchmark for evaluating MEMs in long-context scenarios. MMLongEmbed comprises four retrieval tasks spanning multiple context-length ranges, covering text, document, and video modalities. Through extensive evaluation of state-of-the-art models, we find that current architectures rely heavily on superficial feature matching and struggle to capture deep semantic and structural dependencies. We further observe that performance degradation varies systematically with context length and key information placement. Moreover, models exhibit substantially different robustness to redundant contextual information across modalities. For reproducibility, the benchmark and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.14757 2026-06-16 cs.CV cs.LG 新提交

Spatial Priors via Space Filling Curves for Small and Limited Data Vision Transformers

基于空间填充曲线的小型与有限数据视觉Transformer的空间先验

Leyla Naz Candogan, Arshia Afzal, Pol Puigdemont, Volkan Cevher

发表机构 * ETH Zürich（苏黎世联邦理工学院）

AI总结提出VIOLIN，一种轻量级掩码注意力机制，通过空间填充曲线编码空间结构，以极小的参数和计算开销为视觉Transformer注入空间归纳偏置，在小模型和有限数据场景下显著提升性能。

Comments ICML 2026

详情

AI中文摘要

尽管视觉Transformer（ViT）已成为许多计算机视觉任务中的主导骨干网络，但由于置换等变性，其注意力机制缺乏显式的空间归纳偏置。这在模型容量小或训练数据有限的情况下尤为重要。受线性Transformer中的注意力掩码策略和视觉状态空间模型（SSM）的扫描模式的启发，我们引入了VIOLIN，一种轻量级掩码注意力机制，通过空间填充曲线（SFC）在注意力中编码空间结构，仅增加不到0.0015%的额外参数和可忽略的计算开销。VIOLIN使用多条SFC扫描图像，构建曲线特定的衰减掩码，然后将其组合并与注意力矩阵相乘。在广泛的评估中，VIOLIN持续提升性能。在有限数据场景下，例如在VTAB-1K上进行微调时，它提升了所有任务组的准确率，在空间信息至关重要的任务上提升高达8.7%。它可以与参数高效微调方法（如LoRA）结合，进一步提高性能。除了微调，VIOLIN在ImageNet-1K上预训练期间改进了各种小型ViT架构（如DeiT、DINO）。此外，在高度依赖位置信息的像素级CIFAR-100训练中，VIOLIN将准确率提升了高达7.2%。总体而言，VIOLIN提供了一种计算高效且有效的方式，将空间归纳偏置注入ViT，特别有利于小模型和有限数据场景。

英文摘要

Though Vision Transformers (ViTs) have become the dominant backbone in many computer vision tasks, due to permutation equivariance, their attention mechanism lacks explicit spatial inductive biases. This become particularly important in two settings: when model capacity is small or training data is limited. Inspired by the attention masking strategies in Linear Transformers and the scanning patterns of Vision SSMs, we introduce VIOLIN, a lightweight masked attention mechanism that encodes spatial structure within attention via Space Filling Curves (SFCs) with less than 0.0015% extra parameters and negligible computational overhead. VIOLIN scans the image using multiple SFCs to construct curve-specific decay masks, which are then combined and multiplied with the attention matrix. Across a wide range of evaluations, VIOLIN consistently improves performance. In limited data regimes such as fine-tuning on VTAB-1K, it boosts accuracy across all task groups and by up to 8.7% on the tasks where spatial information is essential. It can be combined with parameter-efficient fine-tuning methods such as LoRA to further increase the performance. Beyond fine-tuning, VIOLIN improves various small scale ViT architectures (e.g., DeiT, DINO) during pretraining on ImageNet-1K. Additionally, on pixel-level CIFAR-100 training, a task that is highly dependent on location information, VIOLIN increases accuracy by up to 7.2%. Overall, VIOLIN provides a computationally efficient yet effective way to inject spatial inductive bias into ViTs, especially benefiting small models and limited data settings.

URL PDF HTML ☆

赞 0 踩 0

2606.14760 2026-06-16 cs.CV cs.AI 新提交

GeoRoPE: Ground-Aware Rotary Adaptation for Remote Sensing Foundation Models

GeoRoPE: 面向遥感基础模型的地面感知旋转适配

Yu Luo, Kun Hu, Mengwei He, Xiaogang Zhu, Shan Zeng, Allen Benter, Wei Xiang, Patrick Filippi, Thomas Francis Bishop, Zhiyong Wang

发表机构 * The University of Sydney（悉尼大学）； Edith Cowan University（埃迪斯科文大学）； Adelaide University（阿德莱德大学）； Wuhan Polytechnic University（武汉轻工大学）； Climate, Orange Agricultural Institute（气候研究所，奥兰治农业研究所）； La Trobe University（拉筹伯大学）

AI总结提出GeoRoPE方法，通过地理坐标校准和频率校准解决遥感基础模型中的尺度失配问题，提升跨分辨率鲁棒性和尺度敏感表征学习。

详情

AI中文摘要

遥感基础模型（RSFMs）受益于在多传感器和地面采样距离（GSD）图像上的预训练，但仅凭这种暴露并不能解决下游适配过程中的尺度失配问题。固定的token网格偏移在不同传感器下可能对应不同的地面距离，使得基于网格的位置先验在物理上不一致。同时，异质空间粒度意味着紧凑的城市区域和均质景观即使在相同GSD下也可能需要不同的位置敏感性。因此，我们提出GeoRoPE，一种面向RSFMs的地面感知、RoPE兼容且参数高效的空间适配方法。GeoRoPE从两个互补方面重新校准token级位置交互。首先，地理坐标校准（GCC）根据一个token网格步长代表的地面距离重新缩放原始token网格偏移，产生跨GSD的地理校准相对坐标。其次，地理频率校准（GFC）使用关系特定因子调整原生RoPE频率，实现对场景依赖空间粒度的位置敏感适配。GeoRoPE通过轻量适配器注入预训练RSFM，在保持冻结空间先验的同时添加地理感知位置校正。在多个RSFM、传感器、分辨率和下游任务上的实验表明，GeoRoPE提升了跨分辨率鲁棒性和尺度敏感表征学习。

英文摘要

Remote-sensing foundation models (RSFMs) benefit from pretraining on imagery from multiple sensors and ground sampling distances (GSDs), but such exposure alone does not resolve scale mismatch during downstream adaptation. A fixed token-grid offset can correspond to different ground distances across sensors, making grid-based positional priors physically inconsistent. Meanwhile, heterogeneous spatial granularity means that compact urban regions and homogeneous landscapes may require different positional sensitivities even under the same GSD. Therefore, we propose {GeoRoPE}, a ground-aware, RoPE-compatible, and parameter-efficient spatial adaptation method for RSFMs. GeoRoPE recalibrates token-level positional interactions from two complementary aspects. First, \textit{Geo-Coordinate Calibration (GCC)} rescales raw token-grid offsets according to the ground distance represented by one token-grid step, producing geo-calibrated relative coordinates across GSDs. Second, \textit{Geo-Frequency Calibration (GFC)} adjusts the native RoPE frequency with a relation-specific factor, enabling position sensitive adaptation to scene-dependent spatial granularity. GeoRoPE is injected into pretrained RSFMs through a lightweight adapter, preserving the frozen spatial prior while adding geo-aware positional corrections. Experiments across multiple RSFMs, sensors, resolutions, and downstream tasks demonstrate that GeoRoPE improves cross-resolution robustness and scale-sensitive representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.14780 2026-06-16 cs.CV cs.LG 新提交

YTClickbait21K: Human-Annotated Multimodal Dataset for YouTube Clickbait Detection Across Diverse Channels and Content Categories

YTClickbait21K：面向YouTube点击诱饵检测的多模态人工标注数据集，覆盖多样频道与内容类别

Md. Minhazul Islam, Md. Tanbeer Jubaer, Amith Khandakar, Shovon Sarker, Sumaiya Rahman, Md. Masum Mia, Mohamed Arselene Ayari, Hamed Noori

发表机构 * Department of Computer Science and Engineering, Rajshahi University of Engineering & Technology（拉贾沙希工程与技术大学计算机科学与工程系）； Department of Electrical Engineering, Qatar University（卡塔尔大学电气工程系）； Department of Civil and Environmental Engineering, Qatar University（卡塔尔大学土木与环境工程系）； SenseNet Inc.（SenseNet公司）

AI总结为应对视频平台点击诱饵检测缺乏大规模高质量多模态数据的问题，构建了包含21,238个视频、来自29国40频道、覆盖新闻/娱乐/教育/游戏等类别的人工标注数据集YTClickbait21K，通过三人独立标注与多数投票确保质量，为多模态语义理解和自动内容审核提供基准。

详情

AI中文摘要

视频分享平台上的点击诱饵内容对信息可靠性构成重大挑战，然而自动检测的进展一直受限于缺乏大规模、高质量的多模态数据集。我们提出了YTClickbait21K，一个人工标注的YouTube点击诱饵数据集，包含来自29个国家40个频道的21,238个视频，覆盖新闻、娱乐、教育和游戏等多种内容类别。每个样本包括结构化元数据（标题、描述、互动统计）以及相关的缩略图图像，支持全面的多模态分析。为确保标注质量，每个视频由三名标注员使用标准化的决策框架独立标注，该框架融合了文本、视觉和跨模态一致性线索，最终标签通过多数投票确定。该数据集展现出显著的人工标注一致性（k=0.65），尽管点击诱饵检测具有固有的主观性，但仍确认了可靠的标注。通过结合规模、标注严谨性和多模态丰富性，该数据集为开发和评估机器学习模型提供了稳健的基准，促进了跨模态语义理解的研究，并推动了自动内容审核系统的发展。

英文摘要

Clickbait content on video-sharing platforms poses a significant challenge to information reliability, yet progress in automated detection has been constrained by the lack of large-scale, high-quality multimodal datasets. We present YTClickbait21K, a human-annotated YouTube clickbait dataset comprising 21,238 videos collected from 40 channels across 29 countries, covering diverse content categories such as news, entertainment, education, and gaming. Each sample includes structured metadata (title, description, engagement statistics) along with associated thumbnail images, enabling comprehensive multimodal analysis. To ensure annotation quality, every video was independently labeled by three annotators using a standardized decision framework that incorporates textual, visual, and cross-modal consistency cues, with final labels determined through majority voting. The dataset exhibits substantial inter-annotator agreement (k=0.65), confirming reliable labeling despite the inherent subjectivity of clickbait detection. By combining scale, annotation rigor, and multimodal richness, this dataset provides a robust benchmark for developing and evaluating machine learning models, facilitating research in cross-modal semantic understanding, and advancing automated content moderation systems.

URL PDF HTML ☆

赞 0 踩 0

2606.14795 2026-06-16 cs.CV 新提交

Position: The Systemic Lack of Agency in Visual Reasoning

立场：视觉推理中系统性的能动性缺失

Yizhao Huang, Haoyang Chen, Shiqin Wang, Pohsun Huang, Jiayuan Li, Haoyuan Du, Yandong Shi, Zheng Wang, Zhixiang Wang

AI总结本文指出当前视觉语言模型因缺乏自主探索能力而无法进行隐式推理，并提出V-IRD基准来评估这一能力，实验表明强语义识别不等同于主动视觉探索。

Comments Accepted by ICML 2026

详情

AI中文摘要

本文论证了系统性的能动性缺失限制了当前视觉语言模型（VLM）的隐式推理能力。隐式推理是指自主发现并利用隐藏的视觉证据来弥合信息鸿沟的能力，而非仅仅依赖明确指定的目标。这种能力是人类视觉理解和日常推理的基础。我们认为，这种限制源于将视觉推理主要视为被动的语义检索，而非依赖于自主视觉探索的主动情境推理。因此，现有大多数基准主要评估被动能力，而忽略了这一推理维度。为弥补这一空白，我们引入了视觉隐式推理诊断基准（V-IRD），该基准通过要求模型严格通过自主视觉分析推导答案来针对这一缺失象限。我们的结果表明，尽管具有强大的检索能力，但主流VLM在利用参考对象和关注需要自主探究的视觉证据方面存在困难。简而言之，强语义识别并不等同于主动视觉探索，揭示了当前VLM的关键差距。更多信息请访问 https://haoychen.github.io/Implicit-Reasoning/

英文摘要

This paper argues that a systemic lack of Agency constrains the implicit reasoning capabilities of current Vision-Language Models (VLMs). Implicit reasoning refers to the ability to autonomously discover and utilize hidden visual evidence to bridge information gaps, rather than merely relying on explicitly specified targets. This capacity underlies human visual understanding and everyday reasoning. We argue that this limitation arises from a tendency to approach visual reasoning primarily as passive semantic retrieval, rather than as active, situated reasoning that depends on autonomous visual exploration. As a result, most existing benchmarks primarily assess Passive Capacity, leaving this aspect of reasoning largely unmeasured. To address this gap, we introduce the Visual Implicit Reasoning Diagnosing Benchmark (V-IRD), which targets this missing quadrant by requiring models to derive answers strictly through autonomous visual analysis. Our results show that, despite strong retrieval abilities, prominent VLMs struggle to utilize reference objects and to attend to visual evidence that requires self-directed inquiry. Simply put, strong semantic recognition does not equate to active visual exploration, revealing a critical gap in current VLMs. More information can be found at https://haoychen.github.io/Implicit-Reasoning/

URL PDF HTML ☆

赞 0 踩 0

2606.14926 2026-06-16 cs.CV 新提交

FlexPooling with Simple Auxiliary Classifiers in Deep Networks

深度网络中带有简单辅助分类器的FlexPooling

Muhammad Ali, Omar Alsuwaidi, Salman Khan

发表机构 * Department of Computer Vision, Mohamed bin Zayed University of Artificial Intelligence (MBZUAI), Abu Dhabi, UAE（阿联酋阿布扎比穆罕默德·本·扎耶德人工智能大学计算机视觉系）

AI总结提出FlexPooling自适应池化方法，通过学习加权平均替代标准池化，并附加简单辅助分类器，在多个图像分类数据集上提升准确率1-3%。

详情

Journal ref: VISAPP 4 (18th), 497-505 2023

AI中文摘要

在计算机视觉中，大多数卷积神经网络的基本流程包括多个特征提取层，其中输入信号在后续每一层中被下采样到更低分辨率。这种下采样过程通常称为池化，是CNN中的基本操作。池化提高了对变换的鲁棒性，减少了可训练参数数量，增加了感受野，并降低了计算时间。由于池化是有损过程，但对于从低级表示中提取高级信息仍然重要，因此保留先前激活中最突出的信息以提高网络判别能力至关重要。标准池化通常使用密集池化方法，如最大池化或平均池化，或通过步长卷积核进行。在本文中，我们提出一种简单而有效的自适应池化方法，称为FlexPooling，它通过学习与网络其余部分联合的激活加权平均来推广平均池化。我们进一步表明，将简单辅助分类器（SAC）附加到CNN上可以提高性能，并证明了所提出方法与标准池化方法相比的有效性。在多个流行图像分类数据集上的实验表明，FlexPooling始终优于基线网络，准确率提升约1%至3%。

英文摘要

In computer vision, the basic pipeline of most convolutional neural networks consists of multiple feature extraction layers, where the input signal is downsampled to a lower resolution in each subsequent layer. This downsampling process is commonly referred to as pooling, which is an essential operation in CNNs. Pooling improves robustness against transformations, reduces the number of trainable parameters, increases the receptive field, and lowers computation time. Since pooling is a lossy process but remains important for extracting high-level information from low-level representations, it is important to preserve the most prominent information from previous activations to improve network discriminability. Standard pooling is usually performed using dense pooling methods, such as max pooling or average pooling, or through strided convolutional kernels. In this paper, we propose a simple yet effective adaptive pooling method, called FlexPooling, which generalizes average pooling by learning a weighted average over activations jointly with the rest of the network. We further show that attaching Simple Auxiliary Classifiers (SAC) to the CNN improves performance and demonstrates the effectiveness of the proposed method compared with standard pooling methods. Experiments on multiple popular image classification datasets show that FlexPooling consistently outperforms baseline networks, achieving approximately 1 to 3 percent improvement in accuracy.

URL PDF HTML ☆

赞 0 踩 0

2606.14958 2026-06-16 cs.CV cs.IR cs.LG 新提交

MVEB: Massive Video Embedding Benchmark

MVEB：大规模视频嵌入基准

Adnan El Assadi, Roman Solomatin, Isaac Chung, Chenghao Xiao, Deep Shah, Manan Dey, Shriya Sudhakar, Zacharie Bugaud, Wissam Siblini, Ayush Sunil Munot, Yashwanth Devavarapu, Rakshitha Ireddi, Michelle Yang, Márton Kardos, Niklas Muennighoff, Kenneth Enevoldsen

AI总结提出MVEB基准，包含23个任务评估33种视频嵌入模型，发现无单一模型占优，音频贡献取决于标注来源，并集成到MTEB生态。

详情

AI中文摘要

我们介绍了大规模视频嵌入基准（MVEB），这是一个包含23个任务的视频嵌入基准，涵盖分类、零样本分类、聚类、配对分类、检索和以视频为中心的问答。我们评估了33个模型，发现没有单一模型占优：基于MLLM的嵌入在分类、聚类、配对分类和问答上领先；多模态绑定在检索和零样本分类上领先；没有对比适应训练的生成式MLLM在跨模态任务上崩溃。成对的仅视频与音频+视频评估表明，音频的贡献取决于数据集标注来源：当标签来自两种模态时音频有帮助，当仅来自视觉时则有害，这一差距在模型族中一致为6个百分点。MVEB源自MVEB+（一个包含184个任务的任务池），旨在保持任务多样性的同时降低评估成本。它集成到MTEB生态系统中，以实现跨文本、图像、音频和视频的统一评估。我们在https://github.com/embeddings-benchmark/mteb上发布MVEB和所有184个任务，以及代码和排行榜。

英文摘要

We introduce the Massive Video Embedding Benchmark (MVEB), a 23-task benchmark for video embeddings spanning classification, zero-shot classification, clustering, pair classification, retrieval, and video-centric question answering. We evaluate 33 models and find that no single model dominates: MLLM-based embeddings lead on classification, clustering, pair classification, and QA; multimodal binding leads on retrieval and zero-shot classification; generative MLLMs without contrastive adaptation collapse on cross-modal tasks. Paired video-only vs. audio+video evaluations show that audio's contribution depends on dataset annotation provenance: audio helps when labels were produced from both modalities and hurts when they were produced from visuals alone, a six-point gap consistent across model families. MVEB is derived from MVEB+, a 184-task pool, and is designed to maintain task diversity while reducing evaluation cost. It integrates into the MTEB ecosystem for unified evaluation across text, image, audio, and video. We release MVEB and all 184 tasks along with code and a leaderboard at https://github.com/embeddings-benchmark/mteb.

URL PDF HTML ☆

赞 0 踩 0

2606.15055 2026-06-16 cs.CV cs.AI 新提交

XPASS-Vis: 跨领域个性化图像美学评估数据集

Takato Hayashi, Hiroaki Takahara, Candy Olivia Mawalim, Hiromi Narimatsu, Akisato Kimura, Shiro Kumano, Shogo Okada

发表机构 * Japan Advanced Institute of Science and Technology（日本先端科学技术大学）； Communication Science Laboratories, NTT, Inc.（日本电信电话株式会社通信科学实验室）

AI总结提出首个跨领域个性化图像美学评估数据集XPASS-Vis，涵盖艺术、时尚、风景三个领域，通过129名标注者评估6526个刺激，建立跨领域个性化美学偏好迁移的基准模型，发现无监督域适应方法可恢复约60%的监督上限性能。

详情

AI中文摘要

个性化图像美学评估（PIAA）旨在个体层面上对艺术品和照片的美学判断的主观性进行建模。已知美学偏好既高度个性化又在视觉领域间部分一致。然而，现有的PIAA数据集和方法大多局限于单一领域，或每个领域内每位标注者的样本太少，无法实现跨领域个性化。因此，个性化美学偏好的跨领域泛化在很大程度上仍未得到探索。为了解决这一空白，我们引入了XPASS-Vis，这是第一个专门为跨领域PIAA设计的数据集。XPASS-Vis包含来自三个视觉领域（艺术、时尚、风景）的6,526个刺激，由129名标注者评分，产生87,836次用户-刺激交互，每次交互都标注了总体美学得分和九项美学情感评分。值得注意的是，每位标注者在每个领域评分的刺激超过200个，提供了足够的领域内覆盖以支持领域内和跨领域的个性化。此外，我们在无监督域适应（UDA）下建立了跨领域PIAA的基线模型，其中在标记源领域上训练的模型被迁移到未标记的目标领域。对代表性UDA方法的系统评估表明，在完全无监督的设置下，性能最佳的方法恢复了约60%（Spearman's ρ = .28）的监督上限。这提供了令人鼓舞的证据，表明个性化美学偏好在一定程度上可以在视觉领域间迁移。同时，仍然存在显著差距，凸显了需要针对PIAA的适应策略。XPASS-Vis及附带的基线为跨领域PIAA的未来研究奠定了基础。所有数据集和代码将在论文被接收后公开。

英文摘要

Personalized image aesthetic assessment (PIAA) seeks to model, at the individual level, the subjective nature of aesthetic judgments toward artworks and photographs. Aesthetic preference is known to be both deeply personal and partially consistent across visual domains. Yet existing PIAA datasets and methods are largely confined to a single domain, or provide too few samples per annotator within each domain to enable personalization across domains. Consequently, the cross-domain generalization of personalized aesthetic preferences remains largely unexplored. To address this gap, we introduce XPASS-Vis, the first dataset explicitly designed for cross-domain PIAA. XPASS-Vis comprises 6,526 stimuli from three visual domains -- art, fashion, and landscape -- rated by 129 annotators, yielding 87,836 user-stimulus interactions, each annotated with an overall aesthetic score and nine aesthetic-emotion ratings. Notably, each annotator rated more than 200 stimuli per domain, providing sufficient per-domain coverage to support personalization both within and across domains. Moreover, we establish baseline models for cross-domain PIAA under unsupervised domain adaptation (UDA), where a model trained on a labeled source domain is transferred to an unlabeled target domain. A systematic evaluation of representative UDA approaches shows that the best-performing method recovers approximately 60\% (Spearman's $ρ$ = .28) of the supervised upper bound under a fully unsupervised setting. This provides encouraging evidence that personalized aesthetic preferences are, to a meaningful extent, transferable across visual domains. At the same time, a substantial gap remains, highlighting the need for PIAA-specific adaptation strategies. XPASS-Vis and the accompanying baselines provide a foundation for future research on cross-domain PIAA. All datasets and code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.15749 2026-06-16 cs.CV cs.AI cs.SY eess.SY 新提交

OmniTraffic: A Controllable Generation Pipeline and Benchmark for Spatio-Temporal Traffic Reasoning

OmniTraffic：面向时空交通推理的可控生成流水线与基准

Maonan Wang, Zhengyan Huang, Kemou Jiang, Yuhang Fu, Jiayue Zhu, Yuxin Cai, Xingchen Zou, Qiaosheng Zhang, Yi Yu, Ding Wang, Xi Chen, Ben M. Chen, Yuxuan Liang, Zhiyong Cui, Man On Pun, Yirong Chen

发表机构 * The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shanghai AI Lab（上海人工智能实验室）； Beihang University（北京航空航天大学）； Nanyang Technological University（南洋理工大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； The Chinese University of Hong Kong（香港中文大学）

AI总结提出OmniTraffic，一个基于12个真实路口3D重建的可控生成流水线与基准，通过8M VQA样本和3K人工验证测试集评估11个前沿MLLM，揭示拓扑与时空推理中的显著人机差距，并证明仿真数据微调可提升真实场景性能。

Comments 34 pages, 28 figures

详情

AI中文摘要

交通场景理解要求模型超越物体识别进行推理，包括车道拓扑、多视角几何、时间演变和信号相位语义。然而，现有的面向交通的多模态基准大多强调被动视觉识别或孤立的视频理解，在受控条件下评估结构感知的交通推理方面支持有限。我们介绍了OmniTraffic，一个用于时空交通推理的可控生成流水线和基准。它基于12个真实世界交叉口重建为可编辑的3D交通环境，并辅以来自两个国家的监控录像，支持受控和自然条件评估。它定义了一个三级任务层次，涵盖场景感知、多视角和时间推理以及决策支持。利用结构化交通元数据，OmniTraffic生成同步的多视角VQA样本，涵盖车辆状态、车道功能、视图-BEV对应、时间动态和信号相位分析，产生800万个VQA样本和一个3000个人工验证的测试集。对11个前沿MLLM的评估揭示了巨大的人机差距，在拓扑基础和时空推理任务中失败最为明显。在模拟的OmniTraffic数据上微调轻量级MLLM进一步提高了在真实交通场景上的性能，证明了仿真生成的监督对特定交通多模态推理的价值。除了固定数据集，OmniTraffic还提供了一个可扩展的流水线，具有可配置的交叉口、相机视角、交通需求、信号相位、视觉条件和罕见事件。

英文摘要

Traffic scene understanding requires models to reason beyond object recognition, including lane topology, multi-view geometry, temporal evolution, and signal-phase semantics. However, existing traffic-oriented multimodal benchmarks largely emphasize passive visual recognition or isolated video understanding, offering limited support for evaluating structure-aware traffic reasoning under controlled conditions. We introduce OmniTraffic, a controllable generation pipeline and benchmark for spatio-temporal traffic reasoning. Built around 12 real-world intersections reconstructed into editable 3D traffic environments and complemented by surveillance footage from two countries, OmniTraffic supports both controlled and natural-condition evaluation. It defines a three-level task hierarchy spanning scene perception, multi-view and temporal reasoning, and decision support. Using structured traffic metadata, OmniTraffic generates synchronized multi-view VQA samples covering vehicle states, lane functions, view--BEV correspondence, temporal dynamics, and signal-phase analysis, resulting in 8M VQA samples and a 3K human-verified test set. Evaluation of eleven frontier MLLMs reveals a large human--model gap, with the most pronounced failures in topology-grounded and spatio-temporal reasoning tasks. Fine-tuning a lightweight MLLM on simulated OmniTraffic data further improves performance on real-world traffic scenes, demonstrating the value of simulation-generated supervision for traffic-specific multimodal reasoning. Beyond a fixed dataset, OmniTraffic provides an extensible pipeline with configurable intersections, camera views, traffic demands, signal phases, visual conditions, and rare events.

URL PDF HTML ☆

赞 0 踩 0

2606.15867 2026-06-16 cs.CV 新提交

CogCanvas: A Benchmark for Evaluating Multi-Subject Reference-Based Image Generation

CogCanvas: 用于评估多主体参考图像生成的基准

Long-Bao Nguyen, Quang-Khai Tran, Tam V. Nguyen, Minh-Triet Tran, Trung-Nghia Le

发表机构 * University of Science, Ho Chi Minh City, Vietnam（胡志明市理科大学）； University of Dayton, Ohio, United States（代顿大学）； Vietnam National University, Ho Chi Minh City, Vietnam（越南国家大学胡志明市分校）

AI总结提出CogCanvas基准，包含1952张参考图像和1361个组合提示，评估多身份、对象绑定和背景场景的生成，引入BG-Sim和Attr-VQA指标，发现现有模型在超过3个主体时性能严重下降。

详情

AI中文摘要

多主体参考图像生成需要同时保留多个人的身份、绑定每个人的对象和时尚物品，并尊重指定的背景场景，当前扩散模型在此方面仍然脆弱。现有基准一次只评估一个方面，没有一个能联合捕捉多身份组合、人-物交互、背景基础和空间合理性。我们引入了CogCanvas，一个包含1952张精选参考图像的基准，涵盖100个名人身份、115个独特对象和时尚物品，以及29个真实世界背景场景（包括地标），从中我们构建了1361个组合提示，覆盖2-5人的群体规模。筛选流程结合了基于DINOv2的去重、两阶段美学过滤以及结构化交互和位置图的自动推导，作为真实监督。CogCanvas在统一的六轴评估协议下支持三个任务：基于参考的多人物-对象生成（主要）、文本到图像的组合生成和参考检索。我们引入了两个针对多参考设置量身定制的指标：BG-Sim，通过DINOv3特征相似性在SAM 3掩码区域上评分背景保真度；Attr-VQA，使用多模态大语言模型根据结构化图验证每个主体的属性绑定和人际交互。对五种最先进方法的基准测试表明，随着群体规模从2人增加到5人，每个模型都显著退化，在超过三个主体时对象/时尚物品绑定几乎完全失败。

英文摘要

Multi-subject reference-based image generation requires jointly preserving multiple human identities, binding per-person objects and fashion items, and respecting a specified background scene, a regime where current diffusion models remain brittle. Existing benchmarks evaluate only one axis at a time and none jointly captures multi-identity composition with human-object interaction, background grounding, and spatial plausibility. We introduce CogCanvas, a benchmark of 1,952 curated reference images spanning 100 celebrity identities, 115 distinctive objects and fashion items, and 29 real-world background scenes including landmarks, from which we construct 1,361 compositional prompts covering 2-5 person group sizes. The curation pipeline combines DINOv2-based deduplication, two-stage aesthetic filtering, and automated derivation of structured interaction and position graphs that serve as ground-truth supervision. CogCanvas supports three tasks, reference-based multi-human-object generation (primary), text-to-image compositional generation, and reference retrieval, under a unified six-axis evaluation protocol. We introduce two metrics tailored to the multi-reference setting: BG-Sim, which scores background fidelity on SAM 3-masked regions via DINOv3 feature similarity, and Attr-VQA, which uses a multimodal LLM to verify per-subject attribute binding and inter-person interactions against the structured graphs. Benchmarking five SOTA methods reveals that every model degrades substantially as group size grows from 2 to 5, with near-complete failure on object/fashion binding beyond three subjects.

URL PDF HTML ☆

赞 0 踩 0

2606.15956 2026-06-16 cs.CV cs.AI cs.LG 新提交

You Don't Need Strong Assumptions: Visual Representation Learning via Temporal Differences

你不需要强假设：通过时间差异进行视觉表示学习

Ninad Daithankar, Alexi Gladstone, Yann LeCun, Heng Ji

发表机构 * UIUC（伊利诺伊大学厄巴纳-香槟分校）； New York University（纽约大学）

AI总结提出TDV方法，基于因果假设（过去导致未来）从视频中自监督学习，避免强归纳偏置，在密集空间任务上达到SOTA。

详情

AI中文摘要

AI的进步很大程度上是由假设更少的方法驱动的。随着计算和数据量的增加，弱归纳偏置的方法通常优于强假设的方法。这在视觉表示学习领域尤为典型，方法从监督学习主导，到弱监督学习，再到如今无需人工标签的自监督学习的广泛成功。然而，即使是现代自监督学习方法仍然依赖于强归纳偏置，如数据增强、掩码或裁剪。如果这一趋势持续，这些剩余的偏置在大规模下将成为瓶颈——我们的实验证实了这一点：随着数据增长，归纳偏置的最优强度降低。这促使我们寻找依赖更少假设的方法。为此，我们提出了视觉时间差异（TDV），一种从视频中进行自监督学习的新范式，它避免了现有的归纳偏置，而是依赖于一个因果假设：过去导致未来。TDV通过联合训练图像编码器和运动编码器，使得当前帧的表示加上编码的运动等于下一帧的表示。尽管没有利用任何强归纳偏置，TDV在密集空间任务上达到了最先进的水平，为无需强假设的表示学习奠定了基础。

英文摘要

Progress in AI has largely been driven by methods that assume less. As compute and data increase, approaches with weaker inductive biases generally outperform those with stronger assumptions. This is particularly characteristic of the field of Visual Representation Learning, where approaches have gone from being dominated by Supervised Learning, to Weakly Supervised Learning, to the now widespread success of Self-Supervised Learning without human labels. Yet, even modern Self-Supervised Learning approaches still depend on strong inductive biases such as augmentations, masking, or cropping. If this trend holds, even these remaining biases should become bottlenecks at scale -- and our experiments confirm this: the optimal strength of inductive biases decreases as data grows. This motivates the search for approaches that rely on fewer assumptions. To this end, we introduce Temporal Difference in Vision (TDV), a new paradigm for self-supervised learning from video that avoids existing inductive biases, relying instead on a causal assumption that the past causes the future. TDV functions by jointly training an image encoder and a motion encoder so that the current frame's representation plus the encoded motion equals the next frame's representation. Despite not leveraging any strong inductive biases, TDV matches state-of-the-art recipes on dense spatial tasks, laying the foundation for representation learning without strong assumptions.

URL PDF HTML ☆

赞 0 踩 0

2606.16015 2026-06-16 cs.CV 新提交

Stringalign: Moving beyond summary statistics with a transparent Unicode-aware tool for evaluating automatic transcription models

Stringalign: 超越摘要统计的透明Unicode感知工具，用于评估自动转录模型

Yngve Mardal Moe, Marie Roald

发表机构 * Independent researcher（独立研究员）； The National Library of Norway（挪威国家图书馆）

AI总结提出Stringalign库，通过透明预处理和错误分析，解决字符/词错误率定义模糊问题，支持HTR、OCR和ASR模型的可重复评估。

详情

AI中文摘要

在评估和理解文档识别、音频转录等文本处理任务的性能时，比较文本字符串至关重要。随着基于AI的手写文本识别（HTR）、光学字符识别（OCR）和自动语音识别（ASR）模型日益复杂，需要能够以灵活且可重复的方式促进评估的工具。本文介绍了Stringalign，一个旨在简化自动转录项目评估过程并促进透明评估的Python库。Stringalign的工具可以检查和可视化模型产生的错误率和错误类型，从而洞察可能的改进，并帮助为特定任务选择模型。广泛使用的字符串比较指标，如字符错误率（CER）和词错误率（WER），虽然有用，但由于字符和词的定义不同而可能产生歧义。Stringalign通过确保所有预处理（即归一化和分词）透明且易于复制，并提供工具以超越摘要统计并分析常见模型错误，解决了这一挑战。此外，Stringalign遵循研究软件的FAIR（可发现、可访问、可互操作、可重用）原则，同时保持轻量级且易于融入研究人员现有工作流程。在本文中，我们讨论了字符级和词级字符串比较的挑战，并通过示例表明，现有工具可能产生不透明且有时令人困惑的结果，而Stringalign提供了一种易于使用且无歧义的替代方案。

英文摘要

Comparing text strings is crucial when evaluating and understanding the performance of various text processing tasks such as document recognition and audio transcription. With an increasingly complex landscape of AI-based handwritten text recognition (HTR), optical character recognition (OCR) and automatic speech recognition (ASR) models, there is a need for tools that facilitate evaluation in a flexible and reproducible way. This paper presents Stringalign, a Python library designed to simplify the evaluation process for automatic transcription projects and facilitate transparent evaluation. Stringalign's tools to examine and visualise both the rate of errors and the types of errors a model makes, give insights into possible improvements and help inform model selection for a particular task. Widely used string comparison metrics, such as the character and word error rates (CER and WER), although useful, can be ambiguous due to varying definitions of what constitutes a character and a word. Stringalign addresses this challenge by ensuring all preprocessing (i.e. normalisation and tokenisation) is transparent and easily replicable, and by providing tools to move beyond summary statistics and analyse common model errors. Moreover, Stringalign adheres to FAIR (Findable, Accessible, Interoperable, and Reusable) principles for research software while staying lightweight and easy to adapt into researchers existing workflows. In this paper, we discuss challenges with character and word level string comparisons and show through examples that where existing tools can yield opaque and sometimes confusing results, Stringalign provides an easy-to-use and unambiguous alternative.

URL PDF HTML ☆

赞 0 踩 0

2606.16185 2026-06-16 cs.CV 新提交

Learned JPEG Compression for DNN Vision

面向DNN视觉的JPEG压缩学习

Kaixiang Zheng, Ahmed H. Salamah, Siyu Chen, En-Hui Yang

发表机构 * University of Waterloo（滑铁卢大学）

AI总结提出J4D框架，通过可微分JPEG编解码器和信息论速率估计，优化JPEG编码参数以在低压缩率下提升DNN推理性能，实验显示在相同精度下压缩率降低高达80.05%。

详情

AI中文摘要

JPEG是一种为人类观看而设计的损失性图像压缩技术，几十年来一直占据主导地位。然而，在人工智能（AI）时代，大量通常由JPEG压缩的图像数据正在并将继续由深度神经网络（DNN）而非人类消费，因此需要优化JPEG以提升DNN推理性能。为此，我们提出面向DNN视觉的JPEG压缩学习（J4D），这是一种新颖的训练框架，用于确定JPEG编码参数，以在最小化压缩率的同时最大化DNN推理性能。解决这一优化问题的主要挑战在于以封闭形式表示JPEG编解码器和压缩率。通过引入基于概率量化方案的可微分软量化器，我们不仅获得了JPEG编解码器的可微分代理，还能够解析计算编码源的熵，这是实际压缩率的近似估计。有了可微分JPEG编解码器和信息论速率估计器，我们就能通过反向传播解决上述优化问题。训练后，学习到的编码参数将基于概率量化用于实际的JPEG编码。跨多个数据集和DNN架构的大量实验结果表明，J4D始终显著优于默认JPEG和其他为DNN优化的竞争性JPEG编解码器。值得注意的是，与默认JPEG相比，J4D在相同码率下准确率提升高达11.60%，或在相同准确率下压缩率降低高达80.05%。此外，借助J4D，我们首次展示了为不同DNN架构设计通用JPEG编码参数的潜力。

英文摘要

JPEG, a lossy image compression technique designed for human viewers, has maintained its dominance for decades. However, in the era of artificial intelligence (AI), a substantial portion of image data, often compressed by JPEG, is and will continue to be consumed by deep neural networks (DNNs) instead of humans, thus creating a need to optimize JPEG for DNN inference performance. To this end, we propose learned JPEG compression for DNN vision (J4D), a novel training framework for determining JPEG encoding parameters to minimize compression rate while maximizing DNN inference performance. The major challenge of solving this optimization problem lies in representing the JPEG codec and compression rate in closed form. By incorporating a differentiable soft quantizer based on a probabilistic quantization scheme, we not only obtain a differentiable proxy for the JPEG codec, but are also able to compute the entropy of the coded source analytically, which is a close estimate of the actual compression rate. Equipped with both the differentiable JPEG codec and the information-theoretic rate estimator, we are then able to solve the aforementioned optimization problem with backpropagation. After training, the learned encoding parameters will be subsequently used in actual JPEG encoding based on probabilistic quantization. Extensive experimental results across multiple datasets and DNN architectures demonstrate that J4D consistently and significantly outperforms the default JPEG and other competitive JPEG codecs optimized for DNNs. Notably, compared to the default JPEG, J4D achieves an increase in accuracy by as much as 11.60% at the same rate, or a reduction of compression rate up to 80.05% at the same accuracy. Additionally, with the help of J4D, we show the potential to design universal JPEG encoding parameters for various DNN architectures for the first time.

URL PDF HTML ☆

赞 0 踩 0

2606.16256 2026-06-16 cs.CV cs.LG 新提交

KeepLoRA++: Continual Learning with Layer-Scaled Residual Gradient Adaptation

KeepLoRA++: 基于层级缩放残差梯度适应的持续学习

Mao-Lin Luo, Yi-Lin Zhang, Zi-Hao Zhou, Yankun Hong, Xialiang Tong, Mingxuan Yuan, Tong Wei, Min-Ling Zhang

发表机构 * School of Computer Science and Engineering, Southeast University（东南大学计算机科学与工程学院）； Key Laboratory of Computer Network and Information Integration, Southeast University, Ministry of Education（东南大学计算机网络和信息集成教育部重点实验室）； Huawei Noah’s Ark Lab（华为诺亚方舟实验室）

AI总结针对预训练视觉语言模型持续学习中保留预训练知识、旧任务知识和学习新知识的冲突，提出KeepLoRA++，通过层级缩放残差梯度适应方法，限制LoRA参数更新到残差子空间并采用浅到深层缩放，平衡三者，在图像分类、视觉问答和视频理解任务上优于基线。

详情

AI中文摘要

预训练视觉语言模型的持续学习需要平衡三个相互竞争的目标：保留预训练知识、保留一系列已学习任务的知识以及保持获取新知识的可塑性。本文提出KeepLoRA++，通过统一的二维知识保留机制来平衡这些目标。我们从层间和层内两个角度分析Transformer架构的知识分布。层间视角考察知识保留如何跨层分布，而层内视角关注每层内的参数空间。我们的分析揭示了一个结构特性：通用可迁移知识主要编码在浅层和参数的主子空间中，而任务特定适应则定位于深层和残差子空间。受此启发，KeepLoRA++引入了一种层级缩放残差梯度适应方法。新任务的学习通过将LoRA参数更新限制在残差子空间，并结合从浅到深的层级缩放来实现，以防止干扰先前获得的能力。具体而言，新任务的梯度被投影到与预训练模型主子空间以及先前任务特征主导方向正交的子空间上，同时为浅层分配较小的更新幅度，为深层分配较大的更新幅度。我们的理论分析和实证评估证实，KeepLoRA++成功平衡了这三个相互竞争的目标，在图像分类、视觉问答和视频理解任务上持续优于代表性基线。

英文摘要

Continual learning for pre-trained vision-language models requires balancing three competing objectives: retaining pre-trained knowledge, preserving knowledge from a sequence of learned tasks, and maintaining the plasticity to acquire new knowledge. This paper presents KeepLoRA++, balancing these objectives through a unified dual-dimensional knowledge retention mechanism. We analyze knowledge distribution of Transformer architecture from both inter-layer and intra-layer perspectives. The inter-layer perspective examines how retention is distributed across layers, while the intra-layer perspective focuses on the parameter space within each layer. Our analysis reveals a structural property: general transferable knowledge is mainly encoded in the shallow layers and the principal subspace of the parameters, while task-specific adaptations are localized in the deep layers and the residual subspace. Motivated by this insight, KeepLoRA++ introduces a layer-scaled residual gradient adaptation method. New tasks are learned by restricting LoRA parameter updates to the residual subspace, combined with a shallow-to-deep layer scaling, to prevent interference with previously acquired capabilities. Specifically, the gradient of a new task is projected onto a subspace orthogonal to both the principal subspace of the pre-trained model and the dominant directions of previous task features, while simultaneously assigning smaller update magnitudes to shallow layers and larger ones to deeper layers. Our theoretical analysis and empirical evaluations confirm that KeepLoRA++ successfully balances these three competing objectives, consistently outperforming representative baselines across image classification, visual question answering, and video understanding tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.16334 2026-06-16 cs.CV 新提交

Chronological Blindness: Benchmarking Temporal Reasoning in Vision-Language Models with CHRONOSIGHT

时间盲：使用CHRONOSIGHT基准测试视觉语言模型的时间推理能力

Parthaw Goswami, Jaynto Goswami Deep

发表机构 * Department of Computer Science, University of Missouri（密苏里大学计算机科学系）； SAP

AI总结提出CHRONOSIGHT基准，从五个维度评估视觉语言模型的时间推理能力，发现模型与人类存在巨大差距（人类平均0.89，最佳模型0.40），并通过微调显著提升性能。

详情

AI中文摘要

人类对视觉场景的感知本质上是时间性的。我们本能地识别水果是在成熟还是腐烂，建筑是在进展还是被拆除，以及两张同一主体的照片之间大致相隔多少时间。大型视觉语言模型（VLM）是否具备这种能力仍然是一个开放且具有实际重要性的问题。我们引入了CHRONOSIGHT，一个严格控制的基准，评估视觉时间推理的五个维度：CHRONORANK（图像序列的时间顺序排序）、CHRONOLOCATE（从单张图像定位阶段顺序）、CHRONODELTA（估计两张图像之间经过的时间，采用对数尺度）、CHRONOREVERSE（检测时间反转序列）以及CHRONOODD（识别集合中的时间异常值）。该基准包含来自八个过程系列（生物生长、食物转化、物理风化、建筑、环境变化、人类衰老、天文现象和城市动态）的1000个项目，时间跨度从分钟到千年。我们在两种提示模式下评估了八个开源VLM（参数从5亿到190亿），并收集了人类表现基线。人类在所有任务上的平均表现为0.89；最佳开源模型（Qwen2.5-VL-7B）在直接提示下达到0.40，我们将这一差距称为时间盲。在151个样本上进行轻量级LoRA微调，将CHRONODELTA的准确率从接近零提升到0.43，并零样本迁移到相关任务（CHRONOODD：0.37；CHRONOREVERSE：0.64），这表明瓶颈部分在于指令遵循而非视觉感知。基准、代码和预测将在接收后发布。

英文摘要

Human perception of visual scenes is inherently temporal. We instinctively recognise whether a fruit is ripening or rotting, whether construction is progressing or being demolished, and approximately how much time separates two photographs of the same subject. Whether large vision-language models (VLMs) share this competence remains an open and practically important question. We introduce CHRONOSIGHT, a rigorously controlled benchmark evaluating five dimensions of visual temporal reasoning: CHRONORANK (chronological ordering of image sequences), CHRONOLOCATE (ordinal stage localisation from a single image), CHRONODELTA (estimation of time elapsed between two images on a logarithmic scale), CHRONOREVERSE (detection of temporally reversed sequences), and CHRONOODD (identification of a temporal outlier within a set). The benchmark comprises 1{,}000 items across eight process families (biological growth, food transformation, physical weathering, construction, environmental change, human ageing, astronomical phenomena, and urban dynamics) spanning timescales from minutes to millennia. We evaluate eight open-source VLMs (500 M to 19 B parameters) under two prompting regimes and collect human performance baselines. Human performance averages 0.89 across tasks; the best open model (Qwen2.5-VL-7B) reaches 0.40 under direct prompting, a gap we term chronological blindness. Lightweight LoRA fine-tuning on 151 examples raises CHRONODELTA accuracy from near-zero to 0.43, transferring zero-shot to related tasks (CHRONOODD: 0.37; CHRONOREVERSE: 0.64)suggesting the bottleneck is partly instruction following rather than visual perception. Benchmark, code, and predictions will be released upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2606.16633 2026-06-16 cs.CV cs.AI 新提交

DCP-Prune: Ultra-Low Token Pruning with Distribution Consistency Preservation

DCP-Prune：基于分布一致性保持的超低令牌剪枝

Xifeng Xue, Xiaokang Wang, Zirui Li, Ming-Ming Cheng, Guolei Sun

发表机构 * College of Computer Science, Nankai University（南开大学计算机学院）； Nanjing University of Posts and Telecommunications（南京邮电大学）

AI总结提出DCP-Prune框架，通过锚点-上下文图恢复和文本感知令牌聚类选择，在超低令牌预算下保持分布一致性，实现稳定高性能。

Comments The code will be released at: https://github.com/EMVision-NK/DCP-Prune

详情

AI中文摘要

最近的视觉令牌剪枝方法在中等令牌预算下能有效保持模型性能，但在超低令牌预算下变得不稳定。我们的分析表明，随着剪枝预算减少，精度下降通常伴随着更大的特征分布偏移。关键的是，这种分布偏移的程度与性能下降强相关。为了更好地表征这一现象，我们引入了一种轻量级的分布一致性度量来估计保留令牌与完整令牌之间的分布偏移。受这些观察启发，我们提出了一个两阶段剪枝框架，包括锚点-上下文图恢复（ACGR）和文本感知令牌聚类选择（TATCS）。具体地，ACGR在令牌移除前转移上下文信息，而TATCS在检测到严重分布偏移时动态重新选择代表性令牌。大量实验表明，我们的方法在超低令牌预算下实现了更优且更稳定的性能。值得注意的是，在仅使用16个视觉令牌的情况下，它在LLaVA-1.5-7B上保留了92.1%的上限平均性能。

英文摘要

Recent vision token pruning methods effectively preserve model performance under moderate token budgets but become unstable under ultra-low token budget. Our analysis shows that as the pruning budget decreases, accuracy degradation is often accompanied by larger feature distribution shifts. Critically, the degree of this distribution shift strongly correlates with performance degradation. To better characterize this phenomenon, we introduce a lightweight distribution consistency metric to estimate the distribution shift between retained and full tokens. Motivated by these observations, we propose a two-stage pruning framework consisting of Anchor-Context Graph Recovery (ACGR) and Text-Aware Token Cluster Selection (TATCS). Specifically, ACGR transfers contextual information before token removal, while TATCS dynamically re-selects representative tokens when severe distribution shift is detected. Extensive experiments demonstrate that our method achieves superior and more stable performance under ultra-low token budget. Notably, it retains 92.1% of the upper-bound average performance on LLaVA-1.5-7B with only 16 visual tokens.

URL PDF HTML ☆

赞 0 踩 0

2606.16638 2026-06-16 cs.CV 新提交

MVM-IOD: An Industrial Object-Centric Benchmark Dataset for the Evaluation of 3D Reconstruction Methods

MVM-IOD：用于评估3D重建方法的工业对象中心基准数据集

Robert Langendörfer, Markus Hillemann, Markus Ulrich

发表机构 * Machine Vision Metrology, Institute of Photogrammetry and Remote Sensing, Karlsruhe Institute of Technology, Germany（德国卡尔斯鲁厄理工学院摄影测量与遥感研究所机器视觉计量学）

AI总结针对工业场景3D重建评估数据集匮乏的问题，提出MVM-IOD数据集，包含工业对象图像、参考相机位姿和点云，并评估了多种SOTA方法，发现前馈方法对非分布图像敏感。

详情

AI中文摘要

工业应用中的3D对象重建和相机位姿估计是挑战性任务，因为错误代价高昂且计算时间通常有限。典型工业对象的复杂性进一步增加了这些任务的难度。现有的大多数数据集并未描绘真实的工业场景。因此，我们引入了机器视觉计量工业对象数据集（MVM-IOD）。通过将安装在工业机器人臂末端执行器上的相机在对象周围的半球上移动，系统性地捕获典型工业对象的图像。MVM-IOD包含参考相机位姿和参考3D点云，以及9个对象和2种背景选择所获得的RGB图像，共产生18个场景，这允许评估所有基于图像的方法，这些方法计算3D重建、相机位姿或场景的新视图。基于MVM-IOD，我们广泛评估了当前的SOTA 3D重建和相机位姿估计方法，例如运动恢复结构、多视图立体、近期的前馈方法（Visual Geometry Grounded Transformer, π3）和2D高斯泼溅，并报告我们的发现作为未来研究的基线。实验表明，像我们这样的捕获设置为前馈方法生成分布外图像，导致次优的点云和相机位姿。然而，通过应用简单的预处理步骤，这些分布外图像可以更接近训练分布。因此，在某些工业应用中，应谨慎使用前馈方法。

英文摘要

3D object reconstruction, and camera pose estimation in industrial applications are challenging tasks, as errors are costly while the computation time is often limited. The complexity of typical industrial objects further complicates these tasks. Most of the existing datasets in this context do not depict realistic industrial scenarios. Therefore, we introduce the Machine Vision Metrology Industrial Object Dataset (MVM-IOD). Images of typical industrial objects are captured systematically, by moving a camera, mounted at the end effector of an industrial robot arm, on a hemisphere around the objects. MVM-IOD contains reference camera poses and reference 3D point clouds, the acquired RGB images of 9 objects and 2 background choices resulting in 18 scenes, which allows evaluation of all image based methods that compute a 3D reconstruction, camera poses, or novel views of a scene. Based on MVM-IOD, we extensively evaluate current SOTA 3D reconstruction and camera pose estimation methods, such as Structure from Motion, Multi-View Stereo, recent feed forward methods (Visual Geometry Grounded Transformer, π3), and 2D Gaussian Splatting and report our findings as a baseline for future research. The experiments show that capture setups like ours generate out-of distribution images for feed forward methods, leading to suboptimal point clouds and camera poses. However, these out-of-distribution images can be shifted closer to the training distribution by applying simple preprocessing steps. Consequently, in certain industrial applications, feed forward methods should be used with caution.

URL PDF HTML ☆

赞 0 踩 0

2606.16861 2026-06-16 cs.CV 新提交

An Open-Source Monitoring Framework for Data Exploration and Progress Tracking in Multi-Center Radiology Studies

一个用于多中心放射学研究中数据探索与进度跟踪的开源监控框架

Markus Bujotzek, Jonas Scherer, Stefan Denner, Peter Neher, Benjamin Hamm, Lorenz Feineis, Uenal Akuenal, Andreas Bucher, Tobias Penzkofer, Klaus Maier-Hein

发表机构 * Germany Cancer Research Center（德国癌症研究中心）； University of Heidelberg（海德堡大学）； University Hospital Frankfurt（法兰克福大学医院）； Charite Universitätsmedizin Berlin（柏林夏里特医学院）； Berlin Institute of Health（柏林健康研究所）

AI总结提出基于Grafana-Prometheus的轻量级开源监控架构，通过聚合分布式站点指标并可视化，实现隐私保护的数据探索和进度监控，已在德国RACOON联盟38家大学医院部署验证。

详情

AI中文摘要

多中心研究对于推进医学和放射学研究至关重要。数据探索、协作发现和研究进度监控对于最大化其潜力至关重要。然而，在实践中，这些过程通常依赖于手动通信和共享表格，这些表格很快就会过时，并阻碍大型分布式研究中的高效协调。这凸显了对专用监控解决方案的需求，以提供对研究进度的透明和最新洞察。我们提出了一种轻量级、开源的多中心研究监控架构，基于广泛使用的Grafana-Prometheus栈。该框架从分布式研究站点收集聚合的监控指标，并通过可配置的仪表板进行可视化。作为一个真实世界的部署示例，该框架被集成到医学影像平台Kaapana中，并在一个大型多中心研究网络中进行评估。通过在德国范围内的RACOON联盟中部署我们的解决方案，我们展示了其在所有38家德国大学医院中实现隐私保护的数据探索和研究进度监控的能力。该监控框架支持分布式研究活动的透明协调，并可促进大规模多中心研究的更高效管理。源代码和Kaapana集成可在https://github.com/MIC-DKFZ/study-monitoring-kaapana公开获取。

英文摘要

Multi-center studies are crucial for advancing medical and radiological research. Data exploration, collaboration discovery, and study progress monitoring are essential for maximizing their potential. However, in practice these processes often rely on manual communication and shared tables, which quickly become outdated and hinder efficient coordination in large distributed studies. This highlights the need for dedicated monitoring solutions that provide transparent and up-to-date insights into study progress. We propose a lightweight, open-source monitoring architecture for multi-center studies based on the widely used Grafana-Prometheus stack. The framework collects aggregated monitoring metrics from distributed study sites and visualizes them through configurable dashboards. As a real-world deployment example, the framework is integrated into the medical imaging platform Kaapana and evaluated within a large multi-center research network. By deploying our solution within the Germany-wide RACOON consortium, we demonstrate its ability to enable privacy-preserving data exploration and study progress monitoring across all 38 German university clinics. The monitoring framework supports transparent coordination of distributed research activities and can facilitate more efficient management of large-scale multi-center studies. The source code and Kaapana integration are publicly available at https://github.com/MIC-DKFZ/study-monitoring-kaapana.

URL PDF HTML ☆

赞 0 踩 0

2606.16868 2026-06-16 cs.CV cs.AI cs.DC 新提交

Federated Medical Image Segmentation under Real-World Label Noise: A Benchmark Suite for Noisy Label Learning Method Selection

真实世界标签噪声下的联邦医学图像分割：面向噪声标签学习方法选择的基准套件

Markus Bujotzek, Dimitrios Bounias, Stefan Denner, Ralf Floca, Maximilian Fischer, Peter Neher, Klaus Maier-Hein

发表机构 * Division of Medical Image Computing, Germany Cancer Research Center（德国癌症研究中心医学图像计算部）； Medical Faculty, University of Heidelberg（海德堡大学医学院）； Heidelberg Institute of Radiation Oncology (HIRO), National Center for Radiation Research in Oncology (NCRO)（海德堡放射肿瘤学研究所（HIRO），国家放射肿瘤学研究中心（NCRO））； Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital（海德堡大学医院放射肿瘤科模式分析与学习组）； Faculty of Mathematics and Computer Science, University of Heidelberg（海德堡大学数学与计算机科学学院）； National Center for Tumor Diseases (NCT), NCT Heidelberg, a partnership between DKFZ and the university medical center Heidelberg（国家肿瘤疾病中心（NCT），NCT海德堡，DKFZ与海德堡大学医学中心的合作机构）

AI总结针对联邦学习中真实世界标签噪声（如轮廓不一致、结构缺失或混淆）问题，提出一个包含多样化真实噪声数据集、客户端噪声场景和针对性评估的基准套件，支持系统评估和噪声标签学习方法选择。

详情

AI中文摘要

虽然联邦学习（FL）能够在不集中敏感数据的情况下实现协作式医学图像分割，但实际部署常因跨站点的标签缺陷（如轮廓不一致、结构缺失或多余、标签混淆）而复杂化。联邦噪声标签学习（FNLL）旨在减轻这些影响，但在实践中仍未被充分利用，因为现有证据主要基于合成噪声、简化设置和有限的实际噪声评估。我们通过引入一个基准套件来弥补这一差距，该套件结合了多样化的真实世界噪声数据集、与部署相关的客户端噪声场景以及针对标签噪声的评估，以支持系统的FNLL评估和知情的方法选择。该套件将来自不同来源的精心策划的真实世界噪声医学图像分割数据集与一个全面的联邦分割框架相结合，包括各种客户端噪声场景和针对噪声的评估。所提出的套件为医学图像分割中的FNLL评估提供了现实且具有区分性的基础，并为公平基准测试、数据集特定的标签噪声表征以及未来在现实联邦设置下的方法开发建立了可重复使用的基础。代码可在 https://github.com/MIC-DKFZ/FedSegNoiseBench 获取。

英文摘要

While federated learning (FL) enables collaborative medical image segmentation without centralizing sensitive data, real-world deployment is frequently complicated by cross-site label imperfections such as contour disagreement, missing or additional structures, and confused labels. Federated noisy label learning (FNLL) aims to mitigate these effects, yet remains underused in practice as existing evidence is largely based on synthetic noise, simplified settings, and limited real-world noisy evaluation. We address this gap by introducing a benchmark suite that combines diverse real-world noisy datasets, deployment-relevant client-noise scenarios, and label-noise-targeted evaluation to support systematic FNLL assessment and informed method selection. The suite combines curated real-world noisy medical image segmentation datasets from diverse sources with a comprehensive federated segmentation framework including various client-noise scenarios and noise-targeted evaluation. The presented suite provides a realistic and discriminative basis for FNLL evaluation in medical image segmentation and establishes a reusable foundation for fair benchmarking, dataset-specific label-noise characterization, and future method development under realistic federated settings. Code is available at https://github.com/MIC-DKFZ/FedSegNoiseBench.

URL PDF HTML ☆

赞 0 踩 0

2606.17020 2026-06-16 cs.CV cs.AI 新提交

FusionRS: A Large-Scale RGB-Infrared Remote Sensing Dataset for Dual-Modal Vision-Language Foundation Models

FusionRS: 用于双模态视觉-语言基础模型的大规模RGB-红外遥感数据集

Jiaju Han, Ben Zhang, Xuemeng Sun, Qike Zhang, Yuxian Dong, Chengyin Hu, Fengyu Zhang, Yiwei Wei, Jiujiang Guo

发表机构 * China University of Petroleum-Beijing at Karamay（中国石油大学（北京）克拉玛依校区）； University of Electronic Science and Technology of China（电子科技大学）； Tianjin University（天津大学）

AI总结针对遥感视觉-语言模型缺乏红外数据的问题，提出首个大规模RGB-红外-文本数据集FusionRS，通过翻译RGB图像为红外风格并配以红外感知描述，训练双模态基础模型，提升RGB-红外对齐和双模态字幕生成性能。

详情

AI中文摘要

遥感视觉-语言模型推动了地球观测理解的发展，但现有工作大多集中于RGB图像，红外数据中的互补信息尚未得到充分探索。红外图像提供了独特的线索，包括热强度结构、物体边界和光照不变场景特征，这些可以丰富超越传统RGB观测的视觉-语言学习。然而，用于遥感视觉-语言建模的大规模RGB-红外-文本数据集仍然缺失。为填补这一空白，我们引入了FusionRS，这是首个专为遥感双模态视觉-语言学习设计的大规模RGB-红外-文本数据集。FusionRS通过将多样的公开RGB遥感图像翻译为红外风格对应物，形成对齐的RGB-IR图像对。每对图像都配有常规场景描述和红外感知描述，后者在保留语义内容的同时明确描述红外特有的视觉属性。基于FusionRS，我们训练了用于RGB-IR联合理解的双模态视觉-语言基础模型。我们首先训练CLIP风格的模型进行RGB-IR-文本对齐，然后微调生成式VLM用于双模态RGB-IR字幕生成。实验表明，与仅RGB和非红外感知训练设置相比，FusionRS改进了RGB-IR对齐、红外到文本检索和双模态字幕生成。消融研究进一步验证了红外感知描述对于加强红外-语言对齐至关重要，突显了模态特定文本监督对于更可扩展的RGB-红外遥感视觉-语言表示学习的重要性。

英文摘要

Remote sensing vision-language models have advanced Earth observation understanding, but most existing work remains centered on RGB imagery, leaving the complementary information in infrared data underexplored. Infrared images provide distinctive cues, including thermal intensity structures, object boundaries, and illumination-invariant scene features, which can enrich visual-language learning beyond conventional RGB observations. However, a large-scale RGB-infrared-text dataset for remote sensing vision-language modeling is still absent. To address this gap, we introduce FusionRS, the first large-scale RGB-infrared-text dataset designed for dual-modal vision-language learning in remote sensing. FusionRS is constructed by translating diverse public RGB remote sensing images into infrared-style counterparts, forming aligned RGB-IR image pairs. Each pair is associated with conventional scene captions and IR-aware captions that explicitly describe infrared-specific visual properties while preserving semantic content. Based on FusionRS, we train dual-modal vision-language foundation models for RGB-IR joint understanding. We first train CLIP-style models for RGB-IR-text alignment, and then fine-tune generative VLMs for dual-modal RGB-IR captioning. Experiments show that FusionRS improves RGB-IR alignment, infrared-to-text retrieval, and dual-modal captioning over RGB-only and non-IR-aware training settings. Ablation studies further verify that IR-aware captions are crucial for strengthening infrared-language alignment, highlighting the importance of modality-specific textual supervision for more scalable RGB-infrared remote sensing vision-language representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.15048 2026-06-16 cs.LG cs.CV 交叉投稿

Temporal Difference Learning for Diffusion Models

扩散模型的时间差分学习

Qizhen Ying, Yangchen Pan, Victor Adrian Prisacariu, Junfeng Wen

AI总结提出时间差分（TD）目标函数，通过将扩散过程视为马尔可夫奖励过程并利用强化学习中的策略评估，强制去噪轨迹上的跨时间一致性，显著提升少步采样下的生成质量。

Comments 15 pages, 4 figures. Accepted at ICML 2026

详情

AI中文摘要

扩散模型通常使用专注于单个时间步（或相邻对）的局部去噪目标的损失函数进行训练，这并不强制去噪轨迹上预测之间的一致性。这种跨时间一致性的缺乏会降低性能，尤其是对于少步采样器。我们引入了一个时间差分（TD）目标，惩罚模型沿去噪路径的多步进展的不一致性。通过将扩散过程重新表述为马尔可夫奖励过程，并将去噪视为强化学习中的策略评估问题，我们推导出一个统一的TD方法，适用于离散和连续时间扩散公式。我们进一步提出了一种基于样本的加权方法，稳定训练。实验表明，使用我们的TD训练可以显著提高由FID衡量的样本质量，当采样步数较少时优势更强，突显了其在低计算预算场景下的实用价值。我们进行了消融研究以证明我们的设计选择，包括成对损失加权、正则化权重和单步跨度。总体而言，我们的TD方法可以作为一种通用的即插即用模块，强制跨时间一致性并提高不同扩散生成模型的生成质量。

英文摘要

Diffusion models are typically trained with objectives that focus on local denoising targets at individual time steps (or adjacent pairs), which do not enforce consistency between predictions along the denoising trajectory. This lack of cross-time consistency can degrade performance, especially for few-step samplers. We introduce a temporal difference (TD) objective that penalizes inconsistency of the model's multi-step progress along the denoising path. By reformulating the diffusion process as a Markov reward process and casting denoising as a policy evaluation problem in reinforcement learning, we derive a unified TD approach that applies to both discrete- and continuous-time diffusion formulations. We further propose a principled sample-based reweighting method that stabilizes training. Empirically, we show that using our TD training can significantly improve sample quality measured by FID, with stronger advantages when the number of sampling steps is small, highlighting its practical utility under low-computation-budget scenarios. We provide ablation studies to justify our design choices, including pairwise loss reweighting, regularization weight, and one-step stride. Overall, our TD approach can be a general drop-in that enforces cross-time consistency and improves generation quality across different diffusion generative models.

URL PDF HTML ☆

赞 0 踩 0

2606.15615 2026-06-16 cs.LG cs.CV 交叉投稿

MoECa: Aligning Feature Reuse with Expert Decomposition in Diffusion Transformers

MoECa: 在扩散变换器中对齐特征复用与专家分解

Maoliang Li, Haojing Chen, Jiayu Chen, Zihao Zheng, Xinhao Sun, Hailong Zou, Xiang Chen

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）； School of Software Engineering, University of Electronic Science and Technology of China（电子科技大学软件工程学院）

AI总结针对DiT-MoE中跨时间步的冗余计算，提出基于专家分支级别的细粒度缓存框架MoECa，实现分支级特征复用，并引入专家感知自适应控制和同步缓存更新，在多个模型上取得高达2.83倍加速且质量损失极小。

Comments under review

详情

AI中文摘要

基于混合专家模型的扩散变换器（DiT-MoE）通过稀疏激活提升了模型容量，但扩散推理仍然受限于跨时间步的冗余计算。现有的缓存方法主要在token级别操作，这在DiT-MoE中变得次优，因为每个token更新内部被分解为多个路由专家分支。我们的分析表明，DiT-MoE中的跨时间步冗余在专家分支级别比在整个token级别更易于表征。基于这一观察，我们提出MoECa，一种细粒度的缓存框架，跨时间步执行分支级特征复用。MoECa进一步引入了专家感知的自适应控制和MoE与注意力路径之间的同步缓存更新，以维持稳定的中间状态。在多个DiT-MoE模型上的实验表明，MoECa在速度-质量权衡上始终优于先前的缓存方法，实现了高达2.83倍的推理加速且质量退化极小。

英文摘要

Diffusion Transformers with Mixture-of-Experts (DiT-MoE) improve model capacity under sparse activation, but diffusion inference is still bottlenecked by redundant computation across timesteps. Existing caching methods mainly operate at the token level, which becomes suboptimal in DiT-MoE because each token update is internally decomposed into multiple routed expert branches. Our analysis shows that cross-timestep redundancy in DiT-MoE is better characterized at the expert-branch level than at the whole-token level. Based on this observation, we propose MoECa, a fine-grained caching framework that performs branch-level feature reuse across timesteps. MoECa further introduces expert-aware adaptive control and synchronized cache updates across MoE and attention paths to maintain stable intermediate states. Experiments on multiple DiT-MoE models show that MoECa consistently achieves a better speed-quality trade-off than prior caching methods, with up to 2.83$\times$ inference speedup and minimal quality degradation.

URL PDF HTML ☆

赞 0 踩 0

2606.16075 2026-06-16 cs.LG cs.CV 交叉投稿

AME: A Multi-Type Contributor Attribution Framework in Generative AI Markets

AME：生成式AI市场中的多类型贡献者归属框架

Yang Shi, Songwen Pei, Yang Gao, Bingxue Zhang

发表机构 * University of Shanghai for Science and Technology（上海理工大学）； Fudan University（复旦大学）

AI总结针对生成式AI中多阶段协作的价值分配问题，提出AME框架，整合异构数据贡献评估、数据权利映射和可信执行，实现与人类判断一致的低成本价值分配。

详情

AI中文摘要

生成式AI通过异构贡献者（包括训练数据、基础模型、微调行为和提示）之间的多阶段协作实现价值创造。然而，如何公平分配数据价值仍未得到充分探索。本文将多阶段生成式AI价值分配定义为一个新的研究问题，并识别出三个核心挑战：异构数据贡献评估、数据权利映射和可信执行。我们提出AME（归属-映射-执行）框架，这是一个统一框架，将数据贡献评估、数据权利映射和可信执行整合到单个工作流中。实验结果表明，AME框架实现了与人类参考判断更一致的数据价值分配结果，同时保持低成本的可信执行。我们的工作为生成式AI数据市场中的价值评估和收益分配提供了初步基础。

英文摘要

Generative AI enables value creation through multi-stage collaboration among heterogeneous contributors, including training data, base models, fine-tuning behaviors, and prompts. However, how to fairly allocate the data value remains largely unexplored. This paper formulates multi-stage generative AI value allocation as a new research problem and identifies three core challenges: heterogeneous data contribution valuation, data rights mapping, and trustworthy execution. We propose AME (Attribution-Mapping-Execution) framework, a unified framework that integrates data contribution valuation, data rights mapping, and trustworthy execution into a single workflow. Experimental results demonstrate that AME framework achieves data value allocation outcomes more consistent with human reference judgments while maintaining low-cost trustworthy execution. Our work provides an initial foundation for value assessment and revenue allocation in generative AI data markets.

URL PDF HTML ☆

赞 0 踩 0

2505.10496 2026-06-16 cs.CV 版本更新

CheXGenBench: A Unified Benchmark For Fidelity, Privacy and Utility of Synthetic Chest Radiographs

CheXGenBench：合成胸片保真度、隐私和实用性的统一基准

Raman Dutt, Pedro Sanchez, Yongchen Yao, Steven McDonagh, Sotirios A. Tsaftaris, Timothy Hospedales

发表机构 * University of Edinburgh（爱丁堡大学）； Samsung AI Center, Cambridge（剑桥三星AI中心）

AI总结提出CheXGenBench，首个统一评估框架，同时衡量合成胸片生成模型的保真度、隐私风险和下游实用性，涵盖11种前沿T2I模型，揭示当前模型在长尾分布、隐私风险和下游多模态任务中的局限。

Comments Published in Transactions of Machine Learning Research (06/2026)

详情

Journal ref: Transactions on Machine Learning Research (2026)

AI中文摘要

结构化基准测试推动了真实世界图像的文字条件生成，但合成放射图像生成尚无此类基准。尽管这是一个高度活跃的研究领域，现有研究仍采用不一致的评估协议，缺乏对三个最关键标准（生成保真度、隐私风险和下游实用性）的统一评估。为解决这些局限，我们引入CheXGenBench，这是首个用于合成胸片生成的统一评估框架，可同时评估前沿文本到图像（T2I）生成模型的保真度、隐私风险和下游实用性。我们的评估协议包含20多项定量指标，覆盖11种领先的T2I架构，并支持即插即用地集成新模型。通过严格且公平的评估协议，我们建立了所有维度的全面基线最新技术水平（SoTA）性能，以指导未来研究。此外，我们的结果揭示了当前生成模型的若干局限：首先，即使是SoTA模型也难以处理长尾医学分布；其次，无论保真度质量如何，模型都存在高隐私风险；第三，尽管合成数据已有利于下游分类，但对下游多模态任务的实用性有限。基于这些结果，我们提出了具体的研究方向以推动该领域发展。代码见此https URL。

英文摘要

Structured benchmarks have advanced text-conditional image generation for real-world imagery, however, no such benchmark exists for synthetic radiograph generation. Despite being a highly active area of research, existing studies continue adopting inconsistent evaluation protocols and lack a unified assessment of the three most critical criteria: generative fidelity, privacy risk, and downstream utility. To address these limitations, we introduce CheXGenBench, the first unified evaluation framework for synthetic chest radiograph generation that simultaneously assesses fidelity, privacy risks, and downstream utility across frontier text-to-image (T2I) generative models. Our evaluation protocol, comprising over 20 quantitative metrics, covers 11 leading T2I architectures with plug-and-play integration for newer models. Through a rigorous and fair evaluation protocol, we establish comprehensive baseline state-of-the-art (SoTA) performances across all dimensions to guide future research. Furthermore, our results uncover several limitations of current generative models, which include first, even SoTA models struggle with long-tailed medical distributions; second, models pose high privacy risks regardless of fidelity quality; and third, while synthetic data already benefits downstream classification, it is of limited utility for downstream multimodal tasks. Drawing from these results, we propose concrete research directions to advance the field. The code is available at https://github.com/Raman1121/CheXGenBench

URL PDF HTML ☆

赞 0 踩 0

2508.07797 2026-06-16 cs.CV 版本更新

Power Battery Detection

动力电池检测

Xiaoqi Zhao, Peiqian Cao, Chenyang Yu, Zonglei Feng, Lihe Zhang, Hanqi Liu, Jiaming Zuo, Youwei Pang, Jinsong Ouyang, Weisi Lin, Georges El Fakhri, Huchuan Lu, Xiaofeng Liu

发表机构 * Yale University, USA（耶鲁大学，美国）； Dalian University of Technology, China（大连理工大学，中国）； Volkswagen Automotive Co., Ltd（大众汽车有限公司）； X3000 Inspection Co., Ltd（X3000检测有限公司）； Nanyang Technological University, Singapore（南洋理工大学，新加坡）

AI总结针对动力电池X射线图像中极板端点定位任务，提出首个大规模基准PBD5K和点级分割模型MDCNeXt，通过多维度结构线索与状态空间模块提升检测精度。

Comments Accepted by International Journal of Computer Vision (IJCV). Code: https://github.com/NTU-AI4X/X-ray-PBD

详情

AI中文摘要

动力电池是电动汽车的关键部件，其内部结构缺陷可能带来严重安全风险。我们对一项新任务——动力电池检测（PBD）进行了全面研究，该任务旨在从工业X射线图像中定位阴极和阳极板的密集端点，用于质量检测。人工检测效率低且易出错，而传统视觉算法难以处理密集排列的极板、低对比度、尺度变化和成像伪影。为解决这一问题并推动对该有意义任务的关注，我们提出了PBD5K，这是该任务的第一个大规模基准，包含来自九种电池类型的5000张X射线图像，具有细粒度标注和八种真实世界视觉干扰。为支持可扩展且一致的标注，我们开发了一种智能标注流程，结合了图像过滤、模型辅助预标注、交叉验证和分层质量评估。我们将PBD表述为一个点级分割问题，并提出了MDCNeXt，该模型旨在提取和整合来自极板本身的多维结构线索，包括点、线和计数信息。为改善极板间的区分并抑制视觉干扰，MDCNeXt引入了两个状态空间模块。第一个是提示过滤模块，学习由任务特定提示引导的对比关系。第二个是密度感知重排序模块，在高极板密度区域细化分割。此外，我们提出了一种距离自适应掩码生成策略，在阳极和阴极位置的不同空间分布下提供鲁棒的监督。源代码和数据集将在\href{ this https URL }{PBD5K}公开。

英文摘要

Power batteries are essential components in electric vehicles, where internal structural defects can pose serious safety risks. We conduct a comprehensive study on a new task, power battery detection (PBD), which aims to localize the dense endpoints of cathode and anode plates from industrial X-ray images for quality inspection. Manual inspection is inefficient and error-prone, while traditional vision algorithms struggle with densely packed plates, low contrast, scale variation, and imaging artifacts. To address this issue and drive more attention into this meaningful task, we present PBD5K, the first large-scale benchmark for this task, consisting of 5,000 X-ray images from nine battery types with fine-grained annotations and eight types of real-world visual interference. To support scalable and consistent labeling, we develop an intelligent annotation pipeline that combines image filtering, model-assisted pre-labeling, cross-verification, and layered quality evaluation. We formulate PBD as a point-level segmentation problem and propose MDCNeXt, a model designed to extract and integrate multi-dimensional structure clues including point, line, and count information from the plate itself. To improve discrimination between plates and suppress visual interference, MDCNeXt incorporates two state space modules. The first is a prompt-filtered module that learns contrastive relationships guided by task-specific prompts. The second is a density-aware reordering module that refines segmentation in regions with high plate density. In addition, we propose a distance-adaptive mask generation strategy to provide robust supervision under varying spatial distributions of anode and cathode positions. The source code and datasets will be publicly available at \href{https://github.com/Xiaoqi-Zhao-DLUT/X-ray-PBD}{PBD5K}.

URL PDF HTML ☆

赞 0 踩 0

2512.00885 2026-06-16 cs.CV 版本更新

HanDyVQA: A Video QA Benchmark for Fine-Grained Hand-Object Interaction Dynamics

HanDyVQA：面向细粒度手-物交互动态的视频问答基准

Masatoshi Tateno, Gido Kato, Hirokatsu Kataoka, Yoichi Sato, Takuma Yagi

发表机构 * Institute of Industrial Science, The University of Tokyo（东京大学工业科学研究所）； National Institute of Advanced Industrial Science and Technology (AIST)（国家先进工业科学与技术研究院）； Waseda University（早稻田大学）； Visual Geometry Group, University of Oxford（牛津大学视觉几何组）

AI总结提出HanDyVQA基准，通过六类问题（11.1K QA对）和10.3K分割掩码，全面评估视频模型对手-物交互中操作与效果的细粒度时空推理能力，发现最佳模型Gemini-2.5-Pro仅73%准确率（人类97%）。

Comments CVPR 2026, Project page: https://masatate.github.io/HanDyVQA-project-page/

详情

AI中文摘要

手-物交互（HOI）本质上涉及动态过程，其中人的操作会在物体上产生不同的时空效果。然而，现有的语义HOI基准要么关注操作，要么关注效果，但都停留在粗粒度层面，缺乏捕捉HOI中潜在动态的细粒度时空推理。我们引入了HanDyVQA，一个细粒度视频问答基准，全面覆盖HOI的操作和效果两个方面。HanDyVQA包含六种互补的问题类型（动作、过程、物体、位置、状态变化和物体部件），总计11.1K个多项选择问答对。收集的问答对识别操作风格、手/物体运动和部件级状态变化。HanDyVQA还包括10.3K个用于物体和物体部件问题的分割掩码，从而能够评估视频物体分割中的物体/部件级推理。我们在基准上评估了最新的视频基础模型，发现即使表现最好的模型Gemini-2.5-Pro也仅达到73%的平均准确率，远低于人类表现（97%）。进一步分析揭示了在空间关系、运动和部件级几何理解方面仍存在的挑战。我们还发现，将显式的HOI相关线索整合到视觉特征中可以提高性能，为开发具有更深层HOI动态理解的未来模型提供了见解。

英文摘要

Hand-object interaction (HOI) inherently involves dynamics where human manipulations produce distinct spatio-temporal effects on objects. However, existing semantic HOI benchmarks focused either on manipulation or on the resulting effects at a coarse level, lacking fine-grained spatio-temporal reasoning to capture the underlying dynamics in HOI. We introduce HanDyVQA, a fine-grained video question-answering benchmark that comprehensively covers both the manipulation and effect aspects of HOI. HanDyVQA comprises six complementary question types (Action, Process, Objects, Location, State Change, and Object Parts), totalling 11.1K multiple-choice QA pairs. Collected QA pairs recognizing manipulation styles, hand/object motions, and part-level state changes. HanDyVQA also includes 10.3K segmentation masks for Objects and Object Parts questions, enabling the evaluation of object/part-level reasoning in video object segmentation. We evaluated recent video foundation models on our benchmark and found that even the best-performing model, Gemini-2.5-Pro, reached only 73% average accuracy, which is far from human performance (97%). Further analysis shows the remaining challenges in spatial relationship, motion, and part-level geometric understanding. We also found that integrating explicit HOI-related cues into visual features improves performance, offering insights for developing future models with a deeper understanding of HOI dynamics.

URL PDF HTML ☆

赞 0 踩 0

2512.01095 2026-06-16 cs.CV cs.AI cs.LG 版本更新

CycliST: A Video Language Model Benchmark for Reasoning on Cyclical State Transitions

CycliST：用于循环状态转换推理的视频语言模型基准

Simon Kohaut, Daniel Ochs, Shun Zhang, Benedict Flade, Julian Eggert, Kristian Kersting, Devendra Singh Dhami

发表机构 * Artificial Intelligence and Machine Learning Lab, TU Darmstadt（人工智能与机器学习实验室，图腾斯达特技术大学）； Konrad Zuse School of Excellence in Learning and Intelligent Systems (ELIZA)（Konrad Zuse 学校（ELIZA））； Honda Research Institute Europe GmbH, Offenbach, Germany（本田欧洲研究院，奥芬巴赫，德国）； Uncertainty in Artificial Intelligence Group, TU Eindhoven（人工智能不确定性小组，埃因霍温技术大学）； Hessian Center for AI (hessian.AI)（黑森人工智能中心（hessian.AI））； Center for Cognitive Science（认知科学中心）； German Center for Artificial Intelligence (DFKI)（德国人工智能中心（DFKI））

AI总结提出CycliST基准，通过合成视频评估视频语言模型对循环状态转换的文本推理能力，揭示现有模型在检测循环模式、时间理解和定量分析方面的局限。

Comments Published in the Journal of Data-centric Machine Learning Research (DMLR); https://openreview.net/forum?id=l03g53HUL2

详情

Journal ref: Journal of Data-centric Machine Learning Research, 2026

AI中文摘要

我们提出了CycliST，这是一个新颖的基准数据集，旨在评估视频语言模型（VLM）在循环状态转换上的文本推理能力。CycliST通过生成合成的、结构丰富的视频序列来捕捉现实世界过程的基本方面，这些视频序列具有物体运动和视觉属性的周期性模式。CycliST采用分层评估系统，通过改变循环物体的数量、场景杂乱程度和光照条件逐步增加难度，挑战最先进模型的时空认知能力。我们使用当前最先进的VLM（包括开源和专有模型）进行了大量实验，揭示了它们在泛化到循环动力学（如线性和轨道运动）以及视觉属性（如颜色和尺度）随时间变化方面的局限性。我们的结果表明，当前的VLM难以可靠地检测和利用循环模式，缺乏时间理解的概念，并且无法从场景中提取定量信息（如运动物体的数量），突显了需要解决的重要技术差距。更具体地说，我们发现没有单一模型在性能上始终领先：大小和架构与结果的相关性不强，且没有模型在所有任务上同样成功。通过提供有针对性的挑战和全面的评估框架，CycliST为超越当前最先进水平的视觉推理模型在理解周期性模式方面铺平了道路。

英文摘要

We present CycliST, a novel benchmark dataset designed to evaluate Video Language Models (VLM) on their ability for textual reasoning over cyclical state transitions. CycliST captures fundamental aspects of real-world processes by generating synthetic, richly structured video sequences featuring periodic patterns in object motion and visual attributes. CycliST employs a tiered evaluation system that progressively increases difficulty through variations in the number of cyclic objects, scene clutter, and lighting conditions, challenging state-of-the-art models on their spatio-temporal cognition. We conduct extensive experiments with current state-of-the-art VLMs, both open-source and proprietary, and reveal their limitations in generalizing to cyclical dynamics such as linear and orbital motion, as well as time-dependent changes in visual attributes like color and scale. Our results demonstrate that present-day VLMs struggle to reliably detect and exploit cyclic patterns, lack a notion of temporal understanding, and are unable to extract quantitative insights from scenes, such as the number of objects in motion, highlighting a significant technical gap that needs to be addressed. More specifically, we find no single model consistently leads in performance: neither size nor architecture correlates strongly with outcomes, and no model succeeds equally well across all tasks. By providing a targeted challenge and a comprehensive evaluation framework, CycliST paves the way for visual reasoning models that surpass the state-of-the-art in understanding periodic patterns.

URL PDF HTML ☆

赞 0 踩 0

2512.07925 2026-06-16 cs.CV cs.AI 版本更新

Near--Real-Time Conflict-Related Fire Detection in Sudan Using Unsupervised Deep Learning

苏丹冲突相关火灾的近实时检测：基于无监督深度学习

Kuldip Singh Atwal, Dieter Pfoser, Daniel Rothbart

发表机构 * George Mason University（乔治·马歇尔大学）

AI总结提出轻量级VAE模型结合Planet Labs 4波段影像，在24-30小时内无监督检测苏丹冲突火灾区域，优于余弦距离、CVA和IR-MAD方法。

详情

DOI: 10.1016/j.srs.2026.100446
Journal ref: Science of Remote Sensing, Volume 13, 2026, 100446, ISSN 2666-0172

AI中文摘要

苏丹持续的武装冲突凸显了快速监测冲突相关火灾影响区域的必要性。深度学习和高频卫星影像的最新进展使得能够近实时评估战区活跃火灾和烧伤疤痕。本研究提出了一种近实时监测方法，使用轻量级变分自编码器（VAE）模型，结合空间分辨率3米的4波段Planet Labs影像。我们证明，在有利观测条件下，利用可获取的商业卫星数据，这些受影响区域可在约24至30小时内被检测到。为此，我们改编了一个最初为10波段影像设计的VAE模型，使其有效处理高分辨率4波段输入。模型以无监督方式训练，学习名义地表状态的紧凑潜在表示，并通过量化时间配对潜在嵌入之间的变化来识别燃烧特征。性能在苏丹的五个案例研究中评估，并与余弦距离、CVA和IR-MAD在精确率、召回率、F1分数以及时间配对影像块之间的精确率-召回率曲线下面积（AUPRC）上进行比较。结果表明，所提方法始终优于其他方法，在高度不平衡的火灾检测场景中实现了更高的召回率和F1分数，同时保持了可行的精确率。使用8波段影像和时间序列影像的实验相比单一4波段输入仅带来边际性能提升，突显了所提轻量级方法在可扩展的近实时冲突监测中的有效性。

英文摘要

Ongoing armed conflict in Sudan highlights the need for rapid monitoring of conflict-related fire-affected areas. Recent advances in deep learning and high-frequency satellite imagery enable near--real-time assessment of active fires and burn scars in war zones. This study presents a near--real-time monitoring approach using a lightweight Variational Auto-Encoder (VAE)--based model integrated with 4-band Planet Labs imagery at 3 m spatial resolution. We demonstrate that these impacted regions can be detected within approximately 24 to 30 hours under favorable observational conditions using accessible, commercially available satellite data. To achieve this, we adapt a VAE--based model, originally designed for 10-band imagery, to operate effectively on high-resolution 4-band inputs. The model is trained in an unsupervised manner to learn compact latent representations of nominal land-surface conditions and identify burn signatures by quantifying changes between temporally paired latent embeddings. Performance is evaluated across five case studies in Sudan and compared against cosine distance, CVA, and IR-MAD using precision, recall, F1-score, and the area under the precision-recall curve (AUPRC) computed between temporally paired image tiles. Results show that the proposed approach consistently outperforms the other methods, achieving higher recall and F1-scores while maintaining viable precision in highly imbalanced fire-detection scenarios. Experiments with 8-band imagery and temporal image sequences yield only marginal performance gains over single 4-band inputs, underscoring the effectiveness of the proposed lightweight approach for scalable, near--real-time conflict monitoring.

URL PDF HTML ☆

赞 0 踩 0

2601.16713 2026-06-16 cs.CV 版本更新

A Human-in-the-Loop Label Error Detection Framework Applied to Arabic-Script HTR Datasets

一种人在回路中的标签错误检测框架应用于阿拉伯文字手写文本识别数据集

Sana Al-azzawi, Elisa Barney, Marcus Liwicki

发表机构 * University of St. Thomas（圣汤姆斯大学）

AI总结提出CER-HV两阶段框架，结合字符错误率检测和人工验证，在阿拉伯文字HTR数据集中高效识别标签错误，提升识别性能。

详情

AI中文摘要

尽管近期取得了进展，阿拉伯文字的手写文本识别（HTR）仍落后于拉丁文字HTR。部分问题在于数据集质量。为帮助缩小这一差距，我们提出了一个用于检测标签错误的两阶段框架（CER-HV）。第一阶段（CER）是基于字符错误率的噪声检测器，构建在卷积循环神经网络（CRNN）架构上。第二阶段（HV）是人在回路中（HITL）验证第一阶段检测到的噪声样本。将CER-HV框架应用于多个阿拉伯文字数据集，可以识别出带有标签错误的样本，包括转录、分割、方向和非文本内容错误，这些错误会显著影响HTR性能。这些错误被框架的第一阶段以高达90%（前50）的精度识别。我们还表明，我们的CRNN在六个评估数据集中的五个上达到了最先进的性能，在KHATT（阿拉伯语）上达到8.46%的字符错误率（CER），在PHTI（普什图语）上达到8.22%，在Ajami上达到10.59%，在Muharaf（阿拉伯语）上达到10.11%，所有这些均未进行任何数据清洗。我们在PHTD（波斯语）数据集上建立了11.3% CER的新基线。应用CER-HV在数据集清洗和重新训练后，评估CER最多提高了1.8个百分点。尽管我们的实验专注于阿拉伯文字语言的文档，但该框架是通用的，可以应用于其他文本识别数据集。

英文摘要

Despite recent advances, Handwritten Text Recognition (HTR) for Arabic-script languages still lags behind Latin-script HTR. Part of the problem is dataset quality. To help closing this gap, we propose a two-stage framework (CER-HV) for detecting label errors. Stage 1 (CER) is a Character-Error-Rate-based noise detector built on a Convolutional Recurrent Neural Network (CRNN) architecture. Stage 2 (HV) is the Human-In-The-Loop (HITL) Verification of noisy samples detected by the first stage. Applying the CER-HV framework on multiple Arabic-script datasets can identify samples with label errors including transcription, segmentation, orientation, and non-text content errors that can markedly affect HTR performance. These errors were identified by the first stage of the framework with up to 90percent (top-50) precision. We also show that our CRNN achieves state-of-the-art performance across five of the six evaluated datasets, reaching 8.46 percent Character Error Rate (CER) on KHATT (Arabic), 8.22 percent on PHTI (Pashto), 10.59 percent on Ajami, and 10.11% on Muharaf (Arabic), all without any data cleaning. We establish a new baseline of 11.3 percent CER on the PHTD (Persian) dataset. Applying CER-HV improves evaluation CER by up to 1.8 percentage points after dataset cleaning and retraining. Although our experiments focus on documents written in an Arabic-script language, the framework is general and can be applied to other text recognition datasets

URL PDF HTML ☆

赞 0 踩 0

2602.04525 2026-06-16 cs.CV cs.AI 版本更新

SLUM-i: Semi-supervised Learning for Urban Mapping of Informal Settlements and Data Quality Benchmarking

SLUM-i: 非正规住区城市制图的半监督学习与数据质量基准测试

Muhammad Taha Mukhtar, Syed Musa Ali Kazmi, Khola Naseem, Muhammad Ali Chattha, Andreas Dengel, Sheraz Ahmed, Muhammad Naseer Bajwa, Muhammad Imran Malik

发表机构 * School of Electrical Engineering and Computer Science, National University of Sciences and Technology (NUST)（电气工程与计算机科学学院，国立科学与技术大学（NUST））； Smart Data & Knowledge Services, German Research Center for Artificial Intelligence (DFKI)（智能数据与知识服务，德国人工智能研究中心（DFKI））

AI总结针对非正规住区制图中标注稀缺和数据质量挑战，提出半监督分割框架，集成类别自适应阈值和DINOv2过滤机制，在跨三大洲七城市实验中mIoU提升最高5.9个百分点。

Comments 10 pages, 8 figures, 5 tables

详情

AI中文摘要

快速的城市扩张推动了低收入和中等收入国家主要城市非正规住区的增长，巴基斯坦的拉合尔和卡拉奇以及印度的孟买就是突出的例子。然而，这些住区的大规模制图不仅受到标注稀缺的严重限制，还受到固有数据质量挑战的制约，特别是正式与非正式结构之间的高光谱模糊性和显著的标注噪声。我们通过引入一个从头构建的拉合尔基准数据集，以及从经过验证的行政边界导出的卡拉奇和孟买配套数据集来解决这一问题，这些数据集总计约900平方公里的城市区域。该集合还补充了来自撒哈拉以南非洲和拉丁美洲先前文献中的四个城市，并为每个城市提供了全面的数据质量评估。我们还提出了一个半监督分割框架，旨在缓解标准半监督学习流程中固有的类别不平衡和分布不匹配问题。我们的方法集成了类别自适应阈值机制，该机制动态调整置信度阈值以防止少数类抑制，以及基于DINOv2的未标记池过滤器，该过滤器在训练前移除分布外的图块以减少协变量偏移。跨越三大洲七个城市、重复五个随机种子的广泛实验表明，与最先进的半监督基线相比，mIoU最高提升5.9个百分点，且两个组件均与架构无关，不增加推理开销。

英文摘要

Rapid urban expansion has fueled the growth of informal settlements in major cities of low- and middle-income countries, with Lahore and Karachi in Pakistan and Mumbai in India serving as prominent examples. However, large-scale mapping of these settlements is severely constrained not only by the scarcity of annotations but by inherent data quality challenges, specifically high spectral ambiguity between formal and informal structures and significant annotation noise. We address this by introducing a benchmark dataset for Lahore, constructed from scratch, along with companion datasets for Karachi and Mumbai, which were derived from verified administrative boundaries, totaling approximately 900 $\text{km}^\text{2}$ of urban area. This collection is supplemented by four cities from prior literature across Sub-Saharan Africa and Latin America, with comprehensive data quality assessments provided for each city. We also propose a semi-supervised segmentation framework designed to mitigate the class imbalance and distribution mismatch inherent in standard semi-supervised learning pipelines. Our method integrates a Class-Aware Adaptive Thresholding mechanism that dynamically adjusts confidence thresholds to prevent minority class suppression, and a DINOv2-based unlabeled pool filter that removes out-of-distribution tiles prior to training to reduce covariate shift. Extensive experiments across seven cities spanning three continents, repeated over five random seeds, demonstrate gains of up to +5.9 pp mIoU over state-of-the-art semi-supervised baselines, with both components being architecture-agnostic and adding no inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2602.09764 2026-06-16 cs.CV cs.IR cs.LG 版本更新

Self-Supervised Learning as Discrete Communication

自监督学习作为离散通信

Kawtar Zaher, Ilyass Moummad, Olivier Buisson, Alexis Joly

发表机构 * Kawtar Zaher, Ilyass Moummad, Olivier Buisson, Alexis Joly

AI总结将视觉自监督学习视为教师与学生网络间的离散通信过程，通过固定容量二进制信道传输语义信息，使用逐元素二元交叉熵目标强制离散一致性，并引入编码率正则化促进结构化表示，在图像分类、检索和密集预测任务上优于连续对齐基线。

详情

AI中文摘要

大多数自监督学习（SSL）方法通过对齐同一输入的不同视图来学习连续视觉表示，对信息如何在表示维度间进行结构化提供的控制有限。在这项工作中，我们将视觉自监督学习视为教师网络与学生网络之间的离散通信过程，其中语义信息通过固定容量的二进制信道传输。学生网络不是对齐连续特征，而是预测教师网络产生的多标签二进制消息。通过逐元素二元交叉熵目标强制离散一致性，同时编码率正则化项鼓励有效利用受限信道，促进结构化表示。我们进一步表明，周期性地重新初始化投影头通过鼓励嵌入在多个离散编码中保持可预测性来增强这种效果。大量实验表明，在图像分类、检索和密集视觉预测任务中，以及通过自监督适应在领域转移下，该方法持续优于连续对齐基线。除了骨干表示，我们分析了学习到的二进制编码，并表明它们形成了一种紧凑且信息丰富的离散语言，捕获了可跨类别复用的语义因子。

英文摘要

Most self-supervised learning (SSL) methods learn continuous visual representations by aligning different views of the same input, offering limited control over how information is structured across representation dimensions. In this work, we frame visual self-supervised learning as a discrete communication process between a teacher and a student network, where semantic information is transmitted through a fixed-capacity binary channel. Rather than aligning continuous features, the student predicts multi-label binary messages produced by the teacher. Discrete agreement is enforced through an element-wise binary cross-entropy objective, while a coding-rate regularization term encourages effective utilization of the constrained channel, promoting structured representations. We further show that periodically reinitializing the projection head strengthens this effect by encouraging embeddings that remain predictive across multiple discrete encodings. Extensive experiments demonstrate consistent improvements over continuous agreement baselines on image classification, retrieval, and dense visual prediction tasks, as well as under domain shift through self-supervised adaptation. Beyond backbone representations, we analyze the learned binary codes and show that they form a compact and informative discrete language, capturing semantic factors reusable across classes.

URL PDF HTML ☆

赞 0 踩 0

2602.15720 2026-06-16 cs.CV 版本更新

ToaSt: Token Channel Selection and Structured Pruning for Efficient ViT

ToaSt: 面向高效ViT的令牌通道选择与结构化剪枝

Hyunchan Moon, Cheonjun Park, Steven L. Waslander

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出ToaSt框架，对多头自注意力模块进行耦合头结构化剪枝，对前馈网络采用训练无关的令牌通道选择方法，在多种ViT模型上实现精度与效率的优越平衡。

Comments Accepted at ICML 2026

详情

AI中文摘要

SLU-2K：基于问题的手语翻译语义评估基准

Zeno Testa, Antonino Furnari, Lorenzo Baraldi, Natalia Díaz-Rodríguez

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷吉奥艾米利亚大学）； University of Catania（卡塔尼亚大学）； University of Granada（格拉纳达大学）； CITIC & DaSCI Institute（CITIC与DaSCI研究所）

AI总结提出SLU-2K基准，通过2350个视频问答对评估手语翻译的语义理解，揭示当前系统在语义正确性上的不足。

Comments Accepted at the GenSign Workshop, CVPR 2026

详情

AI中文摘要

手语翻译（SLT）通常使用表面形式指标（如BLEU和ROUGE）进行评估，这些指标奖励词汇重叠，但不直接衡量翻译是否保留了源手语序列的含义。这与将SLT集成到辅助技术中的最终目标相悖。在这项工作中，我们将重点从手语翻译（SLT）转向手语理解（SLU），特别强调语义理解。具体来说，我们根据系统从输入视频中正确恢复原始句子关键语义方面的能力来评估系统，例如发生的动作以及关于人和物体的事实。为了系统地实现这种评估，我们提出了SLU-2K，这是一个基于流行的PHOENIX-2014T和CSL-Daily数据集的2350个封闭式视频问答对的数据集。为了获得SLU-2K，我们提出并广泛评估了一个自动数据生成流水线，该流水线生成7个类别的问题，即动作、位置、数字、物体、人物、时间和天气条件。我们通过评估流行的多模态大语言模型（MLLM）和两个代表性的最先进系统MMSTL和SpaMo，展示了SLU-2K的潜力。我们的结果表明，MLLM达到了接近随机的性能，突显了当前AI系统中需要更系统地集成SLU。此外，在领域内数据上精心微调的最先进翻译系统仍然存在显著的语义差距，结果范围从56.7%到75.2%。这些发现表明，当前的SLT评估协议高估了真正的理解，未来的进展不仅应通过流畅性和n-gram重叠来衡量，还应通过语义正确性来衡量。代码、提示和基准文件可在此https URL获取。

英文摘要

Sign Language Translation (SLT) is typically evaluated with surface-form metrics such as BLEU and ROUGE, which reward lexical overlap but do not directly measure whether a translation preserves the meaning of the source sign sequence. This is in contrast with the final objective of integrating SLT in assistive technology. In this work, we shift the focus from Sign Language Translation (SLT) to Sign Language Understanding (SLU), with particular emphasis on semantic understanding. Specifically, we evaluate systems based on their ability to correctly recover, from the input video, key semantic aspects of the original sentence, such as actions taking place and facts about people and objects. To enable this evaluation systematically, we propose SLU-2K, a dataset of 2,350 closed-ended video question-answer pairs based on the popular PHOENIX-2014T and CSL-Daily datasets. To obtain SLU-2K, we propose and extensively evaluate an automated data generation pipeline which produces questions across 7 categories, namely actions, locations, numbers, objects, people, time, and weather conditions. We show the potential of SLU-2K by evaluating popular Multimodal Large Language Models (MLLMs) and two representative state-of-the-art systems, MMSTL and SpaMo. Our results show that MLLMs reach near-random performance, highlighting the need for a more systematic integration of SLU in current AI systems. Furthermore, state-of-the-art translation systems carefully fine-tuned on in-domain data still exhibit a substantial semantic gap, with results ranging from 56.7% to 75.2%. These findings suggest that current SLT evaluation protocols overestimate true understanding and that future progress should be measured not only by fluency and n-gram overlap, but also by semantic correctness. Code, prompts, and benchmark files are available at https://github.com/ZenoTsT/SLU-2K

URL PDF HTML ☆

赞 0 踩 0

2606.04184 2026-06-16 cs.CV 版本更新

GroupToM-Bench: Benchmarking Group Theory of Mind and Nonlinear Social Emergence in MLLMs

GroupToM-Bench: 多模态大语言模型中群体心智理论和非线性社会涌现的基准测试

Weidong Tang, Jierui Li, Yueling Hou, Zihan Mei, Can Zhang, Xinyan Wan, Zhiyuan Liang, Pengfei Zhou, Yang You, Wangbo Zhao

发表机构 * Xidian University（西安电子科技大学）； National University of Singapore（新加坡国立大学）； University of Electronic Science and Technology of China（电子科技大学）； University of Science and Technology of China（中国科学技术大学）

AI总结针对多模态大语言模型在群体心智理论推理上的不足，提出GroupToM-Bench基准，通过七级认知审计框架评估模型从微观BDI状态到宏观结果预测的因果链，揭示模型在处理社会结构和非线性集体动态上的缺陷。

Comments ACL 2026 (Main Conference)

详情

AI中文摘要

真正的通用智能不仅需要物理世界模型，还需要社会世界模型：即推断个体心理状态如何相互作用并结晶为群体层面结果的能力。尽管在个体层面的心智理论推理方面取得了显著进展，现有的多模态大语言模型在这一更广泛的任务上仍然失败。集体行为从社会张力、从众动态和结构约束中非线性地涌现，这意味着它不能通过简单地对个体意图求和来恢复。我们提出了GroupToM-Bench，第一个针对群体层面心智理论的多模态基准，围绕一个跨越微观层面BDI状态（信念、欲望、意图）、中观层面群体张力和结构约束以及宏观层面结果预测和机制归因的因果链构建。为了探测这一完整弧线，我们开发了一个七级认知审计框架。实验揭示了当前模型与人类基线之间的差距，突出了模型在处理社会结构和非线性集体动态方面的失败。

英文摘要

True general intelligence requires not only a model of the physical world but also a social world model: the capacity to infer how individual mental states interact and crystallize into group-level outcomes. Despite notable progress in individual-level Theory of Mind (ToM) reasoning, existing multimodal large language models fail at this broader task. Collective behavior emerges non-linearly from social tensions, conformity dynamics, and structural constraints, meaning it cannot be recovered by merely summing individual intentions. We present GroupToM-Bench, the first multimodal benchmark for group-level ToM, built around a causal chain spanning micro-level BDI states (belief, desire, intention), meso-level group tension and structural constraints, and macro-level outcome prediction and mechanistic attribution. To probe this full arc, we develop a seven-level cognitive audit framework. Experiments reveal a gap between current models and human baselines, highlighting a failure to process social structures and non-linear collective dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.07086 2026-06-16 cs.CV cs.LG 版本更新

IGLU：集成高斯线性单元激活函数

Mingi Kang, Zai Yang, Jeova Farias Sales Rocha Neto

发表机构 * Bowdoin College（布罗德学院）

AI总结提出IGLU激活函数，基于半正态混合分布推导出闭式表达，其门控为柯西CDF，通过单一锐度参数在恒等与ReLU行为间插值，重尾特性保证非零梯度，并给出仅含ReLU操作的有理近似，在视觉和语言任务上达到或超越ReLU/GELU性能。

详情

AI中文摘要

激活函数对深度神经网络至关重要，控制着梯度流、优化稳定性和表示能力。在历史深度架构中，ReLU一直是激活函数的主要选择，而现代基于Transformer的模型越来越多地采用更平滑的替代方案，如GELU和其他自门控替代方案。尽管它们在经验上取得了成功，但这些函数之间的数学关系及其有效性背后的原理仍仅被部分理解。我们引入了IGLU，一个参数化激活函数，作为在半正态混合分布下的GELU门控的尺度混合推导得出。该推导产生了一个闭式表达式，其门控分量恰好是柯西CDF，提供了一个原则性的单参数族，通过单一锐度参数$\sigma$在类恒等和类ReLU行为之间连续插值。与GELU的高斯门控不同，IGLU的重尾柯西门控在负尾处以多项式衰减，保证所有有限输入的非零梯度，并对梯度消失具有更强的鲁棒性。我们进一步引入了IGLU-Approx，一种计算高效的IGLU有理近似，完全用ReLU操作表示，消除了超越函数求值。通过在CIFAR-10、CIFAR-100和WikiText-103上使用ResNet-20、ViT-Tiny和GPT-2 Small进行的评估，IGLU在视觉和语言数据集上相对于ReLU和GELU基线实现了具有竞争力或更优的性能，而IGLU-Approx以大幅降低的计算成本恢复了这一性能。特别地，我们表明在高度不平衡的分类数据集中，使用重尾门控带来了显著的性能提升。

英文摘要

Activation functions are fundamental to deep neural networks, governing gradient flow, optimization stability, and representational capacity. Within historic deep architectures, while ReLU has been the dominant choice for the activation function, modern transformer-based models increasingly are adopting smoother alternatives such as GELU and other self-gated alternatives. Despite their empirical success, the mathematical relationships among these functions and the principles underlying their effectiveness remains only partially understood. We introduce IGLU, a parametric activation function derived as a scale mixture of GELU gates under a half-normal mixing distribution. This derivation yields a closed-form expression whose gating component is exactly the Cauchy CDF, providing a principled one-parameter family that continuously interpolates between identity-like and ReLU-like behavior via a single sharpness parameter $σ$. Unlike GELU's Gaussian gate, IGLU's heavy-tailed Cauchy gate decays polynomially in the negative tail, guaranteeing non-zero gradients for all finite inputs and offering greater robustness to vanishing gradients. We further introduce IGLU-Approx, a computationally efficient rational approximation of IGLU expressed entirely in terms of ReLU operations that eliminates transcendental function evaluation. Through evaluations on CIFAR-10, CIFAR-100, and WikiText-103 across ResNet-20, ViT-Tiny, and GPT-2 Small, IGLU achieves competitive or superior performance on both vision and language datasets against ReLU and GELU baselines, with IGLU-Approx recovering this performance at substantially reduced computational cost. In particular, we show that employing a heavy-tailed gate leads to considerable performance gains in heavily imbalanced classification datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.00873 2026-06-16 cs.MM cs.AI cs.CV 版本更新

BRITE: A Benchmark for Reliable and Interpretable T2V Evaluation on Implausible Scenarios

BRITE：面向不可信场景的可靠可解释文本到视频评估基准

Advait Tilak, Jiwon Choi, Nazifa Mouli, Wei Le

AI总结提出BRITE基准，通过人工参与协议统一不可信提示、细粒度音视频一致性评估和可解释QA评估，揭示现有模型在对象-动作绑定和音视频同步上的显著缺陷。

详情

AI中文摘要

逼真文本到视频（T2V）生成的快速发展带来了对最新评估方法的迫切需求。现有基准大多忽略了不可信场景，并且不衡量音视频对齐。我们引入BRITE，这是第一个将（1）不可信提示、（2）音视频一致性的细粒度评估以及（3）基于QA的可解释评估统一为全面T2V基准的框架。与完全自动化的基于多模态LLM的流水线（容易产生幻觉和提示歧义）不同，BRITE通过严格的人工参与协议保证基准创建的可靠性。评估五个最先进模型（Sora 2、Veo 3.1、Runway Gen4.5、Pixverse V5.5和Qwen3Max），我们揭示了一个关键性能差距：虽然模型在静态对象组合方面表现出色，但在对象-动作绑定和音视频同步方面表现出显著退化。我们的框架为社区提供了一个可靠、可解释的基准和评估框架，能够检测和定位下一代T2V模型的局限性，特别是对于流形外提示。

英文摘要

The rapid advancement of photorealistic Text-to-Video (T2V) generation brings in an urgent need for up-to-date evaluation methods. Existing benchmarks largely overlooked implausible scenarios and do not measure audio-visual alignment. We introduce BRITE, the first framework that unifies (1) implausible prompting, (2) fine-grained assessment of audio-visual consistency, and (3) QA-based interpretable evaluation into a comprehensive T2V benchmark. Unlike fully automated Multimodal LLM-based pipelines, which are prone to hallucination and prompt ambiguity, BRITE guarantees reliability through a rigorous human-in-the-loop protocol for benchmark creation. Evaluating five state-of-the-art models (Sora 2, Veo 3.1, Runway Gen4.5, Pixverse V5.5, and Qwen3Max), we reveal a critical performance gap: while models excel at static object composition, they exhibit significant degradation in object-action binding and audio-visual synchronization. Our framework offers the community a reliable, interpretable benchmark and evaluation framework that can detect and locate limitations in the next generation of T2V models, especially for off-manifold prompts

URL PDF HTML ☆

赞 0 踩 0

2606.14764 2026-06-16 cs.CV cs.DM 新提交

Avoiding Exponential Blow-Up in Distributive Lattice Submodular Minimization

避免分配格次模最小化中的指数爆炸

Ishant Shanu

发表机构 * Ishant Shanu

AI总结针对分配格上次模函数最小化中因布尔格变换导致的空间指数膨胀问题，提出仅在分配格内工作的通用框架，显著提升运行效率。

2606.14963 2026-06-16 cs.CV cs.AI 新提交

Multi-Modal Attention for Automated Disaster Damage Assessment Using Remote Sensing Imagery and Deep Learning

基于遥感影像和深度学习的多模态注意力自动灾害损伤评估

Tewodros Syum Gebre, Jagrati Talreja, Leila Hashemi-Beni

发表机构 * Built Environment Department, College of Science and Technology, North Carolina A&T State University（北卡罗来纳农工州立大学科技学院建筑环境系）； United Nations University Institute for Water, Environment and Health（联合国大学水、环境与健康研究所）

AI总结提出一种多模态注意力机制融合双时相遥感影像的深度学习框架，实现建筑物损伤四分类（无/轻微/严重/毁坏），准确率达94.90%。

Comments This paper has been accepted for publication in ISPRS Congress 2026 and the 47th Canadian Symposium on Remote Sensing (CSRS 2026) Annals

详情

AI中文摘要

及时准确的灾害损伤评估对于有效的应急响应、资源分配和恢复至关重要。传统方法通常依赖人工检查或稀疏数据，往往速度慢且易出错。本文介绍了一种利用遥感影像和深度学习自动化建筑损伤分类的新框架。使用灾前和灾后卫星影像，我们的模型将建筑物分为四个损伤等级：无损伤、轻微损伤、严重损伤和毁坏。核心创新是一种多模态注意力机制，融合双时相特征以显式检测和评估结构变化。我们采用轻量级ConvNeXT-Tiny骨干网络，确保高效处理而不牺牲性能。主要贡献包括：（1）用于多模态数据融合的交叉注意力模块，（2）针对大规模数据集的优化预处理流程，以及（3）鲁棒的数据增强技术。在大规模灾害数据集上的实验表明，总体分类准确率达到94.90%。该模型能有效区分损伤类别，并对不完整数据保持鲁棒性。本系统显著提高了评估速度和准确性，有助于应急响应人员优先安排干预措施。本研究通过将多时相影像与深度学习相结合，推进了自动化灾害损伤检测，为实时响应提供了可扩展的解决方案。

英文摘要

Timely and accurate disaster damage assessment is crucial for effective emergency response, resource allocation, and recovery. Traditional methods, which often rely on manual inspections or sparse data, are typically slow and error-prone. This paper introduces a novel framework leveraging remote sensing imagery and deep learning to automate building damage classification. Using pre- and post-disaster satellite imagery, our model categorizes buildings into four damage levels: no damage, minor damage, major damage, and destroyed. The core innovation is a multi-modal attention mechanism that fuses bi-temporal features to explicitly detect and assess structural changes. We employ a lightweight ConvNeXT-Tiny backbone to ensure efficient processing without compromising performance. Key contributions include: (1) a cross-attention module for multi-modal data fusion, (2) an optimized preprocessing pipeline for large-scale datasets, and (3) robust data augmentation techniques. Experiments on a large-scale disaster dataset demonstrate an overall classification accuracy of 94.90%. The model effectively discriminates between damage categories and remains resilient to incomplete data. This system significantly improves assessment speed and accuracy, aiding emergency responders in prioritizing interventions. This work advances automated disaster damage detection by integrating multi-temporal imagery with deep learning, offering a scalable solution for real-time response.

URL PDF HTML ☆

赞 0 踩 0

2606.15198 2026-06-16 cs.CV cs.HC 新提交

City landscape in sight: A crowdsourced framework for unlocking urban-scale window view perceptions from real estate imagery

城市景观在望：一种从房地产图像解锁城市尺度窗景感知的众包框架

Chucai Peng, Sijie Yang, Ang Liu, Yang Xiang, Zhixiang Zhou, Filip Biljecki

发表机构 * National University of Singapore（新加坡国立大学）

AI总结提出一种利用房地产平台真实窗景图像（WVI）进行大规模感知映射的方法，通过混合神经网络模型预测六维感知并分析空间分布，发现楼层高度和窗景组成（如天空、树木比例）对感知有非线性影响。

详情

AI中文摘要

通过住宅窗户看到的城市景观影响生活质量，然而城市尺度上实际窗景的感知仍研究不足。本研究提出一种大规模感知映射方法，使用从中国武汉房地产平台收集的12,334张真实住宅窗景图像（WVI），这是一种罕见探索的城市景观图像形式，相比以往研究中常见的渲染或模拟窗景具有优势。通过非沉浸式虚拟现实平台，我们基于499张WVI从304名参与者收集了27,477对六维感知（如生动性）的比较。训练了一个混合神经网络模型来预测所有众包WVI的人类感知并绘制其空间分布。结果显示，整个城市存在显著的空间自相关，具有明显的热点和冷点。楼层高度强烈影响人类感知：较高楼层提供更受欢迎和更广阔的窗景，而较低楼层为居民提供安静和生动的视野。推理模型进一步表明，窗景组成至关重要：高比例的天空、树木和低层建筑增强人们的偏好和生动性感知，而高层建筑的高比例增加单调和压抑感。重要的是，这些影响是非线性的：某些元素的过度存在会改变其对人类感知的影响。这项工作推进了城市尺度上居民视觉体验的理解，并为以人为本的城市规划和房地产优化窗户视觉景观提供了基于证据的指导。

英文摘要

City landscapes viewed through home windows influence quality of life, yet perceptions of actual window views at the urban scale remain understudied. This study presents an approach for large-scale mapping of perceptions using 12,334 window view images (WVIs) collected from actual residential properties listed on real estate platforms in Wuhan, China, representing a rarely explored form of urban view imagery that offers advantages over the rendered or simulated window views commonly examined in previous studies. Through a non-immersive virtual reality platform, we collected 27,477 pairwise comparisons across six perceptual dimensions (e.g.\ Vivid) from 304 participants based on 499 WVIs. A hybrid neural network model was trained to predict human perceptions of all crowdsourced WVIs and map their spatial distribution. Results reveal significant spatial autocorrelation with distinct hot and cold spots across the whole city. Floor level strongly influences human perceptions: while higher floors offer more preferred and extensive window views, lower-floor windows provide residents with quiet and vivid views. An inference model further shows that window view composition matters considerably: high ratios of sky, trees, and low-rise buildings enhance people's preferences and perceptions of vividness, whereas high ratios of high-rise buildings increase perceptions of monotony and oppression. Importantly, these effects are non-linear: the excessive presence of certain elements can alter their impact on human perception. This work advances urban-scale understanding of residents' visual experiences and provides evidence-based guidance for human-centric urban planning and real estate to optimise visual landscapes from windows.

URL PDF HTML ☆

赞 0 踩 0

2606.15351 2026-06-16 cs.CV 新提交

Facial Affect Analysis for Service-Oriented Systems: Advances, Challenges, and Future Visions

面向服务系统的面部情感分析：进展、挑战与未来愿景

Spyridon Georgiou, Aggelos Psiris, Thomas Lagkas, Vasileios Argyriou, Panagiotis Sarigiannidis, Iraklis Varlamis, Georgios Th. Papadopoulos

发表机构 * International Hellenic University（国际希腊大学）； Democritus University of Thrace（德谟克利特大学）； Kingston University London（伦敦金斯顿大学）； University of Western Macedonia（西马其顿大学）

AI总结本文从系统工程角度综述面部情感分析在服务导向软件生态系统中的进展，强调可组合性和可靠性需求，并指出基准性能提升不足以满足服务化部署，需兼顾鲁棒性、公平性、隐私性等运行时保障。

详情

AI中文摘要

面部情感分析（FAA）正从独立的识别任务演变为服务导向软件生态系统（SoSE）中可复用的感知能力。本文保留了FAA的方法论核心，同时通过可组合和可靠服务的系统工程需求重新诠释近期进展。我们回顾了静态和动态表情分析、动作单元和微表情建模以及现代CNN、Transformer、图神经网络和混合架构的代表性进展，然后根据这些进展在边缘、云和混合服务管道中的操作适配性进行解读。综合强调决定可部署性的SoSE关注点：面向不确定性输出的服务契约、延迟和可用性包络、生命周期监控与重新校准、治理感知集成以及跨独立演化组件的互操作性。我们的分析表明，仅凭基准性能提升不足以满足SoSE就绪性；分布偏移下的鲁棒性、干预稳定性、公平性、隐私姿态和运行时保证同样关键。最后，我们提出了将FAA视为具有显式接口、可测量质量属性和可问责生命周期管理的操作服务组件的路线图。

英文摘要

Facial Affect Analysis (FAA) is evolving from a stand-alone recognition task into a reusable perception capability for Service-Oriented Software Ecosystems (SoSE). This paper preserves the FAA methodological core while reframing recent advances through systems-engineering requirements for composable and dependable services. We review representative progress in static and dynamic expression analysis, action-unit and micro-expression modeling, and modern CNN, Transformer, graph, and hybrid architectures, then interpret these advances by their operational fit in edge, cloud, and hybrid service pipelines. The synthesis emphasizes SoSE concerns that determine deployability: service contracts for uncertainty-aware outputs, latency and availability envelopes, lifecycle monitoring and recalibration, governance-aware integration, and interoperability across independently evolving components. Our analysis shows that benchmark gains alone are insufficient for SoSE readiness; robustness under shift, intervention stability, fairness, privacy posture, and runtime guarantees are equally critical. We conclude with a roadmap for treating FAA as an operational service component with explicit interfaces, measurable quality attributes, and accountable lifecycle management.

URL PDF HTML ☆

赞 0 踩 0

2606.16271 2026-06-16 cs.CV cs.LG 新提交

Contrastive Learning for Seismic Horizon Tracking with Domain-Specific Priors

基于领域先验的对比学习用于地震层位追踪

Alexandre Thouvenot, Lionel Boillot, Vincent Gripon

发表机构 * IMT Atlantique, LAB-STICC, UMR CNRS 6285（IMT Atlantique, LAB-STICC, CNRS 6285联合实验室）； TotalEnergies, OneTech（道达尔能源公司, OneTech）

AI总结提出自监督融合信号与纹理的方法，利用信号导出的局部层位对应作为领域先验训练纹理深度学习模型，通过对比学习保持层位身份，实现跨不连续面的层位追踪。

Comments 5 pages, 5 figures. Submitted to the IEEE GRSL for possible publication

详情

AI中文摘要

无监督3D地震层位追踪面临一个关键限制：基于信号的传播器提供精确的迹级对齐，但在断层附近常失败，而纹理驱动的深度模型对不连续性更鲁棒，但通常以标记数据需求和降低迹级精度为代价。我们提出了一种自监督融合两种范式的方法，其中信号导出的局部层位对应作为领域先验来训练基于纹理的深度学习模型。具体来说，我们从反射体斜率估计可靠的迹间流，并将其用于形成对比目标中的正对，同时将训练限制在高置信度邻域，可选地使用断层掩码增强。目标不是推断不连续性附近的模糊对应，而是跨不连续性保持层位身份。结果，网络学习到体素级嵌入，保持局部信号连续性，同时通过相似性搜索实现跨不连续性的层位传播。在公共F3数据集和含断层合成数据集上的实验实现了比无监督基线更低的平均绝对误差（MAE），并且与使用单个标记切片的半监督方法性能相当。

英文摘要

Unsupervised 3D seismic horizon tracking faces a key limitation: signal-based propagators provide accurate trace-level alignment but often fail near faults, whereas texture-driven deep models are more robust to discontinuities, typically at the cost of labeled data requirements and reduced trace-level precision. We propose a self-supervised fusion of both paradigms in which signal-derived local horizon correspondences act as domain-specific priors to train a texture-based deep learning model. Specifically, we estimate reliable trace-to-trace flows from reflector slopes and use them to form positive pairs in a contrastive objective, while restricting training to high-confidence neighborhoods, optionally augmented with a fault mask. The objective is not to infer ambiguous correspondences close to discontinuities, but to preserve horizon identity across them. As a result, the network learns voxel-wise embeddings that preserve local signal continuity while enabling horizon propagation beyond discontinuities through similarity search. Experiments on the public F3 dataset and a faulted synthetic dataset achieve lower mean absolute error (MAE) than unsupervised baselines and competitive performance against a semi-supervised method using a single labeled slice.

URL PDF HTML ☆

赞 0 踩 0

2606.16837 2026-06-16 cs.CV cs.AI cs.SD 新提交

Robust Spoofed Speech Detection via Temporal Pyramid Modeling

基于时间金字塔建模的鲁棒语音伪造检测

Mahtab Masoudi Nezhad, Nima Karimian

发表机构 * Lane Department of Computer Science and Electrical Engineering, West Virginia University（西弗吉尼亚大学莱恩计算机科学与电气工程系）； Bellini College of Artificial Intelligence, Cybersecurity and Computing, University of South Florida（南佛罗里达大学贝利尼人工智能、网络安全与计算学院）

AI总结提出时间金字塔适配器，通过多尺度时间卷积捕获局部伪影和全局韵律异常，结合自监督XLS-R表示，在多个数据集上显著优于基线模型。

详情

AI中文摘要

伪造语音检测日益受到逼真合成、语音转换和重放攻击的挑战，跨数据集泛化仍然是主要限制。本文提出时间金字塔适配器，利用具有不同感受野的并行时间卷积来捕获多尺度伪造线索，从局部伪影到全局韵律异常。我们还集成了自监督XLS-R表示，并结合前端适配器，包括Mel、Sinc和用于多尺度时间建模的时间金字塔设计。所提出的模型在多个基准上进行了评估，包括ASVspoof 2017、ASVspoof 2021 (DF/LA)、PartialSpoof、DiffSSD和多语言HQ-MPSD数据集。实验结果表明，时间金字塔模型在PartialSpoof数据库上获得了99.24%的AUC和3.87%的EER，显著优于基础模型和多个SOTA基线，如LCNN-BLSTM（9.87% EER）和TRACE（8.08% EER）。此外，多语言评估证实，虽然伪造伪影与语言无关，但自监督表示提高了鲁棒性，在领域和语言偏移下性能下降，凸显了需要更好的适应和校准策略。

英文摘要

Spoofed speech detection is increasingly challenged by realistic synthesis, voice conversion, and replay attacks, with cross-dataset generalization remaining a major limitation. This work we propose a Temporal Pyramid Adapter that utilize parallel temporal convolutions with varying receptive fields to capture multi-scale spoofing cues, ranging from local artifacts to global prosodic irregularities. We also integrated self-supervised XLS-R representations combined with front-end adapters, including Mel, Sinc, and a Temporal Pyramid design for multi-scale temporal modeling. The proposed model is evaluated cross multiple benchmark including ASVspoof 2017, ASVspoof 2021 (DF/LA), PartialSpoof, DiffSSD, and multilingual HQ-MPSD datasets. Experimental results demonstrate that Temporal Pyramid model obtained AUC of 99.24% and a EER of 3.87% on the PartialSpoof database, which is significantly outperforming the base model and several SOTA baseline such as LCNN-BLSTM (9.87% EER) and TRACE (8.08% EER). Additionally, multilingual evaluations confirm that while spoofing artifact are independent from language. While self-supervised representations improve robustness, performance degrades under domain and language shifts, highlighting the need for better adaptation and calibration strategies.

URL PDF HTML ☆

赞 0 踩 0

2606.16870 2026-06-16 cs.CV cs.GR 新提交

Latent Space Reinforcement Learning for Inverse Material Estimation in Food Fracture Simulation

潜空间强化学习用于食品断裂模拟中的逆材料估计

Adrian Ramlal, Yuhao Chen, John S. Zelek

发表机构 * University of Waterloo（滑铁卢大学）

AI总结针对食品断裂模拟中材料参数难以直接测量的问题，提出基于潜空间强化学习的目标条件策略，实现从断裂行为描述到材料参数的单次前向估计，精度提升23%。

Comments Accepted in IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 MetaFood Workshop

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 9573-9581

AI中文摘要

食品操作的真实视觉模拟需要精确的材料参数，但这些参数难以直接测量，且在单个食品的异质区域间变化。我们解决了从非连续损伤力学模拟器中断裂行为的目标描述中估计材料参数的逆问题。以剥橙子为测试案例，我们在2000次正向模拟上训练神经代理，并比较协方差矩阵自适应进化策略（CMA-ES，一种无梯度进化优化器）与近端策略优化（PPO，一种强化学习算法）在原始9维参数空间和两个学习的4维潜表示上的表现。由于不同橙子具有不同的材料属性，实用的逆系统必须能够处理任意目标而无需重新训练。我们训练了一个目标条件PPO策略，该策略学习通用的逆映射：给定任意剥皮行为的目标描述，该策略在单次前向传递（8次代理评估，约10毫秒）中产生材料参数估计。在归一化流潜空间中使用共享代理评估器，目标条件策略通过模拟器验证时实现了0.642的实际恢复率，比原始参数空间高出23%。从策略输出初始化CMA-ES细化的热启动扩展进一步将恢复率提升至0.828，使用540次评估。这些发现为食品逆物理提供了实用框架，并为从食品操作的视频观测中通过视觉驱动识别材料奠定了基础。

英文摘要

Realistic visual simulation of food manipulation requires accurate material parameters, yet these are difficult to measure directly and vary across the heterogeneous regions of a single food item. We address the inverse problem of estimating material parameters from a target description of fracture behavior in a non-differentiable continuum damage mechanics simulator. Using orange peeling as a test case, we train a neural surrogate on 2,000 forward simulations and compare Covariance Matrix Adaptation Evolution Strategy (CMA-ES, a gradient-free evolutionary optimizer) with Proximal Policy Optimization (PPO, a reinforcement learning algorithm) across the original 9-dimensional parameter space and two learned 4-dimensional latent representations. Since different oranges have different material properties, a practical inverse system must handle arbitrary targets without retraining. We train a goal-conditioned PPO policy that learns a general inverse mapping: given any target description of peeling behavior, the policy produces a material parameter estimate in a single forward pass (8 surrogate evaluations, approximately 10ms). Operating in a normalizing flow latent space with a shared surrogate evaluator, the goal-conditioned policy achieves 0.642 actual recovery when validated through the simulator, outperforming the original parameter space by 23%. A warm-start extension that initializes CMA-ES refinement from the policy's output further improves recovery to 0.828 with 540 evaluations. These findings provide a practical framework for inverse food physics and lay groundwork for vision-driven material identification from video observations of food manipulation.

URL PDF HTML ☆

赞 0 踩 0

2606.16951 2026-06-16 cs.CV eess.IV 新提交

Simulation-Based Multi-Fillet Evaluation of Woody Breast Poultry Fillets

基于仿真的多鸡胸肉木质化评估

Chirantan Sen Mukherjee, Seung-Chul Yoon, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington（德克萨斯大学阿灵顿分校计算机科学与工程系）； Quality and Safety Assessment Research Unit, U.S. National Poultry Research Center, USDA Agricultural Research Service（美国农业部农业研究服务局国家家禽研究中心质量与安全评估研究单元）

AI总结针对单鸡胸肉检测的吞吐量瓶颈，提出一种俯视多鸡胸肉检测架构，通过物理仿真生成数据集并提取二维形状变形分数，实现多鸡胸肉同时评估。

Comments To be published in the 2026 International Conference on Automation Science and Engineering (CASE)

详情

AI中文摘要

木质化鸡胸肉是现代肉鸡的一种肌病，导致胸肌异常僵硬和纤维化，降低肉质并造成重大经济损失。最先进的自动WB检测依赖于侧视成像系统，分析单个鸡胸肉从传送带落下时的弯曲行为。虽然高度准确，但该方法受限于单鸡胸肉视野，在商业加工线上造成吞吐量瓶颈。本文通过一种利用俯视相机配置的新型多鸡胸肉检测架构来解决这一限制。为验证我们的方法，首先开发了工业传送系统的高保真数字孪生。然后，合成多样化的3D鸡胸肉网格数据集，并使用基于物理的仿真引擎模拟其粘弹性弯曲动力学。最后，从俯视视角提取连续的二维形状变形分数，模拟鸡胸肉经过滚轮边缘的过程。实验结果表明，俯视形状分数有效捕捉鸡胸肉弯曲时的轮廓变化，为同时多鸡胸肉WB评估提供了鲁棒且可扩展的侧视成像系统替代方案。

英文摘要

Woody breast (WB) is a myopathy in modern broiler chickens that causes the breast muscle to become unusually stiff and fibrous, leading to decreased meat quality and significant economic losses. State-of-the-art automated WB detection relies on a side-view imaging system to analyze the bending behavior of a single fillet as it falls off a conveyor belt. While highly accurate, this approach is constrained by its single-fillet field of view, creating throughput bottlenecks on commercial processing lines. In this paper, we address this limitation via a novel multi-fillet detection architecture utilizing a top-down camera configuration. To validate our approach, we first develop a high-fidelity digital twin of an industrial conveyor system. Next, we synthesize a diverse dataset of 3D fillet meshes and model their viscoelastic bending dynamics using a physics-based simulation engine. Lastly, a continuous 2D shape deformation score is extracted from the top-down perspective as the simulated fillets traverse the roller precipice. Experimental results demonstrate that the top-down shape score effectively captures the contour changes of the fillets as it bends, providing a robust and scalable alternative to a side-view imaging system for simultaneous multi-fillet WB evaluation.

URL PDF HTML ☆

赞 0 踩 0

2603.04592 2026-06-16 cs.CL cs.CV 交叉投稿

From Static Inference to Dynamic Interaction: A Survey of Streaming Large Language Models

从静态推理到动态交互：流式大型语言模型综述

Junlong Tong, Zilong Wang, YuJie Ren, Peiran Yin, Hao Wu, Wei Zhang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Institute of Digital Twin, Eastern Institute of Technology（数字孪生研究院，东部技术研究院）

AI总结本文统一了流式LLM的定义，提出系统分类法，综述其方法、应用与未来方向。

Comments Accepted by ACL 2026 Findings

详情

AI中文摘要

标准大型语言模型（LLM）主要设计用于预定义输入的静态推理，这限制了它们在动态实时场景中的适用性。为解决这一差距，流式LLM范式应运而生。然而，现有流式LLM的定义仍然零散，混淆了流式生成、流式输入和交互式流式架构，且缺乏系统分类法。本文对流式LLM进行了全面概述和分析。首先，我们基于数据流和动态交互建立了流式LLM的统一定义，以澄清现有歧义。基于这一定义，我们提出了当前流式LLM的系统分类法，并对其底层方法进行了深入讨论。此外，我们探讨了流式LLM在现实场景中的应用，并概述了有前景的研究方向，以支持流式智能的持续进展。我们在以下网址维护一个持续更新的相关论文仓库：此 https URL。

英文摘要

Standard Large Language Models (LLMs) are predominantly designed for static inference with pre-defined inputs, which limits their applicability in dynamic, real-time scenarios. To address this gap, the streaming LLM paradigm has emerged. However, existing definitions of streaming LLMs remain fragmented, conflating streaming generation, streaming inputs, and interactive streaming architectures, while a systematic taxonomy is still lacking. This paper provides a comprehensive overview and analysis of streaming LLMs. First, we establish a unified definition of streaming LLMs based on data flow and dynamic interaction to clarify existing ambiguities. Building on this definition, we propose a systematic taxonomy of current streaming LLMs and conduct an in-depth discussion on their underlying methodologies. Furthermore, we explore the applications of streaming LLMs in real-world scenarios and outline promising research directions to support ongoing advances in streaming intelligence. We maintain a continuously updated repository of relevant papers at https://github.com/EIT-NLP/Awesome-Streaming-LLMs.

URL PDF HTML ☆

赞 0 踩 0

2606.14750 2026-06-16 eess.AS cs.AI cs.CV cs.SD 交叉投稿

Pixel-TTS: Image based Text Rendering for Robust Text-to-Speech

Pixel-TTS: 基于图像的文字渲染实现鲁棒文本转语音

Adarsh Arigala, Arjun Gangwar, S Umesh, Yova Kementchedjhieva

发表机构 * SPRING Lab, Indian Institute of Technology, Madras, India（SPRING实验室，印度理工学院，马德拉斯，印度）； MBZUAI, UAE（MBZUAI，阿联酋）

AI总结提出Pixel-TTS框架，将文本渲染为图像并通过2D卷积生成嵌入，消除嵌入矩阵扩展，提升对未见字符和拼写变体的鲁棒性，实现零样本泛化。

Comments 5 pages, 4 figures, 4 tables

详情

AI中文摘要

近期基于像素的文本建模进展表明，将文本表示为图像能使模型利用视觉线索进行语言理解。将文本锚定在其视觉形式上，允许具有不同Unicode编码的结构相似字符产生相似的嵌入，从而有益于跨语言和零样本场景。传统的基于文本的方法独立处理每个字符，限制了向未见字符的泛化，并在跨语言适应时需要嵌入扩展。我们提出Pixel-TTS，首个视觉接地语音合成框架。它将文本渲染为图像，并通过2D卷积层投影以生成嵌入。这种设计在微调过程中消除了嵌入矩阵扩展，同时提高了对未见字符和拼写变体的鲁棒性。大量实验表明，Pixel-TTS在强基线上实现了有竞争力的性能、更快的收敛和鲁棒的零样本泛化。

英文摘要

Recent advances in pixel-based text modeling show that representing text as images enables models to exploit visual cues for language understanding. Grounding text in its visual form allows structurally similar characters with different Unicode encodings to produce similar embeddings, benefiting cross-lingual and zero-shot scenarios. Conventional text-based approaches treat each character independently, limiting generalization to unseen characters and requiring embedding expansion during cross-lingual adaptation. We propose Pixel-TTS, the first framework for visually grounded speech synthesis. It renders text as images and projects them through a 2D convolutional layer to generate embeddings. This design eliminates embedding matrix expansion during fine-tuning while improving robustness to unseen characters and orthographic variations. Extensive experiments show Pixel-TTS achieves competitive performance with strong baselines, faster convergence and robust zero-shot generalization.

URL PDF HTML ☆

赞 0 踩 0

2606.14808 2026-06-16 eess.IV cs.CV cs.IT math.IT 交叉投稿

计算机视觉中的推理：分类、模型、任务与方法论

Ayushman Sarkar, Zhenyu Yu, Mohd Yamani Idna Idris

发表机构 * Department of Computer Science and Engineering, Birbhum Institute of Engineering and Technology（计算机科学与工程系，比罗尔理工学院）； College of Computer Science and Artificial Intelligence, Fudan University（计算机科学与人工智能学院，复旦大学）； Faculty of Computer Science and Information Technology, Universiti Malaya（计算机科学与信息技术学院，马来亚大学）

AI总结本文对计算机视觉中的推理进行系统分类，涵盖关系、符号、时间、因果和常识推理，并综述了从图模型到多模态大语言模型的实现方法及评估协议，指出开放挑战并设定未来研究方向。

详情

AI中文摘要

视觉推理对于许多超越表面级物体检测和分类的计算机视觉任务至关重要。尽管在关系、符号、时间、因果和常识推理方面取得了进展，但现有综述通常只涵盖问题的一部分，例如视觉问答、场景图生成、神经符号AI或多模态思维链，很少同时分析推理类型、方法论和评估协议。本综述填补了这一空白。通过结构化文献回顾，我们将视觉推理分为五大类型（关系、符号、时间、因果和常识），并考察每种类型如何在从基于图的模型、记忆网络、注意力机制、神经符号系统到视觉语言模型（VLM）和多模态大语言模型（MLLM）的方法中实现，包括视觉思维链、视觉编程、工具增强和测试时推理。然后，我们回顾了功能正确性、结构一致性和因果有效性的评估协议，并分析了它们在泛化性、可重复性、忠实性和解释性方面的局限性。我们还识别了开放挑战：扩展到复杂场景、更深入地整合符号和神经范式、缺乏全面基准、基础模型中的语言先验捷径和幻觉，以及弱监督下的推理。最后，我们为视觉系统设定了一个研究议程，并认为连接感知和推理对于透明、可信和跨领域模型是必要的，特别是在自动驾驶和医疗诊断等高风险场景中。

英文摘要

Visual reasoning matters for many computer vision tasks that go beyond surface-level object detection and classification. Despite progress in relational, symbolic, temporal, causal, and commonsense reasoning, existing surveys typically cover only one part of the problem, such as visual question answering, scene-graph generation, neuro-symbolic AI, or multimodal chain-of-thought, and rarely analyze reasoning types, methodologies, and evaluation protocols together. This survey addresses that gap. Following a structured literature review, we group visual reasoning into five major types (relational, symbolic, temporal, causal, and commonsense) and examine how each is implemented across methods that range from graph-based models, memory networks, attention mechanisms, and neuro-symbolic systems to reasoning with vision-language models (VLMs) and multimodal large language models (MLLMs), including visual chain-of-thought, visual programming, and tool-augmented and test-time reasoning. We then review evaluation protocols for functional correctness, structural consistency, and causal validity, and we analyze their limits in generalizability, reproducibility, faithfulness, and explanatory power. We also identify open challenges: scaling to complex scenes, integrating symbolic and neural paradigms more deeply, the shortage of comprehensive benchmarks, language-prior shortcuts and hallucination in foundation models, and reasoning under weak supervision. Finally, we set out a research agenda for vision systems and argue that connecting perception and reasoning is necessary for transparent, trustworthy, and cross-domain models, especially in high-stakes settings such as autonomous driving and medical diagnostics.

URL PDF HTML ☆

赞 0 踩 0

2508.17254 2026-06-16 cs.CV cs.AI 版本更新

A biological vision inspired framework for machine perception of abutting grating illusory contours

一种受生物视觉启发的机器感知对接光栅错觉轮廓框架

Xiao Zhang, Kai-Fu Yang, Xian-Shi Zhang, Hong-Zhi You, Hong-Mei Yan, Yong-Jie Li

发表机构 * Sichuan Cancer Hospital & Institute, School of Life Science and Technology, University of Electronic Science and Technology of China（四川肿瘤医院及研究院、电子科技大学生命科学与技术学院）

AI总结提出受视觉皮层启发的ICPNet网络，通过多尺度特征投影、特征交互注意力和边缘融合模块，显著提升了对对接光栅错觉轮廓的感知能力。

详情

AI中文摘要

更高层次的机器智能需要与人类感知和认知对齐。深度神经网络（DNN）主导的机器智能在各种现实任务中表现出色。然而，最近证据表明，DNN无法感知如对接光栅这样的错觉轮廓，这与人类感知模式不一致。与以往工作不同，我们提出了一种受视觉皮层电路启发的新型深度网络，称为错觉轮廓感知网络（ICPNet）。在ICPNet中，设计了多尺度特征投影（MFP）模块以提取多尺度表示。为了增强前馈和反馈特征之间的交互，引入了特征交互注意力模块（FIAM）。此外，受人类感知中形状偏见的启发，通过边缘融合模块（EFM）进行的边缘检测任务注入了形状约束，引导网络关注前景。我们在现有的AG-MNIST测试集和本文构建的AG-Fashion-MNIST测试集上评估了我们的方法。综合实验结果表明，ICPNet对对接光栅错觉轮廓的敏感度显著高于最先进模型，在各个子集上的top-1准确率均有显著提升。这项工作有望使基于DNN的模型向人类级智能迈进一步。

英文摘要

Higher levels of machine intelligence demand alignment with human perception and cognition. Deep neural networks (DNN) dominated machine intelligence have demonstrated exceptional performance across various real-world tasks. Nevertheless, recent evidence suggests that DNNs fail to perceive illusory contours like the abutting grating, a discrepancy that misaligns with human perception patterns. Departing from previous works, we propose a novel deep network called illusory contour perception network (ICPNet) inspired by the circuits of the visual cortex. In ICPNet, a multi-scale feature projection (MFP) module is designed to extract multi-scale representations. To boost the interaction between feedforward and feedback features, a feature interaction attention module (FIAM) is introduced. Moreover, drawing inspiration from the shape bias observed in human perception, an edge detection task conducted via the edge fusion module (EFM) injects shape constraints that guide the network to concentrate on the foreground. We assess our method on the existing AG-MNIST test set and the AG-Fashion-MNIST test sets constructed by this work. Comprehensive experimental results reveal that ICPNet is significantly more sensitive to abutting grating illusory contours than state-of-the-art models, with notable improvements in top-1 accuracy across various subsets. This work is expected to make a step towards human-level intelligence for DNN-based models.

URL PDF HTML ☆

赞 0 踩 0

2602.08029 2026-06-16 gr-qc astro-ph.IM cs.CV 版本更新

Dynamic Black-hole Emission Tomography with Physics-informed Neural Fields

基于物理信息神经场的动态黑洞发射断层成像

Berthy T. Feng, Andrew A. Chael, David Bromley, Aviad Levis, William T. Freeman, Katherine L. Bouman

发表机构 * Caltech（加州理工学院）； MIT（麻省理工学院）； NSF IAIFI（国家科学基金会IAIFI）； Princeton University（普林斯顿大学）； Niels Bohr International Academy（尼尔斯·玻尔国际学院）； University of Toronto（多伦多大学）

AI总结提出PI-DEF方法，利用可微神经渲染从EHT测量数据中联合重建4D发射率场和3D速度场，以软约束方式引入物理信息，在模拟数据上显著优于现有方法。

Comments CVPR 2026

详情

AI中文摘要

随着静态黑洞成像的成功，下一个前沿是黑洞的动态和三维成像。恢复黑洞附近的动态三维气体将揭示宇宙中以前未见的部分，并为新的物理模型提供信息。然而，只有从单一视角进行的稀疏射电测量是可能的，这使得动态三维重建问题严重不适定。此前，BH-NeRF通过假设气体的开普勒动力学来解决不适定问题，但这种假设在黑洞附近失效，因为黑洞的强大引力吸引和增强的电磁活动使流体动力学复杂化。为了克服BH-NeRF的限制性假设，我们提出了PI-DEF，一种基于物理信息的方法，使用可微神经渲染根据EHT测量拟合4D（时间+3D）发射率场。我们的方法联合重建3D速度场与4D发射率场，并将速度作为发射率动力学的软约束。在模拟数据上的实验中，我们发现与BH-NeRF和物理无关方法相比，重建精度显著提高。我们展示了我们的方法如何用于估计黑洞的其他物理参数，例如其自旋。

英文摘要

With the success of static black-hole imaging, the next frontier is the dynamic and 3D imaging of black holes. Recovering the dynamic 3D gas near a black hole would reveal previously-unseen parts of the universe and inform new physics models. However, only sparse radio measurements from a single viewpoint are possible, making the dynamic 3D reconstruction problem significantly ill-posed. Previously, BH-NeRF addressed the ill-posed problem by assuming Keplerian dynamics of the gas, but this assumption breaks down near the black hole, where the strong gravitational pull of the black hole and increased electromagnetic activity complicate fluid dynamics. To overcome the restrictive assumptions of BH-NeRF, we propose PI-DEF, a physics-informed approach that uses differentiable neural rendering to fit a 4D (time + 3D) emissivity field given EHT measurements. Our approach jointly reconstructs the 3D velocity field with the 4D emissivity field and enforces the velocity as a soft constraint on the dynamics of the emissivity. In experiments on simulated data, we find significantly improved reconstruction accuracy over both BH-NeRF and a physics-agnostic approach. We demonstrate how our method may be used to estimate other physics parameters of the black hole, such as its spin.

URL PDF HTML ☆

赞 0 踩 0

2604.16592 2026-06-16 cs.RO cs.AI cs.CV cs.ET 版本更新

Human Cognition in Machines: A Unified Perspective of World Models

机器中的人类认知：世界模型的统一视角

Timothy Rupprecht, Pu Zhao, Amir Taherin, Arash Akbari, Arman Akbari, Yumei He, Tooba Imtiaz, Sean Duffy, Juyi Lin, Yixiao Chen, Rahul Chowdhury, Enfu Nan, Yixin Shen, Yifan Cao, Haochen Zeng, Weiwei Chen, Geng Yuan, Jennifer Dy, Sarah Ostadabbas, Xuan Zhang, David Kaeli, Edmund Yeh, Yanzhi Wang

发表机构 * Northeastern University（东北大学）； EmbodyX Inc.（EmbodyX公司）； Tulane University（路易斯安那州立大学）； Cornell University（康奈尔大学）； University of Georgia（佐治亚大学）

AI总结提出统一框架整合记忆、感知等认知功能，指出动机和元认知研究不足，并引入认知世界模型新类别。

详情

AI中文摘要

本报告通过区分先前工作在认知功能上的创新来审视世界模型。许多工作声称其世界模型具有近乎人类般的认知能力。评估这些主张需要基于人类和机器认知理论的第一原理。在迈向类人世界模型的过程中，我们提出了一个概念性的统一框架，该框架完全整合了所有认知功能（即记忆、感知、语言、推理、想象、动机和元认知），并指出现有研究的空白，以指导未来技术的发展。特别是，我们发现动机（尤其是内在动机）和元认知仍然严重研究不足，并提出了基于主动推理和全局工作空间理论的具体方向来解决这些空白。我们还引入了认知世界模型，这是一个新的类别，涵盖在结构化知识上运行的科学发现代理框架。我们的分类法应用于视频、具身和认知世界模型，提出了先前分类法未涉及的研究方向。

英文摘要

This report of world models distinguishes prior works by the cognitive functions they innovate. Many works claim an almost human-like cognitive capability in their world models. To evaluate these claims requires a proper grounding in first principles from human and machine cognition theory. In moving towards human-like world models we present a conceptual unified framework for world models that fully incorporates all the cognitive functions (i.e., memory, perception, language, reasoning, imagining, motivation, and metacognition) and identify gaps in existing research as a guide for future states of the art. In particular, we find that motivation (especially intrinsic motivation) and metacognition remain drastically under-researched, and we propose concrete directions to address these gaps informed by active inference and global workspace theory. We also introduce epistemic world models, a new category encompassing agent frameworks for scientific discovery that operate over structured knowledge. Our taxonomy, applied to video, embodied, and epistemic world models, suggests research directions where prior taxonomies have not.

URL PDF HTML ☆

赞 0 踩 0

2605.05372 2026-06-16 cs.CV cs.AI 版本更新

Two Steps Are All You Need: Efficient 3D Point Cloud Anomaly Detection with Consistency Models

两步即可：基于一致性模型的高效3D点云异常检测

Pranav A, Shashank B, Pranav Siddappa, Dominik Seuss, Minal Moharir, Subramanya KN

发表机构 * R.V. College of Engineering（R.V. 工程学院）； Technical University of Applied Sciences Würzburg-Schweinfurt（Würzburg-Schweinfurt 应用科学大学）

AI总结本文提出基于一致性学习的重建异常检测方法，通过简化推理过程提升效率，实现低延迟的3D点云异常检测，适用于资源受限设备。

Comments Accepted to CVPR 2026, at the 9th Workshop on Efficient Deep Learning for Computer Vision (ECV). To be published in the IEEE/CVF CVPR 2026 Workshop Proceedings

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 3479-3487

AI中文摘要

扩散模型正在重新定义3D点云数据中的异常检测。随着3D传感成为现代制造的关键，可靠的异常检测对于高吞吐量的质量保证和过程控制至关重要。然而，在资源受限且延迟敏感的系统中，实际部署仍然有限。现有方法往往在复杂未遮挡区域计算上不可行或不可靠，而扩散管道本质上受限于迭代去噪。在本文中，我们通过一致性学习重构基于重建的异常检测，使能够在一次或两次网络评估中直接预测无异常几何。我们进一步引入了一种新的混合损失公式，明确强制重建至干净数据。这种设计显著降低了推理成本，达到比当前最先进方法快80倍的运行时间，无需GPU加速，同时保持强大的检测性能。它在Anomaly-ShapeNet上以76.20%的I-AUROC优于R3D-AD，在Real3DAD上以72.80%的I-AUROC保持竞争力，使在资源受限平台上实现高效、低延迟的异常检测成为可能，包括无人机、智能工业相机和其他边缘设备。

英文摘要

Diffusion models are rapidly redefining 3D anomaly detection in point cloud data. As 3D sensing becomes integral to modern manufacturing, reliable anomaly detection is essential for high-throughput quality assurance and process control. Yet practical deployment on resource-constrained, latency-critical systems remains limited. Existing methods are often computationally prohibitive or unreliable in complex, unmasked regions, and diffusion pipelines are inherently bottlenecked by iterative denoising. In this work, we address this bottleneck by reformulating reconstructionbased anomaly detection through consistency learning, enabling direct prediction of anomaly-free geometry in one or two network evaluations. We further introduce a novel hybrid loss formulation that explicitly enforces reconstruction toward clean data. This design substantially reduces inference cost, achieving up to 80x faster runtime than the current state-of-the-art method, without GPU acceleration, while preserving strong detection performance. It outperforms R3D-AD on Anomaly-ShapeNet with 76.20% I-AUROC and remains competitive on Real3DAD with 72.80% I-AUROC, enabling efficient, low-latency anomaly detection on resource-constrained platforms, including drones, smart industrial cameras, and other edge devices.

URL PDF HTML ☆

赞 0 踩 0

2603.24724 2026-06-16 cs.CV cs.AI 版本更新

Is Geometry Enough? An Evaluation of Landmark-Based Gaze Estimation

几何足够吗？基于标记的注视估计评估

Daniele Agostinelli, Thomas Agostinelli, Andrea Generosi, Maura Mengoni

发表机构 * Department of Industrial Engineering and Mathematical Sciences, Università Politecnica delle Marche（工业工程与数学科学系，帕尔米塞大学）； Department of Science and Information Technology, Università Pegaso（科学与信息科技系，佩加索大学）

AI总结本文评估了基于面部标记的注视估计方法，通过标准化流程提取和归一化三个大型数据集的标记，并训练轻量级回归模型，发现其在跨域评估中与ResNet18基线相当，表明稀疏几何特征能有效支持鲁棒的注视估计。

详情

DOI: 10.1109/ACCESS.2026.3696778

AI中文摘要

基于外观的注视估计通常依赖深度卷积神经网络（CNNs）。这些模型准确但计算成本高且作为“黑箱”，可解释性差。基于面部标记的几何方法是轻量级替代方案，但其性能限制和泛化能力在现代基准中仍待探索。本文全面评估了基于标记的注视估计，引入标准化流程提取和归一化三个大型数据集（Gaze360、ETH-XGaze、GazeGene）的标记，并训练轻量级回归模型，具体为极端梯度提升树和两种神经架构：整体多层感知机（MLP）和设计捕捉双眼几何的孪生MLP。发现基于标记的模型在领域内评估表现较低，可能由于数据集中的标记检测噪声引入。然而，在跨域评估中，所提出的MLP架构的泛化能力与ResNet18基线相当。这些发现表明稀疏几何特征编码了足够的信息以支持鲁棒的注视估计，为高效、可解释且隐私友好的边缘应用铺平了道路。源代码和生成的基于标记的数据集可在https://github.com/daniele-agostinelli/LandmarkGaze.git获取。

英文摘要

Appearance-based gaze estimation frequently relies on deep Convolutional Neural Networks (CNNs). These models are accurate, but computationally expensive and act as "black boxes", offering little interpretability. Geometric methods based on facial landmarks are a lightweight alternative, but their performance limits and generalization capabilities remain underexplored in modern benchmarks. In this study, we conduct a comprehensive evaluation of landmark-based gaze estimation. We introduce a standardized pipeline to extract and normalize landmarks from three large-scale datasets (Gaze360, ETH-XGaze, and GazeGene) and train lightweight regression models, specifically Extreme Gradient Boosted trees and two neural architectures: a holistic Multi-Layer Perceptron (MLP) and a siamese MLP designed to capture binocular geometry. We find that landmark-based models exhibit lower performance in within-domain evaluation, likely due to noise introduced into the datasets by the landmark detector. Nevertheless, in cross-domain evaluation, the proposed MLP architectures show generalization capabilities comparable to those of ResNet18 baselines. These findings suggest that sparse geometric features encode sufficient information for robust gaze estimation, paving the way for efficient, interpretable, and privacy-friendly edge applications. The source code and generated landmark-based datasets are available at https://github.com/daniele-agostinelli/LandmarkGaze.git.

URL PDF HTML ☆

赞 0 踩 0

2106.14490 2026-06-16 cs.CV 版本更新

Making Images Real Again: A Comprehensive Survey on Deep Image Composition

让图像重现真实：深度图像合成的全面综述

Li Niu, Wenyan Cong, Liu Liu, Yan Hong, Bo Zhang, Jing Liang, Liqing Zhang

发表机构 * MOE Key Lab of Artificial Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University（教育部人工智能联合研究院，计算机科学与工程系，上海交通大学）

AI总结本文综述了深度图像合成的子任务与综合任务，总结了现有方法、数据集及评估指标，并提供了首个图像合成工具箱libcom。

详情

AI中文摘要

作为常见的图像编辑操作，图像合成（也称为对象/主体插入/添加/合成）旨在将一幅图像的前景与另一幅背景图像结合，生成合成图像。然而，许多问题可能导致合成图像不真实。这些问题可以总结为前景与背景之间的不一致，包括外观不一致、几何不一致和语义不一致。图像合成任务可以分解为多个子任务，每个子任务针对一个或多个问题。具体而言，对象放置旨在为前景找到合理的尺度、位置和形状。图像融合旨在解决前景与背景之间的不自然边界。图像调和旨在调整前景的光照统计。阴影（或反射）生成旨在为前景生成合理的阴影（或反射）。这些子任务可以按顺序或并行执行以获得逼真的合成图像。据我们所知，目前没有关于图像合成的先前综述。在本文中，我们对图像合成的子任务和综合任务进行了全面综述。对于每个任务，我们总结了现有方法、可用数据集和常见评估指标。图像合成的数据集和代码汇总在https://github.com/bcmi/Awesome-Object-Insertion。我们还贡献了首个图像合成工具箱：libcom https://github.com/bcmi/libcom，它集成了10多个与图像合成相关的功能。该工具箱的最终目标是通过简单的`import libcom`解决所有图像合成问题。基于libcom工具箱，我们还开发了一个在线图像合成工作台https://libcom.ustcnewly.com。

英文摘要

As a common image editing operation, image composition/compositing, which is also called object/subject insertion/addition/compositing, aims to combine the foreground from one image and another background image to produce a composite image. However, there are many issues that could make the composite images unrealistic. These issues can be summarized as the inconsistency between foreground and background, which includes appearance inconsistency, geometry inconsistency, and semantic inconsistency. The image composition task could be decomposed into multiple sub-tasks, in which each sub-task targets one or more issues. Specifically, object placement aims to find reasonable scale, location, and shape for the foreground. Image blending aims to address the unnatural boundary between foreground and background. Image harmonization aims to adjust the illumination statistics of foreground. Shadow (resp., reflection) generation aims to generate plausible shadow (resp., reflection) for the foreground. These sub-tasks can be executed sequentially or in parallel to acquire realistic composite images. To the best of our knowledge, there is no previous survey on image composition. In this paper, we conduct a comprehensive survey over the sub-tasks and combined task of image composition. For each one, we summarize the existing methods, available datasets, and common evaluation metrics. Datasets and codes for image composition are summarized at https://github.com/bcmi/Awesome-Object-Insertion. We have also contributed the first image composition toolbox: libcom https://github.com/bcmi/libcom, which assembles 10+ image-composition-related functions. The ultimate goal of this toolbox is to solve all image composition problems with simple `import libcom'. Based on libcom toolbox, we also develop an online image composition workbench https://libcom.ustcnewly.com.

URL PDF HTML ☆

赞 0 踩 0

2511.00352 2026-06-16 cs.CV cs.AI 版本更新

Detecting AI-Generated Images via Diffusion Snap-Back Reconstruction: A Forensic Approach

通过扩散快回重建检测AI生成图像：一种取证方法

Mohd Ruhul Ameen, Akif Islam

发表机构 * Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology（1 计算机科学与工程系，孟加拉国工程与技术大学）

AI总结本文提出通过扩散模型重建图像时的响应行为来检测AI生成图像，利用LPIPS等指标分析图像与扩散模型去噪行为的匹配程度，实验显示方法在识别准确率上表现优异。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情

DOI: 10.1109/QPAIN69676.2026.11545865
Journal ref: 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

AI中文摘要

生成图像模型的快速发展使数字媒体发生了变革，使得人类观察者或许多传统检测方法难以可靠地区分AI生成图像和真实照片。现代文本到图像系统如Stable Diffusion和DALL E能够生成极其逼真的图像，使其看起来完全自然，留下很少或没有传统深度伪造检测器可以依赖的可见伪影。这一挑战对虚假信息控制、机构身份验证和政治和法律领域中的数字信任有实际影响。我们不搜索隐藏的像素级痕迹，而是观察图像在被轻微扰动和由扩散模型重建时的反应。我们称之为扩散快回。通过跟踪不同重建强度下感知相似性度量（LPIPS、SSIM和PSNR）的变化，我们捕捉到紧凑且可解释的信号，揭示图像与扩散模型学习的去噪行为的接近程度。在包含4000张人类和AI生成图像的平衡数据集上评估，所提出的方法在分层五折交叉验证中达到AUROC 0.993，在使用仅逻辑回归的测试集上达到0.990。初步的鲁棒性测试显示，该方法在常见的现实世界失真如图像压缩和添加噪声下仍保持稳定。虽然我们的实验使用单一扩散主干进行，但结果表明，重建行为可以作为合成媒体检测的可靠且可扩展的基础，随着生成模型变得越来越逼真。

英文摘要

The rapid advancement of generative image models has transformed digital media to the point where AI generated images can no longer be reliably distinguished from authentic photographs by human observers or many conventional detection methods. Modern text to image systems such as Stable Diffusion and DALL E can now generate images so realistic that they often appear completely natural, leaving little to no visible artifacts for traditional deepfake detectors to rely on. This challenge has practical consequences for misinformation control, institutional identity verification, and digital trust in political and legal contexts. Instead of searching for hidden pixel level traces, we take a different approach: we observe how an image responds when it is gently disturbed and reconstructed by a diffusion model. We call this behavior diffusion snap back. By tracking how perceptual similarity measures (LPIPS, SSIM, and PSNR) change across different reconstruction strengths, we capture compact and interpretable signals that reveal how closely an image aligns with the diffusion model's learned denoising behavior. Evaluated on a balanced dataset of 4,000 human and AI generated images, the proposed method achieves an AUROC of 0.993 under stratified five fold cross validation and 0.990 on a holdout split using only logistic regression. Initial robustness tests show that the method remains stable under common real world distortions such as image compression and added noise. Although our experiments were conducted using a single diffusion backbone, the results indicate that reconstruction behavior can serve as a reliable and scalable foundation for synthetic media detection as generative models continue to grow more realistic.

URL PDF HTML ☆

赞 0 踩 0

2510.23785 2026-06-16 cs.CV cs.AI 版本更新

CountFormer: A Transformer Framework for Learning Visual Repetition and Structure in Class-Agnostic Object Counting

CountFormer：一种用于学习类无关物体计数中视觉重复和结构的Transformer框架

Md Tanvir Hossain, Akif Islam, Mohd Ruhul Ameen

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结 CountFormer通过使用DINOv2和位置嵌入，改进了无示例物体计数中的结构一致性，实现了在FSC-147上的竞争力表现。

Comments Accepted at the 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence and Networking (QPAIN 2026)

详情

DOI: 10.1109/QPAIN69676.2026.11546546
Journal ref: 2026 IEEE 2nd International Conference on Quantum Photonics, Artificial Intelligence & Networking (QPAIN)

AI中文摘要

人类通常通过观察视觉重复和组成来计数 unfamiliar objects，而非仅依赖物体类别。然而，许多无示例计数模型在这种情况下的表现不佳，尤其是在物体包含对称组件、重复子结构或部分遮挡时可能过计数。我们引入了CountFormer，这是一种受CounTR启发的密度回归框架的受控适应，其中图像编码器被自监督视觉基础模型DINOv2取代。所得的Transformer特征与显式的二维位置嵌入结合，并通过轻量级卷积网络解码，以生成密度图，其积分给出最终计数。我们的目标不是提出新的计数架构，而是研究在严格无示例设置下，基于基础的表示是否能提高结构一致性。在FSC-147上，CountFormer在官方基准上实现了竞争性表现（MAE 19.06，RMSE 118.45）。定性分析表明，对于某些结构复杂的物体，部分层面的过计数错误更少，而总体误差与先前方法大致一致。敏感性分析显示，评估指标强烈受少量极端高密度场景的影响。总体而言，结果突显了表示质量在无示例物体计数中的作用。

英文摘要

Humans can often count unfamiliar objects by observing visual repetition and composition, rather than relying only on object categories. However, many exemplar-free counting models struggle in such situations and may overcount when objects contain symmetric components, repeated substructures, or partial occlusion. We introduce CountFormer, a controlled adaptation of a density-regression framework inspired by CounTR, where the image encoder is replaced with the self-supervised vision foundation model DINOv2. The resulting transformer features are combined with explicit two-dimensional positional embeddings and decoded by a lightweight convolutional network to produce a density map whose integral gives the final count. Our goal is not to propose a new counting architecture, but to study whether foundation-based representations improve structural consistency under a strictly exemplar-free setting. On FSC-147, CountFormer achieves competitive performance under the official benchmark (MAE 19.06, RMSE 118.45). Qualitative analysis suggests fewer part-level overcounting errors for some structurally complex objects, while overall error remains broadly consistent with prior approaches. Sensitivity analysis shows that evaluation metrics are strongly affected by a small number of extreme high-density scenes. Overall, the results highlight the role of representation quality in exemplar-free object counting.

URL PDF HTML ☆

赞 0 踩 0

2601.18045 2026-06-16 cs.CV cs.AI 版本更新

Leveraging Persistence Image to Enhance Robustness and Performance in Curvilinear Structure Segmentation

利用持续图像增强曲率结构分割的鲁棒性和性能

Zhuangzhi Gao, Feixiang Zhou, He Zhao, Xiuju Chen, Xiaoxin Li, Qinkai Yu, Yitian Zhao, Alena Shantsila, Gregory Y. H. Lip, Eduard Shantsila, Yalin Zheng

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结本文提出PIs-Regressor和Topology SegNet，通过直接学习持续图像来增强曲率结构分割的鲁棒性和性能，实验表明拓扑特征能有效提升医学图像分割的准确性。

Comments Accepted by IEEE International Symposium on Biomedical Imaging (ISBI) 2026. 5 pages, 3 figures

详情

DOI: 10.1109/ISBI61048.2026.11515783
Journal ref: 2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), London, United Kingdom, 2026

AI中文摘要

在医学图像中分割曲率结构对于分析临床应用中的形态学模式至关重要。整合拓扑属性如连通性可提高分割的准确性和一致性。然而，从持续图（PD）中提取和嵌入这些属性具有挑战性，因为它们非可微且计算成本高。现有方法大多通过手工设计的损失函数编码拓扑，泛化能力差。本文提出PIs-Regressor，一个简单有效的模块，直接从数据中学习持续图像（PI）——拓扑特征的有限、可微表示。与Topology SegNet结合，该框架将拓扑整合到网络架构本身而非辅助损失中。与依赖手工损失函数的方法不同，我们的方法直接将拓扑信息整合到网络结构中，从而实现更稳健的分割。我们的设计灵活，可无缝结合其他拓扑方法以进一步提升分割性能。实验结果表明，整合拓扑特征增强了模型鲁棒性，有效处理医学图像中的过曝和模糊挑战。在三个曲率基准上，我们的方法在像素级准确性和拓扑保真度上均达到最先进的性能。

英文摘要

Segmenting curvilinear structures in medical images is essential for analyzing morphological patterns in clinical applications. Integrating topological properties, such as connectivity, improves segmentation accuracy and consistency. However, extracting and embedding such properties - especially from Persistence Diagrams (PD) - is challenging due to their non-differentiability and computational cost. Existing approaches mostly encode topology through handcrafted loss functions, which generalize poorly across tasks. In this paper, we propose PIs-Regressor, a simple yet effective module that learns persistence image (PI) - finite, differentiable representations of topological features - directly from data. Together with Topology SegNet, which fuses these features in both downsampling and upsampling stages, our framework integrates topology into the network architecture itself rather than auxiliary losses. Unlike existing methods that depend heavily on handcrafted loss functions, our approach directly incorporates topological information into the network structure, leading to more robust segmentation. Our design is flexible and can be seamlessly combined with other topology-based methods to further enhance segmentation performance. Experimental results show that integrating topological features enhances model robustness, effectively handling challenges like overexposure and blurring in medical imaging. Our approach on three curvilinear benchmarks demonstrate state-of-the-art performance in both pixel-level accuracy and topological fidelity.

URL PDF HTML ☆

赞 0 踩 0

2411.13602 2026-06-16 eess.IV cs.AI cs.CV 版本更新

Translating Electrocardiograms to Cardiac Magnetic Resonance Imaging Useful for Cardiac Assessment and Disease Screening: A Multi-Center Study

将心电图转换为心脏磁共振成像对心脏评估和疾病筛查有用：一项多中心研究

Zhengyao Ding, Ziyu Li, Yujian Hu, Youyao Xu, Chengchen Zhao, Yiheng Mao, Haitao Li, Zhikang Li, Qian Li, Jing Wang, Yue Chen, Mengjia Chen, Longbo Wang, Xuesen Chu, Weichao Pan, Ziyi Liu, Fei Wu, Hongkun Zhang, Ting Chen, Zhengxing Huang

发表机构 * College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； Department of Vascular Surgery, The First Affiliated Hospital of Zhejiang University School of Medicine（浙江大学医学院附属第一医院血管外科）； Department of Cardiology, The First Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院附属第一医院心内科）； Department of Radiology, The First Affiliated Hospital, Zhejiang University School of Medicine（浙江大学医学院附属第一医院放射科）； Department of Vascular Surgery, Quzhou People’s Hospital（衢州人民医院血管外科）； Department of Cardiology, The Second Affiliated Hospital of Zhejiang University School of Medicine（浙江大学医学院附属第二医院心内科）； China Ship Scientific Research Center（中国船舶科学研究院）； Guangdong Transtek Medical Electronics Co., Ltd.（广东 Transtek 医疗电子有限公司）

AI总结本文提出CardioNets框架，通过深度学习将12导联心电图信号转换为心脏磁共振成像级别的功能参数和合成图像，提升大规模心血管疾病筛查的效率和可及性。

Comments 29 pages, 7 figures

详情

DOI: 10.1056/AIoa2500549
Journal ref: NEJM AI 2026;3(4)

AI中文摘要

心血管疾病（CVDs）是全球死亡的主要原因，需要可访问且准确的诊断工具。尽管心脏磁共振成像（CMR）提供心脏结构和功能的金标准见解，但其临床效用受到高成本和复杂性的限制。相比之下，心电图（ECG）成本低且广泛可用，但缺乏CMR的粒度。我们提出CardioNets，一种深度学习框架，将12导联ECG信号转换为CMR级别的功能参数和合成图像，从而实现可扩展的心脏评估。CardioNets整合了跨模态对比学习和生成预训练，对齐ECG与CMR衍生的心脏表型，并通过掩码自回归模型合成高分辨率CMR图像。在159,819个样本上训练，包括英国生物库（n=42,483）和MIMIC-IV-ECG（n=164,550），并在独立临床数据集（n=3,767）上进行外部验证，CardioNets在疾病筛查和表型估计任务中表现出色。在英国生物库中，它将心脏表型回归R2提高了24.8%，并使心肌病AUC提高了高达39.3%。在MIMIC中，它将肺动脉高压检测的AUC提高了5.6%。生成的CMR图像在SSIM和PSNR方面分别比先前方法高36.6%和8.7%。在一项读者研究中，仅使用ECG的CardioNets在准确率上比同时使用ECG和真实CMR的人类医生高13.9%。这些结果表明，CardioNets为大规模CVD筛查提供了一个有前景的低成本替代方案，特别是在资源有限的环境中。未来的工作将专注于临床部署和ECG基于合成成像的监管验证。

英文摘要

Cardiovascular diseases (CVDs) are the leading cause of global mortality, necessitating accessible and accurate diagnostic tools. While cardiac magnetic resonance imaging (CMR) provides gold-standard insights into cardiac structure and function, its clinical utility is limited by high cost and complexity. In contrast, electrocardiography (ECG) is inexpensive and widely available but lacks the granularity of CMR. We propose CardioNets, a deep learning framework that translates 12-lead ECG signals into CMR-level functional parameters and synthetic images, enabling scalable cardiac assessment. CardioNets integrates cross-modal contrastive learning and generative pretraining, aligning ECG with CMR-derived cardiac phenotypes and synthesizing high-resolution CMR images via a masked autoregressive model. Trained on 159,819 samples from five cohorts, including the UK Biobank (n=42,483) and MIMIC-IV-ECG (n=164,550), and externally validated on independent clinical datasets (n=3,767), CardioNets achieved strong performance across disease screening and phenotype estimation tasks. In the UK Biobank, it improved cardiac phenotype regression R2 by 24.8% and cardiomyopathy AUC by up to 39.3% over baseline models. In MIMIC, it increased AUC for pulmonary hypertension detection by 5.6%. Generated CMR images showed 36.6% higher SSIM and 8.7% higher PSNR than prior approaches. In a reader study, ECG-only CardioNets achieved 13.9% higher accuracy than human physicians using both ECG and real CMR. These results suggest that CardioNets offers a promising, low-cost alternative to CMR for large-scale CVD screening, particularly in resource-limited settings. Future efforts will focus on clinical deployment and regulatory validation of ECG-based synthetic imaging.

URL PDF HTML ☆

赞 0 踩 0

2512.00572 2026-06-16 cs.CV cs.AI 版本更新

Integrating Skeleton Based Representations for Robust Yoga Pose Classification Using Deep Learning Models

基于骨架表示的瑜伽姿势分类深度学习模型整合

Mohammed Mohiuddin, Syed Mohammod Minhaz Hossain, Sumaiya Khanam, Prionkar Barua, Aparup Barua, MD Tamim Hossain

发表机构 * Department of Computer Science and Engineering, Premier University（计算机科学与工程系，普里梅尔大学）

AI总结本文提出Yoga-16数据集，系统评估了三种深度学习模型，证明骨架表示在瑜伽姿势分类中优于原始图像，VGG16结合MediaPipe骨架输入达到96.09%的准确率。

详情

DOI: 10.1038/s41598-025-23726-0

AI中文摘要

瑜伽因其精神和身体健康益处而全球流行，但错误姿势可能导致受伤。自动化瑜伽姿势分类因此变得重要，以减少对专家的依赖。尽管人类姿态关键点提取模型在动作识别中表现出潜力，但系统化的瑜伽姿势识别基准评估仍有限，因为先前工作通常仅关注原始图像或单一姿态提取模型。本文引入了'Yoga-16'数据集，以解决现有数据集的限制，并系统评估了三种深度学习架构（VGG16、ResNet50和Xception），使用三种输入模式（直接图像、MediaPipe Pose骨架图像和YOLOv8 Pose骨架图像）。我们的实验表明，基于骨架的表示优于原始图像输入，VGG16与MediaPipe Pose骨架输入的最高准确率为96.09%。此外，我们通过Grad-CAM进行可解释性分析，提供瑜伽姿势分类的模型决策洞察，通过交叉验证分析。

英文摘要

Yoga is a popular form of exercise worldwide due to its spiritual and physical health benefits, but incorrect postures can lead to injuries. Automated yoga pose classification has therefore gained importance to reduce reliance on expert practitioners. While human pose keypoint extraction models have shown high potential in action recognition, systematic benchmarking for yoga pose recognition remains limited, as prior works often focus solely on raw images or a single pose extraction model. In this study, we introduce a curated dataset, 'Yoga-16', which addresses limitations of existing datasets, and systematically evaluate three deep learning architectures (VGG16, ResNet50, and Xception), using three input modalities (direct images, MediaPipe Pose skeleton images, and YOLOv8 Pose skeleton images). Our experiments demonstrate that skeleton-based representations outperform raw image inputs, with the highest accuracy of 96.09% achieved by VGG16 with MediaPipe Pose skeleton input. Additionally, we provide interpretability analysis using Grad-CAM, offering insights into model decision-making for yoga pose classification with cross-validation analysis.

URL PDF HTML ☆

赞 0 踩 0

2501.05436 2026-06-16 cs.CV 版本更新

Scale-invariant brain morphometry: application to sulcal depth

标度不变脑形态学：应用于沟回深度

Maxime Dieudonné, Guillaume Auzias, Julien Lefèvre

发表机构 * Institut National de la Santé et de la Recherche Médicale (INSERM), U954, Université de Nantes（法国国家卫生与医学研究院（INSERM）U954，南特大学）

AI总结本文研究了脑大小对沟回深度形态学特征的影响，提出了一种标度不变的沟回深度估计方法，并通过大规模样本验证了其生物学意义。

Comments GA and JL contributed equally to this work

详情

DOI: 10.1016/j.compbiomed.2026.111754
Journal ref: Computers in Biology and Medicine, Volume 212, 2026, 111754, ISSN 0010-4825

AI中文摘要

人类皮层的几何结构复杂且高度变异，文献中已明确记录了脑大小、皮层折叠和年龄之间的相互作用。然而，很少有研究探讨了全局脑大小如何影响从解剖MRI中获得的皮层表面形态学特征。在本工作中，我们关注沟回深度，这一成像表型在基础研究和临床应用中都受到关注。我们通过四个关键贡献推动该领域：1）提供首次定量分析脑大小对沟回深度测量的影响；2）引入一种基于问题原始形式化的新型标度不变沟回深度估计方法；3）提出验证框架并分享代码和基准数据；4）通过涵盖从受孕后26周到成年期的1987名受试者的大样本，展示我们新沟回深度测量的生物学相关性。

英文摘要

The geometry of the human cortex is complex and highly variable, with interactions between brain size, cortical folding, and age well-documented in the literature. However, few studies have explored how global brain size influences morphometry features of the cortical surface derived from anatomical MRI. In this work, we focus on sulcal depth, an imaging phenotype that has gained attention in both basic research and clinical applications. We make key contributions to the field by: 1) providing the first quantitative analysis of the influence of brain size on sulcal depth measurements; 2) introducing a novel, scale-invariant method for sulcal depth estimation based on an original formalization of the problem; 3) presenting a validation framework and sharing our code and benchmark data with the community; and 4) demonstrating the biological relevance of our new sulcal depth measure using a large sample of 1,987 subjects spanning the developmental period from 26 weeks post-conception to adulthood.

URL PDF HTML ☆

赞 0 踩 0

2410.20202 2026-06-16 cs.CV 版本更新

An Efficient Watermarking Method for Latent Diffusion Models via Low-Rank Adaptation and Dynamic Loss Weighting

通过低秩适应与动态损失加权实现潜在扩散模型的高效水印方法

Dongdong Lin, Yue Li, Benedetta Tondi, Kaiqing Lin, Bin Li, Mauro Barni

发表机构 * Xiamen Key Laboratory of Data Security and Blockchain Technology, Huaqiao University, Xiamen 361021, China（厦门数据安全与区块链技术重点实验室，华侨大学，厦门361021，中国）； Department of Information Engineering and Mathematics of the University of Siena, Italy（意大利锡耶纳大学信息工程与数学系）； Guangdong Provincial Key Laboratory of Intelligent Information Processing, Shenzhen University, Shenzhen 518060, China（广东省智能信息处理重点实验室，深圳大学，深圳518060，中国）； Shenzhen Key Laboratory of Media Security, Shenzhen University, Shenzhen 518060, China（深圳媒体安全重点实验室，深圳大学，深圳518060，中国）； SZU-AFS Joint Innovation Center for AI Technology, Shenzhen University, Shenzhen 518060, China（深圳大学人工智能技术联合创新中心）

AI总结本文提出基于低秩适应的潜在扩散模型高效水印方法，通过动态损失加权平衡生成质量与水印保真度，实现快速准确的水印嵌入且不影响生成图像质量。

详情

DOI: 10.1016/j.eswa.2026.133172
Journal ref: Expert Systems with Applications. 331 (2026) 133172

AI中文摘要

深度神经网络的快速普及推动了模型水印技术的发展，因为训练模型本身是有价值的知识产权。现有水印方法主要修改模型参数或改变采样行为。然而，随着模型规模的增大，提高水印嵌入效率以管理日益增长的计算需求变得至关重要。本文提出了一种基于低秩适应的潜在扩散模型（LDM）高效水印方法。核心思想是将可训练的低秩参数引入冻结的LDM中以嵌入水印，从而保持原始模型权重的完整性。此外，设计了一个动态损失权重调度器，以适应性地平衡生成质量和水印保真度，使模型能够以最小影响生成图像质量的方式实现有效的水印嵌入。实验结果表明，所提出的方法确保了快速且准确的水印嵌入，并保持了高质量的生成图像，同时在某些情况下与最先进的方法相比具有同等或更高的鲁棒性。此外，该方法在不同数据集和基础LDM上具有良好的泛化能力。代码可在：https://github.com/MrDongdongLin/EW-LoRA 上获取。

英文摘要

The rapid proliferation of Deep Neural Networks (DNNs) is driving a surge in model watermarking technologies, as the trained models themselves constitute valuable intellectual property. Existing watermarking approaches primarily focus on modifying model parameters or altering sampling behaviors. However, with the emergence of increasingly large models, improving the efficiency of watermark embedding becomes essential to manage increasing computational demands. Prioritizing efficiency not only optimizes resource utilization, making the watermarking process more applicable for large models, but also mitigates potential degradation of model performance. In this paper, we propose an efficient watermarking method for Latent Diffusion Models (LDMs) based on Low-Rank Adaptation (LoRA). The core idea is to introduce trainable low-rank parameters into the frozen LDM to embed watermark, thereby preserving the integrity of the original model weights. Furthermore, a dynamic loss weight scheduler is designed to adaptively balance the objectives of generative quality and watermark fidelity, enabling the model to achieve effective watermark embedding with minimal impact on quality of the generated images. Experimental results show that the proposed method ensures fast and accurate watermark embedding and a high quality of the generated images, at the same time maintaining a level of robustness aligned - in some cases superior - with state-of-the-art approaches. Moreover, the method generalizes well across different datasets and base LDMs. Codes are available at: https://github.com/MrDongdongLin/EW-LoRA.

URL PDF HTML ☆

赞 0 踩 0

2410.13439 2026-06-16 cs.LG cs.CL cs.CV 版本更新

Similarity-Dissimilarity Loss for Multi-label Supervised Contrastive Learning

多标签监督对比学习中的相似性-差异性损失

Guangming Huang, Yunfei Long, Cunjin Luo

发表机构 * University of Essex（埃塞克斯大学）； Queen Mary University of London（伦敦大学玛丽女王学院）

AI总结本文提出相似性-差异性损失，通过动态加权样本解决多标签场景下正样本确定问题，提供理论证明并统一单标签与多标签对比学习框架，实验表明方法在图像、文本和医疗领域均优于基线。

Comments Accepted by Transactions on Machine Learning Research (TMLR)

详情

AI中文摘要

监督对比学习通过利用标签信息取得了显著成功；然而，在多标签场景中确定正样本仍是一个关键挑战。在多标签监督对比学习（MSCL）中，多标签关系尚未完全定义，导致正样本识别和对比损失函数构建存在歧义。为解决这些挑战，我们：（i）系统地制定了MSCL中的多标签关系；（ii）提出了一种新颖的相似性-差异性损失，根据相似性和差异性因素动态重新加权样本；（iii）通过严谨的数学分析提供了理论支持，支持我们的方法制定和有效性；（iv）为单标签和多标签监督对比损失提供统一形式和范式。我们在图像和文本模态上进行了实验，并进一步将其扩展到医疗领域。结果表明，我们的方法在全面评估中始终优于基线，证明了其有效性和鲁棒性。

英文摘要

Supervised contrastive learning has achieved remarkable success by leveraging label information; however, determining positive samples in multi-label scenarios remains a critical challenge. In multi-label supervised contrastive learning (MSCL), multi-label relations are not yet fully defined, leading to ambiguity in identifying positive samples and formulating contrastive loss functions to construct the representation space. To address these challenges, we: (i) systematically formulate multi-label relations in MSCL, (ii) propose a novel Similarity-Dissimilarity Loss, which dynamically re-weights samples based on similarity and dissimilarity factors, (iii) further provide theoretically grounded proofs for our method through rigorous mathematical analysis that supports the formulation and effectiveness, and (iv) offer a unified form and paradigm for both single-label and multi-label supervised contrastive loss. We conduct experiments on both image and text modalities and further extend the evaluation to the medical domain. The results show that our method consistently outperforms baselines in comprehensive evaluations, demonstrating its effectiveness and robustness.

URL PDF HTML ☆

赞 0 踩 0

2509.00176 2026-06-16 cs.CV cs.AI 版本更新

Waste-Bench: A Comprehensive Benchmark for Evaluating VLLMs in Cluttered Environments

Waste-Bench: 一个用于评估在杂乱环境中视觉大型语言模型性能的综合基准

Muhammad Ali, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文提出Waste-Bench基准，用于评估VLLMs在复杂环境中的鲁棒性和准确性，揭示了提升VLLM在复杂环境性能的必要性。

详情

Journal ref: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP 2025), pp. 31019-31032, 2025

AI中文摘要

近年来，大型语言模型（LLMs）的进步为能够执行广泛视觉理解任务的视觉大型语言模型（VLLMs）铺平了道路。尽管LLMs在标准自然图像上表现出色，但其在杂乱数据集中的能力尚未得到充分探索，其中包含复杂环境和变形形状的对象。在本工作中，我们引入了一个专门设计用于现实场景中垃圾分类的新型数据集，其特点是有复杂的环境和变形形状的对象。此外，我们还提出了一种深入的评估方法，以严格评估VLLMs的鲁棒性和准确性。所引入的数据集和全面分析为VLLMs在挑战性条件下性能提供了有价值的见解。我们的发现强调了进一步提升VLLM鲁棒性以在复杂环境中表现更好的重要性。数据集和实验代码将公开发布。

英文摘要

Recent advancements in Large Language Models (LLMs) have paved the way for Vision Large Language Models (VLLMs) capable of performing a wide range of visual understanding tasks. While LLMs have demonstrated impressive performance on standard natural images, their capabilities have not been thoroughly explored in cluttered datasets where there is complex environment having deformed shaped objects. In this work, we introduce a novel dataset specifically designed for waste classification in real-world scenarios, characterized by complex environments and deformed shaped objects. Along with this dataset, we present an in-depth evaluation approach to rigorously assess the robustness and accuracy of VLLMs. The introduced dataset and comprehensive analysis provide valuable insights into the performance of VLLMs under challenging conditions. Our findings highlight the critical need for further advancements in VLLM's robustness to perform better in complex environments. The dataset and code for our experiments will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2505.15408 2026-06-16 cs.CV 版本更新

Mouse Lockbox Dataset: Behavior Recognition for Mice Solving Lockboxes

鼠标锁盒数据集：小鼠解决锁盒的行为识别

Patrik Reiske, Marcus N. Boon, Niek Andresen, Sole Traverso, Katharina Hohlbaum, Lars Lewejohann, Christa Thöne-Reineke, Olaf Hellwich, Henning Sprekeler

发表机构 * Max Planck Institute for Biological Cybernetics, Berlin, Germany（柏林生物医学信息学研究所）

AI总结本文提出一个包含小鼠解决复杂机械谜题的视频数据集，用于评估帧级动作分类方法，提供人工标注标签以研究细粒度行为自动标注的挑战。

Comments Accepted and published (poster) at the CV4Animals: Computer Vision for Animal Behavior Tracking and Modeling workshop, in conjunction with Computer Vision and Pattern Recognition (CVPR) 2025

详情

DOI: 10.1007/s11263-026-02908-x

AI中文摘要

机器学习和计算机视觉方法对研究自然动物行为有重大影响，因为它们能够自动分析大量视频数据。小鼠是大多数研究领域中的标准哺乳动物模型，但现有数据集主要关注简单或社交行为。本文提出一个视频数据集，记录小鼠从三个不同视角解决复杂机械谜题（锁盒）。总播放时间超过110小时，我们为两种不同小鼠的视频提供了人工标注标签，占数据集的13%。基于关键点（姿态）跟踪的动作分类框架展示了自动标注细粒度行为（如物体操作）的挑战。我们希望该工作能加速计算神经科学领域自动动作和行为分类的发展。数据集可公开访问：https://doi.org/10.14279/depositonce-23850

英文摘要

Machine learning and computer vision methods have a major impact on the study of natural animal behavior, as they enable the (semi-)automatic analysis of vast amounts of video data. Mice are the standard mammalian model system in most research fields, but the datasets available today to refine such methods focus either on simple or social behaviors. In this work, we present a video dataset of individual mice solving complex mechanical puzzles, so-called lockboxes. The more than 110 hours of total playtime show their behavior recorded from three different perspectives. As a benchmark for frame-level action classification methods, we provide human-annotated labels for all videos of two different mice, that equal 13% of our dataset. Our keypoint (pose) tracking-based action classification framework illustrates the challenges of automated labeling of fine-grained behaviors, such as the manipulation of objects. We hope that our work will help accelerate the advancement of automated action and behavior classification in the computational neuroscience community. Our dataset is publicly available at https://doi.org/10.14279/depositonce-23850

URL PDF HTML ☆

赞 0 踩 0

2411.07742 2026-06-16 cs.CV 版本更新

Efficient 3D Perception on Multi-Sweep Point Cloud with Gumbel Spatial Pruning

多扫点云上的高效3D感知与Gumbel空间修剪

Tianyu Sun, Jianhao Li, Xueqian Zhang, Zhongdao Wang, Bailan Feng, Hengshuang Zhao

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Department of Computer Science and Engineering, Beihang University（北航计算机科学与工程学院）； Noah’s Ark Lab（诺亚实验室）； Department of Computer Science, University of Hong Kong（香港大学计算机科学系）

AI总结本文研究了户外环境中点云感知问题，通过累积多个连续点云扫描以提高感知精度，引入Gumbel空间修剪层有效减少冗余点，提升3D感知性能。

详情

AI中文摘要

本文研究了户外环境中点云感知问题。现有方法在远距离或遮挡物体识别上受限，因户外点云稀疏。本文通过累积多个连续点云扫描显著缓解该问题，但计算成本增加阻碍了大量点云扫描的使用。我们发现累积点云中大部分点冗余，剔除这些点对感知精度影响小。引入简单有效的Gumbel空间修剪（GSP）层，基于端到端采样动态修剪点。GSP层与其他网络组件解耦，可无缝集成到现有点云网络中。无需额外计算开销，将点云扫描数从10增加到40，显著提升感知性能。例如，在nuScenes 3D目标检测和BEV地图分割任务中，我们的修剪策略改进了多种3D感知基线方法。

英文摘要

This paper studies point cloud perception within outdoor environments. Existing methods face limitations in recognizing objects located at a distance or occluded, due to the sparse nature of outdoor point clouds. In this work, we observe a significant mitigation of this problem by accumulating multiple temporally consecutive point cloud sweeps, resulting in a remarkable improvement in perception accuracy. However, the computation cost also increases, hindering previous approaches from utilizing a large number of point cloud sweeps. To tackle this challenge, we find that a considerable portion of points in the accumulated point cloud is redundant, and discarding these points has minimal impact on perception accuracy. We introduce a simple yet effective Gumbel Spatial Pruning (GSP) layer that dynamically prunes points based on a learned end-to-end sampling. The GSP layer is decoupled from other network components and thus can be seamlessly integrated into existing point cloud network architectures. Without incurring additional computational overhead, we increase the number of point cloud sweeps from 10, a common practice, to as many as 40. Consequently, there is a significant enhancement in perception performance. For instance, in nuScenes 3D object detection and BEV map segmentation tasks, our pruning strategy improves several 3D perception baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2504.18179 2026-06-16 cs.CV cs.LG 版本更新

Label-independent hyperparameter-free self-supervised single-view deep subspace clustering

与标签无关的超参数自由单视图深度子空间聚类

Lovro Sindicic, Ivica Kopriva

发表机构 * Division of Computing and Data Science, Ruđer Bošković Institute（计算与数据科学系，鲁德·博克维奇研究所）

AI总结本文提出一种无需超参数调节的单视图深度子空间聚类方法，通过层间自表达损失、子空间结构范数优化、多阶段学习框架和相对误差终止机制提升聚类性能。

Comments 35 pages; 1 figure; 10 Tables

详情

DOI: 10.1016/j.neucom.2025.132260

AI中文摘要

深度子空间聚类（DSC）算法面临多个挑战，限制了其在各种应用领域中的广泛应用。首先，聚类质量通常仅通过编码器的输出层评估，忽略了中间层中的有价值信息。其次，大多数DSC方法将表示学习和子空间聚类视为独立任务，限制了其有效性。第三，它们假设可以使用一个留出的数据集进行超参数调节，这在实际场景中往往不现实。第四，学习终止通常基于聚类误差监控，需要外部标签。最后，其性能通常依赖于依赖标注数据的后处理技术。为了解决这些限制，我们引入了一种新的单视图DSC方法：(i) 使用联合表示矩阵最小化层间自表达损失；(ii) 优化子空间结构范数以提高聚类质量；(iii) 采用多阶段顺序学习框架，包括预训练和微调，使能够使用多个正则化项而无需超参数调节；(iv) 融合基于相对误差的自停止机制以终止训练而不使用标签；(v) 根据先验知识在学习的表示矩阵中保留固定数量的领先系数。我们在六个代表面孔、数字和物体的数据集上评估了所提出的方法。结果表明，我们的方法在经过仔细调节的超参数下优于大多数线性SC算法，同时在最佳线性方法中保持竞争力。

英文摘要

Deep subspace clustering (DSC) algorithms face several challenges that hinder their widespread adoption across variois application domains. First, clustering quality is typically assessed using only the encoder's output layer, disregarding valuable information present in the intermediate layers. Second, most DSC approaches treat representation learning and subspace clustering as independent tasks, limiting their effectiveness. Third, they assume the availability of a held-out dataset for hyperparameter tuning, which is often impractical in real-world scenarios. Fourth, learning termination is commonly based on clustering error monitoring, requiring external labels. Finally, their performance often depends on post-processing techniques that rely on labeled data. To address this limitations, we introduce a novel single-view DSC approach that: (i) minimizes a layer-wise self expression loss using a joint representation matrix; (ii) optimizes a subspace-structured norm to enhance clustering quality; (iii) employs a multi-stage sequential learning framework, consisting of pre-training and fine-tuning, enabling the use of multiple regularization terms without hyperparameter tuning; (iv) incorporates a relative error-based self-stopping mechanism to terminate training without labels; and (v) retains a fixed number of leading coefficients in the learned representation matrix based on prior knowledge. We evaluate the proposed method on six datasets representing faces, digits, and objects. The results show that our method outperforms most linear SC algorithms with careffulyl tuned hyperparameters while maintaining competitive performance with the best performing linear appoaches.

URL PDF HTML ☆

赞 0 踩 0

2502.05214 2026-06-16 eess.IV cs.AI cs.CV 版本更新

CoRPA: Adversarial Image Generation for Chest X-rays Using Concept Vector Perturbations and Generative Models

CoRPA: 基于概念向量扰动和生成模型的胸部X光图像对抗生成

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh（信息学院，爱丁堡大学）； NHS Lothian（NHS洛锡安）

AI总结本文提出CoRPA，一种针对医学影像领域的临床聚焦对抗攻击框架，通过概念向量扰动生成对抗性影像报告和图像，揭示医疗AI在真实临床场景下的脆弱性。

详情

DOI: 10.1109/ICHI64645.2025.00057

AI中文摘要

深度学习模型在医学图像分类任务中的应用日益广泛，旨在提高诊断准确性、减轻医务人员负担并改善患者预后。然而，其对对抗攻击的脆弱性对患者安全构成重大风险。当前攻击方法使用通用技术如模型查询或像素值扰动生成对抗样本以欺骗模型。这些方法可能无法充分解决源于临床错误的特征遗漏或误识别问题。我们提出基于概念的报告扰动攻击（CoRPA），一种专注于临床的黑盒对抗攻击框架，专门针对医学影像领域。CoRPA利用临床概念生成对抗性放射学报告和图像，以接近现实的临床误诊场景。我们使用MIMIC-CXR-JPG数据集中的胸部X光影像和放射学报告验证了CoRPA的实用性。评估显示，对传统对抗攻击具有强大鲁棒性的深度学习模型在面对CoRPA的临床聚焦扰动时显著更脆弱。这突显了在医疗AI系统中解决领域特定脆弱性的重要性。通过引入专门的对抗攻击框架，本研究为开发在真实世界中可靠、安全的AI模型提供了基础，确保其在高风险临床环境中的安全可靠部署。

英文摘要

Deep learning models for medical image classification tasks are becoming widely implemented in AI-assisted diagnostic tools, aiming to enhance diagnostic accuracy, reduce clinician workloads, and improve patient outcomes. However, their vulnerability to adversarial attacks poses significant risks to patient safety. Current attack methodologies use general techniques such as model querying or pixel value perturbations to generate adversarial examples designed to fool a model. These approaches may not adequately address the unique characteristics of clinical errors stemming from missed or incorrectly identified clinical features. We propose the Concept-based Report Perturbation Attack (CoRPA), a clinically-focused black-box adversarial attack framework tailored to the medical imaging domain. CoRPA leverages clinical concepts to generate adversarial radiological reports and images that closely mirror realistic clinical misdiagnosis scenarios. We demonstrate the utility of CoRPA using the MIMIC-CXR-JPG dataset of chest X-rays and radiological reports. Our evaluation reveals that deep learning models exhibiting strong resilience to conventional adversarial attacks are significantly less robust when subjected to CoRPA's clinically-focused perturbations. This underscores the importance of addressing domain-specific vulnerabilities in medical AI systems. By introducing a specialized adversarial attack framework, this study provides a foundation for developing robust, real-world-ready AI models in healthcare, ensuring their safe and reliable deployment in high-stakes clinical environments.

URL PDF HTML ☆

赞 0 踩 0

2403.19444 2026-06-16 cs.LG cs.CV 版本更新

Leveraging Expert Input for Robust and Explainable AI-Assisted Lung Cancer Detection in Chest X-rays

利用专家输入实现稳健且可解释的AI辅助肺癌检测

Amy Rafferty, Rishi Ramaesh, Ajitha Rajan

发表机构 * School of Informatics, University of Edinburgh（信息学院，爱丁堡大学）； NHS Lothian（洛锡安国家健康服务）

AI总结本文研究了基于InceptionV3的肺癌检测模型的可解释性和鲁棒性，提出ClinicXAI方法，通过专家驱动的思路生成临床相关解释，并在对抗攻击下表现出更强的鲁棒性。

详情

DOI: 10.1109/ICHI64645.2025.00071

AI中文摘要

深度学习模型在推动AI辅助医学诊断方面显示出巨大潜力，特别是在通过胸部X光等医学图像模态检测肺癌方面。然而，这些模型的黑盒性质对可解释性和可信度构成挑战，限制了其在临床中的应用。本研究评估了基于InceptionV3的高性能肺癌检测模型的可解释性和鲁棒性，利用公开的胸部X光和放射学报告数据集。我们评估了多种可解释AI（XAI）技术的临床效用，包括后验和先验方法，并发现现有方法常无法提供临床相关解释，存在不一致性和与放射科专家评估的偏离。为解决这些限制，我们与放射科医生合作定义诊断特定的临床概念，并开发了ClinicXAI，一种专家驱动的方法，利用概念瓶颈方法。ClinicXAI生成具有临床意义的解释，与临床医生的实践需求紧密相关，同时保持高诊断准确性。我们还通过一系列广泛使用的对抗攻击测试ClinicXAI与原始InceptionV3模型的鲁棒性。我们的分析表明，ClinicXAI在对抗扰动下表现出显著更强的鲁棒性。这些发现强调了在医学诊断中将领域专业知识纳入可解释和鲁棒AI系统设计的重要性，为医疗领域更可信和有效的AI解决方案铺平道路。

英文摘要

Deep learning models show significant potential for advancing AI-assisted medical diagnostics, particularly in detecting lung cancer through medical image modalities such as chest X-rays. However, the black-box nature of these models poses challenges to their interpretability and trustworthiness, limiting their adoption in clinical practice. This study examines both the interpretability and robustness of a high-performing lung cancer detection model based on InceptionV3, utilizing a public dataset of chest X-rays and radiological reports. We evaluate the clinical utility of multiple explainable AI (XAI) techniques, including both post-hoc and ante-hoc approaches, and find that existing methods often fail to provide clinically relevant explanations, displaying inconsistencies and divergence from expert radiologist assessments. To address these limitations, we collaborated with a radiologist to define diagnosis-specific clinical concepts and developed ClinicXAI, an expert-driven approach leveraging the concept bottleneck methodology. ClinicXAI generated clinically meaningful explanations which closely aligned with the practical requirements of clinicians while maintaining high diagnostic accuracy. We also assess the robustness of ClinicXAI in comparison to the original InceptionV3 model by subjecting both to a series of widely utilized adversarial attacks. Our analysis demonstrates that ClinicXAI exhibits significantly greater resilience to adversarial perturbations. These findings underscore the importance of incorporating domain expertise into the design of interpretable and robust AI systems for medical diagnostics, paving the way for more trustworthy and effective AI solutions in healthcare.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 42 篇

FUSE: Quantifying Uncertainty in Vision-Language Models by Bayesian Fusing Epistemic and Aleatoric Uncertainty

HorusEye: Language as Dynamic Attention for Emergency Visual Analysis

Beyond Self-Attention: Sub-Quadratic Vision Transformers for Fast Image Captioning

Disentangling Hallucinations: Orthogonal Semantic Projection for Robust Interpretability

JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

DLWM: Diverse Latent World Models for Efficient Multimodal Reasoning

Self-Questioning Vision-Language Models: Reinforcement Learning for Compositional Visual Reasoning

OneFocus: Enabling Real-World X-ray Security Screening with a Unified Vision-Language Model

Task-Instructed Causal Routing of Vision Foundation Models for Multi-Task Learning

OmniOPSD: Rationale-Privileged On-Policy Self-Distillation for Affective Computing

Mind the Gap: Diagnosing Constraint Discovery Failures in Text-in-Image Editing

Stepwise Token Selection for Efficient Multimodal Large Language Models

VinQA: Visual Elements Interleaved Long-form Answer Generation for Real-World Multimodal Document QA

Focus When Necessary: Adaptive Routing and Collaborative Grounding for Training-Free Visual Grounding

Cascaded Sparse Autoencoders Learn Multi-Level Visual Concepts in Multimodal LLMs

GRACE: Boosting Video MLLMs with Grounded Action-Centric Evidence for Viewer Sentiment Prediction

UniDDT: Unifying Multimodal Understanding and Generation with Decoupled Diffusion Transformer

VisualClaw: A Real-Time, Personalized Agent for the Physical World

LOCUS: Local Visual Cue Search for Enhancing Fine-Grained Perception in Multimodal Large Language Models

DifferAD-R1: A Difference-Guided IndustrialAnomaly Localization with Multimodal LargeLanguage Models

SUP-MCRL: Subject-aware Unified Pseudo-feature Coded Multimodal Contrastive Representation Learning for EEG Visual Decoding

Look Again Before You Abstain:Budgeted Conformal Evidence Acquisition for Reliable Vision-Language Model

Gen-VCoT: Generative Visual Chain-of-Thought Reasoning via Diffusion-Based RGB Intermediate Representations

MatchLM2Lite: A Scalable MLLM-to-Lite Framework for Reproduced Content Identification

Post-Launch Capability Expansion of Vision-Language Models via Prompting for On-Orbit Spacecraft Inspection

MAF: Multimodal Adaptive Few-shot Prompting for Sentiment Analysis with MLLMs

Mitigating Visual Hallucinations in Multimodal Systems through Retrieval-Augmented Reliability-Aware Inference

Lost at the End: Primacy Bias in Multimodal Retrieval-Augmented Question Answering

Context-Aware RL for Agentic and Multimodal LLMs

A Comprehensive Survey of Knowledge-Based Vision Question Answering Systems: The Lifecycle of Knowledge in Visual Reasoning Task

Dual-branch Prompting for Multimodal Machine Translation

Akasha 2: Hamiltonian State Space Duality and Visual-Language Joint Embedding Predictive Architectur

CASHEW: Stabilizing Multimodal Reasoning via Iterative Trajectory Aggregation

SAMTok: Representing Any Mask with Two Words

When RAG Hurts: Diagnosing and Mitigating Attention Distraction in Retrieval-Augmented LVLMs

UniT: Unified Multimodal Chain-of-Thought Test-time Scaling

Cross-modal Identity Mapping: Minimizing Information Loss in Modality Conversion via Reinforcement Learning

Proact-VL: A Proactive VideoLLM for Real-Time AI Companions

GEASS: Gated Evidence-Adaptive Selective Caption Trust for Vision-Language Models

MolSight: Molecular Property Prediction with Images

Wasserstein Equilibrium Decoding for Reliable Medical Visual Question Answering

2. 具身智能、机器人与自动驾驶 45 篇

X-Tokenizer: A Multimodal Action Tokenizer for Vision-Language-Action Pretraining

ScoutVLA: UAV-Centric Active Perception via a Dual-Expert VLA Model for Open-World Embodied Question Answering

Multi-HMR 2: Multi-Person Camera-Centric Human Detection, Mesh Recovery and Tracking

Think Less, Act Early: Reinforced Latent Reasoning with Early Exit in Vision-Language-Action Models

MotionVLA: Vision-Language-Action Model for Humanoid Motion

G2IA: Geometry-Guided Instance-Aware Retrieval and Refinement for Cross-Modal Place Recognition

CausalDrive: Real-time Causal World Models for Autonomous Driving

Metis: A Generalizable and Efficient World-Action Model for Autonomous Driving and Urban Navigation

EgoPhys: Learning Generalizable Physics Models of Deformable Objects from Egocentric Video

Learned Image Compression for Vision-Language-Action Models

GraphWorld: Long-Horizon Planning with World Models for End-to-End Autonomous Driving

RealityBridge: Bridging Editable 3D Gaussian Splatting Driving Simulations and Real-World Videos

GraphBEV++: Multi-Modal Feature Alignment for Autonomous Driving

Instance-Aware Knowledge Distillation for Semi-Supervised Learning of an On-Board Multi-Task Dense Prediction Model for Collision Avoidance System

Decoupled Object-Centric Video Understanding for Generating Robotic Manipulation Commands

MVOFormer: Flow-Semantic Transformer for Robust Monocular Visual Odometry

PROSE: Training-Free Egocentric Scene Registration with Vision-Language Models

Semantic Flip: Synthetic OOD Generation for Robust Refusal in Embodied Question Answering and Spatial Localization

SurroundNEXO: Ego-Centric Metric Bridging for Spatially Consistent Geometry in Autonomous Driving

VANDERER: Map-Free Exploration using Future-Aware and Visual-Curiosity-Guided Diffusion Policy

DragMesh-2: Physically Plausible Dexterous Hand-Object Interaction with Articulated Objects

Pixels to Proofs: Probabilistically-Safe Latent World Model Control via Parallel Conformal Robust MPC

Towards Next-Generation Healthcare: A Survey of Medical Embodied AI for Perception, Decision-Making, and Action

Learning New Tasks via Reusable Skills: Skill-Compositional Experts for Embodied Continual Learning

Effective and Low-cost Lane-based Map Localization for Vehicle-Centric Route Generation

V2P-Manip: Learning Dexterous Manipulation from Monocular Human Videos

PATCH: Action-Chunk-Conditioned Latent Patch Innovation Monitoring for Robot Manipulation

R2RDreamer: 3D-aware Data Augmentation for Spatially-generalized 2D Manipulation Policies

Geometric Action Model for Robot Policy Learning

Planning with Unified Multimodal Models

CoIRL-AD: Collaborative-Competitive Imitation-Reinforcement Learning in Latent World Models for Autonomous Driving

FDIO: Frequency Decomposed Inertial Odometry

Seeing Roads Through Words: A Language-Guided Framework for RGB-T Driving Scene Segmentation

RLPR: Radar-to-LiDAR Place Recognition via Two-Stage Asymmetric Cross-Modal Alignment for Autonomous Driving

DriveReward: A Comprehensive Dataset and Generative Vision-Language Reward Model for Autonomous Driving

LIBERO-Occ: Evaluating and Improving Vision-Language-Action Models under Scene-Induced Occlusion via Viewpoint Imagination