arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2606.17296 2026-06-17 cs.CV 新提交

Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

Pareto LoRA：通过帕累托最优梯度集成缓解统一多模态模型中的模态不平衡

Xiwen Wei, Mark Nutter, Madhusudhanan Srinivasan, Radu Marculescu

发表机构 * The University of Texas at Austin（德克萨斯大学奥斯汀分校）； Advanced Micro Devices, Inc.（超威半导体公司）

AI总结针对统一多模态模型在LoRA微调中语言梯度主导优化导致图像生成质量下降的问题，提出帕累托最优梯度集成策略Pareto LoRA，通过调节梯度方向和强度平衡文本与图像目标，在CoMM基准上显著提升图像感知质量达44.9%。

详情

AI中文摘要

统一多模态模型（UMMs）最近作为一种有前景的范式出现，将多模态理解和生成集成在单个自回归Transformer中。然而，在多模态指令微调期间，这些模型通常表现出明显的模态不平衡：语言梯度主导优化，从而导致图像生成质量较低，尤其是在参数高效微调（如LoRA）下。在这项工作中，我们系统分析了基于LoRA的UMMs在交错文本-图像生成微调中的模态不平衡。我们表明，与单模态对应物相比，视觉模态性能下降幅度远大于文本模态性能，并且模态特定梯度在不同任务和层之间可能相差数个数量级。受此观察启发，我们将多模态指令微调重新表述为双目标优化问题，并提出Pareto LoRA，一种帕累托最优梯度集成策略，通过调节梯度方向和强度来平衡文本和图像目标。在CoMM基准上使用Emu2的实验表明，Pareto LoRA持续改善多模态生成平衡，在保持可比文本性能的同时，图像感知质量相比普通LoRA提升高达44.9%。

英文摘要

Unified multimodal models (UMMs) have recently emerged as a promising paradigm for integrating multimodal understanding and generation within a single autoregressive transformer. However, during multimodal instruction tuning, these models often exhibit pronounced modality imbalance: language gradients dominate optimization, thus leading to lower image generation quality, especially under parameter-efficient fine-tuning such as LoRA. In this work, we systematically analyze modality imbalance in LoRA-based fine-tuning of UMMs for interleaved text-image generation. We show that vision modality performance degrades substantially more than text modality performance when compared to unimodal counterparts, and that modality-specific gradients can differ by orders of magnitude across various tasks and layers. Motivated by this observation, we reformulate the multimodal instruction tuning as a bi-objective optimization problem and propose Pareto LoRA, a Pareto-optimal gradient integration strategy that balances the text and image objectives by modulating the gradient direction and strength. Experiments on the CoMM benchmark with Emu2 demonstrate that Pareto LoRA consistently improves multimodal generation balance, achieving up to 44.9% gains in perceptual image quality over vanilla LoRA while maintaining comparable text performance.

URL PDF HTML ☆

赞 0 踩 0

2606.17389 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

视觉会撒谎，一致性说话：在视觉-语言模型中解耦空间注意力与可靠性

Logan Mann, Yi Xia, Ajit Saravanan, Ishan Dave, Saadullah Ismail, Shikhar Shiromani, Emily Huang, Ruizhe Li, Kevin Zhu

发表机构 * University of California, Santa Barbara（加州大学圣塔芭芭拉分校）； Algoverse AI Research（Algoverse AI研究）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文提出VLM可靠性探针（VRP），通过结构注意力指标和生成动态分析，发现空间注意力与准确性几乎无关（R≈0.001），而自一致性是可靠性的主要预测因子（R=0.429），揭示了视觉特征与最终生成之间的符号脱离现象。

Comments 16 pages. Accepted to the ICLR 2026 Workshop on Multimodal Intelligence. Code: https://github.com/itsloganmann/VLM-Reliability-Probe

详情

AI中文摘要

多模态基础模型越来越多地被用作推理代理，因此可靠性（即知道模型何时可能产生幻觉）变得至关重要。一种常见的直觉，我们称之为注意力-置信度假设，认为可靠性源于“结构性”视觉感知：对相关区域的紧密注意力应表明答案可信，而分散的注意力则表示困惑。我们通过VLM可靠性探针（VRP）挑战这一观点，这是一项对当代视觉-语言模型（VLM）中可靠性信号进行的系统性跨家族研究。我们引入了结构注意力指标——簇计数（C_k）和空间熵（H_s）——来量化视觉编码器的注视点，并追踪其跨层的演化（ΔH_s）。这揭示了一种“符号脱离”：模型通常“早期锁定”视觉特征，但随后注意力扩散，切断了早期感知与最终生成的联系。与接地假设相反，我们发现“簇失效”：空间注意力与准确性几乎零相关（R≈0.001）。相反，可靠性是生成动态和内部状态分布的现象。自一致性，即采样推理路径之间的一致率，是真实性的主要预测因子（R=0.429）。扩展因果干预揭示了尖锐的架构差异：LLaVA将其预测锁定在脆弱的后期瓶颈中，而PaliGemma和Qwen2-VL全局分布可靠性，即使其最具预测性的层被破坏约50%或更多，仍保持韧性。对于当前的VLM，可靠性信号与视觉接地图脱离，最好通过生成时动态和隐藏状态探针来推断。

英文摘要

Multimodal Foundation Models are increasingly used as reasoning agents, making reliability, knowing when a model may hallucinate, critical. A common intuition, which we call the Attention-Confidence Assumption, holds that reliability follows from "structural" visual perception: tight attention on relevant regions should signal a trustworthy answer, while scattered attention signals confusion. We challenge this through the VLM Reliability Probe (VRP), a systematic cross-family study of reliability signals in contemporary Vision-Language Models (VLMs). We introduce structural-attention metrics, cluster counts (C_k) and spatial entropy (H_s), to quantify the visual encoder's gaze, and track its evolution (Delta H_s) across layers. This reveals a "Symbolic Detachment": models often "Early Lock" visual features only to diffuse attention later, severing early perception from final generation. Contrary to the grounding hypothesis, we find a "Cluster Failure": spatial attention has near-zero correlation (R approx 0.001) with accuracy. Instead, reliability is a phenomenon of generation dynamics and internal-state distributions. Self-Consistency, the agreement rate across sampled reasoning paths, is the dominant predictor of truth (R = 0.429). Scaling causal interventions exposes a sharp architectural divergence: LLaVA locks its prediction in a fragile late-stage bottleneck, whereas PaliGemma and Qwen2-VL distribute reliability globally, staying resilient even when ~50% or more of their most predictive layer is destroyed. For current VLMs, reliability signals are detached from visual grounding maps and are best inferred from generation-time dynamics and hidden-state probes.

URL PDF HTML ☆

赞 0 踩 0

2606.17410 2026-06-17 cs.CV 新提交

Attention Alignment Between Humans and Vision-Language Models

人类与视觉语言模型之间的注意力对齐

Isaac R. Christian, Udith Haputhanthrige, Hanna Hornfeld, Declan Campbell, Samuel Nastase, Taylor Webb, Michael Graziano

发表机构 * Princeton Neuroscience Institute, Princeton University（普林斯顿大学普林斯顿神经科学研究所）； Department of Psychology, Princeton University（普林斯顿大学心理学系）； Department of Computer Science, Princeton University（普林斯顿大学计算机科学系）； Department of Psychology and Center for Computational Language Sciences, University of Southern California（南加州大学心理学系与计算语言科学中心）； Department of Psychology, Université de Montréal（蒙特利尔大学心理学系）

AI总结本研究比较了六种视觉语言模型的空间注意力图与人类注视热图，发现解码器架构（LSTM vs Transformer）主导对齐程度，LSTM解码器对齐度更高但空间分散且任务区分度低，而Transformer解码器注意力更集中且任务区分度强。

详情

AI中文摘要

视觉感知依赖于自上而下的目标和自下而上的感觉机制。视觉语言模型同时实现了这两种机制，使我们能够将每个组成部分视为关于驱动我们注视位置的可分离假设。我们比较了六种视觉语言模型的空间注意力图与在200张图像上两个任务（一般描述和社交字幕）中记录的人类注视热图。这六种模型跨越了CNN与ViT编码器乘以LSTM与Transformer解码器的2×2因子设计，外加Molmo 7B-D和Qwen3.5 9B。我们发现解码器和编码器架构都影响对齐，但解码器选择占主导地位。LSTM与Transformer解码器使对齐度提高了40-50个百分点（分别达到人类噪声上限的80-87%和40-59%）。相比之下，CNN与ViT编码器根据解码器家族的不同贡献了5-20个百分点的次要优势，其中CNN-LSTM是整体对齐度最高的模型（85-87%）。尽管对齐度有优势，但LSTM解码器的注意力图在空间上分散且任务区分度最小；而对齐度最弱的ViT-Transformer则显示出最尖锐的空间集中度和最强的任务区分度。一项半空间忽略模拟证实，消融注意力对LSTM解码器的影响大于Transformer解码器。在使用TRIBE模拟的合成神经反应的探索性扩展中，注视对齐和神经相关性分离：CNN-Transformer注意力图尽管注视对齐度较低，但能更好地预测合成大脑活动，其中注意力图最佳预测早期视觉皮层。总之，自上而下和自下而上的组件在行为和合成神经数据中预测的内容上存在权衡。

统一多模态自回归建模：共享上下文-视觉分词器是实现统一的关键

Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai

发表机构 * Institute of Trustworthy Embodied AI, Fudan University（可信具身AI研究院，复旦大学）； Shanghai Innovation Institute（上海创新研究院）； Qwen Team, Alibaba Inc.（通义实验室，阿里公司）

AI总结提出UniAR框架，通过单一离散视觉分词器桥接视觉理解与生成，采用并行位预测和扩散解码，在图像生成和编辑上达到最优，同时保持多模态理解竞争力。

Comments Accepted by ICML2026. Project page https://sharelab-sii.github.io/uniar-web

详情

AI中文摘要

统一多模态建模旨在将视觉理解和生成集成到单个系统中。然而，现有方法通常依赖两个不同的视觉分词器，这分割了表示空间并阻碍了真正的统一建模。我们提出UniAR，一个统一的自回归框架，其中单个离散视觉分词器作为理解和生成之间的关键桥梁，使得模型能够直接解释其自身生成的视觉标记而无需额外的重新编码，从而实现共享上下文。UniAR采用预训练的视觉编码器，结合多级特征融合和无查找的逐位量化方案，在保留高层语义和低层细节的同时，以最小代价扩展有效视觉词汇。在此基础上，统一自回归模型采用并行逐位预测来联合预测空间分组的多级视觉编码，大幅减少视觉序列长度并加速生成。最后，基于扩散的视觉解码器对离散视觉标记进行操作，以解码高保真图像。通过大规模预训练，随后进行监督微调和强化学习，UniAR在图像生成和图像编辑上达到了最先进的性能，同时在多模态理解基准上保持竞争力。项目页面可在此URL获取。

英文摘要

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.

URL PDF HTML ☆

赞 1 踩 0

2606.17449 2026-06-17 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

MODE-RAG: 基于流形异常诊断和能量的检索增强生成评估

Zehang Wei, Jiaxin Dai, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； School of AI and Automation, Huazhong University of Science and Technology（华中科技大学人工智能与自动化学院）

AI总结提出MODE-RAG多智能体系统，利用变分自由能和内部注意力状态动态门控干预，结合蒙特卡洛树搜索和logit扰动减少多模态检索增强生成中的幻觉和逻辑捏造。

Comments To be presented at ACL 2026

详情

AI中文摘要

虽然多模态检索增强生成（M-RAG）增强了大型视觉语言模型，但它仍然非常容易受到跨模态幻觉、因果捏造和谄媚的影响。此外，现有的缓解流程常常面临干预悖论：静态规则往往不必要地干扰准确的生成，而完全不加引导的多模态推理则允许现有的不匹配级联成严重的逻辑捏造。为了量化和缓解这些幻觉，我们提出了一个多智能体系统MODE-RAG，由变分自由能（VFE）和内部注意力状态驱动，以动态门控干预。高风险查询被路由到五个阶段特定的智能体，集成蒙特卡洛树搜索（MCTS）进行严格的因果推导，以及logit扰动以惩罚谄媚。专门的纠正和监管智能体确保格式稳定性并执行事后事实验证。为了客观评估我们的方法，我们引入了ModeVent，一个源自MultiVent数据集的具有挑战性的子集。大量实验表明，我们的系统有效降低了幻觉率和逻辑捏造，显著提高了M-RAG系统的鲁棒性。

英文摘要

While Multimodal Retrieval-Augmented Generation (M-RAG) enhances Large Vision-Language Models, it remains highly susceptible to cross-modal hallucinations, causal fabrications, and sycophancy. Furthermore, existing mitigation pipelines often face an intervention paradox: static rules tend to unnecessarily disrupt accurate generations, whereas leaving the multi-modal reasoning completely unguided allows existing mismatches to cascade into severe logical fabrications. To quantify and mitigate these hallucinations, we propose a Multi-Agent system, MODE-RAG, driven by Variational Free Energy (VFE) and internal attention states to dynamically gate interventions. High-risk queries are routed to five stage-specific agents, integrating Monte Carlo Tree Search (MCTS) for rigorous causal derivation and logit perturbations to penalize sycophancy. Dedicated Correction and Overseer agents ensure formatting stability and perform post-hoc factual verification. To objectively evaluate our approach, we introduce ModeVent, a challenging subset derived from the MultiVent dataset. Extensive experiments indicate that our system effectively reduces hallucination rates and logical fabrication, significantly improving the robustness of M-RAG systems.

URL PDF HTML ☆

赞 0 踩 0

2601.00215 2026-06-17 cs.CV cs.CL 版本更新

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

通过奖励设计解耦多模态大语言模型中的感知与推理

Omar Sharif, Eftekhar Hossain, Nikhil Singh, Patrick Ng

AI总结研究多模态大模型中感知与推理的瓶颈，发现感知是主要约束，并通过奖励设计提升视觉基础推理，平均提升5.56分。

Comments 24 pages, 15 Figures, 10 Tables

详情

AI中文摘要

基于可验证奖励的强化学习推动了LLM推理的重大进步，直观上这种策略应能很好地迁移到多模态模型。然而，多模态模型做两件事：首先感知图像中的内容，然后推理其含义。由于这两个阶段是联合评分的，很难判断推理本身还有多大提升空间。我们在算法视觉谜题上研究这一问题，其中两个组件都是必要的，并表明感知而非推理是约束瓶颈。用简单的文本描述替换图像，Claude模型的平均性能提升超过20点。然后我们评估了六种奖励设计，旨在诱导推理过程中的视觉基础，而无需思维链监督。使用GRPO训练Qwen-2.5-VL-7B，奖励设计诱导出带有自我反思和视觉引用的长结构化推理，相比基础模型获得5.56点的提升。然而，这些提升是不均匀的；没有单一奖励能改善所有类别，并且具有可验证准确性信号的奖励会以域外迁移为代价换取域内准确性。这些结果表明，感知感知的奖励设计是一条前进之路，以便在源头纠正感知，而不是纠正继承其错误的推理。

英文摘要

Reinforcement learning with verifiable rewards has driven major gains in LLM reasoning, and it is intuitive to assume this recipe will transfer well to multimodal models. However, multimodal models do two things: first, perceive what is in an image, then reason about what it implies. Because these stages are graded jointly, it is hard to tell how much room reasoning alone has to grow. We study this on algorithmic visual puzzles, where both components are necessary and show that perception, not reasoning, is the binding constraint. Replacing images with simple textual descriptions raises performance by over 20 points on average for Claude models. We then evaluate six reward designs aimed at inducing visual grounding during reasoning without chain-of-thought supervision. Training Qwen-2.5-VL-7B with GRPO, reward design induces long, structured reasoning with self-reflection and visual references, yielding a 5.56-point gain over the base model. These gains are, however, uneven; no single reward improves all categories, and rewards with verifiable accuracy signals trade out-of-domain transfer for in-domain accuracy. These results point to perception-aware reward design as a path forward, so that signals correct perception at its source rather than the reasoning that inherits its errors.

URL PDF HTML ☆

赞 0 踩 0

2602.18746 2026-06-17 cs.CV 版本更新

DriveJudge: 用视觉-语言模型重新思考自动驾驶评估

Xinglong Sun, Kevin Xie, Jenny Schmalfuss, Despoina Paschalidou, Xiuming Zhang, Sanja Fidler, Kashyap Chitta, Jose M. Alvarez

发表机构 * NVIDIA（英伟达）

AI总结提出DriveJudge，结合规则评估与VLM推理，通过选择性调用物理规则函数实现可解释且上下文感知的驾驶评估，在驾驶质量分类和轨迹偏好选择任务上超越现有方法。

Comments Under Review

详情

AI中文摘要

自动驾驶已转向端到端策略学习，其中可靠、可解释的策略评估是一个基本挑战，因为驾驶质量高度依赖于上下文。常用的基于规则的驾驶指标（如EPDMS）可解释但缺乏上下文感知，而近期基于VLM的评估虽具有上下文感知能力，但受限于模糊的VLM输出和较弱的物理基础。为了以既可解释又上下文感知的方式评估驾驶，我们引入了DriveJudge。DriveJudge是一个驾驶评估代理，它将规则基础评估与视觉-语言模型（VLM）推理相结合，并在解释环境上下文后有选择地调用基于物理的确定性规则函数。为了训练和评估DriveJudge，我们整理了一个包含33,577个具有挑战性的驾驶样本的大规模数据集，并附有人类标注，指示给定场景中的驾驶行为是否合理。利用该数据集，我们解决了驾驶指标评估中未被充分探索的问题，并引入了两个与人类对齐的基准任务：驾驶质量分类和轨迹偏好选择。DriveJudge在驾驶质量分类上比EPDMS高出21.23 AUC，在轨迹偏好选择上比近期基于VLM的DriveCritic高出6.5%，为可解释且精确的驾驶评估设立了新标准。

英文摘要

Autonomous driving has shifted towards end-to-end policy learning, where reliable, interpretable policy evaluation is a fundamental challenge as driving quality is highly context-dependent. Commonly used rule-based driving metrics like EPDMS are interpretable but lack context-awareness, while recent VLMbased evaluations are context-aware but limited by ambiguous VLM outputs and weak physical grounding. To evaluate driving in a manner that is both interpretable and context-aware, we introduce DriveJudge. DriveJudge is a driving evaluation agent that combines rule-grounded evaluation with Vision-Language Model (VLM) reasoning and selectively invokes physically-grounded deterministic rule functions after interpreting the environmental context. To train and evaluate DriveJudge, we curate a large-scale dataset of 33,577 challenging driving samples with human annotations on whether the driving behavior is reasonable in the given scenario. With this dataset, we address the underexplored problem of driving metric evaluation, and introduce two human-aligned benchmark tasks: Driving Quality Classification and Trajectory Preference Selection. DriveJudge outperforms EPDMS for driving quality classification by 21.23 AUC, and the recent VLM-based DriveCritic for trajectory preference selection by 6.5%, setting a new standard for interpretable and precise driving evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.17386 2026-06-17 cs.CV cs.AI cs.RO 新提交

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

TerraTransfer: 无需专家示范的端到端驾驶策略学习

Zikang Xiong, Weixin Li, Zhouchonghao Wu, Akshay Rangesh, Saarth Bonde, Grantland Hall, Chen Tang, Yihan Hu, Wei Zhan

发表机构 * Applied Intuition ； UCLA（加州大学洛杉矶分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出一种无需专家示范的端到端驾驶方法，通过向量化模拟器中的自博弈预训练策略，再与预训练视觉骨干对齐，降低了数据成本并达到或超越现有方法。

详情

AI中文摘要

端到端自动驾驶在基准测试和实际部署中取得了最先进的性能。然而，其标准训练流程在所有阶段都成本高昂：收集和标注数百万驾驶帧代价昂贵，而在图像上进行闭环强化学习受限于每步的光真实感渲染和大视觉骨干的前向传播成本。在向量化模拟器中进行自博弈改变了经济性：每秒数百万次 rollout 步骤，状态分布自然包含碰撞、近碰撞和恢复等驾驶日志中不包含的情况。我们的方法通过解耦学习驾驶和学习视觉来利用这种不对称性。我们通过自博弈预训练单个策略，然后通过动作 KL 散度和批量关系低秩结构损失将其潜在空间与预训练视觉骨干对齐。动作目标来自自博弈策略，因此对齐从未对记录的轨迹进行监督：只需要一个（图像、场景状态）帧的配对数据集，无需模仿预训练所依赖的精心策划的专家示范。在光真实感 3D 高斯泼溅闭环场景中，得到的端到端策略匹配或超越了先前的端到端方法。

英文摘要

End-to-end autonomous driving has achieved state-of-the-art performance on benchmarks and real-world deployments. Its standard training recipe, however, is expensive across all stages: collecting and labeling millions of driving frames is costly, and closed-loop RL on images is bottlenecked by the per-step cost of photorealistic rendering plus a forward pass through a large vision backbone. Self-play in vectorized simulators changes the economics: millions of rollout steps per second, and a state distribution naturally rich in collisions, near-misses, and recoveries that no driving log contains. Our approach exploits this asymmetry by decoupling learning to drive from learning to see. We pretrain a single policy by self-play, then align its latent space with a pretrained vision backbone, through the action KL divergence and a batch-relational low-rank structural loss. The action target comes from the self-play policy, so alignment never supervises against a logged trajectory: a paired dataset of (image, scene-state) frames suffices, with no need for the curated expert demonstrations that imitation pretraining is built on. On photorealistic 3D Gaussian splatting closed-loop scenarios, the resulting end-to-end policy matches or exceeds prior end-to-end methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17463 2026-06-17 cs.CV cs.RO 新提交

WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

WeaveLA: 面向重复机器人操作的基于事件驱动的跨子任务潜在记忆编织

Shoujing Zhu, Zhenyang Liu, Fungmiu Wang, Jiafeng Wang, Bo Yue, Guiliang Liu, Simo Wu, Xiangyang Xue, Taiping Zeng

发表机构 * Fudan University（复旦大学）； School of Data Science, The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳）数据科学学院）； Shanghai Innovation Institute（上海创新研究院）； Shenzhen Loop Area Institute（深圳环域研究院）

AI总结针对短窗口VLA策略缺乏跨子任务信息传递的问题，提出WeaveLA，通过事件触发将完成子任务压缩为潜在令牌并注入下一子任务的动作生成路径，在保持基础策略短窗口接口的同时实现轻量级跨子任务通道，在困难重复任务上成功率从0%提升至47.8%。

详情

AI中文摘要

视觉-语言-动作（VLA）策略已实现显著的单步操作，但在每个阶段依赖于刚刚完成的任务时仍然脆弱。核心问题是结构性的：短窗口VLA缺乏明确的跨子任务信息路由通道，而现有的记忆增强变体要么在每一帧写入，要么从演示阶段检索，要么在子目标事件触发时未执行显式的子任务到子任务交接给动作专家。我们将子目标完成事件识别为跨子任务记忆交接的自然时间单元，并提出WeaveLA（为视觉-语言-动作策略编织潜在记忆），这是一种跨子任务记忆接口，在冻结的VLA骨干之上，通过查询驱动的注意力池化将每个完成的段压缩为潜在令牌，并直接路由到下一子任务的动作生成路径。这种事件触发、动作侧的设计保留了基础策略的短窗口接口，同时添加了轻量级跨子任务通道。通过在RoboMME上使用$\pi_{0.5}$骨干进行分层评估，WeaveLA的增益恰好出现在需要该通道的地方：在最难的重复切片（SwingXtimes，$N{=}3$）上，成功率从$0\\%$提升至$47.8\\%$，而单次执行片段保持不变。每集配对分析证实增益仅限于因果结构需要跨子任务信息的任务。

英文摘要

Vision-Language-Action (VLA) policies have achieved remarkable single-step manipulation, yet they remain brittle precisely where each stage depends on what was just completed. The core issue is structural: short-window VLAs lack an explicit channel for rouxting information across sub-task boundaries, and existing memory-augmented variants either write at every frame, retrieve from demonstration-time stages, or fire at sub-goal events without performing an explicit sub-task-to-sub-task hand-off into the action expert. We identify the sub-goal completion event as the natural temporal unit for cross-subtask memory hand-off, and present WeaveLA (Weave Latent memory for Vision-Language-Action policies), a cross-subtask memory interface that, on top of a frozen VLA backbone, compresses each completed segment into latent tokens via query-driven attention pooling and routes them directly into the action-generation path of the next sub-task. This event-triggered, action-side design preserves the base policy's short-window interface while adding a lightweight cross-subtask channel. Through stratified evaluation on RoboMME with a $π_{0.5}$ backbone, WeaveLA's gains land exactly where the channel is needed: on the hardest repetition slice (SwingXtimes, $N{=}3$), success rises from $0\%$ to $47.8\%$, while single-execution episodes remain unchanged. Per-episode paired analysis confirms the gains are confined to tasks whose causal structure requires cross-subtask information.

URL PDF HTML ☆

赞 0 踩 0

2606.17480 2026-06-17 cs.CV cs.RO 新提交

MOCHI: 协作人-物交互的运动增强

Jiye Lee, Yonghun Choi, Jungdam Won

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； Seoul National University（首尔国立大学）

AI总结针对多人-物交互数据中手物接触错位、运动抖动和手指细节缺失等问题，提出两阶段框架MOCHI，先通过优化生成物理合理的手部抓取，再基于扩散模型优化全身运动，有效增强噪声数据。

Comments SIGGRAPH 2026 Journal (ACM TOG); Project page: https://jiyewise.github.io/projects/MOCHI/

详情

DOI: 10.1145/3811308

AI中文摘要

协作人-物交互展示了动态且复杂的运动，需要参与者与共享对象之间的相互预期和持续调整。对此类协作多人-物交互（MHOI）场景进行建模需要高质量的数据采集作为基础步骤；然而，由于MHOI中人与人、人与物交互同时发生的内在复杂性，这一步骤具有挑战性。这种复杂性导致MHOI捕获数据存在噪声，表现为多种伪影：手与物体之间的接触错位、捕获序列中的运动抖动和时间不一致性，以及缺失或不完整的手指级关节细节。为了解决这些挑战，我们提出了MOCHI（协作人-物交互的运动增强），一个用于增强噪声MHOI数据的两阶段框架。我们的方法首先通过从噪声身体输入进行优化生成物理合理的手部抓取，产生既物理合理又与身体姿态语义一致的抓取，然后将这些优化后的抓取扩展为完整的手-物交互序列。随后，所有参与者的全身运动通过一个基于扩散的噪声优化框架进行细化，该框架使用单人运动先验。在优化过程中，我们引入优化目标以在这些单人先验中编码人-物和人与人交互信息。实验结果表明，我们的流程在多种MHOI数据（无论是通过现有捕获方法获取还是由生成模型合成）上均有效。我们进一步展示了系统在不同参与者数量和交互类型下的鲁棒性，并演示了包括基于关键帧的MHOI创建和通过改变物体几何形状进行数据增强在内的多种应用。

英文摘要

Collaborative human-object interaction shows dynamic and complex movements that require mutual anticipation and continuous adjustment between participants and the shared object. Modeling such collaborative multi-human object interaction (MHOI) scenarios requires high-quality data acquisition as a foundational step; however, this is challenging due to the inherent complexity of MHOI where human-human and human-object interactions occur simultaneously. Such complexity leads to noisy MHOI captures characterized by several artifacts: contact misalignment between hands and objects, motion jitter and temporal inconsistencies in the captured sequences, and missing or incomplete finger-level articulation details. To address these challenges, we present MOCHI (MOtion Enhancement of Collaborative Human-object Interactions), a two-stage framework for enhancing noisy MHOI data. Our approach first generates physically plausible hand grasps through optimization from noisy body input, producing grasps that are both physically plausible and semantically consistent with the body pose, where these optimized grasps are extended into complete hand-object interaction sequences. Consequently, the full-body motion for all participants are refined through a diffusion-based noise optimization framework that uses single-person motion priors. During the optimization process, we introduce optimization objectives to encode human-object and human-human interaction information within these single-person priors. Experimental results demonstrate the effectiveness of our pipeline across diverse MHOI data, either acquired by existing capture methods or synthesized by generative models. We further show robustness of our system across varying numbers of participants and types of interactions, and demonstrate various applications including keyframe-based MHOI creation and data augmentation through varying object geometries.

URL PDF HTML ☆

赞 0 踩 0

2606.18250 2026-06-17 cs.CV 新提交

Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion

未来动态3D重建：一种具有解耦自运动的3D世界模型

Nils Morbitzer, Jonathan Evers, Artem Savkin, Thomas Stauner, Nassir Navab, Federico Tombari, Stefano Gasperini

发表机构 * Technical University of Munich (TUM)（慕尼黑技术大学）； Munich Center for Machine Learning (MCML)（慕尼黑机器学习中心）

AI总结提出FR3D世界模型，通过解耦场景3D演化与智能体轨迹，利用教师-学生蒸馏策略实现从单目观测到未来动态3D重建的几何一致性和零样本泛化。

Comments ICML 2026. Project page: https://fr3d-wm.github.io

详情

AI中文摘要

预测动态环境的演化对于自主智能体至关重要。尽管生成式世界模型最近通过在图像平面内混合自运动和环境动态，在2D视频合成中实现了高逼真度，但它们表现出物理不一致性，例如物体变形或消失，尤其是在长时间范围内。在本文中，我们提出FR3D，一种预测未来动态3D重建的持久3D潜在表示的世界模型。与将世界视为基于图像的特征序列的先前工作不同，FR3D明确地将场景的3D演化与智能体的轨迹解耦，将推断的自运动视为动作的潜在代理。这种解耦解决了自运动和世界运动之间的歧义，确保了几何一致性到未来。此外，我们引入了一种教师-学生蒸馏策略，利用现成基础模型的空间“常识”，从而实现鲁棒的零样本泛化。大量实验表明，FR3D在多个数据集上从单目观测进行未来动态3D重建（甚至到未来2秒）的强大性能。项目页面：此https URL。

英文摘要

Forecasting the evolution of dynamic environments is crucial for autonomous agents. While generative world models have recently achieved high photorealism in 2D video synthesis by mixing ego-motion and environmental dynamics within the image plane, they exhibit physical inconsistencies, such as morphing or vanishing objects, especially over long time horizons. In this paper, we propose FR3D, a world model that predicts a persistent 3D latent representation for future dynamic 3D reconstruction. Unlike prior works that treat the world as a sequence of image-based features, FR3D explicitly decouples the 3D evolution of the scene from the agent's trajectory, treating the inferred ego-motion as a latent proxy for action. This disentanglement resolves the ambiguities between self-motion and world-motion, ensuring geometric consistency into the future. Furthermore, we introduce a teacher-student distillation strategy that leverages the spatial "common sense" of off-the-shelf foundation models, leading to robust zero-shot generalization. Extensive experiments demonstrate FR3D's strong performance for future dynamic 3D reconstruction from monocular observations across multiple datasets, even 2 seconds into the future. Project page: https://fr3d-wm.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.17256 2026-06-17 cs.RO cs.CV 交叉投稿

Contrastive Action-Image Pre-training for Visuomotor Control

对比动作-图像预训练用于视觉运动控制

Yuvan Sharma, Dantong Niu, Anirudh Pai, Zekai Wang, Zhuoyang Liu, Baifeng Shi, Stefano Saravalle, Boning Shao, Ruijie Zheng, Jing Wang, Konstantinos Kallidromitis, Yusuke Kato, Fabio Galasso, Yuke Zhu, Danfei Xu, Linxi "Jim" Fan, Jitendra Malik, Trevor Darrell, Roei Herzig

发表机构 * UC Berkeley（加州大学伯克利分校）； NVIDIA（英伟达）； Sapienza University of Rome（罗马大学）； Panasonic（松下）； ItalAI

AI总结提出CAIP方法，利用大规模第一人称视频中3D手部关键点作为代理动作信号，通过对比学习统一动作-图像表示，在少量机器人数据下显著提升灵巧操作性能。

详情

AI中文摘要

现有的机器人视觉编码器面临一个根本瓶颈：机器人数据集缺乏大规模预训练所需的规模。先前的工作通过转向互联网规模的图像和语言数据或自我中心的人类视频来规避数据稀缺问题。虽然这些模型显示出潜力，但两种范式都没有从配对的视觉和动作数据中学习，而下游视觉运动控制策略需要这些数据。然而，机器人轨迹作为这种配对信号最直接的来源，在预训练规模上不可用，这促使我们从丰富的人类视频中提取动作信号。为此，我们引入了CAIP（对比动作-图像预训练），一种视觉编码器，将大规模自我中心视频中的人类手部姿态视为末端执行器动作的代理。通过提取3D手部关键点（一种与下游机器人动作空间自然对齐的表示），CAIP通过对比目标学习统一的动作-图像表示。利用32,041小时的自我中心人类视频和仅88小时的机器人操作数据，CAIP优于最先进的视觉编码器，包括DINOv2、SigLIP、MVP和R3M。在使用Dexmate Vega和Sharpa Wave手的具有挑战性的真实世界灵巧操作设置上评估，CAIP在涉及折叠、倾倒和精细操作的任务上取得了超过30%的性能提升。我们的结果表明，我们的对比动作中心预训练方法为获得更适合物理交互的鲁棒视觉表示提供了一条可扩展的路径。

英文摘要

Existing vision encoders for robotics face a fundamental bottleneck: robotic datasets lack the scale necessary for large-scale pre-training. Prior work circumvents this data scarcity by turning to internet-scale image and language data or egocentric human video. While these models show promise, neither paradigm learns from paired vision and action data, which downstream visuomotor control policies require. However, robot trajectories, the most direct source of this paired signal, are not available at pre-training scale, motivating us to extract action signals from abundant human video instead. To this end, we introduce CAIP (Contrastive Action-Image Pre-training), a vision encoder that treats human hand poses from large-scale egocentric video as a proxy for end-effector actions. By extracting 3D hand keypoints, a representation that aligns naturally with downstream robot action spaces, CAIP learns a unified action-image representation through a contrastive objective. Leveraging 32,041 hours of egocentric human video and only 88 hours of robotic manipulation data, CAIP outperforms state-of-the-art vision encoders including DINOv2, SigLIP, MVP, and R3M. Evaluated on a challenging real-world dexterous manipulation setup using Dexmate Vega and Sharpa Wave hands, CAIP yields performance gains of more than 30% on tasks involving folding, pouring, and fine-grained manipulation. Our results show that our method of contrastive action-centric pre-training yields a scalable path to achieving robust visual representations better suited for physical interaction.

URL PDF HTML ☆

赞 0 踩 0

2606.17376 2026-06-17 cs.RO cs.CV 交叉投稿

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

异构移动机器人上的非接触式呼吸监测：一种多模态边缘计算框架

Milind Rampure, Shadman Sakib, Haley Patel, Zahid Hasan, Nirmalya Roy

发表机构 * University of Maryland, Baltimore County（马里兰大学巴尔的摩县分校）

AI总结提出一种适用于异构移动机器人的多模态非接触式呼吸率监测框架，通过自适应传感器选择、关键点引导的ROI提取和信号质量过滤，在多种平台和光照条件下实现鲁棒监测，无需平台特定调参。

Comments 8 pages, 6 figures. To appear in Proceedings of the 8th International Workshop on IoT Applications and Industry 5.0 (IoTI5 2026), co-located with IEEE DCOSS-IoT 2026, Reykjavik, Iceland, June 2026

详情

AI中文摘要

呼吸率监测是紧急响应、灾难恢复和传染病场景中远程分诊和受害者评估的关键组成部分，在这些场景中，最小化物理接触可以降低救援人员风险并提高操作安全性。然而，由于光照变化、姿势变化、平台异构性以及危险环境中可穿戴传感器的不实用性，非接触式呼吸率监测的现场部署仍然具有挑战性。在本文中，我们提出了一种适用于具有机载边缘计算的异构移动机器人的模态自适应非接触式呼吸率监测框架。所提出的系统结合了跨RGB、热成像、近红外和低光相机的亮度自适应传感器选择、用于姿势鲁棒监测的关键点引导胸部ROI提取，以及基于信号质量指数的滤波机制以实现可靠的呼吸估计。我们在三个机器人平台上实现并评估了该框架，涵盖四足和轮式运动以及多种边缘计算架构。在不同光照条件、受试者姿势和机器人到受试者距离下进行的实验表明，该框架无需针对每个平台进行算法重新调整即可跨平台泛化，同时揭示了模态特定的操作边界。RGB提供最广的覆盖范围，可达8米；近红外在6米内有效；热成像仅在短距离内可靠；低光传感支持在完全黑暗环境中监测，距离可达8米。总体而言，结果证明了在移动机器人上进行多模态非接触式呼吸率监测的可行性，并支持其作为危险搜救场景中自主分诊和受害者评估的基础。

英文摘要

Respiratory-rate (RR) monitoring is a critical component of remote triage and victim assessment in emergency response, disaster recovery, and infectious-disease scenarios, where minimizing physical contact can reduce responder risk and improve operational safety. However, field deployment of contactless RR monitoring remains challenging due to variable illumination, posture changes, platform heterogeneity, and the impracticality of wearable sensors in hazardous environments. In this paper, we present a modality-adaptive contactless RR monitoring framework for heterogeneous mobile robots with onboard edge computing. The proposed system combines brightness-adaptive sensor selection across RGB, thermal, near-infrared (NIR), and low-light cameras, keypoint-guided chest ROI extraction for posture-robust monitoring, and a signal-quality-index (SQI)-based filtering mechanism for reliable respiratory estimation. We implement and evaluate the framework on three robotic platforms spanning quadruped and wheeled locomotion and multiple edge-computing architectures. Experiments conducted across diverse lighting conditions, subject poses, and robot-to-subject distances demonstrate that the framework generalizes across platforms without per-platform algorithmic retuning, while revealing modality-specific operational boundaries. RGB provides the broadest coverage up to 8m, NIR remains effective up to 6m, thermal is reliable only at short range, and low-light sensing supports monitoring in complete darkness up to 8m. Overall, the results demonstrate the feasibility of multimodal contactless RR monitoring on mobile robots and support its use as a foundation for autonomous triage and victim assessment in hazardous search-and-rescue settings.

URL PDF HTML ☆

赞 0 踩 0

2606.17408 2026-06-17 cs.RO cs.CV cs.LG 交叉投稿

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

动作生成应从何处开始？面向生成式机器人策略的可学习源先验

Meipo Dai, Qiyuan Zhuang, He-Yang Xu, Ying-Jie Shuai, Yijun Wang, Qi Dou, Xiu-Shen Wei

发表机构 * Southeast University（东南大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出LeaP，用轻量MLP预测基于本体感知的对角高斯分布作为动作生成源先验，替代标准高斯分布，在15个RoboTwin任务中平均成功率81.6%，优于基线方法6.5-25.5个百分点。

详情

AI中文摘要

生成式机器人策略通常从与观测无关的标准高斯分布开始动作生成，源分布的选择尚未被充分探索。本文提出一个简单问题：动作生成应从何处开始？我们提出LeaP，一种可学习源先验，用基于本体感知的对角高斯分布（作用于动作块）替代标准高斯分布。通过轻量MLP参数化，LeaP联合预测源分布的均值和状态自适应方差，同时保持下游生成器架构和推理求解器不变。这种设计提供了观测信息驱动的随机初始化，使生成器能够专注于精确的动作细化，而非从无信息的噪声源传输样本。在15个RoboTwin操作任务中，LeaP实现了81.6%的平均成功率，优于四个代表性基线——包括确定性源方法、无先验对应方法和扩散桥策略——6.5至25.5个百分点。相同的先验一致地改进了流匹配和扩散桥生成器，同时使用更少的参数且收敛更快。该优势延续到实际部署中，LeaP取得了最佳性能。这些结果表明，源分布是生成式机器人策略的一个独立且可重用的设计轴，与生成动力学的选择互补。

英文摘要

Generative robot policies typically begin action generation from an observation-independent standard Gaussian distribution, leaving the choice of source distribution underexplored. This work asks a simple question: where should action generation begin? We propose LeaP, a Learnable source Prior that replaces the standard Gaussian with a proprioception-conditioned diagonal Gaussian over action chunks. Parameterized by a lightweight MLP, LeaP jointly predicts the mean and state-adaptive variance of the source distribution, while keeping the downstream generator architecture and inference solver unchanged. This design provides an observation-informed yet stochastic initialization, allowing the generator to focus on precise action refinement rather than transporting samples from an uninformed noise source. On 15 RoboTwin manipulation tasks, LeaP achieves an average success rate of 81.6%, outperforming four representative baselines -- including deterministic-source methods, a no-prior counterpart, and a diffusion-bridge policy -- by 6.5 to 25.5 percentage points. The same prior consistently improves both flow-matching and diffusion-bridge generators, while using fewer parameters and converging faster. The advantage carries over to real-world deployment, where LeaP attains the best performance. These results suggest that the source distribution is an independent and reusable design axis for generative robot policies, complementary to the choice of generative dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.17446 2026-06-17 cs.RO cs.CV 交叉投稿

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

AnnotateAnything：面向机器人操作的3D资产自动标注

Haoran Lu, Mutian Shen, Shuyang Yu, Yu Xiao, Songling Liu, Jianshu Zhang, Shang Wu, Yue Chen, Guo Ye, Jiayi Wang, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Peking University（北京大学）

AI总结提出AnnotateAnything框架，通过视觉-语言标注和物理标注双流水线，自动为3D资产生成可执行操作标签，提升仿真数据收集效率和任务成功率。

详情

AI中文摘要

仿真使得可扩展的机器人数据收集成为可能，但原始3D资产仅提供几何信息，缺乏指定机器人应在何处以及如何操作的语义、交互和物理知识。在这项工作中，我们提出了AnnotateAnything，一个通用的自动标注框架，将被动3D资产转换为具有结构化、多样化和可执行操作标签的、可用于操作的资产。AnnotateAnything围绕两个互补的流水线构建。首先，一个统一的视觉-语言标注流水线，利用视觉-语言推理来推断对象语义、交互约束和3D接地线索，为识别有意义的交互区域提供人类先验指导。其次，一个全自动且大规模并行的物理标注流水线，通过候选生成、几何优化和轨迹生成，将这些先验知识嵌入每个资产的几何和物理约束中。该流水线生成多样且可执行的动作标注，包括抓取姿态、灵巧接触、关节运动路径点、插入方向、悬挂可供性和导航目标。利用生成的标注，我们进一步构建了一个跨不同对象、任务和机器人形态的异步并行仿真数据收集系统。实验表明，与现有的标注和数据生成流水线相比，AnnotateAnything在标注效率、数据收集效率和任务成功率方面均表现优越，同时支持下游任务如可供性检测、机器人VQA和视觉指令微调。我们在项目页面上提供项目材料，并计划发布完整代码、标注和基准以促进未来研究。视频、代码、演示资产和标注在补充材料中提供。项目页面：此https URL。

英文摘要

Simulation enables scalable robot data collection, but raw 3D assets provide only geometry, lacking the semantic, interactive, and physical knowledge needed to specify where and how robots should act. In this work, we present AnnotateAnything, a general automatic annotation framework that converts passive 3D assets into manipulation-ready assets with structured, diverse, and executable manipulation labels. AnnotateAnything is built around two complementary pipelines. First, a unified visual-language annotation pipeline using vision-language reasoning to infer object semantics, interaction constraints, and 3D-grounded cues, providing human-prior guidance for identifying meaningful interaction regions. Second, a fully automatic and massively parallel physics annotation pipeline grounds these priors in each asset's geometry and physical constraints through candidate generation, geometry optimization and trajectory generation. This pipeline produces diverse and executable action annotations, including grasp poses, dexterous contacts, articulation waypoints, insertion directions, hanging affordances, and navigation targets. Using the generated annotations, we further build an asynchronous parallel simulation data-collection system across diverse objects, tasks, and robot embodiments. Experiments demonstrate that AnnotateAnything achieves superior annotation efficiency, data-collection efficiency, and task success rates over existing annotation and data-generation pipelines, while also supporting downstream tasks such as affordance detection, robotic VQA, and visual instruction finetuning. We provide project materials on the project page and plan to release the full code, annotations, and benchmark to facilitate future research. Videos, code, demo assets, and annotations are provided in supplementary materials Project page: https://tourmaline-caramel-169490.netlify.app.

URL PDF HTML ☆

赞 0 踩 0

2606.17511 2026-06-17 cs.RO cs.AI cs.CV 交叉投稿

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

MagicSim: 可执行具身交互的统一基础设施

Haoran Lu, Songling Liu, Yue Chen, Guo Ye, Mutian Shen, Shuyang Yu, Yu Xiao, Jihai Zhao, Shang Wu, Jianshu Zhang, Xiangtian Gui, Chuye Hong, Yuran Wang, Maojiang Su, Jiayi Wang, Ruihai Wu, Zhaoran Wang, Han Liu

发表机构 * Northwestern University（西北大学）； Peking University（北京大学）； University of California, Berkeley（加州大学伯克利分校）； ShanghaiTech University（上海科技大学）

AI总结提出MagicSim，一个基于确定性批处理运行时和共享MDP的具身交互基础设施，通过YAML规范解耦内容、放置、行为和智能体暴露，统一世界构建、执行、评估和自动生成轨迹。

详情

AI中文摘要

机器人学习和具身智能体现在需要模拟作为连接控制、技能和规划的共享执行基底，而不仅仅是渲染器、控制器测试平台或固定任务环境。现有的流水线通过“魔法”动作、脱节的训练环境或仅前向渲染来分割这些层，无法重现、评估和标注同一情节。我们提出MagicSim，一个围绕确定性批处理运行时和共享马尔可夫决策过程（MDP）构建的具身交互基础设施。通过YAML优先的规范解耦内容、放置、行为和智能体暴露，MagicSim在单一重置-步进循环中构建多样化的可执行世界，涵盖任务族、交互模式、物理、布局、传感器、化身和机器人具身。一个通用的执行接口通过控制器、原子技能、规划器原语和异步规划将高级命令具体化，将其实现为机器人动作而非模拟器端的状态编辑。一个任务定义支持三种能力：基准测试和强化学习评估、自动收集接口（自动将命令转化为具体轨迹）以及面向智能体/VLM的交互。对于自动执行，命令流经Command->Skill->Planner->Robot->Record流水线，而每个环境的命令、技能、规划、重试、标注和情节状态在共享物理滴答之上独立推进。成功的展开被保存为结构化的多模态轨迹，将语言监督、动作表示、视觉/几何表示和任务级别状态与执行的情节对齐。因此，MagicSim在一个规划器在环运行时中统一了多样化的世界构建、具身执行、任务评估、自动展开生成和交互式智能体接口。

英文摘要

Robot learning and embodied agents now require simulation to serve as a shared execution substrate linking control, skills, and planning, not only as a renderer, controller testbed, or fixed task environment. Existing pipelines split these layers with "magic" actions, disconnected training environments, or forward-only renders that cannot reproduce, evaluate, and annotate the same episode. We present MagicSim, an embodied interaction infrastructure built around one deterministic batched runtime and a shared Markov decision process (MDP). From YAML-first specifications that decouple contents, placement, behavior, and agent exposure, MagicSim constructs diverse executable worlds spanning task families, interaction regimes, physics, layouts, sensors, avatars, and robot embodiments in one reset-and-step loop. A common execution interface grounds high-level commands through controllers, atomicskills, planner primitives, and asynchronous planning, realizing them as robot actions rather than simulator-side state edits. One task definition supports three capabilities: benchmark and RL evaluation, an autocollect interface that automatically turns commands into grounded trajectories, and agent/VLM-facing interaction. For automatic execution, commands flow through a Command->Skill->Planner->Robot->Record pipeline, while per-environment command, skill, planning, retry, annotation, and episode states advance independently above the shared physics tick. Successful rollouts are saved as structured multimodal trajectories aligning language supervision, action representations, visual/geometric representations, and task-level status with the executed episode. MagicSim thus unifies diverse world construction, embodied execution, task evaluation, automatic rollout generation, and interactive agent interfaces in one planner-in-the-loop runtime.

URL PDF HTML ☆

赞 0 踩 0

2606.17520 2026-06-17 cs.RO cs.CV 交叉投稿

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

GASE：基于高斯溅射的自动化系统用于重建具身仿真环境

Jiawei Zhang, Yiming Yan, Chao Liang, Nuo Xu, Seson Sun, Qichen Zhang, Yuhao Xu, Yantai Yang, Yingqiao Wang, Qin Jin, Zhipeng Zhang

发表机构 * AutoLab, SAI, Shanghai Jiao Tong University（上海交通大学SAI学院AutoLab实验室）； AIM3 Lab, School of Information, Renmin University of China（中国人民大学信息学院AIM3实验室）； Research Lab, Anyverse Dynamics（Anyverse Dynamics研究实验室）

AI总结提出GASE系统，利用全景相机阵列和多视图视频流，通过相机位姿策略提取前景物体并修复场景，独立重建后导入物理仿真器，实现高效高保真仿真环境构建，分割精度提升超10%，真实机器人部署性能差距小于10%。

详情

AI中文摘要

在现实世界中训练具身代理需要熟练的操作人员和昂贵的硬件。仿真环境通过实现大规模、低成本的数据增强提供了一种引人注目的替代方案。因此，快速构建具有最小仿真到现实差距的高保真仿真场景已成为机器人学习的关键目标。尽管基于重建的方法提供了优越的视觉质量，但当前的工作流程受到低效的数据采集和次优的前景物体提取的阻碍。因此，我们提出了GASE，一个高度自动化的仿真场景构建系统。GASE利用全景相机阵列的多视角视频流实现快速环境扫描。为确保高质量的资产生成，我们的流程引入了一种基于相机位姿的策略，在2D域中跨帧鲁棒地提取物体，随后进行高保真场景修复。前景物体和静态背景随后被独立重建，并无缝导入物理仿真器用于策略训练。大量实验表明，GASE在分割精度上比现有的基于3D高斯的方法提高了超过10%，同时实现了最先进的修复质量。此外，在操作和导航任务中的真实机器人部署保持了与纯真实世界数据训练策略相比低于10%的性能差距。这些结果证实GASE为弥合仿真到现实差距提供了高效且高度有效的解决方案。代码将发布。

英文摘要

Training embodied agents in the real world requires skilled operators and expensive hardware. Simulation environments offer a compelling alternative by enabling large-scale, cost-effective data augmentation. Consequently, rapidly constructing high-fidelity simulation scenes with a minimal sim-to-real gap has become a critical objective in robot learning. While reconstruction-based methods provide superior visual quality, current workflows are hindered by inefficient data acquisition and subpar foreground object extraction. We thus propose GASE, a highly automated system for simulation scene construction. GASE leverages multi-view video streams from panoramic camera arrays to enable rapid environment scanning. To ensure high-quality asset generation, our pipeline introduces a camera-pose-based strategy that robustly extracts objects across frames in the 2D domain, followed by high-fidelity scene inpainting. Foreground objects and the static background are then reconstructed independently and seamlessly imported into physics simulators for policy training. Extensive experiments demonstrate that GASE outperforms existing 3D Gaussian-based methods in segmentation accuracy by over 10\% while achieving state-of-the-art inpainting quality. Furthermore, real-robot deployments across manipulation and navigation tasks maintains a performance gap of less than 10\% compared to policies trained purely on real-world data. These results confirm that GASE provides an efficient and highly effective solution for bridging the sim-to-real gap. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2606.17598 2026-06-17 cs.RO cs.CV 交叉投稿

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

MuseVLA: 一种用于机器人操作的自适应多模态感知视觉-语言-动作模型

Xingyuming Liu, Ruichun Ma, Heyu Guo, Qixiu Li, Qingwen Yang, Lin Luo, Shiqi Jiang, Chenren Xu, Jiaolong Yang, Baining Guo

发表机构 * School of Computer Science, Peking University（北京大学计算机科学学院）； Microsoft Research Asia（微软亚洲研究院）； Princeton University（普林斯顿大学）； Tsinghua University（清华大学）

AI总结提出MuseVLA模型，通过将传感器作为按需工具集成，实现自适应多模态感知；设计传感器图像统一表示，并引入数据合成流水线，在灵巧手操作任务中平均成功率80.6%，显著优于RGB-only和多模态基线。

详情

AI中文摘要

人类自然地利用多种感知模态与物理世界交互，而大多数用于机器人的视觉-语言-动作（VLA）模型仅依赖RGB观测。这限制了它们感知难以或无法从RGB相机推断的物理属性（如温度、声音或雷达响应）的能力。我们提出MuseVLA，一种自适应多模态感知VLA模型，将新型传感器作为按需工具集成到机器人操作中。给定任务指令和视觉上下文，MuseVLA首先生成一个传感器令牌和目标描述，选择要调用的感知模态和关注对象，类似于带参数的工具调用。然后，它将选定的传感器测量值转换为接地传感器图像，这是一种统一的中间表示，编码异构读数以进行多模态融合和动作生成。这种设计将传感器特定处理与VLA主干解耦，实现了多种模态的高效集成。为了减少对昂贵的多传感器机器人数据集的需求，我们进一步引入了一种数据合成流水线，用接地传感器图像增强现有的RGB视频数据集，从而实现对未见过的传感器引导任务的泛化。我们在真实机器人上评估了MuseVLA，涉及需要多模态感知输入的挑战性灵巧手操作任务，包括温度引导的拾取与放置、音频驱动的物体搜索和雷达辅助的隐藏物体检索。MuseVLA平均成功率达到80.6%，显著优于仅RGB和多模态VLA基线，并在未见任务上表现出强大的零样本能力。

英文摘要

Humans naturally leverage diverse sensing modalities to interact with the physical world, while most Vision-Language-Action (VLA) models for robotics rely solely on RGB observations. This limits their ability to perceive physical properties that are difficult or impossible to infer from RGB cameras, such as temperature, sound, or radar response. We present MuseVLA, an adaptive multimodal sensing VLA model that integrates novel sensors as on-demand tools for robotic manipulation. Given a task instruction and visual context, MuseVLA first generates a sensor token and target description that select the sensing modality to invoke and what to attend to, analogous to a tool call with arguments. It then converts the selected sensor measurement into a grounded sensor image, a unified intermediate representation that encodes heterogeneous readings for multimodal fusion and action generation. This design decouples sensor-specific processing from the VLA backbone, enabling efficient integration of diverse modalities. To reduce the need for expensive multisensory robot datasets, we further introduce a data synthesis pipeline that augments existing RGB video datasets with grounded sensor images, enabling generalization to unseen sensor-guided tasks. We evaluate MuseVLA on a real-world robot across challenging dexterous hand manipulation tasks that require multimodal sensing inputs, including temperature-guided pick-and-place, audio-driven object search, and radar-assisted hidden object retrieval. MuseVLA achieves 80.6% success rate on average, outperforming RGB-only and multisensory VLA baselines significantly, and exhibits strong zero-shot capabilities on unseen tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.17739 2026-06-17 cs.RO cs.AI cs.CV cs.MA 交叉投稿

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

ED3R: 能量感知的分布式灾难检测——基于协作机器人智能体

Lina Magoula, Nikolaos Koursioumpas, Nancy Alonistioti, Ramin Khalili

发表机构 * Dept. of Informatics and Telecommunications, National and Kapodistrian University of Athens（雅典大学信息学与电信系）； Huawei Heisenberg Research Center (Munich)（华为海森堡研究中心（慕尼黑））

AI总结提出ED3R框架，通过机器人-远程控制器分层协作与分布式神经回归预测，在不确定性下以最低能耗实现野火检测，成功率达97.18%，能耗降低36.4%，检测速度提升41%。

Comments 14 pages, 9 figures

详情

AI中文摘要

机器人技术有望支持环境监测和自然灾害管理，在这些场景中，决策必须在不确定性、资源限制和严格操作约束下做出。在关键任务（如野火）中，机器人智能体不仅需要以足够置信度识别危险事件，还需管理能量成本和检测时间。本文介绍ED3R，一种用于不确定性下野火检测的能量感知分布式框架。ED3R实现了机器人与远程控制器之间的分层协作决策：远程控制器决定机器人的运动，而机器人感知环境并决定在何处（机载或远程）以及如何执行野火检测。共同目标是以所需置信度检测野火，同时最小化任何机器人操作消耗的能量。ED3R进一步集成了避免附近障碍物、防止冗余探索、实现自适应早期任务完成以及通过自定义惩罚函数确保可行性的机制。ED3R还引入了前瞻能力，通过分布式神经回归模型使智能体能够在执行前评估候选策略以预测未来。该框架通过逼真的机器人仿真、消融研究和基线比较进行评估。总体而言，ED3R的任务成功率高达97.18%。尤其是在最具挑战性的任务中，它比基线减少高达36.4%的能量消耗，并提前高达41%检测到野火。

英文摘要

Robotics are expected to support environmental monitoring and natural disaster management, where decisions must be made under uncertainty, resource limitations, and strict operational constraints. In critical missions, such as wildfires, robotic agents must not only identify hazardous events with sufficient confidence, but also manage the energy cost and time until detection. This paper introduces ED3R, an energy-aware distributed framework for wildfire detection under uncertainty. ED3R enables hierarchical cooperative decision-making between a robot and a remote controller. The remote controller decides upon the robot's motion, while the robot senses the environment and decides where to execute the wildfire detection (onboard or remotely) and how. The common goal is to detect wildfires with a required confidence while minimizing the energy consumed by any robot operation. ED3R further integrates mechanisms to avoid nearby obstacles, prevent redundant exploration, enable adaptive early mission completion, and ensure feasibility through a custom penalty function. ED3R also introduces a forward-looking capability, enabled through distributed neural regression models that allow the agents to anticipate the future by evaluating candidate strategies before execution. The framework is evaluated through realistic robotics simulations, ablation studies, and baseline comparisons. Overall, ED3R achieves a mission success rate of up to 97.18%. Especially in the most demanding missions, it reduces energy consumption by up to 36.4% and detects wildfires up to 41% faster than baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.18112 2026-06-17 cs.RO cs.CV 交叉投稿

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

Qwen-RobotNav 技术报告：为智能体导航系统设计的可扩展导航模型

Jiazhao Zhang, Gengze Zhou, Hale Yin, Yiyang Huang, Zixing Lei, Qihang Peng, Haoqi Yuan, Jie Zhang, Xudong Guo, Xiaoyue Chen, An Yang, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou, Zhuoyuan Yu, Jingyang Fan, Zhixuan Liang, Pei Lin, Ye Wang, Anzhe Chen, Kun Yan, Xiao Xu, Jiahao Li, Lulu Hu, Minying Zhang, Shurui Li, Wenhu Xiao, Shuai Bai, Xuancheng Ren, Chenxu Lv, Chenfei Wu, Xiong-Hui Chen

发表机构 * Qwen Team（通义实验室）

AI总结提出 Qwen-RobotNav 可扩展导航模型，通过参数化接口支持多种任务模式和可调观测参数，在15.6M样本上训练，联合视觉语言数据防止行为坍缩，在多个导航基准上取得新最优结果，并展示零样本泛化能力。

详情

AI中文摘要

智能体导航系统需要一个基础导航模型，其观测策略可以在推理时从外部重新配置，因为指令跟随、目标搜索、目标跟踪和自动驾驶共享相同的感知规划主干，但对视觉流的消费方式有根本不同的要求。我们提出 Qwen-RobotNav，一个建立在 Qwen-RobotNav 上的可扩展导航模型，通过一个具有两个互补维度的参数化接口来解决这个问题：多个任务模式选择导航行为，以及可控的观测参数（例如，token 预算、每个摄像头的权重）控制视觉历史的编码方式。通过训练时对所有参数进行随机化，Qwen-RobotNav 对任何推理时配置都具有鲁棒性，无需对 Qwen-RobotNav 主干进行任何架构修改。我们在15.6M样本上训练 Qwen-RobotNav；与视觉语言数据联合训练防止了在仅轨迹训练中观察到的反应性动作序列映射器的坍缩。参数化接口也使 Qwen-RobotNav 成为智能体系统的自然构建块：对于长时域场景，上层规划器将目标分解为子任务，并在情节中动态切换 Qwen-RobotNav 的任务模式和上下文策略，通过重复调用同一模型组合出复杂行为。大量实验表明，Qwen-RobotNav 在主要导航基准上取得了新的最优结果。该模型从2B到8B参数展现出良好的扩展性，联合多任务训练发展出一个跨任务族迁移的共享空间规划基板，并在多样环境中对真实世界机器人展现出强大的零样本泛化能力。

英文摘要

Agentic navigation systems require a base navigation model whose observation strategy can be externally reconfigured at inference time, because instruction following, object search, target tracking, and autonomous driving share the same perception-planning backbone yet demand fundamentally different strategies for consuming the visual stream. We present Qwen-RobotNav, a scalable navigation model built on Qwen-RobotNav that addresses it through a parameterised interface with two complementary dimensions: multiple task modes that select the navigation behaviour, and controllable observation parameters (e.g., token budget, per-camera weights) that govern how visual history is encoded. With training-time randomization over all parameters, Qwen-RobotNav is robust to any inference-time configuration requiring zero architectural modification to the Qwen-RobotNav backbone. We train Qwen-RobotNav on 15.6M samples; co-training with vision-language data prevents the collapse into reactive action-sequence mappers observed in trajectory-only training. The parameterised interface also makes Qwen-RobotNav a natural building block for agentic systems: for long-horizon scenarios, an upper-level planner decomposes goals into sub-tasks and dynamically switches Qwen-RobotNav's task mode and context strategy mid-episode, composing complex behaviours from repeated calls to the same model. Extensive experiments show that Qwen-RobotNav sets new state-of-the-art results across major navigation benchmarks. The model exhibits favourable scaling from 2B to 8B parameters, with joint multi-task training developing a shared spatial-planning substrate that transfers across task families, and demonstrates strong zero-shot generalisation to real-world robots across diverse environments.

URL PDF HTML ☆

赞 0 踩 0

2603.28251 2026-06-17 cs.CV cs.AI 版本更新

改进和评估手-物体交互检测

Ahmad Darkhalil, Dima Damen, David Fouhey

发表机构 * School of Computer Science, University of Bristol, Bristol, UK（布里斯托大学计算机科学学院）； Computer Science and Electrical and Computer Engineering, New York University, NY, US（纽约大学计算机科学与电气与计算机工程系）

AI总结提出HOI-DETR框架，将手-物体和物体-物体交互引入Co-DETR架构，在四个数据集上显著提升检测性能，mAP提升超过20个百分点。

Comments Project page: https://ahmaddarkhalil.github.io/HOI-DETR/

2606.17475 2026-06-17 cs.CV 新提交

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

StereoFactory: 一种用于鲁棒立体匹配的统一合并框架

Xianda Guo, Pinhan Fu, Ruilin Wang, Wenke Huang, Mang Ye, Qin Zou

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； D-Star Robotics ； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结提出StereoFactory，一种粗到细的进化框架，通过遗传算法选择模型子集和CMA-ES优化模块级路由，实现自适应模型合并，在多个基准上降低误差并显著减少训练时间。

详情

AI中文摘要

立体匹配通过在大规模数据集上训练的基础模型取得了进展，但这种范式存在可扩展性瓶颈：引入新数据需要昂贵的联合重新训练。模型合并提供了一种可扩展的事后替代方案，在源检查点可用后整合来自专门模型的知识。然而，现有的合并方法通常保留所有可用模型或依赖贪婪包含，这可能会保留有害的任务向量干扰。我们提出StereoFactory，一种用于自适应模型合并的粗到细进化框架。第一阶段采用遗传算法搜索模型子集的组合空间，确定哪些模型应该参与。第二阶段通过CMA-ES优化对所选任务向量进行架构自适应路由，并可选地进行模块级缩放，解决模块级知识专门化问题（不同功能模块对知识源表现出不同偏好）。在两个架构和四个基准上的实验表明，在相同检查点池下，StereoFactory始终达到最佳的四基准平均值，相对于最强的受控基线，在NMRF上将平均误差从3.80降至3.30，在FoundationStereo上从2.88降至2.19。事后搜索仅需要相应联合重新训练挂钟时间的2.7–3.7%。分析表明，知识贡献本质上是模块特定的，所选子集可以在架构间转移且性能下降最小。代码将在接收后公开发布于：此 https URL。

英文摘要

Stereo matching has advanced through foundation models trained on large-scale datasets, yet this paradigm suffers from a scalability bottleneck: incorporating new data requires costly joint retraining. Model merging offers a scalable post-hoc alternative by integrating knowledge from specialized models after source checkpoints are available. However, existing merging methods typically retain all available models or rely on greedy inclusion, which can preserve harmful task-vector interference. We propose StereoFactory, a coarse-to-fine evolutionary framework for adaptive model merging. Stage~1 employs a genetic algorithm to search the combinatorial space of model subsets, determining which models should participate. Stage~2 addresses module-level knowledge specialization (different functional modules exhibit distinct preferences for knowledge sources) through CMA-ES optimization of architecture-adaptive routing over the selected task vectors, with optional module-level scaling. Experiments across two architectures and four benchmarks demonstrate that StereoFactory consistently achieves the best four-benchmark average under the same checkpoint pool, reducing the average error from 3.80 to 3.30 on NMRF and from 2.88 to 2.19 on FoundationStereo relative to the strongest controlled baseline. The post-hoc search requires only 2.7--3.7\% of the corresponding joint-retraining wall-clock time. Analysis reveals that knowledge contributions are inherently module-specific, and selected subsets can transfer across architectures with minimal degradation. Code will be publicly released upon acceptance at: https://github.com/XiandaGuo/StereoFactory.

URL PDF HTML ☆

赞 0 踩 0

2606.17561 2026-06-17 cs.CV 新提交

RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

RT-Counter：实时文本引导的开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Zhiwei Zhu, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结提出实时文本引导开放词汇计数框架RT-Counter，通过视觉原型文本化模块和编织Transformer层，在保持高精度的同时实现实时推理，在FSC147上MAE为13.30，速度达112.48 FPS。

详情

AI中文摘要

文本引导的开放词汇目标计数（TOOC）旨在对自然语言描述指定的类别中的对象进行计数。尽管视觉-语言预训练模型已成功应用于TOOC任务，但在计数场景中仍面临细粒度空间理解和实时推理需求的挑战。为解决这些限制，本文提出一种实时TOOC框架，称为实时计数器（RT-Counter），它不仅实现了良好的计数精度，而且具有高计算效率。RT-Counter设计了一种新颖的视觉原型文本化（VPT）模块，该模块可以将学习到的视觉特征投影到文本特征空间，然后生成包含视觉原型难以捕获的抽象信息和文本难以描述的详细原型信息的特征，增强了对象级视觉-语言模型的计数能力。此外，RT-Counter集成了我们的编织Transformer（Weaformer）层，以极低的计算成本保持高描述能力。Weaformer层采用了一种新颖的混合注意力机制，可以高效地编织局部和全局视觉特征。在三个公共数据集上的大量实验表明，RT-Counter成功打破了TOOC中精度与速度的权衡。在FSC147上实现具有竞争力的MAE 13.30的同时，RT-Counter以112.48 FPS运行，比现有TOOC领先方法快7.4倍，参数效率高4倍以上。我们的工作旨在平衡TOOC中的高精度和实时性能。代码可在以下网址获取：this https URL。

英文摘要

Text-guided open-vocabulary object counting (TOOC) aims to count objects belonging to the categories specified by natural language descriptions. Although vision-language pre-trained models have been successful applied to TOOC tasks, they still struggle with fine-grained spatial understanding and real-time inference requirements in counting scenarios. To address these limitations, this paper proposes a real-time TOOC framework, called the Real-Time Counter (RT-Counter), that achieves not only good counting accuracy but also high computational efficiency. RT-Counter designs a novel Visual Prototype Textualization (VPT) module that can project learned visual features into a text feature space and then generate features containing the abstract information that is hard to capture with visual prototypes and the detailed prototype information that is difficult to describe in text, enhancing the object-level visual-language model's counting capabilities. Additionally, RT-Counter incorporates our Weaving Transformer (Weaformer) layers, maintaining high descriptive power at a fraction of the computational cost. The Weaformer layer adopts a novel hybrid attention mechanism that can efficiently weave together local and global visual features. Extensive experiments on three public datasets show that RT-Counter successfully breaks the accuracy-speed trade-off in TOOC. While achieving a competitive MAE of 13.30 on FSC147, RT-Counter operates at 112.48 FPS, making it 7.4x faster and over 4$\times$ more parameter-efficient than the existing leading methods in TOOC. Our work aims at balancing high accuracy and real-time performance in TOOC. Code is available at: https://github.com/Jason-Mar1/RT-Counter.

URL PDF HTML ☆

赞 0 踩 0

2606.17650 2026-06-17 cs.CV cs.CL 新提交

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

MambaCount: 基于空间稀疏状态空间对偶块的高效文本引导开放词汇目标计数

Hao-Yuan Ma, Li Zhang, Minjie Qiang, Jie Gao

发表机构 * School of Computer Science and Technology, Soochow University（苏州大学计算机科学与技术学院）

AI总结提出MambaCount框架，通过空间稀疏状态空间对偶块解决Mamba在非因果视觉任务中的双向依赖限制和空间token高熵问题，实现线性复杂度的开放词汇目标计数，在FSC-147上取得12.23的测试MAE。

详情

AI中文摘要

文本引导的开放词汇目标计数（TOOC）旨在估计由文本提示描述的目标数量，在具有大规模变化的密集场景中尤其具有挑战性。现有的TOOC方法主要依赖Transformer，其相对于图像分辨率的二次复杂度限制了可扩展性。Mamba因其线性复杂度提供了一种有前景的替代方案。然而，先前基于Mamba的方法存在两个主要限制。一方面，Mamba固有的因果公式限制了非因果视觉任务所需的双向空间依赖建模。另一方面，现有的基于Mamba的视觉模型往往忽略了空间token响应中无约束的高熵，这可能削弱局部细节和高频线索。为了解决这些限制，我们提出了MambaCount，一种基于空间稀疏状态空间对偶（S^4D）块的高效框架。具体来说，我们分析并重构了Mamba中隐藏状态的衰减动态，以缓解因果建模引入的依赖约束。此外，我们引入了空间token选择（STS）子块，以减少Mamba中空间token响应的无约束高熵。另外，我们设计了多粒度原型（MGP），以在不同语义级别识别类似目标的区域，改善跨模态对齐和可解释性。在FSC-147上的大量实验表明，MambaCount在无需二次查询的方法中达到了最先进的性能，测试MAE为12.23，同时保持了线性复杂度。

英文摘要

Text-guided Open-vocabulary Object Counting (TOOC) aims to estimate the number of objects described by text prompts, which is particularly challenging in dense scenes with large scale variations. Existing TOOC approaches predominantly rely on Transformers, whose quadratic complexity with respect to image resolution limits their scalability. Mamba offers a promising alternative due to its linear complexity. However, previous Mamba-based methods have two main limitations. On the one hand, the inherent causal formulation of Mamba constrains the bidirectional spatial dependency modeling required by non-causal vision tasks. On the other hand, existing Mamba-based vision models often overlook the unconstrained high entropy in the spatial token responses, which can weaken local details and high-frequency cues. To address these limitations, we propose MambaCount, an efficient framework built on the Spatial Sparse State Space Duality (S^4D) block. Specifically, we analyze and reconstruct the decay dynamics of hidden states in Mamba to alleviate the dependency constraints introduced by causal modeling. Moreover, we introduce a Spatial Token Selection (STS) sub-block to reduce the unconstrained high entropy in spatial token responses within Mamba. In addition, we design Multi-Granularity Prototypes (MGP) to identify object-like regions at different semantic levels, improving cross-modal alignment and interpretability. Extensive experiments on FSC-147 demonstrate that MambaCount achieves state-of-the-art performance among methods without secondary querying, obtaining a test MAE of 12.23, while retaining linear complexity.

URL PDF HTML ☆

赞 0 踩 0

2606.17966 2026-06-17 cs.CV 新提交

Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

Reload-Mamba：用于多类语义分割的分层抗稀释状态空间建模

Sheng-Wei Chan, Hsin-Jui Pan, Jen-Shiun Chiang

发表机构 * Department of Electrical and Computer Engineering, Tamkang University（淡江大学电机与计算机工程系）

AI总结提出Reload-Mamba框架，通过边界监督的局部细节先验、类不确定性感知的Reload门控和分层多级Reload机制，解决Mamba状态空间传播导致的响应稀释问题，在ADE20K、Cityscapes和PASCAL VOC 2012上取得优异性能。

Comments 23 pages, 4 figures, 17 tables. Code will be released soon

详情

AI中文摘要

基于Mamba的状态空间模型为高分辨率密集预测提供了线性时间的长程建模能力，但顺序状态空间传播会削弱多类语义分割中关键的边界敏感和细节敏感响应。我们提出Reload-Mamba，一种语义分割框架，通过三个分割特定设计解决这种传播导致的响应稀释问题：(i) 边界监督的局部细节先验，使用真实边界掩码显式训练，以识别需要响应恢复的区域；(ii) 类不确定性感知的Reload门控，将来自预重载辅助头的逐像素类熵作为额外的门控信号，该公式仅在多类密集预测下提供信息；(iii) 分层多级Reload机制，在三个解码器级别应用抗稀释细化，并自上而下融合恢复的表示。基于ConvNeXt-Tiny编码器、多尺度解码器和具有像素级方向注意力的四方向Mamba扫描，Reload-Mamba在ADE20K上达到47.9%单尺度（48.9%多尺度）mIoU，在Cityscapes上达到83.2%单尺度mIoU。在标准DeepLab风格协议下使用ResNet-101 + COCO预训练，Reload-Mamba在PASCAL VOC 2012 val上达到87.8% mIoU。控制消融实验表明，三个分割特定设计各自贡献了超出直接移植先前为二值化提出的抗稀释架构的性能，在ADE20K上相比直接移植基线累积提升了+2.2 mIoU。

英文摘要

Mamba-based state space models offer linear-time long-range modeling for high-resolution dense prediction, but sequential state-space propagation can attenuate boundary-sensitive and detail-sensitive responses that are critical in multi-class semantic segmentation. We propose Reload-Mamba, a semantic segmentation framework that addresses this propagation-induced response dilution through three segmentation-specific designs: (i) a boundary-supervised local detail prior that is explicitly trained with ground-truth boundary masks to identify regions requiring response restoration; (ii) a class-uncertainty-aware Reload Gate that incorporates per-pixel class entropy from a pre-reload auxiliary head as an additional gating signal, a formulation that is informative only under multi-class dense prediction; and (iii) a hierarchical multi-level Reload mechanism that applies anti-dilution refinement at three decoder levels and fuses the restored representations top-down. Built upon a ConvNeXt-Tiny encoder with a multi-scale decoder and four-directional Mamba scanning with pixel-wise directional attention, Reload-Mamba achieves 47.9% single-scale (48.9% multi-scale) mIoU on ADE20K and 83.2% single-scale mIoU on Cityscapes. With ResNet-101 + COCO pre-training under the standard DeepLab-style protocol, Reload-Mamba reaches 87.8% mIoU on PASCAL VOC 2012 val. Controlled ablations show that each of the three segmentation-specific designs contributes beyond a direct port of the prior anti-dilution architecture proposed for binarization, cumulatively improving over the direct-port baseline by +2.2 mIoU on ADE20K.

URL PDF HTML ☆

赞 0 踩 0

2510.23798 2026-06-17 cs.CV cs.AI 版本更新

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

一种基于几何和深度学习的可复现流水线，用于利用原位摄像头监测城市河流中的漂浮人为碎片

Gauthier Grimmer, Romain Wenger, Clément Flint, Germain Forestier, Gilles Rixhon, Valentin Chardon

AI总结提出结合几何模型与深度学习的框架，利用固定摄像头连续量化监测城市河流漂浮碎片，并评估不同模型在复杂环境下的精度与速度，通过投影几何实现碎片尺寸估计。

详情

AI中文摘要

河流中漂浮人为碎片的扩散已成为一个紧迫的环境问题，对生物多样性、水质以及人类活动（如航行和娱乐）产生不利影响。本研究提出了一种新颖的方法框架，利用固定的原位摄像头监测上述废弃物。本研究提供了两个关键贡献：（i）利用深度学习对漂浮碎片进行连续量化和监测；（ii）在复杂环境条件下，识别出在精度和推理速度方面最合适的深度学习模型。这些模型在多种环境条件和学习配置下进行测试，包括与数据泄漏相关的偏差实验。此外，实现了一个几何模型，用于从二维图像估计检测对象的实际尺寸。该模型利用了相机的内参和外参特性。本研究结果强调了数据集构建协议的重要性，特别是在负样本图像的整合和时间泄漏的考虑方面。最后，证明了使用投影几何结合回归校正进行公制物体估计的可行性。该方法为开发稳健、低成本、自动化的城市水生环境监测系统铺平了道路。

英文摘要

The proliferation of floating anthropogenic debris in rivers has emerged as a pressing environmental concern, exerting a detrimental influence on biodiversity, water quality, and human activities such as navigation and recreation. The present study proposes a novel methodological framework for the monitoring the aforementioned waste, utilising fixed, in-situ cameras. This study provides two key contributions: (i) the continuous quantification and monitoring of floating debris using deep learning and (ii) the identification of the most suitable deep learning model in terms of accuracy and inference speed under complex environmental conditions. These models are tested in a range of environmental conditions and learning configurations, including experiments on biases related to data leakage. Furthermore, a geometric model is implemented to estimate the actual size of detected objects from a 2D image. This model takes advantage of both intrinsic and extrinsic characteristics of the camera. The findings of this study underscore the significance of the dataset constitution protocol, particularly with respect to the integration of negative images and the consideration of temporal leakage. In conclusion, the feasibility of metric object estimation using projective geometry coupled with regression corrections is demonstrated. This approach paves the way for the development of robust, low-cost, automated monitoring systems for urban aquatic environments.

URL PDF HTML ☆

赞 0 踩 0

2601.18252 2026-06-17 cs.CV cs.AI cs.LG stat.ML 版本更新

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

Co-PLNet: 一种用于提示引导的线框解析的协作点线网络

Chao Wang, Xuanying Li, Cheng Dai, Jinglei Feng, Yuxiang Luo, Hao Qin, Yuqi Ouyang

AI总结提出点线协作框架Co-PLNet，通过点线提示编码器交换空间线索，并利用交叉引导线解码器增强点线一致性，在Wireframe和YorkUrban数据集上提升线框解析的准确性和鲁棒性。

详情

AI中文摘要

线框解析旨在恢复线段及其连接点，以形成结构化的几何表示，用于同时定位与地图构建（SLAM）等下游任务。现有方法分别预测线和点，并在事后进行调和，导致不匹配和鲁棒性降低。我们提出Co-PLNet，一个点线协作框架，在两个任务之间交换空间线索，其中早期检测通过点线提示编码器（PLP-Encoder）转换为空间提示，该编码器将几何属性编码为紧凑且空间对齐的图。交叉引导线解码器（CGL-Decoder）随后通过基于互补提示的稀疏注意力细化预测，强制点线一致性和效率。在Wireframe和YorkUrban上的实验显示，准确性和鲁棒性持续改进，同时具有有利的实时效率，证明了我们在结构化几何感知中的有效性。我们的代码可在该 https URL 获取。

英文摘要

Wireframe parsing aims to recover line segments and their junctions to form a structured geometric representation useful for downstream tasks such as Simultaneous Localization and Mapping (SLAM). Existing methods predict lines and junctions separately and reconcile them post-hoc, causing mismatches and reduced robustness. We present Co-PLNet, a point-line collaborative framework that exchanges spatial cues between the two tasks, where early detections are converted into spatial prompts via a Point-Line Prompt Encoder (PLP-Encoder), which encodes geometric attributes into compact and spatially aligned maps. A Cross-Guidance Line Decoder (CGL-Decoder) then refines predictions with sparse attention conditioned on complementary prompts, enforcing point-line consistency and efficiency. Experiments on Wireframe and YorkUrban show consistent improvements in accuracy and robustness, together with favorable real-time efficiency, demonstrating our effectiveness for structured geometry perception. Our code is available at https://github.com/GalacticHogrider/Co-PLNet.

URL PDF HTML ☆

赞 0 踩 0

2602.14771 2026-06-17 cs.CV cs.AI cs.LG cs.MM cs.NE 版本更新

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

GOT-JEPA：基于联合嵌入预测架构的通用目标跟踪与模型自适应及遮挡处理

Shih-Fang Chen, Jun-Cheng Chen, I-Hong Jhuo, Yen-Yu Lin

AI总结提出GOT-JEPA框架，通过预测跟踪模型而非图像特征来提升泛化能力，并设计OccuSolver增强遮挡感知，在七个基准上验证了有效性。

Comments Accepted by IEEE Transactions on Circuits and Systems for Video Technology (TCSVT). This research focuses on learning model adaptation for adverse and dynamic environments, as well as fine-grained occlusion perception for tracking

详情

DOI: 10.1109/TCSVT.2026.3675005
Journal ref: IEEE Transactions on Circuits and Systems for Video Technology 2026

AI中文摘要

人类视觉系统通过整合当前观测与先前观测信息、适应目标和场景变化、以及精细推理遮挡来跟踪物体。相比之下，最近的通用目标跟踪器通常针对训练目标进行优化，这限制了在未见场景中的鲁棒性和泛化能力，并且它们的遮挡推理仍然粗糙，缺乏对遮挡模式的详细建模。为了解决这些在泛化和遮挡感知方面的局限性，我们提出了GOT-JEPA，一个模型预测预训练框架，将JEPA从预测图像特征扩展到预测跟踪模型。给定相同的历史信息，教师预测器从干净的当前帧生成伪跟踪模型，学生预测器学习从当前帧的损坏版本预测相同的伪跟踪模型。这种设计提供了稳定的伪监督，并明确训练预测器在遮挡、干扰和其他不利观测下产生可靠的跟踪模型，从而提高了对动态环境的泛化能力。基于GOT-JEPA，我们进一步提出了OccuSolver来增强目标跟踪的遮挡感知。OccuSolver调整了一个以点为中心的点跟踪器，用于目标感知的可见性估计和详细的遮挡模式捕获。在跟踪器迭代生成的目标先验条件下，OccuSolver逐步细化可见性状态，增强遮挡处理，并产生更高质量的参考标签，逐步改进后续模型预测。在七个基准上的广泛评估表明，我们的方法有效增强了跟踪器的泛化能力和鲁棒性。

英文摘要

The human visual system tracks objects by integrating current observations with previously observed information, adapting to target and scene changes, and reasoning about occlusion at fine granularity. In contrast, recent generic object trackers are often optimized for training targets, which limits robustness and generalization in unseen scenarios, and their occlusion reasoning remains coarse, lacking detailed modeling of occlusion patterns. To address these limitations in generalization and occlusion perception, we propose GOT-JEPA, a model-predictive pretraining framework that extends JEPA from predicting image features to predicting tracking models. Given identical historical information, a teacher predictor generates pseudo-tracking models from a clean current frame, and a student predictor learns to predict the same pseudo-tracking models from a corrupted version of the current frame. This design provides stable pseudo supervision and explicitly trains the predictor to produce reliable tracking models under occlusions, distractors, and other adverse observations, improving generalization to dynamic environments. Building on GOT-JEPA, we further propose OccuSolver to enhance occlusion perception for object tracking. OccuSolver adapts a point-centric point tracker for object-aware visibility estimation and detailed occlusion-pattern capture. Conditioned on object priors iteratively generated by the tracker, OccuSolver incrementally refines visibility states, strengthens occlusion handling, and produces higher-quality reference labels that progressively improve subsequent model predictions. Extensive evaluations on seven benchmarks show that our method effectively enhances tracker generalization and robustness.

URL PDF HTML ☆

赞 0 踩 0

2606.14081 2026-06-17 cs.CV cs.AI cs.LG eess.IV 版本更新

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

Clay-CNN混合模型：利用地理基础模型作为滑坡检测的辅助上下文

Huong Binh Vu

发表机构 * Harvard University（哈佛大学）

AI总结针对滑坡检测中的极端类别不平衡问题，提出将地理基础模型Clay v1.5作为辅助上下文注入U-Net瓶颈的混合方法，在Landslide4Sense基准上达到64.5% F1，优于纯Clay或U-Net基线。

详情

AI中文摘要

灾后快速滑坡制图对灾害响应至关重要，但由于极端类别不平衡，自动化仍然困难。本研究评估了地理基础模型（GFM）Clay v1.5是否能够改善Landslide4Sense（L4S）基准上的像素级滑坡分割，该基准包含3,799个训练块，具有14个Sentinel-2和地形波段，约2%的正像素。我们比较了三种策略：Clay作为主编码器并融合多尺度残差地形、在瓶颈处注入Clay语义上下文的U-Net骨干、以及标准U-Net基线。采用两阶段低秩适应（LoRA）的混合U-Net + Clay模型在三个随机种子上的最佳测试F1为64.5±1.8%，超过了纯Clay骨干（55.2±3.6%）和U-Net基线（59.9%）。由于缺乏多尺度跳跃连接，Clay作为独立编码器的性能低于U-Net，但其预训练表示在作为辅助上下文注入时持续提升了性能。这些发现表明，GFM在滑坡检测中最有效的方式是补充空间细节丰富的卷积架构，而非替代它们。

英文摘要

Rapid post-event landslide mapping is essential for disaster response but remains difficult to automate due to extreme class imbalance. This study evaluates whether Clay v1.5, a Geospatial Foundation Model (GFM), can improve pixel-level landslide segmentation on the Landslide4Sense (L4S) benchmark, which contains 3,799 training chips with 14 Sentinel-2 and terrain bands and approximately 2% positive pixels. We compare three strategies: Clay as the primary encoder with multi-scale residual terrain fusion, a U-Net backbone augmented with Clay semantic context at the bottleneck, and a standard U-Net baseline. The hybrid U-Net + Clay model with two-stage Low-Rank Adaptation (LoRA) achieved the best test F1 of 64.5 +/- 1.8% over three seeds, surpassing the Clay-only backbone (55.2 +/- 3.6%) and the U-Net baseline (59.9%). Clay as a standalone encoder underperformed the U-Net due to the absence of multi-scale skip connections, but its pretrained representations consistently improved performance when injected as auxiliary context. These findings suggest that GFMs are most effective for landslide detection when they complement spatially detailed convolutional architectures rather than replace them.

URL PDF HTML ☆

赞 0 踩 0

2606.15937 2026-06-17 cs.CV 版本更新

GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

GOOSE-M2F：适配Mask2Former用于非结构化户外地形的高保真、长尾细粒度语义分割

Jyothiraditya Lingam, Nikhileswara Rao Sulake, Sai Manikanta Eswar Machara

发表机构 * Rajiv Gandhi University of Knowledge Technologies, Nuzvid, India（拉吉夫·甘地知识技术大学，努兹维德，印度）

AI总结针对非结构化户外地形长尾细粒度语义分割挑战，提出GOOSE-M2F，通过200个对象查询、特征精炼模块和辅助监督头，结合多阶段训练策略，在GOOSE基准上达到70.08%复合mIoU。

Comments This solution has got 3rd position at GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA~2026

详情

AI中文摘要

我们提出GOOSE-M2F，这是Mask2Former针对GOOSE 2D细粒度语义分割（FGSS）挑战（ICRA 2026）的任务特定适配。GOOSE基准涵盖非结构化户外地形中的64个细粒度类别，具有严重的长尾分布，其中稀有类别每张图像占据少于50个像素。我们扩展了Swin-Large Mask2Former基线，并贡献了三个针对性改进：（1）200个对象查询以消除表示饱和；（2）结合ASPP-lite和CBAM双注意力的特征精炼模块（FRM）；（3）为稀有类别提供直接逐像素梯度的辅助监督头。多阶段训练策略结合了分布平衡损失、稀有类别复制粘贴增强、动态IoU感知重加权和EMA。在推理时，采用密集滑动窗口引擎，结合2D高斯核融合和4尺度TTA，提升了+10.57%。GOOSE-M2F达到70.08%官方复合mIoU（细粒度63.55%，粗粒度76.61%），在GOOSE 2D FGSS排行榜上位列第三。代码和训练好的模型已公开：\href{https://github.com/Aditya-Lingam-9000/GOOSE-M2F}{Github GOOSE-M2F代码} 和 \href{https://huggingface.co/XYZ9843/GOOSE-M2F}{Hugging Face GOOSE-M2F}。

英文摘要

We present GOOSE-M2F, a task-specific adaptation of Mask2Former for the GOOSE 2D Fine-Grained Semantic Segmentation (FGSS) Challenge at ICRA 2026. The GOOSE benchmark spans 64 fine-grained classes across unstructured outdoor terrain with a severely long-tailed distribution, where rare classes occupy fewer than 50 pixels per image. We extend the Swin-Large Mask2Former baseline with three targeted contributions: (1) 200 object queries to eliminate representational saturation; (2) a Feature Refinement Module (FRM) combining ASPP-lite and CBAM dual-attention; and (3) an Auxiliary Supervision Head that delivers direct per-pixel gradients for rare classes. A multi-stage training strategy pairs Distribution-Balanced loss, Rare-Class Copy-Paste augmentation, dynamic IoU-aware re-weighting, and EMA. At inference, a dense sliding-window engine with 2D Gaussian kernel blending and 4-scale TTA adds +10.57%. GOOSE-M2F achieves 70.08% Official Composite mIoU (63.55% fine, 76.61% coarse), placing 3rd on the GOOSE 2D FGSS leaderboard. Code and trained models are publicly available at GitHub: https://github.com/Aditya-Lingam-9000/GOOSE-M2F and Hugging Face: https://huggingface.co/XYZ9843/GOOSE-M2F.

URL PDF HTML ☆

赞 0 踩 0

2606.17279 2026-06-17 cs.CV 新提交

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

基于数字孪生表示的强化学习训练LLMs用于推理密集型手术视频问答

Yiqing Shen, Han Zhang, Mathias Unberath

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出强化学习框架，通过手术基础模型构建数字孪生表示，解耦视觉感知与推理，并引入分层表示与新型奖励，在三个基准上取得最优性能。

详情

AI中文摘要

手术视频问答需要跨语义、空间和时间维度的多步推理。现有方法在架构上将视频压缩为离散令牌表示，并将视觉感知与推理耦合。这种方法割裂了连续的空间-时间关系，已被证明限制了多步推理能力。我们引入了一个强化学习框架，通过操作由手术基础模型构建的数字孪生表示，训练大型语言模型解耦感知与推理。此外，我们引入了跨帧、时间窗口和程序级别的分层表示，并带有概率不确定性估计。最后，我们提出了一种新颖的奖励，结合了格式验证与通过临床合理性评估和不确定性感知校准进行的准确性评估。为了展示该方法的能力，我们引入了REAL-Colon-Reason，一个包含2000个问题-答案对、涵盖三个复杂度级别的结肠镜基准。我们在REAL-Colon-Reason以及两个现有手术视频问答基准REAL-Colon-VQA和EndoVis18-VQA上取得了最先进的性能。

英文摘要

Surgical video question answering requires multi-step reasoning across semantic, spatial, and temporal dimensions. Existing methods architecturally compress videos into discrete token representations and couple visual perception with reasoning. This approach fragments continuous spatial-temporal relationships and has been shown to restrict multi-step reasoning capabilities. We introduce a reinforcement learning (RL) framework that trains large language models (LLMs) to decouple perception from reasoning by operating over digital twin representations constructed from surgical foundation models. Additionally, we introduce hierarchical representations across frame, temporal window, and procedure levels with probabilistic uncertainty estimates. Finally, we propose a novel reward that combines format validation with accuracy assessment through clinical plausibility evaluation and uncertainty-aware calibration for training. To demonstrate the capabilities of this approach, we introduce REAL-Colon-Reason, a colonoscopic benchmark with 2000 question-answer pairs across three complexity levels. We achieve state-of-the-art performance on REAL-Colon-Reason and two existing surgical VideoQA benchmarks REAL-Colon-VQA and EndoVis18-VQA.

URL PDF HTML ☆

赞 0 踩 0

2606.17298 2026-06-17 cs.CV 新提交

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

面向手术室视频的推理式文本-视频检索：基于动作驱动数字孪生

Yiqing Shen, Hao Ding, Mathias Unberath

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结提出OR3方法，通过动作驱动数字孪生（ActDT）将视频片段转化为结构化表示，并利用大语言模型生成假设ActDT进行检索，结合证据修正实现隐式查询推理，在手术室视频检索中显著优于基线。

详情

AI中文摘要

手术室中的文本-视频检索是实现手术室安全的关键技术，它允许利益相关者检索和检查特定事件的记录。然而，由于最安全关键的事件可能不遵循常见结构，为了充分发挥其潜力，文本-视频检索必须能够处理需要推理才能识别正确视频的隐式查询（例如，剪断前的一步）。然而，现有方法依赖于无法对此类查询进行推理的全局嵌入。我们提出OR3，一种文本-视频检索方法，它将视频片段转换为动作驱动数字孪生（ActDTs），将并发的“主体-动作-对象”三元组分组到非重叠的时间间隔内。此外，与通过配对编码器进行跨模态匹配不同，OR3执行基于想象的检索，其中大语言模型从查询生成假设的ActDTs。这通过使用针对ActDT定制的难负样本训练的单一编码器实现模态内匹配。最后，基于证据的修正根据与顶级候选者的差异修正想象的ActDTs，以捕获特定于程序的模式。我们从MM-OR构建了一个基准，包含来自机器人膝关节手术的386个视频片段的276个隐式查询，涵盖四个推理类别。OR3实现了57.6的R@1和77.3的R@5，优于最强基线。这些结果表明，OR3通过时间动作推理实现了视觉上相似的手术室视频片段之间的细粒度区分。

英文摘要

Text-to-video retrieval in operating rooms (OR) is an enabling technology for OR safety, as it allows stakeholders to retrieve and inspect recordings of specific events. However, because the most safety-critical events may not follow the common structure, to unlock its full potential text-to-video retrieval must be able to handle implicit queries that require reasoning to identify the right video (e.g., the step right before clipping). However, existing methods rely on global embeddings that cannot reason over such queries. We propose OR3, a text-to-video retrieval method that converts clips into action-driven digital twins (ActDTs), grouping concurrent subject-action-object triplets under non-overlapping temporal intervals. Moreover, rather than cross-modal matching through paired encoders, OR3 performs imagination-based retrieval where an LLM generates hypothetical ActDTs from queries. This enables intra-modal matching via a single encoder trained with ActDT-tailored hard negatives. Finally, evidence-grounded refinement revises imagined ActDTs based on discrepancies with top candidates to capture procedure-specific patterns. We construct a benchmark from MM-OR with 276 implicit queries across four reasoning categories over 386 clips from robotic knee procedures. OR3 achieves 57.6 R@1 and 77.3 R@5, outperforming the strongest baseline. These results demonstrate that OR3 enables fine-grained discrimination between visually similar OR video clips through temporal action reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.17615 2026-06-17 cs.CV cs.AI 新提交

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

SkillMoV: 基于原型条件门控的视图混合路由用于统一多视角熟练度估计

Edoardo Bianchi, Antonio Liotta

发表机构 * Free University of Bozen-Bolzano（博尔扎诺自由大学）

AI总结提出SkillMoV框架，通过混合视图投影器（MoVP）实现多场景多视角视频的熟练度估计，在EgoExo4D数据集上达到50.17%准确率，超越现有方法。

详情

AI中文摘要

从视频中估计人类熟练度是自动化技能评估的关键挑战，应用于体育教练、音乐教学、手术培训和工作场所学习。现有方法通常专注于单一场景或依赖共享的多视角聚合，限制了其适应异构摄像机视角和活动领域的能力。我们提出SkillMoV，一个统一的、参数高效的框架，用于从同步多视角视频中进行多场景熟练度估计。其核心是混合视图投影器（MoVP），将混合专家范式适应于摄像机特定的视角特征。MoVP由四个阶段组成：(i) 一个具有12个专家MLP的混合视图软路由器，无需摄像机身份监督即可学习视角相关的专家偏好；(ii) 跨视角注意力以对齐同步摄像机；(iii) 可学习的原型锚定，以类级参考向量条件化表示；(iv) 一个原型条件门控投影，生成最终技能嵌入。我们在EgoExo4D上评估SkillMoV，涵盖六个技能领域和三种单独训练的视角配置：Ego、Exos和Ego+Exos。SkillMoV在Exos设置中达到50.17%的总体准确率，单个模型在所有场景上联合训练，超过比较方法中报告的最强Exos结果3.57个百分点。在Ego+Exos中，SkillMoV接近该设置的最佳报告结果（47.63%对48.20%）。在选定的Exos配置上的消融实验验证了每个组件：MoV路由比注意力聚合提高+6.61个百分点，跨视角注意力+4.92个百分点，原型锚定+4.07个百分点，随机视角丢弃+3.90个百分点。通过LoRA适配，SkillMoV仅训练其参数的23.32%，并且相对于仅LoRA基线增加了有限的测量开销。

英文摘要

Estimating human proficiency from video is a key challenge for automated skill assessment, with applications in sports coaching, music pedagogy, surgical training, and workplace learning. Existing approaches often focus on individual scenarios or rely on shared multi-view aggregation, limiting their ability to adapt to heterogeneous camera viewpoints and activity domains. We introduce SkillMoV, a unified, parameter-efficient framework for multi-scenario proficiency estimation from synchronized multi-view video. At its core, SkillMoV introduces a Mixture-of-View Projector (MoVP), which adapts the mixture-of-experts paradigm to camera-specific view features. MoVP is composed of four stages: (i) a Mixture-of-View soft router with twelve expert MLPs that learns view-dependent expert preferences without camera-identity supervision; (ii) cross-view attention to align synchronized cameras; (iii) learnable prototype anchoring to condition the representation on class-level reference vectors; and (iv) a prototype-conditioned gated projection that produces the final skill embedding. We evaluate SkillMoV on EgoExo4D across six skill domains and three separately trained view configurations: Ego, Exos, and Ego+Exos. SkillMoV reaches 50.17% overall accuracy in the Exos setting with a single model trained jointly across all scenarios, surpassing the strongest reported Exos result among the compared methods by 3.57 percentage points. In Ego+Exos, SkillMoV remains close to the best reported result in that setting (47.63% versus 48.20%). Ablations on the selected Exos configuration validate each component: MoV routing contributes +6.61 pp over attentive aggregation, cross-view attention +4.92 pp, prototype anchoring +4.07 pp, and stochastic view dropout +3.90 pp. Through LoRA adaptation, SkillMoV trains only 23.32% of its parameters and adds limited measured overhead relative to a LoRA-only baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.17627 2026-06-17 cs.CV cs.AI 新提交

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

分、议、决：一种用于细粒度自我中心动作识别的多智能体框架

Alessandro Sottovia, Alessandro Torcinovich, Oswald Lanz

发表机构 * Faculty of Engineering, Free University of Bozen-Bolzano（博尔扎诺自由大学工程学院）

AI总结提出一种零样本多智能体框架，通过视频分割、异构VLM专家协商和Borda计数聚合，提升细粒度自我中心动作识别性能。

详情

AI中文摘要

在自我中心视频中进行细粒度动作识别对视觉语言模型（VLM）具有挑战性：动作通常仅在小视觉线索上有所不同，而单个模型往往偏向于这些线索的一个子集。我们提出了“分、议、决”（Divide, Deliberate, Decide），一个完全本地化的零样本多智能体框架，其中（i）一个VLM编排器将视频分块，并为每个片段提出一个top-k候选标签列表，（ii）一个由来自不同开放模型系列的异构VLM专家组成的集成体进行结构化协商，包括一轮同行咨询问题，以及（iii）使用Borda计数聚合智能体排名，并且编排器根据专家的证据重新排名自己的预测。整个流程在本地运行，无需微调。实验表明，我们的方法在零样本动作识别性能上比基线有积极改进，突出了异构协商步骤的影响，表明增益来自去相关的模型先验而非额外的计算。

英文摘要

Fine-grained action recognition in egocentric video is challenging for Vision-Language Models (VLMs): actions often differ only in small visual cues, and a single model tends to be biased toward a subset of these cues. We propose Divide, Deliberate, Decide, a fully-local, zero-shot multi-agent framework in which (i) a VLM orchestrator chunks the video and proposes a top-k candidate label list per segment, (ii) an ensemble of heterogeneous VLM specialists, drawn from different open model families, engages in a structured deliberation that includes a peer-consultation round of questions, and (iii) agent rankings are aggregated with a Borda count and the orchestrator re-ranks its own prediction in light of the specialists' evidence. The entire pipeline runs locally with no fine-tuning. Experiments show that our method positively improves zero-shot action recognition performance over the baseline, highlighting the influence of a heterogeneous deliberation step, showing that the gain stems from decorrelated model priors rather than from additional compute.

URL PDF HTML ☆

赞 0 踩 0

2606.17798 2026-06-17 cs.CV cs.AI 新提交

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

LiveStarPro: 具有分层记忆的主动式流视频理解用于长时域流

Zhenyu Yang, Kairui Zhang, Bing Wang, Shengsheng Qian, Changsheng Xu

发表机构 * IEEE

AI总结提出LiveStarPro，通过流验证解码、流因果注意力掩码和树结构分层记忆三个组件，实现长时域流媒体视频的主动理解，在语义正确性和时序误差上分别提升28.9%和降低18.2%。

详情

AI中文摘要

尽管视频大语言模型（Video-LLMs）取得了显著进展，当前的在线架构仍然难以同时处理连续视频流、自主决定何时响应以及保持长时域上下文记忆。这些障碍削弱了实时响应能力，并在长时间交互中导致严重遗忘。在这项工作中，我们引入了LiveStarPro，一个专为长时域流上的主动视频理解而设计的直播助手。LiveStarPro的设计基于三个互补组件。第一个组件是流验证解码（SVeD），一种通过单次困惑度验证识别适当响应时机的推理框架，从而消除了对显式静音标记的依赖。第二个组件是流因果注意力掩码（SCAM），一种训练策略，它在可变长度流上强制实现增量视频-语言对齐。第三个组件是树结构分层记忆（TSHM），一种递归记忆架构，它将驱逐的历史信息组织成事件链，从而能够从有效无界的视频流中高效检索。为了在现实在线条件下促进全面评估，我们进一步提出了OmniStarPro，一个大规模基准测试，涵盖15个多样化的真实世界场景，并扩展到小时级流以评估长期回忆。大量实验表明，LiveStarPro持续超越现有方法，在语义正确性上提升28.9%，时序误差降低18.2%，而其流式键值缓存进一步在相同模型上实现了1.58倍的推理加速。模型和代码在此https URL公开。

英文摘要

Despite the remarkable progress of Video Large Language Models (Video-LLMs), current online architectures still struggle to simultaneously process continuous video streams, decide autonomously when to respond, and preserve long-horizon contextual memory. These obstacles undermine real-time responsiveness and cause severe forgetting throughout prolonged interactions. In this work, we introduce LiveStarPro, a live streaming assistant that is designed for proactive video understanding over long-horizon streams. The design of LiveStarPro rests on three complementary components. The first component is Streaming Verification Decoding (SVeD), an inference framework that identifies the appropriate response timing through single-pass perplexity verification, thereby eliminating the dependency on explicit silence tokens. The second component is Streaming Causal Attention Masks (SCAM), a training strategy that enforces incremental video-language alignment over variable-length streams. The third component is Tree-Structured Hierarchical Memory (TSHM), a recursive memory architecture that organizes evicted historical information into event chains and consequently enables efficient retrieval from effectively unbounded video streams. To facilitate a comprehensive evaluation under realistic online conditions, we further present OmniStarPro, a large-scale benchmark that spans 15 diverse real-world scenarios and that extends to hour-scale streams for the assessment of long-term recall. Extensive experiments demonstrate that LiveStarPro consistently surpasses existing methods, attaining a 28.9% improvement in semantic correctness and an 18.2% reduction in timing error, while its streaming key-value cache further yields a 1.58x inference speedup over the same model without caching. The model and the code are publicly available at https://github.com/sotayang/LiveStarPro.

URL PDF HTML ☆

赞 0 踩 0

2603.22281 2026-06-17 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

ThinkJEPA：赋予潜在世界模型大型视觉-语言推理能力

Haichao Zhang, Yijiang Li, Shwai He, Tushar Nagarajan, Mingfei Chen, Jianglin Lu, Ang Li, Yun Fu

AI总结提出ThinkJEPA框架，结合密集JEPA分支与稀疏VLM思考者分支，通过分层金字塔表示提取模块，实现细粒度运动建模与长程语义引导，在手部操作轨迹预测任务上超越基线。

Comments 10 pages, 5 figures

详情

AI中文摘要

潜在世界模型（如V-JEPA2）的最新进展展示了从视频观测预测未来世界状态的能力。然而，短观测窗口的密集预测限制了时间上下文，可能导致预测偏向局部低层次外推，难以捕捉长程语义并降低下游效用。相比之下，视觉-语言模型（VLM）通过对均匀采样帧进行推理，提供强大的语义基础和通用知识，但由于计算驱动的稀疏采样、语言输出瓶颈（将细粒度交互状态压缩为文本导向表示）以及适应小规模动作条件数据集时的数据分布不匹配，它们不适合作为独立的密集预测器。我们提出了一种VLM引导的JEPA风格潜在世界建模框架，通过双时间路径结合密集帧动态建模与长程语义指导：一个密集JEPA分支用于细粒度运动和交互线索，以及一个均匀采样的VLM“思考者”分支，具有更大的时间步长以提供知识丰富的指导。为了有效传递VLM的渐进推理信号，我们引入了一个分层金字塔表示提取模块，将多层VLM表示聚合成与潜在预测兼容的指导特征。在手部操作轨迹预测实验上，我们的方法优于强VLM-only基线和JEPA预测器基线，并展现出更鲁棒的长程展开行为。

英文摘要

Recent progress in latent world models (e.g., V-JEPA2) has shown promising capability in forecasting future world states from video observations. Nevertheless, dense prediction from a short observation window limits temporal context and can bias predictors toward local, low-level extrapolation, making it difficult to capture long-horizon semantics and reducing downstream utility. Vision--language models (VLMs), in contrast, provide strong semantic grounding and general knowledge by reasoning over uniformly sampled frames, but they are not ideal as standalone dense predictors due to compute-driven sparse sampling, a language-output bottleneck that compresses fine-grained interaction states into text-oriented representations, and a data-regime mismatch when adapting to small action-conditioned datasets. We propose a VLM-guided JEPA-style latent world modeling framework that combines dense-frame dynamics modeling with long-horizon semantic guidance via a dual-temporal pathway: a dense JEPA branch for fine-grained motion and interaction cues, and a uniformly sampled VLM \emph{thinker} branch with a larger temporal stride for knowledge-rich guidance. To transfer the VLM's progressive reasoning signals effectively, we introduce a hierarchical pyramid representation extraction module that aggregates multi-layer VLM representations into guidance features compatible with latent prediction. Experiments on hand-manipulation trajectory prediction show that our method outperforms both a strong VLM-only baseline and a JEPA-predictor baseline, and yields more robust long-horizon rollout behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.17257 2026-06-17 cs.CV cs.AI 新提交

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

Pulling The REINS: 通过表示引导实现视频扩散模型的无训练安全对齐

Rohit Kundu, Arindam Dutta, Sarosij Bose, Athula Balachandran, Amit K. Roy-Chowdhury

发表机构 * University of California, Riverside（加州大学河滨分校）； YouTube (Google)（YouTube（谷歌））

AI总结提出REINS方法，在推理时通过线性方向引导视频扩散模型的内部表示，实现无训练的安全对齐，避免有害内容生成，且不降低通用能力。

详情

AI中文摘要

开源视频扩散模型能够生成从暴力到虚假信息等逼真的不安全内容，然而现有防御要么需要昂贵的安全微调（这会降低通用能力），要么应用容易被对抗性提示绕过的外部过滤器。我们提出REINS（表示空间推理时安全引导），一种无训练方法，通过在推理时引导其内部表示向安全生成方向对齐视频扩散模型。我们的关键发现是，安全相关结构线性编码在视频扩散Transformer的隐藏状态激活中，并且通过基于二元安全标签的监督PCA发现的一个单一方向足以分离安全与不安全的生成轨迹。在推理时，将该方向添加到中间Transformer层的隐藏状态中，将生成从有害内容重定向到语义相关的安全替代方案，无需权重更新、无需概念枚举，且计算开销可忽略。通过机制分析，我们揭示了虽然安全信息随Transformer深度单调累积，但引导效果在中间层（约50%深度）达到峰值，暴露了信息可用性与下游传播能力之间的基本权衡。我们在9个视频扩散模型、多个参数规模（1.3B-5B）以及文本到视频和图像到视频生成上评估REINS，据我们所知，这是视频生成文献中最广泛的安全评估套件。

英文摘要

Open-weight video diffusion models can generate photorealistic unsafe content, from violence to misinformation, yet existing defenses either require expensive safety fine-tuning that degrades general capability, or apply external filters that are trivially bypassed by adversarial prompts. We present REINS (REpresentation-space INference-time Safety steering), a training-free method that aligns video diffusion models at inference time by steering their internal representations toward safe generation. Our key finding is that safety-relevant structure is linearly encoded in the hidden-state activations of video diffusion transformers, and a single direction, discovered via Supervised PCA on binary safety labels, suffices to separate safe from unsafe generation trajectories. At inference, adding this direction to hidden states at an intermediate transformer layer redirects generation from harmful content to semantically related safe alternatives, with no weight updates, no concept enumeration, and negligible computational overhead. Through mechanistic analysis, we reveal that while safety information accumulates monotonically with transformer depth, steering effectiveness peaks at intermediate layers (~50% depth), exposing a fundamental tradeoff between information availability and downstream propagation capacity. We evaluate REINS across 9 video diffusion models, multiple parameter scales (1.3B-5B), and both text-to-video and image-to-video generation, to our knowledge, the broadest safety evaluation suite in the video generation literature.

URL PDF HTML ☆

赞 0 踩 0

2606.17310 2026-06-17 cs.CV 新提交

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

SierpinskiCam: 基于谢尔宾斯基三角形图案线索的相机控制视频重拍

Suttisak Wizadwongsa, Hyelin Nam, Supasorn Suwajanakorn, Jeong Joon Park

发表机构 * University of Michigan, Ann Arbor（密歇根大学安娜堡分校）； VISTEC, Thailand（泰国威斯泰克科学技术研究院）

AI总结提出SierpinskiCam方法，通过谢尔宾斯基圆顶纹理线索增强几何引导，并引入参考视频条件机制，解决单目视频重拍中相机大角度偏离时的稀疏区域问题，提升相机可控性、几何一致性和视频质量。

Comments 20 pages, 13 figures

详情

AI中文摘要

从单个单目视频沿用户定义的相机轨迹生成场景的新颖渲染，称为视频重拍，是内容创作和视觉效果中一个引人注目但困难的问题。现有的几何引导方法从源视频重建4D表示，并沿目标轨迹渲染以条件视频扩散模型。然而，当目标相机偏离源轨迹时，这种引导会退化，导致新暴露区域稀疏或完全缺失。我们提出SierpinskiCam，通过使用包含丰富可跟踪特征的谢尔宾斯基圆顶纹理线索来增强基于几何的引导，从而解决了这一限制，即使在大的视角变化下也能保持跟踪。我们进一步引入了一种参考视频条件机制，将源视频令牌附加到目标令牌序列，并使用负RoPE索引分离两个流，从而无需架构修改或逐视频适应即可实现外观基础。大量实验表明，SierpinskiCam在多样且具有挑战性的重拍场景中，在相机可控性、几何一致性和视频质量方面取得了显著提升。项目页面：此https URL。

英文摘要

Generating novel renderings of a scene along user-defined camera trajectories from a single monocular video, dubbed video retaking, is a compelling but difficult problem in content creation and visual effects. Existing geometry-guided approaches reconstruct a 4D representation from the source video and render it along the target trajectory to condition video diffusion models. However, this guidance degrades as the target camera departs from the source trajectory, leaving newly revealed regions sparse or entirely missing. We propose SierpinskiCam, which addresses this limitation by augmenting geometry-based guidance with Sierpinski dome texture cues that contains rich trackable features even under large viewpoint changes. We further introduce a reference video conditioning mechanism that appends source-video tokens to the target-token sequence and separates the two streams with negative RoPE indices, enabling appearance grounding without architectural modification or per-video adaptation. Extensive experiments show that SierpinskiCam achieves significant gains in camera controllability, geometric consistency, and video quality across diverse and challenging retaking scenarios. Project page: https://hyelinnam.github.io/SierpinskiCam/.

URL PDF HTML ☆

赞 0 踩 0

2606.17342 2026-06-17 cs.CV 新提交

Learning a Maximum Entropy Model for Visual Textures using Diffusion

使用扩散学习视觉纹理的最大熵模型

Xinyuan Zhao, Eero P. Simoncelli

发表机构 * New York University（纽约大学）； Flatiron Institute（熨斗研究所）

AI总结提出首个基于扩散模型无监督学习最大熵模型统计量的纹理建模方法，仅用512个统计量即可生成质量优于或媲美当前最优模型（约177k统计量）的纹理图像，并实现平滑插值。

详情

AI中文摘要

视觉纹理——包含重复元素的空间均匀图像区域（例如草地、树皮）——在视觉场景中普遍存在，并为识别和分析材料及物体提供重要线索。许多现有纹理模型从单张纹理图像中提取关键统计量，然后通过匹配这些统计量生成视觉上相似的高质量样本。然而，它们的统计量要么是手工设计的，要么基于为其他目的（如物体识别）预训练的网络。在这里，我们开发了第一个用于无监督学习一组统计量的原理性方法，这些统计量用于约束最大熵概率模型。我们利用为生成扩散模型开发的方法来推导训练和采样程序，并将这些与通过匹配统计量进行采样的传统方法进行比较。尽管我们训练的模型很紧凑（512个统计量），但它生成的纹理图像质量与当前最先进的模型（约177k统计量）相当或更好。通过合成对一个模型不可区分但对另一个模型差异最大的图像，对两个模型进行更直接的比较，揭示了它们的相对优势和劣势。最后，我们表明，与以前的统计纹理模型不同，在我们的模型表示空间中的直线轨迹生成均匀的纹理样本，这些样本在两个端点的特征之间平滑插值。

英文摘要

Visual textures -- spatially homogeneous image regions containing repeated elements (e.g. a field of grass, the bark of a tree) -- are ubiquitous in visual scenes and provide important cues for recognizing and analyzing materials and objects. A number of existing texture models extract essential statistics from a single texture image, and can then generate high-quality samples that are visually similar to the original by matching these statistics. However, their statistics are either hand-designed or based on a network pretrained for another purpose (e.g., object recognition). Here, we develop the first principled method for unsupervised learning of a set of statistics that are used to constrain a maximum entropy probability model. We leverage methods developed for generative diffusion models to derive training and sampling procedures, and compare these to the traditional method of sampling via matching the statistics. Despite the compactness of our trained model (512 statistics), it generates texture images whose quality is as good as or better than the current state-of-the-art model (~177k statistics). A more direct comparison of the two models, obtained by synthesizing images that are indistinguishable for one model but maximally different for the other, reveals their relative strengths and weaknesses. Finally, we show that unlike previous statistical texture models, a straight trajectory in the representation space of our model generates homogeneous texture samples that interpolate smoothly between the features of the two end points.

URL PDF HTML ☆

赞 0 踩 0

2606.17431 2026-06-17 cs.CV 新提交

Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

视觉检索增强生成：基于轮廓引导的动物艺术创作

Quoc-Duy Tran, Anh-Tuan Vo, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM（胡志明市国立大学理科大学）； Vietnam National University, Ho Chi Minh（胡志明市国立大学）

AI总结提出视觉检索增强生成（Visual-RAG）框架，通过检索与自然轮廓结构相似的动物形状，结合ControlNet和IP-Adapter引导扩散模型生成动物艺术，实现计算空想性视错觉。

Comments SOICT 2025

详情

AI中文摘要

生成式AI已经提升了渲染逼真或艺术图像的能力，但在人类创造力的一个关键方面仍然有限：解释模糊形状。这种现象根植于空想性视错觉，使人类能够从云、石头或树叶等随机图案中感知有意义的形状。为了在计算上复制这一想象过程，我们引入了视觉检索增强生成（Visual-RAG），这是一个直接从自然轮廓生成动物艺术的框架。我们的方法从包含28,586个高质量轮廓的精选语料库中检索结构相似的动物形状，并将其作为参考示例，通过ControlNet和IP-Adapter引导基于扩散的生成。消融研究证实，使用RANSAC的形状上下文提供了最准确的匹配，而去除形状标准化会使内点比率降至仅13.4%，强调了结构保真度在Visual-RAG中的重要性。一项包含12名参与者的用户研究从美学、轮廓保真度和整体印象方面评估了输出结果。结果表明，虽然Visual-RAG提供了合理的解释，但在实现高感知影响力方面仍存在挑战。这项工作为计算空想性视错觉奠定了基础，展示了机器如何为想象发现的早期阶段做出贡献。

英文摘要

Generative AI has advanced the ability to render photorealistic or artistic images, yet it remains limited in a key aspect of human creativity: interpreting ambiguous shapes. This phenomenon, rooted in pareidolia, allows humans to perceive meaningful forms in random patterns such as clouds, stones, or leaves. To computationally replicate this imaginative process, we introduce Visual Retrieval-Augmented Generation (Visual-RAG), a framework that generates animal art directly from natural silhouettes. Our method retrieves structurally similar animal shapes from a curated corpus of 28,586 high-quality silhouettes and uses them as reference exemplars to guide diffusion-based generation with ControlNet and IP-Adapter. Ablation studies confirm that shape Context with RANSAC provides the most accurate alignment, while removing shape standardization reduces the inlier ratio to just 13.4\%, underscoring the importance of structural fidelity in Visual-RAG. A user study with 12 participants evaluated the outputs in terms of aesthetics, silhouette fidelity, and overall impression. Results reveal that while Visual-RAG provides plausible interpretations, challenges remain in achieving high perceptual impact. This work lays the foundation for computational pareidolia, showing how machines can contribute to the early stages of imaginative discovery.

URL PDF HTML ☆

赞 0 踩 0

2606.17584 2026-06-17 cs.CV cs.LG 新提交

Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

基于轨迹直线度的整流流根选择不动点反演

Semin Kim, Jihwan Yoon, Seunghoon Hong

发表机构 * KAIST（韩国科学技术院）

AI总结提出SelFix方法，通过选择使逆轨迹更直的不动点解，在整流流中实现精确反演，提升图像重建和编辑质量。

详情

AI中文摘要

找到生成给定数据样本的初始噪声（称为反演）是下游应用（如无训练图像编辑）的关键组成部分。现有的不动点反演方法通过将每个反演步骤表述为不动点问题来提高反演精度，但它们缺乏一个原则性的机制来选择实践中可能出现的多个不动点解。我们观察到不同的选择会引发不同的反演轨迹，导致重建和编辑质量的显著变化。对于整流流，我们进一步发现这种变化与轨迹直线度密切相关，这促使我们将直线度作为原则性的选择标准。我们提出SelFix，一种不动点反演方法，它选择诱导更直逆轨迹的不动点解，同时在标准局部假设下保持收敛到精确的反演根。在FLUX.1-dev和PIE-Bench上的实验表明，SelFix改进了不动点反演，实现了比先前反演基线更强的真实图像重建和更好的源保持提示编辑。代码可在该https URL获取。

英文摘要

Finding the initial noise that generates a given data sample, known as inversion, is a key component for downstream applications such as training-free image editing. Existing fixed-point inversion methods improve inversion accuracy by formulating each inversion step as a fixed-point problem, but they lack a principled mechanism for selecting among multiple fixed-point solutions that can arise in practice. We observe that different selections induce different inversion trajectories, leading to substantial variation in reconstruction and editing quality. For rectified flows, we further find that this variation is closely associated with trajectory straightness, motivating straightness as a principled selection criterion. We propose SelFix, a fixed-point inversion method that selects fixed-point solutions inducing straighter inverse trajectories while retaining convergence to an exact inverse root under standard local assumptions. Experiments on FLUX.1-dev and PIE-Bench show that SelFix improves fixed-point inversion, achieving stronger real-image reconstruction and better source-preserving prompt-based editing than prior inversion baselines. The code is available at https://github.com/seminkim/selfix.

URL PDF HTML ☆

赞 0 踩 0

2606.17590 2026-06-17 cs.CV 新提交

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

TivTok：广播时间不变令牌以实现可扩展视频分词

Weiliang Chen, Yuanhui Huang, Xuebo Wang, Yueqi Duan

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Department of Automation, Tsinghua University（清华大学自动化系）； Kuaishou Technology（快手科技）

AI总结提出TivTok，一种可重用感知的视频分词器，通过时间不变（TIV）和时间变化（TV）令牌分解视频，实现高效压缩和长视频建模，在标准基准上rFVD达12.65，压缩效率提升2.91倍。

详情

AI中文摘要

视频分词是可扩展视频生成的基础，因为令牌数量直接决定计算成本和可建模视频长度。现有分词器主要通过将视频压缩为更少令牌来提高可扩展性，但它们通常跨帧和块重复表示持久内容，如静态背景和一致物体外观。本文提出\textbf{TivTok}（\textit{时间不变分词器}），一种可重用感知的视频分词器，使持久信息随时间可重用。TivTok用时间不变（TIV）令牌（编码跨帧共享信息）和时间变化（TV）令牌（编码帧特定残差）表示一个片段。为获得这种分解，我们引入范围诱导分解（SIF），为两个令牌组分配不同的注意力范围：TIV令牌关注整个片段，而每个TV令牌仅访问其对应帧及TIV令牌。在解码器中，不变广播（IB）跨帧和块重用相同的TIV令牌，用于并行重建和长视频分词。实验表明，TivTok在标准$16{\times}256{\times}256$基准上达到12.65的rFVD，与评估基线相比，128帧视频的压缩效率提升2.91倍，同时仅使用下采样分词器所需令牌的1.1%。

英文摘要

Video tokenization is fundamental to scalable video generation, as the number of tokens directly determines the computational cost and the length of videos that can be modeled. Existing tokenizers mainly improve scalability by compressing videos into fewer tokens, but they often continue to represent persistent content, such as static backgrounds and consistent object appearances, repeatedly across frames and chunks. In this paper, we propose \textbf{TivTok} (\textit{Time-Invariant Tokenizer}), a reuse-aware video tokenizer that makes persistent information reusable across time. TivTok represents a clip with Time-Invariant (TIV) tokens that encode information shared across frames and Time-Variant (TV) tokens that encode frame-specific residuals. To obtain this factorization, we introduce Scope-Induced Factorization (SIF), which assigns different attention scopes to the two token groups: TIV tokens attend to the full clip, whereas each TV token only accesses its corresponding frame together with the TIV tokens. In the decoder, Invariant Broadcasting (IB) reuses the same TIV tokens across frames and chunks for parallel reconstruction and long-video tokenization. Experiments show that TivTok achieves an rFVD of 12.65 on the standard $16{\times}256{\times}256$ benchmark and improves compression efficiency by 2.91$\times$ for 128-frame videos compared with the evaluated baselines, while using only 1.1\% of the tokens required by downsample-based tokenizers in our evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.17619 2026-06-17 cs.CV 新提交

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

RAVA: 检索增强的视角对齐用于主题驱动图像生成

Qiwei Yan, Zhiqiang Yuan, Chongyang Li, Jiapei Zhang, Ying Deng, Jinchao Zhang, Jie Zhou

发表机构 * WeChat AI, Tencent Inc.（腾讯微信人工智能实验室）

AI总结提出RAVA框架，通过检索增强提供几何证据，解决跨主体视角对齐中的视角漂移和结构不匹配问题，在保持身份的同时实现可靠视角控制。

详情

AI中文摘要

参考驱动图像生成在身份保持方面取得了快速进展，但跨不同主体的可靠视角控制仍然难以理解。难点不仅在于生成目标主体的新图像：模型必须推断一个主体的隐含视角，并仅使用图像级证据将其转移到另一个主体，无需相机姿态、深度或基于射线的条件。在这种设置下，现有基于多个图像参考的生成器通常依赖虚假的语义相关性，导致视角漂移、部分级结构不匹配以及缺失或不支持的目标特定内容。我们将这一挑战形式化为跨主体视角对齐，并提出RAVA，一个检索增强框架，在生成前提供显式几何证据。RAVA首先学习一个跨实例视角嵌入，检索与锚点视角对齐的目标主体图像，然后应用基于LogDet的子集选择策略，保留一个既视角一致又结构互补的紧凑参考集。最后，选定的参考被微调的多参考图像生成器使用。实验表明，通用语义嵌入在此任务上几乎是随机的，而所提出的检索器显著提高了视角检索质量。在跨主体生成上，RAVA在相同生成骨干下始终优于零样本基线和更强的检索替代方案。这些结果表明，跨主体视角对齐受益于检索增强的几何基础，而非仅依赖端到端生成。

英文摘要

Reference-driven image generation has made rapid progress on identity preservation, but reliable viewpoint control across different subjects remains poorly understood. The difficulty is not merely generating a new image of the target subject: the model must infer the implicit viewpoint of one subject and transfer it to another subject using only image-level evidence, without camera poses, depth, or ray-based conditions. In this setting, existing generators conditioned on multiple image references often rely on spurious semantic correlations, which lead to viewpoint drift, part-level structural mismatches, and missing or unsupported target-specific content. We formulate this challenge as cross-subject viewpoint alignment and propose RAVA, a retrieval-augmented framework that supplies explicit geometric evidence before generation. RAVA first learns a cross-instance viewpoint embedding that retrieves target-subject images aligned with the anchor viewpoint, then applies a LogDet-based subset selection strategy to retain a compact reference set that is both view-consistent and structurally complementary. The selected references are finally consumed by a fine-tuned multi-reference image generator. Experiments show that generic semantic embeddings are nearly random for this task, while the proposed retriever substantially improves viewpoint retrieval quality. On cross-subject generation, RAVA consistently outperforms zero-shot baselines and stronger retrieval alternatives under the same generation backbone. These results indicate that cross-subject viewpoint alignment benefits from retrieval-augmented geometric grounding rather than relying on end-to-end generation alone.

URL PDF HTML ☆

赞 0 踩 0

2606.17800 2026-06-17 cs.CV 新提交

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

MaineCoon: 追求实时音视频社交世界模型

Lichen Bai, Tianhao Zhang, Shitong Shao, Dingwei Tan, Qiyu Zhong, Zhengpeng Xie, Haopeng Li, Qinghao Huang, Dandan Shen, Tengjiao Ji, Wei Wang, Peicheng Wu, Yuxuan Zhao, Xiangyu Zhu, Welly Luo, Shurui Yang, Zeke Xie

发表机构 * Catnip AI Team（Catnip AI团队）

AI总结提出MaineCoon，首个22B参数的实时音视频自回归模型，支持单GPU上高达47.5 FPS的流式生成和亚秒级交互，专为社交互动应用优化，引入自重采样、跨模态对齐、领域偏好优化和强化在线策略蒸馏等技术。

Comments 32 pages, 13 figures, 3 tables

详情

AI中文摘要

随着全球视频内容越来越多地在社交平台上用于互动社交目的，为社交世界构建的视频生成模型至关重要，但以往研究在很大程度上忽视了这一点。在这项工作中，我们定义了社交世界模型的位置，并构建了一个原型模型作为实现这一目标的第一步。虽然以往的世界模型成功模拟了物理环境或游戏世界探索，但它们从根本上与以人为中心的社交动态脱节。为了弥合这一差距，作为社交世界模型的第一步，我们提出了MaineCoon，这是首个实时音视频自回归模型，拥有22B参数，能够在单个GPU上实现高达47.5 FPS的创纪录帧率的实时流式生成和亚秒级交互。据我们所知，MaineCoon也是首个专门针对社交互动应用优化的实时音视频生成模型。为了实现高效稳定的训练，我们在MaineCoon中引入了多种新技术，包括自重采样、跨模态表示对齐、领域感知偏好优化和强化在线策略蒸馏（ROPD）。我们还设计了首个智能体流式推理框架，支持千秒级甚至更长的生成，同时通过智能体缓存管理和提示规划减轻漂移。这些创新显著加速了训练，同时优化了实时推理性能。我们相信，这项工作不仅为高质量、低延迟和长时域的音视频自回归模型设定了新的最先进（SOTA）性能基准，而且指出了下一代AI原生社交平台所需的范式转变。

英文摘要

As an increasing majority of global video content is consumed on social platforms for interactive social purposes, video generation models built for social worlds are important but largely overlooked by previous studies. In this work, we define the position of social world models and build a prototype model as the first step towards this goal. While previous world models successfully simulate physical environments or gaming world exploration, they remain fundamentally detached from human-centric social dynamics. To bridge this gap as the first step to social world models, we present MaineCoon, the first real-time audio-visual autoregressive model that has 22B parameters and is capable of real-time streaming generation and sub-second interaction, with a record-breaking frame rate of up to 47.5 FPS, on a single GPU. To the best of our knowledge, MaineCoon is also the first real-time audio-visual generation model specifically optimized for social-interactive applications. To enable efficient and stable training, we introduce several novel techniques into MaineCoon, including self-resampling, cross-modal representation alignment, domain-aware preference optimization, and reinforced online-policy distillation (ROPD). We also design the first agentic streaming inference framework that supports thousand-second-scale or even longer generation while mitigating drift with agentic cache management and prompt planing. These innovations significantly accelerate training while optimizing real-time inference performance. We believe this work not only sets a new state-of-the-art (SOTA) performance benchmark for high-quality, low-latency, and long-horizon audio-visual autoregressive models, but also points out the paradigm shift desired for next-generation AI-native social platforms.

URL PDF HTML ☆

赞 0 踩 0

2606.18156 2026-06-17 cs.CV cs.AI 新提交

ReAge3D: Re-Aging 3D Faces with View Consistency

ReAge3D：具有视角一致性的3D人脸回龄

Libing Zeng, Li Ma, Mingming He, Ning Yu, Paul Debevec, Nima Khademi Kalantari

发表机构 * Texas A&M University（德克萨斯农工大学）； Netflix Eyeline Studios

AI总结提出ReAge3D框架，通过2D扩散模型DiffReaging和中心向外编辑传播策略，实现多视角一致的3D人脸回龄，保持身份和细节，优于现有方法。

详情

AI中文摘要

我们提出了一种新颖的框架，用于实现逼真且可控的3D人脸回龄，生成高度详细、保留身份的结果。现有的3D编辑方法虽然对粗粒度的语义变化有效，但不适合回龄，因为即使回龄2D视图之间的微小不一致也会导致对微妙但感知上重要的年龄相关细节的过度平滑。为了解决这一挑战，我们首先引入了一个基于2D扩散的回龄模型DiffReaging，该模型在合成生成的图像对上训练。我们进一步提出了一种中心向外编辑传播策略，利用该回龄模型重建多视图一致的回龄图像。具体来说，从回龄的正面枢轴视图开始，我们通过扭曲和我们提出的Masked-DiffReaging过程重建其余视图。通过在扩散过程的每一步注入现有内容，Masked-DiffReaging确保重建区域与现有像素保持连贯。由此产生的一致回龄视图集监督回龄3D表示的优化。我们的方法在视觉上和定量上都优于现有的3D编辑技术，能够对3D人脸模型中的年龄变换进行平滑、细粒度的控制。

英文摘要

We present a novel framework for realistic and controllable 3D face re-aging which produces highly detailed, identity-preserving results. Existing 3D editing methods, while effective for coarse semantic changes, are not well suited for re-aging, as even small inconsistencies across re-aged 2D views can lead to over-smoothing of subtle but perceptually important age-related details. To address this challenge, we first introduce a 2D diffusion-based re-aging model, DiffReaging, trained on synthetically generated image pairs. We further propose a center-out editing propagation strategy that leverages this re-aging model to reconstruct multi-view-consistent re-aged images. Specifically, starting from a re-aged frontal pivot view, we reconstruct the remaining views through warping and our proposed Masked-DiffReaging process. By injecting existing content at every step of the diffusion process, Masked-DiffReaging ensures that the reconstructed regions remain coherent with existing pixels. The resulting consistent set of re-aged views supervises the optimization of the re-aged 3D representation. Our method outperforms existing 3D editing techniques both visually and quantitatively, enabling smooth, fine-grained control over age transformations in 3D face models.

URL PDF HTML ☆

赞 0 踩 0

2606.17432 2026-06-17 cs.GR cs.CV 交叉投稿

Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting

Edit3DGS：通过2D指令引导扩散与3D高斯泼溅的动态头部编辑统一框架

Duy-Dat Tran, Trung-Nghia Le

发表机构 * University of Science, VNU-HCM, Ho Chi Minh, Vietnam（越南胡志明市国家大学）； Vietnam National University, Ho Chi Minh, Vietnam（越南国家大学）

AI总结提出Edit3DGS统一框架，结合2D指令引导扩散与3D高斯泼溅，实现动态3D头部的可控编辑，支持表情变换、属性修改等操作，并保持身份与运动动态的一致性。

Comments SOICT 2025

详情

AI中文摘要

我们提出Edit3DGS，一个用于动态3D头部编辑的统一框架，它将2D指令引导扩散与3D高斯泼溅相结合。与先前分别处理基于帧的编辑或静态3D重建的方法不同，我们的方法将图像域中的语义可控性与逼真、时间一致的3D表示结合起来。给定输入视频，可编辑的面部区域被掩码并使用文本条件扩散模型进行修改，以支持细粒度操作，如表情变换、属性修改和外观细化。然后，编辑后的帧通过3D高斯泼溅聚合，生成一个连贯、高保真的化身，同时保留身份和运动动态。为了强制一致性，Edit3DGS采用了多视图批量编辑和轻量级修复策略，以恢复跨时间步丢失的表情。实验结果表明，我们的框架能够实现可控、无伪影的头部编辑，并具有平滑的时间过渡，在虚拟化身、沉浸式通信、电影制作和交互媒体中具有实际应用。

英文摘要

基于流匹配的原理化强化学习从片段级策略优化中涌现

Yifu Luo, Haoyuan Sun, Xinhao Hu, Penghui Du, Keyu Fan, Bo Li, Sinan Du, Xu Wan, Zhiyu Chen, Bo Xia, Yongzhe Chang, Changqian Yu, Kun Gai, Tiantian Zhang, Xueqian Wang

发表机构 * GitHub

AI总结本文提出了一种基于片段级策略优化的流匹配强化学习方法GCPO，通过将连续步骤聚合为相干片段并改变策略优化层级，有效缓解了优势归因不准确的问题，实验表明其在文本到图像生成任务中表现优于现有方法。

Comments ICML 2026

2602.06806 2026-06-17 cs.CV cs.LG 版本更新

RAIGen: Rare Attribute Identification in Text-to-Image Generative Models

RAIGen: 文本到图像生成模型中的罕见属性识别

Silpa Vadakkeeveetil Sreelatha, Dan Wang, Serge Belongie, Muhammad Awais, Anjan Dutta

发表机构 * University of California, Berkeley（加州大学伯克利分校）； UC Berkeley（加州大学伯克利分校）

AI总结提出RAIGen框架，利用Matryoshka稀疏自编码器和新颖的少数度量，在无标签条件下发现扩散模型中的罕见属性，并支持属性放大。

Comments Accepted at ICML 2026. Webpage and code available at https://github.com/VSSILPA/RAIGen

详情

AI中文摘要

文本到图像扩散模型实现了令人印象深刻的生成质量，但继承并放大了训练数据中的偏差，扭曲了语义属性的覆盖。先前的工作以两种方式解决这一问题。封闭集方法在预定义的公平性类别（如性别、种族）中减轻偏差，假设社会显著的少数属性是先验已知的。开放集方法将任务框架化为偏差识别，突出主导输出的多数属性。两者都忽略了一个互补的任务：揭示在数据分布中代表性不足（社会、文化或风格）但仍编码在模型表示中的罕见或少数特征。我们介绍了RAIGen，据我们所知，这是第一个用于扩散模型中无标签罕见属性发现的框架，不需要预定义的少数类别。RAIGen利用Matryoshka稀疏自编码器和一种新颖的少数度量，结合神经元激活频率与语义独特性，识别出那些其最高激活图像揭示代表性不足属性的可解释神经元。实验表明，RAIGen在Stable Diffusion中发现了超出固定公平性类别的属性，可扩展到更大的模型如SDXL，支持跨架构的系统审计，并在生成过程中实现罕见属性的定向放大。项目页面可在 https://vssilpa.github.io/RAIGen_webpage/ 获取。

英文摘要

Text-to-image diffusion models achieve impressive generation quality but inherit and amplify training-data biases, skewing coverage of semantic attributes. Prior work addresses this in two ways. Closed-set approaches mitigate biases in predefined fairness categories (e.g., gender, race), assuming socially salient minority attributes are known a priori. Open-set approaches frame the task as bias identification, highlighting majority attributes that dominate outputs. Both overlook a complementary task: uncovering rare or minority features underrepresented in the data distribution (social, cultural, or stylistic) yet still encoded in model representations. We introduce RAIGen, the first framework, to our knowledge, for label-free rare-attribute discovery in diffusion models, requiring no predefined minority categories. RAIGen leverages Matryoshka Sparse Autoencoders and a novel minority metric combining neuron activation frequency with semantic distinctiveness to identify interpretable neurons whose top-activating images reveal underrepresented attributes. Experiments show RAIGen discovers attributes beyond fixed fairness categories in Stable Diffusion, scales to larger models such as SDXL, supports systematic auditing across architectures, and enables targeted amplification of rare attributes during generation. The project page is available at https://vssilpa.github.io/RAIGen_webpage/ .

URL PDF HTML ☆

赞 0 踩 0

2603.03485 2026-06-17 cs.CV cs.AI cs.RO 版本更新

Phys4D: Fine-Grained Physics-Consistent 4D Modeling from Video Diffusion

Phys4D: 从视频扩散模型实现细粒度物理一致的4D建模

Haoran Lu, Shang Wu, Songling Liu, Jianshu Zhang, Maojiang Su, Guo Ye, Chenwei Xu, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Zhaoran Wang, Han Liu

AI总结提出Phys4D流水线，通过三阶段训练（伪监督预训练、物理监督微调、强化学习校正）从视频扩散模型学习物理一致的4D世界表示，显著提升细粒度时空与物理一致性。

详情

AI中文摘要

最近的视频扩散模型作为大规模生成式世界模型已经取得了令人印象深刻的能力。然而，这些模型通常难以保持细粒度的物理一致性，随时间表现出物理上不合理的动态。在这项工作中，我们提出了 \textbf{Phys4D}，一个从视频扩散模型中学习物理一致的4D世界表示的流水线。Phys4D 采用 \textbf{三阶段训练范式}，逐步将外观驱动的视频扩散模型提升为物理一致的4D世界表示。我们首先通过大规模伪监督预训练引导出稳健的几何和运动表示，为4D场景建模奠定基础。然后，我们使用模拟生成的数据进行基于物理的监督微调，强制执行时间一致的4D动态。最后，我们应用基于模拟的强化学习来纠正难以通过显式监督捕获的残留物理违规。为了评估超越外观指标的细粒度物理一致性，我们引入了一套 \textbf{4D世界一致性评估}，探测几何一致性、运动稳定性和长期物理合理性。实验结果表明，与外观驱动的基线相比，Phys4D 显著改善了细粒度时空和物理一致性，同时保持了强大的生成性能。我们的项目页面可在此 https URL 获取。

英文摘要

Recent video diffusion models have achieved impressive capabilities as large-scale generative world models. However, these models often struggle with fine-grained physical consistency, exhibiting physically implausible dynamics over time. In this work, we present \textbf{Phys4D}, a pipeline for learning physics-consistent 4D world representations from video diffusion models. Phys4D adopts \textbf{a three-stage training paradigm} that progressively lifts appearance-driven video diffusion models into physics-consistent 4D world representations. We first bootstrap robust geometry and motion representations through large-scale pseudo-supervised pretraining, establishing a foundation for 4D scene modeling. We then perform physics-grounded supervised fine-tuning using simulation-generated data, enforcing temporally consistent 4D dynamics. Finally, we apply simulation-grounded reinforcement learning to correct residual physical violations that are difficult to capture through explicit supervision. To evaluate fine-grained physical consistency beyond appearance-based metrics, we introduce a set of \textbf{4D world consistency evaluation} that probe geometric coherence, motion stability, and long-horizon physical plausibility. Experimental results demonstrate that Phys4D substantially improves fine-grained spatiotemporal and physical consistency compared to appearance-driven baselines, while maintaining strong generative performance. Our project page is available at https://sensational-brioche-7657e7.netlify.app/

URL PDF HTML ☆

赞 0 踩 0

2605.09313 2026-06-17 cs.CV 版本更新

展示信号，隐藏噪声：像素空间扩散的频谱强制

Weichen Fan, Haiwen Diao, Penghao Wu, Ziwei Liu

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S-Lab）

AI总结提出频谱强制方法，通过在像素空间扩散模型中对噪声输入施加时变低通滤波器，引导模型关注信号频带，提升训练效率和生成质量。

Comments Code link: https://github.com/WeichenFan/Spectral_Forcing

详情

AI中文摘要

像素空间扩散模型在全带宽噪声图像上训练，但去噪器可用的有用信号强烈依赖于频率。在整流流扩散和自然图像幂律谱下，每个时间$t$的频带数据-噪声等高线$k^{*}(t) = (1-t)^{-2/α}$将信号承载的低频区域与噪声主导的高频区域分开。我们表明，这种隐式的由粗到细结构不仅仅是描述性的：它引发了一个容量分配问题。标准的像素空间去噪器必须内部发现移动的带宽边界，并可能在最优预测退化为确定性基线而非数据分布建模的频率-时间区域上花费计算。为了显式化这个边界，我们引入了频谱强制，一个无参数、时间条件的2D-DCT低通算子，在补丁嵌入器之前应用于噪声输入。其截止频率随扩散时间单调增加，并在数据端点处变为恒等映射。通过受控的合成实验，我们确定了该算子有益的机制：粗补丁分词和其高频内容主要是噪声而非必要信号的数据。在ImageNet-256上使用JiT-700M/32，频谱强制在不同训练周期中一致地改进了FID和Inception Score，展示了训练过程中的稳健增益；在更细的分词下，频谱强制仍然具有竞争力。我们进一步将未修改的算子插入SenseNova-U1，一个统一的文本到图像模型，它改进了DPG-Bench和GenEval，表明输入侧频谱先验可以超越类条件生成。这些结果表明了一条通往容量高效的像素空间扩散的途径：展示信号并隐藏噪声。

英文摘要

Pixel-space diffusion models are trained on full-bandwidth noisy images, yet the useful signal available to the denoiser is strongly frequency dependent. Under rectified-flow diffusion and natural-image power-law spectra, the per-band data-to-noise contour $k^{*}(t) = (1-t)^{-2/α}$ separates a signal-bearing low-frequency region from a noise-dominated high-frequency region at each time $t$. We show that this implicit coarse-to-fine structure is not merely descriptive: it induces a capacity-allocation problem. A standard pixel-space denoiser must discover the moving bandwidth boundary internally and can spend computation on frequency-time regions where the optimal prediction collapses to deterministic baselines rather than data-distribution modeling. To make this boundary explicit, we introduce Spectral Forcing, a parameter-free, time-conditional 2D-DCT low-pass operator applied to the noisy input before the patch embedder. Its cutoff expands monotonically with the diffusion time and becomes the identity at the data endpoint. Through controlled synthetic experiments, we identify the regime in which the operator is beneficial: coarse patch tokenization and data whose high-frequency content is predominantly noise rather than essential signal. On ImageNet-256 with JiT-700M/32, Spectral Forcing consistently improves both FID and Inception Score across different training epochs, demonstrating robust gains throughout training; at finer tokenization, the spectral forcing is still competitive. We further insert the unchanged operator into SenseNova-U1, a unified text-to-image model, where it improves DPG-Bench and GenEval, showing that the input-side spectral prior transfers beyond class-conditional generation. These results suggest a route to capacity-efficient pixel-space diffusion by showing the signal and hiding the noise.

URL PDF HTML ☆

赞 0 踩 0

2606.15614 2026-06-17 cs.CV 版本更新

Variational Test-time Optimization for Diffusion Synchronization

扩散同步的变分测试时优化

Hyunsoo Lee, Farrin Marouf Sofian, Kushagra Pandey, Stephan Mandt

发表机构 * Seoul National University（首尔大学）； University of California, Irvine（加利福尼亚大学尔湾分校）

AI总结提出基于最优控制的变分测试时优化框架，通过优化控制变量引导多轨迹协同生成，无需额外训练即可提升扩散同步性能。

Comments Preprint. Project website: https://hleephilip.github.io/SyncVC/

详情

AI中文摘要

协同生成通过协调多个扩散轨迹来扩展预训练先验的能力，已成为扩展扩散模型适用性的强大范式。在现有方法中，扩散同步通过引入通用引导机制提供了场景无关的解决方案。然而，当前的同步方法严重依赖启发式方法，并且仍然需要针对特定任务进行调整，这限制了它们的泛化能力和性能。在这项工作中，我们基于最优控制数学推导了一个同步框架，为扩散同步提供了原理性解释。在采样过程中，我们优化控制变量以引导多个轨迹朝向一致解，同时保持接近底层扩散先验。我们的方法完全在测试时运行，无需额外训练，因此当与强大的预训练先验结合时，能够在多样化的生成场景中广泛应用。我们在三个代表性的协同生成任务上展示了相对于基线的持续改进，涵盖了广泛的模态和应用。除了性能提升，我们的工作为协同生成建立了新的基础，为将预训练生成模型扩展到新的协同生成设置开辟了一条原理性路径。

英文摘要

Collaborative generation, which coordinates multiple diffusion trajectories to extend the capabilities of pretrained priors, has emerged as a powerful paradigm for extending the applicability of diffusion models. Among existing approaches, diffusion synchronization provides a scenario-agnostic solution by introducing general guidance mechanisms. However, current synchronization approaches rely heavily on heuristics and still require task-specific tailoring, which limits their generalizability and performance. In this work, we mathematically derive a synchronization framework based on optimal control, providing a principled explanation of diffusion synchronization. During sampling, we optimize control variables to guide multiple trajectories toward coherent solutions while remaining close to the underlying diffusion prior. Our method operates entirely at test-time without additional training, thereby enabling broad applicability across diverse generation scenarios when combined with strong pretrained priors. We demonstrate consistent improvements over baselines on three representative collaborative generation tasks, covering a wide range of modalities and applications. Beyond performance gains, our work establishes a novel foundation for collaborative generation, opening a principled path toward extending pretrained generative models to new collaborative generation settings.

URL PDF HTML ☆

赞 0 踩 0

2606.16449 2026-06-17 cs.CV 版本更新

PermaVid: Consistent Video Generation Across Edits via Disentangled Context Memory

PermaVid: 通过解耦上下文记忆实现编辑下的一致视频生成

Shuai Yang, Bingjie Gao, Ziwei Liu, Jiaqi Wang, Dahua Lin, Tong Wu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Stanford University（斯坦福大学）； S-Lab, Nanyang Technological University（南洋理工大学S-Lab）； The Chinese University of Hong Kong（香港中文大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结提出PermaVid框架，利用解耦为语义外观和几何结构的上下文记忆，结合编辑感知更新策略，实现编辑操作后视频的长期一致生成。

Comments Project page: https://ys-imtech.github.io/projects/PermaVid/

详情

AI中文摘要

在编辑操作下的一致视频生成需要持久性：当编辑修改场景外观或布局时，后续生成应在时间和视角上保持连贯。然而，现有的记忆设计在修改后难以维持长期一致性，因为存储的上下文可能变得过时或无效。为了解决这个问题，我们提出了PermaVid，一种新颖的框架，基于多模态上下文记忆，将空间上下文解耦为语义外观和几何结构，并采用编辑感知的记忆更新和检索策略，使记忆演化与后续观察保持一致。具体来说，我们开发了两个互补的记忆库：一个RGB上下文记忆，捕获外观感知的观察同时隐式编码几何；一个深度上下文记忆，保留与语义解耦的纯几何结构。基于此设计，我们引入了一个记忆引导的视频生成模型，在从混合模态记忆上下文中提取的参考条件下执行多模态特征融合。实验表明，我们的方法在编辑后保持了强大的长期语义和结构一致性，显著优于现有方法。

英文摘要

Consistent video generation under editing operations requires persistence: when edits modify scene appearance or layout, subsequent generations should remain coherent across time and viewpoints. However, existing memory designs struggle to maintain long-term consistency after such modifications, as stored contexts may become outdated or invalid. To address this, we propose PermaVid, a novel framework built upon a multi-modal context memory that disentangles spatial context into semantic appearance and geometric structure, together with an edit-aware memory update and retrieval strategy that keeps memory evolution aligned with subsequent observations. Specifically, we develop two complementary memory banks: an RGB context memory that captures appearance-aware observations while implicitly encoding geometry, and a depth context memory that preserves geometry-only structure disentangled from semantics. Building on this design, we introduce a memory-guided video generation model that performs multi-modal feature fusion under reference conditions drawn from mixed-modality memory contexts. Experiments demonstrate that our method maintains strong long-term semantic and structural consistency after edits, significantly outperforming state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17438 2026-06-17 cs.CV 新提交

Contact-Based Fringe Projection Profilometry for High-Resolution 3-D Surface Measurement of Reflective and Transparent Objects

基于接触式条纹投影轮廓术的高分辨率反射与透明物体三维表面测量

Ingu Yeo, Hyung-Gun Chi, Jae-Sang Hyun

发表机构 * Department of Mechanical Engineering, Yonsei University（延世大学机械工程系）； Yonsei Institute for Embodied Intelligence, Yonsei University（延世大学具身智能研究所）

AI总结针对GelSight传感器在反射/透明物体上深度精度不足和校准困难的问题，提出基于数字条纹投影的接触式三维测量方法，通过三角测量实现高精度全视场三维重建。

详情

AI中文摘要

本文提出一种基于数字条纹投影（DFP）系统的接触式三维表面测量方法，属于以商业成功的GelSight传感器为代表的视觉触觉传感家族。此类传感器已被证明对机器人指尖操作和接触传感有效。然而，由于GelSight采用RGB LED光度立体视觉，它不直接测量绝对深度，而是通过积分估计的表面梯度来推断深度，这可能累积重建误差；此外，随着传感区域增大，校准变得越来越困难，并且在高反射或透明物体上深度精度受到挑战。为克服这些缺点，我们提出一种基于条纹投影的接触测量技术，在涂覆硅胶的接触表面上执行基于三角测量的三维重建，提供接触区域上密集的逐像素表面几何和全场三维形状测量。通过将高精度数字条纹投影集成到传感器中，我们的方法简化了大面积校准，并提高了复杂表面的深度精度。实验结果，包括与GelSight Mini传感器的直接比较、球面拟合精度评估和不确定性分析，证实了所提方法显著提高了基于结构光的三维测量的精度和稳定性，允许可靠重建具有不同光学特性的物体。

英文摘要

This paper presents a contact-based 3-D surface measurement method based on a Digital Fringe Projection (DFP) system, belonging to the vision-based tactile sensing family pioneered by the commercially successful GelSight sensor. Such sensors have proven effective for robotic fingertip manipulation and contact sensing. However, because GelSight employs photometric stereo with RGB LEDs, it does not measure absolute depth directly but instead infers it by integrating estimated surface gradients, which can accumulate reconstruction errors; in addition, it becomes increasingly difficult to calibrate as the sensing area grows, and its depth accuracy is challenged on highly reflective or transparent objects. To overcome these drawbacks, we propose a fringe-projection-based contact measurement technique that performs triangulation-based 3-D reconstruction on a coated silicone contact surface, providing dense per-pixel surface geometry and full-field 3-D shape measurement over the contact region. By integrating high-accuracy digital fringe projection into the sensor, our approach simplifies calibration over larger areas and enhances depth precision for complex surfaces. Experimental results, including a direct comparison with a GelSight Mini sensor, a sphere-fitting accuracy evaluation, and an uncertainty analysis, confirm that the proposed method significantly improves the accuracy and stability of structured-light-based 3-D measurements, allowing reliable reconstruction of objects with diverse optical properties.

URL PDF HTML ☆

赞 0 踩 0

2606.17824 2026-06-17 cs.CV cs.AI 新提交

4DSloMo: 基于异步捕获的高速场景4D重建

Yutian Chen, Shi Guo, Tianshuo Yang, Lihe Ding, Xiuyuan Yu, Jinwei Gu, Tianfan Xue

AI总结提出一种仅使用低帧率相机的高速4D捕获系统，通过异步捕获方案将等效帧率提升至100-200 FPS，并利用视频扩散模型修复稀疏视图伪影，实现高质量高速4D重建。

Comments Webpage: https://openimaginglab.github.io/4DSloMo/

详情

AI中文摘要

从多视角视频重建快速动态场景对于高速运动分析和逼真的4D重建至关重要。然而，大多数4D捕获系统的帧率限制在30 FPS以下，直接从低帧率输入进行高速运动的4D重建可能导致不理想的结果。在这项工作中，我们提出了一种仅使用低帧率相机的高速4D捕获系统，通过新颖的捕获和处理模块实现。在捕获方面，我们提出了一种异步捕获方案，通过错开相机的开始时间来提高有效帧率。通过分组相机并利用25 FPS的基础帧率，我们的方法实现了100-200 FPS的等效帧率，无需专门的高速相机。在处理方面，我们还提出了一种新颖的生成模型来修复由4D稀疏视图重建引起的伪影，因为异步减少了每个时间戳的视角数量。具体来说，我们提出训练一个基于视频扩散的伪影修复模型用于稀疏4D重建，该模型细化缺失细节、保持时间一致性并提高整体重建质量。实验结果表明，与同步捕获相比，我们的方法显著增强了高速4D重建。

英文摘要

Reconstructing fast-dynamic scenes from multi-view videos is crucial for high-speed motion analysis and realistic 4D reconstruction. However, the majority of 4D capture systems are limited to frame rates below 30 FPS (frames per second), and a direct 4D reconstruction of high-speed motion from low FPS input may lead to undesirable results. In this work, we propose a high-speed 4D capturing system only using low FPS cameras, through novel capturing and processing modules. On the capturing side, we propose an asynchronous capture scheme that increases the effective frame rate by staggering the start times of cameras. By grouping cameras and leveraging a base frame rate of 25 FPS, our method achieves an equivalent frame rate of 100-200 FPS without requiring specialized high-speed cameras. On processing side, we also propose a novel generative model to fix artifacts caused by 4D sparse-view reconstruction, as asynchrony reduces the number of viewpoints at each timestamp. Specifically, we propose to train a video-diffusion-based artifact-fix model for sparse 4D reconstruction, which refines missing details, maintains temporal consistency, and improves overall reconstruction quality. Experimental results demonstrate that our method significantly enhances high-speed 4D reconstruction compared to synchronous capture.

URL PDF HTML ☆

赞 0 踩 0

2509.12742 2026-06-17 cs.CV 版本更新

Effective Gaussian Management for High-fidelity Object Reconstruction

高保真物体重建的有效高斯管理

Jiateng Liu, Hao Gao, Jiu-Cheng Xie, Chi-Man Pun, Jian Xiong, Haolun Li, Junxin Chen, Feng Xu

AI总结提出一种高斯管理框架，通过选择性激活属性、自适应表示和任务解耦剪枝，结合正则化表面重建模块，在减少参数的同时实现高保真外观与几何重建。

详情

AI中文摘要

本文提出了一种有效的高斯管理框架，用于外观和几何的高保真场景重建。与最近将所有基元在优化过程中统一处理的高斯泼溅（GS）管线不同，我们的框架显式管理高斯的属性激活、表示和剪枝。具体来说，我们的框架首先引入GauSep，一种新的致密化策略，选择性地激活高斯颜色或法线属性，以缓解由双重监督产生的破坏性梯度冲突。我们进一步提出GauRep，一种自适应高斯表示，动态调整球谐函数（SHs）阶数并执行任务解耦剪枝，以在个体和全局层面减少冗余。为了为上述管理过程提供可靠的几何监督，我们还引入了CoRe，一个正则化表面重建模块，通过置信度机制从SDF分支蒸馏鲁棒的法线场到高斯表示。值得注意的是，所提出的高斯管理与各种重建架构兼容，可以无缝集成以提高性能同时减小模型大小。大量实验表明，与最先进方法相比，我们的方法在外观和几何重建上实现了优越或可比的性能，同时使用了显著更少的参数。

英文摘要

This paper proposes an effective Gaussian management framework for high-fidelity scene reconstruction of both appearance and geometry. Unlike recent Gaussian Splatting (GS) pipelines that treat all primitives uniformly during optimization, our framework explicitly manages the attribute activation, representation and pruning of Gaussian. Specifically, our framework first introduces GauSep, a novel densification strategy that selectively activates Gaussian color or normal attributes to alleviate destructive gradient conflicts arising from dual supervision. We further propose GauRep, an adaptive Gaussian representation that dynamically adjusts spherical harmonics (SHs) orders and performs task-decoupled pruning to reduce redundancy at both the individual and global levels. To provide reliable geometric supervision for above mangement process, we additionally introduce CoRe, an regularized surface reconstruction module that distills robust normal fields from an SDF branch to the Gaussian representation through a confidence mechanism. Notably, the proposed Gaussian management is compatible with various reconstruction architectures and can be seamlessly integrated to improve performance while reducing size of the model. Extensive experiments demonstrate that our approach achieves superior or comparable performance in appearance and geometry reconstruction compared with state-of-the-art methods, while using significantly fewer parameters.

URL PDF HTML ☆

赞 0 踩 0

2512.09373 2026-06-17 cs.CV 版本更新

FUSER: Feed-Forward MUltiview 3D Registration Transformer and SE(3)$^N$ Diffusion Refinement

FUSER: 前馈多视图3D配准Transformer与SE(3)^N扩散精化

Haobo Jiang, Jin Xie, Jian Yang, Liang Yu, Jianmin Zheng

发表机构 * Nanyang Technological University（南洋理工大学）； Alibaba Group（阿里巴巴集团）； Nanjing University（南京大学）

AI总结提出FUSER，首个前馈多视图配准Transformer，在统一潜在空间中直接预测全局位姿，避免成对匹配；并引入SE(3)^N扩散精化框架FUSER-DF以校正估计。

Comments Accepted to CVPR 2026 (Oral)

详情

AI中文摘要

多视图点云的配准传统上依赖于广泛的成对匹配来构建用于全局同步的位姿图，这在计算上昂贵且在没有整体几何约束的情况下本质上是不适定的。本文提出了FUSER，第一个前馈多视图配准Transformer，它在统一、紧凑的潜在空间中联合处理所有扫描，直接预测全局位姿，无需任何成对估计。为了保持可处理性，FUSER通过稀疏3D CNN将每个扫描编码为低分辨率超点特征，该网络保留绝对平移线索，并通过几何交替注意力模块执行高效的扫描内和扫描间推理。特别地，我们从现成的基础模型中转移2D注意力先验，以增强3D特征交互和几何一致性。基于FUSER，我们进一步引入了FUSER-DF，一个SE(3)^N扩散精化框架，通过在联合SE(3)^N空间中进行去噪来校正FUSER的估计。FUSER作为代理多视图配准模型来构建去噪器，并推导了先验条件SE(3)^N变分下界用于去噪监督。在3DMatch、ScanNet和ArkitScenes上的大量实验表明，我们的方法实现了优越的配准精度和出色的计算效率。

英文摘要

Registration of multiview point clouds conventionally relies on extensive pairwise matching to build a pose graph for global synchronization, which is computationally expensive and inherently ill-posed without holistic geometric constraints. This paper proposes FUSER, the first feed-forward multiview registration transformer that jointly processes all scans in a unified, compact latent space to directly predict global poses without any pairwise estimation. To maintain tractability, FUSER encodes each scan into low-resolution superpoint features via a sparse 3D CNN that preserves absolute translation cues, and performs efficient intra- and inter-scan reasoning through a Geometric Alternating Attention module. Particularly, we transfer 2D attention priors from off-the-shelf foundation models to enhance 3D feature interaction and geometric consistency. Building upon FUSER, we further introduce FUSER-DF, an SE(3)$^N$ diffusion refinement framework to correct FUSER's estimates via denoising in the joint SE(3)$^N$ space. FUSER acts as a surrogate multiview registration model to construct the denoiser, and a prior-conditioned SE(3)$^N$ variational lower bound is derived for denoising supervision. Extensive experiments on 3DMatch, ScanNet and ArkitScenes demonstrate that our approach achieves the superior registration accuracy and outstanding computational efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.07971 2026-06-17 cs.CV cs.LG 版本更新

DVD: Discrete Voxel Diffusion for 3D Generation and Editing

DVD: 用于3D生成和编辑的离散体素扩散

Zhengrui Xiang, Jiaqi Wu, Fupeng Sun, Heliang Zheng, Yingzhen Li

发表机构 * Imperial College London（伦敦帝国学院）； Math Magic ； Hitem3D

AI总结提出离散体素扩散框架（DVD），通过将体素占用视为离散变量，实现稀疏体素的生成、不确定性估计和编辑，避免连续到离散的阈值处理，并提供可解释的生成动态。

详情

AI中文摘要

我们引入了离散体素扩散（DVD），这是一个离散扩散框架，用于生成、评估和编辑基于SLat（结构化潜在）的3D生成管道中的稀疏体素。尽管离散扩散通常没有在类似图像的生成中取代连续扩散，但我们表明它可以作为稀疏体素支架的有效第一阶段先验。通过将体素占用视为原生离散变量，DVD避免了连续到离散的阈值处理，并为体素生成、不确定性估计和编辑提供了一个简单的框架。除了质量提升外，DVD通过显式类别建模提供了更可解释的生成动态。此外，我们利用预测熵作为稳健的不确定性度量，以识别模糊的体素区域和复杂样本，促进数据过滤和质量评估等任务。最后，我们提出了一种使用块结构扰动模式的轻量级微调策略。这种方法使模型能够在单次采样轮次内修复和编辑体素，所需的辅助计算量可忽略不计，且无需额外的模型评估。

英文摘要

We introduce Discrete Voxel Diffusion (DVD), a discrete diffusion framework to generate, assess, and edit sparse voxels for SLat (Structured LATent) based 3D generative pipelines. Although discrete diffusion has not generally displaced continuous diffusion in image-like generation, we show that it can be an effective first-stage prior for sparse voxel scaffolds. By treating voxel occupancy as a native discrete variable, DVD avoids continuous-to-discrete thresholding and provides a simple framework for voxel generation, uncertainty estimation, and editing. Beyond quality gains, DVD provides more interpretable generation dynamics through explicit categorical modeling. Furthermore, we leverage the predictive entropy as a robust uncertainty metric to identify ambiguous voxel regions and complicated samples, facilitating tasks such as data filtering and quality assessment. Finally, we propose a lightweight fine-tuning strategy using block-structured perturbation patterns. This approach empowers the model to inpaint and edit voxels within a single sampling round, requiring negligible auxiliary computation and no additional model evaluations. Code is available at https://github.com/TeCai/DVD.

URL PDF HTML ☆

赞 0 踩 0

2606.08402 2026-06-17 cs.CV cs.AI cs.MA 版本更新

SceneConductor: 3D Scene Generation from a Single Image with Multi-Agent Orchestration

SceneConductor: 基于多智能体编排的单图像3D场景生成

Jeonghwan Kim, Yushi Lan, Yongwei Chen, Hieu Trung Nguyen, Chuanyu Pan, Xingang Pan

发表机构 * Nanyang Technological University（南洋理工大学）； University of Oxford（牛津大学）； Meshy AI

AI总结提出多智能体编排框架，将单图像3D场景生成分解为场景初始化、环境构建和多智能体细化三个阶段，并引入几何感知布局预测器，在几何精度、空间一致性和感知真实性上超越现有方法。

详情

AI中文摘要

从单张图像生成完整3D场景需要从本质上模糊的视觉证据中推断全局一致的几何、物体关系和环境上下文。尽管联合布局和网格生成近期取得进展，现有方法通常依赖整体或弱分解的流水线，将许多因素纠缠在一起，需要大量场景级监督，限制了其对复杂真实环境的泛化。我们提出一个多智能体编排框架，将单图像3D场景生成分解为三个结构化阶段：场景初始化、环境构建和多智能体细化。初始化阶段提取图像派生的物体掩码，构建物体级3D表示，并预测初始空间布局以形成粗略3D场景。环境构建阶段随后利用该初始化以及点图几何，构建支撑表面、房间边界、材质和光照的环境支架。最后，在细化阶段，规划器智能体识别结构和视觉不一致性，直接应用简单修正，并派遣专家智能体进行复杂的局部修订，再整合回全局场景。为提供可靠的结构初始化同时减少对场景级标注的依赖，我们进一步引入一个几何感知布局预测器，由点图派生的稀疏几何先验监督。与全监督布局生成器不同，该预测器可从分割级数据训练，并稳健泛化到多样真实场景。在基准数据集上的大量实验表明，我们的方法在几何精度、空间一致性和感知真实性上持续优于先前方法。

英文摘要

Generating complete 3D scenes from a single image requires inferring globally consistent geometry, object relationships, and environmental context from inherently ambiguous visual evidence. Despite recent progress in joint layout-and-mesh generation, existing methods often rely on holistic or weakly decomposed pipelines that entangle many factors at once and demand extensive scene-level supervision, limiting their generalization to complex real-world environments. We propose a multi-agent orchestration framework that decomposes single-image 3D scene generation into three structured stages: scene initialization, environment construction, and multi-agent refinement. The initialization stage extracts image-derived object masks, builds object-level 3D representations, and predicts an initial spatial layout to form a coarse 3D scene. The environment-construction stage then leverages this initialization together with point-map geometry to build an environmental scaffold of supporting surfaces, room boundaries, materials, and illumination. Finally, in the refinement stage, a planner agent identifies structural and visual inconsistencies, applies simple corrections directly, and dispatches specialist agents for complex localized revisions that are reintegrated into the global scene. To provide reliable structural initialization while reducing reliance on scene-level annotations, we further introduce a geometry-aware layout predictor supervised by sparse geometric priors derived from point maps. Unlike fully supervised layout generators, the predictor can be trained from segmentation-level data and generalizes robustly to diverse real-world scenes. Extensive experiments on benchmark datasets show that our method consistently outperforms prior approaches in geometric accuracy, spatial consistency, and perceptual realism.

URL PDF HTML ☆

赞 0 踩 0

2606.17340 2026-06-17 cs.CV cs.AI 新提交

Geometry-Consistent Endoscopic Representations for Image-Guided Navigation via Structured Foundation Model Adaptation

几何一致的内窥镜表示用于图像引导导航：基于结构化基础模型适配

Hongchao Shu, Roger D. Soberanis-Mukul, Hao Ding, Morgan Ringel, Mali Shen, Saif Iftekar Sayed, Hedyeh Rafii-Tari, Mathias Unberath

发表机构 * Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Semaphor Surgical ； Johnson & Johnson MedTech（强生医疗科技）

AI总结提出统一框架，结合合成数据管道与层级感知几何语义适配，学习几何一致且领域鲁棒的图像表示，提升单目内窥镜中的位姿估计与深度预测性能。

详情

AI中文摘要

由于深度线索有限、组织纹理弱、非刚性变形以及跨域外观变化大，单目内窥镜中基于视觉的精确导航十分困难，这些问题使得位姿估计、深度预测和图像-解剖对齐复杂化。尽管最近的视觉基础模型显示出潜力，但它们学到的表示往往几何一致性不足，阻碍了稳定的特征对应，限制了其在后续导航任务中的可靠性。我们提出了一个统一框架，用于学习单目内窥镜中几何一致且领域鲁棒的图像表示。该框架结合了提供精确几何监督的合成数据管道与层级感知几何语义适配，后者是标准LoRA的结构化替代方案，在Transformer层级间选择性插入低秩适配器，并配合逐层训练目标，以鼓励中间特征的几何对应和深层特征的语义一致性。在公开和专有数据集上的实验表明，几何和语义表示质量得到提升，从而在包括位姿估计和单目深度估计在内的下游导航任务上取得更好性能。学到的表示在临床支气管镜中显示出良好的合成到真实迁移能力，并为在有限监督下适配鼻窦镜和结肠镜提供了有用的初始化。该框架还显示出随模型大小和训练数据的良好扩展性。这些结果支持层级感知、几何引导的适配作为内窥镜表示学习的实用方法。

英文摘要

Accurate vision-based navigation in monocular endoscopy is difficult due to limited depth cues, weak tissue texture, non-rigid deformation, and substantial appearance variation across domains, all of which complicate pose estimation, depth prediction, and image-to-anatomy alignment. Although recent vision foundation models have shown promise, their learned representations often remain insufficiently geometry-consistent, hindering stable feature correspondence and limiting their reliability for downstream navigation tasks. We propose a unified framework for learning geometry-consistent and domain-robust image representations for monocular endoscopy. The framework combines a synthetic data pipeline that provides accurate geometric supervision with Hierarchy-Aware Geometry-Semantic Adaptation, a structured alternative to standard LoRA that inserts low-rank adapters selectively across the transformer hierarchy and couples them with layer-wise training objectives to encourage geometric correspondence in intermediate features and semantic consistency in deeper features. Experiments on public and proprietary datasets show improved geometric and semantic representation quality, leading to better performance on downstream navigation tasks including pose estimation and monocular depth estimation. The learned representations show favorable synthetic-to-real transfer on clinical bronchoscopy and provide a useful initialization for adaptation to sinus endoscopy and colonoscopy under limited supervision. The framework also shows favorable scaling with model size and training data. These results support hierarchy-aware, geometry-guided adaptation as a practical approach for endoscopic representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.17343 2026-06-17 cs.CV stat.AP 新提交

Bayesian Magnetic Resonance Joint Image Reconstruction and Uncertainty Quantification using Sparsity Prior Models and Markov Chain Monte Carlo Sampling

贝叶斯磁共振联合图像重建与不确定性量化：基于稀疏先验模型和马尔可夫链蒙特卡洛采样

Ahmed Karam Eldaly, Matteo Figini, Daniel C. Alexander

发表机构 * Department of Computer Science, University of Exeter（埃克塞特大学计算机科学系）； UCL Hawkes Institute, Department of Computer Science, University College London（伦敦大学学院计算机科学系霍克斯研究所）

AI总结提出一种基于压缩感知磁共振图像重建的不确定性量化框架，采用贝叶斯线性逆问题建模，利用稀疏先验（总变分或小波变换）和分裂增广吉布斯采样器进行MCMC采样，在单线圈和多线圈数据集上验证了优于优化方法和深度学习方法的图像重建与不确定性量化性能。

详情

AI中文摘要

我们提出了一种新的框架，用于使用压缩感知磁共振图像重建进行不确定性量化。该问题在贝叶斯框架内被表述为线性逆问题，并为未知模型参数分配先验分布。具体而言，待重建的图像在给定基下被假设为稀疏的。我们开发了一个适用于任何基的通用框架，并作为示例，测试了图像在（1）空间梯度（使用总变分先验模型）和（2）小波变换中的稀疏性。然后，采用基于分裂增广吉布斯采样的马尔可夫链蒙特卡洛（MCMC）方法从未知参数的后验分布中采样。使用近端MCMC方法有效采样不可微的条件分布。所提出的算法在单线圈和多线圈数据集上使用各种k空间子采样模式和比率进行了验证。结果表明，与对应的基于优化的方法相比，每种提出的方法在图像重建方面具有优越性能。此外，与现有的基于深度学习的方法相比，我们的框架有效地量化了不确定性，显示估计的不确定性图与使用真实值和重建图像计算的误差图之间存在显著相关性。

英文摘要

We propose a novel framework for uncertainty quantification using compressed sensing magnetic resonance image reconstruction. The problem is formulated within a Bayesian framework as a linear inverse problem, with prior distributions assigned to the unknown model parameters. Specifically, the image to be reconstructed is assumed to be sparse in a given basis. We develop a general framework applicable to any basis and as examples, we test the sparsity of the image in its (1) spatial gradients using a total variation prior model, and in its (2) wavelet transform. A Markov chain Monte Carlo (MCMC) method, based on a split-and-augmented Gibbs sampler, is then employed to sample from the posterior distribution of the unknown parameters. The non-differentiable conditional distributions are efficiently sampled using a proximal MCMC method. The proposed algorithms are validated on both single-coil and multi-coil datasets using various k-space sub-sampling patterns and ratios. The results demonstrate the superior performance of each proposed approach in reconstructing images compared to its counterpart optimisation-based method. Moreover, our framework effectively quantifies uncertainty, showing a notable correlation between estimated uncertainty maps and error maps computed using ground truth and reconstructed images, compared with existing deep learning-based methods.

URL PDF HTML ☆

赞 0 踩 0

2606.17379 2026-06-17 cs.CV cs.AI eess.IV 新提交

MeiBRD: Meta-Learning Intraoperative Biomechanical Residual Deformation

MeiBRD：元学习术中生物力学残余变形

Casey Meisenzahl, Jon Heiselman, Michael Holtz, Yubo Ye, Michael Miga, Linwei Wang

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； Vanderbilt University（范德堡大学）

AI总结提出混合配准框架，利用稀疏术中对应点自适应生物力学先验，通过图神经扩散函数学习残余变形，结合元学习从术中样本中快速适应，在肝脏体模上优于现有方法。

详情

AI中文摘要

由于软组织大幅变形且术中测量稀疏，精确的术中肝脏配准具有挑战性。生物力学模型通过先验知识正则化这一不适定问题，但由于简化假设而表现出持续的预测偏差，而数据驱动学习方法在数据效率、泛化能力和物理合理性方面存在困难。我们提出一个混合配准框架，利用稀疏术中对应点自适应生物力学先验。我们不是学习完整的变形场，而是学习一个校正线性生物力学预测的残余变形函数，该函数建模为图神经扩散函数，在3D肝脏网格上具有几何感知注意力。为了实现稀疏观测的长距离信息传递，我们从一个新颖的角度将稀疏术中测量视为\textit{上下文}样本，其中残余变形函数的输入-输出对完全观测，将问题转化为从术中上下文样本中学习该残余函数，使用前馈元学习器。在可变形肝脏体模数据集上的实验表明，与刚性、生物力学和数据驱动基线相比，配准精度和泛化能力得到提升，特别是在分布外几何和变形情况下。

英文摘要

Accurate intraoperative liver registration is challenging due to substantial soft-tissue deformation yet sparse intraoperative measurements. Biomechanical models regularize this ill-posedness with prior knowledge but exhibit persistent prediction bias due to simplifying assumptions, while data-driven learning solutions struggle with data efficiency, generalization, and physical plausibility. We propose a hybrid registration framework that adapts a biomechanical prior using sparse intraoperative correspondences. Rather than learning a full deformation field, we learn a residual deformation function that corrects linear biomechanical predictions, modeled as a graph neural diffusion function with geometry-aware attention over the 3D liver mesh. To enable long-range information transfer of sparse observations, we take a novel perspective of sparse intraoperative measurements as \textit{context} samples where input-output pairs of the residual deformation function are fully observed, casting the problem into learning-to-learn this residual function from intraoperative context samples with feedforward meta-learners. Experiments on a deformable liver phantom dataset demonstrate improved registration accuracy and generalization compared to rigid, biomechanical, and data-driven baselines, particularly for out-of-distribution geometries and deformations.

URL PDF HTML ☆

赞 0 踩 0

2606.17437 2026-06-17 cs.CV cs.AI 新提交

Spatio-Temporal Fusion Model for Standard View Classification of Echocardiographic Videos

超声心动图视频标准视图分类的时空融合模型

Bo Gou, Jicheng Zhang, Jianlong Xiong, Tao He, Bentian Liu, Hai Wu, Yijiao Wang, Yu Zhang, Yujia Yang, Yun Dai, Jian Liu, Jie Wang

发表机构 * Department of Ultrasound, The First Affiliated Hospital of Chengdu Medical College, School of Clinical Medicine, Chengdu Medical College（成都医学院第一附属医院超声科，临床医学院）； College of Computer Science, Sichuan University（四川大学计算机学院）； Department of Medical Ultrasound, West China Hospital of Sichuan University（四川大学华西医院超声科）； Cancer Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College（中国医学科学院北京协和医学院肿瘤医院）

AI总结针对超声心动图视图分类中数据稀缺、时空特征难以融合的问题，提出基于不确定性感知的CNN-LSTM双流融合模型，在最大公开数据集EV9V上取得竞争性能。

详情

AI中文摘要

超声心动图标准视图的自动分类对于高效的临床工作流程至关重要，但面临三个主要挑战。首先，公开可用的数据集稀缺，且规模和视图覆盖范围有限。其次，一些现代视频级架构在超声心动图视图分类中的性能尚未得到充分探索。第三，某些视图类别在空间外观上高度相似，使得单帧特征不足以区分，而异质的帧质量使得鲁棒的时序信息融合变得复杂。为了解决这些挑战，我们发布了九视图超声心动图视频（EV9V）数据集，包含5,138个视频、910,579帧和9个标准视图，据我们所知，这是最大的公开超声心动图视频数据集。利用EV9V，我们系统地基准测试了代表性的视频分类架构，包括卷积神经网络（CNN）、循环神经网络（RNN）和Transformer。此外，我们提出了一种时空融合模型（STFM），一种高效的双流CNN-LSTM（长短期记忆）框架，联合捕获空间解剖结构和时间心脏动力学。所提出的框架利用不确定性感知学习在训练期间优先采样代表性视频片段，并在推理期间进行基于证据的融合，提高了对超声心动图视频中帧质量变化的鲁棒性。大量实验表明，我们的方法在各种视频分类模型中取得了竞争性能，验证了不确定性感知时空学习在超声心动图视图分类中的有效性。代码可在以下网址获取：https://this URL。

英文摘要

Automated classification of standard echocardiographic views is crucial for efficient clinical workflow but faces three main challenges. First, publicly available datasets are scarce and limited in scale and view coverage. Second, the performance of some modern video-level architectures for echocardiographic view classification remains underexplored. Third, some view categories exhibit highly similar spatial appearances, making single-frame features insufficient for discrimination, while heterogeneous frame quality complicates robust temporal information fusion. To address these challenges, we release the Echocardiographic Videos of Nine Views (EV9V) dataset, comprising 5,138 videos, 910,579 frames, and 9 standard views, which is, to the best of our knowledge, the largest publicly available echocardiography video dataset. Using EV9V, we systematically benchmark representative video classification architectures, including Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and Transformers. Furthermore, we propose a Spatio-Temporal Fusion Model (STFM), an efficient dual-stream CNN-LSTM (Long Short-Term Memory) framework that jointly captures spatial anatomical structures and temporal cardiac dynamics. The proposed framework leverages uncertainty-aware learning to preferentially sample representative video segments during training and evidence-based fusion during inference, improving robustness to variations in frame quality across echocardiographic videos. Extensive experiments demonstrate that our method achieves competitive performance across diverse video classification models, validating the effectiveness of uncertainty-aware spatio-temporal learning for echocardiographic view classification. The code is available at https://github.com/bgx666/stfm.

URL PDF HTML ☆

赞 0 踩 0

2606.17675 2026-06-17 cs.CV 新提交

Do We Really Need Diffusion? A Fast U-Net for Paired Medical Image Translation

我们真的需要扩散吗？用于配对医学图像翻译的快速U-Net

Alicia Pirwass, Birte Glimm, Michael Munz, Hans-Joachim Wilke

发表机构 * Institute of Artificial Intelligence, Ulm University（乌尔姆大学人工智能研究所）； Institute of Orthopaedic Research and Biomechanics, Centre for Trauma Research, University Hospital Ulm（乌尔姆大学医院创伤研究中心骨科研究与生物力学研究所）； AI for Sensor Data Analytics Research Group, Ulm University of Applied Sciences（乌尔姆应用科学大学传感器数据分析人工智能研究组）

AI总结本文比较轻量级4级U-Net与去噪扩散概率模型（DDPM）在从T2加权MRI估计脂肪分数任务上的性能，发现U-Net在精度和速度上均优于DDPM。

详情

AI中文摘要

磁共振成像-信号脂肪分数（MRI-SFF）量化组织脂肪，是代谢和肌肉骨骼疾病的既定生物标志物。然而，采集需要专门的MRI序列，这些序列并非常规可用。我们研究是否可以通过图像到图像翻译（I2I）从广泛可用的T2加权（T2w）MRI估计SFF。我们进一步使用来自德国国家队列（NAKO）的230048对2D图像（183517训练，23621验证，22910测试）数据集，将轻量级4级U-Net与最先进的去噪扩散概率模型（DDPM）进行比较。两种模型均明显优于恒等基线（Pearson相关系数r=0.769，平均绝对误差MAE=0.070±0.054），证实模型学习了非平凡的跨模态映射。有趣的是，轻量级U-Net在相关性（r=0.975 vs. 0.962）和误差（MAE=0.014±0.015 vs. 0.019±0.019）方面均优于DDPM，同时推理时间减少了208倍（每张图像25.2 ms vs. 5 227.2 ms，使用50步去噪扩散隐式模型（DDIM））。在显著降低计算成本的同时实现强大的临床性能，使得实时临床使用成为可能。

英文摘要

Magnetic resonance imaging-signal fat fraction (MRI-SFF) quantifies tissue fat and serves as an established biomarker for metabolic and musculoskeletal disorders. The acquisition requires, however, specialized MRI sequences, which are not available routinely. We investigate whether SFF can be estimated from widely available T2-weighted (T2w) MRI via image-to-image translation (I2I). We further compare a lightweight 4-level U-Net to a state-of-the-art Denoising Diffusion Probabilistic Model (DDPM) using a dataset of 230 048 paired 2D images (183 517 train, 23 621 val, 22 910 test) from the German National Cohort (NAKO). Both models clearly outperform the identity baseline (Pearson correlation r = 0.769, mean absolute error MAE = 0.070 +/- 0.054), which confirms that the models learn a non-trivial cross-modal mapping. Interestingly, the lightweight U-Net outperforms the DDPM in both correlation (r = 0.975 vs. 0.962) and error (MAE = 0.014 +/- 0.015 vs. 0.019 +/- 0.019), while reducing inference time by a factor of 208 (25.2 ms vs. 5 227.2 ms per image using 50 Denoising Diffusion Implicit Model (DDIM) steps). The strong clinical performance at substantially reduced computational cost enables real-time clinical use.

URL PDF HTML ☆

赞 0 踩 0

2606.17702 2026-06-17 cs.CV cs.AI 新提交

SegTME-UNI2: A Foundation Model-Based Framework for Generalisable Multiclass Cell Segmentation and LLM-Driven Tumour Microenvironment Characterisation in Histopathology

SegTME-UNI2: 一种基于基础模型的可泛化多类细胞分割框架及LLM驱动的组织微环境表征在组织病理学中的应用

Wan Siti Halimatul Munirah Wan Ahmad, Faris Syahmi Samidi, Mohammad Badal Ahmmed, Vimal Angela Thiviyanathan, Selvam James Thavaraj, Anwar P. P. Abdul Majeed

发表机构 * Department of Data Science and Artificial Intelligence, School of Computing and Artificial Intelligence, Faculty of Engineering and Technology, Sunway University（双威大学工程与技术学院计算与人工智能学院数据科学与人工智能系）； Faculty of Dentistry, Universiti Malaya（马来亚大学牙科学院）

AI总结提出SegTME-UNI2框架，结合UNI2-H病理基础模型与双头UperNet解码器实现六类语义分割和核实例分割，通过三阶段伪标签课程学习解决标注不足问题，并利用LLM生成临床可解释的TME报告。

详情

AI中文摘要

从常规H&E染色组织学图像中表征肿瘤微环境（TME）需要同时进行细胞分割、特征提取和可解释的临床报告。我们提出了SEGTME-UNI2，一个统一框架来满足这些需求。其核心是UNI2-UPERHOVER，一个双头分割模型，将UNI2-H病理基础模型（ViT-Giant，在来自100K张切片的>100M张图块上预训练）与两个并行的UperNet解码器配对：一个用于六类语义分割，另一个用于水平-垂直梯度回归，从而实现基于分水岭的核实例分离。为了解决大型真实世界数据集中缺乏像素级标注的问题，UNI2-UPERHOVER经历了一个三阶段渐进式伪标签课程。每个阶段训练一个全新模型（无权重迁移），完全通过提高伪标签质量来驱动改进：阶段1：使用人工标注的PanNuke（7,901张图像，189,744个细胞核，0.25 um/像素）。阶段2：使用阶段1模型在271,711个TCGA-UT尺度0图块（0.5 um/像素）上生成的熵过滤伪标签。阶段3：使用阶段2模型在所有1,608,060个TCGA-UT图块（覆盖六个分辨率尺度，0.5-1.0 um/像素）上生成的伪标签。分割输出输入到一个结构化的TME特征提取流水线，计算每个图块的20多个组成、形态、空间熵和细胞间距离指标。这些指标编码为JSON，并传递给微调的NVIDIA BioNeMo GPT模型，以生成临床可解释的TME叙述。在保留的PanNuke和TCGA-UT分区上的初步验证证明了框架的可行性和内部一致性。公开释放了伪标注的TCGA-UT数据集和UNI2-UPERHOVER检查点，以支持大规模TME分析和空间生物学研究。

英文摘要

Characterising the tumour microenvironment (TME) from routine H&E-stained histology images requires simultaneous cell segmentation, feature extraction, and interpretable clinical reporting. We present SEGTME-UNI2, a unified framework addressing these requirements. Its core is UNI2-UPERHOVER, a dual-head segmentation model pairing the UNI2-H pathology foundation model (ViT-Giant, pretrained on >100M tiles from 100K slides) with two parallel UperNet decoders: one for six-class semantic segmentation and one for horizontal-vertical gradient regression enabling watershed-based nuclear instance separation. To address the lack of pixel-level annotations in large real-world repositories, UNI2-UPERHOVER undergoes a three-stage progressive pseudo-label curriculum. Each stage trains a fresh model without weight transfer, driving improvement entirely via increased pseudo-label quality: Stage 1: Uses human-annotated PanNuke (7,901 images, 189,744 nuclei, 0.25 um/pixel). Stage 2: Uses entropy-filtered pseudo-labels from the Stage 1 model on 271,711 TCGA-UT scale-0 patches (0.5 um/pixel). Stage 3: Uses pseudo-labels from the Stage 2 model on all 1,608,060 TCGA-UT patches across six resolution scales (0.5-1.0 um/pixel). Segmentation outputs feed a structured TME feature extraction pipeline computing 20+ per-patch compositional, morphological, spatial entropy, and intercellular distance metrics. These are encoded as JSON and passed to a fine-tuned NVIDIA BioNeMo GPT model to generate clinically interpretable TME narratives. Preliminary validation on held-out PanNuke and TCGA-UT partitions demonstrates framework feasibility and internal consistency. The pseudo-labelled TCGA-UT dataset and UNI2-UPERHOVER checkpoint are publicly released to support large-scale TME profiling and spatial biology research.

URL PDF HTML ☆

赞 0 踩 0

2606.17710 2026-06-17 cs.CV cs.AI cs.CL cs.LG 新提交

Vision-language models for chest radiography do not always need the image

胸部X光片的视觉-语言模型并不总是需要图像

Mahshad Lotfinia, Sebastian Ziegelmayer, Lisa Adams, Daniel Truhn, Andreas Maier, Soroosh Tayebi Arasteh

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg（弗里德里希-亚历山大-埃尔朗根-纽伦堡大学模式识别实验室）； Department of Diagnostic and Interventional Radiology, TUM University Clinic, School of Medicine and Health, Klinikum rechts der Isar, Technical University of Munich（慕尼黑工业大学医学院与健康学院伊萨尔河右岸医院诊断与介入放射学系）； Lab for AI in Medicine, RWTH Aachen University（亚琛工业大学医学人工智能实验室）； Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen（亚琛工业大学医院诊断与介入放射学系）

AI总结本文通过因果审计方法，发现许多医学视觉-语言模型在胸部X光片任务中依赖文本先验而非图像，纯文本模型与多模态模型性能接近，并提出了基于图像依赖性的评估框架。

详情

AI中文摘要

医学视觉-语言模型报告了强大的胸部X光片准确性，这越来越多地被解读为它们使用了图像的证据。这种推断是不安全的：一个利用发现名称先验的模型得分与读取扫描的模型相同，且没有标准基准能区分它们。我们引入了一种因果审计方法，通过遮挡相关区域、遮挡无关区域以及替换为另一患者的相同标签扫描来干预图像，并结合三种行为指标测试正确答案是否依赖于图像。在九个系统中，一个没有图像访问权限的纯文本模型达到了最佳多模态模型5.7个准确度点以内的水平，而一个1190亿参数的多模态模型在统计上与70亿参数的纯文本基线无法区分。审计将队列分为三个忽略图像的模型、一个不稳定的模型和五个选择性使用图像的模型（针对部分发现）；这些分类在第二个数据集、分辨率和提示措辞上保持一致。与委员会认证的放射科医生相比，纯文本模型在准确率上与放射科医生无统计差异，但基础归因于零，而使用图像的模型的基础归因率与放射科医生相当。报告的置信度仅在模型使用图像时标记无根据的答案。基础归因审计（而非准确性）应成为临床部署的门槛。

英文摘要

Medical vision-language models report strong chest radiograph accuracy, and this is increasingly read as evidence that they use the image. That inference is unsafe: a model exploiting finding-name priors scores like one that reads the scan, and no standard benchmark separates them. We introduce a causal audit that intervenes on the image, occluding the relevant region, occluding an irrelevant one, and swapping in another patient's same-label scan, and combines three behavioral metrics to test whether a correct answer depends on the image. Across nine systems, a text-only model with no image access reaches within 5.7 accuracy points of the best multimodal one, and a 119-billion-parameter multimodal model is statistically indistinguishable from a 7-billion text-only baseline. The audit splits the cohort into three models that ignore the image, one that is unstable, and five that use it selectively, for a subset of findings; the categories hold across a second dataset, resolution, and prompt phrasing. Against board-certified radiologists, a text-only model is statistically indistinguishable from a radiologist's accuracy while grounding at zero, whereas the image-using models ground at radiologist-comparable rates. Reported confidence flags ungrounded answers only when a model uses the image. Grounding audits, not accuracy, should gate clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.17742 2026-06-17 cs.CV q-bio.NC 新提交

BrainWorld: A Structural-Prior-Conditioned Generative Model for Whole-Brain 4D fMRI Dynamics

BrainWorld：一种用于全脑4D fMRI动力学的结构先验条件生成模型

Junfeng Xia, Wenhao Ye, Junxiang Zhang, Xuanye Pan, Mo Wang, Quanying Liu

发表机构 * Department of Biomedical Engineering, Southern University of Science and Technology（南方科技大学生物医学工程系）； School of Biomedical Engineering, Shenzhen University（深圳大学生物医学工程学院）

AI总结提出BrainWorld模型，利用结构MRI作为解剖先验条件，通过去噪过程生成全脑4D fMRI动态，在22个数据集上稳定生成400帧轨迹，并通过生成样本增强提升下游任务性能。

详情

AI中文摘要

全脑4D fMRI生成对于建模功能性脑动力学具有重要价值，然而现有的fMRI基础模型主要针对表示学习和下游预测，而非条件预测生成。我们提出BrainWorld，一种用于全脑4D fMRI动力学的结构先验条件生成模型。BrainWorld使用sMRI作为受试者级别的解剖上下文来指导未来的fMRI生成，将结构信息整合到去噪过程中，而非将其视为并行模态。在涵盖不同队列和脑状态的22个数据集上评估，BrainWorld能够生成长达400帧的稳定4D fMRI轨迹，通过生成样本增强提升下游性能，并学习到可迁移的多模态表示，优于基线方法。这些结果共同确立了BrainWorld作为长时程脑动力学建模和多模态表示学习的条件感知生成框架。

英文摘要

Whole-brain 4D fMRI generation is valuable for modeling functional brain dynamics, yet existing fMRI foundation models mainly target representation learning and downstream prediction rather than conditional predictive generation. We introduce BrainWorld, a structural-prior-conditioned generative model for whole-brain 4D fMRI dynamics. BrainWorld uses sMRI as subject-level anatomical context to guide future fMRI generation, integrating structural information into the denoising process rather than treating it as a parallel modality. Evaluated on 22 datasets spanning diverse cohorts and brain states, BrainWorld generates stable 4D fMRI trajectories up to 400 frames, improves downstream performance through generated-example augmentation, and learns transferable multimodal representations that outperform baselines. Together, these results establish BrainWorld as a condition-aware generative framework for long-horizon brain dynamics modeling and multimodal representation learning.

URL PDF HTML ☆

赞 0 踩 0

2606.17836 2026-06-17 cs.CV cs.AI cs.CG cs.GR 新提交

当LLM分析疤痕：从图像到临床有意义的特征

Ruman Wang, Hangting Ye

发表机构 * Liaoning University of Traditional Chinese Medicine（辽宁中医药大学）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）

AI总结提出ScaFE框架，利用LLM作为知识驱动的特征工程师，将高维图像转化为低维临床可解释特征，在数据稀缺的疤痕分类中优于端到端深度学习方法。

详情

AI中文摘要

医学图像分类面临一个基本困境：虽然深度学习模型在大规模数据上表现卓越，但现实临床场景中由于标注成本、隐私约束和疾病罕见性，常常遭受严重的数据稀缺。这一挑战在病理性疤痕分类中尤为突出，区分瘢痕疙瘩和增生性疤痕需要微妙的专家知识，且标注图像极其有限。我们提出一种新范式，将大型语言模型（LLM）重新定位为知识驱动的特征工程师，而非端到端分类器。我们将此框架称为ScaFE（疤痕特征工程）。我们的关键洞察是，LLM编码了丰富的医学知识，可以外部化为可执行的特征提取代码，从而将高维图像转化为低维、临床可解释的表示。具体来说，我们使用既定的疤痕评估标准提示LLM，生成确定性的Python代码，提取与临床评分系统（如温哥华疤痕量表）对齐的特征。我们的方法提供三个关键优势：（1）数据效率，通过将知识获取与统计学习解耦，在有限训练样本下实现稳健性能；（2）隐私保护，原始图像在本地处理，不暴露给外部LLM；（3）可解释性，通过基于临床推理的显式特征。在疤痕分类上的大量实验表明，在数据有限条件下，我们的方法始终优于端到端深度学习基线或使用LLM作为黑盒分类器，为将LLM集成到数据高效且临床透明的医学AI系统中开辟了有前景的方向。

英文摘要

Medical image classification faces a fundamental dilemma: while deep learning models achieve remarkable performance at scale, real-world clinical scenarios often suffer from severe data scarcity due to annotation costs, privacy constraints, and disease rarity. This challenge is particularly pronounced in pathological scar classification, where differentiating keloids from hypertrophic scars requires subtle expert knowledge and labeled images are extremely limited. We propose a novel paradigm that repositions large language models (LLMs) as knowledge-driven feature engineers rather than end-to-end classifiers. We call this framework ScaFE (Scar Feature Engineering). Our key insight is that LLMs encode rich medical knowledge that can be externalized as executable feature extraction code, enabling the transformation of high-dimensional images into low-dimensional, clinically interpretable representations. Specifically, we prompt an LLM with established scar assessment criteria to generate deterministic Python code that extracts features aligned with clinical scoring systems such as the Vancouver Scar Scale. Our approach offers three key advantages: (1) data efficiency, achieving robust performance with limited training samples by decoupling knowledge acquisition from statistical learning; (2) privacy preservation, as raw images are processed locally without exposure to external LLMs; and (3) interpretability, through explicit features grounded in clinical reasoning. Extensive experiments on scar classification demonstrate that our method consistently outperforms end-to-end deep learning baselines or using LLMs as black-box classifiers under limited data conditions, establishing a promising direction for integrating LLMs into data-efficient and clinically transparent medical AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.18123 2026-06-17 cs.CV 新提交

Predicting Immune Biomarkers with MultiModal Mixture-of-Expert Pathology Foundation Models Empowers Precision Oncology

使用多模态混合专家病理基础模型预测免疫生物标志物，赋能精准肿瘤学

Tianyu Liu, Ziqing Wang, Zhaokang Liang, Tong Ding, Peter Humphrey, Lorraine Colón-Cartagena, Emily Ling-Lin Pai, Kenneth Tou En Chang, Mohamed Kahila, Jonathan Chong Kai Liew, Tinglin Huang, Rex Ying, Kaize Ding, Faisal Mahmood, Wengong Jin

发表机构 * Program of Computational Biology and Bioinforamtics, Yale University（耶鲁大学计算生物学与生物信息学项目）； Broad Institute of MIT and Harvard（麻省理工学院与哈佛大学博德研究所）； Department of Statistics and Data Science, Northwestern University（西北大学统计与数据科学系）； Department of Computer Science, Northeastern University（东北大学计算机科学系）； Department of Computer Science, Harvard University（哈佛大学计算机科学系）； Department of Pathology, Yale University（耶鲁大学病理学系）； Department of Anatomic Pathology and Laboratory Medicine, Hospital of the University of Pennsylvania（宾夕法尼亚大学医院解剖病理学与检验医学系）； Department of Pathology and Laboratory Medicine, University of California, San Francisco（加州大学旧金山分校病理学与检验医学系）； Department of Pathology and Laboratory Medicine, KK Women’s and Children’s Hospital（竹脚妇幼医院病理学与检验医学系）； Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学院生物统计学、流行病学与信息学系）

AI总结提出MixTIME多模态基础模型，采用混合专家架构整合不同模态的病理基础模型，从HE全切片图像预测多重免疫荧光蛋白表达，在17个蛋白标记物上达到最优性能，并增强空间域识别、生存预测等下游任务。

Comments 5 figures

详情

AI中文摘要

预测与肿瘤免疫微环境（TIME）相关的免疫生物标志物对于推进精准肿瘤学至关重要，但现有方法主要局限于单一图像模态，且存在分辨率不足以及未能充分利用互补的临床和生物学信息的问题。本文介绍MixTIME，一种多模态基础模型，利用混合专家（MoE）架构整合在不同模态上训练的病理基础模型：纯图像（UNIv2）、图像文本（CONCHv1.5）和图像转录组（STPath）表示，用于从苏木精-伊红（HE）全切片图像进行像素级和切片级的多重免疫荧光（mIF）蛋白表达预测。MixTIME采用可学习路由器动态加权专家贡献，并使用分布和趋势感知的损失函数进行训练。在两个不同规模的数据集上进行基准测试，MixTIME在17个蛋白标记物上通过相关性指标衡量达到了最先进的性能。预测的mIF图谱显著增强了下游任务，包括空间域识别、生存预测以及由全球多个机构的病理专家验证的AI辅助病理报告生成。此外，MixTIME能够跨临床时间点纵向追踪蛋白表达动态，并揭示与肿瘤微环境中耐药性和免疫抑制相关的蛋白-基因相互作用模式。总之，MixTIME为计算病理学中的多模态生物标志物发现和临床转化提供了一个可扩展的框架。

英文摘要

Predicting immune biomarkers associated with the tumor immune microenvironment (TIME) is critical for advancing precision oncology, yet existing approaches are largely limited to single image modalities and suffer from insufficient resolution and incomplete utilization of complementary clinical and biological information. Here we introduce MixTIME, a multimodal foundation model that leverages a mixture-of-experts (MoE) architecture to integrate pathology foundation models trained across distinct modalities: image only (UNIv2), image text (CONCHv1.5), and image transcriptomic (STPath) representations for pixel-level and slide-level prediction of multiplex immunofluorescence (mIF) protein expression from hematoxylin and eosin (HE) whole-slide images. MixTIME employs a learnable router to dynamically weight expert contributions and is trained with a distribution- and tendency-aware loss function. Benchmarked on two datasets of different scales, MixTIME achieves state-of-the-art performance across 17 protein markers as measured by correlation metrics. The predicted mIF profiles substantially enhance downstream tasks, including spatial domain identification, survival prediction, and AI-assisted pathology report generation validated by expert pathologists from multiple institutes across the world. Furthermore, MixTIME enables longitudinal tracking of protein expression dynamics across clinical time points and reveals protein gene interaction patterns linked to drug resistance and immune suppression in tumor microenvironments. Collectively, MixTIME provides a scalable framework for multimodal biomarker discovery and clinical translation in computational pathology.

URL PDF HTML ☆

赞 0 踩 0

2606.17213 2026-06-17 cs.CL cs.CV 交叉投稿

Revisiting LLM Adaptation for 3D CT Report Generation: A Study of Scaling and Diagnostic Priors

重新审视用于3D CT报告生成的LLM适应：缩放与诊断先验研究

Vanshali Sharma, Andrea M. Bejar, Halil Ertugrul Aktas, Quoc-Huy Trinh, Debesh Jha, Gorkem Durak, Ulas Bagci

发表机构 * Northwestern University（西北大学）； University of South Dakota（南达科他大学）； Aalto University（阿尔托大学）

AI总结提出RAD3D-Prefix轻量级诊断先验框架，通过冻结大语言模型并融合多标签分类逻辑，在少量可训练参数下实现3D CT报告生成，优于全微调基线并展现强泛化性。

详情

AI中文摘要

多模态学习的最新进展，包括大型语言模型（LLM）和视觉-语言模型（VLM），已展现出对自然图像的强大适应性。然而，将其扩展到医学领域，特别是体积（3D）图像，由于高计算复杂度、体积依赖性和视觉特征与临床术语之间的语义差距而具有挑战性。在有限的医学数据上对LLM进行朴素微调常常导致过拟合和临床幻觉，其中语言流畅性优先于临床事实性。在本研究中，我们研究了用于体积CT报告生成的参数高效适应策略，并引入了RAD3D-Prefix，一种轻量级的诊断先验条件框架，最大限度地减少了对大量参数训练的需求。该模块将图像嵌入与多标签诊断分类逻辑相结合，保留了关键的临床细节，同时弥合了语义差距。通过保持LLM冻结，我们的方法需要最少的可训练参数，并减轻了在小规模、特定领域数据集上过拟合的风险。通过对从96.1M到1.6B参数的LLM进行系统研究，我们发现微调对较小的LLM最有益，而冻结较大的（约1B+）LLM并仅训练轻量级投影层在性能、泛化性和计算效率之间提供了优越的权衡。在多个自动指标和一项临床读者研究中，RAD3D-Prefix优于可比较的参数高效基线，并在使用比全微调替代方案少得多的可训练参数的情况下，展现出强大的域外泛化能力。

英文摘要

Recent advances in multimodal learning, including large language models (LLMs) and vision-language models (VLMs), have demonstrated strong adaptability to natural images. However, extending their use to the medical domain, particularly for volumetric (3D) images, is challenging due to high computational complexity, volumetric dependencies and the semantic gap between visual features and clinical terminology. Naively fine-tuning LLMs on limited medical data often leads to overfitting and clinical hallucination, where linguistic fluency is prioritized over clinical factuality. In this study, we investigate parameter-efficient adaptation strategies for volumetric CT report generation and introduce RAD3D-Prefix, a lightweight diagnostic-prior conditioning framework that minimizes the need for extensive parameter training. This module integrates image embeddings with multi-label diagnostic classification logits, preserving critical clinical details while bridging the semantic gap. By keeping the LLM frozen, our method requires minimal trainable parameters and mitigates the risk of overfitting on small, domain-specific datasets. Through a systematic study spanning LLMs from 96.1M to 1.6B parameters, we find that fine-tuning is most beneficial for smaller LLMs, whereas freezing larger (~1B+ LLMs and training only lightweight projection layers provides a superior trade-off between performance, generalization, and computational efficiency. Across multiple automatic metrics and a clinical reader study, RAD3D-Prefix outperforms comparable parameter-efficient baselines and demonstrates strong out-of-domain generalization while using substantially fewer trainable parameters than fully fine-tuned alternatives.

URL PDF HTML ☆

赞 0 踩 0

2606.17295 2026-06-17 eess.IV cs.CV 交叉投稿

Phenotyping TPF via Self-Supervised Learning: A Label-Agnostic Framework with Expert Validation

通过自监督学习进行胫骨平台骨折表型分析：一种具有专家验证的标签无关框架

Miral Elnakib, Muhammad Saad, Ahmad Al-Kabbany

发表机构 * Faculty of Sciences（科学学院）； Alexandria University（亚历山大大学）； Multimedia Interaction and Communication Lab（多媒体交互与通信实验室）； Wearables, Biosensing, and Biosignal Processing Research Lab（可穿戴设备、生物传感与生物信号处理研究实验室）； Arab Academy for Science and Technology（阿拉伯科学与技术学院）

AI总结提出一种标签无关的自监督学习框架，利用SimCLR和聚类从X光片中直接学习骨折表征，发现四种影像衍生表型，经盲法专家验证具有稳定性和临床可解释性，与常规分类正交。

详情

AI中文摘要

人工智能在胫骨平台骨折特征描述中的全部潜力尚未实现，受限于对标注数据集的根本依赖，而标注数据集的一致性无法保证：传统的分类方案如Schatzker和AO/OTA存在观察者间变异性，导致监督模型学习的是人类分歧而非稳定的骨折形态。我们设计、实现并验证了一个标签无关的框架，通过直接从影像数据中学习骨折表征来消除这一约束，无需观察者分配的标签。使用RadImageNet预训练的ResNet-50编码器，在154张清洁的膝关节X光片上通过SimCLR对比目标进行微调，之前进行数据清洗协议，之后进行UMAP降维和k-means聚类，以发现四种影像衍生表型。通过盲法专家审查协议评估表型有效性，由两名独立临床医生进行。四种表型表现出稳健的稳定性（bootstrap ARI = 0.319 +/- 0.041）、强内部凝聚力（轮廓系数 = 0.511），以及两名评审者在盲法条件下给出3-5/5的一致性评分；一种表型被一致认为表现出粉碎性——一种在没有监督信号的情况下分离出的高复杂性特征。与Schatzker标签的跨分区比较得出ARI = 0.013，证实了与传统分类边界的正交性。值得注意的是，锚定于既定分类词汇的专家评审者在Schatzker对齐度最低的地方认为影像衍生组是异质的，这表明Schatzker训练的感知和标签无关的嵌入几何测量的是正交维度。这些发现确立了标签无关的SSL表型分析作为传统分类的可重复且临床可解释的补充。

英文摘要

The full potential of artificial intelligence in tibial plateau fracture characterisation remains unrealised, constrained by a fundamental dependency on labelled datasets whose consistency cannot be guaranteed: conventional classification schemes such as Schatzker and AO/OTA suffer from inter-observer variability, causing supervised models to learn human disagreement rather than stable fracture morphology. We design, implement, and validate a label-agnostic framework that eliminates this constraint by learning fracture representations directly from imaging data without observer-assigned labels. A RadImageNet-pretrained ResNet-50 encoder is fine-tuned on 154 cleaned knee radiographs using the SimCLR contrastive objective, preceded by a data cleaning protocol and followed by UMAP dimensionality reduction and k-means clustering to discover four imaging-derived phenotypes. Phenotype validity is assessed through a blinded expert review protocol administered to two independent clinicians. The four phenotypes demonstrate robust stability (bootstrap ARI = 0.319 +/- 0.041), strong internal cohesion (silhouette = 0.511), and coherence ratings of 3-5/5 from both reviewers under blinded conditions; one phenotype was unanimously identified as exhibiting comminution -- a high-complexity feature isolated without any supervisory signal. Inter-partition comparison against Schatzker labels yields ARI = 0.013, confirming orthogonality to conventional classification boundaries. Notably, expert reviewers anchored to established classification vocabularies perceived imaging-derived groups as heterogeneous precisely where Schatzker alignment was lowest, suggesting that Schatzker-trained perception and label-agnostic embedding geometry measure orthogonal dimensions. These findings establish label-agnostic SSL phenotyping as a reproducible and clinically interpretable complement to conventional classification.

URL PDF HTML ☆

赞 0 踩 0

2606.17504 2026-06-17 eess.IV cs.CV 交叉投稿

Two-Stage Fine-Tuning of ResNet50 for High-Sensitivity Melanoma Detection on Dermoscopic Images

ResNet50的两阶段微调用于皮肤镜图像中高灵敏度黑色素瘤检测

Aryan Bhagat

AI总结提出ResNet50的两阶段微调方法，通过分层训练和低学习率微调解决类别不平衡和迁移学习不足问题，在3826张测试图像上实现AUC-ROC 0.9559，灵敏度87.56%，优于单阶段微调。

Comments 13 pages, 4 figures, 4 tables. Code available at https://github.com/Aryanbhagat23/melanoma-detection

详情

AI中文摘要

黑色素瘤是最危险的皮肤癌，早期检测五年生存率超过99%，但一旦扩散则急剧下降。本文提出并评估了一种两阶段微调方法，用于皮肤镜图像上的二分类黑色素瘤检测，基于ResNet50。解决的核心挑战是类别不平衡和单阶段微调导致的迁移学习次优。在分层训练/验证/测试分割后，仅对训练集应用随机过采样以实现1:1类别平衡。第一阶段冻结ResNet50骨干网络，仅训练分类头；第二阶段以1e-5的低学习率联合微调所有层，以防止对已学习视觉特征的灾难性遗忘。在包含3826张图像的独立测试集上，模型实现了AUC-ROC为0.9559，准确率88.34%，灵敏度87.56%，特异度89.13%，F1分数88.29%。消融研究证实两阶段协议显著优于单阶段微调，灵敏度提升超过4%。Grad-CAM可视化展示了正确的病变定位。提供了完全可部署的Streamlit检测应用程序及所有训练代码。

英文摘要

Melanoma is the most dangerous form of skin cancer with five-year survival rates exceeding 99% when detected early but falling sharply once the disease spreads. This paper proposes and evaluates a two-stage fine-tuning approach for ResNet50 applied to binary melanoma classification on dermoscopic images. The core challenges addressed are class imbalance and suboptimal transfer learning from single-stage fine-tuning. After stratified train/validation/test splitting, random oversampling was applied exclusively to the training set to achieve a 1:1 class balance. Stage 1 trained only the classification head with the ResNet50 base frozen, while Stage 2 fine-tuned all layers jointly at a low learning rate of 1e-5 to prevent catastrophic forgetting of learned visual features. On an independent test set of 3,826 images, the model achieved an AUC-ROC of 0.9559, accuracy of 88.34%, sensitivity of 87.56%, specificity of 89.13%, and F1-score of 88.29%. An ablation study confirms the two-stage protocol significantly outperforms single-stage fine-tuning, with sensitivity gains of over 4%. Grad-CAM visualizations demonstrate correct lesion localization. A fully deployable Streamlit detection application is provided alongside all training code.

URL PDF HTML ☆

赞 0 踩 0

2606.17791 2026-06-17 cs.CL cs.CV 交叉投稿

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

Slop悖论：合成标准化如何侵蚀AI重写放射学报告中的临床不确定性和跨模态对齐

Samar Ansari

发表机构 * School of Computing and Engineering Sciences, University of Chester（切斯特大学计算与工程科学学院）

AI总结本研究通过控制实验测量AI重写放射学报告导致的信息退化，发现电子健康记录摘要虽破坏内容但保留图像-文本对齐，而标准化重写和教学病例准备则相反，造成更大对齐损失，称为slop悖论。

详情

AI中文摘要

AI辅助临床文档工具越来越多地使用大型语言模型（LLMs）对放射学报告进行摘要、标准化和重新格式化。我们提出了对由此产生的信息退化的受控测量。使用印第安纳大学数据集的450份胸部X光报告，我们通过三种真实的LLM重写任务生成合成版本：电子健康记录摘要、标准化重写和教学病例准备。我们测量了实体侵蚀（通过医学命名实体识别）、对冲崩溃（临床不确定性语言的丧失）和跨模态对齐退化（通过BiomedCLIP图像-文本相似度）。我们的核心发现是信息损失与跨模态保真度之间的分离。电子健康记录摘要在内容层面最具破坏性，侵蚀了51.4%的临床实体和43.7%的对冲语言，但它几乎完全保留了图像-文本对齐（下降2.5%）。旨在生成更干净训练数据的两个任务，即标准化重写和教学病例准备，则相反：它们保留了更多实体（分别侵蚀26.8%和29.3%），但导致14.9-16.5%的对齐下降，是电子健康记录摘要的六到七倍。我们称之为slop悖论：使临床文本看起来更干净以用于多模态训练的重写恰恰使其偏离图像。与我们预先指定的假设相反，罕见病理并未优先退化：在九次罕见与常见比较中，没有差异在多重比较校正后幸存，且名义差异方向相反（常见>罕见），因此污染对特定条件监测不可见。退化的主要决定因素是AI重写任务的类型，而非临床内容。这些发现对多模态医学AI数据集构建和AI辅助临床文档的治理具有重要意义。

英文摘要

AI-assisted clinical documentation tools increasingly summarize, standardize, and reformat radiology reports using large language models (LLMs). We present a controlled measurement of the resulting information degradation. Using 450 chest X-ray reports from the Indiana University dataset, we generate synthetic versions via three realistic LLM rewriting tasks: EHR summarization, standardized rewriting, and teaching case preparation. We measure entity erosion (via medical NER), hedging collapse (loss of clinical uncertainty language), and cross-modal alignment degradation (via BiomedCLIP image-text similarity). Our central finding is a dissociation between information loss and cross-modal fidelity. EHR summarization is the most destructive at the content level, eroding 51.4% of clinical entities and 43.7% of hedging language, yet it preserves image-text alignment almost entirely (a 2.5% drop). The two tasks meant to produce cleaner training data, standardized rewriting and teaching case preparation, do the reverse: they preserve more entities (26.8% and 29.3% eroded) but cause 14.9-16.5% alignment drops, six to seven times those of EHR summarization. We term this the slop paradox: rewriting that makes clinical text look cleaner for multimodal training is precisely what pulls it away from the image. Contrary to our pre-specified hypothesis, rare pathologies were not preferentially degraded: across nine rare-versus-common comparisons, no difference survived multiple-comparison correction, and nominal differences ran in the opposite direction (common > rare), so contamination is invisible to condition-specific monitoring. The dominant determinant of degradation is the type of AI rewriting task, not the clinical content. These findings bear on multimodal medical AI dataset construction and the governance of AI-assisted clinical documentation.

URL PDF HTML ☆

赞 0 踩 0

DynFS-MoE: 用于创伤后癫痫诊断的动态功能-结构混合专家模型

Jun-En Ding, Spencer Chen, Henry Noren, Daniel Valdivia, Christine Yohn, Suhina Patel, Taylor Zink, Hai Sun, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology（史蒂文斯理工学院系统工程系）； Department of Neurosurgery, Robert Wood Johnson Medical School, Rutgers University（罗格斯大学罗伯特·伍德·约翰逊医学院神经外科）

AI总结提出动态多模态混合专家框架，通过时间感知功能-结构编码和类别条件专家路由，融合功能与结构MRI，在三个二分类任务中优于静态融合基线，并揭示有意义的ROI交互。

详情

AI中文摘要

创伤后癫痫（PTE）是创伤性脑损伤（TBI）的严重并发症，但由于其在大脑中诱导的复杂结构和功能改变，早期识别仍然具有挑战性。为了解决这个问题，我们提出了一个动态多模态混合专家（MoE）框架，通过时间感知功能-结构编码和类别条件专家路由，整合功能性和结构性MRI。在该框架内，模态特定和跨模态专家学习互补表示，而模态-类别MoE（MCoE）模块根据每个分类目标动态分配专家权重。跨三个二分类任务的实验结果表明，该框架始终优于静态融合基线，高可解释性分析进一步揭示了有意义的感兴趣区域（ROI）交互。这种动态多模态专家框架有效捕获了类别依赖的脑交互模式，并为PTE诊断和风险分层提供了一种可解释的方法。

英文摘要

Post-traumatic epilepsy (PTE) is a severe complication of traumatic brain injury (TBI), yet early identification remains challenging due to the complex structural and functional alterations it induces in the brain. To address this, we propose a dynamic multimodal Mixture-of-Experts (MoE) framework that integrates functional and structural MRI through time-aware functional-structural encoding and class-conditioned expert routing. Within this framework, modality-specific and cross-modal experts learn complementary representations, while a Modality-Class MoE (MCoE) module dynamically dispatches expert weights according to each classification objective. Experimental results across three binary classification tasks demonstrate that the framework consistently outperforms static fusion baselines, and high-interpretability analyses further reveal meaningful region-of-interest (ROI) interactions. This dynamic multimodal expert framework effectively captures class-dependent brain interaction patterns and provides an interpretable approach for PTE diagnosis and risk stratification.

URL PDF HTML ☆

赞 0 踩 0

2507.04704 2026-06-17 q-bio.QM cs.AI cs.CV 版本更新

SPATIA: Multimodal Generation and Prediction of Spatial Cell Phenotypes

SPATIA: 空间细胞表型的多模态生成与预测

Zhenglun Kong, Mufan Qiu, John Boesen, Xiang Lin, Sukwon Yun, Tianlong Chen, Manolis Kellis, Marinka Zitnik

AI总结提出SPATIA模型，融合细胞形态、基因表达和空间上下文，通过置信感知流匹配和形态-谱对齐实现多尺度生成与预测，在12项任务中优于18个基线模型。

Comments ICML 2026

详情

AI中文摘要

理解细胞形态、基因表达和空间上下文如何共同塑造组织功能是生物学中的一个核心挑战。基于图像的空间转录组学技术现在能够提供细胞图像和基因表达谱的高分辨率测量，但现有方法通常孤立地分析这些模态或以有限的分辨率进行分析。我们通过引入SPATIA来解决这个问题，这是一个多层次的生成和预测模型，通过融合从细胞到组织水平的形态、基因表达和空间上下文，学习统一的、空间感知的表征。SPATIA还结合了一个空间条件生成框架，该框架具有置信感知的OT重加权和形态-谱对齐，用于建模目标状态形态分布。具体来说，我们提出了一个置信感知的流匹配目标，该目标基于不确定性对弱最优传输对进行重加权。我们进一步应用形态-谱对齐来鼓励有生物学意义的图像生成，从而能够建模微环境依赖的表型转变。我们组装了一个多尺度数据集，包含17个组织中的2590万个细胞-基因对。我们在12项任务上对SPATIA与18个模型进行了基准测试，涵盖表型生成、注释、聚类、基因插补和跨模态预测等类别。SPATIA相比最先进模型取得了改进，生成保真度提高了8%，预测准确率提高了3%。

英文摘要

Understanding how cellular morphology, gene expression, and spatial context jointly shape tissue function is a central challenge in biology. Image-based spatial transcriptomics technologies now provide high-resolution measurements of cell images and gene expression profiles, but existing methods typically analyze these modalities in isolation or at limited resolution. We address the problem by introducing SPATIA, a multi-level generative and predictive model that learns unified, spatially aware representations by fusing morphology, gene expression, and spatial context from the cell to the tissue level. SPATIA also incorporates a spatially conditioned generative framework with confidence-aware OT reweighting and morphology-profile alignment for modeling target-state morphology distributions. Specifically, we propose a confidence-aware flow matching objective that reweights weak optimal-transport pairs based on uncertainty. We further apply morphology-profile alignment to encourage biologically meaningful image generation, enabling the modeling of microenvironment-dependent phenotypic transitions. We assembled a multi-scale dataset consisting of 25.9 million cell-gene pairs across 17 tissues. We benchmark SPATIA against 18 models across 12 tasks, spanning categories such as phenotype generation, annotation, clustering, gene imputation, and cross-modal prediction. SPATIA achieves improved performance over state-of-the-art models, improving generative fidelity by 8% and predictive accuracy by up to 3%.

URL PDF HTML ☆

赞 0 踩 0

2606.13919 2026-06-17 eess.IV cs.AI cs.CV 版本更新

GMN4AD: Graph Matching Network for Alzheimer's Disease Diagnosis with Test-Time Domain Adaptation using Multi-centered Structure Magnetic Resonance Imaging

GMN4AD：基于图匹配网络的阿尔茨海默病诊断与测试时域适应方法在多中心结构磁共振成像中的应用

Chen Zhao, Huan Huang, Yixin Xie, Jiajing Huang, Weihua Zhou

发表机构 * Department of Computer Science, Kennesaw State University（肯纳邦大学计算机科学系）； Department of Information Technology, Kennesaw State University（肯纳邦大学信息技术系）； School of Data Science and Analytics, Kennesaw State University（肯纳邦大学数据科学与分析学院）； Department of Applied Computing, Michigan Technological University（密歇根技术大学应用计算系）

AI总结提出GMN4AD，利用图匹配网络建模异质脑图间关系，结合测试时域适应策略，在三个公共数据集上优于现有方法，实现鲁棒的AD诊断。

详情

AI中文摘要

阿尔茨海默病（AD）是一种进行性神经退行性疾病，影响数百万老年人，预计未来几年患病率将显著上升。早期诊断，特别是在轻度认知障碍（MCI）阶段，对于及时干预至关重要。结构磁共振成像（sMRI）已成为检测AD相关脑变化的关键模态，但传统的基于图的方法通常难以处理模态和站点间异质性，限制了诊断性能。在本文中，我们提出了用于阿尔茨海默病诊断的图匹配网络（GMN4AD），旨在建模来自神经影像数据的异质脑图之间的交互。与将每个脑图独立处理的传统方法不同，GMN4AD利用图匹配来捕获跨图关系，提高诊断精度。此外，我们引入了一种测试时域适应策略，结合对比学习来减轻推理过程中的域偏移。在三个公共AD数据集上的大量实验表明，GMN4AD相比最先进方法实现了优越的性能，为AD诊断提供了鲁棒且可泛化的解决方案。

英文摘要

Alzheimer's Disease (AD) is a progressive neurodegenerative disorder that affects millions of older adults, with prevalence expected to rise significantly in the coming years. Early diagnosis, particularly during the mild cognitive impairment (MCI) stage, is critical for timely intervention. Structural Magnetic Resonance Imaging (sMRI) has emerged as a key modality for detecting AD-related brain changes, but traditional graph-based approaches often struggle with modality and inter-site heterogeneity, limiting diagnostic performance. In this paper, we propose Graph Matching Network for Alzheimer's Disease Diagnosis (GMN4AD), designed to model interactions between heterogeneous brain graphs derived from neuroimaging data. Unlike conventional methods that treat each brain graph independently, GMN4AD leverages graph matching to capture cross-graph relationships, enhancing diagnostic precision. Furthermore, we introduce a test-time domain adaptation strategy that combines contrastive learning to mitigate domain shifts during inference. Extensive experiments on three public AD datasets demonstrate that GMN4AD achieves superior performance compared to state-of-the-art methods, offering a robust and generalizable solution for AD diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2601.05212 2026-06-17 cs.CV 版本更新

FlowLet: Conditional 3D Brain MRI Synthesis using Wavelet Flow Matching

FlowLet: 基于小波流匹配的条件性3D脑MRI合成

Danilo Danese, Angela Lombardi, Matteo Attimonelli, Giuseppe Fasano, Tommaso Di Noia

发表机构 * Politecnico di Bari（巴里理工学院）； Sapienza University of Rome（罗马萨皮恩扎大学）

AI总结提出FlowLet框架，利用可逆3D小波域中的流匹配生成年龄条件化的3D脑MRI，避免重建伪影并降低计算需求，实验证明其生成高保真体积且提升脑年龄预测模型对低代表性年龄组的性能。

Comments Accepted at Medical Image Analysis (Elsevier)

详情

DOI: 10.1016/j.media.2026.104161

AI中文摘要

脑磁共振成像（MRI）在研究神经发育、衰老和疾病中起着核心作用。一个关键应用是脑年龄预测（BAP），它从MRI数据中估计个体的生物脑年龄。有效的BAP模型需要大规模、多样化和年龄平衡的数据集，而现有的3D MRI数据集在人口统计学上存在偏差，限制了公平性和泛化能力。获取新数据成本高昂且受到伦理约束，这促使了生成性数据增强。当前的生成方法通常基于潜在扩散模型，这些模型在学习的低维潜在空间中操作，以应对体积MRI数据的内存需求。然而，这些方法在推理时通常较慢，可能因潜在压缩而引入伪影，并且很少以年龄为条件，从而影响BAP性能。在这项工作中，我们提出了FlowLet，一个条件生成框架，通过在可逆3D小波域中利用流匹配来合成年龄条件化的3D MRI，有助于避免重建伪影并降低计算需求。实验表明，FlowLet以少量采样步骤生成高保真体积。使用FlowLet生成的数据训练BAP模型可改善低代表性年龄组的性能，基于区域的分析确认了解剖结构的保留。

英文摘要

Brain Magnetic Resonance Imaging (MRI) plays a central role in studying neurological development, aging, and diseases. One key application is Brain Age Prediction (BAP), which estimates an individual's biological brain age from MRI data. Effective BAP models require large, diverse, and age-balanced datasets, whereas existing 3D MRI datasets are demographically skewed, limiting fairness and generalizability. Acquiring new data is costly and ethically constrained, motivating generative data augmentation. Current generative methods are often based on latent diffusion models, which operate in learned low dimensional latent spaces to address the memory demands of volumetric MRI data. However, these methods are typically slow at inference, may introduce artifacts due to latent compression, and are rarely conditioned on age, thereby affecting the BAP performance. In this work, we propose FlowLet, a conditional generative framework that synthesizes age-conditioned 3D MRIs by leveraging flow matching within an invertible 3D wavelet domain, helping to avoid reconstruction artifacts and reducing computational demands. Experiments show that FlowLet generates high-fidelity volumes with few sampling steps. Training BAP models with data generated by FlowLet improves performance for underrepresented age groups, and region-based analysis confirms preservation of anatomical structures.

URL PDF HTML ☆

赞 0 踩 0

2606.17355 2026-06-17 cs.CV 新提交

Complex Layout Classification in the Wild: A Low-Resource Approach with Layout-Preserving Augmentations

野外复杂版面分类：一种低资源方法及版面保持增强

Sharva Gogawale, Iddo Hakim, Gal Grudka, Mohammad Suliman, Omer Ventura, Daria Vasyutinsky-Shapira, Berat Kurar-Barakat, Nachum Dershowitz

发表机构 * School of Computer Science and AI, Tel Aviv University（特拉维夫大学计算机科学与人工智能学院）

AI总结针对低资源复杂版面分类问题，提出基于CNN的分类器，采用窄各向异性高斯掩码和反射诱导标签变换等版面保持增强方法，在标注稀缺下显著提升分类性能。

详情

AI中文摘要

许多数字化语料库面临低资源问题，因为标注可能稀缺、页面扫描噪声大且分辨率低，或者版面结构复杂，对自动转录质量产生负面影响。低资源语言的鲁棒分类模型开发受到缺乏大规模标注数据和页面版面频繁语义复杂性的制约。为此，我们整理了一个复杂版面数据集，根据分隔区域手动分为八种版面类型。为克服数据稀缺，我们提出了一种基于CNN的分类器的新型训练策略，采用强领域感知增强来改善泛化。我们利用窄各向异性高斯掩码抑制偶然文本细节，同时保留基本分隔，迫使模型学习全局几何排列。此外，我们实施反射诱导标签变换以丰富训练分布，同时保持不对称类别间的标签一致性。结果表明，版面特定增强可以在严重标注稀缺下显著改善页面级版面分类。

发表机构 * National University of Defense Technology（国防科技大学）； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Intelligent Game and Decision Lab（智能博弈与决策实验室）

AI总结发现多模态大语言模型在中间层形成正确视觉预测，但最终输出时被文本覆盖，通过检测预测方向变化（85%失败转向文本，89%成功转向视觉）提出无训练方法CALRD，在冲突基准上提升高达9.4%。

Comments Accepted at IJCAI 2026. 16 pages, 10 figures

详情

AI中文摘要

当视觉与文本矛盾时，多模态大语言模型（MLLMs）始终偏向文本，即使图像提供了明确的相反证据。这种偏见对需要视觉基础的应用构成风险，但其原因尚不清楚。本文中，我们揭示了一个令人惊讶的发现：模型最初往往是正确的，在中间层形成基于视觉的正确预测，然后在最终输出中改变主意，偏向文本。我们称之为“后层文本覆盖”。视觉信息已被编码，只是未能保留到输出。更有趣的是，我们发现预测的变化方式揭示了其正确性：85%的失败转向文本，而89%的成功转向视觉。这种方向性特征使得一种简单而有效的干预成为可能：当我们检测到自信的视觉预测被抑制时，我们将其恢复。我们提出了CALRD（冲突感知层参考解码），一种无需训练的方法，在推理时恢复被覆盖的预测。在五种不同架构的MLLM上的实验表明，在冲突基准上绝对提升高达9.4%，同时基本保持标准性能，无需训练或外部知识。它恢复了模型已知但未能保留的信息。

英文摘要

When vision contradicts text, multimodal large language models (MLLMs) consistently favor text, even when images provide clear evidence otherwise. This bias poses risks for applications requiring visual grounding, yet its cause remains unclear. In this paper, we uncover a surprising finding: models often get it right initially, forming correct vision-based predictions in their intermediate layers, before changing their minds and favoring text in the final output. We call this "late-layer textual override". The visual information is encoded, it simply does not survive to the output. More intriguingly, we find that how predictions change reveals whether they're correct: 85% of failures shift toward text, while 89% of successes shift toward vision. This directional signature enables a simple but powerful intervention: when we detect a confident visual prediction being suppressed, we restore it. We propose CALRD (Conflict-Aware Layer Reference Decoding), a training-free method that recovers overridden predictions at inference time. Experiments across five MLLMs of varying architectures demonstrate up to 9.4% absolute improvements on conflict benchmarks while largely preserving standard performance, without training or external knowledge. It recovers what the model already knew but failed to preserve.

URL PDF HTML ☆

赞 0 踩 0

2606.17961 2026-06-17 cs.CV cs.AI 新提交

Robustness of Similarity-based Positional Encoding Under Rotations: Theoretical Analysis and Experimental Validation

基于相似性的位置编码在旋转下的鲁棒性：理论分析与实验验证

Andrea Santomauro, Luigi Portinale, Giorgio Leonardi

发表机构 * Computer Science Institute, DiSIT, University of Piemonte Orientale, Alessandria, Italy（皮埃蒙特东方大学计算机科学研究所，DiSIT，亚历山德里亚，意大利）

AI总结本文理论分析并实验验证了基于相似性的位置编码（simPE）在旋转扰动下的稳定性，证明其在Frobenius范数下具有有界扰动，并在多个数据集上优于标准位置编码。

详情

AI中文摘要

位置编码是Transformer架构的基本组成部分，因为它注入了关于输入空间或序列排列的信息。在标准绝对位置编码和正弦编码的最新替代方案中，基于相似性的位置编码（simPE）已成为一种通过成对关系表示位置结构的灵活框架。simPE最初是为医学成像应用设计的，其中几何鲁棒性尤为重要：在图像采集过程中，由于成像仪器、患者定位或轻微的采集偏差，自然会产生小旋转。尽管具有经验上的前景，但simPE在几何扰动下的理论行为尚未完全表征。在本文中，我们研究了simPE对旋转的鲁棒性，结合了形式化的理论分析和实验验证。我们首先证明simPE通常不是旋转不变的。然后，我们证明，在基本分量的温和Lipschitz假设下，simPE在旋转扰动下是稳定的，并推导了Frobenius范数下的显式扰动界限。我们在四个受控数据集上实验验证了这些发现——一个合成Arrow数据集、一个合成Shapes数据集（四个几何形状类别）、一个合成Digits数据集和一个基准图像分类数据集（FashionMNIST）——其中训练和验证图像保持固定的规范方向，而测试图像则经受逐渐增大的旋转角度。在所有数据集中，simPE在旋转下的准确率、F1分数、精确率和召回率方面始终优于标准学习位置编码，特别是在小到中等角度范围内，这证实了理论稳定性保证。

英文摘要

Positional encoding is a fundamental component of Transformer architectures, as it injects information about the spatial or sequential arrangement of inputs. Among recent alternatives to standard absolute and sinusoidal encodings, similarity-based positional encoding (simPE) has emerged as a flexible framework for representing positional structure through pairwise relations. simPE was originally designed for medical imaging applications, where geometric robustness is especially relevant: small rotations naturally arise during image acquisition, induced by imaging instruments, patient positioning, or slight acquisition misalignments. Despite its empirical promise, the theoretical behavior of simPE under geometric perturbations has not been fully characterized. In this paper, we study the robustness of simPE with respect to rotations, combining formal theoretical analysis with experimental validation. We first show that simPE is generally not rotation-invariant. We then prove that, under mild Lipschitz assumptions on the elementary components, simPE is stable under rotational perturbations and derive explicit perturbation bounds in Frobenius norm. We validate these findings experimentally on four controlled datasets--a synthetic Arrow dataset, a synthetic Shapes dataset (four geometric shape categories), a synthetic Digits dataset, and a benchmark image classification dataset (FashionMNIST)--in which training and validation images are kept in a fixed canonical orientation while test images are subjected to increasing rotation angles. Across all datasets, simPE consistently outperforms standard learned positional encoding in terms of accuracy, F1 score, precision, and recall under rotation, particularly in the small-to-moderate angle regime, corroborating the theoretical stability guarantees.

URL PDF HTML ☆

赞 0 踩 0

2606.18008 2026-06-17 cs.CV 新提交

PhaseWin: An Efficient Search Algorithm for Faithful Visual Attribution

PhaseWin：一种用于忠实视觉归因的高效搜索算法

Zihan Gu, Ruoyu Chen, Junchi Zhang, Li Liu, Xiaochun Cao, Hua Zhang

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络空间安全学院）； Shanghai Center for Mathematical Sciences, Fudan University（复旦大学上海数学中心）； College of Electronic Science and Technology, National University of Defense Technology（国防科技大学电子科学学院）； School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University（中山大学深圳校区网络空间安全学院）

AI总结提出PhaseWin算法，通过分阶段窗口搜索将视觉归因的计算复杂度从O(n²)降至O(n)，在保持接近贪心算法忠实度的同时大幅减少模型评估次数。

Comments 26 pages, 29 figures

详情

AI中文摘要

BusterX++: 迈向基于MLLM的统一跨模态AI生成内容检测与解释

Haiquan Wen, Tianxiao Li, Zhenglin Huang, Yiwei He, Guangliang Cheng

AI总结提出统一多模态大模型BusterX++，通过纯强化学习策略实现图像与视频伪造检测的跨模态能力迁移，性能超越现有方法。

详情

AI中文摘要

生成式AI的快速发展显著提升了图像和视频合成质量，加剧了多模态视觉错误信息的风险。最近的多模态大模型通过推理和解释在透明化AI生成内容检测方面展现出潜力，但现有方法大多将图像和视频取证视为孤立任务，跨模态协同作用尚未充分探索。为解决这一问题，我们提出了\textbf{BusterX++}，一个统一的多模态大模型，用于联合图像和视频检测并具备可解释推理能力。我们还引入了\textbf{GenBuster-Bench++}，一个精心策划、难度对齐的基准测试，包含平衡的图像和视频样本，覆盖最新的生成模型和多样化的真实场景。利用这一受控设置，我们重新审视了广泛采用的$SFT \rightarrow RL$后训练范式。值得注意的是，我们的发现表明，仅由稀疏结果奖励驱动的单阶段纯RL策略在统一和单模态设置中始终匹配或超越强SFT+RL基线。我们的关键洞察是，SFT降低了策略熵，限制了策略搜索空间并抑制了探索自由度。相比之下，单阶段纯RL在整个训练过程中保持较高的策略熵，有效解锁了图像和视频取证之间跨模态能力迁移的自发涌现。大量实验表明，BusterX++达到了最先进的性能，突显了RL在统一跨模态视觉推理中的强大潜力。

英文摘要

The rapid advancement of generative AI has substantially improved image and video synthesis, amplifying the risk of multimodal visual misinformation. Recent MLLMs have shown promise for transparent AI-generated content detection through reasoning and explanation, yet existing approaches largely treat image and video forensics as isolated tasks, leaving cross-modal synergies underexplored. To address this, we present \textbf{BusterX++}, a unified MLLM for joint image and video detection with interpretable reasoning. We also introduce \textbf{GenBuster-Bench++}, a meticulously curated, difficulty-aligned benchmark containing balanced image and video samples spanning recent generation models and diverse real-world scenarios. Using this controlled setting, we revisit the widely adopted $SFT \rightarrow RL$ post-training paradigm. Notably, our findings demonstrate that a single-stage, pure RL strategy driven strictly by sparse outcome rewards consistently matches or surpasses a strong SFT+RL baseline across both unified and single-modality settings. Our key insight reveals that SFT imposes lower policy entropy, which restricts the policy search space and dampens exploratory freedom. In contrast, single-stage pure RL maintains higher policy entropy throughout training, effectively unlocking the spontaneous emergence of cross-modal capability transfer between image and video forensics. Extensive experiments demonstrate that BusterX++ achieves state-of-the-art performance, highlighting the powerful potential of RL for unified cross-modal visual reasoning.

URL PDF HTML ☆

赞 0 踩 0

2604.00605 2026-06-17 cs.CV 版本更新

Fluently Lying: Adversarial Robustness Can Be Substrate-Dependent

流利地撒谎：对抗鲁棒性可能依赖于底层架构

Daye Kang, Hyeongboo Baek

AI总结发现一种新的对抗攻击失败模式——质量崩溃（QC），即检测数量不变但精度骤降，且仅出现在特定SNN架构（EMS-YOLO）中，表明对抗失败模式可能依赖于底层架构。

Comments Withdrawn by the authors due to an implementation bug discovered in the main experimental pipeline. The bug affects the main results, and therefore the empirical claims and conclusions of the paper are no longer supported

详情

AI中文摘要

用于监控和防御对抗攻击下目标检测器的主要工具假设，当精度下降时，检测数量也会同步下降。这种耦合是假设的，并未经过测量。我们报告了在单个模型上观察到的反例：在标准PGD攻击下，EMS-YOLO（一种脉冲神经网络（SNN）目标检测器）保留了超过70%的检测结果，而mAP从0.528骤降至0.042。我们将这种保持检测数量但精度崩溃的现象称为质量崩溃（QC），以区别于在非目标评估中占主导地位的抑制现象。在四种SNN架构和两种威胁模型（l-infinity和l-2）下，QC仅出现在测试的四种检测器之一（EMS-YOLO）中。在该模型上，所有五种标准防御组件均未能检测或缓解QC，这表明防御生态系统可能依赖于一种基于单一底层架构校准的共享假设。据我们所知，这些结果首次证明对抗失败模式可能依赖于底层架构。

英文摘要

The primary tools used to monitor and defend object detectors under adversarial attack assume that when accuracy degrades, detection count drops in tandem. This coupling was assumed, not measured. We report a counterexample observed on a single model: under standard PGD, EMS-YOLO, a spiking neural network (SNN) object detector, retains more than 70% of its detections while mAP collapses from 0.528 to 0.042. We term this count-preserving accuracy collapse Quality Corruption (QC), to distinguish it from the suppression that dominates untargeted evaluation. Across four SNN architectures and two threat models (l-infinity and l-2), QC appears only in one of the four detectors tested (EMS-YOLO). On this model, all five standard defense components fail to detect or mitigate QC, suggesting the defense ecosystem may rely on a shared assumption calibrated on a single substrate. These results provide, to our knowledge, the first evidence that adversarial failure modes can be substrate-dependent.

URL PDF HTML ☆

赞 0 踩 0

2510.11709 2026-06-17 cs.LG cs.AI cs.CV 版本更新

HLS-GPT：一种用于大陆尺度NASA协调Landsat和Sentinel-2（HLS）所有波段任意日期反射率重建的生成式预训练Transformer

Junjie Li, Hankui K. Zhang, David P. Roy

发表机构 * Geospatial Sciences Center of Excellence, Department of Geography and Geospatial Sciences, South Dakota State University（地理科学卓越中心，地理与空间科学系，南达科他州立大学）； Department of Geography, Environment, and Spatial Sciences, & Center for Global Change and Earth Observations, Michigan State University（地理、环境与空间科学系，以及全球变化与地球观测中心，密歇根州立大学）

AI总结提出HLS-GPT，一种大规模生成式预训练Transformer模型，用于重建所有波段、任意日期和任意像素位置的NASA HLS 30米地表反射率，采用层次化Transformer架构处理不同光谱波段配置，在超过25万训练像素上训练，在多样地表条件下实现稳健重建。

详情

AI中文摘要

最近用于Landsat和Sentinel-2反射率时间序列重建的深度学习方法仍然受到光谱覆盖范围有限、地理可扩展性受限或基于补丁的短时间上下文设计的限制。我们提出了HLS-GPT，一种大规模生成式预训练Transformer模型，用于重建所有波段、任意日期和任意像素位置的NASA协调Landsat Sentinel-2 30米地表反射率。HLS-GPT使用层次化Transformer架构处理Landsat和Sentinel-2的不同光谱波段配置，并基于单像素12个月时间序列进行操作。为了捕捉地理和季节变化，该模型使用来自美国本土超过25万训练像素的九年HLS时间序列进行训练。一种随机裁剪和掩码策略提取不同起始日期的12个月周期，掩码50%的有效观测，并训练模型从剩余观测中重建被掩码的反射率值。使用超过62000个独立测试像素的评估显示，在多样化的地表条件下，包括复杂的作物物候和稀疏、不规则的观测，重建效果稳健。留一观测评估在所有HLS光谱波段上实现了低于0.026的重建RMSE，可见光波段相对RMSE低于35%，其他波段低于13%。尽管Landsat上缺少红边波段，红边波段误差与红波段和近红外波段误差相当。敏感性分析随机掩码10%至90%的测试观测，显示当掩码10%至50%的观测时，性能仅适度下降，所有波段RMSE低于0.028。在九个独立的109×109公里美国本土HLS图块上的图像重建进一步表明，HLS-GPT优于两种传统方法和NASA-IBM Prithvi模型。

英文摘要

Recent deep learning methods for Landsat and Sentinel-2 reflectance time series reconstruction remain limited by restricted spectral coverage, limited geographic scalability, or patch-based designs with short temporal contexts. We present HLS-GPT, a large-scale generative pretrained Transformer model for reconstructing NASA Harmonized Landsat Sentinel-2 30 m surface reflectance for all bands, any date, and any pixel location. HLS-GPT uses a hierarchical Transformer architecture to handle the different spectral band configurations of Landsat and Sentinel-2 and operates on single-pixel 12-month time series. To capture geographic and seasonal variability, the model was trained with nine years of HLS time series from more than 0.25 million training pixels across the conterminous United States. A random cropping and masking strategy extracts 12-month periods with varying start dates across epochs, masks 50% of valid observations, and trains the model to reconstruct the masked reflectance values from the remaining observations. Evaluation using more than 62,000 independent test pixels shows robust reconstruction under diverse land surface conditions, including complex crop phenology and sparse, irregular observations. Leave-one-observation-out evaluation achieved reconstruction RMSE below 0.026 for all HLS spectral bands, with relative RMSE below 35% for visible bands and below 13% for other bands. Red-edge band errors were comparable to red and near-infrared errors despite the absence of red-edge bands on Landsat. Sensitivity analyses that randomly masked 10% to 90% of test observations showed only modest degradation when 10% to 50% of observations were masked, with all-band RMSE below 0.028. Image reconstruction over nine independent 109 by 109 km CONUS HLS tiles further demonstrates that HLS-GPT outperforms two conventional methods and the NASA-IBM Prithvi model.

URL PDF HTML ☆

赞 0 踩 0

2606.18180 2026-06-17 cs.CV 新提交

Colab NAS：遵循奥卡姆剃刀原则获取轻量级任务特定卷积神经网络

Andrea Mattia Garavagno, Daniele Leonardis, Antonio Frisoli

发表机构 * Institute of Mechanical Intelligence, Scuola Superiore Sant’Anna of Pisa（机械智能研究所，比萨圣安娜高等学院）

AI总结提出ColabNAS，一种低成本的硬件感知神经架构搜索方法，通过奥卡姆剃刀启发的无导数搜索策略，在免费GPU服务上3.1小时内获得轻量级CNN，在Visual Wake Word数据集上达到最先进结果。

详情

DOI: 10.1016/j.future.2023.11.003
Journal ref: Future Generation Computer Systems, vol. 152, pp. 152-159, 2024

AI中文摘要

当前从在大数据集上训练的卷积神经网络（CNN）进行迁移学习的趋势，在目标应用是一个自定义且有限的问题，且有足够数据从头训练网络时，可能是一种过度杀伤。另一方面，从头训练自定义且更轻量的CNN需要专业知识，以及在硬件感知神经架构搜索（HW NAS）情况下需要高端资源，这限制了非习惯性神经网络开发者对该技术的访问。因此，我们提出了ColabNAS，一种用于生成轻量级任务特定CNN的经济实惠的HW NAS技术。其新颖的无导数搜索策略受奥卡姆剃刀原则启发，使得在Visual Wake Word数据集（一个标准的TinyML基准）上，仅需使用Google Colaboratory和Kaggle Kernel等免费在线GPU服务，在3.1 GPU小时内即可获得最先进的结果。

英文摘要

The current trend of applying transfer learning from convolutional neural networks (CNNs) trained on large datasets can be an overkill when the target application is a custom and delimited problem, with enough data to train a network from scratch. On the other hand, the training of custom and lighter CNNs requires expertise, in the from-scratch case, and or high-end resources, as in the case of hardware-aware neural architecture search (HW NAS), limiting access to the technology by non-habitual NN developers. For this reason, we present ColabNAS, an affordable HW NAS technique for producing lightweight task-specific CNNs. Its novel derivative-free search strategy, inspired by Occam's razor, allows to obtain state-of-the-art results on the Visual Wake Word dataset, a standard TinyML benchmark, in just 3.1 GPU hours using free online GPU services such as Google Colaboratory and Kaggle Kernel.

URL PDF HTML ☆

赞 0 踩 0

2404.09790 2026-06-17 cs.CV 版本更新

NTIRE 2024 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2024图像超分辨率挑战赛（x4）：方法与结果

Zheng Chen, Zongwei Wu, Eduard Zamfir, Kai Zhang, Yulun Zhang, Radu Timofte, Xiaokang Yang, Hongyuan Yu, Cheng Wan, Yuxin Hong, Zhijuan Huang, Yajun Zou, Yuan Huang, Jiamin Lin, Bingnan Han, Xianyu Guan, Yongsheng Yu, Daoan Zhang, Xuanwu Yin, Kunlong Zuo, Jinhua Hao, Kai Zhao, Kun Yuan, Ming Sun, Chao Zhou, Hongyu An, Xinfeng Zhang, Zhiyuan Song, Ziyue Dong, Qing Zhao, Xiaogang Xu, Pengxu Wei, Zhi-chao Dou, Gui-ling Wang, Chih-Chung Hsu, Chia-Ming Lee, Yi-Shiuan Chou, Cansu Korkmaz, A. Murat Tekalp, Yubin Wei, Xiaole Yan, Binren Li, Haonan Chen, Siqi Zhang, Sihan Chen, Amogh Joshi, Nikhil Akalwadi, Sampada Malagi, Palani Yashaswini, Chaitra Desai, Ramesh Ashok Tabib, Ujwala Patil, Uma Mudenagudi, Anjali Sarvaiya, Pooja Choksy, Jagrit Joshi, Shubh Kawa, Kishor Upla, Sushrut Patwardhan, Raghavendra Ramachandra, Sadat Hossain, Geongi Park, S. M. Nadim Uddin, Hao Xu, Yanhui Guo, Aman Urumbekov, Xingzhuo Yan, Wei Hao, Minghan Fu, Isaac Orais, Samuel Smith, Ying Liu, Wangwang Jia, Qisheng Xu, Kele Xu, Weijun Yuan, Zhan Li, Wenqin Kuang, Ruijin Guan, Ruting Deng, Zhao Zhang, Bo Wang, Suiyi Zhao, Yan Luo, Yanyan Wei, Asif Hussain Khan, Christian Micheloni, Niki Martinel

发表机构 * CVLAI

AI总结本文回顾NTIRE 2024图像超分辨率挑战赛（x4），总结参赛方案和成果，推动单图像超分辨率性能边界并概述当前趋势。

Comments NTIRE 2024 webpage: https://cvlai.net/ntire/2024. Code: https://github.com/zhengchen1999/NTIRE2024_ImageSR_x4

详情

DOI: 10.1109/CVPRW63382.2024.00617
Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024, pp. 6108-6132

AI中文摘要

本文回顾了NTIRE 2024图像超分辨率（$\ imes$4）挑战赛，重点介绍了提出的解决方案和获得的结果。该挑战涉及利用先验信息从低分辨率（LR）输入生成对应的高分辨率（HR）图像，放大倍数为四倍。LR图像来源于双三次下采样退化。挑战的目标是获得具有最先进SR性能的设计/解决方案，对计算资源（如模型大小和FLOPs）或训练数据没有限制。该赛道在DIV2K测试数据集上使用PSNR指标评估性能。比赛吸引了199名注册者，其中20支队伍提交了有效参赛作品。这一集体努力不仅推动了单图像SR的性能边界，还提供了对该领域当前趋势的全面概述。

英文摘要

This paper reviews the NTIRE 2024 challenge on image super-resolution ($\times$4), highlighting the solutions proposed and the outcomes obtained. The challenge involves generating corresponding high-resolution (HR) images, magnified by a factor of four, from low-resolution (LR) inputs using prior information. The LR images originate from bicubic downsampling degradation. The aim of the challenge is to obtain designs/solutions with the most advanced SR performance, with no constraints on computational resources (e.g., model size and FLOPs) or training data. The track of this challenge assesses performance with the PSNR metric on the DIV2K testing dataset. The competition attracted 199 registrants, with 20 teams submitting valid entries. This collective endeavour not only pushes the boundaries of performance in single-image SR but also offers a comprehensive overview of current trends in this field.

URL PDF HTML ☆

赞 0 踩 0

2504.14582 2026-06-17 cs.CV 版本更新

NTIRE 2025 Challenge on Image Super-Resolution (x4): Methods and Results

NTIRE 2025 图像超分辨率（×4）挑战赛：方法与结果

Zheng Chen, Kai Liu, Jue Gong, Jingkai Wang, Lei Sun, Zongwei Wu, Radu Timofte, Yulun Zhang, Xiangyu Kong, Xiaoxuan Yu, Hyunhee Park, Suejin Han, Hakjae Jeon, Dafeng Zhang, Hyung-Ju Chun, Donghun Ryou, Inju Ha, Bohyung Han, Lu Zhao, Yuyi Zhang, Pengyu Yan, Jiawei Hu, Pengwei Liu, Fengjun Guo, Hongyuan Yu, Pufan Xu, Zhijuan Huang, Shuyuan Cui, Peng Guo, Jiahui Liu, Dongkai Zhang, Heng Zhang, Huiyuan Fu, Huadong Ma, Yanhui Guo, Sisi Tian, Xin Liu, Jinwen Liang, Jie Liu, Jie Tang, Gangshan Wu, Zeyu Xiao, Zhuoyuan Li, Yinxiang Zhang, Wenxuan Cai, Vijayalaxmi Ashok Aralikatti, Nikhil Akalwadi, G Gyaneshwar Rao, Chaitra Desai, Ramesh Ashok Tabib, Uma Mudenagudi, Marcos V. Conde, Alejandro Merino, Bruno Longarela, Javier Abad, Weijun Yuan, Zhan Li, Zhanglu Chen, Boyang Yao, Aagam Jain, Milan Kumar Singh, Ankit Kumar, Shubh Kawa, Divyavardhan Singh, Anjali Sarvaiya, Kishor Upla, Raghavendra Ramachandra, Chia-Ming Lee, Yu-Fan Lin, Chih-Chung Hsu, Risheek V Hiremath, Yashaswini Palani, Yuxuan Jiang, Qiang Zhu, Siyue Teng, Fan Zhang, Shuyuan Zhu, Bing Zeng, David Bull, Jingwei Liao, Yuqing Yang, Wenda Shao, Junyi Zhao, Qisheng Xu, Kele Xu, Sunder Ali Khowaja, Ik Hyun Lee, Snehal Singh Tomar, Rajarshi Ray, Klaus Mueller, Sachin Chaudhary, Surya Vashisth, Akshay Dudhane, Praful Hambarde, Satya Naryan Tazi, Prashant Patil, Santosh Kumar Vipparthi, Subrahmanyam Murala, Bilel Benjdira, Anas M. Ali, Wadii Boulila, Zahra Moammeri, Ahmad Mahmoudi-Aznaveh, Ali Karbasi, Hossein Motamednia, Liangyan Li, Guanhua Zhao, Kevin Le, Yimo Ning, Haoxuan Huang, Jun Chen

发表机构 * CVPR 2025

AI总结本文介绍NTIRE 2025图像超分辨率（×4）挑战赛，包括恢复和感知两个子赛道，总结比赛设计、数据集、评估协议及25个团队的提交方法。

Comments NTIRE 2025 webpage: https://www.cvlai.net/ntire/2025. Code: https://github.com/zhengchen1999/NTIRE2025_ImageSR_x4

详情

DOI: 10.1109/CVPRW67362.2025.00141
Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025, pp. 1525-1535

AI中文摘要

本文介绍了NTIRE 2025图像超分辨率（×4）挑战赛，这是CVPR 2025第10届NTIRE Workshop的关联竞赛之一。该挑战旨在从通过双三次下采样生成的×4比例低分辨率图像中恢复高分辨率图像，目标是开发有效的网络设计或解决方案以实现最先进的超分辨率性能。为反映图像超分辨率研究的双重目标，挑战包含两个子赛道：（1）恢复赛道，强调像素级精度，根据PSNR对提交结果进行排名；（2）感知赛道，关注视觉真实感，根据感知分数对结果进行排名。共有286名参与者注册了比赛，25个团队提交了有效作品。本报告总结了挑战设计、数据集、评估协议、主要结果以及每个团队的方法。该挑战作为基准，旨在推动图像超分辨率领域的最先进技术并促进其进步。

英文摘要

This paper presents the NTIRE 2025 image super-resolution ($\times$4) challenge, one of the associated competitions of the 10th NTIRE Workshop at CVPR 2025. The challenge aims to recover high-resolution (HR) images from low-resolution (LR) counterparts generated through bicubic downsampling with a $\times$4 scaling factor. The objective is to develop effective network designs or solutions that achieve state-of-the-art SR performance. To reflect the dual objectives of image SR research, the challenge includes two sub-tracks: (1) a restoration track, emphasizes pixel-wise accuracy and ranks submissions based on PSNR; (2) a perceptual track, focuses on visual realism and ranks results by a perceptual score. A total of 286 participants registered for the competition, with 25 teams submitting valid entries. This report summarizes the challenge design, datasets, evaluation protocol, the main results, and methods of each team. The challenge serves as a benchmark to advance the state of the art and foster progress in image SR.

URL PDF HTML ☆

赞 0 踩 0

2512.16978 2026-06-17 cs.CV 版本更新

A Benchmark for Omni-Modal Reasoning in Long Videos

长视频全模态推理基准

Mohammed Irfan Kurpath, Jaseel Muhammad Kaithakkodan, Jinxing Zhou, Sahal Shaji Mullappilly, Mohammad Almansoori, Noor Ahsan, Beknur Kalmakhanbet, Sambal Shikhar, Rishabh Lalla, Jean Lahoud, Mariette Awad, Fahad Shahbaz Khan, Salman Khan, Rao Muhammad Anwer, Hisham Cholakkal

AI总结提出LongShOTBench基准，用于评估长视频中视觉、语音和环境音频的全模态推理，并引入无训练的全模态证据搜索代理LongShOTAgent，在105个模型上取得最优性能。

详情

AI中文摘要

长形式全模态视频理解需要整合视觉、语音和环境音频，并进行连贯的长上下文推理。现有的视频基准通常在时间尺度、模态覆盖、开放式交互和可解释评分之间进行权衡。为了解决这一差距，我们引入了LongShOTBench，一个围绕三个耦合目标设计的长期视频理解基准：整体全模态集成、意图驱动的开放式交互和规则级诊断。它从真实观看场景构建单轮和多轮问题，通过系统任务探究视觉、语音、环境音频、时间和跨模态推理。每个项目包括一个参考答案和一个加权标准级规则，让评估识别哪些感知事实、时间链接、模态接地要求和推理步骤得到满足或遗漏。所有样本都经过手动验证，以提高接地性、清晰度和规则可靠性。我们还引入了LongShOTAgent，一个无训练的全模态证据搜索代理，将全视频预处理与目标检索、查询自适应片段细化以及基于视觉、语音和非语音音频证据的显式声明验证相结合。其迭代搜索-细化-验证循环暴露中间证据，并让模态特定专家在回答之前重新分析相关时刻。我们评估了105个视频能力模型，涵盖开源全模态模型、视觉语言系统、音频LLM、代理管道和闭源API。当前的MLLM远未饱和LongShOTBench，而我们的LongShOTAgent是最强的无训练系统，达到66.64%的整体性能。通过发布基准、排行榜和方法，我们为推进长形式全模态视频推理提供了一个共享、可解释的测试平台。代码、数据和排行榜可在以下网址获取：此 https URL。

英文摘要

Long-form omni-modal video understanding requires integrating vision, speech, and ambient audio with coherent long-context reasoning. Existing video benchmarks often trade off temporal scale, modality coverage, open-ended interaction, and interpretable scoring. To address this gap, we introduce LongShOTBench, a long video understanding benchmark designed around three coupled goals: holistic omni-modal integration, intent-driven open-ended interaction, and rubric-level diagnosis. It builds single- and multi-turn questions from real viewing scenarios, with systematic tasks probing visual, speech, ambient-audio, temporal, and cross-modal reasoning. Each item includes a reference answer and a weighted criterion-level rubric, letting evaluation identify which perceptual facts, temporal links, modality-grounding requirements, and reasoning steps are satisfied or missed. All samples are manually verified to improve grounding, clarity, and rubric reliability. We also introduce LongShOTAgent, a training-free omni-modal evidence-seeking agent coupling full-video preprocessing with targeted retrieval, query-adaptive segment refinement, and explicit claim verification over visual, speech, and non-speech audio evidence. Its iterative search-refine-verify loop exposes intermediate evidence and lets modality-specific specialists re-analyze relevant moments before answering. We evaluate 105 video-capable models spanning open-source omni-modal models, vision-language systems, audio LLMs, agentic pipelines and closed-source APIs. Current MLLMs remain far from saturating LongShOTBench, while our LongShOTAgent is the strongest training-free system, reaching 66.64% overall. By releasing the benchmark, leaderboard, and method, we provide a shared, interpretable testbed for advancing long-form omni-modal video reasoning. Code, data, and the leaderboard are available at https://longshot.cvmbzuai.com/.

URL PDF HTML ☆

赞 0 踩 0

2601.19099 2026-06-17 cs.CV cs.AI 版本更新

m2sv: A Scalable Benchmark for Map-to-Street-View Spatial Reasoning

m2sv: 地图到街景空间推理的可扩展基准

Yosub Shin, Michael Buriek, Igor Molybog

AI总结提出m2sv基准，通过匹配朝北俯视图与街景图像推断相机方向，评估VLM空间推理能力；最佳模型准确率65.2%，低于人类72.0%，揭示几何对齐与推理一致性的差距。

详情

AI中文摘要

视觉-语言模型（VLM）在许多多模态基准上表现强劲，但在需要将抽象俯视图表示与自我中心视图对齐的空间推理任务上仍然脆弱。我们引入m2sv，一个用于地图到街景空间推理的可扩展基准，要求模型通过将朝北俯视图与在同一真实世界交叉口拍摄的街景图像对齐来推断相机视角方向。我们发布了m2sv-20k，一个具有受控歧义的地理多样化基准，以及m2sv-sft-11k，一个用于监督微调的精选结构化推理轨迹集。尽管在现有多模态基准上表现强劲，但最佳评估的VLM在m2sv上仅达到65.2%的准确率，低于人类标注者的平均72.0%（专家可达95%），且标注者间一致性高（$\kappa$高达0.76）。虽然监督微调和强化学习带来持续改进，但跨基准评估显示迁移有限。除了总体准确率，我们使用结构信号和人工努力系统分析了地图到街景推理的难度，并对适应的开放模型进行了广泛的失败分析。我们的发现凸显了几何对齐、证据聚合和推理一致性方面的持续差距，为跨视角的接地空间推理的未来工作提供了动力。

英文摘要

Vision--language models (VLMs) achieve strong performance on many multimodal benchmarks but remain brittle on spatial reasoning tasks that require aligning abstract overhead representations with egocentric views. We introduce m2sv, a scalable benchmark for map-to-street-view spatial reasoning that asks models to infer camera viewing direction by aligning a north-up overhead map with a Street View image captured at the same real-world intersection. We release m2sv-20k, a geographically diverse benchmark with controlled ambiguity, along with m2sv-sft-11k, a curated set of structured reasoning traces for supervised fine-tuning. Despite strong performance on existing multimodal benchmarks, the best evaluated VLM achieves only 65.2% accuracy on m2sv, below human annotators who reach 72.0% on average (and 95% for an expert) with strong inter-annotator agreement ($κ$ up to 0.76). While supervised fine-tuning and reinforcement learning yield consistent gains, cross-benchmark evaluations reveal limited transfer. Beyond aggregate accuracy, we systematically analyze difficulty in map-to-street-view reasoning using both structural signals and human effort, and conduct an extensive failure analysis of adapted open models. Our findings highlight persistent gaps in geometric alignment, evidence aggregation, and reasoning consistency, motivating future work on grounded spatial reasoning across viewpoints.

URL PDF HTML ☆

赞 0 踩 0

2603.26551 2026-06-17 cs.CV cs.AI 版本更新

Beyond MACs: Hardware Efficient Architecture Design for Vision Backbones

超越MACs：面向视觉骨干网络的硬件高效架构设计

Moritz Nottebaum, Matteo Dunnhofer, Christian Micheloni

发表机构 * Machine Learning and Perception Lab, University of Udine（乌迪大学机器学习与感知实验室）； Centre for Vision Research, York University（约克大学视觉研究中心）

AI总结针对MACs指标在边缘设备上的不足，提出基于硬件效率洞察的LowFormer骨干网络，通过轻量级Lowtention模块实现显著加速。

Comments Accepted at International Journal of Computer Vision (IJCV)

详情

DOI: 10.1007/s11263-026-02873-5
Journal ref: Int J Comput Vis 134, 295 (2026)

AI中文摘要

视觉骨干网络在现代计算机视觉中扮演核心角色。提升其效率直接惠及广泛下游应用。为衡量效率，许多出版物依赖MACs（乘累加操作）作为执行时间的预测指标。本文通过实验证明该指标的缺陷，尤其在边缘设备场景下。通过对比常见架构设计元素的MAC计数和执行时间，我们识别出高效执行的关键因素，并提供优化骨干设计的见解。基于这些见解，我们提出LowFormer，一种新型视觉骨干家族。LowFormer采用流线型的宏观和微观设计，包括Lowtention——多头自注意力的轻量级替代方案。Lowtention不仅更高效，还在ImageNet上取得了更优结果。此外，我们提出LowFormer的边缘GPU版本，可进一步提升其在边缘GPU和桌面GPU上的基线速度。通过在更小图像分类数据集上的评估以及将其适配到多个下游任务（如目标检测、语义分割、图像检索和视觉目标跟踪），我们展示了LowFormer的广泛适用性。与近期最先进的骨干网络相比，LowFormer模型在各种硬件平台上均实现了显著加速。代码和模型见此链接。

英文摘要

Vision backbone networks play a central role in modern computer vision. Enhancing their efficiency directly benefits a wide range of downstream applications. To measure efficiency, many publications rely on MACs (Multiply Accumulate operations) as a predictor of execution time. In this paper, we experimentally demonstrate the shortcomings of such a metric, especially in the context of edge devices. By contrasting the MAC count and execution time of common architectural design elements, we identify key factors for efficient execution and provide insights to optimize backbone design. Based on these insights, we present LowFormer, a novel vision backbone family. LowFormer features a streamlined macro and micro design that includes Lowtention, a lightweight alternative to Multi-Head Self-Attention. Lowtention not only proves more efficient, but also enables superior results on ImageNet. Additionally, we present an edge GPU version of LowFormer, that can further improve upon its baseline's speed on edge GPU and desktop GPU. We demonstrate LowFormer's wide applicability by evaluating it on smaller image classification datasets, as well as adapting it to several downstream tasks, such as object detection, semantic segmentation, image retrieval, and visual object tracking. LowFormer models consistently achieve remarkable speed-ups across various hardware platforms compared to recent state-of-the-art backbones. Code and models are available at https://github.com/altair199797/LowFormer/blob/main/Beyond_MACs.md.

URL PDF HTML ☆

赞 0 踩 0

2605.23176 2026-06-17 cs.CV 版本更新

DRIVESPATIAL: A Benchmark for Spatiotemporal Intelligence in VLMs for Autonomous Driving

DRIVESPATIAL：自动驾驶中视觉语言模型时空智能的基准

Hao Vo, Khoa Vo, Phu Loc Nguyen, Sieu Tran, Duc Minh Nguyen, Ngo Xuan Cuong, Gladys Gawugah, Sreevenkata Anjani Tishita Godavarthi, Chase Rainwater, Nghi D. Q. Bui, Anh Nguyen, Duy Minh Ho Nguyen, Ngan Le

发表机构 * University of Arkansas, USA（美国阿肯色大学）； Google Research, Google（谷歌研究院）； University of Liverpool, UK（英国利物浦大学）； Max Planck Research School for Intelligent Systems（马克斯·普朗克智能系统研究学校）

AI总结提出DriveSpatial基准，通过多视角、时空推理任务评估视觉语言模型在自动驾驶中的场景构建、关系理解、时序推理和泛化能力，发现人类与模型间存在显著差距。

详情

AI中文摘要

自动驾驶中的时空智能要求智能体将多视角观测整合为连贯的场景表示，跨视角和时间保持物体连续性，并推理空间关系、交互和未来动态。然而，现有的自动驾驶视觉语言基准主要关注单视角、静态、自我中心或单源问答，尚不清楚当前视觉语言模型（VLM）能否真正构建和推理动态驾驶场景。我们引入了DriveSpatial，一个包含来自五个大规模自动驾驶数据集的20个任务、15.6K人工验证问答对的基准。DriveSpatial评估四种能力：认知场景构建、多视角关系理解、时序推理和泛化。与之前的基准不同，DriveSpatial是从一个动态多关系场景图生成的，该图编码了物体状态、空间关系、交互、相机可见性和时间对应关系，从而产生强制进行真正的跨视角和时空推理的问答对。评估15个代表性VLM揭示了显著的人机差距：最强模型落后人类28.4分，其中认知场景构建成为关键瓶颈。进一步诊断表明，仅语言提示不足，而显式BEV基础一致地提升性能。这些结果表明，当前VLM缺乏可靠的时空驾驶智能所需的场景构建能力。DriveSpatial及其构建流程将发布以支持未来研究。

英文摘要

Spatiotemporal intelligence in autonomous driving (AD) requires an agent to integrate multi-view observations into a coherent scene representation, maintain object continuity across viewpoints and time, and reason about spatial relations, interactions, and future dynamics. However, existing AD vision-language benchmarks largely focus on single-view, static, ego-centric, or single-source question answering, leaving it unclear whether current Vision-Language Models (VLMs) can truly construct and reason over dynamic driving scenes. We introduce DriveSpatial, a benchmark of 15.6K human-verified QA pairs across 20 tasks from five large-scale AD datasets. DriveSpatial evaluates four abilities: Cognitive Scene Construction, Multi-view Relational Understanding, Temporal Reasoning, and Generalization. Unlike prior benchmarks, DriveSpatial is generated from a dynamic multi-relational scene graph that encodes object states, spatial relations, interactions, camera visibility, and temporal correspondences, enabling QA pairs that enforce genuine cross-view and spatiotemporal reasoning. Evaluating 15 representative VLMs reveals a substantial human-model gap: the strongest model trails humans by 28.4 points, with Cognitive Scene Construction emerging as the key bottleneck. Further diagnostics show that language-only prompting is insufficient, while explicit BEV grounding consistently improves performance. These results suggest that current VLMs lack the scene-construction ability needed for reliable spatiotemporal driving intelligence. DriveSpatial and its construction pipeline will be released to support future research.

URL PDF HTML ☆

赞 0 踩 0

2606.14096 2026-06-17 cs.CV 版本更新

A New Multi-Domain Benchmark for Micro-Action Recognition and Detection

微动作识别与检测的新多领域基准

Yanbin Hao, Pengyu Liu, Xing Wei, Xun Yang, Dan Guo, Meng Wang

发表机构 * School of Computer Science and Information Engineering, Hefei University of Technology（合肥工业大学计算机科学与信息工程学院）； School of Information Science and Technology, University of Science and Technology of China（中国科学技术大学信息科学技术学院）

AI总结提出MMA-82，一个大规模多领域微动作基准，扩展至82个类别、4个领域，涵盖识别与多标签检测任务，实验表明现有方法在域迁移、长尾分布等场景下仍面临挑战。

Comments 10 pages, 9 figures

详情

AI中文摘要

微动作是全身层面持续时间短、幅度低的细微身体运动，能够揭示潜在意图、非自愿反应和细粒度情感变化。我们之前的MA-52基准为微动作识别提供了重要基础，但在规模、场景多样性、任务覆盖和评估协议方面仍有限。为了将微动作分析推向更真实和全面的场景，我们引入了MMA-82，这是MA-52的大规模多领域扩展。MMA-82将标签空间从52个细粒度微动作类别扩展到82个，并涵盖四个不同领域，包括实验室访谈、街头访谈、精神病患者访谈和情感丰富的电视视频，最终从454名受试者中获得了77,856个标注实例。基于MMA-82，我们建立了两个核心任务：微动作识别和多标签微动作检测。对于识别，我们进一步定义了域内和跨域协议，包括少样本和零样本设置，以评估模型的鲁棒性、可迁移性和泛化能力。大量实验表明，当前方法在真实微动作理解中仍面临困难，尤其是在域迁移、长尾类别分布和复杂时间定位下。除了基准测试，我们还研究了微动作与情感之间的关系，表明微动作与情感状态密切相关，并为面部微表情提供补充线索，以改进情感识别。这些结果表明，MMA-82是真实微动作分析的全面且具有挑战性的基准，也是以人为中心的AI的宝贵资源。MMA-82可在以下网址获取：https://xxx。

英文摘要

Micro-actions are short-duration, low-amplitude subtle body movements at the whole-body level that can reveal latent intentions, involuntary reactions, and fine-grained affective changes. Our previous MA-52 benchmark has provided an important foundation for micro-action recognition, but it remains limited in scale, scene diversity, task coverage, and evaluation protocols. To advance micro-action analysis toward more realistic and comprehensive settings, we introduce MMA-82, a large-scale multi-domain extension of MA-52. MMA-82 expands the label space from 52 to 82 fine-grained micro-action categories and covers four distinct domains, including laboratory interviews, street interviews, psychiatric patient interviews, and emotion-rich television videos, resulting in 77,856 annotated instances from 454 subjects. Built upon MMA-82, we establish two core tasks: Micro-Action Recognition and Multi-label Micro-Action Detection. For recognition, we further define in-domain and cross-domain protocols, including few-shot and zero-shot settings, to evaluate model robustness, transferability, and generalization. Extensive experiments show that current methods still struggle with realistic micro-action understanding, especially under domain shift, long-tailed category distributions, and complex temporal localization. Beyond benchmarking, we investigate the relationship between micro-actions and emotion, showing that micro-actions are strongly associated with emotional states and provide complementary cues to facial micro-expressions for improved emotion recognition. These results demonstrate that MMA-82 serves as a comprehensive and challenging benchmark for realistic micro-action analysis and a valuable resource for human-centered AI. MMA-82 is available at https://lpynow.github.io/MMA-82-AIM/.

URL PDF HTML ☆

赞 0 踩 0

2606.14383 2026-06-17 cs.CV 版本更新

IndustryBench-MIPU: Benchmarking Multi-Image Attribute Value Extraction for Industrial Products

IndustryBench-MIPU：面向工业产品的多图像属性值提取基准

Haonan Qi, Jin Cao, Yongqi Zhang, Xintong Wang, Weidong Tang, Bin Chen, Chengfu Huo, Haojun Pan, Hengyu You, Jing Li, Yingde Wang, Liang Ding

发表机构 * Multimodal and Industrial AI Team（多模态与工业AI团队）； Taobao&Tmall, Alibaba Group（淘宝&天猫，阿里巴巴集团）

AI总结提出首个多图像工业产品理解基准IndustryBench-MIPU，通过结构化属性提取任务评估多模态大模型在规格表、铭牌、技术图纸上的文本识别、视觉推理、领域知识和跨图像证据整合能力，发现多图像完整性是核心瓶颈。

详情

AI中文摘要

工业产品（如阀门和断路器）由密集的技术规格定义，这些规格支配着供应链中的采购、兼容性和安全性。这些规格分散在多个异构的产品图像中，包括规格表、铭牌和技术图纸，然而多模态大语言模型（MLLMs）能否可靠地恢复它们仍未被充分探索。为填补这一空白，我们引入了IndustryBench-MIPU，这是首个用于多图像工业产品理解的大规模基准，围绕结构化属性提取构建——从产品图像中恢复属性-值对。该任务共同探究了规格表和铭牌上的文本识别、技术图纸上的视觉推理、解码工业术语的领域知识，以及跨图像证据整合以组装分散的规格。具体而言，该基准包含来自27,652张图像的4,559个产品，具有跨越18个工业类别的103,703个标注，通过多模型共识和三层质量保证构建。在单图像和产品级多图像设置下评估九个MLLMs，揭示了一个显著的完整性差距：模型实现了高精度（86-94%），但最佳模型仅恢复了49.9%的产品级属性；从单图像到多图像提取，召回率下降了15-34个百分点。多图像完整性，而非单图像准确性，是核心瓶颈。数据集和代码已公开。

英文摘要

Industrial products such as valves and circuit breakers are defined by dense technical specifications that govern procurement, compatibility, and safety across supply chains. These specifications are scattered across multiple heterogeneous product images, including specification tables, nameplates, and technical drawings, yet whether Multimodal Large Language Models (MLLMs) can reliably recover them remains underexplored. To fill this gap, we introduce IndustryBench-MIPU, the first large-scale benchmark for multi-image industrial product understanding, built around structured attribute extraction -- recovering property-value pairs from product images. This task jointly probes text recognition on specification tables and nameplates, visual reasoning over technical drawings, domain knowledge to decode industrial terminology, and cross-image evidence integration to assemble scattered specifications. Concretely, the benchmark comprises 4,559 products across 27,652 images with 103,703 annotations spanning 18 industrial categories, constructed through multi-model consensus and three-tier quality assurance. Evaluating nine MLLMs under both single-image and product-level multi-image settings reveals a stark completeness gap: models achieve high precision (86--94%) but the best recovers only 49.9% of product-level attributes; moving from single-image to multi-image extraction costs 15--34 percentage points of recall. Multi-image completeness, not single-image accuracy, is the core bottleneck. Dataset and code are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2411.06842 2026-06-17 eess.IV cs.CV 版本更新

Evaluating Synthetic Data Generation for Domain Generalization in Fetal Brain MRI Segmentation

评估胎儿脑MRI分割中域泛化的合成数据生成

Vladyslav Zalevskyi, Thomas Sanchez, Margaux Roulet, Busra Bulut, Hélène Lajous, Jordina Aviles Verdera, Sara Neves Silva, Georg Langs, Gregor Kasprian, Roxane Licandro, Jana Hutter, Hamza Kebiri, Meritxell Bach Cuadra

发表机构 * Department of Radiology, Lausanne University Hospital and University of Lausanne (UNIL)（拉沃斯大学医院放射科和洛桑大学（UNIL））； CIBM Center for Biomedical Imaging（生物医学成像中心）； Institute for Information Processing, Leibniz University Hannover（汉诺威莱比锡大学信息处理研究所）； Department of Biomedical Engineering, School of Biomedical Engineering & Imaging Sciences, King’s College London（伦敦国王学院生物医学工程系）； Department of Biomedical Imaging and Image-Guided Therapy, Division of Neuroradiology and Musculoskeletal Radiology, Medical University of Vienna（维也纳医学大学生物医学成像与影像引导治疗系）； Department of Biomedical Imaging and Image-guided Therapy, Computational Imaging Research Lab (CIR), Medical University of Vienna（维也纳医学大学生物医学成像与影像引导治疗系，计算成像研究实验室（CIR））； Christian Doppler Laboratory for Mathematical Modelling and Simulation of Next-Generation Medical Ultrasound Devices, Medical University of Vienna（维也纳医学大学下一代医学超声设备数学建模与仿真克里斯蒂安多普勒实验室）； Comprehensive Center for Artificial Intelligence in Medicine, Medical University of Vienna（维也纳医学大学人工智能在医学中的综合中心）； Division of Neuroradiology and Musculoskeletal Radiology, Department of Biomedical Imaging and Image–guided Therapy, Medical University of Vienna（维也纳医学大学生物医学成像与影像引导治疗系，神经放射学和骨科放射学系）

AI总结针对胎儿脑MRI分割中数据异质性和标注不足问题，研究基于域随机化的合成数据生成策略，提出FetalSynthSeg框架，通过高斯混合强度建模和强度聚类提升跨域鲁棒性，在多个数据集上达到最优性能。

详情

AI中文摘要

从磁共振成像（MRI）中进行胎儿脑组织分割对于研究神经发育至关重要，但由于数据异质性和有限标注而仍然具有挑战性。域随机化（DR）最近作为一种有前景的单源域泛化策略出现，通过合成具有随机伪影、对比度和分辨率的训练图像。在这项工作中，我们研究了如何最大化基于DR的方法的域外（OOD）泛化能力。我们评估了几种用于DR的合成数据生成策略，特别关注我们最近提出的框架FetalSynthSeg。我们表明，简单的高斯混合强度建模优于更复杂的基于物理的模拟，并且强度聚类（根据强度细分组织类别）提高了OOD鲁棒性。在来自四个站点的348个胎儿受试者（涵盖0.55-3T以及T1w和T2w对比）上评估，FetalSynthSeg在多个FeTA 2024测试数据集上达到了最先进的性能（80-85 Dice分数），并且首次在T2w以外的模态上为胎儿脑分割提供了鲁棒的分割（在dHCP-T1w数据集上达到80 Dice）。与最先进的方法（如BOUNTI、nnU-Net集成和FeTA 2024获胜者）相比，FetalSynthSeg在保持跨域偏移的强鲁棒性的同时，提供了相当或更优的准确性。我们的代码、模型权重和便于推理的Docker镜像可在以下网址获取：此 https URL。

英文摘要

Fetal brain tissue segmentation from magnetic resonance imaging (MRI) is crucial for studying neurodevelopment, but remains challenging due to data heterogeneity and limited annotations. Domain randomization (DR) has recently emerged as a promising strategy for single-source domain generalization by synthesizing training images with randomized artifacts, contrast, and resolution. In this work, we investigate how to maximize the out-of-domain (OOD) generalization of DR-based methods. We evaluate several synthetic data generation strategies for DR, with a particular focus on our recently proposed framework, FetalSynthSeg. We show that simple Gaussian mixture-based intensity modeling outperforms more complex physics-based simulations, and that intensity clustering (subdividing tissue classes based on intensity) improves OOD robustness. Evaluated on 348 fetal subjects from four sites spanning 0.55-3T and both T1w and T2w contrasts, FetalSynthSeg reaches state-of-the-art performance on several FeTA 2024 testing datasets (80-85 Dice score) and, for the first time, offers robust segmentation on modalities other than T2w for fetal brain segmentation (80 Dice on dHCP-T1w dataset). Compared with state-of-the-art methods such as BOUNTI, nnU-Net ensemble, and the FeTA 2024 winner, FetalSynthSeg delivers comparable or superior accuracy while maintaining strong robustness across domain shifts. Our code, model weights, and Docker image ready for easy inference are available at https://hub.docker.com/r/vzalevskyi/fetalsynthseg.

URL PDF HTML ☆

赞 0 踩 0

2502.00241 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

Mordal: Automated Pretrained Model Selection for Vision Language Models

Mordal: 面向视觉语言模型的自动化预训练模型选择

Shiqi He, Insu Jang, Mosharaf Chowdhury

AI总结提出Mordal框架，通过减少候选模型数量和评估时间，自动化搜索用户定义任务的最佳视觉语言模型，相比网格搜索降低GPU耗时8.9-11.6倍，加权Kendall's τ平均提升69%。

详情

AI中文摘要

将多种模态融入大型语言模型（LLMs）是增强其对非文本数据理解、使其能够执行多模态任务的有效方式。视觉语言模型（VLMs）因其在医疗、机器人和无障碍等领域的众多实际应用，成为增长最快的多模态模型类别。然而，尽管文献中不同的VLM在不同基准测试中展现出令人印象深刻的视觉能力，它们都是由人类专家手工设计的；目前尚无自动化框架来创建特定任务的多模态模型。我们引入Mordal，一种自动化多模态模型搜索框架，能够高效地为用户定义的任务找到最佳VLM，无需人工干预。Mordal通过减少搜索过程中需考虑的候选模型数量以及最小化评估每个剩余候选模型所需的时间来实现这一目标。我们的评估表明，Mordal能够找到给定问题的最佳VLM，其GPU耗时比网格搜索低8.9倍至11.6倍。我们还发现，Mordal在不同任务上平均比最先进的模型选择方法实现约69%更高的加权Kendall's τ。

英文摘要

Incorporating multiple modalities into large language models (LLMs) is a powerful way to enhance their understanding of non-textual data, enabling them to perform multimodal tasks. Vision language models (VLMs) form the fastest growing category of multimodal models because of their many practical use cases, including in healthcare, robotics, and accessibility. Unfortunately, even though different VLMs in the literature demonstrate impressive visual capabilities in different benchmarks, they are handcrafted by human experts; there is no automated framework to create task-specific multimodal models. We introduce Mordal, an automated multimodal model search framework that efficiently finds the best VLM for a user-defined task without manual intervention. Mordal achieves this both by reducing the number of candidates to consider during the search process and by minimizing the time required to evaluate each remaining candidate. Our evaluation shows that Mordal can find the best VLM for a given problem using $8.9\times$--$11.6\times$ lower GPU hours than grid search. We have also discovered that Mordal achieves about 69\% higher weighted Kendall's $τ$ on average than the state-of-the-art model selection method across diverse tasks.

URL PDF HTML ☆

赞 0 踩 0

2505.00986 2026-06-17 cs.LG cs.CV 版本更新

EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems

EmbodiTTA：面向具身视觉系统的资源高效测试时自适应

Xiao Ma, Young D. Kwon, Dong Ma

AI总结提出按需测试时自适应范式OD-TTA，通过轻量域移检测、源域选择和分离批归一化更新，在边缘设备上实现高效准确的自适应，显著降低计算和能耗开销。

详情

AI中文摘要

连续测试时自适应（CTTA）持续对每个到达的数据批次调整部署模型。虽然达到了最优精度，但现有的CTTA方法由于巨大的内存开销和能耗，在资源受限的边缘设备上实际应用性差。本文首先引入一种新范式——按需TTA，仅在检测到显著域移时触发自适应。然后，我们提出OD-TTA，一种用于边缘设备上准确高效自适应的按需TTA框架。OD-TTA包含三项创新技术：1）轻量级域移检测机制，仅在需要时激活TTA，大幅降低总体计算开销；2）源域选择模块，选择合适的源模型进行自适应，确保高且鲁棒的精度；3）解耦的批归一化（BN）更新方案，实现小批量下的内存高效自适应。大量实验表明，OD-TTA在显著降低能量和计算开销的同时，实现了可比甚至更好的性能，使TTA成为实际可行的技术。

英文摘要

Continual Test-time adaptation (CTTA) continuously adapts the deployed model on every incoming batch of data. While achieving optimal accuracy, existing CTTA approaches present poor real-world applicability on resource-constrained edge devices, due to the substantial memory overhead and energy consumption. In this work, we first introduce a novel paradigm -- on-demand TTA -- which triggers adaptation only when a significant domain shift is detected. Then, we present OD-TTA, an on-demand TTA framework for accurate and efficient adaptation on edge devices. OD-TTA comprises three innovative techniques: 1) a lightweight domain shift detection mechanism to activate TTA only when it is needed, drastically reducing the overall computation overhead, 2) a source domain selection module that chooses an appropriate source model for adaptation, ensuring high and robust accuracy, 3) a decoupled Batch Normalization (BN) update scheme to enable memory-efficient adaptation with small batch sizes. Extensive experiments show that OD-TTA achieves comparable and even better performance while reducing the energy and computation overhead remarkably, making TTA a practical reality.

URL PDF HTML ☆

赞 0 踩 0

2512.21315 2026-06-17 cs.LG cs.CV stat.ML 版本更新

Does the Data Processing Inequality Reflect Practice? On the Utility of Low-Level Tasks

数据处理不等式是否反映实践？论低级任务的有用性

Roy Turgeman, Tom Tirer

AI总结本文研究低级处理（如去噪、编码）如何提升分类性能，证明在有限样本下存在预处理可提高准确率，并通过实验验证理论趋势。

Comments ICLR 2026 (camera-ready). Code is available at: https://github.com/serveroy/process-before-you-classify

详情

Journal ref: The Fourteenth International Conference on Learning Representations (ICLR 2026)

AI中文摘要

数据处理不等式是一个信息论原理，指出信号的信息内容不能通过处理观测数据而增加。特别地，它表明在解决分类问题之前，增强信号或对其进行编码没有益处。对于最优贝叶斯分类器，这一断言可以被证明是正确的。然而，在实践中，尽管现代深度神经网络具有强大的能力，但在高级下游任务之前执行“低级”任务仍然很常见。在本文中，我们旨在理解低级处理何时以及为何对分类有益。我们提出了一个二元分类设置的综合理论研究，其中我们考虑一个与最优贝叶斯分类器紧密相连的分类器，并随着训练样本数量的增加而收敛到它。我们证明，对于任何有限数量的训练样本，存在一种预分类处理可以提高分类准确率。我们还探讨了类分离、训练集大小和类平衡对该过程相对增益的影响。我们通过理论设置的经验研究来支持我们的理论。最后，我们进行了一项实证研究，调查去噪和编码对基准数据集上实际深度分类器性能的影响。具体来说，我们改变了训练集的大小和类别分布以及噪声水平，并展示了与我们的理论结果一致的趋势。

英文摘要

The data processing inequality is an information-theoretic principle stating that the information content of a signal cannot be increased by processing the observations. In particular, it suggests that there is no benefit in enhancing the signal or encoding it before addressing a classification problem. This assertion can be proven to be true for the case of the optimal Bayes classifier. However, in practice, it is common to perform "low-level" tasks before "high-level" downstream tasks despite the overwhelming capabilities of modern deep neural networks. In this paper, we aim to understand when and why low-level processing can be beneficial for classification. We present a comprehensive theoretical study of a binary classification setup, where we consider a classifier that is tightly connected to the optimal Bayes classifier and converges to it as the number of training samples increases. We prove that for any finite number of training samples, there exists a pre-classification processing that improves the classification accuracy. We also explore the effect of class separation, training set size, and class balance on the relative gain from this procedure. We support our theory with an empirical investigation of the theoretical setup. Finally, we conduct an empirical study where we investigate the effect of denoising and encoding on the performance of practical deep classifiers on benchmark datasets. Specifically, we vary the size and class distribution of the training set, and the noise level, and demonstrate trends that are consistent with our theoretical results.

URL PDF HTML ☆

赞 0 踩 0

2602.03300 2026-06-17 cs.LG cs.AI cs.CL cs.CV 版本更新

R1-SyntheticVL: Is Synthetic Data from Generative Models Ready for Multimodal Large Language Model?

R1-SyntheticVL：生成模型的合成数据是否已为多模态大语言模型做好准备？

Jingyi Zhang, Tianyi Lin, Huanjin Yao, Xiang Lan, Shunyu Liu, Jiaxing Huang

AI总结提出集体对抗数据合成（CADS）方法，通过集体智能和对抗学习自动生成高质量、多样且具有挑战性的多模态数据，用于增强多模态大语言模型（MLLM）在复杂现实任务中的性能。

Comments ICML 2026 Camera Ready

详情

AI中文摘要

在这项工作中，我们旨在开发有效的数据合成技术，自主合成多模态训练数据，以增强MLLM解决复杂现实任务的能力。为此，我们提出了集体对抗数据合成（CADS），这是一种新颖且通用的方法，用于合成高质量、多样且具有挑战性的多模态数据。CADS的核心思想是利用集体智能确保高质量和多样化的生成，同时探索对抗学习以合成具有挑战性的样本，从而有效驱动模型改进。具体来说，CADS包含两个循环阶段：集体对抗数据生成（CAD-Generate）和集体对抗数据判断（CAD-Judge）。CAD-Generate利用集体知识共同生成新的多样化多模态数据，而CAD-Judge则协作评估合成数据的质量。此外，CADS引入了一种对抗上下文优化机制，以优化生成上下文，鼓励生成具有挑战性和高价值的数据。通过CADS，我们构建了MMSynthetic-20K并训练了我们的模型R1-SyntheticVL，该模型在多个基准测试中表现出优越的性能。

英文摘要

In this work, we aim to develop effective data synthesis techniques that autonomously synthesize multimodal training data for enhancing MLLMs in solving complex real-world tasks. To this end, we propose Collective Adversarial Data Synthesis (CADS), a novel and general approach to synthesize high-quality, diverse and challenging multimodal data for MLLMs. The core idea of CADS is to leverage collective intelligence to ensure high-quality and diverse generation, while exploring adversarial learning to synthesize challenging samples for effectively driving model improvement. Specifically, CADS operates with two cyclic phases, i.e., Collective Adversarial Data Generation (CAD-Generate) and Collective Adversarial Data Judgment (CAD-Judge). CAD-Generate leverages collective knowledge to jointly generate new and diverse multimodal data, while CAD-Judge collaboratively assesses the quality of synthesized data. In addition, CADS introduces an Adversarial Context Optimization mechanism to optimize the generation context to encourage challenging and high-value data generation. With CADS, we construct MMSynthetic-20K and train our model R1-SyntheticVL, which demonstrates superior performance on various benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.17427 2026-06-17 cs.CV cs.HC 新提交

Impact of Hand Impairment and Occlusions on Hand Pose Estimation Accuracy in Augmented Reality Applications

手部损伤和遮挡对增强现实应用中手部姿态估计精度的影响

Damian M. Manzone, Mathew Szymanowski, Olga Taran, Shuo Cai, Melissa Marquez-Chin, Tammy Zeng, Hardeep Singh, Cesar Marquez-Chin, José Zariffa

发表机构 * KITE Research Institute, Toronto Rehabilitation Institute, University Health Network（大学健康网络多伦多康复研究所KITE研究所）； Institute of Biomedical Engineering, University of Toronto（多伦多大学生物医学工程研究所）； Department of Health Sciences and Technology, ETH Zürich（苏黎世联邦理工学院健康科学与技术系）； Department of Occupational Science & Occupational Therapy and the Rehabilitation Sciences Institute, University of Toronto（多伦多大学职业科学与职业治疗系及康复科学研究所）

AI总结研究评估了HoloLens 2和多种姿态估计算法在手部损伤和物体遮挡条件下的精度，发现算法可泛化至手部损伤人群，透明物体略有优势。

详情

AI中文摘要

混合现实应用可设计用于手部康复。增强现实（AR）头戴式显示器（HMD）特别允许生态有效的任务，因为个体可以看到真实环境并与真实物体交互，同时在HMD上接收额外提示。虽然这些应用依赖于准确的手部姿态估计，但在调查手部损伤或真实物体交互遮挡对姿态估计精度的影响方面存在空白。此外，AR HMD预测与最先进姿态估计方法之间的比较尚未建立。本研究评估了HoloLens 2 HMD和最先进姿态估计算法（WiLoR、HaMeR、WildHands和MediaPipe）在颈椎损伤（cSCI；n=13，神经损伤水平：C3-C6；美国脊柱损伤协会损伤量表：A-D）和15名未受伤对照者与透明和不透明物体交互时的姿态估计精度。通过多摄像头设置三角测量生成3D关节位置的真实值。姿态估计精度在cSCI和未受伤对照组之间没有差异，表明HoloLens 2和姿态估计算法的3D关节预测可以泛化到手部损伤人群。此外，透明物体比不透明物体提供了微小的精度优势（0.1毫米），WiLoR和HaMeR的预测比HoloLens 2略精确（2毫米）。总体而言，这些结果表明HoloLens 2可能适用于手部康复应用，生成的数据集可用于改进手部损伤人群的姿态估计方法。

英文摘要

Mixed reality applications can be designed for hand rehabilitation. Augmented reality (AR) head mounted displays (HMDs) specifically allow for ecologically valid tasks because individuals can see their real environment and interact with real objects while receiving additional cues on the HMD. While these applications rely on accurate hand pose estimation, there is a gap in investigating the influence of hand impairment or occlusion from real-object interactions on pose estimation accuracy. Further, comparisons between AR HMD predictions and state-of-the-art pose estimation methods have not been established. The current study assessed pose estimation accuracy of the HoloLens 2 HMD and state-of-the-art pose estimation algorithms (WiLoR, HaMeR, WildHands, and MediaPipe) while individuals with cervical spinal cord injury (cSCI; n = 13, Neurological Level of Injury: C3-C6; American Spinal Injury Association Impairment Scale: A-D) and 15 uninjured controls interacted with clear and opaque objects. Ground truth estimates of 3D joint positions were generated via triangulation from a multi-camera setup. Pose estimation accuracy did not differ between the cSCI and uninjured control groups suggesting that 3D joint predictions from the HoloLens 2 and pose estimation algorithms can generalize to populations with hand impairment. Further, clear objects provided a small accuracy advantage over opaque objects (0.1 mm) and predictions from both WiLoR and HaMeR were slightly more accurate than the HoloLens 2 (2 mm). Overall, these results suggest that the HoloLens 2 may be viable for hand rehabilitation applications and the dataset generated can be used to refine pose estimation methods for hand-impaired populations.

URL PDF HTML ☆

赞 0 踩 0

2606.17713 2026-06-17 cs.CV 新提交

Heterogeneous SAR-optical fusion for near-real-time land use and land cover mapping under cloud contamination: A novel framework and global benchmark dataset

异质SAR-光学融合用于云污染下近实时的土地利用和土地覆盖制图：新框架与全球基准数据集

Jiangong Xu, Weibao Xue, Xiaoyu Yu, Jun Pan, Xinlian Lianga, Mi Wang

发表机构 * State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing（信息工程测绘与遥感国家重点实验室）； School of Computer Science and Information Engineering（计算机科学与信息工程学院）； Hangzhou International Innovation Institute（杭州国际创新研究院）； Oriental Space Port Research Institute（东方航天港研究院）； Hubei Luojia Laboratory（湖北珞珈实验室）

AI总结针对云污染导致光学遥感不可靠的问题，提出端到端异质SAR-光学融合框架CloudLULC-Net，通过光学可靠性调制、异质信息自适应聚合和统一语义映射变换器，实现近实时LULC制图，并构建含40,223个三元组的全球基准数据集CloudLULC-Set，在多个指标上优于现有方法。

详情

AI中文摘要

光学遥感影像经常受到云和云阴影污染的干扰，这限制了其在近实时土地利用和土地覆盖（LULC）制图中的可靠性。尽管合成孔径雷达（SAR）可以提供穿透云层的结构信息，但现有的SAR-光学融合方法通常假设光学观测可靠，未能充分解决云污染引入的语义不确定性。为了解决这个问题，我们提出了CloudLULC-Net，一个端到端的异质SAR-光学融合框架，直接从受云污染的Sentinel-2影像和时间相邻的Sentinel-1 SAR观测中预测LULC图。所提出的网络包含光学可靠性调制以抑制不可靠的光学响应、异质信息自适应聚合以建模光学和SAR表示之间的高阶空间-通道交互，以及一个统一的语义映射变换器，在面向LULC的潜在空间中组织融合特征。进一步引入语义锚点引导优化策略以提高中间语义表示的一致性。为支持该任务，我们构建了CloudLULC-Set，一个大规模基准数据集，包含40,223个精心挑选的SAR-光学-标签三元组，具有跨不同地理区域和云条件的像素级LULC标注。实验结果表明，CloudLULC-Net实现了86.60%的总体精度、83.29%的F1分数和73.51%的平均交并比，优于代表性的异质重建优先和端到端SAR-光学映射方法。与现有全球LULC产品的比较以及不同云覆盖水平下的分析进一步证明了CloudLULC-Net在易云区域目标日期LULC制图中的鲁棒性和实用价值。该项目公开于：https://github.com/your-repo（实际链接请替换）。

英文摘要

Optical remote sensing imagery is frequently degraded by cloud and cloud-shadow contamination, which limits its reliability for near-real-time land use and land cover (LULC) mapping. Although synthetic aperture radar (SAR) can provide cloud-penetrating structural information, existing SAR-optical fusion methods often assume reliable optical observations and insufficiently address the semantic uncertainty introduced by cloud contamination. To address this issue, we propose CloudLULC-Net, an end-to-end heterogeneous SAR-optical fusion framework that directly predicts LULC maps from cloud-contaminated Sentinel-2 imagery and temporally adjacent Sentinel-1 SAR observations. The proposed network incorporates optical reliability modulation to suppress unreliable optical responses, heterogeneous information adaptive aggregation to model high-order spatial-channel interactions between optical and SAR representations, and a unified semantic mapping transformer to organize fused features in a LULC-oriented latent space. A semantic anchor-guided optimization strategy is further introduced to improve the consistency of intermediate semantic representations. To support this task, we construct CloudLULC-Set, a large-scale benchmark dataset containing 40,223 curated SAR-optical-label triplets with pixel-level LULC annotations across diverse geographic regions and cloud conditions. Experimental results show that CloudLULC-Net achieves an OA of 86.60%, an F1-score of 83.29%, and an mIoU of 73.51%, outperforming representative heterogeneous reconstruction-first and end-to-end SAR-optical mapping methods. Comparisons with existing global LULC products and analyses under different cloud-cover levels further demonstrate the robustness and practical value of CloudLULC-Net for target-date LULC mapping in cloud-prone regions.The project is publicly available at: https://github.com/RSIIPAC/CloudLULC

URL PDF HTML ☆

赞 0 踩 0

2510.19255 2026-06-17 cs.CV 版本更新

Advances in 4D Representation: Geometry, Motion, and Interaction

4D表示进展：几何、运动与交互

Mingrui Zhao, Sauradip Nag, Kai Wang, Aditya Vora, Guangda Ji, Peter Chun, Ali Mahdavi-Amiri, Hao Zhang

AI总结本文综述了4D生成与重建领域，从几何、运动和交互三个核心支柱出发，分析不同4D表示方法的特性、挑战及适用场景，并探讨了大语言模型和视频基础模型在其中的作用。

Comments CGF'26,21 pages. Project Page: https://mingrui-zhao.github.io/4DRep-GMI/

详情

AI中文摘要

我们呈现了一篇关于4D生成与重建的综述，这是一个快速发展的计算机图形学子领域，其进展得益于神经场、几何与运动深度学习以及3D生成式人工智能（GenAI）的最新突破。尽管我们的综述并非首篇，但我们从独特且鲜明的4D表示视角构建领域覆盖，以建模随时间演变的3D几何，同时展现运动和交互。具体而言，我们并未穷举众多工作，而是采取更具选择性的方法，聚焦代表性工作，以突出每种表示在不同计算、应用和数据场景下的理想特性及随之而来的挑战。我们旨在向读者传达的主要信息是：如何为其任务选择并定制合适的4D表示。在组织上，我们基于三个关键支柱：几何、运动与交互，对4D表示进行划分。我们的讨论不仅涵盖当今最流行的表示，如神经辐射场（NeRFs）和3D高斯泼溅（3DGS），还关注在4D背景下相对未被充分探索的表示，如结构化模型和长程运动。在整个综述中，我们将重新审视大语言模型（LLMs）和视频基础模型（VFMs）在各种4D应用中的作用，同时引导讨论指向它们当前的局限性以及如何解决。我们还专门介绍了目前可用的4D数据集以及推动该子领域前进所缺乏的数据。项目页面：this https URL

英文摘要

We present a survey on 4D generation and reconstruction, a fast-evolving subfield of computer graphics whose developments have been propelled by recent advances in neural fields, geometric and motion deep learning, as well as 3D generative artificial intelligence (GenAI). While our survey is not the first of its kind, we build our coverage of the domain from a unique and distinctive perspective of 4D representations, to model 3D geometry evolving over time while exhibiting motion and interaction. Specifically, instead of offering an exhaustive enumeration of many works, we take a more selective approach by focusing on representative works to highlight both the desirable properties and ensuing challenges of each representation under different computation, application, and data scenarios. The main take-away message we aim to convey to the readers is on how to select and then customize the appropriate 4D representations for their tasks. Organizationally, we separate the 4D representations based on three key pillars: geometry, motion, and interaction. Our discourse will not only encompass the most popular representations of today, such as neural radiance fields (NeRFs) and 3D Gaussian Splatting (3DGS), but also bring attention to relatively under-explored representations in the 4D context, such as structured models and long-range motions. Throughout our survey, we will reprise the role of large language models (LLMs) and video foundational models (VFMs) in a variety of 4D applications, while steering our discussion towards their current limitations and how they can be addressed. We also provide a dedicated coverage on what 4D datasets are currently available, as well as what is lacking, in driving the subfield forward. Project page:https://mingrui-zhao.github.io/4DRep-GMI/

URL PDF HTML ☆

赞 0 踩 0

2605.26921 2026-06-17 cs.CV q-bio.NC 版本更新

Similarity-based representation factorization for revealing interpretable dimensions in representational data

揭示大脑、行为和AI中表征的核心维度

Florian P. Mahner, Ka Chun Lam, Francisco Pereira, Martin N. Hebart

发表机构 * Max Planck Institute for Human Cognitive and Brain Sciences（人类认知与脑科学最大平面研究所）； National Institute of Mental Health（心理健康国家研究所）； Justus Liebig University Giessen（吉森约斯特-利普大学）； Center for Mind, Brain and Behavior（心智、脑与行为中心）

AI总结提出相似性基表示因子分解（SRF）方法，从相似性矩阵中恢复低维、非负、可解释的嵌入，以揭示神经、行为和计算数据中表征的潜在维度。

详情

AI中文摘要

表征研究广泛存在于神经科学、心理学和人工智能等领域。虽然通常通过刺激之间的相似性来研究和比较表征，但现有方法仅能有限地访问塑造这些表征的维度，且可解释性有限。为克服这些挑战，本文引入相似性基表示因子分解（SRF），一种通用的计算方法，用于从测量数据导出的相似性矩阵中恢复低维、非负、可解释的嵌入。在模拟以及多种神经、行为和计算数据集中，SRF能从各种形式的表征数据中恢复可解释的维度，即使对于非常稀疏采样、不完整的数据也是如此。从这些数据集中导出的维度与任务特定模型获得的维度相匹配，预测独立的行为属性，改进探索性分析，并且与比较相似性矩阵相比，为验证性假设检验提供更高的统计功效。这些结果共同确立了SRF作为一种通用方法，在揭示、理解和利用表征背后的维度方面具有广泛的应用前景。

英文摘要

The study of representations is widespread across fields, including neuroscience, psychology, and artificial intelligence. While representations are often studied and compared through similarities between stimuli, current methods provide only limited access to the dimensions that shape these representations and are often limited in interpretability. To overcome these challenges, here we introduce Similarity-Based Representation Factorization (SRF), a general computational method for recovering low-dimensional, non-negative, interpretable embeddings from similarity matrices derived from measured data. Across simulations and many neural, behavioral, and computational datasets, SRF recovers interpretable dimensions from diverse forms of representational data, even for very sparsely sampled, incomplete data. The dimensions derived from these datasets match those obtained by task-specific models, predict independent behavioral properties, improve exploratory analysis, and offer higher power for confirmatory hypothesis testing than comparing similarity matrices. Together, these results establish SRF as a general-purpose method with broad applications for uncovering, understanding, and using the dimensions underlying representations.

URL PDF HTML ☆

赞 0 踩 0

2507.05169 2026-06-17 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model: A Generative Latent Prediction Architecture for World Modeling

世界模型批判：一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结本文从心理学“假设性思维”出发，提出世界模型的核心目标是模拟真实世界的所有可行动可能性，并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测（GLP）架构。

详情

AI中文摘要

世界模型，即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器，近年来因开发具有人工（通用）智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估，已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发，并借鉴心理学文献中“假设性思维”的概念，论证世界模型的主要目标是模拟真实世界中所有可行动的可能性，以进行有目的的推理和行动。我们审视了世界建模的关键设计维度：数据、表示、架构、学习目标和使用，调查了现有方法并分析了它们的权衡。在此基础上，我们提出了一种新的通用世界模型生成式潜在预测（GLP）架构，基于有状态的、分层的、多层次的、混合连续/离散表示，以及生成式和自监督学习框架，并展望了由这种模型支持的物理、智能体和嵌套（PAN）AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

URL PDF HTML ☆

赞 0 踩 0

2601.06862 2026-06-17 cs.CR cs.CV cs.LG cs.MM eess.IV 版本更新

Learning QoE from Packet-Level Measurements in Encrypted Video Conferencing Traffic

从加密视频会议流量的数据包级别测量中学习QoE

Michael Sidorov, Ofer Hadar

AI总结针对ISP无法访问加密内容评估QoE的挑战，提出基于CNN的框架仅利用数据包大小预测BRISQUE和MOS，在WhatsApp和Zoom数据集上优于先前模型。

详情

AI中文摘要

用户体验质量已成为当今世界最重要的方面之一，因为它直接影响个人继续使用或放弃产品或服务的意愿。在此背景下，视频会议应用（VCAs）在COVID-19大流行后得到广泛采用，必须在日益拥挤的市场中提供卓越性能以保持竞争力。尽管内容提供商（CPs）如Zoom、WhatsApp、Telegram和Google Meet可以通过比较发送和接收的数据来评估通话质量，但VCAs中广泛使用的端到端加密使得互联网服务提供商（ISPs）评估体验质量（QoE）变得更加困难。由于ISPs无法访问加密内容，他们必须依赖对数据路径上未加密流量特征的被动测量。在这项工作中，我们提出了一个简单而有效的QoE预测框架，基于几乎原生的卷积神经网络（CNN）架构，仅使用从视频会议（VC）通话中两个参与者之间的通信中提取的数据包大小来预测两个QoE指标：BRISQUE和MOS。所提出的框架简单、易于实现，且不需要高端计算资源，但提供了优越的预测性能，正如我们在从WhatsApp和Zoom收集的两个自定义数据集上的实验所示，这些实验在QoE预测任务上比先前模型取得了显著改进。

英文摘要

The quality of the user experience has become one of the most important aspects in todays world, as it directly influences individuals willingness to continue using or abandon a product or service. In this context, video conferencing applications (VCAs), which experienced widespread adoption following the COVID-19 pandemic, must deliver excellent performance to remain competitive in an increasingly crowded market. Although content providers (CPs) such as Zoom, WhatsApp, Telegram, and Google Meet can assess conversation quality by comparing transmitted and received data. The widespread use of end-to-end encryption in VCAs makes quality-of-experience (QoE) evaluation by internet service providers (ISPs) far more challenging. Since ISPs do not have access to the encrypted content, they must rely on passive measurements of unencrypted traffic characteristics on the data path. In this work, we present a simple yet effective QoE prediction framework based on an almost stock convolutional neural network (CNN) architecture that uses only the packet sizes extracted from the communication between two participants in a video conferencing (VC) call to predict two QoE metrics: BRISQUE and MOS. The proposed framework is simple, easy to implement, and does not require high-end computational resources, yet it provides superior prediction performance, as shown in our experiments on two custom datasets collected from WhatsApp and Zoom, which achieve substantial improvements over previous models for the QoE prediction task.

URL PDF HTML ☆

赞 0 踩 0

2507.15777 2026-06-17 cs.CV 版本更新

Label tree semantic losses for rich multi-class medical image segmentation

用于丰富多类医学图像分割的标签树语义损失

Junwen Wang, Oscar MacCormac, William Rochford, Aaron Kujawa, Jonathan Shapey, Tom Vercauteren

发表机构 * School of Biomedical Engineering & Imaging Sciences（生物医学工程与成像科学学院）； Department of Neurosurgery（神经外科部门）

AI总结提出两种基于标签层次结构的树状语义损失函数，在脑MRI全监督分割和神经外科高光谱成像稀疏标注场景理解中取得一致改进。

详情

DOI: 10.3389/frai.2026.1841639

AI中文摘要

丰富且准确的医学图像分割有望通过描绘术前规划的关键解剖结构、指导实时术中导航和支持精确术后评估，为下一代AI定义的临床实践奠定基础。然而，医学和外科成像分割任务中常用的学习方法对所有错误一视同仁，未能利用标签空间中的任何类间语义。随着标签基数和丰富度的增加以包含细微不同的类别，这一问题变得尤为突出。在这项工作中，我们提出了两种基于树的语义损失函数，利用标签的层次组织。我们进一步将我们的损失纳入最近提出的用于稀疏、无背景标注的训练方法中，以扩展所提出损失的适用性。在两个医学和外科成像分割任务上进行了大量实验，即全监督的头部MRI全脑分割和稀疏标注的神经外科高光谱成像场景理解。结果表明，在评估的任务特定基线上取得了一致的改进，其中基于Wasserstein的复合损失在全脑分割中支持最强，而层次加权顶层监督在稀疏HSI设置中表现最佳。

英文摘要

Rich and accurate medical image segmentation is poised to underpin the next generation of AI-defined clinical practice by delineating critical anatomy for pre-operative planning, guiding real-time intra-operative navigation, and supporting precise post-operative assessment. However, commonly used learning methods for medical and surgical imaging segmentation tasks penalise all errors equivalently and thus fail to exploit any inter-class semantics in the label space. This becomes particularly problematic as the cardinality and richness of labels increases to include subtly different classes. In this work, we propose two tree-based semantic loss functions which take advantage of a hierarchical organisation of the labels. We further incorporate our losses in a recently proposed approach for training with sparse, background-free annotations to extend the applicability of our proposed losses. Extensive experiments are reported on two medical and surgical imaging segmentation tasks, namely head MRI for whole brain parcellation with full supervision and neurosurgical hyperspectral imaging for scene understanding with sparse annotations. Results demonstrate consistent improvements over the evaluated task-specific baselines, with the strongest support for the Wasserstein-based compound loss in whole-brain parcellation and for hierarchy-weighted top-level supervision in the sparse HSI setting.

URL PDF HTML ☆

赞 0 踩 0

2605.12220 2026-06-17 cs.CV cs.AI cs.LG cs.RO 版本更新

TriBand-BEV: Real-Time LiDAR-Only 3D Pedestrian Detection via Height-Aware BEV and High-Resolution Feature Fusion

TriBand-BEV：基于高度感知的鸟瞰图与高分辨率特征融合的实时仅LiDAR三维行人检测

Mohammad Khoshkdahan, Alexey Vinel

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文提出TriBand-BEV方法，通过高度感知的鸟瞰图与高分辨率特征融合实现实时LiDAR-only三维行人检测，采用轻量级鸟瞰图张量映射，单网络一次通过检测车辆、行人和自行车，提升检测精度与速度。

Comments Accepted for publication in the Proceedings of the 2026 International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

DOI: 10.65109/INST9866
Journal ref: Proceedings of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

AI中文摘要

安全的自动驾驶代理和移动机器人需要快速的实时三维感知，尤其是对于行人等易受伤害道路使用者。我们介绍了一种新的鸟瞰图（BEV）编码方法，将完整的三维LiDAR点云映射到轻量级的二维BEV张量中，分为三个高度带。我们明确地将三维检测重新公式化为二维检测问题，然后从BEV输出中重建三维框。单个网络在一次通过中检测车辆、行人和自行车。骨干网络在深层阶段使用区域注意力，层次化的双向颈部网络在P1到P4之间融合上下文和细节，头部使用分布焦点学习预测定向框，以预测侧偏移和旋转IoU损失。训练应用小垂直重新分箱和温和的反射率抖动以防止记忆化。我们使用四分位距（IQR）过滤器在三维重建中去除噪声和离群的LiDAR点。在KITTI数据集上，TriBand-BEV在49 FPS的单个消费级GPU上实现了易、中等和困难样本的行人BEV AP分别为58.7/52.6/47.2%，优于Complex-YOLO，分别提升了+12.6%、+7.5%和+3.1%。定性场景显示在遮挡下检测稳定。该流程紧凑且适用于实时机器人部署。我们的源代码在GitHub上公开可用。

英文摘要

Safe autonomous agents and mobile robots need fast real time 3D perception, especially for vulnerable road users (VRUs) such as pedestrians. We introduce a new bird's eye view (BEV) encoding, which maps the full 3D LiDAR point cloud into a light-weight 2D BEV tensor with three height bands. We explicitly reformulate 3D detection as a 2D detection problem and then reconstruct 3D boxes from the BEV outputs. A single network detects cars, pedestrians, and cyclists in one pass. The backbone uses area attention at deep stages, a hierarchical bidirectional neck over P1 to P4 fuses context and detail, and the head predicts oriented boxes with distribution focal learning for side offsets and a rotated IoU loss. Training applies a small vertical re bin and a mild reflectance jitter in channel space to resist memorization. We use an interquartile range (IQR) filter to remove noisy and outlier LiDAR points during 3D reconstruction. On KITTI dataset, TriBand-BEV attains 58.7/52.6/47.2 pedestrian BEV AP(%) for easy, moderate, and hard at 49 FPS on a single consumer GPU, surpassing Complex-YOLO, with gains of +12.6%, +7.5%, and +3.1%. Qualitative scenes show stable detection under occlusion. The pipeline is compact and ready for real time robotic deployment. Our source code is publicly available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2604.13662 2026-06-17 cond-mat.mes-hall cs.CV cs.LG 版本更新

Automatic Charge State Tuning of 300 mm FDSOI Quantum Dots Using Neural Network Segmentation of Charge Stability Diagram

300毫米FDSOI量子点自动电荷状态调节：基于神经网络的电荷稳定性图分割

Peter Samaha, Amine Torki, Ysaline Renaud, Sam Fiette, Emmanuel Chanrion, Pierre-Andre Mortemousque, Yann Beilliard

发表机构 * CEA-Leti（法国格勒诺耶大学（Univ. Grenoble Alpes））

AI总结本文提出基于深度学习的语义分割流程，通过识别电荷稳定性图中的过渡线实现量子点自动电荷调节，提升硅量子点量子比特的高通量电荷调节效率。

Comments 10 pages, 6 figures, supplementary materials available

详情

DOI: 10.1088/2632-2153/ae7cda

AI中文摘要

调节由门定义的半导体量子点（QDs）是扩展自旋量子比特技术的主要瓶颈。我们提出了一种由深度学习（DL）驱动的语义分割流程，通过在完整的电荷稳定性图（CSDs）中定位过渡线来实现电荷自动调节，并返回单电荷 regime 的门电压目标。我们组装并手动注释了1015个实验测量的硅量子点设备的大型异构数据集，涵盖九种设计几何形状、多个晶圆和制造批次。一个具有MobileNetV2编码器的U-Net风格卷积神经网络（CNN）通过五折分组交叉验证进行训练和验证。我们的模型在定位单电荷 regime 方面实现了80.0%的离线调节成功率，某些设计的峰值性能超过88%。我们分析了主导的失败模式并提出了针对性的缓解措施。最后，宽范围图分割也自然地启用了可扩展的基于物理的特征提取，可以反馈到制造和设计流程中，并概述了在低温晶圆探针中实现实时集成的道路图。总体而言，我们的结果表明，基于神经网络（NN）的宽图分割是实现硅量子点量子比特高通量电荷调节的可行步骤。

英文摘要

Tuning of gate-defined semiconductor quantum dots (QDs) is a major bottleneck for scaling spin qubit technologies. We present a deep learning (DL) driven, semantic-segmentation pipeline that performs charge auto-tuning by locating transition lines in full charge stability diagrams (CSDs) and returns gate voltage targets for the single charge regime. We assemble and manually annotate a large, heterogeneous dataset of 1015 experimental CSDs measured from silicon QD devices, spanning nine design geometries, multiple wafers, and fabrication runs. A U-Net style convolutional neural network (CNN) with a MobileNetV2 encoder is trained and validated through five-fold group cross validation. Our model achieves an overall offline tuning success of 80.0% in locating the single-charge regime, with peak performance exceeding 88% for some designs. We analyze dominant failure modes and propose targeted mitigations. Finally, wide-range diagram segmentation also naturally enables scalable physic-based feature extraction that can feed back to fabrication and design workflows and outline a roadmap for real-time integration in a cryogenic wafer prober. Overall, our results show that neural network (NN) based wide-diagram segmentation is a practical step toward automated, high-throughput charge tuning for silicon QD qubits.

URL PDF HTML ☆

赞 0 踩 0

2506.07917 2026-06-17 cs.GR cs.CV 版本更新

SpeeDe3DGS: Speedy Deformable 3D Gaussian Splatting with Temporal Pruning and Motion Grouping

SpeeDe3DGS：通过时间修剪和运动分组实现快速变形3D高斯点拨

Allen Tu, Haiyang Ying, Alex Hanson, Yonghan Lee, Tom Goldstein, Matthias Zwicker

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）

AI总结本文提出SpeeDe3DGS，通过时间敏感性修剪、时间敏感性采样和GroupFlow模块，在保持高质量重建的同时，显著提升3DGS的渲染和训练效率。

Comments Project Page: https://speede3dgs.github.io/

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 26083-26093

AI中文摘要

动态扩展的3D高斯点拨（3DGS）通过神经运动场实现高质量重建，但每个高斯神经推理使其模型计算成本高。基于DeformableGS，我们引入了快速变形3D高斯点拨（SpeeDe3DGS），通过三个互补模块：时间敏感性修剪（TSP）通过时间聚合敏感性分析移除低影响高斯，时间敏感性采样（TSS）扰动时间戳以抑制漂浮点并提高时间一致性，以及GroupFlow将学习的变形场压缩为共享SE(3)变换以实现高效的组间运动。在50个动态场景的MonoDyGauBench上，将TSP和TSS整合到DeformableGS中，平均渲染速度提升6.78倍，同时保持神经场保真度并使用10倍更少的原始体素。添加GroupFlow后，渲染速度进一步提升13.71倍，训练时间缩短2.53倍，超越所有基线，在保持优越图像质量的同时实现了更快的速度。

英文摘要

Dynamic extensions of 3D Gaussian Splatting (3DGS) achieve high-quality reconstructions through neural motion fields, but per-Gaussian neural inference makes these models computationally expensive. Building on DeformableGS, we introduce Speedy Deformable 3D Gaussian Splatting (SpeeDe3DGS), which bridges this efficiency-fidelity gap through three complementary modules: Temporal Sensitivity Pruning (TSP) removes low-impact Gaussians via temporally aggregated sensitivity analysis, Temporal Sensitivity Sampling (TSS) perturbs timestamps to suppress floaters and improve temporal coherence, and GroupFlow distills the learned deformation field into shared SE(3) transformations for efficient groupwise motion. On the 50 dynamic scenes in MonoDyGauBench, integrating TSP and TSS into DeformableGS accelerates rendering by 6.78$\times$ on average while maintaining neural-field fidelity and using 10$\times$ fewer primitives. Adding GroupFlow culminates in 13.71$\times$ faster rendering and 2.53$\times$ shorter training, surpassing all baselines in speed while preserving superior image quality.

URL PDF HTML ☆

赞 0 踩 0

2603.19801 2026-06-17 eess.IV cs.AI cs.CV 版本更新

Offshore oil and gas platform dynamics in the North Sea, Gulf of Mexico, and Persian Gulf: Exploiting the Sentinel-1 archive

北海、墨西哥湾和波斯湾的海上石油和天然气平台动态：利用Sentinel-1档案

Robin Spanier, Thorsten Hoeser, John Truckenbrodt, Felix Bachofer, Claudia Kuenzer

发表机构 * German Remote Sensing Data Center, Earth Observation Center, EOC of the German Aerospace Center, DLR（德国遥感数据中心，地球观测中心，德国航空航天中心（DLR）地球观测中心）； Institute for Geography and Geology, Department of Remote Sensing, University of Würzburg（地理与地质研究所，遥感系，乌尔姆大学）

AI总结本文利用Sentinel-1数据和深度学习技术，研究了北海、墨西哥湾和波斯湾的海上平台动态，揭示了平台数量变化及结构转型，为海洋基础设施监测提供了数据支持。

Comments 16 pages, 10 figures, 1 table

详情

DOI: 10.1080/20964471.2026.2679328
Journal ref: Big Earth Data, 2026, 1-27

AI中文摘要

随着海上基础设施的增加，对持续、可扩展的监测需求日益增长。本文提出了一种基于免费地球观测数据的自动化方法，利用Sentinel-1档案数据和深度学习目标检测技术，构建了2017-2025年间北海、墨西哥湾和波斯湾的季度平台位置时间序列。此外，还推导了平台大小、水深、海岸距离、国家归属及安装和退役日期等信息。2025年识别出3728个海上平台，其中北海有356个，墨西哥湾有1641个，波斯湾有1731个。尽管波斯湾平台数量在2024年前持续增长，但墨西哥湾和北海的平台数量在2018-2020年间有所下降。同时，超过2700个平台被安装或迁移到新地点，同时有相当数量被退役或迁移。此外，平台寿命缩短的趋势表明，海上行业正经历结构性变化，与移动海上单位如钻探平台的重要性增长有关。研究结果展示了免费地球观测数据和深度学习在持续、长期监测海洋基础设施中的潜力。所推导的数据集是公开的，为海上监测、海洋规划及海上能源行业转型分析提供了基础。

英文摘要

The increasing use of marine spaces by offshore infrastructure, including oil and gas platforms, underscores the need for consistent, scalable monitoring. Offshore development has economic, environmental, and regulatory implications, yet maritime areas remain difficult to monitor systematically due to their inaccessibility and spatial extent. This study presents an automated approach to the spatiotemporal detection of offshore oil and gas platforms based on freely available Earth observation data. Leveraging Sentinel-1 archive data and deep learning-based object detection, a consistent quarterly time series of platform locations for three major production regions: the North Sea, the Gulf of Mexico, and the Persian Gulf, was created for the period 2017-2025. In addition, platform size, water depth, distance to the coast, national affiliation, and installation and decommissioning dates were derived. 3,728 offshore platforms were identified in 2025, 356 in the North Sea, 1,641 in the Gulf of Mexico, and 1,731 in the Persian Gulf. While expansion was observed in the Persian Gulf until 2024, the Gulf of Mexico and the North Sea saw a decline in platform numbers from 2018-2020. At the same time, a pronounced dynamic was apparent. More than 2,700 platforms were installed or relocated to new sites, while a comparable number were decommissioned or relocated. Furthermore, the increasing number of platforms with short lifespans points to a structural change in the offshore sector associated with the growing importance of mobile offshore units such as jack-ups or drillships. The results highlighted the potential of freely available Earth observation data and deep learning for consistent, long-term monitoring of marine infrastructure. The derived dataset is public and provides a basis for offshore monitoring, maritime planning, and analyses of the transformation of the offshore energy sector.

URL PDF HTML ☆

赞 0 踩 0

2505.12620 2026-06-17 cs.CV 版本更新

BusterX: MLLM-Powered AI-Generated Video Forgery Detection and Explanation

BusterX：基于MLLM的AI生成视频伪造检测与解释

Haiquan Wen, Yiwei He, Zhenglin Huang, Tianxiao Li, Zihan Yu, Xingru Huang, Lu Qi, Baoyuan Wu, Xiangtai Li, Guangliang Cheng

发表机构 * University of Liverpool, UK（利物浦大学）； Nanyang Technological University, SG（南洋理工大学）； The Chinese University of Hong Kong, Shenzhen, Guangdong, China（香港中文大学（深圳））； Wuhan University（武汉大学）； Hangzhou Dianzi University（杭州电子科技大学）

AI总结本文提出BusterX，一种基于多模态大语言模型的视频伪造检测系统，通过改进数据集和评估基准，提升检测准确性和解释质量。

详情

AI中文摘要

随着生成视频模型日益逼真，检测AI生成视频需要兼具准确性和可解释性的系统。然而，将多模态大语言模型（MLLMs）应用于视频取证目前受限于过时的数据集、简化的评估协议和对黑盒分类的依赖。为解决这些问题，我们引入了一个全面的数据集、基准和基线模型用于视频伪造检测。首先，我们提出了GenBuster-200K，一个包含超过200,000个高质量视频的公平数据集，这些视频来自最先进的生成器，涵盖多样化的现实场景。其次，我们提出了GenBuster-Bench，一个覆盖三个渐进赛道（领域内、领域外和野外）的诊断基准，用于评估模型在领域转移和代际转移中的表现。它还引入了MLLM-as-a-Judge协议来评估生成的取证解释质量。最后，我们开发了BusterX，一种具有RL训练的MLLM基线模型。不同于直接二元分类，BusterX将检测视为视觉推理任务，其中生成的推理链本身作为检测器。实验结果表明，BusterX在检测准确性和推理质量上均优于几种领先的MLLMs（例如Qwen3.5、Claude-Sonnet-4.6）

英文摘要

As generative video models become increasingly realistic, detecting AI-generated videos requires systems that offer both accuracy and interpretability. However, applying Multimodal Large Language Models (MLLMs) to video forensics is currently limited by outdated datasets, simplistic evaluation protocols, and a reliance on black-box classification. To address these issues, we introduce a comprehensive dataset, benchmark, and baseline model for video forgery detection. First, we present \textbf{GenBuster-200K}, a fair dataset of over 200,000 high-quality videos sourced from state-of-the-art generators, featuring diverse real-world scenarios. Second, we propose \textbf{GenBuster-Bench}, a diagnostic benchmark spanning three progressive tracks (In-Domain, Out-of-Domain, and In-the-Wild) to evaluate models across \textit{domain shifts} and \textit{generational shifts}. It also introduces an MLLM-as-a-Judge protocol to assess the quality of the generated forensic explanations. Finally, we develop \textbf{BusterX}, an MLLM baseline with RL training. Instead of direct binary classification, BusterX formulates detection as a visual reasoning task, where the generated reasoning chain serves as detector itself. Experimental results demonstrate that BusterX outperforms several leading MLLMs (e.g., Qwen3.5, Claude-Sonnet-4.6) in both detection accuracy and rationale quality.

URL PDF HTML ☆

赞 0 踩 0

2508.04492 2026-06-17 cs.CV cs.AI 版本更新

Learning Robust Intervention Representations with Delta Embeddings

通过delta嵌入学习鲁棒的干预表示

Panagiotis Alimisis, Christos Diou

发表机构 * Department of Informatics and Telematics（信息与电信学系）

AI总结本文提出通过潜在空间中的可操作反事实表示提升模型鲁棒性，提出因果delta嵌入方法，在无需额外监督的情况下学习因果表示，实验显示其在合成和现实基准中表现优异。

Comments ICLR 2026, Poster

详情

Journal ref: International Conference on Learning Representations (ICLR), 2026

AI中文摘要

因果表示学习近年来引起了广泛关注，作为提高模型泛化性和鲁棒性的手段。因果干预图像对（也称为“可操作反事实”）的表示具有特性：在起始状态和结束状态之间，只有受干预/动作影响的场景变量发生变化。尽管大多数工作集中在识别和表示因果模型下的场景变量，但较少关注干预本身的表示。本文表明，通过关注潜在空间中的可操作反事实表示，可以有效提升离分布鲁棒性。具体而言，我们提出干预可通过因果delta嵌入表示，该嵌入对视觉场景不变且在影响的因果变量上稀疏。基于此见解，我们提出一种无需额外监督的学习因果表示的方法。在因果三元组挑战中的实验表明，因果delta嵌入在离分布设置中表现突出，显著超越基线性能，在合成和现实基准中均取得优异结果。

英文摘要

Causal representation learning has attracted significant research interest during the past few years, as a means for improving model generalization and robustness. Causal representations of interventional image pairs (also called ``actionable counterfactuals'' in the literature), have the property that only variables corresponding to scene elements affected by the intervention / action are changed between the start state and the end state. While most work in this area has focused on identifying and representing the variables of the scene under a causal model, fewer efforts have focused on representations of the interventions themselves. In this work, we show that an effective strategy for improving out of distribution (OOD) robustness is to focus on the representation of actionable counterfactuals in the latent space. Specifically, we propose that an intervention can be represented by a Causal Delta Embedding that is invariant to the visual scene and sparse in terms of the causal variables it affects. Leveraging this insight, we propose a method for learning causal representations from image pairs, without any additional supervision. Experiments in the Causal Triplet challenge demonstrate that Causal Delta Embeddings are highly effective in OOD settings, significantly exceeding baseline performance in both synthetic and real-world benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2602.13318 2026-06-17 cs.AI cs.CV cs.LG 版本更新

DECKBench: Benchmarking Multi-Agent Frameworks for Academic Slide Generation and Editing

DECKBench：用于学术幻灯片生成和编辑的多智能体框架基准测试

Daesik Jang, Morgan Lindsay Heisler, Linzi Xing, Yifei Li, Edward Wang, Ying Xiong, Yong Zhang, Zhenan Fan

发表机构 * Huawei Technologies Canada（华为加拿大技术有限公司）； University of British Columbia（不列颠哥伦比亚大学）

AI总结本文提出DECKBench，一个用于评估多智能体生成和编辑学术幻灯片的框架，通过定制数据集和模拟编辑指令，系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

详情

DOI: 10.1145/3770855.3817525

AI中文摘要

本文提出DECKBench，一个用于评估多智能体生成和编辑学术幻灯片的框架，通过定制数据集和模拟编辑指令，系统评估幻灯片和整个演示文稿的忠实度、连贯性、布局质量和多轮指令遵循能力。

英文摘要

Automatically generating and iteratively editing academic slide decks requires more than document summarization. It demands faithful content selection, coherent slide organization, layout-aware rendering, and robust multi-turn instruction following. However, existing benchmarks and evaluation protocols do not adequately measure these challenges. To address this gap, we introduce the Deck Edits and Compliance Kit Benchmark (DECKBench), an evaluation framework for multi-agent slide generation and editing. DECKBench is built on a curated dataset of paper to slide pairs augmented with realistic, simulated editing instructions. Our evaluation protocol systematically assesses slide-level and deck-level fidelity, coherence, layout quality, and multi-turn instruction following. We further implement a modular multi-agent baseline system that decomposes the slide generation and editing task into paper parsing and summarization, slide planning, HTML creation, and iterative editing. Experimental results demonstrate that the proposed benchmark highlights strengths, exposes failure modes, and provides actionable insights for improving multi-agent slide generation and editing systems. Overall, this work establishes a standardized foundation for reproducible and comparable evaluation of academic presentation generation and editing. Code and data are publicly available at https://github.com/morgan-heisler/DeckBench .

URL PDF HTML ☆

赞 0 踩 0

2601.17053 2026-06-17 cs.CV 版本更新

Synthetic Data Guided Feature Selection for Robust Activity Recognition in Older Adults

合成数据引导的特征选择用于老年人稳健活动识别

Shuhao Que, Dieuwke van Dartel, Ilse Heeringa, Han Hegeman, Miriam Vollenbroek-Hutten, Ying Wang

发表机构 * University of Twente（特文特大学）； Ziekenhuis Groep Twente（Twente医疗集团）； Medisch Spectrum Twente（Twente医疗光谱）

AI总结本研究开发了稳健的人体活动识别系统，利用合成数据提高老年人髋部骨折康复期间持续活动识别的可靠性，尤其在识别高临床相关性的体位转移任务上表现突出。

Comments This paper has been submitted to Nordic Conference on Digital Health and Wireless Solutions 2026, currently under review

详情

DOI: 10.1007/978-3-032-28819-6_9

AI中文摘要

髋部骨折康复期间的体力活动对于减轻老年人群长期功能下降至关重要，但在临床实践中很少被量化。现有连续监测系统通常针对中年人开发，因此在老年人步态缓慢且变化大的情况下表现不可靠。本研究旨在开发一个稳健的人体活动识别（HAR）系统，以提高髋部骨折康复期间的持续体力活动识别。24名超过80岁的健康老年人在模拟自由生活条件下，佩戴两个加速度计（分别置于下背部和前上大腿）进行了75分钟的日常活动（行走、站立、坐、躺和体位转换）。通过留一被试法交叉验证评估模型的鲁棒性。合成数据展示了在不同参与者间泛化的能力。所得到的特征干预模型（FIM）通过合成数据指导实现了可靠的活动识别，其平均F1分数分别为行走0.896、站立0.927、坐0.997、躺0.937、体位转换0.816。与无合成数据的对照模型相比，FIM显著提高了体位转换检测，即在现有HAR文献中常被忽视的高临床相关性活动类别。结论：这些初步结果展示了在老年人群中稳健活动识别的可行性。需要进一步在髋部骨折患者群体中验证以评估所提出监测系统的临床实用性。

英文摘要

Physical activity during hip fracture rehabilitation is essential for mitigating long-term functional decline in geriatric patients. However, it is rarely quantified in clinical practice. Existing continuous monitoring systems with commercially available wearable activity trackers are typically developed in middle-aged adults and therefore perform unreliably in older adults with slower and more variable gait patterns. This study aimed to develop a robust human activity recognition (HAR) system to improve continuous physical activity recognition in the context of hip fracture rehabilitation. 24 healthy older adults aged over 80 years were included to perform activities of daily living (walking, standing, sitting, lying down, and postural transfers) under simulated free-living conditions for 75 minutes while wearing two accelerometers positioned on the lower back and anterior upper thigh. Model robustness was evaluated using leave-one-subject-out cross-validation. The synthetic data demonstrated potential to improve generalization across participants. The resulting feature intervention model (FIM), aided by synthetic data guidance, achieved reliable activity recognition with mean F1-scores of 0.896 for walking, 0.927 for standing, 0.997 for sitting, 0.937 for lying down, and 0.816 for postural transfers. Compared with a control condition model without synthetic data, the FIM significantly improved the postural transfer detection, i.e., an activity class of high clinical relevance that is often overlooked in existing HAR literature. In conclusion, these preliminary results demonstrate the feasibility of robust activity recognition in older adults. Further validation in hip fracture patient populations is required to assess the clinical utility of the proposed monitoring system.

URL PDF HTML ☆

赞 0 踩 0

2511.03876 2026-06-17 eess.IV cs.CV cs.LG physics.med-ph 版本更新

Computed Tomography (CT)-derived Cardiovascular Flow Estimation Using Physics-Informed Neural Networks Improves with Sinogram-based Training: A Simulation Study

基于CT的心血管血流估计利用物理信息神经网络，通过sinogram训练提升：一项模拟研究

Jinyuxuan Guo, Gurnoor Singh Khurana, Alejandro Gonzalo Grande, Juan C. del Alamo, Francisco Contijoch

发表机构 * Dept. of Bioengineering, University of California San Diego（加州大学圣地亚哥分校生物工程系）； Dept. of Computer Science Engineering, University of California San Diego（加州大学圣地亚哥分校计算机科学与工程系）； Dept. of Mechanical Engineering, Univ of Washington（华盛顿大学机械工程系）； Depts of Mechanical Engineering and Cardiology, Univ. of Washington（华盛顿大学机械工程与心内科系）； Depts. of Bioengineering, Radiology, University of California San Diego（加州大学圣地亚哥分校生物工程与放射学系）

AI总结本研究评估了CT影像对基于物理信息神经网络（PINN）的血流估计的影响，提出了一种改进框架SinoFlow，直接利用sinogram数据估计血流，结果显示SinoFlow在避免滤波反投影引入的误差方面表现更优。

详情

DOI: 10.1002/mp.70519

AI中文摘要

背景：非侵入性成像基于血流评估在评估心脏功能和结构中起关键作用。CT是一种广泛使用的成像模态，能够稳健地评估心血管解剖和功能，但直接从对比剂演变的电影中估计血流速度的方法尚未开发。目的：本研究评估CT影像对基于物理信息神经网络（PINN）的血流估计的影响，并提出一种改进框架SinoFlow，直接利用sinogram数据估计血流。方法：我们利用计算流体力学生成理想化的2D血管分叉中的脉动流场，并模拟了不同 gantry 旋转速度、管电流和脉冲模式成像设置的CT扫描。我们比较了基于重建图像的PINN血流估计（ImageFlow）与SinoFlow的性能。结果：SinoFlow通过避免滤波反投影引入的误差显著提高了血流估计性能。SinoFlow在所有测试的gantry旋转速度下都表现出鲁棒性，并且始终产生比ImageFlow更低的均方误差和速度误差。此外，SinoFlow与脉冲模式成像兼容，并且在较短的脉冲宽度下保持更高的准确性。结论：本研究展示了SinoFlow在CT基血流估计中的潜力，为非侵入性血流评估提供了一种更有前景的方法。研究结果旨在为PINNs在CT图像中的未来应用提供信息，并提供了一种基于图像的估计解决方案，合理采集参数可产生准确的血流估计。

英文摘要

Background: Non-invasive imaging-based assessment of blood flow plays a critical role in evaluating heart function and structure. Computed Tomography (CT) is a widely-used imaging modality that can robustly evaluate cardiovascular anatomy and function, but direct methods to estimate blood flow velocity from movies of contrast evolution have not been developed. Purpose: This study evaluates the impact of CT imaging on Physics-Informed Neural Networks (PINN)-based flow estimation and proposes an improved framework, SinoFlow, which uses sinogram data directly to estimate blood flow. Methods: We generated pulsatile flow fields in an idealized 2D vessel bifurcation using computational fluid dynamics and simulated CT scans with varying gantry rotation speeds, tube currents, and pulse mode imaging settings. We compared the performance of PINN-based flow estimation using reconstructed images (ImageFlow) to SinoFlow. Results: SinoFlow significantly improved flow estimation performance by avoiding propagating errors introduced by filtered backprojection. SinoFlow was robust across all tested gantry rotation speeds and consistently produced lower mean squared error and velocity errors than ImageFlow. Additionally, SinoFlow was compatible with pulsed-mode imaging and maintained higher accuracy with shorter pulse widths. Conclusions: This study demonstrates the potential of SinoFlow for CT-based flow estimation, providing a more promising approach for non-invasive blood flow assessment. The findings aim to inform future applications of PINNs to CT images and provide a solution for image-based estimation, with reasonable acquisition parameters yielding accurate flow estimates.

URL PDF HTML ☆

赞 0 踩 0

2201.06574 2026-06-17 eess.IV cs.CV 版本更新

Neural Computed Tomography

神经计算断层扫描

Kunal Gupta, Brendan Colvert, Francisco Contijoch

发表机构 * University of California San Diego（加州大学圣地亚哥分校）

AI总结本文提出NeuralCT框架，通过神经隐式方法生成无运动伪影的时间分辨图像，适用于心脏等复杂运动场景。

Comments https://kunalmgupta.github.io/projects/NeuralCT.html

详情

DOI: 10.1002/mp.16157

AI中文摘要

在获取投影集过程中发生的运动可能导致计算断层扫描重建中出现显著的运动伪影，尽管单个视图的获取速度较快。在如心脏成像等情况下，运动可能是不可避免的，评估运动具有临床意义。通过开发具有更快门架旋转速度的系统或使用测量和/或估计位移的算法，通常可以减少运动伪影。然而，这些方法由于物理限制以及估计/测量非刚性、时间变化和患者特异性运动的挑战而效果有限。我们提出了一种新的重建框架NeuralCT，以生成无运动伪影的时间分辨图像。我们的方法利用神经隐式方法，不需要对底层运动进行估计或建模。相反，通过使用符号距离度量和神经隐式框架来表示边界。我们利用“分析-合成”方法来确定与所获取的sinogram一致且符合空间和时间一致性约束的解决方案。我们通过三个渐进复杂的场景展示了NeuralCT的实用性：小圆的平移、椭圆直径的心跳样变化以及复杂的拓扑变形。在不进行超参数调优或改变架构的情况下，NeuralCT在使用均方误差和Dice度量时，为所有三种运动提供了高质量的图像重建，相比滤波反投影。

英文摘要

Motion during acquisition of a set of projections can lead to significant motion artifacts in computed tomography reconstructions despite fast acquisition of individual views. In cases such as cardiac imaging, motion may be unavoidable and evaluating motion may be of clinical interest. Reconstructing images with reduced motion artifacts has typically been achieved by developing systems with faster gantry rotation or using algorithms which measure and/or estimate the displacements. However, these approaches have had limited success due to both physical constraints as well as the challenge of estimating/measuring non-rigid, temporally varying, and patient-specific motions. We propose a novel reconstruction framework, NeuralCT, to generate time-resolved images free from motion artifacts. Our approaches utilizes a neural implicit approach and does not require estimation or modeling of the underlying motion. Instead, boundaries are represented using a signed distance metric and neural implicit framework. We utilize `analysis-by-synthesis' to identify a solution consistent with the acquired sinogram as well as spatial and temporal consistency constraints. We illustrate the utility of NeuralCT in three progressively more complex scenarios: translation of a small circle, heartbeat-like change in an ellipse's diameter, and complex topological deformation. Without hyperparameter tuning or change to the architecture, NeuralCT provides high quality image reconstruction for all three motions, as compared to filtered backprojection, using mean-square-error and Dice metrics.

URL PDF HTML ☆

赞 0 踩 0

1. 多模态与视觉语言模型 12 篇

Pareto LoRA: Mitigating Modality Imbalance in Unified Multimodal Models via Pareto-Optimal Gradient Integration

Visuals Lie, Consistency Speaks: Disentangling Spatial Attention from Reliability in Vision-Language Models

Attention Alignment Between Humans and Vision-Language Models

CIAN: Multi-Stage Framework for Event-Enriched Image Captioning via Retrieval-Augmented Generation

Reinforcing Dual-Path Reasoning in Spatial Vision Language Models

See First, Answer Later: Visual Evidence Pre-Alignment via Sufficiency-Driven RL

Plug-and-Adapt: Multimodal Coreference Resolution at First Sight with a Pretrained Alignment Model

Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification

MODE-RAG: Manifold Outlier Diagnosis and Energy-based Retrieval-Augmented Generation Evaluation

Disentangling Perception and Reasoning in Multimodal LLMs via Reward Design

Bridging Modality Disconnect in Self-Reflection via Closed-Loop Visually Grounded Verification

Last But Not Least: Boundary Attention CalibratiON for Multimodal KV Cache Compression

2. 具身智能、机器人与自动驾驶 26 篇

Beyond Benchmarks: Continuous Edge Inference for Fine-Grained Roadside Perception

DriveJudge: Rethinking Autonomous Driving Evaluation with Vision-Language Models

TerraTransfer: Learning End-to-End Driving Policies Without Expert Demonstrations

WeaveLA: Event Driven Cross-Subtask Latent Memory Weaving for Repetitive Robot Manipulation

GeneralVLA-2: Geometry-Aware Reconstruction and Governed Memory for Robot Planning

SPHINX: First Explain, Then Explore

OmniDrive: An LLM-Choreographed Multi-Agent World Model with Unified Latent Co-Compression for Multi-View Driving Video Generation

ActWorld: From Explorable to Interactive World Model via Action-Aware Memory

EventDrive: Event Cameras for Vision-Language Driving Intelligence

MOCHI: Motion Enhancement of Collaborative Human-object Interactions

Future Dynamic 3D Reconstruction: A 3D World Model with Disentangled Ego-Motion

Contrastive Action-Image Pre-training for Visuomotor Control

Contactless Respiratory Monitoring on Heterogeneous Mobile Robots: A Multimodal Edge-Computing Framework

Where Should Action Generation Begin? A Learnable Source Prior for Generative Robot Policies

AnnotateAnything: Automatic Annotation of 3D Assets for Robot Manipulation

MagicSim: A Unified Infrastructure for Executable Embodied Interaction

GASE: Gaussian Splatting-Based Automated System for Reconstructing Embodied-Simulation Environments

MuseVLA: An Adaptive Multimodal Sensing Vision-Language-Action Model for Robotic Manipulation

ED3R: Energy-Aware Distributed Disaster Detection Enabled by Cooperative Robotic Agents

Qwen-RobotNav Technical Report: A Scalable Navigation Model Designed for an Agentic Navigation System

DiffAttn: Diffusion-Based Drivers' Visual Attention Prediction with LLM-Enhanced Semantic Reasoning

SCC-Loc: A Unified Semantic Cascade Consensus Framework for UAV Thermal Geo-Localization

OpenTie: Open-vocabulary Sequential Rebar Tying System

AlignDrive: Aligned Lateral-Longitudinal Planning for End-to-End Autonomous Driving

Planning with the Views

Kairos: A Native World Model Stack for Physical AI

3. 图像识别、检索与分类 5 篇

Quantum Enchanced Multi-Scale CNN with Bi-directional Mamba for Crop Field Analysis

Landsat-Sentinel-2 Algal Bloom Mapping Using Vision Transformers: Model Description, Implementation, and Examples

Graph Neural Networks for Semi-Supervised Image Classification with Multi-Feature Aggregation

UoU: A Universal Fingerprint Foundation Model Based on Large-Scale Unsupervised Learning

Partial Ring Scan: Revisiting Scan Order in Vision State Space Models

4. 目标检测、分割与定位 11 篇

FATE: Pillar Encoding and Frequency-Aware Training for Event-Based Object Detection

Improving and Evaluating Hand-Object Interaction Detection

StereoFactory: A Unified Merging Framework for Robust Stereo Matching

RT-Counter: Real-Time Text-Guided Open-Vocabulary Object Counting

MambaCount: Efficient Text-guided Open-vocabulary Object Counting with Spatial Sparse State Space Duality Block

Reload-Mamba: Hierarchical Anti-Dilution State-Space Modeling for Multi-Class Semantic Segmentation

A geometric and deep learning reproducible pipeline for monitoring floating anthropogenic debris in urban rivers using in situ cameras

Co-PLNet: A Collaborative Point-Line Network for Prompt-Guided Wireframe Parsing

GOT-JEPA: Generic Object Tracking with Model Adaptation and Occlusion Handling using Joint-Embedding Predictive Architecture

Clay-CNN Hybrids: Leveraging Geospatial Foundation Models as Auxiliary Context for Landslide Detection

GOOSE-M2F: Adapting Mask2Former for High-Fidelity, Long-Tailed Fine-Grained Semantic Segmentation in Unstructured Outdoor Terrain

5. 视频理解与时序视觉 6 篇

Training LLMs with Reinforcement Learning over Digital Twin Representations for Reasoning-Intensive Surgical VideoQA

Reasoning Text-to-Video Retrieval for Operating Room Clips via Action-Driven Digital Twins

SkillMoV: Mixture-of-View Routing with Prototype-Conditioned Gating for Unified Multi-View Proficiency Estimation

Divide, Deliberate, Decide: A Multi-Agent Framework for Fine-Grained Egocentric Action Recognition

LiveStarPro: Proactive Streaming Video Understanding with Hierarchical Memory for Long-Horizon Streams

ThinkJEPA: Empowering Latent World Models with Large Vision-Language Reasoning Model

6. 生成式视觉与世界模型 23 篇

Pulling The REINS: Training-Free Safety Alignment of Video Diffusion Models via Representation Steering

SierpinskiCam: Camera-Controlled Video Retaking with Sierpinski Triangle Pattern Cues

Learning a Maximum Entropy Model for Visual Textures using Diffusion

Visual Retrieval-Augmented Generation for Silhouette-Guided Animal Art

Root-Selecting Fixed-Point Inversion for Rectified Flows via Trajectory Straightness

TivTok: Broadcasting Time-Invariant Tokens for Scalable Video Tokenization

RAVA: Retrieval-Augmented Viewpoint Alignment for Subject-Driven Image Generation

MaineCoon: Pursuing A Real-Time Audio-Visual Social World Model

ReAge3D: Re-Aging 3D Faces with View Consistency

Edit3DGS: Unified Framework for Dynamic Head Editing via 2D Instruction-Guided Diffusion and 3D Gaussian Splatting

Looped World Models

SceneCompleter: Dense 3D Scene Completion for Generative Novel View Synthesis

TextMesh4D: Zero-shot Text-to-4D Mesh Generation

Detail++: Training-Free Detail Enhancer for Text-to-Image Diffusion Models