2606.05115 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Continual Visual and Verbal Learning Through a Child's Egocentric Input

通过儿童自我中心输入进行持续的视觉与语言学习

Xiaoyang Jiang, Yanlai Yang, Kenneth A. Norman, Brenden Lake, Mengye Ren

发表机构 * Agentic Learning AI Lab, New York University（代理学习人工智能实验室，纽约大学）； Department of Psychology, Princeton University（心理学系，普林斯顿大学）

AI总结提出BabyCL持续多模态学习框架，在单一时间顺序处理SAYCam数据集，通过流式视觉表示学习和图像-文本对比目标，在SAYCam Labeled-S 4AFC基准上优于流式学习基线，缩小了与离线训练上限的差距。

Comments 15 pages, 4 figures

详情

AI中文摘要

儿童从连续的、时间结构化的自我中心经验流中学习单词的含义。最近的研究表明，神经网络也可以从儿童的自我中心视频记录中学习单词-指代物映射，但它们会循环处理打乱的数据数百个周期，这与儿童实际接触环境的方式形成对比。我们引入了BabyCL，一个持续多模态学习框架，它以单一时间顺序处理SAYCam数据集，结合了流式视觉表示学习和图像-文本对比目标。BabyCL将流的多阶段时间分割与双回放缓冲区相结合，该缓冲区独立管理视觉和多模态历史，并在共享骨干网络上联合训练三个对比损失。在匹配的优化预算下，BabyCL在SAYCam Labeled-S 4AFC基准上优于流式学习基线，显著缩小了与离线训练上限的差距。消融实验表明，这些增益对在线时间分割窗口的长度和回放缓冲区的驱逐规则具有鲁棒性。总之，这些结果表明，在更接近儿童实际体验的训练条件下，有意义的单词-指代物映射可以出现。

英文摘要

Children learn the meanings of words from a continuous, temporally structured stream of egocentric experience. Recent work shows that neural networks can also learn word-referent mappings from a child's egocentric video recordings, but they cycle through the shuffled data for hundreds of epochs, contrasting with how children actually encounter their environment. We introduce BabyCL, a continual multimodal learning framework that processes the SAYCam dataset in a single chronological pass, combining streaming visual representation learning with an image-text contrastive objective. BabyCL combines a multi-stage temporal segmentation of the stream with a dual replay buffer that independently manages visual and multimodal histories, and it is jointly trained with three contrastive losses on a shared backbone. Under a matched optimization budget, BabyCL outperforms streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, substantially narrowing the gap to an upper bound of offline training. Ablations show that the gains are robust to the length of the online temporal segmentation window and the eviction rule of the replay buffer. Together, these results show that meaningful word-referent mappings can emerge under training conditions much closer to a child's actual experience.

URL PDF HTML ☆

赞 0 踩 0

2606.05107 2026-06-04 cs.CV cs.AI 版本更新

MaCo-GAN: 用于单图像超分辨率的流形对比对抗学习

Daeyoung Han, Seongmin Hwang, Moongu Jeon

发表机构 * Department of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea（电气工程与计算机科学系，全州科学技术院，全州，韩国）； Department of AI Convergence, Gwangju Institute of Science and Technology, Gwangju, Republic of Korea（人工智能融合系，全州科学技术院，全州，韩国）

AI总结提出MaCo-GAN，通过流形对比对抗学习替代传统对抗损失，利用动态假样本合成器生成保持低分辨率对应的假图像，实现感知-失真权衡的持续改进。

详情

AI中文摘要

传统的用于单图像超分辨率（SISR）的生成对抗网络（GAN）常常出现幻觉伪影，这主要是因为标准判别器评估整体图像自然度而非严格的条件真实性。为了解决这个问题，我们提出了MaCo-GAN，一种新颖的流形对比GAN框架，用监督对比目标替代了传统的对抗损失。我们方法的核心是一个动态假样本合成器，它将真实数据（GT）转换为一系列具有挑战性、感知上合理且严格保持低分辨率（LR）对应的假图像。利用这些合成样本，我们建立了一个鲁棒的对比极小极大博弈：生成器被训练为将其预测吸引到流形上的假图像（低失真）并远离流形外的假图像（高失真），而判别器则优化完全相反的目标。通过简单地将基线SR模型的对抗损失替换为我们提出的目标，我们在各种基准测试中展示了感知-失真权衡的持续改进。广泛的消融研究验证了我们框架的有效性，并深入洞察了这种条件对比博弈的动态。

英文摘要

Conventional Generative Adversarial Networks (GANs) for Single Image Super-Resolution (SISR) often struggle with hallucinated artifacts, largely because standard discriminators evaluate overall image naturalness rather than strict conditional realism. To address this, we propose MaCo-GAN, a novel manifold-contrastive GAN framework that replaces the conventional adversarial loss with a supervised contrastive objective. A core component of our method is a dynamic fake sample synthesizer that transforms ground truth (GT) data into a spectrum of challenging, perceptually plausible fake images that strictly maintain low-resolution (LR) correspondence. Utilizing these synthesized samples, we establish a robust contrastive minimax game: the generator is trained to attract its predictions toward on-manifold fakes (low distortion) and repel them from off-manifold fakes (high distortion), while the discriminator optimizes the exact opposite. By simply replacing the adversarial loss of a baseline SR model with our proposed objective, we demonstrate consistent improvements in the perception-distortion trade-off across various benchmarks. Extensive ablation studies validate the effectiveness of our framework and provide deep insights into the dynamics of this conditional contrastive game.

URL PDF HTML ☆

赞 0 踩 0

2606.05058 2026-06-04 cs.CV cs.AI 版本更新

UniCAD: A Unified Benchmark and Universal Model for Multi-Modal Multi-Task CAD

UniCAD：面向多模态多任务CAD的统一基准与通用模型

Jingyuan Chen, Sheng Jin, Haopeng Sun, Wentao Liu, Chen Qian

发表机构 * SenseTime Research and Tetras.AI（秒速科技研究院和Tetras.AI）

AI总结针对CAD领域缺乏统一多模态基准的问题，提出UniCAD基准和UniCAD-MLLM通用多模态大语言模型，在点云到CAD重建、文本/图像到CAD生成和CAD问答等任务上实现端到端统一处理，并在多个基准上取得最优性能。

详情

AI中文摘要

计算机辅助设计（CAD）通过创建精确、可编辑的3D模型，支撑着现代工程和制造。然而，CAD研究通常孤立地研究各项任务，而多模态、多任务学习因缺乏统一基准而受阻。为解决这一问题，我们引入了UniCAD，一个全面的多模态CAD学习基准，涵盖点云到CAD重建、文本/图像到CAD生成以及CAD问答等多种输入模态。伴随该基准，我们提出了UniCAD-MLLM，一个通用的多模态大语言模型，能够接收文本、图像、草图和点云，并在单一框架内以端到端方式执行这些异构任务。在UniCAD和Fusion360基准上的大量实验表明，UniCAD-MLLM在所有任务上均达到最先进性能，优于现有的任务特定和多任务基线。我们将发布数据集、代码和预训练模型，以加速未来研究。

英文摘要

Computer-Aided Design (CAD) underpins modern engineering and manufacturing by enabling the creation of precise, editable 3D models. However, CAD research typically studies tasks in isolation, and multi-modal, multi-task learning for CAD is hindered by the absence of a unified benchmark. To address this gap, we introduce UniCAD, a comprehensive benchmark for multi-modal CAD learning that covers point-to-CAD reconstruction, text/image-to-CAD generation, and CAD question answering across diverse input modalities. Alongside the benchmark, we present UniCAD-MLLM, a universal multi-modal large language model that ingests text, images, sketches, and point clouds and performs these heterogeneous tasks in an end-to-end fashion within a single framework. Extensive experiments on the UniCAD and Fusion360 benchmarks demonstrate that UniCAD-MLLM achieves state-of-the-art performance across all tasks, outperforming existing task-specific and multi-task baselines. We will release the dataset, code, and pretrained models to accelerate future research.

URL PDF HTML ☆

赞 0 踩 0

2606.05035 2026-06-04 cs.CV 版本更新

Anchor3R: Streaming 3D Reconstruction with Transient Anchors for Long-Horizon Visual Mapping

Anchor3R: 基于瞬态锚点的流式3D重建用于长时程视觉映射

Peilin Tao, Chong Cheng, Yuansen Du, Caiwei Song, Zhengqing Chen, Xiaoyang Guo, Wei Yin, Weiqiang Ren, Qian Zhang, Hainan Cui, Shuhan Shen

发表机构 * CASIA（中国科学院自动化研究所）； UCAS（中国科学院自动化研究所）； Horizon Robotics（Horizon机器人技术有限公司）； HKUST(GZ)（香港科技大学（广州））

AI总结提出Anchor3R框架，通过将前馈重建视为当前帧坐标系下的局部测量预测而非全局回归，结合窗口相对位姿预测、闭环插入和运动平均，实现长序列上的在线3D重建与位姿估计。

详情

AI中文摘要

长时程在线视觉映射是机器人感知的核心能力，需要在有限内存和计算下从视觉流中持续估计相机运动和场景几何。最近的前馈3D重建模型提供了强大的几何先验，但其流式变体通常在与第一帧或持久场景记忆绑定的固定坐标系中预测位姿。这种固定基准设计会导致训练-测试不匹配、对早期锚点的注意力偏差以及在远长于训练序列的序列上累积漂移。我们提出Anchor3R，一种流式3D重建框架，将前馈重建视为以当前为中心的局部测量预测，而非持久的全局基准回归。在每个时间步，Anchor3R预测窗口相对位姿和当前帧坐标系下的局部点图，将流式重建转化为相对位姿测量生成。这些测量支持在线位姿更新，而闭环插入和运动平均对齐轨迹并将局部点图转换为一致的全局重建。在室内、室外、驾驶和RGB-D基准上的实验表明，Anchor3R在长时程位姿精度和密集重建质量上优于现有流式基线，同时支持有限内存的在线推理。

英文摘要

Long-horizon online visual mapping is a core capability for robot perception, requiring continuous camera-motion and scene-geometry estimation from visual streams under bounded memory and computation. Recent feed-forward 3D reconstruction models provide strong geometric priors, but their streaming variants often predict poses in a fixed coordinate system tied to the first frame or a persistent scene memory. This fixed-gauge design leads to train--test mismatch, attention bias toward early anchors, and accumulated drift on sequences much longer than those seen during training. We propose \emph{Anchor3R}, a streaming 3D reconstruction framework that treats feed-forward reconstruction as current-centric local measurement prediction rather than persistent global-gauge regression. At each time step, Anchor3R predicts window-relative poses and a local pointmap in the current-frame coordinate system, turning streaming reconstruction into relative-pose measurement generation. These measurements support online pose updates, while loop-closure reinsertion and motion averaging align the trajectory and transform local pointmaps into a coherent global reconstruction. Experiments on indoor, outdoor, driving, and RGB-D benchmarks show that Anchor3R improves long-horizon pose accuracy and dense reconstruction quality over existing streaming baselines, while supporting bounded-memory online inference.

URL PDF HTML ☆

赞 0 踩 0

2606.05031 2026-06-04 cs.CV 版本更新

Food-R1: 一种基于强化学习的统一多任务食品视觉语言模型

Yu Zhu, Yongkang Li, Wenjie Zhu, Haoyi Jiang, Wenyu Liu, Wei Yang, Bin Li, Xinggang Wang

发表机构 * Huazhong University of Science and Technology（华中科技大学）

AI总结针对现有食品视觉语言模型依赖监督微调导致推理和泛化能力受限以及营养标注稀缺的问题，提出包含链式思维标注的大规模基准CalorieBench-80K和基于强化微调（GRPO）的统一多任务食品视觉语言模型Food-R1，在食品相关任务上持续超越强基线。

详情

AI中文摘要

最近的研究探索了将视觉语言模型（VLM）用于食品分析。然而，现有方法主要依赖于监督微调（SFT），这通常限制了推理和泛化能力。此外，高质量的大规模营养标注仍然稀缺。为了解决这些问题，我们引入了CalorieBench-80K，一个包含精心整理的卡路里标签和饮食建议注释的大规模基准。据我们所知，这是第一个包含链式思维（CoT）注释用于卡路里推理的食品图像基准。我们还提出了Food-R1，一个在多任务学习范式中训练的统一食品VLM，以赋予模型广泛的能力。Food-R1经过基于CoT的冷启动指令微调，然后使用组相对策略优化（GRPO）进行强化微调（RFT），以提高推理和性能。在CalorieBench-80K和代表性基准上的实验表明，Food-R1在食品相关任务上持续优于强基线。代码、模型权重和基准注释可在项目仓库中获得。

英文摘要

Recent studies have explored Vision-Language Models (VLMs) for food analysis. However, most existing methods rely primarily on supervised fine-tuning (SFT), which often limits reasoning and generalization capabilities. Moreover, high-quality large-scale nutritional annotations remain scarce. To address these issues, we introduce CalorieBench-80K, a large-scale benchmark with curated calorie labels and dietary advice annotations. To the best of our knowledge, it is the first food image benchmark to incorporate Chain-of-Thought (CoT) annotations for calorie reasoning. We also propose Food-R1, a unified food VLM trained in a multi-task learning paradigm to equip the model with broad capabilities. Food-R1 undergoes CoT-based cold-start instruction tuning, followed by reinforcement fine-tuning (RFT) using Group Relative Policy Optimization (GRPO) to improve reasoning and performance. Experiments on CalorieBench-80K and representative benchmarks show that Food-R1 consistently outperforms strong baselines across food-related tasks. The code, model weights, and benchmark annotations are available at the project repository.

URL PDF HTML ☆

赞 0 踩 0

2606.04970 2026-06-04 cs.CV cs.AI 版本更新

Plan, Watch, Recover: A Benchmark and Architectures for Proactive Procedural Assistance

计划、观察、恢复：主动式程序辅助的基准与架构

Kaustav Kundu, Ritvik Shrivastava, Maxim Arap, Nanshu Wang, Xianhui Zhu, Quintin Fettes, Gautam Tiwari, Parth Suresh, Théo Moutakanni, Alejandro Castillejo Munoz, Allen Bolourchi, Pascale Fung, Pinar Donmez, Babak Damavandi, Anuj Kumar, Seungwhan Moon

发表机构 * Meta Reality Labs（Meta现实实验室）； Meta Superintelligence Labs（Meta超智能实验室）

AI总结提出EgoProactive数据集和Pro²Bench基准，并设计解耦规划器-交互架构，用于主动式程序辅助中的实时引导和异常恢复。

Comments 53 pages, 14 figures

详情

AI中文摘要

我们设想一个主动的多模态辅助系统，该系统在程序性任务中为用户提供实时的逐步指导，自主决定何时中断以及如何指导。然而，由于缺乏反映现实条件的大规模跨领域基准，特别是用户偏离预期步骤序列的常见情况，进展受到限制。我们通过四项贡献来解决这一差距： extbf{(1)}~我们发布了 extbf{EgoProactive}，一个大规模的可穿戴自我中心数据集，用于主动程序辅助，带有明确的计划外（OOP）标注和恢复步骤； extbf{(2)}~我们将五个已建立的基准（Ego4D、EPIC-KITCHENS、EgoExo4D、HoloAssist、HowTo100M）扩充为统一的主动指导模式下的 extbf{Pro extsuperscript{2}Bench}； extbf{(3)}~我们提出了一种专门针对程序状态、视觉线索和恢复注入的 extbf{解耦规划器-交互架构}； extbf{(4)}~我们引入了一种跨模型家族迁移的训练后方案，通过在Llama~4和Qwen-3.6-VL上的跨骨干复制进行验证。在大量实验中，我们训练的Llama-4系统在所有六个数据集上，相对于强大的专有基线（Claude Opus~4.6、Gemini~3.1~Pro、GPT~5.2）和开放权重基线（Qwen3~VL~235B），显著提高了客观干预质量。Oracle计划实验进一步表明，当计划质量得到控制时，训练的双工模型产生高质量的指导，并在计划外恢复方面取得巨大收益。

英文摘要

We envision a proactive multi-modal assistant system which gives users real-time step-by-step guidance on a procedural task, autonomously deciding \textit{when} to interrupt, and \textit{how} to coach. However, progress is limited by the absence of large-scale, cross-domain benchmarks that reflect realistic conditions, particularly the common case in which users deviate from the expected step sequence. We address this gap with four contributions: \textbf{(1)}~we release \textbf{EgoProactive}, a large-scale wearable-egocentric dataset for proactive procedural assistance with explicit Out-of-Plan (OOP) annotations and recovery steps; \textbf{(2)}~we augment five established benchmarks (Ego4D, EPIC-KITCHENS, EgoExo4D, HoloAssist, HowTo100M) into \textbf{Pro\textsuperscript{2}Bench} under a unified proactive-guidance schema; \textbf{(3)}~we propose a \textbf{decoupled planner--interaction architecture} specialized for procedural state, visual cues, and recovery injection; \textbf{(4)}~we introduce a post-training recipe that transfers across model families, validated by cross-backbone replication on Llama~4 and Qwen-3.6-VL. In extensive experiments, our trained Llama-4 system substantially improves objective intervention quality over strong proprietary baselines (Claude Opus~4.6, Gemini~3.1~Pro, GPT~5.2) and open-weight baselines (Qwen3~VL~235B) baselines across all six datasets. Oracle-plan experiments further show that, when plan quality is controlled, the trained duplex model produces high-quality guidance and large gains on Out-of-Plan recovery.

URL PDF HTML ☆

赞 0 踩 0

2606.04925 2026-06-04 cs.CV 版本更新

Scene-Centric Unsupervised Video Panoptic Segmentation

以场景为中心的无监督视频全景分割

Christoph Reich, Oliver Hahn, Nikita Araslanov, Laura Leal-Taixé, Christian Rupprecht, Daniel Cremers, Stefan Roth

发表机构 * TU Munich（慕尼黑技术大学）； TU Darmstadt（达姆施塔特技术大学）； NVIDIA（英伟达）； University of Oxford（牛津大学）； MCML ； ELIZA

AI总结提出无监督视频全景分割任务及首个方法VideoCUPS，利用深度、运动和视觉线索生成伪标签，并通过Video DropLoss训练，在无监督条件下实现准确分割。

Comments CVPR 2026. Oliver Hahn and Christoph Reich - both authors contributed equally. Code: https://github.com/visinf/cups/tree/main/videocups Project page: https://visinf.github.io/videocups/

详情

AI中文摘要

视频全景分割（VPS）旨在联合检测、分割和跟踪所有对象，同时将视频划分为语义一致的区域。我们引入了无监督VPS的任务设置，省略任何人工监督。现有的无监督场景理解工作主要关注图像分割任务；视频领域仍未充分探索。我们提出了VideoCUPS，这是第一个无监督VPS方法。VideoCUPS通过利用无监督的深度、运动和视觉线索，从以场景为中心的视频中生成时间一致的全景视频伪标签。使用新颖的Video DropLoss在这些伪标签上训练，可以得到一个准确的无监督VPS模型。为了对进展进行基准测试，我们引入了一个全面的评估协议和四个竞争基线，将最先进的无监督全景图像和实例视频分割模型扩展到VPS。VideoCUPS优于所有基线，并展示了强大的标签高效学习能力。通过VideoCUPS、我们的评估协议和基线，我们为未来无监督VPS的研究提供了坚实的基础。

英文摘要

Video panoptic segmentation (VPS) aims to jointly detect, segment, and track all objects while partitioning the video into semantically consistent regions. We introduce the task setting of unsupervised VPS, omitting any human supervision. Existing unsupervised scene understanding works mainly focused on image segmentation tasks; the video domain remains underexplored. We propose VideoCUPS, the first unsupervised VPS approach. VideoCUPS generates temporally consistent panoptic video pseudo-labels from scene-centric videos by exploiting unsupervised depth, motion, and visual cues. Training on these pseudo-labels using a novel Video DropLoss yields an accurate, unsupervised VPS model. To benchmark progress, we introduce a comprehensive evaluation protocol and four competitive baselines, extending state-of-the-art unsupervised panoptic image and instance video segmentation models to VPS. VideoCUPS outperforms all baselines and demonstrates strong label-efficient learning. With VideoCUPS, our evaluation protocol, and baselines, we provide a strong foundation for future research on unsupervised VPS.

URL PDF HTML ☆

赞 0 踩 0

2606.04922 2026-06-04 cs.CV cs.AI cs.LG 版本更新

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

几何感知蒸馏用于提示调优生物医学视觉-语言模型

Tran Dinh Tien, Zhiqiang Shen

发表机构 * Department of Machine Learning（机器学习系）； Mohamed bin Zayed University of Artificial Intelligence（Mohamed bin Zayed人工智能大学）

AI总结提出Omni-Geometry知识蒸馏（OGKD）框架，通过注入类别关系结构到教师模型，生成保留真实标签同时尊重类间几何的方向性目标，并设计全局几何感知蒸馏（GAD）和标签引导几何蒸馏（LGD）损失，在11个医学数据集上平均提升准确率1.7%-2.8%。

Comments Preprint. Code is available at https://github.com/tientrandinh/OGKD

详情

AI中文摘要

当前基于提示和适配器的视觉-语言模型（VLM）调优方法在医学影像中具有吸引力，因为临床数据敏感性倾向于冻结骨干网络且标注有限。然而，这些方法通常仅优化真实类别，将所有其他类别视为同等错误，忽略了临床上有意义的类别关系，并在有限监督设置下产生不稳定的决策边界。我们提出了Omni-Geometry知识蒸馏（OGKD），一种新框架，将类别关系结构注入教师模型，以生成保留真实标签同时尊重类间几何的方向性目标。利用这些目标，我们开发了两种蒸馏损失：全局几何感知蒸馏（GAD）作用于全局图像标记，标签引导几何蒸馏（LGD）将相同的几何应用于注意力补丁标记以改善细粒度对齐。在11个广泛使用的医学数据集上进行的基础到新类和少样本评估的综合实验和分析中，我们的OGKD实现了显著更好的性能，在所有先前最先进的VLM适应方法上平均绝对增益为1.7%-2.8%。它还能稳健地泛化到未见类别，并产生比其他方法更可靠的预测。我们的代码可在https://github.com/tientrandinh/OGKD获取。

英文摘要

Current prompt-based and adapter-based tuning of vision-language models (VLMs) is attractive for medical imaging, where clinical data sensitivity favors frozen backbones and annotations are limited. However, these methods typically optimize only the ground-truth class, treating all other classes as equally incorrect, ignoring clinically meaningful class relations and yielding unstable decision boundaries in limited-supervision settings. We propose Omni-Geometry Knowledge Distillation (OGKD), a new framework that injects class-relation structure into the teacher to produce directional targets that preserve the ground truth while respecting inter-class geometry. Using these targets, we develop two distillation losses: Global Geometry-Aware Distillation (GAD) operates on the global image token, and Label-Guided Geometry Distillation (LGD) applies the same geometry to attentive patch tokens to improve fine-grained alignment. Across comprehensive experiments and analyses on 11 widely-used medical datasets for base-to-novel and few-shot evaluations, our OGKD achieves substantially better performance, consistently improving accuracy by an average absolute gain of 1.7%-2.8% over all prior state-of-the-art VLM adaptation counterparts. It also robustly generalizes to unseen classes and yields more reliable predictions than other approaches. Our code is available at https://github.com/tientrandinh/OGKD.

URL PDF HTML ☆

赞 0 踩 0

2606.04911 2026-06-04 cs.CV cs.CL 版本更新

BreastGPT: A Multimodal Large Language Model for the Full Spectrum of Breast Cancer Clinical Routine

BreastGPT: 面向乳腺癌临床全流程的多模态大语言模型

Yang Liu, Jiajin Zhang, Danyang Tu, Yaojun Hu, Jiao Qu, Jiuyu Zhang, Yu Shi, Wei Fang, Shi Gu, Ling Zhang, Yingda Xia

发表机构 * DAMO Academy, Alibaba Group（阿里巴巴集团 DAMO 院）； Zhejiang University（浙江大学）； Hupan Lab（华潘实验室）； West China Hospital（西京医院）； China Medical University（中国医科大学）

AI总结提出BreastGPT多模态大语言模型，通过构建工作流对齐的指令语料库BreastStage和双分支视觉编码器，实现乳腺癌筛查、诊断和治疗规划全流程的多模态推理，在BreastStage-Bench上取得75.66%封闭式准确率和89.92%开放式得分。

详情

AI中文摘要

乳腺癌仍然是女性癌症相关死亡的主要原因。其临床管理需要跨临床工作流（包括筛查、诊断和治疗规划）的多模态推理，其中每个阶段涉及不同的成像模态、任务目标和推理模式。然而，受限于数据稀缺和模型通用性，现有的医学多模态大语言模型通常仅在孤立的模态或狭窄的任务族上进行评估，限制了它们支持工作流级临床推理的能力。在这项工作中，我们首先引入了BreastStage，一个工作流对齐的乳腺影像指令语料库，包含来自5种成像模态的17个子数据集和136个任务模板的186万条指令遵循对。其保留子集BreastStage-Bench为评估乳腺癌护理连续体中的多模态推理提供了全面的基准。基于该语料库，我们提出了BreastGPT，一个统一的多模态大语言模型，配备双分支视觉编码器和概念保持的令牌压缩，以弥合标准放射学与千兆像素病理学之间的尺度差距。在BreastStage-Bench上，BreastGPT实现了75.66%的封闭式准确率和89.92%的开放式得分，在临床阶段和任务格式上均优于通用和医学专用多模态大语言模型。这些结果表明，工作流对齐的数据和跨尺度视觉建模对于临床基础的医学多模态大语言模型至关重要。所有数据、代码和模型检查点已在https://yangyy-liu.github.io/BreastGPT.io发布。

英文摘要

Breast cancer remains a leading cause of cancer-related mortality among women. Its clinical management requires multimodal reasoning across a clinical workflow that spans \textit{screening}, \textit{diagnosis} and \textit{treatment planning}, where each stage involves distinct imaging modalities, task objectives, and reasoning patterns. However, constrained by data scarcity and model versatility, existing medical MLLMs are typically evaluated on isolated modalities or narrow task families, limiting their ability to support workflow-level clinical reasoning. In this work, we first introduce \textbf{BreastStage}, a workflow-aligned breast imaging instruction corpus comprising 1.86M instruction-following pairs curated from 17 sub-datasets across 5 imaging modalities and 136 task templates. Its held-out split, \textbf{BreastStage-Bench}, provides a comprehensive benchmark for evaluating multimodal reasoning across the breast cancer care continuum. Building on this corpus, we propose \textbf{BreastGPT}, a unified MLLM equipped with a dual-branch visual encoder and concept-preserving token compression to bridge the scale gap between standard radiology and gigapixel pathology. On BreastStage-Bench, BreastGPT achieves 75.66\% closed-ended accuracy and 89.92\% open-ended score, outperforming both general-purpose and medical-specific MLLMs across clinical stages and task formats. These results suggest that workflow-aligned data and cross-scale visual modeling are critical for clinically grounded medical MLLMs. All data, code, and model checkpoints are released at https://yangyy-liu.github.io/BreastGPT.io.

URL PDF HTML ☆

赞 0 踩 0

2606.04898 2026-06-04 cs.CV 版本更新

CDPM-Align: Multi-Scale Guidance-Aligned Diffusion Pretraining for Robust Few-Shot Anatomical Landmark Detection

CDPM-Align：用于鲁棒少样本解剖标志检测的多尺度引导对齐扩散预训练

Roberto Di Via, Irina Voiculescu, Francesca Odone, Vito Paolo Pastore

发表机构 * MaLGa ； DIBRIS, University of Genoa（DIBRIS，热那亚大学）； University of Genoa（热那亚大学）； Department of Computer Science, University of Oxford（奥大利大学计算机科学系）

AI总结提出多尺度引导对齐的条件扩散预训练方法CDPM-Align，通过生成式预训练学习鲁棒表示，在少样本和低标注场景下提升解剖标志检测的准确性和不确定性。

Comments Accepted MICCAI 2026

详情

AI中文摘要

解剖标志检测是医学图像分析中的一项基础任务，支持广泛的诊断和介入工作流程。尽管最近的方法已经实现了亚毫米级的定位，但仅凭准确性不足以用于临床部署，还需要预测的可靠性和鲁棒性。尽管具有临床相关性，但表示学习在此背景下的影响仍未得到充分探索。在这项工作中，我们引入了CDPM-align，一种用于解剖标志检测的多尺度引导对齐条件扩散预训练方法。我们的实验设置侧重于少量图像和少量标注场景。具体来说，我们采用三个流行的异构小规模基准数据集，通过条件生成预训练进行表示学习。此外，我们考虑了标志检测下游任务的低标注场景，分别使用10张和25张标注图像，反映了临床工作与标注资源约束之间的现实权衡。我们的结果证实，生成式预训练使模型能够学习鲁棒的表示。这提高了下游任务的准确性和不确定性，朝着安全高效的临床部署迈进。

英文摘要

Anatomical landmark detection is a fundamental task in medical image analysis supporting a wide range of diagnostic and interventional workflows. Although recent methods have achieved sub-millimetric localisation, accuracy alone is not sufficient for clinical deployment, requiring reliability and robustness in prediction. Despite its clinical relevance, the impact of representation learning in this context is still underexplored. In this work, we introduce CDPM-align, a multi-scale guidance-aligned conditional diffusion pre-training for anatomical landmark detection. Our experimental setup focuses on a few images and a few annotation regimes. Specifically, we employ three popular heterogeneous small-scale benchmark datasets for representation learning via conditional generative pre-training. Furthermore, we consider low-annotation scenarios for the downstream task of landmark detection, with 10 and 25 annotated images, reflecting realistic trade-offs between clinical effort and resource constraints for annotations. Our results confirm that generative pre-training enables the model to learn a robust representation. This improves both accuracy and uncertainty on the downstream tasks, advancing towards safe and efficient clinical deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.04891 2026-06-04 cs.CV cs.CG 版本更新

Hierarchical Space Partition for Surface Reconstruction

表面重建的层次空间划分

Minjie Tang, Xiangfei Li

发表机构 * Independent Researcher（独立研究员）； Huazhong University of Science and Technology（华中科技大学）

AI总结针对点云重建中因LiDAR扫描局限导致细节缺失的问题，提出基于平面分类与优先级生长的层次空间划分方法，并通过最小割优化生成水密多边形网格。

Comments Published in 2026 International Conference on 3D Vision (3DV)

详情

DOI: 10.1109/3DV69130.2026.00027
Journal ref: in 2026 International Conference on 3D Vision (3DV), Vancouver, BC, Canada, 2026, pp. 207-216

AI中文摘要

从点云生成紧凑的多边形模型是3D视觉和计算机图形学中的一个关键问题。然而，由于LiDAR扫描的固有限制（例如距离约束和遮挡），关键场景信息常常缺失，导致重建精度下降。为了解决这个问题，我们提出了一种平面组装策略，该策略在保持模型紧凑性的同时有效恢复缺失的细节。我们将从场景中提取的所有平面分为三类：高可见、几乎不可见和不可见。通过场景结构分析恢复的不可见平面指示了缺失的细节。这三种类型的平面对应于三种生长优先级。每个平面根据优先级水平生长，空间被逐步划分，即层次划分。随后，我们通过基于最小割的优化从划分中生成水密多边形网格。最后，在公共数据集上的比较显示了我们的方法相对于主流方法的有效性和优越性。项目页面可在https://hsr-3dv.github.io/获取。

英文摘要

Generating compact polygonal models from point clouds is a key problem in 3D vision and computer graphics. However, due to inherent limitations of LiDAR scanning (e.g. range constraints and occlusions), critical scene information is often missing, leading to degraded reconstruction accuracy. To address this, we propose a plane assembling strategy that effectively recovers missing details while maintaining model compactness. We classify all the planes extracted from the scene into three categories: highly visible, barely visible, and invisible. The invisible planes, which are recovered by scene structure analysis, indicate the missing details. The three types of planes correspond to the three growth priorities. Each plane grows according to the priority level, and the space is partitioned progressively, namely, the hierarchical partition. Subsequently, we generate a watertight polygonal mesh from the partition via a min-cut-based optimization. Finally, comparisons on public datasets show the effectiveness and superiority of our method against mainstream approaches. The project page is available at https://hsr-3dv.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2606.04888 2026-06-04 cs.CV 版本更新

HD-DinoMoE: A Class-Aware Hierarchical Dual Mixture-of-Experts Network for Scleral Anomaly Segmentation in Complex Acquisition Scenarios

HD-DinoMoE: 一种用于复杂采集场景下巩膜异常分割的类别感知层次化双混合专家网络

Yinxiang Yu, Maoxiang Chu, Qi Niu, Guanghu Liu, Wei Xu, Haotian Wang, Zhi Chen, Yutian Zhu, Yuelong Fan, Guanghao Liao

发表机构 * School of Electronic and Information Engineering, University of Science and Technology Liaoning（辽宁科技大学电子与信息工程学院）

AI总结针对多源分布差异、异常形态多样和巩膜镜面反射问题，提出类别感知层次化双混合专家网络HD-DinoMoE，结合双流DINOv3特征融合与类别特定多专家解码，实现血管、黄斑和黑斑、血斑的像素级分割，在ML-SASD数据集上达到72.11%的平均Dice和58.44%的平均IoU。

Comments Submitted to Medical Image Analysis; 47 pages, 31 figures, 14 tables

详情

AI中文摘要

中医目诊通过观察巩膜表面异常提供经验性线索，但其临床应用仍具有主观性且难以量化。为支持智能化和可量化的目诊，本研究提出了中医启发的人工智能眼部辅助诊断系统（TAO），并聚焦于像素级巩膜表面异常分割。针对受多源分布差异、异常形态多样和巩膜镜面反射（SSR）影响的临床和用户采集图像，我们提出了HD-DinoMoE，一种类别感知层次化双混合专家网络。HD-DinoMoE结合类别感知双流DINOv3特征融合与类别特定多专家解码，以分割血管、黄斑和黑斑以及血斑。一种三阶段骨干冻结路由策略稳定了双骨干适应；渐进置信惩罚（PCP）损失减少了SSR区域的高置信度假阳性和分割泄漏；类别感知自适应样本加权（CA-ASW）平衡了样本和类别级别的训练贡献。我们进一步构建了多标签巩膜异常分割数据集（ML-SASD），这是一个包含临床、野生和混合设置以及三种异常类别像素级标注的新基准。在ML-SASD-Mix上，HD-DinoMoE实现了72.11%的平均Dice和58.44%的平均交并比，同时保持了良好的边界定位和镜面区域假阳性控制。它在公共SBVPI数据集的血管子集上也显示出有竞争力的泛化能力。这些结果表明，HD-DinoMoE为复杂采集场景下的TAO提供了一种可行的分割解决方案。代码和数据访问信息可在https://github.com/FX-CMX/HD-DinoMoE获取。

英文摘要

Traditional Chinese Medicine (TCM) ocular inspection provides empirical cues for assessing scleral surface anomalies, but its clinical use remains subjective and difficult to quantify. To support intelligent and quantifiable ocular inspection, this study presents the TCM-inspired Artificial Intelligence Ocular Auxiliary Diagnosis System (TAO) and focuses on pixel-level scleral surface anomaly segmentation. For clinical and user-acquired images affected by multi-source distributional discrepancies, diverse anomaly morphologies, and scleral specular reflection (SSR), we propose HD-DinoMoE, a class-aware hierarchical dual mixture-of-experts network. HD-DinoMoE combines class-aware dual-stream DINOv3 feature fusion with class-specific multi-expert decoding to segment Vessels, Yellow and Black Spots, and Blood Spots. A three-stage backbone-frozen routing strategy stabilizes dual-backbone adaptation; Progressive Confidence Penalty (PCP) Loss reduces high-confidence false positives and segmentation leakage in SSR regions; and Class-Aware Adaptive Sample Weighting (CA-ASW) balances sample- and class-level training contributions. We further construct the Multi-label Scleral Anomaly Segmentation Dataset (ML-SASD), a new benchmark with Clinical, Wild, and Mix settings and pixel-wise annotations for three anomaly categories. On ML-SASD-Mix, HD-DinoMoE achieves a mean Dice of 72.11% and a mean Intersection-over-Union of 58.44%, while maintaining favorable boundary localization and specular-region false-positive control. It also shows competitive generalization on the Vessels subset of the public SBVPI dataset. These results indicate that HD-DinoMoE provides a feasible segmentation solution for TAO under complex acquisition scenarios. The code and data access information are available at https://github.com/FX-CMX/HD-DinoMoE.

URL PDF HTML ☆

赞 0 踩 0

2606.04881 2026-06-04 cs.CV cs.AI 版本更新

DiverAge: Reliable Pluralistic Face Aging with Cross-Age Identity Relation Guidance

DiverAge: 基于跨年龄身份关系引导的可靠多元人脸老化

Yueying Zou, Peipei Li, Qianrui Teng, Dianyan Xu, Zekun Li

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications（人工智能学院，北京邮电大学）； School of Computer Science, University of California, Santa Barbara（计算机科学学院，加州大学圣芭芭拉分校）

AI总结提出基于扩散自编码的分层多元人脸老化框架DiverAge，通过随机扩散解码和年龄条件语义调制保持外观多样性，并引入跨年龄身份关系调节器（CARR）在推理时引导去噪，以提升序列级有序可靠性。

Comments 11 pages,10 figures, 5 tables

详情

AI中文摘要

人脸老化在长期生物特征分析、跨年龄身份验证和法医身份分析中扮演重要角色。由于同一主体因遗传、环境和生活方式等因素在目标年龄可能呈现多种合理外观，人脸老化本质上是一个一对多的生成问题。然而，仅有多元性不足以实现可靠的人脸老化：模型应在每个年龄组内提供外观级别的候选多样性，同时跨有序年龄组保持序列级别的有序可靠性。现有的确定性老化方法可以合成视觉上合理的年龄增长人脸，但通常缺乏随机多样性。相比之下，多元老化方法引入局部外观变化，但往往未能明确调控完整老化序列的身份演化。本文提出基于扩散自编码的分层多元人脸老化框架DiverAge。DiverAge通过随机扩散解码和年龄条件语义调制保持外观级多样性。为提升序列级可靠性，我们引入跨年龄身份关系调节器（CARR），一种推理时引导策略，联合去噪多个目标年龄组。CARR由从真实同身份跨年龄对估计的跨年龄身份相似性（CIS）先验引导，通过单边采样时引导抑制过度的跨年龄身份漂移，无需修改训练目标或引入额外可训练参数。实验表明，DiverAge在保持身份保留、年龄准确性、图像质量和外观级多样性的同时，提升了序列级有序可靠性。

英文摘要

Face aging plays an important role in long-term biometric analysis, cross-age identity verification, and forensic identity analysis. Since the same subject may exhibit multiple plausible appearances at a target age due to genetic, environmental, and lifestyle factors, face aging is inherently a one-to-many generation problem. However, pluralism alone is insufficient for reliable face aging: a model should provide appearance-level candidate diversity within each age group while maintaining sequence-level ordinal reliability across ordered age groups. Existing deterministic aging methods can synthesize visually plausible age-progressed faces, but usually lack stochastic diversity. In contrast, pluralistic aging methods introduce local appearance variations, but often fail to explicitly regulate the identity evolution of the full aging sequence. In this paper, we propose \textbf{DiverAge}, a hierarchical pluralistic face aging framework based on diffusion autoencoding. DiverAge preserves appearance-level diversity through stochastic diffusion decoding and age-conditioned semantic modulation. To improve sequence-level reliability, we introduce a Cross-age Identity Relation Regulator (CARR), an inference-time guidance strategy that jointly denoises multiple target age groups. CARR is guided by a Cross-age Identity Similarity (CIS) prior estimated from real same-identity cross-age pairs, and suppresses excessive cross-age identity drift through one-sided sampling-time guidance without modifying the training objective or introducing extra trainable parameters. Experiments demonstrate that DiverAge improves sequence-level ordinal reliability while maintaining identity preservation, age accuracy, image quality, and appearance-level diversity.

URL PDF HTML ☆

赞 0 踩 0

2606.04880 2026-06-04 cs.CV 版本更新

MAOAM: Unified Object and Material Selection with Vision-Language Models

MAOAM: 基于视觉语言模型的统一对象与材质选择

Jaden Park, Valentin Deschaintre, Jason Kuen, Kangning Liu, Iliyan Georgiev, Krishna Kumar Singh, Yong Jae Lee, Michael Fischer

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Adobe Research（Adobe研究）

AI总结提出MAOAM框架，利用视觉语言模型和分割头，通过文本或点击交互实现对象和材质的精确选择，并设计数据生成流水线解决材质选择数据缺乏问题。

Comments Accepted to SIGGRAPH 2026 Conference. Project page: \href{https://jadenpark0.github.io/project_pages/maoam/}{here}

详情

DOI: 10.1145/3799902.3811186

AI中文摘要

选择是交互式图像编辑中的核心操作。为了实用，用户应能通过文本或点击交互来指定和区分所需的选择区域，系统应支持不仅选择对象，还包括其他标准，如材质。基于材质的选择对于重新纹理化表面或编辑特定材质的实例等任务很有价值。然而，现有的基于视觉语言模型（VLM）的选择方法以对象为中心，通常支持单一交互模态，限制了其适用性。因此，在这项工作中，我们提出了Mask Any Object And Material（MAOAM），一个统一的选择框架，能够在文本和点击交互中实现精确的对象和材质级选择。MAOAM利用带有分割头的VLM从用户提示中生成像素级掩码：VLM解释用户的选择意图（对象或材质级）并编码视觉实体、属性和空间关系，而分割头将输出标记解码为掩码。一个关键挑战是缺乏带有文本标注的材质选择数据集。我们提出了一种可扩展的数据生成流水线：收集带有材质掩码的真实和合成图像，并利用VLM生成具有丰富视觉语义的材质描述。我们通过多任务目标训练MAOAM，涵盖点击和文本选择，以及从材质描述派生的辅助VQA任务，以促进更深入的材质理解。尽管使用单模态提示训练，我们的模型在推理时结合文本和点击时表现出选择能力的涌现提升，实现了灵活的图像编辑工作流程。实验表明，在多样化的对象、材质和交互场景中，选择准确且连贯，突显了实际鲁棒性。

英文摘要

Selection is a core operation in interactive image editing. To be practical, a user should be able to specify and disambiguate the desired selection region through either text or click-based interactions, and the system should support selecting not only objects but also other criteria, such as materials. Material-based selection is valuable for tasks like re-texturing surfaces or editing instances of a specific material. However, existing vision-language-model (VLM) based selection methods are object-centric and typically support a single interaction modality, limiting their applicability. In this work, we thus present Mask Any Object And Material (MAOAM), a unified selection framework that enables precise object and material-level selection across both text- and click-based interactions. MAOAM leverages a VLM with a segmentation head to produce pixel-accurate masks from user prompts: the VLM interprets the user's selection intent (object or material-level) and encodes visual entities, attributes, and spatial relations, while the segmentation head decodes the output token into a mask. A key challenge is the lack of material selection datasets with text annotations. We propose a scalable data generation pipeline: we collect real and synthetic images with material masks, and leverage VLMs to generate material descriptions with rich visual-semantics. We train MAOAM with a multi-task objective over click and text-based selection, along with an auxiliary VQA task derived from the material descriptions to facilitate deeper material understanding. Despite being trained with uni-modal prompts, our model exhibits an emergent improvement in selection when combining text and clicks at inference, enabling flexible image editing workflows. Experiments demonstrate accurate and coherent selections across diverse objects, materials, and interaction scenarios, highlighting robustness in practice.

URL PDF HTML ☆

赞 0 踩 0

2606.04871 2026-06-04 cs.CV 版本更新

Recent Advances and Trends in Learning-based 3D Representations

基于学习的3D表示的最新进展与趋势

Adrien Schockaert, Hamid Laga, Hazem Wannous, Vincent Magnier, Guillaume Dufaye, Jean-françois Witz

发表机构 * CERI SN, IMT Nord Europe（CERI SN，IMT Nord Europe）； Univ. Lille, CNRS, Centrale Lille, UMR 9013 - LaMcube - Laboratoire de Mécanique, Multiphysique, Multiéchelle（里尔大学，CNRS，Centrale Lille，UMR 9013 - LaMcube - 机械、多物理场、多尺度实验室）； Downs, 59670 Sainte-Marie-Cappel（Downs，59670 Sainte-Marie-Cappel）； School of Information Technology, Murdoch University（墨尔本大学信息科技学院）

AI总结本文综述了从离散显式格式到连续隐式场（基于神经渲染或基元溅射）的3D表示家族，分析了其优缺点及关键应用，并强调了向隐式表示的范式转变。

详情

AI中文摘要

选择合适的3D表示是一个基本的设计决策，它决定了现代计算机视觉和图形管线在3D重建、新视角合成与渲染、形状与运动分析、识别和生成等任务中的效率、质量和能力。虽然传统表示（如网格、点云和体素网格）仍然是3D传感器（如LiDAR和3D扫描仪）的标准输出，并广泛应用于下游应用（如编辑和仿真），但最近的神经和基元表示（如3D高斯溅射）提供了紧凑且可微的替代方案，在游戏、AR/VR、自动驾驶、机器人导航和医学成像等应用中开辟了广泛的机会。本文的目标是综述主要的3D表示家族，从离散显式格式到基于神经渲染或基元溅射的连续隐式场。对于每种表示类型，我们介绍其一般公式和变体，讨论其优点和局限性，并突出关键应用。最后，我们概述了开放挑战和未来研究的潜在方向。与近期广泛涵盖3D物体和场景重建的综述不同，本文专注于分析3D表示本身的演变。我们特别强调了向隐式表示的范式转变，提供了关于这些新兴格式如何从根本上改变3D/4D工作流程的新视角。

英文摘要

The selection of an appropriate 3D representation is a fundamental design decision that dictates the efficiency, quality, and capabilities of modern computer vision and graphics pipelines for tasks such as 3D reconstruction, novel-view synthesis and rendering, shape and motion analysis, recognition, and generation. While traditional representations (\eg meshes, point clouds, and volumetric grids) remain standard outputs of 3D sensors (\eg LiDAR and 3D scanners) and are widely used in downstream applications (\eg editing and simulation), recent neural and primitive-based representations (\eg 3D Gaussian Splatting) offer compact and differentiable alternatives opening a wide range of opportunities in applications such as games, AR/VR, autonomous driving, robot navigation, and medical imaging, to name a few. The goal of this paper is to survey the main families of 3D representations from discrete explicit formats to continuous implicit fields based either on neural rendering or primitive splatting. For each type of representation, we present the general formulation and its variants, discuss its benefits and limitations, and highlight key applications. We conclude the paper by outlining the open challenges and potential directions for future research. Distinct from recent surveys that broadly cover 3D object and scene reconstruction, this paper provides a focused analysis on the evolution of 3D representations themselves. We specifically emphasize the paradigm shift toward implicit representations, offering a novel perspective on how these emerging formats fundamentally alter 3D/4D workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.04863 2026-06-04 cs.CV 版本更新

IRIS-GAN: Staged Specialist Detection of Deepfake Faces

IRIS-GAN: 深度伪造人脸的分阶段专家检测

Jaume M. Trenchs, Veronica Sanz

发表机构 * Departamento de Física Teórica, Universitat de València, Burjassot, Spain（瓦伦西亚大学理论物理系，瓦伦西亚大学，西班牙Burjassot）； Instituto de Física Corpuscular (IFIC), CSIC–Universitat de València, Valencia, Spain（物理微观粒子研究所（IFIC），西班牙-瓦伦西亚大学，瓦伦西亚，西班牙）

AI总结提出IRIS-GAN，一种通过分阶段暴露于不同GAN族来训练的专业伪造人脸检测器，在跨生成器迁移下实现高检测率，并通过Grad-CAM分析揭示生成器依赖的空间响应模式。

Comments 20 pages, 10 figures

详情

AI中文摘要

我们引入IRIS-GAN，一种针对跨生成器迁移下合成人脸图像的专业取证检测器。我们并非解决通用合成图像检测问题，而是专注于由生成对抗网络（GAN）生成的人脸，这些网络在深度伪造内容中处于领先地位，并通过分阶段暴露于日益苛刻的GAN族同时保留早期生成器来训练检测器。最终模型在考虑的GAN族中实现了超过99%的伪造检测率，并以98.9%的准确率分类了一个外部真实人脸数据集。Grad-CAM分析进一步揭示了可测量的生成器依赖的空间响应模式，这些模式对于仅使用热图的二级分类器仍然具有信息量。对扩散生成人脸的族外测试证实了IRIS-GAN是一个专家检测器，具有一定能力检测非GAN深度伪造。这些结果确立了分阶段训练作为鲁棒GAN人脸取证的有效策略。

英文摘要

We introduce IRIS-GAN, a specialist forensic detector for synthetic face images under cross-generator shift. Rather than addressing universal synthetic-image detection, we focus on faces generated by generative adversarial networks (GANs), which are state-of-the-art in deepfake content, and train the detector through staged exposure to increasingly demanding GAN families while retaining earlier generators. The final model reaches fake-detection rates above 99% across the GAN families considered and classifies an external real-face dataset with 98.9% accuracy. Grad-CAM analysis further reveals measurable generator-dependent spatial response patterns, which remain informative for a secondary heatmap-only classifier. Out-of-family tests on diffusion-generated faces confirm that IRIS-GAN is a specialist detector, with some capability to reach non-GAN deepfakes. These results establish staged training as an effective strategy for robust GAN-face forensics.

URL PDF HTML ☆

赞 0 踩 0

2606.04847 2026-06-04 cs.CV cs.CL cs.LG 版本更新

MusaCoder: Native GPU Kernel Generation with Full-Stack Training on Moore Threads GPU

MusaCoder: 在摩尔线程GPU上通过全栈训练实现原生GPU内核生成

Kun Cheng, Songshuo Lu, Sicong Liao, Tankun Li, Yafei Zhang, Dong Yang, Qiheng Lv, Hua Wang, Zhi Chen, Yaohua Tang

发表机构 * Moore Threads AI

AI总结提出MusaCoder全栈训练框架，结合渐进式数据合成、多样性保持拒绝微调和基于执行反馈的强化学习，在CUDA和MUSA后端上生成高效原生GPU内核，9B模型匹配前沿闭源模型，27B模型达到新最优。

详情

AI中文摘要

原生GPU内核生成将高级张量程序转换为可执行、高效的低级代码。现有大型语言模型（LLMs）在此任务上表现不佳，而基于执行的强化学习面临稀疏奖励、奖励黑客和训练不稳定性问题。我们提出MusaCoder，一个用于在CUDA和MUSA后端上生成原生GPU内核的全栈训练框架。MusaCoder结合了渐进式内核导向数据合成、保持多样性的拒绝微调以及通过MooreEval（一个分布式验证器和奖励环境）进行的执行反馈强化学习（RL）。为了稳定RL，MusaCoder引入了PrimeEcho用于首轮锚定的多轮奖励、Buffered Dynamic Retry用于从全失败的困难样本中恢复信号，以及MirrorPop用于离策略序列过滤。在KernelBench和MUSA移植变体上的实验表明，MusaCoder在正确性和经验加速方面均优于强开源和专有基线，其中9B模型匹配或超越前沿闭源模型，27B模型建立了新的最优结果。这些结果不仅证明了全栈执行反馈训练对原生内核生成的有效性，也展示了摩尔线程GPU支持完整LLM后训练栈的能力，为新兴加速器上的大模型训练和优化提供了实用基础。

英文摘要

Native GPU kernel generation turns high-level tensor programs into executable, efficient low-level code. Existing Large Language Models (LLMs) struggle with this task, while execution-based reinforcement learning suffers from sparse rewards, reward hacking, and training instability. We present MusaCoder, a full-stack training framework for native GPU kernel generation on CUDA and MUSA backends. MusaCoder combines progressive kernel-oriented data synthesis, diversity-preserving rejection fine-tuning, and execution-feedback Reinforcement Learning (RL) through MooreEval, a distributed verifier and reward environment. To stabilize RL, MusaCoder introduces PrimeEcho for first-turn-anchored multi-turn rewards, Buffered Dynamic Retry for recovering signals from all-failed hard samples, and MirrorPop for off-policy sequence filtering. Experiments on KernelBench and a MUSA-ported variant show that MusaCoder outperforms strong open-source and proprietary baselines in both correctness and empirical speedup, with the 9B model matching or exceeding frontier closed-source models and the 27B model establishing a new state of the art. These results demonstrate not only the effectiveness of full-stack execution-feedback training for native kernel generation, but also the capability of Moore Threads GPUs to support the complete LLM post-training stack, providing a practical foundation for large-model training and optimization on emerging accelerators.

URL PDF HTML ☆

赞 0 踩 0

2606.04844 2026-06-04 cs.SD cs.CV 版本更新

Drift-Augmented Scoring: Text-Derived Noise Robustness for Zero-Shot Audio-Language Classification

漂移增强评分：文本驱动的零样本音频-语言分类噪声鲁棒性

Tu Vo, Sheir Zaheer, Chan Y. Park

发表机构 * Anonymous Authors（匿名作者）

AI总结提出漂移增强评分（DAS），通过文本生成的噪声条件提示预测音频嵌入漂移方向，为每个类别添加奖励分数，在不增加梯度或测试时批处理的情况下，显著提升零样本音频分类在噪声下的准确率和mAP。

详情

AI中文摘要

对比音频-语言模型（如CLAP）能够实现零样本音频分类：通过将音频嵌入与文本提示嵌入匹配来标记声音，无需标注音频。但在声学噪声下，这种匹配会失效，标准基准测试中，0 dB SNR时准确率和mAP下降12-30个百分点。我们提出漂移增强评分（DAS），这是一种添加到余弦评分中的每类小奖励。当噪声音频嵌入向该类噪声条件文本提示预测的方向漂移时，奖励该类。该奖励仅从文本推导，计算一次并缓存，推理时每类只需一个内积，无需梯度或测试时批处理。在LAION CLAP骨干网络上，我们将DAS与Acevedo等人同期方法的四种变体在UrbanSound8K和完整FSD50K评估集上进行比较，将每个片段与城市声学场景噪声混合，覆盖一系列SNR。DAS在所有测试条件下均提升了指标：UrbanSound8K上准确率提高+2.60至+5.75个百分点，FSD50K上mAP提高+1.50至+1.74个百分点。

英文摘要

Contrastive audio-language models such as CLAP enable zero-shot audio classification: a sound is labelled by matching its embedding to text prompt embeddings, with no labelled audio. This matching breaks down under acoustic noise, where accuracy and mAP fall by 12-30 percentage points at 0 dB SNR on standard benchmarks. We propose Drift Augmented Scoring (DAS), a small per-class bonus added to the cosine score. The bonus rewards a class when the noisy audio embedding drifts in the direction that the class's noise-conditioned text prompts predict. It is derived from text alone, computed once and cached, and adds a single inner product per class at inference, with no gradients and no test-time batch. On a LAION CLAP backbone, we compare DAS against the four variants of Acevedo et al.'s concurrent method on UrbanSound8K and the full FSD50K eval set, mixing each clip with urban acoustic scene noise across a range of SNRs. DAS improves the metric on every test condition: by +2.60 to +5.75 accuracy points on UrbanSound8K and +1.50 to +1.74 mAP points on FSD50K.

URL PDF HTML ☆

赞 0 踩 0

2606.04836 2026-06-04 cs.CV 版本更新

3D Temporal Analysis for Autism Spectrum Disorder Screening During Attention Tasks

注意力任务期间自闭症谱系障碍筛查的3D时间分析

Inam Qadir, Elizabeth B Varghese, Dena Al-Thani, Marwa Qaraqe

发表机构 * College of Science and Engineering, Hamad Bin Khalifa University, Qatar Foundation, Doha, Qatar（科学与工程学院，哈马德·本·哈利法大学，卡塔尔基金会，多哈，卡塔尔）

AI总结提出基于DECA的3D时间分析框架，提取头部姿态和面部表情特征，利用LSTM/GRU分类器在VR-CPT任务中实现ASD筛查，多模态融合达到84.6%准确率。

详情

AI中文摘要

对学龄儿童进行准确的自闭症谱系障碍（ASD）筛查对于识别早期可能遗漏的病例以及及时干预以支持社交、认知和学业发展至关重要。当前的ASD筛查依赖于主观评估和2D分析方法，无法捕捉ASD行为特征的空间位移模式。本研究提出了一种新颖的3D时间分析框架，该框架基于DECA（详细表情捕捉与动画）这一3D建模框架，用于提取全面的头部姿态参数（包括平移分量$T_x, T_y, T_z$）以及独立于姿态变化的面部表情。基于LSTM和GRU的时间分类器在从39名7-12岁参与者（19名ASD，20名TD）在虚拟现实-持续性能测试任务中收集的视频数据提取的3D特征上进行训练。GRU模型表现出优越性能，其中3D头部姿态特征达到83.9%的准确率，3D面部特征达到81.4%的准确率，分别比2D基线方法高出10.7%和7.5%。此外，通过PCA降维的3D头部姿态和面部特征的多模态融合达到了84.6%的最高准确率，优于单模态方法。这项工作为针对学龄人群ASD识别中当前诊断局限性的客观、自动化筛查工具奠定了基础。

英文摘要

Accurate Autism Spectrum Disorder (ASD) screening for school-age children is crucial to identify cases that may have been missed earlier and to enable timely interventions supporting social, cognitive, and academic development. Current ASD screening relies on subjective assessments and 2D analysis methods that fail to capture spatial displacement patterns characteristic of ASD behaviors. In this study, a novel 3D temporal analysis framework is presented, built on top of DECA (Detailed Expression Capture and Animation), a 3D modeling framework, to extract comprehensive head pose parameters (including translational components $T_x, T_y, T_z$) and facial expressions independent of pose variations. LSTM and GRU-based temporal classifiers were trained on the extracted 3D features from video data collected from 39 participants (19 ASD, 20 TD) aged 7-12 years during Virtual Reality-Continuous Performance Test tasks. The GRU-based models demonstrated superior performance, with 3D head pose features achieving 83.9\% accuracy and 3D facial features reaching 81.4\% accuracy, outperforming 2D baseline approaches by 10.7\% and 7.5\%, respectively. Furthermore, multimodal fusion of 3D head pose and facial features with PCA-based dimensionality reduction achieved the highest accuracy of 84.6\%, outperforming unimodal approaches. This work establishes a foundation for objective, automated screening tools addressing current diagnostic limitations in ASD identification for school-age populations.

URL PDF HTML ☆

赞 0 踩 0

2606.04820 2026-06-04 cs.CV cs.AI cs.LG 版本更新

OA-CutMix: Correcting the Label Bias of CutMix

OA-CutMix：纠正CutMix的标签偏差

Tobias Christian Nauen, Stanislav Frolov, Federico Raue, Brian B. Moser, Andreas Dengel

发表机构 * RPTU University Kaiserslautern-Landau（凯撒斯劳滕-兰道大学）； German Research Center for Artificial Intelligence (DFKI)（德国人工智能研究中心）

AI总结针对CutMix中标签分配基于区域面积导致语义偏差的问题，提出OA-CutMix，利用分割掩码根据可见目标面积分配标签，在不改变图像混合过程的情况下提升分类准确率。

详情

AI中文摘要

CutMix已成为事实上的标准混合增强方法，但其标签分配基于一个有缺陷的假设：粘贴补丁的面积忠实地反映了其对混合图像的语义贡献。然而，在实践中，补丁经常落在背景区域，将标签信用分配给其目标不可见的类别。CutMix标签与语义目标面积的平均差异为21.5%。在17%的样本中，一张图像贡献了零个可见目标像素，却获得了非零的标签权重。我们提出目标感知CutMix（OA-CutMix），通过用从预计算分割掩码中导出的权重替换基于面积的CutMix权重来纠正这种偏差，根据每个图像贡献给混合图像的可见目标面积比例分配标签。图像混合过程完全保持不变。我们在4种架构和6个数据集上评估了OA-CutMix与10多种静态和动态混合方法的性能。OA-CutMix在所有任务中始终达到最高准确率，甚至优于动态混合方法，但训练时间成本仅为其一小部分。对于小目标，改进最大，因为CutMix的标签偏差最大。因此，纠正标签足以匹配或超过修改图像混合算法的方法的性能。

英文摘要

CutMix has become the de facto standard mixing augmentation, yet its label assignment rests on a flawed assumption: The area of the pasted patch faithfully reflects its semantic contribution to the mixed image. In practice, however, patches frequently land on background regions, assigning label credit to classes whose objects are not visible. The mean discrepancy of the CutMix label and the semantic object area is $21.5\%$. In $17\%$ of samples an image contributes zero visible object pixels yet receives nonzero label weight. We propose Object-Aware CutMix (OA-CutMix), which corrects this bias by replacing the area-based CutMix weight with one derived from precomputed segmentation masks, assigning labels in proportion to the visible object area each image contributes to the mix. The image mixing procedure is left entirely unchanged. We evaluate OA-CutMix against 10+ static and dynamic mixing methods across 4 architectures and 6 datasets. OA-CutMix consistently achieves the highest accuracy over all tasks, outperforming even dynamic mixing methods, but at a fraction of the training-time cost. Improvements are largest for small objects, where the label bias from CutMix is greatest. Thus, correcting the label is sufficient to match or exceed the performance of methods modifying the image mixing algorithm.

URL PDF HTML ☆

赞 0 踩 0

2606.04806 2026-06-04 cs.CV cs.AI 版本更新

NoRA: Evaluating Grounded Reasonableness in Visual First-person Normative Action Reasoning

NoRA: 评估视觉第一人称规范性动作推理中的基于事实的合理性

Sichao Li, Sai Ma, Daniel Kilov, Secil Yanik Guyot, Zhuang Li, Seth Lazar

发表机构 * The University of Sydney（悉尼大学）； Australian National University（澳大利亚国立大学）； RMIT University（皇家墨尔本理工大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结提出NoRA基准，通过事实-理由-动作支持图评估多模态模型生成合理动作并基于可见事实进行推理的能力，发现当前VLM在构建完整动作空间和绑定正确支持方面存在不足。

详情

AI中文摘要

LLM和智能系统越来越多地部署在社交环境中，使得规范能力对安全和适当行为至关重要。然而，现有方法要么仅在文本中评估规范性判断，要么将其简化为从固定候选动作集中选择。我们认为两者都不够。在实践中，智能体永远不会获得一个选项菜单；它们必须从头识别一个合理的动作，基于可见事实并由可检查的理由支持。我们引入了NoRA，一个视觉第一人称视频基准，要求模型生成候选的下一个动作，并通过显式的事实-理由-动作支持图来证明每个动作。该基准包含1,420个带注释的视频片段，包括HumanGold-190和LLMSilver-1230分割。每个实例通过动作对齐、事实基础和支持绑定进行评估，汇总为单一的基于事实的合理性分数。我们在直接、深思熟虑和结构化提示模式下对12个多模态系统进行了基准测试，发现当前的VLM经常能恢复合理的动作和相关的场景事实，但始终难以构建完整的合理动作空间并将所选动作绑定到正确的局部支持上。NoRA使这一差距可测量，将评估问题从模型是否能选择一个动作转变为是否能基于正确的可见理由证明一个适当的动作。

英文摘要

LLMs and agentic systems are increasingly deployed in social environments, making normative competence critical for safe and appropriate behavior. However, existing approaches either assess normative judgment in text alone or reduce it to choosing among a fixed set of candidate actions. We argue both are insufficient. In practice, agents are never handed a menu of options; they must identify a reasonable action from scratch, grounded in visible facts and supported by inspectable reasons. We introduce NoRA, a visual first-person video benchmark that requires models to generate candidate next actions and justify each through an explicit fact-reason-action support graph. The benchmark comprises 1,420 annotated video clips, including HumanGold-190 and LLMSilver-1230 splits. Each instance is evaluated through action alignment, factual grounding, and support binding, aggregated into a single grounded reasonableness score. We benchmark 12 multimodal systems under direct, deliberate, and structured prompting regimes, finding that current VLMs frequently recover plausible actions and relevant scene facts, but consistently struggle to construct the full reasonable action space and bind selected actions to the correct local support. NoRA makes this gap measurable, shifting the evaluation question from whether a model can pick an action to whether it can justify an appropriate action for the right visible reasons.

URL PDF HTML ☆

赞 0 踩 0

2606.04801 2026-06-04 cs.CV 版本更新

Fast Cubical Persistent Homology on 2D and 3D Images via Union-Find, Pruning, and Lookup Tables

基于并查集、剪枝和查找表的2D和3D图像快速立方体持久同调

Titouan Le Breton, Karol Szustakowski, Marie Piraud

发表机构 * Helmholtz AI（海德堡人工智能研究所）； Helmholtz Munich（海德堡慕尼黑研究所）； École des Ponts ParisTech（巴黎科技大学）； ENS Paris-Saclay（巴黎-萨克勒大学）； Institute of AI for Health, Helmholtz Munich（健康人工智能研究所，海德堡慕尼黑研究所）

AI总结提出Flash Cubical方法，通过并查集、边剪枝和查找表技术，高效计算2D和3D图像在V-过滤下的立方体持久性，在时间和内存上达到最优。

2606.04797 2026-06-04 cs.CV cs.LG 版本更新

Crafting Your Evolving Dreams: Concept-Incremental Versatile Customization

打造你不断演变的梦想：概念增量式多功能定制

Jiahua Dong, Wenqi Liang, Hongliu Li, Yang Cong, Duzhen Zhang, Hanbin Zhao, Henghui Ding, Yulun Zhang, Salman Khan, Fahad Shahbaz Khan

发表机构 * Mohamed bin Zayed University of Artificial Intelligence（Mohamed bin Zayed大学人工智能学院）； University of Trento（特伦托大学）； Department of Civil and Environmental Engineering, The Hong Kong Polytechnic University（香港理工大学土木与环境工程系）； South China University of Technology（华南理工大学）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； Institute of Big Data, Fudan University（复旦大学大数据研究院）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出持续可定制扩散模型（CCDM），通过属性解耦LoRA模块和相关性引导聚合策略解决灾难性遗忘，并结合可控区域上下文合成策略处理概念忽视，实现概念增量式多功能定制。

Comments Accepted to Transactions on Pattern Analysis and Machine Intelligence (TPAMI)

详情

AI中文摘要

定制扩散模型（CDMs）因其生成个性化概念的卓越能力而引起了广泛关注。然而，大多数CDMs不切实际地假设用户的个性化概念集合是静态的，无法随时间增长。此外，在增量学习一系列新概念时，它们对先前学习的概念表现出显著的灾难性遗忘和概念忽视。为了解决上述挑战，我们开发了一种新颖的持续可定制扩散模型（CCDM），使用户能够进行概念增量式多功能定制。具体来说，我们设计了一个属性解耦LoRA（AD-LoRA）模块和一个相关性引导的AD-LoRA聚合策略，以缓解灾难性遗忘。它们可以保留每个任务的概念特定属性，并利用有益的任务间相关性来增强新定制任务的持续学习。此外，为了解决概念忽视的挑战，我们提出了一种可控区域上下文合成策略，该策略根据用户提供的条件进行多概念合成。该策略通过保证用户定义区域之间的语义独立性及其平滑边界过渡，增强了多概念合成的整体一致性。实验表明，我们的CCDM在基线方法上表现出显著改进。

英文摘要

Custom diffusion models (CDMs) have garnered significant interest owing to their remarkable capacity for generating personalized concepts. However, the majority of CDMs unrealistically presume that the user's collection of personalized concepts is static and incapable of incremental growth over time. Furthermore, they exhibit significant catastrophic forgetting and concept neglect of previously learned concepts when incrementally learning a sequence of new ones. To resolve the above challenges, we develop a novel Continually Customizable Diffusion Model (CCDM), enabling users to perform concept-incremental versatile customization. Specifically, we design an attribute-decoupled LoRA (AD-LoRA) module and a relevance-guided AD-LoRA aggregation strategy to mitigate catastrophic forgetting. They can preserve concept-specific attributes of each task and leverage beneficial inter-task correlations to enhance the continual learning of new customization tasks. Additionally, to address the challenge of concept neglect, we propose a controllable regional context synthesis strategy that performs multi-concept composition in alignment with user-provided conditions. This strategy enhances the overall consistency in multi-concept synthesis by guaranteeing semantic independence between user-defined regions and their smooth boundary transitions. Experiments show our CCDM exhibits significant improvements over baseline methods.

URL PDF HTML ☆

赞 0 踩 0

2606.04792 2026-06-04 cs.CV 版本更新

A Pathology Foundation Model for Gastric Cancer with Real-World Validation

用于胃癌的病理基础模型及真实世界验证

Ling Liang, Jiabo Ma, Zhengyu Zhang, Fengtao Zhou, Yingxue Xu, Yihui Wang, Cheng Jin, Zhengrui Guo, On Ki Tang, Zhijian Cen, Zhen Wang, Qi Xie, Chengyu Lu, Chenglong Zhao, Feifei Wang, Yu Cai, Hongyi Wang, Jing Zhang, Yaping Ye, Shijun Sun, Shenglei Li, Yu Wang, Zhenhui Li, Ronald Cheong Kin Chan, Xiuming Zhang, Zhe Wang, Hao Chen, Li Liang

发表机构 * Department of Computer Science and Engineering, The Hong Kong University of Science and Technology, Hong Kong SAR, China（计算机科学与工程系，香港科技大学，香港特别行政区，中国）； Department of Pathology, Nanfang Hospital, Southern Medical University, Guangzhou, China（病理学系，南方医科大学南芳医院，广州，中国）； Department of Pathology, School of Basic Medical Sciences, Southern Medical University, Guangzhou, China（病理学系，南方医科大学基础医学学院，广州，中国）； Guangdong Province Key Laboratory of Molecular Tumor Pathology, Guangzhou, China（广东省分子肿瘤病理学重点实验室，广州，中国）； Department of Anatomical and Cellular Pathology, The Chinese University of Hong Kong, Hong Kong SAR, China（解剖学与细胞病理学系，香港中文大学，香港特别行政区，中国）； Pathology Artificial Intelligence Development and Assessment Laboratory, State Key Laboratory of Translational Oncology, The Chinese University of Hong Kong, Hong Kong SAR, China（病理人工智能发展与评估实验室，转化肿瘤学国家重点实验室，香港中文大学，香港特别行政区，中国）

AI总结提出胃癌专用基础模型GRACE，基于多中心HE染色全切片图像，在28项临床任务中优于通用PFM，并通过前瞻性验证和读者研究证实其辅助诊断效能。

详情

AI中文摘要

胃癌仍然是癌症死亡的主要原因，但其组织学和分子异质性使诊断和风险分层复杂化。通用病理基础模型（PFM）在胃癌诊疗的关键细粒度终点上往往表现停滞，且很少有模型经过严格的前瞻性验证或临床读者研究。我们提出了GRACE，一个用于真实世界评估和临床决策支持的胃癌专用基础模型。GRACE基于来自37,493名患者的多中心胃癌病理数据集（共48,364张HE染色全切片图像）开发。在28项临床相关任务评估中，GRACE持续优于代表性泛癌PFM，达到宏观AUC 0.9188，在癌前病变诊断（宏观AUC 0.9322）、肿瘤组织病理评估（宏观AUC 0.9119）、分子分型（宏观AUC 0.8682）和预后预测方面表现强劲。除基准测试外，GRACE的转化价值通过严格的证据链得到证实。在安全门控标准（排除要求100%阴性预测值，纳入要求100%阳性预测值）下，GRACE简化了高达69.6%的恶性诊断病例的审查，并分流了46.8%的MMR-IHC随访请求。这种转化可行性通过病理学家-AI协作的随机交叉读者研究得到进一步加强。在GRACE辅助下，诊断准确率从82.0%提高到89.9%，正确诊断的校正优势比提高近两倍（OR 1.987），同时敏感性和特异性也得到提升。AI辅助还使诊断时间减少14.9%，诊断信心提高9.0%，并显著改善评估者间一致性。当校准至不劣于高级病理医生时，AI辅助工作流可分流60.7%的萎缩病例和82.7%的肠化生病例。

英文摘要

Gastric cancer remains a major cause of cancer mortality, yet its histological and molecular heterogeneity complicates diagnosis and risk stratification. General-purpose pathology foundation models (PFMs) often plateau on fine-grained endpoints central to gastric cancer care, and few have undergone rigorous prospective validation or clinical reader studies. We present GRACE, a Gastric-specific foundation model for Real-world Assessment and Clinical dEcision support. GRACE was developed from multicenter gastric pathology datasets totaling 48,364 primarily HE-stained whole-slide images from 37,493 patients. When evaluated on 28 clinically relevant tasks, GRACE consistently outperformed representative pancancer PFMs, achieving a macro-AUC of 0.9188, with strong performance for precancerous lesion diagnosis (macro-AUC 0.9322), tumor histopathological assessment (macro-AUC 0.9119), molecular profiling (macro-AUC 0.8682), and prognostic prediction. Beyond benchmarking, GRACE's translational value was substantiated through a rigorous evidence chain. Under safety-gated criteria requiring 100% NPV for rule-out and 100% PPV for rule-in, GRACE streamlined review for up to 69.6% of malignancy-diagnosis cases and triaged 46.8% of MMR-IHC follow-up requests. This translational feasibility was further strengthened by a randomized crossover reader study of pathologist-AI collaboration. With GRACE assistance, diagnostic accuracy improved from 82.0% to 89.9%, yielding nearly twofold higher adjusted odds of a correct diagnosis (OR 1.987) alongside concurrent gains in sensitivity and specificity. AI assistance also reduced diagnostic time by 14.9%, elevated diagnostic confidence by 9.0%, and markedly improved inter-rater agreement. When calibrated to maintain non-inferior performance to senior pathologists, the AI-assisted workflow could triage 60.7% of atrophy and 82.7% of intestinal metaplasia cases.

URL PDF HTML ☆

赞 0 踩 0

2606.04788 2026-06-04 cs.CV cs.RO 版本更新

Z-FLoc: Zero-Shot Floorplan Localization via Geometric Primitives

Z-FLoc: 基于几何基元的零样本楼层平面定位

Ayumi Umemura, Toshinori Kuwahara, Marc Pollefeys, Daniel Barath

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Tohoku University（东北大学）

AI总结提出一种零样本楼层平面定位方法，通过从单目3D重建的鸟瞰图中提取直线和圆等几何基元，并与楼层平面进行鲁棒匹配，无需重新训练即可泛化到新环境。

详情

AI中文摘要

视觉定位——在预先存在的地图中估计相机姿态——是计算机视觉中的一个基本问题。楼层平面是一种有吸引力的地图表示：它们对于大多数建筑来说易于获取、紧凑，并且固有地不受视觉外观变化的影响。然而，弥合相机观测与楼层平面几何之间的严重领域差距仍然具有挑战性。现有方法通过数据驱动学习来解决这一差距，但它们需要大规模训练数据和特定环境的重新训练，限制了实际部署。我们提出了一种零样本楼层平面定位方法，无需任何重新训练即可泛化到新环境。我们的关键见解是，主导几何基元——直线和圆——在人造环境中无处不在，并提供外观不变的结构约束。我们从单目3D重建的鸟瞰图投影中提取这些基元，并通过鲁棒估计框架内的专用最小求解器将它们与楼层平面进行匹配。在模拟和真实数据集上的实验表明，我们的方法在未见过的环境上优于最先进的基于学习的方法，同时在所有实验中使用单一固定的超参数集。源代码将公开提供。

英文摘要

Visual localization -- estimating a camera pose within a pre-existing map -- is a fundamental problem in computer vision. Floorplans are an attractive map representation: they are readily available for most buildings, compact, and inherently invariant to visual appearance changes. However, bridging the severe domain gap between camera observations and floorplan geometry remains challenging. Existing methods address this gap through data-driven learning, yet they require large-scale training data and environment-specific retraining, limiting their practical deployment. We propose a zero-shot floorplan localization method that generalizes to novel environments without any retraining. Our key insight is that dominant geometric primitives -- lines and circles -- are ubiquitous in human-made environments and provide appearance-invariant structural constraints. We extract these primitives from a bird's-eye-view (BEV) projection of monocular 3D reconstructions and match them to the floorplan via dedicated minimal solvers within a robust estimation framework. Experiments on both simulated and real-world datasets show that our approach outperforms state-of-the-art learning-based methods on unseen environments, while using a single fixed set of hyperparameters across all experiments. The source code will be made publicly available.

URL PDF HTML ☆

赞 0 踩 0

2606.04775 2026-06-04 cs.LG cs.AI cs.CV cs.SY eess.SY math.OC 版本更新

Activation Steering of Video Generation Models via Reduced-Order Linear Optimal Control

通过降阶线性最优控制引导视频生成模型的激活

Jihoon Hong, Alice Chan, Qiyue Dai, Julian Skifstad, Glen Chou

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结提出LA-LQR框架，将文本到视频推理建模为动态系统，通过降阶最优控制实现最小干预的激活引导，减少不安全内容生成同时保持视觉质量。

详情

AI中文摘要

在大规模网络数据上训练的文本到视频（T2V）模型可能生成不良内容，这促使我们进行干预以减少有害输出而不牺牲视觉质量。激活引导提供了一种有吸引力的机制替代微调和提示过滤，但现有的T2V引导方法仍然有限，通常采用粗糙的、非预测性的干预，可能导致过度引导和内容退化。为了弥补这一差距，我们提出了潜在激活线性二次型调节器（LA-LQR），一种用于最小侵入性T2V引导的降阶最优控制框架。LA-LQR将T2V推理表述为一个动态系统，并计算闭环反馈干预，将激活引导向期望的特征设定点，同时惩罚不必要的扰动。为了使最优控制对高维视频激活可行，我们将激活投影到由对比提示对导出的低维、任务相关子空间，估计该潜在空间中的局部线性动力学，并求解潜在LQR问题以获得时间步和层特定的引导信号。我们提供了将潜在设定点跟踪与原始激活空间特征控制联系起来的理论界限，并实证验证了降阶潜在动力学的保真度。在概念引导和视频安全基准测试中，LA-LQR相对于基线减少了不安全生成，同时保持了提示保真度和视觉质量。

英文摘要

Text-to-video (T2V) models trained on large-scale web data can generate undesired content, motivating interventions that reduce harmful outputs without sacrificing visual quality. Activation steering offers an attractive mechanistic alternative to finetuning and prompt filtering, but existing T2V steering methods remain limited, typically applying coarse, non-anticipative interventions that can lead to oversteering and content degradation. To close this gap, we propose Latent Activation Linear-Quadratic Regulator (LA-LQR), a reduced-order optimal control framework for minimally invasive T2V steering. LA-LQR formulates T2V inference as a dynamical system and computes closed-loop feedback interventions that steer activations toward desired feature setpoints while penalizing unnecessary perturbations. To make optimal control feasible for high-dimensional video activations, we project activations onto a low-dimensional, task-relevant subspace derived from contrastive prompt pairs, estimate local linear dynamics in this latent space, and solve a latent LQR problem to obtain timestep- and layer-specific steering signals. We provide theoretical bounds relating latent setpoint tracking to raw activation-space feature control, and empirically validate the fidelity of the reduced latent dynamics. On concept steering and video safety benchmarks, LA-LQR reduces unsafe generations relative to baselines, while preserving prompt fidelity and visual quality.

URL PDF HTML ☆

赞 0 踩 0

2606.04773 2026-06-04 cs.CV cs.CL 版本更新

NextMotionQA: Benchmarking and Judging Human Motion Understanding with Vision-Language Models

NextMotionQA: 使用视觉语言模型基准测试和评判人体运动理解

Yong Cao, Chuqiao Li, Xianghui Xie, Gerard Pons-Moll, Andreas Geiger

发表机构 * University of Tübingen（图宾根大学）； Tübingen AI Center（图宾根人工智能中心）； Max Planck Institute for Informatics（马克斯·普朗克信息学院）； Saarland Informatics Campus（萨尔兰州信息学院）

AI总结提出NextMotionQA基准，通过三项互补任务和多粒度难度分层，系统评估视觉语言模型在人体运动理解中的能力，并揭示其在细粒度评判中的局限性。

Comments 23 pages, 8 figures, 9 tables

详情

AI中文摘要

人体运动理解的可靠评估对于推进具身人工智能、机器人和动画至关重要。然而，现有基准存在语义粒度粗糙、难度无区分、标注质量有限以及答案模糊等问题，无法诊断当前模型的失败之处。为弥补这一差距，我们引入NextMotionQA，这是一个全面的基准，利用视觉语言模型（VLM）进行半自动化、专家验证的数据集构建。NextMotionQA包含三项互补任务：多项选择题问答、视频字幕生成和细粒度错误纠正。每项任务沿三个核心语义轴系统组织，并分为三个任务复杂度级别。我们对十二个代表性VLM的广泛评估揭示了在传统单任务评估中不可见的关键能力差距和弱点。在互补方向上，近期工作开始使用VLM作为文本到运动评估的评判者；我们探究它们在更困难任务下是否表现出同样的退化。我们发现，VLM在粗粒度标准上与专家评分高度一致（Cohen's κ=0.70），但在细粒度、部件级评判上表现不佳（κ=0.10），验证了该范式在其强项领域的有效性，同时明确了其局限性。

英文摘要

Reliable evaluation of human motion understanding is fundamental to advancing embodied AI, robotics, and animation. However, existing benchmarks suffer from coarse semantic granularity, undifferentiated difficulty, limited annotation quality, and pervasive answer ambiguity, leaving them unable to diagnose where current models fail. To bridge this gap, we introduce NextMotionQA, a comprehensive benchmark that leverages vision-language models (VLMs) for semi-automated, expert-verified dataset. NextMotionQA features three complementary tasks: multiple-choice question answering, video captioning, and fine-grained error correction. Each task is systematically structured across three core semantic axes and stratified into three task complexity levels. Our extensive evaluation of twelve representative VLMs uncovers critical capability gaps and weakness that remain invisible under conventional, single-task evaluations. In a complementary direction, recent work has begun using VLMs as judges for text-to-motion evaluation; we ask whether they show the same degradation under harder tasks. We find that VLMs align strongly with expert ratings on coarse criteria (Cohen's κ=0.70) but break down on fine-grained, part-level judgment (κ=0.10), validating the paradigm in its strong regime while clarifying its limits.

URL PDF HTML ☆

赞 0 踩 0

2606.04772 2026-06-04 cs.CV cs.AI 版本更新

Coarse-to-fine Hierarchical Architecture with Sequential Mamba for Brain Reconstruction

用于脑重建的基于顺序Mamba的粗到细层次架构

Hoang-Son Vo, Van-Hung Bui, Minh-Huy Mai-Duc, Tien-Dung Mai, Soo-Hyung Kim

发表机构 * Chonnam National University, Gwangju, Republic of Korea（全罗国立大学，韩国光州市）； Vietnam National University - Ho Chi Minh City, University of Science, Vietnam（越南国家大学-胡志明市，越南科学大学）； Institute for Cybersecurity and Digital Technologies, Russia（俄罗斯网络安全与数字技术研究所）

AI总结提出CHASMBrain，一种基于双流Mamba和粗到细策略的两阶段图像到fMRI编码框架，在NSD数据集上优于基线，并揭示了视觉皮层的因果组织特性。

详情

AI中文摘要

理解深度视觉表征与人类视觉系统之间的关系是计算神经科学中的一个基本挑战。尽管现代视觉模型在图像识别中取得了强劲性能，但它们与人类视觉皮层层次组织的对应关系仍是一个开放问题。在本研究中，我们提出了CHASMBrain，一种新颖的分层两阶段图像到fMRI编码框架。我们的架构利用双流Mamba设计，明确分离并处理全局语义标记和局部空间补丁，这一设计受视觉皮层功能组织的启发。采用粗到细策略：第一阶段预测去噪的ROI级激活，第二阶段使用Mamba-VAE将这些粗响应细化为全体素级预测。在自然场景数据集（NSD）上的实验表明，我们的方法达到了0.429的皮尔逊相关系数和0.261的均方误差，优于所有评估的基线，包括岭回归和DINOv2线性探针。除了预测性能，因果分支消融实验揭示了一种非对称特化：补丁流特定锁定于早期视觉皮层（视网膜拓扑区域），而CLS流为高阶区域提供更广泛的语义上下文——这种对应关系是因果性的，而不仅仅是相关性的。跨被试迁移实验进一步表明，学习到的骨干网络在个体间泛化良好，只需极少的个体适应，表明模型捕捉到了共享的、与主体无关的视觉表征。

英文摘要

Understanding the relationship between deep visual representations and the human visual system is a fundamental challenge in computational neuroscience. While modern vision models achieve strong performance in image recognition, their correspondence with the hierarchical organization of the human visual cortex remains an open question. In this study, we propose CHASMBrain, a novel hierarchical two-stage framework for image-to-fMRI encoding. Our architecture leverages a dual-stream Mamba design to explicitly separate and process global semantic tokens and local spatial patches, motivated by the functional organization of the visual cortex. A coarse-to-fine strategy is employed: Stage 1 predicts denoised ROI-level activations, while Stage 2 refines these coarse responses into full voxel-level predictions using a Mamba-VAE. Experiments on the Natural Scenes Dataset (NSD) demonstrate that our method achieves a Pearson correlation of 0.429 and an MSE of 0.261, outperforming all evaluated baselines including ridge regression and DINOv2 linear probes. Beyond predictive performance, causal branch-ablation experiments reveal an asymmetric specialization: the patch stream is specifically locked to early visual cortex (retinotopic regions), while the CLS stream contributes broader semantic context to higher-order areas -- a correspondence that holds causally, not merely correlationally. Cross-subject transfer experiments further show that the learned backbone generalizes across individuals with minimal per-subject adaptation, suggesting the model captures a shared, subject-agnostic visual representation.

URL PDF HTML ☆

赞 0 踩 0

2606.04767 2026-06-04 cs.LG cs.CV 版本更新

Measuring Model Robustness via Fisher Information: Spectral Bounds, Theoretical Guarantees, and Practical Algorithms

通过Fisher信息度量模型鲁棒性：谱界、理论保证与实用算法

Chong Zhang, Xiang Li, Jia Wang, Qiufeng Wang, Xiaobo Jin

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于Fisher信息矩阵谱范数的攻击无关鲁棒性度量，理论推导常见架构的闭式谱界，并开发高效估计算法，实验验证其与对抗脆弱性的强相关性。

Comments 35 pages, 1 figure

详情

AI中文摘要

深度神经网络的鲁棒性对于安全关键部署至关重要，但现有评估方法通常依赖于攻击且缺乏可解释性。我们提出了一种基于Fisher信息矩阵（FIM）谱范数的原则性、攻击无关的鲁棒性度量，该度量量化了模型输出分布对输入扰动的worst-case敏感性。理论上，我们证明了FIM等于输入Jacobian的方差，并推导了常见架构（包括VGG、ResNet、DenseNet和Transformer）的闭式谱界，提供了首个理论鲁棒性排名。为了实现可扩展的评估，我们开发了高效算法，包括幂迭代和基于Hutchinson的估计，支持白盒和黑盒设置。在多个数据集（包括CIFAR、ImageNet和医学图像）和多种架构上的大量实验表明，我们的度量与对抗脆弱性之间存在强相关性。我们的框架作为一种可解释的诊断工具，补充了基于攻击的评估，提供了对架构敏感性的洞察，并指导更鲁棒模型的设计。代码可在https://github.com/franz-chang/SRP/获取。

英文摘要

The robustness of deep neural networks is crucial for safety-critical deployments, yet existing evaluation methods are often attack-dependent and lack interpretability. We propose a principled, attack-agnostic robustness metric based on the spectral norm of the Fisher Information Matrix (FIM), which quantifies the worst-case sensitivity of the model's output distribution to input perturbations. Theoretically, we establish that the FIM equals the variance of the input Jacobian and derive closed-form spectral bounds for common architectures, including VGG, ResNet, DenseNet, and Transformer, providing the first theoretical robustness ranking. To enable scalable evaluation, we develop efficient algorithms, including power iteration and Hutchinson-based estimation, that support both white-box and black-box settings. Extensive experiments across multiple datasets, including CIFAR, ImageNet, and medical images, and across multiple architectures show a strong correlation between our metric and adversarial vulnerability. Our framework serves as an interpretable diagnostic tool that complements attack-based evaluations, offering insights into architectural sensitivity and guiding the design of more robust models. Code is available at: https://github.com/franz-chang/SRP/.

URL PDF HTML ☆

赞 0 踩 0

2606.04764 2026-06-04 cs.CV 版本更新

Do Foundation Models See Biology? Evaluating Attention Coherence with Spatial Transcriptomics in Glioblastoma

基础模型是否理解生物学？利用空间转录组学评估胶质母细胞瘤中的注意力一致性

Dilakshan Srikanthan, Amoon Jamzad, Paul Wilson, Nooshin Maghsoodi, Robert Policelli, Gabor Fichtinger, John F. Rudan, Parvin Mousavi

发表机构 * Translational Medicine, School of Medicine, Queen’s University, Kingston, ON, Canada（转化医学、医学院、皇后大学、金斯顿，ON，加拿大）； School of Computing, Queen’s University, Kingston, ON, Canada（计算学院、皇后大学、金斯顿，ON，加拿大）； Department of Surgery, Queen’s University, Kingston, ON, Canada（外科部门、皇后大学、金斯顿，ON，加拿大）

AI总结提出基于空间转录组学的框架，客观评估病理基础模型注意力图与生物学的一致性，发现注意力捕捉多基因转录程序而非单个分子事件。

详情

AI中文摘要

病理基础模型的注意力图是否捕捉真实的生物学仍未知，但这一问题对临床信任和监管批准至关重要。我们提出一个基于空间转录组学的框架，用于无假设的注意力正交评估，并将其应用于五个病理基础模型（CONCH v1.5、UNI v2、Virchow2、GigaPath、H-Optimus-1）和一个ResNet50基线。使用基于注意力的多实例学习，我们训练单任务和多任务模型预测胶质母细胞瘤中的五种分子改变（CPTAC队列），在独立TCGA队列上验证，并使用来自18个样本的共配准Visium空间转录组数据评估注意力图与87个转录特征之间的生物学一致性。内部结果显示，没有单一编码器在所有任务中占优，外部验证则颠倒了内部性能排名。注意力图显示从通路（Cohen's d=0.329）到单个基因（d=0.055）的五倍富集梯度，表明注意力捕捉的是涌现的多基因转录程序而非单个分子事件。空间平滑的注意力图并不意味生物学一致性，不同编码器关注不同的生物学区室。我们的框架提供了对基础模型从组织病理学中学到内容的客观定量评估，推动该领域超越定性显著性图审查。

英文摘要

Whether attention maps from pathology foundation models capture genuine biology remains unknown, yet this question is critical for clinical trust and regulatory approval. We propose a spatial transcriptomics-based framework for orthogonal, hypothesis-free evaluation of attention and apply it to five pathology foundation models (CONCH v1.5, UNI v2, Virchow2, GigaPath, H-Optimus-1) and a ResNet50 baseline. Using attention-based multiple instance learning, we train single-task and multi-task models to predict five molecular alterations in glioblastoma on the CPTAC cohort, validate on an independent TCGA cohort, and evaluate biological coherence of attention maps against 87 transcriptional signatures using co-registered Visium spatial transcriptomics data from 18 samples. Internally, no single encoder dominates across all tasks, and external validation inverts internal performance rankings. Attention maps show a five-fold enrichment gradient from pathways (Cohen's d=0.329) to individual genes (d=0.055), indicating that attention captures emergent multi-gene transcriptional programs rather than individual molecular events. Spatially smooth attention maps do not imply biological coherence, and different encoders attend to distinct biological compartments. Our framework provides objective, quantitative assessment of what foundation models learn from histopathology, moving the field beyond qualitative saliency map review.

URL PDF HTML ☆

赞 0 踩 0

2606.04737 2026-06-04 cs.CV 版本更新

Physics-Informed Video Generation via Mixture-of-Experts Latent Alignment

基于物理信息的视频生成：通过混合专家潜在对齐

Cong Wang, Hanxin Zhu, Jiayi Luo, Yonglin Tian, Xiaoqian Cheng, Peiyan Tu, Xin Jin, Long Chen, Zhibo Chen

发表机构 * CASIA（中国科学院自动化研究所）； UCAS（中国科学技术大学）； ZGCA（浙江大学）； USTC（中国科学技术大学）； BUAA（北京航空航天大学）； ZJU（浙江大学）； EIT（欧洲工业技术学院）

AI总结提出PILA框架，通过混合专家潜在对齐将物理结构化潜在引导注入预训练视频模型的冻结流匹配动力学，以提升生成视频的物理合理性。

详情

AI中文摘要

大规模视频生成模型在语义一致性和视觉质量方面取得了显著进展，生成的视频越来越连贯且视觉上令人信服。然而，由像素级拟合引发的动态过程自然无法适应支配真实世界运动和交互的规律性，导致在物理合理性方面持续存在不足。为解决这一局限，我们提出了PILA（物理信息潜在对齐），一个将物理结构化的潜在引导注入预训练视频模型冻结流匹配动力学的框架。具体而言，PILA首先采用锚定场估计，将冻结生成器的潜在变量映射到一个由场代理槽组织的可操作物理属性库中，利用可观测运动作为运动学锚点来构建较难直接观测的代理。为处理真实世界动态的异质性，PILA采用基于物理类别的混合专家设计。标签先验掩码专家路由选择特定类别的算子专家，其精炼结果通过从物理关系中抽象出的操作残差进行正则化。最后，精炼后的代理被融合回物理属性库，并解码为流匹配向量场的修正，从而在保持预训练骨干网络视觉先验的同时注入物理感知引导。通过在Wan 2.1-1.3B上进行分阶段适配器训练，并将学到的适配器直接迁移到Wan 2.2-14B，PILA在VBench-2.0、VideoPhy-2和PhyGenBench上，在视觉质量和基准测量的物理合理性方面均达到了最先进的结果。

英文摘要

Large-scale video generation models have made remarkable progress in semantic consistency and visual quality, producing videos that are increasingly coherent and visually convincing. Nevertheless, the dynamics induced by pixel-level fitting do not naturally accommodate the regularities that govern real-world motion and interaction, resulting in persistent shortcomings in physical plausibility. To address this limitation, we propose \textbf{PILA} (Physics-Informed Latent Alignment), a framework that injects physics-structured latent guidance into the frozen flow-matching dynamics of pretrained video models. Specifically, PILA first employs anchored field estimation to map frozen-generator latents into an operational physical attribute bank organized by field-proxy slots, using observable motion as a kinematic anchor for constructing less directly observed proxies. To handle the heterogeneity of real-world dynamics, PILA adopts a mixture-of-experts design over physical categories. Label-prior masked expert routing selects category-specific operator experts, whose refinements are regularized by operational residuals abstracted from physical relations. Finally, the refined proxies are fused into the physical attribute bank and decoded into a correction to the flow-matching vector field, injecting physics-aware guidance while preserving the visual prior of the pretrained backbone. With staged adapter training on Wan 2.1-1.3B and direct transfer of the learned adapter to Wan 2.2-14B, PILA achieves state-of-the-art results on VBench-2.0, VideoPhy-2, and PhyGenBench in both visual quality and benchmark-measured physical plausibility.

URL PDF HTML ☆

赞 0 踩 0

2606.04722 2026-06-04 cs.CV 版本更新

StrokeTimer: Robust Representation Learning for Ischemic Stroke Onset-Time Estimation from Non-contrast CT

StrokeTimer: 基于非增强CT的缺血性卒中发病时间估计的鲁棒表示学习

Weiru Wang, Susanne G. H. Olthuis, Elizaveta Lavrova, Robert J. van Oostenbrugge, Charles B. L. M. Majoie, Wim H. van Zwam, Ruisheng Su

发表机构 * Department of Biomedical Engineering, Eindhoven University of Technology（埃因霍温理工大学生物医学工程系）； Graduate School of Life Sciences, Utrecht University（乌得勒支大学生命科学研究生院）； Department of Neurology, Maastricht University Medical Centre+（马斯特里赫特大学医学中心神经科）； Precision Medicine Department, GROW Research Institute for Oncology and Reproduction, Maastricht University（马斯特里赫特大学精准医学部，GROW肿瘤与生殖医学研究所）； Department of Radiology and Nuclear Medicine, Amsterdam University Medical Centre（阿姆斯特丹大学医学中心放射学与核医学系）； Department of Radiology and Nuclear Medicine, Maastricht University Medical Centre+（马斯特里赫特大学医学中心放射学与核医学系）

AI总结提出StrokeTimer框架，通过自监督解耦学习和能量引导对比学习，从非增强CT中估计缺血性卒中发病时间，在大型多中心数据集上实现宏AUC 0.69和宏F1 0.57，较基线提升近50%。

Comments Early accepted at MICCAI 2026

详情

AI中文摘要

缺血性卒中是一种主要的全球性疾病。治疗决策高度时间敏感，因为再灌注治疗的资格取决于卒中发病与干预之间的时间间隔。然而，在临床实践中，真实的发病时间往往不确定，因此需要基于影像的组织年龄评估作为替代标志物。常规非增强CT（NCCT）上的早期缺血性改变通常很细微，而真实世界的临床数据集表现出显著的发病时间类别不平衡和中心-扫描仪相关的异质性。在这项工作中，我们提出了StrokeTimer，一个用于急性缺血性卒中发病时间估计的全自动框架。StrokeTimer整合了自监督解耦学习和能量引导对比学习，以捕捉细微的缺血模式，同时解决采集变异下的长尾数据分布。发病时间被分为三个临床相关窗口：<4.5小时、4.5-6小时和>6小时。在两个国家队列（MR CLEAN Registry和MR CLEAN LATE）的大型多中心NCCT数据集上的实验结果表明，StrokeTimer实现了宏AUC 0.69和宏F1分数0.57，比最强基线提高了近50%（p < 0.005）。在这个现实且具有挑战性的设置中，代表性基线方法表现出接近随机的宏性能。模型解释进一步突出了与已建立的放射学生物标志物一致的细微灰白质模糊和低密度区域。这些发现证明了StrokeTimer在支持急性缺血性卒中治疗决策方面的潜力。代码可在https://github.com/BrainVas/StrokeTimer获取。

英文摘要

Ischemic stroke is a major global disease. Treatment decisions are highly time-sensitive, as eligibility for reperfusion therapies relies on the interval between stroke onset and intervention. However, the true onset time is often uncertain in clinical practice, necessitating imaging-based assessment of tissue age as a surrogate marker. Early ischemic changes on routinely acquired non-contrast CT (NCCT) are often subtle, and real-world clinical datasets exhibit pronounced onset-time class imbalance and center-scanner-related heterogeneity. In this work, we propose StrokeTimer, a fully automated framework for onset-time estimation in acute ischemic stroke. StrokeTimer integrates self-supervised disentanglement learning with energy-guided contrastive learning to capture subtle ischemic patterns while addressing long-tailed data distributions under acquisition variability. Onset time is categorized into three clinically relevant windows: <4.5 h, 4.5-6 h, and >6 h. Experimental results on a large multi-center NCCT dataset from two national cohorts, MR CLEAN Registry and MR CLEAN LATE, show that StrokeTimer achieves a macro AUC of 0.69 and a macro F1-score of 0.57, improving the strongest baseline by nearly 50% (p < 0.005). In this realistic, challenging setting, representative baseline approaches exhibit near-chance macro performance. Model explanations further highlight subtle gray-white matter blurring and hypodense regions consistent with established radiological biomarkers. These findings demonstrate the potential of StrokeTimer to support treatment decision-making in acute ischemic stroke. Code is available at https://github.com/BrainVas/StrokeTimer.

URL PDF HTML ☆

赞 0 踩 0

2606.04710 2026-06-04 cs.CV 版本更新

Data Efficient Complex Feature Fusion Network For Hyperspectral Image Classification

数据高效复杂特征融合网络用于高光谱图像分类

Maitreya Shelare, Atharva Satam, Poonam Sonar, Sneha Burnase

发表机构 * Department of Electronics and Telecommunication, Rajiv Gandhi Institute of Technology, University of Mumbai（电子与电信系，拉吉夫甘地技术学院，孟买大学）

AI总结提出一种数据高效的注意力双支路复杂特征融合网络（DE-CFFN），通过因子分析降维和3D卷积层滤波器数量减半来减少模型复杂度，同时保持与CFFN相当的分类性能。

Comments 10 pages, 3 figures

详情

DOI: 10.1007/978-981-95-3616-0_7
Journal ref: In Proceedings of International Conference on Wireless Communication (ICWiCOM 2025), Lecture Notes in Electrical Engineering, vol. 1499, Springer, 2025

AI中文摘要

本工作提出了一种数据高效的基于注意力的双支路复杂特征融合网络（CFFN）变体，用于高光谱图像分类。所提出的模型称为DE-CFFN，保留了原始的双流结构：实值神经网络（RVNN）处理标准高光谱图像块，而复值神经网络（CVNN）处理其傅里叶变换后的对应物。本工作的主要贡献在于特征提取过程和架构增强。使用因子分析进行降维，相比主成分分析提供了更好的潜在特征表示。此外，RVNN和CVNN流均通过将3D卷积层中的滤波器数量逐次减半来减少复杂度。两个分支的输出被拼接并通过一个挤压激励（SE）块以增强联合特征表示。在Pavia University和Salinas数据集上的评估表明，DE-CFFN实现了与CFFN相当的分类性能，同时显著减小了模型大小、内存消耗和推理延迟，使其适用于实时高光谱成像应用。

英文摘要

This work presents a data-efficient variant of the Attention-Based Dual-Branch Complex Feature Fusion Network (CFFN) for hyperspectral image classification. The proposed model, termed DE-CFFN, retains the original two-stream structure: the Real-Valued Neural Network (RVNN) processes standard hyperspectral patches, while the Complex-Valued Neural Network (CVNN) handles their Fourier-transformed counterparts. The main contribution of this work lies in the feature extraction process and architectural enhancement. Factor Analysis is used for dimensionality reduction, offering improved latent feature representation over Principal Component Analysis. Additionally, both the RVNN and CVNN streams are structurally modified by successively halving the number of filters in the 3D convolutional layers to reduce complexity. The outputs of both branches are concatenated and passed through a Squeeze and Excitation (SE) block to enhance joint feature representation. Evaluated on the Pavia University and Salinas datasets, DE-CFFN achieves classification performance comparable to CFFN, while significantly reducing model size, memory consumption, and inference latency, making it suitable for real-time hyperspectral imaging applications.

URL PDF HTML ☆

赞 0 踩 0

2606.04706 2026-06-04 cs.CV 版本更新

ReConFuse: Reconstruction-Error Guided Semantic Fusion for AI-Generated Video Detection

ReConFuse: 重建误差引导的语义融合用于AI生成视频检测

Xiaojing Chen, Xinyu Lu, Changtao Miao, Yunfeng Diao

发表机构 * Anhui University（安徽大学）； Ant Group（蚂蚁集团）； Hefei University of Technology（合肥工业大学）

AI总结提出ReConFuse框架，利用预训练WF-VAE的重建误差作为鉴别线索，结合多帧语义特征和Mamba时序建模，实现AI生成视频的鲁棒检测。

详情

AI中文摘要

AI生成的视频变得越来越逼真，引发了关于错误信息、内容真实性和媒体信任的严重担忧。因此，可靠的AI生成视频检测对于多媒体取证至关重要，但由于需要捕捉空间伪影、时间动态并泛化到不断演变的生成模型，这仍然具有挑战性。在本文中，我们探索重建误差作为AI生成视频检测的判别性取证线索。通过使用预训练的WF-VAE重建输入视频，我们观察到真实视频和生成视频表现出可区分的逐帧重建误差模式，表明重建误差可以揭示它们的分布差异。然而，将基于重建的图像检测扩展到视频并非易事，因为视频重建误差在帧间具有时间组织性，并且需要语义上下文才能有效解释。为了应对这些挑战，我们提出了ReConFuse，一个用于视频级AI生成视频检测的重建引导语义融合框架。ReConFuse从WF-VAE重建的视频中提取重建误差线索，将其与多帧语义特征对齐，并使用基于Mamba的模块对时间演化进行建模以进行视频级分类。在多个生成器和评估设置上的实验证明了ReConFuse的有效性和强大的泛化能力。

英文摘要

AI-generated videos are becoming increasingly realistic, raising serious concerns about misinformation, content authenticity, and media trust. Reliable AI-generated video detection is therefore essential for multimedia forensics, yet remains challenging due to the need to capture spatial artifacts, temporal dynamics, and generalize to evolving generative models. In this paper, we explore reconstruction error as a discriminative forensic cue for AI-generated video detection. By reconstructing input videos with a pretrained WF-VAE, we observe that real and generated videos exhibit distinguishable frame-wise reconstruction error patterns, suggesting that reconstruction errors can reveal their distributional discrepancies. However, extending reconstruction-based image detection to videos is non-trivial, since video reconstruction errors are temporally organized across frames and require semantic context for effective interpretation. To address these challenges, we propose ReConFuse, a reconstruction-guided semantic fusion framework for video-level AI-generated video detection. ReConFuse extracts reconstruction error cues from WF-VAE reconstructed videos, aligns them with multi-frame semantic features, and uses a Mamba-based module to model temporal evolution for video-level classification. Experiments across multiple generators and evaluation settings demonstrate the effectiveness and strong generalization ability of ReConFuse.

URL PDF HTML ☆

赞 0 踩 0

2606.04705 2026-06-04 cs.CV cs.AI 版本更新

Enhancing MedSAM with a Lightweight Box Predictor for Medical Image Segmentation

通过轻量级框预测器增强 MedSAM 用于医学图像分割

Amirhossein Movahedisefat, Amirreza Fateh, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology (IUST)（伊朗科学技术大学计算机工程学院）

AI总结提出一种集成轻量级框预测器的 MedSAM 增强框架，通过单次点击估计边界框以提升点提示的空间引导能力，在仅增加 1.6M 参数下显著提高多模态医学图像分割的准确性和鲁棒性。

详情

AI中文摘要

医学图像中的语义分割是一项关键但具有挑战性的任务，原因是数据稀缺和跨模态的高变异性。虽然像 Segment Anything Model (SAM) 这样的基础模型显示出潜力，但它们在没有特定适应的情况下往往难以处理医学图像。此外，点提示尽管是最自然的用户交互形式，但为可靠分割提供的空间上下文不足，特别是当目标结构不规则或对比度差时。在本文中，我们提出了一种增强的分割框架，将轻量级框预测器模块集成到 MedSAM 架构中。框预测器通过使用局部图像嵌入特征从单次用户点击估计近似边界框，提供空间引导以减少点提示的模糊性，同时仅引入 1.6M 额外参数和可忽略的推理开销。我们引入了一个两阶段训练流程，其中框预测器在集成到 MedSAM 之前独立训练。为了验证我们方法的泛化能力，我们在四个不同的数据集（FLARE22、BRISC、BUSI、LungSegDB）上进行了广泛评估，这些数据集涵盖不同的成像模态，包括 CT、MRI 和超声。我们的方法在不同解剖结构和成像领域中提高了分割准确性和鲁棒性，在 BUSI、FLARE22、BRISC 和 LungSegDB 上分别达到了 0.89、0.93、0.88 和 0.98 的 Dice 分数。代码可在 https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor 获取。

英文摘要

Semantic segmentation in medical imaging is a critical yet challenging task due to data scarcity and high variability across modalities. While foundation models like the Segment Anything Model (SAM) show promise, they often struggle with medical images without specific adaptation. Moreover, point prompts, despite being the most natural form of user interaction, provide insufficient spatial context for reliable segmentation, particularly when target structures are irregular or poorly contrasted. In this paper, we propose an enhanced segmentation framework that integrates a lightweight Box Predictor module into the MedSAM architecture. The Box Predictor estimates an approximate bounding box from a single user click using localized image embedding features, providing spatial guidance that reduces the ambiguity of point prompts, while introducing only 1.6M additional parameters and negligible inference overhead. We introduce a two-stage training pipeline where the Box Predictor is trained independently before being integrated into MedSAM. To validate the generalization capability of our method, we conduct extensive evaluations on four diverse datasets (FLARE22, BRISC, BUSI, LungSegDB) spanning distinct imaging modalities, including CT, MRI, and Ultrasound. Our method improves segmentation accuracy and robustness across varied anatomical structures and imaging domains, achieving Dice scores of 0.89 (BUSI), 0.93 (FLARE22), 0.88 (BRISC), and 0.98 (LungSegDB). Code is available at https://github.com/Amirhosseinmovahedi/MedSAM-BoxPredictor

URL PDF HTML ☆

赞 0 踩 0

2606.04701 2026-06-04 cs.CV cs.CL 版本更新

Benchmarking Living-Screen-Native GUI Agents on Short-Video Platforms

在短视频平台上对原生动态屏幕GUI代理的基准测试

Jiashu Yao, Heyan Huang, Daiqing Wu, Wangke Chen, Huaxi Ai, Haoyu Wen, Zeming Liu, Yuhang Guo

发表机构 * Beijing Institute of Technology（北京理工大学）； Tsinghua University（清华大学）； Beihang University（北航）

AI总结针对短视频平台等动态屏幕环境，提出LivingScreen基准测试，通过三级任务套件和联合评估准确性与信息效率的指标，发现现有GUI代理存在观察过度或不足的问题。

Comments preprint

2606.04700 2026-06-04 cs.CV 版本更新

A New Angle on Bones: Robust Pose Estimation in X-Ray and Ultrasound

骨骼的新视角：X射线和超声中的鲁棒姿态估计

Ron Keuth, Christoph Großbröhmer, Franziska Halm, Miriam Johann, Anne-Nele Schröder, Ludger Tüshaus, Mattias P. Heinrich, Lasse Hansen

发表机构 * Medical Informatics, University of Lübeck（吕贝克大学医学信息学系）； Institut of Radiology and Nuclear Medicine, University Hospital Schleswig-Holstein（石勒苏益格-荷尔斯泰因大学医院放射学与核医学研究所）； Paediatric Surgery, University Hospital Schleswig-Holstein（石勒苏益格-荷尔斯泰因大学医院小儿外科）； EchoScout GmbH

AI总结提出基于学习的关键点候选和鲁棒线模型（RANSAC、霍夫变换）的自动骨骼姿态估计方法，在儿科骨折和髋关节发育不良评估中达到临床可接受的误差并优于地标方法。

Comments Code and annotations for fracture angle assessment in radiographs: https://github.com/multimodallearning/RobustBonePoseEstimation

详情

AI中文摘要

测量骨骼结构之间的角度是医学图像分析中的常规任务，为诊断和治疗规划提供关键的定量参数。自动化方法可以减少时间和成本，同时提高可重复性。在这项工作中，我们通过基于学习的关键点候选提议，随后使用线模型提取轴参数，来解决自动骨骼姿态估计问题。由于传统线模型如最小二乘法对异常值敏感，我们结合了假阳性减少策略和鲁棒拟合技术，如RANSAC和霍夫变换，以提高鲁棒性。我们在三个临床相关的儿科角度估计任务上评估了我们的方法：X射线和超声中的骨折碎片评估，以及使用Graf方法的超声中髋关节发育不良评估。我们的方法分别实现了$4.1^\circ$、$5.4^\circ$和$5.51^\circ$的平均误差，不仅保持在预期的临床观察者变异范围内，而且显著优于基于地标的方法。我们的代码和用于X射线骨折角度评估的注释已在GitHub上公开。

英文摘要

Measuring the angle between bone structures is a routine task in medical image analysis and provides a key quantitative parameter for diagnosis and treatment planning. Automated methods can reduce time and cost while improving reproducibility. In this work, we address automatic bone pose estimation using a learning-based point candidate proposal followed by a line model to extract axis parameters. Since conventional line models such as least squares are sensitive to outliers, we incorporate false-positive reduction strategies and robust fitting techniques, such as RANSAC and Hough transforms, to improve robustness. We evaluate our method on three clinically relevant paediatric angle estimation tasks: fracture fragment assessment in radiographs and ultrasound and developmental dysplasia of the hip evaluation in ultrasound using the Graf method. Our approach achieves mean errors of $4.1^\circ$, $5.4^\circ$, and $5.51^\circ$, respectively, not only remaining within the expected clinical observer variability, but also significantly outperforming landmark-based methods. Our code and annotations for fracture angle assessment in radiographs are publicly available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.04699 2026-06-04 cs.LG cs.AI cs.CV 版本更新

目标检测中的实例级事后不确定性量化

Chongzhe Zhang, Zifan Zeng, Qunli Zhang, Feng Liu, Zheng Hu

发表机构 * Tsinghua University（清华大学）

AI总结提出蒙特卡洛广义线性模型（MC-GLM），用于目标检测中实例级、近似事后不确定性量化，无需重新训练，在nuScenes数据集上验证了有效性。

Comments 7 pages, 2 figures

2606.04613 2026-06-04 cs.CV cs.LG 版本更新

Beyond Symmetric Alignment: Spectral Diagnostics of Modality Imbalance in Vision-Language Models in the Medical Domain

超越对称对齐：医学领域视觉-语言模型中模态不平衡的光谱诊断

Alessandro Gambetti, Qiwei Han, Cláudia Soares, Hong Shen

发表机构 * NOVA School of Science and Technology（诺瓦科学与技术学校）； Nova School of Business and Economics（诺瓦商业与经济学校）； Carnegie Mellon University（卡内基梅隆大学）

AI总结提出非对称光谱对齐分数（SAS），通过特征值加权的特征模态相关性量化模态信息不平衡，并在医学图像-文本数据集上评估15个VLM，发现医学图像比临床报告保留更丰富的结构信息，且SAS与检索性能的相关性最强。

Comments 10 pages, 3 figures, 9 tables

详情

AI中文摘要

视觉-语言模型（VLM）在应用于医学图像-文本数据时表现不佳，但可用于诊断这种失败的工具仍然有限。现有的表示对齐度量是对称的，将两种模态合并为一个分数，隐藏了哪种模态驱动了跨模态退化。我们引入了光谱对齐分数（SAS），这是一种非对称度量，将两种模态投影到锚定模态的主特征基上，并计算特征值加权的每个特征模态的相关性，从而得到方向性分数，其差值量化了模态信息不平衡。我们将SAS嵌入到一个基准框架中，评估了15个VLM在自然和医学图像-文本数据集上的表现，同时使用了6种对齐度量和双向检索。我们的实验表明，医学图像比其配对的临床报告保留了更丰富的结构信息，这种方向性不对称是所有竞争度量无法察觉的，并且SAS在医学领域实现了与检索性能的最强零标签相关性，使其成为临床部署的实用诊断工具。代码可在以下网址获取：https://github.com/iamalegambetti/medical-vlms-assessment。

英文摘要

Vision-Language Models (VLMs) struggle when applied to medical image-text data, yet the tools available to diagnose this failure remain limited. Existing representation alignment metrics are symmetric, collapsing both modalities into a single score and hiding which modality drives cross-modal degradation. We introduce the Spectral Alignment Score (SAS), an asymmetric metric that projects both modalities onto the principal eigenbasis of an anchor modality and computes eigenvalue-weighted per-eigenmode correlations, resulting in directional scores whose difference quantifies modality information imbalance. We embed SAS within a benchmarking framework evaluating 15 VLMs across natural and medical image-text datasets alongside 6 alignment metrics and bidirectional retrieval. Our experiments show that medical images retain richer structural information than their paired clinical reports, a directional asymmetry invisible to all competing metrics, and that SAS achieves the strongest zero-label correlation with retrieval performance in the medical domain, positioning it as a practical diagnostic tool for clinical deployment. Code is available at this URL: https://github.com/iamalegambetti/medical-vlms-assessment.

URL PDF HTML ☆

赞 0 踩 0

2606.04604 2026-06-04 cs.CV 版本更新

COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations

COMBINER: 基于属性邻居关系的组合图像检索

Zixu Li, Yupeng Hu, Zhiwei Chen, Haokun Wen, Xuemeng Song, Liqiang Nie

发表机构 * School of Software, Shandong University（山东大学软件学院）； School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳）计算机科学与技术学院）； Department of Data Science, City University of Hong Kong（香港城市大学数据科学系）； Department of Computer Science and Engineering, Southern University of Science and Technology（南方科技大学计算机科学与工程系）

AI总结针对组合图像检索中视觉相似但属性不同的样本问题，提出基于属性原型的跨模态统一表示方法COMBINER，通过自适应语义解耦、统一原型组合和双重关系建模提升检索准确性。

Comments Accepted by IEEE TIP 2026

详情

AI中文摘要

组合图像检索（CIR）是一项具有挑战性的检索任务，旨在通过多模态输入定位特定图像。尽管CIR技术近期取得了进展，但先前的方法常常忽略视觉上相似但属性不同的情况，这可能削弱多模态特征融合和相似性建模。为缓解这一限制，我们基于属性原型设计了跨模态特征的统一表示。然而，由于三个核心问题，该任务远非直接：（1）属性级语义的纠缠，（2）模态间的不一致性，以及（3）监督信号缺失。为解决上述障碍，我们引入了基于属性邻居关系的组合图像检索网络（COMBINER）。具体而言，我们首先设计了一个自适应语义解耦模块，能够基于多模态原始特征解耦属性特征。其次，我们提出了一个统一原型组合模块，可以构建跨模态统一原型（CUP）并促进多模态特征组合。最后，我们引入了一个双重关系建模模块，能够基于属性相似性挖掘成对和邻居关系。与传统的邻居关系建模CIR方法相比，COMBINER是首个解决视觉相似但属性无关样本现象的研究。它通过采用基于属性原型的相似性度量，实现了对样本间语义关系的更准确理解。在三个基准数据集上进行的全面实验证实了我们提出的COMBINER的有效性。我们的方法实现将在https://github.com/Lee-zixu/COMBINER上提供。

英文摘要

Composed Image Retrieval (CIR) represents a challenging retrieval task that targets locating specific images through multimodal inputs. Despite recent progress in CIR techniques, prior approaches often overlook cases where images appear visually alike yet differ in attributes, potentially undermining both multimodal feature fusion and similarity modeling. To mitigate this limitation, we design a unified representation of cross-modal features based on attribute prototypes. Nevertheless, the task is far from straightforward, owing to three core issues: (1) entanglement in attribute-level semantics, (2) inconsistency across modalities, and (3) supervised signal missing. To tackle the above obstacles, we introduce a COMposed image retrieval network guided By attrIbute-based NEighbor Relations (COMBINER). Specifically, we first design an Adaptive Semantic Disentanglement module, which is capable of disentangling attribute features based on multimodal primitive features. Secondly, we propose a Unified Prototype-based Composition module, which can construct cross-modal unified prototypes (CUP) and facilitate multimodal feature composition. Finally, we introduce a Dual Relations Modeling module, which can mine pairwise and neighbor relations based on attribute similarity. Compared to traditional neighbor relations modeling CIR methods, COMBINER represents the first study addressing the phenomenon of visually similar but attribute-unrelated samples. It achieves a more accurate understanding of the semantic relations among samples by employing an attribute prototype-based similarity metric. Comprehensive experiments conducted on three benchmark datasets confirm the effectiveness of our proposed COMBINER. The implementation of our method will be accessed at https://github.com/Lee-zixu/COMBINER

URL PDF HTML ☆

赞 0 踩 0

2606.04593 2026-06-04 cs.CV 版本更新

4D Reconstruction from Sparse Dynamic Cameras

来自稀疏动态相机的4D重建

Kazuki Ozeki, Shun Kenney, Yuto Shibata, Eisuke Takeuchi, Takuya Narihira, Kazumi Fukuda, Ryosuke Sawata, Yuki Mitsufuji, Yoshimitsu Aoki

发表机构 * Keio University（庆应大学）； Sony AI（索尼人工智能）； Sony Group Corporation（索尼集团）

AI总结针对稀疏动态相机设置下的4D重建，提出一种通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性的3D轨迹初始化方法，并引入噪声鲁棒的深度排序正则化损失和时空多样批次采样策略，在自建数据集LetCamsGo上验证了动态区域重建质量的提升。

Comments Accepted by 4DV Workshop at CVPR 2026

详情

AI中文摘要

尽管从单目动态相机进行动态3D（即4D）重建最近取得了进展，但其仍然受到深度模糊的根本限制。本文关注一种替代实用方案，即稀疏动态相机设置，其中少量独立移动的相机捕捉相同的对象。在保持低成本的同时，这种设置引入了多视图约束，并且对于现实世界的视频制作（如体育、音乐会和电视节目）仍然实用。尽管有潜力，但我们的实验表明，现有单目或密集固定相机方法的简单扩展是不够的，因为它们无法解决跨视图和时间的复杂时空不一致性。为填补这一空白，我们提出了一种简单而有效的3D轨迹初始化方法，通过集成跨相机特征匹配与帧内点跟踪来确保时空一致性。此外，我们引入了噪声鲁棒的深度排序正则化损失和时空多样批次采样策略，以增强优化稳定性和跨视图泛化。进一步地，为解决此任务缺乏标准化基准的问题，我们引入了LetCamsGo，这是一个新的真实世界视频数据集，包含4个不同环境中的5个序列，由三个独立移动的相机和一个固定相机记录。在LetCamsGo上的全面基准测试表明，与基线相比，我们提出的框架提高了动态区域的4D重建质量，为野外低成本4D重建范式铺平了道路。

英文摘要

Although dynamic 3D (i.e., 4D) reconstruction from a monocular dynamic camera has recently advanced, it remains fundamentally limited by depth ambiguity. In this paper, we focus on an alternative practical way, i.e., sparse dynamic camera setup, where a handful of independently moving cameras capture the same subjects. While keeping capture costs low, this setup introduces multi-view constraints and remains practical for real-world video production such as sports, concerts, and TV shows. Despite its potential, our experiments show that naive extensions of existing monocular or dense-fixed camera-based methods are insufficient since they fail to resolve the complex spatiotemporal inconsistencies across views and time. To fill this gap, we propose a simple yet effective 3D track initialization method designed to ensure spatiotemporal consistency by integrating inter-camera feature matching with intra-camera point tracking. Additionally, we incorporate a noise-robust depth-ordering regularization loss and a spatiotemporally diverse batch sampling strategy to enhance optimization stability and cross-view generalization. Furthermore, to address the lack of standardized benchmarks for this task, we introduce LetCamsGo, a new real-world video dataset with 5 sequences across 4 diverse environments, recorded by three independently moving cameras and one fixed camera. Comprehensive benchmarking on LetCamsGo demonstrated that our proposed framework improves 4D reconstruction quality in dynamic regions compared with baselines, paving the way for a low-cost 4D reconstruction paradigm in the wild.

URL PDF HTML ☆

赞 0 踩 0

2606.04591 2026-06-04 cs.CL cs.CV 版本更新

SFMambaNet: 用于对应点筛选的频谱-频率增强选择性状态空间模型

Zhihua Wang, Yanping Li, Yizhang Liu

AI总结提出SFMambaNet，通过局部频谱-几何注意力块和频谱集成全局Mamba块，首次将频域感知融入对应点筛选任务，增强内点与离点的区分能力。

详情

AI中文摘要

对应点筛选旨在从初始对应点集中识别内点。现有大多数基于图神经网络的方法依赖于从粗欧几里得坐标映射的几何特征，难以捕捉内点呈现的细微几何一致性。而基于Mamba的方法虽具有全局感受野和长序列建模能力，但往往在隐藏状态空间中积累大量不一致特征，难以区分内点与离点。本文首次将频域感知融入该任务，提出SFMambaNet，一种新颖的频谱-频率增强Mamba双视图对应点筛选网络。我们的方法由两个组件协同构成：首先，设计局部频谱-几何注意力（LSGA）块。LSGA将频谱位置编码融入局部图交互，并引入多尺度Mamba处理，以增强对细微几何一致性的捕捉并提升局部特征判别性。在此基础上，设计频谱集成全局Mamba（SIGM）块。SIGM在状态空间中嵌入频率门控机制，利用LSGA提供的频率信息显式抑制隐藏状态内高频噪声的累积，并减轻不一致特征的传播。这增强了内点-离点可分性，并以近乎线性的复杂度实现了鲁棒的全局上下文建模能力。大量实验表明，SFMambaNet在多个具有挑战性的任务上优于当前最先进方法。代码可在https://github.com/Kirito14IT/SFMambaNet获取。

英文摘要

Correspondence pruning aims to identify inliers from an initial set of correspondences. Most existing Graph Neural Network (GNN)-based methods rely on geometric features mapped from coarse Euclidean coordinates, which struggle to capture the subtle geometric consistencies presented by inliers. While Mamba-based methods possess global receptive fields and long sequence modeling capabilities, they tend to accumulate substantial inconsistent features within the hidden state space, making it difficult to distinguish inliers from outliers. In this paper, we integrate frequency domain perception into this task for the first time and propose SFMambaNet, a novel Spectral-Frequency enhanced Mamba-based two-view correspondence pruning network. Our method is collaboratively composed of two components: First, we design a Local Spectral-Geometric Attention (LSGA) block. LSGA incorporates spectral positional encoding into local graph interactions and introduces multi-scale Mamba processing to enhance the capture of subtle geometric consistencies and improve local feature discriminability. Building upon this, we design a Spectral-Integrated Global Mamba (SIGM) block. SIGM embeds a frequency gating mechanism within the state space, utilizing the frequency information provided by LSGA to explicitly suppress high-frequency noise accumulation within hidden states and mitigate the propagation of inconsistent features. This enhances inlier-outlier separability and achieves robust global context modeling capabilities with nearly linear complexity. Extensive experiments demonstrate that SFMambaNet outperforms current state-of-the-art methods on several challenging tasks. The code is available at https://github.com/Kirito14IT/SFMambaNet.

URL PDF HTML ☆

赞 0 踩 0

2606.04480 2026-06-04 cs.CV cs.HC 版本更新

IMPose: Interactive Multi-person Pose Estimation with Dynamic Correction Propagation

IMPose: 基于动态校正传播的交互式多人姿态估计

Haoyang Ge, Jian Ma, Ziwen Wang, Qihe Wang, Jianqi Fan, Hongzhi Yu, Xingyu Chen, Kun Li

发表机构 * Tianjin University（天津大学）； Zhongguancun Academy（中关村学院）； Tiandi（天迪）

AI总结提出IMPose交互式工具，通过双级跟踪机制（关键点级和实例级）将稀疏的多人姿态校正传播到整个视频，显著减少手动标注工作量。

详情

AI中文摘要

高质量动态人体姿态标注为人工智能提供精确的运动学信息，使其能够掌握人类行为，但仍然劳动密集且耗时。当前的标注工具要么缺乏时间校正传播，要么在多人场景中失败，需要过多的人工干预。在本文中，我们介绍了IMPose，一种用于多人动态姿态标注的交互式工具。它具有双级跟踪机制，可将标注者的一帧多人姿态校正传播到整个视频。关键点级通过顺序建模确保校正的时间传播，而实例级采用关键点感知嵌入和相对位置编码来维持多人跨帧一致性。为了进一步提高鲁棒性，IMPose在轨迹库中维护历史姿态和实例线索，增强了长程时间关联，并在遮挡和运动模糊等挑战性情况下稳定标注。通过将稀疏的人工校正转换为密集且连贯的姿态轨迹，我们的框架显著减少了跨帧的重复手动细化。大量实验表明，IMPose在不同交互预算下始终实现强精度-效率权衡，在低点击标注设置中表现出特别优势。IMPose实现了高精度和高效率的标注，在3DPW上每1050帧视频仅需27次点击，在PoseTrack21上每个轨迹段每84帧仅需3次点击。我们进一步扩展了PoseTrack21，以10名标注员10小时的最小成本添加了188K个姿态实例（355万个关键点）。标注工具、代码和扩展数据集将开源。

英文摘要

High-quality dynamic human pose annotation equips AI with precise motion kinematics to enable human behavior mastery, yet remains labor-intensive and time-consuming. Current annotation tools either lack temporal correction propagation or fail in multi-person scenarios, necessitating excessive manual intervention. In this paper, we introduce IMPose, an interactive tool for multi-person dynamic pose annotation. It features a dual-level tracking mechanism that propagates one-frame multi-person pose corrections from annotators across entire videos. The keypoint-level ensures corrections temporal propagation via sequential modeling, while the instance-level employs keypoint-aware embedding with relative positional encoding to maintain multi-person cross-frame consistency. To further improve robustness, IMPose maintains historical pose and instance cues in a trajectory bank, which enhances long-range temporal association and stabilizes annotation in challenging cases such as occlusion and motion blur. By converting sparse human corrections into dense and coherent pose trajectories, our framework significantly reduces repeated manual refinement across frames. Extensive experiments show that IMPose consistently achieves a strong accuracy efficiency trade off under different interaction budgets, demonstrating particular advantages in low click annotation settings. IMPose achieves high precision annotation with high efficiency, requiring only 27 clicks per 1,050 frame video on 3DPW and 3 clicks per tracklet per 84-frame on PoseTrack21. We further expand PoseTrack21 with 188K pose instances (3.55M keypoints) at a minimal cost of 10 annotators in 10 hours. The annotation tool, codes, and extended dataset will be open-sourced.

URL PDF HTML ☆

赞 0 踩 0

2606.04479 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Evaluating Reasoning Fidelity in Visual Text Generation

评估视觉文本生成中的推理保真度

Jiajun Hong, Jiawei Zhou

发表机构 * Stony Brook University（石桥大学）

AI总结通过长文本渲染、事实知识探测、上下文理解和多步推理等任务，评估当前文本到图像模型在视觉文本生成中是否忠实保持推理能力，发现其常产生语义错误和逻辑不一致，与纯文本模型存在显著差距。

Comments Peer reviewed and accepted at CVPR 2026 at the GRAIL-V (Grounded Retrieval and Agentic Intelligence for Vision-Language) workshop (non-archival track)

详情

AI中文摘要

最近的文本到图像（T2I）模型能够在图像中渲染高度清晰且结构良好的文本，从而支持文档生成和幻灯片生成等应用。然而，当复杂解决方案必须直接通过渲染文本表达时，这些系统是否忠实地保留了推理能力，还是仅仅模仿表面模式，目前尚不清楚。我们通过评估视觉文本生成中的推理保真度来研究这一问题，其中模型必须将完整的推理过程表达为图像。我们的评估包括长文本渲染、事实知识探测、上下文理解和多步推理。在这些设置中，我们发现当前的T2I模型经常产生语义错误、逻辑不一致和错误的中间步骤，即使渲染的文本在视觉上清晰。这些失败与纯文本模型在相同任务上的强推理表现形成对比。我们的发现揭示了视觉文本生成与程序性推理之间的显著差距，促使更可靠的视觉文本推理。

英文摘要

Recent text-to-image (T2I) models can render highly legible and well-structured text within images, enabling applications including document generation and slide generation. However, it remains unclear whether such systems faithfully preserve reasoning ability when complex solutions must be expressed directly through rendered text, or whether they merely imitate surface-level patterns. We investigate this question by evaluating reasoning fidelity in visual text generation, where models must express complete reasoning processes as images. Our evaluation includes long text rendering, factual knowledge probing, context understanding, and multi-step reasoning. Across these settings, we find that current T2I models frequently produce semantic errors, logical inconsistencies, and incorrect intermediate steps, even when the rendered text appears visually clear. These failures contrast with the strong reasoning performance of text-only models on the same tasks. Our findings reveal a substantial gap between visual text generation and procedural reasoning, motivating more reliable visual text reasoning.

URL PDF HTML ☆

赞 0 踩 0

2606.04469 2026-06-04 cs.CV cs.AI 版本更新

Adaptive Calibration for Fair and Performant Facial Recognition

自适应校准：实现公平且高性能的面部识别

Ryan Brown, Chris Russell

发表机构 * University of Oxford（牛津大学）

AI总结提出自适应校准（AC）方法，通过将归一化嵌入的余弦相似度映射为校准概率，并融入局部上下文校正区域差异，从而在无需人口统计元数据的情况下提升面部识别的整体性能和公平性。

详情

AI中文摘要

我们引入自适应校准（AC），一种新颖的面部识别校准策略，将归一化嵌入之间的余弦相似度映射为良好校准的概率。通过将局部上下文纳入校准，自适应校正确保了余弦相似度中的一个基本不匹配问题，即相同的距离在不同嵌入区域可能对应不同的匹配概率。我们的方法在无需人口统计元数据的情况下，既提高了整体性能，又实现了更公平的校准。在各种预训练模型和标准基准上，我们的方法在准确性和公平性指标上始终优于现有方法。AC为公平的面部识别提供了实用的解决方案，无需人口统计组注释，同时提高了整体性能。与现有方法不同，我们的方法提供了连续的、区域特定的校准，避免了“降级”现象，即公平性以牺牲某些群体的性能为代价。

英文摘要

We introduce Adaptive Calibration (AC), a novel calibration strategy for facial recognition that maps cosine similarity between normalized embeddings to well-calibrated probabilities. By incorporating local context into calibration, Adaptive Calibration corrects for a fundamental mismatch in cosine similarity, whereby the same distance can correspond to different match probabilities in different embedding regions. Our approach improves both overall performance and results in a fairer calibration without requiring demographic metadata. Our approach consistently dominates existing methods both on accuracy and fairness metrics across a variety of pretrained models and standard benchmarks. AC provides a practical solution for equitable facial recognition, without requiring demographic group annotations, and while improving overall performance. Unlike existing approaches, our method provides continuous, region-specific calibration that avoids "leveling down" where fairness comes at the cost of degraded performance for some groups.

URL PDF HTML ☆

赞 0 踩 0

2606.04461 2026-06-04 cs.CV 版本更新

ChannelTok: Efficient Flexible-Length Vision Tokenization

ChannelTok: 高效灵活长度视觉分词

Sukriti Paul, Arpit Bansal, Tom Goldstein

发表机构 * University of Maryland, College Park（马里兰大学College Park分校）

AI总结提出一种基于通道的轻量级灵活长度分词器，通过随机尾部丢弃训练实现语义重要性排序，在保持高质量的同时大幅提升解码速度和模型效率。

详情

AI中文摘要

领先的灵活视觉分词器以极端成本实现SOTA质量，依赖参数繁重的骨干网络和缓慢的多步生成解码器。我们摆脱这种复杂的空间分词范式，引入一种简单、轻量且快速的通道级灵活长度分词器。我们的方法将每个潜在通道视为一个视觉标记，采用参数高效的CNN-Transformer混合骨干网络。此外，在训练过程中采用随机尾部丢弃范式，自然地迫使通道按语义重要性排序。这使得在推理时只需保留前$k$个通道即可实现灵活压缩，并自然支持可变长度自回归图像生成。我们通过在ImageNet上的大量实验验证了该方法，展示了在不同标记预算下的一致质量。结果建立了新的质量-效率前沿：我们的模型实现了最先进的感知质量（rFID 2.92），同时解码速度比次优方案快$8.6\times$，参数量小$2.1\times$（1.59亿参数）。我们的工作将通道级分词确立为高效视觉表示的一种强大且实用的范式。项目页面：https://channeltok.github.io

英文摘要

Leading flexible vision tokenizers achieve SOTA quality at an extreme cost, relying on parameter-heavy backbones and slow, multi-step generative decoders. We depart from this complex, spatial-token paradigm and introduce a simple, lightweight, and fast channel-wise flexible-length tokenizer. Our method treats each latent channel as a visual token, enabling a parameter-efficient CNN-Transformer hybrid backbone. Furthermore, employing a stochastic tail-dropping paradigm during training naturally forces channels to organize by semantic importance. This allows for flexible compression at inference by simply retaining the first $k$ channels, and naturally enables variable-length autoregressive image generation. We validate our approach through extensive experiments on ImageNet, demonstrating consistent quality across diverse token budgets. The results establish a new quality-efficiency frontier: our model achieves state-of-the-art perceptual quality (rFID 2.92) while being $8.6\times$ faster in decoding and $2.1\times$ smaller (159M params) than the next-best alternative. Our work establishes channel-wise tokenization as a powerful and practical paradigm for efficient visual representation. Project page: https://channeltok.github.io

URL PDF HTML ☆

赞 0 踩 0

2606.04457 2026-06-04 cs.CV 版本更新

Imagine Before You Draw: Visual Prompt Engineering for Image Generation

先构思再绘制：面向图像生成的视觉提示工程

Liyu Jia, Fengda Zhang, Jiachun Pan, Kesen Zhao, Saining Zhang, Wang Lin, Weijia Wu, Yue Liao, Aojun Zhou, Hanwang Zhang

发表机构 * Nanyang Technological University（南洋理工大学）； National University of Singapore（国立新加坡大学）； Zhejiang University（浙江大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结提出视觉提示工程（VPE），通过在单一模型内先生成视觉语义令牌作为中间计划，再生成完整图像，从而避免信息瓶颈，提升图像生成质量与编辑保真度。

详情

AI中文摘要

在图像生成之前，将视觉语义表示作为中间步骤引入，可以降低文本与图像之间的建模难度，从而提高生成质量。近期工作如X-Omni和BLIP3o-Next探索了这一方向，但它们通常采用两阶段外部流水线：一个独立的自回归模型首先生成语义令牌，然后将其作为条件输入给独立的扩散解码器。由于解码器无法同时访问原始输入和语义计划，这种设计引入了信息瓶颈，限制了编辑等下游任务中的细节保留。而Transfusion、BAGEL和Show-o2等内部架构通过单一模型内的跨模态交互避免了这一瓶颈，但它们在没有中间语义引导的情况下，仍然面临困难的文本到像素建模差距。我们提出了视觉提示工程（VPE），它可以无缝集成到此类内部框架中。具体来说，模型首先自回归地生成视觉语义令牌（例如SigLIP 2）作为“视觉提示”，以捕捉语义布局，然后基于该计划生成完整图像令牌。我们在类别条件生成、文本到图像生成和图像编辑上验证了VPE，涵盖了多种令牌类型和模型架构。结果表明，VPE可以加速收敛、提高质量上限，并且通过内部集成，在相同参数规模下，相比外部替代方案实现了显著更好的编辑保真度（PSNR：26.76 vs. 19.92），同时保持了有竞争力的编辑响应速度。

英文摘要

Incorporating visual semantic representations as an intermediate step before image generation can reduce the modeling difficulty between text and images, thereby improving generation quality. Recent works such as X-Omni and BLIP3o-Next have explored this direction, but they typically use a two-stage external pipeline: a separate autoregressive model first generates semantic tokens, which are then fed as conditioning to an independent diffusion decoder. Since the decoder cannot jointly access the original input and the semantic plan, this design introduces an information bottleneck that limits detail preservation in downstream tasks such as editing. Internal architectures such as Transfusion, BAGEL, and Show-o2 avoid this bottleneck by enabling cross-modal interaction within a single model, but they still face the difficult text-to-pixel modeling gap without intermediate semantic guidance. We propose Visual Prompt Engineering (VPE), which can be seamlessly integrated into such internal frameworks. Specifically, the model first autoregressively generates visual semantic tokens (e.g., SigLIP 2) as "visual prompts" that capture the semantic layout, then generates the full image tokens conditioned on this plan. We validate VPE across class-conditional generation, text-to-image generation, and image editing, covering various token types and model architectures. Results show that VPE can accelerate convergence, raise quality ceilings, and through internal integration, achieve substantially better editing preservation (PSNR: 26.76 vs. 19.92) than external alternatives of the same parameter scale, while maintaining competitive editing responsiveness.

URL PDF HTML ☆

赞 0 踩 0

2606.04453 2026-06-04 cs.CV cs.LG 版本更新

Radiomic Feature Selection Using Gradient Loss of Deep Neural Network for Lung Cancer Stage Detection

基于深度神经网络梯度损失的放射组学特征选择用于肺癌分期检测

Hina Shakir, Mohammad Mohatram, Javeed Hussain, Syed Rizwan Ali, Muhammad Irfan Memon

发表机构 * Department of Software Engineering, Bahria University（巴尔ia大学软件工程系）； Global College of Engineering and Technology（全球工程与技术学院）； Software Engineering & Business Incubation Center, Bahria University（软件工程与企业孵化中心，巴尔ia大学）

AI总结提出GL-RFE框架，利用深度神经网络梯度敏感性分析递归消除低贡献特征，从106个放射组学特征中选出前15个用于肺癌早晚期分类，准确率达90.22%。

详情

DOI: 10.3791/70181
Journal ref: J. Vis. Exp. (230), e70181, (2026)

AI中文摘要

放射组学能够从医学图像中提取定量成像生物标志物，已成为计算机辅助癌症诊断的重要工具。然而，放射组学数据集通常具有高维小样本的特点，使得特征选择成为构建可靠预测模型的关键步骤。本研究提出了一种梯度损失递归特征消除（GL-RFE）框架，该框架集成深度神经网络的梯度敏感性分析，以识别对肺癌分期检测最具影响力的放射组学特征。使用3D Slicer平台的PyRadiomics扩展从胸部计算机断层扫描（CT）中提取了总共106个放射组学特征。所提出的方法通过计算网络损失相对于输入特征的梯度来评估特征重要性，并递归消除贡献最小的特征。最终选出的前15个放射组学特征用于训练深度神经网络分类器，以区分早期和晚期肺癌。该框架在测试数据集上取得了强劲的分类性能，准确率为90.22%，精确率为90.10%，召回率为90.24%，F1分数为90.16%。可视化分析（包括相关性热图和分布图）进一步证实了特征冗余减少和类别可分性提高。与传统特征选择技术相比，GL-RFE有效捕捉了非线性特征交互并增强了模型泛化能力。所提出的协议为基于放射组学的癌症分期检测提供了一种可重复且可解释的方法，特别适用于高维小样本生物医学数据集，并在基因组学和多模态临床分析等其他领域具有潜在应用价值。

英文摘要

Radiomics enables extraction of quantitative imaging biomarkers from medical images and has become an important tool for computer-aided cancer diagnosis. However, radiomics datasets are typically high-dimensional with limited samples, making feature selection a critical step for building reliable predictive models. This study proposes a Gradient-Loss Recursive Feature Elimination (GL-RFE) framework that integrates gradient sensitivity analysis from a deep neural network to identify the most influential radiomic features for lung cancer stage detection. A total of 106 radiomic features were extracted from chest Computed Tomography (CT) scans using the PyRadiomics extension of the 3D Slicer platform. The proposed method evaluates feature importance by computing gradients of the network loss with respect to input features and recursively eliminates features with minimal contribution. The resulting top-15 radiomic features are used to train a deep neural network classifier for distinguishing early-stage and advanced-stage lung cancer. The proposed framework achieves strong classification performance, with accuracy of 90.22%, precision of 90.10%, recall of 90.24%, and F1-score of 90.16% on the test dataset. Visualization analyses, including correlation heat maps and distribution plots, further confirm reduced feature redundancy and improved class separability. Compared to conventional feature selection techniques, GL-RFE effectively captures nonlinear feature interactions and enhances model generalization. The presented protocol provides a reproducible and interpretable methodology for radiomics-based cancer stage detection and is particularly suitable for high-dimensional, small-sample biomedical datasets, with potential applications in other domains such as genomics and multimodal clinical analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.04437 2026-06-04 cs.CV 版本更新

INTACT: Ego-Guided Typed Sparse Evidence Retrieval for Heterogeneous Collaborative Perception

INTACT: 面向异构协同感知的自我引导类型化稀疏证据检索

Chen Li, Shengrong Yuan, Jialong Zuo, Xinzhong Zhu, Nong Sang, Changxin Gao

发表机构 * National Key Laboratory of Multispectral Information Intelligent Processing Technology, School of Artificial Intelligence and Automation, Huazhong University of Science and Technology（多谱信息智能处理国家级重点实验室，人工智能与自动化学院，华中科技大学）； Zhejiang Normal University（浙江师范大学）

AI总结提出INTACT框架，通过自我车辆发出类型化证据查询、协作方仅返回局部证据的稀疏检索机制，实现异构协同感知中零训练的新节点接入，在OPV2V-H和DAIR-V2X上取得高效性能。

详情

AI中文摘要

协同感知通过跨智能体共享信息扩展自动驾驶车辆的感知范围，但异构传感器和感知模型使得中间特征融合难以大规模部署。现有的异构协同方法通常遵循先翻译后融合的范式：协作方特征必须在对齐、适应或投影到自我兼容空间后才能融合。这种特征兼容性契约提升了固定系统的性能，但将部署与协作方特定的适配耦合，使得新加入的异构智能体集成成本高昂。为解决这一问题，我们提出INTACT，一种面向异构协同感知的自我引导类型化稀疏证据检索框架。INTACT不翻译整个协作方特征图，而是让自我车辆发出类型化证据查询，表达可疑目标和证据不足的区域。协作方仅在查询位置返回局部证据，自我车辆通过稀疏的每查询路由选择有用响应，并通过门控残差回写注入。这将兼容性要求从全局特征图可解释性转变为在自我车辆查询下的局部、类型化响应可比性，实现了零训练的异构插入协议：自我接口训练一次，新协作方通过检查点合并加入。在模拟和真实世界的异构协同感知基准上的大量实验验证了INTACT的有效性和可部署性。在OPV2V-H上，INTACT仅用0.52M额外参数和18.0 $\log_2$通信量达到80.1 AP70，相当于密集特征传输的约16倍压缩。在DAIR-V2X上，INTACT在具有挑战性的真实条件下达到43.8 AP50。

英文摘要

Collaborative perception extends the perceptual range of autonomous vehicles by sharing information across agents, but heterogeneous sensors and perception models make intermediate feature fusion difficult to deploy at scale. Existing heterogeneous collaboration methods typically follow a translation-first paradigm: collaborator features must be aligned, adapted, or projected into an ego-compatible space before fusion. Such feature-compatibility contracts improve fixed-system performance, but they couple deployment to collaborator-specific adaptation and make newly joined heterogeneous agents costly to integrate. To address this gap, we propose INTACT, an ego-guided typed sparse evidence retrieval framework for heterogeneous collaborative perception. Instead of translating an entire collaborator feature map, INTACT lets the ego vehicle issue typed evidence queries that express suspected objects and evidence-deficient regions. Collaborators respond only with local evidence at queried locations, and the ego selects useful responses through sparse per-query routing and injects them through gated residual write-back. This changes the compatibility requirement from global feature-map interpretability to local, typed response comparability under ego-issued queries, enabling a zero-training heterogeneous insertion protocol in which the ego interface is trained once and new collaborators join through checkpoint merging. Extensive experiments on simulated and real-world heterogeneous collaborative perception benchmarks validate the effectiveness and deployability of INTACT. On OPV2V-H, INTACT achieves 80.1 AP70 with only 0.52M additional parameters and 18.0 $\log_2$ communication volume, corresponding to about 16$\times$ compression over dense feature transmission. On DAIR-V2X, INTACT achieves 43.8 AP50 under challenging real-world conditions.

URL PDF HTML ☆

赞 0 踩 0

2606.04436 2026-06-04 cs.CV cs.RO 版本更新

运动引导的因果解耦用于鲁棒多视角电影心脏MRI诊断

Chuankai Xu, Cristiane De Carvalho Singulane, Mohammad Abuannadi, Stephen Chandler, Jeremy Slivnick, Karolina Zareba, Jane Cao, Vidya Nadig, Fabio Fernandes, Seth Uretsky, Diego Perez de Arenaza, Amit Patel, Jianxin Xie

发表机构 * University of Virginia（弗吉尼亚大学）； University of Chicago（芝加哥大学）； Ohio State University（俄亥俄州立大学）； St. Francis Hospital & Heart Center（圣弗朗西斯医院及心脏中心）； Hartford HealthCare（哈特福德医疗集团）； Instituto do Coração (InCor)（心脏研究所（InCor））； Atlantic Health System（大西洋健康系统）； Sociedad Italiana de Beneficencia (Hospital Italiano)（意大利慈善协会（意大利医院））

AI总结提出运动引导的视角-疾病解耦框架MoViD，通过双分支监督对比学习和梯度反转对抗约束分离视角特定与疾病判别特征，结合无标注时间运动特征定位心脏区域并缓解类别不平衡，在静脉血栓数据集和两个公开基准上超越标准Transformer基线。

详情

AI中文摘要

多视角心脏磁共振成像提供互补的解剖信息，广泛用于无创疾病评估。最近的基于Transformer的模型在CMR分析中展示了强大的表示学习能力；然而，它们通常学习统一的潜在嵌入，将视角特定的解剖变异与疾病相关特征纠缠在一起。这种纠缠使分类器偏向结构属性而非视角不变的病理模式。在低数据场景下，特别是对于代表性不足的心脏疾病，这个问题更加严重，因为有限的样本增加了对捷径学习和视角相关决策边界的敏感性。为了解决这个问题，我们提出了一个基于ViT-MAE骨干的运动引导视角-疾病解耦框架MoViD。该模型通过双分支监督对比目标和梯度反转对抗约束，明确地将潜在表示分解为视角特定和疾病判别组件，最小化疾病信息泄漏到视角嵌入中。此外，引入了一种从帧间差异图导出的无标注时间运动特征，用于定位跳动的心脏区域并抑制背景伪影。对比损失中融入了焦点重加权机制以缓解类别不平衡。我们在一个私有临床静脉血栓数据集和两个公开基准（M&Ms, M&Ms2）上评估了该框架。在疾病分类和心脏分割任务中，我们的方法始终优于标准Transformer基线，并与大规模预训练基础模型相比表现出竞争性能，验证了结构解耦在医学图像分析中的有效性。

英文摘要

Multi-view cardiac magnetic resonance (CMR) imaging provides complementary anatomical information and is widely used for noninvasive disease assessment. Recent transformer-based models have demonstrated strong representation learning capabilities for CMR analysis; however, they typically learn unified latent embeddings that entangle view-specific anatomical variations with disease-related features. Such entanglement biases classifiers toward structural attributes rather than view-invariant pathological patterns. This issue is exacerbated in low-data regimes, particularly for underrepresented cardiac conditions, where limited samples increase the susceptibility to shortcut learning and view-dependent decision boundaries. To address this, we propose a Motion-Guided View--Disease Disentanglement framework MoViD built upon a ViT-MAE backbone. The model explicitly factorizes latent representations into view-specific and disease-discriminative components using dual-branch supervised contrastive objectives and a gradient-reversal adversarial constraint that minimizes disease leakage into the view embedding. Additionally, an annotation-free temporal motion feature, derived from inter-frame difference maps, is introduced to localize the beating heart region and suppress background artifacts. A focal reweighting mechanism is incorporated into the contrastive loss to mitigate class imbalance. We evaluate the framework on a private clinical venous thrombosis dataset and two public benchmarks (M&Ms, M&Ms2). Across disease classification and cardiac segmentation tasks, our approach consistently outperforms standard transformer baselines and demonstrates competitive performance against large-scale pretrained foundation models, validating the efficacy of structural disentanglement in medical image analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.04410 2026-06-04 cs.CV 版本更新

Ultra-Fast Neural Video Compression

超快神经视频压缩

Jiahao Li, Wenxuan Xie, Zhaoyang Jia, Bin Li, Zongyu Guo, Xiaoyi Zhang, Yan Lu

发表机构 * Microsoft Research Asia（微软亚洲研究院）； University of Science and Technology of China（中国科学技术大学）

AI总结提出基于块的编码框架DCVC-UF，通过联合时空建模和并行重建实现超快编解码，显著提升率失真-复杂度权衡。

Comments CVPR 2026

详情

AI中文摘要

尽管神经视频编解码器（NVC）已展现出优越的压缩比，但其过高的计算复杂度仍是实际部署的关键障碍。本文引入一种基于块的编码框架，旨在显著改善率失真-复杂度权衡。我们的方法不是逐帧处理，而是将多个帧组成的块编码为单个紧凑的潜在表示，并同时解码它们。这是通过用于联合时空建模的跨帧交互模块和用于并行重建的帧特定解码器实现的。这种范式不仅显著提高了编码吞吐量，还有助于更有效地建模长期时间相关性。为了进一步提高速度，我们提出了一种简化的熵编码机制，将比特流交互整合为单一步骤，大幅减少解码开销。基于这些创新，我们提出了DCVC-UF（超快），一种新的NVC，在性能上树立了新的SOTA。我们的实验表明，DCVC-UF可以实现超快的编码和解码速度，显著优于之前的领先编解码器。DCVC-UF是NVC发展历程中的一个显著里程碑。代码位于https://github.com/microsoft/DCVC。

英文摘要

While neural video codecs (NVCs) have demonstrated superior compression ratio, their prohibitive computational complexity remains a critical barrier to real-world deployment. This paper introduces a chunk-based coding framework designed to significantly improve the rate-distortion-complexity trade-off. Instead of processing frames sequentially, our approach encodes a chunk of multiple frames into a single compact latent representation and decodes them simultaneously. This is enabled by cross-frame interaction modules for joint spatial-temporal modeling and frame-specific decoders for parallel reconstruction. This paradigm not only dramatically enhances coding throughput but also facilitates more effective modeling of long-term temporal correlations. To further boost speed, we propose a streamlined entropy coding mechanism that consolidates bit-stream interactions into a single step, substantially reducing decoding overhead. Building on these innovations, we present DCVC-UF (Ultra-Fast), a new NVC that sets a new SOTA in performance. Our experiments show that DCVC-UF can achieve ultra-fast encoding and decoding speeds, significantly outperforming previous leading codecs. DCVC-UF serves as a notable landmark in the journey of NVC evolution. The code is at https://github.com/microsoft/DCVC.

URL PDF HTML ☆

赞 0 踩 0

2606.04385 2026-06-04 cs.CV 版本更新

Geometry-Preserving Unsupervised Alignment for Heterogeneous Foundation Models

保持几何结构的异质基础模型无监督对齐

Shuwen Yu, Zhanxuan Hu, Yi Zhao, Yonghang Tai, Huafeng Li

发表机构 * Yunnan Normal University, Kunming, China（云南师范大学，昆明，中国）； Kunming University of Science and Technology, Kunming, China（昆明理工大学，昆明，中国）

AI总结提出GPUA框架，通过正交映射将视觉基础模型特征对齐到视觉语言模型语义空间，无需标签或参数更新，提升跨模型兼容性并在零样本识别与分割任务中取得显著增益。

Comments Accepted at ICML 2026

详情

AI中文摘要

基础模型推动了计算机视觉的快速发展，然而两种主导范式——视觉语言基础模型（VLM）和纯视觉基础模型（VFM）——仍然仅部分兼容。VLM提供语言基础的语义对齐，但通常视觉上粗糙；而VFM学习判别性的感知几何结构，但缺乏语义基础。我们提出GPUA（保持几何结构的无监督对齐），一个整合VFM和VLM互补优势的框架。受跨语言对齐启发，GPUA将VFM特征视为一种视觉语言，并学习一个正交映射，将VFM空间转换到VLM语义空间，保持几何结构并缩小模态差距，无需标签或模型参数更新。GPUA是任务无关的，仅需对预训练模型进行特征级访问。在多种基准上的实验表明，跨模型兼容性得到改善，下游零样本识别和分割任务中取得了显著增益，且开销可忽略。代码可在https://github.com/Yuteam14/GPUA获取。

英文摘要

Foundation models have driven rapid progress in computer vision, yet the two dominant paradigms, vision-language foundation models (VLMs) and vision-only foundation models (VFMs), remain only partially compatible. VLMs offer language-grounded semantic alignment but are often visually coarse, while VFMs learn discriminative perceptual geometry but lack semantic grounding. We propose GPUA (Geometry-Preserving Unsupervised Alignment), a framework that integrates the complementary strengths of VFMs and VLMs. Inspired by cross-lingual alignment, GPUA treats VFM features as a visual language and learns an orthogonal mapping that translates the VFM space into the VLM semantic space, preserving geometry and narrowing the modality gap without labels or model parameter updates. GPUA is task-agnostic and requires only feature-level access to pretrained models. Experiments across diverse benchmarks demonstrate improved cross-model compatibility and strong gains in downstream zero-shot recognition and segmentation with negligible overhead. Code is available at https://github.com/Yuteam14/GPUA

URL PDF HTML ☆

赞 0 踩 0

2606.04369 2026-06-04 cs.CV 版本更新

VT-3DAD: Cross-Category 3D Anomaly Detection via Visual-Text Normal Space Alignment

VT-3DAD：通过视觉-文本正常空间对齐的跨类别3D异常检测

Zi Wang, Katsuya Hotta, Yawen Zou, Koichiro Kamide, Yijin Wei, Chao Zhang, Jun Yu

发表机构 * Niigata University（Niigata大学）； University of Toyama（Toyama大学）； Iwate University（Iwate大学）

AI总结提出VT-3DAD无训练框架，通过冻结CLIP编码器提取多视图视觉特征和文本正常锚点，融合视觉与语义偏差实现跨类别3D异常检测，在ShapeNetPart上达到最优性能。

详情

AI中文摘要

少样本跨类别3D异常检测旨在仅使用少量正常参考样本判断未知点云是否属于目标正常类别。现有的基于训练的方法通常需要类别级优化，而最近基于多视图CLIP视觉特征的无训练方法主要依赖视觉相似性，可能被几何相似的类别混淆。本文提出VT-3DAD，一种通过视觉-文本正常空间对齐进行跨类别3D异常检测的无训练框架。给定少量正常参考样本和测试点云，VT-3DAD首先生成逼真的多视图深度图，并使用冻结的CLIP视觉编码器提取视图级特征。视觉分支在多视图特征空间中度量参考-测试偏差。同时，深度感知和3D感知提示由冻结的CLIP文本编码器编码，构建文本正常锚点，为目标类别提供语义正常性约束。最终异常分数通过融合来自正常参考的视觉偏差和来自文本正常空间的语义偏差获得。在ShapeNetPart数据集上的实验表明，VT-3DAD达到了最先进性能。特别地，与仅视觉基线相比，VT-3DAD将单样本平均AUC-ROC从92.49%提升至94.80%，同时将平均标准差从5.64降至3.41。

英文摘要

Few-shot cross-category 3D anomaly detection aims to determine whether an unknown point cloud belongs to a target normal category using only a few normal references. Existing training-based methods usually require category-wise optimization, while recent training-free methods based on multi-view CLIP visual features mainly rely on visual similarity and may be confused by geometrically similar categories. In this paper, we propose VT-3DAD, a training-free framework for cross-category 3D anomaly detection via Visual-Text Normal Space Alignment. Given few-shot normal references and a test point cloud, VT-3DAD first generates realistic multi-view depth maps and extracts view-wise features using a frozen CLIP visual encoder. The visual branch measures reference-test deviation in the multi-view feature space. In parallel, depth-aware and 3D-aware prompts are encoded by the frozen CLIP text encoder to construct textual normal anchors, which provide semantic normality constraints for the target category. The final anomaly score is obtained by fusing visual deviation from normal references and semantic deviation from the textual normal space. Experiments on the ShapeNetPart dataset demonstrate that VT-3DAD achieves state-of-the-art performance. In particular, VT-3DAD improves the one-shot average AUC-ROC from 92.49% to 94.80% compared with the visual-only baseline, while also reducing the average standard deviation from 5.64 to 3.41.

URL PDF HTML ☆

赞 0 踩 0

2606.04365 2026-06-04 cs.CV cs.AI 版本更新

StateVLM: 一种用于机器人可操作推理的状态感知视觉语言模型

Xiaowen Sun, Matthias Kerzel, Mengdi Li, Xufeng Zhao, Paul Striker, Stefan Wermter

发表机构 * Department of Informatics, University of Hamburg（汉堡大学信息学院）； King Abdullah University of Science and Technology（卡塔尔科学与技术大学）

AI总结提出StateVLM模型，通过辅助回归损失训练策略增强视觉语言模型在目标检测和状态定位中的数值推理能力，并构建OSAR基准验证其有效性。

详情

AI中文摘要

视觉语言模型（VLM）在各种机器人任务中表现出色，因为它们能够感知视觉信息并理解自然语言指令。然而，当应用于机器人时，VLM仍然受限于大型语言模型（LLM）固有的一个基本限制：它们在数值推理方面存在困难，特别是在目标检测和目标状态定位中。为了探索VLM中作为回归任务的数值推理，我们提出了一种新颖的训练策略，使VLM适应目标检测和目标状态定位。该方法在微调期间利用框解码器输出计算辅助回归损失（ARL），同时在推理时保持标准序列预测。我们利用这种训练策略开发了StateVLM（状态感知视觉语言模型），这是一种新颖的模型，旨在感知和学习细粒度的目标表示，包括目标和其状态的精确定位，以及可抓取区域。由于缺乏目标状态可操作推理的基准，我们引入了一个开源基准——目标状态可操作推理（OSAR），其中包含1172个场景，7746个单独目标及其对应的边界框。在适配基准（RefCOCO、RefCOCO+和RefCOCOg）上的对比实验表明，与没有ARL的模型相比，ARL使模型性能平均提高1.6%。在OSAR基准上的实验进一步支持了这一发现，表明带有ARL的StateVLM比没有ARL的模型平均性能高5.2%。特别是，ARL对于OSAR中复杂的可操作推理任务也很重要，它增强了模型输出的一致性。

英文摘要

Vision-language models (VLMs) have shown remarkable performance in various robotic tasks, as they can perceive visual information and understand natural language instructions. However, when applied to robotics, VLMs remain subject to a fundamental limitation inherent in large language models (LLMs): they struggle with numerical reasoning, particularly in object detection and object-state localization. To explore numerical reasoning as a regression task in VLMs, we propose a novel training strategy to adapt VLMs for object detection and object-state localization. This approach leverages box decoder outputs to compute an Auxiliary Regression Loss (ARL) during fine-tuning, while preserving standard sequence prediction at inference. We leverage this training strategy to develop StateVLM (State-aware Vision-Language Model), a novel model designed to perceive and learn fine-grained object representations, including precise localization of objects and their states, as well as graspable regions. Due to the lack of a benchmark for object-state affordance reasoning, we introduce an open-source benchmark, Object State Affordance Reasoning (OSAR), which contains 1172 scenes with 7746 individual objects and corresponding bounding boxes. Comparative experiments on adapted benchmarks (RefCOCO, RefCOCO+, and RefCOCOg) demonstrate that ARL improves model performance by an average of 1.6% compared to models without ARL. Experiments on the OSAR benchmark further support this finding, showing that StateVLM with ARL achieves an average of 5.2% higher performance than models without ARL. In particular, ARL is also important for the complex task of affordance reasoning in OSAR, where it enhances the consistency of model outputs.

URL PDF HTML ☆

赞 0 踩 0

2606.04323 2026-06-04 cs.CV 版本更新

3D视觉食谱：数据、学习范式与应用

Hongyang Du, Zongxia Li, Dawei Liu, Runhao Li, Haoyuan Song, Qingyu Zhang, Yubo Wang, Jingcheng Ni, Shihang Gui, Congchao Dong, Tao Hu

发表机构 * Brown University（布朗大学）； University of Maryland, College Park（马里兰大学学院公园分校）； University of Pennsylvania（宾夕法尼亚大学）； University of Southern California（南加州大学）； New York University（纽约大学）； The University of Sydney（悉尼大学）； Stability AI

AI总结本文提出一种以数据为中心的3D视觉分类法，通过分析点云、网格、体素和3D高斯等几何表示及其获取流程，以及数据集设计、基准构建和监督机制，统一了表示、学习范式与下游任务（重建、生成、视频建模）之间的关系。

Comments Accepted to the CVPR 2026 OpenSUN3D Workshop. Official version available at CVF Open Access. https://openaccess.thecvf.com/content/CVPR2026W/OpenSUN3D/html/Du_A_Cookbook_of_3D_Vision_Data_Learning_Paradigms_and_Application_CVPRW_2026_paper.html

详情

Journal ref: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026

AI中文摘要

3D视觉在日益多样化的数据表示、学习范式和建模策略的推动下迅速发展。然而，该领域在表示和基准测试方面仍然分散，难以形成关于效率、保真度和可扩展性的统一视角。本文提供了一种以数据为中心的3D视觉分类法，将几何表示、数据集、学习框架和应用连接在一个单一的概念图中。我们首先分析3D数据的主要结构表示——点云、网格、体素和3D高斯——及其获取流程。然后，我们研究数据集设计、基准构建和监督机制如何塑造最近的进展，涵盖2D监督的3D学习、隐式神经表示和4D世界建模。通过这种整合视角，我们阐明了表示、学习范式与下游任务（重建、生成和视频建模）之间的关系，提供了关于平衡效率与保真度以及多模态几何基础的新兴趋势的统一观点。

英文摘要

3D vision has rapidly evolved, driven by increasingly diverse data representations, learning paradigms, and modeling strategies. Yet the field remains fragmented across representations and benchmarks, making it difficult to develop unified perspectives on efficiency, fidelity, and scalability. This work provides a data-centric taxonomy of 3D vision that connects geometric representations, datasets, learning frameworks, and applications within a single conceptual map. We begin by analysing the principal structural representations of 3D data--point clouds, meshes, voxels, and 3D Gaussians--along with their acquisition pipelines. We then examine how dataset design, benchmark construction, and supervision regimes shape recent advances, spanning 2D-supervised 3D learning, implicit neural representations, and 4D world modeling. Through this integrative lens, we clarify the relationships among representations, learning paradigms, and downstream tasks in reconstruction, generation, and video modeling, offering a consolidated view of emerging trends toward balancing efficiency and fidelity and toward multimodal geometric grounding.

URL PDF HTML ☆

赞 0 踩 0

2606.04282 2026-06-04 cs.CV 版本更新

FindIt: A Format-Informed Visual Detection Benchmark for Generalist Multimodal LLMs

FindIt：面向通用多模态大语言模型的格式感知视觉检测基准

Eshika Khandelwal, Jingjing Pan, Mingfang Zhang, Quan Kong, Lorenzo Garattoni, Hilde Kuehne

发表机构 * Tuebingen AI Center, University of Tuebingen（图宾根人工智能中心，图宾根大学）； Woven by Toyota, Inc.（丰田公司）； Toyota Motor Europe（丰田欧洲公司）

AI总结提出首个全面评估通用多模态大语言模型在可提示定位能力上的基准，涵盖四种核心任务，并标准化输入输出格式，揭示模型对格式约束的敏感性。

详情

AI中文摘要

多模态大语言模型（MLLMs）主要在自由形式的视觉语言任务（如视觉问答、图像描述和摘要）上进行评估。然而，它们的实际应用正在迅速扩展到更结构化的计算机视觉场景，用户提示模型执行以定位为中心的任务（如目标检测），通常是在更大的智能体或决策系统中。尽管发生了这种转变，但目前还没有标准化的基准来系统地大规模评估这些能力。在这项工作中，我们引入了第一个专门设计用于评估通用MLLMs可提示定位能力的全面基准。我们的基准涵盖四个核心任务类别：目标检测、指代表达检测、实例级检测和基于视频的检测。为了实现一致和公平的评估，我们开发了一个统一框架，标准化输入，强制可解析的边界框输出，并定义了跨任务的透明评估协议。使用该套件，我们评估了多种开源和专有MLLMs，深入分析了它们的性能和局限性。除了准确性，我们还检查了模型遵守输出格式规范的能力，表明当前系统对格式约束高度敏感，并且即使面对微小变化也常常无法泛化。我们的结果突出了最先进的MLLMs在定位设置中的优势和缺点，并指出了改进多模态模型设计和评估的重要方向。

英文摘要

Multimodal large language models (MLLMs) are predominantly evaluated on free-form vision-language tasks such as visual question answering, captioning, and summarization. However, their practical use is rapidly expanding to more structured computer vision settings, where users prompt models to perform localization-centric tasks such as object detection, often within larger agentic or decision-making systems. Despite this shift, there is currently no standardized benchmark that systematically evaluates these capabilities at scale. In this work, we introduce the first comprehensive benchmark specifically designed to assess the promptable localization abilities of generalist MLLMs. Our benchmark spans four core task categories: object detection, referring expression detection, instance-level detection, and video-based detection. To enable consistent and fair evaluation, we develop a unified framework that standardizes inputs, enforces parsable bounding box outputs, and defines transparent evaluation protocols across tasks. Using this suite, we evaluate a diverse set of open-source and proprietary MLLMs, providing an in-depth analysis of their performance and limitations. Beyond accuracy, we examine models' ability to adhere to output format specifications, showing that current systems are highly sensitive to formatting constraints and often fail to generalize even to minor variations. Our results highlight both the strengths and shortcomings of state-of-the-art MLLMs in localization settings, and point toward important directions for improving multimodal model design and evaluation.

URL PDF HTML ☆

赞 0 踩 0

2606.04271 2026-06-04 cs.CV cs.AI 版本更新

StandardE2E: A Unified Framework for End-to-End Autonomous Driving Datasets

StandardE2E：端到端自动驾驶数据集的统一框架

Stepan Konev

发表机构 * University of Cambridge（剑桥大学）

AI总结提出StandardE2E框架，通过统一数据模式、多数据集联合加载和简化新数据集添加流程，解决端到端自动驾驶数据集格式不兼容问题。

详情

AI中文摘要

自动驾驶已从模块化的感知-预测-规划堆栈转向端到端（E2E）模型，这些模型直接将传感器输入映射到车辆控制，通常通过辅助任务（如3D检测、运动预测和高清地图感知）进行正则化。进展由快速增长的传感器丰富驾驶数据集生态系统驱动，但每个数据集都有自己的文件格式、API、坐标约定和模态覆盖范围，导致跨数据集实验甚至基本的每个数据集预处理都需要为每个项目重新实现。我们提出StandardE2E，一个为E2E驾驶数据集提供统一接口的框架。StandardE2E (i) 在共享数据模式下标准化每个数据集的预处理；(ii) 在单个PyTorch DataLoader中组合多个数据集，用于跨数据集预训练、辅助任务监督和场景级过滤；(iii) 将添加新数据集简化为从原始帧到规范模式的单个数据集映射，而整个下游流程保持不变。该框架开箱即支持六个数据集：Waymo End-to-End、Waymo Perception、Argoverse 2 Sensor、Argoverse 2 LiDAR、NAVSIM (OpenScene-v1.1) 和 WayveScenes101，并作为开源标准e2e Python包发布，可在 https://github.com/stepankonev/StandardE2E 获取。

英文摘要

Autonomous driving has shifted from modular perception-prediction-planning stacks toward end-to-end (E2E) models that map sensor inputs directly to vehicle control, often regularized by auxiliary tasks such as 3D detection, motion forecasting, and HD-map perception. Progress is driven by a fast-growing ecosystem of sensor-rich driving datasets, yet each ships its own file formats, APIs, coordinate conventions, and modality coverage, leaving cross-dataset experimentation and even basic per-dataset preprocessing to be re-implemented per project. We present StandardE2E, a framework that provides a single unified interface over E2E driving datasets. StandardE2E (i) standardizes per-dataset preprocessing under one shared data schema; (ii) combines multiple datasets in a single PyTorch DataLoader for cross-dataset pretraining, auxiliary-task supervision, and scenario-level filtering; and (iii) reduces adding a new dataset to a single per-dataset mapping from raw frames to the canonical schema, leaving the entire downstream pipeline unchanged. The framework supports six datasets out of the box: Waymo End-to-End, Waymo Perception, Argoverse 2 Sensor, Argoverse 2 LiDAR, NAVSIM (OpenScene-v1.1), and WayveScenes101, and is released as the open-source standard-e2e Python package, available at https://github.com/stepankonev/StandardE2E.

URL PDF HTML ☆

赞 0 踩 0

2606.04269 2026-06-04 cs.RO cs.AI cs.CV 版本更新

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

Instant-Fold: 可变形物体操作的情境模仿学习

Yilong Wang, Cheng Qian, Edward Johns

发表机构 * The Robot Learning Lab（机器人学习实验室）； Imperial College London（伦敦帝国学院）

AI总结提出Instant-Fold框架，通过单次人类演示的情境模仿学习，无需梯度更新即可推断并执行多种可变形物体操作模式，在仿真训练后零样本迁移到真实世界。

详情

AI中文摘要

可变形物体操作（DOM）具有挑战性，因为其状态是高维、部分可观测的，并且通过长时间跨度、拓扑变化的交互演变，涉及多种有效的操作模式。我们引入了Instant-Fold，一个用于DOM的情境模仿学习框架。给定单次人类演示，我们的策略直接从演示中推断并执行多种操作模式，包括空间执行和顺序的变化，无需梯度更新。我们的方法首先通过时间对比预训练学习变形感知的视觉表示，然后基于演示的条件流匹配变换器策略预测执行预期操作模式的动作。完全在仿真中训练的Instant-Fold能够泛化到多种折叠模式，并零样本迁移到真实世界环境，无需额外的数据收集或微调。视频可在https://instant-fold.github.io获取。

英文摘要

Deformable object manipulation (DOM) is challenging due to high-dimensional, partially observable states that evolve through long-horizon, topology-changing interactions with multiple valid manipulation modes. We introduce Instant-Fold, an in-context imitation learning framework for DOM. Given a single human demonstration, our policy infers and executes diverse manipulation modes directly from the demonstration, including variations in spatial execution and ordering, without requiring gradient updates. Our approach first learns deformation-aware visual representations via temporal contrastive pretraining, after which a flow-matching transformer policy conditioned on the demonstration predicts actions to execute the intended manipulation mode. Trained entirely in simulation, Instant-Fold generalizes across diverse folding modes and transfers zero-shot to real-world settings without additional data collection or finetuning. Videos are available at https://instant-fold.github.io.

URL PDF HTML ☆

赞 0 踩 0

2606.04264 2026-06-04 cs.CV 版本更新

基于潜空间运动跟踪的单次测量前瞻性动态3D MRI重建

Lixuan Chen, Zhongnan Liu, Jesse Hamilton, James M. Balter, Jeong Joon Park, Liyue Shen

发表机构 * University of Michigan（密歇根大学）

AI总结提出PDMR框架，通过离线学习运动场的低维潜流形并采用三平面表示实现高效编码，从单次测量中实现高保真、时间一致的前瞻性动态3D MRI重建。

详情

AI中文摘要

前瞻性重建在许多临床应用中至关重要，例如MRI引导的放射治疗，这需要从当前获取的测量中实现精确的图像重建和快速运动估计。然而，由于超稀疏采样和严格的延迟要求，前瞻性重建仍然具有挑战性。在这项工作中，我们提出了PDMR，一种具有潜空间运动跟踪的前瞻性动态3D MRI重建框架。我们的核心思想是离线学习一个高效且可泛化的运动场潜流形，从而实现快速在线自适应以进行前瞻性重建。具体来说，我们将变形矢量场（DVF）参数化在低维流形上，有效减少了快速在线自适应的搜索空间，并采用三平面表示实现几何感知和内存高效的3D运动编码。在XCAT数字体模和内部腹部MRI数据集上的实验表明，PDMR在多个前瞻性场景（立即和2分钟后）中实现了高保真和时间一致的重建，优于最先进的回顾性和在线方法。我们的结果为临床实践中实现超快速、运动感知的前瞻性MRI重建提供了一条有前景的途径。

英文摘要

Prospective reconstruction is crucial in many clinical applications such as MRI-guided radiotherapy, which demands accurate image reconstruction and fast motion estimation from currently acquired measurements. However, prospective reconstruction remains challenging due to ultra-sparse sampling and stringent latency requirements. In this work, we propose PDMR, a Prospective Dynamic 3D MRI Reconstruction framework with latent-space motion tracking. Our core idea is to learn an efficient and generalizable latent manifold of motion fields offline, enabling rapid online adaptation for prospective reconstruction. Specifically, we parameterize the deformation vector fields (DVFs) on a low-dimensional manifold, effectively reducing the search space for fast online adaptation, and employ a tri-plane representation to achieve geometry-aware and memory-efficient encoding of 3D motion. Experiments on both XCAT digital phantoms and in-house abdominal MRI datasets demonstrate that PDMR achieves high-fidelity and temporally consistent reconstruction across multiple prospective scenarios (Immediate and After-2min), outperforming state-of-the-art retrospective and online methods. Our results suggest a promising pathway toward ultra-fast, motion-aware prospective MRI reconstruction in clinical practice.

URL PDF HTML ☆

赞 0 踩 0

2606.04244 2026-06-04 cs.AI cs.CL cs.CV cs.LG 版本更新

VAMPS: Visual-Assisted Mathematical Problem Solving Benchmark

VAMPS: 视觉辅助数学问题求解基准

Amirhossein Dabiriaghdam, Shayan Vassef, Mohammadreza Bakhtiari, Yasamin Medghalchi, Ilker Hacihaliloglu, Mesrob Ohannessian, Lele Wang, Giuseppe Carenini

发表机构 * University of California, Berkeley（加州大学伯克利分校）

AI总结提出VAMPS基准，通过1,168道双语多选题评估多模态大模型在借助绘图工具进行数学推理时的表现，发现直接解析求解优于工具辅助视觉求解。

详情

AI中文摘要

多模态大语言模型在复杂推理方面能力日益增强，但当它们必须通过工具外部化问题然后基于工具输出进行推理时，尤其是在依赖视觉辅助的情况下，其性能往往会下降。这一差距尤为重要，因为真实的工程和科学工作流程通常依赖可视化工具进行分析、验证和决策。为了研究这一差异，我们引入了VAMPS（视觉辅助数学问题求解），一个用于图辅助数学的基准。VAMPS包含1,168个多模态、双语选择题问答对，这些题目来自伊朗大学入学考试的代数和微积分问题，并通过人工审核的LLM生成的合成变体进行了扩展，所有题目都经过精心挑选，使得绘图能够通过揭示交点、极值、渐近线等提供自然的求解策略。VAMPS旨在用于基准测试和诊断，它超越了以往主要评估在固定视觉输入上进行推理的多模态基准，通过测试模型是否能够从构建有用的图形中受益并将其答案基于结果可视化。总体而言，我们发现，在一组多样化的模型中，直接解析求解出人意料地优于工具辅助的视觉求解，即使在绘图是自然策略的问题上也是如此。

英文摘要

Multimodal large language models are increasingly capable of complex reasoning, yet their performance often degrades when they must externalize a problem through a tool and then reason over the tool's output, specifically when they rely on visual aids. This gap is especially important because real engineering and scientific workflows often rely on visualization tools for analysis, validation, and decision-making. To study this discrepancy, we introduce VAMPS (Visual-Assisted Mathematical Problem Solving), a benchmark for graph-assisted mathematics. VAMPS contains 1,168 multimodal, bilingual multiple-choice question-answer pairs drawn from Iranian University Entrance Exam algebra and calculus problems and expanded with human-reviewed LLM-generated synthetic variants, all selected so that plotting provides a natural solution strategy by revealing intersections, extrema, asymptotes, etc. Designed for both benchmarking and diagnosis, VAMPS goes beyond prior multimodal benchmarks that primarily evaluate reasoning over fixed visual inputs by testing whether a model can benefit from constructing a useful graph and grounding its answer in the resulting visualization. Overall, we found that across a diverse set of models, direct analytical solving surprisingly outperforms tool-enabled visual solving, even on problems where plotting is a natural strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.04240 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Overview of the EReL@MIR 2025 Multimodal Document Retrieval Challenge (Track 1)

EReL@MIR 2025 多模态文档检索挑战赛（赛道1）概述

Jingbiao Mei

发表机构 * University of Cambridge（剑桥大学）； Cambridge United Kingdom（剑桥英国）

AI总结本文介绍了EReL@MIR 2025多模态文档检索挑战赛（赛道1）的设计、数据集、评估协议、最终排名及前三名获胜系统的分析，所有系统均基于Qwen2-VL系列解码器多模态大语言模型嵌入器。

Comments MDR Challenge Report at WWW2025

详情

AI中文摘要

对于视觉丰富的文档（即文本与图形、表格和图表交织的页面）的检索，对于多模态检索增强生成至关重要，然而大多数检索器仍然丢弃视觉通道。\emph{多模态文档检索挑战赛}是首届EReL@MIR研讨会（与2025年万维网会议同期举办）中MIR挑战赛的赛道1，要求参与者构建一个\emph{单一}检索系统，处理两种互补的场景：基于文本查询在长文档内进行封闭集文档页面检索（MMDocIR），以及基于图像或图像加文本查询进行开放域维基百科风格段落检索（M2KR）。系统根据两个任务上平均Recall@$\{1,3,5\}$的宏平均值进行排名。该挑战赛吸引了来自22个团队的455名参赛者和586份提交。本报告描述了挑战赛的设计、数据集和评估协议；报告了最终排名；并分析了三个获胜团队的系统。所有三个系统都基于Qwen2-VL系列的解码器多模态大语言模型嵌入器，而非CLIP风格的编码器，主要区别在于它们是通过微调集成、无训练的多路融合与强视觉语言重排序器，还是零样本后期交互达到顶尖水平。无训练系统与微调获胜者的得分差距在0.1分以内。

英文摘要

Retrieval over visually-rich documents, pages that interleave text with figures, tables, and charts, is essential for multimodal retrieval-augmented generation, yet most retrievers still discard the visual channel. The \emph{Multimodal Document Retrieval Challenge}, Track~1 of the MIR Challenge at the first EReL@MIR workshop, co-located with The Web Conference 2025, asks participants to build a \emph{single} retrieval system that handles two complementary regimes: closed-set document page retrieval within long documents from a text query (MMDocIR), and open-domain retrieval of Wikipedia-style passages from an image or image-plus-text query (M2KR). Systems are ranked by the macro-average of mean Recall@$\{1,3,5\}$ over the two tasks. The challenge drew 455 entrants and 586 submissions across 22 teams. This report describes the challenge design, datasets, and evaluation protocol; reports the final standings; and analyses the three winning teams' systems. All three build on decoder-based Multimodal-LLM embedders from the Qwen2-VL family rather than on CLIP-style encoders, and differ chiefly in whether they reach the top through fine-tuned ensembles, training-free multi-route fusion with a strong vision-language re-ranker, or zero-shot late interaction. The training-free system finished within $0.1$ point of the fine-tuned winner.

URL PDF HTML ☆

赞 0 踩 0

2606.04205 2026-06-04 cs.MM cs.AI cs.CL cs.CV cs.LG cs.SD 版本更新

DetectZoo: A Unified Toolkit for AI-Generated Content Detection Across Text, Audio, and Image Modalities

DetectZoo：一个用于跨文本、音频和图像模态的AI生成内容检测的统一工具包

Sajad Ebrahimi, Nima Jamali, Bardia Shirsalimian, Kelly McConvey, Wentao Zhang, Jalehsadat Mahdavimoghaddam, Maksym Taranukhin, Maura Grossman, Vered Shwartz, Yuntian Deng, Ebrahim Bagheri

发表机构 * University of Toronto（多伦多大学）； University of Waterloo（滑铁卢大学）； Toronto Metropolitan University（多伦多 Metropolitan 大学）； University of British Columbia（不列颠哥伦比亚大学）； Vector Institute（向量研究所）

AI总结提出DetectZoo，一个首个统一的多模态AI生成内容检测工具包，通过标准化数据预处理、评估流程和集成61个检测器与22个基准数据集，实现公平可重复的基准测试。

详情

AI中文摘要

生成模型的日益普及和能力提升模糊了人类与机器生成内容之间的界限，推动了跨文本、图像和音频检测领域的大量研究。大多数现有的检测器要么是商业软件，要么是开源但带有不兼容的代码库、定制化的预处理、评估协议和评估指标，这使得它们的采用、公平比较和复现变得相当困难。为了解决这一关键差距，我们引入了DetectZoo，这是首个可扩展的工具包，旨在为跨文本、音频和图像模态的AI生成内容检测提供统一接口。DetectZoo标准化了从数据摄取和预处理到模型评估的完整实证流程，为研究人员提供了一个统一的框架来系统地基准测试最先进的检测器。通过将多样的公共数据集和基线检测算法集成到单一的统一API下，我们的工具包促进了严格且可重复的评估。DetectZoo提供了61个检测器的参考实现、22个基准数据集的原生加载器，以及一个标准化的评估流程，通过通用接口报告多个指标。每个检测器都是自包含的，但可通过同一接口访问，自动缓存预训练权重，并复现原始发表的结果。DetectZoo降低了多模态AI取证的入门门槛，使研究人员能够识别跨领域的性能差距，并加速开发鲁棒、可泛化的检测技术。开源仓库和全面文档可在https://github.com/sadjadeb/DetectZoo 获取，且可通过pip install detectzoo安装该包。

英文摘要

The growing popularity and capacity of generative models have eroded the distinction between human and machine-generated content, motivating a growing body of work on detection across text, images, and audio. Most available detectors are either commercial software or, if open-source, come with incompatible codebases with bespoke preprocessing, evaluation protocols, and evaluation metrics, which make their adoption, fair comparison, and reproduction quite difficult. To address this critical gap, we introduce DetectZoo, a first-of-its-kind, extensible toolkit designed to provide a unified interface for AI-generated content detection across text, audio, and image modalities. DetectZoo standardizes the complete empirical pipeline, from data ingestion and preprocessing to model assessment, offering researchers a cohesive framework to benchmark state-of-the-art detectors systematically. By integrating diverse public datasets and baseline detection algorithms under a single, unified API, our toolkit facilitates rigorous and reproducible evaluation. DetectZoo provides reference implementations of 61 detectors, native loaders for 22 benchmark datasets, and a standardized evaluation pipeline that reports multiple metrics through a common interface. Each detector is self-contained yet accessible through the same interface, automatically caches pretrained weights, and reproduces the original published results. DetectZoo lowers the barrier to entry for multi-modal AI forensics, enabling researchers to identify performance gaps across domains and accelerating the development of robust, generalizable detection techniques. The open-source repository and comprehensive documentation are publicly available at https://github.com/sadjadeb/DetectZoo, and the package can be installed via pip install detectzoo.

URL PDF HTML ☆

赞 0 踩 0

2606.04198 2026-06-04 cs.CV 版本更新

Spatial Artifact Coherence Determines Codec Robustness in Patch-Based rPPG

空间伪影相干性决定基于补丁的rPPG中的编解码鲁棒性

Achraf Ben Ahmed

发表机构 * PlesmoSense SARL（PlesmoSense公司）

AI总结提出空间伪影相干性（SAC）度量，解释编解码压缩下基于补丁的rPPG方法优于全局投影方法的原因，并设计PatchPCA算法族，实验表明SAC解释了93.8%的PCA优势方差。

详情

AI中文摘要

远程光电容积描记法（rPPG）在未压缩基准上实现了低心率误差，但在远程医疗、新生儿ICU和驾驶员疲劳应用中通过压缩视频通道部署。先前没有工作确定在编解码压缩下空间分解优于全局投影方法的物理量。我们提出空间伪影相干性（SAC），定义为4x4块间绿色通道协方差矩阵（带通0.75-2.5 Hz）的非对角能量与对角能量之比，以及PatchPCA算法族（四种编解码感知的rPPG算法）。我们在三个公共数据集上评估了280名受试者、11种编解码退化变体（MPEG-4、H.265、H.264、JPEG、色度子采样）和13种算法，通过Wilcoxon检验（BH-FDR，q < 0.05，904次检验）。SAC解释了PCA优势中93.8%的变体间方差（r = +0.969），编解码族之间零重叠：非MPEG-4变体聚集在SAC 0.10-0.18，PCA胜率84-90%；而MPEG-4变体聚集在SAC 0.48-0.59，胜率61%，平均改进降低5.8倍。在受试者内部，78%确认了预期模式（p < 10^-22，dz = 0.73）。变体内部受试者水平SAC相关性为r = +0.099，确认SAC分类编解码族而非预测个体结果。MPEG-4的影响是结构性的（宏块DCT几何，而非噪声幅度），由源编解码状态而非分辨率决定。P-Hybrid被确定为最部署鲁棒的算法。建立了PatchPCA优势的两个必要操作条件：SAC < 0.30和低到中等运动，直接排除了原始到MPEG-4转码流水线。SAC为临床远程监测系统中编解码感知的rPPG算法选择提供了物理基础度量。

英文摘要

Remote photoplethysmography (rPPG) achieves low heart-rate error on uncompressed benchmarks yet is deployed over compressed video channels in telehealth, neonatal ICU, and driver fatigue applications. No prior work identifies the physical quantity determining when spatial decomposition outperforms global-projection methods under codec compression. We propose Spatial Artifact Coherence (SAC), defined as the ratio of off-diagonal to diagonal energy in the 4x4 inter-patch Green-channel covariance matrix (bandpass 0.75-2.5 Hz), and the PatchPCA algorithm family (four codec-aware rPPG algorithms). We evaluate 280 subjects across three public datasets, 11 codec degradation variants (MPEG-4, H.265, H.264, JPEG, chroma subsampling), and 13 algorithms via Wilcoxon tests (BH-FDR, q < 0.05, 904 tests). SAC explains 93.8% of between-variant variance in PCA advantage (r = +0.969), with zero overlap between codec families: non-MPEG-4 variants cluster at SAC 0.10-0.18 with 84-90% PCA win rates, while MPEG-4 variants cluster at SAC 0.48-0.59 with 61% win rate and a 5.8x reduction in mean improvement. Within subjects, 78% confirm the expected pattern (p < 10^-22, dz = 0.73). Within-variant subject-level SAC correlation is r = +0.099, confirming SAC classifies codec families rather than predicting individual outcomes. MPEG-4's effect is structural (macroblock DCT geometry, not noise amplitude), governed by source codec state, not resolution. P-Hybrid is identified as the most deployment-robust algorithm. Two necessary operating conditions for PatchPCA advantage are established: SAC < 0.30 and low-to-moderate motion, directly ruling out raw-to-MPEG-4 transcoding pipelines. SAC provides a physically grounded metric for codec-aware rPPG algorithm selection in clinical remote monitoring systems.

URL PDF HTML ☆

赞 0 踩 0

2606.04166 2026-06-04 cs.CV 版本更新

End-to-End Text Line Detection and Ordering

端到端文本行检测与排序

Benjamin Kiessling

发表机构 * ALMAnaCH, Inria, France（ALMAnaCH、法国国家信息与自动化研究所）

AI总结提出Orli模型，将文本行检测与阅读顺序排序统一为图像到序列问题，通过自回归生成基线实现端到端处理，在多种历史文档上达到先进性能。

详情

AI中文摘要

实际的历史文档文本识别流程通常将布局分析分解为行检测和单独的阅读顺序步骤，后者通常由手工编码的几何启发式方法处理，但难以应对旁注、多列、表格和特定来源的编辑惯例。本文介绍了Orli（行的有序回归），一个端到端模型，将两个子任务视为单一的图像到序列问题：从页面图像中，Orli以自回归方式直接按阅读顺序生成文本行基线。基线采用弦框架参数化表示，该参数化锚定行的位置、方向和范围，同时通过垂直偏移编码局部几何；迭代细化头和局部视觉细化器生成最终曲线。在涵盖十种书写系统的196,691页异构语料库上训练，Orli在没有数据集特定训练的情况下，略微超过了之前报道的cBAD行检测的最先进水平，在多个阅读顺序基准测试中零样本达到近乎完美的覆盖率和排序，并通过有限的微调适应更专业的域外布局。该方法的源代码和模型权重在开放许可下可从https://github.com/mittagessen/orli获取。

英文摘要

Practical text-recognition pipelines for historical documents typically decompose layout analysis into line detection followed by a separate reading-order step, with the latter most often handled by a hand-coded geometric heuristic that struggles with marginalia, multiple columns, tables, and source-specific editorial conventions. This article introduces Orli (Ordered Regression of Lines), an end-to-end model that casts both sub-tasks as a single image-to-sequence problem: from a page image, Orli autoregressively generates text-line baselines directly in reading order. Baselines are represented in a chord-frame parameterization that anchors a line's position, orientation, and extent while encoding local geometry through perpendicular offsets; an iterative refinement head and a local visual refiner produce the final curve. Trained on a heterogeneous corpus of 196,691 pages spanning ten writing systems, Orli marginally exceeds the previously reported state of the art for cBAD line detection without dataset-specific training, reaches near perfect coverage and ordering on multiple reading-order benchmarks zero-shot, and adapts to more specialized out-of-domain layouts with limited fine-tuning. The method's source code and model weights are available under an open license at https://github.com/mittagessen/orli.

URL PDF HTML ☆

赞 0 踩 0

2606.04133 2026-06-04 cs.CV 版本更新

Pinpoint: Grounded Worldwide Image Geolocation via Cross-Source Retrieval and Reranking

Pinpoint: 基于跨源检索与重排序的全球图像地理定位

Nika Chuzhoy, Brian Hu, Amit A. Arora, Jae Ro, Sarthak S. Sahu

发表机构 * Virtualitics

AI总结提出一种检索-重排序架构Pinpoint，通过对比学习融合Flickr照片和街景图像，结合注意力重排序器利用跨源证据实现全球图像地理定位，在多个基准上达到最优。

详情

AI中文摘要

图像地理定位旨在根据视觉内容估计照片拍摄地点。在全球范围内，由于视觉证据往往模糊、多样且分布不均，这仍然具有挑战性。先前的工作通常将普通互联网照片和街景图像的地理定位视为独立任务，尽管它们具有互补优势：互联网照片更匹配用户拍摄查询的外观分布，而街景图像提供更密集、地理覆盖更广的参考。我们提出Pinpoint，一种检索-重排序架构，以由粗到细的流程结合两种数据源。对比图像-GPS嵌入器在用户上传的Flickr照片和街景图像上训练，学习共享的图像-GPS嵌入空间，用于检索候选位置。然后，基于注意力的重排序器通过结合候选级别的视觉和GPS特征以及来自附近位置的跨源证据，对检索到的候选进行重新评分，以确定预测。与最近的先前工作不同，Pinpoint不依赖多模态大语言模型，使得推理更快且更具可重复性。Pinpoint在互联网照片（IM2GPS3k和YFCC4k）和街景图像（OSV-5M）的标准基准上，在所有指标上均达到最先进的结果。

英文摘要

Image geolocation aims to estimate where a photograph was taken from its visual content. At worldwide scale, this remains challenging because visual evidence is often ambiguous, diverse, and unevenly distributed. Prior work has typically treated geolocation of ordinary internet photos and street-view imagery as separate tasks, despite their complementary strengths: internet photos better match the appearance distribution of user-captured queries, while street-view imagery provides denser, geographically grounded coverage. We present Pinpoint, a retrieve-and-rerank architecture that combines both sources in a coarse-to-fine pipeline. A contrastive image-GPS embedder is trained on both user-uploaded Flickr photos and street-view imagery, learning a shared image-GPS embedding space that is used to retrieve candidate locations. An attention-based reranker then rescores retrieved candidates by combining candidate-level visual and GPS features with cross-source evidence from nearby locations to ground the prediction. Unlike recent prior work, Pinpoint does not rely on multimodal large-language models, making inference faster and more reproducible. Pinpoint achieves state-of-the-art results across all metrics on standard benchmarks for internet photos (IM2GPS3k and YFCC4k) and street-view imagery (OSV-5M).

URL PDF HTML ☆

赞 0 踩 0

2606.04108 2026-06-04 cs.GR cs.AI cs.CV cs.LG 版本更新

模态内邻居从不说谎：基于图模态内推理纠正模态间噪声对应

Yang Liu, Wentao Feng, Shu-Dong Huang, Yalan Ye, Jiancheng Lv

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出IN2R框架，利用模态内数据的几何稳定性，通过图精炼器对动态跨模态记忆中的邻居进行关系推理，合成连续软原型以纠正模态间噪声对应，显著提升跨模态检索性能。

详情

Journal ref: International Conference of Machine Learning 2026

AI中文摘要

大规模网络采集数据集推动了跨模态检索的进展，但不可避免地遭受噪声对应问题，严重损害模型泛化能力。现有方法主要通过过滤噪声或寻找替代标签来解决，但它们主要局限于“离散选择”范式。我们认为，依赖单一离散代理会导致单点脆弱性和离散化误差。为克服这些限制，我们提出了一种新颖框架——模态内邻居感知噪声纠正（IN2R），它将范式从搜索替代标签转变为合成可靠的监督目标。利用模态内数据固有的几何稳定性，IN2R采用图精炼器对从动态跨模态记忆中检索到的邻居进行关系推理。我们的方法不是传播离散标签，而是合成一个连续的软原型，反映局部语义邻域的共识，有效纠正模态间错位。在Flickr30K、MS-COCO和CC152K上的大量实验表明，IN2R显著优于最先进的方法。我们的代码和预训练模型可在https://github.com/liuyyy111/IN2R公开获取。

英文摘要

Large-scale web-harvested datasets have fueled the progress of cross-modal retrieval but inevitably suffer from noisy correspondence, which severely degrades model generalization. Existing methods primarily address this by filtering out noise or seeking a substitute label, yet they predominantly remain bound by a "Discrete Selection" paradigm. We argue that relying on a single discrete proxy induces Single-Point Fragility and Discretization Error. To overcome these limitations, we propose a novel framework, Intra-modal Neighbor-aware Noise Rectification (IN2R), which shifts the paradigm from searching for a substitute to synthesizing a reliable supervision target. Leveraging the intrinsic geometric stability of intra-modal data, IN2R employs a Graph Refiner to perform relational reasoning over neighbors retrieved from a dynamic Cross-Model Memory. Instead of propagating discrete labels, our method synthesizes a continuous, soft prototype that reflects the consensus of the local semantic neighborhood, effectively rectifying inter-modal misalignment. Extensive experiments on Flickr30K, MS-COCO, and CC152K demonstrate that IN2R significantly outperforms state-of-the-art methods. Our code and pre-trained models are publicly available at https://github.com/liuyyy111/IN2R.

URL PDF HTML ☆

赞 0 踩 0

2606.04060 2026-06-04 cs.CV 版本更新

Weakly Supervised Incremental Segmentation via Semantic Anchors and Spatial Arbitration

基于语义锚点和空间仲裁的弱监督增量分割

Zhonggai Wang, Kai Fang, Guangyu Gao

发表机构 * National Natural Science Foundation of China（中华人民共和国国家自然科学基金委员会）； Tsinghua University（清华大学）

AI总结针对弱监督增量语义分割中噪声监督导致的特征漂移和语义覆盖问题，提出SASA方法，通过语义锚点稳定表示学习和空间标签仲裁过滤不可靠信号，有效缓解特征漂移。

Comments Accepted by ICME2026

详情

AI中文摘要

弱监督增量语义分割（WILSS）面临持续引入噪声监督的问题，这会逐步破坏类别级表示，导致严重的特征漂移和语义污染，从而使新学习的类别覆盖旧类别。为了解决这些问题，我们提出了一种抗漂移的WILSS方法，名为SASA，旨在通过语义锚点和空间仲裁稳定语义学习。具体地，在表示层面，我们引入可学习令牌的语义锚点作为刚性类别级参考，以保持长期语义一致性。作为补充，弹性残差适应实现了受控的、实例特定的细化，确保稳定而灵活的学习轨迹。在监督层面，我们开发了一种空间标签仲裁机制，该机制执行几何感知决策，直接过滤不可靠信号，并强制执行严格的“一个对象，一个类别”约束。通过协同稳定表示和提高监督可靠性，SASA有效缓解了弱监督下的特征漂移。在标准基准上的大量实验表明，我们的方法始终优于现有最先进方法，特别是在具有挑战性的多步增量设置中。代码可在https://github.com/ZhonggaiWang/SASA获取。

英文摘要

Weakly Incremental Learning for Semantic Segmentation (WILSS) suffers from the continuous introduction of noisy supervision, which progressively corrupts class-level representations, leading to severe feature drift and semantic corruption, thereby causing newly learned classes to overwrite old ones. To address these issues, we propose a drift-resilient WILSS approach, named SASA, designed to stabilize semantic learning via Semantic Anchors and Spatial Arbitration. Specifically, at the representation level, we introduce semantic anchors of learnable tokens as rigid class-level references to preserve long-term semantic identity. Complementary to this, an elastic residual adaptation facilitates controlled, instance-specific refinement, ensuring a stable yet flexible learning trajectory. At the supervision level, we develop a Spatial Label Arbitration mechanism that performs geometry-aware decisions to directly filter unreliable signals and enforce a strict "one object, one class" constraint. By synergistically stabilizing representations and improving supervision reliability, SASA effectively mitigates feature drift under weak supervision. Extensive experiments on standard benchmarks demonstrate that our approach consistently outperforms existing state-of-the-art methods, particularly in challenging multi-step incremental settings. The code is available at https://github.com/ZhonggaiWang/SASA.

URL PDF HTML ☆

赞 0 踩 0

2606.04046 2026-06-04 cs.CV cs.AI cs.CL cs.LG cs.RO 版本更新

Dive into the Scene: Breaking the Perceptual Bottleneck in Vision-Language Decision Making via Focus Plan Generation

深入场景：通过焦点计划生成打破视觉-语言决策中的感知瓶颈

Boyuan Xiao, Bohong Chen, Yumeng Li, Ji Feng, Yao-Xiang Ding, Kun Zhou

发表机构 * University of Science and Technology of China（中国科学技术大学）； Tsinghua University（清华大学）

AI总结提出SceneDiver方法，通过从粗到细的焦点计划生成，逐步构建场景图并分解任务，减少视觉幻觉，提升视觉-语言模型和视觉-语言-动作模型在具身决策任务中的表现。

Comments Accepted at ICML 2026

详情

AI中文摘要

在具身视觉-语言决策任务（如机器人操作和导航）中，视觉-语言模型和视觉-语言-动作模型（VLMs & VLAs）是具有不同优势的强大工具：VLMs更擅长长期规划，而VLAs更擅长反应控制。然而，它们的性能受到相同感知瓶颈的限制：由于模型无法区分任务相关对象与干扰物，导致视觉幻觉。原则上，准确识别并聚焦关键对象同时过滤无关对象是突破这一限制的关键。一个直接的解决方案是一步聚焦：直接关注重要对象。然而，这种方法被证明无效，因为有效的聚焦本质上需要深度场景理解。为此，我们提出SceneDiver，一种利用VLMs长期规划能力的从粗到细的焦点计划生成方法，首先构建整体场景图以建立初步理解，然后通过识别、理解和分析的迭代循环逐步将任务分解为更简单的子问题。为了实现反应控制，我们还设计了一个轻量级适配器，将深思熟虑的聚焦能力蒸馏到VLAs中。在标准具身AI基准上的评估证实，我们的方法显著减少了VLMs和VLAs的视觉幻觉，同时在需要快速执行的任务中保持了计算效率。我们的代码和数据发布在：https://future-item.github.io/SceneDiver。

英文摘要

In embodied vision-language decision making tasks such as robotic manipulation and navigation, Vision-Language and Vision-Language-Action Models (VLMs & VLAs) are powerful tools with different benefits: VLMs are better at long-term planning, while VLAs are better at reactive control. However, their performance is limited by the same perceptual bottleneck: visual hallucinations arise due to the models' inability to distinguish task-relevant objects from distractors. In principle, accurate identification and focus on critical objects while filtering out irrelevant ones is the key to break this limitation. A straightforward solution is one-step focus: directly attending to essential objects. However, this approach proves ineffective because effective focus inherently requires deep scene understanding. To this end, we propose SceneDiver, a coarse-to-fine focus plan generation method for VLMs leveraging their long-term planning abilities, that first constructs a holistic scene graph to establish initial comprehension, then progressively decomposes the task into simpler sub-problems through an iterative cycle of recognition, understanding, and analysis. To enable reactive control, we also design a lightweight adapter for distilling the deliberate focus ability into VLAs. Evaluations on standard embodied AI benchmarks confirm that our method substantially reduces visual hallucinations for both VLMs and VLAs, while preserving computational efficiency in tasks requiring fast execution. Our code and data are released at: https://future-item.github.io/SceneDiver.

URL PDF HTML ☆

赞 0 踩 0

2606.03972 2026-06-04 cs.CV 版本更新

AAD-1: Asymmetric Adversarial Distillation for One-Step Autoregressive Video Generation

AAD-1：用于一步自回归视频生成的非对称对抗蒸馏

Haobo Li, Yanhong Zeng, Yunhong Lu, Jiapeng Zhu, Hao Ouyang, Qiuyu Wang, Ka Leong Cheng, Yujun Shen, Zhipeng Zhang

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出非对称对抗蒸馏框架AAD-1，通过打破生成器与判别器的对称性以及分阶段训练策略，解决一步自回归视频生成中的运动崩溃和训练不稳定问题，实现最先进性能。

Comments ICML 2026. Project page: \url{https://aad-1.github.io/}

详情

AI中文摘要

我们提出了AAD-1，一种用于一步自回归图像到视频生成的非对称对抗蒸馏框架。最先进的方法采用对抗蒸馏，但存在运动崩溃和训练不稳定的问题，导致生成静态视频。AAD-1通过架构和训练策略上的两个关键设计解决了这些挑战。我们的核心架构见解是打破生成器和判别器之间的对称性。生成器保持因果性以保留自回归采样能力，而判别器则双向关注整个时空上下文，并为整个视频序列生成单一的全局真实性评分。这种非对称设计使判别器能够有效检测导致自回归生成中运动崩溃的全局时间故障和长程漂移。为了稳定训练，我们引入了一种分阶段策略，首先使用分布匹配来引导一个稳定的一步生成器，提供一个预热阶段，使学生分布更接近教师分布，然后再开始对抗蒸馏。在VBench上的大量实验表明，AAD-1在一步自回归视频生成中达到了最先进的性能。

英文摘要

We present AAD-1, an Asymmetric Adversarial Distillation framework for One-step autoregressive image-to-video generation. State-of-the-art methods adopt adversarial distillation but suffer from motion collapse and training instability, resulting in static videos. AAD-1 addresses these challenges through two key designs in architecture and training strategy. Our key architectural insight is to break the symmetry between generator and discriminator. While the generator remains causal to preserve autoregressive sampling capability, the discriminator attends bidirectionally over the full spatiotemporal context and produces a single holistic realism score for the entire video sequence. This asymmetric design enables the discriminator to effectively detect global temporal failures and long-range drift that cause motion collapse in autoregressive generation. To stabilize training, we introduce a phased strategy that first uses distribution matching to bootstrap a stable one-step generator, providing a warm-up phase that brings the student distribution closer to the teacher before adversarial distillation begins. Extensive experiments on VBench demonstrate that AAD-1 achieves state-of-the-art performance in one-step autoregressive video generation.

URL PDF HTML ☆

赞 0 踩 0

2606.03943 2026-06-04 cs.RO cs.CV cs.LG 版本更新

PointAction: 3D Points as Universal Action Representations for Robot Control

PointAction: 3D点作为机器人控制的通用动作表示

Mutian Tong, Han Jiang, Qiao Feng, Lingjie Liu, Jiatao Gu

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结提出PointAction框架，通过微调视频生成模型联合预测未来RGB帧和动态3D点图，将点动力学作为与具体本体无关的动作接口，再由扩散动作解码器映射为可执行动作，以减少RGB动作歧义并跨任务/本体迁移。

Comments Project page: https://oriontmt.github.io/pointaction/

详情

AI中文摘要

视频-动作模型（VAM）利用预训练视频扩散模型捕获的广泛视觉动态，为通用机器人操作提供了有前景的路径。然而，仅RGB视频展开无法直接操作：它们未明确指定度量3D运动、接触几何和细粒度空间约束，导致动作基础不明确。同时，跨不同任务和本体的动作监督扩展仍然成本高昂。我们提出PointAction，一个通过显式基于点的4D建模将视频预测桥接到机器人动作的框架。PointAction微调基础视频生成模型，联合预测未来RGB帧和动态3D点图，产生任务相关场景几何的时间一致3D运动。这些点动力学作为结构化的、与本体无关的动作接口，由基于扩散的动作解码器映射为可执行的机器人动作。通过使用度量3D点动力学作为视频预测和控制之间的接口，PointAction减少了仅RGB动作基础的不确定性，并支持在有限动作监督下跨任务和本体的迁移。实验表明，PointAction在机器人场景上实现了最先进的4D生成质量，在模拟中优于现有基线，并泛化到预训练中未见过的两个真实机器人手臂。

英文摘要

Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

URL PDF HTML ☆

赞 0 踩 0

2606.03746 2026-06-04 cs.CV cs.AI cs.GR cs.LG 版本更新

Qwen-Image-Flash: Beyond Objective Design

Qwen-Image-Flash：超越目标设计

Tianhe Wu, Kun Yan, Zikai Zhou, Lihan Jiang, Jiahao Li, Jie Zhang, Kaiyuan Gao, Ningyuan Tang, Shengming Yin, Xiaoyue Chen, Xiao Xu, Yilei Chen, Yuxiang Chen, Yan Shu, Yixian Xu, Yanran Zhang, Zihao Liu, Zhendong Wang, Zekai Zhang, Deqing Li, Liang Peng, Yi Wang, Jingren Zhou, Chenfei Wu

发表机构 * alibaba-inc.com（阿里巴巴公司）

AI总结本文通过系统研究数据组成、教师指导和任务混合三个因素，提出Qwen-Image-Flash，表明有效的少步蒸馏不仅需要精心设计的目标，还需要对更广泛的训练流程进行原则性组织。

2606.03598 2026-06-04 cs.RO cs.AI cs.CV 版本更新

PHASER: Phase-Aware and Semantic Experience Replay for Vision-Language-Action Models

PHASER: 面向视觉-语言-动作模型的相位感知与语义经验回放

Ziyang Chen, Shaoguang Wang, Weiyu Guo, Qianyi Cai, He Zhang, Pengteng Li, Yiren Zhao, Yandong Guo

发表机构 * Thrust of AI, HKUST(Guangzhou)（人工智能 thrust，香港科技大学（广州））； AI 2 Robotics, Shenzhen, China（人工智能与机器人，深圳，中国）

AI总结提出PHASER框架，通过相位感知容量分配和多模态干扰路由策略，结合自动相位提取管线Auto-PC，解决VLA模型在持续学习中的灾难性遗忘问题，在LIBERO基准上平均成功率提升高达31%。

Comments 20 pages, 8 figures, 12 tables

详情

AI中文摘要

视觉-语言-动作（VLA）模型在语言条件机器人操作中取得了显著成功。然而，在开放环境中部署这些模型需要持续获取新技能，这一过程不可避免地会严重遗忘先前学习的行为。虽然经验回放（ER）是一种标准的缓解策略，但简单的均匀采样从根本上与操作轨迹的时间特征不一致。它系统性地欠采样短暂但因果关键的子技能，导致相位饥饿，并完全忽略了历史任务中不同程度的遗忘。为克服这些限制，我们提出PHASER，一种架构无关的持续学习框架。PHASER采用以相位为中心的容量分配，确保所有子技能获得平等的记忆支持，并结合多模态干扰路由策略，动态优先处理遗忘风险高的历史相位。此外，为实现完全自主的终身适应，我们集成了Auto-PC，一种轻量级管线，结合无监督动作信号变化点检测和基于VLM的语义验证，无需大量人工监督即可提取时间边界。在LIBERO持续学习套件上对三个VLA骨干网络的评估表明，PHASER取得了显著的实证改进，与匹配预算的ER相比，平均成功率（ASR）提升高达31%，并在LIBERO-Goal CL设置中达到87.8%的最终ASR。

英文摘要

Vision-Language-Action (VLA) models have achieved remarkable success in language-conditioned robotic manipulation. However, deploying these models in open-ended environments requires continuously acquiring novel skills, a process that inevitably triggers severe catastrophic forgetting of previously learned behaviors. While experience replay (ER) serves as a standard mitigating strategy, naive uniform sampling fundamentally misaligns with the temporal characteristics of manipulation trajectories. It systematically under-samples brief but causally critical sub-skills, leading to phase starvation, and completely overlooks the varying degrees of forgetting across historical tasks. To overcome these limitations, we introduce PHASER, an architecture-agnostic continual learning framework. PHASER employs a phase-centric capacity allocation to guarantee equal memory support for all sub-skills, coupled with a multi-modal interference routing strategy that dynamically prioritizes historical phases at high risk of forgetting. Furthermore, to enable fully autonomous lifelong adaptation, we integrate Auto-PC, a lightweight pipeline combining unsupervised action-signal change-point detection with VLM-based semantic verification to extract temporal boundaries without intensive manual supervision. Evaluated across three VLA backbones on LIBERO continual learning suites, PHASER yields substantial empirical improvements, increasing Average Success Rate (ASR) by up to 31% over matched-budget ER and achieving an 87.8% final ASR on the LIBERO-Goal CL setting.

URL PDF HTML ☆

赞 0 踩 0

2606.03564 2026-06-04 cs.CV cs.AI 版本更新

CR-Seg: Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation

CR-Seg：注意力引导与CoT增强的由粗到精推理分割

Yifan Cao, Xiaocui Yang, Faxian Wan, Shi Feng, Daling Wang, Yifei Zhang

发表机构 * School of Computer Science and Engineering, Northeastern University（东北大学计算机科学与工程学院）

AI总结提出CR-Seg两阶段框架，通过注意力图提取和全局到局部思维链，实现由粗到精的推理分割，解决跨模态对齐和推理-答案不一致问题。

详情

AI中文摘要

推理分割旨在通过联合视觉-文本推理来分割复杂语言描述的目标对象。现有方法通常依赖学习到的语义标记来桥接多模态大语言模型（MLLMs）和分割模型，但面临困难的跨模态对齐问题；或者依赖显式空间提示（如边界框），但可能丢失整体响应语义。为解决这些限制，我们提出注意力引导与CoT增强的由粗到精推理分割（CR-Seg），一个两阶段框架。具体地，我们设计了提取注意力图和点（EAP）模块，用于提取粗目标定位的注意力图并选择信息点，两者都输入SAM进行掩码细化。为缓解推理-答案不一致，我们进一步引入全局到局部思维链（GLCoT），引导模型从全局场景上下文逐步推理到局部目标细节。在推理分割基准上的大量实验证明了CR-Seg的有效性。

英文摘要

Reasoning segmentation aims to segment target objects described by complex language through joint visual-textual reasoning. Existing methods typically rely on either learned semantic tokens to bridge Multimodal Large Language Models (MLLMs) and segmentation models, suffering from difficult cross-modal alignment, or explicit spatial prompts such as bounding boxes, which may lose holistic response semantics. To address these limitations, we propose Attention-Guided and CoT-Enhanced Coarse-to-Refined Reasoning Segmentation, termed CR-Seg, a two-stage framework for coarse-to-refined reasoning segmentation. Specifically, we design an Extract Attention Maps and Points (EAP) module to extract attention maps for coarse target localization and select informative points, both of which are fed into SAM for mask refinement. To alleviate reasoning--answer inconsistency, we further introduce Global-to-Local Chain-of-Thought (GLCoT), which guides the model to reason progressively from global scene context to local target details. Extensive experiments on reasoning segmentation benchmarks demonstrate the effectiveness of CR-Seg.

URL PDF HTML ☆

赞 0 踩 0

2606.03402 2026-06-04 cs.CV 版本更新

Mamba-Enhanced Implicit Motion Learning for Audio-Driven Portrait Animation

Mamba增强的隐式运动学习用于音频驱动肖像动画

Xuan Wei, Jiahui Chen, Kaiheng Li, Mingyu Shao, Qingqi Hong

发表机构 * Fujian Provincial Natural Science Foundation of China（福建省自然科学基金委员会）； Giant Interactive Group Inc.（巨匠互动集团有限公司）； National Natural Science Foundation of China（国家自然科学基金委员会）

AI总结提出一种两阶段隐式运动框架，结合区域感知注意力机制和Mamba增强扩散模型，从单张静态图像和音频生成逼真且时间一致的人体运动视频，在多个基准上达到最先进性能。

Comments accepted by 2026 IEEE International Conference on Multimedia and Expo (ICME)

详情

AI中文摘要

音频驱动的人体运动视频生成旨在从单张静态图像合成逼真且时间一致的人体动画，应用于说话头生成、共语手势生成和动态演示。超越传统基于关键点的方法（这些方法往往难以捕捉细微的运动动态），我们提出了一种新颖的隐式运动框架，用于从单张静态图像和音频生成逼真且时间一致的人体运动视频。我们的方法采用两阶段流水线，将运动预测与渲染解耦。第一阶段将外观先验和层次深度线索整合到区域感知注意力机制中，以建模潜在运动特征。第二阶段采用Mamba增强的扩散模型直接从音频和源图像预测这些特征，实现细粒度运动模式的无监督学习。这种解耦架构增强了灵活性和效率。在一个新的380小时高质量数据集上训练，我们的方法在准确性、自然性和时间一致性方面优于多个公共基准和我们收集的数据上的先前工作，达到了新的最先进水平。

英文摘要

Audio-driven human motion video generation aims to synthesize realistic and temporally coherent human animations from a single static image, with applications in talking-head synthesis, co-speech gesture generation, and dynamic presentations. Moving beyond conventional keypoint-based methods that often struggle to capture subtle motion dynamics, We propose a novel implicit-motion framework for generating realistic and temporally coherent human motion videos from a single static image and audio. Our approach uses a two-stage pipeline that decouples motion prediction from rendering. The first stage integrates appearance priors and hierarchical depth cues into a region-aware attention mechanism to model latent motion features. The second stage employs a Mamba-enhanced diffusion model to directly predict these features from audio and the source image, enabling unsupervised learning of fine-grained motion patterns. This decoupled architecture enhances flexibility and efficiency. Trained on a new 380-hour high-quality dataset, our method outperforms prior work across multiple public benchmarks and our collected data in accuracy, naturalness, and temporal coherence, setting a new state-of-the-art.

URL PDF HTML ☆

赞 0 踩 0

2606.03376 2026-06-04 cs.CV cs.AI cs.CL cs.LG 版本更新

P$^2$-DPO: Grounding Hallucination in Perceptual Processing via Calibration Direct Preference Optimization

P²-DPO：通过校准直接偏好优化在感知处理中锚定幻觉

Ruipeng Zhang, Zhihao Li, Haozhang Yuan, C. L. Philip Chen, Tong Zhang

发表机构 * Guangdong Provincial Key Laboratory of Computational AI Models and Cognitive Intelligence, School of Computer Science & Engineering, South China University of Technology（广东省计算人工智能模型与认知智能重点实验室，计算机科学与工程学院，华南理工大学）； Pazhou Lab, Guangzhou, China（琶洲实验室，广州，中国）； Engineering Research Center of the Ministry of Education on Health Intelligent Perception and Paralleled Digital-Human, Guangzhou, China（教育部健康智能感知与并行数字人工程研究中心，广州，中国）

AI总结针对大型视觉语言模型中的幻觉问题，提出P²-DPO训练范式，通过模型自生成偏好对和校准损失，直接优化感知瓶颈和视觉鲁棒性，无需昂贵人工反馈。

详情

AI中文摘要

幻觉最近在大型视觉语言模型（LVLMs）中引起了广泛的研究关注。直接偏好优化（DPO）旨在直接从人类提供的纠正偏好中学习，从而解决幻觉问题。尽管取得了成功，但这种范式尚未专门针对关注区域中的感知瓶颈或解决图像退化下的视觉鲁棒性不足问题。此外，现有的偏好对通常是视觉无关的，其固有的离策略性质限制了它们在指导模型学习方面的有效性。为了解决这些挑战，我们提出了感知处理直接偏好优化（P²-DPO），一种新颖的训练范式，其中模型生成并学习自己的偏好对，从而直接解决已识别的视觉瓶颈，同时固有地避免视觉无关和离策略数据的问题。它引入了：（1）一种针对焦点增强感知和视觉鲁棒性的在策略偏好对构建方法，以及（2）一种精心设计的校准损失，以精确地将视觉信号与文本的因果生成对齐。实验结果表明，在相当数量的训练数据和成本下，P²-DPO在基准测试中优于依赖昂贵人工反馈的强基线。此外，对注意力区域保真度（ARF）和图像退化场景的评估验证了P²-DPO在解决关注区域感知瓶颈和提高对退化输入的视觉鲁棒性方面的有效性。

英文摘要

Hallucination has recently garnered significant research attention in Large Vision-Language Models (LVLMs). Direct Preference Optimization (DPO) aims to learn directly from the corrected preferences provided by humans, thereby addressing the hallucination issue. Despite its success, this paradigm has yet to specifically target the perceptual bottleneck in attended regions or address insufficient Visual Robustness against image degradation. Furthermore, existing preference pairs are often vision-agnostic and their inherently off-policy nature limits their effectiveness in guiding model learning. To address these challenges, we propose Perceptual Processing Direct Preference Optimization (P$^2$-DPO), a novel training paradigm in which the model generates and learns from its own preference pairs, thereby directly addressing the identified visual bottlenecks while inherently avoiding the issues of vision-agnostic and off-policy data. It introduces: (1) an on-policy preference pairs construction method targeting Focus-and-Enhance perception and Visual Robustness, and (2) a well-designed Calibration Loss to precisely align visual signals with the causal generation of text. Experimental results demonstrate that with a comparable amount of training data and cost, P$^2$-DPO outperforms strong baselines that rely on costly human feedback on benchmarks. Furthermore, evaluations on Attention Region Fidelity (ARF) and image degradation scenarios validate the effectiveness of P$^2$-DPO in addressing perceptual bottleneck in attended regions and improving Visual Robustness against degraded inputs.

URL PDF HTML ☆

赞 0 踩 0

2606.03201 2026-06-04 cs.CV cs.AI 版本更新

Reinforcement Learning from Cross-domain Videos with Video Prediction Model

基于视频预测模型的跨领域视频强化学习

Zhao Yang, Xinrui Zu, Jacob E. Kooi, Thomas Delliaux, He Liu, Shujian Yu, Kevin Sebastian Luck, Vincent François-Lavet

发表机构 * VU Amsterdam（阿姆斯特丹大学）； ISAE-SUPAERO

AI总结提出XIPER奖励模型，通过跨领域视频预测将智能体观测映射到专家域，利用预测似然作为奖励信号，解决视觉差异域中无奖励信号和领域差距问题。

详情

AI中文摘要

由于缺乏奖励信号以及存在领域差距，从视觉上截然不同的领域的专家视频中进行强化学习具有挑战性。我们引入了XIPER（跨领域视频预测奖励），这是一种奖励模型，用于从视觉不同领域收集的专家视频中进行学习，其中智能体的外观因颜色、形态或仿真到现实差距等因素而不同。更具体地说，XIPER训练了一个跨领域视频预测模型，将智能体观测映射到专家领域，并使用预测似然作为奖励信号。在DMC Color Suite（8个任务）和DMC Body Suite（3个任务）上的实验表明，尽管存在智能体颜色和形态等领域的差距，XIPER始终优于基线方法。我们进一步在仿真到现实迁移数据集上分析了XIPER，证明它仅凭模拟专家视频就能为真实机器人观测产生有意义的奖励信号。代码、预训练模型、数据集和视频演示可在我们的项目网页上找到：this https URL

英文摘要

Reinforcement learning from expert videos across visually distinct domains is challenging due to the absence of reward signals and the presence of domain gaps. We introduce XIPER (Cross-domain Video Prediction Reward), a reward model for learning from expert videos collected in a visually different domain, where the agent's appearance differs due to factors such as color, morphology, or the sim-to-real gap. More specifically, XIPER trains a cross-domain video prediction model that maps agent observations into the expert domain and uses the prediction likelihood as a reward signal. Experiments on the DMC Color Suite (8 tasks) and DMC Body Suite (3 tasks) show that XIPER consistently outperforms baselines despite domain gaps such as differences in agent color and morphology. We further analyze XIPER on a sim-to-real transfer dataset, demonstrating that it produces meaningful reward signals for real-robot observations given only simulated expert videos. Code, pretrained models, datasets and video demonstrations can be found on our project webpage: https://sites.google.com/view/xiper

URL PDF HTML ☆

赞 0 踩 0

2606.03175 2026-06-04 cs.CV cs.RO 版本更新

Ask When It Pays: Cost-Aware Open-Ended Interaction for Instance Goal Navigation

在值得时询问：面向实例目标导航的成本感知开放式交互

Xunyi Zhao, Sihao Lin, Gengze Zhou, Zerui Li, Shijie Li, Wei Tao, Jiajun Liu, Qi Wu

发表机构 * Adelaide University（阿德莱德大学）； Responsible AI Research Centre, Australian Institute for Machine Learning（负责任人工智能研究中心，澳大利亚机器学习研究所）； Institute for Infocomm Research (I2R), A*STAR（信息与通信研究院（I2R），A*STAR）； iMotion ； CSIRO Data61 Project Website（CSIRO Data61项目网站）

AI总结针对实例目标导航中语言歧义问题，提出一种成本敏感的不确定性减少方法，通过信息增益分析确定有效问题类型，并构建基准测试和加权成功率指标，实现零样本MLLM导航器仅在预期收益大于成本时查询。

详情

AI中文摘要

实例目标导航（IGN）要求具身智能体根据不明确的自然语言描述，在干扰物中找到特定对象实例。这种歧义通常无法仅通过感知和语言解决，因此与oracle的交互成为消歧的自然机制。先前的交互方法允许oracle查询，但将轻量级澄清和路径级指导同等对待，使得智能体通过重复的高信息量问题提高成功率，而非高效解决潜在歧义。我们将交互式IGN重新定义为成本敏感的不确定性减少问题，其中智能体应提出其答案相对于惩罚能最大程度减少导航不确定性的问题。为此，我们对现有导航语料库进行信息增益分析，以识别哪些线索能减少导航不确定性，从而得到一组紧凑的问题类型和数据驱动的成本。然而，现有的交互式导航基准并未建模不同问题类型的成本，也未评估智能体使用交互的效率，因此不适合研究成本敏感的交互。基于此分类，我们构建了一个用于诊断交互行为和效率的基准，以及一个加权成功率指标，该指标根据推导出的成本对每次查询进行惩罚。我们进一步提出了一种零样本MLLM导航器，仅在预期不确定性减少证明交互成本合理时，才在每个决策步骤有选择地进行查询。

英文摘要

Instance Goal Navigation (IGN) requires an embodied agent to find a specific object instance among distractors from an under-specified natural-language description. Such ambiguity often cannot be resolved from perception and language alone, making interaction with an oracle a natural mechanism for disambiguation. Prior interactive methods allow oracle queries but treat lightweight clarification and route-level guidance alike, letting agents boost success rate through repeated high-information questions rather than by resolving the underlying ambiguity efficiently. We recast interactive IGN as a cost-sensitive uncertainty-reduction problem, where the agent should ask the question whose answer provides the largest reduction in navigation uncertainty relative to its penalty. To this end, we apply an information-gain analysis on existing navigation corpora to identify which cues reduce navigation uncertainty, yielding a compact set of question types and data-derived weights. However, existing interactive navigation benchmarks do not model the cost of different question types or evaluate how efficiently agents use interaction, making them unsuitable for studying cost-sensitive interaction. Based on this taxonomy, we construct a benchmark for diagnosing interaction behavior and efficiency, together with a Weighted Success Rate metric that penalizes each query by its derived cost. We further propose a zero-shot MLLM navigator that selectively queries at each decision step only when the expected uncertainty reduction justifies the interaction cost.

URL PDF HTML ☆

赞 0 踩 0

2606.02894 2026-06-04 cs.CV 版本更新

Tiny Collaborative Inference for Occlusion-Robust Object Detection

用于遮挡鲁棒目标检测的微型协同推理

Chieh-Tung Cheng, Mustafa Aslanov, Eiman Kanjo

发表机构 * Imperial College London（帝国理工学院伦敦分校）； Nottingham Trent University（诺丁汉特伦特大学）

AI总结针对超低端边缘设备，结合MCUNet骨干网络、YOLOv2检测头和TensorFlow Lite量化，评估决策级融合（WBF）相比特征级融合在遮挡场景下提升mAP达+0.2736，并验证了多视角融合与Wi-Fi对等部署的可行性。

详情

AI中文摘要

小型边缘设备，如物联网监控节点和搜索救援（SAR）平台，越来越期望本地运行计算机视觉。然而，在超低端硬件上，目标检测受到可用内存和计算、多个设备协作时的通信成本以及遮挡导致的精度损失的限制。本文通过结合MCUNet骨干网络、YOLOv2检测头和TensorFlow Lite量化，评估了在小于1 MB SRAM的设备上的遮挡鲁棒目标检测。我们评估了两种协作推理策略：特征级融合（拼接中间特征图）和通过加权框融合（WBF）的决策级融合。在测试的遮挡设置下，WBF优于特征级融合，在非对称遮挡场景中最高可提升+0.2736 mAP。将融合扩展到三个视角进一步提高了精度（最高+0.3827 mAP），同时增加了通信开销（每次交换约1.3 KB）。硬件实验从主机辅助的USB中继基线开始，然后转移到两个Coral Dev Board Micro单元上的Wi-Fi对等部署，其中WBF在设备上运行，通信能量相对于推理仍然很小。在一个代表性的301.9秒自主会话中，包含108帧，融合输出在61帧上观察到，而仅Board 2为47帧，帧级覆盖增益为+29.8%。我们还包含了一个小型探索性的去中心化联邦学习（DFL）可行性说明，但由于在非独立同分布本地数据下性能仍然有限，我们不将其作为主要结果。结果支持决策级融合作为提高小规模边缘目标检测中遮挡鲁棒性的可行选项，包括在超低端硬件上无需主机的多板操作。

英文摘要

Edge AI nodes for search and rescue are increasingly expected to run computer vision locally, yet ultra-low-end hardware imposes hard constraints on memory, compute, and inter-device communication. This work addresses occlusion-robust object detection on devices with less than 1 MB SRAM by combining an MCUNet backbone, a YOLOv2 detection head, and Lite quantisation. Two collaborative inference strategies are evaluated: feature-level fusion, concatenating intermediate feature maps, and decision-level fusion via Weighted Boxes Fusion (WBF). WBF outperforms feature-level fusion under all tested occlusion conditions, yielding gains of up to +0.2736 mAP in asymmetric scenarios. Extending fusion to three views improves accuracy further (up to +0.3827 mAP) at modest communication overhead (~1.3 KB per exchange). Hardware experiments progress from a host-assisted USB-relay baseline to a Wi-Fi peer-to-peer deployment on two Coral Dev Board Micro units, where WBF executes on-device with negligible communication energy relative to inference. In a 301.9 s autonomous session of 108 frames, fused output is produced on 61 frames versus 47 for a single board - a coverage gain of +29.8%. A decentralised federated learning feasibility note is included but not treated as a primary result, as performance remains limited under non-iid data. The results support decision-level fusion as a viable option for improving occlusion robustness in small-scale edge object detection, including host-free multi-board operation on ultra-low-end hardware.

URL PDF HTML ☆

赞 0 踩 0

2606.02576 2026-06-04 cs.CV cs.LG 版本更新

VG²GT: 体素-高斯泼溅视觉几何基础变换器

Yibin Zhao, Yihan Pan, Jun Nan, Wenli Yang, Liwei Chen, Jianjun Yi

发表机构 * East China University of Science and Technology（东华大学）； Shanghai Open University（上海开放大学）； Shanghai Xiaoyuan Innovation Center（上海小元创新中心）

AI总结提出VG²GT，利用冻结的视觉基础模型、多尺度可微体素模块和体素特征直接回归高斯原语参数，通过随机实体体渲染监督深度图，实现几何精确的高斯场景重建，在多个数据集上达到最优性能。

详情

AI中文摘要

高斯泼溅在3D重建和新视角合成方面显示出强大的潜力。然而，大多数现有方法需要精确的相机参数和逐场景优化，而使用像素对齐高斯原语的前馈方法常常遭受伪影和非均匀原语的困扰。在本文中，我们提出了VG²GT，一种体素-高斯泼溅视觉几何基础变换器。VG²GT利用冻结的预训练视觉基础模型（VFM），结合多尺度可微体素模块以增强几何理解，并直接从体素特征分裂和回归高斯原语参数。在训练过程中，通过随机实体体渲染监督深度图，使得在保持视觉基础模型完全冻结的同时，实现几何准确的高斯场景重建。这种设计使VG²GT能够无缝插入任何基于补丁特征的VFM，同时大幅降低所需的训练成本。VG²GT在广泛使用的DTU、Replica、TAT和ScanNet数据集上优于当前最先进的方法。

英文摘要

Gaussian splatting has shown strong potential for 3D reconstruction and novel view synthesis. However, most existing methods require accurate camera parameters and per-scene optimization, while feed-forward methods with pixel-aligned Gaussian primitives often suffer from artifacts and non-uniform primitives. In this paper, we propose $\text{VG}^2$GT, a Voxel-Gaussian Splatting Visual Geometry-Grounded Transformer. $\text{VG}^2$GT leverages a frozen pretrained visual foundation model (VFM), incorporates a multi-scale differentiable voxel module to enhance geometric understanding, and directly splits and regresses Gaussian primitive parameters from voxel features. During training, depth maps are supervised through stochastic solid volume rendering, enabling geometrically accurate Gaussian scene reconstruction while keeping the visual foundation model fully frozen. This design enables $\text{VG}^2$GT to be seamlessly plugged into any patch-feature-based VFM, while substantially reducing the required training cost. $\text{VG}^2$GT outperforms current state-of-the-art methods on widely used DTU, Replica, TAT, and ScanNet datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.01537 2026-06-04 cs.CV cs.LG 版本更新

PaCX-MAE: Physiology-Augmented Chest X-Ray Masked Autoencoder

PaCX-MAE: 生理增强的胸部X光掩码自编码器

Yancheng Liu, Kenichi Maeda, Manan Pancholy

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Tokyo（东京大学）； University of Michigan（密歇根大学）

AI总结提出PaCX-MAE跨模态蒸馏框架，通过双对比预测目标将生理先验注入胸部X光编码器，在保持单模态推理的同时提升生理相关任务性能。

Comments Accepted at the ICML 2026 3rd Workshop on Multi-modal Foundation Models and Large Language Models for Life Sciences (FM4LS)

详情

AI中文摘要

临床诊断通常需要结合影像与生理测量，但部署的模型通常处理单模态数据。我们提出PaCX-MAE，一种跨模态蒸馏框架，将生理先验注入胸部X光（CXR）编码器，同时在推理时严格保持单模态。PaCX-MAE通过双对比预测目标增强域内掩码自编码，使CXR表示与配对的ECG和实验室嵌入对齐。在九个基准上的广泛评估表明，该方法在领域特定MAE上取得一致改进，特别是在依赖生理的任务上（例如，MedMod上AUROC提升2.7；VinDr上F1提升6.5）。该方法在1%标注数据下表现出高度标签效率，并保持解剖保真度，在分割任务上与MAE持平。零样本和注意力分析证实，PaCX-MAE成功学习关注生理指标，如心脏轮廓，这在标准视觉预训练中缺失。

英文摘要

Clinical diagnosis often requires combining imaging with physiological measurements, yet deployed models typically operate on unimodal data. We present PaCX-MAE, a cross-modal distillation framework that injects physiological priors into chest X-ray (CXR) encoders while remaining strictly unimodal at inference. PaCX-MAE augments in-domain masked autoencoding with a dual contrastive-predictive objective, aligning CXR representations with paired ECG and laboratory embeddings. Extensive evaluation across nine benchmarks demonstrates consistent improvements over domain-specific MAE, particularly on physiology-dependent tasks (e.g., +2.7 AUROC on MedMod; +6.5 F1 on VinDr). The method proves highly label-efficient in the 1% regime and preserves anatomical fidelity, achieving parity with MAE on segmentation tasks. Zero-shot and attention analyses confirm that PaCX-MAE successfully learns to attend to physiological indicators, such as the cardiac silhouette, absent in standard visual pretraining.

URL PDF HTML ☆

赞 0 踩 0

2606.01023 2026-06-04 cs.CV cs.AI 版本更新

Data Collection for Training Quality-Control AI in Carpet Manufacturing

地毯制造中用于训练质量控制AI的数据收集

Akbar Erkinov

发表机构 * Independent Researcher（独立研究者）

AI总结针对地毯生产中视觉检测慢、主观且不一致的问题，提出一种在线机器视觉系统设计，通过同步线扫描相机和组合照明实时检测缺陷，并系统收集标注数据以持续训练质量控制模型，最终通过DMAIC方法量化质量改进。

Comments 10 pages, 3 figures

详情

AI中文摘要

视觉检测仍然是机织和簇绒地毯生产中主要的质量控制实践，但在现代织机的线速度和宽度下，它缓慢、主观且不一致。我们提出了一种在线机器视觉系统的设计方案，其主要目的有两个：实时检测地毯幅面，以及同样重要的是，系统地收集和标注缺陷图案的图像，以便在设备使用寿命内训练日益强大的质量控制模型。该方案基于一个具体的工业环境：在一个机织地毯生产设施中进行的六西格玛（DMAIC）项目，该项目预计在增加织机后会出现生产瓶颈，且基线缺陷率较高，质量故障带来的财务风险显著。我们描述了一个基于同步线扫描相机并组合明场和掠射照明的成像子系统，推导了在多米宽幅面上分辨细微结构缺陷所需的分辨率和吞吐量要求，并定义了地毯特定的缺陷分类。然后，我们提出了一种分阶段建模策略，从基于无缺陷材料的无监督异常检测开始，遵循MVTec异常检测基准中地毯类别的范例，并通过人在环的标注飞轮成熟为有监督的检测和分割模型。最后，我们将检测性能与DMAIC目标联系起来，展示逃逸缺陷的减少如何转化为过程质量和过程西格玛水平的提升。贡献在于提供了一个端到端、可部署的蓝图，将数据收集视为首要工程目标而非事后考虑。

英文摘要

Visual inspection remains the dominant quality-control practice in woven and tufted carpet production, yet it is slow, subjective, and inconsistent at the line speeds and widths of modern looms. We present a design proposal for an in-line machine-vision system whose primary purpose is twofold: to inspect the carpet web in real time and, equally importantly, to systematically collect and label images of defect patterns so that increasingly capable quality-control models can be trained over the life of the installation. The proposal is grounded in a concrete industrial setting: a Six Sigma (DMAIC) project at a woven-carpet production facility that anticipated a production bottleneck following the installation of additional weaving machines, with a substantial baseline defect rate and significant financial exposure associated with quality failures. We describe an imaging subsystem based on synchronized line-scan cameras with combined bright-field and grazing illumination, derive the resolution and throughput requirements needed to resolve fine structural defects across a multi-metre web, and define a carpet-specific defect taxonomy. We then lay out a staged modelling strategy that begins with unsupervised anomaly detection trained on defect-free material, following the paradigm exemplified by the carpet category of the MVTec Anomaly Detection benchmark, and matures through a human-in-the-loop annotation flywheel into supervised detection and segmentation models. Finally, we connect detection performance to the DMAIC objectives, showing how reductions in escaped defects translate into improved process quality and process sigma levels. The contribution is an end-to-end, deployable blueprint that treats data collection as a first-class engineering objective rather than an afterthought.

URL PDF HTML ☆

赞 0 踩 0

2606.00747 2026-06-04 cs.CV cs.AI 版本更新

SkyShield: Occupancy as a Safety Interface for Low-Altitude UAV Autonomy

SkyShield：占用作为低空无人机自主飞行的安全接口

Jie Gao, Jie Ma, Kaihui Lin, Kai Ye, Miaohui Zhang, Pingyang Dai, Liujuan Cao

发表机构 * Xiamen University（厦门大学）； Jiangxi Academy of Sciences（江西省科学院）

AI总结针对低空无人机自主飞行中的三维空间理解问题，提出首个前视单目语义占用基准SkyShield、动态感知度量KAR-mIoU和几何优先基线SkyOcc，将占用作为安全接口。

详情

AI中文摘要

对于低空无人机自主飞行，三维空间理解不仅仅是感知目标，更是人类指令与物理飞行之间的安全接口。在20米以下的人尺度城市空域中，薄几何结构、遮挡、植被和城市杂乱决定了飞行器能否安全进入前方空间。然而，现有的无人机数据集主要提供2D标注或3D框，而面向驾驶的占用基准假设稳定的地面级传感器装置。两者都缺少低空飞行的定义性场景：一个前视单目相机从移动的飞行器上观察占据和自由空间，具有逐帧变化的6自由度姿态和相机外参。为填补这一空白，我们提出了SkyShield，据我们所知，这是首个面向20米以下城市无人机飞行的前视单目语义占用基准。基于CARLA构建，SkyShield包含36K个前视无人机样本，涵盖多种城市场景和天气条件，每张图像配以逐帧6自由度无人机姿态、逐帧动态相机几何、无人机状态和前视截锥体语义占用标签。我们进一步提出了KAR-mIoU，一种以无人机为中心且动态感知的度量，通过运动可达性和碰撞时间重新加权体素级评估，揭示传统mIoU隐藏的安全关键风险。为应对这一具有挑战性的新场景，我们提供了SkyOcc，一种几何优先的单目基线，将逐帧无人机姿态集成到投影中，融合时序占用特征，并应用安全先验优化以保留稀疏的碰撞关键结构。SkyShield、KAR-mIoU和SkyOcc共同将占用确立为低空空中自主飞行的安全接口。代码和数据集将公开发布。

英文摘要

For low-altitude Unmanned Aerial Vehicle (UAV) autonomy, 3D spatial understanding is not merely a perception objective, but the safety interface between human instructions and physical flight. In human-scale urban airspace below 20 meters, thin geometry, occlusions, vegetation, and urban clutter define whether an aerial agent can safely enter the space ahead. However, existing UAV datasets mainly provide 2D annotations or 3D boxes, while driving-oriented occupancy benchmarks assume stable ground-level sensor rigs. Both miss the defining regime of low-altitude flight: a front-facing monocular camera observing occupied and free space from a moving aerial body with frame-wise changing 6-DoF pose and camera extrinsics. To bridge this gap, we introduce SkyShield, to the best of our knowledge the first front-view monocular semantic occupancy benchmark for urban UAV flight below 20 meters. Built on CARLA, SkyShield contains 36K front-view UAV samples across diverse urban scenes and weather conditions, pairing each image with frame-wise 6-DoF UAV pose, frame-wise dynamic camera geometry, UAV states, and front-frustum semantic occupancy labels. We further propose KAR-mIoU, a UAV-centric and dynamics-aware metric that re-weights voxel-level evaluation by kinematic reachability and time-to-collision, revealing safety-critical risks hidden by conventional mIoU. To tackle this challenging new setting, we provide SkyOcc, a geometry-first monocular baseline that integrates frame-wise UAV attitude into projection, fuses temporal occupancy features, and applies safety-prior optimization to preserve sparse collision-critical structures. Together, SkyShield, KAR-mIoU, and SkyOcc establish occupancy as a safety interface for low-altitude aerial autonomy. Code and dataset will be released publicly.

URL PDF HTML ☆

赞 0 踩 0

2606.00260 2026-06-04 cs.CV cs.LG 版本更新

LastAct: Trajectory-Guided Latest-Activity Localization for Real-Time Smart-Home Activity Recognition

LastAct: 轨迹引导的最新活动定位用于实时智能家居活动识别

Zishuai Liu, Ruili Fang, Jin Lu, Fei Dou

发表机构 * School of Computing, University of Georgia（佐治亚大学计算学院）

AI总结提出LastAct框架，通过轨迹图像序列和边界定位器解决滑动窗口中的边界污染问题，实现实时智能家居活动识别。

详情

AI中文摘要

基于环境传感器的人类活动识别（HAR）支持健康监测和辅助生活等智能家居应用。然而，在实际部署中，传感器事件以连续流的形式到达，活动边界未知。因此，滑动窗口推理会产生许多跨越转换并包含混合活动的窗口，造成边界污染，违反了大多数基准和模型使用的预分割实例假设。此外，许多管道通过将传感器ID视为独立标记来未充分利用空间上下文。我们提出了LastAct，一个面向轨迹的流式智能家居HAR框架，旨在处理混合窗口下的最新活动，同时显式建模空间结构。LastAct将传感器事件投影到家庭平面图上，形成保持空间连续性的布局对齐轨迹图像序列。一个轻量级门控识别受污染的窗口，边界定位器估计最后一个转换，从而实现边界引导的掩码，强调边界后的证据并抑制过时的上下文。为了提高效率，我们重用预计算的布局对齐模板缓存以避免重复渲染。实验表明，在四个公开的智能家居数据集上，采用接近真实的混合活动协议，LastAct在纯窗口上达到竞争性或更优的性能，并在交叉/混合窗口上获得显著的Macro-F1增益，展示了在接近真实的滑动窗口机制下更强的鲁棒性。

英文摘要

Human Activity Recognition (HAR) from ambient sensors enables smart-home applications such as health monitoring and assisted living. In realistic deployments, however, sensor events arrive as a continuous stream and activity boundaries are unknown. Sliding-window inference therefore produces many windows that straddle transitions and contain mixed activities, creating boundary contamination that violates the pre-segmented instance assumption used by most benchmarks and models. Moreover, many pipelines under-use spatial context by treating sensor IDs as independent tokens. We present LastAct, a trajectory-centric framework for streaming smart-home HAR that targets the most recent activity under mixed windows while explicitly modeling spatial structure. LastAct projects sensor events onto the home floorplan to form a layout-aligned trajectory image sequence that preserves spatial continuity. A lightweight gate identifies contaminated windows, and a boundary localizer estimates the last transition to enable boundary-guided masking that emphasizes post-boundary evidence and suppresses stale context. For efficiency, we reuse a precomputed layout-aligned template cache to avoid repeated rendering. Empirically, across four public smart-home datasets under near-realistic mixed-activity protocols, LastAct achieves competitive or superior performance on pure windows and yields substantial Macro-F1 gains on cross/mixed windows, demonstrating improved robustness under near-realistic sliding-window regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.31039 2026-06-04 cs.CV 版本更新

GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

GGT-100K：面向泛化真实世界图像恢复的生成式真实标签

Xiangtao Kong, Jixin Zhao, Lingchen Sun, Rongyuan Wu, Lei Zhang

发表机构 * VISUAL COMPUTING LAB POLYU（PolyU视觉计算实验室）； The Hong Kong Polytechnic University（香港理工大学）； OPPO Research Institute（OPPO研究院）

AI总结提出利用生成式多模态基础模型从真实低质量图像合成高质量目标作为真实标签，构建包含10万对数据的GGT-100K数据集，显著提升多种图像恢复模型的真实世界泛化能力。

详情

AI中文摘要

真实世界图像恢复（IR）受限于高质量配对训练数据的稀缺。合成数据集丰富但常无法模拟真实退化，而真实配对数据集昂贵且难以获取。因此，在这些数据集上训练的IR模型在真实场景中泛化能力有限。本文提出利用生成式多模态基础模型（MFMs）从真实低质量（LQ）图像生成高质量（HQ）目标，即生成式真实标签（GGT）。我们首先对包括Nano-Banana-2和GPT-Image-2在内的九种最先进MFMs，在多种场景和退化类型的图像上进行了系统评估。结果表明，采用基于VLM自适应提示的Nano-Banana-2在合成感知真实且内容忠实的高质量目标方面能力最强，可作为LQ输入的GGT。随后，我们使用Nano-Banana-2构建GGT合成流水线，包括多阶段质量控制以确保数据可靠性，并构建了GGT-100K，一个包含103,707个训练对的LQ-HQ配对数据集，覆盖多样场景和复杂真实退化。还建立了500个图像对的测试集。大量实验表明，GGT-100K持续提升多种IR模型的真实世界泛化能力，尤其对微调生成模型进行IR任务有显著益处。我们的结果表明，MFMs可作为面向恢复的数据生成的实用工具，GGT-100K是扩展真实世界IR模型泛化边界的有用资源。

英文摘要

Real-world image restoration (IR) is bottlenecked by the scarcity of high-quality paired training data. Synthetic datasets are abundant but often fail to model real-world degradations, while real-world paired datasets are expensive and difficult to capture. As a result, IR models trained on these datasets show limited generalization in real-world scenarios. In this work, we propose Generative Ground Truth (GGT) by using generative multimodal foundation models (MFMs) to produce high-quality (HQ) targets from real-world low-quality (LQ) images. We first conduct a systematic evaluation of nine state-of-the-art MFMs, including Nano-Banana-2 and GPT-Image-2, on images of various scenes and degradation types. The results demonstrate that Nano-Banana-2 with VLM-based adaptive prompting shows the highest capability to synthesize perceptually realistic and content-faithful HQ targets, which can serve as the GGT for the LQ input. We then employ Nano-Banana-2 to build a GGT synthesis pipeline, which involves multi-stage quality control to ensure data reliability, and construct GGT-100K, an LQ-HQ paired dataset comprising 103,707 training pairs and covering diverse scenes and complex real-world degradations. A test set of 500 image pairs is also established. Extensive experiments show that GGT-100K consistently improves the real-world generalization of a wide range of IR models, with particularly strong benefits for finetuning generative models for IR tasks. Our results suggest that MFMs can serve as practical tools for restoration-oriented data generation, and GGT-100K is a useful resource to expand the generalization boundaries of real-world IR models.

URL PDF HTML ☆

赞 0 踩 0

2605.30705 2026-06-04 cs.CV cs.LG 版本更新

Equivariant Latent Alignment via Flow Matching under Group Symmetries

群对称下通过流匹配的等变潜在对齐

Sunghyun Kim, Jaehoon Hahm, Jeongwoo Shin, Joonseok Lee

发表机构 * University of Illinois Urbana-Champaign, Illinois, USA（伊利诺伊大学厄巴纳-香槟分校）； Seoul National University, Seoul, Korea（首尔国立大学）

AI总结针对现有方法在潜在空间中存在的等变错位问题，提出基于流的残差潜在流框架，通过纠正错位潜在表示来增强旋转群SO(n)下的等变一致性，提升新视角合成质量。

详情

AI中文摘要

几何感知生成模型和新视角合成方法在视觉保真度和一致性方面展现出强大潜力。同时，等变表示学习已成为构建潜在空间的有力框架，其中分析已知的群变换可以直接作用，捕捉数据中的几何结构，并增强新视角合成的可解释性和泛化性。然而，我们发现现有方法常遭受潜在错位问题，即潜在空间中预期的群作用与实际所需的变换之间存在差异。因此，学习到的潜在表示往往无法一致地保持底层群对称性所施加的等变关系。为解决此问题，我们提出残差潜在流，一种基于流的框架，用于纠正错位的潜在表示，从而提高对底层等变关系的遵从性。我们的综合实验表明，在旋转群SO(n)下，我们的方法显著减少了潜在错位，并提高了新视角合成的质量。

英文摘要

Geometry-aware generative models and novel view synthesis approaches have shown strong potential in visual fidelity and consistency. In parallel, equivariant representation learning has emerged as a powerful framework for constructing latent spaces where analytically known group transformations could act directly, capturing geometric structure in data and enhancing both interpretability and generalization in novel view synthesis. However, we identify that existing approaches often suffer from latent misalignment, a discrepancy between the intended group action and the actually required transformations in the latent space. Consequently, the learned latents often fail to consistently preserve the equivariant relations imposed by the underlying group symmetry. To address this, we propose Residual Latent Flow, a flow-based framework that corrects the misaligned latents, thereby improving compliance with the underlying equivariance relation. Our comprehensive experiments show that our method significantly reduces latent misalignment and improves novel view synthesis quality, under rotation groups SO(n).

URL PDF HTML ☆

赞 0 踩 0

2605.25402 2026-06-04 cs.CV cs.AI 版本更新

Anatomy-Anchored Self-Supervision: Distilling Vision Foundation Models for Invariant Ultrasound Representation

解剖锚定的自监督：蒸馏视觉基础模型用于不变超声表示

Chunzheng Zhu, Yijun Wang, Jianxin Lin, Feng Wang, Hongwei Wang, Lei Zhao, Shengli Li, Kenli Li

发表机构 * Hunan University（湖南大学）； Shenzhen Maternity and Child Healthcare Hospital（深圳妇幼保健医院）

AI总结提出解剖锚定的超声自监督框架ANAUS，通过可学习潜在提示引擎和领域自适应实现无标注解剖分割，并设计双策略自监督学习（语义感知解剖分离对齐和上下文核心区域预测）来增强表示学习，在六个公开数据集上超越现有方法。

Comments MICCAI 2026 Accepted Paper; Anatomy-Anchored Ultrasound Self-Supervision

详情

AI中文摘要

自监督预训练范式在医学图像中学习可迁移表示方面日益重要，但现有超声图像方法在图像或帧级别操作，忽略了临床对齐表示学习的解剖上下文。在这项工作中，我们提出了一种解剖锚定的超声自监督框架ANAUS，将表示学习从通用视觉区域转移到临床有意义的解剖结构。利用可学习的潜在提示引擎以及对现有公开图像-掩码对的一次性领域自适应，我们使LP-SAM模块能够大规模实现无标注解剖描绘。基于此解剖基础，我们提出了一种双策略自监督学习范式，包括视图间语义感知的解剖分离对齐和上下文核心区域预测，以增强表示学习。具体而言，前者在相同解剖区域内强制特征不变性，同时促进不同结构间的可区分性；后者迫使模型重建被破坏的区域，从而捕获细粒度的结构细节。在六个公开数据集上的广泛评估表明，我们的方法持续优于当前最先进的方法，同时保持了临床部署所需的计算效率。代码可在https://github.com/zhcz328/ANAUS获取。

英文摘要

Self-supervised pre-training paradigm has gained increasing prominence for learning transferable representations in medical imaging, yet existing methods for ultrasound (US) images operate at the image or frame level, overlooking the anatomical context for clinical-aligned representation learning. In this work, we propose an anatomy-anchored ultrasound self-supervision framework ANAUS that shifts representation learning from generic visual regions to clinically meaningful anatomical structures. Utilizing a learnable latent prompt engine alongside a one-time domain adaptation on existing public image-mask pairs, we empower the LP-SAM module to achieve annotation-free anatomy delineation at scale. Building upon this anatomical grounding, we propose a dual-policy self-supervised learning paradigm consisting of inter-view semantics-aware anatomy-separating alignment and contextual core-region prediction to enhance representation learning. Specifically, the former enforces feature invariance within identical anatomical regions while promoting discriminability across distinct structures; the latter compels the model to reconstruct corrupted regions, thereby capturing fine-grained structural details. Extensive evaluations on six public datasets demonstrate that ANAUS consistently outstrips current state-of-the-art methods while maintaining the computational efficiency essential for clinical deployment. Code is available at https://github.com/zhcz328/ANAUS.

URL PDF HTML ☆

赞 0 踩 0

2605.24602 2026-06-04 cs.CV cs.AI 版本更新

Correcting Visual Blur Induced by Attention Distraction to Reduce Hallucinations: Algorithm and Theory

纠正注意力分散引起的视觉模糊以减少幻觉：算法与理论

Quanjiang Li, Zhiming Liu, Wei Luo, Tingjin Luo, Chenping Hou

发表机构 * National University of Singapore（新加坡国立大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本文揭示多模态大语言模型中的物体幻觉与类人注意力分散现象相关，并提出一种无需额外训练的注意力聚焦方法（AFIP）通过跨头注意力增强和动态历史注意力强化来纠正视觉模糊，从而减少幻觉。

详情

Journal ref: ICML2026

AI中文摘要

多模态大语言模型（MLLMs）经常遭受物体幻觉的困扰，但导致这一失败的视觉感知机制仍知之甚少。在这项工作中，我们揭示幻觉与一种类人注意力分散现象密切相关，其中人类在注意力分散下会经历视觉清晰度下降并产生不准确的描述，而在模型中，同样的机制表现为解码过程中多头注意力的空间不一致性以及对图像令牌注意力的时间衰减。我们进一步提供了理论见解，表明注意力分散会增加模型复杂度并降低分类泛化能力。受这些发现的启发，我们提出了一种用于改进图像感知的注意力聚焦方法（AFIP），该方法通过跨头注意力丰富来纠正注意力分散，并通过动态历史注意力增强来强化视觉基础。在多个基准和模型上的大量实验验证了AFIP的有效性，且无需额外训练。

英文摘要

Multimodal large language models (MLLMs) frequently suffer from object hallucinations, yet the visual perceptual mechanism underlying this failure remains poorly understood. In this work, we reveal that hallucinations are strongly associated with a human-like attention distraction phenomenon, where humans under divided focus experience degraded visual clarity and produce inaccurate descriptions, while in models the same mechanism manifests as spatial inconsistency in multi-head attention and temporal fading of attention to image tokens during decoding. We further provide theoretical insights that attention dispersion increases model complexity and degrades classification generalization. Motivated by these findings, we propose an Attention-Focused Approach for Improved Image Perception (AFIP), which corrects attention distraction via cross-head attention enrichment and reinforces visual grounding through dynamic historical attention enhancement. Extensive experiments on multiple benchmarks and models validate the effectiveness of AFIP without additional training.

URL PDF HTML ☆

赞 0 踩 0

2605.18102 2026-06-04 cs.CV 版本更新

DanceHMR: Hand-Aware Whole-Body Human Mesh Recovery from Monocular Videos

DanceHMR: 从单目视频中恢复手部感知的全身人体网格

Wenhao Shen, Ming Zhou, Hengyuan Zhang, Siyuan Bian, Youjiang Xu, Yuan Zhang

发表机构 * ByteDance Intelligent Creation（字节跳动智能创作）

AI总结提出一种基于残差体手融合的时序一致全身HMR框架，通过身体上下文与手部观测的融合以及特写增强，实现稳定身体运动与精细手部恢复。

Comments Project page: https://shenwenhao01.github.io/dancehmr/

详情

AI中文摘要

单目视频人体网格恢复对于数字人、虚拟角色动画和具身模拟至关重要，需要时间稳定性和表现力丰富的全身运动。现有视频HMR方法能生成连贯的身体运动，但常忽略精细的手部关节；而基于图像的全身体方法逐帧独立恢复SMPL-X网格，常导致手部运动抖动且不准确。我们提出一种针对具有挑战性的野外单目视频的时序一致全身体HMR框架。我们的模型通过残差体手融合统一身体上下文和特定部分的手部观测，在单个时序架构中实现稳定的身体运动和精细的手部恢复。我们进一步引入特写感知增强，以提高上半身构图下的鲁棒性。在全身体和仅身体基准上的实验表明，手部重建得到改善，身体精度具有竞争力。我们的方法在具有挑战性的真实世界视频中也产生了时间稳定且2D一致的SMPL-X运动。

英文摘要

Monocular video human mesh recovery is essential for digital humans, avatar animation, and embodied simulation, where both temporal stability and expressive whole-body motion are required. Existing video HMR methods produce coherent body motion but often overlook detailed hand articulation, while image-based whole-body methods recover SMPL-X meshes independently per frame, often leading to jittery and inaccurate hand motion. We present a temporally coherent whole-body HMR framework for challenging in-the-wild monocular videos. Our model unifies body context and part-specific hand observations through residual body-hand fusion, enabling stable body motion and detailed hand recovery within a single temporal architecture. We further introduce close-up-aware augmentation to improve robustness under upper-body framing. Experiments on whole-body and body-only benchmarks demonstrate improved hand reconstruction and competitive body accuracy. Our method also produces temporally stable and 2D-consistent SMPL-X motion in challenging real-world videos.

URL PDF HTML ☆

赞 0 踩 0

2605.21268 2026-06-04 cs.CV 版本更新

Vision Transformers and Convolutional Neural Networks for Land Use Scene Classification

视觉Transformer与卷积神经网络在土地利用场景分类中的应用

Arun D. Kulkarni

发表机构 * Computer Science Department, University of Texas at Tyler（德克萨斯理工大学计算机科学系）

AI总结本文比较了视觉Transformer和CNN在遥感土地利用场景分类中的性能，发现CNN在有限训练样本和局部纹理特征强的场景中表现稳健，而ViT在数据充足时能更好地捕捉全局空间关系，但计算成本更高。

Comments 11 pages

详情

AI中文摘要

来自遥感影像的土地利用场景分类在环境监测、城市规划和可持续资源管理中起着关键作用。近年来，深度学习方法显著推动了该领域的发展，其中卷积神经网络因其强大的局部空间特征捕获能力而占据主导地位。然而，视觉Transformer的出现引入了一种新范式，通过自注意力机制建模长距离依赖关系，可能实现更好的全局上下文理解。本文对视觉Transformer和基于CNN的架构在遥感土地利用场景分类中进行了比较评估。使用基准遥感数据集（包括UC Merced土地利用和EuroSAT土地利用数据集）评估了代表性CNN模型（如AlexNet）和视觉Transformer。研究考察了分类准确率、精确率、召回率、F1分数和计算复杂度，以提供全面的性能比较。实验结果表明，在训练样本有限且局部纹理特征强的数据集上，CNN表现稳健；而在训练数据充足时，视觉Transformer在捕获复杂场景中的全局空间关系方面表现出更优性能。然而，ViT通常需要更多的计算资源和更大的训练数据集才能达到最优性能。本研究的结果为两种架构的优势和局限性提供了见解，并为遥感土地利用场景分类应用中选择合适模型提供了指导。

英文摘要

Land Use Scene Classification (LUSC) from remote sensing imagery plays a critical role in environmental monitoring, urban planning, and sustainable resource management. In recent years, deep learning methods have significantly advanced the state of the art, with Convolutional Neural Networks (CNNs) dominating the field because of their strong ability to capture local spatial features. However, the emergence of Vision Transformers (ViTs) has introduced a new paradigm that models long-range dependencies through self-attention mechanisms, potentially enabling improved global context understanding. This paper presents a comparative assessment of Vision Transformers and CNN-based architecture for remote sensing land use scene classification. Representative CNN models, such as AlexNet, is evaluated alongside the Vision Transformer (ViT) using benchmark remote sensing datasets, including the UC Merced Land Use and EuroSAT Land Use datasets. The study examines classification accuracy, precision, recall, F1-score, and computational complexity to provide a comprehensive performance comparison. Experimental results demonstrate that CNNs perform robustly on datasets with limited training samples and strong local texture characteristics, whereas Vision Transformers exhibit superior performance in capturing global spatial relationships in complex scenes when sufficient training data are available. However, ViTs typically require greater computational resources and larger training datasets to achieve optimal performance. The findings of this study provide insights into the strengths and limitations of both architectures and offer guidance for selecting appropriate models for remote sensing land use scene classification applications.

URL PDF HTML ☆

赞 0 踩 0

2605.19398 2026-06-04 cs.CV cs.AI 版本更新

Rebalancing Reference Frame Dominance to Improve Motion in Image-to-Video Models

重新平衡参考帧主导性以改善图像到视频模型中的运动

Wooseok Jeon, Seungho Park, Seunghyun Shin, Sangeyl Lee, Hyeonho Jeong, Hae-Gon Jeon

发表机构 * Yonsei University（延世大学）； GIST（韩国科学技术院）； Adobe Research（Adobe研究）

AI总结针对图像到视频模型生成视频过于静态的问题，提出无需训练且模型无关的DyMoS方法，通过重新平衡去噪初期生成帧对参考帧的注意力来增强运动，同时保持视觉质量和保真度。

Comments Preprint. Project page: https://sh0xed98b8.github.io/DyMoS/

详情

AI中文摘要

与文本到视频模型相比，图像到视频模型通常生成的视频过于静态。先前的方法通过削弱或修改图像条件信号来缓解这一问题，但往往需要额外训练或牺牲对参考图像的保真度。在这项工作中，我们识别出参考帧主导性是运动抑制的关键机制。我们观察到，I2V模型中的非参考帧将过多的自注意力分配给参考帧的关键词元，导致参考信息随时间过度传播，从而抑制了帧间动态。基于这一发现，我们提出了DyMoS（动态运动滑块），一种无需训练且模型无关的方法，在初始去噪步骤中重新平衡从生成帧到参考帧的注意力路径。DyMoS保持输入图像和模型权重不变，并引入单个标量参数以连续控制运动强度。在多个最先进的I2V骨干网络上的实验表明，DyMoS在保持视觉质量和参考图像保真度的同时，一致地改善了运动动态。

英文摘要

Image-to-video models often generate videos that remain overly static, compared to text-to-video models. While prior approaches mitigate this issue by weakening or modifying the image-conditioning signal, they often require additional training or sacrifice fidelity to the reference image. In this work, we identify reference-frame dominance as a key mechanism behind motion suppression. We observe that non-reference frames in I2V models allocate excessive self-attention to reference-frame key tokens, causing reference information to be over-propagated across time and suppressing inter-frame dynamics. Based on this finding, we propose DyMoS (Dynamic Motion Slider), a training-free and model-agnostic method that rebalances the attention pathway from generated frames to the reference frame during initial denoising steps. DyMoS leaves both the input image and model weights unchanged and introduces a single scalar parameter for continuous control over motion strength. Experiments across multiple state-of-the-art I2V backbones demonstrate that DyMoS consistently improves motion dynamics while maintaining visual quality and fidelity to the reference image.

URL PDF HTML ☆

赞 0 踩 0

2605.15741 2026-06-04 cs.CV 版本更新

HyperDiT: Hyper-Connected Transformers for High-Fidelity Pixel-Space Diffusion

HyperDiT: 用于高保真像素空间扩散的超连接Transformer

Yu He, Lichen Ma, Zipeng Guo, Xinyuan Shan, Jingling Fu, Dong Chen, Junshi Huang, Yan Li

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结针对像素空间扩散模型中全局语义与细粒度细节难以兼顾的粒度困境，提出HyperDiT框架，通过超连接跨尺度交互和尺度感知旋转位置编码，结合预训练视觉基础模型的密集语义，在像素空间实现高保真生成，在ImageNet 256×256上取得1.56的SoTA FID。

详情

AI中文摘要

像素空间扩散模型绕过了变分自编码器（VAE）的重建瓶颈，但面临一个基本的“粒度困境”：捕捉全局语义需要大的块尺度，而生成高保真细节则要求细粒度的输入。为了解决这个问题，我们提出了HyperDiT，一个统一的框架，建立超连接跨尺度交互以桥接语义和像素流形。与通过AdaLN注入语义不同，HyperDiT利用交叉注意力机制，使细粒度标记能够全局查询多级语义锚点。为了解决多尺度交互过程中的空间不匹配问题，我们引入了尺度感知旋转位置编码（SA-RoPE），以确保不同块大小的标记之间精确的几何对齐。此外，我们加入了寄存器，从预训练的视觉基础模型（VFM）中学习密集语义，有效减少生成幻觉和伪影。大量实验表明，HyperDiT在像素空间内直接在ImageNet 256×256上实现了最先进的FID为1.56。通过将细粒度流与语义指导相结合，HyperDiT为高保真像素生成提供了一种优越的范式。

英文摘要

Pixel-space diffusion models bypass the reconstruction bottleneck of Variational Autoencoders (VAEs) but face a fundamental "granularity dilemma": capturing global semantics favors large patch scales, while generating high-fidelity details demands fine-grained inputs. To address this issue, we propose HyperDiT, a unified framework establishing Hyper-Connected Cross-Scale Interactions to bridge the semantic and pixel manifold. Diverging from injecting semantics by AdaLN, HyperDiT utilizes Cross-Attention mechanisms, enabling fine-grained tokens to query multi-level semantic anchors globally. To resolve the spatial mismatch during multi-scale interactions, we introduce Scale-Aware Rotary Position Embedding (SA-RoPE) to ensure precise geometric alignment among tokens of varying patch sizes. Furthermore, we incorporate Registers to learn the dense semantics from a pretrained Visual Foundation Model (VFM), effectively reducing generation hallucination and artifacts. Extensive experiments demonstrate that HyperDiT achieves state-of-the-art (SoTA) FID of $\mathbf{1.56}$ on ImageNet $256\times256$ directly within the pixel space. By combining the fine-grained stream with semantic guidance, HyperDiT offers a superior paradigm for high-fidelity pixel generation.

URL PDF HTML ☆

赞 0 踩 0

2605.14091 2026-06-04 cs.CV 版本更新

Venus-DeFakerOne: Unified Fake Image Detection & Localization

Venus-DeFakerOne: 统一假图检测与定位

GuangJian Team

发表机构 * Ant Group（蚂蚁集团）

AI总结针对假图生成机制统一化而检测定位研究碎片化的问题，提出基于InternVL2和SAM2的数据驱动统一基础模型DeFakerOne，实现跨场景的图像级检测与像素级定位，在39个检测和9个定位基准上达到最优性能。

详情

AI中文摘要

近年来，生成式AI的快速发展从根本上重塑了图像伪造的范式，打破了文档编辑、自然图像篡改、DeepFake生成和全图像AIGC合成之间的传统界限。尽管伪造生成正趋于统一，但现有的假图检测与定位（FIDL）研究仍然碎片化。这造成了日益统一的伪造生成机制与领域特定检测范式之间的不匹配。弥合这一不匹配给FIDL带来了两个关键挑战：理解跨域伪影的迁移与干扰，以及构建一个高容量的统一基础模型以实现联合检测与定位。为应对这些挑战，我们提出了DeFakerOne，一个以数据为中心的统一FIDL基础模型，集成了InternVL2和SAM2。DeFakerOne能够在多种场景下同时进行图像级检测和像素级伪造定位。大量实验表明，DeFakerOne达到了最先进的性能，在39个伪造检测基准和9个定位基准上均优于基线。此外，该模型对真实世界扰动和最先进的生成器（如GPT-Image-2）表现出卓越的鲁棒性。最后，我们系统分析了数据缩放规律、跨域伪影迁移-干扰模式、细粒度监督的必要性以及原始分辨率伪影保留，突显了可扩展、鲁棒且统一的FIDL的设计原则。

英文摘要

In recent years, the rapid evolution of generative AI has fundamentally reshaped the paradigm of image forgery, breaking the traditional boundaries between document editing, natural image manipulation, DeepFake generation, and full-image AIGC synthesis. Despite this shift toward unified forgery generation, existing research in Fake Image Detection and Localization (FIDL) remains fragmented. This creates a mismatch between increasingly unified forgery generation mechanisms and the domain-specific detection paradigm. Bridging this mismatch poses two key challenges for FIDL: understanding cross-domain artifacts transfer and interference, and building a high-capacity unified foundation model for joint detection and localization. To address these challenges, we propose DeFakerOne, a data-centric, unified FIDL foundation model integrating InternVL2 and SAM2. DeFakerOne enables simultaneous image-level detection and pixel-level forgery localization across diverse scenarios. Extensive experiments demonstrate that DeFakerOne achieves state-of-the-art performance, outperforming baselines on 39 forgery detection benchmarks and 9 localization benchmarks. Furthermore, the model exhibits superior robustness against real-world perturbations and state-of-the-art generators such as GPT-Image-2. Finally, we provide a systematic analysis of data scaling laws, cross-domain artifacts transfer-interference patterns, the necessity of fine-grained supervision, and the original resolution artifacts preservation, highlighting the design principles for scalable, robust, and unified FIDL.

URL PDF HTML ☆

赞 0 踩 0

2605.14054 2026-06-04 cs.AI cs.CV 版本更新

Bad Seeing or Bad Thinking? Rewarding Perception for Multimodal Reasoning

Haozhe Wang, Qixin Xu, Changpeng Wang, Taofeng Xue, Chong Peng, Wenhu Chen, Fangzhen Lin

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出一种基于强化学习的模态感知信用分配框架（MoCA），通过感知验证和结构化口头验证解决视觉语言模型中感知与推理的权衡问题，实现多任务性能提升。

Comments Accepted by ICML 2026 as Oral

详情

AI中文摘要

实现稳健的感知-推理协同是高级视觉语言模型（VLM）的核心目标。最近的进展通过架构设计或智能体工作流追求这一目标。然而，这些方法通常受限于静态文本推理，或因外部智能体复杂性的巨大计算和工程负担而变得复杂。更糟糕的是，这种大量投入并未带来成比例的性能提升，常常在感知和推理上观察到“跷跷板效应”。这促使我们从根本上重新思考真正的瓶颈。在本文中，我们认为这种权衡的根本原因是模态信用分配中的模糊性：当VLM失败时，是由于感知缺陷（“坏视力”）还是逻辑缺陷（“坏思维”）？为解决这一问题，我们引入了一个强化学习框架，通过可靠地奖励感知保真度来改善感知-推理协同。我们明确地将生成过程分解为交错的感知和推理步骤。这种解耦使得能够对感知进行有针对性的监督。关键的是，我们引入了感知验证（PV），利用“盲推理”代理独立于推理结果奖励感知保真度。此外，为了在自由形式的VL任务中扩展训练，我们提出了结构化口头验证（Structured Verbal Verification），用结构化的算法执行替代高方差的LLM评判。这些技术被整合到模态感知信用分配（MoCA）机制中，该机制将奖励路由到特定的错误源——无论是坏视力还是坏思维——使单个VLM能够在广泛的任务谱系上同时获得性能提升。

英文摘要

Achieving robust perception-reasoning synergy is a central goal for advanced Vision-Language Models (VLMs). Recent advancements have pursued this goal via architectural designs or agentic workflows. However, these approaches are often limited by static textual reasoning or complicated by the significant compute and engineering burden of external agentic complexity. Worse, this heavy investment does not yield proportional gains, often witnessing a "seesaw effect" on perception and reasoning. This motivates a fundamental rethinking of the true bottleneck. In this paper, we argue that the root cause of this trade-off is an ambiguity in modality credit assignment: when a VLM fails, is it due to flawed perception ("bad seeing") or flawed logic ("bad thinking")? To resolve this, we introduce a reinforcement learning framework that improves perception-reasoning synergy by reliably rewarding the perception fidelity. We explicitly decompose the generation process into interleaved perception and reasoning steps. This decoupling enables targeted supervision on perception. Crucially, we introduce Perception Verification (PV), leveraging a "blindfolded reasoning" proxy to reward perceptual fidelity independently of reasoning outcomes. Furthermore, to scale training across free-form VL tasks, we propose Structured Verbal Verification, which replaces high-variance LLM judging with structured algorithmic execution. These techniques are integrated into a Modality-Aware Credit Assignment (MoCA) mechanism, which routes rewards to the specific source of error -- either bad seeing or bad thinking -- enabling a single VLM to achieve simultaneous performance gains across a wide task spectrum.

URL PDF HTML ☆

赞 0 踩 0

2605.13672 2026-06-04 cs.CV cs.AI cs.LG 版本更新

SpurAudio: A Benchmark for Studying Shortcut Learning in Few-Shot Audio Classification

SpurAudio: 用于研究少样本音频分类中捷径学习的基准

Giries Abu Ayoub, Morad Tukan, Loay Mualem

发表机构 * Department of Computer Science, University of Haifa（海法大学计算机科学系）； Independent Researcher（独立研究者）； University of Stuttgart, Germany（斯图加特大学，德国）； IMPRS-IS, Germany（智能系统国际Max Planck研究学校，德国）

AI总结提出SpurAudio基准，通过控制音频中前景与背景的关联，评估少样本分类模型对虚假相关性的敏感性，发现现有方法在背景变化时性能显著下降。

详情

AI中文摘要

少样本分类（FSC）广泛用于从有限标注数据中学习，但大多数评估隐含假设目标概念与上下文线索无关。然而，在现实场景中，样本通常出现在丰富的上下文中，允许模型利用前景内容与背景信号之间的虚假相关性。虽然这种效应已在少样本图像分类中得到研究，但其在少样本音频分类中的作用仍 largely 未被探索，且现有音频基准对上下文结构的控制有限。我们引入了 SpurAudio，一个利用音频中前景事件和背景环境的自然可分离性，以支持对支持集和查询集之间的上下文偏移进行可控、多级评估的基准。使用该基准，我们表明许多最先进的少样本方法在背景相关性被破坏时遭受严重的性能下降，尽管在标准评估协议下达到相似的准确率。关键的是，即使在大型预训练音频基础模型中，这种脆弱性仍然存在，排除了骨干网络容量不足的解释。此外，在传统基准下看似相当的方法可能对虚假相关性表现出显著不同的敏感性，揭示了与特征表示在推理时如何与分类器头交互相关的系统性算法优势和脆弱性。这些发现为音频中少样本方法的行为提供了新的见解，并强调了在评估FSC模型时需要明确探测上下文依赖性的基准。

英文摘要

Few-shot classification (FSC) is widely used for learning from limited labeled data, yet most evaluations implicitly assume that target concepts are independent of contextual cues. In real-world settings, however, examples often appear within rich contexts, allowing models to exploit spurious correlations between foreground content and background signals. While such effects have been studied in few-shot image classification, their role in few-shot audio classification remains largely unexplored, and existing audio benchmarks offer limited control over contextual structure. We introduce SpurAudio, a benchmark that leverages the natural separability of foreground events and background environments in audio to enable controlled, multi-level evaluation of contextual shifts across support and query sets. Using this benchmark, we show that many state-of-the-art few-shot methods suffer severe performance degradation when background correlations are disrupted, despite achieving similar accuracy under standard evaluation protocols. Crucially, this vulnerability persists even in large pretrained audio foundation models, ruling out limited backbone capacity as an explanation. Moreover, methods that appear comparable under conventional benchmarks can exhibit markedly different sensitivity to spurious correlations, revealing systematic algorithmic strengths and vulnerabilities tied to how feature representations interact with classifier heads at inference time. These findings provide new insight into the behavior of few-shot methods in audio and highlight the need for benchmarks that explicitly probe context dependence when evaluating FSC models.

URL PDF HTML ☆

赞 0 踩 0

2304.10891 2026-06-04 cs.LG cs.AI cs.CV cs.RO cs.SY eess.SY 版本更新

Transformer-Based Autonomous Driving Models and Deployment-Oriented Compression: A Survey

基于Transformer的自动驾驶模型与面向部署的压缩：综述

Juan Zhong, Yuhang Shi, Zukang Xu, Xi Chen

发表机构 * Renmin University of China（中国人民大学）； Artificial Intelligence Innovation and Incubation Institute, Fudan University（复旦大学人工智能创新与孵化院）； Shanghai Academy of AI for Science（上海人工智能科学研究院）； Department of houmo.ai（houmo.ai部门）

AI总结本文综述了基于Transformer的自动驾驶模型，并从部署角度分析了压缩与加速策略（如量化、剪枝、知识蒸馏等）如何影响模型设计、部署性、鲁棒性和安全性。

详情

AI中文摘要

基于Transformer的模型正成为自动驾驶的核心范式，因为它们能够捕捉感知、预测和规划中的长程空间依赖、多智能体交互和多模态上下文。然而，它们在真实车辆中的部署仍然困难，因为高容量注意力架构带来了显著的延迟、内存和能量开销。本综述回顾了具有代表性的基于Transformer的自动驾驶模型，并按任务角色、感知配置和架构设计进行组织。更重要的是，我们从面向部署的角度审视这些模型，分析效率约束如何在实际中重塑模型设计选择。我们进一步回顾了与基于Transformer的驾驶系统相关的压缩和加速策略，包括量化、剪枝、知识蒸馏、低秩近似和高效注意力，并讨论了它们的优势、局限性和任务依赖性。我们不将压缩视为孤立的后期处理步骤，而是强调其作为直接影响部署性、鲁棒性和安全性的系统级设计考虑。最后，我们指出了面向标准化、安全感知和硬件感知的高效自动驾驶系统评估的开放挑战和未来研究方向。

英文摘要

Transformer-based models are becoming a central paradigm in autonomous driving because they can capture long-range spatial dependencies, multi-agent interactions, and multimodal context across perception, prediction, and planning. At the same time, their deployment in real vehicles remains difficult because high-capacity attention-based architectures impose substantial latency, memory, and energy overhead. This survey reviews representative Transformer-based autonomous driving models and organizes them by task role, sensing configuration, and architectural design. More importantly, it examines these models from a deployment-oriented perspective and analyzes how efficiency constraints reshape model design choices in practice. We further review compression and acceleration strategies relevant to Transformer-based driving systems, including quantization, pruning, knowledge distillation, low-rank approximation, and efficient attention, and discuss their benefits, limitations, and task-dependent applicability. Rather than treating compression as an isolated post-processing step, we highlight it as a system-level design consideration that directly affects deployability, robustness, and safety. Finally, we identify open challenges and future research directions toward standardized, safety-aware, and hardware-conscious evaluation of efficient autonomous driving systems.

URL PDF HTML ☆

赞 0 踩 0

2602.22779 2026-06-04 cs.CV 版本更新

TrajTok: Learning Trajectory Tokens enables better Video Understanding

TrajTok: 学习轨迹令牌以实现更好的视频理解

Chenhao Zheng, Jieyu Zhang, Jianing Zhang, Weikai Huang, Ashutosh Kumar, Quan Kong, Oncel Tuzel, Chun-Liang Li, Ranjay Krishna

发表机构 * University of Washington（华盛顿大学）； Allen Institute for Artificial Intelligence（人工智能研究院）； Apple（苹果公司）； Woven by Toyota, Inc（丰田纺织公司）

AI总结提出TrajTok，一种端到端视频令牌化模块，通过隐式时空聚类生成对象轨迹令牌，提升视频理解效率与性能。

Comments CVPR 2026

详情

AI中文摘要

视频模型中的令牌化通常通过分块（patchification）进行，产生过多且冗余的令牌，严重限制了视频的效率和可扩展性。虽然最近的基于轨迹的令牌化器通过将视频时长与令牌数量解耦提供了有前景的解决方案，但它们依赖于复杂的外部分割和跟踪流水线，速度慢且任务无关。我们提出TrajTok，一个端到端的视频令牌化器模块，完全集成并与视频模型共同训练以服务于下游目标，根据语义复杂度动态调整令牌粒度，独立于视频时长。TrajTok包含一个统一的分割器，在空间和时间上对像素进行隐式聚类，直接在一次前向传播中生成对象轨迹。通过优先考虑下游适应性而非像素完美的分割保真度，TrajTok轻量且高效，同时经验上提升了视频理解性能。利用TrajTok，我们实现了一个从头训练的视频CLIP模型（TrajViT2）。它在分类和检索基准上均实现了大规模的最佳精度，同时保持了与最佳令牌合并方法相当的高效率。TrajTok也证明了其作为令牌化器之外的多功能组件。我们表明，它可以无缝集成作为预训练视觉特征的探测头（TrajAdapter）或视觉-语言模型中的对齐连接器（TrajVLM），尤其在长视频推理中表现出色。

英文摘要

Tokenization in video models, typically through patchification, generates an excessive and redundant number of tokens. This severely limits video efficiency and scalability. While recent trajectory-based tokenizers offer a promising solution by decoupling video duration from token count, they rely on complex external segmentation and tracking pipelines that are slow and task-agnostic. We propose TrajTok, an end-to-end video tokenizer module that is fully integrated and co-trained with video models for a downstream objective, dynamically adapting its token granularity to semantic complexity, independent of video duration. TrajTok contains a unified segmenter that performs implicit clustering over pixels in both space and time to directly produce object trajectories in a single forward pass. By prioritizing downstream adaptability over pixel-perfect segmentation fidelity, TrajTok is lightweight and efficient, yet empirically improves video understanding performance. With TrajTok, we implement a video CLIP model trained from scratch (TrajViT2). It achieves the best accuracy at scale across both classification and retrieval benchmarks, while maintaining efficiency comparable to the best token-merging methods. TrajTok also proves to be a versatile component beyond its role as a tokenizer. We show that it can be seamlessly integrated as either a probing head for pretrained visual features (TrajAdapter) or an alignment connector in vision-language models (TrajVLM) with especially strong performance in long-video reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.06637 2026-06-04 cs.CV 版本更新

DPM++: Dynamic Masked Metric Learning for Occluded Person Re-identification

DPM++：用于遮挡行人重识别的动态掩码度量学习

Lei Tan, Yingshi Luan, Pincong Zou, Pingyang Dai, Liujuan Cao

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education of China, Xiamen University（中国教育部多媒体可信感知与高效计算重点实验室，厦门大学）

AI总结提出DPM++动态掩码度量学习框架，通过自适应掩码选择可靠身份子空间，结合CLIP两阶段监督和显著性引导的补丁转移策略，在遮挡和整体场景下均达到最优性能。

详情

AI中文摘要

尽管行人重识别取得了显著进展，但障碍物造成的遮挡在实际应用中仍是一个未解决的问题。困难在于不完整的遮挡样本与整体身份表示之间的不匹配。严重遮挡会移除判别性身体线索并引入背景杂波和遮挡物的干扰，使得全局度量学习不可靠。现有方法主要依赖额外的预训练模型来估计可见部分以进行对齐，或通过数据增强构建遮挡样本，但仍缺乏一个统一的框架来学习在真实遮挡模式下鲁棒的可见性一致匹配。本文提出了DPM++，一种用于遮挡行人重识别的动态掩码度量学习框架。DPM++学习一种输入自适应的掩码度量，动态地为每个遮挡实例选择可靠的身份子空间，使匹配能够强调可见性一致的证据，同时抑制不可靠的组件。基于分类器-原型空间，DPM++引入了基于CLIP的两阶段监督方案，其中ID级语义先验从文本分支学习并转移到分类器-原型空间中进行动态掩码匹配。为了增强掩码度量，我们引入了一种显著性引导的补丁转移策略，在训练过程中合成可控且逼真的遮挡样本。利用真实场景先验，该策略使模型暴露于真实的部分观察中，并提供比随机擦除更丰富的监督。此外，遮挡感知的样本配对和掩码引导优化提高了框架的稳定性和有效性。在遮挡和整体行人重识别基准上的实验表明，DPM++在整体和遮挡场景中均持续优于先前的最先进方法。

英文摘要

Although person re-identification has made impressive progress, occlusion caused by obstacles remains an unsettled issue in real applications. The difficulty lies in the mismatch between incomplete occluded samples and holistic identity representations. Severe occlusion removes discriminative body cues and introduces interference from background clutter and occluders, making global metric learning unreliable. Existing methods mainly rely on extra pre-trained models to estimate visible parts for alignment or construct occluded samples via data augmentation, but still lack a unified framework that learns robust visibility-consistent matching under realistic occlusion patterns. In this paper, we propose DPM++, a Dynamic Masked Metric Learning framework for occluded person re-identification. DPM++ learns an input-adaptive masked metric that dynamically selects reliable identity subspaces for each occluded instance, enabling matching to emphasize visibility-consistent evidence while suppressing unreliable components. Built upon the classifier-prototype space, DPM++ introduces a CLIP-based two-stage supervision scheme, where ID-level semantic priors are learned from the text branch and transferred into the classifier-prototype space for dynamic masked matching. To strengthen the masked metric, we introduce a saliency-guided patch transfer strategy to synthesize controllable and photo-realistic occluded samples during training. Exploiting real scene priors, this strategy exposes the model to realistic partial observations and provides richer supervision than random erasing. In addition, occlusion-aware sample pairing and mask-guided optimization improve the stability and effectiveness of the framework. Experiments on occluded and holistic person re-identification benchmarks show that DPM++ consistently outperforms previous state-of-the-art methods in both holistic and occlusion scenarios.

URL PDF HTML ☆

赞 0 踩 0

2605.00242 2026-06-04 cs.CV cs.AI 版本更新

R3G: 一种面向以视觉为中心的答案生成的推理-检索-重排序框架

Zhuohong Chen, Zhengxian Wu, Zirui Liao, Shenao Jiang, Hangrui Xu, Yang Chen, Chaokui Su, Xiaoyu Liu, Haoqian Wang

发表机构 * The Shenzhen International Graduate School, Tsinghua University, China（清华大学深圳国际研究生院）； State Key Laboratory of Nuclear Power Safety Technology and Equipment, China（核能安全技术与装备国家重点实验室）； School of Computer Science and Information Engineering, Hefei University of Technology, China（合肥工业大学计算机科学与信息工程学院）

AI总结提出R3G框架，通过先制定推理计划指定所需视觉线索，再采用粗检索加细粒度重排序的两阶段策略选择证据图像，在MRAG-Bench上提升六种多模态大语言模型在九个子场景中的准确率，实现整体最优性能。

2603.28762 2026-06-04 cs.CV cs.AI cs.GR cs.LG 版本更新

On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

上下文空间中的即时排斥以实现扩散变换器的丰富多样性

Omer Dahary, Benaya Koren, Daniel Garibi, Daniel Cohen-Or

发表机构 * Tel Aviv University（特拉维夫大学）； Snap Research Israel（Snap以色列研究）

AI总结针对文本到图像扩散模型多样性不足的问题，提出在扩散变换器的上下文空间中通过多模态注意力通道施加即时排斥，在不牺牲视觉保真度和语义一致性的前提下显著提升生成多样性，且计算开销小，适用于现代Turbo和蒸馏模型。

Comments SIGGRAPH 2026. Project page: https://contextual-repulsion.github.io/

详情

AI中文摘要

现代文本到图像（T2I）扩散模型在语义对齐方面取得了显著进展，但通常缺乏多样性，倾向于为任何给定提示收敛到狭窄的视觉解决方案集。这种典型性偏差对需要广泛生成结果的创意应用构成了挑战。我们识别出当前多样性方法中的一个基本权衡：修改模型输入需要昂贵的优化来整合生成路径的反馈。相反，对空间上已承诺的中间潜变量进行操作往往会破坏正在形成的视觉结构，导致伪影。在这项工作中，我们提出在上下文空间中应用排斥作为一种新颖的框架，以实现扩散变换器的丰富多样性。通过干预多模态注意力通道，我们在变换器的前向传播过程中施加即时排斥，在文本条件被新兴图像结构丰富后的块之间注入干预。这允许在结构信息形成后但构图固定之前重定向引导轨迹。我们的结果表明，上下文空间中的排斥在不牺牲视觉保真度或语义一致性的情况下产生了显著更丰富的多样性。此外，我们的方法非常高效，计算开销小，即使在现代“Turbo”和蒸馏模型中也有效，而传统的基于轨迹的干预在这些模型中通常会失败。

英文摘要

Modern Text-to-Image (T2I) diffusion models have achieved remarkable semantic alignment, yet they often suffer from a significant lack of variety, converging on a narrow set of visual solutions for any given prompt. This typicality bias presents a challenge for creative applications that require a wide range of generative outcomes. We identify a fundamental trade-off in current approaches to diversity: modifying model inputs requires costly optimization to incorporate feedback from the generative path. In contrast, acting on spatially-committed intermediate latents tends to disrupt the forming visual structure, leading to artifacts. In this work, we propose to apply repulsion in the Contextual Space as a novel framework for achieving rich diversity in Diffusion Transformers. By intervening in the multimodal attention channels, we apply on-the-fly repulsion during the transformer's forward pass, injecting the intervention between blocks where text conditioning is enriched with emergent image structure. This allows for redirecting the guidance trajectory after it is structurally informed but before the composition is fixed. Our results demonstrate that repulsion in the Contextual Space produces significantly richer diversity without sacrificing visual fidelity or semantic adherence. Furthermore, our method is uniquely efficient, imposing a small computational overhead while remaining effective even in modern "Turbo" and distilled models where traditional trajectory-based interventions typically fail.

URL PDF HTML ☆

赞 0 踩 0

2512.14177 2026-06-04 cs.CV 版本更新

Improving Semantic Uncertainty Quantification in LVLMs with Semantic Gaussian Processes

利用语义高斯过程改进LVLM中的语义不确定性量化

Joseph Hoche, Andrei Bursuc, David Brellmann, Gilles Louppe, Pavel Izmailov, Angela Yao, Gianni Franchi

发表机构 * AMIAD, Pôle Recherche, Palaiseau（AMIAD研究部，Palaiseau）； valeo.ai ； Safran Tech ； University of Liège（利耶大学）； New York University（纽约大学）； National University of Singapore（新加坡国立大学）； ENSTA Paris（巴黎ENSTA）

AI总结提出语义高斯过程不确定性（SGPU）框架，通过分析答案嵌入的几何结构来量化语义不确定性，避免脆弱的聚类方法，在多个模型和数据集上实现了最先进的校准和判别性能。

详情

AI中文摘要

大型视觉语言模型（LVLM）经常产生看似合理但不可靠的输出，因此鲁棒的不确定性估计至关重要。最近的语义不确定性估计工作依赖于外部模型对多个采样响应进行聚类并测量其语义一致性。然而，这些聚类方法通常脆弱，对微小的措辞变化高度敏感，并且可能错误地分组或分离语义相似的答案，导致不可靠的不确定性估计。我们提出了语义高斯过程不确定性（SGPU），这是一个贝叶斯框架，通过分析答案嵌入的几何结构来量化语义不确定性，避免了脆弱的聚类。SGPU将生成的答案映射到密集的语义空间，计算其嵌入的Gram矩阵，并通过特征谱总结其语义配置。然后将这种谱表示输入到高斯过程分类器中，该分类器学习将语义一致性模式映射到预测不确定性，并且可以在黑盒和白盒设置中应用。在跨越VQA、图像分类和文本QA的八个数据集上的六个LLM和LVLM中，SGPU始终实现了最先进的校准（ECE）和判别（AUROC、AUARC）性能。我们进一步表明，SGPU可以跨模型和模态迁移，表明其谱表示捕捉了语义不确定性的通用模式。

英文摘要

Large Vision-Language Models (LVLMs) often produce plausible but unreliable outputs, making robust uncertainty estimation essential. Recent work on semantic uncertainty estimates relies on external models to cluster multiple sampled responses and measure their semantic consistency. However, these clustering methods are often fragile, highly sensitive to minor phrasing variations, and can incorrectly group or separate semantically similar answers, leading to unreliable uncertainty estimates. We propose Semantic Gaussian Process Uncertainty (SGPU), a Bayesian framework that quantifies semantic uncertainty by analyzing the geometric structure of answer embeddings, avoiding brittle clustering. SGPU maps generated answers into a dense semantic space, computes the Gram matrix of their embeddings, and summarizes their semantic configuration via the eigenspectrum. This spectral representation is then fed into a Gaussian Process Classifier that learns to map patterns of semantic consistency to predictive uncertainty, and that can be applied in both black-box and white-box settings. Across six LLMs and LVLMs on eight datasets spanning VQA, image classification, and textual QA, SGPU consistently achieves state-of-the-art calibration (ECE) and discriminative (AUROC, AUARC) performance. We further show that SGPU transfers across models and modalities, indicating that its spectral representation captures general patterns of semantic uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2603.22121 2026-06-04 cs.CV cs.AI 版本更新

GenSpan: Generation-Calibrated Motion Span Priors for Multi-Verb Video Corpus Moment Retrieval

GenSpan: 用于多动词视频语料库时刻检索的生成校准运动跨度先验

Yunzhuo Sun, Xinyue Liu, Yanyang Li, Nanding Wu, Linlin Zong, Xianchao Zhang, Wenxin Liang

发表机构 * Dalian University of Technology（大连理工大学）

AI总结提出GenSpan框架，利用LLM生成辅助视频作为时间先验，结合令牌选择器和双向状态空间模型，提升多动词查询下的视频语料库时刻检索与定位性能。

Comments Major revision with title change, updated method, and additional experiments

详情

AI中文摘要

视频语料库时刻检索（VCMR）旨在检索与自然语言查询对应的正确视频及其时间片段，对于时间动作顺序至关重要的多动词查询尤其具有挑战性。现有方法通常仅依赖文本或静态图像，难以捕捉隐式运动动态，导致检索错误和时间错位。我们提出GenSpan，一个生成校准的VCMR框架，从LLM选择的字幕线索和分解的子事件中构建短辅助视频，将这些作为时间先验而非直接检索目标。令牌选择器过滤与生成运动对齐的候选视频特征，双向状态空间模型高效预测视频-时刻元组。在TVR和ActivityNet-Captions上的实验表明，GenSpan提高了语料库级检索和时刻定位，特别是对于复杂的多动作查询，同时与最先进的多模态基线相比降低了计算成本。

英文摘要

Video Corpus Moment Retrieval (VCMR) aims to retrieve both the correct video and its temporal segment corresponding to a natural-language query, a task that is especially challenging for multi-verb queries where temporal action ordering is critical. Existing approaches often rely solely on text or static images and struggle to capture implicit motion dynamics, leading to retrieval errors and temporal misalignment. We propose GenSpan, a generation-calibrated VCMR framework that constructs short auxiliary videos from LLM-selected subtitle cues and decomposed sub-events, using these as temporal priors rather than direct retrieval targets. A token selector filters candidate-video features aligned with generated motion, and a bidirectional state-space model efficiently predicts video-moment tuples. Experiments on TVR and ActivityNet-Captions demonstrate that GenSpan improves corpus-level retrieval and moment localization, particularly for complex multi-action queries, while reducing computational cost compared to state-of-the-art multimodal baselines.

URL PDF HTML ☆

赞 0 踩 0

2603.13432 2026-06-04 cs.CV cs.AI 版本更新

EvoPrompt: 引导提示演化以适应视觉-语言模型

Enming Zhang, Jiayang Li, Yanlong Wang, Yanru Wu, Zhenyu Liu, Yang Li

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院，清华大学）； Sun Yat-sen University（中山大学）； Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结提出EvoPrompt框架，通过引导提示演化路径并解耦低秩更新为方向和幅度分量，实现视觉-语言模型在少样本学习中的遗忘-free微调，同时保持零样本能力。

详情

AI中文摘要

大规模视觉-语言模型（VLM）在有限标注数据下适应下游任务仍然是一个重大挑战。虽然参数高效的提示学习方法提供了一条有希望的路径，但它们常常遭受预训练知识的灾难性遗忘。为了解决这一限制，我们的工作基于一个洞察：控制提示的演化路径对于遗忘-free适应至关重要。为此，我们提出了EvoPrompt，一个旨在明确引导提示轨迹以进行知识保留微调的新型框架。具体来说，我们的方法采用模态共享提示投影器（MPP）从统一嵌入空间生成层次化提示。关键的是，一种演化训练策略将低秩更新解耦为方向和幅度分量，保留早期学习的语义方向而仅调整其幅度，从而使提示能够在不丢弃基础知识的情况下演化。这一过程通过特征几何正则化（FGR）进一步稳定，该正则化强制特征去相关以防止表示崩溃。大量实验表明，EvoPrompt在少样本学习中实现了最先进的性能，同时稳健地保留了预训练VLM的原始零样本能力。

英文摘要

The adaptation of large-scale vision-language models (VLMs) to downstream tasks with limited labeled data remains a significant challenge. While parameter-efficient prompt learning methods offer a promising path, they often suffer from catastrophic forgetting of pre-trained knowledge. Toward addressing this limitation, our work is grounded in the insight that governing the evolutionary path of prompts is essential for forgetting-free adaptation. To this end, we propose EvoPrompt, a novel framework designed to explicitly steer the prompt trajectory for knowledge-preserving fine-tuning. Specifically, our approach employs a Modality-Shared Prompt Projector (MPP) to generate hierarchical prompts from a unified embedding space. Critically, an evolutionary training strategy decouples low-rank updates into directional and magnitude components, preserving early-learned semantic directions while only adapting their magnitude, thus enabling prompts to evolve without discarding foundational knowledge. This process is further stabilized by Feature Geometric Regularization (FGR), which enforces feature decorrelation to prevent representation collapse. Extensive experiments demonstrate that EvoPrompt achieves state-of-the-art performance in few-shot learning while robustly preserving the original zero-shot capabilities of pre-trained VLMs.

URL PDF HTML ☆

赞 0 踩 0

2603.09242 2026-06-04 cs.CV 版本更新

When Detectors Forget Forensics: Blocking Semantic Shortcuts for Generalizable AI-Generated Image Detection

当检测器遗忘取证：阻断语义捷径以实现可泛化的AI生成图像检测

Chao Shuai, Shaojing Fan, Chenlin Zou, Bin Gong, Weichen Lian, Xiuli Bi, Zhenguang Liu, Zhongjie Ba, Kui Ren

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University（区块链与数据安全国家重点实验室，浙江大学）； National University of Singapore（新加坡国立大学）； Chongqing University of Posts and Telecommunications（重庆邮电大学）

AI总结本文提出几何语义解耦（GSD）框架，通过抑制语义主导方向来促进不变取证表征，从而解决预训练视觉基础模型在AI生成图像检测中因语义回退导致的泛化不足问题。

详情

AI中文摘要

生成模型日益逼真，模糊了真实与合成内容之间的界限，给可靠的AI生成图像检测带来了重大挑战。尽管大规模预训练视觉基础模型提升了检测能力，但它们对来自未见生成管道的图像的泛化仍然不足。在本文中，我们首次识别出一个关键失败机制，称为语义回退，即取证微调未能完全重塑表征空间。因此，所得表征仍沿高层语义结构而非操作特定的取证线索组织。基于这一见解，我们提出了一个几何语义解耦（GSD）框架，该框架显式抑制语义主导方向，从而促进不变的取证表征。具体而言，GSD利用冻结的CLIP编码器通过奇异值分解（SVD）估计主导语义子空间。然后，通过几何约束公式抑制语义成分，并在样本和层间自适应调节抑制强度。我们进一步引入了一种小批量SVD近似策略，该策略分摊子空间估计，在保持有效性的同时实现了超过15倍的计算开销减少。最后，考虑到涵盖大规模和在线评估的实际场景，我们开发了三种推理协议：批量推理、逐样本推理和基于参考的推理，并证明它们能产生一致的语义解耦，从而形成稳定的面向伪造的特征流形。

英文摘要

The growing realism of generative models has blurred the boundary between real and synthetic content, posing significant challenges to reliable AI-generated image detection. Although large-scale pre-trained Vision Foundation Models have advanced detection capability, their generalization to images from unseen generation pipelines remains inadequate. In this paper, we identify, for the first time, a key failure mechanism, termed \emph{semantic fallback}, wherein forensic fine-tuning fails to fully reshape the representation space. Consequently, the resulting representations remain organized along high-level semantic structures rather than manipulation-specific forensic cues. Building on this insight, we propose a \textbf{Geometric Semantic Decoupling (GSD)} framework, which explicitly suppresses semantically dominant directions, thereby promoting invariant forensic representations. Specifically, GSD leverages a frozen CLIP encoder to estimate the dominant semantic subspace via Singular Value Decomposition (SVD). It then suppresses the semantic components through a geometry-constrained formulation with the suppression strength adaptively modulated across samples and layers. We further introduce a mini-batch SVD approximation strategy that amortizes subspace estimation, achieving over a $15 \times$ reduction in computational overhead while preserving effectiveness. Finally, considering practical scenarios spanning both large-scale and online evaluation, we develop three inference protocols, batch, per-sample, and reference-based inference, and demonstrate that they induce consistent semantic decoupling, yielding a stable forgery-oriented feature manifold.

URL PDF HTML ☆

赞 0 踩 0

2603.03482 2026-06-04 cs.CV cs.AI cs.LG 版本更新

Beyond Pixel Histories: World Models with Persistent 3D State

超越像素历史：具有持久3D状态的世界模型

Samuel Garcin, Thomas Walker, Steven McDonagh, Tim Pearce, Hakan Bilen, Tianyu He, Kaixin Wang, Jiang Bian

发表机构 * University of Edinburgh（爱丁堡大学）； Microsoft Research（微软研究院）

AI总结提出PERSIST范式，通过模拟潜在3D场景（环境、相机、渲染器）的演化，实现具有持久空间记忆和一致几何的世界模型，显著提升3D一致性、空间记忆和长期稳定性。

Comments Accepted to the International Conference on Machine Learning (ICML) 2026. To appear in the Proceedings of Machine Learning Research (PMLR). 9 pages

详情

AI中文摘要

交互式世界模型通过响应用户的动作持续生成视频，实现开放式的生成能力。然而，现有模型通常缺乏环境的3D表示，意味着3D一致性必须从数据中隐式学习，且空间记忆受限于有限的时域上下文窗口。这导致不真实的用户体验，并对训练智能体等下游任务构成重大障碍。为解决这一问题，我们提出PERSIST，一种新的世界模型范式，它模拟潜在3D场景（环境、相机和渲染器）的演化。这使得我们能够合成具有持久空间记忆和一致几何的新帧。定量指标和定性用户研究均表明，与现有方法相比，在空间记忆、3D一致性和长期稳定性方面有显著提升，从而实现连贯、演化的3D世界。我们进一步展示了新颖的能力，包括从单张图像合成多样化的3D环境，以及通过直接在3D空间中支持环境编辑和指定，实现对生成体验的细粒度、几何感知控制。项目页面：https://francelico.github.io/persist.github.io

英文摘要

Interactive world models continually generate video by responding to a user's actions, enabling open-ended generation capabilities. However, existing models typically lack a 3D representation of the environment, meaning 3D consistency must be implicitly learned from data, and spatial memory is restricted to limited temporal context windows. This results in an unrealistic user experience and presents significant obstacles to downstream tasks such as training agents. To address this, we present PERSIST, a new paradigm of world model which simulates the evolution of a latent 3D scene: environment, camera, and renderer. This allows us to synthesise new frames with persistent spatial memory and consistent geometry. Both quantitative metrics and a qualitative user study show substantial improvements in spatial memory, 3D consistency, and long-horizon stability over existing methods, enabling coherent, evolving 3D worlds. We further demonstrate novel capabilities, including synthesising diverse 3D environments from a single image, as well as enabling fine-grained, geometry-aware control over generated experiences by supporting environment editing and specification directly in 3D space. Project page: https://francelico.github.io/persist.github.io

URL PDF HTML ☆

赞 0 踩 0

2603.02697 2026-06-04 cs.CV cs.AI 版本更新

ShareVerse: Multi-Agent Consistent Video Generation for Shared World Modeling

ShareVerse：面向共享世界建模的多智能体一致视频生成

Jiayi Zhu, Jianing Zhang, Yiying Yang, Wei Cheng, Xiaoyun Yuan

发表机构 * Shanghai Jiao Tong University China（上海交通大学中国）； Fudan University China（复旦大学中国）； StepFun China（StepFun中国）

AI总结提出ShareVerse框架，通过构建多智能体交互数据集、空间拼接策略和跨智能体注意力机制，实现多智能体共享世界的一致视频生成。

详情

AI中文摘要

本文提出ShareVerse，一个视频生成框架，支持多智能体共享世界建模，解决了现有工作缺乏统一共享世界构建和多智能体交互支持的问题。ShareVerse利用大型视频模型的生成能力，并整合了三个关键创新：1）在CARLA仿真平台上构建了大规模多智能体交互世界建模数据集，包含多样场景、天气条件和交互轨迹，以及配对的每智能体四视角视频（前/后/左/右视图）和相机数据。2）我们提出了一种针对独立智能体四视角视频的空间拼接策略，以建模更广泛的环境并确保内部多视角几何一致性。3）我们将跨智能体注意力模块集成到预训练视频模型中，实现跨智能体时空信息的交互传递，保证重叠区域的共享世界一致性和非重叠区域的合理生成。支持49帧大规模视频生成的ShareVerse能够准确感知动态智能体的位置，实现一致的共享世界建模。

英文摘要

This paper presents ShareVerse, a video generation framework enabling multi-agent shared world modeling, addressing the gap in existing works that lack support for unified shared world construction with multi-agent interaction. ShareVerse leverages the generation capability of large video models and integrates three key innovations: 1) A dataset for large-scale multi-agent interactive world modeling is built on the CARLA simulation platform, featuring diverse scenes, weather conditions, and interactive trajectories with paired multi-view videos (front/ rear/ left/ right views per agent) and camera data. 2) We propose a spatial concatenation strategy for four-view videos of independent agents to model a broader environment and to ensure internal multi-view geometric consistency. 3) We integrate cross-agent attention blocks into the pretrained video model, which enable interactive transmission of spatial-temporal information across agents, guaranteeing shared world consistency in overlapping regions and reasonable generation in non-overlapping regions. ShareVerse, which supports 49-frame large-scale video generation, accurately perceives the position of dynamic agents and achieves consistent shared world modeling.

URL PDF HTML ☆

赞 0 踩 0

2602.23214 2026-06-04 cs.CV cs.LG eess.IV 版本更新

Plug-and-Play Diffusion Meets ADMM: Dual-Variable Coupling for Robust Medical Image Reconstruction

即插即用扩散遇见ADMM：双变量耦合用于鲁棒医学图像重建

Chenhe Du, Xuanyu Tian, Qing Wu, Muyu Liu, Jingyi Yu, Hongjiang Wei, Yuyao Zhang

发表机构 * ShanghaiTech University（上海科技大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结提出双耦合即插即用扩散（DC-PnPDP）框架，通过引入经典对偶变量提供积分反馈并采用频谱均匀化（SH）处理结构伪影，解决了现有PnP求解器的稳态偏差和幻觉问题，在CT和MRI重建中实现了最先进的保真度和加速收敛。

Comments Accepted by ICML 2026

详情

AI中文摘要

即插即用扩散先验（PnPDP）框架通过将预训练生成模型视为模块化先验，已成为解决成像逆问题的强大范式。然而，我们发现当前PnP求解器（例如基于HQS或近端梯度）存在一个关键缺陷：它们作为无记忆算子，仅基于瞬时梯度更新估计。这种缺乏历史跟踪的做法不可避免地导致非消失稳态偏差，使得重建在严重损坏下无法严格满足物理测量。为了解决这个问题，我们提出了双耦合PnP扩散（DC-PnPDP），它恢复了经典对偶变量以提供积分反馈，逐步强制数据一致性和先验之间的一致性。然而，这种严格的几何耦合引入了第二个挑战：累积的对偶残差表现出频谱有色、结构化的伪影，违反了扩散先验的加性白高斯噪声（AWGN）假设，导致严重的幻觉。为了弥合这一差距，我们引入了频谱均匀化（SH），一种频域适应机制，将这些结构化残差调制为统计上合规的伪AWGN输入。这有效地将求解器的严格优化轨迹与去噪器的有效统计流形对齐。在CT和MRI重建上的大量实验表明，我们的方法解决了偏差-幻觉权衡，实现了最先进的保真度并显著加速收敛。代码可在https://github.com/duchenhe/DC-PnPDP获取。

英文摘要

Plug-and-Play diffusion prior (PnPDP) frameworks have emerged as a powerful paradigm for solving imaging inverse problems by treating pretrained generative models as modular priors. However, we identify a critical flaw in prevailing PnP solvers (e.g., based on HQS or Proximal Gradient): they function as memoryless operators, updating estimates solely based on instantaneous gradients. This lack of historical tracking inevitably leads to non-vanishing steady-state bias, where the reconstruction fails to strictly satisfy physical measurements under heavy corruption. To resolve this, we propose Dual-Coupled PnP Diffusion (DC-PnPDP), which restores the classical dual variable to provide integral feedback, progressively enforce agreement between the data-consistency and prior. However, this rigorous geometric coupling introduces a secondary challenge: the accumulated dual residuals exhibit spectrally colored, structured artifacts that violate the Additive White Gaussian Noise (AWGN) assumption of diffusion priors, causing severe hallucinations. To bridge this gap, we introduce Spectral Homogenization (SH), a frequency-domain adaptation mechanism that modulates these structured residuals into statistically compliant pseudo-AWGN inputs. This effectively aligns the solver's rigorous optimization trajectory with the denoiser's valid statistical manifold. Extensive experiments on CT and MRI reconstruction demonstrate that our approach resolves the bias-hallucination trade-off, achieving state-of-the-art fidelity with significantly accelerated convergence. The code is available at https://github.com/duchenhe/DC-PnPDP

URL PDF HTML ☆

赞 0 踩 0

2506.06006 2026-06-04 cs.CV cs.AI cs.CL 版本更新

Can VLMs Predict Future States? Bootstrapping World Models from Inverse Dynamics

视觉语言模型能预测未来状态吗？从逆动力学引导世界模型

Yifu Qiu, Yftah Ziser, Anna Korhonen, Shay B. Cohen, Edoardo M. Ponti

发表机构 * Institute for Language, Cognition and Computation, University of Edinburgh（语言、认知与计算研究所，爱丁堡大学）； Language Technology Lab, University of Cambridge（语言技术实验室，剑桥大学）； NVIDIA（NVIDIA公司）； University of Groningen（格罗宁根大学）

AI总结本文发现视觉语言模型（VLM）难以直接进行前向动力学预测（FDP），但逆动力学预测（IDP）更容易学习，并利用IDP通过弱监督学习和推理时验证两种策略引导FDP，在Aurora-Bench上取得与最先进图像编辑模型竞争的性能。

详情

AI中文摘要

统一的视觉语言模型（VLM）能否执行前向动力学预测（FDP），即根据先前的观察和（语言形式的）动作预测未来状态（图像形式）？我们发现VLM难以根据指令生成帧之间物理上合理的过渡。然而，我们识别出多模态基础中的一个关键不对称性：微调VLM学习逆动力学预测（IDP）——有效地描述帧之间的动作——比学习FDP容易得多。反过来，IDP可以通过两种主要策略引导FDP：1）来自合成数据的弱监督学习，以及2）推理时验证。首先，IDP可以为未标记的视频帧观察对标注动作，以扩大FDP的训练数据规模。其次，IDP可以为FDP的多个样本分配奖励以对其进行评分，从而在推理时有效指导搜索。我们通过Aurora-Bench上的以动作为中心的图像编辑任务，使用两个VLM家族评估了这两种策略产生的FDP。尽管仍然是通用模型，我们的最佳模型实现了与最先进的图像编辑模型竞争的性能，根据GPT4o作为评判，在Aurora-Bench的所有子集上，性能提高了7%到13%，并获得了最佳平均人类评估。

英文摘要

Can unified vision-language models (VLMs) perform forward dynamics prediction (FDP), i.e., predicting the future state (in image form) given the previous observation and an action (in language form)? We find that VLMs struggle to generate physically plausible transitions between frames from instructions. Nevertheless, we identify a crucial asymmetry in multimodal grounding: fine-tuning a VLM to learn inverse dynamics prediction (IDP)-effectively captioning the action between frames-is significantly easier than learning FDP. In turn, IDP can be used to bootstrap FDP through two main strategies: 1) weakly supervised learning from synthetic data and 2) inference time verification. Firstly, IDP can annotate actions for unlabelled pairs of video frame observations to expand the training data scale for FDP. Secondly, IDP can assign rewards to multiple samples of FDP to score them, effectively guiding search at inference time. We evaluate the FDP resulting from both strategies through the task of action-centric image editing on Aurora-Bench with two families of VLMs. Despite remaining general-purpose, our best model achieves a performance competitive with state-of-the-art image editing models, improving on them by a margin between 7% and 13% according to GPT4o-as-judge, and achieving the best average human evaluation across all subsets of Aurora-Bench.

URL PDF HTML ☆

赞 0 踩 0

2602.06883 2026-06-04 cs.LG cs.CV stat.ML 版本更新

Vision Transformer Finetuning Benefits from Non-Smooth Components

视觉变换器微调受益于非平滑组件

Ambroise Odonnat, Laetitia Chapel, Romain Tavenard, Ievgen Redko

发表机构 * Noah's Ark Lab（诺亚 ark 实验室）； Univ. Rennes 2, Inria（里昂二大学，法国国家信息与自动化研究所）

AI总结本文通过分析视觉变换器组件的可塑性（即输出对输入变化的敏感度），发现高可塑性（低平滑性）的注意力模块和前馈层在微调中表现更好，挑战了平滑性有利的传统观点。

Comments Accepted at ICML 2026

详情

AI中文摘要

变换器架构的平滑性在泛化、训练稳定性和对抗鲁棒性方面已被广泛研究。然而，其在迁移学习中的作用仍知之甚少。本文分析了视觉变换器组件使其输出适应输入变化的能力，即它们的\emph{可塑性}。定义为平均变化率，它捕捉了对输入扰动的敏感性；特别地，高可塑性意味着低平滑性。我们的理论分析和大量实验——在大规模视觉变换器上进行超过1000次微调运行——表明，这一视角为选择在适应过程中优先考虑的组件提供了原则性指导。对从业者的关键启示是，注意力模块和前馈层的高可塑性始终导致更好的微调性能。我们的发现偏离了平滑性是可取的普遍假设，为变换器的功能特性提供了新的视角。代码可在 https://github.com/ambroiseodt/vit-plasticity 获取。

英文摘要

The smoothness of the transformer architecture has been extensively studied in the context of generalization, training stability, and adversarial robustness. However, its role in transfer learning remains poorly understood. In this paper, we analyze the ability of vision transformer components to adapt their outputs to changes in inputs, or, in other words, their \emph{plasticity}. Defined as an average rate of change, it captures the sensitivity to input perturbation; in particular, a high plasticity implies a low smoothness. Our theoretical analysis and extensive experiments -- over $1,000$ finetuning runs on large-scale vision transformers -- showcase that this perspective provides principled guidance in choosing the components to prioritize during adaptation. A key takeaway for practitioners is that the high plasticity of the attention modules and feedforward layers consistently leads to better finetuning performance. Our findings depart from the prevailing assumption that smoothness is desirable, offering a novel perspective on transformers' functional properties. The code is available at https://github.com/ambroiseodt/vit-plasticity.

URL PDF HTML ☆

赞 0 踩 0

2512.03553 2026-06-04 cs.CV cs.AI 版本更新

Dynamic Content Moderation in Livestreams: Combining Supervised Classification with MLLM-Boosted Similarity Matching

直播中的动态内容审核：结合监督分类与MLLM增强的相似度匹配

Wei Chee Yew, Hailun Xu, Sanjay Saha, Xiaotian Fan, Hiok Hian Ong, David Yuchen Wang, Kanchan Sarkar, Zhenheng Yang, Danhui Guan

发表机构 * TikTok Singapore Singapore（TikTok新加坡）； TikTok San Jose United States（TikTok旧金山美国）； TikTok Shanghai China（TikTok上海中国）

AI总结提出一种混合审核框架，结合监督分类和基于参考的相似度匹配，利用多模态大语言模型提升准确性，在保持轻量推理的同时实现大规模直播内容审核。

Comments To be published at KDD 2026 (ADS track)

详情

DOI: 10.1145/33770854.3783936

AI中文摘要

内容审核对于大规模用户生成视频平台仍然是一个关键且具有挑战性的任务，尤其是在直播环境中，审核必须及时、多模态，并且能够应对不断演变的不良内容形式。我们提出了一个在生产规模部署的混合审核框架，该框架将已知违规的监督分类与针对新颖或微妙情况的基于参考的相似度匹配相结合。这种混合设计能够稳健地检测出明确违规以及传统分类器无法检测到的新颖边缘情况。多模态输入（文本、音频、视觉）通过两个流水线处理，多模态大语言模型（MLLM）将知识提炼到每个流水线中，以提高准确性，同时保持推理轻量。在生产中，分类流水线在80%精确率下达到67%召回率，相似度流水线在80%精确率下达到76%召回率。大规模A/B测试显示，用户对不良直播的观看次数减少了6-8%。这些结果表明了一种可扩展且适应性强的多模态内容治理方法，能够处理明确违规和新兴对抗行为。

英文摘要

Content moderation remains a critical yet challenging task for large-scale user-generated video platforms, especially in livestreaming environments where moderation must be timely, multimodal, and robust to evolving forms of unwanted content. We present a hybrid moderation framework deployed at production scale that combines supervised classification for known violations with reference-based similarity matching for novel or subtle cases. This hybrid design enables robust detection of both explicit violations and novel edge cases that evade traditional classifiers. Multimodal inputs (text, audio, visual) are processed through both pipelines, with a multimodal large language model (MLLM) distilling knowledge into each to boost accuracy while keeping inference lightweight. In production, the classification pipeline achieves 67% recall at 80% precision, and the similarity pipeline achieves 76% recall at 80% precision. Large-scale A/B tests show a 6-8% reduction in user views of unwanted livestreams}. These results demonstrate a scalable and adaptable approach to multimodal content governance, capable of addressing both explicit violations and emerging adversarial behaviors.

URL PDF HTML ☆

赞 0 踩 0

2601.19683 2026-06-04 cs.CV 版本更新

从片段到场景：自动驾驶中基于视觉语言模型的时间理解

Kevin Cannons, Saeed Ranjbar Alvar, Mohammad Asiful Hossain, Ahmad Rezaei, Mohsen Gholami, Alireza Heidarikhazaei, Zhou Weimin, Yong Zhang, Mohammad Akbari

发表机构 * University of California, Berkeley（加州大学伯克利分校）； University of Cambridge（剑桥大学）； University of Toronto（多伦多大学）； ETH Zurich（苏黎世联邦理工学院）； University of Washington（华盛顿大学）； University of Southern California（南加州大学）

AI总结提出自动驾驶时间理解基准TAD，通过场景思维链和轨迹认知图两种无训练方法提升视觉语言模型的时间推理能力。

详情

AI中文摘要

视觉语言模型（VLM）越来越多地被部署为野外自主代理的感知和推理骨干，其中自动驾驶（AD）是最安全关键的实例之一。可靠的时间理解对于此类代理预测事件、归因原因和在动态环境中安全行动至关重要，但即使对于最先进的（SoTA）VLM来说，这仍然是一个重大挑战。先前的视频基准强调了其他内容（体育、烹饪等），但现有基准没有专门关注短时和长时AD视频的时间理解。为填补这一空白，我们提出了自动驾驶时间理解（TAD）基准，包含近6000个问答（QA）对，涵盖7个任务，并评估了9个闭源和开源通用以及AD专用模型。当前SoTA模型在TAD上的表现远低于人类准确率。为了改进基于VLM的驾驶代理的时间推理，我们提出了两种新颖的无训练解决方案：Scene-CoT，它使用思维链（CoT）推理；以及TCogMap，它结合了由轨迹分析模块生成的自我中心时间认知图，该模块作为VLM周围的代理工具运行。与现有VLM集成后，我们的方法在TAD上的平均准确率提高了高达17.72%，在STSBench上提高了高达10.35%。通过引入TAD、对SoTA模型进行基准测试并提出有效的增强方法，本工作旨在促进野外代理AD系统时间理解的进一步进展。基准和评估代码分别可在${\href{https://huggingface.co/datasets/vbdai/TAD}{ ext{Hugging Face}}}$和${\href{https://github.com/vbdi/tad_bench}{ ext{GitHub}}}$上获取。

英文摘要

Vision-Language Models (VLMs) are increasingly deployed as the perception and reasoning backbone of autonomous agents acting in the wild, with autonomous driving (AD) being one of the most safety-critical instances. Reliable temporal understanding is essential for such agents to anticipate events, attribute causes, and act safely in dynamic environments, yet this remains a significant challenge even for state-of-the-art (SoTA) VLMs. Prior video benchmarks have emphasized other content (sports, cooking, etc.), yet no existing benchmark focuses exclusively on temporal understanding for both short- and long-form AD footage. To fill this gap, we present the Temporal Understanding in Autonomous Driving (TAD) benchmark, comprising nearly 6000 question-answer (QA) pairs across 7 tasks, and evaluate 9 closed- and open-source generalist as well as AD-specialist models. Current SoTA models perform substantially below human accuracy on TAD. To improve the temporal reasoning of VLM-based driving agents, we propose two novel training-free solutions: Scene-CoT, which uses Chain-of-Thought (CoT) reasoning, and TCogMap, which incorporates an ego-centric temporal cognitive map produced by a trajectory-analysis module that operates as an agentic tool around the VLM. Integrated with existing VLMs, our methods improve average accuracy on TAD by up to $17.72\%$ and by up to $10.35\%$ on STSBench. By introducing TAD, benchmarking SoTA models, and proposing effective enhancements, this work aims to catalyze further progress on temporal understanding for agentic AD systems operating in the wild. The benchmark and evaluation code are available at ${\href{https://huggingface.co/datasets/vbdai/TAD}{\text{Hugging Face}}}$ and ${\href{https://github.com/vbdi/tad_bench}{\text{GitHub}}}$, respectively.

URL PDF HTML ☆

赞 0 踩 0

2512.08331 2026-06-04 cs.CV 版本更新

DMAConv: Dual Mask-Adaptive Convolution for Remote Sensing Pansharpening

DMAConv: 用于遥感全色锐化的双掩膜自适应卷积

Xianghong Xiao, Zeyu Xia, Zhou Fei, Jinliang Xiao, Haorui Chen, Liangjian Deng

发表机构 * University of Electronic Science and Technology of China（电子科技大学）； Tongji University（同济大学）

AI总结提出双掩膜自适应卷积（DMAConv），通过软硬掩膜动态分配计算资源，以轻量级双分支结构高效处理遥感图像的区域异质性，实现SOTA性能且计算成本最低。

详情

AI中文摘要

全色锐化旨在融合高分辨率全色图像与低分辨率多光谱图像。现有的深度学习方法，包括最近的自适应卷积，难以应对遥感图像的区域异质性，且往往计算成本过高。为解决这些挑战，我们提出双掩膜自适应卷积（DMAConv），这是一种根据特征特征动态分配计算资源的新型算子。DMAConv首先使用轻量级模块生成软掩膜和硬掩膜。硬掩膜将特征分为一个紧凑分支（用于全局处理冗余信息）和一个聚焦分支（以更多计算投入建模复杂异质区域）。随后，软掩膜对两个分支的输入特征进行初步调制。这种双分支掩膜自适应设计显著增强了特征表示，同时最小化了计算开销。大量实验表明，我们的方法在广泛的定量基准上达到了SOTA，且参数数量显著更低，计算成本在自适应卷积模型中最低。

英文摘要

Pansharpening aims to fuse a high-resolution panchromatic image with a low-resolution multispectral image. Existing deep learning methods, including recent adaptive convolutions, struggle with regional heterogeneity in remote sensing images and often incur prohibitive computational costs. To address these challenges, we propose Dual Mask-Adaptive Convolution (DMAConv), a novel operator that dynamically allocates computational resources based on feature characteristics. DMAConv first employs a lightweight module to generate soft and hard masks. The hard mask separates features into a compact branch for processing redundant information globally and a focused branch that models complex, heterogeneous regions with greater computational investment. The soft mask then preliminarily modulates the input features for both branches. This dual-branch, mask-adaptive design significantly enhances feature representation while minimizing computational overhead. Extensive experiments demonstrate that our method achieves SOTA on a broad array of quantitative benchmarks, with substantially lower parameter counts and the minimal computational cost among adaptive convolution models.

URL PDF HTML ☆

赞 0 踩 0

2511.16624 2026-06-04 cs.CV cs.AI 版本更新

SAM 3D: 3Dfy Anything in Images

SAM 3D: 将图像中的任何内容3D化

SAM 3D Team, Xingyu Chen, Fu-Jen Chu, Pierre Gleize, Kevin J Liang, Alexander Sax, Hao Tang, Weiyao Wang, Michelle Guo, Thibaut Hardin, Xiang Li, Aohan Lin, Jiawei Liu, Ziqi Ma, Anushka Sagar, Bowen Song, Xiaodong Wang, Jianing Yang, Bowen Zhang, Piotr Dollár, Georgia Gkioxari, Matt Feiszli, Jitendra Malik

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）

AI总结提出SAM 3D生成模型，从单张图像重建3D物体的几何、纹理和布局，通过人机协同标注和分阶段训练突破数据瓶颈，在真实场景中取得显著优势。

Comments Website: https://ai.meta.com/sam3d/

详情

AI中文摘要

我们提出SAM 3D，一种用于视觉引导的3D物体重建的生成模型，能够从单张图像预测几何、纹理和布局。SAM 3D在自然图像中表现出色，这些图像中遮挡和场景杂乱很常见，且来自上下文的视觉识别线索起着更重要的作用。我们通过一个人工和模型在环的流水线来标注物体形状、纹理和姿态，以前所未有的规模提供视觉引导的3D重建数据。我们在一个现代的、多阶段的训练框架中从这些数据中学习，该框架结合了合成预训练和真实世界对齐，打破了3D“数据壁垒”。与近期工作相比，我们获得了显著提升，在真实世界物体和场景的人类偏好测试中至少达到5:1的胜率。我们将发布我们的代码和模型权重、一个在线演示以及一个新的用于野外3D物体重建的具有挑战性的基准测试。

英文摘要

We present SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image. SAM 3D excels in natural images, where occlusion and scene clutter are common and visual recognition cues from context play a larger role. We achieve this with a human- and model-in-the-loop pipeline for annotating object shape, texture, and pose, providing visually grounded 3D reconstruction data at unprecedented scale. We learn from this data in a modern, multi-stage training framework that combines synthetic pretraining with real-world alignment, breaking the 3D "data barrier". We obtain significant gains over recent work, with at least a 5:1 win rate in human preference tests on real-world objects and scenes. We will release our code and model weights, an online demo, and a new challenging benchmark for in-the-wild 3D object reconstruction.

URL PDF HTML ☆

赞 0 踩 0

2511.06331 2026-06-04 cs.CV 版本更新

Label-Efficient 3D Forest Mapping: Self-Supervised and Transfer Learning for Instance Segmentation, Semantic Segmentation, and Species Classification

标签高效的3D森林映射：自监督与迁移学习用于实例分割、语义分割和物种分类

Aldino Rizaldy, Fabian Ewald Fassnacht, Ahmed Jamal Afifi, Hua Jiang, Richard Gloaguen, Pedram Ghamisi

发表机构 * Helmholtz-Zentrum Dresden-Rossendorf (HZDR), Helmholtz Institute Freiberg for Resource Technology (HIF)（德累斯顿-罗斯托克亥姆霍兹中心（HZDR）、弗里贝格资源技术亥姆霍兹研究所（HIF））； Remote Sensing and Geoinformatics, Freie Universität Berlin（柏林自由大学遥感与地理信息学系）； Institute of Geomatics, BOKU University（博科尼大学测绘学院）； Faculty of Electrical and Computer Engineering, University of Iceland（爱沙尼亚大学电气与计算机工程学院）

AI总结本文利用自监督和迁移学习策略，在少量标注数据下提升3D点云中树木实例分割、语义分割和物种分类的性能，并集成统一框架以简化流程。

详情

AI中文摘要

个体树木级别的详细结构和物种信息对于支持精准林业、生物多样性保护以及为生物量和碳映射提供参考数据日益重要。来自机载和地面激光扫描的点云是目前快速大规模获取此类信息的最合适数据源。深度学习的最新进展改进了对个体树木的分割和分类以及语义树组件的识别。然而，深度学习模型通常需要大量标注训练数据，这限制了进一步的改进。为3D点云生成密集、高质量的标注，尤其是在复杂森林中，劳动密集且难以规模化。我们探索使用自监督和迁移学习来减少对大型标注数据集的依赖。我们的目标是提高三个任务的性能：实例分割、语义分割和树木分类，使用现实且可操作的训练集。与从头训练相比，我们观察到所有任务均有所改进，并通过各自的指标进行评估。对于实例分割，自监督学习结合领域适应使AP50提高了16.98%。对于语义分割，仅自监督学习使mIoU提高了1.79%。对于树木分类，层次迁移学习使平均Jaccard提高了6.07%。为简化使用并鼓励采用，我们将这些任务集成到一个统一框架中，简化了从原始点云到树木描绘、结构分析和物种分类的流程。预训练模型减少了约21%的能耗和碳排放。这一开源贡献旨在加速从激光扫描点云中操作性地提取个体树木信息，以支持林业、生物多样性和碳映射。

英文摘要

Detailed structural and species information on individual tree level is increasingly important to support precision forestry, biodiversity conservation, and provide reference data for biomass and carbon mapping. Point clouds from airborne and ground-based laser scanning are currently the most suitable data source to rapidly derive such information at scale. Recent advancements in deep learning improved segmenting and classifying individual trees and identifying semantic tree components. However, deep learning models typically require large amounts of annotated training data which limits further improvement. Producing dense, high-quality annotations for 3D point clouds, especially in complex forests, is labor-intensive and challenging to scale. We explore strategies to reduce dependence on large annotated datasets using self-supervised and transfer learning. Our objective is to improve performance across three tasks: instance segmentation, semantic segmentation, and tree classification using realistic and operational training sets. We observe improvements across all tasks, compared to training from scratch, evaluated with their respective metrics. For instance segmentation, self-supervised learning combined with domain adaptation improves AP50 by 16.98%. For semantic segmentation, self-supervised learning alone improves mIoU by 1.79%. For tree classification, hierarchical transfer learning improves mean Jaccard by 6.07%. To simplify use and encourage uptake, we integrated the tasks into a unified framework, streamlining the process from raw point clouds to tree delineation, structural analysis, and species classification. Pretrained models reduce energy consumption and carbon emissions by ~21%. This open-source contribution aims to accelerate operational extraction of individual tree information from laser scanning point clouds to support forestry, biodiversity, and carbon mapping.

URL PDF HTML ☆

赞 0 踩 0

2511.00801 2026-06-04 cs.CV cs.MM 版本更新

Med-Banana: Learning Quality-Controlled Medical Image Editing from Success-and-Failure Trajectories

Med-Banana：从成功与失败轨迹中学习质量可控的医学图像编辑

Zhihui Chen, Qingyuan Lei, Kai He, Yanrui Du, Mengling Feng

发表机构 * National University of Singapore（新加坡国立大学）； The Chinese University of Hong Kong（香港中文大学）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结提出Med-Banana框架，通过收集成功与失败编辑轨迹数据集Med-Banana-80K，联合训练编辑器、验证器和优化器，实现质量可控的医学图像编辑。

详情

AI中文摘要

文本引导的医学图像编辑必须满足所需的病理特征，同时保留解剖结构、模态特定外观和临床合理性。然而，现有数据集主要用最终接受的编辑结果来监督编辑器，并丢弃生成过程中产生的失败尝试。我们认为这些失败为质量控制提供了必要的监督：它们指定了应该拒绝什么、为什么编辑在医学或视觉上无效，以及应该如何修改指令。我们提出了Med-Banana，一个用于质量可控的医学图像编辑的轨迹监督框架。我们引入了Med-Banana-80K，一个大规模的成功与失败编辑轨迹资源，包含候选图像、验证结果、拒绝原因和提示优化。在此基础上，Med-Banana联合训练编辑器、验证器和优化器，实现了从接受和拒绝尝试中进行编辑-验证-优化推理。在MLLM评估者、盲审专家评估、源保留和真实-合成可分离性探测上的实验表明，与开放的医学图像编辑器相比，该方法具有一致的改进。代码和数据已公开。

英文摘要

Text-guided medical image editing must satisfy the requested pathology while preserving anatomy, modality-specific appearance, and clinical plausibility. However, existing datasets largely supervise editors with final accepted edits and discard the failed attempts produced during generation. We argue that these failures provide essential supervision for quality control: they specify what should be rejected, why an edit is medically or visually invalid, and how the instruction should be revised. We present Med-Banana, a trajectory-supervised framework for quality-controlled medical image editing. We introduce Med-Banana-80K, a large-scale resource of success-and-failure editing trajectories with candidate images, verification outcomes, rejection reasons, and prompt refinements. Building on it, Med-Banana jointly trains an editor, verifier, and refiner, enabling edit--verify--refine inference from accepted and rejected attempts. Experiments across MLLM judges, blind expert assessment, source-preservation and real--synthetic separability probes demonstrate consistent improvements over open medical image editors. Code and data are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2505.24528 2026-06-04 cs.CV cs.LG 版本更新

Geospatial Foundation Models to Enable Progress on Sustainable Development Goals

地理空间基础模型推动可持续发展目标的进展

Pedram Ghamisi, Weikang Yu, Xiaokang Zhang, Aldino Rizaldy, Jian Wang, Chufeng Zhou, Richard Gloaguen, Gustau Camps-Valls

发表机构 * Helmholtz-Zentrum Dresden-Rossendorf（德累斯顿-罗斯托克研究所）； University of Iceland（冰岛大学）； Wuhan University（武汉大学）； Wuhan University of Science and Technology（武汉科技大学）； Universitat de València（瓦伦西亚大学）

AI总结本文提出SustainFM基准框架，基于17个可持续发展目标评估地理空间基础模型，发现其在多样任务中优于传统方法，并强调需从模型中心转向影响驱动部署，关注能效、泛化性和伦理。

详情

AI中文摘要

基础模型（FMs）是大规模预训练的人工智能系统，已革新自然语言处理和计算机视觉，并正在推进地理空间分析和地球观测（EO）。它们承诺在任务间改进泛化、可扩展性以及用最少标注数据高效适应。然而，尽管地理空间FMs迅速激增，其现实世界效用和与全球可持续发展目标的一致性仍未充分探索。我们提出SustainFM，一个基于17个可持续发展目标的全面基准框架，涵盖从资产财富预测到环境危害检测的极其多样化的任务。本研究提供了对地理空间FMs的严格、跨学科评估，并对其在实现可持续发展目标中的作用提供了关键见解。我们的发现表明：（1）虽然并非普遍优越，但FMs在多样任务和数据集上通常优于传统方法。（2）评估FMs应超越准确性，将可迁移性、泛化性和能效作为其负责任使用的关键标准。（3）FMs支持可扩展的、基于SDG的解决方案，为应对复杂可持续发展挑战提供广泛实用性。关键的是，我们倡导从以模型为中心的发展转向以影响驱动的部署，并强调能效、对领域变化的鲁棒性以及伦理考量等指标。

MATCH: 面向半监督组织病理学分割的多面自适应拓扑一致性

Meilong Xu, Xiaoling Hu, Shahira Abousamra, Chen Li, Chao Chen

发表机构 * Stony Brook University（斯通布罗克大学）； Massachusetts General Hospital and Harvard Medical School（麻省总医院和哈佛医学院）； Department of Biomedical Data Science, Stanford University（斯坦福大学生物医学数据科学系）

AI总结提出一种半监督分割框架MATCH，通过随机丢弃和时间训练快照生成多种扰动预测，并强制拓扑一致性来识别和保留相关拓扑特征，引入结合空间重叠与全局结构对齐的匹配策略以减少预测差异，有效降低拓扑错误，提升分割鲁棒性和准确性。

Comments 20 pages, 6 figures. Accepted by NeurIPS 2025

详情

AI中文摘要

在半监督分割中，从无标签数据中捕获有意义的语义结构至关重要。这在组织病理学图像分析中尤其具有挑战性，因为物体分布密集。为了解决这个问题，我们提出了一个半监督分割框架，旨在稳健地识别和保留相关的拓扑特征。我们的方法利用通过随机丢弃和时间训练快照获得的多种扰动预测，强制这些不同输出之间的拓扑一致性。这种一致性机制有助于将生物学有意义的结构与瞬态和噪声伪影区分开来。这个过程的一个关键挑战是在没有真实标签的情况下准确匹配预测中对应的拓扑特征。为了克服这一点，我们引入了一种新颖的匹配策略，将空间重叠与全局结构对齐相结合，最小化预测之间的差异。大量实验表明，我们的方法有效减少了拓扑错误，从而产生更稳健和准确的分割，这对于可靠的下游分析至关重要。代码可在 https://github.com/Melon-Xu/MATCH 获取。

英文摘要

In semi-supervised segmentation, capturing meaningful semantic structures from unlabeled data is essential. This is particularly challenging in histopathology image analysis, where objects are densely distributed. To address this issue, we propose a semi-supervised segmentation framework designed to robustly identify and preserve relevant topological features. Our method leverages multiple perturbed predictions obtained through stochastic dropouts and temporal training snapshots, enforcing topological consistency across these varied outputs. This consistency mechanism helps distinguish biologically meaningful structures from transient and noisy artifacts. A key challenge in this process is to accurately match the corresponding topological features across the predictions in the absence of ground truth. To overcome this, we introduce a novel matching strategy that integrates spatial overlap with global structural alignment, minimizing discrepancies among predictions. Extensive experiments demonstrate that our approach effectively reduces topological errors, resulting in more robust and accurate segmentations essential for reliable downstream analysis. Code is available at https://github.com/Melon-Xu/MATCH.

URL PDF HTML ☆

赞 0 踩 0

2307.00862 2026-06-04 cs.CV cs.CL 版本更新

UniFine: A Unified and Fine-grained Approach for Zero-shot Vision-Language Understanding

UniFine: 一种统一且细粒度的零样本视觉-语言理解方法

Rui Sun, Zhecan Wang, Haoxuan You, Noel Codella, Kai-Wei Chang, Shih-Fu Chang

发表机构 * Columbia University（哥伦比亚大学）； Microsoft Research（微软研究院）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结提出UniFine框架，通过利用句子关键词和图像对象等细粒度信息进行图像-文本匹配，在零样本设置下统一处理VQA、SNLI-VE和VCR等视觉-语言任务，并在多个数据集上取得显著改进。

Comments 14 pages, 4 figures, ACL 2023 Findings

详情

AI中文摘要

视觉-语言任务，如VQA、SNLI-VE和VCR，具有挑战性，因为它们需要模型的推理能力来理解视觉世界和自然语言的语义。针对视觉-语言任务的监督方法已被充分研究。然而，在零样本设置下解决这些任务的研究较少。由于对比语言-图像预训练（CLIP）在图像-文本匹配上展现了显著的零样本性能，先前的工作通过将视觉-语言任务转换为图像-文本匹配问题来利用其强大的零样本能力，并且它们主要考虑全局级别的匹配（例如，整个图像或句子）。然而，我们发现视觉和文本的细粒度信息，例如句子中的关键词和图像中的对象，对于语义理解可能相当有信息量。受此启发，我们提出了一个统一框架，利用细粒度信息进行零样本视觉-语言学习，涵盖多个任务，如VQA、SNLI-VE和VCR。我们的实验表明，我们的框架在VQA上优于先前的零样本方法，并在SNLI-VE和VCR上取得了显著改进。此外，我们的消融研究证实了我们提出的方法的有效性和泛化性。

英文摘要

Vision-language tasks, such as VQA, SNLI-VE, and VCR are challenging because they require the model's reasoning ability to understand the semantics of the visual world and natural language. Supervised methods working for vision-language tasks have been well-studied. However, solving these tasks in a zero-shot setting is less explored. Since Contrastive Language-Image Pre-training (CLIP) has shown remarkable zero-shot performance on image-text matching, previous works utilized its strong zero-shot ability by converting vision-language tasks into an image-text matching problem, and they mainly consider global-level matching (e.g., the whole image or sentence). However, we find visual and textual fine-grained information, e.g., keywords in the sentence and objects in the image, can be fairly informative for semantics understanding. Inspired by this, we propose a unified framework to take advantage of the fine-grained information for zero-shot vision-language learning, covering multiple tasks such as VQA, SNLI-VE, and VCR. Our experiments show that our framework outperforms former zero-shot methods on VQA and achieves substantial improvement on SNLI-VE and VCR. Furthermore, our ablation studies confirm the effectiveness and generalizability of our proposed method.

URL PDF HTML ☆

赞 0 踩 0

2503.21469 2026-06-04 eess.IV cs.CV 版本更新

Embedding Compression Distortion in Video Coding for Machines

面向机器的视频编码中的嵌入压缩失真

Yuxiao Sun, Yao Zhao, Meiqin Liu, Chao Yao, Weisi Lin

发表机构 * Beijing Jiaotong University, China（北京交通大学）； University of Science and Technology Beijing, China（北京科技大学）； Nanyang Technological University, Singapore（南洋理工大学）

AI总结提出压缩失真表示嵌入（CDRE）框架，通过提取机器感知相关的失真表示并嵌入下游模型，提升压缩视频的任务性能。

详情

DOI: 10.1109/ICME59968.2025.11210034

AI中文摘要

目前，视频传输不仅服务于人类视觉系统（HVS）以供观看，还服务于机器感知以供分析。然而，现有的编解码器主要针对像素域和HVS感知指标进行优化，而非机器视觉任务的需求。为解决此问题，我们提出了一种压缩失真表示嵌入（CDRE）框架，该框架提取与机器感知相关的失真表示，并将其嵌入下游模型，从而解决压缩过程中丢失的信息并提升任务性能。具体而言，为了更好地分析与机器感知相关的失真，我们设计了一个压缩敏感提取器，用于在特征域中识别压缩退化。为了实现高效传输，引入了一个轻量级失真编解码器，将失真信息压缩为紧凑表示。随后，该表示被逐步嵌入下游模型，使其更好地了解压缩退化并提升性能。在各种编解码器和下游任务上的实验表明，我们的框架能够以最小的比特率、执行时间和参数数量开销，有效提升现有编解码器的率-任务性能。我们的代码和补充材料发布在 https://github.com/Ws-Syx/CDRE/。

英文摘要

Currently, video transmission serves not only the Human Visual System (HVS) for viewing but also machine perception for analysis. However, existing codecs are primarily optimized for pixel-domain and HVS-perception metrics rather than the needs of machine vision tasks. To address this issue, we propose a Compression Distortion Representation Embedding (CDRE) framework, which extracts machine-perception-related distortion representation and embeds it into downstream models, addressing the information lost during compression and improving task performance. Specifically, to better analyze the machine-perception-related distortion, we design a compression-sensitive extractor that identifies compression degradation in the feature domain. For efficient transmission, a lightweight distortion codec is introduced to compress the distortion information into a compact representation. Subsequently, the representation is progressively embedded into the downstream model, enabling it to be better informed about compression degradation and enhancing performance. Experiments across various codecs and downstream tasks demonstrate that our framework can effectively boost the rate-task performance of existing codecs with minimal overhead in terms of bitrate, execution time, and number of parameters. Our codes and supplementary materials are released in https://github.com/Ws-Syx/CDRE/.

URL PDF HTML ☆

赞 0 踩 0

2503.10629 2026-06-04 cs.CV 版本更新

Hierarchical Self-Supervised Adversarial Training for Robust Vision Models in Histopathology

层次化自监督对抗训练用于组织病理学中的鲁棒视觉模型

Hashmat Shadab Malik, Shahina Kunhimon, Muzammal Naseer, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence（Mohamed Bin Zayed人工智能大学）； Khalifa University（卡勒比大学）； Linköping University（林霍尔姆大学）； Australian National University（澳大利亚国立大学）

AI总结提出层次化自监督对抗训练（HSAT），利用组织病理图像的患者-切片-补丁层次结构进行多级对比学习，生成对抗样本并整合到对抗训练中，在OpenSRH数据集上白盒设置平均提升54.31%，黑盒设置性能下降降至3-4%。

Comments Accepted at 28th International Conference On Medical Image Computing And Computer Assisted Intervention (MICCAI 2025)

详情

AI中文摘要

对抗攻击对医疗等关键领域的视觉模型构成重大挑战，这些领域可靠性至关重要。尽管对抗训练在自然图像中已得到充分研究，但其在生物医学和显微镜数据中的应用仍然有限。现有的自监督对抗训练方法忽视了组织病理图像的层次结构，其中患者-切片-补丁关系提供了有价值的判别信号。为了解决这一问题，我们提出了层次化自监督对抗训练（HSAT），它利用这些属性通过多级对比学习生成对抗样本，并将其整合到对抗训练中以增强鲁棒性。我们在多类组织病理数据集OpenSRH上评估了HSAT，结果表明HSAT在生物医学和自然图像领域均优于现有方法。HSAT增强了鲁棒性，在白盒设置中平均提升54.31%，在黑盒设置中将性能下降降至3-4%，而基线为25-30%。这些结果为该领域的对抗训练树立了新的基准，为更鲁棒的模型铺平了道路。我们的训练和评估代码可在https://github.com/HashmatShadab/HSAT获取。

英文摘要

Adversarial attacks pose significant challenges for vision models in critical fields like healthcare, where reliability is essential. Although adversarial training has been well studied in natural images, its application to biomedical and microscopy data remains limited. Existing self-supervised adversarial training methods overlook the hierarchical structure of histopathology images, where patient-slide-patch relationships provide valuable discriminative signals. To address this, we propose Hierarchical Self-Supervised Adversarial Training (HSAT), which exploits these properties to craft adversarial examples using multi-level contrastive learning and integrate it into adversarial training for enhanced robustness. We evaluate HSAT on multiclass histopathology dataset OpenSRH and the results show that HSAT outperforms existing methods from both biomedical and natural image domains. HSAT enhances robustness, achieving an average gain of 54.31% in the white-box setting and reducing performance drops to 3-4% in the black-box setting, compared to 25-30% for the baseline. These results set a new benchmark for adversarial training in this domain, paving the way for more robust models. Our Code for training and evaluation is available at https://github.com/HashmatShadab/HSAT.

URL PDF HTML ☆

赞 0 踩 0

2502.01576 2026-06-04 cs.CV 版本更新

Robust-LLaVA: On the Effectiveness of Large-Scale Robust Image Encoders for Multi-modal Large Language Models

Robust-LLaVA：大规模鲁棒图像编码器对多模态大语言模型的有效性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Khan, Salman Khan

发表机构 * Mohamed bin Zayed University of AI（Mohamed bin Zayed人工智能大学）； Khalifa University（卡利法大学）； Michigan State University（密歇根州立大学）； Australian National University（澳大利亚国立大学）

AI总结本文提出利用大规模对抗预训练的图像分类模型替代CLIP编码器，以增强多模态大语言模型对视觉对抗扰动的鲁棒性，在无需额外对抗训练的情况下，在视觉问答、图像描述和越狱攻击任务中取得显著鲁棒性提升。

Comments Accepted at Trustworthy FMs Workshop Trust Before Use: Building Foundation Models that You Can Trust (ICCVW) 2025

详情

AI中文摘要

多模态大语言模型（MLLMs）在视觉-语言任务中表现出色，但仍然容易受到视觉对抗扰动的影响，这些扰动可能引发幻觉、操纵响应或绕过安全机制。现有方法通过在ImageNet规模的数据上对CLIP视觉编码器进行受限的对抗微调来缓解这些风险，从而保持其泛化能力。然而，这种有限的对抗训练限制了鲁棒性和更广泛的泛化。在这项工作中，我们探索了一种替代方法，即利用在大规模数据上经过对抗预训练的现有视觉分类模型。我们的分析揭示了两个主要贡献：（1）对抗预训练的广泛规模和多样性使得这些模型能够对各种对抗威胁表现出优越的鲁棒性，范围从不可察觉的扰动到高级越狱尝试，而无需额外的对抗训练；（2）将这些鲁棒模型与MLLM进行端到端集成，有助于语言组件更好地适应鲁棒视觉特征，在复杂推理任务上优于现有的即插即用方法。通过在视觉问答、图像描述和越狱攻击上的系统评估，我们证明使用这些鲁棒模型训练的MLLM在保持良好干净性能的同时，实现了优越的对抗鲁棒性。我们的框架在描述和VQA任务中分别实现了2倍和1.5倍的平均鲁棒性增益，并在越狱攻击中提供了超过10%的改进。代码和预训练模型将在https://github.com/HashmatShadab/Robust-LLaVA 提供。

英文摘要

Multi-modal Large Language Models (MLLMs) excel in vision-language tasks but remain vulnerable to visual adversarial perturbations that can induce hallucinations, manipulate responses, or bypass safety mechanisms. Existing methods seek to mitigate these risks by applying constrained adversarial fine-tuning to CLIP vision encoders on ImageNet-scale data, ensuring their generalization ability is preserved. However, this limited adversarial training restricts robustness and broader generalization. In this work, we explore an alternative approach of leveraging existing vision classification models that have been adversarially pre-trained on large-scale data. Our analysis reveals two principal contributions: (1) the extensive scale and diversity of adversarial pre-training enables these models to demonstrate superior robustness against diverse adversarial threats, ranging from imperceptible perturbations to advanced jailbreaking attempts, without requiring additional adversarial training, and (2) end-to-end MLLM integration with these robust models facilitates enhanced adaptation of language components to robust visual features, outperforming existing plug-and-play methodologies on complex reasoning tasks. Through systematic evaluation across visual question-answering, image captioning, and jail-break attacks, we demonstrate that MLLMs trained with these robust models achieve superior adversarial robustness while maintaining favorable clean performance. Our framework achieves 2x and 1.5x average robustness gains in captioning and VQA tasks, respectively, and delivers over 10% improvement against jailbreak attacks. Code and pretrained models will be available at https://github.com/HashmatShadab/Robust-LLaVA.

URL PDF HTML ☆

赞 0 踩 0

2412.20803 2026-06-04 cs.CV 版本更新

Scalable Event Cloud Network for Event-based Classification

可扩展事件云网络用于基于事件的分类

Hongwei Ren, Fei Ma, Xiaopeng Lin, Yuetong Fang, Hongxiang Huang, Yue Zhou, Yulong Huang, Haotian Fu, Ziyi Yang, Youxin Jiang, Xiangqian Wu, Bojun Cheng

发表机构 * Research Centre for Multimodal Artificial Intelligence（多模态人工智能研究中心）； Applications, Faculty of Computing, Harbin Institute of Technology（应用学院，哈尔滨工业大学）； MICS Thrust, Hong Kong University of Science（科学与技术大学（香港）MICS研究方向）； Guangdong Laboratory of Artificial Intelligence（广东人工智能与数字经济实验室）

AI总结提出SECNet，通过结构级极性集成和频域特征提取，解决事件云表示在空间和时间分辨率上的可扩展性问题，在十个数据集上验证了有效性。

Comments ICML2026 Oral

详情

AI中文摘要

事件相机是受生物启发的传感器，引起了工业界和学术界的广泛关注。主流方法倾向于帧和体素表示，这些方法在达到满意性能的同时，引入了耗时的转换、庞大的模型，并牺牲了细粒度的时间信息。相比之下，点云表示在解决上述弱点方面显示出潜力，但在抽象更高空间分辨率和更长时序事件的特征方面可扩展性有限。在本文中，我们提出了一种名为SECNet的可扩展网络，以利用事件云表示。SECNet通过创新的基于事件的分组和采样模块，在结构层面而非仅在输入层面集成极性。为了适应事件数量的激增，SECNet通过傅里叶变换在频域中进行特征提取。这种方法不仅显著减少了乘累加操作的爆炸，而且有效地抽象了时空特征。我们在 extbf{十个}基于事件的数据集上进行了大量实验，验证了SECNet的可扩展性、有效性和效率。我们的代码将在以下网址提供：https://github.com/rhwxmx/SECNet_ICML。

英文摘要

Event cameras are biologically inspired sensors garnering significant attention from both industry and academia. Mainstream methods favor frame and voxel representations, which reach a satisfactory performance while introducing time-consuming transformations, bulky models, and sacrificing fine-grained temporal information. Alternatively, Point Cloud representation demonstrates promise in addressing the mentioned weaknesses, but it has limited scalability in abstracting features of higher spatial resolution and longer temporal sequence events. In this paper, we propose a Scalable Network named SECNet to leverage Event Cloud representation. SECNet integrates polarity at the structural level by innovating the Event-based Group and Sampling module rather than only at the input level. To accommodate the surge in the number of events, SECNet embraces feature extraction in the frequency domain via the Fourier transform.This approach not only substantially extinguishes the explosion of Multiply Accumulate Operations but also effectively abstracts spatio-temporal features. We conducted extensive experiments on \textbf{ten} event-based datasets, and substantiate the scalability, effectiveness, and efficiency of SECNet. Our code will be available at: https://github.com/rhwxmx/SECNet_ICML.

URL PDF HTML ☆

赞 0 踩 0

2411.19758 2026-06-04 cs.CV cs.AI cs.LG 版本更新

LaVIDE: Language-Prompted Satellite Change Detection via Map-Image Alignment

LaVIDE: 通过地图-图像对齐的语言提示卫星变化检测

Shuguo Jiang, Fang Xu, Chuandong Liu, Hong Tan, Shengyang Li, Lei Yu, Wen Yang, Sen Jia, Gui-Song Xia

发表机构 * School of Computer Science, Wuhan University（武汉大学计算机学院）； School of Artificial Intelligence, Wuhan University（武汉大学人工智能学院）； Technology and Engineering Center for Space Utilization and the Key Laboratory of Space Utilization, Chinese Academy of Sciences（中国科学院空间利用技术与重点实验室）； School of Aeronautics and Astronautics, University of Chinese Academy of Sciences（中国科学院大学航空宇航学院）； School of Electronic Information, Wuhan University（武汉大学电子信息学院）； College of Computer Science and Software Engineering, Shenzhen University（深圳大学计算机科学与软件工程学院）

AI总结提出LaVIDE框架，利用受限提示学习和对象感知嵌入增强，通过语言弥合高层地图类别与低层图像细节之间的语义鸿沟，实现跨模态对齐，在多类与单类变化检测任务上分别提升IoU 18.4%和5.2%。

详情

AI中文摘要

基于地图参考和最新图像的遥感变化检测，在缺乏早期图像进行比较时，有助于及时观测地球表面。然而，高层地图类别与低层图像细节之间的语义鸿沟阻碍了提取同质特征以进行稳健的时间关联。与比较像素级视觉相似性或传播分割误差的传统方法不同，我们提出了一种新颖框架——LaVIDE（用于检测变化的语言-视觉判别器），该框架以语言为中介，弥合了高层地图类别与低层图像细节之间的语义鸿沟。具体来说，我们引入了受限提示学习来生成上下文感知的文本提示，使地图语义与图像内容对齐，并采用对象感知嵌入增强策略将对象级属性（如形状、边界）整合到地图表示中。这些组件能够在统一的语言-视觉特征空间中实现稳健的跨模态对齐。在四个基准数据集（DynamicEarthNet、HRSCD、BANDON和SECOND）上的大量实验表明，LaVIDE以显著优势超越了最先进的方法，在多类和单类变化检测任务上分别实现了18.4%和5.2%的IoU提升。我们的框架不仅提高了地图-图像变化检测的准确性，还为以最少人工干预快速更新地图提供了实用解决方案，有望在城市规划、灾害评估和生态保护等领域产生广泛影响。代码和数据集可在 https://github.com/ShuGuoJ/LAVIDE.git 获取。

英文摘要

Remote sensing change detection based on a map reference and an up-to-date image boosts timely observation of the Earth's surface when earlier images are lacking for comparison. However, the semantic gap between high-level map categories and low-level image details hinders the extraction of homogeneous features for robust temporal association in change detection. Unlike conventional approaches that either compare pixel-level visual similarity or propagate segmentation errors, \textcolor{black}{we propose a novel framework, \underline{La}nguage-\underline{VI}sion \underline{D}iscriminator for d\underline{E}tecting changes, LaVIDE}, which bridges the semantic gap between high-level map categories and low-level image details using language as an intermediary. Specifically, we introduce {\it restricted prompt learning} to generate context-aware textual prompts that align map semantics with image content, and an {\it object-aware embedding enhancement} strategy to integrate object-level attributes (e.g., shape, boundary) into map representations. These components enable robust cross-modal alignment within a unified language-vision feature space. Extensive experiments on four benchmarks, DynamicEarthNet, HRSCD, BANDON, and SECOND, demonstrate that LaVIDE outperforms state-of-the-art methods by significant margins, achieving $18.4\%$ and $5.2\%$ improvements in IoU on multi-class and single-class change detection tasks, respectively. Our framework not only advances the accuracy of map-image change detection but also provides a practical solution for rapid map updating with minimal human intervention, promising broad impacts in urban planning, disaster assessment, and ecological conservation. Code and datasets are available at: https://github.com/ShuGuoJ/LAVIDE.git.

URL PDF HTML ☆

赞 0 踩 0

2406.09407 2026-06-04 cs.CV 版本更新

Towards Evaluating the Robustness of Visual State Space Models

评估视觉状态空间模型的鲁棒性

Hashmat Shadab Malik, Fahad Shamshad, Muzammal Naseer, Karthik Nandakumar, Fahad Shahbaz Khan, Salman Khan

发表机构 * Mohamed Bin Zayed University of AI（Mohamed Bin Zayed人工智能大学）； Center of Secure Cyber-Physical Security Systems（安全的网络物理安全系统中心）； Linköping University（林波伊大学）； Australian National University（澳大利亚国立大学）

AI总结本文全面评估了视觉状态空间模型（VSSMs）在遮挡、图像结构、常见损坏和对抗攻击等多种扰动下的鲁棒性，并与Transformer和CNN等架构进行比较，揭示了其优势和局限性。

Comments Accepted at The 5th Workshop of Adversarial Machine Learning on Computer Vision (CVPRW 2025)

详情

AI中文摘要

视觉状态空间模型（VSSMs）是一种结合了循环神经网络和潜变量模型优势的新型架构，通过有效捕捉长程依赖和建模复杂视觉动态，在视觉感知任务中表现出色。然而，它们在自然和对抗扰动下的鲁棒性仍然是一个关键问题。在这项工作中，我们全面评估了VSSMs在各种扰动场景下的鲁棒性，包括遮挡、图像结构、常见损坏和对抗攻击，并将其性能与Transformer和卷积神经网络等成熟架构进行比较。此外，我们研究了VSSMs在复杂视觉场景中针对物体-背景组合变化的鲁棒性，使用了专门设计用于测试模型性能的复杂基准。我们还使用模拟真实场景的损坏数据集评估了它们在目标检测和分割任务上的鲁棒性。为了更深入地理解VSSMs的对抗鲁棒性，我们进行了基于频率的对抗攻击分析，评估了它们对低频和高频扰动的性能。我们的发现突出了VSSMs在处理复杂视觉损坏方面的优势和局限性，为未来研究提供了宝贵的见解。我们的代码和模型将在 https://github.com/HashmatShadab/MambaRobustness 提供。

英文摘要

Vision State Space Models (VSSMs), a novel architecture that combines the strengths of recurrent neural networks and latent variable models, have demonstrated remarkable performance in visual perception tasks by efficiently capturing long-range dependencies and modeling complex visual dynamics. However, their robustness under natural and adversarial perturbations remains a critical concern. In this work, we present a comprehensive evaluation of VSSMs' robustness under various perturbation scenarios, including occlusions, image structure, common corruptions, and adversarial attacks, and compare their performance to well-established architectures such as transformers and Convolutional Neural Networks. Furthermore, we investigate the resilience of VSSMs to object-background compositional changes on sophisticated benchmarks designed to test model performance in complex visual scenes. We also assess their robustness on object detection and segmentation tasks using corrupted datasets that mimic real-world scenarios. To gain a deeper understanding of VSSMs' adversarial robustness, we conduct a frequency-based analysis of adversarial attacks, evaluating their performance against low-frequency and high-frequency perturbations. Our findings highlight the strengths and limitations of VSSMs in handling complex visual corruptions, offering valuable insights for future research. Our code and models will be available at https://github.com/HashmatShadab/MambaRobustness.

URL PDF HTML ☆

赞 0 踩 0

2407.13922 2026-06-04 cs.CV cs.AI cs.LG 版本更新

CounterFace: A Synthetic Face Dataset for Fine-Grained Counterfactual Evaluation of Face Recognition Systems

CounterFace: 用于人脸识别系统细粒度反事实评估的合成人脸数据集

Guruprasad Viswanathan Ramesh, Ashish Hooda, Shimaa Ahmed, Harrison J Rosenberg, Ramya Korlakai Vinayak, Kassem Fawaz

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Visa Research（Visa研究）

AI总结提出CounterFace数据集，通过全自动流水线生成包含20种面部属性和8种人口统计因素的11,821个反事实人脸对，用于细粒度评估人脸识别系统在特定属性-人口统计组合下的性能退化。

Comments Code available at https://github.com/Guruprasad68/counterface_facct2026. Dataset available for non-commercial research upon request

详情

DOI: 10.1145/3805689.3812410

AI中文摘要

人脸识别系统广泛应用于关键应用，因此其在不同人群和条件下的可靠性和鲁棒性至关重要。人脸识别系统的标准评估通常依赖LFW等数据集来估计平均识别准确率。一些基准测试也捕捉了粗粒度的身份内变化，如老化、姿态和光照。然而，人脸存在更细粒度的变化，包括发型和化妆等外观变化，这些在现有基准测试中代表性不足。反事实评估提供了一种在细粒度变化下评估人脸识别鲁棒性的方法。然而，现有使用图像生成器合成的反事实人脸数据集由于在流程中使用人工验证，属性覆盖范围有限。我们提出CounterFace，一个新的反事实评估数据集，包含20种面部属性和8种人口统计因素，超过先前合成人脸数据集14种属性和2种人口统计因素。该数据集使用基于现成图像生成器和自定义验证器的全自动流水线生成，无需人工验证。CounterFace包含11,821个反事实人脸对，事后用户研究证实了生成反事实的忠实性。我们评估了两个商业和四个开源人脸识别系统（AWS Rekognition、Face++、AdaFace、MagFace、ArcFace、FaceNet）在160种属性-人口统计组合上的性能。与标准评估基准不同，我们的数据集有助于隔离单个系统的精确故障模式。结果表明，所有六个系统的性能退化因属性和人口统计而异，遮挡属性（如口罩和胡须）普遍降低性能。

英文摘要

Face recognition (FR) systems are widely deployed in critical applications, making their reliability and robustness across diverse populations and conditions essential. Standard evaluation of FR systems typically relies on datasets such as LFW to estimate average recognition accuracy. Some benchmarks also capture coarse-grained intra-identity variations such as aging, pose, and lighting. However, human faces undergo more fine-grained changes, including appearance changes such as hairstyles and makeup, that are underrepresented in existing benchmarks. Counterfactual evaluation provides a method to assess FR robustness under such fine-grained variations. Existing counterfactual face datasets synthesized with image generators, however, are limited in attribute coverage due to the use of humans for verification in the pipeline. We propose CounterFace, a new counterfactual evaluation dataset comprising 20 facial attributes and 8 demographic factors, exceeding prior synthetic face datasets by 14 attributes and 2 demographics. The dataset is generated using a fully automated pipeline based on off-the-shelf image generators with custom verifiers, removing human need for verification. CounterFace contains 11,821 counterfactual face pairs, and a post-hoc user study confirms the faithfulness of the generated counterfactuals. We evaluate two commercial and four open-source FR systems (AWS Rekognition, Face++, AdaFace, MagFace, ArcFace, FaceNet) across 160 attribute-demographic combinations. Our dataset helps in the isolation of precise failure modes for individual systems unlike standard evaluation benchmarks. Results indicate that the performance degradation varies across attributes and demographics for all six systems and occluding attributes (e.g., facemask and facial hair) universally degrade performance.

URL PDF HTML ☆

赞 0 踩 0

2404.11309 2026-06-04 cs.CV 版本更新

Achieving Rotation-Invariant Convolution via Non-Learnable Orientation Alignment Operators

通过不可学习的朝向对齐算子实现旋转不变卷积

Hanlin Mo, Peihong Lei, You Hao, Guoying Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）

AI总结提出基于不可学习算子的旋转不变卷积（RIConvs），其参数量和计算过程与标准卷积相同，在多个视觉任务中提升准确率，尤其在数据有限时效果显著。

详情

AI中文摘要

在深度神经网络中实现旋转不变性而无需数据增强是一个研究热点。内在不变性使特征能够捕捉目标的固有属性，从而提升深度学习在视觉任务中的性能。基于多种类型的不可学习算子，本文提出了一套对任意旋转自然不变的卷积操作。与大多数先前方法不同，这些旋转不变卷积（RIConvs）具有与标准卷积相同的可学习参数数量和相似的计算过程，因此可以互换。使用MNIST-Rot数据集，我们验证了它们在不同旋转角度下的不变性，并与先前的旋转不变CNN进行了比较，其中两种基于梯度的RIConvs取得了最先进的结果。然后，我们将RIConvs与经典CNN骨干网络集成，并在纹理识别、飞机类型识别和遥感图像分类任务上进行了评估。结果表明，RIConvs显著提高了准确率，特别是在训练数据有限的情况下，并且即使在使用数据增强时也能提升性能。

英文摘要

Achieving rotational invariance in deep neural networks without data augmentation is a research hotspot. Intrinsic invariance enables features to capture targets' inherent properties, enhancing deep learning performance in visual tasks. Based on various types of non-learnable operators, this paper proposes a comprehensive set of convolution operations that are natually invariant to arbitrary rotations. Unlike most prior methods, these rotation-invariant convolutions (RIConvs) have the same number of learnable parameters and a similar computational process as standard convolutions, making them interchangeable. Using the MNIST-Rot dataset, we validate their invariance across rotation angles and compare them with previous rotation-invariant CNNs, where two gradient-based RIConvs achieve state-of-the-art results. Then, we integrate RIConvs with classic CNN backbones and evaluate them on texture recognition, aircraft type recognition, and remote sensing image classification tasks. Results show that RIConvs significantly improve accuracy, particularly with limited training data, and enhance performance even with data augmentation.

URL PDF HTML ☆

赞 0 踩 0

1803.03724 2026-06-04 math.DG cs.CG cs.CV cs.NA math.AP math.NA 版本更新

Contour Parametrization via Anisotropic Mean Curvature Flows

通过各向异性均 curvature 流进行轮廓参数化

P. Suárez-Serrato, E. I. Velázquez Richards

发表机构 * Department of Mathematics, University of California, Santa Barbara, on leave from , Instituto de Matem\'aticas, Instituto de Matem\'aticas, Universidad Nacional Aut\'onoma de M\'exico, Mexico City

AI总结本文提出了一种新的各向异性均 curvature 流实现，用于轮廓识别。通过将平面闭合光滑曲线的均 curvature 流与外部场相结合，该方法利用点电荷势场约束曲线运动，从而实现轮廓的参数化。

Comments 30 pages, 20 images, source code for our numerical implementation is available in this URL https://github.com/V3du4rd0/AMCF

2402.02555 2026-06-04 cs.CV cs.CL 版本更新

High-Quality Entity Segmentation and Grounding

高质量实体分割与定位

Lu Qi, Yi-Wen Chen, Tao Zhang, Xiangtai Li, Xu Yang, Bo Du, Ming-Hsuan Yang

发表机构 * Wuhan University（武汉大学）； Insta360 Research（Insta360研究院）； Department of EECS, University of California, Merced（加州大学默塞德分校电子工程与计算机科学系）； Nanyang Technological University（南洋理工大学）； Institute of Automation of the Chinese Academy of Sciences（中国科学院自动化研究所）

AI总结提出ESG流水线，通过新数据集EntitySeg和两阶段解耦设计（CropFormer高质量分割+GELLA精确名词提取与语义匹配），实现高质量实体分割与定位，在五项任务上有效。

详情

AI中文摘要

在这项工作中，我们提出了ESG，一个由新数据集EntitySeg支持的高质量实体分割与定位流水线。首先，所提出的数据集命名为EntitySeg，包含跨越各种图像域和实体的图像，以及用于训练和测试的大量高分辨率图像和高质量掩码标注。然后，ESG主要由两个模块组成：用于高质量实体分割的CropFormer，以及用于从句子中精确提取名词并在语言和视觉区域之间进行语义匹配的GELLA。与现有联合训练分割和大语言模型的定位方法不同，ESG采用两阶段解耦设计，保留了高质量掩码和定位鲁棒性，避免了联合训练通常带来的权衡。CropFormer确保高质量实体分割结果，然后可以编码到GELLA模型中进行有效定位。大量实验结果表明，我们提出的流水线在五项任务上有效，包括实体分割、全景分割、开放词汇分割、指代分割和全景定位叙述。此外，ESG流水线的GELLA模块高度灵活，能够处理来自任何分割框架的掩码输入，这得益于其轻量级的颜色图/视觉编码器、语言/掩码解码器和关联模块。实体分割数据集和定位代码将在https://github.com/qqlu/Entity发布。

英文摘要

In this work, we propose ESG, a pipeline for high-quality entity segmentation and grounding supported by a new dataset EntitySeg. At first, the proposed dataset naming EntitySeg contains images spanning various image domains and entities, along with plentiful high-resolution images and high-quality mask annotations for training and testing. Then, the ESG mainly consists of two modules: CropFormer for high-quality entity segmentation whereas GELLA for accurate noun extraction from sentences and semantic matching between language and visual regions. Unlike existing grounding methods that jointly train a segmentation and a large language model, ESG adopts a two-stage decoupled design, preserving high-quality masks and grounding robustness without the trade-offs often introduced by joint training. CropFormer ensures high-quality entity segmentation results, which can then be encoded into the GELLA model for effective grounding. Extensive experimental results demonstrate the effectiveness of our proposed pipeline across five tasks, including entity segmentation, panoptic segmentation, open-vocabulary segmentation, referring segmentation, and panoptic localized narratives. Furthermore, GELLA module of ESG pipeline is highly flexible and capable of processing mask inputs from any segmentation framework, thanks to its lightweight colormap/vision encoder, language/mask decoder, and association module. The entity segmentation dataset and grounding code will be released at https://github.com/qqlu/Entity.

URL PDF HTML ☆

赞 0 踩 0

1410.6333 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Regularization Approach to Blind Deblurring and Denoising of QR Barcodes

一种正则化方法用于QR条形码的盲去模糊和去噪

Yves van Gennip, Prashant Athavale, Jérôme Gilles, Rustum Choksi

发表机构 * Fields Institute, University of Toronto（多伦多大学菲尔兹研究所）

AI总结本文提出了一种基于正则化的纯方法，用于在存在噪声的情况下对QR条形码进行盲去模糊和去噪，利用了已知的所需模式和开源条形码阅读器的事实。

Comments 14 pages, 19 figures (with a total of 57 subfigures), 1 table; v3: previously missing reference [35] added

1803.00638 2026-06-04 math.NA cs.CV cs.NA 版本更新

Fast and accurate computation of orthogonal moments for texture analysis

快速且准确的正交矩计算用于纹理分析

C. Di Ruberto, L. Putzu, G. Rodriguez

发表机构 * Department of Mathematics and Computer Science, University of Cagliari（卡利里大学数学与计算机科学系）； Department of Electrical and Electronic Engineering, University of Cagliari（卡利里大学电子与电气工程系）

AI总结本文提出了一种快速稳定的算法，用于图像正交矩的计算，通过递推关系方法优化了Matlab实现，以提高计算效率和重建精度，并在纹理分析中展示了优于传统描述符的性能。

Comments 29 pages, 9 figures

详情

DOI: 10.1016/j.patcog.2018.06.012
Journal ref: Pattern Recongnit. 83 (2018) 498-510

AI中文摘要

在本文中，我们描述了一种快速且稳定的算法，用于计算图像的正交矩。正交矩具有高区分能力，但某些公式形式计算复杂度较高，限制了实时应用。本文详细描述了基于递推关系的方法，并提出了一种优化的Matlab实现，旨在解决上述限制，并向社区提供高效易用的软件。在实验中，我们评估了递推公式的有效性及其在重建任务中的性能，与文献中常用的闭式表示进行比较。结果表明，计算复杂度显著降低，同时重建精度更高。为了评估和比较计算矩在纹理分析中的准确性，我们在六个著名的纹理图像数据库上进行了分类实验。再次，递推公式在分类任务中优于闭式表示。更重要的是，如果使用所提出的稳定过程从图像的GLCM计算正交矩，则在某些情况下，正交矩优于一些最流行的纹理分类状态-of-the-art描述符。

英文摘要

In this work we describe a fast and stable algorithm for the computation of the orthogonal moments of an image. Indeed, orthogonal moments are characterized by a high discriminative power, but some of their possible formulations are characterized by a large computational complexity, which limits their real-time application. This paper describes in detail an approach based on recurrence relations, and proposes an optimized Matlab implementation of the corresponding computational procedure, aiming to solve the above limitations and put at the community's disposal an efficient and easy to use software. In our experiments we evaluate the effectiveness of the recurrence formulation, as well as its performance for the reconstruction task, in comparison to the closed form representation, often used in the literature. The results show a sensible reduction in the computational complexity, together with a greater accuracy in reconstruction. In order to assess and compare the accuracy of the computed moments in texture analysis, we perform classification experiments on six well-known databases of texture images. Again, the recurrence formulation performs better in classification than the closed form representation. More importantly, if computed from the GLCM of the image using the proposed stable procedure, the orthogonal moments outperform in some situations some of the most diffused state-of-the-art descriptors for texture classification.

URL PDF HTML ☆

赞 0 踩 0

1601.03094 2026-06-04 cs.CV cs.SY eess.SY math.OC 版本更新

A metric for sets of trajectories that is practical and mathematically consistent

一种用于轨迹集的度量标准，具有实用性和数学一致性

José Bento, Jia Jie Zhu

AI总结本文提出了一种新的轨迹集度量标准，解决了现有数学一致度量难以计算以及实用近似度量不一致的问题，该度量标准能够快速计算、最优处理轨迹身份混淆，并且在数学上是有效的。

Comments Submitted to IEEE Transactions on Signal Processing

详情

AI中文摘要

在计算机视觉、机器学习、机器人学和通用人工智能领域，对轨迹集空间的度量至关重要。然而，现有的轨迹集接近性概念要么在数学上不一致，要么在实际应用中有限。在本文中，我们指出现有数学一致度量的局限性，这些度量基于OSPA（Schuhmacher等人，2008）；以及实践中使用的启发式接近性概念，其主要思想与广泛用于计算机视觉的CLEAR MOT度量（Keni和Rainer，2008）相似。通过两步方法，我们提出了一个新的直观度量标准，以解决这些局限性。首先，我们解释了一种导致难以计算的度量解决方案。然后，我们修改此公式，以获得一个易于计算但保留先前度量有用属性的度量。我们的接近性概念是第一个展示以下三个特征的度量：1）可以快速计算，2）以最优方式整合轨迹身份的混淆，3）在数学意义上是一个度量。

英文摘要

Metrics on the space of sets of trajectories are important for scientists in the field of computer vision, machine learning, robotics, and general artificial intelligence. However, existing notions of closeness between sets of trajectories are either mathematically inconsistent or of limited practical use. In this paper, we outline the limitations in the current mathematically-consistent metrics, which are based on OSPA (Schuhmacher et al. 2008); and the inconsistencies in the heuristic notions of closeness used in practice, whose main ideas are common to the CLEAR MOT measures (Keni and Rainer 2008) widely used in computer vision. In two steps, we then propose a new intuitive metric between sets of trajectories and address these limitations. First, we explain a solution that leads to a metric that is hard to compute. Then we modify this formulation to obtain a metric that is easy to compute while keeping the useful properties of the previous metric. Our notion of closeness is the first demonstrating the following three features: the metric 1) can be quickly computed, 2) incorporates confusion of trajectories' identity in an optimal way, and 3) is a metric in the mathematical sense.

URL PDF HTML ☆

赞 0 踩 0

1812.03434 2026-06-04 cs.CG cs.CV cs.NA math.NA physics.med-ph q-bio.QM 版本更新

Area-preserving mapping of 3D ultrasound carotid artery images using density-equalizing reference map

利用密度相等参考图进行3D超声颈动脉图像的面积保持映射

Gary P. T. Choi, Bernard Chiu, Chris H. Rycroft

发表机构 * Mathematics Group, Lawrence Berkeley National Laboratory（伯克利国家实验室数学组）

AI总结本文提出了一种新的密度相等参考图（DERM）方法，用于将3D颈动脉表面映射到标准化的2D颈动脉模板，重点是通过最小化局部面积变形来保持局部几何结构，从而提高对血管壁加斑块厚度（VWT）的定量监测和比较能力。

详情

DOI: 10.1109/TBME.2019.2963783
Journal ref: IEEE Transactions on Biomedical Engineering, 67(9), 1507-1517 (2020)

AI中文摘要

颈动脉动脉粥样斑块是一种发生在颈动脉分叉处的局部疾病。为了定量监测血管壁加斑块厚度（VWT）的局部变化，并比较不同患者或同一患者在不同超声扫描会话中的VWT分布，需要一种映射技术来调整不同颈动脉模型的几何变化。在本工作中，我们提出了一种新的方法，称为密度相等参考图（DERM），用于将3D颈动脉表面映射到标准化的2D颈动脉模板，重点是通过最小化局部面积变形来保持颈动脉表面的局部几何结构。初始映射是由之前描述的弧长缩放（ALS）映射方法生成的，该方法将3D颈动脉表面投影到2D非凸L形域。随后通过变形ALS映射，利用所提出的结合密度相等映射和参考映射技术的算法，构建出平滑且面积保持的扁平化映射。这种结合使首次实现了将3D表面映射到标准化的非凸平面域的1:1映射，且以面积保持的方式。使用20个颈动脉表面模型的评估显示，与ALS映射方法相比，所提出的方法将扁平化映射的面积变形减少了超过80%。

英文摘要

Carotid atherosclerosis is a focal disease at the bifurcations of the carotid artery. To quantitatively monitor the local changes in the vessel-wall-plus-plaque thickness (VWT) and compare the VWT distributions for different patients or for the same patients at different ultrasound scanning sessions, a mapping technique is required to adjust for the geometric variability of different carotid artery models. In this work, we propose a novel method called density-equalizing reference map (DERM) for mapping 3D carotid surfaces to a standardized 2D carotid template, with an emphasis on preserving the local geometry of the carotid surface by minimizing the local area distortion. The initial map was generated by a previously described arc-length scaling (ALS) mapping method, which projects a 3D carotid surface onto a 2D non-convex L-shaped domain. A smooth and area-preserving flattened map was subsequently constructed by deforming the ALS map using the proposed algorithm that combines the density-equalizing map and the reference map techniques. This combination allows, for the first time, one-to-one mapping from a 3D surface to a standardized non-convex planar domain in an area-preserving manner. Evaluations using 20 carotid surface models show that the proposed method reduced the area distortion of the flattening maps by over 80% as compared to the ALS mapping method.

URL PDF HTML ☆

赞 0 踩 0

1605.01177 2026-06-04 cs.CV cs.SY eess.SY 版本更新

A metric on the space of finite sets of trajectories for evaluation of multi-target tracking algorithms

有限轨迹集合空间上的度量用于多目标跟踪算法评估

Ángel F. García-Fernández, Abu Sajana Rahmathullah, Lennart Svensson

发表机构 * Zenuity AB（Zenuity AB公司）

AI总结本文提出了一种用于以数学严谨方式评估多目标跟踪算法的有限轨迹集合空间上的度量。该度量用于比较不同算法对轨迹的估计与真实轨迹，并包含与定位误差、漏检和误检以及轨迹切换相关的直观成本。度量计算基于解决多维分配问题，还提出了该度量的下界，该下界也是轨迹集合的度量，并可通过线性规划在多项式时间内计算。此外，还扩展了该度量到随机有限轨迹集合。

Comments Matlab code for the metric is available at https://github.com/Agarciafernandez/MTT

详情

DOI: 10.1109/TSP.2020.3005309
Journal ref: in IEEE Transactions on Signal Processing, vol. 68, pp. 3917-3928, 2020

AI中文摘要

在本文中，我们提出了一种度量，用于以数学严谨的方式评估多目标跟踪算法。该度量的主要用途是将不同算法对轨迹的估计与真实轨迹进行比较。所提出的度量包括与每个时间步长正确检测目标、漏检和误检以及轨迹切换相关的直观成本。度量计算基于解决多维分配问题。我们还提出了该度量的下界，该下界也是轨迹集合的度量，并可通过线性规划在多项式时间内计算。此外，我们还扩展了该度量到随机有限轨迹集合。

英文摘要

In this paper, we propose a metric on the space of finite sets of trajectories for assessing multi-target tracking algorithms in a mathematically sound way. The main use of the metric is to compare estimates of trajectories from different algorithms with the ground truth of trajectories. The proposed metric includes intuitive costs associated to localization error for properly detected targets, missed and false targets and track switches at each time step. The metric computation is based on solving a multi-dimensional assignment problem. We also propose a lower bound for the metric, which is also a metric for sets of trajectories and is computable in polynomial time using linear programming. We also extend the proposed metrics on sets of trajectories to random finite sets of trajectories.

URL PDF HTML ☆

赞 0 踩 0

1902.09135 2026-06-04 math.NA cs.CV cs.NA eess.IV 版本更新

A Dual Symmetric Gauss-Seidel Alternating Direction Method of Multipliers for Hyperspectral Sparse Unmixing

一种双对称Gauss-Seidel交替方向乘子法用于超光谱稀疏解混

Longfei Ren, Chengjing Wang, Peipei Tang, Zheng Ma

发表机构 * School of Information Science and technology, and the Provincial Key Lab of Information Coding and Trans- mission, Southwest Jiaotong University（信息科学与技术学院，信息编码与传输省重点实验室，西南交通大学）； School of Mathematics, Southwest Jiaotong University（数学学院，西南交通大学）； School of Computer and Computing Science, Zhejiang University City College（计算机与计算科学学院，浙江大学城市学院）

AI总结本文提出了一种高效的双对称Gauss-Seidel交替方向乘子法（sGS-ADMM）用于带有总变分正则化的超光谱稀疏解混，解决了传统ADMM在计算效率和收敛性方面的不足，并通过实验验证了该方法在解混效率和图像质量上的优越性。

Comments 30 pages, 6 figures

详情

AI中文摘要

由于稀疏解混已成为超光谱解混的有前景的方法，最近一些空间上下文信息已被用来提高解混性能。总变分（TV）已被广泛用于促进空间均匀性和相邻像素之间的平滑性。然而，带有TV正则项的超光谱稀疏解混的计算任务很重。此外，对于带有TV正则项的超光谱稀疏解混的原始交替方向乘子法（ADMM）的收敛性尚未详细解释。在本文中，我们设计了一种高效的、收敛的双对称Gauss-Seidel ADMM（sGS-ADMM）用于带有TV正则项的超光谱稀疏解混。我们还对这种算法进行了全局收敛性和局部线性收敛率的分析。如数值实验所示，我们的算法在解混效率上明显优于最先进的算法。更重要的是，我们能够获得质量更高的图像。

英文摘要

Since sparse unmixing has emerged as a promising approach to hyperspectral unmixing, some spatial-contextual information in the hyperspectral images has been exploited to improve the performance of the unmixing recently. The total variation (TV) has been widely used to promote the spatial homogeneity as well as the smoothness between adjacent pixels. However, the computation task for hyperspectral sparse unmixing with a TV regularization term is heavy. Besides, the convergence of the primal alternating direction method of multipliers (ADMM) for the hyperspectral sparse unmixing with a TV regularization term has not been explained in details. In this paper, we design an efficient and convergent dual symmetric Gauss-Seidel ADMM (sGS-ADMM) for hyperspectral sparse unmixing with a TV regularization term. We also present the global convergence and local linear convergence rate analysis for this algorithm. As demonstrated in numerical experiments, our algorithm can obviously improve the efficiency of the unmixing compared with the state-of-the-art algorithm. More importantly, we can obtain images with higher quality.

URL PDF HTML ☆

赞 0 踩 0

1407.0221 2026-06-04 cs.CV cs.NA math.NA 版本更新

Imaging with Kantorovich-Rubinstein discrepancy

基于Kantorovich-Rubinstein偏差的成像

Jan Lellmann, Dirk A. Lorenz, Carola Schönlieb, Tuomo Valkonen

发表机构 * Department for Applied Mathematics and Theoretical Physics, University of Cambridge（应用数学与理论物理系，剑桥大学）； Institute for Analysis and Algebra, TU Braunschweig（分析与代数研究所， Braunschweig 技术大学）； Center for Mathematical Modeling (Modemat), EPN Quito（数学建模中心（Modemat），厄瓜多尔奎托）

AI总结本文提出将最优传输中的Kantorovich-Rubinstein范数应用于成像问题，提出了一种变分正则化模型，结合Kantorovich-Rubinstein偏差项和总变分正则化，用于图像去噪和卡通-纹理分解，并与其他方法建立联系，证明优化问题可转化为凸-凹鞍点问题并用标准工具求解。

1801.03800 2026-06-04 cs.CV cs.NA math.NA 版本更新

Cortical-inspired image reconstruction via sub-Riemannian geometry and hypoelliptic diffusion

基于子黎曼几何和双曲椭圆扩散的皮层启发式图像重建

Ugo Boscain, Roman Chertovskih, Jean-Paul Gauthier, Dario Prandi, Alexey Remizov

发表机构 * CNRS, LJLL, Universit\'e Pierre et Marie Curie, Paris, France ； SYSTEC, FEUP, University of Porto, Portugal

AI总结本文基于初级视觉皮层的数学模型，提出了一种利用双曲椭圆扩散进行图像修复的算法，其中一种算法不利用图像损坏位置信息，另一种则利用该信息，后者在图像修复领域达到了最先进的水平，验证了视觉皮层确实编码了第一种算法。

1712.05870 2026-06-04 math.NA cs.CV cs.NA 版本更新

Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm

通过最小化管核范数的偏和进行多维成像数据恢复

Tai-Xiang Jiang, Ting-Zhu Huang, Xi-Le Zhao, Liang-Jian Deng

发表机构 * School of Mathematical Sciences/Research Center for Image and Vision Computing（数学科学学院/图像与视觉计算研究中心）； University of Electronic Science and Technology of China（电子科技大学）； FinTech Innovation Center（金融科技创新中心）； Financial Intelligence and Financial Engineering Research Key Laboratory of Sichuan province（四川省金融 intelligence 和金融工程研究重点实验室）； School of Economic Information Engineering（经济信息工程学院）； Southwestern University of Finance and Economics（西南财经大学）

AI总结本文在张量奇异值分解（t-SVD）框架下研究张量恢复问题，提出张量管核秩的替代物偏和管核范数（PSTNN），并构建两个基于PSTNN的最小化模型用于张量补全和主成分分析，通过交替方向乘子法（ADMM）算法解决，并在合成数据和实际数据上验证了PSTNN的优越性。

1904.10379 2026-06-04 cs.GR cs.CV cs.NA math.NA 版本更新

Multi-modal 3D Shape Reconstruction Under Calibration Uncertainty using Parametric Level Set Methods

在校准不确定性下利用参数水平集方法进行多模态3D形状重建

Moshe Eliasof, Andrei Sharf, Eran Treister

发表机构 * Computer Science Department, Ben-Gurion University of the Negev（本古里安大学计算机科学系）

AI总结本文提出了一种参数化水平集方法，用于在存在校准不确定性时从多模态数据中重建3D形状，该方法能够有效处理不同数据模态，如稀疏点集、体素切片、2D照片等，并在复杂对象的紧凑表示和校准噪声鲁棒性方面取得了显著贡献。

详情

AI中文摘要

我们考虑了在存在不确定校准参数的情况下，从多模态数据中重建3D形状的问题。通常，3D数据模态可以是多种多样的形式，例如稀疏点集、体素切片、2D照片等。为了联合处理这些数据模态，我们利用了一种参数化水平集方法，该方法使用椭球径向基函数。这种方法不仅允许我们以解析且紧凑的方式表示物体，还赋予我们克服源自不准确获取参数的校准相关噪声的能力。这种本质上隐式的正则化导致了高度鲁棒且可扩展的重建，超越了其他传统方法。在我们的结果中，我们首先展示了该方法对复杂物体进行紧凑表示的能力。然后我们展示了我们的重建方法在少量测量和获取参数中的噪声方面都具有鲁棒性。最后，我们展示了从不同模态，如通过液体位移获得的体素切片（类似于CT扫描和X射线）以及从形状轮廓获得的视觉测量中，我们的重建能力。

英文摘要

We consider the problem of 3D shape reconstruction from multi-modal data, given uncertain calibration parameters. Typically, 3D data modalities can be in diverse forms such as sparse point sets, volumetric slices, 2D photos and so on. To jointly process these data modalities, we exploit a parametric level set method that utilizes ellipsoidal radial basis functions. This method not only allows us to analytically and compactly represent the object, it also confers on us the ability to overcome calibration related noise that originates from inaccurate acquisition parameters. This essentially implicit regularization leads to a highly robust and scalable reconstruction, surpassing other traditional methods. In our results we first demonstrate the ability of the method to compactly represent complex objects. We then show that our reconstruction method is robust both to a small number of measurements and to noise in the acquisition parameters. Finally, we demonstrate our reconstruction abilities from diverse modalities such as volume slices obtained from liquid displacement (similar to CTscans and XRays), and visual measurements obtained from shape silhouettes.

URL PDF HTML ☆

赞 0 踩 0

1605.06311 2026-06-04 stat.CO cs.CV cs.SY eess.SY 版本更新

Poisson multi-Bernoulli conjugate prior for multiple extended object filtering

泊松多伯努利共轭先验用于多扩展目标滤波

Karl Granstrom, Maryam Fatemi, Lennart Svensson

发表机构 * Department of Signals and Systems, Chalmers University of Technology（信號與系統系，查爾姆斯理工大学）； Zenuity

AI总结本文提出了一种用于多扩展目标滤波的泊松多伯努利混合（PMBM）共轭先验。通过泊松点过程描述尚未检测到的目标存在，而多伯努利混合描述已检测到的目标分布。预测和更新方程针对标准转移密度和测量似然性进行推导。预测和更新均保持密度的PMBM形式，因此PMBM密度是一种共轭先验。然而，未知的数据关联导致PMBM密度中出现难以处理的大量项，因此需要近似方法。本文给出了伽马高斯逆 Wishart 实现，并提供了处理数据关联问题的方法。模拟研究显示，扩展目标PMBM滤波器在与扩展目标d-GLMB和LMB滤波器的比较中表现良好。使用激光雷达数据的实验展示了同时跟踪已检测和未检测目标的优势。

详情

DOI: 10.1109/TAES.2019.2920220

AI中文摘要

本文提出了一种用于多扩展目标滤波的泊松多伯努利混合（PMBM）共轭先验。通过泊松点过程描述尚未检测到的目标存在，而多伯努利混合描述已检测到的目标分布。预测和更新方程针对标准转移密度和测量似然性进行推导。预测和更新均保持密度的PMBM形式，因此PMBM密度是一种共轭先验。然而，未知的数据关联导致PMBM密度中出现难以处理的大量项，因此需要近似方法。本文给出了伽马高斯逆 Wishart 实现，并提供了处理数据关联问题的方法。模拟研究显示，扩展目标PMBM滤波器在与扩展目标d-GLMB和LMB滤波器的比较中表现良好。使用激光雷达数据的实验展示了同时跟踪已检测和未检测目标的优势。

英文摘要

This paper presents a Poisson multi-Bernoulli mixture (PMBM) conjugate prior for multiple extended object filtering. A Poisson point process is used to describe the existence of yet undetected targets, while a multi-Bernoulli mixture describes the distribution of the targets that have been detected. The prediction and update equations are presented for the standard transition density and measurement likelihood. Both the prediction and the update preserve the PMBM form of the density, and in this sense the PMBM density is a conjugate prior. However, the unknown data associations lead to an intractably large number of terms in the PMBM density, and approximations are necessary for tractability. A gamma Gaussian inverse Wishart implementation is presented, along with methods to handle the data association problem. A simulation study shows that the extended target PMBM filter performs well in comparison to the extended target d-GLMB and LMB filters. An experiment with Lidar data illustrates the benefit of tracking both detected and undetected targets.

URL PDF HTML ☆

赞 0 踩 0

1903.10604 2026-06-04 cs.CV cs.SY eess.SY 版本更新

An Approach for Adaptive Automatic Threat Recognition Within 3D Computed Tomography Images for Baggage Security Screening

一种基于3D计算层析成像的自适应自动威胁识别方法用于行李安全检查

Qian Wang, Khalid N. Ismail, Toby P. Breckon

发表机构 * Department of Computer Science, Durham University, United Kingdom（英国杜伦大学计算机科学系）； Department of Engineering, Durham University, United Kingdom（英国杜伦大学工程系）

AI总结本文提出了一种基于3D X射线计算层析成像的自适应自动威胁识别方法，旨在解决快速演变的威胁特征识别问题，通过多尺度3D CT图像分割算法、多类支持向量机分类器和适应性策略实现高检测概率和低误报率。

Comments Technical Report, Durham University

详情

AI中文摘要

使用X射线扫描器对行李进行安全检查已成为航空安全的常规操作，自动威胁检测方法基于3D X射线计算层析成像（CT）图像，称为自动威胁识别（ATR）。当前策略使用预定义的威胁材料签名，而非适应于新出现的威胁签名。为了解决这个问题，先前的工作提出了自适应自动威胁识别（AATR）的概念。本文提出了一种基于X射线CT行李扫描图像的解决方案。该方法旨在解决筛查要求中快速演变的威胁特征问题。理想情况下，部署在安全扫描仪中的检测算法应能快速适应不同情况，具有不同的威胁特征要求（例如，威胁材料、物体的物理属性）。我们通过一种新颖的自适应机器学习方法来解决这个问题，该解决方案包括一个多尺度3D CT图像分割算法、一个多类支持向量机（SVM）分类器用于物体材料识别以及一种使方法适应的策略。实验在开放和封闭的3D CT行李图像数据集上进行，这些数据集专门用于AATR研究。我们提出的方法在识别和适应性方面表现良好。总体而言，我们的方法可以实现约90%的检测概率和低于20%的误报率。我们的AATR展示了适应不同种类材料的能力，甚至包括训练数据中未出现的未知材料，适应不同所需的检测概率以及适应不同规模的威胁物体。

英文摘要

The screening of baggage using X-ray scanners is now routine in aviation security with automatic threat detection approaches, based on 3D X-ray computed tomography (CT) images, known as Automatic Threat Recognition (ATR) within the aviation security industry. These current strategies use pre-defined threat material signatures in contrast to adaptability towards new and emerging threat signatures. To address this issue, the concept of adaptive automatic threat recognition (AATR) was proposed in previous work. In this paper, we present a solution to AATR based on such X-ray CT baggage scan imagery. This aims to address the issues of rapidly evolving threat signatures within the screening requirements. Ideally, the detection algorithms deployed within the security scanners should be readily adaptable to different situations with varying requirements of threat characteristics (e.g., threat material, physical properties of objects). We tackle this issue using a novel adaptive machine learning methodology with our solution consisting of a multi-scale 3D CT image segmentation algorithm, a multi-class support vector machine (SVM) classifier for object material recognition and a strategy to enable the adaptability of our approach. Experiments are conducted on both open and sequestered 3D CT baggage image datasets specifically collected for the AATR study. Our proposed approach performs well on both recognition and adaptation. Overall our approach can achieve the probability of detection around 90% with a probability of false alarm below 20%. Our AATR shows the capabilities of adapting to varying types of materials, even the unknown materials which are not available in the training data, adapting to varying required probability of detection and adapting to varying scales of the threat object.

URL PDF HTML ☆

赞 0 踩 0

1904.03537 2026-06-04 math.OC cs.CV cs.LG cs.NA math.NA 版本更新

Convex-Concave Backtracking for Inertial Bregman Proximal Gradient Algorithms in Non-Convex Optimization

凸凹回溯法用于非凸优化中的惯性Bregman近似梯度算法

Mahesh Chandra Mukkamala, Peter Ochs, Thomas Pock, Shoham Sabach

发表机构 * Faculty of Mathematics and Computer Science, Saarland University（萨尔兰大学数学与计算机科学学院）； Institute of Computer Graphics and Vision, Graz University of Technology（格拉茨技术大学计算机图形与视觉研究所）； Faculty of Industrial Engineering, The Technion（技术学院工业工程学院）

AI总结本文提出了一种凸凹回溯方法，用于非凸优化中的惯性Bregman近似梯度算法，通过寻找目标函数的凸上界和凹下界，实现步长和外推参数的自适应选择，并证明算法全局收敛到临界点。

Comments 29 pages

详情

AI中文摘要

回溯线搜索是一种古老而强大的策略，用于在近似梯度算法中寻找更好的步长。其主要原理是局部寻找目标函数的简单凸上界，从而控制使用的步长。在惯性近似梯度算法中，情况变得更加复杂，通常导致对外推参数的非常严格的限制。在本文中，我们展示通过局部寻找目标函数的简单凹下界，可以控制外推参数。这导致了一种双凸凹回溯过程，允许自适应地选择步长和外推参数。我们将此过程应用于惯性Bregman近似梯度方法的类别，并证明由这些算法生成的任何序列都全局收敛到函数的临界点。在图像处理和机器学习中的多个具有挑战性的非凸问题上的数值实验显示，结合惯性步和双回溯策略能够实现性能的提升。

英文摘要

Backtracking line-search is an old yet powerful strategy for finding a better step sizes to be used in proximal gradient algorithms. The main principle is to locally find a simple convex upper bound of the objective function, which in turn controls the step size that is used. In case of inertial proximal gradient algorithms, the situation becomes much more difficult and usually leads to very restrictive rules on the extrapolation parameter. In this paper, we show that the extrapolation parameter can be controlled by locally finding also a simple concave lower bound of the objective function. This gives rise to a double convex-concave backtracking procedure which allows for an adaptive choice of both the step size and extrapolation parameters. We apply this procedure to the class of inertial Bregman proximal gradient methods, and prove that any sequence generated by these algorithms converges globally to a critical point of the function at hand. Numerical experiments on a number of challenging non-convex problems in image processing and machine learning were conducted and show the power of combining inertial step and double backtracking strategy in achieving improved performances.

URL PDF HTML ☆

赞 0 踩 0

1812.03446 2026-06-04 math.NA cs.CV cs.NA 版本更新

A New Variational Model for Joint Image Reconstruction and Motion Estimation in Spatiotemporal Imaging

一种新的变分模型用于时空成像中的联合图像重建和运动估计

Chong Chen, Barbara Gris, Ozan Öktem

发表机构 * LSEC, ICMSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences（LSEC，ICMSEC，数学系统科学研究所，中国科学院）； Department of Mathematics, KTH–Royal Institute of Technology（数学系，皇家理工学院）

AI总结本文提出了一种新的变分模型，用于时空成像中的联合图像重建和运动估计，该模型基于形状理论框架，结合了改进的静态图像重建和顺序间接图像配准，通过理论分析和数值实验展示了其在稀疏和高噪声数据下的有效性。

Comments 35 pages, 5 figures, 3 tables, revised

详情

DOI: 10.1137/18M1234047
Journal ref: SIAM Journal on Imaging Sciences 2019

AI中文摘要

我们提出了一种新的变分模型，用于时空成像中的联合图像重建和运动估计，该模型基于我们提出的一般框架，结合了形状理论。该模型由两个组成部分组成，一个用于执行改进的静态图像重建，另一个依次执行间接图像配准。对于后者，我们将大变形各向同性度量映射框架推广到顺序间接配准设置中。所提出的模型在理论上与替代方法（基于光学流的模型和各向同性运动模型）进行了比较，证明了所提模型在最优解方面具有良好的性质。此外，还给出了所提模型时间离散化场景的理论推导和高效算法，表明时间离散化版本的最优解与时间连续版本一致，并且大部分计算组件都是易于实现的线性化变形。还分析了该算法的复杂度。本文最后通过2D空间+时间断层成像中非常稀疏和/或高噪声数据的数值示例进行了总结。

Spatial perception is the backbone of many robotics applications, and spans a broad range of research problems, including localization and mapping, point cloud alignment, and relative pose estimation from camera images. Robust spatial perception is jeopardized by the presence of incorrect data association, and in general, outliers. Although techniques to handle outliers do exist, they can fail in unpredictable manners (e.g., RANSAC, robust estimators), or can have exponential runtime (e.g., branch-and-bound). In this paper, we advance the state of the art in outlier rejection by making three contributions. First, we show that even a simple linear instance of outlier rejection is inapproximable: in the worst-case one cannot design a quasi-polynomial time algorithm that computes an approximate solution efficiently. Our second contribution is to provide the first per-instance sub-optimality bounds to assess the approximation quality of a given outlier rejection outcome. Our third contribution is to propose a simple general-purpose algorithm, named adaptive trimming, to remove outliers. Our algorithm leverages recently-proposed global solvers that are able to solve outlier-free problems, and iteratively removes measurements with large errors. We demonstrate the proposed algorithm on three spatial perception problems: 3D registration, two-view geometry, and SLAM. The results show that our algorithm outperforms several state-of-the-art methods across applications while being a general-purpose method.

URL PDF HTML ☆

赞 0 踩 0

1903.02531 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Combining Optimal Control and Learning for Visual Navigation in Novel Environments

将最优控制与学习相结合用于新环境中的视觉导航

Somil Bansal, Varun Tolani, Saurabh Gupta, Jitendra Malik, Claire Tomlin

发表机构 * University of California, Berkeley（加州大学伯克利分校）； Facebook AI Research（脸书人工智能研究）

AI总结本文提出了一种结合模型控制与学习感知的方法，用于在新环境中实现可靠的视觉导航，通过生成无碰撞路径的 waypoints，使机器人能够高效地到达目标位置，同时在低帧率和仿真到现实的迁移中表现良好。

Comments Project website: https://vtolani95.github.io/WayPtNav/

详情

AI中文摘要

基于模型的控制是机器人导航的流行范式，因为它可以利用已知的动力学模型来高效地规划鲁棒的机器人轨迹。然而，在环境事先未知且只能通过机器人上的传感器部分观测的情况下，使用基于模型的方法具有挑战性。在本工作中，我们通过将基于模型的控制与基于学习的感知相结合来解决这一不足。基于学习的感知模块生成一系列 waypoints，通过无碰撞路径引导机器人到达目标。这些 waypoints 被用于基于模型的规划器生成平滑且动态可行的轨迹，该轨迹通过反馈控制在物理系统上执行。我们在模拟的真实世界复杂环境中以及在实际地面车辆上的实验表明，与纯几何映射或端到端学习方法相比，所提出的方法在新环境中能够更可靠、更高效地到达目标位置。我们的方法不依赖于详细的显式 3D 环境地图，能够与低帧率工作，并且在仿真到现实的迁移中表现良好。描述我们方法和实验的视频可在项目网站上获得。

英文摘要

Model-based control is a popular paradigm for robot navigation because it can leverage a known dynamics model to efficiently plan robust robot trajectories. However, it is challenging to use model-based methods in settings where the environment is a priori unknown and can only be observed partially through on-board sensors on the robot. In this work, we address this short-coming by coupling model-based control with learning-based perception. The learning-based perception module produces a series of waypoints that guide the robot to the goal via a collision-free path. These waypoints are used by a model-based planner to generate a smooth and dynamically feasible trajectory that is executed on the physical system using feedback control. Our experiments in simulated real-world cluttered environments and on an actual ground vehicle demonstrate that the proposed approach can reach goal locations more reliably and efficiently in novel environments as compared to purely geometric mapping-based or end-to-end learning-based alternatives. Our approach does not rely on detailed explicit 3D maps of the environment, works well with low frame rates, and generalizes well from simulation to the real world. Videos describing our approach and experiments are available on the project website.

URL PDF HTML ☆

赞 0 踩 0

1811.09358 2026-06-04 cs.LG cs.CV cs.NA math.NA math.OC stat.ML 版本更新

A Sufficient Condition for Convergences of Adam and RMSProp

Adam和RMSProp收敛性的充分条件

Fangyu Zou, Li Shen, Zequn Jie, Weizhong Zhang, Wei Liu

发表机构 * Tencent AI Lab（腾讯AI实验室）； Stony Brook University（石英布鲁克大学）

AI总结本文提出了一种易于检查的充分条件，该条件仅依赖于基础学习率参数和历史二阶矩量的组合，以保证通用的Adam/RMSProp算法在大规模非凸随机优化中的全局收敛性，并展示了几种Adam变体在非凸设置下的收敛性可由此条件直接推导。

Comments Accepted by CVPR2019 as an Oral presentation

详情

AI中文摘要

Adam和RMSProp是训练深度神经网络中最具影响力的自适应随机算法，尽管在凸设置中通过几个简单的反例已被指出存在发散现象。许多尝试，如降低自适应学习率、采用大批次大小、引入时间去相关技术、寻找类比的替代方案等，已被尝试以促进Adam/RMSProp型算法收敛。与现有方法不同，我们引入了一种替代的易于检查的充分条件，该条件仅依赖于基础学习率参数和历史二阶矩量的组合，以保证通用的Adam/RMSProp算法在大规模非凸随机优化中的全局收敛性。此外，我们展示了几种Adam变体，如AdamNC、AdaEMA等，在非凸设置下的收敛性可通过所提出的充分条件直接推导。此外，我们表明Adam本质上是一种具有指数移动平均动量的特定加权AdaGrad，这为理解Adam和RMSProp提供了新的视角。这一观察结合该充分条件，为它们的发散性提供了更深入的解释。最后，我们通过将Adam和RMSProp应用于特定反例和训练深度神经网络来验证该充分条件。数值结果与我们的理论分析一致。

英文摘要

Adam and RMSProp are two of the most influential adaptive stochastic algorithms for training deep neural networks, which have been pointed out to be divergent even in the convex setting via a few simple counterexamples. Many attempts, such as decreasing an adaptive learning rate, adopting a big batch size, incorporating a temporal decorrelation technique, seeking an analogous surrogate, etc., have been tried to promote Adam/RMSProp-type algorithms to converge. In contrast with existing approaches, we introduce an alternative easy-to-check sufficient condition, which merely depends on the parameters of the base learning rate and combinations of historical second-order moments, to guarantee the global convergence of generic Adam/RMSProp for solving large-scale non-convex stochastic optimization. Moreover, we show that the convergences of several variants of Adam, such as AdamNC, AdaEMA, etc., can be directly implied via the proposed sufficient condition in the non-convex setting. In addition, we illustrate that Adam is essentially a specifically weighted AdaGrad with exponential moving average momentum, which provides a novel perspective for understanding Adam and RMSProp. This observation coupled with this sufficient condition gives much deeper interpretations on their divergences. At last, we validate the sufficient condition by applying Adam and RMSProp to tackle a certain counterexample and train deep neural networks. Numerical results are exactly in accord with our theoretical analysis.

URL PDF HTML ☆

赞 0 踩 0

1905.11299 2026-06-04 eess.SY cs.CV cs.LG cs.SY 版本更新

ImgSensingNet: UAV Vision Guided Aerial-Ground Air Quality Sensing System

ImgSensingNet: 基于无人机视觉引导的空地空气质量感知系统

Yuzhe Yang, Zhiwen Hu, Kaigui Bian, Lingyang Song

发表机构 * Computer Science and Artificial Intelligence Laboratory（计算机科学与人工智能实验室）； School of Electrical Engineering and Computer Science（电子工程与计算机科学学院）

AI总结本文提出ImgSensingNet，一种基于无人机视觉引导的空地联合感知系统，通过融合无人机拍摄的雾霾图像与地面三维无线传感器网络（WSN）收集的AQI数据，实现精细化空气质量监测与预测，显著降低系统能耗。

Comments Preliminary version published in INFOCOM 2019. Code available at https://github.com/YyzHarry/ImgSensingNet

详情

AI中文摘要

鉴于日益严重的空气污染问题，城市区域空气质量指数（AQI）的监测已引起广泛关注。本文提出ImgSensingNet，一种基于视觉引导的空地联合感知系统，用于利用无人机拍摄的雾霾图像与地面三维无线传感器网络（WSN）收集的AQI数据进行精细化空气质量监测与预测。具体而言，ImgSensingNet首先利用计算机视觉技术从拍摄的雾霾图像中识别不同区域的AQI尺度，其中设计了与雾霾相关的特征和深度卷积神经网络（CNN）以直接学习雾霾图像与相应AQI尺度之间的映射关系。基于学习到的AQI尺度，ImgSensingNet决定是否唤醒地面无线传感器进行小尺度AQI监测和推断，从而显著降低系统的能耗。采用基于熵的模型以在未测量位置实现准确的实时AQI推断和未来空气质量分布预测。我们在两所大学校园自2018年2月起实施并评估ImgSensingNet，已收集17,630张照片和260万条AQI数据样本。实验结果证实，与现有最先进的AQI监测方法相比，ImgSensingNet在提高推断精度的同时显著降低了能耗。

英文摘要

Given the increasingly serious air pollution problem, the monitoring of air quality index (AQI) in urban areas has drawn considerable attention. This paper presents ImgSensingNet, a vision guided aerial-ground sensing system, for fine-grained air quality monitoring and forecasting using the fusion of haze images taken by the unmanned-aerial-vehicle (UAV) and the AQI data collected by an on-ground three-dimensional (3D) wireless sensor network (WSN). Specifically, ImgSensingNet first leverages the computer vision technique to tell the AQI scale in different regions from the taken haze images, where haze-relevant features and a deep convolutional neural network (CNN) are designed for direct learning between haze images and corresponding AQI scale. Based on the learnt AQI scale, ImgSensingNet determines whether to wake up on-ground wireless sensors for small-scale AQI monitoring and inference, which can greatly reduce the energy consumption of the system. An entropy-based model is employed for accurate real-time AQI inference at unmeasured locations and future air quality distribution forecasting. We implement and evaluate ImgSensingNet on two university campuses since Feb. 2018, and has collected 17,630 photos and 2.6 millions of AQI data samples. Experimental results confirm that ImgSensingNet can achieve higher inference accuracy while greatly reduce the energy consumption, compared to state-of-the-art AQI monitoring approaches.

URL PDF HTML ☆

赞 0 踩 0

1905.08538 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Two-stage Classification Method for High-dimensional Data and Point Clouds

高维数据和点云的两阶段分类方法

Xiaohao Cai, Raymond Chan, Xiaoyu Xie, Tieyong Zeng

发表机构 * Mullard Space Science Laboratory (MSSL), University College London, Surrey RH5 6NT, UK（穆拉德空间科学实验室（MSSL），伦敦大学学院， Surrey RH5 6NT，英国）； Department of Mathematics, City University of Hong Kong, Kowloon Tong, Hong Kong（城市大学数学系， Hong Kong，香港）； Department of Mathematics, The Chinese University of Hong Kong, Shatin, Hong Kong（香港中文大学数学系， Shatin， Hong Kong）

AI总结本文提出了一种两阶段多阶段半监督分类方法，用于高维数据和无结构点云的分类。首先使用模糊分类方法如标准支持向量机生成初始解，然后应用SaT（平滑和阈值）两阶段方法改进分类。第一阶段通过无约束凸变分模型净化和平滑初始解，第二阶段将第一阶段得到的平滑分区投影到二进制分区。这两个阶段可以重复进行，以提高分类质量。我们证明了平滑阶段的凸模型有唯一解，并可以通过专门设计的对偶算法求解。我们在多个基准数据集上测试了我们的方法，并与最先进的方法进行了比较。实验结果表明，我们的方法在高维数据和点云的分类准确率和计算速度上均优于现有方法。

Comments 21 pages, 4 figures

详情

AI中文摘要

高维数据分类是机器学习和成像科学中的基本任务。在本文中，我们提出了一种两阶段多阶段半监督分类方法，用于对高维数据和无结构点云进行分类。首先，使用模糊分类方法如标准支持向量机生成初始解。然后应用名为SaT（平滑和阈值）的两阶段方法来改进分类。第一阶段实现一个无约束凸变分模型以净化和平滑初始解，随后第二阶段将第一阶段得到的平滑分区投影到二进制分区。这两个阶段可以重复进行，以最新结果作为新初始解，持续提高分类质量。我们证明了平滑阶段的凸模型具有唯一解，并可以通过专门设计的对偶算法求解。我们测试了我们的方法，并在多个基准数据集上与最先进的方法进行了比较。实验结果清楚地表明，我们的方法在高维数据和点云的分类准确率和计算速度上均优于现有方法。

英文摘要

High-dimensional data classification is a fundamental task in machine learning and imaging science. In this paper, we propose a two-stage multiphase semi-supervised classification method for classifying high-dimensional data and unstructured point clouds. To begin with, a fuzzy classification method such as the standard support vector machine is used to generate a warm initialization. We then apply a two-stage approach named SaT (smoothing and thresholding) to improve the classification. In the first stage, an unconstraint convex variational model is implemented to purify and smooth the initialization, followed by the second stage which is to project the smoothed partition obtained at stage one to a binary partition. These two stages can be repeated, with the latest result as a new initialization, to keep improving the classification quality. We show that the convex model of the smoothing stage has a unique solution and can be solved by a specifically designed primal-dual algorithm whose convergence is guaranteed. We test our method and compare it with the state-of-the-art methods on several benchmark data sets. The experimental results demonstrate clearly that our method is superior in both the classification accuracy and computation speed for high-dimensional data and point clouds.

URL PDF HTML ☆

赞 0 踩 0

1905.05946 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

Depth map estimation methodology for detecting free-obstacle navigation areas

用于自由障碍区域检测的深度图估计方法

Sergio Trejo, Karla Martinez, Gerardo Flores

AI总结本文提出了一种基于视觉的方法，利用立体相机和一维LiDAR估计四旋翼导航中的自由障碍区域。通过加权最小二乘滤波器过滤深度图，并通过卡尔曼滤波算法融合信息，确定四旋翼可通过的足够大自由空间区域。整个过程在Jetson TX2嵌入式计算机上用ROS实现。

1905.02176 2026-06-04 math.NA cs.CG cs.CV cs.NA math.DG 版本更新

Computation of Circular Area and Spherical Volume Invariants via Boundary Integrals

通过边界积分计算圆面积和球体积不变量

Riley O'Neill, Pedro Angulo-Umana, Jeff Calder, Bo Hessburg, Peter J. Olver, Chehrzad Shakiban, Katrina Yezzi-Woodley

发表机构 * Department of Mathematics, University of St. Thomas（圣托马斯大学数学系）； School of Mathematics, University of Minnesota（明尼苏达大学数学学院）； Department of Anthropology, University of Minnesota（明尼苏达大学人类学系）

AI总结本文提出通过边界积分计算平面曲线的圆面积不变量和曲面的球体积不变量，利用散度定理将积分转化为边界积分，并扩展到高维超曲面，为三角化曲面提供了一种无需离散环境空间的计算方法，应用于考古学中骨折碎片的特征检测。

1809.00846 2026-06-04 cs.LG cs.CV cs.SY eess.SY stat.ML 版本更新

Towards Understanding Regularization in Batch Normalization

向批量归一化中的正则化理解迈进

Ping Luo, Xinjiang Wang, Wenqi Shao, Zhanglin Peng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； SenseTime Research（时光科技研究院）； The University of Hong Kong（香港大学）

AI总结本文通过理论分析探讨了批量归一化在神经网络训练中的收敛性和泛化能力，揭示了批量归一化作为隐式正则化的作用，并通过实验验证了其在卷积神经网络中的正则化特性。

Comments International Conference on Learning Representations (ICLR)

1806.02998 2026-06-04 cs.CV cs.NA math.GN math.NA 版本更新

Logarithmic mathematical morphology: a new framework adaptive to illumination changes

对数数学形态学：一种适应光照变化的新框架

Guillaume Noyel

发表机构 * University of Strathclyde Institute of Global Public Health（斯特拉思克莱德大学全球公共卫生研究所）； International Prevention Research Institute（国际预防研究所）； iPRI ； Lyon, France（法国里昂）

AI总结本文提出了一种基于对数图像处理模型的新数学形态学框架，该框架能够适应曝光时间或光照强度变化引起的光照变化，通过定义对数膨胀和腐蚀算子，提高了低对比信息处理的效率。

详情

DOI: 10.1007/978-3-030-13469-3_53
Journal ref: 23rd Iberoamerican Congress on Pattern Recognition (CIARP 2018), Nov 2018, Madrid, Spain. Springer International Publishing, Lecture Notes in Computer Science, 11401, pp.453-461, 2019, Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications. https://atvs.ii.uam.es/ciarp2018/

AI中文摘要

通过基于成像物理的对数图像处理（LIP）模型，定义了一组适应由曝光时间或光照强度变化引起的光照变化的数学形态学（MM）算子。该模型与人类视觉一致。基本算子，即对数膨胀和对数腐蚀，通过结构函数的LIP加法定义。这些两个互补算子的结合给出了形态学滤波器，即对数开运算和闭运算，用于模式识别。建立了“经典”膨胀和腐蚀与其对数版本之间的数学关系，从而方便了其实现。在模拟和真实图像上的结果表明，对数MM在低对比信息上比“经典”MM更有效。

英文摘要

A new set of mathematical morphology (MM) operators adaptive to illumination changes caused by variation of exposure time or light intensity is defined thanks to the Logarithmic Image Processing (LIP) model. This model based on the physics of acquisition is consistent with human vision. The fundamental operators, the logarithmic-dilation and the logarithmic-erosion, are defined with the LIP-addition of a structuring function. The combination of these two adjunct operators gives morphological filters, namely the logarithmic-opening and closing, useful for pattern recognition. The mathematical relation existing between ``classical'' dilation and erosion and their logarithmic-versions is established facilitating their implementation. Results on simulated and real images show that logarithmic-MM is more efficient on low-contrasted information than ``classical'' MM.

URL PDF HTML ☆

赞 0 踩 0

1904.05814 2026-06-04 cs.CV cs.GR cs.LG cs.NA cs.RO math.NA 版本更新

Probabilistic Permutation Synchronization using the Riemannian Structure of the Birkhoff Polytope

利用Birkhoff多面体的Riemannian结构的概率排列同步

Tolga Birdal, Umut Şimşekli

AI总结本文提出了一种新的几何和概率方法，用于在多个对象或图像集合之间同步对应关系。核心方法包括基于Birkhoff-Riemannian L-BFGS优化放松后的循环一致性损失，以及基于Birkhoff-Riemannian Langevin Monte Carlo生成Birkhoff多面体样本并估计解的置信度。

Comments To appear as oral presentation at CVPR 2019. 20 pages including the supplementary material

详情

AI中文摘要

我们提出了一种全新的几何和概率方法，用于在多个对象或图像集合之间同步对应关系。具体而言，我们提出了两个算法：(1) Birkhoff-Riemannian L-BFGS用于以系统化的方式优化放松后的循环一致性损失的松弛版本；(2) Birkhoff-Riemannian Langevin Monte Carlo用于在Birkhoff多面体上生成样本并估计找到的解的置信度。为此，我们首先介绍了最近发展出的Birkhoff多面体的Riemannian几何。接着，我们引入了一种新的概率同步模型，形式为马尔可夫随机场(MRF)。最后，基于一阶retraction算子，我们将问题 formulation 为模拟随机微分方程，并设计了新的积分器。我们在合成和真实数据集上展示，我们能够以更快的收敛速度和可靠的置信度/不确定性估计获得高质量的多图匹配结果。

英文摘要

We present an entirely new geometric and probabilistic approach to synchronization of correspondences across multiple sets of objects or images. In particular, we present two algorithms: (1) Birkhoff-Riemannian L-BFGS for optimizing the relaxed version of the combinatorially intractable cycle consistency loss in a principled manner, (2) Birkhoff-Riemannian Langevin Monte Carlo for generating samples on the Birkhoff Polytope and estimating the confidence of the found solutions. To this end, we first introduce the very recently developed Riemannian geometry of the Birkhoff Polytope. Next, we introduce a new probabilistic synchronization model in the form of a Markov Random Field (MRF). Finally, based on the first order retraction operators, we formulate our problem as simulating a stochastic differential equation and devise new integrators. We show on both synthetic and real datasets that we achieve high quality multi-graph matching results with faster convergence and reliable confidence/uncertainty estimates.

URL PDF HTML ☆

赞 0 踩 0

1604.00970 2026-06-04 cs.CV cs.SY eess.SP eess.SY 版本更新

Extended Object Tracking: Introduction, Overview and Applications

扩展目标跟踪：介绍、概述与应用

Karl Granstrom, Marcus Baum, Stephan Reuter

发表机构 * Department of Signals and Systems, Chalmers University of Technology（信号与系统系，查尔姆斯理工大学）

AI总结本文综述了扩展目标跟踪的当前研究，定义了该问题并讨论其与其他目标跟踪方法的区别，介绍了两种基本的扩展目标跟踪方法，并总结了在摄像头、X波段雷达、激光雷达、RGB-D传感器等应用中的实际应用。

Comments 30 pages, 19 figures

详情

Journal ref: Journal of Advances in Information Fusion, Volume 12, Number 2, Pages 139-174, December 2016, ISSN 1557-6418

AI中文摘要

本文提供了一篇详尽的扩展目标跟踪当前研究的概述。我们给出了扩展目标跟踪问题的明确定义，并讨论了其与其他类型目标跟踪的界限。接下来，广泛讨论了扩展目标建模的不同方面。随后，我们介绍了两种基本且常用的方法——随机矩阵方法和基于卡尔曼滤波的星形形状方法。接下来一部分讨论了多个扩展目标的跟踪，并阐述了如何利用随机有限集（RFS）和非RFS多目标跟踪器来处理大量的可行关联假设。文章最后总结了当前的应用情况，突出了四个涉及摄像头、X波段雷达、激光雷达、红绿蓝深度（RGB-D）传感器的应用示例。

英文摘要

This article provides an elaborate overview of current research in extended object tracking. We provide a clear definition of the extended object tracking problem and discuss its delimitation to other types of object tracking. Next, different aspects of extended object modelling are extensively discussed. Subsequently, we give a tutorial introduction to two basic and well used extended object tracking approaches - the random matrix approach and the Kalman filter-based approach for star-convex shapes. The next part treats the tracking of multiple extended objects and elaborates how the large number of feasible association hypotheses can be tackled using both Random Finite Set (RFS) and Non-RFS multi-object trackers. The article concludes with a summary of current applications, where four example applications involving camera, X-band radar, light detection and ranging (lidar), red-green-blue-depth (RGB-D) sensors are highlighted.

URL PDF HTML ☆

赞 0 踩 0

1803.07187 2026-06-04 cs.CV cs.NA eess.IV math.NA 版本更新

Unveiling the invisible - mathematical methods for restoring and interpreting illuminated manuscripts

揭示无形之物 - 用于修复和解释手稿的数学方法

Luca Calatroni, Marie d'Autume, Rob Hocking, Stella Panayotova, Simone Parisotto, Paola Ricciardi, Carola-Bibiane Schönlieb

AI总结本文探讨了用于修复和可视化手稿的数学方法，强调了数字图像处理在艺术领域中的应用和重要性。

详情

DOI: 10.1186/s40494-018-0216-z

AI中文摘要

过去五十年来，数学方法在数字图像分析和处理方面的快速发展，主要集中在摄影、生物医学成像和各种工程领域。然而，艺术领域在此过程中大多被忽视，除了最近十年中少数例外。然而，随着艺术领域数字化的迅速兴起，艺术领域对数字图像处理方法的接受度正在增加，因此关注这一点的重要性也随之增加。本文讨论了一系列用于数字图像修复和数字可视化的方法，特别是手稿，因为它们传统上保持物理未受干扰，为数字操作提供了有趣的机会。同时，它们也展示了数学和数字修复作为通用和客观工具包在艺术领域中的可能性。

英文摘要

The last fifty years have seen an impressive development of mathematical methods for the analysis and processing of digital images, mostly in the context of photography, biomedical imaging and various forms of engineering. The arts have been mostly overlooked in this process, apart from a few exceptional works in the last ten years. With the rapid emergence of digitisation in the arts, however, the arts domain is becoming increasingly receptive to digital image processing methods and the importance of paying attention to this therefore increases. In this paper we discuss a range of mathematical methods for digital image restoration and digital visualisation for illuminated manuscripts. The latter provide an interesting opportunity for digital manipulation because they traditionally remain physically untouched. At the same time they also serve as an example for the possibilities mathematics and digital restoration offer as a generic and objective toolkit for the arts.

URL PDF HTML ☆

赞 0 踩 0

1903.05079 2026-06-04 math.NA cs.CV cs.NA 版本更新

A total variation based regularizer promoting piecewise-Lipschitz reconstructions

一种基于总变分的正则化器，促进分段Lipschitz重建

Martin Burger, Yury Korolev, Carola-Bibiane Schönlieb, Christiane Stollenwerk

发表机构 * Department of Applied Mathematics and Theoretical Physics, University of Cambridge（剑桥大学应用数学与理论物理系）

AI总结本文提出了一种新的总变分家族正则化器，促进具有给定Lipschitz常数（可空间变化）的重建。通过证明该功能的正则化性质，并研究其与总变分和infimal convolution类型正则化器TVLp的联系，特别是建立了拓扑等价性。数值实验表明，所提出的正则化器在性能上与总广义变分相似，但具有非常直观的自由参数解释，即只是梯度范数的局部估计。它还提供了一种自然的空间自适应正则化方法。

Comments 12 pages, 4 figures, accepted for publication in SSVM conference proceedings 2019

1902.10414 2026-06-04 math.NA cs.CV cs.NA 版本更新

Computing Nonlinear Eigenfunctions via Gradient Flow Extinction

通过梯度流灭绝计算非线性本函数

Leon Bungert, Martin Burger, Daniel Tenbrinck

AI总结本文研究了通过梯度流灭绝轮廓计算非线性本函数的问题，提出了一种递归减去本函数的方案，并展示了该方法在某些情况下能将数据分解为本函数，如一维总变分。还讨论了使用灭绝轮廓和梯度流进行谱图聚类的数值实验结果。

Comments 12 pages, 5 figure, accepted for publication in SSVM conference proceedings 2019

1902.05343 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

Study of dynamical system based obstacle avoidance via manipulating orthogonal coordinates

基于操纵正交坐标的动态系统障碍避障研究

Weiya Ren

发表机构 * Artificial Intelligence Research Center of National Innovation Institute of Defense Technology（国家创新技术研究院人工智能研究中心）； Tianjin Artificial Intelligence Innovation Center（天津人工智能创新中心）

AI总结本文研究了基于动态系统的障碍避障问题，通过引入正交坐标开发了调制矩阵，使调制矩阵更加合理。新轨迹的方向可通过正交坐标的线性组合表示。提出了一种通过引入旋转矩阵来解决局部最小问题，并在三维或更高维空间中提供更合理运动的正交坐标操纵方法。该方法还为围绕凸形体巡逻提供了解决方案。实验结果表明所提出方法的有效性。

1807.04638 2026-06-04 math.NA cs.CV cs.NA 版本更新

PDE-constrained LDDMM via geodesic shooting and inexact Gauss-Newton-Krylov optimization using the incremental adjoint Jacobi equations

基于偏微分方程约束的LDDMM通过测地线射击和近似高斯-牛顿-克罗内克优化使用增量伴随雅可比方程

Monica Hernandez

发表机构 * Computer Sciences Department（计算机科学系）； Aragon Institute on Engineering Research（阿拉贡工程研究院）； University of Zaragoza（萨拉戈塔大学）

AI总结本文提出了一种基于偏微分方程约束的LDDMM方法，利用测地线射击和近似高斯-牛顿-克罗内克优化，通过增量伴随雅可比方程在初始速度场空间中进行参数化，从而避免了对初始速度场的复杂依赖，提供了高效的测地线路径。

详情

DOI: 10.1088/1361-6560/aaf598

AI中文摘要

在偏微分方程约束的大变形 diffeomorphic度量映射框架下提出的一类非刚性注册方法是一个特别有趣且具有物理意义的 diffeomorphic 注册方法集合。在该框架中，不精确的牛顿-克罗内克优化已显示出卓越的数值精度和极快的收敛速度。然而，非稳态速度场的伽辽金表示法并未提供适当的测地线路径。在本文中，我们提出了一种在初始速度场空间中参数化的偏微分方程约束LDDMM方法。梯度和Hessian-向量积的推导是在最终速度场上进行，并通过伴随和增量伴随雅可比方程反向传输。这样，我们避免了在推导和计算伴随方程及其增量版本时对初始速度场的复杂依赖。所提出的方法在偏微分方程约束LDDMM框架内提供了测地线，并展示了与基准PDE约束LDDMM和EPDiff-LDDMM方法相媲美的性能。

英文摘要

The class of non-rigid registration methods proposed in the framework of PDE-constrained Large Deformation Diffeomorphic Metric Mapping is a particularly interesting family of physically meaningful diffeomorphic registration methods. Inexact Newton-Krylov optimization has shown an excellent numerical accuracy and an extraordinarily fast convergence rate in this framework. However, the Galerkin representation of the non-stationary velocity fields does not provide proper geodesic paths. In this work, we propose a method for PDE-constrained LDDMM parameterized in the space of initial velocity fields under the EPDiff equation. The derivation of the gradient and the Hessian-vector products are performed on the final velocity field and transported backward using the adjoint and the incremental adjoint Jacobi equations. This way, we avoid the complex dependence on the initial velocity field in the derivations and the computation of the adjoint equation and its incremental counterpart. The proposed method provides geodesics in the framework of PDE-constrained LDDMM, and it shows performance competitive to benchmark PDE-constrained LDDMM and EPDiff-LDDMM methods.

URL PDF HTML ☆

赞 0 踩 0

1805.11572 2026-06-04 cs.CV cs.LG cs.NA math.NA stat.ML 版本更新

Adversarial Regularizers in Inverse Problems

对抗正则化在反问题中的应用

Sebastian Lunz, Ozan Öktem, Carola-Bibiane Schönlieb

发表机构 * DAMTP Department of Mathematics（DAMTP数学系）； University of Cambridge（剑桥大学）； KTH - Royal Institute of Technology（皇家理工学院）

AI总结本文提出了一种利用神经网络作为正则化函数的新框架，用于解决反问题，该方法通过学习真实图像分布与未正则化重建分布之间的差异来提升反问题求解的性能。

Comments published at NeurIPS 2018

1707.09715 2026-06-04 eess.SY cs.CV cs.RO cs.SY 版本更新

Automatic Crack Detection in Built Infrastructure Using Unmanned Aerial Vehicles

使用无人机自动检测建筑基础设施裂缝

Manh Duong Phung, Van Truong Hoang, Tran Hiep Dinh, Quang Ha

发表机构 * School of Electrical Mechanical and Mechatronic Systems, University of Technology Sydney, Australia（电气机械与机电系统学院，悉尼技术大学，澳大利亚）

AI总结本文提出了一种利用无人机采集数据并结合直方图分析进行建筑基础设施裂缝检测的方法，通过自动化流程提高检测效率并降低安全隐患。

Comments In proceeding of The 34th International Symposium on Automation and Robotics in Construction (ISARC), pp. 823-829, Taipei, Taiwan, 2017

详情

DOI: 10.22260/ISARC2017/0115

AI中文摘要

本文针对建筑基础设施健康监测中至关重要的裂缝检测问题，提出了一种包含两个阶段的方法：使用无人机（UAV）进行数据采集和利用直方图分析进行裂缝检测。首先，利用激光扫描仪创建结构的3D模型，然后提取几何属性以生成用于导航无人机拍摄结构图像的路径点。接着，将从重叠视野中获取的图像拼接在一起，通过直方图分析和峰值检测进行聚类，最后利用局部自适应阈值识别潜在裂缝。整个过程自动化进行，从而显著提高了检查时间并最小化了安全风险。已开发出原型系统进行评估，并包含实验结果。

英文摘要

This paper addresses the problem of crack detection which is essential for health monitoring of built infrastructure. Our approach includes two stages, data collection using unmanned aerial vehicles (UAVs) and crack detection using histogram analysis. For the data collection, a 3D model of the structure is first created by using laser scanners. Based on the model, geometric properties are extracted to generate way points necessary for navigating the UAV to take images of the structure. Then, our next step is to stick together those obtained images from the overlapped field of view. The resulting image is then clustered by histogram analysis and peak detection. Potential cracks are finally identified by using locally adaptive thresholds. The whole process is automatically carried out so that the inspection time is significantly improved while safety hazards can be minimised. A prototypical system has been developed for evaluation and experimental results are included.

URL PDF HTML ☆

赞 0 踩 0

1805.12521 2026-06-04 math.NA cs.CV cs.NA 版本更新

Whole Brain Susceptibility Mapping Using Harmonic Incompatibility Removal

利用谐波不兼容性去除的全脑susceptibility映射

Chenglong Bao, Jae Kyu Choi, Bin Dong

发表机构 * Yau Mathematical Sciences Center, Tsinghua University（清华大学尤拉数学科学中心）； School of Mathematical Sciences, Tongji University（同济大学数学科学学院）

AI总结本文提出了一种基于正则化的susceptibility重建模型，通过引入基于稀疏性的正则化项来处理谐波不兼容性，以提高全脑susceptibility映射的性能。

Comments Accepted for publication in SIAM Journal on Imaging Sciences

详情

AI中文摘要

定量susceptibility映射（QSM）旨在通过利用磁共振信号中的相位数据求解场到源的逆问题，从而可视化三维的susceptibility分布。然而，由于积分核的傅里叶变换在频域中存在零点，逆问题是病态的。尽管已经提出了许多基于正则化的模型来克服这个问题，但场数据中的不兼容性并未得到足够的关注，导致恢复质量下降。在本文中，我们表明QSM的数据采集过程本质上会在测量的局部场中生成谐波不兼容性。基于这一发现，我们提出了一种新的基于正则化的susceptibility重建模型，并在谐波不兼容性上引入了基于稀疏性的正则化项。数值实验表明，所提出的方法在性能上优于现有的方法。

英文摘要

Quantitative susceptibility mapping (QSM) aims to visualize the three dimensional susceptibility distribution by solving the field-to-source inverse problem using the phase data in magnetic resonance signal. However, the inverse problem is ill-posed since the Fourier transform of integral kernel has zeroes in the frequency domain. Although numerous regularization based models have been proposed to overcome this problem, the incompatibility in the field data has not received enough attention, which leads to deterioration of the recovery. In this paper, we show that the data acquisition process of QSM inherently generates a harmonic incompatibility in the measured local field. Based on such discovery, we propose a novel regularization based susceptibility reconstruction model with an additional sparsity based regularization term on the harmonic incompatibility. Numerical experiments show that the proposed method achieves better performance than the existing approaches.

URL PDF HTML ☆

赞 0 踩 0

1711.04178 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

CUR Decompositions, Similarity Matrices, and Subspace Clustering

CUR分解、相似矩阵与子空间聚类

Akram Aldroubi, Keaton Hamm, Ahmet Bugra Koku, Ali Sekmen

AI总结本文提出了一种利用CUR分解解决子空间聚类问题的通用框架，通过构造相似矩阵实现无噪声情况下的精确聚类，并展示了如何通过CUR分解生成多种相似矩阵以处理噪声数据，同时推导出两种已知的子空间聚类方法。

Comments Approximately 30 pages. Current version contains improved algorithm and numerical experiments from the previous version

详情

AI中文摘要

本文提出了一种利用CUR分解解决子空间聚类问题的通用框架。CUR分解提供了一种自然方法来构造数据来自未知子空间联盟$\mathscr{U}=\underset{i=1}{\overset{M}\bigcup}S_i$的相似矩阵。由此构造的相似矩阵在无噪声情况下能够实现精确聚类。此外，这种分解还能从给定数据集生成多种不同的相似矩阵，从而具有足够的灵活性以对含噪声数据进行准确聚类。我们还展示了两种已知的子空间聚类方法可以从CUR分解中推导出来。本文还提出了一种基于相似矩阵理论构造的算法，并在合成和真实数据上进行了实验以测试该方法。此外，本文还利用了基于CUR的相似矩阵的改进版本，提供了一种启发式算法用于子空间聚类；该算法在Hopkins155运动分割数据集上的聚类性能目前最佳。

英文摘要

A general framework for solving the subspace clustering problem using the CUR decomposition is presented. The CUR decomposition provides a natural way to construct similarity matrices for data that come from a union of unknown subspaces $\mathscr{U}=\underset{i=1}{\overset{M}\bigcup}S_i$. The similarity matrices thus constructed give the exact clustering in the noise-free case. Additionally, this decomposition gives rise to many distinct similarity matrices from a given set of data, which allow enough flexibility to perform accurate clustering of noisy data. We also show that two known methods for subspace clustering can be derived from the CUR decomposition. An algorithm based on the theoretical construction of similarity matrices is presented, and experiments on synthetic and real data are presented to test the method. Additionally, an adaptation of our CUR based similarity matrices is utilized to provide a heuristic algorithm for subspace clustering; this algorithm yields the best overall performance to date for clustering the Hopkins155 motion segmentation dataset.

URL PDF HTML ☆

赞 0 踩 0

1812.04303 2026-06-04 cs.CV cs.GR cs.NA math.NA 版本更新

Analytic heuristics for a fast DSC-MRI

动态磁共振成像的分析启发法

Marco Virgulin, Marco Castellaro, Enrico Grisan, Fabio Marcuzzi

发表机构 * Department of Mathematics, Padua University（数学系，帕多瓦大学）； Department of Information Engineering, Padua University（信息工程系，帕多瓦大学）

AI总结本文提出了一种确定性方法用于动态磁敏感对比成像数据的重建，并将其与文献中已有的压缩感知解决方案进行比较。通过问题的数学分析，尽管计算复杂度非多项式导致计算不可行，但提出了简单的启发法，效果良好，并在真实图像和加噪人工假体上给出了结果。

1805.07857 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Parallel Transport Convolution: A New Tool for Convolutional Neural Networks on Manifolds

平行运输卷积：用于流形上卷积神经网络的新工具

Stefan C. Schonsheck, Bin Dong, Rongjie Lai

发表机构 * Rensselaer Polytechnic Institute（伦斯拉尔理工学院）

AI总结本文提出平行运输卷积（PTC），一种在流形及其离散对应物上扩展卷积操作的新方法，能够保持卷积的紧凑支持、方向性和跨流形的可转移性，从而在曲面域上构建小波样操作和深度卷积神经网络。

Comments 10 pages

详情

AI中文摘要

卷积在科学和工程中的各种应用中扮演了重要的角色，是卷积神经网络中最关键的操作。近年来，研究者对在曲面域（如流形和图）上推广卷积的兴趣增长，但现有方法无法保持欧几里得卷积的所有理想特性，即紧凑支持滤波器、方向性和跨不同流形的可转移性。本文开发了一种新的卷积操作扩展，称为平行运输卷积（PTC），应用于黎曼流形及其离散对应物。PTC基于平行运输，能够沿流形传输信息并内在保持方向性。PTC允许构建具有紧凑支持的滤波器，并且对流形变形具有鲁棒性。这使得我们能够执行小波样操作，并在曲面域上定义深度卷积神经网络。

英文摘要

Convolution has been playing a prominent role in various applications in science and engineering for many years. It is the most important operation in convolutional neural networks. There has been a recent growth of interests of research in generalizing convolutions on curved domains such as manifolds and graphs. However, existing approaches cannot preserve all the desirable properties of Euclidean convolutions, namely compactly supported filters, directionality, transferability across different manifolds. In this paper we develop a new generalization of the convolution operation, referred to as parallel transport convolution (PTC), on Riemannian manifolds and their discrete counterparts. PTC is designed based on the parallel transportation which is able to translate information along a manifold and to intrinsically preserve directionality. PTC allows for the construction of compactly supported filters and is also robust to manifold deformations. This enables us to preform wavelet-like operations and to define deep convolutional neural networks on curved domains.

URL PDF HTML ☆

赞 0 踩 0

1804.01983 2026-06-04 math.NA cs.CV cs.LG cs.NA 版本更新

High-dimension Tensor Completion via Gradient-based Optimization Under Tensor-train Format

通过张量列车格式的梯度优化实现高维张量补全

Longhao Yuan, Qibin Zhao, Lihua Gui, Jianting Cao

发表机构 * Graduate School of Engineering, Saitama Institute of Technology, Japan（日本埼玉科技大学工学研究科）； Tensor Learning Unit, RIKEN Center for Advanced Intelligence Project (AIP), Japan（日本RIKEN先进人工智能项目（AIP）张量学习单元）； School of Automation, Guangdong University of Technology, China（广东技术大学自动化学院）； School of Computer Science and Technology, Hangzhou Dianzi University, China（杭州电子科技大学计算机科学与技术学院）

AI总结本文提出了一种基于张量列车格式的梯度优化方法，用于补全高维张量中的缺失数据，通过寻找低秩张量列车分解来捕捉数据的潜在特征，并利用梯度下降算法高效解决张量补全问题，同时引入视觉数据张量化方法提升算法性能。

详情

AI中文摘要

张量列车（TT）分解因其在高阶张量中的强大表示能力和稳定性而受到关注。本文提出了一种新的方法，用于恢复由高阶张量表示的不完整数据中的缺失条目。我们尝试找到不完整数据的低秩TT分解，以捕捉整个数据集的潜在特征，然后重建缺失条目。通过应用梯度下降算法，利用优化模型高效地解决了张量补全问题。我们提出了两种基于TT的算法：张量列车加权优化（TT-WOPT）和张量列车随机梯度下降（TT-SGD），用于优化TT分解因子。此外，提出了一种名为视觉数据张量化（VDT）的方法，将视觉数据转换为高阶张量，从而提升了我们算法的性能。在合成数据和视觉数据的实验中，我们的算法在高阶、高缺失率和大规模张量补全情况下表现出高效和优越的性能，相比最先进的补全算法。

英文摘要

Tensor train (TT) decomposition has drawn people's attention due to its powerful representation ability and performance stability in high-order tensors. In this paper, we propose a novel approach to recover the missing entries of incomplete data represented by higher-order tensors. We attempt to find the low-rank TT decomposition of the incomplete data which captures the latent features of the whole data and then reconstruct the missing entries. By applying gradient descent algorithms, tensor completion problem is efficiently solved by optimization models. We propose two TT-based algorithms: Tensor Train Weighted Optimization (TT-WOPT) and Tensor Train Stochastic Gradient Descent (TT-SGD) to optimize TT decomposition factors. In addition, a method named Visual Data Tensorization (VDT) is proposed to transform visual data into higher-order tensors, resulting in the performance improvement of our algorithms. The experiments in synthetic data and visual data show high efficiency and performance of our algorithms compared to the state-of-the-art completion algorithms, especially in high-order, high missing rate, and large-scale tensor completion situations.

URL PDF HTML ☆

赞 0 踩 0

1811.12084 2026-06-04 cs.CV cs.LG cs.NA math.AP math.NA 版本更新

Networks for Nonlinear Diffusion Problems in Imaging

图像中非线性扩散问题的网络

Simon Arridge, Andreas Hauptmann

发表机构 * Department of Computer Science（计算机科学系；伦敦大学学院）； University College London

AI总结本文提出了一种基于非线性扩散过程的网络架构DiffNet，用于解决图像中的非线性扩散问题，该网络在可解释性和泛化能力方面优于传统卷积神经网络，并在非线性扩散逆问题上取得了与U-Net相当的性能。

详情

AI中文摘要

许多成像和视觉任务近期通过深度学习方法，特别是卷积神经网络的应用，经历了重大变革。这些方法在某些应用中取得了显著成果，即使这些应用并不明显表明卷积适合捕捉底层物理。在本文中，我们开发了一种基于非线性扩散过程的网络架构，称为DiffNet。通过设计，我们获得了一种适合图像中扩散相关问题的非线性网络架构。此外，所执行的更新是显式的，从而比传统卷积神经网络架构获得了更好的可解释性和泛化能力。在STL-10图像数据集上测试DiffNet在非线性扩散逆问题中的性能，使用Perona-Malik滤波器。我们获得的结果与已建立的U-Net架构具有竞争力，参数数量和必要的训练数据较少。

英文摘要

A multitude of imaging and vision tasks have seen recently a major transformation by deep learning methods and in particular by the application of convolutional neural networks. These methods achieve impressive results, even for applications where it is not apparent that convolutions are suited to capture the underlying physics. In this work we develop a network architecture based on nonlinear diffusion processes, named DiffNet. By design, we obtain a nonlinear network architecture that is well suited for diffusion related problems in imaging. Furthermore, the performed updates are explicit, by which we obtain better interpretability and generalisability compared to classical convolutional neural network architectures. The performance of DiffNet tested on the inverse problem of nonlinear diffusion with the Perona-Malik filter on the STL-10 image dataset. We obtain competitive results to the established U-Net architecture, with a fraction of parameters and necessary training data.

URL PDF HTML ☆

赞 0 踩 0

1510.02923 2026-06-04 math.AP cs.CV cs.NA math.NA 版本更新

On 1-Laplacian Elliptic Equations Modeling Magnetic Resonance Image Rician Denoising

关于1-拉普拉斯椭圆方程建模磁共振图像Rician去噪

Adrian Martin, Emanuele Schiavi, Sergio Segura de Leon

AI总结本文研究了利用总变分（TV）在贝叶斯或广义Tikhonov框架中建模磁共振图像Rician去噪的问题，推导出非线性椭圆方程，涉及1-拉普拉斯算子，并通过修正的一阶贝塞尔函数定义反应项，提出存在性理论和解的性质，采用收敛的近点算法直接求解非光滑非凸优化问题，通过合成和真实MRI数据验证了方法的有效性，并应用于扩散张量图像。

详情

DOI: 10.1007/s10851-016-0675-3

AI中文摘要

在贝叶斯或广义Tikhonov框架中利用总变分（TV）建模磁共振图像（MRI）的Rician去噪问题，自然导致非线性椭圆方程的考虑。这些方程涉及所谓的1-拉普拉斯算子，需要特别注意问题的正确建模。通过引入描述数据的Rician统计学，通过一个带有反应项的奇异方程来定义，该反应项由修正的一阶贝塞尔函数定义。本文提供了存在性理论和其他解的定性性质。值得注意的是，相关函数的每个正全局极小值都是此类解之一。此外，本文直接使用收敛的近点算法解决这一非光滑非凸最小化问题。基于合成和真实MRI的数据结果表明，所提出的方法在Rician去噪中优于之前的基于TV的模型，这些模型通过正则化或凸化问题来处理。最后，还展示了在受Rician噪声强烈影响的MRI模态——扩散张量图像上的应用，并进行了讨论。

英文摘要

Modeling magnitude Magnetic Resonance Images (MRI) rician denoising in a Bayesian or generalized Tikhonov framework using Total Variation (TV) leads naturally to the consideration of nonlinear elliptic equations. These involve the so called $1$-Laplacian operator and special care is needed to properly formulate the problem. The rician statistics of the data are introduced through a singular equation with a reaction term defined in terms of modified first order Bessel functions. An existence theory is provided here together with other qualitative properties of the solutions. Remarkably, each positive global minimum of the associated functional is one of such solutions. Moreover, we directly solve this non--smooth non--convex minimization problem using a convergent Proximal Point Algorithm. Numerical results based on synthetic and real MRI demonstrate a better performance of the proposed method when compared to previous TV based models for rician denoising which regularize or convexify the problem. Finally, an application on real Diffusion Tensor Images, a strongly affected by rician noise MRI modality, is presented and discussed.

URL PDF HTML ☆

赞 0 踩 0

1804.06128 2026-06-04 math.NA cs.CV cs.NA 版本更新

Fast and Accurate Tensor Completion with Total Variation Regularized Tensor Trains

快速且准确的张量补全与总变分正则化张量列车

Ching-Yun Ko, Kim Batselier, Wenjian Yu, Ngai Wong

发表机构 * Department of Electrical and Electronic Engineering, The University of Hong Kong（香港大学电子工程系）； Delft Center for Systems and Control, Delft University of Technology（代尔夫特理工大学系统与控制中心）

AI总结本文提出了一种基于张量列车的新型张量补全方法，通过总变分和Tikhonov正则化提升了补全速度和可扩展性，尤其在已知数据极少时表现优异。

Comments 13 pages. Source code and supplemental materials are available via: https://github.com/IRENEKO/TTC Updates 11/13: included more comparisons and experimental results

1811.03621 2026-06-04 cs.HC cs.CV cs.LG cs.SY eess.SY stat.ML 版本更新

Satyam: Democratizing Groundtruth for Machine Vision

Satyam: 机器视觉领域地面真实数据的民主化

Hang Qiu, Krishna Chintalapudi, Ramesh Govindan

发表机构 * University of Southern California（南加州大学）； Microsoft Research（微软研究院）

AI总结本文提出Satyam系统，通过简化流程使非专业人员能够高效收集机器视觉的地面真实数据，从而提升自动驾驶、交通监控和视频监控系统的性能。

详情

AI中文摘要

机器学习的民主化已经导致了用于自动驾驶、交通监控和视频监控的基于机器学习的机器视觉系统。然而，没有大大简化收集地面真实数据的过程，真正的民主化就无法实现。这种地面真实数据的收集对于确保在不同条件下具有良好的性能是必要的。在本文中，我们提出了Satyam系统的设计和评估，这是一个首次出现的系统，使非专业人士能够以最小的努力启动机器视觉的地面真实数据收集任务。Satyam利用一个众包平台，亚马逊机械 Turk，并自动化了地面真实数据收集的几个具有挑战性的方面：创建和启动定制的网页用户界面任务以获取所需的真实数据，控制结果质量以应对垃圾邮件发送者和未经训练的工人，根据任务复杂性调整价格，过滤表现差的垃圾邮件发送者和工人，以及处理工人的报酬。我们通过几种流行的基准视觉数据集验证了Satyam，并展示了通过Satyam获得的真实数据与由训练专家获得的数据相当，并且在用于训练时提供匹配的机器学习性能。

英文摘要

The democratization of machine learning (ML) has led to ML-based machine vision systems for autonomous driving, traffic monitoring, and video surveillance. However, true democratization cannot be achieved without greatly simplifying the process of collecting groundtruth for training and testing these systems. This groundtruth collection is necessary to ensure good performance under varying conditions. In this paper, we present the design and evaluation of Satyam, a first-of-its-kind system that enables a layperson to launch groundtruth collection tasks for machine vision with minimal effort. Satyam leverages a crowdtasking platform, Amazon Mechanical Turk, and automates several challenging aspects of groundtruth collection: creating and launching of custom web-UI tasks for obtaining the desired groundtruth, controlling result quality in the face of spammers and untrained workers, adapting prices to match task complexity, filtering spammers and workers with poor performance, and processing worker payments. We validate Satyam using several popular benchmark vision datasets, and demonstrate that groundtruth obtained by Satyam is comparable to that obtained from trained experts and provides matching ML performance when used for training.

URL PDF HTML ☆

赞 0 踩 0

1802.00285 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

Virtual-to-Real: Learning to Control in Visual Semantic Segmentation

虚拟到现实：学习在视觉语义分割中的控制

Zhang-Wei Hong, Chen Yu-Ming, Shih-Yang Su, Tzu-Yun Shann, Yi-Hsiang Chang, Hsuan-Kung Yang, Brian Hsi-Lin Ho, Chih-Chieh Tu, Yueh-Chuan Chang, Tsu-Ching Hsiao, Hsin-Wei Hsiao, Sih-Pin Lai, Chun-Yi Lee

发表机构 * Elsa Lab（Elsa实验室）； Department of Computer Science（计算机科学系）； National Tsing Hua University（国立清华大学）

AI总结本文提出了一种模块化架构，通过将感知模块和控制策略模块结合，利用语义图像分割作为元表示，解决虚拟到现实的迁移问题，并在障碍避让和目标跟随任务中展示了优越的性能。

Comments 7 pages, accepted by IJCAI-18

详情

AI中文摘要

从物理世界收集训练数据通常是耗时且甚至对脆弱机器人来说是危险的，因此最近的机器人学习进展倡导使用模拟器作为训练平台。不幸的是，合成与真实视觉数据之间的现实差距阻止了在虚拟世界中训练的模型直接迁移到现实世界。本文提出了一种模块化架构来解决虚拟到现实的问题。所提出的架构将学习模型分为感知模块和控制策略模块，并使用语义图像分割作为这些模块之间关联的元表示。感知模块将感知的RGB图像转换为语义图像分割。控制策略模块实现为一个深度强化学习代理，根据转换后的图像分割执行动作。我们的架构在避障任务和目标跟随任务中进行了评估。实验结果表明，我们的架构在虚拟和现实环境中均显著优于所有基线方法，并且比它们具有更快的学习曲线。我们还对各种变体配置进行了详细分析，并验证了我们模块化架构的可迁移性。

英文摘要

Collecting training data from the physical world is usually time-consuming and even dangerous for fragile robots, and thus, recent advances in robot learning advocate the use of simulators as the training platform. Unfortunately, the reality gap between synthetic and real visual data prohibits direct migration of the models trained in virtual worlds to the real world. This paper proposes a modular architecture for tackling the virtual-to-real problem. The proposed architecture separates the learning model into a perception module and a control policy module, and uses semantic image segmentation as the meta representation for relating these two modules. The perception module translates the perceived RGB image to semantic image segmentation. The control policy module is implemented as a deep reinforcement learning agent, which performs actions based on the translated image segmentation. Our architecture is evaluated in an obstacle avoidance task and a target following task. Experimental results show that our architecture significantly outperforms all of the baseline methods in both virtual and real environments, and demonstrates a faster learning curve than them. We also present a detailed analysis for a variety of variant configurations, and validate the transferability of our modular architecture.

URL PDF HTML ☆

赞 0 踩 0

1703.09971 2026-06-04 cs.CV cs.NA math.DS math.NA 版本更新

A Geometric Framework for Stochastic Shape Analysis

随机形状分析的几何框架

Alexis Arnaudon, Darryl D. Holm, Stefan Sommer

发表机构 * Department of Mathematics, Imperial College（帝国理工学院数学系）； Department of Computer Science (DIKU), University of Copenhagen（哥本哈根大学计算机科学系（DIKU））

AI总结本文提出了一种随机流形的几何框架，用于分析形状、图像和地标的数据演化，通过Fokker-Planck方程和数值模拟研究了随机演化的特性，并提出了两种参数推断方法。

详情

DOI: 10.1007/s10208-018-9394-z

AI中文摘要

我们介绍了一种随机的流形模型，其作用于多种数据类型上可降维为形状、图像和地标的随机演化。随机性引入在运输数据的向量场中，该随机性在大变形流形度度量映射（LDDMM）框架中用于形状分析和图像配准。随机性因此建模了跟随给定变形速度时流的误差或不确定性。该方法在有限维地标流形的例子中进行了说明，其随机演化通过Fokker-Planck方程和数值模拟研究。我们推导了两种从离散时间点观测到的地标配置推断随机模型参数的方法。第一种方法将Fokker-Planck方程的矩匹配到数据样本的矩，第二种方法则使用蒙特卡罗桥采样方案的期望最大化算法来优化数据似然。我们推导并数值测试了这两种方法推断底层噪声空间相关长度的能力。

英文摘要

We introduce a stochastic model of diffeomorphisms, whose action on a variety of data types descends to stochastic evolution of shapes, images and landmarks. The stochasticity is introduced in the vector field which transports the data in the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework for shape analysis and image registration. The stochasticity thereby models errors or uncertainties of the flow in following the prescribed deformation velocity. The approach is illustrated in the example of finite dimensional landmark manifolds, whose stochastic evolution is studied both via the Fokker-Planck equation and by numerical simulations. We derive two approaches for inferring parameters of the stochastic model from landmark configurations observed at discrete time points. The first of the two approaches matches moments of the Fokker-Planck equation to sample moments of the data, while the second approach employs an Expectation-Maximisation based algorithm using a Monte Carlo bridge sampling scheme to optimise the data likelihood. We derive and numerically test the ability of the two approaches to infer the spatial correlation length of the underlying noise.

URL PDF HTML ☆

赞 0 踩 0

1709.00483 2026-06-04 math.NA cs.CV cs.NA math.OC stat.ML 版本更新

Iteratively Linearized Reweighted Alternating Direction Method of Multipliers for a Class of Nonconvex Problems

迭代线性化加权交替方向乘子法用于一类非凸问题

Tao Sun, Hao Jiang, Lizhi Cheng, Wei Zhu

发表机构 * Department of Mathematics, National University of Defense Technology（国防科技大学数学系）； College of Computer, National University of Defense Technology（国防科技大学计算机学院）； The State Key Laboratory for High Performance Computation, National University of Defense Technology（国防科技大学高性能计算国家重点实验室）； Hunan Key Laboratory for Computation and Simulation in Science and Engineering, School of Mathematics and Computational Science, Xiangtan University（湖南计算与模拟科学工程重点实验室，湘潭大学数学与计算科学学院）

AI总结本文提出了一种迭代线性化加权交替方向乘子法，用于解决信号处理和机器学习中常见的非凸和非光滑问题，该方法通过将子问题转化为凸问题以提高求解效率，并证明了算法的全局收敛性。

1710.06647 2026-06-04 cs.CV cs.NA math.NA 版本更新

Image Restoration by Iterative Denoising and Backward Projections

通过迭代去噪和反向投影进行图像恢复

Tom Tirer, Raja Giryes

发表机构 * School of Electrical Engineering, Tel Aviv University（特拉维夫大学电气工程学院）

AI总结本文提出了一种利用现成去噪器解决逆问题的替代方法，通过将典型成本函数转换为新的优化问题，并引入高效的最小化方案和自动调参机制，以减少参数调优并提升图像修复和去模糊的效果。

Comments To appear in IEEE Transactions on Image Processing

详情

AI中文摘要

逆问题出现在许多应用中，如图像去模糊和修复。解决这些问题是通过为每个问题设计特定算法。Plug-and-Play（P&P）框架最近被引入，利用现有去噪算法的出色能力来解决一般逆问题。尽管这种新策略已找到许多应用，但通常需要大量的参数调优才能获得高质量的结果。在本文中，我们提出了一种替代方法，利用现成的去噪器解决逆问题，其参数调优要求更少。首先，我们将典型成本函数（由保真度和先验项组成）转换为一个密切相关的新优化问题。然后，我们提出了一种高效的最小化方案，具有Plug-and-Play属性，即先验项仅通过去噪操作处理。最后，我们提出了一种自动调参机制来设置方法的参数。我们对方法进行了理论分析，并通过图像修复和去模糊任务与特定技术和P&P方法的实验，证明了其竞争力。

英文摘要

Inverse problems appear in many applications, such as image deblurring and inpainting. The common approach to address them is to design a specific algorithm for each problem. The Plug-and-Play (P&P) framework, which has been recently introduced, allows solving general inverse problems by leveraging the impressive capabilities of existing denoising algorithms. While this fresh strategy has found many applications, a burdensome parameter tuning is often required in order to obtain high-quality results. In this work, we propose an alternative method for solving inverse problems using off-the-shelf denoisers, which requires less parameter tuning. First, we transform a typical cost function, composed of fidelity and prior terms, into a closely related, novel optimization problem. Then, we propose an efficient minimization scheme with a plug-and-play property, i.e., the prior term is handled solely by a denoising operation. Finally, we present an automatic tuning mechanism to set the method's parameters. We provide a theoretical analysis of the method, and empirically demonstrate its competitiveness with task-specific techniques and the P&P approach for image inpainting and deblurring.

URL PDF HTML ☆

赞 0 踩 0

1810.03275 2026-06-04 math.NA cs.CV cs.NA 版本更新

TV-regularized CT Reconstruction and Metal Artifact Reduction Using Inequality Constraints with Preconditioning

基于不等式约束与预条件的TV正则化CT重建及金属伪影消除

Clemens Schiffer

发表机构 * Karl-Franzens-Universität Graz（格拉茨卡尔-弗里德里希大学）

AI总结本文提出了一种结合不等式约束与预条件的TV正则化方法，用于CT重建中减少金属伪影，通过Chambolle-Pock算法和预条件的Douglas-Rachford分裂法及ADMM算法实现快速收敛，验证了模型在真实和合成数据中的有效性。

Comments Master's Thesis, as submitted at the University of Graz

1809.07399 2026-06-04 math.NA cs.CV cs.NA 版本更新

Nonisometric Surface Registration via Conformal Laplace-Beltrami Basis Pursuit

非等距表面配准 via 保形拉普拉斯-贝尔特拉米基底追踪

Stefan C. Schonsheck, Michael M. Bronstein, Rongjie Lai

发表机构 * Department of Mathematics, Rensselaer Polytechnic Institute（拉特格斯理工学院数学系）； Institute of Computational Science（瑞士意大利计算科学研究所）； Universit della Svizzera Italiana

AI总结本文提出了一种变分模型，通过保形变形对非等距零属表面的拉普拉斯-贝尔特拉米特征值系统进行对齐，利用新的基追踪方案同时计算目标形状的保形变形及其变形的LB特征值系统，通过混合交替最小化算法和增广拉格朗日方法，仅需少量地标点即可获得准确对应关系。

Comments 21 pages, 7 figures

详情

AI中文摘要

表面配准是几何处理中最基本的问题之一。许多方法已用于解决当表面近似等距时的问题。然而，计算内在相似性较低的表面之间的对应关系更具挑战性。本文提出了一种变分模型，通过保形变形对两个非等距零属表面的拉普拉斯-贝尔特拉米（LB）特征值系统进行对齐。该方法使我们能够计算非等距形状之间的几何有意义的点对点映射。我们的模型基于一种新颖的基追踪方案，其中我们同时计算目标形状的保形变形及其变形的LB特征值系统。我们使用混合了交替最小化算法和增广拉格朗日方法的近端交替最小化算法来求解该模型，仅需少量地标点即可获得准确的对应关系。我们还提出了一种重新初始化方案，以克服变分问题非凸性带来的某些困难。大量数值实验展示了所提出方法在处理具有大变形的非等距表面方面的有效性和鲁棒性，无论是在底层流形上的噪声还是给定地标点内的误差方面。

英文摘要

Surface registration is one of the most fundamental problems in geometry processing. Many approaches have been developed to tackle this problem in cases where the surfaces are nearly isometric. However, it is much more challenging to compute correspondence between surfaces which are intrinsically less similar. In this paper, we propose a variational model to align the Laplace-Beltrami (LB) eigensytems of two non-isometric genus zero shapes via conformal deformations. This method enables us compute to geometric meaningful point-to-point maps between non-isometric shapes. Our model is based on a novel basis pursuit scheme whereby we simultaneously compute a conformal deformation of a 'target shape' and its deformed LB eigensytem. We solve the model using an proximal alternating minimization algorithm hybridized with the augmented Lagrangian method which produces accurate correspondences given only a few landmark points. We also propose a reinitialization scheme to overcome some of the difficulties caused by the non-convexity of the variational problem. Intensive numerical experiments illustrate the effectiveness and robustness of the proposed method to handle non-isometric surfaces with large deformation with respect to both noise on the underlying manifolds and errors within the given landmarks.

URL PDF HTML ☆

赞 0 踩 0

1608.01825 2026-06-04 physics.med-ph cs.CV cs.NA math.NA 版本更新

Compartmental analysis of dynamic nuclear medicine data: regularization procedure and application to physiology

动态核医学数据的室模型分析：正则化程序及其在生理学中的应用

Delbary Fabrice, Garbarino Sara

发表机构 * Centre for Medical Image Computing, Department of Computer Science, University College London（伦敦大学学院计算机科学系医学影像中心）

AI总结本文提出一种基于正则化多变量Gauss-Newton方法的室模型正则化程序，用于估计示踪剂系数，并应用于脑、肝和肾功能的实验研究。

详情

DOI: 10.1080/17415977.2018.1512603
Journal ref: Inverse Problems in Science and Engineering 2018

AI中文摘要

基于示踪剂质量平衡的室模型在临床和预临床核医学中被广泛用于获取生物组织中示踪剂代谢的定量信息。本文是系列两篇论文中的第二篇，探讨了在反问题框架下通过室模型估计示踪剂系数的问题。虽然前一篇工作专注于讨论2、3和n维室模型系统的可识别性问题，本文则讨论如何通过通用的正则化多变量Gauss-Newton方案数值确定示踪剂系数。本文考虑了涉及不同小鼠模型的FDG-PET数据的实验测量，应用于脑、肝和肾功能的研究。

英文摘要

Compartmental models based on tracer mass balance are extensively used in clinical and pre-clinical nuclear medicine in order to obtain quantitative information on tracer metabolism in the biological tissue. This paper is the second of a series of two that deal with the problem of tracer coefficient estimation via compartmental modelling in an inverse problem framework. While the previous work was devoted to the discussion of identifiability issues for 2, 3 and n-dimension compartmental systems, here we discuss the problem of numerically determining the tracer coefficients by means of a general regularized Multivariate Gauss Newton scheme. In this paper, applications concerning cerebral, hepatic and renal functions are considered, involving experimental measurements on FDG-PET data on different set of murine models.

URL PDF HTML ☆

赞 0 踩 0

1601.05585 2026-06-04 eess.SY cs.CV cs.SY 版本更新

Generalized optimal sub-pattern assignment metric

有限目标集上的广义最优子模式分配度量

Abu Sajana Rahmathullah, Ángel F. García-Fernández, Lennart Svensson

发表机构 * Zenuity AB（泽尼特公司）； Aalto University（阿尔托大学）； Chalmers University of Technology（查尔姆斯理工大学）

AI总结本文提出了一种广义最优子模式分配度量（GOSPA），该度量在目标集空间中未归一化，并通过优化分配而非排列来惩罚基数误差，从而更准确地评估多目标跟踪算法性能。

Comments The paper received the Jean Pierre Le Cadre best paper award at the 20th International Conference on Information Fusion, July 2017. A Matlab implementation of the proposed GOSPA metric is available in https://github.com/abusajana/GOSPA Also visit https://youtu.be/M79GTTytvCM for a 15-min presentation about the paper

详情

DOI: 10.23919/ICIF.2017.8009645
Journal ref: Proceedings of the 20th International Conference on Information Fusion (Fusion), 2017

AI中文摘要

本文提出了有限目标集空间中的广义最优子模式分配（GOSPA）度量。与已确立的最优子模式分配（OSPA）度量相比，GOSPA未归一化，其对基数误差的惩罚方式不同，允许通过优化分配而非排列来表达。这一特性使得GOSPA能够以传统多目标跟踪（MTT）性能指标所示的方式，对检测目标的定位误差以及漏检和误检误差进行合理惩罚。此外，本文将GOSPA度量扩展到随机有限集空间，这对于通过模拟严格评估MTT算法至关重要。

英文摘要

This paper presents the generalized optimal sub-pattern assignment (GOSPA) metric on the space of finite sets of targets. Compared to the well-established optimal sub-pattern assignment (OSPA) metric, GOSPA is unnormalized as a function of the cardinality and it penalizes cardinality errors differently, which enables us to express it as an optimisation over assignments instead of permutations. An important consequence of this is that GOSPA allows us to penalize localization errors for detected targets and the errors due to missed and false targets, as indicated by traditional multiple target tracking (MTT) performance measures, in a sound manner. In addition, we extend the GOSPA metric to the space of random finite sets, which is important to evaluate MTT algorithms via simulations in a rigorous way.

URL PDF HTML ☆

赞 0 踩 0

1809.03314 2026-06-04 cs.CV cs.SY eess.SY 版本更新

A Robotic Auto-Focus System based on Deep Reinforcement Learning

基于深度强化学习的机器人自动对焦系统

Xiaofan Yu, Runze Yu, Jingsong Yang, Xiaohui Duan

发表机构 * Center of Wireless Communication and Signal Processing（无线通信与信号处理中心）

AI总结本文提出一种端到端的自动对焦方法，通过深度强化学习在视觉输入中学习对焦策略，实现自动清晰成像。方法通过离散化动作空间和应用DQN，解决自动对焦问题并推广至基于视觉的控制问题。

Comments To Appear at ICARCV 2018

详情

AI中文摘要

考虑到DQN在处理高维视觉输入和学习离散域控制策略方面的优势，DQN可能成为传统自动对焦方法的替代方案。本文基于深度强化学习提出了一种端到端方法，从视觉输入中学习自动对焦策略，并自动聚焦到清晰点。我们证明了我们的方法——通过粗到细的步骤离散化动作空间并应用DQN，不仅解决了自动对焦问题，还为基于视觉的控制问题提供了一种通用方法。分别在虚拟和真实环境中进行训练阶段以获得有效的模型。虚拟实验表明，我们的方法在不同聚焦范围内能够实现100%的准确性。进一步在真实机器人上训练可消除模拟器与真实场景之间的偏差，从而在实际应用中实现可靠性能。

英文摘要

Considering its advantages in dealing with high-dimensional visual input and learning control policies in discrete domain, Deep Q Network (DQN) could be an alternative method of traditional auto-focus means in the future. In this paper, based on Deep Reinforcement Learning, we propose an end-to-end approach that can learn auto-focus policies from visual input and finish at a clear spot automatically. We demonstrate that our method - discretizing the action space with coarse to fine steps and applying DQN is not only a solution to auto-focus but also a general approach towards vision-based control problems. Separate phases of training in virtual and real environments are applied to obtain an effective model. Virtual experiments, which are carried out after the virtual training phase, indicates that our method could achieve 100% accuracy on a certain view with different focus range. Further training on real robots could eliminate the deviation between the simulator and real scenario, leading to reliable performances in real applications.

URL PDF HTML ☆

赞 0 踩 0

1802.07072 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Composite Optimization by Nonconvex Majorization-Minimization

通过非凸majorization-minimization进行复合优化

Jonas Geiping, Michael Moeller

发表机构 * University of Siegen（明斯特大学）

AI总结本文提出非凸majorization-minimization方法用于非凸复合函数优化，证明其能实现全局收敛，并通过实验展示其在深度超分辨率中的优越性。

Comments 38 pages, 12 figures, accepted for publication in SIIMS

1806.00728 2026-06-04 stat.ML cs.CV cs.LG cs.SY eess.SP eess.SY 版本更新

Data-Free/Data-Sparse Softmax Parameter Estimation with Structured Class Geometries

无数据/稀疏数据softmax参数估计与结构类几何

Nisar Ahmed

发表机构 * H.J. Smead Aerospace Engineering Sciences, University of Colorado, Boulder, Colorado 80309（H.J. Smead航空航天工程科学系，科罗拉多大学，伯尔德，科罗拉多州80309）

AI总结本文提出在少量或无标注数据情况下，利用类标签对数几率边界结构几何先验信息进行softmax参数估计，通过线性方程组求解，无需昂贵的数据采样和优化。

Comments Final version accepted to IEEE Signal Processing Letters (double column), submitted July 21, 2018

详情

DOI: 10.1109/LSP.2018.2860238

AI中文摘要

本文考虑在少量或无标注训练数据可用时，但已知类标签对数几率边界相对几何结构信息的softmax参数估计问题。证明了'无数据'softmax模型合成对应于求解参数方程组，其中期望主导类对数几率边界通过分解输入特征空间的凸多面体编码。当方程可解时，线性方程给出仅使用类边界多面体规范的softmax参数解集。这允许softmax参数学习无需昂贵的暴力数据采样和数值优化。线性方程还可适应数据稀疏情况下的约束最大似然估计。由于某些多面体规范可能无法得到解，因此也展示了存在某些概率分类问题，其对数几率边界无法用m类softmax模型学习。

英文摘要

This note considers softmax parameter estimation when little/no labeled training data is available, but a priori information about the relative geometry of class label log-odds boundaries is available. It is shown that `data-free' softmax model synthesis corresponds to solving a linear system of parameter equations, wherein desired dominant class log-odds boundaries are encoded via convex polytopes that decompose the input feature space. When solvable, the linear equations yield closed-form softmax parameter solution families using class boundary polytope specifications only. This allows softmax parameter learning to be implemented without expensive brute force data sampling and numerical optimization. The linear equations can also be adapted to constrained maximum likelihood estimation in data-sparse settings. Since solutions may also fail to exist for the linear parameter equations derived from certain polytope specifications, it is thus also shown that there exist probabilistic classification problems over m convexly separable classes for which the log-odds boundaries cannot be learned using an m-class softmax model.

URL PDF HTML ☆

赞 0 踩 0

1708.01244 2026-06-04 math.NA cs.CV cs.NA 版本更新

Image reconstruction with imperfect forward models and applications in deblurring

基于不完美正向模型的图像重建及其在去模糊中的应用

Yury Korolev, Jan Lellmann

AI总结本文提出基于偏序空间（Banach格）的图像重建方法，通过顺序区间描述数据和正向模型误差，分析可行集的凸性及其在去模糊中的应用。

1509.02223 2026-06-04 cs.CV cs.NA math.NA 版本更新

Diffusion tensor imaging with deterministic error bounds

具有确定性误差边界的扩散张量成像

Artur Gorokh, Yury Korolev, Tuomo Valkonen

发表机构 * Faculty of Physics, Lomonosov Moscow State University（莫斯科罗蒙诺索夫国立大学物理系）； School of Engineering and Materials Science, Queen Mary University of London（伦敦女王玛丽大学工程与材料科学学院）

AI总结本文在Banach格中利用偏序理论建模逆问题的误差，应用于扩散张量成像中复杂噪声建模问题，通过确定性误差边界方法简化非线性Stejskal-Tanner方程的处理。

1608.02702 2026-06-04 cs.CV cs.NA math.NA 版本更新

Steerable Principal Components for Space-Frequency Localized Images

可旋转主成分用于空间-频率局部化图像

Boris Landa, Yoel Shkolnisky

发表机构 * Department of Applied Mathematics, School of Mathematical Sciences（应用数学系，数学科学学院）

AI总结本文提出一种快速准确的方法，通过二维Prolate Spheroidal Wave Functions对图像进行展开，获取可旋转主成分，用于图像及其旋转的最优扩展。

详情

DOI: 10.1137/16M1085334

AI中文摘要

本文描述了一种快速且准确的方法，用于从大量图像数据集中获得可旋转主成分，假设图像在空间和频率上具有良好的局部化特性。所获得的可旋转主成分用于图像数据集及其旋转的最优扩展。该方法首先使用一系列二维Prolate Spheroidal Wave Functions对图像进行展开，其中展开系数通过特殊设计的数值积分方案进行评估。然后，利用这些展开系数构建一个旋转不变的协方差矩阵，其具有块对角结构，其块的特征分解提供了所需的可旋转主成分。所提出的方法被证明比现有方法更快，同时提供适当的误差界以保证其准确性。

英文摘要

This paper describes a fast and accurate method for obtaining steerable principal components from a large dataset of images, assuming the images are well localized in space and frequency. The obtained steerable principal components are optimal for expanding the images in the dataset and all of their rotations. The method relies upon first expanding the images using a series of two-dimensional Prolate Spheroidal Wave Functions (PSWFs), where the expansion coefficients are evaluated using a specially designed numerical integration scheme. Then, the expansion coefficients are used to construct a rotationally-invariant covariance matrix which admits a block-diagonal structure, and the eigen-decomposition of its blocks provides us with the desired steerable principal components. The proposed method is shown to be faster then existing methods, while providing appropriate error bounds which guarantee its accuracy.

URL PDF HTML ☆

赞 0 踩 0

1807.11534 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Restricted-Domain Dual Formulation for Two-Phase Image Segmentation

两相图像分割的受限域双变量公式

Jack Spencer

发表机构 * Department of Mathematics, University of Liverpool, UK（利物浦大学数学系）

AI总结本文探讨了两相图像分割中数据拟合的性质，提出在受限域内求解双变量公式以提升计算效率，并通过实验验证了该方法的有效性。

1807.10757 2026-06-04 cs.CV cs.NA math.NA 版本更新

A multi-contrast MRI approach to thalamus segmentation

一种多对比MRI方法用于丘脑分割

Veronica Corona, Jan Lellmann, Peter Nestor, Carola-Bibiane Schoenlieb, Julio Acosta-Cabronero

发表机构 * Department of Applied Mathematics and Theoretical Physics, University of Cambridge（应用数学与理论物理系，剑桥大学）； Queensland Brain Institute, University of Queensland（昆士兰脑研究所，昆士兰大学）； Mater Hospital, South Brisbane, Queensland, Australia（马特医院，南布里斯班，昆士兰，澳大利亚）； Wellcome Centre for Human Neuroimaging, UCL Institute of Neurology, University College London, United Kingdom（wellcome人类神经影像中心，伦敦大学学院神经学研究所，伦敦大学学院，英国）； German Center for Neurodegenerative Diseases (DZNE), Magdeburg, Germany（德国神经退行性疾病研究中心（DZNE），马格德堡，德国）

AI总结本文提出一种多模态MRI分割方法，通过多对比数据提高丘脑子核分割精度，结合迭代配准、手动分割模板、监督学习和凸优化，提升分割性能与鲁棒性。

详情

AI中文摘要

丘脑变化与许多神经疾病相关，包括阿尔茨海默病、帕金森病和多发性硬化症。常规干预常包括手术或深部脑刺激，因此准确分割灰质丘脑子区域具有临床重要性。MRI适用于结构分割，因其能提供单次扫描的不同解剖视图。尽管有多种对比度可用，开发能处理多谱的图像分割技术变得越来越重要。本文提出了一种新的多模态数据分割方法，用于自动分割主要丘脑子核组，使用T1-、T2*-加权和定量susceptibility mapping (QSM)信息。该方法包括四个步骤：高度迭代的图像配准、在平均训练数据模板上的手动分割、监督学习用于模式识别，以及最终的凸优化步骤，通过进一步的空间约束来优化解决方案。这导致了与手动分割更一致的解决方案，优于标准Morel图谱方法。此外，我们展示了多对比方法提升了分割性能。然后我们研究了是否能利用训练模板轮廓的先验知识进一步提高凸分割的精度和鲁棒性，从而在单个受试者中获得高度精确的多对比分割。该方法可扩展到大多数3D成像数据类型和任何在单次扫描或多受试者模板中可辨识的感兴趣区域。

英文摘要

Thalamic alterations are relevant to many neurological disorders including Alzheimer's disease, Parkinson's disease and multiple sclerosis. Routine interventions to improve symptom severity in movement disorders, for example, often consist of surgery or deep brain stimulation to diencephalic nuclei. Therefore, accurate delineation of grey matter thalamic subregions is of the upmost clinical importance. MRI is highly appropriate for structural segmentation as it provides different views of the anatomy from a single scanning session. Though with several contrasts potentially available, it is also of increasing importance to develop new image segmentation techniques that can operate multi-spectrally. We hereby propose a new segmentation method for use with multi-modality data, which we evaluated for automated segmentation of major thalamic subnuclear groups using T1-, T2*-weighted and quantitative susceptibility mapping (QSM) information. The proposed method consists of four steps: highly iterative image co-registration, manual segmentation on the average training-data template, supervised learning for pattern recognition, and a final convex optimisation step imposing further spatial constraints to refine the solution. This led to solutions in greater agreement with manual segmentation than the standard Morel atlas based approach. Furthermore, we show that the multi-contrast approach boosts segmentation performances. We then investigated whether prior knowledge using the training-template contours could further improve convex segmentation accuracy and robustness, which led to highly precise multi-contrast segmentations in single subjects. This approach can be extended to most 3D imaging data types and any region of interest discernible in single scans or multi-subject templates.

URL PDF HTML ☆

赞 0 踩 0

1806.10472 2026-06-04 cs.CV cs.NA math.NA 版本更新

Homogeneity of a region in the logarithmic image processing framework: application to region growing algorithms

对数图像处理框架中区域的同质性：应用于区域生长算法

Michel Jourlin, Guillaume Noyel

发表机构 * Lab. H. Curien, UMR CNRS 5516（H. Curien实验室，CNRS 5516研究单位）； University of Strathclyde Institute of Global Public Health（斯特拉斯堡大学全球公共卫生研究所）； International Prevention Research Institute, iPRI（国际预防研究研究所）

AI总结本文探讨了对数图像处理（LIP）算子在评估区域同质性中的作用，提出两种新的异质性标准，改进了Revol技术以增强对比度变化的鲁棒性，减少区域生长过程中的链式效应。

1804.03415 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Fast Hierarchically Preconditioned Eigensolver Based On Multiresolution Matrix Decomposition

基于多分辨率矩阵分解的快速分层预处理本征求解器

Thomas Y. Hou, De Huang, Ka Chun Lam, Ziyun Zhang

AI总结本文提出一种基于多分辨率运算压缩框架的迭代方法，用于高效计算稀疏对称正矩阵的大量左本征对，通过整合隐式重启Lanczos方法与多分辨率框架，提出扩展-细化迭代方案以提高计算效率。

Comments 46 pages, 11 figures, 10 tables

1710.04265 2026-06-04 math.NA cs.CV cs.NA 版本更新

Solutions of Quadratic First-Order ODEs applied to Computer Vision Problems

二次一阶微分方程的解及其在计算机视觉问题中的应用

David Casillas-Perez, Daniel Pizarro, Manuel Mazo, Adrien Bartoli

发表机构 * Department of Electronic, University of Alcalá（阿尔卡拉大学电子系）； ISIT - CNRS/Université d’Auvergne（奥弗涅大学ISIT-CNRS）

AI总结本文研究了特定二次一阶微分方程的存在性和唯一性，探讨了其在平面-透视曲线重建中的应用，并提出了最大深度函数和最大深度解问题。

Comments The version 2: New change of variable. Maximal Curve Maximal Solution Convergence Cones The version 3: modifies the author's list and the abstract in metadata

1709.05746 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Adversarial Discriminative Sim-to-real Transfer of Visuo-motor Policies

对抗性判别仿真到现实的视觉-运动策略转移

Fangyi Zhang, Jürgen Leitner, Zongyuan Ge, Michael Milford, Peter Corke

发表机构 * Australian Centre for Robotic Vision (ACRV)（澳大利亚机器人视觉中心）； Queensland University of Technology (QUT)（昆士兰技术大学）； Monash University（墨尔本大学）

AI总结本文提出对抗性判别仿真到现实转移方法，减少现实数据标注成本，在桌面上物体抓取任务中，通过视觉观测控制7自由度机械臂在障碍物中抓取蓝色立方体，仅需93个标注和186个未标注图像即可实现97.8%的成功率和1.8厘米的控制精度。

Comments Under review for the International Journal of Robotics Research

详情

AI中文摘要

各种方法已被提出以学习用于现实世界机器人应用的视觉-运动策略。一种解决方案是首先在仿真中学习然后转移到现实世界。在转移过程中，大多数现有方法需要带有标签的真实图像。然而，在许多机器人应用中，标注过程往往昂贵甚至不实际。在本文中，我们提出了一种对抗性判别仿真到现实转移方法，以减少标注真实数据的成本。通过模块化网络在桌面物体抓取任务中验证了该方法的有效性，其中7自由度的机械臂以速度模式控制在障碍物中抓取蓝色立方体。对抗性转移方法将标注真实数据的需求减少了50%。策略可以仅使用93个标注和186个未标注的真实图像转移到现实环境。转移的视觉-运动策略对训练中未见过的物体和移动目标具有鲁棒性，实现了97.8%的成功率和1.8厘米的控制精度。

英文摘要

Various approaches have been proposed to learn visuo-motor policies for real-world robotic applications. One solution is first learning in simulation then transferring to the real world. In the transfer, most existing approaches need real-world images with labels. However, the labelling process is often expensive or even impractical in many robotic applications. In this paper, we propose an adversarial discriminative sim-to-real transfer approach to reduce the cost of labelling real data. The effectiveness of the approach is demonstrated with modular networks in a table-top object reaching task where a 7 DoF arm is controlled in velocity mode to reach a blue cuboid in clutter through visual observations. The adversarial transfer approach reduced the labelled real data requirement by 50%. Policies can be transferred to real environments with only 93 labelled and 186 unlabelled real images. The transferred visuo-motor policies are robust to novel (not seen in training) objects in clutter and even a moving target, achieving a 97.8% success rate and 1.8 cm control accuracy.

URL PDF HTML ☆

赞 0 踩 0

1710.01493 2026-06-04 cs.LG cs.CV cs.NA math.NA math.OC 版本更新

Image Labeling Based on Graphical Models Using Wasserstein Messages and Geometric Assignment

基于图形模型的图像标注：利用Wasserstein消息与几何分配

Ruben Hühnerbein, Fabrizio Savarino, Freddie Åström, Christoph Schnörr

发表机构 * Image and Pattern Analysis Group, Heidelberg University, Germany（海德堡大学图像与模式分析组）； Heidelberg Collaboratory for Image Processing, Heidelberg University, Germany（海德堡图像处理协同实验室）

AI总结本文提出基于离散图模型的最大后验推断新方法，利用局部Wasserstein距离近似目标函数并实现并行收敛。

1804.02307 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Accelerated Optimization in the PDE Framework: Formulations for the Manifold of Diffeomorphisms

在PDE框架中的加速优化：微分流形上的形式化方法

Ganesh Sundaramoorthi, Anthony Yezzi

发表机构 * KAUST (King Abdullah University of Science and Technology)（卡斯特大学（国王阿卜杜勒-阿齐兹大学））； Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出了一种适用于微分流形上优化问题的新方法，通过将Nesterov加速优化推广到无限维流形，推导出连续演化方程并将其与流体力学原理联系起来，同时与最优运输问题建立联系。

详情

AI中文摘要

我们考虑在无限维微分流形上优化成本泛函的问题。我们提出了一种新的优化方法，适用于任何在微分流形上设置的优化问题，通过将Nesterov加速优化推广到微分流形。虽然我们的框架适用于无限维流形，但我们特别处理微分流形的情况，受计算机视觉中光流问题的启发。这通过基于最近的变分方法来一般类加速优化方法的近期工作实现，该方法适用于有限维空间。我们将其推广到无限维流形。我们推导出令人惊讶的简单的连续演化方程，即偏微分方程，用于加速梯度下降，并将其与流体力学中的简单力学原理联系起来。我们的方法与最优运输问题有自然联系，因为可以将我们的方法视为无限数量粒子的演化，这些粒子具有质量（用质量密度表示），在能量景观中移动。质量随优化变量变化，并赋予粒子动力学。这与有限维情况不同，后者只有一粒粒子移动，因此动力学不依赖于质量。我们推导了理论，计算了加速优化的PDE，并展示了这些新加速优化方案的行为。

英文摘要

We consider the problem of optimization of cost functionals on the infinite-dimensional manifold of diffeomorphisms. We present a new class of optimization methods, valid for any optimization problem setup on the space of diffeomorphisms by generalizing Nesterov accelerated optimization to the manifold of diffeomorphisms. While our framework is general for infinite dimensional manifolds, we specifically treat the case of diffeomorphisms, motivated by optical flow problems in computer vision. This is accomplished by building on a recent variational approach to a general class of accelerated optimization methods by Wibisono, Wilson and Jordan, which applies in finite dimensions. We generalize that approach to infinite dimensional manifolds. We derive the surprisingly simple continuum evolution equations, which are partial differential equations, for accelerated gradient descent, and relate it to simple mechanical principles from fluid mechanics. Our approach has natural connections to the optimal mass transport problem. This is because one can think of our approach as an evolution of an infinite number of particles endowed with mass (represented with a mass density) that moves in an energy landscape. The mass evolves with the optimization variable, and endows the particles with dynamics. This is different than the finite dimensional case where only a single particle moves and hence the dynamics does not depend on the mass. We derive the theory, compute the PDEs for accelerated optimization, and illustrate the behavior of these new accelerated optimization schemes.

URL PDF HTML ☆

赞 0 踩 0

1805.09408 2026-06-04 cs.CV cs.NA math.NA 版本更新

Non-convex non-local flows for saliency detection

非凸非局部流用于显著性检测

Iván Ramírez, Gonzalo Galiano, Emanuele Schiavi

发表机构 * Dpt. of Mathematics, Universidad Rey Juan Carlos（数学系，雷乌恩卡洛斯大学）； Dpt. of Mathematics, Universidad de Oviedo（数学系，奥维多大学）

AI总结本文提出并求解了新的变分模型用于数字图像自动显著性检测，结合非局部框架和新的二次显著性检测项，用于胶质瘤在MRI-Flair图像中的分割。

1805.08095 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Small steps and giant leaps: Minimal Newton solvers for Deep Learning

小步与巨跃：用于深度学习的最小牛顿求解器

João F. Henriques, Sebastien Ehrhardt, Samuel Albanie, Andrea Vedaldi

发表机构 * Visual Geometry Group, University of Oxford（视觉几何组，牛津大学）

AI总结本文提出一种快速的二阶方法，可作为现有深度学习求解器的替代方案。该方法仅需每个迭代两次额外的前向模式自动微分操作，计算成本与两次标准前向传递相当，易于实现。方法解决了现有二阶求解器的长期问题，避免了计算Hessian矩阵的近似逆矩阵的高成本和噪声敏感性。

详情

AI中文摘要

我们提出了一种快速的二阶方法，可作为现有深度学习求解器的替代方案。与随机梯度下降（SGD）相比，该方法每个迭代仅需两次额外的前向模式自动微分操作，计算成本与两次标准前向传递相当，且易于实现。我们的方法解决了现有二阶求解器的长期问题，即每次迭代精确或通过共轭梯度法计算近似Hessian矩阵的逆矩阵，这一过程成本高且对噪声敏感。相反，我们提出保持一个梯度的估计值，该估计值通过逆Hessian矩阵投影得到，并在每次迭代中更新一次。该估计值的大小相同，类似于SGD中常用的动量变量。不维护Hessian的估计值。我们首先在具有已知闭式解的小问题上验证了我们的方法，称为CurveBall，包括噪声Rosenbrock函数和退化的两层线性网络，其中现有深度学习求解器似乎难以处理。然后我们在CIFAR和ImageNet上训练了多个大型模型，包括ResNet和VGG-f网络，展示了无需超参数调优的更快收敛速度。代码已提供。

英文摘要

We propose a fast second-order method that can be used as a drop-in replacement for current deep learning solvers. Compared to stochastic gradient descent (SGD), it only requires two additional forward-mode automatic differentiation operations per iteration, which has a computational cost comparable to two standard forward passes and is easy to implement. Our method addresses long-standing issues with current second-order solvers, which invert an approximate Hessian matrix every iteration exactly or by conjugate-gradient methods, a procedure that is both costly and sensitive to noise. Instead, we propose to keep a single estimate of the gradient projected by the inverse Hessian matrix, and update it once per iteration. This estimate has the same size and is similar to the momentum variable that is commonly used in SGD. No estimate of the Hessian is maintained. We first validate our method, called CurveBall, on small problems with known closed-form solutions (noisy Rosenbrock function and degenerate 2-layer linear networks), where current deep learning solvers seem to struggle. We then train several large models on CIFAR and ImageNet, including ResNet and VGG-f networks, where we demonstrate faster convergence with no hyperparameter tuning. Code is available.

URL PDF HTML ☆

赞 0 踩 0

1804.10432 2026-06-04 math.NA cs.CV cs.NA math.DG 版本更新

Variational Regularization of Inverse Problems for Manifold-Valued Data

变分正则化用于流形值数据的反问题

Martin Storath, Andreas Weinmann

发表机构 * Image Analysis and Learning Group, Interdisciplinary Center for Scientific Computing, Universität Heidelberg, Germany（图像分析与学习组，跨学科科学计算中心，海德堡大学，德国）； Department of Mathematics and Natural Sciences, Hochschule Darmstadt, and Institute of Computational Biology, Helmholtz Zentrum München, Germany（数学与自然科学系，达姆施塔特应用科学大学，以及计算生物学研究所，海德堡研究中心，德国）

AI总结本文研究流形值数据的变分正则化反问题，提出TV和TGV正则化方法，并通过合成和真实数据验证其有效性。

1801.02686 2026-06-04 cs.CV cs.SY eess.SY 版本更新

Towards Multi-Object Detection and Tracking in Urban Scenario under Uncertainties

面向城市场景下的多目标检测与跟踪在不确定性中的研究

Achim Kampker, Mohsen Sefati, Arya Abdul Rachman, Kai Kreisköther, Pascual Campoy

AI总结本文提出一种实时框架，结合3D激光雷达的遮挡感知检测与高效启发式过滤，以应对城市环境中传感器限制和目标运动复杂性带来的不确定性，实现高效的多目标跟踪。

Comments Some significant editorial/editing issues are found upon review. Paper will undergo language re-proofing before resubmitted

详情

DOI: 10.5220/0006706101560167
Journal ref: 4th.VEHITS.Proc. 109 (2018) 156-167

AI中文摘要

面向自动驾驶车辆的城市应用需要可靠的感知技术来应对高不确定性。最近引入的紧凑型3D激光雷达传感器提供了周围空间信息，可用于增强车辆感知。我们提出了一种实时集成框架，用于使用3D激光雷达的多目标检测和跟踪，旨在城市使用。我们的方法结合了遮挡感知的检测方法，计算高效的启发式规则过滤和自适应概率跟踪，以处理3D激光雷达的传感限制和目标运动复杂性带来的不确定性。使用真实世界预录制的3D激光雷达数据进行评估并与最新作品进行比较的结果表明，我们的框架能够在城市环境中实现有希望的跟踪性能。

英文摘要

Urban-oriented autonomous vehicles require a reliable perception technology to tackle the high amount of uncertainties. The recently introduced compact 3D LIDAR sensor offers a surround spatial information that can be exploited to enhance the vehicle perception. We present a real-time integrated framework of multi-target object detection and tracking using 3D LIDAR geared toward urban use. Our approach combines sensor occlusion-aware detection method with computationally efficient heuristics rule-based filtering and adaptive probabilistic tracking to handle uncertainties arising from sensing limitation of 3D LIDAR and complexity of the target object movement. The evaluation results using real-world pre-recorded 3D LIDAR data and comparison with state-of-the-art works shows that our framework is capable of achieving promising tracking performance in the urban situation.

URL PDF HTML ☆

赞 0 踩 0

1804.06114 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

A Support Tensor Train Machine

支持张量列车机

Cong Chen, Kim Batselier, Ching-Yun Ko, Ngai Wong

发表机构 * The Department of Electrical and Electronic Engineering, The University of Hong Kong（香港大学电子与电气工程系）

AI总结本文提出支持张量列车机，通过将传统支持张量机中的秩一张量替换为张量列车，提升模型表达能力，实验验证其优于SVM和STM。

Comments 7 pages

1612.00181 2026-06-04 cs.CV cs.NA math.NA 版本更新

Monge's Optimal Transport Distance for Image Classification

蒙特问题最优运输距离用于图像分类

Michael Snow, Jan Van lent

发表机构 * Department of Engineering Design and Mathematics, Centre for Machine Vision, University of the West of England（工程设计与数学系，机器视觉中心，西英格兰大学）

AI总结本文提出利用Wasserstein距离进行图像比较，通过求解Monge问题的高效数值方法，并用1-NN算法展示其在图像分类中的优势。

Comments 15 pages, 14 figure

1712.08585 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Denoising of image gradients and total generalized variation denoising

图像梯度去噪与总泛化变分去噪

Birgit Komander, Dirk A. Lorenz, Lena Vestweber

发表机构 * Institute of Analysis and Algebra（分析与代数研究所）； TU Braunschweig（布拉unsch维格技术大学）

AI总结本文重新审视全变分去噪，提出增强模型假设已获得图像梯度估计，改进了图像重建质量，并推导出与总泛化变分去噪方法相似的模型，提出约束去噪模型和参数自由的变分去噪模型，采用Chambolle-Pock和Douglas-Rachford方法进行数值实验，验证了预处理对收敛速度的提升。

详情

AI中文摘要

我们重新审视全变分去噪，并研究一个增强模型，其中假设已获得图像梯度的估计。我们证明这会提高图像重建质量，并推导出所得到的模型类似于总泛化变分去噪方法，从而为该模型提供了新的动机。进一步，我们提出使用约束去噪模型，并开发一个基本无参数的变分去噪模型，即所有模型参数都直接从噪声图像中估计。此外，我们使用Chambolle-Pock的对偶方法以及Douglas-Rachford方法用于新模型。对于后者，必须解决大规模的偏微分方程离散化。我们提出以不精确的方式使用预条件共轭梯度法进行处理，并为此推导出预条件器。数值实验表明，所得到的方法具有良好的去噪性能，并且预处理显著提高了收敛速度。最后我们分析了不同TGV去噪问题形式的对偶间隙，并推导出一个简单的停止准则。

英文摘要

We revisit total variation denoising and study an augmented model where we assume that an estimate of the image gradient is available. We show that this increases the image reconstruction quality and derive that the resulting model resembles the total generalized variation denoising method, thus providing a new motivation for this model. Further, we propose to use a constraint denoising model and develop a variational denoising model that is basically parameter free, i.e. all model parameters are estimated directly from the noisy image. Moreover, we use Chambolle-Pock's primal dual method as well as the Douglas-Rachford method for the new models. For the latter one has to solve large discretizations of partial differential equations. We propose to do this in an inexact manner using the preconditioned conjugate gradients method and derive preconditioners for this. Numerical experiments show that the resulting method has good denoising properties and also that preconditioning does increase convergence speed significantly. Finally we analyze the duality gap of different formulations of the TGV denoising problem and derive a simple stopping criterion.

URL PDF HTML ☆

赞 0 踩 0

1710.10781 2026-06-04 math.NA cs.CV cs.LG cs.NA stat.ML 版本更新

Stochastic variance reduced multiplicative update for nonnegative matrix factorization

随机方差缩减乘法更新用于非负矩阵分解

Hiroyuki Kasai

发表机构 * Graduate School of Informatics and Engineering, The University of Electro-Communications（信息与工程研究生院，东京电波通信大学）

AI总结本文提出一种随机方差缩减乘法更新算法，改进非负矩阵分解的收敛速度，通过数值实验验证其在不同数据集上的优越性。

Comments IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP2018)

1705.10887 2026-06-04 stat.ML cs.CV cs.LG cs.NA math.NA 版本更新

Efficient, sparse representation of manifold distance matrices for classical scaling

高效表示经典标度中的流形距离矩阵

Javier S. Turek, Alexander Huth

发表机构 * Intel Labs（英特尔实验室）； The University of Texas at Austin（得克萨斯大学奥斯汀分校）

AI总结本文提出一种基于双调和插值的稀疏方法，用于高效表示流形距离矩阵，相比现有方法速度快2倍，内存占用低20倍，能处理大规模点集。

Comments Conference CVPR 2018

1803.08137 2026-06-04 cs.CV cs.AI cs.NA math.NA stat.ML 版本更新

Robust Blind Deconvolution via Mirror Descent

通过镜像下降实现鲁棒盲去卷积

Sathya N. Ravi, Ronak Mehta, Vikas Singh

AI总结本文研究盲去卷积的鲁棒性和收敛性，提出一种具有理论保证的算法，在实践中表现优异。

详情

AI中文摘要

我们重新审视盲去卷积问题，重点在于理解其鲁棒性和收敛性属性。可证明的鲁棒性对噪声和其他扰动的容忍能力最近在视觉领域受到关注，从获得对抗攻击的免疫性到评估和描述关键任务应用中算法的失败模式。此外，许多基于深度架构的盲去卷积方法内部使用或优化基本公式，因此更清楚地理解该子模块的行为、何时可以求解以及它可以容忍多少噪声注入是首要要求。我们推导了盲去卷积理论基础的新见解。出现的算法具有良好的收敛保证，并在我们论文中正式定义的意义上被证明是鲁棒的。有趣的是，这些技术结果在实践中表现非常出色，其中在标准数据集上，我们的算法结果与或优于现有最先进方法。关键词：盲去卷积，鲁棒连续优化

英文摘要

We revisit the Blind Deconvolution problem with a focus on understanding its robustness and convergence properties. Provable robustness to noise and other perturbations is receiving recent interest in vision, from obtaining immunity to adversarial attacks to assessing and describing failure modes of algorithms in mission critical applications. Further, many blind deconvolution methods based on deep architectures internally make use of or optimize the basic formulation, so a clearer understanding of how this sub-module behaves, when it can be solved, and what noise injection it can tolerate is a first order requirement. We derive new insights into the theoretical underpinnings of blind deconvolution. The algorithm that emerges has nice convergence guarantees and is provably robust in a sense we formalize in the paper. Interestingly, these technical results play out very well in practice, where on standard datasets our algorithm yields results competitive with or superior to the state of the art. Keywords: blind deconvolution, robust continuous optimization

URL PDF HTML ☆

赞 0 踩 0

1803.07226 2026-06-04 cs.CV cs.DS cs.NA math.NA 版本更新

Learning the Hierarchical Parts of Objects by Deep Non-Smooth Nonnegative Matrix Factorization

通过深度非光滑非负矩阵分解学习物体的层次部分

Jinshi Yu, Guoxu Zhou, Andrzej Cichocki, Shengli Xie

发表机构 * RIKEN（日本理化学研究所）； SKOLTECH（莫斯科SKOLTECH）

AI总结本文提出深度非光滑非负矩阵分解方法，通过更深层架构学习复杂数据的层次特征，结合非负约束生成部分特征并提取更高层次抽象特征，实验表明其在聚类分析中表现优异。

详情

AI中文摘要

非光滑非负矩阵分解（nsNMF）能够产生更局部化、重叠更少的特征表示，同时保持对数据的良好拟合。然而，nsNMF及其他现有NMF方法由于其浅层结构无法学习复杂数据的层次特征。为填补这一空白，本文提出了一种深度nsNMF方法，其架构比标准nsNMF更深入。深度nsNMF不仅由于非负约束生成部分特征，还通过组合低层特征生成更高层次的抽象特征。深入描述了深度架构如何帮助在dnsNMF中高效发现抽象特征。此外，本文还表明深度nsNMF与深度自编码器有密切关系，表明所提模型继承了深度学习和NMF的主要优势。大量实验表明，所提方法在聚类分析中表现出色。

英文摘要

Nonsmooth Nonnegative Matrix Factorization (nsNMF) is capable of producing more localized, less overlapped feature representations than other variants of NMF while keeping satisfactory fit to data. However, nsNMF as well as other existing NMF methods is incompetent to learn hierarchical features of complex data due to its shallow structure. To fill this gap, we propose a deep nsNMF method coined by the fact that it possesses a deeper architecture compared with standard nsNMF. The deep nsNMF not only gives parts-based features due to the nonnegativity constraints, but also creates higher-level, more abstract features by combing lower-level ones. The in-depth description of how deep architecture can help to efficiently discover abstract features in dnsNMF is presented. And we also show that the deep nsNMF has close relationship with the deep autoencoder, suggesting that the proposed model inherits the major advantages from both deep learning and NMF. Extensive experiments demonstrate the standout performance of the proposed method in clustering analysis.

URL PDF HTML ☆

赞 0 踩 0

1803.05026 2026-06-04 cs.LG cs.CV cs.IT cs.NA math.IT math.NA 版本更新

Principal Component Analysis with Tensor Train Subspace

张量列车子空间下的主成分分析

Wenqi Wang, Vaneet Aggarwal, Shuchin Aeron

发表机构 * Purdue University（普渡大学）

AI总结本文提出TT-PCA算法，通过保持低秩张量结构来估计结构化的张量列车子空间，相比PCA和Tucker-PCA更具鲁棒性，实验验证其有效性。

1701.01945 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

A Framework for Wasserstein-1-Type Metrics

Wasserstein-1型度量的框架

Bernhard Schmitzer, Benedikt Wirth

AI总结本文提出一个统一框架，将Wasserstein-1度量推广为不同质量非负测度之间的差异度量，继承其凸性与计算效率，并通过数值实验验证其应用价值。

Comments to appear in Journal of Convex Analysis

1803.03104 2026-06-04 eess.SY cs.CV cs.SY math.DS stat.ML 版本更新

Applicability and interpretation of the deterministic weighted cepstral distance

确定性加权谱距的应用与解释

Oliver Lauwers, Bart De Moor

发表机构 * KU Leuven, Department of Electrical Engineering (ESAT), STADIUS Center for Dynamical Systems, Signal Processing（库勒文大学电子工程系（ESAT）、动态系统信号处理与数据分析中心）

AI总结本文结合系统理论和机器学习，研究了加权谱距在确定性线性时不变单输入单输出模型中的应用，提出了一种基于输入输出信号信息评估系统稳定性和相位类型的纯数据驱动方法。

Comments 18 pages, 5 figures, submitted for review to Automatica

详情

AI中文摘要

量化数据对象之间的相似性是现代数据科学的重要部分。决定使用哪种相似性度量非常依赖于具体应用。本文结合系统理论和机器学习的见解，研究了之前为ARMA模型信号定义的加权谱距。我们将其扩展到可逆的确定性线性时不变单输入单输出模型，并评估其适用性。我们证明了该距离总能以底层模型的极点和零点进行解释，并在稳定、最小相位或不稳定、最大相位模型的情况下，可以以子空间角度进行几何解释。然后，我们提出了一种仅使用输入/输出信号信息的方法来评估生成模型的稳定性和相位类型。通过这种方式，我们证明了扩展的加权谱距与加权谱模型范数之间的联系。通过这种方式，我们提供了一种纯数据驱动的方法来评估输入/输出信号对的不同底层动态，而无需任何系统识别步骤。这在时间序列聚类等机器学习任务中可能很有用。本文还发布了一个iPython教程，包含各种方法和算法的实现，以及一些证明等价性的数值示例。

英文摘要

Quantifying similarity between data objects is an important part of modern data science. Deciding what similarity measure to use is very application dependent. In this paper, we combine insights from systems theory and machine learning, and investigate the weighted cepstral distance, which was previously defined for signals coming from ARMA models. We provide an extension of this distance to invertible deterministic linear time invariant single input single output models, and assess its applicability. We show that it can always be interpreted in terms of the poles and zeros of the underlying model, and that, in the case of stable, minimum-phase, or unstable, maximum-phase models, a geometrical interpretation in terms of subspace angles can be given. We then devise a method to assess stability and phase-type of the generating models, using only input/output signal information. In this way, we prove a connection between the extended weighted cepstral distance and a weighted cepstral model norm. In this way, we provide a purely data-driven way to assess different underlying dynamics of input/output signal pairs, without the need for any system identification step. This can be useful in machine learning tasks such as time series clustering. An iPython tutorial is published complementary to this paper, containing implementations of the various methods and algorithms presented here, as well as some numerical illustrations of the equivalences proven here.

URL PDF HTML ☆

赞 0 踩 0

1605.00031 2026-06-04 cs.LG cs.CV cs.NA math.NA stat.ML 版本更新

Deep Convolutional Neural Networks on Cartoon Functions

深度卷积神经网络在卡通函数上的应用

Philipp Grohs, Thomas Wiatowski, Helmut Bölcskei

发表机构 * 1 Dept. Math., ETH Zurich, Switzerland

AI总结本文研究深度卷积神经网络在卡通函数上的变形稳定性，提出考虑结构特性的新结果，适用于具有尖锐和弯曲不连续性的信号。

Comments This is a slightly updated version of the paper published in the ISIT proceedings. Specifically, we corrected errors in the arguments on the volume of tubes. Note that this correction does not affect the main statements of the paper

详情

DOI: 10.1109/ISIT.2016.7541482
Journal ref: Proc. of IEEE International Symposium on Information Theory (ISIT), Barcelona, Spain, pp. 1163-1167, July 2016

AI中文摘要

Wiatowski和Bölcskei, 2015证明了深度卷积神经网络基于的特征提取器的变形稳定性和垂直平移不变性由网络结构本身保证，而非特定卷积核和非线性。虽然平移不变性结果适用于平方可积函数，变形稳定性界仅适用于带限函数。然而，许多实际相关信号（如自然图像）表现出尖锐和弯曲的不连续性，因此不是带限的。本文的主要贡献是针对Donoho, 2001引入的卡通函数类建立变形稳定性界。

英文摘要

Wiatowski and Bölcskei, 2015, proved that deformation stability and vertical translation invariance of deep convolutional neural network-based feature extractors are guaranteed by the network structure per se rather than the specific convolution kernels and non-linearities. While the translation invariance result applies to square-integrable functions, the deformation stability bound holds for band-limited functions only. Many signals of practical relevance (such as natural images) exhibit, however, sharp and curved discontinuities and are, hence, not band-limited. The main contribution of this paper is a deformation stability result that takes these structural properties into account. Specifically, we establish deformation stability bounds for the class of cartoon functions introduced by Donoho, 2001.

URL PDF HTML ☆

赞 0 踩 0

1705.07364 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

Stabilizing Adversarial Nets With Prediction Methods

用预测方法稳定对抗网络

Abhay Yadav, Sohil Shah, Zheng Xu, David Jacobs, Tom Goldstein

发表机构 * University of Maryland, College Park（马里兰大学 College Park 分校）

AI总结本文提出一种改进的随机梯度下降方法，通过稳定对抗网络的训练过程，使其更可靠地收敛到鞍点，提高训练稳定性与效率。

Comments Accepted at ICLR 2018

1801.09238 2026-06-04 eess.SY cs.CV cs.SY stat.AP 版本更新

Performance Analysis of Robust Stable PID Controllers Using Dominant Pole Placement for SOPTD Process Models

基于主导极点放置的鲁棒稳定PID控制器性能分析用于SOPTD过程模型

Saptarshi Das, Kaushik Halder, Amitava Gupta

发表机构 * Department of Mathematics, College of Engineering, Mathematics and Physical Sciences, University of Exeter（数学系，工程、数学与物理科学学院，埃克塞特大学）； Department of Power Engineering, Jadavpur University（动力工程系，贾瓦德普大学）

AI总结本文提出新的主导极点放置PID控制器设计方法，用于处理具有时间延迟的二阶过程。通过三阶Pade近似约束闭环主导和非主导极点位置，分析不同非主导极点类型对稳定性区域的影响。

Comments 50 pages, 42 figures, Knowledge-Based Systems, 2018

详情

DOI: 10.1016/j.knosys.2018.01.030

AI中文摘要

本文推导了新的基于主导极点放置的PID控制器设计公式，用于处理具有时间延迟的二阶过程（SOPTD）。之前已尝试在无延迟系统中进行极点放置。时间延迟项在Pade近似中表现为具有可变数量交错极点和零点的高阶系统，这使得精确极点放置控制变得困难。本文报告了使用三阶Pade近似来约束闭环主导和非主导极点在复数s平面上的解析表达式。然而，通过增加Pade阶数验证了不同时间延迟近似对闭环性能的不变性，代表了更接近现实的高阶延迟动态。非主导极点的性质（如全部为复数、实数或组合）会影响特征方程并影响可实现的稳定性区域。不同类型的非主导极点及其对应的稳定性区域对九个测试台过程的影响被获得，这些过程表现出不同的开环阻尼比和滞后到延迟比。接下来，通过蒙特卡洛模拟研究不同表达式在设计参数空间中产生更宽稳定性区域的效果。随后，通过成千上万的蒙特卡洛模拟研究了各种时域和频域控制性能参数及其在不确定过程参数下的偏差，围绕每个测试台过程的鲁棒稳定解。

英文摘要

This paper derives new formulations for designing dominant pole placement based proportional-integral-derivative (PID) controllers to handle second order processes with time delays (SOPTD). Previously, similar attempts have been made for pole placement in delay-free systems. The presence of the time delay term manifests itself as a higher order system with variable number of interlaced poles and zeros upon Pade approximation, which makes it difficult to achieve precise pole placement control. We here report the analytical expressions to constrain the closed loop dominant and non-dominant poles at the desired locations in the complex s-plane, using a third order Pade approximation for the delay term. However, invariance of the closed loop performance with different time delay approximation has also been verified using increasing order of Pade, representing a closed to reality higher order delay dynamics. The choice of the nature of non-dominant poles e.g. all being complex, real or a combination of them modifies the characteristic equation and influences the achievable stability regions. The effect of different types of non-dominant poles and the corresponding stability regions are obtained for nine test-bench processes indicating different levels of open-loop damping and lag to delay ratio. Next, we investigate which expression yields a wider stability region in the design parameter space by using Monte Carlo simulations while uniformly sampling a chosen design parameter space. Various time and frequency domain control performance parameters are investigated next, as well as their deviations with uncertain process parameters, using thousands of Monte Carlo simulations, around the robust stable solution for each of the nine test-bench processes.

URL PDF HTML ☆

赞 0 踩 0

1701.08092 2026-06-04 cs.CV cs.NA math.NA 版本更新

Double-sided probing by map of Asplund's distances using Logarithmic Image Processing in the framework of Mathematical Morphology

通过使用对数图像处理的Asplund距离映射实现双面探测

Guillaume Noyel, Michel Jourlin

发表机构 * International Prevention Research Institute（国际预防研究所）； Lab. H. Curien（H. Curien实验室）； UMR CNRS 5516（CNRS 5516联合研究单位）

AI总结本文在数学形态学框架下，利用对数图像处理的标量乘法建立数学形态学与探针与灰度函数之间Asplund距离映射的联系，并通过实例展示该方法在模式匹配中的应用。

Comments The final publication is available at link.springer.com

详情

DOI: 10.1007/978-3-319-57240-6_33
Journal ref: 13th International Symposium on Mathematical Morphology, ISMM 2017, May 2017, Fontainebleau, France. Springer International Publishing, pp.408-420, 2017, Mathematical Morphology and Its Applications to Signal and Image Processing: 13th International Symposium, ISMM 2017, Fontainebleau, France, May 15--17, 2017, Proceedings. http://cmm.ensmp.fr/ismm2017/

AI中文摘要

我们通过使用对数图像处理的标量乘法，建立了数学形态学与探针和灰度函数之间Asplund距离映射之间的联系。我们证明该映射是函数通过结构函数（即探针）进行膨胀和腐蚀的比值的对数。膨胀和腐蚀是将图像的格映射到正函数的格中的映射。使用平坦的结构元素，可以通过图像的膨胀和腐蚀来简化Asplund距离映射的表达，这些映射仍保留在图像的格中。我们通过一个使用非平坦结构函数的模式匹配示例来展示我们的方法。

英文摘要

We establish the link between Mathematical Morphology and the map of Asplund's distances between a probe and a grey scale function, using the Logarithmic Image Processing scalar multiplication. We demonstrate that the map is the logarithm of the ratio between a dilation and an erosion of the function by a structuring function: the probe. The dilations and erosions are mappings from the lattice of the images into the lattice of the positive functions. Using a flat structuring element, the expression of the map of Asplund's distances can be simplified with a dilation and an erosion of the image; these mappings stays in the lattice of the images. We illustrate our approach by an example of pattern matching with a non-flat structuring function.

URL PDF HTML ☆

赞 0 踩 0

1610.06781 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Modular Deep Q Networks for Sim-to-real Transfer of Visuo-motor Policies

模块化深度Q网络用于视觉-运动策略的仿真到现实迁移

Fangyi Zhang, Jürgen Leitner, Michael Milford, Peter Corke

发表机构 * Australian Centre for Robotic Vision (ACRV)（澳大利亚机器人视觉中心）； Queensland University of Technology (QUT)（昆士兰理工大学）

AI总结本文提出模块化深度强化学习方法，通过在感知与控制之间引入瓶颈，实现仿真到现实的迁移，提升机器人视觉-运动协调能力。

Comments Australasian Conference on Robotics and Automation (ACRA) 2017, Student Paper Award Finalist

详情

Journal ref: The proceedings of the Australasian Conference on Robotics and Automation (ACRA) 2017

AI中文摘要

尽管深度学习在计算机视觉中因大量视觉数据而取得显著成功，但为机器人学习收集足够大的现实世界数据集成本较高。为提高这些技术在真实机器人上的实用性，我们提出了一种模块化深度强化学习方法，能够将仿真训练的模型迁移到现实世界机器人任务中。我们引入了感知与控制之间的瓶颈，使网络能够独立训练，然后在端到端方式下合并和微调，以进一步提高视觉-运动协调性。在经典的平面视觉引导机器人抓取任务中，微调后的准确度达到1.6像素，显著优于直接迁移（17.5像素），显示出在更复杂和广泛的应用中的潜力。我们的方法提供了一种更高效学习和迁移视觉-运动策略的技术，无需完全依赖大规模现实世界机器人数据集。

英文摘要

While deep learning has had significant successes in computer vision thanks to the abundance of visual data, collecting sufficiently large real-world datasets for robot learning can be costly. To increase the practicality of these techniques on real robots, we propose a modular deep reinforcement learning method capable of transferring models trained in simulation to a real-world robotic task. We introduce a bottleneck between perception and control, enabling the networks to be trained independently, but then merged and fine-tuned in an end-to-end manner to further improve hand-eye coordination. On a canonical, planar visually-guided robot reaching task a fine-tuned accuracy of 1.6 pixels is achieved, a significant improvement over naive transfer (17.5 pixels), showing the potential for more complicated and broader applications. Our method provides a technique for more efficient learning and transfer of visuo-motor policies for real robotic systems without relying entirely on large real-world robot datasets.

URL PDF HTML ☆

赞 0 踩 0

1503.05528 2026-06-04 cs.CV cs.MM cs.NA eess.IV math.NA 版本更新

Video Inpainting of Complex Scenes

复杂场景的视频修复

Alasdair Newson, Andrés Almansa, Matthieu Fradet, Yann Gousseau, Patrick Pérez

发表机构 * Technicolor

AI总结本文提出一种自动视频修复算法，通过优化全局基于补丁的功能实现复杂场景修复，提升修复效率与质量，无需手动输入，适用于高清晰度视频。

详情

DOI: 10.1137/140954933
Journal ref: SIAM Journal on Imaging Sciences, Society for Industrial and Applied Mathematics, 2014, 7 (4), pp.1993-2019

AI中文摘要

我们提出了一种自动视频修复算法，该算法依赖于对全局基于补丁的功能进行优化。我们的算法能够处理视频修复中出现的各种挑战性情况，如动态纹理的正确重建、多个移动物体和移动背景。此外，我们实现了比现有最佳方法快一个数量级的执行时间。我们还能在高清晰度视频上获得良好的修复质量。最后，我们提供了具体的算法细节，使实现我们的算法尽可能简单。所得到的算法不需要分割或手动输入，除了定义修复掩码外，能够处理比以前工作更广泛的场景。

英文摘要

We propose an automatic video inpainting algorithm which relies on the optimisation of a global, patch-based functional. Our algorithm is able to deal with a variety of challenging situations which naturally arise in video inpainting, such as the correct reconstruction of dynamic textures, multiple moving objects and moving background. Furthermore, we achieve this in an order of magnitude less execution time with respect to the state-of-the-art. We are also able to achieve good quality results on high definition videos. Finally, we provide specific algorithmic details to make implementation of our algorithm as easy as possible. The resulting algorithm requires no segmentation or manual input other than the definition of the inpainting mask, and can deal with a wider variety of situations than is handled by previous work. 1. Introduction. Advanced image and video editing techniques are increasingly common in the image processing and computer vision world, and are also starting to be used in media entertainment. One common and difficult task closely linked to the world of video editing is image and video " inpainting ". Generally speaking, this is the task of replacing the content of an image or video with some other content which is visually pleasing. This subject has been extensively studied in the case of images, to such an extent that commercial image inpainting products destined for the general public are available, such as Photoshop's " Content Aware fill " [1]. However, while some impressive results have been obtained in the case of videos, the subject has been studied far less extensively than image inpainting. This relative lack of research can largely be attributed to high time complexity due to the added temporal dimension. Indeed, it has only very recently become possible to produce good quality inpainting results on high definition videos, and this only in a semi-automatic manner. Nevertheless, high-quality video inpainting has many important and useful applications such as film restoration, professional post-production in cinema and video editing for personal use. For this reason, we believe that an automatic, generic video inpainting algorithm would be extremely useful for both academic and professional communities.

URL PDF HTML ☆

赞 0 踩 0

1711.11075 2026-06-04 math.NA cs.CV cs.NA 版本更新

A fast nonconvex Compressed Sensing algorithm for highly low-sampled MR images reconstruction

一种快速非凸压缩感知算法用于高采样率MRI图像重建

Damiana Lazzaro, Elena Loli Piccolomini, Fabiana Zama

发表机构 * Department of Mathematics, University of Bologna（博洛尼亚大学数学系）

AI总结本文提出一种快速高效的MRI图像重建算法，通过非凸正则化目标函数和最小二乘数据拟合约束，解决严重欠采样数据的重建问题，证明了算法的收敛性。

详情

AI中文摘要

本文提出了一种快速且高效的MRI图像重建方法，将压缩感知理论建模为具有非凸正则化目标函数的约束最小化问题。我们提出了一种名为快速非凸重加权（FNCR）的算法，基于迭代方案，通过凸线性化近似非凸问题，并自动更新惩罚参数。凸问题通过前向-后向过程求解，其中后向步骤通过分裂Bregman策略实现。此外，我们提出了一种新的高效迭代求解器用于出现的线性系统。我们证明了所提出的FNCR方法的收敛性。在合成假人和真实图像上的结果表明，该算法表现优异且计算高效，即使与文献中表现最佳的方法相比也是如此。

英文摘要

In this paper we present a fast and efficient method for the reconstruction of Magnetic Resonance Images (MRI) from severely under-sampled data. From the Compressed Sensing theory we have mathematically modeled the problem as a constrained minimization problem with a family of non-convex regularizing objective functions depending on a parameter and a least squares data fit constraint. We propose a fast and efficient algorithm, named Fast NonConvex Reweighting (FNCR) algorithm, based on an iterative scheme where the non-convex problem is approximated by its convex linearization and the penalization parameter is automatically updated. The convex problem is solved by a Forward-Backward procedure, where the Backward step is performed by a Split Bregman strategy. Moreover, we propose a new efficient iterative solver for the arising linear systems. We prove the convergence of the proposed FNCR method. The results on synthetic phantoms and real images show that the algorithm is very well performing and computationally efficient, even when compared to the best performing methods proposed in the literature.

URL PDF HTML ☆

赞 0 踩 0

1711.09867 2026-06-04 math.NA cs.CV cs.NA 版本更新

Accelerated Optimization in the PDE Framework: Formulations for the Active Contour Case

在PDE框架中实现加速优化：活动轮廓情况的公式化

Anthony Yezzi, Ganesh Sundaramoorthi

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology（电子工程学院，佐治亚理工学院）； Electrical Engineering, King Abdullah University of Science and Technology（电气工程，国王阿卜杜勒-阿齐兹大学）

AI总结本文探讨了在PDE框架中利用加速优化方法提升参数估计性能，通过变分框架和Bregman散度推导连续极限ODE，并扩展至无限维流形，引入共进化质量模型连接最优质量传输的流体力学公式化。

详情

AI中文摘要

在Nesterov开创性工作的基础上，加速优化方法已被用于显著提升一阶梯度基参数估计在第二阶优化策略不可用或不实际的场景中的性能。不仅加速梯度下降比传统梯度下降收敛更快，而且通过初始超调和随后振荡回荡，更稳健地搜索参数空间，从而选择仅局部最小值，其吸引基足够大以包含初始超调。这种行为使加速和随机梯度搜索方法在机器学习社区中特别受欢迎。在最近的PNAS 2016论文中，Wibisono、Wilson和Jordan展示了如何将广泛的一类加速方案用变分框架形式化，围绕Bregman散度，从而得到连续极限ODE。我们展示了其公式如何进一步扩展到无限维流形（从几何空间曲线和曲面开始）通过将Bregman散度替换为切空间上的内积，并显式引入一个与目标对象同时演变的分布式质量模型。这种共进化质量模型，仅为使优化具备有益的动力学而引入，也将由此得到的一类基于PDE的加速优化方案与最优质量传输的流体力学公式化联系起来。

英文摘要

Following the seminal work of Nesterov, accelerated optimization methods have been used to powerfully boost the performance of first-order, gradient-based parameter estimation in scenarios where second-order optimization strategies are either inapplicable or impractical. Not only does accelerated gradient descent converge considerably faster than traditional gradient descent, but it also performs a more robust local search of the parameter space by initially overshooting and then oscillating back as it settles into a final configuration, thereby selecting only local minimizers with a basis of attraction large enough to contain the initial overshoot. This behavior has made accelerated and stochastic gradient search methods particularly popular within the machine learning community. In their recent PNAS 2016 paper, Wibisono, Wilson, and Jordan demonstrate how a broad class of accelerated schemes can be cast in a variational framework formulated around the Bregman divergence, leading to continuum limit ODE's. We show how their formulation may be further extended to infinite dimension manifolds (starting here with the geometric space of curves and surfaces) by substituting the Bregman divergence with inner products on the tangent space and explicitly introducing a distributed mass model which evolves in conjunction with the object of interest during the optimization process. The co-evolving mass model, which is introduced purely for the sake of endowing the optimization with helpful dynamics, also links the resulting class of accelerated PDE based optimization schemes to fluid dynamical formulations of optimal mass transport.

URL PDF HTML ☆

赞 0 踩 0

1310.7443 2026-06-04 cs.CV cs.NA math.NA 版本更新

On Convergent Finite Difference Schemes for Variational - PDE Based Image Processing

关于变分-PDE基图像处理的收敛有限差分方案

V. B. S. Prasath, Juan C. Moreno

AI总结本文提出一种自适应各向异性Huber函数图像修复方案，结合L2-L1正则化函数，通过Split Bregman方法实现图像去噪与边缘保持，实验表明该算法具有最佳收敛性。

Comments 23 pages, 12 figures, 2 tables

1703.09499 2026-06-04 cs.CV cs.NA math.NA 版本更新

Locality preserving projection on SPD matrix Lie group: algorithm and analysis

局部保持投影在SPD矩阵李群上的应用：算法与分析

Yangyang Li, Ruqian Lu

AI总结本文提出在SPD矩阵李群上进行降维的算法，通过局部保持投影思想构建Laplacian矩阵，有效处理高维SPD矩阵，提升人脸识别和动作识别性能。

Comments 15 pages, 3 tables

详情

AI中文摘要

对用于图像识别的对称正定（SPD）矩阵作为特征描述符通常是高维的。传统流形学习仅适用于降维高维向量数据。对于高维SPD矩阵，直接使用流形学习算法降维矩阵数据是不可能的。SPD矩阵必须首先转换为长向量，然后降维此向量。然而，这种方法破坏了SPD矩阵空间的空间结构。为克服这一限制，我们提出了一种新的在SPD矩阵空间上的降维算法，将高维SPD矩阵转换为低维SPD矩阵。我们的工作基于所有相同大小的SPD矩阵集具有李群结构的事实，并旨在将流形学习转换到SPD矩阵李群。我们使用局部保持投影（LPP）算法的基本思想，构建对应的Laplacian矩阵在SPD矩阵李群上。因此，我们称我们的方法为Lie-LPP以强调其李群特性。我们展示了详细的算法分析，并通过实验表明Lie-LPP在人类动作识别和人类面孔识别上实现了有效的结果。

英文摘要

Symmetric positive definite (SPD) matrices used as feature descriptors in image recognition are usually high dimensional. Traditional manifold learning is only applicable for reducing the dimension of high-dimensional vector-form data. For high-dimensional SPD matrices, directly using manifold learning algorithms to reduce the dimension of matrix-form data is impossible. The SPD matrix must first be transformed into a long vector, and then the dimension of this vector must be reduced. However, this approach breaks the spatial structure of the SPD matrix space. To overcome this limitation, we propose a new dimension reduction algorithm on SPD matrix space to transform high-dimensional SPD matrices into low-dimensional SPD matrices. Our work is based on the fact that the set of all SPD matrices with the same size has a Lie group structure, and we aim to transform the manifold learning to the SPD matrix Lie group. We use the basic idea of the manifold learning algorithm called locality preserving projection (LPP) to construct the corresponding Laplacian matrix on the SPD matrix Lie group. Thus, we call our approach Lie-LPP to emphasize its Lie group character. We present a detailed algorithm analysis and show through experiments that Lie-LPP achieves effective results on human action recognition and human face recognition.

URL PDF HTML ☆

赞 0 踩 0

1510.02975 2026-06-04 math.OC cs.CV cs.DC cs.NA cs.SY eess.SY math.NA 版本更新

Optimal Piecewise Linear Function Approximation for GPU-based Applications

基于GPU的应用最优分段线性函数逼近

Daniel Berjón, Guillermo Gallego, Carlos Cuevas, Francisco Morán, Narciso García

AI总结本文提出一种高效方法，通过最优设计的分段线性近似，提高复杂连续函数的实时计算效率，尤其在GPU上表现优异。

Comments 12 pages, 12 figures, post-print, IEEE Transactions on Cybernetics, Oct. 2015

详情

DOI: 10.1109/TCYB.2015.2482365
Journal ref: IEEE Transactions on Cybernetics, vol. 46, no. 11, pp. 2584-2595, Nov. 2016

AI中文摘要

近年来，许多计算机视觉和人机交互应用需要评估复杂的连续数学函数作为关键步骤。然而，严格评估此类函数通常计算成本高，无法满足实时应用需求。为此，函数常被近似为更简单的分段多项式表示。本文提出一种新的高效技术，通过近优化设计的两种分段线性近似，在大量评估子区间预算下评估复杂连续函数。我们开发了详尽的误差分析，提供渐近紧界，准确量化两种表示的近似性能。该方法改进了之前的误差估计，允许用户在近似误差和评估子区间数量之间进行权衡。为保证实时运行，该方法适用于但不仅限于现代图形处理单元（GPU）的高效实现，其中通过利用其纹理单元中的固定函数插值程序，优于以往的替代方法。本文提出的方法适用于任何需要评估连续函数的应用，我们详细测试了其质量和效率在多个函数上的表现，特别是高斯函数，因其在许多计算机视觉和控制领域中被广泛使用且计算成本高。

英文摘要

Many computer vision and human-computer interaction applications developed in recent years need evaluating complex and continuous mathematical functions as an essential step toward proper operation. However, rigorous evaluation of this kind of functions often implies a very high computational cost, unacceptable in real-time applications. To alleviate this problem, functions are commonly approximated by simpler piecewise-polynomial representations. Following this idea, we propose a novel, efficient, and practical technique to evaluate complex and continuous functions using a nearly optimal design of two types of piecewise linear approximations in the case of a large budget of evaluation subintervals. To this end, we develop a thorough error analysis that yields asymptotically tight bounds to accurately quantify the approximation performance of both representations. It provides an improvement upon previous error estimates and allows the user to control the trade-off between the approximation error and the number of evaluation subintervals. To guarantee real-time operation, the method is suitable for, but not limited to, an efficient implementation in modern Graphics Processing Units (GPUs), where it outperforms previous alternative approaches by exploiting the fixed-function interpolation routines present in their texture units. The proposed technique is a perfect match for any application requiring the evaluation of continuous functions, we have measured in detail its quality and efficiency on several functions, and, in particular, the Gaussian function because it is extensively used in many areas of computer vision and cybernetics, and it is expensive to evaluate.

URL PDF HTML ☆

赞 0 踩 0

1711.02857 2026-06-04 cs.LG cs.AI cs.CV cs.NA math.NA stat.ML 版本更新

Learning Sparse Visual Representations with Leaky Capped Norm Regularizers

通过泄漏受限范数正则化器学习稀疏视觉表示

Jianqiao Wangni, Dahua Lin

AI总结本文提出泄漏受限范数正则化器，用于学习过完备视觉表示，证明了其在3D形状恢复中的收敛性，优于ℓ1和非凸正则化方法。

详情

AI中文摘要

诱导稀疏性的正则化是学习过完备视觉表示的重要组成部分。尽管ℓ1正则化广受欢迎，本文研究了非凸正则化在该问题中的应用。我们的贡献包括三个部分：首先，我们提出了泄漏受限范数正则化器（LCNR），允许模型权重低于一定阈值的部分被更强地正则化，从而实现强稀疏性，仅引入可控的估计偏差。我们提出了一种主要化-最小化算法来优化联合目标函数。其次，我们的研究显示，在单目3D形状恢复和神经网络中，LCNR优于ℓ1和其他非凸正则化方法，实现了最先进的性能和更快的收敛速度。第三，我们证明了在3D恢复问题上的理论全局收敛速度。到目前为止，这是首次对3D恢复问题的收敛性分析。

英文摘要

Sparsity inducing regularization is an important part for learning over-complete visual representations. Despite the popularity of $\ell_1$ regularization, in this paper, we investigate the usage of non-convex regularizations in this problem. Our contribution consists of three parts. First, we propose the leaky capped norm regularization (LCNR), which allows model weights below a certain threshold to be regularized more strongly as opposed to those above, therefore imposes strong sparsity and only introduces controllable estimation bias. We propose a majorization-minimization algorithm to optimize the joint objective function. Second, our study over monocular 3D shape recovery and neural networks with LCNR outperforms $\ell_1$ and other non-convex regularizations, achieving state-of-the-art performance and faster convergence. Third, we prove a theoretical global convergence speed on the 3D recovery problem. To the best of our knowledge, this is the first convergence analysis of the 3D recovery problem.

URL PDF HTML ☆

赞 0 踩 0

1710.00620 2026-06-04 cs.CV cs.NA math.NA 版本更新

Out-of-focus Blur: Image De-blurring

失焦模糊：图像去模糊

Yuzhen Lu

AI总结本文研究通过模拟研究解决因失焦模糊导致的图像去模糊问题，采用正则化方法和共轭梯度法提升去模糊效果，提出最优参数选择策略。

Comments 11 pages

详情

AI中文摘要

图像去模糊在许多真实场景或物体成像中至关重要。本项目通过模拟研究，针对由失焦模糊扭曲的图像进行去模糊处理。首先探索伪逆滤波器，但因噪声放大而失败。随后采用Tikhonov正则化方法，相比伪逆滤波器有显著改进。在Tikhonov正则化中，正则化参数的选择对获得高质量图像至关重要，正则化解具有半收敛性质。当使用预设的不一致原理确定最优值时，相对恢复误差为8.49%。此外，采用共轭梯度法进行图像去模糊，计算速度快且结果更优，相对恢复误差为8.22%。迭代次数在CG中充当正则化参数，迭代解也具有半收敛性质。

英文摘要

Image de-blurring is important in many cases of imaging a real scene or object by a camera. This project focuses on de-blurring an image distorted by an out-of-focus blur through a simulation study. A pseudo-inverse filter is first explored but it fails because of severe noise amplification. Then Tikhonov regularization methods are employed, which produce greatly improved results compared to the pseudo-inverse filter. In Tikhonov regularization, the choice of the regularization parameter plays a critical rule in obtaining a high-quality image, and the regularized solutions possess a semi-convergence property. The best result, with the relative restoration error of 8.49%, is achieved when the prescribed discrepancy principle is used to decide an optimal value. Furthermore, an iterative method, Conjugated Gradient, is employed for image de-blurring, which is fast in computation and leads to an even better result with the relative restoration error of 8.22%. The number of iteration in CG acts as a regularization parameter, and the iterates have a semi-convergence property as well.

URL PDF HTML ☆

赞 0 踩 0

1710.06232 2026-06-04 cs.CV cs.SY eess.SY 版本更新

Analysis of feature detector and descriptor combinations with a localization experiment for various performance metrics

基于多种性能指标的特征检测器与描述符组合分析：定位实验

Ertugrul Bayraktar, Pinar Boyraz

AI总结本文通过移动机器人室内实验，比较不同特征检测器与描述符组合在图像匹配中的性能，分析不同组合在精度、时间、角度差等五项指标下的表现。

Comments 11 pages, 3 figures, 1 table

详情

DOI: 10.3906/elk-1602-225
Journal ref: Turkish Journal of Electrical Engineering & Computer Sciences, (2017) 25: 2444 - 2454

AI中文摘要

本研究旨在提供特征检测器/描述符方法的详细性能比较，特别是当其各种组合用于图像匹配时的表现。通过移动机器人在室内环境中的定位实验作为案例研究，使用3090张查询图像和127张数据集图像。研究包括五种特征检测器方法（FAST、ORB、SURF、SIFT、BRISK）和五种特征描述符方法（BRIEF、BRISK、SIFT、SURF、ORB）。这些方法在23种不同组合中使用，通过本研究定义的性能标准获得有意义且一致的比较结果。所有方法作为独立的特征检测器或描述符分别使用。性能分析展示了各种检测器和描述符组合的判别能力。分析使用五个参数：（i）准确性，（ii）时间，（iii）关键点之间的角度差，（iv）正确匹配的数量，（v）正确匹配关键点之间的距离。在60°范围内，覆盖系统五个旋转姿态点，FAST-SURF组合具有最低的距离和角度差值以及最高的匹配关键点数量。SIFT-SURF是准确度最高的组合，正确分类率为98.41%。最快的算法是ORB-BRIEF，匹配560张在运动中捕获的图像和127张数据集图像的总运行时间为21,303.30秒。

英文摘要

The purpose of this study is to provide a detailed performance comparison of feature detector/descriptor methods, particularly when their various combinations are used for image-matching. The localization experiments of a mobile robot in an indoor environment are presented as a case study. In these experiments, 3090 query images and 127 dataset images were used. This study includes five methods for feature detectors (features from accelerated segment test (FAST), oriented FAST and rotated binary robust independent elementary features (BRIEF) (ORB), speeded-up robust features (SURF), scale invariant feature transform (SIFT), and binary robust invariant scalable keypoints (BRISK)) and five other methods for feature descriptors (BRIEF, BRISK, SIFT, SURF, and ORB). These methods were used in 23 different combinations and it was possible to obtain meaningful and consistent comparison results using the performance criteria defined in this study. All of these methods were used independently and separately from each other as either feature detector or descriptor. The performance analysis shows the discriminative power of various combinations of detector and descriptor methods. The analysis is completed using five parameters: (i) accuracy, (ii) time, (iii) angle difference between keypoints, (iv) number of correct matches, and (v) distance between correctly matched keypoints. In a range of 60°, covering five rotational pose points for our system, the FAST-SURF combination had the lowest distance and angle difference values and the highest number of matched keypoints. SIFT-SURF was the most accurate combination with a 98.41% correct classification rate. The fastest algorithm was ORB-BRIEF, with a total running time of 21,303.30 s to match 560 images captured during motion with 127 dataset images.

URL PDF HTML ☆

赞 0 踩 0

1608.01431 2026-06-04 cs.CV cs.NA math.NA 版本更新

An efficient iterative thresholding method for image segmentation

一种高效的迭代阈值方法用于图像分割

Dong Wang, Haohan Li, Xiaoyu Wei, Xiaoping Wang

AI总结本文提出了一种高效的迭代阈值方法用于多相图像分割，通过非局部多相能量近似轮廓长度，实现最优复杂度O(N log N)的高效分割。

Comments 14 pages, 21 figures

1511.06631 2026-06-04 math.NA cs.CV cs.NA math.OC 版本更新

Multi-Contrast MRI Reconstruction with Structure-Guided Total Variation

多对比MRI重建与结构引导的总变分

Matthias J. Ehrhardt, Marta M. Betcke

AI总结本文提出基于结构引导的总变分方法，用于多对比MRI重建，通过结合结构先验知识提升重建质量，在标准指标上优于传统总变分方法。

Comments 18 pages, 16 figures

详情

DOI: 10.1137/15M1047325

AI中文摘要

磁共振成像（MRI）是一种多功能成像技术，允许根据采集参数获得不同对比度。许多临床研究同时获取多种对比度（如T1和T2加权图像），这使整体扫描过程非常耗时。由于所有图像显示相同的解剖结构，可以通过考虑相似性来省略不必要的测量。本文讨论了两种总变分的修改版本，分别基于位置和方向，利用结构先验知识减少总变分在无结构知识时的退化情况。我们使用交替方向乘子法求解由此产生的凸最小化问题，将正向算子与先验分离。对于两种先验，对应的近端算子可作为快速梯度投影法在对偶问题上的扩展实现。我们在六个基于仿生和真实MRI图像的数据集上测试了这些先验。所有测试案例中，利用其他对比度的结构信息比单独使用总变分在标准指标如峰值信噪比和结构相似性指数上表现更好。此外，我们发现利用二维方向信息可生成具有清晰边缘的图像，优于仅使用边缘位置先验信息的重建结果。

英文摘要

Magnetic resonance imaging (MRI) is a versatile imaging technique that allows different contrasts depending on the acquisition parameters. Many clinical imaging studies acquire MRI data for more than one of these contrasts---such as for instance T1 and T2 weighted images---which makes the overall scanning procedure very time consuming. As all of these images show the same underlying anatomy one can try to omit unnecessary measurements by taking the similarity into account during reconstruction. We will discuss two modifications of total variation---based on i) location and ii) direction---that take structural a priori knowledge into account and reduce to total variation in the degenerate case when no structural knowledge is available. We solve the resulting convex minimization problem with the alternating direction method of multipliers that separates the forward operator from the prior. For both priors the corresponding proximal operator can be implemented as an extension of the fast gradient projection method on the dual problem for total variation. We tested the priors on six data sets that are based on phantoms and real MRI images. In all test cases exploiting the structural information from the other contrast yields better results than separate reconstruction with total variation in terms of standard metrics like peak signal-to-noise ratio and structural similarity index. Furthermore, we found that exploiting the two dimensional directional information results in images with well defined edges, superior to those reconstructed solely using a priori information about the edge location.

URL PDF HTML ☆

赞 0 踩 0

1710.00489 2026-06-04 cs.RO cs.AI cs.CV cs.NE cs.SY eess.SY 版本更新

SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control

SE3-姿态网络：用于视觉-运动规划和控制的结构深度动力学模型

Arunkumar Byravan, Felix Leeb, Franziska Meier, Dieter Fox

AI总结本文提出了一种基于结构深度动力学模型的深度视觉-运动控制方法，通过编码器-解码器结构学习低维姿态嵌入，实现场景分割和姿态预测，并在现实世界中实现了闭环控制。

Comments 8 pages, Initial submission to IEEE International Conference on Robotics and Automation (ICRA) 2018

详情

AI中文摘要

本文提出了一种基于结构深度动力学模型的深度视觉-运动控制方法。我们的深度动力学模型是一种SE3-Nets的变体，通过编码器-解码器结构学习低维姿态嵌入用于视觉-运动控制。与以往工作不同，我们的动力学模型是结构化的：给定一个输入场景，我们的网络明确学习分割显著部分并预测其姿态嵌入以及其运动作为姿态空间中的变化。我们通过一对相隔动作的点云训练我们的模型，并展示在仅提供帧间点对数据关联的监督下，我们的网络能够学习有意义的场景分割以及一致的姿态。我们进一步展示我们的模型可以直接在学习的低维姿态空间中用于闭环控制，其中动作通过最小化姿态空间中的误差使用基于梯度的方法计算，类似于传统模型驱动控制。我们展示了在模拟和现实世界中控制Baxter机器人从原始深度数据的结果，并与两种基线深度网络进行了比较。我们的方法在实时运行，实现了良好的场景动态预测，并在多个控制运行中优于基线方法。视频结果可在：https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/

英文摘要

In this work, we present an approach to deep visuomotor control using structured deep dynamics models. Our deep dynamics model, a variant of SE3-Nets, learns a low-dimensional pose embedding for visuomotor control via an encoder-decoder structure. Unlike prior work, our dynamics model is structured: given an input scene, our network explicitly learns to segment salient parts and predict their pose-embedding along with their motion modeled as a change in the pose space due to the applied actions. We train our model using a pair of point clouds separated by an action and show that given supervision only in the form of point-wise data associations between the frames our network is able to learn a meaningful segmentation of the scene along with consistent poses. We further show that our model can be used for closed-loop control directly in the learned low-dimensional pose space, where the actions are computed by minimizing error in the pose space using gradient-based methods, similar to traditional model-based control. We present results on controlling a Baxter robot from raw depth data in simulation and in the real world and compare against two baseline deep networks. Our method runs in real-time, achieves good prediction of scene dynamics and outperforms the baseline methods on multiple control runs. Video results can be found at: https://rse-lab.cs.washington.edu/se3-structured-deep-ctrl/

URL PDF HTML ☆

赞 0 踩 0

1709.02641 2026-06-04 math.NA cs.CV cs.NA 版本更新

Completion of High Order Tensor Data with Missing Entries via Tensor-train Decomposition

通过张量-列车分解完成高阶张量数据中的缺失条目

Longhao Yuan, Qibin Zhao, Jianting Cao

AI总结本文提出TT-WOPT算法，利用张量-列车分解解决高阶张量数据缺失问题，实验表明在高缺失率下性能优于其他方法。

Comments 8 pages, ICONIP 2017

1709.01237 2026-06-04 cs.CV cs.LG cs.NA math.NA 版本更新

Newton-type Methods for Inference in Higher-Order Markov Random Fields

牛顿型方法在高阶马尔可夫随机场推断中的应用

Hariprasad Kannan, Nikos Komodakis, Nikos Paragios

AI总结本文研究了在高阶马尔可夫随机场推断中使用牛顿型方法求解拉格朗日对偶问题的益处，提出了一种收敛性可证且高效的框架，包含Hessian矩阵构建的计算复杂度与精度的平衡策略、阻尼策略、截断策略与通用预条件器的结合，以及稀疏团势能的高效求和-乘积计算。

Comments 10 pages, 3 figures, 3 tables, CVPR 2017

详情

Journal ref: Poster at IEEE International Conference on Computer Vision and Pattern Recognition 2017

AI中文摘要

线性规划松弛是离散马尔可夫随机场MAP推断中的核心方法。正确求解拉格朗日对偶问题的能力是此类方法的关键组成部分。本文研究了使用牛顿型方法求解平滑版本问题的拉格朗日对偶问题的益处。我们探讨了其在实现更优收敛行为和更好地处理公式中的病态性质方面的能力，与一阶方法相比。我们证明了确实可以高效地应用信任区域牛顿方法，以解决广泛MAP推断问题。本文提出了一种可证收敛且高效的框架，包括（i）在Hessian矩阵构建方面计算复杂度和精度之间的良好平衡，（ii）一种有助于高效优化的阻尼策略，（iii）一种与通用共轭梯度预条件器结合的截断策略，（iv）稀疏团势能的高效求和-乘积计算。高阶马尔可夫随机场的结果展示了这种方法的潜力。

英文摘要

Linear programming relaxations are central to {\sc map} inference in discrete Markov Random Fields. The ability to properly solve the Lagrangian dual is a critical component of such methods. In this paper, we study the benefit of using Newton-type methods to solve the Lagrangian dual of a smooth version of the problem. We investigate their ability to achieve superior convergence behavior and to better handle the ill-conditioned nature of the formulation, as compared to first order methods. We show that it is indeed possible to efficiently apply a trust region Newton method for a broad range of {\sc map} inference problems. In this paper we propose a provably convergent and efficient framework that includes (i) excellent compromise between computational complexity and precision concerning the Hessian matrix construction, (ii) a damping strategy that aids efficient optimization, (iii) a truncation strategy coupled with a generic pre-conditioner for Conjugate Gradients, (iv) efficient sum-product computation for sparse clique potentials. Results for higher-order Markov Random Fields demonstrate the potential of this approach.

URL PDF HTML ☆

赞 0 踩 0

1611.02862 2026-06-04 cs.CV cs.NA math.NA 版本更新

The Little Engine that Could: Regularization by Denoising (RED)

那辆小引擎也能：通过去噪进行正则化（RED）

Yaniv Romano, Michael Elad, Peyman Milanfar

AI总结本文提出了一种更强大灵活的框架，通过去噪引擎定义逆问题的正则化，以提升图像去模糊和超分辨率的性能。

详情

AI中文摘要

图像去噪是图像处理中广泛研究的问题。确实，最近高级且高效的去噪算法的出现使一些人相信现有的方法在去噪性能上已接近极限。我们能否利用这一显著成就来处理图像处理中的其他任务？最近的工作对此问题给出了肯定的回答，形式为Plug-and-Play Prior（$P^3$）方法，表明通过依次应用图像去噪步骤可以处理任何逆问题。这严重依赖于ADMM优化技术，以获得这种连续去噪解释。这是否是图像处理任务中唯一能利用图像去噪引擎的方法？在本文中，我们提供了一种更强大、更灵活的框架来实现相同的目标。与$P^3$方法不同，我们提出了正则化通过去噪（RED）：利用去噪引擎定义逆问题的正则化。我们提出了一种显式的图像自适应拉普拉斯基正则化函数，使整体目标函数更清晰且更明确。通过完全灵活地选择迭代优化过程来最小化上述函数，RED能够结合任何图像去噪算法，非常有效地处理一般逆问题，并保证收敛到全局最优解。我们测试了这种方法，并在图像去模糊和超分辨率问题中展示了最先进的结果。

英文摘要

Removal of noise from an image is an extensively studied problem in image processing. Indeed, the recent advent of sophisticated and highly effective denoising algorithms lead some to believe that existing methods are touching the ceiling in terms of noise removal performance. Can we leverage this impressive achievement to treat other tasks in image processing? Recent work has answered this question positively, in the form of the Plug-and-Play Prior ($P^3$) method, showing that any inverse problem can be handled by sequentially applying image denoising steps. This relies heavily on the ADMM optimization technique in order to obtain this chained denoising interpretation. Is this the only way in which tasks in image processing can exploit the image denoising engine? In this paper we provide an alternative, more powerful and more flexible framework for achieving the same goal. As opposed to the $P^3$ method, we offer Regularization by Denoising (RED): using the denoising engine in defining the regularization of the inverse problem. We propose an explicit image-adaptive Laplacian-based regularization functional, making the overall objective functional clearer and better defined. With a complete flexibility to choose the iterative optimization procedure for minimizing the above functional, RED is capable of incorporating any image denoising algorithm, treat general inverse problems very effectively, and is guaranteed to converge to the globally optimal result. We test this approach and demonstrate state-of-the-art results in the image deblurring and super-resolution problems.

URL PDF HTML ☆

赞 0 踩 0

1708.07850 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

Structured Low-Rank Matrix Factorization: Global Optimality, Algorithms, and Applications

结构低秩矩阵分解：全局最优性、算法与应用

Benjamin D. Haeffele, Rene Vidal

AI总结本文提出一种适用于大规模数据集的矩阵分解技术，通过特定正则化形式捕捉额外结构，证明在因子规模足够时局部极小值即为全局极小值，并展示在神经钙成像视频分割和高光谱压缩恢复中的优势。

详情

AI中文摘要

近年来，低秩矩阵分解问题凸形式在机器学习中受到广泛关注。然而，此类形式往往需要求解与数据矩阵同样大小的矩阵，难以应用于大规模数据集。此外，在许多应用中，数据可能表现出超越单纯低秩的结构，例如图像和视频呈现复杂的时空结构，而标准低秩方法大多忽略这些结构。本文研究了一种适用于大规模数据集的矩阵分解技术，通过特定形式的正则化捕捉额外结构，该正则化包括总变分和核范数等已知正则化器作为特例。尽管所得优化问题非凸，我们证明在因子规模足够时，若满足某些条件，则因子的任何局部极小值即为全局极小值。此外，本文还提供了几种实用算法来解决矩阵分解问题，并推导了近似解到全局最优解距离的界。神经钙成像视频分割和高光谱压缩恢复的示例展示了该方法在高维数据集中的优势。

英文摘要

Recently, convex formulations of low-rank matrix factorization problems have received considerable attention in machine learning. However, such formulations often require solving for a matrix of the size of the data matrix, making it challenging to apply them to large scale datasets. Moreover, in many applications the data can display structures beyond simply being low-rank, e.g., images and videos present complex spatio-temporal structures that are largely ignored by standard low-rank methods. In this paper we study a matrix factorization technique that is suitable for large datasets and captures additional structure in the factors by using a particular form of regularization that includes well-known regularizers such as total variation and the nuclear norm as particular cases. Although the resulting optimization problem is non-convex, we show that if the size of the factors is large enough, under certain conditions, any local minimizer for the factors yields a global minimizer. A few practical algorithms are also provided to solve the matrix factorization problem, and bounds on the distance from a given approximate solution of the optimization problem to the global optimum are derived. Examples in neural calcium imaging video segmentation and hyperspectral compressed recovery show the advantages of our approach on high-dimensional datasets.

URL PDF HTML ☆

赞 0 踩 0

1610.03819 2026-06-04 math.NA cs.CV cs.NA math.ST stat.TH 版本更新

Recursive Diffeomorphism-Based Regression for Shape Functions

递归微分流形回归用于形状函数

Jieren Xu, Haizhao Yang, Ingrid Daubechies

AI总结本文提出一种递归微分流形回归方法，用于一维广义模式分解问题，旨在从其叠加中提取广义模式。首先应用一维同步压缩变换估计瞬时信息，然后提出基于微分流形和非参数回归的新方法估计波形函数。

1705.05065 2026-06-04 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

AirSim: High-Fidelity Visual and Physical Simulation for Autonomous Vehicles

AirSim：面向自动驾驶车辆的高保真视觉与物理模拟

Shital Shah, Debadeepta Dey, Chris Lovett, Ashish Kapoor

AI总结本文提出基于Unreal引擎的AirSim模拟器，用于高效开发和测试自动驾驶算法，支持高频率物理模拟和多种协议，通过四旋翼实验验证其有效性。

Comments Accepted for Field and Service Robotics conference 2017 (FSR 2017)

1611.05963 2026-06-04 cs.CV cs.NA math.NA 版本更新

Reweighted Low-Rank Tensor Decomposition based on t-SVD and its Applications in Video Denoising

基于t-SVD的加权低秩张量分解及其在视频去噪中的应用

M. Baburaj, Sudhish N. George

AI总结本文提出基于t-SVD的加权低秩张量分解方法，通过改进张量多秩和稀疏成分恢复，提升视频去噪性能。

Comments Algorithm 1 is inefficient since line 2 is processed n 3 times need to be changed There are inconsistent notations throughout the manuscript Unitary Tensor are not defined

1707.01530 2026-06-04 cs.CV cs.NA math.NA 版本更新

On the Fusion of Compton Scatter and Attenuation Data for Limited-view X-ray Tomographic Applications

在有限视角X射线断层成像应用中融合康普顿散射与衰减数据

Hamideh Rezaee, Brian Tracey, Eric L. Miller

AI总结本文提出一种融合康普顿散射数据与传统衰减数据的方法，用于恢复材料密度和光电吸收，通过变分方法和正则化技术提升成像精度。

详情

AI中文摘要

本文演示了在有限视角X射线断层成像应用中，融合能量分辨的康普顿散射光子观测与传统衰减数据，用于联合恢复材料密度和光电吸收的实用性。我们首先开发了康普顿散射过程的物理和相关数值模型。利用该模型，我们提出了一种变分方法来恢复这两种材料属性。除了典型的数据保真项外，优化功能还包含对质量和光电系数的正则化。我们还考虑了质量密度情况下的新型边缘保持方法。为了帮助恢复光电信息，我们借鉴了最近的方法，并采用非局部正则化方案，利用质量密度更稳定成像的事实。模拟结果展示了同时使用散射光子数据和能量分辨信息在映射两种材料属性方面的明显优势。具体而言，比较仅使用传统衰减数据获得的图像与仅使用康普顿散射光子或两种数据结合形成的图像，显示同时利用两种数据进行重建能提供更准确的结果。

英文摘要

In this paper we demonstrate the utility of fusing energy-resolved observations of Compton scattered photons with traditional attenuation data for the joint recovery of mass density and photoelectric absorption in the context of limited view tomographic imaging applications. We begin with the development of a physical and associated numerical model for the Compton scatter process. Using this model, we propose a variational approach recovering these two material properties. In addition to the typical data-fidelity terms, the optimization functional includes regularization for both the mass density and photoelectric coefficients. We consider a novel edge-preserving method in the case of mass density. To aid in the recovery of the photoelectric information, we draw on our recent method in \cite{r15} and employ a non-local regularization scheme that builds on the fact that mass density is more stably imaged. Simulation results demonstrate clear advantages associated with the use of both scattered photon data and energy resolved information in mapping the two material properties of interest. Specifically, comparing images obtained using only conventional attenuation data with those where we employ only Compton scatter photons and images formed from the combination of the two, shows that taking advantage of both types of data for reconstruction provides far more accurate results.

URL PDF HTML ☆

赞 0 踩 0

1707.00281 2026-06-04 cs.CV cs.NA math.NA math.OC 版本更新

A Batch-Incremental Video Background Estimation Model using Weighted Low-Rank Approximation of Matrices

一种基于矩阵加权低秩近似的批量增量视频背景估计模型

Aritra Dutta, Xin Li, Peter Richtárik

AI总结本文提出一种批量增量视频背景估计模型，通过加权低秩近似改进传统方法，在实测和合成视频上优于GRASTA、ReProCS等算法。

1706.08575 2026-06-04 math.NA cs.CV cs.NA 版本更新

Using Frame Theoretic Convolutional Gridding for Robust Synthetic Aperture Sonar Imaging

利用框架理论卷积栅格进行鲁棒合成孔径声呐成像

John McKay, Anne Gelb, Vishal Monga, Raghu Raj

AI总结本文提出使用框架理论卷积栅格算法改进合成孔径声呐成像，以提高鲁棒性和精度，减少因多普勒效应和声速估计误差导致的不准确性。

Comments Accepted to OCEANS 2017 - Anchorage (Conference)

详情

AI中文摘要

近年来，合成孔径声呐（SAS）技术及处理方法的进步显著提升了水下成像性能，优于传统方法在准确性和效率上。然而，当前SAS重建方法存在固有局限。特别是流行的高效傅里叶域SAS方法需要二维插值，通常病态且不准确，不可避免地降低对斑点和不准确声速估计的鲁棒性。为克服这些问题，我们提出使用框架理论卷积栅格（FTCG）算法处理非均匀傅里叶数据。FTCG在非均匀快速傅里叶变换（NUFFT）算法基础上，将NUFFT视为给定傅里叶框架数据的近似问题。FTCG已被证明在计算成本略高的情况下能提供改进的准确性。通过模拟数据，我们概述了如何使用FTCG来增强当前SAS处理。

英文摘要

Recent progress in synthetic aperture sonar (SAS) technology and processing has led to significant advances in underwater imaging, outperforming previously common approaches in both accuracy and efficiency. There are, however, inherent limitations to current SAS reconstruction methodology. In particular, popular and efficient Fourier domain SAS methods require a 2D interpolation which is often ill conditioned and inaccurate, inevitably reducing robustness with regard to speckle and inaccurate sound-speed estimation. To overcome these issues, we propose using the frame theoretic convolution gridding (FTCG) algorithm to handle the non-uniform Fourier data. FTCG extends upon non-uniform fast Fourier transform (NUFFT) algorithms by casting the NUFFT as an approximation problem given Fourier frame data. The FTCG has been show to yield improved accuracy at little more computational cost. Using simulated data, we outline how the FTCG can be used to enhance current SAS processing.

URL PDF HTML ☆

赞 0 踩 0

1403.7588 2026-06-04 math.OC cs.CV cs.NA math.NA stat.ML 版本更新

Scalable Robust Matrix Recovery: Frank-Wolfe Meets Proximal Methods

可扩展的鲁棒矩阵恢复：Frank-Wolfe与近端方法的结合

Cun Mu, Yuqian Zhang, John Wright, Donald Goldfarb

AI总结本文提出了一种可扩展且高效的鲁棒矩阵恢复方法，结合Frank-Wolfe和近端方法，以线性复杂度解决压缩主成分追寻问题，通过秩一SVD更新低秩部分并处理稀疏项，验证了方法在视觉数据中的可扩展性。

详情

DOI: 10.1137/15M101628X
Journal ref: SIAM Journal on Scientific Computing, 2016, Vol. 38, No. 5 : pp. A3291-A3317

AI中文摘要

矩阵从压缩和严重损坏的观测中恢复是稳健统计中的基本问题，广泛应用于计算机视觉和机器学习。理论上，在某些条件下，该问题可以通过自然的凸松弛，即压缩主成分追踪（CPCP）在多项式时间内解决。然而，所有现有的可证明算法对于CPCP都面临每迭代超线性的成本，这严重限制了它们在大规模问题中的应用。在本文中，我们提出了一种可证明、可扩展和高效的解决CPCP的方法，具有（本质上）线性每迭代成本。我们的方法结合了Frank-Wolfe和近端方法的经典思想。在每次迭代中，我们主要利用Frank-Wolfe来使用秩一SVD更新低秩部分，并利用近端步骤处理稀疏项。还讨论了收敛结果和实现细节。我们通过在视觉数据上的有希望的数值实验展示了所提出方法的可扩展性。

英文摘要

Recovering matrices from compressive and grossly corrupted observations is a fundamental problem in robust statistics, with rich applications in computer vision and machine learning. In theory, under certain conditions, this problem can be solved in polynomial time via a natural convex relaxation, known as Compressive Principal Component Pursuit (CPCP). However, all existing provable algorithms for CPCP suffer from superlinear per-iteration cost, which severely limits their applicability to large scale problems. In this paper, we propose provable, scalable and efficient methods to solve CPCP with (essentially) linear per-iteration cost. Our method combines classical ideas from Frank-Wolfe and proximal methods. In each iteration, we mainly exploit Frank-Wolfe to update the low-rank component with rank-one SVD and exploit the proximal step for the sparse term. Convergence results and implementation details are also discussed. We demonstrate the scalability of the proposed approach with promising numerical experiments on visual data.

URL PDF HTML ☆

赞 0 踩 0

1705.05804 2026-06-04 cs.CV cs.NA math.NA stat.ML 版本更新

The Incremental Multiresolution Matrix Factorization Algorithm

增量多分辨率矩阵分解算法

Vamsi K. Ithapu, Risi Kondor, Sterling C. Johnson, Vikas Singh

AI总结本文提出增量多分辨率矩阵分解算法，用于揭示对称矩阵的层次块结构，通过逐特征分析提升大规模矩阵处理能力，并在医学影像回归任务中验证其有效性。

Comments Computer Vision and Pattern Recognition (CVPR) 2017, 10 pages

详情

AI中文摘要

多分辨率分析和矩阵分解是计算机视觉的基础工具。本文研究了这两个不同领域的交汇，并获得揭示对称矩阵层次块结构的技术，这对许多视觉问题的成功至关重要。我们的新算法，增量多分辨率矩阵分解，逐特征揭示此类结构，因此能有效扩展至大规模矩阵。我们描述了这种多尺度分析比直接全局分解能识别的更多。我们通过医学影像数据评估所得到的分解在回归任务中的有效性。我们还利用该分解在由流行深度网络学习的表示上进行操作，提供证据表明这些网络即使未显式训练以执行此类推断，也能推断语义关系。我们展示了该算法可作为探索工具来改进网络架构，并在视觉的众多其他设置中使用。

英文摘要

Multiresolution analysis and matrix factorization are foundational tools in computer vision. In this work, we study the interface between these two distinct topics and obtain techniques to uncover hierarchical block structure in symmetric matrices -- an important aspect in the success of many vision problems. Our new algorithm, the incremental multiresolution matrix factorization, uncovers such structure one feature at a time, and hence scales well to large matrices. We describe how this multiscale analysis goes much farther than what a direct global factorization of the data can identify. We evaluate the efficacy of the resulting factorizations for relative leveraging within regression tasks using medical imaging data. We also use the factorization on representations learned by popular deep networks, providing evidence of their ability to infer semantic relationships even when they are not explicitly trained to do so. We show that this algorithm can be used as an exploratory tool to improve the network architecture, and within numerous other settings in vision.

URL PDF HTML ☆

赞 0 踩 0

1705.05116 2026-06-04 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Tuning Modular Networks with Weighted Losses for Hand-Eye Coordination

通过加权损失调节模块网络以提升手眼协调

Fangyi Zhang, Jürgen Leitner, Michael Milford, Peter I. Corke

AI总结本文提出端到端微调方法，通过加权损失提升模块化深度视觉-运动策略在平面抓取任务中的手眼协调性能。

Comments 2 pages, to appear in the Deep Learning for Robotic Vision (DLRV) Workshop in CVPR 2017

1610.06688 2026-06-04 cs.CV cs.NA math.NA 版本更新

Multispectral image denoising with optimized vector non-local mean filter

多光谱图像去噪的优化向量非局部均值滤波

Ahmed Ben Said, Rachid Hadjidj, Kamel Eddine Melkemi, Sebti Foufou

AI总结本文提出将非局部均值滤波扩展至向量域，用于多光谱图像去噪，通过优化参数和计算复杂度提升去噪性能。

Comments 30 pages, 17 figures, journal paper

详情

DOI: 10.1016/j.dsp.2016.07.017

AI中文摘要

如今，许多应用依赖高质量图像以确保任务执行性能。然而，噪声是大多数应用中不可避免的问题。因此，开发技术以减轻噪声影响，同时保持图像相关信息的完整性至关重要。本文提出将非局部均值滤波（NLM）扩展至向量情况，并应用于多光谱图像去噪。目标是利用多光谱成像系统带来的额外信息。NLM滤波器利用图像中的信息冗余来去除噪声。恢复的像素是图像中所有像素的加权平均。在我们的贡献中，我们提出了一种优化框架，其中动态调整NLM滤波器参数，并通过考虑最相似像素来降低计算复杂度。滤波器参数使用Stein的无偏风险估计器（SURE）而非随意方法进行优化。实验在受加性白高斯噪声污染的多光谱图像上进行，并提供了PSNR和与其他方法的相似性比较，以展示本方法在去噪性能和计算复杂度方面的效率。

英文摘要

Nowadays, many applications rely on images of high quality to ensure good performance in conducting their tasks. However, noise goes against this objective as it is an unavoidable issue in most applications. Therefore, it is essential to develop techniques to attenuate the impact of noise, while maintaining the integrity of relevant information in images. We propose in this work to extend the application of the Non-Local Means filter (NLM) to the vector case and apply it for denoising multispectral images. The objective is to benefit from the additional information brought by multispectral imaging systems. The NLM filter exploits the redundancy of information in an image to remove noise. A restored pixel is a weighted average of all pixels in the image. In our contribution, we propose an optimization framework where we dynamically fine tune the NLM filter parameters and attenuate its computational complexity by considering only pixels which are most similar to each other in computing a restored pixel. Filter parameters are optimized using Stein's Unbiased Risk Estimator (SURE) rather than using ad hoc means. Experiments have been conducted on multispectral images corrupted with additive white Gaussian noise and PSNR and similarity comparison with other approaches are provided to illustrate the efficiency of our approach in terms of both denoising performance and computation complexity.

URL PDF HTML ☆

赞 0 踩 0

1703.09744 2026-06-04 cs.CV cs.SY eess.SY 版本更新

Feature Analysis and Selection for Training an End-to-End Autonomous Vehicle Controller Using the Deep Learning Approach

基于深度学习方法的自动驾驶控制器训练中的特征分析与选择

Shun Yang, Wenshuo Wang, Chang Liu, Kevin Deng, J. Karl Hedrick

AI总结本文通过分析CNN训练中不同特征对控制器性能的影响，提出特征选择方法以降低计算成本。实验表明，道路相关特征不可或缺，路边相关特征能提升控制器泛化能力，而天空相关特征贡献有限。

Comments 6 pages, 11 figures, 3 tables, accepted by 2017 IEEE Intelligent Vehicles Symposium

详情

AI中文摘要

基于深度学习的方法因其强大的非线性函数近似能力，已被广泛用于训练自动驾驶车辆控制器。然而，训练过程通常需要大量标记数据且耗时较长。本文分析了卷积神经网络（CNN）训练中各特征对控制器性能的影响，为特征选择提供指导。通过使用开放赛车模拟器（TORCS）收集大量数据，并将图像特征分为天空相关、路边相关和道路相关三类。设计了两个实验框架来研究各单个特征对训练CNN控制器的重要性。第一个框架使用包含所有三个特征的训练数据训练控制器，然后用移除一个特征的数据测试以评估特征影响。第二个框架则使用排除一个特征的训练数据，而测试数据包含所有三个特征。通过不同驾驶场景测试和分析两个实验框架下的训练控制器。实验结果表明：（1）道路相关特征对训练控制器至关重要；（2）路边相关特征有助于提升控制器在复杂路边信息场景下的泛化能力；（3）天空相关特征对训练端到端自动驾驶车辆控制器贡献有限。

英文摘要

Deep learning-based approaches have been widely used for training controllers for autonomous vehicles due to their powerful ability to approximate nonlinear functions or policies. However, the training process usually requires large labeled data sets and takes a lot of time. In this paper, we analyze the influences of features on the performance of controllers trained using the convolutional neural networks (CNNs), which gives a guideline of feature selection to reduce computation cost. We collect a large set of data using The Open Racing Car Simulator (TORCS) and classify the image features into three categories (sky-related, roadside-related, and road-related features).We then design two experimental frameworks to investigate the importance of each single feature for training a CNN controller.The first framework uses the training data with all three features included to train a controller, which is then tested with data that has one feature removed to evaluate the feature's effects. The second framework is trained with the data that has one feature excluded, while all three features are included in the test data. Different driving scenarios are selected to test and analyze the trained controllers using the two experimental frameworks. The experiment results show that (1) the road-related features are indispensable for training the controller, (2) the roadside-related features are useful to improve the generalizability of the controller to scenarios with complicated roadside information, and (3) the sky-related features have limited contribution to train an end-to-end autonomous vehicle controller.

URL PDF HTML ☆

赞 0 踩 0

1703.08001 2026-06-04 cs.CV cs.NA math.NA 版本更新

Nonlinear Spectral Image Fusion

非线性频谱图像融合

Martin Benning, Michael Möller, Raz Z. Nossek, Martin Burger, Daniel Cremers, Guy Gilboa, Carola-Bibiane Schönlieb

AI总结本文展示基于总变分正则化的非线性频谱分解框架在图像融合及更广泛的图像处理任务中的有效性，通过选择特定图像的频率转移特征如面部皱纹，实现图像编辑。

Comments 13 pages, 9 figures, submitted to SSVM conference proceedings 2017

1703.05560 2026-06-04 math.NA cs.CV cs.NA math.SP 版本更新

Combining Contrast Invariant L1 Data Fidelities with Nonlinear Spectral Image Decomposition

结合对比不变的L1数据保真度与非线性频谱图像分解

Leonie Zeune, Stephan A. van Gils, Leon W. M. M. Terstappen, Christoph Brune

AI总结本文研究了变分方法的多尺度方法及其梯度流，结合L1保真度与非线性频谱分解以提升图像分割和形状分解的性能。

Comments 13 pages, 7 figures, conference SSVM 2017

详情

AI中文摘要

本文聚焦于变分方法的多尺度方法及其对应的梯度流。近年来，针对如总变分等凸正则化函数，已开发出通过非线性频谱分解解决非线性本征值问题的新理论和算法。这些方法为高级图像滤波开辟了新方向。然而，为了在图像分割和形状分解中有效应用，需要清晰解释频谱响应与大小和强度尺度的关系，但当前方法缺乏这一解释。在此背景下，L1数据保真度因其有趣的多尺度特性，如对比不变性，特别有用。因此，本文的创新点是将基于L1的多尺度方法与非线性频谱分解相结合。我们从频谱图像表示和分解的角度比较L1与L2尺度空间方法。我们证明了L1-TV的对比不变多尺度行为在频谱响应中促进稀疏性，从而提供更具信息量的分解。我们提供了一种数值方法，并分析了合成和生物医学图像，其中分解导致了改进的分割。

英文摘要

This paper focuses on multi-scale approaches for variational methods and corresponding gradient flows. Recently, for convex regularization functionals such as total variation, new theory and algorithms for nonlinear eigenvalue problems via nonlinear spectral decompositions have been developed. Those methods open new directions for advanced image filtering. However, for an effective use in image segmentation and shape decomposition, a clear interpretation of the spectral response regarding size and intensity scales is needed but lacking in current approaches. In this context, $L^1$ data fidelities are particularly helpful due to their interesting multi-scale properties such as contrast invariance. Hence, the novelty of this work is the combination of $L^1$-based multi-scale methods with nonlinear spectral decompositions. We compare $L^1$ with $L^2$ scale-space methods in view of spectral image representation and decomposition. We show that the contrast invariant multi-scale behavior of $L^1-TV$ promotes sparsity in the spectral response providing more informative decompositions. We provide a numerical method and analyze synthetic and biomedical images at which decomposition leads to improved segmentation.

URL PDF HTML ☆

赞 0 踩 0

1501.06209 2026-06-04 math.NA cs.CV cs.NA math.OC physics.med-ph 版本更新

Parallel Magnetic Resonance Imaging

并行磁共振成像

Martin Uecker

AI总结本文探讨了并行磁共振成像在图像重建中的应用，通过逆问题视角分析，介绍了正则化、离散化和迭代重建等基本概念，并讨论了自校准算法、近似理论及压缩感知的结合。

Comments 22 pages, 9 Figures, 76 References. Copyright: Martin Uecker. Draft for a book chapter. To appear in: A Majumdar and RK Ward (eds.), MRI: Physics, Image Reconstruction, and Analysis, CRC Press 2015

详情

Journal ref: In: MRI: Physics, Image Reconstruction, and Analysis, CRC Press 2015, pp. 73-92, ISBN 9781482298871

AI中文摘要

磁共振成像（MRI）的主要缺点是其长扫描时间和由此引起的运动敏感性。利用多个接收线圈的互补信息，平行成像能够从欠采样的k空间数据中恢复图像并加速测量。由于平行磁共振成像可以加速任何成像序列，因此具有重要的应用价值。平行成像带来了图像重建的根本性转变：图像重建从简单的直接傅里叶变换转变为求解一个病态逆问题的解决方案。本文从逆问题的角度概述了图像重建，介绍了正则化、离散化和迭代重建等基本概念，并讨论了包括自校准算法、与近似理论的联系以及与压缩感知的结合等高级主题。

英文摘要

The main disadvantage of Magnetic Resonance Imaging (MRI) are its long scan times and, in consequence, its sensitivity to motion. Exploiting the complementary information from multiple receive coils, parallel imaging is able to recover images from under-sampled k-space data and to accelerate the measurement. Because parallel magnetic resonance imaging can be used to accelerate basically any imaging sequence it has many important applications. Parallel imaging brought a fundamental shift in image reconstruction: Image reconstruction changed from a simple direct Fourier transform to the solution of an ill-conditioned inverse problem. This work gives an overview of image reconstruction from the perspective of inverse problems. After introducing basic concepts such as regularization, discretization, and iterative reconstruction, advanced topics are discussed including algorithms for auto-calibration, the connection to approximation theory, and the combination with compressed sensing.

URL PDF HTML ☆

赞 0 踩 0

1703.00663 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC stat.ML 版本更新

Introduction to Nonnegative Matrix Factorization

非负矩阵因子分解简介

Nicolas Gillis

AI总结本文介绍非负矩阵因子分解的应用、解的几何性质与唯一性、复杂度及算法，并探讨其与多面体扩展形式的联系。

Comments 18 pages, 4 figures

1608.00514 2026-06-04 math.NA cs.CV cs.NA 版本更新

Dimensionality reduction based on Distance Preservation to Local Mean (DPLM) for SPD matrices and its application in BCI

基于距离保持到局部均值的距离降维（DPLM）的SPD矩阵及其在BCI中的应用

Alireza Davoudi, Saeed Shiry Ghidary, Khadijeh Sadatnejad

AI总结本文提出了一种非线性降维算法，用于对对称正定（SPD）矩阵流形进行处理，通过保持数据的局部结构和局部均值距离（DPLM）来提供高类间区分的低维表示。实验表明，DPLM在BCI竞赛IV的多类数据集IIa上优于其他方法，因其对异常值具有鲁棒性。

详情

DOI: 10.1088/1741-2552/aa61bb

AI中文摘要

本文提出了一种非线性降维算法，用于对称正定（SPD）矩阵流形。该算法考虑了SPD矩阵的几何特性，并通过保持距离到局部均值（DPLM）来提供高类间区分的低维表示。DPLM在训练样本数量上是线性的，并且可以利用可用的标签信息以提高分类任务的性能。我们在BCI竞赛IV的多类数据集IIa上进行了多项实验。结果表明，我们的方法在与其他文献中方法相比时，由于其对异常值的鲁棒性而表现更优。实验还确认，DPLM与FGMDM作为分类器的结合在该数据集上实现了最先进的性能。

英文摘要

In this paper, we propose a nonlinear dimensionality reduction algorithm for the manifold of Symmetric Positive Definite (SPD) matrices that considers the geometry of SPD matrices and provides a low dimensional representation of the manifold with high class discrimination. The proposed algorithm, tries to preserve the local structure of the data by preserving distance to local mean (DPLM) and also provides an implicit projection matrix. DPLM is linear in terms of the number of training samples and may use the label information when they are available in order to performance improvement in classification tasks. We performed several experiments on the multi-class dataset IIa from BCI competition IV. The results show that our approach as dimensionality reduction technique - leads to superior results in comparison with other competitor in the related literature because of its robustness against outliers. The experiments confirm that the combination of DPLM with FGMDM as the classifier leads to the state of the art performance on this dataset.

URL PDF HTML ☆

赞 0 踩 0

1612.07850 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

Automatic Interpretation of Unordered Point Cloud Data for UAV Navigation in Construction

无人机在建筑施工中无序点云数据的自动解释

M. D. Phung, C. H. Quach, D. T. Chu, N. Q. Nguyen, T. H. Dinh, Q. P. Ha

AI总结本文提出了一种数据处理系统，用于自动为无人机生成航路点，以检查建筑和桥梁等结构表面。系统通过两个正交安装的2D激光扫描仪和惯性测量单元的数据，利用数据注册、表面检测和航路生成算法，实现结构点云重建和航路规划。

Comments In The 14th International Conference on Control, Automation, Robotics and Vision, ICARCV 2016

详情

DOI: 10.1109/ICARCV.2016.7838683

AI中文摘要

本工作旨在开发一种数据处理系统，能够自动为无人驾驶航空器（UAV）生成航路点，以检查建筑物和桥梁等结构的表面。输入包括由两个正交安装在UAV上的2D激光扫描仪和惯性测量单元（IMU）记录的数据。为实现目标，开发了处理所收集数据的算法，分为三类：（i）数据注册和滤波以生成结构的3D模型并控制点云密度以提高数据完整性；（ii）表面和障碍物检测以协助UAV的监控任务；（iii）航路点生成以设置飞行路径。不同数据集的实验表明，所开发的系统能够重建结构的3D点云，提取其表面和物体，并为UAV生成航路点以完成检查任务。

英文摘要

The objective of this work is to develop a data processing system that can automatically generate waypoints for navigation of an unmanned aerial vehicle (UAV) to inspect surfaces of structures like buildings and bridges. The input includes data recorded by two 2D laser scanners, orthogonally mounted on the UAV, and an inertial measurement unit (IMU). To achieve the goal, algorithms are developed to process the data collected. They are separated into three major groups: (i) the data registration and filtering to generate a 3D model of the structure and control the density of point clouds for data completeness enhancement; (ii) the surface and obstacle detection to assist the UAV in monitoring tasks; and (iii) the waypoint generation to set the flight path. Experiments on different data sets show that the developed system is able to reconstruct a 3D point cloud of the structure, extract its surfaces and objects, and generate waypoints for the UAV to accomplish inspection tasks.

URL PDF HTML ☆

赞 0 踩 0

1702.02680 2026-06-04 cs.CV cs.NA math.NA 版本更新

Manifold Based Low-rank Regularization for Image Restoration and Semi-supervised Learning

基于流形的低秩正则化用于图像恢复和半监督学习

Rongjie Lai, Jia Li

AI总结本文提出基于流形的低秩正则化方法，用于图像恢复和半监督学习，通过线性近似流形维度，提升处理非线性数据的灵活性和效果。

Comments 23 pages, 13 figures

详情

AI中文摘要

低秩结构在图像科学和数据科学的近期进展中扮演重要角色。作为低秩结构在非线性数据中的自然扩展，流形低维结构的概念被应用于许多数据处理问题。受此概念启发，本文考虑基于流形的低秩正则化作为流形维度的线性近似。这种正则化比全局低秩正则化更灵活，能够更好地处理非线性数据。作为应用，本文将所提正则化方法应用于图像科学和数据科学中的经典反问题，包括图像修复、图像超分辨率、X射线计算机断层扫描（CT）图像重建和半监督学习。我们在多个图像恢复问题和使用MINST数据集的半监督学习问题上进行了大量数值实验。我们的数值测试展示了所提方法的有效性，并通过与许多现有方法的比较，证明了新正则化方法的出色性能。

英文摘要

Low-rank structures play important role in recent advances of many problems in image science and data science. As a natural extension of low-rank structures for data with nonlinear structures, the concept of the low-dimensional manifold structure has been considered in many data processing problems. Inspired by this concept, we consider a manifold based low-rank regularization as a linear approximation of manifold dimension. This regularization is less restricted than the global low-rank regularization, and thus enjoy more flexibility to handle data with nonlinear structures. As applications, we demonstrate the proposed regularization to classical inverse problems in image sciences and data sciences including image inpainting, image super-resolution, X-ray computer tomography (CT) image reconstruction and semi-supervised learning. We conduct intensive numerical experiments in several image restoration problems and a semi-supervised learning problem of classifying handwritten digits using the MINST data. Our numerical tests demonstrate the effectiveness of the proposed methods and illustrate that the new regularization methods produce outstanding results by comparing with many existing methods.

URL PDF HTML ☆

赞 0 踩 0

1609.06041 2026-06-04 physics.med-ph cs.CV cs.NA math.NA 版本更新

A very fast iterative algorithm for TV-regularized image reconstruction with applications to low-dose and few-view CT

一种非常快速的迭代算法用于TV正则化图像重建及其在低剂量和少视角CT中的应用

Hiroyuki Kudo, Fukashi Yamazaki, Takuya Nemoto, Keita Takaki

AI总结本文提出了一种快速迭代算法用于低剂量和少视角CT图像重建，通过TV正则化最小化数据保真项，利用预条件技术提升收敛速度。

Comments 16 pages, 8 figures, SPIE Optics + Photonics 2016 Conference (Developments in X-Ray Tomography X) Paper No. 9967-37

1609.06020 2026-06-04 physics.med-ph cs.CV cs.NA math.NA 版本更新

Proposal of fault-tolerant tomographic image reconstruction

容错断层成像图像重建方法的提出

Hiroyuki Kudo, Keita Takaki, Fukashi Yamazaki, Takuya Nemoto

AI总结本文提出一种容错断层成像重建算法，通过使用L1范数替代L2范数，结合凸优化中的近端分裂框架，提升异常数据下的重建鲁棒性。

Comments 12 pages, 5 figures, SPIE Optics + Photonics 2016 Conference (Developments in X-Ray Tomography X) Paper No. 9967-55

详情

DOI: 10.1117/12.2237107

AI中文摘要

本文针对断层成像中部分投影数据bin被异常数据污染的情况，提出一种新的容错重建算法。传统迭代重建使用L2范数误差函数||Ax-b||_2^2，易受异常数据影响。本文改用L1范数误差函数||Ax-b||_1^1，并开发一种基于近端分裂框架的行动作迭代算法。同时提出改进的L1-TV重建方法，在成本函数中加入弱化总变分（TV）惩罚项。仿真结果表明，L2范数重建在异常bin影响下图像严重受损，而L1范数和L1-TV重建对异常bin具有鲁棒性。

英文摘要

This paper deals with tomographic image reconstruction under the situation where some of projection data bins are contaminated with abnormal data. Such situations occur in various instances of tomography. We propose a new reconstruction algorithm called the Fault-Tolerant reconstruction outlined as follows. The least-squares (L2-norm) error function ||Ax-b||_2^2 used in ordinary iterative reconstructions is sensitive to the existence of abnormal data. The proposed algorithm utilizes the L1-norm error function ||Ax-b||_1^1 instead of the L2-norm, and we develop a row-action-type iterative algorithm using the proximal splitting framework in convex optimization fields. We also propose an improved version of the L1-norm reconstruction called the L1-TV reconstruction, in which a weak Total Variation (TV) penalty is added to the cost function. Simulation results demonstrate that reconstructed images with the L2-norm were severely damaged by the effect of abnormal bins, whereas images with the L1-norm and L1-TV reconstructions were robust to the existence of abnormal bins.

URL PDF HTML ☆

赞 0 踩 0

1701.07158 2026-06-04 math.NA cs.CV cs.NA math.FA 版本更新

An Edge Driven Wavelet Frame Model for Image Restoration

基于边缘驱动的小波框架模型图像修复

Jae Kyu Choi, Bin Dong, Xiaoqun Zhang

AI总结本文提出一种边缘驱动的小波框架模型，通过将图像近似为分段光滑函数，对光滑和奇异区域施加不同强度的正则化，实现鲁棒的图像修复。

1512.00389 2026-06-04 cs.CV cs.NA math.NA 版本更新

Accelerated graph-based nonlinear denoising filters

加速的图基非线性去噪滤波器

Andrew Knyazev, Alexander Malyshev

AI总结本文提出通过共轭梯度法和Nesterov加速技术加速图基非线性去噪滤波器，实验显示在图像去噪中效率提升2-12倍。

Comments 10 pages, 6 figures, to appear in Procedia Computer Science, vol.80, 2016, International Conference on Computational Science, San Diego, CA, USA, June 6-8, 2016

1607.06032 2026-06-04 cs.CV cs.NA math.DS math.NA math.OC 版本更新

A Topological Lowpass Filter for Quasiperiodic Signals

一种用于拟周期信号的拓扑低通滤波器

Michael Robinson

AI总结本文提出一种两阶段拓扑算法，用于从噪声测量中恢复拟周期函数的估计。第一阶段为拓扑相位估计器，能检测函数的拟周期结构，不增加额外限制，从而避免在使用大量样本时产生失真。

1612.06176 2026-06-04 cs.CV cs.NA math.NA stat.ML 版本更新

An extended Perona-Malik model based on probabilistic models

基于概率模型扩展的Perona-Malik模型

Lars M. Mescheder, Dirk A. Lorenz

AI总结本文基于高斯尺度混合模型扩展了Perona-Malik模型，通过EM算法推导出滞后扩散算法，并改进其以更好地捕捉恢复中的不确定性，同时提出计算可行的放松方法，实验显示改进算法在恢复纹理区域和模糊边缘方面表现更优。

详情

AI中文摘要

Perona-Malik模型在从噪声输入中恢复图像方面非常成功。本文将该模型重新诠释为高斯尺度混合物的语言，并推导出一些扩展。具体来说，我们展示了将EM算法应用于高斯尺度混合物导致滞后扩散算法用于计算Perona-Malik扩散方程的稳态点。此外，我们展示了这些高斯尺度混合物的均场近似如何导致一种改进的滞后扩散算法，更准确地捕捉恢复中的不确定性。由于这种改进在实践中难以计算，我们提出对均场目标进行放松以使算法计算可行。我们的数值实验表明，这种改进的滞后扩散算法在恢复纹理区域和模糊边缘方面通常比未改进的算法表现更好。作为高斯尺度混合框架的第二个应用，我们展示了如何通过高效采样过程获得概率模型，使计算条件均值和其他期望在算法上可行。同样，所得到的算法与滞后扩散算法有很强的相似性。最后，我们展示了在相同框架下，通过离散边缘先验可以得到概率版本的Mumford-Shah分割模型。

英文摘要

The Perona-Malik model has been very successful at restoring images from noisy input. In this paper, we reinterpret the Perona-Malik model in the language of Gaussian scale mixtures and derive some extensions of the model. Specifically, we show that the expectation-maximization (EM) algorithm applied to Gaussian scale mixtures leads to the lagged-diffusivity algorithm for computing stationary points of the Perona-Malik diffusion equations. Moreover, we show how mean field approximations to these Gaussian scale mixtures lead to a modification of the lagged-diffusivity algorithm that better captures the uncertainties in the restoration. Since this modification can be hard to compute in practice we propose relaxations to the mean field objective to make the algorithm computationally feasible. Our numerical experiments show that this modified lagged-diffusivity algorithm often performs better at restoring textured areas and fuzzy edges than the unmodified algorithm. As a second application of the Gaussian scale mixture framework, we show how an efficient sampling procedure can be obtained for the probabilistic model, making the computation of the conditional mean and other expectations algorithmically feasible. Again, the resulting algorithm has a strong resemblance to the lagged-diffusivity algorithm. Finally, we show that a probabilistic version of the Mumford-Shah segementation model can be obtained in the same framework with a discrete edge-prior.

URL PDF HTML ☆

赞 0 踩 0

1612.05323 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Stochastic Large Deformation Model for Computational Anatomy

计算解剖学中的一种随机大变形模型

Alexis Arnaudon, Darryl D. Holm, Akshay Pai, Stefan Sommer

AI总结本文提出一种随机模型，用于在大变形流形度量映射框架中引入随机变化，通过几何性质定制的设置，解决带噪声地标点模板估计问题，并提出两种高效估计噪声场参数的方法。

详情

AI中文摘要

在使用计算解剖学研究人体器官形状时，发现变异来源于受试者间解剖差异、疾病特异性效应和测量噪声。本文介绍了一种随机模型，用于将随机变化纳入大变形流形度量映射（LDDMM）框架中。通过在特定设置中考虑随机性，该设置适合LDDMM的几何性质，我们为带噪声的地标点模板估计问题建立了公式，并给出了两种高效估计噪声场参数的方法。一种方法直接用有限组微分方程近似每个地标点的方差时间演化，另一种基于期望最大化算法。在第二种方法中，通过应用随机扰动的大变形梯度流算法的桥采样技术，在不注册地标点的情况下评估数据似然性。该方法和估计算法在合成示例和人类胼胝体形状数据上进行了实验验证。

英文摘要

In the study of shapes of human organs using computational anatomy, variations are found to arise from inter-subject anatomical differences, disease-specific effects, and measurement noise. This paper introduces a stochastic model for incorporating random variations into the Large Deformation Diffeomorphic Metric Mapping (LDDMM) framework. By accounting for randomness in a particular setup which is crafted to fit the geometrical properties of LDDMM, we formulate the template estimation problem for landmarks with noise and give two methods for efficiently estimating the parameters of the noise fields from a prescribed data set. One method directly approximates the time evolution of the variance of each landmark by a finite set of differential equations, and the other is based on an Expectation-Maximisation algorithm. In the second method, the evaluation of the data likelihood is achieved without registering the landmarks, by applying bridge sampling using a stochastically perturbed version of the large deformation gradient flow algorithm. The method and the estimation algorithms are experimentally validated on synthetic examples and shape data of human corpora callosa.

URL PDF HTML ☆

赞 0 踩 0

1609.05258 2026-06-04 cs.RO cs.AI cs.CV cs.SY eess.SY 版本更新

The ACRV Picking Benchmark (APB): A Robotic Shelf Picking Benchmark to Foster Reproducible Research

ACRV 摘取基准 (APB)：一个促进可重复研究的机器人货架摘取基准

Jürgen Leitner, Adam W. Tow, Jake E. Dean, Niko Suenderhauf, Joseph W. Durham, Matthew Cooper, Markus Eich, Christopher Lehnert, Ruben Mangels, Christopher McCool, Peter Kujala, Lachlan Nicholson, Trung Pham, James Sergeant, Liao Wu, Fangyi Zhang, Ben Upcroft, Peter Corke

AI总结本文提出ACRV摘取基准(APB)，通过42个常见物品、广泛可用的货架和精确的物品排列指南，提供可重复的机器人摘取基准，支持完整机器人系统的比较。

Comments 8 pages, submitted to RA:Letters

1607.08481 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Nonlocal Denoising Algorithm for Manifold-Valued Images Using Second Order Statistics

基于二阶统计的非局部去噪算法用于流形值图像

Friederike Laus, Mila Nikolova, Johannes Persch, Gabriele Steidl

AI总结本文首次将非局部块方法推广到流形值图像，通过最小均方误差估计提出新的估计器，用于恢复流形值图像。

详情

AI中文摘要

非局部块方法，特别是Lebrun等人（2013）的贝叶斯方法，被认为是去噪（彩色）图像的有效方法，这些图像受到白高斯噪声影响。本文首次尝试将该技术推广到流形值图像。此类图像，例如具有相位或方向信息或值在对称正定矩阵流形上的图像，在现实应用中很常见。将正态分布推广到流形不是标准的，已有不同尝试。本文聚焦于一个直接的内在模型，并讨论特定流形的其他方法。我们将Lebrun等人的贝叶斯方法重新解释为最小均方误差估计，这促使我们定义相应的估计器。有了这个估计器，我们提出了一种非局部块方法用于恢复流形值图像。各种概念验证示例展示了所提算法的潜力。

英文摘要

Nonlocal patch-based methods, in particular the Bayes' approach of Lebrun, Buades and Morel (2013), are considered as state-of-the-art methods for denoising (color) images corrupted by white Gaussian noise of moderate variance. This paper is the first attempt to generalize this technique to manifold-valued images. Such images, for example images with phase or directional entries or with values in the manifold of symmetric positive definite matrices, are frequently encountered in real-world applications. Generalizing the normal law to manifolds is not canonical and different attempts have been considered. Here we focus on a straightforward intrinsic model and discuss the relation to other approaches for specific manifolds. We reinterpret the Bayesian approach of Lebrun et al. (2013) in terms of minimum mean squared error estimation, which motivates our definition of a corresponding estimator on the manifold. With this estimator at hand we present a nonlocal patch-based method for the restoration of manifold-valued images. Various proof of concept examples demonstrate the potential of the proposed algorithm.

URL PDF HTML ☆

赞 0 踩 0

1612.00056 2026-06-04 math.NA cs.CV cs.NA math.GR 版本更新

Generalized Fourier-Bessel operator and almost-periodic interpolation and approximation

广义傅里叶-贝塞尔算子与近周期插值与逼近

Jean-Paul Gauthier, Dario Prandi

AI总结本文研究了在频率平面中闭合于离散旋转的有限频率集上的函数评估、插值与逼近问题，提出了一种抽象分解定理以高效解决相关数值问题，结合SE(2,N)群的特殊结构。

Comments 15 pages, 2 figures

详情

AI中文摘要

我们考虑由有限频率集F上的三角函数定义的函数f，该集在频率平面中关于角度2kπ/M（M为整数）的旋转闭合。首先研究在空间平面上类似有限集E上的函数评估问题，其次研究通过E网格上的f来插值或逼近双变量函数g。为此，我们建立了评估函数的抽象分解定理，这是解决这些问题高效数值解的关键。该结果基于SE(2,N)群的特殊结构，该群是平面运动群SE(2)的子群，对应于离散旋转，是一个极大近周期群。尽管本文动机源于生物仿生图像重建和模式识别中的相关问题，但该主题也与经典问题如极坐标下的FFT、非均匀FFT以及一般三角多项式评估等有关。

英文摘要

We consider functions $f$ of two real variables, given as trigonometric functions over a finite set $F$ of frequencies. This set is assumed to be closed under rotations in the frequency plane of angle $\frac{2kπ}{M}$ for some integer $M$. Firstly, we address the problem of evaluating these functions over a similar finite set $E$ in the space plane and, secondly, we address the problems of interpolating or approximating a function $g$ of two variables by such an $f$ over the grid $E.$ In particular, for this aim, we establish an abstract factorization theorem for the evaluation function, which is a key point for an efficient numerical solution to these problems. This result is based on the very special structure of the group $SE(2,N)$, subgroup of the group $SE(2)$ of motions of the plane corresponding to discrete rotations, which is a maximally almost periodic group. Although the motivation of this paper comes from our previous works on biomimetic image reconstruction and pattern recognition, where these questions appear naturally, this topic is related with several classical problems: the FFT in polar coordinates, the Non Uniform FFT, the evaluation of general trigonometric polynomials, and so on.

URL PDF HTML ☆

赞 0 踩 0

1611.05947 2026-06-04 math.AG cs.CV cs.NA math.NA 版本更新

Minimal Problems for the Calibrated Trifocal Variety

校准三焦点流形的最优化问题

Joe Kileel

AI总结本文通过数值代数几何和Bertini软件确定校准三焦点流形的最优化问题的代数次数。

Comments 23 pages, 1 table

1601.08201 2026-06-04 math.NA cs.CV cs.NA 版本更新

Spectrally Grouped Total Variation Reconstruction for Scatter Imaging Using ADMM

基于ADMM的谱分组总变分重建用于散射成像

Ikenna Odinaka, Yan Kaganovsky, Joel A. Greenberg, Mehadi Hassan, David G. Politte, Joseph A. O'Sullivan, Lawrence Carin, David J. Brady

AI总结本文提出基于ADMM的谱分组总变分重建算法，通过改进的正则化方法提升散射成像的谱与空间质量，利用凸分解实现并行优化。

Comments Presented at IEEE Nuclear Science Symposium and Medical Imaging Conference (NSS/MIC) 2015. 4 pages, 2 figures

详情

DOI: 10.1109/NSSMIC.2015.7582220

AI中文摘要

我们考虑X射线相干散射成像，目标是从多重散射测量中重建每个空间位置的动量转移分布（谱分布）。每种材料具有独特的动量转移分布（MTP），可用于区分不同材料。我们提出了一种基于泊松噪声模型的迭代图像重建算法，能够处理光子限制的测量以及数据的多种二次统计。为了提高图像质量，先前方法使用边缘保持正则化器以在空间域中促进分段常数图像，而每个谱bin分别处理。相反，我们提出谱分组正则化，促进空间方向上的分段常数图像，同时确保相邻空间bin的MTP相似，如果它们包含相同材料。我们证明这种分组正则化在谱和空间图像质量上都有提升。我们追求一种优化转移方法，利用凸分解将问题提升，使得所有超体素可以并行更新并以闭式形式处理。分组惩罚引入了挑战，因为它不直接适用于这些分解。我们使用交替方向乘子法（ADMM）将原问题替换为一个等效的子问题序列，这些子问题适用于凸分解，导致高度并行的算法。我们在真实数据上展示了性能。

英文摘要

We consider X-ray coherent scatter imaging, where the goal is to reconstruct momentum transfer profiles (spectral distributions) at each spatial location from multiplexed measurements of scatter. Each material is characterized by a unique momentum transfer profile (MTP) which can be used to discriminate between different materials. We propose an iterative image reconstruction algorithm based on a Poisson noise model that can account for photon-limited measurements as well as various second order statistics of the data. To improve image quality, previous approaches use edge-preserving regularizers to promote piecewise constancy of the image in the spatial domain while treating each spectral bin separately. Instead, we propose spectrally grouped regularization that promotes piecewise constant images along the spatial directions but also ensures that the MTPs of neighboring spatial bins are similar, if they contain the same material. We demonstrate that this group regularization results in improvement of both spectral and spatial image quality. We pursue an optimization transfer approach where convex decompositions are used to lift the problem such that all hyper-voxels can be updated in parallel and in closed-form. The group penalty introduces a challenge since it is not directly amendable to these decompositions. We use the alternating directions method of multipliers (ADMM) to replace the original problem with an equivalent sequence of sub-problems that are amendable to convex decompositions, leading to a highly parallel algorithm. We demonstrate the performance on real data.

URL PDF HTML ☆

赞 0 踩 0

1511.05261 2026-06-04 cs.CV cs.LG cs.NA math.NA stat.ML 版本更新

Robust PCA via Nonconvex Rank Approximation

通过非凸秩近似实现鲁棒PCA

Zhao Kang, Chong Peng, Qiang Cheng

AI总结本文提出非凸秩近似方法，以改进鲁棒PCA中核范数的局限性，通过高效算法提升准确性和效率。

Comments IEEE International Conference on Data Mining

详情

DOI: 10.1109/ICDM.2015.15

AI中文摘要

在数据挖掘和机器学习中，许多应用需要恢复低秩矩阵。鲁棒主成分分析（RPCA）是处理此类问题的通用框架。RPCA中核范数作为秩函数的凸替代物被广泛研究。在某些假设下，它可以以高概率恢复底层低秩矩阵。然而，这些假设可能在实际应用中不成立。由于核范数通过将所有奇异值相加来近似秩，即本质上是奇异值的ℓ1范数，因此产生的近似误差并不 trivial，导致最终的矩阵估计器可能有显著偏差。为寻求更接近的近似并缓解核范数的上述限制，我们提出了一种非凸秩近似。这种对矩阵秩的近似比核范数更紧密。为了解决相关的非凸最小化问题，我们开发了高效的增广拉格朗日乘子优化算法。实验结果表明，我们的方法在准确性和效率上均优于当前最先进的算法。

英文摘要

Numerous applications in data mining and machine learning require recovering a matrix of minimal rank. Robust principal component analysis (RPCA) is a general framework for handling this kind of problems. Nuclear norm based convex surrogate of the rank function in RPCA is widely investigated. Under certain assumptions, it can recover the underlying true low rank matrix with high probability. However, those assumptions may not hold in real-world applications. Since the nuclear norm approximates the rank by adding all singular values together, which is essentially a $\ell_1$-norm of the singular values, the resulting approximation error is not trivial and thus the resulting matrix estimator can be significantly biased. To seek a closer approximation and to alleviate the above-mentioned limitations of the nuclear norm, we propose a nonconvex rank approximation. This approximation to the matrix rank is tighter than the nuclear norm. To solve the associated nonconvex minimization problem, we develop an efficient augmented Lagrange multiplier based optimization algorithm. Experimental results demonstrate that our method outperforms current state-of-the-art algorithms in both accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

1610.06049 2026-06-04 math.NA cs.CV cs.NA 版本更新

Fast and Accurate Surface Normal Integration on Non-Rectangular Domains

非矩形域上快速且准确的表面法向量积分

Martin Bähr, Michael Breuß, Yvain Quéau, Ali Sharifi Boroujerdi, Jean-Denis Durou

AI总结本文提出一种结合经典方法与现代技术的高效算法，用于非矩形域上的表面法向量积分，通过迭代Krylov子空间求解器和预条件处理提升精度和效率，实验证明其在计算机视觉中的有效性。

详情

AI中文摘要

在三维空间中计算表面形状时，表面法向量的积分是一个经典问题。然而，至今仍难以设计出一种方法，既能处理非平凡计算域，又具备高精度、鲁棒性和计算效率。本文结合经典方法与现代计算技术，构建了一个求解器。基于Poisson积分模型，我们提出使用迭代Krylov子空间求解器作为核心步骤。尽管此类方法可能非常高效，但只有在结合合适的数值预条件处理和问题特定的初始化时，才能发挥其全部潜力。我们进行了详尽的数值研究，以确定适用于我们目的的预条件器。为了解决合适的初始化问题，我们提出通过最近开发的快速推进积分器计算初始状态。详细的数值实验展示了这种新组合的优势。此外，我们还展示了所开发的数值框架能够灵活应对现代计算机视觉应用，在真实世界光流立体数据集上进行了验证。

英文摘要

The integration of surface normals for the purpose of computing the shape of a surface in 3D space is a classic problem in computer vision. However, even nowadays it is still a challenging task to devise a method that combines the flexibility to work on non-trivial computational domains with high accuracy, robustness and computational efficiency. By uniting a classic approach for surface normal integration with modern computational techniques we construct a solver that fulfils these requirements. Building upon the Poisson integration model we propose to use an iterative Krylov subspace solver as a core step in tackling the task. While such a method can be very efficient, it may only show its full potential when combined with a suitable numerical preconditioning and a problem-specific initialisation. We perform a thorough numerical study in order to identify an appropriate preconditioner for our purpose. To address the issue of a suitable initialisation we propose to compute this initial state via a recently developed fast marching integrator. Detailed numerical experiments illuminate the benefits of this novel combination. In addition, we show on real-world photometric stereo datasets that the developed numerical framework is flexible enough to tackle modern computer vision applications.

URL PDF HTML ☆

赞 0 踩 0

1610.04973 2026-06-04 cs.NE cs.CV cs.SY eess.SY 版本更新

Multiple Instance Fuzzy Inference Neural Networks

多实例模糊推理神经网络

Amine Ben Khalifa, Hichem Frigui

AI总结本文提出多实例模糊推理系统，用于处理多实例数据，通过扩展Sugeno推理方法，开发MI-ANFIS系统，用于土地雷探测中的多算法融合。

Comments Submitted to IEEE Transactions On Cybernetics for review

详情

AI中文摘要

模糊逻辑是一种强大的工具，用于建模知识不确定性、测量不精确性和模糊性。然而，当数据具有多种表达形式时，模糊逻辑无法很好地处理另一种模糊性，这在多实例学习问题（MIL）中尤为明显。在MIL中，一个对象由多个实例组成，称为一个袋。如果袋中的所有实例都是负的，则标记为负；如果至少有一个实例是正的，则标记为正。正袋编码了模糊性，因为实例本身未被标记。本文介绍了模糊推理系统和神经网络，用于处理实例袋作为输入，并能够从模糊标记的数据中学习。首先，我们介绍了多实例Sugeno风格模糊推理（MI-Sugeno），它扩展了标准Sugeno推理以处理多实例推理。其次，我们使用MI-Sugeno定义并开发多实例自适应神经模糊推理系统（MI-ANFIS）。我们扩展了标准ANFIS的架构以允许袋推理，并通过反向传播推导出学习算法以确定网络的前提和结论参数。所提出的推理系统通过合成和适用于MIL问题的基准数据集进行测试和验证。我们还应用所提出的MI-ANFIS来融合多个判别算法的输出，以用于使用穿透雷达的土地雷探测。

广义回归运动：碰撞的视觉线索

Krzysztof Chalupka, Michael Dickinson, Pietro Perona

AI总结研究提出广义回归运动作为碰撞检测的视觉线索，通过几何分析证明其在同类碰撞中的可靠性，并通过基于代理的建模显示其比 looming 更有效。

详情

DOI: 10.1088/1748-3190/11/4/046008

AI中文摘要

大脑和感觉系统进化以指导运动。关键任务是控制对静止障碍物的接近并检测移动生物。Looming 被提出为主要的单目视觉线索，用于检测其他动物的接近并避免与静止障碍物碰撞。在昆虫和脊椎动物大脑中发现了优雅的神经机制用于 looming 检测。然而，looming 未在两个移动动物碰撞的背景下进行分析。我们提出了一种替代策略，即广义回归运动（GRM），这与最近观察到的果蝇行为一致。几何分析证明 GRM 是同类碰撞的可靠线索，而基于代理的建模表明 GRM 比 looming 更有效用于检测接近、防止碰撞和维持移动性。

英文摘要

Brains and sensory systems evolved to guide motion. Central to this task is controlling the approach to stationary obstacles and detecting moving organisms. Looming has been proposed as the main monocular visual cue for detecting the approach of other animals and avoiding collisions with stationary obstacles. Elegant neural mechanisms for looming detection have been found in the brain of insects and vertebrates. However, looming has not been analyzed in the context of collisions between two moving animals. We propose an alternative strategy, Generalized Regressive Motion (GRM), which is consistent with recently observed behavior in fruit flies. Geometric analysis proves that GRM is a reliable cue to collision among conspecifics, whereas agent-based modeling suggests that GRM is a better cue than looming as a means to detect approach, prevent collisions and maintain mobility.

URL PDF HTML ☆

赞 0 踩 0

1609.05483 2026-06-04 eess.SY cs.CV cs.RO cs.SY 版本更新

Set-Point Regulation of Linear Continuous-Time Systems using Neuromorphic Vision Sensors

利用神经形态视觉传感器进行线性连续时间系统的设定点调节

Prince Singh, Sze Zheng Yong, Emilio Frazzoli

AI总结本文提出基于神经形态视觉传感器的H∞控制器，用于调节线性时不变系统的设定点，并在不稳定系统上验证了方法的有效性。

Comments Submitted to IEEE Transactions on Automatic Control

1609.05434 2026-06-04 math.NA cs.CV cs.NA 版本更新

Consistent Discretization and Minimization of the L1 Norm on Manifolds

一致离散化和流形上L1范数的最小化

Alex Bronstein, Yoni Choukroun, Ron Kimmel, Matan Sela

AI总结本文探讨了流形上L1范数的离散化问题，指出采样敏感性，并提出基于迭代加权l2范数的替代方法，应用于压缩模式问题，通过简单特征分解提升稳定性与准确性。

1609.04167 2026-06-04 math.NA cs.CV cs.IT cs.LG cs.NA math.IT math.OC 版本更新

Proceedings of the third "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'16)

第三届“国际稀疏模型与技术相互作用研讨会”（iTWIST'16）会议论文集

V. Abrol, O. Absil, P. -A. Absil, S. Anthoine, P. Antoine, T. Arildsen, N. Bertin, F. Bleichrodt, J. Bobin, A. Bol, A. Bonnefoy, F. Caltagirone, V. Cambareri, C. Chenot, V. Crnojević, M. Daňková, K. Degraux, J. Eisert, J. M. Fadili, M. Gabrié, N. Gac, D. Giacobello, A. Gonzalez, C. A. Gomez Gonzalez, A. González, P. -Y. Gousenbourger, M. Græsbøll Christensen, R. Gribonval, S. Guérit, S. Huang, P. Irofti, L. Jacques, U. S. Kamilov, S. Kiticć, M. Kliesch, F. Krzakala, J. A. Lee, W. Liao, T. Lindstrøm Jensen, A. Manoel, H. Mansour, A. Mohammad-Djafari, A. Moshtaghpour, F. Ngolè, B. Pairet, M. Panić, G. Peyré, A. Pižurica, P. Rajmic, M. Roblin, I. Roth, A. K. Sao, P. Sharma, J. -L. Starck, E. W. Tramel, T. van Waterschoot, D. Vukobratovic, L. Wang, B. Wirth, G. Wunder, H. Zhang

AI总结本文探讨了稀疏模型与技术的相互作用，涵盖数据传感、非凸逆问题、概率推断、机器学习等领域，通过演讲和讨论促进国际科研合作。

Comments 69 pages, 22 extended abstracts, iTWIST'16 website: http://www.itwist16.es.aau.dk

详情

AI中文摘要

第三届“国际稀疏模型与技术相互作用研讨会”（iTWIST'16）于2016年8月24日至26日在丹麦第四大城市阿勒堡举行。该研讨会旨在通过具体的口头/海报展示和自由讨论促进国际科研团队的合作。本届研讨会汇集了约50位国际参与者，包含8场特邀讲座、12场口头报告和12个海报，主题涵盖稀疏范式的理论、应用和推广，包括由稀疏驱动的数据传感与处理（如光学、计算机视觉、基因组学、生物医学、数字通信、信道估计、天文学）；稀疏模型在非凸/非线性逆问题中的应用（如相位恢复、盲去卷积、自校准）；近似概率推断用于稀疏问题；稀疏机器学习与推断；“盲”逆问题与字典学习；稀疏建模的优化；信息论、几何与随机性；稀疏？未来是什么（离散值信号；低维空间的并集、共稀疏性、混合/组范数、基于模型的、低复杂度模型等）；矩阵/流形传感与处理（图、低秩近似等）；数值方法/优化中的复杂性与精度权衡；电子/光学压缩传感器（硬件）。

英文摘要

The third edition of the "international - Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST) took place in Aalborg, the 4th largest city in Denmark situated beautifully in the northern part of the country, from the 24th to 26th of August 2016. The workshop venue was at the Aalborg University campus. One implicit objective of this biennial workshop is to foster collaboration between international scientific teams by disseminating ideas through both specific oral/poster presentations and free discussions. For this third edition, iTWIST'16 gathered about 50 international participants and features 8 invited talks, 12 oral presentations, and 12 posters on the following themes, all related to the theory, application and generalization of the "sparsity paradigm": Sparsity-driven data sensing and processing (e.g., optics, computer vision, genomics, biomedical, digital communication, channel estimation, astronomy); Application of sparse models in non-convex/non-linear inverse problems (e.g., phase retrieval, blind deconvolution, self calibration); Approximate probabilistic inference for sparse problems; Sparse machine learning and inference; "Blind" inverse problems and dictionary learning; Optimization for sparse modelling; Information theory, geometry and randomness; Sparsity? What's next? (Discrete-valued signals; Union of low-dimensional spaces, Cosparsity, mixed/group norm, model-based, low-complexity models, ...); Matrix/manifold sensing/processing (graph, low-rank approximation, ...); Complexity/accuracy tradeoffs in numerical methods/optimization; Electronic/optical compressive sensors (hardware).

URL PDF HTML ☆

赞 0 踩 0

1608.06440 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

A Delay-Tolerant Potential-Field-Based Network Implementation of an Integrated Navigation System

基于延迟容忍的势场网络的集成导航系统实现

Rachana Ashok Gupta, Ahmad A. Masoud, Mo-Yuen Chow

AI总结本文提出一种基于网络控制器的集成导航系统，通过互联网实现摄像头、无人地面车和远程服务器的实时网络化，旨在简化无人车导航同时保持系统鲁棒性。

详情

DOI: 10.1109/TIE.2009.2026764
Journal ref: The IEEE Transactions On Industrial Electronics, Vol. 57, No.2, February 2010, PP. 769-783

AI中文摘要

网络控制器（NCs）是能够将动态、空间扩展且功能专门化的模块转化为可执行目标导向组的设备，称为网络控制系统。本文探讨了设计和构建使用互联网作为通信介质的NC的实践方面。重点在于寻找兼容的控制器组件，这些组件可通过主机结构集成，使摄像头、无人地面车（UGV）、远程计算机服务器及必要的操作软件界面能够实时联网。目标是简化UGV的导航过程，同时保持系统性能的鲁棒性。本文描述了所提控制器的结构、其组件及其接口方式。提供了详尽的实验结果，包括性能评估和与之前实现的NC的比较。

英文摘要

Network controllers (NCs) are devices that are capable of converting dynamic, spatially extended, and functionally specialized modules into a taskable goal-oriented group called networked control system. This paper examines the practical aspects of designing and building an NC that uses the Internet as a communication medium. It focuses on finding compatible controller components that can be integrated via a host structure in a manner that makes it possible to network, in real-time, a webcam, an unmanned ground vehicle (UGV), and a remote computer server along with the necessary operator software interface. The aim is to deskill the UGV navigation process and yet maintain a robust performance. The structure of the suggested controller, its components, and the manner in which they are interfaced are described. Thorough experimental results along with performance assessment and comparisons to a previously implemented NC are provided.

URL PDF HTML ☆

赞 0 踩 0

1608.02165 2026-06-04 cs.CV cs.AI cs.NA math.NA math.OC 版本更新

ShapeFit and ShapeKick for Robust, Scalable Structure from Motion

形状拟合与形状踢：用于鲁棒、可扩展的结构从运动

Thomas Goldstein, Paul Hand, Choongbum Lee, Vladislav Voroninski, Stefano Soatto

AI总结本文提出一种利用高效凸优化程序进行成对方向定位恢复的新方法，能有效处理对抗性异常值，且在真实场景和模拟数据上验证了其性能和灵活性。

1608.01372 2026-06-04 cs.CV cs.NA math.NA 版本更新

Permutation NMF

排列非负矩阵分解

Giovanni Barbarino

AI总结本文提出在经典NMF中引入平移不变性，使其能检测不同图像中位移后的共同特征。

1502.00592 2026-06-04 stat.ME cs.CV cs.MM cs.NA math.NA stat.AP 版本更新

A Class of DCT Approximations Based on the Feig-Winograd Algorithm

基于Feig-Winograd算法的一类DCT近似方法

C. J. Tablada, F. M. Bayer, R. J. Cintra

AI总结本文提出基于Feig-Winograd算法8点DCT因子化的参数化矩阵类，通过多目标优化获得具有低计算复杂度、正交性、低逆复杂度及接近精确DCT性能的新型DCT近似方法。

Comments 26 pages, 4 figures, 5 tables, fixed arithmetic complexity in Table IV

详情

DOI: 10.1016/j.sigpro.2015.01.011
Journal ref: Signal Processing, vol. 113, pp. 38-51, August 2015

AI中文摘要

本文提出了一种基于Feig-Winograd因子化8点DCT的参数化矩阵类。此类参数化诱导出一个矩阵子空间，统一了多种现有的DCT近似方法。通过求解一个综合的多目标优化问题，我们识别出几种新的DCT近似方法。获得的解旨在具备以下特性：(i) 低乘法无计算复杂度，(ii) 正交或近正交性，(iii) 低复杂度可逆性，(iv) 接近精确DCT的接近性和性能。所提出的方法在接近DCT、编码性能及图像压缩适用性方面进行了评估。考虑到帕累托效率，某些新提出的近似方法可能在文献中已有的各种现有方法上表现更优。

英文摘要

A new class of matrices based on a parametrization of the Feig-Winograd factorization of 8-point DCT is proposed. Such parametrization induces a matrix subspace, which unifies a number of existing methods for DCT approximation. By solving a comprehensive multicriteria optimization problem, we identified several new DCT approximations. Obtained solutions were sought to possess the following properties: (i) low multiplierless computational complexity, (ii) orthogonality or near orthogonality, (iii) low complexity invertibility, and (iv) close proximity and performance to the exact DCT. Proposed approximations were submitted to assessment in terms of proximity to the DCT, coding performance, and suitability for image compression. Considering Pareto efficiency, particular new proposed approximations could outperform various existing methods archived in literature.

URL PDF HTML ☆

赞 0 踩 0

1607.03255 2026-06-04 math.NA cs.CV cs.NA math.OC 版本更新

A Variational Model for Joint Motion Estimation and Image Reconstruction

一种联合运动估计和图像重建的变分模型

Martin Burger, Hendrik Dirks, Carola-Bibiane Schönlieb

AI总结本文提出一种变分模型，用于联合估计运动和重建图像序列，基于连续时间欧拉运动模型，通过分析亮度恒定方程实现鲁棒运动估计，并证明了在合适函数空间中解的存在性。

1602.07038 2026-06-04 cs.GR cs.CV cs.NA math.NA 版本更新

Computer Aided Restoration of Handwritten Character Strokes

计算机辅助修复手写字符笔画

Barak Sober, David Levin

AI总结本文提出一种变分方法用于修复损坏的古代希伯来文字符，通过立方样条进行梯度下降并保持插值，实现了对1000个古代字符的修复。

Comments 11 pages, 17 figures

1606.07414 2026-06-04 cs.CV cs.MM cs.NA math.NA stat.ME 版本更新

Multiplierless 16-point DCT Approximation for Low-complexity Image and Video Coding

无乘法器16点DCT近似用于低复杂度图像和视频编码

T. L. T. Silveira, R. S. Oliveira, F. M. Bayer, R. J. Cintra, A. Madanayake

AI总结本文提出一种无需乘法和位移操作的16点近似DCT变换，通过矩阵分解快速算法仅需44次加法，实现了最低的算术成本，并在图像和视频编码中表现出最佳的成本效益比。

Comments 12 pages, 5 figures, 3 tables

详情

DOI: 10.1007/s11760-016-0923-4

AI中文摘要

本文介绍了一种正交的16点近似离散余弦变换（DCT）。所提出的变换不需要乘法或位移操作。引入了一种基于矩阵分解的快速算法，仅需44次加法，这是文献中最低的算术成本。为了评估所提出的变换，计算了计算复杂度、与精确DCT的相似性以及编码性能指标。经典和最先进的16点低复杂度变换在比较分析中被使用。在图像压缩中，所提出的近似通过PSNR和SSIM测量评估，获得了最佳的成本效益比。对于视频编码，所提出的近似被嵌入到HEVC参考软件中，直接与原始HEVC标准进行比较。通过FPGA硬件实现和测试，所提出的变换在与文献中最佳竞争变换相比时，面积-时间和面积-时间平方VLSI指标分别提高了35%和37%。

英文摘要

An orthogonal 16-point approximate discrete cosine transform (DCT) is introduced. The proposed transform requires neither multiplications nor bit-shifting operations. A fast algorithm based on matrix factorization is introduced, requiring only 44 additions---the lowest arithmetic cost in literature. To assess the introduced transform, computational complexity, similarity with the exact DCT, and coding performance measures are computed. Classical and state-of-the-art 16-point low-complexity transforms were used in a comparative analysis. In the context of image compression, the proposed approximation was evaluated via PSNR and SSIM measurements, attaining the best cost-benefit ratio among the competitors. For video encoding, the proposed approximation was embedded into a HEVC reference software for direct comparison with the original HEVC standard. Physically realized and tested using FPGA hardware, the proposed transform showed 35% and 37% improvements of area-time and area-time-squared VLSI metrics when compared to the best competing transform in the literature.

URL PDF HTML ☆

赞 0 踩 0

1508.01308 2026-06-04 cs.CV cs.NA math.HO math.NA math.OC 版本更新

Collaborative Total Variation: A General Framework for Vectorial TV Models

协同总变分：向量总变分模型的通用框架

Joan Duran, Michael Moeller, Catalina Sbert, Daniel Cremers

AI总结本文提出协同总变分（CTV）模型，通过不同维度的范数测量颜色图像张量的平滑性，探讨其理论性质和应用效果，实验比较了多种CTV方法在去噪、去模糊和修复等逆问题中的性能。

详情

DOI: 10.1137/15M102873X
Journal ref: SIAM Journal on Imaging Sciences, vol. 9(1), pp. 116-151, 2016

AI中文摘要

尽管已有二十年，总变分（TV）仍然是图像处理中最受欢迎的正则化方法之一，并引发了大量研究，特别是从标量到向量值函数的转变。本文将彩色图像的梯度视为一个三维矩阵或张量，其维度对应空间扩展、与其他像素的差异和光谱通道。通过不同维度的不同范数测量该张量的平滑性，根据这些范数的类型可获得不同的正则化特性，从而得到新的颜色图像模型。我们称之为协同总变分（CTV）。在理论方面，我们刻画了所提出正则化器的对偶范数、次微分和近端映射。进一步地，借助广义奇异向量的概念，证明了$\ell^{\infty}$通道耦合做出最强烈的先验假设，并具有最大程度减少颜色伪影的潜力。我们的实际贡献包括一个广泛的实验部分，其中我们比较了大量协同TV方法在去噪、去模糊和修复等逆问题中的性能。

英文摘要

Even after over two decades, the total variation (TV) remains one of the most popular regularizations for image processing problems and has sparked a tremendous amount of research, particularly to move from scalar to vector-valued functions. In this paper, we consider the gradient of a color image as a three dimensional matrix or tensor with dimensions corresponding to the spatial extend, the differences to other pixels, and the spectral channels. The smoothness of this tensor is then measured by taking different norms along the different dimensions. Depending on the type of these norms one obtains very different properties of the regularization, leading to novel models for color images. We call this class of regularizations collaborative total variation (CTV). On the theoretical side, we characterize the dual norm, the subdifferential and the proximal mapping of the proposed regularizers. We further prove, with the help of the generalized concept of singular vectors, that an $\ell^{\infty}$ channel coupling makes the most prior assumptions and has the greatest potential to reduce color artifacts. Our practical contributions consist of an extensive experimental section where we compare the performance of a large number of collaborative TV methods for inverse problems like denoising, deblurring and inpainting.

URL PDF HTML ☆

赞 0 踩 0

1606.05535 2026-06-04 math.NA cs.CV cs.DS cs.NA 版本更新

Tensor Ring Decomposition

张量环分解

Qibin Zhao, Guoxu Zhou, Shengli Xie, Liqing Zhang, Andrzej Cichocki

AI总结本文提出张量环分解，通过循环多线性乘积表示高维张量，具有循环维度不变性，改进了传统张量分解方法，通过不同算法优化隐含核心并验证其有效性。

详情

AI中文摘要

张量网络近年来已成为解决大规模优化问题的强大工具。其中最流行的张量网络是张量列车（TT）分解，但其高度依赖张量维度的排列，导致难以找到最优TT表示。本文引入一种基本的张量分解模型，通过循环多线性乘积表示高维张量，可图形化解释为第三阶张量的环形连接，称为张量环（TR）分解。TR模型的关键优势是通过迹操作和等价处理隐含核心获得循环维度不变性。TR模型可视为TT分解的线性组合，从而获得强大的泛化表示能力。对于隐含核心的优化，我们提出了基于顺序SVD、ALS方案和分块ALS技术的四种不同算法。此外，研究了TR模型的数学性质，表明通过TR表示可以高效地执行基本多线性代数运算，经典张量分解可方便地转换为TR表示。最后，通过合成信号和真实数据集的实验评估了不同算法的性能。

英文摘要

Tensor networks have in recent years emerged as the powerful tools for solving the large-scale optimization problems. One of the most popular tensor network is tensor train (TT) decomposition that acts as the building blocks for the complicated tensor networks. However, the TT decomposition highly depends on permutations of tensor dimensions, due to its strictly sequential multilinear products over latent cores, which leads to difficulties in finding the optimal TT representation. In this paper, we introduce a fundamental tensor decomposition model to represent a large dimensional tensor by a circular multilinear products over a sequence of low dimensional cores, which can be graphically interpreted as a cyclic interconnection of 3rd-order tensors, and thus termed as tensor ring (TR) decomposition. The key advantage of TR model is the circular dimensional permutation invariance which is gained by employing the trace operation and treating the latent cores equivalently. TR model can be viewed as a linear combination of TT decompositions, thus obtaining the powerful and generalized representation abilities. For optimization of latent cores, we present four different algorithms based on the sequential SVDs, ALS scheme, and block-wise ALS techniques. Furthermore, the mathematical properties of TR model are investigated, which shows that the basic multilinear algebra can be performed efficiently by using TR representaions and the classical tensor decompositions can be conveniently transformed into the TR representation. Finally, the experiments on both synthetic signals and real-world datasets were conducted to evaluate the performance of different algorithms.

URL PDF HTML ☆

赞 0 踩 0

1403.3320 2026-06-04 math.NA cs.CV cs.NA 版本更新

Numerical Approaches for Linear Left-invariant Diffusions on SE(2), their Comparison to Exact Solutions, and their Applications in Retinal Imaging

在SE(2)上的线性左不变扩散的数值方法，其与精确解的比较及其在视网膜成像中的应用

Jiong Zhang, Remco Duits, Gonzalo Sanguinetti, Bart M. ter Haar Romeny

AI总结本文比较了SE(2)上左不变扩散的多种数值方法，并提供了精确解的分析，展示了傅里叶方法在精度上的优势，并在视网膜成像中应用了结合左不变PDE演化的可逆方向分数。

Comments A final and corrected version of the manuscript is Published in Numerical Mathematics: Theory, Methods and Applications (NM-TMA), vol. (9), p.1-50, 2016

详情

DOI: 10.4208/nmtma.2015.m1411
Journal ref: Numerical Mathematics: Theory, Methods and Applications (NM-TMA), vol. (9), p.1-50, 2016

AI中文摘要

在旋转平移群SE(2)上的左不变PDE演化（及其残差方程）在皮层建模和图像分析领域已被广泛研究。它们包括Citti与Sarti、Petitot提出的超椭圆扩散（用于轮廓增强）以及Mumford提出的方向过程（用于轮廓补全）。本文对多种数值方法进行了详尽研究和比较，这在文献中是缺失的。现有的数值方法可以分为三类：有限差分法、基于傅里叶的方法（等同于SE(2)-傅里叶方法）以及随机方法（蒙特卡洛模拟）。此外，之前Duits和van Almsick在2005年的工作中明确推导出了三种PDE演化的精确解（在空间傅里叶域中）。本文概述了这三种精确解，并解释了它们如何与三种数值方法相关联。我们计算了所有数值方法相对于精确解的相对误差，并发现基于傅里叶的方法表现最佳，具有最小的相对误差。我们还改进了Mathematica算法以评估Mathieu函数，这对实现精确解至关重要。此外，我们还对核中的奇点进行了渐近分析，并提出了对底层随机过程的概率扩展，以克服时间积分核在原点处的奇异行为。最后，我们展示了将左不变PDE演化与可逆方向分数结合在视网膜成像中的应用。

英文摘要

Left-invariant PDE-evolutions on the roto-translation group $SE(2)$ (and their resolvent equations) have been widely studied in the fields of cortical modeling and image analysis. They include hypo-elliptic diffusion (for contour enhancement) proposed by Citti & Sarti, and Petitot, and they include the direction process (for contour completion) proposed by Mumford. This paper presents a thorough study and comparison of the many numerical approaches, which, remarkably, is missing in the literature. Existing numerical approaches can be classified into 3 categories: Finite difference methods, Fourier based methods (equivalent to $SE(2)$-Fourier methods), and stochastic methods (Monte Carlo simulations). There are also 3 types of exact solutions to the PDE-evolutions that were derived explicitly (in the spatial Fourier domain) in previous works by Duits and van Almsick in 2005. Here we provide an overview of these 3 types of exact solutions and explain how they relate to each of the 3 numerical approaches. We compute relative errors of all numerical approaches to the exact solutions, and the Fourier based methods show us the best performance with smallest relative errors. We also provide an improvement of Mathematica algorithms for evaluating Mathieu-functions, crucial in implementations of the exact solutions. Furthermore, we include an asymptotical analysis of the singularities within the kernels and we propose a probabilistic extension of underlying stochastic processes that overcomes the singular behavior in the origin of time-integrated kernels. Finally, we show retinal imaging applications of combining left-invariant PDE-evolutions with invertible orientation scores.

URL PDF HTML ☆

赞 0 踩 0

1605.02196 2026-06-04 eess.SY cs.CV cs.LG cs.RO cs.SY 版本更新

All Weather Perception: Joint Data Association, Tracking, and Classification for Autonomous Ground Vehicles

全天候感知：面向自主地面车辆的数据关联、跟踪与分类的联合解决方案

Peter Radecki, Mark Campbell, Kevin Matzen

AI总结本文提出一种新型概率感知算法，用于自主地面车辆在全天候条件下的数据关联、目标跟踪和分类。该算法扩展了原有的 Rao-Blackwellized 粒子滤波器，结合多模型跟踪进行分类，并通过升级 Cornell 的 AGV 实验证明了先进视觉算法在恶劣天气下的鲁棒性。

Comments 35 pages, 21 figures, 14 tables

详情

AI中文摘要

本文提出了一种新颖的概率感知算法，作为实时联合解决方案，用于自主地面车辆在全天候条件下的数据关联、目标跟踪和目标分类。该算法扩展了最初使用粒子滤波进行数据关联和卡尔曼滤波进行多目标跟踪的 Rao-Blackwellized 粒子滤波器（Miller 等，2011a），现已包含多模型跟踪用于分类。此外，还实现了一种最先进的视觉检测算法，该算法包含方向信息，适用于自主地面车辆（AGV）应用。Cornell 的 AGV 从 DARPA 城市挑战中被升级并用于实验，以检验先进视觉算法能否补充或替代激光雷达和雷达传感器。在恶劣天气和光照条件下，传感器和算法性能得到测试。实验评估显示，在联合概率感知算法中，摄像头、激光雷达和雷达传感器能够实现稳健的全天候数据关联、跟踪和分类。

英文摘要

A novel probabilistic perception algorithm is presented as a real-time joint solution to data association, object tracking, and object classification for an autonomous ground vehicle in all-weather conditions. The presented algorithm extends a Rao-Blackwellized Particle Filter originally built with a particle filter for data association and a Kalman filter for multi-object tracking (Miller et al. 2011a) to now also include multiple model tracking for classification. Additionally a state-of-the-art vision detection algorithm that includes heading information for autonomous ground vehicle (AGV) applications was implemented. Cornell's AGV from the DARPA Urban Challenge was upgraded and used to experimentally examine if and how state-of-the-art vision algorithms can complement or replace lidar and radar sensors. Sensor and algorithm performance in adverse weather and lighting conditions is tested. Experimental evaluation demonstrates robust all-weather data association, tracking, and classification where camera, lidar, and radar sensors complement each other inside the joint probabilistic perception algorithm.

URL PDF HTML ☆

赞 0 踩 0

1512.01979 2026-06-04 math.NA cs.CV cs.NA 版本更新

Hyperspectral Chemical Plume Detection Algorithms Based On Multidimensional Iterative Filtering Decomposition

基于多维迭代滤波分解的高光谱化学烟雾检测算法

Antonio Cicone, Jingfang Liu, Haomin Zhou

AI总结本文提出基于多维迭代滤波分解的后处理工具，用于提升化学烟雾边界识别性能，并通过预处理方法实现高光谱数据的去相关与均值中心化，改进了余弦相似度分类器性能。

详情

DOI: 10.1098/rsta.2015.0196

AI中文摘要

空气中的化学物质可能对人类和环境造成极大危害。高光谱图像可用于识别化学烟雾，但该任务极具挑战性。假设已知某些具有已知频谱的化学烟雾已被高光谱传感器拍摄，可以使用匹配滤波或自适应余弦估计等标准技术，结合适当选择的阈值，来确定化学烟雾的位置。然而，由于噪声和传感器故障，即使在看似简单的状况下，准确识别化学像素也并不容易。本文提出了一种后处理工具，以完全自适应和数据驱动的方式，提高任何分类方法在识别烟雾边界方面的性能。这是通过多维迭代滤波（MIF）算法（arXiv:1411.6051, arXiv:1507.07173）实现的，这是一种类似于开创性经验模态分解（EMD）方法的非平稳信号分解方法。此外，基于MIF技术，我们还提出了一种预处理方法，使高光谱数据集去相关并均值中心化。余弦相似度度量，通常在实践中表现不佳，当配备此类预处理方法时，似乎成为一种成功且表现优异的分类器。我们展示了所提出方法在实际问题中的应用示例。

英文摘要

Chemicals released in the air can be extremely dangerous for human beings and the environment. Hyperspectral images can be used to identify chemical plumes, however the task can be extremely challenging. Assuming we know a priori that some chemical plume, with a known frequency spectrum, has been photographed using a hyperspectral sensor, we can use standard techniques like the so called matched filter or adaptive cosine estimator, plus a properly chosen threshold value, to identify the position of the chemical plume. However, due to noise and sensors fault, the accurate identification of chemical pixels is not easy even in this apparently simple situation. In this paper we present a post-processing tool that, in a completely adaptive and data driven fashion, allows to improve the performance of any classification methods in identifying the boundaries of a plume. This is done using the Multidimensional Iterative Filtering (MIF) algorithm (arXiv:1411.6051, arXiv:1507.07173), which is a non-stationary signal decomposition method like the pioneering Empirical Mode Decomposition (EMD) method. Moreover, based on the MIF technique, we propose also a pre-processing method that allows to decorrelate and mean-center a hyperspectral dataset. The Cosine Similarity measure, which often fails in practice, appears to become a successful and outperforming classifier when equipped with such pre-processing method. We show some examples of the proposed methods when applied to real life problems.

URL PDF HTML ☆

赞 0 踩 0

1504.07259 2026-06-04 cs.CV cs.NA math.AP math.NA 版本更新

Image Segmentation and Restoration Using Parametric Contours With Free Endpoints

基于自由端点参数轮廓的图像分割与修复

Heike Benninghoff, Harald Garcke

AI总结本文提出一种新型自由端点主动轮廓方法，通过离散化穆恩-沙赫功能实现图像分割与修复，结合曲线法向流动和端点切向流动演化规律，采用参数化轮廓与边缘保持去噪实现快速分割与修复。

1604.02292 2026-06-04 math.NA cs.CV cs.NA 版本更新

A method for locally approximating regularized iterative tomographic reconstruction methods

一种局部近似正则化迭代断层成像重建方法

D. M. Pelt, K. J. Batenburg

AI总结本文提出一种局部近似正则化迭代断层成像方法，通过仅在感兴趣区域进行计算，降低计算需求，实现与全局方法相近的重建质量。

Comments 32 pages, 13 figures

详情

AI中文摘要

在许多断层成像应用中，获取的投影数据往往数量有限或包含大量噪声。在这些情况下，标准重建方法容易产生伪影，影响进一步分析。先进的正则化迭代方法，如总变分最小化，通常能通过利用对被扫描物体的先验知识提高重建质量。然而，这些方法在实践中往往计算时间过长或内存需求大。此外，由于它们基于最小化全局目标函数，正则化迭代方法需要重建整个被扫描物体，即使仅对重建图像的某个（小）区域感兴趣。本文提出了一种在被扫描物体的（小）感兴趣区域内部近似正则化迭代重建方法。该方法仅在感兴趣区域内部进行计算，确保低计算需求。不同幻影图像和正则化类型的结果显示，所提出局部方法的重建结果与近似全局正则化迭代方法的重建结果几乎相同，即使对于相对较小的感兴趣区域也是如此。此外，我们还表明，通过并行重建多个小区域并将它们合并为一个重建，可以高效地重建更大的区域。

英文摘要

In many applications of tomography, the acquired projections are either limited in number or contain a significant amount of noise. In these cases, standard reconstruction methods tend to produce artifacts that can make further analysis difficult. Advanced regularized iterative methods, such as total variation minimization, are often able to achieve a higher reconstruction quality by exploiting prior knowledge about the scanned object. In practice, however, these methods often have prohibitively long computation times or large memory requirements. Furthermore, since they are based on minimizing a global objective function, regularized iterative methods need to reconstruct the entire scanned object, even when one is only interested in a (small) region of the reconstructed image. In this paper, we present a method to approximate regularized iterative reconstruction methods inside a (small) region of the scanned object. The method only performs computations inside the region of interest, ensuring low computational requirements. Reconstruction results for different phantom images and types of regularization are given, showing that reconstructions of the proposed local method are almost identical to those of the global regularized iterative methods that are approximated, even for relatively small regions of interest. Furthermore, we show that larger regions can be reconstructed efficiently by reconstructing several small regions in parallel and combining them into a single reconstruction afterwards.

URL PDF HTML ☆

赞 0 踩 0

1402.4893 2026-06-04 cs.CV cs.NA math.NA 版本更新

Anisotropic Mesh Adaptation for Image Representation

各向异性网格自适应用于图像表示

Xianping Li

AI总结本文提出基于各向异性网格自适应的GPRAMA方法，通过改进的网格拼接技术实现更高质量的图像表示，同时降低计算成本。

Comments 25 pages, 15 figures

1603.08497 2026-06-04 cs.CV cs.NA math.NA 版本更新

On distances, paths and connections for hyperspectral image segmentation

关于超光谱图像分割的距离、路径和连接

Guillaume Noyel, Jesus Angulo, Dominique Jeulin

AI总结本文提出η和η连接以增强λ-平坦区的区域信息，通过自顶向下的方法实现更精细的分割。

1410.7632 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

On the Covariance of ICP-based Scan-matching Techniques

关于基于ICP的扫描匹配技术的协方差

Silvère Bonnabel, Martin Barczyk, François Goulette

AI总结本文研究了ICP算法计算旋转变换协方差的问题，指出点到点版本的ICP应用会导致错误协方差，通过数学证明验证点到平面版本的正确性。

Comments Accepted at 2016 American Control Conference

1506.09016 2026-06-04 cs.LG cs.CV cs.NA math.NA math.OC stat.ML 版本更新

Online Learning to Sample

在线学习采样

Guillaume Bouchard, Théo Trouillon, Julien Perez, Adrien Gaidon

AI总结本文提出AW-SGD算法，通过在线学习优化采样策略，提升在线优化效率，应用于图像分类、矩阵分解和强化学习。

Comments Update: removed convergence theorem and proof as there is an error. Submitted to UAI 2016

详情

AI中文摘要

随机梯度下降（SGD）是机器学习中用于在线优化最广泛使用的技术之一。在本工作中，我们通过适应性地学习如何在每个时间步选择最有用的训练示例来加速SGD。首先，我们证明SGD可以用于学习重要采样估计器的最佳可能采样分布。其次，我们证明SGD算法的采样分布可以通过逐步最小化梯度的方差来在线估计。所得到的算法——自适应加权SGD（AW-SGD）——维护一组用于优化的参数，以及一组用于采样学习示例的参数。我们证明AWSGD在三个不同的应用中实现了更快的收敛：（i）使用深度特征的图像分类，其中图像的采样取决于其标签，（ii）矩阵分解，其中行和列不是均匀采样的，以及（iii）强化学习，其中优化和探索策略同时被估计，其中我们的方法对应于一个off-policy梯度算法。

英文摘要

Stochastic Gradient Descent (SGD) is one of the most widely used techniques for online optimization in machine learning. In this work, we accelerate SGD by adaptively learning how to sample the most useful training examples at each time step. First, we show that SGD can be used to learn the best possible sampling distribution of an importance sampling estimator. Second, we show that the sampling distribution of an SGD algorithm can be estimated online by incrementally minimizing the variance of the gradient. The resulting algorithm - called Adaptive Weighted SGD (AW-SGD) - maintains a set of parameters to optimize, as well as a set of parameters to sample learning examples. We show that AWSGD yields faster convergence in three different applications: (i) image classification with deep features, where the sampling of images depends on their labels, (ii) matrix factorization, where rows and columns are not sampled uniformly, and (iii) reinforcement learning, where the optimized and exploration policies are estimated at the same time, where our approach corresponds to an off-policy gradient algorithm.

URL PDF HTML ☆

赞 0 踩 0

1410.1699 2026-06-04 math.NA cs.CV cs.NA math.OC physics.med-ph 版本更新

Mumford-Shah and Potts Regularization for Manifold-Valued Data with Applications to DTI and Q-Ball Imaging

Mumford-Shah和Potts正则化用于流形值数据及其在DTI和Q-ball成像中的应用

Andreas Weinmann, Laurent Demaret, Martin Storath

AI总结本文提出用于流形值信号和图像的Mumford-Shah和Potts正则化算法，通过动态规划与优化技术解决单变量问题，并针对Cartan-Hadamard流形证明算法能全局最小化。方法无需先验边缘集限制或数据空间离散化，应用于DTI和Q-ball成像实现脑白质分割。

详情

DOI: 10.1007/s10851-015-0628-2

AI中文摘要

Mumford-Shah和Potts函数是强大的变分模型，广泛用于信号和图像处理，典型应用包括边缘保持去噪和分割。由于非光滑和非凸特性，即使对标量数据计算也具有挑战性。对于流形值数据，问题更加复杂，因为向量空间的典型特征不可用。本文提出用于流形值信号和图像的Mumford-Shah和Potts正则化算法。对于单变量问题，我们推导出基于动态规划结合（凸）优化技术的求解器。对于Cartan-Hadamard流形（包括扩散张量成像的数据空间），我们证明我们的算法能为任何起始点计算全局最小值。对于多变量Mumford-Shah和Potts问题（图像正则化），我们提出将其拆分为合适的子问题，利用相应单变量问题开发的技术精确求解。我们的方法不需要任何先验边缘集限制，也不需要离散化数据空间。我们将其应用于扩散张量成像（DTI）以及Q-ball成像。使用DTI模型，我们获得了脑胼胝体的分割。

英文摘要

Mumford-Shah and Potts functionals are powerful variational models for regularization which are widely used in signal and image processing; typical applications are edge-preserving denoising and segmentation. Being both non-smooth and non-convex, they are computationally challenging even for scalar data. For manifold-valued data, the problem becomes even more involved since typical features of vector spaces are not available. In this paper, we propose algorithms for Mumford-Shah and for Potts regularization of manifold-valued signals and images. For the univariate problems, we derive solvers based on dynamic programming combined with (convex) optimization techniques for manifold-valued data. For the class of Cartan-Hadamard manifolds (which includes the data space in diffusion tensor imaging), we show that our algorithms compute global minimizers for any starting point. For the multivariate Mumford-Shah and Potts problems (for image regularization) we propose a splitting into suitable subproblems which we can solve exactly using the techniques developed for the corresponding univariate problems. Our method does not require any a priori restrictions on the edge set and we do not have to discretize the data space. We apply our method to diffusion tensor imaging (DTI) as well as Q-ball imaging. Using the DTI model, we obtain a segmentation of the corpus callosum.

URL PDF HTML ☆

赞 0 踩 0

1510.00771 2026-06-04 cs.CV cs.RO cs.SY eess.SY 版本更新

Design and Analysis of a Single-Camera Omnistereo Sensor for Quadrotor Micro Aerial Vehicles (MAVs)

单相机 omnistereo 传感器的设计与分析用于四旋翼微型飞行器（MAVs）

Carlos Jaramillo

AI总结本文提出一种适用于低负载四旋翼微型飞行器的单相机 omnistereo 传感器设计，通过共轴超曲面镜实现立体视觉，分析其几何特性与3D感知性能。

Comments 49 pages, 22 figures, journal article draft

详情

DOI: 10.3390/s16020217
Journal ref: Sensors 16 (2016) 217

AI中文摘要

我们描述了一种应用于微型飞行器（MAVs）的 omnistereo 系统的设计和3D感知性能。所提出的 omnistereo 模型采用一个单目相机，与一对双曲面镜（折叠catadioptric配置）共轴对齐。我们证明这种配置在安装在具有低负载的四旋翼MAV上进行立体视觉是可行的。理论上的单视角（SVP）约束帮助我们推导出传感器投影几何的解析解，并生成SVP兼容的全景图像，以从立体对应关系中计算3D信息（真正同步地）。我们对各种系统特性进行了广泛分析，如大小、catadioptric空间分辨率、视场。此外，我们提出了一种概率模型，用于估计从三角化中深度的不确定性，用于斜向后投影射线。我们期望通过我们的解决方案的可重复性来激励，因为它可以被适应（最优地）到其他基于catadioptric的omnistereo视觉应用。

英文摘要

We describe the design and 3D sensing performance of an omnidirectional stereo-vision system (omnistereo) as applied to Micro Aerial Vehicles (MAVs). The proposed omnistereo model employs a monocular camera that is co-axially aligned with a pair of hyperboloidal mirrors (folded catadioptric configuration). We show that this arrangement is practical for performing stereo-vision when mounted on top of propeller-based MAVs characterized by low payloads. The theoretical single viewpoint (SVP) constraint helps us derive analytical solutions for the sensor's projective geometry and generate SVP-compliant panoramic images to compute 3D information from stereo correspondences (in a truly synchronous fashion). We perform an extensive analysis on various system characteristics such as its size, catadioptric spatial resolution, field-of-view. In addition, we pose a probabilistic model for uncertainty estimation of the depth from triangulation for skew back-projection rays. We expect to motivate the reproducibility of our solution since it can be adapted (optimally) to other catadioptric-based omnistereo vision applications.

URL PDF HTML ☆

赞 0 踩 0

1510.06895 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

Nonconvex Nonsmooth Low-Rank Minimization via Iteratively Reweighted Nuclear Norm

非凸非光滑低秩最小化通过迭代重加权核范数

Canyi Lu, Jinhui Tang, Shuicheng Yan, Zhouchen Lin

AI总结本文提出通过迭代重加权核范数算法解决非凸非光滑低秩最小化问题，利用非凸替代函数近似秩函数，提升低秩矩阵恢复性能。

详情

DOI: 10.1109/TIP.2015.2511584

AI中文摘要

核范数因其在压缩感知中用于低秩矩阵恢复而被广泛使用，但求解基于核范数的松弛凸问题通常导致原始秩最小化问题的次优解。本文提出在矩阵奇异值上使用非凸替代函数近似秩函数，从而得到非凸非光滑最小化问题。然后通过迭代重加权核范数（IRNN）算法求解，该算法通过求解加权奇异值阈值（WSVT）问题，利用非凸替代函数的特殊性质获得闭式解。同时，IRNN被扩展以处理两个或多个变量块的非凸问题。理论上，证明IRNN单调减少目标函数值，任何极限点都是 stationary 点。在合成数据和真实图像上的大量实验表明，IRNN相比最先进的凸算法在低秩矩阵恢复方面表现更优。

英文摘要

The nuclear norm is widely used as a convex surrogate of the rank function in compressive sensing for low rank matrix recovery with its applications in image recovery and signal processing. However, solving the nuclear norm based relaxed convex problem usually leads to a suboptimal solution of the original rank minimization problem. In this paper, we propose to perform a family of nonconvex surrogates of $L_0$-norm on the singular values of a matrix to approximate the rank function. This leads to a nonconvex nonsmooth minimization problem. Then we propose to solve the problem by Iteratively Reweighted Nuclear Norm (IRNN) algorithm. IRNN iteratively solves a Weighted Singular Value Thresholding (WSVT) problem, which has a closed form solution due to the special properties of the nonconvex surrogate functions. We also extend IRNN to solve the nonconvex problem with two or more blocks of variables. In theory, we prove that IRNN decreases the objective function value monotonically, and any limit point is a stationary point. Extensive experiments on both synthesized data and real images demonstrate that IRNN enhances the low-rank matrix recovery compared with state-of-the-art convex algorithms.

URL PDF HTML ☆

赞 0 踩 0

1512.01927 2026-06-04 math.NA cs.CV cs.LG cs.NA 版本更新

Fast Optimization Algorithm on Riemannian Manifolds and Its Application in Low-Rank Representation

流形上的快速优化算法及其在低秩表示中的应用

Haoran Chen, Yanfeng Sun, Junbin Gao, Yongli Hu

AI总结本文提出了一种具有快速收敛速度的新型一阶优化算法FOA，并在低秩表示模型中应用了基于FOA的快速子空间追踪方法，实验表明其在收敛速度和准确性方面优于其他方法。

1512.00298 2026-06-04 math.NA cs.CV cs.NA math.OC 版本更新

On Optical Flow Models for Variational Motion Estimation

关于变分运动估计中的光学流模型

Martin Burger, Hendrik Dirks, Lena Frerking

AI总结本文探讨了基于总变分的正则化方法在运动估计中的应用，重点分析了光学流模型的不同变体及优化方法，并提出通过反问题视角评估运动估计质量的框架。

Comments 27 pages, 3 figures, 2 tables

1511.04685 2026-06-04 math.NA cs.CV cs.NA math.SP 版本更新

Semi-Inner-Products for Convex Functionals and Their Use in Image Decomposition

半内积在凸泛函中的应用及其在图像分解中的应用

Guy Gilboa

AI总结本文扩展了Lumer意义下的半内积到凸泛函，构建了Banach空间中凸泛函的希尔伯特空间结构，并利用该结构分析总变分和更高阶泛函。

1411.0814 2026-06-04 math.NA cs.CV cs.NA 版本更新

A random algorithm for low-rank decomposition of large-scale matrices with missing entries

一种用于大规模矩阵低秩分解的随机算法

Yiguang Liu

AI总结本文提出随机子矩阵方法(RSM)用于大规模矩阵低秩分解，该方法在速度和内存使用上优于现有算法，并在精度上达到或接近最优。

详情

DOI: 10.1109/TIP.2015.2458176

AI中文摘要

本文提出了一种随机子矩阵方法(RSM)，用于计算具有已知条目百分比ρ的大规模矩阵的低秩分解。RSM非常快速，其所需的浮点运算(flops)与现有最先进算法相比具有竞争力。同时，RSM非常节省内存。在已知条目均匀分布于给定矩阵的情况下，通过已知条目形成的子矩阵被随机选择。根据已证明的定理，与较小奇异值相关的子空间受噪声扰动较小，因此计算每个子矩阵对应的null向量或右奇异向量。这些向量是给定大规模矩阵真实地面中的相应子矩阵的null向量。如果随机选择了足够的子矩阵，就能估计出低秩分解。在随机合成矩阵（如131072X1024）和真实数据集上的实验结果表明，RSM在速度和内存使用上显著优于现有方法，同时在精度上也达到或接近最优。

英文摘要

A Random SubMatrix method (RSM) is proposed to calculate the low-rank decomposition of large-scale matrices with known entry percentage ρ. RSM is very fast as the floating-point operations (flops) required are compared favorably with the state-of-the-art algorithms. Meanwhile RSM is very memory-saving. With known entries homogeneously distributed in the given matrix, sub-matrices formed by known entries are randomly selected. According to the just proved theorem that subspace related to smaller singular values is less perturbed by noise, the null vectors or the right singular vectors associated with the minor singular values are calculated for each submatrix. The vectors are the null vectors of the corresponding submatrix in the ground truth of the given large-scale matrix. If enough sub-matrices are randomly chosen, the low-rank decomposition is estimated. The experimental results on random synthetical matrices with sizes such as 131072X1024 and on real data sets indicate that RSM is much faster and memory-saving, and, meanwhile, has considerable high precision achieving or approximating to the best.

URL PDF HTML ☆

赞 0 踩 0

1509.04237 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Total Fractional-Order Variation Model for Image Restoration with Non-homogeneous Boundary Conditions and its Numerical Solution

一种总分数阶变化模型用于图像恢复及其数值解法，具有非均匀边界条件

Jianping Zhang, Ke Chen

AI总结本文提出一种分数阶总α阶变化模型用于图像恢复，克服传统总变分模型的不足，通过分析理论性质和开发四种算法，在恢复质量和效率上优于现有高阶模型。

Comments 26 pages

详情

AI中文摘要

为克服基于总变分的图像恢复模型的不足，近年来提出了多种高阶（通常为二阶）正则化模型。本文分析并测试了一种基于分数阶导数的总α阶变化模型，其性能优于当前流行的高阶正则化模型。尽管已有使用总α阶变化进行图像恢复的研究，但尚未进行理论分析，且所有测试公式均使用零Dirichlet边界条件，这在现实中不适用（而非零边界条件违反分数阶导数的定义）。本文首先回顾了一些分数阶导数的结果，然后严格分析所提出总α阶变分模型的理论性质。接着开发了四种算法来求解变分问题，一种基于变分Split-Bregman思想，三种基于直接求解离散优化问题。数值实验表明，在恢复质量和解效率方面，所提模型在光滑图像上能与已建立的高阶模型（均 curvature 和总泛化变化）产生高度竞争的结果。

英文摘要

To overcome the weakness of a total variation based model for image restoration, various high order (typically second order) regularization models have been proposed and studied recently. In this paper we analyze and test a fractional-order derivative based total $α$-order variation model, which can outperform the currently popular high order regularization models. There exist several previous works using total $α$-order variations for image restoration; however first no analysis is done yet and second all tested formulations, differing from each other, utilize the zero Dirichlet boundary conditions which are not realistic (while non-zero boundary conditions violate definitions of fractional-order derivatives). This paper first reviews some results of fractional-order derivatives and then analyzes the theoretical properties of the proposed total $α$-order variational model rigorously. It then develops four algorithms for solving the variational problem, one based on the variational Split-Bregman idea and three based on direct solution of the discretise-optimization problem. Numerical experiments show that, in terms of restoration quality and solution efficiency, the proposed model can produce highly competitive results, for smooth images, to two established high order models: the mean curvature and the total generalized variation.

URL PDF HTML ☆

赞 0 踩 0

1404.5009 2026-06-04 cs.CV cs.LG cs.NA math.NA 版本更新

Efficient Semidefinite Branch-and-Cut for MAP-MRF Inference

高效半定规划分支定界法用于MAP-MRF推断

Peng Wang, Chunhua Shen, Anton van den Hengel, Philip Torr

AI总结本文提出了一种高效的分支定界方法用于求解通用MAP-MRF推断问题，通过结合可扩展的半定规划和切割平面法，实现了高效的约束求解，并在密集连接或 unary 成本相对较低时取得最佳结果。

Comments 21 pages

1509.00728 2026-06-04 math.OC cs.CV cs.MA cs.NA math.NA stat.ML 版本更新

On Transitive Consistency for Linear Invertible Transformations between Euclidean Coordinate Systems

关于欧几里得坐标系统之间线性可逆变换的传递一致性

Johan Thunberg, Florian Bernard, Jorge Goncalves

AI总结本文研究了如何同步非传递一致的线性可逆变换，提出两种同步方法及迭代Gauss-Newton方法，适用于不同图拓扑，并通过仿真验证了方法的有效性。

Comments 25 pages

详情

AI中文摘要

传递一致性是欧几里得坐标框架之间线性可逆变换集合的内在属性。在实践中，当变换由数据估计时，这一属性往往缺失。本文解决如何同步非传递一致的变换的问题。一旦变换被同步，它们将满足传递一致性条件——从框架A到框架C的变换等于先从A到B再从B到C的复合变换。坐标框架对应图中的节点，变换对应图中的边。提出了两种直接或集中同步方法，分别适用于近强连通图和连通图。作为第二种方法的扩展，提出了迭代Gauss-Newton方法，并将其适应于仿射和欧几里得变换的情况。还提出了适用于正交矩阵的两种分布式同步方法，这些方法可以看作是两种直接或集中方法的分布式版本；它们类似于用于分布式平均的标准共识协议。当变换为正交矩阵时，可以计算最优性间隙的上界。仿真显示，即使在噪声幅度较大的情况下，间隙也几乎准确。本文还从理论层面提供了传递一致变换的线性代数关系。所提出方法的一个优点是其简单性——使用基本线性代数方法，例如奇异值分解（SVD）。对于广泛参数设置范围内的方法，进行了数值验证。

英文摘要

Transitive consistency is an intrinsic property for collections of linear invertible transformations between Euclidean coordinate frames. In practice, when the transformations are estimated from data, this property is lacking. This work addresses the problem of synchronizing transformations that are not transitively consistent. Once the transformations have been synchronized, they satisfy the transitive consistency condition - a transformation from frame $A$ to frame $C$ is equal to the composite transformation of first transforming A to B and then transforming B to C. The coordinate frames correspond to nodes in a graph and the transformations correspond to edges in the same graph. Two direct or centralized synchronization methods are presented for different graph topologies; the first one for quasi-strongly connected graphs, and the second one for connected graphs. As an extension of the second method, an iterative Gauss-Newton method is presented, which is later adapted to the case of affine and Euclidean transformations. Two distributed synchronization methods are also presented for orthogonal matrices, which can be seen as distributed versions of the two direct or centralized methods; they are similar in nature to standard consensus protocols used for distributed averaging. When the transformations are orthogonal matrices, a bound on the optimality gap can be computed. Simulations show that the gap is almost right, even for noise large in magnitude. This work also contributes on a theoretical level by providing linear algebraic relationships for transitively consistent transformations. One of the benefits of the proposed methods is their simplicity - basic linear algebraic methods are used, e.g., the Singular Value Decomposition (SVD). For a wide range of parameter settings, the methods are numerically validated.

URL PDF HTML ☆

赞 0 踩 0

1508.05514 2026-06-04 stat.ML cs.CV cs.LG cs.RO cs.SY eess.SY 版本更新

Gaussian Mixture Reduction Using Reverse Kullback-Leibler Divergence

基于反向Kullback-Leibler散度的高斯混合减少

Tohid Ardeshiri, Umut Orguner, Emre Özkan

AI总结本文提出一种贪心混合减少算法，基于Kullback-Leibler散度进行混合成分的剪枝与合并，通过分析近似方法提高计算效率，并在模拟和实际数据中验证其性能优于现有方法。

1412.2291 2026-06-04 stat.CO cs.CG cs.CV cs.NA math.NA 版本更新

Adjusted least squares fitting of algebraic hypersurfaces

修正的最小二乘法拟合代数超曲面

Konstantin Usevich, Ivan Markovsky

AI总结本文提出修正的最小二乘法用于拟合欧几里得空间中的点集，通过构造偏倚修正的矩矩阵解决普通最小二乘法的偏倚问题，并改进了计算算法。

Comments 30 pages, 10 figures

1508.04467 2026-06-04 cs.CV cs.IT cs.LG cs.NA math.IT math.NA stat.ML 版本更新

Robust Subspace Clustering via Smoothed Rank Approximation

通过平滑秩近似实现鲁棒子空间聚类

Zhao Kang, Chong Peng, Qiang Cheng

AI总结本文提出基于对数-行列式秩近似的方法，用于子空间聚类，以提高精度并有效处理误差和噪声。

Comments Journal, code is available

详情

DOI: 10.1109/LSP.2015.2460737
Journal ref: IEEE Signal Processing Letters, 22(2015)2088-2092

AI中文摘要

本文提出基于对数-行列式秩近似的方法，用于子空间聚类，以提高精度并有效处理误差和噪声。矩阵秩最小化受线性约束在许多应用领域中出现，从信号处理到机器学习。核范数是该问题的凸松弛，可以在某些受限且理论有趣的条件下精确恢复秩。然而，对于许多现实应用，核范数近似到秩函数只能产生远离最优解的结果。为了寻求比核范数更准确的解决方案，本文提出基于对数-行列式的秩近似方法。我们考虑将此秩近似应用于子空间聚类应用。我们的框架可以建模不同类型的误差和噪声。开发了有效的优化策略，并具有理论保证，以收敛到 stationary 点。所提出的方法在人脸识别和运动分割任务上相比最先进的子空间聚类算法表现出有希望的结果。

英文摘要

Matrix rank minimizing subject to affine constraints arises in many application areas, ranging from signal processing to machine learning. Nuclear norm is a convex relaxation for this problem which can recover the rank exactly under some restricted and theoretically interesting conditions. However, for many real-world applications, nuclear norm approximation to the rank function can only produce a result far from the optimum. To seek a solution of higher accuracy than the nuclear norm, in this paper, we propose a rank approximation based on Logarithm-Determinant. We consider using this rank approximation for subspace clustering application. Our framework can model different kinds of errors and noise. Effective optimization strategy is developed with theoretical guarantee to converge to a stationary point. The proposed method gives promising results on face clustering and motion segmentation tasks compared to the state-of-the-art subspace clustering algorithms.

URL PDF HTML ☆

赞 0 踩 0

1503.01993 2026-06-04 cs.CV cs.NA math.NA 版本更新

Tomographic Image Reconstruction using Training images

利用训练图像进行断层图像重建

Sara Soltani, Martin S. Andersen, Per Christian Hansen

AI总结本文提出一种利用训练图像的断层图像重建算法，通过非负字典和正则化非负矩阵分解实现稀疏表示，减少计算复杂度，并在低剂量设置下表现优异。

Comments 25 pages, 12 figures

1507.08847 2026-06-04 cs.LG cs.CV cs.NA math.NA 版本更新

A novel multivariate performance optimization method based on sparse coding and hyper-predictor learning

一种基于稀疏编码和超预测器学习的新型多变量性能优化方法

Jiachen Yanga, Zhiyong Dinga, Fei Guoa, Huogen Wanga, Nick Hughesb

AI总结本文提出一种新型方法，通过稀疏编码和超预测器学习优化多变量性能度量，通过联合优化问题最小化重建误差、稀疏性及复杂损失函数上界。

详情

DOI: 10.1016/j.neunet.2015.07.011

AI中文摘要

本文研究了多变量性能度量的优化问题，提出了一种新算法。与传统机器学习方法不同，本文研究如何学习有效超预测器以处理数据点元组，从而最小化对应于多变量性能度量的复杂损失函数。我们提出将数据点元组通过字典转换为稀疏码元组，然后应用线性函数比较稀疏码与给定候选类别标签。为了学习字典、稀疏码和线性函数参数，我们提出一个联合优化问题。在此问题中，同时最小化稀疏码的重建误差和稀疏性，以及复杂损失函数的上界。此外，损失函数的上界通过稀疏码和线性函数参数近似。为优化此问题，我们开发了一种基于下降梯度方法的迭代算法，交替学习稀疏码和超预测器参数。在一些基准数据集上的实验结果表明，所提方法优于其他最先进的算法。

英文摘要

In this paper, we investigate the problem of optimization multivariate performance measures, and propose a novel algorithm for it. Different from traditional machine learning methods which optimize simple loss functions to learn prediction function, the problem studied in this paper is how to learn effective hyper-predictor for a tuple of data points, so that a complex loss function corresponding to a multivariate performance measure can be minimized. We propose to present the tuple of data points to a tuple of sparse codes via a dictionary, and then apply a linear function to compare a sparse code against a give candidate class label. To learn the dictionary, sparse codes, and parameter of the linear function, we propose a joint optimization problem. In this problem, the both the reconstruction error and sparsity of sparse code, and the upper bound of the complex loss function are minimized. Moreover, the upper bound of the loss function is approximated by the sparse codes and the linear function parameter. To optimize this problem, we develop an iterative algorithm based on descent gradient methods to learn the sparse codes and hyper-predictor parameter alternately. Experiment results on some benchmark data sets show the advantage of the proposed methods over other state-of-the-art algorithms.

URL PDF HTML ☆

赞 0 踩 0

1506.08110 2026-06-04 cs.CV cs.NA math.NA 版本更新

Nonnegative Matrix Factorization applied to reordered pixels of single images based on patches to achieve structured nonnegative dictionaries

非负矩阵分解应用于基于块的单图像重新排序像素以实现结构非负字典

Richard M. Charles, Kye M. Taylor, James H. Curry

AI总结本文提出利用非负矩阵分解对单图像块重新排序像素生成结构化非负字典，通过SVD和NMF对比发现NMF能保留图像原始符号结构和局部细节。

Comments 34 pages, 15 figures, 2 tables

详情

AI中文摘要

近年来计算能力的提升使得处理和分析各种领域的大型数据集成为可能。通常分析需要创建低秩近似以实现高效存储。本文提出并分析了一种新颖方法，通过将非负矩阵分解应用于单自然图像的重新排序像素来创建非负、结构化的字典。我们基于块重新排序像素并提出了一般方法。我们研究了当使用奇异值分解（SVD）和非负矩阵分解（NMF）作为低秩近似时的方法。峰值信噪比（PSNR）和均值结构相似性指数（MSSIM）用于评估算法。我们报告说，虽然SVD提供最佳重建，但其向量字典丢失了原始图像的符号结构和局部细节。相比之下，使用NMF生成的字典保留了原始图像矩阵的符号结构，并提供了一个非负、基于部分的字典。

英文摘要

Recent improvements in computing allow for the processing and analysis of very large datasets in a variety of fields. Often the analysis requires the creation of low-rank approximations to the datasets leading to efficient storage. This article presents and analyzes a novel approach for creating nonnegative, structured dictionaries using NMF applied to reordered pixels of single, natural images. We reorder the pixels based on patches and present our approach in general. We investigate our approach when using the Singular Value Decomposition (SVD) and Nonnegative Matrix Factorizations (NMF) as low-rank approximations. Peak Signal-to-Noise Ratio (PSNR) and Mean Structural Similarity Index (MSSIM) are used to evaluate the algorithm. We report that while the SVD provides the best reconstructions, its dictionary of vectors lose both the sign structure of the original image and details of localized image content. In contrast, the dictionaries produced using NMF preserves the sign structure of the original image matrix and offer a nonnegative, parts-based dictionary.

URL PDF HTML ☆

赞 0 踩 0

1412.2700 2026-06-04 math.NA cs.CV cs.NA 版本更新

Subspace based low rank and joint sparse matrix recovery

基于子空间的低秩和联合稀疏矩阵恢复

Sampurna Biswas, Sunrita Poddar, Soura Dasgupta, Raghuraman Mudumbai, Mathews Jacob

AI总结本文研究从欠采样测量中恢复低秩和联合稀疏矩阵的问题，提出在不同测量矩阵下进行测量以减少总测量数，适用于动态MRI图像恢复。

Comments 5 pages, 5 figures, Asilomar 2014 conference submission

详情

AI中文摘要

我们考虑从欠采样的列测量中恢复低秩和联合稀疏矩阵的问题。该问题在高时空分辨率动态MRI数据恢复中具有重要相关性，其中矩阵的每一列对应图像时间序列中的一个帧，由于帧高度相关，矩阵具有高度低秩性。此外，在适当变换/框架域（如小波、梯度）中，矩阵的非零位置在不同帧中大致相同，其支持集的超集可以安全地假设为联合稀疏。与经典多测量向量（MMV）设置不同，我们考虑每个快照使用不同的测量矩阵进行测量。我们证明这种方法可以减少总测量数，特别是当矩阵的秩远小于其稀疏性时。在动态成像的实验中，该方法在实现自由呼吸心脏MRI中非常有用。

英文摘要

We consider the recovery of a low rank and jointly sparse matrix from under sampled measurements of its columns. This problem is highly relevant in the recovery of dynamic MRI data with high spatio-temporal resolution, where each column of the matrix corresponds to a frame in the image time series; the matrix is highly low-rank since the frames are highly correlated. Similarly the non-zero locations of the matrix in appropriate transform/frame domains (e.g. wavelet, gradient) are roughly the same in different frame. The superset of the support can be safely assumed to be jointly sparse. Unlike the classical multiple measurement vector (MMV) setup that measures all the snapshots using the same matrix, we consider each snapshot to be measured using a different measurement matrix. We show that this approach reduces the total number of measurements, especially when the rank of the matrix is much smaller than than its sparsity. Our experiments in the context of dynamic imaging shows that this approach is very useful in realizing free breathing cardiac MRI.

URL PDF HTML ☆

赞 0 踩 0

1306.1392 2026-06-04 math.NA cs.CV cs.NA 版本更新

PyHST2: an hybrid distributed code for high speed tomographic reconstruction with iterative reconstruction and a priori knowledge capabilities

PyHST2：一种混合分布式代码，用于高速断层扫描重建，支持迭代重建和先验知识能力

Alessandro Mirone, Emmanuelle Gouillart, Emmanuel Brun, Paul Tafforeau, Jerome Kieffer

AI总结 PyHST2是一种用于高速断层扫描重建的混合分布式代码，支持迭代重建和先验知识，适用于第三代同步辐射设施的高数据流需求。

详情

DOI: 10.1016/j.nimb.2013.09.030

AI中文摘要

我们介绍了PyHST2代码，该代码在ESRF用于相位对比和吸收断层扫描。该代码采用了分布式和流水线架构，以支持第三代同步辐射设施的高数据流（每实验10太字节）。代码实现了默认的滤波反投影重建，以及结合先验知识的迭代重建技术。这些技术用于提高重建质量或减少所需数据量以达到目标质量。实现的先验知识技术基于总变分惩罚和一种新的凸函数，该函数基于重叠块。我们详细介绍了不同方法及其实现，代码以免费许可证发布。我们还提供了在没有地面真实数据的情况下估计先验技术最佳参数值的方法。

英文摘要

We present the PyHST2 code which is in service at ESRF for phase-contrast and absorption tomography. This code has been engineered to sustain the high data flow typical of the third generation synchrotron facilities (10 terabytes per experiment) by adopting a distributed and pipelined architecture. The code implements, beside a default filtered backprojection reconstruction, iterative reconstruction techniques with a-priori knowledge. These latter are used to improve the reconstruction quality or in order to reduce the required data volume and reach a given quality goal. The implemented a-priori knowledge techniques are based on the total variation penalisation and a new recently found convex functional which is based on overlapping patches. We give details of the different methods and their implementations while the code is distributed under free license. We provide methods for estimating, in the absence of ground-truth data, the optimal parameters values for a-priori techniques.

URL PDF HTML ☆

赞 0 踩 0

1305.1256 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Convex Functional for Image Denoising based on Patches with Constrained Overlaps and its vectorial application to Low Dose Differential Phase Tomography

基于具有受限重叠的块的凸函数的图像去噪方法及其在低剂量差分相位断层扫描中的向量应用

Alessandro Mirone, Emmanuel Brun, Paola Coan

AI总结本文提出了一种新的凸函数用于图像去噪，通过引入块间相似性约束项，结合FISTA加速迭代算法，在低剂量差分相位断层扫描中实现了高效且鲁棒的重建。

详情

DOI: 10.1371/journal.pone.0114325

AI中文摘要

我们通过字典学习技术解决图像去噪问题，构建了一种新的凸函数形式。该函数包含通常的稀疏诱导项和保真项，以及一个新的项，诱导重叠区域中块之间的相似性。该函数依赖于两个自由正则化参数：一个乘以块基函数系数的稀疏诱导L1范数系数，另一个乘以重叠区域中块差异的L2范数系数。通过应用迭代近端梯度下降法与FISTA加速求解。在断层扫描重建中，通过在每次迭代步骤中应用解的投影及其误差反投影来计算梯度。我们在合成数据上研究了解的质量，作为正则化参数和噪声函数的函数，其中解是先验已知的。我们对实验数据应用该方法，针对差分相位断层扫描，采用一种原始方法，即使用向量块，每个块有两个成分：每个梯度成分一个。所得到的算法在ESRF断层扫描重建代码PyHST中实现，结果显示出稳健、高效，并且适合在医学断层扫描中显著减少所需剂量和投影数量。

英文摘要

We solve the image denoising problem with a dictionary learning technique by writing a convex functional of a new form. This functional contains beside the usual sparsity inducing term and fidelity term, a new term which induces similarity between overlapping patches in the overlap regions. The functional depends on two free regularization parameters: a coefficient multiplying the sparsity-inducing $L_{1}$ norm of the patch basis functions coefficients, and a coefficient multiplying the $L_{2}$ norm of the differences between patches in the overlapping regions. The solution is found by applying the iterative proximal gradient descent method with FISTA acceleration. In the case of tomography reconstruction we calculate the gradient by applying projection of the solution and its error backprojection at each iterative step. We study the quality of the solution, as a function of the regularization parameters and noise, on synthetic datas for which the solution is a-priori known. We apply the method on experimental data in the case of Differential Phase Tomography. For this case we use an original approach which consists in using vectorial patches, each patch having two components: one per each gradient component. The resulting algorithm, implemented in the ESRF tomography reconstruction code PyHST, results to be robust, efficient, and well adapted to strongly reduce the required dose and the number of projections in medical tomography.

URL PDF HTML ☆

赞 0 踩 0

1506.00060 2026-06-04 cs.CV cs.NA math.NA 版本更新

A Three-stage Approach for Segmenting Degraded Color Images: Smoothing, Lifting and Thresholding (SLaT)

一种用于退化彩色图像分割的三阶段方法：平滑、提升和阈值化（SLaT）

Xiaohao Cai, Raymond Chan, Mila Nikolova, Tieyong Zeng

AI总结本文提出了一种三阶段SLaT方法，用于处理噪声、信息损失和模糊的多相分割。通过凸变分模型平滑图像，利用维度提升保留颜色信息，并通过多通道阈值化实现分割，提升了分割质量和效率。

Comments 19 pages

详情

AI中文摘要

本文提出了一种SLaT（平滑、提升和阈值化）方法，用于多相分割受噪声、信息损失和模糊影响的彩色图像。第一阶段应用凸变分模型的变体对每个通道进行平滑处理，证明该模型在不同退化下具有唯一解。第二阶段通过维度提升，将恢复图像与其在二次颜色空间中的变换组合成向量值图像，以保留足够的信息进行分割。第三阶段通过多通道阈值化找到分割结果。相数仅在最后阶段需要，用户可自由选择或更改而无需重新解决前阶段。实验表明，SLaT方法在分割质量和CPU时间上优于其他先进方法。

英文摘要

In this paper, we propose a SLaT (Smoothing, Lifting and Thresholding) method with three stages for multiphase segmentation of color images corrupted by different degradations: noise, information loss, and blur. At the first stage, a convex variant of the Mumford-Shah model is applied to each channel to obtain a smooth image. We show that the model has unique solution under the different degradations. In order to properly handle the color information, the second stage is dimension lifting where we consider a new vector-valued image composed of the restored image and its transform in the secondary color space with additional information. This ensures that even if the first color space has highly correlated channels, we can still have enough information to give good segmentation results. In the last stage, we apply multichannel thresholding to the combined vector-valued image to find the segmentation. The number of phases is only required in the last stage, so users can choose or change it all without the need of solving the previous stages again. Experiments demonstrate that our SLaT method gives excellent results in terms of segmentation quality and CPU time in comparison with other state-of-the-art segmentation methods.

URL PDF HTML ☆

赞 0 踩 0

1504.00905 2026-06-04 math.OC cs.CV cs.LG cs.SY eess.SY 版本更新

Robust Anomaly Detection Using Semidefinite Programming

使用半定规划进行鲁棒异常检测

Jose A. Lopez, Octavia Camps, Mario Sznaier

AI总结本文提出基于多项式优化和矩方法的新型异常检测方法，仅需正常状态特征统计矩信息，相较于Parzen窗口和1类SVM等方法表现更优，且能简洁描述正常状态，简化高维数据集的异常检测问题。

Comments 13 pages, 11 figures

1505.07690 2026-06-04 math.NA cs.CV cs.NA 版本更新

Invertible Orientation Scores of 3D Images

三维图像的可逆取向分数

Michiel Janssen, Remco Duits, Marcel Breeuwer

AI总结本文提出三维取向分数，用于增强和检测噪声图像中的细长结构，采用可逆相干态变换和球面谐波变换实现高效计算，展示初步应用结果。

Comments ssvm 2015 published version in LNCS contains a mistake (a switch notation spherical angles) that is corrected in this arxiv version

1312.6208 2026-06-04 math.NA cs.CV cs.NA 版本更新

Total variation with overlapping group sparsity for image deblurring under impulse noise

总变分与重叠组稀疏性用于在脉冲噪声下的图像去模糊

Gang Liu, Ting-Zhu Huang, Jun Liu, Xiao-Guang Lv

AI总结本文提出一种结合总变分与重叠组稀疏性的模型，用于在脉冲噪声下恢复模糊图像，通过引入盒约束和ADMM框架下的高效算法，有效缓解阶梯效应并提升PSNR和相对误差。

Comments 22 pages, 57 figures, submitted

详情

DOI: 10.1371/journal.pone.0122562
Journal ref: PLOS ONE 2015 10(4): e0122562

AI中文摘要

总变分（TV）正则化方法是图像去模糊中保留边缘的有效方法。然而，基于TV的解决方案通常存在阶梯效应。本文为了缓解阶梯效应，提出了一种新的模型，用于恢复受脉冲噪声影响的模糊图像。该模型包含一个ℓ1保真项和一个总变分与重叠组稀疏性（OGS）正则化项。此外，我们对所提模型施加盒约束以获得更准确的解决方案。提出了一种高效的算法，在交替方向乘子法（ADMM）框架下求解该模型。我们使用一个内循环，嵌套在主要化最小化（MM）迭代中，用于所提方法的子问题。与其他方法相比，数值结果表明，所提方法在避免阶梯效应和PSNR以及相对误差（ReE）方面显著提高了恢复质量。

英文摘要

The total variation (TV) regularization method is an effective method for image deblurring in preserving edges. However, the TV based solutions usually have some staircase effects. In this paper, in order to alleviate the staircase effect, we propose a new model for restoring blurred images with impulse noise. The model consists of an $\ell_1$-fidelity term and a TV with overlapping group sparsity (OGS) regularization term. Moreover, we impose a box constraint to the proposed model for getting more accurate solutions. An efficient and effective algorithm is proposed to solve the model under the framework of the alternating direction method of multipliers (ADMM). We use an inner loop which is nested inside the majorization minimization (MM) iteration for the subproblem of the proposed method. Compared with other methods, numerical results illustrate that the proposed method, can significantly improve the restoration quality, both in avoiding staircase effects and in terms of peak signal-to-noise ratio (PSNR) and relative error (ReE).

URL PDF HTML ☆

赞 0 踩 0

1505.01599 2026-06-04 cs.CV cs.NA math.NA 版本更新

Filter characteristics in image decomposition with singular spectrum analysis

图像分解中的奇异谱分析滤波特性

Kenji Kume, Naoko Nose-Togawa

AI总结本文研究了奇异谱分析在多维数据分解中的滤波特性，指出自适应生成的滤波器具有对称性，用于图像去噪。

详情

快速且鲁棒的固定秩矩阵恢复

German Ros, Julio Guerrero

AI总结本文提出了一种高效且稳定的固定秩矩阵分解方法，通过几何和代数技术结合，避免了截断奇异值分解的瓶颈，提升了大规模问题的处理效率。

详情

AI中文摘要

我们解决了高效稀疏固定秩（S-FR）矩阵分解问题，即将受污染的矩阵M分解为未受污染的秩为r的矩阵L和稀疏异常值矩阵S。固定秩约束通常由系统研究的物理限制决定。本文提出了一种准确且高效的S-FR分解方法，适用于大规模问题。我们的方法结合了几何和代数技术，避免了截断奇异值分解（TSVD）的瓶颈。相反，采用极坐标分解来利用固定秩问题的流形结构，作为Stiefel和SPD流形的乘积，从而获得更好的收敛性和稳定性。然后，闭合形式的投影器有助于加速方法的每次迭代。我们引入了一种新的快速投影器用于SPD流形，并证明其有效性。进一步的加速是通过Nystrom方案实现的。在鲁棒光度立体和光谱聚类的合成和真实数据实验中，我们的方法优于现有技术。

英文摘要

We address the problem of efficient sparse fixed-rank (S-FR) matrix decomposition, i.e., splitting a corrupted matrix $M$ into an uncorrupted matrix $L$ of rank $r$ and a sparse matrix of outliers $S$. Fixed-rank constraints are usually imposed by the physical restrictions of the system under study. Here we propose a method to perform accurate and very efficient S-FR decomposition that is more suitable for large-scale problems than existing approaches. Our method is a grateful combination of geometrical and algebraical techniques, which avoids the bottleneck caused by the Truncated SVD (TSVD). Instead, a polar factorization is used to exploit the manifold structure of fixed-rank problems as the product of two Stiefel and an SPD manifold, leading to a better convergence and stability. Then, closed-form projectors help to speed up each iteration of the method. We introduce a novel and fast projector for the $\text{SPD}$ manifold and a proof of its validity. Further acceleration is achieved using a Nystrom scheme. Extensive experiments with synthetic and real data in the context of robust photometric stereo and spectral clustering show that our proposals outperform the state of the art.

URL PDF HTML ☆

赞 0 踩 0

1503.06561 2026-06-04 math.NA cs.CV cs.NA 版本更新

A Comparative Analysis of Tensor Decomposition Models Using Hyper Spectral Image

基于超光谱图像的张量分解模型比较分析

Ankit Gupta, Ashish Oberoi

AI总结本文比较了LMLRA、BTD和CPD三种张量分解模型在超光谱图像处理中的性能，发现BTD在分解结果上表现最佳。

Comments 7 pages, 3 figures,1 table

1305.3006 2026-06-04 cs.CV cs.NA math.NA 版本更新

Fast Linearized Alternating Direction Minimization Algorithm with Adaptive Parameter Selection for Multiplicative Noise Removal

快速线性化交替方向最小化算法与自适应参数选择用于乘性噪声去除

Dai-Qiang Chen, Li-Zhi Cheng

AI总结本文提出基于线性化技术的两种快速算法，通过特殊偏差函数自适应选择正则化参数，同时恢复图像，在PSNR和计算时间上优于现有方法。

Comments 23pages

详情

DOI: 10.1016/j.cam.2013.08.012
Journal ref: Journal of Computational and Applied Mathematics 257 (2014) 29-45

AI中文摘要

由于总变分（TV）具有边缘保持能力和低计算成本，具有TV正则化的变分模型在乘性噪声去除领域被广泛研究。成功应用的关键在于：正则化参数的最优选择，平衡数据保真项与TV正则化项；以及高效算法计算解。本文提出两种基于线性化技术的快速算法，能够同时估计正则化参数并恢复图像。在所提算法的迭代步骤中，正则化参数通过为乘性噪声定义的特殊偏差函数进行调整。在一定条件下证明了所提算法的收敛性，数值实验显示所提算法在PSNR值和计算时间上整体优于一些最先进的方法。

英文摘要

Owing to the edge preserving ability and low computational cost of the total variation (TV), variational models with the TV regularization have been widely investigated in the field of multiplicative noise removal. The key points of the successful application of these models lie in: the optimal selection of the regularization parameter which balances the data-fidelity term with the TV regularizer; the efficient algorithm to compute the solution. In this paper, we propose two fast algorithms based on the linearized technique, which are able to estimate the regularization parameter and recover the image simultaneously. In the iteration step of the proposed algorithms, the regularization parameter is adjusted by a special discrepancy function defined for multiplicative noise. The convergence properties of the proposed algorithms are proved under certain conditions, and numerical experiments demonstrate that the proposed algorithms overall outperform some state-of-the-art methods in the PSNR values and computational time.

URL PDF HTML ☆

赞 0 踩 0

1502.06220 2026-06-04 cs.CV cs.NA math.NA 版本更新

Boosting of Image Denoising Algorithms

图像去噪算法的提升

Yaniv Romano, Michael Elad

AI总结本文提出一种通用递归算法提升图像去噪方法，通过SOS流程增强信号、操作去噪方法并减去前一步结果，研究其收敛性并展示在K-SVD等算法中的改进效果。

Comments 33 pages, 9 figures, 3 tables, submitted to SIAM Journal on Imaging Sciences

1502.07743 2026-06-04 eess.SY cs.CV cs.SY math.OC 版本更新

Tracking an Object with Unknown Accelerations using a Shadowing Filter

利用阴影滤波器跟踪具有未知加速度的对象

Kevin Judd

AI总结本文提出基于阴影滤波器的跟踪方法，用于处理未知随机加速度的物体跟踪问题，该方法高效且稳健，优于传统卡尔曼滤波。

Comments 20 pages, 5 figures

详情

AI中文摘要

一个常见的问题是跟踪物理对象，如机动船只、飞机、陆地车辆、航天器或携带无线设备的生物体。传感器数据通常有限，且范围或方位的观测不准确。此问题比跟踪弹道轨迹更困难，因为操作会影响未知且任意变化的加速度。尽管随机滤波或状态估计（卡尔曼滤波和粒子滤波）被广泛使用，但在这种跟踪上下文中，变分方法更为合适，因为物体通常不显示显著的随机运动。这促使我们提出基于阴影滤波器的优雅方法。所得到的滤波器高效（减少为线性方程的求解）且稳健（不受缺失数据和奇异相关性导致贝叶斯滤波灾难性失败的影响）。跟踪如此稳健，以至于在某些常见情况下，它实际上通过忽略对卡尔曼滤波至关重要的误差相关性而表现更好。

英文摘要

A commonly encountered problem is the tracking of a physical object, like a maneuvering ship, aircraft, land vehicle, spacecraft or animate creature carrying a wireless device. The sensor data is often limited and inaccurate observations of range or bearing. This problem is more difficult than tracking a ballistic trajectory, because an operative affects unknown and arbitrarily changing accelerations. Although stochastic methods of filtering or state estimation (Kalman filters and particle filters) are widely used, out of vogue variational methods are more appropriate in this tracking context, because the objects do not typically display any significant random motions at the length and time scales of interest. This leads us to propose a rather elegant approach based on a \emph{shadowing filter}. The resulting filter is efficient (reduces to the solution of linear equations) and robust (uneffected by missing data and singular correlations that would cause catastrophic failure of Bayesian filters.) The tracking is so robust, that in some common situations it actually performs better by ignoring error correlations that are so vital to Kalman filters.

URL PDF HTML ☆

赞 0 踩 0

1502.00555 2026-06-04 stat.ME cs.CV cs.MM cs.NA math.NA stat.CO 版本更新

A Discrete Tchebichef Transform Approximation for Image and Video Coding

一种用于图像和视频编码的离散切比绍夫变换近似

P. A. M. Oliveira, R. J. Cintra, F. M. Bayer, S. Kulasekera, A. Madanayake

AI总结本文提出了一种低复杂度的离散切比绍夫变换近似方法，通过减少乘法和加法运算提升编码效率，并在FPGA上实现时降低功耗和面积。

Comments 13 pages, 5 figures, 2 tables

1310.5715 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC stat.ML 版本更新

Stochastic Gradient Descent, Weighted Sampling, and the Randomized Kaczmarz algorithm

随机梯度下降、加权采样与随机化Kaczmarz算法

Deanna Needell, Nathan Srebro, Rachel Ward

AI总结本文改进了随机梯度下降在光滑强凸目标下的线性收敛保证，从二次依赖于条件数转换为线性依赖，同时探讨了加权采样对收敛性的影响，并将随机化Kaczmarz算法与SGD联系起来，证明其在加权最小二乘问题中的指数收敛性。

Comments 22 pages, 6 figures

详情

AI中文摘要

我们获得了随机梯度下降在光滑且强凸目标下的改进有限样本保证，将线性收敛的依赖从二次的条件数$(L/μ)^2$（其中$L$是光滑性的上界，$μ$是强凸性的上界）转为线性依赖于$L/μ$。此外，我们展示了如何通过重新加权采样分布（即重要性采样）进一步提升收敛性，并获得平均光滑性的线性依赖，优于先前结果。我们还讨论了SGD中的重要性采样在其他场景中的应用。我们的结果基于将SGD与随机化Kaczmarz算法联系起来的发现，使我们能够将两种方法的文献思想相互转移。特别是，我们将随机化Kaczmarz算法重新表述为SGD的一个实例，并应用我们的结果证明其在加权最小二乘问题中的指数收敛性，而非原始最小二乘问题。然后，我们提出了一种修改的Kaczmarz算法，具有部分偏置采样，该算法能够收敛到原始最小二乘解，并以相同的指数收敛速率。

英文摘要

We obtain an improved finite-sample guarantee on the linear convergence of stochastic gradient descent for smooth and strongly convex objectives, improving from a quadratic dependence on the conditioning $(L/μ)^2$ (where $L$ is a bound on the smoothness and $μ$ on the strong convexity) to a linear dependence on $L/μ$. Furthermore, we show how reweighting the sampling distribution (i.e. importance sampling) is necessary in order to further improve convergence, and obtain a linear dependence in the average smoothness, dominating previous results. We also discuss importance sampling for SGD more broadly and show how it can improve convergence also in other scenarios. Our results are based on a connection we make between SGD and the randomized Kaczmarz algorithm, which allows us to transfer ideas between the separate bodies of literature studying each of the two methods. In particular, we recast the randomized Kaczmarz algorithm as an instance of SGD, and apply our results to prove its exponential convergence, but to the solution of a weighted least squares problem rather than the original least squares problem. We then present a modified Kaczmarz algorithm with partially biased sampling which does converge to the original least squares solution with the same exponential convergence rate.

URL PDF HTML ☆

赞 0 踩 0

1501.02995 2026-06-04 cs.MM cs.CV cs.NA math.NA stat.ME 版本更新

Improved 8-point Approximate DCT for Image and Video Compression Requiring Only 14 Additions

改进的8点近似DCT用于图像和视频压缩，仅需14次加法

U. S. Potluri, A. Madanayake, R. J. Cintra, F. M. Bayer, S. Kulasekera, A. Edirisuriya

AI总结本文提出一种仅需14次加法的8点DCT近似方法，具有低计算复杂度，相比现有方法在算法复杂度和信噪比上表现更优，适用于HEVC等可重构视频标准。

Comments 30 pages, 7 figures, 5 tables

详情

DOI: 10.1109/TCSI.2013.2295022
Journal ref: Circuits and Systems I: Regular Papers, IEEE Transactions on, Volume 61, Issue 6, June 2014, 1727--1740

AI中文摘要

视频处理系统如HEVC要求低能耗以满足多媒体市场的需求，推动了快速算法在高效近似2-D DCT变换方面的广泛应用。由于DCT具有显著的能量压缩特性，被广泛应用于多种压缩标准。已提出无乘法器的近似DCT变换，提供极低电路复杂度下的优异压缩性能。此类近似可通过仅使用加法和减法在数字VLSI硬件中实现，显著降低芯片面积和功耗。本文提出一种新的8点DCT近似方法，仅需14次加法运算和无乘法。该变换具有低计算复杂度，并在算法复杂度和峰值信噪比方面与现有最先进的DCT近似方法进行比较。所提出的DCT近似方法是HEVC等可重构视频标准的候选方案。所提出变换及其他几种DCT近似方法被映射到脉动阵列数字架构，并通过FPGA技术和45 nm CMOS工艺物理实现为数字原型电路。

英文摘要

Video processing systems such as HEVC requiring low energy consumption needed for the multimedia market has lead to extensive development in fast algorithms for the efficient approximation of 2-D DCT transforms. The DCT is employed in a multitude of compression standards due to its remarkable energy compaction properties. Multiplier-free approximate DCT transforms have been proposed that offer superior compression performance at very low circuit complexity. Such approximations can be realized in digital VLSI hardware using additions and subtractions only, leading to significant reductions in chip area and power consumption compared to conventional DCTs and integer transforms. In this paper, we introduce a novel 8-point DCT approximation that requires only 14 addition operations and no multiplications. The proposed transform possesses low computational complexity and is compared to state-of-the-art DCT approximations in terms of both algorithm complexity and peak signal-to-noise ratio. The proposed DCT approximation is a candidate for reconfigurable video standards such as HEVC. The proposed transform and several other DCT approximations are mapped to systolic-array digital architectures and physically realized as digital prototype circuits using FPGA technology and mapped to 45 nm CMOS technology.

URL PDF HTML ☆

赞 0 踩 0

1501.00680 2026-06-04 math.NA cs.CV cs.NA 版本更新

A New Method for Signal and Image Analysis: The Square Wave Method

信号和图像分析中的一种新方法：正弦波方法

Osvaldo Skliar, Ricardo E. Monge, Sherry Gapper

AI总结本文介绍了正弦波方法在信号和图像分析中的应用，通过两个案例展示了其在频域中的表现。

1403.5403 2026-06-04 cs.CV cs.NA math.NA math.OC 版本更新

A Non-Local Structure Tensor Based Approach for Multicomponent Image Recovery Problems

一种基于非局部结构张量的多组件图像恢复方法

Giovanni Chierchia, Nelly Pustelnik, Beatrice Pesquet-Popescu, Jean-Christophe Pesquet

AI总结本文提出基于非局部总变分的多组件图像恢复方法，利用梯度得到的结构张量，通过ℓ_{1,p}矩阵范数惩罚非局部变化，改进收敛速度。

详情

DOI: 10.1109/TIP.2014.2364141

AI中文摘要

非局部总变分（NLTV）已发展为图像恢复变分方法中的有用工具。本文通过利用多组件图像梯度得到的结构张量，将NLTV正则化扩展到多组件图像。所提出的方法允许通过各种ℓ_{1,p}矩阵范数（p≥1）对不同组件的非局部变化进行惩罚。为方便超参数选择，我们采用约束凸优化方法，在数据保真项最小化的同时满足ST-NLTV正则化约束。所得到的凸优化问题通过新颖的epigraphical投影方法解决。这种公式能够高效实现，得益于最近的对偶近端算法的灵活性。进行了多谱和超光谱图像的实验。结果表明引入非局部结构张量正则化是有利的，并显示所提出的方法在收敛速度方面相比当前最先进的方法有显著改进。

英文摘要

Non-Local Total Variation (NLTV) has emerged as a useful tool in variational methods for image recovery problems. In this paper, we extend the NLTV-based regularization to multicomponent images by taking advantage of the Structure Tensor (ST) resulting from the gradient of a multicomponent image. The proposed approach allows us to penalize the non-local variations, jointly for the different components, through various $\ell_{1,p}$ matrix norms with $p \ge 1$. To facilitate the choice of the hyper-parameters, we adopt a constrained convex optimization approach in which we minimize the data fidelity term subject to a constraint involving the ST-NLTV regularization. The resulting convex optimization problem is solved with a novel epigraphical projection method. This formulation can be efficiently implemented thanks to the flexibility offered by recent primal-dual proximal algorithms. Experiments are carried out for multispectral and hyperspectral images. The results demonstrate the interest of introducing a non-local structure tensor regularization and show that the proposed approach leads to significant improvements in terms of convergence speed over current state-of-the-art methods.

URL PDF HTML ☆

赞 0 踩 0

1406.5429 2026-06-04 math.NA cs.CV cs.LG cs.NA math.OC 版本更新

Playing with Duality: An Overview of Recent Primal-Dual Approaches for Solving Large-Scale Optimization Problems

双模互动：解决大规模优化问题的最新对偶方法综述

Nikos Komodakis, Jean-Christophe Pesquet

AI总结本文综述了近期用于解决大规模优化问题的对偶方法，探讨了对偶问题在信号处理、计算机视觉和机器学习中的应用，强调了对偶算法在求解凸优化和离散问题中的优势。

1411.2584 2026-06-04 cs.CV cs.NA math.NA 版本更新

Applications of sampling Kantorovich operators to thermographic images for seismic engineering

采样Kantorovich算子在地震工程中的热图像应用

Danilo Costarelli, Federico Cluni, Anna Maria Minotti, Gianluca Vinti

AI总结本文利用多变量采样Kantorovich算子S_w理论，结合MATLAB和矩阵计算，开发算法重建热图像，用于建筑在地震作用下的行为模拟，并通过实际案例分析不同模型的性能差异。

Comments 16 pages, 5 figures, 2 tables

1410.3426 2026-06-04 math.NA cs.CV cs.NA 版本更新

Computing Topology Preservation of RBF Transformations for Landmark-Based Image Registration

计算基于地标图像配准的RBF变换拓扑保持性

R. Cavoretto, A. De Rossi, H. Qiao, B. Quatember, W. Recheis, M. Mayr

AI总结本文研究了RBF在基于地标的图像配准中保持拓扑性质的能力，通过单点和四点模型分析Matérn函数等变换的拓扑保持特性，并与高斯、温德兰德武函数的数值结果进行比较。

1410.0719 2026-06-04 math.NA cs.CV cs.IT cs.LG cs.NA math.IT math.OC math.ST stat.TH 版本更新

Proceedings of the second "international Traveling Workshop on Interactions between Sparse models and Technology" (iTWIST'14)

第二届‘国际稀疏模型与技术相互作用’研讨会论文集（iTWIST'14）

L. Jacques, C. De Vleeschouwer, Y. Boursier, P. Sudhakar, C. De Mol, A. Pizurica, S. Anthoine, P. Vandergheynst, P. Frossard, C. Bilen, S. Kitic, N. Bertin, R. Gribonval, N. Boumal, B. Mishra, P. -A. Absil, R. Sepulchre, S. Bundervoet, C. Schretter, A. Dooms, P. Schelkens, O. Chabiron, F. Malgouyres, J. -Y. Tourneret, N. Dobigeon, P. Chainais, C. Richard, B. Cornelis, I. Daubechies, D. Dunson, M. Dankova, P. Rajmic, K. Degraux, V. Cambareri, B. Geelen, G. Lafruit, G. Setti, J. -F. Determe, J. Louveaux, F. Horlin, A. Drémeau, P. Heas, C. Herzet, V. Duval, G. Peyré, A. Fawzi, M. Davies, N. Gillis, S. A. Vavasis, C. Soussen, L. Le Magoarou, J. Liang, J. Fadili, A. Liutkus, D. Martina, S. Gigan, L. Daudet, M. Maggioni, S. Minsker, N. Strawn, C. Mory, F. Ngole, J. -L. Starck, I. Loris, S. Vaiter, M. Golbabaee, D. Vukobratovic

AI总结 iTWIST'14聚焦稀疏范式理论与应用，通过演讲、海报和讨论促进国际协作，涵盖稀疏数据传感、子空间联合、非线性逆问题等主题。

Comments 69 pages, 24 extended abstracts, iTWIST'14 website: http://sites.google.com/site/itwist14

1410.0868 2026-06-04 math.NA cs.CV cs.NA 版本更新

Group Orbit Optimization: A Unified Approach to Data Normalization

群轨道优化：数据规范化的一个统一方法

Shuchang Zhou, Zhihua Zhang, Xiaobing Feng

AI总结本文提出并研究了群轨道优化（GOO）问题，证明其可诱导矩阵分解技术，如SVD、LU分解、QR分解等，从而构建统一的矩阵分解框架，并推广至张量分解。

1409.3714 2026-06-04 math.NA cs.CV cs.NA 版本更新

Time-domain multiscale shape identification in electro-sensing

时域多尺度形状识别在电感应中的应用

Habib Ammari, Han Wang

AI总结本文提出一种新颖的时域多尺度方法，用于电感应中利用脉冲信号进行形状识别，通过多尺度过滤极化张量计算不变形状描述符，具有强抗噪性，适用于回声定位和感应数据的脉冲成像。

1409.2579 2026-06-04 math.NA cs.CV cs.LG cs.NA 版本更新

A theoretical contribution to the fast implementation of null linear discriminant analysis method using random matrix multiplication with scatter matrices

对利用散射矩阵进行随机矩阵乘法实现null线性判别分析方法的理论贡献

Ting-ting Feng, Gang Wu

AI总结本文提出一种理论方法，通过合理选择随机矩阵保证null LDA的列满秩，避免信息丢失，提升计算效率。

Comments 7 pages

1407.0921 2026-06-04 math.OC cs.CV cs.NA math.NA 版本更新

Solving QVIs for Image Restoration with Adaptive Constraint Sets

利用自适应约束集求解图像恢复中的QVIs

Frank Lenzen, Jan Lellmann, Florian Becker, Christoph Schnörr

AI总结本文研究了自适应图像恢复中的准变分不等式，证明了更大类问题的解唯一性，并提出收敛的数值算法，实验结果支持理论发现。

1405.2220 2026-06-04 q-fin.TR cs.CE cs.CV cs.SY eess.SY q-fin.ST 版本更新

Gaussian-Chain Filters for Heavy-Tailed Noise with Application to Detecting Big Buyers and Big Sellers in Stock Market

高斯链滤波器用于重尾噪声及其在股票市场中检测大买家和大卖家的应用

Li-Xin Wang

AI总结本文提出高斯链分布用于处理重尾噪声，通过构造基于最大似然原理的高斯链滤波器，改进了市场情绪跟踪策略，实证显示其在股票市场中表现优于传统策略。

详情

AI中文摘要

我们提出了一种新的重尾分布——高斯链（GC）分布，其灵感来源于社会组织中的分层结构。我们确定了高斯链分布的均值、方差和峰度以展示其重尾性质，并计算尾部分布表以提供具体数值说明尾部的重性。为了过滤重尾噪声，我们构建了基于最大似然原理的2阶和3阶GC滤波器。仿真结果表明，当噪声呈重尾分布时，GC滤波器比基准最小二乘算法表现更好。利用GC滤波器，我们提出了一种名为Ride-the-Mood的交易策略，通过检测市场中大买家和大卖家的行为，跟踪市场情绪。对五只蓝筹香港股票近两年的数据应用Ride-the-Mood策略，显示其收益高于基准买入持有策略和恒生指数基金。

英文摘要

We propose a new heavy-tailed distribution --- Gaussian-Chain (GC) distribution, which is inspirited by the hierarchical structures prevailing in social organizations. We determine the mean, variance and kurtosis of the Gaussian-Chain distribution to show its heavy-tailed property, and compute the tail distribution table to give specific numbers showing how heavy is the heavy-tails. To filter out the heavy-tailed noise, we construct two filters --- 2nd and 3rd-order GC filters --- based on the maximum likelihood principle. Simulation results show that the GC filters perform much better than the benchmark least-squares algorithm when the noise is heavy-tail distributed. Using the GC filters, we propose a trading strategy, named Ride-the-Mood, to follow the mood of the market by detecting the actions of the big buyers and the big sellers in the market based on the noisy, heavy-tailed price data. Application of the Ride-the-Mood strategy to five blue-chip Hong Kong stocks over the recent two-year period from April 2, 2012 to March 31, 2014 shows that their returns are higher than the returns of the benchmark Buy-and-Hold strategy and the Hang Seng Index Fund.

URL PDF HTML ☆

赞 0 踩 0

1405.2128 2026-06-04 cs.CV cs.NA math.NA 版本更新

Variational Image Segmentation Model Coupled with Image Restoration Achievements

变分图像分割模型与图像修复成果相结合

Xiaohao Cai

AI总结本文提出结合图像修复与分割的多相分割模型，有效处理高噪声、模糊和缺失像素图像，改进传统分割模型并高效求解。

Comments 23 pages

详情

AI中文摘要

图像分割和图像修复是图像处理中的重要课题，本文提出一种新的多相分割模型，结合图像修复和分割模型。利用图像修复特性，所提模型能有效处理高噪声、模糊、缺失像素和向量值图像。特别是传统分割模型如分段常数Mumford-Shah模型可通过引入新的数据保真项扩展，用于分割受噪声、模糊或缺失像素影响的灰度和向量值图像。该模型使用交替最小化算法高效求解，并在温和条件下证明了算法收敛性。在多种合成和实际图像上的实验表明，该方法在模糊图像和缺失像素图像的分割上优于现有方法。

英文摘要

Image segmentation and image restoration are two important topics in image processing with great achievements. In this paper, we propose a new multiphase segmentation model by combining image restoration and image segmentation models. Utilizing image restoration aspects, the proposed segmentation model can effectively and robustly tackle high noisy images, blurry images, images with missing pixels, and vector-valued images. In particular, one of the most important segmentation models, the piecewise constant Mumford-Shah model, can be extended easily in this way to segment gray and vector-valued images corrupted for example by noise, blur or missing pixels after coupling a new data fidelity term which comes from image restoration topics. It can be solved efficiently using the alternating minimization algorithm, and we prove the convergence of this algorithm with three variables under mild condition. Experiments on many synthetic and real-world images demonstrate that our method gives better segmentation results in comparison to others state-of-the-art segmentation models especially for blurry images and images with missing pixels values.

URL PDF HTML ☆

赞 0 踩 0

1312.6813 2026-06-04 math.NA cs.CV cs.NA 版本更新

New explicit thresholding/shrinkage formulas for one class of regularization problems with overlapping group sparsity and their applications

一类具有重叠组稀疏性的正则化问题的新显式阈值化/收缩公式及其应用

Gang Liu, Ting-Zhu Huang, Xiao-Guang Lv, Jun Liu

AI总结本文提出了一类具有重叠组稀疏性的正则化问题的新显式收缩公式，应用于TV去模糊和去噪，通过交替方向乘子法验证了其有效性。

Comments 22 pages, 30 figures

1404.6691 2026-06-04 math.NA cs.CV cs.NA physics.med-ph 版本更新

Sinogram constrained TV-minimization for metal artifact reduction in CT

基于sinogram约束的TV最小化方法用于CT中的金属伪影消除

Clemens Schiffer, Kristian Bredies

AI总结本文提出了一种基于凸优化问题和总变分正则化的CT金属伪影消除方法，利用Chambolle-Pock算法求解，并通过合成数据验证了该方法的有效性。

Comments Part of the OAGM 2014 proceedings (arXiv:1404.3538)

1404.6871 2026-06-04 math.NA cs.CV cs.NA 版本更新

Proximal Iteratively Reweighted Algorithm with Multiple Splitting for Nonconvex Sparsity Optimization

近端迭代重加权算法与多重分裂用于非凸稀疏优化

Canyi Lu, Yunchao Wei, Zhouchen Lin, Shuicheng Yan

AI总结本文提出PIRE算法解决非凸稀疏及结构稀疏问题，相比传统方法更高效，且在每轮迭代计算成本接近凸求解器。进一步提出PIRE-PS和PIRE-AU处理多变量问题，理论证明其收敛性，实验显示性能优异。

详情

Journal ref: Twenty-Eighth AAAI Conference on Artificial Intelligence, 2014

AI中文摘要

本文提出近端迭代重加权（PIRE）算法用于解决一般性问题，涉及大量非凸稀疏及结构稀疏相关问题。与以往非凸稀疏问题迭代求解器相比，PIRE更为通用和高效。PIRE每轮迭代的计算成本通常接近当前最先进的凸求解器。我们进一步提出具有并行分裂的PIRE算法（PIRE-PS）和具有交替更新的PIRE算法（PIRE-AU）以处理多变量问题。在理论上，我们证明所提方法收敛且任何极限解都是 stationary 点。在合成和真实数据集上的广泛实验表明，我们的方法在学习性能上具有竞争力，但效率显著更高，相较于以往的非凸求解器。

英文摘要

This paper proposes the Proximal Iteratively REweighted (PIRE) algorithm for solving a general problem, which involves a large body of nonconvex sparse and structured sparse related problems. Comparing with previous iterative solvers for nonconvex sparse problem, PIRE is much more general and efficient. The computational cost of PIRE in each iteration is usually as low as the state-of-the-art convex solvers. We further propose the PIRE algorithm with Parallel Splitting (PIRE-PS) and PIRE algorithm with Alternative Updating (PIRE-AU) to handle the multi-variable problems. In theory, we prove that our proposed methods converge and any limit solution is a stationary point. Extensive experiments on both synthesis and real data sets demonstrate that our methods achieve comparative learning performance, but are much more efficient, by comparing with previous nonconvex solvers.

URL PDF HTML ☆

赞 0 踩 0

1403.7543 2026-06-04 math.OC cs.CV cs.IT cs.NA math.IT math.NA 版本更新

A sparse Kaczmarz solver and a linearized Bregman method for online compressed sensing

一种稀疏Kaczmarz求解器和线性化Bregman方法用于在线压缩感知

Dirk A. Lorenz, Stephan Wenger, Frank Schöpfer, Marcus Magnor

AI总结本文提出了一种计算稀疏或最小TV解的线性系统算法框架，包含Kaczmarz方法和线性化Bregman方法作为特例，并引入了稀疏Kaczmarz求解器等新方法，适用于测量缓慢且昂贵的在线压缩感知、TV断层成像和射电干涉测量等问题。

1403.3022 2026-06-04 cs.CV cs.NA math.NA 版本更新

Efficient Legendre moment computation for grey level images

对灰度图像的Legendre矩高效计算

Guanyu Yang, Huazhong Shu, Christine Toumoulin, Guo-Niu Han, Limin M. Luo

AI总结本文提出一种高效计算Legendre矩的方法，适用于二值和灰度图像，通过递推公式和无乘法算法降低计算复杂度。

详情

DOI: 10.1016/j.patcog.2005.08.008
Journal ref: Pattern Recognition 39, 1 (2006) 74-80

AI中文摘要

Legendre正交矩已在图像分析领域广泛应用。由于直接方法计算非常耗时，近期研究致力于降低计算复杂度。然而，现有算法主要针对二值图像。本文提出一种新的快速方法计算Legendre矩，不仅适用于二值图像，也适用于灰度图像。我们首先通过Legendre多项式的递推性质建立一维Legendre矩的递推公式。结果表明，一维Legendre矩Lp可表示为Lp-1(1)和Lp-2(0)的线性组合。基于此关系，通过L1(a)和L0(a)数组（a为小于p的整数）得到Lp(0)。为进一步降低计算复杂度，采用无需乘法的算法计算这些量。该方法随后扩展至二维Legendre矩Lpq的计算。我们证明所提方法比直接方法更高效。

英文摘要

Legendre orthogonal moments have been widely used in the field of image analysis. Because their computation by a direct method is very time expensive, recent efforts have been devoted to the reduction of computational complexity. Nevertheless, the existing algorithms are mainly focused on binary images. We propose here a new fast method for computing the Legendre moments, which is not only suitable for binary images but also for grey levels. We first set up the recurrence formula of one-dimensional (1D) Legendre moments by using the recursive property of Legendre polynomials. As a result, the 1D Legendre moments of order p, Lp = Lp(0), can be expressed as a linear combination of Lp-1(1) and Lp-2(0). Based on this relationship, the 1D Legendre moments Lp(0) is thus obtained from the array of L1(a) and L0(a) where a is an integer number less than p. To further decrease the computation complexity, an algorithm, in which no multiplication is required, is used to compute these quantities. The method is then extended to the calculation of the two-dimensional Legendre moments Lpq. We show that the proposed method is more efficient than the direct method.

URL PDF HTML ☆

赞 0 踩 0

1403.3021 2026-06-04 cs.CV cs.NA math.NA 版本更新

Image reconstruction from limited range projections using orthogonal moments

利用正交矩进行有限范围投影的图像重建

Huazhong Shu, Jian Zhou, Guo-Niu Han, Limin M. Luo, Jean-Louis Coatrieux

AI总结本文提出一组正交多项式用于图像重建，探讨了投影矩与图像矩的关系，并通过仿真验证了方法的有效性。

1403.0240 2026-06-04 cs.CV cs.CE cs.NA math.NA q-bio.QM 版本更新

Particle methods enable fast and simple approximation of Sobolev gradients in image segmentation

粒子方法使Sobolev梯度在图像分割中的快速和简单近似

Ivo F. Sbalzarini, Sophie Schneider, Janick Cardinale

AI总结本文提出利用粒子方法高效计算Sobolev梯度，以解决图像分割中的正则化问题，通过局部粒子交互替代全局Poisson方程，提升计算效率。

Comments 21 pages, 10 figures

详情

用于电感应形状感知的小波方法

Habib Ammari, Stéphane Mallat, Irène Waldspurger, Han Wang

AI总结本文提出了一种基于小波的电感应形状识别新方法，通过微电阻测量高效识别目标形状，并通过数值模拟验证了算法的稳定性与分辨率。

1309.5401 2026-06-04 cs.RO cs.CV cs.SY eess.SY 版本更新

Nonmyopic View Planning for Active Object Detection

非我的视图规划用于主动物体检测

Nikolay Atanasov, Bharath Sankaran, Jerome Le Ny, George J. Pappas, Kostas Daniilidis

AI总结本文提出通过控制移动深度相机视角进行主动物体检测，通过规划视图序列平衡移动能耗与识别正确假设的概率，实验表明优于贪心方法。

Comments 12 pages (two-column); 7 figures; 2 tables; Manuscript submitted to the IEEE Transactions on Robotics (TRO)

详情

AI中文摘要

计算机视觉中的核心问题之一是语义重要物体的检测以及其姿态的估计。大多数物体检测工作基于单张图像处理，其性能受限于遮挡和外观和几何的模糊性。本文提出了一种主动检测方法，通过控制移动深度相机的视角进行物体检测。当初始静态检测阶段识别出感兴趣的物体时，会针对其类别和方向提出多个假设。传感器随后规划一系列视角，平衡移动所消耗的能量与识别正确假设的概率。我们提出了一个包含传感器移动性的主动假设检验问题，并使用基于点的近似POMDP算法进行求解。通过仿真和实际世界实验验证了我们的方法的有效性，结果表明我们的方法优于广泛使用的贪心视角选择方法，并在静态物体检测上提供了显著改进。

英文摘要

One of the central problems in computer vision is the detection of semantically important objects and the estimation of their pose. Most of the work in object detection has been based on single image processing and its performance is limited by occlusions and ambiguity in appearance and geometry. This paper proposes an active approach to object detection by controlling the point of view of a mobile depth camera. When an initial static detection phase identifies an object of interest, several hypotheses are made about its class and orientation. The sensor then plans a sequence of views, which balances the amount of energy used to move with the chance of identifying the correct hypothesis. We formulate an active hypothesis testing problem, which includes sensor mobility, and solve it using a point-based approximate POMDP algorithm. The validity of our approach is verified through simulation and real-world experiments with the PR2 robot. The results suggest that our approach outperforms the widely-used greedy view point selection and provides a significant improvement over static object detection.

URL PDF HTML ☆

赞 0 踩 0

1308.2292 2026-06-04 cs.CV cs.NA math.AP math.NA 版本更新

Fast image segmentation and restoration using parametric curve evolution with junctions and topology changes

利用参数曲线演化进行快速图像分割与修复

Heike Benninghoff, Harald Garcke

AI总结本文提出一种基于区域轮廓模型的曲线演化方案，支持节点和拓扑变化，结合后处理去噪，实现快速高效的图像分割与修复。

Comments 26 pages, 16 figures